This page lists the configuration options you can set when creating or updating a dedicated endpoint.Documentation Index
Fetch the complete documentation index at: https://docs.together.ai/llms.txt
Use this file to discover all available pages before exploring further.
Replica count
A replica (also called a worker) is an instance of your model running on independent hardware, capable of handling requests in parallel with other replicas. Increasing the number of replicas on an endpoint improves throughput under high traffic, lowers latency, and provides resiliency if a single replica fails. You can configure the minimum and maximum number of replicas to scale between:min-replicasdetermines the minimum number of replicas to keep running. These will never scale down, and will always be available to handle requests.max-replicasdetermines the maximum number of replicas to scale up to, allowing you to set a ceiling on cost.
update and pass the endpoint ID:
Both
--min-replicas and --max-replicas must be specified together when updating.Hardware and GPU count
A hardware configuration ID follows this format:2x_nvidia_h100_80gb_sxm
Pass it to the CLI with --hardware <hardware_id>. To see what’s available for a model:
Shell
Availability zone
If you have specific latency or geographic needs, target an availability zone at create time. Restricting to a zone can limit hardware availability, so don’t set this unless you need to.Shell
--availability-zone <zone> on the create command.
Auto-shutdown
To avoid charges from idle replicas, a dedicated endpoint automatically stops after a period of inactivity. The default is 60 minutes. Configure the threshold with theinactive-timeout parameter on the create command (in minutes). Pass 0 to disable auto-shutdown entirely.
Set the timeout at create time:
update:
together endpoints start <endpoint_id>.
Speculative decoding
Speculative decoding is an optimization that improves average throughput by speculatively predicting future tokens. It usually improves performance, but it can introduce occasional tail-latency spikes that real-time or mission-critical workloads won’t tolerate. By default, speculative decoding is enabled. To turn it off, pass the--no-speculative-decoding flag to the create command.
Prompt caching
Prompt caching stores the result of previously executed prompts so the model can return cached responses instead of recomputing. It significantly reduces redundant compute for repeated prefixes. Prompt caching is enabled by default on every dedicated endpoint and cannot be disabled.The
--no-prompt-cache CLI flag and disable_prompt_cache API field are deprecated and will be removed in February 2026. They are currently accepted but ignored; prompt caching is always enabled.