Dedicated endpoints have two scaling axes: how many GPUs go into each replica (vertical) and how many replicas to run (horizontal). Both change throughput, but they’re optimized for different workload shapes. Only horizontal scaling is dynamic. You configure a minimum and maximum replica count when you create the endpoint, and the platform autoscales between those bounds based on demand. Vertical scaling is set at create time by picking a multi-GPU hardware SKU; changing it later requires redeploying the endpoint.Documentation Index
Fetch the complete documentation index at: https://docs.together.ai/llms.txt
Use this file to discover all available pages before exploring further.
Vertical vs. horizontal
When to add GPUs per replica (vertical)
More GPUs per replica increase generation speed, lower time to first token, and raise the maximum requests per second a single replica can handle. Reach for vertical scaling when:- Compute-bound: Your workload is bottlenecked by GPU compute. Adding GPUs to one replica directly speeds up each request.
- Memory-intensive: The model or context window is large enough that one GPU can’t hold it. Adding GPUs to a replica gives you the memory headroom.
- Single-node parallelism works: Your workload benefits from data parallelism or model parallelism within a single node.
- Low-latency requirements: Each request needs to complete quickly. Multiple GPUs in one replica process the request faster than one GPU could.
endpoints create. List options for your model with together endpoints hardware --model <model_id>, then create the endpoint with the SKU you want:
Shell
When to add replicas (horizontal)
More replicas raise the maximum requests per second the endpoint can serve in aggregate. Reach for horizontal scaling when:- Concurrent requests: Your application receives a high volume of simultaneous requests. Replicas spread that load.
- I/O-bound workloads: Requests spend significant time waiting on data load or write. Replicas let you do more of that waiting in parallel.
- Fault tolerance: A second replica means a single hardware failure doesn’t take your endpoint offline.
- Multi-node parallelism works: Your workload scales well across nodes (data parallelism, distributed inference).
--min-replicas and --max-replicas at create time, or update them later on a running endpoint:
Shell