Dedicated endpoints

How does the system scale?

Dedicated endpoints supports horizontal scaling. This means that it scales linearly with the additional replicas specified at during endpoint configuration.


How does auto-scaling affect my costs?

Billing for dedicated endpoints is proportional to the number of replicas.
For example, scaling from 1 to 2 replicas will double your GPU costs.


Is my endpoint guaranteed to scale to the max replica set?

We will scale to the max possible replica available at the time. This may be short of the max replicas that were set in the configuration if availability is limited.


When to use vertical vs horizontal scale?

In other words, when to add GPUs per replica or add more replica?

Vertical scaling

Multiple GPUs, or vertical scaling, increases the generation speed, time to first token and max QPS.
You should increase GPUs if your workload meets the following conditions:

Compute-bound
If your workload is compute-intensive and bottlenecked by GPU processing power, adding more GPUs to a single endpoint can significantly improve performance.

Memory-intensive
If your workload requires large amounts of memory, adding more GPUs to a single endpoint can provide more memory and improve performance.

Single-node scalability
If your workload can scale well within a single node (e.g., using data parallelism or model parallelism), adding more GPUs to a single endpoint can be an effective way to increase throughput.

Low-latency requirements
If your application requires low latency, increasing the number of GPUs on a single endpoint can help reduce latency by processing requests in parallel.

Horizontal scaling

The number of replicas (horizontal scaling) increases the max number of QPS.
You should increase the number of replicas if your workload meets the following conditions:

I/O-bound workloads
If your workload is I/O-bound (e.g., waiting for data to be loaded or written), increasing the number of replicas can help spread the I/O load across multiple nodes.

Request concurrency
If your application receives a high volume of concurrent requests, increasing the number of replicas can help distribute the load and improve responsiveness.

Fault tolerance:
Increasing the number of replicas can improve fault tolerance by ensuring that if one node fails, others can continue to process requests.

Scalability across multiple nodes
If your workload can scale well across multiple nodes (e.g., using data parallelism or distributed training), increasing the number of replicas can be an effective way to increase throughput.


Troubleshooting dedicated endpoints configuration

There are a number of reasons that an endpoint isn't immediately created successfully.

Lack of availability: If we are short on available hardware, the endpoint will still be created but rather than automatically starting the endpoint, it will be queued for the next available hardware.

Minimal availability: We may have hardware available but only enough to for a small amount of replicas. If this is the case, the endpoint may start but only scale to the amount of replicas available. If the min replica is set higher than we have capacity for, we may queue the endpoint until it there is enough availability. To avoid the wait, you can reduce the min replica count.


Will I be billed for the time spent spinning up the endpoint or looking for resources?

Billing events start only when a dedicated endpoint is successfully up and running. If there is a lag in time or a failure to deploy the endpoint, you will not be billed for that time.