- Predictable performance unaffected by serverless traffic.
- Reliable capacity to respond to spiky traffic.
- Customization to suit the unique usage of the model.
Getting Started
Jump straight into the API with these docs or create an endpoint with this guide below.1. Select a model
Explore the list of supported models for dedicated endpoints on our models list You can also upload your own model .2. Create a dedicated endpoint
To create a dedicated endpoint, run:READY
. To let it run asynchronously, remove the --wait
flag.
To view the hardware options for specific model, run:
--gpu a100 --gpu-count 2
,--gpu a100 --gpu-count 4
,--gpu a100 --gpu-count 8
--gpu h100 --gpu-count 2
,--gpu h100 --gpu-count 4
,--gpu h100 --gpu-count 8
3. Get endpoint status
You can check on the deployment status by running:4. Start, stop & delete endpoint
If you added the--wait
flag on creation or previously stopped the endpoint, you can start it again by running:
Endpoint options
Replica count
Replicas provide horizontal scaling, ensuring better handling of high traffic, reduced latency, and resiliency in the event of instance failure. They are set with the--min-replicas
and --max-replicas
options. The default min and max replica is set to 1. When the max replica is increased, the endpoint will automatically scale based on server load.
Auto-shutdown
If an endpoint is inactive for an hour, it will shutdown automatically. This window of inactivity can be customized when configuring a deployment in the web interfaceChoosing hardware and GPU count
A hardware configuration for a given model follows this format: [gpu-count]-[hardware]-[gpu-type]-[gpu-link] Example:2x_nvidia_h100_80gb_sxm
When configuring the hardware on the CLI, you can specify which version of the hardware you would like by listing the --gpu
(or hardware), --gpu-count
and gpu-type
Multiple GPUs
Increasing thegpu-count
will increase the GPUs per replica. This will result in higher generation speed, lower time-to-first-token and higher max QPS.
Speculative decoding
Speculative decoding is an optimization technique used to improve the efficiency of text generation and decoding processes. Using speculators can improve performance, increase throughput and improve the handling of uncertain or ambiguous input. Customers who require consistently low tail latencies—such as those running real-time or mission-critical applications—may want to avoid speculative decoding. While this technique can improve average performance, it also introduces the risk of occasional extreme delays, which may be unacceptable in latency-sensitive workloads. To disable speculative decoding, set the--no-speculative-decoding
flag.
Prompt caching
Prompt caching stores the results of previously executed prompts, allowing your model to quickly retrieve and return cached responses instead of reprocessing the same input. This significantly improves performance by reducing redundant computations. By default, caching is enabled. However, in cases where prompt inputs vary significantly, disabling caching may be preferable. To turn off prompt caching, add--no-prompt-cache
to the create command.