Dedicated Inference

Deploy models on your own custom endpoints for improved reliability at scale

Dedicated Endpoints allows you to deploy models as dedicated endpoints with custom hardware and scaling configurations. Benefits of dedicated endpoints include:

  • Predictable performance unaffected by serverless traffic.
  • Reliable capacity to respond to spiky traffic.
  • Customization to suit the unique usage of the model.

Getting Started

Jump straight into the API with these docs or create an endpoint with this guide below.

1. Select a model

Explore the list of supported models for dedicated endpoints on our models list

You can also upload your own model .


2. Create a dedicated endpoint

To create a dedicated endpoint, run:

together endpoints create \
--model mistralai/Mixtral-8x7B-Instruct-v0.1 \
--gpu h100 \
--gpu-count 2 \
--wait

This command will finish when the endpoint is READY. To let it run asynchronously, remove the --waitflag.

To view the hardware options for specific model, run:

together endpoints hardware --model <MODEL_ID>

You will get a response like:

together endpoints hardware --model mistralai/Mixtral-8x7B-Instruct-v0.1

All hardware options:
  2x_nvidia_a100_80gb_sxm
  2x_nvidia_h100_80gb_sxm
  4x_nvidia_a100_80gb_sxm
  4x_nvidia_h100_80gb_sxm
  8x_nvidia_a100_80gb_sxm
  8x_nvidia_h100_80gb_sxm

From this list, you can identify which of the GPUs can be listed in your command. For example, in this list, the following combinations are possible:

  1. --gpu a100 --gpu-count 2, --gpu a100 --gpu-count 4, --gpu a100 --gpu-count 8
  2. --gpu h100 --gpu-count 2, --gpu h100 --gpu-count 4, --gpu h100 --gpu-count 8

3. Get endpoint status

You can check on the deployment status by running:

together endpoints get <ENDPOINT_ID>

A sample response will look like the following:

ID:		endpoint-e6c6b82f-90f7-45b7-af39-3ca3b51d08xx
Name:		tester/mistralai/Mixtral-8x7B-Instruct-v0.1-bb04c904
Display Name:	My Endpoint
Hardware:	2x_nvidia_h100_80gb_sxm
Autoscaling:	Min=1, Max=1
Model:		mistralai/Mixtral-8x7B-Instruct-v0.1
Type:		dedicated
Owner:		tester
State:		READY
Created:	2025-02-18 11:55:50.686000+00:00

4. Start, stop & delete endpoint

If you added the --waitflag on creation or previously stopped the endpoint, you can start it again by running:

together endpoints start <ENDPOINT_ID>

Stopping the endpoint follows the same pattern:

together endpoints stop <ENDPOINT_ID>

To fully delete the endpoint, run:

together endpoints delete <ENDPOINT_ID>

Endpoint options

Replica count

Replicas provide horizontal scaling, ensuring better handling of high traffic, reduced latency, and resiliency in the event of instance failure. They are set with the --min-replicasand --max-replicasoptions. The default min and max replica is set to 1. When the max replica is increased, the endpoint will automatically scale based on server load.

Auto-shutdown

If an endpoint is inactive for an hour, it will shutdown automatically. This window of inactivity can be customized when configuring a deployment in the web interface

Choosing hardware and GPU count

A hardware configuration for a given model follows this format: [gpu-count]-[hardware]-[gpu-type]-[gpu-link]

Example:2x_nvidia_h100_80gb_sxm

When configuring the hardware on the CLI, you can specify which version of the hardware you would like by listing the --gpu(or hardware), --gpu-countand gpu-type

Multiple GPUs

Increasing the gpu-count will increase the GPUs per replica. This will result in higher generation speed, lower time-to-first-token and higher max QPS.

Speculative decoding

Speculative decoding is an optimization technique used to improve the efficiency of text generation and decoding processes. Using speculators can improve performance, increase throughput and improve the handling of uncertain or ambiguous input.

Customers who require consistently low tail latencies—such as those running real-time or mission-critical applications—may want to avoid speculative decoding. While this technique can improve average performance, it also introduces the risk of occasional extreme delays, which may be unacceptable in latency-sensitive workloads.

To disable speculative decoding, set the --no-speculative-decodingflag.

Prompt caching

Prompt caching stores the results of previously executed prompts, allowing your model to quickly retrieve and return cached responses instead of reprocessing the same input. This significantly improves performance by reducing redundant computations.

By default, caching is enabled. However, in cases where prompt inputs vary significantly, disabling caching may be preferable. To turn off prompt caching, add --no-prompt-cache to the create command.