Skip to main content
Using a coding agent? Install the together-dedicated-endpoints skill to let your agent create and manage dedicated endpoints for you.
A dedicated endpoint serves a single model on hardware reserved only for you, offering predictable latency and no shared-fleet rate limits. They are highly configurable, allowing you to upload custom fine-tuned models, and configure autoscaling and decoding optimizations to match your workload. Dedicated endpoints use the same inference APIs as serverless models, allowing you to prototype with serverless, then switch to dedicated endpoints without changing your application code.

Get started

Quickstart

Deploy and call your first endpoint in 5 minutes.

Manage endpoints

Create, start, stop, update, and delete via the UI or API.

Endpoint settings

Configure endpoint hardware, autoscaling, decoding, prompt caching.

Inference APIs

Explore the API surface for chat, vision, audio, embeddings, and more.

Available models

Browse Together-hosted models you can deploy on dedicated endpoints.

Upload a custom model

Upload your own model weights.

Pricing

Dedicated endpoints bill per-minute by hardware while the endpoint is running, regardless of your model or request volume. The following table lists the single-GPU price for each hardware type.
GPUHardware IDCost/hour
H100 80GB SXM1x_nvidia_h100_80gb_sxm$6.49
H200 140GB SXM1x_nvidia_h200_140gb_sxm$7.89
B200 180GB SXM1x_nvidia_b200_180gb_sxm$11.95

Scaling out

You can deploy multiple autoscaling replicas of your model on a single endpoint to reduce latency under high traffic and provide resiliency in case a single replica fails. Each running replica bills independently, and stops billing as soon as it is scaled down. You can also deploy multiple GPUs per replica (2, 4, or 8 GPUs) to increase throughput and lower latency. Cost scales with the GPU count, and the hardware ID carries a matching prefix—for example, the ID for four H100s is 4x_nvidia_h100_80gb_sxm.
For best practices on when to increase replicas vs. GPUs per replica, see Vertical vs. horizontal scaling .

List hardware options

To get a list of available hardware options, GPU counts, and per-minute rates for a given model, run this command:
tg endpoints hardware --model <MODEL_ID>