Get started
Quickstart
Deploy and call your first endpoint in 5 minutes.
Manage endpoints
Create, start, stop, update, and delete via the UI or API.
Endpoint settings
Configure endpoint hardware, autoscaling, decoding, prompt caching.
Inference APIs
Explore the API surface for chat, vision, audio, embeddings, and more.
Available models
Browse Together-hosted models you can deploy on dedicated endpoints.
Upload a custom model
Upload your own model weights.
Pricing
Dedicated endpoints bill per-minute by hardware while the endpoint is running, regardless of your model or request volume. The following table lists the single-GPU price for each hardware type.| GPU | Hardware ID | Cost/hour |
|---|---|---|
| H100 80GB SXM | 1x_nvidia_h100_80gb_sxm | $6.49 |
| H200 140GB SXM | 1x_nvidia_h200_140gb_sxm | $7.89 |
| B200 180GB SXM | 1x_nvidia_b200_180gb_sxm | $11.95 |
Scaling out
You can deploy multiple autoscaling replicas of your model on a single endpoint to reduce latency under high traffic and provide resiliency in case a single replica fails. Each running replica bills independently, and stops billing as soon as it is scaled down. You can also deploy multiple GPUs per replica (2, 4, or 8 GPUs) to increase throughput and lower latency. Cost scales with the GPU count, and the hardware ID carries a matching prefix—for example, the ID for four H100s is4x_nvidia_h100_80gb_sxm.