A dedicated endpoint serves a single model on hardware reserved only for you, offering predictable latency and no shared-fleet rate limits. They are highly configurable, allowing you to upload custom fine-tuned models, and configure autoscaling and decoding optimizations to match your workload. Dedicated endpoints use the same inference APIs as serverless models, allowing you to prototype with serverless, then switch to dedicated endpoints without changing your application code.Documentation Index
Fetch the complete documentation index at: https://docs.together.ai/llms.txt
Use this file to discover all available pages before exploring further.
Get started
Quickstart
Deploy and call your first endpoint in 5 minutes.
Manage endpoints
Create, start, stop, update, and delete via the UI or API.
Endpoint settings
Configure endpoint hardware, autoscaling, decoding, prompt caching.
Inference APIs
Explore the API surface for chat, vision, audio, embeddings, and more.
Available models
Browse Together-hosted models you can deploy on dedicated endpoints.
Upload a custom model
Upload your own model weights.
Pricing
Dedicated endpoints bill per-minute by hardware while the endpoint is running, regardless of your model or request volume.| Hardware type | Cost/hour |
|---|---|
| 1x H100 80GB | $3.99 |
| 1x H200 141GB | $5.49 |
| 1x B200 180GB | $9.95 |