Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.together.ai/llms.txt

Use this file to discover all available pages before exploring further.

Using a coding agent? Install the together-dedicated-endpoints skill to let your agent create and manage dedicated endpoints for you.
A dedicated endpoint serves a single model on hardware reserved only for you, offering predictable latency and no shared-fleet rate limits. They are highly configurable, allowing you to upload custom fine-tuned models, and configure autoscaling and decoding optimizations to match your workload. Dedicated endpoints use the same inference APIs as serverless models, allowing you to prototype with serverless, then switch to dedicated endpoints without changing your application code.

Get started

Quickstart

Deploy and call your first endpoint in 5 minutes.

Manage endpoints

Create, start, stop, update, and delete via the UI or API.

Endpoint settings

Configure endpoint hardware, autoscaling, decoding, prompt caching.

Inference APIs

Explore the API surface for chat, vision, audio, embeddings, and more.

Available models

Browse Together-hosted models you can deploy on dedicated endpoints.

Upload a custom model

Upload your own model weights.

Pricing

Dedicated endpoints bill per-minute by hardware while the endpoint is running, regardless of your model or request volume.
Hardware typeCost/hour
1x H100 80GB$3.99
1x H200 141GB$5.49
1x B200 180GB$9.95
Each running replica bills independently, and stop billing as soon as they are scaled down.