Together AI offers three ways to deploy and run models for inference:Documentation Index
Fetch the complete documentation index at: https://docs.together.ai/llms.txt
Use this file to discover all available pages before exploring further.
- Serverless: Call any supported model through a shared per-token API, best for prototyping and variable traffic.
- Dedicated endpoints: Reserved GPUs running one model, best for steady traffic that needs predictable latency and no shared-fleet rate limits.
- Dedicated containers: Your Docker image on Together’s managed GPUs with autoscaling, best for custom runtimes or models not in the catalog.
Comparison
| Option | What you deploy | What Together runs | Pricing | Best for |
|---|---|---|---|---|
| Serverless | Nothing; you call the API | Shared fleet of popular models | Per-token | Prototyping, variable traffic on listed models |
| Dedicated endpoints | A model ID | Reserved GPUs running that model | Per-minute, by hardware | Predictable performance on Together-hosted or fine-tuned models |
| Dedicated containers | A Docker image plus a Sprocket worker | Your container, autoscaled and fronted by a job queue | Per-minute, by hardware | Custom runtimes, models not on Together, bring-your-own stack |
Decision guide
- You need to call a popular open model in your app today: Use serverless.
- Your traffic is steady enough that per-token pricing is no longer the best deal, or you’re hitting rate limits: Use a dedicated endpoint.
- Your model isn’t on Together, or it needs a custom runtime, pre-processing, or post-processing step: Use a dedicated container.
Pricing
Together AI prices each inference option differently. For current rates, see together.ai/pricing.- Serverless: Pay per token, based on input and output tokens processed. Rates vary by model.
- Dedicated endpoints: Pay per minute for the GPU hardware reserved. Billing runs for as long as the endpoint is active, regardless of request volume.
- Dedicated containers: Pay per minute for the GPU hardware allocated to your container. With autoscaling, you pay for every active replica, including idle ones above the minimum.
- Batch inference: Same per-token model as serverless, with a discount on select models. See batch inference.
Playground
Try models interactively before writing code. The web playground has separate environments for chat, image, video, audio, and transcription models. Pick a model, adjust the system prompt and inference parameters from the side panel, and iterate.Next steps
- Browse the catalog of recommended models for chat, coding, vision, and reasoning.
- Check rate limits to see what to expect on serverless and how to request more.
- Set up authentication. API keys work across all three deployment options.