Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.together.ai/llms.txt

Use this file to discover all available pages before exploring further.

Together AI offers three ways to deploy and run models for inference:
  • Serverless: Call any supported model through a shared per-token API, best for prototyping and variable traffic.
  • Dedicated endpoints: Reserved GPUs running one model, best for steady traffic that needs predictable latency and no shared-fleet rate limits.
  • Dedicated containers: Your Docker image on Together’s managed GPUs with autoscaling, best for custom runtimes or models not in the catalog.

Comparison

OptionWhat you deployWhat Together runsPricingBest for
ServerlessNothing; you call the APIShared fleet of popular modelsPer-tokenPrototyping, variable traffic on listed models
Dedicated endpointsA model IDReserved GPUs running that modelPer-minute, by hardwarePredictable performance on Together-hosted or fine-tuned models
Dedicated containersA Docker image plus a Sprocket workerYour container, autoscaled and fronted by a job queuePer-minute, by hardwareCustom runtimes, models not on Together, bring-your-own stack

Decision guide

  • You need to call a popular open model in your app today: Use serverless.
  • Your traffic is steady enough that per-token pricing is no longer the best deal, or you’re hitting rate limits: Use a dedicated endpoint.
  • Your model isn’t on Together, or it needs a custom runtime, pre-processing, or post-processing step: Use a dedicated container.
If you need raw GPUs to train a model or run arbitrary workloads, see GPU clusters.

Pricing

Together AI prices each inference option differently. For current rates, see together.ai/pricing.
  • Serverless: Pay per token, based on input and output tokens processed. Rates vary by model.
  • Dedicated endpoints: Pay per minute for the GPU hardware reserved. Billing runs for as long as the endpoint is active, regardless of request volume.
  • Dedicated containers: Pay per minute for the GPU hardware allocated to your container. With autoscaling, you pay for every active replica, including idle ones above the minimum.
  • Batch inference: Same per-token model as serverless, with a discount on select models. See batch inference.

Playground

Try models interactively before writing code. The web playground has separate environments for chat, image, video, audio, and transcription models. Pick a model, adjust the system prompt and inference parameters from the side panel, and iterate.

Next steps

  • Browse the catalog of recommended models for chat, coding, vision, and reasoning.
  • Check rate limits to see what to expect on serverless and how to request more.
  • Set up authentication. API keys work across all three deployment options.