Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.together.ai/llms.txt

Use this file to discover all available pages before exploring further.

Inference billing works differently depending on your deployment mode: For per-model and per-hardware rates, see together.ai/pricing.

Serverless models

You pay per unit of work, with units determined by model type:
  • Chat, language, embedding, and rerank: Per input and output token.
  • Image generation: Per megapixel of output.
  • Video generation: Per second of output.
  • Speech-to-text and text-to-speech: Per second of audio.
Find per-model rates in the serverless model catalog. Cached input discounts: Select chat models bill cached input tokens at a steep discount, applied automatically. See Chat models for the supported list.

Dedicated endpoints

You pay per minute for the hardware you reserve, regardless of your model or request volume. Billing starts when the endpoint is running. Provisioning, queuing, and failed deployments are not billed. See Dedicated endpoint pricing for more details.

Batch processing

When you don’t need real-time responses, you can use the batch processing API for a 50% discount when running selected serverless models. Dedicated endpoints can also run batch jobs, but at full price.