Pricing - Together AI docs

Inference billing works differently depending on your deployment mode:

Serverless models bill based on usage, with no minimums and no provisioning cost.
Dedicated endpoints bill per-minute, depending on the hardware you reserve.

For per-model and per-hardware rates, see together.ai/pricing.

Serverless models

You pay per unit of work, with units determined by model type:

Chat, language, embedding, and rerank: Per input and output token.
Image generation: Per megapixel of output.
Video generation: Per second of output.
Speech-to-text and text-to-speech: Per second of audio.

Find per-model rates in the serverless model catalog. Cached input discounts: Select chat models bill cached input tokens at a steep discount, applied automatically. See Chat models for the supported list.

Dedicated endpoints

You pay per minute for the hardware you reserve, regardless of your model or request volume. Billing starts when the endpoint is running. Provisioning, queuing, and failed deployments are not billed. See Dedicated endpoint pricing for more details.

Batch processing

When you don’t need real-time responses, you can use the batch processing API for a 50% discount when running selected serverless models. Dedicated endpoints can also run batch jobs, but at full price.

Documentation Index

​Serverless models

​Dedicated endpoints

​Batch processing

Serverless models

Dedicated endpoints

Batch processing