> ## Documentation Index
> Fetch the complete documentation index at: https://docs.together.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Pricing

> How Together AI bills for inference.

Inference billing works differently depending on your deployment mode:

* [Serverless models](/docs/serverless/models) bill based on usage, with no minimums and no provisioning cost.
* [Dedicated endpoints](/docs/dedicated-endpoints/overview) bill per-minute, depending on the hardware you reserve.

For per-model and per-hardware rates, see [together.ai/pricing](https://together.ai/pricing).

## Serverless models

You pay per unit of work, with units determined by model type:

* **Chat, language, embedding, and rerank:** Per input and output token.
* **Image generation:** Per megapixel of output.
* **Video generation:** Per second of output.
* **Speech-to-text and text-to-speech:** Per second of audio.

Find per-model rates in the [serverless model catalog](/docs/serverless/models).

**Cached input discounts:** Select chat models bill cached input tokens at a steep discount, applied automatically. See [Chat models](/docs/serverless/models#chat-models) for the supported list.

## Dedicated endpoints

You pay per minute for the hardware you reserve, regardless of your model or request volume. Billing starts when the endpoint is running. Provisioning, queuing, and failed deployments are not billed.

See [Dedicated endpoint pricing](/docs/dedicated-endpoints/overview#pricing) for more details.

## Batch processing

When you don't need real-time responses, you can use the [batch processing API](/docs/inference/batch/overview) for a 50% discount when running [selected serverless models](/docs/inference/batch/overview#supported-models). Dedicated endpoints can also run batch jobs, but at full price.
