Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.together.ai/llms.txt

Use this file to discover all available pages before exploring further.

Together AI offers two ways to run inference: Serverless models: A shared fleet of popular open models you can call through a per-token API. No GPUs to provision or manage. Best for prototyping and variable traffic on the models we host. Dedicated endpoints: A single model running on GPUs reserved for you, billed per minute by hardware. Best when traffic is steady, you need consistent latency, or you’re serving a fine-tuned model.

Get started

Quickstart

Set up an API key and make your first call in Python, TypeScript, or cURL.

Recommended models

Our picks for common inference use cases.

Pricing

How Together AI bills for inference.

Shared inference API

Serverless models and dedicated endpoints use the same inference APIs for generating and retrieving model outputs. Apps work on either deployment mode without code changes; just swap the model parameter:
from together import Together

client = Together()

# Serverless model request
response = client.chat.completions.create(
    model="moonshotai/Kimi-K2.5",
    messages=[{"role": "user", "content": "Hello!"}],
)

# Dedicated endpoint request
response = client.chat.completions.create(
    model="<ACCOUNT_NAME>/Qwen/Qwen3.5-9B-FP8-bb04c904",
    messages=[{"role": "user", "content": "Hello!"}],
)

Integrations

OpenAI compatibility

Drop-in replacement for OpenAI clients.

SDK integrations

Together SDKs and framework wiring.

Batch processing

Batch processing

If your workload doesn’t need a real-time response, submit it as a batch job for up to 50% off serverless rates.

Model capabilities

Chat & text

Chat completions, streaming, parameters.

Function calling

Tool use and agentic loops.

Vision

Pass images alongside text.

Image generation

FLUX, Kontext, and Google models.

Video generation

Text-to-video and image-to-video.

Speech-to-text

Batch and streaming transcription.

Text-to-speech

HTTP and WebSocket audio output.

Embeddings & rerank

Vectors, rerankers, and RAG.