Overview - Together AI docs

Together AI offers two ways to run inference: Serverless models: A shared fleet of popular open models you can call through a per-token API. No GPUs to provision or manage. Best for prototyping, or apps with variable traffic. Dedicated endpoints: A single model running on GPUs reserved for you, billed per minute by hardware. Best for apps with steady traffic, consistent latency, or for serving fine-tuned models.

Get started

Quickstart

Set up an API key and make your first call in Python, TypeScript, or cURL.

Recommended models

Our picks for common inference use cases.

Pricing

How Together AI bills for inference.

Shared inference API

Serverless models and dedicated endpoints use the same inference APIs for generating and retrieving model outputs. Apps work on either deployment mode without code changes; just swap the model parameter:

from together import Together

client = Together()

# Serverless model request
response = client.chat.completions.create(
    model="moonshotai/Kimi-K2.5",
    messages=[{"role": "user", "content": "Hello!"}],
)

# Dedicated endpoint request
response = client.chat.completions.create(
    model="<ACCOUNT_NAME>/Qwen/Qwen3.5-9B-FP8-bb04c904",
    messages=[{"role": "user", "content": "Hello!"}],
)

Integrations

OpenAI compatibility

Drop-in replacement for OpenAI clients.

SDK integrations

Together SDKs and framework wiring.

Batch processing

If your workload doesn’t need a real-time response, submit it as a batch job for up to 50% off serverless rates.

Model capabilities

Chat & text

Chat completions, streaming, parameters.

Function calling

Tool use and agentic loops.

Vision

Pass images alongside text.

Image generation

FLUX, Kontext, and Google models.

Video generation

Text-to-video and image-to-video.

Speech-to-text

Batch and streaming transcription.

Text-to-speech

HTTP and WebSocket audio output.

Embeddings & rerank

Vectors, rerankers, and RAG.

Recommended models

Serverless models

⌘I

Documentation Index

​Get started

Quickstart

Recommended models

Pricing

​Shared inference API

​Integrations

OpenAI compatibility

SDK integrations

​Batch processing

Batch processing

​Model capabilities

Chat & text

Function calling

Vision

Image generation

Video generation

Speech-to-text

Text-to-speech

Embeddings & rerank

Get started

Shared inference API

Integrations

Batch processing

Model capabilities