Overview

Together AI offers three ways to run inference: Serverless models: A shared fleet of popular open models you can call through a per-token API. No GPUs to provision or manage. Best for prototyping, or apps with variable traffic. Provisioned throughput: Reserved capacity for a selected stock model with a defined SLA covering committed throughput and reliability. Best for production workloads that need stronger guarantees than serverless. Dedicated model inference: A single model running on GPUs reserved for you, billed per minute by hardware. Best for apps with steady traffic, consistent latency, or for serving fine-tuned models.

Get started

Quickstart

Set up an API key and make your first call in Python, TypeScript, or cURL.

Recommended models

Our picks for common inference use cases.

Shared inference API

Serverless, provisioned throughput, and dedicated model inference all use the same inference APIs for generating and retrieving model outputs. Apps work on any deployment mode without code changes; just swap the model parameter:

from together import Together

client = Together()

# Serverless model request
response = client.chat.completions.create(
    model="moonshotai/Kimi-K2.6",
    messages=[{"role": "user", "content": "Hello!"}],
)

# Dedicated model inference request
response = client.chat.completions.create(
    model="<ACCOUNT_NAME>/Qwen/Qwen3.5-9B-FP8-bb04c904",
    messages=[{"role": "user", "content": "Hello!"}],
)

import Together from "together-ai";
const client = new Together();

// Serverless model request
let response = await client.chat.completions.create({
    model: "moonshotai/Kimi-K2.6",
    messages: [{ role: "user", content: "Hello!" }],
});

// Dedicated endpoint request
response = await client.chat.completions.create({
    model: "<ACCOUNT_NAME>/Qwen/Qwen3.5-9B-FP8-bb04c904",
    messages: [{ role: "user", content: "Hello!" }],
});

# Serverless model request
curl -X POST "https://api.together.ai/v1/chat/completions" \
     -H "Authorization: Bearer $TOGETHER_API_KEY" \
     -H "Content-Type: application/json" \
     -d '{
          "model": "moonshotai/Kimi-K2.6",
          "messages": [{"role": "user", "content": "Hello!"}]
        }'

# Dedicated endpoint request
curl -X POST "https://api.together.ai/v1/chat/completions" \
     -H "Authorization: Bearer $TOGETHER_API_KEY" \
     -H "Content-Type: application/json" \
     -d '{
          "model": "<ACCOUNT_NAME>/Qwen/Qwen3.5-9B-FP8-bb04c904",
          "messages": [{"role": "user", "content": "Hello!"}]
        }'

Integrations

OpenAI compatibility

Drop-in replacement for OpenAI clients.

SDK integrations

Together SDKs and framework wiring.

Batch processing

If your workload doesn’t need a real-time response, submit it as a batch job for up to 50% off serverless rates.

Model capabilities

Chat & text

Chat completions, streaming, parameters.

Function calling

Tool use and agentic loops.

Vision

Pass images alongside text.

Image generation

FLUX, Kontext, and Google models.

Video generation

Text-to-video and image-to-video.

Speech-to-text

Batch and streaming transcription.

Text-to-speech

HTTP and WebSocket audio output.

Embeddings & rerank

Vectors, rerankers, and RAG.

Provisioned throughput

OpenAI compatibility

⌘I

GET STARTED

SERVERLESS

INFERENCE APIS

DEDICATED MODEL INFERENCE

DEDICATED CONTAINER INFERENCE

GPU CLUSTERS

FINE-TUNING

CODE EXECUTION

ADMINISTRATION

Get started

Quickstart

Recommended models

Shared inference API

Integrations

OpenAI compatibility

SDK integrations

Batch processing

Batch processing

Model capabilities

Chat & text

Function calling

Vision

Image generation

Video generation

Speech-to-text

Text-to-speech

Embeddings & rerank

​Get started

Quickstart

Recommended models

​Shared inference API

​Integrations

OpenAI compatibility

SDK integrations

​Batch processing

Batch processing

​Model capabilities

Chat & text

Function calling

Vision

Image generation

Video generation

Speech-to-text

Text-to-speech

Embeddings & rerank

Get started

Shared inference API

Integrations

Batch processing

Model capabilities