Together AI offers two ways to run inference: Serverless models: A shared fleet of popular open models you can call through a per-token API. No GPUs to provision or manage. Best for prototyping and variable traffic on the models we host. Dedicated endpoints: A single model running on GPUs reserved for you, billed per minute by hardware. Best when traffic is steady, you need consistent latency, or you’re serving a fine-tuned model.Documentation Index
Fetch the complete documentation index at: https://docs.together.ai/llms.txt
Use this file to discover all available pages before exploring further.
Get started
Quickstart
Set up an API key and make your first call in Python, TypeScript, or cURL.
Recommended models
Our picks for common inference use cases.
Pricing
How Together AI bills for inference.
Shared inference API
Serverless models and dedicated endpoints use the same inference APIs for generating and retrieving model outputs. Apps work on either deployment mode without code changes; just swap themodel parameter:
Integrations
OpenAI compatibility
Drop-in replacement for OpenAI clients.
SDK integrations
Together SDKs and framework wiring.
Batch processing
Batch processing
If your workload doesn’t need a real-time response, submit it as a batch job for up to 50% off serverless rates.
Model capabilities
Chat & text
Chat completions, streaming, parameters.
Function calling
Tool use and agentic loops.
Vision
Pass images alongside text.
Image generation
FLUX, Kontext, and Google models.
Video generation
Text-to-video and image-to-video.
Speech-to-text
Batch and streaming transcription.
Text-to-speech
HTTP and WebSocket audio output.
Embeddings & rerank
Vectors, rerankers, and RAG.