The batch API runs many independent inference requests asynchronously from a single uploaded JSONL file. You get up to 50% off serverless rates and a separate rate limit pool, in exchange for a job-shaped (rather than request-shaped) workflow.Documentation Index
Fetch the complete documentation index at: https://docs.together.ai/llms.txt
Use this file to discover all available pages before exploring further.
When to use it
Consider using batch jobs when latency is not your primary concern. For example, when you want to classify a large dataset, run evaluations, generate synthetic data, or offline summarizations. The 24-hour completion window is a maximum, not a typical wait time. Small batches (under 1,000 requests) typically finish in minutes. If your workload is interactive, depends on shared conversation state across requests, or needs sub-second responses, use the standard chat completions endpoint instead.Rate limits
Batch jobs run against a separate rate-limit pool from the standard real-time API.- Up to 50,000 requests per batch.
- Up to 100 MB per input file.
- Up to 30B tokens enqueued per model at any time.
- Completion window defaults to
24hand cannot be changed; it is a best-effort target.
Supported models
Most serverless models support batch processing through the chat completions endpoint. Batch jobs can also run against dedicated endpoints, but the discount does not apply to dedicated endpoint usage.Discounted models
Selected serverless models run at 50% off batch rates:| Model ID |
|---|
meta-llama/Llama-3.3-70B-Instruct-Turbo |
meta-llama/Llama-3-70b-chat-hf |
Qwen/Qwen2.5-7B-Instruct-Turbo |
mistralai/Mixtral-8x7B-Instruct-v0.1 |
zai-org/GLM-4.5-Air-FP8 |
openai/whisper-large-v3 |
Models not available for batch
The following serverless models are not currently available for batch processing. Batch jobs that target these models will fail:| Model ID |
|---|
deepseek-ai/DeepSeek-R1 |
deepseek-ai/DeepSeek-V3.1 |
deepseek-ai/DeepSeek-V4-Pro |
MiniMaxAI/MiniMax-M2.7 |
moonshotai/Kimi-K2.5 |
moonshotai/Kimi-K2.6 |
Qwen/Qwen3.5-397B-A17B |
zai-org/GLM-5 |
zai-org/GLM-5.1 |
Run your first batch job
Follow the batch tutorial for an end-to-end walkthrough: prepare a JSONL file, upload it, create the batch, poll until it finishes, and download the results. Batch job results are returned in arbitrary order. Use thecustom_id field on each input request to reconcile inputs with outputs and errors. A single uploaded file can back multiple batch jobs without re-uploading.
Billing
Together bills you for each successful response in the output file. Failed requests in the error file aren’t billed. Cancelling a batch doesn’t refund successful responses generated before the cancel landed.Best practices
- Aim for 1,000 to 10,000 requests per batch: Smaller batches still work but waste the per-job overhead. Larger batches risk hitting the 50,000-request cap.
- Keep
custom_idvalues stable and meaningful: Treat them as the join key between input, output, and error files. - For classification or labeling, set
max_tokensto 4 andtemperatureto 0: Constrain the system prompt to return only the label. Output tokens dominate cost on short-output workloads. - Validate your JSONL locally before uploading: A malformed input file fails the entire batch in
VALIDATING. - Track progress by status, not wall-clock time: Complex or popular models can occasionally exceed the standard 24-hour window. As long as the status is
IN_PROGRESS, the job is still being processed. Wait at least 72 hours ofIN_PROGRESSbefore contacting support. - Always inspect the error file: Even when the batch reports
COMPLETED, per-request errors don’t change the overall batch status.
Next steps
- Run a batch job: tutorial walking through an end-to-end batch job from JSONL to results.
- Manage batch jobs: cancel, list, error files, and other operational reference.
- OpenAI compatibility: how Together’s Batch API compares to the OpenAI Batch endpoint.