Batch Inference
Process jobs asynchronously with the Batch API.
Learn how to use the Batch API to send asynchronous groups of requests with 50% lower costs, higher rate limits, and flexible completion windows. The service is ideal for processing jobs that don't require immediate responses.
Overview
The Batch API enables you to process large volumes of requests asynchronously at 50% lower cost compared to real-time API calls. It's perfect for workloads that don't need immediate responses such as:
- Running evaluations and data analysis
- Classifying large datasets
- Offline summarization
- Synthetic data generation
- Content generation for marketing
- Dataset processing and transformations
Compared to using standard endpoints directly, Batch API offers:
- Better cost efficiency: 50% cost discount compared to synchronous APIs
- Higher rate limits: Substantially more headroom with separate rate limit pools
- Large-scale support: Process thousands of requests per batch
- Flexible completion: Best-effort completion within 24 hours with progress tracking
Getting started
Note: Make sure your together
version number is >1.5.13. Run pip install together --upgrade
to upgrade if needed.
1. Prepare your batch file
Batches start with a .jsonl
file where each line contains the details of an individual request to the API. The available endpoint is /v1/chat/completions
(Chat Completions API). Each request must include a unique custom_id
value, which you can use to reference results after completion. Here's an example of an input file with 2 requests:
{"custom_id": "request-1", "body": {"model": "deepseek-ai/DeepSeek-V3", "messages": [{"role": "user", "content": "Hello, world!"}], "max_tokens": 200}}
{"custom_id": "request-2", "body": {"model": "deepseek-ai/DeepSeek-V3", "messages": [{"role": "user", "content": "Explain quantum computing"}], "max_tokens": 200}}
Each line in your batch file must follow this schema:
Field | Type | Required | Description |
---|---|---|---|
custom_id | string | Yes | Unique identifier for tracking (max 64 chars) |
body | object | Yes | The request body matching the endpoint's schema |
2. Upload your batch input file
You must first upload your input file so that you can reference it correctly when creating batches. Upload your .jsonl
file using the Files API with purpose=batch-api
.
Upload files for Batch API
from together import Together
client = Together()
# Uploads batch job file
file_resp = client.files.upload(file="batch_input.jsonl", purpose="batch-api")
together files upload batch_input.jsonl
This will return a file object with id
and other details:
FileResponse(
id='file-fa37fdce-89cb-414b-923c-2add62250155',
object=<ObjectType.File: 'file'>,
...
filename='simpleqa_batch_requests.jsonl',
bytes=1268723,
line_count=0,
processed=True,
FileType='jsonl')
3. Create the batch
Once you've successfully uploaded your input file, you can use the input File object's ID to create a batch. The completion window can be set to 24h
. For now, the completion window defaults to 24h
and cannot be changed. You can also provide custom metadata.
Create the Batch
file_id = file_resp.id
batch = client.batches.create_batch(file_id, endpoint="/v1/chat/completions")
print(batch.id)
This request will return a Batch object with metadata about your batch:
{
"id": "batch-xyz789",
"status": "VALIDATING",
"endpoint": "/v1/chat/completions",
"input_file_id": "file-abc123",
"created_at": "2024-01-15T10:00:00Z",
"request_count": 0,
"model_id": null
}
4. Check the status of a batch
You can check the status of a batch at any time, which will return updated batch information.
Check the status of a batch
batch_stat = client.batches.get_batch(batch.id)
print(batch_stat.status)
The status of a given Batch object can be any of the following:
Status | Description |
---|---|
VALIDATING | The input file is being validated before the batch can begin |
IN_PROGRESS | Batch is in progress |
COMPLETED | Batch processing completed successfully |
FAILED | Batch processing failed |
EXPIRED | Batch exceeded deadline |
CANCELLED | Batch was cancelled |
5. Retrieve the results
Once the batch is complete, you can download the output by making a request to retrieve the output file using the output_file_id
field from the Batch object.
Retrieving the batch results
from together import Together
client = Together()
# Get the batch status to find output_file_id
batch = client.batches.get_batch('batch-xyz789')
if batch.status == 'COMPLETED':
# Download the output file
client.files.retrieve_content(id=batch_stat.output_file_id, output="batch_output.jsonl")
The output .jsonl
file will have one response line for every successful request line in the input file. Any failed requests will have their error information in a separate error file accessible via error_file_id
.
Note that the output line order may not match the input line order. Use the custom_id
field to map requests to results.
6. Get a list of all batches
At any time, you can see all your batches.
Getting a list of all batches
from together import Together
client = Together()
# List all batches
batches = client.batches.list_batches()
for batch in batches:
print(batch)
Model availability
The following models are supported for batch processing:
Model ID | Size |
---|---|
deepseek-ai/DeepSeek-R1 | 685B |
deepseek-ai/DeepSeek-V3 | 671B |
meta-llama/Llama-3-70b-chat-hf | 70B |
meta-llama/Llama-3.3-70B-Instruct-Turbo | 70B |
meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 | 17B |
meta-llama/Llama-4-Scout-17B-16E-Instruct | 17B |
meta-llama/Meta-Llama-3.1-405B-Instruct-Turbo | 405B |
meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo | 70B |
meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo | 8B |
mistralai/Mistral-7B-Instruct-v0.1 | 7B |
mistralai/Mixtral-8x7B-Instruct-v0.1 | 8x7B |
Qwen/Qwen2.5-72B-Instruct-Turbo | 72B |
Qwen/Qwen2.5-7B-Instruct-Turbo | 7B |
Qwen/Qwen3-235B-A22B-fp8-tput | 235B |
Qwen/QwQ-32B | 32B |
Rate limits
Batch API rate limits are separate from existing per-model rate limits. The Batch API has specific rate limits:
- Max Token limits: A maximum of 10M tokens can be enqueued per model
- Per-batch limits: A single batch may include up to 50,000 requests
- Batch file size: Maximum 100MB per batch input file
- Separate pool: Batch API usage doesn't consume tokens from standard rate limits
Error handling
When errors occur during batch processing, they are recorded in a separate error file accessible via the error_file_id
field. Common error codes include:
Error Code | Description | Solution |
---|---|---|
400 | Invalid request format | Check JSONL syntax and required fields |
401 | Authentication failed | Verify API key |
404 | Batch not found | Check batch ID |
429 | Rate limit exceeded | Reduce request frequency |
500 | Server error | Retry with exponential backoff |
Error File Format:
{"custom_id": "req-1", "error": {"message": "Invalid model specified", "code": "invalid_model"}}
{"custom_id": "req-5", "error": {"message": "Request timeout", "code": "timeout"}}
Batch expiration
Batches that do not complete within the 24-hour window will move to an EXPIRED
state. Unfinished requests are cancelled, and completed requests are made available via the output file. You will only be charged for tokens consumed from completed requests. Batches are best effort completion within 24 hours.
Best practices
Optimal Batch Size
- Aim for 1,000-10,000 requests per batch for best performance
- Maximum 50,000 requests per batch
- Keep file size under 100MB
Error Handling
- Always check the
error_file_id
for partial failures - Implement retry logic for failed requests
- Use unique
custom_id
values for easy tracking
Model Selection
- Choose models based on your quality/cost requirements
- Smaller models (7B-17B) for simple tasks
- Larger models (70B+) for complex reasoning
Request Formatting
- Validate JSON before submission
- Use consistent schema across requests
- Include all required fields
Monitoring
- Poll status endpoint every 30-60 seconds
- Set up notifications for completion (if available)
FAQ
Q: How long do batches take to complete?
A: Processing time depends on batch size and model complexity. Most batches typically complete within 1-12 hours, but can take up to 24 hours (or only partially complete within 24 hours) depending on inference capacity.
Q: Can I cancel a running batch?
A: Currently, batches cannot be cancelled once processing begins.
Q: What happens if my batch exceeds the deadline?
A: The batch will be marked as EXPIRED and partial results may be available.
Q: Are results returned in the same order as requests?
A: No, results may be in any order. Use custom_id
to match requests with responses.
Q: Can I use the same file for multiple batches?
A: Yes, uploaded files can be reused for multiple batch jobs.
Updated 2 days ago