Rate limits cap how often you can call Together AI serverless models. They protect platform capacity and keep the service available across users. Limits are set per organization and per model, and they adjust based on your recent usage. Requests that exceed your limit return aDocumentation Index
Fetch the complete documentation index at: https://docs.together.ai/llms.txt
Use this file to discover all available pages before exploring further.
429 Too Many Requests error.
Dynamic rate limits
Together AI uses dynamic rate limits instead of fixed thresholds. Each organization has a dynamic rate per model that adjusts based on:- The model’s live capacity.
- Your recent successful usage on that model.
How Together handles traffic spikes
Together buffers sudden traffic spikes so that every user keeps receiving timely responses. Best-effort smoothing absorbs most bursts before any limiting is applied. If a burst still produces failures, the error code you get back depends on whether the failed request was below or above your dynamic rate:- Requests at or below your dynamic rate return
503 Service Unavailable. These failures are attributed to platform capacity, i.e., the model is overloaded and unable to serve requests (not due to your usage). - Requests above your dynamic rate return
429 Too Many Requestswith one of these error types:error_type: "dynamic_request_limited"for request-based limiting.error_type: "dynamic_token_limited"for token-based limiting.
Rewards for sustained traffic
Steady traffic helps Together predict demand and scale capacity over time. If your request rate increases gradually and stays consistent, your success rate will improve, which raises your dynamic rate (the burst cushion based on recent successful usage). The platform then ramps up capacity to match the new steady load, leaving more headroom for future bursts.Best practices
To maximize successful requests on serverless models:- Stay within your rate limit.
- Send steady, consistent traffic. Avoid bursts.
Check your current limits
Every serverless inference API response includes headers reporting the current rate limits for the model, your remaining quota, and the reset window. Limits are model-specific. Plan workloads against the values reported in response headers, not a fixed threshold. For spend and usage trends across keys and workloads, see your project’s cost analytics page. Limits are measured in seconds internally and displayed in minutes to match common industry conventions. A limit advertised as 60 RPM is enforced internally as 1 RPS.Alternatives for high-volume or bursty workloads
If your workload needs higher throughput or runs in large bursts, consider:- Batch inference for high request or token volumes when latency is not critical. You pay for what you use, with discounts on most models.
- Dedicated inference for predictable, reserved capacity that you control. Use it for workloads with strict SLAs.