429 Too Many Requests error.
Dynamic rate limits
Together uses dynamic rate limits instead of fixed thresholds. Each organization has a dynamic rate per model that adjusts based on:- The model’s live capacity.
- Your recent successful usage on that model.
How Together handle traffic spikes
The Together platform buffers sudden traffic spikes so that every user keeps receiving timely responses. Best-effort smoothing absorbs most bursts before any limiting is applied. If a burst still produces failures, the error code you get back depends on whether the failed request was below or above your dynamic rate:- Requests at or below your dynamic rate return
503 Service Unavailable. These failures are attributed to platform capacity, i.e., the model is overloaded and unable to serve requests (not due to your usage). - Requests above your dynamic rate return
429 Too Many Requestswith one of these error types:error_type: "dynamic_request_limited"for request-based limiting.error_type: "dynamic_token_limited"for token-based limiting.
Rewards for sustained traffic
Steady traffic helps Together predict demand and scale capacity over time. If your request rate increases gradually and stays consistent, your success rate will improve, which raises your dynamic rate (the burst cushion based on recent successful usage). The platform then ramps up capacity to match the new steady load, leaving more headroom for future bursts.Best practices
To maximize successful requests on serverless models:- Stay within your rate limit.
- Send steady, consistent traffic. Avoid bursts.
x-ratelimit-reset, which reports the suggested retry interval for the model. Consider using exponential backoff so requests continue trying again with increasing wait times, instead of failing immediately. For spend and usage trends across keys and workloads, see your project’s cost analytics page.
Inspect your current rate limit
Dynamic rate limits adjust with usage, so there are no fixed per-model limits published. The most reliable way to see how close you are to hitting a limit is to call it and read thex-ratelimit-reset response header. When you hit a 429, the value reports how many seconds to wait before retrying.
Alternatives for high-volume or bursty workloads
If your workload needs higher throughput or runs in large bursts, consider:- Batch inference for high request or token volumes when latency is not critical. You pay for what you use, with discounts on most models.
- Dedicated inference for predictable, reserved capacity that you control. Use it for workloads with strict SLAs.