Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.together.ai/llms.txt

Use this file to discover all available pages before exploring further.

Rate limits cap how often you can call Together AI serverless models. They protect platform capacity and keep the service available across users. Limits are set per organization and per model, and they adjust based on your recent usage. Requests that exceed your limit return a 429 Too Many Requests error.

Dynamic rate limits

Together AI uses dynamic rate limits instead of fixed thresholds. Each organization has a dynamic rate per model that adjusts based on:
  • The model’s live capacity.
  • Your recent successful usage on that model.
Sustained, successful traffic raises your dynamic rate over time. Sudden spikes far above your recent usage may be throttled. The exact formula is evolving as Together tunes the system, but the practical takeaway is that steady, successful traffic raises your limit, and bursts well beyond your recent usage do not.

How Together handles traffic spikes

Together buffers sudden traffic spikes so that every user keeps receiving timely responses. Best-effort smoothing absorbs most bursts before any limiting is applied. If a burst still produces failures, the error code you get back depends on whether the failed request was below or above your dynamic rate:
  • Requests at or below your dynamic rate return 503 Service Unavailable. These failures are attributed to platform capacity, i.e., the model is overloaded and unable to serve requests (not due to your usage).
  • Requests above your dynamic rate return 429 Too Many Requests with one of these error types:
    • error_type: "dynamic_request_limited" for request-based limiting.
    • error_type: "dynamic_token_limited" for token-based limiting.

Rewards for sustained traffic

Steady traffic helps Together predict demand and scale capacity over time. If your request rate increases gradually and stays consistent, your success rate will improve, which raises your dynamic rate (the burst cushion based on recent successful usage). The platform then ramps up capacity to match the new steady load, leaving more headroom for future bursts.

Best practices

To maximize successful requests on serverless models:
  • Stay within your rate limit.
  • Send steady, consistent traffic. Avoid bursts.
For example, if your limit is 60 requests per minute (RPM), send roughly 1 request per second (RPS) across the minute rather than 60 requests in a single second. The shorter the window you concentrate requests into, the burstier the traffic. Together does its best to serve bursty traffic, but success depends on the model’s real-time load and available capacity at that moment.

Check your current limits

Every serverless inference API response includes headers reporting the current rate limits for the model, your remaining quota, and the reset window. Limits are model-specific. Plan workloads against the values reported in response headers, not a fixed threshold. For spend and usage trends across keys and workloads, see your project’s cost analytics page. Limits are measured in seconds internally and displayed in minutes to match common industry conventions. A limit advertised as 60 RPM is enforced internally as 1 RPS.

Alternatives for high-volume or bursty workloads

If your workload needs higher throughput or runs in large bursts, consider:
  • Batch inference for high request or token volumes when latency is not critical. You pay for what you use, with discounts on most models.
  • Dedicated inference for predictable, reserved capacity that you control. Use it for workloads with strict SLAs.