How We Measure Rate limits
We measure rate limits in seconds, but display them in minutes to align with common industry conventions. For example, if your rate limit advertised is 60 Requests per Minute (RPM). Then we limit requests over 1 Request per Second (RPS) internally.Fetching Latest Serverless Rate Limits
Every serverless inference API request includes response headers that report the latest rate limits for the model, including current usage and reset timing. Rate Limits are model specific. We recommend planning your workload according to the latest ratelimits specified in the following response headers:Alternatives for High Volume or Bursty Workloads
If your workload requires higher rate limits or has huge bursts of traffic, we strongly recommend considering:
- batch-inference: for high volume of requests/tokens but when completing them is not time sensitive. Pay for what you use with discounts applied for most models.
- dedicated-inference: predictable capacity that you can control when workloads requires strict SLAs.
Best Practice
To maximize successful requests for serverless models:- Stay within your rate limit.
- Prefer steady, consistent traffic and avoid bursts.

Dynamic Rate Limits
We will be rolling out dynamic rate limits to all new users after 26th January 2026 PST. This is our approach to adapt rate limits based on live capacity of the model, and your past usage patterns. Our goal is to make this experience as good as, or better than what you have today, by enabling higher sustained request volumes for serverless models over time.How do we handle bursty traffic for serverless ?
How do we handle bursty traffic for serverless ?
To ensure fair use of a model across all users, we buffer sudden surges in traffic and apply a fairness mechanism so everyone continues to receive timely service. We also make a best-effort attempt upfront to absorb and smooth bursts via our leading inference speed and capacity management, before any limiting behavior is applied.If a burst still results in failed requests despite these protections, we apply response attribution using an Dynamic Rate threshold.
Dynamic Rate
We track a Dynamic Rate per user and per model:Dynamic Rate ≈ 2 × past_hour_successful_request_rateWe constrain Dynamic Rate as:base_rate ≤ dynamic_rate ≤ cap_rate- Default
base_rateis 60 RPM.
Behavior during burst failures
When bursty requests fail:- Requests at or below your Dynamic Rate (≤ Dynamic Rate) receive 503: Service Unavailable.
These failures are attributed to platform capacity under burst conditions — we take responsibility. - Requests above your Dynamic Rate (> Dynamic Rate) receive 429: Too Many Requests, with:
error_type: "dynamic_request_limited"(request-based limiting), orerror_type: "dynamic_token_limited"(token-based limiting)
Recommendation
We strongly recommend avoiding bursty traffic for serverless models. Please consider batch or dedicated inference for this. If your traffic spikes to roughly 2× (or more) of what you’ve successfully sustained over the past hour, we cannot guarantee capacity.Rewards of sustained traffic.
Rewards of sustained traffic.
Steady Traffic Improves Success Rates and Increases Dynamic Rate

A Virtuous Cycle: Consistency Builds Capacity
