Skip to main content
Rate limits in APIs are a standard approach, and they serve to safeguard against abuse or misuse of the API, helping to ensure equitable access to the API with consistent performance. Rate limits are denoted as HTTP status code 429. Read more about our rate limit best practices, or tiers below, and find out how you can increase them here:
  • If you have a high volume of steady traffic and good payment history for this traffic, you can request a higher limit by emailing [email protected].
  • If you are interested in our Enterprise package, with custom requests per minute (RPM) and unlimited tokens per minute (TPM), please reach out to sales here.

Best Practice

To maximize successful requests:
  • Stay within your cap rate limit.
  • Prefer steady, consistent traffic and avoid bursts.
steady rate For example, if your cap is 60 RPM, it’s strongly recommended to send traffic steadily—about 1 RPS for 60 seconds—rather than sending 60 RPS in a single second. In general, the more requests you concentrate into a short window (e.g., within one second), the more bursty your traffic is. We make a best-effort attempt to serve bursty traffic, since we understand users’ urgency. However, success ultimately depends on the real-time load and available capacity for the target model at that moment.

How we measure Rate limits

If your limit is 60 RPM, plan for roughly RPM/60 = 1 RPS (i.e., spread requests evenly across the minute). We enforce rate limits per second internally (RPS/TPS). We display limits per minute (RPM/TPM) to align with common industry conventions.

Earned Rate Limits

This is a new feature we will roll out to help us handle bursty traffic, and help users steadily increase their success rate.
To ensure fair use of a model across all users, we buffer sudden surges in traffic and apply a fairness mechanism so everyone continues to receive timely service. We also make a best-effort attempt upfront to absorb and smooth bursts via our leading inference speed and capacity management, before any limiting behavior is applied.If a burst still results in failed requests despite these protections, we apply response attribution using an Earned Rate threshold.

Earned Rate

We track an Earned Rate per user and per model:Earned Rate ≈ 2 × past_hour_successful_request_rateWe constrain Earned Rate as:base_rate ≤ earned_rate ≤ cap_rate
  • Default base_rate is 60 RPM.

Behavior during burst failures

When bursty requests fail:
  • Requests at or below your Earned Rate (≤ Earned Rate) receive 503 Service Unavailable.
    These failures are attributed to platform capacity under burst conditions — we take responsibility.
  • Requests above your Earned Rate (> Earned Rate) receive 429 Too Many Requests, with:
    • error_type: "earned_request_limited" (request-based limiting), or
    • error_type: "earned_token_limited" (token-based limiting)

Recommendation

We strongly recommend avoiding bursty traffic. If your traffic spikes to roughly 2× (or more) of what you’ve successfully sustained over the past hour, requests beyond your Earned Rate may be limited even after our best-effort buffering.

Steady Traffic Improves Success Rates and Increases Earned Rate

steady rate Steady, sustained traffic helps the system scale capacity over time. As your request rate increases gradually and stays consistent, your success rate improves, which increases your Earned Rate (the burst cushion based on recent successful usage). The platform then ramps up system capacity to match the new steady load, leaving a capacity buffer that makes subsequent bursts more likely to succeed.

A Virtuous Cycle: Consistency Builds Capacity

steady rate If you send steady, sustained traffic, it’s easier for us to predict demand and scale capacity in time. Over time, this typically improves your success rate, which in turn can increase your Earned Rate—allowing you to send higher traffic with a higher likelihood of success.
Important: when we launch support for a brand new model, we may temporarily disable automatic rate limit increases for that given model. This ensures our service levels remain stable, as rate limits represent the maximum “up to” capacity a user is entitled to, which is ultimately driven by our available serverless capacity. We strive to enable automatic increases as soon as possible once capacity stabilizes.

Rate limit tiers

You can view your rate limit by navigating to Settings > Billing. As your usage of the Together API and your spend on our API increases, we will automatically increase your rate limits. Chat, language & code models
TierQualification criteriaRPMTPM
Tier 1Credit card added, $5 paid600180,000
Tier 2$50 paid1,800250,000
Tier 3$100 paid3,000500,000
Tier 4$250 paid4,5001,000,000
Tier 5$1,000 paid6,0002,000,000
DeepSeek R1 model-specific rate limits
Due to high demand on the platform, DeepSeek R1 has these special rate limits. We are actively increasing them.
TierRPM
Tier 13
Tier 260
Tier 3~400+
Tier 4~400+
Tier 5~1200+
Embedding models
TierQualification criteriaRPMTPM
Tier 1Credit card added, $5 paid3,0002,000,000
Tier 2$50 paid5,0002,000,000
Tier 3$100 paid5,00010,000,000
Tier 4$250 paid10,00010,000,000
Tier 5$1,000 paid10,00020,000,000
Re-rank models
TierQualification criteriaRPMTPM
Tier 1Credit card added, $5 paid2,500500,000
Tier 2$50 paid3,5001,500,000
Tier 3$100 paid4,0002,000,000
Tier 4$250 paid7,5003,000,000
Tier 5$1,000 paid9,0005,000,000
Image models
TierQualification criteriaImg/min
Tier 1Credit card added, $5 paid240
Tier 2$50 paid480
Tier 3$100 paid600
Tier 4$250 paid960
Tier 5$1,000 paid1,200
Note: Due to high demand:
  • FLUX.1 [schnell] Free has a model specific rate limit of 6 img/min.
  • FLUX.1 Kontext [pro] has a model specific rate limit of 57 img/min.
Video models
TierQualification criteriaRPM
Tier 1Credit card added, $5 paid60
Tier 2$50 paid60
Tier 3$100 paid60
Tier 4$250 paid60
Tier 5$1,000 paid100
You may experience congestion based on traffic from other users, and may be throttled to a lower level because of that. If you want committed capacity, contact our sales team to inquire about our Scale and Enterprise plans, which include custom RPM and unlimited TPM. Rate limits in headers The API response includes headers that display the rate limit enforcement, current usage, and when the limit will reset. We enforce limits per second and minute for token usage and per second for request rates, but the headers display per second limits only.
FieldDescription
x-ratelimit-limitThe maximum number of requests per sec that are permitted before exhausting the rate limit.
x-ratelimit-remainingThe remaining number of requests per sec that are permitted before exhausting the rate limit.
x-ratelimit-resetThe time until the rate limit (based on requests per sec) resets to its initial state.
x-tokenlimit-limitThe maximum number of tokens per sec that are permitted before exhausting the rate limit.
x-tokenlimit-remainingThe remaining number of tokens per sec that are permitted before exhausting the rate limit.