How to fix AI API rate limit errors from OpenAI, Anthropic, and Gemini
Quick answer
💡AI API rate limit errors return HTTP 429 with a body like { "error": { "type": "rate_limit_exceeded" } }. Read the x-ratelimit-remaining-requests and x-ratelimit-reset-requests response headers to understand remaining capacity, then implement exponential backoff with jitter. Tokens per minute (TPM) limits are usually hit before requests per minute (RPM) limits, so reducing prompt size is often more effective than adding retry delays.
Error symptoms
- ✕
HTTP 429 Too Many Requests from OpenAI, Anthropic, or Gemini - ✕
rate_limit_exceeded or overloaded_error in the response body - ✕
Requests succeed locally but fail under production load - ✕
Streaming requests that feel slow still consuming full token quota - ✕
Summarization jobs throttling the same pool as interactive requests - ✕
Error rate spikes immediately after a new AI feature launches
Common causes
- •Tokens per minute exceeded before requests per minute limit is reached
- •Multiple background workers sharing the same API key and organization limit
- •No retry logic, or retrying immediately without reading Retry-After
- •Sending full conversation history on every request instead of a rolling summary
- •Using the largest available model for tasks that do not require it
- •Organization-level limits applying across all projects even when one project is the offender
When it happens
- •Right after a new AI feature launches to production traffic
- •During batch summarization or document processing jobs
- •When switching providers mid-request or using a multi-model fallback without backoff
- •During streaming with stream: true when the full token cost still counts against TPM
- •When a cost spike triggers a billing-based rate limit on the account
Examples and fixes
curl, JavaScript fetch, and Python with tenacity showing correct 429 handling.
Retry with exponential backoff and jitter
❌ Wrong
# curl — no retry awareness
curl https://api.openai.com/v1/chat/completions \
-H "Authorization: Bearer YOUR_API_KEY_HERE" \
-H "Content-Type: application/json" \
-d '{"model":"gpt-4o","messages":[{"role":"user","content":"Summarize this."}]}'
// JS — retries immediately in a tight loop
while (true) {
const res = await fetch("/api/ai");
if (res.ok) break;
}
# Python — no backoff
while True:
r = requests.post(url, json=payload)
if r.status_code == 200:
break✅ Fixed
# curl — inspect rate limit headers before retrying
curl -i https://api.openai.com/v1/chat/completions \
-H "Authorization: Bearer YOUR_API_KEY_HERE" \
-H "Content-Type: application/json" \
-d '{"model":"gpt-4o","messages":[{"role":"user","content":"Summarize this."}]}'
# Check x-ratelimit-remaining-requests and x-ratelimit-reset-requests in response
// JS — read Retry-After header and apply backoff
async function callWithBackoff(url, options, attempt = 0) {
const res = await fetch(url, options);
if (res.status === 429) {
const retryAfter = Number(res.headers.get("retry-after") || 2);
const jitter = Math.random();
const delay = Math.min(retryAfter * 1000 + jitter * 1000, 60000);
await new Promise(r => setTimeout(r, delay));
return callWithBackoff(url, options, attempt + 1);
}
return res;
}
# Python — tenacity with exponential backoff
from tenacity import retry, wait_exponential, stop_after_attempt
@retry(wait=wait_exponential(min=1, max=60), stop=stop_after_attempt(6))
def call_openai(payload):
r = requests.post(url, json=payload, timeout=30)
r.raise_for_status()
return r.json()The broken examples retry without delay, which amplifies the rate limit problem by immediately burning more quota. The fixed curl command uses -i to expose x-ratelimit-remaining-requests and x-ratelimit-reset-requests headers so you can see exactly how much capacity remains. The JavaScript version reads the Retry-After header and adds random jitter to avoid synchronized retry storms when multiple instances hit the limit simultaneously. The Python version uses tenacity's wait_exponential to implement the formula: wait = min(initial_delay * 2^attempt + random(0, 1), max_delay) without writing the loop manually. All three should be tested using /tools/http-request-builder to verify the 429 response structure before building retry logic.
Reduce per-request token cost and fall back to a smaller model when the primary hits its limit.
Token budget reduction and model fallback
❌ Wrong
// Sends full conversation history and large system prompt every request
const messages = [
{ role: "system", content: fullPolicyDocument },
...fullConversationHistory,
{ role: "user", content: userMessage }
];
const response = await openai.chat.completions.create({
model: "gpt-4o",
messages,
max_tokens: 4096
});
# Python — no model fallback on 429
response = client.messages.create(
model="claude-3-opus-20240229",
max_tokens=2000,
messages=[{"role": "user", "content": full_document}]
)✅ Fixed
// Summarize stable history; use compact system prompt; limit output tokens
const messages = [
{ role: "system", content: compactPolicySummary },
{ role: "user", content: rollingConversationSummary },
{ role: "user", content: userMessage }
];
try {
return await openai.chat.completions.create({
model: "gpt-4o",
messages,
max_tokens: 512
});
} catch (err) {
if (err.status === 429) {
// Fall back to gpt-4o-mini for less critical requests
return await openai.chat.completions.create({
model: "gpt-4o-mini",
messages,
max_tokens: 512
});
}
throw err;
}
# Python — fallback from claude-3-opus (50 RPM) to claude-3-5-haiku (1000 RPM)
try:
response = client.messages.create(
model="claude-3-opus-20240229",
max_tokens=512,
messages=[{"role": "user", "content": summarized_document}]
)
except anthropic.RateLimitError:
response = client.messages.create(
model="claude-3-5-haiku-20241022",
max_tokens=512,
messages=[{"role": "user", "content": summarized_document}]
)The broken pattern sends the full policy document and conversation history on every request. Because OpenAI TPM limits for gpt-4o are 30,000 TPM on Tier 1, one oversized request can consume the entire per-minute budget. The fixed version summarizes stable context into a compact rolling summary, cutting token cost by 60-80 percent for typical chat apps. The model fallback switches from gpt-4o to gpt-4o-mini, or from claude-3-opus (50 RPM limit) to claude-3-5-haiku (1000 RPM limit), so critical requests keep flowing while the primary model recovers. This pattern works across providers because 429 errors use the same HTTP status regardless of the SDK.
Why AI APIs throttle your requests
AI API rate limits exist because large language model inference is computationally expensive and providers must distribute capacity fairly across all customers. Unlike a simple database query, a single 4,000-token GPT-4o completion requires substantial GPU time, which is why providers enforce both requests per minute (RPM) and tokens per minute (TPM) limits simultaneously.
On OpenAI Tier 1, gpt-4o allows 500 RPM but only 30,000 TPM. A developer who sends 500 small requests will never exceed the RPM limit, but a developer who sends 10 requests each consuming 3,000 tokens will hit the TPM ceiling almost immediately. This is the most common surprise: you are not sending too many requests, you are sending too many tokens per request.
Anthropic applies the same dual-axis approach. claude-3-opus allows only 50 RPM, which makes it trivial to hit during batch processing. claude-3-5-haiku allows 1,000 RPM but has its own TPM ceiling. Switching models is a genuine mitigation, not just a workaround, because the limits are tracked separately per model.
The 429 response body from OpenAI includes { "error": { "type": "rate_limit_exceeded", "code": "rate_limit_exceeded" } }. Anthropic returns a similar structure. The response headers carry more diagnostic detail: x-ratelimit-limit-requests shows your ceiling, x-ratelimit-remaining-requests shows what is left, and x-ratelimit-reset-requests shows when the window resets. Reading these headers before writing retry logic tells you whether you are hitting the RPM or the TPM limit, which changes the correct fix.
Organization-level limits add another layer. OpenAI tracks limits at both the organization and project level. A single high-volume project can exhaust the organization TPM pool, causing unrelated projects to start returning 429 responses. Checking the OpenAI usage dashboard filtered by project quickly identifies the source.
Streaming responses with stream: true do not reduce token cost. The model still generates the full completion; the difference is that tokens are streamed back incrementally. A streamed 2,000-token response counts as 2,000 tokens against TPM just like a non-streamed one. Developers who switch to streaming expecting lower quota usage are surprised when limits do not improve.
Diagnosing the specific limit you hit
Start by capturing the exact 429 response including all headers. Use curl -i to include response headers in the output:
curl -i https://api.openai.com/v1/chat/completions -H "Authorization: Bearer YOUR_API_KEY_HERE" -H "Content-Type: application/json" -d '{"model":"gpt-4o","messages":[{"role":"user","content":"Hello"}]}'
The -i flag prints response headers before the body. Look for x-ratelimit-remaining-requests and x-ratelimit-remaining-tokens. If remaining-tokens is near zero but remaining-requests still has headroom, you are hitting a TPM limit, not an RPM limit. The fix is different for each.
For RPM limits, the solution is to reduce the number of concurrent API calls. In Node.js, the p-limit library caps concurrent async calls. Set the concurrency to something like 10 and your workers will automatically queue instead of firing simultaneously.
For TPM limits, count tokens before sending. The tiktoken library for Python and the js-tiktoken port for Node.js let you compute the exact token count of your messages array before the API call. Log this count alongside each request for a week and you will see which request types are consuming disproportionate quota.
The OpenAI usage dashboard shows RPM and TPM consumption as time-series charts split by model. Spike patterns correlate directly with background jobs. If you see TPM spikes at predictable intervals, a scheduled batch job is the culprit. If spikes align with user activity, interactive requests are consuming more tokens than expected.
You can also reproduce rate limit scenarios outside your application using /tools/http-request-builder to send the same request multiple times in sequence and observe the header values change as remaining capacity drops. This is faster than instrumenting production code to observe the same pattern.
Anthropic's rate limit headers follow the same naming convention. The Retry-After header, when present, tells you exactly how many seconds to wait before the next request. Not all providers include it on every 429 response, so your retry logic should fall back to a computed delay when the header is absent.
Implementing backoff, batching, and caching
Three distinct fixes address rate limit errors at different levels: retry logic handles transient bursts, async batching prevents synchronized overloads, and caching eliminates redundant calls entirely.
For retry logic, the correct formula is: wait = min(initial_delay * 2^attempt + random(0, 1), max_delay). The random component, called jitter, is critical when multiple application instances hit the limit simultaneously. Without jitter, all instances retry at exactly the same moment, producing a synchronized retry storm that extends throttling. In Python, the tenacity library implements this automatically: @retry(wait=wait_exponential(min=1, max=60), stop=stop_after_attempt(6)) wraps any function that calls the API.
For async batching in Node.js, p-limit controls concurrency. Instead of sending 100 API calls in parallel, wrap each call in a p-limit function capped at 10 concurrent requests. The remaining calls queue automatically and drain as capacity frees up. This pattern is more predictable than rate limiting at the request count level because it naturally adapts to varying token sizes per request.
Caching is the highest-leverage fix when the same prompt produces the same expected output. API completions for deterministic prompts, such as category classification or format extraction, can be cached with Redis or an in-memory LRU cache. The cache key is a hash of the model name, system prompt, and user content. Cache hits reduce both cost and rate limit pressure simultaneously.
Model fallback handles the case where the primary model is throttled but the request must complete. If gpt-4o returns 429, retry the same messages array with gpt-4o-mini. Anthropic equivalents: claude-3-opus hitting 50 RPM can fall back to claude-3-5-haiku at 1,000 RPM for less critical completions.
For prompt size, always trim conversation history to a rolling window rather than sending the full history. Use the model to summarize older turns into a compact paragraph, then include only that summary plus the last two or three turns. This typically reduces per-request token cost by 50 to 80 percent without meaningful quality loss.
Test your retry implementation against /tools/http-request-builder before deploying, and verify that the x-ratelimit-reset-requests header is being respected rather than ignored.
Edge cases that mislead standard retry logic
Several situations cause standard backoff implementations to fail or produce unexpected behavior even when the core logic is correct.
Synchronized batch jobs are the most common trap. If a cron job triggers 50 workers at 2:00 AM and each worker makes 20 API calls, 1,000 requests fire within seconds of each other. Even with per-worker backoff, the initial burst hits the limit before any retries are attempted. The fix is to add a startup delay: each worker sleeps for a random duration between 0 and 30 seconds before making its first request. This spreads the initial burst across the window.
Streaming and token counting interact unexpectedly. When stream: true is set, the token count is not known until the stream completes because the model generates tokens dynamically. If you are tracking TPM usage in your application, you must sum the usage object from the final chunk of the stream, not estimate from the input tokens alone.
Organization vs. project limits create false positives. If your organization has multiple projects and one project exhausts the organization TPM pool, all other projects receive 429 responses that look like their own limit is exceeded. Check the OpenAI dashboard to confirm which project is the source before adding retry logic to unrelated services.
Very long completions can hit limits mid-generation. A request that starts successfully may return a rate limit error partway through if the token consumption during generation exceeds the remaining TPM budget. This appears as a truncated streaming response or a 429 on the completion object. The fix is to set max_tokens conservatively and use a rolling limit check before each batch of requests.
Anthropic overloaded_error is distinct from rate_limit_exceeded. Anthropic sometimes returns { "type": "overloaded_error" } on high-traffic infrastructure events. This is not a quota limit; it is a server-side capacity issue. The retry logic is the same, but you should not count these against your quota tracking. Treat overloaded_error as a transient 503 rather than a quota violation.
Billing-based limits apply separately from the technical rate limits. Once a monthly spend cap is reached, the API returns 429 even if the RPM and TPM limits have headroom. Check the billing dashboard when standard rate limit headers show full remaining capacity but 429 errors continue.
Mistakes that make rate limiting worse
The most damaging mistake is retrying immediately on 429 without delay. An application that fires the same 50-request batch, gets throttled, and immediately retries the full batch creates a feedback loop that keeps the API throttled far longer than necessary. The retry must pause for at least the duration specified by x-ratelimit-reset-requests or the Retry-After header.
Ignoring jitter causes retry storms. Two application servers that both hit the limit at the same time and both use the same backoff formula without jitter will retry at exactly the same moment. The collision repeats on every retry attempt. Adding Math.random() times one second to the delay breaks the synchronization.
Increasing quota before fixing root causes is expensive and usually ineffective. If 10 workers are each sending requests that are 10x larger than necessary because they include full conversation history, upgrading from Tier 1 to Tier 2 increases the TPM limit but does not fix the underlying inefficiency. The cheaper fix is trimming prompts, which also reduces costs.
Disabling TLS verification to speed up debugging is a common shortcut that sometimes persists into staging. Commands like requests.get(url, verify=False) or curl -k hide real certificate errors and create a false sense of working connectivity. Always test with TLS verification enabled to catch certificate issues before they affect production.
Treating all 429 responses as identical ignores important distinctions. Anthropic overloaded_error requires different tracking than rate_limit_exceeded. OpenAI billing-limit 429 responses require a different action than token-limit 429 responses. Parse the error type from the response body before deciding on a retry strategy.
Logging full API keys or tokens in retry diagnostics exposes credentials. When logging retry attempts, log only the status code, the error type field from the response body, the remaining-requests header value, and a timestamp. Never log the full Authorization header or any secrets from the request configuration.
Production patterns that prevent throttling
Production AI API integrations that handle rate limits gracefully share several common patterns that are worth building into the architecture from the start rather than retrofitting after the first incident.
Maintain a centralized API client with built-in retry logic rather than implementing backoff in every call site. A single shared client ensures that all projects use consistent retry behavior, logs the same structured fields, and applies jitter uniformly. It also makes it easy to add model fallback logic in one place when requirements change.
Track token usage per request type in production. Log the usage.total_tokens field from every API response alongside a tag for the request category such as classification, summarization, or chat. After a week of data, you will know exactly which request types consume disproportionate quota and can prioritize optimization accordingly.
Separate interactive and batch workloads onto different API keys or projects if the provider supports it. Interactive requests from users require low latency and should not compete with background batch jobs for the same TPM pool. OpenAI project-level keys make this straightforward.
Cache deterministic completions aggressively. Format extraction, category classification, and template-driven prompts almost always produce the same output for the same input. A Redis cache keyed on the hash of model plus messages can achieve 40 to 70 percent cache hit rates for these workloads, which directly reduces both cost and rate limit pressure.
Set max_tokens explicitly on every request. A request without a max_tokens limit allows the model to generate up to its context limit, which can produce unexpectedly large completions that consume your entire TPM budget in one call. Cap output at a realistic value for the task.
Test retry logic with /tools/http-request-builder by simulating 429 responses and verifying that your client waits, applies jitter, and eventually succeeds. A rate limit event should never be the first time your retry logic runs in a production context.
Document rate limit tiers and the TPM ceiling for each model in your team's runbook. When a new engineer joins or a new model is added, the limits should be visible without requiring a trip to the provider documentation.
Rate limit fix checklist
- ✓Read x-ratelimit-remaining-tokens and x-ratelimit-remaining-requests to identify which limit you hit.
- ✓Add exponential backoff with jitter: wait = min(initial_delay * 2^attempt + random, max_delay).
- ✓Read the Retry-After header when present and wait that exact duration before retrying.
- ✓Trim conversation history to a rolling summary instead of sending the full context.
- ✓Cap max_tokens on every request to prevent unexpectedly large completions.
- ✓Separate batch workloads onto a different API key or project from interactive requests.
- ✓Cache deterministic completions using Redis or an in-memory LRU keyed on model plus messages hash.
- ✓Add model fallback: catch 429 from the primary model and retry with a cheaper model tier.
Related guides
Frequently asked questions
Why do I hit the rate limit even with only a few requests per minute?
Tokens per minute (TPM) limits are usually more restrictive than requests per minute (RPM) limits. On OpenAI Tier 1, gpt-4o allows 500 RPM but only 30,000 TPM. Sending 10 requests with 3,000 tokens each exhausts the TPM budget while using only 2 percent of the RPM limit. Check x-ratelimit-remaining-tokens in the response headers to confirm which limit you are hitting.
What is the difference between OpenAI organization limits and project limits?
Organization limits apply across all projects in your OpenAI account. Project limits apply per individual project. A high-volume project can exhaust the organization TPM pool, causing unrelated projects to receive 429 responses even though those projects have not exceeded their individual quotas. The OpenAI usage dashboard allows filtering by project to identify the source.
Does streaming with stream: true reduce my token count?
No. Streaming changes how tokens are delivered to your client but does not reduce the number of tokens generated. A 2,000-token streamed response counts as 2,000 tokens against the TPM limit, identical to a non-streamed response. Streaming can make the user experience feel faster, but it does not help with rate limits.
What is the correct exponential backoff formula for AI API retries?
The standard formula is: wait = min(initial_delay * 2^attempt + random(0, 1), max_delay). Start with initial_delay of 1 second and cap at max_delay of 60 seconds. The random component prevents synchronized retry storms when multiple instances hit the limit simultaneously. The tenacity library in Python implements this with wait_exponential(min=1, max=60).
How do I distinguish a billing rate limit from a technical rate limit?
Both return HTTP 429, but the response body error type differs. Technical rate limits return rate_limit_exceeded. Billing limits sometimes return a distinct message about spend caps or account limits. If x-ratelimit-remaining-requests shows headroom but 429 errors continue, check the billing dashboard for a monthly spend cap that has been reached.
What is the Anthropic rate limit for claude-3-opus vs claude-3-5-haiku?
Anthropic applies approximately 50 RPM for claude-3-opus and 1,000 RPM for claude-3-5-haiku, though exact limits depend on account tier. This large difference makes model fallback an effective strategy: catch RateLimitError from claude-3-opus and retry with claude-3-5-haiku for non-critical completions to keep throughput up during peak periods.
Should I cache AI API responses to reduce rate limit pressure?
Yes, for deterministic prompts. Format extraction, category classification, and template-driven completions produce the same output for identical inputs. Cache these responses using Redis or an in-memory LRU cache keyed on a hash of the model name and messages array. Cache hit rates of 40 to 70 percent are common for classification workloads, directly reducing both cost and rate limit pressure.
How do I handle the overloaded_error from Anthropic?
Anthropic returns overloaded_error during high-traffic infrastructure events, separate from quota-based rate limits. Treat it like a transient 503 rather than a quota violation: apply the same exponential backoff retry logic, but do not count these events against your quota tracking. The error typically resolves within seconds to minutes without any action on the developer side.
How can I test my retry logic before it hits production?
Use /tools/http-request-builder to send requests to your own mock endpoint that returns a 429 with realistic rate limit headers. Verify that your client reads x-ratelimit-reset-requests, waits the correct duration with jitter, and retries successfully. Testing retry paths with real 429 responses is the only way to confirm the logic works before a live rate limit event.
All tools run in your browser. Your data never leaves your device. Last updated: 2026-05-06.