HTTP 504 Gateway Timeout — Fix the Upstream That Responded Too Slowly

Quick answer

💡A 504 Gateway Timeout means the upstream server took too long to respond and the proxy gave up waiting. The upstream is alive — it is just slow. The real fix is either to speed up the upstream response (fix a slow query, add an index, reduce payload size) or move the slow work to a background job and return a 202 Accepted immediately. Increasing the proxy timeout is a temporary workaround, not a fix.

Test HTTP Requests →

Error symptoms

✕504 Gateway Timeout in browser with nginx or Cloudflare error page
✕upstream timed out (110: Connection timed out) in nginx error_log
✕AWS API Gateway: Endpoint request timed out after 29 seconds
✕Cloudflare Error 524: A Timeout Occurred
✕Request takes exactly 60 seconds and then fails — suspiciously precise
✕Health check endpoint passes but real API endpoints timeout under load

Common causes

•Slow database query taking longer than the proxy's read timeout limit
•nginx proxy_read_timeout set to default 60 seconds, lower than the actual operation time
•AWS API Gateway 29-second hard limit on integration timeout — cannot be raised
•Synchronous file I/O or external HTTP calls blocking an async event loop
•Redis CLUSTER WAIT or network partition blocking the application thread
•Lambda cold start plus processing time exceeds 29 seconds on API Gateway

When it happens

•On first request after a long idle period when Lambda or container has a cold start
•During heavy database load when slow query log fills up with multi-second queries
•After deploying a new feature that introduced a synchronous blocking operation
•When a downstream service your app depends on becomes slow or degraded
•During data migrations or batch operations that run in the request thread

Examples and fixes

A heavy database aggregation query runs synchronously in the request handler, causing timeouts under load.

Slow synchronous database query in API handler

❌ Wrong

// Express route — blocks the event loop for seconds
app.get('/api/report', async (req, res) => {
  // Full-table scan takes 45 seconds under production data volume
  const rows = await db.query(
    'SELECT user_id, SUM(amount) FROM transactions GROUP BY user_id'
  );
  res.json(rows);
});
// nginx proxy_read_timeout: 60s — sometimes passes, sometimes 504

✅ Fixed

// Express route — return cached result or queue async generation
app.get('/api/report', async (req, res) => {
  const cached = await redis.get('report:summary');
  if (cached) return res.json(JSON.parse(cached));
  // Return 202 and process in background
  await reportQueue.add({ userId: req.user.id });
  res.status(202).json({ status: 'generating', pollUrl: '/api/report/status' });
});
// Worker picks up job, caches result, client polls /api/report/status

The original route runs a full-table scan in the request thread. Under any meaningful load, this exceeds the proxy timeout and returns 504. The fix uses two patterns together: cache the last computed result in Redis so repeat requests are instant, and for cache misses, enqueue the work and return 202 Accepted immediately. The client polls a status endpoint until the report is ready. This pattern handles operations of any duration without proxy timeout risk.

Lambda processing time plus cold start exceeds API Gateway's hard 29-second integration timeout.

AWS API Gateway 29-second limit with Lambda

❌ Wrong

// Lambda function — synchronous heavy processing
exports.handler = async (event) => {
  const imageUrl = JSON.parse(event.body).url;
  // Download 50MB image, resize, upload to S3 — takes 35s
  const buffer = await downloadImage(imageUrl);
  const resized = await sharp(buffer).resize(1200).toBuffer();
  await s3.putObject({
    Bucket: 'my-bucket',
    Key: 'result.jpg',
    Body: resized
  }).promise();
  return { statusCode: 200, body: JSON.stringify({ key: 'result.jpg' }) };
};

✅ Fixed

// Lambda function — enqueue work, return job ID immediately
exports.handler = async (event) => {
  const { url } = JSON.parse(event.body);
  const jobId = crypto.randomUUID();
  await sqs.sendMessage({
    QueueUrl: process.env.PROCESSING_QUEUE_URL,
    MessageBody: JSON.stringify({ jobId, url }),
  }).promise();
  await dynamodb.put({
    TableName: 'jobs',
    Item: { jobId, status: 'queued' }
  }).promise();
  return { statusCode: 202, body: JSON.stringify({ jobId }) };
};

AWS API Gateway has a hard 29-second integration timeout that cannot be increased on proxy integrations. Any Lambda that might take longer than 25 seconds (with buffer for cold start) must be decoupled from the API Gateway response. The fix publishes a job to SQS and returns 202 with a job ID in under 100ms. A separate Lambda subscribed to the SQS queue handles the actual processing without any timeout constraint. The client uses the job ID to poll a status endpoint.

Why each proxy layer has a different timeout limit

A 504 Gateway Timeout is not a single error with a single fix — it is a category of error produced by different systems at different time thresholds. Understanding which layer is timing out determines both the diagnosis and the correct fix.

nginx has a configurable proxy_read_timeout directive that defaults to 60 seconds. This is the amount of time nginx waits for the upstream to send any data. If the upstream does not send the first byte of the response within 60 seconds, nginx closes the connection and returns 504. The limit applies per read operation, not to the total response time, so streaming responses that send chunks can survive longer.

AWS API Gateway imposes a hard 29-second integration timeout on all Lambda proxy integrations. This limit cannot be increased through configuration — it is enforced at the API Gateway service level regardless of your Lambda timeout setting. Even if your Lambda is configured with a 15-minute timeout, API Gateway will close the connection after 29 seconds. This is the most common source of 504 errors in serverless architectures and the one most frequently misunderstood.

AWS ALB has a configurable idle timeout that defaults to 60 seconds. Unlike API Gateway, this can be raised through load balancer attributes in the AWS console or CLI. However, raising it treats the symptom rather than the cause. ALB also deregisters targets that fail health checks, and a slow upstream that is not responding to health checks may be removed from the pool entirely.

Cloudflare enforces a 100-second response timeout from the origin server. Requests that take longer than 100 seconds receive a 524 error (Cloudflare-specific) rather than a generic 504. Enterprise Cloudflare plans allow custom timeout values, but most projects operate under the 100-second limit. Because Cloudflare's limit is higher than most origin timeouts, a 524 usually means the origin nginx or ALB already gave up and Cloudflare is presenting that failure.

All of these limits exist for legitimate reasons: a proxy that waits indefinitely for slow upstreams accumulates open connections, exhausts file descriptors, and eventually fails entirely. The architectural response to hard limits is decoupling slow operations from the request-response cycle using queues, background workers, and polling endpoints.

Pinpoint the slow operation causing the timeout

The precision of the timeout value is your first diagnostic clue. If requests fail at exactly 60 seconds, nginx proxy_read_timeout is the culprit. If failures happen at exactly 29 seconds, AWS API Gateway is timing out. If failures cluster around 100 seconds, Cloudflare is the terminating layer. Exact thresholds indicate proxy timeouts rather than variable upstream slowness.

Open the browser DevTools Network tab and click the failing request. Look at the Timing tab — the TTFB (Time To First Byte) section shows how long the browser waited before receiving any response. A TTFB equal to the timeout value confirms the proxy gave up. The Response tab will show the 504 page HTML, which often includes identifying markers: nginx version, Cloudflare ray ID, or AWS error codes.

For database-related timeouts, enable the slow query log on your database server. In PostgreSQL, set log_min_duration_statement = 5000 in postgresql.conf to log all queries taking more than 5 seconds. In MySQL, set slow_query_log = 1 and long_query_time = 5. Then reproduce the 504 and check the slow query log immediately — the offending query will appear with its execution time and query plan. Run EXPLAIN ANALYZE on the query to find missing indexes or expensive full-table scans.

For Node.js applications, add timing instrumentation at the handler level. Log the time at the start and end of each async operation: database query, external HTTP call, Redis command, and file I/O. Even a simple Date.now() before and after each await call is enough to identify which operation is consuming the most time. The operation whose duration approaches or exceeds the proxy timeout is the one to optimize.

For Lambda, open CloudWatch Logs and filter for 'Task timed out' or review the Duration field in log entries. Lambda logs the billed duration of each invocation. If requests are failing at 29 seconds on API Gateway but Lambda logs show them completing in 35 seconds, the function is completing successfully but API Gateway already returned 504 before Lambda finished. Use /tools/http-request-builder to measure round-trip response times from outside the AWS network and compare against CloudWatch timing data.

Matching the fix to the timeout source

The right fix depends on which layer is timing out and why the upstream is slow. Never start by raising the proxy timeout — that is the last resort after all other options are exhausted.

For slow database queries, the fix is query optimization. Run EXPLAIN ANALYZE and look for sequential scans on large tables. Add indexes on the columns in WHERE, JOIN, and ORDER BY clauses. For aggregation queries that touch millions of rows, precompute the result on a schedule and serve it from a cache. Use materialized views in PostgreSQL for aggregations that are expensive to compute but acceptable to refresh periodically.

For nginx proxy_read_timeout adjustments when raising is genuinely appropriate — for example, a legitimate long-running export endpoint — increase it only for that specific location block rather than globally. Use 'location /api/exports { proxy_read_timeout 120s; }' rather than setting it in the main http block. Document the reason for the higher timeout adjacent to the configuration so future engineers understand why it exists.

For AWS API Gateway with Lambda, the architectural fix is always async decoupling. Accept the request in a fast Lambda that validates input and publishes to SQS, EventBridge, or SNS. Return 202 Accepted with a job ID. Implement a polling endpoint that checks job status from DynamoDB or Redis. Process the actual work in a separate Lambda subscribed to the queue. This pattern has no timeout ceiling and scales horizontally through queue depth.

For FastAPI, Django, or Flask applications where synchronous database calls block async handlers, migrate heavy operations to background tasks using Celery, ARQ, or FastAPI's BackgroundTasks. The request handler should do only lightweight work: validate input, enqueue the job, return a response. The worker processes the job asynchronously. Use Redis as the message broker for simple setups or RabbitMQ for more complex routing requirements.

For Cloudflare 524 errors specifically, the fix is usually in your origin server configuration. Cloudflare's 100-second limit is generous — if you are hitting it, the origin is genuinely too slow. Profile the slowest requests in your application using APM tools like Datadog, New Relic, or the open-source OpenTelemetry stack. Identify the p99 response time for each endpoint and set a target of keeping p99 below 10 seconds for interactive endpoints.

Timeout edge cases in distributed systems

Several scenarios produce 504 errors that are difficult to reproduce locally because they depend on production load patterns or distributed system behavior.

Cascading timeouts are the most damaging pattern. Service A calls Service B with a 30-second timeout. Service B calls Service C with a 30-second timeout. If Service C is slow, both B and A accumulate open connections waiting for responses. The cascade amplifies load on every upstream service. The fix is to use context propagation with deadline reduction: subtract the time already spent from the remaining timeout budget at each hop, and set a global request deadline at the edge. In Go, the context package supports this natively. In Node.js, use AbortController with a shared signal passed through async call chains.

Event loop blocking in Node.js produces intermittent 504 errors that appear random. If a synchronous CPU-intensive operation — JSON parsing of a large payload, bcrypt with high rounds, regex on long strings — executes in the request handler, it blocks the event loop and prevents other requests from progressing. Under load, multiple requests pile up behind the blocked event loop and eventually hit the proxy timeout. Profile Node.js with 'node --prof' and look for synchronous operations taking more than a few milliseconds per invocation.

Database connection pool exhaustion creates a timeout pattern that worsens under load. When all pool connections are in use, new requests queue waiting for a connection. The queue grows faster than connections are released, and requests eventually wait long enough to trigger the proxy timeout. Increase the pool size, but also audit whether connections are being released promptly — especially in error handling paths where a thrown exception might skip the connection release code.

Redis CLUSTER operations with WAIT commands can block for extended periods during network partitions between cluster nodes. If your application uses WAIT to ensure replication durability, a partition event can cause arbitrary delays. Set a reasonable timeout on WAIT commands and handle the timeout as a degraded but acceptable state rather than blocking indefinitely. This is especially important in multi-region setups where cross-region replication latency is variable.

Timeout anti-patterns that create new problems

Raising the proxy timeout globally is the most common band-aid applied to 504 errors. Setting proxy_read_timeout to 300 or 600 seconds in the nginx http block affects every upstream, including ones that should fail fast. A legitimate connection to a crashed upstream will now hold open connections for 5 to 10 minutes instead of 60 seconds, consuming file descriptors and memory on the proxy. Always scope timeout increases to the specific location or upstream that legitimately needs more time.

Setting Lambda timeout to 15 minutes without addressing API Gateway is a common serverless misunderstanding. Lambda has a configurable 15-minute timeout, but if the Lambda is invoked through API Gateway, the 29-second API Gateway limit applies regardless of the Lambda setting. Engineers who raise the Lambda timeout expecting it to fix 504 errors from API Gateway are surprised when nothing changes. The Lambda timeout only applies to direct invocations and queued invocations — not to API Gateway proxy integrations.

Not distinguishing between a 504 and a 502 when choosing a fix leads to wasted effort. A 504 means the upstream is alive but slow — fix the slowness or use async processing. A 502 means the upstream is not responding at all — fix the crash or connection issue. The nginx error log message distinguishes them: 'upstream timed out' is a 504 cause, while 'upstream prematurely closed connection' or 'connect() failed' is a 502 cause.

Ignoring cold start latency in serverless 504 troubleshooting creates a recurring problem. Lambda cold starts for large functions can add 1 to 5 seconds or more to first invocations after idle periods. For functions with heavy dependencies — large node_modules, compiled binaries — cold starts can exceed 10 seconds. Combined with actual processing time, the total easily approaches the API Gateway limit. Use Provisioned Concurrency for latency-sensitive endpoints, or restructure the deployment to use Lambda@Edge for cacheable responses.

Not having a health check endpoint that responds in under 500 milliseconds makes the load balancer remove healthy instances when they are momentarily slow. The health check must be a fast, shallow check, not a deep application probe. Many teams add database queries to health checks thinking this makes them more informative, but a slow health check during a database overload event removes the instance from the pool at exactly the wrong time.

Architecture decisions that make 504 rare

The most effective way to prevent 504 errors is to design every API endpoint to complete within a predictable, short time budget. For interactive endpoints — those called from a browser or mobile app — target a p99 response time under 2 seconds. For internal service-to-service calls, target under 500 milliseconds. Any operation that cannot reliably meet these targets should be decoupled into an async flow with a polling or webhook notification mechanism.

Add response time SLOs to every endpoint during design, not after incidents. Define the maximum acceptable response time and instrument the code to measure it. Alert when p95 or p99 exceeds 50 percent of the timeout threshold — this gives you time to fix performance regressions before they start producing 504 errors in production.

Use circuit breakers between services. When Service A calls Service B and B becomes slow, a circuit breaker detects the degraded response times and short-circuits calls to B, returning a cached response or a meaningful error immediately. Libraries like Resilience4j for JVM, Polly for .NET, and opossum for Node.js implement circuit breaker patterns. This prevents slow upstreams from causing cascading timeouts across the entire dependency graph.

Every health check endpoint must respond in under 500 milliseconds unconditionally. A health endpoint that itself runs a slow database query defeats its purpose — a 504 health check causes the load balancer to remove the target from the pool, reducing capacity exactly when you need it most. The health endpoint should check database reachability with a fast SELECT 1 or a cached connection ping, not run application logic.

Use /tools/http-request-builder and /tools/cors-tester to test endpoint response times from multiple geographic locations and network conditions. Observing response times from outside your infrastructure reveals proxy timeout behaviors and CDN edge caching effects that are invisible from within the application. Monitor p50, p95, and p99 latencies continuously — a drift in p99 is the first sign of an emerging timeout problem before it affects a significant fraction of users.

504 Gateway Timeout fix checklist

✓Check exact timeout value — 29s means API Gateway, 60s means nginx default, 100s means Cloudflare
✓Read nginx error_log for 'upstream timed out' messages with upstream address and port
✓Enable slow query log on your database and look for queries exceeding 5 seconds
✓Profile your application handler to identify which async operation takes the most time
✓For Lambda behind API Gateway: decouple any operation that may exceed 25 seconds into SQS
✓Raise nginx proxy_read_timeout only for the specific location that needs it, never globally
✓Add timing logs to your application at each major async operation boundary
✓Verify health check endpoint responds in under 500ms under all conditions

Related guides

Frequently asked questions

What is the difference between a 504 and a 502?

A 504 Gateway Timeout means the upstream is alive but responded too slowly — the proxy gave up waiting. A 502 Bad Gateway means the upstream is not reachable or returned an invalid response immediately. Fix 504 by speeding up the upstream or using async processing. Fix 502 by ensuring the upstream process is running and reachable on the correct host and port.

Can I increase the AWS API Gateway timeout beyond 29 seconds?

No. The 29-second integration timeout on AWS API Gateway proxy integrations is a hard service limit that cannot be increased. For operations that may take longer, decouple them: accept the request in a fast Lambda that enqueues the work, return 202 Accepted with a job ID, and process asynchronously in a separate Lambda subscribed to SQS or EventBridge. The client polls a status endpoint for completion.

How do I increase nginx proxy_read_timeout?

Add proxy_read_timeout 120s; to the specific location block that needs more time. Do not add it to the global http block, which applies the longer timeout to all upstreams. Reload nginx with 'nginx -s reload' after the change. Only increase this if the upstream legitimately needs more time — most endpoints should respond in well under 60 seconds, and hitting the default usually indicates a fixable slowness.

Why do my 504 errors happen at exactly 60 seconds?

An exactly-60-second failure is a strong indicator of nginx proxy_read_timeout at its default value. Check your nginx configuration with 'grep proxy_read_timeout /etc/nginx/nginx.conf'. Then investigate what the upstream is doing during those 60 seconds — enable the slow query log on your database or add timing logs to your application to identify the blocking operation.

Does Cloudflare's 100-second limit apply to all plans?

Yes, the 100-second origin response timeout applies to all Cloudflare plans including free and Pro. Enterprise plans can configure custom timeout values through Cloudflare's API. If your origin consistently takes more than 100 seconds, the fix is architectural — move slow work to background processing. Cloudflare reports these as Error 524, not a generic 504.

How should my health check endpoint respond to avoid false 504s?

Your health endpoint should always respond in under 500 milliseconds. Verify database connectivity with a fast ping such as SELECT 1 but do not execute application logic. Cache the last successful health check result and return it immediately if checked within a few seconds. This prevents the health check itself from triggering timeouts and avoids the load balancer removing healthy targets during database slowdowns.

Can a cold Lambda start cause a 504 on API Gateway?

Yes. Lambda cold starts for large functions can add several seconds to the first invocation after an idle period. If cold start time plus processing time exceeds 29 seconds, API Gateway returns 504. Use Provisioned Concurrency to keep function instances warm for latency-sensitive endpoints. Alternatively, split the function into a thin cold-start-safe entry point that delegates heavy work to an async queue.

What is the fastest way to identify which operation is causing my 504?

Add Date.now() timing logs before and after each major async operation in your handler — database query, Redis call, external HTTP call, file I/O. The operation whose logged duration approaches your proxy timeout value is the one to fix. In PostgreSQL, set log_min_duration_statement = 5000 to automatically log slow queries. This combination identifies the bottleneck without requiring a full APM setup.

All tools run in your browser. Your data never leaves your device. Last updated: 2026-05-06.