Who this is for: Ops engineers, platform architects, and senior developers who run n8n in production and need to keep it reliable as traffic grows. We cover this in detail in the n8n Production Readiness & Scalability Risks Guide.
Quick Diagnosis
If any of the following appear, the instance is moving toward a scalability breaking point:
- CPU or memory spikes that stay high for minutes
- Workflow time‑outs or a growing queue backlog
- Credential refresh errors (e.g., OAuth token expires)
- Repeated “worker died” logs
In production these symptoms typically emerge after sustained load, not immediately after a fresh deploy.
The fastest way to stop the bleed is to enable built‑in metrics, cap concurrent executions, and off‑load heavy work to external services. That combination gives immediate visibility and buys time for refactoring.
1. Persistent High‑Load Metrics
If you encounter any n8n concurrent webhooks internals resolve them before continuing with the setup.
| Symptom | Threshold |
|---|---|
| CPU usage | > 85 % for > 5 min |
| Memory usage | > 80 % of RAM |
| Event‑loop lag | > 200 ms |
Why it breaks – When the Node.js event loop stalls, workflows hang and time‑outs cascade. This is easy to miss during first‑time setups.
What to do – Export Prometheus metrics and set alerts:
docker run -e N8N_METRICS=true -p 5678:5678 n8n
Add --max-old-space-size=4096 to the NODE_OPTIONS environment variable to give the V8 heap more breathing room:
export NODE_OPTIONS="--max-old-space-size=4096"
Inject the NODE_OPTIONS line into the container’s environment file so it is applied consistently across pods.
2. Execution Time‑outs & Queue Backlog
If you encounter any database write amplification in n8n resolve them before continuing with the setup.
| Indicator | Meaning |
|---|---|
| Execution timeout (default 30 s) | Workflow runs longer than allowed |
| Queue length > 1000 | Workers can’t keep up |
| “Worker died” logs | Process crashes (exceptions, leaks) |
Fixes – step by step
- Raise the timeout only as far as your SLA permits.
export EXECUTIONS_TIMEOUT_MAX=120 # seconds
- Move heavy steps to async HTTP nodes or external functions.
- Enable queue mode and add more workers.
export EXECUTIONS_MODE=queue export WORKER_CONCURRENCY=$(nproc) # match CPU cores
Never increase the timeout beyond what users expect; doing so only masks design problems.
3. Credential & Secret Management
If you encounter any n8n backfills and replays at scale resolve them before continuing with the setup.
| Symptom | Trigger | Fix |
|---|---|---|
| “Invalid OAuth token” after 1 h | Token cache isn’t shared | Store tokens in Redis (CREDENTIALS_STORAGE=redis) |
| “Secret not found” on new pod | Secrets not injected by orchestrator | Use a cloud secret manager (AWS/GCP) and reference via {{ $env.SECRET_NAME }} |
| Rotation race condition | Multiple workers write the same file | Serialize writes with a distributed lock (e.g., Redlock) |
In production, token‑cache misses often appear as a burst of 401 responses. Ensure every worker can read the same store.
Never embed raw API keys in workflow JSON – always reference environment variables or secret‑manager entries.
4. Database Saturation
| Metric | Warning | Remedy |
|---|---|---|
PostgreSQL connections > 70 % of max_connections |
Pool exhaustion → “too many clients” | Increase max_connections **or** add PgBouncer |
| Write latency > 200 ms | Logs fall behind | Move execution logs to a write‑optimized table or external store (Loki) |
deadlock detected > 5 /hr |
Transaction conflicts | Keep transactions short; use SELECT … FOR UPDATE where needed |
When the connection ceiling is hit, errors such as “too many clients” appear in the logs. If you are still on SQLite, migrate to PostgreSQL before reaching roughly 1 k concurrent executions.
# Example: enable PgBouncer docker run -e DB_TYPE=postgres -e POSTGRES_HOST=pg.example.com -e PGBOUNCER=true pgbouncer
A lightweight connection pooler smooths out traffic spikes.
5. Infrastructure Limits & Mis‑configurations
| Issue | Symptom | Quick Fix |
|---|---|---|
| Single‑node deployment | All traffic hits one container | Deploy at least two worker pods behind an ingress and add health checks (/healthz) |
| No Horizontal Pod Autoscaler (HPA) | No auto‑scale on spikes | Add HPA with targetCPUUtilizationPercentage: 70 and minReplicas: 2 |
| Weak Redis persistence | Queue loss after restart | Set appendonly yes and save 60 1; consider a Redis cluster for HA |
| Low container limits | OOM kills | Set resources.limits.cpu: "2" and resources.limits.memory: "4Gi" |
A single‑node deployment is a common first step, but it quickly becomes a bottleneck. Adding a second pod is the simplest way to overcome this.
Explicitly bound memory; the default Docker --memory-swap can silently degrade Node.js performance.
6. Step‑by‑Step Audit Checklist
| Steps | Action | Command / Config |
|---|---|---|
| 1 | Export n8n metrics to Prometheus | docker run -e N8N_METRICS=true … |
| 2 | Verify Redis queue health | redis-cli info replication |
| 3 | Check PostgreSQL connection pool | SELECT count(*) FROM pg_stat_activity; |
| 4 | Scan worker logs for unhandled rejections | grep -i unhandledRejection /var/log/n8n/*.log |
| 5 | Confirm secret manager integration | aws secretsmanager get-secret-value --secret-id n8n‑oauth‑token |
| 6 | Validate HPA scaling policy | kubectl get hpa n8n‑workers |
| 7 | Simulate load (10 k req) | hey -n 10000 -c 200 http://n8n.example.com/webhook/test |
| 8 | Capture execution latency histogram | curl http://localhost:5678/metrics | grep execution_duration_seconds_bucket |
| 9 | Set worker concurrency limit | export WORKER_CONCURRENCY=4 |
| 10 | Schedule credential refresh job | cron: "0 * * * *" node refreshCredentials.js |
7. Real‑World Mitigation Patterns
- Off‑load heavy processing – Move large file transforms to a Lambda or Cloud Function and return a reference URL. This often cuts CPU usage by half.
- Batch API calls – Use the “HTTP Request – Batch” node to combine 50+ calls into a single request, easing external rate limits.
- Stateless workers – Keep only PostgreSQL for workflow state; avoid in‑memory caches that disappear on pod restarts.
- Circuit breaker – Wrap external HTTP nodes in a
try/catchand add a fallback that retries after a back‑off. - Graceful shutdown – Ensure in‑flight executions finish before the container stops:
process.on('SIGTERM', async () => { await gracefulShutdown(); });
8. Scaling Roadmap
| Phase | Goal | Recommended Tech |
|---|---|---|
| Immediate | Stop loss | Raise WORKER_CONCURRENCY, enable Redis queue, set alerts |
| Short‑term | Stabilize | Add a second worker pod, deploy PgBouncer, switch to secret manager |
| Mid‑term | Optimize | Refactor long‑running nodes to async services, add circuit breakers |
| Long‑term | Future‑proof | Use the n8n-operator, HPA + Cluster Autoscaler, event‑driven NATS streaming for inter‑node messaging |
Conclusion
If your n8n instance shows high CPU/memory, execution time‑outs, queue backlog, credential errors, or DB connection saturation, it’s primed to fail at scale.
Quick fix:
- Enable Prometheus metrics (
N8N_METRICS=true). - Set
WORKER_CONCURRENCYto match CPU cores. - Store credentials in a shared secret manager.
- Deploy at least two worker pods behind a load balancer and configure an HPA.



