n8n works in staging but slows down in production – root cause

Step by Step Guide to solve n8n works in staging but slows down in production 
Step by Step Guide to solve n8n works in staging but slows down in production


Who this is for: Platform engineers and DevOps teams managing n8n deployments that run fast in staging but experience latency, time‑outs, or webhook throttling in production. We cover this in detail in the n8n Performance Degradation & Stability Issues Guide.


Quick Diagnosis

Problem: Identical n8n code runs smoothly on a staging server (fast workflow execution, low CPU/Memory) but becomes noticeably slower after the same code is deployed to production (high latency, time‑outs, throttled webhooks).

Featured‑snippet solution:

  1. Compare environment variables – ensure EXECUTIONS_PROCESS=main (or a correctly configured queue) matches across both environments.
  2. Validate DB indexes – run an EXPLAIN on the execution_entity table in production; add missing indexes.
  3. Check resource limits – confirm the production container/pod has at least 2 CPU and 4 GB RAM and that the ulimit for open files is ≥ 10 000.
  4. Inspect webhook queue – temporarily enable N8N_DISABLE_PRODUCTION_WEBHOOKS to see if inbound traffic is the bottleneck.

Align any differences with the staging configuration and the slowdown usually disappears.


1. Environment‑level Mismatches

Micro‑summary: Verify that execution mode, Redis connectivity, and Node runtime are identical between staging and production.
If you encounter any n8n becomes unstable after high volume runs why and fix resolve them before continuing with the setup.

1.1 Execution mode

Setting Staging (working) Production (slow) Recommended
EXECUTIONS_PROCESS main queue (default) Keep main for low‑traffic prod or configure a Redis‑backed queue correctly.
N8N_QUEUE_BULL_REDIS_HOST redis-prod.internal (unreachable) Verify Redis connectivity; fallback to main if queue cannot be reached.

EEFA note: When EXECUTIONS_PROCESS=queue and Redis is mis‑configured, every workflow execution waits for a failed connection retry, inflating latency dramatically.

1.2 Node.js version

Check the Node version on each environment:

# Staging
ssh staging-host "node -v"

# Production
ssh prod-host "node -v"

If production reports a version older than v18.13, upgrade it. Newer V8 optimisations improve n8n’s internal libraries.


2. Database Bottlenecks

Micro‑summary: Ensure the execution_entity table is indexed and the DB connection pool is sized for production traffic.

2.1 Missing indexes on execution_entity

Run an execution‑plan query to see if a sequential scan occurs:

EXPLAIN ANALYZE
SELECT * FROM execution_entity
WHERE workflow_id = $1
ORDER BY created_at DESC
LIMIT 20;

If the plan shows Seq Scan, add the appropriate index.

PostgreSQL

CREATE INDEX idx_execution_workflow_created
ON execution_entity (workflow_id, created_at DESC);

MySQL

ALTER TABLE execution_entity
ADD INDEX idx_execution_workflow_created (workflow_id, created_at DESC);

2.2 Connection‑pool size

Variable Staging Production Recommended
DB_MAX_CONNECTIONS 10 5 (default) Set ≥ 20 for production or match staging value.
PGPOOL_MAX_CLIENTS (Postgres) 10 5 Align with DB_MAX_CONNECTIONS.

EEFA note: A low pool size forces n8n to queue workflow executions, which appears as a “slow” UI even though the underlying DB is healthy. If you encounter any n8n freezes under load but doesnt crash resolve them before continuing with the setup.


3. Resource Allocation & OS Limits

Micro‑summary: Provide enough CPU, memory, fast storage, and file‑descriptor limits for production workloads.

3.1 Container / VM sizing

Resource Staging (Docker) Production (K8s) Minimum for production
CPU 1 core 0.5 core 2 cores (burstable)
Memory 2 GB 1 GB 4 GB (headroom for concurrent executions)
Disk I/O SSD (fast) Network‑attached storage (NAS) Use SSD or provisioned IOPS.

3.2 ulimit for open files

Check the current limit and raise it in the service definition:

# Current limit
ulimit -n   # e.g., 1024

# Systemd service snippet
[Service]
LimitNOFILE=10000

EEFA note: n8n opens a file descriptor per active webhook and per DB connection. Hitting the default 1024 limit results in “Too many open files” errors that surface as delayed executions.


4. Webhook & Queue Configuration

Micro‑summary: Throttle inbound traffic and verify Redis health to keep the webhook pipeline fluid.

4.1 Webhook throttling

Add rate‑limiting variables to the production .env:

N8N_WEBHOOK_TUNNEL=false
N8N_WEBHOOK_RATE_LIMIT=200          # requests per minute per IP
N8N_WEBHOOK_RATE_LIMIT_BURST=50

4.2 Redis queue health check

# Ping Redis
redis-cli -h redis-prod.internal ping   # should return PONG

# Inspect memory usage
redis-cli -h redis-prod.internal info | grep used_memory_human

If memory usage exceeds 75 % of the allocated quota, increase the Redis pod’s resources.limits.memory or enable the allkeys-lru eviction policy.

EEFA note: A saturated Redis queue causes workers to idle while waiting for jobs, making the UI appear “stuck”. If you encounter any n8n starts fast but degrades under continuous load resolve them before continuing with the setup.


5. Logging, Monitoring & Real‑World Edge Cases

Micro‑summary: Set up observability and watch for production‑only error patterns.

5.1 Monitoring stack

Tool What to monitor Production‑specific alerts
Prometheus + Grafana n8n_workflow_execution_seconds, nodejs_process_cpu_seconds_total Alert if avg execution > 2 s for > 5 min
ELK stack error level logs, Redis connection error Notify Slack channel #n8n-prod
Health‑check endpoint (/healthz) 200 OK, response time < 200 ms Auto‑restart pod after 3 consecutive failures

5.2 Common production‑only error patterns

Log snippet Root cause Fix
Error: connect ECONNREFUSED 127.0.0.1:6379 Redis service not reachable (different network namespace) Add a NetworkPolicy allowing pod → Redis, or set N8N_QUEUE_BULL_REDIS_HOST to the correct DNS name.
Error: write EPIPE Webhook payload exceeds default body size (1 MB) Set N8N_MAX_PAYLOAD_SIZE=5mb in .env.production.
Error: ER_LOCK_DEADLOCK Contention on execution_entity due to missing index Apply the index from Section 2.1.
TLS handshake failed Production DB enforces strict TLS; older OpenSSL in node‑postgres Upgrade node-postgres to ≥ 8.9.

6. Step‑by‑Step Production Tuning Checklist

Steps Action Verification command
1 Align EXECUTIONS_PROCESS and ensure Redis reachable docker exec -it n8n env \| grep EXECUTIONS_PROCESS
2 Upgrade Node to v18+ node -v
3 Add missing DB indexes (execution_entity) psql -c “\d execution_entity”
4 Increase DB connection pool (DB_MAX_CONNECTIONS) grep DB_MAX_CONNECTIONS .env.production
5 Raise container CPU/Memory to ≥ 2 CPU / 4 GB kubectl top pod <n8n-pod>
6 Set ulimit -n to ≥ 10000 (systemd or Docker) ulimit -n
7 Enable webhook rate limiting (N8N_WEBHOOK_RATE_LIMIT) grep N8N_WEBHOOK_RATE_LIMIT .env.production
8 Monitor Redis memory, set eviction policy if needed redis-cli info memory
9 Deploy Prometheus alerts for execution latency > 2 s Grafana dashboard
10 Validate logs for ECONNREFUSED, EPIPE, DEADLOCK kubectl logs -f <n8n-pod>

Conclusion

By systematically aligning production configuration with the proven staging setup—matching execution mode, upgrading Node, adding critical database indexes, sizing resources appropriately, raising OS limits, and configuring webhook throttling—the “staging works, production slows” symptom disappears in the vast majority of real‑world deployments. Implement the checklist above, monitor the key metrics, and you’ll achieve consistent, low‑latency n8n performance in production.

Leave a Comment

Your email address will not be published. Required fields are marked *