n8n degrades under continuous load – why performance drops and how to fix

Step by Step Guide to solve n8n starts fast but degrades under continuous load
Step by Step Guide to solve n8n starts fast but degrades under continuous load

Who this is for: Ops engineers and platform teams running n8n in production who need a reliable, low‑latency automation pipeline. We cover this in detail in the n8n Performance Degradation & Stability Issues Guide.


Quick Diagnosis: Is Your n8n Instance Slowing Down?

Symptom Immediate Check One‑Line Fix (Featured‑Snippet Ready)
Workflow latency spikes after a few minutes CPU > 80 % or Memory > 75 % in docker stats / top Scale the execution workers (EXECUTIONS_PROCESS=main,queue) or add a Redis queue to off‑load jobs.
“Too many connections” errors from PostgreSQL SELECT count(*) FROM pg_stat_activity; shows > max_connections Raise POSTGRESQL_MAX_CONNECTIONS or enable connection pooling (pgbouncer).
“Queue is full” in logs (BullMQ warnings) redis-cli LLEN n8n:queue > 10 000 Increase QUEUE_BULL_MAX_JOBS or split workers across multiple containers.
Memory keeps growing even after workflows finish docker exec <container> node -e "console.log(process.memoryUsage())" shows steady rise Switch from SQLite to PostgreSQL + Redis and enable EXECUTIONS_MODE=queue.

Bottom‑line: If the first 5‑10 minutes are snappy but latency climbs thereafter, the culprit is usually resource saturation (CPU, memory, DB connections) or missing queue infrastructure. Apply the appropriate fix from the sections below and monitor the metrics for 10 min to confirm the trend reverses.


1. n8n’s Execution Model — Why It Matters Under Load?

If you encounter any n8n becomes unstable after high volume runs why and fix resolve them before continuing with the setup.

Component Role Default Production Setting
Workflow Engine Parses and executes nodes EXECUTIONS_PROCESS=main (single‑process)
Database Stores workflow definitions, execution data SQLite (DB_TYPE=sqlite)
Queue (BullMQ) Optional job queue for async execution Disabled by default
Redis Back‑end for BullMQ & cache Not required

EEFA Note: Running n8n with SQLite and no queue is acceptable only for dev or low‑traffic demos. Production workloads should always use PostgreSQL (or MySQL) and Redis to guarantee isolation between workflow executions.


2. Common Bottlenecks That Appear Only Under Continuous Load

If you encounter any n8n freezes under load but doesnt crash resolve them before continuing with the setup.

Bottleneck Symptom Root Cause Fix
DB Connection Exhaustion “Error: too many clients” from PostgreSQL max_connections too low or n8n opening a new connection per workflow Use a connection pool (pgbouncer) or increase POSTGRESQL_MAX_CONNECTIONS.
Event‑Loop Blocking CPU spikes, latency ↑, “blocked for X ms” in logs Heavy JavaScript (e.g., large JSON transforms) running in the main process Move to queue mode (EXECUTIONS_MODE=queue) and spin up multiple workers (EXECUTIONS_PROCESS=main,queue).
Redis Queue Saturation BullMQ warnings: “Job stalled” Queue length > QUEUE_BULL_MAX_JOBS or Redis memory limit reached Raise QUEUE_BULL_MAX_JOBS, enable Redis eviction policy volatile-lru, or shard queues across multiple Redis instances.
Large Payloads Memory climbs, OOM kills Nodes that fetch big files keep data in RAM Stream data (node-fetch with stream:true), limit payload size via MAX_PAYLOAD_SIZE env var.
Memory Leaks in Custom Code Memory never releases after workflow finishes Custom JavaScript nodes retaining references Refactor code to avoid closures; in dev run Node with --expose-gc and call global.gc().

3. Step‑by‑Step Diagnostic Checklist

Diagnostic Step Command / Action Expected Result
1. Verify execution mode docker exec n8n printenv EXECUTIONS_MODE queue (recommended) or main.
2. Check worker count docker exec n8n printenv EXECUTIONS_PROCESS main,queue (at least two workers).
3. Inspect DB health psql -U $POSTGRES_USER -c "SELECT count(*) FROM pg_stat_activity;" < max_connections (default 100).
4. Monitor Redis queue depth redis-cli LLEN n8n:queue < 5 000 (adjustable).
5. Profile CPU/Memory docker stats n8n (or htop inside) CPU < 70 %, Mem < 70 % of limit.
6. Look for “blocked” logs docker logs n8n | grep "blocked" No recent entries.
7. Validate container limits docker inspect n8n --format '{{.HostConfig.Memory}}' > 2 GB for moderate load.
8. Test a high‑frequency workflow Create a “ping” workflow that runs every 5 s for 10 min. Execution time stays ~ < 200 ms.
9. Review error rates docker logs n8n | grep -i error < 1 % of total executions.
10. Enable Prometheus metrics Add METRICS=true env var, scrape /metrics. Metrics visible in Grafana.

4. Optimizing n8n for Sustained Load

4.1 Docker‑Compose Example (PostgreSQL + Redis)

Core services declaration

version: "3.8"
services:
  db:
    image: postgres:16-alpine
    environment:
      POSTGRES_USER: n8n

Database configuration

      POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
      POSTGRES_DB: n8n
    volumes:
      - db_data:/var/lib/postgresql/data
    restart: unless-stopped

Redis with memory limits

  redis:
    image: redis:7-alpine
    command: ["redis-server", "--maxmemory", "2gb", "--maxmemory-policy", "volatile-lru"]
    restart: unless-stopped

n8n container with queue and scaling

  n8n:
    image: n8nio/n8n:latest
    ports:
      - "5678:5678"
    environment:
      DB_TYPE: postgres
      DB_POSTGRESDB_HOST: db
      DB_POSTGRESDB_PORT: 5432
      DB_POSTGRESDB_DATABASE: n8n
      DB_POSTGRESDB_USER: n8n
      DB_POSTGRESDB_PASSWORD: ${POSTGRES_PASSWORD}
      EXECUTIONS_MODE: queue
      EXECUTIONS_PROCESS: main,queue
      QUEUE_BULL_REDIS_HOST: redis
      QUEUE_BULL_REDIS_PORT: 6379
      WORKER_COUNT: 4               # Parallel queue workers
      MAX_PAYLOAD_SIZE: 10mb        # Guard against huge payloads
      METRICS: "true"
    depends_on:
      - db
      - redis
    restart: unless-stopped
    deploy:
      resources:
        limits:
          cpus: "2"
          memory: 4g

Volume definition

volumes:
  db_data:

EEFA Note: Never run n8n with DB_TYPE=sqlite in a container that restarts automatically; SQLite files can become corrupted during abrupt shutdowns under load.


4.2 Kubernetes Deployment (Helm‑style)

Deployment skeleton with replica count

apiVersion: apps/v1
kind: Deployment
metadata:
  name: n8n
spec:
  replicas: 2                     # Horizontal scaling
  selector:
    matchLabels:
      app: n8n

Pod template and container env vars (part 1)

  template:
    metadata:
      labels:
        app: n8n
    spec:
      containers:
        - name: n8n
          image: n8nio/n8n:latest
          env:
            - name: DB_TYPE
              value: "postgresdb"
            - name: DB_POSTGRESDB_HOST
              value: "postgres.default.svc.cluster.local"

Pod template and container env vars (part 2)

            - name: EXECUTIONS_MODE
              value: "queue"
            - name: EXECUTIONS_PROCESS
              value: "main,queue"
            - name: QUEUE_BULL_REDIS_HOST
              value: "redis-master.default.svc.cluster.local"
            - name: WORKER_COUNT
              value: "6"
            - name: METRICS
              value: "true"

Resource limits and port

          resources:
            limits:
              cpu: "2000m"
              memory: "4Gi"
            requests:
              cpu: "500m"
              memory: "1Gi"
          ports:
            - containerPort: 5678

4.3 Tuning Individual Environment Variables

Variable Recommended Value (Continuous Load) What It Controls
EXECUTIONS_MODE queue Switches from in‑process to BullMQ queue.
EXECUTIONS_PROCESS main,queue Starts both the API server and separate queue workers.
WORKER_COUNT 2‑8 (depending on CPU cores) Number of parallel queue workers.
MAX_PAYLOAD_SIZE 5mb‑20mb Caps inbound data to protect RAM.
QUEUE_BULL_MAX_JOBS 20000 Upper bound for queued jobs before back‑pressure.
POSTGRESQL_MAX_CONNECTIONS 200 (or pgbouncer pool) Prevents “too many connections” errors.
REDIS_MAXMEMORY 2gb Stops Redis from swapping and evicts least‑recently‑used keys.

5. Monitoring, Alerting & Observability

5.1 Prometheus Scrape Config

scrape_configs:
  - job_name: 'n8n'
    static_configs:
      - targets: ['n8n:5678']
    metrics_path: /metrics

Key metrics to watch:

Metric Threshold (Alert) Meaning
process_cpu_seconds_total > 80 % of allocated CPU for 5 min CPU saturation.
process_resident_memory_bytes > 75 % of memory limit Memory pressure → OOM risk.
n8n_queue_length > 10 000 Queue backlog; consider scaling workers.
n8n_workflow_execution_duration_seconds_bucket{le=”0.5″} < 70 % of executions in ≤ 0.5 s Healthy latency.
postgres_connections > 80 % of max_connections DB connection pool near exhaustion.

5.2 Health‑Check Endpoint

curl -s http://localhost:5678/healthz | jq .

Expected JSON:

{
  "status":"ok",
  "db":"connected",
  "redis":"connected",
  "queue":"ready"
}

Configure your load balancer or Kubernetes liveness probe to call this endpoint every 30 s. If you encounter any n8n works in staging but slows down in production resolve them before continuing with the setup.


6. Production‑Grade Fixes & EEFA (Expert‑First Advice)

Issue Why the Naïve Fix Fails Production‑Ready Remedy
Using SQLite File‑level locking blocks concurrent writes → latency spikes. Migrate to PostgreSQL (or MySQL) **before** traffic exceeds 10 rps.
Running in “main” mode only All workflows share one Node.js event loop → one long job stalls the rest. Enable **queue mode** and allocate **≥ 2 workers** (WORKER_COUNT).
No Redis BullMQ falls back to an in‑memory queue that disappears on container restart, causing lost jobs. Deploy a dedicated Redis cluster (or at least a single‑node with persistence).
Unlimited container resources Container may consume host RAM → OOM killer restarts n8n, losing state. Set **hard CPU/memory limits** (docker compose deploy.resources.limits).
Ignoring back‑pressure High inbound webhook rate floods the queue, leading to “Job stalled” warnings. Implement **rate‑limiting** on incoming webhooks (RATE_LIMIT=50), and enable QUEUE_BULL_MAX_JOBS back‑pressure.
Skipping TLS/Authentication on Redis In production, an unauthenticated Redis is a security hole. Use REDIS_TLS=true and provide REDIS_PASSWORD.

Final EEFA Checklist

  • Database = PostgreSQL (or MySQL) with connection pool.
  • Queue = BullMQ backed by Redis.
  • Execution mode = queue with main,queue processes.
  • Worker count = CPU cores × 2 (adjust after load test).
  • Resource limits = CPU ≤ 2 cores, Memory ≤ 4 GiB per container.
  • Monitoring = Prometheus + Grafana dashboards.
  • Health‑checks = /healthz endpoint + liveness probes.

Conclusion

The “fast‑then‑slow” symptom is almost always a mis‑aligned execution model (single‑process + SQLite) combined with resource exhaustion. Switch to queue mode, provision PostgreSQL + Redis, and enforce resource caps. After applying the checklist above, latency should remain flat even under continuous, high‑throughput loads, delivering a stable automation platform for production workloads.

Leave a Comment

Your email address will not be published. Required fields are marked *