n8n performance drops after scaling horizontally - why more workers isn't enough

Step by Step Guide to solve why n8n performance drops after scaling horizontally

Who this is for: Site reliability engineers, platform engineers, or DevOps teams that have already deployed n8n in a Kubernetes (or similar) environment and are now adding pods to increase capacity. We cover this in detail in the n8n Performance Degradation & Stability Issues Guide.

Quick Diagnosis & Fix

Symptom	Likely Root Cause	One‑line Remedy
CPU spikes on every node, overall throughput ↓	Shared DB bottleneck (SQLite or single‑instance Postgres)	Move to a dedicated, HA PostgreSQL cluster and enable connection pooling (pgbouncer).
Random workflow failures, “execution timed out”	Stateless‑vs‑Stateful mismatch (workflows rely on in‑memory cache)	Use Redis for the queue & cache; set `N8N_QUEUE_MODE=redis` and point all nodes to the same Redis instance.
Load‑balancer returns 502/504 under load	Sticky‑session mis‑config (LB not preserving session)	Enable `N8N_DISABLE_PROXIES=true` or configure sticky sessions (session affinity) on the LB.
Memory usage climbs on each replica, OOM kills	Workflow memory leak (large payloads kept in process)	Limit `N8N_EXECUTIONS_PROCESS=main` or off‑load heavy data to external storage (S3, DB) and set `N8N_MAX_EXECUTION_TIMEOUT`.
Overall latency ↑ despite more pods	Insufficient pod resources / CPU throttling	Raise `resources.requests.cpu` and `resources.limits.cpu` in the pod spec; monitor with `kubectl top pod`.

Quick test – After applying the appropriate remedy, re‑run a load test (e.g., hey -c 200 -n 10000 http://<lb>/webhook). Latency should drop and error rate stay below 1 %.

1. Horizontal Scaling Basics

If you encounter any n8n throughput plateau adding workers stops helping resolve them before continuing with the setup.

Component	Single‑Node Default	Change Required for Horizontal Scale
Workflow Engine	In‑process Node.js loop	Must be stateless – any pod can pick up any execution.
Queue	In‑memory fallback	Switch to a distributed queue (Redis, RabbitMQ, or n8n‑cloud).
Database	SQLite file or single Postgres	Use an HA, multi‑writer PostgreSQL cluster with pooling.
Cache	Process memory	External cache (Redis) for shared state (credentials, webhook IDs).
Load Balancer	Direct to sole pod	Distribute HTTP requests and preserve session affinity for webhook callbacks.

EEFA Note: n8n was originally built for “single‑node dev” use‑cases. Scaling without converting these components creates hidden contention points that manifest as performance drops.

2. Common Bottlenecks After Scaling

2.1 Database Saturation

Symptom	Why it Happens	Fix Checklist
Query latency > 200 ms, many `ER_LOCK_DEADLOCK` errors	All workflow metadata (logs, credentials, definitions) are written to the same DB. Adding pods multiplies concurrent writes.	• Switch to a dedicated PostgreSQL cluster (e.g., AWS RDS Aurora). • Enable connection pooling (pgbouncer). • Tune `max_connections` to `2 × (CPU cores × pods)`. • (Optional) Add read replicas for UI queries.

PostgreSQL Upgrade Example

# Set environment variables for an external HA Postgres
export N8N_DB_TYPE=postgresdb
export N8N_DB_POSTGRESDB_HOST=<aurora-endpoint>
export N8N_DB_POSTGRESDB_PORT=5432
export N8N_DB_POSTGRESDB_USER=<user>
export N8N_DB_POSTGRESDB_PASSWORD=<password>

Enable pgbouncer Pooler

helm install pgbouncer bitnami/pgbouncer \
  --set postgresql.host=<aurora-endpoint> \
  --set postgresql.port=5432

# Point n8n to the pooler
export N8N_DB_POSTGRESDB_HOST=pgbouncer.default.svc.cluster.local

2.2 Queue Mis‑configuration

Symptom	Root Cause	Remedy
“Queue is full” errors, workflow stalls	Default in‑memory queue cannot be shared; each pod maintains its own queue, causing duplicate work and lost jobs.	Deploy a distributed Redis queue and configure n8n to use it.

Deploy Redis (replicated)

helm repo add bitnami https://charts.bitnami.com/bitnami
helm install redis bitnami/redis --set architecture=replication

Configure n8n to use Redis (first half)

# values.yaml – queue mode
env:
  - name: N8N_QUEUE_MODE
    value: "redis"

Configure n8n to use Redis (second half)

  - name: N8N_REDIS_HOST
    value: "redis-master.default.svc.cluster.local"
  - name: N8N_REDIS_PORT
    value: "6379"
  - name: N8N_REDIS_DB
    value: "0"

EEFA Warning: Enable Redis persistence (appendonly yes) and ACLs; otherwise a single node failure can lose queued executions.

2.3 Sticky Sessions & Webhook Routing

Symptom	Cause	Fix Options
Webhook callbacks hit a different pod → 404 or duplicate execution	Load balancer does not preserve client affinity.	1. Enable sticky sessions on the LB. 2. Make webhooks stateless by disabling proxy headers (`N8N_DISABLE_PROXIES=true`). 3. Use n8n’s built‑in webhook tunnel only for dev.

Option A – Enable Sticky Sessions (K8s Service)

apiVersion: v1
kind: Service
metadata:
  name: n8n
spec:
  selector:
    app: n8n
  ports:
  - protocol: TCP
    port: 80
    targetPort: 5678
  sessionAffinity: ClientIP   # <-- enable sticky sessions

Option B – Stateless Webhooks

export N8N_DISABLE_PROXIES=true
# Add to deployment env:
- name: N8N_DISABLE_PROXIES
  value: "true"

2.4 Memory Leaks in Long‑Running Workflows

Symptom	Cause	Mitigation
Pods OOM‑kill after 30‑60 min under load	Large payloads (files, JSON) stay in the Node.js process until the workflow finishes.	• Off‑load binaries to external storage (S3). • Limit in‑memory payload size (`N8N_MAX_BINARY_DATA_SIZE=5mb`). • Run heavy workflows in a separate worker process. • Set a hard execution timeout.

Example Settings

export N8N_MAX_BINARY_DATA_SIZE=5mb
export N8N_EXECUTIONS_PROCESS=worker
export N8N_WORKER_CONCURRENCY=5
export N8N_MAX_EXECUTION_TIMEOUT=300   # seconds

2.5 CPU Throttling & Pod Resource Limits

Desired Throughput	CPU Request	CPU Limit	Pods
100 req/s	500m	1	3
250 req/s	1	2	5
500+ req/s	2	4	8+

EEFA Tip: Use the Horizontal Pod Autoscaler (HPA) with a custom metric (n8n_executions_per_second) instead of only CPU.

Resource & Autoscaling Manifest (split for readability)

resources:
  requests:
    cpu: "1"
    memory: "1Gi"
  limits:
    cpu: "2"
    memory: "2Gi"

autoscaling:
  enabled: true
  minReplicas: 3
  maxReplicas: 12
  targetCPUUtilizationPercentage: 70
  metrics:
  - type: Pods
    pods:
      metric:
        name: n8n_executions_per_second
      target:
        type: AverageValue
        averageValue: "200"

3. Step‑by‑Step Stabilization Guide

3.1 Audit Your Current Stack

# Identify DB type
echo $N8N_DB_TYPE   # should be "postgresdb" in production

# Verify queue mode
echo $N8N_QUEUE_MODE   # must NOT be "memory"

# Check LB session affinity
kubectl get svc n8n -o yaml | grep sessionAffinity

If any command returns a default (sqlite, memory, None), you are in a bottleneck zone.

3.2 Deploy a Distributed Queue (Redis)

Install Redis (see 2.2).
Add env vars (see 2.2 snippets).
Restart n8n pods:
```
kubectl rollout restart deployment n8n
```

3.3 Upgrade to HA PostgreSQL

Step	Action
Provision	Create an RDS/Aurora instance (or any HA Postgres).
Secret	`kubectl create secret generic n8n-pg --from-literal=postgres_user=…`
Env vars	Apply the variables shown in the “PostgreSQL Upgrade Example”.
Pooler	Deploy pgbouncer (see 2.1).
Point n8n	Set `N8N_DB_POSTGRESDB_HOST` to the pgbouncer service.

3.4 Enforce Sticky Sessions or Stateless Webhooks

*Choose one approach that matches your ingress controller.*

Sticky Sessions – apply the Service manifest from 2.3 Option A.
Stateless Webhooks – add N8N_DISABLE_PROXIES=true to the deployment env (see 2.3 Option B).

3.5 Tune Resources & Enable Autoscaling

Update the deployment with the resources block (see 2.5).
Enable the HPA using the autoscaling block (see 2.5).

Apply changes:

helm upgrade --install n8n . -f values.yaml

4. Troubleshooting Checklist

Check	How to Verify	Expected
DB connection pool saturation	`SELECT count(*) FROM pg_stat_activity WHERE state='active';`	< 80 % of `max_connections`.
Redis queue depth	`redis-cli LLEN n8n:queue`	< 500 (adjust per load).
Pod CPU throttling	`kubectl top pod n8n-xxxx`	CPU usage ≤ requests.
Webhook delivery latency	`kubectl logs n8n-xxxx \| grep webhook`	< 200 ms.
Memory usage per pod	`kubectl exec n8n-xxxx -- free -m`	RSS ≤ 1 GiB (or within limit).
LB health checks	`curl -I http://<lb>/healthz`	200 OK consistently.

If any metric exceeds the expected range, revisit the corresponding configuration block in Section 3.

5. Production‑Ready Best Practices (EEFA)

Practice	Why it Matters	How to Implement
Separate execution processes (`worker` mode)	Isolates heavy workflows from the API server, preventing request‑latency spikes.	`export N8N_EXECUTIONS_PROCESS=worker` `export N8N_WORKER_CONCURRENCY=5`
Enable Prometheus metrics	Real‑time visibility into queue length, execution time, DB latency.	`export N8N_METRICS=true` and expose `/metrics`.
TLS termination at the LB	Removes per‑pod TLS overhead and secures webhook callbacks.	Configure LB cert; set `N8N_ENDPOINT_WEBHOOK=https://<lb>/webhook`.
Rotate credentials regularly	Limits blast‑radius if a node is compromised.	Use `N8N_ENCRYPTION_KEY` and rotate via CI pipeline.
Graceful shutdown hooks	Guarantees in‑flight executions finish before pod termination, avoiding partial runs.	preStop: exec: command: ["curl", "-X", "POST", "http://localhost:5678/healthz/shutdown"]

Conclusion

Performance regressions after horizontal scaling almost always stem from five core regressions: database saturation, an in‑memory queue, missing sticky sessions, memory‑leaky workflows, and CPU throttling. By auditing each component, migrating to distributed services (HA PostgreSQL, Redis), enforcing session affinity or stateless webhooks, tuning pod resources, and applying the EEFA best practices above, you can scale n8n horizontally without sacrificing latency or reliability. Validate each change with load‑testing and the troubleshooting checklist, then let the autoscaler handle traffic spikes while your underlying services stay robust. Happy automating!

n8n performance drops after scaling horizontally – why more workers isn’t enough

Quick Diagnosis & Fix

1. Horizontal Scaling Basics

2. Common Bottlenecks After Scaling

2.1 Database Saturation

PostgreSQL Upgrade Example

Enable pgbouncer Pooler

2.2 Queue Mis‑configuration

Deploy Redis (replicated)

Configure n8n to use Redis (first half)

Configure n8n to use Redis (second half)

2.3 Sticky Sessions & Webhook Routing

Option A – Enable Sticky Sessions (K8s Service)

Option B – Stateless Webhooks

2.4 Memory Leaks in Long‑Running Workflows

Example Settings

2.5 CPU Throttling & Pod Resource Limits

Resource & Autoscaling Manifest (split for readability)

3. Step‑by‑Step Stabilization Guide

3.1 Audit Your Current Stack

3.2 Deploy a Distributed Queue (Redis)

3.3 Upgrade to HA PostgreSQL

3.4 Enforce Sticky Sessions or Stateless Webhooks

3.5 Tune Resources & Enable Autoscaling

4. Troubleshooting Checklist

5. Production‑Ready Best Practices (EEFA)

Conclusion

Leave a Comment Cancel Reply

Sign up for Newsletter

Quick Diagnosis & Fix

1. Horizontal Scaling Basics

2. Common Bottlenecks After Scaling

2.1 Database Saturation

PostgreSQL Upgrade Example

Enable pgbouncer Pooler

2.2 Queue Mis‑configuration

Deploy Redis (replicated)

Configure n8n to use Redis (first half)

Configure n8n to use Redis (second half)

2.3 Sticky Sessions & Webhook Routing

Option A – Enable Sticky Sessions (K8s Service)

Option B – Stateless Webhooks

2.4 Memory Leaks in Long‑Running Workflows

Example Settings

2.5 CPU Throttling & Pod Resource Limits

Resource & Autoscaling Manifest (split for readability)

3. Step‑by‑Step Stabilization Guide

3.1 Audit Your Current Stack

3.2 Deploy a Distributed Queue (Redis)

3.3 Upgrade to HA PostgreSQL

3.4 Enforce Sticky Sessions or Stateless Webhooks

3.5 Tune Resources & Enable Autoscaling

4. Troubleshooting Checklist

5. Production‑Ready Best Practices (EEFA)

Conclusion

Must Read

Leave a Comment Cancel Reply