Why 4 Reasons Adding Workers Does Not Scale n8n

Step by Step Guide to solve why more workers dont scale n8n 
Step by Step Guide to solve why more workers dont scale n8n


Who this is for: engineers running n8n in production who need to scale beyond a single worker.

Quick diagnosis: Adding more n8n workers often doesn’t raise throughput because the bottleneck moves from CPU to shared resources (database, queue, file system, network). In production you’ll see the queue length climb while the CPU stays flat. The fix is to measure where the queue or DB stalls, tune concurrency limits, or switch to a dedicated queue (Redis) before adding workers. We cover this in detail in the n8n Architectural Failure Modes Guide.


1. n8n Worker Architecture: What Each Worker Actually Does?

Component Single‑worker role Multi‑worker role
Main Process Loads workflow definitions, parses triggers, spawns the execution engine. One instance per worker – each loads the same metadata from the DB.
Execution Engine Executes nodes sequentially, writes intermediate results to the DB. Runs in parallel across workers but shares the same DB connections and same file storage.
Queue (default: SQLite) Holds pending executions when the engine is busy. All workers pull from the same SQLite file → lock contention.
Database (PostgreSQL/MySQL) Stores workflow definitions, execution data, credentials. Each worker opens its own pool, quickly hitting the DB’s max‑connections limit.
File System (local / S3) Stores binary data, logs, and temporary files. Concurrent writes can saturate I/O, especially on shared volumes.

Each worker ends up pulling the same rows from the DB, which can be surprising the first time you scale.

Note: n8n’s default SQLite queue isn’t built for multi‑process contention. Switch to Redis (or RabbitMQ) before adding workers; otherwise you’ll see “database locked” errors and stalled executions.


2. Why Linear Worker Scaling Is a Myth?

If you encounter any when n8n is the wrong tool resolve them before continuing with the setup.
Adding workers sounds simple, but several hidden limits appear once you go beyond a couple of processes.

  1. Lock‑based queues – SQLite uses file‑level locks; each extra worker spends more time waiting than doing work.
  2. DB connection saturation – PostgreSQL default max_connections = 100. Ten workers × 10 concurrent executions = 100 connections → the DB starts rejecting new connections.
  3. CPU vs. I/O trade‑off – Workflows that read/write large payloads become I/O‑bound; extra CPU cores stay idle.
  4. Shared credential store – Credentials are cached per process; duplicate caching wastes memory and can cause race conditions when rotating secrets.

Result: After ~3‑4 workers you typically hit a plateau where latency stops improving and may even degrade.

3. Identifying the Real Bottleneck – Metrics & Monitoring

If you encounter any when n8n becomes the bottleneck resolve them before continuing with the setup.

3.1 Key Prometheus / Grafana Metrics

Metric Interpretation Alert threshold
n8n_worker_active_executions Executions currently running on a worker. > 80 % of worker_concurrency
n8n_queue_length Items waiting in the queue (Redis or SQLite). > 500
postgresql_connections Active DB connections. > 80 % of max_connections
nodejs_event_loop_lag_seconds Event‑loop health per worker. > 0.1 s
disk_io_write_bytes_total Write throughput on the shared volume. > 80 % of disk bandwidth

Tip: If nodejs_event_loop_lag_seconds spikes while n8n_queue_length stays low, the worker is CPU‑bound. If the queue length climbs but CPU is idle, the bottleneck is the queue or DB.

These numbers give you a quick health check; if they’re all green you’re probably not hitting a hard limit.

3.2 Example docker‑compose.yml – Enabling Prometheus Exporter

First, define the n8n service:

services:
  n8n:
    image: n8nio/n8n:latest
    ports:
      - "5678:5678"

Then add the environment variables that turn on metrics and point to PostgreSQL + Redis:

environment:
  - DB_TYPE=postgresdb
  - DB_POSTGRESDB_HOST=postgres
  - DB_POSTGRESDB_PORT=5432
  - DB_POSTGRESDB_DATABASE=n8n
  - DB_POSTGRESDB_USER=n8n
  - DB_POSTGRESDB_PASSWORD=${POSTGRES_PASSWORD}
  - QUEUE_MODE=redis
  - REDIS_HOST=redis
  - N8N_METRICS=true   # enable Prometheus metrics

4. Practical Configuration Tweaks That Actually Scale

If you encounter any n8n vs custom microservices failure modes resolve them before continuing with the setup.

4.1 Switch to a Distributed Queue (Redis)

Add the following lines to your .env file (or Docker environment) to replace SQLite with Redis:

# .env
QUEUE_MODE=redis
REDIS_HOST=redis
REDIS_PORT=6379
REDIS_PASSWORD=${REDIS_PASSWORD}

Why it works: Redis uses in‑memory lists with atomic LPUSH/RPOP, eliminating file‑level locks.

4.2 Tune Worker Concurrency

Setting Default Recommended for a 4‑worker cluster
EXECUTIONS_PROCESS_TIMEOUT 3600 s Keep unchanged – only affects runaway executions.
WORKER_CONCURRENCY 5 10 per worker → total 40 concurrent executions, provided the DB can handle 40 connections.
DB_MAX_CONNECTIONS (Postgres) 100 200 (increase max_connections in postgresql.conf).

Update the Docker compose override for each worker:

environment:
  - WORKER_CONCURRENCY=10
  - POSTGRES_MAX_CONNECTIONS=200

When you first hit the DB connection ceiling, raising max_connections is usually faster than chasing obscure node‑js bugs.

4.3 Increase Node‑Postgres Connection Pool

Adjust the pool size in the source file that creates the PostgreSQL client:

// src/databases/postgres.ts
import { Pool } from 'pg';
export const pool = new Pool({
  max: parseInt(process.env.DB_POOL_MAX ?? '20'), // raise from default 10
  idleTimeoutMillis: 30000,
  connectionTimeoutMillis: 2000,
});

Note: Setting max too high can thrash the DB; keep the total connections under 80 % of max_connections and monitor pg_stat_activity.

4.4 Use Separate Volumes for Large Payloads

Mount a dedicated SSD volume for binary data to avoid I/O contention:

services:
  n8n:
    volumes:
      - n8n-data:/home/node/.n8n
      - n8n-files:/files   # dedicated volume for large files
volumes:
  n8n-data:
  n8n-files:

Most teams run into this after a few weeks, not on day one.


5. Production‑Grade Checklist for Scaling n8n Workers

Item Why it matters How to verify
Distributed Queue (Redis) Removes SQLite lock contention. redis-cli LLEN n8n_queue returns a non‑zero length while workers are idle.
DB connection pool < 80 % of max Prevents “too many connections” errors. SELECT count(*) FROM pg_stat_activity; < 0.8 × max_connections.
Worker concurrency tuned per CPU core Aligns CPU usage with workload. top shows < 90 % CPU on each worker under load.
Separate I/O volume for binary data Avoids disk‑I/O saturation. iostat -x shows < 70 % utilization on the volume.
Prometheus alerts for queue length & event‑loop lag Early detection of scaling limits. Grafana dashboard fires alerts when thresholds are breached.
Graceful shutdown script Prevents orphaned executions during deploys. docker stop n8n waits for n8n_worker_active_executions to drop to 0.
Credential rotation policy Avoids stale secrets across many workers. CI pipeline rotates secrets every 30 days; workers reload on restart.

6. When Adding Workers Still Makes Sense: Edge Cases & Recommendations

Scenario Recommended worker count Additional steps
CPU‑intensive custom code nodes (e.g., image processing) 8‑12 workers on a 32‑core host Use Docker --cpus limits per container to avoid oversubscription.
Burst traffic spikes (short‑lived bursts) Temporary scaling via Docker Swarm/K8s replicas Combine with an autoscaling policy that also checks Redis queue length.
Stateless webhook listeners 1 worker per 2 CPU cores + a dedicated Nginx reverse proxy Offload TLS termination and rate‑limiting to Nginx so workers stay focused on execution.

Bottom line: Adding workers is only effective when the underlying queue, database, and I/O layers are already horizontally scalable.


Conclusion

Scaling n8n isn’t solved by simply launching more workers. The true limits lie in the shared queue, database connections, and I/O paths. By switching to a distributed queue (Redis), tuning worker concurrency, enlarging DB connection pools, and isolating file‑system I/O, you turn additional CPU cores into real throughput gains. Follow the production checklist, monitor the key metrics, and only add workers after the supporting layers are proven to scale. This approach delivers reliable, linear performance improvements in real‑world deployments.

Leave a Comment

Your email address will not be published. Required fields are marked *