Who this is for : Ops engineers, DevOps teams, or platform architects that need a production‑grade n8n deployment that survives node failures, traffic spikes, and regional outages. We cover this in detail in the Production‑Grade n8n Architecture.
Quick diagnosis
If your n8n instance drops workflows, returns 502 errors, or slows down after a traffic burst, you need an HA architecture that guarantees > 99.9 % uptime. In production this usually appears when a node restarts unexpectedly or a sudden wave of webhook calls hits the API. The patterns below eliminate single points of failure, auto‑recover from node loss, and keep workflows running without manual intervention.
1. Why n8n needs a dedicated HA blueprint?
If you encounter any single vs multi instance n8n resolve them before continuing with the setup.
| Failure mode | Symptom | HA countermeasure |
|---|---|---|
| Single‑node crash | All workflows stop, UI 502 | Horizontal worker pool behind a load balancer |
| Database outage | “Connection refused” errors | Multi‑master or streaming‑replica cluster (Patroni) |
| File‑store loss | Missing uploaded files | Object storage (S3/GCS) or replicated NFS |
| Network partition | Workers can’t reach DB or webhooks | Health‑checked probes + auto‑failover |
| Regional disaster | Complete site outage after AZ failure | Multi‑region active‑passive deployment |
EEFA note – The most common production downtime source is a stateful file store on the same node as the workflow engine. Decouple it early to avoid data loss during node replacement.
2. Pattern A – Load‑balanced stateless workers
2.1 Architecture snapshot
Client → L7 Load Balancer → n8n Workers
↘︎ ↘︎
Redis (optional) PostgreSQL (HA)
*Workers are stateless; they read workflow definitions from the DB and store temporary files in external object storage.*
The load balancer (NGINX, Traefik, or a cloud‑managed ALB) probes /healthz on each worker. If you encounter any n8n zero downtime upgrades resolve them before continuing with the setup.
2.2 Docker‑Compose – split into focused services
Database service (PostgreSQL, 3‑node replica)
services:
db:
image: postgres:15
environment:
POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
POSTGRES_DB: n8n
volumes:
- db-data:/var/lib/postgresql/data
deploy:
mode: replicated
replicas: 3
Redis queue (optional)
redis:
image: redis:7-alpine
deploy:
mode: replicated
replicas: 2
n8n worker definition
n8n:
image: n8n:latest
environment:
- DB_TYPE=postgresdb
- DB_POSTGRESDB_HOST=db
- DB_POSTGRESDB_DATABASE=n8n
- DB_POSTGRESDB_USER=postgres
- DB_POSTGRESDB_PASSWORD=${POSTGRES_PASSWORD}
- EXECUTIONS_PROCESS=main
- N8N_QUEUE_MODE=redis
- N8N_REDIS_HOST=redis
- N8N_REDIS_PASSWORD=${REDIS_PASSWORD}
- N8N_BASIC_AUTH_ACTIVE=true
- N8N_BASIC_AUTH_USER=${ADMIN_USER}
- N8N_BASIC_AUTH_PASSWORD=${ADMIN_PASS}
ports:
- "5678:5678"
depends_on:
- db
- redis
deploy:
mode: replicated
replicas: 4
update_config:
parallelism: 1
delay: 10s
restart_policy:
condition: on-failure
max_attempts: 3
EEFA tweak –
EXECUTIONS_PROCESS=mainforces a single‑process execution per worker, avoiding race conditions with a shared Redis queue.
If a worker stalls, a pod restart is usually quicker than hunting a phantom lock.
Persistent volume for the DB
volumes: db-data:
2.3 Nginx health‑check configuration
server {
listen 80;
server_name n8n.example.com;
location / {
proxy_pass http://n8n_cluster;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
}
location /healthz {
proxy_pass http://n8n_cluster/healthz;
}
}
*The load balancer marks a node unhealthy after **2 consecutive 5xx** responses within **10 s**, then drops it from rotation automatically.*
3. Pattern B – PostgreSQL high‑availability (Patroni + etcd)
3.1 Why a single primary is a SPOF
Even with many workers, a lone PostgreSQL instance can bring the whole system down. Patroni handles automatic failover and usually switches in < 5 s. If you encounter any n8n data consistency resolve them before continuing with the setup.
3.2 ConfigMap – Patroni configuration (focus: DCS & bootstrap)
apiVersion: v1
kind: ConfigMap
metadata:
name: patroni-config
data:
patroni.yml: |
scope: n8n
namespace: /db/
name: $(POD_NAME)
restapi:
listen: 0.0.0.0:8008
etcd:
host: etcd-0.etcd.svc:2379,etcd-1.etcd.svc:2379,etcd-2.etcd.svc:2379
bootstrap:
dcs:
ttl: 30
loop_wait: 10
retry_timeout: 10
maximum_lag_on_failover: 1048576
postgresql:
use_pg_rewind: true
parameters:
max_connections: 100
shared_buffers: 256MB
initdb:
- encoding: UTF8
- data-checksums
pg_hba:
- host all all 0.0.0.0/0 md5
*Patroni uses etcd for consensus; the rest of the file is mostly boiler‑plate.*
3.3 StatefulSet – three Patroni pods (focus: storage)
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: patroni
spec:
serviceName: patroni
replicas: 3
selector:
matchLabels:
app: patroni
template:
metadata:
labels:
app: patroni
spec:
containers:
- name: patroni
image: patroni:latest
envFrom:
- configMapRef:
name: patroni-config
ports:
- containerPort: 5432 # PostgreSQL
- containerPort: 8008 # Patroni API
volumeMounts:
- name: pgdata
mountPath: /home/postgres/pgdata
volumeClaimTemplates:
- metadata:
name: pgdata
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 20Gi
EEFA tip – Use a managed etcd service (e.g., AWS EKS etcd‑operator) to avoid operator‑level mistakes and keep quorum latency sub‑millisecond.
3.4 n8n connection string (no changes during failover)
DB_TYPE=postgresdb
DB_POSTGRESDB_HOST=patroni # virtual IP / service name
DB_POSTGRESDB_DATABASE=n8n
DB_POSTGRESDB_USER=postgres
DB_POSTGRESDB_PASSWORD=${POSTGRES_PASSWORD}
Patroni always routes the hostname to the current primary, so workers stay connected automatically.
4. Pattern C – External object storage for binary data
4.1 Why off‑load binaries
n8n’s default /data folder is local. In a cluster it becomes a single point of failure. Store binaries in S3 (or a shared NFS) instead.
4.2 Terraform – create an S3 bucket (focus: versioning & lifecycle)
resource "aws_s3_bucket" "n8n_binary" {
bucket = "n8n-binary-${var.env}"
acl = "private"
versioning {
enabled = true
}
lifecycle_rule {
id = "expire-old-objects"
enabled = true
expiration {
days = 365
}
}
}
4.3 n8n environment variables for S3 storage
N8N_BINARY_DATA_MODE=s3
N8N_BINARY_DATA_S3_BUCKET=${aws_s3_bucket.n8n_binary.id}
N8N_BINARY_DATA_S3_REGION=${var.aws_region}
N8N_BINARY_DATA_S3_ACCESS_KEY_ID=${aws_iam_access_key.n8n.id}
N8N_BINARY_DATA_S3_SECRET_ACCESS_KEY=${aws_iam_access_key.n8n.secret}
EEFA warning – Enable **S3 Object Lock** (governance mode) for regulated documents; otherwise a rogue delete could break audit trails.
4.4 On‑prem alternative – NFS shared volume (focus: PV definition)
apiVersion: v1
kind: PersistentVolume
metadata:
name: n8n-nfs-pv
spec:
capacity:
storage: 100Gi
accessModes:
- ReadWriteMany
nfs:
path: /exports/n8n-data
server: nfs.example.local
Mount this PV to /data on every worker pod and set fsGroup: 1000 so the n8n process can read/write.
5. Pattern D – Multi‑region active‑passive disaster recovery
5.1 Component matrix (single purpose)
| Component | Primary region | DR region | Sync method |
|---|---|---|---|
| DB | Patroni (3 nodes) | Patroni (3 nodes) | Logical replication (pglogical) |
| Object store | S3 bucket | Replicated bucket | Cross‑region replication |
| Load balancer | Cloud‑ALB (us‑east‑1) | Cloud‑ALB (eu‑west‑1) | Route 53 health‑check failover |
| Workers | 4 replicas | 2 cold replicas | Same Docker image version |
5.2 Logical replication – primary node setup
CREATE EXTENSION IF NOT EXISTS pglogical;
SELECT pglogical.create_node(
node_name := 'primary',
dsn := 'host=primary-db port=5432 dbname=n8n user=postgres password=***'
);
5.3 Logical replication – DR node registration
SELECT pglogical.create_node(
node_name := 'dr',
dsn := 'host=dr-db port=5432 dbname=n8n user=postgres password=***'
);
5.4 Subscription – DR pulls from primary
SELECT pglogical.create_subscription(
subscription_name := 'dr_sub',
provider_dsn := 'host=primary-db port=5432 dbname=n8n user=postgres password=***',
synchronize_structure := true,
synchronize_data := true
);
EEFA tip – Set
synchronous_commit = remote_applyon the primary for **zero‑data‑loss** when the DR link is healthy. If latency is high, switch tolocaland accept a few‑second lag.
5.5 DNS‑based failover (Route 53) – weight‑based record set
{
"Comment": "Failover record set for n8n",
"Changes": [
{
"Action": "CREATE",
"ResourceRecordSet": {
"Name": "n8n.example.com.",
"Type": "A",
"SetIdentifier": "primary-us-east-1",
"Weight": 100,
"HealthCheckId": "abcd1234-primary",
"TTL": 60,
"ResourceRecords": [{ "Value": "3.3.3.3" }]
}
},
{
"Action": "CREATE",
"ResourceRecordSet": {
"Name": "n8n.example.com.",
"Type": "A",
"SetIdentifier": "dr-eu-west-1",
"Weight": 0,
"HealthCheckId": "efgh5678-dr",
"TTL": 60,
"ResourceRecords": [{ "Value": "4.4.4.4" }]
}
}
]
}
In most setups we let Route 53 handle the switch; manual traffic moves tend to introduce errors.
6. Monitoring, alerting & auto‑remediation
| Tool | Metric | Alert threshold | Auto‑remedy |
|---|---|---|---|
| Prometheus + Grafana | n8n_worker_up | < 3/5 workers healthy | kubectl rollout restart |
| PostgreSQL pg_stat_activity | max_connections usage | > 80 % | Scale Patroni StatefulSet |
| Redis connected_clients | Client count | > 10k | Add replica via redis-cli |
| Nginx upstream_response_time | Response latency | > 2 s | Add more workers or raise CPU limits |
| CloudWatch S3ObjectDeleted | Deletion spike | Sudden rise | Review IAM policies |
*Teams often miss the Redis client count until it spikes during a batch job.*
Sample PrometheusRule (worker count)
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: n8n-availability
spec:
groups:
- name: n8n.rules
rules:
- alert: n8nWorkerInsufficient
expr: sum(up{job="n8n"}) < 3
for: 2m
labels:
severity: critical
annotations:
summary: "Fewer than 3 n8n workers are healthy"
runbook_url: "https://example.com/runbooks/n8n-worker-failure"
7. TL;DR – One‑page HA checklist
| ✅ | Action |
|---|---|
| Stateless workers | Deploy ≥ 3 replicas behind an L7 LB with /healthz probes |
| Redis queue | N8N_QUEUE_MODE=redis + EXECUTIONS_PROCESS=main |
| HA DB | Patroni + etcd (3‑node) – use service name patroni |
| External binaries | S3 (N8N_BINARY_DATA_MODE=s3) **or** NFS (ReadWriteMany) |
| Failover testing | Kill primary DB → verify promotion < 5 s |
| Monitoring | Alerts for worker count, DB lag, Redis clients |
| DR | Logical replication, cross‑region S3, Route 53 weight failover |
| Security | Least‑privilege IAM for S3, TLS on LB, basic auth enabled |
Featured‑snippet ready – “To achieve high availability for n8n, run multiple stateless workers behind a health‑checked load balancer, use Patroni for PostgreSQL failover, store binary data in S3 or a shared NFS volume, and add a Redis queue for distributed execution.”
Conclusion
Deploying n8n in production hinges on three pillars: stateless workers, a fail‑over‑ready database, and externalised binary storage. Wire them together with health‑checked load balancing, a Redis execution queue, and optional multi‑region DR, and you remove the usual single points of failure. The monitoring rules and auto‑remediation steps keep the system self‑healing, while the checklist gives a quick sanity‑check before going live. Follow the patterns above and your n8n workflows stay up, responsive, and resilient—even when traffic spikes or an entire AZ disappears.



