High-Availability n8n Patterns for Production Systems

Step by Step Guide to solve n8n high availability patterns

Who this is for : Ops engineers, DevOps teams, or platform architects that need a production‑grade n8n deployment that survives node failures, traffic spikes, and regional outages. We cover this in detail in the Production‑Grade n8n Architecture.

Quick diagnosis

If your n8n instance drops workflows, returns 502 errors, or slows down after a traffic burst, you need an HA architecture that guarantees > 99.9 % uptime. In production this usually appears when a node restarts unexpectedly or a sudden wave of webhook calls hits the API. The patterns below eliminate single points of failure, auto‑recover from node loss, and keep workflows running without manual intervention.

1. Why n8n needs a dedicated HA blueprint?

If you encounter any single vs multi instance n8n resolve them before continuing with the setup.

Failure mode	Symptom	HA countermeasure
Single‑node crash	All workflows stop, UI 502	Horizontal worker pool behind a load balancer
Database outage	“Connection refused” errors	Multi‑master or streaming‑replica cluster (Patroni)
File‑store loss	Missing uploaded files	Object storage (S3/GCS) or replicated NFS
Network partition	Workers can’t reach DB or webhooks	Health‑checked probes + auto‑failover
Regional disaster	Complete site outage after AZ failure	Multi‑region active‑passive deployment

EEFA note – The most common production downtime source is a stateful file store on the same node as the workflow engine. Decouple it early to avoid data loss during node replacement.

2. Pattern A – Load‑balanced stateless workers

2.1 Architecture snapshot

Client → L7 Load Balancer → n8n Workers
                         ↘︎   ↘︎
                       Redis (optional)   PostgreSQL (HA)

*Workers are stateless; they read workflow definitions from the DB and store temporary files in external object storage.*
The load balancer (NGINX, Traefik, or a cloud‑managed ALB) probes /healthz on each worker. If you encounter any n8n zero downtime upgrades resolve them before continuing with the setup.

2.2 Docker‑Compose – split into focused services

Database service (PostgreSQL, 3‑node replica)

services:
  db:
    image: postgres:15
    environment:
      POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
      POSTGRES_DB: n8n
    volumes:
      - db-data:/var/lib/postgresql/data
    deploy:
      mode: replicated
      replicas: 3

Redis queue (optional)

  redis:
    image: redis:7-alpine
    deploy:
      mode: replicated
      replicas: 2

n8n worker definition

  n8n:
    image: n8n:latest
    environment:
      - DB_TYPE=postgresdb
      - DB_POSTGRESDB_HOST=db
      - DB_POSTGRESDB_DATABASE=n8n
      - DB_POSTGRESDB_USER=postgres
      - DB_POSTGRESDB_PASSWORD=${POSTGRES_PASSWORD}
      - EXECUTIONS_PROCESS=main
      - N8N_QUEUE_MODE=redis
      - N8N_REDIS_HOST=redis
      - N8N_REDIS_PASSWORD=${REDIS_PASSWORD}
      - N8N_BASIC_AUTH_ACTIVE=true
      - N8N_BASIC_AUTH_USER=${ADMIN_USER}
      - N8N_BASIC_AUTH_PASSWORD=${ADMIN_PASS}
    ports:
      - "5678:5678"
    depends_on:
      - db
      - redis
    deploy:
      mode: replicated
      replicas: 4
      update_config:
        parallelism: 1
        delay: 10s
      restart_policy:
        condition: on-failure
        max_attempts: 3

EEFA tweak – EXECUTIONS_PROCESS=main forces a single‑process execution per worker, avoiding race conditions with a shared Redis queue.
If a worker stalls, a pod restart is usually quicker than hunting a phantom lock.

Persistent volume for the DB

volumes:
  db-data:

2.3 Nginx health‑check configuration

server {
    listen 80;
    server_name n8n.example.com;

    location / {
        proxy_pass http://n8n_cluster;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
    }

    location /healthz {
        proxy_pass http://n8n_cluster/healthz;
    }
}

*The load balancer marks a node unhealthy after **2 consecutive 5xx** responses within **10 s**, then drops it from rotation automatically.*

3. Pattern B – PostgreSQL high‑availability (Patroni + etcd)

3.1 Why a single primary is a SPOF

Even with many workers, a lone PostgreSQL instance can bring the whole system down. Patroni handles automatic failover and usually switches in < 5 s. If you encounter any n8n data consistency resolve them before continuing with the setup.

3.2 ConfigMap – Patroni configuration (focus: DCS & bootstrap)

apiVersion: v1
kind: ConfigMap
metadata:
  name: patroni-config
data:
  patroni.yml: |
    scope: n8n
    namespace: /db/
    name: $(POD_NAME)
    restapi:
      listen: 0.0.0.0:8008
    etcd:
      host: etcd-0.etcd.svc:2379,etcd-1.etcd.svc:2379,etcd-2.etcd.svc:2379
    bootstrap:
      dcs:
        ttl: 30
        loop_wait: 10
        retry_timeout: 10
        maximum_lag_on_failover: 1048576
        postgresql:
          use_pg_rewind: true
          parameters:
            max_connections: 100
            shared_buffers: 256MB
      initdb:
        - encoding: UTF8
        - data-checksums
      pg_hba:
        - host all all 0.0.0.0/0 md5

*Patroni uses etcd for consensus; the rest of the file is mostly boiler‑plate.*

3.3 StatefulSet – three Patroni pods (focus: storage)

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: patroni
spec:
  serviceName: patroni
  replicas: 3
  selector:
    matchLabels:
      app: patroni
  template:
    metadata:
      labels:
        app: patroni
    spec:
      containers:
        - name: patroni
          image: patroni:latest
          envFrom:
            - configMapRef:
                name: patroni-config
          ports:
            - containerPort: 5432   # PostgreSQL
            - containerPort: 8008   # Patroni API
          volumeMounts:
            - name: pgdata
              mountPath: /home/postgres/pgdata
  volumeClaimTemplates:
    - metadata:
        name: pgdata
      spec:
        accessModes: ["ReadWriteOnce"]
        resources:
          requests:
            storage: 20Gi

EEFA tip – Use a managed etcd service (e.g., AWS EKS etcd‑operator) to avoid operator‑level mistakes and keep quorum latency sub‑millisecond.

3.4 n8n connection string (no changes during failover)

DB_TYPE=postgresdb
DB_POSTGRESDB_HOST=patroni   # virtual IP / service name
DB_POSTGRESDB_DATABASE=n8n
DB_POSTGRESDB_USER=postgres
DB_POSTGRESDB_PASSWORD=${POSTGRES_PASSWORD}

Patroni always routes the hostname to the current primary, so workers stay connected automatically.

4. Pattern C – External object storage for binary data

4.1 Why off‑load binaries

n8n’s default /data folder is local. In a cluster it becomes a single point of failure. Store binaries in S3 (or a shared NFS) instead.

4.2 Terraform – create an S3 bucket (focus: versioning & lifecycle)

resource "aws_s3_bucket" "n8n_binary" {
  bucket = "n8n-binary-${var.env}"
  acl    = "private"

  versioning {
    enabled = true
  }

  lifecycle_rule {
    id      = "expire-old-objects"
    enabled = true
    expiration {
      days = 365
    }
  }
}

4.3 n8n environment variables for S3 storage

N8N_BINARY_DATA_MODE=s3
N8N_BINARY_DATA_S3_BUCKET=${aws_s3_bucket.n8n_binary.id}
N8N_BINARY_DATA_S3_REGION=${var.aws_region}
N8N_BINARY_DATA_S3_ACCESS_KEY_ID=${aws_iam_access_key.n8n.id}
N8N_BINARY_DATA_S3_SECRET_ACCESS_KEY=${aws_iam_access_key.n8n.secret}

EEFA warning – Enable **S3 Object Lock** (governance mode) for regulated documents; otherwise a rogue delete could break audit trails.

4.4 On‑prem alternative – NFS shared volume (focus: PV definition)

apiVersion: v1
kind: PersistentVolume
metadata:
  name: n8n-nfs-pv
spec:
  capacity:
    storage: 100Gi
  accessModes:
    - ReadWriteMany
  nfs:
    path: /exports/n8n-data
    server: nfs.example.local

Mount this PV to /data on every worker pod and set fsGroup: 1000 so the n8n process can read/write.

5. Pattern D – Multi‑region active‑passive disaster recovery

5.1 Component matrix (single purpose)

Component	Primary region	DR region	Sync method
DB	Patroni (3 nodes)	Patroni (3 nodes)	Logical replication (`pglogical`)
Object store	S3 bucket	Replicated bucket	Cross‑region replication
Load balancer	Cloud‑ALB (us‑east‑1)	Cloud‑ALB (eu‑west‑1)	Route 53 health‑check failover
Workers	4 replicas	2 cold replicas	Same Docker image version

5.2 Logical replication – primary node setup

CREATE EXTENSION IF NOT EXISTS pglogical;
SELECT pglogical.create_node(
    node_name := 'primary',
    dsn := 'host=primary-db port=5432 dbname=n8n user=postgres password=***'
);

5.3 Logical replication – DR node registration

SELECT pglogical.create_node(
    node_name := 'dr',
    dsn := 'host=dr-db port=5432 dbname=n8n user=postgres password=***'
);

5.4 Subscription – DR pulls from primary

SELECT pglogical.create_subscription(
    subscription_name := 'dr_sub',
    provider_dsn := 'host=primary-db port=5432 dbname=n8n user=postgres password=***',
    synchronize_structure := true,
    synchronize_data := true
);

EEFA tip – Set synchronous_commit = remote_apply on the primary for **zero‑data‑loss** when the DR link is healthy. If latency is high, switch to local and accept a few‑second lag.

5.5 DNS‑based failover (Route 53) – weight‑based record set

{
  "Comment": "Failover record set for n8n",
  "Changes": [
    {
      "Action": "CREATE",
      "ResourceRecordSet": {
        "Name": "n8n.example.com.",
        "Type": "A",
        "SetIdentifier": "primary-us-east-1",
        "Weight": 100,
        "HealthCheckId": "abcd1234-primary",
        "TTL": 60,
        "ResourceRecords": [{ "Value": "3.3.3.3" }]
      }
    },
    {
      "Action": "CREATE",
      "ResourceRecordSet": {
        "Name": "n8n.example.com.",
        "Type": "A",
        "SetIdentifier": "dr-eu-west-1",
        "Weight": 0,
        "HealthCheckId": "efgh5678-dr",
        "TTL": 60,
        "ResourceRecords": [{ "Value": "4.4.4.4" }]
      }
    }
  ]
}

In most setups we let Route 53 handle the switch; manual traffic moves tend to introduce errors.

6. Monitoring, alerting & auto‑remediation

Tool	Metric	Alert threshold	Auto‑remedy
Prometheus + Grafana	n8n_worker_up	< 3/5 workers healthy	kubectl rollout restart
PostgreSQL pg_stat_activity	max_connections usage	> 80 %	Scale Patroni StatefulSet
Redis connected_clients	Client count	> 10k	Add replica via redis-cli
Nginx upstream_response_time	Response latency	> 2 s	Add more workers or raise CPU limits
CloudWatch S3ObjectDeleted	Deletion spike	Sudden rise	Review IAM policies

*Teams often miss the Redis client count until it spikes during a batch job.*

Sample PrometheusRule (worker count)

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: n8n-availability
spec:
  groups:
  - name: n8n.rules
    rules:
    - alert: n8nWorkerInsufficient
      expr: sum(up{job="n8n"}) < 3
      for: 2m
      labels:
        severity: critical
      annotations:
        summary: "Fewer than 3 n8n workers are healthy"
        runbook_url: "https://example.com/runbooks/n8n-worker-failure"

7. TL;DR – One‑page HA checklist

✅	Action
Stateless workers	Deploy ≥ 3 replicas behind an L7 LB with `/healthz` probes
Redis queue	`N8N_QUEUE_MODE=redis` + `EXECUTIONS_PROCESS=main`
HA DB	Patroni + etcd (3‑node) – use service name `patroni`
External binaries	S3 (`N8N_BINARY_DATA_MODE=s3`) or NFS (`ReadWriteMany`)
Failover testing	Kill primary DB → verify promotion < 5 s
Monitoring	Alerts for worker count, DB lag, Redis clients
DR	Logical replication, cross‑region S3, Route 53 weight failover
Security	Least‑privilege IAM for S3, TLS on LB, basic auth enabled

Featured‑snippet ready – “To achieve high availability for n8n, run multiple stateless workers behind a health‑checked load balancer, use Patroni for PostgreSQL failover, store binary data in S3 or a shared NFS volume, and add a Redis queue for distributed execution.”

Conclusion

Deploying n8n in production hinges on three pillars: stateless workers, a fail‑over‑ready database, and externalised binary storage. Wire them together with health‑checked load balancing, a Redis execution queue, and optional multi‑region DR, and you remove the usual single points of failure. The monitoring rules and auto‑remediation steps keep the system self‑healing, while the checklist gives a quick sanity‑check before going live. Follow the patterns above and your n8n workflows stay up, responsive, and resilient—even when traffic spikes or an entire AZ disappears.

High-Availability n8n Patterns for Production Systems

Quick diagnosis

1. Why n8n needs a dedicated HA blueprint?

2. Pattern A – Load‑balanced stateless workers

2.1 Architecture snapshot

2.2 Docker‑Compose – split into focused services

Database service (PostgreSQL, 3‑node replica)

Redis queue (optional)

n8n worker definition

Persistent volume for the DB

2.3 Nginx health‑check configuration

3. Pattern B – PostgreSQL high‑availability (Patroni + etcd)

3.1 Why a single primary is a SPOF

3.2 ConfigMap – Patroni configuration (focus: DCS & bootstrap)

3.3 StatefulSet – three Patroni pods (focus: storage)

3.4 n8n connection string (no changes during failover)

4. Pattern C – External object storage for binary data

4.1 Why off‑load binaries

4.2 Terraform – create an S3 bucket (focus: versioning & lifecycle)

4.3 n8n environment variables for S3 storage

4.4 On‑prem alternative – NFS shared volume (focus: PV definition)

5. Pattern D – Multi‑region active‑passive disaster recovery

5.1 Component matrix (single purpose)

5.2 Logical replication – primary node setup

5.3 Logical replication – DR node registration

5.4 Subscription – DR pulls from primary

5.5 DNS‑based failover (Route 53) – weight‑based record set

6. Monitoring, alerting & auto‑remediation

Sample PrometheusRule (worker count)

7. TL;DR – One‑page HA checklist

Conclusion

Leave a Comment Cancel Reply

Sign up for Newsletter

Quick diagnosis

1. Why n8n needs a dedicated HA blueprint?

2. Pattern A – Load‑balanced stateless workers

2.1 Architecture snapshot

2.2 Docker‑Compose – split into focused services

Database service (PostgreSQL, 3‑node replica)

Redis queue (optional)

n8n worker definition

Persistent volume for the DB

2.3 Nginx health‑check configuration

3. Pattern B – PostgreSQL high‑availability (Patroni + etcd)

3.1 Why a single primary is a SPOF

3.2 ConfigMap – Patroni configuration (focus: DCS & bootstrap)

3.3 StatefulSet – three Patroni pods (focus: storage)

3.4 n8n connection string (no changes during failover)

4. Pattern C – External object storage for binary data

4.1 Why off‑load binaries

4.2 Terraform – create an S3 bucket (focus: versioning & lifecycle)

4.3 n8n environment variables for S3 storage

4.4 On‑prem alternative – NFS shared volume (focus: PV definition)

5. Pattern D – Multi‑region active‑passive disaster recovery

5.1 Component matrix (single purpose)

5.2 Logical replication – primary node setup

5.3 Logical replication – DR node registration

5.4 Subscription – DR pulls from primary

5.5 DNS‑based failover (Route 53) – weight‑based record set

6. Monitoring, alerting & auto‑remediation

Sample PrometheusRule (worker count)

7. TL;DR – One‑page HA checklist

Conclusion

Must Read

Leave a Comment Cancel Reply

2. Pattern A – Load‑balanced stateless workers

3. Pattern B – PostgreSQL high‑availability (Patroni + etcd)

4. Pattern C – External object storage for binary data

5. Pattern D – Multi‑region active‑passive disaster recovery

5.5 DNS‑based failover (Route 53) – weight‑based record set