High-Availability n8n Patterns for Production Systems

Step by Step Guide to solve n8n high availability patterns 
Step by Step Guide to solve n8n high availability patterns


Who this is for : Ops engineers, DevOps teams, or platform architects that need a production‑grade n8n deployment that survives node failures, traffic spikes, and regional outages. We cover this in detail in the Production‑Grade n8n Architecture.


Quick diagnosis

If your n8n instance drops workflows, returns 502 errors, or slows down after a traffic burst, you need an HA architecture that guarantees > 99.9 % uptime. In production this usually appears when a node restarts unexpectedly or a sudden wave of webhook calls hits the API. The patterns below eliminate single points of failure, auto‑recover from node loss, and keep workflows running without manual intervention.


1. Why n8n needs a dedicated HA blueprint?

If you encounter any single vs multi instance n8n resolve them before continuing with the setup.

Failure mode Symptom HA countermeasure
Single‑node crash All workflows stop, UI 502 Horizontal worker pool behind a load balancer
Database outage “Connection refused” errors Multi‑master or streaming‑replica cluster (Patroni)
File‑store loss Missing uploaded files Object storage (S3/GCS) or replicated NFS
Network partition Workers can’t reach DB or webhooks Health‑checked probes + auto‑failover
Regional disaster Complete site outage after AZ failure Multi‑region active‑passive deployment

EEFA note – The most common production downtime source is a stateful file store on the same node as the workflow engine. Decouple it early to avoid data loss during node replacement.


2. Pattern A – Load‑balanced stateless workers

2.1 Architecture snapshot

Client → L7 Load Balancer → n8n Workers
                         ↘︎   ↘︎
                       Redis (optional)   PostgreSQL (HA)

*Workers are stateless; they read workflow definitions from the DB and store temporary files in external object storage.*
The load balancer (NGINX, Traefik, or a cloud‑managed ALB) probes /healthz on each worker. If you encounter any n8n zero downtime upgrades resolve them before continuing with the setup.

2.2 Docker‑Compose – split into focused services

Database service (PostgreSQL, 3‑node replica)

services:
  db:
    image: postgres:15
    environment:
      POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
      POSTGRES_DB: n8n
    volumes:
      - db-data:/var/lib/postgresql/data
    deploy:
      mode: replicated
      replicas: 3

Redis queue (optional)

  redis:
    image: redis:7-alpine
    deploy:
      mode: replicated
      replicas: 2

n8n worker definition

  n8n:
    image: n8n:latest
    environment:
      - DB_TYPE=postgresdb
      - DB_POSTGRESDB_HOST=db
      - DB_POSTGRESDB_DATABASE=n8n
      - DB_POSTGRESDB_USER=postgres
      - DB_POSTGRESDB_PASSWORD=${POSTGRES_PASSWORD}
      - EXECUTIONS_PROCESS=main
      - N8N_QUEUE_MODE=redis
      - N8N_REDIS_HOST=redis
      - N8N_REDIS_PASSWORD=${REDIS_PASSWORD}
      - N8N_BASIC_AUTH_ACTIVE=true
      - N8N_BASIC_AUTH_USER=${ADMIN_USER}
      - N8N_BASIC_AUTH_PASSWORD=${ADMIN_PASS}
    ports:
      - "5678:5678"
    depends_on:
      - db
      - redis
    deploy:
      mode: replicated
      replicas: 4
      update_config:
        parallelism: 1
        delay: 10s
      restart_policy:
        condition: on-failure
        max_attempts: 3

EEFA tweakEXECUTIONS_PROCESS=main forces a single‑process execution per worker, avoiding race conditions with a shared Redis queue.
If a worker stalls, a pod restart is usually quicker than hunting a phantom lock.

Persistent volume for the DB

volumes:
  db-data:

2.3 Nginx health‑check configuration

server {
    listen 80;
    server_name n8n.example.com;

    location / {
        proxy_pass http://n8n_cluster;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
    }

    location /healthz {
        proxy_pass http://n8n_cluster/healthz;
    }
}

*The load balancer marks a node unhealthy after **2 consecutive 5xx** responses within **10 s**, then drops it from rotation automatically.*


3. Pattern B – PostgreSQL high‑availability (Patroni + etcd)

3.1 Why a single primary is a SPOF

Even with many workers, a lone PostgreSQL instance can bring the whole system down. Patroni handles automatic failover and usually switches in < 5 s. If you encounter any n8n data consistency resolve them before continuing with the setup.

3.2 ConfigMap – Patroni configuration (focus: DCS & bootstrap)

apiVersion: v1
kind: ConfigMap
metadata:
  name: patroni-config
data:
  patroni.yml: |
    scope: n8n
    namespace: /db/
    name: $(POD_NAME)
    restapi:
      listen: 0.0.0.0:8008
    etcd:
      host: etcd-0.etcd.svc:2379,etcd-1.etcd.svc:2379,etcd-2.etcd.svc:2379
    bootstrap:
      dcs:
        ttl: 30
        loop_wait: 10
        retry_timeout: 10
        maximum_lag_on_failover: 1048576
        postgresql:
          use_pg_rewind: true
          parameters:
            max_connections: 100
            shared_buffers: 256MB
      initdb:
        - encoding: UTF8
        - data-checksums
      pg_hba:
        - host all all 0.0.0.0/0 md5

*Patroni uses etcd for consensus; the rest of the file is mostly boiler‑plate.*

3.3 StatefulSet – three Patroni pods (focus: storage)

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: patroni
spec:
  serviceName: patroni
  replicas: 3
  selector:
    matchLabels:
      app: patroni
  template:
    metadata:
      labels:
        app: patroni
    spec:
      containers:
        - name: patroni
          image: patroni:latest
          envFrom:
            - configMapRef:
                name: patroni-config
          ports:
            - containerPort: 5432   # PostgreSQL
            - containerPort: 8008   # Patroni API
          volumeMounts:
            - name: pgdata
              mountPath: /home/postgres/pgdata
  volumeClaimTemplates:
    - metadata:
        name: pgdata
      spec:
        accessModes: ["ReadWriteOnce"]
        resources:
          requests:
            storage: 20Gi

EEFA tip – Use a managed etcd service (e.g., AWS EKS etcd‑operator) to avoid operator‑level mistakes and keep quorum latency sub‑millisecond.

3.4 n8n connection string (no changes during failover)

DB_TYPE=postgresdb
DB_POSTGRESDB_HOST=patroni   # virtual IP / service name
DB_POSTGRESDB_DATABASE=n8n
DB_POSTGRESDB_USER=postgres
DB_POSTGRESDB_PASSWORD=${POSTGRES_PASSWORD}

Patroni always routes the hostname to the current primary, so workers stay connected automatically.


4. Pattern C – External object storage for binary data

4.1 Why off‑load binaries

n8n’s default /data folder is local. In a cluster it becomes a single point of failure. Store binaries in S3 (or a shared NFS) instead.

4.2 Terraform – create an S3 bucket (focus: versioning & lifecycle)

resource "aws_s3_bucket" "n8n_binary" {
  bucket = "n8n-binary-${var.env}"
  acl    = "private"

  versioning {
    enabled = true
  }

  lifecycle_rule {
    id      = "expire-old-objects"
    enabled = true
    expiration {
      days = 365
    }
  }
}

4.3 n8n environment variables for S3 storage

N8N_BINARY_DATA_MODE=s3
N8N_BINARY_DATA_S3_BUCKET=${aws_s3_bucket.n8n_binary.id}
N8N_BINARY_DATA_S3_REGION=${var.aws_region}
N8N_BINARY_DATA_S3_ACCESS_KEY_ID=${aws_iam_access_key.n8n.id}
N8N_BINARY_DATA_S3_SECRET_ACCESS_KEY=${aws_iam_access_key.n8n.secret}

EEFA warning – Enable **S3 Object Lock** (governance mode) for regulated documents; otherwise a rogue delete could break audit trails.

4.4 On‑prem alternative – NFS shared volume (focus: PV definition)

apiVersion: v1
kind: PersistentVolume
metadata:
  name: n8n-nfs-pv
spec:
  capacity:
    storage: 100Gi
  accessModes:
    - ReadWriteMany
  nfs:
    path: /exports/n8n-data
    server: nfs.example.local

Mount this PV to /data on every worker pod and set fsGroup: 1000 so the n8n process can read/write.


5. Pattern D – Multi‑region active‑passive disaster recovery

5.1 Component matrix (single purpose)

Component Primary region DR region Sync method
DB Patroni (3 nodes) Patroni (3 nodes) Logical replication (pglogical)
Object store S3 bucket Replicated bucket Cross‑region replication
Load balancer Cloud‑ALB (us‑east‑1) Cloud‑ALB (eu‑west‑1) Route 53 health‑check failover
Workers 4 replicas 2 cold replicas Same Docker image version

5.2 Logical replication – primary node setup

CREATE EXTENSION IF NOT EXISTS pglogical;
SELECT pglogical.create_node(
    node_name := 'primary',
    dsn := 'host=primary-db port=5432 dbname=n8n user=postgres password=***'
);

5.3 Logical replication – DR node registration

SELECT pglogical.create_node(
    node_name := 'dr',
    dsn := 'host=dr-db port=5432 dbname=n8n user=postgres password=***'
);

5.4 Subscription – DR pulls from primary

SELECT pglogical.create_subscription(
    subscription_name := 'dr_sub',
    provider_dsn := 'host=primary-db port=5432 dbname=n8n user=postgres password=***',
    synchronize_structure := true,
    synchronize_data := true
);

EEFA tip – Set synchronous_commit = remote_apply on the primary for **zero‑data‑loss** when the DR link is healthy. If latency is high, switch to local and accept a few‑second lag.

5.5 DNS‑based failover (Route 53) – weight‑based record set

{
  "Comment": "Failover record set for n8n",
  "Changes": [
    {
      "Action": "CREATE",
      "ResourceRecordSet": {
        "Name": "n8n.example.com.",
        "Type": "A",
        "SetIdentifier": "primary-us-east-1",
        "Weight": 100,
        "HealthCheckId": "abcd1234-primary",
        "TTL": 60,
        "ResourceRecords": [{ "Value": "3.3.3.3" }]
      }
    },
    {
      "Action": "CREATE",
      "ResourceRecordSet": {
        "Name": "n8n.example.com.",
        "Type": "A",
        "SetIdentifier": "dr-eu-west-1",
        "Weight": 0,
        "HealthCheckId": "efgh5678-dr",
        "TTL": 60,
        "ResourceRecords": [{ "Value": "4.4.4.4" }]
      }
    }
  ]
}

In most setups we let Route 53 handle the switch; manual traffic moves tend to introduce errors.


6. Monitoring, alerting & auto‑remediation

Tool Metric Alert threshold Auto‑remedy
Prometheus + Grafana n8n_worker_up < 3/5 workers healthy kubectl rollout restart
PostgreSQL pg_stat_activity max_connections usage > 80 % Scale Patroni StatefulSet
Redis connected_clients Client count > 10k Add replica via redis-cli
Nginx upstream_response_time Response latency > 2 s Add more workers or raise CPU limits
CloudWatch S3ObjectDeleted Deletion spike Sudden rise Review IAM policies

*Teams often miss the Redis client count until it spikes during a batch job.*

Sample PrometheusRule (worker count)

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: n8n-availability
spec:
  groups:
  - name: n8n.rules
    rules:
    - alert: n8nWorkerInsufficient
      expr: sum(up{job="n8n"}) < 3
      for: 2m
      labels:
        severity: critical
      annotations:
        summary: "Fewer than 3 n8n workers are healthy"
        runbook_url: "https://example.com/runbooks/n8n-worker-failure"

7. TL;DR – One‑page HA checklist

Action
Stateless workers Deploy ≥ 3 replicas behind an L7 LB with /healthz probes
Redis queue N8N_QUEUE_MODE=redis + EXECUTIONS_PROCESS=main
HA DB Patroni + etcd (3‑node) – use service name patroni
External binaries S3 (N8N_BINARY_DATA_MODE=s3) **or** NFS (ReadWriteMany)
Failover testing Kill primary DB → verify promotion < 5 s
Monitoring Alerts for worker count, DB lag, Redis clients
DR Logical replication, cross‑region S3, Route 53 weight failover
Security Least‑privilege IAM for S3, TLS on LB, basic auth enabled

Featured‑snippet ready – “To achieve high availability for n8n, run multiple stateless workers behind a health‑checked load balancer, use Patroni for PostgreSQL failover, store binary data in S3 or a shared NFS volume, and add a Redis queue for distributed execution.”


Conclusion

Deploying n8n in production hinges on three pillars: stateless workers, a fail‑over‑ready database, and externalised binary storage. Wire them together with health‑checked load balancing, a Redis execution queue, and optional multi‑region DR, and you remove the usual single points of failure. The monitoring rules and auto‑remediation steps keep the system self‑healing, while the checklist gives a quick sanity‑check before going live. Follow the patterns above and your n8n workflows stay up, responsive, and resilient—even when traffic spikes or an entire AZ disappears.

Leave a Comment

Your email address will not be published. Required fields are marked *