Production-Grade n8n Architecture

Step by Step Guide to solve production grade n8n architecture 
Step by Step Guide to solve production grade n8n architecture


Who this is for: SREs, DevOps engineers, and backend developers who need a reliable, horizontally‑scalable n8n deployment in production. We cover this in detail in the Production‑Grade n8n Architecture.


Quick Diagnosis

Your n8n workflows are missing executions, exposing credentials, or failing under load. The root cause is typically a stateful, single‑node setup. Re‑architect with a stateless execution layer, external PostgreSQL, a durable queue, and proper monitoring.

TL;DR – Deploy n8n on Kubernetes (or Docker‑Compose for small scale) with PostgreSQL, Redis, Prometheus‑Grafana, TLS, RBAC, and daily backups.
In production you’ll notice it once you have more than a handful of concurrent runs.


1. Core Production Requirements

If you encounter any n8n architecture anti patterns resolve them before continuing with the setup.

Requirement Why It Matters n8n Default Production‑Ready Alternative
Stateless execution Enables horizontal scaling and zero‑downtime deploys In‑process execution (single‑process) Separate worker pods/containers that pull jobs from a queue
Durable data store Guarantees workflow state, credentials, logs SQLite (file‑based) PostgreSQL 13+ (managed or self‑hosted)
Reliable job queue Prevents lost executions when the API crashes In‑memory (no queue) Redis (or RabbitMQ) as a broker for EXECUTIONS_MODE=queue
TLS & Auth Protects credentials in transit and at rest Optional, self‑signed Ingress TLS + OAuth2 / API‑Key enforcement
High‑availability (HA) No single point of failure Single pod Replicated DB, multiple workers, load‑balanced API
Observability Early detection of bottlenecks & failures Minimal logs Prometheus metrics + Grafana dashboards + structured logging
Backup & DR Prevent data loss Manual file copy Automated PGDump + WAL archiving, point‑in‑time restore

EEFA Note: Switching from SQLite to PostgreSQL requires a data migration (n8n export:db → import). Perform this in a maintenance window to avoid corrupting running workflows.

Most teams hit these gaps after a few weeks of traffic, not on day one.


2. Architecture Blueprint

The diagram shows a stateless API that enqueues work, worker pods that process jobs, and a resilient data layer. If you encounter any n8n control plane data plane resolve them before continuing with the setup.

+-------------------+       +-------------------+       +-------------------+
|   Ingress (TLS)   | <---> |  n8n API Service  | <---> |   Redis Queue     |
+-------------------+       +-------------------+       +-------------------+
                                 ^   |
                                 |   v
                        +-------------------+
                        |  n8n Worker Pods  |
                        +-------------------+
                                 |
                                 v
                        +-------------------+
                        | PostgreSQL (HA)   |
                        +-------------------+
                                 |
                                 v
                        +-------------------+
                        | Prometheus/Grafana|
                        +-------------------+

A quick walk‑through: the Ingress terminates TLS and hands traffic to the API service. The API writes execution requests to Redis. Worker Pods pull from the queue, run the workflow, and persist results to PostgreSQL. Prometheus scrapes metrics from both API and workers for Grafana to visualise.


3. Deployment Options

If you encounter any n8n multi tenant architecture resolve them before continuing with the setup.

Platform Pros Cons When to Choose
Docker‑Compose (single‑node) Quick start, easy local dev, low cost No native HA, manual scaling, limited monitoring PoC, < 10 concurrent executions
Docker Swarm Built‑in service replication, simple networking Declining community support, limited autoscaling Small‑to‑mid teams already on Swarm
Kubernetes (Helm) Declarative, auto‑scaling, native secrets, robust ecosystem Higher operational overhead, learning curve Production, ≥ 20 concurrent executions, need for HA
Managed n8n Cloud Zero‑ops, automatic backups, SLA Vendor lock‑in, less control over network topology Teams without ops resources, compliance OK with provider

3.1 Helm Values – Core Settings

Below are the essential Helm values split into focused snippets.

Environment variables (DB, queue, auth)

n8n:
  env:
    - name: DB_TYPE
      value: postgres
    - name: DB_POSTGRESDB_HOST
      value: pg-n8n-primary
    - name: DB_POSTGRESDB_PORT
      value: "5432"
    - name: DB_POSTGRESDB_DATABASE
      value: n8n
    - name: DB_POSTGRESDB_USER
      valueFrom:
        secretKeyRef:
          name: n8n-pg-secret
          key: username
    - name: DB_POSTGRESDB_PASSWORD
      valueFrom:
        secretKeyRef:
          name: n8n-pg-secret
          key: password
    - name: EXECUTIONS_MODE
      value: queue
    - name: QUEUE_BULL_REDIS_HOST
      value: redis-n8n
    - name: QUEUE_BULL_REDIS_PORT
      value: "6379"
    - name: N8N_BASIC_AUTH_ACTIVE
      value: "true"
    - name: N8N_BASIC_AUTH_USER
      valueFrom:
        secretKeyRef:
          name: n8n-basic-auth
          key: user
    - name: N8N_BASIC_AUTH_PASSWORD
      valueFrom:
        secretKeyRef:
          name: n8n-basic-auth
          key: password

Resource limits and service definition

  resources:
    limits:
      cpu: "2"
      memory: "2Gi"
    requests:
      cpu: "500m"
      memory: "512Mi"
  service:
    type: ClusterIP
    port: 5678

Worker replica count (horizontal scaling)

worker:
  replicaCount: 3   # increase to meet concurrency needs
  resources:
    limits:
      cpu: "1"
      memory: "1Gi"

EEFA Note: Start with modest resources.limits. Over‑provisioning workers can starve the API pod and cause request timeouts. Bumping the worker replica count is usually faster than hunting a hidden bottleneck.


4. Security & Compliance Checklist

Item Implementation Detail Verification
TLS everywhere Use cert‑manager to provision certs for Ingress and internal services (Redis, PostgreSQL) kubectl get secret <name>-tls
Least‑privilege DB user PostgreSQL role with SELECT, INSERT, UPDATE, DELETE on n8n schema only \du in psql
Credential encryption n8n encrypts stored credentials with ENCRYPTION_KEY; store key in K8s secret, rotate quarterly kubectl describe secret n8n-encryption
Network policies Deny all traffic, allow only API ↔ Worker, Worker ↔ Redis, API ↔ PostgreSQL kubectl get netpol
Audit logging Enable PostgreSQL log_line_prefix and forward to Loki/ELK Check log entries for INSERT INTO credential
RBAC for API Enable N8N_BASIC_AUTH_ACTIVE=true and configure user/pass Test with curl -u user:pass … expecting 401 without header
Secret management Use external secret store (AWS Secrets Manager, HashiCorp Vault) via CSI driver Verify secret rotation works without pod restart

EEFA Note: If the API is public, add rate‑limiting annotations to the Ingress to mitigate credential‑brute‑force attacks. In practice a modest limit cut down noisy scans dramatically.


5. High‑Availability & Disaster Recovery

5.1 PostgreSQL HA Blueprint

  1. Primary‑Replica (Patroni) – automatic failover within ~30 s.
  2. WAL Archiving to an S3 bucket → enables point‑in‑time recovery.
  3. Scheduled pg_dump (nightly) → stored in immutable object storage.

Trigger a manual failover for testing:

kubectl exec -n n8n -it patroni-0 -- patronictl -c /etc/patroni.yml failover

5.2 Worker Redundancy

Deploy 3+ replicas behind a ClusterIP service. With EXECUTIONS_MODE=queue, any worker can pick up pending jobs, ensuring no single point of failure.

5.3 Redis Persistence

Persistence Mode Description Recommended
AOF (Append‑Only File) Logs every write operation; fast recovery ✔︎
RDB Snapshots Periodic full dumps; lower I/O ✖︎ (use only as secondary)

Redis Helm values for AOF:

appendonly: "yes"
save: "900 1"   # snapshot every 15 min if at least 1 key changed

5.4 Disaster‑Recovery Runbook

Step Action Owner
1 Verify DB replica health (patronictl list) DBA
2 Spin up a fresh PostgreSQL from latest WAL archive Ops
3 Re‑point n8n DB_POSTGRESDB_HOST to new primary (ConfigMap rollout) DevOps
4 Run health checks (curl /healthz) on API & workers QA
5 Validate workflow history in UI Product

6. Observability & Alerting

6.1 Prometheus Exporter Annotations

Add these annotations to the n8n deployment so Prometheus can scrape metrics:

metadata:
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/path: "/metrics"
    prometheus.io/port: "5678"
Metric Meaning Alert Threshold
n8n_executions_total Total executions processed
n8n_executions_failed_total Failed executions > 5/min
n8n_queue_length Jobs waiting in Redis > 100
process_resident_memory_bytes Memory per pod > 80 % of limit
http_request_duration_seconds API latency p95 > 500 ms

6.2 Sample Grafana Dashboard (JSON snippet)

The following JSON defines two panels—queue depth and execution failures.

{
  "panels": [
    {
      "type": "graph",
      "title": "Queue Depth",
      "targets": [{ "expr": "n8n_queue_length" }]
    },
    {
      "type": "graph",
      "title": "Execution Failures",
      "targets": [{ "expr": "rate(n8n_executions_failed_total[5m])" }]
    }
  ]
}

Import this JSON into Grafana to get immediate visibility into bottlenecks.


7. Cost‑Optimization Checklist

✔️ Item How to Optimize
Right‑sized workers Start with CPU 0.5 / Mem 512Mi; auto‑scale based on n8n_queue_length
Spot instances (k8s) Use node pools with spot/preemptible VMs for workers (stateless)
Redis persistence tier Enable **AOF** only; disable RDB snapshots if storage cost is a concern
PostgreSQL storage Enable **storage autoscaling**; set maxsize to 2× expected data growth
Turn off dev‑mode logs Set N8N_LOG_LEVEL=error in prod to reduce log volume

8. Step‑by‑Step Production Rollout (Docker‑Compose Example)

Below the compose file is broken into logical service blocks for readability.

8.1 Database Service

db:
  image: postgres:15-alpine
  environment:
    POSTGRES_DB: n8n
    POSTGRES_USER: n8n
    POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
  volumes:
    - pg-data:/var/lib/postgresql/data
  restart: unless-stopped

8.2 Redis Service

redis:
  image: redis:7-alpine
  command: ["redis-server", "--appendonly", "yes"]
  volumes:
    - redis-data:/data
  restart: unless-stopped

8.3 API Service

api:
  image: n8nio/n8n:1.30.0
  environment:
    - DB_TYPE=postgresdb
    - DB_POSTGRESDB_HOST=db
    - DB_POSTGRESDB_PORT=5432
    - DB_POSTGRESDB_DATABASE=n8n
    - DB_POSTGRESDB_USER=n8n
    - DB_POSTGRESDB_PASSWORD=${POSTGRES_PASSWORD}
    - EXECUTIONS_MODE=queue
    - QUEUE_BULL_REDIS_HOST=redis
    - QUEUE_BULL_REDIS_PORT=6379
    - N8N_BASIC_AUTH_ACTIVE=true
    - N8N_BASIC_AUTH_USER=${ADMIN_USER}
    - N8N_BASIC_AUTH_PASSWORD=${ADMIN_PASS}
  ports:
    - "5678:5678"
  depends_on:
    - db
    - redis
  restart: unless-stopped

8.4 Worker Service

worker:
  image: n8nio/n8n:1.30.0
  command: ["n8n", "worker"]
  environment: *api.environment   # reuse the same env vars
  depends_on:
    - api
    - redis
  restart: unless-stopped

8.5 Compose Wrapper

version: "3.8"
services:
  db:      # defined above
  redis:   # defined above
  api:     # defined above
  worker:  # defined above

volumes:
  pg-data:
  redis-data:

**Rollout steps**

  1. Create the required secrets (POSTGRES_PASSWORD, ADMIN_USER, ADMIN_PASS).
  2. Run docker compose up -d. The API becomes reachable at https://<host>:5678.
  3. Verify the queue is working:
    docker exec -it <redis_container> redis-cli LLEN n8n:queue
    
  4. Scale workers as needed, e.g. docker compose up -d --scale worker=4.

EEFA Note: When stopping the API for upgrades, use docker compose stop api (graceful) to let in‑flight executions finish; a hard kill can lose partial runs.


9. Frequently Asked Production Questions

Question Short Answer
Can I keep SQLite for prod? No – it cannot survive pod restarts or scaling.
Do I need both Redis and a queue? Yes. EXECUTIONS_MODE=queue requires a broker; otherwise only the API pod can run jobs.
How many workers do I need? Start with workers = ceil(concurrent_executions / avg_execution_time_seconds). Adjust via autoscaler.
Is n8n thread‑safe? Each worker runs a single Node.js event loop; concurrency is achieved by adding more workers, not threads.
Can I use MySQL? Supported, but PostgreSQL offers better JSONB performance for workflow payloads.

Conclusion

Deploying n8n in production demands stateless execution, a robust PostgreSQL backend, a persistent queue (Redis), and full observability. By separating the API from worker pods, enforcing TLS/RBAC, and automating backups, you eliminate single points of failure and gain the ability to scale horizontally. Follow the checklist, use the provided Helm/Docker‑Compose snippets, and monitor key metrics to keep the system healthy. This architecture has been battle‑tested in real‑world pipelines and delivers reliable, secure workflow automation at scale.

Leave a Comment

Your email address will not be published. Required fields are marked *