Production-Grade n8n Architecture

Who this is for: SREs, DevOps engineers, and backend developers who need a reliable, horizontally‑scalable n8n deployment in production. We cover this in detail in the Production‑Grade n8n Architecture.

Quick Diagnosis

Your n8n workflows are missing executions, exposing credentials, or failing under load. The root cause is typically a stateful, single‑node setup. Re‑architect with a stateless execution layer, external PostgreSQL, a durable queue, and proper monitoring.

TL;DR – Deploy n8n on Kubernetes (or Docker‑Compose for small scale) with PostgreSQL, Redis, Prometheus‑Grafana, TLS, RBAC, and daily backups.
In production you’ll notice it once you have more than a handful of concurrent runs.

1. Core Production Requirements

If you encounter any n8n architecture anti patterns resolve them before continuing with the setup.

Requirement	Why It Matters	n8n Default	Production‑Ready Alternative
Stateless execution	Enables horizontal scaling and zero‑downtime deploys	In‑process execution (single‑process)	Separate worker pods/containers that pull jobs from a queue
Durable data store	Guarantees workflow state, credentials, logs	SQLite (file‑based)	PostgreSQL 13+ (managed or self‑hosted)
Reliable job queue	Prevents lost executions when the API crashes	In‑memory (no queue)	Redis (or RabbitMQ) as a broker for `EXECUTIONS_MODE=queue`
TLS & Auth	Protects credentials in transit and at rest	Optional, self‑signed	Ingress TLS + OAuth2 / API‑Key enforcement
High‑availability (HA)	No single point of failure	Single pod	Replicated DB, multiple workers, load‑balanced API
Observability	Early detection of bottlenecks & failures	Minimal logs	Prometheus metrics + Grafana dashboards + structured logging
Backup & DR	Prevent data loss	Manual file copy	Automated PGDump + WAL archiving, point‑in‑time restore

EEFA Note: Switching from SQLite to PostgreSQL requires a data migration (n8n export:db → import). Perform this in a maintenance window to avoid corrupting running workflows.

Most teams hit these gaps after a few weeks of traffic, not on day one.

2. Architecture Blueprint

The diagram shows a stateless API that enqueues work, worker pods that process jobs, and a resilient data layer. If you encounter any n8n control plane data plane resolve them before continuing with the setup.

+-------------------+       +-------------------+       +-------------------+
|   Ingress (TLS)   | <---> |  n8n API Service  | <---> |   Redis Queue     |
+-------------------+       +-------------------+       +-------------------+
                                 ^   |
                                 |   v
                        +-------------------+
                        |  n8n Worker Pods  |
                        +-------------------+
                                 |
                                 v
                        +-------------------+
                        | PostgreSQL (HA)   |
                        +-------------------+
                                 |
                                 v
                        +-------------------+
                        | Prometheus/Grafana|
                        +-------------------+

A quick walk‑through: the Ingress terminates TLS and hands traffic to the API service. The API writes execution requests to Redis. Worker Pods pull from the queue, run the workflow, and persist results to PostgreSQL. Prometheus scrapes metrics from both API and workers for Grafana to visualise.

3. Deployment Options

If you encounter any n8n multi tenant architecture resolve them before continuing with the setup.

Platform	Pros	Cons	When to Choose
Docker‑Compose (single‑node)	Quick start, easy local dev, low cost	No native HA, manual scaling, limited monitoring	PoC, < 10 concurrent executions
Docker Swarm	Built‑in service replication, simple networking	Declining community support, limited autoscaling	Small‑to‑mid teams already on Swarm
Kubernetes (Helm)	Declarative, auto‑scaling, native secrets, robust ecosystem	Higher operational overhead, learning curve	Production, ≥ 20 concurrent executions, need for HA
Managed n8n Cloud	Zero‑ops, automatic backups, SLA	Vendor lock‑in, less control over network topology	Teams without ops resources, compliance OK with provider

3.1 Helm Values – Core Settings

Below are the essential Helm values split into focused snippets.

Environment variables (DB, queue, auth)

n8n:
  env:
    - name: DB_TYPE
      value: postgres
    - name: DB_POSTGRESDB_HOST
      value: pg-n8n-primary
    - name: DB_POSTGRESDB_PORT
      value: "5432"
    - name: DB_POSTGRESDB_DATABASE
      value: n8n
    - name: DB_POSTGRESDB_USER
      valueFrom:
        secretKeyRef:
          name: n8n-pg-secret
          key: username
    - name: DB_POSTGRESDB_PASSWORD
      valueFrom:
        secretKeyRef:
          name: n8n-pg-secret
          key: password
    - name: EXECUTIONS_MODE
      value: queue
    - name: QUEUE_BULL_REDIS_HOST
      value: redis-n8n
    - name: QUEUE_BULL_REDIS_PORT
      value: "6379"
    - name: N8N_BASIC_AUTH_ACTIVE
      value: "true"
    - name: N8N_BASIC_AUTH_USER
      valueFrom:
        secretKeyRef:
          name: n8n-basic-auth
          key: user
    - name: N8N_BASIC_AUTH_PASSWORD
      valueFrom:
        secretKeyRef:
          name: n8n-basic-auth
          key: password

Resource limits and service definition

  resources:
    limits:
      cpu: "2"
      memory: "2Gi"
    requests:
      cpu: "500m"
      memory: "512Mi"
  service:
    type: ClusterIP
    port: 5678

Worker replica count (horizontal scaling)

worker:
  replicaCount: 3   # increase to meet concurrency needs
  resources:
    limits:
      cpu: "1"
      memory: "1Gi"

EEFA Note: Start with modest resources.limits. Over‑provisioning workers can starve the API pod and cause request timeouts. Bumping the worker replica count is usually faster than hunting a hidden bottleneck.

4. Security & Compliance Checklist

Item	Implementation Detail	Verification
TLS everywhere	Use cert‑manager to provision certs for Ingress and internal services (Redis, PostgreSQL)	`kubectl get secret <name>-tls`
Least‑privilege DB user	PostgreSQL role with `SELECT, INSERT, UPDATE, DELETE` on `n8n` schema only	`\du` in psql
Credential encryption	n8n encrypts stored credentials with `ENCRYPTION_KEY`; store key in K8s secret, rotate quarterly	`kubectl describe secret n8n-encryption`
Network policies	Deny all traffic, allow only API ↔ Worker, Worker ↔ Redis, API ↔ PostgreSQL	`kubectl get netpol`
Audit logging	Enable PostgreSQL `log_line_prefix` and forward to Loki/ELK	Check log entries for `INSERT INTO credential`
RBAC for API	Enable `N8N_BASIC_AUTH_ACTIVE=true` and configure user/pass	Test with `curl -u user:pass …` expecting 401 without header
Secret management	Use external secret store (AWS Secrets Manager, HashiCorp Vault) via CSI driver	Verify secret rotation works without pod restart

EEFA Note: If the API is public, add rate‑limiting annotations to the Ingress to mitigate credential‑brute‑force attacks. In practice a modest limit cut down noisy scans dramatically.

5. High‑Availability & Disaster Recovery

5.1 PostgreSQL HA Blueprint

Primary‑Replica (Patroni) – automatic failover within ~30 s.
WAL Archiving to an S3 bucket → enables point‑in‑time recovery.
Scheduled pg_dump (nightly) → stored in immutable object storage.

Trigger a manual failover for testing:

kubectl exec -n n8n -it patroni-0 -- patronictl -c /etc/patroni.yml failover

5.2 Worker Redundancy

Deploy 3+ replicas behind a ClusterIP service. With EXECUTIONS_MODE=queue, any worker can pick up pending jobs, ensuring no single point of failure.

5.3 Redis Persistence

Persistence Mode	Description	Recommended
AOF (Append‑Only File)	Logs every write operation; fast recovery	✔︎
RDB Snapshots	Periodic full dumps; lower I/O	✖︎ (use only as secondary)

Redis Helm values for AOF:

appendonly: "yes"
save: "900 1"   # snapshot every 15 min if at least 1 key changed

5.4 Disaster‑Recovery Runbook

Step	Action	Owner
1	Verify DB replica health (`patronictl list`)	DBA
2	Spin up a fresh PostgreSQL from latest WAL archive	Ops
3	Re‑point n8n `DB_POSTGRESDB_HOST` to new primary (ConfigMap rollout)	DevOps
4	Run health checks (`curl /healthz`) on API & workers	QA
5	Validate workflow history in UI	Product

6. Observability & Alerting

6.1 Prometheus Exporter Annotations

Add these annotations to the n8n deployment so Prometheus can scrape metrics:

metadata:
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/path: "/metrics"
    prometheus.io/port: "5678"

Metric	Meaning	Alert Threshold
n8n_executions_total	Total executions processed	–
n8n_executions_failed_total	Failed executions	> 5/min
n8n_queue_length	Jobs waiting in Redis	> 100
process_resident_memory_bytes	Memory per pod	> 80 % of limit
http_request_duration_seconds	API latency	p95 > 500 ms

6.2 Sample Grafana Dashboard (JSON snippet)

The following JSON defines two panels—queue depth and execution failures.

{
  "panels": [
    {
      "type": "graph",
      "title": "Queue Depth",
      "targets": [{ "expr": "n8n_queue_length" }]
    },
    {
      "type": "graph",
      "title": "Execution Failures",
      "targets": [{ "expr": "rate(n8n_executions_failed_total[5m])" }]
    }
  ]
}

Import this JSON into Grafana to get immediate visibility into bottlenecks.

7. Cost‑Optimization Checklist

✔️ Item	How to Optimize
Right‑sized workers	Start with `CPU 0.5` / `Mem 512Mi`; auto‑scale based on `n8n_queue_length`
Spot instances (k8s)	Use node pools with spot/preemptible VMs for workers (stateless)
Redis persistence tier	Enable AOF only; disable RDB snapshots if storage cost is a concern
PostgreSQL storage	Enable storage autoscaling; set `maxsize` to 2× expected data growth
Turn off dev‑mode logs	Set `N8N_LOG_LEVEL=error` in prod to reduce log volume

8. Step‑by‑Step Production Rollout (Docker‑Compose Example)

Below the compose file is broken into logical service blocks for readability.

8.1 Database Service

db:
  image: postgres:15-alpine
  environment:
    POSTGRES_DB: n8n
    POSTGRES_USER: n8n
    POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
  volumes:
    - pg-data:/var/lib/postgresql/data
  restart: unless-stopped

8.2 Redis Service

redis:
  image: redis:7-alpine
  command: ["redis-server", "--appendonly", "yes"]
  volumes:
    - redis-data:/data
  restart: unless-stopped

8.3 API Service

api:
  image: n8nio/n8n:1.30.0
  environment:
    - DB_TYPE=postgresdb
    - DB_POSTGRESDB_HOST=db
    - DB_POSTGRESDB_PORT=5432
    - DB_POSTGRESDB_DATABASE=n8n
    - DB_POSTGRESDB_USER=n8n
    - DB_POSTGRESDB_PASSWORD=${POSTGRES_PASSWORD}
    - EXECUTIONS_MODE=queue
    - QUEUE_BULL_REDIS_HOST=redis
    - QUEUE_BULL_REDIS_PORT=6379
    - N8N_BASIC_AUTH_ACTIVE=true
    - N8N_BASIC_AUTH_USER=${ADMIN_USER}
    - N8N_BASIC_AUTH_PASSWORD=${ADMIN_PASS}
  ports:
    - "5678:5678"
  depends_on:
    - db
    - redis
  restart: unless-stopped

8.4 Worker Service

worker:
  image: n8nio/n8n:1.30.0
  command: ["n8n", "worker"]
  environment: *api.environment   # reuse the same env vars
  depends_on:
    - api
    - redis
  restart: unless-stopped

8.5 Compose Wrapper

version: "3.8"
services:
  db:      # defined above
  redis:   # defined above
  api:     # defined above
  worker:  # defined above

volumes:
  pg-data:
  redis-data:

**Rollout steps**

Create the required secrets (POSTGRES_PASSWORD, ADMIN_USER, ADMIN_PASS).
Run docker compose up -d. The API becomes reachable at https://<host>:5678.

Verify the queue is working:

docker exec -it <redis_container> redis-cli LLEN n8n:queue

Scale workers as needed, e.g. docker compose up -d --scale worker=4.

EEFA Note: When stopping the API for upgrades, use docker compose stop api (graceful) to let in‑flight executions finish; a hard kill can lose partial runs.

9. Frequently Asked Production Questions

Question	Short Answer
Can I keep SQLite for prod?	No – it cannot survive pod restarts or scaling.
Do I need both Redis and a queue?	Yes. `EXECUTIONS_MODE=queue` requires a broker; otherwise only the API pod can run jobs.
How many workers do I need?	Start with `workers = ceil(concurrent_executions / avg_execution_time_seconds)`. Adjust via autoscaler.
Is n8n thread‑safe?	Each worker runs a single Node.js event loop; concurrency is achieved by adding more workers, not threads.
Can I use MySQL?	Supported, but PostgreSQL offers better JSONB performance for workflow payloads.

Conclusion

Deploying n8n in production demands stateless execution, a robust PostgreSQL backend, a persistent queue (Redis), and full observability. By separating the API from worker pods, enforcing TLS/RBAC, and automating backups, you eliminate single points of failure and gain the ability to scale horizontally. Follow the checklist, use the provided Helm/Docker‑Compose snippets, and monitor key metrics to keep the system healthy. This architecture has been battle‑tested in real‑world pipelines and delivers reliable, secure workflow automation at scale.

Production-Grade n8n Architecture

Quick Diagnosis

1. Core Production Requirements

2. Architecture Blueprint

3. Deployment Options

3.1 Helm Values – Core Settings

Environment variables (DB, queue, auth)

Resource limits and service definition

Worker replica count (horizontal scaling)

4. Security & Compliance Checklist

5. High‑Availability & Disaster Recovery

5.1 PostgreSQL HA Blueprint

5.2 Worker Redundancy

5.3 Redis Persistence

5.4 Disaster‑Recovery Runbook

6. Observability & Alerting

6.1 Prometheus Exporter Annotations

6.2 Sample Grafana Dashboard (JSON snippet)

7. Cost‑Optimization Checklist

8. Step‑by‑Step Production Rollout (Docker‑Compose Example)

8.1 Database Service

8.2 Redis Service

8.3 API Service

8.4 Worker Service

8.5 Compose Wrapper

9. Frequently Asked Production Questions

Conclusion

Leave a Comment Cancel Reply

Sign up for Newsletter

Quick Diagnosis

1. Core Production Requirements

2. Architecture Blueprint

3. Deployment Options

3.1 Helm Values – Core Settings

Environment variables (DB, queue, auth)

Resource limits and service definition

Worker replica count (horizontal scaling)

4. Security & Compliance Checklist

5. High‑Availability & Disaster Recovery

5.1 PostgreSQL HA Blueprint

5.2 Worker Redundancy

5.3 Redis Persistence

5.4 Disaster‑Recovery Runbook

6. Observability & Alerting

6.1 Prometheus Exporter Annotations

6.2 Sample Grafana Dashboard (JSON snippet)

7. Cost‑Optimization Checklist

8. Step‑by‑Step Production Rollout (Docker‑Compose Example)

8.1 Database Service

8.2 Redis Service

8.3 API Service

8.4 Worker Service

8.5 Compose Wrapper

9. Frequently Asked Production Questions

Conclusion

Must Read

Leave a Comment Cancel Reply