Who this is for: SREs, DevOps engineers, and backend developers who need a reliable, horizontally‑scalable n8n deployment in production. We cover this in detail in the Production‑Grade n8n Architecture.
Quick Diagnosis
Your n8n workflows are missing executions, exposing credentials, or failing under load. The root cause is typically a stateful, single‑node setup. Re‑architect with a stateless execution layer, external PostgreSQL, a durable queue, and proper monitoring.
TL;DR – Deploy n8n on Kubernetes (or Docker‑Compose for small scale) with PostgreSQL, Redis, Prometheus‑Grafana, TLS, RBAC, and daily backups.
In production you’ll notice it once you have more than a handful of concurrent runs.
1. Core Production Requirements
If you encounter any n8n architecture anti patterns resolve them before continuing with the setup.
| Requirement | Why It Matters | n8n Default | Production‑Ready Alternative |
|---|---|---|---|
| Stateless execution | Enables horizontal scaling and zero‑downtime deploys | In‑process execution (single‑process) | Separate worker pods/containers that pull jobs from a queue |
| Durable data store | Guarantees workflow state, credentials, logs | SQLite (file‑based) | PostgreSQL 13+ (managed or self‑hosted) |
| Reliable job queue | Prevents lost executions when the API crashes | In‑memory (no queue) | Redis (or RabbitMQ) as a broker for EXECUTIONS_MODE=queue |
| TLS & Auth | Protects credentials in transit and at rest | Optional, self‑signed | Ingress TLS + OAuth2 / API‑Key enforcement |
| High‑availability (HA) | No single point of failure | Single pod | Replicated DB, multiple workers, load‑balanced API |
| Observability | Early detection of bottlenecks & failures | Minimal logs | Prometheus metrics + Grafana dashboards + structured logging |
| Backup & DR | Prevent data loss | Manual file copy | Automated PGDump + WAL archiving, point‑in‑time restore |
EEFA Note: Switching from SQLite to PostgreSQL requires a data migration (
n8n export:db→ import). Perform this in a maintenance window to avoid corrupting running workflows.
Most teams hit these gaps after a few weeks of traffic, not on day one.
2. Architecture Blueprint
The diagram shows a stateless API that enqueues work, worker pods that process jobs, and a resilient data layer. If you encounter any n8n control plane data plane resolve them before continuing with the setup.
+-------------------+ +-------------------+ +-------------------+
| Ingress (TLS) | <---> | n8n API Service | <---> | Redis Queue |
+-------------------+ +-------------------+ +-------------------+
^ |
| v
+-------------------+
| n8n Worker Pods |
+-------------------+
|
v
+-------------------+
| PostgreSQL (HA) |
+-------------------+
|
v
+-------------------+
| Prometheus/Grafana|
+-------------------+
A quick walk‑through: the Ingress terminates TLS and hands traffic to the API service. The API writes execution requests to Redis. Worker Pods pull from the queue, run the workflow, and persist results to PostgreSQL. Prometheus scrapes metrics from both API and workers for Grafana to visualise.
3. Deployment Options
If you encounter any n8n multi tenant architecture resolve them before continuing with the setup.
| Platform | Pros | Cons | When to Choose |
|---|---|---|---|
| Docker‑Compose (single‑node) | Quick start, easy local dev, low cost | No native HA, manual scaling, limited monitoring | PoC, < 10 concurrent executions |
| Docker Swarm | Built‑in service replication, simple networking | Declining community support, limited autoscaling | Small‑to‑mid teams already on Swarm |
| Kubernetes (Helm) | Declarative, auto‑scaling, native secrets, robust ecosystem | Higher operational overhead, learning curve | Production, ≥ 20 concurrent executions, need for HA |
| Managed n8n Cloud | Zero‑ops, automatic backups, SLA | Vendor lock‑in, less control over network topology | Teams without ops resources, compliance OK with provider |
3.1 Helm Values – Core Settings
Below are the essential Helm values split into focused snippets.
Environment variables (DB, queue, auth)
n8n:
env:
- name: DB_TYPE
value: postgres
- name: DB_POSTGRESDB_HOST
value: pg-n8n-primary
- name: DB_POSTGRESDB_PORT
value: "5432"
- name: DB_POSTGRESDB_DATABASE
value: n8n
- name: DB_POSTGRESDB_USER
valueFrom:
secretKeyRef:
name: n8n-pg-secret
key: username
- name: DB_POSTGRESDB_PASSWORD
valueFrom:
secretKeyRef:
name: n8n-pg-secret
key: password
- name: EXECUTIONS_MODE
value: queue
- name: QUEUE_BULL_REDIS_HOST
value: redis-n8n
- name: QUEUE_BULL_REDIS_PORT
value: "6379"
- name: N8N_BASIC_AUTH_ACTIVE
value: "true"
- name: N8N_BASIC_AUTH_USER
valueFrom:
secretKeyRef:
name: n8n-basic-auth
key: user
- name: N8N_BASIC_AUTH_PASSWORD
valueFrom:
secretKeyRef:
name: n8n-basic-auth
key: password
Resource limits and service definition
resources:
limits:
cpu: "2"
memory: "2Gi"
requests:
cpu: "500m"
memory: "512Mi"
service:
type: ClusterIP
port: 5678
Worker replica count (horizontal scaling)
worker:
replicaCount: 3 # increase to meet concurrency needs
resources:
limits:
cpu: "1"
memory: "1Gi"
EEFA Note: Start with modest
resources.limits. Over‑provisioning workers can starve the API pod and cause request timeouts. Bumping the worker replica count is usually faster than hunting a hidden bottleneck.
4. Security & Compliance Checklist
| Item | Implementation Detail | Verification |
|---|---|---|
| TLS everywhere | Use cert‑manager to provision certs for Ingress and internal services (Redis, PostgreSQL) | kubectl get secret <name>-tls |
| Least‑privilege DB user | PostgreSQL role with SELECT, INSERT, UPDATE, DELETE on n8n schema only |
\du in psql |
| Credential encryption | n8n encrypts stored credentials with ENCRYPTION_KEY; store key in K8s secret, rotate quarterly |
kubectl describe secret n8n-encryption |
| Network policies | Deny all traffic, allow only API ↔ Worker, Worker ↔ Redis, API ↔ PostgreSQL | kubectl get netpol |
| Audit logging | Enable PostgreSQL log_line_prefix and forward to Loki/ELK |
Check log entries for INSERT INTO credential |
| RBAC for API | Enable N8N_BASIC_AUTH_ACTIVE=true and configure user/pass |
Test with curl -u user:pass … expecting 401 without header |
| Secret management | Use external secret store (AWS Secrets Manager, HashiCorp Vault) via CSI driver | Verify secret rotation works without pod restart |
EEFA Note: If the API is public, add rate‑limiting annotations to the Ingress to mitigate credential‑brute‑force attacks. In practice a modest limit cut down noisy scans dramatically.
5. High‑Availability & Disaster Recovery
5.1 PostgreSQL HA Blueprint
- Primary‑Replica (Patroni) – automatic failover within ~30 s.
- WAL Archiving to an S3 bucket → enables point‑in‑time recovery.
- Scheduled
pg_dump(nightly) → stored in immutable object storage.
Trigger a manual failover for testing:
kubectl exec -n n8n -it patroni-0 -- patronictl -c /etc/patroni.yml failover
5.2 Worker Redundancy
Deploy 3+ replicas behind a ClusterIP service. With EXECUTIONS_MODE=queue, any worker can pick up pending jobs, ensuring no single point of failure.
5.3 Redis Persistence
| Persistence Mode | Description | Recommended |
|---|---|---|
| AOF (Append‑Only File) | Logs every write operation; fast recovery | ✔︎ |
| RDB Snapshots | Periodic full dumps; lower I/O | ✖︎ (use only as secondary) |
Redis Helm values for AOF:
appendonly: "yes" save: "900 1" # snapshot every 15 min if at least 1 key changed
5.4 Disaster‑Recovery Runbook
| Step | Action | Owner |
|---|---|---|
| 1 | Verify DB replica health (patronictl list) |
DBA |
| 2 | Spin up a fresh PostgreSQL from latest WAL archive | Ops |
| 3 | Re‑point n8n DB_POSTGRESDB_HOST to new primary (ConfigMap rollout) |
DevOps |
| 4 | Run health checks (curl /healthz) on API & workers |
QA |
| 5 | Validate workflow history in UI | Product |
6. Observability & Alerting
6.1 Prometheus Exporter Annotations
Add these annotations to the n8n deployment so Prometheus can scrape metrics:
metadata:
annotations:
prometheus.io/scrape: "true"
prometheus.io/path: "/metrics"
prometheus.io/port: "5678"
| Metric | Meaning | Alert Threshold |
|---|---|---|
| n8n_executions_total | Total executions processed | – |
| n8n_executions_failed_total | Failed executions | > 5/min |
| n8n_queue_length | Jobs waiting in Redis | > 100 |
| process_resident_memory_bytes | Memory per pod | > 80 % of limit |
| http_request_duration_seconds | API latency | p95 > 500 ms |
6.2 Sample Grafana Dashboard (JSON snippet)
The following JSON defines two panels—queue depth and execution failures.
{
"panels": [
{
"type": "graph",
"title": "Queue Depth",
"targets": [{ "expr": "n8n_queue_length" }]
},
{
"type": "graph",
"title": "Execution Failures",
"targets": [{ "expr": "rate(n8n_executions_failed_total[5m])" }]
}
]
}
Import this JSON into Grafana to get immediate visibility into bottlenecks.
7. Cost‑Optimization Checklist
| ✔️ Item | How to Optimize |
|---|---|
| Right‑sized workers | Start with CPU 0.5 / Mem 512Mi; auto‑scale based on n8n_queue_length |
| Spot instances (k8s) | Use node pools with spot/preemptible VMs for workers (stateless) |
| Redis persistence tier | Enable **AOF** only; disable RDB snapshots if storage cost is a concern |
| PostgreSQL storage | Enable **storage autoscaling**; set maxsize to 2× expected data growth |
| Turn off dev‑mode logs | Set N8N_LOG_LEVEL=error in prod to reduce log volume |
8. Step‑by‑Step Production Rollout (Docker‑Compose Example)
Below the compose file is broken into logical service blocks for readability.
8.1 Database Service
db:
image: postgres:15-alpine
environment:
POSTGRES_DB: n8n
POSTGRES_USER: n8n
POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
volumes:
- pg-data:/var/lib/postgresql/data
restart: unless-stopped
8.2 Redis Service
redis:
image: redis:7-alpine
command: ["redis-server", "--appendonly", "yes"]
volumes:
- redis-data:/data
restart: unless-stopped
8.3 API Service
api:
image: n8nio/n8n:1.30.0
environment:
- DB_TYPE=postgresdb
- DB_POSTGRESDB_HOST=db
- DB_POSTGRESDB_PORT=5432
- DB_POSTGRESDB_DATABASE=n8n
- DB_POSTGRESDB_USER=n8n
- DB_POSTGRESDB_PASSWORD=${POSTGRES_PASSWORD}
- EXECUTIONS_MODE=queue
- QUEUE_BULL_REDIS_HOST=redis
- QUEUE_BULL_REDIS_PORT=6379
- N8N_BASIC_AUTH_ACTIVE=true
- N8N_BASIC_AUTH_USER=${ADMIN_USER}
- N8N_BASIC_AUTH_PASSWORD=${ADMIN_PASS}
ports:
- "5678:5678"
depends_on:
- db
- redis
restart: unless-stopped
8.4 Worker Service
worker:
image: n8nio/n8n:1.30.0
command: ["n8n", "worker"]
environment: *api.environment # reuse the same env vars
depends_on:
- api
- redis
restart: unless-stopped
8.5 Compose Wrapper
version: "3.8" services: db: # defined above redis: # defined above api: # defined above worker: # defined above volumes: pg-data: redis-data:
**Rollout steps**
- Create the required secrets (
POSTGRES_PASSWORD,ADMIN_USER,ADMIN_PASS). - Run
docker compose up -d. The API becomes reachable athttps://<host>:5678. - Verify the queue is working:
docker exec -it <redis_container> redis-cli LLEN n8n:queue
- Scale workers as needed, e.g.
docker compose up -d --scale worker=4.
EEFA Note: When stopping the API for upgrades, use
docker compose stop api(graceful) to let in‑flight executions finish; a hard kill can lose partial runs.
9. Frequently Asked Production Questions
| Question | Short Answer |
|---|---|
| Can I keep SQLite for prod? | No – it cannot survive pod restarts or scaling. |
| Do I need both Redis and a queue? | Yes. EXECUTIONS_MODE=queue requires a broker; otherwise only the API pod can run jobs. |
| How many workers do I need? | Start with workers = ceil(concurrent_executions / avg_execution_time_seconds). Adjust via autoscaler. |
| Is n8n thread‑safe? | Each worker runs a single Node.js event loop; concurrency is achieved by adding more workers, not threads. |
| Can I use MySQL? | Supported, but PostgreSQL offers better JSONB performance for workflow payloads. |
Conclusion
Deploying n8n in production demands stateless execution, a robust PostgreSQL backend, a persistent queue (Redis), and full observability. By separating the API from worker pods, enforcing TLS/RBAC, and automating backups, you eliminate single points of failure and gain the ability to scale horizontally. Follow the checklist, use the provided Helm/Docker‑Compose snippets, and monitor key metrics to keep the system healthy. This architecture has been battle‑tested in real‑world pipelines and delivers reliable, secure workflow automation at scale.



