Production-Grade n8n Architecture

<figure class="wp-block-image aligncenter"><img src="https://flowgenius.in/wp-content/uploads/2026/01/production-grade-n8n-architecture.png" alt="Step by Step Guide to solve production grade n8n architecture" /> <figcaption style="text-align: center;">Step by Step Guide to solve production grade n8n architecture</p> <hr /> </figcaption></figure> <p style="margin-bottom: 2em; line-height: 1.9;"><strong>Who this is for:</strong> SREs, DevOps engineers, and backend developers who need a reliable, horizontally‑scalable n8n deployment in production. <strong>We cover this in detail in the </strong>Production‑Grade n8n Architecture.</p> <hr style="border: none; margin: 55px 0;" /> <h2 style="margin-bottom: 45px; line-height: 1.3;">Quick Diagnosis</h2> <p style="margin-bottom: 2em; line-height: 1.9;">Your n8n workflows are missing executions, exposing credentials, or failing under load. The root cause is typically a <strong>stateful, single‑node setup</strong>. Re‑architect with a stateless execution layer, external PostgreSQL, a durable queue, and proper monitoring.</p> <blockquote style="margin: 0 0 2em 0; padding-left: 1em; border-left: 4px solid #e0e0e0;"> <p style="margin: 0; line-height: 1.9;"><strong>TL;DR</strong> – Deploy n8n on Kubernetes (or Docker‑Compose for small scale) with PostgreSQL, Redis, Prometheus‑Grafana, TLS, RBAC, and daily backups.<br /> <em>In production you’ll notice it once you have more than a handful of concurrent runs.</em></p> </blockquote> <hr style="border: none; margin: 55px 0;" /> <h2 style="margin-bottom: 45px; line-height: 1.3;">1. Core Production Requirements</h2> <p>If you encounter any <a href="/n8n-architecture-anti-patterns">n8n architecture anti patterns </a>resolve them before continuing with the setup.</p> <table style="border-collapse: collapse; width: 100%; margin-bottom: 2em;"> <thead> <tr> <th style="border: 1px solid #e0e0e0; padding: 13px; text-align: left;">Requirement</th> <th style="border: 1px solid #e0e0e0; padding: 13px; text-align: left;">Why It Matters</th> <th style="border: 1px solid #e0e0e0; padding: 13px; text-align: left;">n8n Default</th> <th style="border: 1px solid #e0e0e0; padding: 13px; text-align: left;">Production‑Ready Alternative</th> </tr> </thead> <tbody> <tr> <td style="border: 1px solid #e0e0e0; padding: 13px;">Stateless execution</td> <td style="border: 1px solid #e0e0e0; padding: 13px;">Enables horizontal scaling and zero‑downtime deploys</td> <td style="border: 1px solid #e0e0e0; padding: 13px;">In‑process execution (single‑process)</td> <td style="border: 1px solid #e0e0e0; padding: 13px;">Separate <strong>worker</strong> pods/containers that pull jobs from a queue</td> </tr> <tr> <td style="border: 1px solid #e0e0e0; padding: 13px;">Durable data store</td> <td style="border: 1px solid #e0e0e0; padding: 13px;">Guarantees workflow state, credentials, logs</td> <td style="border: 1px solid #e0e0e0; padding: 13px;">SQLite (file‑based)</td> <td style="border: 1px solid #e0e0e0; padding: 13px;"><strong>PostgreSQL 13+</strong> (managed or self‑hosted)</td> </tr> <tr> <td style="border: 1px solid #e0e0e0; padding: 13px;">Reliable job queue</td> <td style="border: 1px solid #e0e0e0; padding: 13px;">Prevents lost executions when the API crashes</td> <td style="border: 1px solid #e0e0e0; padding: 13px;">In‑memory (no queue)</td> <td style="border: 1px solid #e0e0e0; padding: 13px;"><strong>Redis</strong> (or RabbitMQ) as a broker for <code>EXECUTIONS_MODE=queue</code></td> </tr> <tr> <td style="border: 1px solid #e0e0e0; padding: 13px;">TLS & Auth</td> <td style="border: 1px solid #e0e0e0; padding: 13px;">Protects credentials in transit and at rest</td> <td style="border: 1px solid #e0e0e0; padding: 13px;">Optional, self‑signed</td> <td style="border: 1px solid #e0e0e0; padding: 13px;">Ingress TLS + OAuth2 / API‑Key enforcement</td> </tr> <tr> <td style="border: 1px solid #e0e0e0; padding: 13px;">High‑availability (HA)</td> <td style="border: 1px solid #e0e0e0; padding: 13px;">No single point of failure</td> <td style="border: 1px solid #e0e0e0; padding: 13px;">Single pod</td> <td style="border: 1px solid #e0e0e0; padding: 13px;">Replicated DB, multiple workers, load‑balanced API</td> </tr> <tr> <td style="border: 1px solid #e0e0e0; padding: 13px;">Observability</td> <td style="border: 1px solid #e0e0e0; padding: 13px;">Early detection of bottlenecks & failures</td> <td style="border: 1px solid #e0e0e0; padding: 13px;">Minimal logs</td> <td style="border: 1px solid #e0e0e0; padding: 13px;"><strong>Prometheus</strong> metrics + <strong>Grafana</strong> dashboards + structured logging</td> </tr> <tr> <td style="border: 1px solid #e0e0e0; padding: 13px;">Backup & DR</td> <td style="border: 1px solid #e0e0e0; padding: 13px;">Prevent data loss</td> <td style="border: 1px solid #e0e0e0; padding: 13px;">Manual file copy</td> <td style="border: 1px solid #e0e0e0; padding: 13px;">Automated <strong>PGDump</strong> + WAL archiving, point‑in‑time restore</td> </tr> </tbody> </table> <blockquote style="margin: 0 0 2em 0; padding-left: 1em; border-left: 4px solid #e0e0e0;"> <p style="margin: 0; line-height: 1.9;"><strong>EEFA Note:</strong> Switching from SQLite to PostgreSQL requires a data migration (<code>n8n export:db</code> → import). Perform this in a maintenance window to avoid corrupting running workflows.</p> </blockquote> <p style="margin-bottom: 2em; line-height: 1.9;"><em>Most teams hit these gaps after a few weeks of traffic, not on day one.</em></p> <hr style="border: none; margin: 55px 0;" /> <h2 style="margin-bottom: 45px; line-height: 1.3;">2. Architecture Blueprint</h2> <p style="margin-bottom: 2em; line-height: 1.9;">The diagram shows a stateless API that enqueues work, worker pods that process jobs, and a resilient data layer. If you encounter any <a href="/n8n-control-plane-data-plane">n8n control plane data plane </a>resolve them before continuing with the setup.</p> <pre style="background: #fafafa; padding: 20px; border: 1px solid #e0e0e0; overflow: auto; line-height: 1.9; margin-bottom: 2em;">+-------------------+ +-------------------+ +-------------------+ | Ingress (TLS) | <---> | n8n API Service | <---> | Redis Queue | +-------------------+ +-------------------+ +-------------------+ ^ | | v +-------------------+ | n8n Worker Pods | +-------------------+ | v +-------------------+ | PostgreSQL (HA) | +-------------------+ | v +-------------------+ | Prometheus/Grafana| +-------------------+ </pre> <p style="margin-bottom: 2em; line-height: 1.9;">A quick walk‑through: the <strong>Ingress</strong> terminates TLS and hands traffic to the API service. The API writes execution requests to <strong>Redis</strong>. <strong>Worker Pods</strong> pull from the queue, run the workflow, and persist results to <strong>PostgreSQL</strong>. Prometheus scrapes metrics from both API and workers for Grafana to visualise.</p> <hr style="border: none; margin: 55px 0;" /> <h2 style="margin-bottom: 45px; line-height: 1.3;">3. Deployment Options</h2> <p>If you encounter any <a href="/n8n-multi-tenant-architecture">n8n multi tenant architecture </a>resolve them before continuing with the setup.</p> <table style="border-collapse: collapse; width: 100%; margin-bottom: 2em;"> <thead> <tr> <th style="border: 1px solid #e0e0e0; padding: 13px;">Platform</th> <th style="border: 1px solid #e0e0e0; padding: 13px;">Pros</th> <th style="border: 1px solid #e0e0e0; padding: 13px;">Cons</th> <th style="border: 1px solid #e0e0e0; padding: 13px;">When to Choose</th> </tr> </thead> <tbody> <tr> <td style="border: 1px solid #e0e0e0; padding: 13px;">Docker‑Compose (single‑node)</td> <td style="border: 1px solid #e0e0e0; padding: 13px;">Quick start, easy local dev, low cost</td> <td style="border: 1px solid #e0e0e0; padding: 13px;">No native HA, manual scaling, limited monitoring</td> <td style="border: 1px solid #e0e0e0; padding: 13px;">PoC, < 10 concurrent executions</td> </tr> <tr> <td style="border: 1px solid #e0e0e0; padding: 13px;">Docker Swarm</td> <td style="border: 1px solid #e0e0e0; padding: 13px;">Built‑in service replication, simple networking</td> <td style="border: 1px solid #e0e0e0; padding: 13px;">Declining community support, limited autoscaling</td> <td style="border: 1px solid #e0e0e0; padding: 13px;">Small‑to‑mid teams already on Swarm</td> </tr> <tr> <td style="border: 1px solid #e0e0e0; padding: 13px;">Kubernetes (Helm)</td> <td style="border: 1px solid #e0e0e0; padding: 13px;">Declarative, auto‑scaling, native secrets, robust ecosystem</td> <td style="border: 1px solid #e0e0e0; padding: 13px;">Higher operational overhead, learning curve</td> <td style="border: 1px solid #e0e0e0; padding: 13px;">Production, ≥ 20 concurrent executions, need for HA</td> </tr> <tr> <td style="border: 1px solid #e0e0e0; padding: 13px;">Managed n8n Cloud</td> <td style="border: 1px solid #e0e0e0; padding: 13px;">Zero‑ops, automatic backups, SLA</td> <td style="border: 1px solid #e0e0e0; padding: 13px;">Vendor lock‑in, less control over network topology</td> <td style="border: 1px solid #e0e0e0; padding: 13px;">Teams without ops resources, compliance OK with provider</td> </tr> </tbody> </table> <h3 style="margin-bottom: 45px; line-height: 1.3;">3.1 Helm Values – Core Settings</h3> <p style="margin-bottom: 2em; line-height: 1.9;">Below are the essential Helm values split into focused snippets.</p> <h4 style="margin-bottom: 45px; line-height: 1.3;">Environment variables (DB, queue, auth)</h4> <pre style="background: #fafafa; padding: 20px; border: 1px solid #e0e0e0; overflow: auto; line-height: 1.9; margin-bottom: 2em;">n8n: env: - name: DB_TYPE value: postgres - name: DB_POSTGRESDB_HOST value: pg-n8n-primary - name: DB_POSTGRESDB_PORT value: "5432" - name: DB_POSTGRESDB_DATABASE value: n8n - name: DB_POSTGRESDB_USER valueFrom: secretKeyRef: name: n8n-pg-secret key: username - name: DB_POSTGRESDB_PASSWORD valueFrom: secretKeyRef: name: n8n-pg-secret key: password - name: EXECUTIONS_MODE value: queue - name: QUEUE_BULL_REDIS_HOST value: redis-n8n - name: QUEUE_BULL_REDIS_PORT value: "6379" - name: N8N_BASIC_AUTH_ACTIVE value: "true" - name: N8N_BASIC_AUTH_USER valueFrom: secretKeyRef: name: n8n-basic-auth key: user - name: N8N_BASIC_AUTH_PASSWORD valueFrom: secretKeyRef: name: n8n-basic-auth key: password </pre> <h4 style="margin-bottom: 45px; line-height: 1.3;">Resource limits and service definition</h4> <pre style="background: #fafafa; padding: 20px; border: 1px solid #e0e0e0; overflow: auto; line-height: 1.9; margin-bottom: 2em;"> resources: limits: cpu: "2" memory: "2Gi" requests: cpu: "500m" memory: "512Mi" service: type: ClusterIP port: 5678 </pre> <h4 style="margin-bottom: 45px; line-height: 1.3;">Worker replica count (horizontal scaling)</h4> <pre style="background: #fafafa; padding: 20px; border: 1px solid #e0e0e0; overflow: auto; line-height: 1.9; margin-bottom: 2em;">worker: replicaCount: 3 # increase to meet concurrency needs resources: limits: cpu: "1" memory: "1Gi" </pre> <blockquote style="margin: 0 0 2em 0; padding-left: 1em; border-left: 4px solid #e0e0e0;"> <p style="margin: 0; line-height: 1.9;"><strong>EEFA Note:</strong> Start with modest <code>resources.limits</code>. Over‑provisioning workers can starve the API pod and cause request timeouts. Bumping the worker replica count is usually faster than hunting a hidden bottleneck.</p> </blockquote> <hr style="border: none; margin: 55px 0;" /> <h2 style="margin-bottom: 45px; line-height: 1.3;">4. Security & Compliance Checklist</h2> <table style="border-collapse: collapse; width: 100%; margin-bottom: 2em;"> <thead> <tr> <th style="border: 1px solid #e0e0e0; padding: 13px;">Item</th> <th style="border: 1px solid #e0e0e0; padding: 13px;">Implementation Detail</th> <th style="border: 1px solid #e0e0e0; padding: 13px;">Verification</th> </tr> </thead> <tbody> <tr> <td style="border: 1px solid #e0e0e0; padding: 13px;">TLS everywhere</td> <td style="border: 1px solid #e0e0e0; padding: 13px;">Use cert‑manager to provision certs for Ingress and internal services (Redis, PostgreSQL)</td> <td style="border: 1px solid #e0e0e0; padding: 13px;"><code>kubectl get secret <name>-tls</code></td> </tr> <tr> <td style="border: 1px solid #e0e0e0; padding: 13px;">Least‑privilege DB user</td> <td style="border: 1px solid #e0e0e0; padding: 13px;">PostgreSQL role with <code>SELECT, INSERT, UPDATE, DELETE</code> on <code>n8n</code> schema only</td> <td style="border: 1px solid #e0e0e0; padding: 13px;"><code>\du</code> in psql</td> </tr> <tr> <td style="border: 1px solid #e0e0e0; padding: 13px;">Credential encryption</td> <td style="border: 1px solid #e0e0e0; padding: 13px;">n8n encrypts stored credentials with <code>ENCRYPTION_KEY</code>; store key in K8s secret, rotate quarterly</td> <td style="border: 1px solid #e0e0e0; padding: 13px;"><code>kubectl describe secret n8n-encryption</code></td> </tr> <tr> <td style="border: 1px solid #e0e0e0; padding: 13px;">Network policies</td> <td style="border: 1px solid #e0e0e0; padding: 13px;">Deny all traffic, allow only API ↔ Worker, Worker ↔ Redis, API ↔ PostgreSQL</td> <td style="border: 1px solid #e0e0e0; padding: 13px;"><code>kubectl get netpol</code></td> </tr> <tr> <td style="border: 1px solid #e0e0e0; padding: 13px;">Audit logging</td> <td style="border: 1px solid #e0e0e0; padding: 13px;">Enable PostgreSQL <code>log_line_prefix</code> and forward to Loki/ELK</td> <td style="border: 1px solid #e0e0e0; padding: 13px;">Check log entries for <code>INSERT INTO credential</code></td> </tr> <tr> <td style="border: 1px solid #e0e0e0; padding: 13px;">RBAC for API</td> <td style="border: 1px solid #e0e0e0; padding: 13px;">Enable <code>N8N_BASIC_AUTH_ACTIVE=true</code> and configure user/pass</td> <td style="border: 1px solid #e0e0e0; padding: 13px;">Test with <code>curl -u user:pass …</code> expecting 401 without header</td> </tr> <tr> <td style="border: 1px solid #e0e0e0; padding: 13px;">Secret management</td> <td style="border: 1px solid #e0e0e0; padding: 13px;">Use external secret store (AWS Secrets Manager, HashiCorp Vault) via CSI driver</td> <td style="border: 1px solid #e0e0e0; padding: 13px;">Verify secret rotation works without pod restart</td> </tr> </tbody> </table> <blockquote style="margin: 0 0 2em 0; padding-left: 1em; border-left: 4px solid #e0e0e0;"> <p style="margin: 0; line-height: 1.9;"><strong>EEFA Note:</strong> If the API is public, add rate‑limiting annotations to the Ingress to mitigate credential‑brute‑force attacks. In practice a modest limit cut down noisy scans dramatically.</p> </blockquote> <hr style="border: none; margin: 55px 0;" /> <h2 style="margin-bottom: 45px; line-height: 1.3;">5. High‑Availability & Disaster Recovery</h2> <h3 style="margin-bottom: 45px; line-height: 1.3;">5.1 PostgreSQL HA Blueprint</h3> <ol style="margin-bottom: 1.8em; line-height: 1.9;"> <li>Primary‑Replica (Patroni) – automatic failover within ~30 s.</li> <li>WAL Archiving to an S3 bucket → enables point‑in‑time recovery.</li> <li>Scheduled <code>pg_dump</code> (nightly) → stored in immutable object storage.</li> </ol> <p style="margin-bottom: 2em; line-height: 1.9;">Trigger a manual failover for testing:</p> <pre style="background: #fafafa; padding: 20px; border: 1px solid #e0e0e0; overflow: auto; line-height: 1.9; margin-bottom: 2em;">kubectl exec -n n8n -it patroni-0 -- patronictl -c /etc/patroni.yml failover </pre> <h3 style="margin-bottom: 45px; line-height: 1.3;">5.2 Worker Redundancy</h3> <p style="margin-bottom: 2em; line-height: 1.9;">Deploy <strong>3+ replicas</strong> behind a ClusterIP service. With <code>EXECUTIONS_MODE=queue</code>, any worker can pick up pending jobs, ensuring no single point of failure.</p> <h3 style="margin-bottom: 45px; line-height: 1.3;">5.3 Redis Persistence</h3> <table style="border-collapse: collapse; width: 100%; margin-bottom: 2em;"> <thead> <tr> <th style="border: 1px solid #e0e0e0; padding: 13px;">Persistence Mode</th> <th style="border: 1px solid #e0e0e0; padding: 13px;">Description</th> <th style="border: 1px solid #e0e0e0; padding: 13px;">Recommended</th> </tr> </thead> <tbody> <tr> <td style="border: 1px solid #e0e0e0; padding: 13px;">AOF (Append‑Only File)</td> <td style="border: 1px solid #e0e0e0; padding: 13px;">Logs every write operation; fast recovery</td> <td style="border: 1px solid #e0e0e0; padding: 13px;">✔︎</td> </tr> <tr> <td style="border: 1px solid #e0e0e0; padding: 13px;">RDB Snapshots</td> <td style="border: 1px solid #e0e0e0; padding: 13px;">Periodic full dumps; lower I/O</td> <td style="border: 1px solid #e0e0e0; padding: 13px;">✖︎ (use only as secondary)</td> </tr> </tbody> </table> <p style="margin-bottom: 2em; line-height: 1.9;">Redis Helm values for AOF:</p> <pre style="background: #fafafa; padding: 20px; border: 1px solid #e0e0e0; overflow: auto; line-height: 1.9; margin-bottom: 2em;">appendonly: "yes" save: "900 1" # snapshot every 15 min if at least 1 key changed </pre> <h3 style="margin-bottom: 45px; line-height: 1.3;">5.4 Disaster‑Recovery Runbook</h3> <table style="border-collapse: collapse; width: 100%; margin-bottom: 2em;"> <thead> <tr> <th style="border: 1px solid #e0e0e0; padding: 13px;">Step</th> <th style="border: 1px solid #e0e0e0; padding: 13px;">Action</th> <th style="border: 1px solid #e0e0e0; padding: 13px;">Owner</th> </tr> </thead> <tbody> <tr> <td style="border: 1px solid #e0e0e0; padding: 13px;">1</td> <td style="border: 1px solid #e0e0e0; padding: 13px;">Verify DB replica health (<code>patronictl list</code>)</td> <td style="border: 1px solid #e0e0e0; padding: 13px;">DBA</td> </tr> <tr> <td style="border: 1px solid #e0e0e0; padding: 13px;">2</td> <td style="border: 1px solid #e0e0e0; padding: 13px;">Spin up a fresh PostgreSQL from latest WAL archive</td> <td style="border: 1px solid #e0e0e0; padding: 13px;">Ops</td> </tr> <tr> <td style="border: 1px solid #e0e0e0; padding: 13px;">3</td> <td style="border: 1px solid #e0e0e0; padding: 13px;">Re‑point n8n <code>DB_POSTGRESDB_HOST</code> to new primary (ConfigMap rollout)</td> <td style="border: 1px solid #e0e0e0; padding: 13px;">DevOps</td> </tr> <tr> <td style="border: 1px solid #e0e0e0; padding: 13px;">4</td> <td style="border: 1px solid #e0e0e0; padding: 13px;">Run health checks (<code>curl /healthz</code>) on API & workers</td> <td style="border: 1px solid #e0e0e0; padding: 13px;">QA</td> </tr> <tr> <td style="border: 1px solid #e0e0e0; padding: 13px;">5</td> <td style="border: 1px solid #e0e0e0; padding: 13px;">Validate workflow history in UI</td> <td style="border: 1px solid #e0e0e0; padding: 13px;">Product</td> </tr> </tbody> </table> <hr style="border: none; margin: 55px 0;" /> <h2 style="margin-bottom: 45px; line-height: 1.3;">6. Observability & Alerting</h2> <h3 style="margin-bottom: 45px; line-height: 1.3;">6.1 Prometheus Exporter Annotations</h3> <p style="margin-bottom: 2em; line-height: 1.9;">Add these annotations to the n8n deployment so Prometheus can scrape metrics:</p> <pre style="background: #fafafa; padding: 20px; border: 1px solid #e0e0e0; overflow: auto; line-height: 1.9; margin-bottom: 2em;">metadata: annotations: prometheus.io/scrape: "true" prometheus.io/path: "/metrics" prometheus.io/port: "5678" </pre> <table style="border-collapse: collapse; width: 100%; margin-bottom: 2em;"> <thead> <tr> <th style="border: 1px solid #e0e0e0; padding: 13px;">Metric</th> <th style="border: 1px solid #e0e0e0; padding: 13px;">Meaning</th> <th style="border: 1px solid #e0e0e0; padding: 13px;">Alert Threshold</th> </tr> </thead> <tbody> <tr> <td style="border: 1px solid #e0e0e0; padding: 13px;">n8n_executions_total</td> <td style="border: 1px solid #e0e0e0; padding: 13px;">Total executions processed</td> <td style="border: 1px solid #e0e0e0; padding: 13px;">–</td> </tr> <tr> <td style="border: 1px solid #e0e0e0; padding: 13px;">n8n_executions_failed_total</td> <td style="border: 1px solid #e0e0e0; padding: 13px;">Failed executions</td> <td style="border: 1px solid #e0e0e0; padding: 13px;">> 5/min</td> </tr> <tr> <td style="border: 1px solid #e0e0e0; padding: 13px;">n8n_queue_length</td> <td style="border: 1px solid #e0e0e0; padding: 13px;">Jobs waiting in Redis</td> <td style="border: 1px solid #e0e0e0; padding: 13px;">> 100</td> </tr> <tr> <td style="border: 1px solid #e0e0e0; padding: 13px;">process_resident_memory_bytes</td> <td style="border: 1px solid #e0e0e0; padding: 13px;">Memory per pod</td> <td style="border: 1px solid #e0e0e0; padding: 13px;">> 80 % of limit</td> </tr> <tr> <td style="border: 1px solid #e0e0e0; padding: 13px;">http_request_duration_seconds</td> <td style="border: 1px solid #e0e0e0; padding: 13px;">API latency</td> <td style="border: 1px solid #e0e0e0; padding: 13px;">p95 > 500 ms</td> </tr> </tbody> </table> <h3 style="margin-bottom: 45px; line-height: 1.3;">6.2 Sample Grafana Dashboard (JSON snippet)</h3> <p style="margin-bottom: 2em; line-height: 1.9;">The following JSON defines two panels—queue depth and execution failures.</p> <pre style="background: #fafafa; padding: 20px; border: 1px solid #e0e0e0; overflow: auto; line-height: 1.9; margin-bottom: 2em;">{ "panels": [ { "type": "graph", "title": "Queue Depth", "targets": [{ "expr": "n8n_queue_length" }] }, { "type": "graph", "title": "Execution Failures", "targets": [{ "expr": "rate(n8n_executions_failed_total[5m])" }] } ] } </pre> <p style="margin-bottom: 2em; line-height: 1.9;">Import this JSON into Grafana to get immediate visibility into bottlenecks.</p> <hr style="border: none; margin: 55px 0;" /> <h2 style="margin-bottom: 45px; line-height: 1.3;">7. Cost‑Optimization Checklist</h2> <table style="border-collapse: collapse; width: 100%; margin-bottom: 2em;"> <thead> <tr> <th style="border: 1px solid #e0e0e0; padding: 13px;">✔️ Item</th> <th style="border: 1px solid #e0e0e0; padding: 13px;">How to Optimize</th> </tr> </thead> <tbody> <tr> <td style="border: 1px solid #e0e0e0; padding: 13px;">Right‑sized workers</td> <td style="border: 1px solid #e0e0e0; padding: 13px;">Start with <code>CPU 0.5</code> / <code>Mem 512Mi</code>; auto‑scale based on <code>n8n_queue_length</code></td> </tr> <tr> <td style="border: 1px solid #e0e0e0; padding: 13px;">Spot instances (k8s)</td> <td style="border: 1px solid #e0e0e0; padding: 13px;">Use node pools with spot/preemptible VMs for workers (stateless)</td> </tr> <tr> <td style="border: 1px solid #e0e0e0; padding: 13px;">Redis persistence tier</td> <td style="border: 1px solid #e0e0e0; padding: 13px;">Enable **AOF** only; disable RDB snapshots if storage cost is a concern</td> </tr> <tr> <td style="border: 1px solid #e0e0e0; padding: 13px;">PostgreSQL storage</td> <td style="border: 1px solid #e0e0e0; padding: 13px;">Enable **storage autoscaling**; set <code>maxsize</code> to 2× expected data growth</td> </tr> <tr> <td style="border: 1px solid #e0e0e0; padding: 13px;">Turn off dev‑mode logs</td> <td style="border: 1px solid #e0e0e0; padding: 13px;">Set <code>N8N_LOG_LEVEL=error</code> in prod to reduce log volume</td> </tr> </tbody> </table> <hr style="border: none; margin: 55px 0;" /> <h2 style="margin-bottom: 45px; line-height: 1.3;">8. Step‑by‑Step Production Rollout (Docker‑Compose Example)</h2> <p style="margin-bottom: 2em; line-height: 1.9;">Below the compose file is broken into logical service blocks for readability.</p> <h3 style="margin-bottom: 45px; line-height: 1.3;">8.1 Database Service</h3> <pre style="background: #fafafa; padding: 20px; border: 1px solid #e0e0e0; overflow: auto; line-height: 1.9; margin-bottom: 2em;">db: image: postgres:15-alpine environment: POSTGRES_DB: n8n POSTGRES_USER: n8n POSTGRES_PASSWORD: ${POSTGRES_PASSWORD} volumes: - pg-data:/var/lib/postgresql/data restart: unless-stopped </pre> <h3 style="margin-bottom: 45px; line-height: 1.3;">8.2 Redis Service</h3> <pre style="background: #fafafa; padding: 20px; border: 1px solid #e0e0e0; overflow: auto; line-height: 1.9; margin-bottom: 2em;">redis: image: redis:7-alpine command: ["redis-server", "--appendonly", "yes"] volumes: - redis-data:/data restart: unless-stopped </pre> <h3 style="margin-bottom: 45px; line-height: 1.3;">8.3 API Service</h3> <pre style="background: #fafafa; padding: 20px; border: 1px solid #e0e0e0; overflow: auto; line-height: 1.9; margin-bottom: 2em;">api: image: n8nio/n8n:1.30.0 environment: - DB_TYPE=postgresdb - DB_POSTGRESDB_HOST=db - DB_POSTGRESDB_PORT=5432 - DB_POSTGRESDB_DATABASE=n8n - DB_POSTGRESDB_USER=n8n - DB_POSTGRESDB_PASSWORD=${POSTGRES_PASSWORD} - EXECUTIONS_MODE=queue - QUEUE_BULL_REDIS_HOST=redis - QUEUE_BULL_REDIS_PORT=6379 - N8N_BASIC_AUTH_ACTIVE=true - N8N_BASIC_AUTH_USER=${ADMIN_USER} - N8N_BASIC_AUTH_PASSWORD=${ADMIN_PASS} ports: - "5678:5678" depends_on: - db - redis restart: unless-stopped </pre> <h3 style="margin-bottom: 45px; line-height: 1.3;">8.4 Worker Service</h3> <pre style="background: #fafafa; padding: 20px; border: 1px solid #e0e0e0; overflow: auto; line-height: 1.9; margin-bottom: 2em;">worker: image: n8nio/n8n:1.30.0 command: ["n8n", "worker"] environment: *api.environment # reuse the same env vars depends_on: - api - redis restart: unless-stopped </pre> <h3 style="margin-bottom: 45px; line-height: 1.3;">8.5 Compose Wrapper</h3> <pre style="background: #fafafa; padding: 20px; border: 1px solid #e0e0e0; overflow: auto; line-height: 1.9; margin-bottom: 2em;">version: "3.8" services: db: # defined above redis: # defined above api: # defined above worker: # defined above volumes: pg-data: redis-data: </pre> <p style="margin-bottom: 2em; line-height: 1.9;">**Rollout steps**</p> <ol style="margin-bottom: 1.8em; line-height: 1.9;"> <li>Create the required secrets (<code>POSTGRES_PASSWORD</code>, <code>ADMIN_USER</code>, <code>ADMIN_PASS</code>).</li> <li>Run <code>docker compose up -d</code>. The API becomes reachable at <code>https://<host>:5678</code>.</li> <li>Verify the queue is working: <pre style="background: #fafafa; padding: 20px; border: 1px solid #e0e0e0; overflow: auto; line-height: 1.9; margin: 0;">docker exec -it <redis_container> redis-cli LLEN n8n:queue </pre> </li> <li>Scale workers as needed, e.g. <code>docker compose up -d --scale worker=4</code>.</li> </ol> <blockquote style="margin: 0 0 2em 0; padding-left: 1em; border-left: 4px solid #e0e0e0;"> <p style="margin: 0; line-height: 1.9;"><strong>EEFA Note:</strong> When stopping the API for upgrades, use <code>docker compose stop api</code> (graceful) to let in‑flight executions finish; a hard kill can lose partial runs.</p> </blockquote> <hr style="border: none; margin: 55px 0;" /> <h2 style="margin-bottom: 45px; line-height: 1.3;">9. Frequently Asked Production Questions</h2> <table style="border-collapse: collapse; width: 100%; margin-bottom: 2em;"> <thead> <tr> <th style="border: 1px solid #e0e0e0; padding: 13px;">Question</th> <th style="border: 1px solid #e0e0e0; padding: 13px;">Short Answer</th> </tr> </thead> <tbody> <tr> <td style="border: 1px solid #e0e0e0; padding: 13px;">Can I keep SQLite for prod?</td> <td style="border: 1px solid #e0e0e0; padding: 13px;"><strong>No</strong> – it cannot survive pod restarts or scaling.</td> </tr> <tr> <td style="border: 1px solid #e0e0e0; padding: 13px;">Do I need both Redis and a queue?</td> <td style="border: 1px solid #e0e0e0; padding: 13px;">Yes. <code>EXECUTIONS_MODE=queue</code> requires a broker; otherwise only the API pod can run jobs.</td> </tr> <tr> <td style="border: 1px solid #e0e0e0; padding: 13px;">How many workers do I need?</td> <td style="border: 1px solid #e0e0e0; padding: 13px;">Start with <code>workers = ceil(concurrent_executions / avg_execution_time_seconds)</code>. Adjust via autoscaler.</td> </tr> <tr> <td style="border: 1px solid #e0e0e0; padding: 13px;">Is n8n thread‑safe?</td> <td style="border: 1px solid #e0e0e0; padding: 13px;">Each worker runs a single Node.js event loop; concurrency is achieved by adding more workers, not threads.</td> </tr> <tr> <td style="border: 1px solid #e0e0e0; padding: 13px;">Can I use MySQL?</td> <td style="border: 1px solid #e0e0e0; padding: 13px;">Supported, but PostgreSQL offers better JSONB performance for workflow payloads.</td> </tr> </tbody> </table> <hr style="border: none; margin: 55px 0;" /> <h2 style="margin-bottom: 45px; line-height: 1.3;">Conclusion</h2> <p style="margin-bottom: 2em; line-height: 1.9;">Deploying n8n in production demands <strong>stateless execution</strong>, a <strong>robust PostgreSQL backend</strong>, a <strong>persistent queue</strong> (Redis), and <strong>full observability</strong>. By separating the API from worker pods, enforcing TLS/RBAC, and automating backups, you eliminate single points of failure and gain the ability to scale horizontally. Follow the checklist, use the provided Helm/Docker‑Compose snippets, and monitor key metrics to keep the system healthy. This architecture has been battle‑tested in real‑world pipelines and delivers reliable, secure workflow automation at scale.</p>

Who this is for: SREs, DevOps engineers, and backend developers who need a reliable, horizontally‑scalable n8n deployment in production. We cover this in detail in the Production‑Grade n8n Architecture.

Quick Diagnosis

Your n8n workflows are missing executions, exposing credentials, or failing under load. The root cause is typically a stateful, single‑node setup. Re‑architect with a stateless execution layer, external PostgreSQL, a durable queue, and proper monitoring.

TL;DR – Deploy n8n on Kubernetes (or Docker‑Compose for small scale) with PostgreSQL, Redis, Prometheus‑Grafana, TLS, RBAC, and daily backups.
In production you’ll notice it once you have more than a handful of concurrent runs.

1. Core Production Requirements

If you encounter any n8n architecture anti patterns resolve them before continuing with the setup.

Requirement	Why It Matters	n8n Default	Production‑Ready Alternative
Stateless execution	Enables horizontal scaling and zero‑downtime deploys	In‑process execution (single‑process)	Separate worker pods/containers that pull jobs from a queue
Durable data store	Guarantees workflow state, credentials, logs	SQLite (file‑based)	PostgreSQL 13+ (managed or self‑hosted)
Reliable job queue	Prevents lost executions when the API crashes	In‑memory (no queue)	Redis (or RabbitMQ) as a broker for `EXECUTIONS_MODE=queue`
TLS & Auth	Protects credentials in transit and at rest	Optional, self‑signed	Ingress TLS + OAuth2 / API‑Key enforcement
High‑availability (HA)	No single point of failure	Single pod	Replicated DB, multiple workers, load‑balanced API
Observability	Early detection of bottlenecks & failures	Minimal logs	Prometheus metrics + Grafana dashboards + structured logging
Backup & DR	Prevent data loss	Manual file copy	Automated PGDump + WAL archiving, point‑in‑time restore

EEFA Note: Switching from SQLite to PostgreSQL requires a data migration (n8n export:db → import). Perform this in a maintenance window to avoid corrupting running workflows.

Most teams hit these gaps after a few weeks of traffic, not on day one.

2. Architecture Blueprint

The diagram shows a stateless API that enqueues work, worker pods that process jobs, and a resilient data layer. If you encounter any n8n control plane data plane resolve them before continuing with the setup.

+-------------------+       +-------------------+       +-------------------+
|   Ingress (TLS)   | <---> |  n8n API Service  | <---> |   Redis Queue     |
+-------------------+       +-------------------+       +-------------------+
                                 ^   |
                                 |   v
                        +-------------------+
                        |  n8n Worker Pods  |
                        +-------------------+
                                 |
                                 v
                        +-------------------+
                        | PostgreSQL (HA)   |
                        +-------------------+
                                 |
                                 v
                        +-------------------+
                        | Prometheus/Grafana|
                        +-------------------+

A quick walk‑through: the Ingress terminates TLS and hands traffic to the API service. The API writes execution requests to Redis. Worker Pods pull from the queue, run the workflow, and persist results to PostgreSQL. Prometheus scrapes metrics from both API and workers for Grafana to visualise.

3. Deployment Options

If you encounter any n8n multi tenant architecture resolve them before continuing with the setup.

Platform	Pros	Cons	When to Choose
Docker‑Compose (single‑node)	Quick start, easy local dev, low cost	No native HA, manual scaling, limited monitoring	PoC, < 10 concurrent executions
Docker Swarm	Built‑in service replication, simple networking	Declining community support, limited autoscaling	Small‑to‑mid teams already on Swarm
Kubernetes (Helm)	Declarative, auto‑scaling, native secrets, robust ecosystem	Higher operational overhead, learning curve	Production, ≥ 20 concurrent executions, need for HA
Managed n8n Cloud	Zero‑ops, automatic backups, SLA	Vendor lock‑in, less control over network topology	Teams without ops resources, compliance OK with provider

3.1 Helm Values – Core Settings

Below are the essential Helm values split into focused snippets.

Environment variables (DB, queue, auth)

n8n:
  env:
    - name: DB_TYPE
      value: postgres
    - name: DB_POSTGRESDB_HOST
      value: pg-n8n-primary
    - name: DB_POSTGRESDB_PORT
      value: "5432"
    - name: DB_POSTGRESDB_DATABASE
      value: n8n
    - name: DB_POSTGRESDB_USER
      valueFrom:
        secretKeyRef:
          name: n8n-pg-secret
          key: username
    - name: DB_POSTGRESDB_PASSWORD
      valueFrom:
        secretKeyRef:
          name: n8n-pg-secret
          key: password
    - name: EXECUTIONS_MODE
      value: queue
    - name: QUEUE_BULL_REDIS_HOST
      value: redis-n8n
    - name: QUEUE_BULL_REDIS_PORT
      value: "6379"
    - name: N8N_BASIC_AUTH_ACTIVE
      value: "true"
    - name: N8N_BASIC_AUTH_USER
      valueFrom:
        secretKeyRef:
          name: n8n-basic-auth
          key: user
    - name: N8N_BASIC_AUTH_PASSWORD
      valueFrom:
        secretKeyRef:
          name: n8n-basic-auth
          key: password

Resource limits and service definition

  resources:
    limits:
      cpu: "2"
      memory: "2Gi"
    requests:
      cpu: "500m"
      memory: "512Mi"
  service:
    type: ClusterIP
    port: 5678

Worker replica count (horizontal scaling)

worker:
  replicaCount: 3   # increase to meet concurrency needs
  resources:
    limits:
      cpu: "1"
      memory: "1Gi"

EEFA Note: Start with modest resources.limits. Over‑provisioning workers can starve the API pod and cause request timeouts. Bumping the worker replica count is usually faster than hunting a hidden bottleneck.

4. Security & Compliance Checklist

Item	Implementation Detail	Verification
TLS everywhere	Use cert‑manager to provision certs for Ingress and internal services (Redis, PostgreSQL)	`kubectl get secret <name>-tls`
Least‑privilege DB user	PostgreSQL role with `SELECT, INSERT, UPDATE, DELETE` on `n8n` schema only	`\du` in psql
Credential encryption	n8n encrypts stored credentials with `ENCRYPTION_KEY`; store key in K8s secret, rotate quarterly	`kubectl describe secret n8n-encryption`
Network policies	Deny all traffic, allow only API ↔ Worker, Worker ↔ Redis, API ↔ PostgreSQL	`kubectl get netpol`
Audit logging	Enable PostgreSQL `log_line_prefix` and forward to Loki/ELK	Check log entries for `INSERT INTO credential`
RBAC for API	Enable `N8N_BASIC_AUTH_ACTIVE=true` and configure user/pass	Test with `curl -u user:pass …` expecting 401 without header
Secret management	Use external secret store (AWS Secrets Manager, HashiCorp Vault) via CSI driver	Verify secret rotation works without pod restart

EEFA Note: If the API is public, add rate‑limiting annotations to the Ingress to mitigate credential‑brute‑force attacks. In practice a modest limit cut down noisy scans dramatically.

5. High‑Availability & Disaster Recovery

5.1 PostgreSQL HA Blueprint

Primary‑Replica (Patroni) – automatic failover within ~30 s.
WAL Archiving to an S3 bucket → enables point‑in‑time recovery.
Scheduled pg_dump (nightly) → stored in immutable object storage.

Trigger a manual failover for testing:

kubectl exec -n n8n -it patroni-0 -- patronictl -c /etc/patroni.yml failover

5.2 Worker Redundancy

Deploy 3+ replicas behind a ClusterIP service. With EXECUTIONS_MODE=queue, any worker can pick up pending jobs, ensuring no single point of failure.

5.3 Redis Persistence

Persistence Mode	Description	Recommended
AOF (Append‑Only File)	Logs every write operation; fast recovery	✔︎
RDB Snapshots	Periodic full dumps; lower I/O	✖︎ (use only as secondary)

Redis Helm values for AOF:

appendonly: "yes"
save: "900 1"   # snapshot every 15 min if at least 1 key changed

5.4 Disaster‑Recovery Runbook

Step	Action	Owner
1	Verify DB replica health (`patronictl list`)	DBA
2	Spin up a fresh PostgreSQL from latest WAL archive	Ops
3	Re‑point n8n `DB_POSTGRESDB_HOST` to new primary (ConfigMap rollout)	DevOps
4	Run health checks (`curl /healthz`) on API & workers	QA
5	Validate workflow history in UI	Product

6. Observability & Alerting

6.1 Prometheus Exporter Annotations

Add these annotations to the n8n deployment so Prometheus can scrape metrics:

metadata:
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/path: "/metrics"
    prometheus.io/port: "5678"

Metric	Meaning	Alert Threshold
n8n_executions_total	Total executions processed	–
n8n_executions_failed_total	Failed executions	> 5/min
n8n_queue_length	Jobs waiting in Redis	> 100
process_resident_memory_bytes	Memory per pod	> 80 % of limit
http_request_duration_seconds	API latency	p95 > 500 ms

6.2 Sample Grafana Dashboard (JSON snippet)

The following JSON defines two panels—queue depth and execution failures.

{
  "panels": [
    {
      "type": "graph",
      "title": "Queue Depth",
      "targets": [{ "expr": "n8n_queue_length" }]
    },
    {
      "type": "graph",
      "title": "Execution Failures",
      "targets": [{ "expr": "rate(n8n_executions_failed_total[5m])" }]
    }
  ]
}

Import this JSON into Grafana to get immediate visibility into bottlenecks.

7. Cost‑Optimization Checklist

✔️ Item	How to Optimize
Right‑sized workers	Start with `CPU 0.5` / `Mem 512Mi`; auto‑scale based on `n8n_queue_length`
Spot instances (k8s)	Use node pools with spot/preemptible VMs for workers (stateless)
Redis persistence tier	Enable AOF only; disable RDB snapshots if storage cost is a concern
PostgreSQL storage	Enable storage autoscaling; set `maxsize` to 2× expected data growth
Turn off dev‑mode logs	Set `N8N_LOG_LEVEL=error` in prod to reduce log volume

8. Step‑by‑Step Production Rollout (Docker‑Compose Example)

Below the compose file is broken into logical service blocks for readability.

8.1 Database Service

db:
  image: postgres:15-alpine
  environment:
    POSTGRES_DB: n8n
    POSTGRES_USER: n8n
    POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
  volumes:
    - pg-data:/var/lib/postgresql/data
  restart: unless-stopped

8.2 Redis Service

redis:
  image: redis:7-alpine
  command: ["redis-server", "--appendonly", "yes"]
  volumes:
    - redis-data:/data
  restart: unless-stopped

8.3 API Service

api:
  image: n8nio/n8n:1.30.0
  environment:
    - DB_TYPE=postgresdb
    - DB_POSTGRESDB_HOST=db
    - DB_POSTGRESDB_PORT=5432
    - DB_POSTGRESDB_DATABASE=n8n
    - DB_POSTGRESDB_USER=n8n
    - DB_POSTGRESDB_PASSWORD=${POSTGRES_PASSWORD}
    - EXECUTIONS_MODE=queue
    - QUEUE_BULL_REDIS_HOST=redis
    - QUEUE_BULL_REDIS_PORT=6379
    - N8N_BASIC_AUTH_ACTIVE=true
    - N8N_BASIC_AUTH_USER=${ADMIN_USER}
    - N8N_BASIC_AUTH_PASSWORD=${ADMIN_PASS}
  ports:
    - "5678:5678"
  depends_on:
    - db
    - redis
  restart: unless-stopped

8.4 Worker Service

worker:
  image: n8nio/n8n:1.30.0
  command: ["n8n", "worker"]
  environment: *api.environment   # reuse the same env vars
  depends_on:
    - api
    - redis
  restart: unless-stopped

8.5 Compose Wrapper

version: "3.8"
services:
  db:      # defined above
  redis:   # defined above
  api:     # defined above
  worker:  # defined above

volumes:
  pg-data:
  redis-data:

**Rollout steps**

Create the required secrets (POSTGRES_PASSWORD, ADMIN_USER, ADMIN_PASS).
Run docker compose up -d. The API becomes reachable at https://<host>:5678.

Verify the queue is working:

docker exec -it <redis_container> redis-cli LLEN n8n:queue

Scale workers as needed, e.g. docker compose up -d --scale worker=4.

EEFA Note: When stopping the API for upgrades, use docker compose stop api (graceful) to let in‑flight executions finish; a hard kill can lose partial runs.

9. Frequently Asked Production Questions

Question	Short Answer
Can I keep SQLite for prod?	No – it cannot survive pod restarts or scaling.
Do I need both Redis and a queue?	Yes. `EXECUTIONS_MODE=queue` requires a broker; otherwise only the API pod can run jobs.
How many workers do I need?	Start with `workers = ceil(concurrent_executions / avg_execution_time_seconds)`. Adjust via autoscaler.
Is n8n thread‑safe?	Each worker runs a single Node.js event loop; concurrency is achieved by adding more workers, not threads.
Can I use MySQL?	Supported, but PostgreSQL offers better JSONB performance for workflow payloads.

Conclusion

Deploying n8n in production demands stateless execution, a robust PostgreSQL backend, a persistent queue (Redis), and full observability. By separating the API from worker pods, enforcing TLS/RBAC, and automating backups, you eliminate single points of failure and gain the ability to scale horizontally. Follow the checklist, use the provided Helm/Docker‑Compose snippets, and monitor key metrics to keep the system healthy. This architecture has been battle‑tested in real‑world pipelines and delivers reliable, secure workflow automation at scale.

Production-Grade n8n Architecture

Quick Diagnosis

1. Core Production Requirements

2. Architecture Blueprint

3. Deployment Options

3.1 Helm Values – Core Settings

Environment variables (DB, queue, auth)

Resource limits and service definition

Worker replica count (horizontal scaling)

4. Security & Compliance Checklist

5. High‑Availability & Disaster Recovery

5.1 PostgreSQL HA Blueprint

5.2 Worker Redundancy

5.3 Redis Persistence

5.4 Disaster‑Recovery Runbook

6. Observability & Alerting

6.1 Prometheus Exporter Annotations

6.2 Sample Grafana Dashboard (JSON snippet)

7. Cost‑Optimization Checklist

8. Step‑by‑Step Production Rollout (Docker‑Compose Example)

8.1 Database Service

8.2 Redis Service

8.3 API Service

8.4 Worker Service

8.5 Compose Wrapper

9. Frequently Asked Production Questions

Conclusion

Leave a Comment Cancel Reply

Sign up for Newsletter

Quick Diagnosis

1. Core Production Requirements

2. Architecture Blueprint

3. Deployment Options

3.1 Helm Values – Core Settings

Environment variables (DB, queue, auth)

Resource limits and service definition

Worker replica count (horizontal scaling)

4. Security & Compliance Checklist

5. High‑Availability & Disaster Recovery

5.1 PostgreSQL HA Blueprint

5.2 Worker Redundancy

5.3 Redis Persistence

5.4 Disaster‑Recovery Runbook

6. Observability & Alerting

6.1 Prometheus Exporter Annotations

6.2 Sample Grafana Dashboard (JSON snippet)

7. Cost‑Optimization Checklist

8. Step‑by‑Step Production Rollout (Docker‑Compose Example)

8.1 Database Service

8.2 Redis Service

8.3 API Service

8.4 Worker Service

8.5 Compose Wrapper

9. Frequently Asked Production Questions

Conclusion

Must Read

Leave a Comment Cancel Reply