
Who this is for: Ops engineers, SREs, and platform developers who run n8n in a clustered, production‑grade environment. We cover this in detail in the n8n Architectural Failure Modes Guide.
Quick Diagnosis
When some nodes in an n8n cluster lose connectivity, workflows can stall, duplicate, or lose data. To confirm a partition‑induced failure quickly, call the health‑check endpoint on every node and compare the clusterStatus fields.
One‑line remedy: Re‑establish inter‑node connectivity (or force a leader re‑election) and replay any
execution_queueentries stuck in the “waiting” state.
In production this usually shows up as a sudden spike in “stuck” executions after a network glitch.
1. What Is a Partial Network Partition in an n8n Cluster?
If you encounter any n8n clock sync time drift issues resolve them before continuing with the setup.
A partial partition means only some services lose connectivity while the rest keep working. The table below shows each component, its typical deployment, its role, and what breaks when it’s isolated.
| Component | Role in the Cluster | What a Partition Breaks |
|---|---|---|
| API Server(s) | Receives webhooks, validates triggers | Isolated API cannot forward jobs to workers |
| Execution Workers | Runs workflow steps | Workers cannot fetch jobs, causing “stuck” executions |
| Message Queue (Redis / RabbitMQ) | Stores execution_queue items |
Heartbeats stop; duplicate pushes appear after healing |
| Database (PostgreSQL) | Persists definitions & execution data | Writes may land on a replica that can’t replicate to primary |
| Load Balancer | Routes HTTP traffic | Continues sending traffic to a partitioned node, amplifying the issue |
2. Symptom Matrix – How Failures Manifest
If you encounter any n8n behavior during cloud outages resolve them before continuing with the setup.
| Symptom | Observable Effect | Likely Partition‑Induced Root Cause |
|---|---|---|
| Workflow never starts | HTTP 202 returned, but no execution record | API node cannot push to the queue |
| Duplicate executions | Same webhook triggers multiple runs | Two API nodes think they are the leader |
| Stuck executions | status: "running" > 30 min, no logs |
Worker cannot read from the queue |
| Missing data in DB | Execution details absent, webhook logs present | Write succeeded on a replica isolated from primary |
| Health endpoint shows “partitioned” | /health JSON includes "partitioned": true |
Direct detection of network split |
Use this matrix to narrow the failure to a component before digging into logs. Most teams see it after a few weeks, not on day one.
3. Step‑by‑Step Troubleshooting Guide
3.1 Verify Cluster Health
Run the health endpoint on every node—API, worker, queue, DB.
If you encounter any n8n retry logic financial workflows resolve them before continuing with the setup.
curl -s http://localhost:5678/health | jq .
Key fields to inspect
| Field | Expected value | Meaning of deviation |
|---|---|---|
| clusterStatus.leaderId | Same on all API nodes | Leadership split → possible duplicate enqueues |
| clusterStatus.partitioned | false | true indicates a network split |
| queueHealth.connected | true | false means the node cannot talk to Redis/RabbitMQ |
Any mismatch means a partition.
3.2 Isolate the Faulty Segment
- Ping test—check basic reachability.
nc -zv api-node-1 5678 # API port nc -zv worker-node-2 5679 # Worker port nc -zv redis-prod 6379 # Redis port
- Traceroute—verify routing paths between nodes.
traceroute api-node-1 traceroute worker-node-2
- Firewall / security‑group audit—look for rules that may have been auto‑scaled (common in cloud VPCs).
Document the results in a small table for the post‑mortem.
3.3 Force a Leader Re‑Election (Redis‑backed clustering)
Run this only after confirming all nodes can see each other.
curl -X POST http://localhost:5678/api/v1/cluster/leadership/force
EEFA Note: Forcing leadership while a partition persists can cause a split‑brain with two leaders enqueueing duplicate jobs.
At this point, regenerating the key is usually faster than chasing edge cases.
3.4 Replay Stuck Queue Items
3.4.1 List waiting jobs in Redis
redis-cli -h <redis-host> -p 6379 ZRANGE n8n:executionQueue:waiting 0 -1 WITHSCORES
3.4.2 Remove them from the waiting set
ZREMRANGEBYRANK n8n:executionQueue:waiting 0 -1
3.4.3 Push each payload back to the ready queue
LPUSH n8n:executionQueue:ready <job‑payload>
EEFA Warning: Re‑injecting jobs without deduplication can cause double‑processing. Verify that the
executionIddoes not already exist in theexecutionstable.
3.5 Validate Database Consistency
3.5.1 Query recent executions on the primary
SELECT execution_id, status, updated_at FROM executions WHERE updated_at > now() - interval '1 hour' ORDER BY updated_at DESC;
3.5.2 If rows are missing on the primary, trigger a re‑sync
# PostgreSQL streaming replication SELECT pg_reload_conf(); -- reload any changed parameters SELECT pg_promote(); -- promote replica if primary is unreachable
EEFA Tip: Keep logical replication slots for n8n so queued events aren’t lost during a fail‑over.
4. Preventive Configuration: Make n8n Partition‑Resilient
4.1 Core n8n Settings
| Setting | Recommended Value | Why It Helps |
|---|---|---|
| EXECUTIONS_PROCESS_TIMEOUT | 300000 (5 min) | Workers abort hung jobs, freeing the queue |
| QUEUE_RECONNECT_ATTEMPTS | 10 | Aggressive retries reduce transient split impact |
| QUEUE_RECONNECT_INTERVAL_MS | 2000 | Short interval keeps the queue alive during brief glitches |
| N8N_DISABLE_PRODUCTION_WEBHOOKS | false | Allows any API node to retry once connectivity restores |
| N8N_WORKER_CONCURRENCY | 2‑4 per CPU core | Prevents overload on a single worker that could mask a partition |
4.2 Sample .env (split for readability)
# Core n8n EXECUTIONS_PROCESS_TIMEOUT=300000 EXECUTIONS_TIMEOUT=600000 N8N_WORKER_CONCURRENCY=8
# Queue (Redis) resilience QUEUE_RECONNECT_ATTEMPTS=10 QUEUE_RECONNECT_INTERVAL_MS=2000 REDIS_TLS_ENABLED=true REDIS_HOST=redis-prod.mycompany.internal REDIS_PORT=6380
EEFA Advisory: When TLS is enabled on Redis, ensure the certificate chain is trusted by all container images; otherwise each node will report “partitioned” due to TLS handshake failures.
5. One‑Paragraph Featured Snippet
n8n fails under a partial network partition when any node (API, worker, queue, or database) loses connectivity to the rest of the cluster, causing webhooks to be accepted but not queued, duplicate job enqueues, or stuck executions. Detect it instantly by calling each node’s /health endpoint and looking for mismatched leaderId or "partitioned": true. Re‑establish network links, force a leader re‑election, and replay any jobs left in the executionQueue:waiting Redis set.
All recommendations assume you are running n8n ≥ 1.0 with Redis or RabbitMQ as the execution queue and PostgreSQL as the primary datastore.



