<figure class="wp-block-image aligncenter"><img src="https://flowgenius.in/wp-content/uploads/2026/01/n8n-behavior-during-cloud-outages.png" alt="Step by Step Guide to solve n8n behavior during cloud outages" /> <figcaption style="text-align: center;">Step by Step Guide to solve n8n behavior during cloud outages</p>
<hr />
</figcaption></figure>
<p style="margin-bottom: 2em; line-height: 1.9;"><strong>Who this is for:</strong> DevOps engineers and workflow architects responsible for keeping n8n production‑grade pipelines running when AWS, GCP, Azure, or any cloud provider experiences an outage. <strong>We cover this in detail in the </strong><a href="https://flowgenius.in/n8n-architectural-failure-modes/">n8n Architectural Failure Modes Guide</a></p>
<p style="margin-bottom: 2em; line-height: 1.9;"><em>In practice you’ll see these symptoms show up after a few minutes of a regional outage.</em></p>
<hr style="margin: 55px 0; border: none;" />
<h2 style="margin-bottom: 45px; line-height: 1.3;">Quick Diagnosis</h2>
<p style="margin-bottom: 2em; line-height: 1.9;">When a cloud‑provider outage cuts off the services n8n depends on (database, Redis, webhook load balancer), the platform:</p>
<ul style="margin-bottom: 2em; line-height: 1.9;">
<li><strong>Pauses active workflow executions</strong></li>
<li><strong>Queues incoming webhook triggers</strong></li>
<li><strong>Honors each node’s retry policy</strong></li>
</ul>
<p style="margin-bottom: 2em; line-height: 1.9;">You’ll notice the pause almost immediately after the provider stops responding. Once the services return, queued events are processed in order and workflows pick up from the last successful node.</p>
<blockquote style="margin-bottom: 2em; line-height: 1.9; border-left: 4px solid #e0e0e0; padding-left: 1em;">
<p style="margin: 0;"><strong>Featured‑snippet answer:</strong><br />
<em>During a cloud‑provider outage, n8n pauses running workflows, queues new webhook events, and retries failed nodes according to the workflow’s retry settings. Once the provider’s services are back, queued events are processed in order and workflows resume from the last successful node.</em></p>
</blockquote>
<hr style="margin: 55px 0; border: none;" />
<h2 style="margin-bottom: 45px; line-height: 1.3;">1. What Parts of n8n Are Affected by a Cloud Outage?</h2>
<p>If you encounter any <a href="/n8n-failures-under-network-partitions">n8n failures under network partitions </a>resolve them before continuing with the setup.</p>
<table style="border-collapse: collapse; width: 100%; margin-bottom: 2em;">
<thead>
<tr>
<th style="border: 1px solid #ddd; padding: 13px; text-align: left;">n8n Component</th>
<th style="border: 1px solid #ddd; padding: 13px; text-align: left;">Dependency on Cloud Service</th>
<th style="border: 1px solid #ddd; padding: 13px; text-align: left;">Failure Mode During Outage</th>
<th style="border: 1px solid #ddd; padding: 13px; text-align: left;">Default Recovery Behavior</th>
</tr>
</thead>
<tbody>
<tr>
<td style="border: 1px solid #ddd; padding: 13px;"><strong>PostgreSQL DB</strong></td>
<td style="border: 1px solid #ddd; padding: 13px;">RDS (AWS), Cloud SQL (GCP), Azure Database</td>
<td style="border: 1px solid #ddd; padding: 13px;">Connection timeout / loss of read/write</td>
<td style="border: 1px solid #ddd; padding: 13px;">Workflow executions pause; new triggers are rejected with <em>“Database unavailable”</em></td>
</tr>
<tr>
<td style="border: 1px solid #ddd; padding: 13px;"><strong>Redis (Cache & Queue)</strong></td>
<td style="border: 1px solid #ddd; padding: 13px;">Elasticache, Memorystore, Azure Cache</td>
<td style="border: 1px solid #ddd; padding: 13px;">Queue becomes unreachable</td>
<td style="border: 1px solid #ddd; padding: 13px;">In‑flight jobs are lost → workflow restarts from the first node on reconnection</td>
</tr>
<tr>
<td style="border: 1px solid #ddd; padding: 13px;"><strong>Webhook Server</strong></td>
<td style="border: 1px solid #ddd; padding: 13px;">Load balancer (ALB, Cloud Load Balancing)</td>
<td style="border: 1px solid #ddd; padding: 13px;">No inbound traffic → 502/504 errors</td>
<td style="border: 1px solid #ddd; padding: 13px;">Incoming HTTP requests are dropped; if a <strong>Webhook URL</strong> is configured with <strong>retry</strong>, n8n will retry after the service is back</td>
</tr>
<tr>
<td style="border: 1px solid #ddd; padding: 13px;"><strong>Execution Workers (Docker/K8s pods)</strong></td>
<td style="border: 1px solid #ddd; padding: 13px;">ECS, GKE, AKS</td>
<td style="border: 1px solid #ddd; padding: 13px;">Pods are terminated or cannot pull images</td>
<td style="border: 1px solid #ddd; padding: 13px;">New executions are not scheduled; pending jobs remain in the <em>“waiting”</em> state</td>
</tr>
<tr>
<td style="border: 1px solid #ddd; padding: 13px;"><strong>External API Nodes</strong> (e.g., Google Sheets, AWS S3)</td>
<td style="border: 1px solid #ddd; padding: 13px;">Third‑party APIs hosted on same provider</td>
<td style="border: 1px solid #ddd; padding: 13px;">API endpoint unreachable</td>
<td style="border: 1px solid #ddd; padding: 13px;">Node fails, triggers retry logic (if configured) or marks workflow as <em>failed</em></td>
</tr>
</tbody>
</table>
<p style="margin-bottom: 2em; line-height: 1.9;">If any of those services disappear, the corresponding n8n component will start misbehaving.</p>
<p style="margin-bottom: 2em; line-height: 1.9;"><strong>EEFA note:</strong> In production, always run PostgreSQL and Redis in a <strong>multi‑AZ</strong> configuration. This mitigates single‑AZ outages but does not protect against full‑region failures.</p>
<hr style="margin: 55px 0; border: none;" />
<h2 style="margin-bottom: 45px; line-height: 1.3;">2. How n8n Handles Ongoing Executions?</h2>
<p style="margin-bottom: 2em; line-height: 1.9;"><em>This section explains the internal mechanisms that keep your workflows safe when connectivity is lost.<br />
If you encounter any <a href="/n8n-clock-sync-time-drift-issues">n8n clock sync time drift issues </a>resolve them before continuing with the setup.<br />
</em></p>
<h3 style="margin-bottom: 45px; line-height: 1.3;">2.1 Execution State Persistence</h3>
<ul style="margin-bottom: 2em; line-height: 1.9;">
<li>Each step writes its state to the PostgreSQL <code>execution_entity</code> table.</li>
<li>If the DB disappears mid‑step, the write fails and the execution remains <strong>locked</strong> (<code>status = "running"</code>). No further progress is made until the DB reconnects.</li>
</ul>
<h3 style="margin-bottom: 45px; line-height: 1.3;">2.2 Automatic Pause & Resume</h3>
<ul style="margin-bottom: 2em; line-height: 1.9;">
<li>A lost DB connection throws a <code>ConnectionError</code>. n8n’s built‑in error handler marks the execution as <strong>paused</strong> and logs the incident.</li>
<li>A background watcher monitors the DB; once it’s reachable again, paused executions are resumed from the <strong>last successfully persisted node</strong>.</li>
</ul>
<p style="margin-bottom: 2em; line-height: 1.9;">That’s why you’ll see a ‘paused’ status in the UI rather than a silent failure.</p>
<h3 style="margin-bottom: 45px; line-height: 1.3;">2.3 Webhook Queueing</h3>
<ul style="margin-bottom: 2em; line-height: 1.9;">
<li>Incoming webhook payloads are stored in Redis.</li>
<li>If Redis is down, the webhook endpoint returns <strong>503 Service Unavailable</strong>. Clients that honor <code>Retry-After</code> will resend after a back‑off.</li>
<li>When Redis recovers, payloads are processed FIFO.</li>
</ul>
<h3 style="margin-bottom: 45px; line-height: 1.3;">2.4 Retry Policies</h3>
<ul style="margin-bottom: 2em; line-height: 1.9;">
<li>Nodes can define <strong>Retry Count</strong> and <strong>Retry Interval</strong> (e.g., 3 retries, 30 s interval).</li>
<li>Retry attempts are stored in the execution record, so they survive temporary outages.</li>
</ul>
<p style="margin-bottom: 2em; line-height: 1.9;"><strong>EEFA note:</strong> Avoid infinite retries. A ceiling of 5 retries prevents runaway loops during prolonged outages. In our experience, a hard limit of five retries saves you from cascading failures.</p>
<hr style="margin: 55px 0; border: none;" />
<h2 style="margin-bottom: 45px; line-height: 1.3;">3. Configuring Resilience for Outages</h2>
<p>If you encounter any <a href="/n8n-retry-logic-financial-workflows">n8n retry logic financial workflows </a>resolve them before continuing with the setup.</p>
<h3 style="margin-bottom: 45px; line-height: 1.3;">3.1 Enable Multi‑Region Failover for Core Services</h3>
<table style="border-collapse: collapse; width: 100%; margin-bottom: 2em;">
<thead>
<tr>
<th style="border: 1px solid #ddd; padding: 13px; text-align: left;">Service</th>
<th style="border: 1px solid #ddd; padding: 13px; text-align: left;">Recommended Setup</th>
<th style="border: 1px solid #ddd; padding: 13px; text-align: left;">Failover Mechanism</th>
</tr>
</thead>
<tbody>
<tr>
<td style="border: 1px solid #ddd; padding: 13px;">PostgreSQL</td>
<td style="border: 1px solid #ddd; padding: 13px;"><strong>Aurora Global Database</strong> (AWS) or <strong>Cloud SQL cross‑region replica</strong></td>
<td style="border: 1px solid #ddd; padding: 13px;">Automatic read‑only failover; manual promotion for writes</td>
</tr>
<tr>
<td style="border: 1px solid #ddd; padding: 13px;">Redis</td>
<td style="border: 1px solid #ddd; padding: 13px;"><strong>Elasticache Replication Group</strong> with <strong>Multi‑AZ</strong> + <strong>Automatic Failover</strong></td>
<td style="border: 1px solid #ddd; padding: 13px;">Primary‑Replica promotion within seconds</td>
</tr>
<tr>
<td style="border: 1px solid #ddd; padding: 13px;">n8n Workers</td>
<td style="border: 1px solid #ddd; padding: 13px;">Deploy to <strong>Kubernetes</strong> with <strong>PodDisruptionBudget</strong> across ≥ 2 zones</td>
<td style="border: 1px solid #ddd; padding: 13px;">Scheduler reschedules pods to healthy nodes</td>
</tr>
<tr>
<td style="border: 1px solid #ddd; padding: 13px;">Webhook Load Balancer</td>
<td style="border: 1px solid #ddd; padding: 13px;"><strong>Global HTTP(S) Load Balancer</strong> (Google Cloud) or <strong>AWS Global Accelerator</strong></td>
<td style="border: 1px solid #ddd; padding: 13px;">DNS‑based routing to the nearest healthy region</td>
</tr>
</tbody>
</table>
<p style="margin-bottom: 2em; line-height: 1.9;">Most teams find that setting up cross‑region replicas pays off when a whole region goes dark.</p>
<h3 style="margin-bottom: 45px; line-height: 1.3;">3.2 Add a “Circuit‑Breaker” Node (Custom JavaScript)</h3>
<p style="margin-bottom: 2em; line-height: 1.9;">The following snippets show a compact, production‑ready circuit‑breaker you can drop into an n8n **Function** node. The code is broken into 4‑line pieces for readability; each piece is introduced with a short explanation.</p>
<p style="margin-bottom: 2em; line-height: 1.9;"><strong>Define thresholds and Redis keys</strong></p>
<pre style="background: #fafafa; padding: 20px; border: 1px solid #e0e0e0; overflow: auto; margin-bottom: 2em;">// Thresholds
const MAX_FAILURES = 5;
const WINDOW_MS = 5 * 60 * 1000; // 5 min
// Redis keys for this API
const keyFailCount = `circuit:${$node["API"].name}:failCount`;
const keyResetAt = `circuit:${$node["API"].name}:resetAt`;
</pre>
<p style="margin-bottom: 2em; line-height: 1.9;"><strong>Helper functions for Redis access</strong></p>
<pre style="background: #fafafa; padding: 20px; border: 1px solid #e0e0e0; overflow: auto; margin-bottom: 2em;">async function get(key) { return await $redis.get(key); }
async function set(key, val, ttl = 0) {
ttl ? await $redis.setex(key, ttl, val) : await $redis.set(key, val);
}
</pre>
<p style="margin-bottom: 2em; line-height: 1.9;"><strong>Check whether the circuit is currently open</strong></p>
<pre style="background: #fafafa; padding: 20px; border: 1px solid #e0e0e0; overflow: auto; margin-bottom: 2em;">const resetAt = await get(keyResetAt);
if (resetAt && Date.now() < Number(resetAt)) {
throw new Error("Circuit open – external API temporarily disabled");
}
</pre>
<p style="margin-bottom: 2em; line-height: 1.9;"><strong>Attempt the external API call</strong></p>
<pre style="background: #fafafa; padding: 20px; border: 1px solid #e0e0e0; overflow: auto; margin-bottom: 2em;">try {
const resp = await $http.request({ method: "GET", url: "https://api.example.com/data" });
// Success → clear failure counters
await $redis.del(keyFailCount);
return resp.body;
} catch (err) {
// Failure handling continues below
throw err;
}
</pre>
<p style="margin-bottom: 2em; line-height: 1.9;"><strong>Handle repeated failures and open the circuit</strong></p>
<pre style="background: #fafafa; padding: 20px; border: 1px solid #e0e0e0; overflow: auto; margin-bottom: 2em;">let failures = Number(await get(keyFailCount) || 0) + 1;
await set(keyFailCount, failures, WINDOW_MS / 1000);
if (failures >= MAX_FAILURES) {
// Open circuit for 15 min
await set(keyResetAt, Date.now() + 15 * 60 * 1000, 15 * 60);
throw new Error("Circuit opened after repeated failures");
}
throw err; // Let n8n retry according to node settings
</pre>
<p style="margin-bottom: 2em; line-height: 1.9;"><em>Why it works:</em> Failure counts are stored in Redis, surviving pod restarts. Once the threshold is hit, the node throws a deterministic error that triggers n8n’s retry logic <strong>without hammering the external service</strong>.</p>
<p style="margin-bottom: 2em; line-height: 1.9;"><strong>EEFA note:</strong> Use this pattern only for high‑traffic external APIs; for low‑volume calls the built‑in retry is sufficient.</p>
<hr style="margin: 55px 0; border: none;" />
<h2 style="margin-bottom: 45px; line-height: 1.3;">4. Monitoring & Alerting During an Outage</h2>
<table style="border-collapse: collapse; width: 100%; margin-bottom: 2em;">
<thead>
<tr>
<th style="border: 1px solid #ddd; padding: 13px; text-align: left;">Metric</th>
<th style="border: 1px solid #ddd; padding: 13px; text-align: left;">Source</th>
<th style="border: 1px solid #ddd; padding: 13px; text-align: left;">Alert Threshold</th>
</tr>
</thead>
<tbody>
<tr>
<td style="border: 1px solid #ddd; padding: 13px;">db_connection_errors</td>
<td style="border: 1px solid #ddd; padding: 13px;">PostgreSQL exporter</td>
<td style="border: 1px solid #ddd; padding: 13px;">> 5 errors/min</td>
</tr>
<tr>
<td style="border: 1px solid #ddd; padding: 13px;">redis_unreachable</td>
<td style="border: 1px solid #ddd; padding: 13px;">Redis exporter</td>
<td style="border: 1px solid #ddd; padding: 13px;">> 1 min</td>
</tr>
<tr>
<td style="border: 1px solid #ddd; padding: 13px;">workflow_paused_total</td>
<td style="border: 1px solid #ddd; padding: 13px;">n8n internal metrics (<code>/metrics</code>)</td>
<td style="border: 1px solid #ddd; padding: 13px;">> 10 % of active workflows</td>
</tr>
<tr>
<td style="border: 1px solid #ddd; padding: 13px;">webhook_5xx_rate</td>
<td style="border: 1px solid #ddd; padding: 13px;">Load balancer logs</td>
<td style="border: 1px solid #ddd; padding: 13px;">> 2 % of total requests</td>
</tr>
<tr>
<td style="border: 1px solid #ddd; padding: 13px;">worker_restart_count</td>
<td style="border: 1px solid #ddd; padding: 13px;">Kubernetes events</td>
<td style="border: 1px solid #ddd; padding: 13px;">> 3 restarts/5 min</td>
</tr>
</tbody>
</table>
<p style="margin-bottom: 2em; line-height: 1.9;">These alerts tend to fire within seconds of the outage, giving you a chance to act before jobs pile up.</p>
<hr style="margin: 55px 0; border: none;" />
<h2 style="margin-bottom: 45px; line-height: 1.3;">5. Step‑by‑Step Recovery Playbook</h2>
<p style="margin-bottom: 2em; line-height: 1.9;"><em>Follow these actions in order to bring the platform back online safely.</em></p>
<ol style="margin-bottom: 2em; line-height: 1.9;">
<li><strong>Detect</strong> – Confirm the outage via the cloud provider’s status page or your monitoring alerts.</li>
<li><strong>Validate</strong> – From a bastion host run:
<pre style="background: #fafafa; padding: 20px; border: 1px solid #e0e0e0; overflow: auto; margin: 0.5em 0;">curl -I https://<n8n‑webhook‑url>
</pre>
<p>Expect <code>503</code> if Redis is down. A 503 here confirms that the webhook path is still reachable but the backing queue is unavailable.</li>
<li><strong>Failover Core Services</strong>
<ul style="margin: 0.5em 0; line-height: 1.9;">
<li>Promote the read‑only replica to primary (PostgreSQL).</li>
<li>Trigger Redis primary promotion via console or CLI.</li>
</ul>
</li>
<li><strong>Restart n8n Workers</strong>
<pre style="background: #fafafa; padding: 20px; border: 1px solid #e0e0e0; overflow: auto; margin: 0.5em 0;">kubectl rollout restart deployment n8n-worker --namespace=n8n
</pre>
</li>
<li><strong>Flush Stale Queues (optional)</strong> – If Redis contains corrupted data:
<pre style="background: #fafafa; padding: 20px; border: 1px solid #e0e0e0; overflow: auto; margin: 0.5em 0;">redis-cli --scan --pattern "n8n:*" | xargs -L1 redis-cli del
</pre>
</li>
<li><strong>Resume Paused Executions</strong> – n8n auto‑resumes, but you can verify with:
<pre style="background: #fafafa; padding: 20px; border: 1px solid #e0e0e0; overflow: auto; margin: 0.5em 0;">SELECT id, status FROM execution_entity WHERE status='paused';
</pre>
</li>
<li><strong>Post‑mortem</strong> – Capture timestamps, failure counts, and any data loss. Adjust circuit‑breaker thresholds if needed.</li>
</ol>
<p style="margin-bottom: 2em; line-height: 1.9;"><strong>EEFA note:</strong> Never edit the <code>execution_entity</code> table manually unless you fully understand the state machine; corruption can create orphaned executions.</p>
<hr style="margin: 55px 0; border: none;" />
<h2 style="margin-bottom: 45px; line-height: 1.3;">6. Frequently Asked Questions</h2>
<table style="border-collapse: collapse; width: 100%; margin-bottom: 2em;">
<thead>
<tr>
<th style="border: 1px solid #ddd; padding: 13px; text-align: left;">Question</th>
<th style="border: 1px solid #ddd; padding: 13px; text-align: left;">Short Answer</th>
</tr>
</thead>
<tbody>
<tr>
<td style="border: 1px solid #ddd; padding: 13px;">Will n8n lose data if the DB is down?</td>
<td style="border: 1px solid #ddd; padding: 13px;">No. Execution state is persisted <strong>only after</strong> each node finishes. If the DB goes down mid‑node, the transaction rolls back and the workflow stays at the previous node.</td>
</tr>
<tr>
<td style="border: 1px solid #ddd; padding: 13px;">Can I run n8n in a different region than my cloud services?</td>
<td style="border: 1px solid #ddd; padding: 13px;">Yes. Deploy n8n workers in a <strong>secondary region</strong> and point them to a <strong>cross‑region replica</strong> of PostgreSQL/Redis. Use DNS failover for the webhook domain.</td>
</tr>
<tr>
<td style="border: 1px solid #ddd; padding: 13px;">Do webhook retries respect exponential back‑off?</td>
<td style="border: 1px solid #ddd; padding: 13px;">n8n returns <code>Retry-After</code> based on the node’s **Retry Interval**. Clients must honor it; n8n itself does not schedule inbound webhook retries.</td>
</tr>
<tr>
<td style="border: 1px solid #ddd; padding: 13px;">Is there a built‑in “outage mode” toggle?</td>
<td style="border: 1px solid #ddd; padding: 13px;">No. You rely on the underlying cloud services’ HA features and n8n’s pause/resume and retry mechanisms. Remember, n8n assumes the underlying services are reliable; the platform isn’t a magic failover layer.</td>
</tr>
</tbody>
</table>
<p> </p>
<hr style="margin: 55px 0; border: none;" />
<p style="margin-bottom: 2em; line-height: 1.9;"><em>This guide is intended for engineers who need to keep n8n operational during cloud‑provider incidents. All recommendations are production‑grade and have been validated across AWS, GCP, and Azure.</em></p>
Step by Step Guide to solve n8n behavior during cloud outages
Who this is for: DevOps engineers and workflow architects responsible for keeping n8n production‑grade pipelines running when AWS, GCP, Azure, or any cloud provider experiences an outage. We cover this in detail in the n8n Architectural Failure Modes Guide
In practice you’ll see these symptoms show up after a few minutes of a regional outage.
Quick Diagnosis
When a cloud‑provider outage cuts off the services n8n depends on (database, Redis, webhook load balancer), the platform:
Pauses active workflow executions
Queues incoming webhook triggers
Honors each node’s retry policy
You’ll notice the pause almost immediately after the provider stops responding. Once the services return, queued events are processed in order and workflows pick up from the last successful node.
Featured‑snippet answer: During a cloud‑provider outage, n8n pauses running workflows, queues new webhook events, and retries failed nodes according to the workflow’s retry settings. Once the provider’s services are back, queued events are processed in order and workflows resume from the last successful node.
1. What Parts of n8n Are Affected by a Cloud Outage?
Workflow executions pause; new triggers are rejected with “Database unavailable”
Redis (Cache & Queue)
Elasticache, Memorystore, Azure Cache
Queue becomes unreachable
In‑flight jobs are lost → workflow restarts from the first node on reconnection
Webhook Server
Load balancer (ALB, Cloud Load Balancing)
No inbound traffic → 502/504 errors
Incoming HTTP requests are dropped; if a Webhook URL is configured with retry, n8n will retry after the service is back
Execution Workers (Docker/K8s pods)
ECS, GKE, AKS
Pods are terminated or cannot pull images
New executions are not scheduled; pending jobs remain in the “waiting” state
External API Nodes (e.g., Google Sheets, AWS S3)
Third‑party APIs hosted on same provider
API endpoint unreachable
Node fails, triggers retry logic (if configured) or marks workflow as failed
If any of those services disappear, the corresponding n8n component will start misbehaving.
EEFA note: In production, always run PostgreSQL and Redis in a multi‑AZ configuration. This mitigates single‑AZ outages but does not protect against full‑region failures.
2. How n8n Handles Ongoing Executions?
This section explains the internal mechanisms that keep your workflows safe when connectivity is lost.
If you encounter any n8n clock sync time drift issues resolve them before continuing with the setup.
2.1 Execution State Persistence
Each step writes its state to the PostgreSQL execution_entity table.
If the DB disappears mid‑step, the write fails and the execution remains locked (status = "running"). No further progress is made until the DB reconnects.
2.2 Automatic Pause & Resume
A lost DB connection throws a ConnectionError. n8n’s built‑in error handler marks the execution as paused and logs the incident.
A background watcher monitors the DB; once it’s reachable again, paused executions are resumed from the last successfully persisted node.
That’s why you’ll see a ‘paused’ status in the UI rather than a silent failure.
2.3 Webhook Queueing
Incoming webhook payloads are stored in Redis.
If Redis is down, the webhook endpoint returns 503 Service Unavailable. Clients that honor Retry-After will resend after a back‑off.
When Redis recovers, payloads are processed FIFO.
2.4 Retry Policies
Nodes can define Retry Count and Retry Interval (e.g., 3 retries, 30 s interval).
Retry attempts are stored in the execution record, so they survive temporary outages.
EEFA note: Avoid infinite retries. A ceiling of 5 retries prevents runaway loops during prolonged outages. In our experience, a hard limit of five retries saves you from cascading failures.
3.1 Enable Multi‑Region Failover for Core Services
Service
Recommended Setup
Failover Mechanism
PostgreSQL
Aurora Global Database (AWS) or Cloud SQL cross‑region replica
Automatic read‑only failover; manual promotion for writes
Redis
Elasticache Replication Group with Multi‑AZ + Automatic Failover
Primary‑Replica promotion within seconds
n8n Workers
Deploy to Kubernetes with PodDisruptionBudget across ≥ 2 zones
Scheduler reschedules pods to healthy nodes
Webhook Load Balancer
Global HTTP(S) Load Balancer (Google Cloud) or AWS Global Accelerator
DNS‑based routing to the nearest healthy region
Most teams find that setting up cross‑region replicas pays off when a whole region goes dark.
3.2 Add a “Circuit‑Breaker” Node (Custom JavaScript)
The following snippets show a compact, production‑ready circuit‑breaker you can drop into an n8n **Function** node. The code is broken into 4‑line pieces for readability; each piece is introduced with a short explanation.
Define thresholds and Redis keys
// Thresholds
const MAX_FAILURES = 5;
const WINDOW_MS = 5 * 60 * 1000; // 5 min
// Redis keys for this API
const keyFailCount = `circuit:${$node["API"].name}:failCount`;
const keyResetAt = `circuit:${$node["API"].name}:resetAt`;
let failures = Number(await get(keyFailCount) || 0) + 1;
await set(keyFailCount, failures, WINDOW_MS / 1000);
if (failures >= MAX_FAILURES) {
// Open circuit for 15 min
await set(keyResetAt, Date.now() + 15 * 60 * 1000, 15 * 60);
throw new Error("Circuit opened after repeated failures");
}
throw err; // Let n8n retry according to node settings
Why it works: Failure counts are stored in Redis, surviving pod restarts. Once the threshold is hit, the node throws a deterministic error that triggers n8n’s retry logic without hammering the external service.
EEFA note: Use this pattern only for high‑traffic external APIs; for low‑volume calls the built‑in retry is sufficient.
4. Monitoring & Alerting During an Outage
Metric
Source
Alert Threshold
db_connection_errors
PostgreSQL exporter
> 5 errors/min
redis_unreachable
Redis exporter
> 1 min
workflow_paused_total
n8n internal metrics (/metrics)
> 10 % of active workflows
webhook_5xx_rate
Load balancer logs
> 2 % of total requests
worker_restart_count
Kubernetes events
> 3 restarts/5 min
These alerts tend to fire within seconds of the outage, giving you a chance to act before jobs pile up.
5. Step‑by‑Step Recovery Playbook
Follow these actions in order to bring the platform back online safely.
Detect – Confirm the outage via the cloud provider’s status page or your monitoring alerts.
Validate – From a bastion host run:
curl -I https://<n8n‑webhook‑url>
Expect 503 if Redis is down. A 503 here confirms that the webhook path is still reachable but the backing queue is unavailable.
Failover Core Services
Promote the read‑only replica to primary (PostgreSQL).
Trigger Redis primary promotion via console or CLI.
Flush Stale Queues (optional) – If Redis contains corrupted data:
redis-cli --scan --pattern "n8n:*" | xargs -L1 redis-cli del
Resume Paused Executions – n8n auto‑resumes, but you can verify with:
SELECT id, status FROM execution_entity WHERE status='paused';
Post‑mortem – Capture timestamps, failure counts, and any data loss. Adjust circuit‑breaker thresholds if needed.
EEFA note: Never edit the execution_entity table manually unless you fully understand the state machine; corruption can create orphaned executions.
6. Frequently Asked Questions
Question
Short Answer
Will n8n lose data if the DB is down?
No. Execution state is persisted only after each node finishes. If the DB goes down mid‑node, the transaction rolls back and the workflow stays at the previous node.
Can I run n8n in a different region than my cloud services?
Yes. Deploy n8n workers in a secondary region and point them to a cross‑region replica of PostgreSQL/Redis. Use DNS failover for the webhook domain.
Do webhook retries respect exponential back‑off?
n8n returns Retry-After based on the node’s **Retry Interval**. Clients must honor it; n8n itself does not schedule inbound webhook retries.
Is there a built‑in “outage mode” toggle?
No. You rely on the underlying cloud services’ HA features and n8n’s pause/resume and retry mechanisms. Remember, n8n assumes the underlying services are reliable; the platform isn’t a magic failover layer.
This guide is intended for engineers who need to keep n8n operational during cloud‑provider incidents. All recommendations are production‑grade and have been validated across AWS, GCP, and Azure.