Who this is for: DevOps engineers and workflow architects responsible for keeping n8n production‑grade pipelines running when AWS, GCP, Azure, or any cloud provider experiences an outage. We cover this in detail in the n8n Architectural Failure Modes Guide
In practice you’ll see these symptoms show up after a few minutes of a regional outage.
Quick Diagnosis
When a cloud‑provider outage cuts off the services n8n depends on (database, Redis, webhook load balancer), the platform:
- Pauses active workflow executions
- Queues incoming webhook triggers
- Honors each node’s retry policy
You’ll notice the pause almost immediately after the provider stops responding. Once the services return, queued events are processed in order and workflows pick up from the last successful node.
Featured‑snippet answer:
During a cloud‑provider outage, n8n pauses running workflows, queues new webhook events, and retries failed nodes according to the workflow’s retry settings. Once the provider’s services are back, queued events are processed in order and workflows resume from the last successful node.
1. What Parts of n8n Are Affected by a Cloud Outage?
If you encounter any n8n failures under network partitions resolve them before continuing with the setup.
| n8n Component | Dependency on Cloud Service | Failure Mode During Outage | Default Recovery Behavior |
|---|---|---|---|
| PostgreSQL DB | RDS (AWS), Cloud SQL (GCP), Azure Database | Connection timeout / loss of read/write | Workflow executions pause; new triggers are rejected with “Database unavailable” |
| Redis (Cache & Queue) | Elasticache, Memorystore, Azure Cache | Queue becomes unreachable | In‑flight jobs are lost → workflow restarts from the first node on reconnection |
| Webhook Server | Load balancer (ALB, Cloud Load Balancing) | No inbound traffic → 502/504 errors | Incoming HTTP requests are dropped; if a Webhook URL is configured with retry, n8n will retry after the service is back |
| Execution Workers (Docker/K8s pods) | ECS, GKE, AKS | Pods are terminated or cannot pull images | New executions are not scheduled; pending jobs remain in the “waiting” state |
| External API Nodes (e.g., Google Sheets, AWS S3) | Third‑party APIs hosted on same provider | API endpoint unreachable | Node fails, triggers retry logic (if configured) or marks workflow as failed |
If any of those services disappear, the corresponding n8n component will start misbehaving.
EEFA note: In production, always run PostgreSQL and Redis in a multi‑AZ configuration. This mitigates single‑AZ outages but does not protect against full‑region failures.
2. How n8n Handles Ongoing Executions?
This section explains the internal mechanisms that keep your workflows safe when connectivity is lost.
If you encounter any n8n clock sync time drift issues resolve them before continuing with the setup.
2.1 Execution State Persistence
- Each step writes its state to the PostgreSQL
execution_entitytable. - If the DB disappears mid‑step, the write fails and the execution remains locked (
status = "running"). No further progress is made until the DB reconnects.
2.2 Automatic Pause & Resume
- A lost DB connection throws a
ConnectionError. n8n’s built‑in error handler marks the execution as paused and logs the incident. - A background watcher monitors the DB; once it’s reachable again, paused executions are resumed from the last successfully persisted node.
That’s why you’ll see a ‘paused’ status in the UI rather than a silent failure.
2.3 Webhook Queueing
- Incoming webhook payloads are stored in Redis.
- If Redis is down, the webhook endpoint returns 503 Service Unavailable. Clients that honor
Retry-Afterwill resend after a back‑off. - When Redis recovers, payloads are processed FIFO.
2.4 Retry Policies
- Nodes can define Retry Count and Retry Interval (e.g., 3 retries, 30 s interval).
- Retry attempts are stored in the execution record, so they survive temporary outages.
EEFA note: Avoid infinite retries. A ceiling of 5 retries prevents runaway loops during prolonged outages. In our experience, a hard limit of five retries saves you from cascading failures.
3. Configuring Resilience for Outages
If you encounter any n8n retry logic financial workflows resolve them before continuing with the setup.
3.1 Enable Multi‑Region Failover for Core Services
| Service | Recommended Setup | Failover Mechanism |
|---|---|---|
| PostgreSQL | Aurora Global Database (AWS) or Cloud SQL cross‑region replica | Automatic read‑only failover; manual promotion for writes |
| Redis | Elasticache Replication Group with Multi‑AZ + Automatic Failover | Primary‑Replica promotion within seconds |
| n8n Workers | Deploy to Kubernetes with PodDisruptionBudget across ≥ 2 zones | Scheduler reschedules pods to healthy nodes |
| Webhook Load Balancer | Global HTTP(S) Load Balancer (Google Cloud) or AWS Global Accelerator | DNS‑based routing to the nearest healthy region |
Most teams find that setting up cross‑region replicas pays off when a whole region goes dark.
3.2 Add a “Circuit‑Breaker” Node (Custom JavaScript)
The following snippets show a compact, production‑ready circuit‑breaker you can drop into an n8n **Function** node. The code is broken into 4‑line pieces for readability; each piece is introduced with a short explanation.
Define thresholds and Redis keys
// Thresholds
const MAX_FAILURES = 5;
const WINDOW_MS = 5 * 60 * 1000; // 5 min
// Redis keys for this API
const keyFailCount = `circuit:${$node["API"].name}:failCount`;
const keyResetAt = `circuit:${$node["API"].name}:resetAt`;
Helper functions for Redis access
async function get(key) { return await $redis.get(key); }
async function set(key, val, ttl = 0) {
ttl ? await $redis.setex(key, ttl, val) : await $redis.set(key, val);
}
Check whether the circuit is currently open
const resetAt = await get(keyResetAt);
if (resetAt && Date.now() < Number(resetAt)) {
throw new Error("Circuit open – external API temporarily disabled");
}
Attempt the external API call
try {
const resp = await $http.request({ method: "GET", url: "https://api.example.com/data" });
// Success → clear failure counters
await $redis.del(keyFailCount);
return resp.body;
} catch (err) {
// Failure handling continues below
throw err;
}
Handle repeated failures and open the circuit
let failures = Number(await get(keyFailCount) || 0) + 1;
await set(keyFailCount, failures, WINDOW_MS / 1000);
if (failures >= MAX_FAILURES) {
// Open circuit for 15 min
await set(keyResetAt, Date.now() + 15 * 60 * 1000, 15 * 60);
throw new Error("Circuit opened after repeated failures");
}
throw err; // Let n8n retry according to node settings
Why it works: Failure counts are stored in Redis, surviving pod restarts. Once the threshold is hit, the node throws a deterministic error that triggers n8n’s retry logic without hammering the external service.
EEFA note: Use this pattern only for high‑traffic external APIs; for low‑volume calls the built‑in retry is sufficient.
4. Monitoring & Alerting During an Outage
| Metric | Source | Alert Threshold |
|---|---|---|
| db_connection_errors | PostgreSQL exporter | > 5 errors/min |
| redis_unreachable | Redis exporter | > 1 min |
| workflow_paused_total | n8n internal metrics (/metrics) |
> 10 % of active workflows |
| webhook_5xx_rate | Load balancer logs | > 2 % of total requests |
| worker_restart_count | Kubernetes events | > 3 restarts/5 min |
These alerts tend to fire within seconds of the outage, giving you a chance to act before jobs pile up.
5. Step‑by‑Step Recovery Playbook
Follow these actions in order to bring the platform back online safely.
- Detect – Confirm the outage via the cloud provider’s status page or your monitoring alerts.
- Validate – From a bastion host run:
curl -I https://<n8n‑webhook‑url>
Expect
503if Redis is down. A 503 here confirms that the webhook path is still reachable but the backing queue is unavailable. - Failover Core Services
- Promote the read‑only replica to primary (PostgreSQL).
- Trigger Redis primary promotion via console or CLI.
- Restart n8n Workers
kubectl rollout restart deployment n8n-worker --namespace=n8n
- Flush Stale Queues (optional) – If Redis contains corrupted data:
redis-cli --scan --pattern "n8n:*" | xargs -L1 redis-cli del
- Resume Paused Executions – n8n auto‑resumes, but you can verify with:
SELECT id, status FROM execution_entity WHERE status='paused';
- Post‑mortem – Capture timestamps, failure counts, and any data loss. Adjust circuit‑breaker thresholds if needed.
EEFA note: Never edit the execution_entity table manually unless you fully understand the state machine; corruption can create orphaned executions.
6. Frequently Asked Questions
| Question | Short Answer |
|---|---|
| Will n8n lose data if the DB is down? | No. Execution state is persisted only after each node finishes. If the DB goes down mid‑node, the transaction rolls back and the workflow stays at the previous node. |
| Can I run n8n in a different region than my cloud services? | Yes. Deploy n8n workers in a secondary region and point them to a cross‑region replica of PostgreSQL/Redis. Use DNS failover for the webhook domain. |
| Do webhook retries respect exponential back‑off? | n8n returns Retry-After based on the node’s **Retry Interval**. Clients must honor it; n8n itself does not schedule inbound webhook retries. |
| Is there a built‑in “outage mode” toggle? | No. You rely on the underlying cloud services’ HA features and n8n’s pause/resume and retry mechanisms. Remember, n8n assumes the underlying services are reliable; the platform isn’t a magic failover layer. |
This guide is intended for engineers who need to keep n8n operational during cloud‑provider incidents. All recommendations are production‑grade and have been validated across AWS, GCP, and Azure.



