<figure class="wp-block-image aligncenter"><img src="https://flowgenius.in/wp-content/uploads/2026/01/n8n-long-running-workflow-failures.png" alt="Step by Step Guide to solve n8n long running workflow failures" /> <figcaption style="text-align: center;">Step by Step Guide to solve n8n long running workflow failures</p>
<hr />
</figcaption></figure>
<p style="margin-bottom: 2em; line-height: 1.9;">Who this is for – Developers and SREs who run production‑grade n8n workflows that need to run for hours or days. <strong>We cover this in detail in the </strong><a href="https://flowgenius.in/n8n-production-failure-patterns/">n8n Production Failure Patterns Guide.</a></p>
<div style="margin: 55px 0;">
<hr />
</div>
<h2 style="margin-bottom: 45px; line-height: 1.3;">Quick diagnosis</h2>
<p style="margin-bottom: 2em; line-height: 1.9;">If a workflow that should run for hours or days stops unexpectedly, open the <strong>execution logs</strong> and search for <code>timeout</code>, <code>worker crashed</code>, or <code>database deadlock</code>. Most failures stem from default execution limits, memory pressure, or external‑service throttling. Adjust the step size, raise the timeout, and add explicit retry/continue‑on‑fail logic to restore stability.</p>
<div style="margin: 55px 0;">
<hr />
</div>
<h2 style="margin-bottom: 45px; line-height: 1.3;">1. Why n8n behaves differently on hour‑long vs. minute‑long workflows?</h2>
<p>If you encounter any <a href="/n8n-idempotency-retry-failures">n8n idempotency retry failures </a>resolve them before continuing with the setup.</p>
<table style="width: 100%; border-collapse: collapse; margin-bottom: 2em;">
<thead>
<tr>
<th style="padding: 13px; border: 1px solid #e0e0e0; text-align: left;">Aspect</th>
<th style="padding: 13px; border: 1px solid #e0e0e0; text-align: left;">Short‑run (≤ 5 min)</th>
<th style="padding: 13px; border: 1px solid #e0e0e0; text-align: left;">Long‑run (≥ 1 h)</th>
</tr>
</thead>
<tbody>
<tr>
<td style="padding: 13px; border: 1px solid #e0e0e0;"><strong>Execution engine</strong></td>
<td style="padding: 13px; border: 1px solid #e0e0e0;">Single Node.js process, keeps all data in RAM</td>
<td style="padding: 13px; border: 1px solid #e0e0e0;">Same process but RAM usage grows as state is stored for each node</td>
</tr>
<tr>
<td style="padding: 13px; border: 1px solid #e0e0e0;"><strong>Default limits</strong></td>
<td style="padding: 13px; border: 1px solid #e0e0e0;"><code>EXECUTIONS_TIMEOUT=3600</code> s (1 h) – rarely hit</td>
<td style="padding: 13px; border: 1px solid #e0e0e0;">Same limit is reached quickly; workers are killed by the OS if memory > 2 GB</td>
</tr>
<tr>
<td style="padding: 13px; border: 1px solid #e0e0e0;"><strong>Database interaction</strong></td>
<td style="padding: 13px; border: 1px solid #e0e0e0;">Few reads/writes → low lock contention</td>
<td style="padding: 13px; border: 1px solid #e0e0e0;">Persistent <code>execution_entity</code> rows → higher chance of deadlocks</td>
</tr>
<tr>
<td style="padding: 13px; border: 1px solid #e0e0e0;"><strong>External calls</strong></td>
<td style="padding: 13px; border: 1px solid #e0e0e0;">Few HTTP requests → low rate‑limit risk</td>
<td style="padding: 13px; border: 1px solid #e0e0e0;">Continuous polling / streaming → API throttling, socket time‑outs</td>
</tr>
</tbody>
</table>
<p style="margin-bottom: 2em; line-height: 1.9;"><strong>EEFA note:</strong> In production containers the kernel OOM‑killer will terminate the worker before n8n can emit a graceful error, leaving the execution marked <em>“running”</em> in the UI. Monitor OOM events for any workflow expected to exceed 30 min of CPU time.</p>
<div style="margin: 55px 0;">
<hr />
</div>
<h2 style="margin-bottom: 45px; line-height: 1.3;">2. Failure modes unique to long‑running workflows</h2>
<table style="width: 100%; border-collapse: collapse; margin-bottom: 2em;">
<thead>
<tr>
<th style="padding: 13px; border: 1px solid #e0e0e0; text-align: left;">Failure mode</th>
<th style="padding: 13px; border: 1px solid #e0e0e0; text-align: left;">Typical symptom</th>
<th style="padding: 13px; border: 1px solid #e0e0e0; text-align: left;">Root cause</th>
<th style="padding: 13px; border: 1px solid #e0e0e0; text-align: left;">Immediate mitigation</th>
</tr>
</thead>
<tbody>
<tr>
<td style="padding: 13px; border: 1px solid #e0e0e0;"><strong>Execution timeout</strong></td>
<td style="padding: 13px; border: 1px solid #e0e0e0;">“Execution timed out after 3600 seconds” in logs</td>
<td style="padding: 13px; border: 1px solid #e0e0e0;"><code>EXECUTIONS_TIMEOUT</code> (default 1 h) reached</td>
<td style="padding: 13px; border: 1px solid #e0e0e0;">Increase <code>EXECUTIONS_TIMEOUT</code> in <code>.env</code> or split workflow into sub‑workflows</td>
</tr>
<tr>
<td style="padding: 13px; border: 1px solid #e0e0e0;"><strong>Worker crash / OOM</strong></td>
<td style="padding: 13px; border: 1px solid #e0e0e0;">Process exit code 137, UI shows “Running” forever</td>
<td style="padding: 13px; border: 1px solid #e0e0e0;">Memory > container limit, unbounded arrays, large binary payloads</td>
<td style="padding: 13px; border: 1px solid #e0e0e0;">Add <code>maxMemoryRestart</code> in <code>config.json</code>, use a <code>SplitInBatches</code> node, store large blobs externally (S3, DB)</td>
</tr>
<tr>
<td style="padding: 13px; border: 1px solid #e0e0e0;"><strong>Database deadlock</strong></td>
<td style="padding: 13px; border: 1px solid #e0e0e0;">“Deadlock found when trying to get lock; try restarting transaction”</td>
<td style="padding: 13px; border: 1px solid #e0e0e0;">Concurrent executions updating the same <code>execution_entity</code> row (e.g., <code>Set</code> node with <code>saveData:true</code>)</td>
<td style="padding: 13px; border: 1px solid #e0e0e0;">Serialize critical sections with a “Mutex” custom node or reduce <code>saveData</code> usage</td>
</tr>
<tr>
<td style="padding: 13px; border: 1px solid #e0e0e0;"><strong>External API rate‑limit</strong></td>
<td style="padding: 13px; border: 1px solid #e0e0e0;">429 responses, retries stop after 3 attempts</td>
<td style="padding: 13px; border: 1px solid #e0e0e0;">Continuous polling or long loops hitting API limits</td>
<td style="padding: 13px; border: 1px solid #e0e0e0;">Implement exponential back‑off, add a “Wait” node, request a higher quota</td>
</tr>
<tr>
<td style="padding: 13px; border: 1px solid #e0e0e0;"><strong>Node‑specific memory leak</strong></td>
<td style="padding: 13px; border: 1px solid #e0e0e0;">Gradual RAM increase, crash after N iterations</td>
<td style="padding: 13px; border: 1px solid #e0e0e0;">Custom JavaScript in a “Function” node keeping references, e.g., <code>global.someArray.push(...)</code></td>
<td style="padding: 13px; border: 1px solid #e0e0e0;">Scope variables locally, clear arrays after each iteration (<code>someArray = []</code>)</td>
</tr>
<tr>
<td style="padding: 13px; border: 1px solid #e0e0e0;"><strong>Infinite loop detection</strong></td>
<td style="padding: 13px; border: 1px solid #e0e0e0;">UI shows “Running” > 24 h, no progress</td>
<td style="padding: 13px; border: 1px solid #e0e0e0;">Mis‑configured “Loop” node without exit condition</td>
<td style="padding: 13px; border: 1px solid #e0e0e0;">Add explicit <code>if (counter >= max)</code> guard, use a “Break” node</td>
</tr>
</tbody>
</table>
<div style="margin: 55px 0;">
<hr />
</div>
<h2 style="margin-bottom: 45px; line-height: 1.3;">3. Real‑time monitoring & diagnostics</h2>
<h3 style="margin-bottom: 45px; line-height: 1.3;">3.1 Execution logs (CLI)</h3>
<pre style="background: #fafafa; padding: 20px; border: 1px solid #e0e0e0; overflow: auto;"># Tail the latest execution logs (Docker example)
docker logs -f n8n | grep -i "executionId=12345"</pre>
<p style="margin-bottom: 2em; line-height: 1.9;">*Look for:* <code>ERROR</code>, <code>WARN</code>, <code>timeout</code>, <code>OOM</code>, <code>deadlock</code>.</p>
<h3 style="margin-bottom: 45px; line-height: 1.3;">3.2 Prometheus metrics (if enabled)</h3>
<table style="width: 100%; border-collapse: collapse; margin-bottom: 2em;">
<thead>
<tr>
<th style="padding: 13px; border: 1px solid #e0e0e0; text-align: left;">Metric</th>
<th style="padding: 13px; border: 1px solid #e0e0e0; text-align: left;">Meaning</th>
<th style="padding: 13px; border: 1px solid #e0e0e0; text-align: left;">Alert threshold</th>
</tr>
</thead>
<tbody>
<tr>
<td style="padding: 13px; border: 1px solid #e0e0e0;">n8n_execution_active</td>
<td style="padding: 13px; border: 1px solid #e0e0e0;">Number of currently running executions</td>
<td style="padding: 13px; border: 1px solid #e0e0e0;">> 5 (consider scaling workers)</td>
</tr>
<tr>
<td style="padding: 13px; border: 1px solid #e0e0e0;">nodejs_process_resident_memory_bytes</td>
<td style="padding: 13px; border: 1px solid #e0e0e0;">Resident set size of the worker</td>
<td style="padding: 13px; border: 1px solid #e0e0e0;">> 1.5 GB (trigger OOM alert)</td>
</tr>
<tr>
<td style="padding: 13px; border: 1px solid #e0e0e0;">n8n_execution_errors_total</td>
<td style="padding: 13px; border: 1px solid #e0e0e0;">Cumulative error count per workflow</td>
<td style="padding: 13px; border: 1px solid #e0e0e0;">spikes > 10/min</td>
</tr>
</tbody>
</table>
<p style="margin-bottom: 2em; line-height: 1.9;">Add a Grafana panel to visualise “execution age” (<code>now - start_timestamp</code>).</p>
<h3 style="margin-bottom: 45px; line-height: 1.3;">3.3 Database view for stuck executions</h3>
<pre style="background: #fafafa; padding: 20px; border: 1px solid #e0e0e0; overflow: auto;">SELECT id, workflowId, startedAt, status
FROM execution_entity
WHERE status = 'running' AND startedAt < NOW() - INTERVAL '2 hours';</pre>
<p style="margin-bottom: 2em; line-height: 1.9;">Rows returned indicate <strong>orphaned</strong> executions that need manual cleanup (<code>n8n execution:delete <id></code>). If you encounter any <a href="/n8n-silent-failures-no-logs">n8n silent failures no logs </a>resolve them before continuing with the setup.</p>
<div style="margin: 55px 0;">
<hr />
</div>
<h2 style="margin-bottom: 45px; line-height: 1.3;">4. Preventive configuration patterns</h2>
<h3 style="margin-bottom: 45px; line-height: 1.3;">4.1 Extend the global timeout</h3>
<pre style="background: #fafafa; padding: 20px; border: 1px solid #e0e0e0; overflow: auto;"># .env (Docker or systemd)
EXECUTIONS_TIMEOUT=86400 # 24 h
MAX_EXECUTION_TIMEOUT=172800 # 48 h (hard cap)</pre>
<p style="margin-bottom: 2em; line-height: 1.9;"><strong>EEFA:</strong> Raising the timeout without also increasing <code>workerMaxMemory</code> can cause silent OOM. Adjust both together.</p>
<h3 style="margin-bottom: 45px; line-height: 1.3;">4.2 Chunk‑and‑process strategy</h3>
<p style="margin-bottom: 2em; line-height: 1.9;">Instead of a single loop that processes 1 M records, break the work into batches with the **SplitInBatches** node.</p>
<p style="margin-bottom: 2em; line-height: 1.9;"><strong>Get all IDs – HTTP request node:</strong></p>
<pre style="background: #fafafa; padding: 20px; border: 1px solid #e0e0e0; overflow: auto;">{
"name": "Get All IDs",
"type": "n8n-nodes-base.httpRequest",
"parameters": {
"url": "https://api.example.com/ids"
}
}</pre>
<p style="margin-bottom: 2em; line-height: 1.9;"><strong>Chunk into batches – SplitInBatches node:</strong></p>
<pre style="background: #fafafa; padding: 20px; border: 1px solid #e0e0e0; overflow: auto;">{
"name": "Chunk",
"type": "n8n-nodes-base.splitInBatches",
"parameters": {
"batchSize": 500
}
}</pre>
<p style="margin-bottom: 2em; line-height: 1.9;"><strong>Process each batch – Function node (kept under five lines):</strong></p>
<pre style="background: #fafafa; padding: 20px; border: 1px solid #e0e0e0; overflow: auto;">// Process items here
return items;</pre>
<p style="margin-bottom: 2em; line-height: 1.9;">Each batch finishes within seconds, keeping memory flat.</p>
<h3 style="margin-bottom: 45px; line-height: 1.3;">4.3 Retry & “Continue On Fail”</h3>
<p style="margin-bottom: 2em; line-height: 1.9;">Configure the built‑in retry options on HTTP Request nodes and enable **Continue On Fail** where intermittent errors are acceptable.</p>
<pre style="background: #fafafa; padding: 20px; border: 1px solid #e0e0e0; overflow: auto;">{
"continueOnFail": true,
"retryOnFail": true,
"retryAttempts": 4,
"retryDelay": 20000
}</pre>
<h3 style="margin-bottom: 45px; line-height: 1.3;">4.4 External persistence for large payloads</h3>
<p style="margin-bottom: 2em; line-height: 1.9;">Upload heavyweight JSON to S3 and keep only the key in the workflow.</p>
<pre style="background: #fafafa; padding: 20px; border: 1px solid #e0e0e0; overflow: auto;">const AWS = require('aws-sdk');
const s3 = new AWS.S3();
await s3.putObject({
Bucket: 'my-bucket',
Key: `workflow/${$execution.id}/payload.json`,
Body: JSON.stringify($json),
}).promise();</pre>
<p style="margin-bottom: 2em; line-height: 1.9;">Return the S3 key so later nodes can fetch the data without holding it in RAM:</p>
<pre style="background: #fafafa; padding: 20px; border: 1px solid #e0e0e0; overflow: auto;">return [{ json: { s3Key: `workflow/${$execution.id}/payload.json` } }];</pre>
<h3 style="margin-bottom: 45px; line-height: 1.3;">4.5 Heartbeat custom node (detect stalled workers)</h3>
<pre style="background: #fafafa; padding: 20px; border: 1px solid #e0e0e0; overflow: auto;">import { IExecuteFunctions } from 'n8n-workflow';
export async function execute(this: IExecuteFunctions) {
const executionId = this.getWorkflow().id;
await this.helpers.request({
method: 'POST',
url: process.env.HEARTBEAT_ENDPOINT,
json: { executionId, timestamp: new Date().toISOString() },
});
return this.prepareOutputData([]);
}</pre>
<p style="margin-bottom: 2em; line-height: 1.9;">Deploy the node at the start of each long loop; a monitoring service can raise an alert if heartbeats stop for > 5 min. If you encounter any <a href="/n8n-partial-failure-handling">n8n partial failure handling </a>resolve them before continuing with the setup.</p>
<div style="margin: 55px 0;">
<hr />
</div>
<h2 style="margin-bottom: 45px; line-height: 1.3;">5. Recovery & resume patterns</h2>
<table style="width: 100%; border-collapse: collapse; margin-bottom: 2em;">
<thead>
<tr>
<th style="padding: 13px; border: 1px solid #e0e0e0; text-align: left;">Pattern</th>
<th style="padding: 13px; border: 1px solid #e0e0e0; text-align: left;">How it works?</th>
<th style="padding: 13px; border: 1px solid #e0e0e0; text-align: left;">When to use?</th>
</tr>
</thead>
<tbody>
<tr>
<td style="padding: 13px; border: 1px solid #e0e0e0;">Manual re‑run with saved state</td>
<td style="padding: 13px; border: 1px solid #e0e0e0;">Store intermediate results in an external DB; on failure, read the last successful batch ID and continue.</td>
<td style="padding: 13px; border: 1px solid #e0e0e0;">Very large data migrations</td>
</tr>
<tr>
<td style="padding: 13px; border: 1px solid #e0e0e0;">“Execute Workflow” sub‑workflow</td>
<td style="padding: 13px; border: 1px solid #e0e0e0;">Break the main flow into atomic sub‑workflows that each complete within the timeout. The parent workflow triggers the next sub‑workflow via <code>Execute Workflow</code> node.</td>
<td style="padding: 13px; border: 1px solid #e0e0e0;">Periodic batch jobs</td>
</tr>
<tr>
<td style="padding: 13px; border: 1px solid #e0e0e0;">Continue On Fail + “Set” node persistence</td>
<td style="padding: 13px; border: 1px solid #e0e0e0;">Enable <code>Continue On Fail</code> on risky nodes, then use a <code>Set</code> node with <code>saveData:true</code> to persist partial results. On the next run, an “If” node checks for existing data and skips already‑processed items.</td>
<td style="padding: 13px; border: 1px solid #e0e0e0;">Idempotent API updates</td>
</tr>
<tr>
<td style="padding: 13px; border: 1px solid #e0e0e0;">External scheduler (cron) + flag file</td>
<td style="padding: 13px; border: 1px solid #e0e0e0;">A cron job checks a flag in Redis; if set, it triggers the n8n workflow via REST API (<code>POST /rest/workflows/:id/run</code>). The workflow clears the flag on success.</td>
<td style="padding: 13px; border: 1px solid #e0e0e0;">Distributed workers across multiple containers</td>
</tr>
</tbody>
</table>
<p style="margin-bottom: 2em; line-height: 1.9;"><strong>EEFA tip:</strong> When using <code>Execute Workflow</code>, pass the parent execution ID as a parameter (<code>{{ $execution.id }}</code>) and store it in the child’s <code>workflowData</code>. This creates a traceable lineage in the UI and simplifies debugging.</p>
<div style="margin: 55px 0;">
<hr />
</div>
<h2 style="margin-bottom: 45px; line-height: 1.3;">6. Production checklist for long‑running workflows</h2>
<table style="width: 100%; border-collapse: collapse; margin-bottom: 2em;">
<thead>
<tr>
<th style="padding: 13px; border: 1px solid #e0e0e0; text-align: left;">Item</th>
<th style="padding: 13px; border: 1px solid #e0e0e0; text-align: left;">Verification method</th>
</tr>
</thead>
<tbody>
<tr>
<td style="padding: 13px; border: 1px solid #e0e0e0;">Timeout increased (<code>EXECUTIONS_TIMEOUT</code>)</td>
<td style="padding: 13px; border: 1px solid #e0e0e0;"><code>docker exec n8n printenv EXECUTIONS_TIMEOUT</code></td>
</tr>
<tr>
<td style="padding: 13px; border: 1px solid #e0e0e0;">Memory limit raised (<code>workerMaxMemory</code>)</td>
<td style="padding: 13px; border: 1px solid #e0e0e0;">Check <code>config.json</code> → <code>workerMaxMemory</code> (default 2048 MB)</td>
</tr>
<tr>
<td style="padding: 13px; border: 1px solid #e0e0e0;">Chunk size ≤ 1 000 (or as per memory budget)</td>
<td style="padding: 13px; border: 1px solid #e0e0e0;">Review <code>SplitInBatches</code> node settings</td>
</tr>
<tr>
<td style="padding: 13px; border: 1px solid #e0e0e0;">All external calls have retry & back‑off</td>
<td style="padding: 13px; border: 1px solid #e0e0e0;">Inspect each HTTP Request node → <code>Retry</code> tab</td>
</tr>
<tr>
<td style="padding: 13px; border: 1px solid #e0e0e0;">Critical nodes have “Continue On Fail”</td>
<td style="padding: 13px; border: 1px solid #e0e0e0;">UI → node → “Continue On Fail” toggle</td>
</tr>
<tr>
<td style="padding: 13px; border: 1px solid #e0e0e0;">Heartbeats emitted every ≤ 5 min</td>
<td style="padding: 13px; border: 1px solid #e0e0e0;">Verify logs from custom Heartbeat node</td>
</tr>
<tr>
<td style="padding: 13px; border: 1px solid #e0e0e0;">Metrics collected (Prometheus)</td>
<td style="padding: 13px; border: 1px solid #e0e0e0;"><code>curl http://localhost:5678/metrics | grep n8n_execution</code></td>
</tr>
<tr>
<td style="padding: 13px; border: 1px solid #e0e0e0;">Dead‑lock safe DB schema (no long‑running <code>UPDATE</code> on <code>execution_entity</code>)</td>
<td style="padding: 13px; border: 1px solid #e0e0e0;">Run <code>EXPLAIN ANALYZE</code> on frequent queries</td>
</tr>
<tr>
<td style="padding: 13px; border: 1px solid #e0e0e0;">External payloads stored off‑process</td>
<td style="padding: 13px; border: 1px solid #e0e0e0;">Search for <code>s3.putObject</code> or similar in codebase</td>
</tr>
<tr>
<td style="padding: 13px; border: 1px solid #e0e0e0;">Recovery path documented (state store, resume logic)</td>
<td style="padding: 13px; border: 1px solid #e0e0e0;">Confluence or repo README updated</td>
</tr>
</tbody>
</table>
<div style="margin: 55px 0;">
<hr />
</div>
<h2 style="margin-bottom: 45px; line-height: 1.3;">7. Fixing a stuck long‑running n8n workflow</h2>
<ol style="margin-bottom: 2em; line-height: 1.9;">
<li>Check logs – look for <code>timeout</code>, <code>OOM</code>, <code>deadlock</code>.</li>
<li>Raise the global timeout (<code>EXECUTIONS_TIMEOUT=86400</code>).</li>
<li>Add a <code>SplitInBatches</code> node to keep each iteration < 30 s.</li>
<li>Enable retry + Continue On Fail on all external API nodes.</li>
<li>Persist large payloads to S3/DB and clear RAM after each batch.</li>
<li>Deploy a heartbeat node and monitor its metrics.</li>
<li>If the worker still crashes, increase <code>workerMaxMemory</code> and verify container limits.</li>
</ol>
<p style="margin-bottom: 2em; line-height: 1.9;">Following this checklist turns a fragile hours‑long workflow into a production‑grade, self‑healing pipeline.</p>
Step by Step Guide to solve n8n long running workflow failures
Who this is for – Developers and SREs who run production‑grade n8n workflows that need to run for hours or days. We cover this in detail in the n8n Production Failure Patterns Guide.
Quick diagnosis
If a workflow that should run for hours or days stops unexpectedly, open the execution logs and search for timeout, worker crashed, or database deadlock. Most failures stem from default execution limits, memory pressure, or external‑service throttling. Adjust the step size, raise the timeout, and add explicit retry/continue‑on‑fail logic to restore stability.
1. Why n8n behaves differently on hour‑long vs. minute‑long workflows?
Same process but RAM usage grows as state is stored for each node
Default limits
EXECUTIONS_TIMEOUT=3600 s (1 h) – rarely hit
Same limit is reached quickly; workers are killed by the OS if memory > 2 GB
Database interaction
Few reads/writes → low lock contention
Persistent execution_entity rows → higher chance of deadlocks
External calls
Few HTTP requests → low rate‑limit risk
Continuous polling / streaming → API throttling, socket time‑outs
EEFA note: In production containers the kernel OOM‑killer will terminate the worker before n8n can emit a graceful error, leaving the execution marked “running” in the UI. Monitor OOM events for any workflow expected to exceed 30 min of CPU time.
2. Failure modes unique to long‑running workflows
Failure mode
Typical symptom
Root cause
Immediate mitigation
Execution timeout
“Execution timed out after 3600 seconds” in logs
EXECUTIONS_TIMEOUT (default 1 h) reached
Increase EXECUTIONS_TIMEOUT in .env or split workflow into sub‑workflows
Worker crash / OOM
Process exit code 137, UI shows “Running” forever
Memory > container limit, unbounded arrays, large binary payloads
Add maxMemoryRestart in config.json, use a SplitInBatches node, store large blobs externally (S3, DB)
Database deadlock
“Deadlock found when trying to get lock; try restarting transaction”
Concurrent executions updating the same execution_entity row (e.g., Set node with saveData:true)
Serialize critical sections with a “Mutex” custom node or reduce saveData usage
External API rate‑limit
429 responses, retries stop after 3 attempts
Continuous polling or long loops hitting API limits
Implement exponential back‑off, add a “Wait” node, request a higher quota
Node‑specific memory leak
Gradual RAM increase, crash after N iterations
Custom JavaScript in a “Function” node keeping references, e.g., global.someArray.push(...)
Scope variables locally, clear arrays after each iteration (someArray = [])
Infinite loop detection
UI shows “Running” > 24 h, no progress
Mis‑configured “Loop” node without exit condition
Add explicit if (counter >= max) guard, use a “Break” node
Add a Grafana panel to visualise “execution age” (now - start_timestamp).
3.3 Database view for stuck executions
SELECT id, workflowId, startedAt, status
FROM execution_entity
WHERE status = 'running' AND startedAt < NOW() - INTERVAL '2 hours';
Rows returned indicate orphaned executions that need manual cleanup (n8n execution:delete <id>). If you encounter any n8n silent failures no logs resolve them before continuing with the setup.
4. Preventive configuration patterns
4.1 Extend the global timeout
# .env (Docker or systemd)
EXECUTIONS_TIMEOUT=86400 # 24 h
MAX_EXECUTION_TIMEOUT=172800 # 48 h (hard cap)
EEFA: Raising the timeout without also increasing workerMaxMemory can cause silent OOM. Adjust both together.
4.2 Chunk‑and‑process strategy
Instead of a single loop that processes 1 M records, break the work into batches with the **SplitInBatches** node.
Deploy the node at the start of each long loop; a monitoring service can raise an alert if heartbeats stop for > 5 min. If you encounter any n8n partial failure handling resolve them before continuing with the setup.
5. Recovery & resume patterns
Pattern
How it works?
When to use?
Manual re‑run with saved state
Store intermediate results in an external DB; on failure, read the last successful batch ID and continue.
Very large data migrations
“Execute Workflow” sub‑workflow
Break the main flow into atomic sub‑workflows that each complete within the timeout. The parent workflow triggers the next sub‑workflow via Execute Workflow node.
Periodic batch jobs
Continue On Fail + “Set” node persistence
Enable Continue On Fail on risky nodes, then use a Set node with saveData:true to persist partial results. On the next run, an “If” node checks for existing data and skips already‑processed items.
Idempotent API updates
External scheduler (cron) + flag file
A cron job checks a flag in Redis; if set, it triggers the n8n workflow via REST API (POST /rest/workflows/:id/run). The workflow clears the flag on success.
Distributed workers across multiple containers
EEFA tip: When using Execute Workflow, pass the parent execution ID as a parameter ({{ $execution.id }}) and store it in the child’s workflowData. This creates a traceable lineage in the UI and simplifies debugging.
6. Production checklist for long‑running workflows