n8n long running workflow failures - timeout and memory fix

<figure class="wp-block-image aligncenter"><img src="https://flowgenius.in/wp-content/uploads/2026/01/n8n-long-running-workflow-failures.png" alt="Step by Step Guide to solve n8n long running workflow failures" /> <figcaption style="text-align: center;">Step by Step Guide to solve n8n long running workflow failures</p> <hr /> </figcaption></figure> <p style="margin-bottom: 2em; line-height: 1.9;">Who this is for – Developers and SREs who run production‑grade n8n workflows that need to run for hours or days. <strong>We cover this in detail in the </strong><a href="https://flowgenius.in/n8n-production-failure-patterns/">n8n Production Failure Patterns Guide.</a></p> <div style="margin: 55px 0;"> <hr /> </div> <h2 style="margin-bottom: 45px; line-height: 1.3;">Quick diagnosis</h2> <p style="margin-bottom: 2em; line-height: 1.9;">If a workflow that should run for hours or days stops unexpectedly, open the <strong>execution logs</strong> and search for <code>timeout</code>, <code>worker crashed</code>, or <code>database deadlock</code>. Most failures stem from default execution limits, memory pressure, or external‑service throttling. Adjust the step size, raise the timeout, and add explicit retry/continue‑on‑fail logic to restore stability.</p> <div style="margin: 55px 0;"> <hr /> </div> <h2 style="margin-bottom: 45px; line-height: 1.3;">1. Why n8n behaves differently on hour‑long vs. minute‑long workflows?</h2> <p>If you encounter any <a href="/n8n-idempotency-retry-failures">n8n idempotency retry failures </a>resolve them before continuing with the setup.</p> <table style="width: 100%; border-collapse: collapse; margin-bottom: 2em;"> <thead> <tr> <th style="padding: 13px; border: 1px solid #e0e0e0; text-align: left;">Aspect</th> <th style="padding: 13px; border: 1px solid #e0e0e0; text-align: left;">Short‑run (≤ 5 min)</th> <th style="padding: 13px; border: 1px solid #e0e0e0; text-align: left;">Long‑run (≥ 1 h)</th> </tr> </thead> <tbody> <tr> <td style="padding: 13px; border: 1px solid #e0e0e0;"><strong>Execution engine</strong></td> <td style="padding: 13px; border: 1px solid #e0e0e0;">Single Node.js process, keeps all data in RAM</td> <td style="padding: 13px; border: 1px solid #e0e0e0;">Same process but RAM usage grows as state is stored for each node</td> </tr> <tr> <td style="padding: 13px; border: 1px solid #e0e0e0;"><strong>Default limits</strong></td> <td style="padding: 13px; border: 1px solid #e0e0e0;"><code>EXECUTIONS_TIMEOUT=3600</code> s (1 h) – rarely hit</td> <td style="padding: 13px; border: 1px solid #e0e0e0;">Same limit is reached quickly; workers are killed by the OS if memory > 2 GB</td> </tr> <tr> <td style="padding: 13px; border: 1px solid #e0e0e0;"><strong>Database interaction</strong></td> <td style="padding: 13px; border: 1px solid #e0e0e0;">Few reads/writes → low lock contention</td> <td style="padding: 13px; border: 1px solid #e0e0e0;">Persistent <code>execution_entity</code> rows → higher chance of deadlocks</td> </tr> <tr> <td style="padding: 13px; border: 1px solid #e0e0e0;"><strong>External calls</strong></td> <td style="padding: 13px; border: 1px solid #e0e0e0;">Few HTTP requests → low rate‑limit risk</td> <td style="padding: 13px; border: 1px solid #e0e0e0;">Continuous polling / streaming → API throttling, socket time‑outs</td> </tr> </tbody> </table> <p style="margin-bottom: 2em; line-height: 1.9;"><strong>EEFA note:</strong> In production containers the kernel OOM‑killer will terminate the worker before n8n can emit a graceful error, leaving the execution marked <em>“running”</em> in the UI. Monitor OOM events for any workflow expected to exceed 30 min of CPU time.</p> <div style="margin: 55px 0;"> <hr /> </div> <h2 style="margin-bottom: 45px; line-height: 1.3;">2. Failure modes unique to long‑running workflows</h2> <table style="width: 100%; border-collapse: collapse; margin-bottom: 2em;"> <thead> <tr> <th style="padding: 13px; border: 1px solid #e0e0e0; text-align: left;">Failure mode</th> <th style="padding: 13px; border: 1px solid #e0e0e0; text-align: left;">Typical symptom</th> <th style="padding: 13px; border: 1px solid #e0e0e0; text-align: left;">Root cause</th> <th style="padding: 13px; border: 1px solid #e0e0e0; text-align: left;">Immediate mitigation</th> </tr> </thead> <tbody> <tr> <td style="padding: 13px; border: 1px solid #e0e0e0;"><strong>Execution timeout</strong></td> <td style="padding: 13px; border: 1px solid #e0e0e0;">“Execution timed out after 3600 seconds” in logs</td> <td style="padding: 13px; border: 1px solid #e0e0e0;"><code>EXECUTIONS_TIMEOUT</code> (default 1 h) reached</td> <td style="padding: 13px; border: 1px solid #e0e0e0;">Increase <code>EXECUTIONS_TIMEOUT</code> in <code>.env</code> or split workflow into sub‑workflows</td> </tr> <tr> <td style="padding: 13px; border: 1px solid #e0e0e0;"><strong>Worker crash / OOM</strong></td> <td style="padding: 13px; border: 1px solid #e0e0e0;">Process exit code 137, UI shows “Running” forever</td> <td style="padding: 13px; border: 1px solid #e0e0e0;">Memory > container limit, unbounded arrays, large binary payloads</td> <td style="padding: 13px; border: 1px solid #e0e0e0;">Add <code>maxMemoryRestart</code> in <code>config.json</code>, use a <code>SplitInBatches</code> node, store large blobs externally (S3, DB)</td> </tr> <tr> <td style="padding: 13px; border: 1px solid #e0e0e0;"><strong>Database deadlock</strong></td> <td style="padding: 13px; border: 1px solid #e0e0e0;">“Deadlock found when trying to get lock; try restarting transaction”</td> <td style="padding: 13px; border: 1px solid #e0e0e0;">Concurrent executions updating the same <code>execution_entity</code> row (e.g., <code>Set</code> node with <code>saveData:true</code>)</td> <td style="padding: 13px; border: 1px solid #e0e0e0;">Serialize critical sections with a “Mutex” custom node or reduce <code>saveData</code> usage</td> </tr> <tr> <td style="padding: 13px; border: 1px solid #e0e0e0;"><strong>External API rate‑limit</strong></td> <td style="padding: 13px; border: 1px solid #e0e0e0;">429 responses, retries stop after 3 attempts</td> <td style="padding: 13px; border: 1px solid #e0e0e0;">Continuous polling or long loops hitting API limits</td> <td style="padding: 13px; border: 1px solid #e0e0e0;">Implement exponential back‑off, add a “Wait” node, request a higher quota</td> </tr> <tr> <td style="padding: 13px; border: 1px solid #e0e0e0;"><strong>Node‑specific memory leak</strong></td> <td style="padding: 13px; border: 1px solid #e0e0e0;">Gradual RAM increase, crash after N iterations</td> <td style="padding: 13px; border: 1px solid #e0e0e0;">Custom JavaScript in a “Function” node keeping references, e.g., <code>global.someArray.push(...)</code></td> <td style="padding: 13px; border: 1px solid #e0e0e0;">Scope variables locally, clear arrays after each iteration (<code>someArray = []</code>)</td> </tr> <tr> <td style="padding: 13px; border: 1px solid #e0e0e0;"><strong>Infinite loop detection</strong></td> <td style="padding: 13px; border: 1px solid #e0e0e0;">UI shows “Running” > 24 h, no progress</td> <td style="padding: 13px; border: 1px solid #e0e0e0;">Mis‑configured “Loop” node without exit condition</td> <td style="padding: 13px; border: 1px solid #e0e0e0;">Add explicit <code>if (counter >= max)</code> guard, use a “Break” node</td> </tr> </tbody> </table> <div style="margin: 55px 0;"> <hr /> </div> <h2 style="margin-bottom: 45px; line-height: 1.3;">3. Real‑time monitoring & diagnostics</h2> <h3 style="margin-bottom: 45px; line-height: 1.3;">3.1 Execution logs (CLI)</h3> <pre style="background: #fafafa; padding: 20px; border: 1px solid #e0e0e0; overflow: auto;"># Tail the latest execution logs (Docker example) docker logs -f n8n | grep -i "executionId=12345"</pre> <p style="margin-bottom: 2em; line-height: 1.9;">*Look for:* <code>ERROR</code>, <code>WARN</code>, <code>timeout</code>, <code>OOM</code>, <code>deadlock</code>.</p> <h3 style="margin-bottom: 45px; line-height: 1.3;">3.2 Prometheus metrics (if enabled)</h3> <table style="width: 100%; border-collapse: collapse; margin-bottom: 2em;"> <thead> <tr> <th style="padding: 13px; border: 1px solid #e0e0e0; text-align: left;">Metric</th> <th style="padding: 13px; border: 1px solid #e0e0e0; text-align: left;">Meaning</th> <th style="padding: 13px; border: 1px solid #e0e0e0; text-align: left;">Alert threshold</th> </tr> </thead> <tbody> <tr> <td style="padding: 13px; border: 1px solid #e0e0e0;">n8n_execution_active</td> <td style="padding: 13px; border: 1px solid #e0e0e0;">Number of currently running executions</td> <td style="padding: 13px; border: 1px solid #e0e0e0;">> 5 (consider scaling workers)</td> </tr> <tr> <td style="padding: 13px; border: 1px solid #e0e0e0;">nodejs_process_resident_memory_bytes</td> <td style="padding: 13px; border: 1px solid #e0e0e0;">Resident set size of the worker</td> <td style="padding: 13px; border: 1px solid #e0e0e0;">> 1.5 GB (trigger OOM alert)</td> </tr> <tr> <td style="padding: 13px; border: 1px solid #e0e0e0;">n8n_execution_errors_total</td> <td style="padding: 13px; border: 1px solid #e0e0e0;">Cumulative error count per workflow</td> <td style="padding: 13px; border: 1px solid #e0e0e0;">spikes > 10/min</td> </tr> </tbody> </table> <p style="margin-bottom: 2em; line-height: 1.9;">Add a Grafana panel to visualise “execution age” (<code>now - start_timestamp</code>).</p> <h3 style="margin-bottom: 45px; line-height: 1.3;">3.3 Database view for stuck executions</h3> <pre style="background: #fafafa; padding: 20px; border: 1px solid #e0e0e0; overflow: auto;">SELECT id, workflowId, startedAt, status FROM execution_entity WHERE status = 'running' AND startedAt < NOW() - INTERVAL '2 hours';</pre> <p style="margin-bottom: 2em; line-height: 1.9;">Rows returned indicate <strong>orphaned</strong> executions that need manual cleanup (<code>n8n execution:delete <id></code>). If you encounter any <a href="/n8n-silent-failures-no-logs">n8n silent failures no logs </a>resolve them before continuing with the setup.</p> <div style="margin: 55px 0;"> <hr /> </div> <h2 style="margin-bottom: 45px; line-height: 1.3;">4. Preventive configuration patterns</h2> <h3 style="margin-bottom: 45px; line-height: 1.3;">4.1 Extend the global timeout</h3> <pre style="background: #fafafa; padding: 20px; border: 1px solid #e0e0e0; overflow: auto;"># .env (Docker or systemd) EXECUTIONS_TIMEOUT=86400 # 24 h MAX_EXECUTION_TIMEOUT=172800 # 48 h (hard cap)</pre> <p style="margin-bottom: 2em; line-height: 1.9;"><strong>EEFA:</strong> Raising the timeout without also increasing <code>workerMaxMemory</code> can cause silent OOM. Adjust both together.</p> <h3 style="margin-bottom: 45px; line-height: 1.3;">4.2 Chunk‑and‑process strategy</h3> <p style="margin-bottom: 2em; line-height: 1.9;">Instead of a single loop that processes 1 M records, break the work into batches with the **SplitInBatches** node.</p> <p style="margin-bottom: 2em; line-height: 1.9;"><strong>Get all IDs – HTTP request node:</strong></p> <pre style="background: #fafafa; padding: 20px; border: 1px solid #e0e0e0; overflow: auto;">{ "name": "Get All IDs", "type": "n8n-nodes-base.httpRequest", "parameters": { "url": "https://api.example.com/ids" } }</pre> <p style="margin-bottom: 2em; line-height: 1.9;"><strong>Chunk into batches – SplitInBatches node:</strong></p> <pre style="background: #fafafa; padding: 20px; border: 1px solid #e0e0e0; overflow: auto;">{ "name": "Chunk", "type": "n8n-nodes-base.splitInBatches", "parameters": { "batchSize": 500 } }</pre> <p style="margin-bottom: 2em; line-height: 1.9;"><strong>Process each batch – Function node (kept under five lines):</strong></p> <pre style="background: #fafafa; padding: 20px; border: 1px solid #e0e0e0; overflow: auto;">// Process items here return items;</pre> <p style="margin-bottom: 2em; line-height: 1.9;">Each batch finishes within seconds, keeping memory flat.</p> <h3 style="margin-bottom: 45px; line-height: 1.3;">4.3 Retry & “Continue On Fail”</h3> <p style="margin-bottom: 2em; line-height: 1.9;">Configure the built‑in retry options on HTTP Request nodes and enable **Continue On Fail** where intermittent errors are acceptable.</p> <pre style="background: #fafafa; padding: 20px; border: 1px solid #e0e0e0; overflow: auto;">{ "continueOnFail": true, "retryOnFail": true, "retryAttempts": 4, "retryDelay": 20000 }</pre> <h3 style="margin-bottom: 45px; line-height: 1.3;">4.4 External persistence for large payloads</h3> <p style="margin-bottom: 2em; line-height: 1.9;">Upload heavyweight JSON to S3 and keep only the key in the workflow.</p> <pre style="background: #fafafa; padding: 20px; border: 1px solid #e0e0e0; overflow: auto;">const AWS = require('aws-sdk'); const s3 = new AWS.S3(); await s3.putObject({ Bucket: 'my-bucket', Key: `workflow/${$execution.id}/payload.json`, Body: JSON.stringify($json), }).promise();</pre> <p style="margin-bottom: 2em; line-height: 1.9;">Return the S3 key so later nodes can fetch the data without holding it in RAM:</p> <pre style="background: #fafafa; padding: 20px; border: 1px solid #e0e0e0; overflow: auto;">return [{ json: { s3Key: `workflow/${$execution.id}/payload.json` } }];</pre> <h3 style="margin-bottom: 45px; line-height: 1.3;">4.5 Heartbeat custom node (detect stalled workers)</h3> <pre style="background: #fafafa; padding: 20px; border: 1px solid #e0e0e0; overflow: auto;">import { IExecuteFunctions } from 'n8n-workflow'; export async function execute(this: IExecuteFunctions) { const executionId = this.getWorkflow().id; await this.helpers.request({ method: 'POST', url: process.env.HEARTBEAT_ENDPOINT, json: { executionId, timestamp: new Date().toISOString() }, }); return this.prepareOutputData([]); }</pre> <p style="margin-bottom: 2em; line-height: 1.9;">Deploy the node at the start of each long loop; a monitoring service can raise an alert if heartbeats stop for > 5 min. If you encounter any <a href="/n8n-partial-failure-handling">n8n partial failure handling </a>resolve them before continuing with the setup.</p> <div style="margin: 55px 0;"> <hr /> </div> <h2 style="margin-bottom: 45px; line-height: 1.3;">5. Recovery & resume patterns</h2> <table style="width: 100%; border-collapse: collapse; margin-bottom: 2em;"> <thead> <tr> <th style="padding: 13px; border: 1px solid #e0e0e0; text-align: left;">Pattern</th> <th style="padding: 13px; border: 1px solid #e0e0e0; text-align: left;">How it works?</th> <th style="padding: 13px; border: 1px solid #e0e0e0; text-align: left;">When to use?</th> </tr> </thead> <tbody> <tr> <td style="padding: 13px; border: 1px solid #e0e0e0;">Manual re‑run with saved state</td> <td style="padding: 13px; border: 1px solid #e0e0e0;">Store intermediate results in an external DB; on failure, read the last successful batch ID and continue.</td> <td style="padding: 13px; border: 1px solid #e0e0e0;">Very large data migrations</td> </tr> <tr> <td style="padding: 13px; border: 1px solid #e0e0e0;">“Execute Workflow” sub‑workflow</td> <td style="padding: 13px; border: 1px solid #e0e0e0;">Break the main flow into atomic sub‑workflows that each complete within the timeout. The parent workflow triggers the next sub‑workflow via <code>Execute Workflow</code> node.</td> <td style="padding: 13px; border: 1px solid #e0e0e0;">Periodic batch jobs</td> </tr> <tr> <td style="padding: 13px; border: 1px solid #e0e0e0;">Continue On Fail + “Set” node persistence</td> <td style="padding: 13px; border: 1px solid #e0e0e0;">Enable <code>Continue On Fail</code> on risky nodes, then use a <code>Set</code> node with <code>saveData:true</code> to persist partial results. On the next run, an “If” node checks for existing data and skips already‑processed items.</td> <td style="padding: 13px; border: 1px solid #e0e0e0;">Idempotent API updates</td> </tr> <tr> <td style="padding: 13px; border: 1px solid #e0e0e0;">External scheduler (cron) + flag file</td> <td style="padding: 13px; border: 1px solid #e0e0e0;">A cron job checks a flag in Redis; if set, it triggers the n8n workflow via REST API (<code>POST /rest/workflows/:id/run</code>). The workflow clears the flag on success.</td> <td style="padding: 13px; border: 1px solid #e0e0e0;">Distributed workers across multiple containers</td> </tr> </tbody> </table> <p style="margin-bottom: 2em; line-height: 1.9;"><strong>EEFA tip:</strong> When using <code>Execute Workflow</code>, pass the parent execution ID as a parameter (<code>{{ $execution.id }}</code>) and store it in the child’s <code>workflowData</code>. This creates a traceable lineage in the UI and simplifies debugging.</p> <div style="margin: 55px 0;"> <hr /> </div> <h2 style="margin-bottom: 45px; line-height: 1.3;">6. Production checklist for long‑running workflows</h2> <table style="width: 100%; border-collapse: collapse; margin-bottom: 2em;"> <thead> <tr> <th style="padding: 13px; border: 1px solid #e0e0e0; text-align: left;">Item</th> <th style="padding: 13px; border: 1px solid #e0e0e0; text-align: left;">Verification method</th> </tr> </thead> <tbody> <tr> <td style="padding: 13px; border: 1px solid #e0e0e0;">Timeout increased (<code>EXECUTIONS_TIMEOUT</code>)</td> <td style="padding: 13px; border: 1px solid #e0e0e0;"><code>docker exec n8n printenv EXECUTIONS_TIMEOUT</code></td> </tr> <tr> <td style="padding: 13px; border: 1px solid #e0e0e0;">Memory limit raised (<code>workerMaxMemory</code>)</td> <td style="padding: 13px; border: 1px solid #e0e0e0;">Check <code>config.json</code> → <code>workerMaxMemory</code> (default 2048 MB)</td> </tr> <tr> <td style="padding: 13px; border: 1px solid #e0e0e0;">Chunk size ≤ 1 000 (or as per memory budget)</td> <td style="padding: 13px; border: 1px solid #e0e0e0;">Review <code>SplitInBatches</code> node settings</td> </tr> <tr> <td style="padding: 13px; border: 1px solid #e0e0e0;">All external calls have retry & back‑off</td> <td style="padding: 13px; border: 1px solid #e0e0e0;">Inspect each HTTP Request node → <code>Retry</code> tab</td> </tr> <tr> <td style="padding: 13px; border: 1px solid #e0e0e0;">Critical nodes have “Continue On Fail”</td> <td style="padding: 13px; border: 1px solid #e0e0e0;">UI → node → “Continue On Fail” toggle</td> </tr> <tr> <td style="padding: 13px; border: 1px solid #e0e0e0;">Heartbeats emitted every ≤ 5 min</td> <td style="padding: 13px; border: 1px solid #e0e0e0;">Verify logs from custom Heartbeat node</td> </tr> <tr> <td style="padding: 13px; border: 1px solid #e0e0e0;">Metrics collected (Prometheus)</td> <td style="padding: 13px; border: 1px solid #e0e0e0;"><code>curl http://localhost:5678/metrics | grep n8n_execution</code></td> </tr> <tr> <td style="padding: 13px; border: 1px solid #e0e0e0;">Dead‑lock safe DB schema (no long‑running <code>UPDATE</code> on <code>execution_entity</code>)</td> <td style="padding: 13px; border: 1px solid #e0e0e0;">Run <code>EXPLAIN ANALYZE</code> on frequent queries</td> </tr> <tr> <td style="padding: 13px; border: 1px solid #e0e0e0;">External payloads stored off‑process</td> <td style="padding: 13px; border: 1px solid #e0e0e0;">Search for <code>s3.putObject</code> or similar in codebase</td> </tr> <tr> <td style="padding: 13px; border: 1px solid #e0e0e0;">Recovery path documented (state store, resume logic)</td> <td style="padding: 13px; border: 1px solid #e0e0e0;">Confluence or repo README updated</td> </tr> </tbody> </table> <div style="margin: 55px 0;"> <hr /> </div> <h2 style="margin-bottom: 45px; line-height: 1.3;">7. Fixing a stuck long‑running n8n workflow</h2> <ol style="margin-bottom: 2em; line-height: 1.9;"> <li>Check logs – look for <code>timeout</code>, <code>OOM</code>, <code>deadlock</code>.</li> <li>Raise the global timeout (<code>EXECUTIONS_TIMEOUT=86400</code>).</li> <li>Add a <code>SplitInBatches</code> node to keep each iteration < 30 s.</li> <li>Enable retry + Continue On Fail on all external API nodes.</li> <li>Persist large payloads to S3/DB and clear RAM after each batch.</li> <li>Deploy a heartbeat node and monitor its metrics.</li> <li>If the worker still crashes, increase <code>workerMaxMemory</code> and verify container limits.</li> </ol> <p style="margin-bottom: 2em; line-height: 1.9;">Following this checklist turns a fragile hours‑long workflow into a production‑grade, self‑healing pipeline.</p>

Step by Step Guide to solve n8n long running workflow failures

Who this is for – Developers and SREs who run production‑grade n8n workflows that need to run for hours or days. We cover this in detail in the n8n Production Failure Patterns Guide.

Quick diagnosis

If a workflow that should run for hours or days stops unexpectedly, open the execution logs and search for timeout, worker crashed, or database deadlock. Most failures stem from default execution limits, memory pressure, or external‑service throttling. Adjust the step size, raise the timeout, and add explicit retry/continue‑on‑fail logic to restore stability.

1. Why n8n behaves differently on hour‑long vs. minute‑long workflows?

If you encounter any n8n idempotency retry failures resolve them before continuing with the setup.

Aspect	Short‑run (≤ 5 min)	Long‑run (≥ 1 h)
Execution engine	Single Node.js process, keeps all data in RAM	Same process but RAM usage grows as state is stored for each node
Default limits	`EXECUTIONS_TIMEOUT=3600` s (1 h) – rarely hit	Same limit is reached quickly; workers are killed by the OS if memory > 2 GB
Database interaction	Few reads/writes → low lock contention	Persistent `execution_entity` rows → higher chance of deadlocks
External calls	Few HTTP requests → low rate‑limit risk	Continuous polling / streaming → API throttling, socket time‑outs

EEFA note: In production containers the kernel OOM‑killer will terminate the worker before n8n can emit a graceful error, leaving the execution marked “running” in the UI. Monitor OOM events for any workflow expected to exceed 30 min of CPU time.

2. Failure modes unique to long‑running workflows

Failure mode	Typical symptom	Root cause	Immediate mitigation
Execution timeout	“Execution timed out after 3600 seconds” in logs	`EXECUTIONS_TIMEOUT` (default 1 h) reached	Increase `EXECUTIONS_TIMEOUT` in `.env` or split workflow into sub‑workflows
Worker crash / OOM	Process exit code 137, UI shows “Running” forever	Memory > container limit, unbounded arrays, large binary payloads	Add `maxMemoryRestart` in `config.json`, use a `SplitInBatches` node, store large blobs externally (S3, DB)
Database deadlock	“Deadlock found when trying to get lock; try restarting transaction”	Concurrent executions updating the same `execution_entity` row (e.g., `Set` node with `saveData:true`)	Serialize critical sections with a “Mutex” custom node or reduce `saveData` usage
External API rate‑limit	429 responses, retries stop after 3 attempts	Continuous polling or long loops hitting API limits	Implement exponential back‑off, add a “Wait” node, request a higher quota
Node‑specific memory leak	Gradual RAM increase, crash after N iterations	Custom JavaScript in a “Function” node keeping references, e.g., `global.someArray.push(...)`	Scope variables locally, clear arrays after each iteration (`someArray = []`)
Infinite loop detection	UI shows “Running” > 24 h, no progress	Mis‑configured “Loop” node without exit condition	Add explicit `if (counter >= max)` guard, use a “Break” node

3. Real‑time monitoring & diagnostics

3.1 Execution logs (CLI)

# Tail the latest execution logs (Docker example)
docker logs -f n8n | grep -i "executionId=12345"

*Look for:* ERROR, WARN, timeout, OOM, deadlock.

3.2 Prometheus metrics (if enabled)

Metric	Meaning	Alert threshold
n8n_execution_active	Number of currently running executions	> 5 (consider scaling workers)
nodejs_process_resident_memory_bytes	Resident set size of the worker	> 1.5 GB (trigger OOM alert)
n8n_execution_errors_total	Cumulative error count per workflow	spikes > 10/min

Add a Grafana panel to visualise “execution age” (now - start_timestamp).

3.3 Database view for stuck executions

SELECT id, workflowId, startedAt, status
FROM execution_entity
WHERE status = 'running' AND startedAt < NOW() - INTERVAL '2 hours';

Rows returned indicate orphaned executions that need manual cleanup (n8n execution:delete <id>). If you encounter any n8n silent failures no logs resolve them before continuing with the setup.

4. Preventive configuration patterns

4.1 Extend the global timeout

# .env (Docker or systemd)
EXECUTIONS_TIMEOUT=86400   # 24 h
MAX_EXECUTION_TIMEOUT=172800 # 48 h (hard cap)

EEFA: Raising the timeout without also increasing workerMaxMemory can cause silent OOM. Adjust both together.

4.2 Chunk‑and‑process strategy

Instead of a single loop that processes 1 M records, break the work into batches with the **SplitInBatches** node.

Get all IDs – HTTP request node:

{
  "name": "Get All IDs",
  "type": "n8n-nodes-base.httpRequest",
  "parameters": {
    "url": "https://api.example.com/ids"
  }
}

Chunk into batches – SplitInBatches node:

{
  "name": "Chunk",
  "type": "n8n-nodes-base.splitInBatches",
  "parameters": {
    "batchSize": 500
  }
}

Process each batch – Function node (kept under five lines):

// Process items here
return items;

Each batch finishes within seconds, keeping memory flat.

4.3 Retry & “Continue On Fail”

Configure the built‑in retry options on HTTP Request nodes and enable **Continue On Fail** where intermittent errors are acceptable.

{
  "continueOnFail": true,
  "retryOnFail": true,
  "retryAttempts": 4,
  "retryDelay": 20000
}

4.4 External persistence for large payloads

Upload heavyweight JSON to S3 and keep only the key in the workflow.

const AWS = require('aws-sdk');
const s3 = new AWS.S3();
await s3.putObject({
  Bucket: 'my-bucket',
  Key: `workflow/${$execution.id}/payload.json`,
  Body: JSON.stringify($json),
}).promise();

Return the S3 key so later nodes can fetch the data without holding it in RAM:

return [{ json: { s3Key: `workflow/${$execution.id}/payload.json` } }];

4.5 Heartbeat custom node (detect stalled workers)

import { IExecuteFunctions } from 'n8n-workflow';
export async function execute(this: IExecuteFunctions) {
  const executionId = this.getWorkflow().id;
  await this.helpers.request({
    method: 'POST',
    url: process.env.HEARTBEAT_ENDPOINT,
    json: { executionId, timestamp: new Date().toISOString() },
  });
  return this.prepareOutputData([]);
}

Deploy the node at the start of each long loop; a monitoring service can raise an alert if heartbeats stop for > 5 min. If you encounter any n8n partial failure handling resolve them before continuing with the setup.

5. Recovery & resume patterns

Pattern	How it works?	When to use?
Manual re‑run with saved state	Store intermediate results in an external DB; on failure, read the last successful batch ID and continue.	Very large data migrations
“Execute Workflow” sub‑workflow	Break the main flow into atomic sub‑workflows that each complete within the timeout. The parent workflow triggers the next sub‑workflow via `Execute Workflow` node.	Periodic batch jobs
Continue On Fail + “Set” node persistence	Enable `Continue On Fail` on risky nodes, then use a `Set` node with `saveData:true` to persist partial results. On the next run, an “If” node checks for existing data and skips already‑processed items.	Idempotent API updates
External scheduler (cron) + flag file	A cron job checks a flag in Redis; if set, it triggers the n8n workflow via REST API (`POST /rest/workflows/:id/run`). The workflow clears the flag on success.	Distributed workers across multiple containers

EEFA tip: When using Execute Workflow, pass the parent execution ID as a parameter ({{ $execution.id }}) and store it in the child’s workflowData. This creates a traceable lineage in the UI and simplifies debugging.

6. Production checklist for long‑running workflows

Item	Verification method
Timeout increased (`EXECUTIONS_TIMEOUT`)	`docker exec n8n printenv EXECUTIONS_TIMEOUT`
Memory limit raised (`workerMaxMemory`)	Check `config.json` → `workerMaxMemory` (default 2048 MB)
Chunk size ≤ 1 000 (or as per memory budget)	Review `SplitInBatches` node settings
All external calls have retry & back‑off	Inspect each HTTP Request node → `Retry` tab
Critical nodes have “Continue On Fail”	UI → node → “Continue On Fail” toggle
Heartbeats emitted every ≤ 5 min	Verify logs from custom Heartbeat node
Metrics collected (Prometheus)	`curl http://localhost:5678/metrics \| grep n8n_execution`
Dead‑lock safe DB schema (no long‑running `UPDATE` on `execution_entity`)	Run `EXPLAIN ANALYZE` on frequent queries
External payloads stored off‑process	Search for `s3.putObject` or similar in codebase
Recovery path documented (state store, resume logic)	Confluence or repo README updated

7. Fixing a stuck long‑running n8n workflow

Check logs – look for timeout, OOM, deadlock.
Raise the global timeout (EXECUTIONS_TIMEOUT=86400).
Add a SplitInBatches node to keep each iteration < 30 s.
Enable retry + Continue On Fail on all external API nodes.
Persist large payloads to S3/DB and clear RAM after each batch.
Deploy a heartbeat node and monitor its metrics.
If the worker still crashes, increase workerMaxMemory and verify container limits.

Following this checklist turns a fragile hours‑long workflow into a production‑grade, self‑healing pipeline.

n8n long running workflow failures – timeout and memory fix

Quick diagnosis

1. Why n8n behaves differently on hour‑long vs. minute‑long workflows?

2. Failure modes unique to long‑running workflows

3. Real‑time monitoring & diagnostics

3.1 Execution logs (CLI)

3.2 Prometheus metrics (if enabled)

3.3 Database view for stuck executions

4. Preventive configuration patterns

4.1 Extend the global timeout

4.2 Chunk‑and‑process strategy

4.3 Retry & “Continue On Fail”

4.4 External persistence for large payloads

4.5 Heartbeat custom node (detect stalled workers)

5. Recovery & resume patterns

6. Production checklist for long‑running workflows

7. Fixing a stuck long‑running n8n workflow

Leave a Comment Cancel Reply

Sign up for Newsletter

Quick diagnosis

1. Why n8n behaves differently on hour‑long vs. minute‑long workflows?

2. Failure modes unique to long‑running workflows

3. Real‑time monitoring & diagnostics

3.1 Execution logs (CLI)

3.2 Prometheus metrics (if enabled)

3.3 Database view for stuck executions

4. Preventive configuration patterns

4.1 Extend the global timeout

4.2 Chunk‑and‑process strategy

4.3 Retry & “Continue On Fail”

4.4 External persistence for large payloads

4.5 Heartbeat custom node (detect stalled workers)

5. Recovery & resume patterns

6. Production checklist for long‑running workflows

7. Fixing a stuck long‑running n8n workflow

Must Read

Leave a Comment Cancel Reply