n8n long running workflow failures – timeout and memory fix

Step by Step Guide to solve n8n long running workflow failures 
Step by Step Guide to solve n8n long running workflow failures


Who this is for – Developers and SREs who run production‑grade n8n workflows that need to run for hours or days. We cover this in detail in the n8n Production Failure Patterns Guide.


Quick diagnosis

If a workflow that should run for hours or days stops unexpectedly, open the execution logs and search for timeout, worker crashed, or database deadlock. Most failures stem from default execution limits, memory pressure, or external‑service throttling. Adjust the step size, raise the timeout, and add explicit retry/continue‑on‑fail logic to restore stability.


1. Why n8n behaves differently on hour‑long vs. minute‑long workflows?

If you encounter any n8n idempotency retry failures resolve them before continuing with the setup.

Aspect Short‑run (≤ 5 min) Long‑run (≥ 1 h)
Execution engine Single Node.js process, keeps all data in RAM Same process but RAM usage grows as state is stored for each node
Default limits EXECUTIONS_TIMEOUT=3600 s (1 h) – rarely hit Same limit is reached quickly; workers are killed by the OS if memory > 2 GB
Database interaction Few reads/writes → low lock contention Persistent execution_entity rows → higher chance of deadlocks
External calls Few HTTP requests → low rate‑limit risk Continuous polling / streaming → API throttling, socket time‑outs

EEFA note: In production containers the kernel OOM‑killer will terminate the worker before n8n can emit a graceful error, leaving the execution marked “running” in the UI. Monitor OOM events for any workflow expected to exceed 30 min of CPU time.


2. Failure modes unique to long‑running workflows

Failure mode Typical symptom Root cause Immediate mitigation
Execution timeout “Execution timed out after 3600 seconds” in logs EXECUTIONS_TIMEOUT (default 1 h) reached Increase EXECUTIONS_TIMEOUT in .env or split workflow into sub‑workflows
Worker crash / OOM Process exit code 137, UI shows “Running” forever Memory > container limit, unbounded arrays, large binary payloads Add maxMemoryRestart in config.json, use a SplitInBatches node, store large blobs externally (S3, DB)
Database deadlock “Deadlock found when trying to get lock; try restarting transaction” Concurrent executions updating the same execution_entity row (e.g., Set node with saveData:true) Serialize critical sections with a “Mutex” custom node or reduce saveData usage
External API rate‑limit 429 responses, retries stop after 3 attempts Continuous polling or long loops hitting API limits Implement exponential back‑off, add a “Wait” node, request a higher quota
Node‑specific memory leak Gradual RAM increase, crash after N iterations Custom JavaScript in a “Function” node keeping references, e.g., global.someArray.push(...) Scope variables locally, clear arrays after each iteration (someArray = [])
Infinite loop detection UI shows “Running” > 24 h, no progress Mis‑configured “Loop” node without exit condition Add explicit if (counter >= max) guard, use a “Break” node

3. Real‑time monitoring & diagnostics

3.1 Execution logs (CLI)

# Tail the latest execution logs (Docker example)
docker logs -f n8n | grep -i "executionId=12345"

*Look for:* ERROR, WARN, timeout, OOM, deadlock.

3.2 Prometheus metrics (if enabled)

Metric Meaning Alert threshold
n8n_execution_active Number of currently running executions > 5 (consider scaling workers)
nodejs_process_resident_memory_bytes Resident set size of the worker > 1.5 GB (trigger OOM alert)
n8n_execution_errors_total Cumulative error count per workflow spikes > 10/min

Add a Grafana panel to visualise “execution age” (now - start_timestamp).

3.3 Database view for stuck executions

SELECT id, workflowId, startedAt, status
FROM execution_entity
WHERE status = 'running' AND startedAt < NOW() - INTERVAL '2 hours';

Rows returned indicate orphaned executions that need manual cleanup (n8n execution:delete <id>). If you encounter any n8n silent failures no logs resolve them before continuing with the setup.


4. Preventive configuration patterns

4.1 Extend the global timeout

# .env (Docker or systemd)
EXECUTIONS_TIMEOUT=86400   # 24 h
MAX_EXECUTION_TIMEOUT=172800 # 48 h (hard cap)

EEFA: Raising the timeout without also increasing workerMaxMemory can cause silent OOM. Adjust both together.

4.2 Chunk‑and‑process strategy

Instead of a single loop that processes 1 M records, break the work into batches with the **SplitInBatches** node.

Get all IDs – HTTP request node:

{
  "name": "Get All IDs",
  "type": "n8n-nodes-base.httpRequest",
  "parameters": {
    "url": "https://api.example.com/ids"
  }
}

Chunk into batches – SplitInBatches node:

{
  "name": "Chunk",
  "type": "n8n-nodes-base.splitInBatches",
  "parameters": {
    "batchSize": 500
  }
}

Process each batch – Function node (kept under five lines):

// Process items here
return items;

Each batch finishes within seconds, keeping memory flat.

4.3 Retry & “Continue On Fail”

Configure the built‑in retry options on HTTP Request nodes and enable **Continue On Fail** where intermittent errors are acceptable.

{
  "continueOnFail": true,
  "retryOnFail": true,
  "retryAttempts": 4,
  "retryDelay": 20000
}

4.4 External persistence for large payloads

Upload heavyweight JSON to S3 and keep only the key in the workflow.

const AWS = require('aws-sdk');
const s3 = new AWS.S3();
await s3.putObject({
  Bucket: 'my-bucket',
  Key: `workflow/${$execution.id}/payload.json`,
  Body: JSON.stringify($json),
}).promise();

Return the S3 key so later nodes can fetch the data without holding it in RAM:

return [{ json: { s3Key: `workflow/${$execution.id}/payload.json` } }];

4.5 Heartbeat custom node (detect stalled workers)

import { IExecuteFunctions } from 'n8n-workflow';
export async function execute(this: IExecuteFunctions) {
  const executionId = this.getWorkflow().id;
  await this.helpers.request({
    method: 'POST',
    url: process.env.HEARTBEAT_ENDPOINT,
    json: { executionId, timestamp: new Date().toISOString() },
  });
  return this.prepareOutputData([]);
}

Deploy the node at the start of each long loop; a monitoring service can raise an alert if heartbeats stop for > 5 min. If you encounter any n8n partial failure handling resolve them before continuing with the setup.


5. Recovery & resume patterns

Pattern How it works? When to use?
Manual re‑run with saved state Store intermediate results in an external DB; on failure, read the last successful batch ID and continue. Very large data migrations
“Execute Workflow” sub‑workflow Break the main flow into atomic sub‑workflows that each complete within the timeout. The parent workflow triggers the next sub‑workflow via Execute Workflow node. Periodic batch jobs
Continue On Fail + “Set” node persistence Enable Continue On Fail on risky nodes, then use a Set node with saveData:true to persist partial results. On the next run, an “If” node checks for existing data and skips already‑processed items. Idempotent API updates
External scheduler (cron) + flag file A cron job checks a flag in Redis; if set, it triggers the n8n workflow via REST API (POST /rest/workflows/:id/run). The workflow clears the flag on success. Distributed workers across multiple containers

EEFA tip: When using Execute Workflow, pass the parent execution ID as a parameter ({{ $execution.id }}) and store it in the child’s workflowData. This creates a traceable lineage in the UI and simplifies debugging.


6. Production checklist for long‑running workflows

Item Verification method
Timeout increased (EXECUTIONS_TIMEOUT) docker exec n8n printenv EXECUTIONS_TIMEOUT
Memory limit raised (workerMaxMemory) Check config.jsonworkerMaxMemory (default 2048 MB)
Chunk size ≤ 1 000 (or as per memory budget) Review SplitInBatches node settings
All external calls have retry & back‑off Inspect each HTTP Request node → Retry tab
Critical nodes have “Continue On Fail” UI → node → “Continue On Fail” toggle
Heartbeats emitted every ≤ 5 min Verify logs from custom Heartbeat node
Metrics collected (Prometheus) curl http://localhost:5678/metrics | grep n8n_execution
Dead‑lock safe DB schema (no long‑running UPDATE on execution_entity) Run EXPLAIN ANALYZE on frequent queries
External payloads stored off‑process Search for s3.putObject or similar in codebase
Recovery path documented (state store, resume logic) Confluence or repo README updated

7. Fixing a stuck long‑running n8n workflow

  1. Check logs – look for timeout, OOM, deadlock.
  2. Raise the global timeout (EXECUTIONS_TIMEOUT=86400).
  3. Add a SplitInBatches node to keep each iteration < 30 s.
  4. Enable retry + Continue On Fail on all external API nodes.
  5. Persist large payloads to S3/DB and clear RAM after each batch.
  6. Deploy a heartbeat node and monitor its metrics.
  7. If the worker still crashes, increase workerMaxMemory and verify container limits.

Following this checklist turns a fragile hours‑long workflow into a production‑grade, self‑healing pipeline.

Leave a Comment

Your email address will not be published. Required fields are marked *