n8n long running workflow failures - timeout and memory fix

Step by Step Guide to solve n8n long running workflow failures

Who this is for – Developers and SREs who run production‑grade n8n workflows that need to run for hours or days. We cover this in detail in the n8n Production Failure Patterns Guide.

Quick diagnosis

If a workflow that should run for hours or days stops unexpectedly, open the execution logs and search for timeout, worker crashed, or database deadlock. Most failures stem from default execution limits, memory pressure, or external‑service throttling. Adjust the step size, raise the timeout, and add explicit retry/continue‑on‑fail logic to restore stability.

1. Why n8n behaves differently on hour‑long vs. minute‑long workflows?

If you encounter any n8n idempotency retry failures resolve them before continuing with the setup.

Aspect	Short‑run (≤ 5 min)	Long‑run (≥ 1 h)
Execution engine	Single Node.js process, keeps all data in RAM	Same process but RAM usage grows as state is stored for each node
Default limits	`EXECUTIONS_TIMEOUT=3600` s (1 h) – rarely hit	Same limit is reached quickly; workers are killed by the OS if memory > 2 GB
Database interaction	Few reads/writes → low lock contention	Persistent `execution_entity` rows → higher chance of deadlocks
External calls	Few HTTP requests → low rate‑limit risk	Continuous polling / streaming → API throttling, socket time‑outs

EEFA note: In production containers the kernel OOM‑killer will terminate the worker before n8n can emit a graceful error, leaving the execution marked “running” in the UI. Monitor OOM events for any workflow expected to exceed 30 min of CPU time.

2. Failure modes unique to long‑running workflows

Failure mode	Typical symptom	Root cause	Immediate mitigation
Execution timeout	“Execution timed out after 3600 seconds” in logs	`EXECUTIONS_TIMEOUT` (default 1 h) reached	Increase `EXECUTIONS_TIMEOUT` in `.env` or split workflow into sub‑workflows
Worker crash / OOM	Process exit code 137, UI shows “Running” forever	Memory > container limit, unbounded arrays, large binary payloads	Add `maxMemoryRestart` in `config.json`, use a `SplitInBatches` node, store large blobs externally (S3, DB)
Database deadlock	“Deadlock found when trying to get lock; try restarting transaction”	Concurrent executions updating the same `execution_entity` row (e.g., `Set` node with `saveData:true`)	Serialize critical sections with a “Mutex” custom node or reduce `saveData` usage
External API rate‑limit	429 responses, retries stop after 3 attempts	Continuous polling or long loops hitting API limits	Implement exponential back‑off, add a “Wait” node, request a higher quota
Node‑specific memory leak	Gradual RAM increase, crash after N iterations	Custom JavaScript in a “Function” node keeping references, e.g., `global.someArray.push(...)`	Scope variables locally, clear arrays after each iteration (`someArray = []`)
Infinite loop detection	UI shows “Running” > 24 h, no progress	Mis‑configured “Loop” node without exit condition	Add explicit `if (counter >= max)` guard, use a “Break” node

3. Real‑time monitoring & diagnostics

3.1 Execution logs (CLI)

# Tail the latest execution logs (Docker example)
docker logs -f n8n | grep -i "executionId=12345"

*Look for:* ERROR, WARN, timeout, OOM, deadlock.

3.2 Prometheus metrics (if enabled)

Metric	Meaning	Alert threshold
n8n_execution_active	Number of currently running executions	> 5 (consider scaling workers)
nodejs_process_resident_memory_bytes	Resident set size of the worker	> 1.5 GB (trigger OOM alert)
n8n_execution_errors_total	Cumulative error count per workflow	spikes > 10/min

Add a Grafana panel to visualise “execution age” (now - start_timestamp).

3.3 Database view for stuck executions

SELECT id, workflowId, startedAt, status
FROM execution_entity
WHERE status = 'running' AND startedAt < NOW() - INTERVAL '2 hours';

Rows returned indicate orphaned executions that need manual cleanup (n8n execution:delete <id>). If you encounter any n8n silent failures no logs resolve them before continuing with the setup.

4. Preventive configuration patterns

4.1 Extend the global timeout

# .env (Docker or systemd)
EXECUTIONS_TIMEOUT=86400   # 24 h
MAX_EXECUTION_TIMEOUT=172800 # 48 h (hard cap)

EEFA: Raising the timeout without also increasing workerMaxMemory can cause silent OOM. Adjust both together.

4.2 Chunk‑and‑process strategy

Instead of a single loop that processes 1 M records, break the work into batches with the **SplitInBatches** node.

Get all IDs – HTTP request node:

{
  "name": "Get All IDs",
  "type": "n8n-nodes-base.httpRequest",
  "parameters": {
    "url": "https://api.example.com/ids"
  }
}

Chunk into batches – SplitInBatches node:

{
  "name": "Chunk",
  "type": "n8n-nodes-base.splitInBatches",
  "parameters": {
    "batchSize": 500
  }
}

Process each batch – Function node (kept under five lines):

// Process items here
return items;

Each batch finishes within seconds, keeping memory flat.

4.3 Retry & “Continue On Fail”

Configure the built‑in retry options on HTTP Request nodes and enable **Continue On Fail** where intermittent errors are acceptable.

{
  "continueOnFail": true,
  "retryOnFail": true,
  "retryAttempts": 4,
  "retryDelay": 20000
}

4.4 External persistence for large payloads

Upload heavyweight JSON to S3 and keep only the key in the workflow.

const AWS = require('aws-sdk');
const s3 = new AWS.S3();
await s3.putObject({
  Bucket: 'my-bucket',
  Key: `workflow/${$execution.id}/payload.json`,
  Body: JSON.stringify($json),
}).promise();

Return the S3 key so later nodes can fetch the data without holding it in RAM:

return [{ json: { s3Key: `workflow/${$execution.id}/payload.json` } }];

4.5 Heartbeat custom node (detect stalled workers)

import { IExecuteFunctions } from 'n8n-workflow';
export async function execute(this: IExecuteFunctions) {
  const executionId = this.getWorkflow().id;
  await this.helpers.request({
    method: 'POST',
    url: process.env.HEARTBEAT_ENDPOINT,
    json: { executionId, timestamp: new Date().toISOString() },
  });
  return this.prepareOutputData([]);
}

Deploy the node at the start of each long loop; a monitoring service can raise an alert if heartbeats stop for > 5 min. If you encounter any n8n partial failure handling resolve them before continuing with the setup.

5. Recovery & resume patterns

Pattern	How it works?	When to use?
Manual re‑run with saved state	Store intermediate results in an external DB; on failure, read the last successful batch ID and continue.	Very large data migrations
“Execute Workflow” sub‑workflow	Break the main flow into atomic sub‑workflows that each complete within the timeout. The parent workflow triggers the next sub‑workflow via `Execute Workflow` node.	Periodic batch jobs
Continue On Fail + “Set” node persistence	Enable `Continue On Fail` on risky nodes, then use a `Set` node with `saveData:true` to persist partial results. On the next run, an “If” node checks for existing data and skips already‑processed items.	Idempotent API updates
External scheduler (cron) + flag file	A cron job checks a flag in Redis; if set, it triggers the n8n workflow via REST API (`POST /rest/workflows/:id/run`). The workflow clears the flag on success.	Distributed workers across multiple containers

EEFA tip: When using Execute Workflow, pass the parent execution ID as a parameter ({{ $execution.id }}) and store it in the child’s workflowData. This creates a traceable lineage in the UI and simplifies debugging.

6. Production checklist for long‑running workflows

Item	Verification method
Timeout increased (`EXECUTIONS_TIMEOUT`)	`docker exec n8n printenv EXECUTIONS_TIMEOUT`
Memory limit raised (`workerMaxMemory`)	Check `config.json` → `workerMaxMemory` (default 2048 MB)
Chunk size ≤ 1 000 (or as per memory budget)	Review `SplitInBatches` node settings
All external calls have retry & back‑off	Inspect each HTTP Request node → `Retry` tab
Critical nodes have “Continue On Fail”	UI → node → “Continue On Fail” toggle
Heartbeats emitted every ≤ 5 min	Verify logs from custom Heartbeat node
Metrics collected (Prometheus)	`curl http://localhost:5678/metrics \| grep n8n_execution`
Dead‑lock safe DB schema (no long‑running `UPDATE` on `execution_entity`)	Run `EXPLAIN ANALYZE` on frequent queries
External payloads stored off‑process	Search for `s3.putObject` or similar in codebase
Recovery path documented (state store, resume logic)	Confluence or repo README updated

7. Fixing a stuck long‑running n8n workflow

Check logs – look for timeout, OOM, deadlock.
Raise the global timeout (EXECUTIONS_TIMEOUT=86400).
Add a SplitInBatches node to keep each iteration < 30 s.
Enable retry + Continue On Fail on all external API nodes.
Persist large payloads to S3/DB and clear RAM after each batch.
Deploy a heartbeat node and monitor its metrics.
If the worker still crashes, increase workerMaxMemory and verify container limits.

Following this checklist turns a fragile hours‑long workflow into a production‑grade, self‑healing pipeline.

n8n long running workflow failures – timeout and memory fix

Quick diagnosis

1. Why n8n behaves differently on hour‑long vs. minute‑long workflows?

2. Failure modes unique to long‑running workflows

3. Real‑time monitoring & diagnostics

3.1 Execution logs (CLI)

3.2 Prometheus metrics (if enabled)

3.3 Database view for stuck executions

4. Preventive configuration patterns

4.1 Extend the global timeout

4.2 Chunk‑and‑process strategy

4.3 Retry & “Continue On Fail”

4.4 External persistence for large payloads

4.5 Heartbeat custom node (detect stalled workers)

5. Recovery & resume patterns

6. Production checklist for long‑running workflows

7. Fixing a stuck long‑running n8n workflow

Leave a Comment Cancel Reply

Sign up for Newsletter

Quick diagnosis

1. Why n8n behaves differently on hour‑long vs. minute‑long workflows?

2. Failure modes unique to long‑running workflows

3. Real‑time monitoring & diagnostics

3.1 Execution logs (CLI)

3.2 Prometheus metrics (if enabled)

3.3 Database view for stuck executions

4. Preventive configuration patterns

4.1 Extend the global timeout

4.2 Chunk‑and‑process strategy

4.3 Retry & “Continue On Fail”

4.4 External persistence for large payloads

4.5 Heartbeat custom node (detect stalled workers)

5. Recovery & resume patterns

6. Production checklist for long‑running workflows

7. Fixing a stuck long‑running n8n workflow

Must Read

Leave a Comment Cancel Reply