Who this is for – Developers and SREs who run production‑grade n8n workflows that need to run for hours or days. We cover this in detail in the n8n Production Failure Patterns Guide.
Quick diagnosis
If a workflow that should run for hours or days stops unexpectedly, open the execution logs and search for timeout, worker crashed, or database deadlock. Most failures stem from default execution limits, memory pressure, or external‑service throttling. Adjust the step size, raise the timeout, and add explicit retry/continue‑on‑fail logic to restore stability.
1. Why n8n behaves differently on hour‑long vs. minute‑long workflows?
If you encounter any n8n idempotency retry failures resolve them before continuing with the setup.
| Aspect | Short‑run (≤ 5 min) | Long‑run (≥ 1 h) |
|---|---|---|
| Execution engine | Single Node.js process, keeps all data in RAM | Same process but RAM usage grows as state is stored for each node |
| Default limits | EXECUTIONS_TIMEOUT=3600 s (1 h) – rarely hit |
Same limit is reached quickly; workers are killed by the OS if memory > 2 GB |
| Database interaction | Few reads/writes → low lock contention | Persistent execution_entity rows → higher chance of deadlocks |
| External calls | Few HTTP requests → low rate‑limit risk | Continuous polling / streaming → API throttling, socket time‑outs |
EEFA note: In production containers the kernel OOM‑killer will terminate the worker before n8n can emit a graceful error, leaving the execution marked “running” in the UI. Monitor OOM events for any workflow expected to exceed 30 min of CPU time.
2. Failure modes unique to long‑running workflows
| Failure mode | Typical symptom | Root cause | Immediate mitigation |
|---|---|---|---|
| Execution timeout | “Execution timed out after 3600 seconds” in logs | EXECUTIONS_TIMEOUT (default 1 h) reached |
Increase EXECUTIONS_TIMEOUT in .env or split workflow into sub‑workflows |
| Worker crash / OOM | Process exit code 137, UI shows “Running” forever | Memory > container limit, unbounded arrays, large binary payloads | Add maxMemoryRestart in config.json, use a SplitInBatches node, store large blobs externally (S3, DB) |
| Database deadlock | “Deadlock found when trying to get lock; try restarting transaction” | Concurrent executions updating the same execution_entity row (e.g., Set node with saveData:true) |
Serialize critical sections with a “Mutex” custom node or reduce saveData usage |
| External API rate‑limit | 429 responses, retries stop after 3 attempts | Continuous polling or long loops hitting API limits | Implement exponential back‑off, add a “Wait” node, request a higher quota |
| Node‑specific memory leak | Gradual RAM increase, crash after N iterations | Custom JavaScript in a “Function” node keeping references, e.g., global.someArray.push(...) |
Scope variables locally, clear arrays after each iteration (someArray = []) |
| Infinite loop detection | UI shows “Running” > 24 h, no progress | Mis‑configured “Loop” node without exit condition | Add explicit if (counter >= max) guard, use a “Break” node |
3. Real‑time monitoring & diagnostics
3.1 Execution logs (CLI)
# Tail the latest execution logs (Docker example) docker logs -f n8n | grep -i "executionId=12345"
*Look for:* ERROR, WARN, timeout, OOM, deadlock.
3.2 Prometheus metrics (if enabled)
| Metric | Meaning | Alert threshold |
|---|---|---|
| n8n_execution_active | Number of currently running executions | > 5 (consider scaling workers) |
| nodejs_process_resident_memory_bytes | Resident set size of the worker | > 1.5 GB (trigger OOM alert) |
| n8n_execution_errors_total | Cumulative error count per workflow | spikes > 10/min |
Add a Grafana panel to visualise “execution age” (now - start_timestamp).
3.3 Database view for stuck executions
SELECT id, workflowId, startedAt, status FROM execution_entity WHERE status = 'running' AND startedAt < NOW() - INTERVAL '2 hours';
Rows returned indicate orphaned executions that need manual cleanup (n8n execution:delete <id>). If you encounter any n8n silent failures no logs resolve them before continuing with the setup.
4. Preventive configuration patterns
4.1 Extend the global timeout
# .env (Docker or systemd) EXECUTIONS_TIMEOUT=86400 # 24 h MAX_EXECUTION_TIMEOUT=172800 # 48 h (hard cap)
EEFA: Raising the timeout without also increasing workerMaxMemory can cause silent OOM. Adjust both together.
4.2 Chunk‑and‑process strategy
Instead of a single loop that processes 1 M records, break the work into batches with the **SplitInBatches** node.
Get all IDs – HTTP request node:
{
"name": "Get All IDs",
"type": "n8n-nodes-base.httpRequest",
"parameters": {
"url": "https://api.example.com/ids"
}
}
Chunk into batches – SplitInBatches node:
{
"name": "Chunk",
"type": "n8n-nodes-base.splitInBatches",
"parameters": {
"batchSize": 500
}
}
Process each batch – Function node (kept under five lines):
// Process items here return items;
Each batch finishes within seconds, keeping memory flat.
4.3 Retry & “Continue On Fail”
Configure the built‑in retry options on HTTP Request nodes and enable **Continue On Fail** where intermittent errors are acceptable.
{
"continueOnFail": true,
"retryOnFail": true,
"retryAttempts": 4,
"retryDelay": 20000
}
4.4 External persistence for large payloads
Upload heavyweight JSON to S3 and keep only the key in the workflow.
const AWS = require('aws-sdk');
const s3 = new AWS.S3();
await s3.putObject({
Bucket: 'my-bucket',
Key: `workflow/${$execution.id}/payload.json`,
Body: JSON.stringify($json),
}).promise();
Return the S3 key so later nodes can fetch the data without holding it in RAM:
return [{ json: { s3Key: `workflow/${$execution.id}/payload.json` } }];
4.5 Heartbeat custom node (detect stalled workers)
import { IExecuteFunctions } from 'n8n-workflow';
export async function execute(this: IExecuteFunctions) {
const executionId = this.getWorkflow().id;
await this.helpers.request({
method: 'POST',
url: process.env.HEARTBEAT_ENDPOINT,
json: { executionId, timestamp: new Date().toISOString() },
});
return this.prepareOutputData([]);
}
Deploy the node at the start of each long loop; a monitoring service can raise an alert if heartbeats stop for > 5 min. If you encounter any n8n partial failure handling resolve them before continuing with the setup.
5. Recovery & resume patterns
| Pattern | How it works? | When to use? |
|---|---|---|
| Manual re‑run with saved state | Store intermediate results in an external DB; on failure, read the last successful batch ID and continue. | Very large data migrations |
| “Execute Workflow” sub‑workflow | Break the main flow into atomic sub‑workflows that each complete within the timeout. The parent workflow triggers the next sub‑workflow via Execute Workflow node. |
Periodic batch jobs |
| Continue On Fail + “Set” node persistence | Enable Continue On Fail on risky nodes, then use a Set node with saveData:true to persist partial results. On the next run, an “If” node checks for existing data and skips already‑processed items. |
Idempotent API updates |
| External scheduler (cron) + flag file | A cron job checks a flag in Redis; if set, it triggers the n8n workflow via REST API (POST /rest/workflows/:id/run). The workflow clears the flag on success. |
Distributed workers across multiple containers |
EEFA tip: When using Execute Workflow, pass the parent execution ID as a parameter ({{ $execution.id }}) and store it in the child’s workflowData. This creates a traceable lineage in the UI and simplifies debugging.
6. Production checklist for long‑running workflows
| Item | Verification method |
|---|---|
Timeout increased (EXECUTIONS_TIMEOUT) |
docker exec n8n printenv EXECUTIONS_TIMEOUT |
Memory limit raised (workerMaxMemory) |
Check config.json → workerMaxMemory (default 2048 MB) |
| Chunk size ≤ 1 000 (or as per memory budget) | Review SplitInBatches node settings |
| All external calls have retry & back‑off | Inspect each HTTP Request node → Retry tab |
| Critical nodes have “Continue On Fail” | UI → node → “Continue On Fail” toggle |
| Heartbeats emitted every ≤ 5 min | Verify logs from custom Heartbeat node |
| Metrics collected (Prometheus) | curl http://localhost:5678/metrics | grep n8n_execution |
Dead‑lock safe DB schema (no long‑running UPDATE on execution_entity) |
Run EXPLAIN ANALYZE on frequent queries |
| External payloads stored off‑process | Search for s3.putObject or similar in codebase |
| Recovery path documented (state store, resume logic) | Confluence or repo README updated |
7. Fixing a stuck long‑running n8n workflow
- Check logs – look for
timeout,OOM,deadlock. - Raise the global timeout (
EXECUTIONS_TIMEOUT=86400). - Add a
SplitInBatchesnode to keep each iteration < 30 s. - Enable retry + Continue On Fail on all external API nodes.
- Persist large payloads to S3/DB and clear RAM after each batch.
- Deploy a heartbeat node and monitor its metrics.
- If the worker still crashes, increase
workerMaxMemoryand verify container limits.
Following this checklist turns a fragile hours‑long workflow into a production‑grade, self‑healing pipeline.



