Summary
On March 19, 2026, a bug in the Delay step caused flows to restart from the beginning instead of resuming after the delay, creating an infinite loop that flooded Redis with jobs. Affected flows never completed, and the growing job backlog degraded queue processing for all users.Impact
- Flows with a Delay step looped forever without completing.
- The runaway job creation overloaded Redis, causing delays for other flows.
- All affected executions were replayed once service was restored — no data was lost.
Timeline
All times are in UTC.- Mar 18, ~9:00 PM — A code change to the Delay step is deployed, introducing the infinite loop bug.
- Mar 19, ~8:45 AM — Customer reports arrive indicating flows with Delay steps are not completing. Investigation begins.
- Mar 19, ~10:45 AM — Fix for the Delay step bug deployed.
Root Cause
When a flow hits a Delay step, the system puts the job on hold via BullMQ’smoveToDelayed(). The bug was that the job still carried executionType: BEGIN instead of RESUME. When the delay expired, the worker re-ran the entire flow from the first step, hit the Delay again, paused again, and looped forever — flooding Redis with new jobs on every iteration.
BEGIN instead of RESUME, each loop iteration was treated as a brand-new execution rather than a continuation. Each fresh execution only ran from the trigger to the Delay step — well within the time limit — before spawning another delayed job and repeating.
Detection & Monitoring Gaps
- Detected by customers, not automated alerting. There was no monitoring on repeated execution patterns or runaway job creation for a single flow.
- No alerting on Redis queue depth growth rate or sudden spikes in scheduled job volume.
Action Items
| Action Item | Status |
|---|---|
Update job data to executionType: RESUME before calling moveToDelayed() so the worker continues from the correct step | Done |
| Add test coverage for Delay step resume behavior to catch regressions where a delayed job restarts instead of resuming | Done |
| Prevent a flow from entering an infinite state by detecting and halting repeated re-executions of the same run (ENG-320) | Done |
| Add alerting on abnormal queue depth growth to detect runaway job creation before customers are impacted | To do |
| Add monitoring for repeated execution patterns on a single flow (e.g., same flow re-triggered N times within a short window) | To do |
Improvements Done
- Delay step fix — Updated job data to
executionType: RESUMEbefore callingmoveToDelayed(), so the worker continues from where the flow left off instead of restarting. - Defense-in-depth: RESUME empty state guard — Worker validates that RESUME operations have non-empty execution state. An empty state with RESUME is the exact signature of the original bug and is rejected with a
VALIDATIONerror. - Defense-in-depth: BEGIN non-empty state assertion — Engine asserts that BEGIN operations have empty execution state. A BEGIN with pre-existing steps would indicate a code regression.