-
Notifications
You must be signed in to change notification settings - Fork 3.3k
Description
Problem Description
There is a race condition between the persistence of a paused workflow execution and the processing of incoming resume requests. If a resume request (e.g., via API) arrives immediately after a workflow pauses but before the pausedExecutions record is fully committed to the database, the resume attempt fails with a "Paused execution not found" error.
This is critical for high-throughput automation where external systems may attempt to resume a workflow milliseconds after it enters a paused state.
Current Behavior
- Workflow Execution: A workflow reaches a pause point.
- Persistence Start: The system calls
PauseResumeManager.persistPauseResultto save the state. - Race Window: During the DB write operation, an external API calls the resume endpoint (
/api/resume/...). - Resume Check: The resume API calls enqueueOrStartResume, which attempts to fetch the
pausedExecutionsrecord. - Failure: Since the record is not yet visible/committed, the fetch returns null, and the API throws
Paused execution not found. - Persistence End: persistPauseResult completes, but the valid resume request has already been rejected.
Affected Files
- apps/sim/background/workflow-execution.ts
- apps/sim/lib/workflows/executor/human-in-the-loop-manager.ts
- apps/sim/app/api/resume/[workflowId]/[executionId]/[contextId]/route.ts
Recommended Fix
Wrap the pause persistence and the subsequent queue processing in a database transaction. This ensures that the paused state is atomically visible to any concurrent resume requests, or properly locks the row so the resume request waits until the transaction completes.
Proposed Implementation:
Modify persistPauseResult to use a transaction that includes both the insertion of the paused execution and the checking of the resume queue.