Skip to content

fix(gas): gracefully handle corrupt wf#4220

Open
MasterPtato wants to merge 1 commit into02-13-chore_improve_debugging_for_runnersfrom
02-17-fix_gas_gracefully_handle_corrupt_wf
Open

fix(gas): gracefully handle corrupt wf#4220
MasterPtato wants to merge 1 commit into02-13-chore_improve_debugging_for_runnersfrom
02-17-fix_gas_gracefully_handle_corrupt_wf

Conversation

@MasterPtato
Copy link
Contributor

Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Copy link
Contributor Author

MasterPtato commented Feb 18, 2026

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more


How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@railway-app
Copy link

railway-app bot commented Feb 18, 2026

🚅 Deployed to the rivet-pr-4220 environment in rivet-frontend

Service Status Web Updated (UTC)
mcp-hub ✅ Success (View Logs) Web Feb 18, 2026 at 3:10 am
website 😴 Sleeping (View Logs) Web Feb 18, 2026 at 2:50 am
ladle ❌ Build Failed (View Logs) Web Feb 18, 2026 at 2:30 am
frontend-cloud ❌ Build Failed (View Logs) Web Feb 18, 2026 at 2:24 am
frontend-inspector ❌ Build Failed (View Logs) Web Feb 18, 2026 at 2:24 am

@claude
Copy link

claude bot commented Feb 18, 2026

PR Review: fix(gas): gracefully handle corrupt wf

This PR makes the workflow engine resilient to corrupt or undeserializable workflow history data by logging errors and skipping affected workflows (returning Ok(None)) instead of propagating errors.

Summary of Changes

  1. Corrupt event handling (two locations): Replaces ? propagation on Event::try_from() with a match block that logs via tracing::error! and returns Ok(None) to skip the workflow.
  2. Missing key handling: Replaces context("key should exist")? on create_ts_entry/ray_id_entry with a let ... else guard that logs and returns Ok(None).
  3. active_worker_count fix: Removes the .max(1) guard, allowing the value to be 0 when there are no active workers.

Positives

  • The core strategy is sound. Propagating deserialization errors causes callers to treat the entire pull operation as failed, which can stall the engine. Returning Ok(None) to skip individual bad workflows is the right approach.
  • The loc variable capture before moving previous_event/current_event is a clean fix for the borrow issue that would arise from the refactored match pattern.
  • Logging with structured fields (workflow_id, location, err) follows the project's tracing conventions.
  • Removing FailedBuildingWorkflowHistory from the error enum is correct since it is no longer reachable.

Issues and Concerns

1. Division by zero risk from removing .max(1) (line 1177)

active_worker_count is used as a modulus divisor at lines 1254 and 1259:

let wf_worker_idx = wf_hash % active_worker_count;
let next_worker_idx = (current_worker_idx + 1) % active_worker_count;

Removing .max(1) means if active_worker_ids is empty, active_worker_count becomes 0, causing a panic (integer overflow/division by zero) in both modulo operations.

Before this change the .max(1) was there precisely to guard against this. Is there an upstream guarantee that active_worker_ids can never be empty at this point in the code? If so, that invariant should be documented with a comment. If not, this removal is a regression and the guard should be kept (or an early return added when the list is empty). This is the most critical issue in the PR.

2. Silent data loss for corrupt workflows

Returning Ok(None) causes the corrupt workflow to be silently dropped from this pull. The workflow will remain in the wake queue and will be attempted again on the next pull cycle, producing repeated error logs. However:

  • There is no mechanism to mark the workflow as permanently corrupt, so it will log this error on every pull indefinitely.
  • Consider adding a metric increment (e.g., a counter for deserialization failures) so these can be monitored in aggregate rather than only via log scraping.

3. No test coverage added

The PR checklist item for tests is unchecked, and no tests are included. Given that this fixes a specific failure mode (corrupt history entries), a unit or integration test covering the Err path of Event::try_from() would make this substantially safer to merge.


Minor Notes

  • The log message "failed building workflow history" is lowercase, consistent with project conventions. Good.
  • The two changed event-building sites are structurally identical; the duplication is inherent to the two separate loops (previous vs. current events) and is not an issue.
  • Unused import of Location is correctly removed from error.rs.

@pkg-pr-new
Copy link

pkg-pr-new bot commented Feb 18, 2026

More templates

@rivetkit/cloudflare-workers

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/cloudflare-workers@4220

@rivetkit/framework-base

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/framework-base@4220

@rivetkit/next-js

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/next-js@4220

@rivetkit/react

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/react@4220

rivetkit

pnpm add https://pkg.pr.new/rivet-dev/rivet/rivetkit@4220

@rivetkit/sql-loader

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/sql-loader@4220

@rivetkit/sqlite-vfs

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/sqlite-vfs@4220

@rivetkit/traces

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/traces@4220

@rivetkit/workflow-engine

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/workflow-engine@4220

@rivetkit/virtual-websocket

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/virtual-websocket@4220

@rivetkit/engine-runner

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner@4220

@rivetkit/engine-runner-protocol

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner-protocol@4220

commit: ee3daf8

@graphite-app graphite-app bot force-pushed the 02-13-chore_improve_debugging_for_runners branch from ba43d9d to 5a68f06 Compare February 18, 2026 02:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments