Skip to content

fix: docker cache issues, dashboards#4181

Closed
MasterPtato wants to merge 1 commit intomainfrom
02-11-fix_docker_cache_issues_dashboards
Closed

fix: docker cache issues, dashboards#4181
MasterPtato wants to merge 1 commit intomainfrom
02-11-fix_docker_cache_issues_dashboards

Conversation

@MasterPtato
Copy link
Contributor

Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@railway-app railway-app bot temporarily deployed to rivet-frontend / rivet-pr-4181 February 11, 2026 21:19 Destroyed
@railway-app
Copy link

railway-app bot commented Feb 11, 2026

🚅 Deployed to the rivet-pr-4181 environment in rivet-frontend

Service Status Web Updated (UTC)
ladle ❌ Build Failed (View Logs) Web Feb 18, 2026 at 2:14 am
website 😴 Sleeping (View Logs) Web Feb 13, 2026 at 7:57 pm
frontend-inspector ❌ Build Failed (View Logs) Web Feb 11, 2026 at 9:22 pm
frontend-cloud ❌ Build Failed (View Logs) Web Feb 11, 2026 at 9:21 pm
mcp-hub ✅ Success (View Logs) Web Feb 11, 2026 at 9:20 pm

Copy link
Contributor Author

MasterPtato commented Feb 11, 2026


How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@claude
Copy link

claude bot commented Feb 11, 2026

PR Review: fix: docker cache issues, dashboards

Summary

This PR fixes two bugs (a potential divide-by-zero in the workflow engine and a missing cache-invalidation trigger in the build system), plus dashboard improvements for Grafana. The changes are small and well-scoped.


engine/packages/gasoline/src/db/kv/mod.rs — Bug Fix

The core fix (line 1177):

// Before
let active_worker_count = active_worker_ids.len() as u64;

// After
let active_worker_count = active_worker_ids.len().max(1) as u64;

This prevents a divide-by-zero / modulo-by-zero panic when active_worker_ids is empty. The value is used in two places:

  • wf_hash % active_worker_count (modulo)
  • (current_worker_idx + 1) % active_worker_count (modulo)

Both would panic with a zero denominator in Rust's debug builds (or return 0 via wrapping in release with %). The fix is correct.

One concern worth noting: when active_worker_ids is empty, the current_worker_idx defaults to 0 (via the error branch). With active_worker_count = 1, wf_worker_idx == 0 will be true for all workflows whose hash is 0 mod 1 (i.e., all of them), so every workflow will be assigned to this "ghost" worker. This may be the intended fallback behavior (all workflows get picked up locally when no peers are visible), but it's worth a comment explaining the intent. The existing error log on the fallback branch is good.

Suggestion: Consider adding a brief comment before the .max(1) guard explaining why, e.g.:

// Prevent division by zero; if no active workers are found, treat count as 1
// so all workflows are assigned to the current worker.
let active_worker_count = active_worker_ids.len().max(1) as u64;

engine/packages/util/build.rs — Cache Fix

println\!("cargo:rerun-if-env-changed=OVERRIDE_GIT_SHA");

Correct fix. Without this instruction, Cargo would not re-run the build script when the OVERRIDE_GIT_SHA environment variable changes, so Docker layer caching or incremental builds could embed a stale SHA. This is the standard pattern for env-var-driven build metadata.


engine/docker/universal/Dockerfile — Debug Echo

echo "Copying binary" && \
cp target/$CARGO_BUILD_MODE/rivet-engine /app/dist/

This is a minor debugging aid added during the cache-investigation work. It's harmless but adds noise to CI/build logs permanently. Consider removing it once the cache issue is confirmed resolved, or keeping it only if it provides useful build-time progress visibility.


Grafana Dashboards

gasoline.json:

  • Moves the CPU Core Usage and Load Shedding Ratio heatmaps earlier in the layout (y: 51 → y: 51/51) and shifts later panels down accordingly. Layout reordering looks intentional and correct.
  • Replaces "Last Pull Workflows Duration" and "Last Pull Workflows History Duration" panels (per-worker-id duration gauges) with "Workflows Dispatched/s" and "Workflows Dispatched/s (Not From Workflow)" time-series panels. This is a meaningful observability improvement, moving from latency-per-worker to dispatch rate.
  • The legendFormat field on the "Workflows Dispatched/s" panel uses {{signal_name}} but the query groups by sub_workflow_name. This is a mismatch — the legend will show blank labels. It should be {{sub_workflow_name}}.
  • "format": "heatmap" is set on both new timeseries panels but the panel type is "timeseries", not "heatmap". Grafana may silently ignore this, but it's inconsistent.
  • Default time range extended from 5m → 30m. Good for initial dashboard usefulness.
  • Template variable current.text format inconsistency fixed (array vs string) — good correctness fix.

pegboard.json:

  • Queries for pending allocation and serverless slots now filter to only the most-recently-publishing gasoline worker per datacenter using a max ... on(k8s_pod_name) join. This prevents stale metrics from decommissioned pods from inflating the displayed values. This is a well-constructed PromQL fix.
  • legendFormat corrected from {{workflow_name}} to {{rivet_datacenter}} across all panels — the old format was always wrong since these queries group by rivet_datacenter.
  • Default template variable changed from prodstaging. This is a personal preference setting — consider keeping it as All or prod for production dashboards if this is committed into the main template.

Summary

Area Verdict
Divide-by-zero fix ✅ Correct fix, minor comment suggestion
Build script cache fix ✅ Correct
Dockerfile echo ⚠️ Consider removing once confirmed
Grafana gasoline dashboard ⚠️ legendFormat: {{signal_name}} should be {{sub_workflow_name}}
Grafana pegboard dashboard ✅ Good PromQL and legend fixes
Pegboard default env set to staging ⚠️ Verify this is intentional for the committed template

The Rust fixes are solid. The main actionable issue is the mismatched legendFormat in the gasoline dashboard.

"legendFormat": "{{worker_id}}",
"expr": "sum by (sub_workflow_name) (rate(rivet_gasoline_workflow_dispatched_total{rivet_project=~\"$project\",rivet_datacenter=~\"$datacenter\",workflow_name=~\"$workflow_name\"} [$__rate_interval]))",
"format": "heatmap",
"legendFormat": "{{signal_name}}",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Legend format references {{signal_name}} but the query groups by sub_workflow_name. This mismatch will result in empty or incorrect legend labels.

Fix:

"legendFormat": "{{sub_workflow_name}}",
Suggested change
"legendFormat": "{{signal_name}}",
"legendFormat": "{{sub_workflow_name}}",

Spotted by Graphite Agent

Fix in Graphite


Is this helpful? React 👍 or 👎 to let us know.

"editorMode": "code",
"expr": "sum(increase(rivet_gasoline_load_shedding_ratio_bucket{rivet_project=~\"$project\",rivet_datacenter=~\"$datacenter\"} [$__rate_interval])) by (le)",
"expr": "sum by (sub_workflow_name) (rate(rivet_gasoline_workflow_dispatched_total{rivet_project=~\"$project\",rivet_datacenter=~\"$datacenter\",workflow_name=\"\"} [$__rate_interval]))",
"format": "heatmap",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Format is set to "heatmap" but the panel type is "timeseries" (line 1724). This format mismatch will cause visualization errors.

Fix:

"format": "time_series",

Or remove the format field entirely to use the default for timeseries panels.

Suggested change
"format": "heatmap",
"format": "time_series",

Spotted by Graphite Agent

Fix in Graphite


Is this helpful? React 👍 or 👎 to let us know.

@MasterPtato MasterPtato force-pushed the 02-11-fix_docker_cache_issues_dashboards branch from 3dbb6ec to 69d4538 Compare February 18, 2026 02:13
@railway-app railway-app bot temporarily deployed to rivet-frontend / rivet-pr-4181 February 18, 2026 02:13 Destroyed
@pkg-pr-new
Copy link

pkg-pr-new bot commented Feb 18, 2026

More templates

@rivetkit/virtual-websocket

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/virtual-websocket@4181

@rivetkit/cloudflare-workers

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/cloudflare-workers@4181

@rivetkit/framework-base

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/framework-base@4181

@rivetkit/next-js

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/next-js@4181

@rivetkit/react

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/react@4181

rivetkit

pnpm add https://pkg.pr.new/rivet-dev/rivet/rivetkit@4181

@rivetkit/sql-loader

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/sql-loader@4181

@rivetkit/sqlite-vfs

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/sqlite-vfs@4181

@rivetkit/traces

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/traces@4181

@rivetkit/workflow-engine

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/workflow-engine@4181

@rivetkit/engine-runner

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner@4181

@rivetkit/engine-runner-protocol

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner-protocol@4181

commit: 69d4538

@graphite-app
Copy link
Contributor

graphite-app bot commented Feb 18, 2026

Merge activity

  • Feb 18, 2:25 AM UTC: MasterPtato added this pull request to the Graphite merge queue.
  • Feb 18, 2:26 AM UTC: CI is running for this pull request on a draft pull request (#4221) due to your merge queue CI optimization settings.
  • Feb 18, 2:27 AM UTC: Merged by the Graphite merge queue via draft PR: #4221.

graphite-app bot pushed a commit that referenced this pull request Feb 18, 2026
# Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context.

## Type of change

- [ ] Bug fix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
- [ ] This change requires a documentation update

## How Has This Been Tested?

Please describe the tests that you ran to verify your changes.

## Checklist:

- [ ] My code follows the style guidelines of this project
- [ ] I have performed a self-review of my code
- [ ] I have commented my code, particularly in hard-to-understand areas
- [ ] I have made corresponding changes to the documentation
- [ ] My changes generate no new warnings
- [ ] I have added tests that prove my fix is effective or that my feature works
- [ ] New and existing unit tests pass locally with my changes
@graphite-app graphite-app bot closed this Feb 18, 2026
@graphite-app graphite-app bot deleted the 02-11-fix_docker_cache_issues_dashboards branch February 18, 2026 02:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments