Skip to content

fix: reduce metrics aggregator cardinality#4183

Open
MasterPtato wants to merge 1 commit intomainfrom
02-11-fix_reduce_metrics_aggregator_cardinality
Open

fix: reduce metrics aggregator cardinality#4183
MasterPtato wants to merge 1 commit intomainfrom
02-11-fix_reduce_metrics_aggregator_cardinality

Conversation

@MasterPtato
Copy link
Contributor

Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@railway-app
Copy link

railway-app bot commented Feb 12, 2026

🚅 Deployed to the rivet-pr-4183 environment in rivet-frontend

Service Status Web Updated (UTC)
ladle ❌ Build Failed (View Logs) Web Feb 18, 2026 at 2:29 am
website 😴 Sleeping (View Logs) Web Feb 17, 2026 at 6:36 am
frontend-cloud ❌ Build Failed (View Logs) Web Feb 12, 2026 at 1:18 am
frontend-inspector ❌ Build Failed (View Logs) Web Feb 12, 2026 at 1:18 am
mcp-hub ✅ Success (View Logs) Web Feb 12, 2026 at 1:17 am

Copy link
Contributor Author

MasterPtato commented Feb 12, 2026

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more


How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@claude
Copy link

claude bot commented Feb 12, 2026

PR Review: fix: reduce metrics aggregator cardinality

Summary

This PR addresses Prometheus cardinality issues in the metrics aggregator through two approaches: fixing Grafana dashboard variable queries and skipping zero-value metric registrations.


Grafana Dashboard Fixes (7 files)

Correct and necessary fix. The change from bare label_values(rivet_project) to label_values(up, rivet_project) properly adds a metric selector, which is required by Prometheus best practices and by some Prometheus/Mimir backends that reject or warn about selector-less label_values calls. The up metric is a safe, universally available choice.

This change is mechanically identical across all 7 dashboards and looks correct.


Rust: aggregate_serverless_desired_slots (active change)

Concern: the zero-value guard does not fully solve the cardinality problem.

The change skips calling with_label_values(...).add(desired_slots) when desired_slots == 0. This prevents new label combinations from being registered when their count happens to be zero on first observation. However:

  • If a namespace/runner previously had nonzero slots and then drops to zero, the label set has already been registered in the metric internal map. Skipping add(0) does not remove it. A reset() call at the top of the function (like ACTOR_PENDING_ALLOCATION gets) would be needed to actually remove stale label sets.
  • Since SERVERLESS_DESIRED_SLOTS uses .add() (accumulates), there is no reset() at the top of aggregate_serverless_desired_slots. Without a reset, each tick accumulates onto whatever was there before, meaning a namespace with 0 slots on a given tick silently retains its old value rather than being cleared to 0.

It is worth clarifying the intended semantics: is the metric meant to represent a snapshot (current desired slots) or a running total? If snapshot, a reset() at the start of the activity is needed (consistent with aggregate_pending_actors). The zero-guard is then a nice optimization on top, but the reset is the critical correctness fix.


Rust: Commented-out aggregate_active_actors (formatting only)

The changes to the commented-out block are purely mechanical formatting and an equivalent refactor of the metric call to extract active_actors before the if. No behavioral change since the entire block is still commented out. This is fine.


Tracing Instrumentation

Adding #[tracing::instrument(skip_all)] to:

  • ApiPublicService::handle_request
  • PegboardRunnerWsCustomServe::handle_request
  • PegboardRunnerWsCustomServe::handle_websocket

These are reasonable additions for observability on hot-path request handlers. skip_all is appropriate since the request types likely contain large or non-Debug bodies that should not be captured in spans.

Minor note: these changes are not mentioned in the PR description (which focuses on cardinality). They appear to be incidental improvements bundled into the fix.


Questions / Suggestions

  1. Missing reset() in aggregate_serverless_desired_slots: Should this function call metrics::SERVERLESS_DESIRED_SLOTS.reset() at the top, consistent with aggregate_pending_actors? Without it, the gauge retains stale values for namespaces that no longer have desired slots.

  2. ACTOR_PENDING_ALLOCATION cardinality: The aggregate_pending_actors function calls .inc() unconditionally for every entry, registering every (namespace_id, runner_name_selector) pair it finds. If the cardinality problem applies to this metric too, it may warrant the same treatment, though it already has a reset() so stale entries are cleared each tick.

  3. Scope of the fix: The PR title says "reduce metrics aggregator cardinality" but the tracing additions are unrelated. Consider splitting into separate PRs or updating the PR description to reflect all changes.


Overall

The Grafana and tracing changes are clean and correct. The zero-value guard in aggregate_serverless_desired_slots partially addresses cardinality (prevents registration of zero-value label sets on fresh runs) but may leave stale cardinality from previously active namespaces. The core concern is whether a reset() is needed at the start of that activity for correctness.

@pkg-pr-new
Copy link

pkg-pr-new bot commented Feb 18, 2026

More templates

@rivetkit/virtual-websocket

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/virtual-websocket@4183

@rivetkit/cloudflare-workers

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/cloudflare-workers@4183

@rivetkit/framework-base

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/framework-base@4183

@rivetkit/next-js

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/next-js@4183

@rivetkit/react

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/react@4183

rivetkit

pnpm add https://pkg.pr.new/rivet-dev/rivet/rivetkit@4183

@rivetkit/sql-loader

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/sql-loader@4183

@rivetkit/sqlite-vfs

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/sqlite-vfs@4183

@rivetkit/traces

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/traces@4183

@rivetkit/workflow-engine

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/workflow-engine@4183

@rivetkit/engine-runner

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner@4183

@rivetkit/engine-runner-protocol

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner-protocol@4183

commit: 81cc8b3

@graphite-app graphite-app bot changed the base branch from 02-11-fix_docker_cache_issues_dashboards to graphite-base/4183 February 18, 2026 02:27
@graphite-app graphite-app bot force-pushed the graphite-base/4183 branch from 69d4538 to 3d5b7c7 Compare February 18, 2026 02:27
@graphite-app graphite-app bot force-pushed the 02-11-fix_reduce_metrics_aggregator_cardinality branch from aa559af to 4bf0ef8 Compare February 18, 2026 02:27
@graphite-app graphite-app bot changed the base branch from graphite-base/4183 to main February 18, 2026 02:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments