Skip to content

controlplane: improve worker crash diagnostics and health check resilience#199

Merged
fuziontech merged 2 commits intomainfrom
improve-worker-crash-diagnostics
Feb 14, 2026
Merged

controlplane: improve worker crash diagnostics and health check resilience#199
fuziontech merged 2 commits intomainfrom
improve-worker-crash-diagnostics

Conversation

@fuziontech
Copy link
Member

Summary

  • Log exit errors for unexpectedly dead workers. When RetireWorker (triggered by client disconnect) finds the process already exited, it now logs w.exitErr at WARN level. Previously, a race between RetireWorker and the health check loop's crash detection caused the exit error to be silently discarded — making it impossible to diagnose recurring worker deaths (e.g. DuckDB C++ fatal errors calling _exit()).

  • Propagate gRPC errors in doHealthCheck. The underlying transport error (e.g. connection reset by peer, context deadline exceeded) is now wrapped and visible in logs, replacing the generic "worker not healthy" message.

  • Retire workers after consecutive health check failures. After 3 consecutive failures (~6s with typical 2s interval), the worker is force-killed, removed from the pool, and crash handlers notify affected sessions. Previously, failing health checks were logged but never acted on, leaving unresponsive workers running until clients timed out.

Context

Investigating a recurring issue where worker processes die silently (no OOM, no segfault, no coredump, no signal in dmesg). The exit error captured by cmd.Wait() would tell us the exit code/signal, but it was being discarded in both the RetireWorker path and the crash detection race. 7 occurrences observed in 24h.

Test plan

  • go build ./controlplane/... passes
  • go test ./controlplane/ passes
  • Deploy and verify that next worker death produces diagnostic logs with exit code

🤖 Generated with Claude Code

fuziontech and others added 2 commits February 14, 2026 01:27
…ience

When a worker process dies (e.g. DuckDB C++ fatal error calling _exit()),
the exit error was silently discarded due to a race between RetireWorker
(triggered by client disconnect) and the health check loop's crash
detection. This made it impossible to diagnose recurring worker deaths.

Three fixes:

1. Log exit errors for unexpectedly dead workers in retireWorkerProcess.
   When RetireWorker finds the process already exited, it now logs the
   exit error (exit code/signal) at WARN level instead of silently
   cleaning up.

2. Propagate gRPC errors in doHealthCheck instead of returning the
   generic "worker not healthy" message. The underlying transport error
   (e.g. connection reset, context deadline exceeded) is now wrapped
   and visible in health check failure logs.

3. Add consecutive health check failure tracking. After 3 consecutive
   failures (~6s with typical 2s interval), the worker is force-killed,
   removed from the pool, and crash handlers notify affected sessions.
   Previously, failing health checks were logged but never acted on,
   leaving unresponsive workers running until clients timed out.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Log exit_code from ProcessState in retireWorkerProcess when a worker
  is found already dead (distinguishes exit(0) from exit(1) even when
  exitErr is nil)
- Add comment documenting single-message assumption in doHealthCheck
- Add comment on force-kill goroutine explaining why it skips SIGINT
- Add worker_mgr_test.go with tests for:
  - Health check failure counting (reset on success, trigger at threshold,
    no trigger below threshold, cleanup on worker exit)
  - retireWorkerProcess already-dead path (exit code 0 and non-zero)
  - retireWorkerProcess graceful shutdown path
  - HealthCheckLoop crash detection and notification

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@fuziontech fuziontech enabled auto-merge (squash) February 14, 2026 01:34
@fuziontech fuziontech merged commit 091b974 into main Feb 14, 2026
11 checks passed
@fuziontech fuziontech deleted the improve-worker-crash-diagnostics branch February 14, 2026 01:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant