controlplane: improve worker crash diagnostics and health check resilience#199
Merged
fuziontech merged 2 commits intomainfrom Feb 14, 2026
Merged
controlplane: improve worker crash diagnostics and health check resilience#199fuziontech merged 2 commits intomainfrom
fuziontech merged 2 commits intomainfrom
Conversation
…ience When a worker process dies (e.g. DuckDB C++ fatal error calling _exit()), the exit error was silently discarded due to a race between RetireWorker (triggered by client disconnect) and the health check loop's crash detection. This made it impossible to diagnose recurring worker deaths. Three fixes: 1. Log exit errors for unexpectedly dead workers in retireWorkerProcess. When RetireWorker finds the process already exited, it now logs the exit error (exit code/signal) at WARN level instead of silently cleaning up. 2. Propagate gRPC errors in doHealthCheck instead of returning the generic "worker not healthy" message. The underlying transport error (e.g. connection reset, context deadline exceeded) is now wrapped and visible in health check failure logs. 3. Add consecutive health check failure tracking. After 3 consecutive failures (~6s with typical 2s interval), the worker is force-killed, removed from the pool, and crash handlers notify affected sessions. Previously, failing health checks were logged but never acted on, leaving unresponsive workers running until clients timed out. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Log exit_code from ProcessState in retireWorkerProcess when a worker
is found already dead (distinguishes exit(0) from exit(1) even when
exitErr is nil)
- Add comment documenting single-message assumption in doHealthCheck
- Add comment on force-kill goroutine explaining why it skips SIGINT
- Add worker_mgr_test.go with tests for:
- Health check failure counting (reset on success, trigger at threshold,
no trigger below threshold, cleanup on worker exit)
- retireWorkerProcess already-dead path (exit code 0 and non-zero)
- retireWorkerProcess graceful shutdown path
- HealthCheckLoop crash detection and notification
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Log exit errors for unexpectedly dead workers. When
RetireWorker(triggered by client disconnect) finds the process already exited, it now logsw.exitErrat WARN level. Previously, a race betweenRetireWorkerand the health check loop's crash detection caused the exit error to be silently discarded — making it impossible to diagnose recurring worker deaths (e.g. DuckDB C++ fatal errors calling_exit()).Propagate gRPC errors in
doHealthCheck. The underlying transport error (e.g.connection reset by peer,context deadline exceeded) is now wrapped and visible in logs, replacing the generic "worker not healthy" message.Retire workers after consecutive health check failures. After 3 consecutive failures (~6s with typical 2s interval), the worker is force-killed, removed from the pool, and crash handlers notify affected sessions. Previously, failing health checks were logged but never acted on, leaving unresponsive workers running until clients timed out.
Context
Investigating a recurring issue where worker processes die silently (no OOM, no segfault, no coredump, no signal in dmesg). The exit error captured by
cmd.Wait()would tell us the exit code/signal, but it was being discarded in both theRetireWorkerpath and the crash detection race. 7 occurrences observed in 24h.Test plan
go build ./controlplane/...passesgo test ./controlplane/passes🤖 Generated with Claude Code