controlplane: improve worker crash diagnostics and health check resilience by fuziontech · Pull Request #199 · PostHog/duckgres

fuziontech · 2026-02-14T01:27:35Z

Summary

Log exit errors for unexpectedly dead workers. When RetireWorker (triggered by client disconnect) finds the process already exited, it now logs w.exitErr at WARN level. Previously, a race between RetireWorker and the health check loop's crash detection caused the exit error to be silently discarded — making it impossible to diagnose recurring worker deaths (e.g. DuckDB C++ fatal errors calling _exit()).
Propagate gRPC errors in doHealthCheck. The underlying transport error (e.g. connection reset by peer, context deadline exceeded) is now wrapped and visible in logs, replacing the generic "worker not healthy" message.
Retire workers after consecutive health check failures. After 3 consecutive failures (~6s with typical 2s interval), the worker is force-killed, removed from the pool, and crash handlers notify affected sessions. Previously, failing health checks were logged but never acted on, leaving unresponsive workers running until clients timed out.

Context

Investigating a recurring issue where worker processes die silently (no OOM, no segfault, no coredump, no signal in dmesg). The exit error captured by cmd.Wait() would tell us the exit code/signal, but it was being discarded in both the RetireWorker path and the crash detection race. 7 occurrences observed in 24h.

Test plan

go build ./controlplane/... passes
go test ./controlplane/ passes
Deploy and verify that next worker death produces diagnostic logs with exit code

🤖 Generated with Claude Code

…ience When a worker process dies (e.g. DuckDB C++ fatal error calling _exit()), the exit error was silently discarded due to a race between RetireWorker (triggered by client disconnect) and the health check loop's crash detection. This made it impossible to diagnose recurring worker deaths. Three fixes: 1. Log exit errors for unexpectedly dead workers in retireWorkerProcess. When RetireWorker finds the process already exited, it now logs the exit error (exit code/signal) at WARN level instead of silently cleaning up. 2. Propagate gRPC errors in doHealthCheck instead of returning the generic "worker not healthy" message. The underlying transport error (e.g. connection reset, context deadline exceeded) is now wrapped and visible in health check failure logs. 3. Add consecutive health check failure tracking. After 3 consecutive failures (~6s with typical 2s interval), the worker is force-killed, removed from the pool, and crash handlers notify affected sessions. Previously, failing health checks were logged but never acted on, leaving unresponsive workers running until clients timed out. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Log exit_code from ProcessState in retireWorkerProcess when a worker is found already dead (distinguishes exit(0) from exit(1) even when exitErr is nil) - Add comment documenting single-message assumption in doHealthCheck - Add comment on force-kill goroutine explaining why it skips SIGINT - Add worker_mgr_test.go with tests for: - Health check failure counting (reset on success, trigger at threshold, no trigger below threshold, cleanup on worker exit) - retireWorkerProcess already-dead path (exit code 0 and non-zero) - retireWorkerProcess graceful shutdown path - HealthCheckLoop crash detection and notification Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fuziontech and others added 2 commits February 14, 2026 01:27

fuziontech enabled auto-merge (squash) February 14, 2026 01:34

fuziontech merged commit 091b974 into main Feb 14, 2026
11 checks passed

fuziontech deleted the improve-worker-crash-diagnostics branch February 14, 2026 01:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

controlplane: improve worker crash diagnostics and health check resilience#199

controlplane: improve worker crash diagnostics and health check resilience#199
fuziontech merged 2 commits intomainfrom
improve-worker-crash-diagnostics

fuziontech commented Feb 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

fuziontech commented Feb 14, 2026

Summary

Context

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant