Skip to content

fix(controlplane): eliminate systemd tracking warning and metrics port conflict#201

Merged
fuziontech merged 2 commits intomainfrom
fix/handover-drain-race
Feb 14, 2026
Merged

fix(controlplane): eliminate systemd tracking warning and metrics port conflict#201
fuziontech merged 2 commits intomainfrom
fix/handover-drain-race

Conversation

@fuziontech
Copy link
Member

Summary

  • Under systemd, selfExec now double-forks via setsid --fork so the new CP is reparented to PID 1 — eliminates the Supervising process X which is not our child warning and ensures Restart=always works if the new CP crashes
  • initMetrics retries binding :9090 until available instead of dying on first failure — during handover the old CP holds the port until drain completes
  • Clears reloading flag after successful handover to prevent the 30s timeout recovery from interfering with long drains

Test plan

  • All 9 controlplane tests pass (detached path only activates under systemd via NOTIFY_SOCKET — tests use direct spawn)
  • Deploy to canary and verify no Supervising process X which is not our child warning
  • Verify metrics server recovers after handover (curl :9090/metrics)

🤖 Generated with Claude Code

fuziontech and others added 2 commits February 14, 2026 01:46
… handover drain

When the PG listener is closed during handover, acceptLoop's `return` caused
RunControlPlane() → main() to exit, killing all in-flight connection goroutines
before the drain logic in handleHandoverRequest could complete. Replace `return`
with `select {}` so the main goroutine blocks until the handover handler (or
shutdown handler) calls os.Exit(0) after properly draining connections.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ics port conflict

Two fixes for zero-downtime handover:

1. selfExec now double-forks via `setsid --fork` when running under systemd
   (NOTIFY_SOCKET set). The new CP is reparented to PID 1, allowing systemd
   to properly track it via waitpid() for Restart=always. Outside systemd
   (tests), the direct spawn path is preserved for fast crash recovery via
   cmd.Wait(). A 30s timeout provides crash recovery for the detached path.

2. initMetrics retries binding :9090 until available. During handover the old
   CP still holds the metrics port until it drains and exits; the new CP's
   metrics goroutine now retries instead of dying on first failure.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@fuziontech fuziontech merged commit 243973c into main Feb 14, 2026
11 checks passed
@fuziontech fuziontech deleted the fix/handover-drain-race branch February 14, 2026 02:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant