Initial implementation of supported services. #266

rshoemaker · 2026-02-09T20:06:39Z

feat: add MCP service instance provisioning

Summary

Adds end-to-end support for provisioning, monitoring, and managing MCP service instances alongside database instances. Services receive dedicated read-only database credentials, run as Docker Swarm services on the database
overlay network, and are monitored for health with automatic state transitions.

NOTE: these changes focus on the provisioning and lifecycle of supported services, not (very much) on the specifics of managing the MCP service. Follow-on work is needed to flesh out exactly what control-plane needs to surface for MCP management.

Changes

API types & validation (PLAT-409): Add ServiceSpec to database spec, ServiceInstance/ServiceInstanceStatus response types, Goa design + generated code, validation for service specs and MCP config, conversion between API and
domain types
Service version-to-image mapping (PLAT-407): ServiceVersions registry resolves service type + version to Docker image references via configurable repository host
Provisioning workflow (PLAT-410): ProvisionServices sub-workflow creates database credentials (deterministic username, random password, read-only role), generates Docker Swarm service specs, deploys all instances in
parallel via resource lifecycle, enriches status post-deploy
Service instance resources: ServiceInstanceResource (Docker Swarm service CRUD), ServiceInstanceSpecResource (virtual spec generation), ServiceUserRole (Postgres role lifecycle with proper identifier quoting)
Monitoring (PLAT-411): ServiceInstanceMonitor polls container status every 10s, transitions creating→running on healthy container, marks failed after 5min creation timeout (durable across restarts via etcd timestamp), 30s
grace period for recently-started instances to avoid false failures during Docker label propagation
Race condition fix: State set to "running" deterministically in ServiceInstanceResource.Create() after WaitForService succeeds, monitor grace period prevents premature failure marking, DatabaseID passthrough eliminates full
etcd scan in state updates
Docker integration: TaskList() method, enhanced WaitForService() with task state breakdown logging
E2E tests (PLAT-413): Single-host, multi-host, add service, remove service, unsupported version graceful degradation
Documentation (PLAT-414): Service credential lifecycle docs
Makefile: dev-teardown cleans up service containers alongside postgres

Testing

go build ./...
make test
make update-lima-fixture && make test-e2e E2E_FIXTURE=lima E2E_RUN=TestProvisionMCPService

Checklist

Tests added or updated (unit and/or e2e, as needed)
Documentation updated - will be in separate PR @see below
Issue is linked (branch name or URL in PR description)
https://github.com/pgEdge/control-plane/blob/main/docs/development/development.md#pull-requests entry added for user-facing behavior changes
Breaking changes (if any) are clearly called out in the PR description - none

Notes for Reviewers

PLAT-414 will add user-facing documentation - there is none in this PR
PLAT-412 (reconciliation): this is only partially implemented here - still need to work on service deletion
PLAT-413 partially complete here - need more testing on service health monitoring, db connection, etc

PLAT-407 PLAT-409 PLAT-410 PLAT-411 PLAT-412 PLAT-413

coderabbitai · 2026-02-09T20:06:52Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds service provisioning end-to-end: API types/validation, database domain models and etcd stores for service instances/status, Swarm orchestrator resources/specs/images, monitors and monitor stores, workflows/activities for provisioning and credential creation, tests, docs, DI/logging and tooling tweaks.

Changes

Cohort / File(s)	Summary
API design & validation `api/apiv1/design/database.go`, `api/apiv1/design/instance.go`, `server/internal/api/apiv1/convert.go`, `server/internal/api/apiv1/validate.go`, `server/internal/api/apiv1/validate_test.go`	Add ServiceSpec, ServiceInstance and related DSL types, service/version patterns; include services/service_instances in Database/Spec; add conversion helpers (filter sensitive config) and validation rules (semver/latest, MCP provider requirements) with tests.
Database domain & stores `server/internal/database/spec.go`, `server/internal/database/service_instance.go`, `server/internal/database/service.go`, `server/internal/database/database.go`, `server/internal/database/instance.go`, `server/internal/database/service_instance_store.go`, `server/internal/database/service_instance_status_store.go`, `server/internal/database/store.go`, `server/internal/database/service_instance_test.go`	Introduce ServiceSpec and ServiceInstance models, ID/username/name generators, ServiceInstance state/status types, CRUD and state APIs, etcd-backed stores for instances/statuses, include ServiceInstances on Database, and add unit tests.
Orchestrator & Swarm resources `server/internal/orchestrator/swarm/service_images.go`, `.../service_images_test.go`, `.../service_spec.go`, `.../service_spec_test.go`, `.../service_instance_spec.go`, `.../service_instance.go`, `.../service_user_role.go`, `.../utils.go`, `server/internal/orchestrator/swarm/resources.go`, `server/internal/orchestrator/swarm/service_instance.go`, `server/internal/orchestrator/swarm/service_instance_spec.go`, `server/internal/orchestrator/swarm/service_instance.go`	Add image/version registry, ServiceContainerSpec builder, service-instance spec/resource and service-user-role resources, helper to find service containers, orchestrator methods to generate resources and fetch instance status, plus tests.
Monitor & monitor stores `server/internal/monitor/service_instance_monitor.go`, `server/internal/monitor/service_instance_monitor_resource.go`, `server/internal/monitor/service_instance_monitor_store.go`, `server/internal/monitor/service.go`, `server/internal/monitor/store.go`, `server/internal/monitor/resources.go`, `server/internal/monitor/host_monitor.go`, `server/internal/monitor/instance_monitor.go`	Add ServiceInstanceMonitor with health/state transition logic, persistent monitor store and monitor resource, integrate monitors into Service lifecycle, and fix instance-monitor constant typos.
Workflows & activities `server/internal/workflows/provision_services.go`, `server/internal/workflows/update_database.go`, `server/internal/workflows/workflows.go`, `server/internal/workflows/activities/*.go`, `server/internal/workflows/activities/create_service_user.go`, `.../generate_service_instance_resources.go`, `.../get_service_instance_status.go`, `.../store_service_instance.go`, `.../update_service_instance_state.go`, `.../provide.go`, `.../activities.go`	Add ProvisionServices workflow (multi-phase provisioning), activities for creating DB service users, generating resources, fetching statuses, storing/updating instances; wire DatabaseService and Orchestrator into workflows and register activities.
Docker client & logging `server/internal/docker/docker.go`, `server/internal/docker/provide.go`	Add logger to Docker client (`NewDocker(logger)`), `TaskList` method, richer WaitForService logging and failure aggregation; update DI to pass logger.
Monitor/orchestrator tests & docs `e2e/service_provisioning_test.go`, `server/internal/orchestrator/swarm/service_spec_test.go`, `server/internal/orchestrator/swarm/service_images_test.go`, `docs/development/service-credentials.md`	Add extensive end-to-end provisioning tests, unit tests for swarm spec/image logic, and a service-credentials design doc describing credential generation, storage, and lifecycle.
Store/DI wiring & build tooling `server/internal/database/store.go`, `server/internal/monitor/store.go`, `server/internal/workflows/activities/provide.go`, `.gitignore`, `Makefile`	Wire new stores into Store and monitor stores, expose DatabaseService in activities provider, add `.idea` to `.gitignore`, and adjust Makefile dev-teardown to inspect Docker service labels when removing services.

Poem

🐇 I hopped through specs and spun a thread,
Usernames stitched, and secret seeds well fed,
Containers hum while monitors keep pace,
Services wake in their allotted place,
The rabbit bows — provisioned, snug, and led.

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 21.95% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'Initial implementation of supported services' is a clear, concise summary of the main change—adding MCP service instance provisioning. It accurately reflects the primary objective of this large feature addition.
Description check	✅ Passed	The PR description is comprehensive and well-structured. It includes a Summary section, a detailed Changes section covering multiple aspects (API, versioning, workflows, monitoring, testing, and docs), a Testing section with commands, and a Checklist marking relevant items. The description clearly explains the intent and scope of the changes.
Merge Conflict Detection	✅ Passed	✅ No merge conflicts detected when merging into `main`

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feat/PLAT-408/supported_services

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 16

🤖 Fix all issues with AI agents

In `@docs/development/service-credentials.md`:
- Around line 200-204: Replace the repeated "Confirm" phrasing with varied
imperatives and fix the ambiguous SQL quoting: change the SQL example token to a
fenced code span or use double quotes around the Markdown inline code so the
inner single quotes remain intact (the SQL to keep is SELECT * FROM pg_user
WHERE usename LIKE 'svc_%';). Update bullets 1–5 (and likewise 226–228) to use
distinct verbs like "Verify", "Ensure", "Check", "Test" and render the SQL as
clear inline code or a code block to avoid nested single-quote ambiguity.

In `@e2e/service_provisioning_test.go`:
- Around line 83-105: Test wait window is shorter than the monitor timeout
causing flakes; increase the maxWait used in the polling loop (the maxWait
variable in the test that currently is 2 * time.Minute) to match the monitor's
creation timeout (e.g., 5 * time.Minute), keep pollInterval unchanged, and apply
the same change to the sibling recovery test's polling loop so both use the
longer deadline before asserting db.ServiceInstances[0].State == "running"
(references: serviceInstance, db.Refresh, maxWait, pollInterval, deadline).

In `@Makefile`:
- Around line 369-376: The dev-teardown target uses portable-unsupported xargs
-r when piping docker service ls -q into docker service inspect and docker
service rm; replace that pipeline with a guard that first captures or tests the
output of docker service ls -q (or tests that the variable is non-empty) before
invoking docker service inspect or docker service rm so the commands are only
run on non-empty input on both Linux and macOS/BSD; update the dev-teardown
recipe (around the docker service ls -q | xargs -r docker service inspect ...
and the final | xargs -r docker service rm) to check for non-empty output (e.g.,
assign to a var or use test -n) and then call docker service inspect and docker
service rm, preserving the same label-filtering awk logic.

In `@server/internal/api/apiv1/convert.go`:
- Around line 184-215: serviceSpecToAPI's ad-hoc key checks on svc.Config can
miss future secret names; add a centralized helper (e.g., isSensitiveConfigKey)
and use it when building filteredConfig to robustly detect secret keys.
Implement isSensitiveConfigKey to either (A) use a conservative allowlist of
safe config keys and treat everything else as sensitive, or (B) use a compiled
case-insensitive regex that matches common secret patterns (e.g., token, key,
apikey, api_key, secret, password, credential, access[_-]?token) with
word-boundary or delimiter handling; then replace the inline strings.Contains
checks in serviceSpecToAPI to call this helper when deciding to skip a config
entry. Ensure the helper is reused anywhere config redaction is needed so secret
detection is centralized.

In `@server/internal/api/apiv1/validate.go`:
- Around line 296-321: The current type assertion for config["llm_provider"] can
silently skip validation if the value exists but isn't a string; change the
logic so you first check for existence (val, exists := config["llm_provider"])
and if exists but not a string append a newValidationError to errs (use
appendPath(path, mapKeyPath("llm_provider")) and a clear error message about
invalid type), otherwise if it is a string continue the existing provider
validation (validProviders check and the switch for
"anthropic"/"openai"/"ollama"); keep using newValidationError, appendPath,
mapKeyPath and errs for all error cases.

In `@server/internal/database/service.go`:
- Around line 596-611: The SetServiceInstanceState method currently returns a
wrapped generic error when the store GetByKey returns storage.ErrNotFound;
change the error handling in Service.SetServiceInstanceState to detect the
not-found case (using errors.Is(err, storage.ErrNotFound) or equivalent) and
return the package-level ErrServiceInstanceNotFound instead of a generic
fmt.Errorf, then proceed as before for other errors; apply the same pattern to
the other methods in the file that call s.store.ServiceInstance.GetByKey/Exec
(the function around lines 709-736) so they also translate storage.ErrNotFound
into ErrServiceInstanceNotFound.

In `@server/internal/monitor/service.go`:
- Around line 170-189: CreateServiceInstanceMonitor currently persists with the
hostID parameter which can diverge from s.cfg.HostID used by Start and
DeleteServiceInstanceMonitor; update the method to enforce consistency by either
validating hostID == s.cfg.HostID at the top of CreateServiceInstanceMonitor and
returning an error if they differ, or always persisting using s.cfg.HostID
instead of the hostID parameter (mirroring CreateInstanceMonitor behavior);
reference the CreateServiceInstanceMonitor, Start, DeleteServiceInstanceMonitor
methods and s.cfg.HostID when making the change so stored keys remain queryable
and deletable after restart.

In `@server/internal/orchestrator/swarm/service_images.go`:
- Around line 73-88: The function serviceImageTag returns an invalid "/<image>"
when cfg.DockerSwarm.ImageRepositoryHost is empty; update serviceImageTag to
check cfg.DockerSwarm.ImageRepositoryHost and only prepend it when non-empty
(otherwise return imageRef unchanged), ensuring you still keep the existing
registry-detection logic that inspects the first part for "." ":" or "localhost"
before deciding to modify the imageRef.
- Around line 19-35: The code registers a non-existent "mcp" version "1.0.0" in
NewServiceVersions which will break at runtime; remove the
versions.addServiceImage call that registers "mcp" "1.0.0" (or replace its tag
with the actual existing image like "1.0.0-beta3") and update the TODO
accordingly so only real image tags are registered via versions.addServiceImage
and serviceImageTag for the ServiceImages entries.

In `@server/internal/orchestrator/swarm/service_spec_test.go`:
- Around line 326-332: The test's checkEnv closure is comparing full "KEY=VALUE"
strings against "PGUSER"/"PGPASSWORD" so it never matches; update checkEnv (the
func(t *testing.T, env []string) closure) to parse each env string (e.g., use
strings.SplitN(e, "=", 2) or strings.HasPrefix(e, "PGUSER=")/"PGPASSWORD=") and
assert on the key portion, and add import "strings" to the test file.

In `@server/internal/orchestrator/swarm/service_spec.go`:
- Around line 34-69: ServiceContainerSpec should validate inputs before using
them: check that opts is non-nil and that opts.ServiceSpec and
opts.ServiceImages are non-nil, returning a descriptive error if any are
missing; also validate opts.Port is within TCP range (1-65535) and return an
error for out-of-range values to prevent accidental random port assignment;
update any callers or subsequent logic (e.g., buildServiceEnvVars,
buildServicePortConfig) to assume these invariants so no nil derefs occur and
invalid ports are rejected early.
- Around line 45-59: The networks slice in service_spec.go currently attaches to
a "bridge" network which is invalid for Swarm services; update the networks
[]swarm.NetworkAttachmentConfig to remove the entry with Target: "bridge" and
leave only the overlay attachment using opts.DatabaseNetworkID (refer to the
networks variable and swarm.NetworkAttachmentConfig in the file); ensure
published ports configuration remains unchanged so external accessibility is
handled via the service's published ports rather than a bridge network.

In `@server/internal/orchestrator/swarm/service_user_role.go`:
- Around line 85-181: The Delete method on ServiceUserRole currently swallows
all errors (returning nil) from dbSvc.GetDatabase,
orch.GetInstanceConnectionInfo, certSvc.PostgresUserTLS and the final conn.Exec
call; change it to only treat database.ErrDatabaseNotFound as non-fatal and
return other errors so the controller can retry. Concretely: in
ServiceUserRole.Delete, replace the unconditional logger.Info() + return nil
after dbSvc.GetDatabase with an errors.Is(err, database.ErrDatabaseNotFound)
check (import errors) and only return nil when that specific error; for
orch.GetInstanceConnectionInfo, certSvc.PostgresUserTLS and the DROP ROLE Exec,
stop returning nil on error and instead return the wrapped error
(fmt.Errorf("...: %w", err) or just return err) while keeping the existing logs;
keep the existing fallback behavior for primaryInstanceID but ensure
GetInstanceConnectionInfo errors are propagated unless they match a not-found
sentinel where appropriate.

In `@server/internal/workflows/activities/store_service_instance.go`:
- Around line 33-56: The StoreServiceInstance activity currently dereferences
input and input.ServiceInstance (e.g., input.ServiceInstance.ServiceInstanceID,
DatabaseID, etc.) which will panic if either is nil; add nil-guards at the start
of Activities.StoreServiceInstance to validate input and input.ServiceInstance,
return a clear error (e.g., fmt.Errorf("nil input" / "nil ServiceInstance"))
early, and only enrich the logger with ServiceInstance fields after confirming
the ServiceInstance is non-nil (or include safe defaults when missing) before
calling a.DatabaseService.UpdateServiceInstance.

In `@server/internal/workflows/activities/update_service_instance_state.go`:
- Around line 37-59: The UpdateServiceInstanceState activity currently
dereferences input immediately and will panic if input is nil; add a nil check
at the top of UpdateServiceInstanceState before accessing input fields (use
activity.Logger(ctx) without calling With on input fields until after the nil
check), return a clear error (e.g., fmt.Errorf("nil input")), and log the error;
ensure references to UpdateServiceInstanceStateInput and
UpdateServiceInstanceStateOutput remain unchanged and that the
DatabaseService.UpdateServiceInstanceState call only happens after the nil
check.

In `@server/internal/workflows/update_database.go`:
- Around line 122-153: UpdateDatabase currently unconditionally calls
ProvisionServices (via w.ExecuteProvisionServices with ProvisionServicesInput),
causing non-idempotent side effects; update the call site to first detect
whether the services spec has actually changed since the last successful run
(e.g., fetch the last stored spec for input.Spec.DatabaseID or the task record
and compare to input.Spec.Services) and only invoke ExecuteProvisionServices
when there is a real diff, or alternatively add idempotency guards inside the
ProvisionServices workflow to no-op for already-running instances (skip
resetting state to "creating", avoid ALTER ROLE password rotation when no
credential change, and skip re-creating existing resources). Ensure you
reference UpdateDatabase, provisionServicesInput/ProvisionServicesInput,
w.ExecuteProvisionServices, and logTaskEvent when implementing the check so
repeated UpdateDatabase executions do not trigger destructive service
reprovisioning.

🧹 Nitpick comments (9)

server/internal/orchestrator/swarm/service_spec_test.go (1)
372-380: Minor: use promoted field per staticcheck.
swarm.ServiceSpec embeds Annotations, so you can access Name directly.
♻️ Suggested tweak
-			if got.Annotations.Name != tt.opts.ServiceName {
-				t.Errorf("service name = %v, want %v", got.Annotations.Name, tt.opts.ServiceName)
+			if got.Name != tt.opts.ServiceName {
+				t.Errorf("service name = %v, want %v", got.Name, tt.opts.ServiceName)
 			}
server/internal/database/service_instance_store.go (1)

46-50: Consider narrowing update options to mutable fields. Update only applies state/error, but ServiceInstanceUpdateOptions also carries identifiers; splitting create vs. update options (or validating immutables) would reduce confusion.

server/internal/database/spec.go (1)

127-141: Shallow clone of Config map may not deep-copy nested values.

maps.Clone(s.Config) performs a shallow copy. If Config contains nested maps or slices as values (which map[string]any allows), mutations to those nested structures in the clone would affect the original.

Given that MCP config typically contains primitive values (strings for API keys, provider names, etc.), this is likely acceptable. However, if nested structures are expected in the future, consider a deep-copy utility.

server/internal/api/apiv1/validate.go (1)

515-515: Consider whether stricter semver validation is needed.

The current pattern ^\d+\.\d+\.\d+$ accepts basic semver (e.g., 1.0.0) but rejects valid semver with pre-release or build metadata (e.g., 1.0.0-alpha, 1.0.0+build).

If the service versioning system will support pre-release versions in the future, consider a more complete semver regex. For now, if only stable versions are expected, this is sufficient.
server/internal/workflows/activities/get_service_instance_status.go (1)
45-48: Consider using a.Orchestrator directly instead of do.Invoke.

Other activities in this codebase (e.g., GenerateServiceInstanceResources at line 45 in that file) access the orchestrator via a.Orchestrator directly, since it's already injected into the Activities struct.

Using do.Invoke here adds unnecessary runtime resolution overhead and inconsistency with other activity patterns.
♻️ Suggested refactor
 func (a *Activities) GetServiceInstanceStatus(
 	ctx context.Context,
 	input *GetServiceInstanceStatusInput,
 ) (*GetServiceInstanceStatusOutput, error) {
 	logger := activity.Logger(ctx).With("service_instance_id", input.ServiceInstanceID)
 	logger.Debug("getting service instance status")

-	orch, err := do.Invoke[database.Orchestrator](a.Injector)
-	if err != nil {
-		return nil, err
-	}
-
-	status, err := orch.GetServiceInstanceStatus(ctx, input.ServiceInstanceID)
+	status, err := a.Orchestrator.GetServiceInstanceStatus(ctx, input.ServiceInstanceID)
 	if err != nil {
 		return nil, fmt.Errorf("failed to get service instance status: %w", err)
 	}

 	return &GetServiceInstanceStatusOutput{Status: status}, nil
 }
server/internal/orchestrator/swarm/service_instance.go (1)
93-156: Persist ServiceName on Create to keep state self-contained.
Right now Create sets ServiceID but leaves ServiceName unchanged. Setting it from the deployed spec makes Delete/Update robust even if the resource was instantiated without a name (or future refactors change initialization).
🔧 Suggested update
	res, err := client.ServiceDeploy(ctx, spec.Spec)
	if err != nil {
		return fmt.Errorf("failed to deploy service instance: %w", err)
	}

	s.ServiceID = res.ServiceID
+	s.ServiceName = spec.Spec.Name
server/internal/monitor/service_instance_monitor.go (1)

257-271: Consider whether defaulting to healthy is the intended behavior.

When status.ServiceReady is nil/false and status.HealthCheck is nil, the function returns true (healthy). This assumes a running container without explicit health info is healthy.

This is likely intentional for services without configured health checks, but it means the monitor will transition "creating" → "running" as soon as a container exists, even if the service isn't actually ready. If this is the desired behavior for Phase 1, a brief comment explaining the rationale would help future maintainers.

server/internal/orchestrator/swarm/service_instance_spec.go (1)

67-113: TODO comment noted for follow-up; implementation is correct.

The Refresh method correctly:

Retrieves the network from state to get the actual NetworkID

Passes all options to ServiceContainerSpec

Assigns the result to s.Spec

The TODO on line 76 regarding database host selection alternatives is appropriate for a follow-up discussion.

The TODO on line 76 mentions considering alternatives for database host selection. Would you like me to open an issue to track this discussion item?
server/internal/workflows/provision_services.go (1)
152-199: Omit redundant type declaration on line 157.

Static analysis correctly identifies that the type int can be inferred from 5432.
🔧 Proposed fix
-		var instancePort int = 5432 // Default Postgres port
+		var instancePort = 5432 // Default Postgres port

docs/development/service-credentials.md

e2e/service_provisioning_test.go

Makefile

server/internal/api/apiv1/convert.go

server/internal/api/apiv1/validate.go

server/internal/orchestrator/swarm/service_spec.go

server/internal/orchestrator/swarm/service_user_role.go

server/internal/workflows/activities/store_service_instance.go

server/internal/workflows/activities/update_service_instance_state.go

server/internal/workflows/update_database.go

coderabbitai

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

server/internal/docker/docker.go (1)
395-397: ⚠️ Potential issue | 🟡 Minor

Wrap nil error incorrectly in error message.

At line 396, if service.Spec.Mode.Replicated == nil, the error returned wraps err, but err is guaranteed to be nil at this point (the previous check at line 392 would have returned). This produces a confusing message like "WaitForService is only usable for replicated services: %!w(<nil>)".
🐛 Proposed fix
 if service.Spec.Mode.Replicated == nil {
-    return fmt.Errorf("WaitForService is only usable for replicated services: %w", err)
+    return errors.New("WaitForService is only usable for replicated services")
 }

🤖 Fix all issues with AI agents

In `@e2e/service_provisioning_test.go`:
- Around line 489-491: The assertion compares host1 (from fixture.HostIDs()[0])
with serviceInstance.HostID but uses different types; make them consistent by
converting host1 to a string before assertion (or vice versa) so both operands
have the same type. Locate host1 (assigned from fixture.HostIDs()) and the
assertion comparing it to serviceInstance.HostID and update the assertion to use
string(host1) (or ensure fixture.HostIDs() returns a string) so the types match.

In `@server/internal/api/apiv1/convert.go`:
- Around line 257-266: The code calls utils.NillablePointerTo on
HealthCheckResult.Message which is a plain string, causing a type mismatch; in
healthCheckResultToAPI replace utils.NillablePointerTo(hc.Message) with the raw
string hc.Message so that api.HealthCheckResult.Message gets a string (use
hc.Message directly in the returned struct and remove the
utils.NillablePointerTo call).

In `@server/internal/monitor/service_instance_monitor.go`:
- Around line 257-271: The function isContainerHealthy in ServiceInstanceMonitor
currently defaults to healthy when both ServiceReady and HealthCheck are
missing; change it to require at least one explicit indicator: return true only
when ServiceReady is non-nil and true or when HealthCheck is non-nil and
HealthCheck.Status == "healthy"; if both fields are nil or neither indicates
health, return false. Update any callers that expect the old default if needed
and reference the ServiceInstanceStatus struct's ServiceReady and HealthCheck
fields when making the change.

In `@server/internal/orchestrator/swarm/orchestrator.go`:
- Around line 518-572: GetServiceInstanceStatus currently sets ServiceReady to
true unconditionally after a successful inspect; instead determine readiness
from the container state and health info: check inspect.State is non-nil, ensure
inspect.State.Running is true, and if inspect.State.Health is present require
inspect.State.Health.Status == "healthy" (treat absence of Health as okay). Use
that boolean to set database.ServiceInstanceStatus.ServiceReady via
utils.PointerTo(...) and handle nil/false appropriately; update the
GetServiceInstanceStatus function where the ServiceReady field is populated.

In `@server/internal/orchestrator/swarm/service_instance.go`:
- Around line 182-233: Replace use of s.ServiceName with the Docker-assigned ID
for consistency: in ServiceInstanceResource.Delete, pass s.ServiceID into
docker.ServiceScaleOptions.ServiceID (instead of s.ServiceName) and also call
client.ServiceRemove(ctx, s.ServiceID) so both scaling and removal use the
service ID; ensure s.ServiceID is used in the ServiceScale/ServiceRemove calls
and that it remains populated by Refresh/Create beforehand.

In `@server/internal/orchestrator/swarm/service_user_role.go`:
- Around line 120-140: The code assigns the Patroni node name returned by
GetPrimaryInstanceID to primaryInstanceID, but downstream
GetInstanceConnectionInfo/GetPostgresContainer expect the full InstanceID;
update the loop after calling database.GetPrimaryInstanceID to map that Patroni
node name back to the matching db.Instances entry (compare the Patroni node name
to the instance's Patroni name field or connection info.Name) and set
primaryInstanceID to that instance's InstanceID; if no match is found, continue
the search or fall back to the existing first-instance behavior, ensuring all
uses of primaryInstanceID are the actual InstanceID values.

🧹 Nitpick comments (9)

server/internal/docker/provide.go (1)
10-11: Consider using do.Invoke for consistency with error handling.

Using do.MustInvoke will panic if the logger isn't registered in the injector, whereas other providers (e.g., activities/provide.go) use do.Invoke with explicit error handling. While the logger is a fundamental dependency that should always be present, using do.Invoke would maintain consistency and provide a clearer error path.
♻️ Suggested refactor for consistency
 do.Provide(i, func(i *do.Injector) (*Docker, error) {
-    logger := do.MustInvoke[zerolog.Logger](i)
+    logger, err := do.Invoke[zerolog.Logger](i)
+    if err != nil {
+        return nil, err
+    }
     cli, err := NewDocker(logger)
server/internal/api/apiv1/validate.go (1)

282-283: Resolve the WIP TODO or track it explicitly.

If you want, I can draft the remaining validation rules or open a follow‑up issue.
server/internal/database/service_instance.go (1)
51-55: Consider using *string for optional Message field.

The API design (api/apiv1/design/instance.go Lines 218-221) defines message as an optional attribute without g.Required. The domain type uses string which doesn't distinguish between "no message" and "empty message". This inconsistency is handled in convert.go but could be clearer if the domain type matched.
♻️ Suggested change for consistency
 type HealthCheckResult struct {
 	Status    string    `json:"status"`
-	Message   string    `json:"message,omitempty"`
+	Message   *string   `json:"message,omitempty"`
 	CheckedAt time.Time `json:"checked_at"`
 }
server/internal/workflows/activities/get_service_instance_status.go (1)

15-22: Consider removing unused HostID from input struct.

GetServiceInstanceStatusInput.HostID is never used in GetServiceInstanceStatus. The hostID for queue routing is passed as a separate parameter to ExecuteGetServiceInstanceStatus. If the field is intended for future use, consider adding a comment; otherwise, remove it to avoid confusion.
server/internal/workflows/activities/create_service_user.go (2)
31-43: Consider adding retries for transient failures.

MaxAttempts: 1 means no retries on transient failures (network issues, temporary Patroni unavailability). While idempotent user creation is handled by CreateUserRole's conditional check, brief connectivity issues could fail the entire workflow. Consider allowing 2-3 attempts with backoff.

128-148: Primary discovery silently swallows errors.

The loop continues on any error from GetInstanceConnectionInfo or GetPrimaryInstanceID, falling back to the first instance. If all instances fail to respond, the fallback may execute statements on a replica, which could cause issues depending on the cluster configuration.

Consider logging the errors encountered during primary discovery to aid debugging.
Suggested enhancement
 	for _, inst := range db.Instances {
 		connInfo, err := orch.GetInstanceConnectionInfo(ctx, input.DatabaseID, inst.InstanceID)
 		if err != nil {
+			logger.Debug("failed to get connection info for instance", "instance_id", inst.InstanceID, "error", err)
 			continue
 		}

 		patroniClient := patroni.NewClient(connInfo.PatroniURL(), nil)
 		primaryID, err := database.GetPrimaryInstanceID(ctx, patroniClient, 10*time.Second)
 		if err == nil && primaryID != "" {
 			primaryInstanceID = primaryID
 			break
+		} else if err != nil {
+			logger.Debug("failed to determine primary from patroni", "instance_id", inst.InstanceID, "error", err)
 		}
 	}
server/internal/monitor/service_instance_monitor.go (1)
299-318: Mutating the input parameter may be surprising.

updateStatus modifies status.LastHealthAt in place. While functional, this side effect on the caller's data could be unexpected. Consider creating a copy or clearly documenting the mutation.
Alternative approach
 func (m *ServiceInstanceMonitor) updateStatus(
 	ctx context.Context,
 	status *database.ServiceInstanceStatus,
 ) error {
-	// Update last health check time
-	now := time.Now()
-	status.LastHealthAt = utils.PointerTo(now)
+	// Create updated status with current health check time
+	now := time.Now()
+	statusToSave := *status
+	statusToSave.LastHealthAt = utils.PointerTo(now)

 	m.logger.Debug().
 		Str("service_instance_id", m.serviceInstanceID).
 		Str("database_id", m.databaseID).
 		Msg("updating service instance status")

-	err := m.dbSvc.UpdateServiceInstanceStatus(ctx, m.databaseID, m.serviceInstanceID, status)
+	err := m.dbSvc.UpdateServiceInstanceStatus(ctx, m.databaseID, m.serviceInstanceID, &statusToSave)
e2e/service_provisioning_test.go (1)

46-59: Remove commented-out code.

Lines 50-51 have a commented-out Version: "1.0.0" with "latest" used instead. Similar patterns appear throughout the file (lines 184-185, 289-290, 557-558). Either remove the comments or add context explaining why they're preserved.

server/internal/monitor/service_instance_monitor_resource.go (1)

44-50: Avoid hard-coded resource type string in Dependencies.

Consider referencing a shared constant (or moving it to a common package) to prevent drift if the swarm resource type changes.

e2e/service_provisioning_test.go

server/internal/api/apiv1/convert.go

server/internal/monitor/service_instance_monitor.go

server/internal/orchestrator/swarm/orchestrator.go

server/internal/orchestrator/swarm/service_instance.go

server/internal/orchestrator/swarm/service_user_role.go

coderabbitai

Actionable comments posted: 4

🤖 Fix all issues with AI agents

In `@e2e/service_provisioning_test.go`:
- Around line 53-57: Replace hardcoded API-key-like strings inside the test
Config map (e.g., the "anthropic_api_key" entry in the Config map in
service_provisioning_test.go) with a non-key placeholder or load from an
environment/test fixture; update any other entries matching "*_api_key" in this
file similarly so tests no longer contain realistic-looking secrets and instead
reference a safe placeholder or os.Getenv-backed fixture variable used by the
test harness.

In `@server/internal/orchestrator/swarm/service_images_test.go`:
- Around line 110-115: The test expectation in service_images_test.go
incorrectly expects a duplicated organization segment; update the test case
(name "image without registry") to expect "ghcr.io/pgedge/postgres-mcp:1.0.0"
instead of "ghcr.io/pgedge/pgedge/postgres-mcp:1.0.0" so it matches the behavior
of serviceImageTag(imageRef, repoHost) which should not duplicate the org when
repoHost already contains the org; change the want value in that test case
accordingly.

In `@server/internal/workflows/activities/create_service_user.go`:
- Around line 128-154: database.GetPrimaryInstanceID returns a Patroni member
name, not the InstanceID expected by GetInstanceConnectionInfo; update the block
that sets primaryInstanceID so that when GetPrimaryInstanceID returns a
non-empty member name you map it to the corresponding db.Instance.InstanceID by
matching the Patroni/member/node name stored on each instance (e.g., instance
metadata or a PatroniName property) before assigning primaryInstanceID; if no
match is found keep the existing fallback (db.Instances[0].InstanceID) and
preserve the existing logging and error handling around
GetInstanceConnectionInfo and the patroni client usage.

In `@server/internal/workflows/activities/get_service_instance_status.go`:
- Around line 38-43: In GetServiceInstanceStatus add a nil-check for the input
pointer at the top of the function (before using input.ServiceInstanceID); if
input is nil, log the problem with the activity logger (e.g., via logger.Error
or activity.Logger(ctx)) and return a clear error (e.g., fmt.Errorf or a
package-level sentinel) instead of proceeding to access fields to avoid a panic;
update callers/tests if they rely on receiving that error.

🧹 Nitpick comments (5)

server/internal/database/service_instance_status_store.go (1)

9-14: Consider adding HostID for consistency.

The StoredServiceInstanceStatus lacks a HostID field, while the ServiceInstanceMonitorResource (from relevant snippets) includes HostID. If status queries need to be filtered or grouped by host, this field may be needed.

However, if HostID is always retrievable via the associated ServiceInstance, this is acceptable.

e2e/service_provisioning_test.go (1)

48-52: Prefer a pinned service version for deterministic E2E runs.
Using "latest" can introduce flakes as images change; consider a fixed version used by your fixture.

server/internal/workflows/activities/get_service_instance_status.go (1)

15-35: Avoid dual sources of host ID in the activity API.
HostID exists on the input but is unused while ExecuteGetServiceInstanceStatus takes a separate hostID. Consider deriving the queue from input.HostID or removing the field to prevent mismatches.

server/internal/workflows/provision_services.go (1)

273-330: Consider more specific error propagation when deployment partially fails.

When applyEvents returns an error, the workflow logs a warning and attempts to mark individual service instances as failed. However, the overall workflow continues to Phase 4 and logs completion for all services (lines 373-389) even if some failed. This could produce misleading completion logs.

Consider tracking which instances succeeded vs failed and adjusting the completion logging accordingly.
server/internal/monitor/service_instance_monitor.go (1)
299-318: updateStatus mutates the input parameter.

Line 305 modifies status.LastHealthAt directly, which could have unintended side effects if the caller reuses the status object. Consider making a copy before modification.
🛠️ Suggested fix
 func (m *ServiceInstanceMonitor) updateStatus(
 	ctx context.Context,
 	status *database.ServiceInstanceStatus,
 ) error {
+	// Create a copy to avoid mutating the caller's object
+	statusCopy := *status
 	// Update last health check time
 	now := time.Now()
-	status.LastHealthAt = utils.PointerTo(now)
+	statusCopy.LastHealthAt = utils.PointerTo(now)

 	m.logger.Debug().
 		Str("service_instance_id", m.serviceInstanceID).
 		Str("database_id", m.databaseID).
 		Msg("updating service instance status")

-	err := m.dbSvc.UpdateServiceInstanceStatus(ctx, m.databaseID, m.serviceInstanceID, status)
+	err := m.dbSvc.UpdateServiceInstanceStatus(ctx, m.databaseID, m.serviceInstanceID, &statusCopy)
 	if err != nil {
 		return fmt.Errorf("failed to update service instance status: %w", err)
 	}

 	return nil
 }

e2e/service_provisioning_test.go

server/internal/orchestrator/swarm/service_images_test.go

server/internal/workflows/activities/create_service_user.go

server/internal/workflows/activities/get_service_instance_status.go

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In `@docs/development/service-credentials.md`:
- Around line 44-54: Update the PostgreSQL Compatibility section to clarify that
truncating usernames to 63 characters can create collisions: explicitly state
whether the system prevents collisions (e.g., by appending a deterministic hash
or sequence to the truncated username) or if collisions are possible and what
the mitigation is (e.g., error on create, require shorter IDs, or enforce
uniqueness via additional DB column). Mention the exact constructs from the
doc—"svc_ prefix", "username", "service ID", "host ID", and "63 characters"—so
readers know where truncation is applied and how collisions are handled or
avoided.

In `@server/internal/docker/docker.go`:
- Around line 466-471: The current early-failure check in the service task loop
(vars failed, running, desired, lastFailureMsg) returns an error when failed>0
&& running==0 even if tasks are still in transitional states; update the
condition to also ensure there are no tasks in preparing or pending before
returning an error (e.g., require preparing==0 && pending==0 in addition to
failed>0 && running==0), and preserve the existing lastFailureMsg logic so the
error message remains fmt.Errorf("service tasks failed: %s", lastFailureMsg)
when present.

🧹 Nitpick comments (2)

server/internal/orchestrator/swarm/service_spec_test.go (1)
91-119: DRY env-var assertions into helpers.

These env-var membership checks are repeated across tests; a small helper keeps the tests shorter and more consistent.
♻️ Suggested helper extraction
+func assertEnvContainsAll(t *testing.T, got, expected []string) {
+	t.Helper()
+	if len(got) != len(expected) {
+		t.Errorf("got %d env vars, want %d", len(got), len(expected))
+	}
+	for _, e := range expected {
+		found := false
+		for _, g := range got {
+			if g == e {
+				found = true
+				break
+			}
+		}
+		if !found {
+			t.Errorf("missing env var: %s", e)
+		}
+	}
+}
+
+func assertEnvNotContainsPrefixes(t *testing.T, got []string, prefixes ...string) {
+	t.Helper()
+	for _, e := range got {
+		for _, p := range prefixes {
+			if strings.HasPrefix(e, p) {
+				t.Errorf("unexpected credential env var: %s", e)
+			}
+		}
+	}
+}
Also applies to: 230-247, 279-295, 519-529
server/internal/api/apiv1/validate.go (1)

282-329: Please track the TODO in validateMCPServiceConfig.

If this is intentionally WIP, consider linking it to an issue so it doesn’t linger indefinitely.

docs/development/service-credentials.md

server/internal/docker/docker.go

… to allow retries

api/apiv1/design/instance.go

api/apiv1/design/database.go

docs/development/service-credentials.md

server/internal/docker/docker.go

server/internal/docker/provide.go

jason-lynch · 2026-02-11T14:58:26Z

server/internal/orchestrator/swarm/service_instance.go

+	// Remove service instance monitor
+	// Note: This is best-effort cleanup. If the monitor doesn't exist or deletion fails,
+	// we still proceed with service instance deletion. Monitors will be cleaned up on restart.
+	if monitorSvc, err := do.Invoke[*monitor.Service](rc.Injector); err == nil {
+		if err := monitorSvc.DeleteServiceInstanceMonitor(ctx, s.ServiceInstanceID); err != nil {
+			if logger, logErr := do.Invoke[zerolog.Logger](rc.Injector); logErr == nil {
+				logger.Warn().Err(err).Str("service_instance_id", s.ServiceInstanceID).Msg("failed to delete service instance monitor during cleanup")
+			}
+			// Continue - not critical for deletion
+		}
+	}


This should be managed by the service instance monitor resource. Why is it necessary to do it here?

jason-lynch · 2026-02-11T15:06:17Z

server/internal/workflows/provision_services.go

+type ProvisionServicesOutput struct {
+}
+
+func (w *Workflows) ExecuteProvisionServices(


This shouldn't be split out into a separate workflow. Instead, the service instance resources should be included in the normal database plan. The way that this is currently written, all of the service resources will get deleted and then recreated each time the database is updated.

jason-lynch · 2026-02-11T22:19:16Z

server/internal/workflows/provision_services.go

+				}
+			}
+
+			// Find any Postgres instance in this database to get connection details


Given my earlier suggestions to use a config file rather than environment variables and to remove this workflow, I suggest that we create a config file resource for the MCP server and do this operation in that resource's Create or Update methods.

jason-lynch · 2026-02-11T22:31:49Z

server/internal/database/service_instance.go

+
+// GenerateServiceName creates a Docker Swarm service name for a service instance.
+// Format: {service_type}-{database_id}-{service_id}-{host_id}
+func GenerateServiceName(serviceType, databaseID, serviceID, hostID string) string {


I think this logic belongs in the swarm package. Regarding using the host ID as the suffix, we should be consistent with the way that we name Postgres services and these services. We currently use a hash of the host ID for the postgres service names because it's a little more readable in Cloud where both the database ID and the host ID are UUIDs. We could do something similar to what you did with user names by only using a hash when the host ID exceeds some length limit, but we'd need to be careful about changing the service names in case anyone depends on them. For now, in addition to moving this logic, could you please make it consistent with the Postgres service names?

server/internal/database/service_instance.go

- Remove container_port from required fields in ServicePort API type - Sanitize hyphens to underscores in generated Postgres role names - Switch MustInvoke to Invoke in docker provider for consistency - Use globally unique service name for container hostname to avoid DNS collisions on shared bridge network

Initial implementation of supported services.

0726290

rshoemaker marked this pull request as ready for review February 9, 2026 20:07

delint

4ad49e8

coderabbitai bot reviewed Feb 9, 2026

View reviewed changes

addressing coderabbit "minor" issues

5bf6d18

coderabbitai bot reviewed Feb 9, 2026

View reviewed changes

docs/development/service-credentials.md Show resolved Hide resolved

server/internal/docker/docker.go Outdated Show resolved Hide resolved

rshoemaker added 2 commits February 9, 2026 17:54

addressing most coderabbit "major" issues

5f0313a

GetDatabase treats ErrDatabaseNotFound as a non-fatal transient error…

5aa9930

… to allow retries

jason-lynch reviewed Feb 11, 2026

View reviewed changes

rshoemaker added 2 commits February 12, 2026 16:41

fix: e2e compilation error

d7d8d2c

Initial implementation of supported services. #266

Are you sure you want to change the base?

Initial implementation of supported services. #266

Conversation

rshoemaker commented Feb 9, 2026

Uh oh!

coderabbitai bot commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jason-lynch Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

jason-lynch Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

jason-lynch Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

jason-lynch Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai bot commented Feb 9, 2026 •

edited

Loading