-
Notifications
You must be signed in to change notification settings - Fork 2
Initial implementation of supported services. #266
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds service provisioning end-to-end: API types/validation, database domain models and etcd stores for service instances/status, Swarm orchestrator resources/specs/images, monitors and monitor stores, workflows/activities for provisioning and credential creation, tests, docs, DI/logging and tooling tweaks. Changes
Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 16
🤖 Fix all issues with AI agents
In `@docs/development/service-credentials.md`:
- Around line 200-204: Replace the repeated "Confirm" phrasing with varied
imperatives and fix the ambiguous SQL quoting: change the SQL example token to a
fenced code span or use double quotes around the Markdown inline code so the
inner single quotes remain intact (the SQL to keep is SELECT * FROM pg_user
WHERE usename LIKE 'svc_%';). Update bullets 1–5 (and likewise 226–228) to use
distinct verbs like "Verify", "Ensure", "Check", "Test" and render the SQL as
clear inline code or a code block to avoid nested single-quote ambiguity.
In `@e2e/service_provisioning_test.go`:
- Around line 83-105: Test wait window is shorter than the monitor timeout
causing flakes; increase the maxWait used in the polling loop (the maxWait
variable in the test that currently is 2 * time.Minute) to match the monitor's
creation timeout (e.g., 5 * time.Minute), keep pollInterval unchanged, and apply
the same change to the sibling recovery test's polling loop so both use the
longer deadline before asserting db.ServiceInstances[0].State == "running"
(references: serviceInstance, db.Refresh, maxWait, pollInterval, deadline).
In `@Makefile`:
- Around line 369-376: The dev-teardown target uses portable-unsupported xargs
-r when piping docker service ls -q into docker service inspect and docker
service rm; replace that pipeline with a guard that first captures or tests the
output of docker service ls -q (or tests that the variable is non-empty) before
invoking docker service inspect or docker service rm so the commands are only
run on non-empty input on both Linux and macOS/BSD; update the dev-teardown
recipe (around the docker service ls -q | xargs -r docker service inspect ...
and the final | xargs -r docker service rm) to check for non-empty output (e.g.,
assign to a var or use test -n) and then call docker service inspect and docker
service rm, preserving the same label-filtering awk logic.
In `@server/internal/api/apiv1/convert.go`:
- Around line 184-215: serviceSpecToAPI's ad-hoc key checks on svc.Config can
miss future secret names; add a centralized helper (e.g., isSensitiveConfigKey)
and use it when building filteredConfig to robustly detect secret keys.
Implement isSensitiveConfigKey to either (A) use a conservative allowlist of
safe config keys and treat everything else as sensitive, or (B) use a compiled
case-insensitive regex that matches common secret patterns (e.g., token, key,
apikey, api_key, secret, password, credential, access[_-]?token) with
word-boundary or delimiter handling; then replace the inline strings.Contains
checks in serviceSpecToAPI to call this helper when deciding to skip a config
entry. Ensure the helper is reused anywhere config redaction is needed so secret
detection is centralized.
In `@server/internal/api/apiv1/validate.go`:
- Around line 296-321: The current type assertion for config["llm_provider"] can
silently skip validation if the value exists but isn't a string; change the
logic so you first check for existence (val, exists := config["llm_provider"])
and if exists but not a string append a newValidationError to errs (use
appendPath(path, mapKeyPath("llm_provider")) and a clear error message about
invalid type), otherwise if it is a string continue the existing provider
validation (validProviders check and the switch for
"anthropic"/"openai"/"ollama"); keep using newValidationError, appendPath,
mapKeyPath and errs for all error cases.
In `@server/internal/database/service.go`:
- Around line 596-611: The SetServiceInstanceState method currently returns a
wrapped generic error when the store GetByKey returns storage.ErrNotFound;
change the error handling in Service.SetServiceInstanceState to detect the
not-found case (using errors.Is(err, storage.ErrNotFound) or equivalent) and
return the package-level ErrServiceInstanceNotFound instead of a generic
fmt.Errorf, then proceed as before for other errors; apply the same pattern to
the other methods in the file that call s.store.ServiceInstance.GetByKey/Exec
(the function around lines 709-736) so they also translate storage.ErrNotFound
into ErrServiceInstanceNotFound.
In `@server/internal/monitor/service.go`:
- Around line 170-189: CreateServiceInstanceMonitor currently persists with the
hostID parameter which can diverge from s.cfg.HostID used by Start and
DeleteServiceInstanceMonitor; update the method to enforce consistency by either
validating hostID == s.cfg.HostID at the top of CreateServiceInstanceMonitor and
returning an error if they differ, or always persisting using s.cfg.HostID
instead of the hostID parameter (mirroring CreateInstanceMonitor behavior);
reference the CreateServiceInstanceMonitor, Start, DeleteServiceInstanceMonitor
methods and s.cfg.HostID when making the change so stored keys remain queryable
and deletable after restart.
In `@server/internal/orchestrator/swarm/service_images.go`:
- Around line 73-88: The function serviceImageTag returns an invalid "/<image>"
when cfg.DockerSwarm.ImageRepositoryHost is empty; update serviceImageTag to
check cfg.DockerSwarm.ImageRepositoryHost and only prepend it when non-empty
(otherwise return imageRef unchanged), ensuring you still keep the existing
registry-detection logic that inspects the first part for "." ":" or "localhost"
before deciding to modify the imageRef.
- Around line 19-35: The code registers a non-existent "mcp" version "1.0.0" in
NewServiceVersions which will break at runtime; remove the
versions.addServiceImage call that registers "mcp" "1.0.0" (or replace its tag
with the actual existing image like "1.0.0-beta3") and update the TODO
accordingly so only real image tags are registered via versions.addServiceImage
and serviceImageTag for the ServiceImages entries.
In `@server/internal/orchestrator/swarm/service_spec_test.go`:
- Around line 326-332: The test's checkEnv closure is comparing full "KEY=VALUE"
strings against "PGUSER"/"PGPASSWORD" so it never matches; update checkEnv (the
func(t *testing.T, env []string) closure) to parse each env string (e.g., use
strings.SplitN(e, "=", 2) or strings.HasPrefix(e, "PGUSER=")/"PGPASSWORD=") and
assert on the key portion, and add import "strings" to the test file.
In `@server/internal/orchestrator/swarm/service_spec.go`:
- Around line 34-69: ServiceContainerSpec should validate inputs before using
them: check that opts is non-nil and that opts.ServiceSpec and
opts.ServiceImages are non-nil, returning a descriptive error if any are
missing; also validate opts.Port is within TCP range (1-65535) and return an
error for out-of-range values to prevent accidental random port assignment;
update any callers or subsequent logic (e.g., buildServiceEnvVars,
buildServicePortConfig) to assume these invariants so no nil derefs occur and
invalid ports are rejected early.
- Around line 45-59: The networks slice in service_spec.go currently attaches to
a "bridge" network which is invalid for Swarm services; update the networks
[]swarm.NetworkAttachmentConfig to remove the entry with Target: "bridge" and
leave only the overlay attachment using opts.DatabaseNetworkID (refer to the
networks variable and swarm.NetworkAttachmentConfig in the file); ensure
published ports configuration remains unchanged so external accessibility is
handled via the service's published ports rather than a bridge network.
In `@server/internal/orchestrator/swarm/service_user_role.go`:
- Around line 85-181: The Delete method on ServiceUserRole currently swallows
all errors (returning nil) from dbSvc.GetDatabase,
orch.GetInstanceConnectionInfo, certSvc.PostgresUserTLS and the final conn.Exec
call; change it to only treat database.ErrDatabaseNotFound as non-fatal and
return other errors so the controller can retry. Concretely: in
ServiceUserRole.Delete, replace the unconditional logger.Info() + return nil
after dbSvc.GetDatabase with an errors.Is(err, database.ErrDatabaseNotFound)
check (import errors) and only return nil when that specific error; for
orch.GetInstanceConnectionInfo, certSvc.PostgresUserTLS and the DROP ROLE Exec,
stop returning nil on error and instead return the wrapped error
(fmt.Errorf("...: %w", err) or just return err) while keeping the existing logs;
keep the existing fallback behavior for primaryInstanceID but ensure
GetInstanceConnectionInfo errors are propagated unless they match a not-found
sentinel where appropriate.
In `@server/internal/workflows/activities/store_service_instance.go`:
- Around line 33-56: The StoreServiceInstance activity currently dereferences
input and input.ServiceInstance (e.g., input.ServiceInstance.ServiceInstanceID,
DatabaseID, etc.) which will panic if either is nil; add nil-guards at the start
of Activities.StoreServiceInstance to validate input and input.ServiceInstance,
return a clear error (e.g., fmt.Errorf("nil input" / "nil ServiceInstance"))
early, and only enrich the logger with ServiceInstance fields after confirming
the ServiceInstance is non-nil (or include safe defaults when missing) before
calling a.DatabaseService.UpdateServiceInstance.
In `@server/internal/workflows/activities/update_service_instance_state.go`:
- Around line 37-59: The UpdateServiceInstanceState activity currently
dereferences input immediately and will panic if input is nil; add a nil check
at the top of UpdateServiceInstanceState before accessing input fields (use
activity.Logger(ctx) without calling With on input fields until after the nil
check), return a clear error (e.g., fmt.Errorf("nil input")), and log the error;
ensure references to UpdateServiceInstanceStateInput and
UpdateServiceInstanceStateOutput remain unchanged and that the
DatabaseService.UpdateServiceInstanceState call only happens after the nil
check.
In `@server/internal/workflows/update_database.go`:
- Around line 122-153: UpdateDatabase currently unconditionally calls
ProvisionServices (via w.ExecuteProvisionServices with ProvisionServicesInput),
causing non-idempotent side effects; update the call site to first detect
whether the services spec has actually changed since the last successful run
(e.g., fetch the last stored spec for input.Spec.DatabaseID or the task record
and compare to input.Spec.Services) and only invoke ExecuteProvisionServices
when there is a real diff, or alternatively add idempotency guards inside the
ProvisionServices workflow to no-op for already-running instances (skip
resetting state to "creating", avoid ALTER ROLE password rotation when no
credential change, and skip re-creating existing resources). Ensure you
reference UpdateDatabase, provisionServicesInput/ProvisionServicesInput,
w.ExecuteProvisionServices, and logTaskEvent when implementing the check so
repeated UpdateDatabase executions do not trigger destructive service
reprovisioning.
🧹 Nitpick comments (9)
server/internal/orchestrator/swarm/service_spec_test.go (1)
372-380: Minor: use promoted field per staticcheck.
swarm.ServiceSpecembedsAnnotations, so you can accessNamedirectly.♻️ Suggested tweak
- if got.Annotations.Name != tt.opts.ServiceName { - t.Errorf("service name = %v, want %v", got.Annotations.Name, tt.opts.ServiceName) + if got.Name != tt.opts.ServiceName { + t.Errorf("service name = %v, want %v", got.Name, tt.opts.ServiceName) }server/internal/database/service_instance_store.go (1)
46-50: Consider narrowing update options to mutable fields.Updateonly applies state/error, butServiceInstanceUpdateOptionsalso carries identifiers; splitting create vs. update options (or validating immutables) would reduce confusion.server/internal/database/spec.go (1)
127-141: Shallow clone ofConfigmap may not deep-copy nested values.
maps.Clone(s.Config)performs a shallow copy. IfConfigcontains nested maps or slices as values (whichmap[string]anyallows), mutations to those nested structures in the clone would affect the original.Given that MCP config typically contains primitive values (strings for API keys, provider names, etc.), this is likely acceptable. However, if nested structures are expected in the future, consider a deep-copy utility.
server/internal/api/apiv1/validate.go (1)
515-515: Consider whether stricter semver validation is needed.The current pattern
^\d+\.\d+\.\d+$accepts basic semver (e.g.,1.0.0) but rejects valid semver with pre-release or build metadata (e.g.,1.0.0-alpha,1.0.0+build).If the service versioning system will support pre-release versions in the future, consider a more complete semver regex. For now, if only stable versions are expected, this is sufficient.
server/internal/workflows/activities/get_service_instance_status.go (1)
45-48: Consider usinga.Orchestratordirectly instead ofdo.Invoke.Other activities in this codebase (e.g.,
GenerateServiceInstanceResourcesat line 45 in that file) access the orchestrator viaa.Orchestratordirectly, since it's already injected into theActivitiesstruct.Using
do.Invokehere adds unnecessary runtime resolution overhead and inconsistency with other activity patterns.♻️ Suggested refactor
func (a *Activities) GetServiceInstanceStatus( ctx context.Context, input *GetServiceInstanceStatusInput, ) (*GetServiceInstanceStatusOutput, error) { logger := activity.Logger(ctx).With("service_instance_id", input.ServiceInstanceID) logger.Debug("getting service instance status") - orch, err := do.Invoke[database.Orchestrator](a.Injector) - if err != nil { - return nil, err - } - - status, err := orch.GetServiceInstanceStatus(ctx, input.ServiceInstanceID) + status, err := a.Orchestrator.GetServiceInstanceStatus(ctx, input.ServiceInstanceID) if err != nil { return nil, fmt.Errorf("failed to get service instance status: %w", err) } return &GetServiceInstanceStatusOutput{Status: status}, nil }server/internal/orchestrator/swarm/service_instance.go (1)
93-156: PersistServiceNameon Create to keep state self-contained.
Right nowCreatesetsServiceIDbut leavesServiceNameunchanged. Setting it from the deployed spec makes Delete/Update robust even if the resource was instantiated without a name (or future refactors change initialization).🔧 Suggested update
res, err := client.ServiceDeploy(ctx, spec.Spec) if err != nil { return fmt.Errorf("failed to deploy service instance: %w", err) } s.ServiceID = res.ServiceID + s.ServiceName = spec.Spec.Nameserver/internal/monitor/service_instance_monitor.go (1)
257-271: Consider whether defaulting to healthy is the intended behavior.When
status.ServiceReadyis nil/false andstatus.HealthCheckis nil, the function returnstrue(healthy). This assumes a running container without explicit health info is healthy.This is likely intentional for services without configured health checks, but it means the monitor will transition "creating" → "running" as soon as a container exists, even if the service isn't actually ready. If this is the desired behavior for Phase 1, a brief comment explaining the rationale would help future maintainers.
server/internal/orchestrator/swarm/service_instance_spec.go (1)
67-113: TODO comment noted for follow-up; implementation is correct.The
Refreshmethod correctly:
- Retrieves the network from state to get the actual
NetworkID- Passes all options to
ServiceContainerSpec- Assigns the result to
s.SpecThe TODO on line 76 regarding database host selection alternatives is appropriate for a follow-up discussion.
The TODO on line 76 mentions considering alternatives for database host selection. Would you like me to open an issue to track this discussion item?
server/internal/workflows/provision_services.go (1)
152-199: Omit redundant type declaration on line 157.Static analysis correctly identifies that the type
intcan be inferred from5432.🔧 Proposed fix
- var instancePort int = 5432 // Default Postgres port + var instancePort = 5432 // Default Postgres port
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
server/internal/docker/docker.go (1)
395-397:⚠️ Potential issue | 🟡 MinorWrap nil error incorrectly in error message.
At line 396, if
service.Spec.Mode.Replicated == nil, the error returned wrapserr, buterris guaranteed to benilat this point (the previous check at line 392 would have returned). This produces a confusing message like"WaitForService is only usable for replicated services: %!w(<nil>)".🐛 Proposed fix
if service.Spec.Mode.Replicated == nil { - return fmt.Errorf("WaitForService is only usable for replicated services: %w", err) + return errors.New("WaitForService is only usable for replicated services") }
🤖 Fix all issues with AI agents
In `@e2e/service_provisioning_test.go`:
- Around line 489-491: The assertion compares host1 (from fixture.HostIDs()[0])
with serviceInstance.HostID but uses different types; make them consistent by
converting host1 to a string before assertion (or vice versa) so both operands
have the same type. Locate host1 (assigned from fixture.HostIDs()) and the
assertion comparing it to serviceInstance.HostID and update the assertion to use
string(host1) (or ensure fixture.HostIDs() returns a string) so the types match.
In `@server/internal/api/apiv1/convert.go`:
- Around line 257-266: The code calls utils.NillablePointerTo on
HealthCheckResult.Message which is a plain string, causing a type mismatch; in
healthCheckResultToAPI replace utils.NillablePointerTo(hc.Message) with the raw
string hc.Message so that api.HealthCheckResult.Message gets a string (use
hc.Message directly in the returned struct and remove the
utils.NillablePointerTo call).
In `@server/internal/monitor/service_instance_monitor.go`:
- Around line 257-271: The function isContainerHealthy in ServiceInstanceMonitor
currently defaults to healthy when both ServiceReady and HealthCheck are
missing; change it to require at least one explicit indicator: return true only
when ServiceReady is non-nil and true or when HealthCheck is non-nil and
HealthCheck.Status == "healthy"; if both fields are nil or neither indicates
health, return false. Update any callers that expect the old default if needed
and reference the ServiceInstanceStatus struct's ServiceReady and HealthCheck
fields when making the change.
In `@server/internal/orchestrator/swarm/orchestrator.go`:
- Around line 518-572: GetServiceInstanceStatus currently sets ServiceReady to
true unconditionally after a successful inspect; instead determine readiness
from the container state and health info: check inspect.State is non-nil, ensure
inspect.State.Running is true, and if inspect.State.Health is present require
inspect.State.Health.Status == "healthy" (treat absence of Health as okay). Use
that boolean to set database.ServiceInstanceStatus.ServiceReady via
utils.PointerTo(...) and handle nil/false appropriately; update the
GetServiceInstanceStatus function where the ServiceReady field is populated.
In `@server/internal/orchestrator/swarm/service_instance.go`:
- Around line 182-233: Replace use of s.ServiceName with the Docker-assigned ID
for consistency: in ServiceInstanceResource.Delete, pass s.ServiceID into
docker.ServiceScaleOptions.ServiceID (instead of s.ServiceName) and also call
client.ServiceRemove(ctx, s.ServiceID) so both scaling and removal use the
service ID; ensure s.ServiceID is used in the ServiceScale/ServiceRemove calls
and that it remains populated by Refresh/Create beforehand.
In `@server/internal/orchestrator/swarm/service_user_role.go`:
- Around line 120-140: The code assigns the Patroni node name returned by
GetPrimaryInstanceID to primaryInstanceID, but downstream
GetInstanceConnectionInfo/GetPostgresContainer expect the full InstanceID;
update the loop after calling database.GetPrimaryInstanceID to map that Patroni
node name back to the matching db.Instances entry (compare the Patroni node name
to the instance's Patroni name field or connection info.Name) and set
primaryInstanceID to that instance's InstanceID; if no match is found, continue
the search or fall back to the existing first-instance behavior, ensuring all
uses of primaryInstanceID are the actual InstanceID values.
🧹 Nitpick comments (9)
server/internal/docker/provide.go (1)
10-11: Consider usingdo.Invokefor consistency with error handling.Using
do.MustInvokewill panic if the logger isn't registered in the injector, whereas other providers (e.g.,activities/provide.go) usedo.Invokewith explicit error handling. While the logger is a fundamental dependency that should always be present, usingdo.Invokewould maintain consistency and provide a clearer error path.♻️ Suggested refactor for consistency
do.Provide(i, func(i *do.Injector) (*Docker, error) { - logger := do.MustInvoke[zerolog.Logger](i) + logger, err := do.Invoke[zerolog.Logger](i) + if err != nil { + return nil, err + } cli, err := NewDocker(logger)server/internal/api/apiv1/validate.go (1)
282-283: Resolve the WIP TODO or track it explicitly.If you want, I can draft the remaining validation rules or open a follow‑up issue.
server/internal/database/service_instance.go (1)
51-55: Consider using*stringfor optionalMessagefield.The API design (
api/apiv1/design/instance.goLines 218-221) definesmessageas an optional attribute withoutg.Required. The domain type usesstringwhich doesn't distinguish between "no message" and "empty message". This inconsistency is handled inconvert.gobut could be clearer if the domain type matched.♻️ Suggested change for consistency
type HealthCheckResult struct { Status string `json:"status"` - Message string `json:"message,omitempty"` + Message *string `json:"message,omitempty"` CheckedAt time.Time `json:"checked_at"` }server/internal/workflows/activities/get_service_instance_status.go (1)
15-22: Consider removing unusedHostIDfrom input struct.
GetServiceInstanceStatusInput.HostIDis never used inGetServiceInstanceStatus. ThehostIDfor queue routing is passed as a separate parameter toExecuteGetServiceInstanceStatus. If the field is intended for future use, consider adding a comment; otherwise, remove it to avoid confusion.server/internal/workflows/activities/create_service_user.go (2)
31-43: Consider adding retries for transient failures.
MaxAttempts: 1means no retries on transient failures (network issues, temporary Patroni unavailability). While idempotent user creation is handled byCreateUserRole's conditional check, brief connectivity issues could fail the entire workflow. Consider allowing 2-3 attempts with backoff.
128-148: Primary discovery silently swallows errors.The loop continues on any error from
GetInstanceConnectionInfoorGetPrimaryInstanceID, falling back to the first instance. If all instances fail to respond, the fallback may execute statements on a replica, which could cause issues depending on the cluster configuration.Consider logging the errors encountered during primary discovery to aid debugging.
Suggested enhancement
for _, inst := range db.Instances { connInfo, err := orch.GetInstanceConnectionInfo(ctx, input.DatabaseID, inst.InstanceID) if err != nil { + logger.Debug("failed to get connection info for instance", "instance_id", inst.InstanceID, "error", err) continue } patroniClient := patroni.NewClient(connInfo.PatroniURL(), nil) primaryID, err := database.GetPrimaryInstanceID(ctx, patroniClient, 10*time.Second) if err == nil && primaryID != "" { primaryInstanceID = primaryID break + } else if err != nil { + logger.Debug("failed to determine primary from patroni", "instance_id", inst.InstanceID, "error", err) } }server/internal/monitor/service_instance_monitor.go (1)
299-318: Mutating the input parameter may be surprising.
updateStatusmodifiesstatus.LastHealthAtin place. While functional, this side effect on the caller's data could be unexpected. Consider creating a copy or clearly documenting the mutation.Alternative approach
func (m *ServiceInstanceMonitor) updateStatus( ctx context.Context, status *database.ServiceInstanceStatus, ) error { - // Update last health check time - now := time.Now() - status.LastHealthAt = utils.PointerTo(now) + // Create updated status with current health check time + now := time.Now() + statusToSave := *status + statusToSave.LastHealthAt = utils.PointerTo(now) m.logger.Debug(). Str("service_instance_id", m.serviceInstanceID). Str("database_id", m.databaseID). Msg("updating service instance status") - err := m.dbSvc.UpdateServiceInstanceStatus(ctx, m.databaseID, m.serviceInstanceID, status) + err := m.dbSvc.UpdateServiceInstanceStatus(ctx, m.databaseID, m.serviceInstanceID, &statusToSave)e2e/service_provisioning_test.go (1)
46-59: Remove commented-out code.Lines 50-51 have a commented-out
Version: "1.0.0"with"latest"used instead. Similar patterns appear throughout the file (lines 184-185, 289-290, 557-558). Either remove the comments or add context explaining why they're preserved.server/internal/monitor/service_instance_monitor_resource.go (1)
44-50: Avoid hard-coded resource type string in Dependencies.Consider referencing a shared constant (or moving it to a common package) to prevent drift if the swarm resource type changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🤖 Fix all issues with AI agents
In `@e2e/service_provisioning_test.go`:
- Around line 53-57: Replace hardcoded API-key-like strings inside the test
Config map (e.g., the "anthropic_api_key" entry in the Config map in
service_provisioning_test.go) with a non-key placeholder or load from an
environment/test fixture; update any other entries matching "*_api_key" in this
file similarly so tests no longer contain realistic-looking secrets and instead
reference a safe placeholder or os.Getenv-backed fixture variable used by the
test harness.
In `@server/internal/orchestrator/swarm/service_images_test.go`:
- Around line 110-115: The test expectation in service_images_test.go
incorrectly expects a duplicated organization segment; update the test case
(name "image without registry") to expect "ghcr.io/pgedge/postgres-mcp:1.0.0"
instead of "ghcr.io/pgedge/pgedge/postgres-mcp:1.0.0" so it matches the behavior
of serviceImageTag(imageRef, repoHost) which should not duplicate the org when
repoHost already contains the org; change the want value in that test case
accordingly.
In `@server/internal/workflows/activities/create_service_user.go`:
- Around line 128-154: database.GetPrimaryInstanceID returns a Patroni member
name, not the InstanceID expected by GetInstanceConnectionInfo; update the block
that sets primaryInstanceID so that when GetPrimaryInstanceID returns a
non-empty member name you map it to the corresponding db.Instance.InstanceID by
matching the Patroni/member/node name stored on each instance (e.g., instance
metadata or a PatroniName property) before assigning primaryInstanceID; if no
match is found keep the existing fallback (db.Instances[0].InstanceID) and
preserve the existing logging and error handling around
GetInstanceConnectionInfo and the patroni client usage.
In `@server/internal/workflows/activities/get_service_instance_status.go`:
- Around line 38-43: In GetServiceInstanceStatus add a nil-check for the input
pointer at the top of the function (before using input.ServiceInstanceID); if
input is nil, log the problem with the activity logger (e.g., via logger.Error
or activity.Logger(ctx)) and return a clear error (e.g., fmt.Errorf or a
package-level sentinel) instead of proceeding to access fields to avoid a panic;
update callers/tests if they rely on receiving that error.
🧹 Nitpick comments (5)
server/internal/database/service_instance_status_store.go (1)
9-14: Consider adding HostID for consistency.The
StoredServiceInstanceStatuslacks aHostIDfield, while theServiceInstanceMonitorResource(from relevant snippets) includesHostID. If status queries need to be filtered or grouped by host, this field may be needed.However, if
HostIDis always retrievable via the associatedServiceInstance, this is acceptable.e2e/service_provisioning_test.go (1)
48-52: Prefer a pinned service version for deterministic E2E runs.
Using"latest"can introduce flakes as images change; consider a fixed version used by your fixture.server/internal/workflows/activities/get_service_instance_status.go (1)
15-35: Avoid dual sources of host ID in the activity API.
HostIDexists on the input but is unused whileExecuteGetServiceInstanceStatustakes a separatehostID. Consider deriving the queue frominput.HostIDor removing the field to prevent mismatches.server/internal/workflows/provision_services.go (1)
273-330: Consider more specific error propagation when deployment partially fails.When
applyEventsreturns an error, the workflow logs a warning and attempts to mark individual service instances as failed. However, the overall workflow continues to Phase 4 and logs completion for all services (lines 373-389) even if some failed. This could produce misleading completion logs.Consider tracking which instances succeeded vs failed and adjusting the completion logging accordingly.
server/internal/monitor/service_instance_monitor.go (1)
299-318:updateStatusmutates the input parameter.Line 305 modifies
status.LastHealthAtdirectly, which could have unintended side effects if the caller reuses the status object. Consider making a copy before modification.🛠️ Suggested fix
func (m *ServiceInstanceMonitor) updateStatus( ctx context.Context, status *database.ServiceInstanceStatus, ) error { + // Create a copy to avoid mutating the caller's object + statusCopy := *status // Update last health check time now := time.Now() - status.LastHealthAt = utils.PointerTo(now) + statusCopy.LastHealthAt = utils.PointerTo(now) m.logger.Debug(). Str("service_instance_id", m.serviceInstanceID). Str("database_id", m.databaseID). Msg("updating service instance status") - err := m.dbSvc.UpdateServiceInstanceStatus(ctx, m.databaseID, m.serviceInstanceID, status) + err := m.dbSvc.UpdateServiceInstanceStatus(ctx, m.databaseID, m.serviceInstanceID, &statusCopy) if err != nil { return fmt.Errorf("failed to update service instance status: %w", err) } return nil }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@docs/development/service-credentials.md`:
- Around line 44-54: Update the PostgreSQL Compatibility section to clarify that
truncating usernames to 63 characters can create collisions: explicitly state
whether the system prevents collisions (e.g., by appending a deterministic hash
or sequence to the truncated username) or if collisions are possible and what
the mitigation is (e.g., error on create, require shorter IDs, or enforce
uniqueness via additional DB column). Mention the exact constructs from the
doc—"svc_ prefix", "username", "service ID", "host ID", and "63 characters"—so
readers know where truncation is applied and how collisions are handled or
avoided.
In `@server/internal/docker/docker.go`:
- Around line 466-471: The current early-failure check in the service task loop
(vars failed, running, desired, lastFailureMsg) returns an error when failed>0
&& running==0 even if tasks are still in transitional states; update the
condition to also ensure there are no tasks in preparing or pending before
returning an error (e.g., require preparing==0 && pending==0 in addition to
failed>0 && running==0), and preserve the existing lastFailureMsg logic so the
error message remains fmt.Errorf("service tasks failed: %s", lastFailureMsg)
when present.
🧹 Nitpick comments (2)
server/internal/orchestrator/swarm/service_spec_test.go (1)
91-119: DRY env-var assertions into helpers.These env-var membership checks are repeated across tests; a small helper keeps the tests shorter and more consistent.
♻️ Suggested helper extraction
+func assertEnvContainsAll(t *testing.T, got, expected []string) { + t.Helper() + if len(got) != len(expected) { + t.Errorf("got %d env vars, want %d", len(got), len(expected)) + } + for _, e := range expected { + found := false + for _, g := range got { + if g == e { + found = true + break + } + } + if !found { + t.Errorf("missing env var: %s", e) + } + } +} + +func assertEnvNotContainsPrefixes(t *testing.T, got []string, prefixes ...string) { + t.Helper() + for _, e := range got { + for _, p := range prefixes { + if strings.HasPrefix(e, p) { + t.Errorf("unexpected credential env var: %s", e) + } + } + } +}Also applies to: 230-247, 279-295, 519-529
server/internal/api/apiv1/validate.go (1)
282-329: Please track the TODO in validateMCPServiceConfig.If this is intentionally WIP, consider linking it to an issue so it doesn’t linger indefinitely.
| // Remove service instance monitor | ||
| // Note: This is best-effort cleanup. If the monitor doesn't exist or deletion fails, | ||
| // we still proceed with service instance deletion. Monitors will be cleaned up on restart. | ||
| if monitorSvc, err := do.Invoke[*monitor.Service](rc.Injector); err == nil { | ||
| if err := monitorSvc.DeleteServiceInstanceMonitor(ctx, s.ServiceInstanceID); err != nil { | ||
| if logger, logErr := do.Invoke[zerolog.Logger](rc.Injector); logErr == nil { | ||
| logger.Warn().Err(err).Str("service_instance_id", s.ServiceInstanceID).Msg("failed to delete service instance monitor during cleanup") | ||
| } | ||
| // Continue - not critical for deletion | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be managed by the service instance monitor resource. Why is it necessary to do it here?
| type ProvisionServicesOutput struct { | ||
| } | ||
|
|
||
| func (w *Workflows) ExecuteProvisionServices( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't be split out into a separate workflow. Instead, the service instance resources should be included in the normal database plan. The way that this is currently written, all of the service resources will get deleted and then recreated each time the database is updated.
| } | ||
| } | ||
|
|
||
| // Find any Postgres instance in this database to get connection details |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given my earlier suggestions to use a config file rather than environment variables and to remove this workflow, I suggest that we create a config file resource for the MCP server and do this operation in that resource's Create or Update methods.
|
|
||
| // GenerateServiceName creates a Docker Swarm service name for a service instance. | ||
| // Format: {service_type}-{database_id}-{service_id}-{host_id} | ||
| func GenerateServiceName(serviceType, databaseID, serviceID, hostID string) string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this logic belongs in the swarm package. Regarding using the host ID as the suffix, we should be consistent with the way that we name Postgres services and these services. We currently use a hash of the host ID for the postgres service names because it's a little more readable in Cloud where both the database ID and the host ID are UUIDs. We could do something similar to what you did with user names by only using a hash when the host ID exceeds some length limit, but we'd need to be careful about changing the service names in case anyone depends on them. For now, in addition to moving this logic, could you please make it consistent with the Postgres service names?
- Remove container_port from required fields in ServicePort API type - Sanitize hyphens to underscores in generated Postgres role names - Switch MustInvoke to Invoke in docker provider for consistency - Use globally unique service name for container hostname to avoid DNS collisions on shared bridge network
feat: add MCP service instance provisioning
Summary
Adds end-to-end support for provisioning, monitoring, and managing MCP service instances alongside database instances. Services receive dedicated read-only database credentials, run as Docker Swarm services on the database
overlay network, and are monitored for health with automatic state transitions.
NOTE: these changes focus on the provisioning and lifecycle of supported services, not (very much) on the specifics of managing the MCP service. Follow-on work is needed to flesh out exactly what control-plane needs to surface for MCP management.
Changes
domain types
parallel via resource lifecycle, enriches status post-deploy
grace period for recently-started instances to avoid false failures during Docker label propagation
etcd scan in state updates
Testing
Checklist
Notes for Reviewers
PLAT-407 PLAT-409 PLAT-410 PLAT-411 PLAT-412 PLAT-413