Skip to content

K8SPG-957: backup controller must watch VolumeSnapshots only if API is installed#1464

Draft
mayankshah1607 wants to merge 1 commit intomainfrom
K8SPG-957
Draft

K8SPG-957: backup controller must watch VolumeSnapshots only if API is installed#1464
mayankshah1607 wants to merge 1 commit intomainfrom
K8SPG-957

Conversation

@mayankshah1607
Copy link
Member

CHANGE DESCRIPTION

Problem:
PG operator crash on EKS.

2026-02-26T08:58:31.770Z        ERROR   Could not wait for Cache to sync        {"controller": "
perconapgbackup", "controllerGroup": "pgv2.percona.com", "controllerKind": "PerconaPGBackup", "source": "kind source: *v1.VolumeSnapshot", "error": "failed to wait for perconapgbackup caches to sync kind source: *v1.VolumeSnapshot: timed out waiting for cache to be synced for Kind *v1.VolumeSnapshot"}                                  
runtime.goexit                                  
        /usr/local/go/src/runtime/asm_amd64.s:1771

Cause:
The backup controller tries to watch VolumeSnapshots, but the the API does not come pre-installed in EKS (unlike GKE). As a result, the caches fail to start.

Solution:
The watcher must be enabled only if the API is installed and feature gate enabled.

CHECKLIST

Jira

  • Is the Jira ticket created and referenced properly?
  • Does the Jira ticket have the proper statuses for documentation (Needs Doc) and QA (Needs QA)?
  • Does the Jira ticket link to the proper milestone (Fix Version field)?

Tests

  • Is an E2E test/test case added for the new feature/change?
  • Are unit tests added where appropriate?

Config/Logging/Testability

  • Are all needed new/changed options added to default YAML files?
  • Are all needed new/changed options added to the Helm Chart?
  • Did we add proper logging messages for operator actions?
  • Did we ensure compatibility with the previous version or cluster upgrade process?
  • Does the change support oldest and newest supported PG version?
  • Does the change support oldest and newest supported Kubernetes version?

Signed-off-by: Mayank Shah <mayank.shah@percona.com>
Copilot AI review requested due to automatic review settings February 27, 2026 05:58
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR prevents the PerconaPGBackup controller from crashing on clusters (e.g., EKS) where the VolumeSnapshot API is not installed by conditionally registering the VolumeSnapshot watch only when the feature gate is enabled and the API is discoverable.

Changes:

  • Added a Kubernetes discovery helper (GroupVersionKindExists) to detect whether a GroupVersion/Kind is available on the API server.
  • Updated the pgbackup controller setup to only Owns(VolumeSnapshot) when feature.BackupSnapshots is enabled and the VolumeSnapshot API is installed.
  • Updated operator wiring to pass ctx into the updated pgbackup controller SetupWithManager.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
percona/k8s/util.go Adds API discovery helper to check for GroupVersion/Kind availability.
percona/controller/pgbackup/controller.go Gates the VolumeSnapshot watch behind feature flag + API discovery to avoid cache sync failure.
cmd/postgres-operator/main.go Updates pgbackup controller initialization to pass context into SetupWithManager.

Comment on lines +70 to +76
if r.DiscoveryClient == nil {
var err error
r.DiscoveryClient, err = discovery.NewDiscoveryClientForConfig(mgr.GetConfig())
if err != nil {
return errors.Wrap(err, "failed to create discovery client")
}
}
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Problem: DiscoveryClient is always initialized in SetupWithManager, even when feature.BackupSnapshots is disabled (the only current use site).
  2. Why it matters: This adds unnecessary client construction and discovery traffic during startup for the default/disabled case.
  3. Fix: Only create the discovery client inside the feature.Enabled(ctx, feature.BackupSnapshots) block (or create a local client there) so it is only initialized when the conditional watch is actually needed.

Copilot uses AI. Check for mistakes.
Comment on lines +82 to +91
// Watch VolumeSnapshots if the feature is enabled and the API is available.
if feature.Enabled(ctx, feature.BackupSnapshots) {
gvk := volumesnapshotv1.SchemeGroupVersion.WithKind(pNaming.KindVolumeSnapshot)
gv := gvk.GroupVersion().String()
if ok, err := k8s.GroupVersionKindExists(r.DiscoveryClient, gv, gvk.Kind); err != nil {
return errors.Wrap(err, "check VolumeSnapshot API availability")
} else if ok {
b = b.Owns(&volumesnapshotv1.VolumeSnapshot{})
}
}
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Problem: The new conditional watch behavior for VolumeSnapshots (to prevent startup crashes when the API is missing) is not covered by tests.
  2. Why it matters: This is a regression fix for an operator crash path; without a test, it’s easy to reintroduce the unconditional watch and break EKS again.
  3. Fix: Add a regression test (preferably envtest) that starts a manager without the VolumeSnapshot CRDs installed and asserts PGBackupReconciler.SetupWithManager(...) succeeds (and/or that it only adds the watch when the feature gate is enabled and the API is present).

Copilot uses AI. Check for mistakes.
Comment on lines +188 to +210
// GroupVersionKindExists checks to see whether a given Kind for a given
// GroupVersion exists in the Kubernetes API Server.
func GroupVersionKindExists(dc *discovery.DiscoveryClient, groupVersion, kind string) (bool, error) {
if dc == nil {
return false, errors.New("discovery client is nil")
}

resourceList, err := dc.ServerResourcesForGroupVersion(groupVersion)
if err != nil {
if k8serrors.IsNotFound(err) {
return false, nil
}
return false, err
}

for _, resource := range resourceList.APIResources {
if resource.Kind == kind {
return true, nil
}
}

return false, nil
}
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Problem: GroupVersionKindExists duplicates the existing Reconciler.GroupVersionKindExists helper in internal/controller/postgrescluster/controller.go with near-identical discovery logic.
  2. Why it matters: Duplicated helpers tend to drift (different nil/return behavior, error handling, etc.), making future changes harder and increasing the chance of inconsistent behavior.
  3. Fix: Consolidate this logic into a single shared helper (e.g., keep it in percona/k8s and have the PostgresCluster reconciler call it), or remove this helper and reuse the existing one where appropriate.

Copilot uses AI. Check for mistakes.
@JNKPercona
Copy link
Collaborator

Test Name Result Time
backup-enable-disable passed 00:13:23
builtin-extensions passed 00:05:28
cert-manager-tls passed 00:05:15
custom-envs passed 00:21:37
custom-extensions passed 00:13:40
custom-tls passed 00:07:42
database-init-sql passed 00:05:33
demand-backup passed 00:23:26
demand-backup-offline-snapshot passed 00:13:53
dynamic-configuration passed 00:03:55
finalizers passed 00:03:54
init-deploy passed 00:02:47
huge-pages passed 00:02:57
monitoring passed 00:07:08
monitoring-pmm3 passed 00:08:01
one-pod passed 00:06:24
operator-self-healing passed 00:10:08
pg-tde passed 00:08:45
pitr passed 00:12:43
scaling passed 00:05:18
scheduled-backup passed 00:24:17
self-healing passed 00:08:26
sidecars passed 00:03:07
standby-pgbackrest passed 00:12:21
standby-streaming passed 00:09:22
start-from-backup passed 00:10:34
tablespaces passed 00:07:39
telemetry-transfer passed 00:04:17
upgrade-consistency passed 00:05:56
upgrade-minor passed 00:05:07
users passed 00:04:26
Summary Value
Tests Run 31/31
Job Duration 01:46:59
Total Test Time 04:37:44

commit: 1ae7957
image: perconalab/percona-postgresql-operator:PR-1464-1ae7957ad

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants