K8SPG-957: backup controller must watch VolumeSnapshots only if API is installed#1464
Draft
mayankshah1607 wants to merge 1 commit intomainfrom
Draft
K8SPG-957: backup controller must watch VolumeSnapshots only if API is installed#1464mayankshah1607 wants to merge 1 commit intomainfrom
mayankshah1607 wants to merge 1 commit intomainfrom
Conversation
Signed-off-by: Mayank Shah <mayank.shah@percona.com>
There was a problem hiding this comment.
Pull request overview
This PR prevents the PerconaPGBackup controller from crashing on clusters (e.g., EKS) where the VolumeSnapshot API is not installed by conditionally registering the VolumeSnapshot watch only when the feature gate is enabled and the API is discoverable.
Changes:
- Added a Kubernetes discovery helper (
GroupVersionKindExists) to detect whether a GroupVersion/Kind is available on the API server. - Updated the pgbackup controller setup to only
Owns(VolumeSnapshot)whenfeature.BackupSnapshotsis enabled and the VolumeSnapshot API is installed. - Updated operator wiring to pass
ctxinto the updated pgbackup controllerSetupWithManager.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
percona/k8s/util.go |
Adds API discovery helper to check for GroupVersion/Kind availability. |
percona/controller/pgbackup/controller.go |
Gates the VolumeSnapshot watch behind feature flag + API discovery to avoid cache sync failure. |
cmd/postgres-operator/main.go |
Updates pgbackup controller initialization to pass context into SetupWithManager. |
Comment on lines
+70
to
+76
| if r.DiscoveryClient == nil { | ||
| var err error | ||
| r.DiscoveryClient, err = discovery.NewDiscoveryClientForConfig(mgr.GetConfig()) | ||
| if err != nil { | ||
| return errors.Wrap(err, "failed to create discovery client") | ||
| } | ||
| } |
There was a problem hiding this comment.
- Problem:
DiscoveryClientis always initialized inSetupWithManager, even whenfeature.BackupSnapshotsis disabled (the only current use site). - Why it matters: This adds unnecessary client construction and discovery traffic during startup for the default/disabled case.
- Fix: Only create the discovery client inside the
feature.Enabled(ctx, feature.BackupSnapshots)block (or create a local client there) so it is only initialized when the conditional watch is actually needed.
Comment on lines
+82
to
+91
| // Watch VolumeSnapshots if the feature is enabled and the API is available. | ||
| if feature.Enabled(ctx, feature.BackupSnapshots) { | ||
| gvk := volumesnapshotv1.SchemeGroupVersion.WithKind(pNaming.KindVolumeSnapshot) | ||
| gv := gvk.GroupVersion().String() | ||
| if ok, err := k8s.GroupVersionKindExists(r.DiscoveryClient, gv, gvk.Kind); err != nil { | ||
| return errors.Wrap(err, "check VolumeSnapshot API availability") | ||
| } else if ok { | ||
| b = b.Owns(&volumesnapshotv1.VolumeSnapshot{}) | ||
| } | ||
| } |
There was a problem hiding this comment.
- Problem: The new conditional watch behavior for VolumeSnapshots (to prevent startup crashes when the API is missing) is not covered by tests.
- Why it matters: This is a regression fix for an operator crash path; without a test, it’s easy to reintroduce the unconditional watch and break EKS again.
- Fix: Add a regression test (preferably envtest) that starts a manager without the VolumeSnapshot CRDs installed and asserts
PGBackupReconciler.SetupWithManager(...)succeeds (and/or that it only adds the watch when the feature gate is enabled and the API is present).
Comment on lines
+188
to
+210
| // GroupVersionKindExists checks to see whether a given Kind for a given | ||
| // GroupVersion exists in the Kubernetes API Server. | ||
| func GroupVersionKindExists(dc *discovery.DiscoveryClient, groupVersion, kind string) (bool, error) { | ||
| if dc == nil { | ||
| return false, errors.New("discovery client is nil") | ||
| } | ||
|
|
||
| resourceList, err := dc.ServerResourcesForGroupVersion(groupVersion) | ||
| if err != nil { | ||
| if k8serrors.IsNotFound(err) { | ||
| return false, nil | ||
| } | ||
| return false, err | ||
| } | ||
|
|
||
| for _, resource := range resourceList.APIResources { | ||
| if resource.Kind == kind { | ||
| return true, nil | ||
| } | ||
| } | ||
|
|
||
| return false, nil | ||
| } |
There was a problem hiding this comment.
- Problem:
GroupVersionKindExistsduplicates the existingReconciler.GroupVersionKindExistshelper ininternal/controller/postgrescluster/controller.gowith near-identical discovery logic. - Why it matters: Duplicated helpers tend to drift (different nil/return behavior, error handling, etc.), making future changes harder and increasing the chance of inconsistent behavior.
- Fix: Consolidate this logic into a single shared helper (e.g., keep it in
percona/k8sand have the PostgresCluster reconciler call it), or remove this helper and reuse the existing one where appropriate.
Collaborator
commit: 1ae7957 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
CHANGE DESCRIPTION
Problem:
PG operator crash on EKS.
Cause:
The backup controller tries to watch VolumeSnapshots, but the the API does not come pre-installed in EKS (unlike GKE). As a result, the caches fail to start.
Solution:
The watcher must be enabled only if the API is installed and feature gate enabled.
CHECKLIST
Jira
Needs Doc) and QA (Needs QA)?Tests
Config/Logging/Testability