Skip to content

Conversation

@dcoppa
Copy link

@dcoppa dcoppa commented Jan 9, 2026

Problem

The ClickHouse operator enters an infinite reconciliation loop when updating StatefulSets that require recreation, causing some keeper pods to remain stuck with 0 replicas indefinitely.

Root Cause

IsStatefulSetReady() in pkg/model/k8s/stateful_set.go incorrectly returns false for StatefulSets with spec.replicas=0 and status.readyReplicas=0. This breaks the operator's own StatefulSet recreation flow:

  1. Deletion phase: doDeleteStatefulSet() scales to 0 replicas for graceful pod termination (line 477 in statefulset-reconciler.go)
  2. Wait phase: Calls WaitHostStatefulSetReady() expecting it to succeed at 0/0 replicas (line 484)
  3. Recreation trigger: updateStatefulSet() checks IsStatefulSetReady() and gets false (line 254)
  4. Infinite loop: Defaults to ErrCRUDRecreate, triggering deletion again instead of allowing the update to proceed

Reproduction

Trigger a rolling update that requires StatefulSet recreation (e.g., adding a metrics port)

Observed behavior

  1. Operator logs show: "Update StatefulSet switch from Update to Recreate"
  2. StatefulSet remains at replicas: 0, generation: 2
  3. Object version labels flip-flop between two hashes
  4. Pod never gets created
  5. Manual kubectl scale --replicas=1 required to break the loop

Note: This is a race condition - some environments/replicas succeed while others get stuck, depending on reconciliation timing.

$ kubectl -n clickhouse-cluster-subscription get sts
NAME                          READY   AGE
chi-cluster01-cluster01-0-0   1/1     20d
chi-cluster01-cluster01-0-1   1/1     20d
chk-keeper01-keeper01-0-0     1/1     14m
chk-keeper01-keeper01-0-1     1/1     14m
chk-keeper01-keeper01-0-2     0/0     15m

Logs:

1767962546629	2026-01-09T12:42:26.629Z	I0109 12:42:26.629747       1 statefulset-reconciler.go:286] updateStatefulSet():Host:0-2[0/2]:clickhouse-cluster-subscription/keeper01:Update StatefulSet(clickhouse-cluster-subscription/chk-keeper01-keeper01-0-2) switch from Update to Recreate

1767962566760	2026-01-09T12:42:46.760Z	W0109 12:42:46.760543       1 controller-getter.go:39] getPodsIPs():unknown:Pod NO IP address found. Pod: clickhouse-cluster-subscription/chk-keeper01-keeper01-0-2-0

1767962581783	2026-01-09T12:43:01.783Z	I0109 12:43:01.781013       1 object-status.go:47] GetObjectStatusFromMetas():unknown:cur and new objects are equal based on object version label. Update of the object is not required. Object: clickhouse-cluster-subscription/chk-keeper01-keeper01-0-2

1767962581786	2026-01-09T12:43:01.786Z	I0109 12:43:01.786826       1 object-status.go:47] GetObjectStatusFromMetas():unknown:cur and new objects are equal based on object version label. Update of the object is not required. Object: clickhouse-cluster-subscription/chk-keeper01-keeper01-0-2

1767962617187	2026-01-09T12:43:37.187Z	I0109 12:43:37.187203       1 object-status.go:47] GetObjectStatusFromMetas():unknown:cur and new objects are equal based on object version label. Update of the object is not required. Object: clickhouse-cluster-subscription/chk-keeper01-keeper01-0-2

1767962617237	2026-01-09T12:43:37.237Z	I0109 12:43:37.237380       1 object-status.go:54] GetObjectStatusFromMetas():unknown:cur and new objects ARE DIFFERENT based on object version label: Update of the object is required. Object: clickhouse-cluster-subscription/chk-keeper01-keeper01-0-2. Cur: 9d8985b2144a792c97ea5968afbb6f6f6d28f8df New: d4ab60dea2d12b6af06f8a8bf2cb88c2a9de7a56

1767962617255	2026-01-09T12:43:37.255Z	I0109 12:43:37.255358       1 object-status.go:47] GetObjectStatusFromMetas():unknown:cur and new objects are equal based on object version label. Update of the object is not required. Object: clickhouse-cluster-subscription/chk-keeper01-keeper01-0-2

1767962677343	2026-01-09T12:44:37.343Z	I0109 12:44:37.342851       1 object-status.go:54] GetObjectStatusFromMetas():unknown:cur and new objects ARE DIFFERENT based on object version label: Update of the object is required. Object: clickhouse-cluster-subscription/chk-keeper01-keeper01-0-2. Cur: d4ab60dea2d12b6af06f8a8bf2cb88c2a9de7a56 New: 9d8985b2144a792c97ea5968afbb6f6f6d28f8df

1767962677348	2026-01-09T12:44:37.348Z	I0109 12:44:37.348315       1 object-status.go:54] GetObjectStatusFromMetas():unknown:cur and new objects ARE DIFFERENT based on object version label: Update of the object is required. Object: clickhouse-cluster-subscription/chk-keeper01-keeper01-0-2. Cur: d4ab60dea2d12b6af06f8a8bf2cb88c2a9de7a56 New: 9d8985b2144a792c97ea5968afbb6f6f6d28f8df

1767963515149	2026-01-09T12:58:35.149Z	I0109 12:58:35.148880       1 object-status.go:54] GetObjectStatusFromMetas():unknown:cur and new objects ARE DIFFERENT based on object version label: Update of the object is required. Object: clickhouse-cluster-subscription/chk-keeper01-keeper01-0-2. Cur: d4ab60dea2d12b6af06f8a8bf2cb88c2a9de7a56 New: 9d8985b2144a792c97ea5968afbb6f6f6d28f8df

1767963515154	2026-01-09T12:58:35.154Z	I0109 12:58:35.154470       1 object-status.go:54] GetObjectStatusFromMetas():unknown:cur and new objects ARE DIFFERENT based on object version label: Update of the object is required. Object: clickhouse-cluster-subscription/chk-keeper01-keeper01-0-2. Cur: d4ab60dea2d12b6af06f8a8bf2cb88c2a9de7a56 New: 9d8985b2144a792c97ea5968afbb6f6f6d28f8df

1767963520191	2026-01-09T12:58:40.191Z	I0109 12:58:40.191642       1 object-status.go:54] GetObjectStatusFromMetas():unknown:cur and new objects ARE DIFFERENT based on object version label: Update of the object is required. Object: clickhouse-cluster-subscription/chk-keeper01-keeper01-0-2. Cur: d4ab60dea2d12b6af06f8a8bf2cb88c2a9de7a56 New: 9d8985b2144a792c97ea5968afbb6f6f6d28f8df

1767963520194	2026-01-09T12:58:40.194Z	W0109 12:58:40.194204       1 controller-getter.go:39] getPodsIPs():unknown:Pod NO IP address found. Pod: clickhouse-cluster-subscription/chk-keeper01-keeper01-0-2-0

1767963520197	2026-01-09T12:58:40.197Z	I0109 12:58:40.197647       1 object-status.go:54] GetObjectStatusFromMetas():unknown:cur and new objects ARE DIFFERENT based on object version label: Update of the object is required. Object: clickhouse-cluster-subscription/chk-keeper01-keeper01-0-2. Cur: d4ab60dea2d12b6af06f8a8bf2cb88c2a9de7a56 New: 9d8985b2144a792c97ea5968afbb6f6f6d28f8df

1767963520213	2026-01-09T12:58:40.213Z	I0109 12:58:40.212958       1 object-status.go:54] GetObjectStatusFromMetas():unknown:cur and new objects ARE DIFFERENT based on object version label: Update of the object is required. Object: clickhouse-cluster-subscription/chk-keeper01-keeper01-0-2. Cur: d4ab60dea2d12b6af06f8a8bf2cb88c2a9de7a56 New: 9d8985b2144a792c97ea5968afbb6f6f6d28f8df

1767963520215	2026-01-09T12:58:40.215Z	W0109 12:58:40.215336       1 controller-getter.go:39] getPodsIPs():unknown:Pod NO IP address found. Pod: clickhouse-cluster-subscription/chk-keeper01-keeper01-0-2-0

1767963520219	2026-01-09T12:58:40.219Z	I0109 12:58:40.219043       1 object-status.go:54] GetObjectStatusFromMetas():unknown:cur and new objects ARE DIFFERENT based on object version label: Update of the object is required. Object: clickhouse-cluster-subscription/chk-keeper01-keeper01-0-2. Cur: d4ab60dea2d12b6af06f8a8bf2cb88c2a9de7a56 New: 9d8985b2144a792c97ea5968afbb6f6f6d28f8df

1767963545401	2026-01-09T12:59:05.401Z	I0109 12:59:05.401229       1 object-status.go:54] GetObjectStatusFromMetas():unknown:cur and new objects ARE DIFFERENT based on object version label: Update of the object is required. Object: clickhouse-cluster-subscription/chk-keeper01-keeper01-0-2. Cur: d4ab60dea2d12b6af06f8a8bf2cb88c2a9de7a56 New: 9d8985b2144a792c97ea5968afbb6f6f6d28f8df

1767963545408	2026-01-09T12:59:05.408Z	I0109 12:59:05.407992       1 object-status.go:54] GetObjectStatusFromMetas():unknown:cur and new objects ARE DIFFERENT based on object version label: Update of the object is required. Object: clickhouse-cluster-subscription/chk-keeper01-keeper01-0-2. Cur: d4ab60dea2d12b6af06f8a8bf2cb88c2a9de7a56 New: 9d8985b2144a792c97ea5968afbb6f6f6d28f8df

1767963882561	2026-01-09T13:04:42.561Z	I0109 13:04:42.561831       1 object-status.go:54] GetObjectStatusFromMetas():unknown:cur and new objects ARE DIFFERENT based on object version label: Update of the object is required. Object: clickhouse-cluster-subscription/chk-keeper01-keeper01-0-2. Cur: d4ab60dea2d12b6af06f8a8bf2cb88c2a9de7a56 New: 9d8985b2144a792c97ea5968afbb6f6f6d28f8df

1767963882704	2026-01-09T13:04:42.704Z	I0109 13:04:42.704061       1 object-status.go:54] GetObjectStatusFromMetas():unknown:cur and new objects ARE DIFFERENT based on object version label: Update of the object is required. Object: clickhouse-cluster-subscription/chk-keeper01-keeper01-0-2. Cur: d4ab60dea2d12b6af06f8a8bf2cb88c2a9de7a56 New: 9d8985b2144a792c97ea5968afbb6f6f6d28f8df

1767963882713	2026-01-09T13:04:42.713Z	I0109 13:04:42.713563       1 object-status.go:54] GetObjectStatusFromMetas():unknown:cur and new objects ARE DIFFERENT based on object version label: Update of the object is required. Object: clickhouse-cluster-subscription/chk-keeper01-keeper01-0-2. Cur: d4ab60dea2d12b6af06f8a8bf2cb88c2a9de7a56 New: 9d8985b2144a792c97ea5968afbb6f6f6d28f8df

1767963882724	2026-01-09T13:04:42.724Z	I0109 13:04:42.722887       1 object-status.go:54] GetObjectStatusFromMetas():unknown:cur and new objects ARE DIFFERENT based on object version label: Update of the object is required. Object: clickhouse-cluster-subscription/chk-keeper01-keeper01-0-2. Cur: d4ab60dea2d12b6af06f8a8bf2cb88c2a9de7a56 New: 9d8985b2144a792c97ea5968afbb6f6f6d28f8df

Important items to consider before making a Pull Request

Please check items PR complies to:

  • All commits in the PR are squashed. More info
  • The PR is made into dedicated next-release branch, not into master branch1. More info
  • The PR is signed. More info

--

1 If you feel your PR does not affect any Go-code or any testable functionality (for example, PR contains docs only or supplementary materials), PR can be made into master branch, but it has to be confirmed by project's maintainer.

…tion loops

The IsStatefulSetReady() function incorrectly returned false for
StatefulSets with spec.replicas=0 and status.readyReplicas=0, treating
a successfully scaled-to-zero StatefulSet as "not ready".

This caused infinite reconciliation loops during StatefulSet recreation:

1 - doDeleteStatefulSet() scales to 0 replicas for graceful termination
2 - recreateStatefulSet() deletes and creates the StatefulSet
3 - During recreation, the StatefulSet temporarily has 0 replicas
4 - updateStatefulSet() checks IsStatefulSetReady(), gets false
5 - Defaults to ErrCRUDRecreate action, triggering deletion again
6 - Loop repeats indefinitely

The fix recognizes that a StatefulSet with 0/0 replicas is in a valid
ready state, allowing the recreation flow to complete successfully.

This issue manifested during rolling updates when manifest changes
triggered StatefulSet recreation, and only affected some replicas due
to timing/race conditions in the reconciliation loop.

Signed-off-by: David Coppa <dcoppa@gmail.com>
@sunsingerus sunsingerus added the planned for review This feature is planned for review label Jan 13, 2026
@dcoppa
Copy link
Author

dcoppa commented Jan 13, 2026

I'm now using a dedicated function that checks for StatefulSets stuck at 0 replicas and forces them to scale up to the desired replica count:

I0113 13:31:29.678480       1 worker-reconciler-chk.go:186] logSWVersion():Host:0-1[0/1]:clickhouse-cluster-subscription/keeper01:Host software version: 0-1 25.3.8[25.3.8.10041/parsed from the tag: '25.3.8.10041']
I0113 13:31:29.678488       1 worker-reconciler-chk.go:186] logSWVersion():Host:0-2[0/2]:clickhouse-cluster-subscription/keeper01:Host software version: 0-2 25.3.8[25.3.8.10041/parsed from the tag: '25.3.8.10041']
I0113 13:31:29.678494       1 worker-reconciler-chk.go:189] logSWVersion():unknown:CR software versions min=25.3.8[25.3.8.10041/parsed from the tag: '25.3.8.10041'] max=25.3.8[25.3.8.10041/parsed from the tag: '25.3.8.10041']
I0113 13:31:29.680418       1 worker-reconciler-chk.go:147] unknown:
ActionPlan start buildCR ---------------------------------------------:

ActionPlan end buildCR ---------------------------------------------
I0113 13:31:29.680445       1 util.go:87] Host:0-0[0/0]:clickhouse-cluster-subscription/keeper01:StatefulSet stuck at 0 replicas, forcing scale to 1
I0113 13:31:29.697332       1 worker-reconciler-chk.go:81] reconcileCR():unknown:ActionPlan has no actions - abort reconcile
I0113 13:31:29.697363       1 worker-reconciler-chk.go:83] worker-reconciler-chk.go:58:reconcileCR():end:unknown
I0113 13:31:29.698605       1 worker-reconciler-chk.go:57] worker-reconciler-chk.go:57:reconcileCR():start:unknown
I0113 13:31:29.698641       1 worker.go:392] createTemplatedCR():unknown:CR has an ancestor, use it as a base for reconcile. CR: clickhouse-cluster-subscription/keeper01
I0113 13:31:29.700924       1 statefulset-reconciler.go:103] unknown:Have StatefulSet available, try to perform label-based comparison for sts: clickhouse-cluster-subscription/chk-keeper01-keeper01-0-0
I0113 13:31:29.700953       1 object-status.go:54] GetObjectStatusFromMetas():unknown:cur and new objects ARE DIFFERENT based on object version label: Update of the object is required. Object: clickhouse-cluster-subscription/chk-keeper01-keeper01-0-0. Cur: 4e89df41e15a7199f154b82a442289a34d33021d New: c6750ea548be16b50eebc16f94f454109a8780de
I0113 13:31:29.702002       1 statefulset-reconciler.go:103] unknown:Have StatefulSet available, try to perform label-based comparison for sts: clickhouse-cluster-subscription/chk-keeper01-keeper01-0-1

With this, I can't reproduce the problem anymore.

@dcoppa dcoppa changed the title fix: Treat StatefulSet with 0 replicas as ready to prevent reconciliation loops fix: StatefulSet stuck with 0 replicas Jan 13, 2026
@alex-zaitsev
Copy link
Member

@dcoppa , is there any easy way to reproduce the original problem?

@dcoppa
Copy link
Author

dcoppa commented Jan 14, 2026

@dcoppa , is there any easy way to reproduce the original problem?

clickhousekeeperinstallation_before.txt
clickhousekeeperinstallation_after.txt

Basically, I updated my ClickHouseKeeperInstallation manifest by adding the necessary configuration for Prometheus metrics.

Please see the two attached files (before and after).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

planned for review This feature is planned for review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants