fix: StatefulSet stuck with 0 replicas #1902

dcoppa · 2026-01-09T17:02:04Z

Problem

The ClickHouse operator enters an infinite reconciliation loop when updating StatefulSets that require recreation, causing some keeper pods to remain stuck with 0 replicas indefinitely.

Root Cause

IsStatefulSetReady() in pkg/model/k8s/stateful_set.go incorrectly returns false for StatefulSets with spec.replicas=0 and status.readyReplicas=0. This breaks the operator's own StatefulSet recreation flow:

Deletion phase: doDeleteStatefulSet() scales to 0 replicas for graceful pod termination (line 477 in statefulset-reconciler.go)
Wait phase: Calls WaitHostStatefulSetReady() expecting it to succeed at 0/0 replicas (line 484)
Recreation trigger: updateStatefulSet() checks IsStatefulSetReady() and gets false (line 254)
Infinite loop: Defaults to ErrCRUDRecreate, triggering deletion again instead of allowing the update to proceed

Reproduction

Trigger a rolling update that requires StatefulSet recreation (e.g., adding a metrics port)

Observed behavior

Operator logs show: "Update StatefulSet switch from Update to Recreate"
StatefulSet remains at replicas: 0, generation: 2
Object version labels flip-flop between two hashes
Pod never gets created
Manual kubectl scale --replicas=1 required to break the loop

Note: This is a race condition - some environments/replicas succeed while others get stuck, depending on reconciliation timing.

$ kubectl -n clickhouse-cluster-subscription get sts
NAME                          READY   AGE
chi-cluster01-cluster01-0-0   1/1     20d
chi-cluster01-cluster01-0-1   1/1     20d
chk-keeper01-keeper01-0-0     1/1     14m
chk-keeper01-keeper01-0-1     1/1     14m
chk-keeper01-keeper01-0-2     0/0     15m

Logs:

1767962546629	2026-01-09T12:42:26.629Z	I0109 12:42:26.629747       1 statefulset-reconciler.go:286] updateStatefulSet():Host:0-2[0/2]:clickhouse-cluster-subscription/keeper01:Update StatefulSet(clickhouse-cluster-subscription/chk-keeper01-keeper01-0-2) switch from Update to Recreate

1767962566760	2026-01-09T12:42:46.760Z	W0109 12:42:46.760543       1 controller-getter.go:39] getPodsIPs():unknown:Pod NO IP address found. Pod: clickhouse-cluster-subscription/chk-keeper01-keeper01-0-2-0

1767962581783	2026-01-09T12:43:01.783Z	I0109 12:43:01.781013       1 object-status.go:47] GetObjectStatusFromMetas():unknown:cur and new objects are equal based on object version label. Update of the object is not required. Object: clickhouse-cluster-subscription/chk-keeper01-keeper01-0-2

1767962581786	2026-01-09T12:43:01.786Z	I0109 12:43:01.786826       1 object-status.go:47] GetObjectStatusFromMetas():unknown:cur and new objects are equal based on object version label. Update of the object is not required. Object: clickhouse-cluster-subscription/chk-keeper01-keeper01-0-2

1767962617187	2026-01-09T12:43:37.187Z	I0109 12:43:37.187203       1 object-status.go:47] GetObjectStatusFromMetas():unknown:cur and new objects are equal based on object version label. Update of the object is not required. Object: clickhouse-cluster-subscription/chk-keeper01-keeper01-0-2

1767962617237	2026-01-09T12:43:37.237Z	I0109 12:43:37.237380       1 object-status.go:54] GetObjectStatusFromMetas():unknown:cur and new objects ARE DIFFERENT based on object version label: Update of the object is required. Object: clickhouse-cluster-subscription/chk-keeper01-keeper01-0-2. Cur: 9d8985b2144a792c97ea5968afbb6f6f6d28f8df New: d4ab60dea2d12b6af06f8a8bf2cb88c2a9de7a56

1767962617255	2026-01-09T12:43:37.255Z	I0109 12:43:37.255358       1 object-status.go:47] GetObjectStatusFromMetas():unknown:cur and new objects are equal based on object version label. Update of the object is not required. Object: clickhouse-cluster-subscription/chk-keeper01-keeper01-0-2

1767962677343	2026-01-09T12:44:37.343Z	I0109 12:44:37.342851       1 object-status.go:54] GetObjectStatusFromMetas():unknown:cur and new objects ARE DIFFERENT based on object version label: Update of the object is required. Object: clickhouse-cluster-subscription/chk-keeper01-keeper01-0-2. Cur: d4ab60dea2d12b6af06f8a8bf2cb88c2a9de7a56 New: 9d8985b2144a792c97ea5968afbb6f6f6d28f8df

1767962677348	2026-01-09T12:44:37.348Z	I0109 12:44:37.348315       1 object-status.go:54] GetObjectStatusFromMetas():unknown:cur and new objects ARE DIFFERENT based on object version label: Update of the object is required. Object: clickhouse-cluster-subscription/chk-keeper01-keeper01-0-2. Cur: d4ab60dea2d12b6af06f8a8bf2cb88c2a9de7a56 New: 9d8985b2144a792c97ea5968afbb6f6f6d28f8df

1767963515149	2026-01-09T12:58:35.149Z	I0109 12:58:35.148880       1 object-status.go:54] GetObjectStatusFromMetas():unknown:cur and new objects ARE DIFFERENT based on object version label: Update of the object is required. Object: clickhouse-cluster-subscription/chk-keeper01-keeper01-0-2. Cur: d4ab60dea2d12b6af06f8a8bf2cb88c2a9de7a56 New: 9d8985b2144a792c97ea5968afbb6f6f6d28f8df

1767963515154	2026-01-09T12:58:35.154Z	I0109 12:58:35.154470       1 object-status.go:54] GetObjectStatusFromMetas():unknown:cur and new objects ARE DIFFERENT based on object version label: Update of the object is required. Object: clickhouse-cluster-subscription/chk-keeper01-keeper01-0-2. Cur: d4ab60dea2d12b6af06f8a8bf2cb88c2a9de7a56 New: 9d8985b2144a792c97ea5968afbb6f6f6d28f8df

1767963520191	2026-01-09T12:58:40.191Z	I0109 12:58:40.191642       1 object-status.go:54] GetObjectStatusFromMetas():unknown:cur and new objects ARE DIFFERENT based on object version label: Update of the object is required. Object: clickhouse-cluster-subscription/chk-keeper01-keeper01-0-2. Cur: d4ab60dea2d12b6af06f8a8bf2cb88c2a9de7a56 New: 9d8985b2144a792c97ea5968afbb6f6f6d28f8df

1767963520194	2026-01-09T12:58:40.194Z	W0109 12:58:40.194204       1 controller-getter.go:39] getPodsIPs():unknown:Pod NO IP address found. Pod: clickhouse-cluster-subscription/chk-keeper01-keeper01-0-2-0

1767963520197	2026-01-09T12:58:40.197Z	I0109 12:58:40.197647       1 object-status.go:54] GetObjectStatusFromMetas():unknown:cur and new objects ARE DIFFERENT based on object version label: Update of the object is required. Object: clickhouse-cluster-subscription/chk-keeper01-keeper01-0-2. Cur: d4ab60dea2d12b6af06f8a8bf2cb88c2a9de7a56 New: 9d8985b2144a792c97ea5968afbb6f6f6d28f8df

1767963520213	2026-01-09T12:58:40.213Z	I0109 12:58:40.212958       1 object-status.go:54] GetObjectStatusFromMetas():unknown:cur and new objects ARE DIFFERENT based on object version label: Update of the object is required. Object: clickhouse-cluster-subscription/chk-keeper01-keeper01-0-2. Cur: d4ab60dea2d12b6af06f8a8bf2cb88c2a9de7a56 New: 9d8985b2144a792c97ea5968afbb6f6f6d28f8df

1767963520215	2026-01-09T12:58:40.215Z	W0109 12:58:40.215336       1 controller-getter.go:39] getPodsIPs():unknown:Pod NO IP address found. Pod: clickhouse-cluster-subscription/chk-keeper01-keeper01-0-2-0

1767963520219	2026-01-09T12:58:40.219Z	I0109 12:58:40.219043       1 object-status.go:54] GetObjectStatusFromMetas():unknown:cur and new objects ARE DIFFERENT based on object version label: Update of the object is required. Object: clickhouse-cluster-subscription/chk-keeper01-keeper01-0-2. Cur: d4ab60dea2d12b6af06f8a8bf2cb88c2a9de7a56 New: 9d8985b2144a792c97ea5968afbb6f6f6d28f8df

1767963545401	2026-01-09T12:59:05.401Z	I0109 12:59:05.401229       1 object-status.go:54] GetObjectStatusFromMetas():unknown:cur and new objects ARE DIFFERENT based on object version label: Update of the object is required. Object: clickhouse-cluster-subscription/chk-keeper01-keeper01-0-2. Cur: d4ab60dea2d12b6af06f8a8bf2cb88c2a9de7a56 New: 9d8985b2144a792c97ea5968afbb6f6f6d28f8df

1767963545408	2026-01-09T12:59:05.408Z	I0109 12:59:05.407992       1 object-status.go:54] GetObjectStatusFromMetas():unknown:cur and new objects ARE DIFFERENT based on object version label: Update of the object is required. Object: clickhouse-cluster-subscription/chk-keeper01-keeper01-0-2. Cur: d4ab60dea2d12b6af06f8a8bf2cb88c2a9de7a56 New: 9d8985b2144a792c97ea5968afbb6f6f6d28f8df

1767963882561	2026-01-09T13:04:42.561Z	I0109 13:04:42.561831       1 object-status.go:54] GetObjectStatusFromMetas():unknown:cur and new objects ARE DIFFERENT based on object version label: Update of the object is required. Object: clickhouse-cluster-subscription/chk-keeper01-keeper01-0-2. Cur: d4ab60dea2d12b6af06f8a8bf2cb88c2a9de7a56 New: 9d8985b2144a792c97ea5968afbb6f6f6d28f8df

1767963882704	2026-01-09T13:04:42.704Z	I0109 13:04:42.704061       1 object-status.go:54] GetObjectStatusFromMetas():unknown:cur and new objects ARE DIFFERENT based on object version label: Update of the object is required. Object: clickhouse-cluster-subscription/chk-keeper01-keeper01-0-2. Cur: d4ab60dea2d12b6af06f8a8bf2cb88c2a9de7a56 New: 9d8985b2144a792c97ea5968afbb6f6f6d28f8df

1767963882713	2026-01-09T13:04:42.713Z	I0109 13:04:42.713563       1 object-status.go:54] GetObjectStatusFromMetas():unknown:cur and new objects ARE DIFFERENT based on object version label: Update of the object is required. Object: clickhouse-cluster-subscription/chk-keeper01-keeper01-0-2. Cur: d4ab60dea2d12b6af06f8a8bf2cb88c2a9de7a56 New: 9d8985b2144a792c97ea5968afbb6f6f6d28f8df

1767963882724	2026-01-09T13:04:42.724Z	I0109 13:04:42.722887       1 object-status.go:54] GetObjectStatusFromMetas():unknown:cur and new objects ARE DIFFERENT based on object version label: Update of the object is required. Object: clickhouse-cluster-subscription/chk-keeper01-keeper01-0-2. Cur: d4ab60dea2d12b6af06f8a8bf2cb88c2a9de7a56 New: 9d8985b2144a792c97ea5968afbb6f6f6d28f8df

Important items to consider before making a Pull Request

Please check items PR complies to:

All commits in the PR are squashed. More info
The PR is made into dedicated next-release branch, not into master branch¹. More info
The PR is signed. More info

--

¹ If you feel your PR does not affect any Go-code or any testable functionality (for example, PR contains docs only or supplementary materials), PR can be made into master branch, but it has to be confirmed by project's maintainer.

…tion loops The IsStatefulSetReady() function incorrectly returned false for StatefulSets with spec.replicas=0 and status.readyReplicas=0, treating a successfully scaled-to-zero StatefulSet as "not ready". This caused infinite reconciliation loops during StatefulSet recreation: 1 - doDeleteStatefulSet() scales to 0 replicas for graceful termination 2 - recreateStatefulSet() deletes and creates the StatefulSet 3 - During recreation, the StatefulSet temporarily has 0 replicas 4 - updateStatefulSet() checks IsStatefulSetReady(), gets false 5 - Defaults to ErrCRUDRecreate action, triggering deletion again 6 - Loop repeats indefinitely The fix recognizes that a StatefulSet with 0/0 replicas is in a valid ready state, allowing the recreation flow to complete successfully. This issue manifested during rolling updates when manifest changes triggered StatefulSet recreation, and only affected some replicas due to timing/race conditions in the reconciliation loop. Signed-off-by: David Coppa <dcoppa@gmail.com>

dcoppa · 2026-01-13T13:36:20Z

I'm now using a dedicated function that checks for StatefulSets stuck at 0 replicas and forces them to scale up to the desired replica count:

I0113 13:31:29.678480       1 worker-reconciler-chk.go:186] logSWVersion():Host:0-1[0/1]:clickhouse-cluster-subscription/keeper01:Host software version: 0-1 25.3.8[25.3.8.10041/parsed from the tag: '25.3.8.10041']
I0113 13:31:29.678488       1 worker-reconciler-chk.go:186] logSWVersion():Host:0-2[0/2]:clickhouse-cluster-subscription/keeper01:Host software version: 0-2 25.3.8[25.3.8.10041/parsed from the tag: '25.3.8.10041']
I0113 13:31:29.678494       1 worker-reconciler-chk.go:189] logSWVersion():unknown:CR software versions min=25.3.8[25.3.8.10041/parsed from the tag: '25.3.8.10041'] max=25.3.8[25.3.8.10041/parsed from the tag: '25.3.8.10041']
I0113 13:31:29.680418       1 worker-reconciler-chk.go:147] unknown:
ActionPlan start buildCR ---------------------------------------------:

ActionPlan end buildCR ---------------------------------------------
I0113 13:31:29.680445       1 util.go:87] Host:0-0[0/0]:clickhouse-cluster-subscription/keeper01:StatefulSet stuck at 0 replicas, forcing scale to 1
I0113 13:31:29.697332       1 worker-reconciler-chk.go:81] reconcileCR():unknown:ActionPlan has no actions - abort reconcile
I0113 13:31:29.697363       1 worker-reconciler-chk.go:83] worker-reconciler-chk.go:58:reconcileCR():end:unknown
I0113 13:31:29.698605       1 worker-reconciler-chk.go:57] worker-reconciler-chk.go:57:reconcileCR():start:unknown
I0113 13:31:29.698641       1 worker.go:392] createTemplatedCR():unknown:CR has an ancestor, use it as a base for reconcile. CR: clickhouse-cluster-subscription/keeper01
I0113 13:31:29.700924       1 statefulset-reconciler.go:103] unknown:Have StatefulSet available, try to perform label-based comparison for sts: clickhouse-cluster-subscription/chk-keeper01-keeper01-0-0
I0113 13:31:29.700953       1 object-status.go:54] GetObjectStatusFromMetas():unknown:cur and new objects ARE DIFFERENT based on object version label: Update of the object is required. Object: clickhouse-cluster-subscription/chk-keeper01-keeper01-0-0. Cur: 4e89df41e15a7199f154b82a442289a34d33021d New: c6750ea548be16b50eebc16f94f454109a8780de
I0113 13:31:29.702002       1 statefulset-reconciler.go:103] unknown:Have StatefulSet available, try to perform label-based comparison for sts: clickhouse-cluster-subscription/chk-keeper01-keeper01-0-1

With this, I can't reproduce the problem anymore.

alex-zaitsev · 2026-01-14T11:19:40Z

@dcoppa , is there any easy way to reproduce the original problem?

dcoppa · 2026-01-14T12:33:15Z

@dcoppa , is there any easy way to reproduce the original problem?

clickhousekeeperinstallation_before.txt
clickhousekeeperinstallation_after.txt

Basically, I updated my ClickHouseKeeperInstallation manifest by adding the necessary configuration for Prometheus metrics.

Please see the two attached files (before and after).

sunsingerus added the planned for review This feature is planned for review label Jan 13, 2026

dcoppa added 3 commits January 13, 2026 14:13

Merge branch 'Altinity:0.26.0' into 0.26.0

aacca49

Use a dedicated function to check for StatefulSets stuck at 0 replicas

61c2a17

zap empty line

9fa2eb0

dcoppa changed the title ~~fix: Treat StatefulSet with 0 replicas as ready to prevent reconciliation loops~~ fix: StatefulSet stuck with 0 replicas Jan 13, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: StatefulSet stuck with 0 replicas #1902

fix: StatefulSet stuck with 0 replicas #1902

dcoppa commented Jan 9, 2026 •

edited

Loading

Uh oh!

dcoppa commented Jan 13, 2026

Uh oh!

alex-zaitsev commented Jan 14, 2026

Uh oh!

dcoppa commented Jan 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fix: StatefulSet stuck with 0 replicas #1902

Are you sure you want to change the base?

fix: StatefulSet stuck with 0 replicas #1902

Conversation

dcoppa commented Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Root Cause

Reproduction

Observed behavior

Important items to consider before making a Pull Request

Uh oh!

dcoppa commented Jan 13, 2026

Uh oh!

alex-zaitsev commented Jan 14, 2026

Uh oh!

dcoppa commented Jan 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dcoppa commented Jan 9, 2026 •

edited

Loading