MD.Status.ReadyReplicas changes from 3 to 0 when machineset_controller updateStatus() hits "Unable to retrieve Node status" error

### What steps did you take and what happened?

When calling clusterctlclient.Client.ApplyUpgrade(upgrade) to upgrade CAPI core components (its version is not changed) and a CAPI Infra Provider component(the version is changed), there is a very low probability that capi-controller-manager pod is restarted. Both capi-controller-manager pod log and pod previous log contains the error log "Unable to retrieve Node status":

E0223 18:31:51.557569 1 machineset_controller.go:883] **"Unable to retrieve Node status" err="failed to create cluster accessor: failed to get lock for cluster: cluster is locked already"** controller="machineset" controllerGroup="cluster.x-k8s.io" controllerKind="MachineSet" MachineSet="e2e-mycluster-v1-24-106-sks-upgrade/e2e-50a8u5-sks-upgrade-3m3w-workers-bkzcs" namespace="e2e-mycluster-v1-24-106-sks-upgrade" name="e2e-50a8u5-sks-upgrade-3m3w-workers-bkzcs" reconcileID=b9a3b2d2-00e9-4d0f-97b4-f2448292404d MachineDeployment="e2e-mycluster-v1-24-106-sks-upgrade/e2e-50a8u5-sks-upgrade-3m3w-workers" Cluster="e2e-mycluster-v1-24-106-sks-upgrade/e2e-50a8u5-sks-upgrade-3m3w" Machine="e2e-mycluster-v1-24-106-sks-upgrade/e2e-50a8u5-sks-upgrade-3m3w-workers-bkzcs-75tm4" node=""

This error causes the MD.Status.ReadyReplicas changes from 3 to 0 and after about 90s it will be changed back to 3. The reason is updateStatus() in [machineset_controller.go](https://github.com/kubernetes-sigs/cluster-api/blob/v1.5.2/internal/controllers/machineset/machineset_controller.go#L863-L884) ignores the error returned by getMachineNode() and treats the Node as not ready.  In the mean time, KCP.Status.ReadyReplicas changes from 3 to 2 and back to 3 (after about only 8 seconds), and the reason might be[ kcp Reconcile()](https://github.com/kubernetes-sigs/cluster-api/blob/v1.5.6/controlplane/kubeadm/internal/controllers/controller.go#L207-L264) issues a requeue immediately after hitting ErrClusterLocked error.

Our code on top of CAPI watches MD.Status.ReadyReplicas and leads to an issue when MD.Status.ReadyReplicas changes from 3 to 0.

### What did you expect to happen?

- MD.Status.ReadyReplicas should not change from 3 to 0 when (at least) hitting ErrClusterLocked error and even other errors, because the Nodes are ready actually.
- KCP.Status.ReadyReplicas should not change either when hitting ErrClusterLocked error.

### Cluster API version

1.5.2

### Kubernetes version

1.24.17

### Anything else you would like to add?

To avoid MD.Status.ReadyReplicas change in this case, we can `return error` rather than `contiune` at 
https://github.com/kubernetes-sigs/cluster-api/blob/v1.5.2/internal/controllers/machineset/machineset_controller.go#L882-L884 when the error is ErrClusterLocked (or even return error on any error).

### Label(s) to be applied

/kind bug
One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

MD.Status.ReadyReplicas changes from 3 to 0 when machineset_controller updateStatus() hits "Unable to retrieve Node status" error #10195

What steps did you take and what happened?

What did you expect to happen?

Cluster API version

Kubernetes version

Anything else you would like to add?

Label(s) to be applied

17 remaining items

v1beta2

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

MD.Status.ReadyReplicas changes from 3 to 0 when machineset_controller updateStatus() hits "Unable to retrieve Node status" error #10195

Description

What steps did you take and what happened?

What did you expect to happen?

Cluster API version

Kubernetes version

Anything else you would like to add?

Label(s) to be applied

Activity

fabriziopandini commented on Feb 26, 2024

jessehu commented on Feb 27, 2024

jessehu commented on Feb 28, 2024

JoelSpeed commented on Feb 28, 2024

jessehu commented on Mar 6, 2024

jessehu commented on Mar 6, 2024

fabriziopandini commented on Apr 11, 2024

17 remaining items

sbueringer commented on Dec 27, 2024

sbueringer commented on Dec 27, 2024

jessehu commented on Dec 30, 2024

chrischdi commented on Feb 6, 2025

v1beta2

fabriziopandini commented on Feb 7, 2025

k8s-triage-robot commented on May 8, 2025

sbueringer commented on May 9, 2025

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

Issue actions