Description
What steps did you take and what happened?
When calling clusterctlclient.Client.ApplyUpgrade(upgrade) to upgrade CAPI core components (its version is not changed) and a CAPI Infra Provider component(the version is changed), there is a very low probability that capi-controller-manager pod is restarted. Both capi-controller-manager pod log and pod previous log contains the error log "Unable to retrieve Node status":
E0223 18:31:51.557569 1 machineset_controller.go:883] "Unable to retrieve Node status" err="failed to create cluster accessor: failed to get lock for cluster: cluster is locked already" controller="machineset" controllerGroup="cluster.x-k8s.io" controllerKind="MachineSet" MachineSet="e2e-mycluster-v1-24-106-sks-upgrade/e2e-50a8u5-sks-upgrade-3m3w-workers-bkzcs" namespace="e2e-mycluster-v1-24-106-sks-upgrade" name="e2e-50a8u5-sks-upgrade-3m3w-workers-bkzcs" reconcileID=b9a3b2d2-00e9-4d0f-97b4-f2448292404d MachineDeployment="e2e-mycluster-v1-24-106-sks-upgrade/e2e-50a8u5-sks-upgrade-3m3w-workers" Cluster="e2e-mycluster-v1-24-106-sks-upgrade/e2e-50a8u5-sks-upgrade-3m3w" Machine="e2e-mycluster-v1-24-106-sks-upgrade/e2e-50a8u5-sks-upgrade-3m3w-workers-bkzcs-75tm4" node=""
This error causes the MD.Status.ReadyReplicas changes from 3 to 0 and after about 90s it will be changed back to 3. The reason is updateStatus() in machineset_controller.go ignores the error returned by getMachineNode() and treats the Node as not ready. In the mean time, KCP.Status.ReadyReplicas changes from 3 to 2 and back to 3 (after about only 8 seconds), and the reason might be kcp Reconcile() issues a requeue immediately after hitting ErrClusterLocked error.
Our code on top of CAPI watches MD.Status.ReadyReplicas and leads to an issue when MD.Status.ReadyReplicas changes from 3 to 0.
What did you expect to happen?
- MD.Status.ReadyReplicas should not change from 3 to 0 when (at least) hitting ErrClusterLocked error and even other errors, because the Nodes are ready actually.
- KCP.Status.ReadyReplicas should not change either when hitting ErrClusterLocked error.
Cluster API version
1.5.2
Kubernetes version
1.24.17
Anything else you would like to add?
To avoid MD.Status.ReadyReplicas change in this case, we can return error
rather than contiune
at
https://github.com/kubernetes-sigs/cluster-api/blob/v1.5.2/internal/controllers/machineset/machineset_controller.go#L882-L884 when the error is ErrClusterLocked (or even return error on any error).
Label(s) to be applied
/kind bug
One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.
Activity
fabriziopandini commentedon Feb 26, 2024
/triage needs-discussion
I would like to get @vincepri, @sbueringer, and @JoelSpeed opinions on this one.
Currently, we consider available a replica for which we know it has node ready, and this seems semantically correct to me.
The downside of this formulation is that available can flick whenever the node status changes, or whenever there are connection problems to the workload cluster and we cannot retrieve the node status anymore (like in this use case).
If this is still the behavior we all want, then IMO the behavior of KCP and MD is correct: they should both reduce the number of available replicas based on the info available at a given time.
However, what we can do is
jessehu commentedon Feb 27, 2024
Thanks @fabriziopandini. The error ErrClusterLocked should be gone in a short time, so marking the Node as notReady or unknown replica immediately after hitting error ErrClusterLocked might be over responsive. Also considering kube-controller-manager marks a Node as unhealthy after 40s unresponsive state.
jessehu commentedon Feb 28, 2024
BTW this could also impacted by #9810 discussed in #10165 (comment)
JoelSpeed commentedon Feb 28, 2024
Yes I think we may want to take a leaf out of KCMs book here and not immediately flick to the unready state. I would expect in the real world that users monitor things like unready nodes and, want to remediate that situation. Going unready early may lead to false positive alerts.
I think in this case specifically, the
ErrClusterLocked
, we would want to leave the Nodes in whatever state they were previously in, and requeue the request to try again in say 20s. Do we track when we last observed the Node currently? We probably also want to have a timeout on this behaviour. If we haven't seen the Node in x time then we assume its status is unknownjessehu commentedon Mar 6, 2024
I made a PR to fix this bug with a simple approach (not implementing unknownReplicas). Please kindly take a look. Thanks!
jessehu commentedon Mar 6, 2024
/area machineset
fabriziopandini commentedon Apr 11, 2024
/priority important-longterm
17 remaining items
sbueringer commentedon Dec 27, 2024
I wonder if this still happens with v1beta2 conditions / new counter fields
sbueringer commentedon Dec 27, 2024
/remove-lifecycle rotten
jessehu commentedon Dec 30, 2024
cc @Levi080513 please help take a look!
chrischdi commentedon Feb 6, 2025
I tested this scenario on main and am not able to reproduce this directly with the code on main.
I was able to reproduce it though by adding some latency to the clustercache:
The error
"error getting client"
gets returned (from here) at to place:https://github.com/kubernetes-sigs/cluster-api/blob/main/internal/controllers/machineset/machineset_controller.go#L1195
That leads to the being counted as not ready, but just because clustercache did not yet finish creating the connection yet.
So this issue is still valid for v1beta1 replicas.
v1beta2
Taking a look at v1beta2 replica fields: the same happens there, but the reason is a bit different.
In this case the Machine is counted as not ready, because the Machine's
Ready
v1beta2 condition flips to Unknown:This is because of the way we handle the Cluster not connected error here:
https://github.com/kubernetes-sigs/cluster-api/blob/main/internal/controllers/machine/machine_controller_status.go#L284-L294
fabriziopandini commentedon Feb 7, 2025
@chrischdi thanks for investigating this!
k8s-triage-robot commentedon May 8, 2025
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied,lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
sbueringer commentedon May 9, 2025
/remove-lifecycle stale
I still didn't find time to look at this