-
Notifications
You must be signed in to change notification settings - Fork 4.2k
Description
(this issue originated from a discussion at the 2025-07-28 SIG Autoscaling office hours)
Which component are you using?:
/area cluster-autoscaler
What version of the component are you using?:
Component version: 1.33
What k8s version are you using (kubectl version
)?:
Server Version: v1.31.2
What environment is this in?:
cluster-api kubemark and aws providers
What did you expect to happen?:
with all the nodes in a node group cordoned, and no scale-from-zero information provided, i expect the autoscaler to utilize an unschedulable node as a template. as described in this code: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/processors/nodeinfosprovider/mixed_nodeinfos_processor.go#L160-L178
What happened instead?:
the autoscaler did not make new nodes and produced log messages describing that no node could fit the workload.
How to reproduce it (as minimally and precisely as possible):
- create a cluster-api cluster, with one MachineDeployment configured for autoscaling (do not add scale from zero information)
- set the minimum node group size to 1 for the MachineDeployment
- increase replicas to 1 for the MachineDeployment
- cordon the node associated with the one Machine in the MachineDeployment, eg
kubectl cordon <node>
- create a workload that targets nodes from the MachineDeployment (eg using node selectors)
Anything else we need to know?:
it appears as though the autoscaler will remove an unschedulable nodes from the list of nodes to be processed during a scale up loop. this means that there are no nodes which could be sanitized of taints and spec.unschedulable field. this may be by design, but we should evaluate to determine if the unschedulable nodes should be removed from the list before processing.
ready nodes collected here, https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/core/static_autoscaler.go#L289
using this function, https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/core/static_autoscaler.go#L985-L1006
the ready nodes list is passed in to the process function here, https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/core/static_autoscaler.go#L356
and would be included from this clause, https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/processors/nodeinfosprovider/mixed_nodeinfos_processor.go#L161-L178
using this function to sanitize the node, https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/processors/nodeinfosprovider/mixed_nodeinfos_processor.go#L183-L195
it seems like we need to determine if, and when, this functionality changed, and then determine if the node list to the Process
function should be changed.