Skip to content

CA potential for skipped node template info when a node group contains only non-ready nodes #8380

@elmiko

Description

@elmiko

(this issue originated from a discussion at the 2025-07-28 SIG Autoscaling office hours)

Which component are you using?:

/area cluster-autoscaler

What version of the component are you using?:

Component version: 1.33

What k8s version are you using (kubectl version)?:

Server Version: v1.31.2

What environment is this in?:

cluster-api kubemark and aws providers

What did you expect to happen?:

with all the nodes in a node group cordoned, and no scale-from-zero information provided, i expect the autoscaler to utilize an unschedulable node as a template. as described in this code: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/processors/nodeinfosprovider/mixed_nodeinfos_processor.go#L160-L178

What happened instead?:

the autoscaler did not make new nodes and produced log messages describing that no node could fit the workload.

How to reproduce it (as minimally and precisely as possible):

  1. create a cluster-api cluster, with one MachineDeployment configured for autoscaling (do not add scale from zero information)
  2. set the minimum node group size to 1 for the MachineDeployment
  3. increase replicas to 1 for the MachineDeployment
  4. cordon the node associated with the one Machine in the MachineDeployment, eg kubectl cordon <node>
  5. create a workload that targets nodes from the MachineDeployment (eg using node selectors)

Anything else we need to know?:

it appears as though the autoscaler will remove an unschedulable nodes from the list of nodes to be processed during a scale up loop. this means that there are no nodes which could be sanitized of taints and spec.unschedulable field. this may be by design, but we should evaluate to determine if the unschedulable nodes should be removed from the list before processing.

ready nodes collected here, https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/core/static_autoscaler.go#L289
using this function, https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/core/static_autoscaler.go#L985-L1006

the ready nodes list is passed in to the process function here, https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/core/static_autoscaler.go#L356
and would be included from this clause, https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/processors/nodeinfosprovider/mixed_nodeinfos_processor.go#L161-L178
using this function to sanitize the node, https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/processors/nodeinfosprovider/mixed_nodeinfos_processor.go#L183-L195

it seems like we need to determine if, and when, this functionality changed, and then determine if the node list to the Process function should be changed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions