Skip to content

WIP: Update cluster-api provider to use machineTemplate.status.nodeInfo for architecture-aware autoscale from zero #8345

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

aleskandro
Copy link
Member

@aleskandro aleskandro commented Jul 22, 2025

What type of PR is this?

/kind feature

What this PR does / why we need it:

kubernetes-sigs/cluster-api#11962 introduced the nodeInfo field for MachineTemplates. Providers can reconcile this field in the status subresource to inform the autoscaler about the architecture and operating system that the MachineTemplate's nodes will run.

Previously, we have been implementing this behavior in the cluster autoscaler by leveraging the labels capacity annotation and, as a fallback, default values set in environment variables at cluster-autoscaler deployment time.

With this commit, the cluster autoscaler computes the future architecture of a node with the following priority order:

  • Labels set in existing nodes for not-autoscale-from-zero cases
  • Labels set in the labels capacity annotation of machine template, machine set, and machine deployment.
  • Values in the status.nodeSystemInfo of MachineTemplates
  • Generic/default labels set in the environment of the cluster autoscaler

Which issue(s) this PR fixes:

Fixes #8347

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Cluster API Provider's implementors can now provide the architecture and the operating system of a future node in their machine template at the field status.nodeInfo to inform these values for scaling from zero.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

NONE

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-area area/cluster-autoscaler area/provider/cluster-api Issues or PRs related to Cluster API provider and removed do-not-merge/needs-area labels Jul 22, 2025
@k8s-ci-robot k8s-ci-robot requested review from elmiko and hardikdr July 22, 2025 13:00
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jul 22, 2025
@aleskandro aleskandro force-pushed the arch-aware-node-info branch 5 times, most recently from dd56e8b to 13dea52 Compare July 23, 2025 09:35
@aleskandro aleskandro force-pushed the arch-aware-node-info branch 2 times, most recently from 8ec180c to c3a296a Compare July 23, 2025 12:33
@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Jul 23, 2025
@aleskandro aleskandro changed the title WIP: Arch aware node info WIP: Update cluster-api provider to use machineTemplate.status.nodeInfo for architecture-aware autoscale from zero Jul 23, 2025
@aleskandro aleskandro force-pushed the arch-aware-node-info branch from c3a296a to 98c4898 Compare August 6, 2025 11:17
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: aleskandro
Once this PR has been reviewed and has the lgtm label, please assign elmiko for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@aleskandro aleskandro force-pushed the arch-aware-node-info branch 3 times, most recently from 21705b1 to 9f183df Compare August 6, 2025 14:49
Copy link
Contributor

@elmiko elmiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on balance this is making sense, i think we need to clean up a couple things.

reducing the number of api server calls is important for the capi provider as we tend to be very noisy from an HTTP perspective.

// InstanceSystemInfo sets the nodeSystemInfo from the infrastructure reference resource.
// If the infrastructure reference resource is not found, returns nil.
func (r unstructuredScalableResource) InstanceSystemInfo() *apiv1.NodeSystemInfo {
infraObj, err := r.readInfrastructureReferenceResource()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

something i am concerned about is the notion that we will be calling this function twice for every node template we create. each call to this function results in another call to the api server to get the infrastructure machine template. i think we need to find a way to combine the call required in this function with the call we make in the InstanceCapacity call.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To keep having two separate functions and avoid coupling, I've made a change to the readInfrastructureReferenceResource to cache the infraObj via lazy loading. Wdyt?

@aleskandro aleskandro force-pushed the arch-aware-node-info branch from 9f183df to 9d1c610 Compare August 7, 2025 07:20
…r architecture-aware autoscale from zero

kubernetes-sigs/cluster-api#11962 introduced the nodeInfo field for MachineTemplates. Providers can reconcile this field in the status subresource to inform the autoscaler about the architecture and operating system that the MachineTemplate's nodes will run.

Previously, we have been implementing this behavior in the cluster autoscaler by leveraging the labels capacity annotation and, as a fallback, default values set in environment variables at cluster-autoscaler deployment time.

With this commit, the cluster autoscaler computes the future architecture of a node with the following priority order:

- Labels set in existing nodes for not-autoscale-from-zero cases
- Labels set in the labels capacity annotation of machine template, machine set, and machine deployment.
- Values in the status.nodeSystemInfo of MachineTemplates
- Generic/default labels set in the environment of the cluster autoscaler
@aleskandro aleskandro force-pushed the arch-aware-node-info branch from 9d1c610 to 7e356f2 Compare August 7, 2025 07:24
…eInfo for architecture-aware autoscale from zero

kubernetes-sigs/cluster-api#11962 introduced the nodeInfo field for MachineTemplates. Providers can reconcile this field in the status subresource to inform the autoscaler about the architecture and operating system that the MachineTemplate's nodes will run.

Previously, we have been implementing this behavior in the cluster autoscaler by leveraging the labels capacity annotation and, as a fallback, default values set in environment variables at cluster-autoscaler deployment time.

With this commit, the cluster autoscaler computes the future architecture of a node with the following priority order:

- Labels set in existing nodes for not-autoscale-from-zero cases
- Labels set in the labels capacity annotation of machine template, machine set, and machine deployment.
- Values in the status.nodeSystemInfo of MachineTemplates
- Generic/default labels set in the environment of the cluster autoscaler
@aleskandro aleskandro force-pushed the arch-aware-node-info branch from 7e356f2 to 8b78773 Compare August 7, 2025 07:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler area/provider/cluster-api Issues or PRs related to Cluster API provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update cluster-api implementation to use MachineTemplate.status.nodeInfo for inferring architecture information for scale from zero
3 participants