Process apiGroup in capi provider #8410

wjunott · 2025-08-06T13:21:45Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

With capi v1beta2, a MachineDeployment or MachineSet's infrastructureRef field has changed from ObjectReference to ContractVersionedObjectReference. We need to process the difference between apiVersion and apiGroup.

Which issue(s) this PR fixes:

Fixes #8330

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

linux-foundation-easycla · 2025-08-06T13:21:50Z

The committers listed above are authorized under a signed CLA.

✅ login: wjunott / name: Jun Wang (1ca5f44, f4c2fde, 21ca04a)

k8s-ci-robot · 2025-08-06T13:21:54Z

Welcome @wjunott!

It looks like this is your first PR to kubernetes/autoscaler 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/autoscaler has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2025-08-06T13:21:55Z

Hi @wjunott. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

cluster-autoscaler/cloudprovider/clusterapi/clusterapi_controller_test.go

cluster-autoscaler/cloudprovider/clusterapi/clusterapi_unstructured.go

sbueringer · 2025-08-06T15:41:41Z

cluster-autoscaler/cloudprovider/clusterapi/clusterapi_unstructured.go

+
+	apiGroup, ok := infraref["apiGroup"]
+	if ok {
+		if apiversion, err = getAPIGroupPreferredVersion(r.controller.managementDiscoveryClient, apiGroup); err != nil {


If I see correctly this is doing a live call against the apiserver. I'm wondering if 1 live call for every call of readInfrastructureReferenceResource is too much

Should we use a cache with a TTL to cache the apiGroup => version mapping? (ttl: 1m or 10m?)
(we can use client-go/tools/cache.NewTTLStore for that)

Good point. Initially, I think this api is invoked only during scale up/down. @elmiko any advice where to put the cache?

If it's okay to always do a live call here because this isn't called too often, absolutely fine for me of course (I just don't know :))

these calls will only happen when the core autoscaler wants to construct a node template. if the autoscaler has a ready node from the node group, then it will use a node as a template instead of asking the provider to generate a new template (where this function is called).

in the worst case scenario, this function will get called once per node group per scan interval from the autoscaler, which defaults to 10 seconds. in a large cluster this could be called several time for the same template depending on how the cluster-api resources are organized.

i think it's worth investigating putting a cache in for the infrastructure templates as they probably won't change that frequently and it could save us some api calls.

Sounds like we don't necessarily need caching. If I see correctly the getInfrastructureResource below is also not cached? So this won't add much on top

getInfrastructureResource enables informer's cache.

sbueringer · 2025-08-06T15:43:05Z

@wjunott Thx!

/assign @elmiko

Once we settled on the implementation I can do a test with Cluster API if it works as expected

elmiko

this making sense to me, i have some suggestions about the error messages and i tend to agree with @sbueringer about caching.

although, if we feel adding caching to this PR will make it too complex, i'm fine to review it in a followup.

cluster-autoscaler/cloudprovider/clusterapi/clusterapi_controller_test.go

elmiko · 2025-08-07T16:37:25Z

cluster-autoscaler/cloudprovider/clusterapi/clusterapi_unstructured.go

+
+	apiGroup, ok := infraref["apiGroup"]
+	if ok {
+		if apiversion, err = getAPIGroupPreferredVersion(r.controller.managementDiscoveryClient, apiGroup); err != nil {


these calls will only happen when the core autoscaler wants to construct a node template. if the autoscaler has a ready node from the node group, then it will use a node as a template instead of asking the provider to generate a new template (where this function is called).

in the worst case scenario, this function will get called once per node group per scan interval from the autoscaler, which defaults to 10 seconds. in a large cluster this could be called several time for the same template depending on how the cluster-api resources are organized.

i think it's worth investigating putting a cache in for the infrastructure templates as they probably won't change that frequently and it could save us some api calls.

elmiko · 2025-08-07T16:46:24Z

cluster-autoscaler/cloudprovider/clusterapi/clusterapi_unstructured.go

+			klog.V(4).Info("Missing apiVersion")
+			return nil, errors.New("Missing apiVersion")


i'd like to add a little more information here to help with triage

Suggested change

klog.V(4).Info("Missing apiVersion")

return nil, errors.New("Missing apiVersion")

errorMsg := fmt.Sprintf("missing apiVersion for infrastructureRef of scalable resource %q", r.unstructured.GetName())

klog.V(4).Info(errorMsg)

return nil, errors.New(errorMsg)

Added more detailed information.

@elmiko @sbueringer I created a commit to support cached preferred version of an apiGroup with about 24 lines' change. How about we discuss more if we still need cached version and if so I will create a new PR after this one is merged? given only scale from zero will access apiserver to get preferred version of an apiGroup.

i think a cache would be helpful to reduce the number of api calls that the cluster-api provider makes. i'm not sure that it is absolutely required, but it would be interesting to test it out.

under normal operation, the cluster-api provider can generate many log lines reporting client-side throttling. i would think that having a cache would help us to reduce the frequency of calls.

Waited for 174.987663ms due to client-side throttling, not priority and fairness, request: <details of HTTP request>

OK, I will create a new PR with cache enabled after this PR is tested and merged.

cluster-autoscaler/cloudprovider/clusterapi/clusterapi_unstructured.go

sbueringer · 2025-08-08T05:58:04Z

@wjunott see #8410 (comment)

sbueringer · 2025-08-08T10:41:48Z

cluster-autoscaler/cloudprovider/clusterapi/clusterapi_controller_test.go

@@ -366,7 +366,7 @@ func createTestConfigs(specs ...testSpec) []*testConfig {
 		config.machineSet = &unstructured.Unstructured{
 			Object: map[string]interface{}{
 				"kind":       machineSetKind,
-				"apiVersion": "cluster.x-k8s.io/v1alpha3",
+				"apiVersion": "cluster.x-k8s.io/v1beta2",


I think we have to adjust the infrastructureRef here to use the new format of v1beta2 (i.e. apiGroup instead of apiVersion)

(please also check if there are other cases below)

The current test structure is not so flexible. For this case, it covers the machineset with apiVersion case, e.g. the previous behavior.

Maybe then we should use v1beta1 here. As it is it is not a valid v1beta2 object

i have some thoughts about how to make these tests easier to work with, but best if we get this review done first then perhaps i can propose some cleanups.

+1 to Michael's comment. Plus v1beta2 + apiVersion in InfrastructureReference is also a valid combination in capi v1beta2.alpha.1, so we may address in a later cleanup.

Not sure if I got it right, but v1beta2 MachineDeployments, MachineSets and MachinePools always have apiGroup. How could they have apiVersion?

this line is the apiVersion for the kind that is being created, for this clause it's a MachineSet. this isn't about the infrastructure ref.

edit: misread your comment @sbueringer

i agree with your suggestion that we use v1beta1 here.

Just chatted with Jun. While it looks trivial to just change this apiVersion to v1beta1 here it breaks a huge amount of tests and requires refactoring

(57 tests failed, 143 tests passed when changing this to v1beta1)

So from my side it would be okay to defer the test refactoring if we feel the change in this PR is sufficiently unit tested.

But I leave this to autoscaler reviewers / maintainers of course

I assume we're talking about test refactoring under the cloudprovider/clusterapi directory, which is in fact maintained independently from the core autoscaler. So I'll defer to @elmiko for final call on merging w/ v1beta2 change.

Overall lgtm

elmiko · 2025-08-08T19:43:47Z

apologies, i didn't get a chance to do any reviews today. i will revisit this PR early next week.

elmiko

i think this is looking good, i'd like to see if @jackfrancis might be able to give a review.

not sure the best way to address the test spec creation for the different infrastructure versions.

elmiko · 2025-08-11T16:16:05Z

cluster-autoscaler/cloudprovider/clusterapi/clusterapi_controller_test.go

@@ -366,7 +366,7 @@ func createTestConfigs(specs ...testSpec) []*testConfig {
 		config.machineSet = &unstructured.Unstructured{
 			Object: map[string]interface{}{
 				"kind":       machineSetKind,
-				"apiVersion": "cluster.x-k8s.io/v1alpha3",
+				"apiVersion": "cluster.x-k8s.io/v1beta2",


i have some thoughts about how to make these tests easier to work with, but best if we get this review done first then perhaps i can propose some cleanups.

jackfrancis · 2025-08-11T16:44:52Z

/ok-to-test

jackfrancis · 2025-08-13T16:02:35Z

/lgtm
/approve

/hold for @elmiko to sign off

k8s-ci-robot · 2025-08-13T16:02:46Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jackfrancis, wjunott

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cluster-autoscaler/cloudprovider/clusterapi/OWNERS~~ [jackfrancis]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

elmiko · 2025-08-13T19:40:46Z

sorry, i didn't get a chance to review today. i'm adding to my queue for tomorrow.

k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. do-not-merge/needs-area labels Aug 6, 2025

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Aug 6, 2025

k8s-ci-robot added area/cluster-autoscaler area/provider/cluster-api Issues or PRs related to Cluster API provider labels Aug 6, 2025

k8s-ci-robot requested review from arunmk and hardikdr August 6, 2025 13:21

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed do-not-merge/needs-area labels Aug 6, 2025

sbueringer reviewed Aug 6, 2025

View reviewed changes

k8s-ci-robot assigned elmiko Aug 6, 2025

elmiko reviewed Aug 7, 2025

View reviewed changes

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Aug 8, 2025

wjunott changed the title ~~Add process with apiGroup in cluster api provider when scaling from …~~ Add process with apiGroup in cluster api provider Aug 8, 2025

wjunott force-pushed the consume-capi-v1beta2-from-zero branch from 619b909 to 53e8f19 Compare August 8, 2025 03:49

wjunott changed the title ~~Add process with apiGroup in cluster api provider~~ Add process with apiGroup in capi provider Aug 8, 2025

wjunott added 3 commits August 8, 2025 15:16

Add process with apiGroup in capi provider

f4c2fde

Replace capi v1alpha3 with v1beta2 in test cases

21ca04a

Add detailed error messages

1ca5f44

wjunott force-pushed the consume-capi-v1beta2-from-zero branch from d0de2e3 to 1ca5f44 Compare August 8, 2025 07:18

k8s-ci-robot removed the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Aug 8, 2025

k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Aug 8, 2025

sbueringer reviewed Aug 8, 2025

View reviewed changes

elmiko reviewed Aug 11, 2025

View reviewed changes

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 11, 2025

wjunott changed the title ~~Add process with apiGroup in capi provider~~ Process apiGroup in capi provider Aug 12, 2025

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 13, 2025

k8s-ci-robot assigned jackfrancis Aug 13, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 13, 2025

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 13, 2025

		klog.V(4).Info("Missing apiVersion")
		return nil, errors.New("Missing apiVersion")

-			klog.V(4).Info("Missing apiVersion")
-			return nil, errors.New("Missing apiVersion")
+            errorMsg := fmt.Sprintf("missing apiVersion for infrastructureRef of scalable resource %q", r.unstructured.GetName())
+			klog.V(4).Info(errorMsg)
+			return nil, errors.New(errorMsg)

Process apiGroup in capi provider #8410

Are you sure you want to change the base?

Process apiGroup in capi provider #8410

Conversation

wjunott commented Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Uh oh!

linux-foundation-easycla bot commented Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Aug 6, 2025

Uh oh!

k8s-ci-robot commented Aug 6, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sbueringer Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sbueringer commented Aug 6, 2025

Uh oh!

elmiko left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wjunott Aug 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sbueringer commented Aug 8, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wjunott Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

elmiko Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

wjunott commented Aug 6, 2025 •

edited

Loading

linux-foundation-easycla bot commented Aug 6, 2025 •

edited

Loading

sbueringer Aug 7, 2025 •

edited

Loading

wjunott Aug 8, 2025 •

edited

Loading

wjunott Aug 12, 2025 •

edited

Loading

elmiko Aug 12, 2025 •

edited

Loading

sbueringer Aug 13, 2025 •

edited

Loading