Skip to content

Process apiGroup in capi provider #8410

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

wjunott
Copy link

@wjunott wjunott commented Aug 6, 2025

What type of PR is this?

/kind bug

What this PR does / why we need it:

With capi v1beta2, a MachineDeployment or MachineSet's infrastructureRef field has changed from ObjectReference to ContractVersionedObjectReference. We need to process the difference between apiVersion and apiGroup.

Which issue(s) this PR fixes:

Fixes #8330

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Aug 6, 2025
Copy link

linux-foundation-easycla bot commented Aug 6, 2025

CLA Signed

The committers listed above are authorized under a signed CLA.

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. do-not-merge/needs-area labels Aug 6, 2025
@k8s-ci-robot
Copy link
Contributor

Welcome @wjunott!

It looks like this is your first PR to kubernetes/autoscaler 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/autoscaler has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Aug 6, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @wjunott. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added area/cluster-autoscaler area/provider/cluster-api Issues or PRs related to Cluster API provider labels Aug 6, 2025
@k8s-ci-robot k8s-ci-robot requested review from arunmk and hardikdr August 6, 2025 13:21
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed do-not-merge/needs-area labels Aug 6, 2025

apiGroup, ok := infraref["apiGroup"]
if ok {
if apiversion, err = getAPIGroupPreferredVersion(r.controller.managementDiscoveryClient, apiGroup); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I see correctly this is doing a live call against the apiserver. I'm wondering if 1 live call for every call of readInfrastructureReferenceResource is too much

Should we use a cache with a TTL to cache the apiGroup => version mapping? (ttl: 1m or 10m?)
(we can use client-go/tools/cache.NewTTLStore for that)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Initially, I think this api is invoked only during scale up/down. @elmiko any advice where to put the cache?

Copy link
Member

@sbueringer sbueringer Aug 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's okay to always do a live call here because this isn't called too often, absolutely fine for me of course (I just don't know :))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these calls will only happen when the core autoscaler wants to construct a node template. if the autoscaler has a ready node from the node group, then it will use a node as a template instead of asking the provider to generate a new template (where this function is called).

in the worst case scenario, this function will get called once per node group per scan interval from the autoscaler, which defaults to 10 seconds. in a large cluster this could be called several time for the same template depending on how the cluster-api resources are organized.

i think it's worth investigating putting a cache in for the infrastructure templates as they probably won't change that frequently and it could save us some api calls.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like we don't necessarily need caching. If I see correctly the getInfrastructureResource below is also not cached? So this won't add much on top

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getInfrastructureResource enables informer's cache.

@sbueringer
Copy link
Member

@wjunott Thx!

/assign @elmiko

Once we settled on the implementation I can do a test with Cluster API if it works as expected

Copy link
Contributor

@elmiko elmiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this making sense to me, i have some suggestions about the error messages and i tend to agree with @sbueringer about caching.

although, if we feel adding caching to this PR will make it too complex, i'm fine to review it in a followup.


apiGroup, ok := infraref["apiGroup"]
if ok {
if apiversion, err = getAPIGroupPreferredVersion(r.controller.managementDiscoveryClient, apiGroup); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these calls will only happen when the core autoscaler wants to construct a node template. if the autoscaler has a ready node from the node group, then it will use a node as a template instead of asking the provider to generate a new template (where this function is called).

in the worst case scenario, this function will get called once per node group per scan interval from the autoscaler, which defaults to 10 seconds. in a large cluster this could be called several time for the same template depending on how the cluster-api resources are organized.

i think it's worth investigating putting a cache in for the infrastructure templates as they probably won't change that frequently and it could save us some api calls.

Comment on lines 396 to 397
klog.V(4).Info("Missing apiVersion")
return nil, errors.New("Missing apiVersion")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'd like to add a little more information here to help with triage

Suggested change
klog.V(4).Info("Missing apiVersion")
return nil, errors.New("Missing apiVersion")
errorMsg := fmt.Sprintf("missing apiVersion for infrastructureRef of scalable resource %q", r.unstructured.GetName())
klog.V(4).Info(errorMsg)
return nil, errors.New(errorMsg)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added more detailed information.

Copy link
Author

@wjunott wjunott Aug 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@elmiko @sbueringer I created a commit to support cached preferred version of an apiGroup with about 24 lines' change. How about we discuss more if we still need cached version and if so I will create a new PR after this one is merged? given only scale from zero will access apiserver to get preferred version of an apiGroup.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think a cache would be helpful to reduce the number of api calls that the cluster-api provider makes. i'm not sure that it is absolutely required, but it would be interesting to test it out.

under normal operation, the cluster-api provider can generate many log lines reporting client-side throttling. i would think that having a cache would help us to reduce the frequency of calls.

 Waited for 174.987663ms due to client-side throttling, not priority and fairness, request: <details of HTTP request>

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I will create a new PR with cache enabled after this PR is tested and merged.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Aug 8, 2025
@wjunott wjunott changed the title Add process with apiGroup in cluster api provider when scaling from … Add process with apiGroup in cluster api provider Aug 8, 2025
@wjunott wjunott force-pushed the consume-capi-v1beta2-from-zero branch from 619b909 to 53e8f19 Compare August 8, 2025 03:49
@wjunott wjunott changed the title Add process with apiGroup in cluster api provider Add process with apiGroup in capi provider Aug 8, 2025
@sbueringer
Copy link
Member

@wjunott see #8410 (comment)

@wjunott wjunott force-pushed the consume-capi-v1beta2-from-zero branch from d0de2e3 to 1ca5f44 Compare August 8, 2025 07:18
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Aug 8, 2025
@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Aug 8, 2025
@@ -366,7 +366,7 @@ func createTestConfigs(specs ...testSpec) []*testConfig {
config.machineSet = &unstructured.Unstructured{
Object: map[string]interface{}{
"kind": machineSetKind,
"apiVersion": "cluster.x-k8s.io/v1alpha3",
"apiVersion": "cluster.x-k8s.io/v1beta2",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have to adjust the infrastructureRef here to use the new format of v1beta2 (i.e. apiGroup instead of apiVersion)

(please also check if there are other cases below)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current test structure is not so flexible. For this case, it covers the machineset with apiVersion case, e.g. the previous behavior.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe then we should use v1beta1 here. As it is it is not a valid v1beta2 object

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i have some thoughts about how to make these tests easier to work with, but best if we get this review done first then perhaps i can propose some cleanups.

Copy link
Author

@wjunott wjunott Aug 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to Michael's comment. Plus v1beta2 + apiVersion in InfrastructureReference is also a valid combination in capi v1beta2.alpha.1, so we may address in a later cleanup.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if I got it right, but v1beta2 MachineDeployments, MachineSets and MachinePools always have apiGroup. How could they have apiVersion?

Copy link
Contributor

@elmiko elmiko Aug 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this line is the apiVersion for the kind that is being created, for this clause it's a MachineSet. this isn't about the infrastructure ref.

edit: misread your comment @sbueringer

i agree with your suggestion that we use v1beta1 here.

Copy link
Member

@sbueringer sbueringer Aug 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just chatted with Jun. While it looks trivial to just change this apiVersion to v1beta1 here it breaks a huge amount of tests and requires refactoring

(57 tests failed, 143 tests passed when changing this to v1beta1)

So from my side it would be okay to defer the test refactoring if we feel the change in this PR is sufficiently unit tested.

But I leave this to autoscaler reviewers / maintainers of course

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume we're talking about test refactoring under the cloudprovider/clusterapi directory, which is in fact maintained independently from the core autoscaler. So I'll defer to @elmiko for final call on merging w/ v1beta2 change.

Overall lgtm

@elmiko
Copy link
Contributor

elmiko commented Aug 8, 2025

apologies, i didn't get a chance to do any reviews today. i will revisit this PR early next week.

Copy link
Contributor

@elmiko elmiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this is looking good, i'd like to see if @jackfrancis might be able to give a review.

not sure the best way to address the test spec creation for the different infrastructure versions.

@@ -366,7 +366,7 @@ func createTestConfigs(specs ...testSpec) []*testConfig {
config.machineSet = &unstructured.Unstructured{
Object: map[string]interface{}{
"kind": machineSetKind,
"apiVersion": "cluster.x-k8s.io/v1alpha3",
"apiVersion": "cluster.x-k8s.io/v1beta2",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i have some thoughts about how to make these tests easier to work with, but best if we get this review done first then perhaps i can propose some cleanups.

@jackfrancis
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 11, 2025
@wjunott wjunott changed the title Add process with apiGroup in capi provider Process apiGroup in capi provider Aug 12, 2025
@jackfrancis
Copy link
Contributor

/lgtm
/approve

/hold for @elmiko to sign off

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 13, 2025
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 13, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jackfrancis, wjunott

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 13, 2025
@elmiko
Copy link
Contributor

elmiko commented Aug 13, 2025

sorry, i didn't get a chance to review today. i'm adding to my queue for tomorrow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cluster-autoscaler area/provider/cluster-api Issues or PRs related to Cluster API provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note-none Denotes a PR that doesn't merit a release note. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CA clusterapi provider scale from zero does not work properly with v1beta2 API
5 participants