-
Notifications
You must be signed in to change notification settings - Fork 4.2k
Process apiGroup in capi provider #8410
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Welcome @wjunott! |
Hi @wjunott. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
cluster-autoscaler/cloudprovider/clusterapi/clusterapi_controller_test.go
Show resolved
Hide resolved
cluster-autoscaler/cloudprovider/clusterapi/clusterapi_unstructured.go
Outdated
Show resolved
Hide resolved
cluster-autoscaler/cloudprovider/clusterapi/clusterapi_unstructured.go
Outdated
Show resolved
Hide resolved
|
||
apiGroup, ok := infraref["apiGroup"] | ||
if ok { | ||
if apiversion, err = getAPIGroupPreferredVersion(r.controller.managementDiscoveryClient, apiGroup); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I see correctly this is doing a live call against the apiserver. I'm wondering if 1 live call for every call of readInfrastructureReferenceResource is too much
Should we use a cache with a TTL to cache the apiGroup => version mapping? (ttl: 1m or 10m?)
(we can use client-go/tools/cache.NewTTLStore for that)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Initially, I think this api is invoked only during scale up/down. @elmiko any advice where to put the cache?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's okay to always do a live call here because this isn't called too often, absolutely fine for me of course (I just don't know :))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these calls will only happen when the core autoscaler wants to construct a node template. if the autoscaler has a ready node from the node group, then it will use a node as a template instead of asking the provider to generate a new template (where this function is called).
in the worst case scenario, this function will get called once per node group per scan interval from the autoscaler, which defaults to 10 seconds. in a large cluster this could be called several time for the same template depending on how the cluster-api resources are organized.
i think it's worth investigating putting a cache in for the infrastructure templates as they probably won't change that frequently and it could save us some api calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like we don't necessarily need caching. If I see correctly the getInfrastructureResource below is also not cached? So this won't add much on top
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getInfrastructureResource enables informer's cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this making sense to me, i have some suggestions about the error messages and i tend to agree with @sbueringer about caching.
although, if we feel adding caching to this PR will make it too complex, i'm fine to review it in a followup.
cluster-autoscaler/cloudprovider/clusterapi/clusterapi_controller_test.go
Show resolved
Hide resolved
|
||
apiGroup, ok := infraref["apiGroup"] | ||
if ok { | ||
if apiversion, err = getAPIGroupPreferredVersion(r.controller.managementDiscoveryClient, apiGroup); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these calls will only happen when the core autoscaler wants to construct a node template. if the autoscaler has a ready node from the node group, then it will use a node as a template instead of asking the provider to generate a new template (where this function is called).
in the worst case scenario, this function will get called once per node group per scan interval from the autoscaler, which defaults to 10 seconds. in a large cluster this could be called several time for the same template depending on how the cluster-api resources are organized.
i think it's worth investigating putting a cache in for the infrastructure templates as they probably won't change that frequently and it could save us some api calls.
klog.V(4).Info("Missing apiVersion") | ||
return nil, errors.New("Missing apiVersion") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'd like to add a little more information here to help with triage
klog.V(4).Info("Missing apiVersion") | |
return nil, errors.New("Missing apiVersion") | |
errorMsg := fmt.Sprintf("missing apiVersion for infrastructureRef of scalable resource %q", r.unstructured.GetName()) | |
klog.V(4).Info(errorMsg) | |
return nil, errors.New(errorMsg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added more detailed information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@elmiko @sbueringer I created a commit to support cached preferred version of an apiGroup with about 24 lines' change. How about we discuss more if we still need cached version and if so I will create a new PR after this one is merged? given only scale from zero will access apiserver to get preferred version of an apiGroup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think a cache would be helpful to reduce the number of api calls that the cluster-api provider makes. i'm not sure that it is absolutely required, but it would be interesting to test it out.
under normal operation, the cluster-api provider can generate many log lines reporting client-side throttling. i would think that having a cache would help us to reduce the frequency of calls.
Waited for 174.987663ms due to client-side throttling, not priority and fairness, request: <details of HTTP request>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I will create a new PR with cache enabled after this PR is tested and merged.
cluster-autoscaler/cloudprovider/clusterapi/clusterapi_unstructured.go
Outdated
Show resolved
Hide resolved
cluster-autoscaler/cloudprovider/clusterapi/clusterapi_unstructured.go
Outdated
Show resolved
Hide resolved
cluster-autoscaler/cloudprovider/clusterapi/clusterapi_unstructured.go
Outdated
Show resolved
Hide resolved
619b909
to
53e8f19
Compare
d0de2e3
to
1ca5f44
Compare
@@ -366,7 +366,7 @@ func createTestConfigs(specs ...testSpec) []*testConfig { | |||
config.machineSet = &unstructured.Unstructured{ | |||
Object: map[string]interface{}{ | |||
"kind": machineSetKind, | |||
"apiVersion": "cluster.x-k8s.io/v1alpha3", | |||
"apiVersion": "cluster.x-k8s.io/v1beta2", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we have to adjust the infrastructureRef here to use the new format of v1beta2 (i.e. apiGroup instead of apiVersion)
(please also check if there are other cases below)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current test structure is not so flexible. For this case, it covers the machineset with apiVersion case, e.g. the previous behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe then we should use v1beta1 here. As it is it is not a valid v1beta2 object
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i have some thoughts about how to make these tests easier to work with, but best if we get this review done first then perhaps i can propose some cleanups.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to Michael's comment. Plus v1beta2 + apiVersion in InfrastructureReference is also a valid combination in capi v1beta2.alpha.1, so we may address in a later cleanup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if I got it right, but v1beta2 MachineDeployments, MachineSets and MachinePools always have apiGroup. How could they have apiVersion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this line is the apiVersion for the kind that is being created, for this clause it's a MachineSet. this isn't about the infrastructure ref.
edit: misread your comment @sbueringer
i agree with your suggestion that we use v1beta1
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just chatted with Jun. While it looks trivial to just change this apiVersion to v1beta1 here it breaks a huge amount of tests and requires refactoring
(57 tests failed, 143 tests passed when changing this to v1beta1)
So from my side it would be okay to defer the test refactoring if we feel the change in this PR is sufficiently unit tested.
But I leave this to autoscaler reviewers / maintainers of course
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume we're talking about test refactoring under the cloudprovider/clusterapi directory, which is in fact maintained independently from the core autoscaler. So I'll defer to @elmiko for final call on merging w/ v1beta2 change.
Overall lgtm
apologies, i didn't get a chance to do any reviews today. i will revisit this PR early next week. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this is looking good, i'd like to see if @jackfrancis might be able to give a review.
not sure the best way to address the test spec creation for the different infrastructure versions.
@@ -366,7 +366,7 @@ func createTestConfigs(specs ...testSpec) []*testConfig { | |||
config.machineSet = &unstructured.Unstructured{ | |||
Object: map[string]interface{}{ | |||
"kind": machineSetKind, | |||
"apiVersion": "cluster.x-k8s.io/v1alpha3", | |||
"apiVersion": "cluster.x-k8s.io/v1beta2", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i have some thoughts about how to make these tests easier to work with, but best if we get this review done first then perhaps i can propose some cleanups.
/ok-to-test |
/lgtm /hold for @elmiko to sign off |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jackfrancis, wjunott The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
sorry, i didn't get a chance to review today. i'm adding to my queue for tomorrow. |
What type of PR is this?
/kind bug
What this PR does / why we need it:
With capi v1beta2, a MachineDeployment or MachineSet's infrastructureRef field has changed from ObjectReference to ContractVersionedObjectReference. We need to process the difference between apiVersion and apiGroup.
Which issue(s) this PR fixes:
Fixes #8330
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: