Add flag to determine whether containerd config is overwritten #12278

mkjpryor · 2025-06-02T10:57:02Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

Adds a flag to opt out of overriding the containerd configuration.

Which issue(s) this PR fixes:

Fixes #12277

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Allow opting out of overriding the containerd configuration. This can be desirable when applications that modify the containerd config are installed on the cluster.

k8s-ci-robot · 2025-06-02T10:57:11Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: mkjpryor
Once this PR has been reviewed and has the lgtm label, please assign mzaian for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2025-06-02T10:57:12Z

Hi @mkjpryor. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

yankay · 2025-06-03T02:33:25Z

/ok-to-test

mkjpryor · 2025-06-04T12:05:53Z

@yankay

Looks like this tested down OK. Is there anything else you need from me?

yankay · 2025-06-05T12:38:09Z

Thanks @mkjpryor

I appreciate the solution you provided. Personally, I think this parameter is reasonable, as I couldn’t think of a better way to address this issue.

However, there’s one scenario that may need clarification:
If a user needs to modify containerd config.yaml while the cluster has gpu-operator deployed, they must set containerd_config_overwrite: true. After running Ansible, they should also restart the NVIDIA Container Toolkit and ensure the configuration file is correctly updated.

/lgtm
@mzaian @VannTen Could you please share your thoughts?

VannTen · 2025-06-05T16:18:25Z

Hum, it seems to me that additional/ multiple runtimes should be handled by configuring the container engine with different runtimes, and having the corresponding runtimeclass object defined. See https://kubernetes.io/docs/concepts/containers/runtime-class/ IMO, the GPU NVIDIA stuff relying on overwriting the containerd config is a broken model, and I'm very much against introducing clutches to support broken models. I haven't delved deep into this, so it's possible I'm missing a good reason, but it's gonna be a hard sell.

mkjpryor · 2025-06-09T15:44:09Z

@VannTen

I agree with you that this is not an ideal solution, but as @yankay says it is the only reasonable solution I can think of that I can implement right now to stop kubespray and the NVIDIA GPU operator fighting over my containerd config.

In the docs you pointed to, it says that RuntimeClass is used for making Kubernetes aware of the different CRI configurations available on the nodes so that it can request them. However in order to make additional CRI configurations available on a node running containerd, you have to modify the containerd config. The NVIDIA GPU operator does indeed create RuntimeClass objects to make Kubernetes aware of the CRI configurations that it has installed.

The major problem here is actually containerd's lack of support for a config.d style model, which would allow the GPU operator to drop the extra CRI configurations into an include directory. That would be the ideal solution for this but containerd does not support doing this due to some strange decisions they took about which parts of the config are subject to a merge and which are overwritten completely by includes (spoiler - the entire runtimes section is overwritten at once, so you can't have a default runtime in config.toml and add extra runtimes in drop-in files).

I'm open to other solutions if you can think of any. I can't. The NVIDIA GPU operator is a very widely used piece of software so I am surprised to be the only one hitting this. Maybe nobody else uses it with kubespray 🤷‍♂️

mkjpryor · 2025-06-09T15:44:38Z

In any case, I think a flag whose default value results in no change in behaviour is a reasonable compromise?

mkjpryor · 2025-06-09T15:48:40Z

Thanks @mkjpryor

I appreciate the solution you provided. Personally, I think this parameter is reasonable, as I couldn’t think of a better way to address this issue.

However, there’s one scenario that may need clarification: If a user needs to modify containerd config.yaml while the cluster has gpu-operator deployed, they must set containerd_config_overwrite: true. After running Ansible, they should also restart the NVIDIA Container Toolkit and ensure the configuration file is correctly updated.

/lgtm@mzaian @VannTen Could you please share your thoughts?

@yankay

You are correct that if a user sets this flag in their group vars, then if they really need kubespray to update the containerd config they will need to run with containerd_config_overwrite=true as an extra var and probably roll the NVIDIA daemonsets after. Does this need a documentation patch? I am happy to do it if you can suggest a good place in the docs for it to go.

yankay · 2025-06-11T03:25:10Z

In any case, I think a flag whose default value results in no change in behaviour is a reasonable compromise?

agree with your opinion. This PR enhances the configurability of Kubespray without any negative impacts, so it can be merged.
Another reviewer's approval is still needed before this can be merged.

Additionally, we can continue to explore whether there are more elegant solutions in the future.

VannTen · 2025-06-14T10:18:28Z

Ok, I see the rationale. I'm still not a very big fan, especially because this opens the door for subtle bugs where the containerd config is not updated with new versions, that kind of things, which would not be a breaking change with normal behavior but could be if we don't update the containerd config.

What's the cri-o approach on this, a config.d/ scheme like you mentioned above ?

Another question regarding the nvidia operator specifically: is the runtime and runtime class available to implement independently ?
Currently the container engine and container runtimes are quite coupled in kubespray, but in the future we should decoupled them, and I'm wondering if that would be then a solution which would allow to get rid of this workaround.

VannTen · 2025-06-14T16:46:14Z

Also, following my previous message, could `containerd_additional_runtimes` currently be used for solving this ? https://github.com/kubernetes-sigs/kubespray/blob/b04ceba89b9094275cd913e24ec7f43eb0f17cf0/roles/container-engine/containerd/templates/config.toml.j2#L43

mkjpryor · 2025-06-18T09:43:14Z

@VannTen

In all honesty, I'd rather use the NVIDIA official approach for similar reasons to the ones that you expressed as concerns for this patch. If the NVIDIA GPU operator changes the installation and I have fixed runtimes defined in containerd_additional_runtimes that are out-of-date with that, I could get in a world of pain that way. I would probably also have to assume responsibility for installing/upgrading the NVIDIA drivers and runtime binaries, which I don't really want to do.

Unfortunately, containerd does not support a config.d style config merging properly right now, so it is not really an option for the NVIDIA GPU operator to use that. The NVIDIA runtime installation process reads the current default runtime and bases the NVIDIA runtimes off that with the required changes, e.g. to point at different runtime binaries.

A stronger decoupling of the installation of the container engine itself and the corresponding runtime config could work, but I'm not sure what that would look like that is much different to what this patch does.

This patch introduces a new option that you specifically opt in to that basically says to kubespray that I, the operator, am taking responsibility for ensuring the containerd config is correct because I know other things need to change it. I'm not sure how much better we can do with the current state of the containerd config merging.

Of course, if a larger refactoring in the future, or changes to containerd make a config.d approach feasible, we should revisit this workaround.

VannTen · 2025-06-20T11:19:27Z

I haven't forgot this, I'm just a bit swamped at work right now ; I'll come back to this soonish.

Make overwriting of containerd config optional

0cdfcec

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jun 2, 2025

k8s-ci-robot requested review from MrFreezeex and yankay June 2, 2025 10:57

k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jun 2, 2025

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 3, 2025

k8s-ci-robot assigned yankay Jun 5, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add flag to determine whether containerd config is overwritten #12278

Add flag to determine whether containerd config is overwritten #12278

mkjpryor commented Jun 2, 2025

Uh oh!

k8s-ci-robot commented Jun 2, 2025

Uh oh!

k8s-ci-robot commented Jun 2, 2025

Uh oh!

yankay commented Jun 3, 2025

Uh oh!

mkjpryor commented Jun 4, 2025

Uh oh!

yankay commented Jun 5, 2025 •

edited

Loading

Uh oh!

VannTen commented Jun 5, 2025 via email

Uh oh!

mkjpryor commented Jun 9, 2025 •

edited

Loading

Uh oh!

mkjpryor commented Jun 9, 2025 •

edited

Loading

Uh oh!

mkjpryor commented Jun 9, 2025

Uh oh!

yankay commented Jun 11, 2025 •

edited

Loading

Uh oh!

VannTen commented Jun 14, 2025

Uh oh!

VannTen commented Jun 14, 2025 via email

Uh oh!

mkjpryor commented Jun 18, 2025 •

edited

Loading

Uh oh!

VannTen commented Jun 20, 2025 via email

Uh oh!

Uh oh!

Add flag to determine whether containerd config is overwritten #12278

Are you sure you want to change the base?

Add flag to determine whether containerd config is overwritten #12278

Conversation

mkjpryor commented Jun 2, 2025

Uh oh!

k8s-ci-robot commented Jun 2, 2025

Uh oh!

k8s-ci-robot commented Jun 2, 2025

Uh oh!

yankay commented Jun 3, 2025

Uh oh!

mkjpryor commented Jun 4, 2025

Uh oh!

yankay commented Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

VannTen commented Jun 5, 2025 via email

Uh oh!

mkjpryor commented Jun 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mkjpryor commented Jun 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mkjpryor commented Jun 9, 2025

Uh oh!

yankay commented Jun 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

VannTen commented Jun 14, 2025

Uh oh!

VannTen commented Jun 14, 2025 via email

Uh oh!

mkjpryor commented Jun 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

VannTen commented Jun 20, 2025 via email

Uh oh!

Uh oh!

yankay commented Jun 5, 2025 •

edited

Loading

mkjpryor commented Jun 9, 2025 •

edited

Loading

mkjpryor commented Jun 9, 2025 •

edited

Loading

yankay commented Jun 11, 2025 •

edited

Loading

mkjpryor commented Jun 18, 2025 •

edited

Loading