Description
What happened?
When the NVIDIA GPU operator is installed on a cluster, the container toolkit component modifies the containerd config to target the NVIDIA runtime.
If Kubespray is then run with a GPU host in the play, e.g. for an upgrade, then the containerd config is overwritten and the NVIDIA runtime definitions are removed. This results in pods failing to schedule on the GPU nodes.
What did you expect to happen?
This is what I expected to happen, but it is not desirable behaviour IMHO.
How can we reproduce it (as minimally and precisely as possible)?
Deploy a Kubespray cluster with GPU nodes, install the NVIDIA GPU operator and then run Kubespray again.
OS
Ubuntu 22
Version of Ansible
ansible [core 2.16.14]
config file = /Users/mattp/Projects/nks-region/k8s-infra-ndg-region/ansible.cfg
configured module search path = ['/Users/mattp/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
ansible python module location = /Users/mattp/.pyenv/versions/3.12.9/envs/kubespray-test/lib/python3.12/site-packages/ansible
ansible collection location = /Users/mattp/Projects/nks-region/k8s-infra-ndg-region/.ansible/collections
executable location = /Users/mattp/.pyenv/versions/kubespray-test/bin/ansible
python version = 3.12.9 (main, Apr 10 2025, 11:21:50) [Clang 17.0.0 (clang-1700.0.13.3)] (/Users/mattp/.pyenv/versions/3.12.9/envs/kubespray-test/bin/python)
jinja version = 3.1.6
libyaml = True
Version of Python
Python 3.12.9
Version of Kubespray (commit)
v2.28.0
Network plugin used
cilium
Full inventory with variables
N/A
Command used to invoke ansible
ansible-playbook kubernetes_sigs.kubespray.cluster
Output of ansible run
N/A
Anything else we need to know
No response