Description
What happened?
While performing a upgrade-cluster with a long-running drain, the drain task will abort with UNREACHABLE, due to the drain exceeding the ansible command timeout
What did you expect to happen?
The task not marking a node unreachable and instead waiting for the timeout to elapse successfully.
How can we reproduce it (as minimally and precisely as possible)?
Run this with the default drain_timeout=360s and the default ansible timeout=10:
ansible-playbook -e drain_grace_period=-1 upgrade-cluster.yml
and place a PodDisruptionBudget for some pod that would make the drain fail for the first node. (We run a rook-ceph cluster and manually drained other nodes until hitting the PDB for the OSDs. This leaves the OSDs on our first worker node being non-evictable.)
OS
Fedora 40
Version of Ansible
2.17.0
Version of Python
3.11.2
Version of Kubespray (commit)
v2.26.0
Network plugin used
calico
Full inventory with variables
[all]
k8s-master01 ansible_host=10.112.5.11
k8s-master02 ansible_host=10.112.6.11
k8s-master03 ansible_host=10.112.7.11
k8s-worker01 ansible_host=10.112.5.21
k8s-worker02 ansible_host=10.112.6.21
k8s-worker03 ansible_host=10.112.7.21
bastion ansible_host=<>
[bastion]
bastion
[kube_control_plane]
k8s-master01
k8s-master02
k8s-master03
[kube_node]
k8s-worker01
k8s-worker02
k8s-worker03
[etcd]
k8s-master01
k8s-master02
k8s-master03
[calico_rr]
[k8s_cluster:children]
kube_node
kube_control_plane
calico_rr
Command used to invoke ansible
ansible-playbook -i /inventory/inventory.ini -e ansible_user=fedora -e drain_grace_period=-1 -b --become-user=root --flush-cache ./upgrade-cluster.yml
Output of ansible run
With the default timeouts:
...
TASK [upgrade/pre-upgrade : Cordon node] ***************************************
changed: [k8s-worker02 -> k8s-master01(10.112.5.11)]
Wednesday 11 June 2025 16:30:03 +0000 (0:00:01.472) 0:09:25.905 ********
TASK [upgrade/pre-upgrade : Drain node] ****************************************
fatal: [k8s-worker02 -> k8s-master01]: UNREACHABLE! => {"changed": false, "msg": "Data could not be sent to remote host \"10.112.5.11\". Make sure this host can be reached over ssh: ", "unreachable": true}
NO MORE HOSTS LEFT *************************************************************
PLAY [Finally handle worker upgrades, based on given batch size] ***************
Wednesday 11 June 2025 16:34:50 +0000 (0:04:46.282) 0:14:12.188 ********
Wednesday 11 June 2025 16:34:50 +0000 (0:00:00.086) 0:14:12.274 ********
Wednesday 11 June 2025 16:34:50 +0000 (0:00:00.034) 0:14:12.308 ********
Wednesday 11 June 2025 16:34:50 +0000 (0:00:00.034) 0:14:12.343 ********
Wednesday 11 June 2025 16:34:50 +0000 (0:00:00.034) 0:14:12.378 ********
Wednesday 11 June 2025 16:34:50 +0000 (0:00:00.034) 0:14:12.412 ********
Wednesday 11 June 2025 16:34:50 +0000 (0:00:00.035) 0:14:12.448 ********
Wednesday 11 June 2025 16:34:50 +0000 (0:00:00.021) 0:14:12.469 ********
Wednesday 11 June 2025 16:34:50 +0000 (0:00:00.020) 0:14:12.490 ********
TASK [upgrade/pre-upgrade : See if node is in ready state] *********************
ok: [k8s-worker03 -> k8s-master01(10.112.5.11)]
...
With drain_timeout=10s explicitly set (too low for my use-case: I need several minutes drain timeout in order for my rook-ceph cluster to stabilize after OSD failure)
...
TASK [upgrade/pre-upgrade : Cordon node] ***************************************
changed: [k8s-worker01 -> k8s-master01(10.112.5.11)]
Wednesday 11 June 2025 17:13:48 +0000 (0:00:01.604) 0:06:28.505 ********
FAILED - RETRYING: [k8s-worker01 -> k8s-master01]: Drain node (3 retries left).
FAILED - RETRYING: [k8s-worker01 -> k8s-master01]: Drain node (2 retries left).
FAILED - RETRYING: [k8s-worker01 -> k8s-master01]: Drain node (1 retries left).
TASK [upgrade/pre-upgrade : Drain node] ****************************************
fatal: [k8s-worker01 -> k8s-master01(10.112.5.11)]: FAILED! => {"attempts": 3, "changed": true, "cmd": ["/usr/local/bin/kubectl", "--kubeconfig", "/etc/kubernetes/admin.conf", "drain", "--force", "--ignore-daemonsets", "--grace-period", "-1", "--timeout", "10s", "--delete-emptydir-data", "k8s-worker01"], "delta": "0:00:10.554703", "end": "2025-06-11 17:15:06.138801", "failed_when_result": true, "msg": "non-zero return code", "rc": 1, "start": "2025-06-11 17:14:55.584098", "stderr": "Warning: ignoring DaemonSet-managed Pods: kube-system/calico-node-22ds9, kube-system/kube-proxy-xq8rb, kube-system/nodelocaldns-h9w58, rook-ceph/csi-rbdplugin-fbfd4\nerror when evicting pods/\"example-pdb-pod\" -n \"example-ns\" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.\nerror when evicting pods/\"rook-ceph-osd-9-565bdd4f
Wednesday 11 June 2025 17:15:06 +0000 (0:01:18.121) 0:07:46.626 ********
TASK [upgrade/pre-upgrade : Set node back to schedulable] **********************
changed: [k8s-worker01 -> k8s-master01(10.112.5.11)]
Wednesday 11 June 2025 17:15:08 +0000 (0:00:01.748) 0:07:48.375 ********
TASK [upgrade/pre-upgrade : Fail after rescue] *********************************
fatal: [k8s-worker01 -> k8s-master01(10.112.5.11)]: FAILED! => {"changed": false, "msg": "Failed to drain node k8s-worker01"}
NO MORE HOSTS LEFT *************************************************************
PLAY RECAP *********************************************************************
k8s-worker01 : ok=290 changed=7 unreachable=0 failed=1 skipped=381 rescued=1 ignored=0
k8s-worker02 : ok=252 changed=4 unreachable=0 failed=0 skipped=330 rescued=0 ignored=0
k8s-worker03 : ok=252 changed=4 unreachable=0 failed=0 skipped=330 rescued=0 ignored=0
Wednesday 11 June 2025 17:15:08 +0000 (0:00:00.041) 0:07:48.417 ********
===============================================================================
upgrade/pre-upgrade : Drain node --------------------------------------- 78.12s
network_plugin/cni : CNI | Copy cni plugins ---------------------------- 19.51s
download : Prep_download | Register docker images info ----------------- 10.91s
download : Check_pull_required | Generate a list of information about the images on a node --- 9.29s
download : Check_pull_required | Generate a list of information about the images on a node --- 9.12s
download : Check_pull_required | Generate a list of information about the images on a node --- 8.82s
download : Check_pull_required | Generate a list of information about the images on a node --- 8.75s
download : Check_pull_required | Generate a list of information about the images on a node --- 8.47s
download : Check_pull_required | Generate a list of information about the images on a node --- 8.33s
download : Check_pull_required | Generate a list of information about the images on a node --- 8.21s
kubernetes/preinstall : Ensure kubelet expected parameters are set ------ 8.04s
kubespray-defaults : Gather ansible_default_ipv4 from all hosts or specific hosts --- 6.71s
bootstrap-os : Assign inventory name to unconfigured hostnames (non-CoreOS, non-Flatcar, Suse and ClearLinux, non-Fedora) --- 6.56s
bootstrap-os : Add proxy to dnf.conf if http_proxy is defined ----------- 6.25s
download : Extract_file | Unpacking archive ----------------------------- 6.20s
bootstrap-os : Gather facts --------------------------------------------- 5.81s
kubernetes/preinstall : Create kubernetes directories ------------------- 5.65s
download : Extract_file | Unpacking archive ----------------------------- 5.18s
bootstrap-os : Ensure bash_completion.d folder exists ------------------- 4.98s
bootstrap-os : Create remote_tmp for it is used by another module ------- 4.94s
Anything else we need to know
Suggestion: Setting async
and poll
as explained in the ansible docs for roles/upgrade/pre-upgrade/tasks/main.yml at Drain Node. to avoid timeouts