Skip to content

drain_timeout exceeds ansible default timeout #12297

Open
@timtrense-leadec

Description

@timtrense-leadec

What happened?

While performing a upgrade-cluster with a long-running drain, the drain task will abort with UNREACHABLE, due to the drain exceeding the ansible command timeout

What did you expect to happen?

The task not marking a node unreachable and instead waiting for the timeout to elapse successfully.

How can we reproduce it (as minimally and precisely as possible)?

Run this with the default drain_timeout=360s and the default ansible timeout=10:

ansible-playbook -e drain_grace_period=-1 upgrade-cluster.yml

and place a PodDisruptionBudget for some pod that would make the drain fail for the first node. (We run a rook-ceph cluster and manually drained other nodes until hitting the PDB for the OSDs. This leaves the OSDs on our first worker node being non-evictable.)

OS

Fedora 40

Version of Ansible

2.17.0

Version of Python

3.11.2

Version of Kubespray (commit)

v2.26.0

Network plugin used

calico

Full inventory with variables

[all]
k8s-master01 ansible_host=10.112.5.11
k8s-master02 ansible_host=10.112.6.11
k8s-master03 ansible_host=10.112.7.11
k8s-worker01 ansible_host=10.112.5.21
k8s-worker02 ansible_host=10.112.6.21
k8s-worker03 ansible_host=10.112.7.21
bastion ansible_host=<>

[bastion]
bastion

[kube_control_plane]
k8s-master01
k8s-master02
k8s-master03

[kube_node]
k8s-worker01
k8s-worker02
k8s-worker03

[etcd]
k8s-master01
k8s-master02
k8s-master03

[calico_rr]

[k8s_cluster:children]
kube_node
kube_control_plane
calico_rr

Command used to invoke ansible

ansible-playbook -i /inventory/inventory.ini -e ansible_user=fedora -e drain_grace_period=-1 -b --become-user=root --flush-cache ./upgrade-cluster.yml

Output of ansible run

With the default timeouts:

...

TASK [upgrade/pre-upgrade : Cordon node] ***************************************
changed: [k8s-worker02 -> k8s-master01(10.112.5.11)]
Wednesday 11 June 2025  16:30:03 +0000 (0:00:01.472)       0:09:25.905 ******** 

TASK [upgrade/pre-upgrade : Drain node] ****************************************
fatal: [k8s-worker02 -> k8s-master01]: UNREACHABLE! => {"changed": false, "msg": "Data could not be sent to remote host \"10.112.5.11\". Make sure this host can be reached over ssh: ", "unreachable": true}

NO MORE HOSTS LEFT *************************************************************

PLAY [Finally handle worker upgrades, based on given batch size] ***************
Wednesday 11 June 2025  16:34:50 +0000 (0:04:46.282)       0:14:12.188 ******** 
Wednesday 11 June 2025  16:34:50 +0000 (0:00:00.086)       0:14:12.274 ******** 
Wednesday 11 June 2025  16:34:50 +0000 (0:00:00.034)       0:14:12.308 ******** 
Wednesday 11 June 2025  16:34:50 +0000 (0:00:00.034)       0:14:12.343 ******** 
Wednesday 11 June 2025  16:34:50 +0000 (0:00:00.034)       0:14:12.378 ******** 
Wednesday 11 June 2025  16:34:50 +0000 (0:00:00.034)       0:14:12.412 ******** 
Wednesday 11 June 2025  16:34:50 +0000 (0:00:00.035)       0:14:12.448 ******** 
Wednesday 11 June 2025  16:34:50 +0000 (0:00:00.021)       0:14:12.469 ******** 
Wednesday 11 June 2025  16:34:50 +0000 (0:00:00.020)       0:14:12.490 ******** 

TASK [upgrade/pre-upgrade : See if node is in ready state] *********************
ok: [k8s-worker03 -> k8s-master01(10.112.5.11)]

...

With drain_timeout=10s explicitly set (too low for my use-case: I need several minutes drain timeout in order for my rook-ceph cluster to stabilize after OSD failure)

...

TASK [upgrade/pre-upgrade : Cordon node] ***************************************
changed: [k8s-worker01 -> k8s-master01(10.112.5.11)]
Wednesday 11 June 2025  17:13:48 +0000 (0:00:01.604)       0:06:28.505 ******** 
FAILED - RETRYING: [k8s-worker01 -> k8s-master01]: Drain node (3 retries left).
FAILED - RETRYING: [k8s-worker01 -> k8s-master01]: Drain node (2 retries left).
FAILED - RETRYING: [k8s-worker01 -> k8s-master01]: Drain node (1 retries left).
TASK [upgrade/pre-upgrade : Drain node] ****************************************
fatal: [k8s-worker01 -> k8s-master01(10.112.5.11)]: FAILED! => {"attempts": 3, "changed": true, "cmd": ["/usr/local/bin/kubectl", "--kubeconfig", "/etc/kubernetes/admin.conf", "drain", "--force", "--ignore-daemonsets", "--grace-period", "-1", "--timeout", "10s", "--delete-emptydir-data", "k8s-worker01"], "delta": "0:00:10.554703", "end": "2025-06-11 17:15:06.138801", "failed_when_result": true, "msg": "non-zero return code", "rc": 1, "start": "2025-06-11 17:14:55.584098", "stderr": "Warning: ignoring DaemonSet-managed Pods: kube-system/calico-node-22ds9, kube-system/kube-proxy-xq8rb, kube-system/nodelocaldns-h9w58, rook-ceph/csi-rbdplugin-fbfd4\nerror when evicting pods/\"example-pdb-pod\" -n \"example-ns\" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.\nerror when evicting pods/\"rook-ceph-osd-9-565bdd4f
Wednesday 11 June 2025  17:15:06 +0000 (0:01:18.121)       0:07:46.626 ******** 
TASK [upgrade/pre-upgrade : Set node back to schedulable] **********************
changed: [k8s-worker01 -> k8s-master01(10.112.5.11)]
Wednesday 11 June 2025  17:15:08 +0000 (0:00:01.748)       0:07:48.375 ******** 
TASK [upgrade/pre-upgrade : Fail after rescue] *********************************
fatal: [k8s-worker01 -> k8s-master01(10.112.5.11)]: FAILED! => {"changed": false, "msg": "Failed to drain node k8s-worker01"}
NO MORE HOSTS LEFT *************************************************************
PLAY RECAP *********************************************************************
k8s-worker01               : ok=290  changed=7    unreachable=0    failed=1    skipped=381  rescued=1    ignored=0   
k8s-worker02               : ok=252  changed=4    unreachable=0    failed=0    skipped=330  rescued=0    ignored=0   
k8s-worker03               : ok=252  changed=4    unreachable=0    failed=0    skipped=330  rescued=0    ignored=0   
Wednesday 11 June 2025  17:15:08 +0000 (0:00:00.041)       0:07:48.417 ******** 
=============================================================================== 
upgrade/pre-upgrade : Drain node --------------------------------------- 78.12s
network_plugin/cni : CNI | Copy cni plugins ---------------------------- 19.51s
download : Prep_download | Register docker images info ----------------- 10.91s
download : Check_pull_required |  Generate a list of information about the images on a node --- 9.29s
download : Check_pull_required |  Generate a list of information about the images on a node --- 9.12s
download : Check_pull_required |  Generate a list of information about the images on a node --- 8.82s
download : Check_pull_required |  Generate a list of information about the images on a node --- 8.75s
download : Check_pull_required |  Generate a list of information about the images on a node --- 8.47s
download : Check_pull_required |  Generate a list of information about the images on a node --- 8.33s
download : Check_pull_required |  Generate a list of information about the images on a node --- 8.21s
kubernetes/preinstall : Ensure kubelet expected parameters are set ------ 8.04s
kubespray-defaults : Gather ansible_default_ipv4 from all hosts or specific hosts --- 6.71s
bootstrap-os : Assign inventory name to unconfigured hostnames (non-CoreOS, non-Flatcar, Suse and ClearLinux, non-Fedora) --- 6.56s
bootstrap-os : Add proxy to dnf.conf if http_proxy is defined ----------- 6.25s
download : Extract_file | Unpacking archive ----------------------------- 6.20s
bootstrap-os : Gather facts --------------------------------------------- 5.81s
kubernetes/preinstall : Create kubernetes directories ------------------- 5.65s
download : Extract_file | Unpacking archive ----------------------------- 5.18s
bootstrap-os : Ensure bash_completion.d folder exists ------------------- 4.98s
bootstrap-os : Create remote_tmp for it is used by another module ------- 4.94s

Anything else we need to know

Suggestion: Setting async and poll as explained in the ansible docs for roles/upgrade/pre-upgrade/tasks/main.yml at Drain Node. to avoid timeouts

Metadata

Metadata

Assignees

No one assigned

    Labels

    Fedora 40kind/bugCategorizes issue or PR as related to a bug.triage/acceptedIndicates an issue or PR is ready to be actively worked on.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions