drain_timeout exceeds ansible default timeout

### What happened?

While performing a upgrade-cluster with a long-running drain, the drain task will abort with UNREACHABLE, due to the drain exceeding the ansible command timeout

### What did you expect to happen?

The task not marking a node unreachable and instead waiting for the timeout to elapse successfully.

### How can we reproduce it (as minimally and precisely as possible)?

Run this with the default drain_timeout=360s and the default ansible timeout=10: 

`ansible-playbook -e drain_grace_period=-1 upgrade-cluster.yml`

and place a PodDisruptionBudget for some pod that would make the drain fail for the first node. (We run a rook-ceph cluster and manually drained other nodes until hitting the PDB for the OSDs. This leaves the OSDs on our first worker node being non-evictable.)

### OS

Fedora 40

### Version of Ansible

2.17.0

### Version of Python

3.11.2

### Version of Kubespray (commit)

v2.26.0

### Network plugin used

calico

### Full inventory with variables

[all]
k8s-master01 ansible_host=10.112.5.11
k8s-master02 ansible_host=10.112.6.11
k8s-master03 ansible_host=10.112.7.11
k8s-worker01 ansible_host=10.112.5.21
k8s-worker02 ansible_host=10.112.6.21
k8s-worker03 ansible_host=10.112.7.21
bastion ansible_host=<<redacted for privacy>>

[bastion]
bastion

[kube_control_plane]
k8s-master01
k8s-master02
k8s-master03

[kube_node]
k8s-worker01
k8s-worker02
k8s-worker03

[etcd]
k8s-master01
k8s-master02
k8s-master03

[calico_rr]

[k8s_cluster:children]
kube_node
kube_control_plane
calico_rr

### Command used to invoke ansible

ansible-playbook -i /inventory/inventory.ini -e ansible_user=fedora -e drain_grace_period=-1 -b --become-user=root --flush-cache ./upgrade-cluster.yml

### Output of ansible run

With the default timeouts:

```
...

TASK [upgrade/pre-upgrade : Cordon node] ***************************************
changed: [k8s-worker02 -> k8s-master01(10.112.5.11)]
Wednesday 11 June 2025  16:30:03 +0000 (0:00:01.472)       0:09:25.905 ******** 

TASK [upgrade/pre-upgrade : Drain node] ****************************************
fatal: [k8s-worker02 -> k8s-master01]: UNREACHABLE! => {"changed": false, "msg": "Data could not be sent to remote host \"10.112.5.11\". Make sure this host can be reached over ssh: ", "unreachable": true}

NO MORE HOSTS LEFT *************************************************************

PLAY [Finally handle worker upgrades, based on given batch size] ***************
Wednesday 11 June 2025  16:34:50 +0000 (0:04:46.282)       0:14:12.188 ******** 
Wednesday 11 June 2025  16:34:50 +0000 (0:00:00.086)       0:14:12.274 ******** 
Wednesday 11 June 2025  16:34:50 +0000 (0:00:00.034)       0:14:12.308 ******** 
Wednesday 11 June 2025  16:34:50 +0000 (0:00:00.034)       0:14:12.343 ******** 
Wednesday 11 June 2025  16:34:50 +0000 (0:00:00.034)       0:14:12.378 ******** 
Wednesday 11 June 2025  16:34:50 +0000 (0:00:00.034)       0:14:12.412 ******** 
Wednesday 11 June 2025  16:34:50 +0000 (0:00:00.035)       0:14:12.448 ******** 
Wednesday 11 June 2025  16:34:50 +0000 (0:00:00.021)       0:14:12.469 ******** 
Wednesday 11 June 2025  16:34:50 +0000 (0:00:00.020)       0:14:12.490 ******** 

TASK [upgrade/pre-upgrade : See if node is in ready state] *********************
ok: [k8s-worker03 -> k8s-master01(10.112.5.11)]

...
```

With drain_timeout=10s explicitly set (too low for my use-case: I need several minutes drain timeout in order for my rook-ceph cluster to stabilize after OSD failure)

```
...

TASK [upgrade/pre-upgrade : Cordon node] ***************************************
changed: [k8s-worker01 -> k8s-master01(10.112.5.11)]
Wednesday 11 June 2025  17:13:48 +0000 (0:00:01.604)       0:06:28.505 ******** 
FAILED - RETRYING: [k8s-worker01 -> k8s-master01]: Drain node (3 retries left).
FAILED - RETRYING: [k8s-worker01 -> k8s-master01]: Drain node (2 retries left).
FAILED - RETRYING: [k8s-worker01 -> k8s-master01]: Drain node (1 retries left).
TASK [upgrade/pre-upgrade : Drain node] ****************************************
fatal: [k8s-worker01 -> k8s-master01(10.112.5.11)]: FAILED! => {"attempts": 3, "changed": true, "cmd": ["/usr/local/bin/kubectl", "--kubeconfig", "/etc/kubernetes/admin.conf", "drain", "--force", "--ignore-daemonsets", "--grace-period", "-1", "--timeout", "10s", "--delete-emptydir-data", "k8s-worker01"], "delta": "0:00:10.554703", "end": "2025-06-11 17:15:06.138801", "failed_when_result": true, "msg": "non-zero return code", "rc": 1, "start": "2025-06-11 17:14:55.584098", "stderr": "Warning: ignoring DaemonSet-managed Pods: kube-system/calico-node-22ds9, kube-system/kube-proxy-xq8rb, kube-system/nodelocaldns-h9w58, rook-ceph/csi-rbdplugin-fbfd4\nerror when evicting pods/\"example-pdb-pod\" -n \"example-ns\" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.\nerror when evicting pods/\"rook-ceph-osd-9-565bdd4f
Wednesday 11 June 2025  17:15:06 +0000 (0:01:18.121)       0:07:46.626 ******** 
TASK [upgrade/pre-upgrade : Set node back to schedulable] **********************
changed: [k8s-worker01 -> k8s-master01(10.112.5.11)]
Wednesday 11 June 2025  17:15:08 +0000 (0:00:01.748)       0:07:48.375 ******** 
TASK [upgrade/pre-upgrade : Fail after rescue] *********************************
fatal: [k8s-worker01 -> k8s-master01(10.112.5.11)]: FAILED! => {"changed": false, "msg": "Failed to drain node k8s-worker01"}
NO MORE HOSTS LEFT *************************************************************
PLAY RECAP *********************************************************************
k8s-worker01               : ok=290  changed=7    unreachable=0    failed=1    skipped=381  rescued=1    ignored=0   
k8s-worker02               : ok=252  changed=4    unreachable=0    failed=0    skipped=330  rescued=0    ignored=0   
k8s-worker03               : ok=252  changed=4    unreachable=0    failed=0    skipped=330  rescued=0    ignored=0   
Wednesday 11 June 2025  17:15:08 +0000 (0:00:00.041)       0:07:48.417 ******** 
=============================================================================== 
upgrade/pre-upgrade : Drain node --------------------------------------- 78.12s
network_plugin/cni : CNI | Copy cni plugins ---------------------------- 19.51s
download : Prep_download | Register docker images info ----------------- 10.91s
download : Check_pull_required |  Generate a list of information about the images on a node --- 9.29s
download : Check_pull_required |  Generate a list of information about the images on a node --- 9.12s
download : Check_pull_required |  Generate a list of information about the images on a node --- 8.82s
download : Check_pull_required |  Generate a list of information about the images on a node --- 8.75s
download : Check_pull_required |  Generate a list of information about the images on a node --- 8.47s
download : Check_pull_required |  Generate a list of information about the images on a node --- 8.33s
download : Check_pull_required |  Generate a list of information about the images on a node --- 8.21s
kubernetes/preinstall : Ensure kubelet expected parameters are set ------ 8.04s
kubespray-defaults : Gather ansible_default_ipv4 from all hosts or specific hosts --- 6.71s
bootstrap-os : Assign inventory name to unconfigured hostnames (non-CoreOS, non-Flatcar, Suse and ClearLinux, non-Fedora) --- 6.56s
bootstrap-os : Add proxy to dnf.conf if http_proxy is defined ----------- 6.25s
download : Extract_file | Unpacking archive ----------------------------- 6.20s
bootstrap-os : Gather facts --------------------------------------------- 5.81s
kubernetes/preinstall : Create kubernetes directories ------------------- 5.65s
download : Extract_file | Unpacking archive ----------------------------- 5.18s
bootstrap-os : Ensure bash_completion.d folder exists ------------------- 4.98s
bootstrap-os : Create remote_tmp for it is used by another module ------- 4.94s
```

### Anything else we need to know

Suggestion: Setting `async` and `poll` [as explained in the ansible docs](https://docs.ansible.com/ansible/latest/playbook_guide/playbooks_async.html#avoid-connection-timeouts-poll-0) for [roles/upgrade/pre-upgrade/tasks/main.yml at Drain Node](https://github.com/kubernetes-sigs/kubespray/blob/cd82ac552bee819f8c0b4b93ef1206500f0a58da/roles/upgrade/pre-upgrade/tasks/main.yml#L53). to avoid timeouts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

drain_timeout exceeds ansible default timeout #12297

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

OS

Version of Ansible

Version of Python

Version of Kubespray (commit)

Network plugin used

Full inventory with variables

Command used to invoke ansible

Output of ansible run

Anything else we need to know

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

drain_timeout exceeds ansible default timeout #12297

Description

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

OS

Version of Ansible

Version of Python

Version of Kubespray (commit)

Network plugin used

Full inventory with variables

Command used to invoke ansible

Output of ansible run

Anything else we need to know

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions