Skip to content

fix: GatewayAPI installation, condition not satisfied when role called with delegate_to and run_once #12279

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

aviral-agarwal
Copy link
Contributor

@aviral-agarwal aviral-agarwal commented Jun 2, 2025

What type of PR is this?
/kind bug

What this PR does / why we need it:
GatewayAPI is not installed even when gateway_api_enabled is set to true

Following is the Root Cause:

  • When gateway_api_enabled is set to true, GatewayAPI is installed using role kubernetes-apps/gateway_api
  • Since Kubespray 2.28.0, the role is called with
    delegate_to: "{{ groups['kube_control_plane'][0] }}" i.e. first control plane node
    run_once: true i.e., select first host available out of the nodes mentioned in hosts of the ansible task in which the role is called, and execute the role on that node
  • the task Invoke kubeadm and install a CNI, in which role kubernetes-apps/gateway_api is called, has hosts: k8s_cluster (which has control plane+working nodes)
  • When using run_once, first host available is selected (not necessarily the first control plane node) and the ansible variable inventory_hostname is set to that node
    (In my case, it was the first worker node, though I did not explore in depth how exactly k8s_cluster is formed)
  • Hence, the following condition is not met, and GatewayAPI is not installed even when gateway_api_enabled is set to true as the tasks were skipped
when:
  - inventory_hostname == groups['kube_control_plane'][0]

Which issue(s) this PR fixes:
I did not open any issue for this, nor could I find any open ones
I saw comments to the same effect in PR #12189

Special notes for your reviewer:

  • from https://docs.ansible.com/ansible/latest/reference_appendices/playbooks_keywords.html
    delegate_to: Host to execute task instead of the target (inventory_hostname). Connection vars from the delegated host will also be used for the task.
    run_once: Boolean that will bypass the host loop, forcing the task to attempt to execute on the first host available and afterward apply any results and facts to all active hosts in the same batch.
  • Since, while calling the role, delegate_to already ensures that the first control plane node executes the task, irrespective of where the role runs (selected by run_once, different from delegate_to, an Ansible nuance it seems)
    We do not need the when condition anyway

Does this PR introduce a user-facing change?:

NONE

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jun 2, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: aviral-agarwal
Once this PR has been reviewed and has the lgtm label, please assign mzaian for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot requested review from ErikJiang and VannTen June 2, 2025 16:05
@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jun 2, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @aviral-agarwal. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@yankay
Copy link
Member

yankay commented Jun 3, 2025

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 3, 2025
@tico88612
Copy link
Member

@aviral-agarwal I can't reproduce the problem you described, can you provide the inventory and variables?

@aviral-agarwal
Copy link
Contributor Author

Sure,

only one variable relevant to GatewayAPI (not sure if any other is effecting this particular flow. The variables used in the tasks were properly picked up and use defaults)
in kubespray\inventory\<inventory name>\group_vars\k8s_cluster\addons.yml
gateway_api_enabled: true

inventory.ini

just removed ssh authentication related details but the order of the nodes remains the same

# This inventory describe a HA typology with stacked etcd (== same nodes as control plane)
# and 3 worker nodes
# See https://docs.ansible.com/ansible/latest/inventory_guide/intro_inventory.html
# for tips on building your # inventory

# Configure 'ip' variable to bind kubernetes services on a different ip than the default iface
# We should set etcd_member_name for etcd cluster. The node that are not etcd members do not need to set the value,
# or can set the empty string value.
[kube_control_plane]
# node1 ansible_host=95.54.0.12  # ip=10.3.0.1 etcd_member_name=etcd1
# node2 ansible_host=95.54.0.13  # ip=10.3.0.2 etcd_member_name=etcd2
# node3 ansible_host=95.54.0.14  # ip=10.3.0.3 etcd_member_name=etcd3
vm1-private
vm2-private
vm3-private

[etcd:children]
kube_control_plane

[kube_node]
# node4 ansible_host=95.54.0.15  # ip=10.3.0.4
# node5 ansible_host=95.54.0.16  # ip=10.3.0.5
# node6 ansible_host=95.54.0.17  # ip=10.3.0.6
vm4-private
vm5-private
vm6-private

@aviral-agarwal
Copy link
Contributor Author

Providing the output as well for reference

I added tasks to debug at the start and end of the role kubernetes-apps/gateway_api, basically kubespray\roles\kubernetes-apps\gateway_api\tasks\main.yml file to get better clarity

Debug Tasks

at the beginning of kubernetes-apps/gateway_api

- name: Start of GatewayAPI
  pause:
    prompt: |
      Start of GatewayAPI
      Press Enter to resume...
- name: Debug | Show current host and first control plane
  debug:
    msg: "Current: {{ inventory_hostname }}, First CP: {{ groups['kube_control_plane'][0] }}"

- name: Debug When | Show current host and first control plane
  debug:
    msg: "With when condition:Current: {{ inventory_hostname }}, First CP: {{ groups['kube_control_plane'][0] }}"
  when:
    - inventory_hostname == groups['kube_control_plane'][0]

at the end of kubernetes-apps/gateway_api

- name: End of GatewayAPI | Check if Gateway API installed
  pause:
    prompt: |
      End of GatewayAPI | Check if Gateway API installed
      check if "{{ kube_config_dir }}/addons/gateway_api/{{ gateway_api_channel }}-install.yaml" exists
      Press Enter to resume...

output logs of kubernetes-apps/gateway_api

Kubespray executed with the following command, with -v for verbosity
ansible-playbook -i inventory/k8s-tpfm/inventory.ini --become --become-user=root cluster.yml -v

Observe that:

  • Debug task Debug When | ... with when condition is not executed or is skipped
  • Debug task Debug | ... shows that inventory_hostname is vm4-private, whereas first control plane node groups['kube_control_plane'][0] is vm1-private, explaining the when condition not being satisfied
  • GatewayAPI tasks with when condition are silently skipped i.e. not outputs in the logs
[kubernetes-apps/gateway_api : Start of GatewayAPI]
Start of GatewayAPI
Press Enter to resume...
:

TASK [kubernetes-apps/gateway_api : Start of GatewayAPI] *****************************************************************************************************************************************************
ok: [vm4-private -> vm1-private(vm1.private.k8s-1.tpfm.aviralagarwal.org)] => {"changed": false, "delta": 10, "echo": true, "rc": 0, "start": "2025-06-03 05:57:37.113195", "stderr": "", "stdout": "Paused for 0.17 minutes", "stop": "2025-06-03 05:57:47.224024", "user_input": ""}
Tuesday 03 June 2025  05:57:47 +0000 (0:00:10.165)       0:17:01.318 **********

TASK [kubernetes-apps/gateway_api : Debug | Show current host and first control plane] ***********************************************************************************************************************
ok: [vm4-private -> vm1-private(vm1.private.k8s-1.tpfm.aviralagarwal.org)] => {
    "msg": "Current: vm4-private, First CP: vm1-private"
}
Tuesday 03 June 2025  05:57:47 +0000 (0:00:00.056)       0:17:01.375 **********
Tuesday 03 June 2025  05:57:47 +0000 (0:00:00.047)       0:17:01.423 **********

TASK [kubernetes-apps/gateway_api : Gateway API | Download YAML] *********************************************************************************************************************************************
included: /workdir/The-Platform.Infrastructure.Kubernetes-Cluster-1/kubespray/roles/kubernetes-apps/gateway_api/tasks/../../../download/tasks/download_file.yml for vm4-private
Tuesday 03 June 2025  05:57:47 +0000 (0:00:00.069)       0:17:01.493 **********

TASK [kubernetes-apps/gateway_api : Prep_download | Set a few facts] *****************************************************************************************************************************************
ok: [vm4-private -> vm1-private(vm1.private.k8s-1.tpfm.aviralagarwal.org)] => {"ansible_facts": {"download_force_cache": false}, "changed": false}
Tuesday 03 June 2025  05:57:48 +0000 (0:00:00.611)       0:17:02.104 **********
Tuesday 03 June 2025  05:57:48 +0000 (0:00:00.045)       0:17:02.150 **********

TASK [kubernetes-apps/gateway_api : Download_file | Set pathname of cached file] *****************************************************************************************************************************
ok: [vm4-private -> vm1-private(vm1.private.k8s-1.tpfm.aviralagarwal.org)] => {"ansible_facts": {"file_path_cached": "/tmp/kubespray_cache/gateway-api-standard-install.yaml"}, "changed": false}
Tuesday 03 June 2025  05:57:49 +0000 (0:00:01.150)       0:17:03.300 **********

TASK [kubernetes-apps/gateway_api : Download_file | Create dest directory on node] ***************************************************************************************************************************
changed: [vm4-private -> vm1-private(vm1.private.k8s-1.tpfm.aviralagarwal.org)] => {"changed": true, "gid": 0, "group": "root", "mode": "0755", "owner": "root", "path": "/tmp/releases", "size": 4096, "state": "directory", "uid": 0}
Tuesday 03 June 2025  05:57:51 +0000 (0:00:02.063)       0:17:05.363 **********
Tuesday 03 June 2025  05:57:51 +0000 (0:00:00.035)       0:17:05.399 **********
Tuesday 03 June 2025  05:57:51 +0000 (0:00:00.046)       0:17:05.445 **********

TASK [kubernetes-apps/gateway_api : Download_file | Download item] *******************************************************************************************************************************************
changed: [vm4-private] => {"attempts": 1, "censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result", "changed": true}
Tuesday 03 June 2025  05:57:58 +0000 (0:00:07.359)       0:17:12.805 **********
Tuesday 03 June 2025  05:57:58 +0000 (0:00:00.045)       0:17:12.851 **********
Tuesday 03 June 2025  05:57:58 +0000 (0:00:00.045)       0:17:12.896 **********
Tuesday 03 June 2025  05:57:58 +0000 (0:00:00.051)       0:17:12.948 **********

TASK [kubernetes-apps/gateway_api : Download_file | Extract file archives] ***********************************************************************************************************************************
included: /workdir/The-Platform.Infrastructure.Kubernetes-Cluster-1/kubespray/roles/download/tasks/extract_file.yml for vm4-private
Tuesday 03 June 2025  05:57:58 +0000 (0:00:00.064)       0:17:13.012 **********
Tuesday 03 June 2025  05:57:59 +0000 (0:00:00.614)       0:17:13.627 **********
Tuesday 03 June 2025  05:57:59 +0000 (0:00:00.048)       0:17:13.676 **********
Tuesday 03 June 2025  05:57:59 +0000 (0:00:00.053)       0:17:13.729 **********
Tuesday 03 June 2025  05:57:59 +0000 (0:00:00.046)       0:17:13.776 **********
[kubernetes-apps/gateway_api : End of GatewayAPI | Check if Gateway API installed]
End of GatewayAPI | Check if Gateway API installed
check if "/etc/kubernetes/addons/gateway_api/standard-install.yaml" exists
Press Enter to resume...
:

on vm1-private (my first control plane node)

Checking execution of task: Gateway API | Download YAML
gateway-api-standard-install.yaml is downloaded in /tmp/releases
Task is successful, there is no when condition on this task

root@vm1-private:~# cd /tmp/releases
root@vm1-private:/tmp/releases# ls -la
total 613600
drwxr-xr-x  4 root root        4096 Jun  3 05:52 .
drwxrwxrwt 13 root root        4096 Jun  3 05:57 ..
-rwxr-xr-x  1 root    118 149233848 Apr  3 21:14 cilium
-rwxr-xr-x  1 root root    58328485 Jun  3 05:50 cilium-0.18.3-amd64.tar.gz
-rwxr-xr-x  1 root root    48133828 Jun  3 05:48 cni-plugins-linux-amd64-1.4.1.tgz
-rwxr-xr-x  1 root root    36968652 Jun  3 05:47 containerd-2.0.5-linux-amd64.tar.gz
-rwxr-xr-x  1 root root       22657 May  1 03:11 containerd-rootless-setuptool.sh
-rwxr-xr-x  1 root root        8708 May  1 03:11 containerd-rootless.sh
-rwxr-xr-x  1 root    127  40076447 Dec  9 09:09 crictl
-rwxr-xr-x  1 root root    19100418 Jun  3 05:45 crictl-1.32.0-linux-amd64.tar.gz
-rwxr-xr-x  1 root root    20486388 Jun  3 05:52 etcd-3.5.16-linux-amd64.tar.gz
drwxr-xr-x  3 root ubuntu      4096 Sep 10  2024 etcd-v3.5.16-linux-amd64
-rwxr-xr-x  1 root root      616803 Jun  3 05:52 gateway-api-standard-install.yaml
drwxr-xr-x  2 root root        4096 Jun  3 05:47 images
-rwxr-xr-x  1 root root    70951064 Jun  3 05:48 kubeadm-1.32.5-amd64
-rwxr-xr-x  1 root root    57327768 Jun  3 05:52 kubectl-1.32.5-amd64
-rwxr-xr-x  1 root root    77410564 Jun  3 05:48 kubelet-1.32.5-amd64
-rwxr-xr-x  1 root root    27738296 May  1 03:11 nerdctl
-rwxr-xr-x  1 root root    10316369 Jun  3 05:46 nerdctl-2.0.5-linux-amd64.tar.gz
-rwxr-xr-x  1 root root    11546208 Jun  3 05:44 runc-1.2.6.amd64

Checking execution of task: Gateway API | Create addon dir
addon directory is not created in kube_config_dir i.e. /etc/kubernetes
has when condition

root@vm1-private:/etc/kubernetes# ls -la
total 104
drwxr-xr-x   4 root root  4096 Jun  3 05:56 .
drwxr-xr-x 113 root root 12288 Jun  3 05:55 ..
-rw-------   1 root root  5669 Jun  3 05:56 admin.conf
-rw-------   1 root root  5697 Jun  3 05:56 controller-manager.conf
-rw-------   1 root root  5677 Jun  3 05:56 controller-manager.conf.7418.2025-06-03@05:56:51~
-rw-r-----   1 root root  5163 Jun  3 05:56 kubeadm-config.yaml
-rw-r--r--   1 root root   457 Jun  3 05:48 kubeadm-images.yaml
-rw-------   1 root root  1067 Jun  3 05:55 kubelet-config.yaml
-rw-------   1 root root  2005 Jun  3 05:56 kubelet.conf
-rw-------   1 root root   446 Jun  3 05:55 kubelet.env
-rw-r--r--   1 root root   199 Jun  3 05:56 kubescheduler-config.yaml
drwxr-xr-x   2 root root  4096 Jun  3 05:56 manifests
-rw-r-----   1 root root   408 Jun  3 05:56 node-crb.yml
lrwxrwxrwx   1 root root    19 Jun  3 05:42 pki -> /etc/kubernetes/ssl
-rw-------   1 root root  5645 Jun  3 05:56 scheduler.conf
-rw-------   1 root root  5625 Jun  3 05:56 scheduler.conf.7430.2025-06-03@05:56:51~
drwxr-xr-x   2 root root  4096 Jun  3 05:56 ssl
-rw-------   1 root root  5697 Jun  3 05:56 super-admin.conf
root@vm1-private:/etc/kubernetes# cd addon
-bash: cd: addon: No such file or directory

Checking execution of task: Gateway API | Copy YAML from download dir
gateway-api-standard-install.yaml file not copied from /tmp/releases to "{{ kube_config_dir }}/addons/gateway_api/{{ gateway_api_channel }}-install.yaml"
has when condition

Checking execution of task: Gateway API | Install Gateway API
GatewayAPI CRDs not installed

root@vm1-private:/etc/kubernetes# kubectl get crds -A
No resources found

Note:

  • GatewayAPI CRDs successfully installed when the when condition was commented out
  • no addon directory in vm4-private as well

on vm4-pirvate

root@vm4-private:/tmp/releases# ls -la
total 537608
drwxr-xr-x  3 root root      4096 Jun  3 05:57 .
drwxrwxrwt 13 root root      4096 Jun  3 05:57 ..
-rwxr-xr-x  1 root  118 149233848 Apr  3 21:14 cilium
-rwxr-xr-x  1 root root  58328485 Jun  3 05:50 cilium-0.18.3-amd64.tar.gz
-rwxr-xr-x  1 root root  48133828 Jun  3 05:48 cni-plugins-linux-amd64-1.4.1.tgz
-rwxr-xr-x  1 root root  36968652 Jun  3 05:47 containerd-2.0.5-linux-amd64.tar.gz
-rwxr-xr-x  1 root root     22657 May  1 03:11 containerd-rootless-setuptool.sh
-rwxr-xr-x  1 root root      8708 May  1 03:11 containerd-rootless.sh
-rwxr-xr-x  1 root  127  40076447 Dec  9 09:09 crictl
-rwxr-xr-x  1 root root  19100418 Jun  3 05:44 crictl-1.32.0-linux-amd64.tar.gz
-rwxr-xr-x  1 root root    616803 Jun  3 05:57 gateway-api-standard-install.yaml
drwxr-xr-x  2 root root      4096 Jun  3 05:47 images
-rwxr-xr-x  1 root root  70951064 Jun  3 05:48 kubeadm-1.32.5-amd64
-rwxr-xr-x  1 root root  77410564 Jun  3 05:48 kubelet-1.32.5-amd64
-rwxr-xr-x  1 root root  27738296 May  1 03:11 nerdctl
-rwxr-xr-x  1 root root  10316369 Jun  3 05:46 nerdctl-2.0.5-linux-amd64.tar.gz
-rwxr-xr-x  1 root root  11546208 Jun  3 05:44 runc-1.2.6.amd64
root@vm4-private:/tmp/releases# cd /etc/kubernetes/
root@vm4-private:/etc/kubernetes# ls -la
total 40
drwxr-xr-x   4 root root  4096 Jun  3 05:57 .
drwxr-xr-x 113 root root 12288 Jun  3 05:55 ..
-rw-r-----   1 root root   490 Jun  3 05:57 kubeadm-client.conf
-rw-------   1 root root  1068 Jun  3 05:55 kubelet-config.yaml
-rw-------   1 root root  1978 Jun  3 05:57 kubelet.conf
-rw-------   1 root root   447 Jun  3 05:55 kubelet.env
drwxr-xr-x   2 root root  4096 Jun  3 05:42 manifests
lrwxrwxrwx   1 root root    19 Jun  3 05:42 pki -> /etc/kubernetes/ssl
drwxr-xr-x   2 root root  4096 Jun  3 05:57 ssl

@VannTen
Copy link
Contributor

VannTen commented Jun 5, 2025

When gateway_api_enabled is set to true, GatewayAPI is installed using role kubernetes-apps/gateway_api
Since Kubespray 2.28.0, the role is called with
delegate_to: "{{ groups['kube_control_plane'][0] }}" i.e. first control plane node
run_once: true i.e., select first host available out of the nodes mentioned in hosts of the ansible task in which the role is called, and execute the role on that node

run_once + delegate_to will delegate to the mentioned host, actually.

The problem, I think, is here:


TASK [kubernetes-apps/gateway_api : Download_file | Create dest directory on node] ***************************************************************************************************************************
changed: [vm4-private -> vm1-private(vm1.private.k8s-1.tpfm.aviralagarwal.org)] => {"changed": true, "gid": 0, "group": "root", "mode": "0755", "owner": "root", "path": "/tmp/releases", "size": 4096, "state": "directory", "uid": 0}
Tuesday 03 June 2025  05:57:51 +0000 (0:00:02.063)       0:17:05.363 **********
Tuesday 03 June 2025  05:57:51 +0000 (0:00:00.035)       0:17:05.399 **********
Tuesday 03 June 2025  05:57:51 +0000 (0:00:00.046)       0:17:05.445 **********

TASK [kubernetes-apps/gateway_api : Download_file | Download item] *******************************************************************************************************************************************
changed: [vm4-private] => {"attempts": 1, "censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result", "changed": true}
Tuesday 03 June 2025  05:57:58 +0000 (0:00:07.359)       0:17:12.805 **********

Note that the first of those task is delegated, not the second.

- name: Download_file | Create cache directory on download_delegate host
file:
path: "{{ file_path_cached | dirname }}"
state: directory
recurse: true
delegate_to: "{{ download_delegate }}"
delegate_facts: false
run_once: true
when:
- download_force_cache
- not download_localhost
# This must always be called, to check if the checksum matches. On no-match the file is re-downloaded.
# This task will avoid logging it's parameters to not leak environment passwords in the log
- name: Download_file | Download item
get_url:
url: "{{ download.url }}"
dest: "{{ file_path_cached if download_force_cache else download.dest }}"
owner: "{{ omit if download_localhost else (download.owner | default(omit)) }}"
mode: "{{ omit if download_localhost else (download.mode | default(omit)) }}"
checksum: "{{ download.checksum }}"
validate_certs: "{{ download_validate_certs }}"
url_username: "{{ download.username | default(omit) }}"
url_password: "{{ download.password | default(omit) }}"
force_basic_auth: "{{ download.force_basic_auth | default(omit) }}"
timeout: "{{ download.timeout | default(omit) }}"
delegate_to: "{{ download_delegate if download_force_cache else inventory_hostname }}"

delegate_to: "{{ download_delegate if download_force_cache else inventory_hostname }}" I think this overrides the delegate_to for the download item, but since the download_delegate is not set this does not affect the first of those task

(Yeah the download role is in dire need of a refactor)

@aviral-agarwal
Copy link
Contributor Author

In role kubernetes-apps/gateway_api kubespray\roles\kubernetes-apps\gateway_api\tasks\main.yml
After commenting out the when condition, since the tasks are no longer being skipped, I can now see in the logs that they are properly delegated_to the first control plane node, as expected

output of tasks (which have the when condition, now commented)

TASK [kubernetes-apps/gateway_api : Gateway API | Create addon dir] ************************************************************************************
changed: [vm4-private -> vm1-private(vm1.private.k8s-1.tpfm.aviralagarwal.org)] => {"changed": true, "gid": 0, "group": "root", "mode": "0755", "owner": "root", "path": "/etc/kubernetes/addons/gateway_api", "secontext": "system_u:object_r:etc_t:s0", "size": 4096, "state": "directory", "uid": 0}
Thursday 05 June 2025  09:27:25 +0000 (0:00:00.794)       0:23:35.753 *********

TASK [kubernetes-apps/gateway_api : Gateway API | Copy YAML from download dir] *************************************************************************
changed: [vm4-private -> vm1-private(vm1.private.k8s-1.tpfm.aviralagarwal.org)] => {"changed": true, "checksum": "8708d774ac92965386a65a9364eaa1fec6bd732d", "dest": "/etc/kubernetes/addons/gateway_api/standard-install.yaml", "gid": 0, "group": "root", "md5sum": "036bb7c2b8efa8fbce640bf3e5bec95b", "mode": "0644", "owner": "root", "secontext": "system_u:object_r:etc_t:s0", "size": 616803, "src": "/tmp/releases/gateway-api-standard-install.yaml", "state": "file", "uid": 0}
Thursday 05 June 2025  09:27:26 +0000 (0:00:00.938)       0:23:36.692 *********

TASK [kubernetes-apps/gateway_api : Gateway API | Install Gateway API] *********************************************************************************
ok: [vm4-private -> vm1-private(vm1.private.k8s-1.tpfm.aviralagarwal.org)] => {"changed": false, "msg": "success: customresourcedefinition.apiextensions.k8s.io/gatewayclasses.gateway.networking.k8s.io created customresourcedefinition.apiextensions.k8s.io/gateways.gateway.networking.k8s.io created customresourcedefinition.apiextensions.k8s.io/grpcroutes.gateway.networking.k8s.io created customresourcedefinition.apiextensions.k8s.io/httproutes.gateway.networking.k8s.io created customresourcedefinition.apiextensions.k8s.io/referencegrants.gateway.networking.k8s.io created"}

Correct me if I am wrong, but inventory_hostname for these tasks will be vm4-private.
In that case, the following condition will not be met

when:
  - inventory_hostname == groups['kube_control_plane'][0]

Why is vm4-private (not the first control plane node) being selected?
As far as I can understand from the documentation, due to run_once: true, it is choosen from the hosts of the calling task

Maybe the delegation issue in the download role is limited to the task within it and when the control/flow comes back to the gateway_api role, delegation works as expected
just that inventory_hostname is not the same as the host/node to which the task is delegated_to (maybe an ansible nuance)

@VannTen
Copy link
Contributor

VannTen commented Jun 5, 2025 via email

@aviral-agarwal
Copy link
Contributor Author

aviral-agarwal commented Jun 5, 2025

I agree, tasks will run once on delegate_to host and not for each host

but tasks are not being executed because condition "inventory_hostname == groups['kube_control_plane'][0]" is not met

I debugged groups['k8s_cluster'] and result is ['vm4-private', 'vm5-private', 'vm6-private', 'vm1-private', 'vm2-private', 'vm3-private']

Please note the following distinction between delegated host and original host

As per the following line from the official documentation
https://docs.ansible.com/ansible/latest/playbook_guide/playbooks_strategies.html#running-on-a-single-machine-with-run-once
"As always with delegation, the action will be executed on the delegated host, but the information is still that of the original host in the task."

  • Could it be possible that the original host is being chosen as the first one from k8s_cluster i.e., vm4-private (populating the inventory_hostname variable ), and then the actual execution is done on delegate_to host i.e., vm1-private
  • In the above, execution is happening only once and on the delegate_to host, but also explains why inventory_hostname is not the same as the delegate_to host
  • explaining vm4-private -> vm1-private as well in the logs

@VannTen
Copy link
Contributor

VannTen commented Jun 6, 2025

Run_once does not select necessarily the first host, it's random (or at least, undefined). But yes, you're correct IIRC than the variables are still those of the original host, in particular inventory_hostname, so the when are a problem.

But I think the conflicting delegate_to (in the import in the gateway api role and directly in the download role imported file) are also problematic (Haven't tested though)

@aviral-agarwal
Copy link
Contributor Author

Yeah, I do see the multiple delegation problem here
Are there any plans for download role refactor?

If not, then we can

  • either remove the delegate_to+run_once from the gateway_api role call and rely only the when condition to execute only the first control plane node
  • or simply remove when condition (as per this PR, I checked GatewayAPI CRD installation, is working)

@diablinux
Copy link

I'm facing this issue, the task is being delegated to my first worker node in the inventory [worker01]. Here you can see the file gateway-api-standard-install.yaml in the folder /tmp/releases in worker01 server. So no gateway-api CRDs are created.

total 537572
-rwxr-xr-x. 1 root    118 149233848 Apr  3 21:14 cilium
-rwxr-xr-x. 1 root root    58328485 Jun  7 03:46 cilium-0.18.3-amd64.tar.gz
-rwxr-xr-x. 1 root root    48133828 Jun  7 03:46 cni-plugins-linux-amd64-1.4.1.tgz
-rwxr-xr-x. 1 root root    36968652 Jun  7 03:45 containerd-2.0.5-linux-amd64.tar.gz
-rwxr-xr-x. 1 root root       22657 May  1 03:11 containerd-rootless-setuptool.sh
-rwxr-xr-x. 1 root root        8708 May  1 03:11 containerd-rootless.sh
-rwxr-xr-x. 1 root docker  40076447 Dec  9 09:09 crictl
-rwxr-xr-x. 1 root root    19100418 Jun  7 03:45 crictl-1.32.0-linux-amd64.tar.gz
-rwxr-xr-x. 1 root root      598441 Jun  7 04:21 gateway-api-standard-install.yaml
drwxr-xr-x. 2 root root          40 Jun  7 03:46 images
-rwxr-xr-x. 1 root root    70951064 Jun  7 03:46 kubeadm-1.32.5-amd64
-rwxr-xr-x. 1 root root    77410564 Jun  7 03:46 kubelet-1.32.5-amd64
-rwxr-xr-x. 1 root root    27738296 May  1 03:11 nerdctl
-rwxr-xr-x. 1 root root    10316369 Jun  7 03:45 nerdctl-2.0.5-linux-amd64.tar.gz
-rwxr-xr-x. 1 root root    11546208 Jun  7 03:45 runc-1.2.6.amd64```

If I comment out the "when" statements in **roles/kubernetes-apps/gateway_api/tasks/main.yml** as @aviral-agarwal's PR, it works as expected, and Gategay-API crds are created. 

So @VannTen is right, run_once seems to be random.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note-none Denotes a PR that doesn't merit a release note. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants