Description
What happened?
Reinstalling my cluster from scratch, based on Ubuntu Cloud Minimal 24.04, as before.
Running cluster.yml playbook, no error during ansible playbook runtime.
However, when checking the cluster status after installation :
- all nodes are NotReady
- all pods from daemonset/cilium are Init:CrashLoopBackOff
- cilium logs claim that
cannot create regular file '/hostbin/cilium-mount': Permission denied
What did you expect to happen?
Pods from daemonset/cilium must be Running in order to get nodes Ready ; so other pods currently Pending will be running too.
How can we reproduce it (as minimally and precisely as possible)?
- create instances for masters and workers nodes, based on upstream qcow2 Ubuntu Cloud Minimal 24.04 image
- add users and ssh keys to make ansible work
- install iputils-ping ; otherwise preflights fail (I will suggest some change as soon as we have solved this one)
- run cluster.yml at commit id 92e8ac9 + cherry pick my fix from Fix template syntax for cilium-values.yaml when encryption is enabled #12244
- ansible playbook finale report is okay.
PLAY RECAP **********************************************************************************************************************************************************************
k8ststmaster-1 : ok=484 changed=134 unreachable=0 failed=0 skipped=772 rescued=0 ignored=4
k8ststmaster-2 : ok=417 changed=112 unreachable=0 failed=0 skipped=725 rescued=0 ignored=3
k8ststmaster-3 : ok=419 changed=113 unreachable=0 failed=0 skipped=723 rescued=0 ignored=3
k8ststworker-1 : ok=316 changed=81 unreachable=0 failed=0 skipped=472 rescued=0 ignored=1
k8ststworker-2 : ok=316 changed=81 unreachable=0 failed=0 skipped=472 rescued=0 ignored=1
k8ststworker-3 : ok=316 changed=81 unreachable=0 failed=0 skipped=472 rescued=0 ignored=1
k8ststworker-4 : ok=316 changed=81 unreachable=0 failed=0 skipped=472 rescued=0 ignored=1
k8ststworker-5 : ok=316 changed=81 unreachable=0 failed=0 skipped=472 rescued=0 ignored=1
- define the new ${KUBECONFIG} for apiserver auth
KUBECONFIG=./k8stst/artifacts/admin.conf
- get nodes
$ kubectl get node
NAME STATUS ROLES AGE VERSION
k8ststmaster-1 NotReady control-plane 34m v1.32.5
k8ststmaster-2 NotReady control-plane 33m v1.32.5
k8ststmaster-3 NotReady control-plane 33m v1.32.5
k8ststworker-1 NotReady worker 30m v1.32.5
k8ststworker-2 NotReady worker 30m v1.32.5
k8ststworker-3 NotReady worker 30m v1.32.5
k8ststworker-4 NotReady worker 30m v1.32.5
k8ststworker-5 NotReady worker 30m v1.32.5
- get pods in kube-system
$ kubectl get pod -n kube-system
NAME READY STATUS RESTARTS AGE
cilium-68x2t 0/1 Init:CrashLoopBackOff 17 (55s ago) 38m
cilium-envoy-5l64j 1/1 Running 0 38m
cilium-envoy-d6ctm 1/1 Running 0 38m
cilium-envoy-knl8c 1/1 Running 0 38m
cilium-envoy-mqw8s 1/1 Running 0 38m
cilium-envoy-rp5bs 1/1 Running 0 38m
cilium-envoy-ssgfm 1/1 Running 0 38m
cilium-envoy-wvcpd 1/1 Running 0 38m
cilium-envoy-zc5pq 1/1 Running 0 38m
cilium-g24gf 0/1 Init:CrashLoopBackOff 12 (2m8s ago) 38m
cilium-h9cb9 0/1 Init:CrashLoopBackOff 12 (111s ago) 38m
cilium-hh7nr 0/1 Init:CrashLoopBackOff 17 (66s ago) 38m
cilium-ntnnz 0/1 Init:CrashLoopBackOff 12 (92s ago) 38m
cilium-operator-bd445f8f5-4bbzt 1/1 Running 0 38m
cilium-operator-bd445f8f5-w6vt6 1/1 Running 0 38m
cilium-wxhqp 0/1 Init:CrashLoopBackOff 17 (51s ago) 38m
cilium-xbd8z 0/1 Init:CrashLoopBackOff 17 (59s ago) 38m
cilium-zhdlt 0/1 Init:CrashLoopBackOff 17 (27s ago) 38m
coredns-5c54f84c97-27p9v 0/1 Pending 0 28m
dns-autoscaler-56cb45595c-5sjp7 0/1 Pending 0 28m
hubble-relay-7b4c9d4474-r56b4 0/1 Pending 0 38m
hubble-ui-76d4965bb6-ptbdm 0/2 Pending 0 38m
kube-apiserver-k8ststmaster-1 1/1 Running 1 43m
kube-apiserver-k8ststmaster-2 1/1 Running 1 43m
kube-apiserver-k8ststmaster-3 1/1 Running 1 42m
kube-controller-manager-k8ststmaster-1 1/1 Running 2 43m
kube-controller-manager-k8ststmaster-2 1/1 Running 1 43m
kube-controller-manager-k8ststmaster-3 1/1 Running 1 42m
kube-scheduler-k8ststmaster-1 1/1 Running 2 43m
kube-scheduler-k8ststmaster-2 1/1 Running 1 43m
kube-scheduler-k8ststmaster-3 1/1 Running 1 42m
metrics-server-964649464-4bm8p 0/1 Pending 0 27m
metrics-server-964649464-bzcct 0/1 Pending 0 27m
metrics-server-964649464-wqqjv 0/1 Pending 0 27m
nginx-proxy-k8ststworker-1 1/1 Running 0 40m
nginx-proxy-k8ststworker-2 1/1 Running 0 40m
nginx-proxy-k8ststworker-3 1/1 Running 0 40m
nginx-proxy-k8ststworker-4 1/1 Running 0 40m
nginx-proxy-k8ststworker-5 1/1 Running 0 40m
nodelocaldns-7g74j 1/1 Running 0 28m
nodelocaldns-9c796 1/1 Running 0 28m
nodelocaldns-9rgbt 1/1 Running 0 28m
nodelocaldns-d2vgg 1/1 Running 0 28m
nodelocaldns-rnm2z 1/1 Running 0 28m
nodelocaldns-t5jcp 1/1 Running 0 28m
nodelocaldns-wpdws 1/1 Running 0 28m
nodelocaldns-xd69z 1/1 Running 0 28m
snapshot-controller-5bcc5977f-dw49s 0/1 Pending 0 26m
- get logs from any pod in ds/cilium :
$ stern cilium-68x2t
+ cilium-68x2t › mount-cgroup
+ cilium-68x2t › config
cilium-68x2t mount-cgroup cp: cannot create regular file '/hostbin/cilium-mount': Permission denied
cilium-68x2t config Running
cilium-68x2t config 2025/05/30 12:37:05 INFO Starting hive
cilium-68x2t config time="2025-05-30T12:37:05.142948467Z" level=info msg="Establishing connection to apiserver" host="https://141.94.2.45:6443" subsys=k8s-client
cilium-68x2t config time="2025-05-30T12:37:05.163867625Z" level=info msg="Connected to apiserver" subsys=k8s-client
cilium-68x2t config time="2025-05-30T12:37:05.165848906Z" level=info msg="Reading configuration from config-map:kube-system/cilium-config" configSource="config-map:kube-system/cilium-config" subsys=option-resolver
cilium-68x2t config time="2025-05-30T12:37:05.172936977Z" level=info msg="Got 168 config pairs from source" configSource="config-map:kube-system/cilium-config" subsys=option-resolver
- cilium-68x2t › mount-cgroup
cilium-68x2t config time="2025-05-30T12:37:05.173025702Z" level=info msg="Reading configuration from cilium-node-config:kube-system/" configSource="cilium-node-config:kube-system/" subsys=option-resolver
cilium-68x2t config time="2025-05-30T12:37:05.178977704Z" level=info msg="Got 0 config pairs from source" configSource="cilium-node-config:kube-system/" subsys=option-resolver
cilium-68x2t config 2025/05/30 12:37:05 INFO Started duration=47.96484ms
cilium-68x2t config 2025/05/30 12:37:05 INFO Stopping
cilium-68x2t config 2025/05/30 12:37:05 INFO health.job-module-status-metrics (rev=2) module=health
- cilium-68x2t › config
+ cilium-68x2t › mount-cgroup
cilium-68x2t mount-cgroup cp: cannot create regular file '/hostbin/cilium-mount': Permission denied
- cilium-68x2t › mount-cgroup
OS
Ubuntu 24
Version of Ansible
ansible [core 2.16.14]
config file = /home/shartmann/git/streamlane/ansible.cfg
configured module search path = ['/home/shartmann/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
ansible python module location = /home/shartmann/.local/lib/python3.12/site-packages/ansible
ansible collection location = /home/shartmann/.ansible/collections:/usr/share/ansible/collections
executable location = /home/shartmann/.local/bin/ansible
python version = 3.12.3 (main, Feb 4 2025, 14:48:35) [GCC 13.3.0] (/usr/bin/python3)
jinja version = 3.1.4
libyaml = True
Version of Python
Python 3.12.3
Version of Kubespray (commit)
Network plugin used
cilium
Full inventory with variables
crio_enable_metrics: true
crio_registries_mirrors:
- prefix: docker.io
insecure: false
blocked: false
location: registry-1.docker.io
mirrors:- location: mirror.gcr.io
insecure: false
nri_enabled: true
- location: mirror.gcr.io
required for cri-o
download_container: false
skip_downloads: false
etcd_deployment_type: host
metrics_server_enabled: true
metrics_server_replicas: 3
metrics_server_limits_cpu: 400m
metrics_server_limits_memory: 600Mi
metrics_server_metric_resolution: 20s
local_path_provisioner_enabled: true
local_path_provisioner_is_default_storageclass: "false"
local_path_provisioner_helper_image_repo: docker.io/library/busybox
ingress_nginx_enabled: true
ingress_nginx_host_network: true
ingress_nginx_class: nginx
csi_snapshot_controller_enabled: true
cert_manager_enabled: true
cephfs_provisioner_enabled: false
argocd_enabled: false
kubernetes_audit: true
kube_encrypt_secret_data: true
watch #11835 then set back to true
remove_anonymous_access: false
kubeconfig_localhost: true
system_reserved: true
kubelet_max_pods: 280
kubelet_systemd_wants_dependencies: ["rpc-statd.service"]
kube_network_node_prefix: 23
kube_network_node_prefix_ipv6: 120
kube_network_plugin: cilium
container_manager: crio
crun_enabled: true
ndots: 2
system_upgrade: true
system_upgrade_reboot: never
kube_proxy_strict_arp: true
resolvconf_mode: host_resolvconf
upstream_dns_servers: [213.186.33.99]
serial: 2 # how many nodes are upgraded at the same time
unsafe_show_logs: true # when need to debug kubespray output
cilium_version: 1.17.3
cilium_cni_exclusive: true
cilium_encryption_enabled: true
cilium_encryption_type: wireguard
cilium_tunnel_mode: vxlan
cilium_enable_bandwidth_manager: true
cilium_enable_hubble: true
cilium_enable_hubble_ui: true
cilium_hubble_install: true
cilium_hubble_tls_generate: true
cilium_enable_hubble_metrics: true
cilium_hubble_metrics:
- dns
- drop
- tcp
- flow
- icmp
- http
cilium_enable_host_firewall: true
cilium_policy_audit_mode: true
cilium_kube_proxy_replacement: true
cilium_gateway_api_enabled: true
cilium_enable_well_known_identities: true
node_labels:
role: master
node_labels:
role: worker
node-role.kubernetes.io/worker: ""
Command used to invoke ansible
ansible-playbook -i ${AI_KTST} --become --vault-id k8s cluster.yml
Output of ansible run
No error during playbook run time. The problem is obviously with cilium configuration or installation paraleters.
PLAY RECAP **********************************************************************************************************************************************************************
k8ststmaster-1 : ok=484 changed=134 unreachable=0 failed=0 skipped=772 rescued=0 ignored=4
k8ststmaster-2 : ok=417 changed=112 unreachable=0 failed=0 skipped=725 rescued=0 ignored=3
k8ststmaster-3 : ok=419 changed=113 unreachable=0 failed=0 skipped=723 rescued=0 ignored=3
k8ststworker-1 : ok=316 changed=81 unreachable=0 failed=0 skipped=472 rescued=0 ignored=1
k8ststworker-2 : ok=316 changed=81 unreachable=0 failed=0 skipped=472 rescued=0 ignored=1
k8ststworker-3 : ok=316 changed=81 unreachable=0 failed=0 skipped=472 rescued=0 ignored=1
k8ststworker-4 : ok=316 changed=81 unreachable=0 failed=0 skipped=472 rescued=0 ignored=1
k8ststworker-5 : ok=316 changed=81 unreachable=0 failed=0 skipped=472 rescued=0 ignored=1
Anything else we need to know
There is plenty of disk space, it's not a "no space left on device" issue.
I have tried cilium 1.17.4 instead of 1.17.3, but it's the same.
Note that we use features host-firewall (currently in audit mode, it acts like a dry-run), kube-proxy-replacement, and gateway api - which are regular features, but relatively new to most users.
On nodes, cilium reports itself that it is unavailable (here tested with cilium 1.17.4) :
$ sudo cilium status
/¯¯\
/¯¯\__/¯¯\ Cilium: 1 errors, 8 warnings
\__/¯¯\__/ Operator: OK
/¯¯\__/¯¯\ Envoy DaemonSet: OK
\__/¯¯\__/ Hubble Relay: 1 errors, 2 warnings
\__/ ClusterMesh: disabled
DaemonSet cilium Desired: 8, Unavailable: 8/8
DaemonSet cilium-envoy Desired: 8, Ready: 8/8, Available: 8/8
Deployment cilium-operator Desired: 2, Ready: 2/2, Available: 2/2
Deployment hubble-relay Desired: 1, Unavailable: 1/1
Deployment hubble-ui Desired: 1, Unavailable: 1/1
Containers: cilium Pending: 8
cilium-envoy Running: 8
cilium-operator Running: 2
clustermesh-apiserver
hubble-relay Pending: 1
hubble-ui Pending: 1
Cluster Pods: 0/12 managed by Cilium
Helm chart version: 1.17.4
Image versions cilium quay.io/cilium/cilium:v1.17.4@sha256:24a73fe795351cf3279ac8e84918633000b52a9654ff73a6b0d7223bcff4a67a: 8
cilium-envoy quay.io/cilium/cilium-envoy:v1.32.5-1744305768-f9ddca7dcd91f7ca25a505560e655c47d3dec2cf@sha256:a04218c6879007d60d96339a441c448565b6f86650358652da27582e0efbf182: 8
cilium-operator quay.io/cilium/operator-generic:v1.17.4@sha256:a3906412f477b09904f46aac1bed28eb522bef7899ed7dd81c15f78b7aa1b9b5: 2
hubble-relay quay.io/cilium/hubble-relay:v1.17.4@sha256:c16de12a64b8b56de62b15c1652d036253b40cd7fa643d7e1a404dc71dc66441: 1
hubble-ui quay.io/cilium/hubble-ui-backend:v0.13.2@sha256:a034b7e98e6ea796ed26df8f4e71f83fc16465a19d166eff67a03b822c0bfa15: 1
hubble-ui quay.io/cilium/hubble-ui:v0.13.2@sha256:9e37c1296b802830834cc87342a9182ccbb71ffebb711971e849221bd9d59392: 1
Errors: cilium cilium 8 pods of DaemonSet cilium are not ready
hubble-relay hubble-relay 1 pods of Deployment hubble-relay are not ready
hubble-ui hubble-ui 1 pods of Deployment hubble-ui are not ready
Warnings: cilium cilium-68x2t pod is pending
cilium cilium-g24gf pod is pending
cilium cilium-h9cb9 pod is pending
cilium cilium-hh7nr pod is pending
cilium cilium-ntnnz pod is pending
cilium cilium-wxhqp pod is pending
cilium cilium-xbd8z pod is pending
cilium cilium-zhdlt pod is pending
hubble-relay hubble-relay-7b4c9d4474-r56b4 pod is pending
hubble-relay hubble-relay-7b4c9d4474-r56b4 pod is pending
hubble-ui hubble-ui-76d4965bb6-ptbdm pod is pending