-
Notifications
You must be signed in to change notification settings - Fork 10
Description
What happened:
My team utilizes the eks-node-monitoring agent to detect bad nodes and uses Karpenters Node Repair feature to replace those nodes. Karpenters Node Repair works by measuring a conditions LastTransitionTime against the preset toleration to determine when to take action.
We have encountered a case where more than one failing DCGM diag tests causes the LastTransitionTime to constantly reset, resulting in the Karpenter toleration to never be met and no repair to occur.
Example node-monitoring agent logs:
{"level":"info","ts":"2026-01-29xxxx","msg":"sending condition to exporter","condition":{"Reason":"DCGMHealthCode101","Message":"DCGM detected issues in health check system with error code 101","Severity":"Fatal","MinOccurrences":0},"conditionType":"AcceleratedHardwareReady"}
{"level":"info","ts":"2026-01-29xxxx","msg":"sending condition to exporter","condition":{"Reason":"DCGMHealthCode4","Message":"DCGM detected issues in health check system with error code 4","Severity":"Fatal","MinOccurrences":0},"conditionType":"AcceleratedHardwareReady"}
Resulting description of the nodes status conditions:
$ kubectl describe node node_a | grep AcceleratedHardwareReady
AcceleratedHardwareReady False Wed, 28 Jan 2026 23:42:22 +0000 Wed, 28 Jan 2026 23:42:22 +0000 DCGMHealthCode4 DCGM detected issues in health check system with error code 4
$ kubectl describe node node_a | grep AcceleratedHardwareReady
AcceleratedHardwareReady False Wed, 28 Jan 2026 23:47:22 +0000 Wed, 28 Jan 2026 23:47:22 +0000 DCGMHealthCode4 DCGM detected issues in health check system with error code 4
$ kubectl describe node node_a | grep AcceleratedHardwareReady
AcceleratedHardwareReady False Wed, 28 Jan 2026 23:52:22 +0000 Wed, 28 Jan 2026 23:52:22 +0000 DCGMHealthCode4 DCGM detected issues in health check system with error code 4
What you expected to happen:
When the a node is failing a set of accelerated hardware tests, the entire failing set is considered a single condition such that the LastTransitionTime of the node doesnt change as long as all the same tests fail together. This would be much more representative of the timeline of a node's health
Example:
Timestamp_A: Node fails DCGM with error code 101 -> set AcceleratedHardwareReady = False AND set LastTransitionTime = Timestamp_A
Timestamp_B: Node fails DCGM with error code 101 and error code 4 -> set AcceleratedHardwareReady = False AND set LastTransitionTime = Timestamp_B
Timestamp_C: Node still fails DCGM with error code 101 and error code 4 -> set AcceleratedHardwareReady = False. Keep LastTransitionTime = Timestamp_B
Timestamp_D: Node still fails DCGM with error code 101 and error code 4 -> set AcceleratedHardwareReady = False. Keep LastTransitionTime = Timestamp_B
Timestamp_E: Node only fails DCGM with error code 4 -> set AcceleratedHardwareReady = False AND set LastTransitionTime = Timestamp_E
How to reproduce it (as minimally and precisely as possible):
Run node-monitoring agent on a GPU node that fails multiple dcgm diag tests. Observe how LastTransitionTime updates/
Environment:
- AWS Region: us-east-1
- Cluster Kubernetes version: 1.33
- Node Kubernetes version: 1.33
- EKS Node Monitoring Agent version:
v1.5.1-eksbuild.1