bug(Karpenter Node Repair): Karpenter fails to repair nodes with multiple dcgm errors

**What happened**:
My team utilizes the eks-node-monitoring agent to detect bad nodes and uses Karpenters [Node Repair](https://karpenter.sh/docs/concepts/disruption/#node-auto-repair) feature to replace those nodes. Karpenters Node Repair works by measuring a conditions `LastTransitionTime` against the preset toleration to determine when to take action.

We have encountered a case where more than one failing DCGM diag tests causes the  `LastTransitionTime` to constantly reset, resulting in the Karpenter toleration to never be met and no repair to occur. 

Example node-monitoring agent logs:
```
{"level":"info","ts":"2026-01-29xxxx","msg":"sending condition to exporter","condition":{"Reason":"DCGMHealthCode101","Message":"DCGM detected issues in health check system with error code 101","Severity":"Fatal","MinOccurrences":0},"conditionType":"AcceleratedHardwareReady"}
{"level":"info","ts":"2026-01-29xxxx","msg":"sending condition to exporter","condition":{"Reason":"DCGMHealthCode4","Message":"DCGM detected issues in health check system with error code 4","Severity":"Fatal","MinOccurrences":0},"conditionType":"AcceleratedHardwareReady"}
```

Resulting description of the nodes status conditions:
```
$ kubectl describe node node_a | grep AcceleratedHardwareReady
  AcceleratedHardwareReady   False   Wed, 28 Jan 2026 23:42:22 +0000   Wed, 28 Jan 2026 23:42:22 +0000   DCGMHealthCode4              DCGM detected issues in health check system with error code 4
$ kubectl describe node node_a | grep AcceleratedHardwareReady
  AcceleratedHardwareReady   False   Wed, 28 Jan 2026 23:47:22 +0000   Wed, 28 Jan 2026 23:47:22 +0000   DCGMHealthCode4              DCGM detected issues in health check system with error code 4
$ kubectl describe node node_a | grep AcceleratedHardwareReady
  AcceleratedHardwareReady   False   Wed, 28 Jan 2026 23:52:22 +0000   Wed, 28 Jan 2026 23:52:22 +0000   DCGMHealthCode4              DCGM detected issues in health check system with error code 4
```

**What you expected to happen**:
When the a node is failing a set of accelerated hardware tests, the entire failing set is considered a single condition such that the `LastTransitionTime` of the node doesnt change as long as all the same tests fail together. This would be much more representative of the timeline of a node's health

Example:
```
Timestamp_A: Node fails DCGM with error code 101 -> set AcceleratedHardwareReady = False AND set LastTransitionTime = Timestamp_A
Timestamp_B: Node fails DCGM with error code 101 and error code 4 -> set AcceleratedHardwareReady = False AND set LastTransitionTime = Timestamp_B
Timestamp_C: Node still fails DCGM with error code 101 and error code 4 -> set AcceleratedHardwareReady = False. Keep LastTransitionTime = Timestamp_B
Timestamp_D: Node still fails DCGM with error code 101 and error code 4 -> set AcceleratedHardwareReady = False. Keep LastTransitionTime = Timestamp_B
Timestamp_E: Node only fails DCGM with error code 4 -> set AcceleratedHardwareReady = False AND set LastTransitionTime = Timestamp_E
```

**How to reproduce it (as minimally and precisely as possible)**:
Run node-monitoring agent on a GPU node that fails multiple dcgm diag tests. Observe how `LastTransitionTime` updates/

**Environment**:
- AWS Region: us-east-1
- Cluster Kubernetes version: 1.33
- Node Kubernetes version: 1.33
- EKS Node Monitoring Agent version: `v1.5.1-eksbuild.1`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug(Karpenter Node Repair): Karpenter fails to repair nodes with multiple dcgm errors #39

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

bug(Karpenter Node Repair): Karpenter fails to repair nodes with multiple dcgm errors #39

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions