Description
Goal
Goal of this issue is to consistently propagate down timeouts (NodeDrainTimeout, NodeDeletionTimeout, ...) from MDs to MSs to Machines. This is desirable so that users can still change timeouts even if a Machine is e.g. stuck in draining.
We had a first PR which ensures a MachineSet propagates down the timeouts to Machines which are in deleting: #10589
But there are a few other cases, as described here: #10589 (inlining below for convenience)
The following specifically focuses on cases where Machines are deleted by the MS controller.
Case 1. MD is deleted
The following happens:
- MD goes away
- ownerRef triggers MS deletion
- MS goes away
- ownerRef triggers Machine deletion
=> The MS will already be gone when the deletionTimestamp is set on the Machines. In this case folks would have to modify the timestamps on each Machine individually. Because the MS doesn't exist anymore it's not possible to propagate down timeouts from the MS to Machines
Case 2. MD is scaled down to 0
The following happens:
- MD scales down MS to 0
- MS deletes Machine
This use case was addressed by: #10589
Case 3. MD rollout
The following happens:
- Someone updates the MD (e.g. bump the Kubernetes version)
- MD creates a new MS and scales it up
- In parallel MD scales down the old MS to 0
=> In this scenario today the MD controller does not propagate the timeouts from MD to all MS (only to the new/current one, not to the old ones). So the Machines of the old MS won't get new timeouts set in the MD
Implementation
To address all scenarios I would propose to always propagate timeouts from MD => MS => Machine. To make that happen we have to implement the following:
- Ensure during MD deletion, MD & MS objects stay around until all Machines are deleted: Consider implementing "forced" MD foreground deletion #10710
- Ensure timeouts are always propagated from MD to all MachineSets to all Machines
- Even if a MD, MS or Machine is in deleting (also both in regular reconcile & reconcileDelete)
- Even if a MS is not the "current" MS
Follow-up:
- We should also check other objects like Cluster (topology), KCP, ...