Skip to content

Support for per-zone PDB #253

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

tcp13equals2
Copy link

@tcp13equals2 tcp13equals2 commented Jul 31, 2025

This is a draft PR for the implementation of a custom pod disruption budget which is zone aware.

#194

Included in this PR;

  • new web hook for handling pod eviction requests added into the rollout-operator
  • example manifest files to introducing a new custom resource definition to define the new PodDisruptionZoneBudget kind
  • example manifest files to add a PodDisruptionZoneBudget to the test-app developer environment
  • example manifest file to register the rollout-operator for the pod eviction web hook
  • modified example manifest file for updated role permissions to enable the pod eviction handler to run under the role-operator-role

Note - this is a draft PR to give early visibility to @charleskorn and to share details on the new custom resource definition.

Relates to grafana/mimir#12315

Includes;

* new web-hook handler for pod eviction added to rollout-operator
* manifests for new custom resource definition for configuring the PDBZ
* updated development example files and environment
@CLAassistant
Copy link

CLAassistant commented Jul 31, 2025

CLA assistant check
All committers have signed the CLA.

Copy link
Contributor

@charleskorn charleskorn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add an integration test to cover this functionality?

* rename to ZoneAwarePodDisruptionBudget
* updated root README
* updates to main eviction handler per PR
*
* updates per PR review
* added ZPDB informer to monitor ZPDB configurations
* added Pod informer to work with pod eviction cache
* added pod eviction cache
* added validating webhook configuration for ZPDB files
* added locking for concurrent eviction handling
Copy link
Contributor

@charleskorn charleskorn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't finished reviewing the tests - will take a look tomorrow.

.golangci.yaml Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change seems unrelated - am I missing something?


This allows an operator to perform maintenance on a single zone whilst ensuring sufficient pod availability in other zones.

Consider the following topology where the `PDZB` has `maxUnavailability=1`;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Consider the following topology where the `PDZB` has `maxUnavailability=1`;
Consider the following topology where the `ZPDB` has `maxUnavailable` set to 1:

Comment on lines +437 to +441
If `ingester-zone-a-0` is to be evicted, it will be allowed if there are no disruptions in either zone `b` or zone `c`.

If `ingester-zone-a-1` has failed and `ingester-zone-a-0` is to be evicted, it will be allowed if there are no disruptions in either zone `b` or zone `c`.

*The already disrupted zone can be further disrupted as long as the other zones meet the unavailability criteria.*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I should have caught this earlier, this isn't quite right - we still want to limit the number of pods allowed to become unavailable :

Suggested change
If `ingester-zone-a-0` is to be evicted, it will be allowed if there are no disruptions in either zone `b` or zone `c`.
If `ingester-zone-a-1` has failed and `ingester-zone-a-0` is to be evicted, it will be allowed if there are no disruptions in either zone `b` or zone `c`.
*The already disrupted zone can be further disrupted as long as the other zones meet the unavailability criteria.*
If `ingester-zone-a-0` is to be evicted, it will be allowed if there are no disruptions in either zone `a`, zone `b` or zone `c`.
If `ingester-zone-a-1` has failed and `ingester-zone-a-0` is to be evicted, it will not be allowed.

It might be good to add another example where maxUnavailability is set higher than 1 to illustrate what happens in that case as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants