Skip to content

fix: reject scale down request if there is already an in-process request in the same rollout group #254

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

sandeepsukhani
Copy link
Contributor

@sandeepsukhani sandeepsukhani commented Jul 31, 2025

If multiple simultaneous scale-down requests are made to prepare the downscale handler for the same rollout group, the code might inadvertently allow all the requests, even if grafana.com/min-time-between-zones-downscale is set to a non-zero value.
The issue is that there is no synchronisation to prevent that from happening.

In this PR, I am fixing the issue by tracking ongoing prepare downscale requests by rollout group and rejecting any new downscale requests for the same rollout group until the previous request is finished serving.

Copy link
Contributor

@JordanRushing JordanRushing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, let's test further before merging. Approving to unblock.

Comment on lines 381 to 382
// mimic in progress downscaling request of rollout group ingester
zt.rolloutGroupDownscalingInProgress.Store(rolloutGroupIngester, ingesterZoneA)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than reaching into the internals like this, what if we changed the test to something like:

  1. configure StatefulSet to call fake prepare downscale handler
  2. initiate downscale, rollout-operator calls fake handler and marks downscale as in progress
  3. fake handler does not return until signalled later
  4. initiate a second downscale: rollout-operator should reject the call immediately given the other is in progress
  5. release fake handler to allow first request to complete

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Charles, for the review! I have made the suggested change.

Comment on lines 429 to 430
ar = buildAdmissionRequest(rolloutGroupIngester, ingesterZoneB)
admissionResponse = zt.prepareDownscale(context.Background(), logger, ar, api, f)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be racy - this request could arrive while the zone A request is still finishing. Should we wait for the zone A admission request to complete before sending the next request?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, sorry, I missed that. I have fixed it. Thank you!

Copy link
Contributor

@charleskorn charleskorn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM modulo suggestion below

@sandeepsukhani sandeepsukhani enabled auto-merge (squash) August 11, 2025 05:41
@sandeepsukhani sandeepsukhani merged commit f60fb11 into main Aug 11, 2025
10 checks passed
@sandeepsukhani sandeepsukhani deleted the handle-multiple-downscale-request-same-rollout-group branch August 11, 2025 05:45
@sandeepsukhani
Copy link
Contributor Author

I forgot to update you that I deployed a build generated with this PR on top of the existing release(i.e., v0.20.1) we were running in two of the dev cells. It was running there for a whole week, and I did not see any issues while it was running.
Screenshot 2025-08-11 at 11 34 53 AM
Screenshot 2025-08-11 at 11 34 42 AM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants