Skip to content

manager/koordlet: add pod-level CPU burst strategies for fine-grained control #2557

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ktalg
Copy link

@ktalg ktalg commented Jul 30, 2025

Ⅰ. Describe what this PR does

Adds Pod-level grey control for the CPU Burst feature without breaking the existing
ConfigMap ➜ koord-manager ➜ NodeSLO ➜ koordlet pipeline.

  • API extensionCPUBurstCfg now contains an optional podStrategies field that lets users target Pods by namespace and/or label selector and optionally override per-Pod annotations.
  • koord-manager – when converting slo-controller-config to NodeSLO, it copies the full podStrategies list together with the node/cluster strategies.
  • koordletcpuburst plugin merges podStrategies at runtime and applies the selected strategy during the normal pod scan; no extra grey-control plugin or additional watch channel is introduced.
  • Keeps full backward compatibility (old configs work unchanged).

Ⅱ. Does this pull request fix one issue?

close #2553

Ⅲ. Describe how to verify it

  1. Edit the slo-controller-config ConfigMap and add a podStrategies entry
    eg.
apiVersion: v1
kind: ConfigMap
metadata:
  name: slo-controller-config
  namespace: koordinator-system
data:
  cpu-burst-config: |
    {
      "clusterStrategy": {
        "policy": "none"
      },
      "nodeStrategies": [
        {
          "nodeSelector": {
            "matchLabels": {
              "kubernetes.io/hostname": "ci-testing-worker"
            }
          },
          "cpuBurstStrategy": {
            "policy": "auto",
            "cpuBurstPercent": 200,
            "cfsQuotaBurstPercent": 300,
            "cfsQuotaBurstPeriodSeconds": 6
          }
        }
      ],
      "podStrategies": [
        {
          "namespace": "frontend",
          "podSelector": {
            "matchLabels": {
              "app": "payment"
            }
          },
          "overridePod": true,
          "cpuBurstConfig": {
            "policy": "auto",
            "cpuBurstPercent": 100,
            "cfsQuotaBurstPercent": 600,
            "cfsQuotaBurstPeriodSeconds": 10
          }
        },
        {
          "namespace": "batch-jobs",
          "podSelector": {
            "matchLabels": {
              "workload-type": "low-priority"
            }
          },
          "overridePod": false,
          "cpuBurstConfig": {
            "policy": "cpuBurstOnly",
            "cpuBurstPercent": 50
          }
        }
      ]
    }

  1. Ensure koord-manager reconciles the change into each node’s NodeSLO.

  2. Launch a matching Pod and observe on the node:

    • cpu.cfs_burst_us / cpu.max.burst set as expected.
    • Logs show genPodBurstConfig picked the pod-specific strategy.
  3. Non-matching Pods and nodes continue to follow node / cluster strategy.

Ⅳ. Special notes for reviews

V. Checklist

  • I have written necessary docs and comments
  • I have added necessary unit tests and integration tests
  • All checks passed in make test

@koordinator-bot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign saintube, zwzhang0107 after the PR has been reviewed.
You can assign the PR to them by writing /assign @saintube @zwzhang0107 in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link

codecov bot commented Jul 30, 2025

Codecov Report

❌ Patch coverage is 78.12500% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 66.36%. Comparing base (f4b91fb) to head (b238c89).

Files with missing lines Patch % Lines
.../koordlet/qosmanager/plugins/cpuburst/cpu_burst.go 70.83% 5 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2557      +/-   ##
==========================================
- Coverage   66.37%   66.36%   -0.01%     
==========================================
  Files         485      485              
  Lines       57115    57140      +25     
==========================================
+ Hits        37908    37922      +14     
- Misses      16458    16463       +5     
- Partials     2749     2755       +6     
Flag Coverage Δ
unittests 66.36% <78.12%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ktalg ktalg force-pushed the main branch 2 times, most recently from bcd9e4e to 0a9e7ef Compare July 31, 2025 04:02
@ktalg ktalg changed the title manager/koordlet: add pod-level CPU burst strategies for fine-grained… manager/koordlet: add pod-level CPU burst strategies for fine-grained control Jul 31, 2025
@ZiMengSheng ZiMengSheng added this to the v1.7 milestone Aug 1, 2025
@ktalg
Copy link
Author

ktalg commented Aug 5, 2025

Hi @ZiMengSheng @saintube 👋 Just a gentle ping on this PR

The feature has been fully verified in our test clusters — including dynamic pod-level burst control without restarts, and full backward compatibility with existing configurations.

Let me know if anything needs to be adjusted. Thanks a lot for your time and review.

@saintube
Copy link
Member

saintube commented Aug 6, 2025

@ktalg Thanks for your contributions!
For your use case, I'm wondering if we could use the pod annotation API koordinator.sh/cpuBurst to control the pod-level configuration. A controller may be added to patch the annotations to target pods, or we can reuse the ColocationProfile to fulfill it.

@ktalg
Copy link
Author

ktalg commented Aug 6, 2025

@saintube

Thanks for the thoughtful suggestions! I've carefully considered both of the proposed alternatives, and would love to share some thoughts:

  1. If we go with the new controller approach, it would require introducing a new configuration source — either a new CRD or reuse of the existing slo-controller-config. A new CRD might be too heavy just for this purpose, while sharing slo-controller-config between multiple controllers feels a bit unclean and risks tight coupling. Additionally, we'd also need to introduce a new webhook to inject annotations for newly created Pods.

  2. Handling cases where Pods already have manually specified burst annotations (and how to safely override or restore them) adds extra complexity around state reconciliation and ownership tracking.

  3. As for using ColocationProfile, it currently only affects newly created Pods via admission webhook. This doesn't align well with the goal of dynamically updating burst configurations on already-running Pods.

Compared to those options, the current implementation — which extends NodeSLO with a podStrategies field and delegates the merging to koordlet — offers a lightweight, non-intrusive, and backward-compatible way to dynamically control burst behavior without touching the Pods themselves or requiring additional components.

Happy to continue the discussion — open to adjustments if there's a direction the community prefers!

Copy link
Member

@saintube saintube left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm convinced. Any comments @zwzhang0107 @songtao98

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[proposal] Support dynamic and fine-grained enabling of cpuBurst via QOSGreyCtrlPlugin + ConfigMap
3 participants