Skip to content

cpu_freq: Add CPU frequency scaling subsystem #93461

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

seankyer
Copy link
Member

@seankyer seankyer commented Jul 21, 2025

This PR implements: #92123 CPU Frequency Scaling Subsystem

Updated System Level Diagram
image

Differences from Original RFC Proposal

  • During development it became clear that a P-state policy is very coupled to a given metric (Metric in this case refers to CPU load, CPU temperature, or some other statistic that CPU Freq decisions could be based on).
    • Based on this, we tweaked the design to create a metrics folder within the subsystem of which different metrics could be defined
    • A standard (or custom) policy can then utilize a metric (or set of metrics) to make P-state decisions.
      • The policy itself would make direct calls to the metric
  • We did not refactor or promote any existing subsystem to be a standalone CPU frequency subsystem
    • THREAD_RUNTIME_STATS option is a lightweight way to get this information and essentially is the "subsystem" we had originally hoped for.

Future Improvements and Features

  • The current implementation of CPU freq subsystem does not support SMP
    • Tentative solution: worker thread is updated to iterate over all cores and cpu_load metric updated to take cpu core ID as argument

@seankyer seankyer force-pushed the feat/cpu-freq-subsystem branch 3 times, most recently from 2205de8 to 364c54e Compare July 21, 2025 20:42
@nashif nashif requested a review from peter-mitsis July 21, 2025 21:02
@seankyer seankyer marked this pull request as ready for review July 21, 2025 21:28
@seankyer
Copy link
Member Author

seankyer commented Jul 21, 2025

Is someone able to link this PR to #92123? I don't seem to have the permission for it.

@seankyer
Copy link
Member Author

@bjarki-andreasen I left an example implementation of what the P state driver may be in the second commit. If you implement one for your board, I can remove this commit.

@MaureenHelm MaureenHelm linked an issue Jul 21, 2025 that may be closed by this pull request
@peter-mitsis
Copy link
Contributor

I really like this--it seems to be very straight-forward. I still need to go through it a little more to make sure that I understand it properly.

As to extending it for SMP, it looks like it should be mostly straight-forward. IIRC, there is a caveat about k_thread_runtime_stats_cpu_get()--the data it reports for other CPUs is from the last time it was updated on that CPU (which could be a while ago if there were no interrupts or context switches). It may be more resource intensive, but if the cpu load monitoring had its own thread on each CPU (for SMP), that would allow it to have the most recent data.

@seankyer seankyer force-pushed the feat/cpu-freq-subsystem branch from 364c54e to 3d9b500 Compare July 23, 2025 15:10
@seankyer seankyer force-pushed the feat/cpu-freq-subsystem branch 2 times, most recently from b3c9b00 to 36db67a Compare July 23, 2025 15:50
@seankyer seankyer force-pushed the feat/cpu-freq-subsystem branch 3 times, most recently from 7e3e99b to 0538e7b Compare July 23, 2025 17:21
@nashif nashif assigned peter-mitsis and unassigned kartben Jul 23, 2025
@seankyer seankyer force-pushed the feat/cpu-freq-subsystem branch 3 times, most recently from af2eb01 to 0ad1ad1 Compare July 23, 2025 18:04
@seankyer seankyer force-pushed the feat/cpu-freq-subsystem branch 2 times, most recently from d43806a to bec9b05 Compare July 23, 2025 22:23
Copy link
Contributor

@andyross andyross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notes & nitpicks. Nothing seems too scary from my perspective. The big points to note are: "We need a story for how this API extends to multiprocessor setups" and "The governor code probably wants to run in a timer and not the work queue"

*
* @param state Performance state.
*/
int32_t cpu_freq_performance_state_set(struct p_state state);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming: if the struct is "p_state", the function name should match. Also "erformance" is a lot of keys to type. You could probably drop the underscore for extra credit, as Linux tends strongly to spell this "pstate".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Serious review: what happens when we need to support a SOC with 2+ CPUs that can run on independent clocks? Is the idea that "pstate" is an opaque thing and that someone else will provide per-cpu governor integration?

Copy link
Member Author

@seankyer seankyer Jul 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I have updated all instances of p_state or performance_state to now use pstate (I think I got them all).

For a multicore setup, I think we hvae two options:

  1. we can instantiate an instance of cpu_freq per cpu_id/core. Each core could even have its own unique pstates defined in their respective .dts file (the vendor can customize this).
  2. Have the cpu_freq subsystem iterate over all cores and call a driver with cpu_id as argument

Both these options would work for both SMP or async multiprocessing, given they have independent clock control.

Multicore is out of scope this for PR, but as you said, the path forward should be clear. I think what I've done is open to extension to support multicore systems.

struct p_state {
uint32_t load_threshold; /**< CPU load threshold at which P-state should be triggered */
uint8_t disabled; /**< Flag to indicate if P-state is disabled */
const void *config; /**< Vendor specific devicetree properties */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given this is almost 100% guaranteed to be static, it might make more sense to let vendors with derived implementations define their own device structs that merely contain a set of struct p_state's, and provide e.g. macros to do static initialization instead. Saves carrying the pointer around for platforms that won't use it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if I follow. What I've tried to do is define a skeleton of what a standard pstate should look like, so policy and drivers can pass around a known struct, but leave it open for extension. I based it off of Bjarki' comment on the RFC #92123 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pstate "owner" could define:

struct my_pstate {
    struct p_state pstate;
    int my_config_data;
};

And then any time it wants to get access to my_config_data from the callback, it can use CONTAINER_OF(). Now the struct doesn't need the pointer in it anymore.

*/
#define P_STATE_DT_GET(_node) &(_CONCAT(p_state_, DEVICE_DT_NAME_GET(_node))),

struct p_state {
Copy link
Contributor

@andyross andyross Jul 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about having a completely separate struct for subsystem config. Usually the way this works is that you fold this in as a struct device.

The DTS militia will need to chime in here (not really my area, and my tastes don't always align), but my gut says that people aren't going to like the idea of DTS-configuring something that isn't itself a device.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am basically just trying to follow what is done in the pm subsystem. There, they have a struct called struct pm_state_info and the details of which are defined through the .dtsi for a given chip. ex:

power-states {
      /* Sleep Mode */
      idle: idle {
        compatible = "zephyr,power-state";
        power-state-name = "runtime-idle";
        min-residency-us = <50>;
        exit-latency-us = <5>;
      };
....etc.

Just following Bjarki's suggestion #92123 (comment) it seemed to tie together nicely.

Copy link
Contributor

@bjarki-andreasen bjarki-andreasen Jul 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These states have to be in the devicetree, they are entirely hardware specific, they describe combinations of voltages and clock speeds supported by the hardware (or at least map to them), something like clock states or pinctrl states, or indeed sleep states.

static uint64_t execution_cycles_prev;
static uint64_t total_cycles_prev;

int get_cpu_load(uint32_t *load)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this wants to move into the kernel, as part of the runtime stats API, no?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also naming: global functions need some kind of name spacing, names like "get_xxx()" are basically guaranteed to collide for someone.

And naming again: this is in fact not what Unix has traditionally meant by "load", which captures the number of runnable processes in the scheduler state (i.e. "how many 'users' are competing for the system", not "how much of the CPU time is running app code"). I'd suggest something like "cpu_utilization" instead.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's already a few different API's available to retrieve this type of information, it really depends on the use case:

  • subsys/debug/cpu_load.c
  • subsys/debug/thread_analyzer
  • Object Core Statistics

I figured it would be best to rely on an already established API to retrieve these stats. The call we use, k_thread_runtime_stats_cpu_get(), and the infrastructure needed for it is also pretty lightweight. Also, this implementation is fairly application specific. It relies on stable periodic calls to produce consistent results. I think if we want to promote this type of functionality, it would require another discussion.

*
* @retval 0 In case of success, nonzero in case of failure.
*/
int cpu_freq_policy_get_p_state_next(struct p_state *p_state);
Copy link
Contributor

@andyross andyross Jul 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming: "next" implies that this is an iterator or something (my immediate thought was "where is first()?!"). How about "cpu_freq_select_pstate()" instead?

Also docs should probably be clearer about intent. "This function is implemented by a governor implementation selected via kconfig, for example 'on_demand'."

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was following PM precedent with this, but I actually agree that I never liked the use of next. I renamed it as per your suggestion and also added some more info to the comment.

LOG_MODULE_REGISTER(cpu_freq, CONFIG_CPU_FREQ_LOG_LEVEL);

static void cpu_freq_work_handler(struct k_work *work);
K_WORK_DELAYABLE_DEFINE(cpu_freq_work, cpu_freq_work_handler);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be a k_timer and not a work item? The former is a lightweight callback that runs in the timer ISR, the latter demands that the system work queue be instantiated, which is a whole thread. Small apps don't necessarily want to pay a cost like that, and my gut says it's not needed here. Is there a reason you need the governor code to run in a thread?

Also note that this introduces a priority inversion case: if you have a very high priority thread spinning, the work queue might not run, and so might not be able to boost the CPU speed for the very high priority spinning thread!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, I switched it to use a timer instead.

@seankyer seankyer force-pushed the feat/cpu-freq-subsystem branch 4 times, most recently from 2f2eb79 to 2efb4fc Compare July 24, 2025 19:32
seankyer and others added 2 commits July 24, 2025 12:32
Add a CPU frequency scaling subsystem, allowing a policy
algorithm to control the dynamic frequency and voltage
scaling abilities of a given SoC/MCU.

Implement a basic, "on-demand" policy algorithm which
iterates through the P-states supported by the SoC and
selects the first P-state where it's trigger threshold is
less than the CPU load.

Implement the CPU load measurement function for use by
the on-demand scaling policy. The statistics used to calculate
the load are taken from the scheduler. Specifically, the number
of cycles used by the idle thread.

The CPU frequency scaling subsystem does not currently
support SMP. The CPU load measurement can be made to support
SMP since statistics are measureed from the scheduler.

Co-authored-by: Eric Hay <[email protected]>
Signed-off-by: Sean Kyer <[email protected]>
Define P-states for Max32 and add cpu_freq driver
for MAX32655.

Signed-off-by: Sean Kyer <[email protected]>
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[RFC] CPU Frequency Scaling Subsystem
7 participants