-
Notifications
You must be signed in to change notification settings - Fork 7.7k
cpu_freq: Add CPU frequency scaling subsystem #93461
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
cpu_freq: Add CPU frequency scaling subsystem #93461
Conversation
2205de8
to
364c54e
Compare
Is someone able to link this PR to #92123? I don't seem to have the permission for it. |
@bjarki-andreasen I left an example implementation of what the P state driver may be in the second commit. If you implement one for your board, I can remove this commit. |
I really like this--it seems to be very straight-forward. I still need to go through it a little more to make sure that I understand it properly. As to extending it for SMP, it looks like it should be mostly straight-forward. IIRC, there is a caveat about k_thread_runtime_stats_cpu_get()--the data it reports for other CPUs is from the last time it was updated on that CPU (which could be a while ago if there were no interrupts or context switches). It may be more resource intensive, but if the cpu load monitoring had its own thread on each CPU (for SMP), that would allow it to have the most recent data. |
364c54e
to
3d9b500
Compare
b3c9b00
to
36db67a
Compare
7e3e99b
to
0538e7b
Compare
af2eb01
to
0ad1ad1
Compare
d43806a
to
bec9b05
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Notes & nitpicks. Nothing seems too scary from my perspective. The big points to note are: "We need a story for how this API extends to multiprocessor setups" and "The governor code probably wants to run in a timer and not the work queue"
include/zephyr/cpu_freq/cpu_freq.h
Outdated
* | ||
* @param state Performance state. | ||
*/ | ||
int32_t cpu_freq_performance_state_set(struct p_state state); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Naming: if the struct is "p_state", the function name should match. Also "erformance" is a lot of keys to type. You could probably drop the underscore for extra credit, as Linux tends strongly to spell this "pstate".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Serious review: what happens when we need to support a SOC with 2+ CPUs that can run on independent clocks? Is the idea that "pstate" is an opaque thing and that someone else will provide per-cpu governor integration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I have updated all instances of p_state
or performance_state
to now use pstate
(I think I got them all).
For a multicore setup, I think we hvae two options:
- we can instantiate an instance of cpu_freq per cpu_id/core. Each core could even have its own unique pstates defined in their respective .dts file (the vendor can customize this).
- Have the cpu_freq subsystem iterate over all cores and call a driver with cpu_id as argument
Both these options would work for both SMP or async multiprocessing, given they have independent clock control.
Multicore is out of scope this for PR, but as you said, the path forward should be clear. I think what I've done is open to extension to support multicore systems.
include/zephyr/cpu_freq/p_state.h
Outdated
struct p_state { | ||
uint32_t load_threshold; /**< CPU load threshold at which P-state should be triggered */ | ||
uint8_t disabled; /**< Flag to indicate if P-state is disabled */ | ||
const void *config; /**< Vendor specific devicetree properties */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given this is almost 100% guaranteed to be static, it might make more sense to let vendors with derived implementations define their own device structs that merely contain a set of struct p_state's, and provide e.g. macros to do static initialization instead. Saves carrying the pointer around for platforms that won't use it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if I follow. What I've tried to do is define a skeleton of what a standard pstate should look like, so policy and drivers can pass around a known struct, but leave it open for extension. I based it off of Bjarki' comment on the RFC #92123 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pstate "owner" could define:
struct my_pstate {
struct p_state pstate;
int my_config_data;
};
And then any time it wants to get access to my_config_data from the callback, it can use CONTAINER_OF(). Now the struct doesn't need the pointer in it anymore.
include/zephyr/cpu_freq/p_state.h
Outdated
*/ | ||
#define P_STATE_DT_GET(_node) &(_CONCAT(p_state_, DEVICE_DT_NAME_GET(_node))), | ||
|
||
struct p_state { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about having a completely separate struct for subsystem config. Usually the way this works is that you fold this in as a struct device
.
The DTS militia will need to chime in here (not really my area, and my tastes don't always align), but my gut says that people aren't going to like the idea of DTS-configuring something that isn't itself a device.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am basically just trying to follow what is done in the pm
subsystem. There, they have a struct called struct pm_state_info
and the details of which are defined through the .dtsi for a given chip. ex:
power-states {
/* Sleep Mode */
idle: idle {
compatible = "zephyr,power-state";
power-state-name = "runtime-idle";
min-residency-us = <50>;
exit-latency-us = <5>;
};
....etc.
Just following Bjarki's suggestion #92123 (comment) it seemed to tie together nicely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These states have to be in the devicetree, they are entirely hardware specific, they describe combinations of voltages and clock speeds supported by the hardware (or at least map to them), something like clock states or pinctrl states, or indeed sleep states.
subsys/cpu_freq/metrics/cpu_load.c
Outdated
static uint64_t execution_cycles_prev; | ||
static uint64_t total_cycles_prev; | ||
|
||
int get_cpu_load(uint32_t *load) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like this wants to move into the kernel, as part of the runtime stats API, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also naming: global functions need some kind of name spacing, names like "get_xxx()" are basically guaranteed to collide for someone.
And naming again: this is in fact not what Unix has traditionally meant by "load", which captures the number of runnable processes in the scheduler state (i.e. "how many 'users' are competing for the system", not "how much of the CPU time is running app code"). I'd suggest something like "cpu_utilization" instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's already a few different API's available to retrieve this type of information, it really depends on the use case:
- subsys/debug/cpu_load.c
- subsys/debug/thread_analyzer
- Object Core Statistics
I figured it would be best to rely on an already established API to retrieve these stats. The call we use, k_thread_runtime_stats_cpu_get()
, and the infrastructure needed for it is also pretty lightweight. Also, this implementation is fairly application specific. It relies on stable periodic calls to produce consistent results. I think if we want to promote this type of functionality, it would require another discussion.
include/zephyr/cpu_freq/policy.h
Outdated
* | ||
* @retval 0 In case of success, nonzero in case of failure. | ||
*/ | ||
int cpu_freq_policy_get_p_state_next(struct p_state *p_state); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Naming: "next" implies that this is an iterator or something (my immediate thought was "where is first()?!"). How about "cpu_freq_select_pstate()
" instead?
Also docs should probably be clearer about intent. "This function is implemented by a governor implementation selected via kconfig, for example 'on_demand'."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was following PM precedent with this, but I actually agree that I never liked the use of next
. I renamed it as per your suggestion and also added some more info to the comment.
subsys/cpu_freq/cpu_freq.c
Outdated
LOG_MODULE_REGISTER(cpu_freq, CONFIG_CPU_FREQ_LOG_LEVEL); | ||
|
||
static void cpu_freq_work_handler(struct k_work *work); | ||
K_WORK_DELAYABLE_DEFINE(cpu_freq_work, cpu_freq_work_handler); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be a k_timer and not a work item? The former is a lightweight callback that runs in the timer ISR, the latter demands that the system work queue be instantiated, which is a whole thread. Small apps don't necessarily want to pay a cost like that, and my gut says it's not needed here. Is there a reason you need the governor code to run in a thread?
Also note that this introduces a priority inversion case: if you have a very high priority thread spinning, the work queue might not run, and so might not be able to boost the CPU speed for the very high priority spinning thread!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, I switched it to use a timer instead.
2f2eb79
to
2efb4fc
Compare
Add a CPU frequency scaling subsystem, allowing a policy algorithm to control the dynamic frequency and voltage scaling abilities of a given SoC/MCU. Implement a basic, "on-demand" policy algorithm which iterates through the P-states supported by the SoC and selects the first P-state where it's trigger threshold is less than the CPU load. Implement the CPU load measurement function for use by the on-demand scaling policy. The statistics used to calculate the load are taken from the scheduler. Specifically, the number of cycles used by the idle thread. The CPU frequency scaling subsystem does not currently support SMP. The CPU load measurement can be made to support SMP since statistics are measureed from the scheduler. Co-authored-by: Eric Hay <[email protected]> Signed-off-by: Sean Kyer <[email protected]>
Define P-states for Max32 and add cpu_freq driver for MAX32655. Signed-off-by: Sean Kyer <[email protected]>
|
This PR implements: #92123 CPU Frequency Scaling Subsystem
Updated System Level Diagram

Differences from Original RFC Proposal
metrics
folder within the subsystem of which different metrics could be definedTHREAD_RUNTIME_STATS
option is a lightweight way to get this information and essentially is the "subsystem" we had originally hoped for.Future Improvements and Features