For some specific platforms (E.g. AlderLake) the balance performance
EPP is updated from the hard coded value in the driver. This acts as
the default and balance_performance EPP. The purpose of this EPP
update is to reach maximum 1 core turbo frequency (when possible) out
of the box.
Although we can achieve the objective by using hard coded value in the
driver, there can be other EPP which can be better in terms of power.
But that will be very subjective based on platform and use cases.
This is not practical to have a per platform specific default hard coded
in the driver.
If a platform wants to specify default EPP, it can be set in the firmware.
If this EPP is not the chipset default of 0x80 (balance_perf_epp unless
driver changed it) and more performance oriented but not 0, the driver
can use this as the default and balanced_perf EPP. In this case no driver
update is required every time there is some new platform and default EPP.
If the firmware didn't update the EPP from the chipset default then
the hard coded value is used as per existing implementation.
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
With HWP enabled, when the turbo range of performance levels is
disabled by the platform firmware, the CPU capacity is given by
the "guaranteed performance" field in MSR_HWP_CAPABILITIES which
is generally dynamic. When it changes, the kernel receives an HWP
notification interrupt handled by notify_hwp_interrupt().
When the "guaranteed performance" value changes in the above
configuration, the CPU performance scaling needs to be adjusted so
as to use the new CPU capacity in computations, which means that
the cpuinfo.max_freq value needs to be updated for that CPU.
Accordingly, modify intel_pstate_notify_work() to read
MSR_HWP_CAPABILITIES and update cpuinfo.max_freq to reflect the
new configuration (this update can be carried out even if the
configuration doesn't actually change, because it simply doesn't
matter then and it takes less time to update it than to do extra
checks to decide whether or not a change has really occurred).
Reported-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Tested-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
There is an expectation from users that they can get frequency specified
by cpufreq/cpuinfo_max_freq when conditions permit. But with AlderLake
mobile it may not be possible. This is possible that frequency is clipped
based on the system power-up EPP value. In this case users can update
cpufreq/energy_performance_preference to some performance oriented EPP to
limit clipping of frequencies.
To get out of box behavior as the prior generations of CPUs, update EPP
for AlderLake mobile CPUs on boot. On prior generations of CPUs EPP = 128
was enough to get maximum frequency, but with AlderLake mobile the
equivalent EPP is 102. Since EPP is model specific, this is possible that
they have different meaning on each generation of CPU.
The current EPP string "balance_performance" corresponds to EPP = 128.
Change the EPP corresponding to "balance_performance" to 102 for only
AlderLake mobile CPUs and update this on each CPU during boot.
To implement reuse epp_values[] array and update the modified EPP at the
index for BALANCE_PERFORMANCE. Add a dummy EPP_INDEX_DEFAULT to
epp_values[] to match indexes in the energy_perf_strings[].
After HWP PM is enabled also update EPP when "balance_performance" is
redefined for the very first time after the boot on each CPU. On
subsequent suspend/resume or offline/online the old EPP is restored,
so no specific action is needed.
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
It is not necessary to call intel_pstate_get_hwp_cap() from
intel_pstate_update_perf_limits(), because it gets called from
intel_pstate_verify_cpu_policy() which is either invoked directly
right before intel_pstate_update_perf_limits(), in
intel_cpufreq_verify_policy() in the passive mode, or called
from driver callbacks in a sequence that causes it to be followed
by an immediate intel_pstate_update_perf_limits().
Namely, in the active mode intel_cpufreq_verify_policy() is called
by intel_pstate_verify_policy() which is the ->verify() callback
routine of intel_pstate and gets called by the cpufreq core right
before intel_pstate_set_policy(), which is the driver's ->setoplicy()
callback routine, where intel_pstate_update_perf_limits() is called.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
On systems with overclocking enabled, CPPC Highest Performance can be
hard coded to 0xff. In this case even if we have cores with different
highest performance, ITMT can't be enabled as the current implementation
depends on CPPC Highest Performance.
On such systems we can use MSR_HWP_CAPABILITIES maximum performance field
when CPPC.Highest Performance is 0xff.
Due to legacy reasons, we can't solely depend on MSR_HWP_CAPABILITIES as
in some older systems CPPC Highest Performance is the only way to identify
different performing cores.
Reported-by: Michael Larabel <Michael@MichaelLarabel.com>
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Tested-by: Michael Larabel <Michael@MichaelLarabel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
After commit 4adcf2e582 ("cpufreq: intel_pstate: Add ->offline and
->online callbacks") the EPP value set by the "performance" scaling
algorithm in the active mode is not restored after an offline/online
cycle which replaces it with the saved EPP value coming from user
space.
Address this issue by forcing intel_pstate_hwp_set() to set a new
EPP value when it runs first time after online.
Fixes: 4adcf2e582 ("cpufreq: intel_pstate: Add ->offline and ->online callbacks")
Link: https://lore.kernel.org/linux-pm/adc7132c8655bd4d1c8b6129578e931a14fe1db2.camel@linux.intel.com/
Reported-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: 5.9+ <stable@vger.kernel.org> # 5.9+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Commit fbdc21e9b0 ("cpufreq: intel_pstate: Add Icelake servers
support in no-HWP mode") enabled the use of Intel P-State driver
for Ice Lake servers.
But it doesn't cover the case when OS can't control P-States.
Therefore, for Ice Lake server, if MSR_MISC_PWR_MGMT bits 8 or 18
are enabled, then the Intel P-State driver should exit as OS can't
control P-States.
Fixes: fbdc21e9b0 ("cpufreq: intel_pstate: Add Icelake servers support in no-HWP mode")
Signed-off-by: Adamos Ttofari <attofari@amazon.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
It is possible that some performance excursions happened before OS boot
or enable HWP interrupts. So clear MSR_HWP_STATUS bits when we enable
HWP interrupt. In this way a next excursion will results in a HWP
interrupt.
The status bits of MSR_HWP_STATUS must be cleared (0) by software so
that a new status condition change will cause the hardware to set the
bit again and issue the notification.
Fixes: 57577c996d ("cpufreq: intel_pstate: Process HWP Guaranteed change notification")
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
It is possible that on some platforms HWP interrupts are disabled. In
that case accessing MSR 0x773 will result in warning.
So check X86_FEATURE_HWP_NOTIFY feature to access MSR 0x773. The other
places in code where this MSR is accessed, already checks this feature
except during disable path called during cpufreq offline and suspend
callbacks.
Fixes: 57577c996d ("cpufreq: intel_pstate: Process HWP Guaranteed change notification")
Reported-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Tested-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Commit a365ab6b9d ("cpufreq: intel_pstate: Implement the
->adjust_perf() callback") caused intel_pstate to use nonzero HWP
desired values in certain usage scenarios, but it did not prevent
them from being leaked into the confugirations in which HWP desired
is expected to be 0.
The failing scenarios are switching the driver from the passive
mode to the active mode and starting a new kernel via kexec() while
intel_pstate is running in the passive mode.
To address this issue, ensure that HWP desired will be cleared on
offline and suspend/shutdown.
Fixes: a365ab6b9d ("cpufreq: intel_pstate: Implement the ->adjust_perf() callback")
Reported-by: Julia Lawall <julia.lawall@inria.fr>
Tested-by: Julia Lawall <julia.lawall@inria.fr>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Fix a problem in active mode that cpu->pstate.turbo_freq is initialized
only if HWP-to-frequency scaling factor is refined.
In passive mode, this problem is not exposed, because
cpu->pstate.turbo_freq is set again, later in
intel_cpufreq_cpu_init()->intel_pstate_get_hwp_cap().
Fixes: eb3693f052 ("cpufreq: intel_pstate: hybrid: CPU-specific scaling factor")
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
It is possible that HWP guaranteed ratio is changed in response to
change in power and thermal limits. For example when Intel Speed Select
performance profile is changed or there is change in TDP, hardware can
send notifications. It is possible that the guaranteed ratio is
increased. This creates an issue when turbo is disabled, as the old
limits set in MSR_HWP_REQUEST are still lower and hardware will clip
to older limits.
This change enables HWP interrupt and process HWP interrupts. When
guaranteed is changed, calls cpufreq_update_policy() so that driver
callbacks are called to update to new HWP limits. This callback
is called from a delayed workqueue of 10ms to avoid frequent updates.
Although the scope of IA32_HWP_INTERRUPT is per logical cpu, on some
plaforms interrupt is generated on all CPUs. This is particularly a
problem during initialization, when the driver didn't allocated
data for other CPUs. So this change uses a cpumask of enabled CPUs and
process interrupts on those CPUs only.
When the cpufreq offline() or suspend() callback is called, HWP interrupt
is disabled on those CPUs and also cancels any pending work item.
Spin lock is used to protect data and processing shared with interrupt
handler. Here READ_ONCE(), WRITE_ONCE() macros are used to designate
shared data, even though spin lock act as an optimization barrier here.
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Tested-by: pablomh@gmail.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
If HWP has been already been enabled by BIOS, it may be
necessary to override some kernel command line parameters.
Once it has been enabled it requires a reset to be disabled.
Suggested-by: Rafael J. Wysocki <rafael@kernel.org>
Signed-off-by: Doug Smythies <dsmythies@telus.net>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The current HWP calibration for hybrid processors in intel_pstate is
fragile, because it depends too much on the information provided by
the platform firmware via CPPC which may not be reliable enough. It
also need not be so complicated.
In order to improve that mechanism and make it more resistant to
platform firmware issues, make it only use the CPPC nominal_perf
values to compute the HWP-to-frequency scaling factors for all
CPUs and possibly use the HWP_CAP highest_perf values to recompute
them if the ones derived from the CPPC nominal_perf values alone
appear to be too high.
Namely, fetch CPC.nominal_perf for all CPUs present in the system,
find the minimum one and use it as a reference for computing all of
the CPUs' scaling factors (using the observation that for the CPUs
having the minimum CPC.nominal_perf the HWP range of available
performance levels should be the same as the range of available
"legacy" P-states and so the HWP-to-frequency scaling factor for
them should be the same as the corresponding scaling factor used
for representing the P-state values in kHz).
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Revert commit d0e936adbd ("cpufreq: intel_pstate: Process HWP
Guaranteed change notification"), because it causes a NULL pointer
dereference to occur on Lenovo X1 gen9 laptops due to an HWP
guaranteed performance change interrupt arriving prematurely.
This feature will be revisited in the next cycle.
Reported-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
It is possible that HWP guaranteed ratio is changed in response to
change in power and thermal limits. For example when Intel Speed Select
performance profile is changed or there is change in TDP, hardware can
send notifications. It is possible that the guaranteed ratio is
increased. This creates an issue when turbo is disabled, as the old
limits set in MSR_HWP_REQUEST are still lower and hardware will clip
to older limits.
This change enables HWP interrupt and process HWP interrupts. When
guaranteed is changed, calls cpufreq_update_policy() so that driver
callbacks are called to update to new HWP limits. This callback
is called from a delayed workqueue of 10ms to avoid frequent updates.
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The functions get_online_cpus() and put_online_cpus() have been
deprecated during the CPU hotplug rework. They map directly to
cpus_read_lock() and cpus_read_unlock().
Replace deprecated CPU-hotplug functions with the official version.
The behavior remains unchanged.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Combine the ->stop_cpu() and ->offline() callback routines for
intel_pstate in the active mode so as to avoid setting the
->stop_cpu callback pointer which is going to be dropped from
the framework.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
One of the previous commits introducing hybrid processor support to
intel_pstate broke build with CONFIG_ACPI unset.
Fix that and while at it make empty stubs of two functions related
to ACPI CPPC static inline and fix a spelling mistake in the name of
one of them.
Fixes: eb3693f052 ("cpufreq: intel_pstate: hybrid: CPU-specific scaling factor")
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Acked-by: Randy Dunlap <rdunlap@infradead.org> # build-tested
Users may disable HWP in firmware, in which case intel_pstate wouldn't load
unless the CPU model is explicitly supported.
See also commit d8de7a44e1 ("cpufreq: intel_pstate: Add Skylake servers
support").
Suggested-by: Doug Smythies <dsmythies@telus.net>
Tested-by: Doug Smythies <dsmythies@telus.net>
Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Users may disable HWP in firmware, in which case intel_pstate wouldn't load
unless the CPU model is explicitly supported.
Add ICELAKE_X to the list of CPUs that can register intel_pstate while not
advertising the HWP capability. Without this change, an ICELAKE_X in no-HWP
mode could only use the acpi_cpufreq frequency scaling driver.
See also commit d8de7a44e1 ("cpufreq: intel_pstate: Add Skylake servers
support").
Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The scaling factor between HWP performance levels and CPU frequency
may be different for different types of CPUs in a hybrid processor
and in general the HWP performance levels need not correspond to
"P-states" representing values that would be written to
MSR_IA32_PERF_CTL if HWP was disabled.
However, the policy limits control in cpufreq is defined in terms
of CPU frequency, so it is necessary to map the frequency limits set
through that interface to HWP performance levels with reasonable
accuracy and the behavior of that interface on hybrid processors
has to be compatible with its behavior on non-hybrid ones.
To address this problem, use the observations that (1) on hybrid
processors the sysfs interface can operate by mapping frequency
to "P-states" and translating those "P-states" to specific HWP
performance levels of the given CPU and (2) the scaling factor
between the MSR_IA32_PERF_CTL "P-states" and CPU frequency can be
regarded as a known value. Moreover, the mapping between the
HWP performance levels and CPU frequency can be assumed to be
linear and such that HWP performance level 0 correspond to the
frequency value of 0, so it is only necessary to know the
frequency corresponding to one specific HWP performance level
to compute the scaling factor applicable to all of them.
One possibility is to take the nominal performance value from CPPC,
if available, and use cpu_khz as the corresponding frequency. If
the CPPC capabilities interface is not there or the nominal
performance value provided by it is out of range, though, something
else needs to be done.
Namely, the guaranteed performance level either from CPPC or from
MSR_HWP_CAPABILITIES can be used instead, but the corresponding
frequency needs to be determined. That can be done by computing the
product of the (known) scaling factor between the MSR_IA32_PERF_CTL
P-states and CPU frequency (the PERF_CTL scaling factor) and the
P-state value referred to as the "TDP ratio".
If the HWP-to-frequency scaling factor value obtained in one of the
ways above turns out to be euqal to the PERF_CTL scaling factor, it
can be assumed that the number of HWP performance levels is equal to
the number of P-states and the given CPU can be handled as though
this was not a hybrid processor.
Otherwise, one more adjustment may still need to be made, because the
HWP-to-frequency scaling factor computed so far may not be accurate
enough (e.g. because the CPPC information does not match the exact
behavior of the processor). Specifically, in that case the frequency
corresponding to the highest HWP performance value from
MSR_HWP_CAPABILITIES (computed as the product of that value and the
HWP-to-frequency scaling factor) cannot exceed the frequency that
corresponds to the maximum 1-core turbo P-state value from
MSR_TURBO_RATIO_LIMIT (computed as the procuct of that value and the
PERF_CTL scaling factor) and the HWP-to-frequency scaling factor may
need to be adjusted accordingly.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The turbo_pct and num_pstates sysfs attributes represent CPU
properties that may be different for differenty types of CPUs in
a hybrid processor, so avoid exposing them in that case.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
It turns out that there are systems where HWP is enabled during
initialization by the platform firmware (BIOS), but HWP EPP support
is not advertised.
After commit 7aa1031223 ("cpufreq: intel_pstate: Avoid enabling HWP
if EPP is not supported") intel_pstate refuses to use HWP on those
systems, but the fallback PERF_CTL interface does not work on them
either because of enabled HWP, and once enabled, HWP cannot be
disabled. Consequently, the users of those systems cannot control
CPU performance scaling.
Address this issue by making intel_pstate use HWP unconditionally if
it is enabled already when the driver starts.
Fixes: 7aa1031223 ("cpufreq: intel_pstate: Avoid enabling HWP if EPP is not supported")
Reported-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Tested-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: 5.9+ <stable@vger.kernel.org> # 5.9+
Because pstate.max_freq is always equal to the product of
pstate.max_pstate and pstate.scaling and, analogously,
pstate.turbo_freq is always equal to the product of
pstate.turbo_pstate and pstate.scaling, the result of the
max_policy_perf computation in intel_pstate_update_perf_limits() is
always equal to the quotient of policy_max and pstate.scaling,
regardless of whether or not turbo is disabled. Analogously, the
result of min_policy_perf in intel_pstate_update_perf_limits() is
always equal to the quotient of policy_min and pstate.scaling.
Accordingly, intel_pstate_update_perf_limits() need not check
whether or not turbo is enabled at all and in order to compute
max_policy_perf and min_policy_perf it can always divide policy_max
and policy_min, respectively, by pstate.scaling. Make it do so.
While at it, move the definition and initialization of the
turbo_max local variable to the code branch using it.
No intentional functional impact.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Chen Yu <yu.c.chen@intel.com>
Notice that some computations related to frequency in intel_pstate
can be simplified if (a) intel_pstate_get_hwp_max() updates the
relevant members of struct cpudata by itself and (b) the "turbo
disabled" check is moved from it to its callers, so modify the code
accordingly and while at it rename intel_pstate_get_hwp_max() to
intel_pstate_get_hwp_cap() which better reflects its purpose and
provide a simplified variat of it, __intel_pstate_get_hwp_cap(),
suitable for the initialization path.
No intentional functional impact.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Chen Yu <yu.c.chen@intel.com>
In the comment for trace in passive mode there is an
unnecessary "the". Eradicate it.
Signed-off-by: Nigel Christian <nigel.l.christian@gmail.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Currently, when turbo is disabled (either by BIOS or by the user),
the intel_pstate driver reads the max non-turbo frequency from the
package-wide MSR_PLATFORM_INFO(0xce) register.
However, on asymmetric platforms it is possible in theory that small
and big core with HWP enabled might have different max non-turbo CPU
frequency, because MSR_HWP_CAPABILITIES is per-CPU scope according
to Intel Software Developer Manual.
The turbo max freq is already per-CPU in current code, so make
similar change to the max non-turbo frequency as well.
Reported-by: Wendy Wang <wendy.wang@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
[ rjw: Subject and changelog edits ]
Cc: 4.18+ <stable@vger.kernel.org> # 4.18+: a45ee4d4e1: cpufreq: intel_pstate: Change intel_pstate_get_hwp_max() argument
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Rename intel_cpufreq_adjust_hwp() and intel_cpufreq_adjust_perf_ctl()
to intel_cpufreq_hwp_update() and intel_cpufreq_perf_ctl_update(),
respectively, to avoid possible confusion with the ->adjist_perf()
callback function, intel_cpufreq_adjust_perf().
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Chen Yu <yu.c.chen@intel.com>
All of the callers of intel_pstate_get_hwp_max() access the struct
cpudata object that corresponds to the given CPU already and the
function itself needs to access that object (in order to update
hwp_cap_cached), so modify the code to pass a struct cpudata pointer
to it instead of the CPU number.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Chen Yu <yu.c.chen@intel.com>
Because intel_pstate_get_hwp_max() which updates hwp_cap_cached
may run in parallel with the readers of it, annotate all of the
read accesses to it with READ_ONCE().
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Chen Yu <yu.c.chen@intel.com>
percent_fp() was used in intel_pstate_pid_reset(), which was removed in
commit 9d0ef7af1f ("cpufreq: intel_pstate: Do not use PID-based P-state
selection") and hence, percent_fp() is unused since then.
percent_ext_fp() was last used in intel_pstate_update_perf_limits(), which
was refactored in commit 1a4fe38add ("cpufreq: intel_pstate: Remove
max/min fractions to limit performance"), and hence, percent_ext_fp() is
unused since then.
make CC=clang W=1 points us those unused functions:
drivers/cpufreq/intel_pstate.c:79:23: warning: unused function 'percent_fp' [-Wunused-function]
static inline int32_t percent_fp(int percent)
^
drivers/cpufreq/intel_pstate.c:94:23: warning: unused function 'percent_ext_fp' [-Wunused-function]
static inline int32_t percent_ext_fp(int percent)
^
Remove those obsolete functions.
Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Reviewed-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
If turbo P-states cannot be used, either due to the configuration of
the processor, or because intel_pstate is not allowed to used them,
the maximum available P-state with HWP enabled corresponds to the
HWP_CAP.GUARANTEED value which is not static. It can be adjusted by
an out-of-band agent or during an Intel Speed Select performance
level change, so long as it remains less than or equal to
HWP_CAP.MAX.
However, if turbo P-states cannot be used, intel_cpufreq_adjust_perf()
always uses pstate.max_pstate (set during the initialization of the
driver only) as the maximum available P-state, so it may miss a change
of the HWP_CAP.GUARANTEED value.
Prevent that from happening by modifyig intel_cpufreq_adjust_perf()
to always read the "guaranteed" and "maximum turbo" performance
levels from the cached HWP_CAP value.
Fixes: a365ab6b9d ("cpufreq: intel_pstate: Implement the ->adjust_perf() callback")
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
When sugov_update_single_perf() falls back to the "frequency"
path due to the missing scale-invariance, it will call
cpufreq_driver_fast_switch() via sugov_fast_switch()
and the driver's ->fast_switch() callback will be invoked,
so it must not be NULL.
However, after commit a365ab6b9d ("cpufreq: intel_pstate: Implement
the ->adjust_perf() callback") intel_pstate sets ->fast_switch() to
NULL when it is going to use intel_cpufreq_adjust_perf(), which is a
mistake, because on x86 the scale-invariance may be turned off
dynamically, so modify it to retain the original ->adjust_perf()
callback pointer.
Fixes: a365ab6b9d ("cpufreq: intel_pstate: Implement the ->adjust_perf() callback")
Reported-by: Kenneth R. Crudup <kenny@panix.com>
Tested-by: Kenneth R. Crudup <kenny@panix.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
When turbo has been disabled by the BIOS, but HWP_CAP.GUARANTEED is
changed later, user space may want to take advantage of this increased
guaranteed performance.
HWP_CAP.GUARANTEED is not a static value. It can be adjusted by an
out-of-band agent or during an Intel Speed Select performance level
change. The HWP_CAP.MAX is still the maximum achievable performance
with turbo disabled by the BIOS, so HWP_CAP.GUARANTEED can still
change as long as it remains less than or equal to HWP_CAP.MAX.
When HWP_CAP.GUARANTEED is changed, the sysfs base_frequency
attribute shows the most recent guaranteed frequency value. This
attribute can be used by user space software to update the scaling
min/max limits of the CPU.
Currently, the ->setpolicy() callback already uses the latest
HWP_CAP values when setting HWP_REQ, but the ->verify() callback will
restrict the user settings to the to old guaranteed performance value
which prevents user space from making use of the extra CPU capacity
theoretically available to it after increasing HWP_CAP.GUARANTEED.
To address this, read HWP_CAP in intel_pstate_verify_cpu_policy()
to obtain the maximum P-state that can be used and use that to
confine the policy max limit instead of using the cached and
possibly stale pstate.max_freq value for this purpose.
For consistency, update intel_pstate_update_perf_limits() to use the
maximum available P-state returned by intel_pstate_get_hwp_max() to
compute the maximum frequency instead of using the return value of
intel_pstate_get_max_freq() which, again, may be stale.
This issue is a side-effect of fixing the scaling frequency limits in
commit eacc9c5a92 ("cpufreq: intel_pstate: Fix intel_pstate_get_hwp_max()
for turbo disabled") which corrected the setting of the reduced scaling
frequency values, but caused stale HWP_CAP.GUARANTEED to be used in
the case at hand.
Fixes: eacc9c5a92 ("cpufreq: intel_pstate: Fix intel_pstate_get_hwp_max() for turbo disabled")
Reported-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Tested-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: 5.8+ <stable@vger.kernel.org> # 5.8+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Make intel_pstate expose the ->adjust_perf() callback when it
operates in the passive mode with HWP enabled which causes the
schedutil governor to use that callback instead of ->fast_switch().
The minimum and target performance-level values passed by the
governor to ->adjust_perf() are converted to HWP.REQ.MIN and
HWP.REQ.DESIRED, respectively, which allows the processor to
adjust its configuration to maximize energy-efficiency while
providing sufficient capacity.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Avoid doing the same assignment in both branches of a conditional,
do it after the whole conditional instead.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Make intel_pstate take the new CPUFREQ_GOV_STRICT_TARGET governor
flag into account when it operates in the passive mode with HWP
enabled, so as to fix the "powersave" governor behavior in that
case (currently, HWP is allowed to scale the performance all the
way up to the policy max limit when the "powersave" governor is
used, but it should be constrained to the policy min limit then).
Fixes: f6ebbcf08f ("cpufreq: intel_pstate: Implement passive mode with HWP enabled")
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: 5.9+ <stable@vger.kernel.org> # 5.9+: 9a2a9ebc0a cpufreq: Introduce governor flags
Cc: 5.9+ <stable@vger.kernel.org> # 5.9+: 218f668701 cpufreq: Introduce CPUFREQ_GOV_STRICT_TARGET
Cc: 5.9+ <stable@vger.kernel.org> # 5.9+: ea9364bbad cpufreq: Add strict_target to struct cpufreq_policy
If the cpufreq policy max limit is changed when intel_pstate operates
in the passive mode with HWP enabled and the "powersave" governor is
used on top of it, the HWP max limit is not updated as appropriate.
Namely, in the "powersave" governor case, the target P-state
is always equal to the policy min limit, so if the latter does
not change, intel_cpufreq_adjust_hwp() is not invoked to update
the HWP Request MSR due to the "target_pstate != old_pstate" check
in intel_cpufreq_update_pstate(), so the HWP max limit is not
updated as a result.
Also, if the CPUFREQ_NEED_UPDATE_LIMITS flag is not set for the
driver and the target frequency does not change along with the
policy max limit, the "target_freq == policy->cur" check in
__cpufreq_driver_target() prevents the driver's ->target() callback
from being invoked at all, so the HWP max limit is not updated.
To prevent that occurring, set the CPUFREQ_NEED_UPDATE_LIMITS flag
in the intel_cpufreq driver structure if HWP is enabled and modify
intel_cpufreq_update_pstate() to do the "target_pstate != old_pstate"
check only in the non-HWP case and let intel_cpufreq_adjust_hwp()
always run in the HWP case (it will update HWP Request only if the
cached value of the register is different from the new one including
the limits, so if neither the target P-state value nor the max limit
changes, the register write will still be avoided).
Fixes: f6ebbcf08f ("cpufreq: intel_pstate: Implement passive mode with HWP enabled")
Reported-by: Zhang Rui <rui.zhang@intel.com>
Cc: 5.9+ <stable@vger.kernel.org> # 5.9+: 1c534352f4 cpufreq: Introduce CPUFREQ_NEED_UPDATE_LIMITS ...
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Tested-by: Zhang Rui <rui.zhang@intel.com>
There is a corner case that if the intel_pstate driver fails to be
registered (might be due to invalid MSR access) and acpi_cpufreq
takse over, the intel_pstate sysfs interface is still populated
which may confuse user space (turbostat for example):
grep . /sys/devices/system/cpu/cpu0/cpufreq/scaling_driver
acpi-cpufreq
grep . /sys/devices/system/cpu/intel_pstate/*
/sys/devices/system/cpu/intel_pstate/max_perf_pct:0
/sys/devices/system/cpu/intel_pstate/min_perf_pct:0
grep: /sys/devices/system/cpu/intel_pstate/no_turbo: Resource temporarily unavailable
grep: /sys/devices/system/cpu/intel_pstate/num_pstates: Resource temporarily unavailable
/sys/devices/system/cpu/intel_pstate/status:off
grep: /sys/devices/system/cpu/intel_pstate/turbo_pct: Resource temporarily unavailable
The mere presence of the intel_pstate sysfs interface does not mean
that intel_pstate is in use (for example, echo "off" to "status"),
but it should not be created in the failing case.
Fix this issue by deleting the intel_pstate sysfs if the driver
registration fails.
Reported-by: Wendy Wang <wendy.wang@intel.com>
Suggested-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com
[ rjw: Refactor code to avoid jumps, change function name, changelog edits ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Fix missing return statement when writing "off" to intel_pstate status
sysfs I/F.
Fixes: 55671ea325 ("cpufreq: intel_pstate: Free memory only when turning off")
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
This fixes the behavior of the scaling_max_freq and scaling_min_freq
sysfs files in systems which had turbo disabled by the BIOS.
Caleb noticed that the HWP is programmed to operate in the wrong
P-state range on his system when the CPUFREQ policy min/max frequency
is set via sysfs. This seems to be because in his system
intel_pstate_get_hwp_max() is returning the maximum turbo P-state even
though turbo was disabled by the BIOS, which causes intel_pstate to
scale kHz frequencies incorrectly e.g. setting the maximum turbo
frequency whenever the maximum guaranteed frequency is requested via
sysfs.
Tested-by: Caleb Callaway <caleb.callaway@intel.com>
Signed-off-by: Francisco Jerez <currojerez@riseup.net>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
[ rjw: Minor subject edits ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
When intel_pstate switches the operation mode from "active" to
"passive" or the other way around, freeing its data structures
representing CPUs and allocating them again from scratch is not
necessary and wasteful. Moreover, if these data structures are
preserved, the cached HWP Request MSR value from there may be
written to the MSR to start with to reinitialize it and help to
restore the EPP value set previously (it is set to 0xFF when CPUs
go offline to allow their SMT siblings to use the full range of
EPP values and that also happens when the driver gets unregistered).
Accordingly, modify the driver to only do a full cleanup on driver
object registration errors and when its status is changed to "off"
via sysfs and to write the cached HWP Request MSR value back to
the MSR on CPU init if the data structure representing the given
CPU is still there.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Add ->offline and ->online driver callbacks to prepare for taking a
CPU offline and to restore its working configuration when it goes
back online, respectively, to avoid invoking the ->init callback on
every CPU online which is quite a bit of unnecessary overhead.
Define ->offline and ->online so that they can be used in the
passive mode as well as in the active mode and because ->offline
will do the majority of ->stop_cpu work, the passive mode does
not need that callback any more, so drop it from there.
Also modify the active mode ->suspend and ->resume callbacks to
prevent them from interfering with the new ->offline and ->online
ones in case the latter are invoked withing the system-wide suspend
and resume code flow and make the passive mode use them too.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Modify the EPP sysfs interface to reject attempts to change the EPP
to values different from 0 ("performance") in the active mode with
the "performance" policy (ie. scaling_governor set to "performance"),
to avoid situations in which the kernel appears to discard data
passed to it via the EPP sysfs attribute.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Make intel_pstate update the cached EPP value when setting the EPP
via sysfs in the active mode just like it is the case in the passive
mode, for consistency, but also for the benefit of subsequent
changes.
No intentional functional impact.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
After commit f6ebbcf08f ("cpufreq: intel_pstate: Implement passive
mode with HWP enabled") it is possible to change the driver status
to "off" via sysfs with HWP enabled, which effectively causes the
driver to unregister itself, but HWP remains active and it forces the
minimum performance, so even if another cpufreq driver is loaded,
it will not be able to control the CPU frequency.
For this reason, make the driver refuse to change the status to
"off" with HWP enabled.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Allow intel_pstate to work in the passive mode with HWP enabled and
make it set the HWP minimum performance limit (HWP floor) to the
P-state value given by the target frequency supplied by the cpufreq
governor, so as to prevent the HWP algorithm and the CPU scheduler
from working against each other, at least when the schedutil governor
is in use, and update the intel_pstate documentation accordingly.
Among other things, this allows utilization clamps to be taken
into account, at least to a certain extent, when intel_pstate is
in use and makes it more likely that sufficient capacity for
deadline tasks will be provided.
After this change, the resulting behavior of an HWP system with
intel_pstate in the passive mode should be close to the behavior
of the analogous non-HWP system with intel_pstate in the passive
mode, except that the HWP algorithm is generally allowed to make the
CPU run at a frequency above the floor P-state set by intel_pstate in
the entire available range of P-states, while without HWP a CPU can
run in a P-state above the requested one if the latter falls into the
range of turbo P-states (referred to as the turbo range) or if the
P-states of all CPUs in one package are coordinated with each other
at the hardware level.
[Note that in principle the HWP floor may not be taken into account
by the processor if it falls into the turbo range, in which case the
processor has a license to choose any P-state, either below or above
the HWP floor, just like a non-HWP processor in the case when the
target P-state falls into the turbo range.]
With this change applied, intel_pstate in the passive mode assumes
complete control over the HWP request MSR and concurrent changes of
that MSR (eg. via the direct MSR access interface) are overridden by
it.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
The MSR_TURBO_RATIO_LIMIT can be 0. This is not an error. User can update
this MSR via BIOS settings on some systems or can use msr tools to update.
Also some systems boot with value = 0.
This results in display of cpufreq/cpuinfo_max_freq wrong. This value
will be equal to cpufreq/base_frequency, even though turbo is enabled.
But platform will still function normally in HWP mode as we get max
1-core frequency from the MSR_HWP_CAPABILITIES. This MSR is already used
to calculate cpu->pstate.turbo_freq, which is used for to set
policy->cpuinfo.max_freq. But some other places cpu->pstate.turbo_pstate
is used. For example to set policy->max.
To fix this, also update cpu->pstate.turbo_pstate when updating
cpu->pstate.turbo_freq.
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Because intel_pstate_set_energy_pref_index() reads and writes the
MSR_HWP_REQUEST register without using the cached value of it used by
intel_pstate_hwp_boost_up() and intel_pstate_hwp_boost_down(), those
functions may overwrite the value written by it and so the EPP value
set via sysfs may be lost.
To avoid that, make intel_pstate_set_energy_pref_index() take the
cached value of MSR_HWP_REQUEST just like the other two routines
mentioned above and update it with the new EPP value coming from
user space in addition to updating the MSR.
Note that the MSR itself still needs to be updated too in case
hwp_boost is unset or the boosting mechanism is not active at the
EPP change time.
Fixes: e0efd5be63 ("cpufreq: intel_pstate: Add HWP boost utility and sched util hooks")
Reported-by: Francisco Jerez <currojerez@riseup.net>
Cc: 4.18+ <stable@vger.kernel.org> # 4.18+: 3da97d4db8ee cpufreq: intel_pstate: Rearrange ...
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Francisco Jerez <currojerez@riseup.net>