Go to file
Chen Yu addca28512 cpufreq: intel_pstate: Handle no_turbo in frequency invariance
Problem statement:

Once the user has disabled turbo frequency by

# echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo

the cfs_rq's util_avg becomes quite small when compared with
CPU capacity.

Step to reproduce:

# echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo

# ./x86_cpuload --count 1 --start 3 --timeout 100 --busy 99

would launch 1 thread and bind it to CPU3, lasting for 100 seconds,
with a CPU utilization of 99%. [1]

top result:
%Cpu3  : 98.4 us,  0.0 sy,  0.0 ni,  1.6 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st

check util_avg:
cat /sys/kernel/debug/sched/debug | grep "cfs_rq\[3\]" -A 20 | grep util_avg
  .util_avg                      : 611

So the util_avg/cpu capacity is 611/1024, which is much smaller than
98.4% shown in the top result.

This might impact some logic in the scheduler. For example,
group_is_overloaded() would compare the group_capacity and group_util
in the sched group, to check if this sched group is overloaded or not.
With this gap, even when there is a nearly 100% workload, the sched
group will not be regarded as overloaded. Besides group_is_overloaded(),
there are also other victims. There is a ongoing work that aims to
optimize the task wakeup in a LLC domain. The main idea is to stop
searching idle CPUs if the sched domain is overloaded[2]. This proposal
also relies on the util_avg/CPU capacity to decide whether the LLC
domain is overloaded.

Analysis:

CPU frequency invariance has caused this difference. In summary,
the util_sum of cfs rq would decay quite fast when the CPU is in
idle, when the CPU frequency invariance is enabled.

The detail is as followed:

As depicted in update_rq_clock_pelt(), when the frequency invariance
is enabled, there would be two clock variables on each rq, clock_task
and clock_pelt:

   The clock_pelt scales the time to reflect the effective amount of
   computation done during the running delta time but then syncs back to
   clock_task when rq is idle.

   absolute time    | 1| 2| 3| 4| 5| 6| 7| 8| 9|10|11|12|13|14|15|16
   @ max frequency  ------******---------------******---------------
   @ half frequency ------************---------************---------
   clock pelt       | 1| 2|    3|    4| 7| 8| 9|   10|   11|14|15|16

The fast decay of util_sum during idle is due to:

 1. rq->clock_pelt is always behind rq->clock_task
 2. rq->last_update is updated to rq->clock_pelt' after invoking
    ___update_load_sum()
 3. Then the CPU becomes idle, the rq->clock_pelt' would be suddenly
    increased a lot to rq->clock_task
 4. Enters ___update_load_sum() again, the idle period is calculated by
    rq->clock_task - rq->last_update, AKA, rq->clock_task - rq->clock_pelt'.
    The lower the CPU frequency is, the larger the delta =
    rq->clock_task - rq->clock_pelt' will be. Since the idle period will be
    used to decay the util_sum only, the util_sum drops significantly during
    idle period.

Proposal:

This symptom is not only caused by disabling turbo frequency, but it
would also appear if the user limits the max frequency at runtime.

Because, if the frequency is always lower than the max frequency,
CPU frequency invariance would decay the util_sum quite fast during
idle.

As some end users would disable turbo after boot up, this patch aims to
present this symptom and deals with turbo scenarios for now.

It might be ideal if CPU frequency invariance is aware of the max CPU
frequency (user specified) at runtime in the future.

Link: https://github.com/yu-chen-surf/x86_cpuload.git #1
Link: https://lore.kernel.org/lkml/20220310005228.11737-1-yu.c.chen@intel.com/ #2
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2022-04-13 17:38:04 +02:00
Documentation linux-kselftest-kunit-fixes-5.18-rc2 2022-04-08 15:06:11 -10:00
LICENSES LICENSES/LGPL-2.1: Add LGPL-2.1-or-later as valid identifiers 2021-12-16 14:33:10 +01:00
arch Driver core changes for 5.18-rc2 2022-04-10 09:55:09 -10:00
block for-5.18/block-2022-04-01 2022-04-01 16:20:00 -07:00
certs Kbuild updates for v5.18 2022-03-31 11:59:03 -07:00
crypto for-5.18/64bit-pi-2022-03-25 2022-03-26 12:01:35 -07:00
drivers cpufreq: intel_pstate: Handle no_turbo in frequency invariance 2022-04-13 17:38:04 +02:00
fs Driver core changes for 5.18-rc2 2022-04-10 09:55:09 -10:00
include Driver core changes for 5.18-rc2 2022-04-10 09:55:09 -10:00
init Kbuild updates for v5.18 2022-03-31 11:59:03 -07:00
ipc fs: allocate inode by using alloc_inode_sb() 2022-03-22 15:57:03 -07:00
kernel - A couple of fixes to cgroup-related handling of perf events 2022-04-10 07:08:22 -10:00
lib Driver core changes for 5.18-rc2 2022-04-10 09:55:09 -10:00
mm - Allow the compiler to optimize away unused percpu accesses and change 2022-04-10 06:56:46 -10:00
net NFS client bugfixes for Linux 5.18 2022-04-08 07:39:17 -10:00
samples dma-mapping updates for Linux 5.18 2022-03-29 08:50:14 -07:00
scripts modpost: restore the warning message for missing symbol versions 2022-04-03 03:11:51 +09:00
security hardening updates for v5.18-rc1-fix1 2022-03-31 11:43:01 -07:00
sound sound fixes for 5.18-rc1 2022-04-01 10:32:46 -07:00
tools - Fix the MSI message data struct definition 2022-04-10 07:12:27 -10:00
usr Kbuild updates for v5.18 2022-03-31 11:59:03 -07:00
virt KVM: Remove dirty handling from gfn_to_pfn_cache completely 2022-04-02 05:34:41 -04:00
.clang-format genirq/msi: Make interrupt allocation less convoluted 2021-12-16 22:22:20 +01:00
.cocciconfig scripts: add Linux .cocciconfig for coccinelle 2016-07-22 12:13:39 +02:00
.get_maintainer.ignore Opt out of scripts/get_maintainer.pl 2019-05-16 10:53:40 -07:00
.gitattributes .gitattributes: use 'dts' diff driver for dts files 2019-12-04 19:44:11 -08:00
.gitignore .gitignore: ignore only top-level modules.builtin 2021-05-02 00:43:35 +09:00
.mailmap mailmap: update Vasily Averin's email address 2022-04-08 14:20:36 -10:00
COPYING COPYING: state that all contributions really are covered by this file 2020-02-10 13:32:20 -08:00
CREDITS MAINTAINERS: replace a Microchip AT91 maintainer 2022-02-09 11:30:01 +01:00
Kbuild kbuild: rename hostprogs-y/always to hostprogs/always-y 2020-02-04 01:53:07 +09:00
Kconfig kbuild: ensure full rebuild when the compiler is updated 2020-05-12 13:28:33 +09:00
MAINTAINERS v5.18 first rc pull request 2022-04-08 18:29:02 -10:00
Makefile Linux 5.18-rc2 2022-04-10 14:21:36 -10:00
README Drop all 00-INDEX files from Documentation/ 2018-09-09 15:08:58 -06:00

README

Linux kernel
============

There are several guides for kernel developers and users. These guides can
be rendered in a number of formats, like HTML and PDF. Please read
Documentation/admin-guide/README.rst first.

In order to build the documentation, use ``make htmldocs`` or
``make pdfdocs``.  The formatted documentation can also be read online at:

    https://www.kernel.org/doc/html/latest/

There are various text files in the Documentation/ subdirectory,
several of them using the Restructured Text markup notation.

Please read the Documentation/process/changes.rst file, as it contains the
requirements for building and running the kernel, and information about
the problems which may result by upgrading your kernel.