Commit graph

1185189 commits

Author SHA1 Message Date
Vineeth Pillai
e20f204c88 sched/deadline: Update GRUB description in the documentation
Update the details of GRUB to reflect the updated logic.

Signed-off-by: Vineeth Pillai (Google) <vineeth@bitbyteword.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lore.kernel.org/r/20230530135526.2385378-2-vineeth@bitbyteword.org
2023-06-16 22:08:12 +02:00
Vineeth Pillai
6a9d623aad sched/deadline: Fix bandwidth reclaim equation in GRUB
According to the GRUB[1] rule, the runtime is depreciated as:
  "dq = -max{u, (1 - Uinact - Uextra)} dt" (1)

To guarantee that deadline tasks doesn't starve lower class tasks,
we do not allocate the full bandwidth of the cpu to deadline tasks.
Maximum bandwidth usable by deadline tasks is denoted by "Umax".
Considering Umax, equation (1) becomes:
  "dq = -(max{u, (Umax - Uinact - Uextra)} / Umax) dt" (2)

Current implementation has a minor bug in equation (2), which this
patch fixes.

The reclamation logic is verified by a sample program which creates
multiple deadline threads and observing their utilization. The tests
were run on an isolated cpu(isolcpus=3) on a 4 cpu system.

Tests on 6.3.0
==============

RUN 1: runtime=7ms, deadline=period=10ms, RT capacity = 95%
TID[693]: RECLAIM=1, (r=7ms, d=10ms, p=10ms), Util: 93.33
TID[693]: RECLAIM=1, (r=7ms, d=10ms, p=10ms), Util: 93.35

RUN 2: runtime=1ms, deadline=period=100ms, RT capacity = 95%
TID[708]: RECLAIM=1, (r=1ms, d=100ms, p=100ms), Util: 16.69
TID[708]: RECLAIM=1, (r=1ms, d=100ms, p=100ms), Util: 16.69

RUN 3: 2 tasks
  Task 1: runtime=1ms, deadline=period=10ms
  Task 2: runtime=1ms, deadline=period=100ms
TID[631]: RECLAIM=1, (r=1ms, d=10ms, p=10ms), Util: 62.67
TID[632]: RECLAIM=1, (r=1ms, d=100ms, p=100ms), Util: 6.37
TID[631]: RECLAIM=1, (r=1ms, d=10ms, p=10ms), Util: 62.38
TID[632]: RECLAIM=1, (r=1ms, d=100ms, p=100ms), Util: 6.23

As seen above, the reclamation doesn't reclaim the maximum allowed
bandwidth and as the bandwidth of tasks gets smaller, the reclaimed
bandwidth also comes down.

Tests with this patch applied
=============================

RUN 1: runtime=7ms, deadline=period=10ms, RT capacity = 95%
TID[608]: RECLAIM=1, (r=7ms, d=10ms, p=10ms), Util: 95.19
TID[608]: RECLAIM=1, (r=7ms, d=10ms, p=10ms), Util: 95.16

RUN 2: runtime=1ms, deadline=period=100ms, RT capacity = 95%
TID[616]: RECLAIM=1, (r=1ms, d=100ms, p=100ms), Util: 95.27
TID[616]: RECLAIM=1, (r=1ms, d=100ms, p=100ms), Util: 95.21

RUN 3: 2 tasks
  Task 1: runtime=1ms, deadline=period=10ms
  Task 2: runtime=1ms, deadline=period=100ms
TID[620]: RECLAIM=1, (r=1ms, d=10ms, p=10ms), Util: 86.64
TID[621]: RECLAIM=1, (r=1ms, d=100ms, p=100ms), Util: 8.66
TID[620]: RECLAIM=1, (r=1ms, d=10ms, p=10ms), Util: 86.45
TID[621]: RECLAIM=1, (r=1ms, d=100ms, p=100ms), Util: 8.73

Running tasks on all cpus allowing for migration also showed that
the utilization is reclaimed to the maximum. Running 10 tasks on
3 cpus SCHED_FLAG_RECLAIM - top shows:
%Cpu0  : 94.6 us,  0.0 sy,  0.0 ni,  5.4 id,  0.0 wa
%Cpu1  : 95.2 us,  0.0 sy,  0.0 ni,  4.8 id,  0.0 wa
%Cpu2  : 95.8 us,  0.0 sy,  0.0 ni,  4.2 id,  0.0 wa

[1]: Abeni, Luca & Lipari, Giuseppe & Parri, Andrea & Sun, Youcheng.
     (2015). Parallel and sequential reclaiming in multicore
     real-time global scheduling.

Signed-off-by: Vineeth Pillai (Google) <vineeth@bitbyteword.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lore.kernel.org/r/20230530135526.2385378-1-vineeth@bitbyteword.org
2023-06-16 22:08:11 +02:00
Arve Hjønnevåg
ef73d6a4ef sched/wait: Fix a kthread_park race with wait_woken()
kthread_park and wait_woken have a similar race that
kthread_stop and wait_woken used to have before it was fixed in
commit cb6538e740 ("sched/wait: Fix a kthread race with
wait_woken()"). Extend that fix to also cover kthread_park.

[jstultz: Made changes suggested by Peter to optimize
 memory loads]

Signed-off-by: Arve Hjønnevåg <arve@android.com>
Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lore.kernel.org/r/20230602212350.535358-1-jstultz@google.com
2023-06-16 17:08:01 +02:00
Miaohe Lin
0cce0fde49 sched/topology: Mark set_sched_topology() __init
All callers of set_sched_topology() are within __init section. Mark
it __init too.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lore.kernel.org/r/20230603073645.1173332-1-linmiaohe@huawei.com
2023-06-16 17:08:01 +02:00
Tom Rix
a707df30c9 sched/fair: Rename variable cpu_util eff_util
cppcheck reports
kernel/sched/fair.c:7436:17: style: Local variable 'cpu_util' shadows outer function [shadowFunction]
  unsigned long cpu_util;
                ^

Clean this up by renaming the variable to eff_util

Signed-off-by: Tom Rix <trix@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lore.kernel.org/r/20230611122535.183654-1-trix@redhat.com
2023-06-16 17:08:01 +02:00
Peter Zijlstra
5416bf1cf5 arm64/arch_timer: Fix MMIO byteswap
The readl_relaxed() to __raw_readl() change meant to loose the
instrumentation, but also (inadvertently) lost the byteswap.

Fixes: 24ee7607b2 ("arm64/arch_timer: Provide noinstr sched_clock_read() functions")
Reported-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Link: https://lkml.kernel.org/r/20230606080614.GB905437@hirez.programming.kicks-ass.net
2023-06-06 10:19:51 +02:00
Dietmar Eggemann
7d0583cf9e sched/fair, cpufreq: Introduce 'runnable boosting'
The responsiveness of the Per Entity Load Tracking (PELT) util_avg in
mobile devices is still considered too low for utilization changes
during task ramp-up.

In Android this manifests in the fact that the first frames of a UI
activity are very prone to be jankframes (a frame which doesn't meet
the required frame rendering time, e.g. 16ms@60Hz) since the CPU
frequency is normally low at this point and has to ramp up quickly.

The beginning of an UI activity is also characterized by the occurrence
of CPU contention, especially on little CPUs. Current little CPUs can
have an original CPU capacity of only ~ 150 which means that the actual
CPU capacity at lower frequency can even be much smaller.

Schedutil maps CPU util_avg into CPU frequency request via:

  util = effective_cpu_util(..., cpu_util_cfs(cpu), ...) ->
  util = map_util_perf(util) -> freq = map_util_freq(util, ...)

CPU contention for CFS tasks can be detected by 'CPU runnable > CPU
utililization' in cpu_util_cfs_boost() -> cpu_util(..., boost = 1).
Schedutil uses 'runnable boosting' by calling cpu_util_cfs_boost().

To be in sync with schedutil's CPU frequency selection, Energy Aware
Scheduling (EAS) also calls cpu_util(..., boost = 1) during max util
detection.

Moreover, 'runnable boosting' is also used in load-balance for busiest
CPU selection when the migration type is 'migrate_util', i.e. only at
sched domains which don't have the SD_SHARE_PKG_RESOURCES flag set.

Suggested-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20230515115735.296329-3-dietmar.eggemann@arm.com
2023-06-05 21:13:44 +02:00
Dietmar Eggemann
3eb6d6ecec sched/fair: Refactor CPU utilization functions
There is a lot of code duplication in cpu_util_next() & cpu_util_cfs().

Remove this by allowing cpu_util_next() to be called with p = NULL.
Rename cpu_util_next() to cpu_util() since the '_next' suffix is no
longer necessary to distinct cpu utilization related functions.
Implement cpu_util_cfs(cpu) as cpu_util(cpu, p = NULL, -1).

This will allow to code future related cpu util changes only in one
place, namely in cpu_util().

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20230515115735.296329-2-dietmar.eggemann@arm.com
2023-06-05 21:13:43 +02:00
Peter Zijlstra
e6a15fa9ea cpuidle: Use local_clock_noinstr()
With the introduction of local_clock_noinstr(), local_clock() itself
is no longer marked noinstr, use the correct function.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Rafael J. Wysocki <rafael@kernel.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>  # Hyper-V
Link: https://lore.kernel.org/r/20230519102716.045980863@infradead.org
2023-06-05 21:11:09 +02:00
Peter Zijlstra
fb7d4948c4 sched/clock: Provide local_clock_noinstr()
Now that all ARCH_WANTS_NO_INSTR architectures (arm64, loongarch,
s390, x86) provide sched_clock_noinstr(), use this to provide
local_clock_noinstr().

This local_clock_noinstr() will be safe to use from noinstr code with
the assumption that any such noinstr code is non-preemptible (it had
better be, entry code will have IRQs disabled while __cpuidle must
have preemption disabled).

Specifically, preempt_enable_notrace(), a common part of many a
sched_clock() implementation calls out to schedule() -- even though,
per the above, it will never trigger -- which frustrates noinstr
validation.

  vmlinux.o: warning: objtool: local_clock+0xb5: call to preempt_schedule_notrace_thunk() leaves .noinstr.text section

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>  # Hyper-V
Link: https://lore.kernel.org/r/20230519102715.978624636@infradead.org
2023-06-05 21:11:09 +02:00
Peter Zijlstra
5c5e9a2b25 x86/tsc: Provide sched_clock_noinstr()
With the intent to provide local_clock_noinstr(), a variant of
local_clock() that's safe to be called from noinstr code (with the
assumption that any such code will already be non-preemptible),
prepare for things by providing a noinstr sched_clock_noinstr()
function.

Specifically, preempt_enable_*() calls out to schedule(), which upsets
noinstr validation efforts.

  vmlinux.o: warning: objtool: native_sched_clock+0x96: call to preempt_schedule_notrace_thunk() leaves .noinstr.text section
  vmlinux.o: warning: objtool: kvm_clock_read+0x22: call to preempt_schedule_notrace_thunk() leaves .noinstr.text section

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>  # Hyper-V
Link: https://lore.kernel.org/r/20230519102715.910937674@infradead.org
2023-06-05 21:11:08 +02:00
Peter Zijlstra
e39acc37db clocksource: hyper-v: Provide noinstr sched_clock()
With the intent to provide local_clock_noinstr(), a variant of
local_clock() that's safe to be called from noinstr code (with the
assumption that any such code will already be non-preemptible),
prepare for things by making the Hyper-V TSC and MSR sched_clock
implementations noinstr.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Co-developed-by: Michael Kelley <mikelley@microsoft.com>
Signed-off-by: Michael Kelley <mikelley@microsoft.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>  # Hyper-V
Link: https://lore.kernel.org/r/20230519102715.843039089@infradead.org
2023-06-05 21:11:08 +02:00
Peter Zijlstra
9397fa2ea3 clocksource: hyper-v: Adjust hv_read_tsc_page_tsc() to avoid special casing U64_MAX
Currently hv_read_tsc_page_tsc() (ab)uses the (valid) time value of
U64_MAX as an error return. This breaks the clean wrap-around of the
clock.

Modify the function signature to return a boolean state and provide
another u64 pointer to store the actual time on success. This obviates
the need to steal one time value and restores the full counter width.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Michael Kelley <mikelley@microsoft.com>  # Hyper-V
Link: https://lore.kernel.org/r/20230519102715.775630881@infradead.org
2023-06-05 21:11:07 +02:00
Peter Zijlstra
77750f78b0 x86/vdso: Fix gettimeofday masking
Because of how the virtual clocks use U64_MAX as an exception value
instead of a valid time, the clocks can no longer be assumed to wrap
cleanly. This is then compounded by arch_vdso_cycles_ok() rejecting
everything with the MSB/Sign-bit set.

Therefore, the effective mask becomes S64_MAX, and the comment with
vdso_calc_delta() that states the mask is U64_MAX and isn't optimized
out is just plain silly.

Now, the code has a negative filter -- to deal with TSC wobbles:

	if (cycles > last)

which is just plain wrong, because it should've been written as:

	if ((s64)(cycles - last) > 0)

to take wrapping into account, but per all the above, we don't
actually wrap on u64 anymore.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Michael Kelley <mikelley@microsoft.com>  # Hyper-V
Link: https://lore.kernel.org/r/20230519102715.704767397@infradead.org
2023-06-05 21:11:07 +02:00
Peter Zijlstra
fc4a0db414 math64: Always inline u128 version of mul_u64_u64_shr()
In order to prevent the following complaint from happening, always
inline the u128 variant of mul_u64_u64_shr() -- which is what x86_64
will use.

  vmlinux.o: warning: objtool: read_hv_sched_clock_tsc+0x5a: call to mul_u64_u64_shr.constprop.0() leaves .noinstr.text section

It should compile into something like:

  asm("mul	%[mul];"
      "shrd	%rdx, %rax, %cl"
      : "+&a" (a)
      : "c" shift, [mul] "r" (mul)
      : "d");

Which is silly not to inline, but it happens.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>  # Hyper-V
Link: https://lore.kernel.org/r/20230519102715.637420396@infradead.org
2023-06-05 21:11:06 +02:00
Peter Zijlstra
91b41a2375 s390/time: Provide sched_clock_noinstr()
With the intent to provide local_clock_noinstr(), a variant of
local_clock() that's safe to be called from noinstr code (with the
assumption that any such code will already be non-preemptible),
prepare for things by providing a noinstr sched_clock_noinstr()
function.

Specifically, preempt_enable_*() calls out to schedule(), which upsets
noinstr validation efforts.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Tested-by: Michael Kelley <mikelley@microsoft.com>  # Hyper-V
Link: https://lore.kernel.org/r/20230519102715.570170436@infradead.org
2023-06-05 21:11:06 +02:00
Peter Zijlstra
6b10fef09f loongarch: Provide noinstr sched_clock_read()
With the intent to provide local_clock_noinstr(), a variant of
local_clock() that's safe to be called from noinstr code (with the
assumption that any such code will already be non-preemptible),
prepare for things by providing a noinstr sched_clock_read() function.

Specifically, preempt_enable_*() calls out to schedule(), which upsets
noinstr validation efforts.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>  # Hyper-V
Link: https://lore.kernel.org/r/20230519102715.502547082@infradead.org
2023-06-05 21:11:05 +02:00
Peter Zijlstra
24ee7607b2 arm64/arch_timer: Provide noinstr sched_clock_read() functions
With the intent to provide local_clock_noinstr(), a variant of
local_clock() that's safe to be called from noinstr code (with the
assumption that any such code will already be non-preemptible),
prepare for things by providing a noinstr sched_clock_read() function.

Specifically, preempt_enable_*() calls out to schedule(), which upsets
noinstr validation efforts.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>  # Hyper-V
Link: https://lore.kernel.org/r/20230519102715.435618812@infradead.org
2023-06-05 21:11:05 +02:00
Peter Zijlstra
c1d26c0f02 arm64/io: Always inline all of __raw_{read,write}[bwlq]()
The next patch will want to use __raw_readl() from a noinstr section
and as such that needs to be marked __always_inline to avoid the
compiler being a silly bugger.

Turns out it already is, but its siblings are not. Finish the work
started in commit e43f1331e2 ("arm64: Ask the compiler to
__always_inline functions used by KVM at HYP") for consistenies sake.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Tested-by: Michael Kelley <mikelley@microsoft.com>  # Hyper-V
Link: https://lore.kernel.org/r/20230519102715.368919762@infradead.org
2023-06-05 21:11:04 +02:00
Peter Zijlstra
5949a68c73 time/sched_clock: Provide sched_clock_noinstr()
With the intent to provide local_clock_noinstr(), a variant of
local_clock() that's safe to be called from noinstr code (with the
assumption that any such code will already be non-preemptible),
prepare for things by providing a noinstr sched_clock_noinstr() function.

Specifically, preempt_enable_*() calls out to schedule(), which upsets
noinstr validation efforts.

As such, pull out the preempt_{dis,en}able_notrace() requirements from
the sched_clock_read() implementations by explicitly providing it in
the sched_clock() function.

This further requires said sched_clock_read() functions to be noinstr
themselves, for ARCH_WANTS_NO_INSTR users. See the next few patches.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>  # Hyper-V
Link: https://lore.kernel.org/r/20230519102715.302350330@infradead.org
2023-06-05 21:11:04 +02:00
Peter Zijlstra
d16317de9b seqlock/latch: Provide raw_read_seqcount_latch_retry()
The read side of seqcount_latch consists of:

  do {
    seq = raw_read_seqcount_latch(&latch->seq);
    ...
  } while (read_seqcount_latch_retry(&latch->seq, seq));

which is asymmetric in the raw_ department, and sure enough,
read_seqcount_latch_retry() includes (explicit) instrumentation where
raw_read_seqcount_latch() does not.

This inconsistency becomes a problem when trying to use it from
noinstr code. As such, fix it by renaming and re-implementing
raw_read_seqcount_latch_retry() without the instrumentation.

Specifically the instrumentation in question is kcsan_atomic_next(0)
in do___read_seqcount_retry(). Loosing this annotation is not a
problem because raw_read_seqcount_latch() does not pass through
kcsan_atomic_next(KCSAN_SEQLOCK_REGION_MAX).

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Tested-by: Michael Kelley <mikelley@microsoft.com>  # Hyper-V
Link: https://lore.kernel.org/r/20230519102715.233598176@infradead.org
2023-06-05 21:11:03 +02:00
Peter Zijlstra
8f2d6c41e5 x86/sched: Rewrite topology setup
Instead of having a number of fixed topologies to pick from; build one
on the fly. This is both simpler now and simpler to extend in the
future.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20230601153522.GB559993%40hirez.programming.kicks-ass.net
2023-06-05 21:11:03 +02:00
Peter Zijlstra
1c06918788 sched: Consider task_struct::saved_state in wait_task_inactive()
With the introduction of task_struct::saved_state in commit
5f220be214 ("sched/wakeup: Prepare for RT sleeping spin/rwlocks")
matching the task state has gotten more complicated. That same commit
changed try_to_wake_up() to consider both states, but
wait_task_inactive() has been neglected.

Sebastian noted that the wait_task_inactive() usage in
ptrace_check_attach() can misbehave when ptrace_stop() is blocked on
the tasklist_lock after it sets TASK_TRACED.

Therefore extract a common helper from ttwu_state_match() and use that
to teach wait_task_inactive() about the PREEMPT_RT locks.

Originally-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://lkml.kernel.org/r/20230601091234.GW83892@hirez.programming.kicks-ass.net
2023-06-05 21:11:03 +02:00
Peter Zijlstra
d5e1586617 sched: Unconditionally use full-fat wait_task_inactive()
While modifying wait_task_inactive() for PREEMPT_RT; the build robot
noted that UP got broken. This led to audit and consideration of the
UP implementation of wait_task_inactive().

It looks like the UP implementation is also broken for PREEMPT;
consider task_current_syscall() getting preempted between the two
calls to wait_task_inactive().

Therefore move the wait_task_inactive() implementation out of
CONFIG_SMP and unconditionally use it.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20230602103731.GA630648%40hirez.programming.kicks-ass.net
2023-06-05 21:11:02 +02:00
Yicong Yang
0dd37d6dd3 sched/fair: Don't balance task to its current running CPU
We've run into the case that the balancer tries to balance a migration
disabled task and trigger the warning in set_task_cpu() like below:

 ------------[ cut here ]------------
 WARNING: CPU: 7 PID: 0 at kernel/sched/core.c:3115 set_task_cpu+0x188/0x240
 Modules linked in: hclgevf xt_CHECKSUM ipt_REJECT nf_reject_ipv4 <...snip>
 CPU: 7 PID: 0 Comm: swapper/7 Kdump: loaded Tainted: G           O       6.1.0-rc4+ #1
 Hardware name: Huawei TaiShan 2280 V2/BC82AMDC, BIOS 2280-V2 CS V5.B221.01 12/09/2021
 pstate: 604000c9 (nZCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
 pc : set_task_cpu+0x188/0x240
 lr : load_balance+0x5d0/0xc60
 sp : ffff80000803bc70
 x29: ffff80000803bc70 x28: ffff004089e190e8 x27: ffff004089e19040
 x26: ffff007effcabc38 x25: 0000000000000000 x24: 0000000000000001
 x23: ffff80000803be84 x22: 000000000000000c x21: ffffb093e79e2a78
 x20: 000000000000000c x19: ffff004089e19040 x18: 0000000000000000
 x17: 0000000000001fad x16: 0000000000000030 x15: 0000000000000000
 x14: 0000000000000003 x13: 0000000000000000 x12: 0000000000000000
 x11: 0000000000000001 x10: 0000000000000400 x9 : ffffb093e4cee530
 x8 : 00000000fffffffe x7 : 0000000000ce168a x6 : 000000000000013e
 x5 : 00000000ffffffe1 x4 : 0000000000000001 x3 : 0000000000000b2a
 x2 : 0000000000000b2a x1 : ffffb093e6d6c510 x0 : 0000000000000001
 Call trace:
  set_task_cpu+0x188/0x240
  load_balance+0x5d0/0xc60
  rebalance_domains+0x26c/0x380
  _nohz_idle_balance.isra.0+0x1e0/0x370
  run_rebalance_domains+0x6c/0x80
  __do_softirq+0x128/0x3d8
  ____do_softirq+0x18/0x24
  call_on_irq_stack+0x2c/0x38
  do_softirq_own_stack+0x24/0x3c
  __irq_exit_rcu+0xcc/0xf4
  irq_exit_rcu+0x18/0x24
  el1_interrupt+0x4c/0xe4
  el1h_64_irq_handler+0x18/0x2c
  el1h_64_irq+0x74/0x78
  arch_cpu_idle+0x18/0x4c
  default_idle_call+0x58/0x194
  do_idle+0x244/0x2b0
  cpu_startup_entry+0x30/0x3c
  secondary_start_kernel+0x14c/0x190
  __secondary_switched+0xb0/0xb4
 ---[ end trace 0000000000000000 ]---

Further investigation shows that the warning is superfluous, the migration
disabled task is just going to be migrated to its current running CPU.
This is because that on load balance if the dst_cpu is not allowed by the
task, we'll re-select a new_dst_cpu as a candidate. If no task can be
balanced to dst_cpu we'll try to balance the task to the new_dst_cpu
instead. In this case when the migration disabled task is not on CPU it
only allows to run on its current CPU, load balance will select its
current CPU as new_dst_cpu and later triggers the warning above.

The new_dst_cpu is chosen from the env->dst_grpmask. Currently it
contains CPUs in sched_group_span() and if we have overlapped groups it's
possible to run into this case. This patch makes env->dst_grpmask of
group_balance_mask() which exclude any CPUs from the busiest group and
solve the issue. For balancing in a domain with no overlapped groups
the behaviour keeps same as before.

Suggested-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20230530082507.10444-1-yangyicong@huawei.com
2023-06-05 21:08:24 +02:00
Miaohe Lin
3f4bf7aa31 sched/deadline: remove unused dl_bandwidth
The default deadline bandwidth control structure has been removed since
commit eb77cf1c15 ("sched/deadline: Remove unused def_dl_bandwidth")
leading to unused init_dl_bandwidth() and struct dl_bandwidth. Remove
them to clean up the code.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lore.kernel.org/r/20230524102514.407486-1-linmiaohe@huawei.com
2023-05-30 22:46:26 +02:00
Arnd Bergmann
7aa55f2a59 sched/fair: Move unused stub functions to header
These four functions have a normal definition for CONFIG_FAIR_GROUP_SCHED,
and empty one that is only referenced when FAIR_GROUP_SCHED is disabled
but CGROUP_SCHED is still enabled. If both are turned off, the functions
are still defined but the misisng prototype causes a W=1 warning:

kernel/sched/fair.c:12544:6: error: no previous prototype for 'free_fair_sched_group'
kernel/sched/fair.c:12546:5: error: no previous prototype for 'alloc_fair_sched_group'
kernel/sched/fair.c:12553:6: error: no previous prototype for 'online_fair_sched_group'
kernel/sched/fair.c:12555:6: error: no previous prototype for 'unregister_fair_sched_group'

Move the alternatives into the header as static inline functions with
the correct combination of #ifdef checks to avoid the warning without
adding even more complexity.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20230522195021.3456768-6-arnd@kernel.org
2023-05-30 22:46:26 +02:00
Arnd Bergmann
f7df852ad6 sched: Make task_vruntime_update() prototype visible
Having the prototype next to the caller but not visible to the callee causes
a W=1 warning:

kernel/sched/fair.c:11985:6: error: no previous prototype for 'task_vruntime_update' [-Werror=missing-prototypes]

Move this to a header, as we do for all other function declarations.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20230522195021.3456768-5-arnd@kernel.org
2023-05-30 22:46:26 +02:00
Arnd Bergmann
c0bdfd72fb sched/fair: Hide unused init_cfs_bandwidth() stub
init_cfs_bandwidth() is only used when CONFIG_FAIR_GROUP_SCHED is
enabled, and without this causes a W=1 warning for the missing prototype:

kernel/sched/fair.c:6131:6: error: no previous prototype for 'init_cfs_bandwidth'

The normal implementation is only defined for CONFIG_CFS_BANDWIDTH,
so the stub exists when CFS_BANDWIDTH is disabled but FAIR_GROUP_SCHED
is enabled.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20230522195021.3456768-4-arnd@kernel.org
2023-05-30 22:46:25 +02:00
Arnd Bergmann
378be384e0 sched: Add schedule_user() declaration
The schedule_user() function is used on powerpc and sparc architectures, but
only ever called from assembler, so it has no prototype, causing a harmless W=1
warning:

kernel/sched/core.c:6730:35: error: no previous prototype for 'schedule_user' [-Werror=missing-prototypes]

Add a prototype in sched/sched.h to shut up the warning.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20230522195021.3456768-3-arnd@kernel.org
2023-05-30 22:46:25 +02:00
Arnd Bergmann
d55ebae3f3 sched: Hide unused sched_update_scaling()
This function is only used when CONFIG_SMP is enabled, without that there
is no caller and no prototype:

kernel/sched/fair.c:688:5: error: no previous prototype for 'sched_update_scaling' [-Werror=missing-prototypes

Hide the definition in the same #ifdef check as the declaration.

Fixes: 8a99b6833c ("sched: Move SCHED_DEBUG sysctl to debugfs")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20230522195021.3456768-2-arnd@kernel.org
2023-05-30 22:46:24 +02:00
Yang Yang
e2a1f85bf9 sched/psi: Avoid resetting the min update period when it is unnecessary
Psi_group's poll_min_period is determined by the minimum window size of
psi_trigger when creating new triggers. While destroying a psi_trigger,
there is no need to reset poll_min_period if the psi_trigger being
destroyed did not have the minimum window size, since in this condition
poll_min_period will remain the same as before.

Signed-off-by: Yang Yang <yang.yang29@zte.com.cn>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Suren Baghdasaryan <surenb@google.com>
Link: https://lkml.kernel.org/r/20230514163338.834345-1-surenb@google.com
2023-05-20 12:53:16 +02:00
晏艳(采苓)
a6fcdd8d95 sched/debug: Correct printing for rq->nr_uninterruptible
Commit e6fe3f422b ("sched: Make multiple runqueue task counters
32-bit") changed the type for rq->nr_uninterruptible from "unsigned
long" to "unsigned int", but left wrong cast print to
/sys/kernel/debug/sched/debug and to the console.

For example, nr_uninterruptible's value is fffffff7 with type
"unsigned int", (long)nr_uninterruptible shows 4294967287 while
(int)nr_uninterruptible prints -9. So using int cast fixes wrong
printing.

Signed-off-by: Yan Yan <yanyan.yan@antgroup.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20230506074253.44526-1-yanyan.yan@antgroup.com
2023-05-08 10:58:39 +02:00
Tim C Chen
bf2dc42d6b sched/topology: Propagate SMT flags when removing degenerate domain
When a degenerate cluster domain for core with SMT CPUs is removed,
the SD_SHARE_CPUCAPACITY flag in the local child sched group was not
propagated to the new parent.  We need this flag to properly determine
whether the local sched group is SMT.  Set the flag in the local
child sched group of the new parent sched domain.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Link: https://lkml.kernel.org/r/73cf0959eafa53c02e7ef6bf805d751d9190e55d.1683156492.git.tim.c.chen@linux.intel.com
2023-05-08 10:58:39 +02:00
Suren Baghdasaryan
519fabc7aa psi: remove 500ms min window size limitation for triggers
Current 500ms min window size for psi triggers limits polling interval
to 50ms to prevent polling threads from using too much cpu bandwidth by
polling too frequently. However the number of cgroups with triggers is
unlimited, so this protection can be defeated by creating multiple
cgroups with psi triggers (triggers in each cgroup are served by a single
"psimon" kernel thread).
Instead of limiting min polling period, which also limits the latency of
psi events, it's better to limit psi trigger creation to authorized users
only, like we do for system-wide psi triggers (/proc/pressure/* files can
be written only by processes with CAP_SYS_RESOURCE capability). This also
makes access rules for cgroup psi files consistent with system-wide ones.
Add a CAP_SYS_RESOURCE capability check for cgroup psi file writers and
remove the psi window min size limitation.

Suggested-by: Sudarshan Rajagopalan <quic_sudaraja@quicinc.com>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lore.kernel.org/all/cover.1676067791.git.quic_sudaraja@quicinc.com/
2023-05-08 10:58:38 +02:00
Chen Yu
044f0e27de x86/sched: Add the SD_ASYM_PACKING flag to the die domain of hybrid processors
Intel Meteor Lake hybrid processors have cores in two separate dies. The
cores in one of the dies have higher maximum frequency. Use the SD_ASYM_
PACKING flag to give higher priority to the die with CPUs of higher maximum
frequency.

Suggested-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230406203148.19182-13-ricardo.neri-calderon@linux.intel.com
2023-05-08 10:58:38 +02:00
Ricardo Neri
046a5a95c3 x86/sched/itmt: Give all SMT siblings of a core the same priority
X86 does not have the SD_ASYM_PACKING flag in the SMT domain. The scheduler
knows how to handle SMT and non-SMT cores of different priority. There is
no reason for SMT siblings of a core to have different priorities.

Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Len Brown <len.brown@intel.com>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Link: https://lore.kernel.org/r/20230406203148.19182-12-ricardo.neri-calderon@linux.intel.com
2023-05-08 10:58:38 +02:00
Ricardo Neri
995998ebde x86/sched: Remove SD_ASYM_PACKING from the SMT domain flags
There is no difference between any of the SMT siblings of a physical core.
Do not do asym_packing load balancing at this level.

Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Link: https://lore.kernel.org/r/20230406203148.19182-11-ricardo.neri-calderon@linux.intel.com
2023-05-08 10:58:37 +02:00
Ricardo Neri
ca528cc501 sched/topology: Remove SHARED_CHILD from ASYM_PACKING
Only x86 and Power7 use ASYM_PACKING. They use it differently.

Power7 has cores of equal priority, but the SMT siblings of a core have
different priorities. Parent scheduling domains do not need (nor have) the
ASYM_PACKING flag. SHARED_CHILD is not needed. Using SHARED_PARENT would
cause the topology debug code to complain.

X86 has cores of different priority, but all the SMT siblings of the core
have equal priority. It needs ASYM_PACKING at the MC level, but not at the
SMT level (it also needs it at upper levels if they have scheduling groups
of different priority). Removing ASYM_PACKING from the SMT domain causes
the topology debug code to complain.

Remove SHARED_CHILD for now. We still need a topology check that satisfies
both architectures.

Suggested-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Link: https://lore.kernel.org/r/20230406203148.19182-10-ricardo.neri-calderon@linux.intel.com
2023-05-08 10:58:37 +02:00
Ricardo Neri
40b4d3dc32 sched/topology: Check SDF_SHARED_CHILD in highest_flag_domain()
Do not assume that all the children of a scheduling domain have a given
flag. Check whether it has the SDF_SHARED_CHILD meta flag.

Suggested-by: Ionela Voinescu <ionela.voinescu@arm.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230406203148.19182-9-ricardo.neri-calderon@linux.intel.com
2023-05-08 10:58:36 +02:00
Ricardo Neri
c9ca07886a sched/fair: Do not even the number of busy CPUs via asym_packing
Now that find_busiest_group() triggers load balancing between a fully_
busy SMT2 core and an idle non-SMT core, it is no longer needed to force
balancing via asym_packing. Use asym_packing only as intended: when there
is high-priority CPU that is idle.

After this change, the same logic apply to SMT and non-SMT local groups.
It makes less sense having a separate function to deal specifically with
SMT. Fold the logic in asym_smt_can_pull_tasks() into sched_asym().

Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Link: https://lore.kernel.org/r/20230406203148.19182-8-ricardo.neri-calderon@linux.intel.com
2023-05-08 10:58:36 +02:00
Ricardo Neri
43726bdedd sched/fair: Use the busiest group to set prefer_sibling
The prefer_sibling setting acts on the busiest group to move excess tasks
to the local group. This should be done as per request of the child of the
busiest group's sched domain, not the local group's.

Using the flags of the child domain of the local group works fortuitously
if both groups have child domains.

There are cases, however, in which the busiest group's sched domain has
child but the local group's does not. Consider, for instance a non-SMT
core (or an SMT core with only one online sibling) doing load balance with
an SMT core at the MC level. SD_PREFER_SIBLING of the busiest group's child
domain will not be honored. We are left with a fully busy SMT core and an
idle non-SMT core.

Suggested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Link: https://lore.kernel.org/r/20230406203148.19182-7-ricardo.neri-calderon@linux.intel.com
2023-05-08 10:58:35 +02:00
Ricardo Neri
5fd6d7f439 sched/fair: Keep a fully_busy SMT sched group as busiest
When comparing two fully_busy scheduling groups, keep the current busiest
group if it represents an SMT core. Tasks in such scheduling group share
CPU resources and need more help than tasks in a non-SMT fully_busy group.

Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Link: https://lore.kernel.org/r/20230406203148.19182-6-ricardo.neri-calderon@linux.intel.com
2023-05-08 10:58:35 +02:00
Ricardo Neri
18ad345327 sched/fair: Let low-priority cores help high-priority busy SMT cores
Using asym_packing priorities within an SMT core is straightforward. Just
follow the priorities that hardware indicates.

When balancing load from an SMT core, also consider the idle state of its
siblings. Priorities do not reflect that an SMT core divides its throughput
among all its busy siblings. They only makes sense when exactly one sibling
is busy.

Indicate that active balance is needed if the destination CPU has lower
priority than the source CPU but the latter has busy SMT siblings.

Make find_busiest_queue() not skip higher-priority SMT cores with more than
busy sibling.

Suggested-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Link: https://lore.kernel.org/r/20230406203148.19182-5-ricardo.neri-calderon@linux.intel.com
2023-05-08 10:58:35 +02:00
Ricardo Neri
ef7657d4d2 sched/fair: Simplify asym_packing logic for SMT cores
Callers of asym_smt_can_pull_tasks() check the idle state of the
destination CPU and its SMT siblings, if any. No extra checks are needed
in such function.

Since SMT cores divide capacity among its siblings, priorities only really
make sense if only one sibling is active. This is true for SMT2, SMT4,
SMT8, etc. Do not use asym_packing load balance for this case. Instead,
let find_busiest_group() handle imbalances.

When balancing non-SMT cores or at higher scheduling domains (e.g.,
between MC scheduling groups), continue using priorities.

Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Len Brown <len.brown@intel.com>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Link: https://lore.kernel.org/r/20230406203148.19182-4-ricardo.neri-calderon@linux.intel.com
2023-05-08 10:58:34 +02:00
Ricardo Neri
eefefa716c sched/fair: Only do asym_packing load balancing from fully idle SMT cores
When balancing load between cores, all the SMT siblings of the destination
CPU, if any, must be idle. Otherwise, pulling new tasks degrades the
throughput of the busy SMT siblings. The overall throughput of the system
remains the same.

When balancing load within an SMT core this consideration is not relevant.
Follow the priorities that hardware indicates.

Suggested-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Link: https://lore.kernel.org/r/20230406203148.19182-3-ricardo.neri-calderon@linux.intel.com
2023-05-08 10:58:34 +02:00
Ricardo Neri
8b36d07f1d sched/fair: Move is_core_idle() out of CONFIG_NUMA
asym_packing needs this function to determine whether an SMT core is a
suitable destination for load balancing.

Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Link: https://lore.kernel.org/r/20230406203148.19182-2-ricardo.neri-calderon@linux.intel.com
2023-05-08 10:58:33 +02:00
Linus Torvalds
ac9a78681b Linux 6.4-rc1 2023-05-07 13:34:35 -07:00
Linus Torvalds
f085df1be6 Disable building BPF based features by default for v6.4.
We need to better polish building with BPF skels, so revert back to
 making it an experimental feature that has to be explicitely enabled
 using BUILD_BPF_SKEL=1.
 
 Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQR2GiIUctdOfX2qHhGyPKLppCJ+JwUCZFbCXwAKCRCyPKLppCJ+
 J7cHAP97erKY4hBXArjpfzcvpFmboh/oqhbTLntyIpS6TEnOyQEAyervAPGIjQYC
 DCo4foyXmOWn3dhNtK9M+YiRl3o2SgQ=
 =7G78
 -----END PGP SIGNATURE-----

Merge tag 'perf-tools-for-v6.4-3-2023-05-06' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux

Pull perf tool updates from Arnaldo Carvalho de Melo:
 "Third version of perf tool updates, with the build problems with with
  using a 'vmlinux.h' generated from the main build fixed, and the bpf
  skeleton build disabled by default.

  Build:

   - Require libtraceevent to build, one can disable it using
     NO_LIBTRACEEVENT=1.

     It is required for tools like 'perf sched', 'perf kvm', 'perf
     trace', etc.

     libtraceevent is available in most distros so installing
     'libtraceevent-devel' should be a one-time event to continue
     building perf as usual.

     Using NO_LIBTRACEEVENT=1 produces tooling that is functional and
     sufficient for lots of users not interested in those libtraceevent
     dependent features.

   - Allow Python support in 'perf script' when libtraceevent isn't
     linked, as not all features requires it, for instance Intel PT does
     not use tracepoints.

   - Error if the python interpreter needed for jevents to work isn't
     available and NO_JEVENTS=1 isn't set, preventing a build without
     support for JSON vendor events, which is a rare but possible
     condition. The two check error messages:

        $(error ERROR: No python interpreter needed for jevents generation. Install python or build with NO_JEVENTS=1.)
        $(error ERROR: Python interpreter needed for jevents generation too old (older than 3.6). Install a newer python or build with NO_JEVENTS=1.)

   - Make libbpf 1.0 the minimum required when building with out of
     tree, distro provided libbpf.

   - Use libsdtc++'s and LLVM's libcxx's __cxa_demangle, a portable C++
     demangler, add 'perf test' entry for it.

   - Make binutils libraries opt in, as distros disable building with it
     due to licensing, they were used for C++ demangling, for instance.

   - Switch libpfm4 to opt-out rather than opt-in, if libpfm-devel (or
     equivalent) isn't installed, we'll just have a build warning:

       Makefile.config:1144: libpfm4 not found, disables libpfm4 support. Please install libpfm4-dev

   - Add a feature test for scandirat(), that is not implemented so far
     in musl and uclibc, disabling features that need it, such as
     scanning for tracepoints in /sys/kernel/tracing/events.

  perf BPF filters:

   - New feature where BPF can be used to filter samples, for instance:

      $ sudo ./perf record -e cycles --filter 'period > 1000' true
      $ sudo ./perf script
           perf-exec 2273949 546850.708501:       5029 cycles:  ffffffff826f9e25 finish_wait+0x5 ([kernel.kallsyms])
           perf-exec 2273949 546850.708508:      32409 cycles:  ffffffff826f9e25 finish_wait+0x5 ([kernel.kallsyms])
           perf-exec 2273949 546850.708526:     143369 cycles:  ffffffff82b4cdbf xas_start+0x5f ([kernel.kallsyms])
           perf-exec 2273949 546850.708600:     372650 cycles:  ffffffff8286b8f7 __pagevec_lru_add+0x117 ([kernel.kallsyms])
           perf-exec 2273949 546850.708791:     482953 cycles:  ffffffff829190de __mod_memcg_lruvec_state+0x4e ([kernel.kallsyms])
                true 2273949 546850.709036:     501985 cycles:  ffffffff828add7c tlb_gather_mmu+0x4c ([kernel.kallsyms])
                true 2273949 546850.709292:     503065 cycles:      7f2446d97c03 _dl_map_object_deps+0x973 (/usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2)

   - In addition to 'period' (PERF_SAMPLE_PERIOD), the other
     PERF_SAMPLE_ can be used for filtering, and also some other sample
     accessible values, from tools/perf/Documentation/perf-record.txt:

        Essentially the BPF filter expression is:

        <term> <operator> <value> (("," | "||") <term> <operator> <value>)*

     The <term> can be one of:
        ip, id, tid, pid, cpu, time, addr, period, txn, weight, phys_addr,
        code_pgsz, data_pgsz, weight1, weight2, weight3, ins_lat, retire_lat,
        p_stage_cyc, mem_op, mem_lvl, mem_snoop, mem_remote, mem_lock,
        mem_dtlb, mem_blk, mem_hops

     The <operator> can be one of:
        ==, !=, >, >=, <, <=, &

     The <value> can be one of:
        <number> (for any term)
        na, load, store, pfetch, exec (for mem_op)
        l1, l2, l3, l4, cxl, io, any_cache, lfb, ram, pmem (for mem_lvl)
        na, none, hit, miss, hitm, fwd, peer (for mem_snoop)
        remote (for mem_remote)
        na, locked (for mem_locked)
        na, l1_hit, l1_miss, l2_hit, l2_miss, any_hit, any_miss, walk, fault (for mem_dtlb)
        na, by_data, by_addr (for mem_blk)
        hops0, hops1, hops2, hops3 (for mem_hops)

  perf lock contention:

   - Show lock type with address.

   - Track and show mmap_lock, siglock and per-cpu rq_lock with address.
     This is done for mmap_lock by following the current->mm pointer:

      $ sudo ./perf lock con -abl -- sleep 10
       contended   total wait     max wait     avg wait            address   symbol
       ...
           16344    312.30 ms      2.22 ms     19.11 us   ffff8cc702595640
           17686    310.08 ms      1.49 ms     17.53 us   ffff8cc7025952c0
               3     84.14 ms     45.79 ms     28.05 ms   ffff8cc78114c478   mmap_lock
            3557     76.80 ms     68.75 us     21.59 us   ffff8cc77ca3af58
               1     68.27 ms     68.27 ms     68.27 ms   ffff8cda745dfd70
               9     54.53 ms      7.96 ms      6.06 ms   ffff8cc7642a48b8   mmap_lock
           14629     44.01 ms     60.00 us      3.01 us   ffff8cc7625f9ca0
            3481     42.63 ms    140.71 us     12.24 us   ffffffff937906ac   vmap_area_lock
           16194     38.73 ms     42.15 us      2.39 us   ffff8cd397cbc560
              11     38.44 ms     10.39 ms      3.49 ms   ffff8ccd6d12fbb8   mmap_lock
               1      5.43 ms      5.43 ms      5.43 ms   ffff8cd70018f0d8
            1674      5.38 ms    422.93 us      3.21 us   ffffffff92e06080   tasklist_lock
             581      4.51 ms    130.68 us      7.75 us   ffff8cc9b1259058
               5      3.52 ms      1.27 ms    703.23 us   ffff8cc754510070
             112      3.47 ms     56.47 us     31.02 us   ffff8ccee38b3120
             381      3.31 ms     73.44 us      8.69 us   ffffffff93790690   purge_vmap_area_lock
             255      3.19 ms     36.35 us     12.49 us   ffff8d053ce30c80

   - Update default map size to 16384.

   - Allocate single letter option -M for --map-nr-entries, as it is
     proving being frequently used.

   - Fix struct rq lock access for older kernels with BPF's CO-RE
     (Compile once, run everywhere).

   - Fix problems found with MSAn.

  perf report/top:

   - Add inline information when using --call-graph=fp or lbr, as was
     already done to the --call-graph=dwarf callchain mode.

   - Improve the 'srcfile' sort key performance by really using an
     optimization introduced in 6.2 for the 'srcline' sort key that
     avoids calling addr2line for comparision with each sample.

  perf sched:

   - Make 'perf sched latency/map/replay' to use "sched:sched_waking"
     instead of "sched:sched_waking", consistent with 'perf record'
     since d566a9c2d4 ("perf sched: Prefer sched_waking event when it
     exists").

  perf ftrace:

   - Make system wide the default target for latency subcommand, run the
     following command then generate some network traffic and press
     control+C:

       # perf ftrace latency -T __kfree_skb
     ^C
         DURATION     |      COUNT | GRAPH                                          |
          0 - 1    us |         27 | #############                                  |
          1 - 2    us |         22 | ###########                                    |
          2 - 4    us |          8 | ####                                           |
          4 - 8    us |          5 | ##                                             |
          8 - 16   us |         24 | ############                                   |
         16 - 32   us |          2 | #                                              |
         32 - 64   us |          1 |                                                |
         64 - 128  us |          0 |                                                |
        128 - 256  us |          0 |                                                |
        256 - 512  us |          0 |                                                |
        512 - 1024 us |          0 |                                                |
          1 - 2    ms |          0 |                                                |
          2 - 4    ms |          0 |                                                |
          4 - 8    ms |          0 |                                                |
          8 - 16   ms |          0 |                                                |
         16 - 32   ms |          0 |                                                |
         32 - 64   ms |          0 |                                                |
         64 - 128  ms |          0 |                                                |
        128 - 256  ms |          0 |                                                |
        256 - 512  ms |          0 |                                                |
        512 - 1024 ms |          0 |                                                |
          1 - ...   s |          0 |                                                |
       #

  perf top:

   - Add --branch-history (LBR: Last Branch Record) option, just like
     already available for 'perf record'.

   - Fix segfault in thread__comm_len() where thread->comm was being
     used outside thread->comm_lock.

  perf annotate:

   - Allow configuring objdump and addr2line in ~/.perfconfig., so that
     you can use alternative binaries, such as llvm's.

  perf kvm:

   - Add TUI mode for 'perf kvm stat report'.

  Reference counting:

   - Add reference count checking infrastructure to check for use after
     free, done to the 'cpumap', 'namespaces', 'maps' and 'map' structs,
     more to come.

     To build with it use -DREFCNT_CHECKING=1 in the make command line
     to build tools/perf. Documented at:

       https://perf.wiki.kernel.org/index.php/Reference_Count_Checking

   - The above caught, for instance, fix, present in this series:

        - Fix maps use after put in 'perf test "Share thread maps"':

          'maps' is copied from leader, but the leader is put on line 79
          and then 'maps' is used to read the reference count below - so
          a use after put, with the put of maps happening within
          thread__put.

     Fixed by reversing the order of puts so that the leader is put
     last.

   - Also several fixes were made to places where reference counts were
     not being held.

   - Make this one of the tests in 'make -C tools/perf build-test' to
     regularly build test it and to make sure no direct access to the
     reference counted structs are made, doing that via accessors to
     check the validity of the struct pointer.

  ARM64:

   - Fix 'perf report' segfault when filtering coresight traces by
     sparse lists of CPUs.

   - Add support for 'simd' as a sort field for 'perf report', to show
     ARM's NEON SIMD's predicate flags: "partial" and "empty".

  arm64 vendor events:

   - Add N1 metrics.

  Intel vendor events:

   - Add graniterapids, grandridge and sierraforrest events.

   - Refresh events for: alderlake, aldernaken, broadwell, broadwellde,
     broadwellx, cascadelakx, haswell, haswellx, icelake, icelakex,
     jaketown, meteorlake, knightslanding, sandybridge, sapphirerapids,
     silvermont, skylake, tigerlake and westmereep-dp

   - Refresh metrics for alderlake-n, broadwell, broadwellde,
     broadwellx, haswell, haswellx, icelakex, ivybridge, ivytown and
     skylakex.

  perf stat:

   - Implement --topdown using JSON metrics.

   - Add TopdownL1 JSON metric as a default if present, but disable it
     for now for some Intel hybrid architectures, a series of patches
     addressing this is being reviewed and will be submitted for v6.5.

   - Use metrics for --smi-cost.

   - Update topdown documentation.

  Vendor events (JSON) infrastructure:

   - Add support for computing and printing metric threshold values. For
     instance, here is one found in thesapphirerapids json file:

       {
           "BriefDescription": "Percentage of cycles spent in System Management Interrupts.",
           "MetricExpr": "((msr@aperf@ - cycles) / msr@aperf@ if msr@smi@ > 0 else 0)",
           "MetricGroup": "smi",
           "MetricName": "smi_cycles",
           "MetricThreshold": "smi_cycles > 0.1",
           "ScaleUnit": "100%"
       },

   - Test parsing metric thresholds with the fake PMU in 'perf test
     pmu-events'.

   - Support for printing metric thresholds in 'perf list'.

   - Add --metric-no-threshold option to 'perf stat'.

   - Add rand (reverse and) and has_pmem (optane memory) support to
     metrics.

   - Sort list of input files to avoid depending on the order from
     readdir() helping in obtaining reproducible builds.

  S/390:

   - Add common metrics: - CPI (cycles per instruction), prbstate (ratio
     of instructions executed in problem state compared to total number
     of instructions), l1mp (Level one instruction and data cache misses
     per 100 instructions).

   - Add cache metrics for z13, z14, z15 and z16.

   - Add metric for TLB and cache.

  ARM:

   - Add raw decoding for SPE (Statistical Profiling Extension) v1.3 MTE
     (Memory Tagging Extension) and MOPS (Memory Operations) load/store.

  Intel PT hardware tracing:

   - Add event type names UINTR (User interrupt delivered) and UIRET
     (Exiting from user interrupt routine), documented in table 32-50
     "CFE Packet Type and Vector Fields Details" in the Intel Processor
     Trace chapter of The Intel SDM Volume 3 version 078.

   - Add support for new branch instructions ERETS and ERETU.

   - Fix CYC timestamps after standalone CBR

  ARM CoreSight hardware tracing:

   - Allow user to override timestamp and contextid settings.

   - Fix segfault in dso lookup.

   - Fix timeless decode mode detection.

   - Add separate decode paths for timeless and per-thread modes.

  auxtrace:

   - Fix address filter entire kernel size.

  Miscellaneous:

   - Fix use-after-free and unaligned bugs in the PLT handling routines.

   - Use zfree() to reduce chances of use after free.

   - Add missing 0x prefix for addresses printed in hexadecimal in 'perf
     probe'.

   - Suppress massive unsupported target platform errors in the unwind
     code.

   - Fix return incorrect build_id size in elf_read_build_id().

   - Fix 'perf scripts intel-pt-events.py' IPC output for Python 2 .

   - Add missing new parameter in kfree_skb tracepoint to the python
     scripts using it.

   - Add 'perf bench syscall fork' benchmark.

   - Add support for printing PERF_MEM_LVLNUM_UNC (Uncached access) in
     'perf mem'.

   - Fix wrong size expectation for perf test 'Setup struct
     perf_event_attr' caused by the patch adding
     perf_event_attr::config3.

   - Fix some spelling mistakes"

* tag 'perf-tools-for-v6.4-3-2023-05-06' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux: (365 commits)
  Revert "perf build: Make BUILD_BPF_SKEL default, rename to NO_BPF_SKEL"
  Revert "perf build: Warn for BPF skeletons if endian mismatches"
  perf metrics: Fix SEGV with --for-each-cgroup
  perf bpf skels: Stop using vmlinux.h generated from BTF, use subset of used structs + CO-RE
  perf stat: Separate bperf from bpf_profiler
  perf test record+probe_libc_inet_pton: Fix call chain match on x86_64
  perf test record+probe_libc_inet_pton: Fix call chain match on s390
  perf tracepoint: Fix memory leak in is_valid_tracepoint()
  perf cs-etm: Add fix for coresight trace for any range of CPUs
  perf build: Fix unescaped # in perf build-test
  perf unwind: Suppress massive unsupported target platform errors
  perf script: Add new parameter in kfree_skb tracepoint to the python scripts using it
  perf script: Print raw ip instead of binary offset for callchain
  perf symbols: Fix return incorrect build_id size in elf_read_build_id()
  perf list: Modify the warning message about scandirat(3)
  perf list: Fix memory leaks in print_tracepoint_events()
  perf lock contention: Rework offset calculation with BPF CO-RE
  perf lock contention: Fix struct rq lock access
  perf stat: Disable TopdownL1 on hybrid
  perf stat: Avoid SEGV on counter->name
  ...
2023-05-07 11:32:18 -07:00
Linus Torvalds
17784de648 A single fix for debugobjects:
The recent fix to ensure atomicity of lookup and allocation inadvertently
   broke the pool refill mechanism, so that debugobject OOMs now in certain
   situations. The reason is that the functions which got updated no longer
   invoke debug_objecs_init(), which is now the only place to care about
   refilling the tracking object pool.
 
   Restore the original behaviour by adding explicit refill opportunities to
   those places.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmRWoFATHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYocNID/9e1fU2Nf32woHokzBGgARKb69Kl/hb
 6yVdMpOnZtxmluheJLnqCWI4WbAB6NjulEMFv+KkwRZ+QndBKVEo8NMZ9RjbXDBb
 HEehI6DvsqRDjaytOLEZj+/8afcZ7bUBKk7JuUK+y5B1gZViazfp1eF3hpiKsIV9
 aowpH6c9lL/9sPgFe2qpp21MUmNTUQbHpz0vbYC0QjqSEU2zTlu8p//P6VLA3xpl
 qoh8Gu5qo/L8lPspN2v8TRVXdiqH67J+KpbGO9IuUQWYPQqFdc6WchhHwomAk8nr
 Nyn9Q1Lred96pTdW3B0Cumnxuf0VPt4X/uQxPSP0kCo/h0Q0Mh6fq59Z66H/Mhjk
 TAvM52w3VzfTmQB6WgaCD1HyRRqIK5Nd+XqXnenCkHN4kjmGXNLg9MUGxua5CVgF
 iQTSRYtN18rF9OevDOFGzsEig2RN1JFi9MnJg9Q/L8SoDUn5ZUfhPaSA/HcOBnSe
 m+9aeRxlb0hAP7+upFKsJkDYzJTtbP6LSx6qqZMyQWqYdsUVHpdiPtJpXb7mLIqQ
 wo83i/Ohq8+dF6ykd89ZcKJ8vLBrnE1rPFKKmvS5ov1eRt/hZbtR3tmMviCNna0M
 2nrJE2fKClbs8Dmc6NNboJdz51ASgZEi32XmdFkATiuZqiD1id7ne0f85ju7DHD9
 sOjfo4ZtIKD/Fw==
 =/0Kc
 -----END PGP SIGNATURE-----

Merge tag 'core-debugobjects-2023-05-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull debugobjects fix from Thomas Gleixner:
 "A single fix for debugobjects:

  The recent fix to ensure atomicity of lookup and allocation
  inadvertently broke the pool refill mechanism, so that debugobject
  OOMs now in certain situations. The reason is that the functions which
  got updated no longer invoke debug_objecs_init(), which is now the
  only place to care about refilling the tracking object pool.

  Restore the original behaviour by adding explicit refill opportunities
  to those places"

* tag 'core-debugobjects-2023-05-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  debugobject: Ensure pool refill (again)
2023-05-07 11:04:26 -07:00