Commit graph

3940 commits

Author SHA1 Message Date
Raghavendra K T
fc137c0dda sched/numa: enhance vma scanning logic
During Numa scanning make sure only relevant vmas of the tasks are
scanned.

Before:
 All the tasks of a process participate in scanning the vma even if they
 do not access vma in it's lifespan.

Now:
 Except cases of first few unconditional scans, if a process do
 not touch vma (exluding false positive cases of PID collisions)
 tasks no longer scan all vma

Logic used:

1) 6 bits of PID used to mark active bit in vma numab status during
   fault to remember PIDs accessing vma.  (Thanks Mel)

2) Subsequently in scan path, vma scanning is skipped if current PID
   had not accessed vma.

3) First two times we do allow unconditional scan to preserve earlier
   behaviour of scanning.

Acknowledgement to Bharata B Rao <bharata@amd.com> for initial patch to
store pid information and Peter Zijlstra <peterz@infradead.org> (Usage of
test and set bit)

Link: https://lkml.kernel.org/r/092f03105c7c1d3450f4636b1ea350407f07640e.1677672277.git.raghavendra.kt@amd.com
Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
Suggested-by: Mel Gorman <mgorman@techsingularity.net>
Cc: David Hildenbrand <david@redhat.com>
Cc: Disha Talreja <dishaa.talreja@amd.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Mike Rapoport <rppt@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-05 20:03:03 -07:00
Mel Gorman
ef6a22b70f sched/numa: apply the scan delay to every new vma
Pach series "sched/numa: Enhance vma scanning", v3.

The patchset proposes one of the enhancements to numa vma scanning
suggested by Mel.  This is continuation of [3].

Reposting the rebased patchset to akpm mm-unstable tree (March 1) 

Existing mechanism of scan period involves, scan period derived from
per-thread stats.  Process Adaptive autoNUMA [1] proposed to gather NUMA
fault stats at per-process level to capture aplication behaviour better.

During that course of discussion, Mel proposed several ideas to enhance
current numa balancing.  One of the suggestion was below

Track what threads access a VMA.  The suggestion was to use an unsigned
long pid_mask and use the lower bits to tag approximately what threads
access a VMA.  Skip VMAs that did not trap a fault.  This would be
approximate because of PID collisions but would reduce scanning of areas
the thread is not interested in.  The above suggestion intends not to
penalize threads that has no interest in the vma, thus reduce scanning
overhead.

V3 changes are mostly based on PeterZ comments (details below in changes)

Summary of patchset:

Current patchset implements:

1. Delay the vma scanning logic for newly created VMA's so that
   additional overhead of scanning is not incurred for short lived tasks
   (implementation by Mel)

2. Store the information of tasks accessing VMA in 2 windows.  It is
   regularly cleared in (4*sysctl_numa_balancing_scan_delay) interval. 
   The above time is derived from experimenting (Suggested by PeterZ) to
   balance between frequent clearing vs obsolete access data

3. hash_32 used to encode task index accessing VMA information

4. VMA's acess information is used to skip scanning for the tasks
   which had not accessed VMA

Changes since V2:
patch1: 
 - Renaming of structure, macro to function,
 - Add explanation to heuristics
 - Adding more details from result (PeterZ)
 Patch2:
 - Usage of test and set bit (PeterZ)
 - Move storing access PID info to numa_migrate_prep()
 - Add a note on fainess among tasks allowed to scan
   (PeterZ)
 Patch3:
 - Maintain two windows of access PID information
  (PeterZ supported implementation and Gave idea to extend
   to N if needed)
 Patch4:
 - Apply hash_32 function to track VMA accessing PIDs (PeterZ)

Changes since RFC V1:
 - Include Mel's vma scan delay patch
 - Change the accessing pid store logic (Thanks Mel)
 - Fencing structure / code to NUMA_BALANCING (David, Mel)
 - Adding clearing access PID logic (Mel)
 - Descriptive change log ( Mike Rapoport)

Things to ponder over:
==========================================

- Improvement to clearing accessing PIDs logic (discussed in-detail in
  patch3 itself (Done in this patchset by implementing 2 window history)

- Current scan period is not changed in the patchset, so we do see
  frequent tries to scan.  Relaxing scan period dynamically could improve
  results further.

[1] sched/numa: Process Adaptive autoNUMA 
 Link: https://lore.kernel.org/lkml/20220128052851.17162-1-bharata@amd.com/T/

[2] RFC V1 Link: 
  https://lore.kernel.org/all/cover.1673610485.git.raghavendra.kt@amd.com/

[3] V2 Link:
  https://lore.kernel.org/lkml/cover.1675159422.git.raghavendra.kt@amd.com/


Results:
Summary: Huge autonuma cost reduction seen in mmtest. Kernbench improvement 
is more than 5% and huge system time (80%+) improvement from mmtest autonuma.
(dbench had huge std deviation to post)

kernbench
===========
                      6.2.0-mmunstable-base  6.2.0-mmunstable-patched
Amean     user-256    22002.51 (   0.00%)    22649.95 *  -2.94%*
Amean     syst-256    10162.78 (   0.00%)     8214.13 *  19.17%*
Amean     elsp-256      160.74 (   0.00%)      156.92 *   2.38%*

Duration User       66017.43    67959.84
Duration System     30503.15    24657.03
Duration Elapsed      504.61      493.12

                      6.2.0-mmunstable-base  6.2.0-mmunstable-patched
Ops NUMA alloc hit                1738835089.00  1738780310.00
Ops NUMA alloc local              1738834448.00  1738779711.00
Ops NUMA base-page range updates      477310.00      392566.00
Ops NUMA PTE updates                  477310.00      392566.00
Ops NUMA hint faults                   96817.00       87555.00
Ops NUMA hint local faults %           10150.00        2192.00
Ops NUMA hint local percent               10.48           2.50
Ops NUMA pages migrated                86660.00       85363.00
Ops AutoNUMA cost                        489.07         442.14

autonumabench
===============
                      6.2.0-mmunstable-base  6.2.0-mmunstable-patched
Amean     syst-NUMA01                  399.50 (   0.00%)       52.05 *  86.97%*
Amean     syst-NUMA01_THREADLOCAL        0.21 (   0.00%)        0.22 *  -5.41%*
Amean     syst-NUMA02                    0.80 (   0.00%)        0.78 *   2.68%*
Amean     syst-NUMA02_SMT                0.65 (   0.00%)        0.68 *  -3.95%*
Amean     elsp-NUMA01                  313.26 (   0.00%)      313.11 *   0.05%*
Amean     elsp-NUMA01_THREADLOCAL        1.06 (   0.00%)        1.08 *  -1.76%*
Amean     elsp-NUMA02                    3.19 (   0.00%)        3.24 *  -1.52%*
Amean     elsp-NUMA02_SMT                3.72 (   0.00%)        3.61 *   2.92%*

Duration User      396433.47   324835.96
Duration System      2808.70      376.66
Duration Elapsed     2258.61     2258.12

                      6.2.0-mmunstable-base  6.2.0-mmunstable-patched
Ops NUMA alloc hit                  59921806.00    49623489.00
Ops NUMA alloc miss                        0.00           0.00
Ops NUMA interleave hit                    0.00           0.00
Ops NUMA alloc local                59920880.00    49622594.00
Ops NUMA base-page range updates   152259275.00       50075.00
Ops NUMA PTE updates               152259275.00       50075.00
Ops NUMA PMD updates                       0.00           0.00
Ops NUMA hint faults               154660352.00       39014.00
Ops NUMA hint local faults %       138550501.00       23139.00
Ops NUMA hint local percent               89.58          59.31
Ops NUMA pages migrated              8179067.00       14147.00
Ops AutoNUMA cost                     774522.98         195.69


This patch (of 4):

Currently whenever a new task is created we wait for
sysctl_numa_balancing_scan_delay to avoid unnessary scanning overhead. 
Extend the same logic to new or very short-lived VMAs.

[raghavendra.kt@amd.com: add initialization in vm_area_dup())]
Link: https://lkml.kernel.org/r/cover.1677672277.git.raghavendra.kt@amd.com
Link: https://lkml.kernel.org/r/7a6fbba87c8b51e67efd3e74285bb4cb311a16ca.1677672277.git.raghavendra.kt@amd.com
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
Cc: Bharata B Rao <bharata@amd.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Disha Talreja <dishaa.talreja@amd.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-05 20:03:03 -07:00
Domenico Cerasuolo
d82caa2735 sched/psi: Allow unprivileged polling of N*2s period
PSI offers 2 mechanisms to get information about a specific resource
pressure. One is reading from /proc/pressure/<resource>, which gives
average pressures aggregated every 2s. The other is creating a pollable
fd for a specific resource and cgroup.

The trigger creation requires CAP_SYS_RESOURCE, and gives the
possibility to pick specific time window and threshold, spawing an RT
thread to aggregate the data.

Systemd would like to provide containers the option to monitor pressure
on their own cgroup and sub-cgroups. For example, if systemd launches a
container that itself then launches services, the container should have
the ability to poll() for pressure in individual services. But neither
the container nor the services are privileged.

This patch implements a mechanism to allow unprivileged users to create
pressure triggers. The difference with privileged triggers creation is
that unprivileged ones must have a time window that's a multiple of 2s.
This is so that we can avoid unrestricted spawning of rt threads, and
use instead the same aggregation mechanism done for the averages, which
runs independently of any triggers.

Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lore.kernel.org/r/20230330105418.77061-5-cerasuolodomenico@gmail.com
2023-04-05 09:58:50 +02:00
Domenico Cerasuolo
4468fcae49 sched/psi: Extract update_triggers side effect
This change moves update_total flag out of update_triggers function,
currently called only in psi_poll_work.
In the next patch, update_triggers will be called also in psi_avgs_work,
but the total update information is specific to psi_poll_work.
Returning update_total value to the caller let us avoid differentiating
the implementation of update_triggers for different aggregators.

Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lore.kernel.org/r/20230330105418.77061-4-cerasuolodomenico@gmail.com
2023-04-05 09:58:49 +02:00
Domenico Cerasuolo
65457b74aa sched/psi: Rename existing poll members in preparation
Renaming in PSI implementation to make a clear distinction between
privileged and unprivileged triggers code to be implemented in the
next patch.

Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lore.kernel.org/r/20230330105418.77061-3-cerasuolodomenico@gmail.com
2023-04-05 09:58:49 +02:00
Domenico Cerasuolo
7fab21fa0d sched/psi: Rearrange polling code in preparation
Move a few functions up in the file to avoid forward declaration needed
in the patch implementing unprivileged PSI triggers.

Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lore.kernel.org/r/20230330105418.77061-2-cerasuolodomenico@gmail.com
2023-04-05 09:58:48 +02:00
Libo Chen
39afe5d6fc sched/fair: Fix inaccurate tally of ttwu_move_affine
There are scenarios where non-affine wakeups are incorrectly counted as
affine wakeups by schedstats.

When wake_affine_idle() returns prev_cpu which doesn't equal to
nr_cpumask_bits, it will slip through the check: target == nr_cpumask_bits
in wake_affine() and be counted as if target == this_cpu in schedstats.

Replace target == nr_cpumask_bits with target != this_cpu to make sure
affine wakeups are accurately tallied.

Fixes: 806486c377 (sched/fair: Do not migrate if the prev_cpu is idle)
Suggested-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: Libo Chen <libo.chen@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
Link: https://lore.kernel.org/r/20220810223313.386614-1-libo.chen@oracle.com
2023-04-05 09:58:48 +02:00
Greg Kroah-Hartman
cd8fe5b6db Merge 6.3-rc5 into driver-core-next
We need the fixes in here for testing, as well as the driver core
changes for documentation updates to build on.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-04-03 09:33:30 +02:00
Nicholas Piggin
aa464ba9a1 lazy tlb: introduce lazy tlb mm refcount helper functions
Add explicit _lazy_tlb annotated functions for lazy tlb mm refcounting. 
This makes the lazy tlb mm references more obvious, and allows the
refcounting scheme to be modified in later changes.  There is no
functional change with this patch.

Link: https://lkml.kernel.org/r/20230203071837.1136453-3-npiggin@gmail.com
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-28 16:20:08 -07:00
Peter Zijlstra
68e2d17c9e trace: Add trace_ipi_send_cpu()
Because copying cpumasks around when targeting a single CPU is a bit
daft...

Tested-and-reviewed-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20230322103004.GA571242%40hirez.programming.kicks-ass.net
2023-03-24 11:01:29 +01:00
Valentin Schneider
68f4ff04db sched, smp: Trace smp callback causing an IPI
Context
=======

The newly-introduced ipi_send_cpumask tracepoint has a "callback" parameter
which so far has only been fed with NULL.

While CSD_TYPE_SYNC/ASYNC and CSD_TYPE_IRQ_WORK share a similar backing
struct layout (meaning their callback func can be accessed without caring
about the actual CSD type), CSD_TYPE_TTWU doesn't even have a function
attached to its struct. This means we need to check the type of a CSD
before eventually dereferencing its associated callback.

This isn't as trivial as it sounds: the CSD type is stored in
__call_single_node.u_flags, which get cleared right before the callback is
executed via csd_unlock(). This implies checking the CSD type before it is
enqueued on the call_single_queue, as the target CPU's queue can be flushed
before we get to sending an IPI.

Furthermore, send_call_function_single_ipi() only has a CPU parameter, and
would need to have an additional argument to trickle down the invoked
function. This is somewhat silly, as the extra argument will always be
pushed down to the function even when nothing is being traced, which is
unnecessary overhead.

Changes
=======

send_call_function_single_ipi() is only used by smp.c, and is defined in
sched/core.c as it contains scheduler-specific ops (set_nr_if_polling() of
a CPU's idle task).

Split it into two parts: the scheduler bits remain in sched/core.c, and the
actual IPI emission is moved into smp.c. This lets us define an
__always_inline helper function that can take the related callback as
parameter without creating useless register pressure in the non-traced path
which only gains a (disabled) static branch.

Do the same thing for the multi IPI case.

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230307143558.294354-8-vschneid@redhat.com
2023-03-24 11:01:29 +01:00
Valentin Schneider
cc9cb0a717 sched, smp: Trace IPIs sent via send_call_function_single_ipi()
send_call_function_single_ipi() is the thing that sends IPIs at the bottom
of smp_call_function*() via either generic_exec_single() or
smp_call_function_many_cond(). Give it an IPI-related tracepoint.

Note that this ends up tracing any IPI sent via __smp_call_single_queue(),
which covers __ttwu_queue_wakelist() and irq_work_queue_on() "for free".

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Acked-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230307143558.294354-3-vschneid@redhat.com
2023-03-24 11:01:27 +01:00
Josh Poimboeuf
e3ff7c609f livepatch,sched: Add livepatch task switching to cond_resched()
There have been reports [1][2] of live patches failing to complete
within a reasonable amount of time due to CPU-bound kthreads.

Fix it by patching tasks in cond_resched().

There are four different flavors of cond_resched(), depending on the
kernel configuration.  Hook into all of them.

A more elegant solution might be to use a preempt notifier.  However,
non-ORC unwinders can't unwind a preempted task reliably.

[1] https://lore.kernel.org/lkml/20220507174628.2086373-1-song@kernel.org/
[2] https://lkml.kernel.org/lkml/20230120-vhost-klp-switching-v1-0-7c2b65519c43@kernel.org

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Tested-by: Seth Forshee (DigitalOcean) <sforshee@kernel.org>
Link: https://lore.kernel.org/r/4ae981466b7814ec221014fc2554b2f86f3fb70b.1677257135.git.jpoimboe@kernel.org
2023-03-22 17:09:28 +01:00
Shrikanth Hegde
41abdba937 sched: Interleave cfs bandwidth timers for improved single thread performance at low utilization
CPU cfs bandwidth controller uses hrtimer. Currently there is no initial
value set. Hence all period timers would align at expiry.
This happens when there are multiple CPU cgroup's.

There is a performance gain that can be achieved here if the timers are
interleaved when the utilization of each CPU cgroup is low and total
utilization of all the CPU cgroup's is less than 50%. If the timers are
interleaved, then the unthrottled cgroup can run freely without many
context switches and can also benefit from SMT Folding. This effect will
be further amplified in SPLPAR environment.

This commit adds a random offset after initializing each hrtimer. This
would result in interleaving the timers at expiry, which helps in achieving
the said performance gain.

This was tested on powerpc platform with 8 core SMT=8. Socket power was
measured when the workload. Benchmarked the stress-ng with power
information. Throughput oriented benchmarks show significant gain up to
25% while power consumption increases up to 15%.

Workload: stress-ng --cpu=32 --cpu-ops=50000.
1CG - 1 cgroup is running.
2CG - 2 cgroups are running together.
Time taken to complete stress-ng in seconds and power is in watts.
each cgroup is throttled at 25% with 100ms as the period value.
           6.2-rc6                     |   with patch
8 core   1CG    power   2CG     power  |  1CG    power  2 CG    power
        27.5    80.6    40      90     |  27.3    82    32.3    104
        27.5    81      40.2    91     |  27.5    81    38.7     96
        27.7    80      40.1    89     |  27.6    80    29.7    106
        27.7    80.1    40.3    94     |  27.6    80    31.5    105

Latency might be affected by this change. That could happen if the CPU was
in a deep idle state which is possible if we interleave the timers. Used
schbench for measuring the latency. Each cgroup is throttled at 25% with
period value is set to 100ms. Numbers are when both the cgroups are
running simultaneously. Latency values don't degrade much. Some
improvement is seen in tail latencies.

		6.2-rc6        with patch
Groups: 16
50.0th:          39.5            42.5
75.0th:         924.0           922.0
90.0th:         972.0           968.0
95.0th:        1005.5           994.0
99.0th:        4166.0          2287.0
99.5th:        7314.0          7448.0
99.9th:       15024.0         13600.0

Groups: 32
50.0th:         819.0           463.0
75.0th:        1596.0           918.0
90.0th:        5992.0          1281.5
95.0th:       13184.0          2765.0
99.0th:       21792.0         14240.0
99.5th:       25696.0         18920.0
99.9th:       33280.0         35776.0

Groups: 64
50.0th:        4806.0          3440.0
75.0th:       31136.0         33664.0
90.0th:       54144.0         58752.0
95.0th:       66176.0         67200.0
99.0th:       84736.0         91520.0
99.5th:       97408.0        114048.0
99.9th:      136448.0        140032.0

Initial RFC PATCH, discussions and details on the problem:

Link1: https://lore.kernel.org/lkml/5ae3cb09-8c9a-11e8-75a7-cc774d9bc283@linux.vnet.ibm.com/
Link2: https://lore.kernel.org/lkml/9c57c92c-3e0c-b8c5-4be9-8f4df344a347@linux.vnet.ibm.com/
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Shrikanth Hegde<sshegde@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ben Segall <bsegall@google.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20230223185153.1499710-1-sshegde@linux.vnet.ibm.com
2023-03-22 10:10:58 +01:00
wuchi
eff6c8ce8d sched/core: Reduce cost of sched_move_task when config autogroup
Some sched_move_task calls are useless because that
task_struct->sched_task_group maybe not changed (equals task_group
of cpu_cgroup) when system enable autogroup. So do some checks in
sched_move_task.

sched_move_task eg:
task A belongs to cpu_cgroup0 and autogroup0, it will always belong
to cpu_cgroup0 when do_exit. So there is no need to do {de|en}queue.
The call graph is as follow.

  do_exit
    sched_autogroup_exit_task
      sched_move_task
	dequeue_task
	  sched_change_group
	    A.sched_task_group = sched_get_task_group (=cpu_cgroup0)
	enqueue_task

Performance results:
===========================
1. env
        cpu: bogomips=4600.00
     kernel: 6.3.0-rc3
 cpu_cgroup: 6:cpu,cpuacct:/user.slice

2. cmds
do_exit script:

  for i in {0..10000}; do
      sleep 0 &
      done
  wait

Run the above script, then use the following bpftrace cmd to get
the cost of sched_move_task:

  bpftrace -e 'k:sched_move_task { @ts[tid] = nsecs; }
	       kr:sched_move_task /@ts[tid]/
		  { @ns += nsecs - @ts[tid]; delete(@ts[tid]); }'

3. cost time(ns):
  without patch: 43528033
  with    patch: 18541416
           diff:-24986617  -57.4%

As the result show, the patch will save 57.4% in the scenario.

Signed-off-by: wuchi <wuchi.zero@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20230321064459.39421-1-wuchi.zero@gmail.com
2023-03-22 10:10:58 +01:00
Hao Jia
530bfad1d5 sched/core: Avoid selecting the task that is throttled to run when core-sched enable
When {rt, cfs}_rq or dl task is throttled, since cookied tasks
are not dequeued from the core tree, So sched_core_find() and
sched_core_next() may return throttled task, which may
cause throttled task to run on the CPU.

So we add checks in sched_core_find() and sched_core_next()
to make sure that the return is a runnable task that is
not throttled.

Co-developed-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
Signed-off-by: Hao Jia <jiahao.os@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20230316081806.69544-1-jiahao.os@bytedance.com
2023-03-22 10:10:58 +01:00
Tom Rix
d91e15a21d sched/topology: Make sched_energy_mutex,update static
smatch reports
kernel/sched/topology.c:212:1: warning:
  symbol 'sched_energy_mutex' was not declared. Should it be static?
kernel/sched/topology.c:213:6: warning:
  symbol 'sched_energy_update' was not declared. Should it be static?

These variables are only used in topology.c, so should be static

Signed-off-by: Tom Rix <trix@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lore.kernel.org/r/20230314144818.1453523-1-trix@redhat.com
2023-03-22 10:10:57 +01:00
Vincent Guittot
a53ce18cac sched/fair: Sanitize vruntime of entity being migrated
Commit 829c1651e9 ("sched/fair: sanitize vruntime of entity being placed")
fixes an overflowing bug, but ignore a case that se->exec_start is reset
after a migration.

For fixing this case, we delay the reset of se->exec_start after
placing the entity which se->exec_start to detect long sleeping task.

In order to take into account a possible divergence between the clock_task
of 2 rqs, we increase the threshold to around 104 days.

Fixes: 829c1651e9 ("sched/fair: sanitize vruntime of entity being placed")
Originally-by: Zhang Qiao <zhangqiao22@huawei.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Zhang Qiao <zhangqiao22@huawei.com>
Link: https://lore.kernel.org/r/20230317160810.107988-1-vincent.guittot@linaro.org
2023-03-21 14:43:04 +01:00
Phil Auld
34320745df sched/debug: Put sched/domains files under the verbose flag
The debug files under sched/domains can take a long time to regenerate,
especially when updates are done one at a time. Move these files under
the sched verbose debug flag. Allow changes to verbose to trigger
generation of the files. This lets a user batch the updates but still
have the information available.  The detailed topology printk messages
are also under verbose.

Discussion that lead to this approach can be found in the link below.

Simplified code to maintain use of debugfs bool routines suggested by
Michael Ellerman <mpe@ellerman.id.au>.

Signed-off-by: Phil Auld <pauld@redhat.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Tested-by: Vishal Chourasia <vishalc@linux.vnet.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vishal Chourasia <vishalc@linux.vnet.ibm.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/all/Y01UWQL2y2r69sBX@li-05afa54c-330e-11b2-a85c-e3f3aa0db1e9.ibm.com/
Link: https://lore.kernel.org/r/20230303183754.3076321-1-pauld@redhat.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-03-17 15:24:19 +01:00
Linus Torvalds
6015b1aca1 sched_getaffinity: don't assume 'cpumask_size()' is fully initialized
The getaffinity() system call uses 'cpumask_size()' to decide how big
the CPU mask is - so far so good.  It is indeed the allocation size of a
cpumask.

But the code also assumes that the whole allocation is initialized
without actually doing so itself.  That's wrong, because we might have
fixed-size allocations (making copying and clearing more efficient), but
not all of it is then necessarily used if 'nr_cpu_ids' is smaller.

Having checked other users of 'cpumask_size()', they all seem to be ok,
either using it purely for the allocation size, or explicitly zeroing
the cpumask before using the size in bytes to copy it.

See for example the ublk_ctrl_get_queue_affinity() function that uses
the proper 'zalloc_cpumask_var()' to make sure that the whole mask is
cleared, whether the storage is on the stack or if it was an external
allocation.

Fix this by just zeroing the allocation before using it.  Do the same
for the compat version of sched_getaffinity(), which had the same logic.

Also, for consistency, make sched_getaffinity() use 'cpumask_bits()' to
access the bits.  For a cpumask_var_t, it ends up being a pointer to the
same data either way, but it's just a good idea to treat it like you
would a 'cpumask_t'.  The compat case already did that.

Reported-by: Ryan Roberts <ryan.roberts@arm.com>
Link: https://lore.kernel.org/lkml/7d026744-6bd6-6827-0471-b5e8eae0be3f@arm.com/
Cc: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-03-14 19:32:38 -07:00
Josh Poimboeuf
071c44e427 sched/idle: Mark arch_cpu_idle_dead() __noreturn
Before commit 076cbf5d2163 ("x86/xen: don't let xen_pv_play_dead()
return"), in Xen, when a previously offlined CPU was brought back
online, it unexpectedly resumed execution where it left off in the
middle of the idle loop.

There were some hacks to make that work, but the behavior was surprising
as do_idle() doesn't expect an offlined CPU to return from the dead (in
arch_cpu_idle_dead()).

Now that Xen has been fixed, and the arch-specific implementations of
arch_cpu_idle_dead() also don't return, give it a __noreturn attribute.

This will cause the compiler to complain if an arch-specific
implementation might return.  It also improves code generation for both
caller and callee.

Also fixes the following warning:

  vmlinux.o: warning: objtool: do_idle+0x25f: unreachable instruction

Reported-by: Paul E. McKenney <paulmck@kernel.org>
Tested-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lore.kernel.org/r/60d527353da8c99d4cf13b6473131d46719ed16d.1676358308.git.jpoimboe@kernel.org
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
2023-03-08 08:44:28 -08:00
Josh Poimboeuf
dfb0f170ca sched/idle: Make sure weak version of arch_cpu_idle_dead() doesn't return
arch_cpu_idle_dead() should never return.  Make it so.

Link: https://lore.kernel.org/r/cf5ad95eef50f7704bb30e7770c59bfe23372af7.1676358308.git.jpoimboe@kernel.org
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
2023-03-08 08:44:27 -08:00
Linus Torvalds
c8b4accf86 More power management updates for 6.3-rc1
- Fix error handling in the apple-soc cpufreq driver (Dan Carpenter).
 
  - Change the log level of a message in the amd-pstate cpufreq driver
    so it is more visible to users (Kai-Heng Feng).
 
  - Adjust the balance_performance EPP value for Sapphire Rapids in the
    intel_pstate cpufreq driver (Srinivas Pandruvada).
 
  - Remove MODULE_LICENSE from 3 pieces of non-modular code (Nick Alcock).
 
  - Make a read-only kobj_type structure in the schedutil cpufreq governor
    constant (Thomas Weißschuh).
 
  - Add Add Power Limit4 support for Meteor Lake SoC to the Intel RAPL
    power capping driver (Sumeet Pawnikar).
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmQCNm0SHHJqd0Byand5
 c29ja2kubmV0AAoJEILEb/54YlRx0P8QAJdcjEg4cY/OJq9VTqiUo92mt63aqWXr
 vOMYgp5QDuDRPKiVMMuCJNCJSrA/xPwAz4pcSOj/59CZn1UZqzg6Ap/vdmV7pQV1
 vDp6N9E2sgCL0QMQqueyv3yEN9meleGrQPnAulOZf29Q9PC0rGiAUdivvgDXZq/9
 9wo+/11EzZpKyaDTfLzotIvNOkmmkp5FtxaCR64D5w3e6gehGrL/wL3FgSefDnsa
 fNlhxgi66FYuamSgXqQTkuIuig0Rbvlp0fmhllPaIOkNMI4o7rvP5rB7FbAKZm1E
 XI+M3aVlZsImPpEEJ1dTqbc4y9WU9HakLfRRSiUnnWHSXpwBI8ncEwP0oulqqVoF
 elA9kd7Sv1DwLiUGMy3GaOwscTN6NDUwICH4UJeISWlMZfrP7YR3Bb7gVtzlCrKC
 b99oA92OFazzWliYPSzzSsFQTI1PczNLZKnelzgF1+5g76q3sIUa3tu6Ts6Z84UK
 rHCLDVw8TUFpsEQxKOvM3oLUdpvE0mmQpdG8fSeHVxon47jVzmkbDjgOBFp1i19F
 HBI7LfOhDkkzO35qb6+DfkQoij0mpn8ldbVVLd1XQe4F0WjfSe8D4oGTgB3ErTK7
 8cOKGoez9FsQRHbSVCaZcaTSemDHdB3UMRUHG2hvU1jbwXOaQcLfMAF5McyECJiN
 U8uI28JVGVPI
 =lRKS
 -----END PGP SIGNATURE-----

Merge tag 'pm-6.3-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull more power management updates from Rafael Wysocki:
 "These update power capping (new hardware support and cleanup) and
  cpufreq (bug fixes, cleanups and intel_pstate adjustment for a new
  platform).

  Specifics:

   - Fix error handling in the apple-soc cpufreq driver (Dan Carpenter)

   - Change the log level of a message in the amd-pstate cpufreq driver
     so it is more visible to users (Kai-Heng Feng)

   - Adjust the balance_performance EPP value for Sapphire Rapids in the
     intel_pstate cpufreq driver (Srinivas Pandruvada)

   - Remove MODULE_LICENSE from 3 pieces of non-modular code (Nick
     Alcock)

   - Make a read-only kobj_type structure in the schedutil cpufreq
     governor constant (Thomas Weißschuh)

   - Add Add Power Limit4 support for Meteor Lake SoC to the Intel RAPL
     power capping driver (Sumeet Pawnikar)"

* tag 'pm-6.3-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  cpufreq: apple-soc: Fix an IS_ERR() vs NULL check
  powercap: remove MODULE_LICENSE in non-modules
  cpufreq: intel_pstate: remove MODULE_LICENSE in non-modules
  powercap: RAPL: Add Power Limit4 support for Meteor Lake SoC
  cpufreq: amd-pstate: remove MODULE_LICENSE in non-modules
  cpufreq: schedutil: make kobj_type structure constant
  cpufreq: amd-pstate: Let user know amd-pstate is disabled
  cpufreq: intel_pstate: Adjust balance_performance EPP for Sapphire Rapids
2023-03-03 10:30:58 -08:00
Linus Torvalds
3822a7c409 - Daniel Verkamp has contributed a memfd series ("mm/memfd: add
F_SEAL_EXEC") which permits the setting of the memfd execute bit at
   memfd creation time, with the option of sealing the state of the X bit.
 
 - Peter Xu adds a patch series ("mm/hugetlb: Make huge_pte_offset()
   thread-safe for pmd unshare") which addresses a rare race condition
   related to PMD unsharing.
 
 - Several folioification patch serieses from Matthew Wilcox, Vishal
   Moola, Sidhartha Kumar and Lorenzo Stoakes
 
 - Johannes Weiner has a series ("mm: push down lock_page_memcg()") which
   does perform some memcg maintenance and cleanup work.
 
 - SeongJae Park has added DAMOS filtering to DAMON, with the series
   "mm/damon/core: implement damos filter".  These filters provide users
   with finer-grained control over DAMOS's actions.  SeongJae has also done
   some DAMON cleanup work.
 
 - Kairui Song adds a series ("Clean up and fixes for swap").
 
 - Vernon Yang contributed the series "Clean up and refinement for maple
   tree".
 
 - Yu Zhao has contributed the "mm: multi-gen LRU: memcg LRU" series.  It
   adds to MGLRU an LRU of memcgs, to improve the scalability of global
   reclaim.
 
 - David Hildenbrand has added some userfaultfd cleanup work in the
   series "mm: uffd-wp + change_protection() cleanups".
 
 - Christoph Hellwig has removed the generic_writepages() library
   function in the series "remove generic_writepages".
 
 - Baolin Wang has performed some maintenance on the compaction code in
   his series "Some small improvements for compaction".
 
 - Sidhartha Kumar is doing some maintenance work on struct page in his
   series "Get rid of tail page fields".
 
 - David Hildenbrand contributed some cleanup, bugfixing and
   generalization of pte management and of pte debugging in his series "mm:
   support __HAVE_ARCH_PTE_SWP_EXCLUSIVE on all architectures with swap
   PTEs".
 
 - Mel Gorman and Neil Brown have removed the __GFP_ATOMIC allocation
   flag in the series "Discard __GFP_ATOMIC".
 
 - Sergey Senozhatsky has improved zsmalloc's memory utilization with his
   series "zsmalloc: make zspage chain size configurable".
 
 - Joey Gouly has added prctl() support for prohibiting the creation of
   writeable+executable mappings.  The previous BPF-based approach had
   shortcomings.  See "mm: In-kernel support for memory-deny-write-execute
   (MDWE)".
 
 - Waiman Long did some kmemleak cleanup and bugfixing in the series
   "mm/kmemleak: Simplify kmemleak_cond_resched() & fix UAF".
 
 - T.J.  Alumbaugh has contributed some MGLRU cleanup work in his series
   "mm: multi-gen LRU: improve".
 
 - Jiaqi Yan has provided some enhancements to our memory error
   statistics reporting, mainly by presenting the statistics on a per-node
   basis.  See the series "Introduce per NUMA node memory error
   statistics".
 
 - Mel Gorman has a second and hopefully final shot at fixing a CPU-hog
   regression in compaction via his series "Fix excessive CPU usage during
   compaction".
 
 - Christoph Hellwig does some vmalloc maintenance work in the series
   "cleanup vfree and vunmap".
 
 - Christoph Hellwig has removed block_device_operations.rw_page() in ths
   series "remove ->rw_page".
 
 - We get some maple_tree improvements and cleanups in Liam Howlett's
   series "VMA tree type safety and remove __vma_adjust()".
 
 - Suren Baghdasaryan has done some work on the maintainability of our
   vm_flags handling in the series "introduce vm_flags modifier functions".
 
 - Some pagemap cleanup and generalization work in Mike Rapoport's series
   "mm, arch: add generic implementation of pfn_valid() for FLATMEM" and
   "fixups for generic implementation of pfn_valid()"
 
 - Baoquan He has done some work to make /proc/vmallocinfo and
   /proc/kcore better represent the real state of things in his series
   "mm/vmalloc.c: allow vread() to read out vm_map_ram areas".
 
 - Jason Gunthorpe rationalized the GUP system's interface to the rest of
   the kernel in the series "Simplify the external interface for GUP".
 
 - SeongJae Park wishes to migrate people from DAMON's debugfs interface
   over to its sysfs interface.  To support this, we'll temporarily be
   printing warnings when people use the debugfs interface.  See the series
   "mm/damon: deprecate DAMON debugfs interface".
 
 - Andrey Konovalov provided the accurately named "lib/stackdepot: fixes
   and clean-ups" series.
 
 - Huang Ying has provided a dramatic reduction in migration's TLB flush
   IPI rates with the series "migrate_pages(): batch TLB flushing".
 
 - Arnd Bergmann has some objtool fixups in "objtool warning fixes".
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY/PoPQAKCRDdBJ7gKXxA
 jlvpAPsFECUBBl20qSue2zCYWnHC7Yk4q9ytTkPB/MMDrFEN9wD/SNKEm2UoK6/K
 DmxHkn0LAitGgJRS/W9w81yrgig9tAQ=
 =MlGs
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:

 - Daniel Verkamp has contributed a memfd series ("mm/memfd: add
   F_SEAL_EXEC") which permits the setting of the memfd execute bit at
   memfd creation time, with the option of sealing the state of the X
   bit.

 - Peter Xu adds a patch series ("mm/hugetlb: Make huge_pte_offset()
   thread-safe for pmd unshare") which addresses a rare race condition
   related to PMD unsharing.

 - Several folioification patch serieses from Matthew Wilcox, Vishal
   Moola, Sidhartha Kumar and Lorenzo Stoakes

 - Johannes Weiner has a series ("mm: push down lock_page_memcg()")
   which does perform some memcg maintenance and cleanup work.

 - SeongJae Park has added DAMOS filtering to DAMON, with the series
   "mm/damon/core: implement damos filter".

   These filters provide users with finer-grained control over DAMOS's
   actions. SeongJae has also done some DAMON cleanup work.

 - Kairui Song adds a series ("Clean up and fixes for swap").

 - Vernon Yang contributed the series "Clean up and refinement for maple
   tree".

 - Yu Zhao has contributed the "mm: multi-gen LRU: memcg LRU" series. It
   adds to MGLRU an LRU of memcgs, to improve the scalability of global
   reclaim.

 - David Hildenbrand has added some userfaultfd cleanup work in the
   series "mm: uffd-wp + change_protection() cleanups".

 - Christoph Hellwig has removed the generic_writepages() library
   function in the series "remove generic_writepages".

 - Baolin Wang has performed some maintenance on the compaction code in
   his series "Some small improvements for compaction".

 - Sidhartha Kumar is doing some maintenance work on struct page in his
   series "Get rid of tail page fields".

 - David Hildenbrand contributed some cleanup, bugfixing and
   generalization of pte management and of pte debugging in his series
   "mm: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE on all architectures with
   swap PTEs".

 - Mel Gorman and Neil Brown have removed the __GFP_ATOMIC allocation
   flag in the series "Discard __GFP_ATOMIC".

 - Sergey Senozhatsky has improved zsmalloc's memory utilization with
   his series "zsmalloc: make zspage chain size configurable".

 - Joey Gouly has added prctl() support for prohibiting the creation of
   writeable+executable mappings.

   The previous BPF-based approach had shortcomings. See "mm: In-kernel
   support for memory-deny-write-execute (MDWE)".

 - Waiman Long did some kmemleak cleanup and bugfixing in the series
   "mm/kmemleak: Simplify kmemleak_cond_resched() & fix UAF".

 - T.J. Alumbaugh has contributed some MGLRU cleanup work in his series
   "mm: multi-gen LRU: improve".

 - Jiaqi Yan has provided some enhancements to our memory error
   statistics reporting, mainly by presenting the statistics on a
   per-node basis. See the series "Introduce per NUMA node memory error
   statistics".

 - Mel Gorman has a second and hopefully final shot at fixing a CPU-hog
   regression in compaction via his series "Fix excessive CPU usage
   during compaction".

 - Christoph Hellwig does some vmalloc maintenance work in the series
   "cleanup vfree and vunmap".

 - Christoph Hellwig has removed block_device_operations.rw_page() in
   ths series "remove ->rw_page".

 - We get some maple_tree improvements and cleanups in Liam Howlett's
   series "VMA tree type safety and remove __vma_adjust()".

 - Suren Baghdasaryan has done some work on the maintainability of our
   vm_flags handling in the series "introduce vm_flags modifier
   functions".

 - Some pagemap cleanup and generalization work in Mike Rapoport's
   series "mm, arch: add generic implementation of pfn_valid() for
   FLATMEM" and "fixups for generic implementation of pfn_valid()"

 - Baoquan He has done some work to make /proc/vmallocinfo and
   /proc/kcore better represent the real state of things in his series
   "mm/vmalloc.c: allow vread() to read out vm_map_ram areas".

 - Jason Gunthorpe rationalized the GUP system's interface to the rest
   of the kernel in the series "Simplify the external interface for
   GUP".

 - SeongJae Park wishes to migrate people from DAMON's debugfs interface
   over to its sysfs interface. To support this, we'll temporarily be
   printing warnings when people use the debugfs interface. See the
   series "mm/damon: deprecate DAMON debugfs interface".

 - Andrey Konovalov provided the accurately named "lib/stackdepot: fixes
   and clean-ups" series.

 - Huang Ying has provided a dramatic reduction in migration's TLB flush
   IPI rates with the series "migrate_pages(): batch TLB flushing".

 - Arnd Bergmann has some objtool fixups in "objtool warning fixes".

* tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (505 commits)
  include/linux/migrate.h: remove unneeded externs
  mm/memory_hotplug: cleanup return value handing in do_migrate_range()
  mm/uffd: fix comment in handling pte markers
  mm: change to return bool for isolate_movable_page()
  mm: hugetlb: change to return bool for isolate_hugetlb()
  mm: change to return bool for isolate_lru_page()
  mm: change to return bool for folio_isolate_lru()
  objtool: add UACCESS exceptions for __tsan_volatile_read/write
  kmsan: disable ftrace in kmsan core code
  kasan: mark addr_has_metadata __always_inline
  mm: memcontrol: rename memcg_kmem_enabled()
  sh: initialize max_mapnr
  m68k/nommu: add missing definition of ARCH_PFN_OFFSET
  mm: percpu: fix incorrect size in pcpu_obj_full_size()
  maple_tree: reduce stack usage with gcc-9 and earlier
  mm: page_alloc: call panic() when memoryless node allocation fails
  mm: multi-gen LRU: avoid futile retries
  migrate_pages: move THP/hugetlb migration support check to simplify code
  migrate_pages: batch flushing TLB
  migrate_pages: share more code between _unmap and _move
  ...
2023-02-23 17:09:35 -08:00
Thomas Weißschuh
70ba26cbe0 cpufreq: schedutil: make kobj_type structure constant
Since commit ee6d3dd4ed ("driver core: make kobj_type constant.")
the driver core allows the usage of const struct kobj_type.

Take advantage of this to constify the structure definition to prevent
modification at runtime.

Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2023-02-23 19:57:29 +01:00
Linus Torvalds
5b7c4cabbb Networking changes for 6.3.
Core
 ----
 
  - Add dedicated kmem_cache for typical/small skb->head, avoid having
    to access struct page at kfree time, and improve memory use.
 
  - Introduce sysctl to set default RPS configuration for new netdevs.
 
  - Define Netlink protocol specification format which can be used
    to describe messages used by each family and auto-generate parsers.
    Add tools for generating kernel data structures and uAPI headers.
 
  - Expose all net/core sysctls inside netns.
 
  - Remove 4s sleep in netpoll if carrier is instantly detected on boot.
 
  - Add configurable limit of MDB entries per port, and port-vlan.
 
  - Continue populating drop reasons throughout the stack.
 
  - Retire a handful of legacy Qdiscs and classifiers.
 
 Protocols
 ---------
 
  - Support IPv4 big TCP (TSO frames larger than 64kB).
 
  - Add IP_LOCAL_PORT_RANGE socket option, to control local port range
    on socket by socket basis.
 
  - Track and report in procfs number of MPTCP sockets used.
 
  - Support mixing IPv4 and IPv6 flows in the in-kernel MPTCP
    path manager.
 
  - IPv6: don't check net.ipv6.route.max_size and rely on garbage
    collection to free memory (similarly to IPv4).
 
  - Support Penultimate Segment Pop (PSP) flavor in SRv6 (RFC8986).
 
  - ICMP: add per-rate limit counters.
 
  - Add support for user scanning requests in ieee802154.
 
  - Remove static WEP support.
 
  - Support minimal Wi-Fi 7 Extremely High Throughput (EHT) rate
    reporting.
 
  - WiFi 7 EHT channel puncturing support (client & AP).
 
 BPF
 ---
 
  - Add a rbtree data structure following the "next-gen data structure"
    precedent set by recently added linked list, that is, by using
    kfunc + kptr instead of adding a new BPF map type.
 
  - Expose XDP hints via kfuncs with initial support for RX hash and
    timestamp metadata.
 
  - Add BPF_F_NO_TUNNEL_KEY extension to bpf_skb_set_tunnel_key
    to better support decap on GRE tunnel devices not operating
    in collect metadata.
 
  - Improve x86 JIT's codegen for PROBE_MEM runtime error checks.
 
  - Remove the need for trace_printk_lock for bpf_trace_printk
    and bpf_trace_vprintk helpers.
 
  - Extend libbpf's bpf_tracing.h support for tracing arguments of
    kprobes/uprobes and syscall as a special case.
 
  - Significantly reduce the search time for module symbols
    by livepatch and BPF.
 
  - Enable cpumasks to be used as kptrs, which is useful for tracing
    programs tracking which tasks end up running on which CPUs in
    different time intervals.
 
  - Add support for BPF trampoline on s390x and riscv64.
 
  - Add capability to export the XDP features supported by the NIC.
 
  - Add __bpf_kfunc tag for marking kernel functions as kfuncs.
 
  - Add cgroup.memory=nobpf kernel parameter option to disable BPF
    memory accounting for container environments.
 
 Netfilter
 ---------
 
  - Remove the CLUSTERIP target. It has been marked as obsolete
    for years, and we still have WARN splats wrt. races of
    the out-of-band /proc interface installed by this target.
 
  - Add 'destroy' commands to nf_tables. They are identical to
    the existing 'delete' commands, but do not return an error if
    the referenced object (set, chain, rule...) did not exist.
 
 Driver API
 ----------
 
  - Improve cpumask_local_spread() locality to help NICs set the right
    IRQ affinity on AMD platforms.
 
  - Separate C22 and C45 MDIO bus transactions more clearly.
 
  - Introduce new DCB table to control DSCP rewrite on egress.
 
  - Support configuration of Physical Layer Collision Avoidance (PLCA)
    Reconciliation Sublayer (RS) (802.3cg-2019). Modern version of
    shared medium Ethernet.
 
  - Support for MAC Merge layer (IEEE 802.3-2018 clause 99). Allowing
    preemption of low priority frames by high priority frames.
 
  - Add support for controlling MACSec offload using netlink SET.
 
  - Rework devlink instance refcounts to allow registration and
    de-registration under the instance lock. Split the code into multiple
    files, drop some of the unnecessarily granular locks and factor out
    common parts of netlink operation handling.
 
  - Add TX frame aggregation parameters (for USB drivers).
 
  - Add a new attr TCA_EXT_WARN_MSG to report TC (offload) warning
    messages with notifications for debug.
 
  - Allow offloading of UDP NEW connections via act_ct.
 
  - Add support for per action HW stats in TC.
 
  - Support hardware miss to TC action (continue processing in SW from
    a specific point in the action chain).
 
  - Warn if old Wireless Extension user space interface is used with
    modern cfg80211/mac80211 drivers. Do not support Wireless Extensions
    for Wi-Fi 7 devices at all. Everyone should switch to using nl80211
    interface instead.
 
  - Improve the CAN bit timing configuration. Use extack to return error
    messages directly to user space, update the SJW handling, including
    the definition of a new default value that will benefit CAN-FD
    controllers, by increasing their oscillator tolerance.
 
 New hardware / drivers
 ----------------------
 
  - Ethernet:
    - nVidia BlueField-3 support (control traffic driver)
    - Ethernet support for imx93 SoCs
    - Motorcomm yt8531 gigabit Ethernet PHY
    - onsemi NCN26000 10BASE-T1S PHY (with support for PLCA)
    - Microchip LAN8841 PHY (incl. cable diagnostics and PTP)
    - Amlogic gxl MDIO mux
 
  - WiFi:
    - RealTek RTL8188EU (rtl8xxxu)
    - Qualcomm Wi-Fi 7 devices (ath12k)
 
  - CAN:
    - Renesas R-Car V4H
 
 Drivers
 -------
 
  - Bluetooth:
    - Set Per Platform Antenna Gain (PPAG) for Intel controllers.
 
  - Ethernet NICs:
    - Intel (1G, igc):
      - support TSN / Qbv / packet scheduling features of i226 model
    - Intel (100G, ice):
      - use GNSS subsystem instead of TTY
      - multi-buffer XDP support
      - extend support for GPIO pins to E823 devices
    - nVidia/Mellanox:
      - update the shared buffer configuration on PFC commands
      - implement PTP adjphase function for HW offset control
      - TC support for Geneve and GRE with VF tunnel offload
      - more efficient crypto key management method
      - multi-port eswitch support
    - Netronome/Corigine:
      - add DCB IEEE support
      - support IPsec offloading for NFP3800
    - Freescale/NXP (enetc):
      - enetc: support XDP_REDIRECT for XDP non-linear buffers
      - enetc: improve reconfig, avoid link flap and waiting for idle
      - enetc: support MAC Merge layer
    - Other NICs:
      - sfc/ef100: add basic devlink support for ef100
      - ionic: rx_push mode operation (writing descriptors via MMIO)
      - bnxt: use the auxiliary bus abstraction for RDMA
      - r8169: disable ASPM and reset bus in case of tx timeout
      - cpsw: support QSGMII mode for J721e CPSW9G
      - cpts: support pulse-per-second output
      - ngbe: add an mdio bus driver
      - usbnet: optimize usbnet_bh() by avoiding unnecessary queuing
      - r8152: handle devices with FW with NCM support
      - amd-xgbe: support 10Mbps, 2.5GbE speeds and rx-adaptation
      - virtio-net: support multi buffer XDP
      - virtio/vsock: replace virtio_vsock_pkt with sk_buff
      - tsnep: XDP support
 
  - Ethernet high-speed switches:
    - nVidia/Mellanox (mlxsw):
      - add support for latency TLV (in FW control messages)
    - Microchip (sparx5):
      - separate explicit and implicit traffic forwarding rules, make
        the implicit rules always active
      - add support for egress DSCP rewrite
      - IS0 VCAP support (Ingress Classification)
      - IS2 VCAP filters (protos, L3 addrs, L4 ports, flags, ToS etc.)
      - ES2 VCAP support (Egress Access Control)
      - support for Per-Stream Filtering and Policing (802.1Q, 8.6.5.1)
 
  - Ethernet embedded switches:
    - Marvell (mv88e6xxx):
      - add MAB (port auth) offload support
      - enable PTP receive for mv88e6390
    - NXP (ocelot):
      - support MAC Merge layer
      - support for the the vsc7512 internal copper phys
    - Microchip:
      - lan9303: convert to PHYLINK
      - lan966x: support TC flower filter statistics
      - lan937x: PTP support for KSZ9563/KSZ8563 and LAN937x
      - lan937x: support Credit Based Shaper configuration
      - ksz9477: support Energy Efficient Ethernet
    - other:
      - qca8k: convert to regmap read/write API, use bulk operations
      - rswitch: Improve TX timestamp accuracy
 
  - Intel WiFi (iwlwifi):
    - EHT (Wi-Fi 7) rate reporting
    - STEP equalizer support: transfer some STEP (connection to radio
      on platforms with integrated wifi) related parameters from the
      BIOS to the firmware.
 
  - Qualcomm 802.11ax WiFi (ath11k):
    - IPQ5018 support
    - Fine Timing Measurement (FTM) responder role support
    - channel 177 support
 
  - MediaTek WiFi (mt76):
    - per-PHY LED support
    - mt7996: EHT (Wi-Fi 7) support
    - Wireless Ethernet Dispatch (WED) reset support
    - switch to using page pool allocator
 
  - RealTek WiFi (rtw89):
    - support new version of Bluetooth co-existance
 
  - Mobile:
    - rmnet: support TX aggregation.
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmP1VIYACgkQMUZtbf5S
 IrvsChAApz0rNL/sPKxXTEfxZ1tN7D3sYxYKQPomxvl5BV+MvicrLddJy3KmzEFK
 nnJNO3nuRNuH422JQ/ylZ4mGX1opa6+5QJb0UINImXUI7Fm8HHBIuPGkv7d5CheZ
 7JexFqjPJXUy9nPyh1Rra+IA9AcRd2U7jeGEZR38wb99bHJQj5Bzdk20WArEB0el
 n44aqg49LXH71bSeXRz77x5SjkwVtYiccQxLcnmTbjLU2xVraLvI2J+wAhHnVXWW
 9lrU1+V4Ex2Xcd1xR0L0cHeK+meP1TrPRAeF+JDpVI3a/zJiE7cZjfHdG/jH5xWl
 leZJqghVozrZQNtewWWO7XhUFhMDgFu3W/1vNLjSHPZEqaz1JpM67J1+ql6s63l4
 LMWoXbcYZz+SL9ZRCoPkbGue/5fKSHv8/Jl9Sh58+eTS+c/zgN8uFGRNFXLX1+EP
 n8uvt985PxMd6x1+dHumhOUzxnY4Sfi1vjitSunTsNFQ3Cmp4SO0IfBVJWfLUCuC
 xz5hbJGJJbSpvUsO+HWyCg83E5OWghRE/Onpt2jsQSZCrO9HDg4FRTEf3WAMgaqc
 edb5KfbRZPTJQM08gWdluXzSk1nw3FNP2tXW4XlgUrEbjb+fOk0V9dQg2gyYTxQ1
 Nhvn8ZQPi6/GMMELHAIPGmmW1allyOGiAzGlQsv8EmL+OFM6WDI=
 =xXhC
 -----END PGP SIGNATURE-----

Merge tag 'net-next-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking updates from Jakub Kicinski:
 "Core:

   - Add dedicated kmem_cache for typical/small skb->head, avoid having
     to access struct page at kfree time, and improve memory use.

   - Introduce sysctl to set default RPS configuration for new netdevs.

   - Define Netlink protocol specification format which can be used to
     describe messages used by each family and auto-generate parsers.
     Add tools for generating kernel data structures and uAPI headers.

   - Expose all net/core sysctls inside netns.

   - Remove 4s sleep in netpoll if carrier is instantly detected on
     boot.

   - Add configurable limit of MDB entries per port, and port-vlan.

   - Continue populating drop reasons throughout the stack.

   - Retire a handful of legacy Qdiscs and classifiers.

  Protocols:

   - Support IPv4 big TCP (TSO frames larger than 64kB).

   - Add IP_LOCAL_PORT_RANGE socket option, to control local port range
     on socket by socket basis.

   - Track and report in procfs number of MPTCP sockets used.

   - Support mixing IPv4 and IPv6 flows in the in-kernel MPTCP path
     manager.

   - IPv6: don't check net.ipv6.route.max_size and rely on garbage
     collection to free memory (similarly to IPv4).

   - Support Penultimate Segment Pop (PSP) flavor in SRv6 (RFC8986).

   - ICMP: add per-rate limit counters.

   - Add support for user scanning requests in ieee802154.

   - Remove static WEP support.

   - Support minimal Wi-Fi 7 Extremely High Throughput (EHT) rate
     reporting.

   - WiFi 7 EHT channel puncturing support (client & AP).

  BPF:

   - Add a rbtree data structure following the "next-gen data structure"
     precedent set by recently added linked list, that is, by using
     kfunc + kptr instead of adding a new BPF map type.

   - Expose XDP hints via kfuncs with initial support for RX hash and
     timestamp metadata.

   - Add BPF_F_NO_TUNNEL_KEY extension to bpf_skb_set_tunnel_key to
     better support decap on GRE tunnel devices not operating in collect
     metadata.

   - Improve x86 JIT's codegen for PROBE_MEM runtime error checks.

   - Remove the need for trace_printk_lock for bpf_trace_printk and
     bpf_trace_vprintk helpers.

   - Extend libbpf's bpf_tracing.h support for tracing arguments of
     kprobes/uprobes and syscall as a special case.

   - Significantly reduce the search time for module symbols by
     livepatch and BPF.

   - Enable cpumasks to be used as kptrs, which is useful for tracing
     programs tracking which tasks end up running on which CPUs in
     different time intervals.

   - Add support for BPF trampoline on s390x and riscv64.

   - Add capability to export the XDP features supported by the NIC.

   - Add __bpf_kfunc tag for marking kernel functions as kfuncs.

   - Add cgroup.memory=nobpf kernel parameter option to disable BPF
     memory accounting for container environments.

  Netfilter:

   - Remove the CLUSTERIP target. It has been marked as obsolete for
     years, and we still have WARN splats wrt races of the out-of-band
     /proc interface installed by this target.

   - Add 'destroy' commands to nf_tables. They are identical to the
     existing 'delete' commands, but do not return an error if the
     referenced object (set, chain, rule...) did not exist.

  Driver API:

   - Improve cpumask_local_spread() locality to help NICs set the right
     IRQ affinity on AMD platforms.

   - Separate C22 and C45 MDIO bus transactions more clearly.

   - Introduce new DCB table to control DSCP rewrite on egress.

   - Support configuration of Physical Layer Collision Avoidance (PLCA)
     Reconciliation Sublayer (RS) (802.3cg-2019). Modern version of
     shared medium Ethernet.

   - Support for MAC Merge layer (IEEE 802.3-2018 clause 99). Allowing
     preemption of low priority frames by high priority frames.

   - Add support for controlling MACSec offload using netlink SET.

   - Rework devlink instance refcounts to allow registration and
     de-registration under the instance lock. Split the code into
     multiple files, drop some of the unnecessarily granular locks and
     factor out common parts of netlink operation handling.

   - Add TX frame aggregation parameters (for USB drivers).

   - Add a new attr TCA_EXT_WARN_MSG to report TC (offload) warning
     messages with notifications for debug.

   - Allow offloading of UDP NEW connections via act_ct.

   - Add support for per action HW stats in TC.

   - Support hardware miss to TC action (continue processing in SW from
     a specific point in the action chain).

   - Warn if old Wireless Extension user space interface is used with
     modern cfg80211/mac80211 drivers. Do not support Wireless
     Extensions for Wi-Fi 7 devices at all. Everyone should switch to
     using nl80211 interface instead.

   - Improve the CAN bit timing configuration. Use extack to return
     error messages directly to user space, update the SJW handling,
     including the definition of a new default value that will benefit
     CAN-FD controllers, by increasing their oscillator tolerance.

  New hardware / drivers:

   - Ethernet:
      - nVidia BlueField-3 support (control traffic driver)
      - Ethernet support for imx93 SoCs
      - Motorcomm yt8531 gigabit Ethernet PHY
      - onsemi NCN26000 10BASE-T1S PHY (with support for PLCA)
      - Microchip LAN8841 PHY (incl. cable diagnostics and PTP)
      - Amlogic gxl MDIO mux

   - WiFi:
      - RealTek RTL8188EU (rtl8xxxu)
      - Qualcomm Wi-Fi 7 devices (ath12k)

   - CAN:
      - Renesas R-Car V4H

  Drivers:

   - Bluetooth:
      - Set Per Platform Antenna Gain (PPAG) for Intel controllers.

   - Ethernet NICs:
      - Intel (1G, igc):
         - support TSN / Qbv / packet scheduling features of i226 model
      - Intel (100G, ice):
         - use GNSS subsystem instead of TTY
         - multi-buffer XDP support
         - extend support for GPIO pins to E823 devices
      - nVidia/Mellanox:
         - update the shared buffer configuration on PFC commands
         - implement PTP adjphase function for HW offset control
         - TC support for Geneve and GRE with VF tunnel offload
         - more efficient crypto key management method
         - multi-port eswitch support
      - Netronome/Corigine:
         - add DCB IEEE support
         - support IPsec offloading for NFP3800
      - Freescale/NXP (enetc):
         - support XDP_REDIRECT for XDP non-linear buffers
         - improve reconfig, avoid link flap and waiting for idle
         - support MAC Merge layer
      - Other NICs:
         - sfc/ef100: add basic devlink support for ef100
         - ionic: rx_push mode operation (writing descriptors via MMIO)
         - bnxt: use the auxiliary bus abstraction for RDMA
         - r8169: disable ASPM and reset bus in case of tx timeout
         - cpsw: support QSGMII mode for J721e CPSW9G
         - cpts: support pulse-per-second output
         - ngbe: add an mdio bus driver
         - usbnet: optimize usbnet_bh() by avoiding unnecessary queuing
         - r8152: handle devices with FW with NCM support
         - amd-xgbe: support 10Mbps, 2.5GbE speeds and rx-adaptation
         - virtio-net: support multi buffer XDP
         - virtio/vsock: replace virtio_vsock_pkt with sk_buff
         - tsnep: XDP support

   - Ethernet high-speed switches:
      - nVidia/Mellanox (mlxsw):
         - add support for latency TLV (in FW control messages)
      - Microchip (sparx5):
         - separate explicit and implicit traffic forwarding rules, make
           the implicit rules always active
         - add support for egress DSCP rewrite
         - IS0 VCAP support (Ingress Classification)
         - IS2 VCAP filters (protos, L3 addrs, L4 ports, flags, ToS
           etc.)
         - ES2 VCAP support (Egress Access Control)
         - support for Per-Stream Filtering and Policing (802.1Q,
           8.6.5.1)

   - Ethernet embedded switches:
      - Marvell (mv88e6xxx):
         - add MAB (port auth) offload support
         - enable PTP receive for mv88e6390
      - NXP (ocelot):
         - support MAC Merge layer
         - support for the the vsc7512 internal copper phys
      - Microchip:
         - lan9303: convert to PHYLINK
         - lan966x: support TC flower filter statistics
         - lan937x: PTP support for KSZ9563/KSZ8563 and LAN937x
         - lan937x: support Credit Based Shaper configuration
         - ksz9477: support Energy Efficient Ethernet
      - other:
         - qca8k: convert to regmap read/write API, use bulk operations
         - rswitch: Improve TX timestamp accuracy

   - Intel WiFi (iwlwifi):
      - EHT (Wi-Fi 7) rate reporting
      - STEP equalizer support: transfer some STEP (connection to radio
        on platforms with integrated wifi) related parameters from the
        BIOS to the firmware.

   - Qualcomm 802.11ax WiFi (ath11k):
      - IPQ5018 support
      - Fine Timing Measurement (FTM) responder role support
      - channel 177 support

   - MediaTek WiFi (mt76):
      - per-PHY LED support
      - mt7996: EHT (Wi-Fi 7) support
      - Wireless Ethernet Dispatch (WED) reset support
      - switch to using page pool allocator

   - RealTek WiFi (rtw89):
      - support new version of Bluetooth co-existance

   - Mobile:
      - rmnet: support TX aggregation"

* tag 'net-next-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1872 commits)
  page_pool: add a comment explaining the fragment counter usage
  net: ethtool: fix __ethtool_dev_mm_supported() implementation
  ethtool: pse-pd: Fix double word in comments
  xsk: add linux/vmalloc.h to xsk.c
  sefltests: netdevsim: wait for devlink instance after netns removal
  selftest: fib_tests: Always cleanup before exit
  net/mlx5e: Align IPsec ASO result memory to be as required by hardware
  net/mlx5e: TC, Set CT miss to the specific ct action instance
  net/mlx5e: Rename CHAIN_TO_REG to MAPPED_OBJ_TO_REG
  net/mlx5: Refactor tc miss handling to a single function
  net/mlx5: Kconfig: Make tc offload depend on tc skb extension
  net/sched: flower: Support hardware miss to tc action
  net/sched: flower: Move filter handle initialization earlier
  net/sched: cls_api: Support hardware miss to tc action
  net/sched: Rename user cookie and act cookie
  sfc: fix builds without CONFIG_RTC_LIB
  sfc: clean up some inconsistent indentings
  net/mlx4_en: Introduce flexible array to silence overflow warning
  net: lan966x: Fix possible deadlock inside PTP
  net/ulp: Remove redundant ->clone() test in inet_clone_ulp().
  ...
2023-02-21 18:24:12 -08:00
Linus Torvalds
8cc01d43f8 RCU pull request for v6.3
This pull request contains the following branches:
 
 doc.2023.01.05a: Documentation updates.
 
 fixes.2023.01.23a: Miscellaneous fixes, perhaps most notably:
 
 o	Throttling callback invocation based on the number of callbacks
 	that are now ready to invoke instead of on the total number
 	of callbacks.
 
 o	Several patches that suppress false-positive boot-time
 	diagnostics, for example, due to lockdep not yet being
 	initialized.
 
 o	Make expedited RCU CPU stall warnings dump stacks of any tasks
 	that are blocking the stalled grace period.  (Normal RCU CPU
 	stall warnings have doen this for mnay years.)
 
 o	Lazy-callback fixes to avoid delays during boot, suspend, and
 	resume.  (Note that lazy callbacks must be explicitly enabled,
 	so this should not (yet) affect production use cases.)
 
 kvfree.2023.01.03a: Cause kfree_rcu() and friends to take advantage of
 	polled grace periods, thus reducing memory footprint by almost
 	two orders of magnitude, admittedly on a microbenchmark.
 	This series also begins the transition from kfree_rcu(p) to
 	kfree_rcu_mightsleep(p).  This transition was motivated by bugs
 	where kfree_rcu(p), which can block, was typed instead of the
 	intended kfree_rcu(p, rh).
 
 srcu.2023.01.03a: SRCU updates, perhaps most notably fixing a bug that
 	causes SRCU to fail when booted on a system with a non-zero boot
 	CPU.  This surprising situation actually happens for kdump kernels
 	on the powerpc architecture.  It also adds an srcu_down_read()
 	and srcu_up_read(), which act like srcu_read_lock() and
 	srcu_read_unlock(), but allow an SRCU read-side critical section
 	to be handed off from one task to another.
 
 srcu-always.2023.02.02a: Cleans up the now-useless SRCU Kconfig option.
 	There are a few more commits that are not yet acked or pulled
 	into maintainer trees, and these will be in a pull request for
 	a later merge window.
 
 tasks.2023.01.03a: RCU-tasks updates, perhaps most notably these fixes:
 
 o	A strange interaction between PID-namespace unshare and the
 	RCU-tasks grace period that results in a low-probability but
 	very real hang.
 
 o	A race between an RCU tasks rude grace period on a single-CPU
 	system and CPU-hotplug addition of the second CPU that can result
 	in a too-short grace period.
 
 o	A race between shrinking RCU tasks down to a single callback list
 	and queuing a new callback to some other CPU, but where that
 	queuing is delayed for more than an RCU grace period.  This can
 	result in that callback being stranded on the non-boot CPU.
 
 torture.2023.01.05a: Torture-test updates and fixes.
 
 torturescript.2023.01.03a: Torture-test scripting updates and fixes.
 
 stall.2023.01.09a: Provide additional RCU CPU stall-warning information
 	in kernels built with CONFIG_RCU_CPU_STALL_CPUTIME=y, and
 	restore the full five-minute timeout limit for expedited RCU
 	CPU stall warnings.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEbK7UrM+RBIrCoViJnr8S83LZ+4wFAmPq29UTHHBhdWxtY2tA
 a2VybmVsLm9yZwAKCRCevxLzctn7jAhVEACEAKJY1VJ9IUqz7CwzAYkzgRJfiygh
 oDUXmlqtm6ew9pr2GdLUVCVsUSldzBc0K7Djb/G1niv4JPs+v7YwupIV33+UbStU
 Qxt6ztTdxc4lKospLm1+2vF9ZdzVEmiP4wVCc4iDarv5FM3FpWSTNc8+L7qmlC+X
 myjv+GqMTxkXZBvYJOgJGFjDwN8noTd7Fr3mCCVLFm3PXMDa7tcwD6HRP5AqD2N8
 qC5M6LEqepKVGmz0mYMLlSN1GPaqIsEcexIFEazRsPEivPh/iafyQCQ/cqxwhXmV
 vEt7u+dXGZT/oiDq9cJ+/XRDS2RyKIS6dUE14TiiHolDCn1ONESahfA/gXWKykC2
 BaGPfjWXrWv/hwbeZ+8xEdkAvTIV92tGpXir9Fby1Z5PjP3balvrnn6hs5AnQBJb
 NdhRPLzy/dCnEF+CweAYYm1qvTo8cd5nyiNwBZHn7rEAIu3Axrecag1rhFl3AJ07
 cpVMQXZtkQVa2X8aIRTUC+ijX6yIqNaHlu0HqNXgIUTDzL4nv5cMjOMzpNQP9/dZ
 FwAMZYNiOk9IlMiKJ8ZiVcxeiA8ouIBlkYM3k6vGrmiONZ7a/EV/mSHoJqI8bvqr
 AxUIJ2Ayhg3bxPboL5oKgCiLql0A7ZVvz6quX6McitWGMgaSvel1fDzT3TnZd41e
 4AFBFd/+VedUGg==
 =bBYK
 -----END PGP SIGNATURE-----

Merge tag 'rcu.2023.02.10a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu

Pull RCU updates from Paul McKenney:

 - Documentation updates

 - Miscellaneous fixes, perhaps most notably:

      - Throttling callback invocation based on the number of callbacks
        that are now ready to invoke instead of on the total number of
        callbacks

      - Several patches that suppress false-positive boot-time
        diagnostics, for example, due to lockdep not yet being
        initialized

      - Make expedited RCU CPU stall warnings dump stacks of any tasks
        that are blocking the stalled grace period. (Normal RCU CPU
        stall warnings have done this for many years)

      - Lazy-callback fixes to avoid delays during boot, suspend, and
        resume. (Note that lazy callbacks must be explicitly enabled, so
        this should not (yet) affect production use cases)

 - Make kfree_rcu() and friends take advantage of polled grace periods,
   thus reducing memory footprint by almost two orders of magnitude,
   admittedly on a microbenchmark

   This also begins the transition from kfree_rcu(p) to
   kfree_rcu_mightsleep(p). This transition was motivated by bugs where
   kfree_rcu(p), which can block, was typed instead of the intended
   kfree_rcu(p, rh)

 - SRCU updates, perhaps most notably fixing a bug that causes SRCU to
   fail when booted on a system with a non-zero boot CPU. This
   surprising situation actually happens for kdump kernels on the
   powerpc architecture

   This also adds an srcu_down_read() and srcu_up_read(), which act like
   srcu_read_lock() and srcu_read_unlock(), but allow an SRCU read-side
   critical section to be handed off from one task to another

 - Clean up the now-useless SRCU Kconfig option

   There are a few more commits that are not yet acked or pulled into
   maintainer trees, and these will be in a pull request for a later
   merge window

 - RCU-tasks updates, perhaps most notably these fixes:

      - A strange interaction between PID-namespace unshare and the
        RCU-tasks grace period that results in a low-probability but
        very real hang

      - A race between an RCU tasks rude grace period on a single-CPU
        system and CPU-hotplug addition of the second CPU that can
        result in a too-short grace period

      - A race between shrinking RCU tasks down to a single callback
        list and queuing a new callback to some other CPU, but where
        that queuing is delayed for more than an RCU grace period. This
        can result in that callback being stranded on the non-boot CPU

 - Torture-test updates and fixes

 - Torture-test scripting updates and fixes

 - Provide additional RCU CPU stall-warning information in kernels built
   with CONFIG_RCU_CPU_STALL_CPUTIME=y, and restore the full five-minute
   timeout limit for expedited RCU CPU stall warnings

* tag 'rcu.2023.02.10a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: (80 commits)
  rcu/kvfree: Add kvfree_rcu_mightsleep() and kfree_rcu_mightsleep()
  kernel/notifier: Remove CONFIG_SRCU
  init: Remove "select SRCU"
  fs/quota: Remove "select SRCU"
  fs/notify: Remove "select SRCU"
  fs/btrfs: Remove "select SRCU"
  fs: Remove CONFIG_SRCU
  drivers/pci/controller: Remove "select SRCU"
  drivers/net: Remove "select SRCU"
  drivers/md: Remove "select SRCU"
  drivers/hwtracing/stm: Remove "select SRCU"
  drivers/dax: Remove "select SRCU"
  drivers/base: Remove CONFIG_SRCU
  rcu: Disable laziness if lazy-tracking says so
  rcu: Track laziness during boot and suspend
  rcu: Remove redundant call to rcu_boost_kthread_setaffinity()
  rcu: Allow up to five minutes expedited RCU CPU stall-warning timeouts
  rcu: Align the output of RCU CPU stall warning messages
  rcu: Add RCU stall diagnosis information
  sched: Add helper nr_context_switches_cpu()
  ...
2023-02-21 10:45:51 -08:00
Linus Torvalds
1f2d9ffc7a Scheduler updates in this cycle are:
- Improve the scalability of the CFS bandwidth unthrottling logic
    with large number of CPUs.
 
  - Fix & rework various cpuidle routines, simplify interaction with
    the generic scheduler code. Add __cpuidle methods as noinstr to
    objtool's noinstr detection and fix boatloads of cpuidle bugs & quirks.
 
  - Add new ABI: introduce MEMBARRIER_CMD_GET_REGISTRATIONS,
    to query previously issued registrations.
 
  - Limit scheduler slice duration to the sysctl_sched_latency period,
    to improve scheduling granularity with a large number of SCHED_IDLE
    tasks.
 
  - Debuggability enhancement on sys_exit(): warn about disabled IRQs,
    but also enable them to prevent a cascade of followup problems and
    repeat warnings.
 
  - Fix the rescheduling logic in prio_changed_dl().
 
  - Micro-optimize cpufreq and sched-util methods.
 
  - Micro-optimize ttwu_runnable()
 
  - Micro-optimize the idle-scanning in update_numa_stats(),
    select_idle_capacity() and steal_cookie_task().
 
  - Update the RSEQ code & self-tests
 
  - Constify various scheduler methods
 
  - Remove unused methods
 
  - Refine __init tags
 
  - Documentation updates
 
  - ... Misc other cleanups, fixes
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmPzbJwRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1iIvA//ZcEaB8Z6ChLRQjM+bsaudKJu3pdLQbPK
 iYbP8Da+LsAfxbEfYuGV3m+jIp0LlBOtsI/EezxQrXV+V7FvNyAX9Y00eEu/zlj8
 7Jn3LMy/DBYTwH7LwVdcU0MyIVI8ZPc6WNnkx0LOtGZn8n+qfHPSDzcP3CW+a5AV
 UvllPYpYyEmsX0Eby7CF4Ue8mSmbViw/xR3rNr8ZSve0c25XzKabw8O9kE3jiHxP
 d/zERJoAYeDyYUEuZqhfn5dTlB4an4IjNEkAfRE5SQ09RA8Gkxsa5Ar8gob9e9M1
 eQsdd4/bdhnrkM8L5qDZczqmgCTZ2bukQrxkBXhRDhLgoFxwAn77b+2ZjmIW3Lae
 AyGqRcDSg1q2oxaYm5ZiuO/t26aDOZu9vPHyHRDGt95EGbZlrp+GgeePyfCigJYz
 UmPdZAAcHdSymnnnlcvdG37WVvaVkpgWZzd8LbtBi23QR+Zc4WQ2IlgnUS5WKNNf
 VOBcAcP6E1IslDotZDQCc2dPFFQoQQEssVooyUc5oMytm7BsvxXLOeHG+Ncu/8uc
 H+U8Qn8jnqTxJbC5hkWQIJlhVKCq2FJrHxxySYTKROfUNcDgCmxboFeAcXTCIU1K
 T0S+sdoTS/CvtLklRkG0j6B8N4N98mOd9cFwUV3tX+/gMLMep3hCQs5L76JagvC5
 skkQXoONNaM=
 =l1nN
 -----END PGP SIGNATURE-----

Merge tag 'sched-core-2023-02-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler updates from Ingo Molnar:

 - Improve the scalability of the CFS bandwidth unthrottling logic with
   large number of CPUs.

 - Fix & rework various cpuidle routines, simplify interaction with the
   generic scheduler code. Add __cpuidle methods as noinstr to objtool's
   noinstr detection and fix boatloads of cpuidle bugs & quirks.

 - Add new ABI: introduce MEMBARRIER_CMD_GET_REGISTRATIONS, to query
   previously issued registrations.

 - Limit scheduler slice duration to the sysctl_sched_latency period, to
   improve scheduling granularity with a large number of SCHED_IDLE
   tasks.

 - Debuggability enhancement on sys_exit(): warn about disabled IRQs,
   but also enable them to prevent a cascade of followup problems and
   repeat warnings.

 - Fix the rescheduling logic in prio_changed_dl().

 - Micro-optimize cpufreq and sched-util methods.

 - Micro-optimize ttwu_runnable()

 - Micro-optimize the idle-scanning in update_numa_stats(),
   select_idle_capacity() and steal_cookie_task().

 - Update the RSEQ code & self-tests

 - Constify various scheduler methods

 - Remove unused methods

 - Refine __init tags

 - Documentation updates

 - Misc other cleanups, fixes

* tag 'sched-core-2023-02-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (110 commits)
  sched/rt: pick_next_rt_entity(): check list_entry
  sched/deadline: Add more reschedule cases to prio_changed_dl()
  sched/fair: sanitize vruntime of entity being placed
  sched/fair: Remove capacity inversion detection
  sched/fair: unlink misfit task from cpu overutilized
  objtool: mem*() are not uaccess safe
  cpuidle: Fix poll_idle() noinstr annotation
  sched/clock: Make local_clock() noinstr
  sched/clock/x86: Mark sched_clock() noinstr
  x86/pvclock: Improve atomic update of last_value in pvclock_clocksource_read()
  x86/atomics: Always inline arch_atomic64*()
  cpuidle: tracing, preempt: Squash _rcuidle tracing
  cpuidle: tracing: Warn about !rcu_is_watching()
  cpuidle: lib/bug: Disable rcu_is_watching() during WARN/BUG
  cpuidle: drivers: firmware: psci: Dont instrument suspend code
  KVM: selftests: Fix build of rseq test
  exit: Detect and fix irq disabled state in oops
  cpuidle, arm64: Fix the ARM64 cpuidle logic
  cpuidle: mvebu: Fix duplicate flags assignment
  sched/fair: Limit sched slice duration
  ...
2023-02-20 17:41:08 -08:00
Yury Norov
01bb11ad82 sched/topology: fix KASAN warning in hop_cmp()
Despite that prev_hop is used conditionally on cur_hop
is not the first hop, it's initialized unconditionally.

Because initialization implies dereferencing, it might happen
that the code dereferences uninitialized memory, which has been
spotted by KASAN. Fix it by reorganizing hop_cmp() logic.

Reported-by: Bruno Goncalves <bgoncalv@redhat.com>
Fixes: cd7f55359c ("sched: add sched_numa_find_nth_cpu()")
Signed-off-by: Yury Norov <yury.norov@gmail.com>
Link: https://lore.kernel.org/r/Y+7avK6V9SyAWsXi@yury-laptop/
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-20 11:45:33 -08:00
Munehisa Kamata
c2dbe32d5d sched/psi: Fix use-after-free in ep_remove_wait_queue()
If a non-root cgroup gets removed when there is a thread that registered
trigger and is polling on a pressure file within the cgroup, the polling
waitqueue gets freed in the following path:

 do_rmdir
   cgroup_rmdir
     kernfs_drain_open_files
       cgroup_file_release
         cgroup_pressure_release
           psi_trigger_destroy

However, the polling thread still has a reference to the pressure file and
will access the freed waitqueue when the file is closed or upon exit:

 fput
   ep_eventpoll_release
     ep_free
       ep_remove_wait_queue
         remove_wait_queue

This results in use-after-free as pasted below.

The fundamental problem here is that cgroup_file_release() (and
consequently waitqueue's lifetime) is not tied to the file's real lifetime.
Using wake_up_pollfree() here might be less than ideal, but it is in line
with the comment at commit 42288cb44c ("wait: add wake_up_pollfree()")
since the waitqueue's lifetime is not tied to file's one and can be
considered as another special case. While this would be fixable by somehow
making cgroup_file_release() be tied to the fput(), it would require
sizable refactoring at cgroups or higher layer which might be more
justifiable if we identify more cases like this.

  BUG: KASAN: use-after-free in _raw_spin_lock_irqsave+0x60/0xc0
  Write of size 4 at addr ffff88810e625328 by task a.out/4404

	CPU: 19 PID: 4404 Comm: a.out Not tainted 6.2.0-rc6 #38
	Hardware name: Amazon EC2 c5a.8xlarge/, BIOS 1.0 10/16/2017
	Call Trace:
	<TASK>
	dump_stack_lvl+0x73/0xa0
	print_report+0x16c/0x4e0
	kasan_report+0xc3/0xf0
	kasan_check_range+0x2d2/0x310
	_raw_spin_lock_irqsave+0x60/0xc0
	remove_wait_queue+0x1a/0xa0
	ep_free+0x12c/0x170
	ep_eventpoll_release+0x26/0x30
	__fput+0x202/0x400
	task_work_run+0x11d/0x170
	do_exit+0x495/0x1130
	do_group_exit+0x100/0x100
	get_signal+0xd67/0xde0
	arch_do_signal_or_restart+0x2a/0x2b0
	exit_to_user_mode_prepare+0x94/0x100
	syscall_exit_to_user_mode+0x20/0x40
	do_syscall_64+0x52/0x90
	entry_SYSCALL_64_after_hwframe+0x63/0xcd
	</TASK>

 Allocated by task 4404:

	kasan_set_track+0x3d/0x60
	__kasan_kmalloc+0x85/0x90
	psi_trigger_create+0x113/0x3e0
	pressure_write+0x146/0x2e0
	cgroup_file_write+0x11c/0x250
	kernfs_fop_write_iter+0x186/0x220
	vfs_write+0x3d8/0x5c0
	ksys_write+0x90/0x110
	do_syscall_64+0x43/0x90
	entry_SYSCALL_64_after_hwframe+0x63/0xcd

 Freed by task 4407:

	kasan_set_track+0x3d/0x60
	kasan_save_free_info+0x27/0x40
	____kasan_slab_free+0x11d/0x170
	slab_free_freelist_hook+0x87/0x150
	__kmem_cache_free+0xcb/0x180
	psi_trigger_destroy+0x2e8/0x310
	cgroup_file_release+0x4f/0xb0
	kernfs_drain_open_files+0x165/0x1f0
	kernfs_drain+0x162/0x1a0
	__kernfs_remove+0x1fb/0x310
	kernfs_remove_by_name_ns+0x95/0xe0
	cgroup_addrm_files+0x67f/0x700
	cgroup_destroy_locked+0x283/0x3c0
	cgroup_rmdir+0x29/0x100
	kernfs_iop_rmdir+0xd1/0x140
	vfs_rmdir+0xfe/0x240
	do_rmdir+0x13d/0x280
	__x64_sys_rmdir+0x2c/0x30
	do_syscall_64+0x43/0x90
	entry_SYSCALL_64_after_hwframe+0x63/0xcd

Fixes: 0e94682b73 ("psi: introduce psi monitor")
Signed-off-by: Munehisa Kamata <kamatam@amazon.com>
Signed-off-by: Mengchi Cheng <mengcc@amazon.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/lkml/20230106224859.4123476-1-kamatam@amazon.com/
Link: https://lore.kernel.org/r/20230214212705.4058045-1-kamatam@amazon.com
2023-02-15 14:19:16 +01:00
Waiman Long
df14b7f9ef sched/core: Fix a missed update of user_cpus_ptr
Since commit 8f9ea86fdf ("sched: Always preserve the user requested
cpumask"), a successful call to sched_setaffinity() should always save
the user requested cpu affinity mask in a task's user_cpus_ptr. However,
when the given cpu mask is the same as the current one, user_cpus_ptr
is not updated. Fix this by saving the user mask in this case too.

Fixes: 8f9ea86fdf ("sched: Always preserve the user requested cpumask")
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20230203181849.221943-1-longman@redhat.com
2023-02-13 16:36:14 +01:00
Pietro Borrello
7c4a5b89a0 sched/rt: pick_next_rt_entity(): check list_entry
Commit 326587b840 ("sched: fix goto retry in pick_next_task_rt()")
removed any path which could make pick_next_rt_entity() return NULL.
However, BUG_ON(!rt_se) in _pick_next_task_rt() (the only caller of
pick_next_rt_entity()) still checks the error condition, which can
never happen, since list_entry() never returns NULL.
Remove the BUG_ON check, and instead emit a warning in the only
possible error condition here: the queue being empty which should
never happen.

Fixes: 326587b840 ("sched: fix goto retry in pick_next_task_rt()")
Signed-off-by: Pietro Borrello <borrello@diag.uniroma1.it>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://lore.kernel.org/r/20230128-list-entry-null-check-sched-v3-1-b1a71bd1ac6b@diag.uniroma1.it
2023-02-11 11:18:10 +01:00
Valentin Schneider
7ea98dfa44 sched/deadline: Add more reschedule cases to prio_changed_dl()
I've been tracking down an issue on a ~5.17ish kernel where:

  CPUx                           CPUy

  <DL task p0 owns an rtmutex M>
  <p0 depletes its runtime, gets throttled>
  <rq switches to the idle task>
				 <DL task p1 blocks on M, boost/replenish p0>
				 <No call to resched_curr() happens here>

  [idle task keeps running here until *something*
   accidentally sets TIF_NEED_RESCHED]

On that kernel, it is quite easy to trigger using rt-tests's deadline_test
[1] with the test running on isolated CPUs (this reduces the chance of
something unrelated setting TIF_NEED_RESCHED on the idle tasks, making the
issue even more obvious as the hung task detector chimes in).

I haven't been able to reproduce this using a mainline kernel, even if I
revert

  2972e3050e ("tracing: Make trace_marker{,_raw} stream-like")

which gets rid of the lock involved in the above test, *but* I cannot
convince myself the issue isn't there from looking at the code.

Make prio_changed_dl() issue a reschedule if the current task isn't a
deadline one. While at it, ensure a reschedule is emitted when a
queued-but-not-current task gets boosted with an earlier deadline that
current's.

[1]: https://git.kernel.org/pub/scm/utils/rt-tests/rt-tests.git

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lore.kernel.org/r/20230206140612.701871-1-vschneid@redhat.com
2023-02-11 11:18:10 +01:00
Zhang Qiao
829c1651e9 sched/fair: sanitize vruntime of entity being placed
When a scheduling entity is placed onto cfs_rq, its vruntime is pulled
to the base level (around cfs_rq->min_vruntime), so that the entity
doesn't gain extra boost when placed backwards.

However, if the entity being placed wasn't executed for a long time, its
vruntime may get too far behind (e.g. while cfs_rq was executing a
low-weight hog), which can inverse the vruntime comparison due to s64
overflow.  This results in the entity being placed with its original
vruntime way forwards, so that it will effectively never get to the cpu.

To prevent that, ignore the vruntime of the entity being placed if it
didn't execute for much longer than the characteristic sheduler time
scale.

[rkagan: formatted, adjusted commit log, comments, cutoff value]
Signed-off-by: Zhang Qiao <zhangqiao22@huawei.com>
Co-developed-by: Roman Kagan <rkagan@amazon.de>
Signed-off-by: Roman Kagan <rkagan@amazon.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20230130122216.3555094-1-rkagan@amazon.de
2023-02-11 11:18:09 +01:00
Vincent Guittot
a2e90611b9 sched/fair: Remove capacity inversion detection
Remove the capacity inversion detection which is now handled by
util_fits_cpu() returning -1 when we need to continue to look for a
potential CPU with better performance.

This ends up almost reverting patches below except for some comments:
commit da07d2f9c1 ("sched/fair: Fixes for capacity inversion detection")
commit aa69c36f31 ("sched/fair: Consider capacity inversion in util_fits_cpu()")
commit 44c7b80bff ("sched/fair: Detect capacity inversion")

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230201143628.270912-3-vincent.guittot@linaro.org
2023-02-11 11:18:09 +01:00
Vincent Guittot
e5ed0550c0 sched/fair: unlink misfit task from cpu overutilized
By taking into account uclamp_min, the 1:1 relation between task misfit
and cpu overutilized is no more true as a task with a small util_avg may
not fit a high capacity cpu because of uclamp_min constraint.

Add a new state in util_fits_cpu() to reflect the case that task would fit
a CPU except for the uclamp_min hint which is a performance requirement.

Use -1 to reflect that a CPU doesn't fit only because of uclamp_min so we
can use this new value to take additional action to select the best CPU
that doesn't match uclamp_min hint.

When util_fits_cpu() returns -1, we will continue to look for a possible
CPU with better performance, which replaces Capacity Inversion detection
with capacity_orig_of() - thermal_load_avg to detect a capacity inversion.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-and-tested-by: Qais Yousef <qyousef@layalina.io>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Kajetan Puchalski <kajetan.puchalski@arm.com>
Link: https://lore.kernel.org/r/20230201143628.270912-2-vincent.guittot@linaro.org
2023-02-11 11:18:09 +01:00
Liam R. Howlett
214dbc4281 sched: convert to vma iterator
Use the vma iterator so that the iterator can be invalidated or updated to
avoid each caller doing so.

Link: https://lkml.kernel.org/r/20230120162650.984577-23-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-09 16:51:34 -08:00
Valentin Schneider
9feae65845 sched/topology: Introduce sched_numa_hop_mask()
Tariq has pointed out that drivers allocating IRQ vectors would benefit
from having smarter NUMA-awareness - cpumask_local_spread() only knows
about the local node and everything outside is in the same bucket.

sched_domains_numa_masks is pretty much what we want to hand out (a cpumask
of CPUs reachable within a given distance budget), introduce
sched_numa_hop_mask() to export those cpumasks.

Link: http://lore.kernel.org/r/20220728191203.4055-1-tariqt@nvidia.com
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-07 18:20:00 -08:00
Yury Norov
cd7f55359c sched: add sched_numa_find_nth_cpu()
The function finds Nth set CPU in a given cpumask starting from a given
node.

Leveraging the fact that each hop in sched_domains_numa_masks includes the
same or greater number of CPUs than the previous one, we can use binary
search on hops instead of linear walk, which makes the overall complexity
of O(log n) in terms of number of cpumask_weight() calls.

Signed-off-by: Yury Norov <yury.norov@gmail.com>
Acked-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Peter Lafreniere <peter@n8pjl.ca>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-07 18:20:00 -08:00
Peter Zijlstra
776f22913b sched/clock: Make local_clock() noinstr
With sched_clock() noinstr, provide a noinstr implementation of
local_clock().

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230126151323.760767043@infradead.org
2023-01-31 15:01:47 +01:00
Ingo Molnar
57a30218fa Linux 6.2-rc6
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmPW7E8eHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGf7MIAI0JnHN9WvtEukSZ
 E6j6+cEGWxsvD6q0g3GPolaKOCw7hlv0pWcFJFcUAt0jebspMdxV2oUGJ8RYW7Lg
 nCcHvEVswGKLAQtQSWw52qotW6fUfMPsNYYB5l31sm1sKH4Cgss0W7l2HxO/1LvG
 TSeNHX53vNAZ8pVnFYEWCSXC9bzrmU/VALF2EV00cdICmfvjlgkELGXoLKJJWzUp
 s63fBHYGGURSgwIWOKStoO6HNo0j/F/wcSMx8leY8qDUtVKHj4v24EvSgxUSDBER
 ch3LiSQ6qf4sw/z7pqruKFthKOrlNmcc0phjiES0xwwGiNhLv0z3rAhc4OM2cgYh
 SDc/Y/c=
 =zpaD
 -----END PGP SIGNATURE-----

Merge tag 'v6.2-rc6' into sched/core, to pick up fixes

Pick up fixes before merging another batch of cpuidle updates.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2023-01-31 15:01:20 +01:00
Waiman Long
5657c11678 sched/core: Fix NULL pointer access fault in sched_setaffinity() with non-SMP configs
The kernel commit 9a5418bc48 ("sched/core: Use kfree_rcu() in
do_set_cpus_allowed()") introduces a bug for kernels built with non-SMP
configs. Calling sched_setaffinity() on such a uniprocessor kernel will
cause cpumask_copy() to be called with a NULL pointer leading to general
protection fault. This is not really a problem in real use cases as
there aren't that many uniprocessor kernel configs in use and calling
sched_setaffinity() on such a uniprocessor system doesn't make sense.

Fix this problem by making sure cpumask_copy() will not be called in
such a case.

Fixes: 9a5418bc48 ("sched/core: Use kfree_rcu() in do_set_cpus_allowed()")
Reported-by: kernel test robot <yujie.liu@intel.com>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230115193122.563036-1-longman@redhat.com
2023-01-16 10:07:25 +01:00
Vincent Guittot
79ba1e607d sched/fair: Limit sched slice duration
In presence of a lot of small weight tasks like sched_idle tasks, normal
or high weight tasks can see their ideal runtime (sched_slice) to increase
to hundreds ms whereas it normally stays below sysctl_sched_latency.

2 normal tasks running on a CPU will have a max sched_slice of 12ms
(half of the sched_period). This means that they will make progress
every sysctl_sched_latency period.

If we now add 1000 idle tasks on the CPU, the sched_period becomes
3006 ms and the ideal runtime of the normal tasks becomes 609 ms.
It will even become 1500ms if the idle tasks belongs to an idle cgroup.
This means that the scheduler will look for picking another waiting task
after 609ms running time (1500ms respectively). The idle tasks change
significantly the way the 2 normal tasks interleave their running time
slot whereas they should have a small impact.

Such long sched_slice can delay significantly the release of resources
as the tasks can wait hundreds of ms before the next running slot just
because of idle tasks queued on the rq.

Cap the ideal_runtime to sysctl_sched_latency to make sure that tasks will
regularly make progress and will not be significantly impacted by
idle/background tasks queued on the rq.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lore.kernel.org/r/20230113133613.257342-1-vincent.guittot@linaro.org
2023-01-15 09:59:00 +01:00
Peter Zijlstra
89b3098703 arch/idle: Change arch_cpu_idle() behavior: always exit with IRQs disabled
Current arch_cpu_idle() is called with IRQs disabled, but will return
with IRQs enabled.

However, the very first thing the generic code does after calling
arch_cpu_idle() is raw_local_irq_disable(). This means that
architectures that can idle with IRQs disabled end up doing a
pointless 'enable-disable' dance.

Therefore, push this IRQ disabling into the idle function, meaning
that those architectures can avoid the pointless IRQ state flipping.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Tony Lindgren <tony@atomide.com>
Tested-by: Ulf Hansson <ulf.hansson@linaro.org>
Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
Acked-by: Mark Rutland <mark.rutland@arm.com> [arm64]
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Guo Ren <guoren@kernel.org>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20230112195540.618076436@infradead.org
2023-01-13 11:48:15 +01:00
Peter Zijlstra
a01353cf18 cpuidle: Fix ct_idle_*() usage
The whole disable-RCU, enable-IRQS dance is very intricate since
changing IRQ state is traced, which depends on RCU.

Add two helpers for the cpuidle case that mirror the entry code:

  ct_cpuidle_enter()
  ct_cpuidle_exit()

And fix all the cases where the enter/exit dance was buggy.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Tony Lindgren <tony@atomide.com>
Tested-by: Ulf Hansson <ulf.hansson@linaro.org>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20230112195540.130014793@infradead.org
2023-01-13 11:48:15 +01:00
Qais Yousef
da07d2f9c1 sched/fair: Fixes for capacity inversion detection
Traversing the Perf Domains requires rcu_read_lock() to be held and is
conditional on sched_energy_enabled(). Ensure right protections applied.

Also skip capacity inversion detection for our own pd; which was an
error.

Fixes: 44c7b80bff ("sched/fair: Detect capacity inversion")
Reported-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20230112122708.330667-3-qyousef@layalina.io
2023-01-13 11:40:21 +01:00
Qais Yousef
e26fd28db8 sched/uclamp: Fix a uninitialized variable warnings
Addresses the following warnings:

> config: riscv-randconfig-m031-20221111
> compiler: riscv64-linux-gcc (GCC) 12.1.0
>
> smatch warnings:
> kernel/sched/fair.c:7263 find_energy_efficient_cpu() error: uninitialized symbol 'util_min'.
> kernel/sched/fair.c:7263 find_energy_efficient_cpu() error: uninitialized symbol 'util_max'.

Fixes: 244226035a ("sched/uclamp: Fix fits_capacity() check in feec()")
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20230112122708.330667-2-qyousef@layalina.io
2023-01-13 11:40:21 +01:00
Waiman Long
9a5418bc48 sched/core: Use kfree_rcu() in do_set_cpus_allowed()
Commit 851a723e45 ("sched: Always clear user_cpus_ptr in
do_set_cpus_allowed()") may call kfree() if user_cpus_ptr was previously
set. Unfortunately, some of the callers of do_set_cpus_allowed()
may have pi_lock held when calling it. So the following splats may be
printed especially when running with a PREEMPT_RT kernel:

   WARNING: possible circular locking dependency detected
   BUG: sleeping function called from invalid context

To avoid these problems, kfree_rcu() is used instead. An internal
cpumask_rcuhead union is created for the sole purpose of facilitating
the use of kfree_rcu() to free the cpumask.

Since user_cpus_ptr is not being used in non-SMP configs, the newly
introduced alloc_user_cpus_ptr() helper will return NULL in this case
and sched_setaffinity() is modified to handle this special case.

Fixes: 851a723e45 ("sched: Always clear user_cpus_ptr in do_set_cpus_allowed()")
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20221231041120.440785-3-longman@redhat.com
2023-01-09 11:43:23 +01:00
Waiman Long
87ca4f9efb sched/core: Fix use-after-free bug in dup_user_cpus_ptr()
Since commit 07ec77a1d4 ("sched: Allow task CPU affinity to be
restricted on asymmetric systems"), the setting and clearing of
user_cpus_ptr are done under pi_lock for arm64 architecture. However,
dup_user_cpus_ptr() accesses user_cpus_ptr without any lock
protection. Since sched_setaffinity() can be invoked from another
process, the process being modified may be undergoing fork() at
the same time.  When racing with the clearing of user_cpus_ptr in
__set_cpus_allowed_ptr_locked(), it can lead to user-after-free and
possibly double-free in arm64 kernel.

Commit 8f9ea86fdf ("sched: Always preserve the user requested
cpumask") fixes this problem as user_cpus_ptr, once set, will never
be cleared in a task's lifetime. However, this bug was re-introduced
in commit 851a723e45 ("sched: Always clear user_cpus_ptr in
do_set_cpus_allowed()") which allows the clearing of user_cpus_ptr in
do_set_cpus_allowed(). This time, it will affect all arches.

Fix this bug by always clearing the user_cpus_ptr of the newly
cloned/forked task before the copying process starts and check the
user_cpus_ptr state of the source task under pi_lock.

Note to stable, this patch won't be applicable to stable releases.
Just copy the new dup_user_cpus_ptr() function over.

Fixes: 07ec77a1d4 ("sched: Allow task CPU affinity to be restricted on asymmetric systems")
Fixes: 851a723e45 ("sched: Always clear user_cpus_ptr in do_set_cpus_allowed()")
Reported-by: David Wang 王标 <wangbiao3@xiaomi.com>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Peter Zijlstra <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20221231041120.440785-2-longman@redhat.com
2023-01-09 11:43:07 +01:00
Yair Podemsky
7fb3ff22ad sched/core: Fix arch_scale_freq_tick() on tickless systems
In order for the scheduler to be frequency invariant we measure the
ratio between the maximum CPU frequency and the actual CPU frequency.

During long tickless periods of time the calculations that keep track
of that might overflow, in the function scale_freq_tick():

  if (check_shl_overflow(acnt, 2*SCHED_CAPACITY_SHIFT, &acnt))
          goto error;

eventually forcing the kernel to disable the feature for all CPUs,
and show the warning message:

   "Scheduler frequency invariance went wobbly, disabling!".

Let's avoid that by limiting the frequency invariant calculations
to CPUs with regular tick.

Fixes: e2b0d619b4 ("x86, sched: check for counters overflow in frequency invariant accounting")
Suggested-by: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Signed-off-by: Yair Podemsky <ypodemsk@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Acked-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Link: https://lore.kernel.org/r/20221130125121.34407-1-ypodemsk@redhat.com
2023-01-07 12:25:50 +01:00
Michal Clapinski
544a4f2ecd sched/membarrier: Introduce MEMBARRIER_CMD_GET_REGISTRATIONS
Provide a method to query previously issued registrations.

Signed-off-by: Michal Clapinski <mclapinski@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lore.kernel.org/r/20221207164338.1535591-2-mclapinski@google.com
2023-01-07 11:29:29 +01:00
Lukasz Luba
948fb4c4e9 cpufreq, sched/util: Optimize operations with single CPU capacity lookup
The max CPU capacity is the same for all CPUs sharing frequency domain.
There is a way to avoid heavy operations in a loop for each CPU by
leveraging this knowledge. Thus, simplify the looping code in the
sugov_next_freq_shared() and drop heavy multiplications. Instead, use
simple max() to get the highest utilization from these CPUs.

This is useful for platforms with many (4 or 6) little CPUs. We avoid
heavy 2*PD_CPU_NUM multiplications in that loop, which is called billions
of times, since it's not limited by the schedutil time delta filter in
sugov_should_update_freq(). When there was no need to change frequency
the code bailed out, not updating the sg_policy::last_freq_update_time.
Then every visit after delta_ns time longer than the
sg_policy::freq_update_delay_ns goes through and triggers the next
frequency calculation code. Although, if the next frequency, as outcome
of that, would be the same as current frequency, we won't update the
sg_policy::last_freq_update_time and the story will be repeated (in
a very short period, sometimes a few microseconds).

The max CPU capacity must be fetched every time we are called, due to
difficulties during the policy setup, where we are not able to get the
normalized CPU capacity at the right time.

The fetched CPU capacity value is than used in sugov_iowait_apply() to
calculate the right boost. This required a few changes in the local
functions and arguments. The capacity value should hopefully be fetched
once when needed and then passed over CPU registers to those functions.

Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20221208160256.859-2-lukasz.luba@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
2023-01-07 11:25:38 +01:00
Chengming Zhou
160fb0d83f sched/core: Reorganize ttwu_do_wakeup() and ttwu_do_activate()
ttwu_do_activate() is used for a complete wakeup, in which we will
activate_task() and use ttwu_do_wakeup() to mark the task runnable
and perform wakeup-preemption, also call class->task_woken() callback
and update the rq->idle_stamp.

Since ttwu_runnable() is not a complete wakeup, don't need all those
done in ttwu_do_wakeup(), so we can move those to ttwu_do_activate()
to simplify ttwu_do_wakeup(), making it only mark the task runnable
to be reused in ttwu_runnable() and try_to_wake_up().

This patch should not have any functional changes.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20221223103257.4962-2-zhouchengming@bytedance.com
2023-01-07 10:48:38 +01:00
Chengming Zhou
efe0938586 sched/core: Micro-optimize ttwu_runnable()
ttwu_runnable() is used as a fast wakeup path when the wakee task
is running on CPU or runnable on RQ, in both cases we can just
set its state to TASK_RUNNING to prevent a sleep.

If the wakee task is on_cpu running, we don't need to update_rq_clock()
or check_preempt_curr().

But if the wakee task is on_rq && !on_cpu (e.g. an IRQ hit before
the task got to schedule() and the task been preempted), we should
check_preempt_curr() to see if it can preempt the current running.

This also removes the class->task_woken() callback from ttwu_runnable(),
which wasn't required per the RT/DL implementations: any required push
operation would have been queued during class->set_next_task() when p
got preempted.

ttwu_runnable() also loses the update to rq->idle_stamp, as by definition
the rq cannot be idle in this scenario.

Suggested-by: Valentin Schneider <vschneid@redhat.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://lore.kernel.org/r/20221223103257.4962-1-zhouchengming@bytedance.com
2023-01-07 10:48:38 +01:00
Zhen Lei
7c182722a0 sched: Add helper nr_context_switches_cpu()
Add a function nr_context_switches_cpu() that returns  number of context
switches since boot on the specified CPU.  This information will be used
to diagnose RCU CPU stalls.

Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-05 12:21:11 -08:00
Bing Huang
ef90cf2281 sched/topology: Add __init for sched_init_domains()
sched_init_domains() is only used in initialization

Signed-off-by: Bing Huang <huangbing@kylinos.cn>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230105014943.9857-1-huangbing775@126.com
2023-01-05 11:42:13 +01:00
Mathieu Desnoyers
bbd0b03150 sched/rseq: Fix concurrency ID handling of usermodehelper kthreads
sched_mm_cid_after_execve() does not expect NULL t->mm, but it may happen
if a usermodehelper kthread fails when attempting to execute a binary.

sched_mm_cid_fork() can be issued from a usermodehelper kthread, which
has t->flags PF_KTHREAD set.

Fixes: af7f588d8f ("sched: Introduce per-memory-map concurrency ID")
Reported-by: kernel test robot <yujie.liu@intel.com>
Reported-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/oe-lkp/202212301353.5c959d72-yujie.liu@intel.com
2023-01-02 16:34:12 +01:00
Nicholas Piggin
c89970202a cputime: remove cputime_to_nsecs fallback
The archs that use cputime_to_nsecs() internally provide their own
definition and don't need the fallback. cputime_to_usecs() unused except
in this fallback, and is not defined anywhere.

This removes the final remnant of the cputime_t code from the kernel.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Alexander Gordeev <agordeev@linux.ibm.com>
Link: https://lore.kernel.org/r/20221220070705.2958959-1-npiggin@gmail.com
2022-12-27 12:52:17 +01:00
Hao Jia
8589018acc sched/core: Adjusting the order of scanning CPU
When select_idle_capacity() starts scanning for an idle CPU, it starts
with target CPU that has already been checked in select_idle_sibling().
So we start checking from the next CPU and try the target CPU at the end.
Similarly for task_numa_assign(), we have just checked numa_migrate_on
of dst_cpu, so start from the next CPU. This also works for
steal_cookie_task(), the first scan must fail and start directly
from the next one.

Signed-off-by: Hao Jia <jiahao.os@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Link: https://lore.kernel.org/r/20221216062406.7812-3-jiahao.os@bytedance.com
2022-12-27 12:52:17 +01:00
Hao Jia
feaed76376 sched/numa: Stop an exhastive search if an idle core is found
In update_numa_stats() we try to find an idle cpu on the NUMA node,
preferably an idle core. we can stop looking for the next idle core
or idle cpu after finding an idle core. But we can't stop the
whole loop of scanning the CPU, because we need to calculate
approximate NUMA stats at a point in time. For example,
the src and dst nr_running is needed by task_numa_find_cpu().

Signed-off-by: Hao Jia <jiahao.os@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Link: https://lore.kernel.org/r/20221216062406.7812-2-jiahao.os@bytedance.com
2022-12-27 12:52:16 +01:00
Matthew Wilcox (Oracle)
904cbab71d sched: Make const-safe
With a modified container_of() that preserves constness, the compiler
finds some pointers which should have been marked as const.  task_of()
also needs to become const-preserving for the !FAIR_GROUP_SCHED case so
that cfs_rq_of() can take a const argument.  No change to generated code.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20221212144946.2657785-1-willy@infradead.org
2022-12-27 12:52:16 +01:00
Mathieu Desnoyers
af7f588d8f sched: Introduce per-memory-map concurrency ID
This feature allows the scheduler to expose a per-memory map concurrency
ID to user-space. This concurrency ID is within the possible cpus range,
and is temporarily (and uniquely) assigned while threads are actively
running within a memory map. If a memory map has fewer threads than
cores, or is limited to run on few cores concurrently through sched
affinity or cgroup cpusets, the concurrency IDs will be values close
to 0, thus allowing efficient use of user-space memory for per-cpu
data structures.

This feature is meant to be exposed by a new rseq thread area field.

The primary purpose of this feature is to do the heavy-lifting needed
by memory allocators to allow them to use per-cpu data structures
efficiently in the following situations:

- Single-threaded applications,
- Multi-threaded applications on large systems (many cores) with limited
  cpu affinity mask,
- Multi-threaded applications on large systems (many cores) with
  restricted cgroup cpuset per container.

One of the key concern from scheduler maintainers is the overhead
associated with additional spin locks or atomic operations in the
scheduler fast-path. This is why the following optimization is
implemented.

On context switch between threads belonging to the same memory map,
transfer the mm_cid from prev to next without any atomic ops. This
takes care of use-cases involving frequent context switch between
threads belonging to the same memory map.

Additional optimizations can be done if the spin locks added when
context switching between threads belonging to different memory maps end
up being a performance bottleneck. Those are left out of this patch
though. A performance impact would have to be clearly demonstrated to
justify the added complexity.

The credit goes to Paul Turner (Google) for the original virtual cpu id
idea. This feature is implemented based on the discussions with Paul
Turner and Peter Oskolkov (Google), but I took the liberty to implement
scheduler fast-path optimizations and my own NUMA-awareness scheme. The
rumor has it that Google have been running a rseq vcpu_id extension
internally in production for a year. The tcmalloc source code indeed has
comments hinting at a vcpu_id prototype extension to the rseq system
call [1].

The following benchmarks do not show any significant overhead added to
the scheduler context switch by this feature:

* perf bench sched messaging (process)

Baseline:                    86.5±0.3 ms
With mm_cid:                 86.7±2.6 ms

* perf bench sched messaging (threaded)

Baseline:                    84.3±3.0 ms
With mm_cid:                 84.7±2.6 ms

* hackbench (process)

Baseline:                    82.9±2.7 ms
With mm_cid:                 82.9±2.9 ms

* hackbench (threaded)

Baseline:                    85.2±2.6 ms
With mm_cid:                 84.4±2.9 ms

[1] https://github.com/google/tcmalloc/blob/master/tcmalloc/internal/linux_syscall_support.h#L26

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20221122203932.231377-8-mathieu.desnoyers@efficios.com
2022-12-27 12:52:11 +01:00
Josh Don
8ad075c2eb sched: Async unthrottling for cfs bandwidth
CFS bandwidth currently distributes new runtime and unthrottles cfs_rq's
inline in an hrtimer callback. Runtime distribution is a per-cpu
operation, and unthrottling is a per-cgroup operation, since a tg walk
is required. On machines with a large number of cpus and large cgroup
hierarchies, this cpus*cgroups work can be too much to do in a single
hrtimer callback: since IRQ are disabled, hard lockups may easily occur.
Specifically, we've found this scalability issue on configurations with
256 cpus, O(1000) cgroups in the hierarchy being throttled, and high
memory bandwidth usage.

To fix this, we can instead unthrottle cfs_rq's asynchronously via a
CSD. Each cpu is responsible for unthrottling itself, thus sharding the
total work more fairly across the system, and avoiding hard lockups.

Signed-off-by: Josh Don <joshdon@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20221117005418.3499691-1-joshdon@google.com
2022-12-27 12:52:09 +01:00
Bing Huang
9a5322db46 sched/topology: Add __init for init_defrootdomain
init_defrootdomain is only used in initialization

Signed-off-by: Bing Huang <huangbing@kylinos.cn>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lkml.kernel.org/r/20221118034208.267330-1-huangbing775@126.com
2022-12-27 12:52:09 +01:00
Linus Torvalds
48ea09cdda hardening updates for v6.2-rc1
- Convert flexible array members, fix -Wstringop-overflow warnings,
   and fix KCFI function type mismatches that went ignored by
   maintainers (Gustavo A. R. Silva, Nathan Chancellor, Kees Cook).
 
 - Remove the remaining side-effect users of ksize() by converting
   dma-buf, btrfs, and coredump to using kmalloc_size_roundup(),
   add more __alloc_size attributes, and introduce full testing
   of all allocator functions. Finally remove the ksize() side-effect
   so that each allocation-aware checker can finally behave without
   exceptions.
 
 - Introduce oops_limit (default 10,000) and warn_limit (default off)
   to provide greater granularity of control for panic_on_oops and
   panic_on_warn (Jann Horn, Kees Cook).
 
 - Introduce overflows_type() and castable_to_type() helpers for
   cleaner overflow checking.
 
 - Improve code generation for strscpy() and update str*() kern-doc.
 
 - Convert strscpy and sigphash tests to KUnit, and expand memcpy
   tests.
 
 - Always use a non-NULL argument for prepare_kernel_cred().
 
 - Disable structleak plugin in FORTIFY KUnit test (Anders Roxell).
 
 - Adjust orphan linker section checking to respect CONFIG_WERROR
   (Xin Li).
 
 - Make sure siginfo is cleared for forced SIGKILL (haifeng.xu).
 
 - Fix um vs FORTIFY warnings for always-NULL arguments.
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmOZSOoWHGtlZXNjb29r
 QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJjAAD/0YkvpU7f03f8hcQMJK6wv//24K
 AW41hEaBikq9RcmkuvkLLrJRibGgZ5O2xUkUkxRs/HxhkhrZ0kEw8sbwZe8MoWls
 F4Y9+TDjsrdHmjhfcBZdLnVxwcKK5wlaEcpjZXtbsfcdhx3TbgcDA23YELl5t0K+
 I11j4kYmf9SLl4CwIrSP5iACml8CBHARDh8oIMF7FT/LrjNbM8XkvBcVVT6hTbOV
 yjgA8WP2e9GXvj9GzKgqvd0uE/kwPkVAeXLNFWopPi4FQ8AWjlxbBZR0gamA6/EB
 d7TIs0ifpVU2JGQaTav4xO6SsFMj3ntoUI0qIrFaTxZAvV4KYGrPT/Kwz1O4SFaG
 rN5lcxseQbPQSBTFNG4zFjpywTkVCgD2tZqDwz5Rrmiraz0RyIokCN+i4CD9S0Ds
 oEd8JSyLBk1sRALczkuEKo0an5AyC9YWRcBXuRdIHpLo08PsbeUUSe//4pe303cw
 0ApQxYOXnrIk26MLElTzSMImlSvlzW6/5XXzL9ME16leSHOIfDeerPnc9FU9Eb3z
 ODv22z6tJZ9H/apSUIHZbMciMbbVTZ8zgpkfydr08o87b342N/ncYHZ5cSvQ6DWb
 jS5YOIuvl46/IhMPT16qWC8p0bP5YhxoPv5l6Xr0zq0ooEj0E7keiD/SzoLvW+Qs
 AHXcibguPRQBPAdiPQ==
 =yaaN
 -----END PGP SIGNATURE-----

Merge tag 'hardening-v6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull kernel hardening updates from Kees Cook:

 - Convert flexible array members, fix -Wstringop-overflow warnings, and
   fix KCFI function type mismatches that went ignored by maintainers
   (Gustavo A. R. Silva, Nathan Chancellor, Kees Cook)

 - Remove the remaining side-effect users of ksize() by converting
   dma-buf, btrfs, and coredump to using kmalloc_size_roundup(), add
   more __alloc_size attributes, and introduce full testing of all
   allocator functions. Finally remove the ksize() side-effect so that
   each allocation-aware checker can finally behave without exceptions

 - Introduce oops_limit (default 10,000) and warn_limit (default off) to
   provide greater granularity of control for panic_on_oops and
   panic_on_warn (Jann Horn, Kees Cook)

 - Introduce overflows_type() and castable_to_type() helpers for cleaner
   overflow checking

 - Improve code generation for strscpy() and update str*() kern-doc

 - Convert strscpy and sigphash tests to KUnit, and expand memcpy tests

 - Always use a non-NULL argument for prepare_kernel_cred()

 - Disable structleak plugin in FORTIFY KUnit test (Anders Roxell)

 - Adjust orphan linker section checking to respect CONFIG_WERROR (Xin
   Li)

 - Make sure siginfo is cleared for forced SIGKILL (haifeng.xu)

 - Fix um vs FORTIFY warnings for always-NULL arguments

* tag 'hardening-v6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (31 commits)
  ksmbd: replace one-element arrays with flexible-array members
  hpet: Replace one-element array with flexible-array member
  um: virt-pci: Avoid GCC non-NULL warning
  signal: Initialize the info in ksignal
  lib: fortify_kunit: build without structleak plugin
  panic: Expose "warn_count" to sysfs
  panic: Introduce warn_limit
  panic: Consolidate open-coded panic_on_warn checks
  exit: Allow oops_limit to be disabled
  exit: Expose "oops_count" to sysfs
  exit: Put an upper limit on how often we can oops
  panic: Separate sysctl logic from CONFIG_SMP
  mm/pgtable: Fix multiple -Wstringop-overflow warnings
  mm: Make ksize() a reporting-only function
  kunit/fortify: Validate __alloc_size attribute results
  drm/sti: Fix return type of sti_{dvo,hda,hdmi}_connector_mode_valid()
  drm/fsl-dcu: Fix return type of fsl_dcu_drm_connector_mode_valid()
  driver core: Add __alloc_size hint to devm allocators
  overflow: Introduce overflows_type() and castable_to_type()
  coredump: Proactively round up to kmalloc bucket size
  ...
2022-12-14 12:20:00 -08:00
Linus Torvalds
8fa37a6835 sysctl changes for v6.2-rc1
Only step forward on the sysctl cleanups for this cycle. This
 has been on linux-next since September and this time it goes
 with a "Yeah, think so, it just moves stuff around a bit" from
 Peter Zijlstra.
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCgAwFiEENnNq2KuOejlQLZofziMdCjCSiKcFAmOYC3sSHG1jZ3JvZkBr
 ZXJuZWwub3JnAAoJEM4jHQowkoinVEYQAL6/3nRt854jULd3zRwrWDyJZd5yxbnc
 R8jJBTt3q4CKwtMqd59uQqVYLpSqOCx/GsArfsXkmY4x7KYhlaSKcC4LHmFS8Z/u
 dofyVKumIFqtXMI+hYuTyqNqfGoK9UKXUftrqYb8pK+K3h73uqYbrDgSex4G9GJo
 Au0/WeDjTzLlgqt7RPN7n0PL2jMtfWVQkr3001OCQOWW9sdrOjtprn/3bDTUnW5q
 KukKB5saU0CvUzrTn2DaweQiRCJxQfCQfy3DZfhDRHVuWFYMV9b1okaGEoVmQlQT
 I9/urfdf3aLCdBBxCQG5W6uRxZwZ2Yb93M+rijZNWNFMC6WHrMCmSiADwz9LJzIK
 iQV7LoolGe1TFTEVJbsde5xKSF6BeId0IF5mmPQuokAx3TPE9279HNgluaB/38c8
 p3P4+mP6qE12mMPyhpwDwNOzEWgUnLsGSIE5n/WPwxCiGNa7UsN2lzMDP1cJejp5
 NlRg1hRKmgt30d9+t9sHeKMcWhrjxyPGsyUMwBJTuMCHbjqizGyBsB8DzyK95OoF
 aN66pyRqwsK0+IUivd8VfLgfriE2gDrQD5VqkJ8lfWBx9pq8RMEq7zQ1eE9IbCff
 hzbfG+7k9R3o4SPfJYmCBXtp6fcq+ovjbLYSvGGCJk0zfFe6SQE21rZ3hCQPq3v5
 xKFh05xUfbRF
 =M48U
 -----END PGP SIGNATURE-----

Merge tag 'sysctl-6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux

Pull sysctl updates from Luis Chamberlain:
 "Only a small step forward on the sysctl cleanups for this cycle"

* tag 'sysctl-6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux:
  sched: Move numa_balancing sysctls to its own file
2022-12-13 14:16:44 -08:00
Linus Torvalds
ce8a79d560 for-6.2/block-2022-12-08
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmOScsgQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpi5ID/9pLXFYOq1+uDjU0KO/MdjMjK8Ukr34lCnk
 WkajRLheE8JBKOFDE54XJk56sQSZHX9bTWqziar0h1fioh7FlQR/tVvzsERCm2M9
 2y9THJNJygC68wgybStyiKlshFjl7TD7Kv5N9Y3xP3mkQygT+D6o8fXZk5xQbYyH
 YdFSoq4rJVHxRL03yzQiReGGIYdOUEQQh8l1FiLwLlKa3lXAey1KuxWIzksVN0KK
 aZB4QhiBpOiPgDHUVisq2XtyQjpZ2byoCImPzgrcqk9Jo4esvm/e6esrg4xlsvII
 LKFFkTmbVqjUZtFjqakFHmfuzVor4nU5f+xb90ZHExuuODYckkxWp5rWhf9QwqqI
 0ik6WYgI1/5vnHnX8f2DYzOFQf9qa/rLgg0CshyUODlD6RfHa9vntqYvlIFkmOBd
 Q7KblIoK8YTzUS1M+v7X8JQ7gDR2KwygH37Da2KJS+vgvfIb8kJGr1ZORuhJuJJ7
 Bl69gaNkHTHrqufp7UI64YXfueeuNu2J9z3zwzGoxeaFaofF/phDn0/2gCQE1fQI
 XBhsMw+ETqI6B2SPHMnzYDu2DM1S8ZTOYQlaD4G3uqgWnAM1tG707395uAy5yu4n
 D5azU1fVG4UocoNIyPujpaoSRs2zWZycEFEeUQkhyDDww/j4hlHi6H33eOnk0zsr
 wxzFGfvHfw==
 =k/vv
 -----END PGP SIGNATURE-----

Merge tag 'for-6.2/block-2022-12-08' of git://git.kernel.dk/linux

Pull block updates from Jens Axboe:

 - NVMe pull requests via Christoph:
      - Support some passthrough commands without CAP_SYS_ADMIN (Kanchan
        Joshi)
      - Refactor PCIe probing and reset (Christoph Hellwig)
      - Various fabrics authentication fixes and improvements (Sagi
        Grimberg)
      - Avoid fallback to sequential scan due to transient issues (Uday
        Shankar)
      - Implement support for the DEAC bit in Write Zeroes (Christoph
        Hellwig)
      - Allow overriding the IEEE OUI and firmware revision in configfs
        for nvmet (Aleksandr Miloserdov)
      - Force reconnect when number of queue changes in nvmet (Daniel
        Wagner)
      - Minor fixes and improvements (Uros Bizjak, Joel Granados, Sagi
        Grimberg, Christoph Hellwig, Christophe JAILLET)
      - Fix and cleanup nvme-fc req allocation (Chaitanya Kulkarni)
      - Use the common tagset helpers in nvme-pci driver (Christoph
        Hellwig)
      - Cleanup the nvme-pci removal path (Christoph Hellwig)
      - Use kstrtobool() instead of strtobool (Christophe JAILLET)
      - Allow unprivileged passthrough of Identify Controller (Joel
        Granados)
      - Support io stats on the mpath device (Sagi Grimberg)
      - Minor nvmet cleanup (Sagi Grimberg)

 - MD pull requests via Song:
      - Code cleanups (Christoph)
      - Various fixes

 - Floppy pull request from Denis:
      - Fix a memory leak in the init error path (Yuan)

 - Series fixing some batch wakeup issues with sbitmap (Gabriel)

 - Removal of the pktcdvd driver that was deprecated more than 5 years
   ago, and subsequent removal of the devnode callback in struct
   block_device_operations as no users are now left (Greg)

 - Fix for partition read on an exclusively opened bdev (Jan)

 - Series of elevator API cleanups (Jinlong, Christoph)

 - Series of fixes and cleanups for blk-iocost (Kemeng)

 - Series of fixes and cleanups for blk-throttle (Kemeng)

 - Series adding concurrent support for sync queues in BFQ (Yu)

 - Series bringing drbd a bit closer to the out-of-tree maintained
   version (Christian, Joel, Lars, Philipp)

 - Misc drbd fixes (Wang)

 - blk-wbt fixes and tweaks for enable/disable (Yu)

 - Fixes for mq-deadline for zoned devices (Damien)

 - Add support for read-only and offline zones for null_blk
   (Shin'ichiro)

 - Series fixing the delayed holder tracking, as used by DM (Yu,
   Christoph)

 - Series enabling bio alloc caching for IRQ based IO (Pavel)

 - Series enabling userspace peer-to-peer DMA (Logan)

 - BFQ waker fixes (Khazhismel)

 - Series fixing elevator refcount issues (Christoph, Jinlong)

 - Series cleaning up references around queue destruction (Christoph)

 - Series doing quiesce by tagset, enabling cleanups in drivers
   (Christoph, Chao)

 - Series untangling the queue kobject and queue references (Christoph)

 - Misc fixes and cleanups (Bart, David, Dawei, Jinlong, Kemeng, Ye,
   Yang, Waiman, Shin'ichiro, Randy, Pankaj, Christoph)

* tag 'for-6.2/block-2022-12-08' of git://git.kernel.dk/linux: (247 commits)
  blktrace: Fix output non-blktrace event when blk_classic option enabled
  block: sed-opal: Don't include <linux/kernel.h>
  sed-opal: allow using IOC_OPAL_SAVE for locking too
  blk-cgroup: Fix typo in comment
  block: remove bio_set_op_attrs
  nvmet: don't open-code NVME_NS_ATTR_RO enumeration
  nvme-pci: use the tagset alloc/free helpers
  nvme: add the Apple shared tag workaround to nvme_alloc_io_tag_set
  nvme: only set reserved_tags in nvme_alloc_io_tag_set for fabrics controllers
  nvme: consolidate setting the tagset flags
  nvme: pass nr_maps explicitly to nvme_alloc_io_tag_set
  block: bio_copy_data_iter
  nvme-pci: split out a nvme_pci_ctrl_is_dead helper
  nvme-pci: return early on ctrl state mismatch in nvme_reset_work
  nvme-pci: rename nvme_disable_io_queues
  nvme-pci: cleanup nvme_suspend_queue
  nvme-pci: remove nvme_pci_disable
  nvme-pci: remove nvme_disable_admin_queue
  nvme: merge nvme_shutdown_ctrl into nvme_disable_ctrl
  nvme: use nvme_wait_ready in nvme_shutdown_ctrl
  ...
2022-12-13 10:43:59 -08:00
Linus Torvalds
8702f2c611 Non-MM patches for 6.2-rc1.
- A ptrace API cleanup series from Sergey Shtylyov
 
 - Fixes and cleanups for kexec from ye xingchen
 
 - nilfs2 updates from Ryusuke Konishi
 
 - squashfs feature work from Xiaoming Ni: permit configuration of the
   filesystem's compression concurrency from the mount command line.
 
 - A series from Akinobu Mita which addresses bound checking errors when
   writing to debugfs files.
 
 - A series from Yang Yingliang to address rapido memory leaks
 
 - A series from Zheng Yejian to address possible overflow errors in
   encode_comp_t().
 
 - And a whole shower of singleton patches all over the place.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY5efRgAKCRDdBJ7gKXxA
 jgvdAP0al6oFDtaSsshIdNhrzcMwfjt6PfVxxHdLmNhF1hX2dwD/SVluS1bPSP7y
 0sZp7Ustu3YTb8aFkMl96Y9m9mY1Nwg=
 =ga5B
 -----END PGP SIGNATURE-----

Merge tag 'mm-nonmm-stable-2022-12-12' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull non-MM updates from Andrew Morton:

 - A ptrace API cleanup series from Sergey Shtylyov

 - Fixes and cleanups for kexec from ye xingchen

 - nilfs2 updates from Ryusuke Konishi

 - squashfs feature work from Xiaoming Ni: permit configuration of the
   filesystem's compression concurrency from the mount command line

 - A series from Akinobu Mita which addresses bound checking errors when
   writing to debugfs files

 - A series from Yang Yingliang to address rapidio memory leaks

 - A series from Zheng Yejian to address possible overflow errors in
   encode_comp_t()

 - And a whole shower of singleton patches all over the place

* tag 'mm-nonmm-stable-2022-12-12' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (79 commits)
  ipc: fix memory leak in init_mqueue_fs()
  hfsplus: fix bug causing custom uid and gid being unable to be assigned with mount
  rapidio: devices: fix missing put_device in mport_cdev_open
  kcov: fix spelling typos in comments
  hfs: Fix OOB Write in hfs_asc2mac
  hfs: fix OOB Read in __hfs_brec_find
  relay: fix type mismatch when allocating memory in relay_create_buf()
  ocfs2: always read both high and low parts of dinode link count
  io-mapping: move some code within the include guarded section
  kernel: kcsan: kcsan_test: build without structleak plugin
  mailmap: update email for Iskren Chernev
  eventfd: change int to __u64 in eventfd_signal() ifndef CONFIG_EVENTFD
  rapidio: fix possible UAF when kfifo_alloc() fails
  relay: use strscpy() is more robust and safer
  cpumask: limit visibility of FORCE_NR_CPUS
  acct: fix potential integer overflow in encode_comp_t()
  acct: fix accuracy loss for input value of encode_comp_t()
  linux/init.h: include <linux/build_bug.h> and <linux/stringify.h>
  rapidio: rio: fix possible name leak in rio_register_mport()
  rapidio: fix possible name leaks when rio_add_device() fails
  ...
2022-12-12 17:28:58 -08:00
Linus Torvalds
bf57ae2165 Scheduler changes for v6.2:
- Implement persistent user-requested affinity: introduce affinity_context::user_mask
    and unconditionally preserve the user-requested CPU affinity masks, for long-lived
    tasks to better interact with cpusets & CPU hotplug events over longer timespans,
    without destroying the original affinity intent if the underlying topology changes.
 
  - Uclamp updates: fix relationship between uclamp and fits_capacity()
 
  - PSI fixes
 
  - Misc fixes & updates.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmOXkmgRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1j/dQ//WYW/JaBpydqnVxDu6C21z0w3+fHDdlsN
 nQ6jyLPlouFjI2Ink1E7i7Iq8C73sdewCgD7Jq3xGa1GhsPEJIrPAaBgacxYjOqc
 x9HHZoygSkAihTfrVvzq37YttD2t/gQQxc81tBziMBVP2A+gb9z44u+ezMlxjiGz
 irgE07qNfiLyTeD/dJhEU2EOsPJm/gestW3+Cd8uwYAe6pj0X4FE3n8ipmr0BzNZ
 6nxFJaSspwAkREjpAIZVENEArq7XrkGqUFKgKpYqWn0HAnuTWgFcW8E2NrDw7Qbf
 4aAdBuzimbWdbkqRoX9r7++r5wqc3KW+is8Y97aEUsc0zhrXHAW1Hn2w7en5XxiQ
 btaPi77Boi69sHvOrfMy3i6UZ895yh2sROIkYBDT485w57BR75HsMLkk2LNIm7qE
 mATrrZ65bbGAgAxZouqlnQk40BUlniIfDlfZyReyFtXkW8UH5tTNX6qtpWzzdwfy
 posrm+XvgDcP96/7DIczZwT6VEJE5GBZbPvk2Vw4GNq6/QeW7g9GPhYTaV6CXzzW
 lCk/MV1n+IWCUqjkGXXCTS53TIyC6WZh2ehegcsh1KYyWcVijEs42S6eqXZI9cO7
 F4oU7sehg4vlhMm1uE5JgaABfYqqzzKlvZySdwXbne2Vjt4nsWlWoe6u6JAdA4EB
 PRwmUDRMyEE=
 =aao/
 -----END PGP SIGNATURE-----

Merge tag 'sched-core-2022-12-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler updates from Ingo Molnar:

 - Implement persistent user-requested affinity: introduce
   affinity_context::user_mask and unconditionally preserve the
   user-requested CPU affinity masks, for long-lived tasks to better
   interact with cpusets & CPU hotplug events over longer timespans,
   without destroying the original affinity intent if the underlying
   topology changes.

 - Uclamp updates: fix relationship between uclamp and fits_capacity()

 - PSI fixes

 - Misc fixes & updates

* tag 'sched-core-2022-12-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched: Clear ttwu_pending after enqueue_task()
  sched/psi: Use task->psi_flags to clear in CPU migration
  sched/psi: Stop relying on timer_pending() for poll_work rescheduling
  sched/psi: Fix avgs_work re-arm in psi_avgs_work()
  sched/psi: Fix possible missing or delayed pending event
  sched: Always clear user_cpus_ptr in do_set_cpus_allowed()
  sched: Enforce user requested affinity
  sched: Always preserve the user requested cpumask
  sched: Introduce affinity_context
  sched: Add __releases annotations to affine_move_task()
  sched/fair: Check if prev_cpu has highest spare cap in feec()
  sched/fair: Consider capacity inversion in util_fits_cpu()
  sched/fair: Detect capacity inversion
  sched/uclamp: Cater for uclamp in find_energy_efficient_cpu()'s early exit condition
  sched/uclamp: Make cpu_overutilized() use util_fits_cpu()
  sched/uclamp: Make asym_fits_capacity() use util_fits_cpu()
  sched/uclamp: Make select_idle_capacity() use util_fits_cpu()
  sched/uclamp: Fix fits_capacity() check in feec()
  sched/uclamp: Make task_fits_capacity() use util_fits_cpu()
  sched/uclamp: Fix relationship between uclamp and migration margin
2022-12-12 15:33:42 -08:00
Kees Cook
79cc1ba7ba panic: Consolidate open-coded panic_on_warn checks
Several run-time checkers (KASAN, UBSAN, KFENCE, KCSAN, sched) roll
their own warnings, and each check "panic_on_warn". Consolidate this
into a single function so that future instrumentation can be added in
a single location.

Cc: Marco Elver <elver@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Gow <davidgow@google.com>
Cc: tangmeng <tangmeng@uniontech.com>
Cc: Jann Horn <jannh@google.com>
Cc: Shuah Khan <skhan@linuxfoundation.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: "Guilherme G. Piccoli" <gpiccoli@igalia.com>
Cc: Tiezhu Yang <yangtiezhu@loongson.cn>
Cc: kasan-dev@googlegroups.com
Cc: linux-mm@kvack.org
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Marco Elver <elver@google.com>
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Link: https://lore.kernel.org/r/20221117234328.594699-4-keescook@chromium.org
2022-12-02 13:04:44 -08:00
Sam Wu
cdcc5ef26b Revert "cpufreq: schedutil: Move max CPU capacity to sugov_policy"
This reverts commit 6d5afdc97e.

On a Pixel 6 device, it is observed that this commit increases
latency by approximately 50ms, or 20%, in migrating a task
that requires full CPU utilization from a LITTLE CPU to Fmax
on a big CPU. Reverting this change restores the latency back
to its original baseline value.

Fixes: 6d5afdc97e ("cpufreq: schedutil: Move max CPU capacity to sugov_policy")
Signed-off-by: Sam Wu <wusamuel@google.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2022-11-22 19:56:52 +01:00
Kefeng Wang
0dff89c448 sched: Move numa_balancing sysctls to its own file
The sysctl_numa_balancing_promote_rate_limit and sysctl_numa_balancing
are part of sched, move them to its own file.

Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2022-11-20 20:55:26 -08:00
Uros Bizjak
8baceabca6 sched/fair: use try_cmpxchg in task_numa_work
Use try_cmpxchg instead of cmpxchg (*ptr, old, new) == old in
task_numa_work.  x86 CMPXCHG instruction returns success in ZF flag, so
this change saves a compare after cmpxchg (and related move instruction in
front of cmpxchg).

No functional change intended.

Link: https://lkml.kernel.org/r/20220822173956.82525-1-ubizjak@gmail.com
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-18 13:55:09 -08:00
Gabriel Krisman Bertazi
ee7dc86b6d wait: Return number of exclusive waiters awaken
Sbitmap code will need to know how many waiters were actually woken for
its batched wakeups implementation.  Return the number of woken
exclusive waiters from __wake_up() to facilitate that.

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20221115224553.23594-3-krisman@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-16 11:33:03 -07:00
Tianchen Ding
d6962c4fe8 sched: Clear ttwu_pending after enqueue_task()
We found a long tail latency in schbench whem m*t is close to nr_cpus.
(e.g., "schbench -m 2 -t 16" on a machine with 32 cpus.)

This is because when the wakee cpu is idle, rq->ttwu_pending is cleared
too early, and idle_cpu() will return true until the wakee task enqueued.
This will mislead the waker when selecting idle cpu, and wake multiple
worker threads on the same wakee cpu. This situation is enlarged by
commit f3dd3f6745 ("sched: Remove the limitation of WF_ON_CPU on
wakelist if wakee cpu is idle") because it tends to use wakelist.

Here is the result of "schbench -m 2 -t 16" on a VM with 32vcpu
(Intel(R) Xeon(R) Platinum 8369B).

Latency percentiles (usec):
                base      base+revert_f3dd3f674555   base+this_patch
50.0000th:         9                            13                 9
75.0000th:        12                            19                12
90.0000th:        15                            22                15
95.0000th:        18                            24                17
*99.0000th:       27                            31                24
99.5000th:      3364                            33                27
99.9000th:     12560                            36                30

We also tested on unixbench and hackbench, and saw no performance
change.

Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Link: https://lkml.kernel.org/r/20221104023601.12844-1-dtcccc@linux.alibaba.com
2022-11-16 10:13:05 +01:00
Peter Zijlstra
91dabf33ae sched: Fix race in task_call_func()
There is a very narrow race between schedule() and task_call_func().

  CPU0						CPU1

  __schedule()
    rq_lock();
    prev_state = READ_ONCE(prev->__state);
    if (... && prev_state) {
      deactivate_tasl(rq, prev, ...)
        prev->on_rq = 0;

						task_call_func()
						  raw_spin_lock_irqsave(p->pi_lock);
						  state = READ_ONCE(p->__state);
						  smp_rmb();
						  if (... || p->on_rq) // false!!!
						    rq = __task_rq_lock()

						  ret = func();

    next = pick_next_task();
    rq = context_switch(prev, next)
      prepare_lock_switch()
        spin_release(&__rq_lockp(rq)->dep_map...)

So while the task is on it's way out, it still holds rq->lock for a
little while, and right then task_call_func() comes in and figures it
doesn't need rq->lock anymore (because the task is already dequeued --
but still running there) and then the __set_task_frozen() thing observes
it's holding rq->lock and yells murder.

Avoid this by waiting for p->on_cpu to get cleared, which guarantees
the task is fully finished on the old CPU.

( While arguably the fixes tag is 'wrong' -- none of the previous
  task_call_func() users appears to care for this case. )

Fixes: f5d39b0208 ("freezer,sched: Rewrite core freezer logic")
Reported-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: https://lkml.kernel.org/r/Y1kdRNNfUeAU+FNl@hirez.programming.kicks-ass.net
2022-11-14 09:58:32 +01:00
Chengming Zhou
52b33d87b9 sched/psi: Use task->psi_flags to clear in CPU migration
The commit d583d360a6 ("psi: Fix psi state corruption when schedule()
races with cgroup move") fixed a race problem by making cgroup_move_task()
use task->psi_flags instead of looking at the scheduler state.

We can extend task->psi_flags usage to CPU migration, which should be
a minor optimization for performance and code simplicity.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lore.kernel.org/r/20220926081931.45420-1-zhouchengming@bytedance.com
2022-10-30 10:12:15 +01:00
Suren Baghdasaryan
710ffe671e sched/psi: Stop relying on timer_pending() for poll_work rescheduling
Psi polling mechanism is trying to minimize the number of wakeups to
run psi_poll_work and is currently relying on timer_pending() to detect
when this work is already scheduled. This provides a window of opportunity
for psi_group_change to schedule an immediate psi_poll_work after
poll_timer_fn got called but before psi_poll_work could reschedule itself.
Below is the depiction of this entire window:

poll_timer_fn
  wake_up_interruptible(&group->poll_wait);

psi_poll_worker
  wait_event_interruptible(group->poll_wait, ...)
  psi_poll_work
    psi_schedule_poll_work
      if (timer_pending(&group->poll_timer)) return;
      ...
      mod_timer(&group->poll_timer, jiffies + delay);

Prior to 461daba06b we used to rely on poll_scheduled atomic which was
reset and set back inside psi_poll_work and therefore this race window
was much smaller.
The larger window causes increased number of wakeups and our partners
report visible power regression of ~10mA after applying 461daba06b.
Bring back the poll_scheduled atomic and make this race window even
narrower by resetting poll_scheduled only when we reach polling expiration
time. This does not completely eliminate the possibility of extra wakeups
caused by a race with psi_group_change however it will limit it to the
worst case scenario of one extra wakeup per every tracking window (0.5s
in the worst case).
This patch also ensures correct ordering between clearing poll_scheduled
flag and obtaining changed_states using memory barrier. Correct ordering
between updating changed_states and setting poll_scheduled is ensured by
atomic_xchg operation.
By tracing the number of immediate rescheduling attempts performed by
psi_group_change and the number of these attempts being blocked due to
psi monitor being already active, we can assess the effects of this change:

Before the patch:
                                           Run#1    Run#2      Run#3
Immediate reschedules attempted:           684365   1385156    1261240
Immediate reschedules blocked:             682846   1381654    1258682
Immediate reschedules (delta):             1519     3502       2558
Immediate reschedules (% of attempted):    0.22%    0.25%      0.20%

After the patch:
                                           Run#1    Run#2      Run#3
Immediate reschedules attempted:           882244   770298    426218
Immediate reschedules blocked:             881996   769796    426074
Immediate reschedules (delta):             248      502       144
Immediate reschedules (% of attempted):    0.03%    0.07%     0.03%

The number of non-blocked immediate reschedules dropped from 0.22-0.25%
to 0.03-0.07%. The drop is attributed to the decrease in the race window
size and the fact that we allow this race only when psi monitors reach
polling window expiration time.

Fixes: 461daba06b ("psi: eliminate kthread_worker from psi trigger scheduling mechanism")
Reported-by: Kathleen Chang <yt.chang@mediatek.com>
Reported-by: Wenju Xu <wenju.xu@mediatek.com>
Reported-by: Jonathan Chen <jonathan.jmchen@mediatek.com>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Chengming Zhou <zhouchengming@bytedance.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: SH Chen <show-hong.chen@mediatek.com>
Link: https://lore.kernel.org/r/20221028194541.813985-1-surenb@google.com
2022-10-30 10:12:14 +01:00
Chengming Zhou
2fcd7bbae9 sched/psi: Fix avgs_work re-arm in psi_avgs_work()
Pavan reported a problem that PSI avgs_work idle shutoff is not
working at all. Because PSI_NONIDLE condition would be observed in
psi_avgs_work()->collect_percpu_times()->get_recent_times() even if
only the kworker running avgs_work on the CPU.

Although commit 1b69ac6b40 ("psi: fix aggregation idle shut-off")
avoided the ping-pong wake problem when the worker sleep, psi_avgs_work()
still will always re-arm the avgs_work, so shutoff is not working.

This patch changes to use PSI_STATE_RESCHEDULE to flag whether to
re-arm avgs_work in get_recent_times(). For the current CPU, we re-arm
avgs_work only when (NR_RUNNING > 1 || NR_IOWAIT > 0 || NR_MEMSTALL > 0),
for other CPUs we can just check PSI_NONIDLE delta. The new flag
is only used in psi_avgs_work(), so we check in get_recent_times()
that current_work() is avgs_work.

One potential problem is that the brief period of non-idle time
incurred between the aggregation run and the kworker's dequeue will
be stranded in the per-cpu buckets until avgs_work run next time.
The buckets can hold 4s worth of time, and future activity will wake
the avgs_work with a 2s delay, giving us 2s worth of data we can leave
behind when shut off the avgs_work. If the kworker run other works after
avgs_work shut off and doesn't have any scheduler activities for 2s,
this maybe a problem.

Reported-by: Pavan Kondeti <quic_pkondeti@quicinc.com>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Suren Baghdasaryan <surenb@google.com>
Tested-by: Chengming Zhou <zhouchengming@bytedance.com>
Link: https://lore.kernel.org/r/20221014110551.22695-1-zhouchengming@bytedance.com
2022-10-30 10:12:14 +01:00
Hao Lee
e38f89af6a sched/psi: Fix possible missing or delayed pending event
When a pending event exists and growth is less than the threshold, the
current logic is to skip this trigger without generating event. However,
from e6df4ead85 ("psi: fix possible trigger missing in the window"),
our purpose is to generate event as long as pending event exists and the
rate meets the limit, no matter what growth is.
This patch handles this case properly.

Fixes: e6df4ead85 ("psi: fix possible trigger missing in the window")
Signed-off-by: Hao Lee <haolee.swjtu@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Suren Baghdasaryan <surenb@google.com>
Link: https://lore.kernel.org/r/20220919072356.GA29069@haolee.io
2022-10-30 10:12:13 +01:00
Waiman Long
851a723e45 sched: Always clear user_cpus_ptr in do_set_cpus_allowed()
The do_set_cpus_allowed() function is used by either kthread_bind() or
select_fallback_rq(). In both cases the user affinity (if any) should be
destroyed too.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20220922180041.1768141-6-longman@redhat.com
2022-10-27 11:01:22 +02:00
Waiman Long
da01903281 sched: Enforce user requested affinity
It was found that the user requested affinity via sched_setaffinity()
can be easily overwritten by other kernel subsystems without an easy way
to reset it back to what the user requested. For example, any change
to the current cpuset hierarchy may reset the cpumask of the tasks in
the affected cpusets to the default cpuset value even if those tasks
have pre-existing user requested affinity. That is especially easy to
trigger under a cgroup v2 environment where writing "+cpuset" to the
root cgroup's cgroup.subtree_control file will reset the cpus affinity
of all the processes in the system.

That is problematic in a nohz_full environment where the tasks running
in the nohz_full CPUs usually have their cpus affinity explicitly set
and will behave incorrectly if cpus affinity changes.

Fix this problem by looking at user_cpus_ptr in __set_cpus_allowed_ptr()
and use it to restrcit the given cpumask unless there is no overlap. In
that case, it will fallback to the given one. The SCA_USER flag is
reused to indicate intent to set user_cpus_ptr and so user_cpus_ptr
masking should be skipped. In addition, masking should also be skipped
if any of the SCA_MIGRATE_* flag is set.

All callers of set_cpus_allowed_ptr() will be affected by this change.
A scratch cpumask is added to percpu runqueues structure for doing
additional masking when user_cpus_ptr is set.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20220922180041.1768141-4-longman@redhat.com
2022-10-27 11:01:22 +02:00
Waiman Long
8f9ea86fdf sched: Always preserve the user requested cpumask
Unconditionally preserve the user requested cpumask on
sched_setaffinity() calls. This allows using it outside of the fairly
narrow restrict_cpus_allowed_ptr() use-case and fix some cpuset issues
that currently suffer destruction of cpumasks.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20220922180041.1768141-3-longman@redhat.com
2022-10-27 11:01:22 +02:00
Waiman Long
713a2e21a5 sched: Introduce affinity_context
In order to prepare for passing through additional data through the
affinity call-chains, convert the mask and flags argument into a
structure.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20220922180041.1768141-5-longman@redhat.com
2022-10-27 11:01:21 +02:00
Waiman Long
5584e8ac2c sched: Add __releases annotations to affine_move_task()
affine_move_task() assumes task_rq_lock() has been called and it does
an implicit task_rq_unlock() before returning. Add the appropriate
__releases annotations to make this clear.

A typo error in comment is also fixed.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20220922180041.1768141-2-longman@redhat.com
2022-10-27 11:01:21 +02:00
Pierre Gondois
ad841e569f sched/fair: Check if prev_cpu has highest spare cap in feec()
When evaluating the CPU candidates in the perf domain (pd) containing
the previously used CPU (prev_cpu), find_energy_efficient_cpu()
evaluates the energy of the pd:
- without the task (base_energy)
- with the task placed on prev_cpu (if the task fits)
- with the task placed on the CPU with the highest spare capacity,
  prev_cpu being excluded from this set

If prev_cpu is already the CPU with the highest spare capacity,
max_spare_cap_cpu will be the CPU with the second highest spare
capacity.

On an Arm64 Juno-r2, with a workload of 10 tasks at a 10% duty cycle,
when prev_cpu and max_spare_cap_cpu are both valid candidates,
prev_spare_cap > max_spare_cap at ~82%.
Thus the energy of the pd when placing the task on max_spare_cap_cpu
is computed with no possible positive outcome 82% most of the time.

Do not consider max_spare_cap_cpu as a valid candidate if
prev_spare_cap > max_spare_cap.

Signed-off-by: Pierre Gondois <pierre.gondois@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20221006081052.3862167-2-pierre.gondois@arm.com
2022-10-27 11:01:20 +02:00
Qais Yousef
aa69c36f31 sched/fair: Consider capacity inversion in util_fits_cpu()
We do consider thermal pressure in util_fits_cpu() for uclamp_min only.
With the exception of the biggest cores which by definition are the max
performance point of the system and all tasks by definition should fit.

Even under thermal pressure, the capacity of the biggest CPU is the
highest in the system and should still fit every task. Except when it
reaches capacity inversion point, then this is no longer true.

We can handle this by using the inverted capacity as capacity_orig in
util_fits_cpu(). Which not only addresses the problem above, but also
ensure uclamp_max now considers the inverted capacity. Force fitting
a task when a CPU is in this adverse state will contribute to making the
thermal throttling last longer.

Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220804143609.515789-10-qais.yousef@arm.com
2022-10-27 11:01:20 +02:00
Qais Yousef
44c7b80bff sched/fair: Detect capacity inversion
Check each performance domain to see if thermal pressure is causing its
capacity to be lower than another performance domain.

We assume that each performance domain has CPUs with the same
capacities, which is similar to an assumption made in energy_model.c

We also assume that thermal pressure impacts all CPUs in a performance
domain equally.

If there're multiple performance domains with the same capacity_orig, we
will trigger a capacity inversion if the domain is under thermal
pressure.

The new cpu_in_capacity_inversion() should help users to know when
information about capacity_orig are not reliable and can opt in to use
the inverted capacity as the 'actual' capacity_orig.

Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220804143609.515789-9-qais.yousef@arm.com
2022-10-27 11:01:20 +02:00
Qais Yousef
d81304bc61 sched/uclamp: Cater for uclamp in find_energy_efficient_cpu()'s early exit condition
If the utilization of the woken up task is 0, we skip the energy
calculation because it has no impact.

But if the task is boosted (uclamp_min != 0) will have an impact on task
placement and frequency selection. Only skip if the util is truly
0 after applying uclamp values.

Change uclamp_task_cpu() signature to avoid unnecessary additional calls
to uclamp_eff_get(). feec() is the only user now.

Fixes: 732cd75b8c ("sched/fair: Select an energy-efficient CPU on task wake-up")
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220804143609.515789-8-qais.yousef@arm.com
2022-10-27 11:01:19 +02:00
Qais Yousef
c56ab1b350 sched/uclamp: Make cpu_overutilized() use util_fits_cpu()
So that it is now uclamp aware.

This fixes a major problem of busy tasks capped with UCLAMP_MAX keeping
the system in overutilized state which disables EAS and leads to wasting
energy in the long run.

Without this patch running a busy background activity like JIT
compilation on Pixel 6 causes the system to be in overutilized state
74.5% of the time.

With this patch this goes down to  9.79%.

It also fixes another problem when long running tasks that have their
UCLAMP_MIN changed while running such that they need to upmigrate to
honour the new UCLAMP_MIN value. The upmigration doesn't get triggered
because overutilized state never gets set in this state, hence misfit
migration never happens at tick in this case until the task wakes up
again.

Fixes: af24bde8df ("sched/uclamp: Add uclamp support to energy_compute()")
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220804143609.515789-7-qais.yousef@arm.com
2022-10-27 11:01:19 +02:00
Qais Yousef
a2e7f03ed2 sched/uclamp: Make asym_fits_capacity() use util_fits_cpu()
Use the new util_fits_cpu() to ensure migration margin and capacity
pressure are taken into account correctly when uclamp is being used
otherwise we will fail to consider CPUs as fitting in scenarios where
they should.

s/asym_fits_capacity/asym_fits_cpu/ to better reflect what it does now.

Fixes: b4c9c9f156 ("sched/fair: Prefer prev cpu in asymmetric wakeup path")
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220804143609.515789-6-qais.yousef@arm.com
2022-10-27 11:01:19 +02:00
Qais Yousef
b759caa1d9 sched/uclamp: Make select_idle_capacity() use util_fits_cpu()
Use the new util_fits_cpu() to ensure migration margin and capacity
pressure are taken into account correctly when uclamp is being used
otherwise we will fail to consider CPUs as fitting in scenarios where
they should.

Fixes: b4c9c9f156 ("sched/fair: Prefer prev cpu in asymmetric wakeup path")
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220804143609.515789-5-qais.yousef@arm.com
2022-10-27 11:01:18 +02:00
Qais Yousef
244226035a sched/uclamp: Fix fits_capacity() check in feec()
As reported by Yun Hsiang [1], if a task has its uclamp_min >= 0.8 * 1024,
it'll always pick the previous CPU because fits_capacity() will always
return false in this case.

The new util_fits_cpu() logic should handle this correctly for us beside
more corner cases where similar failures could occur, like when using
UCLAMP_MAX.

We open code uclamp_rq_util_with() except for the clamp() part,
util_fits_cpu() needs the 'raw' values to be passed to it.

Also introduce uclamp_rq_{set, get}() shorthand accessors to get uclamp
value for the rq. Makes the code more readable and ensures the right
rules (use READ_ONCE/WRITE_ONCE) are respected transparently.

[1] https://lists.linaro.org/pipermail/eas-dev/2020-July/001488.html

Fixes: 1d42509e47 ("sched/fair: Make EAS wakeup placement consider uclamp restrictions")
Reported-by: Yun Hsiang <hsiang023167@gmail.com>
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220804143609.515789-4-qais.yousef@arm.com
2022-10-27 11:01:18 +02:00
Qais Yousef
b48e16a697 sched/uclamp: Make task_fits_capacity() use util_fits_cpu()
So that the new uclamp rules in regard to migration margin and capacity
pressure are taken into account correctly.

Fixes: a7008c07a5 ("sched/fair: Make task_fits_capacity() consider uclamp restrictions")
Co-developed-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220804143609.515789-3-qais.yousef@arm.com
2022-10-27 11:01:18 +02:00
Qais Yousef
48d5e9daa8 sched/uclamp: Fix relationship between uclamp and migration margin
fits_capacity() verifies that a util is within 20% margin of the
capacity of a CPU, which is an attempt to speed up upmigration.

But when uclamp is used, this 20% margin is problematic because for
example if a task is boosted to 1024, then it will not fit on any CPU
according to fits_capacity() logic.

Or if a task is boosted to capacity_orig_of(medium_cpu). The task will
end up on big instead on the desired medium CPU.

Similar corner cases exist for uclamp and usage of capacity_of().
Slightest irq pressure on biggest CPU for example will make a 1024
boosted task look like it can't fit.

What we really want is for uclamp comparisons to ignore the migration
margin and capacity pressure, yet retain them for when checking the
_actual_ util signal.

For example, task p:

	p->util_avg = 300
	p->uclamp[UCLAMP_MIN] = 1024

Will fit a big CPU. But

	p->util_avg = 900
	p->uclamp[UCLAMP_MIN] = 1024

will not, this should trigger overutilized state because the big CPU is
now *actually* being saturated.

Similar reasoning applies to capping tasks with UCLAMP_MAX. For example:

	p->util_avg = 1024
	p->uclamp[UCLAMP_MAX] = capacity_orig_of(medium_cpu)

Should fit the task on medium cpus without triggering overutilized
state.

Inlined comments expand more on desired behavior in more scenarios.

Introduce new util_fits_cpu() function which encapsulates the new logic.
The new function is not used anywhere yet, but will be used to update
various users of fits_capacity() in later patches.

Fixes: af24bde8df ("sched/uclamp: Add uclamp support to energy_compute()")
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220804143609.515789-2-qais.yousef@arm.com
2022-10-27 11:01:17 +02:00
Kees Cook
8e5bad7dcc sched: Introduce struct balance_callback to avoid CFI mismatches
Introduce distinct struct balance_callback instead of performing function
pointer casting which will trip CFI. Avoids warnings as found by Clang's
future -Wcast-function-type-strict option:

In file included from kernel/sched/core.c:84:
kernel/sched/sched.h:1755:15: warning: cast from 'void (*)(struct rq *)' to 'void (*)(struct callback_head *)' converts to incompatible function type [-Wcast-function-type-strict]
        head->func = (void (*)(struct callback_head *))func;
                     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

No binary differences result from this change.

This patch is a cleanup based on Brad Spengler/PaX Team's modifications
to sched code in their last public patch of grsecurity/PaX based on my
understanding of the code. Changes or omissions from the original code
are mine and don't reflect the original grsecurity/PaX code.

Reported-by: Sami Tolvanen <samitolvanen@google.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Nathan Chancellor <nathan@kernel.org>
Link: https://github.com/ClangBuiltLinux/linux/issues/1724
Link: https://lkml.kernel.org/r/20221008000758.2957718-1-keescook@chromium.org
2022-10-17 16:41:25 +02:00
Lin Shengwang
e705968dd6 sched/core: Fix comparison in sched_group_cookie_match()
In commit 97886d9dcd ("sched: Migration changes for core scheduling"),
sched_group_cookie_match() was added to help determine if a cookie
matches the core state.

However, while it iterates the SMT group, it fails to actually use the
RQ for each of the CPUs iterated, use cpu_rq(cpu) instead of rq to fix
things.

Fixes: 97886d9dcd ("sched: Migration changes for core scheduling")
Signed-off-by: Lin Shengwang <linshengwang1@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20221008022709.642-1-linshengwang1@huawei.com
2022-10-17 16:41:24 +02:00
Linus Torvalds
bd9a3dba18 PSI updates for v6.1:
- Various performance optimizations, resulting in a 4%-9% speedup
    in the mmtests/config-scheduler-perfpipe micro-benchmark.
 
  - New interface to turn PSI on/off on a per cgroup level.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmNJKPsRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1iPmg//aovCitAQX2lLoHJDIgdQibU40oaEpKTX
 wM549EGz3Dr6qmwF8+qT1U2Ge6af/hHQc5G/ZqDpKbuTjUIc3RmBkqX80dNKFLuH
 uyi9UtfsSriw+ks8fWuDdjr+S4oppwW9ZoIXvK8v4bisd3F31DNGvKPTayNxt73m
 lExfzJiD1oJixDxGX8MGO9QpcoywmjWjzjrB2P+J8hnTpArouHx/HOKdQOpG6wXq
 ZRr9kZvju6ucDpXCTa1HJrfVRxNAh35tx/b4cDtXbBFifVAeKaPOrHapMTVsqfel
 Z7T+2DymhidNYK0hrRJoGUwa/vkz+2Sm1ZLG9LlgUCXVco/9S1zw1ZuQakVvzPen
 wriuxRaAkR+szCP0L8js5+/DAkGa43MjKsvQHmDVnetQtlsAD4eYnn+alQ837SXv
 MP3jwFqF+e4mcWdoQcfh0OWUgGec5XZzdgRYrFkBKyTWGLB2iPivcAMNf0X/h82Q
 xxv4DQJIIJ017GOQ/ho2saq+GbtFCvX8YnGYas9T47Bjjluhjo7jgTVtPTo+mhtN
 RfwMdG718Ap/gvnAX7wMe/t+L/4AP8AIgDRi5L35dTRqETwOjH+LAvOYjleQFYgu
 kMVtLMyzU+TGwHscuzPFRh7TnvSJ4sD48Ll1BPnyZsh3SS9u0gAs1bml7Cu7JbmW
 SIZD/S/hzdI=
 =91tB
 -----END PGP SIGNATURE-----

Merge tag 'sched-psi-2022-10-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull PSI updates from Ingo Molnar:

 - Various performance optimizations, resulting in a 4%-9% speedup in
   the mmtests/config-scheduler-perfpipe micro-benchmark.

 - New interface to turn PSI on/off on a per cgroup level.

* tag 'sched-psi-2022-10-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/psi: Per-cgroup PSI accounting disable/re-enable interface
  sched/psi: Cache parent psi_group to speed up group iteration
  sched/psi: Consolidate cgroup_psi()
  sched/psi: Add PSI_IRQ to track IRQ/SOFTIRQ pressure
  sched/psi: Remove NR_ONCPU task accounting
  sched/psi: Optimize task switch inside shared cgroups again
  sched/psi: Move private helpers to sched/stats.h
  sched/psi: Save percpu memory when !psi_cgroups_enabled
  sched/psi: Don't create cgroup PSI files when psi_disabled
  sched/psi: Fix periodic aggregation shut off
2022-10-14 13:03:00 -07:00
Linus Torvalds
27bc50fc90 - Yu Zhao's Multi-Gen LRU patches are here. They've been under test in
linux-next for a couple of months without, to my knowledge, any negative
   reports (or any positive ones, come to that).
 
 - Also the Maple Tree from Liam R.  Howlett.  An overlapping range-based
   tree for vmas.  It it apparently slight more efficient in its own right,
   but is mainly targeted at enabling work to reduce mmap_lock contention.
 
   Liam has identified a number of other tree users in the kernel which
   could be beneficially onverted to mapletrees.
 
   Yu Zhao has identified a hard-to-hit but "easy to fix" lockdep splat
   (https://lkml.kernel.org/r/CAOUHufZabH85CeUN-MEMgL8gJGzJEWUrkiM58JkTbBhh-jew0Q@mail.gmail.com).
   This has yet to be addressed due to Liam's unfortunately timed
   vacation.  He is now back and we'll get this fixed up.
 
 - Dmitry Vyukov introduces KMSAN: the Kernel Memory Sanitizer.  It uses
   clang-generated instrumentation to detect used-unintialized bugs down to
   the single bit level.
 
   KMSAN keeps finding bugs.  New ones, as well as the legacy ones.
 
 - Yang Shi adds a userspace mechanism (madvise) to induce a collapse of
   memory into THPs.
 
 - Zach O'Keefe has expanded Yang Shi's madvise(MADV_COLLAPSE) to support
   file/shmem-backed pages.
 
 - userfaultfd updates from Axel Rasmussen
 
 - zsmalloc cleanups from Alexey Romanov
 
 - cleanups from Miaohe Lin: vmscan, hugetlb_cgroup, hugetlb and memory-failure
 
 - Huang Ying adds enhancements to NUMA balancing memory tiering mode's
   page promotion, with a new way of detecting hot pages.
 
 - memcg updates from Shakeel Butt: charging optimizations and reduced
   memory consumption.
 
 - memcg cleanups from Kairui Song.
 
 - memcg fixes and cleanups from Johannes Weiner.
 
 - Vishal Moola provides more folio conversions
 
 - Zhang Yi removed ll_rw_block() :(
 
 - migration enhancements from Peter Xu
 
 - migration error-path bugfixes from Huang Ying
 
 - Aneesh Kumar added ability for a device driver to alter the memory
   tiering promotion paths.  For optimizations by PMEM drivers, DRM
   drivers, etc.
 
 - vma merging improvements from Jakub Matěn.
 
 - NUMA hinting cleanups from David Hildenbrand.
 
 - xu xin added aditional userspace visibility into KSM merging activity.
 
 - THP & KSM code consolidation from Qi Zheng.
 
 - more folio work from Matthew Wilcox.
 
 - KASAN updates from Andrey Konovalov.
 
 - DAMON cleanups from Kaixu Xia.
 
 - DAMON work from SeongJae Park: fixes, cleanups.
 
 - hugetlb sysfs cleanups from Muchun Song.
 
 - Mike Kravetz fixes locking issues in hugetlbfs and in hugetlb core.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY0HaPgAKCRDdBJ7gKXxA
 joPjAQDZ5LlRCMWZ1oxLP2NOTp6nm63q9PWcGnmY50FjD/dNlwEAnx7OejCLWGWf
 bbTuk6U2+TKgJa4X7+pbbejeoqnt5QU=
 =xfWx
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2022-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:

 - Yu Zhao's Multi-Gen LRU patches are here. They've been under test in
   linux-next for a couple of months without, to my knowledge, any
   negative reports (or any positive ones, come to that).

 - Also the Maple Tree from Liam Howlett. An overlapping range-based
   tree for vmas. It it apparently slightly more efficient in its own
   right, but is mainly targeted at enabling work to reduce mmap_lock
   contention.

   Liam has identified a number of other tree users in the kernel which
   could be beneficially onverted to mapletrees.

   Yu Zhao has identified a hard-to-hit but "easy to fix" lockdep splat
   at [1]. This has yet to be addressed due to Liam's unfortunately
   timed vacation. He is now back and we'll get this fixed up.

 - Dmitry Vyukov introduces KMSAN: the Kernel Memory Sanitizer. It uses
   clang-generated instrumentation to detect used-unintialized bugs down
   to the single bit level.

   KMSAN keeps finding bugs. New ones, as well as the legacy ones.

 - Yang Shi adds a userspace mechanism (madvise) to induce a collapse of
   memory into THPs.

 - Zach O'Keefe has expanded Yang Shi's madvise(MADV_COLLAPSE) to
   support file/shmem-backed pages.

 - userfaultfd updates from Axel Rasmussen

 - zsmalloc cleanups from Alexey Romanov

 - cleanups from Miaohe Lin: vmscan, hugetlb_cgroup, hugetlb and
   memory-failure

 - Huang Ying adds enhancements to NUMA balancing memory tiering mode's
   page promotion, with a new way of detecting hot pages.

 - memcg updates from Shakeel Butt: charging optimizations and reduced
   memory consumption.

 - memcg cleanups from Kairui Song.

 - memcg fixes and cleanups from Johannes Weiner.

 - Vishal Moola provides more folio conversions

 - Zhang Yi removed ll_rw_block() :(

 - migration enhancements from Peter Xu

 - migration error-path bugfixes from Huang Ying

 - Aneesh Kumar added ability for a device driver to alter the memory
   tiering promotion paths. For optimizations by PMEM drivers, DRM
   drivers, etc.

 - vma merging improvements from Jakub Matěn.

 - NUMA hinting cleanups from David Hildenbrand.

 - xu xin added aditional userspace visibility into KSM merging
   activity.

 - THP & KSM code consolidation from Qi Zheng.

 - more folio work from Matthew Wilcox.

 - KASAN updates from Andrey Konovalov.

 - DAMON cleanups from Kaixu Xia.

 - DAMON work from SeongJae Park: fixes, cleanups.

 - hugetlb sysfs cleanups from Muchun Song.

 - Mike Kravetz fixes locking issues in hugetlbfs and in hugetlb core.

Link: https://lkml.kernel.org/r/CAOUHufZabH85CeUN-MEMgL8gJGzJEWUrkiM58JkTbBhh-jew0Q@mail.gmail.com [1]

* tag 'mm-stable-2022-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (555 commits)
  hugetlb: allocate vma lock for all sharable vmas
  hugetlb: take hugetlb vma_lock when clearing vma_lock->vma pointer
  hugetlb: fix vma lock handling during split vma and range unmapping
  mglru: mm/vmscan.c: fix imprecise comments
  mm/mglru: don't sync disk for each aging cycle
  mm: memcontrol: drop dead CONFIG_MEMCG_SWAP config symbol
  mm: memcontrol: use do_memsw_account() in a few more places
  mm: memcontrol: deprecate swapaccounting=0 mode
  mm: memcontrol: don't allocate cgroup swap arrays when memcg is disabled
  mm/secretmem: remove reduntant return value
  mm/hugetlb: add available_huge_pages() func
  mm: remove unused inline functions from include/linux/mm_inline.h
  selftests/vm: add selftest for MADV_COLLAPSE of uffd-minor memory
  selftests/vm: add file/shmem MADV_COLLAPSE selftest for cleared pmd
  selftests/vm: add thp collapse shmem testing
  selftests/vm: add thp collapse file and tmpfs testing
  selftests/vm: modularize thp collapse memory operations
  selftests/vm: dedup THP helpers
  mm/khugepaged: add tracepoint to hpage_collapse_scan_file()
  mm/madvise: add file and shmem support to MADV_COLLAPSE
  ...
2022-10-10 17:53:04 -07:00
Linus Torvalds
d4013bc4d4 bitmap patches for v6.1-rc1
From Phil Auld:
 drivers/base: Fix unsigned comparison to -1 in CPUMAP_FILE_MAX_BYTES
 
 From me:
 cpumask: cleanup nr_cpu_ids vs nr_cpumask_bits mess
 
 This series cleans that mess and adds new config FORCE_NR_CPUS that
 allows to optimize cpumask subsystem if the number of CPUs is known
 at compile-time.
 
 From me:
 lib: optimize find_bit() functions
 
 Reworks find_bit() functions based on new FIND_{FIRST,NEXT}_BIT() macros.
 
 From me:
 lib/find: add find_nth_bit()
 
 Adds find_nth_bit(), which is ~70 times faster than bitcounting with
 for_each() loop:
         for_each_set_bit(bit, mask, size)
                 if (n-- == 0)
                         return bit;
 
 Also adds bitmap_weight_and() to let people replace this pattern:
 	tmp = bitmap_alloc(nbits);
 	bitmap_and(tmp, map1, map2, nbits);
 	weight = bitmap_weight(tmp, nbits);
 	bitmap_free(tmp);
 with a single bitmap_weight_and() call.
 
 From me:
 cpumask: repair cpumask_check()
 
 After switching cpumask to use nr_cpu_ids, cpumask_check() started
 generating many false-positive warnings. This series fixes it.
 
 From Valentin Schneider:
 bitmap,cpumask: Add for_each_cpu_andnot() and for_each_cpu_andnot()
 
 Extends the API with one more function and applies it in sched/core.
 -----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEEi8GdvG6xMhdgpu/4sUSA/TofvsgFAmNBwmUACgkQsUSA/Tof
 vshPRwv+KlqnZlKtuSPgbo/Kgswworpi/7TqfnN9GWlb8AJ2uhjBKI3GFwv4TDow
 7KV6wdKdXYLr4pktcIhWy3qLrT+bDDExfarHRo3QI1A1W42EJ+ZiUaGnQGcnVMzD
 5q/K1YMJYq0oaesHEw5PVUh8mm6h9qRD8VbX1u+riW/VCWBj3bho9Dp4mffQ48Q6
 hVy/SnMGgClQwNYp+sxkqYx38xUqUGYoU5MzeziUmoS6pZQh+4lF33MULnI3EKmc
 /ehXilPPtOV/Tm0RovDWFfm3rjNapV9FXHu8Ob2z/c+1A29EgXnE3pwrBDkAx001
 TQrL9qbCANRDGPLzWQHw0dwFIaXvTdrSttCsfYYfU5hI4JbnJEe0Pqkaaohy7jqm
 r0dW/TlyOG5T+k8Kwdx9w9A+jKs8TbKKZ8HOaN8BpkXswVnpbzpQbj3TITZI4aeV
 6YR4URBQ5UkrVLEXFXbrOzwjL2zqDdyNoBdTJmGLJ+5b/n0HHzmyMVkegNIwLLM3
 GR7sMQae
 =Q/+F
 -----END PGP SIGNATURE-----

Merge tag 'bitmap-6.1-rc1' of https://github.com/norov/linux

Pull bitmap updates from Yury Norov:

 - Fix unsigned comparison to -1 in CPUMAP_FILE_MAX_BYTES (Phil Auld)

 - cleanup nr_cpu_ids vs nr_cpumask_bits mess (me)

   This series cleans that mess and adds new config FORCE_NR_CPUS that
   allows to optimize cpumask subsystem if the number of CPUs is known
   at compile-time.

 - optimize find_bit() functions (me)

   Reworks find_bit() functions based on new FIND_{FIRST,NEXT}_BIT()
   macros.

 - add find_nth_bit() (me)

   Adds find_nth_bit(), which is ~70 times faster than bitcounting with
   for_each() loop:

	for_each_set_bit(bit, mask, size)
		if (n-- == 0)
			return bit;

   Also adds bitmap_weight_and() to let people replace this pattern:

	tmp = bitmap_alloc(nbits);
	bitmap_and(tmp, map1, map2, nbits);
	weight = bitmap_weight(tmp, nbits);
	bitmap_free(tmp);

   with a single bitmap_weight_and() call.

 - repair cpumask_check() (me)

   After switching cpumask to use nr_cpu_ids, cpumask_check() started
   generating many false-positive warnings. This series fixes it.

 - Add for_each_cpu_andnot() and for_each_cpu_andnot() (Valentin
   Schneider)

   Extends the API with one more function and applies it in sched/core.

* tag 'bitmap-6.1-rc1' of https://github.com/norov/linux: (28 commits)
  sched/core: Merge cpumask_andnot()+for_each_cpu() into for_each_cpu_andnot()
  lib/test_cpumask: Add for_each_cpu_and(not) tests
  cpumask: Introduce for_each_cpu_andnot()
  lib/find_bit: Introduce find_next_andnot_bit()
  cpumask: fix checking valid cpu range
  lib/bitmap: add tests for for_each() loops
  lib/find: optimize for_each() macros
  lib/bitmap: introduce for_each_set_bit_wrap() macro
  lib/find_bit: add find_next{,_and}_bit_wrap
  cpumask: switch for_each_cpu{,_not} to use for_each_bit()
  net: fix cpu_max_bits_warn() usage in netif_attrmask_next{,_and}
  cpumask: add cpumask_nth_{,and,andnot}
  lib/bitmap: remove bitmap_ord_to_pos
  lib/bitmap: add tests for find_nth_bit()
  lib: add find_nth{,_and,_andnot}_bit()
  lib/bitmap: add bitmap_weight_and()
  lib/bitmap: don't call __bitmap_weight() in kernel code
  tools: sync find_bit() implementation
  lib/find_bit: optimize find_next_bit() functions
  lib/find_bit: create find_first_zero_bit_le()
  ...
2022-10-10 12:49:34 -07:00
Linus Torvalds
30c999937f Scheduler changes for v6.1:
- Debuggability:
 
      - Change most occurances of BUG_ON() to WARN_ON_ONCE()
 
      - Reorganize & fix TASK_ state comparisons, turn it into a bitmap
 
      - Update/fix misc scheduler debugging facilities
 
  - Load-balancing & regular scheduling:
 
      - Improve the behavior of the scheduler in presence of lot of
        SCHED_IDLE tasks - in particular they should not impact other
        scheduling classes.
 
      - Optimize task load tracking, cleanups & fixes
 
      - Clean up & simplify misc load-balancing code
 
  - Freezer:
 
      - Rewrite the core freezer to behave better wrt thawing and be simpler
        in general, by replacing PF_FROZEN with TASK_FROZEN & fixing/adjusting
        all the fallout.
 
  - Deadline scheduler:
 
      - Fix the DL capacity-aware code
 
      - Factor out dl_task_is_earliest_deadline() & replenish_dl_new_period()
 
      - Relax/optimize locking in task_non_contending()
 
  - Cleanups:
 
      - Factor out the update_current_exec_runtime() helper
 
      - Various cleanups, simplifications
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmM/01cRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1geZA/+PB4KC1T9aVxzaTHI36R03YgJYZmIdtxw
 wTf02MixePmz+gQCbepJbempGOh5ST28aOcI0xhdYOql5B63MaUBBMlB0HvGUyDG
 IU3zETqLMRtAbnSTdQFv8m++ECUtZYp8/x1FCel4WO7ya4ETkRu1NRfCoUepEhpZ
 aVAlae9LH3NBaF9t7s0PT2lTjf3pIzMFRkddJ0ywJhbFR3VnWat05fAK+J6fGY8+
 LS54coefNlJD4oDh5TY8uniL1j5SmWmmwbk9Cdj7bLU5P3dFSS0/+5FJNHJPVGDE
 srGT7wstRUcDrN0CnZo48VIUBiApJCCDqTfJYi9wNYd0NAHvwY6MIJJgEIY8mKsI
 L/qH26H81Wt+ezSZ/5JIlGlZ/LIeNaa6OO/fbWEYABBQogvvx3nxsRNUYKSQzumH
 CnSBasBjLnjWyLlK4qARM9cI7NFSEK6NUigrEx/7h8JFu/8T4DlSy6LsF1HUyKgq
 4+FJLAqG6cL0tcwB/fHYd0oRESN8dStnQhGxSojgufwLc7dlFULvCYF5JM/dX+/V
 IKwbOfIOeOn6ViMtSOXAEGdII+IQ2/ZFPwr+8Z5JC7NzvTVL6xlu/3JXkLZR3L7o
 yaXTSaz06h1vil7Z+GRf7RHc+wUeGkEpXh5vnarGZKXivhFdWsBdROIJANK+xR0i
 TeSLCxQxXlU=
 =KjMD
 -----END PGP SIGNATURE-----

Merge tag 'sched-core-2022-10-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler updates from Ingo Molnar:
 "Debuggability:

   - Change most occurances of BUG_ON() to WARN_ON_ONCE()

   - Reorganize & fix TASK_ state comparisons, turn it into a bitmap

   - Update/fix misc scheduler debugging facilities

  Load-balancing & regular scheduling:

   - Improve the behavior of the scheduler in presence of lot of
     SCHED_IDLE tasks - in particular they should not impact other
     scheduling classes.

   - Optimize task load tracking, cleanups & fixes

   - Clean up & simplify misc load-balancing code

  Freezer:

   - Rewrite the core freezer to behave better wrt thawing and be
     simpler in general, by replacing PF_FROZEN with TASK_FROZEN &
     fixing/adjusting all the fallout.

  Deadline scheduler:

   - Fix the DL capacity-aware code

   - Factor out dl_task_is_earliest_deadline() &
     replenish_dl_new_period()

   - Relax/optimize locking in task_non_contending()

  Cleanups:

   - Factor out the update_current_exec_runtime() helper

   - Various cleanups, simplifications"

* tag 'sched-core-2022-10-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (41 commits)
  sched: Fix more TASK_state comparisons
  sched: Fix TASK_state comparisons
  sched/fair: Move call to list_last_entry() in detach_tasks
  sched/fair: Cleanup loop_max and loop_break
  sched/fair: Make sure to try to detach at least one movable task
  sched: Show PF_flag holes
  freezer,sched: Rewrite core freezer logic
  sched: Widen TAKS_state literals
  sched/wait: Add wait_event_state()
  sched/completion: Add wait_for_completion_state()
  sched: Add TASK_ANY for wait_task_inactive()
  sched: Change wait_task_inactive()s match_state
  freezer,umh: Clean up freezer/initrd interaction
  freezer: Have {,un}lock_system_sleep() save/restore flags
  sched: Rename task_running() to task_on_cpu()
  sched/fair: Cleanup for SIS_PROP
  sched/fair: Default to false in test_idle_cores()
  sched/fair: Remove useless check in select_idle_core()
  sched/fair: Avoid double search on same cpu
  sched/fair: Remove redundant check in select_idle_smt()
  ...
2022-10-10 09:10:28 -07:00
Linus Torvalds
513389809e for-6.1/block-2022-10-03
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmM67XkQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpiHoD/9eN+6YnNRPu5+2zeGnnm1Nlwic6YMZeORr
 KFIeC0COMWoFhNBIPFkgAKT+0qIH+uGt5UsHSM3Y5La7wMR8yLxD4PAnvTZ/Ijtt
 yxVIOmonJoQ0OrQ2kTbvDXL/9OCUrzwXXyUIEPJnH0Ca1mxeNOgDHbE7VGF6DMul
 0D3pI8qs2WLnHlDi1V/8kH5qZ6WoAJSDcb8sTzOUVnyveZPNaZhGQJuHA2XAYMtg
 fqKMDJqgmNk6jdTMUgdF5B+rV64PQoCy28I7fXqGkEe+RE5TBy57vAa0XY84V8XR
 /a8CEuwMts2ypk1hIcJG8Vv8K6u5war9yPM5MTngKsoMpzNIlhrhaJQVyjKdcs+E
 Ixwzexu6xTYcrcq+mUARgeTh79FzTBM/uXEdbCG2G3S6HPd6UZWUJZGfxw/l0Aem
 V4xB7lj6SQaJDU1iJCYUaHcekNXhQAPvyVG+R2ED1SO3McTpTPIM1aeigxw6vj7u
 bH3Kfdr94Z8HNuoLuiS6YYfjNt2Shf4LEB6GxKJ9TYHtyhdOyO0H64jGHpygrWqN
 cSnkWPUqUUNpF7srKM0ZgbliCshvmyJc4aMOFd0gBY/kXf5J/j7IXvh8TFCi9rHH
 0KyZH3/3Zsu9geUn3ynznlr4FXU+BcqE6boaa/iWb9sN1m+Rvaahv8cSch/dh44a
 vQNj/iOBQA==
 =R05e
 -----END PGP SIGNATURE-----

Merge tag 'for-6.1/block-2022-10-03' of git://git.kernel.dk/linux

Pull block updates from Jens Axboe:

 - NVMe pull requests via Christoph:
      - handle number of queue changes in the TCP and RDMA drivers
        (Daniel Wagner)
      - allow changing the number of queues in nvmet (Daniel Wagner)
      - also consider host_iface when checking ip options (Daniel
        Wagner)
      - don't map pages which can't come from HIGHMEM (Fabio M. De
        Francesco)
      - avoid unnecessary flush bios in nvmet (Guixin Liu)
      - shrink and better pack the nvme_iod structure (Keith Busch)
      - add comment for unaligned "fake" nqn (Linjun Bao)
      - print actual source IP address through sysfs "address" attr
        (Martin Belanger)
      - various cleanups (Jackie Liu, Wolfram Sang, Genjian Zhang)
      - handle effects after freeing the request (Keith Busch)
      - copy firmware_rev on each init (Keith Busch)
      - restrict management ioctls to admin (Keith Busch)
      - ensure subsystem reset is single threaded (Keith Busch)
      - report the actual number of tagset maps in nvme-pci (Keith
        Busch)
      - small fabrics authentication fixups (Christoph Hellwig)
      - add common code for tagset allocation and freeing (Christoph
        Hellwig)
      - stop using the request_queue in nvmet (Christoph Hellwig)
      - set min_align_mask before calculating max_hw_sectors (Rishabh
        Bhatnagar)
      - send a rediscover uevent when a persistent discovery controller
        reconnects (Sagi Grimberg)
      - misc nvmet-tcp fixes (Varun Prakash, zhenwei pi)

 - MD pull request via Song:
      - Various raid5 fix and clean up, by Logan Gunthorpe and David
        Sloan.
      - Raid10 performance optimization, by Yu Kuai.

 - sbitmap wakeup hang fixes (Hugh, Keith, Jan, Yu)

 - IO scheduler switching quisce fix (Keith)

 - s390/dasd block driver updates (Stefan)

 - support for recovery for the ublk driver (ZiyangZhang)

 - rnbd drivers fixes and updates (Guoqing, Santosh, ye, Christoph)

 - blk-mq and null_blk map fixes (Bart)

 - various bcache fixes (Coly, Jilin, Jules)

 - nbd signal hang fix (Shigeru)

 - block writeback throttling fix (Yu)

 - optimize the passthrough mapping handling (me)

 - prepare block cgroups to being gendisk based (Christoph)

 - get rid of an old PSI hack in the block layer, moving it to the
   callers instead where it belongs (Christoph)

 - blk-throttle fixes and cleanups (Yu)

 - misc fixes and cleanups (Liu Shixin, Liu Song, Miaohe, Pankaj,
   Ping-Xiang, Wolfram, Saurabh, Li Jinlin, Li Lei, Lin, Li zeming,
   Miaohe, Bart, Coly, Gaosheng

* tag 'for-6.1/block-2022-10-03' of git://git.kernel.dk/linux: (162 commits)
  sbitmap: fix lockup while swapping
  block: add rationale for not using blk_mq_plug() when applicable
  block: adapt blk_mq_plug() to not plug for writes that require a zone lock
  s390/dasd: use blk_mq_alloc_disk
  blk-cgroup: don't update the blkg lookup hint in blkg_conf_prep
  nvmet: don't look at the request_queue in nvmet_bdev_set_limits
  nvmet: don't look at the request_queue in nvmet_bdev_zone_mgmt_emulate_all
  blk-mq: use quiesced elevator switch when reinitializing queues
  block: replace blk_queue_nowait with bdev_nowait
  nvme: remove nvme_ctrl_init_connect_q
  nvme-loop: use the tagset alloc/free helpers
  nvme-loop: store the generic nvme_ctrl in set->driver_data
  nvme-loop: initialize sqsize later
  nvme-fc: use the tagset alloc/free helpers
  nvme-fc: store the generic nvme_ctrl in set->driver_data
  nvme-fc: keep ctrl->sqsize in sync with opts->queue_size
  nvme-rdma: use the tagset alloc/free helpers
  nvme-rdma: store the generic nvme_ctrl in set->driver_data
  nvme-tcp: use the tagset alloc/free helpers
  nvme-tcp: store the generic nvme_ctrl in set->driver_data
  ...
2022-10-07 09:19:14 -07:00
Valentin Schneider
585463f0d5 sched/core: Merge cpumask_andnot()+for_each_cpu() into for_each_cpu_andnot()
This removes the second use of the sched_core_mask temporary mask.

Suggested-by: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
2022-10-06 05:57:36 -07:00
Linus Torvalds
c79e6fa98c Power management updates for 6.1-rc1
- Add isupport for Tiger Lake in no-HWP mode to intel_pstate (Doug
    Smythies).
 
  - Update the AMD P-state driver (Perry Yuan):
    * Fix wrong lowest perf fetch.
    * Map desired perf into pstate scope for powersave governor.
    * Update pstate frequency transition delay time.
    * Fix initial highest_perf value.
    * Clean up.
 
  - Move max CPU capacity to sugov_policy in the schedutil cpufreq
    governor (Lukasz Luba).
 
  - Add SM6115 to cpufreq-dt blocklist (Adam Skladowski).
 
  - Add support for Tegra239 and minor cleanups (Sumit Gupta, ye xingchen,
    and Yang Yingliang).
 
  - Add freq qos for qcom cpufreq driver and minor cleanups (Xuewen Yan,
    and Viresh Kumar).
 
  - Minor cleanups around functions called at module_init() (Xiu Jianfeng).
 
  - Use module_init and add module_exit for bmips driver (Zhang Jianhua).
 
  - Add AlderLake-N support to intel_idle (Zhang Rui).
 
  - Replace strlcpy() with unused retval with strscpy() in intel_idle
    (Wolfram Sang).
 
  - Remove redundant check from cpuidle_switch_governor() (Yu Liao).
 
  - Replace strlcpy() with unused retval with strscpy() in the powernv
    cpuidle driver (Wolfram Sang).
 
  - Drop duplicate word from a comment in the coupled cpuidle driver
    (Jason Wang).
 
  - Make rpm_resume() return -EINPROGRESS if RPM_NOWAIT is passed to it
    in the flags and the device is about to resume (Rafael Wysocki).
 
  - Add extra debugging statement for multiple active IRQs to system
    wakeup handling code (Mario Limonciello).
 
  - Replace strlcpy() with unused retval with strscpy() in the core
    system suspend support code (Wolfram Sang).
 
  - Update the intel_rapl power capping driver:
    * Use standard Energy Unit for SPR Dram RAPL domain (Zhang Rui).
    * Add support for RAPTORLAKE_S (Zhang Rui).
    * Fix UBSAN shift-out-of-bounds issue (Chao Qin).
 
  - Handle -EPROBE_DEFER when regulator is not probed on
    mtk-ci-devfreq.c (AngeloGioacchino Del Regno).
 
  - Fix message typo and use dev_err_probe() in rockchip-dfi.c
    (Christophe JAILLET).
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmM7OrYSHHJqd0Byand5
 c29ja2kubmV0AAoJEILEb/54YlRxeKAP/jFiZ1lhTGRngiVLMV6a6SSSy5xzzXZZ
 b/V0oqsuUvWWo6CzVmfU4QfmKGr55+77NgI9Yh5qN6zJTEJmunuCYwVD80KdxPDJ
 8SjMUNCACiVwfryLR1gFJlO+0BN4CWTxvto2gjGxzm0l1UQBACf71wm9MQCP8b7A
 gcBNuOtM7o5NLywDB+/528SiF9AXfZKjkwXhJACimak5yQytaCJaqtOWtcG2KqYF
 USunmqSB3IIVkAa5LJcwloc8wxHYo5mTPaWGGuSA65hfF42k3vJQ2/b8v8oTVza7
 bKzhegErIYtL6B9FjB+P1FyknNOvT7BYr+4RSGLvaPySfjMn1bwz9fM1Epo59Guk
 Azz3ExpaPixDh+x7b89W1Gb751FZU/zlWT+h1CNy5sOP/ChfxgCEBHw0mnWJ2Y0u
 CPcI/Ch0FNQHG+PdbdGlyfvORHVh7te/t6dOhoEHXBue+1r3VkOo8tRGY9x+2IrX
 /JB968u1r0oajF0btGwaDdbbWlyMRTzjrxVl3bwsuz/Kv/0JxsryND2JT0zkKAMZ
 qYT29HQxhdE0Duw1chgAK6X+BsgP58Bu6LeM3mVcwnGPZE9QvcFa0GQh7z+H71AW
 3yOGNmMVMqQSThBYFC6GDi7O2N1UEsLOMV9+ThTRh6D11nU4uiITM5QVIn8nWZGR
 z3IZ52Jg0oeJ
 =+3IL
 -----END PGP SIGNATURE-----

Merge tag 'pm-6.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management updates from Rafael Wysocki:
 "These add support for some new hardware, extend the existing hardware
  support, fix some issues and clean up code

  Specifics:

   - Add isupport for Tiger Lake in no-HWP mode to intel_pstate (Doug
     Smythies)

   - Update the AMD P-state driver (Perry Yuan):
      - Fix wrong lowest perf fetch
      - Map desired perf into pstate scope for powersave governor
      - Update pstate frequency transition delay time
      - Fix initial highest_perf value
      - Clean up

   - Move max CPU capacity to sugov_policy in the schedutil cpufreq
     governor (Lukasz Luba)

   - Add SM6115 to cpufreq-dt blocklist (Adam Skladowski)

   - Add support for Tegra239 and minor cleanups (Sumit Gupta, ye
     xingchen, and Yang Yingliang)

   - Add freq qos for qcom cpufreq driver and minor cleanups (Xuewen
     Yan, and Viresh Kumar)

   - Minor cleanups around functions called at module_init() (Xiu
     Jianfeng)

   - Use module_init and add module_exit for bmips driver (Zhang
     Jianhua)

   - Add AlderLake-N support to intel_idle (Zhang Rui)

   - Replace strlcpy() with unused retval with strscpy() in intel_idle
     (Wolfram Sang)

   - Remove redundant check from cpuidle_switch_governor() (Yu Liao)

   - Replace strlcpy() with unused retval with strscpy() in the powernv
     cpuidle driver (Wolfram Sang)

   - Drop duplicate word from a comment in the coupled cpuidle driver
     (Jason Wang)

   - Make rpm_resume() return -EINPROGRESS if RPM_NOWAIT is passed to it
     in the flags and the device is about to resume (Rafael Wysocki)

   - Add extra debugging statement for multiple active IRQs to system
     wakeup handling code (Mario Limonciello)

   - Replace strlcpy() with unused retval with strscpy() in the core
     system suspend support code (Wolfram Sang)

   - Update the intel_rapl power capping driver:
      - Use standard Energy Unit for SPR Dram RAPL domain (Zhang Rui).
      - Add support for RAPTORLAKE_S (Zhang Rui).
      - Fix UBSAN shift-out-of-bounds issue (Chao Qin)

   - Handle -EPROBE_DEFER when regulator is not probed on
     mtk-ci-devfreq.c (AngeloGioacchino Del Regno)

   - Fix message typo and use dev_err_probe() in rockchip-dfi.c
     (Christophe JAILLET)"

* tag 'pm-6.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (29 commits)
  cpufreq: qcom-cpufreq-hw: Add cpufreq qos for LMh
  cpufreq: Add __init annotation to module init funcs
  cpufreq: tegra194: change tegra239_cpufreq_soc to static
  PM / devfreq: rockchip-dfi: Fix an error message
  PM / devfreq: mtk-cci: Handle sram regulator probe deferral
  powercap: intel_rapl: Use standard Energy Unit for SPR Dram RAPL domain
  PM: runtime: Return -EINPROGRESS from rpm_resume() in the RPM_NOWAIT case
  intel_idle: Add AlderLake-N support
  powercap: intel_rapl: fix UBSAN shift-out-of-bounds issue
  cpufreq: tegra194: Add support for Tegra239
  cpufreq: qcom-cpufreq-hw: Fix uninitialized throttled_freq warning
  cpufreq: intel_pstate: Add Tigerlake support in no-HWP mode
  powercap: intel_rapl: Add support for RAPTORLAKE_S
  cpufreq: amd-pstate: Fix initial highest_perf value
  cpuidle: Remove redundant check in cpuidle_switch_governor()
  PM: wakeup: Add extra debugging statement for multiple active IRQs
  cpufreq: tegra194: Remove the unneeded result variable
  PM: suspend: move from strlcpy() with unused retval to strscpy()
  intel_idle: move from strlcpy() with unused retval to strscpy()
  cpuidle: powernv: move from strlcpy() with unused retval to strscpy()
  ...
2022-10-03 13:26:47 -07:00
Rafael J. Wysocki
0766fa2e8a Merge branch 'pm-cpufreq'
Merge cpufreq changes for 6.1-rc1:

 - Add isupport for Tiger Lake in no-HWP mode to intel_pstate (Doug
   Smythies).

 - Update the AMD P-state driver (Perry Yuan):
   * Fix wrong lowest perf fetch.
   * Map desired perf into pstate scope for powersave governor.
   * Update pstate frequency transition delay time.
   * Fix initial highest_perf value.
   * Clean up.

 - Move max CPU capacity to sugov_policy in the schedutil cpufreq
   governor (Lukasz Luba).

 - Add SM6115 to cpufreq-dt blocklist (Adam Skladowski).

 - Add support for Tegra239 and minor cleanups (Sumit Gupta, ye xingchen,
   and Yang Yingliang).

 - Add freq qos for qcom cpufreq driver and minor cleanups (Xuewen Yan,
   and Viresh Kumar).

 - Minor cleanups around functions called at module_init() (Xiu Jianfeng).

 - Use module_init and add module_exit for bmips driver (Zhang Jianhua).

* pm-cpufreq:
  cpufreq: qcom-cpufreq-hw: Add cpufreq qos for LMh
  cpufreq: Add __init annotation to module init funcs
  cpufreq: tegra194: change tegra239_cpufreq_soc to static
  cpufreq: tegra194: Add support for Tegra239
  cpufreq: qcom-cpufreq-hw: Fix uninitialized throttled_freq warning
  cpufreq: intel_pstate: Add Tigerlake support in no-HWP mode
  cpufreq: amd-pstate: Fix initial highest_perf value
  cpufreq: tegra194: Remove the unneeded result variable
  cpufreq: amd-pstate: update pstate frequency transition delay time
  cpufreq: amd_pstate: map desired perf into pstate scope for powersave governor
  cpufreq: amd_pstate: fix wrong lowest perf fetch
  cpufreq: amd-pstate: fix white-space
  cpufreq: amd-pstate: simplify cpudata pointer assignment
  cpufreq: bmips-cpufreq: Use module_init and add module_exit
  cpufreq: schedutil: Move max CPU capacity to sugov_policy
  cpufreq: Add SM6115 to cpufreq-dt-platdev blocklist
2022-10-03 20:19:13 +02:00
Linus Torvalds
890f242084 RCU pull request for v6.1
This pull request contains the following branches:
 
 doc.2022.08.31b: Documentation updates.  This is the first in a series
 	from an ongoing review of the RCU documentation.  "Why are people
 	thinking -that- about RCU?  Oh.  Because that is an entirely
 	reasonable interpretation of its documentation."
 
 fixes.2022.08.31b: Miscellaneous fixes.
 
 kvfree.2022.08.31b: Improved memory allocation and heuristics.
 
 nocb.2022.09.01a: Improve rcu_nocbs diagnostic output.
 
 poll.2022.08.31b: Add full-sized polled RCU grace period state values.
 	These are the same size as an rcu_head structure, which is double
 	that of the traditional unsigned long state values that may still
 	be obtained from et_state_synchronize_rcu().  The added size
 	avoids missing overlapping grace periods.  This benefit is that
 	call_rcu() can be replaced by polling, which can be attractive
 	in situations where RCU-protected data is aged out of memory.
 
 	Early in the series, the size of this state value is three
 	unsigned longs.  Later in the series, the synchronize_rcu() and
 	synchronize_rcu_expedited() fastpaths are reworked to permit
 	the full state to be represented by only two unsigned longs.
 	This reworking slows these two functions down in SMP kernels
 	running either on single-CPU systems or on systems with all but
 	one CPU offlined, but this should not be a significant problem.
 	And if it somehow becomes a problem in some yet-as-unforeseen
 	situations, three-value state values can be provided for only
 	those situations.
 
 	Finally, a pair of functions named same_state_synchronize_rcu()
 	and same_state_synchronize_rcu_full() allow grace-period state
 	values to be compared for equality.  This permits users to
 	maintain lists of data structures having the same state value,
 	removing the need for per-data-structure grace-period state
 	values, thus decreasing memory footprint.
 
 poll-srcu.2022.08.31b: Polled SRCU grace-period updates, including
 	adding tests to rcutorture and reducing the incidence of Tiny
 	SRCU grace-period-state counter wrap.
 
 tasks.2022.08.31b: Improve Tasks RCU diagnostics and quiescent-state
 	detection.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEbK7UrM+RBIrCoViJnr8S83LZ+4wFAmM3bxQTHHBhdWxtY2tA
 a2VybmVsLm9yZwAKCRCevxLzctn7jC0zD/0cPe9Nl3LPKVTqDN8wWG6SUOHcwQrg
 dLBUo1pbBh3mK3HcHzwl1iIF7gd2nmKN7UT3m+C+qv+N3Q9ej9K+MutGThCiRvNT
 A56TDYU9I1xfqoQ25E9TL7nqty818rtYYMl36Rw8epcLKHo/It9MFODb5kEBY5ir
 P5UaIK2D4heHfJL6Di8JDq9vC5a/NlNIIkiIj7lUB+px0FpVW0dUqmnWbIOE74YH
 OBGJ/Mxn6KDO4WeFO0v0DxVaBTLd+khu6W0JspI0szOO6iyTqiDCGE5EqkEdcs5I
 Fk9WCifdo9nrQG0LPIuEBv0YnwNfGbe5nYXupAmGFb3tdCbjkM+W0UBUE032nXog
 3E6m5FEBD1XGQttScFHm70kYssa+xI7khGb9/ZFoYN/QW28oWqwfnx6+eAZGxPNS
 AZx6pc2bebg8sOUhkz/Sv+qMH7CQgIgcMR66SKl5SdT1Onaig45sgdUuC23BshgG
 oEdDxvK7vexFQT6q0oqU8LAO/CVKdyVIswt3pB6CUmn8yNgSo+qDZzlEHt0gPdMY
 4Xa1jnNtOHobDnI4g0JMdVqAujByrRq74ZsVW96hdedKrA0r9y462jnVBm9tqW68
 lu0Lw9WLif2kw0lMY8Q59zqTL+fB8TdNiZrHoqefwvQ/ZrvinfHGSblcrS8zAhX3
 4oVwCUs9pPQRMA==
 =AZPZ
 -----END PGP SIGNATURE-----

Merge tag 'rcu.2022.09.30a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu

Pull RCU updates from Paul McKenney:

 - Documentation updates.

   This is the first in a series from an ongoing review of the RCU
   documentation. "Why are people thinking -that- about RCU? Oh. Because
   that is an entirely reasonable interpretation of its documentation."

 - Miscellaneous fixes.

 - Improved memory allocation and heuristics.

 - Improve rcu_nocbs diagnostic output.

 - Add full-sized polled RCU grace period state values.

   These are the same size as an rcu_head structure, which is double
   that of the traditional unsigned long state values that may still be
   obtained from et_state_synchronize_rcu(). The added size avoids
   missing overlapping grace periods. This benefit is that call_rcu()
   can be replaced by polling, which can be attractive in situations
   where RCU-protected data is aged out of memory.

   Early in the series, the size of this state value is three unsigned
   longs. Later in the series, the fastpaths in synchronize_rcu() and
   synchronize_rcu_expedited() are reworked to permit the full state to
   be represented by only two unsigned longs. This reworking slows these
   two functions down in SMP kernels running either on single-CPU
   systems or on systems with all but one CPU offlined, but this should
   not be a significant problem. And if it somehow becomes a problem in
   some yet-as-unforeseen situations, three-value state values can be
   provided for only those situations.

   Finally, a pair of functions named same_state_synchronize_rcu() and
   same_state_synchronize_rcu_full() allow grace-period state values to
   be compared for equality. This permits users to maintain lists of
   data structures having the same state value, removing the need for
   per-data-structure grace-period state values, thus decreasing memory
   footprint.

 - Polled SRCU grace-period updates, including adding tests to
   rcutorture and reducing the incidence of Tiny SRCU grace-period-state
   counter wrap.

 - Improve Tasks RCU diagnostics and quiescent-state detection.

* tag 'rcu.2022.09.30a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: (55 commits)
  rcutorture: Use the barrier operation specified by cur_ops
  rcu-tasks: Make RCU Tasks Trace check for userspace execution
  rcu-tasks: Ensure RCU Tasks Trace loops have quiescent states
  rcu-tasks: Convert RCU_LOCKDEP_WARN() to WARN_ONCE()
  srcu: Make Tiny SRCU use full-sized grace-period counters
  srcu: Make Tiny SRCU poll_state_synchronize_srcu() more precise
  srcu: Add GP and maximum requested GP to Tiny SRCU rcutorture output
  rcutorture: Make "srcud" option also test polled grace-period API
  rcutorture: Limit read-side polling-API testing
  rcu: Add functions to compare grace-period state values
  rcutorture: Expand rcu_torture_write_types() first "if" statement
  rcutorture: Use 1-suffixed variable in rcu_torture_write_types() check
  rcu: Make synchronize_rcu() fastpath update only boot-CPU counters
  rcutorture: Adjust rcu_poll_need_2gp() for rcu_gp_oldstate field removal
  rcu: Remove ->rgos_polled field from rcu_gp_oldstate structure
  rcu: Make synchronize_rcu_expedited() fast path update .expedited_sequence
  rcu: Remove expedited grace-period fast-path forward-progress helper
  rcu: Make synchronize_rcu() fast path update ->gp_seq counters
  rcu-tasks: Remove grace-period fast-path rcu-tasks helper
  rcu: Set rcu_data structures' initial ->gpwrap value to true
  ...
2022-10-03 10:11:11 -07:00
Peter Zijlstra
5aec788aeb sched: Fix TASK_state comparisons
Task state is fundamentally a bitmask; direct comparisons are probably
not working as intended. Specifically the normal wait-state have
a number of possible modifiers:

  TASK_UNINTERRUPTIBLE:	TASK_WAKEKILL, TASK_NOLOAD, TASK_FREEZABLE
  TASK_INTERRUPTIBLE:   TASK_FREEZABLE

Specifically, the addition of TASK_FREEZABLE wrecked
__wait_is_interruptible(). This however led to an audit of direct
comparisons yielding the rest of the changes.

Fixes: f5d39b0208 ("freezer,sched: Rewrite core freezer logic")
Reported-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Debugged-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Christian Borntraeger <borntraeger@linux.ibm.com>
2022-09-28 10:00:16 +02:00
Matthew Wilcox (Oracle)
0cd4d02c32 sched: use maple tree iterator to walk VMAs
The linked list is slower than walking the VMAs using the maple tree.  We
can't use the VMA iterator here because it doesn't support moving to an
earlier position.

Link: https://lkml.kernel.org/r/20220906194824.2110408-49-Liam.Howlett@oracle.com
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: Yu Zhao <yuzhao@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-26 19:46:22 -07:00
Aneesh Kumar K.V
467b171af8 mm/demotion: update node_is_toptier to work with memory tiers
With memory tier support we can have memory only NUMA nodes in the top
tier from which we want to avoid promotion tracking NUMA faults.  Update
node_is_toptier to work with memory tiers.  All NUMA nodes are by default
top tier nodes.  With lower(slower) memory tiers added we consider all
memory tiers above a memory tier having CPU NUMA nodes as a top memory
tier

[sj@kernel.org: include missed header file, memory-tiers.h]
  Link: https://lkml.kernel.org/r/20220820190720.248704-1-sj@kernel.org
[akpm@linux-foundation.org: mm/memory.c needs linux/memory-tiers.h]
[aneesh.kumar@linux.ibm.com: make toptier_distance inclusive upper bound of toptiers]
  Link: https://lkml.kernel.org/r/20220830081457.118960-1-aneesh.kumar@linux.ibm.com
Link: https://lkml.kernel.org/r/20220818131042.113280-10-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Acked-by: Wei Xu <weixugc@google.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Bharata B Rao <bharata@amd.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Hesham Almatary <hesham.almatary@huawei.com>
Cc: Jagdish Gediya <jvgediya.oss@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-26 19:46:12 -07:00
Yu Zhao
bd74fdaea1 mm: multi-gen LRU: support page table walks
To further exploit spatial locality, the aging prefers to walk page tables
to search for young PTEs and promote hot pages.  A kill switch will be
added in the next patch to disable this behavior.  When disabled, the
aging relies on the rmap only.

NB: this behavior has nothing similar with the page table scanning in the
2.4 kernel [1], which searches page tables for old PTEs, adds cold pages
to swapcache and unmaps them.

To avoid confusion, the term "iteration" specifically means the traversal
of an entire mm_struct list; the term "walk" will be applied to page
tables and the rmap, as usual.

An mm_struct list is maintained for each memcg, and an mm_struct follows
its owner task to the new memcg when this task is migrated.  Given an
lruvec, the aging iterates lruvec_memcg()->mm_list and calls
walk_page_range() with each mm_struct on this list to promote hot pages
before it increments max_seq.

When multiple page table walkers iterate the same list, each of them gets
a unique mm_struct; therefore they can run concurrently.  Page table
walkers ignore any misplaced pages, e.g., if an mm_struct was migrated,
pages it left in the previous memcg will not be promoted when its current
memcg is under reclaim.  Similarly, page table walkers will not promote
pages from nodes other than the one under reclaim.

This patch uses the following optimizations when walking page tables:
1. It tracks the usage of mm_struct's between context switches so that
   page table walkers can skip processes that have been sleeping since
   the last iteration.
2. It uses generational Bloom filters to record populated branches so
   that page table walkers can reduce their search space based on the
   query results, e.g., to skip page tables containing mostly holes or
   misplaced pages.
3. It takes advantage of the accessed bit in non-leaf PMD entries when
   CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y.
4. It does not zigzag between a PGD table and the same PMD table
   spanning multiple VMAs. IOW, it finishes all the VMAs within the
   range of the same PMD table before it returns to a PGD table. This
   improves the cache performance for workloads that have large
   numbers of tiny VMAs [2], especially when CONFIG_PGTABLE_LEVELS=5.

Server benchmark results:
  Single workload:
    fio (buffered I/O): no change

  Single workload:
    memcached (anon): +[8, 10]%
                Ops/sec      KB/sec
      patch1-7: 1147696.57   44640.29
      patch1-8: 1245274.91   48435.66

  Configurations:
    no change

Client benchmark results:
  kswapd profiles:
    patch1-7
      48.16%  lzo1x_1_do_compress (real work)
       8.20%  page_vma_mapped_walk (overhead)
       7.06%  _raw_spin_unlock_irq
       2.92%  ptep_clear_flush
       2.53%  __zram_bvec_write
       2.11%  do_raw_spin_lock
       2.02%  memmove
       1.93%  lru_gen_look_around
       1.56%  free_unref_page_list
       1.40%  memset

    patch1-8
      49.44%  lzo1x_1_do_compress (real work)
       6.19%  page_vma_mapped_walk (overhead)
       5.97%  _raw_spin_unlock_irq
       3.13%  get_pfn_folio
       2.85%  ptep_clear_flush
       2.42%  __zram_bvec_write
       2.08%  do_raw_spin_lock
       1.92%  memmove
       1.44%  alloc_zspage
       1.36%  memset

  Configurations:
    no change

Thanks to the following developers for their efforts [3].
  kernel test robot <lkp@intel.com>

[1] https://lwn.net/Articles/23732/
[2] https://llvm.org/docs/ScudoHardenedAllocator.html
[3] https://lore.kernel.org/r/202204160827.ekEARWQo-lkp@intel.com/

Link: https://lkml.kernel.org/r/20220918080010.2920238-9-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michael Larabel <Michael@MichaelLarabel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-26 19:46:09 -07:00
Christoph Hellwig
527eb453bb sched/psi: export psi_memstall_{enter,leave}
To properly account for all refaults from file system logic, file systems
need to call psi_memstall_enter directly, so export it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lore.kernel.org/r/20220915094200.139713-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-20 08:24:38 -06:00
Vincent Guittot
7e9518baed sched/fair: Move call to list_last_entry() in detach_tasks
Move the call to list_last_entry() in detach_tasks() after testing
loop_max and loop_break.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20220825122726.20819-4-vincent.guittot@linaro.org
2022-09-15 16:13:52 +02:00
Vincent Guittot
c59862f826 sched/fair: Cleanup loop_max and loop_break
sched_nr_migrate_break is set to a fix value and never changes so we can
replace it by a define SCHED_NR_MIGRATE_BREAK.

Also, we adjust SCHED_NR_MIGRATE_BREAK to be aligned with the init value
of sysctl_sched_nr_migrate which can be init to different values.

Then, use SCHED_NR_MIGRATE_BREAK to init sysctl_sched_nr_migrate.

The behavior stays unchanged unless you modify sysctl_sched_nr_migrate
trough debugfs.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20220825122726.20819-3-vincent.guittot@linaro.org
2022-09-15 16:13:51 +02:00
Vincent Guittot
b0defa7ae0 sched/fair: Make sure to try to detach at least one movable task
During load balance, we try at most env->loop_max time to move a task.
But it can happen that the loop_max LRU tasks (ie tail of
the cfs_tasks list) can't be moved to dst_cpu because of affinity.
In this case, loop in the list until we found at least one.

The maximum of detached tasks remained the same as before.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20220825122726.20819-2-vincent.guittot@linaro.org
2022-09-15 16:13:51 +02:00
Huang Ying
c959924b0d memory tiering: adjust hot threshold automatically
The promotion hot threshold is workload and system configuration
dependent.  So in this patch, a method to adjust the hot threshold
automatically is implemented.  The basic idea is to control the number of
the candidate promotion pages to match the promotion rate limit.  If the
hint page fault latency of a page is less than the hot threshold, we will
try to promote the page, and the page is called the candidate promotion
page.

If the number of the candidate promotion pages in the statistics interval
is much more than the promotion rate limit, the hot threshold will be
decreased to reduce the number of the candidate promotion pages. 
Otherwise, the hot threshold will be increased to increase the number of
the candidate promotion pages.

To make the above method works, in each statistics interval, the total
number of the pages to check (on which the hint page faults occur) and the
hot/cold distribution need to be stable.  Because the page tables are
scanned linearly in NUMA balancing, but the hot/cold distribution isn't
uniform along the address usually, the statistics interval should be
larger than the NUMA balancing scan period.  So in the patch, the max scan
period is used as statistics interval and it works well in our tests.

Link: https://lkml.kernel.org/r/20220713083954.34196-4-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: osalvador <osalvador@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zhong Jiang <zhongjiang-ali@linux.alibaba.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-11 20:25:54 -07:00
Huang Ying
c6833e1000 memory tiering: rate limit NUMA migration throughput
In NUMA balancing memory tiering mode, if there are hot pages in slow
memory node and cold pages in fast memory node, we need to promote/demote
hot/cold pages between the fast and cold memory nodes.

A choice is to promote/demote as fast as possible.  But the CPU cycles and
memory bandwidth consumed by the high promoting/demoting throughput will
hurt the latency of some workload because of accessing inflating and slow
memory bandwidth contention.

A way to resolve this issue is to restrict the max promoting/demoting
throughput.  It will take longer to finish the promoting/demoting.  But
the workload latency will be better.  This is implemented in this patch as
the page promotion rate limit mechanism.

The number of the candidate pages to be promoted to the fast memory node
via NUMA balancing is counted, if the count exceeds the limit specified by
the users, the NUMA balancing promotion will be stopped until the next
second.

A new sysctl knob kernel.numa_balancing_promote_rate_limit_MBps is added
for the users to specify the limit.

Link: https://lkml.kernel.org/r/20220713083954.34196-3-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: osalvador <osalvador@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zhong Jiang <zhongjiang-ali@linux.alibaba.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-11 20:25:54 -07:00
Huang Ying
33024536ba memory tiering: hot page selection with hint page fault latency
Patch series "memory tiering: hot page selection", v4.

To optimize page placement in a memory tiering system with NUMA balancing,
the hot pages in the slow memory nodes need to be identified. 
Essentially, the original NUMA balancing implementation selects the mostly
recently accessed (MRU) pages to promote.  But this isn't a perfect
algorithm to identify the hot pages.  Because the pages with quite low
access frequency may be accessed eventually given the NUMA balancing page
table scanning period could be quite long (e.g.  60 seconds).  So in this
patchset, we implement a new hot page identification algorithm based on
the latency between NUMA balancing page table scanning and hint page
fault.  Which is a kind of mostly frequently accessed (MFU) algorithm.

In NUMA balancing memory tiering mode, if there are hot pages in slow
memory node and cold pages in fast memory node, we need to promote/demote
hot/cold pages between the fast and cold memory nodes.

A choice is to promote/demote as fast as possible.  But the CPU cycles and
memory bandwidth consumed by the high promoting/demoting throughput will
hurt the latency of some workload because of accessing inflating and slow
memory bandwidth contention.

A way to resolve this issue is to restrict the max promoting/demoting
throughput.  It will take longer to finish the promoting/demoting.  But
the workload latency will be better.  This is implemented in this patchset
as the page promotion rate limit mechanism.

The promotion hot threshold is workload and system configuration
dependent.  So in this patchset, a method to adjust the hot threshold
automatically is implemented.  The basic idea is to control the number of
the candidate promotion pages to match the promotion rate limit.

We used the pmbench memory accessing benchmark tested the patchset on a
2-socket server system with DRAM and PMEM installed.  The test results are
as follows,

		pmbench score		promote rate
		 (accesses/s)			MB/s
		-------------		------------
base		  146887704.1		       725.6
hot selection     165695601.2		       544.0
rate limit	  162814569.8		       165.2
auto adjustment	  170495294.0                  136.9

From the results above,

With hot page selection patch [1/3], the pmbench score increases about
12.8%, and promote rate (overhead) decreases about 25.0%, compared with
base kernel.

With rate limit patch [2/3], pmbench score decreases about 1.7%, and
promote rate decreases about 69.6%, compared with hot page selection
patch.

With threshold auto adjustment patch [3/3], pmbench score increases about
4.7%, and promote rate decrease about 17.1%, compared with rate limit
patch.

Baolin helped to test the patchset with MySQL on a machine which contains
1 DRAM node (30G) and 1 PMEM node (126G).

sysbench /usr/share/sysbench/oltp_read_write.lua \
......
--tables=200 \
--table-size=1000000 \
--report-interval=10 \
--threads=16 \
--time=120

The tps can be improved about 5%.


This patch (of 3):

To optimize page placement in a memory tiering system with NUMA balancing,
the hot pages in the slow memory node need to be identified.  Essentially,
the original NUMA balancing implementation selects the mostly recently
accessed (MRU) pages to promote.  But this isn't a perfect algorithm to
identify the hot pages.  Because the pages with quite low access frequency
may be accessed eventually given the NUMA balancing page table scanning
period could be quite long (e.g.  60 seconds).  The most frequently
accessed (MFU) algorithm is better.

So, in this patch we implemented a better hot page selection algorithm. 
Which is based on NUMA balancing page table scanning and hint page fault
as follows,

- When the page tables of the processes are scanned to change PTE/PMD
  to be PROT_NONE, the current time is recorded in struct page as scan
  time.

- When the page is accessed, hint page fault will occur.  The scan
  time is gotten from the struct page.  And The hint page fault
  latency is defined as

    hint page fault time - scan time

The shorter the hint page fault latency of a page is, the higher the
probability of their access frequency to be higher.  So the hint page
fault latency is a better estimation of the page hot/cold.

It's hard to find some extra space in struct page to hold the scan time. 
Fortunately, we can reuse some bits used by the original NUMA balancing.

NUMA balancing uses some bits in struct page to store the page accessing
CPU and PID (referring to page_cpupid_xchg_last()).  Which is used by the
multi-stage node selection algorithm to avoid to migrate pages shared
accessed by the NUMA nodes back and forth.  But for pages in the slow
memory node, even if they are shared accessed by multiple NUMA nodes, as
long as the pages are hot, they need to be promoted to the fast memory
node.  So the accessing CPU and PID information are unnecessary for the
slow memory pages.  We can reuse these bits in struct page to record the
scan time.  For the fast memory pages, these bits are used as before.

For the hot threshold, the default value is 1 second, which works well in
our performance test.  All pages with hint page fault latency < hot
threshold will be considered hot.

It's hard for users to determine the hot threshold.  So we don't provide a
kernel ABI to set it, just provide a debugfs interface for advanced users
to experiment.  We will continue to work on a hot threshold automatic
adjustment mechanism.

The downside of the above method is that the response time to the workload
hot spot changing may be much longer.  For example,

- A previous cold memory area becomes hot

- The hint page fault will be triggered.  But the hint page fault
  latency isn't shorter than the hot threshold.  So the pages will
  not be promoted.

- When the memory area is scanned again, maybe after a scan period,
  the hint page fault latency measured will be shorter than the hot
  threshold and the pages will be promoted.

To mitigate this, if there are enough free space in the fast memory node,
the hot threshold will not be used, all pages will be promoted upon the
hint page fault for fast response.

Thanks Zhong Jiang reported and tested the fix for a bug when disabling
memory tiering mode dynamically.

Link: https://lkml.kernel.org/r/20220713083954.34196-1-ying.huang@intel.com
Link: https://lkml.kernel.org/r/20220713083954.34196-2-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Wei Xu <weixugc@google.com>
Cc: osalvador <osalvador@suse.de>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Zhong Jiang <zhongjiang-ali@linux.alibaba.com>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-11 20:25:54 -07:00
Linus Torvalds
e35be05d74 Driver core fixes for 6.0-rc5
Here are some small driver core and debugfs fixes for 6.0-rc5.
 
 Included in here are:
   - multiple attempts to get the arch_topology code to work properly on
     non-cluster SMT systems.  First attempt caused build breakages in
     linux-next and 0-day, second try worked.
   - debugfs fixes for a long-suffering memory leak.  The pattern of
     debugfs_remove(debugfs_lookup(...)) turns out to leak dentries, so
     add debugfs_lookup_and_remove() to fix this problem.  Also fix up
     the scheduler debug code that highlighted this problem.  Fixes for
     other subsystems will be trickling in over the next few months for
     this same issue once the debugfs function is merged.
 
 All of these have been in linux-next since Wednesday with no reported
 problems.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCYxuERw8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ylPqwCgjU6xlN2y/80HH+66k+yyzlxocE8AoLPgnGrA
 dJZIGWFXExzO26tvMT52
 =zGHA
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-6.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core fixes from Greg KH:
 "Here are some small driver core and debugfs fixes for 6.0-rc5.

  Included in here are:

   - multiple attempts to get the arch_topology code to work properly on
     non-cluster SMT systems. First attempt caused build breakages in
     linux-next and 0-day, second try worked.

   - debugfs fixes for a long-suffering memory leak. The pattern of
     debugfs_remove(debugfs_lookup(...)) turns out to leak dentries, so
     add debugfs_lookup_and_remove() to fix this problem. Also fix up
     the scheduler debug code that highlighted this problem. Fixes for
     other subsystems will be trickling in over the next few months for
     this same issue once the debugfs function is merged.

  All of these have been in linux-next since Wednesday with no reported
  problems"

* tag 'driver-core-6.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
  arch_topology: Make cluster topology span at least SMT CPUs
  sched/debug: fix dentry leak in update_sched_domain_debugfs
  debugfs: add debugfs_lookup_and_remove()
  driver core: fix driver_set_override() issue with empty strings
  Revert "arch_topology: Make cluster topology span at least SMT CPUs"
  arch_topology: Make cluster topology span at least SMT CPUs
2022-09-09 15:08:40 -04:00
Chengming Zhou
34f26a1561 sched/psi: Per-cgroup PSI accounting disable/re-enable interface
PSI accounts stalls for each cgroup separately and aggregates it
at each level of the hierarchy. This may cause non-negligible overhead
for some workloads when under deep level of the hierarchy.

commit 3958e2d0c3 ("cgroup: make per-cgroup pressure stall tracking configurable")
make PSI to skip per-cgroup stall accounting, only account system-wide
to avoid this each level overhead.

But for our use case, we also want leaf cgroup PSI stats accounted for
userspace adjustment on that cgroup, apart from only system-wide adjustment.

So this patch introduce a per-cgroup PSI accounting disable/re-enable
interface "cgroup.pressure", which is a read-write single value file that
allowed values are "0" and "1", the defaults is "1" so per-cgroup
PSI stats is enabled by default.

Implementation details:

It should be relatively straight-forward to disable and re-enable
state aggregation, time tracking, averaging on a per-cgroup level,
if we can live with losing history from while it was disabled.
I.e. the avgs will restart from 0, total= will have gaps.

But it's hard or complex to stop/restart groupc->tasks[] updates,
which is not implemented in this patch. So we always update
groupc->tasks[] and PSI_ONCPU bit in psi_group_change() even when
the cgroup PSI stats is disabled.

Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lkml.kernel.org/r/20220907090332.2078-1-zhouchengming@bytedance.com
2022-09-09 11:08:33 +02:00
Chengming Zhou
dc86aba751 sched/psi: Cache parent psi_group to speed up group iteration
We use iterate_groups() to iterate each level psi_group to update
PSI stats, which is a very hot path.

In current code, iterate_groups() have to use multiple branches and
cgroup_parent() to get parent psi_group for each level, which is not
very efficient.

This patch cache parent psi_group in struct psi_group, only need to get
psi_group of task itself first, then just use group->parent to iterate.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lore.kernel.org/r/20220825164111.29534-10-zhouchengming@bytedance.com
2022-09-09 11:08:33 +02:00
Chengming Zhou
52b1364ba0 sched/psi: Add PSI_IRQ to track IRQ/SOFTIRQ pressure
Now PSI already tracked workload pressure stall information for
CPU, memory and IO. Apart from these, IRQ/SOFTIRQ could have
obvious impact on some workload productivity, such as web service
workload.

When CONFIG_IRQ_TIME_ACCOUNTING, we can get IRQ/SOFTIRQ delta time
from update_rq_clock_task(), in which we can record that delta
to CPU curr task's cgroups as PSI_IRQ_FULL status.

Note we don't use PSI_IRQ_SOME since IRQ/SOFTIRQ always happen in
the current task on the CPU, make nothing productive could run
even if it were runnable, so we only use PSI_IRQ_FULL.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lore.kernel.org/r/20220825164111.29534-8-zhouchengming@bytedance.com
2022-09-09 11:08:32 +02:00
Johannes Weiner
71dbdde791 sched/psi: Remove NR_ONCPU task accounting
We put all fields updated by the scheduler in the first cacheline of
struct psi_group_cpu for performance.

Since we want add another PSI_IRQ_FULL to track IRQ/SOFTIRQ pressure,
we need to reclaim space first. This patch remove NR_ONCPU task accounting
in struct psi_group_cpu, use one bit in state_mask to track instead.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Chengming Zhou <zhouchengming@bytedance.com>
Tested-by: Chengming Zhou <zhouchengming@bytedance.com>
Link: https://lore.kernel.org/r/20220825164111.29534-7-zhouchengming@bytedance.com
2022-09-09 11:08:32 +02:00
Chengming Zhou
65176f59a1 sched/psi: Optimize task switch inside shared cgroups again
Way back when PSI_MEM_FULL was accounted from the timer tick, task
switching could simply iterate next and prev to the common ancestor to
update TSK_ONCPU and be done.

Then memstall ticks were replaced with checking curr->in_memstall
directly in psi_group_change(). That meant that now if the task switch
was between a memstall and a !memstall task, we had to iterate through
the common ancestors at least ONCE to fix up their state_masks.

We added the identical_state filter to make sure the common ancestor
elimination was skipped in that case. It seems that was always a
little too eager, because it caused us to walk the common ancestors
*twice* instead of the required once: the iteration for next could
have stopped at the common ancestor; prev could have updated TSK_ONCPU
up to the common ancestor, then finish to the root without changing
any flags, just to get the new curr->in_memstall into the state_masks.

This patch recognizes this and makes it so that we walk to the root
exactly once if state_mask needs updating, which is simply catching up
on a missed optimization that could have been done in commit 7fae6c8171
("psi: Use ONCPU state tracking machinery to detect reclaim") directly.

Apart from this, it's also necessary for the next patch "sched/psi: remove
NR_ONCPU task accounting". Suppose we walk the common ancestors twice:

(1) psi_group_change(.clear = 0, .set = TSK_ONCPU)
(2) psi_group_change(.clear = TSK_ONCPU, .set = 0)

We previously used tasks[NR_ONCPU] to record TSK_ONCPU, tasks[NR_ONCPU]++
in (1) then tasks[NR_ONCPU]-- in (2), so tasks[NR_ONCPU] still be correct.

The next patch change to use one bit in state mask to record TSK_ONCPU,
PSI_ONCPU bit will be set in (1), but then be cleared in (2), which cause
the psi_group_cpu has task running on CPU but without PSI_ONCPU bit set!

With this patch, we will never walk the common ancestors twice, so won't
have above problem.

Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lore.kernel.org/r/20220825164111.29534-6-zhouchengming@bytedance.com
2022-09-09 11:08:32 +02:00
Chengming Zhou
d79ddb069c sched/psi: Move private helpers to sched/stats.h
This patch move psi_task_change/psi_task_switch declarations out of
PSI public header, since they are only needed for implementing the
PSI stats tracking in sched/stats.h

psi_task_switch is obvious, psi_task_change can't be public helper
since it doesn't check psi_disabled static key. And there is no
any user now, so put it in sched/stats.h too.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lore.kernel.org/r/20220825164111.29534-5-zhouchengming@bytedance.com
2022-09-09 11:08:31 +02:00
Chengming Zhou
e2ad8ab04c sched/psi: Save percpu memory when !psi_cgroups_enabled
We won't use cgroup psi_group when !psi_cgroups_enabled, so don't
bother to alloc percpu memory and init for it.

Also don't need to migrate task PSI stats between cgroups in
cgroup_move_task().

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lore.kernel.org/r/20220825164111.29534-4-zhouchengming@bytedance.com
2022-09-09 11:08:31 +02:00
Chengming Zhou
c530a3c716 sched/psi: Fix periodic aggregation shut off
We don't want to wake periodic aggregation work back up if the
task change is the aggregation worker itself going to sleep, or
we'll ping-pong forever.

Previously, we would use psi_task_change() in psi_dequeue() when
task going to sleep, so this check was put in psi_task_change().

But commit 4117cebf1a ("psi: Optimize task switch inside shared cgroups")
defer task sleep handling to psi_task_switch(), won't go through
psi_task_change() anymore.

So this patch move this check to psi_task_switch().

Fixes: 4117cebf1a ("psi: Optimize task switch inside shared cgroups")
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lore.kernel.org/r/20220825164111.29534-2-zhouchengming@bytedance.com
2022-09-09 11:08:30 +02:00
Peter Zijlstra
f5d39b0208 freezer,sched: Rewrite core freezer logic
Rewrite the core freezer to behave better wrt thawing and be simpler
in general.

By replacing PF_FROZEN with TASK_FROZEN, a special block state, it is
ensured frozen tasks stay frozen until thawed and don't randomly wake
up early, as is currently possible.

As such, it does away with PF_FROZEN and PF_FREEZER_SKIP, freeing up
two PF_flags (yay!).

Specifically; the current scheme works a little like:

	freezer_do_not_count();
	schedule();
	freezer_count();

And either the task is blocked, or it lands in try_to_freezer()
through freezer_count(). Now, when it is blocked, the freezer
considers it frozen and continues.

However, on thawing, once pm_freezing is cleared, freezer_count()
stops working, and any random/spurious wakeup will let a task run
before its time.

That is, thawing tries to thaw things in explicit order; kernel
threads and workqueues before doing bringing SMP back before userspace
etc.. However due to the above mentioned races it is entirely possible
for userspace tasks to thaw (by accident) before SMP is back.

This can be a fatal problem in asymmetric ISA architectures (eg ARMv9)
where the userspace task requires a special CPU to run.

As said; replace this with a special task state TASK_FROZEN and add
the following state transitions:

	TASK_FREEZABLE	-> TASK_FROZEN
	__TASK_STOPPED	-> TASK_FROZEN
	__TASK_TRACED	-> TASK_FROZEN

The new TASK_FREEZABLE can be set on any state part of TASK_NORMAL
(IOW. TASK_INTERRUPTIBLE and TASK_UNINTERRUPTIBLE) -- any such state
is already required to deal with spurious wakeups and the freezer
causes one such when thawing the task (since the original state is
lost).

The special __TASK_{STOPPED,TRACED} states *can* be restored since
their canonical state is in ->jobctl.

With this, frozen tasks need an explicit TASK_FROZEN wakeup and are
free of undue (early / spurious) wakeups.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://lore.kernel.org/r/20220822114649.055452969@infradead.org
2022-09-07 21:53:50 +02:00
Peter Zijlstra
929659acea sched/completion: Add wait_for_completion_state()
Allows waiting with a custom @state.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20220822114648.922711674@infradead.org
2022-09-07 21:53:49 +02:00
Peter Zijlstra
f9fc8cad97 sched: Add TASK_ANY for wait_task_inactive()
Now that wait_task_inactive()'s @match_state argument is a mask (like
ttwu()) it is possible to replace the special !match_state case with
an 'all-states' value such that any blocked state will match.

Suggested-by: Ingo Molnar (mingo@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/YxhkzfuFTvRnpUaH@hirez.programming.kicks-ass.net
2022-09-07 21:53:49 +02:00
Peter Zijlstra
9204a97f7a sched: Change wait_task_inactive()s match_state
Make wait_task_inactive()'s @match_state work like ttwu()'s @state.

That is, instead of an equal comparison, use it as a mask. This allows
matching multiple block conditions.

(removes the unlikely; it doesn't make sense how it's only part of the
condition)

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220822114648.856734578@infradead.org
2022-09-07 21:53:48 +02:00
Peter Zijlstra
0b9d46fc5e sched: Rename task_running() to task_on_cpu()
There is some ambiguity about task_running() in that it is unrelated
to TASK_RUNNING but instead tests ->on_cpu. As such, rename the thing
task_on_cpu().

Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/Yxhkhn55uHZx+NGl@hirez.programming.kicks-ass.net
2022-09-07 21:53:47 +02:00
Abel Wu
96c1c0cfe4 sched/fair: Cleanup for SIS_PROP
The sched-domain of this cpu is only used for some heuristics when
SIS_PROP is enabled, and it should be irrelevant whether the local
sd_llc is valid or not, since all we care about is target sd_llc
if !SIS_PROP.

Access the local domain only when there is a need.

Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Link: https://lore.kernel.org/r/20220907112000.1854-6-wuyun.abel@bytedance.com
2022-09-07 21:53:47 +02:00
Abel Wu
398ba2b0cc sched/fair: Default to false in test_idle_cores()
It's uncertain whether idle cores exist or not if shared sched-
domains are not ready, so returning "no idle cores" usually
makes sense.

While __update_idle_core() is an exception, it checks status
of this core and set hint to shared sched-domain if necessary.
So the whole logic of this function depends on the existence
of shared sched-domain, and can certainly bail out early if
it is not available.

It's somehow a little tricky, and as Josh suggested that it
should be transient while the domain isn't ready. So remove
the self-defined default value to make things more clearer.

Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Josh Don <joshdon@google.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Link: https://lore.kernel.org/r/20220907112000.1854-5-wuyun.abel@bytedance.com
2022-09-07 21:53:47 +02:00
Abel Wu
8eeeed9c4a sched/fair: Remove useless check in select_idle_core()
The function select_idle_core() only gets called when has_idle_cores
is true which can be possible only when sched_smt_present is enabled.

This change also aligns select_idle_core() with select_idle_smt() in
the way that the caller do the check if necessary.

Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Link: https://lore.kernel.org/r/20220907112000.1854-4-wuyun.abel@bytedance.com
2022-09-07 21:53:46 +02:00
Abel Wu
b9bae70440 sched/fair: Avoid double search on same cpu
The prev cpu is checked at the beginning of SIS, and it's unlikely
to be idle before the second check in select_idle_smt(). So we'd
better focus on its SMT siblings.

Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Josh Don <joshdon@google.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Link: https://lore.kernel.org/r/20220907112000.1854-3-wuyun.abel@bytedance.com
2022-09-07 21:53:46 +02:00
Abel Wu
3e6efe87cd sched/fair: Remove redundant check in select_idle_smt()
If two cpus share LLC cache, then the two cores they belong to
are also in the same LLC domain.

Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Josh Don <joshdon@google.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Link: https://lore.kernel.org/r/20220907112000.1854-2-wuyun.abel@bytedance.com
2022-09-07 21:53:46 +02:00
Greg Kroah-Hartman
c2e4065965 sched/debug: fix dentry leak in update_sched_domain_debugfs
Kuyo reports that the pattern of using debugfs_remove(debugfs_lookup())
leaks a dentry and with a hotplug stress test, the machine eventually
runs out of memory.

Fix this up by using the newly created debugfs_lookup_and_remove() call
instead which properly handles the dentry reference counting logic.

Cc: Major Chen <major.chen@samsung.com>
Cc: stable <stable@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Matthias Brugger <matthias.bgg@gmail.com>
Reported-by: Kuyo Chang <kuyo.chang@mediatek.com>
Tested-by: Kuyo Chang <kuyo.chang@mediatek.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220902123107.109274-2-gregkh@linuxfoundation.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-09-05 13:02:38 +02:00
Shang XiaoJing
33f9352579 sched/deadline: Move __dl_clear_params out of dl_bw lock
As members in sched_dl_entity are independent with dl_bw, move
__dl_clear_params out of dl_bw lock.

Signed-off-by: Shang XiaoJing <shangxiaojing@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Link: https://lore.kernel.org/r/20220827020911.30641-1-shangxiaojing@huawei.com
2022-09-01 11:19:55 +02:00
Shang XiaoJing
96458e7f7d sched/deadline: Add replenish_dl_new_period helper
Wrap repeated code in helper function replenish_dl_new_period, which set
the deadline and runtime of input dl_se based on pi_of(dl_se). Note that
setup_new_dl_entity originally set the deadline and runtime base on
dl_se, which should equals to pi_of(dl_se) for non-boosted task.

Signed-off-by: Shang XiaoJing <shangxiaojing@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Link: https://lore.kernel.org/r/20220826100037.12146-1-shangxiaojing@huawei.com
2022-09-01 11:19:54 +02:00
Shang XiaoJing
973bee493a sched/deadline: Add dl_task_is_earliest_deadline helper
Wrap repeated code in helper function dl_task_is_earliest_deadline, which
return true if there is no deadline task on the rq at all, or task's
deadline earlier than the whole rq.

Signed-off-by: Shang XiaoJing <shangxiaojing@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Link: https://lore.kernel.org/r/20220826083453.698-1-shangxiaojing@huawei.com
2022-09-01 11:19:54 +02:00
Zhen Lei
bc1cca97e6 sched/debug: Show the registers of 'current' in dump_cpu_task()
The dump_cpu_task() function does not print registers on architectures
that do not support NMIs.  However, registers can be useful for
debugging.  Fortunately, in the case where dump_cpu_task() is invoked
from an interrupt handler and is dumping the current CPU's stack, the
get_irq_regs() function can be used to get the registers.

Therefore, this commit makes dump_cpu_task() check to see if it is being
asked to dump the current CPU's stack from within an interrupt handler,
and, if so, it uses the get_irq_regs() function to obtain the registers.
On systems that do support NMIs, this commit has the further advantage
of avoiding a self-NMI in this case.

This is an example of rcu self-detected stall on arm64, which does not
support NMIs:
[   27.501721] rcu: INFO: rcu_preempt self-detected stall on CPU
[   27.502238] rcu:     0-....: (1250 ticks this GP) idle=4f7/1/0x4000000000000000 softirq=2594/2594 fqs=619
[   27.502632]  (t=1251 jiffies g=2989 q=29 ncpus=4)
[   27.503845] CPU: 0 PID: 306 Comm: test0 Not tainted 5.19.0-rc7-00009-g1c1a6c29ff99-dirty #46
[   27.504732] Hardware name: linux,dummy-virt (DT)
[   27.504947] pstate: 20000005 (nzCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[   27.504998] pc : arch_counter_read+0x18/0x24
[   27.505301] lr : arch_counter_read+0x18/0x24
[   27.505328] sp : ffff80000b29bdf0
[   27.505345] x29: ffff80000b29bdf0 x28: 0000000000000000 x27: 0000000000000000
[   27.505475] x26: 0000000000000000 x25: 0000000000000000 x24: 0000000000000000
[   27.505553] x23: 0000000000001f40 x22: ffff800009849c48 x21: 000000065f871ae0
[   27.505627] x20: 00000000000025ec x19: ffff80000a6eb300 x18: ffffffffffffffff
[   27.505654] x17: 0000000000000001 x16: 0000000000000000 x15: ffff80000a6d0296
[   27.505681] x14: ffffffffffffffff x13: ffff80000a29bc18 x12: 0000000000000426
[   27.505709] x11: 0000000000000162 x10: ffff80000a2f3c18 x9 : ffff80000a29bc18
[   27.505736] x8 : 00000000ffffefff x7 : ffff80000a2f3c18 x6 : 00000000759bd013
[   27.505761] x5 : 01ffffffffffffff x4 : 0002dc6c00000000 x3 : 0000000000000017
[   27.505787] x2 : 00000000000025ec x1 : ffff80000b29bdf0 x0 : 0000000075a30653
[   27.505937] Call trace:
[   27.506002]  arch_counter_read+0x18/0x24
[   27.506171]  ktime_get+0x48/0xa0
[   27.506207]  test_task+0x70/0xf0
[   27.506227]  kthread+0x10c/0x110
[   27.506243]  ret_from_fork+0x10/0x20

This is a marked improvement over the old output:
[   27.944550] rcu: INFO: rcu_preempt self-detected stall on CPU
[   27.944980] rcu:     0-....: (1249 ticks this GP) idle=cbb/1/0x4000000000000000 softirq=2610/2610 fqs=614
[   27.945407]  (t=1251 jiffies g=2681 q=28 ncpus=4)
[   27.945731] Task dump for CPU 0:
[   27.945844] task:test0           state:R  running task     stack:    0 pid:  306 ppid:     2 flags:0x0000000a
[   27.946073] Call trace:
[   27.946151]  dump_backtrace.part.0+0xc8/0xd4
[   27.946378]  show_stack+0x18/0x70
[   27.946405]  sched_show_task+0x150/0x180
[   27.946427]  dump_cpu_task+0x44/0x54
[   27.947193]  rcu_dump_cpu_stacks+0xec/0x130
[   27.947212]  rcu_sched_clock_irq+0xb18/0xef0
[   27.947231]  update_process_times+0x68/0xac
[   27.947248]  tick_sched_handle+0x34/0x60
[   27.947266]  tick_sched_timer+0x4c/0xa4
[   27.947281]  __hrtimer_run_queues+0x178/0x360
[   27.947295]  hrtimer_interrupt+0xe8/0x244
[   27.947309]  arch_timer_handler_virt+0x38/0x4c
[   27.947326]  handle_percpu_devid_irq+0x88/0x230
[   27.947342]  generic_handle_domain_irq+0x2c/0x44
[   27.947357]  gic_handle_irq+0x44/0xc4
[   27.947376]  call_on_irq_stack+0x2c/0x54
[   27.947415]  do_interrupt_handler+0x80/0x94
[   27.947431]  el1_interrupt+0x34/0x70
[   27.947447]  el1h_64_irq_handler+0x18/0x24
[   27.947462]  el1h_64_irq+0x64/0x68                       <--- the above backtrace is worthless
[   27.947474]  arch_counter_read+0x18/0x24
[   27.947487]  ktime_get+0x48/0xa0
[   27.947501]  test_task+0x70/0xf0
[   27.947520]  kthread+0x10c/0x110
[   27.947538]  ret_from_fork+0x10/0x20

Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Valentin Schneider <vschneid@redhat.com>
2022-08-31 05:05:49 -07:00
Zhen Lei
e73dfe3093 sched/debug: Try trigger_single_cpu_backtrace(cpu) in dump_cpu_task()
The trigger_all_cpu_backtrace() function attempts to send an NMI to the
target CPU, which usually provides much better stack traces than the
dump_cpu_task() function's approach of dumping that stack from some other
CPU.  So much so that most calls to dump_cpu_task() only happen after
a call to trigger_all_cpu_backtrace() has failed.  And the exception to
this rule really should attempt to use trigger_all_cpu_backtrace() first.

Therefore, move the trigger_all_cpu_backtrace() invocation into
dump_cpu_task().

Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Valentin Schneider <vschneid@redhat.com>
2022-08-31 05:03:14 -07:00
Ingo Molnar
53aa930dc4 Merge branch 'sched/warnings' into sched/core, to pick up WARN_ON_ONCE() conversion commit
Merge in the BUG_ON() => WARN_ON_ONCE() conversion commit.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2022-08-30 10:28:15 +02:00
Shang XiaoJing
5531ecffa4 sched: Add update_current_exec_runtime helper
Wrap repeated code in helper function update_current_exec_runtime for
update the exec time of the current.

Signed-off-by: Shang XiaoJing <shangxiaojing@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20220824082856.15674-1-shangxiaojing@huawei.com
2022-08-27 00:05:35 +02:00
Mikulas Patocka
8238b45798 wait_on_bit: add an acquire memory barrier
There are several places in the kernel where wait_on_bit is not followed
by a memory barrier (for example, in drivers/md/dm-bufio.c:new_read).

On architectures with weak memory ordering, it may happen that memory
accesses that follow wait_on_bit are reordered before wait_on_bit and
they may return invalid data.

Fix this class of bugs by introducing a new function "test_bit_acquire"
that works like test_bit, but has acquire memory ordering semantics.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Acked-by: Will Deacon <will@kernel.org>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-08-26 09:30:25 -07:00
Lukasz Luba
6d5afdc97e cpufreq: schedutil: Move max CPU capacity to sugov_policy
There is no need to keep the max CPU capacity in the per_cpu instance.
Furthermore, there is no need to check and update that variable
(sg_cpu->max) every time in the frequency change request, which is part
of hot path. Instead use struct sugov_policy to store that information.
Initialize the max CPU capacity during the setup and start callback.
We can do that since all CPUs in the same frequency domain have the same
max capacity (capacity setup and thermal pressure are based on that).

Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2022-08-23 20:03:33 +02:00
Chengming Zhou
e4fe074d6c sched/fair: Don't init util/runnable_avg for !fair task
post_init_entity_util_avg() init task util_avg according to the cpu util_avg
at the time of fork, which will decay when switched_to_fair() some time later,
we'd better to not set them at all in the case of !fair task.

Suggested-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20220818124805.601-10-zhouchengming@bytedance.com
2022-08-23 11:01:20 +02:00
Chengming Zhou
d6531ab6e5 sched/fair: Move task sched_avg attach to enqueue_task_fair()
When wake_up_new_task(), we use post_init_entity_util_avg() to init
util_avg/runnable_avg based on cpu's util_avg at that time, and
attach task sched_avg to cfs_rq.

Since enqueue_task_fair() -> enqueue_entity() -> update_load_avg()
loop will do attach, we can move this work to update_load_avg().

wake_up_new_task(p)
  post_init_entity_util_avg(p)
    attach_entity_cfs_rq()  --> (1)
  activate_task(rq, p)
    enqueue_task() := enqueue_task_fair()
      enqueue_entity() loop
        update_load_avg(cfs_rq, se, UPDATE_TG | DO_ATTACH)
          if (!se->avg.last_update_time && (flags & DO_ATTACH))
            attach_entity_load_avg()  --> (2)

This patch move attach from (1) to (2), update related comments too.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20220818124805.601-9-zhouchengming@bytedance.com
2022-08-23 11:01:19 +02:00
Chengming Zhou
df16b71c68 sched/fair: Allow changing cgroup of new forked task
commit 7dc603c902 ("sched/fair: Fix PELT integrity for new tasks")
introduce a TASK_NEW state and an unnessary limitation that would fail
when changing cgroup of new forked task.

Because at that time, we can't handle task_change_group_fair() for new
forked fair task which hasn't been woken up by wake_up_new_task(),
which will cause detach on an unattached task sched_avg problem.

This patch delete this unnessary limitation by adding check before do
detach or attach in task_change_group_fair().

So cpu_cgrp_subsys.can_attach() has nothing to do for fair tasks,
only define it in #ifdef CONFIG_RT_GROUP_SCHED.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20220818124805.601-8-zhouchengming@bytedance.com
2022-08-23 11:01:19 +02:00
Chengming Zhou
7e2edaf618 sched/fair: Fix another detach on unattached task corner case
commit 7dc603c902 ("sched/fair: Fix PELT integrity for new tasks")
fixed two load tracking problems for new task, including detach on
unattached new task problem.

There still left another detach on unattached task problem for the task
which has been woken up by try_to_wake_up() and waiting for actually
being woken up by sched_ttwu_pending().

try_to_wake_up(p)
  cpu = select_task_rq(p)
  if (task_cpu(p) != cpu)
    set_task_cpu(p, cpu)
      migrate_task_rq_fair()
        remove_entity_load_avg()       --> unattached
        se->avg.last_update_time = 0;
      __set_task_cpu()
  ttwu_queue(p, cpu)
    ttwu_queue_wakelist()
      __ttwu_queue_wakelist()

task_change_group_fair()
  detach_task_cfs_rq()
    detach_entity_cfs_rq()
      detach_entity_load_avg()   --> detach on unattached task
  set_task_rq()
  attach_task_cfs_rq()
    attach_entity_cfs_rq()
      attach_entity_load_avg()

The reason of this problem is similar, we should check in detach_entity_cfs_rq()
that se->avg.last_update_time != 0, before do detach_entity_load_avg().

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20220818124805.601-7-zhouchengming@bytedance.com
2022-08-23 11:01:19 +02:00
Chengming Zhou
e1f078f504 sched/fair: Combine detach into dequeue when migrating task
When we are migrating task out of the CPU, we can combine detach and
propagation into dequeue_entity() to save the detach_entity_cfs_rq()
in migrate_task_rq_fair().

This optimization is like combining DO_ATTACH in the enqueue_entity()
when migrating task to the CPU. So we don't have to traverse the CFS tree
extra time to do the detach_entity_cfs_rq() -> propagate_entity_cfs_rq(),
which wouldn't be called anymore with this patch's change.

detach_task()
  deactivate_task()
    dequeue_task_fair()
      for_each_sched_entity(se)
        dequeue_entity()
          update_load_avg() /* (1) */
            detach_entity_load_avg()

  set_task_cpu()
    migrate_task_rq_fair()
      detach_entity_cfs_rq() /* (2) */
        update_load_avg();
        detach_entity_load_avg();
        propagate_entity_cfs_rq();
          for_each_sched_entity()
            update_load_avg()

This patch save the detach_entity_cfs_rq() called in (2) by doing
the detach_entity_load_avg() for a CPU migrating task inside (1)
(the task being the first se in the loop)

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20220818124805.601-6-zhouchengming@bytedance.com
2022-08-23 11:01:18 +02:00
Chengming Zhou
859f206290 sched/fair: Update comments in enqueue/dequeue_entity()
When reading the sched_avg related code, I found the comments in
enqueue/dequeue_entity() are not updated with the current code.

We don't add/subtract entity's runnable_avg from cfs_rq->runnable_avg
during enqueue/dequeue_entity(), those are done only for attach/detach.

This patch updates the comments to reflect the current code working.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20220818124805.601-5-zhouchengming@bytedance.com
2022-08-23 11:01:18 +02:00
Chengming Zhou
5d6da83c44 sched/fair: Reset sched_avg last_update_time before set_task_rq()
set_task_rq() -> set_task_rq_fair() will try to synchronize the blocked
task's sched_avg when migrate, which is not needed for already detached
task.

task_change_group_fair() will detached the task sched_avg from prev cfs_rq
first, so reset sched_avg last_update_time before set_task_rq() to avoid that.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20220818124805.601-4-zhouchengming@bytedance.com
2022-08-23 11:01:18 +02:00
Chengming Zhou
39c4261191 sched/fair: Remove redundant cpu_cgrp_subsys->fork()
We use cpu_cgrp_subsys->fork() to set task group for the new fair task
in cgroup_post_fork().

Since commit b1e8206582 ("sched: Fix yet more sched_fork() races")
has already set_task_rq() for the new fair task in sched_cgroup_fork(),
so cpu_cgrp_subsys->fork() can be removed.

  cgroup_can_fork()	--> pin parent's sched_task_group
  sched_cgroup_fork()
    __set_task_cpu()
      set_task_rq()
  cgroup_post_fork()
    ss->fork() := cpu_cgroup_fork()
      sched_change_group(..., TASK_SET_GROUP)
        task_set_group_fair()
          set_task_rq()  --> can be removed

After this patch's change, task_change_group_fair() only need to
care about task cgroup migration, make the code much simplier.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lore.kernel.org/r/20220818124805.601-3-zhouchengming@bytedance.com
2022-08-23 11:01:17 +02:00
Chengming Zhou
78b6b15770 sched/fair: Maintain task se depth in set_task_rq()
Previously we only maintain task se depth in task_move_group_fair(),
if a !fair task change task group, its se depth will not be updated,
so commit eb7a59b2c8 ("sched/fair: Reset se-depth when task switched to FAIR")
fix the problem by updating se depth in switched_to_fair() too.

Then commit daa59407b5 ("sched/fair: Unify switched_{from,to}_fair()
and task_move_group_fair()") unified these two functions, moved se.depth
setting to attach_task_cfs_rq(), which further into attach_entity_cfs_rq()
with commit df217913e7 ("sched/fair: Factorize attach/detach entity").

This patch move task se depth maintenance from attach_entity_cfs_rq()
to set_task_rq(), which will be called when CPU/cgroup change, so its
depth will always be correct.

This patch is preparation for the next patch.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20220818124805.601-2-zhouchengming@bytedance.com
2022-08-23 11:01:17 +02:00
Hao Jia
76b079ef4c sched/psi: Remove unused parameter nbytes of psi_trigger_create()
psi_trigger_create()'s 'nbytes' parameter is not used, so we can remove it.

Signed-off-by: Hao Jia <jiahao.os@bytedance.com>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2022-08-15 12:35:25 -10:00
Hao Jia
2b97cf7628 sched/psi: Zero the memory of struct psi_group
After commit 5f69a6577b ("psi: dont alloc memory for psi by default"),
the memory used by struct psi_group is no longer allocated and zeroed
in cgroup_create().

Since the memory of struct psi_group is not zeroed, the data in this
memory is random, which will lead to inaccurate psi statistics when
creating a new cgroup.

So we use kzlloc() to allocate and zero the struct psi_group and
remove the redundant zeroing in group_init().

Steps to reproduce:
1. Use cgroup v2 and enable CONFIG_PSI
2. Create a new cgroup, and query psi statistics
mkdir /sys/fs/cgroup/test
cat /sys/fs/cgroup/test/cpu.pressure
some avg10=0.00 avg60=0.00 avg300=47927752200.00 total=12884901
full avg10=561815124.00 avg60=125835394188.00 avg300=1077090462000.00 total=10273561772

cat /sys/fs/cgroup/test/io.pressure
some avg10=1040093132823.95 avg60=1203770351379.21 avg300=3862252669559.46 total=4294967296
full avg10=921884564601.39 avg60=0.00 avg300=1984507298.35 total=442381631

cat /sys/fs/cgroup/test/memory.pressure
some avg10=232476085778.11 avg60=0.00 avg300=0.00 total=0
full avg10=0.00 avg60=0.00 avg300=2585658472280.57 total=12884901

Fixes: commit 5f69a6577b ("psi: dont alloc memory for psi by default")
Signed-off-by: Hao Jia <jiahao.os@bytedance.com>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2022-08-15 12:35:13 -10:00
Ingo Molnar
09348d75a6 sched/all: Change all BUG_ON() instances in the scheduler to WARN_ON_ONCE()
There's no good reason to crash a user's system with a BUG_ON(),
chances are high that they'll never even see the crash message on
Xorg, and it won't make it into the syslog either.

By using a WARN_ON_ONCE() we at least give the user a chance to report
any bugs triggered here - instead of getting silent hangs.

None of these WARN_ON_ONCE()s are supposed to trigger, ever - so we ignore
cases where a NULL check is done via a BUG_ON() and we let a NULL
pointer through after a WARN_ON_ONCE().

There's one exception: WARN_ON_ONCE() arguments with side-effects,
such as locking - in this case we use the return value of the
WARN_ON_ONCE(), such as in:

 -       BUG_ON(!lock_task_sighand(p, &flags));
 +       if (WARN_ON_ONCE(!lock_task_sighand(p, &flags)))
 +               return;

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/YvSsKcAXISmshtHo@gmail.com
2022-08-12 11:25:10 +02:00
Linus Torvalds
cac03ac368 Various fixes: a deadline scheduler fix, a migration fix, a Sparse fix and a comment fix.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmLuvmwRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1gONQ/+KkkPTeKgGDvrahTfeYZlmRyvcI1R78r9
 yooa8v+DtifznBW2eXDBc8WTruzqr78VyUY+1YSjfKS6FRQWYMficJ3qk3hxgBru
 998KZbvl3jXBBlRkqgGeFlF5Ty2KaryEZgX97a7IF/0xWDgpm972jFkJ/KCo/YTY
 WSQrzutz2FKe71EjK4cAplYxPZIiy/zo2hSGTbsso4M7bO5VLc1Y4qMtFGcCZ7JB
 s9JYkj2Rfz+AS5wioDRcGuec4A4SrroxKszZA6QDDBuhMJukqexO02xs/fxZ2W4Z
 DF4U5MFOrtz9AWSGsf1P6XXbgJO8qTgQXZchFsEcJwypV13w8U0IViXQfD/Pvx2X
 y+WHdnZVIO2sDwOJ15ew7IuoJZ2LsVygrBNFJJaIFOtIz3RzprI0BJN7LeWFALOa
 IPmbtiY8hVwhKmjRgMHWDwJhMEHLuhGx3idiD89w1pknzTUnKDiwLyEUtyynxeGd
 ft9uCvPefrYQVx9AiH7wf0W+fg334FCccC+0f8LyduyftUyQCfZIZY6LUSKuKded
 Odm7k0ngLDPbdZwAHs0Nf/ilRwd91Z7b6hGt5U3ptx+8BPMKB+/k1VoKog7OISPc
 zGaP7DrtuC4sEdX4X6bqX+mEQhpkLcQw15gVGxhKoHqygWNSZrV634aSSXwfVXJx
 eT5m/K9a7L0=
 =CYl5
 -----END PGP SIGNATURE-----

Merge tag 'sched-urgent-2022-08-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler fixes from Ingo Molnar:
 "Various fixes: a deadline scheduler fix, a migration fix, a Sparse fix
  and a comment fix"

* tag 'sched-urgent-2022-08-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/core: Do not requeue task on CPU excluded from cpus_mask
  sched/rt: Fix Sparse warnings due to undefined rt.c declarations
  exit: Fix typo in comment: s/sub-theads/sub-threads
  sched, cpuset: Fix dl_cpu_busy() panic due to empty cs->cpus_allowed
2022-08-06 17:34:06 -07:00
Mel Gorman
751d4cbc43 sched/core: Do not requeue task on CPU excluded from cpus_mask
The following warning was triggered on a large machine early in boot on
a distribution kernel but the same problem should also affect mainline.

   WARNING: CPU: 439 PID: 10 at ../kernel/workqueue.c:2231 process_one_work+0x4d/0x440
   Call Trace:
    <TASK>
    rescuer_thread+0x1f6/0x360
    kthread+0x156/0x180
    ret_from_fork+0x22/0x30
    </TASK>

Commit c6e7bd7afa ("sched/core: Optimize ttwu() spinning on p->on_cpu")
optimises ttwu by queueing a task that is descheduling on the wakelist,
but does not check if the task descheduling is still allowed to run on that CPU.

In this warning, the problematic task is a workqueue rescue thread which
checks if the rescue is for a per-cpu workqueue and running on the wrong CPU.
While this is early in boot and it should be possible to create workers,
the rescue thread may still used if the MAYDAY_INITIAL_TIMEOUT is reached
or MAYDAY_INTERVAL and on a sufficiently large machine, the rescue
thread is being used frequently.

Tracing confirmed that the task should have migrated properly using the
stopper thread to handle the migration. However, a parallel wakeup from udev
running on another CPU that does not share CPU cache observes p->on_cpu and
uses task_cpu(p), queues the task on the old CPU and triggers the warning.

Check that the wakee task that is descheduling is still allowed to run
on its current CPU and if not, wait for the descheduling to complete
and select an allowed CPU.

Fixes: c6e7bd7afa ("sched/core: Optimize ttwu() spinning on p->on_cpu")
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20220804092119.20137-1-mgorman@techsingularity.net
2022-08-04 11:26:13 +02:00
Xin Gao
8648f92a66 sched/core: Remove superfluous semicolon
Signed-off-by: Xin Gao <gaoxin@cdjrlc.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20220719111044.7095-1-gaoxin@cdjrlc.com
2022-08-04 11:02:08 +02:00
Bing Huang
18c31c9711 sched/fair: Make per-cpu cpumasks static
The load_balance_mask and select_rq_mask percpu variables are only used in
kernel/sched/fair.c.

Make them static and move their allocation into init_sched_fair_class().

Replace kzalloc_node() with zalloc_cpumask_var_node() to get rid of the
CONFIG_CPUMASK_OFFSTACK #ifdef and to align with per-cpu cpumask
allocation for RT (local_cpu_mask in init_sched_rt_class()) and DL
class (local_cpu_mask_dl in init_sched_dl_class()).

[ mingo: Tidied up changelog & touched up the code. ]

Signed-off-by: Bing Huang <huangbing@kylinos.cn>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20220722213609.3901-1-huangbing775@126.com
2022-08-03 19:17:33 +02:00
Hao Jia
d985ee9f44 sched/fair: Remove unused parameter idle of _nohz_idle_balance()
After commit 7a82e5f52a ("sched/fair: Merge for each idle cpu loop of ILB"),
_nohz_idle_balance()'s 'idle' parameter is not used anymore, so we can remove it.

Signed-off-by: Hao Jia <jiahao.os@bytedance.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20220803130223.70419-1-jiahao.os@bytedance.com
2022-08-03 18:54:26 +02:00
Linus Torvalds
b6bb70f9ab Several core optimizations:
* threadgroup_rwsem write locking is skipped when configuring controllers in
   empty subtrees. Combined with CLONE_INTO_CGROUP, this allows the common
   static usage pattern to not grab threadgroup_rwsem at all (glibc still
   doesn't seem ready for CLONE_INTO_CGROUP unfortunately).
 
 * threadgroup_rwsem used to be put into non-percpu mode by default due to
   latency concerns in specific use cases. There's no reason for everyone
   else to pay for it. Make the behavior optional.
 
 * psi no longer allocates memory when disabled.
 
 along with some code cleanups.
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYIACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCYugHIQ4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGd+oAP9lfD3fTRdNo4qWV2VsZsYzoOxzNIuJSwN/dnYx
 IEbQOwD/cd2YMfeo6zcb427U/VfTFqjJjFK04OeljYtJU8fFywo=
 =sucy
 -----END PGP SIGNATURE-----

Merge tag 'cgroup-for-5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup

Pull cgroup updates from Tejun Heo:
 "Several core optimizations:

   - threadgroup_rwsem write locking is skipped when configuring
     controllers in empty subtrees.

     Combined with CLONE_INTO_CGROUP, this allows the common static
     usage pattern to not grab threadgroup_rwsem at all (glibc still
     doesn't seem ready for CLONE_INTO_CGROUP unfortunately).

   - threadgroup_rwsem used to be put into non-percpu mode by default
     due to latency concerns in specific use cases. There's no reason
     for everyone else to pay for it. Make the behavior optional.

   - psi no longer allocates memory when disabled.

  ... along with some code cleanups"

* tag 'cgroup-for-5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: Skip subtree root in cgroup_update_dfl_csses()
  cgroup: remove "no" prefixed mount options
  cgroup: Make !percpu threadgroup_rwsem operations optional
  cgroup: Add "no" prefixed mount options
  cgroup: Elide write-locking threadgroup_rwsem when updating csses on an empty subtree
  cgroup.c: remove redundant check for mixable cgroup in cgroup_migrate_vet_dst
  cgroup.c: add helper __cset_cgroup_from_root to cleanup duplicated codes
  psi: dont alloc memory for psi by default
2022-08-03 09:45:08 -07:00
Ben Dooks
87514b2c24 sched/rt: Fix Sparse warnings due to undefined rt.c declarations
There are several symbols defined in kernel/sched/sched.h but get wrapped
in CONFIG_CGROUP_SCHED, even though dummy versions get built in rt.c and
therefore trigger Sparse warnings:

  kernel/sched/rt.c:309:6: warning: symbol 'unregister_rt_sched_group' was not declared. Should it be static?
  kernel/sched/rt.c:311:6: warning: symbol 'free_rt_sched_group' was not declared. Should it be static?
  kernel/sched/rt.c:313:5: warning: symbol 'alloc_rt_sched_group' was not declared. Should it be static?

Fix this by moving them outside the CONFIG_CGROUP_SCHED block.

[ mingo: Refreshed to the latest scheduler tree, tweaked changelog. ]

Signed-off-by: Ben Dooks <ben-linux@fluff.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20220721145155.358366-1-ben-linux@fluff.org
2022-08-03 11:22:37 +02:00
Waiman Long
b6e8d40d43 sched, cpuset: Fix dl_cpu_busy() panic due to empty cs->cpus_allowed
With cgroup v2, the cpuset's cpus_allowed mask can be empty indicating
that the cpuset will just use the effective CPUs of its parent. So
cpuset_can_attach() can call task_can_attach() with an empty mask.
This can lead to cpumask_any_and() returns nr_cpu_ids causing the call
to dl_bw_of() to crash due to percpu value access of an out of bound
CPU value. For example:

	[80468.182258] BUG: unable to handle page fault for address: ffffffff8b6648b0
	  :
	[80468.191019] RIP: 0010:dl_cpu_busy+0x30/0x2b0
	  :
	[80468.207946] Call Trace:
	[80468.208947]  cpuset_can_attach+0xa0/0x140
	[80468.209953]  cgroup_migrate_execute+0x8c/0x490
	[80468.210931]  cgroup_update_dfl_csses+0x254/0x270
	[80468.211898]  cgroup_subtree_control_write+0x322/0x400
	[80468.212854]  kernfs_fop_write_iter+0x11c/0x1b0
	[80468.213777]  new_sync_write+0x11f/0x1b0
	[80468.214689]  vfs_write+0x1eb/0x280
	[80468.215592]  ksys_write+0x5f/0xe0
	[80468.216463]  do_syscall_64+0x5c/0x80
	[80468.224287]  entry_SYSCALL_64_after_hwframe+0x44/0xae

Fix that by using effective_cpus instead. For cgroup v1, effective_cpus
is the same as cpus_allowed. For v2, effective_cpus is the real cpumask
to be used by tasks within the cpuset anyway.

Also update task_can_attach()'s 2nd argument name to cs_effective_cpus to
reflect the change. In addition, a check is added to task_can_attach()
to guard against the possibility that cpumask_any_and() may return a
value >= nr_cpu_ids.

Fixes: 7f51412a41 ("sched/deadline: Fix bandwidth check/update when migrating tasks between exclusive cpusets")
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lore.kernel.org/r/20220803015451.2219567-1-longman@redhat.com
2022-08-03 10:34:26 +02:00
Linus Torvalds
7d9d077c78 RCU pull request for v5.20 (or whatever)
This pull request contains the following branches:
 
 doc.2022.06.21a: Documentation updates.
 
 fixes.2022.07.19a: Miscellaneous fixes.
 
 nocb.2022.07.19a: Callback-offload updates, perhaps most notably a new
 	RCU_NOCB_CPU_DEFAULT_ALL Kconfig option that causes all CPUs to
 	be offloaded at boot time, regardless of kernel boot parameters.
 	This is useful to battery-powered systems such as ChromeOS
 	and Android.  In addition, a new RCU_NOCB_CPU_CB_BOOST kernel
 	boot parameter prevents offloaded callbacks from interfering
 	with real-time workloads and with energy-efficiency mechanisms.
 
 poll.2022.07.21a: Polled grace-period updates, perhaps most notably
 	making these APIs account for both normal and expedited grace
 	periods.
 
 rcu-tasks.2022.06.21a: Tasks RCU updates, perhaps most notably reducing
 	the CPU overhead of RCU tasks trace grace periods by more than
 	a factor of two on a system with 15,000 tasks.	The reduction
 	is expected to increase with the number of tasks, so it seems
 	reasonable to hypothesize that a system with 150,000 tasks might
 	see a 20-fold reduction in CPU overhead.
 
 torture.2022.06.21a: Torture-test updates.
 
 ctxt.2022.07.05a: Updates that merge RCU's dyntick-idle tracking into
 	context tracking, thus reducing the overhead of transitioning to
 	kernel mode from either idle or nohz_full userspace execution
 	for kernels that track context independently of RCU.  This is
 	expected to be helpful primarily for kernels built with
 	CONFIG_NO_HZ_FULL=y.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEbK7UrM+RBIrCoViJnr8S83LZ+4wFAmLgMcgTHHBhdWxtY2tA
 a2VybmVsLm9yZwAKCRCevxLzctn7jArXD/0fjbCwqpRjHVTzjMY8jN4zDkqZZD6m
 g8Fx27hZ4ToNFwRptyHwNezrNj14skjAJEXfdjaVw32W62ivXvf0HINvSzsTLCSq
 k2kWyBdXLc9CwY5p5W4smnpn5VoAScjg5PoPL59INoZ/Zziji323C7Zepl/1DYJt
 0T6bPCQjo1ZQoDUCyVpSjDmAqxnderWG0MeJVt74GkLqmnYLANg0GH8c7mH4+9LL
 kVGlLp5nlPgNJ4FEoFdMwNU8T/ETmaVld/m2dkiawjkXjJzB2XKtBigU91DDmXz5
 7DIdV4ABrxiy4kGNqtIe/jFgnKyVD7xiDpyfjd6KTeDr/rDS8u2ZH7+1iHsyz3g0
 Np/tS3vcd0KR+gI/d0eXxPbgm5sKlCmKw/nU2eArpW/+4LmVXBUfHTG9Jg+LJmBc
 JrUh6aEdIZJZHgv/nOQBNig7GJW43IG50rjuJxAuzcxiZNEG5lUSS23ysaA9CPCL
 PxRWKSxIEfK3kdmvVO5IIbKTQmIBGWlcWMTcYictFSVfBgcCXpPAksGvqA5JiUkc
 egW+xLFo/7K+E158vSKsVqlWZcEeUbsNJ88QOlpqnRgH++I2Yv/LhK41XfJfpH+Y
 ALxVaDd+mAq6v+qSHNVq9wT3ozXIPy/zK1hDlMIqx40h2YvaEsH4je+521oSoN9r
 vX60+QNxvUBLwA==
 =vUNm
 -----END PGP SIGNATURE-----

Merge tag 'rcu.2022.07.26a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu

Pull RCU updates from Paul McKenney:

 - Documentation updates

 - Miscellaneous fixes

 - Callback-offload updates, perhaps most notably a new
   RCU_NOCB_CPU_DEFAULT_ALL Kconfig option that causes all CPUs to be
   offloaded at boot time, regardless of kernel boot parameters.

   This is useful to battery-powered systems such as ChromeOS and
   Android. In addition, a new RCU_NOCB_CPU_CB_BOOST kernel boot
   parameter prevents offloaded callbacks from interfering with
   real-time workloads and with energy-efficiency mechanisms

 - Polled grace-period updates, perhaps most notably making these APIs
   account for both normal and expedited grace periods

 - Tasks RCU updates, perhaps most notably reducing the CPU overhead of
   RCU tasks trace grace periods by more than a factor of two on a
   system with 15,000 tasks.

   The reduction is expected to increase with the number of tasks, so it
   seems reasonable to hypothesize that a system with 150,000 tasks
   might see a 20-fold reduction in CPU overhead

 - Torture-test updates

 - Updates that merge RCU's dyntick-idle tracking into context tracking,
   thus reducing the overhead of transitioning to kernel mode from
   either idle or nohz_full userspace execution for kernels that track
   context independently of RCU.

   This is expected to be helpful primarily for kernels built with
   CONFIG_NO_HZ_FULL=y

* tag 'rcu.2022.07.26a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: (98 commits)
  rcu: Add irqs-disabled indicator to expedited RCU CPU stall warnings
  rcu: Diagnose extended sync_rcu_do_polled_gp() loops
  rcu: Put panic_on_rcu_stall() after expedited RCU CPU stall warnings
  rcutorture: Test polled expedited grace-period primitives
  rcu: Add polled expedited grace-period primitives
  rcutorture: Verify that polled GP API sees synchronous grace periods
  rcu: Make Tiny RCU grace periods visible to polled APIs
  rcu: Make polled grace-period API account for expedited grace periods
  rcu: Switch polled grace-period APIs to ->gp_seq_polled
  rcu/nocb: Avoid polling when my_rdp->nocb_head_rdp list is empty
  rcu/nocb: Add option to opt rcuo kthreads out of RT priority
  rcu: Add nocb_cb_kthread check to rcu_is_callbacks_kthread()
  rcu/nocb: Add an option to offload all CPUs on boot
  rcu/nocb: Fix NOCB kthreads spawn failure with rcu_nocb_rdp_deoffload() direct call
  rcu/nocb: Invert rcu_state.barrier_mutex VS hotplug lock locking order
  rcu/nocb: Add/del rdp to iterate from rcuog itself
  rcu/tree: Add comment to describe GP-done condition in fqs loop
  rcu: Initialize first_gp_fqs at declaration in rcu_gp_fqs()
  rcu/kvfree: Remove useless monitor_todo flag
  rcu: Cleanup RCU urgency state for offline CPU
  ...
2022-08-02 19:12:45 -07:00
Linus Torvalds
b349b1181d for-5.20/io_uring-2022-07-29
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmLkm5gQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpmKMD/4l3QIrLbjYIxlfrzQcHbmYuUkbQtj3SbZg
 6ejbnGVhCs1P9DdXH8MgE2BxgpiXQE0CqOK7vbSoo5ep2n2UTLI2DIxAl74SMIo7
 0wmJXtUJySuViKr3NYVHqlN180MkQYddBz0nGElhkQBPBCMhW8CrtPCeURr/YyHp
 2RxSYBXiUx2gRyig+klnp6oPEqelcBZJUyNHdA9yVrgl/RhB/t2rKj7D++8ukQM3
 Zuyh8WIkTeTfUz9hdGG7fuCEdZN4DlO2CCEc7uy0cKi6VRCKH4hYUCqClJ+/cfd2
 43dUI2O7B6D1t/ObFh8AGIDXBDqVA6ePQohQU6gooRkfQiBPKkc9d0ts4yIhRqca
 AjkzNM+0Eve3A01loJ8J84w8oZnvNpYEv5n8/sZVLWcyU3UIs0I88nC2OBiFtoRq
 d77CtFLwOTo+r3STtAhnZOqez90rhS6BqKtqlUP346PCuFItl6/MbGtwdTbLYEFj
 CVNIb2pERWSr2NxGv4lFyXaX/cRwruxojWH7yc3rRYjr4Ykevd1pe/fMGNiMAnKw
 5em/3QU3qq0ZVcXLMihksKeHHFIQwGDRMuyuv/fktV10+yYXQ0t16WzkJT3aR8Xo
 cqs0r8+6Jnj3uYcOMzj/FoLcpEPr21hnwAtzLto1mG1Wh4JRn/D7Nx5zqxPLxcW+
 NiU6VihPOw==
 =gxeV
 -----END PGP SIGNATURE-----

Merge tag 'for-5.20/io_uring-2022-07-29' of git://git.kernel.dk/linux-block

Pull io_uring updates from Jens Axboe:

 - As per (valid) complaint in the last merge window, fs/io_uring.c has
   grown quite large these days. io_uring isn't really tied to fs
   either, as it supports a wide variety of functionality outside of
   that.

   Move the code to io_uring/ and split it into files that either
   implement a specific request type, and split some code into helpers
   as well. The code is organized a lot better like this, and io_uring.c
   is now < 4K LOC (me).

 - Deprecate the epoll_ctl opcode. It'll still work, just trigger a
   warning once if used. If we don't get any complaints on this, and I
   don't expect any, then we can fully remove it in a future release
   (me).

 - Improve the cancel hash locking (Hao)

 - kbuf cleanups (Hao)

 - Efficiency improvements to the task_work handling (Dylan, Pavel)

 - Provided buffer improvements (Dylan)

 - Add support for recv/recvmsg multishot support. This is similar to
   the accept (or poll) support for have for multishot, where a single
   SQE can trigger everytime data is received. For applications that
   expect to do more than a few receives on an instantiated socket, this
   greatly improves efficiency (Dylan).

 - Efficiency improvements for poll handling (Pavel)

 - Poll cancelation improvements (Pavel)

 - Allow specifiying a range for direct descriptor allocations (Pavel)

 - Cleanup the cqe32 handling (Pavel)

 - Move io_uring types to greatly cleanup the tracing (Pavel)

 - Tons of great code cleanups and improvements (Pavel)

 - Add a way to do sync cancelations rather than through the sqe -> cqe
   interface, as that's a lot easier to use for some use cases (me).

 - Add support to IORING_OP_MSG_RING for sending direct descriptors to a
   different ring. This avoids the usually problematic SCM case, as we
   disallow those. (me)

 - Make the per-command alloc cache we use for apoll generic, place
   limits on it, and use it for netmsg as well (me).

 - Various cleanups (me, Michal, Gustavo, Uros)

* tag 'for-5.20/io_uring-2022-07-29' of git://git.kernel.dk/linux-block: (172 commits)
  io_uring: ensure REQ_F_ISREG is set async offload
  net: fix compat pointer in get_compat_msghdr()
  io_uring: Don't require reinitable percpu_ref
  io_uring: fix types in io_recvmsg_multishot_overflow
  io_uring: Use atomic_long_try_cmpxchg in __io_account_mem
  io_uring: support multishot in recvmsg
  net: copy from user before calling __get_compat_msghdr
  net: copy from user before calling __copy_msghdr
  io_uring: support 0 length iov in buffer select in compat
  io_uring: fix multishot ending when not polled
  io_uring: add netmsg cache
  io_uring: impose max limit on apoll cache
  io_uring: add abstraction around apoll cache
  io_uring: move apoll cache to poll.c
  io_uring: consolidate hash_locked io-wq handling
  io_uring: clear REQ_F_HASH_LOCKED on hash removal
  io_uring: don't race double poll setting REQ_F_ASYNC_DATA
  io_uring: don't miss setting REQ_F_DOUBLE_POLL
  io_uring: disable multishot recvmsg
  io_uring: only trace one of complete or overflow
  ...
2022-08-02 13:20:44 -07:00
Zhen Lei
0f03d6805b sched/debug: Print each field value left-aligned in sched_show_task()
Currently, the values of some fields are printed right-aligned, causing
the field value to be next to the next field name rather than next to its
own field name. So print each field value left-aligned, to make it more
readable.

 Before:
	stack:    0 pid:  307 ppid:     2 flags:0x00000008
 After:
	stack:0     pid:308   ppid:2      flags:0x0000000a

This also makes them print in the same style as the other two fields:

	task:demo0           state:R  running task

Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lore.kernel.org/r/20220727060819.1085-1-thunder.leizhen@huawei.com
2022-08-02 21:45:35 +02:00
Dietmar Eggemann
b3f53daacc sched/deadline: Use sched_dl_entity's dl_density in dl_task_fits_capacity()
Save a multiplication in dl_task_fits_capacity() by using already
maintained per-sched_dl_entity (i.e. per-task) `dl_runtime/dl_deadline`
(dl_density).

  cap_scale(dl_deadline, cap) >= dl_runtime

  dl_deadline * cap >> SCHED_CAPACITY_SHIFT >= dl_runtime

  cap >= dl_runtime << SCHED_CAPACITY_SHIFT / dl_deadline

  cap >= (dl_runtime << BW_SHIFT / dl_deadline) >>
				BW_SHIFT - SCHED_CAPACITY_SHIFT

  cap >= dl_density >> BW_SHIFT - SCHED_CAPACITY_SHIFT

__sched_setscheduler()->__checkparam_dl() ensures that the 2 corner
cases (if conditions) `runtime == RUNTIME_INF (-1)` and `period == 0`
of to_ratio(deadline, runtime) are not met when setting dl_density in
__sched_setscheduler()-> __setscheduler_params()->__setparam_dl().

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20220729111305.1275158-4-dietmar.eggemann@arm.com
2022-08-02 12:32:46 +02:00
Dietmar Eggemann
6092478bcb sched/deadline: Make dl_cpuset_cpumask_can_shrink() capacity-aware
dl_cpuset_cpumask_can_shrink() is used to validate whether there is
still enough CPU capacity for DL tasks in the reduced cpuset.

Currently it still operates on `# remaining CPUs in the cpuset` (1).
Change this to use the already capacity-aware DL admission control
__dl_overflow() for the `cpumask can shrink` test.

  dl_b->bw = sched_rt_period << BW_SHIFT / sched_rt_period

  dl_b->bw * (1) >= currently allocated bandwidth in root_domain (rd)

  Replace (1) w/ `\Sum CPU capacity in rd >> SCHED_CAPACITY_SHIFT`

Adapt __dl_bw_capacity() to take a cpumask instead of a CPU number
argument so that `rd->span` and `cpumask of the reduced cpuset` can
be used here.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20220729111305.1275158-3-dietmar.eggemann@arm.com
2022-08-02 12:32:45 +02:00
Dietmar Eggemann
740cf8a760 sched/core: Introduce sched_asym_cpucap_active()
Create an inline helper for conditional code to be only executed on
asymmetric CPU capacity systems. This makes these (currently ~10 and
future) conditions a lot more readable.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20220729111305.1275158-2-dietmar.eggemann@arm.com
2022-08-02 12:32:45 +02:00
Linus Torvalds
b167fdffe9 This cycle's scheduler updates for v6.0 are:
Load-balancing improvements:
 ============================
 
 - Improve NUMA balancing on AMD Zen systems for affine workloads.
 
 - Improve the handling of reduced-capacity CPUs in load-balancing.
 
 - Energy Model improvements: fix & refine all the energy fairness metrics (PELT),
   and remove the conservative threshold requiring 6% energy savings to
   migrate a task. Doing this improves power efficiency for most workloads,
   and also increases the reliability of energy-efficiency scheduling.
 
 - Optimize/tweak select_idle_cpu() to spend (much) less time searching
   for an idle CPU on overloaded systems. There's reports of several
   milliseconds spent there on large systems with large workloads ...
 
   [ Since the search logic changed, there might be behavioral side effects. ]
 
 - Improve NUMA imbalance behavior. On certain systems
   with spare capacity, initial placement of tasks is non-deterministic,
   and such an artificial placement imbalance can persist for a long time,
   hurting (and sometimes helping) performance.
 
   The fix is to make fork-time task placement consistent with runtime
   NUMA balancing placement.
 
   Note that some performance regressions were reported against this,
   caused by workloads that are not memory bandwith limited, which benefit
   from the artificial locality of the placement bug(s). Mel Gorman's
   conclusion, with which we concur, was that consistency is better than
   random workload benefits from non-deterministic bugs:
 
      "Given there is no crystal ball and it's a tradeoff, I think it's
       better to be consistent and use similar logic at both fork time
       and runtime even if it doesn't have universal benefit."
 
 - Improve core scheduling by fixing a bug in sched_core_update_cookie() that
   caused unnecessary forced idling.
 
 - Improve wakeup-balancing by allowing same-LLC wakeup of idle CPUs for newly
   woken tasks.
 
 - Fix a newidle balancing bug that introduced unnecessary wakeup latencies.
 
 ABI improvements/fixes:
 =======================
 
 - Do not check capabilities and do not issue capability check denial messages
   when a scheduler syscall doesn't require privileges. (Such as increasing niceness.)
 
 - Add forced-idle accounting to cgroups too.
 
 - Fix/improve the RSEQ ABI to not just silently accept unknown flags.
   (No existing tooling is known to have learned to rely on the previous behavior.)
 
 - Depreciate the (unused) RSEQ_CS_FLAG_NO_RESTART_ON_* flags.
 
 Optimizations:
 ==============
 
 - Optimize & simplify leaf_cfs_rq_list()
 
 - Micro-optimize set_nr_{and_not,if}_polling() via try_cmpxchg().
 
 Misc fixes & cleanups:
 ======================
 
 - Fix the RSEQ self-tests on RISC-V and Glibc 2.35 systems.
 
 - Fix a full-NOHZ bug that can in some cases result in the tick not being
   re-enabled when the last SCHED_RT task is gone from a runqueue but there's
   still SCHED_OTHER tasks around.
 
 - Various PREEMPT_RT related fixes.
 
 - Misc cleanups & smaller fixes.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmLn2ywRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1iNfxAAhPJMwM4tYCpIM6PhmxKiHl6kkiT2tt42
 HhEmiJVLjczLybWaWwmGA2dSFkv1f4+hG7nqdZTm9QYn0Pqat2UTSRcwoKQc+gpB
 x85Hwt2IUmnUman52fRl5r1miH9LTdCI6agWaFLQae5ds1XmOugFo52t2ahax+Gn
 dB8LxS2fa/GrKj229EhkJSPWAK4Y94asoTProwpKLuKEeXhDkqUNrOWbKhz+wEnA
 pVZySpA9uEOdNLVSr1s0VB6mZoh5/z6yQefj5YSNntsG71XWo9jxKCIm5buVdk2U
 wjdn6UzoTThOy/5Ygm64eYRexMHG71UamF1JYUdmvDeUJZ5fhG6RD0FECUQNVcJB
 Msu2fce6u1AV0giZGYtiooLGSawB/+e6MoDkjTl8guFHi/peve9CezKX1ZgDWPfE
 eGn+EbYkUS9RMafXCKuEUBAC1UUqAavGN9sGGN1ufyR4za6ogZplOqAFKtTRTGnT
 /Ne3fHTtvv73DLGW9ohO5vSS2Rp7zhAhB6FunhibhxCWlt7W6hA4Ze2vU9hf78Yn
 SJDLAJjOEilLaKUkRG/d9uM3FjKJM1tqxuT76+sUbM0MNxdyiKcviQlP1b8oq5Um
 xE1KNZUevnr/WXqOTGDKHH/HNPFgwxbwavMiP7dNFn8h/hEk4t9dkf5siDmVHtn4
 nzDVOob1LgE=
 =xr2b
 -----END PGP SIGNATURE-----

Merge tag 'sched-core-2022-08-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler updates from Ingo Molnar:
"Load-balancing improvements:

   - Improve NUMA balancing on AMD Zen systems for affine workloads.

   - Improve the handling of reduced-capacity CPUs in load-balancing.

   - Energy Model improvements: fix & refine all the energy fairness
     metrics (PELT), and remove the conservative threshold requiring 6%
     energy savings to migrate a task. Doing this improves power
     efficiency for most workloads, and also increases the reliability
     of energy-efficiency scheduling.

   - Optimize/tweak select_idle_cpu() to spend (much) less time
     searching for an idle CPU on overloaded systems. There's reports of
     several milliseconds spent there on large systems with large
     workloads ...

     [ Since the search logic changed, there might be behavioral side
       effects. ]

   - Improve NUMA imbalance behavior. On certain systems with spare
     capacity, initial placement of tasks is non-deterministic, and such
     an artificial placement imbalance can persist for a long time,
     hurting (and sometimes helping) performance.

     The fix is to make fork-time task placement consistent with runtime
     NUMA balancing placement.

     Note that some performance regressions were reported against this,
     caused by workloads that are not memory bandwith limited, which
     benefit from the artificial locality of the placement bug(s). Mel
     Gorman's conclusion, with which we concur, was that consistency is
     better than random workload benefits from non-deterministic bugs:

        "Given there is no crystal ball and it's a tradeoff, I think
         it's better to be consistent and use similar logic at both fork
         time and runtime even if it doesn't have universal benefit."

   - Improve core scheduling by fixing a bug in
     sched_core_update_cookie() that caused unnecessary forced idling.

   - Improve wakeup-balancing by allowing same-LLC wakeup of idle CPUs
     for newly woken tasks.

   - Fix a newidle balancing bug that introduced unnecessary wakeup
     latencies.

  ABI improvements/fixes:

   - Do not check capabilities and do not issue capability check denial
     messages when a scheduler syscall doesn't require privileges. (Such
     as increasing niceness.)

   - Add forced-idle accounting to cgroups too.

   - Fix/improve the RSEQ ABI to not just silently accept unknown flags.
     (No existing tooling is known to have learned to rely on the
     previous behavior.)

   - Depreciate the (unused) RSEQ_CS_FLAG_NO_RESTART_ON_* flags.

  Optimizations:

   - Optimize & simplify leaf_cfs_rq_list()

   - Micro-optimize set_nr_{and_not,if}_polling() via try_cmpxchg().

  Misc fixes & cleanups:

   - Fix the RSEQ self-tests on RISC-V and Glibc 2.35 systems.

   - Fix a full-NOHZ bug that can in some cases result in the tick not
     being re-enabled when the last SCHED_RT task is gone from a
     runqueue but there's still SCHED_OTHER tasks around.

   - Various PREEMPT_RT related fixes.

   - Misc cleanups & smaller fixes"

* tag 'sched-core-2022-08-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (32 commits)
  rseq: Kill process when unknown flags are encountered in ABI structures
  rseq: Deprecate RSEQ_CS_FLAG_NO_RESTART_ON_* flags
  sched/core: Fix the bug that task won't enqueue into core tree when update cookie
  nohz/full, sched/rt: Fix missed tick-reenabling bug in dequeue_task_rt()
  sched/core: Always flush pending blk_plug
  sched/fair: fix case with reduced capacity CPU
  sched/core: Use try_cmpxchg in set_nr_{and_not,if}_polling
  sched/core: add forced idle accounting for cgroups
  sched/fair: Remove the energy margin in feec()
  sched/fair: Remove task_util from effective utilization in feec()
  sched/fair: Use the same cpumask per-PD throughout find_energy_efficient_cpu()
  sched/fair: Rename select_idle_mask to select_rq_mask
  sched, drivers: Remove max param from effective_cpu_util()/sched_cpu_util()
  sched/fair: Decay task PELT values during wakeup migration
  sched/fair: Provide u64 read for 32-bits arch helper
  sched/fair: Introduce SIS_UTIL to search idle CPU based on sum of util_avg
  sched: only perform capability check on privileged operation
  sched: Remove unused function group_first_cpu()
  sched/fair: Remove redundant word " *"
  selftests/rseq: check if libc rseq support is registered
  ...
2022-08-01 11:49:06 -07:00
Jens Axboe
ed29b0b4fd io_uring: move to separate directory
In preparation for splitting io_uring up a bit, move it into its own
top level directory. It didn't really belong in fs/ anyway, as it's
not a file system only API.

This adds io_uring/ and moves the core files in there, and updates the
MAINTAINERS file for the new location.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:39:10 -06:00
Paul E. McKenney
34bc7b454d Merge branch 'ctxt.2022.07.05a' into HEAD
ctxt.2022.07.05a: Linux-kernel memory model development branch.
2022-07-21 17:46:18 -07:00
Cruz Zhao
91caa5ae24 sched/core: Fix the bug that task won't enqueue into core tree when update cookie
In function sched_core_update_cookie(), a task will enqueue into the
core tree only when it enqueued before, that is, if an uncookied task
is cookied, it will not enqueue into the core tree until it enqueue
again, which will result in unnecessary force idle.

Here follows the scenario:
  CPU x and CPU y are a pair of SMT siblings.
  1. Start task a running on CPU x without sleeping, and task b and
     task c running on CPU y without sleeping.
  2. We create a cookie and share it to task a and task b, and then
     we create another cookie and share it to task c.
  3. Simpling core_forceidle_sum of task a and b from /proc/PID/sched

And we will find out that core_forceidle_sum of task a takes 30%
time of the sampling period, which shouldn't happen as task a and b
have the same cookie.

Then we migrate task a to CPU x', migrate task b and c to CPU y', where
CPU x' and CPU y' are a pair of SMT siblings, and sampling again, we
will found out that core_forceidle_sum of task a and b are almost zero.

To solve this problem, we enqueue the task into the core tree if it's
on rq.

Fixes: 6e33cad0af49("sched: Trivial core scheduling cookie management")
Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/1656403045-100840-2-git-send-email-CruzZhao@linux.alibaba.com
2022-07-21 10:39:39 +02:00
Nicolas Saenz Julienne
5c66d1b9b3 nohz/full, sched/rt: Fix missed tick-reenabling bug in dequeue_task_rt()
dequeue_task_rt() only decrements 'rt_rq->rt_nr_running' after having
called sched_update_tick_dependency() preventing it from re-enabling the
tick on systems that no longer have pending SCHED_RT tasks but have
multiple runnable SCHED_OTHER tasks:

  dequeue_task_rt()
    dequeue_rt_entity()
      dequeue_rt_stack()
        dequeue_top_rt_rq()
	  sub_nr_running()	// decrements rq->nr_running
	    sched_update_tick_dependency()
	      sched_can_stop_tick()	// checks rq->rt.rt_nr_running,
	      ...
        __dequeue_rt_entity()
          dec_rt_tasks()	// decrements rq->rt.rt_nr_running
	  ...

Every other scheduler class performs the operation in the opposite
order, and sched_update_tick_dependency() expects the values to be
updated as such. So avoid the misbehaviour by inverting the order in
which the above operations are performed in the RT scheduler.

Fixes: 76d92ac305 ("sched: Migrate sched to use new tick dependency mask model")
Signed-off-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Phil Auld <pauld@redhat.com>
Link: https://lore.kernel.org/r/20220628092259.330171-1-nsaenzju@redhat.com
2022-07-21 10:39:38 +02:00
Juri Lelli
ddfc710395 sched/deadline: Fix BUG_ON condition for deboosted tasks
Tasks the are being deboosted from SCHED_DEADLINE might enter
enqueue_task_dl() one last time and hit an erroneous BUG_ON condition:
since they are not boosted anymore, the if (is_dl_boosted()) branch is
not taken, but the else if (!dl_prio) is and inside this one we
BUG_ON(!is_dl_boosted), which is of course false (BUG_ON triggered)
otherwise we had entered the if branch above. Long story short, the
current condition doesn't make sense and always leads to triggering of a
BUG.

Fix this by only checking enqueue flags, properly: ENQUEUE_REPLENISH has
to be present, but additional flags are not a problem.

Fixes: 64be6f1f5f ("sched/deadline: Don't replenish from a !SCHED_DEADLINE entity")
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20220714151908.533052-1-juri.lelli@redhat.com
2022-07-21 10:35:28 +02:00
John Keeping
401e4963bf sched/core: Always flush pending blk_plug
With CONFIG_PREEMPT_RT, it is possible to hit a deadlock between two
normal priority tasks (SCHED_OTHER, nice level zero):

	INFO: task kworker/u8:0:8 blocked for more than 491 seconds.
	      Not tainted 5.15.49-rt46 #1
	"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
	task:kworker/u8:0    state:D stack:    0 pid:    8 ppid:     2 flags:0x00000000
	Workqueue: writeback wb_workfn (flush-7:0)
	[<c08a3a10>] (__schedule) from [<c08a3d84>] (schedule+0xdc/0x134)
	[<c08a3d84>] (schedule) from [<c08a65a0>] (rt_mutex_slowlock_block.constprop.0+0xb8/0x174)
	[<c08a65a0>] (rt_mutex_slowlock_block.constprop.0) from [<c08a6708>]
	+(rt_mutex_slowlock.constprop.0+0xac/0x174)
	[<c08a6708>] (rt_mutex_slowlock.constprop.0) from [<c0374d60>] (fat_write_inode+0x34/0x54)
	[<c0374d60>] (fat_write_inode) from [<c0297304>] (__writeback_single_inode+0x354/0x3ec)
	[<c0297304>] (__writeback_single_inode) from [<c0297998>] (writeback_sb_inodes+0x250/0x45c)
	[<c0297998>] (writeback_sb_inodes) from [<c0297c20>] (__writeback_inodes_wb+0x7c/0xb8)
	[<c0297c20>] (__writeback_inodes_wb) from [<c0297f24>] (wb_writeback+0x2c8/0x2e4)
	[<c0297f24>] (wb_writeback) from [<c0298c40>] (wb_workfn+0x1a4/0x3e4)
	[<c0298c40>] (wb_workfn) from [<c0138ab8>] (process_one_work+0x1fc/0x32c)
	[<c0138ab8>] (process_one_work) from [<c0139120>] (worker_thread+0x22c/0x2d8)
	[<c0139120>] (worker_thread) from [<c013e6e0>] (kthread+0x16c/0x178)
	[<c013e6e0>] (kthread) from [<c01000fc>] (ret_from_fork+0x14/0x38)
	Exception stack(0xc10e3fb0 to 0xc10e3ff8)
	3fa0:                                     00000000 00000000 00000000 00000000
	3fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
	3fe0: 00000000 00000000 00000000 00000000 00000013 00000000

	INFO: task tar:2083 blocked for more than 491 seconds.
	      Not tainted 5.15.49-rt46 #1
	"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
	task:tar             state:D stack:    0 pid: 2083 ppid:  2082 flags:0x00000000
	[<c08a3a10>] (__schedule) from [<c08a3d84>] (schedule+0xdc/0x134)
	[<c08a3d84>] (schedule) from [<c08a41b0>] (io_schedule+0x14/0x24)
	[<c08a41b0>] (io_schedule) from [<c08a455c>] (bit_wait_io+0xc/0x30)
	[<c08a455c>] (bit_wait_io) from [<c08a441c>] (__wait_on_bit_lock+0x54/0xa8)
	[<c08a441c>] (__wait_on_bit_lock) from [<c08a44f4>] (out_of_line_wait_on_bit_lock+0x84/0xb0)
	[<c08a44f4>] (out_of_line_wait_on_bit_lock) from [<c0371fb0>] (fat_mirror_bhs+0xa0/0x144)
	[<c0371fb0>] (fat_mirror_bhs) from [<c0372a68>] (fat_alloc_clusters+0x138/0x2a4)
	[<c0372a68>] (fat_alloc_clusters) from [<c0370b14>] (fat_alloc_new_dir+0x34/0x250)
	[<c0370b14>] (fat_alloc_new_dir) from [<c03787c0>] (vfat_mkdir+0x58/0x148)
	[<c03787c0>] (vfat_mkdir) from [<c0277b60>] (vfs_mkdir+0x68/0x98)
	[<c0277b60>] (vfs_mkdir) from [<c027b484>] (do_mkdirat+0xb0/0xec)
	[<c027b484>] (do_mkdirat) from [<c0100060>] (ret_fast_syscall+0x0/0x1c)
	Exception stack(0xc2e1bfa8 to 0xc2e1bff0)
	bfa0:                   01ee42f0 01ee4208 01ee42f0 000041ed 00000000 00004000
	bfc0: 01ee42f0 01ee4208 00000000 00000027 01ee4302 00000004 000dcb00 01ee4190
	bfe0: 000dc368 bed11924 0006d4b0 b6ebddfc

Here the kworker is waiting on msdos_sb_info::s_lock which is held by
tar which is in turn waiting for a buffer which is locked waiting to be
flushed, but this operation is plugged in the kworker.

The lock is a normal struct mutex, so tsk_is_pi_blocked() will always
return false on !RT and thus the behaviour changes for RT.

It seems that the intent here is to skip blk_flush_plug() in the case
where a non-preemptible lock (such as a spinlock) has been converted to
a rtmutex on RT, which is the case covered by the SM_RTLOCK_WAIT
schedule flag.  But sched_submit_work() is only called from schedule()
which is never called in this scenario, so the check can simply be
deleted.

Looking at the history of the -rt patchset, in fact this change was
present from v5.9.1-rt20 until being dropped in v5.13-rt1 as it was part
of a larger patch [1] most of which was replaced by commit b4bfa3fcfe
("sched/core: Rework the __schedule() preempt argument").

As described in [1]:

   The schedule process must distinguish between blocking on a regular
   sleeping lock (rwsem and mutex) and a RT-only sleeping lock (spinlock
   and rwlock):
   - rwsem and mutex must flush block requests (blk_schedule_flush_plug())
     even if blocked on a lock. This can not deadlock because this also
     happens for non-RT.
     There should be a warning if the scheduling point is within a RCU read
     section.

   - spinlock and rwlock must not flush block requests. This will deadlock
     if the callback attempts to acquire a lock which is already acquired.
     Similarly to being preempted, there should be no warning if the
     scheduling point is within a RCU read section.

and with the tsk_is_pi_blocked() in the scheduler path, we hit the first
issue.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/tree/patches/0022-locking-rtmutex-Use-custom-scheduling-function-for-s.patch?h=linux-5.10.y-rt-patches

Signed-off-by: John Keeping <john@metanate.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://lkml.kernel.org/r/20220708162702.1758865-1-john@metanate.com
2022-07-13 11:29:17 +02:00
Vincent Guittot
c82a69629c sched/fair: fix case with reduced capacity CPU
The capacity of the CPU available for CFS tasks can be reduced because of
other activities running on the latter. In such case, it's worth trying to
move CFS tasks on a CPU with more available capacity.

The rework of the load balance has filtered the case when the CPU is
classified to be fully busy but its capacity is reduced.

Check if CPU's capacity is reduced while gathering load balance statistic
and classify it group_misfit_task instead of group_fully_busy so we can
try to move the load on another CPU.

Reported-by: David Chen <david.chen@nutanix.com>
Reported-by: Zhang Qiao <zhangqiao22@huawei.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: David Chen <david.chen@nutanix.com>
Tested-by: Zhang Qiao <zhangqiao22@huawei.com>
Link: https://lkml.kernel.org/r/20220708154401.21411-1-vincent.guittot@linaro.org
2022-07-13 11:29:17 +02:00
Frederic Weisbecker
e67198cc05 context_tracking: Take idle eqs entrypoints over RCU
The RCU dynticks counter is going to be merged into the context tracking
subsystem. Start with moving the idle extended quiescent states
entrypoints to context tracking. For now those are dumb redirections to
existing RCU calls.

[ paulmck: Apply kernel test robot feedback. ]

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Cc: Uladzislau Rezki <uladzislau.rezki@sony.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Nicolas Saenz Julienne <nsaenz@kernel.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Xiongfeng Wang <wangxiongfeng2@huawei.com>
Cc: Yu Liao <liaoyu15@huawei.com>
Cc: Phil Auld <pauld@redhat.com>
Cc: Paul Gortmaker<paul.gortmaker@windriver.com>
Cc: Alex Belits <abelits@marvell.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Tested-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
2022-07-05 13:32:16 -07:00
Uros Bizjak
c02d5546ea sched/core: Use try_cmpxchg in set_nr_{and_not,if}_polling
Use try_cmpxchg instead of cmpxchg (*ptr, old, new) != old in
set_nr_{and_not,if}_polling. x86 cmpxchg returns success in ZF flag,
so this change saves a compare after cmpxchg.

The definition of cmpxchg based fetch_or was changed in the
same way as atomic_fetch_##op definitions were changed
in e6790e4b5d.

Also declare these two functions as inline to ensure inlining. In the
case of set_nr_and_not_polling, the compiler (gcc) tries to outsmart
itself by constructing the boolean return value with logic operations
on the fetched value, and these extra operations enlarge the function
over the inlining threshold value.

Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20220629151552.6015-1-ubizjak@gmail.com
2022-07-04 09:23:08 +02:00
Josh Don
1fcf54deb7 sched/core: add forced idle accounting for cgroups
4feee7d126 previously added per-task forced idle accounting. This patch
extends this to also include cgroups.

rstat is used for cgroup accounting, except for the root, which uses
kcpustat in order to bypass the need for doing an rstat flush when
reading root stats.

Only cgroup v2 is supported. Similar to the task accounting, the cgroup
accounting requires that schedstats is enabled.

Signed-off-by: Josh Don <joshdon@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lkml.kernel.org/r/20220629211426.3329954-1-joshdon@google.com
2022-07-04 09:23:07 +02:00
Frederic Weisbecker
24a9c54182 context_tracking: Split user tracking Kconfig
Context tracking is going to be used not only to track user transitions
but also idle/IRQs/NMIs. The user tracking part will then become a
separate feature. Prepare Kconfig for that.

[ frederic: Apply Max Filippov feedback. ]

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Cc: Uladzislau Rezki <uladzislau.rezki@sony.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Nicolas Saenz Julienne <nsaenz@kernel.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Xiongfeng Wang <wangxiongfeng2@huawei.com>
Cc: Yu Liao <liaoyu15@huawei.com>
Cc: Phil Auld <pauld@redhat.com>
Cc: Paul Gortmaker<paul.gortmaker@windriver.com>
Cc: Alex Belits <abelits@marvell.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Tested-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
2022-06-29 17:04:09 -07:00
Vincent Donnefort
b812fc9768 sched/fair: Remove the energy margin in feec()
find_energy_efficient_cpu() integrates a margin to protect tasks from
bouncing back and forth from a CPU to another. This margin is set as being
6% of the total current energy estimated on the system. This however does
not work for two reasons:

1. The energy estimation is not a good absolute value:

compute_energy() used in feec() is a good estimation for task placement as
it allows to compare the energy with and without a task. The computed
delta will give a good overview of the cost for a certain task placement.
It, however, doesn't work as an absolute estimation for the total energy
of the system. First it adds the contribution to idle CPUs into the
energy, second it mixes util_avg with util_est values. util_avg contains
the near history for a CPU usage, it doesn't tell at all what the current
utilization is. A system that has been quite busy in the near past will
hold a very high energy and then a high margin preventing any task
migration to a lower capacity CPU, wasting energy. It even creates a
negative feedback loop: by holding the tasks on a less efficient CPU, the
margin contributes in keeping the energy high.

2. The margin handicaps small tasks:

On a system where the workload is composed mostly of small tasks (which is
often the case on Android), the overall energy will be high enough to
create a margin none of those tasks can cross. On a Pixel4, a small
utilization of 5% on all the CPUs creates a global estimated energy of 140
joules, as per the Energy Model declaration of that same device. This
means, after applying the 6% margin that any migration must save more than
8 joules to happen. No task with a utilization lower than 40 would then be
able to migrate away from the biggest CPU of the system.

The 6% of the overall system energy was brought by the following patch:

 (eb92692b25 sched/fair: Speed-up energy-aware wake-ups)

It was previously 6% of the prev_cpu energy. Also, the following one
made this margin value conditional on the clusters where the task fits:

 (8d4c97c105 sched/fair: Only compute base_energy_pd if necessary)

We could simply revert that margin change to what it was, but the original
version didn't have strong grounds neither and as demonstrated in (1.) the
estimated energy isn't a good absolute value. Instead, removing it
completely. It is indeed, made possible by recent changes that improved
energy estimation comparison fairness (sched/fair: Remove task_util from
effective utilization in feec()) (PM: EM: Increase energy calculation
precision) and task utilization stabilization (sched/fair: Decay task
util_avg during migration)

Without a margin, we could have feared bouncing between CPUs. But running
LISA's eas_behaviour test coverage on three different platforms (Hikey960,
RB-5 and DB-845) showed no issue.

Removing the energy margin enables more energy-optimized placements for a
more energy efficient system.

Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Lukasz Luba <lukasz.luba@arm.com>
Link: https://lkml.kernel.org/r/20220621090414.433602-8-vdonnefort@google.com
2022-06-28 09:17:48 +02:00
Vincent Donnefort
3e8c6c9aac sched/fair: Remove task_util from effective utilization in feec()
The energy estimation in find_energy_efficient_cpu() (feec()) relies on
the computation of the effective utilization for each CPU of a perf domain
(PD). This effective utilization is then used as an estimation of the busy
time for this pd. The function effective_cpu_util() which gives this value,
scales the utilization relative to IRQ pressure on the CPU to take into
account that the IRQ time is hidden from the task clock. The IRQ scaling is
as follow:

   effective_cpu_util = irq + (cpu_cap - irq)/cpu_cap * util

Where util is the sum of CFS/RT/DL utilization, cpu_cap the capacity of
the CPU and irq the IRQ avg time.

If now we take as an example a task placement which doesn't raise the OPP
on the candidate CPU, we can write the energy delta as:

  delta = OPPcost/cpu_cap * (effective_cpu_util(cpu_util + task_util) -
                             effective_cpu_util(cpu_util))
        = OPPcost/cpu_cap * (cpu_cap - irq)/cpu_cap * task_util

We end-up with an energy delta depending on the IRQ avg time, which is a
problem: first the time spent on IRQs by a CPU has no effect on the
additional energy that would be consumed by a task. Second, we don't want
to favour a CPU with a higher IRQ avg time value.

Nonetheless, we need to take the IRQ avg time into account. If a task
placement raises the PD's frequency, it will increase the energy cost for
the entire time where the CPU is busy. A solution is to only use
effective_cpu_util() with the CPU contribution part. The task contribution
is added separately and scaled according to prev_cpu's IRQ time.

No change for the FREQUENCY_UTIL component of the energy estimation. We
still want to get the actual frequency that would be selected after the
task placement.

Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Lukasz Luba <lukasz.luba@arm.com>
Link: https://lkml.kernel.org/r/20220621090414.433602-7-vdonnefort@google.com
2022-06-28 09:17:47 +02:00
Dietmar Eggemann
9b340131a4 sched/fair: Use the same cpumask per-PD throughout find_energy_efficient_cpu()
The Perf Domain (PD) cpumask (struct em_perf_domain.cpus) stays
invariant after Energy Model creation, i.e. it is not updated after
CPU hotplug operations.

That's why the PD mask is used in conjunction with the cpu_online_mask
(or Sched Domain cpumask). Thereby the cpu_online_mask is fetched
multiple times (in compute_energy()) during a run-queue selection
for a task.

cpu_online_mask may change during this time which can lead to wrong
energy calculations.

To be able to avoid this, use the select_rq_mask per-cpu cpumask to
create a cpumask out of PD cpumask and cpu_online_mask and pass it
through the function calls of the EAS run-queue selection path.

The PD cpumask for max_spare_cap_cpu/compute_prev_delta selection
(find_energy_efficient_cpu()) is now ANDed not only with the SD mask
but also with the cpu_online_mask. This is fine since this cpumask
has to be in syc with the one used for energy computation
(compute_energy()).
An exclusive cpuset setup with at least one asymmetric CPU capacity
island (hence the additional AND with the SD cpumask) is the obvious
exception here.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Tested-by: Lukasz Luba <lukasz.luba@arm.com>
Link: https://lkml.kernel.org/r/20220621090414.433602-6-vdonnefort@google.com
2022-06-28 09:17:47 +02:00
Dietmar Eggemann
ec4fc801a0 sched/fair: Rename select_idle_mask to select_rq_mask
On 21/06/2022 11:04, Vincent Donnefort wrote:
> From: Dietmar Eggemann <dietmar.eggemann@arm.com>

https://lkml.kernel.org/r/202206221253.ZVyGQvPX-lkp@intel.com discovered
that this patch doesn't build anymore (on tip sched/core or linux-next)
because of commit f5b2eeb499 ("sched/fair: Consider CPU affinity when
allowing NUMA imbalance in find_idlest_group()").

New version of [PATCH v11 4/7] sched/fair: Rename select_idle_mask to
select_rq_mask below.

-- >8 --

Decouple the name of the per-cpu cpumask select_idle_mask from its usage
in select_idle_[cpu/capacity]() of the CFS run-queue selection
(select_task_rq_fair()).

This is to support the reuse of this cpumask in the Energy Aware
Scheduling (EAS) path (find_energy_efficient_cpu()) of the CFS run-queue
selection.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Tested-by: Lukasz Luba <lukasz.luba@arm.com>
Link: https://lkml.kernel.org/r/250691c7-0e2b-05ab-bedf-b245c11d9400@arm.com
2022-06-28 09:17:47 +02:00
Dietmar Eggemann
bb44799949 sched, drivers: Remove max param from effective_cpu_util()/sched_cpu_util()
effective_cpu_util() already has a `int cpu' parameter which allows to
retrieve the CPU capacity scale factor (or maximum CPU capacity) inside
this function via an arch_scale_cpu_capacity(cpu).

A lot of code calling effective_cpu_util() (or the shim
sched_cpu_util()) needs the maximum CPU capacity, i.e. it will call
arch_scale_cpu_capacity() already.
But not having to pass it into effective_cpu_util() will make the EAS
wake-up code easier, especially when the maximum CPU capacity reduced
by the thermal pressure is passed through the EAS wake-up functions.

Due to the asymmetric CPU capacity support of arm/arm64 architectures,
arch_scale_cpu_capacity(int cpu) is a per-CPU variable read access via
per_cpu(cpu_scale, cpu) on such a system.
On all other architectures it is a a compile-time constant
(SCHED_CAPACITY_SCALE).

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Tested-by: Lukasz Luba <lukasz.luba@arm.com>
Link: https://lkml.kernel.org/r/20220621090414.433602-4-vdonnefort@google.com
2022-06-28 09:17:46 +02:00
Vincent Donnefort
e2f3e35f1f sched/fair: Decay task PELT values during wakeup migration
Before being migrated to a new CPU, a task sees its PELT values
synchronized with rq last_update_time. Once done, that same task will also
have its sched_avg last_update_time reset. This means the time between
the migration and the last clock update will not be accounted for in
util_avg and a discontinuity will appear. This issue is amplified by the
PELT clock scaling. It takes currently one tick after the CPU being idle
to let clock_pelt catching up clock_task.

This is especially problematic for asymmetric CPU capacity systems which
need stable util_avg signals for task placement and energy estimation.

Ideally, this problem would be solved by updating the runqueue clocks
before the migration. But that would require taking the runqueue lock
which is quite expensive [1]. Instead estimate the missing time and update
the task util_avg with that value.

To that end, we need sched_clock_cpu() but it is a costly function. Limit
the usage to the case where the source CPU is idle as we know this is when
the clock is having the biggest risk of being outdated.

See comment in migrate_se_pelt_lag() for more details about how the PELT
value is estimated. Notice though this estimation doesn't take into account
IRQ and Paravirt time.

[1] https://lkml.kernel.org/r/20190709115759.10451-1-chris.redpath@arm.com

Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Lukasz Luba <lukasz.luba@arm.com>
Link: https://lkml.kernel.org/r/20220621090414.433602-3-vdonnefort@google.com
2022-06-28 09:17:46 +02:00
Vincent Donnefort
d05b43059d sched/fair: Provide u64 read for 32-bits arch helper
Introducing macro helpers u64_u32_{store,load}() to factorize lockless
accesses to u64 variables for 32-bits architectures.

Users are for now cfs_rq.min_vruntime and sched_avg.last_update_time. To
accommodate the later where the copy lies outside of the structure
(cfs_rq.last_udpate_time_copy instead of sched_avg.last_update_time_copy),
use the _copy() version of those helpers.

Those new helpers encapsulate smp_rmb() and smp_wmb() synchronization and
therefore, have a small penalty for 32-bits machines in set_task_rq_fair()
and init_cfs_rq().

Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Lukasz Luba <lukasz.luba@arm.com>
Link: https://lkml.kernel.org/r/20220621090414.433602-2-vdonnefort@google.com
2022-06-28 09:17:46 +02:00
Chen Yu
70fb5ccf2e sched/fair: Introduce SIS_UTIL to search idle CPU based on sum of util_avg
[Problem Statement]
select_idle_cpu() might spend too much time searching for an idle CPU,
when the system is overloaded.

The following histogram is the time spent in select_idle_cpu(),
when running 224 instances of netperf on a system with 112 CPUs
per LLC domain:

@usecs:
[0]                  533 |                                                    |
[1]                 5495 |                                                    |
[2, 4)             12008 |                                                    |
[4, 8)            239252 |                                                    |
[8, 16)          4041924 |@@@@@@@@@@@@@@                                      |
[16, 32)        12357398 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@         |
[32, 64)        14820255 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[64, 128)       13047682 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@       |
[128, 256)       8235013 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@                        |
[256, 512)       4507667 |@@@@@@@@@@@@@@@                                     |
[512, 1K)        2600472 |@@@@@@@@@                                           |
[1K, 2K)          927912 |@@@                                                 |
[2K, 4K)          218720 |                                                    |
[4K, 8K)           98161 |                                                    |
[8K, 16K)          37722 |                                                    |
[16K, 32K)          6715 |                                                    |
[32K, 64K)           477 |                                                    |
[64K, 128K)            7 |                                                    |

netperf latency usecs:
=======
case            	load    	    Lat_99th	    std%
TCP_RR          	thread-224	      257.39	(  0.21)

The time spent in select_idle_cpu() is visible to netperf and might have a negative
impact.

[Symptom analysis]
The patch [1] from Mel Gorman has been applied to track the efficiency
of select_idle_sibling. Copy the indicators here:

SIS Search Efficiency(se_eff%):
        A ratio expressed as a percentage of runqueues scanned versus
        idle CPUs found. A 100% efficiency indicates that the target,
        prev or recent CPU of a task was idle at wakeup. The lower the
        efficiency, the more runqueues were scanned before an idle CPU
        was found.

SIS Domain Search Efficiency(dom_eff%):
        Similar, except only for the slower SIS
	patch.

SIS Fast Success Rate(fast_rate%):
        Percentage of SIS that used target, prev or
	recent CPUs.

SIS Success rate(success_rate%):
        Percentage of scans that found an idle CPU.

The test is based on Aubrey's schedtests tool, including netperf, hackbench,
schbench and tbench.

Test on vanilla kernel:
schedstat_parse.py -f netperf_vanilla.log
case	        load	    se_eff%	    dom_eff%	  fast_rate%	success_rate%
TCP_RR	   28 threads	     99.978	      18.535	      99.995	     100.000
TCP_RR	   56 threads	     99.397	       5.671	      99.964	     100.000
TCP_RR	   84 threads	     21.721	       6.818	      73.632	     100.000
TCP_RR	  112 threads	     12.500	       5.533	      59.000	     100.000
TCP_RR	  140 threads	      8.524	       4.535	      49.020	     100.000
TCP_RR	  168 threads	      6.438	       3.945	      40.309	      99.999
TCP_RR	  196 threads	      5.397	       3.718	      32.320	      99.982
TCP_RR	  224 threads	      4.874	       3.661	      25.775	      99.767
UDP_RR	   28 threads	     99.988	      17.704	      99.997	     100.000
UDP_RR	   56 threads	     99.528	       5.977	      99.970	     100.000
UDP_RR	   84 threads	     24.219	       6.992	      76.479	     100.000
UDP_RR	  112 threads	     13.907	       5.706	      62.538	     100.000
UDP_RR	  140 threads	      9.408	       4.699	      52.519	     100.000
UDP_RR	  168 threads	      7.095	       4.077	      44.352	     100.000
UDP_RR	  196 threads	      5.757	       3.775	      35.764	      99.991
UDP_RR	  224 threads	      5.124	       3.704	      28.748	      99.860

schedstat_parse.py -f schbench_vanilla.log
(each group has 28 tasks)
case	        load	    se_eff%	    dom_eff%	  fast_rate%	success_rate%
normal	   1   mthread	     99.152	       6.400	      99.941	     100.000
normal	   2   mthreads	     97.844	       4.003	      99.908	     100.000
normal	   3   mthreads	     96.395	       2.118	      99.917	      99.998
normal	   4   mthreads	     55.288	       1.451	      98.615	      99.804
normal	   5   mthreads	      7.004	       1.870	      45.597	      61.036
normal	   6   mthreads	      3.354	       1.346	      20.777	      34.230
normal	   7   mthreads	      2.183	       1.028	      11.257	      21.055
normal	   8   mthreads	      1.653	       0.825	       7.849	      15.549

schedstat_parse.py -f hackbench_vanilla.log
(each group has 28 tasks)
case			load	        se_eff%	    dom_eff%	  fast_rate%	success_rate%
process-pipe	     1 group	         99.991	       7.692	      99.999	     100.000
process-pipe	    2 groups	         99.934	       4.615	      99.997	     100.000
process-pipe	    3 groups	         99.597	       3.198	      99.987	     100.000
process-pipe	    4 groups	         98.378	       2.464	      99.958	     100.000
process-pipe	    5 groups	         27.474	       3.653	      89.811	      99.800
process-pipe	    6 groups	         20.201	       4.098	      82.763	      99.570
process-pipe	    7 groups	         16.423	       4.156	      77.398	      99.316
process-pipe	    8 groups	         13.165	       3.920	      72.232	      98.828
process-sockets	     1 group	         99.977	       5.882	      99.999	     100.000
process-sockets	    2 groups	         99.927	       5.505	      99.996	     100.000
process-sockets	    3 groups	         99.397	       3.250	      99.980	     100.000
process-sockets	    4 groups	         79.680	       4.258	      98.864	      99.998
process-sockets	    5 groups	          7.673	       2.503	      63.659	      92.115
process-sockets	    6 groups	          4.642	       1.584	      58.946	      88.048
process-sockets	    7 groups	          3.493	       1.379	      49.816	      81.164
process-sockets	    8 groups	          3.015	       1.407	      40.845	      75.500
threads-pipe	     1 group	         99.997	       0.000	     100.000	     100.000
threads-pipe	    2 groups	         99.894	       2.932	      99.997	     100.000
threads-pipe	    3 groups	         99.611	       4.117	      99.983	     100.000
threads-pipe	    4 groups	         97.703	       2.624	      99.937	     100.000
threads-pipe	    5 groups	         22.919	       3.623	      87.150	      99.764
threads-pipe	    6 groups	         18.016	       4.038	      80.491	      99.557
threads-pipe	    7 groups	         14.663	       3.991	      75.239	      99.247
threads-pipe	    8 groups	         12.242	       3.808	      70.651	      98.644
threads-sockets	     1 group	         99.990	       6.667	      99.999	     100.000
threads-sockets	    2 groups	         99.940	       5.114	      99.997	     100.000
threads-sockets	    3 groups	         99.469	       4.115	      99.977	     100.000
threads-sockets	    4 groups	         87.528	       4.038	      99.400	     100.000
threads-sockets	    5 groups	          6.942	       2.398	      59.244	      88.337
threads-sockets	    6 groups	          4.359	       1.954	      49.448	      87.860
threads-sockets	    7 groups	          2.845	       1.345	      41.198	      77.102
threads-sockets	    8 groups	          2.871	       1.404	      38.512	      74.312

schedstat_parse.py -f tbench_vanilla.log
case			load	      se_eff%	    dom_eff%	  fast_rate%	success_rate%
loopback	  28 threads	       99.976	      18.369	      99.995	     100.000
loopback	  56 threads	       99.222	       7.799	      99.934	     100.000
loopback	  84 threads	       19.723	       6.819	      70.215	     100.000
loopback	 112 threads	       11.283	       5.371	      55.371	      99.999
loopback	 140 threads	        0.000	       0.000	       0.000	       0.000
loopback	 168 threads	        0.000	       0.000	       0.000	       0.000
loopback	 196 threads	        0.000	       0.000	       0.000	       0.000
loopback	 224 threads	        0.000	       0.000	       0.000	       0.000

According to the test above, if the system becomes busy, the
SIS Search Efficiency(se_eff%) drops significantly. Although some
benchmarks would finally find an idle CPU(success_rate% = 100%), it is
doubtful whether it is worth it to search the whole LLC domain.

[Proposal]
It would be ideal to have a crystal ball to answer this question:
How many CPUs must a wakeup path walk down, before it can find an idle
CPU? Many potential metrics could be used to predict the number.
One candidate is the sum of util_avg in this LLC domain. The benefit
of choosing util_avg is that it is a metric of accumulated historic
activity, which seems to be smoother than instantaneous metrics
(such as rq->nr_running). Besides, choosing the sum of util_avg
would help predict the load of the LLC domain more precisely, because
SIS_PROP uses one CPU's idle time to estimate the total LLC domain idle
time.

In summary, the lower the util_avg is, the more select_idle_cpu()
should scan for idle CPU, and vice versa. When the sum of util_avg
in this LLC domain hits 85% or above, the scan stops. The reason to
choose 85% as the threshold is that this is the imbalance_pct(117)
when a LLC sched group is overloaded.

Introduce the quadratic function:

y = SCHED_CAPACITY_SCALE - p * x^2
and y'= y / SCHED_CAPACITY_SCALE

x is the ratio of sum_util compared to the CPU capacity:
x = sum_util / (llc_weight * SCHED_CAPACITY_SCALE)
y' is the ratio of CPUs to be scanned in the LLC domain,
and the number of CPUs to scan is calculated by:

nr_scan = llc_weight * y'

Choosing quadratic function is because:
[1] Compared to the linear function, it scans more aggressively when the
    sum_util is low.
[2] Compared to the exponential function, it is easier to calculate.
[3] It seems that there is no accurate mapping between the sum of util_avg
    and the number of CPUs to be scanned. Use heuristic scan for now.

For a platform with 112 CPUs per LLC, the number of CPUs to scan is:
sum_util%   0    5   15   25  35  45  55   65   75   85   86 ...
scan_nr   112  111  108  102  93  81  65   47   25    1    0 ...

For a platform with 16 CPUs per LLC, the number of CPUs to scan is:
sum_util%   0    5   15   25  35  45  55   65   75   85   86 ...
scan_nr    16   15   15   14  13  11   9    6    3    0    0 ...

Furthermore, to minimize the overhead of calculating the metrics in
select_idle_cpu(), borrow the statistics from periodic load balance.
As mentioned by Abel, on a platform with 112 CPUs per LLC, the
sum_util calculated by periodic load balance after 112 ms would
decay to about 0.5 * 0.5 * 0.5 * 0.7 = 8.75%, thus bringing a delay
in reflecting the latest utilization. But it is a trade-off.
Checking the util_avg in newidle load balance would be more frequent,
but it brings overhead - multiple CPUs write/read the per-LLC shared
variable and introduces cache contention. Tim also mentioned that,
it is allowed to be non-optimal in terms of scheduling for the
short-term variations, but if there is a long-term trend in the load
behavior, the scheduler can adjust for that.

When SIS_UTIL is enabled, the select_idle_cpu() uses the nr_scan
calculated by SIS_UTIL instead of the one from SIS_PROP. As Peter and
Mel suggested, SIS_UTIL should be enabled by default.

This patch is based on the util_avg, which is very sensitive to the
CPU frequency invariance. There is an issue that, when the max frequency
has been clamp, the util_avg would decay insanely fast when
the CPU is idle. Commit addca28512 ("cpufreq: intel_pstate: Handle no_turbo
in frequency invariance") could be used to mitigate this symptom, by adjusting
the arch_max_freq_ratio when turbo is disabled. But this issue is still
not thoroughly fixed, because the current code is unaware of the user-specified
max CPU frequency.

[Test result]

netperf and tbench were launched with 25% 50% 75% 100% 125% 150%
175% 200% of CPU number respectively. Hackbench and schbench were launched
by 1, 2 ,4, 8 groups. Each test lasts for 100 seconds and repeats 3 times.

The following is the benchmark result comparison between
baseline:vanilla v5.19-rc1 and compare:patched kernel. Positive compare%
indicates better performance.

Each netperf test is a:
netperf -4 -H 127.0.1 -t TCP/UDP_RR -c -C -l 100
netperf.throughput
=======
case            	load    	baseline(std%)	compare%( std%)
TCP_RR          	28 threads	 1.00 (  0.34)	 -0.16 (  0.40)
TCP_RR          	56 threads	 1.00 (  0.19)	 -0.02 (  0.20)
TCP_RR          	84 threads	 1.00 (  0.39)	 -0.47 (  0.40)
TCP_RR          	112 threads	 1.00 (  0.21)	 -0.66 (  0.22)
TCP_RR          	140 threads	 1.00 (  0.19)	 -0.69 (  0.19)
TCP_RR          	168 threads	 1.00 (  0.18)	 -0.48 (  0.18)
TCP_RR          	196 threads	 1.00 (  0.16)	+194.70 ( 16.43)
TCP_RR          	224 threads	 1.00 (  0.16)	+197.30 (  7.85)
UDP_RR          	28 threads	 1.00 (  0.37)	 +0.35 (  0.33)
UDP_RR          	56 threads	 1.00 ( 11.18)	 -0.32 (  0.21)
UDP_RR          	84 threads	 1.00 (  1.46)	 -0.98 (  0.32)
UDP_RR          	112 threads	 1.00 ( 28.85)	 -2.48 ( 19.61)
UDP_RR          	140 threads	 1.00 (  0.70)	 -0.71 ( 14.04)
UDP_RR          	168 threads	 1.00 ( 14.33)	 -0.26 ( 11.16)
UDP_RR          	196 threads	 1.00 ( 12.92)	+186.92 ( 20.93)
UDP_RR          	224 threads	 1.00 ( 11.74)	+196.79 ( 18.62)

Take the 224 threads as an example, the SIS search metrics changes are
illustrated below:

    vanilla                    patched
   4544492          +237.5%   15338634        sched_debug.cpu.sis_domain_search.avg
     38539        +39686.8%   15333634        sched_debug.cpu.sis_failed.avg
  128300000          -87.9%   15551326        sched_debug.cpu.sis_scanned.avg
   5842896          +162.7%   15347978        sched_debug.cpu.sis_search.avg

There is -87.9% less CPU scans after patched, which indicates lower overhead.
Besides, with this patch applied, there is -13% less rq lock contention
in perf-profile.calltrace.cycles-pp._raw_spin_lock.raw_spin_rq_lock_nested
.try_to_wake_up.default_wake_function.woken_wake_function.
This might help explain the performance improvement - Because this patch allows
the waking task to remain on the previous CPU, rather than grabbing other CPUs'
lock.

Each hackbench test is a:
hackbench -g $job --process/threads --pipe/sockets -l 1000000 -s 100
hackbench.throughput
=========
case            	load    	baseline(std%)	compare%( std%)
process-pipe    	1 group 	 1.00 (  1.29)	 +0.57 (  0.47)
process-pipe    	2 groups 	 1.00 (  0.27)	 +0.77 (  0.81)
process-pipe    	4 groups 	 1.00 (  0.26)	 +1.17 (  0.02)
process-pipe    	8 groups 	 1.00 (  0.15)	 -4.79 (  0.02)
process-sockets 	1 group 	 1.00 (  0.63)	 -0.92 (  0.13)
process-sockets 	2 groups 	 1.00 (  0.03)	 -0.83 (  0.14)
process-sockets 	4 groups 	 1.00 (  0.40)	 +5.20 (  0.26)
process-sockets 	8 groups 	 1.00 (  0.04)	 +3.52 (  0.03)
threads-pipe    	1 group 	 1.00 (  1.28)	 +0.07 (  0.14)
threads-pipe    	2 groups 	 1.00 (  0.22)	 -0.49 (  0.74)
threads-pipe    	4 groups 	 1.00 (  0.05)	 +1.88 (  0.13)
threads-pipe    	8 groups 	 1.00 (  0.09)	 -4.90 (  0.06)
threads-sockets 	1 group 	 1.00 (  0.25)	 -0.70 (  0.53)
threads-sockets 	2 groups 	 1.00 (  0.10)	 -0.63 (  0.26)
threads-sockets 	4 groups 	 1.00 (  0.19)	+11.92 (  0.24)
threads-sockets 	8 groups 	 1.00 (  0.08)	 +4.31 (  0.11)

Each tbench test is a:
tbench -t 100 $job 127.0.0.1
tbench.throughput
======
case            	load    	baseline(std%)	compare%( std%)
loopback        	28 threads	 1.00 (  0.06)	 -0.14 (  0.09)
loopback        	56 threads	 1.00 (  0.03)	 -0.04 (  0.17)
loopback        	84 threads	 1.00 (  0.05)	 +0.36 (  0.13)
loopback        	112 threads	 1.00 (  0.03)	 +0.51 (  0.03)
loopback        	140 threads	 1.00 (  0.02)	 -1.67 (  0.19)
loopback        	168 threads	 1.00 (  0.38)	 +1.27 (  0.27)
loopback        	196 threads	 1.00 (  0.11)	 +1.34 (  0.17)
loopback        	224 threads	 1.00 (  0.11)	 +1.67 (  0.22)

Each schbench test is a:
schbench -m $job -t 28 -r 100 -s 30000 -c 30000
schbench.latency_90%_us
========
case            	load    	baseline(std%)	compare%( std%)
normal          	1 mthread	 1.00 ( 31.22)	 -7.36 ( 20.25)*
normal          	2 mthreads	 1.00 (  2.45)	 -0.48 (  1.79)
normal          	4 mthreads	 1.00 (  1.69)	 +0.45 (  0.64)
normal          	8 mthreads	 1.00 (  5.47)	 +9.81 ( 14.28)

*Consider the Standard Deviation, this -7.36% regression might not be valid.

Also, a OLTP workload with a commercial RDBMS has been tested, and there
is no significant change.

There were concerns that unbalanced tasks among CPUs would cause problems.
For example, suppose the LLC domain is composed of 8 CPUs, and 7 tasks are
bound to CPU0~CPU6, while CPU7 is idle:

          CPU0    CPU1    CPU2    CPU3    CPU4    CPU5    CPU6    CPU7
util_avg  1024    1024    1024    1024    1024    1024    1024    0

Since the util_avg ratio is 87.5%( = 7/8 ), which is higher than 85%,
select_idle_cpu() will not scan, thus CPU7 is undetected during scan.
But according to Mel, it is unlikely the CPU7 will be idle all the time
because CPU7 could pull some tasks via CPU_NEWLY_IDLE.

lkp(kernel test robot) has reported a regression on stress-ng.sock on a
very busy system. According to the sched_debug statistics, it might be caused
by SIS_UTIL terminates the scan and chooses a previous CPU earlier, and this
might introduce more context switch, especially involuntary preemption, which
impacts a busy stress-ng. This regression has shown that, not all benchmarks
in every scenario benefit from idle CPU scan limit, and it needs further
investigation.

Besides, there is slight regression in hackbench's 16 groups case when the
LLC domain has 16 CPUs. Prateek mentioned that we should scan aggressively
in an LLC domain with 16 CPUs. Because the cost to search for an idle one
among 16 CPUs is negligible. The current patch aims to propose a generic
solution and only considers the util_avg. Something like the below could
be applied on top of the current patch to fulfill the requirement:

	if (llc_weight <= 16)
		nr_scan = nr_scan * 32 / llc_weight;

For LLC domain with 16 CPUs, the nr_scan will be expanded to 2 times large.
The smaller the CPU number this LLC domain has, the larger nr_scan will be
expanded. This needs further investigation.

There is also ongoing work[2] from Abel to filter out the busy CPUs during
wakeup, to further speed up the idle CPU scan. And it could be a following-up
optimization on top of this change.

Suggested-by: Tim Chen <tim.c.chen@intel.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Yicong Yang <yangyicong@hisilicon.com>
Tested-by: Mohini Narkhede <mohini.narkhede@intel.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lore.kernel.org/r/20220612163428.849378-1-yu.c.chen@intel.com
2022-06-28 09:08:30 +02:00
Christian Göttsche
700a78335f sched: only perform capability check on privileged operation
sched_setattr(2) issues via kernel/sched/core.c:__sched_setscheduler()
a CAP_SYS_NICE audit event unconditionally, even when the requested
operation does not require that capability / is unprivileged, i.e. for
reducing niceness.
This is relevant in connection with SELinux, where a capability check
results in a policy decision and by default a denial message on
insufficient permission is issued.
It can lead to three undesired cases:
  1. A denial message is generated, even in case the operation was an
     unprivileged one and thus the syscall succeeded, creating noise.
  2. To avoid the noise from 1. the policy writer adds a rule to ignore
     those denial messages, hiding future syscalls, where the task
     performs an actual privileged operation, leading to hidden limited
     functionality of that task.
  3. To avoid the noise from 1. the policy writer adds a rule to allow
     the task the capability CAP_SYS_NICE, while it does not need it,
     violating the principle of least privilege.

Conduct privilged/unprivileged categorization first and perform a
capable test (and at most once) only if needed.

Signed-off-by: Christian Göttsche <cgzones@googlemail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20220615152505.310488-1-cgzones@googlemail.com
2022-06-28 09:08:29 +02:00
Zhang Qiao
c64b551f6a sched: Remove unused function group_first_cpu()
As of commit afe06efdf0 ("sched: Extend scheduler's asym packing")
group_first_cpu() became an unused function, remove it.

Signed-off-by: Zhang Qiao <zhangqiao22@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lore.kernel.org/r/20220617181151.29980-3-zhangqiao22@huawei.com
2022-06-28 09:08:29 +02:00
Zhang Qiao
fb95a5a04d sched/fair: Remove redundant word " *"
" *" is redundant. so remove it.

Signed-off-by: Zhang Qiao <zhangqiao22@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220617181151.29980-2-zhangqiao22@huawei.com
2022-06-28 09:08:29 +02:00
Paul E. McKenney
e386b67257 rcu-tasks: Eliminate RCU Tasks Trace IPIs to online CPUs
Currently, the RCU Tasks Trace grace-period kthread IPIs each online CPU
using smp_call_function_single() in order to track any tasks currently in
RCU Tasks Trace read-side critical sections during which the corresponding
task has neither blocked nor been preempted.  These IPIs are annoying
and are also not strictly necessary because any task that blocks or is
preempted within its current RCU Tasks Trace read-side critical section
will be tracked on one of the per-CPU rcu_tasks_percpu structure's
->rtp_blkd_tasks list.  So the only time that this is a problem is if
one of the CPUs runs through a long-duration RCU Tasks Trace read-side
critical section without a context switch.

Note that the task_call_func() function cannot help here because there is
no safe way to identify the target task.  Of course, the task_call_func()
function will be very useful later, when processing the list of tasks,
but it needs to know the task.

This commit therefore creates a cpu_curr_snapshot() function that returns
a pointer the task_struct structure of some task that happened to be
running on the specified CPU more or less during the time that the
cpu_curr_snapshot() function was executing.  If there was no context
switch during this time, this function will return a pointer to the
task_struct structure of the task that was running throughout.  If there
was a context switch, then the outgoing task will be taken care of by
RCU's context-switch hook, and the incoming task was either already taken
care during some previous context switch, or it is not currently within an
RCU Tasks Trace read-side critical section.  And in this latter case, the
grace period already started, so there is no need to wait on this task.

This new cpu_curr_snapshot() function is invoked on each CPU early in
the RCU Tasks Trace grace-period processing, and the resulting tasks
are queued for later quiescent-state inspection.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Martin KaFai Lau <kafai@fb.com>
Cc: KP Singh <kpsingh@kernel.org>
2022-06-21 15:49:38 -07:00
Tianchen Ding
f3dd3f6745 sched: Remove the limitation of WF_ON_CPU on wakelist if wakee cpu is idle
Wakelist can help avoid cache bouncing and offload the overhead of waker
cpu. So far, using wakelist within the same llc only happens on
WF_ON_CPU, and this limitation could be removed to further improve
wakeup performance.

The commit 518cd62341 ("sched: Only queue remote wakeups when
crossing cache boundaries") disabled queuing tasks on wakelist when
the cpus share llc. This is because, at that time, the scheduler must
send IPIs to do ttwu_queue_wakelist. Nowadays, ttwu_queue_wakelist also
supports TIF_POLLING, so this is not a problem now when the wakee cpu is
in idle polling.

Benefits:
  Queuing the task on idle cpu can help improving performance on waker cpu
  and utilization on wakee cpu, and further improve locality because
  the wakee cpu can handle its own rq. This patch helps improving rt on
  our real java workloads where wakeup happens frequently.

  Consider the normal condition (CPU0 and CPU1 share same llc)
  Before this patch:

         CPU0                                       CPU1

    select_task_rq()                                idle
    rq_lock(CPU1->rq)
    enqueue_task(CPU1->rq)
    notify CPU1 (by sending IPI or CPU1 polling)

                                                    resched()

  After this patch:

         CPU0                                       CPU1

    select_task_rq()                                idle
    add to wakelist of CPU1
    notify CPU1 (by sending IPI or CPU1 polling)

                                                    rq_lock(CPU1->rq)
                                                    enqueue_task(CPU1->rq)
                                                    resched()

  We see CPU0 can finish its work earlier. It only needs to put task to
  wakelist and return.
  While CPU1 is idle, so let itself handle its own runqueue data.

This patch brings no difference about IPI.
  This patch only takes effect when the wakee cpu is:
  1) idle polling
  2) idle not polling

  For 1), there will be no IPI with or without this patch.

  For 2), there will always be an IPI before or after this patch.
  Before this patch: waker cpu will enqueue task and check preempt. Since
  "idle" will be sure to be preempted, waker cpu must send a resched IPI.
  After this patch: waker cpu will put the task to the wakelist of wakee
  cpu, and send an IPI.

Benchmark:
We've tested schbench, unixbench, and hachbench on both x86 and arm64.

On x86 (Intel Xeon Platinum 8269CY):
  schbench -m 2 -t 8

    Latency percentiles (usec)              before        after
        50.0000th:                             8            6
        75.0000th:                            10            7
        90.0000th:                            11            8
        95.0000th:                            12            8
        *99.0000th:                           13           10
        99.5000th:                            15           11
        99.9000th:                            18           14

  Unixbench with full threads (104)
                                            before        after
    Dhrystone 2 using register variables  3011862938    3009935994  -0.06%
    Double-Precision Whetstone              617119.3      617298.5   0.03%
    Execl Throughput                         27667.3       27627.3  -0.14%
    File Copy 1024 bufsize 2000 maxblocks   785871.4      784906.2  -0.12%
    File Copy 256 bufsize 500 maxblocks     210113.6      212635.4   1.20%
    File Copy 4096 bufsize 8000 maxblocks  2328862.2     2320529.1  -0.36%
    Pipe Throughput                      145535622.8   145323033.2  -0.15%
    Pipe-based Context Switching           3221686.4     3583975.4  11.25%
    Process Creation                        101347.1      103345.4   1.97%
    Shell Scripts (1 concurrent)            120193.5      123977.8   3.15%
    Shell Scripts (8 concurrent)             17233.4       17138.4  -0.55%
    System Call Overhead                   5300604.8     5312213.6   0.22%

  hackbench -g 1 -l 100000
                                            before        after
    Time                                     3.246        2.251

On arm64 (Ampere Altra):
  schbench -m 2 -t 8

    Latency percentiles (usec)              before        after
        50.0000th:                            14           10
        75.0000th:                            19           14
        90.0000th:                            22           16
        95.0000th:                            23           16
        *99.0000th:                           24           17
        99.5000th:                            24           17
        99.9000th:                            28           25

  Unixbench with full threads (80)
                                            before        after
    Dhrystone 2 using register variables  3536194249    3537019613   0.02%
    Double-Precision Whetstone              629383.6      629431.6   0.01%
    Execl Throughput                         65920.5       65846.2  -0.11%
    File Copy 1024 bufsize 2000 maxblocks  1063722.8     1064026.8   0.03%
    File Copy 256 bufsize 500 maxblocks     322684.5      318724.5  -1.23%
    File Copy 4096 bufsize 8000 maxblocks  2348285.3     2328804.8  -0.83%
    Pipe Throughput                      133542875.3   131619389.8  -1.44%
    Pipe-based Context Switching           3215356.1     3576945.1  11.25%
    Process Creation                        108520.5      120184.6  10.75%
    Shell Scripts (1 concurrent)            122636.3        121888  -0.61%
    Shell Scripts (8 concurrent)             17462.1       17381.4  -0.46%
    System Call Overhead                   4429998.9     4435006.7   0.11%

  hackbench -g 1 -l 100000
                                            before        after
    Time                                     4.217        2.916

Our patch has improvement on schbench, hackbench
and Pipe-based Context Switching of unixbench
when there exists idle cpus,
and no obvious regression on other tests of unixbench.
This can help improve rt in scenes where wakeup happens frequently.

Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lore.kernel.org/r/20220608233412.327341-3-dtcccc@linux.alibaba.com
2022-06-13 10:30:01 +02:00
Tianchen Ding
28156108fe sched: Fix the check of nr_running at queue wakelist
The commit 2ebb177175 ("sched/core: Offload wakee task activation if it
the wakee is descheduling") checked rq->nr_running <= 1 to avoid task
stacking when WF_ON_CPU.

Per the ordering of writes to p->on_rq and p->on_cpu, observing p->on_cpu
(WF_ON_CPU) in ttwu_queue_cond() implies !p->on_rq, IOW p has gone through
the deactivate_task() in __schedule(), thus p has been accounted out of
rq->nr_running. As such, the task being the only runnable task on the rq
implies reading rq->nr_running == 0 at that point.

The benchmark result is in [1].

[1] https://lore.kernel.org/all/e34de686-4e85-bde1-9f3c-9bbc86b38627@linux.alibaba.com/

Suggested-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lore.kernel.org/r/20220608233412.327341-2-dtcccc@linux.alibaba.com
2022-06-13 10:30:01 +02:00
Josh Don
792b9f65a5 sched: Allow newidle balancing to bail out of load_balance
While doing newidle load balancing, it is possible for new tasks to
arrive, such as with pending wakeups. newidle_balance() already accounts
for this by exiting the sched_domain load_balance() iteration if it
detects these cases. This is very important for minimizing wakeup
latency.

However, if we are already in load_balance(), we may stay there for a
while before returning back to newidle_balance(). This is most
exacerbated if we enter a 'goto redo' loop in the LBF_ALL_PINNED case. A
very straightforward workaround to this is to adjust should_we_balance()
to bail out if we're doing a CPU_NEWLY_IDLE balance and new tasks are
detected.

This was tested with the following reproduction:
- two threads that take turns sleeping and waking each other up are
  affined to two cores
- a large number of threads with 100% utilization are pinned to all
  other cores

Without this patch, wakeup latency was ~120us for the pair of threads,
almost entirely spent in load_balance(). With this patch, wakeup latency
is ~6us.

Signed-off-by: Josh Don <joshdon@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20220609025515.2086253-1-joshdon@google.com
2022-06-13 10:30:01 +02:00
Yajun Deng
2ed81e7654 sched/deadline: Use proc_douintvec_minmax() limit minimum value
sysctl_sched_dl_period_max and sysctl_sched_dl_period_min are unsigned
integer, but proc_dointvec() wouldn't return error even if we set a
negative number.

Use proc_douintvec_minmax() instead of proc_dointvec(). Add extra1 for
sysctl_sched_dl_period_max and extra2 for sysctl_sched_dl_period_min.

It's just an optimization for match data and proc_handler in struct
ctl_table. The 'if (period < min || period > max)' in __checkparam_dl()
will work fine even if there hasn't this patch.

Signed-off-by: Yajun Deng <yajun.deng@linux.dev>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Link: https://lore.kernel.org/r/20220607101807.249965-1-yajun.deng@linux.dev
2022-06-13 10:30:00 +02:00
Chengming Zhou
51bf903b64 sched/fair: Optimize and simplify rq leaf_cfs_rq_list
We notice the rq leaf_cfs_rq_list has two problems when do bugfix
backports and some test profiling.

1. cfs_rqs under throttled subtree could be added to the list, and
   make their fully decayed ancestors on the list, even though not needed.

2. #1 also make the leaf_cfs_rq_list management complex and error prone,
   this is the list of related bugfix so far:

   commit 31bc6aeaab ("sched/fair: Optimize update_blocked_averages()")
   commit fe61468b2c ("sched/fair: Fix enqueue_task_fair warning")
   commit b34cb07dde ("sched/fair: Fix enqueue_task_fair() warning some more")
   commit 39f23ce07b ("sched/fair: Fix unthrottle_cfs_rq() for leaf_cfs_rq list")
   commit 0258bdfaff ("sched/fair: Fix unfairness caused by missing load decay")
   commit a7b359fc6a ("sched/fair: Correctly insert cfs_rq's to list on unthrottle")
   commit fdaba61ef8 ("sched/fair: Ensure that the CFS parent is added after unthrottling")
   commit 2630cde267 ("sched/fair: Add ancestors of unthrottled undecayed cfs_rq")

commit 31bc6aeaab ("sched/fair: Optimize update_blocked_averages()")
delete every cfs_rq under throttled subtree from rq->leaf_cfs_rq_list,
and delete the throttled_hierarchy() test in update_blocked_averages(),
which optimized update_blocked_averages().

But those later bugfix add cfs_rqs under throttled subtree back to
rq->leaf_cfs_rq_list again, with their fully decayed ancestors, for
the integrity of rq->leaf_cfs_rq_list.

This patch takes another method, skip all cfs_rqs under throttled
hierarchy when list_add_leaf_cfs_rq(), to completely make cfs_rqs
under throttled subtree off the leaf_cfs_rq_list.

So we don't need to consider throttled related things in
enqueue_entity(), unthrottle_cfs_rq() and enqueue_task_fair(),
which simplify the code a lot. Also optimize update_blocked_averages()
since cfs_rqs under throttled hierarchy and their ancestors
won't be on the leaf_cfs_rq_list.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20220601021848.76943-1-zhouchengming@bytedance.com
2022-06-13 10:30:00 +02:00
K Prateek Nayak
f5b2eeb499 sched/fair: Consider CPU affinity when allowing NUMA imbalance in find_idlest_group()
In the case of systems containing multiple LLCs per socket, like
AMD Zen systems, users want to spread bandwidth hungry applications
across multiple LLCs. Stream is one such representative workload where
the best performance is obtained by limiting one stream thread per LLC.
To ensure this, users are known to pin the tasks to a specify a subset
of the CPUs consisting of one CPU per LLC while running such bandwidth
hungry tasks.

Suppose we kickstart a multi-threaded task like stream with 8 threads
using taskset or numactl to run on a subset of CPUs on a 2 socket Zen3
server where each socket contains 128 CPUs
(0-63,128-191 in one socket, 64-127,192-255 in another socket)

Eg: numactl -C 0,16,32,48,64,80,96,112 ./stream8

Here each CPU in the list is from a different LLC and 4 of those LLCs
are on one socket, while the other 4 are on another socket.

Ideally we would prefer that each stream thread runs on a different
CPU from the allowed list of CPUs. However, the current heuristics in
find_idlest_group() do not allow this during the initial placement.

Suppose the first socket (0-63,128-191) is our local group from which
we are kickstarting the stream tasks. The first four stream threads
will be placed in this socket. When it comes to placing the 5th
thread, all the allowed CPUs are from the local group (0,16,32,48)
would have been taken.

However, the current scheduler code simply checks if the number of
tasks in the local group is fewer than the allowed numa-imbalance
threshold. This threshold was previously 25% of the NUMA domain span
(in this case threshold = 32) but after the v6 of Mel's patchset
"Adjust NUMA imbalance for multiple LLCs", got merged in sched-tip,
Commit: e496132ebe ("sched/fair: Adjust the allowed NUMA imbalance
when SD_NUMA spans multiple LLCs") it is now equal to number of LLCs
in the NUMA domain, for processors with multiple LLCs.
(in this case threshold = 8).

For this example, the number of tasks will always be within threshold
and thus all the 8 stream threads will be woken up on the first socket
thereby resulting in sub-optimal performance.

The following sched_wakeup_new tracepoint output shows the initial
placement of tasks in the current tip/sched/core on the Zen3 machine:

stream-5313    [016] d..2.   627.005036: sched_wakeup_new: comm=stream pid=5315 prio=120 target_cpu=032
stream-5313    [016] d..2.   627.005086: sched_wakeup_new: comm=stream pid=5316 prio=120 target_cpu=048
stream-5313    [016] d..2.   627.005141: sched_wakeup_new: comm=stream pid=5317 prio=120 target_cpu=000
stream-5313    [016] d..2.   627.005183: sched_wakeup_new: comm=stream pid=5318 prio=120 target_cpu=016
stream-5313    [016] d..2.   627.005218: sched_wakeup_new: comm=stream pid=5319 prio=120 target_cpu=016
stream-5313    [016] d..2.   627.005256: sched_wakeup_new: comm=stream pid=5320 prio=120 target_cpu=016
stream-5313    [016] d..2.   627.005295: sched_wakeup_new: comm=stream pid=5321 prio=120 target_cpu=016

Once the first four threads are distributed among the allowed CPUs of
socket one, the rest of the treads start piling on these same CPUs
when clearly there are CPUs on the second socket that can be used.

Following the initial pile up on a small number of CPUs, though the
load-balancer eventually kicks in, it takes a while to get to {4}{4}
and even {4}{4} isn't stable as we observe a bunch of ping ponging
between {4}{4} to {5}{3} and back before a stable state is reached
much later (1 Stream thread per allowed CPU) and no more migration is
required.

We can detect this piling and avoid it by checking if the number of
allowed CPUs in the local group are fewer than the number of tasks
running in the local group and use this information to spread the
5th task out into the next socket (after all, the goal in this
slowpath is to find the idlest group and the idlest CPU during the
initial placement!).

The following sched_wakeup_new tracepoint output shows the initial
placement of tasks after adding this fix on the Zen3 machine:

stream-4485    [016] d..2.   230.784046: sched_wakeup_new: comm=stream pid=4487 prio=120 target_cpu=032
stream-4485    [016] d..2.   230.784123: sched_wakeup_new: comm=stream pid=4488 prio=120 target_cpu=048
stream-4485    [016] d..2.   230.784167: sched_wakeup_new: comm=stream pid=4489 prio=120 target_cpu=000
stream-4485    [016] d..2.   230.784222: sched_wakeup_new: comm=stream pid=4490 prio=120 target_cpu=112
stream-4485    [016] d..2.   230.784271: sched_wakeup_new: comm=stream pid=4491 prio=120 target_cpu=096
stream-4485    [016] d..2.   230.784322: sched_wakeup_new: comm=stream pid=4492 prio=120 target_cpu=080
stream-4485    [016] d..2.   230.784368: sched_wakeup_new: comm=stream pid=4493 prio=120 target_cpu=064

We see that threads are using all of the allowed CPUs and there is
no pileup.

No output is generated for tracepoint sched_migrate_task with this
patch due to a perfect initial placement which removes the need
for balancing later on - both across NUMA boundaries and within
NUMA boundaries for stream.

Following are the results from running 8 Stream threads with and
without pinning on a dual socket Zen3 Machine (2 x 64C/128T):

During the testing of this patch, the tip sched/core was at
commit: 089c02ae27 "ftrace: Use preemption model accessors for trace
header printout"

Pinning is done using: numactl -C 0,16,32,48,64,80,96,112 ./stream8

	           5.18.0-rc1               5.18.0-rc1                5.18.0-rc1
               tip sched/core           tip sched/core            tip sched/core
                 (no pinning)                + pinning              + this-patch
								       + pinning

 Copy:   109364.74 (0.00 pct)     94220.50 (-13.84 pct)    158301.28 (44.74 pct)
Scale:   109670.26 (0.00 pct)     90210.59 (-17.74 pct)    149525.64 (36.34 pct)
  Add:   129029.01 (0.00 pct)    101906.00 (-21.02 pct)    186658.17 (44.66 pct)
Triad:   127260.05 (0.00 pct)    106051.36 (-16.66 pct)    184327.30 (44.84 pct)

Pinning currently hurts the performance compared to unbound case on
tip/sched/core. With the addition of this patch, we are able to
outperform tip/sched/core by a good margin with pinning.

Following are the results from running 16 Stream threads with and
without pinning on a dual socket IceLake Machine (2 x 32C/64T):

NUMA Topology of Intel Skylake machine:
Node 1: 0,2,4,6 ... 126 (Even numbers)
Node 2: 1,3,5,7 ... 127 (Odd numbers)

Pinning is done using: numactl -C 0-15 ./stream16

	           5.18.0-rc1               5.18.0-rc1                5.18.0-rc1
               tip sched/core           tip sched/core            tip sched/core
                 (no pinning)                 +pinning              + this-patch
								       + pinning

 Copy:    85815.31 (0.00 pct)     149819.21 (74.58 pct)    156807.48 (82.72 pct)
Scale:    64795.60 (0.00 pct)      97595.07 (50.61 pct)     99871.96 (54.13 pct)
  Add:    71340.68 (0.00 pct)     111549.10 (56.36 pct)    114598.33 (60.63 pct)
Triad:    68890.97 (0.00 pct)     111635.16 (62.04 pct)    114589.24 (66.33 pct)

In case of Icelake machine, with single LLC per socket, pinning across
the two sockets reduces cache contention, thus showing great
improvement in pinned case which is further benefited by this patch.

Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Link: https://lkml.kernel.org/r/20220407111222.22649-1-kprateek.nayak@amd.com
2022-06-13 10:30:00 +02:00
Mel Gorman
026b98a93b sched/numa: Adjust imb_numa_nr to a better approximation of memory channels
For a single LLC per node, a NUMA imbalance is allowed up until 25%
of CPUs sharing a node could be active. One intent of the cut-off is
to avoid an imbalance of memory channels but there is no topological
information based on active memory channels. Furthermore, there can
be differences between nodes depending on the number of populated
DIMMs.

A cut-off of 25% was arbitrary but generally worked. It does have a severe
corner cases though when an parallel workload is using 25% of all available
CPUs over-saturates memory channels. This can happen due to the initial
forking of tasks that get pulled more to one node after early wakeups
(e.g. a barrier synchronisation) that is not quickly corrected by the
load balancer. The LB may fail to act quickly as the parallel tasks are
considered to be poor migrate candidates due to locality or cache hotness.

On a range of modern Intel CPUs, 12.5% appears to be a better cut-off
assuming all memory channels are populated and is used as the new cut-off
point. A minimum of 1 is specified to allow a communicating pair to
remain local even for CPUs with low numbers of cores. For modern AMDs,
there are multiple LLCs and are not affected.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lore.kernel.org/r/20220520103519.1863-5-mgorman@techsingularity.net
2022-06-13 10:30:00 +02:00
Mel Gorman
cb29a5c19d sched/numa: Apply imbalance limitations consistently
The imbalance limitations are applied inconsistently at fork time
and at runtime. At fork, a new task can remain local until there are
too many running tasks even if the degree of imbalance is larger than
NUMA_IMBALANCE_MIN which is different to runtime. Secondly, the imbalance
figure used during load balancing is different to the one used at NUMA
placement. Load balancing uses the number of tasks that must move to
restore imbalance where as NUMA balancing uses the total imbalance.

In combination, it is possible for a parallel workload that uses a small
number of CPUs without applying scheduler policies to have very variable
run-to-run performance.

[lkp@intel.com: Fix build breakage for arc-allyesconfig]

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lore.kernel.org/r/20220520103519.1863-4-mgorman@techsingularity.net
2022-06-13 10:29:59 +02:00
Mel Gorman
13ede33150 sched/numa: Do not swap tasks between nodes when spare capacity is available
If a destination node has spare capacity but there is an imbalance then
two tasks are selected for swapping. If the tasks have no numa group
or are within the same NUMA group, it's simply shuffling tasks around
without having any impact on the compute imbalance. Instead, it's just
punishing one task to help another.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lore.kernel.org/r/20220520103519.1863-3-mgorman@techsingularity.net
2022-06-13 10:29:59 +02:00
Mel Gorman
70ce3ea9aa sched/numa: Initialise numa_migrate_retry
On clone, numa_migrate_retry is inherited from the parent which means
that the first NUMA placement of a task is non-deterministic. This
affects when load balancing recognises numa tasks and whether to
migrate "regular", "remote" or "all" tasks between NUMA scheduler
domains.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lore.kernel.org/r/20220520103519.1863-2-mgorman@techsingularity.net
2022-06-13 10:29:59 +02:00
Peter Zijlstra
04193d590b sched: Fix balance_push() vs __sched_setscheduler()
The purpose of balance_push() is to act as a filter on task selection
in the case of CPU hotplug, specifically when taking the CPU out.

It does this by (ab)using the balance callback infrastructure, with
the express purpose of keeping all the unlikely/odd cases in a single
place.

In order to serve its purpose, the balance_push_callback needs to be
(exclusively) on the callback list at all times (noting that the
callback always places itself back on the list the moment it runs,
also noting that when the CPU goes down, regular balancing concerns
are moot, so ignoring them is fine).

And here-in lies the problem, __sched_setscheduler()'s use of
splice_balance_callbacks() takes the callbacks off the list across a
lock-break, making it possible for, an interleaving, __schedule() to
see an empty list and not get filtered.

Fixes: ae79270232 ("sched: Optimize finish_lock_switch()")
Reported-by: Jing-Ting Wu <jing-ting.wu@mediatek.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Jing-Ting Wu <jing-ting.wu@mediatek.com>
Link: https://lkml.kernel.org/r/20220519134706.GH2578@worktop.programming.kicks-ass.net
2022-06-13 10:15:07 +02:00
Chen Wandun
5f69a6577b psi: dont alloc memory for psi by default
Memory about struct psi_group is allocated by default for
each cgroup even if psi_disabled is true, in this case, these
allocated memory is waste, so alloc memory for struct psi_group
only when psi_disabled is false.

Signed-off-by: Chen Wandun <chenwandun@huawei.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2022-06-07 07:11:47 -10:00
Linus Torvalds
bc1e02c3e5 Fix the fallout of sysctl code move which placed the init function wrong.
-----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmKcc+ATHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYobuAD/i9gZZ1b1Qtr9vCeioy8hXkP1QltXC5
 sBK/l+LH8kWMi2eQtqhr5T8Xhca0K84Ml2gCLimnhX5PZxUJ7VTL9DVhhU0vPbFx
 7dKVAF6xU6B0gV92Ojn4ztHXM/qulqbOExESaxOXpIQ8Rh5QUqokVSeTFVzQmxve
 GykHUGK6DFi3+W1aID4oW5BsGgG1i+5Qn4HuSP35vfvS4e0K056s5ANfJdzt4+Sz
 q4APbOrB5Dgbg351vRk+ms49TariVrdWOBc4ujPAYJms+oCrDHpTbHtTAKt/n1UF
 z5AbfFkebSK7hf2nplo1Qr+QZ0vLlWchbc+IdjhPvkXrZwqbgdJKJUznieHyUEOO
 xUUYKErEAcDNuF6wkhum5hhRARuHpHOdd6qAIFGsexTN7gOIgFF3Mdq2GD4YBebF
 O86tinGdOD1hzzxzW+r+AIgZgCUILmP0+lRasEZuJMT/iTWAiU1MjrNNPL45ZX1e
 Ldie5b7rGubwKaCwCv1R//D9NI6OTalVnmCyf1NRIeb9py4bG3SgCQecHP0pZeh5
 +xmsO8rrMlkq2fOe9K6N57vWeRcvQzTMGb17wZh/fAWCa2Ny7Si52NFgRwPNExCu
 YKR2bptFSZmKe0cZ8Q6mdw7J1tCJm+4Htgw9LSgaf/jvmvHyTV261ZZTHeglsdTQ
 JDGbad9Vt6h1
 =arj3
 -----END PGP SIGNATURE-----

Merge tag 'sched-urgent-2022-06-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler fix from Thomas Gleixner:
 "Fix the fallout of sysctl code move which placed the init function
  wrong"

* tag 'sched-urgent-2022-06-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/autogroup: Fix sysctl move
2022-06-05 10:42:40 -07:00
Linus Torvalds
67850b7bdc While looking at the ptrace problems with PREEMPT_RT and the problems
of Peter Zijlstra was encountering with ptrace in his freezer rewrite
 I identified some cleanups to ptrace_stop that make sense on their own
 and move make resolving the other problems much simpler.
 
 The biggest issue is the habbit of the ptrace code to change task->__state
 from the tracer to suppress TASK_WAKEKILL from waking up the tracee.  No
 other code in the kernel does that and it is straight forward to update
 signal_wake_up and friends to make that unnecessary.
 
 Peter's task freezer sets frozen tasks to a new state TASK_FROZEN and
 then it stores them by calling "wake_up_state(t, TASK_FROZEN)" relying
 on the fact that all stopped states except the special stop states can
 tolerate spurious wake up and recover their state.
 
 The state of stopped and traced tasked is changed to be stored in
 task->jobctl as well as in task->__state.  This makes it possible for
 the freezer to recover tasks in these special states, as well as
 serving as a general cleanup.  With a little more work in that
 direction I believe TASK_STOPPED can learn to tolerate spurious wake
 ups and become an ordinary stop state.
 
 The TASK_TRACED state has to remain a special state as the registers for
 a process are only reliably available when the process is stopped in
 the scheduler.  Fundamentally ptrace needs acess to the saved
 register values of a task.
 
 There are bunch of semi-random ptrace related cleanups that were found
 while looking at these issues.
 
 One cleanup that deserves to be called out is from commit 57b6de08b5
 ("ptrace: Admit ptrace_stop can generate spuriuos SIGTRAPs").  This
 makes a change that is technically user space visible, in the handling
 of what happens to a tracee when a tracer dies unexpectedly.
 According to our testing and our understanding of userspace nothing
 cares that spurious SIGTRAPs can be generated in that case.
 
 The entire discussion can be found at:
   https://lkml.kernel.org/r/87a6bv6dl6.fsf_-_@email.froward.int.ebiederm.org
 
 Eric W. Biederman (11):
       signal: Rename send_signal send_signal_locked
       signal: Replace __group_send_sig_info with send_signal_locked
       ptrace/um: Replace PT_DTRACE with TIF_SINGLESTEP
       ptrace/xtensa: Replace PT_SINGLESTEP with TIF_SINGLESTEP
       ptrace: Remove arch_ptrace_attach
       signal: Use lockdep_assert_held instead of assert_spin_locked
       ptrace: Reimplement PTRACE_KILL by always sending SIGKILL
       ptrace: Document that wait_task_inactive can't fail
       ptrace: Admit ptrace_stop can generate spuriuos SIGTRAPs
       ptrace: Don't change __state
       ptrace: Always take siglock in ptrace_resume
 
 Peter Zijlstra (1):
       sched,signal,ptrace: Rework TASK_TRACED, TASK_STOPPED state
 
  arch/ia64/include/asm/ptrace.h    |   4 --
  arch/ia64/kernel/ptrace.c         |  57 ----------------
  arch/um/include/asm/thread_info.h |   2 +
  arch/um/kernel/exec.c             |   2 +-
  arch/um/kernel/process.c          |   2 +-
  arch/um/kernel/ptrace.c           |   8 +--
  arch/um/kernel/signal.c           |   4 +-
  arch/x86/kernel/step.c            |   3 +-
  arch/xtensa/kernel/ptrace.c       |   4 +-
  arch/xtensa/kernel/signal.c       |   4 +-
  drivers/tty/tty_jobctrl.c         |   4 +-
  include/linux/ptrace.h            |   7 --
  include/linux/sched.h             |  10 ++-
  include/linux/sched/jobctl.h      |   8 +++
  include/linux/sched/signal.h      |  20 ++++--
  include/linux/signal.h            |   3 +-
  kernel/ptrace.c                   |  87 ++++++++---------------
  kernel/sched/core.c               |   5 +-
  kernel/signal.c                   | 140 +++++++++++++++++---------------------
  kernel/time/posix-cpu-timers.c    |   6 +-
  20 files changed, 140 insertions(+), 240 deletions(-)
 
 Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEgjlraLDcwBA2B+6cC/v6Eiajj0AFAmKaXaYACgkQC/v6Eiaj
 j0CgoA/+JncSQ6PY2D5Jh1apvHzmnRsFXzr3DRvtv/CVx4oIebOXRQFyVDeD5tRn
 TmMgB29HpBlHRDLojlmlZRGAld1HR/aPEW9j8W1D3Sy/ZFO5L8lQitv9aDHO9Ntw
 4lZvlhS1M0KhATudVVBqSPixiG6CnV5SsGmixqdOyg7xcXSY6G1l2nB7Zk9I3Tat
 ZlmhuZ6R5Z5qsm4MEq0vUSrnsHiGxYrpk6uQOaVz8Wkv8ZFmbutt6XgxF0tsyZNn
 mHSmWSiZzIgBjTlaibEmxi8urYJTPj3vGBeJQVYHblFwLFi6+Oy7bDxQbWjQvaZh
 DsgWPScfBF4Jm0+8hhCiSYpvPp8XnZuklb4LNCeok/VFr+KfSmpJTIhn00kagQ1u
 vxQDqLws8YLW4qsfGydfx9uUIFCbQE/V2VDYk5J3Re3gkUNDOOR1A56hPniKv6VB
 2aqGO2Fl0RdBbUa3JF+XI5Pwq5y1WrqR93EUvj+5+u5W9rZL/8WLBHBMEz6gbmfD
 DhwFE0y8TG2WRlWJVEDRId+5zo3di/YvasH0vJZ5HbrxhS2RE/yIGAd+kKGx/lZO
 qWDJC7IHvFJ7Mw5KugacyF0SHeNdloyBM7KZW6HeXmgKn9IMJBpmwib92uUkRZJx
 D8j/bHHqD/zsgQ39nO+c4M0MmhO/DsPLG/dnGKrRCu7v1tmEnkY=
 =ZUuO
 -----END PGP SIGNATURE-----

Merge tag 'ptrace_stop-cleanup-for-v5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace

Pull ptrace_stop cleanups from Eric Biederman:
 "While looking at the ptrace problems with PREEMPT_RT and the problems
  Peter Zijlstra was encountering with ptrace in his freezer rewrite I
  identified some cleanups to ptrace_stop that make sense on their own
  and move make resolving the other problems much simpler.

  The biggest issue is the habit of the ptrace code to change
  task->__state from the tracer to suppress TASK_WAKEKILL from waking up
  the tracee. No other code in the kernel does that and it is straight
  forward to update signal_wake_up and friends to make that unnecessary.

  Peter's task freezer sets frozen tasks to a new state TASK_FROZEN and
  then it stores them by calling "wake_up_state(t, TASK_FROZEN)" relying
  on the fact that all stopped states except the special stop states can
  tolerate spurious wake up and recover their state.

  The state of stopped and traced tasked is changed to be stored in
  task->jobctl as well as in task->__state. This makes it possible for
  the freezer to recover tasks in these special states, as well as
  serving as a general cleanup. With a little more work in that
  direction I believe TASK_STOPPED can learn to tolerate spurious wake
  ups and become an ordinary stop state.

  The TASK_TRACED state has to remain a special state as the registers
  for a process are only reliably available when the process is stopped
  in the scheduler. Fundamentally ptrace needs acess to the saved
  register values of a task.

  There are bunch of semi-random ptrace related cleanups that were found
  while looking at these issues.

  One cleanup that deserves to be called out is from commit 57b6de08b5
  ("ptrace: Admit ptrace_stop can generate spuriuos SIGTRAPs"). This
  makes a change that is technically user space visible, in the handling
  of what happens to a tracee when a tracer dies unexpectedly. According
  to our testing and our understanding of userspace nothing cares that
  spurious SIGTRAPs can be generated in that case"

* tag 'ptrace_stop-cleanup-for-v5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  sched,signal,ptrace: Rework TASK_TRACED, TASK_STOPPED state
  ptrace: Always take siglock in ptrace_resume
  ptrace: Don't change __state
  ptrace: Admit ptrace_stop can generate spuriuos SIGTRAPs
  ptrace: Document that wait_task_inactive can't fail
  ptrace: Reimplement PTRACE_KILL by always sending SIGKILL
  signal: Use lockdep_assert_held instead of assert_spin_locked
  ptrace: Remove arch_ptrace_attach
  ptrace/xtensa: Replace PT_SINGLESTEP with TIF_SINGLESTEP
  ptrace/um: Replace PT_DTRACE with TIF_SINGLESTEP
  signal: Replace __group_send_sig_info with send_signal_locked
  signal: Rename send_signal send_signal_locked
2022-06-03 16:13:25 -07:00
Linus Torvalds
1ec6574a3c This set of changes updates init and user mode helper tasks to be
ordinary user mode tasks.
 
 In commit 40966e316f ("kthread: Ensure struct kthread is present for
 all kthreads") caused init and the user mode helper threads that call
 kernel_execve to have struct kthread allocated for them.  This struct
 kthread going away during execve in turned made a use after free of
 struct kthread possible.
 
 The commit 343f4c49f2 ("kthread: Don't allocate kthread_struct for
 init and umh") is enough to fix the use after free and is simple enough
 to be backportable.
 
 The rest of the changes pass struct kernel_clone_args to clean things
 up and cause the code to make sense.
 
 In making init and the user mode helpers tasks purely user mode tasks
 I ran into two complications.  The function task_tick_numa was
 detecting tasks without an mm by testing for the presence of
 PF_KTHREAD.  The initramfs code in populate_initrd_image was using
 flush_delayed_fput to ensuere the closing of all it's file descriptors
 was complete, and flush_delayed_fput does not work in a userspace thread.
 
 I have looked and looked and more complications and in my code review
 I have not found any, and neither has anyone else with the code sitting
 in linux-next.
 
 Link: https://lkml.kernel.org/r/87mtfu4up3.fsf@email.froward.int.ebiederm.org
 
 Eric W. Biederman (8):
       kthread: Don't allocate kthread_struct for init and umh
       fork: Pass struct kernel_clone_args into copy_thread
       fork: Explicity test for idle tasks in copy_thread
       fork: Generalize PF_IO_WORKER handling
       init: Deal with the init process being a user mode process
       fork: Explicitly set PF_KTHREAD
       fork: Stop allowing kthreads to call execve
       sched: Update task_tick_numa to ignore tasks without an mm
 
  arch/alpha/kernel/process.c      | 13 ++++++------
  arch/arc/kernel/process.c        | 13 ++++++------
  arch/arm/kernel/process.c        | 12 ++++++-----
  arch/arm64/kernel/process.c      | 12 ++++++-----
  arch/csky/kernel/process.c       | 15 ++++++-------
  arch/h8300/kernel/process.c      | 10 ++++-----
  arch/hexagon/kernel/process.c    | 12 ++++++-----
  arch/ia64/kernel/process.c       | 15 +++++++------
  arch/m68k/kernel/process.c       | 12 ++++++-----
  arch/microblaze/kernel/process.c | 12 ++++++-----
  arch/mips/kernel/process.c       | 13 ++++++------
  arch/nios2/kernel/process.c      | 12 ++++++-----
  arch/openrisc/kernel/process.c   | 12 ++++++-----
  arch/parisc/kernel/process.c     | 18 +++++++++-------
  arch/powerpc/kernel/process.c    | 15 +++++++------
  arch/riscv/kernel/process.c      | 12 ++++++-----
  arch/s390/kernel/process.c       | 12 ++++++-----
  arch/sh/kernel/process_32.c      | 12 ++++++-----
  arch/sparc/kernel/process_32.c   | 12 ++++++-----
  arch/sparc/kernel/process_64.c   | 12 ++++++-----
  arch/um/kernel/process.c         | 15 +++++++------
  arch/x86/include/asm/fpu/sched.h |  2 +-
  arch/x86/include/asm/switch_to.h |  8 +++----
  arch/x86/kernel/fpu/core.c       |  4 ++--
  arch/x86/kernel/process.c        | 18 +++++++++-------
  arch/xtensa/kernel/process.c     | 17 ++++++++-------
  fs/exec.c                        |  8 ++++---
  include/linux/sched/task.h       |  8 +++++--
  init/initramfs.c                 |  2 ++
  init/main.c                      |  2 +-
  kernel/fork.c                    | 46 +++++++++++++++++++++++++++++++++-------
  kernel/sched/fair.c              |  2 +-
  kernel/umh.c                     |  6 +++---
  33 files changed, 234 insertions(+), 160 deletions(-)
 
 Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEgjlraLDcwBA2B+6cC/v6Eiajj0AFAmKaR/MACgkQC/v6Eiaj
 j0Aayg/7Bx66872d9c6igkJ+MPCTuh+v9QKCGwiYEmiU4Q5sVAFB0HPJO27qC14u
 630X0RFNZTkPzNNEJNIW4kw6Dj8s8YRKf+FgQAVt4SzdRwT7eIPDjk1nGraopPJ3
 O04pjvuTmUyidyViRyFcf2ptx/pnkrwP8jUSc+bGTgfASAKAgAokqKE5ecjewbBc
 Y/EAkQ6QW7KxPjeSmpAHwI+t3BpBev9WEC4PbhRhsBCQFO2+PJiklvqdhVNBnIjv
 qUezll/1xv9UYgniB15Q4Nb722SmnWSU3r8as1eFPugzTHizKhufrrpyP+KMK1A0
 tdtEJNs5t2DZF7ZbGTFSPqJWmyTYLrghZdO+lOmnaSjHxK4Nda1d4NzbefJ0u+FE
 tutewowvHtBX6AFIbx+H3O+DOJM2IgNMf+ReQDU/TyNyVf3wBrTbsr9cLxypIJIp
 zze8npoLMlB7B4yxVo5ES5e63EXfi3iHl0L3/1EhoGwriRz1kWgVLUX/VZOUpscL
 RkJHsW6bT8sqxPWAA5kyWjEN+wNR2PxbXi8OE4arT0uJrEBMUgDCzydzOv5tJB00
 mSQdytxH9LVdsmxBKAOBp5X6WOLGA4yb1cZ6E/mEhlqXMpBDF1DaMfwbWqxSYi4q
 sp5zU3SBAW0qceiZSsWZXInfbjrcQXNV/DkDRDO9OmzEZP4m1j0=
 =x6fy
 -----END PGP SIGNATURE-----

Merge tag 'kthread-cleanups-for-v5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace

Pull kthread updates from Eric Biederman:
 "This updates init and user mode helper tasks to be ordinary user mode
  tasks.

  Commit 40966e316f ("kthread: Ensure struct kthread is present for
  all kthreads") caused init and the user mode helper threads that call
  kernel_execve to have struct kthread allocated for them. This struct
  kthread going away during execve in turned made a use after free of
  struct kthread possible.

  Here, commit 343f4c49f2 ("kthread: Don't allocate kthread_struct for
  init and umh") is enough to fix the use after free and is simple
  enough to be backportable.

  The rest of the changes pass struct kernel_clone_args to clean things
  up and cause the code to make sense.

  In making init and the user mode helpers tasks purely user mode tasks
  I ran into two complications. The function task_tick_numa was
  detecting tasks without an mm by testing for the presence of
  PF_KTHREAD. The initramfs code in populate_initrd_image was using
  flush_delayed_fput to ensuere the closing of all it's file descriptors
  was complete, and flush_delayed_fput does not work in a userspace
  thread.

  I have looked and looked and more complications and in my code review
  I have not found any, and neither has anyone else with the code
  sitting in linux-next"

* tag 'kthread-cleanups-for-v5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  sched: Update task_tick_numa to ignore tasks without an mm
  fork: Stop allowing kthreads to call execve
  fork: Explicitly set PF_KTHREAD
  init: Deal with the init process being a user mode process
  fork: Generalize PF_IO_WORKER handling
  fork: Explicity test for idle tasks in copy_thread
  fork: Pass struct kernel_clone_args into copy_thread
  kthread: Don't allocate kthread_struct for init and umh
2022-06-03 16:03:05 -07:00
Peter Zijlstra
82f586f923 sched/autogroup: Fix sysctl move
Ivan reported /proc/sys/kernel/sched_autogroup_enabled went walk-about
and using the noautogroup command line parameter would result in a
boot error message.

Turns out the sysctl move placed the init function wrong.

Fixes: c8eaf6ac76 ("sched: move autogroup sysctls into its own file")
Reported-by: Ivan Kozik <ivan@ludios.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Ivan Kozik <ivan@ludios.org>
Link: https://lkml.kernel.org/r/YpR2IqndgsyMzN00@worktop.programming.kicks-ass.net
2022-05-30 12:36:36 +02:00
Linus Torvalds
44d35720c9 sysctl changes for v5.19-rc1
For two kernel releases now kernel/sysctl.c has been being cleaned up
 slowly, since the tables were grossly long, sprinkled with tons of #ifdefs and
 all this caused merge conflicts with one susbystem or another.
 
 This tree was put together to help try to avoid conflicts with these cleanups
 going on different trees at time. So nothing exciting on this pull request,
 just cleanups.
 
 I actually had this sysctl-next tree up since v5.18 but I missed sending a
 pull request for it on time during the last merge window. And so these changes
 have been being soaking up on sysctl-next and so linux-next for a while.
 The last change was merged May 4th.
 
 Most of the compile issues were reported by 0day and fixed.
 
 To help avoid a conflict with bpf folks at Daniel Borkmann's request
 I merged bpf-next/pr/bpf-sysctl into sysctl-next to get the effor which
 moves the BPF sysctls from kernel/sysctl.c to BPF core.
 
 Possible merge conflicts and known resolutions as per linux-next:
 
 bfp:
 https://lkml.kernel.org/r/20220414112812.652190b5@canb.auug.org.au
 
 rcu:
 https://lkml.kernel.org/r/20220420153746.4790d532@canb.auug.org.au
 
 powerpc:
 https://lkml.kernel.org/r/20220520154055.7f964b76@canb.auug.org.au
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCgAwFiEENnNq2KuOejlQLZofziMdCjCSiKcFAmKOq8ASHG1jZ3JvZkBr
 ZXJuZWwub3JnAAoJEM4jHQowkoinDAkQAJVo5YVM9f74UwYp4PQhTpjxJBCjRoZD
 z1u9bp5rMj2ujTC8Fr7VmzKaHrb8+r1C1WvCvZtIzemYNB4lZUrHpVDYfXuXiPRB
 ihPmEjhlPO5PFBx6cVCpI3cu9bEhG00rLc1QXnABx/pXwNPcOTJAGZJVamZvqubk
 chjgZrb7N+adHPfvS55v1+zpwdeKfpp5U3zuu5qlT/nn0GS0HCVzOj5fj4oC4wtJ
 IqfUubo+FX50Ga58yQABWNrjaPD9Crykz5ohVazy3ElQl0hJ4VsK65ct3blqc2vz
 1Bb8kPpWuv6aZ5nr1lCVE8qvF4ZIL33ySvpg5BSdWLQEDrBbSpzvJe9Yn7wgR+eq
 y7fhpO24+zRM82EoDMEvyxX9u1n1RsvoXRtf3ds9BGf63MUxk8a1cgjlU6vuyO2U
 JhDmfM1xzdKvPoY4COOnHzcAiIqzItTqKd09N5y0cahmYstROU8lvp9huhTAHqk1
 SjQMbLIZG7OnX8ZeQcR1EB8sq/IOPZT48ejj0iJmQ8FyMaep71MOQLYyLPAq4lgh
 JHXm8P6QdB57jfJbqAeNSyZoK0qdxOUR/83Zcah7Jjns6vkju1DNatEsaEEI2y2M
 4n7/rkHeZ3TyFHBUX4e9FomKvGLsAalDBRiqsuxLSOPMU8rGrNLAslOAtKwvp90X
 4ht3M2VP098l
 =btwh
 -----END PGP SIGNATURE-----

Merge tag 'sysctl-5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux

Pull sysctl updates from Luis Chamberlain:
 "For two kernel releases now kernel/sysctl.c has been being cleaned up
  slowly, since the tables were grossly long, sprinkled with tons of
  #ifdefs and all this caused merge conflicts with one susbystem or
  another.

  This tree was put together to help try to avoid conflicts with these
  cleanups going on different trees at time. So nothing exciting on this
  pull request, just cleanups.

  Thanks a lot to the Uniontech and Huawei folks for doing some of this
  nasty work"

* tag 'sysctl-5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux: (28 commits)
  sched: Fix build warning without CONFIG_SYSCTL
  reboot: Fix build warning without CONFIG_SYSCTL
  kernel/kexec_core: move kexec_core sysctls into its own file
  sysctl: minor cleanup in new_dir()
  ftrace: fix building with SYSCTL=y but DYNAMIC_FTRACE=n
  fs/proc: Introduce list_for_each_table_entry for proc sysctl
  mm: fix unused variable kernel warning when SYSCTL=n
  latencytop: move sysctl to its own file
  ftrace: fix building with SYSCTL=n but DYNAMIC_FTRACE=y
  ftrace: Fix build warning
  ftrace: move sysctl_ftrace_enabled to ftrace.c
  kernel/do_mount_initrd: move real_root_dev sysctls to its own file
  kernel/delayacct: move delayacct sysctls to its own file
  kernel/acct: move acct sysctls to its own file
  kernel/panic: move panic sysctls to its own file
  kernel/lockdep: move lockdep sysctls to its own file
  mm: move page-writeback sysctls to their own file
  mm: move oom_kill sysctls to their own file
  kernel/reboot: move reboot sysctls to its own file
  sched: Move energy_aware sysctls to topology.c
  ...
2022-05-26 16:57:20 -07:00
Linus Torvalds
6f3f04c190 Scheduler changes in this cycle were:
- Updates to scheduler metrics:
 
     - PELT fixes & enhancements
     - PSI fixes & enhancements
     - Refactor cpu_util_without()
 
  - Updates to instrumentation/debugging:
 
     - Remove sched_trace_*() helper functions - can be done via debug info
     - Fix double update_rq_clock() warnings
 
  - Introduce & use "preemption model accessors" to simplify some of
    the Kconfig complexity.
 
  - Make softirq handling RT-safe.
 
  - Misc smaller fixes & cleanups.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmKLvXYRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1hcXg//fJ1jAB9pQOg/Su9wwwbcOeaXNUpQA38e
 970nXdK6i7w+YeAT2x1ikIQZq5S/px7k9S4Fzks8U9LMhnKPxhjdnG6J69h5XLuB
 z1BtRJBB6W8BAYWzAeq1M+R8whQylciOMZOBSjeTIEdpYBK7c9QA/R1DkDqPRlBA
 7nW0mFbpYcK8Q1n1ItjP0wkpiHG4q8orp+BXiPG8rjiHdCa3GFt7g38hiqNls64H
 fOQ/Ka25tBSYrmeqQY3QsWKnKNHKQRLNareHAwI/x4V8F8d4tn/OmJzmWGDdtprn
 6/gi/E99ej1j5Do8sgx/oTp/aVg4j8AsurrpGFd4/er+euoG4UyHr42UhX6zmFM6
 /KIinp0Z/V2n9okgI9LUZ2x7mD682iXDilNDgiSAwu1bNDUvxBXPD30gThh+KasA
 HxeKxTzb4/dZV4ih4xUMsCOjUT4NFZT2rmiMorUystgyNRk28DtFCdBMtrs/zVtG
 qAktb7v5g76pKAmV4nQu4imZeSD+f+RJP2fuSUYQCJbCxQfthTZkn8GfCMYEdY7Y
 sDyBx4Te8Vu/dcnal9qMpY/m5EPruPQAkvC9zK4YvkvLUmGC742PG/xHfCdC9J2m
 Adbl/Cmn7tD9dOGYbHPsrViqwIiZUcjbnBlMN5DjJXQF6kWNOIXUEguZpBocminP
 1CSy0+gyI6o=
 =GY8N
 -----END PGP SIGNATURE-----

Merge tag 'sched-core-2022-05-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler updates from Ingo Molnar:

 - Updates to scheduler metrics:
     - PELT fixes & enhancements
     - PSI fixes & enhancements
     - Refactor cpu_util_without()

 - Updates to instrumentation/debugging:
     - Remove sched_trace_*() helper functions - can be done via debug
       info
     - Fix double update_rq_clock() warnings

 - Introduce & use "preemption model accessors" to simplify some of the
   Kconfig complexity.

 - Make softirq handling RT-safe.

 - Misc smaller fixes & cleanups.

* tag 'sched-core-2022-05-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  topology: Remove unused cpu_cluster_mask()
  sched: Reverse sched_class layout
  sched/deadline: Remove superfluous rq clock update in push_dl_task()
  sched/core: Avoid obvious double update_rq_clock warning
  smp: Make softirq handling RT safe in flush_smp_call_function_queue()
  smp: Rename flush_smp_call_function_from_idle()
  sched: Fix missing prototype warnings
  sched/fair: Remove cfs_rq_tg_path()
  sched/fair: Remove sched_trace_*() helper functions
  sched/fair: Refactor cpu_util_without()
  sched/fair: Revise comment about lb decision matrix
  sched/psi: report zeroes for CPU full at the system level
  sched/fair: Delete useless condition in tg_unthrottle_up()
  sched/fair: Fix cfs_rq_clock_pelt() for throttled cfs_rq
  sched/fair: Move calculate of avg_load to a better location
  mailmap: Update my email address to @redhat.com
  MAINTAINERS: Add myself as scheduler topology reviewer
  psi: Fix trigger being fired unexpectedly at initial
  ftrace: Use preemption model accessors for trace header printout
  kcsan: Use preemption model accessors
2022-05-24 11:11:13 -07:00
Linus Torvalds
2319be1356 Locking changes in this cycle were:
- rwsem cleanups & optimizations/fixes:
     - Conditionally wake waiters in reader/writer slowpaths
     - Always try to wake waiters in out_nolock path
 
  - Add try_cmpxchg64() implementation, with arch optimizations - and use it to
    micro-optimize sched_clock_{local,remote}()
 
  - Various force-inlining fixes to address objdump instrumentation-check warnings
 
  - Add lock contention tracepoints:
 
     lock:contention_begin
     lock:contention_end
 
  - Misc smaller fixes & cleanups
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmKLsrERHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1js3g//cPR9PYlvZv87T2hI8VWKfNzapgSmwCsH
 1P+nk27Pef+jfxHr/N7YScvSD06+z2wIroLE3npPNETmNd1X8obBDThmeD4VI899
 J6h4sE0cFOpTG/mHeECFxqnDuzhdHiRHWS52RxOwTjZTpdbeKWZYueC0Mvqn+tIp
 UM2D2yTseIHs67ikxYtayU/iJgSZ+PYrMPv9nSVUjIFILmg7gMIz38OZYQzR84++
 auL3m8sAq/i2pjzDBbXMpfYeu177/tPHpPJr2rOErLEXWqK2K6op8+CbX4z3yv3z
 EBBhGiUNqDmFaFuIgg7Mx94SvPh8MBGexUnT0XA2aXPwyP9oAaenCk2CZ1j9u15m
 /Xp1A4KNvg1WY8jHu5ZM4VIEXQ7d6Gwtbej7IeovUxBD6y7Trb3+rxb7PVdZX941
 uVGjss1Lgk70wUQqBqBPmBm08O6NUF3vekHlona5CZTQgEF84zD7+7D++QPaAZo7
 kiuNUptdgfq6X0xqgP88GX1KU85gJYoF5Q13vb7lAcv19QhRG5JBJeWMYiXEmg12
 Ktl97Sru0zCpCY1NCvwsBll09SLVO9kX3Lq+QFD8bFMZ0obsGIBotHq1qH6U7cH8
 RY6esVBF/1/+qdrxOKs8qowlJ4EUp/3bX0R/MKYHJJbulj/aaE916HvMsoN/QR4Y
 oW7GcxMQGLE=
 =gaS5
 -----END PGP SIGNATURE-----

Merge tag 'locking-core-2022-05-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull locking updates from Ingo Molnar:

 - rwsem cleanups & optimizations/fixes:
    - Conditionally wake waiters in reader/writer slowpaths
    - Always try to wake waiters in out_nolock path

 - Add try_cmpxchg64() implementation, with arch optimizations - and use
   it to micro-optimize sched_clock_{local,remote}()

 - Various force-inlining fixes to address objdump instrumentation-check
   warnings

 - Add lock contention tracepoints:

    lock:contention_begin
    lock:contention_end

 - Misc smaller fixes & cleanups

* tag 'locking-core-2022-05-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/clock: Use try_cmpxchg64 in sched_clock_{local,remote}
  locking/atomic/x86: Introduce arch_try_cmpxchg64
  locking/atomic: Add generic try_cmpxchg64 support
  futex: Remove a PREEMPT_RT_FULL reference.
  locking/qrwlock: Change "queue rwlock" to "queued rwlock"
  lockdep: Delete local_irq_enable_in_hardirq()
  locking/mutex: Make contention tracepoints more consistent wrt adaptive spinning
  locking: Apply contention tracepoints in the slow path
  locking: Add lock contention tracepoints
  locking/rwsem: Always try to wake waiters in out_nolock path
  locking/rwsem: Conditionally wake waiters in reader/writer slowpaths
  locking/rwsem: No need to check for handoff bit if wait queue empty
  lockdep: Fix -Wunused-parameter for _THIS_IP_
  x86/mm: Force-inline __phys_addr_nodebug()
  x86/kvm/svm: Force-inline GHCB accessors
  task_stack, x86/cea: Force-inline stack helpers
2022-05-24 10:18:23 -07:00
Linus Torvalds
1e57930e9f RCU pull request for v5.19
This pull request contains the following branches:
 
 docs.2022.04.20a: Documentation updates.
 
 fixes.2022.04.20a: Miscellaneous fixes.
 
 nocb.2022.04.11b: Callback-offloading updates, mainly simplifications.
 
 rcu-tasks.2022.04.11b: RCU-tasks updates, including some -rt fixups,
 	handling of systems with sparse CPU numbering, and a fix for a
 	boot-time race-condition failure.
 
 srcu.2022.05.03a: Put SRCU on a memory diet in order to reduce the size
 	of the srcu_struct structure.
 
 torture.2022.04.11b: Torture-test updates fixing some bugs in tests and
 	closing some testing holes.
 
 torture-tasks.2022.04.20a: Torture-test updates for the RCU tasks flavors,
 	most notably ensuring that building rcutorture and friends does
 	not change the RCU-tasks-related Kconfig options.
 
 torturescript.2022.04.20a: Torture-test scripting updates.
 
 exp.2022.05.11a: Expedited grace-period updates, most notably providing
 	milliseconds-scale (not all that) soft real-time response from
 	synchronize_rcu_expedited().  This is also the first time in
 	almost 30 years of RCU that someone other than me has pushed
 	for a reduction in the RCU CPU stall-warning timeout, in this
 	case by more than three orders of magnitude from 21 seconds to
 	20 milliseconds.  This tighter timeout applies only to expedited
 	grace periods.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEbK7UrM+RBIrCoViJnr8S83LZ+4wFAmKG2zcTHHBhdWxtY2tA
 a2VybmVsLm9yZwAKCRCevxLzctn7jGXgD/90xtRtZyN0umlN/IOBBn8fIOM+BAMu
 5k3ef6wLsXKXlLO13WTjSitypX9LEFwytTeVhEyN4ODeX0cI9mUmts6Z8/6sV92D
 fN8vqTavveE7m5YfFfLRvDRfVHpB0LpLMM+V0qWPu/F8dWPDKA0225rX9IC7iICP
 LkxCuNVNzJ0cLaVTvsUWlxMdHcogydXZb1gPDVRhnR6iVFWCBtL4RRpU41CoSNh4
 fWRSLQak6OhZRFE7hVoLQhZyLE0GIw1fuUJgj2fCllhgGogDx78FQ8jHdDzMEhVk
 cD4Yel5vUPiy2AKphGfi28bKFYcyhVBnD/Jq733VJV0/szyddxNbz0xKpEA0/8qh
 w1T7IjBN6MAKHSh0uUitm6U24VN13m4r30HrUQSpp71VFZkUD4QS6TismKsaRNjR
 lK4q2QKBprBb3Hv7KPAGYT1Us3aS7qLPrgPf3gzSxL1aY5QV0A5UpPP6RKTLbWPl
 CEQxEno6g5LTHwKd5QD74dG8ccphg9377lDMJpeesYShBqlLNrNWCxqJoZk2HnSf
 f2dTQeQWrtRJjeTGy/4cfONCGZTghE0Pch43XMzLLt3ZTuDc8FVM0t3Xs9J5Kg22
 zmThQh6LRXTGjrb1vLiOrjPf5JaTnX2Sz8OUJTo/ZxwcixxP/mj8Ja+W81NjfqnK
 LLZ1D6UN4a8n9A==
 =4spH
 -----END PGP SIGNATURE-----

Merge tag 'rcu.2022.05.19a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu

Pull RCU update from Paul McKenney:

 - Documentation updates

 - Miscellaneous fixes

 - Callback-offloading updates, mainly simplifications

 - RCU-tasks updates, including some -rt fixups, handling of systems
   with sparse CPU numbering, and a fix for a boot-time race-condition
   failure

 - Put SRCU on a memory diet in order to reduce the size of the
   srcu_struct structure

 - Torture-test updates fixing some bugs in tests and closing some
   testing holes

 - Torture-test updates for the RCU tasks flavors, most notably ensuring
   that building rcutorture and friends does not change the
   RCU-tasks-related Kconfig options

 - Torture-test scripting updates

 - Expedited grace-period updates, most notably providing
   milliseconds-scale (not all that) soft real-time response from
   synchronize_rcu_expedited().

   This is also the first time in almost 30 years of RCU that someone
   other than me has pushed for a reduction in the RCU CPU stall-warning
   timeout, in this case by more than three orders of magnitude from 21
   seconds to 20 milliseconds. This tighter timeout applies only to
   expedited grace periods

* tag 'rcu.2022.05.19a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: (80 commits)
  rcu: Move expedited grace period (GP) work to RT kthread_worker
  rcu: Introduce CONFIG_RCU_EXP_CPU_STALL_TIMEOUT
  srcu: Drop needless initialization of sdp in srcu_gp_start()
  srcu: Prevent expedited GPs and blocking readers from consuming CPU
  srcu: Add contention check to call_srcu() srcu_data ->lock acquisition
  srcu: Automatically determine size-transition strategy at boot
  rcutorture: Make torture.sh allow for --kasan
  rcutorture: Make torture.sh refscale and rcuscale specify Tasks Trace RCU
  rcutorture: Make kvm.sh allow more memory for --kasan runs
  torture: Save "make allmodconfig" .config file
  scftorture: Remove extraneous "scf" from per_version_boot_params
  rcutorture: Adjust scenarios' Kconfig options for CONFIG_PREEMPT_DYNAMIC
  torture: Enable CSD-lock stall reports for scftorture
  torture: Skip vmlinux check for kvm-again.sh runs
  scftorture: Adjust for TASKS_RCU Kconfig option being selected
  rcuscale: Allow rcuscale without RCU Tasks Rude/Trace
  rcuscale: Allow rcuscale without RCU Tasks
  refscale: Allow refscale without RCU Tasks Rude/Trace
  refscale: Allow refscale without RCU Tasks
  rcutorture: Allow specifying per-scenario stat_interval
  ...
2022-05-23 11:46:51 -07:00
Peter Zijlstra
546a3fee17 sched: Reverse sched_class layout
Because GCC-12 is fully stupid about array bounds and it's just really
hard to get a solid array definition from a linker script, flip the
array order to avoid needing negative offsets :-/

This makes the whole relational pointer magic a little less obvious, but
alas.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lkml.kernel.org/r/YoOLLmLG7HRTXeEm@hirez.programming.kicks-ass.net
2022-05-19 23:46:13 +02:00
Uros Bizjak
8491d1bdf5 sched/clock: Use try_cmpxchg64 in sched_clock_{local,remote}
Use try_cmpxchg64 instead of cmpxchg64 (*ptr, old, new) != old in
sched_clock_{local,remote}. x86 cmpxchg returns success in ZF flag,
so this change saves a compare after cmpxchg (and related move
instruction in front of cmpxchg).

Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20220518184953.3446778-1-ubizjak@gmail.com
2022-05-19 23:46:09 +02:00
Delyan Kratunov
9c2136be08 sched/tracing: Append prev_state to tp args instead
Commit fa2c3254d7 (sched/tracing: Don't re-read p->state when emitting
sched_switch event, 2022-01-20) added a new prev_state argument to the
sched_switch tracepoint, before the prev task_struct pointer.

This reordering of arguments broke BPF programs that use the raw
tracepoint (e.g. tp_btf programs). The type of the second argument has
changed and existing programs that assume a task_struct* argument
(e.g. for bpf_task_storage access) will now fail to verify.

If we instead append the new argument to the end, all existing programs
would continue to work and can conditionally extract the prev_state
argument on supported kernel versions.

Fixes: fa2c3254d7 (sched/tracing: Don't re-read p->state when emitting sched_switch event, 2022-01-20)
Signed-off-by: Delyan Kratunov <delyank@fb.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://lkml.kernel.org/r/c8a6930dfdd58a4a5755fc01732675472979732b.camel@fb.com
2022-05-12 00:37:11 +02:00
Eric W. Biederman
2500ad1c7f ptrace: Don't change __state
Stop playing with tsk->__state to remove TASK_WAKEKILL while a ptrace
command is executing.

Instead remove TASK_WAKEKILL from the definition of TASK_TRACED, and
implement a new jobctl flag TASK_PTRACE_FROZEN.  This new flag is set
in jobctl_freeze_task and cleared when ptrace_stop is awoken or in
jobctl_unfreeze_task (when ptrace_stop remains asleep).

In signal_wake_up add __TASK_TRACED to state along with TASK_WAKEKILL
when the wake up is for a fatal signal.  Skip adding __TASK_TRACED
when TASK_PTRACE_FROZEN is not set.  This has the same effect as
changing TASK_TRACED to __TASK_TRACED as all of the wake_ups that use
TASK_KILLABLE go through signal_wake_up.

Handle a ptrace_stop being called with a pending fatal signal.
Previously it would have been handled by schedule simply failing to
sleep.  As TASK_WAKEKILL is no longer part of TASK_TRACED schedule
will sleep with a fatal_signal_pending.   The code in signal_wake_up
guarantees that the code will be awaked by any fatal signal that
codes after TASK_TRACED is set.

Previously the __state value of __TASK_TRACED was changed to
TASK_RUNNING when woken up or back to TASK_TRACED when the code was
left in ptrace_stop.  Now when woken up ptrace_stop now clears
JOBCTL_PTRACE_FROZEN and when left sleeping ptrace_unfreezed_traced
clears JOBCTL_PTRACE_FROZEN.

Tested-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lkml.kernel.org/r/20220505182645.497868-10-ebiederm@xmission.com
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2022-05-11 14:35:32 -05:00
Eric W. Biederman
b3f9916d81 sched: Update task_tick_numa to ignore tasks without an mm
Qian Cai <quic_qiancai@quicinc.com> wrote:
> Reverting the last 3 commits of the series fixed a boot crash.
>
> 1b2552cbdb fork: Stop allowing kthreads to call execve
> 753550eb0c fork: Explicitly set PF_KTHREAD
> 68d85f0a33 init: Deal with the init process being a user mode process
>
>  BUG: KASAN: null-ptr-deref in task_nr_scan_windows.isra.0
>  arch_atomic_long_read at ./include/linux/atomic/atomic-long.h:29
>  (inlined by) atomic_long_read at ./include/linux/atomic/atomic-instrumented.h:1266
>  (inlined by) get_mm_counter at ./include/linux/mm.h:1996
>  (inlined by) get_mm_rss at ./include/linux/mm.h:2049
>  (inlined by) task_nr_scan_windows at kernel/sched/fair.c:1123
>  Read of size 8 at addr 00000000000003d0 by task swapper/0/1

With the change to init and the user mode helper processes to not have
PF_KTHREAD set before they call kernel_execve the PF_KTHREAD test in
task_tick_numa became insufficient to detect all tasks that have
"->mm == NULL".  Correct that by testing for "->mm == NULL" directly.

Reported-by: Qian Cai <quic_qiancai@quicinc.com>
Tested-by: Qian Cai <quic_qiancai@quicinc.com>
Fixes: 1b2552cbdb ("fork: Stop allowing kthreads to call execve")
Link: https://lkml.kernel.org/r/87r150ug1l.fsf_-_@email.froward.int.ebiederm.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2022-05-11 12:41:48 -05:00
Hao Jia
734387ec2f sched/deadline: Remove superfluous rq clock update in push_dl_task()
The change to call update_rq_clock() before activate_task()
commit 840d719604 ("sched/deadline: Update rq_clock of later_rq
when pushing a task") is no longer needed since commit f4904815f9
("sched/deadline: Fix double accounting of rq/running bw in push & pull")
removed the add_running_bw() before the activate_task().

So we remove some comments that are no longer needed and update
rq clock in activate_task().

Signed-off-by: Hao Jia <jiahao.os@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Link: https://lore.kernel.org/r/20220430085843.62939-3-jiahao.os@bytedance.com
2022-05-11 16:27:12 +02:00
Hao Jia
2679a83731 sched/core: Avoid obvious double update_rq_clock warning
When we use raw_spin_rq_lock() to acquire the rq lock and have to
update the rq clock while holding the lock, the kernel may issue
a WARN_DOUBLE_CLOCK warning.

Since we directly use raw_spin_rq_lock() to acquire rq lock instead of
rq_lock(), there is no corresponding change to rq->clock_update_flags.
In particular, we have obtained the rq lock of other CPUs, the
rq->clock_update_flags of this CPU may be RQCF_UPDATED at this time, and
then calling update_rq_clock() will trigger the WARN_DOUBLE_CLOCK warning.

So we need to clear RQCF_UPDATED of rq->clock_update_flags to avoid
the WARN_DOUBLE_CLOCK warning.

For the sched_rt_period_timer() and migrate_task_rq_dl() cases
we simply replace raw_spin_rq_lock()/raw_spin_rq_unlock() with
rq_lock()/rq_unlock().

For the {pull,push}_{rt,dl}_task() cases, we add the
double_rq_clock_clear_update() function to clear RQCF_UPDATED of
rq->clock_update_flags, and call double_rq_clock_clear_update()
before double_lock_balance()/double_rq_lock() returns to avoid the
WARN_DOUBLE_CLOCK warning.

Some call trace reports:
Call Trace 1:
 <IRQ>
 sched_rt_period_timer+0x10f/0x3a0
 ? enqueue_top_rt_rq+0x110/0x110
 __hrtimer_run_queues+0x1a9/0x490
 hrtimer_interrupt+0x10b/0x240
 __sysvec_apic_timer_interrupt+0x8a/0x250
 sysvec_apic_timer_interrupt+0x9a/0xd0
 </IRQ>
 <TASK>
 asm_sysvec_apic_timer_interrupt+0x12/0x20

Call Trace 2:
 <TASK>
 activate_task+0x8b/0x110
 push_rt_task.part.108+0x241/0x2c0
 push_rt_tasks+0x15/0x30
 finish_task_switch+0xaa/0x2e0
 ? __switch_to+0x134/0x420
 __schedule+0x343/0x8e0
 ? hrtimer_start_range_ns+0x101/0x340
 schedule+0x4e/0xb0
 do_nanosleep+0x8e/0x160
 hrtimer_nanosleep+0x89/0x120
 ? hrtimer_init_sleeper+0x90/0x90
 __x64_sys_nanosleep+0x96/0xd0
 do_syscall_64+0x34/0x90
 entry_SYSCALL_64_after_hwframe+0x44/0xae

Call Trace 3:
 <TASK>
 deactivate_task+0x93/0xe0
 pull_rt_task+0x33e/0x400
 balance_rt+0x7e/0x90
 __schedule+0x62f/0x8e0
 do_task_dead+0x3f/0x50
 do_exit+0x7b8/0xbb0
 do_group_exit+0x2d/0x90
 get_signal+0x9df/0x9e0
 ? preempt_count_add+0x56/0xa0
 ? __remove_hrtimer+0x35/0x70
 arch_do_signal_or_restart+0x36/0x720
 ? nanosleep_copyout+0x39/0x50
 ? do_nanosleep+0x131/0x160
 ? audit_filter_inodes+0xf5/0x120
 exit_to_user_mode_prepare+0x10f/0x1e0
 syscall_exit_to_user_mode+0x17/0x30
 do_syscall_64+0x40/0x90
 entry_SYSCALL_64_after_hwframe+0x44/0xae

Call Trace 4:
 update_rq_clock+0x128/0x1a0
 migrate_task_rq_dl+0xec/0x310
 set_task_cpu+0x84/0x1e4
 try_to_wake_up+0x1d8/0x5c0
 wake_up_process+0x1c/0x30
 hrtimer_wakeup+0x24/0x3c
 __hrtimer_run_queues+0x114/0x270
 hrtimer_interrupt+0xe8/0x244
 arch_timer_handler_phys+0x30/0x50
 handle_percpu_devid_irq+0x88/0x140
 generic_handle_domain_irq+0x40/0x60
 gic_handle_irq+0x48/0xe0
 call_on_irq_stack+0x2c/0x60
 do_interrupt_handler+0x80/0x84

Steps to reproduce:
1. Enable CONFIG_SCHED_DEBUG when compiling the kernel
2. echo 1 > /sys/kernel/debug/clear_warn_once
   echo "WARN_DOUBLE_CLOCK" > /sys/kernel/debug/sched/features
   echo "NO_RT_PUSH_IPI" > /sys/kernel/debug/sched/features
3. Run some rt/dl tasks that periodically work and sleep, e.g.
Create 2*n rt or dl (90% running) tasks via rt-app (on a system
with n CPUs), and Dietmar Eggemann reports Call Trace 4 when running
on PREEMPT_RT kernel.

Signed-off-by: Hao Jia <jiahao.os@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lore.kernel.org/r/20220430085843.62939-2-jiahao.os@bytedance.com
2022-05-11 16:27:11 +02:00
YueHaibing
494dcdf46e sched: Fix build warning without CONFIG_SYSCTL
IF CONFIG_SYSCTL is n, build warn:

kernel/sched/core.c:1782:12: warning: ‘sysctl_sched_uclamp_handler’ defined but not used [-Wunused-function]
 static int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
            ^~~~~~~~~~~~~~~~~~~~~~~~~~~

sysctl_sched_uclamp_handler() is used while CONFIG_SYSCTL enabled,
wrap all related code with CONFIG_SYSCTL to fix this.

Fixes: 3267e0156c ("sched: Move uclamp_util sysctls to core.c")
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2022-05-09 16:54:57 -07:00
Ingo Molnar
d70522fc54 Linux 5.18-rc5
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmJu9FYeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGAyEH/16xtJSpLmLwrQzG
 o+4ToQxSQ+/9UHyu0RTEvHg2THm9/8emtIuYyc/5FgdoWctcSa3AaDcveWmuWmkS
 KYcdhfJsaEqjNHS3OPYXN84fmo9Hel7263shu5+IYmP/sN0DfQp6UWTryX1q4B3Q
 4Pdutkuq63Uwd8nBZ5LXQBumaBrmkkuMgWEdT4+6FOo1mPzwdIGBxCuz1UsNNl5k
 chLWxkQfe2eqgWbYJrgCQfrVdORXVtoU2fGilZUNrHRVGkkldXkkz5clJfapyZD3
 odmZCEbrE4GPKgZwCmDERMfD1hzhZDtYKiHfOQ506szH5ykJjPBcOjHed7dA60eB
 J3+wdek=
 =39Ca
 -----END PGP SIGNATURE-----

Merge tag 'v5.18-rc5' into sched/core to pull in fixes & to resolve a conflict

 - sched/core is on a pretty old -rc1 base - refresh it to include recent fixes.
 - this also allows up to resolve a (trivial) .mailmap conflict

Conflicts:
	.mailmap

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2022-05-06 10:21:46 +02:00
Thomas Gleixner
16bf5a5e1e smp: Rename flush_smp_call_function_from_idle()
This is invoked from the stopper thread too, which is definitely not idle.
Rename it to flush_smp_call_function_queue() and fixup the callers.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220413133024.305001096@linutronix.de
2022-05-01 10:03:43 +02:00
Thomas Gleixner
d664e39912 sched: Fix missing prototype warnings
A W=1 build emits more than a dozen missing prototype warnings related to
scheduler and scheduler specific includes.

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220413133024.249118058@linutronix.de
2022-05-01 10:03:43 +02:00
Dietmar Eggemann
97956dd278 sched/fair: Remove cfs_rq_tg_path()
cfs_rq_tg_path() is used by a tracepoint-to traceevent (tp-2-te)
converter to format the path of a taskgroup or autogroup respectively.
It doesn't have any in-kernel users after the removal of the
sched_trace_cfs_rq_path() helper function.

cfs_rq_tg_path() can be coded in a tp-2-te converter.

Remove it from kernel/sched/fair.c.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220428144338.479094-3-qais.yousef@arm.com
2022-04-29 11:06:29 +02:00
Dietmar Eggemann
50e7b416d2 sched/fair: Remove sched_trace_*() helper functions
We no longer need them as we can use DWARF debug info or BTF + pahole to
re-generate the required structs to compile against them for a given
kernel.

This moves the burden of maintaining these helper functions to the
module.

	https://github.com/qais-yousef/sched_tp

Note that pahole v1.15 is required at least for using DWARF. And for BTF
v1.23 which is not yet released will be required. There's alignment
problem that will lead to crashes in earlier versions when used with
BTF.

We should have enough infrastructure to make these helper functions now
obsolete, so remove them.

[Rewrote commit message to reflect the new alternative]
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220428144338.479094-2-qais.yousef@arm.com
2022-04-29 11:06:29 +02:00
Dietmar Eggemann
4e3c7d338a sched/fair: Refactor cpu_util_without()
Except the 'task has no contribution or is new' condition at the
beginning of cpu_util_without(), which it shares with the load and
runnable counterpart functions, a cpu_util_next(..., dst_cpu = -1)
call can replace the rest of it.

The UTIL_EST specific check that task util_est has to be subtracted
from the CPU one in case of an enqueued (or current (to cater for the
wakeup - lb race)) task has to be moved to cpu_util_next().
This was initially introduced by commit c469933e77
("sched/fair: Fix cpu_util_wake() for 'execl' type workloads").
UnixBench's `execl` throughput tests were run on the dual socket 40
CPUs Intel E5-2690 v2 to make sure it doesn't regress again.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20220318163656.954440-1-dietmar.eggemann@arm.com
2022-04-29 11:06:29 +02:00
Tao Zhou
a658353167 sched/fair: Revise comment about lb decision matrix
If busiest group type is group_misfit_task, the local
group type must be group_has_spare according to below
code in update_sd_pick_busiest():

  if (sgs->group_type == group_misfit_task &&
      (!capacity_greater(capacity_of(env->dst_cpu), sg->sgc->max_capacity) ||
       sds->local_stat.group_type != group_has_spare))
	   return false;

group type imbalanced and overloaded and fully_busy are filtered in here.
misfit and asym are filtered before in update_sg_lb_stats().
So, change the decision matrix to:

  busiest \ local has_spare fully_busy misfit asym imbalanced overloaded
  has_spare        nr_idle   balanced   N/A    N/A  balanced   balanced
  fully_busy       nr_idle   nr_idle    N/A    N/A  balanced   balanced
  misfit_task      force     N/A        N/A    N/A  *N/A*      *N/A*
  asym_packing     force     force      N/A    N/A  force      force
  imbalanced       force     force      N/A    N/A  force      force
  overloaded       force     force      N/A    N/A  force      avg_load

Fixes: 0b0695f2b3 ("sched/fair: Rework load_balance()")
Signed-off-by: Tao Zhou <tao.zhou@linux.dev>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20220415095505.7765-1-tao.zhou@linux.dev
2022-04-22 12:14:08 +02:00
Chengming Zhou
890d550d7d sched/psi: report zeroes for CPU full at the system level
Martin find it confusing when look at the /proc/pressure/cpu output,
and found no hint about that CPU "full" line in psi Documentation.

% cat /proc/pressure/cpu
some avg10=0.92 avg60=0.91 avg300=0.73 total=933490489
full avg10=0.22 avg60=0.23 avg300=0.16 total=358783277

The PSI_CPU_FULL state is introduced by commit e7fcd76228
("psi: Add PSI_CPU_FULL state"), which mainly for cgroup level,
but also counted at the system level as a side effect.

Naturally, the FULL state doesn't exist for the CPU resource at
the system level. These "full" numbers can come from CPU idle
schedule latency. For example, t1 is the time when task wakeup
on an idle CPU, t2 is the time when CPU pick and switch to it.
The delta of (t2 - t1) will be in CPU_FULL state.

Another case all processes can be stalled is when all cgroups
have been throttled at the same time, which unlikely to happen.

Anyway, CPU_FULL metric is meaningless and confusing at the
system level. So this patch will report zeroes for CPU full
at the system level, and update psi Documentation accordingly.

Fixes: e7fcd76228 ("psi: Add PSI_CPU_FULL state")
Reported-by: Martin Steigerwald <Martin.Steigerwald@proact.de>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lore.kernel.org/r/20220408121914.82855-1-zhouchengming@bytedance.com
2022-04-22 12:14:08 +02:00
Chengming Zhou
0a00a35464 sched/fair: Delete useless condition in tg_unthrottle_up()
We have tested cfs_rq->load.weight in cfs_rq_is_decayed(),
the first condition "!cfs_rq_is_decayed(cfs_rq)" is enough
to cover the second condition "cfs_rq->nr_running".

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ben Segall <bsegall@google.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20220408115309.81603-2-zhouchengming@bytedance.com
2022-04-22 12:14:07 +02:00
Chengming Zhou
64eaf50731 sched/fair: Fix cfs_rq_clock_pelt() for throttled cfs_rq
Since commit 2312729688 ("sched/fair: Update scale invariance of PELT")
change to use rq_clock_pelt() instead of rq_clock_task(), we should also
use rq_clock_pelt() for throttled_clock_task_time and throttled_clock_task
accounting to get correct cfs_rq_clock_pelt() of throttled cfs_rq. And
rename throttled_clock_task(_time) to be clock_pelt rather than clock_task.

Fixes: 2312729688 ("sched/fair: Update scale invariance of PELT")
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ben Segall <bsegall@google.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20220408115309.81603-1-zhouchengming@bytedance.com
2022-04-22 12:14:07 +02:00
zgpeng
0635490078 sched/fair: Move calculate of avg_load to a better location
In calculate_imbalance function, when the value of local->avg_load is
greater than or equal to busiest->avg_load, the calculated sds->avg_load is
not used. So this calculation can be placed in a more appropriate position.

Signed-off-by: zgpeng <zgpeng@tencent.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Samuel Liao <samuelliao@tencent.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/1649239025-10010-1-git-send-email-zgpeng@tencent.com
2022-04-22 12:14:07 +02:00
Hailong Liu
915a087e4c psi: Fix trigger being fired unexpectedly at initial
When a trigger being created, its win.start_value and win.start_time are
reset to zero. If group->total[PSI_POLL][t->state] has accumulated before,
this trigger will be fired unexpectedly in the next period, even if its
growth time does not reach its threshold.

So set the window of the new trigger to the current state value.

Signed-off-by: Hailong Liu <liuhailong@linux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Suren Baghdasaryan <surenb@google.com>
Link: https://lore.kernel.org/r/1648789811-3788971-1-git-send-email-liuhailong@linux.alibaba.com
2022-04-22 12:14:06 +02:00
kuyo chang
40f5aa4c5e sched/pelt: Fix attach_entity_load_avg() corner case
The warning in cfs_rq_is_decayed() triggered:

    SCHED_WARN_ON(cfs_rq->avg.load_avg ||
		  cfs_rq->avg.util_avg ||
		  cfs_rq->avg.runnable_avg)

There exists a corner case in attach_entity_load_avg() which will
cause load_sum to be zero while load_avg will not be.

Consider se_weight is 88761 as per the sched_prio_to_weight[] table.
Further assume the get_pelt_divider() is 47742, this gives:
se->avg.load_avg is 1.

However, calculating load_sum:

  se->avg.load_sum = div_u64(se->avg.load_avg * se->avg.load_sum, se_weight(se));
  se->avg.load_sum = 1*47742/88761 = 0.

Then enqueue_load_avg() adds this to the cfs_rq totals:

  cfs_rq->avg.load_avg += se->avg.load_avg;
  cfs_rq->avg.load_sum += se_weight(se) * se->avg.load_sum;

Resulting in load_avg being 1 with load_sum is 0, which will trigger
the WARN.

Fixes: f207934fb7 ("sched/fair: Align PELT windows between cfs_rq and its se")
Signed-off-by: kuyo chang <kuyo.chang@mediatek.com>
[peterz: massage changelog]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lkml.kernel.org/r/20220414090229.342-1-kuyo.chang@mediatek.com
2022-04-19 21:15:41 +02:00
Zhen Ni
8a0441415b sched: Move energy_aware sysctls to topology.c
move energy_aware sysctls to topology.c and use the new
register_sysctl_init() to register the sysctl interface.

Signed-off-by: Zhen Ni <nizhen@uniontech.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2022-04-06 13:43:44 -07:00
Zhen Ni
d4ae80ffa6 sched: Move cfs_bandwidth_slice sysctls to fair.c
move cfs_bandwidth_slice sysctls to fair.c and use the
new register_sysctl_init() to register the sysctl interface.

Signed-off-by: Zhen Ni <nizhen@uniontech.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2022-04-06 13:43:43 -07:00
Zhen Ni
3267e0156c sched: Move uclamp_util sysctls to core.c
move uclamp_util sysctls to core.c and use the new
register_sysctl_init() to register the sysctl interface.

Signed-off-by: Zhen Ni <nizhen@uniontech.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2022-04-06 13:43:43 -07:00
Baisong Zhong
28f152cd09 sched/rt: fix build error when CONFIG_SYSCTL is disable
Avoid random build errors which do not select
CONFIG_SYSCTL by depending on it in Kconfig.

This fixes the following warning:

In file included from kernel/sched/build_policy.c:43:
At top level:
kernel/sched/rt.c:3017:12: error: ‘sched_rr_handler’ defined but not used [-Werror=unused-function]
 3017 | static int sched_rr_handler(struct ctl_table *table, int write, void *buffer,
      |            ^~~~~~~~~~~~~~~~
kernel/sched/rt.c:2978:12: error: ‘sched_rt_handler’ defined but not used [-Werror=unused-function]
 2978 | static int sched_rt_handler(struct ctl_table *table, int write, void *buffer,
      |            ^~~~~~~~~~~~~~~~
cc1: all warnings being treated as errors
make[2]: *** [scripts/Makefile.build:310: kernel/sched/build_policy.o] Error 1
make[1]: *** [scripts/Makefile.build:638: kernel/sched] Error 2
make[1]: *** Waiting for unfinished jobs....

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Baisong Zhong <zhongbaisong@huawei.com>
[mcgrof: small build fix, we need sched_rt_can_attach() even
 when CONFIG_SYSCTL is disabled]
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2022-04-06 13:43:43 -07:00
Zhen Ni
dafd7a9dad sched: Move rr_timeslice sysctls to rt.c
move rr_timeslice sysctls to rt.c and use the new
register_sysctl_init() to register the sysctl interface.

Signed-off-by: Zhen Ni <nizhen@uniontech.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2022-04-06 13:43:43 -07:00
Zhen Ni
84227c1288 sched: Move deadline_period sysctls to deadline.c
move deadline_period sysctls to deadline.c and use the new
register_sysctl_init() to register the sysctl interface.

Signed-off-by: Zhen Ni <nizhen@uniontech.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2022-04-06 13:43:43 -07:00
Zhen Ni
d9ab0e63fa sched: Move rt_period/runtime sysctls to rt.c
move rt_period/runtime sysctls to rt.c and use the new
register_sysctl_init() to register the sysctl interface.

Signed-off-by: Zhen Ni <nizhen@uniontech.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2022-04-06 13:43:43 -07:00
Zhen Ni
f5ef06d58b sched: Move schedstats sysctls to core.c
move schedstats sysctls to core.c and use the new
register_sysctl_init() to register the sysctl interface.

Signed-off-by: Zhen Ni <nizhen@uniontech.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2022-04-06 13:43:43 -07:00
Zhen Ni
a60707d74b sched: Move child_runs_first sysctls to fair.c
move child_runs_first sysctls to fair.c and use the new
register_sysctl_init() to register the sysctl interface.

Signed-off-by: Zhen Ni <nizhen@uniontech.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2022-04-06 13:43:43 -07:00
Valentin Schneider
cfe43f478b preempt/dynamic: Introduce preemption model accessors
CONFIG_PREEMPT{_NONE, _VOLUNTARY} designate either:
o The build-time preemption model when !PREEMPT_DYNAMIC
o The default boot-time preemption model when PREEMPT_DYNAMIC

IOW, using those on PREEMPT_DYNAMIC kernels is meaningless - the actual
model could have been set to something else by the "preempt=foo" cmdline
parameter. Same problem applies to CONFIG_PREEMPTION.

Introduce a set of helpers to determine the actual preemption model used by
the live kernel.

Suggested-by: Marco Elver <elver@google.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Marco Elver <elver@google.com>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20211112185203.280040-3-valentin.schneider@arm.com
2022-04-05 10:24:42 +02:00
Nick Desaulniers
8b023accc8 lockdep: Fix -Wunused-parameter for _THIS_IP_
While looking into a bug related to the compiler's handling of addresses
of labels, I noticed some uses of _THIS_IP_ seemed unused in lockdep.
Drive by cleanup.

-Wunused-parameter:
kernel/locking/lockdep.c:1383:22: warning: unused parameter 'ip'
kernel/locking/lockdep.c:4246:48: warning: unused parameter 'ip'
kernel/locking/lockdep.c:4844:19: warning: unused parameter 'ip'

Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Waiman Long <longman@redhat.com>
Link: https://lore.kernel.org/r/20220314221909.2027027-1-ndesaulniers@google.com
2022-04-05 10:24:34 +02:00
Sebastian Andrzej Siewior
386ef214c3 sched: Teach the forced-newidle balancer about CPU affinity limitation.
try_steal_cookie() looks at task_struct::cpus_mask to decide if the
task could be moved to `this' CPU. It ignores that the task might be in
a migration disabled section while not on the CPU. In this case the task
must not be moved otherwise per-CPU assumption are broken.

Use is_cpu_allowed(), as suggested by Peter Zijlstra, to decide if the a
task can be moved.

Fixes: d2dfa17bc7 ("sched: Trivial forced-newidle balancer")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/YjNK9El+3fzGmswf@linutronix.de
2022-04-05 09:59:36 +02:00
Peter Zijlstra
5b6547ed97 sched/core: Fix forceidle balancing
Steve reported that ChromeOS encounters the forceidle balancer being
ran from rt_mutex_setprio()'s balance_callback() invocation and
explodes.

Now, the forceidle balancer gets queued every time the idle task gets
selected, set_next_task(), which is strictly too often.
rt_mutex_setprio() also uses set_next_task() in the 'change' pattern:

	queued = task_on_rq_queued(p); /* p->on_rq == TASK_ON_RQ_QUEUED */
	running = task_current(rq, p); /* rq->curr == p */

	if (queued)
		dequeue_task(...);
	if (running)
		put_prev_task(...);

	/* change task properties */

	if (queued)
		enqueue_task(...);
	if (running)
		set_next_task(...);

However, rt_mutex_setprio() will explicitly not run this pattern on
the idle task (since priority boosting the idle task is quite insane).
Most other 'change' pattern users are pidhash based and would also not
apply to idle.

Also, the change pattern doesn't contain a __balance_callback()
invocation and hence we could have an out-of-band balance-callback,
which *should* trigger the WARN in rq_pin_lock() (which guards against
this exact anti-pattern).

So while none of that explains how this happens, it does indicate that
having it in set_next_task() might not be the most robust option.

Instead, explicitly queue the forceidle balancer from pick_next_task()
when it does indeed result in forceidle selection. Having it here,
ensures it can only be triggered under the __schedule() rq->lock
instance, and hence must be ran from that context.

This also happens to clean up the code a little, so win-win.

Fixes: d2dfa17bc7 ("sched: Trivial forced-newidle balancer")
Reported-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: T.J. Alumbaugh <talumbau@chromium.org>
Link: https://lkml.kernel.org/r/20220330160535.GN8939@worktop.programming.kicks-ass.net
2022-04-05 09:59:36 +02:00