Commit graph

2209 commits

Author SHA1 Message Date
Paul E. McKenney
c69a00a12e srcu: Add boot-time control over srcu_node array allocation
This commit adds an srcu_tree.convert_to_big kernel parameter that either
refuses to convert at all (0), converts immediately at init_srcu_struct()
time (1), or lets rcutorture convert it (2).  An addition contention-based
dynamic conversion choice will be added, along with documentation.

[ paulmck: Apply callback-scanning feedback from Neeraj Upadhyay. ]

Co-developed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-04-11 15:51:08 -07:00
Neeraj Upadhyay
0b56f95390 srcu: Ensure snp nodes tree is fully initialized before traversal
For configurations where snp node tree is not initialized at
init time (added in subsequent commits), srcu_funnel_gp_start()
and srcu_funnel_exp_start() can potential traverse and observe
the snp nodes' transient (uninitialized) states. This can potentially
happen, when init_srcu_struct_nodes() initialization of sdp->mynode
races with srcu_funnel_gp_start() and srcu_funnel_exp_start()

Consider the case below where srcu_funnel_gp_start() observes
sdp->mynode to be not NULL and uses an uninitialized sdp->grpmask

          P1                                  P2

init_srcu_struct_nodes()           void srcu_funnel_gp_start(...)
{
for_each_possible_cpu(cpu) {
  ...
  sdp->mynode = &snp_first[...];
  for (snp = sdp->mynode;...)       struct srcu_node *snp_leaf =
                                       smp_load_acquire(&sdp->mynode)
    ...                             if (snp_leaf) {
                                      for (snp = snp_leaf; ...)
                                        ...
					if (snp == snp_leaf)
                                         snp->srcu_data_have_cbs[idx] |=
                                           sdp->grpmask;
    sdp->grpmask =
      1 << (cpu - sdp->mynode->grplo);
  }
}

Similarly, init_srcu_struct_nodes() and srcu_funnel_exp_start() can
race, where srcu_funnel_exp_start() could observe state of snp lock
before spin_lock_init().

          P1                                      P2

init_srcu_struct_nodes()               void srcu_funnel_exp_start(...)
{
  srcu_for_each_node_breadth_first(ssp, snp) {      for (; ...) {
                                                      spin_lock_...(snp, )
	spin_lock_init(&ACCESS_PRIVATE(snp, lock));
    ...
  }
  for_each_possible_cpu(cpu) {
    ...
    sdp->mynode = &snp_first[...];

To avoid these issues, ensure that snp node tree initialization is
complete i.e. after SRCU_SIZE_WAIT_BARRIER srcu_size_state is reached,
before traversing the tree. Given that srcu_funnel_gp_start() and
srcu_funnel_exp_start() are called within SRCU read side critical
sections, this check is safe, in the sense that all callbacks are
enqueued on CPU0 srcu_cblist until SRCU_SIZE_WAIT_CALL is entered,
and these read side critical sections (containing srcu_funnel_gp_start()
and srcu_funnel_exp_start()) need to complete, before SRCU_SIZE_WAIT_CALL
is reached.

Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-04-11 15:31:03 -07:00
Paul E. McKenney
cbdc98e93e srcu: Use invalid initial value for srcu_node GP sequence numbers
Currently, tree SRCU relies on the srcu_node structures being initialized
at the same time that the srcu_struct itself is initialized, and thus
use the initial grace-period sequence number as the initial value for
the srcu_node structure's ->srcu_have_cbs[] and ->srcu_gp_seq_needed_exp
fields.  Although this has a high probability of also working when the
srcu_node array is allocated and initialized at some random later time,
it would be better to avoid leaving such things to chance.

This commit therefore initializes these fields with 0x2, which is a
recognizable invalid value.  It then adds the required checks for this
invalid value in order to avoid confusion on long-running kernels
(especially those on 32-bit systems) that allocate and initialize
srcu_node arrays late in life.

Co-developed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-04-11 15:31:03 -07:00
Paul E. McKenney
aeb9b39b8f srcu: Compute snp_seq earlier in srcu_funnel_gp_start()
Currently, srcu_funnel_gp_start() tests snp->srcu_have_cbs[idx] and then
separately assigns it to the snp_seq local variable.  This commit does
the assignment earlier to simplify the code a bit.  While in the area,
this commit also takes advantage of the 100-character line limit to put
the call to srcu_schedule_cbs_sdp() on a single line.

Co-developed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-04-11 15:31:02 -07:00
Paul E. McKenney
3bedebcf63 srcu: Make rcutorture dump the SRCU size state
This commit adds the numeric and string version of ->srcu_size_state to
the Tree-SRCU-specific portion of the rcutorture output.

[ paulmck: Apply feedback from kernel test robot and Dan Carpenter. ]
[ quic_neeraju: Apply feedback from Jiapeng Chong. ]

Co-developed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-04-11 15:31:02 -07:00
Paul E. McKenney
e2f638365d srcu: Add size-state transitioning code
This is just dead code at the moment, and will be used once
the state-transition code is activated.

Because srcu_barrier() must be aware of transition before call_srcu(), the
state machine waits for an SRCU grace period before callbacks are queued
to the non-CPU-0 queues.  This requres that portions of srcu_barrier()
be enclosed in an SRCU read-side critical section.

Co-developed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-04-11 15:31:02 -07:00
Paul E. McKenney
2ec303113d srcu: Dynamically allocate srcu_node array
This commit shrinks the srcu_struct structure by converting its ->node
field from a fixed-size compile-time array to a pointer to a dynamically
allocated array.  In kernels built with large values of NR_CPUS that boot
on systems with smaller numbers of CPUs, this can save significant memory.

[ paulmck: Apply kernel test robot feedback. ]

Reported-by: A cast of thousands
Co-developed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-04-11 15:31:02 -07:00
Paul E. McKenney
994f706872 srcu: Make Tree SRCU able to operate without snp_node array
This commit makes Tree SRCU able to operate without an snp_node
array, that is, when the srcu_data structures' ->mynode pointers
are NULL.  This can result in high contention on the srcu_struct
structure's ->lock, but only when there are lots of call_srcu(),
synchronize_srcu(), and synchronize_srcu_expedited() calls.

Note that when there is no snp_node array, all SRCU callbacks use
CPU 0's callback queue.  This is optimal in the common case of low
update-side load because it removes the need to search each CPU
for the single callback that made the grace period happen.

Co-developed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-04-11 15:31:02 -07:00
Paul E. McKenney
7b9e9b5856 srcu: Make srcu_funnel_gp_start() cache ->mynode in snp_leaf
Currently, the srcu_funnel_gp_start() walks its local variable snp up the
tree and reloads sdp->mynode whenever it is necessary to check whether
it is still at the leaf srcu_node level.  This works, but is a bit more
obtuse than absolutely necessary.  In addition, upcoming commits will
dynamically size srcu_struct structures, in which case sdp->mynode will
no longer necessarily be a constant, and this commit helps prepare for
that dynamic sizing.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-04-11 15:31:02 -07:00
Paul E. McKenney
8ed0076020 srcu: Tighten cleanup_srcu_struct() GP checks
Currently, cleanup_srcu_struct() checks for a grace period in progress,
but it does not check for a grace period that has not yet started but
which might start at any time.  Such a situation could result in a
use-after-free bug, so this commit adds a check for a grace period that
is needed but not yet started to cleanup_srcu_struct().

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-04-11 15:31:01 -07:00
Zhen Ni
d9ab0e63fa sched: Move rt_period/runtime sysctls to rt.c
move rt_period/runtime sysctls to rt.c and use the new
register_sysctl_init() to register the sysctl interface.

Signed-off-by: Zhen Ni <nizhen@uniontech.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2022-04-06 13:43:43 -07:00
Linus Torvalds
3fe2f7446f Changes in this cycle were:
- Cleanups for SCHED_DEADLINE
  - Tracing updates/fixes
  - CPU Accounting fixes
  - First wave of changes to optimize the overhead of the scheduler build,
    from the fast-headers tree - including placeholder *_api.h headers for
    later header split-ups.
  - Preempt-dynamic using static_branch() for ARM64
  - Isolation housekeeping mask rework; preperatory for further changes
  - NUMA-balancing: deal with CPU-less nodes
  - NUMA-balancing: tune systems that have multiple LLC cache domains per node (eg. AMD)
  - Updates to RSEQ UAPI in preparation for glibc usage
  - Lots of RSEQ/selftests, for same
  - Add Suren as PSI co-maintainer
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmI5rg8RHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1hGrw/+M3QOk6fH7G48wjlNnBvcOife6ls+Ni4k
 ixOAcF4JKoixO8HieU5vv0A7yf/83tAa6fpeXeMf1hkCGc0NSlmLtuIux+WOmoAL
 LzCyDEYfiP8KnVh0A1Tui/lK0+AkGo21O6ADhQE2gh8o2LpslOHQMzvtyekSzeeb
 mVxMYQN+QH0m518xdO2D8IQv9ctOYK0eGjmkqdNfntOlytypPZHeNel/tCzwklP/
 dElJUjNiSKDlUgTBPtL3DfpoLOI/0mHF2p6NEXvNyULxSOqJTu8pv9Z2ADb2kKo1
 0D56iXBDngMi9MHIJLgvzsA8gKzHLFSuPbpODDqkTZCa28vaMB9NYGhJ643NtEie
 IXTJEvF1rmNkcLcZlZxo0yjL0fjvPkczjw4Vj27gbrUQeEBfb4mfuI4BRmij63Ep
 qEkgQTJhduCqqrQP1rVyhwWZRk1JNcVug+F6N42qWW3fg1xhj0YSrLai2c9nPez6
 3Zt98H8YGS1Z/JQomSw48iGXVqfTp/ETI7uU7jqHK8QcjzQ4lFK5H4GZpwuqGBZi
 NJJ1l97XMEas+rPHiwMEN7Z1DVhzJLCp8omEj12QU+tGLofxxwAuuOVat3CQWLRk
 f80Oya3TLEgd22hGIKDRmHa22vdWnNQyS0S15wJotawBzQf+n3auS9Q3/rh979+t
 ES/qvlGxTIs=
 =Z8uT
 -----END PGP SIGNATURE-----

Merge tag 'sched-core-2022-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler updates from Ingo Molnar:

 - Cleanups for SCHED_DEADLINE

 - Tracing updates/fixes

 - CPU Accounting fixes

 - First wave of changes to optimize the overhead of the scheduler
   build, from the fast-headers tree - including placeholder *_api.h
   headers for later header split-ups.

 - Preempt-dynamic using static_branch() for ARM64

 - Isolation housekeeping mask rework; preperatory for further changes

 - NUMA-balancing: deal with CPU-less nodes

 - NUMA-balancing: tune systems that have multiple LLC cache domains per
   node (eg. AMD)

 - Updates to RSEQ UAPI in preparation for glibc usage

 - Lots of RSEQ/selftests, for same

 - Add Suren as PSI co-maintainer

* tag 'sched-core-2022-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (81 commits)
  sched/headers: ARM needs asm/paravirt_api_clock.h too
  sched/numa: Fix boot crash on arm64 systems
  headers/prep: Fix header to build standalone: <linux/psi.h>
  sched/headers: Only include <linux/entry-common.h> when CONFIG_GENERIC_ENTRY=y
  cgroup: Fix suspicious rcu_dereference_check() usage warning
  sched/preempt: Tell about PREEMPT_DYNAMIC on kernel headers
  sched/topology: Remove redundant variable and fix incorrect type in build_sched_domains
  sched/deadline,rt: Remove unused parameter from pick_next_[rt|dl]_entity()
  sched/deadline,rt: Remove unused functions for !CONFIG_SMP
  sched/deadline: Use __node_2_[pdl|dle]() and rb_first_cached() consistently
  sched/deadline: Merge dl_task_can_attach() and dl_cpu_busy()
  sched/deadline: Move bandwidth mgmt and reclaim functions into sched class source file
  sched/deadline: Remove unused def_dl_bandwidth
  sched/tracing: Report TASK_RTLOCK_WAIT tasks as TASK_UNINTERRUPTIBLE
  sched/tracing: Don't re-read p->state when emitting sched_switch event
  sched/rt: Plug rt_mutex_setprio() vs push_rt_task() race
  sched/cpuacct: Remove redundant RCU read lock
  sched/cpuacct: Optimize away RCU read lock
  sched/cpuacct: Fix charge percpu cpuusage
  sched/headers: Reorganize, clean up and optimize kernel/sched/sched.h dependencies
  ...
2022-03-22 14:39:12 -07:00
Linus Torvalds
35dc0352bb RCU pull request for v5.18
This pull request contains the following branches:
 
 exp.2022.02.24a: Contains a fix for idle detection from Neeraj Upadhyay
 	and missing access marking detected by KCSAN.
 
 fixes.2022.02.14a: Miscellaneous fixes.
 
 rcu_barrier.2022.02.08a: Reduces coupling between rcu_barrier() and
 	CPU-hotplug operations, so that rcu_barrier() no longer needs
 	to do cpus_read_lock().  This may also someday allow system
 	boot to bring CPUs online concurrently.
 
 rcu-tasks.2022.02.08a: Enable more aggressive movement to per-CPU
 	queueing when reacting to excessive lock contention due
 	to workloads placing heavy update-side stress on RCU tasks.
 
 rt.2022.02.01b: Improvements to RCU priority boosting, including
 	changes from Neeraj Upadhyay, Zqiang, and Alison Chaiken.
 
 torture.2022.02.01b: Various fixes improving test robustness and
 	debug information.
 
 torturescript.2022.02.08a: Add tests for SRCU size transitions, further
 	compress torture.sh build products, and improve debug output.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEbK7UrM+RBIrCoViJnr8S83LZ+4wFAmIusb0THHBhdWxtY2tA
 a2VybmVsLm9yZwAKCRCevxLzctn7jAklD/9VXLK7crcg2YeRXUIg1IOdnancsVCV
 MNtTfxNYqYIis+W2UfuHKuQu2yEXF5fihdY0J9TQv0byHsprp6FIZT+i1An4Ukgd
 0vyHjd/DaIKgs2txsB1DjhlatWlJUfQuBwhtNUkpYFLFwKdCI1l813bPbNlL+GiL
 p0ZejVMpBC5HgE6sDOtaaQSAB+AEUp+Lgr+yaG/On8hfzwWFKO8KldxhiKY9n07v
 SNDfKDgXB+80hx4RBVGbkuogV3s9brFULoNRXJy7Uf79DtiY09uazhhA3G0TjO34
 zGwmF91dqsXDF/Uz8g4aZO0xYRXUchOrsQ5lgO/GhTVbM9I0wWlMHEk/8WHyBJkU
 vlXOMuwzBc9/5uwZE3rnkA4a3nkXhPQjLlCr+/I7A/7Vsv9IBW9WSlgMvUN0Qf4S
 XAwTnIqfErnR60a+L0+HRr5kIV5VoXcxqI/Nv0/4/BMLRubS/c7cYjOTxXNJL9SU
 50pv5vty9xk3HSpuz0JAOyLf+PUT773uUQhFr5xCBSCVqbAm5WFg6hWPAgrN/tUS
 wstBc0wlA73rKVJxeLDQwHc/oT1zTUEzswVZITQ5zLHK0t0GbeR6QHccsdeaJyTe
 DisX+66A6YQrEuJmx5xUZqjYHqtYLDOBTbHA3ZwQmvjKu8ibWZ8Fg9ioURLCS4bF
 +FVkp/5KdcAN9w==
 =ljVY
 -----END PGP SIGNATURE-----

Merge tag 'rcu.2022.03.13a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu

Pull RCU updates from Paul McKenney:

 - Fix idle detection (Neeraj Upadhyay) and missing access marking
   detected by KCSAN.

 - Reduce coupling between rcu_barrier() and CPU-hotplug operations, so
   that rcu_barrier() no longer needs to do cpus_read_lock(). This may
   also someday allow system boot to bring CPUs online concurrently.

 - Enable more aggressive movement to per-CPU queueing when reacting to
   excessive lock contention due to workloads placing heavy update-side
   stress on RCU tasks.

 - Improvements to RCU priority boosting, including changes from Neeraj
   Upadhyay, Zqiang, and Alison Chaiken.

 - Various fixes improving test robustness and debug information.

 - Add tests for SRCU size transitions, further compress torture.sh
   build products, and improve debug output.

 - Miscellaneous fixes.

* tag 'rcu.2022.03.13a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: (49 commits)
  rcu: Replace cpumask_weight with cpumask_empty where appropriate
  rcu: Remove __read_mostly annotations from rcu_scheduler_active externs
  rcu: Uninline multi-use function: finish_rcuwait()
  rcu: Mark writes to the rcu_segcblist structure's ->flags field
  kasan: Record work creation stack trace with interrupts enabled
  rcu: Inline __call_rcu() into call_rcu()
  rcu: Add mutex for rcu boost kthread spawning and affinity setting
  rcu: Fix description of kvfree_rcu()
  MAINTAINERS:  Add Frederic and Neeraj to their RCU files
  rcutorture: Provide non-power-of-two Tasks RCU scenarios
  rcutorture: Test SRCU size transitions
  torture: Make torture.sh help message match reality
  rcu-tasks: Set ->percpu_enqueue_shift to zero upon contention
  rcu-tasks: Use order_base_2() instead of ilog2()
  rcu: Create and use an rcu_rdp_cpu_online()
  rcu: Make rcu_barrier() no longer block CPU-hotplug operations
  rcu: Rework rcu_barrier() and callback-migration logic
  rcu: Refactor rcu_barrier() empty-list handling
  rcu: Kill rnp->ofl_seq and use only rcu_state.ofl_lock for exclusion
  torture: Change KVM environment variable to RCUTORTURE
  ...
2022-03-21 14:00:56 -07:00
Frederic Weisbecker
2984539959 tick/rcu: Remove obsolete rcu_needs_cpu() parameters
With the removal of CONFIG_RCU_FAST_NO_HZ, the parameters in
rcu_needs_cpu() are not necessary anymore. Simply remove them.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Paul Menzel <pmenzel@molgen.mpg.de>
2022-03-07 23:01:26 +01:00
Paul E. McKenney
d5578190be Merge branches 'exp.2022.02.24a', 'fixes.2022.02.14a', 'rcu_barrier.2022.02.08a', 'rcu-tasks.2022.02.08a', 'rt.2022.02.01b', 'torture.2022.02.01b' and 'torturescript.2022.02.08a' into HEAD
exp.2022.02.24a: Expedited grace-period updates.
fixes.2022.02.14a: Miscellaneous fixes.
rcu_barrier.2022.02.08a: Make rcu_barrier() no longer exclude CPU hotplug.
rcu-tasks.2022.02.08a: RCU-tasks updates.
rt.2022.02.01b: Real-time-related updates.
torture.2022.02.01b: Torture-test updates.
torturescript.2022.02.08a: Torture-test scripting updates.
2022-02-24 09:38:46 -08:00
Ingo Molnar
6255b48aeb Linux 5.17-rc5
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmISrYgeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGg20IAKDZr7rfSHBopjQV
 Cocw744tom0XuxpvSZpp2GGOOXF+tkswcNNaRIrbGOl1mkyxA7eBZCTMpDeDS9aQ
 wB0D0Gxx8QBAJp4KgB1W7TB+hIGes/rs8Ve+6iO4ulLLdCVWX/q2boI0aZ7QX9O9
 qNi8OsoZQtk6falRvciZFHwV5Av1p2Sy1AW57udQ7DvJ4H98AfKf1u8/z208WWW8
 1ixC+qJxQcUcM9vI+7P9Tt7NbFSKv8SvAmqjFY7P+DxQAsVw6KXoqVXykDzeOv0t
 fUNOE/t0oFZafwtn8h7KBQnwS9lH03+3KkslVZs+iMFyUj/Bar+NVVyKoDhWXtVg
 /PuMhEg=
 =eU1o
 -----END PGP SIGNATURE-----

Merge tag 'v5.17-rc5' into sched/core, to resolve conflicts

New conflicts in sched/core due to the following upstream fixes:

  44585f7bc0 ("psi: fix "defined but not used" warnings when CONFIG_PROC_FS=n")
  a06247c680 ("psi: Fix uaf issue when psi trigger is destroyed while being polled")

Conflicts:
	include/linux/psi_types.h
	kernel/sched/psi.c

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2022-02-21 11:53:51 +01:00
Frederic Weisbecker
04d4e665a6 sched/isolation: Use single feature type while referring to housekeeping cpumask
Refer to housekeeping APIs using single feature types instead of flags.
This prevents from passing multiple isolation features at once to
housekeeping interfaces, which soon won't be possible anymore as each
isolation features will have their own cpumask.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Reviewed-by: Phil Auld <pauld@redhat.com>
Link: https://lore.kernel.org/r/20220207155910.527133-5-frederic@kernel.org
2022-02-16 15:57:55 +01:00
Yury Norov
6a2c1d450a rcu: Replace cpumask_weight with cpumask_empty where appropriate
In some places, RCU code calls cpumask_weight() to check if any bit of a
given cpumask is set. We can do it more efficiently with cpumask_empty()
because cpumask_empty() stops traversing the cpumask as soon as it finds
first set bit, while cpumask_weight() counts all bits unconditionally.

Signed-off-by: Yury Norov <yury.norov@gmail.com>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-02-14 10:36:58 -08:00
Ingo Molnar
58d4292bd0 rcu: Uninline multi-use function: finish_rcuwait()
This is a rarely used function, so uninlining its 3 instructions
is probably a win or a wash - but the main motivation is to
make <linux/rcuwait.h> independent of task_struct details.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-02-14 10:36:58 -08:00
Paul E. McKenney
c099290310 rcu: Mark writes to the rcu_segcblist structure's ->flags field
KCSAN reports data races between the rcu_segcblist_clear_flags() and
rcu_segcblist_set_flags() functions, though misreporting the latter
as a call to rcu_segcblist_is_enabled() from call_rcu().  This commit
converts the updates of this field to WRITE_ONCE(), relying on the
resulting unmarked reads to continue to detect buggy concurrent writes
to this field.

Reported-by: Zhouyi Zhou <zhouzhouyi@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Frederic Weisbecker <frederic@kernel.org>
2022-02-14 10:36:58 -08:00
Zqiang
d818cc76e2 kasan: Record work creation stack trace with interrupts enabled
Recording the work creation stack trace for KASAN reports in
call_rcu() is expensive, due to unwinding the stack, but also
due to acquiring depot_lock inside stackdepot (which may be contended).
Because calling kasan_record_aux_stack_noalloc() does not require
interrupts to already be disabled, this may unnecessarily extend
the time with interrupts disabled.

Therefore, move calling kasan_record_aux_stack() before the section
with interrupts disabled.

Acked-by: Marco Elver <elver@google.com>
Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-02-14 10:36:58 -08:00
Paul E. McKenney
1fe09ebe7a rcu: Inline __call_rcu() into call_rcu()
Because __call_rcu() is invoked only by call_rcu(), this commit inlines
the former into the latter.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-02-14 10:36:58 -08:00
David Woodhouse
218b957a69 rcu: Add mutex for rcu boost kthread spawning and affinity setting
As we handle parallel CPU bringup, we will need to take care to avoid
spawning multiple boost threads, or race conditions when setting their
affinity. Spotted by Paul McKenney.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-02-14 10:36:35 -08:00
Paul E. McKenney
00a8b4b54c rcu-tasks: Set ->percpu_enqueue_shift to zero upon contention
Currently, call_rcu_tasks_generic() sets ->percpu_enqueue_shift to
order_base_2(nr_cpu_ids) upon encountering sufficient contention.
This does not shift to use of non-CPU-0 callback queues as intended, but
rather continues using only CPU 0's queue.  Although this does provide
some decrease in contention due to spreading work over multiple locks,
it is not the dramatic decrease that was intended.

This commit therefore makes call_rcu_tasks_generic() set
->percpu_enqueue_shift to 0.

Reported-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-02-08 10:13:22 -08:00
Paul E. McKenney
2bcd18e041 rcu-tasks: Use order_base_2() instead of ilog2()
The ilog2() function can be used to generate a shift count, but it will
generate the same count for a power of two as for one greater than a power
of two.  This results in shift counts that are larger than necessary for
systems with a power-of-two number of CPUs because the CPUs are numbered
from zero, so that the maximum CPU number is one less than that power
of two.

This commit therefore substitutes order_base_2(), which appears to have
been designed for exactly this use case.

Suggested-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-02-08 10:13:12 -08:00
Paul E. McKenney
5ae0f1b58b rcu: Create and use an rcu_rdp_cpu_online()
The pattern "rdp->grpmask & rcu_rnp_online_cpus(rnp)" occurs frequently
in RCU code in order to determine whether rdp->cpu is online from an
RCU perspective.  This commit therefore creates an rcu_rdp_cpu_online()
function to replace it.

[ paulmck: Apply kernel test robot unused-variable feedback. ]

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-02-08 10:12:28 -08:00
Paul E. McKenney
80b3fd474c rcu: Make rcu_barrier() no longer block CPU-hotplug operations
This commit removes the cpus_read_lock() and cpus_read_unlock() calls
from rcu_barrier(), thus allowing CPUs to come and go during the course
of rcu_barrier() execution.  Posting of the ->barrier_head callbacks does
synchronize with portions of RCU's CPU-hotplug notifiers, but these locks
are held for short time periods on both sides.  Thus, full CPU-hotplug
operations could both start and finish during the execution of a given
rcu_barrier() invocation.

Additional synchronization is provided by a global ->barrier_lock.
Since the ->barrier_lock is only used during rcu_barrier() execution and
during onlining/offlining a CPU, the contention for this lock should
be low.  It might be tempting to make use of a per-CPU lock just on
general principles, but straightforward attempts to do this have the
problems shown below.

Initial state: 3 CPUs present, CPU 0 and CPU1 do not have
any callback and CPU2 has callbacks.

1. CPU0 calls rcu_barrier().

2. CPU1 starts offlining for CPU2. CPU1 calls
   rcutree_migrate_callbacks(). rcu_barrier_entrain() is called
   from rcutree_migrate_callbacks(), with CPU2's rdp->barrier_lock.
   It does not entrain ->barrier_head for CPU2, as rcu_barrier()
   on CPU0 hasn't started the barrier sequence (by calling
   rcu_seq_start(&rcu_state.barrier_sequence)) yet.

3. CPU0 starts new barrier sequence. It iterates over
   CPU0 and CPU1, after acquiring their per-cpu ->barrier_lock
   and finds 0 segcblist length. It updates ->barrier_seq_snap
   for CPU0 and CPU1 and continues loop iteration to CPU2.

    for_each_possible_cpu(cpu) {
        raw_spin_lock_irqsave(&rdp->barrier_lock, flags);
        if (!rcu_segcblist_n_cbs(&rdp->cblist)) {
            WRITE_ONCE(rdp->barrier_seq_snap, gseq);
            raw_spin_unlock_irqrestore(&rdp->barrier_lock, flags);
            rcu_barrier_trace(TPS("NQ"), cpu, rcu_state.barrier_sequence);
            continue;
        }

4. rcutree_migrate_callbacks() completes execution on CPU1.
   Segcblist len for CPU2 becomes 0.

5. The loop iteration on CPU0, checks rcu_segcblist_n_cbs(&rdp->cblist)
   for CPU2 and completes the loop iteration after setting
   ->barrier_seq_snap.

6. As there isn't any ->barrier_head callback entrained; at
   this point, rcu_barrier() in CPU0 returns.

7. The callbacks, which migrated from CPU2 to CPU1, execute.

Straightforward per-CPU locking is also subject to the following race
condition noted by Boqun Feng:

1. CPU0 calls rcu_barrier(), starting a new barrier sequence by invoking
   rcu_seq_start() and init_completion(), but does not yet initialize
   rcu_state.barrier_cpu_count.

2. CPU1 starts offlining for CPU2, calling rcutree_migrate_callbacks(),
   which in turn calls rcu_barrier_entrain() holding CPU2's.
   rdp->barrier_lock.  It then entrains ->barrier_head for CPU2
   and atomically increments rcu_state.barrier_cpu_count, which is
   unfortunately not yet initialized to the value 2.

3. The just-entrained RCU callback is invoked.  It atomically
   decrements rcu_state.barrier_cpu_count and sees that it is
   now zero.  This callback therefore invokes complete().

4. CPU0 continues executing rcu_barrier(), but is not blocked
   by its call to wait_for_completion().  This results in rcu_barrier()
   returning before all pre-existing callbacks have been invoked,
   which is a bug.

Therefore, synchronization is provided by rcu_state.barrier_lock,
which is also held across the initialization sequence, especially the
rcu_seq_start() and the atomic_set() that sets rcu_state.barrier_cpu_count
to the value 2.  In addition, this lock is held when entraining the
rcu_barrier() callback, when deciding whether or not a CPU has callbacks
that rcu_barrier() must wait on, when setting the ->qsmaskinitnext for
incoming CPUs, and when migrating callbacks from a CPU that is going
offline.

Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Co-developed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-02-08 10:12:28 -08:00
Paul E. McKenney
a16578dd5e rcu: Rework rcu_barrier() and callback-migration logic
This commit reworks rcu_barrier() and callback-migration logic to
permit allowing rcu_barrier() to run concurrently with CPU-hotplug
operations.  The key trick is for callback migration to check to see if
an rcu_barrier() is in flight, and, if so, enqueue the ->barrier_head
callback on its behalf.

This commit adds synchronization with RCU's CPU-hotplug notifiers.  Taken
together, this will permit a later commit to remove the cpus_read_lock()
and cpus_read_unlock() calls from rcu_barrier().

[ paulmck: Updated per kbuild test robot feedback. ]
[ paulmck: Updated per reviews session with Neeraj, Frederic, Uladzislau, and Boqun. ]

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-02-08 10:12:28 -08:00
Paul E. McKenney
0cabb47af3 rcu: Refactor rcu_barrier() empty-list handling
This commit saves a few lines by checking first for an empty callback
list.  If the callback list is empty, then that CPU is taken care of,
regardless of its online or nocb state.  Also simplify tracing accordingly
and fold a few lines together.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-02-08 10:12:28 -08:00
David Woodhouse
82980b1622 rcu: Kill rnp->ofl_seq and use only rcu_state.ofl_lock for exclusion
If we allow architectures to bring APs online in parallel, then we end
up requiring rcu_cpu_starting() to be reentrant. But currently, the
manipulation of rnp->ofl_seq is not thread-safe.

However, rnp->ofl_seq is also fairly much pointless anyway since both
rcu_cpu_starting() and rcu_report_dead() hold rcu_state.ofl_lock for
fairly much the whole time that rnp->ofl_seq is set to an odd number
to indicate that an operation is in progress.

So drop rnp->ofl_seq completely, and use only rcu_state.ofl_lock.

This has a couple of minor complexities: lockdep will complain when we
take rcu_state.ofl_lock, and currently accepts the 'excuse' of having
an odd value in rnp->ofl_seq. So switch it to an arch_spinlock_t to
avoid that false positive complaint. Since we're killing rnp->ofl_seq
of course that 'excuse' has to be changed too, so make it check for
arch_spin_is_locked(rcu_state.ofl_lock).

There's no arch_spin_lock_irqsave() so we have to manually save and
restore local interrupts around the locking.

At Paul's request based on Neeraj's analysis, make rcu_gp_init not just
wait but *exclude* any CPU online/offline activity, which was fairly
much true already by virtue of it holding rcu_state.ofl_lock.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-02-08 10:11:41 -08:00
Paul E. McKenney
9c0f1c7fd7 rcutorture: Enable limited callback-flooding tests of SRCU
This commit allows up to 50,000 callbacks worth of callback-flooding
tests of SRCU.  The goal of this change is to exercise Tree SRCU's
ability to transition from SRCU_SIZE_SMALL to SRCU_SIZE_BIG triggered
by callback-queue-time lock contention.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-02-01 17:24:39 -08:00
Paul E. McKenney
89440d2dad rcutorture: Fix rcu_fwd_mutex deadlock
The rcu_torture_fwd_cb_hist() function acquires rcu_fwd_mutex, but is
invoked from rcutorture_oom_notify() function, which hold this same
mutex across this call.  This commit fixes the resulting deadlock.

Reported-by: kernel test robot <oliver.sang@intel.com>
Tested-by: Oliver Sang <oliver.sang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-02-01 17:24:39 -08:00
Paul E. McKenney
02b51a1cf4 rcutorture: Add end-of-test check to rcu_torture_fwd_prog() loop
The second and subsequent forward-progress kthreads loop waiting for
the first forward-progress kthread to start the next test interval.
Unfortunately, if the test ends while one of those kthreads is waiting,
the test will hang.  This hang occurs because that wait loop fails to
check for the end of the test.  This commit therefore adds an end-of-test
check to that wait loop.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-02-01 17:24:38 -08:00
Paul E. McKenney
e22ef8df41 rcutorture: Make rcu_fwd_cb_nodelay be a counter
Back when only one rcutorture kthread could do forward-progress testing,
it was just fine for rcu_fwd_cb_nodelay to be a non-atomic bool.  It was
set at the start of forward-progress testing and cleared at the end.
But now that there are multiple threads, the value can be cleared while
one of the threads is still doing forward-progress testing.  This commit
therefore makes rcu_fwd_cb_nodelay be an atomic counter, replacing the
WRITE_ONCE() operations with atomic_inc() and atomic_dec().

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-02-01 17:24:38 -08:00
Paul E. McKenney
05b724655b rcutorture: Increase visibility of forward-progress hangs
This commit adds a few pr_alert() calls to rcutorture's forward-progress
testing in order to better diagnose shutdown-time hangs.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-02-01 17:24:38 -08:00
Paul E. McKenney
6f81bd6a4e rcutorture: Print message before invoking ->cb_barrier()
The various ->cb_barrier() functions, for example, rcu_barrier(),
sometimes cause rcutorture hangs.  But currently, the last console
message is the unenlightening "Stopping rcu_torture_stats".  This commit
therefore prints a message of the form "rcu_torture_cleanup: Invoking
rcu_barrier+0x0/0x1e0()" to help point people in the right direction.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-02-01 17:24:38 -08:00
Zqiang
c951587585 rcu: Add per-CPU rcuc task dumps to RCU CPU stall warnings
When the rcutree.use_softirq kernel boot parameter is set to zero, all
RCU_SOFTIRQ processing is carried out by the per-CPU rcuc kthreads.
If these kthreads are being starved, quiescent states will not be
reported, which in turn means that the grace period will not end, which
can in turn trigger RCU CPU stall warnings.  This commit therefore dumps
stack traces of stalled CPUs' rcuc kthreads, which can help identify
what is preventing those kthreads from running.

Suggested-by: Ammar Faizi <ammarfaizi2@gnuweeb.org>
Reviewed-by: Ammar Faizi <ammarfaizi2@gnuweeb.org>
Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-02-01 17:22:17 -08:00
Paul E. McKenney
10c5357874 rcu: Don't deboost before reporting expedited quiescent state
Currently rcu_preempt_deferred_qs_irqrestore() releases rnp->boost_mtx
before reporting the expedited quiescent state.  Under heavy real-time
load, this can result in this function being preempted before the
quiescent state is reported, which can in turn prevent the expedited grace
period from completing.  Tim Murray reports that the resulting expedited
grace periods can take hundreds of milliseconds and even more than one
second, when they should normally complete in less than a millisecond.

This was fine given that there were no particular response-time
constraints for synchronize_rcu_expedited(), as it was designed
for throughput rather than latency.  However, some users now need
sub-100-millisecond response-time constratints.

This patch therefore follows Neeraj's suggestion (seconded by Tim and
by Uladzislau Rezki) of simply reversing the two operations.

Reported-by: Tim Murray <timmurray@google.com>
Reported-by: Joel Fernandes <joelaf@google.com>
Reported-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Tested-by: Tim Murray <timmurray@google.com>
Cc: Todd Kjos <tkjos@google.com>
Cc: Sandeep Patil <sspatil@google.com>
Cc: <stable@vger.kernel.org> # 5.4.x
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-02-01 17:19:41 -08:00
Alison Chaiken
c8b16a6526 rcu: Elevate priority of offloaded callback threads
When CONFIG_PREEMPT_RT=y, the rcutree.kthread_prio command-line
parameter signals initialization code to boost the priority of rcuc
callbacks to the designated value.  With the additional
CONFIG_RCU_NOCB_CPU=y configuration and an additional rcu_nocbs
command-line parameter, the callbacks on the listed cores are
offloaded to new rcuop kthreads that are not pinned to the cores whose
post-grace-period work is performed.  While the rcuop kthreads perform
the same function as the rcuc kthreads they offload, the kthread_prio
parameter only boosts the priority of the rcuc kthreads.  Fix this
inconsistency by elevating rcuop kthreads to the same priority as the rcuc
kthreads.

Signed-off-by: Alison Chaiken <achaiken@aurora.tech>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-02-01 17:19:25 -08:00
Alison Chaiken
54577e23fa rcu: Make priority of grace-period thread consistent
The priority of RCU grace period threads is set to kthread_prio when
they are launched from rcu_spawn_gp_kthread().  The same is not true
of rcu_spawn_one_nocb_kthread().  Accordingly, add priority elevation
to rcu_spawn_one_nocb_kthread().

Signed-off-by: Alison Chaiken <achaiken@aurora.tech>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-02-01 17:19:17 -08:00
Alison Chaiken
c8db27dd0e rcu: Move kthread_prio bounds-check to a separate function
Move the bounds-check of the kthread_prio cmdline parameter to a new
function in order to faciliate a different callsite.

Signed-off-by: Alison Chaiken <achaiken@aurora.tech>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-02-01 17:19:02 -08:00
Zqiang
4b4399b245 rcu: Create per-cpu rcuc kthreads only when rcutree.use_softirq=0
The per-CPU "rcuc" kthreads are used only by kernels booted with
rcutree.use_softirq=0, but they are nevertheless unconditionally created
by kernels built with CONFIG_RCU_BOOST=y.  This results in "rcuc"
kthreads being created that are never actually used.  This commit
therefore refrains from creating these kthreads unless the kernel
is actually booted with rcutree.use_softirq=0.

Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-02-01 17:19:02 -08:00
Neeraj Upadhyay
eae9f147a4 rcu: Remove unused rcu_state.boost
Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-02-01 17:19:02 -08:00
Neeraj Upadhyay
02e3024175 rcu/nocb: Handle concurrent nocb kthreads creation
When multiple CPUs in the same nocb gp/cb group concurrently
come online, they might try to concurrently create the same
rcuog kthread. Fix this by using nocb gp CPU's spawn mutex to
provide mutual exclusion for the rcuog kthread creation code.

[ paulmck: Whitespace fixes per kernel test robot feedback. ]

Acked-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-02-01 17:19:02 -08:00
Paul E. McKenney
a47f9f131d rcu: Mark accesses to boost_starttime
The boost_starttime shared variable has conflicting unmarked C-language
accesses, which are dangerous at best.  This commit therefore adds
appropriate marking.  This was found by KCSAN.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-02-01 17:19:02 -08:00
Paul E. McKenney
63c564da11 rcu: Mark ->expmask access in synchronize_rcu_expedited_wait()
This commit adds a READ_ONCE() to an access to the rcu_node structure's
->expmask field to prevent compiler mischief.  Detected by KCSAN.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-02-01 17:05:10 -08:00
Neeraj Upadhyay
4d266c247d rcu/exp: Fix check for idle context in rcu_exp_handler
For PREEMPT_RCU, the rcu_exp_handler() function checks
whether the current CPU is in idle, by calling
rcu_dynticks_curr_cpu_in_eqs(). However, rcu_exp_handler()
is called in IPI handler context. So, it should be checking
the idle context using rcu_is_cpu_rrupt_from_idle(). Fix this
by using rcu_is_cpu_rrupt_from_idle() instead of
rcu_dynticks_curr_cpu_in_eqs(). Non-preempt configuration
already uses the correct check.

Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-02-01 17:05:10 -08:00
Paul E. McKenney
da123016ca rcu-tasks: Fix computation of CPU-to-list shift counts
The ->percpu_enqueue_shift field is used to map from the running CPU
number to the index of the corresponding callback list.  This mapping
can change at runtime in response to varying callback load, resulting
in varying levels of contention on the callback-list locks.

Unfortunately, the initial value of this field is correct only if the
system happens to have a power-of-two number of CPUs, otherwise the
callbacks from the high-numbered CPUs can be placed into the callback list
indexed by 1 (rather than 0), and those index-1 callbacks will be ignored.
This can result in soft lockups and hangs.

This commit therefore corrects this mapping, adding one to this shift
count as needed for systems having odd numbers of CPUs.

Fixes: 7a30871b6a ("rcu-tasks: Introduce ->percpu_enqueue_shift for dynamic queue selection")
Reported-by: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Cc: Reported-by: Martin Lau <kafai@fb.com>
Cc: Neeraj Upadhyay <neeraj.iitr10@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-01-26 13:04:05 -08:00
Linus Torvalds
f56caedaf9 Merge branch 'akpm' (patches from Andrew)
Merge misc updates from Andrew Morton:
 "146 patches.

  Subsystems affected by this patch series: kthread, ia64, scripts,
  ntfs, squashfs, ocfs2, vfs, and mm (slab-generic, slab, kmemleak,
  dax, kasan, debug, pagecache, gup, shmem, frontswap, memremap,
  memcg, selftests, pagemap, dma, vmalloc, memory-failure, hugetlb,
  userfaultfd, vmscan, mempolicy, oom-kill, hugetlbfs, migration, thp,
  ksm, page-poison, percpu, rmap, zswap, zram, cleanups, hmm, and
  damon)"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (146 commits)
  mm/damon: hide kernel pointer from tracepoint event
  mm/damon/vaddr: hide kernel pointer from damon_va_three_regions() failure log
  mm/damon/vaddr: use pr_debug() for damon_va_three_regions() failure logging
  mm/damon/dbgfs: remove an unnecessary variable
  mm/damon: move the implementation of damon_insert_region to damon.h
  mm/damon: add access checking for hugetlb pages
  Docs/admin-guide/mm/damon/usage: update for schemes statistics
  mm/damon/dbgfs: support all DAMOS stats
  Docs/admin-guide/mm/damon/reclaim: document statistics parameters
  mm/damon/reclaim: provide reclamation statistics
  mm/damon/schemes: account how many times quota limit has exceeded
  mm/damon/schemes: account scheme actions that successfully applied
  mm/damon: remove a mistakenly added comment for a future feature
  Docs/admin-guide/mm/damon/usage: update for kdamond_pid and (mk|rm)_contexts
  Docs/admin-guide/mm/damon/usage: mention tracepoint at the beginning
  Docs/admin-guide/mm/damon/usage: remove redundant information
  Docs/admin-guide/mm/damon/usage: update for scheme quotas and watermarks
  mm/damon: convert macro functions to static inline functions
  mm/damon: modify damon_rand() macro to static inline function
  mm/damon: move damon_rand() definition into damon.h
  ...
2022-01-15 20:37:06 +02:00
Cai Huoqing
3b9cb4ba4b rcutorture: make use of the helper function kthread_run_on_cpu()
Replace kthread_create_on_node/kthread_bind/wake_up_process() with
kthread_run_on_cpu() to simplify the code.

Link: https://lkml.kernel.org/r/20211022025711.3673-5-caihuoqing@baidu.com
Signed-off-by: Cai Huoqing <caihuoqing@baidu.com>
Cc: Bernard Metzler <bmt@zurich.ibm.com>
Cc: Daniel Bristot de Oliveira <bristot@kernel.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Doug Ledford <dledford@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: "Paul E . McKenney" <paulmck@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-01-15 16:30:24 +02:00
Paul E. McKenney
f80fe66c38 Merge branches 'doc.2021.11.30c', 'exp.2021.12.07a', 'fastnohz.2021.11.30c', 'fixes.2021.11.30c', 'nocb.2021.12.09a', 'nolibc.2021.11.30c', 'tasks.2021.12.09a', 'torture.2021.12.07a' and 'torturescript.2021.11.30c' into HEAD
doc.2021.11.30c: Documentation updates.
exp.2021.12.07a: Expedited-grace-period fixes.
fastnohz.2021.11.30c: Remove CONFIG_RCU_FAST_NO_HZ.
fixes.2021.11.30c: Miscellaneous fixes.
nocb.2021.12.09a: No-CB CPU updates.
nolibc.2021.11.30c: Tiny in-kernel library updates.
tasks.2021.12.09a: RCU-tasks updates, including update-side scalability.
torture.2021.12.07a: Torture-test in-kernel module updates.
torturescript.2021.11.30c: Torture-test scripting updates.
2021-12-09 11:38:09 -08:00
Frederic Weisbecker
10d4703154 rcu/nocb: Merge rcu_spawn_cpu_nocb_kthread() and rcu_spawn_one_nocb_kthread()
The rcu_spawn_one_nocb_kthread() function is called only from
rcu_spawn_cpu_nocb_kthread().  Therefore, inline the former into
the latter, saving a few lines of code.

Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Tested-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-12-09 11:35:16 -08:00
Frederic Weisbecker
d2cf0854d7 rcu/nocb: Allow empty "rcu_nocbs" kernel parameter
Allow the rcu_nocbs kernel parameter to be specified just by itself,
without specifying any CPUs.  This allows systems administrators to use
"rcu_nocbs" to specify that none of the CPUs are to be offloaded at boot
time, but than any of them may be offloaded at runtime via cpusets.

In contrast, if the "rcu_nocbs" or "nohz_full" kernel parameters are not
specified at all, then not only are none of the CPUs offloaded at boot,
none of them can be offloaded at runtime, either.

While in the area, modernize the description of the "rcuo" kthreads'
naming scheme.

Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Tested-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-12-09 11:35:11 -08:00
Frederic Weisbecker
2cf4528d6d rcu/nocb: Create kthreads on all CPUs if "rcu_nocbs=" or "nohz_full=" are passed
In order to be able to (de-)offload any CPU using cpusets in the future,
create the NOCB data structures for all possible CPUs.  For now this is
done only as long as the "rcu_nocbs=" or "nohz_full=" kernel parameters
are passed to avoid the unnecessary overhead for most users.

Note that the rcuog and rcuoc kthreads are not created until at least
one of the corresponding CPUs comes online.  This approach avoids the
creation of excess kthreads when firmware lies about the number of CPUs
present on the system.

Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Tested-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-12-09 11:35:06 -08:00
Frederic Weisbecker
a81aeaf7a1 rcu/nocb: Optimize kthreads and rdp initialization
Currently cpumask_available() is used to prevent from unwanted NOCB
initialization.  However if neither "rcu_nocbs=" nor "nohz_full="
parameters are passed to a kernel built with CONFIG_CPUMASK_OFFSTACK=n,
the initialization path is still taken, running through all sorts of
needless operations and iterations on an empty cpumask.

Fix this by relying on a real initialization state instead.  This also
optimizes kthread creation, preventing needless iteration over all online
CPUs when the kernel is booted without any offloaded CPUs.

Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Tested-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-12-09 11:34:37 -08:00
Frederic Weisbecker
8d97039646 rcu/nocb: Prepare nocb_cb_wait() to start with a non-offloaded rdp
In order to be able to toggle the offloaded state from cpusets, a nocb
kthread will need to be created for all possible CPUs whenever either
of the "rcu_nocbs=" or "nohz_full=" parameters are specified.

Therefore, the nocb_cb_wait() kthread must be prepared to start running
on a de-offloaded rdp.  To accomplish this, simply move the sleeping
condition to the beginning of the nocb_cb_wait() function, which prevents
this kthread from attempting to invoke callbacks before the corresponding
CPU is offloaded.

Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Tested-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-12-09 11:34:30 -08:00
Frederic Weisbecker
2ebc45c44c rcu/nocb: Remove rcu_node structure from nocb list when de-offloaded
The nocb_gp_wait() function iterates over all CPUs in its group,
including even those CPUs that have been de-offloaded.  This is of
course suboptimal, especially if none of the CPUs within the group are
currently offloaded.  This will become even more of a problem once a
nocb kthread is created for all possible CPUs.

Therefore use a standard double linked list to link all the offloaded
rcu_data structures and safely add or delete these structure as we
offload or de-offload them, respectively.

Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Tested-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-12-09 11:34:07 -08:00
Paul E. McKenney
fd796e4139 rcu-tasks: Use fewer callbacks queues if callback flood ends
By default, when lock contention is encountered, the RCU Tasks flavors
of RCU switch to using per-CPU queueing.  However, if the callback
flood ends, per-CPU queueing continues to be used, which introduces
significant additional overhead, especially for callback invocation,
which fans out a series of workqueue handlers.

This commit therefore switches back to single-queue operation if at the
beginning of a grace period there are very few callbacks.  The definition
of "very few" is set by the rcupdate.rcu_task_collapse_lim module
parameter, which defaults to 10.  This switch happens in two phases,
with the first phase causing future callbacks to be enqueued on CPU 0's
queue, but with all queues continuing to be checked for grace periods
and callback invocation.  The second phase checks to see if an RCU grace
period has elapsed and if all remaining RCU-Tasks callbacks are queued
on CPU 0.  If so, only CPU 0 is checked for future grace periods and
callback operation.

Of course, the return of contention anywhere during this process will
result in returning to per-CPU callback queueing.

Reported-by: Martin Lau <kafai@fb.com>
Cc: Neeraj Upadhyay <neeraj.iitr10@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-12-09 10:52:11 -08:00
Paul E. McKenney
2cee0789b4 rcu-tasks: Use separate ->percpu_dequeue_lim for callback dequeueing
Decreasing the number of callback queues is a bit tricky because it
is necessary to handle callbacks that were queued before the number of
queues decreased, but which were not ready to invoke until afterwards.
This commit takes a first step in this direction by maintaining a separate
->percpu_dequeue_lim to control callback dequeueing, in addition to the
existing ->percpu_enqueue_lim which now controls only enqueueing.

Reported-by: Martin Lau <kafai@fb.com>
Cc: Neeraj Upadhyay <neeraj.iitr10@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-12-09 10:52:11 -08:00
Paul E. McKenney
ab97152f88 rcu-tasks: Use more callback queues if contention encountered
The rcupdate.rcu_task_enqueue_lim module parameter allows system
administrators to tune the number of callback queues used by the RCU
Tasks flavors.  However if callback storms are infrequent, it would
be better to operate with a single queue on a given system unless and
until that system actually needed more queues.  Systems not needing
more queues can then avoid the overhead of checking the extra queues
and especially avoid the overhead of fanning workqueue handlers out to
all CPUs to invoke callbacks.

This commit therefore switches to using all the CPUs' callback queues if
call_rcu_tasks_generic() encounters too much lock contention.  The amount
of lock contention to tolerate defaults to 100 contended lock acquisitions
per jiffy, and can be adjusted using the new rcupdate.rcu_task_contend_lim
module parameter.

Such switching is undertaken only if the rcupdate.rcu_task_enqueue_lim
module parameter is negative, which is its default value (-1).
This allows savvy systems administrators to set the number of queues
to some known good value and to not have to worry about the kernel doing
any second guessing.

[ paulmck: Apply feedback from Guillaume Tucker and kernelci. ]

Reported-by: Martin Lau <kafai@fb.com>
Cc: Neeraj Upadhyay <neeraj.iitr10@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-12-09 10:52:11 -08:00
Paul E. McKenney
3063b33a34 rcu-tasks: Avoid raw-spinlocked wakeups from call_rcu_tasks_generic()
If the caller of of call_rcu_tasks(), call_rcu_tasks_rude(),
or call_rcu_tasks_trace() holds a raw spinlock, and then if
call_rcu_tasks_generic() determines that the grace-period kthread must
be awakened, then the wakeup might acquire a normal spinlock while a
raw spinlock is held.  This results in lockdep splats when the
kernel is built with CONFIG_PROVE_RAW_LOCK_NESTING=y.

This commit therefore defers the wakeup using irq_work_queue().

It would be nice to directly invoke wakeup when a raw spinlock is not
held, but there is currently no way to check for this in all kernels.

Reported-by: Martin Lau <kafai@fb.com>
Cc: Neeraj Upadhyay <neeraj.iitr10@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-12-09 10:52:11 -08:00
Paul E. McKenney
7d13d30bb6 rcu-tasks: Count trylocks to estimate call_rcu_tasks() contention
This commit converts the unconditional raw_spin_lock_rcu_node() lock
acquisition in call_rcu_tasks_generic() to a trylock followed by an
unconditional acquisition if the trylock fails.  If the trylock fails,
the failure is counted, but the count is reset to zero on each new jiffy.

This statistic will be used to determine when to move from a single
callback queue to per-CPU callback queues.

Reported-by: Martin Lau <kafai@fb.com>
Cc: Neeraj Upadhyay <neeraj.iitr10@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-12-09 10:52:11 -08:00
Paul E. McKenney
8610b65680 rcu-tasks: Add rcupdate.rcu_task_enqueue_lim to set initial queueing
This commit adds a rcupdate.rcu_task_enqueue_lim module parameter that
sets the initial number of callback queues to use for the RCU Tasks
family of RCU implementations.  This parameter allows testing of various
fanout values.

Reported-by: Martin Lau <kafai@fb.com>
Cc: Neeraj Upadhyay <neeraj.iitr10@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-12-09 10:52:11 -08:00
Paul E. McKenney
ce9b1c667f rcu-tasks: Make rcu_barrier_tasks*() handle multiple callback queues
Currently, rcu_barrier_tasks(), rcu_barrier_tasks_rude(),
and rcu_barrier_tasks_trace() simply invoke the corresponding
synchronize_rcu_tasks*() function.  This works because there is only
one callback queue.

However, there will soon be multiple callback queues.  This commit
therefore scans the queues currently in use, entraining a callback on
each non-empty queue.  Sequence numbers and reference counts are used
to synchronize this process in a manner similar to the approach taken
by rcu_barrier().

Reported-by: Martin Lau <kafai@fb.com>
Cc: Neeraj Upadhyay <neeraj.iitr10@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-12-09 10:52:11 -08:00
Paul E. McKenney
d363f833c6 rcu-tasks: Use workqueues for multiple rcu_tasks_invoke_cbs() invocations
If there is a flood of callbacks, it is necessary to put multiple
CPUs to work invoking those callbacks.  This commit therefore uses a
workqueue-flooding approach to parallelize RCU Tasks callback execution.

Reported-by: Martin Lau <kafai@fb.com>
Cc: Neeraj Upadhyay <neeraj.iitr10@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-12-09 10:52:00 -08:00
Paul E. McKenney
57881863ad rcu-tasks: Abstract invocations of callbacks
This commit adds a rcu_tasks_invoke_cbs() function that invokes all
ready callbacks on all of the per-CPU lists that are currently in use.

Reported-by: Martin Lau <kafai@fb.com>
Cc: Neeraj Upadhyay <neeraj.iitr10@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-12-09 10:51:44 -08:00
Paul E. McKenney
4d1114c054 rcu-tasks: Abstract checking of callback lists
This commit adds a rcu_tasks_need_gpcb() function that returns an
indication of whether another grace period is required, and if no grace
period is required, whether there are callbacks that need to be invoked.
The function scans all per-CPU lists currently in use.

Reported-by: Martin Lau <kafai@fb.com>
Cc: Neeraj Upadhyay <neeraj.iitr10@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-12-09 10:51:27 -08:00
Paul E. McKenney
8dd593fddd rcu-tasks: Add a ->percpu_enqueue_lim to the rcu_tasks structure
This commit adds a ->percpu_enqueue_lim field to the rcu_tasks structure.
This field contains two to the power of the ->percpu_enqueue_shift
field, easing construction of iterators over the per-CPU queues that
might contain RCU Tasks callbacks.  Such iterators will be introduced
in later commits.

Reported-by: Martin Lau <kafai@fb.com>
Cc: Neeraj Upadhyay <neeraj.iitr10@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-12-09 10:13:55 -08:00
Neeraj Upadhyay
65b629e704 rcu-tasks: Inspect stalled task's trc state in locked state
On RCU tasks trace stall, inspect the RCU-tasks-trace specific
states of stalled task in locked down state, using try_invoke_
on_locked_down_task(), to get reliable trc state of a non-running
stalled task.

This was tested using the following command:

tools/testing/selftests/rcutorture/bin/kvm.sh --cpus 8 --configs TRACE01 \
--bootargs "rcutorture.torture_type=tasks-tracing rcutorture.stall_cpu=10 \
rcutorture.stall_cpu_block=1 rcupdate.rcu_task_stall_timeout=100" --trust-make

As expected, this produced the following console output for running and
sleeping tasks.

[   21.520291] INFO: rcu_tasks_trace detected stalls on tasks:
[   21.521292] P85: ... nesting: 1N cpu: 2
[   21.521966] task:rcu_torture_sta state:D stack:15080 pid:   85 ppid:     2
flags:0x00004000
[   21.523384] Call Trace:
[   21.523808]  __schedule+0x273/0x6e0
[   21.524428]  schedule+0x35/0xa0
[   21.524971]  schedule_timeout+0x1ed/0x270
[   21.525690]  ? del_timer_sync+0x30/0x30
[   21.526371]  ? rcu_torture_writer+0x720/0x720
[   21.527106]  rcu_torture_stall+0x24a/0x270
[   21.527816]  kthread+0x115/0x140
[   21.528401]  ? set_kthread_struct+0x40/0x40
[   21.529136]  ret_from_fork+0x22/0x30
[   21.529766]  1 holdouts
[   21.632300] INFO: rcu_tasks_trace detected stalls on tasks:
[   21.632345] rcu_torture_stall end.
[   21.633293] P85: .
[   21.633294] task:rcu_torture_sta state:R  running task stack:15080 pid:
85 ppid:     2 flags:0x00004000
[   21.633299] Call Trace:
[   21.633301]  ? vprintk_emit+0xab/0x180
[   21.633306]  ? vprintk_emit+0x11a/0x180
[   21.633308]  ? _printk+0x4d/0x69
[   21.633311]  ? __default_send_IPI_shortcut+0x1f/0x40

[ paulmck: Update to new v5.16 task_call_func() name. ]

Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-12-09 10:13:55 -08:00
Paul E. McKenney
381a4f3b38 rcu-tasks: Use spin_lock_rcu_node() and friends
This commit renames the rcu_tasks_percpu structure's ->cbs_pcpu_lock
to ->lock and then uses spin_lock_rcu_node() and friends to acquire and
release this lock, preparing for upcoming commits that will spread the
grace-period process across multiple CPUs and kthreads.

[ paulmck: Apply feedback from kernel test robot. ]

Reported-by: Martin Lau <kafai@fb.com>
Cc: Neeraj Upadhyay <neeraj.iitr10@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-12-09 10:12:45 -08:00
Paul E. McKenney
53b541fbdb rcutorture: Combine n_max_cbs from all kthreads in a callback flood
With the addition of multiple callback-flood kthreads, the maximum number
of callbacks from any one of those kthreads is reported in the rcutorture
run summary.  This commit changes this to report the sum of each kthread's
maximum number of callbacks in a given callback-flooding episode.

Cc: Neeraj Upadhyay <neeraj.iitr10@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-12-07 16:36:18 -08:00
Paul E. McKenney
613b00fbe6 rcutorture: Add ability to limit callback-flood intensity
The RCU tasks flavors of RCU now need concurrent callback flooding to
test their ability to switch between single-queue mode and per-CPU queue
mode, but their lack of heavy-duty forward-progress features rules out
the use of rcutorture's current callback-flooding code.  This commit
therefore provides the ability to limit the intensity of the callback
floods using a new ->cbflood_max field in the rcu_operations structure.
When this field is zero, there is no limit, otherwise, each callback-flood
kthread allocates at most ->cbflood_max callbacks.

Cc: Neeraj Upadhyay <neeraj.iitr10@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-12-07 16:36:18 -08:00
Paul E. McKenney
82e310033d rcutorture: Enable multiple concurrent callback-flood kthreads
This commit converts the rcutorture.fwd_progress module parameter from
bool to int, so that it specifies the number of callback-flood kthreads.
Values less than zero specify one kthread per CPU, however, the number of
kthreads executing concurrently is limited to the number of online CPUs.
This commit also reverse the order of the need-resched and callback-flood
operations to cause the callback flooding to happen more nearly at the
same time.

Cc: Neeraj Upadhyay <neeraj.iitr10@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-12-07 16:36:18 -08:00
Wander Lairson Costa
5ff7c9f9d7 rcutorture: Avoid soft lockup during cpu stall
If we use the module stall_cpu option, we may get a soft lockup warning
in case we also don't pass the stall_cpu_block option.

Introduce the stall_no_softlockup option to avoid a soft lockup on
cpu stall even if we don't use the stall_cpu_block option.

Signed-off-by: Wander Lairson Costa <wander@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-12-07 16:36:17 -08:00
Li Zhijian
81faa4f6fb locktorture,rcutorture,torture: Always log error message
Unconditionally log messages corresponding to errors.

Acked-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Li Zhijian <zhijianx.li@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-12-07 16:36:17 -08:00
Li Zhijian
86e7ed1bd5 rcuscale: Always log error message
Unconditionally log messages corresponding to errors.

Acked-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Li Zhijian <zhijianx.li@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-12-07 16:36:17 -08:00
Li Zhijian
f71f22b67d refscale: Add missing '\n' to flush message
Add '\n' to macros to flush message for each call.

Acked-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Li Zhijian <zhijianx.li@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-12-07 16:36:12 -08:00
Li Zhijian
4feeb9d5f8 refscale: Always log the error message
An OOM is a serious error that should be logged even in non-verbose runs.
This commit therefore adds an unconditional SCALEOUT_ERRSTRING() macro
and uses it instead of VERBOSE_SCALEOUT_ERRSTRING() when reporting an OOM.

[ paulmck: Drop do-while from SCALEOUT_ERRSTRING() due to only single statement. ]

Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-12-07 16:31:44 -08:00
Paul E. McKenney
9b073de1c7 rcu_tasks: Convert bespoke callback list to rcu_segcblist structure
This commit moves from a bespoke head and tail pointer in the
rcu_tasks_percpu structure to an rcu_segcblist structure, thus allowing
associating the grace-period sequence number with groups of callbacks.
This in turn will allow callbacks to be invoked independently on
different CPUs.

Reported-by: Martin Lau <kafai@fb.com>
Cc: Neeraj Upadhyay <neeraj.iitr10@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-12-07 16:26:57 -08:00
Paul E. McKenney
b14fb4fbbc rcu-tasks: Convert grace-period counter to grace-period sequence number
This commit moves the rcu_tasks structure's ->n_gps grace-period-counter
field to a ->task_gp_seq grce-period sequence number in order to enable
use of the rcu_segcblist structure for the callback lists.  This in turn
permits CPUs to lag behind the RCU Tasks grace-period sequence number
without suffering long-term slowdowns in callback invocation.

Reported-by: Martin Lau <kafai@fb.com>
Cc: Neeraj Upadhyay <neeraj.iitr10@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-12-07 16:26:57 -08:00
Paul E. McKenney
7a30871b6a rcu-tasks: Introduce ->percpu_enqueue_shift for dynamic queue selection
This commit introduces a ->percpu_enqueue_shift field to the rcu_tasks
structure, and uses it to shift down the CPU number in order to
select a rcu_tasks_percpu structure.  This field is currently set to a
sufficiently large shift count to always select the CPU-0 instance of
the rcu_tasks_percpu structure, and later commits will adjust this.

Reported-by: Martin Lau <kafai@fb.com>
Cc: Neeraj Upadhyay <neeraj.iitr10@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-12-07 16:26:57 -08:00
Paul E. McKenney
cafafd6776 rcu-tasks: Create per-CPU callback lists
Currently, RCU Tasks Trace (as well as the other two flavors of RCU Tasks)
use a single global callback list.  This works well and is simple, but
expected changes in workload will cause this list to become a bottleneck.
This commit therefore creates per-CPU callback lists for the various
flavors of RCU Tasks, but continues queueing on a single list, namely
that of CPU 0.  Later commits will dynamically vary the number of lists
in use to accommodate dynamic changes in workload.

Reported-by: Martin Lau <kafai@fb.com>
Cc: Neeraj Upadhyay <neeraj.iitr10@gmail.com>
Tested-by: kernel test robot <beibei.si@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-12-07 16:26:09 -08:00
Frederic Weisbecker
0598a4d442 rcu/nocb: Don't invoke local rcu core on callback overload from nocb kthread
rcu_core() tries to ensure that its self-invocation in case of callbacks
overload only happen in softirq/rcuc mode. Indeed it doesn't make sense
to trigger local RCU core from nocb_cb kthread since it can execute
on a CPU different from the target rdp. Also in case of overload, the
nocb_cb kthread simply iterates a new loop of callbacks processing.

However the "offloaded" check that aims at preventing misplaced
rcu_core() invocations is wrong. First of all that state is volatile
and second: softirq/rcuc can execute while the target rdp is offloaded.
As a result rcu_core() can be invoked on the wrong CPU while in the
process of (de-)offloading.

Fix that with moving the rcu_core() self-invocation to rcu_core() itself,
irrespective of the rdp offloaded state.

Tested-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-12-07 16:24:44 -08:00
Frederic Weisbecker
a554ba2888 rcu: Apply callbacks processing time limit only on softirq
Time limit only makes sense when callbacks are serviced in softirq mode
because:

_ In case we need to get back to the scheduler,
  cond_resched_tasks_rcu_qs() is called after each callback.

_ In case some other softirq vector needs the CPU, the call to
  local_bh_enable() before cond_resched_tasks_rcu_qs() takes care about
  them via a call to do_softirq().

Therefore, make sure the time limit only applies to softirq mode.

Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-12-07 16:24:44 -08:00
Frederic Weisbecker
3e61e95e2d rcu: Fix callbacks processing time limit retaining cond_resched()
The callbacks processing time limit makes sure we are not exceeding a
given amount of time executing the queue.

However its "continue" clause bypasses the cond_resched() call on
rcuc and NOCB kthreads, delaying it until we reach the limit, which can
be very long...

Make sure the scheduler has a higher priority than the time limit.

Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-12-07 16:24:44 -08:00
Frederic Weisbecker
78ad37a2c5 rcu/nocb: Limit number of softirq callbacks only on softirq
The current condition to limit the number of callbacks executed in a
row checks the offloaded state of the rdp. Not only is it volatile
but it is also misleading: the rcu_core() may well be executing
callbacks concurrently with NOCB kthreads, and the offloaded state
would then be verified on both cases. As a result the limit would
spuriously not apply anymore on softirq while in the middle of
(de-)offloading process.

Fix and clarify the condition with those constraints in mind:

_ If callbacks are processed either by rcuc or NOCB kthread, the call
  to cond_resched_tasks_rcu_qs() is enough to take care of the overload.

_ If instead callbacks are processed by softirqs:
  * If need_resched(), exit the callbacks processing
  * Otherwise if CPU is idle we can continue
  * Otherwise exit because a softirq shouldn't interrupt a task for too
    long nor deprive other pending softirq vectors of the CPU.

Tested-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-12-07 16:24:44 -08:00
Frederic Weisbecker
7b65dfa32d rcu/nocb: Use appropriate rcu_nocb_lock_irqsave()
Instead of hardcoding IRQ save and nocb lock, use the consolidated
API (and fix a comment as per Valentin Schneider's suggestion).

Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-12-07 16:24:44 -08:00
Frederic Weisbecker
344e219d7d rcu/nocb: Check a stable offloaded state to manipulate qlen_last_fqs_check
It's not entirely obvious why rdp->qlen_last_fqs_check is updated before
processing the queue only on offloaded rdp. There can be different
effect to that, either in favour of triggering the force quiescent state
path or not. For example:

1) If the number of callbacks has decreased since the last
   rdp->qlen_last_fqs_check update (because we recently called
   rcu_do_batch() and we executed below qhimark callbacks) and the number
   of processed callbacks on a subsequent do_batch() arranges for
   exceeding qhimark on non-offloaded but not on offloaded setup, then we
   may spare a later run to the force quiescent state
   slow path on __call_rcu_nocb_wake(), as compared to the non-offloaded
   counterpart scenario.

   Here is such an offloaded scenario instance:

    qhimark = 1000
    rdp->last_qlen_last_fqs_check = 3000
    rcu_segcblist_n_cbs(rdp) = 2000

    rcu_do_batch() {
        if (offloaded)
            rdp->last_qlen_fqs_check = rcu_segcblist_n_cbs(rdp) // 2000
        // run 1000 callback
        rcu_segcblist_n_cbs(rdp) = 1000
        // Not updating rdp->qlen_last_fqs_check
        if (count < rdp->qlen_last_fqs_check - qhimark)
            rdp->qlen_last_fqs_check = count;
    }

    call_rcu() * 1001 {
        __call_rcu_nocb_wake() {
            // not taking the fqs slowpath:
            // rcu_segcblist_n_cbs(rdp) == 2001
            // rdp->qlen_last_fqs_check == 2000
            // qhimark == 1000
            if (len > rdp->qlen_last_fqs_check + qhimark)
                ...
    }

    In the case of a non-offloaded scenario, rdp->qlen_last_fqs_check
    would be 1000 and the fqs slowpath would have executed.

2) If the number of callbacks has increased since the last
   rdp->qlen_last_fqs_check update (because we recently queued below
   qhimark callbacks) and the number of callbacks executed in rcu_do_batch()
   doesn't exceed qhimark for either offloaded or non-offloaded setup,
   then it's possible that the offloaded scenario later run the force
   quiescent state slow path on __call_rcu_nocb_wake() while the
   non-offloaded doesn't.

    qhimark = 1000
    rdp->last_qlen_last_fqs_check = 3000
    rcu_segcblist_n_cbs(rdp) = 2000

    rcu_do_batch() {
        if (offloaded)
            rdp->last_qlen_last_fqs_check = rcu_segcblist_n_cbs(rdp) // 2000
        // run 100 callbacks
        // concurrent queued 100
        rcu_segcblist_n_cbs(rdp) = 2000
        // Not updating rdp->qlen_last_fqs_check
        if (count < rdp->qlen_last_fqs_check - qhimark)
            rdp->qlen_last_fqs_check = count;
    }

    call_rcu() * 1001 {
        __call_rcu_nocb_wake() {
            // Taking the fqs slowpath:
            // rcu_segcblist_n_cbs(rdp) == 3001
            // rdp->qlen_last_fqs_check == 2000
            // qhimark == 1000
            if (len > rdp->qlen_last_fqs_check + qhimark)
                ...
    }

    In the case of a non-offloaded scenario, rdp->qlen_last_fqs_check
    would be 3000 and the fqs slowpath would have executed.

The reason for updating rdp->qlen_last_fqs_check when invoking callbacks
for offloaded CPUs is that there is usually no point in waking up either
the rcuog or rcuoc kthreads while in this state.  After all, both threads
are prohibited from indefinite sleeps.

The exception is when some huge number of callbacks are enqueued while
rcu_do_batch() is in the midst of invoking, in which case interrupting
the rcuog kthread's timed sleep might get more callbacks set up for the
next grace period.

Reported-and-tested-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Original-patch-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-12-07 16:24:44 -08:00
Frederic Weisbecker
b3bb02fe5a rcu/nocb: Make rcu_core() callbacks acceleration (de-)offloading safe
When callbacks are offloaded, the NOCB kthreads handle the callbacks
progression on behalf of rcu_core().

However during the (de-)offloading process, the kthread may not be
entirely up to the task. As a result some callbacks grace period
sequence number may remain stale for a while because rcu_core() won't
take care of them either.

Fix this with forcing callbacks acceleration from rcu_core() as long
as the offloading process isn't complete.

Reported-and-tested-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-12-07 16:24:44 -08:00
Thomas Gleixner
24ee940d89 rcu/nocb: Make rcu_core() callbacks acceleration preempt-safe
While reporting a quiescent state for a given CPU, rcu_core() takes
advantage of the freshly loaded grace period sequence number and the
locked rnp to accelerate the callbacks whose sequence number have been
assigned a stale value.

This action is only necessary when the rdp isn't offloaded, otherwise
the NOCB kthreads already take care of the callbacks progression.

However the check for the offloaded state is volatile because it is
performed outside the IRQs disabled section. It's possible for the
offloading process to preempt rcu_core() at that point on PREEMPT_RT.

This is dangerous because rcu_core() may end up accelerating callbacks
concurrently with NOCB kthreads without appropriate locking.

Fix this with moving the offloaded check inside the rnp locking section.

Reported-and-tested-by: Valentin Schneider <valentin.schneider@arm.com>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-12-07 16:24:44 -08:00
Frederic Weisbecker
fbb94cbd70 rcu/nocb: Invoke rcu_core() at the start of deoffloading
On PREEMPT_RT, if rcu_core() is preempted by the de-offloading process,
some work, such as callbacks acceleration and invocation, may be left
unattended due to the volatile checks on the offloaded state.

In the worst case this work is postponed until the next rcu_pending()
check that can take a jiffy to reach, which can be a problem in case
of callbacks flooding.

Solve that with invoking rcu_core() early in the de-offloading process.
This way any work dismissed by an ongoing rcu_core() call fooled by
a preempting deoffloading process will be caught up by a nearby future
recall to rcu_core(), this time fully aware of the de-offloading state.

Tested-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-12-07 16:24:44 -08:00
Frederic Weisbecker
213d56bf33 rcu/nocb: Prepare state machine for a new step
Currently SEGCBLIST_SOFTIRQ_ONLY is a bit of an exception among the
segcblist flags because it is an exclusive state that doesn't mix up
with the other flags. Remove it in favour of:

_ A flag specifying that rcu_core() needs to perform callbacks execution
  and acceleration

and

_ A flag specifying we want the nocb lock to be held in any needed
  circumstances

This clarifies the code and is more flexible: It allows to have a state
where rcu_core() runs with locking while offloading hasn't started yet.
This is a necessary step to prepare for triggering rcu_core() at the
very beginning of the de-offloading process so that rcu_core() won't
dismiss work while being preempted by the de-offloading process, at
least not without a pending subsequent rcu_core() that will quickly
catch up.

Reviewed-by: Valentin Schneider <Valentin.Schneider@arm.com>
Tested-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-12-07 16:24:44 -08:00
Frederic Weisbecker
118e0d4a1b rcu/nocb: Make local rcu_nocb_lock_irqsave() safe against concurrent deoffloading
rcu_nocb_lock_irqsave() can be preempted between the call to
rcu_segcblist_is_offloaded() and the actual locking. This matters now
that rcu_core() is preemptible on PREEMPT_RT and the (de-)offloading
process can interrupt the softirq or the rcuc kthread.

As a result we may locklessly call into code that requires nocb locking.
In practice this is a problem while we accelerate callbacks on rcu_core().

Simply disabling interrupts before (instead of after) checking the NOCB
offload state fixes the issue.

Reported-and-tested-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-12-07 16:24:44 -08:00
Paul E. McKenney
614ddad17f rcu: Tighten rcu_advance_cbs_nowake() checks
Currently, rcu_advance_cbs_nowake() checks that a grace period is in
progress, however, that grace period could end just after the check.
This commit rechecks that a grace period is still in progress while
holding the rcu_node structure's lock.  The grace period cannot end while
the current CPU's rcu_node structure's ->lock is held, thus avoiding
false positives from the WARN_ON_ONCE().

As Daniel Vacek noted, it is not necessary for the rcu_node structure
to have a CPU that has not yet passed through its quiescent state.

Tested-by: Guillaume Morin <guillaume@morinfr.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-12-07 16:23:03 -08:00
Frederic Weisbecker
81f6d49cce rcu/exp: Mark current CPU as exp-QS in IPI loop second pass
Expedited RCU grace periods invoke sync_rcu_exp_select_node_cpus(), which
takes two passes over the leaf rcu_node structure's CPUs.  The first
pass gathers up the current CPU and CPUs that are in dynticks idle mode.
The workqueue will report a quiescent state on their behalf later.
The second pass sends IPIs to the rest of the CPUs, but excludes the
current CPU, incorrectly assuming it has been included in the first
pass's list of CPUs.

Unfortunately the current CPU may have changed between the first and
second pass, due to the fact that the various rcu_node structures'
->lock fields have been dropped, thus momentarily enabling preemption.
This means that if the second pass's CPU was not on the first pass's
list, it will be ignored completely.  There will be no IPI sent to
it, and there will be no reporting of quiescent states on its behalf.
Unfortunately, the expedited grace period will nevertheless be waiting
for that CPU to report a quiescent state, but with that CPU having no
reason to believe that such a report is needed.

The result will be an expedited grace period stall.

Fix this by no longer excluding the current CPU from consideration during
the second pass.

Fixes: b9ad4d6ed1 ("rcu: Avoid self-IPI in sync_rcu_exp_select_node_cpus()")
Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-12-07 16:22:22 -08:00
Paul E. McKenney
790da24897 rcu: Make idle entry report expedited quiescent states
In non-preemptible kernels, an unfortunately timed expedited grace period
can result in the rcu_exp_handler() IPI handler setting the rcu_data
structure's cpu_no_qs.b.exp field just as the target CPU enters idle.
There are situations in which this field will not be checked until after
that CPU exits idle.  The resulting grace-period latency does not qualify
as "expedited".

This commit therefore checks this field upon non-preemptible idle entry in
the rcu_preempt_deferred_qs() function.  It also qualifies the rcu_core()
preempt_count() check with IS_ENABLED(CONFIG_PREEMPT_COUNT) to prevent
false-positive quiescent states from count-free kernels.

Reported-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-12-07 16:22:22 -08:00
Paul E. McKenney
147f04b14a rcu: Prevent expedited GP from enabling tick on offline CPU
If an RCU expedited grace period starts just when a CPU is in the process
of going offline, so that the outgoing CPU has completed its pass through
stop-machine but has not yet completed its final dive into the idle loop,
RCU will attempt to enable that CPU's scheduling-clock tick via a call
to tick_dep_set_cpu().  For this to happen, that CPU has to have been
online when the expedited grace period completed its CPU-selection phase.

This is pointless:  The outgoing CPU has interrupts disabled, so it cannot
take a scheduling-clock tick anyway.  In addition, the tick_dep_set_cpu()
function's eventual call to irq_work_queue_on() will splat as follows:

smpboot: CPU 1 is now offline
WARNING: CPU: 6 PID: 124 at kernel/irq_work.c:95
+irq_work_queue_on+0x57/0x60
Modules linked in:
CPU: 6 PID: 124 Comm: kworker/6:2 Not tainted 5.15.0-rc1+ #3
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
+rel-1.14.0-0-g155821a-rebuilt.opensuse.org 04/01/2014
Workqueue: rcu_gp wait_rcu_exp_gp
RIP: 0010:irq_work_queue_on+0x57/0x60
Code: 8b 05 1d c7 ea 62 a9 00 00 f0 00 75 21 4c 89 ce 44 89 c7 e8
+9b 37 fa ff ba 01 00 00 00 89 d0 c3 4c 89 cf e8 3b ff ff ff eb ee <0f> 0b eb b7
+0f 0b eb db 90 48 c7 c0 98 2a 02 00 65 48 03 05 91
 6f
RSP: 0000:ffffb12cc038fe48 EFLAGS: 00010282
RAX: 0000000000000001 RBX: 0000000000005208 RCX: 0000000000000020
RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff9ad01f45a680
RBP: 000000000004c990 R08: 0000000000000001 R09: ffff9ad01f45a680
R10: ffffb12cc0317db0 R11: 0000000000000001 R12: 00000000fffecee8
R13: 0000000000000001 R14: 0000000000026980 R15: ffffffff9e53ae00
FS:  0000000000000000(0000) GS:ffff9ad01f580000(0000)
+knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 000000000de0c000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 tick_nohz_dep_set_cpu+0x59/0x70
 rcu_exp_wait_wake+0x54e/0x870
 ? sync_rcu_exp_select_cpus+0x1fc/0x390
 process_one_work+0x1ef/0x3c0
 ? process_one_work+0x3c0/0x3c0
 worker_thread+0x28/0x3c0
 ? process_one_work+0x3c0/0x3c0
 kthread+0x115/0x140
 ? set_kthread_struct+0x40/0x40
 ret_from_fork+0x22/0x30
---[ end trace c5bf75eb6aa80bc6 ]---

This commit therefore avoids invoking tick_dep_set_cpu() on offlined
CPUs to limit both futility and false-positive splats.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-12-07 16:22:22 -08:00
Paul E. McKenney
5401cc5264 rcu: Mark sync_sched_exp_online_cleanup() ->cpu_no_qs.b.exp load
The sync_sched_exp_online_cleanup() is called from rcutree_online_cpu(),
which can be invoked with interrupts enabled.  This means that
the ->cpu_no_qs.b.exp field is subject to data races from the
rcu_exp_handler() IPI handler, so this commit marks the load from
that field.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-12-07 16:22:21 -08:00
Frederic Weisbecker
6120b72e25 rcu: Remove rcu_data.exp_deferred_qs and convert to rcu_data.cpu no_qs.b.exp
Having two fields for the same purpose with subtle differences on
different RCU flavours is confusing, especially when both fields always
exist on both RCU flavours.

Fortunately, it is now safe for preemptible RCU to rely on the rcu_data
structure's ->cpu_no_qs.b.exp field, just like non-preemptible RCU.
This commit therefore removes the ad-hoc ->exp_deferred_qs field.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-12-07 16:22:21 -08:00
Frederic Weisbecker
6e16b0f7ba rcu: Move rcu_data.cpu_no_qs.b.exp reset to rcu_export_exp_rdp()
On non-preemptible RCU, move clearing of the rcu_data structure's
->cpu_no_qs.b.exp filed to the actual expedited quiescent state report
function, matching hw preemptible RCU handles the ->exp_deferred_qs field.

This prepares for removing ->exp_deferred_qs in favor of ->cpu_no_qs.b.exp
for both preemptible and non-preemptible RCU.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-12-07 16:22:21 -08:00
Frederic Weisbecker
a438265948 rcu: Ignore rdp.cpu_no_qs.b.exp on preemptible RCU's rcu_qs()
Preemptible RCU does not use the rcu_data structure's ->cpu_no_qs.b.exp,
instead using a separate ->exp_deferred_qs field to record the need for
an expedited quiescent state.

In fact ->cpu_no_qs.b.exp should never be set in preemptible RCU because
preemptible RCU's expedited grace periods use other mechanisms to record
quiescent states.

This commit therefore removes the implicit rcu_qs() reference to
->cpu_no_qs.b.exp in favor of a direct reference to ->cpu_no_qs.b.norm.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-12-07 16:20:59 -08:00
Li Zhijian
9880eb878c refscale: Prevent buffer to pr_alert() being too long
0Day/LKP observed that the refscale results fail to complete when larger
values of nrun (such as 300) are specified.  The problem is that printk()
can accept at most a 1024-byte buffer.  This commit therefore prints
the buffer whenever its length exceeds 800 bytes.

CC: Philip Li <philip.li@intel.com>
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-11-30 17:29:50 -08:00
Li Zhijian
c30c876312 refscale: Simplify the errexit checkpoint
There is only the one OOM error case in main_func(), so this commit
eliminates the errexit local variable in favor of a branch to cleanup
code.

Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-11-30 17:29:50 -08:00
Paul E. McKenney
340170fef0 rcutorture: Suppress pi-lock-across read-unlock testing for Tiny SRCU
Because Tiny srcu_read_unlock() directly calls swake_up_one(), lockdep
complains when a pi lock is held across that srcu_read_unlock().
Although this is a lockdep false positive (there is no other CPU to
complete the deadlock cycle), lockdep is what it is at the moment.
This commit therefore prevents rcutorture from holding pi lock across
a Tiny srcu_read_unlock().

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-11-30 17:29:50 -08:00
Paul E. McKenney
1c3d53986f rcutorture: More thoroughly test nested readers
Currently, nested readers occur only when a timer handler interrupts a
reader.  This is rare, and is thus insufficient testing of the transition
between nesting levels.  This commit therefore causes rcutorture nested
readers to be the rule rather than the exception.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-11-30 17:29:50 -08:00
Paul E. McKenney
902d82e629 rcutorture: Sanitize RCUTORTURE_RDR_MASK
RCUTORTURE_RDR_MASK is currently not the bit indicated by
RCUTORTURE_RDR_SHIFT, but is instead all the bits less significant than
that one.  This is an accident waiting to happen, so this commit makes
RCUTORTURE_RDR_MASK be that one bit and adjusts uses accordingly.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-11-30 17:29:49 -08:00
Paul E. McKenney
f5dbc594b5 rcu-tasks: Don't remove tasks with pending IPIs from holdout list
Currently, the check_all_holdout_tasks_trace() function removes all tasks
marked with ->trc_reader_checked from the holdout list, including those
with IPIs pending.  This means that the IPI handler might arrive at
a task that has already been removed from the list, which is at best
an accident waiting to happen.

This commit therefore avoids removing tasks with IPIs pending from
the holdout list.  This in turn means that the "if" condition in the
for_each_online_cpu() loop in rcu_tasks_trace_postgp() should always
evaluate to false, so a WARN_ON_ONCE() is added to check that.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-11-30 17:29:06 -08:00
Paul E. McKenney
1f8da406a9 srcu: Prevent redundant __srcu_read_unlock() wakeup
Tiny SRCU readers can appear at task level, but also in interrupt and
softirq handlers.  Because Tiny SRCU is selected only in kernels built
with CONFIG_SMP=n and CONFIG_PREEMPTION=n, it is not possible for a grace
period to start while there is a non-task-level SRCU reader executing.
This means that it does not make sense for __srcu_read_unlock() to awaken
the Tiny SRCU grace period, because that can only happen when the grace
period is waiting for one value of ->srcu_idx and __srcu_read_unlock()
is ending the last reader for some other value of ->srcu_idx.  After all,
any such wakeup will be redundant.

Worse yet, in some cases, such wakeups generate lockdep splats:

	======================================================
	WARNING: possible circular locking dependency detected
	5.15.0-rc1+ #3758 Not tainted
	------------------------------------------------------
	rcu_torture_rea/53 is trying to acquire lock:
	ffffffff9514e6a8 (srcu_ctl.srcu_wq.lock){..-.}-{2:2}, at:
	xa/0x30

	but task is already holding lock:
	ffff95c642479d80 (&p->pi_lock){-.-.}-{2:2}, at:
	_extend+0x370/0x400

	which lock already depends on the new lock.

	the existing dependency chain (in reverse order) is:

	-> #1 (&p->pi_lock){-.-.}-{2:2}:
	       _raw_spin_lock_irqsave+0x2f/0x50
	       try_to_wake_up+0x50/0x580
	       swake_up_locked.part.7+0xe/0x30
	       swake_up_one+0x22/0x30
	       rcutorture_one_extend+0x1b6/0x400
	       rcu_torture_one_read+0x290/0x5d0
	       rcu_torture_timer+0x1a/0x70
	       call_timer_fn+0xa6/0x230
	       run_timer_softirq+0x493/0x4c0
	       __do_softirq+0xc0/0x371
	       irq_exit+0x73/0x90
	       sysvec_apic_timer_interrupt+0x63/0x80
	       asm_sysvec_apic_timer_interrupt+0x12/0x20
	       default_idle+0xb/0x10
	       default_idle_call+0x5e/0x170
	       do_idle+0x18a/0x1f0
	       cpu_startup_entry+0xa/0x10
	       start_kernel+0x678/0x69f
	       secondary_startup_64_no_verify+0xc2/0xcb

	-> #0 (srcu_ctl.srcu_wq.lock){..-.}-{2:2}:
	       __lock_acquire+0x130c/0x2440
	       lock_acquire+0xc2/0x270
	       _raw_spin_lock_irqsave+0x2f/0x50
	       swake_up_one+0xa/0x30
	       rcutorture_one_extend+0x387/0x400
	       rcu_torture_one_read+0x290/0x5d0
	       rcu_torture_reader+0xac/0x200
	       kthread+0x12d/0x150
	       ret_from_fork+0x22/0x30

	other info that might help us debug this:

	 Possible unsafe locking scenario:

	       CPU0                    CPU1
	       ----                    ----
	  lock(&p->pi_lock);
				       lock(srcu_ctl.srcu_wq.lock);
				       lock(&p->pi_lock);
	  lock(srcu_ctl.srcu_wq.lock);

	 *** DEADLOCK ***

	1 lock held by rcu_torture_rea/53:
	 #0: ffff95c642479d80 (&p->pi_lock){-.-.}-{2:2}, at:
	_extend+0x370/0x400

	stack backtrace:
	CPU: 0 PID: 53 Comm: rcu_torture_rea Not tainted 5.15.0-rc1+

	Hardware name: Red Hat KVM/RHEL-AV, BIOS
	e_el8.5.0+746+bbd5d70c 04/01/2014
	Call Trace:
	 check_noncircular+0xfe/0x110
	 ? find_held_lock+0x2d/0x90
	 __lock_acquire+0x130c/0x2440
	 lock_acquire+0xc2/0x270
	 ? swake_up_one+0xa/0x30
	 ? find_held_lock+0x72/0x90
	 _raw_spin_lock_irqsave+0x2f/0x50
	 ? swake_up_one+0xa/0x30
	 swake_up_one+0xa/0x30
	 rcutorture_one_extend+0x387/0x400
	 rcu_torture_one_read+0x290/0x5d0
	 rcu_torture_reader+0xac/0x200
	 ? rcutorture_oom_notify+0xf0/0xf0
	 ? __kthread_parkme+0x61/0x90
	 ? rcu_torture_one_read+0x5d0/0x5d0
	 kthread+0x12d/0x150
	 ? set_kthread_struct+0x40/0x40
	 ret_from_fork+0x22/0x30

This is a false positive because there is only one CPU, and both locks
are raw (non-preemptible) spinlocks.  However, it is worthwhile getting
rid of the redundant wakeup, which has the side effect of breaking
the theoretical deadlock cycle.  This commit therefore eliminates the
redundant wakeups.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-11-30 17:28:16 -08:00
Jun Miao
300c0c5e72 rcu: Avoid alloc_pages() when recording stack
The default kasan_record_aux_stack() calls stack_depot_save() with GFP_NOWAIT,
which in turn can then call alloc_pages(GFP_NOWAIT, ...).  In general, however,
it is not even possible to use either GFP_ATOMIC nor GFP_NOWAIT in certain
non-preemptive contexts/RT kernel including raw_spin_locks (see gfp.h and ab00db216c).
Fix it by instructing stackdepot to not expand stack storage via alloc_pages()
in case it runs out by using kasan_record_aux_stack_noalloc().

Jianwei Hu reported:
BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:969
in_atomic(): 0, irqs_disabled(): 1, non_block: 0, pid: 15319, name: python3
INFO: lockdep is turned off.
irq event stamp: 0
  hardirqs last  enabled at (0): [<0000000000000000>] 0x0
  hardirqs last disabled at (0): [<ffffffff856c8b13>] copy_process+0xaf3/0x2590
  softirqs last  enabled at (0): [<ffffffff856c8b13>] copy_process+0xaf3/0x2590
  softirqs last disabled at (0): [<0000000000000000>] 0x0
  CPU: 6 PID: 15319 Comm: python3 Tainted: G        W  O 5.15-rc7-preempt-rt #1
  Hardware name: Supermicro SYS-E300-9A-8C/A2SDi-8C-HLN4F, BIOS 1.1b 12/17/2018
  Call Trace:
    show_stack+0x52/0x58
    dump_stack+0xa1/0xd6
    ___might_sleep.cold+0x11c/0x12d
    rt_spin_lock+0x3f/0xc0
    rmqueue+0x100/0x1460
    rmqueue+0x100/0x1460
    mark_usage+0x1a0/0x1a0
    ftrace_graph_ret_addr+0x2a/0xb0
    rmqueue_pcplist.constprop.0+0x6a0/0x6a0
     __kasan_check_read+0x11/0x20
     __zone_watermark_ok+0x114/0x270
     get_page_from_freelist+0x148/0x630
     is_module_text_address+0x32/0xa0
     __alloc_pages_nodemask+0x2f6/0x790
     __alloc_pages_slowpath.constprop.0+0x12d0/0x12d0
     create_prof_cpu_mask+0x30/0x30
     alloc_pages_current+0xb1/0x150
     stack_depot_save+0x39f/0x490
     kasan_save_stack+0x42/0x50
     kasan_save_stack+0x23/0x50
     kasan_record_aux_stack+0xa9/0xc0
     __call_rcu+0xff/0x9c0
     call_rcu+0xe/0x10
     put_object+0x53/0x70
     __delete_object+0x7b/0x90
     kmemleak_free+0x46/0x70
     slab_free_freelist_hook+0xb4/0x160
     kfree+0xe5/0x420
     kfree_const+0x17/0x30
     kobject_cleanup+0xaa/0x230
     kobject_put+0x76/0x90
     netdev_queue_update_kobjects+0x17d/0x1f0
     ... ...
     ksys_write+0xd9/0x180
     __x64_sys_write+0x42/0x50
     do_syscall_64+0x38/0x50
     entry_SYSCALL_64_after_hwframe+0x44/0xa9

Links: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/include/linux/kasan.h?id=7cb3007ce2da27ec02a1a3211941e7fe6875b642
Fixes: 84109ab585 ("rcu: Record kvfree_call_rcu() call stack for KASAN")
Fixes: 26e760c9a7 ("rcu: kasan: record and print call_rcu() call stack")
Reported-by: Jianwei Hu <jianwei.hu@windriver.com>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Acked-by: Marco Elver <elver@google.com>
Tested-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Jun Miao <jun.miao@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-11-30 17:25:20 -08:00
Zqiang
c2cf0767e9 rcu: Avoid running boost kthreads on isolated CPUs
When the boost kthreads are created on systems with nohz_full CPUs,
the cpus_allowed_ptr is set to housekeeping_cpumask(HK_FLAG_KTHREAD).
However, when the rcu_boost_kthread_setaffinity() is called, the original
affinity will be changed and these kthreads can subsequently run on
nohz_full CPUs.  This commit makes rcu_boost_kthread_setaffinity()
restrict these boost kthreads to housekeeping CPUs.

Signed-off-by: Zqiang <qiang.zhang1211@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-11-30 17:25:20 -08:00
Zhouyi Zhou
17ea371882 rcu: Improve tree_plugin.h comments and add code cleanups
This commit cleans up some comments and code in kernel/rcu/tree_plugin.h.

Signed-off-by: Zhouyi Zhou <zhouzhouyi@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-11-30 17:25:20 -08:00
Changbin Du
2407a64f80 rcu: in_irq() cleanup
This commit replaces the obsolete and ambiguous macro in_irq() with its
shiny new in_hardirq() equivalent.

Signed-off-by: Changbin Du <changbin.du@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-11-30 17:25:20 -08:00
Paul E. McKenney
bc849e9192 rcu: Move rcu_needs_cpu() to tree.c
Now that RCU_FAST_NO_HZ is no more, there is but one implementation of
the rcu_needs_cpu() function.  This commit therefore moves this function
from kernel/rcu/tree_plugin.c to kernel/rcu/tree.c.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-11-30 17:24:47 -08:00
Paul E. McKenney
e2c73a6860 rcu: Remove the RCU_FAST_NO_HZ Kconfig option
All of the uses of CONFIG_RCU_FAST_NO_HZ=y that I have seen involve
systems with RCU callbacks offloaded.  In this situation, all that this
Kconfig option does is slow down idle entry/exit with an additional
allways-taken early exit.  If this is the only use case, then this
Kconfig option nothing but an attractive nuisance that needs to go away.

This commit therefore removes the RCU_FAST_NO_HZ Kconfig option.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-11-30 17:24:47 -08:00
Linus Torvalds
6fedc28076 RCU pull request for v5.16
This pull request contains the following branches:
 
 fixes.2021.10.07a: Miscellaneous fixes.
 
 scftorture.2021.09.16a: smp_call_function torture-test updates, most
 	notably better checking of module parameters.
 
 tasks.2021.09.15a: Tasks-trace RCU updates that fix a number of rare
 	but important race-condition bugs.
 
 torture.2021.09.13b: Other torture-test updates, most notably
 	better checking of module parameters.  In addition, rcutorture
 	may now be run on CONFIG_PREEMPT_RT kernels.
 
 torturescript.2021.09.16a: Torture-test scripting updates, most notably
 	specifying the new CONFIG_KCSAN_STRICT kconfig option rather
 	than maintaining an ever-changing list of individual KCSAN
 	kconfig options.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEbK7UrM+RBIrCoViJnr8S83LZ+4wFAmGAVMMTHHBhdWxtY2tA
 a2VybmVsLm9yZwAKCRCevxLzctn7jGJBD/9ld6USOpedBLAbTYVMQYvIKoSqqDIG
 74ZFhKvZ5I6Y8OZAGxXjb5U06rh4V2brlTN7IJ7XLEA1t401ENffsGeQSCxEmpEf
 PqQN04dbmVvaWjD4jiLZCcl3oDp+w1gIKwmX6wh0Weogr3KZWu5aNvD5tl9qIz4a
 uPC1JqTBxf7WDrLhqNxG5N4MXs27+KvukCd9wftk3NTzRJ9tyLM/YNGOVArM8rW2
 QpEh8n6veB5dEoXBxmRHzuxYHN1k0Fhkbm3irMjcI0T5wj8TDod89zbg9mdFXMIj
 AjZ9CGpIBa4frThdu654ZNuEQHDCsPWtMi925xNOWxh5lkPGjeWnwYpcRrwfI2pj
 op0xVlur+Nam5CT/AJNT9+KogpZthAWXvwqCs5GbYNSU30Rlw99bw1vyAsJUD+af
 Mv08/z4o7Kuhr4cw2vkd2UfF9zuIQsJ1jWCIjMxfj4ctBnIpedrEnEISp8Y61fWk
 w9vXgCRhZCSkxoURoNss+nAUsiePUafptsvqKLu6Z53ufPA5yL0rVS778xq8vurP
 Xyd34TVlQ94ydZDC5pkSNpri1HGV1U7pztFwey5GloE66iV+7TSQCfMhzLd4CM0K
 wW96wimHrDtIxD6LedCZOHLHkS9AJd7F9uSoNodKspTH0tJowQztrzPW1eZifDE3
 iJP8xcJ+vL67Og==
 =nmaP
 -----END PGP SIGNATURE-----

Merge tag 'rcu.2021.11.01a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu

Pull RCU updates from Paul McKenney:

 - Miscellaneous fixes

 - Torture-test updates for smp_call_function(), most notably improved
   checking of module parameters.

 - Tasks-trace RCU updates that fix a number of rare but important
   race-condition bugs.

 - Other torture-test updates, most notably better checking of module
   parameters. In addition, rcutorture may once again be run on
   CONFIG_PREEMPT_RT kernels.

 - Torture-test scripting updates, most notably specifying the new
   CONFIG_KCSAN_STRICT kconfig option rather than maintaining an
   ever-changing list of individual KCSAN kconfig options.

* tag 'rcu.2021.11.01a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: (46 commits)
  rcu: Fix rcu_dynticks_curr_cpu_in_eqs() vs noinstr
  rcu: Always inline rcu_dynticks_task*_{enter,exit}()
  torture: Make kvm-remote.sh print size of downloaded tarball
  torture: Allot 1G of memory for scftorture runs
  tools/rcu: Add an extract-stall script
  scftorture: Warn on individual scf_torture_init() error conditions
  scftorture: Count reschedule IPIs
  scftorture: Account for weight_resched when checking for all zeroes
  scftorture: Shut down if nonsensical arguments given
  scftorture: Allow zero weight to exclude an smp_call_function*() category
  rcu: Avoid unneeded function call in rcu_read_unlock()
  rcu-tasks: Update comments to cond_resched_tasks_rcu_qs()
  rcu-tasks: Fix IPI failure handling in trc_wait_for_one_reader
  rcu-tasks: Fix read-side primitives comment for call_rcu_tasks_trace
  rcu-tasks: Clarify read side section info for rcu_tasks_rude GP primitives
  rcu-tasks: Correct comparisons for CPU numbers in show_stalled_task_trace
  rcu-tasks: Correct firstreport usage in check_all_holdout_tasks_trace
  rcu-tasks: Fix s/rcu_add_holdout/trc_add_holdout/ typo in comment
  rcu-tasks: Move RTGS_WAIT_CBS to beginning of rcu_tasks_kthread() loop
  rcu-tasks: Fix s/instruction/instructions/ typo in comment
  ...
2021-11-01 20:25:38 -07:00
Linus Torvalds
9a7e0a90a4 Scheduler updates:
- Revert the printk format based wchan() symbol resolution as it can leak
    the raw value in case that the symbol is not resolvable.
 
  - Make wchan() more robust and work with all kind of unwinders by
    enforcing that the task stays blocked while unwinding is in progress.
 
  - Prevent sched_fork() from accessing an invalid sched_task_group
 
  - Improve asymmetric packing logic
 
  - Extend scheduler statistics to RT and DL scheduling classes and add
    statistics for bandwith burst to the SCHED_FAIR class.
 
  - Properly account SCHED_IDLE entities
 
  - Prevent a potential deadlock when initial priority is assigned to a
    newly created kthread. A recent change to plug a race between cpuset and
    __sched_setscheduler() introduced a new lock dependency which is now
    triggered. Break the lock dependency chain by moving the priority
    assignment to the thread function.
 
  - Fix the idle time reporting in /proc/uptime for NOHZ enabled systems.
 
  - Improve idle balancing in general and especially for NOHZ enabled
    systems.
 
  - Provide proper interfaces for live patching so it does not have to
    fiddle with scheduler internals.
 
  - Add cluster aware scheduling support.
 
  - A small set of tweaks for RT (irqwork, wait_task_inactive(), various
    scheduler options and delaying mmdrop)
 
  - The usual small tweaks and improvements all over the place
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmF/OUkTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoR/5D/9ikdGNpKg9osNqJ3GjAmxsK6kVkB29
 iFe2k8pIpWDToWQf/wQRGih4Yj3Cl49QSnZcPIibh2/12EB1qrrW6iSPJkInz8Ec
 /1LS5/Vewn2OyoxyXZjdvGC5gTXEodSbIazASvX7nvdMeI4gsAsL5etzrMJirT/t
 aymqvr7zovvywrwMTQJrGjUMo9l4ewE8tafMNNhRu1BHU1U4ojM9yvThyRAAcmp7
 3Xy49A+Yq3IgrvYI4u8FMK5Zh08KaxSFjiLhePGm/bF+wSfYmWop2TP1jY05W2Uo
 ti8hfbJMUoFRYuMxAiEldkItnc0wV4M9PtWZZ/x+B71bs65Y4Zjt9cW+rxJv2+m1
 vzV31EsQwGnOti072dzWN4c/cZqngVXAjaNtErvDwJUr+Tw1ayv9KUvuodMQqZY6
 mu68bFUO2kV9EMe1CBOv51Uy1RGHyLj3rlNqrkw+Xp5ISE9Ad2vhUEiRp5bQx5Ci
 V/XFhGZkGUluh0vccrdFlNYZwhj8cZEzkOPCnPSeZ+bq8SyZE6xuHH/lTP1CJCOy
 s800rW1huM+kgV+zRN8adDkGXibAk9N3RtVGnQXmuEy8gB9LZmQg+JeM2wsc9B+6
 i0gdqZnsjNAfoK+BBAG4holxptSL8/eOJsFH8ZNIoxQ+iqooyPx9tFX7yXnRTBQj
 d2qWG7UvoseT+g==
 =fgtS
 -----END PGP SIGNATURE-----

Merge tag 'sched-core-2021-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler updates from Thomas Gleixner:

 - Revert the printk format based wchan() symbol resolution as it can
   leak the raw value in case that the symbol is not resolvable.

 - Make wchan() more robust and work with all kind of unwinders by
   enforcing that the task stays blocked while unwinding is in progress.

 - Prevent sched_fork() from accessing an invalid sched_task_group

 - Improve asymmetric packing logic

 - Extend scheduler statistics to RT and DL scheduling classes and add
   statistics for bandwith burst to the SCHED_FAIR class.

 - Properly account SCHED_IDLE entities

 - Prevent a potential deadlock when initial priority is assigned to a
   newly created kthread. A recent change to plug a race between cpuset
   and __sched_setscheduler() introduced a new lock dependency which is
   now triggered. Break the lock dependency chain by moving the priority
   assignment to the thread function.

 - Fix the idle time reporting in /proc/uptime for NOHZ enabled systems.

 - Improve idle balancing in general and especially for NOHZ enabled
   systems.

 - Provide proper interfaces for live patching so it does not have to
   fiddle with scheduler internals.

 - Add cluster aware scheduling support.

 - A small set of tweaks for RT (irqwork, wait_task_inactive(), various
   scheduler options and delaying mmdrop)

 - The usual small tweaks and improvements all over the place

* tag 'sched-core-2021-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (69 commits)
  sched/fair: Cleanup newidle_balance
  sched/fair: Remove sysctl_sched_migration_cost condition
  sched/fair: Wait before decaying max_newidle_lb_cost
  sched/fair: Skip update_blocked_averages if we are defering load balance
  sched/fair: Account update_blocked_averages in newidle_balance cost
  x86: Fix __get_wchan() for !STACKTRACE
  sched,x86: Fix L2 cache mask
  sched/core: Remove rq_relock()
  sched: Improve wake_up_all_idle_cpus() take #2
  irq_work: Also rcuwait for !IRQ_WORK_HARD_IRQ on PREEMPT_RT
  irq_work: Handle some irq_work in a per-CPU thread on PREEMPT_RT
  irq_work: Allow irq_work_sync() to sleep if irq_work() no IRQ support.
  sched/rt: Annotate the RT balancing logic irqwork as IRQ_WORK_HARD_IRQ
  sched: Add cluster scheduler level for x86
  sched: Add cluster scheduler level in core and related Kconfig for ARM64
  topology: Represent clusters of CPUs within a die
  sched: Disable -Wunused-but-set-variable
  sched: Add wrapper for get_wchan() to keep task blocked
  x86: Fix get_wchan() to support the ORC unwinder
  proc: Use task_is_running() for wchan in /proc/$pid/stat
  ...
2021-11-01 13:48:52 -07:00
Paul E. McKenney
dd1277d2ad Merge branches 'fixes.2021.10.07a', 'scftorture.2021.09.16a', 'tasks.2021.09.15a', 'torture.2021.09.13b' and 'torturescript.2021.09.16a' into HEAD
fixes.2021.10.07a: Miscellaneous fixes.
scftorture.2021.09.16a: smp_call_function torture-test updates.
tasks.2021.09.15a: Tasks-trace RCU updates.
torture.2021.09.13b: Other torture-test updates.
torturescript.2021.09.16a: Torture-test scripting updates.
2021-10-07 10:02:17 -07:00
Peter Zijlstra
74aece72f9 rcu: Fix rcu_dynticks_curr_cpu_in_eqs() vs noinstr
vmlinux.o: warning: objtool: rcu_nmi_enter()+0x36: call to __kasan_check_read() leaves .noinstr.text section

noinstr cannot have atomic_*() functions in because they're explicitly
annotated, use arch_atomic_*().

Fixes: 2be57f7328 ("rcu: Weaken ->dynticks accesses and updates")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-10-07 10:00:07 -07:00
Peter Zijlstra
7663ad9a5d rcu: Always inline rcu_dynticks_task*_{enter,exit}()
RCU managed to grow a few noinstr violations:

  vmlinux.o: warning: objtool: rcu_dynticks_eqs_enter()+0x0: call to rcu_dynticks_task_trace_enter() leaves .noinstr.text section
  vmlinux.o: warning: objtool: rcu_dynticks_eqs_exit()+0xe: call to rcu_dynticks_task_trace_exit() leaves .noinstr.text section

Fix them by adding __always_inline to the relevant trivial functions.

Also replace the noinstr with __always_inline for the existing
rcu_dynticks_task_*() functions since noinstr would force noinline
them, even when empty, which seems silly.

Fixes: 7d0c9c50c5 ("rcu-tasks: Avoid IPIing userspace/idle tasks if kernel is so built")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-10-07 09:59:41 -07:00
Peter Zijlstra
9b3c4ab304 sched,rcu: Rework try_invoke_on_locked_down_task()
Give try_invoke_on_locked_down_task() a saner name and have it return
an int so that the caller might distinguish between different reasons
of failure.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Acked-by: Vasily Gorbik <gor@linux.ibm.com>
Tested-by: Vasily Gorbik <gor@linux.ibm.com> # on s390
Link: https://lkml.kernel.org/r/20210929152428.649944917@infradead.org
2021-10-07 13:51:15 +02:00
Zhouyi Zhou
a2e05ddda1 lockdep: Improve comments in wait-type checks
Comments in wait-type checks be improved by mentioning the
PREEPT_RT kernel configure option.

Signed-off-by: Zhouyi Zhou <zhouzhouyi@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lkml.kernel.org/r/20210811025920.20751-1-zhouzhouyi@gmail.com
2021-09-17 15:08:45 +02:00
Waiman Long
925da92ba5 rcu: Avoid unneeded function call in rcu_read_unlock()
Since commit aa40c138cc ("rcu: Report QS for outermost PREEMPT=n
rcu_read_unlock() for strict GPs") the function rcu_read_unlock_strict()
is invoked by the inlined rcu_read_unlock() function.  However,
rcu_read_unlock_strict() is an empty function in production kernels,
which are built with CONFIG_RCU_STRICT_GRACE_PERIOD=n.

There is a mention of rcu_read_unlock_strict() in the BPF verifier,
but this is in a deny-list, meaning that BPF does not care whether
rcu_read_unlock_strict() is ever called.

This commit therefore provides a slight performance improvement
by hoisting the check of CONFIG_RCU_STRICT_GRACE_PERIOD from
rcu_read_unlock_strict() into rcu_read_unlock(), thus avoiding the
pointless call to an empty function.

Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-09-16 10:24:05 -07:00
Paul E. McKenney
8af9e2c782 rcu-tasks: Update comments to cond_resched_tasks_rcu_qs()
The cond_resched_rcu_qs() function no longer exists, despite being mentioned
several times in kernel/rcu/tasks.h.  This commit therefore updates it to
the current cond_resched_tasks_rcu_qs().

Reported-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-09-15 11:41:19 -07:00
Neeraj Upadhyay
46aa886c48 rcu-tasks: Fix IPI failure handling in trc_wait_for_one_reader
The trc_wait_for_one_reader() function is called at multiple stages
of trace rcu-tasks GP function, rcu_tasks_wait_gp():

- First, it is called as part of per task function -
  rcu_tasks_trace_pertask(), for all non-idle tasks. As part of per task
  processing, this function add the task in the holdout list and if the
  task is currently running on a CPU, it sends IPI to the task's CPU.
  The IPI handler takes action depending on whether task is in trace
  rcu-tasks read side critical section or not:

  - a. If the task is in trace rcu-tasks read side critical section
       (t->trc_reader_nesting != 0), the IPI handler sets the task's
       ->trc_reader_special.b.need_qs, so that this task notifies exit
       from its outermost read side critical section (by decrementing
       trc_n_readers_need_end) to the GP handling function.
       trc_wait_for_one_reader() also increments trc_n_readers_need_end,
       so that the trace rcu-tasks GP handler function waits for this
       task's read side exit notification. The IPI handler also sets
       t->trc_reader_checked to true, and no further IPIs are sent for
       this task, for this trace rcu-tasks grace period and this
       task can be removed from holdout list.

  - b. If the task is in the process of exiting its trace rcu-tasks
       read side critical section, (t->trc_reader_nesting < 0), defer
       this task's processing to future calls to trc_wait_for_one_reader().

  - c. If task is not in rcu-task read side critical section,
       t->trc_reader_nesting == 0, ->trc_reader_checked is set for this
       task, so that this task is removed from holdout list.

- Second, trc_wait_for_one_reader() is called as part of post scan, in
  function rcu_tasks_trace_postscan(), for all idle tasks.

- Third, in function check_all_holdout_tasks_trace(), this function is
  called for each task in the holdout list, but only if there isn't
  a pending IPI for the task (->trc_ipi_to_cpu == -1). This function
  removed the task from holdout list, if IPI handler has completed the
  required work, to ensure that the current trace rcu-tasks grace period
  either waits for this task, or this task is not in a trace rcu-tasks
  read side critical section.

Now, considering the scenario where smp_call_function_single() fails in
first case, inside rcu_tasks_trace_pertask(). In this case,
->trc_ipi_to_cpu is set to the current CPU for that task. This will
result in trc_wait_for_one_reader() getting skipped in third case,
inside check_all_holdout_tasks_trace(), for this task. This further
results in ->trc_reader_checked never getting set for this task,
and the task not getting removed from holdout list. This can cause
the current trace rcu-tasks grace period to stall.

Fix the above problem, by resetting ->trc_ipi_to_cpu to -1, on
smp_call_function_single() failure, so that future IPI calls can
be send for this task.

Note that all three of the trc_wait_for_one_reader() function's
callers (rcu_tasks_trace_pertask(), rcu_tasks_trace_postscan(),
check_all_holdout_tasks_trace()) hold cpu_read_lock().  This means
that smp_call_function_single() cannot race with CPU hotplug, and thus
should never fail.  Therefore, also add a warning in order to report
any such failure in case smp_call_function_single() grows some other
reason for failure.

Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-09-15 11:41:19 -07:00
Neeraj Upadhyay
ed42c38067 rcu-tasks: Fix read-side primitives comment for call_rcu_tasks_trace
call_rcu_tasks_trace() does have read-side primitives - rcu_read_lock_trace()
and rcu_read_unlock_trace(). Fix this information in the comments.

Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-09-15 11:41:19 -07:00
Neeraj Upadhyay
a6517e9ce0 rcu-tasks: Clarify read side section info for rcu_tasks_rude GP primitives
RCU tasks rude variant does not check whether the current
running context on a CPU is usermode. Read side critical section ends
on transition to usermode execution, by the virtue of usermode
execution being schedulable. Clarify this in comments for
call_rcu_tasks_rude() and synchronize_rcu_tasks_rude().

Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-09-15 11:41:19 -07:00
Neeraj Upadhyay
d39ec8f3c1 rcu-tasks: Correct comparisons for CPU numbers in show_stalled_task_trace
Valid CPU numbers can be zero or greater, but the checks for
->trc_ipi_to_cpu and tick_nohz_full_cpu()'s argument are for strictly
greater than.  This commit therefore corrects the check for no_hz_full
cpu in show_stalled_task_trace() so as to include cpu 0.

Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-09-15 11:41:14 -07:00
Neeraj Upadhyay
89401176da rcu-tasks: Correct firstreport usage in check_all_holdout_tasks_trace
In check_all_holdout_tasks_trace(), firstreport is a pointer argument;
so, check the dereferenced value, instead of checking the pointer.

Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-09-15 11:37:51 -07:00
Neeraj Upadhyay
d0a8585856 rcu-tasks: Fix s/rcu_add_holdout/trc_add_holdout/ typo in comment
Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-09-15 11:37:51 -07:00
Paul E. McKenney
0db7c32ad3 rcu-tasks: Move RTGS_WAIT_CBS to beginning of rcu_tasks_kthread() loop
Early in debugging, it made some sense to differentiate the first
iteration from subsequent iterations, but now this just causes confusion.
This commit therefore moves the "set_tasks_gp_state(rtp, RTGS_WAIT_CBS)"
statement to the beginning of the "for" loop in rcu_tasks_kthread().

Reported-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-09-15 11:37:51 -07:00
Paul E. McKenney
c4f113ac45 rcu-tasks: Fix s/instruction/instructions/ typo in comment
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-09-15 11:37:50 -07:00
Paul E. McKenney
a5c071ccfa rcu-tasks: Remove second argument of rcu_read_unlock_trace_special()
The second argument of rcu_read_unlock_trace_special() is always zero.
When called from exit_tasks_rcu_finish_trace(), it is the constant
zero, and rcu_read_unlock_trace_special() doesn't get called from
rcu_read_unlock_trace() unless the value of local variable "nesting"
is zero because in that case the early return is taken instead.

This commit therefore removes the "nesting" argument from the
rcu_read_unlock_trace_special() function, substituting the constant
zero within that function.  This commit also adds a WARN_ON_ONCE()
to rcu_read_lock_trace_held() in case non-zeroness some day appears.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-09-15 11:37:50 -07:00
Paul E. McKenney
18f08e758f rcu-tasks: Add trc_inspect_reader() checks for exiting critical section
Currently, trc_inspect_reader() treats a task exiting its RCU Tasks
Trace read-side critical section the same as being within that critical
section.  However, this can fail because that task might have already
checked its .need_qs field, which means that it might never decrement
the all-important trc_n_readers_need_end counter.  Of course, for that
to happen, the task would need to never again execute an RCU Tasks Trace
read-side critical section, but this really could happen if the system's
last trampoline was removed.  Note that exit from such a critical section
cannot be treated as a quiescent state due to the possibility of nested
critical sections.  This means that if trc_inspect_reader() sees a
negative nesting value, it must set up to try again later.

This commit therefore ignores tasks that are exiting their RCU Tasks
Trace read-side critical sections so that they will be rechecked later.

[ paulmck: Apply feedback from Neeraj Upadhyay and Boqun Feng. ]

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-09-15 11:37:50 -07:00
Paul E. McKenney
96017bf903 rcu-tasks: Simplify trc_read_check_handler() atomic operations
Currently, trc_wait_for_one_reader() atomically increments
the trc_n_readers_need_end counter before sending the IPI
invoking trc_read_check_handler().  All failure paths out of
trc_read_check_handler() and also from the smp_call_function_single()
within trc_wait_for_one_reader() must carefully atomically decrement
this counter.  This is more complex than it needs to be.

This commit therefore simplifies things and saves a few lines of
code by dispensing with the atomic decrements in favor of having
trc_read_check_handler() do the atomic increment only in the success case.
In theory, this represents no change in functionality.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-09-15 11:37:50 -07:00
Scott Wood
71921a9606 rcutorture: Avoid problematic critical section nesting on PREEMPT_RT
rcutorture is generating some nesting scenarios that are not compatible on PREEMPT_RT.
For example:
	preempt_disable();
	rcu_read_lock_bh();
	preempt_enable();
	rcu_read_unlock_bh();

The problem here is that on PREEMPT_RT the bottom halves have to be
disabled and enabled in preemptible context.

Reorder locking: start with BH locking and continue with then with
disabling preemption or interrupts. In the unlocking do it reverse by
first enabling interrupts and preemption and BH at the very end.
Ensure that on PREEMPT_RT BH locking remains unchanged if in
non-preemptible context.

Link: https://lkml.kernel.org/r/20190911165729.11178-6-swood@redhat.com
Link: https://lkml.kernel.org/r/20210819182035.GF4126399@paulmck-ThinkPad-P17-Gen-1
Signed-off-by: Scott Wood <swood@redhat.com>
[bigeasy: Drop ATOM_BH, make it only about changing BH in atomic
context. Allow enabling RCU in IRQ-off section. Reword commit message.]
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-09-13 16:36:16 -07:00
Paul E. McKenney
fd13fe16db rcutorture: Don't cpuhp_remove_state() if cpuhp_setup_state() failed
Currently, in CONFIG_RCU_BOOST kernels, if the rcu_torture_init()
function's call to cpuhp_setup_state() fails, rcu_torture_cleanup()
gamely passes nonsense to cpuhp_remove_state().  This results in
strange and misleading splats.  This commit therefore ensures that if
the rcu_torture_init() function's call to cpuhp_setup_state() fails,
rcu_torture_cleanup() avoids invoking cpuhp_remove_state().

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-09-13 16:36:16 -07:00
Paul E. McKenney
eb77abfdee rcuscale: Warn on individual rcu_scale_init() error conditions
When running rcuscale as a module, any rcu_scale_init() issues will be
reflected in the error code from modprobe or insmod, as the case may be.
However, these error codes are not available when running rcuscale
built-in, for example, when using the kvm.sh script.  This commit
therefore adds WARN_ON_ONCE() to allow distinguishing rcu_scale_init()
errors when running rcuscale built-in.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-09-13 16:36:16 -07:00
Paul E. McKenney
ed60ad733a refscale: Warn on individual ref_scale_init() error conditions
When running refscale as a module, any ref_scale_init() issues will be
reflected in the error code from modprobe or insmod, as the case may be.
However, these error codes are not available when running refscale
built-in, for example, when using the kvm.sh script.  This commit
therefore adds WARN_ON_ONCE() to allow distinguishing ref_scale_init()
errors when running refscale built-in.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-09-13 16:36:16 -07:00
Paul E. McKenney
efeff6b39b rcutorture: Warn on individual rcu_torture_init() error conditions
When running rcutorture as a module, any rcu_torture_init() issues will be
reflected in the error code from modprobe or insmod, as the case may be.
However, these error codes are not available when running rcutorture
built-in, for example, when using the kvm.sh script.  This commit
therefore adds WARN_ON_ONCE() to allow distinguishing rcu_torture_init()
errors when running rcutorture built-in.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-09-13 16:36:16 -07:00
Paul E. McKenney
fda84866b1 rcutorture: Suppressing read-exit testing is not an error
Currently, specifying the rcutorture.read_exit_burst=0 kernel boot
parameter will result in a -EINVAL exit code that will stop the rcutorture
test run before it has fully initialized.  This commit therefore uses a
zero exit code in that case, thus allowing rcutorture.read_exit_burst=0
to complete normally.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-09-13 16:36:15 -07:00
Paul E. McKenney
cbe0d8d914 rcu-tasks: Wait for trc_read_check_handler() IPIs
Currently, RCU Tasks Trace initializes the trc_n_readers_need_end counter
to the value one, increments it before each trc_read_check_handler()
IPI, then decrements it within trc_read_check_handler() if the target
task was in a quiescent state (or if the target task moved to some other
CPU while the IPI was in flight), complaining if the new value was zero.
The rationale for complaining is that the initial value of one must be
decremented away before zero can be reached, and this decrement has not
yet happened.

Except that trc_read_check_handler() is initiated with an asynchronous
smp_call_function_single(), which might be significantly delayed.  This
can result in false-positive complaints about the counter reaching zero.

This commit therefore waits for in-flight IPI handlers to complete before
decrementing away the initial value of one from the trc_n_readers_need_end
counter.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-09-13 16:35:08 -07:00
Neeraj Upadhyay
f0b2b2df54 rcu: Fix existing exp request check in sync_sched_exp_online_cleanup()
The sync_sched_exp_online_cleanup() checks to see if RCU needs
an expedited quiescent state from the incoming CPU, sending it
an IPI if so. Before sending IPI, it checks whether expedited
qs need has been already requested for the incoming CPU, by
checking rcu_data.cpu_no_qs.b.exp for the current cpu, on which
sync_sched_exp_online_cleanup() is running. This works for the
case where incoming CPU is same as self. However, for the case
where incoming CPU is different from self, expedited request
won't get marked, which can potentially delay reporting of
expedited quiescent state for the incoming CPU.

Fixes: e015a34112 ("rcu: Avoid self-IPI in sync_sched_exp_online_cleanup()")
Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-09-13 16:32:46 -07:00
Juri Lelli
1eac0075eb rcu: Make rcu update module parameters world-readable
rcu update module parameters currently don't appear in sysfs and this is
a serviceability issue as it might be needed to access their default
values at runtime.

Fix this issue by changing rcu update module parameters permissions to
world-readable.

Suggested-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-09-13 16:32:46 -07:00
Juri Lelli
ebb6d30d9e rcu: Make rcu_normal_after_boot writable again
Certain configurations (e.g., systems that make heavy use of netns)
need to use synchronize_rcu_expedited() to service RCU grace periods
even after boot.

Even though synchronize_rcu_expedited() has been traditionally
considered harmful for RT for the heavy use of IPIs, it is perfectly
usable under certain conditions (e.g. nohz_full).

Make rcupdate.rcu_normal_after_boot= again writeable on RT (if NO_HZ_
FULL is defined), but keep its default value to 1 (enabled) to avoid
regressions. Users who need synchronize_rcu_expedited() will boot with
rcupdate.rcu_normal_after_ boot=0 in the kernel cmdline.

Reflect the change in synchronize_rcu_expedited_wait() by removing the
WARN related to CONFIG_PREEMPT_RT.

Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-09-13 16:32:46 -07:00
Paul E. McKenney
4aa846f97c rcu: Make rcutree_dying_cpu() use its "cpu" parameter
The CPU-hotplug functions take a "cpu" parameter, but rcutree_dying_cpu()
ignores it in favor of this_cpu_ptr().  This works at the moment, but
it would be better to be consistent.  This might also work better given
some possible future changes.  This commit therefore uses per_cpu_ptr()
to avoid ignoring the rcutree_dying_cpu() function's argument.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-09-13 16:32:46 -07:00
Paul E. McKenney
768f5d50e6 rcu: Simplify rcu_report_dead() call to rcu_report_exp_rdp()
Currently, rcu_report_dead() disables preemption across its call to
rcu_report_exp_rdp(), but this is pointless because interrupts are
already disabled by the caller.  In addition, rcu_report_dead() computes
the address of the outgoing CPU's rcu_data structure, which is also
pointless because this address is already present in local variable rdp.
This commit therefore drops the preemption disabling and passes rdp
to rcu_report_exp_rdp().

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-09-13 16:32:46 -07:00
Paul E. McKenney
2caebefb00 rcu: Move rcu_dynticks_eqs_online() to rcu_cpu_starting()
The purpose of rcu_dynticks_eqs_online() is to adjust the ->dynticks
counter of an incoming CPU when required.  It is currently invoked
from rcutree_prepare_cpu(), which runs before the incoming CPU is
running, and thus on some other CPU.  This makes the per-CPU accesses in
rcu_dynticks_eqs_online() iffy at best, and it all "works" only because
the running CPU cannot possibly be in dyntick-idle mode, which means
that rcu_dynticks_eqs_online() never has any effect.

It is currently OK for rcu_dynticks_eqs_online() to have no effect, but
only because the CPU-offline process just happens to leave ->dynticks in
the correct state.  After all, if ->dynticks were in the wrong state on a
just-onlined CPU, rcutorture would complain bitterly the next time that
CPU went idle, at least in kernels built with CONFIG_RCU_EQS_DEBUG=y,
for example, those built by rcutorture scenario TREE04.  One could
argue that this means that rcu_dynticks_eqs_online() is unnecessary,
however, removing it would make the CPU-online process vulnerable to
slight changes in the CPU-offline process.

One could also ask why it is safe to move the rcu_dynticks_eqs_online()
call so late in the CPU-online process.  Indeed, there was a time when it
would not have been safe, which does much to explain its current location.
However, the marking of a CPU as online from an RCU perspective has long
since moved from rcutree_prepare_cpu() to rcu_cpu_starting(), and all
that is required is that ->dynticks be set correctly by the time that
the CPU is marked as online from an RCU perspective.  After all, the RCU
grace-period kthread does not check to see if offline CPUs are also idle.
(In case you were curious, this is one reason why there is quiescent-state
reporting as part of the offlining process.)

This commit therefore moves the call to rcu_dynticks_eqs_online() from
rcutree_prepare_cpu() to rcu_cpu_starting(), this latter being guaranteed
to be running on the incoming CPU.  The call to this function must of
course be placed before this rcu_cpu_starting() announces this CPU's
presence to RCU.

Reported-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-09-13 16:32:46 -07:00
Paul E. McKenney
ebc88ad491 rcu: Comment rcu_gp_init() code waiting for CPU-hotplug operations
Near the beginning of rcu_gp_init() is a per-rcu_node loop that waits
for CPU-hotplug operations that might have started before the new
grace period did.  This commit adds a comment explaining that this
wait does not exclude CPU-hotplug operations.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-09-13 16:32:46 -07:00
Zhouyi Zhou
3ac8587852 rcu: Fix undefined Kconfig macros
Invoking scripts/checkkconfigsymbols.py in the Linux-kernel source tree
located the following issues:

1. TREE_PREEMPT_RCU
Referencing files: arch/sh/configs/sdk7786_defconfig

It should now be CONFIG_PREEMPT_RCU. Except that the CONFIG_PREEMPT=y in
that same file implies CONFIG_PREEMPT_RCU=y.  Therefore, delete the
CONFIG_TREE_PREEMPT_RCU=y line.

The reason is as follows:

In kernel/rcu/Kconfig, we have
config PREEMPT_RCU
        bool
        default y if PREEMPTION

https://www.kernel.org/doc/Documentation/kbuild/kconfig-language.txt says,
"The default value is only assigned to the config symbol if no other value
 was set by the user (via the input prompt above)."
there is no prompt in config PREEMPT_RCU entry, so we are guaranteed to
get CONFIG_PREEMPT_RCU=y when CONFIG_PREEMPT is present.

2. RCU_CPU_STALL_INFO
Referencing files: arch/xtensa/configs/nommu_kc705_defconfig

The old Kconfig option RCU_CPU_STALL_INFO was removed by commit
75c27f119b ("rcu: Remove CONFIG_RCU_CPU_STALL_INFO"), and the kernel
now acts as if this Kconfig option was unconditionally enabled.

3. RCU_NOCB_CPU_ALL
Referencing files:
Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst

This is an old snapshot of the code. I update this from the real
rcu_prepare_for_idle() function in kernel/rcu/tree_plugin.h.
This change was tested by invoking "make htmldocs".

4. RCU_TORTURE_TESTS
Referencing files: kernel/rcu/rcutorture.c

Forward-progress checking conflicts with CPU-stall testing, so we should
complain at "modprobe rcutorture" when both are enabled.

Signed-off-by: Zhouyi Zhou <zhouzhouyi@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-09-13 16:32:46 -07:00
Paul E. McKenney
9424b867a7 rcu: Eliminate rcu_implicit_dynticks_qs() local variable ruqp
The rcu_implicit_dynticks_qs() function's local variable ruqp references
the ->rcu_urgent_qs field in the rcu_data structure referenced by the
function parameter rdp, with a rather odd method for computing the
pointer to this field.  This commit therefore simplifies things and
saves a couple of lines of code by replacing each instance of ruqp with
&rdp->need_heavy_qs.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-09-13 16:32:46 -07:00
Paul E. McKenney
88ee23ef1c rcu: Eliminate rcu_implicit_dynticks_qs() local variable rnhqp
The rcu_implicit_dynticks_qs() function's local variable rnhqp references
the ->rcu_need_heavy_qs field in the rcu_data structure referenced by
the function parameter rdp, with a rather odd method for computing
the pointer to this field.  This commit therefore simplifies things
and saves a few lines of code by replacing each instance of rnhqp with
&rdp->need_heavy_qs.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-09-13 16:32:45 -07:00
Paul E. McKenney
52b030aa27 rcu-nocb: Fix a couple of tree_nocb code-style nits
This commit removes a non-value-returning "return" statement at the end
of __call_rcu_nocb_wake() and adds a blank line following declarations
in nocb_cb_can_run().

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-09-13 16:32:45 -07:00
Paul E. McKenney
2431774f04 rcu: Mark accesses to rcu_state.n_force_qs
This commit marks accesses to the rcu_state.n_force_qs.  These data
races are hard to make happen, but syzkaller was equal to the task.

Reported-by: syzbot+e08a83a1940ec3846cd5@syzkaller.appspotmail.com
Acked-by: Marco Elver <elver@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-09-13 16:32:45 -07:00
Linus Torvalds
e5e726f7bb Updates for locking and atomics:
The regular pile:
 
   - A few improvements to the mutex code
 
   - Documentation updates for atomics to clarify the difference between
     cmpxchg() and try_cmpxchg() and to explain the forward progress
     expectations.
 
   - Simplification of the atomics fallback generator
 
   - The addition of arch_atomic_long*() variants and generic arch_*()
     bitops based on them.
 
   - Add the missing might_sleep() invocations to the down*() operations of
     semaphores.
 
 The PREEMPT_RT locking core:
 
   - Scheduler updates to support the state preserving mechanism for
     'sleeping' spin- and rwlocks on RT. This mechanism is carefully
     preserving the state of the task when blocking on a 'sleeping' spin- or
     rwlock and takes regular wake-ups targeted at the same task into
     account. The preserved or updated (via a regular wakeup) state is
     restored when the lock has been acquired.
 
   - Restructuring of the rtmutex code so it can be utilized and extended
     for the RT specific lock variants.
 
   - Restructuring of the ww_mutex code to allow sharing of the ww_mutex
     specific functionality for rtmutex based ww_mutexes.
 
   - Header file disentangling to allow substitution of the regular lock
     implementations with the PREEMPT_RT variants without creating an
     unmaintainable #ifdef mess.
 
   - Shared base code for the PREEMPT_RT specific rw_semaphore and rwlock
     implementations. Contrary to the regular rw_semaphores and rwlocks the
     PREEMPT_RT implementation is writer unfair because it is infeasible to
     do priority inheritance on multiple readers. Experience over the years
     has shown that real-time workloads are not the typical workloads which
     are sensitive to writer starvation. The alternative solution would be
     to allow only a single reader which has been tried and discarded as it
     is a major bottleneck especially for mmap_sem. Aside of that many of
     the writer starvation critical usage sites have been converted to a
     writer side mutex/spinlock and RCU read side protections in the past
     decade so that the issue is less prominent than it used to be.
 
   - The actual rtmutex based lock substitutions for PREEMPT_RT enabled
     kernels which affect mutex, ww_mutex, rw_semaphore, spinlock_t and
     rwlock_t. The spin/rw_lock*() functions disable migration across the
     critical section to preserve the existing semantics vs. per CPU
     variables.
 
   - Rework of the futex REQUEUE_PI mechanism to handle the case of early
     wake-ups which interleave with a re-queue operation to prevent the
     situation that a task would be blocked on both the rtmutex associated
     to the outer futex and the rtmutex based hash bucket spinlock.
 
     While this situation cannot happen on !RT enabled kernels the changes
     make the underlying concurrency problems easier to understand in
     general. As a result the difference between !RT and RT kernels is
     reduced to the handling of waiting for the critical section. !RT
     kernels simply spin-wait as before and RT kernels utilize rcu_wait().
 
   - The substitution of local_lock for PREEMPT_RT with a spinlock which
     protects the critical section while staying preemptible. The CPU
     locality is established by disabling migration.
 
   The underlying concepts of this code have been in use in PREEMPT_RT for
   way more than a decade. The code has been refactored several times over
   the years and this final incarnation has been optimized once again to be
   as non-intrusive as possible, i.e. the RT specific parts are mostly
   isolated.
 
   It has been extensively tested in the 5.14-rt patch series and it has
   been verified that !RT kernels are not affected by these changes.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmEsnuMTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoaeWD/wLNMoAZXslS0prfr64ANjRgLXIqMFA
 r6xgioiwxxaxbmZ/GNPraoLC//ENo6mwobuUovq8yKljv2oBu6AmlUkBwrmMBc8Q
 nnm7jjGM3bZ1REup7rWERnjdOZfdGVSL5CUAAfthyC744XmXaepwrrrqfXG22GxJ
 QwLXBTAwXFVDxKfUjDKzEo5zgLNHRvHbzc0DpTYYn6WcuDJOmlyWnhfDTu2mNG9Z
 rqjqy+OgOUEUprQDgitk5hedfeic2kPm1mxxZrXkpkuPef5be2inQq2siC7GxR4g
 0AKeUsMFgFmSqiD4iJTALJ+8WXkgMnD9VgooeWHk4OaqZfaGzi/iwRSnrlnf7+OV
 GTmrsmX+TX/Wz2BDjB+3zylQnYqYh3quE5w4UO6uUyJXfdhlnvsjVc8bEajDFjeM
 yUapaWxdAri7k2n+vjXQthAngxtYPgXtFbZPoOl109JcDcG6jJsCdM5TdenegaRs
 WeUh05JqrH8+qI+Nwzc4rO+PmKHQ8on2wKdgLp11dviiPOf8OguH65nDQSGZ/fGv
 7cnD9A1/MUd0sdrvc52AqkIYxh+Rp9GnCs1xA82JsTXgAPcXqAWjjR2JFPHL4neV
 eW2upZekl8lMR7hkfcQbhe4MVjQIjff3iFOkQXittxMzfzFdi0tly8xB8AzpTHOx
 h91MycvmMR2zRw==
 =IEqE
 -----END PGP SIGNATURE-----

Merge tag 'locking-core-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull locking and atomics updates from Thomas Gleixner:
 "The regular pile:

   - A few improvements to the mutex code

   - Documentation updates for atomics to clarify the difference between
     cmpxchg() and try_cmpxchg() and to explain the forward progress
     expectations.

   - Simplification of the atomics fallback generator

   - The addition of arch_atomic_long*() variants and generic arch_*()
     bitops based on them.

   - Add the missing might_sleep() invocations to the down*() operations
     of semaphores.

  The PREEMPT_RT locking core:

   - Scheduler updates to support the state preserving mechanism for
     'sleeping' spin- and rwlocks on RT.

     This mechanism is carefully preserving the state of the task when
     blocking on a 'sleeping' spin- or rwlock and takes regular wake-ups
     targeted at the same task into account. The preserved or updated
     (via a regular wakeup) state is restored when the lock has been
     acquired.

   - Restructuring of the rtmutex code so it can be utilized and
     extended for the RT specific lock variants.

   - Restructuring of the ww_mutex code to allow sharing of the ww_mutex
     specific functionality for rtmutex based ww_mutexes.

   - Header file disentangling to allow substitution of the regular lock
     implementations with the PREEMPT_RT variants without creating an
     unmaintainable #ifdef mess.

   - Shared base code for the PREEMPT_RT specific rw_semaphore and
     rwlock implementations.

     Contrary to the regular rw_semaphores and rwlocks the PREEMPT_RT
     implementation is writer unfair because it is infeasible to do
     priority inheritance on multiple readers. Experience over the years
     has shown that real-time workloads are not the typical workloads
     which are sensitive to writer starvation.

     The alternative solution would be to allow only a single reader
     which has been tried and discarded as it is a major bottleneck
     especially for mmap_sem. Aside of that many of the writer
     starvation critical usage sites have been converted to a writer
     side mutex/spinlock and RCU read side protections in the past
     decade so that the issue is less prominent than it used to be.

   - The actual rtmutex based lock substitutions for PREEMPT_RT enabled
     kernels which affect mutex, ww_mutex, rw_semaphore, spinlock_t and
     rwlock_t. The spin/rw_lock*() functions disable migration across
     the critical section to preserve the existing semantics vs per-CPU
     variables.

   - Rework of the futex REQUEUE_PI mechanism to handle the case of
     early wake-ups which interleave with a re-queue operation to
     prevent the situation that a task would be blocked on both the
     rtmutex associated to the outer futex and the rtmutex based hash
     bucket spinlock.

     While this situation cannot happen on !RT enabled kernels the
     changes make the underlying concurrency problems easier to
     understand in general. As a result the difference between !RT and
     RT kernels is reduced to the handling of waiting for the critical
     section. !RT kernels simply spin-wait as before and RT kernels
     utilize rcu_wait().

   - The substitution of local_lock for PREEMPT_RT with a spinlock which
     protects the critical section while staying preemptible. The CPU
     locality is established by disabling migration.

  The underlying concepts of this code have been in use in PREEMPT_RT for
  way more than a decade. The code has been refactored several times over
  the years and this final incarnation has been optimized once again to be
  as non-intrusive as possible, i.e. the RT specific parts are mostly
  isolated.

  It has been extensively tested in the 5.14-rt patch series and it has
  been verified that !RT kernels are not affected by these changes"

* tag 'locking-core-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (92 commits)
  locking/rtmutex: Return success on deadlock for ww_mutex waiters
  locking/rtmutex: Prevent spurious EDEADLK return caused by ww_mutexes
  locking/rtmutex: Dequeue waiter on ww_mutex deadlock
  locking/rtmutex: Dont dereference waiter lockless
  locking/semaphore: Add might_sleep() to down_*() family
  locking/ww_mutex: Initialize waiter.ww_ctx properly
  static_call: Update API documentation
  locking/local_lock: Add PREEMPT_RT support
  locking/spinlock/rt: Prepare for RT local_lock
  locking/rtmutex: Add adaptive spinwait mechanism
  locking/rtmutex: Implement equal priority lock stealing
  preempt: Adjust PREEMPT_LOCK_OFFSET for RT
  locking/rtmutex: Prevent lockdep false positive with PI futexes
  futex: Prevent requeue_pi() lock nesting issue on RT
  futex: Simplify handle_early_requeue_pi_wakeup()
  futex: Reorder sanity checks in futex_requeue()
  futex: Clarify comment in futex_requeue()
  futex: Restructure futex_requeue()
  futex: Correct the number of requeued waiters for PI
  futex: Remove bogus condition for requeue PI
  ...
2021-08-30 14:26:36 -07:00
Peter Zijlstra
830e6acc8a locking/rtmutex: Split out the inner parts of 'struct rtmutex'
RT builds substitutions for rwsem, mutex, spinlock and rwlock around
rtmutexes. Split the inner working out so each lock substitution can use
them with the appropriate lockdep annotations. This avoids having an extra
unused lockdep map in the wrapped rtmutex.

No functional change.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20210815211302.784739994@linutronix.de
2021-08-17 17:04:01 +02:00
Paul E. McKenney
b770efc460 Merge branches 'doc.2021.07.20c', 'fixes.2021.08.06a', 'nocb.2021.07.20c', 'nolibc.2021.07.20c', 'tasks.2021.07.20c', 'torture.2021.07.27a' and 'torturescript.2021.07.27a' into HEAD
doc.2021.07.20c: Documentation updates.
fixes.2021.08.06a: Miscellaneous fixes.
nocb.2021.07.20c: Callback-offloading (NOCB CPU) updates.
nolibc.2021.07.20c: Tiny userspace library updates.
tasks.2021.07.20c: Tasks RCU updates.
torture.2021.07.27a: In-kernel torture-test updates.
torturescript.2021.07.27a: Torture-test scripting updates.
2021-08-10 11:00:53 -07:00
Sebastian Andrzej Siewior
d3dd95a885 rcu: Replace deprecated CPU-hotplug functions
The functions get_online_cpus() and put_online_cpus() have been
deprecated during the CPU hotplug rework. They map directly to
cpus_read_lock() and cpus_read_unlock().

Replace deprecated CPU-hotplug functions with the official version.
The behavior remains unchanged.

Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: rcu@vger.kernel.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-08-10 10:47:32 -07:00
Paul E. McKenney
521c89b3a4 rcu: Print human-readable message for schedule() in RCU reader
The WARN_ON_ONCE() invocation within the CONFIG_PREEMPT=y version of
rcu_note_context_switch() triggers when there is a voluntary context
switch in an RCU read-side critical section, but there is quite a gap
between the output of that WARN_ON_ONCE() and this RCU-usage error.
This commit therefore converts the WARN_ON_ONCE() to a WARN_ONCE()
that explicitly describes the problem in its message.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-08-06 13:41:49 -07:00
Liu Song
8211e922de rcu: Use per_cpu_ptr to get the pointer of per_cpu variable
There are a few remaining locations in kernel/rcu that still use
"&per_cpu()".  This commit replaces them with "per_cpu_ptr(&)", and does
not introduce any functional change.

Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Reviewed-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Liu Song <liu.song11@zte.com.cn>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-08-06 13:41:49 -07:00
Liu Song
eb880949ef rcu: Remove useless "ret" update in rcu_gp_fqs_loop()
Within rcu_gp_fqs_loop(), the "ret" local variable is set to the
return value from swait_event_idle_timeout_exclusive(), but "ret" is
unconditionally overwritten later in the code.  This commit therefore
removes this useless assignment.

Signed-off-by: Liu Song <liu.song11@zte.com.cn>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-08-06 13:41:48 -07:00
Paul E. McKenney
d283aa1b04 rcu: Mark accesses in tree_stall.h
This commit marks the accesses in tree_stall.h so as to both avoid
undesirable compiler optimizations and to keep KCSAN focused on the
accesses of the core algorithm.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-08-06 13:41:48 -07:00
Paul E. McKenney
f74126dcbc rcu: Make rcu_gp_init() and rcu_gp_fqs_loop noinline to conserve stack
The kbuild test project found an oversized stack frame in rcu_gp_kthread()
for some kernel configurations.  This oversizing was due to a very large
amount of inlining, which is unnecessary due to the fact that this code
executes infrequently.  This commit therefore marks rcu_gp_init() and
rcu_gp_fqs_loop noinline_for_stack to conserve stack space.

Reported-by: kernel test robot <lkp@intel.com>
Tested-by: Rong Chen <rong.a.chen@intel.com>
[ paulmck: noinline_for_stack per Nathan Chancellor. ]
Reviewed-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-08-06 13:41:48 -07:00
Paul E. McKenney
d9ee962feb rcu: Mark lockless ->qsmask read in rcu_check_boost_fail()
Accesses to ->qsmask are normally protected by ->lock, but there is an
exception in the diagnostic code in rcu_check_boost_fail().  This commit
therefore applies data_race() to this access to avoid KCSAN complaining
about the C-language writes protected by ->lock.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-08-06 13:41:48 -07:00
Paul E. McKenney
65bfdd36c1 srcutiny: Mark read-side data races
This commit marks some interrupt-induced read-side data races in
__srcu_read_lock(), __srcu_read_unlock(), and srcu_torture_stats_print().

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-08-06 13:41:48 -07:00
Paul E. McKenney
b169246feb rcu: Start timing stall repetitions after warning complete
Systems with low-bandwidth consoles can have very large printk()
latencies, and on such systems it makes no sense to have the next RCU CPU
stall warning message start output before the prior message completed.
This commit therefore sets the time of the next stall only after the
prints have completed.  While printing, the time of the next stall
message is set to ULONG_MAX/2 jiffies into the future.

Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-08-06 13:41:48 -07:00
Sergey Senozhatsky
a80be428fb rcu: Do not disable GP stall detection in rcu_cpu_stall_reset()
rcu_cpu_stall_reset() is one of the functions virtual CPUs
execute during VM resume in order to handle jiffies skew
that can trigger false positive stall warnings. Paul has
pointed out that this approach is problematic because
rcu_cpu_stall_reset() disables RCU grace period stall-detection
virtually forever, while in fact it can just restart the
stall-detection timeout.

Suggested-by: "Paul E. McKenney" <paulmck@kernel.org>
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Signed-off-by: Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-08-06 13:41:48 -07:00
Sergey Senozhatsky
ccfc9dd691 rcu/tree: Handle VM stoppage in stall detection
The soft watchdog timer function checks if a virtual machine
was suspended and hence what looks like a lockup in fact
is a false positive.

This is what kvm_check_and_clear_guest_paused() does: it
tests guest PVCLOCK_GUEST_STOPPED (which is set by the host)
and if it's set then we need to touch all watchdogs and bail
out.

Watchdog timer function runs from IRQ, so PVCLOCK_GUEST_STOPPED
check works fine.

There is, however, one more watchdog that runs from IRQ, so
watchdog timer fn races with it, and that watchdog is not aware
of PVCLOCK_GUEST_STOPPED - RCU stall detector.

apic_timer_interrupt()
 smp_apic_timer_interrupt()
  hrtimer_interrupt()
   __hrtimer_run_queues()
    tick_sched_timer()
     tick_sched_handle()
      update_process_times()
       rcu_sched_clock_irq()

This triggers RCU stalls on our devices during VM resume.

If tick_sched_handle()->rcu_sched_clock_irq() runs on a VCPU
before watchdog_timer_fn()->kvm_check_and_clear_guest_paused()
then there is nothing on this VCPU that touches watchdogs and
RCU reads stale gp stall timestamp and new jiffies value, which
makes it think that RCU has stalled.

Make RCU stall watchdog aware of PVCLOCK_GUEST_STOPPED and
don't report RCU stalls when we resume the VM.

Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Signed-off-by: Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-08-06 13:41:48 -07:00
Paul E. McKenney
5fcb3a5f04 rcu: Mark accesses to ->rcu_read_lock_nesting
KCSAN flags accesses to ->rcu_read_lock_nesting as data races, but
in the past, the overhead of marked accesses was excessive.  However,
that was long ago, and much has changed since then, both in terms of
hardware and of compilers.  Here is data taken on an eight-core laptop
using Intel(R) Core(TM) i9-10885H CPU @ 2.40GHz with a kernel built
using gcc version 9.3.0, with all data in nanoseconds.

Unmarked accesses (status quo), measured by three refscale runs:

	Minimum reader duration:  3.286  2.851  3.395
	Median reader duration:   3.698  3.531  3.4695
	Maximum reader duration:  4.481  5.215  5.157

Marked accesses, also measured by three refscale runs:

	Minimum reader duration:  3.501  3.677  3.580
	Median reader duration:   4.053  3.723  3.895
	Maximum reader duration:  7.307  4.999  5.511

This focused microbenhmark shows only sub-nanosecond differences which
are unlikely to be visible at the system level.  This commit therefore
marks data-racing accesses to ->rcu_read_lock_nesting.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-08-06 13:41:48 -07:00
Paul E. McKenney
2be57f7328 rcu: Weaken ->dynticks accesses and updates
Accesses to the rcu_data structure's ->dynticks field have always been
fully ordered because it was not possible to prove that weaker ordering
was safe.  However, with the removal of the rcu_eqs_special_set() function
and the advent of the Linux-kernel memory model, it is now easy to show
that two of the four original full memory barriers can be weakened to
acquire and release operations.  The remaining pair must remain full
memory barriers.  This change makes the memory ordering requirements
more evident, and it might well also speed up the to-idle and from-idle
fastpaths on some architectures.

The following litmus test, adapted from one supplied off-list by Frederic
Weisbecker, models the RCU grace-period kthread detecting an idle CPU
that is concurrently transitioning to non-idle:

	C dynticks-from-idle

	{
		DYNTICKS=0; (* Initially idle. *)
	}

	P0(int *X, int *DYNTICKS)
	{
		int dynticks;
		int x;

		// Idle.
		dynticks = READ_ONCE(*DYNTICKS);
		smp_store_release(DYNTICKS, dynticks + 1);
		smp_mb();
		// Now non-idle
		x = READ_ONCE(*X);
	}

	P1(int *X, int *DYNTICKS)
	{
		int dynticks;

		WRITE_ONCE(*X, 1);
		smp_mb();
		dynticks = smp_load_acquire(DYNTICKS);
	}

	exists (1:dynticks=0 /\ 0:x=1)

Running "herd7 -conf linux-kernel.cfg dynticks-from-idle.litmus" verifies
this transition, namely, showing that if the RCU grace-period kthread (P1)
sees another CPU as idle (P0), then any memory access prior to the start
of the grace period (P1's write to X) will be seen by any RCU read-side
critical section following the to-non-idle transition (P0's read from X).
This is a straightforward use of full memory barriers to force ordering
in a store-buffering (SB) litmus test.

The following litmus test, also adapted from the one supplied off-list
by Frederic Weisbecker, models the RCU grace-period kthread detecting
a non-idle CPU that is concurrently transitioning to idle:

	C dynticks-into-idle

	{
		DYNTICKS=1; (* Initially non-idle. *)
	}

	P0(int *X, int *DYNTICKS)
	{
		int dynticks;

		// Non-idle.
		WRITE_ONCE(*X, 1);
		dynticks = READ_ONCE(*DYNTICKS);
		smp_store_release(DYNTICKS, dynticks + 1);
		smp_mb();
		// Now idle.
	}

	P1(int *X, int *DYNTICKS)
	{
		int x;
		int dynticks;

		smp_mb();
		dynticks = smp_load_acquire(DYNTICKS);
		x = READ_ONCE(*X);
	}

	exists (1:dynticks=2 /\ 1:x=0)

Running "herd7 -conf linux-kernel.cfg dynticks-into-idle.litmus" verifies
this transition, namely, showing that if the RCU grace-period kthread
(P1) sees another CPU as newly idle (P0), then any pre-idle memory access
(P0's write to X) will be seen by any code following the grace period
(P1's read from X).  This is a simple release-acquire pair forcing
ordering in a message-passing (MP) litmus test.

Of course, if the grace-period kthread detects the CPU as non-idle,
it will refrain from reporting a quiescent state on behalf of that CPU,
so there are no ordering requirements from the grace-period kthread in
that case.  However, other subsystems call rcu_is_idle_cpu() to check
for CPUs being non-idle from an RCU perspective.  That case is also
verified by the above litmus tests with the proviso that the sense of
the low-order bit of the DYNTICKS counter be inverted.

Unfortunately, on x86 smp_mb() is as expensive as a cache-local atomic
increment.  This commit therefore weakens only the read from ->dynticks.
However, the updates are abstracted into a rcu_dynticks_inc() function
to ease any future changes that might be needed.

[ paulmck: Apply Linus Torvalds feedback. ]

Link: https://lore.kernel.org/lkml/20210721202127.2129660-4-paulmck@kernel.org/
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-08-06 13:41:48 -07:00
Joel Fernandes (Google)
a86baa69c2 rcu: Remove special bit at the bottom of the ->dynticks counter
Commit b8c17e6664 ("rcu: Maintain special bits at bottom of ->dynticks
counter") reserved a bit at the bottom of the ->dynticks counter to defer
flushing of TLBs, but this facility never has been used.  This commit
therefore removes this capability along with the rcu_eqs_special_set()
function used to trigger it.

Link: https://lore.kernel.org/linux-doc/CALCETrWNPOOdTrFabTDd=H7+wc6xJ9rJceg6OL1S0rTV5pfSsA@mail.gmail.com/
Suggested-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: "Joel Fernandes (Google)" <joel@joelfernandes.org>
[ paulmck: Forward-port to v5.13-rc1. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-08-06 13:41:48 -07:00
Yanfei Xu
dc87740c8a rcu: Fix stall-warning deadlock due to non-release of rcu_node ->lock
If rcu_print_task_stall() is invoked on an rcu_node structure that does
not contain any tasks blocking the current grace period, it takes an
early exit that fails to release that rcu_node structure's lock.  This
results in a self-deadlock, which is detected by lockdep.

To reproduce this bug:

tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --duration 3 --trust-make --configs "TREE03" --kconfig "CONFIG_PROVE_LOCKING=y" --bootargs "rcutorture.stall_cpu=30 rcutorture.stall_cpu_block=1 rcutorture.fwd_progress=0 rcutorture.test_boost=0"

This will also result in other complaints, including RCU's scheduler
hook complaining about blocking rather than preemption and an rcutorture
writer stall.

Only a partial RCU CPU stall warning message will be printed because of
the self-deadlock.

This commit therefore releases the lock on the rcu_print_task_stall()
function's early exit path.

Fixes: c583bcb8f5 ("rcu: Don't invoke try_invoke_on_locked_down_task() with irqs disabled")
Tested-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Yanfei Xu <yanfei.xu@windriver.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-08-06 13:41:42 -07:00
Yanfei Xu
e6a901a44f rcu: Fix to include first blocked task in stall warning
The for loop in rcu_print_task_stall() always omits ts[0], which points
to the first task blocking the stalled grace period.  This in turn fails
to count this first task, which means that ndetected will be equal to
zero when all CPUs have passed through their quiescent states and only
one task is blocking the stalled grace period.  This zero value for
ndetected will in turn result in an incorrect "All QSes seen" message:

rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
rcu:    Tasks blocked on level-1 rcu_node (CPUs 12-23):
        (detected by 15, t=6504 jiffies, g=164777, q=9011209)
rcu: All QSes seen, last rcu_preempt kthread activity 1 (4295252379-4295252378), jiffies_till_next_fqs=1, root ->qsmask 0x2
BUG: sleeping function called from invalid context at include/linux/uaccess.h:156
in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 70613, name: msgstress04
INFO: lockdep is turned off.
Preemption disabled at:
[<ffff8000104031a4>] create_object.isra.0+0x204/0x4b0
CPU: 15 PID: 70613 Comm: msgstress04 Kdump: loaded Not tainted
5.12.2-yoctodev-standard #1
Hardware name: Marvell OcteonTX CN96XX board (DT)
Call trace:
 dump_backtrace+0x0/0x2cc
 show_stack+0x24/0x30
 dump_stack+0x110/0x188
 ___might_sleep+0x214/0x2d0
 __might_sleep+0x7c/0xe0

This commit therefore fixes the loop to include ts[0].

Fixes: c583bcb8f5 ("rcu: Don't invoke try_invoke_on_locked_down_task() with irqs disabled")
Tested-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Yanfei Xu <yanfei.xu@windriver.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-08-06 13:41:29 -07:00
Jiangong.Han
811192c5f2 rcuscale: Console output claims too few grace periods
The rcuscale console output claims N grace periods, numbered from zero
to N, which means that there were really N+1 grace periods.  The root
cause of this bug is that rcu_scale_writer() stores the number of the
last grace period (numbered from zero) into writer_n_durations[me]
instead of the number of grace periods.  This commit therefore assigns
the actual number of grace periods to writer_n_durations[me], and also
makes the corresponding adjustment to the loop outputting per-grace-period
measurements.

Sample of old console output:
    rcu-scale: writer 0 gps: 133
    ......
    rcu-scale:    0 writer-duration:     0 44003961
    rcu-scale:    0 writer-duration:     1 32003582
    ......
    rcu-scale:    0 writer-duration:   132 28004391
    rcu-scale:    0 writer-duration:   133 27996410

Sample of new console output:
    rcu-scale: writer 0 gps: 134
    ......
    rcu-scale:    0 writer-duration:     0 44003961
    rcu-scale:    0 writer-duration:     1 32003582
    ......
    rcu-scale:    0 writer-duration:   132 28004391
    rcu-scale:    0 writer-duration:   133 27996410

Signed-off-by: Jiangong.Han <jiangong.han@windriver.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-07-27 11:39:30 -07:00
Paul E. McKenney
59e8366628 rcutorture: Preempt rather than block when testing task stalls
Currently, rcu_torture_stall() does a one-jiffy timed wait when
stall_cpu_block is set.  This works, but emits a pointless splat in
CONFIG_PREEMPT=y kernels.  This commit avoids this splat by instead
invoking preempt_schedule() in CONFIG_PREEMPT=y kernels.

This uses an admittedly ugly #ifdef, but abstracted approaches just
looked worse.  A prettier approach would provide a preempt_schedule()
definition with a WARN_ON() for CONFIG_PREEMPT=n kernels, but this seems
quite silly.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-07-27 11:39:30 -07:00
Paul E. McKenney
25f6fa53a0 refscale: Add measurement of clock readout
This commit adds a "clock" type to refscale, which checks the performance
of ktime_get_real_fast_ns().  Use the "clocksource=" kernel boot parameter
to select the underlying clock source.

[ paulmck: Work around compiler false positive per kernel test robot. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-07-27 11:38:56 -07:00
Zhouyi Zhou
fed31a4dd3 rcu: Fix macro name CONFIG_TASKS_RCU_TRACE
This commit fixes several typos where CONFIG_TASKS_RCU_TRACE should
instead be CONFIG_TASKS_TRACE_RCU.  Among other things, these typos
could cause CONFIG_TASKS_TRACE_RCU_READ_MB=y kernels to suffer from
memory-ordering bugs that could result in false-positive quiescent
states and too-short grace periods.

Signed-off-by: Zhouyi Zhou <zhouzhouyi@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-07-20 13:43:44 -07:00
Paul E. McKenney
e4be1f44b6 rcu-tasks: Fix synchronize_rcu_rude() typo in comment
This commit replaces the fictitious synchronize_rcu_rude() function with
its real-world synchronize_rcu_tasks_rude() counterpart.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-07-20 13:43:44 -07:00
Paul E. McKenney
f8ab3fad80 rcu-tasks: Mark ->trc_reader_special.b.need_qs data races
There are several ->trc_reader_special.b.need_qs data races that are
too low-probability for KCSAN to notice, but which will happen sooner
or later.  This commit therefore marks these accesses.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-07-20 13:43:44 -07:00
Paul E. McKenney
bdb0cca0d1 rcu-tasks: Mark ->trc_reader_nesting data races
There are several ->trc_reader_nesting data races that are too
low-probability for KCSAN to notice, but which will happen sooner or
later.  This commit therefore marks these accesses, and comments one
that cannot race.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-07-20 13:43:44 -07:00
Paul E. McKenney
45f4b4a202 rcu-tasks: Add comments explaining task_struct strategy
Accesses to task_struct structures must be either protected by RCU
or by get_task_struct().  Tasks trace RCU uses these in a non-obvious
combination, in conjunction with an IPI handler.  This commit therefore
adds comments explaining this usage.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-07-20 13:43:44 -07:00
Frederic Weisbecker
cba712beeb rcu/nocb: Remove NOCB deferred wakeup from rcutree_dead_cpu()
At CPU offline time, we must handle any pending wakeup for the nocb_gp
kthread linked to the outgoing CPU.

Now we are making sure of that twice:

1) From rcu_report_dead() when the outgoing CPU makes the very last
   local cleanups by itself before switching offline.

2) From rcutree_dead_cpu(). Here the offlining CPU has gone and is truly
   now offline. Another CPU takes care of post-portem cleaning up and
   check if the offline CPU had pending wakeup.

Both ways are fine but we have to choose one or the other because we
don't need to repeat that action. Simply benefit from cache locality
and keep only the first solution.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-07-20 13:41:51 -07:00
Frederic Weisbecker
dfcb275402 rcu/nocb: Start moving nocb code to its own plugin file
The kernel/rcu/tree_plugin.h file contains not only the plugins for
preemptible RCU, but also many other features including rcu_nocbs
callback offloading.  This offloading has become large and complex,
so it is time to put it in its own file.

This commit starts that process.

Suggested-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
[ paulmck: Rename to tree_nocb.h, add Frederic as author. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-07-20 13:41:51 -07:00
Paul E. McKenney
2a2ed5618a rcu: Fix pr_info() formats and values in show_rcu_gp_kthreads()
This commit changes from "%lx" to "%x" and from "0x1ffffL" to "0x1ffff"
to match the change in type between the old field ->state (unsigned long)
and the new field ->__state (unsigned int).

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-07-06 15:53:12 -07:00
Paul E. McKenney
a9ab9cce93 rcu-tasks: Don't delete holdouts within trc_wait_for_one_reader()
Invoking trc_del_holdout() from within trc_wait_for_one_reader() is
only a performance optimization because the RCU Tasks Trace grace-period
kthread will eventually do this within check_all_holdout_tasks_trace().
But it is not a particularly important performance optimization because
it only applies to the grace-period kthread, of which there is but one.
This commit therefore removes this invocation of trc_del_holdout() in
favor of the one in check_all_holdout_tasks_trace() in the grace-period
kthread.

Reported-by: "Xu, Yanfei" <yanfei.xu@windriver.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-07-06 15:52:49 -07:00
Paul E. McKenney
1d10bf55d8 rcu-tasks: Don't delete holdouts within trc_inspect_reader()
As Yanfei pointed out, although invoking trc_del_holdout() is safe
from the viewpoint of the integrity of the holdout list itself,
the put_task_struct() invoked by trc_del_holdout() can result in
use-after-free errors due to later accesses to this task_struct structure
by the RCU Tasks Trace grace-period kthread.

This commit therefore removes this call to trc_del_holdout() from
trc_inspect_reader() in favor of the grace-period thread's existing call
to trc_del_holdout(), thus eliminating that particular class of
use-after-free errors.

Reported-by: "Xu, Yanfei" <yanfei.xu@windriver.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-07-06 12:38:21 -07:00
Paul E. McKenney
05bc276cf2 refscale: Avoid false-positive warnings in ref_scale_reader()
If the call to set_cpus_allowed_ptr() in ref_scale_reader()
fails, a later WARN_ONCE() complains.  But with the advent of
570a752b7a ("lib/smp_processor_id: Use is_percpu_thread() instead of
nr_cpus_allowed"), this complaint can be drowned out by complaints from
smp_processor_id().  The rationale for this change is that refscale's
kthreads are not marked with PF_NO_SETAFFINITY, which means that a system
administrator could change affinity at any time.

However, refscale is a performance/stress test, and the system
administrator might well have a valid test-the-test reason for changing
affinity.  This commit therefore changes to raw_smp_processor_id()
in order to avoid the noise, and also adds a WARN_ON_ONCE() to the
call to set_cpus_allowed_ptr() in order to directly detect immediate
failure.  There is no WARN_ON_ONCE() within the test loop, allowing
human-reflex-based affinity resetting, if desired.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-07-06 12:38:08 -07:00
Linus Torvalds
28e92f9903 Merge branch 'core-rcu-2021.07.04' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu
Pull RCU updates from Paul McKenney:

 - Bitmap parsing support for "all" as an alias for all bits

 - Documentation updates

 - Miscellaneous fixes, including some that overlap into mm and lockdep

 - kvfree_rcu() updates

 - mem_dump_obj() updates, with acks from one of the slab-allocator
   maintainers

 - RCU NOCB CPU updates, including limited deoffloading

 - SRCU updates

 - Tasks-RCU updates

 - Torture-test updates

* 'core-rcu-2021.07.04' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: (78 commits)
  tasks-rcu: Make show_rcu_tasks_gp_kthreads() be static inline
  rcu-tasks: Make ksoftirqd provide RCU Tasks quiescent states
  rcu: Add missing __releases() annotation
  rcu: Remove obsolete rcu_read_unlock() deadlock commentary
  rcu: Improve comments describing RCU read-side critical sections
  rcu: Create an unrcu_pointer() to remove __rcu from a pointer
  srcu: Early test SRCU polling start
  rcu: Fix various typos in comments
  rcu/nocb: Unify timers
  rcu/nocb: Prepare for fine-grained deferred wakeup
  rcu/nocb: Only cancel nocb timer if not polling
  rcu/nocb: Delete bypass_timer upon nocb_gp wakeup
  rcu/nocb: Cancel nocb_timer upon nocb_gp wakeup
  rcu/nocb: Allow de-offloading rdp leader
  rcu/nocb: Directly call __wake_nocb_gp() from bypass timer
  rcu: Don't penalize priority boosting when there is nothing to boost
  rcu: Point to documentation of ordering guarantees
  rcu: Make rcu_gp_cleanup() be noinline for tracing
  rcu: Restrict RCU_STRICT_GRACE_PERIOD to at most four CPUs
  rcu: Make show_rcu_gp_kthreads() dump rcu_node structures blocking GP
  ...
2021-07-04 12:58:33 -07:00
Linus Torvalds
71bd934101 Merge branch 'akpm' (patches from Andrew)
Merge more updates from Andrew Morton:
 "190 patches.

  Subsystems affected by this patch series: mm (hugetlb, userfaultfd,
  vmscan, kconfig, proc, z3fold, zbud, ras, mempolicy, memblock,
  migration, thp, nommu, kconfig, madvise, memory-hotplug, zswap,
  zsmalloc, zram, cleanups, kfence, and hmm), procfs, sysctl, misc,
  core-kernel, lib, lz4, checkpatch, init, kprobes, nilfs2, hfs,
  signals, exec, kcov, selftests, compress/decompress, and ipc"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (190 commits)
  ipc/util.c: use binary search for max_idx
  ipc/sem.c: use READ_ONCE()/WRITE_ONCE() for use_global_lock
  ipc: use kmalloc for msg_queue and shmid_kernel
  ipc sem: use kvmalloc for sem_undo allocation
  lib/decompressors: remove set but not used variabled 'level'
  selftests/vm/pkeys: exercise x86 XSAVE init state
  selftests/vm/pkeys: refill shadow register after implicit kernel write
  selftests/vm/pkeys: handle negative sys_pkey_alloc() return code
  selftests/vm/pkeys: fix alloc_random_pkey() to make it really, really random
  kcov: add __no_sanitize_coverage to fix noinstr for all architectures
  exec: remove checks in __register_bimfmt()
  x86: signal: don't do sas_ss_reset() until we are certain that sigframe won't be abandoned
  hfsplus: report create_date to kstat.btime
  hfsplus: remove unnecessary oom message
  nilfs2: remove redundant continue statement in a while-loop
  kprobes: remove duplicated strong free_insn_page in x86 and s390
  init: print out unknown kernel parameters
  checkpatch: do not complain about positive return values starting with EPOLL
  checkpatch: improve the indented label test
  checkpatch: scripts/spdxcheck.py now requires python3
  ...
2021-07-02 12:08:10 -07:00
Andy Shevchenko
f39650de68 kernel.h: split out panic and oops helpers
kernel.h is being used as a dump for all kinds of stuff for a long time.
Here is the attempt to start cleaning it up by splitting out panic and
oops helpers.

There are several purposes of doing this:
- dropping dependency in bug.h
- dropping a loop by moving out panic_notifier.h
- unload kernel.h from something which has its own domain

At the same time convert users tree-wide to use new headers, although for
the time being include new header back to kernel.h to avoid twisted
indirected includes for existing users.

[akpm@linux-foundation.org: thread_info.h needs limits.h]
[andriy.shevchenko@linux.intel.com: ia64 fix]
  Link: https://lkml.kernel.org/r/20210520130557.55277-1-andriy.shevchenko@linux.intel.com

Link: https://lkml.kernel.org/r/20210511074137.33666-1-andriy.shevchenko@linux.intel.com
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Co-developed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Corey Minyard <cminyard@mvista.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Wei Liu <wei.liu@kernel.org>
Acked-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Sebastian Reichel <sre@kernel.org>
Acked-by: Luis Chamberlain <mcgrof@kernel.org>
Acked-by: Stephen Boyd <sboyd@kernel.org>
Acked-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Acked-by: Helge Deller <deller@gmx.de> # parisc
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-01 11:06:04 -07:00
Peter Zijlstra
2f064a59a1 sched: Change task_struct::state
Change the type and name of task_struct::state. Drop the volatile and
shrink it to an 'unsigned int'. Rename it in order to find all uses
such that we can use READ_ONCE/WRITE_ONCE as appropriate.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Acked-by: Will Deacon <will@kernel.org>
Acked-by: Daniel Thompson <daniel.thompson@linaro.org>
Link: https://lore.kernel.org/r/20210611082838.550736351@infradead.org
2021-06-18 11:43:09 +02:00
Peter Zijlstra
b03fbd4ff2 sched: Introduce task_is_running()
Replace a bunch of 'p->state == TASK_RUNNING' with a new helper:
task_is_running(p).

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Davidlohr Bueso <dave@stgolabs.net>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20210611082838.222401495@infradead.org
2021-06-18 11:43:07 +02:00
Paul E. McKenney
641faf1b90 Merge branches 'bitmaprange.2021.05.10c', 'doc.2021.05.10c', 'fixes.2021.05.13a', 'kvfree_rcu.2021.05.10c', 'mmdumpobj.2021.05.10c', 'nocb.2021.05.12a', 'srcu.2021.05.12a', 'tasks.2021.05.18a' and 'torture.2021.05.10c' into HEAD
bitmaprange.2021.05.10c: Allow "all" for bitmap ranges.
doc.2021.05.10c: Documentation updates.
fixes.2021.05.13a: Miscellaneous fixes.
kvfree_rcu.2021.05.10c: kvfree_rcu() updates.
mmdumpobj.2021.05.10c: mem_dump_obj() updates.
nocb.2021.05.12a: RCU NOCB CPU updates, including limited deoffloading.
srcu.2021.05.12a: SRCU updates.
tasks.2021.05.18a: Tasks-RCU updates.
torture.2021.05.10c: Torture-test updates.
2021-05-18 10:56:19 -07:00
Paul E. McKenney
474d099736 tasks-rcu: Make show_rcu_tasks_gp_kthreads() be static inline
In some architectures, the no-op variant of show_rcu_tasks_gp_kthreads()
get "no previous prototype" compiler warnings.  These are false positives
given that kernel/rcu/tasks.h is included only once.  But why put up
with the compiler noise?

This commit therefore adds "static inline" to this definition to force
the compiler to accept this situation, while also moving it to its proper
place in kernel/rcu/rcu.h.

Reported-by: kernel test robot <lkp@intel.com>
[ paulmck: Update per Stephen Rothwell feedback. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-05-18 10:55:17 -07:00
Paul E. McKenney
cf868c2af2 rcu-tasks: Make ksoftirqd provide RCU Tasks quiescent states
Heavy networking load can cause a CPU to execute continuously and
indefinitely within ksoftirqd, in which case there will be no voluntary
task switches and thus no RCU-tasks quiescent states.  This commit
therefore causes the exiting rcu_softirq_qs() to provide an RCU-tasks
quiescent state.

This of course means that __do_softirq() and its callers cannot be
invoked from within a tracing trampoline.

Reported-by: Toke Høiland-Jørgensen <toke@redhat.com>
Tested-by: Toke Høiland-Jørgensen <toke@redhat.com>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
2021-05-18 10:54:51 -07:00
Jules Irenge
c70360c334 rcu: Add missing __releases() annotation
Sparse reports a warning at rcu_print_task_stall():

"warning: context imbalance in rcu_print_task_stall - unexpected unlock"

The root cause is a missing annotation on rcu_print_task_stall().

This commit therefore adds the missing __releases(rnp->lock) annotation.

Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-05-13 09:13:23 -07:00
Paul E. McKenney
1893afd634 rcu: Improve comments describing RCU read-side critical sections
There are a number of places that call out the fact that preempt-disable
regions of code now act as RCU read-side critical sections, where
preempt-disable regions of code include irq-disable regions of code,
bh-disable regions of code, hardirq handlers, and NMI handlers.  However,
someone relying solely on (for example) the call_rcu() header comment
might well have no idea that preempt-disable regions of code have RCU
semantics.

This commit therefore updates the header comments for
call_rcu(), synchronize_rcu(), rcu_dereference_bh_check(), and
rcu_dereference_sched_check() to call out these new(ish) forms of RCU
readers.

Reported-by: Michel Lespinasse <michel@lespinasse.org>
[ paulmck: Apply Matthew Wilcox and Michel Lespinasse feedback. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-05-13 09:13:23 -07:00
Frederic Weisbecker
0a580fa65c srcu: Early test SRCU polling start
Place an early call to start_poll_synchronize_srcu() before the invocation
of call_srcu() on the same srcu_struct structure.

After the later call to srcu_barrier(), the completion of the
first grace period should be visible to a subsequent invocation of
poll_state_synchronize_srcu(), and if not, warn.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Uladzislau Rezki <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-05-12 12:12:33 -07:00
Ingo Molnar
a616aec9aa rcu: Fix various typos in comments
Fix ~12 single-word typos in RCU code comments.

[ paulmck: Apply feedback from Randy Dunlap. ]
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-05-12 12:11:05 -07:00
Frederic Weisbecker
e75bcd48e2 rcu/nocb: Unify timers
Now that ->nocb_timer and ->nocb_bypass_timer have become quite similar,
this commit merges them together.  A new RCU_NOCB_WAKE_BYPASS wake level
is introduced.  As a result, timers perform all kinds of deferred wake
ups but other deferred wakeup callsites only handle non-bypass wakeups
in order not to wake up rcuo too early.

The timer also unconditionally executes a full barrier so as to order
timer_pending() and callback enqueue although the path performing
RCU_NOCB_WAKE_FORCE that makes use of it is debatable. It should also
test against the rdp leader instead of the current rdp.

This unconditional full barrier shouldn't bring visible overhead since
these timers almost never fire.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-05-12 12:10:23 -07:00
Frederic Weisbecker
870905169d rcu/nocb: Prepare for fine-grained deferred wakeup
Tuning the deferred wakeup level must be done from a safe wakeup
point. Currently those sites are:

* ->nocb_timer
* user/idle/guest entry
* CPU down
* softirq/rcuc

All of these sites perform the wake up for both RCU_NOCB_WAKE and
RCU_NOCB_WAKE_FORCE.

In order to merge ->nocb_timer and ->nocb_bypass_timer together, we plan
to add a new RCU_NOCB_WAKE_BYPASS that really should be deferred until
a timer fires so that we don't wake up the NOCB-gp kthread too early.

To prepare for that, this commit specifies the per-callsite wakeup
level/limit.

Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
[ paulmck: Fix non-NOCB rcu_nocb_need_deferred_wakeup() definition. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-05-12 12:10:23 -07:00
Frederic Weisbecker
f9fc166b79 rcu/nocb: Only cancel nocb timer if not polling
This commit refrains deleting the ->nocb_timer if rcu_nocb is polling
because it should not ever have been queued in the polling case.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-05-12 12:10:23 -07:00
Frederic Weisbecker
3b2348e2fd rcu/nocb: Delete bypass_timer upon nocb_gp wakeup
A NOCB-gp wake p can safely delete the ->nocb_bypass_timer because
nocb_gp_wait() will recheck again the bypass state and rearm the bypass
timer if necessary.  This commit therefore deletes this timer.

Reviewed-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-05-12 12:10:23 -07:00
Frederic Weisbecker
b6e2c4ed35 rcu/nocb: Cancel nocb_timer upon nocb_gp wakeup
When waking up in nocb_gp_wait(), there is no need to keep the nocb_timer
around because this function will traverse the whole rdp list. Any
update performed before the timer was armed will now be visible after
the ->nocb_gp_lock acquire.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-05-12 12:10:23 -07:00
Frederic Weisbecker
552cac80e6 rcu/nocb: Allow de-offloading rdp leader
The only thing that prevented an rdp leader from being de-offloaded was
the nocb_bypass_timer that used to lock the nocb_lock of the rdp leader.

If an rdp gets de-offloaded, it will subtlely ignore rcu_nocb_lock()
calls and do its job in the timer unsafely.  Worse yet:  If it gets
re-offloaded in the middle of the timer, rcu_nocb_unlock() would try to
unlock, leaving it imbalanced.

Now that the nocb_bypass_timer doesn't use the nocb_lock anymore,
de-offloading the rdp leader is now safe.  This commit therefore allows
the rdp leader to be de-offloaded.

Reported-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-05-12 12:10:23 -07:00
Frederic Weisbecker
c7ef7500a8 rcu/nocb: Directly call __wake_nocb_gp() from bypass timer
The bypass timer calls __call_rcu_nocb_wake() instead of directly
calling __wake_nocb_gp().  The only difference here is that
rdp->qlen_last_fqs_check gets overridden.  But resetting the deferred
force quiescent state base shouldn't be relevant for that timer.  In fact
the bypass queue in question can be for any rdp from the group and not
necessarily the rdp leader on which the bypass timer is attached.

This commit therefore calls __wake_nocb_gp() directly.  This way we
don't even need to lock the ->nocb_lock.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-05-12 12:10:08 -07:00
Paul E. McKenney
5390473ec1 rcu: Don't penalize priority boosting when there is nothing to boost
RCU priority boosting cannot do anything unless there is at least one
task blocking the current RCU grace period that was preempted within
the RCU read-side critical section that it still resides in.  However,
the current rcu_torture_boost_failed() code will count this as an RCU
priority-boosting failure if there were no CPUs blocking the current
grace period.  This situation can happen (for example) if the last CPU
blocking the current grace period was subjected to vCPU preemption,
which is always a risk for rcutorture guest OSes.

This commit therefore causes rcu_torture_boost_failed() to refrain from
reporting failure unless there is at least one task blocking the current
RCU grace period that was preempted within the RCU read-side critical
section that it still resides in.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-05-10 16:44:11 -07:00
Paul E. McKenney
3d3a0d1b50 rcu: Point to documentation of ordering guarantees
Add comments to synchronize_rcu() and friends that point to
Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-05-10 16:22:54 -07:00
Paul E. McKenney
2f20de99a6 rcu: Make rcu_gp_cleanup() be noinline for tracing
Although there are trace events for RCU grace periods, these are only
enabled in CONFIG_RCU_TRACE=y kernels.  This commit therefore marks
rcu_gp_cleanup() noinline in order to provide a function that can be
traced that is invoked near the end of each grace period.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-05-10 16:22:54 -07:00
Paul E. McKenney
4d80b8e196 rcu: Restrict RCU_STRICT_GRACE_PERIOD to at most four CPUs
Kernels built with CONFIG_RCU_STRICT_GRACE_PERIOD=y can experience
significant lock contention due to RCU's resulting focus on ending grace
periods as soon as possible.  This is OK, but only if there are not very
many CPUs.  This commit therefore puts this Kconfig option off-limits
to systems with more than four CPUs.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-05-10 16:22:54 -07:00
Paul E. McKenney
b15805013b rcu: Make show_rcu_gp_kthreads() dump rcu_node structures blocking GP
Currently, show_rcu_gp_kthreads() only dumps rcu_node structures that
have outdated ideas of the current grace-period number.  This commit
also dumps those that are in any way blocking the current grace period.
This helps diagnose RCU priority boosting failures.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-05-10 16:22:54 -07:00
Paul E. McKenney
3ef5a1c382 rcu: Make RCU priority boosting work on single-CPU rcu_node structures
When any CPU comes online, it checks to see if an RCU-boost kthread has
already been created for that CPU's leaf rcu_node structure, and if
not, it creates one.  Unfortunately, it also verifies that this leaf
rcu_node structure actually has at least one online CPU, and if not,
it declines to create the kthread.  Although this behavior makes sense
during early boot, especially on systems that claim far more CPUs than
they actually have, it makes no sense for the first CPU to come online
for a given rcu_node structure.  There is no point in checking because
we know there is a CPU on its way in.

The problem is that timing differences can cause this incoming CPU to not
yet be reflected in the various bit masks even at rcutree_online_cpu()
time, and there is no chance at rcutree_prepare_cpu() time.  Plus it
would be better to create the RCU-boost kthread at rcutree_prepare_cpu()
to handle the case where the CPU is involved in an RCU priority inversion
very shortly after it comes online.

This commit therefore moves the checking to rcu_prepare_kthreads(), which
is called only at early boot, when the check is appropriate.  In addition,
it makes rcutree_prepare_cpu() invoke rcu_spawn_one_boost_kthread(), which
no longer does any checking for online CPUs.

With this change, RCU priority boosting tests now pass for short rcutorture
runs, even with single-CPU leaf rcu_node structures.

Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Scott Wood <swood@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-05-10 16:22:54 -07:00
Paul E. McKenney
396eba65f6 rcu: Add quiescent states and boost states to show_rcu_gp_kthreads() output
This commit adds each rcu_node structure's ->qsmask and "bBEG" output
indicating whether: (1) There is a boost kthread, (2) A reader needs
to be (or is in the process of being) boosted, (3) A reader is blocking
an expedited grace period, and (4) A reader is blocking a normal grace
period.  This helps diagnose RCU priority boosting failures.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-05-10 16:22:54 -07:00
Paul E. McKenney
3066820034 rcu: Reject RCU_LOCKDEP_WARN() false positives
If another lockdep report runs concurrently with an RCU lockdep report
from RCU_LOCKDEP_WARN(), the following sequence of events can occur:

1.	debug_lockdep_rcu_enabled() sees that lockdep is enabled
	when called from (say) synchronize_rcu().

2.	Lockdep is disabled by a concurrent lockdep report.

3.	debug_lockdep_rcu_enabled() evaluates its lockdep-expression
	argument, for example, lock_is_held(&rcu_bh_lock_map).

4.	Because lockdep is now disabled, lock_is_held() plays it safe and
	returns the constant 1.

5.	But in this case, the constant 1 is not safe, because invoking
	synchronize_rcu() under rcu_read_lock_bh() is disallowed.

6.	debug_lockdep_rcu_enabled() wrongly invokes lockdep_rcu_suspicious(),
	resulting in a false-positive splat.

This commit therefore changes RCU_LOCKDEP_WARN() to check
debug_lockdep_rcu_enabled() after checking the lockdep expression,
so that any "safe" returns from lock_is_held() are rejected by
debug_lockdep_rcu_enabled().  This requires memory ordering, which is
supplied by READ_ONCE(debug_locks).  The resulting volatile accesses
prevent the compiler from reordering and the fact that only one variable
is being accessed prevents the underlying hardware from reordering.
The combination works for IA64, which can reorder reads to the same
location, but this is defeated by the volatile accesses, which compile
to load instructions that provide ordering.

Reported-by: syzbot+dde0cc33951735441301@syzkaller.appspotmail.com
Reported-by: Matthew Wilcox <willy@infradead.org>
Reported-by: syzbot+88e4f02896967fe1ab0d@syzkaller.appspotmail.com
Reported-by: Thomas Gleixner <tglx@linutronix.de>
Suggested-by: Boqun Feng <boqun.feng@gmail.com>
Reviewed-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-05-10 16:22:54 -07:00
Paul E. McKenney
27ba76e164 rcu: Add ->gp_max to show_rcu_gp_kthreads() output
This commit adds ->gp_max to show_rcu_gp_kthreads() output in order to
better diagnose RCU priority boosting failures.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-05-10 16:22:54 -07:00
Paul E. McKenney
e44111ed20 rcu: Add ->rt_priority and ->gp_start to show_rcu_gp_kthreads() output
This commit adds ->rt_priority and ->gp_start to show_rcu_gp_kthreads()
output in order to better diagnose RCU priority boosting failures.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-05-10 16:22:54 -07:00
Paul E. McKenney
8e4b1d2bc1 rcu: Invoke rcu_spawn_core_kthreads() from rcu_spawn_gp_kthread()
Currently, rcu_spawn_core_kthreads() is invoked via an early_initcall(),
which works, except that rcu_spawn_gp_kthread() is also invoked via an
early_initcall() and rcu_spawn_core_kthreads() relies on adjustments to
kthread_prio that are carried out by rcu_spawn_gp_kthread().  There is
no guaranttee of ordering among early_initcall() handlers, and thus no
guarantee that kthread_prio will be properly checked and range-limited
at the time that rcu_spawn_core_kthreads() needs it.

In most cases, this bug is harmless.  After all, the only reason that
rcu_spawn_gp_kthread() adjusts the value of kthread_prio is if the user
specified a nonsensical value for this boot parameter, which experience
indicates is rare.

Nevertheless, a bug is a bug.  This commit therefore causes the
rcu_spawn_core_kthreads() function to be invoked directly from
rcu_spawn_gp_kthread() after any needed adjustments to kthread_prio have
been carried out.

Fixes: 48d07c04b4 ("rcu: Enable elimination of Tree-RCU softirq processing")
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-05-10 16:22:54 -07:00
Zhouyi Zhou
277ffe1b70 rcu: Improve tree.c comments and add code cleanups
This commit cleans up some comments and code in kernel/rcu/tree.c.

Signed-off-by: Zhouyi Zhou <zhouzhouyi@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-05-10 16:22:53 -07:00
Paul E. McKenney
ce7c169dee rcu: Remove the unused rcu_irq_exit_preempt() function
Commit 9ee01e0f69 ("x86/entry: Clean up idtentry_enter/exit()
leftovers") left the rcu_irq_exit_preempt() in place in order to avoid
conflicts with the -rcu tree.  Now that this change has long since hit
mainline, this commit removes the no-longer-used rcu_irq_exit_preempt()
function.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-05-10 16:22:53 -07:00
Paul E. McKenney
7ab2bd31df rcutorture: Move mem_dump_obj() tests into separate function
To make the purpose of the code more apparent, this commit moves the
tests of mem_dump_obj() to a new rcu_torture_mem_dump_obj() function
and calls it from rcu_torture_cleanup().

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-05-10 16:05:07 -07:00
Paul E. McKenney
063f5a4df9 rcutorture: Don't count CPU-stalled time against priority boosting
It will frequently be the case that rcu_torture_boost() will get a
->start_gp_poll() cookie that needs almost all of the current grace period
plus an additional grace period to elapse before ->poll_gp_state() will
return true.  It is quite possible that the current grace period will have
(say) two seconds of stall by a CPU failing to pass through a quiescent
state, followed by 300 milliseconds of delay due to a preempted reader.
The next grace period might suffer only one second of stall by a CPU,
followed by another 300 milliseconds of delay due to a preempted reader.
This is an example of RCU priority boosting doing its job, but the full
elapsed time of 3.6 seconds exceeds the 3.5-second limit.  In addition,
there is no CPU stall in force at the 3.5-second mark, so this would
nevertheless currently be counted as an RCU priority boosting failure.

This commit therefore avoids this sort of false positive by resetting
the gp_state_time timestamp any time that the current grace period is
being blocked by a CPU.  This results in extremely frequent calls to
the ->check_boost_failed() function, so this commit provides a lockless
fastpath that is selected by supplying a NULL CPU-number pointer.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-05-10 16:05:07 -07:00
Paul E. McKenney
0260b92e1c rcutorture: Forgive RCU boost failures when CPUs don't pass through QS
Currently, rcu_torture_boost() runs CPU-bound at real-time priority
to force RCU priority inversions.  It then checks that grace periods
progress during this CPU-bound time.  If grace periods fail to progress,
it reports and RCU priority boosting failure.

However, it is possible (and sometimes does happen) that the grace period
fails to progress due to a CPU failing to pass through a quiescent state
for an extended time period (3.5 seconds by default).  This can happen
due to vCPU preemption, long-running interrupts, and much else besides.
There is nothing that RCU priority boosting can do about these situations,
and so they should not be counted as RCU priority boosting failures.

This commit therefore checks for CPUs (as opposed to preempted tasks)
holding up a grace period, and flags the resulting RCU priority boosting
failures, but does not splat nor count them as errors.  It does rate-limit
them to avoid flooding the console log.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-05-10 16:05:07 -07:00
Paul E. McKenney
bcd4af44e2 rcutorture: Make rcu_torture_boost_failed() check for GP end
It is possible that a delayed grace period that rcu_torture_boost()
was polling for ended while rcu_torture_boost_failed() was printing the
failure splat.  It would be good to know when this happens.  This commit
therefore has rcu_torture_boost_failed() recheck the grace period after
printing the splat, and printing a message indicating whether or not
the grace period has ended.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-05-10 16:05:06 -07:00
Paul E. McKenney
8c7ec02e2a rcutorture: Consolidate rcu_torture_boost() timing and statistics
This commit consolidates two loops in rcu_torture_boost(), one of which
counts the number of boost-test episodes and the other of which computes
the start time of the next episode, into one loop that does both with but
a single acquisition of boost_mutex.  This means that the count of the
number of boost-test episodes is incremented after an episode completes
rather than before it starts, but it also avoids the over-counting that
was possible previously.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-05-10 16:05:06 -07:00
Paul E. McKenney
7b9dad7aba rcutorture: Delay-based false positives for RCU priority boosting tests
If an rcu_torture_boost() kthread determines that its grace period
has not yet ended, it invokes rcu_torture_boost_failed() which checks
whether enough time has elapsed for this to be considered a failure of
RCU priority boosting, and, if so, flags the error.

Unfortunately, that kthread might be preempted for some seconds between
the time that it checks the grace period and the time that it checks the
time.  This delay can result in a false positive, featuring a complaint
that a particular grace period has not ended, followed by a diagnostic
dump featuring a much later grace period.

This commit avoids these false positives by rechecking for the end of
the grace period after the time check.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-05-10 16:05:06 -07:00
Paul E. McKenney
ea6d962e80 rcutorture: Judge RCU priority boosting on grace periods, not callbacks
Currently, rcutorture's testing of RCU priority boosting insists not
only that grace periods complete, but also that callbacks be invoked.
Although this is in fact what the user would want, ensuring that there
is sufficient CPU bandwidth devoted to callback execution is in fact
the user's responsibility.  One could argue that rcutorture can take on
that responsibility, which is true in theory.  But in practice, ensuring
sufficient CPU bandwidth to ksoftirqd, any rcuc kthreads, and any rcuo
kthreads is not particularly consistent with rcutorture's main job,
that of stress-testing RCU.  In addition, if the system administrator
(say) makes very poor choices when pinning rcuo kthreads and then runs
rcutorture, there really isn't much rcutorture can do.

Besides, RCU priority boosting only boosts lagging readers, not all the
machinery required to invoke callbacks in a timely fashion.

This commit therefore switches rcutorture's evaluation of RCU priority
boosting from callback execution to grace-period completion by using
the new start_poll_synchronize_rcu() and poll_state_synchronize_rcu()
functions.  When rcutorture is built in (as in when there is no innocent
workload to inconvenience), the ksoftirqd ktheads are boosted to real-time
priority 2 in order to allow timeouts to work properly in the face of
rcutorture's testing of RCU priority boosting.

Indeed, it is not as easy as it looks to create a reliable test of RCU
priority boosting without destroying the rest of the kernel!

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-05-10 16:05:06 -07:00
Paul E. McKenney
a5c095e0e9 rcutorture: Abstract read-lock-held checks
This commit adds a (*readlock_held)() function pointer to the
rcu_torture_ops structure in order to make the rcu_torture_one_read()
function's rcu_dereference_check() lockdep expression more appropriate
for a given run.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-05-10 16:05:05 -07:00
Paul E. McKenney
e9b800db96 refscale: Add acqrel, lock, and lock-irq
This commit adds scale_type of acqrel, lock, and lock-irq to
test acquisition and release.  Note that the refscale.nreaders=1
module parameter is required if you wish to test uncontended locking.
In contrast, acqrel uses a per-CPU variable, so should be just fine with
large values of the refscale.nreaders=1 module parameter.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-05-10 16:05:05 -07:00
Paul E. McKenney
9fc98e3143 rcu-tasks: Add block comment laying out RCU Rude design
This commit adds a block comment that gives a high-level overview of
how RCU Rude grace periods progress.  It also gives an overview of the
memory ordering.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-05-10 16:04:24 -07:00
Paul E. McKenney
06a3ec9205 rcu-tasks: Add block comment laying out RCU Tasks design
This commit adds a block comment that gives a high-level overview of how
RCU tasks grace periods progress.  It also adds a note about how exiting
tasks are handled, plus it gives an overview of the memory ordering.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-05-10 16:04:24 -07:00
Frederic Weisbecker
b5befe842e srcu: Fix broken node geometry after early ssp init
An srcu_struct structure that is initialized before rcu_init_geometry()
will have its srcu_node hierarchy based on CONFIG_NR_CPUS.  Once
rcu_init_geometry() is called, this hierarchy is compressed as needed
for the actual maximum number of CPUs for this system.

Later on, that srcu_struct structure is confused, sometimes referring
to its initial CONFIG_NR_CPUS-based hierarchy, and sometimes instead
to the new num_possible_cpus() hierarchy.  For example, each of its
->mynode fields continues to reference the original leaf rcu_node
structures, some of which might no longer exist.  On the other hand,
srcu_for_each_node_breadth_first() traverses to the new node hierarchy.

There are at least two bad possible outcomes to this:

1) a) A callback enqueued early on an srcu_data structure (call it
      *sdp) is recorded pending on sdp->mynode->srcu_data_have_cbs in
      srcu_funnel_gp_start() with sdp->mynode pointing to a deep leaf
      (say 3 levels).

   b) The grace period ends after rcu_init_geometry() shrinks the
      nodes level to a single one.  srcu_gp_end() walks through the new
      srcu_node hierarchy without ever reaching the old leaves so the
      callback is never executed.

   This is easily reproduced on an 8 CPUs machine with CONFIG_NR_CPUS >= 32
   and "rcupdate.rcu_self_test=1". The srcu_barrier() after early tests
   verification never completes and the boot hangs:

	[ 5413.141029] INFO: task swapper/0:1 blocked for more than 4915 seconds.
	[ 5413.147564]       Not tainted 5.12.0-rc4+ #28
	[ 5413.151927] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
	[ 5413.159753] task:swapper/0       state:D stack:    0 pid:    1 ppid:     0 flags:0x00004000
	[ 5413.168099] Call Trace:
	[ 5413.170555]  __schedule+0x36c/0x930
	[ 5413.174057]  ? wait_for_completion+0x88/0x110
	[ 5413.178423]  schedule+0x46/0xf0
	[ 5413.181575]  schedule_timeout+0x284/0x380
	[ 5413.185591]  ? wait_for_completion+0x88/0x110
	[ 5413.189957]  ? mark_held_locks+0x61/0x80
	[ 5413.193882]  ? mark_held_locks+0x61/0x80
	[ 5413.197809]  ? _raw_spin_unlock_irq+0x24/0x50
	[ 5413.202173]  ? wait_for_completion+0x88/0x110
	[ 5413.206535]  wait_for_completion+0xb4/0x110
	[ 5413.210724]  ? srcu_torture_stats_print+0x110/0x110
	[ 5413.215610]  srcu_barrier+0x187/0x200
	[ 5413.219277]  ? rcu_tasks_verify_self_tests+0x50/0x50
	[ 5413.224244]  ? rdinit_setup+0x2b/0x2b
	[ 5413.227907]  rcu_verify_early_boot_tests+0x2d/0x40
	[ 5413.232700]  do_one_initcall+0x63/0x310
	[ 5413.236541]  ? rdinit_setup+0x2b/0x2b
	[ 5413.240207]  ? rcu_read_lock_sched_held+0x52/0x80
	[ 5413.244912]  kernel_init_freeable+0x253/0x28f
	[ 5413.249273]  ? rest_init+0x250/0x250
	[ 5413.252846]  kernel_init+0xa/0x110
	[ 5413.256257]  ret_from_fork+0x22/0x30

2) An srcu_struct structure that is initialized before rcu_init_geometry()
   and used afterward will always have stale rdp->mynode references,
   resulting in callbacks to be missed in srcu_gp_end(), just like in
   the previous scenario.

This commit therefore causes init_srcu_struct_nodes to initialize the
geometry, if needed.  This ensures that the srcu_node hierarchy is
properly built and distributed from the get-go.

Suggested-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Uladzislau Rezki <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-05-10 16:03:35 -07:00
Frederic Weisbecker
8e9c01c717 srcu: Initialize SRCU after timers
Once srcu_init() is called, the SRCU core will make use of delayed
workqueues, which rely on timers.  However init_timers() is called
several steps after rcu_init().  This means that a call_srcu() after
rcu_init() but before init_timers() would find itself within a dangerously
uninitialized timer core.

This commit therefore creates a separate call to srcu_init() after
init_timer() completes, which ensures that we stay in early SRCU mode
until timers are safe(r).

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-05-10 16:03:35 -07:00
Frederic Weisbecker
c75e9d2915 srcu: Remove superfluous ssp initialization for early callbacks
Pre-srcu_init() invocations of call_srcu() initialize the srcu_struct
structure in question, so there is no need to check this initialization
in srcu_init() when initiating grace periods for srcu_struct structures
that had early call_srcu() invocations.  This commit therefore drops
the calls to check_init_srcu_struct() in srcu_init().

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Uladzislau Rezki <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-05-10 16:03:34 -07:00
Frederic Weisbecker
94df76a197 srcu: Remove superfluous sdp->srcu_lock_count zero filling
Because alloc_percpu() zeroes out the allocated memory, there is no need
to zero-fill newly allocated per-CPU memory.  This commit therefore removes
the loop zeroing the ->srcu_lock_count and ->srcu_unlock_count arrays
from init_srcu_struct_nodes().  This is the only use of that function's
is_static parameter, which this commit also removes.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Uladzislau Rezki <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-05-10 16:03:34 -07:00
Frederic Weisbecker
d76e0926d8 rcu/nocb: Use the rcuog CPU's ->nocb_timer
Currently each CPU has its own ->nocb_timer queued when the nocb_gp
wakeup must be deferred.  This approach has many drawbacks, compared to
a solution based on a single timer per NOCB group:

* There are a lot of timers to maintain.

* The per-rdp ->nocb_lock must be held to queue and cancel the timer
  and this lock can already be heavily contended.

* One timer firing doesn't cancel the other timers in the same group:
  - These other timers can thus cause spurious wakeups
  - Each rdp that queued a timer must lock both ->nocb_lock and then
    ->nocb_gp_lock upon exit from the kernel to idle/user/guest mode.

* We can't cancel all of them if we detect an unflushed bypass in
  nocb_gp_wait(). In fact currently we only ever cancel the ->nocb_timer
  of the leader group.

* The leader group's nocb_timer is cancelled without locking ->nocb_lock
  in nocb_gp_wait().  This currently appears to be safe but is an
  accident waiting to happen.

* Since the timer acquires ->nocb_lock, it requires extra care in the
  NOCB (de-)offloading process, requiring that it be either enabled or
  disabled and then flushed.

This commit instead uses the rcuog kthread's CPU's ->nocb_timer instead.
It is protected by nocb_gp_lock, which is _way_ less contended and
remains so even after this change.  As a matter of fact, the nocb_timer
almost never fires and the deferred wakeup is mostly carried out upon
idle/user/guest entry.  Now the early check performed at this point in
do_nocb_deferred_wakeup() is done on rdp_gp->nocb_defer_wakeup, which
is of course racy.  However, this raciness is harmless because we only
need the guarantee that the timer is queued if we were the last one to
queue it.  Any other situation (another CPU has queued it and we either
see it or not) is fine.

This solves all the issues listed above.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-05-10 16:02:44 -07:00
Uladzislau Rezki (Sony)
a78d4a2a10 kvfree_rcu: Refactor kfree_rcu_monitor()
Currently we have three functions which depend on each other.
Two of them are quite tiny and the last one where the most
work is done. All of them are related to queuing RCU batches
to reclaim objects after a GP.

1. kfree_rcu_monitor(). It consist of few lines. It acquires a spin-lock
   and calls kfree_rcu_drain_unlock().

2. kfree_rcu_drain_unlock(). It also consists of few lines of code. It
   calls queue_kfree_rcu_work() to queue the batch.  If this fails,
   it rearms the monitor work to try again later.

3. queue_kfree_rcu_work(). This provides the bulk of the functionality,
   attempting to start a new batch to free objects after a GP.

Since there are no external users of functions [2] and [3], both
can eliminated by moving all logic directly into [1], which both
shrinks and simplifies the code.

Also replace comments which start with "/*" to "//" format to make it
unified across the file.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-05-10 16:00:48 -07:00
Uladzislau Rezki (Sony)
d8628f35ba kvfree_rcu: Fix comments according to current code
The kvfree_rcu() function now defers allocations in the common
case due to the fact that there is no lockless access to the
memory-allocator caches/pools.  In addition, in CONFIG_PREEMPT_NONE=y
and in CONFIG_PREEMPT_VOLUNTARY=y kernels, there is no reliable way to
determine if spinlocks are held.  As a result, allocation is deferred in
the common case, and the two-argument form of kvfree_rcu() thus uses the
"channel 3" queue through all the rcu_head structures.  This channel
is called referred to as the emergency case in comments, and these
comments are now obsolete.

This commit therefore updates these comments to reflect the new
common-case nature of such emergencies.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-05-10 16:00:48 -07:00
Uladzislau Rezki (Sony)
7fe1da33f6 kvfree_rcu: Use kfree_rcu_monitor() instead of open-coded variant
Replace an open-coded version of the kfree_rcu_monitor() function body
with a call to that function.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-05-10 16:00:48 -07:00
Uladzislau Rezki (Sony)
dd28c9f057 kvfree_rcu: Update "monitor_todo" once a batch is started
Before attempting to start a new batch the "monitor_todo" variable is
set to "false" and set back to "true" when a previous RCU batch is still
in progress.  This is at best confusing.

Thus change this variable to "false" only when a new batch has been
successfully queued, otherwise, just leave it be.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-05-10 16:00:48 -07:00
Uladzislau Rezki (Sony)
d434c00fa3 kvfree_rcu: Add a bulk-list check when a scheduler is run
The rcu_scheduler_active flag is set to RCU_SCHEDULER_RUNNING once the
scheduler is up and running.  That signal is used in order to check
and queue a "monitor work" to reclaim freed objects (if there are any)
during early boot.  This flag is used by kvfree_rcu() to determine when
work can safely be queued, at which point memory passed to earlier
invocations of kvfree_rcu() can be processed.

However, only "krcp->head" is checked for objects that need to be
released, and there are now two more, namely, "krcp->bkvhead[0]" and
"krcp->bkvhead[1]".  Therefore, check these two additional channels.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-05-10 16:00:48 -07:00
Uladzislau Rezki (Sony)
ac7625ebd5 kvfree_rcu: Use [READ/WRITE]_ONCE() macros to access to nr_bkv_objs
nr_bkv_objs is a count of the objects in the kvfree_rcu page cache.
Accessing it requires holding the ->lock.  Switch to READ_ONCE() and
WRITE_ONCE() macros to provide lockless access to this counter.
This lockless access is used for the shrinker.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-05-10 16:00:48 -07:00
Zhang Qiang
d0bfa8b3c4 kvfree_rcu: Release a page cache under memory pressure
Add a drain_page_cache() function to drain a per-cpu page cache.
The reason behind of it is a system can run into a low memory
condition, in that case a page shrinker can ask for its users
to free their caches in order to get extra memory available for
other needs in a system.

When a system hits such condition, a page cache is drained for
all CPUs in a system. By default a page cache work is delayed
with 5 seconds interval until a memory pressure disappears, if
needed it can be changed. See a rcu_delay_page_cache_fill_msec
module parameter.

Co-developed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Zqiang <qiang.zhang@windriver.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-05-10 16:00:48 -07:00
Rolf Eike Beer
4c9c3809ae rcu: Fix typo in comment: kthead -> kthread
Signed-off-by: Rolf Eike Beer <eb@emlix.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-05-10 15:45:58 -07:00
Yury Norov
a6814a79f2 rcu/tree_plugin: Don't handle the case of 'all' CPU range
The 'all' semantics is now supported by the bitmap_parselist() so we can
drop supporting it as a special case in RCU code.  Since 'all' is properly
supported in core bitmap code, also drop legacy comment in RCU for it.

This patch does not make any functional changes for existing users.

Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-05-10 15:38:20 -07:00
Paul E. McKenney
ab6ad3dbdd Merge branches 'bitmaprange.2021.03.08a', 'fixes.2021.03.15a', 'kvfree_rcu.2021.03.08a', 'mmdumpobj.2021.03.08a', 'nocb.2021.03.15a', 'poll.2021.03.24a', 'rt.2021.03.08a', 'tasks.2021.03.08a', 'torture.2021.03.08a' and 'torturescript.2021.03.22a' into HEAD
bitmaprange.2021.03.08a:  Allow 3-N for bitmap ranges.
fixes.2021.03.15a:  Miscellaneous fixes.
kvfree_rcu.2021.03.08a:  kvfree_rcu() updates.
mmdumpobj.2021.03.08a:  mem_dump_obj() updates.
nocb.2021.03.15a:  RCU NOCB CPU updates, including limited deoffloading.
poll.2021.03.24a:  Polling grace-period interfaces for RCU.
rt.2021.03.08a:  Realtime-related RCU changes.
tasks.2021.03.08a:  Tasks-RCU updates.
torture.2021.03.08a:  Torture-test updates.
torturescript.2021.03.22a:  Torture-test scripting updates.
2021-03-24 17:20:18 -07:00
Paul E. McKenney
7ac3fdf099 rcutorture: Test start_poll_synchronize_rcu() and poll_state_synchronize_rcu()
This commit causes rcutorture to test the new start_poll_synchronize_rcu()
and poll_state_synchronize_rcu() functions.  Because of the difficulty of
determining the nature of a synchronous RCU grace (expedited or not),
the test that insisted that poll_state_synchronize_rcu() detect an
intervening synchronize_rcu() had to be dropped.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-03-24 17:17:38 -07:00
Paul E. McKenney
0909fc2b2c rcu: Provide polling interfaces for Tiny RCU grace periods
There is a need for a non-blocking polling interface for RCU grace
periods, so this commit supplies start_poll_synchronize_rcu() and
poll_state_synchronize_rcu() for this purpose.  Note that the existing
get_state_synchronize_rcu() may be used if future grace periods are
inevitable (perhaps due to a later call_rcu() invocation).  The new
start_poll_synchronize_rcu() is to be used if future grace periods
might not otherwise happen.  Finally, poll_state_synchronize_rcu()
provides a lockless check for a grace period having elapsed since
the corresponding call to either of the get_state_synchronize_rcu()
or start_poll_synchronize_rcu().

As with get_state_synchronize_rcu(), the return value from either
get_state_synchronize_rcu() or start_poll_synchronize_rcu() is passed in
to a later call to either poll_state_synchronize_rcu() or the existing
(might_sleep) cond_synchronize_rcu().

[ paulmck: Revert cond_synchronize_rcu() to might_sleep() per Frederic Weisbecker feedback. ]
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-03-24 17:16:15 -07:00
Paul E. McKenney
7abb18bd75 rcu: Provide polling interfaces for Tree RCU grace periods
There is a need for a non-blocking polling interface for RCU grace
periods, so this commit supplies start_poll_synchronize_rcu() and
poll_state_synchronize_rcu() for this purpose.  Note that the existing
get_state_synchronize_rcu() may be used if future grace periods are
inevitable (perhaps due to a later call_rcu() invocation).  The new
start_poll_synchronize_rcu() is to be used if future grace periods
might not otherwise happen.  Finally, poll_state_synchronize_rcu()
provides a lockless check for a grace period having elapsed since
the corresponding call to either of the get_state_synchronize_rcu()
or start_poll_synchronize_rcu().

As with get_state_synchronize_rcu(), the return value from either
get_state_synchronize_rcu() or start_poll_synchronize_rcu() is passed in
to a later call to either poll_state_synchronize_rcu() or the existing
(might_sleep) cond_synchronize_rcu().

[ paulmck: Remove redundant smp_mb() per Frederic Weisbecker feedback. ]
[ Update poll_state_synchronize_rcu() docbook per Frederic Weisbecker feedback. ]
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-03-22 08:23:48 -07:00
Frederic Weisbecker
e02691b7ef rcu/nocb: Move trace_rcu_nocb_wake() calls outside nocb_lock when possible
Those tracing calls don't need to be under ->nocb_lock.  This commit
therefore moves them outside of that lock.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-03-15 13:54:55 -07:00
Frederic Weisbecker
0efdf14a9f rcu/nocb: Remove stale comment above rcu_segcblist_offload()
This commit removes a stale comment claiming that the cblist must be
empty before changing the offloading state.  This claim was correct back
when the offloaded state was defined exclusively at boot.

Reported-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-03-15 13:54:54 -07:00
Frederic Weisbecker
76d00b494d rcu/nocb: Disable bypass when CPU isn't completely offloaded
Currently, the bypass is flushed at the very last moment in the
deoffloading procedure.  However, this approach leads to a larger state
space than would be preferred.  This commit therefore disables the
bypass at soon as the deoffloading procedure begins, then flushes it.
This guarantees that the bypass remains empty and thus out of the way
of the deoffloading procedure.

Symmetrically, this commit waits to enable the bypass until the offloading
procedure has completed.

Reported-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-03-15 13:54:54 -07:00