Commit Graph

2472 Commits

Author SHA1 Message Date
Frederic Weisbecker 5c83cedbaa rcu/nocb: Protect lazy shrinker against concurrent (de-)offloading
The shrinker may run concurrently with callbacks (de-)offloading. As
such, calling rcu_nocb_lock() is very dangerous because it does a
conditional locking. The worst outcome is that rcu_nocb_lock() doesn't
lock but rcu_nocb_unlock() eventually unlocks, or the reverse, creating
an imbalance.

Fix this with protecting against (de-)offloading using the barrier mutex.
Although if the barrier mutex is contended, which should be rare, then
step aside so as not to trigger a mutex VS allocation
dependency chain.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-05-09 17:26:58 -07:00
Zqiang 6b706e5603 rcu/kvfree: Make drain_page_cache() take early return if cache is disabled
If the rcutree.rcu_min_cached_objs kernel boot parameter is set to zero,
then krcp->page_cache_work will never be triggered to fill page cache.
In addition, the put_cached_bnode() will not fill page cache.  As a
result krcp->bkvcache will always be empty, so there is no need to acquire
krcp->lock to get page from krcp->bkvcache.  This commit therefore makes
drain_page_cache() return immediately if the rcu_min_cached_objs is zero.

Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-05-09 17:26:21 -07:00
Zqiang 60888b77a0 rcu/kvfree: Make fill page cache start from krcp->nr_bkv_objs
When the fill_page_cache_func() function is invoked, it assumes that
the cache of pages is completely empty.  However, there can be some time
between triggering execution of this function and its actual invocation.
During this time, kfree_rcu_work() might run, and might fill in part or
all of this cache of pages, thus invalidating the fill_page_cache_func()
function's assumption.

This will not overfill the cache because put_cached_bnode() will reject
the extra page.  However, it will result in a needless allocation and
freeing of one extra page, which might not be helpful under lowish-memory
conditions.

This commit therefore causes the fill_page_cache_func() to explicitly
account for pages that have been placed into the cache shortly before
it starts running.

Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-05-09 17:26:21 -07:00
Uladzislau Rezki (Sony) 021a5ff847 rcu/kvfree: Do not run a page work if a cache is disabled
By default the cache size is 5 pages per CPU, but it can be disabled at
boot time by setting the rcu_min_cached_objs to zero.  When that happens,
the current code will uselessly set an hrtimer to schedule refilling this
cache with zero pages.  This commit therefore streamlines this process
by simply refusing the set the hrtimer when rcu_min_cached_objs is zero.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-05-09 17:26:21 -07:00
Zqiang 309a431650 rcu/kvfree: Use consistent krcp when growing kfree_rcu() page cache
The add_ptr_to_bulk_krc_lock() function is invoked to allocate a new
kfree_rcu() page, also known as a kvfree_rcu_bulk_data structure.
The kfree_rcu_cpu structure's lock is used to protect this operation,
except that this lock must be momentarily dropped when allocating memory.
It is clearly important that the lock that is reacquired be the same
lock that was acquired initially via krc_this_cpu_lock().

Unfortunately, this same krc_this_cpu_lock() function is used to
re-acquire this lock, and if the task migrated to some other CPU during
the memory allocation, this will result in the kvfree_rcu_bulk_data
structure being added to the wrong CPU's kfree_rcu_cpu structure.

This commit therefore replaces that second call to krc_this_cpu_lock()
with raw_spin_lock_irqsave() in order to explicitly acquire the lock on
the correct kfree_rcu_cpu structure, thus keeping things straight even
when the task migrates.

Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-05-09 17:26:21 -07:00
Zqiang 1e237994d9 rcu/kvfree: Invoke debug_rcu_bhead_unqueue() after checking bnode->gp_snap
If kvfree_rcu_bulk() sees that the required grace period has failed to
elapse, it leaks the memory because readers might still be using it.
But in that case, the debug-objects subsystem still marks the relevant
structures as having been freed, even though they are instead being
leaked.

This commit fixes this mismatch by invoking debug_rcu_bhead_unqueue()
only when we are actually going to free the objects.

Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-05-09 17:26:21 -07:00
Uladzislau Rezki (Sony) f32276a376 rcu/kvfree: Add debug check for GP complete for kfree_rcu_cpu list
Under low-memory conditions, kvfree_rcu() will use each object's
rcu_head structure to queue objects in a singly linked list headed by
the kfree_rcu_cpu structure's ->head field.  This list is passed to
call_rcu() as a unit, but there is no indication of which grace period
this list needs to wait for.  This in turn prevents adding debug checks
in the kfree_rcu_work() as was done for the two page-of-pointers channels
in the kfree_rcu_cpu structure.

This commit therefore adds a ->head_free_gp_snap field to the
kfree_rcu_cpu_work structure to record this grace-period number.  It also
adds a WARN_ON_ONCE() to kfree_rcu_monitor() that checks to make sure
that the required grace period has in fact elapsed.

[ paulmck: Fix kerneldoc issue raised by Stephen Rothwell. ]

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-05-09 17:26:21 -07:00
Paul E. McKenney cdfa0f6fa6 rcu/kvfree: Add debug to check grace periods
This commit adds debugging checks to verify that the required RCU
grace period has elapsed for each kvfree_rcu_bulk_data structure that
arrives at the kvfree_rcu_bulk() function.  These checks make use
of that structure's ->gp_snap field, which has been upgraded from an
unsigned long to an rcu_gp_oldstate structure.  This upgrade reduces
the chances of false positives to nearly zero, even on 32-bit systems,
for which this structure carries 64 bits of state.

Cc: Ziwei Dai <ziwei.dai@unisoc.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-05-09 17:26:21 -07:00
Linus Torvalds 5dfb75e842 RCU Changes for 6.4:
o  MAINTAINERS files additions and changes.
  o  Fix hotplug warning in nohz code.
  o  Tick dependency changes by Zqiang.
  o  Lazy-RCU shrinker fixes by Zqiang.
  o  rcu-tasks stall reporting improvements by Neeraj.
  o  Initial changes for renaming of k[v]free_rcu() to its new k[v]free_rcu_mightsleep()
     name for robustness.
  o  Documentation Updates:
  o  Significant changes to srcu_struct size.
  o  Deadlock detection for srcu_read_lock() vs synchronize_srcu() from Boqun.
  o  rcutorture and rcu-related tool, which are targeted for v6.4 from Boqun's tree.
  o  Other misc changes.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEcoCIrlGe4gjE06JJqA4nf2o45hAFAmQuBnIACgkQqA4nf2o4
 5hACVRAAoXu7/gfh5Pjw9O4E4pCdPJKsZZVYrcrVGrq6NAxRn6M1SgurAdC5grj2
 96x0waoGaiO82V0H5iJMcKdAVu67x9R8WaQ1JoxN75Efn8h9W4TguB87TV1gk0xS
 eZ18b/CyEaM5mNb80DFFF4FLohy5737p/kNTMqXQdUyR1BsDl16iRMgjiBiFhNUx
 yPo8Y2kC2U2OTbldZgaE7s9bQO3xxEcifx93sGWsAex/gx54FYNisiwSlCOSgOE+
 XkYo/OKk8Xvr82tLVX8XQVEPCMJ+rxea8T5zSs8/alvsPq7gA8wW3y6fsoa3vUU/
 +Gd+W+Q/OsONIDtp8rQAY1qsD0ScDpaR8052RSH0zTa7pj8HsQgE5PjZ+cJW0SEi
 cKN+Oe8+ETqKald+xZ6PDf58O212VLrru3RpQWrOQcJ7fmKmfT4REK0RcbLgg4qT
 CBgOo6eg+ub4pxq2y11LZJBNTv1/S7xAEzFE0kArew64KB2gyVud0VJRZVAJnEfe
 93QQVDFrwK2bhgWQZ6J6IbTvGeQW0L93IibuaU6jhZPR283VtUIIvM7vrOylN7Fq
 4jsae0T7YGYfKUhgTpm7rCnm8A/D3Ni8MY0sKYYgDSyKmZUsnpI5wpx1xke4lwwV
 ErrY46RCFa+k8wscc6iWfB4cGXyyFHyu+wtyg0KpFn5JAzcfz4A=
 =Rgbj
 -----END PGP SIGNATURE-----

Merge tag 'rcu.6.4.april5.2023.3' of git://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux

Pull RCU updates from Joel Fernandes:

 - Updates and additions to MAINTAINERS files, with Boqun being added to
   the RCU entry and Zqiang being added as an RCU reviewer.

   I have also transitioned from reviewer to maintainer; however, Paul
   will be taking over sending RCU pull-requests for the next merge
   window.

 - Resolution of hotplug warning in nohz code, achieved by fixing
   cpu_is_hotpluggable() through interaction with the nohz subsystem.

   Tick dependency modifications by Zqiang, focusing on fixing usage of
   the TICK_DEP_BIT_RCU_EXP bitmask.

 - Avoid needless calls to the rcu-lazy shrinker for CONFIG_RCU_LAZY=n
   kernels, fixed by Zqiang.

 - Improvements to rcu-tasks stall reporting by Neeraj.

 - Initial renaming of k[v]free_rcu() to k[v]free_rcu_mightsleep() for
   increased robustness, affecting several components like mac802154,
   drbd, vmw_vmci, tracing, and more.

   A report by Eric Dumazet showed that the API could be unknowingly
   used in an atomic context, so we'd rather make sure they know what
   they're asking for by being explicit:

      https://lore.kernel.org/all/20221202052847.2623997-1-edumazet@google.com/

 - Documentation updates, including corrections to spelling,
   clarifications in comments, and improvements to the srcu_size_state
   comments.

 - Better srcu_struct cache locality for readers, by adjusting the size
   of srcu_struct in support of SRCU usage by Christoph Hellwig.

 - Teach lockdep to detect deadlocks between srcu_read_lock() vs
   synchronize_srcu() contributed by Boqun.

   Previously lockdep could not detect such deadlocks, now it can.

 - Integration of rcutorture and rcu-related tools, targeted for v6.4
   from Boqun's tree, featuring new SRCU deadlock scenarios, test_nmis
   module parameter, and more

 - Miscellaneous changes, various code cleanups and comment improvements

* tag 'rcu.6.4.april5.2023.3' of git://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux: (71 commits)
  checkpatch: Error out if deprecated RCU API used
  mac802154: Rename kfree_rcu() to kvfree_rcu_mightsleep()
  rcuscale: Rename kfree_rcu() to kfree_rcu_mightsleep()
  ext4/super: Rename kfree_rcu() to kfree_rcu_mightsleep()
  net/mlx5: Rename kfree_rcu() to kfree_rcu_mightsleep()
  net/sysctl: Rename kvfree_rcu() to kvfree_rcu_mightsleep()
  lib/test_vmalloc.c: Rename kvfree_rcu() to kvfree_rcu_mightsleep()
  tracing: Rename kvfree_rcu() to kvfree_rcu_mightsleep()
  misc: vmw_vmci: Rename kvfree_rcu() to kvfree_rcu_mightsleep()
  drbd: Rename kvfree_rcu() to kvfree_rcu_mightsleep()
  rcu: Protect rcu_print_task_exp_stall() ->exp_tasks access
  rcu: Avoid stack overflow due to __rcu_irq_enter_check_tick() being kprobe-ed
  rcu-tasks: Report stalls during synchronize_srcu() in rcu_tasks_postscan()
  rcu: Permit start_poll_synchronize_rcu_expedited() to be invoked early
  rcu: Remove never-set needwake assignment from rcu_report_qs_rdp()
  rcu: Register rcu-lazy shrinker only for CONFIG_RCU_LAZY=y kernels
  rcu: Fix missing TICK_DEP_MASK_RCU_EXP dependency check
  rcu: Fix set/clear TICK_DEP_BIT_RCU_EXP bitmask race
  rcu/trace: use strscpy() to instead of strncpy()
  tick/nohz: Fix cpu_is_hotpluggable() by checking with nohz subsystem
  ...
2023-04-24 12:16:14 -07:00
Ziwei Dai 5da7cb193d rcu/kvfree: Avoid freeing new kfree_rcu() memory after old grace period
Memory passed to kvfree_rcu() that is to be freed is tracked by a
per-CPU kfree_rcu_cpu structure, which in turn contains pointers
to kvfree_rcu_bulk_data structures that contain pointers to memory
that has not yet been handed to RCU, along with an kfree_rcu_cpu_work
structure that tracks the memory that has already been handed to RCU.
These structures track three categories of memory: (1) Memory for
kfree(), (2) Memory for kvfree(), and (3) Memory for both that arrived
during an OOM episode.  The first two categories are tracked in a
cache-friendly manner involving a dynamically allocated page of pointers
(the aforementioned kvfree_rcu_bulk_data structures), while the third
uses a simple (but decidedly cache-unfriendly) linked list through the
rcu_head structures in each block of memory.

On a given CPU, these three categories are handled as a unit, with that
CPU's kfree_rcu_cpu_work structure having one pointer for each of the
three categories.  Clearly, new memory for a given category cannot be
placed in the corresponding kfree_rcu_cpu_work structure until any old
memory has had its grace period elapse and thus has been removed.  And
the kfree_rcu_monitor() function does in fact check for this.

Except that the kfree_rcu_monitor() function checks these pointers one
at a time.  This means that if the previous kfree_rcu() memory passed
to RCU had only category 1 and the current one has only category 2, the
kfree_rcu_monitor() function will send that current category-2 memory
along immediately.  This can result in memory being freed too soon,
that is, out from under unsuspecting RCU readers.

To see this, consider the following sequence of events, in which:

o	Task A on CPU 0 calls rcu_read_lock(), then uses "from_cset",
	then is preempted.

o	CPU 1 calls kfree_rcu(cset, rcu_head) in order to free "from_cset"
	after a later grace period.  Except that "from_cset" is freed
	right after the previous grace period ended, so that "from_cset"
	is immediately freed.  Task A resumes and references "from_cset"'s
	member, after which nothing good happens.

In full detail:

CPU 0					CPU 1
----------------------			----------------------
count_memcg_event_mm()
|rcu_read_lock()  <---
|mem_cgroup_from_task()
 |// css_set_ptr is the "from_cset" mentioned on CPU 1
 |css_set_ptr = rcu_dereference((task)->cgroups)
 |// Hard irq comes, current task is scheduled out.

					cgroup_attach_task()
					|cgroup_migrate()
					|cgroup_migrate_execute()
					|css_set_move_task(task, from_cset, to_cset, true)
					|cgroup_move_task(task, to_cset)
					|rcu_assign_pointer(.., to_cset)
					|...
					|cgroup_migrate_finish()
					|put_css_set_locked(from_cset)
					|from_cset->refcount return 0
					|kfree_rcu(cset, rcu_head) // free from_cset after new gp
					|add_ptr_to_bulk_krc_lock()
					|schedule_delayed_work(&krcp->monitor_work, ..)

					kfree_rcu_monitor()
					|krcp->bulk_head[0]'s work attached to krwp->bulk_head_free[]
					|queue_rcu_work(system_wq, &krwp->rcu_work)
					|if rwork->rcu.work is not in WORK_STRUCT_PENDING_BIT state,
					|call_rcu(&rwork->rcu, rcu_work_rcufn) <--- request new gp

					// There is a perious call_rcu(.., rcu_work_rcufn)
					// gp end, rcu_work_rcufn() is called.
					rcu_work_rcufn()
					|__queue_work(.., rwork->wq, &rwork->work);

					|kfree_rcu_work()
					|krwp->bulk_head_free[0] bulk is freed before new gp end!!!
					|The "from_cset" is freed before new gp end.

// the task resumes some time later.
 |css_set_ptr->subsys[(subsys_id) <--- Caused kernel crash, because css_set_ptr is freed.

This commit therefore causes kfree_rcu_monitor() to refrain from moving
kfree_rcu() memory to the kfree_rcu_cpu_work structure until the RCU
grace period has completed for all three categories.

v2: Use helper function instead of inserted code block at kfree_rcu_monitor().

Fixes: 34c8817455 ("rcu: Support kfree_bulk() interface in kfree_rcu()")
Fixes: 5f3c8d6204 ("rcu/tree: Maintain separate array for vmalloc ptrs")
Reported-by: Mukesh Ojha <quic_mojha@quicinc.com>
Signed-off-by: Ziwei Dai <ziwei.dai@unisoc.com>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Tested-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-04-06 10:04:23 -07:00
Joel Fernandes (Google) 8ae9985774 Merge branches 'rcu/staging-core', 'rcu/staging-docs' and 'rcu/staging-kfree', remote-tracking branches 'paul/srcu-cf.2023.04.04a', 'fbq/rcu/lockdep.2023.03.27a' and 'fbq/rcu/rcutorture.2023.03.20a' into rcu/staging 2023-04-05 13:50:37 +00:00
Uladzislau Rezki (Sony) 936c7e19c6 rcuscale: Rename kfree_rcu() to kfree_rcu_mightsleep()
The kfree_rcu() and kvfree_rcu() macros' single-argument forms are
deprecated.  Therefore switch to the new kfree_rcu_mightsleep() and
kvfree_rcu_mightsleep() variants. The goal is to avoid accidental use
of the single-argument forms, which can introduce functionality bugs in
atomic contexts and latency bugs in non-atomic contexts.

Acked-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
2023-04-05 13:48:04 +00:00
Zqiang 3c1566bca3 rcu: Protect rcu_print_task_exp_stall() ->exp_tasks access
For kernels built with CONFIG_PREEMPT_RCU=y, the following scenario can
result in a NULL-pointer dereference:

           CPU1                                           CPU2
rcu_preempt_deferred_qs_irqrestore                rcu_print_task_exp_stall
  if (special.b.blocked)                            READ_ONCE(rnp->exp_tasks) != NULL
    raw_spin_lock_rcu_node
    np = rcu_next_node_entry(t, rnp)
    if (&t->rcu_node_entry == rnp->exp_tasks)
      WRITE_ONCE(rnp->exp_tasks, np)
      ....
      raw_spin_unlock_irqrestore_rcu_node
                                                    raw_spin_lock_irqsave_rcu_node
                                                    t = list_entry(rnp->exp_tasks->prev,
                                                        struct task_struct, rcu_node_entry)
                                                    (if rnp->exp_tasks is NULL, this
                                                       will dereference a NULL pointer)

The problem is that CPU2 accesses the rcu_node structure's->exp_tasks
field without holding the rcu_node structure's ->lock and CPU2 did
not observe CPU1's change to rcu_node structure's ->exp_tasks in time.
Therefore, if CPU1 sets rcu_node structure's->exp_tasks pointer to NULL,
then CPU2 might dereference that NULL pointer.

This commit therefore holds the rcu_node structure's ->lock while
accessing that structure's->exp_tasks field.

[ paulmck: Apply Frederic Weisbecker feedback. ]

Acked-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
2023-04-05 13:47:44 +00:00
Zheng Yejian 7a29fb4a47 rcu: Avoid stack overflow due to __rcu_irq_enter_check_tick() being kprobe-ed
Registering a kprobe on __rcu_irq_enter_check_tick() can cause kernel
stack overflow as shown below. This issue can be reproduced by enabling
CONFIG_NO_HZ_FULL and booting the kernel with argument "nohz_full=",
and then giving the following commands at the shell prompt:

  # cd /sys/kernel/tracing/
  # echo 'p:mp1 __rcu_irq_enter_check_tick' >> kprobe_events
  # echo 1 > events/kprobes/enable

This commit therefore adds __rcu_irq_enter_check_tick() to the kprobes
blacklist using NOKPROBE_SYMBOL().

Insufficient stack space to handle exception!
ESR: 0x00000000f2000004 -- BRK (AArch64)
FAR: 0x0000ffffccf3e510
Task stack:     [0xffff80000ad30000..0xffff80000ad38000]
IRQ stack:      [0xffff800008050000..0xffff800008058000]
Overflow stack: [0xffff089c36f9f310..0xffff089c36fa0310]
CPU: 5 PID: 190 Comm: bash Not tainted 6.2.0-rc2-00320-g1f5abbd77e2c #19
Hardware name: linux,dummy-virt (DT)
pstate: 400003c5 (nZcv DAIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : __rcu_irq_enter_check_tick+0x0/0x1b8
lr : ct_nmi_enter+0x11c/0x138
sp : ffff80000ad30080
x29: ffff80000ad30080 x28: ffff089c82e20000 x27: 0000000000000000
x26: 0000000000000000 x25: ffff089c02a8d100 x24: 0000000000000000
x23: 00000000400003c5 x22: 0000ffffccf3e510 x21: ffff089c36fae148
x20: ffff80000ad30120 x19: ffffa8da8fcce148 x18: 0000000000000000
x17: 0000000000000000 x16: 0000000000000000 x15: ffffa8da8e44ea6c
x14: ffffa8da8e44e968 x13: ffffa8da8e03136c x12: 1fffe113804d6809
x11: ffff6113804d6809 x10: 0000000000000a60 x9 : dfff800000000000
x8 : ffff089c026b404f x7 : 00009eec7fb297f7 x6 : 0000000000000001
x5 : ffff80000ad30120 x4 : dfff800000000000 x3 : ffffa8da8e3016f4
x2 : 0000000000000003 x1 : 0000000000000000 x0 : 0000000000000000
Kernel panic - not syncing: kernel stack overflow
CPU: 5 PID: 190 Comm: bash Not tainted 6.2.0-rc2-00320-g1f5abbd77e2c #19
Hardware name: linux,dummy-virt (DT)
Call trace:
 dump_backtrace+0xf8/0x108
 show_stack+0x20/0x30
 dump_stack_lvl+0x68/0x84
 dump_stack+0x1c/0x38
 panic+0x214/0x404
 add_taint+0x0/0xf8
 panic_bad_stack+0x144/0x160
 handle_bad_stack+0x38/0x58
 __bad_stack+0x78/0x7c
 __rcu_irq_enter_check_tick+0x0/0x1b8
 arm64_enter_el1_dbg.isra.0+0x14/0x20
 el1_dbg+0x2c/0x90
 el1h_64_sync_handler+0xcc/0xe8
 el1h_64_sync+0x64/0x68
 __rcu_irq_enter_check_tick+0x0/0x1b8
 arm64_enter_el1_dbg.isra.0+0x14/0x20
 el1_dbg+0x2c/0x90
 el1h_64_sync_handler+0xcc/0xe8
 el1h_64_sync+0x64/0x68
 __rcu_irq_enter_check_tick+0x0/0x1b8
 arm64_enter_el1_dbg.isra.0+0x14/0x20
 el1_dbg+0x2c/0x90
 el1h_64_sync_handler+0xcc/0xe8
 el1h_64_sync+0x64/0x68
 __rcu_irq_enter_check_tick+0x0/0x1b8
 [...]
 el1_dbg+0x2c/0x90
 el1h_64_sync_handler+0xcc/0xe8
 el1h_64_sync+0x64/0x68
 __rcu_irq_enter_check_tick+0x0/0x1b8
 arm64_enter_el1_dbg.isra.0+0x14/0x20
 el1_dbg+0x2c/0x90
 el1h_64_sync_handler+0xcc/0xe8
 el1h_64_sync+0x64/0x68
 __rcu_irq_enter_check_tick+0x0/0x1b8
 arm64_enter_el1_dbg.isra.0+0x14/0x20
 el1_dbg+0x2c/0x90
 el1h_64_sync_handler+0xcc/0xe8
 el1h_64_sync+0x64/0x68
 __rcu_irq_enter_check_tick+0x0/0x1b8
 el1_interrupt+0x28/0x60
 el1h_64_irq_handler+0x18/0x28
 el1h_64_irq+0x64/0x68
 __ftrace_set_clr_event_nolock+0x98/0x198
 __ftrace_set_clr_event+0x58/0x80
 system_enable_write+0x144/0x178
 vfs_write+0x174/0x738
 ksys_write+0xd0/0x188
 __arm64_sys_write+0x4c/0x60
 invoke_syscall+0x64/0x180
 el0_svc_common.constprop.0+0x84/0x160
 do_el0_svc+0x48/0xe8
 el0_svc+0x34/0xd0
 el0t_64_sync_handler+0xb8/0xc0
 el0t_64_sync+0x190/0x194
SMP: stopping secondary CPUs
Kernel Offset: 0x28da86000000 from 0xffff800008000000
PHYS_OFFSET: 0xfffff76600000000
CPU features: 0x00000,01a00100,0000421b
Memory Limit: none

Acked-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Link: https://lore.kernel.org/all/20221119040049.795065-1-zhengyejian1@huawei.com/
Fixes: aaf2bc50df ("rcu: Abstract out rcu_irq_enter_check_tick() from rcu_nmi_enter()")
Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
2023-04-05 13:47:44 +00:00
Neeraj Upadhyay a4533cc0a5 rcu-tasks: Report stalls during synchronize_srcu() in rcu_tasks_postscan()
The call to synchronize_srcu() from rcu_tasks_postscan() can be stalled
by a task getting stuck in do_exit() between that function's calls to
exit_tasks_rcu_start() and exit_tasks_rcu_finish().   To ease diagnosis
of this situation, print a stall warning message every rcu_task_stall_info
period when rcu_tasks_postscan() is stalled.

[ paulmck: Adjust to handle CONFIG_SMP=n. ]

Acked-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Reported-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/rcu/20230111212736.GA1062057@paulmck-ThinkPad-P17-Gen-1/
Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
2023-04-05 13:47:44 +00:00
Zqiang 7ea91307ad rcu: Permit start_poll_synchronize_rcu_expedited() to be invoked early
According to the commit log of the patch that added it to the kernel,
start_poll_synchronize_rcu_expedited() can be invoked very early, as
in long before rcu_init() has been invoked.  But before rcu_init(),
the rcu_data structure's ->mynode field has not yet been initialized.
This means that the start_poll_synchronize_rcu_expedited() function's
attempt to set the CPU's leaf rcu_node structure's ->exp_seq_poll_rq
field will result in a segmentation fault.

This commit therefore causes start_poll_synchronize_rcu_expedited() to
set ->exp_seq_poll_rq only after rcu_init() has initialized all CPUs'
rcu_data structures' ->mynode fields.  It also removes the check from
the rcu_init() function so that start_poll_synchronize_rcu_expedited(
is unconditionally invoked.  Yes, this might result in an unnecessary
boot-time grace period, but this is down in the noise.

Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
2023-04-05 13:47:44 +00:00
Zqiang 46103fe01b rcu: Remove never-set needwake assignment from rcu_report_qs_rdp()
The rcu_accelerate_cbs() function is invoked by rcu_report_qs_rdp()
only if there is a grace period in progress that is still blocked
by at least one CPU on this rcu_node structure.  This means that
rcu_accelerate_cbs() should never return the value true, and thus that
this function should never set the needwake variable and in turn never
invoke rcu_gp_kthread_wake().

This commit therefore removes the needwake variable and the invocation
of rcu_gp_kthread_wake() in favor of a WARN_ON_ONCE() on the call to
rcu_accelerate_cbs().  The purpose of this new WARN_ON_ONCE() is to
detect situations where the system's opinion differs from ours.

Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
2023-04-05 13:47:44 +00:00
Zqiang 2450b78e0b rcu: Register rcu-lazy shrinker only for CONFIG_RCU_LAZY=y kernels
The lazy_rcu_shrink_count() shrinker function is registered even in
kernels built with CONFIG_RCU_LAZY=n, in which case this function
uselessly consumes cycles learning that no CPU has any lazy callbacks
queued.

This commit therefore registers this shrinker function only in the kernels
built with CONFIG_RCU_LAZY=y, where it might actually do something useful.

Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
2023-04-05 13:47:43 +00:00
Zqiang e22abe180c rcu: Fix set/clear TICK_DEP_BIT_RCU_EXP bitmask race
For kernels built with CONFIG_NO_HZ_FULL=y, the following scenario can result
in the scheduling-clock interrupt remaining enabled on a holdout CPU after
its quiescent state has been reported:

	CPU1                                                 CPU2
rcu_report_exp_cpu_mult                          synchronize_rcu_expedited_wait
   acquires rnp->lock                               mask = rnp->expmask;
                                                    for_each_leaf_node_cpu_mask(rnp, cpu, mask)
   rnp->expmask = rnp->expmask & ~mask;                rdp = per_cpu_ptr(&rcu_data, cpu1);
   for_each_leaf_node_cpu_mask(rnp, cpu, mask)
      rdp = per_cpu_ptr(&rcu_data, cpu1);
      if (!rdp->rcu_forced_tick_exp)
             continue;                                 rdp->rcu_forced_tick_exp = true;
                                                       tick_dep_set_cpu(cpu1, TICK_DEP_BIT_RCU_EXP);

The problem is that CPU2's sampling of rnp->expmask is obsolete by the
time it invokes tick_dep_set_cpu(), and CPU1 is not guaranteed to see
CPU2's store to ->rcu_forced_tick_exp in time to clear it.  And even if
CPU1 does see that store, it might invoke tick_dep_clear_cpu() before
CPU2 got around to executing its tick_dep_set_cpu(), which would still
leave the victim CPU with its scheduler-clock tick running.

Either way, an nohz_full real-time application running on the victim
CPU would have its latency needlessly degraded.

Note that expedited RCU grace periods look at context-tracking
information, and so if the CPU is executing in nohz_full usermode
throughout, that CPU cannot be victimized in this manner.

This commit therefore causes synchronize_rcu_expedited_wait to hold
the rcu_node structure's ->lock when checking for holdout CPUs, setting
TICK_DEP_BIT_RCU_EXP, and invoking tick_dep_set_cpu(), thus preventing
this race.

Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
2023-04-05 13:47:43 +00:00
Paul E. McKenney e035e8876e rcu: Remove CONFIG_SRCU
Now that all references to CONFIG_SRCU have been removed, it is time to
remove CONFIG_SRCU itself.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Petr Mladek <pmladek@suse.com>
Reviewed-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
2023-04-05 13:47:41 +00:00
Paul E. McKenney 09853fb89f rcu: Add comment to rcu_do_batch() identifying rcuoc code path
This commit adds a comment to help explain why the "else" clause of the
in_serving_softirq() "if" statement does not need to enforce a time limit.
The reason is that this "else" clause handles rcuoc kthreads that do not
block handlers for other softirq vectors.

Acked-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
2023-04-05 13:47:18 +00:00
Joel Fernandes (Google) 754aa6427e srcu: Clarify comments on memory barrier "E"
There is an smp_mb() named "E" in srcu_flip() immediately before the
increment (flip) of the srcu_struct structure's ->srcu_idx.

The purpose of E is to order the preceding scan's read of lock counters
against the flipping of the ->srcu_idx, in order to prevent new readers
from continuing to use the old ->srcu_idx value, which might needlessly
extend the grace period.

However, this ordering is already enforced because of the control
dependency between the preceding scan and the ->srcu_idx flip.
This control dependency exists because atomic_long_read() is used
to scan the counts, because WRITE_ONCE() is used to flip ->srcu_idx,
and because ->srcu_idx is not flipped until the ->srcu_lock_count[] and
->srcu_unlock_count[] counts match.  And such a match cannot happen when
there is an in-flight reader that started before the flip (observation
courtesy Mathieu Desnoyers).

The litmus test below (courtesy of Frederic Weisbecker, with changes
for ctrldep by Boqun and Joel) shows this:

C srcu
(*
 * bad condition: P0's first scan (SCAN1) saw P1's idx=0 LOCK count inc, though P1 saw flip.
 *
 * So basically, the ->po ordering on both P0 and P1 is enforced via ->ppo
 * (control deps) on both sides, and both P0 and P1 are interconnected by ->rf
 * relations. Combining the ->ppo with ->rf, a cycle is impossible.
 *)

{}

// updater
P0(int *IDX, int *LOCK0, int *UNLOCK0, int *LOCK1, int *UNLOCK1)
{
        int lock1;
        int unlock1;
        int lock0;
        int unlock0;

        // SCAN1
        unlock1 = READ_ONCE(*UNLOCK1);
        smp_mb(); // A
        lock1 = READ_ONCE(*LOCK1);

        // FLIP
        if (lock1 == unlock1) {   // Control dep
                smp_mb(); // E    // Remove E and still passes.
                WRITE_ONCE(*IDX, 1);
                smp_mb(); // D

                // SCAN2
                unlock0 = READ_ONCE(*UNLOCK0);
                smp_mb(); // A
                lock0 = READ_ONCE(*LOCK0);
        }
}

// reader
P1(int *IDX, int *LOCK0, int *UNLOCK0, int *LOCK1, int *UNLOCK1)
{
        int tmp;
        int idx1;
        int idx2;

        // 1st reader
        idx1 = READ_ONCE(*IDX);
        if (idx1 == 0) {         // Control dep
                tmp = READ_ONCE(*LOCK0);
                WRITE_ONCE(*LOCK0, tmp + 1);
                smp_mb(); /* B and C */
                tmp = READ_ONCE(*UNLOCK0);
                WRITE_ONCE(*UNLOCK0, tmp + 1);
        } else {
                tmp = READ_ONCE(*LOCK1);
                WRITE_ONCE(*LOCK1, tmp + 1);
                smp_mb(); /* B and C */
                tmp = READ_ONCE(*UNLOCK1);
                WRITE_ONCE(*UNLOCK1, tmp + 1);
        }
}

exists (0:lock1=1 /\ 1:idx1=1)

More complicated litmus tests with multiple SRCU readers also show that
memory barrier E is not needed.

This commit therefore clarifies the comment on memory barrier E.

Why not also remove that redundant smp_mb()?

Because control dependencies are quite fragile due to their not being
recognized by most compilers and tools.  Control dependencies therefore
exact an ongoing maintenance burden, and such a burden cannot be justified
in this slowpath.  Therefore, that smp_mb() stays until such time as
its overhead becomes a measurable problem in a real workload running on
a real production system, or until such time as compilers start paying
attention to this sort of control dependency.

Co-developed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Co-developed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Co-developed-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
2023-04-05 13:47:18 +00:00
Frederic Weisbecker 3636d8d114 rcu: Further comment and explain the state space of GP sequences
The state space of the GP sequence number isn't documented and the
definitions of its special values are scattered.  This commit therefore
gathers some common knowledge near the grace-period sequence-number
definitions.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
2023-04-05 13:47:17 +00:00
Paul E. McKenney cefc0a599b srcu: Fix long lines in srcu_funnel_gp_start()
This commit creates an srcu_usage pointer named "sup" as a shorter
synonym for the "ssp->srcu_sup" that was bloating several lines of code.

Cc: Christoph Hellwig <hch@lst.de>
Tested-by: Sachin Sant <sachinp@linux.ibm.com>
Tested-by: "Zhang, Qiang1" <qiang1.zhang@intel.com>
Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-04-04 08:37:08 -07:00
Paul E. McKenney 6c366522e1 srcu: Fix long lines in srcu_gp_end()
This commit creates an srcu_usage pointer named "sup" as a shorter
synonym for the "ssp->srcu_sup" that was bloating several lines of code.

Cc: Christoph Hellwig <hch@lst.de>
Tested-by: Sachin Sant <sachinp@linux.ibm.com>
Tested-by: "Zhang, Qiang1" <qiang1.zhang@intel.com>
Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-04-04 08:37:02 -07:00
Paul E. McKenney 5ff8319f07 srcu: Fix long lines in cleanup_srcu_struct()
This commit creates an srcu_usage pointer named "sup" as a shorter
synonym for the "ssp->srcu_sup" that was bloating several lines of code.

Cc: Christoph Hellwig <hch@lst.de>
Tested-by: Sachin Sant <sachinp@linux.ibm.com>
Tested-by: "Zhang, Qiang1" <qiang1.zhang@intel.com>
Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-04-04 08:36:57 -07:00
Paul E. McKenney eabe7625f0 srcu: Fix long lines in srcu_get_delay()
This commit creates an srcu_usage pointer named "sup" as a shorter
synonym for the "ssp->srcu_sup" that was bloating several lines of code.

Tested-by: Sachin Sant <sachinp@linux.ibm.com>
Tested-by: "Zhang, Qiang1" <qiang1.zhang@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-04-04 08:36:48 -07:00
Paul E. McKenney a7bf4d7c16 srcu: Check for readers at module-exit time
If a given statically allocated in-module srcu_struct structure was ever
used for updates, srcu_module_going() will invoke cleanup_srcu_struct()
at module-exit time.  This will check for the error case of SRCU readers
persisting past module-exit time.  On the other hand, if this srcu_struct
structure never went through a grace period, srcu_module_going() only
invokes free_percpu(), which would result in strange failures if SRCU
readers persisted past module-exit time.

This commit therefore adds a srcu_readers_active() check to
srcu_module_going(), splatting if readers have persisted and refraining
from invoking free_percpu() in that case.  Better to leak memory than
to suffer silent memory corruption!

[ paulmck: Apply Zhang, Qiang1 feedback on memory leak. ]

Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-04-04 08:36:43 -07:00
Paul E. McKenney fd1b3f8e09 srcu: Move work-scheduling fields from srcu_struct to srcu_usage
This commit moves the ->reschedule_jiffies, ->reschedule_count, and
->work fields from the srcu_struct structure to the srcu_usage structure
to reduce the size of the former in order to improve cache locality.

However, this means that the container_of() calls cannot get a pointer
to the srcu_struct because they are no longer in the srcu_struct.
This issue is addressed by adding a ->srcu_ssp field in the srcu_usage
structure that references the corresponding srcu_struct structure.
And given the presence of the sup pointer to the srcu_usage structure,
replace some ssp->srcu_usage-> instances with sup->.

[ paulmck Apply feedback from kernel test robot. ]

Link: https://lore.kernel.org/oe-kbuild-all/202303191400.iO5BOqka-lkp@intel.com/
Suggested-by: Christoph Hellwig <hch@lst.de>
Tested-by: Sachin Sant <sachinp@linux.ibm.com>
Tested-by: "Zhang, Qiang1" <qiang1.zhang@intel.com>
Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-04-04 08:36:37 -07:00
Paul E. McKenney d20162e0bf srcu: Move srcu_barrier() fields from srcu_struct to srcu_usage
This commit moves the ->srcu_barrier_seq, ->srcu_barrier_mutex,
->srcu_barrier_completion, and ->srcu_barrier_cpu_cnt fields from the
srcu_struct structure to the srcu_usage structure to reduce the size of
the former in order to improve cache locality.

Suggested-by: Christoph Hellwig <hch@lst.de>
Tested-by: Sachin Sant <sachinp@linux.ibm.com>
Tested-by: "Zhang, Qiang1" <qiang1.zhang@intel.com>
Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-04-04 08:36:33 -07:00
Paul E. McKenney 660349ac79 srcu: Move ->sda_is_static from srcu_struct to srcu_usage
This commit moves the ->sda_is_static field from the srcu_struct structure
to the srcu_usage structure to reduce the size of the former in order
to improve cache locality.

Suggested-by: Christoph Hellwig <hch@lst.de>
Tested-by: Sachin Sant <sachinp@linux.ibm.com>
Tested-by: "Zhang, Qiang1" <qiang1.zhang@intel.com>
Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-04-04 08:36:28 -07:00
Paul E. McKenney 3b46679c62 srcu: Move heuristics fields from srcu_struct to srcu_usage
This commit moves the ->srcu_size_jiffies, ->srcu_n_lock_retries,
and ->srcu_n_exp_nodelay fields from the srcu_struct structure to the
srcu_usage structure to reduce the size of the former in order to improve
cache locality.

Suggested-by: Christoph Hellwig <hch@lst.de>
Tested-by: Sachin Sant <sachinp@linux.ibm.com>
Tested-by: "Zhang, Qiang1" <qiang1.zhang@intel.com>
Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-04-04 08:36:23 -07:00
Paul E. McKenney 03200b5ca3 srcu: Move grace-period fields from srcu_struct to srcu_usage
This commit moves the ->srcu_gp_seq, ->srcu_gp_seq_needed,
->srcu_gp_seq_needed_exp, ->srcu_gp_start, and ->srcu_last_gp_end fields
from the srcu_struct structure to the srcu_usage structure to reduce
the size of the former in order to improve cache locality.

Suggested-by: Christoph Hellwig <hch@lst.de>
Tested-by: Sachin Sant <sachinp@linux.ibm.com>
Tested-by: "Zhang, Qiang1" <qiang1.zhang@intel.com>
Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-04-04 08:36:19 -07:00
Paul E. McKenney e3a6ab25cf srcu: Move ->srcu_gp_mutex from srcu_struct to srcu_usage
This commit moves the ->srcu_gp_mutex field from the srcu_struct structure
to the srcu_usage structure to reduce the size of the former in order
to improve cache locality.

Suggested-by: Christoph Hellwig <hch@lst.de>
Tested-by: Sachin Sant <sachinp@linux.ibm.com>
Tested-by: "Zhang, Qiang1" <qiang1.zhang@intel.com>
Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-04-04 08:36:14 -07:00
Paul E. McKenney b3fb11f7e9 srcu: Move ->lock from srcu_struct to srcu_usage
This commit moves the ->lock field from the srcu_struct structure to
the srcu_usage structure to reduce the size of the former in order to
improve cache locality.

Suggested-by: Christoph Hellwig <hch@lst.de>
Tested-by: Sachin Sant <sachinp@linux.ibm.com>
Tested-by: "Zhang, Qiang1" <qiang1.zhang@intel.com>
Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-04-04 08:36:07 -07:00
Paul E. McKenney 0839ade94b srcu: Move ->lock initialization after srcu_usage allocation
Currently, both __init_srcu_struct() in CONFIG_DEBUG_LOCK_ALLOC=y kernels
and init_srcu_struct() in CONFIG_DEBUG_LOCK_ALLOC=n kernel initialize
the srcu_struct structure's ->lock before the srcu_usage structure has
been allocated.  This of course prevents the ->lock from being moved
to the srcu_usage structure, so this commit moves the initialization
into the init_srcu_struct_fields() after the srcu_usage structure has
been allocated.

Cc: Christoph Hellwig <hch@lst.de>
Tested-by: Sachin Sant <sachinp@linux.ibm.com>
Tested-by: "Zhang, Qiang1" <qiang1.zhang@intel.com>
Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-04-04 08:36:01 -07:00
Paul E. McKenney 574dc1a7ef srcu: Move ->srcu_cb_mutex from srcu_struct to srcu_usage
This commit moves the ->srcu_cb_mutex field from the srcu_struct structure
to the srcu_usage structure to reduce the size of the former in order
to improve cache locality.

Suggested-by: Christoph Hellwig <hch@lst.de>
Tested-by: Sachin Sant <sachinp@linux.ibm.com>
Tested-by: "Zhang, Qiang1" <qiang1.zhang@intel.com>
Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-04-04 08:35:56 -07:00
Paul E. McKenney a0d8cbd382 srcu: Move ->srcu_size_state from srcu_struct to srcu_usage
This commit moves the ->srcu_size_state field from the srcu_struct
structure to the srcu_usage structure to reduce the size of the former
in order to improve cache locality.

Suggested-by: Christoph Hellwig <hch@lst.de>
Tested-by: Sachin Sant <sachinp@linux.ibm.com>
Tested-by: "Zhang, Qiang1" <qiang1.zhang@intel.com>
Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-04-04 08:35:50 -07:00
Paul E. McKenney 208f41b131 srcu: Move ->level from srcu_struct to srcu_usage
This commit moves the ->level[] array from the srcu_struct structure to
the srcu_usage structure to reduce the size of the former in order to
improve cache locality.

Suggested-by: Christoph Hellwig <hch@lst.de>
Tested-by: Sachin Sant <sachinp@linux.ibm.com>
Tested-by: "Zhang, Qiang1" <qiang1.zhang@intel.com>
Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-04-04 08:35:44 -07:00
Paul E. McKenney 95433f7263 srcu: Begin offloading srcu_struct fields to srcu_update
The current srcu_struct structure is on the order of 200 bytes in size
(depending on architecture and .config), which is much better than the
old-style 26K bytes, but still all too inconvenient when one is trying
to achieve good cache locality on a fastpath involving SRCU readers.

However, only a few fields in srcu_struct are used by SRCU readers.
The remaining fields could be offloaded to a new srcu_update
structure, thus shrinking the srcu_struct structure down to a few
tens of bytes.  This commit begins this noble quest, a quest that is
complicated by open-coded initialization of the srcu_struct within the
srcu_notifier_head structure.  This complication is addressed by updating
the srcu_notifier_head structure's open coding, given that there does
not appear to be a straightforward way of abstracting that initialization.

This commit moves only the ->node pointer to srcu_update.  Later commits
will move additional fields.

[ paulmck: Fold in qiang1.zhang@intel.com's memory-leak fix. ]

Link: https://lore.kernel.org/all/20230320055751.4120251-1-qiang1.zhang@intel.com/
Suggested-by: Christoph Hellwig <hch@lst.de>
Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Cc: "Michał Mirosław" <mirq-linux@rere.qmqm.pl>
Cc: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Tested-by: Sachin Sant <sachinp@linux.ibm.com>
Tested-by: "Zhang, Qiang1" <qiang1.zhang@intel.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-04-04 08:35:34 -07:00
Paul E. McKenney f4d01a2593 srcu: Use static init for statically allocated in-module srcu_struct
Further shrinking the srcu_struct structure is eased by requiring
that in-module srcu_struct structures rely more heavily on static
initialization.  In particular, this preserves the property that
a module-load-time srcu_struct initialization can fail only due
to memory-allocation failure of the per-CPU srcu_data structures.
It might also slightly improve robustness by keeping the number of memory
allocations that must succeed down percpu_alloc() call.

This is in preparation for splitting an srcu_usage structure out
of the srcu_struct structure.

[ paulmck: Fold in qiang1.zhang@intel.com feedback. ]

Cc: Christoph Hellwig <hch@lst.de>
Tested-by: Sachin Sant <sachinp@linux.ibm.com>
Tested-by: "Zhang, Qiang1" <qiang1.zhang@intel.com>
Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-04-04 08:35:28 -07:00
Paul E. McKenney 2b4be54830 rcu-tasks: Fix warning for unused tasks_rcu_exit_srcu
The tasks_rcu_exit_srcu variable is used only by kernels built
with CONFIG_TASKS_RCU=y, but is defined for all kernesl with
CONFIG_TASKS_RCU_GENERIC=y.  Therefore, in kernels built with
CONFIG_TASKS_RCU_GENERIC=y but CONFIG_TASKS_RCU=n, this gives
a "defined but not used" warning.

This commit therefore moves this variable under CONFIG_TASKS_RCU.

Link: https://lore.kernel.org/oe-kbuild-all/202303191536.XzMSyzTl-lkp@intel.com/
Reported-by: kernel test robot <lkp@intel.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-04-04 08:34:47 -07:00
Paul E. McKenney 6e2044887b rcutorture: Add RCU Tasks Trace and SRCU deadlock scenarios
Add a test number 3 that creates deadlock cycles involving one RCU
Tasks Trace step and L-1 SRCU steps.  Please note that lockdep will not
detect these deadlocks until synchronize_rcu_tasks_trace() is marked
with lockdep's new "sync" annotation, which will probably not happen
until some time after these markings prove their worth on SRCU.

Please note that these tests are available only in kernels built with
CONFIG_TASKS_TRACE_RCU=y.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2023-03-27 11:16:12 -07:00
Paul E. McKenney d94f12e8a4 rcutorture: Add SRCU deadlock scenarios
In order to test the new SRCU-lockdep functionality, this commit adds
an rcutorture.test_srcu_lockdep module parameter that, when non-zero,
selects an SRCU deadlock scenario to execute.  This parameter is a
five-digit number formatted as DNNL, where "D" is 1 to force a deadlock
and 0 to avoid doing so; "NN" is the test number, 0 for SRCU-based, 1
for SRCU/mutex-based, and 2 for SRCU/rwsem-based; and "L" is the number
of steps in the deadlock cycle.

Note that rcutorture.test_srcu_lockdep=1 will also force a hard hang.

If a non-zero value of rcutorture.test_srcu_lockdep does not select a
deadlock scenario, a console message is printed and testing continues.

[ paulmck: Apply kernel test robot feedback, add rwsem support. ]
[ paulmck: Apply Dan Carpenter feedback. ]

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2023-03-27 11:16:10 -07:00
Boqun Feng f0f44752f5 rcu: Annotate SRCU's update-side lockdep dependencies
Although all flavors of RCU readers are annotated correctly with
lockdep as recursive read locks, they do not set the lock_acquire
'check' parameter.  This means that RCU read locks are not added to
the lockdep dependency graph, which in turn means that lockdep cannot
detect RCU-based deadlocks.  This is not a problem for RCU flavors having
atomic read-side critical sections because context-based annotations can
catch these deadlocks, see for example the RCU_LOCKDEP_WARN() statement
in synchronize_rcu().  But context-based annotations are not helpful
for sleepable RCU, especially given that it is perfectly legal to do
synchronize_srcu(&srcu1) within an srcu_read_lock(&srcu2).

However, we can detect SRCU-based by: (1) Making srcu_read_lock() a
'check'ed recursive read lock and (2) Making synchronize_srcu() a empty
write lock critical section.  Even better, with the newly introduced
lock_sync(), we can avoid false positives about irq-unsafe/safe.
This commit therefore makes it so.

Note that NMI-safe SRCU read side critical sections are currently not
annotated, but might be annotated in the future.

Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
[ boqun: Add comments for annotation per Waiman's suggestion ]
[ boqun: Fix comment warning reported by Stephen Rothwell ]
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2023-03-27 11:15:59 -07:00
Paul E. McKenney 6bc6e6b275 refscale: Move shutdown from wait_event() to wait_event_idle()
The ref_scale_shutdown() kthread/function uses wait_event() to wait for
the refscale test to complete.  However, although the read-side tests
are normally extremely fast, there is no law against specifying a very
large value for the refscale.loops module parameter or against having
a slow read-side primitive.  Either way, this might well trigger the
hung-task timeout.

This commit therefore replaces those wait_event() calls with calls to
wait_event_idle(), which do not trigger the hung-task timeout.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2023-03-20 11:12:16 -07:00
Paul E. McKenney ef1ef3d476 rcuscale: Move shutdown from wait_event() to wait_event_idle()
The rcu_scale_shutdown() and kfree_scale_shutdown() kthreads/functions
use wait_event() to wait for the rcuscale test to complete.  However,
each updater thread in such a test waits for at least 100 grace periods.
If each grace period takes more than 1.2 seconds, which is long, but
not insanely so, this can trigger the hung-task timeout.

This commit therefore replaces those wait_event() calls with calls to
wait_event_idle(), which do not trigger the hung-task timeout.

Reported-by: kernel test robot <yujie.liu@intel.com>
Reported-by: Liam Howlett <liam.howlett@oracle.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Tested-by: Yujie Liu <yujie.liu@intel.com>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2023-03-20 11:12:16 -07:00
Zqiang 4f02ac2378 rcutorture: Create nocb kthreads only when testing rcu in CONFIG_RCU_NOCB_CPU=y kernels
Given a non-zero rcutorture.nocbs_nthreads module parameter, the specified
number of nocb kthreads will be created, regardless of whether or not
the RCU implementation under test is capable of offloading callbacks.
Please note that even vanilla RCU is incapable of offloading in kernels
built with CONFIG_RCU_NOCB_CPU=n.  And when the RCU implementation is
incapable of offloading callbacks, there is no point in creating those
kthreads.

This commit therefore checks the cur_ops.torture_type module parameter and
CONFIG_RCU_NOCB_CPU Kconfig option in order to avoid creating unnecessary
nocb tasks.

Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
[ boqun: Fix checkpatch warning ]
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2023-03-20 11:11:44 -07:00
Yue Hu fdcf87a3df rcutorture: Eliminate variable n_rcu_torture_boost_rterror
After commit 8b700983de ("sched: Remove sched_set_*() return value"),
this variable is not used anymore. So eliminate it entirely.

Signed-off-by: Yue Hu <huyue2@coolpad.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2023-03-11 18:11:13 -08:00
Paul E. McKenney 808a9d6759 rcutorture: Add test_nmis module parameter
This commit adds a test_nmis module parameter to generate the
specified number of NMI stack backtraces 15 seconds apart.  This module
parameter can be used to test NMI delivery and accompanying diagnostics.
Note that this parameter is ignored when rcutorture is a module rather
than built into the kernel.  This could be changed with the addition of
an EXPORT_SYMBOL_GPL().

[ paulmck: Apply kernel test robot feedback. ]

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2023-03-11 18:03:41 -08:00
Paul E. McKenney bba8d3d17d Merge branch 'stall.2023.01.09a' into HEAD
stall.2023.01.09a: RCU CPU stall-warning updates.
2023-02-02 16:40:07 -08:00
Paul E. McKenney 8e1704b6a8 Merge branches 'doc.2023.01.05a', 'fixes.2023.01.23a', 'kvfree.2023.01.03a', 'srcu.2023.01.03a', 'srcu-always.2023.02.02a', 'tasks.2023.01.03a', 'torture.2023.01.05a' and 'torturescript.2023.01.03a' into HEAD
doc.2023.01.05a: Documentation update.
fixes.2023.01.23a: Miscellaneous fixes.
kvfree.2023.01.03a: kvfree_rcu() updates.
srcu.2023.01.03a: SRCU updates.
srcu-always.2023.02.02a: Finish making SRCU be unconditionally available.
tasks.2023.01.03a: Tasks-RCU updates.
torture.2023.01.05a: Torture-test updates.
torturescript.2023.01.03a: Torture-test scripting updates.
2023-02-02 16:33:43 -08:00
Joel Fernandes (Google) cf7066b97e rcu: Disable laziness if lazy-tracking says so
During suspend, we see failures to suspend 1 in 300-500 suspends.
Looking closer, it appears that asynchronous RCU callbacks are being
queued as lazy even though synchronous callbacks are expedited. These
delays appear to not be very welcome by the suspend/resume code as
evidenced by these occasional suspend failures.

This commit modifies call_rcu() to check if rcu_async_should_hurry(),
which will return true if we are in suspend or in-kernel boot.

[ paulmck: Alphabetize local variables. ]

Ignoring the lazy hint makes the 3000 suspend/resume cycles pass
reliably on a 12th gen 12-core Intel CPU, and there is some evidence
that it also slightly speeds up boot performance.

Fixes: 3cb278e73b ("rcu: Make call_rcu() lazy to save power")
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-23 16:51:29 -08:00
Joel Fernandes (Google) 6efdda8bec rcu: Track laziness during boot and suspend
Boot and suspend/resume should not be slowed down in kernels built with
CONFIG_RCU_LAZY=y.  In particular, suspend can sometimes fail in such
kernels.

This commit therefore adds rcu_async_hurry(), rcu_async_relax(), and
rcu_async_should_hurry() functions that track whether or not either
a boot or a suspend/resume operation is in progress.  This will
enable a later commit to refrain from laziness during those times.

Export rcu_async_should_hurry(), rcu_async_hurry(), and rcu_async_relax()
for later use by rcutorture.

[ paulmck: Apply feedback from Steve Rostedt. ]

Fixes: 3cb278e73b ("rcu: Make call_rcu() lazy to save power")
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-17 20:20:11 -08:00
Zqiang ccfe1fef94 rcu: Remove redundant call to rcu_boost_kthread_setaffinity()
The rcu_boost_kthread_setaffinity() function is invoked at
rcutree_online_cpu() and rcutree_offline_cpu() time, early in the online
timeline and late in the offline timeline, respectively.  It is also
invoked from rcutree_dead_cpu(), however, in the absence of userspace
manipulations (for which userspace must take responsibility), this call
is redundant with that from rcutree_offline_cpu().  This redundancy can
be demonstrated by printing out the relevant cpumasks

This commit therefore removes the call to rcu_boost_kthread_setaffinity()
from rcutree_dead_cpu().

Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
2023-01-12 11:30:11 -08:00
Paul E. McKenney 84ec7c2036 rcu: Allow up to five minutes expedited RCU CPU stall-warning timeouts
The maximum value of RCU CPU stall-warning timeouts has historically been
five minutes (300 seconds).  However, the recently introduced expedited
RCU CPU stall-warning timeout is instead limited to 21 seconds.  This
causes problems for CI/fuzzing services such as syzkaller by obscuring
the issue in question with expedited RCU CPU stall-warning timeout splats.

This commit therefore sets the RCU_EXP_CPU_STALL_TIMEOUT Kconfig options
upper bound to 300000 milliseconds, which is 300 seconds (AKA 5 minutes).

[ paulmck: Apply feedback from Hillf Danton. ]
[ paulmck: Apply feedback from Geert Uytterhoeven. ]

Reported-by: Dave Chinner <david@fromorbit.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Tested-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-09 12:09:52 -08:00
Zhen Lei 3ab955de92 rcu: Align the output of RCU CPU stall warning messages
Time stamps are added to the output in kernels built with
CONFIG_PRINTK_TIME=y, which causes misaligned output.  Therefore,
replace pr_cont() with pr_err(), which fixes alignment and gets
rid of a couple of despised pr_cont() calls.

Before:
[   37.567343] rcu: INFO: rcu_preempt self-detected stall on CPU
[   37.567839] rcu:     0-....: (1500 ticks this GP) idle=***
[   37.568270]  (t=1501 jiffies g=4717 q=28 ncpus=4)
[   37.568668] CPU: 0 PID: 313 Comm: test0 Not tainted 6.1.0-rc4 #8

After:
[   36.762074] rcu: INFO: rcu_preempt self-detected stall on CPU
[   36.762543] rcu:     0-....: (1499 ticks this GP) idle=***
[   36.763003] rcu:     (t=1500 jiffies g=5097 q=27 ncpus=4)
[   36.763522] CPU: 0 PID: 313 Comm: test0 Not tainted 6.1.0-rc4 #9

Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-05 12:21:11 -08:00
Zhen Lei be42f00b73 rcu: Add RCU stall diagnosis information
Because RCU CPU stall warnings are driven from the scheduling-clock
interrupt handler, a workload consisting of a very large number of
short-duration hardware interrupts can result in misleading stall-warning
messages.  On systems supporting only a single level of interrupts,
that is, where interrupts handlers cannot be interrupted, this can
produce misleading diagnostics.  The stack traces will show the
innocent-bystander interrupted task, not the interrupts that are
at the very least exacerbating the stall.

This situation can be improved by displaying the number of interrupts
and the CPU time that they have consumed.  Diagnosing other types
of stalls can be eased by also providing the count of softirqs and
the CPU time that they consumed as well as the number of context
switches and the task-level CPU time consumed.

Consider the following output given this change:

rcu: INFO: rcu_preempt self-detected stall on CPU
rcu:     0-....: (1250 ticks this GP) <omitted>
rcu:          hardirqs   softirqs   csw/system
rcu:  number:      624         45            0
rcu: cputime:       69          1         2425   ==> 2500(ms)

This output shows that the number of hard and soft interrupts is small,
there are no context switches, and the system takes up a lot of time. This
indicates that the current task is looping with preemption disabled.

The impact on system performance is negligible because snapshot is
recorded only once for all continuous RCU stalls.

This added debugging information is suppressed by default and can be
enabled by building the kernel with CONFIG_RCU_CPU_STALL_CPUTIME=y or
by booting with rcupdate.rcu_cpu_stall_cputime=1.

Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Reviewed-by: Mukesh Ojha <quic_mojha@quicinc.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-05 12:21:11 -08:00
Paul E. McKenney 273661595c rcutorture: Drop sparse lock-acquisition annotations
The sparse __acquires() and __releases() annotations provide very
little value.  The argument is ignored, so sparse cannot tell the
differences between acquiring one lock and releasing another on the one
hand and acquiring and releasing a given lock on the other.  In addition,
lockdep annotations provide much more precision, for but one example,
actually knowing which lock is held.

This commit therefore removes the __acquires() and __releases()
annotations from rcutorture.

Reported-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-05 12:10:35 -08:00
Paul E. McKenney a6889becb0 refscale: Add tests using SLAB_TYPESAFE_BY_RCU
This commit adds three read-side-only tests of three use cases featuring
SLAB_TYPESAFE_BY_RCU: One using per-object reference counting, one using
per-object locking, and one using per-object sequence locking.

[ paulmck: Apply feedback from kernel test robot. ]

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-05 12:09:42 -08:00
Paul E. McKenney 3c6496c86e refscale: Provide for initialization failure
Current tests all have init() functions that are guaranteed to succeed.
But upcoming tests will need to allocate memory, thus possibly failing.
This commit therefore handles init() function failure.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-03 17:52:54 -08:00
Zqiang a4fcfbee8f rcu-tasks: Handle queue-shrink/callback-enqueue race condition
The rcu_tasks_need_gpcb() determines whether or not: (1) There are
callbacks needing another grace period, (2) There are callbacks ready
to be invoked, and (3) It would be a good time to shrink back down to a
single-CPU callback list.  This third case is interesting because some
other CPU might be adding new callbacks, which might suddenly make this
a very bad time to be shrinking.

This is currently handled by requiring call_rcu_tasks_generic() to
enqueue callbacks under the protection of rcu_read_lock() and requiring
rcu_tasks_need_gpcb() to wait for an RCU grace period to elapse before
finalizing the transition.  This works well in practice.

Unfortunately, the current code assumes that a grace period whose end is
detected by the poll_state_synchronize_rcu() in the second "if" condition
actually ended before the earlier code counted the callbacks queued on
CPUs other than CPU 0 (local variable "ncbsnz").  Given the current code,
it is possible that a long-delayed call_rcu_tasks_generic() invocation
will queue a callback on a non-zero CPU after these CPUs have had their
callbacks counted and zero has been stored to ncbsnz.  Such a callback
would trigger the WARN_ON_ONCE() in the second "if" statement.

To see this, consider the following sequence of events:

o	CPU 0 invokes rcu_tasks_one_gp(), and counts fewer than
	rcu_task_collapse_lim callbacks.  It sees at least one
	callback queued on some other CPU, thus setting ncbsnz
	to a non-zero value.

o	CPU 1 invokes call_rcu_tasks_generic() and loads 42 from
	->percpu_enqueue_lim.  It therefore decides to enqueue its
	callback onto CPU 1's callback list, but is delayed.

o	CPU 0 sees the rcu_task_cb_adjust is non-zero and that the number
	of callbacks does not exceed rcu_task_collapse_lim.  It therefore
	checks percpu_enqueue_lim, and sees that its value is greater
	than the value one.  CPU 0 therefore  starts the shift back
	to a single callback list.  It sets ->percpu_enqueue_lim to 1,
	but CPU 1 has already read the old value of 42.  It also gets
	a grace-period state value from get_state_synchronize_rcu().

o	CPU 0 sees that ncbsnz is non-zero in its second "if" statement,
	so it declines to finalize the shrink operation.

o	CPU 0 again invokes rcu_tasks_one_gp(), and counts fewer than
	rcu_task_collapse_lim callbacks.  It also sees that there are
	no callback queued on any other CPU, and thus sets ncbsnz to zero.

o	CPU 1 resumes execution and enqueues its callback onto its own
	list.  This invalidates the value of ncbsnz.

o	CPU 0 sees the rcu_task_cb_adjust is non-zero and that the number
	of callbacks does not exceed rcu_task_collapse_lim.  It therefore
	checks percpu_enqueue_lim, but sees that its value is already
	unity.	It therefore does not get a new grace-period state value.

o	CPU 0 sees that rcu_task_cb_adjust is non-zero, ncbsnz is zero,
	and that poll_state_synchronize_rcu() says that the grace period
	has completed.  it therefore finalizes the shrink operation,
	setting ->percpu_dequeue_lim to the value one.

o	CPU 0 does a debug check, scanning the other CPUs' callback lists.
	It sees that CPU 1's list has a callback, so it (rightly)
	triggers the WARN_ON_ONCE().  After all, the new value of
	->percpu_dequeue_lim says to not bother looking at CPU 1's
	callback list, which means that this callback will never be
	invoked.  This can result in hangs and maybe even OOMs.

Based on long experience with rcutorture, this is an extremely
low-probability race condition, but it really can happen, especially in
preemptible kernels or within guest OSes.

This commit therefore checks for completion of the grace period
before counting callbacks.  With this change, in the above failure
scenario CPU 0 would know not to prematurely end the shrink operation
because the grace period would not have completed before the count
operation started.

[ paulmck: Adjust grace-period end rather than adding RCU reader. ]
[ paulmck: Avoid spurious WARN_ON_ONCE() with ->percpu_dequeue_lim check. ]

Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-03 17:52:17 -08:00
Zqiang ea5c8987fe rcu-tasks: Make rude RCU-Tasks work well with CPU hotplug
The synchronize_rcu_tasks_rude() function invokes rcu_tasks_rude_wait_gp()
to wait one rude RCU-tasks grace period.  The rcu_tasks_rude_wait_gp()
function in turn checks if there is only a single online CPU.  If so, it
will immediately return, because a call to synchronize_rcu_tasks_rude()
is by definition a grace period on a single-CPU system.  (We could
have blocked!)

Unfortunately, this check uses num_online_cpus() without synchronization,
which can result in too-short grace periods.  To see this, consider the
following scenario:

        CPU0                                   CPU1 (going offline)
                                          migration/1 task:
                                      cpu_stopper_thread
                                       -> take_cpu_down
                                          -> _cpu_disable
                                           (dec __num_online_cpus)
                                          ->cpuhp_invoke_callback
                                                preempt_disable
                                                access old_data0
           task1
 del old_data0                                  .....
 synchronize_rcu_tasks_rude()
 task1 schedule out
 ....
 task2 schedule in
 rcu_tasks_rude_wait_gp()
     ->__num_online_cpus == 1
       ->return
 ....
 task1 schedule in
 ->free old_data0
                                                preempt_enable

When CPU1 decrements __num_online_cpus, its value becomes 1.  However,
CPU1 has not finished going offline, and will take one last trip through
the scheduler and the idle loop before it actually stops executing
instructions.  Because synchronize_rcu_tasks_rude() is mostly used for
tracing, and because both the scheduler and the idle loop can be traced,
this means that CPU0's prematurely ended grace period might disrupt the
tracing on CPU1.  Given that this disruption might include CPU1 executing
instructions in memory that was just now freed (and maybe reallocated),
this is a matter of some concern.

This commit therefore removes that problematic single-CPU check from the
rcu_tasks_rude_wait_gp() function.  This dispenses with the single-CPU
optimization, but there is no evidence indicating that this optimization
is important.  In addition, synchronize_rcu_tasks_generic() contains a
similar optimization (albeit only for early boot), which also splats.
(As in exactly why are you invoking synchronize_rcu_tasks_rude() so
early in boot, anyway???)

It is OK for the synchronize_rcu_tasks_rude() function's check to be
unsynchronized because the only times that this check can evaluate to
true is when there is only a single CPU running with preemption
disabled.

While in the area, this commit also fixes a minor bug in which a
call to synchronize_rcu_tasks_rude() would instead be attributed to
synchronize_rcu_tasks().

[ paulmck: Add "synchronize_" prefix and "()" suffix. ]

Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-03 17:52:17 -08:00
Frederic Weisbecker 28319d6dc5 rcu-tasks: Fix synchronize_rcu_tasks() VS zap_pid_ns_processes()
RCU Tasks and PID-namespace unshare can interact in do_exit() in a
complicated circular dependency:

1) TASK A calls unshare(CLONE_NEWPID), this creates a new PID namespace
   that every subsequent child of TASK A will belong to. But TASK A
   doesn't itself belong to that new PID namespace.

2) TASK A forks() and creates TASK B. TASK A stays attached to its PID
   namespace (let's say PID_NS1) and TASK B is the first task belonging
   to the new PID namespace created by unshare()  (let's call it PID_NS2).

3) Since TASK B is the first task attached to PID_NS2, it becomes the
   PID_NS2 child reaper.

4) TASK A forks() again and creates TASK C which get attached to PID_NS2.
   Note how TASK C has TASK A as a parent (belonging to PID_NS1) but has
   TASK B (belonging to PID_NS2) as a pid_namespace child_reaper.

5) TASK B exits and since it is the child reaper for PID_NS2, it has to
   kill all other tasks attached to PID_NS2, and wait for all of them to
   die before getting reaped itself (zap_pid_ns_process()).

6) TASK A calls synchronize_rcu_tasks() which leads to
   synchronize_srcu(&tasks_rcu_exit_srcu).

7) TASK B is waiting for TASK C to get reaped. But TASK B is under a
   tasks_rcu_exit_srcu SRCU critical section (exit_notify() is between
   exit_tasks_rcu_start() and exit_tasks_rcu_finish()), blocking TASK A.

8) TASK C exits and since TASK A is its parent, it waits for it to reap
   TASK C, but it can't because TASK A waits for TASK B that waits for
   TASK C.

Pid_namespace semantics can hardly be changed at this point. But the
coverage of tasks_rcu_exit_srcu can be reduced instead.

The current task is assumed not to be concurrently reapable at this
stage of exit_notify() and therefore tasks_rcu_exit_srcu can be
temporarily relaxed without breaking its constraints, providing a way
out of the deadlock scenario.

[ paulmck: Fix build failure by adding additional declaration. ]

Fixes: 3f95aa81d2 ("rcu: Make TASKS_RCU handle tasks that are almost done exiting")
Reported-by: Pengfei Xu <pengfei.xu@intel.com>
Suggested-by: Boqun Feng <boqun.feng@gmail.com>
Suggested-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Suggested-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Eric W . Biederman <ebiederm@xmission.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-03 17:52:16 -08:00
Frederic Weisbecker 4475709295 rcu-tasks: Remove preemption disablement around srcu_read_[un]lock() calls
Ever since the following commit:

	5a41344a3d ("srcu: Simplify __srcu_read_unlock() via this_cpu_dec()")

SRCU doesn't rely anymore on preemption to be disabled in order to
modify the per-CPU counter. And even then it used to be done from the API
itself.

Therefore and after checking further, it appears to be safe to remove
the preemption disablement around __srcu_read_[un]lock() in
exit_tasks_rcu_start() and exit_tasks_rcu_finish()

Suggested-by: Boqun Feng <boqun.feng@gmail.com>
Suggested-by: Paul E. McKenney <paulmck@kernel.org>
Suggested-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-03 17:52:16 -08:00
Frederic Weisbecker e4e1e8089c rcu-tasks: Improve comments explaining tasks_rcu_exit_srcu purpose
Make sure we don't need to look again into the depths of git blame in
order not to miss a subtle part about how rcu-tasks is dealing with
exiting tasks.

Suggested-by: Boqun Feng <boqun.feng@gmail.com>
Suggested-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Suggested-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-03 17:52:16 -08:00
Zqiang 9420fb934c rcu-tasks: Use accurate runstart time for RCU Tasks boot-time testing
Currently, test_rcu_tasks_callback() reads from the jiffies counter only
once when this function is invoked.  This introduces inaccuracies because
of the latencies induced by the synchronize_rcu_tasks*() invocations.
This commit therefore re-reads the jiffies counter at the beginning
of each test, thus avoiding penalizing later tests for the latencies
induced by earlier tests.

Therefore, this commit at the start of each RCU Tasks test, re-fetch the
jiffies time as the runstart time.

Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-03 17:52:16 -08:00
Paul E. McKenney dafc4d1603 srcu: Update comment after the index flip
Because there is not guaranteed to be a full memory barrier between
the ->srcu_unlock_count increment of an srcu_read_unlock() and the
->srcu_lock_count increment of the next srcu_read_lock(), this next
srcu_read_lock() is not guaranteed to see the effect of the index flip
just prior to this comment.  However, this next srcu_read_lock() will
execute a full memory barrier, so the srcu_read_lock() after that is
guaranteed to see that index flip.

This guarantee is illustrated by the following diagram of events and
the litmus test following that.

------------------------------------------------------------------------

READER                  UPDATER
-------------           ----------
                           // idx is initially 0.

                           srcu_flip() {
                              smp_mb();
// RSCS

srcu_read_unlock() {
  smp_mb();
                              idx++;    // P
                              smp_mb(); // QQ
                           }

                           srcu_readers_unlock_idx(0) {
        ,--counted------------ count all unlock[0]; // Q
        |
  unlock[0]++;  // X

}
                               smp_mb();
srcu_read_lock() {
  READ(idx) = 0;         ,---- count all lock[0]; // contributes imbalance of 1.
  lock[0]++;  ----counted              |
  smp_mb(); // PP          }           |
}                                      |
                                       |
// RSCS                             not going to effect above scan
                                       |
srcu_read_unlock() {                   |
  smp_mb();                            |
  unlock[0]++;                         |
}                                      |
                                      /
                                     /
srcu_read_lock() {                  |
  READ(idx);  // Y  -----cannot be counted because of P (has to sample idx as 1)
  lock[1]++;
  ...
}

------------------------------------------------------------------------

This makes it similar to the store buffer pattern. Using X, Y, P and Q
annotated above, we get:

------------------------------------------------------------------------

READER                    UPDATER
X (write)                 P (write)

smp_mb(); //PP            smp_mb(); //QQ

Y (read)                  Q (read)

------------------------------------------------------------------------

ASCII art courtesy of Joel Fernandes.

Reported-by: Joel Fernandes <joel@joelfernandes.org>
Reported-by: Boqun Feng <boqun.feng@gmail.com>
Reported-by: Frederic Weisbecker <frederic@kernel.org>
Reported-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-03 17:49:23 -08:00
Paul E. McKenney 0cd4b50b12 srcu: Yet more detail for srcu_readers_active_idx_check() comments
The comment in srcu_readers_active_idx_check() following the smp_mb()
is out of date, hailing from a simpler time when preemption was disabled
across the bulk of __srcu_read_lock().  The fact that preemption was
disabled meant that the number of tasks that had fetched the old index
but not yet incremented counters was limited by the number of CPUs.

In our more complex modern times, the number of CPUs is no longer a limit.
This commit therefore updates this comment, additionally giving more
memory-ordering detail.

[ paulmck: Apply Nt->Nc feedback from Joel Fernandes. ]

Reported-by: Boqun Feng <boqun.feng@gmail.com>
Reported-by: Frederic Weisbecker <frederic@kernel.org>
Reported-by: "Joel Fernandes (Google)" <joel@joelfernandes.org>
Reported-by: Neeraj Upadhyay <neeraj.iitr10@gmail.com>
Reported-by: Uladzislau Rezki <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-03 17:49:23 -08:00
Pingfan Liu 1bafbfb3e1 srcu: Remove needless rcu_seq_done() check while holding read lock
The srcu_gp_start_if_needed() function now read-holds the srcu_struct
whose grace period is being started, which means that the corresponding
SRCU grace period cannot end.  This in turn means that the SRCU
grace-period sequence number returned by rcu_seq_snap() cannot expire
during this time.  And that means that the calls to rcu_seq_done() in
srcu_funnel_exp_start() and srcu_funnel_gp_start() can never return true.

This commit therefore removes these rcu_seq_done() checks, but adds checks
in kernels built with CONFIG_PROVE_RCU=y that splats if rcu_seq_done()
does somehow return true.

[ paulmck: Rearrange checks to handle kernels built with lockdep. ]

Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
To: rcu@vger.kernel.org
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-03 17:49:23 -08:00
Paul E. McKenney efa3c40cfa rcu: Add test code for semaphore-like SRCU readers
This commit adds trivial test code for srcu_down_read() and
srcu_up_read().

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-03 17:49:22 -08:00
Pingfan Liu 50be0c0439 srcu: Fix the comparision in srcu_invl_snp_seq()
A grace-period sequence number contains two fields: counter and
state.  SRCU_SNP_INIT_SEQ provides a guaranteed invalid value for
grace-period sequence numbers in newly allocated srcu_node structures'
->srcu_have_cbs[] and ->srcu_gp_seq_needed_exp fields.  The point of the
comparison in srcu_invl_snp_seq() is not to detect invalid grace-period
sequence numbers in general, but rather to detect a newly allocated
srcu_node structure whose ->srcu_have_cbs[] and ->srcu_gp_seq_needed_exp
fields need to be brought into line with the srcu_struct structure's
->srcu_gp_seq field.

This commit therefore causes srcu_invl_snp_seq() to compare both fields
of the specified grace-period sequence number.

Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: <rcu@vger.kernel.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-03 17:49:22 -08:00
Pingfan Liu 7f24626d6d srcu: Delegate work to the boot cpu if using SRCU_SIZE_SMALL
Commit 994f706872 ("srcu: Make Tree SRCU able to operate without
snp_node array") assumes that cpu 0 is always online.  However, there
really are situations when some other CPU is the boot CPU, for example,
when booting a kdump kernel with the maxcpus=1 boot parameter.

On PowerPC, the kdump kernel can hang as follows:
...
[    1.740036] systemd[1]: Hostname set to <xyz.com>
[  243.686240] INFO: task systemd:1 blocked for more than 122 seconds.
[  243.686264]       Not tainted 6.1.0-rc1 #1
[  243.686272] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  243.686281] task:systemd         state:D stack:0     pid:1     ppid:0      flags:0x00042000
[  243.686296] Call Trace:
[  243.686301] [c000000016657640] [c000000016657670] 0xc000000016657670 (unreliable)
[  243.686317] [c000000016657830] [c00000001001dec0] __switch_to+0x130/0x220
[  243.686333] [c000000016657890] [c000000010f607b8] __schedule+0x1f8/0x580
[  243.686347] [c000000016657940] [c000000010f60bb4] schedule+0x74/0x140
[  243.686361] [c0000000166579b0] [c000000010f699b8] schedule_timeout+0x168/0x1c0
[  243.686374] [c000000016657a80] [c000000010f61de8] __wait_for_common+0x148/0x360
[  243.686387] [c000000016657b20] [c000000010176bb0] __flush_work.isra.0+0x1c0/0x3d0
[  243.686401] [c000000016657bb0] [c0000000105f2768] fsnotify_wait_marks_destroyed+0x28/0x40
[  243.686415] [c000000016657bd0] [c0000000105f21b8] fsnotify_destroy_group+0x68/0x160
[  243.686428] [c000000016657c40] [c0000000105f6500] inotify_release+0x30/0xa0
[  243.686440] [c000000016657cb0] [c0000000105751a8] __fput+0xc8/0x350
[  243.686452] [c000000016657d00] [c00000001017d524] task_work_run+0xe4/0x170
[  243.686464] [c000000016657d50] [c000000010020e94] do_notify_resume+0x134/0x140
[  243.686478] [c000000016657d80] [c00000001002eb18] interrupt_exit_user_prepare_main+0x198/0x270
[  243.686493] [c000000016657de0] [c00000001002ec60] syscall_exit_prepare+0x70/0x180
[  243.686505] [c000000016657e10] [c00000001000bf7c] system_call_vectored_common+0xfc/0x280
[  243.686520] --- interrupt: 3000 at 0x7fffa47d5ba4
[  243.686528] NIP:  00007fffa47d5ba4 LR: 0000000000000000 CTR: 0000000000000000
[  243.686538] REGS: c000000016657e80 TRAP: 3000   Not tainted  (6.1.0-rc1)
[  243.686548] MSR:  800000000000d033 <SF,EE,PR,ME,IR,DR,RI,LE>  CR: 42044440  XER: 00000000
[  243.686572] IRQMASK: 0
[  243.686572] GPR00: 0000000000000006 00007ffffa606710 00007fffa48e7200 0000000000000000
[  243.686572] GPR04: 0000000000000002 000000000000000a 0000000000000000 0000000000000001
[  243.686572] GPR08: 000001000c172dd0 0000000000000000 0000000000000000 0000000000000000
[  243.686572] GPR12: 0000000000000000 00007fffa4ff4bc0 0000000000000000 0000000000000000
[  243.686572] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[  243.686572] GPR20: 0000000132dfdc50 000000000000000e 0000000000189375 0000000000000000
[  243.686572] GPR24: 00007ffffa606ae0 0000000000000005 000001000c185490 000001000c172570
[  243.686572] GPR28: 000001000c172990 000001000c184850 000001000c172e00 00007fffa4fedd98
[  243.686683] NIP [00007fffa47d5ba4] 0x7fffa47d5ba4
[  243.686691] LR [0000000000000000] 0x0
[  243.686698] --- interrupt: 3000
[  243.686708] INFO: task kworker/u16:1:24 blocked for more than 122 seconds.
[  243.686717]       Not tainted 6.1.0-rc1 #1
[  243.686724] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  243.686733] task:kworker/u16:1   state:D stack:0     pid:24    ppid:2      flags:0x00000800
[  243.686747] Workqueue: events_unbound fsnotify_mark_destroy_workfn
[  243.686758] Call Trace:
[  243.686762] [c0000000166736e0] [c00000004fd91000] 0xc00000004fd91000 (unreliable)
[  243.686775] [c0000000166738d0] [c00000001001dec0] __switch_to+0x130/0x220
[  243.686788] [c000000016673930] [c000000010f607b8] __schedule+0x1f8/0x580
[  243.686801] [c0000000166739e0] [c000000010f60bb4] schedule+0x74/0x140
[  243.686814] [c000000016673a50] [c000000010f699b8] schedule_timeout+0x168/0x1c0
[  243.686827] [c000000016673b20] [c000000010f61de8] __wait_for_common+0x148/0x360
[  243.686840] [c000000016673bc0] [c000000010210840] __synchronize_srcu.part.0+0xa0/0xe0
[  243.686855] [c000000016673c30] [c0000000105f2c64] fsnotify_mark_destroy_workfn+0xc4/0x1a0
[  243.686868] [c000000016673ca0] [c000000010174ea8] process_one_work+0x2a8/0x570
[  243.686882] [c000000016673d40] [c000000010175208] worker_thread+0x98/0x5e0
[  243.686895] [c000000016673dc0] [c0000000101828d4] kthread+0x124/0x130
[  243.686908] [c000000016673e10] [c00000001000cd40] ret_from_kernel_thread+0x5c/0x64
[  366.566274] INFO: task systemd:1 blocked for more than 245 seconds.
[  366.566298]       Not tainted 6.1.0-rc1 #1
[  366.566305] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  366.566314] task:systemd         state:D stack:0     pid:1     ppid:0      flags:0x00042000
[  366.566329] Call Trace:
...

The above splat occurs because PowerPC really does use maxcpus=1
instead of nr_cpus=1 in the kernel command line.  Consequently, the
(quite possibly non-zero) kdump CPU is the only online CPU in the kdump
kernel.  SRCU unconditionally queues a sdp->work on cpu 0, for which no
worker thread has been created, so sdp->work will be never executed and
__synchronize_srcu() will never be completed.

This commit therefore replaces CPU ID 0 with get_boot_cpu_id() in key
places in Tree SRCU.  Since the CPU indicated by get_boot_cpu_id()
is guaranteed to be online, this avoids the above splat.

Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
To: rcu@vger.kernel.org
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-03 17:49:22 -08:00
Zqiang 66ea1029f9 srcu: Release early_srcu resources when no longer in use
Kernels built with the CONFIG_TREE_SRCU Kconfig option set and then
booted with rcupdate.rcu_self_test=1 and srcutree.convert_to_big=1 will
test Tree SRCU during early boot.  The early_srcu structure's srcu_node
array will be allocated when init_srcu_struct_fields() is invoked,
but after the test completes this early_srcu structure will not be used.

This commit therefore invokes cleanup_srcu_struct() to free that srcu_node
structure.

Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-03 17:49:22 -08:00
Uladzislau Rezki (Sony) 2ca836b1da rcu/kvfree: Split ready for reclaim objects from a batch
This patch splits the lists of objects so as to avoid sending any
through RCU that have already been queued for more than one grace
period.  These long-term-resident objects are immediately freed.
The remaining short-term-resident objects are queued for later freeing
using queue_rcu_work().

This change avoids delaying workqueue handlers with synchronize_rcu()
invocations.  Yes, workqueue handlers are designed to handle blocking,
but avoiding blocking when unnecessary improves performance during
low-memory situations.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-03 17:48:41 -08:00
Uladzislau Rezki (Sony) 4c33464ae8 rcu/kvfree: Carefully reset number of objects in krcp
The schedule_delayed_monitor_work() function relies on the count of
objects queued into any given kfree_rcu_cpu structure.  This count is
used to determine how quickly to schedule passing these objects to RCU.

There are three pipes where pointers can be placed.  When any pipe is
offloaded, the kfree_rcu_cpu structure's ->count counter is set to zero,
which is wrong because the other pipes might still be non-empty.

This commit therefore maintains per-pipe counters, and introduces a
krc_count() helper to access the aggregate value of those counters.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-03 17:48:41 -08:00
Uladzislau Rezki (Sony) 9627456101 rcu/kvfree: Use READ_ONCE() when access to krcp->head
The need_offload_krc() function is now lock-free, which gives the
compiler freedom to load old values from plain C-language loads from
the kfree_rcu_cpu struture's ->head pointer.  This commit therefore
applied READ_ONCE() to these loads.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-03 17:48:41 -08:00
Uladzislau Rezki (Sony) cc37d52076 rcu/kvfree: Use a polled API to speedup a reclaim process
Currently all objects placed into a batch wait for a full grace period
to elapse after that batch is ready to send to RCU.  However, this
can unnecessarily delay freeing of the first objects that were added
to the batch.  After all, several RCU grace periods might have elapsed
since those objects were added, and if so, there is no point in further
deferring their freeing.

This commit therefore adds per-page grace-period snapshots which are
obtained from get_state_synchronize_rcu().  When the batch is ready
to be passed to call_rcu(), each page's snapshot is checked by passing
it to poll_state_synchronize_rcu().  If a given page's RCU grace period
has already elapsed, its objects are freed immediately by kvfree_rcu_bulk().
Otherwise, these objects are freed after a call to synchronize_rcu().

This approach requires that the pages be traversed in reverse order,
that is, the oldest ones first.

Test example:

kvm.sh --memory 10G --torture rcuscale --allcpus --duration 1 \
  --kconfig CONFIG_NR_CPUS=64 \
  --kconfig CONFIG_RCU_NOCB_CPU=y \
  --kconfig CONFIG_RCU_NOCB_CPU_DEFAULT_ALL=y \
  --kconfig CONFIG_RCU_LAZY=n \
  --bootargs "rcuscale.kfree_rcu_test=1 rcuscale.kfree_nthreads=16 \
  rcuscale.holdoff=20 rcuscale.kfree_loops=10000 \
  torture.disable_onoff_at_boot" --trust-make

Before this commit:

Total time taken by all kfree'ers: 8535693700 ns, loops: 10000, batches: 1188, memory footprint: 2248MB
Total time taken by all kfree'ers: 8466933582 ns, loops: 10000, batches: 1157, memory footprint: 2820MB
Total time taken by all kfree'ers: 5375602446 ns, loops: 10000, batches: 1130, memory footprint: 6502MB
Total time taken by all kfree'ers: 7523283832 ns, loops: 10000, batches: 1006, memory footprint: 3343MB
Total time taken by all kfree'ers: 6459171956 ns, loops: 10000, batches: 1150, memory footprint: 6549MB

After this commit:

Total time taken by all kfree'ers: 8560060176 ns, loops: 10000, batches: 1787, memory footprint: 61MB
Total time taken by all kfree'ers: 8573885501 ns, loops: 10000, batches: 1777, memory footprint: 93MB
Total time taken by all kfree'ers: 8320000202 ns, loops: 10000, batches: 1727, memory footprint: 66MB
Total time taken by all kfree'ers: 8552718794 ns, loops: 10000, batches: 1790, memory footprint: 75MB
Total time taken by all kfree'ers: 8601368792 ns, loops: 10000, batches: 1724, memory footprint: 62MB

The reduction in memory footprint is well in excess of an order of
magnitude.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-03 17:48:41 -08:00
Uladzislau Rezki (Sony) 8fc5494ad5 rcu/kvfree: Move need_offload_krc() out of krcp->lock
The need_offload_krc() function currently holds the krcp->lock in order
to safely check krcp->head.  This commit removes the need for this lock
in that function by updating the krcp->head pointer using WRITE_ONCE()
macro so that readers can carry out lockless loads of that pointer.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-03 17:48:41 -08:00
Uladzislau Rezki (Sony) 8c15a9e808 rcu/kvfree: Move bulk/list reclaim to separate functions
The kvfree_rcu() code maintains lists of pages of pointers, but also a
singly linked list, with the latter being used when memory allocation
fails.  Traversal of these two types of lists is currently open coded.
This commit simplifies the code by providing kvfree_rcu_bulk() and
kvfree_rcu_list() functions, respectively, to traverse these two types
of lists.  This patch does not introduce any functional change.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-03 17:48:41 -08:00
Uladzislau Rezki (Sony) 27538e18b6 rcu/kvfree: Switch to a generic linked list API
This commit improves the readability and maintainability of the
kvfree_rcu() code by switching from an open-coded linked list to
the standard Linux-kernel circular doubly linked list.  This patch
does not introduce any functional change.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-03 17:48:41 -08:00
Uladzislau Rezki (Sony) 04a522b7da rcu: Refactor kvfree_call_rcu() and high-level helpers
Currently a kvfree_call_rcu() takes an offset within a structure as
a second parameter, so a helper such as a kvfree_rcu_arg_2() has to
convert rcu_head and a freed ptr to an offset in order to pass it. That
leads to an extra conversion on macro entry.

Instead of converting, refactor the code in way that a pointer that has
to be freed is passed directly to the kvfree_call_rcu().

This patch does not make any functional change and is transparent to
all kvfree_rcu() users.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-03 17:48:40 -08:00
Paul E. McKenney 92987fe8bd rcu: Allow expedited RCU CPU stall warnings to dump task stacks
This commit introduces the rcupdate.rcu_exp_stall_task_details kernel
boot parameter, which cause expedited RCU CPU stall warnings to dump
the stacks of any tasks blocking the current expedited grace period.

Reported-by: David Howells <dhowells@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-03 17:47:44 -08:00
Paul E. McKenney 748bf47a89 rcu: Test synchronous RCU grace periods at the end of rcu_init()
This commit tests synchronize_rcu() and synchronize_rcu_expedited()
at the end of rcu_init(), in addition to the test already at the
beginning of that function.  These tests are run only in kernels built
with CONFIG_PROVE_RCU=y.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-03 17:28:34 -08:00
Zqiang 3d1adf7ada rcu: Make rcu_blocking_is_gp() stop early-boot might_sleep()
Currently, rcu_blocking_is_gp() invokes might_sleep() even during early
boot when interrupts are disabled and before the scheduler is scheduling.
This is at best an accident waiting to happen.  Therefore, this commit
moves that might_sleep() under an rcu_scheduler_active check in order
to ensure that might_sleep() is not invoked unless sleeping might actually
happen.

Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-03 17:28:34 -08:00
Paul E. McKenney 2d7f00b2f0 rcu: Suppress smp_processor_id() complaint in synchronize_rcu_expedited_wait()
The normal grace period's RCU CPU stall warnings are invoked from the
scheduling-clock interrupt handler, and can thus invoke smp_processor_id()
with impunity, which allows them to directly invoke dump_cpu_task().
In contrast, the expedited grace period's RCU CPU stall warnings are
invoked from process context, which causes the dump_cpu_task() function's
calls to smp_processor_id() to complain bitterly in debug kernels.

This commit therefore causes synchronize_rcu_expedited_wait() to disable
preemption around its call to dump_cpu_task().

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-03 17:28:34 -08:00
Paul E. McKenney 95ff24ee7b rcu: Upgrade header comment for poll_state_synchronize_rcu()
This commit emphasizes the possibility of concurrent calls to
synchronize_rcu() and synchronize_rcu_expedited() causing one or
the other of the two grace periods being lost from the viewpoint of
poll_state_synchronize_rcu().

If you cannot afford to lose grace periods this way, you should
instead use the _full() variants of the polled RCU API, for
example, poll_state_synchronize_rcu_full().

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-03 17:28:34 -08:00
Paul E. McKenney 253cbbff62 rcu: Throttle callback invocation based on number of ready callbacks
Currently, rcu_do_batch() sizes its batches based on the total number
of callbacks in the callback list.  This can result in some strange
choices, for example, if there was 12,800 callbacks in the list, but
only 200 were ready to invoke, RCU would invoke 100 at a time (12,800
shifted down by seven bits).

A more measured approach would use the number that were actually ready
to invoke, an approach that has become feasible only recently given the
per-segment ->seglen counts in ->cblist.

This commit therefore bases the batch limit on the number of callbacks
ready to invoke instead of on the total number of callbacks.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-03 17:28:34 -08:00
Paul E. McKenney 5a04848d00 rcu: Consolidate initialization and CPU-hotplug code
This commit consolidates the initialization and CPU-hotplug code at
the end of kernel/rcu/tree.c.  This is strictly a code-motion commit.
No functionality has changed.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-03 17:28:34 -08:00
Linus Torvalds 19822e3ee4 Urgent RCU pull request for v6.2
This commit fixes a lockdep false positive in synchronize_rcu() that
 can otherwise occur during early boot.  Theis fix simply avoids invoking
 lockdep if the scheduler has not yet been initialized, that is, during
 that portion of boot when interrupts are disabled.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEbK7UrM+RBIrCoViJnr8S83LZ+4wFAmOeXj8THHBhdWxtY2tA
 a2VybmVsLm9yZwAKCRCevxLzctn7jPmZEACaI5JqO6Dr2U4HojJJBYEfLVaSYxDp
 JrUi5D5WzzZidyjM2fyyZZkdRVQ24i1aV2H/fbLoIIH/smYjE/KLEFHQmclpphw5
 BSOyapotjdt5YhIavvAeOjdUd7jPyMqhbDVnwzjnblhUD1ObLVlhIs8Pjn7/03sF
 gzlIhYgp3EL7GenT9j9kud2FwWP+wrVQ7SdJ+Ni/WAHYO8860xQAmFXH/07bYzx7
 fbp5iPkCOSSUoRMw/qQ8s7CE3XhBNKufv1BtcvV/uxEtutfV1qvEQBv/l2RBd0Vg
 wOVBZnWXze+7IUx13M90R/d04Nn7RaGwon6xBMlvIwL3qzEj8x/r1FYz7zZhQPkv
 wwChAxFHQACnLCZSu48WBtVrawNdZHM57KHUK4rloAbrK92FpVznhQU+5pBDy4c6
 rfY2my+SNO4kWvePEg/2fd8aQycrZr99fK/ojCIerEn8MNboxuVOYTjzy0qtUcVT
 yJ/80O8ADI3QL/NRhjMFWgEnBDbHN1PcGhiRoutApdLQkg/UPTJjCRZ7ibmIFYY2
 ViW3cSndr/f0I7sOex2EILHwiZ2bUKiwyeTW6vWuFl/7MEWsvpJaWoUxXgQj99Bt
 ncAOaxtmmuhbwrOCt2kab90A0c/thNx9kNYYIkG3vUNcSRzyHQtg3ydEljBpaTFR
 OzhrqdUA7W9Sfg==
 =UKUo
 -----END PGP SIGNATURE-----

Merge tag 'rcu-urgent.2022.12.17a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu

Pull RCU fix from Paul McKenney:
 "This fixes a lockdep false positive in synchronize_rcu() that can
  otherwise occur during early boot.

  The fix simply avoids invoking lockdep if the scheduler has not yet
  been initialized, that is, during that portion of boot when interrupts
  are disabled"

* tag 'rcu-urgent.2022.12.17a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu:
  rcu: Don't assert interrupts enabled too early in boot
2022-12-21 07:59:57 -08:00
Paul E. McKenney 3f6c3d29df rcu: Don't assert interrupts enabled too early in boot
The rcu_poll_gp_seq_end() and rcu_poll_gp_seq_end_unlocked() both check
that interrupts are enabled, as they normally should be when waiting for
an RCU grace period.  Except that it is legal to wait for grace periods
during early boot, before interrupts have been enabled for the first time,
and polling for grace periods is required to work during this time.
This can result in false-positive lockdep splats in the presence of
boot-time-initiated tracing.

This commit therefore conditions those interrupts-enabled checks on
rcu_scheduler_active having advanced past RCU_SCHEDULER_INACTIVE, by
which time interrupts have been enabled.

Reported-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Tested-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-12-17 16:12:20 -08:00
Linus Torvalds 7e68dd7d07 Networking changes for 6.2.
Core
 ----
  - Allow live renaming when an interface is up
 
  - Add retpoline wrappers for tc, improving considerably the
    performances of complex queue discipline configurations.
 
  - Add inet drop monitor support.
 
  - A few GRO performance improvements.
 
  - Add infrastructure for atomic dev stats, addressing long standing
    data races.
 
  - De-duplicate common code between OVS and conntrack offloading
    infrastructure.
 
  - A bunch of UBSAN_BOUNDS/FORTIFY_SOURCE improvements.
 
  - Netfilter: introduce packet parser for tunneled packets
 
  - Replace IPVS timer-based estimators with kthreads to scale up
    the workload with the number of available CPUs.
 
  - Add the helper support for connection-tracking OVS offload.
 
 BPF
 ---
  - Support for user defined BPF objects: the use case is to allocate
    own objects, build own object hierarchies and use the building
    blocks to build own data structures flexibly, for example, linked
    lists in BPF.
 
  - Make cgroup local storage available to non-cgroup attached BPF
    programs.
 
  - Avoid unnecessary deadlock detection and failures wrt BPF task
    storage helpers.
 
  - A relevant bunch of BPF verifier fixes and improvements.
 
  - Veristat tool improvements to support custom filtering, sorting,
    and replay of results.
 
  - Add LLVM disassembler as default library for dumping JITed code.
 
  - Lots of new BPF documentation for various BPF maps.
 
  - Add bpf_rcu_read_{,un}lock() support for sleepable programs.
 
  - Add RCU grace period chaining to BPF to wait for the completion
    of access from both sleepable and non-sleepable BPF programs.
 
  - Add support storing struct task_struct objects as kptrs in maps.
 
  - Improve helper UAPI by explicitly defining BPF_FUNC_xxx integer
    values.
 
  - Add libbpf *_opts API-variants for bpf_*_get_fd_by_id() functions.
 
 Protocols
 ---------
  - TCP: implement Protective Load Balancing across switch links.
 
  - TCP: allow dynamically disabling TCP-MD5 static key, reverting
    back to fast[er]-path.
 
  - UDP: Introduce optional per-netns hash lookup table.
 
  - IPv6: simplify and cleanup sockets disposal.
 
  - Netlink: support different type policies for each generic
    netlink operation.
 
  - MPTCP: add MSG_FASTOPEN and FastOpen listener side support.
 
  - MPTCP: add netlink notification support for listener sockets
    events.
 
  - SCTP: add VRF support, allowing sctp sockets binding to VRF
    devices.
 
  - Add bridging MAC Authentication Bypass (MAB) support.
 
  - Extensions for Ethernet VPN bridging implementation to better
    support multicast scenarios.
 
  - More work for Wi-Fi 7 support, comprising conversion of all
    the existing drivers to internal TX queue usage.
 
  - IPSec: introduce a new offload type (packet offload) allowing
    complete header processing and crypto offloading.
 
  - IPSec: extended ack support for more descriptive XFRM error
    reporting.
 
  - RXRPC: increase SACK table size and move processing into a
    per-local endpoint kernel thread, reducing considerably the
    required locking.
 
  - IEEE 802154: synchronous send frame and extended filtering
    support, initial support for scanning available 15.4 networks.
 
  - Tun: bump the link speed from 10Mbps to 10Gbps.
 
  - Tun/VirtioNet: implement UDP segmentation offload support.
 
 Driver API
 ----------
 
  - PHY/SFP: improve power level switching between standard
    level 1 and the higher power levels.
 
  - New API for netdev <-> devlink_port linkage.
 
  - PTP: convert existing drivers to new frequency adjustment
    implementation.
 
  - DSA: add support for rx offloading.
 
  - Autoload DSA tagging driver when dynamically changing protocol.
 
  - Add new PCP and APPTRUST attributes to Data Center Bridging.
 
  - Add configuration support for 800Gbps link speed.
 
  - Add devlink port function attribute to enable/disable RoCE and
    migratable.
 
  - Extend devlink-rate to support strict prioriry and weighted fair
    queuing.
 
  - Add devlink support to directly reading from region memory.
 
  - New device tree helper to fetch MAC address from nvmem.
 
  - New big TCP helper to simplify temporary header stripping.
 
 New hardware / drivers
 ----------------------
 
  - Ethernet:
    - Marvel Octeon CNF95N and CN10KB Ethernet Switches.
    - Marvel Prestera AC5X Ethernet Switch.
    - WangXun 10 Gigabit NIC.
    - Motorcomm yt8521 Gigabit Ethernet.
    - Microchip ksz9563 Gigabit Ethernet Switch.
    - Microsoft Azure Network Adapter.
    - Linux Automation 10Base-T1L adapter.
 
  - PHY:
    - Aquantia AQR112 and AQR412.
    - Motorcomm YT8531S.
 
  - PTP:
    - Orolia ART-CARD.
 
  - WiFi:
    - MediaTek Wi-Fi 7 (802.11be) devices.
    - RealTek rtw8821cu, rtw8822bu, rtw8822cu and rtw8723du USB
      devices.
 
  - Bluetooth:
    - Broadcom BCM4377/4378/4387 Bluetooth chipsets.
    - Realtek RTL8852BE and RTL8723DS.
    - Cypress.CYW4373A0 WiFi + Bluetooth combo device.
 
 Drivers
 -------
  - CAN:
    - gs_usb: bus error reporting support.
    - kvaser_usb: listen only and bus error reporting support.
 
  - Ethernet NICs:
    - Intel (100G):
      - extend action skbedit to RX queue mapping.
      - implement devlink-rate support.
      - support direct read from memory.
    - nVidia/Mellanox (mlx5):
      - SW steering improvements, increasing rules update rate.
      - Support for enhanced events compression.
      - extend H/W offload packet manipulation capabilities.
      - implement IPSec packet offload mode.
    - nVidia/Mellanox (mlx4):
      - better big TCP support.
    - Netronome Ethernet NICs (nfp):
      - IPsec offload support.
      - add support for multicast filter.
    - Broadcom:
      - RSS and PTP support improvements.
    - AMD/SolarFlare:
      - netlink extened ack improvements.
      - add basic flower matches to offload, and related stats.
    - Virtual NICs:
      - ibmvnic: introduce affinity hint support.
    - small / embedded:
      - FreeScale fec: add initial XDP support.
      - Marvel mv643xx_eth: support MII/GMII/RGMII modes for Kirkwood.
      - TI am65-cpsw: add suspend/resume support.
      - Mediatek MT7986: add RX wireless wthernet dispatch support.
      - Realtek 8169: enable GRO software interrupt coalescing per
        default.
 
  - Ethernet high-speed switches:
    - Microchip (sparx5):
      - add support for Sparx5 TC/flower H/W offload via VCAP.
    - Mellanox mlxsw:
      - add 802.1X and MAC Authentication Bypass offload support.
      - add ip6gre support.
 
  - Embedded Ethernet switches:
    - Mediatek (mtk_eth_soc):
      - improve PCS implementation, add DSA untag support.
      - enable flow offload support.
    - Renesas:
      - add rswitch R-Car Gen4 gPTP support.
    - Microchip (lan966x):
      - add full XDP support.
      - add TC H/W offload via VCAP.
      - enable PTP on bridge interfaces.
    - Microchip (ksz8):
      - add MTU support for KSZ8 series.
 
  - Qualcomm 802.11ax WiFi (ath11k):
    - support configuring channel dwell time during scan.
 
  - MediaTek WiFi (mt76):
    - enable Wireless Ethernet Dispatch (WED) offload support.
    - add ack signal support.
    - enable coredump support.
    - remain_on_channel support.
 
  - Intel WiFi (iwlwifi):
    - enable Wi-Fi 7 Extremely High Throughput (EHT) PHY capabilities.
    - 320 MHz channels support.
 
  - RealTek WiFi (rtw89):
    - new dynamic header firmware format support.
    - wake-over-WLAN support.
 
 Signed-off-by: Paolo Abeni <pabeni@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmOYXUcSHHBhYmVuaUBy
 ZWRoYXQuY29tAAoJECkkeY3MjxOk8zQP/R7BZtbJMTPiWkRnSoKHnAyupDVwrz5U
 ktukLkwPsCyJuEbAjgxrxf4EEEQ9uq2FFlxNSYuKiiQMqIpFxV6KED7LCUygn4Tc
 kxtkp0Q+5XiqisWlQmtfExf2OjuuPqcjV9tWCDBI6GebKUbfNwY/eI44RcMu4BSv
 DzIlW5GkX/kZAPqnnuqaLsN3FudDTJHGEAD7NbA++7wJ076RWYSLXlFv0Z+SCSPS
 H8/PEG0/ZK/65rIWMAFRClJ9BNIDwGVgp0GrsIvs1gqbRUOlA1hl1rDM21TqtNFf
 5QPQT7sIfTcCE/nerxKJD5JE3JyP+XRlRn96PaRw3rt4MgI6I/EOj/HOKQ5tMCNc
 oPiqb7N70+hkLZyr42qX+vN9eDPjp2koEQm7EO2Zs+/534/zWDs24Zfk/Aa1ps0I
 Fa82oGjAgkBhGe/FZ6i5cYoLcyxqRqZV1Ws9XQMl72qRC7/BwvNbIW6beLpCRyeM
 yYIU+0e9dEm+wHQEdh2niJuVtR63hy8tvmPx56lyh+6u0+pondkwbfSiC5aD3kAC
 ikKsN5DyEsdXyiBAlytCEBxnaOjQy4RAz+3YXSiS0eBNacXp03UUrNGx4Pzpu/D0
 QLFJhBnMFFCgy5to8/DvKnrTPgZdSURwqbIUcZdvU21f1HLR8tUTpaQnYffc/Whm
 V8gnt1EL+0cc
 =CbJC
 -----END PGP SIGNATURE-----

Merge tag 'net-next-6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking updates from Paolo Abeni:
 "Core:

   - Allow live renaming when an interface is up

   - Add retpoline wrappers for tc, improving considerably the
     performances of complex queue discipline configurations

   - Add inet drop monitor support

   - A few GRO performance improvements

   - Add infrastructure for atomic dev stats, addressing long standing
     data races

   - De-duplicate common code between OVS and conntrack offloading
     infrastructure

   - A bunch of UBSAN_BOUNDS/FORTIFY_SOURCE improvements

   - Netfilter: introduce packet parser for tunneled packets

   - Replace IPVS timer-based estimators with kthreads to scale up the
     workload with the number of available CPUs

   - Add the helper support for connection-tracking OVS offload

  BPF:

   - Support for user defined BPF objects: the use case is to allocate
     own objects, build own object hierarchies and use the building
     blocks to build own data structures flexibly, for example, linked
     lists in BPF

   - Make cgroup local storage available to non-cgroup attached BPF
     programs

   - Avoid unnecessary deadlock detection and failures wrt BPF task
     storage helpers

   - A relevant bunch of BPF verifier fixes and improvements

   - Veristat tool improvements to support custom filtering, sorting,
     and replay of results

   - Add LLVM disassembler as default library for dumping JITed code

   - Lots of new BPF documentation for various BPF maps

   - Add bpf_rcu_read_{,un}lock() support for sleepable programs

   - Add RCU grace period chaining to BPF to wait for the completion of
     access from both sleepable and non-sleepable BPF programs

   - Add support storing struct task_struct objects as kptrs in maps

   - Improve helper UAPI by explicitly defining BPF_FUNC_xxx integer
     values

   - Add libbpf *_opts API-variants for bpf_*_get_fd_by_id() functions

  Protocols:

   - TCP: implement Protective Load Balancing across switch links

   - TCP: allow dynamically disabling TCP-MD5 static key, reverting back
     to fast[er]-path

   - UDP: Introduce optional per-netns hash lookup table

   - IPv6: simplify and cleanup sockets disposal

   - Netlink: support different type policies for each generic netlink
     operation

   - MPTCP: add MSG_FASTOPEN and FastOpen listener side support

   - MPTCP: add netlink notification support for listener sockets events

   - SCTP: add VRF support, allowing sctp sockets binding to VRF devices

   - Add bridging MAC Authentication Bypass (MAB) support

   - Extensions for Ethernet VPN bridging implementation to better
     support multicast scenarios

   - More work for Wi-Fi 7 support, comprising conversion of all the
     existing drivers to internal TX queue usage

   - IPSec: introduce a new offload type (packet offload) allowing
     complete header processing and crypto offloading

   - IPSec: extended ack support for more descriptive XFRM error
     reporting

   - RXRPC: increase SACK table size and move processing into a
     per-local endpoint kernel thread, reducing considerably the
     required locking

   - IEEE 802154: synchronous send frame and extended filtering support,
     initial support for scanning available 15.4 networks

   - Tun: bump the link speed from 10Mbps to 10Gbps

   - Tun/VirtioNet: implement UDP segmentation offload support

  Driver API:

   - PHY/SFP: improve power level switching between standard level 1 and
     the higher power levels

   - New API for netdev <-> devlink_port linkage

   - PTP: convert existing drivers to new frequency adjustment
     implementation

   - DSA: add support for rx offloading

   - Autoload DSA tagging driver when dynamically changing protocol

   - Add new PCP and APPTRUST attributes to Data Center Bridging

   - Add configuration support for 800Gbps link speed

   - Add devlink port function attribute to enable/disable RoCE and
     migratable

   - Extend devlink-rate to support strict prioriry and weighted fair
     queuing

   - Add devlink support to directly reading from region memory

   - New device tree helper to fetch MAC address from nvmem

   - New big TCP helper to simplify temporary header stripping

  New hardware / drivers:

   - Ethernet:
      - Marvel Octeon CNF95N and CN10KB Ethernet Switches
      - Marvel Prestera AC5X Ethernet Switch
      - WangXun 10 Gigabit NIC
      - Motorcomm yt8521 Gigabit Ethernet
      - Microchip ksz9563 Gigabit Ethernet Switch
      - Microsoft Azure Network Adapter
      - Linux Automation 10Base-T1L adapter

   - PHY:
      - Aquantia AQR112 and AQR412
      - Motorcomm YT8531S

   - PTP:
      - Orolia ART-CARD

   - WiFi:
      - MediaTek Wi-Fi 7 (802.11be) devices
      - RealTek rtw8821cu, rtw8822bu, rtw8822cu and rtw8723du USB
        devices

   - Bluetooth:
      - Broadcom BCM4377/4378/4387 Bluetooth chipsets
      - Realtek RTL8852BE and RTL8723DS
      - Cypress.CYW4373A0 WiFi + Bluetooth combo device

  Drivers:

   - CAN:
      - gs_usb: bus error reporting support
      - kvaser_usb: listen only and bus error reporting support

   - Ethernet NICs:
      - Intel (100G):
         - extend action skbedit to RX queue mapping
         - implement devlink-rate support
         - support direct read from memory
      - nVidia/Mellanox (mlx5):
         - SW steering improvements, increasing rules update rate
         - Support for enhanced events compression
         - extend H/W offload packet manipulation capabilities
         - implement IPSec packet offload mode
      - nVidia/Mellanox (mlx4):
         - better big TCP support
      - Netronome Ethernet NICs (nfp):
         - IPsec offload support
         - add support for multicast filter
      - Broadcom:
         - RSS and PTP support improvements
      - AMD/SolarFlare:
         - netlink extened ack improvements
         - add basic flower matches to offload, and related stats
      - Virtual NICs:
         - ibmvnic: introduce affinity hint support
      - small / embedded:
         - FreeScale fec: add initial XDP support
         - Marvel mv643xx_eth: support MII/GMII/RGMII modes for Kirkwood
         - TI am65-cpsw: add suspend/resume support
         - Mediatek MT7986: add RX wireless wthernet dispatch support
         - Realtek 8169: enable GRO software interrupt coalescing per
           default

   - Ethernet high-speed switches:
      - Microchip (sparx5):
         - add support for Sparx5 TC/flower H/W offload via VCAP
      - Mellanox mlxsw:
         - add 802.1X and MAC Authentication Bypass offload support
         - add ip6gre support

   - Embedded Ethernet switches:
      - Mediatek (mtk_eth_soc):
         - improve PCS implementation, add DSA untag support
         - enable flow offload support
      - Renesas:
         - add rswitch R-Car Gen4 gPTP support
      - Microchip (lan966x):
         - add full XDP support
         - add TC H/W offload via VCAP
         - enable PTP on bridge interfaces
      - Microchip (ksz8):
         - add MTU support for KSZ8 series

   - Qualcomm 802.11ax WiFi (ath11k):
      - support configuring channel dwell time during scan

   - MediaTek WiFi (mt76):
      - enable Wireless Ethernet Dispatch (WED) offload support
      - add ack signal support
      - enable coredump support
      - remain_on_channel support

   - Intel WiFi (iwlwifi):
      - enable Wi-Fi 7 Extremely High Throughput (EHT) PHY capabilities
      - 320 MHz channels support

   - RealTek WiFi (rtw89):
      - new dynamic header firmware format support
      - wake-over-WLAN support"

* tag 'net-next-6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2002 commits)
  ipvs: fix type warning in do_div() on 32 bit
  net: lan966x: Remove a useless test in lan966x_ptp_add_trap()
  net: ipa: add IPA v4.7 support
  dt-bindings: net: qcom,ipa: Add SM6350 compatible
  bnxt: Use generic HBH removal helper in tx path
  IPv6/GRO: generic helper to remove temporary HBH/jumbo header in driver
  selftests: forwarding: Add bridge MDB test
  selftests: forwarding: Rename bridge_mdb test
  bridge: mcast: Support replacement of MDB port group entries
  bridge: mcast: Allow user space to specify MDB entry routing protocol
  bridge: mcast: Allow user space to add (*, G) with a source list and filter mode
  bridge: mcast: Add support for (*, G) with a source list and filter mode
  bridge: mcast: Avoid arming group timer when (S, G) corresponds to a source
  bridge: mcast: Add a flag for user installed source entries
  bridge: mcast: Expose __br_multicast_del_group_src()
  bridge: mcast: Expose br_multicast_new_group_src()
  bridge: mcast: Add a centralized error path
  bridge: mcast: Place netlink policy before validation functions
  bridge: mcast: Split (*, G) and (S, G) addition into different functions
  bridge: mcast: Do not derive entry type from its filter mode
  ...
2022-12-13 15:47:48 -08:00
Linus Torvalds 98d0052d0d printk changes for 6.2
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEESH4wyp42V4tXvYsjUqAMR0iAlPIFAmORzikACgkQUqAMR0iA
 lPKF/g/7Bmcao3rJkZjEagsYY+s7rGhaFaSbML8FDdyE3UzeXLJOnNxBLrD0JIe9
 XFW7+DMqr2uRxsab5C7APy0mrIWp/zCGyJ8CmBILnrPDNcAQ27OhFzxv6WlMUmEc
 xEjGHrk5dFV96s63gyHGLkKGOZMd/cfcpy/QDOyg0vfF8EZCiPywWMbQQ2Ij8E50
 N6UL70ExkoLjT9tzb8NXQiaDqHxqNRvd15aIomDjRrce7eeaL4TaZIT7fKnEcULz
 0Lmdo8RUknonCI7Y00RWdVXMqqPD2JsKz3+fh0vBnXEN+aItwyxis/YajtN+m6l7
 jhPGt7hNhCKG17auK0/6XVJ3717QwjI3+xLXCvayA8jyewMK14PgzX70hCws0eXM
 +5M+IeXI4ze5qsq+ln9Dt8zfC+5HGmwXODUtaYTBWhB4nVWdL/CZ+nTv349zt+Uc
 VIi/QcPQ4vq6EfsxUZR2r6Y12+sSH40iLIROUfqSchtujbLo7qxSNF5x7x9+rtff
 nWuXo5OsjGE7TZDwn3kr0zSuJ+w/pkWMYQ7jch+A2WqUMYyGC86sL3At7ocL+Esq
 34uvzwEgWnNySV8cLiMh34kBmgBwhAP34RhV0RS9iCv8kev2DV7pLQTs9V3QAjw9
 EZnFDHATUdikgugaFKCeDV86R3wFgnRWWOdlRrRi6aAzFDqNcYk=
 =1PTZ
 -----END PGP SIGNATURE-----

Merge tag 'printk-for-6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux

Pull printk updates from Petr Mladek:

 - Add NMI-safe SRCU reader API. It uses atomic_inc() instead of
   this_cpu_inc() on strong load-store architectures.

 - Introduce new console_list_lock to synchronize a manipulation of the
   list of registered consoles and their flags.

   This is a first step in removing the big-kernel-lock-like behavior of
   console_lock(). This semaphore still serializes console->write()
   calbacks against:

      - each other. It primary prevents potential races between early
        and proper console drivers using the same device.

      - suspend()/resume() callbacks and init() operations in some
        drivers.

      - various other operations in the tty/vt and framebufer
        susbsystems. It is likely that console_lock() serializes even
        operations that are not directly conflicting with the
        console->write() callbacks here. This is the most complicated
        big-kernel-lock aspect of the console_lock() that will be hard
        to untangle.

 - Introduce new console_srcu lock that is used to safely iterate and
   access the registered console drivers under SRCU read lock.

   This is a prerequisite for introducing atomic console drivers and
   console kthreads. It will reduce the complexity of serialization
   against normal consoles and console_lock(). Also it should remove the
   risk of deadlock during critical situations, like Oops or panic, when
   only atomic consoles are registered.

 - Check whether the console is registered instead of enabled on many
   locations. It was a historical leftover.

 - Cleanly force a preferred console in xenfb code instead of a dirty
   hack.

 - A lot of code and comment clean ups and improvements.

* tag 'printk-for-6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux: (47 commits)
  printk: htmldocs: add missing description
  tty: serial: sh-sci: use setup() callback for early console
  printk: relieve console_lock of list synchronization duties
  tty: serial: kgdboc: use console_list_lock to trap exit
  tty: serial: kgdboc: synchronize tty_find_polling_driver() and register_console()
  tty: serial: kgdboc: use console_list_lock for list traversal
  tty: serial: kgdboc: use srcu console list iterator
  proc: consoles: use console_list_lock for list iteration
  tty: tty_io: use console_list_lock for list synchronization
  printk, xen: fbfront: create/use safe function for forcing preferred
  netconsole: avoid CON_ENABLED misuse to track registration
  usb: early: xhci-dbc: use console_is_registered()
  tty: serial: xilinx_uartps: use console_is_registered()
  tty: serial: samsung_tty: use console_is_registered()
  tty: serial: pic32_uart: use console_is_registered()
  tty: serial: earlycon: use console_is_registered()
  tty: hvc: use console_is_registered()
  efi: earlycon: use console_is_registered()
  tty: nfcon: use console_is_registered()
  serial_core: replace uart_console_enabled() with uart_console_registered()
  ...
2022-12-12 09:01:36 -08:00
Linus Torvalds 1fab45ab6e RCU pull request for v6.2
This pull request contains the following branches:
 
 doc.2022.10.20a: Documentation updates.  This is the second
 	in a series from an ongoing review of the RCU documentation.
 
 fixes.2022.10.21a: Miscellaneous fixes.
 
 lazy.2022.11.30a: Introduces a default-off Kconfig option that depends
 	on RCU_NOCB_CPU that, on CPUs mentioned in the nohz_full or
 	rcu_nocbs boot-argument CPU lists, causes call_rcu() to introduce
 	delays.  These delays result in significant power savings on
 	nearly idle Android and ChromeOS systems.  These savings range
 	from a few percent to more than ten percent.
 
 	This series also includes several commits that change call_rcu()
 	to a new call_rcu_hurry() function that avoids these delays in
 	a few cases, for example, where timely wakeups are required.
 	Several of these are outside of RCU and thus have acks and
 	reviews from the relevant maintainers.
 
 srcunmisafe.2022.11.09a: Creates an srcu_read_lock_nmisafe() and an
 	srcu_read_unlock_nmisafe() for architectures that support NMIs,
 	but which do not provide NMI-safe this_cpu_inc().  These NMI-safe
 	SRCU functions are required by the upcoming lockless printk()
 	work by John Ogness et al.
 
 	That printk() series depends on these commits, so if you pull
 	the printk() series before this one, you will have already
 	pulled in this branch, plus two more SRCU commits:
 
 	0cd7e350ab ("rcu: Make SRCU mandatory")
 	51f5f78a4f ("srcu: Make Tiny synchronize_srcu() check for readers")
 
 	These two commits appear to work well, but do not have
 	sufficient testing exposure over a long enough time for me to
 	feel comfortable pushing them unless something in mainline is
 	definitely going to use them immediately, and currently only
 	the new printk() work uses them.
 
 torture.2022.10.18c: Changes providing minor but important increases
 	in test coverage for the new RCU polled-grace-period APIs.
 
 torturescript.2022.10.20a: Changes that avoid redundant kernel builds,
 	thus providing about a 30% speedup for the torture.sh acceptance
 	test.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEbK7UrM+RBIrCoViJnr8S83LZ+4wFAmOKnS8THHBhdWxtY2tA
 a2VybmVsLm9yZwAKCRCevxLzctn7jCMiD/4weraRjmcLhZ3tz2vgTI8ZsXdIiCfU
 vCln0AOKroVo37S4BhViVfryV2D4VFfEb1UY6EgxNFu7Jd3z0seQShZh/5r8bFMU
 p0E6TC8PwyKUpQstTOwOynkw6BWGW1qeL620PpBNRAy4MkxL8AGv40tHRIHEeAzc
 cCTax2+xW9ae0ZtAZHDDCUAzpYpcjScIf4OZ3tkSaFCcpWZijg+dN60dnsZ9l7h9
 DtqKH61rszXAtxkmN9Fs9OY5MPCXi9Es6LVYq6KN06jqxwJRqmYf+pai3apmNIOf
 P8isXOQG58tbhBLpNCG58UBSkjI2GG8Lcq6hYr6d/7Ukm7RF49q8eL7OQlVrJMuQ
 Zi2DVTEAu2U3pzdTC14gi3RvqP7dO+psBs+LpGXtj4RxYvAP99e9KSRcG14j/Wwa
 L52AetBzBXTCS5nhPOG8RP22d8HRZLxMe9x7T8iVCDuwH4M1zTF5cVzLeEdgPAD7
 tdX4eV16PLt1AvhCEuHU/2v520gc2K9oGXLI1A6kzquXh7FflcPWl5WS+sYUbB/p
 gBsblz7C3I5GgSoW4aAMnkukZiYgSvVql8ZyRwQuRzvLpYcofMpoanZbcufDjuw9
 N5QzAaMmzHnBu3hOJS2WaSZRZ73fed3NO8jo8q8EMfYeWK3NAHybBdaQqSTgsO8i
 s+aN+LZ4s5MnRw==
 =eMOr
 -----END PGP SIGNATURE-----

Merge tag 'rcu.2022.12.02a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu

Pull RCU updates from Paul McKenney:

 - Documentation updates. This is the second in a series from an ongoing
   review of the RCU documentation.

 - Miscellaneous fixes.

 - Introduce a default-off Kconfig option that depends on RCU_NOCB_CPU
   that, on CPUs mentioned in the nohz_full or rcu_nocbs boot-argument
   CPU lists, causes call_rcu() to introduce delays.

   These delays result in significant power savings on nearly idle
   Android and ChromeOS systems. These savings range from a few percent
   to more than ten percent.

   This series also includes several commits that change call_rcu() to a
   new call_rcu_hurry() function that avoids these delays in a few
   cases, for example, where timely wakeups are required. Several of
   these are outside of RCU and thus have acks and reviews from the
   relevant maintainers.

 - Create an srcu_read_lock_nmisafe() and an srcu_read_unlock_nmisafe()
   for architectures that support NMIs, but which do not provide
   NMI-safe this_cpu_inc(). These NMI-safe SRCU functions are required
   by the upcoming lockless printk() work by John Ogness et al.

 - Changes providing minor but important increases in torture test
   coverage for the new RCU polled-grace-period APIs.

 - Changes to torturescript that avoid redundant kernel builds, thus
   providing about a 30% speedup for the torture.sh acceptance test.

* tag 'rcu.2022.12.02a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: (49 commits)
  net: devinet: Reduce refcount before grace period
  net: Use call_rcu_hurry() for dst_release()
  workqueue: Make queue_rcu_work() use call_rcu_hurry()
  percpu-refcount: Use call_rcu_hurry() for atomic switch
  scsi/scsi_error: Use call_rcu_hurry() instead of call_rcu()
  rcu/rcutorture: Use call_rcu_hurry() where needed
  rcu/rcuscale: Use call_rcu_hurry() for async reader test
  rcu/sync: Use call_rcu_hurry() instead of call_rcu
  rcuscale: Add laziness and kfree tests
  rcu: Shrinker for lazy rcu
  rcu: Refactor code a bit in rcu_nocb_do_flush_bypass()
  rcu: Make call_rcu() lazy to save power
  rcu: Implement lockdep_rcu_enabled for !CONFIG_DEBUG_LOCK_ALLOC
  srcu: Debug NMI safety even on archs that don't require it
  srcu: Explain the reason behind the read side critical section on GP start
  srcu: Warn when NMI-unsafe API is used in NMI
  arch/s390: Add ARCH_HAS_NMI_SAFE_THIS_CPU_OPS Kconfig option
  arch/loongarch: Add ARCH_HAS_NMI_SAFE_THIS_CPU_OPS Kconfig option
  rcu: Fix __this_cpu_read() lockdep warning in rcu_force_quiescent_state()
  rcu-tasks: Make grace-period-age message human-readable
  ...
2022-12-12 07:47:15 -08:00
Zqiang 51f5f78a4f srcu: Make Tiny synchronize_srcu() check for readers
This commit adds lockdep checks for illegal use of synchronize_srcu()
within same-type SRCU read-side critical sections and within normal
RCU read-side critical sections.  It also makes synchronize_srcu()
be a no-op during early boot.

These changes bring Tiny synchronize_srcu() into line with both Tree
synchronize_srcu() and Tiny synchronize_rcu().

Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Tested-by: John Ogness <john.ogness@linutronix.de>
2022-12-01 15:49:12 -08:00
Paul E. McKenney 87492c06e6 Merge branches 'doc.2022.10.20a', 'fixes.2022.10.21a', 'lazy.2022.11.30a', 'srcunmisafe.2022.11.09a', 'torture.2022.10.18c' and 'torturescript.2022.10.20a' into HEAD
doc.2022.10.20a: Documentation updates.
fixes.2022.10.21a: Miscellaneous fixes.
lazy.2022.11.30a: Lazy call_rcu() and NOCB updates.
srcunmisafe.2022.11.09a: NMI-safe SRCU readers.
torture.2022.10.18c: Torture-test updates.
torturescript.2022.10.20a: Torture-test scripting updates.
2022-11-30 13:20:05 -08:00
Paul E. McKenney 0cd7e350ab rcu: Make SRCU mandatory
Kernels configured with CONFIG_PRINTK=n and CONFIG_SRCU=n get build
failures.  This causes trouble for deep embedded systems.  But given
that there are more than 25 instances of "select SRCU" in the kernel,
it is hard to believe that there are many kernels running in production
without SRCU.  This commit therefore makes SRCU mandatory.  The SRCU
Kconfig option remains for backwards compatibility, and will be removed
when it is no longer used.

[ paulmck: Update per kernel test robot feedback. ]

Reported-by: John Ogness <john.ogness@linutronix.de>
Reported-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: <linux-arch@vger.kernel.org>
Acked-by: Randy Dunlap <rdunlap@infradead.org> # build-tested
Reviewed-by: John Ogness <john.ogness@linutronix.de>
2022-11-29 15:00:06 -08:00
Joel Fernandes (Google) 405d8e91f0 rcu/rcutorture: Use call_rcu_hurry() where needed
call_rcu() changes to save power will change the behavior of rcutorture
tests. Use the call_rcu_hurry() API instead which reverts to the old
behavior.

[ paulmck: Apply s/call_rcu_flush/call_rcu_hurry/ feedback from Tejun Heo. ]

Reported-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-11-29 14:04:33 -08:00
Joel Fernandes (Google) 723df859d8 rcu/rcuscale: Use call_rcu_hurry() for async reader test
rcuscale uses call_rcu() to queue async readers. With recent changes to
save power, the test will have fewer async readers in flight. Use the
call_rcu_hurry() API instead to revert to the old behavior.

[ paulmck: Apply s/call_rcu_flush/call_rcu_hurry/ feedback from Tejun Heo. ]

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-11-29 14:04:33 -08:00
Joel Fernandes (Google) 7651d6b250 rcu/sync: Use call_rcu_hurry() instead of call_rcu
call_rcu() changes to save power will slow down rcu sync. Use the
call_rcu_hurry() API instead which reverts to the old behavior.

[ paulmck: Apply s/call_rcu_flush/call_rcu_hurry/ feedback from Tejun Heo. ]

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-11-29 14:04:33 -08:00
Joel Fernandes (Google) 084e04fff1 rcuscale: Add laziness and kfree tests
This commit adds 2 tests to rcuscale.  The first one is a startup test
to check whether we are not too lazy or too hard working.  The second
one causes kfree_rcu() itself to use call_rcu() and checks memory
pressure. Testing indicates that the new call_rcu() keeps memory pressure
under control roughly as well as does kfree_rcu().

[ paulmck: Apply checkpatch feedback. ]

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-11-29 14:02:52 -08:00
Vineeth Pillai c945b4da7a rcu: Shrinker for lazy rcu
The shrinker is used to speed up the free'ing of memory potentially held
by RCU lazy callbacks. RCU kernel module test cases show this to be
effective. Test is introduced in a later patch.

Signed-off-by: Vineeth Pillai <vineeth@bitbyteword.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-11-29 14:02:52 -08:00
Joel Fernandes (Google) 3d222a0c0c rcu: Refactor code a bit in rcu_nocb_do_flush_bypass()
This consolidates the code a bit and makes it cleaner. Functionally it
is the same.

Reported-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-11-29 14:02:52 -08:00
Joel Fernandes (Google) 3cb278e73b rcu: Make call_rcu() lazy to save power
Implement timer-based RCU callback batching (also known as lazy
callbacks). With this we save about 5-10% of power consumed due
to RCU requests that happen when system is lightly loaded or idle.

By default, all async callbacks (queued via call_rcu) are marked
lazy. An alternate API call_rcu_hurry() is provided for the few users,
for example synchronize_rcu(), that need the old behavior.

The batch is flushed whenever a certain amount of time has passed, or
the batch on a particular CPU grows too big. Also memory pressure will
flush it in a future patch.

To handle several corner cases automagically (such as rcu_barrier() and
hotplug), we re-use bypass lists which were originally introduced to
address lock contention, to handle lazy CBs as well. The bypass list
length has the lazy CB length included in it. A separate lazy CB length
counter is also introduced to keep track of the number of lazy CBs.

[ paulmck: Fix formatting of inline call_rcu_lazy() definition. ]
[ paulmck: Apply Zqiang feedback. ]
[ paulmck: Apply s/call_rcu_flush/call_rcu_hurry/ feedback from Tejun Heo. ]

Suggested-by: Paul McKenney <paulmck@kernel.org>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-11-29 14:02:23 -08:00
Jakub Kicinski 96917bb3a3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
include/linux/net.h
  a5ef058dc4 ("net: introduce and use custom sockopt socket flag")
  e993ffe3da ("net: flag sockets supporting msghdr originated zerocopy")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-24 13:44:11 -07:00
Frederic Weisbecker e29a4915db srcu: Debug NMI safety even on archs that don't require it
Currently the NMI safety debugging is only performed on architectures
that don't support NMI-safe this_cpu_inc().

Reorder the code so that other architectures like x86 also detect bad
uses.

[ paulmck: Apply kernel test robot, Stephen Rothwell, and Zqiang feedback. ]

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-10-21 10:44:11 -07:00
Frederic Weisbecker ae3c070616 srcu: Explain the reason behind the read side critical section on GP start
Tell about the need to protect against concurrent updaters who may
overflow the GP counter behind the current update.

Reported-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-10-21 10:16:15 -07:00
Frederic Weisbecker 6b77bb9b99 srcu: Warn when NMI-unsafe API is used in NMI
Using the NMI-unsafe reader API from within an NMI handler is very likely
to be buggy for three reasons:

1) NMIs aren't strictly re-entrant (a pending nested NMI will execute at
   the end of the current one) so it should be fine to use a non-atomic
   increment here. However, breakpoints can still interrupt NMIs and if
   a breakpoint callback has a reader on that same ssp, a racy increment
   can happen.

2) If the only reader site for a given srcu_struct structure is in an
   NMI handler, then RCU should be used instead of SRCU.

3) Because of the previous reason (2), an srcu_struct structure having
   an SRCU read side critical section in an NMI handler is likely to
   have another one from a task context.

For all these reasons, warn if an NMI-unsafe reader API is used from an
NMI handler.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-10-21 10:15:53 -07:00
Zqiang ceb1c8c9b8 rcu: Fix __this_cpu_read() lockdep warning in rcu_force_quiescent_state()
Running rcutorture with non-zero fqs_duration module parameter in a
kernel built with CONFIG_PREEMPTION=y results in the following splat:

BUG: using __this_cpu_read() in preemptible [00000000]
code: rcu_torture_fqs/398
caller is __this_cpu_preempt_check+0x13/0x20
CPU: 3 PID: 398 Comm: rcu_torture_fqs Not tainted 6.0.0-rc1-yoctodev-standard+
Call Trace:
<TASK>
dump_stack_lvl+0x5b/0x86
dump_stack+0x10/0x16
check_preemption_disabled+0xe5/0xf0
__this_cpu_preempt_check+0x13/0x20
rcu_force_quiescent_state.part.0+0x1c/0x170
rcu_force_quiescent_state+0x1e/0x30
rcu_torture_fqs+0xca/0x160
? rcu_torture_boost+0x430/0x430
kthread+0x192/0x1d0
? kthread_complete_and_exit+0x30/0x30
ret_from_fork+0x22/0x30
</TASK>

The problem is that rcu_force_quiescent_state() uses __this_cpu_read()
in preemptible code instead of the proper raw_cpu_read().  This commit
therefore changes __this_cpu_read() to raw_cpu_read().

Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-10-21 10:11:01 -07:00
Paul E. McKenney df83fff758 rcu-tasks: Make grace-period-age message human-readable
This commit adds a few words to the informative message that appears
every ten seconds in RCU Tasks and RCU Tasks Trace grace periods.
This message currently reads as follows:

rcu_tasks_wait_gp: rcu_tasks grace period 1046 is 10088 jiffies old.

After this change, it provides additional context, instead reading
as follows:

rcu_tasks_wait_gp: rcu_tasks grace period number 1046 (since boot) is 10088 jiffies old.

Reported-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-10-21 10:11:00 -07:00
Yipeng Zou fdbdb86845 rcu: Remove rcu_is_idle_cpu()
The commit 3fcd6a230f ("x86/cpu: Avoid cpuinfo-induced IPIing of
idle CPUs") introduced rcu_is_idle_cpu() in order to identify the
current CPU idle state.  But commit f3eca381bd ("x86/aperfmperf:
Replace arch_freq_get_on_cpu()") switched to using MAX_SAMPLE_AGE,
so rcu_is_idle_cpu() is no longer used.  This commit therefore removes it.

Fixes: f3eca381bd ("x86/aperfmperf: Replace arch_freq_get_on_cpu()")
Signed-off-by: Yipeng Zou <zouyipeng@huawei.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-10-21 10:11:00 -07:00
Paul E. McKenney 31d8aaa87f rcu: Keep synchronize_rcu() from enabling irqs in early boot
Making polled RCU grace periods account for expedited grace periods
required acquiring the leaf rcu_node structure's lock during early boot,
but after rcu_init() was called.  This lock is irq-disabled, but the
code incorrectly assumes that irqs are always disabled when invoking
synchronize_rcu().  The exception is early boot before the scheduler has
started, which means that upon return from synchronize_rcu(), irqs will
be incorrectly enabled.

This commit fixes this bug by using irqsave/irqrestore locking primitives.

Fixes: bf95b2bc3e ("rcu: Switch polled grace-period APIs to ->gp_seq_polled")

Reported-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-10-20 15:34:49 -07:00
Paul E. McKenney 36f65f1d15 srcu: Check for consistent global per-srcu_struct NMI safety
This commit adds runtime checks to verify that a given srcu_struct uses
consistent NMI-safe (or not) read-side primitives globally, but based
on the per-CPU data.  These global checks are made by the grace-period
code that must scan the srcu_data structures anyway, and are done only
in kernels built with CONFIG_PROVE_RCU=y.

Link: https://lore.kernel.org/all/20220910221947.171557773@linutronix.de/

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Petr Mladek <pmladek@suse.com>
2022-10-20 15:02:27 -07:00
Paul E. McKenney 27120e7d2c srcu: Check for consistent per-CPU per-srcu_struct NMI safety
This commit adds runtime checks to verify that a given srcu_struct uses
consistent NMI-safe (or not) read-side primitives on a per-CPU basis.

Link: https://lore.kernel.org/all/20220910221947.171557773@linutronix.de/

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Petr Mladek <pmladek@suse.com>
2022-10-20 15:02:15 -07:00
Paul E. McKenney 2e83b879fb srcu: Create an srcu_read_lock_nmisafe() and srcu_read_unlock_nmisafe()
On strict load-store architectures, the use of this_cpu_inc() by
srcu_read_lock() and srcu_read_unlock() is not NMI-safe in TREE SRCU.
To see this suppose that an NMI arrives in the middle of srcu_read_lock(),
just after it has read ->srcu_lock_count, but before it has written
the incremented value back to memory.  If that NMI handler also does
srcu_read_lock() and srcu_read_lock() on that same srcu_struct structure,
then upon return from that NMI handler, the interrupted srcu_read_lock()
will overwrite the NMI handler's update to ->srcu_lock_count, but
leave unchanged the NMI handler's update by srcu_read_unlock() to
->srcu_unlock_count.

This can result in a too-short SRCU grace period, which can in turn
result in arbitrary memory corruption.

If the NMI handler instead interrupts the srcu_read_unlock(), this
can result in eternal SRCU grace periods, which is not much better.

This commit therefore creates a pair of new srcu_read_lock_nmisafe()
and srcu_read_unlock_nmisafe() functions, which allow SRCU readers in
both NMI handlers and in process and IRQ context.  It is bad practice
to mix the existing and the new _nmisafe() primitives on the same
srcu_struct structure.  Use one set or the other, not both.

Just to underline that "bad practice" point, using srcu_read_lock() at
process level and srcu_read_lock_nmisafe() in your NMI handler will not,
repeat NOT, work.  If you do not immediately understand why this is the
case, please review the earlier paragraphs in this commit log.

[ paulmck: Apply kernel test robot feedback. ]
[ paulmck: Apply feedback from Randy Dunlap. ]
[ paulmck: Apply feedback from John Ogness. ]
[ paulmck: Apply feedback from Frederic Weisbecker. ]

Link: https://lore.kernel.org/all/20220910221947.171557773@linutronix.de/

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Acked-by: Randy Dunlap <rdunlap@infradead.org> # build-tested
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Petr Mladek <pmladek@suse.com>
2022-10-20 14:39:18 -07:00
Paul E. McKenney 1324d95b1c rcutorture: Verify NUM_ACTIVE_RCU_POLL_OLDSTATE
This commit adds code to the RTWS_POLL_GET case of rcu_torture_writer()
to verify that the value of NUM_ACTIVE_RCU_POLL_OLDSTATE is sufficiently
large

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-10-18 15:02:12 -07:00
Paul E. McKenney 1d5ebc351f rcutorture: Verify NUM_ACTIVE_RCU_POLL_FULL_OLDSTATE
This commit adds code to the RTWS_POLL_GET_FULL case
of rcu_torture_writer() to verify that the value of
NUM_ACTIVE_RCU_POLL_FULL_OLDSTATE is sufficiently large.

[ paulmck: Fix whitespace issue located by checkpatch.pl. ]

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-10-18 15:02:12 -07:00
Frederic Weisbecker b8f7aca3f0 rcu: Fix missing nocb gp wake on rcu_barrier()
In preparation for RCU lazy changes, wake up the RCU nocb gp thread if
needed after an entrain.  This change prevents the RCU barrier callback
from waiting in the queue for several seconds before the lazy callbacks
in front of it are serviced.

Reported-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-10-18 15:01:31 -07:00
Joel Fernandes (Google) b50606f35f rcu: Fix late wakeup when flush of bypass cblist happens
When the bypass cblist gets too big or its timeout has occurred, it is
flushed into the main cblist. However, the bypass timer is still running
and the behavior is that it would eventually expire and wake the GP
thread.

Since we are going to use the bypass cblist for lazy CBs, do the wakeup
soon as the flush for "too big or too long" bypass list happens.
Otherwise, long delays can happen for callbacks which get promoted from
lazy to non-lazy.

This is a good thing to do anyway (regardless of future lazy patches),
since it makes the behavior consistent with behavior of other code paths
where flushing into the ->cblist makes the GP kthread into a
non-sleeping state quickly.

[ Frederic Weisbecker: Changes to avoid unnecessary GP-thread wakeups plus
		    comment changes. ]

Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-10-18 15:00:36 -07:00
Zhen Lei f1ffec1ea3 rcu: Simplify rcu_init_nohz() cpumask handling
In kernels built with either CONFIG_RCU_NOCB_CPU_DEFAULT_ALL=y or
CONFIG_NO_HZ_FULL=y, additional CPUs must be added to rcu_nocb_mask.
Except that kernels booted without the rcu_nocbs= will not have
allocated rcu_nocb_mask.  And the current rcu_init_nohz() function uses
its need_rcu_nocb_mask and offload_all local variables to track the
rcu_nocb and nohz_full state.

But there is a much simpler approach, namely creating a cpumask pointer
to track the default and then using cpumask_available() to check the
rcu_nocb_mask state.  This commit takes this approach, thereby simplifying
and shortening the rcu_init_nohz() function.

Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-10-18 15:00:36 -07:00
Joel Fernandes (Google) aba9645bd1 rcu: Use READ_ONCE() for lockless read of rnp->qsmask
The rnp->qsmask is locklessly accessed from rcutree_dying_cpu(). This
may help avoid load tearing due to concurrent access, KCSAN
issues, and preserve sanity of people reading the mask in tracing.

Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-10-18 14:59:57 -07:00
Pingfan Liu 6343402ac3 rcu: Synchronize ->qsmaskinitnext in rcu_boost_kthread_setaffinity()
Once either rcutree_online_cpu() or rcutree_dead_cpu() is invoked
concurrently, the following rcu_boost_kthread_setaffinity() race can
occur:

        CPU 1                               CPU2
mask = rcu_rnp_online_cpus(rnp);
...

                                   mask = rcu_rnp_online_cpus(rnp);
                                   ...
                                   set_cpus_allowed_ptr(t, cm);

set_cpus_allowed_ptr(t, cm);

This results in CPU2's update being overwritten by that of CPU1, and
thus the possibility of ->boost_kthread_task continuing to run on a
to-be-offlined CPU.

This commit therefore eliminates this race by relying on the pre-existing
acquisition of ->boost_kthread_mutex to serialize the full process of
changing the affinity of ->boost_kthread_task.

Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: "Jason A. Donenfeld" <Jason@zx2c4.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-10-18 14:59:57 -07:00
Zqiang d6fd907a95 rcu: Remove duplicate RCU exp QS report from rcu_report_dead()
The rcu_report_dead() function invokes rcu_report_exp_rdp() in order
to force an immediate expedited quiescent state on the outgoing
CPU, and then it invokes rcu_preempt_deferred_qs() to provide any
required deferred quiescent state of either sort.  Because the call to
rcu_preempt_deferred_qs() provides the expedited RCU quiescent state if
requested, the call to rcu_report_exp_rdp() is potentially redundant.

One possible issue is a concurrent start of a new expedited RCU
grace period, but this situation is already handled correctly
by __sync_rcu_exp_select_node_cpus().  This function will detect
that the CPU is going offline via the error return from its call
to smp_call_function_single().  In that case, it will retry, and
eventually stop retrying due to rcu_report_exp_rdp() clearing the
->qsmaskinitnext bit corresponding to the target CPU.  As a result,
__sync_rcu_exp_select_node_cpus() will report the necessary quiescent
state after dealing with any remaining CPU.

This change assumes that control does not enter rcu_report_dead() within
an RCU read-side critical section, but then again, the surviving call
to rcu_preempt_deferred_qs() has always made this assumption.

This commit therefore removes the call to rcu_report_exp_rdp(), thus
relying on rcu_preempt_deferred_qs() to handle both normal and expedited
quiescent states.

Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-10-18 14:59:57 -07:00
Paul E. McKenney 5d0f5953b6 srcu: Convert ->srcu_lock_count and ->srcu_unlock_count to atomic
NMI-safe variants of srcu_read_lock() and srcu_read_unlock() are needed
by printk(), which on many architectures entails read-modify-write
atomic operations.  This commit prepares Tree SRCU for this change by
making both ->srcu_lock_count and ->srcu_unlock_count by atomic_long_t.

[ paulmck: Apply feedback from John Ogness. ]

Link: https://lore.kernel.org/all/20220910221947.171557773@linutronix.de/

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Petr Mladek <pmladek@suse.com>
2022-10-18 11:24:24 -07:00
Paul E. McKenney e6c86c513f rcu-tasks: Provide rcu_trace_implies_rcu_gp()
As an accident of implementation, an RCU Tasks Trace grace period also
acts as an RCU grace period.  However, this could change at any time.
This commit therefore creates an rcu_trace_implies_rcu_gp() that currently
returns true to codify this accident.  Code relying on this accident
must call this function to verify that this accident is still happening.

Reported-by: Hou Tao <houtao@huaweicloud.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Link: https://lore.kernel.org/r/20221014113946.965131-2-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-10-18 10:27:02 -07:00
Paul E. McKenney 5c0ec49004 Merge branches 'doc.2022.08.31b', 'fixes.2022.08.31b', 'kvfree.2022.08.31b', 'nocb.2022.09.01a', 'poll.2022.08.31b', 'poll-srcu.2022.08.31b' and 'tasks.2022.08.31b' into HEAD
doc.2022.08.31b: Documentation updates
fixes.2022.08.31b: Miscellaneous fixes
kvfree.2022.08.31b: kvfree_rcu() updates
nocb.2022.09.01a: NOCB CPU updates
poll.2022.08.31b: Full-oldstate RCU polling grace-period API
poll-srcu.2022.08.31b: Polled SRCU grace-period updates
tasks.2022.08.31b: Tasks RCU updates
2022-09-01 10:55:57 -07:00
Zqiang 48297a22a3 rcutorture: Use the barrier operation specified by cur_ops
The rcutorture_oom_notify() function unconditionally invokes
rcu_barrier(), which is OK when the rcutorture.torture_type value is
"rcu", but unhelpful otherwise.  The purpose of these barrier calls is to
wait for all outstanding callback-flooding callbacks to be invoked before
cleaning up their data.  Using the wrong barrier function therefore
risks arbitrary memory corruption.  Thus, this commit changes these
rcu_barrier() calls into cur_ops->cb_barrier() to make things work when
torturing non-vanilla flavors of RCU.

Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-09-01 10:50:04 -07:00
Zqiang 528262f502 rcu-tasks: Make RCU Tasks Trace check for userspace execution
Userspace execution is a valid quiescent state for RCU Tasks Trace,
but the scheduling-clock interrupt does not currently report such
quiescent states.

Of course, the scheduling-clock interrupt is not strictly speaking
userspace execution.  However, the only way that this code is not
in a quiescent state is if something invoked rcu_read_lock_trace(),
and that would be reflected in the ->trc_reader_nesting field in
the task_struct structure.  Furthermore, this field is checked by
rcu_tasks_trace_qs(), which is invoked by rcu_tasks_qs() which is in
turn invoked by rcu_note_voluntary_context_switch() in kernels building
at least one of the RCU Tasks flavors.  It is therefore safe to invoke
rcu_tasks_trace_qs() from the rcu_sched_clock_irq().

But rcu_tasks_qs() also invokes rcu_tasks_classic_qs() for RCU
Tasks, which lacks the read-side markers provided by RCU Tasks Trace.
This raises the possibility that an RCU Tasks grace period could start
after the interrupt from userspace execution, but before the call to
rcu_sched_clock_irq().  However, it turns out that this is safe because
the RCU Tasks grace period waits for an RCU grace period, which will
wait for the entire scheduling-clock interrupt handler, including any
RCU Tasks read-side critical section that this handler might contain.

This commit therefore updates the rcu_sched_clock_irq() function's
check for usermode execution and its call to rcu_tasks_classic_qs()
to instead check for both usermode execution and interrupt from idle,
and to instead call rcu_note_voluntary_context_switch().  This
consolidates code and provides more faster RCU Tasks Trace
reporting of quiescent states in kernels that do scheduling-clock
interrupts for userspace execution.

[ paulmck: Consolidate checks into rcu_sched_clock_irq(). ]

Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-08-31 05:10:55 -07:00
Paul E. McKenney d6ad60635c rcu-tasks: Ensure RCU Tasks Trace loops have quiescent states
The RCU Tasks Trace grace-period kthread loops across all CPUs, and
there can be quite a few CPUs, with some commercially available systems
sporting well over a thousand of them.  Some of these loops can feature
IPIs, which can take some time.  This commit therefore places a call to
cond_resched_tasks_rcu_qs() in each such loop.

Link: https://docs.google.com/document/d/1V0YnG1HTWMt9WHJjroiJL9lf-hMrud4v8Fn3fhyY0cI/edit?usp=sharing
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-08-31 05:10:55 -07:00
Zqiang fcd53c8a4d rcu-tasks: Convert RCU_LOCKDEP_WARN() to WARN_ONCE()
Kernels built with CONFIG_PROVE_RCU=y and CONFIG_DEBUG_LOCK_ALLOC=y
attempt to emit a warning when the synchronize_rcu_tasks_generic()
function is called during early boot while the rcu_scheduler_active
variable is RCU_SCHEDULER_INACTIVE.  However the warnings is not
actually be printed because the debug_lockdep_rcu_enabled() returns
false, exactly because the rcu_scheduler_active variable is still equal
to RCU_SCHEDULER_INACTIVE.

This commit therefore replaces RCU_LOCKDEP_WARN() with WARN_ONCE()
to force these warnings to actually be printed.

Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-08-31 05:10:54 -07:00
Paul E. McKenney 5fe89191e4 srcu: Make Tiny SRCU use full-sized grace-period counters
This commit makes Tiny SRCU use full-sized grace-period counters to
further avoid counter-wrap issues when using polled grace-period APIs.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-08-31 05:10:15 -07:00
Paul E. McKenney de3f2671ae srcu: Make Tiny SRCU poll_state_synchronize_srcu() more precise
This commit applies the more-precise grace-period-state check used by
rcu_seq_done_exact() to poll_state_synchronize_srcu().  This is important
because Tiny SRCU uses a 16-bit counter, which can wrap quite quickly.
If counter wrap continues to be a problem, then expanding ->srcu_idx
and ->srcu_idx_max to 32 bits might be warranted.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-08-31 05:10:15 -07:00
Paul E. McKenney 599d97e3f2 rcutorture: Make "srcud" option also test polled grace-period API
This commit brings the "srcud" (dynamically allocated) SRCU test in line
with the "srcu" (statically allocated) test, so that both test the full
SRCU polled grace-period API.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-08-31 05:10:15 -07:00
Paul E. McKenney 967c298d65 rcutorture: Limit read-side polling-API testing
RCU's polled grace-period API is reasonably lightweight, but still
contains heavyweight memory barriers.  This commit therefore limits
testing of this API from rcutorture's readers in order to avoid the
false negatives that these heavyweight operations could provoke.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-08-31 05:09:22 -07:00
Paul E. McKenney 5d7801f201 rcutorture: Expand rcu_torture_write_types() first "if" statement
This commit expands the rcu_torture_write_types() function's first "if"
condition and body, placing one element per line, in order to make the
compiler's error messages more helpful.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-08-31 05:09:22 -07:00
Paul E. McKenney cc8faf5b65 rcutorture: Use 1-suffixed variable in rcu_torture_write_types() check
This commit changes the use of gp_poll_exp to gp_poll_exp1 in the first
check in rcu_torture_write_types().  No functional effect, but consistency
is a good thing.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-08-31 05:09:22 -07:00
Paul E. McKenney d761de8a7d rcu: Make synchronize_rcu() fastpath update only boot-CPU counters
Large systems can have hundreds of rcu_node structures, and updating
counters in each of them might slow down booting.  This commit therefore
updates only the counters in those rcu_node structures corresponding
to the boot CPU, up to and including the root rcu_node structure.

The counters for the remaining rcu_node structures are updated by the
rcu_scheduler_starting() function, which executes just before the first
non-boot kthread is spawned.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-08-31 05:09:22 -07:00
Paul E. McKenney b3cdd0a79c rcutorture: Adjust rcu_poll_need_2gp() for rcu_gp_oldstate field removal
Now that rcu_gp_oldstate can accurately track both normal and
expedited grace periods regardless of system state, rcutorture's
rcu_poll_need_2gp() function need only call for a second grace period
for the old single-unsigned-long grace-period polling APIs
This commit therefore adjusts rcu_poll_need_2gp() accordingly.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-08-31 05:09:21 -07:00
Paul E. McKenney 7ecef0871d rcu: Remove ->rgos_polled field from rcu_gp_oldstate structure
Because both normal and expedited grace periods increment their respective
counters on their pre-scheduler early boot fastpaths, the rcu_gp_oldstate
structure no longer needs its ->rgos_polled field.  This commit therefore
removes this field, shrinking this structure so that it is the same size
as an rcu_head structure.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-08-31 05:09:21 -07:00
Paul E. McKenney 43ff97cc99 rcu: Make synchronize_rcu_expedited() fast path update .expedited_sequence
This commit causes the early boot single-CPU synchronize_rcu_expedited()
fastpath to update the rcu_state structure's ->expedited_sequence
counter.  This will allow the full-state polled grace-period APIs to
detect all expedited grace periods without the need to track the special
combined polling-only counter, which is another step towards removing
the ->rgos_polled field from the rcu_gp_oldstate, thereby reducing its
size by one third.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-08-31 05:09:21 -07:00
Paul E. McKenney e8755d2bde rcu: Remove expedited grace-period fast-path forward-progress helper
Now that the expedited grace-period fast path can only happen during
the pre-scheduler portion of early boot, this fast path can no longer
block run-time RCU Trace grace periods.  This commit therefore removes
the conditional cond_resched() invocation.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-08-31 05:09:21 -07:00
Paul E. McKenney 910e12092e rcu: Make synchronize_rcu() fast path update ->gp_seq counters
This commit causes the early boot single-CPU synchronize_rcu() fastpath to
update the rcu_state and rcu_node structures' ->gp_seq and ->gp_seq_needed
counters.  This will allow the full-state polled grace-period APIs to
detect all normal grace periods without the need to track the special
combined polling-only counter, which is a step towards removing the
->rgos_polled field from the rcu_gp_oldstate, thereby reducing its size
by one third.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-08-31 05:09:21 -07:00
Paul E. McKenney 5f11bad6b7 rcu-tasks: Remove grace-period fast-path rcu-tasks helper
Now that the grace-period fast path can only happen during the
pre-scheduler portion of early boot, this fast path can no longer block
run-time RCU Tasks and RCU Tasks Trace grace periods.  This commit
therefore removes the conditional cond_resched_tasks_rcu_qs() invocation.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-08-31 05:08:08 -07:00
Paul E. McKenney a5d1b0b68a rcu: Set rcu_data structures' initial ->gpwrap value to true
It would be good do reduce the size of the rcu_gp_oldstate structure
from three unsigned long instances to two, but this requires that the
boot-time optimized grace periods update the various ->gp_seq fields.
Updating these fields in the rcu_state structure and in all of the
rcu_node structures is at least semi-reasonable, but updating them in
all of the rcu_data structures is a bridge too far.  This means that if
there are too many early boot-time grace periods, the ->gp_seq field in
the rcu_data structure cannot be trusted.  This commit therefore sets
each rcu_data structure's ->gpwrap field to provide the necessary impetus
for a suitable level of distrust.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-08-31 05:08:08 -07:00
Paul E. McKenney 258f887aba rcu: Disable run-time single-CPU grace-period optimization
The run-time single-CPU grace-period optimization applies only to
kernels built with CONFIG_SMP=y && CONFIG_PREEMPTION=y that are running
on a single-CPU system.  But a kernel intended for a single-CPU system
should instead be built with CONFIG_SMP=n, and in any case, single-CPU
systems running Linux no longer appear to be the common case.  Plus this
optimization results in the rcu_gp_oldstate structure being half again
larger than it needs to be.

This commit therefore disables the run-time single-CPU grace-period
optimization, so that this optimization applies only during the
pre-scheduler portion of the boot sequence.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-08-31 05:08:08 -07:00
Paul E. McKenney 8df13f0160 rcu: Add full-sized polling for cond_sync_exp_full()
The cond_synchronize_rcu_expedited() API compresses the combined expedited and
normal grace-period states into a single unsigned long, which conserves
storage, but can miss grace periods in certain cases involving overlapping
normal and expedited grace periods.  Missing the occasional grace period
is usually not a problem, but there are use cases that care about each
and every grace period.

This commit therefore adds yet another member of the full-state RCU
grace-period polling API, which is the cond_synchronize_rcu_exp_full()
function.  This uses up to three times the storage (rcu_gp_oldstate
structure instead of unsigned long), but is guaranteed not to miss
grace periods.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-08-31 05:08:08 -07:00
Paul E. McKenney b6fe4917ae rcu: Add full-sized polling for cond_sync_full()
The cond_synchronize_rcu() API compresses the combined expedited and
normal grace-period states into a single unsigned long, which conserves
storage, but can miss grace periods in certain cases involving overlapping
normal and expedited grace periods.  Missing the occasional grace period
is usually not a problem, but there are use cases that care about each
and every grace period.

This commit therefore adds yet another member of the full-state RCU
grace-period polling API, which is the cond_synchronize_rcu_full()
function.  This uses up to three times the storage (rcu_gp_oldstate
structure instead of unsigned long), but is guaranteed not to miss
grace periods.

[ paulmck: Apply feedback from kernel test robot and Julia Lawall. ]

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-08-31 05:08:08 -07:00
Paul E. McKenney f21e014345 rcu: Remove blank line from poll_state_synchronize_rcu() docbook header
This commit removes the blank line preceding the oldstate parameter to
the docbook header for the poll_state_synchronize_rcu() function and
marks uses of this parameter later in that header.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-08-31 05:08:08 -07:00
Paul E. McKenney 6c502b14ba rcu: Add full-sized polling for start_poll_expedited()
The start_poll_synchronize_rcu_expedited() API compresses the combined
expedited and normal grace-period states into a single unsigned long,
which conserves storage, but can miss grace periods in certain cases
involving overlapping normal and expedited grace periods.  Missing the
occasional grace period is usually not a problem, but there are use
cases that care about each and every grace period.

This commit therefore adds yet another member of the
full-state RCU grace-period polling API, which is the
start_poll_synchronize_rcu_expedited_full() function.  This uses up to
three times the storage (rcu_gp_oldstate structure instead of unsigned
long), but is guaranteed not to miss grace periods.

[ paulmck: Apply feedback from kernel test robot and Julia Lawall. ]

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-08-31 05:08:08 -07:00
Paul E. McKenney 76ea364161 rcu: Add full-sized polling for start_poll()
The start_poll_synchronize_rcu() API compresses the combined expedited and
normal grace-period states into a single unsigned long, which conserves
storage, but can miss grace periods in certain cases involving overlapping
normal and expedited grace periods.  Missing the occasional grace period
is usually not a problem, but there are use cases that care about each
and every grace period.

This commit therefore adds the next member of the full-state RCU
grace-period polling API, namely the start_poll_synchronize_rcu_full()
function.  This uses up to three times the storage (rcu_gp_oldstate
structure instead of unsigned long), but is guaranteed not to miss
grace periods.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-08-31 05:08:08 -07:00
Paul E. McKenney f4754ad292 rcutorture: Verify long-running reader prevents full polling from completing
This commit adds full-state polling checks to accompany the old-style
polling checks in the rcu_torture_one_read() function.  If a polling
cycle within an RCU reader completes, a WARN_ONCE() is triggered.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-08-31 05:08:07 -07:00
Paul E. McKenney 37d6ade31c rcutorture: Remove redundant RTWS_DEF_FREE check
This check does nothing because the state at this point in the code
because the rcu_torture_writer_state value is guaranteed to instead
be RTWS_REPLACE.  This commit therefore removes this check.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-08-31 05:08:07 -07:00
Paul E. McKenney d594231aa5 rcutorture: Verify RCU reader prevents full polling from completing
This commit adds a test to rcu_torture_writer() that verifies that a
->get_gp_state_full() and ->poll_gp_state_full() polled grace-period
sequence does not claim that a grace period elapsed within the confines
of the corresponding read-side critical section.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-08-31 05:08:07 -07:00
Paul E. McKenney ed7d2f1abe rcutorture: Allow per-RCU-flavor polled double-GP check
Only vanilla RCU needs a double grace period for its compressed
polled grace-period old-state cookie.  This commit therefore adds an
rcu_torture_ops per-flavor function ->poll_need_2gp to allow this check
to be adapted to the RCU flavor under test.  A NULL pointer for this
function says that doubled grace periods are never needed.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-08-31 05:08:07 -07:00
Paul E. McKenney ccb42229fb rcutorture: Abstract synchronous and polled API testing
This commit abstracts a do_rtws_sync() function that does synchronous
grace-period testing, but also testing the polled API 25% of the time
each for the normal and full-state variants of the polled API.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-08-31 05:08:07 -07:00
Paul E. McKenney 3fdefca9b4 rcu: Add full-sized polling for get_state()
The get_state_synchronize_rcu() API compresses the combined expedited and
normal grace-period states into a single unsigned long, which conserves
storage, but can miss grace periods in certain cases involving overlapping
normal and expedited grace periods.  Missing the occasional grace period
is usually not a problem, but there are use cases that care about each
and every grace period.

This commit therefore adds the next member of the full-state RCU
grace-period polling API, namely the get_state_synchronize_rcu_full()
function.  This uses up to three times the storage (rcu_gp_oldstate
structure instead of unsigned long), but is guaranteed not to miss
grace periods.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-08-31 05:08:07 -07:00
Paul E. McKenney 91a967fd69 rcu: Add full-sized polling for get_completed*() and poll_state*()
The get_completed_synchronize_rcu() and poll_state_synchronize_rcu()
APIs compress the combined expedited and normal grace-period states into a
single unsigned long, which conserves storage, but can miss grace periods
in certain cases involving overlapping normal and expedited grace periods.
Missing the occasional grace period is usually not a problem, but there
are use cases that care about each and every grace period.

This commit therefore adds the first members of the full-state RCU
grace-period polling API, namely the get_completed_synchronize_rcu_full()
and poll_state_synchronize_rcu_full() functions.  These use up to three
times the storage (rcu_gp_oldstate structure instead of unsigned long),
but which are guaranteed not to miss grace periods, at least in situations
where the single-CPU grace-period optimization does not apply.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-08-31 05:08:07 -07:00
Paul E. McKenney 638dce227a rcu/nocb: Add CPU number to CPU-{,de}offload failure messages
Offline CPUs cannot be offloaded or deoffloaded.  Any attempt to offload
or deoffload an offline CPU causes a message to be printed on the console,
which is good, but this message does not contain the CPU number, which
is bad.  Such a CPU number can be helpful when debugging, as it gives a
clear indication that the CPU in question is in fact offline.  This commit
therefore adds the CPU number to the CPU-{,de}offload failure messages.

Cc: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-08-31 05:07:19 -07:00
Zqiang 5334da2af2 rcu/nocb: Choose the right rcuog/rcuop kthreads to output
The show_rcu_nocb_gp_state() function is supposed to dump out the rcuog
kthread and the show_rcu_nocb_state() function is supposed to dump out
the rcuo[ps] kthread.  Currently, both do a mixture, which is not optimal
for debugging, even though it does not affect functionality.

This commit therefore adjusts these two functions to focus on their
respective kthreads.

Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-08-31 05:07:19 -07:00
Uladzislau Rezki (Sony) 51824b780b rcu/kvfree: Update KFREE_DRAIN_JIFFIES interval
Currently the monitor work is scheduled with a fixed interval of HZ/20,
which is roughly 50 milliseconds. The drawback of this approach is
low utilization of the 512 page slots in scenarios with infrequence
kvfree_rcu() calls.  For example on an Android system:

<snip>
  kworker/3:3-507     [003] ....   470.286305: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x00000000d0f0dde5 nr_records=6
  kworker/6:1-76      [006] ....   470.416613: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x00000000ea0d6556 nr_records=1
  kworker/6:1-76      [006] ....   470.416625: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x000000003e025849 nr_records=9
  kworker/3:3-507     [003] ....   471.390000: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x00000000815a8713 nr_records=48
  kworker/1:1-73      [001] ....   471.725785: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x00000000fda9bf20 nr_records=3
  kworker/1:1-73      [001] ....   471.725833: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x00000000a425b67b nr_records=76
  kworker/0:4-1411    [000] ....   472.085673: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x000000007996be9d nr_records=1
  kworker/0:4-1411    [000] ....   472.085728: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x00000000d0f0dde5 nr_records=5
  kworker/6:1-76      [006] ....   472.260340: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x0000000065630ee4 nr_records=102
<snip>

In many cases, out of 512 slots, fewer than 10 were actually used.
In order to improve batching and make utilization more efficient this
commit sets a drain interval to a fixed 5-seconds interval. Floods are
detected when a page fills quickly, and in that case, the reclaim work
is re-scheduled for the next scheduling-clock tick (jiffy).

After this change:

<snip>
  kworker/7:1-371     [007] ....  5630.725708: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x000000005ab0ffb3 nr_records=121
  kworker/7:1-371     [007] ....  5630.989702: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x0000000060c84761 nr_records=47
  kworker/7:1-371     [007] ....  5630.989714: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x000000000babf308 nr_records=510
  kworker/7:1-371     [007] ....  5631.553790: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x00000000bb7bd0ef nr_records=169
  kworker/7:1-371     [007] ....  5631.553808: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x0000000044c78753 nr_records=510
  kworker/5:6-9428    [005] ....  5631.746102: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x00000000d98519aa nr_records=123
  kworker/4:7-9434    [004] ....  5632.001758: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x00000000526c9d44 nr_records=322
  kworker/4:7-9434    [004] ....  5632.002073: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x000000002c6a8afa nr_records=185
  kworker/7:1-371     [007] ....  5632.277515: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x000000007f4a962f nr_records=510
<snip>

Here, all but one of the cases, more than one hundreds slots were used,
representing an order-of-magnitude improvement.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-08-31 05:06:50 -07:00
Joel Fernandes (Google) 3826909635 rcu/kfree: Fix kfree_rcu_shrink_count() return value
As per the comments in include/linux/shrinker.h, .count_objects callback
should return the number of freeable items, but if there are no objects
to free, SHRINK_EMPTY should be returned. The only time 0 is returned
should be when we are unable to determine the number of objects, or the
cache should be skipped for another reason.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-08-31 05:06:50 -07:00
Michal Hocko 093590c16b rcu: Back off upon fill_page_cache_func() allocation failure
The fill_page_cache_func() function allocates couple of pages to store
kvfree_rcu_bulk_data structures. This is a lightweight (GFP_NORETRY)
allocation which can fail under memory pressure. The function will,
however keep retrying even when the previous attempt has failed.

This retrying is in theory correct, but in practice the allocation is
invoked from workqueue context, which means that if the memory reclaim
gets stuck, these retries can hog the worker for quite some time.
Although the workqueues subsystem automatically adjusts concurrency, such
adjustment is not guaranteed to happen until the worker context sleeps.
And the fill_page_cache_func() function's retry loop is not guaranteed
to sleep (see the should_reclaim_retry() function).

And we have seen this function cause workqueue lockups:

kernel: BUG: workqueue lockup - pool cpus=93 node=1 flags=0x1 nice=0 stuck for 32s!
[...]
kernel: pool 74: cpus=37 node=0 flags=0x1 nice=0 hung=32s workers=2 manager: 2146
kernel:   pwq 498: cpus=249 node=1 flags=0x1 nice=0 active=4/256 refcnt=5
kernel:     in-flight: 1917:fill_page_cache_func
kernel:     pending: dbs_work_handler, free_work, kfree_rcu_monitor

Originally, we thought that the root cause of this lockup was several
retries with direct reclaim, but this is not yet confirmed.  Furthermore,
we have seen similar lockups without any heavy memory pressure.  This
suggests that there are other factors contributing to these lockups.
However, it is not really clear that endless retries are desireable.

So let's make the fill_page_cache_func() function back off after
allocation failure.

Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-08-31 05:06:50 -07:00
Paul E. McKenney 7634b1eaa0 rcu: Exclude outgoing CPU when it is the last to leave
The rcu_boost_kthread_setaffinity() function removes the outgoing CPU
from the set_cpus_allowed() mask for the corresponding leaf rcu_node
structure's rcub priority-boosting kthread.  Except that if the outgoing
CPU will leave that structure without any online CPUs, the mask is set
to the housekeeping CPU mask from housekeeping_cpumask().  Which is fine
unless the outgoing CPU happens to be a housekeeping CPU.

This commit therefore removes the outgoing CPU from the housekeeping mask.
This would of course be problematic if the outgoing CPU was the last
online housekeeping CPU, but in that case you are in a world of hurt
anyway.  If someone comes up with a valid use case for a system needing
all the housekeeping CPUs to be offline, further adjustments can be made.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-08-31 05:06:03 -07:00
Zqiang 621189a1fe rcu: Avoid triggering strict-GP irq-work when RCU is idle
Kernels built with PREEMPT_RCU=y and RCU_STRICT_GRACE_PERIOD=y trigger
irq-work from rcu_read_unlock(), and the resulting irq-work handler
invokes rcu_preempt_deferred_qs_handle().  The point of this triggering
is to force grace periods to end quickly in order to give tools like KASAN
a better chance of detecting RCU usage bugs such as leaking RCU-protected
pointers out of an RCU read-side critical section.

However, this irq-work triggering is unconditional.  This works, but
there is no point in doing this irq-work unless the current grace period
is waiting on the running CPU or task, which is not the common case.
After all, in the common case there are many rcu_read_unlock() calls
per CPU per grace period.

This commit therefore triggers the irq-work only when the current grace
period is waiting on the running CPU or task.

This change was tested as follows on a four-CPU system:

	echo rcu_preempt_deferred_qs_handler > /sys/kernel/debug/tracing/set_ftrace_filter
	echo 1 > /sys/kernel/debug/tracing/function_profile_enabled
	insmod rcutorture.ko
	sleep 20
	rmmod rcutorture.ko
	echo 0 > /sys/kernel/debug/tracing/function_profile_enabled
	echo > /sys/kernel/debug/tracing/set_ftrace_filter

This procedure produces results in this per-CPU set of files:

	/sys/kernel/debug/tracing/trace_stat/function*

Sample output from one of these files is as follows:

  Function                               Hit    Time            Avg             s^2
  --------                               ---    ----            ---             ---
  rcu_preempt_deferred_qs_handle      838746    182650.3 us     0.217 us        0.004 us

The baseline sum of the "Hit" values (the number of calls to this
function) was 3,319,015.  With this commit, that sum was 1,140,359,
for a 2.9x reduction.  The worst-case variance across the CPUs was less
than 25%, so this large effect size is statistically significant.

The raw data is available in the Link: URL.

Link: https://lore.kernel.org/all/20220808022626.12825-1-qiang1.zhang@intel.com/
Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-08-31 05:06:02 -07:00
Zhen Lei e73dfe3093 sched/debug: Try trigger_single_cpu_backtrace(cpu) in dump_cpu_task()
The trigger_all_cpu_backtrace() function attempts to send an NMI to the
target CPU, which usually provides much better stack traces than the
dump_cpu_task() function's approach of dumping that stack from some other
CPU.  So much so that most calls to dump_cpu_task() only happen after
a call to trigger_all_cpu_backtrace() has failed.  And the exception to
this rule really should attempt to use trigger_all_cpu_backtrace() first.

Therefore, move the trigger_all_cpu_backtrace() invocation into
dump_cpu_task().

Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Valentin Schneider <vschneid@redhat.com>
2022-08-31 05:03:14 -07:00
Paul E. McKenney 089254fd38 rcu: Document reason for rcu_all_qs() call to preempt_disable()
Given that rcu_all_qs() is in non-preemptible kernels, why on earth should
it invoke preempt_disable()?  This commit adds the reason, which is to
work nicely with debugging enabled in CONFIG_PREEMPT_COUNT=y kernels.

Reported-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Reported-by: Boqun Feng <boqun.feng@gmail.com>
Reported-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-08-31 05:03:14 -07:00
Zqiang 6ca0292ccf rcu: Make tiny RCU support leak callbacks for debug-object errors
Currently, only Tree RCU leaks callbacks setting when it detects a
duplicate call_rcu().  This commit causes Tiny RCU to also leak
callbacks in this situation.

Because this is Tiny RCU, kernel size is important:

1. CONFIG_TINY_RCU=y and CONFIG_DEBUG_OBJECTS_RCU_HEAD=n
   (Production kernel)

    Original:
    text      data      bss       dec       hex     filename
    26290663  20159823  15212544  61663030  3ace736 vmlinux

    With this commit:
    text      data      bss       dec       hex     filename
    26290663  20159823  15212544  61663030  3ace736 vmlinux

2. CONFIG_TINY_RCU=y and CONFIG_DEBUG_OBJECTS_RCU_HEAD=y
   (Debugging kernel)

    Original:
    text      data      bss       dec       hex     filename
    26291319  20160143  15212544  61664006  3aceb06 vmlinux

    With this commit:
    text      data      bss       dec       hex     filename
    26291319  20160431  15212544  61664294  3acec26 vmlinux

These results show that the kernel size is unchanged for production
kernels, as desired.

Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-08-31 05:03:14 -07:00
Zqiang fcb42c9a77 rcu: Add QS check in rcu_exp_handler() for non-preemptible kernels
Kernels built with CONFIG_PREEMPTION=n and CONFIG_PREEMPT_COUNT=y maintain
preempt_count() state.  Because such kernels map __rcu_read_lock()
and __rcu_read_unlock() to preempt_disable() and preempt_enable(),
respectively, this allows the expedited grace period's !CONFIG_PREEMPT_RCU
version of the rcu_exp_handler() IPI handler function to use
preempt_count() to detect quiescent states.

This preempt_count() usage might seem to risk failures due to
use of implicit RCU readers in portions of the kernel under #ifndef
CONFIG_PREEMPTION, except that rcu_core() already disallows such implicit
RCU readers.  The moral of this story is that you must use explicit
read-side markings such as rcu_read_lock() or preempt_disable() even if
the code knows that this kernel does not support preemption.

This commit therefore adds a preempt_count()-based check for a quiescent
state in the !CONFIG_PREEMPT_RCU version of the rcu_exp_handler()
function for kernels built with CONFIG_PREEMPT_COUNT=y, reporting an
immediate quiescent state when the interrupted code had both preemption
and softirqs enabled.

This change results in about a 2% reduction in expedited grace-period
latency in kernels built with both CONFIG_PREEMPT_RCU=n and
CONFIG_PREEMPT_COUNT=y.

Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lore.kernel.org/all/20220622103549.2840087-1-qiang1.zhang@intel.com/
2022-08-31 05:03:14 -07:00
Zqiang bca4fa8cb0 rcu: Update rcu_preempt_deferred_qs() comments for !PREEMPT kernels
In non-premptible kernels, tasks never do context switches within
RCU read-side critical sections.  Therefore, in such kernels, each
leaf rcu_node structure's ->blkd_tasks list will always be empty.
The comment on the non-preemptible version of rcu_preempt_deferred_qs()
confuses this point, so this commit therefore fixes it.

Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-08-31 05:03:14 -07:00
Zqiang 6d60ea03ac rcu: Fix rcu_read_unlock_strict() strict QS reporting
Kernels built with CONFIG_PREEMPT=n and CONFIG_RCU_STRICT_GRACE_PERIOD=y
report the quiescent state directly from the outermost rcu_read_unlock().
However, the current CPU's rcu_data structure's ->cpu_no_qs.b.norm
might still be set, in which case rcu_report_qs_rdp() will exit early,
thus failing to report quiescent state.

This commit therefore causes rcu_read_unlock_strict() to clear
CPU's rcu_data structure's ->cpu_no_qs.b.norm field before invoking
rcu_report_qs_rdp().

Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-08-31 05:03:14 -07:00
Linus Torvalds 6614a3c316 - The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe
Lin, Yang Shi, Anshuman Khandual and Mike Rapoport
 
 - Some kmemleak fixes from Patrick Wang and Waiman Long
 
 - DAMON updates from SeongJae Park
 
 - memcg debug/visibility work from Roman Gushchin
 
 - vmalloc speedup from Uladzislau Rezki
 
 - more folio conversion work from Matthew Wilcox
 
 - enhancements for coherent device memory mapping from Alex Sierra
 
 - addition of shared pages tracking and CoW support for fsdax, from
   Shiyang Ruan
 
 - hugetlb optimizations from Mike Kravetz
 
 - Mel Gorman has contributed some pagealloc changes to improve latency
   and realtime behaviour.
 
 - mprotect soft-dirty checking has been improved by Peter Xu
 
 - Many other singleton patches all over the place
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCYuravgAKCRDdBJ7gKXxA
 jpqSAQDrXSdII+ht9kSHlaCVYjqRFQz/rRvURQrWQV74f6aeiAD+NHHeDPwZn11/
 SPktqEUrF1pxnGQxqLh1kUFUhsVZQgE=
 =w/UH
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2022-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:
 "Most of the MM queue. A few things are still pending.

  Liam's maple tree rework didn't make it. This has resulted in a few
  other minor patch series being held over for next time.

  Multi-gen LRU still isn't merged as we were waiting for mapletree to
  stabilize. The current plan is to merge MGLRU into -mm soon and to
  later reintroduce mapletree, with a view to hopefully getting both
  into 6.1-rc1.

  Summary:

   - The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe
     Lin, Yang Shi, Anshuman Khandual and Mike Rapoport

   - Some kmemleak fixes from Patrick Wang and Waiman Long

   - DAMON updates from SeongJae Park

   - memcg debug/visibility work from Roman Gushchin

   - vmalloc speedup from Uladzislau Rezki

   - more folio conversion work from Matthew Wilcox

   - enhancements for coherent device memory mapping from Alex Sierra

   - addition of shared pages tracking and CoW support for fsdax, from
     Shiyang Ruan

   - hugetlb optimizations from Mike Kravetz

   - Mel Gorman has contributed some pagealloc changes to improve
     latency and realtime behaviour.

   - mprotect soft-dirty checking has been improved by Peter Xu

   - Many other singleton patches all over the place"

 [ XFS merge from hell as per Darrick Wong in

   https://lore.kernel.org/all/YshKnxb4VwXycPO8@magnolia/ ]

* tag 'mm-stable-2022-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (282 commits)
  tools/testing/selftests/vm/hmm-tests.c: fix build
  mm: Kconfig: fix typo
  mm: memory-failure: convert to pr_fmt()
  mm: use is_zone_movable_page() helper
  hugetlbfs: fix inaccurate comment in hugetlbfs_statfs()
  hugetlbfs: cleanup some comments in inode.c
  hugetlbfs: remove unneeded header file
  hugetlbfs: remove unneeded hugetlbfs_ops forward declaration
  hugetlbfs: use helper macro SZ_1{K,M}
  mm: cleanup is_highmem()
  mm/hmm: add a test for cross device private faults
  selftests: add soft-dirty into run_vmtests.sh
  selftests: soft-dirty: add test for mprotect
  mm/mprotect: fix soft-dirty check in can_change_pte_writable()
  mm: memcontrol: fix potential oom_lock recursion deadlock
  mm/gup.c: fix formatting in check_and_migrate_movable_page()
  xfs: fail dax mount if reflink is enabled on a partition
  mm/memcontrol.c: remove the redundant updating of stats_flush_threshold
  userfaultfd: don't fail on unrecognized features
  hugetlb_cgroup: fix wrong hugetlb cgroup numa stat
  ...
2022-08-05 16:32:45 -07:00
Linus Torvalds 228dfe98a3 Char / Misc driver changes for 6.0-rc1
Here is the large set of char and misc and other driver subsystem
 changes for 6.0-rc1.
 
 Highlights include:
 	- large set of IIO driver updates, additions, and cleanups
 	- new habanalabs device support added (loads of register maps
 	  much like GPUs have)
 	- soundwire driver updates
 	- phy driver updates
 	- slimbus driver updates
 	- tiny virt driver fixes and updates
 	- misc driver fixes and updates
 	- interconnect driver updates
 	- hwtracing driver updates
 	- fpga driver updates
 	- extcon driver updates
 	- firmware driver updates
 	- counter driver update
 	- mhi driver fixes and updates
 	- binder driver fixes and updates
 	- speakup driver fixes
 
 Full details are in the long shortlog contents.
 
 All of these have been in linux-next for a while without any reported
 problems.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCYup9QQ8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ylBKQCfaSuzl9ZP9dTvAw2FPp14oRqXnpoAnicvWAoq
 1vU9Vtq2c73uBVLdZm4m
 =AwP3
 -----END PGP SIGNATURE-----

Merge tag 'char-misc-6.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc

Pull char / misc driver updates from Greg KH:
 "Here is the large set of char and misc and other driver subsystem
  changes for 6.0-rc1.

  Highlights include:

   - large set of IIO driver updates, additions, and cleanups

   - new habanalabs device support added (loads of register maps much
     like GPUs have)

   - soundwire driver updates

   - phy driver updates

   - slimbus driver updates

   - tiny virt driver fixes and updates

   - misc driver fixes and updates

   - interconnect driver updates

   - hwtracing driver updates

   - fpga driver updates

   - extcon driver updates

   - firmware driver updates

   - counter driver update

   - mhi driver fixes and updates

   - binder driver fixes and updates

   - speakup driver fixes

  All of these have been in linux-next for a while without any reported
  problems"

* tag 'char-misc-6.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (634 commits)
  drivers: lkdtm: fix clang -Wformat warning
  char: remove VR41XX related char driver
  misc: Mark MICROCODE_MINOR unused
  spmi: trace: fix stack-out-of-bound access in SPMI tracing functions
  dt-bindings: iio: adc: Add compatible for MT8188
  iio: light: isl29028: Fix the warning in isl29028_remove()
  iio: accel: sca3300: Extend the trigger buffer from 16 to 32 bytes
  iio: fix iio_format_avail_range() printing for none IIO_VAL_INT
  iio: adc: max1027: unlock on error path in max1027_read_single_value()
  iio: proximity: sx9324: add empty line in front of bullet list
  iio: magnetometer: hmc5843: Remove duplicate 'the'
  iio: magn: yas530: Use DEFINE_RUNTIME_DEV_PM_OPS() and pm_ptr() macros
  iio: magnetometer: ak8974: Use DEFINE_RUNTIME_DEV_PM_OPS() and pm_ptr() macros
  iio: light: veml6030: Use DEFINE_RUNTIME_DEV_PM_OPS() and pm_ptr() macros
  iio: light: vcnl4035: Use DEFINE_RUNTIME_DEV_PM_OPS() and pm_ptr() macros
  iio: light: vcnl4000: Use DEFINE_RUNTIME_DEV_PM_OPS() and pm_ptr() macros
  iio: light: tsl2591: Use DEFINE_RUNTIME_DEV_PM_OPS() and pm_ptr()
  iio: light: tsl2583: Use DEFINE_RUNTIME_DEV_PM_OPS and pm_ptr()
  iio: light: isl29028: Use DEFINE_RUNTIME_DEV_PM_OPS() and pm_ptr()
  iio: light: gp2ap002: Switch to DEFINE_RUNTIME_DEV_PM_OPS and pm_ptr()
  ...
2022-08-04 11:05:48 -07:00
Linus Torvalds 7d9d077c78 RCU pull request for v5.20 (or whatever)
This pull request contains the following branches:
 
 doc.2022.06.21a: Documentation updates.
 
 fixes.2022.07.19a: Miscellaneous fixes.
 
 nocb.2022.07.19a: Callback-offload updates, perhaps most notably a new
 	RCU_NOCB_CPU_DEFAULT_ALL Kconfig option that causes all CPUs to
 	be offloaded at boot time, regardless of kernel boot parameters.
 	This is useful to battery-powered systems such as ChromeOS
 	and Android.  In addition, a new RCU_NOCB_CPU_CB_BOOST kernel
 	boot parameter prevents offloaded callbacks from interfering
 	with real-time workloads and with energy-efficiency mechanisms.
 
 poll.2022.07.21a: Polled grace-period updates, perhaps most notably
 	making these APIs account for both normal and expedited grace
 	periods.
 
 rcu-tasks.2022.06.21a: Tasks RCU updates, perhaps most notably reducing
 	the CPU overhead of RCU tasks trace grace periods by more than
 	a factor of two on a system with 15,000 tasks.	The reduction
 	is expected to increase with the number of tasks, so it seems
 	reasonable to hypothesize that a system with 150,000 tasks might
 	see a 20-fold reduction in CPU overhead.
 
 torture.2022.06.21a: Torture-test updates.
 
 ctxt.2022.07.05a: Updates that merge RCU's dyntick-idle tracking into
 	context tracking, thus reducing the overhead of transitioning to
 	kernel mode from either idle or nohz_full userspace execution
 	for kernels that track context independently of RCU.  This is
 	expected to be helpful primarily for kernels built with
 	CONFIG_NO_HZ_FULL=y.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEbK7UrM+RBIrCoViJnr8S83LZ+4wFAmLgMcgTHHBhdWxtY2tA
 a2VybmVsLm9yZwAKCRCevxLzctn7jArXD/0fjbCwqpRjHVTzjMY8jN4zDkqZZD6m
 g8Fx27hZ4ToNFwRptyHwNezrNj14skjAJEXfdjaVw32W62ivXvf0HINvSzsTLCSq
 k2kWyBdXLc9CwY5p5W4smnpn5VoAScjg5PoPL59INoZ/Zziji323C7Zepl/1DYJt
 0T6bPCQjo1ZQoDUCyVpSjDmAqxnderWG0MeJVt74GkLqmnYLANg0GH8c7mH4+9LL
 kVGlLp5nlPgNJ4FEoFdMwNU8T/ETmaVld/m2dkiawjkXjJzB2XKtBigU91DDmXz5
 7DIdV4ABrxiy4kGNqtIe/jFgnKyVD7xiDpyfjd6KTeDr/rDS8u2ZH7+1iHsyz3g0
 Np/tS3vcd0KR+gI/d0eXxPbgm5sKlCmKw/nU2eArpW/+4LmVXBUfHTG9Jg+LJmBc
 JrUh6aEdIZJZHgv/nOQBNig7GJW43IG50rjuJxAuzcxiZNEG5lUSS23ysaA9CPCL
 PxRWKSxIEfK3kdmvVO5IIbKTQmIBGWlcWMTcYictFSVfBgcCXpPAksGvqA5JiUkc
 egW+xLFo/7K+E158vSKsVqlWZcEeUbsNJ88QOlpqnRgH++I2Yv/LhK41XfJfpH+Y
 ALxVaDd+mAq6v+qSHNVq9wT3ozXIPy/zK1hDlMIqx40h2YvaEsH4je+521oSoN9r
 vX60+QNxvUBLwA==
 =vUNm
 -----END PGP SIGNATURE-----

Merge tag 'rcu.2022.07.26a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu

Pull RCU updates from Paul McKenney:

 - Documentation updates

 - Miscellaneous fixes

 - Callback-offload updates, perhaps most notably a new
   RCU_NOCB_CPU_DEFAULT_ALL Kconfig option that causes all CPUs to be
   offloaded at boot time, regardless of kernel boot parameters.

   This is useful to battery-powered systems such as ChromeOS and
   Android. In addition, a new RCU_NOCB_CPU_CB_BOOST kernel boot
   parameter prevents offloaded callbacks from interfering with
   real-time workloads and with energy-efficiency mechanisms

 - Polled grace-period updates, perhaps most notably making these APIs
   account for both normal and expedited grace periods

 - Tasks RCU updates, perhaps most notably reducing the CPU overhead of
   RCU tasks trace grace periods by more than a factor of two on a
   system with 15,000 tasks.

   The reduction is expected to increase with the number of tasks, so it
   seems reasonable to hypothesize that a system with 150,000 tasks
   might see a 20-fold reduction in CPU overhead

 - Torture-test updates

 - Updates that merge RCU's dyntick-idle tracking into context tracking,
   thus reducing the overhead of transitioning to kernel mode from
   either idle or nohz_full userspace execution for kernels that track
   context independently of RCU.

   This is expected to be helpful primarily for kernels built with
   CONFIG_NO_HZ_FULL=y

* tag 'rcu.2022.07.26a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: (98 commits)
  rcu: Add irqs-disabled indicator to expedited RCU CPU stall warnings
  rcu: Diagnose extended sync_rcu_do_polled_gp() loops
  rcu: Put panic_on_rcu_stall() after expedited RCU CPU stall warnings
  rcutorture: Test polled expedited grace-period primitives
  rcu: Add polled expedited grace-period primitives
  rcutorture: Verify that polled GP API sees synchronous grace periods
  rcu: Make Tiny RCU grace periods visible to polled APIs
  rcu: Make polled grace-period API account for expedited grace periods
  rcu: Switch polled grace-period APIs to ->gp_seq_polled
  rcu/nocb: Avoid polling when my_rdp->nocb_head_rdp list is empty
  rcu/nocb: Add option to opt rcuo kthreads out of RT priority
  rcu: Add nocb_cb_kthread check to rcu_is_callbacks_kthread()
  rcu/nocb: Add an option to offload all CPUs on boot
  rcu/nocb: Fix NOCB kthreads spawn failure with rcu_nocb_rdp_deoffload() direct call
  rcu/nocb: Invert rcu_state.barrier_mutex VS hotplug lock locking order
  rcu/nocb: Add/del rdp to iterate from rcuog itself
  rcu/tree: Add comment to describe GP-done condition in fqs loop
  rcu: Initialize first_gp_fqs at declaration in rcu_gp_fqs()
  rcu/kvfree: Remove useless monitor_todo flag
  rcu: Cleanup RCU urgency state for offline CPU
  ...
2022-08-02 19:12:45 -07:00
Linus Torvalds 4ba1329cbb Urgent RCU pull request for v5.19
This pull request contains a pair of commits that fix 282d8998e9 ("srcu:
 Prevent expedited GPs and blocking readers from consuming CPU"), which
 was itself a fix to an SRCU expedited grace-period problem that could
 prevent kernel live patching (KLP) from completing.  That SRCU fix for
 KLP introduced large (as in minutes) boot-time delays to embedded Linux
 kernels running on qemu/KVM.  These delays were due to the emulation of
 certain MMIO operations controlling memory layout, which were emulated
 with one expedited grace period per access.  Common configurations
 required thousands of boot-time MMIO accesses, and thus thousands of
 boot-time expedited SRCU grace periods.
 
 In these configurations, the occasional sleeps that allowed KLP to proceed
 caused excessive boot delays.  These commits preserve enough sleeps to
 permit KLP to proceed, but few enough that the virtual embedded kernels
 still boot reasonably quickly.
 
 This represents a regression introduced in the v5.19 merge window,
 and the bug is causing significant inconvenience, hence this pull request.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEbK7UrM+RBIrCoViJnr8S83LZ+4wFAmLZ6LoTHHBhdWxtY2tA
 a2VybmVsLm9yZwAKCRCevxLzctn7jNHgD/4tb8Un6vZlrEaYbyA/ztUITX/2DisS
 kiqbQz1BH8V3B3PxSo4ldEiw+z3fC3SMyIPymuu9bhwm6SFdjEsarFkIqySxkYnX
 jnuk0JbWxs4Kk64rIkHHzAxzvM2Iw1EjSzjP1M+DC7iymSJpsgp+0zFJJtcJ8Y87
 67hbQRQYk+1T7ZT+vq77NiyAAFEzSd8UydgBVxlsOOdkXQ91NYTyB8D6ldUJAnLU
 opwCEpgpu74Sp4Te5q6f9uAt8xZmXsyrm8zJgzTz0KSgivcpt4GmIoyEFYUQczj0
 Hewr6+qM9AWfvfQxNvRCS25yeox18kbdp1qdp9rl0BZMtYN2Zsk1Ec4c79s7NBLc
 G3TIvJkGLHuZO1dO4BhLkYczgRYlaPxOR/0GKNn4m69/TbVmseUL1WeZS0pswB0q
 cH1AKKEg9KdPoaX0hTLoOrlv/vwbgjhKKuoqEv7yEUhJJdACy50rmnhWhSxeuQDb
 aIITVKkjkwpDtRX5QTdG1f5uIMoGz9BbUDv7VeodB0mrYHluXEfyNTwlqcISKAgm
 T9kLmsdfvMrQ4fLR5S3i3dwnL3b52OB8h5NyfW3YRkXEnA7//ef/XpPiW2HY8BMT
 7QwPqOoUSr/IraAcI8j0QxRpioUk1oaNi+UJ3FSHni8re6rZ0kaxatRCT20h6Djq
 C9RVLaevw3bGXQ==
 =ndhB
 -----END PGP SIGNATURE-----

Merge tag 'rcu-urgent.2022.07.21a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu

Pull RCU fix from Paul McKenney:
 "This contains a pair of commits that fix 282d8998e9 ("srcu: Prevent
  expedited GPs and blocking readers from consuming CPU"), which was
  itself a fix to an SRCU expedited grace-period problem that could
  prevent kernel live patching (KLP) from completing.

  That SRCU fix for KLP introduced large (as in minutes) boot-time
  delays to embedded Linux kernels running on qemu/KVM. These delays
  were due to the emulation of certain MMIO operations controlling
  memory layout, which were emulated with one expedited grace period per
  access. Common configurations required thousands of boot-time MMIO
  accesses, and thus thousands of boot-time expedited SRCU grace
  periods.

  In these configurations, the occasional sleeps that allowed KLP to
  proceed caused excessive boot delays. These commits preserve enough
  sleeps to permit KLP to proceed, but few enough that the virtual
  embedded kernels still boot reasonably quickly.

  This represents a regression introduced in the v5.19 merge window, and
  the bug is causing significant inconvenience"

* tag 'rcu-urgent.2022.07.21a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu:
  srcu: Make expedited RCU grace periods block even less frequently
  srcu: Block less aggressively for expedited grace periods
2022-07-22 10:01:20 -07:00
Paul E. McKenney 34bc7b454d Merge branch 'ctxt.2022.07.05a' into HEAD
ctxt.2022.07.05a: Linux-kernel memory model development branch.
2022-07-21 17:46:18 -07:00
Paul E. McKenney d38c8fe483 Merge branches 'doc.2022.06.21a', 'fixes.2022.07.19a', 'nocb.2022.07.19a', 'poll.2022.07.21a', 'rcu-tasks.2022.06.21a' and 'torture.2022.06.21a' into HEAD
doc.2022.06.21a: Documentation updates.
fixes.2022.07.19a: Miscellaneous fixes.
nocb.2022.07.19a: Callback-offload updates.
poll.2022.07.21a: Polled grace-period updates.
rcu-tasks.2022.06.21a: Tasks RCU updates.
torture.2022.06.21a: Torture-test updates.
2022-07-21 17:43:16 -07:00
Zqiang 28787e04fb rcu: Add irqs-disabled indicator to expedited RCU CPU stall warnings
If a CPU has interrupts disabled continuously starting before the
beginning of a given expedited RCU grace period, that CPU will not
execute that grace period's IPI handler.  This will in turn mean
that the ->cpu_no_qs.b.exp field in that CPU's rcu_data structure
will continue to contain the boolean value false.

Knowing whether or not a CPU has had interrupts disabled can be helpful
when debugging an expedited RCU CPU stall warning, so this commit
adds a "D" indicator expedited RCU CPU stall warnings that signifies
that the corresponding CPU has had interrupts disabled throughout.

This capability was tested as follows:

runqemu kvm slirp nographic qemuparams="-m 4096 -smp 4"  bootparams=
"isolcpus=2,3 nohz_full=2,3 rcu_nocbs=2,3 rcutree.dump_tree=1
rcutorture.stall_cpu_holdoff=30 rcutorture.stall_cpu=40
rcutorture.stall_cpu_irqsoff=1 rcutorture.stall_cpu_block=0
rcutorture.stall_no_softlockup=1" -d

The rcu_torture_stall() function ran on CPU 1, which displays the "D"
as expected given the rcutorture.stall_cpu_irqsoff=1 module parameter:

............
rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks:
{ 1-...D } 26467 jiffies s: 13317 root: 0x1/.
rcu: blocking rcu_node structures (internal RCU debug): l=1:0-1:0x2/.
Task dump for CPU 1:
task:rcu_torture_sta state:R  running task     stack:    0 pid:   76  ppid:     2 flags:0x00004008

Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-07-21 17:41:56 -07:00
Paul E. McKenney 82e445697d rcu: Diagnose extended sync_rcu_do_polled_gp() loops
This commit dumps out state when the sync_rcu_do_polled_gp() function
loops more than expected.  This is a debugging aid.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-07-21 17:41:56 -07:00
Zqiang ef4f9d9b92 rcu: Put panic_on_rcu_stall() after expedited RCU CPU stall warnings
When a normal RCU CPU stall warning is encountered with the
panic_on_rcu_stall sysfs variable is set, the system panics only after
the stall warning is printed.  But when an expedited RCU CPU stall
warning is encountered with the panic_on_rcu_stall sysfs variable is
set, the system panics first, thus never printing the stall warning.
This commit therefore brings the expedited stall warning into line with
the normal stall warning by printing first and panicking afterwards.

Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-07-21 17:41:56 -07:00
Paul E. McKenney 11d62f0f43 rcutorture: Test polled expedited grace-period primitives
This commit adds tests of start_poll_synchronize_rcu_expedited() and
poll_state_synchronize_rcu_expedited().

Link: https://lore.kernel.org/all/20220121142454.1994916-1-bfoster@redhat.com/
Link: https://docs.google.com/document/d/1RNKWW9jQyfjxw2E8dsXVTdvZYh0HnYeSHDKog9jhdN8/edit?usp=sharing
Cc: Brian Foster <bfoster@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Ian Kent <raven@themaw.net>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-07-21 17:41:56 -07:00
Paul E. McKenney d96c52fe49 rcu: Add polled expedited grace-period primitives
This commit adds expedited grace-period functionality to RCU's polled
grace-period API, adding start_poll_synchronize_rcu_expedited() and
cond_synchronize_rcu_expedited(), which are similar to the existing
start_poll_synchronize_rcu() and cond_synchronize_rcu() functions,
respectively.

Note that although start_poll_synchronize_rcu_expedited() can be invoked
very early, the resulting expedited grace periods are not guaranteed
to start until after workqueues are fully initialized.  On the other
hand, both synchronize_rcu() and synchronize_rcu_expedited() can also
be invoked very early, and the resulting grace periods will be taken
into account as they occur.

[ paulmck: Apply feedback from Neeraj Upadhyay. ]

Link: https://lore.kernel.org/all/20220121142454.1994916-1-bfoster@redhat.com/
Link: https://docs.google.com/document/d/1RNKWW9jQyfjxw2E8dsXVTdvZYh0HnYeSHDKog9jhdN8/edit?usp=sharing
Cc: Brian Foster <bfoster@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Ian Kent <raven@themaw.net>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-07-21 17:41:56 -07:00
Paul E. McKenney e4333cb20f rcutorture: Verify that polled GP API sees synchronous grace periods
This commit causes rcu_torture_writer() to use WARN_ON_ONCE() to check
that the cookie returned by the current RCU flavor's ->get_gp_state()
function (get_state_synchronize_rcu() for vanilla RCU) causes that
flavor's ->poll_gp_state function (poll_state_synchronize_rcu() for
vanilla RCU) to unconditionally return true.

Note that a pair calls to synchronous grace-period-wait functions are
used.  This is necessary to account for partially overlapping normal and
expedited grace periods aligning in just the wrong way with polled API
invocations, which can cause those polled API invocations to ignore one or
the other of those partially overlapping grace periods.  It is unlikely
that this sort of ignored grace period will be a problem in production,
but rcutorture can make it happen quite within a few tens of seconds.

This commit is in preparation for polled expedited grace periods.

[ paulmck: Apply feedback from Frederic Weisbecker. ]

Link: https://lore.kernel.org/all/20220121142454.1994916-1-bfoster@redhat.com/
Link: https://docs.google.com/document/d/1RNKWW9jQyfjxw2E8dsXVTdvZYh0HnYeSHDKog9jhdN8/edit?usp=sharing
Cc: Brian Foster <bfoster@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Ian Kent <raven@themaw.net>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-07-21 17:41:56 -07:00
Paul E. McKenney 7f4535366f rcu: Make Tiny RCU grace periods visible to polled APIs
This commit makes the Tiny RCU implementation of synchronize_rcu()
increment the rcu_ctrlblk.gp_seq counter, thus making both
synchronize_rcu() and synchronize_rcu_expedited() visible to
get_state_synchronize_rcu() and friends.

Link: https://lore.kernel.org/all/20220121142454.1994916-1-bfoster@redhat.com/
Link: https://docs.google.com/document/d/1RNKWW9jQyfjxw2E8dsXVTdvZYh0HnYeSHDKog9jhdN8/edit?usp=sharing
Cc: Brian Foster <bfoster@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Ian Kent <raven@themaw.net>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-07-21 17:41:56 -07:00
Paul E. McKenney dd04140531 rcu: Make polled grace-period API account for expedited grace periods
Currently, this code could splat:

	oldstate = get_state_synchronize_rcu();
	synchronize_rcu_expedited();
	WARN_ON_ONCE(!poll_state_synchronize_rcu(oldstate));

This situation is counter-intuitive and user-unfriendly.  After all, there
really was a perfectly valid full grace period right after the call to
get_state_synchronize_rcu(), so why shouldn't poll_state_synchronize_rcu()
know about it?

This commit therefore makes the polled grace-period API aware of expedited
grace periods in addition to the normal grace periods that it is already
aware of.  With this change, the above code is guaranteed not to splat.

Please note that the above code can still splat due to counter wrap on the
one hand and situations involving partially overlapping normal/expedited
grace periods on the other.  On 64-bit systems, the second is of course
much more likely than the first.  It is possible to modify this approach
to prevent overlapping grace periods from causing splats, but only at
the expense of greatly increasing the probability of counter wrap, as
in within milliseconds on 32-bit systems and within minutes on 64-bit
systems.

This commit is in preparation for polled expedited grace periods.

Link: https://lore.kernel.org/all/20220121142454.1994916-1-bfoster@redhat.com/
Link: https://docs.google.com/document/d/1RNKWW9jQyfjxw2E8dsXVTdvZYh0HnYeSHDKog9jhdN8/edit?usp=sharing
Cc: Brian Foster <bfoster@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Ian Kent <raven@themaw.net>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-07-21 17:41:56 -07:00
Paul E. McKenney bf95b2bc3e rcu: Switch polled grace-period APIs to ->gp_seq_polled
This commit switches the existing polled grace-period APIs to use a
new ->gp_seq_polled counter in the rcu_state structure.  An additional
->gp_seq_polled_snap counter in that same structure allows the normal
grace period kthread to interact properly with the !SMP !PREEMPT fastpath
through synchronize_rcu().  The first of the two to note the end of a
given grace period will make knowledge of this transition available to
the polled API.

This commit is in preparation for polled expedited grace periods.

[ paulmck: Fix use of rcu_state.gp_seq_polled to start normal grace period. ]

Link: https://lore.kernel.org/all/20220121142454.1994916-1-bfoster@redhat.com/
Link: https://docs.google.com/document/d/1RNKWW9jQyfjxw2E8dsXVTdvZYh0HnYeSHDKog9jhdN8/edit?usp=sharing
Cc: Brian Foster <bfoster@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Ian Kent <raven@themaw.net>
Co-developed-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-07-21 17:40:06 -07:00
Zqiang 0578e14c94 rcu/nocb: Avoid polling when my_rdp->nocb_head_rdp list is empty
Currently, if the 'rcu_nocb_poll' kernel boot parameter is enabled, all
rcuog kthreads enter polling mode.  However, if all of a given group
of rcuo kthreads correspond to CPUs that have been de-offloaded, the
corresponding rcuog kthread will nonetheless still wake up periodically,
unnecessarily consuming power and perturbing workloads.  Fortunately,
this situation is easily detected by the fact that the rcuog kthread's
CPU's rcu_data structure's ->nocb_head_rdp list is empty.

This commit saves power and avoids unnecessarily perturbing workloads
by putting an rcuog kthread to sleep during any time period when all of
its rcuo kthreads' CPUs are de-offloaded.

Co-developed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
2022-07-19 11:43:55 -07:00
Uladzislau Rezki (Sony) 8f489b4da5 rcu/nocb: Add option to opt rcuo kthreads out of RT priority
This commit introduces a RCU_NOCB_CPU_CB_BOOST Kconfig option that
prevents rcuo kthreads from running at real-time priority, even in
kernels built with RCU_BOOST.  This capability is important to devices
needing low-latency (as in a few milliseconds) response from expedited
RCU grace periods, but which are not running a classic real-time workload.
On such devices, permitting the rcuo kthreads to run at real-time priority
results in unacceptable latencies imposed on the application tasks,
which run as SCHED_OTHER.

See for example the following trace output:

<snip>
<...>-60 [006] d..1 2979.028717: rcu_batch_start: rcu_preempt CBs=34619 bl=270
<snip>

If that rcuop kthread were permitted to run at real-time SCHED_FIFO
priority, it would monopolize its CPU for hundreds of milliseconds
while invoking those 34619 RCU callback functions, which would cause an
unacceptably long latency spike for many application stacks on Android
platforms.

However, some existing real-time workloads require that callback
invocation run at SCHED_FIFO priority, for example, those running on
systems with heavy SCHED_OTHER background loads.  (It is the real-time
system's administrator's responsibility to make sure that important
real-time tasks run at a higher priority than do RCU's kthreads.)

Therefore, this new RCU_NOCB_CPU_CB_BOOST Kconfig option defaults to
"y" on kernels built with PREEMPT_RT and defaults to "n" otherwise.
The effect is to preserve current behavior for real-time systems, but for
other systems to allow expedited RCU grace periods to run with real-time
priority while continuing to invoke RCU callbacks as SCHED_OTHER.

As you would expect, this RCU_NOCB_CPU_CB_BOOST Kconfig option has no
effect except on CPUs with offloaded RCU callbacks.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Acked-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
2022-07-19 11:43:48 -07:00
Zqiang 5103850654 rcu: Add nocb_cb_kthread check to rcu_is_callbacks_kthread()
Callbacks are invoked in RCU kthreads when calbacks are offloaded
(rcu_nocbs boot parameter) or when RCU's softirq handler has been
offloaded to rcuc kthreads (use_softirq==0).  The current code allows
for the rcu_nocbs case but not the use_softirq case.  This commit adds
support for the use_softirq case.

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
2022-07-19 11:43:39 -07:00
Joel Fernandes b37a667c62 rcu/nocb: Add an option to offload all CPUs on boot
Systems built with CONFIG_RCU_NOCB_CPU=y but booted without either
the rcu_nocbs= or rcu_nohz_full= kernel-boot parameters will not have
callback offloading on any of the CPUs, nor can any of the CPUs be
switched to enable callback offloading at runtime.  Although this is
intentional, it would be nice to have a way to offload all the CPUs
without having to make random bootloaders specify either the rcu_nocbs=
or the rcu_nohz_full= kernel-boot parameters.

This commit therefore provides a new CONFIG_RCU_NOCB_CPU_DEFAULT_ALL
Kconfig option that switches the default so as to offload callback
processing on all of the CPUs.  This default can still be overridden
using the rcu_nocbs= and rcu_nohz_full= kernel-boot parameters.

Reviewed-by: Kalesh Singh <kaleshsingh@google.com>
Reviewed-by: Uladzislau Rezki <urezki@gmail.com>
(In v4.1, fixed issues with CONFIG maze reported by kernel test robot).
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Joel Fernandes <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
2022-07-19 11:43:34 -07:00
Zqiang 3a5761dc02 rcu/nocb: Fix NOCB kthreads spawn failure with rcu_nocb_rdp_deoffload() direct call
If the rcuog/o[p] kthreads spawn failed, the offloaded rdp needs to
be explicitly deoffloaded, otherwise the target rdp is still considered
offloaded even though nothing actually handles the callbacks.

Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Uladzislau Rezki <uladzislau.rezki@sony.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
2022-07-19 11:43:04 -07:00
Zqiang 24a57affd2 rcu/nocb: Invert rcu_state.barrier_mutex VS hotplug lock locking order
In case of failure to spawn either rcuog or rcuo[p] kthreads for a given
rdp, rcu_nocb_rdp_deoffload() needs to be called with the hotplug
lock and the barrier_mutex held. However cpus write lock is already held
while calling rcutree_prepare_cpu(). It's not possible to call
rcu_nocb_rdp_deoffload() from there with just locking the barrier_mutex
or this would result in a locking inversion against
rcu_nocb_cpu_deoffload() which holds both locks in the reverse order.

Simply solve this with inverting the locking order inside
rcu_nocb_cpu_[de]offload(). This will be a pre-requisite to toggle NOCB
states toward cpusets anyway.

Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Uladzislau Rezki <uladzislau.rezki@sony.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
2022-07-19 11:42:55 -07:00
Frederic Weisbecker 1598f4a476 rcu/nocb: Add/del rdp to iterate from rcuog itself
NOCB rdp's are part of a group whose list is iterated by the
corresponding rdp leader.

This list is RCU traversed because an rdp can be either added or
deleted concurrently. Upon addition, a new iteration to the list after
a synchronization point (a pair of LOCK/UNLOCK ->nocb_gp_lock) is forced
to make sure:

1) we didn't miss a new element added in the middle of an iteration
2) we didn't ignore a whole subset of the list due to an element being
   quickly deleted and then re-added.
3) we prevent from probably other surprises...

Although this layout is expected to be safe, it doesn't help anybody
to sleep well.

Simplify instead the nocb state toggling with moving the list
modification from the nocb (de-)offloading workqueue to the rcuog
kthreads instead.

Whenever the rdp leader is expected to (re-)set the SEGCBLIST_KTHREAD_GP
flag of a target rdp, the latter is queued so that the leader handles
the flag flip along with adding or deleting the target rdp to the list
to iterate. This way the list modification and iteration happen from the
same kthread and those operations can't race altogether.

As a bonus, the flags for each rdp don't need to be checked locklessly
before each iteration, which is one less opportunity to produce
nightmares.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Uladzislau Rezki <uladzislau.rezki@sony.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Zqiang <qiang1.zhang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
2022-07-19 11:42:36 -07:00
Neeraj Upadhyay a03ae49c47 rcu/tree: Add comment to describe GP-done condition in fqs loop
Add a comment to explain why !rcu_preempt_blocked_readers_cgp() condition
is required on root rnp node, for GP completion check in rcu_gp_fqs_loop().

Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-07-19 11:40:00 -07:00
Paul E. McKenney 9bdb5b3a8d rcu: Initialize first_gp_fqs at declaration in rcu_gp_fqs()
This commit saves a line of code by initializing the rcu_gp_fqs()
function's first_gp_fqs local variable in its declaration.

Reported-by: Frederic Weisbecker <frederic@kernel.org>
Reported-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-07-19 11:40:00 -07:00
Joel Fernandes (Google) 82d26c36cc rcu/kvfree: Remove useless monitor_todo flag
monitor_todo is not needed as the work struct already tracks
if work is pending. Just use that to know if work is pending
using schedule_delayed_work() helper.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
2022-07-19 11:40:00 -07:00
Zqiang e2bb1288a3 rcu: Cleanup RCU urgency state for offline CPU
When a CPU is slow to provide a quiescent state for a given grace
period, RCU takes steps to encourage that CPU to get with the
quiescent-state program in a more timely fashion.  These steps
include these flags in the rcu_data structure:

1.	->rcu_urgent_qs, which causes the scheduling-clock interrupt to
	request an otherwise pointless context switch from the scheduler.

2.	->rcu_need_heavy_qs, which causes both cond_resched() and RCU's
	context-switch hook to do an immediate momentary quiscent state.

3.	->rcu_need_heavy_qs, which causes the scheduler-clock tick to
	be enabled even on nohz_full CPUs with only one runnable task.

These flags are of course cleared once the corresponding CPU has passed
through a quiescent state.  Unless that quiescent state is the CPU
going offline, which means that when the CPU comes back online, it will
needlessly consume additional CPU time and incur additional latency,
which constitutes a minor but very real performance bug.

This commit therefore adds the call to rcu_disable_urgency_upon_qs()
that clears these flags to the CPU-hotplug offlining code path.

Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
2022-07-19 11:40:00 -07:00
Johannes Berg 800d6acf40 rcu: tiny: Record kvfree_call_rcu() call stack for KASAN
When running KASAN with Tiny RCU (e.g. under ARCH=um, where
a working KASAN patch is now available), we don't get any
information on the original kfree_rcu() (or similar) caller
when a problem is reported, as Tiny RCU doesn't record this.

Add the recording, which required pulling kvfree_call_rcu()
out of line for the KASAN case since the recording function
(kasan_record_aux_stack_noalloc) is neither exported, nor
can we include kasan.h into rcutiny.h.

without KASAN, the patch has no size impact (ARCH=um kernel):
    text       data         bss         dec        hex    filename
 6151515    4423154    33148520    43723189    29b29b5    linux
 6151515    4423154    33148520    43723189    29b29b5    linux + patch

with KASAN, the impact on my build was minimal:
    text       data         bss         dec        hex    filename
13915539    7388050    33282304    54585893    340ea25    linux
13911266    7392114    33282304    54585684    340e954    linux + patch
   -4273      +4064         +-0        -209

Acked-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-07-19 11:40:00 -07:00
Neeraj Upadhyay 4f2bfd9494 srcu: Make expedited RCU grace periods block even less frequently
The purpose of commit 282d8998e9 ("srcu: Prevent expedited GPs
and blocking readers from consuming CPU") was to prevent a long
series of never-blocking expedited SRCU grace periods from blocking
kernel-live-patching (KLP) progress.  Although it was successful, it also
resulted in excessive boot times on certain embedded workloads running
under qemu with the "-bios QEMU_EFI.fd" command line.  Here "excessive"
means increasing the boot time up into the three-to-four minute range.
This increase in boot time was due to the more than 6000 back-to-back
invocations of synchronize_rcu_expedited() within the KVM host OS, which
in turn resulted from qemu's emulation of a long series of MMIO accesses.

Commit 640a7d37c3f4 ("srcu: Block less aggressively for expedited grace
periods") did not significantly help this particular use case.

Zhangfei Gao and Shameerali Kolothum Thodi did experiments varying the
value of SRCU_MAX_NODELAY_PHASE with HZ=250 and with various values
of non-sleeping per phase counts on a system with preemption enabled,
and observed the following boot times:

+──────────────────────────+────────────────+
| SRCU_MAX_NODELAY_PHASE   | Boot time (s)  |
+──────────────────────────+────────────────+
| 100                      | 30.053         |
| 150                      | 25.151         |
| 200                      | 20.704         |
| 250                      | 15.748         |
| 500                      | 11.401         |
| 1000                     | 11.443         |
| 10000                    | 11.258         |
| 1000000                  | 11.154         |
+──────────────────────────+────────────────+

Analysis on the experiment results show additional improvements with
CPU-bound delays approaching one jiffy in duration. This improvement was
also seen when number of per-phase iterations were scaled to one jiffy.

This commit therefore scales per-grace-period phase number of non-sleeping
polls so that non-sleeping polls extend for about one jiffy. In addition,
the delay-calculation call to srcu_get_delay() in srcu_gp_end() is
replaced with a simple check for an expedited grace period.  This change
schedules callback invocation immediately after expedited grace periods
complete, which results in greatly improved boot times.  Testing done
by Marc and Zhangfei confirms that this change recovers most of the
performance degradation in boottime; for CONFIG_HZ_250 configuration,
specifically, boot times improve from 3m50s to 41s on Marc's setup;
and from 2m40s to ~9.7s on Zhangfei's setup.

In addition to the changes to default per phase delays, this
change adds 3 new kernel parameters - srcutree.srcu_max_nodelay,
srcutree.srcu_max_nodelay_phase, and srcutree.srcu_retry_check_delay.
This allows users to configure the srcu grace period scanning delays in
order to more quickly react to additional use cases.

Fixes: 640a7d37c3f4 ("srcu: Block less aggressively for expedited grace periods")
Fixes: 282d8998e9 ("srcu: Prevent expedited GPs and blocking readers from consuming CPU")
Reported-by: Zhangfei Gao <zhangfei.gao@linaro.org>
Reported-by: yueluck <yueluck@163.com>
Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Tested-by: Marc Zyngier <maz@kernel.org>
Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org>
Link: https://lore.kernel.org/all/20615615-0013-5adc-584f-2b1d5c03ebfc@linaro.org/
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-07-19 11:39:59 -07:00
Paul E. McKenney b3ade95b8e rcu: Forbid RCU_STRICT_GRACE_PERIOD in TINY_RCU kernels
The RCU_STRICT_GRACE_PERIOD Kconfig option does nothing in kernels
built with CONFIG_TINY_RCU=y, so this commit adjusts the dependencies
to disallow this combination.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
2022-07-19 11:39:59 -07:00
Paul E. McKenney 8f870e6eb8 srcu: Block less aggressively for expedited grace periods
Commit 282d8998e9 ("srcu: Prevent expedited GPs and blocking readers
from consuming CPU") fixed a problem where a long-running expedited SRCU
grace period could block kernel live patching.  It did so by giving up
on expediting once a given SRCU expedited grace period grew too old.

Unfortunately, this added excessive delays to boots of virtual embedded
systems specifying "-bios QEMU_EFI.fd" to qemu.  This commit therefore
makes the transition away from expediting less aggressive, increasing
the per-grace-period phase number of non-sleeping polls of readers from
one to three and increasing the required grace-period age from one jiffy
(actually from zero to one jiffies) to two jiffies (actually from one
to two jiffies).

Fixes: 282d8998e9 ("srcu: Prevent expedited GPs and blocking readers from consuming CPU")
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reported-by: Zhangfei Gao <zhangfei.gao@linaro.org>
Reported-by: chenxiang (M)" <chenxiang66@hisilicon.com>
Cc: Shameerali Kolothum Thodi  <shameerali.kolothum.thodi@huawei.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Link: https://lore.kernel.org/all/20615615-0013-5adc-584f-2b1d5c03ebfc@linaro.org/
2022-07-19 11:39:59 -07:00