Commit Graph

1245618 Commits

Author SHA1 Message Date
Tejun Heo bf52b1ac6a async: Use a dedicated unbound workqueue with raised min_active
Async can schedule a number of interdependent work items. However, since
5797b1c189 ("workqueue: Implement system-wide nr_active enforcement for
unbound workqueues"), unbound workqueues have separate min_active which sets
the number of interdependent work items that can be handled. This default
value is 8 which isn't sufficient for async and can lead to stalls during
resume from suspend in some cases.

Let's use a dedicated unbound workqueue with raised min_active.

Link: http://lkml.kernel.org/r/708a65cc-79ec-44a6-8454-a93d0f3114c3@samsung.com
Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-02-09 11:13:59 -10:00
Tejun Heo 8f172181f2 workqueue: Implement workqueue_set_min_active()
Since 5797b1c189 ("workqueue: Implement system-wide nr_active enforcement
for unbound workqueues"), unbound workqueues have separate min_active which
sets the number of interdependent work items that can be handled. This value
is currently initialized to WQ_DFL_MIN_ACTIVE which is 8. This isn't high
enough for some users, let's add an interface to adjust the setting.

Signed-off-by: Tejun Heo <tj@kernel.org>
2024-02-09 11:13:59 -10:00
Waiman Long 516d3dc99f workqueue: Fix kernel-doc comment of unplug_oldest_pwq()
Fix the kernel-doc comment of the unplug_oldest_pwq() function to enable
proper processing and formatting of the embedded ASCII diagram.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-02-09 11:04:13 -10:00
Waiman Long 49584bb8dd workqueue: Bind unbound workqueue rescuer to wq_unbound_cpumask
Commit 85f0ab43f9 ("kernel/workqueue: Bind rescuer to unbound
cpumask for WQ_UNBOUND") modified init_rescuer() to bind rescuer of
an unbound workqueue to the cpumask in wq->unbound_attrs. However
unbound_attrs->cpumask's of all workqueues are initialized to
cpu_possible_mask and will only be changed if it has the WQ_SYSFS flag
to expose a cpumask sysfs file to be written by users. So this patch
doesn't achieve what it is intended to do.

If an unbound workqueue is created after wq_unbound_cpumask is modified
and there is no more unbound cpumask update after that, the unbound
rescuer will be bound to all CPUs unless the workqueue is created
with the WQ_SYSFS flag and a user explicitly modified its cpumask
sysfs file.  Fix this problem by binding directly to wq_unbound_cpumask
in init_rescuer().

Fixes: 85f0ab43f9 ("kernel/workqueue: Bind rescuer to unbound cpumask for WQ_UNBOUND")
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-02-08 09:23:38 -10:00
Juri Lelli d64f2fa064 kernel/workqueue: Let rescuers follow unbound wq cpumask changes
When workqueue cpumask changes are committed the associated rescuer (if
one exists) affinity is not touched and this might be a problem down the
line for isolated setups.

Make sure rescuers affinity is updated every time a workqueue cpumask
changes, so that rescuers can't break isolation.

 [longman: set_cpus_allowed_ptr() will block until the designated task
  is enqueued on an allowed CPU, no wake_up_process() needed. Also use
  the unbound_effective_cpumask() helper as suggested by Tejun.]

Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-02-08 09:23:32 -10:00
Waiman Long 4c065dbce1 workqueue: Enable unbound cpumask update on ordered workqueues
Ordered workqueues does not currently follow changes made to the
global unbound cpumask because per-pool workqueue changes may break
the ordering guarantee. IOW, a work function in an ordered workqueue
may run on an isolated CPU.

This patch enables ordered workqueues to follow changes made to the
global unbound cpumask by temporaily plug or suspend the newly allocated
pool_workqueue from executing newly queued work items until the old
pwq has been properly drained. For ordered workqueues, there should
only be one pwq that is unplugged, the rests should be plugged.

This enables ordered workqueues to follow the unbound cpumask changes
like other unbound workqueues at the expense of some delay in execution
of work functions during the transition period.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-02-08 09:22:49 -10:00
Waiman Long 26fb7e3dda workqueue: Link pwq's into wq->pwqs from oldest to newest
Add a new pwq into the tail of wq->pwqs so that pwq iteration will
start from the oldest pwq to the newest. This ordering will facilitate
the inclusion of ordered workqueues in a wq_unbound_cpumask update.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-02-08 09:22:30 -10:00
Tejun Heo 40911d4457 Merge branch 'for-6.8-fixes' into for-6.9
The for-6.8-fixes commit ae9cc8956944 ("Revert "workqueue: Override implicit
ordered attribute in workqueue_apply_unbound_cpumask()") also fixes build for

Signed-off-by: Tejun Heo <tj@kernel.org>
2024-02-05 15:49:47 -10:00
Tejun Heo aac8a59537 Revert "workqueue: Override implicit ordered attribute in workqueue_apply_unbound_cpumask()"
This reverts commit ca10d851b9.

The commit allowed workqueue_apply_unbound_cpumask() to clear __WQ_ORDERED
on now removed implicitly ordered workqueues. This was incorrect in that
system-wide config change shouldn't break ordering properties of all
workqueues. The reason why apply_workqueue_attrs() path was allowed to do so
was because it was targeting the specific workqueue - either the workqueue
had WQ_SYSFS set or the workqueue user specifically tried to change
max_active, both of which indicate that the workqueue doesn't need to be
ordered.

The implicitly ordered workqueue promotion was removed by the previous
commit 3bc1e711c2 ("workqueue: Don't implicitly make UNBOUND workqueues w/
@max_active==1 ordered"). However, it didn't update this path and broke
build. Let's revert the commit which was incorrect in the first place which
also fixes build.

Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 3bc1e711c2 ("workqueue: Don't implicitly make UNBOUND workqueues w/ @max_active==1 ordered")
Fixes: ca10d851b9 ("workqueue: Override implicit ordered attribute in workqueue_apply_unbound_cpumask()")
Cc: stable@vger.kernel.org # v6.6+
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-02-05 15:49:06 -10:00
Tejun Heo 3bc1e711c2 workqueue: Don't implicitly make UNBOUND workqueues w/ @max_active==1 ordered
5c0338c687 ("workqueue: restore WQ_UNBOUND/max_active==1 to be ordered")
automoatically promoted UNBOUND workqueues w/ @max_active==1 to ordered
workqueues because UNBOUND workqueues w/ @max_active==1 used to be the way
to create ordered workqueues and the new NUMA support broke it. These
problems can be subtle and the fact that they can only trigger on NUMA
machines made them even more difficult to debug.

However, overloading the UNBOUND allocation interface this way creates other
issues. It's difficult to tell whether a given workqueue actually needs to
be ordered and users that legitimately want a min concurrency level wq
unexpectedly gets an ordered one instead. With planned UNBOUND workqueue
udpates to improve execution locality and more prevalence of chiplet designs
which can benefit from such improvements, this isn't a state we wanna be in
forever.

There aren't that many UNBOUND w/ @max_active==1 users in the tree and the
preceding patches audited all and converted them to
alloc_ordered_workqueue() as appropriate. This patch removes the implicit
promotion of UNBOUND w/ @max_active==1 workqueues to ordered ones.

v2: v1 patch incorrectly dropped !list_empty(&wq->pwqs) condition in
    apply_workqueue_attrs_locked() which spuriously triggers WARNING and
    fails workqueue creation. Fix it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: kernel test robot <oliver.sang@intel.com>
Link: https://lore.kernel.org/oe-lkp/202304251050.45a5df1f-oliver.sang@intel.com
2024-02-05 14:19:10 -10:00
Waiman Long 8eb17dc1a6 workqueue: Skip __WQ_DESTROYING workqueues when updating global unbound cpumask
Skip updating workqueues with __WQ_DESTROYING bit set when updating
global unbound cpumask to avoid unnecessary work and other complications.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-02-05 07:52:22 -10:00
Wang Jinchao 96068b6030 workqueue: fix a typo in comment
There should be three, fix it.

Signed-off-by: Wang Jinchao <wangjinchao@xfusion.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-02-05 07:48:25 -10:00
Tejun Heo 4f19b8e01e Revert "workqueue: make wq_subsys const"
This reverts commit d412ace111. This leads to
build failures as it depends on a driver-core commit 32f78abe59 ("driver
core: bus: constantify subsys_register() calls"). Let's drop it from wq tree
and route it through driver-core tree.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202402051505.kM9Rr3CJ-lkp@intel.com/
2024-02-05 07:19:54 -10:00
Tejun Heo 4cb1ef6460 workqueue: Implement BH workqueues to eventually replace tasklets
The only generic interface to execute asynchronously in the BH context is
tasklet; however, it's marked deprecated and has some design flaws such as
the execution code accessing the tasklet item after the execution is
complete which can lead to subtle use-after-free in certain usage scenarios
and less-developed flush and cancel mechanisms.

This patch implements BH workqueues which share the same semantics and
features of regular workqueues but execute their work items in the softirq
context. As there is always only one BH execution context per CPU, none of
the concurrency management mechanisms applies and a BH workqueue can be
thought of as a convenience wrapper around softirq.

Except for the inability to sleep while executing and lack of max_active
adjustments, BH workqueues and work items should behave the same as regular
workqueues and work items.

Currently, the execution is hooked to tasklet[_hi]. However, the goal is to
convert all tasklet users over to BH workqueues. Once the conversion is
complete, tasklet can be removed and BH workqueues can directly take over
the tasklet softirqs.

system_bh[_highpri]_wq are added. As queue-wide flushing doesn't exist in
tasklet, all existing tasklet users should be able to use the system BH
workqueues without creating their own workqueues.

v3: - Add missing interrupt.h include.

v2: - Instead of using tasklets, hook directly into its softirq action
      functions - tasklet[_hi]_action(). This is slightly cheaper and closer
      to the eventual code structure we want to arrive at. Suggested by Lai.

    - Lai also pointed out several places which need NULL worker->task
      handling or can use clarification. Updated.

Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/CAHk-=wjDW53w4-YcSmgKC5RruiRLHmJ1sXeYdp_ZgVoBw=5byA@mail.gmail.com
Tested-by: Allen Pais <allen.lkml@gmail.com>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-02-04 11:28:06 -10:00
Tejun Heo 2fcdb1b444 workqueue: Factor out init_cpu_worker_pool()
Factor out init_cpu_worker_pool() from workqueue_init_early(). This is pure
reorganization in preparation of BH workqueue support.

Signed-off-by: Tejun Heo <tj@kernel.org>
Tested-by: Allen Pais <allen.lkml@gmail.com>
2024-02-04 11:28:06 -10:00
Tejun Heo c35aea39d1 workqueue: Update lock debugging code
These changes are in preparation of BH workqueue which will execute work
items from BH context.

- Update lock and RCU depth checks in process_one_work() so that it
  remembers and checks against the starting depths and prints out the depth
  changes.

- Factor out lockdep annotations in the flush paths into
  touch_{wq|work}_lockdep_map(). The work->lockdep_map touching is moved
  from __flush_work() to its callee - start_flush_work(). This brings it
  closer to the wq counterpart and will allow testing the associated wq's
  flags which will be needed to support BH workqueues. This is not expected
  to cause any functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Tested-by: Allen Pais <allen.lkml@gmail.com>
2024-02-04 11:28:06 -10:00
Ricardo B. Marliere d412ace111 workqueue: make wq_subsys const
Now that the driver core can properly handle constant struct bus_type,
move the wq_subsys variable to be a constant structure as well,
placing it into read-only memory which can not be modified at runtime.

Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Suggested-and-reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Ricardo B. Marliere <ricardo@marliere.net>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-02-04 11:23:25 -10:00
Tejun Heo c70e1779b7 workqueue: Fix pwq->nr_in_flight corruption in try_to_grab_pending()
dd6c3c5441 ("workqueue: Move pwq_dec_nr_in_flight() to the end of work
item handling") relocated pwq_dec_nr_in_flight() after
set_work_pool_and_keep_pending(). However, the latter destroys information
contained in work->data that's needed by pwq_dec_nr_in_flight() including
the flush color. With flush color destroyed, flush_workqueue() can stall
easily when mixed with cancel_work*() usages.

This is easily triggered by running xfstests generic/001 test on xfs:

     INFO: task umount:6305 blocked for more than 122 seconds.
     ...
     task:umount          state:D stack:13008 pid:6305  tgid:6305  ppid:6301   flags:0x00004000
     Call Trace:
      <TASK>
      __schedule+0x2f6/0xa20
      schedule+0x36/0xb0
      schedule_timeout+0x20b/0x280
      wait_for_completion+0x8a/0x140
      __flush_workqueue+0x11a/0x3b0
      xfs_inodegc_flush+0x24/0xf0
      xfs_unmountfs+0x14/0x180
      xfs_fs_put_super+0x3d/0x90
      generic_shutdown_super+0x7c/0x160
      kill_block_super+0x1b/0x40
      xfs_kill_sb+0x12/0x30
      deactivate_locked_super+0x35/0x90
      deactivate_super+0x42/0x50
      cleanup_mnt+0x109/0x170
      __cleanup_mnt+0x12/0x20
      task_work_run+0x60/0x90
      syscall_exit_to_user_mode+0x146/0x150
      do_syscall_64+0x5d/0x110
      entry_SYSCALL_64_after_hwframe+0x6c/0x74

Fix it by stashing work_data before calling set_work_pool_and_keep_pending()
and using the stashed value for pwq_dec_nr_in_flight().

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Chandan Babu R <chandanbabu@kernel.org>
Link: http://lkml.kernel.org/r/87o7cxeehy.fsf@debian-BULLSEYE-live-builder-AMD64
Fixes: dd6c3c5441 ("workqueue: Move pwq_dec_nr_in_flight() to the end of work item handling")
2024-02-04 11:14:49 -10:00
Miguel Ojeda 3e0bc2855b workqueue: rust: sync with `WORK_CPU_UNBOUND` change
Commit e563d0a7cd ("workqueue: Break up enum definitions and give
names to the types") gives a name to the `enum` where `WORK_CPU_UNBOUND`
was defined, so `bindgen` changes its output from e.g.:

    pub type _bindgen_ty_10 = core::ffi::c_uint;
    pub const WORK_CPU_UNBOUND: _bindgen_ty_10 = 64;

to e.g.:

    pub type wq_misc_consts = core::ffi::c_uint;
    pub const wq_misc_consts_WORK_CPU_UNBOUND: wq_misc_consts = 64;

Thus update Rust's side to match the change (which requires a slight
reformat of the code), fixing the build error.

Closes: https://lore.kernel.org/rust-for-linux/CANiq72=9PZ89bCAVX0ZV4cqrYSLoZWyn-d_K4KpBMHjwUMdC3A@mail.gmail.com/
Fixes: e563d0a7cd ("workqueue: Break up enum definitions and give names to the types")
Signed-off-by: Miguel Ojeda <ojeda@kernel.org>
Reviewed-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-02-01 09:26:00 -10:00
Tejun Heo c5f8cd6c62 workqueue: Avoid premature init of wq->node_nr_active[].max
System workqueues are allocated early during boot from
workqueue_init_early(). While allocating unbound workqueues,
wq_update_node_max_active() is invoked from apply_workqueue_attrs() and
accesses NUMA topology to initialize wq->node_nr_active[].max.

However, topology information may not be set up at this point.
wq_update_node_max_active() is explicitly invoked from
workqueue_init_topology() later when topology information is known to be
available.

This doesn't seem to crash anything but it's doing useless work with dubious
data. Let's skip the premature and duplicate node_max_active updates by
initializing the field to WQ_DFL_MIN_ACTIVE on allocation and making
wq_update_node_max_active() noop until workqueue_init_topology().

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/workqueue.c |    8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 9221a4c57ae1..a65081ec6780 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -386,6 +386,8 @@ static const char *wq_affn_names[WQ_AFFN_NR_TYPES] = {
 	[WQ_AFFN_SYSTEM]		= "system",
 };

+static bool wq_topo_initialized = false;
+
 /*
  * Per-cpu work items which run for longer than the following threshold are
  * automatically considered CPU intensive and excluded from concurrency
@@ -1510,6 +1512,9 @@ static void wq_update_node_max_active(struct workqueue_struct *wq, int off_cpu)

 	lockdep_assert_held(&wq->mutex);

+	if (!wq_topo_initialized)
+		return;
+
 	if (!cpumask_test_cpu(off_cpu, effective))
 		off_cpu = -1;

@@ -4356,6 +4361,7 @@ static void free_node_nr_active(struct wq_node_nr_active **nna_ar)

 static void init_node_nr_active(struct wq_node_nr_active *nna)
 {
+	nna->max = WQ_DFL_MIN_ACTIVE;
 	atomic_set(&nna->nr, 0);
 	raw_spin_lock_init(&nna->lock);
 	INIT_LIST_HEAD(&nna->pending_pwqs);
@@ -7400,6 +7406,8 @@ void __init workqueue_init_topology(void)
 	init_pod_type(&wq_pod_types[WQ_AFFN_CACHE], cpus_share_cache);
 	init_pod_type(&wq_pod_types[WQ_AFFN_NUMA], cpus_share_numa);

+	wq_topo_initialized = true;
+
 	mutex_lock(&wq_pool_mutex);

 	/*
2024-01-30 19:17:00 -10:00
Tejun Heo 15930da42f workqueue: Don't call cpumask_test_cpu() with -1 CPU in wq_update_node_max_active()
For wq_update_node_max_active(), @off_cpu of -1 indicates that no CPU is
going down. The function was incorrectly calling cpumask_test_cpu() with -1
CPU leading to oopses like the following on some archs:

  Unable to handle kernel paging request at virtual address ffff0002100296e0
  ..
  pc : wq_update_node_max_active+0x50/0x1fc
  lr : wq_update_node_max_active+0x1f0/0x1fc
  ...
  Call trace:
    wq_update_node_max_active+0x50/0x1fc
    apply_wqattrs_commit+0xf0/0x114
    apply_workqueue_attrs_locked+0x58/0xa0
    alloc_workqueue+0x5ac/0x774
    workqueue_init_early+0x460/0x540
    start_kernel+0x258/0x684
    __primary_switched+0xb8/0xc0
  Code: 9100a273 35000d01 53067f00 d0016dc1 (f8607a60)
  ---[ end trace 0000000000000000 ]---
  Kernel panic - not syncing: Attempted to kill the idle task!
  ---[ end Kernel panic - not syncing: Attempted to kill the idle task! ]---

Fix it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
Reported-by: Nathan Chancellor <nathan@kernel.org>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Link: http://lkml.kernel.org/r/91eacde0-df99-4d5c-a980-91046f66e612@samsung.com
Fixes: 5797b1c189 ("workqueue: Implement system-wide nr_active enforcement for unbound workqueues")
2024-01-30 18:55:55 -10:00
Leonardo Bras aae17ebb53 workqueue: Avoid using isolated cpus' timers on queue_delayed_work
When __queue_delayed_work() is called, it chooses a cpu for handling the
timer interrupt. As of today, it will pick either the cpu passed as
parameter or the last cpu used for this.

This is not good if a system does use CPU isolation, because it can take
away some valuable cpu time to:
1 - deal with the timer interrupt,
2 - schedule-out the desired task,
3 - queue work on a random workqueue, and
4 - schedule the desired task back to the cpu.

So to fix this, during __queue_delayed_work(), if cpu isolation is in
place, pick a random non-isolated cpu to handle the timer interrupt.

As an optimization, if the current cpu is not isolated, use it instead
of looking for another candidate.

Signed-off-by: Leonardo Bras <leobras@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-01-29 15:21:37 -10:00
Tejun Heo 07daa99b7f tools/workqueue/wq_dump.py: Add node_nr/max_active dump
Print out per-node nr/max_active numbers to improve visibility into
node_nr_active operations.

Signed-off-by: Tejun Heo <tj@kernel.org>
2024-01-29 08:11:25 -10:00
Tejun Heo 5797b1c189 workqueue: Implement system-wide nr_active enforcement for unbound workqueues
A pool_workqueue (pwq) represents the connection between a workqueue and a
worker_pool. One of the roles that a pwq plays is enforcement of the
max_active concurrency limit. Before 636b927eba ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues"), there was one pwq per each CPU
for per-cpu workqueues and per each NUMA node for unbound workqueues, which
was a natural result of per-cpu workqueues being served by per-cpu pools and
unbound by per-NUMA pools.

In terms of max_active enforcement, this was, while not perfect, workable.
For per-cpu workqueues, it was fine. For unbound, it wasn't great in that
NUMA machines would get max_active that's multiplied by the number of nodes
but didn't cause huge problems because NUMA machines are relatively rare and
the node count is usually pretty low.

However, cache layouts are more complex now and sharing a worker pool across
a whole node didn't really work well for unbound workqueues. Thus, a series
of commits culminating on 8639ecebc9 ("workqueue: Make unbound workqueues
to use per-cpu pool_workqueues") implemented more flexible affinity
mechanism for unbound workqueues which enables using e.g. last-level-cache
aligned pools. In the process, 636b927eba ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues") made unbound workqueues use
per-cpu pwqs like per-cpu workqueues.

While the change was necessary to enable more flexible affinity scopes, this
came with the side effect of blowing up the effective max_active for unbound
workqueues. Before, the effective max_active for unbound workqueues was
multiplied by the number of nodes. After, by the number of CPUs.

636b927eba ("workqueue: Make unbound workqueues to use per-cpu
pool_workqueues") claims that this should generally be okay. It is okay for
users which self-regulates concurrency level which are the vast majority;
however, there are enough use cases which actually depend on max_active to
prevent the level of concurrency from going bonkers including several IO
handling workqueues that can issue a work item for each in-flight IO. With
targeted benchmarks, the misbehavior can easily be exposed as reported in
http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3.

Unfortunately, there is no way to express what these use cases need using
per-cpu max_active. A CPU may issue most of in-flight IOs, so we don't want
to set max_active too low but as soon as we increase max_active a bit, we
can end up with unreasonable number of in-flight work items when many CPUs
issue IOs at the same time. ie. The acceptable lowest max_active is higher
than the acceptable highest max_active.

Ideally, max_active for an unbound workqueue should be system-wide so that
the users can regulate the total level of concurrency regardless of node and
cache layout. The reasons workqueue hasn't implemented that yet are:

- One max_active enforcement decouples from pool boundaires, chaining
  execution after a work item finishes requires inter-pool operations which
  would require lock dancing, which is nasty.

- Sharing a single nr_active count across the whole system can be pretty
  expensive on NUMA machines.

- Per-pwq enforcement had been more or less okay while we were using
  per-node pools.

It looks like we no longer can avoid decoupling max_active enforcement from
pool boundaries. This patch implements system-wide nr_active mechanism with
the following design characteristics:

- To avoid sharing a single counter across multiple nodes, the configured
  max_active is split across nodes according to the proportion of each
  workqueue's online effective CPUs per node. e.g. A node with twice more
  online effective CPUs will get twice higher portion of max_active.

- Workqueue used to be able to process a chain of interdependent work items
  which is as long as max_active. We can't do this anymore as max_active is
  distributed across the nodes. Instead, a new parameter min_active is
  introduced which determines the minimum level of concurrency within a node
  regardless of how max_active distribution comes out to be.

  It is set to the smaller of max_active and WQ_DFL_MIN_ACTIVE which is 8.
  This can lead to higher effective max_weight than configured and also
  deadlocks if a workqueue was depending on being able to handle chains of
  interdependent work items that are longer than 8.

  I believe these should be fine given that the number of CPUs in each NUMA
  node is usually higher than 8 and work item chain longer than 8 is pretty
  unlikely. However, if these assumptions turn out to be wrong, we'll need
  to add an interface to adjust min_active.

- Each unbound wq has an array of struct wq_node_nr_active which tracks
  per-node nr_active. When its pwq wants to run a work item, it has to
  obtain the matching node's nr_active. If over the node's max_active, the
  pwq is queued on wq_node_nr_active->pending_pwqs. As work items finish,
  the completion path round-robins the pending pwqs activating the first
  inactive work item of each, which involves some pool lock dancing and
  kicking other pools. It's not the simplest code but doesn't look too bad.

v4: - wq_adjust_max_active() updated to invoke wq_update_node_max_active().

    - wq_adjust_max_active() is now protected by wq->mutex instead of
      wq_pool_mutex.

v3: - wq_node_max_active() used to calculate per-node max_active on the fly
      based on system-wide CPU online states. Lai pointed out that this can
      lead to skewed distributions for workqueues with restricted cpumasks.
      Update the max_active distribution to use per-workqueue effective
      online CPU counts instead of system-wide and cache the calculation
      results in node_nr_active->max.

v2: - wq->min/max_active now uses WRITE/READ_ONCE() as suggested by Lai.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Naohiro Aota <Naohiro.Aota@wdc.com>
Link: http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3
Fixes: 636b927eba ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues")
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-01-29 08:11:25 -10:00
Tejun Heo 91ccc6e723 workqueue: Introduce struct wq_node_nr_active
Currently, for both percpu and unbound workqueues, max_active applies
per-cpu, which is a recent change for unbound workqueues. The change for
unbound workqueues was a significant departure from the previous behavior of
per-node application. It made some use cases create undesirable number of
concurrent work items and left no good way of fixing them. To address the
problem, workqueue is implementing a NUMA node segmented global nr_active
mechanism, which will be explained further in the next patch.

As a preparation, this patch introduces struct wq_node_nr_active. It's a
data structured allocated for each workqueue and NUMA node pair and
currently only tracks the workqueue's number of active work items on the
node. This is split out from the next patch to make it easier to understand
and review.

Note that there is an extra wq_node_nr_active allocated for the invalid node
nr_node_ids which is used to track nr_active for pools which don't have NUMA
node associated such as the default fallback system-wide pool.

This doesn't cause any behavior changes visible to userland yet. The next
patch will expand to implement the control mechanism on top.

v4: - Fixed out-of-bound access when freeing per-cpu workqueues.

v3: - Use flexible array for wq->node_nr_active as suggested by Lai.

v2: - wq->max_active now uses WRITE/READ_ONCE() as suggested by Lai.

    - Lai pointed out that pwq_tryinc_nr_active() incorrectly dropped
      pwq->max_active check. Restored. As the next patch replaces the
      max_active enforcement mechanism, this doesn't change the end result.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-01-29 08:11:24 -10:00
Tejun Heo dd6c3c5441 workqueue: Move pwq_dec_nr_in_flight() to the end of work item handling
The planned shared nr_active handling for unbound workqueues will make
pwq_dec_nr_active() sometimes drop the pool lock temporarily to acquire
other pool locks, which is necessary as retirement of an nr_active count
from one pool may need kick off an inactive work item in another pool.

This patch moves pwq_dec_nr_in_flight() call in try_to_grab_pending() to the
end of work item handling so that work item state changes stay atomic.
process_one_work() which is the other user of pwq_dec_nr_in_flight() already
calls it at the end of work item handling. Comments are added to both call
sites and pwq_dec_nr_in_flight().

This shouldn't cause any behavior changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-01-29 08:11:24 -10:00
Tejun Heo 9f66cff212 workqueue: RCU protect wq->dfl_pwq and implement accessors for it
wq->cpu_pwq is RCU protected but wq->dfl_pwq isn't. This is okay because
currently wq->dfl_pwq is used only accessed to install it into wq->cpu_pwq
which doesn't require RCU access. However, we want to be able to access
wq->dfl_pwq under RCU in the future to access its __pod_cpumask and the code
can be made easier to read by making the two pwq fields behave in the same
way.

- Make wq->dfl_pwq RCU protected.

- Add unbound_pwq_slot() and unbound_pwq() which can access both ->dfl_pwq
  and ->cpu_pwq. The former returns the double pointer that can be used
  access and update the pwqs. The latter performs locking check and
  dereferences the double pointer.

- pwq accesses and updates are converted to use unbound_pwq[_slot]().

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-01-29 08:11:24 -10:00
Tejun Heo c5404d4e6d workqueue: Make wq_adjust_max_active() round-robin pwqs while activating
wq_adjust_max_active() needs to activate work items after max_active is
increased. Previously, it did that by visiting each pwq once activating all
that could be activated. While this makes sense with per-pwq nr_active,
nr_active will be shared across multiple pwqs for unbound wqs. Then, we'd
want to round-robin through pwqs to be fairer.

In preparation, this patch makes wq_adjust_max_active() round-robin pwqs
while activating. While the activation ordering changes, this shouldn't
cause user-noticeable behavior changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-01-29 08:11:24 -10:00
Tejun Heo 1c270b79ce workqueue: Move nr_active handling into helpers
__queue_work(), pwq_dec_nr_in_flight() and wq_adjust_max_active() were
open-coding nr_active handling, which is fine given that the operations are
trivial. However, the planned unbound nr_active update will make them more
complicated, so let's move them into helpers.

- pwq_tryinc_nr_active() is added. It increments nr_active if under
  max_active limit and return a boolean indicating whether inc was
  successful. Note that the function is structured to accommodate future
  changes. __queue_work() is updated to use the new helper.

- pwq_activate_first_inactive() is updated to use pwq_tryinc_nr_active() and
  thus no longer assumes that nr_active is under max_active and returns a
  boolean to indicate whether a work item has been activated.

- wq_adjust_max_active() no longer tests directly whether a work item can be
  activated. Instead, it's updated to use the return value of
  pwq_activate_first_inactive() to tell whether a work item has been
  activated.

- nr_active decrement and activating the first inactive work item is
  factored into pwq_dec_nr_active().

v3: - WARN_ON_ONCE(!WORK_STRUCT_INACTIVE) added to __pwq_activate_work() as
      now we're calling the function unconditionally from
      pwq_activate_first_inactive().

v2: - wq->max_active now uses WRITE/READ_ONCE() as suggested by Lai.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-01-29 08:11:24 -10:00
Tejun Heo 4c6380305d workqueue: Replace pwq_activate_inactive_work() with [__]pwq_activate_work()
To prepare for unbound nr_active handling improvements, move work activation
part of pwq_activate_inactive_work() into __pwq_activate_work() and add
pwq_activate_work() which tests WORK_STRUCT_INACTIVE and updates nr_active.

pwq_activate_first_inactive() and try_to_grab_pending() are updated to use
pwq_activate_work(). The latter conversion is functionally identical. For
the former, this conversion adds an unnecessary WORK_STRUCT_INACTIVE
testing. This is temporary and will be removed by the next patch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-01-29 08:11:24 -10:00
Tejun Heo afa87ce853 workqueue: Factor out pwq_is_empty()
"!pwq->nr_active && list_empty(&pwq->inactive_works)" test is repeated
multiple times. Let's factor it out into pwq_is_empty().

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-01-29 08:11:24 -10:00
Tejun Heo a045a272d8 workqueue: Move pwq->max_active to wq->max_active
max_active is a workqueue-wide setting and the configured value is stored in
wq->saved_max_active; however, the effective value was stored in
pwq->max_active. While this is harmless, it makes max_active update process
more complicated and gets in the way of the planned max_active semantic
updates for unbound workqueues.

This patches moves pwq->max_active to wq->max_active. This simplifies the
code and makes freezing and noop max_active updates cheaper too. No
user-visible behavior change is intended.

As wq->max_active is updated while holding wq mutex but read without any
locking, it now uses WRITE/READ_ONCE(). A new locking locking rule WO is
added for it.

v2: wq->max_active now uses WRITE/READ_ONCE() as suggested by Lai.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-01-29 08:11:24 -10:00
Tejun Heo e563d0a7cd workqueue: Break up enum definitions and give names to the types
workqueue is collecting different sorts of enums into a single unnamed enum
type which can increase confusion around enum width. Also, unnamed enums
can't be accessed from BPF. Let's break up enum definitions according to
their purposes and give them type names.

Signed-off-by: Tejun Heo <tj@kernel.org>
2024-01-26 11:55:50 -10:00
Tejun Heo 6a229b0e2f workqueue: Drop unnecessary kick_pool() in create_worker()
After creating a new worker, create_worker() is calling kick_pool() to wake
up the new worker task. However, as kick_pool() doesn't do anything if there
is no work pending, it also calls wake_up_process() explicitly. There's no
reason to call kick_pool() at all. wake_up_process() is enough by itself.
Drop the unnecessary kick_pool() call.

Signed-off-by: Tejun Heo <tj@kernel.org>
2024-01-26 11:55:46 -10:00
Audra Mitchell 8318d6a636 workqueue: Shorten events_freezable_power_efficient name
Since we have set the WQ_NAME_LEN to 32, decrease the name of
events_freezable_power_efficient so that it does not trip the name length
warning when the workqueue is created.

Signed-off-by: Audra Mitchell <audra@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-01-25 09:11:40 -10:00
Tejun Heo a6b48c83d2 tools/workqueue/wq_dump.py: Clean up code and drop duplicate information
- Factor out wq_type_str()

- Improve formatting so that it adapts to actual field widths.

- Drop duplicate information from "Workqueue -> rescuer" section. If
  anything, we should add more rescuer-specific info - e.g. the number of
  work items rescued.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
2024-01-25 06:22:03 -10:00
Marcelo Tosatti 7bd20b6b87 workqueue: mark power efficient workqueue as unbounded if nohz_full enabled
A customer using nohz_full has experienced the following interruption:

oslat-1004510 [018] timer_cancel:         timer=0xffff90a7ca663cf8
oslat-1004510 [018] timer_expire_entry:   timer=0xffff90a7ca663cf8 function=delayed_work_timer_fn now=4709188240 baseclk=4709188240
oslat-1004510 [018] workqueue_queue_work: work struct=0xffff90a7ca663cd8 function=fb_flashcursor workqueue=events_power_efficient req_cpu=8192 cpu=18
oslat-1004510 [018] workqueue_activate_work: work struct 0xffff90a7ca663cd8
oslat-1004510 [018] sched_wakeup:         kworker/18:1:326 [120] CPU:018
oslat-1004510 [018] timer_expire_exit:    timer=0xffff90a7ca663cf8
oslat-1004510 [018] irq_work_entry:       vector=246
oslat-1004510 [018] irq_work_exit:        vector=246
oslat-1004510 [018] tick_stop:            success=0 dependency=SCHED
oslat-1004510 [018] hrtimer_start:        hrtimer=0xffff90a70009cb00 function=tick_sched_timer/0x0 ...
oslat-1004510 [018] softirq_exit:         vec=1 [action=TIMER]
oslat-1004510 [018] softirq_entry:        vec=7 [action=SCHED]
oslat-1004510 [018] softirq_exit:         vec=7 [action=SCHED]
oslat-1004510 [018] tick_stop:            success=0 dependency=SCHED
oslat-1004510 [018] sched_switch:         oslat:1004510 [120] R ==> kworker/18:1:326 [120]
kworker/18:1-326 [018] workqueue_execute_start: work struct 0xffff90a7ca663cd8: function fb_flashcursor
kworker/18:1-326 [018] workqueue_queue_work: work struct=0xffff9078f119eed0 function=drm_fb_helper_damage_work workqueue=events req_cpu=8192 cpu=18
kworker/18:1-326 [018] workqueue_activate_work: work struct 0xffff9078f119eed0
kworker/18:1-326 [018] timer_start:          timer=0xffff90a7ca663cf8 function=delayed_work_timer_fn ...

Set wq_power_efficient to true, in case nohz_full is enabled.
This makes the power efficient workqueue be unbounded, which allows
workqueue items there to be moved to HK CPUs.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-01-19 13:55:47 -10:00
Xuewen Yan 1a65a6d17c workqueue: Add rcu lock check at the end of work item execution
Currently the workqueue just checks the atomic and locking states after work
execution ends. However, sometimes, a work item may not unlock rcu after
acquiring rcu_read_lock(). And as a result, it would cause rcu stall, but
the rcu stall warning can not dump the work func, because the work has
finished.

In order to quickly discover those works that do not call rcu_read_unlock()
after rcu_read_lock(), add the rcu lock check.

Use rcu_preempt_depth() to check the work's rcu status. Normally, this value
is 0. If this value is bigger than 0, it means the work are still holding
rcu lock. If so, print err info and the work func.

tj: Reworded the description for clarity. Minor formatting tweak.

Signed-off-by: Xuewen Yan <xuewen.yan@unisoc.com>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-01-16 10:20:44 -10:00
Juri Lelli 85f0ab43f9 kernel/workqueue: Bind rescuer to unbound cpumask for WQ_UNBOUND
At the time they are created unbound workqueues rescuers currently use
cpu_possible_mask as their affinity, but this can be too wide in case a
workqueue unbound mask has been set as a subset of cpu_possible_mask.

Make new rescuers use their associated workqueue unbound cpumask from
the start.

Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-01-16 08:47:30 -10:00
Juri Lelli ab5e5b99a9 tools/workqueue: Add rescuers printing to wq_dump.py
Retrieving rescuers information (e.g., affinity and name) is quite
useful when debugging workqueues configurations.

Add printing of such information to the existing wq_dump.py script.

Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-01-16 08:47:22 -10:00
Audra Mitchell 31c8900728 workqueue.c: Increase workqueue name length
Currently we limit the size of the workqueue name to 24 characters due to
commit ecf6881ff3 ("workqueue: make workqueue->name[] fixed len")
Increase the size to 32 characters and print a warning in the event
the requested name is larger than the limit of 32 characters.

Signed-off-by: Audra Mitchell <audra@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-01-16 08:31:24 -10:00
Linus Torvalds 052d534373 Description for this pull request:
- Replace the internal table lookup algorithm with the hweight library
     and ffs of the bitops library.
   - Handle the two types of stream entry, valid data size(has been written)
     and data size separately.It will improves compatibility with two
     differently sized files created on Windows.
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEE6NzKS6Uv/XAAGHgyZwv7A1FEIQgFAmWgmxoWHGxpbmtpbmpl
 b25Aa2VybmVsLm9yZwAKCRBnC/sDUUQhCERgD/4rHm1yG0ZlURvXiAwZwVOQMJoz
 9Y8Gz3M1LsycJEN1uxNjSYfUe9LX/BlbXz5uIH8tVQjEEIbyl0RmJjITawVBHVbS
 Ps/UMDiQvT5DPqIwhrfTh9qxy0cRi7WBuKNAXRXSSVRx2mMWYIjNxT+8dIcD2FEG
 63ojnoYi8RwuYuvCwo51coxf8/E7GY+WAYnC97hqtj2jSQ6gMjeDtiyYx1m5PfUN
 32NdG1IYaiTstD7EU1lv1QNzLZx/Q9gBhi0jhDu1qc0fI+rS49p0zqop1TeEtsIf
 RD05XHZ8KRapChgoSvw+hb6CfZ7RanImFAHm6WnILqgFoY7uagUH1dn3oOJFgdLA
 OTwbEA/sQmnIdqg07Hhgf74OI9bu/kgP7g8/xrooqhO2SkYGLXDLgYFhEk08aEyE
 sp9fxtBfKhXUVHKafzkKtUmI+THl5W793aAfND5W+ahX2zDprwupzg/F7p4Sj3tJ
 GbvaRL/n1d/O1dhf/doTmfggH7TnDODS729w0HBSNJU+q6zrGluLRyqB3XsRFXng
 7RlN8f4HSI6eFRVG7KTwxVcfwsedtPmNRKLg3PEMkXz5jb4wsw7tZUB3gAFwy9qf
 cZd7/+oU9qKEgrBRDJfJsFqq0IpzLCXDEZp00F5RregLIhWHZN4ghBrVU94ciDuT
 gxBgoSWrLqObnXmLVQ==
 =WFoc
 -----END PGP SIGNATURE-----

Merge tag 'exfat-for-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat

Pull exfat updates from Namjae Jeon:

 - Replace the internal table lookup algorithm with the hweight library
   and ffs of the bitops library.

 - Handle the two types of stream entry, valid data size (has been
   written) and data size separately. It improves compatibility with two
   differently sized files created on Windows.

* tag 'exfat-for-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat:
  exfat: do not zero the extended part
  exfat: change to get file size from DataLength
  exfat: using ffs instead of internal logic
  exfat: using hweight instead of internal logic
2024-01-12 18:05:56 -08:00
Linus Torvalds f16ab99c2e fix buggered locking in bch2_ioctl_subvolume_destroy()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZaDougAKCRBZ7Krx/gZQ
 60eJAQCtXa908kOFDjSSTetU6aBzWKcCCHszirjhXiTFJv1jTgD/TbvyGs4ku7Ri
 oI4nh1XX4QMVWsup1VETnnLAjt6DhAw=
 =fror
 -----END PGP SIGNATURE-----

Merge tag 'pull-bcachefs-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull bcachefs locking fix from Al Viro:
 "Fix broken locking in bch2_ioctl_subvolume_destroy()"

* tag 'pull-bcachefs-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  bch2_ioctl_subvolume_destroy(): fix locking
  new helper: user_path_locked_at()
2024-01-12 18:04:01 -08:00
Linus Torvalds 1acc24b300 more simple_recursive_removal() conversions
nfsctl this time...
 
 Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZaDpDwAKCRBZ7Krx/gZQ
 686rAQDa2xnjrqb3SvN52DgU0uDlpupXV+ouHzeVe51DRllyLwEAoZPMcsvnRV2s
 XGjifvcuKk85hxgLSRIzQ3DuoQX6DAk=
 =1lVd
 -----END PGP SIGNATURE-----

Merge tag 'pull-simple_recursive_removal' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull nfsctl update from Al Viro:
 "More simple_recursive_removal() conversions.

  nfsctl this time..."

* tag 'pull-simple_recursive_removal' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  nfsctl: switch to simple_recursive_removal()
2024-01-12 18:02:03 -08:00
Linus Torvalds 23a80d462c RCU pull request for v6.8
This pull request contains the following branches:
 
 doc.2023.12.13a: Documentation and comment updates.
 
 torture.2023.11.23a: RCU torture, locktorture updates that include
         cleanups; nolibc init build support for mips, ppc and rv64;
         testing of mid stall duration scenario and fixing fqs task
         creation conditions.
 
 fixes.2023.12.13a: Misc fixes, most notably restricting usage of
         RCU CPU stall notifiers, to confine their usage primarily
         to debug kernels.
 
 rcu-tasks.2023.12.12b: RCU tasks minor fixes.
 
 srcu.2023.12.13a: lockdep annotation fix for NMI-safe accesses,
         callback advancing/acceleration cleanup and documentation
         improvements.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQSi2tPIQIc2VEtjarIAHS7/6Z0wpQUCZYUS0AAKCRAAHS7/6Z0w
 pRXgAQD+k8oqjvKL6la61ppWm5Y7NLjdj/IbV+cOd42jKnM6PAEAyavNhX0n7zGx
 o9cDlvIDxJfHnFrOTc5WLH9yEs3IiQQ=
 =8rdu
 -----END PGP SIGNATURE-----

Merge tag 'rcu.release.v6.8' of https://github.com/neeraju/linux

Pull RCU updates from Neeraj Upadhyay:

 - Documentation and comment updates

 - RCU torture, locktorture updates that include cleanups; nolibc init
   build support for mips, ppc and rv64; testing of mid stall duration
   scenario and fixing fqs task creation conditions

 - Misc fixes, most notably restricting usage of RCU CPU stall
   notifiers, to confine their usage primarily to debug kernels

 - RCU tasks minor fixes

 - lockdep annotation fix for NMI-safe accesses, callback
   advancing/acceleration cleanup and documentation improvements

* tag 'rcu.release.v6.8' of https://github.com/neeraju/linux:
  rcu: Force quiescent states only for ongoing grace period
  doc: Clarify historical disclaimers in memory-barriers.txt
  doc: Mention address and data dependencies in rcu_dereference.rst
  doc: Clarify RCU Tasks reader/updater checklist
  rculist.h: docs: Fix wrong function summary
  Documentation: RCU: Remove repeated word in comments
  srcu: Use try-lock lockdep annotation for NMI-safe access.
  srcu: Explain why callbacks invocations can't run concurrently
  srcu: No need to advance/accelerate if no callback enqueued
  srcu: Remove superfluous callbacks advancing from srcu_gp_start()
  rcu: Remove unused macros from rcupdate.h
  rcu: Restrict access to RCU CPU stall notifiers
  rcu-tasks: Mark RCU Tasks accesses to current->rcu_tasks_idle_cpu
  rcutorture: Add fqs_holdoff check before fqs_task is created
  rcutorture: Add mid-sized stall to TREE07
  rcutorture: add nolibc init support for mips, ppc and rv64
  locktorture: Increase Hamming distance between call_rcu_chain and rcu_call_chains
2024-01-12 16:35:58 -08:00
Linus Torvalds 38814330fe Devicetree for v6.8:
- Convert FPGA bridge, all TPMs (finally), and Rockchip HDMI bindings to
   schemas
 
 - Improvements in Samsung GPU schemas
 
 - A few more cases of dropping unneeded quotes in schemas
 
 - Merge QCom idle-states txt binding into common idle-states schema
 
 - Add X1E80100, SM8650, SM8650, and SDX75 SoCs to QCom Power Domain
   Controller
 
 - Add NXP i.mx8dl to SCU PD
 
 - Add synaptics r63353 panel controller
 
 - Clarify the wording around the use of 'wakeup-source' property
 
 - Add a DTS coding style doc
 
 - Add smi vendor prefix
 
 - Fix DT_SCHEMA_FILES incorrect matching of paths outside the kernel
   tree
 
 - Disable sysfb (e.g. EFI FB) when simple-framebuffer node is present
 
 - Fix double free in of_parse_phandle_with_args_map()
 
 - A couple of kerneldoc fixes
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEktVUI4SxYhzZyEuo+vtdtY28YcMFAmWhWgYACgkQ+vtdtY28
 YcPHeA//f6xoTczQDavkfVcl+9vfR6uLXAq/sDj0t3qiEbRuBG15dAzGZJGI8Pro
 7T9+6wFRS89lwS8qd1wJvoDTlIxcZebvBapzNp3e0XSis98f89qqqDnfxilKgPau
 QF+mAlQ2tEZoHYUQIGXbjyq9X8GqQ3KGibkfowmdh5NFw5ORWaz9d0Fmiank620a
 gpH/jCZFBmboWlnKbJKUV8yjna/T8XCPqWUGcPst3ByTNEWToAMInpL4SOaw80dn
 WdAyZGS8K9GDwrvwdjIFUipcLGXk2kQLQhlOIm9wP/qdpoLfzdLuK96njiqv8PP/
 /pDJrtDcGNkFi5327OQXoYB+UhDiouQJWSVKphZiwPuW/xpbpEkC00bZztZ6tMOl
 qilXuAbDwb+1cjI9HO95w/SDbCppISvXDslJuLFhyLA/FBmMewe8ypuh6vm7JMoe
 MKSfkjDEyBdKEp1iySyVKxVfIa4Ph7jR9B8bDGOaF0/bpzsB5e51pPH0oc9lLv8L
 No+AKQnyijruj6F+LVhIimgKLN6zuFqfgOgVshGYbEhAVbbT4cBx4NHGFXqL+6DP
 FedBjqi0mdIECF+F8FcfKpO80Pi+1bxjgLPYt5d67y+nZUIEe2xJUOdp07f80cHQ
 Qo0L0h9UVI8pKrWBbtQXeeAMG9l4SMwriazoAzudvUk9Hlh3LWk=
 =5zC/
 -----END PGP SIGNATURE-----

Merge tag 'devicetree-for-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux

Pull devicetree updates from Rob Herring:

 - Convert FPGA bridge, all TPMs (finally), and Rockchip HDMI bindings
   to schemas

 - Improvements in Samsung GPU schemas

 - A few more cases of dropping unneeded quotes in schemas

 - Merge QCom idle-states txt binding into common idle-states schema

 - Add X1E80100, SM8650, SM8650, and SDX75 SoCs to QCom Power Domain
   Controller

 - Add NXP i.mx8dl to SCU PD

 - Add synaptics r63353 panel controller

 - Clarify the wording around the use of 'wakeup-source' property

 - Add a DTS coding style doc

 - Add smi vendor prefix

 - Fix DT_SCHEMA_FILES incorrect matching of paths outside the kernel
   tree

 - Disable sysfb (e.g. EFI FB) when simple-framebuffer node is present

 - Fix double free in of_parse_phandle_with_args_map()

 - A couple of kerneldoc fixes

* tag 'devicetree-for-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux: (37 commits)
  of: unittest: Fix of_count_phandle_with_args() expected value message
  dt-bindings: fpga: altera: Convert bridge bindings to yaml
  dt-bindings: fpga: Convert bridge binding to yaml
  dt-bindings: vendor-prefixes: Add smi
  dt-bindings: power: Clarify wording for wakeup-source property
  of: Fix double free in of_parse_phandle_with_args_map
  dt-bindings: ignore paths outside kernel for DT_SCHEMA_FILES
  drivers: of: Fixed kernel doc warning
  dt-bindings: tpm: Document Microsoft fTPM bindings
  dt-bindings: tpm: Convert IBM vTPM bindings to DT schema
  dt-bindings: tpm: Convert Google Cr50 bindings to DT schema
  dt-bindings: tpm: Consolidate TCG TIS bindings
  dt-bindings: display: rockchip,inno-hdmi: Document RK3128 compatible
  dt-bindings: arm: Add remote etm dt-binding
  dt-bindings: mmc: sdhci-pxa: Fix 'regs' typo
  media: dt-bindings: samsung,s5p-mfc: Fix iommu properties schemas
  dt-bindings: display: panel: Add synaptics r63353 panel controller
  dt-bindings: arm: merge qcom,idle-state with idle-state
  dt-bindings: drm: rockchip: convert inno_hdmi-rockchip.txt to yaml
  dt-bindings: cache: qcom,llcc: correct QDU1000 reg entries
  ...
2024-01-12 15:05:30 -08:00
Linus Torvalds 42bff4d0f9 pwm: Changes for v6.8-rc1
This contains a bunch of cleanups and simplifications across the board,
 as well as a number of small fixes.
 
 Perhaps the most notable change here is the addition of an API that
 allows PWMs to be used in atomic contexts, which is useful when time-
 critical operations are involved, such as using a PWM to generate IR
 signals.
 
 Finally, I have decided to step down as PWM subsystem maintainer. Due to
 other responsibilities I have lately not been able to find the time that
 the subsystem deserves and Uwe, who has been helping out a lot for the
 past few years and has many things planned for the future, has kindly
 volunteered to take over. I have no doubt that he will be a suitable
 replacement.
 -----BEGIN PGP SIGNATURE-----
 
 iQJNBAABCAA3FiEEiOrDCAFJzPfAjcif3SOs138+s6EFAmWhYQAZHHRoaWVycnku
 cmVkaW5nQGdtYWlsLmNvbQAKCRDdI6zXfz6zobK8EACtzJX+AeoTkN2S671A7QoG
 IGl9mrFhvrqN/6syigAIU8ZGOgb5uScZDH58PeTXH/oGEMn+bhQ9MK7JyWF5BUzF
 I8p6CqaVjG66LsfWz1m8AEAl/0N2Sc2032fWQSJB0o8qgYH0ZRBKc1j371Zm2DgD
 HT78pZJDSnyCnltPKjbDLShRfBwGINspmbihZKFa6yxrPbvADixCTo19b9Pk8XGa
 S9k2R/9S/QKPxvB+3DiZAHFstwoTn2p+1IBsg+hp/jLniw8XidZp2Rq0RJuwTjqO
 jVQDnOFOqNBa3VQccvNe23rDaKUkrmYwk+zzuFF27URam2Gp9wHZ8Y86WPSO5TA9
 ftNsoeW++R25PBsOVZFGMU9r9aI9XI1tNyVuv2blNc0yr1fNSRzwjcELzLzG2myp
 gHgdayJofnvlKM7JV5ZDY6BDPwTP9jfDRdZOqKKNUeB8e9IiQD9JtAV9P+EL/5Hl
 C+7mh5Xb17bVlczWHqNkTd89Omp+Eu2z5BMfJyiQWQ2lzZpxOkBwB0nrlPshZpPi
 hz5IhsGjREkEfNV2qz/YDEyaj+CAYO8toitFGt5HeAYTftuG4WD49rHE5VQ8nhwg
 P9XpWNFHK2EtpU3/BRuupWHZRVd7MU4kMx4WQvWZGOFEqc9DYaC29ynk6sK0ryaC
 ba8U4gpBS66d3wNvVpkN3A==
 =pIh8
 -----END PGP SIGNATURE-----

Merge tag 'pwm/for-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/thierry.reding/linux-pwm

Pull pwm updates from Thierry Reding:
 "This contains a bunch of cleanups and simplifications across the
  board, as well as a number of small fixes.

  Perhaps the most notable change here is the addition of an API that
  allows PWMs to be used in atomic contexts, which is useful when time-
  critical operations are involved, such as using a PWM to generate IR
  signals.

  Finally, I have decided to step down as PWM subsystem maintainer. Due
  to other responsibilities I have lately not been able to find the time
  that the subsystem deserves and Uwe, who has been helping out a lot
  for the past few years and has many things planned for the future, has
  kindly volunteered to take over. I have no doubt that he will be a
  suitable replacement"

* tag 'pwm/for-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/thierry.reding/linux-pwm: (44 commits)
  MAINTAINERS: pwm: Thierry steps down, Uwe takes over
  pwm: linux/pwm.h: fix Excess kernel-doc description warning
  pwm: Add pwm_apply_state() compatibility stub
  pwm: cros-ec: Drop documentation for dropped struct member
  pwm: Drop two unused API functions
  pwm: lpc18xx-sct: Don't modify the cached period of other PWM outputs
  pwm: meson: Simplify using dev_err_probe()
  pwm: stmpe: Silence duplicate error messages
  pwm: Reduce number of pointer dereferences in pwm_device_request()
  pwm: crc: Use consistent variable naming for driver data
  pwm: omap-dmtimer: Drop locking
  dt-bindings: pwm: ti,pwm-omap-dmtimer: Update binding for yaml
  media: pwm-ir-tx: Trigger edges from hrtimer interrupt context
  pwm: bcm2835: Allow PWM driver to be used in atomic context
  pwm: Make it possible to apply PWM changes in atomic context
  pwm: renesas: Remove unused include
  pwm: Replace ENOTSUPP with EOPNOTSUPP
  pwm: Rename pwm_apply_state() to pwm_apply_might_sleep()
  pwm: Stop referencing pwm->chip
  pwm: Update kernel doc for struct pwm_chip
  ...
2024-01-12 14:59:50 -08:00
Linus Torvalds fef018d819 hid-for-linus-2024010801
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIVAwUAZZxaAaZi849r7WBJAQL3kQ/+O657H6H/yfi2tC/i+S8Q13gWc9bhtYt/
 dI90ixcrWnZbNEuSUZ9aLt5UzVfSO2GnsmwGUwRdfMCOIYv42mS9st0JAGvYx0jL
 xYHaqMW5VHn9pdUBDdgXG90DyivcbprxAldpTyJFr029g1H7vdnp/KXhzveBfaIw
 lBGOzM1miiK2/5quj9/tIW1rJLJiR8LLNBpjaDAVZrJqAjJXObCY9AmtpsgiGQSY
 kh8YTohNcMTo6w/CVoAekoQugE6tDHAvAg7QqRVwuMrFXu71fMjUcyOd8vFrptwC
 8OnOVN8qZYohdE9o9AxO8jUm2dUI8hTvijdxERW6zZy3lRNOnfpiTYozGISJofUc
 +E1fY8/LCtow1RzH8tlfuc+JcWfBdn4egU+r727aRMZSgu+f61xXZGTGsUuwY37Q
 zSGoAa2P5xodk4S4bF40XKIYYlbEbfUJP73GRFk4QQYkE5lcAK+djG4e1guU+lw2
 VsWwCHK4Nl9LxNIj5a6VLrK4JegBHuY9uBGrQUDw7NcB86o7le4wh1HdM8cfBDG5
 RrLe/lfyonquFOAPIasVHB5oG+FK1E+ex4DL1qfvWcmV1RRVaEvtrBF2mpidMSoI
 W0m6iAPN5keGhlus50xXllfmbotpgFQtcBgJQKCUaSYP/WAJT9JpHLYPOA2ADM0g
 3XoBX8i6s2M=
 =3SCb
 -----END PGP SIGNATURE-----

Merge tag 'hid-for-linus-2024010801' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid

Pull HID updates from Jiri Kosina:

 - assorted functional fixes for hid-steam ported from SteamOS betas
   (Vicki Pfau)

 - fix for custom sensor-hub sensors (hinge angle sensor and LISS
   sensors) not working (Yauhen Kharuzhy)

 - functional fix for handling Confidence in Wacom driver (Jason
   Gerecke)

 - support for Ilitek ili2901 touchscreen (Zhengqiao Xia)

 - power management fix for Wacom userspace battery exporting
   (Tatsunosuke Tobita)

 - rework of wait-for-reset in order to reduce the need for
   I2C_HID_QUIRK_NO_IRQ_AFTER_RESET qurk; the success rate is now 50%
   better, but there are still further improvements to be made (Hans de
   Goede)

 - greatly improved coverage of Tablets in hid-selftests (Benjamin
   Tissoires)

 - support for Nintendo NSO controllers -- SNES, Genesis and N64 (Ryan
   McClelland)

 - support for controlling mcp2200 GPIOs (Johannes Roith)

 - power management improvement for EHL OOB wakeup in intel-ish
   (Kai-Heng Feng)

 - other assorted device-specific fixes and code cleanups

* tag 'hid-for-linus-2024010801' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid: (53 commits)
  HID: amd_sfh: Add a new interface for exporting ALS data
  HID: amd_sfh: Add a new interface for exporting HPD data
  HID: amd_sfh: rename float_to_int() to amd_sfh_float_to_int()
  HID: i2c-hid: elan: Add ili2901 timing
  dt-bindings: HID: i2c-hid: elan: Introduce Ilitek ili2901
  HID: bpf: make bus_type const in struct hid_bpf_ops
  HID: make ishtp_cl_bus_type const
  HID: make hid_bus_type const
  HID: hid-steam: Add gamepad-only mode switched to by holding options
  HID: hid-steam: Better handling of serial number length
  HID: hid-steam: Update list of identifiers from SDL
  HID: hid-steam: Make client_opened a counter
  HID: hid-steam: Clean up locking
  HID: hid-steam: Disable watchdog instead of using a heartbeat
  HID: hid-steam: Avoid overwriting smoothing parameter
  HID: magicmouse: fix kerneldoc for struct magicmouse_sc
  HID: sensor-hub: Enable hid core report processing for all devices
  HID: wacom: Add additional tests of confidence behavior
  HID: wacom: Correct behavior when processing some confidence == false touches
  HID: nintendo: add support for nso controllers
  ...
2024-01-12 14:45:13 -08:00
Linus Torvalds d97a78423c fbdev fixes and cleanups for 6.8-rc1:
- Remove intelfb fbdev driver (Thomas Zimmermann)
 - Remove amba-clcd fbdev driver (Linus Walleij)
 - Remove vmlfb Carillo Ranch fbdev driver (Matthew Wilcox)
 - fb_deferred_io flushing fixes (Nam Cao)
 - imxfb code fixes and cleanups (Dario Binacchi)
 - stifb primary screen detection cleanups (Thomas Zimmermann)
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQS86RI+GtKfB8BJu973ErUQojoPXwUCZaEregAKCRD3ErUQojoP
 X0c9AP0ZdpE1zWFdjYg7ucKhNGrMfXWS8auvkC0qdd9Q0psnmQD/Z22JIl/fpgrk
 0T74KAEd7MwMXGIUm5VKMe/AZJtaBgs=
 =/sky
 -----END PGP SIGNATURE-----

Merge tag 'fbdev-for-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/linux-fbdev

Pull fbdev updates from Helge Deller:
 "Three fbdev drivers (~8500 lines of code) removed. The Carillo Ranch
  fbdev driver is for an Intel product which was never shipped, and for
  the intelfb and the amba-clcd drivers the drm drivers can be used
  instead.

  The other code changes are minor: some fb_deferred_io flushing fixes,
  imxfb margin fixes and stifb cleanups.

  Summary:
   - Remove intelfb fbdev driver (Thomas Zimmermann)
   - Remove amba-clcd fbdev driver (Linus Walleij)
   - Remove vmlfb Carillo Ranch fbdev driver (Matthew Wilcox)
   - fb_deferred_io flushing fixes (Nam Cao)
   - imxfb code fixes and cleanups (Dario Binacchi)
   - stifb primary screen detection cleanups (Thomas Zimmermann)"

* tag 'fbdev-for-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/linux-fbdev: (28 commits)
  fbdev/intelfb: Remove driver
  fbdev/hyperv_fb: Do not clear global screen_info
  firmware/sysfb: Clear screen_info state after consuming it
  fbdev/hyperv_fb: Remove firmware framebuffers with aperture helpers
  drm/hyperv: Remove firmware framebuffers with aperture helper
  fbdev/sis: Remove dependency on screen_info
  video/logo: use %u format specifier for unsigned int values
  video/sticore: Remove info field from STI struct
  arch/parisc: Detect primary video device from device instance
  fbdev/stifb: Allocate fb_info instance with framebuffer_alloc()
  video/sticore: Store ROM device in STI struct
  fbdev: flush deferred IO before closing
  fbdev: flush deferred work in fb_deferred_io_fsync()
  fbdev: amba-clcd: Delete the old CLCD driver
  fbdev: Remove support for Carillo Ranch driver
  fbdev: hgafb: fix kernel-doc comments
  fbdev: mmp: Fix typo and wording in code comment
  fbdev: fsl-diu-fb: Fix sparse warning due to virt_to_phys() prototype change
  fbdev: imxfb: add '*/' on a separate line in block comment
  fbdev: imxfb: use __func__ for function name
  ...
2024-01-12 14:38:08 -08:00
Linus Torvalds 61da593f44 media updates for v6.8-rc1
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE+QmuaPwR3wnBdVwACF8+vY7k4RUFAmWhMN0ACgkQCF8+vY7k
 4RUNmw/9FNwS6woEDPUgoaqBqES5N1R/QzqqWgALQSiywUSWsAnHuALR7Aio4oH4
 IMRrXn3nQpg1wj+f2OtVcXreuf1OMdD0WmQvcj/hcjGIx3mZUh2dBXJpR/iRrCRA
 jbGUDZqsiT9gPf6IcqspFccoShVjQDM7NKLa6xQMpwyITBAXhY0d5LPOs70VJO3f
 EH6wdJbhbQoocylSFCOl61un4Ebk/Zou2TDyzAAnIMQxmanVTWnRE9QL6G59t8PA
 XgTcSRiyQGdNScR8c/xgyHDj4KchvkwFHQfOAH+HPot6CIy+UsC02ofcBralTuN6
 HTiJ8upj9U/N2ArX7L9w4xOFaRlXSO4MBTbQsroCgl7vkSfjMx/Ik5gHNVSoP25y
 MdLVg9ReW6D22tWnHvqpwbv4YPjT1uFX9LzuNtINwjJVC5Hn7EMPiE7hdsBGdKZE
 98dVJKNMUHQT5KfEUqjDkdrcAYm5Pd46trEa1aMDnoxtshxlmwHYwUj0QNKVX2gT
 Wa5lw9twFvmvYPGFwIT11HrkFukj4zR7N7Eh0IOK93zjJl6gtr1UNmujj8KFXJS3
 ZB4yTlcSgFunsmkyfcZ+02bgM9fq7SC1/rL9bXBYxdo2EcuduJ93Gr5riaQtVu44
 6t6Gi8Ucf9Uh7B68Alf0GQAd+AhXPzbqtisHdYJb0ijGjRMhq7w=
 =KPPn
 -----END PGP SIGNATURE-----

Merge tag 'media/v6.8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media

Pull media updates from Mauro Carvalho Chehab:

 - v4l core: subdev frame interval now supports which field

 - v4l kapi: moves and renames the init_cfg pad op to init_state as an
   internal op.

 - new sensor drivers: gc0308, gc2145, Avnet Alvium, ov64a40, tw9900

 - new camera driver: STM32 DCMIPP

 - s5p-mfc has gained MFC v12 support

 - new ISP driver added to staging: Starfive

 - new stateful encoder/decoded: Wave5 codec It is found on the J721S2
   SoC, JH7100 SoC, ssd202d SoC. Etc.

 - fwnode gained support for MIPI "DisCo for Imaging"
   (https://www.mipi.org/specifications/mipi-disco-imaging)

 - as usual, lots of cleanups, fixups and driver improvements.

* tag 'media/v6.8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (309 commits)
  media: i2c: thp7312: select CONFIG_FW_LOADER
  media: i2c: mt9m114: use fsleep() in place of udelay()
  media: videobuf2: core: Rename min_buffers_needed field in vb2_queue
  media: i2c: thp7312: Store frame interval in subdev state
  media: docs: uAPI: Fix documentation of 'which' field for routing ioctls
  media: docs: uAPI: Expand error documentation for invalid 'which' value
  media: docs: uAPI: Clarify error documentation for invalid 'which' value
  media: v4l2-subdev: Store frame interval in subdev state
  media: v4l2-subdev: Add which field to struct v4l2_subdev_frame_interval
  media: v4l2-subdev: Turn .[gs]_frame_interval into pad operations
  media: v4l: subdev: Move out subdev state lock macros outside CONFIG_MEDIA_CONTROLLER
  media: s5p-mfc: DPB Count Independent of VIDIOC_REQBUF
  media: s5p-mfc: Load firmware for each run in MFCv12.
  media: s5p-mfc: Set context for valid case before calling try_run
  media: s5p-mfc: Add support for DMABUF for encoder
  media: s5p-mfc: Add support for UHD encoding.
  media: s5p-mfc: Add support for rate controls in MFCv12
  media: s5p-mfc: Add YV12 and I420 multiplanar format support
  media: s5p-mfc: Add initial support for MFCv12
  media: s5p-mfc: Rename IS_MFCV10 macro
  ...
2024-01-12 14:29:48 -08:00