License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-01 14:07:57 +00:00
|
|
|
/* SPDX-License-Identifier: GPL-2.0 */
|
2005-04-16 22:20:36 +00:00
|
|
|
/*
|
|
|
|
* workqueue.h --- work queue handling for Linux.
|
|
|
|
*/
|
|
|
|
|
|
|
|
#ifndef _LINUX_WORKQUEUE_H
|
|
|
|
#define _LINUX_WORKQUEUE_H
|
|
|
|
|
|
|
|
#include <linux/timer.h>
|
|
|
|
#include <linux/linkage.h>
|
|
|
|
#include <linux/bitops.h>
|
2007-10-19 06:39:55 +00:00
|
|
|
#include <linux/lockdep.h>
|
2010-06-29 08:07:13 +00:00
|
|
|
#include <linux/threads.h>
|
2011-07-26 23:09:06 +00:00
|
|
|
#include <linux/atomic.h>
|
2013-03-12 18:30:00 +00:00
|
|
|
#include <linux/cpumask.h>
|
2018-03-14 19:45:13 +00:00
|
|
|
#include <linux/rcupdate.h>
|
2023-12-11 18:55:01 +00:00
|
|
|
#include <linux/workqueue_types.h>
|
2006-11-22 14:54:45 +00:00
|
|
|
|
2006-12-16 17:53:50 +00:00
|
|
|
/*
|
|
|
|
* The first word is the work queue pointer and the flags rolled into
|
|
|
|
* one
|
|
|
|
*/
|
|
|
|
#define work_data_bits(work) ((unsigned long *)(&(work)->data))
|
|
|
|
|
2024-01-26 21:55:50 +00:00
|
|
|
enum work_bits {
|
2010-06-29 08:07:10 +00:00
|
|
|
WORK_STRUCT_PENDING_BIT = 0, /* work item is pending execution */
|
2024-02-21 05:36:14 +00:00
|
|
|
WORK_STRUCT_INACTIVE_BIT, /* work item is inactive */
|
|
|
|
WORK_STRUCT_PWQ_BIT, /* data points to pwq */
|
|
|
|
WORK_STRUCT_LINKED_BIT, /* next work is linked to this one */
|
2010-06-29 08:07:10 +00:00
|
|
|
#ifdef CONFIG_DEBUG_OBJECTS_WORK
|
2024-02-21 05:36:14 +00:00
|
|
|
WORK_STRUCT_STATIC_BIT, /* static initializer (debugobjects) */
|
2010-06-29 08:07:10 +00:00
|
|
|
#endif
|
2024-02-21 05:36:14 +00:00
|
|
|
WORK_STRUCT_FLAG_BITS,
|
2010-06-29 08:07:10 +00:00
|
|
|
|
2024-02-21 05:36:14 +00:00
|
|
|
/* color for workqueue flushing */
|
|
|
|
WORK_STRUCT_COLOR_SHIFT = WORK_STRUCT_FLAG_BITS,
|
2010-06-29 08:07:11 +00:00
|
|
|
WORK_STRUCT_COLOR_BITS = 4,
|
|
|
|
|
|
|
|
/*
|
2024-02-21 05:36:14 +00:00
|
|
|
* When WORK_STRUCT_PWQ is set, reserve 8 bits off of pwq pointer w/
|
|
|
|
* debugobjects turned off. This makes pwqs aligned to 256 bytes (512
|
|
|
|
* bytes w/ DEBUG_OBJECTS_WORK) and allows 16 workqueue flush colors.
|
|
|
|
*
|
|
|
|
* MSB
|
|
|
|
* [ pwq pointer ] [ flush color ] [ STRUCT flags ]
|
|
|
|
* 4 bits 4 or 5 bits
|
2010-06-29 08:07:11 +00:00
|
|
|
*/
|
2024-02-21 05:36:14 +00:00
|
|
|
WORK_STRUCT_PWQ_SHIFT = WORK_STRUCT_COLOR_SHIFT + WORK_STRUCT_COLOR_BITS,
|
2012-08-03 17:30:46 +00:00
|
|
|
|
2024-02-21 05:36:14 +00:00
|
|
|
/*
|
|
|
|
* data contains off-queue information when !WORK_STRUCT_PWQ.
|
|
|
|
*
|
|
|
|
* MSB
|
|
|
|
* [ pool ID ] [ OFFQ flags ] [ STRUCT flags ]
|
|
|
|
* 1 bit 4 or 5 bits
|
|
|
|
*/
|
|
|
|
WORK_OFFQ_FLAG_SHIFT = WORK_STRUCT_FLAG_BITS,
|
|
|
|
WORK_OFFQ_CANCELING_BIT = WORK_OFFQ_FLAG_SHIFT,
|
|
|
|
WORK_OFFQ_FLAG_END,
|
|
|
|
WORK_OFFQ_FLAG_BITS = WORK_OFFQ_FLAG_END - WORK_OFFQ_FLAG_SHIFT,
|
2012-08-03 17:30:46 +00:00
|
|
|
|
2013-01-24 19:01:33 +00:00
|
|
|
/*
|
2024-02-21 05:36:14 +00:00
|
|
|
* When a work item is off queue, the high bits encode off-queue flags
|
|
|
|
* and the last pool it was on. Cap pool ID to 31 bits and use the
|
|
|
|
* highest number to indicate that no pool is associated.
|
2013-01-24 19:01:33 +00:00
|
|
|
*/
|
2024-02-21 05:36:14 +00:00
|
|
|
WORK_OFFQ_POOL_SHIFT = WORK_OFFQ_FLAG_SHIFT + WORK_OFFQ_FLAG_BITS,
|
2013-01-24 19:01:33 +00:00
|
|
|
WORK_OFFQ_LEFT = BITS_PER_LONG - WORK_OFFQ_POOL_SHIFT,
|
|
|
|
WORK_OFFQ_POOL_BITS = WORK_OFFQ_LEFT <= 31 ? WORK_OFFQ_LEFT : 31,
|
2024-01-26 21:55:50 +00:00
|
|
|
};
|
|
|
|
|
|
|
|
enum work_flags {
|
|
|
|
WORK_STRUCT_PENDING = 1 << WORK_STRUCT_PENDING_BIT,
|
|
|
|
WORK_STRUCT_INACTIVE = 1 << WORK_STRUCT_INACTIVE_BIT,
|
|
|
|
WORK_STRUCT_PWQ = 1 << WORK_STRUCT_PWQ_BIT,
|
|
|
|
WORK_STRUCT_LINKED = 1 << WORK_STRUCT_LINKED_BIT,
|
|
|
|
#ifdef CONFIG_DEBUG_OBJECTS_WORK
|
|
|
|
WORK_STRUCT_STATIC = 1 << WORK_STRUCT_STATIC_BIT,
|
|
|
|
#else
|
|
|
|
WORK_STRUCT_STATIC = 0,
|
|
|
|
#endif
|
|
|
|
};
|
|
|
|
|
|
|
|
enum wq_misc_consts {
|
|
|
|
WORK_NR_COLORS = (1 << WORK_STRUCT_COLOR_BITS),
|
|
|
|
|
|
|
|
/* not bound to any CPU, prefer the local CPU */
|
|
|
|
WORK_CPU_UNBOUND = NR_CPUS,
|
|
|
|
|
2010-06-29 08:07:14 +00:00
|
|
|
/* bit mask for work_busy() return values */
|
|
|
|
WORK_BUSY_PENDING = 1 << 0,
|
|
|
|
WORK_BUSY_RUNNING = 1 << 1,
|
2013-04-30 22:27:22 +00:00
|
|
|
|
|
|
|
/* maximum string length for set_worker_desc() */
|
|
|
|
WORKER_DESC_LEN = 24,
|
2010-06-29 08:07:10 +00:00
|
|
|
};
|
|
|
|
|
workqueue: clean up WORK_* constant types, clarify masking
Dave Airlie reports that gcc-13.1.1 has started complaining about some
of the workqueue code in 32-bit arm builds:
kernel/workqueue.c: In function ‘get_work_pwq’:
kernel/workqueue.c:713:24: error: cast to pointer from integer of different size [-Werror=int-to-pointer-cast]
713 | return (void *)(data & WORK_STRUCT_WQ_DATA_MASK);
| ^
[ ... a couple of other cases ... ]
and while it's not immediately clear exactly why gcc started complaining
about it now, I suspect it's some C23-induced enum type handlign fixup in
gcc-13 is the cause.
Whatever the reason for starting to complain, the code and data types
are indeed disgusting enough that the complaint is warranted.
The wq code ends up creating various "helper constants" (like that
WORK_STRUCT_WQ_DATA_MASK) using an enum type, which is all kinds of
confused. The mask needs to be 'unsigned long', not some unspecified
enum type.
To make matters worse, the actual "mask and cast to a pointer" is
repeated a couple of times, and the cast isn't even always done to the
right pointer, but - as the error case above - to a 'void *' with then
the compiler finishing the job.
That's now how we roll in the kernel.
So create the masks using the proper types rather than some ambiguous
enumeration, and use a nice helper that actually does the type
conversion in one well-defined place.
Incidentally, this magically makes clang generate better code. That,
admittedly, is really just a sign of clang having been seriously
confused before, and cleaning up the typing unconfuses the compiler too.
Reported-by: Dave Airlie <airlied@gmail.com>
Link: https://lore.kernel.org/lkml/CAPM=9twNnV4zMCvrPkw3H-ajZOH-01JVh_kDrxdPYQErz8ZTdA@mail.gmail.com/
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Tejun Heo <tj@kernel.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-06-23 19:08:14 +00:00
|
|
|
/* Convenience constants - of type 'unsigned long', not 'enum'! */
|
2024-02-21 05:36:14 +00:00
|
|
|
#define WORK_OFFQ_CANCELING (1ul << WORK_OFFQ_CANCELING_BIT)
|
workqueue: clean up WORK_* constant types, clarify masking
Dave Airlie reports that gcc-13.1.1 has started complaining about some
of the workqueue code in 32-bit arm builds:
kernel/workqueue.c: In function ‘get_work_pwq’:
kernel/workqueue.c:713:24: error: cast to pointer from integer of different size [-Werror=int-to-pointer-cast]
713 | return (void *)(data & WORK_STRUCT_WQ_DATA_MASK);
| ^
[ ... a couple of other cases ... ]
and while it's not immediately clear exactly why gcc started complaining
about it now, I suspect it's some C23-induced enum type handlign fixup in
gcc-13 is the cause.
Whatever the reason for starting to complain, the code and data types
are indeed disgusting enough that the complaint is warranted.
The wq code ends up creating various "helper constants" (like that
WORK_STRUCT_WQ_DATA_MASK) using an enum type, which is all kinds of
confused. The mask needs to be 'unsigned long', not some unspecified
enum type.
To make matters worse, the actual "mask and cast to a pointer" is
repeated a couple of times, and the cast isn't even always done to the
right pointer, but - as the error case above - to a 'void *' with then
the compiler finishing the job.
That's now how we roll in the kernel.
So create the masks using the proper types rather than some ambiguous
enumeration, and use a nice helper that actually does the type
conversion in one well-defined place.
Incidentally, this magically makes clang generate better code. That,
admittedly, is really just a sign of clang having been seriously
confused before, and cleaning up the typing unconfuses the compiler too.
Reported-by: Dave Airlie <airlied@gmail.com>
Link: https://lore.kernel.org/lkml/CAPM=9twNnV4zMCvrPkw3H-ajZOH-01JVh_kDrxdPYQErz8ZTdA@mail.gmail.com/
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Tejun Heo <tj@kernel.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-06-23 19:08:14 +00:00
|
|
|
#define WORK_OFFQ_POOL_NONE ((1ul << WORK_OFFQ_POOL_BITS) - 1)
|
|
|
|
#define WORK_STRUCT_NO_POOL (WORK_OFFQ_POOL_NONE << WORK_OFFQ_POOL_SHIFT)
|
2024-02-21 05:36:14 +00:00
|
|
|
#define WORK_STRUCT_PWQ_MASK (~((1ul << WORK_STRUCT_PWQ_SHIFT) - 1))
|
workqueue: clean up WORK_* constant types, clarify masking
Dave Airlie reports that gcc-13.1.1 has started complaining about some
of the workqueue code in 32-bit arm builds:
kernel/workqueue.c: In function ‘get_work_pwq’:
kernel/workqueue.c:713:24: error: cast to pointer from integer of different size [-Werror=int-to-pointer-cast]
713 | return (void *)(data & WORK_STRUCT_WQ_DATA_MASK);
| ^
[ ... a couple of other cases ... ]
and while it's not immediately clear exactly why gcc started complaining
about it now, I suspect it's some C23-induced enum type handlign fixup in
gcc-13 is the cause.
Whatever the reason for starting to complain, the code and data types
are indeed disgusting enough that the complaint is warranted.
The wq code ends up creating various "helper constants" (like that
WORK_STRUCT_WQ_DATA_MASK) using an enum type, which is all kinds of
confused. The mask needs to be 'unsigned long', not some unspecified
enum type.
To make matters worse, the actual "mask and cast to a pointer" is
repeated a couple of times, and the cast isn't even always done to the
right pointer, but - as the error case above - to a 'void *' with then
the compiler finishing the job.
That's now how we roll in the kernel.
So create the masks using the proper types rather than some ambiguous
enumeration, and use a nice helper that actually does the type
conversion in one well-defined place.
Incidentally, this magically makes clang generate better code. That,
admittedly, is really just a sign of clang having been seriously
confused before, and cleaning up the typing unconfuses the compiler too.
Reported-by: Dave Airlie <airlied@gmail.com>
Link: https://lore.kernel.org/lkml/CAPM=9twNnV4zMCvrPkw3H-ajZOH-01JVh_kDrxdPYQErz8ZTdA@mail.gmail.com/
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Tejun Heo <tj@kernel.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-06-23 19:08:14 +00:00
|
|
|
|
2017-02-01 17:01:17 +00:00
|
|
|
#define WORK_DATA_INIT() ATOMIC_LONG_INIT((unsigned long)WORK_STRUCT_NO_POOL)
|
2010-06-29 08:07:13 +00:00
|
|
|
#define WORK_DATA_STATIC_INIT() \
|
2017-02-01 17:01:17 +00:00
|
|
|
ATOMIC_LONG_INIT((unsigned long)(WORK_STRUCT_NO_POOL | WORK_STRUCT_STATIC))
|
2006-12-16 17:53:50 +00:00
|
|
|
|
2006-11-22 14:54:01 +00:00
|
|
|
struct delayed_work {
|
|
|
|
struct work_struct work;
|
2005-04-16 22:20:36 +00:00
|
|
|
struct timer_list timer;
|
2013-02-07 02:04:53 +00:00
|
|
|
|
|
|
|
/* target workqueue and CPU ->timer uses to queue ->work */
|
|
|
|
struct workqueue_struct *wq;
|
2012-08-08 16:38:42 +00:00
|
|
|
int cpu;
|
2005-04-16 22:20:36 +00:00
|
|
|
};
|
|
|
|
|
2018-03-14 19:45:13 +00:00
|
|
|
struct rcu_work {
|
|
|
|
struct work_struct work;
|
|
|
|
struct rcu_head rcu;
|
|
|
|
|
|
|
|
/* target workqueue ->rcu uses to queue ->work */
|
|
|
|
struct workqueue_struct *wq;
|
|
|
|
};
|
|
|
|
|
2023-08-08 01:57:24 +00:00
|
|
|
enum wq_affn_scope {
|
2023-08-08 01:57:25 +00:00
|
|
|
WQ_AFFN_DFL, /* use system default */
|
2023-08-08 01:57:24 +00:00
|
|
|
WQ_AFFN_CPU, /* one pod per CPU */
|
|
|
|
WQ_AFFN_SMT, /* one pod poer SMT */
|
|
|
|
WQ_AFFN_CACHE, /* one pod per LLC */
|
2023-08-08 01:57:24 +00:00
|
|
|
WQ_AFFN_NUMA, /* one pod per NUMA node */
|
|
|
|
WQ_AFFN_SYSTEM, /* one pod across the whole system */
|
|
|
|
|
|
|
|
WQ_AFFN_NR_TYPES,
|
|
|
|
};
|
|
|
|
|
2016-10-28 08:14:09 +00:00
|
|
|
/**
|
|
|
|
* struct workqueue_attrs - A struct for workqueue attributes.
|
2013-04-01 18:23:38 +00:00
|
|
|
*
|
2016-10-28 08:14:09 +00:00
|
|
|
* This can be used to change attributes of an unbound workqueue.
|
2013-03-12 18:30:00 +00:00
|
|
|
*/
|
|
|
|
struct workqueue_attrs {
|
2016-10-28 08:14:09 +00:00
|
|
|
/**
|
|
|
|
* @nice: nice level
|
|
|
|
*/
|
|
|
|
int nice;
|
|
|
|
|
|
|
|
/**
|
|
|
|
* @cpumask: allowed CPUs
|
workqueue: Add workqueue_attrs->__pod_cpumask
workqueue_attrs has two uses:
* to specify the required unouned workqueue properties by users
* to match worker_pool's properties to workqueues by core code
For example, if the user wants to restrict a workqueue to run only CPUs 0
and 2, and the two CPUs are on different affinity scopes, the workqueue's
attrs->cpumask would contains CPUs 0 and 2, and the workqueue would be
associated with two worker_pools, one with attrs->cpumask containing just
CPU 0 and the other CPU 2.
Workqueue wants to support non-strict affinity scopes where work items are
started in their matching affinity scopes but the scheduler is free to
migrate them outside the starting scopes, which can enable utilizing the
whole machine while maintaining most of the locality benefits from affinity
scopes.
To enable that, worker_pools need to distinguish the strict affinity that it
has to follow (because that's the restriction coming from the user) and the
soft affinity that it wants to apply when dispatching work items. Note that
two worker_pools with different soft dispatching requirements have to be
separate; otherwise, for example, we'd be ping-ponging worker threads across
NUMA boundaries constantly.
This patch adds workqueue_attrs->__pod_cpumask. The new field is double
underscored as it's only used internally to distinguish worker_pools. A
worker_pool's ->cpumask is now always the same as the online subset of
allowed CPUs of the associated workqueues, and ->__pod_cpumask is the pod's
subset of that ->cpumask. Going back to the example above, both worker_pools
would have ->cpumask containing both CPUs 0 and 2 but one's ->__pod_cpumask
would contain 0 while the other's 2.
* pool_allowed_cpus() is added. It returns the worker_pool's strict cpumask
that the pool's workers must stay within. This is currently always
->__pod_cpumask as all boundaries are still strict.
* As a workqueue_attrs can now track both the associated workqueues' cpumask
and its per-pod subset, wq_calc_pod_cpumask() no longer needs an external
out-argument. Drop @cpumask and instead store the result in
->__pod_cpumask.
* The above also simplifies apply_wqattrs_prepare() as the same
workqueue_attrs can be used to create all pods associated with a
workqueue. tmp_attrs is dropped.
* wq_update_pod() is updated to use wqattrs_equal() to test whether a pwq
update is needed instead of only comparing ->cpumask so that
->__pod_cpumask is compared too. It can directly compare ->__pod_cpumaks
but the code is easier to understand and more robust this way.
The only user-visible behavior change is that two workqueues with different
cpumasks no longer can share worker_pools even when their pod subsets
coincide. Going back to the example, let's say there's another workqueue
with cpumask 0, 2, 3, where 2 and 3 are in the same pod. It would be mapped
to two worker_pools - one with CPU 0, the other with 2 and 3. The former has
the same cpumask as the first pod of the earlier example and would have
shared the same worker_pool but that's no longer the case after this patch.
The worker_pools would have the same ->__pod_cpumask but their ->cpumask's
wouldn't match.
While this is necessary to support non-strict affinity scopes, there can be
further optimizations to maintain sharing among strict affinity scopes.
However, non-strict affinity scopes are going to be preferable for most use
cases and we don't see very diverse mixture of unbound workqueue cpumasks
anyway, so the additional overhead doesn't seem to justify the extra
complexity.
v2: - wq_update_pod() was incorrectly comparing target_attrs->__pod_cpumask
to pool->attrs->cpumask instead of its ->__pod_cpumask. Fix it by
using wqattrs_equal() for comparison instead.
- Per-cpu worker pools weren't initializing ->__pod_cpumask which caused
a subtle problem later on. Set it to cpumask_of(cpu) like ->cpumask.
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-08 01:57:25 +00:00
|
|
|
*
|
|
|
|
* Work items in this workqueue are affine to these CPUs and not allowed
|
|
|
|
* to execute on other CPUs. A pool serving a workqueue must have the
|
|
|
|
* same @cpumask.
|
2016-10-28 08:14:09 +00:00
|
|
|
*/
|
|
|
|
cpumask_var_t cpumask;
|
|
|
|
|
workqueue: Add workqueue_attrs->__pod_cpumask
workqueue_attrs has two uses:
* to specify the required unouned workqueue properties by users
* to match worker_pool's properties to workqueues by core code
For example, if the user wants to restrict a workqueue to run only CPUs 0
and 2, and the two CPUs are on different affinity scopes, the workqueue's
attrs->cpumask would contains CPUs 0 and 2, and the workqueue would be
associated with two worker_pools, one with attrs->cpumask containing just
CPU 0 and the other CPU 2.
Workqueue wants to support non-strict affinity scopes where work items are
started in their matching affinity scopes but the scheduler is free to
migrate them outside the starting scopes, which can enable utilizing the
whole machine while maintaining most of the locality benefits from affinity
scopes.
To enable that, worker_pools need to distinguish the strict affinity that it
has to follow (because that's the restriction coming from the user) and the
soft affinity that it wants to apply when dispatching work items. Note that
two worker_pools with different soft dispatching requirements have to be
separate; otherwise, for example, we'd be ping-ponging worker threads across
NUMA boundaries constantly.
This patch adds workqueue_attrs->__pod_cpumask. The new field is double
underscored as it's only used internally to distinguish worker_pools. A
worker_pool's ->cpumask is now always the same as the online subset of
allowed CPUs of the associated workqueues, and ->__pod_cpumask is the pod's
subset of that ->cpumask. Going back to the example above, both worker_pools
would have ->cpumask containing both CPUs 0 and 2 but one's ->__pod_cpumask
would contain 0 while the other's 2.
* pool_allowed_cpus() is added. It returns the worker_pool's strict cpumask
that the pool's workers must stay within. This is currently always
->__pod_cpumask as all boundaries are still strict.
* As a workqueue_attrs can now track both the associated workqueues' cpumask
and its per-pod subset, wq_calc_pod_cpumask() no longer needs an external
out-argument. Drop @cpumask and instead store the result in
->__pod_cpumask.
* The above also simplifies apply_wqattrs_prepare() as the same
workqueue_attrs can be used to create all pods associated with a
workqueue. tmp_attrs is dropped.
* wq_update_pod() is updated to use wqattrs_equal() to test whether a pwq
update is needed instead of only comparing ->cpumask so that
->__pod_cpumask is compared too. It can directly compare ->__pod_cpumaks
but the code is easier to understand and more robust this way.
The only user-visible behavior change is that two workqueues with different
cpumasks no longer can share worker_pools even when their pod subsets
coincide. Going back to the example, let's say there's another workqueue
with cpumask 0, 2, 3, where 2 and 3 are in the same pod. It would be mapped
to two worker_pools - one with CPU 0, the other with 2 and 3. The former has
the same cpumask as the first pod of the earlier example and would have
shared the same worker_pool but that's no longer the case after this patch.
The worker_pools would have the same ->__pod_cpumask but their ->cpumask's
wouldn't match.
While this is necessary to support non-strict affinity scopes, there can be
further optimizations to maintain sharing among strict affinity scopes.
However, non-strict affinity scopes are going to be preferable for most use
cases and we don't see very diverse mixture of unbound workqueue cpumasks
anyway, so the additional overhead doesn't seem to justify the extra
complexity.
v2: - wq_update_pod() was incorrectly comparing target_attrs->__pod_cpumask
to pool->attrs->cpumask instead of its ->__pod_cpumask. Fix it by
using wqattrs_equal() for comparison instead.
- Per-cpu worker pools weren't initializing ->__pod_cpumask which caused
a subtle problem later on. Set it to cpumask_of(cpu) like ->cpumask.
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-08 01:57:25 +00:00
|
|
|
/**
|
|
|
|
* @__pod_cpumask: internal attribute used to create per-pod pools
|
|
|
|
*
|
|
|
|
* Internal use only.
|
|
|
|
*
|
|
|
|
* Per-pod unbound worker pools are used to improve locality. Always a
|
|
|
|
* subset of ->cpumask. A workqueue can be associated with multiple
|
|
|
|
* worker pools with disjoint @__pod_cpumask's. Whether the enforcement
|
|
|
|
* of a pool's @__pod_cpumask is strict depends on @affn_strict.
|
|
|
|
*/
|
|
|
|
cpumask_var_t __pod_cpumask;
|
|
|
|
|
workqueue: Implement non-strict affinity scope for unbound workqueues
An unbound workqueue can be served by multiple worker_pools to improve
locality. The segmentation is achieved by grouping CPUs into pods. By
default, the cache boundaries according to cpus_share_cache() define the
CPUs are grouped. Let's a workqueue is allowed to run on all CPUs and the
system has two L3 caches. The workqueue would be mapped to two worker_pools
each serving one L3 cache domains.
While this improves locality, because the pod boundaries are strict, it
limits the total bandwidth a given issuer can consume. For example, let's
say there is a thread pinned to a CPU issuing enough work items to saturate
the whole machine. With the machine segmented into two pods, no matter how
many work items it issues, it can only use half of the CPUs on the system.
While this limitation has existed for a very long time, it wasn't very
pronounced because the affinity grouping used to be always by NUMA nodes.
With cache boundaries as the default and support for even finer grained
scopes (smt and cpu), it is now an a lot more pressing problem.
This patch implements non-strict affinity scope where the pod boundaries
aren't enforced strictly. Going back to the previous example, the workqueue
would still be mapped to two worker_pools; however, the affinity enforcement
would be soft. The workers in both pools would have their cpus_allowed set
to the whole machine thus allowing the scheduler to migrate them anywhere on
the machine. However, whenever an idle worker is woken up, the workqueue
code asks the scheduler to bring back the task within the pod if the worker
is outside. ie. work items start executing within its affinity scope but can
be migrated outside as the scheduler sees fit. This removes the hard cap on
utilization while maintaining the benefits of affinity scopes.
After the earlier ->__pod_cpumask changes, the implementation is pretty
simple. When non-strict which is the new default:
* pool_allowed_cpus() returns @pool->attrs->cpumask instead of
->__pod_cpumask so that the workers are allowed to run on any CPU that
the associated workqueues allow.
* If the idle worker task's ->wake_cpu is outside the pod, kick_pool() sets
the field to a CPU within the pod.
This would be the first use of task_struct->wake_cpu outside scheduler
proper, so it isn't clear whether this would be acceptable. However, other
methods of migrating tasks are significantly more expensive and are likely
prohibitively so if we want to do this on every work item. This needs
discussion with scheduler folks.
There is also a race window where setting ->wake_cpu wouldn't be effective
as the target task is still on CPU. However, the window is pretty small and
this being a best-effort optimization, it doesn't seem to warrant more
complexity at the moment.
While the non-strict cache affinity scopes seem to be the best option, the
performance picture interacts with the affinity scope and is a bit
complicated to fully discuss in this patch, so the behavior is made easily
selectable through wqattrs and sysfs and the next patch will add
documentation to discuss performance implications.
v2: pool->attrs->affn_strict is set to true for per-cpu worker_pools.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
2023-08-08 01:57:25 +00:00
|
|
|
/**
|
|
|
|
* @affn_strict: affinity scope is strict
|
|
|
|
*
|
|
|
|
* If clear, workqueue will make a best-effort attempt at starting the
|
|
|
|
* worker inside @__pod_cpumask but the scheduler is free to migrate it
|
|
|
|
* outside.
|
|
|
|
*
|
|
|
|
* If set, workers are only allowed to run inside @__pod_cpumask.
|
|
|
|
*/
|
|
|
|
bool affn_strict;
|
|
|
|
|
2023-08-08 01:57:24 +00:00
|
|
|
/*
|
|
|
|
* Below fields aren't properties of a worker_pool. They only modify how
|
|
|
|
* :c:func:`apply_workqueue_attrs` select pools and thus don't
|
|
|
|
* participate in pool hash calculations or equality comparisons.
|
|
|
|
*/
|
|
|
|
|
2016-10-28 08:14:09 +00:00
|
|
|
/**
|
2023-08-08 01:57:24 +00:00
|
|
|
* @affn_scope: unbound CPU affinity scope
|
2016-10-28 08:14:09 +00:00
|
|
|
*
|
2023-08-08 01:57:24 +00:00
|
|
|
* CPU pods are used to improve execution locality of unbound work
|
|
|
|
* items. There are multiple pod types, one for each wq_affn_scope, and
|
|
|
|
* every CPU in the system belongs to one pod in every pod type. CPUs
|
|
|
|
* that belong to the same pod share the worker pool. For example,
|
|
|
|
* selecting %WQ_AFFN_NUMA makes the workqueue use a separate worker
|
|
|
|
* pool for each NUMA node.
|
|
|
|
*/
|
|
|
|
enum wq_affn_scope affn_scope;
|
|
|
|
|
|
|
|
/**
|
|
|
|
* @ordered: work items must be executed one by one in queueing order
|
2016-10-28 08:14:09 +00:00
|
|
|
*/
|
2023-08-08 01:57:23 +00:00
|
|
|
bool ordered;
|
2013-03-12 18:30:00 +00:00
|
|
|
};
|
|
|
|
|
2009-04-02 23:56:54 +00:00
|
|
|
static inline struct delayed_work *to_delayed_work(struct work_struct *work)
|
|
|
|
{
|
|
|
|
return container_of(work, struct delayed_work, work);
|
|
|
|
}
|
|
|
|
|
2018-03-14 19:45:13 +00:00
|
|
|
static inline struct rcu_work *to_rcu_work(struct work_struct *work)
|
|
|
|
{
|
|
|
|
return container_of(work, struct rcu_work, work);
|
|
|
|
}
|
|
|
|
|
2006-02-23 18:43:43 +00:00
|
|
|
struct execute_work {
|
|
|
|
struct work_struct work;
|
|
|
|
};
|
|
|
|
|
2007-10-19 06:39:55 +00:00
|
|
|
#ifdef CONFIG_LOCKDEP
|
|
|
|
/*
|
|
|
|
* NB: because we have to copy the lockdep_map, setting _key
|
|
|
|
* here is required, otherwise it could get initialised to the
|
|
|
|
* copy of the lockdep_map!
|
|
|
|
*/
|
|
|
|
#define __WORK_INIT_LOCKDEP_MAP(n, k) \
|
|
|
|
.lockdep_map = STATIC_LOCKDEP_MAP_INIT(n, k),
|
|
|
|
#else
|
|
|
|
#define __WORK_INIT_LOCKDEP_MAP(n, k)
|
|
|
|
#endif
|
|
|
|
|
2012-08-21 20:18:23 +00:00
|
|
|
#define __WORK_INITIALIZER(n, f) { \
|
|
|
|
.data = WORK_DATA_STATIC_INIT(), \
|
|
|
|
.entry = { &(n).entry, &(n).entry }, \
|
|
|
|
.func = (f), \
|
|
|
|
__WORK_INIT_LOCKDEP_MAP(#n, &(n)) \
|
2006-11-22 14:55:48 +00:00
|
|
|
}
|
|
|
|
|
2012-08-21 20:18:23 +00:00
|
|
|
#define __DELAYED_WORK_INITIALIZER(n, f, tflags) { \
|
2012-08-21 20:18:23 +00:00
|
|
|
.work = __WORK_INITIALIZER((n).work, (f)), \
|
2017-10-23 07:40:42 +00:00
|
|
|
.timer = __TIMER_INITIALIZER(delayed_work_timer_fn,\
|
2012-08-21 20:18:24 +00:00
|
|
|
(tflags) | TIMER_IRQSAFE), \
|
2010-10-20 22:57:33 +00:00
|
|
|
}
|
|
|
|
|
2012-08-21 20:18:23 +00:00
|
|
|
#define DECLARE_WORK(n, f) \
|
2006-11-22 14:55:48 +00:00
|
|
|
struct work_struct n = __WORK_INITIALIZER(n, f)
|
|
|
|
|
2012-08-21 20:18:23 +00:00
|
|
|
#define DECLARE_DELAYED_WORK(n, f) \
|
2012-08-21 20:18:23 +00:00
|
|
|
struct delayed_work n = __DELAYED_WORK_INITIALIZER(n, f, 0)
|
2006-11-22 14:55:48 +00:00
|
|
|
|
2012-08-21 20:18:23 +00:00
|
|
|
#define DECLARE_DEFERRABLE_WORK(n, f) \
|
2012-08-21 20:18:23 +00:00
|
|
|
struct delayed_work n = __DELAYED_WORK_INITIALIZER(n, f, TIMER_DEFERRABLE)
|
2010-10-20 22:57:33 +00:00
|
|
|
|
2009-11-15 16:09:48 +00:00
|
|
|
#ifdef CONFIG_DEBUG_OBJECTS_WORK
|
|
|
|
extern void __init_work(struct work_struct *work, int onstack);
|
|
|
|
extern void destroy_work_on_stack(struct work_struct *work);
|
2014-03-23 14:20:44 +00:00
|
|
|
extern void destroy_delayed_work_on_stack(struct delayed_work *work);
|
2010-06-29 08:07:10 +00:00
|
|
|
static inline unsigned int work_static(struct work_struct *work)
|
|
|
|
{
|
2010-06-29 08:07:10 +00:00
|
|
|
return *work_data_bits(work) & WORK_STRUCT_STATIC;
|
2010-06-29 08:07:10 +00:00
|
|
|
}
|
2009-11-15 16:09:48 +00:00
|
|
|
#else
|
|
|
|
static inline void __init_work(struct work_struct *work, int onstack) { }
|
|
|
|
static inline void destroy_work_on_stack(struct work_struct *work) { }
|
2014-03-23 14:20:44 +00:00
|
|
|
static inline void destroy_delayed_work_on_stack(struct delayed_work *work) { }
|
2010-06-29 08:07:10 +00:00
|
|
|
static inline unsigned int work_static(struct work_struct *work) { return 0; }
|
2009-11-15 16:09:48 +00:00
|
|
|
#endif
|
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
/*
|
2006-11-22 14:54:01 +00:00
|
|
|
* initialize all of a work item in one go
|
2006-12-16 17:53:50 +00:00
|
|
|
*
|
2009-06-23 10:09:29 +00:00
|
|
|
* NOTE! No point in using "atomic_long_set()": using a direct
|
2006-12-16 17:53:50 +00:00
|
|
|
* assignment of the work data initializer allows the compiler
|
|
|
|
* to generate better code.
|
2005-04-16 22:20:36 +00:00
|
|
|
*/
|
2007-10-19 06:39:55 +00:00
|
|
|
#ifdef CONFIG_LOCKDEP
|
workqueue: Provide one lock class key per work_on_cpu() callsite
All callers of work_on_cpu() share the same lock class key for all the
functions queued. As a result the workqueue related locking scenario for
a function A may be spuriously accounted as an inversion against the
locking scenario of function B such as in the following model:
long A(void *arg)
{
mutex_lock(&mutex);
mutex_unlock(&mutex);
}
long B(void *arg)
{
}
void launchA(void)
{
work_on_cpu(0, A, NULL);
}
void launchB(void)
{
mutex_lock(&mutex);
work_on_cpu(1, B, NULL);
mutex_unlock(&mutex);
}
launchA and launchB running concurrently have no chance to deadlock.
However the above can be reported by lockdep as a possible locking
inversion because the works containing A() and B() are treated as
belonging to the same locking class.
The following shows an existing example of such a spurious lockdep splat:
======================================================
WARNING: possible circular locking dependency detected
6.6.0-rc1-00065-g934ebd6e5359 #35409 Not tainted
------------------------------------------------------
kworker/0:1/9 is trying to acquire lock:
ffffffff9bc72f30 (cpu_hotplug_lock){++++}-{0:0}, at: _cpu_down+0x57/0x2b0
but task is already holding lock:
ffff9e3bc0057e60 ((work_completion)(&wfc.work)){+.+.}-{0:0}, at: process_scheduled_works+0x216/0x500
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #2 ((work_completion)(&wfc.work)){+.+.}-{0:0}:
__flush_work+0x83/0x4e0
work_on_cpu+0x97/0xc0
rcu_nocb_cpu_offload+0x62/0xb0
rcu_nocb_toggle+0xd0/0x1d0
kthread+0xe6/0x120
ret_from_fork+0x2f/0x40
ret_from_fork_asm+0x1b/0x30
-> #1 (rcu_state.barrier_mutex){+.+.}-{3:3}:
__mutex_lock+0x81/0xc80
rcu_nocb_cpu_deoffload+0x38/0xb0
rcu_nocb_toggle+0x144/0x1d0
kthread+0xe6/0x120
ret_from_fork+0x2f/0x40
ret_from_fork_asm+0x1b/0x30
-> #0 (cpu_hotplug_lock){++++}-{0:0}:
__lock_acquire+0x1538/0x2500
lock_acquire+0xbf/0x2a0
percpu_down_write+0x31/0x200
_cpu_down+0x57/0x2b0
__cpu_down_maps_locked+0x10/0x20
work_for_cpu_fn+0x15/0x20
process_scheduled_works+0x2a7/0x500
worker_thread+0x173/0x330
kthread+0xe6/0x120
ret_from_fork+0x2f/0x40
ret_from_fork_asm+0x1b/0x30
other info that might help us debug this:
Chain exists of:
cpu_hotplug_lock --> rcu_state.barrier_mutex --> (work_completion)(&wfc.work)
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock((work_completion)(&wfc.work));
lock(rcu_state.barrier_mutex);
lock((work_completion)(&wfc.work));
lock(cpu_hotplug_lock);
*** DEADLOCK ***
2 locks held by kworker/0:1/9:
#0: ffff900481068b38 ((wq_completion)events){+.+.}-{0:0}, at: process_scheduled_works+0x212/0x500
#1: ffff9e3bc0057e60 ((work_completion)(&wfc.work)){+.+.}-{0:0}, at: process_scheduled_works+0x216/0x500
stack backtrace:
CPU: 0 PID: 9 Comm: kworker/0:1 Not tainted 6.6.0-rc1-00065-g934ebd6e5359 #35409
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
Workqueue: events work_for_cpu_fn
Call Trace:
rcu-torture: rcu_torture_read_exit: Start of episode
<TASK>
dump_stack_lvl+0x4a/0x80
check_noncircular+0x132/0x150
__lock_acquire+0x1538/0x2500
lock_acquire+0xbf/0x2a0
? _cpu_down+0x57/0x2b0
percpu_down_write+0x31/0x200
? _cpu_down+0x57/0x2b0
_cpu_down+0x57/0x2b0
__cpu_down_maps_locked+0x10/0x20
work_for_cpu_fn+0x15/0x20
process_scheduled_works+0x2a7/0x500
worker_thread+0x173/0x330
? __pfx_worker_thread+0x10/0x10
kthread+0xe6/0x120
? __pfx_kthread+0x10/0x10
ret_from_fork+0x2f/0x40
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1b/0x30
</TASK
Fix this with providing one lock class key per work_on_cpu() caller.
Reported-and-tested-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-09-24 15:07:02 +00:00
|
|
|
#define __INIT_WORK_KEY(_work, _func, _onstack, _key) \
|
2006-11-22 14:55:48 +00:00
|
|
|
do { \
|
2009-11-15 16:09:48 +00:00
|
|
|
__init_work((_work), _onstack); \
|
2007-05-09 09:34:19 +00:00
|
|
|
(_work)->data = (atomic_long_t) WORK_DATA_INIT(); \
|
workqueue: Provide one lock class key per work_on_cpu() callsite
All callers of work_on_cpu() share the same lock class key for all the
functions queued. As a result the workqueue related locking scenario for
a function A may be spuriously accounted as an inversion against the
locking scenario of function B such as in the following model:
long A(void *arg)
{
mutex_lock(&mutex);
mutex_unlock(&mutex);
}
long B(void *arg)
{
}
void launchA(void)
{
work_on_cpu(0, A, NULL);
}
void launchB(void)
{
mutex_lock(&mutex);
work_on_cpu(1, B, NULL);
mutex_unlock(&mutex);
}
launchA and launchB running concurrently have no chance to deadlock.
However the above can be reported by lockdep as a possible locking
inversion because the works containing A() and B() are treated as
belonging to the same locking class.
The following shows an existing example of such a spurious lockdep splat:
======================================================
WARNING: possible circular locking dependency detected
6.6.0-rc1-00065-g934ebd6e5359 #35409 Not tainted
------------------------------------------------------
kworker/0:1/9 is trying to acquire lock:
ffffffff9bc72f30 (cpu_hotplug_lock){++++}-{0:0}, at: _cpu_down+0x57/0x2b0
but task is already holding lock:
ffff9e3bc0057e60 ((work_completion)(&wfc.work)){+.+.}-{0:0}, at: process_scheduled_works+0x216/0x500
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #2 ((work_completion)(&wfc.work)){+.+.}-{0:0}:
__flush_work+0x83/0x4e0
work_on_cpu+0x97/0xc0
rcu_nocb_cpu_offload+0x62/0xb0
rcu_nocb_toggle+0xd0/0x1d0
kthread+0xe6/0x120
ret_from_fork+0x2f/0x40
ret_from_fork_asm+0x1b/0x30
-> #1 (rcu_state.barrier_mutex){+.+.}-{3:3}:
__mutex_lock+0x81/0xc80
rcu_nocb_cpu_deoffload+0x38/0xb0
rcu_nocb_toggle+0x144/0x1d0
kthread+0xe6/0x120
ret_from_fork+0x2f/0x40
ret_from_fork_asm+0x1b/0x30
-> #0 (cpu_hotplug_lock){++++}-{0:0}:
__lock_acquire+0x1538/0x2500
lock_acquire+0xbf/0x2a0
percpu_down_write+0x31/0x200
_cpu_down+0x57/0x2b0
__cpu_down_maps_locked+0x10/0x20
work_for_cpu_fn+0x15/0x20
process_scheduled_works+0x2a7/0x500
worker_thread+0x173/0x330
kthread+0xe6/0x120
ret_from_fork+0x2f/0x40
ret_from_fork_asm+0x1b/0x30
other info that might help us debug this:
Chain exists of:
cpu_hotplug_lock --> rcu_state.barrier_mutex --> (work_completion)(&wfc.work)
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock((work_completion)(&wfc.work));
lock(rcu_state.barrier_mutex);
lock((work_completion)(&wfc.work));
lock(cpu_hotplug_lock);
*** DEADLOCK ***
2 locks held by kworker/0:1/9:
#0: ffff900481068b38 ((wq_completion)events){+.+.}-{0:0}, at: process_scheduled_works+0x212/0x500
#1: ffff9e3bc0057e60 ((work_completion)(&wfc.work)){+.+.}-{0:0}, at: process_scheduled_works+0x216/0x500
stack backtrace:
CPU: 0 PID: 9 Comm: kworker/0:1 Not tainted 6.6.0-rc1-00065-g934ebd6e5359 #35409
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
Workqueue: events work_for_cpu_fn
Call Trace:
rcu-torture: rcu_torture_read_exit: Start of episode
<TASK>
dump_stack_lvl+0x4a/0x80
check_noncircular+0x132/0x150
__lock_acquire+0x1538/0x2500
lock_acquire+0xbf/0x2a0
? _cpu_down+0x57/0x2b0
percpu_down_write+0x31/0x200
? _cpu_down+0x57/0x2b0
_cpu_down+0x57/0x2b0
__cpu_down_maps_locked+0x10/0x20
work_for_cpu_fn+0x15/0x20
process_scheduled_works+0x2a7/0x500
worker_thread+0x173/0x330
? __pfx_worker_thread+0x10/0x10
kthread+0xe6/0x120
? __pfx_kthread+0x10/0x10
ret_from_fork+0x2f/0x40
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1b/0x30
</TASK
Fix this with providing one lock class key per work_on_cpu() caller.
Reported-and-tested-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-09-24 15:07:02 +00:00
|
|
|
lockdep_init_map(&(_work)->lockdep_map, "(work_completion)"#_work, (_key), 0); \
|
2006-11-22 14:55:48 +00:00
|
|
|
INIT_LIST_HEAD(&(_work)->entry); \
|
2014-03-07 15:24:50 +00:00
|
|
|
(_work)->func = (_func); \
|
2006-11-22 14:55:48 +00:00
|
|
|
} while (0)
|
2007-10-19 06:39:55 +00:00
|
|
|
#else
|
workqueue: Provide one lock class key per work_on_cpu() callsite
All callers of work_on_cpu() share the same lock class key for all the
functions queued. As a result the workqueue related locking scenario for
a function A may be spuriously accounted as an inversion against the
locking scenario of function B such as in the following model:
long A(void *arg)
{
mutex_lock(&mutex);
mutex_unlock(&mutex);
}
long B(void *arg)
{
}
void launchA(void)
{
work_on_cpu(0, A, NULL);
}
void launchB(void)
{
mutex_lock(&mutex);
work_on_cpu(1, B, NULL);
mutex_unlock(&mutex);
}
launchA and launchB running concurrently have no chance to deadlock.
However the above can be reported by lockdep as a possible locking
inversion because the works containing A() and B() are treated as
belonging to the same locking class.
The following shows an existing example of such a spurious lockdep splat:
======================================================
WARNING: possible circular locking dependency detected
6.6.0-rc1-00065-g934ebd6e5359 #35409 Not tainted
------------------------------------------------------
kworker/0:1/9 is trying to acquire lock:
ffffffff9bc72f30 (cpu_hotplug_lock){++++}-{0:0}, at: _cpu_down+0x57/0x2b0
but task is already holding lock:
ffff9e3bc0057e60 ((work_completion)(&wfc.work)){+.+.}-{0:0}, at: process_scheduled_works+0x216/0x500
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #2 ((work_completion)(&wfc.work)){+.+.}-{0:0}:
__flush_work+0x83/0x4e0
work_on_cpu+0x97/0xc0
rcu_nocb_cpu_offload+0x62/0xb0
rcu_nocb_toggle+0xd0/0x1d0
kthread+0xe6/0x120
ret_from_fork+0x2f/0x40
ret_from_fork_asm+0x1b/0x30
-> #1 (rcu_state.barrier_mutex){+.+.}-{3:3}:
__mutex_lock+0x81/0xc80
rcu_nocb_cpu_deoffload+0x38/0xb0
rcu_nocb_toggle+0x144/0x1d0
kthread+0xe6/0x120
ret_from_fork+0x2f/0x40
ret_from_fork_asm+0x1b/0x30
-> #0 (cpu_hotplug_lock){++++}-{0:0}:
__lock_acquire+0x1538/0x2500
lock_acquire+0xbf/0x2a0
percpu_down_write+0x31/0x200
_cpu_down+0x57/0x2b0
__cpu_down_maps_locked+0x10/0x20
work_for_cpu_fn+0x15/0x20
process_scheduled_works+0x2a7/0x500
worker_thread+0x173/0x330
kthread+0xe6/0x120
ret_from_fork+0x2f/0x40
ret_from_fork_asm+0x1b/0x30
other info that might help us debug this:
Chain exists of:
cpu_hotplug_lock --> rcu_state.barrier_mutex --> (work_completion)(&wfc.work)
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock((work_completion)(&wfc.work));
lock(rcu_state.barrier_mutex);
lock((work_completion)(&wfc.work));
lock(cpu_hotplug_lock);
*** DEADLOCK ***
2 locks held by kworker/0:1/9:
#0: ffff900481068b38 ((wq_completion)events){+.+.}-{0:0}, at: process_scheduled_works+0x212/0x500
#1: ffff9e3bc0057e60 ((work_completion)(&wfc.work)){+.+.}-{0:0}, at: process_scheduled_works+0x216/0x500
stack backtrace:
CPU: 0 PID: 9 Comm: kworker/0:1 Not tainted 6.6.0-rc1-00065-g934ebd6e5359 #35409
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
Workqueue: events work_for_cpu_fn
Call Trace:
rcu-torture: rcu_torture_read_exit: Start of episode
<TASK>
dump_stack_lvl+0x4a/0x80
check_noncircular+0x132/0x150
__lock_acquire+0x1538/0x2500
lock_acquire+0xbf/0x2a0
? _cpu_down+0x57/0x2b0
percpu_down_write+0x31/0x200
? _cpu_down+0x57/0x2b0
_cpu_down+0x57/0x2b0
__cpu_down_maps_locked+0x10/0x20
work_for_cpu_fn+0x15/0x20
process_scheduled_works+0x2a7/0x500
worker_thread+0x173/0x330
? __pfx_worker_thread+0x10/0x10
kthread+0xe6/0x120
? __pfx_kthread+0x10/0x10
ret_from_fork+0x2f/0x40
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1b/0x30
</TASK
Fix this with providing one lock class key per work_on_cpu() caller.
Reported-and-tested-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-09-24 15:07:02 +00:00
|
|
|
#define __INIT_WORK_KEY(_work, _func, _onstack, _key) \
|
2007-10-19 06:39:55 +00:00
|
|
|
do { \
|
2009-11-15 16:09:48 +00:00
|
|
|
__init_work((_work), _onstack); \
|
2007-10-19 06:39:55 +00:00
|
|
|
(_work)->data = (atomic_long_t) WORK_DATA_INIT(); \
|
|
|
|
INIT_LIST_HEAD(&(_work)->entry); \
|
2014-03-07 15:24:50 +00:00
|
|
|
(_work)->func = (_func); \
|
2007-10-19 06:39:55 +00:00
|
|
|
} while (0)
|
|
|
|
#endif
|
2006-11-22 14:55:48 +00:00
|
|
|
|
workqueue: Provide one lock class key per work_on_cpu() callsite
All callers of work_on_cpu() share the same lock class key for all the
functions queued. As a result the workqueue related locking scenario for
a function A may be spuriously accounted as an inversion against the
locking scenario of function B such as in the following model:
long A(void *arg)
{
mutex_lock(&mutex);
mutex_unlock(&mutex);
}
long B(void *arg)
{
}
void launchA(void)
{
work_on_cpu(0, A, NULL);
}
void launchB(void)
{
mutex_lock(&mutex);
work_on_cpu(1, B, NULL);
mutex_unlock(&mutex);
}
launchA and launchB running concurrently have no chance to deadlock.
However the above can be reported by lockdep as a possible locking
inversion because the works containing A() and B() are treated as
belonging to the same locking class.
The following shows an existing example of such a spurious lockdep splat:
======================================================
WARNING: possible circular locking dependency detected
6.6.0-rc1-00065-g934ebd6e5359 #35409 Not tainted
------------------------------------------------------
kworker/0:1/9 is trying to acquire lock:
ffffffff9bc72f30 (cpu_hotplug_lock){++++}-{0:0}, at: _cpu_down+0x57/0x2b0
but task is already holding lock:
ffff9e3bc0057e60 ((work_completion)(&wfc.work)){+.+.}-{0:0}, at: process_scheduled_works+0x216/0x500
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #2 ((work_completion)(&wfc.work)){+.+.}-{0:0}:
__flush_work+0x83/0x4e0
work_on_cpu+0x97/0xc0
rcu_nocb_cpu_offload+0x62/0xb0
rcu_nocb_toggle+0xd0/0x1d0
kthread+0xe6/0x120
ret_from_fork+0x2f/0x40
ret_from_fork_asm+0x1b/0x30
-> #1 (rcu_state.barrier_mutex){+.+.}-{3:3}:
__mutex_lock+0x81/0xc80
rcu_nocb_cpu_deoffload+0x38/0xb0
rcu_nocb_toggle+0x144/0x1d0
kthread+0xe6/0x120
ret_from_fork+0x2f/0x40
ret_from_fork_asm+0x1b/0x30
-> #0 (cpu_hotplug_lock){++++}-{0:0}:
__lock_acquire+0x1538/0x2500
lock_acquire+0xbf/0x2a0
percpu_down_write+0x31/0x200
_cpu_down+0x57/0x2b0
__cpu_down_maps_locked+0x10/0x20
work_for_cpu_fn+0x15/0x20
process_scheduled_works+0x2a7/0x500
worker_thread+0x173/0x330
kthread+0xe6/0x120
ret_from_fork+0x2f/0x40
ret_from_fork_asm+0x1b/0x30
other info that might help us debug this:
Chain exists of:
cpu_hotplug_lock --> rcu_state.barrier_mutex --> (work_completion)(&wfc.work)
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock((work_completion)(&wfc.work));
lock(rcu_state.barrier_mutex);
lock((work_completion)(&wfc.work));
lock(cpu_hotplug_lock);
*** DEADLOCK ***
2 locks held by kworker/0:1/9:
#0: ffff900481068b38 ((wq_completion)events){+.+.}-{0:0}, at: process_scheduled_works+0x212/0x500
#1: ffff9e3bc0057e60 ((work_completion)(&wfc.work)){+.+.}-{0:0}, at: process_scheduled_works+0x216/0x500
stack backtrace:
CPU: 0 PID: 9 Comm: kworker/0:1 Not tainted 6.6.0-rc1-00065-g934ebd6e5359 #35409
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
Workqueue: events work_for_cpu_fn
Call Trace:
rcu-torture: rcu_torture_read_exit: Start of episode
<TASK>
dump_stack_lvl+0x4a/0x80
check_noncircular+0x132/0x150
__lock_acquire+0x1538/0x2500
lock_acquire+0xbf/0x2a0
? _cpu_down+0x57/0x2b0
percpu_down_write+0x31/0x200
? _cpu_down+0x57/0x2b0
_cpu_down+0x57/0x2b0
__cpu_down_maps_locked+0x10/0x20
work_for_cpu_fn+0x15/0x20
process_scheduled_works+0x2a7/0x500
worker_thread+0x173/0x330
? __pfx_worker_thread+0x10/0x10
kthread+0xe6/0x120
? __pfx_kthread+0x10/0x10
ret_from_fork+0x2f/0x40
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1b/0x30
</TASK
Fix this with providing one lock class key per work_on_cpu() caller.
Reported-and-tested-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-09-24 15:07:02 +00:00
|
|
|
#define __INIT_WORK(_work, _func, _onstack) \
|
|
|
|
do { \
|
|
|
|
static __maybe_unused struct lock_class_key __key; \
|
|
|
|
\
|
|
|
|
__INIT_WORK_KEY(_work, _func, _onstack, &__key); \
|
|
|
|
} while (0)
|
|
|
|
|
2012-08-21 20:18:23 +00:00
|
|
|
#define INIT_WORK(_work, _func) \
|
2015-01-06 16:29:29 +00:00
|
|
|
__INIT_WORK((_work), (_func), 0)
|
2009-11-15 16:09:48 +00:00
|
|
|
|
2012-08-21 20:18:23 +00:00
|
|
|
#define INIT_WORK_ONSTACK(_work, _func) \
|
2015-01-06 16:29:29 +00:00
|
|
|
__INIT_WORK((_work), (_func), 1)
|
2009-11-15 16:09:48 +00:00
|
|
|
|
workqueue: Provide one lock class key per work_on_cpu() callsite
All callers of work_on_cpu() share the same lock class key for all the
functions queued. As a result the workqueue related locking scenario for
a function A may be spuriously accounted as an inversion against the
locking scenario of function B such as in the following model:
long A(void *arg)
{
mutex_lock(&mutex);
mutex_unlock(&mutex);
}
long B(void *arg)
{
}
void launchA(void)
{
work_on_cpu(0, A, NULL);
}
void launchB(void)
{
mutex_lock(&mutex);
work_on_cpu(1, B, NULL);
mutex_unlock(&mutex);
}
launchA and launchB running concurrently have no chance to deadlock.
However the above can be reported by lockdep as a possible locking
inversion because the works containing A() and B() are treated as
belonging to the same locking class.
The following shows an existing example of such a spurious lockdep splat:
======================================================
WARNING: possible circular locking dependency detected
6.6.0-rc1-00065-g934ebd6e5359 #35409 Not tainted
------------------------------------------------------
kworker/0:1/9 is trying to acquire lock:
ffffffff9bc72f30 (cpu_hotplug_lock){++++}-{0:0}, at: _cpu_down+0x57/0x2b0
but task is already holding lock:
ffff9e3bc0057e60 ((work_completion)(&wfc.work)){+.+.}-{0:0}, at: process_scheduled_works+0x216/0x500
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #2 ((work_completion)(&wfc.work)){+.+.}-{0:0}:
__flush_work+0x83/0x4e0
work_on_cpu+0x97/0xc0
rcu_nocb_cpu_offload+0x62/0xb0
rcu_nocb_toggle+0xd0/0x1d0
kthread+0xe6/0x120
ret_from_fork+0x2f/0x40
ret_from_fork_asm+0x1b/0x30
-> #1 (rcu_state.barrier_mutex){+.+.}-{3:3}:
__mutex_lock+0x81/0xc80
rcu_nocb_cpu_deoffload+0x38/0xb0
rcu_nocb_toggle+0x144/0x1d0
kthread+0xe6/0x120
ret_from_fork+0x2f/0x40
ret_from_fork_asm+0x1b/0x30
-> #0 (cpu_hotplug_lock){++++}-{0:0}:
__lock_acquire+0x1538/0x2500
lock_acquire+0xbf/0x2a0
percpu_down_write+0x31/0x200
_cpu_down+0x57/0x2b0
__cpu_down_maps_locked+0x10/0x20
work_for_cpu_fn+0x15/0x20
process_scheduled_works+0x2a7/0x500
worker_thread+0x173/0x330
kthread+0xe6/0x120
ret_from_fork+0x2f/0x40
ret_from_fork_asm+0x1b/0x30
other info that might help us debug this:
Chain exists of:
cpu_hotplug_lock --> rcu_state.barrier_mutex --> (work_completion)(&wfc.work)
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock((work_completion)(&wfc.work));
lock(rcu_state.barrier_mutex);
lock((work_completion)(&wfc.work));
lock(cpu_hotplug_lock);
*** DEADLOCK ***
2 locks held by kworker/0:1/9:
#0: ffff900481068b38 ((wq_completion)events){+.+.}-{0:0}, at: process_scheduled_works+0x212/0x500
#1: ffff9e3bc0057e60 ((work_completion)(&wfc.work)){+.+.}-{0:0}, at: process_scheduled_works+0x216/0x500
stack backtrace:
CPU: 0 PID: 9 Comm: kworker/0:1 Not tainted 6.6.0-rc1-00065-g934ebd6e5359 #35409
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
Workqueue: events work_for_cpu_fn
Call Trace:
rcu-torture: rcu_torture_read_exit: Start of episode
<TASK>
dump_stack_lvl+0x4a/0x80
check_noncircular+0x132/0x150
__lock_acquire+0x1538/0x2500
lock_acquire+0xbf/0x2a0
? _cpu_down+0x57/0x2b0
percpu_down_write+0x31/0x200
? _cpu_down+0x57/0x2b0
_cpu_down+0x57/0x2b0
__cpu_down_maps_locked+0x10/0x20
work_for_cpu_fn+0x15/0x20
process_scheduled_works+0x2a7/0x500
worker_thread+0x173/0x330
? __pfx_worker_thread+0x10/0x10
kthread+0xe6/0x120
? __pfx_kthread+0x10/0x10
ret_from_fork+0x2f/0x40
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1b/0x30
</TASK
Fix this with providing one lock class key per work_on_cpu() caller.
Reported-and-tested-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-09-24 15:07:02 +00:00
|
|
|
#define INIT_WORK_ONSTACK_KEY(_work, _func, _key) \
|
|
|
|
__INIT_WORK_KEY((_work), (_func), 1, _key)
|
|
|
|
|
2012-08-21 20:18:23 +00:00
|
|
|
#define __INIT_DELAYED_WORK(_work, _func, _tflags) \
|
2012-08-21 20:18:23 +00:00
|
|
|
do { \
|
|
|
|
INIT_WORK(&(_work)->work, (_func)); \
|
2017-10-23 01:48:43 +00:00
|
|
|
__init_timer(&(_work)->timer, \
|
|
|
|
delayed_work_timer_fn, \
|
|
|
|
(_tflags) | TIMER_IRQSAFE); \
|
2006-11-22 14:54:01 +00:00
|
|
|
} while (0)
|
|
|
|
|
2012-08-21 20:18:23 +00:00
|
|
|
#define __INIT_DELAYED_WORK_ONSTACK(_work, _func, _tflags) \
|
2012-08-21 20:18:23 +00:00
|
|
|
do { \
|
|
|
|
INIT_WORK_ONSTACK(&(_work)->work, (_func)); \
|
2017-10-23 01:48:43 +00:00
|
|
|
__init_timer_on_stack(&(_work)->timer, \
|
|
|
|
delayed_work_timer_fn, \
|
|
|
|
(_tflags) | TIMER_IRQSAFE); \
|
2009-01-12 11:52:23 +00:00
|
|
|
} while (0)
|
|
|
|
|
2012-08-21 20:18:23 +00:00
|
|
|
#define INIT_DELAYED_WORK(_work, _func) \
|
|
|
|
__INIT_DELAYED_WORK(_work, _func, 0)
|
|
|
|
|
|
|
|
#define INIT_DELAYED_WORK_ONSTACK(_work, _func) \
|
|
|
|
__INIT_DELAYED_WORK_ONSTACK(_work, _func, 0)
|
|
|
|
|
2012-08-21 20:18:23 +00:00
|
|
|
#define INIT_DEFERRABLE_WORK(_work, _func) \
|
2012-08-21 20:18:23 +00:00
|
|
|
__INIT_DELAYED_WORK(_work, _func, TIMER_DEFERRABLE)
|
|
|
|
|
|
|
|
#define INIT_DEFERRABLE_WORK_ONSTACK(_work, _func) \
|
|
|
|
__INIT_DELAYED_WORK_ONSTACK(_work, _func, TIMER_DEFERRABLE)
|
2007-05-08 07:27:47 +00:00
|
|
|
|
2018-03-14 19:45:13 +00:00
|
|
|
#define INIT_RCU_WORK(_work, _func) \
|
|
|
|
INIT_WORK(&(_work)->work, (_func))
|
|
|
|
|
|
|
|
#define INIT_RCU_WORK_ONSTACK(_work, _func) \
|
|
|
|
INIT_WORK_ONSTACK(&(_work)->work, (_func))
|
|
|
|
|
2006-11-22 14:54:49 +00:00
|
|
|
/**
|
|
|
|
* work_pending - Find out whether a work item is currently pending
|
|
|
|
* @work: The work item in question
|
|
|
|
*/
|
|
|
|
#define work_pending(work) \
|
2010-06-29 08:07:10 +00:00
|
|
|
test_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))
|
2006-11-22 14:54:49 +00:00
|
|
|
|
|
|
|
/**
|
|
|
|
* delayed_work_pending - Find out whether a delayable work item is currently
|
|
|
|
* pending
|
2015-08-13 23:52:02 +00:00
|
|
|
* @w: The work item in question
|
2006-11-22 14:54:49 +00:00
|
|
|
*/
|
2006-12-15 22:13:51 +00:00
|
|
|
#define delayed_work_pending(w) \
|
|
|
|
work_pending(&(w)->work)
|
2006-11-22 14:54:49 +00:00
|
|
|
|
2010-09-10 14:51:36 +00:00
|
|
|
/*
|
|
|
|
* Workqueue flags and constants. For details, please refer to
|
2016-10-28 08:14:09 +00:00
|
|
|
* Documentation/core-api/workqueue.rst.
|
2010-09-10 14:51:36 +00:00
|
|
|
*/
|
2024-01-26 21:55:50 +00:00
|
|
|
enum wq_flags {
|
2024-02-04 21:28:06 +00:00
|
|
|
WQ_BH = 1 << 0, /* execute in bottom half (softirq) context */
|
2010-07-02 08:03:51 +00:00
|
|
|
WQ_UNBOUND = 1 << 1, /* not bound to any cpu */
|
2011-02-16 08:25:31 +00:00
|
|
|
WQ_FREEZABLE = 1 << 2, /* freeze during suspend */
|
2010-10-11 13:12:27 +00:00
|
|
|
WQ_MEM_RECLAIM = 1 << 3, /* may be used for memory reclaim */
|
2010-06-29 08:07:14 +00:00
|
|
|
WQ_HIGHPRI = 1 << 4, /* high priority */
|
2014-03-24 20:37:02 +00:00
|
|
|
WQ_CPU_INTENSIVE = 1 << 5, /* cpu intensive workqueue */
|
2021-01-18 08:04:55 +00:00
|
|
|
WQ_SYSFS = 1 << 6, /* visible in sysfs, see workqueue_sysfs_register() */
|
2010-06-29 08:07:14 +00:00
|
|
|
|
2013-04-08 11:15:40 +00:00
|
|
|
/*
|
|
|
|
* Per-cpu workqueues are generally preferred because they tend to
|
|
|
|
* show better performance thanks to cache locality. Per-cpu
|
|
|
|
* workqueues exclude the scheduler from choosing the CPU to
|
|
|
|
* execute the worker threads, which has an unfortunate side effect
|
|
|
|
* of increasing power consumption.
|
|
|
|
*
|
|
|
|
* The scheduler considers a CPU idle if it doesn't have any task
|
|
|
|
* to execute and tries to keep idle cores idle to conserve power;
|
|
|
|
* however, for example, a per-cpu work item scheduled from an
|
|
|
|
* interrupt handler on an idle CPU will force the scheduler to
|
2021-07-31 00:01:29 +00:00
|
|
|
* execute the work item on that CPU breaking the idleness, which in
|
2013-04-08 11:15:40 +00:00
|
|
|
* turn may lead to more scheduling choices which are sub-optimal
|
|
|
|
* in terms of power consumption.
|
|
|
|
*
|
|
|
|
* Workqueues marked with WQ_POWER_EFFICIENT are per-cpu by default
|
|
|
|
* but become unbound if workqueue.power_efficient kernel param is
|
|
|
|
* specified. Per-cpu workqueues which are identified to
|
|
|
|
* contribute significantly to power-consumption are identified and
|
|
|
|
* marked with this flag and enabling the power_efficient mode
|
|
|
|
* leads to noticeable power saving at the cost of small
|
|
|
|
* performance disadvantage.
|
|
|
|
*
|
|
|
|
* http://thread.gmane.org/gmane.linux.kernel/1480396
|
|
|
|
*/
|
|
|
|
WQ_POWER_EFFICIENT = 1 << 7,
|
|
|
|
|
2022-12-13 04:39:36 +00:00
|
|
|
__WQ_DESTROYING = 1 << 15, /* internal: workqueue is destroying */
|
2013-03-12 18:30:04 +00:00
|
|
|
__WQ_DRAINING = 1 << 16, /* internal: workqueue is draining */
|
2013-03-12 18:30:04 +00:00
|
|
|
__WQ_ORDERED = 1 << 17, /* internal: workqueue is ordered */
|
2016-01-29 10:59:46 +00:00
|
|
|
__WQ_LEGACY = 1 << 18, /* internal: create*_workqueue() */
|
2024-02-04 21:28:06 +00:00
|
|
|
|
|
|
|
/* BH wq only allows the following flags */
|
|
|
|
__WQ_BH_ALLOWS = WQ_BH | WQ_HIGHPRI,
|
2024-01-26 21:55:50 +00:00
|
|
|
};
|
2010-08-24 12:22:47 +00:00
|
|
|
|
2024-01-26 21:55:50 +00:00
|
|
|
enum wq_consts {
|
2010-06-29 08:07:14 +00:00
|
|
|
WQ_MAX_ACTIVE = 512, /* I like 512, better ideas? */
|
workqueue: Make unbound workqueues to use per-cpu pool_workqueues
A pwq (pool_workqueue) represents an association between a workqueue and a
worker_pool. When a work item is queued, the workqueue selects the pwq to
use, which in turn determines the pool, and queues the work item to the pool
through the pwq. pwq is also what implements the maximum concurrency limit -
@max_active.
As a per-cpu workqueue should be assocaited with a different worker_pool on
each CPU, it always had per-cpu pwq's that are accessed through wq->cpu_pwq.
However, unbound workqueues were sharing a pwq within each NUMA node by
default. The sharing has several downsides:
* Because @max_active is per-pwq, the meaning of @max_active changes
depending on the machine configuration and whether workqueue NUMA locality
support is enabled.
* Makes per-cpu and unbound code deviate.
* Gets in the way of making workqueue CPU locality awareness more flexible.
This patch makes unbound workqueues use per-cpu pwq's the same way per-cpu
workqueues do by making the following changes:
* wq->numa_pwq_tbl[] is removed and unbound workqueues now use wq->cpu_pwq
just like per-cpu workqueues. wq->cpu_pwq is now RCU protected for unbound
workqueues.
* numa_pwq_tbl_install() is renamed to install_unbound_pwq() and installs
the specified pwq to the target CPU's wq->cpu_pwq.
* apply_wqattrs_prepare() now always allocates a separate pwq for each CPU
unless the workqueue is ordered. If ordered, all CPUs use wq->dfl_pwq.
This makes the return value of wq_calc_node_cpumask() unnecessary. It now
returns void.
* @max_active now means the same thing for both per-cpu and unbound
workqueues. WQ_UNBOUND_MAX_ACTIVE now equals WQ_MAX_ACTIVE and
documentation is updated accordingly. WQ_UNBOUND_MAX_ACTIVE is no longer
used in workqueue implementation and will be removed later.
* All unbound pwq operations which used to be per-numa-node are now per-cpu.
For most unbound workqueue users, this shouldn't cause noticeable changes.
Work item issue and completion will be a small bit faster, flush_workqueue()
would become a bit more expensive, and the total concurrency limit would
likely become higher. All @max_active==1 use cases are currently being
audited for conversion into alloc_ordered_workqueue() and they shouldn't be
affected once the audit and conversion is complete.
One area where the behavior change may be more noticeable is
workqueue_congested() as the reported congestion state is now per CPU
instead of NUMA node. There are only two users of this interface -
drivers/infiniband/hw/hfi1 and net/smc. Maintainers of both subsystems are
cc'd. Inputs on the behavior change would be very much appreciated.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Karsten Graul <kgraul@linux.ibm.com>
Cc: Wenjia Zhang <wenjia@linux.ibm.com>
Cc: Jan Karcher <jaka@linux.ibm.com>
2023-08-08 01:57:23 +00:00
|
|
|
WQ_UNBOUND_MAX_ACTIVE = WQ_MAX_ACTIVE,
|
2010-06-29 08:07:14 +00:00
|
|
|
WQ_DFL_ACTIVE = WQ_MAX_ACTIVE / 2,
|
workqueue: Implement system-wide nr_active enforcement for unbound workqueues
A pool_workqueue (pwq) represents the connection between a workqueue and a
worker_pool. One of the roles that a pwq plays is enforcement of the
max_active concurrency limit. Before 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues"), there was one pwq per each CPU
for per-cpu workqueues and per each NUMA node for unbound workqueues, which
was a natural result of per-cpu workqueues being served by per-cpu pools and
unbound by per-NUMA pools.
In terms of max_active enforcement, this was, while not perfect, workable.
For per-cpu workqueues, it was fine. For unbound, it wasn't great in that
NUMA machines would get max_active that's multiplied by the number of nodes
but didn't cause huge problems because NUMA machines are relatively rare and
the node count is usually pretty low.
However, cache layouts are more complex now and sharing a worker pool across
a whole node didn't really work well for unbound workqueues. Thus, a series
of commits culminating on 8639ecebc9b1 ("workqueue: Make unbound workqueues
to use per-cpu pool_workqueues") implemented more flexible affinity
mechanism for unbound workqueues which enables using e.g. last-level-cache
aligned pools. In the process, 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues") made unbound workqueues use
per-cpu pwqs like per-cpu workqueues.
While the change was necessary to enable more flexible affinity scopes, this
came with the side effect of blowing up the effective max_active for unbound
workqueues. Before, the effective max_active for unbound workqueues was
multiplied by the number of nodes. After, by the number of CPUs.
636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu
pool_workqueues") claims that this should generally be okay. It is okay for
users which self-regulates concurrency level which are the vast majority;
however, there are enough use cases which actually depend on max_active to
prevent the level of concurrency from going bonkers including several IO
handling workqueues that can issue a work item for each in-flight IO. With
targeted benchmarks, the misbehavior can easily be exposed as reported in
http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3.
Unfortunately, there is no way to express what these use cases need using
per-cpu max_active. A CPU may issue most of in-flight IOs, so we don't want
to set max_active too low but as soon as we increase max_active a bit, we
can end up with unreasonable number of in-flight work items when many CPUs
issue IOs at the same time. ie. The acceptable lowest max_active is higher
than the acceptable highest max_active.
Ideally, max_active for an unbound workqueue should be system-wide so that
the users can regulate the total level of concurrency regardless of node and
cache layout. The reasons workqueue hasn't implemented that yet are:
- One max_active enforcement decouples from pool boundaires, chaining
execution after a work item finishes requires inter-pool operations which
would require lock dancing, which is nasty.
- Sharing a single nr_active count across the whole system can be pretty
expensive on NUMA machines.
- Per-pwq enforcement had been more or less okay while we were using
per-node pools.
It looks like we no longer can avoid decoupling max_active enforcement from
pool boundaries. This patch implements system-wide nr_active mechanism with
the following design characteristics:
- To avoid sharing a single counter across multiple nodes, the configured
max_active is split across nodes according to the proportion of each
workqueue's online effective CPUs per node. e.g. A node with twice more
online effective CPUs will get twice higher portion of max_active.
- Workqueue used to be able to process a chain of interdependent work items
which is as long as max_active. We can't do this anymore as max_active is
distributed across the nodes. Instead, a new parameter min_active is
introduced which determines the minimum level of concurrency within a node
regardless of how max_active distribution comes out to be.
It is set to the smaller of max_active and WQ_DFL_MIN_ACTIVE which is 8.
This can lead to higher effective max_weight than configured and also
deadlocks if a workqueue was depending on being able to handle chains of
interdependent work items that are longer than 8.
I believe these should be fine given that the number of CPUs in each NUMA
node is usually higher than 8 and work item chain longer than 8 is pretty
unlikely. However, if these assumptions turn out to be wrong, we'll need
to add an interface to adjust min_active.
- Each unbound wq has an array of struct wq_node_nr_active which tracks
per-node nr_active. When its pwq wants to run a work item, it has to
obtain the matching node's nr_active. If over the node's max_active, the
pwq is queued on wq_node_nr_active->pending_pwqs. As work items finish,
the completion path round-robins the pending pwqs activating the first
inactive work item of each, which involves some pool lock dancing and
kicking other pools. It's not the simplest code but doesn't look too bad.
v4: - wq_adjust_max_active() updated to invoke wq_update_node_max_active().
- wq_adjust_max_active() is now protected by wq->mutex instead of
wq_pool_mutex.
v3: - wq_node_max_active() used to calculate per-node max_active on the fly
based on system-wide CPU online states. Lai pointed out that this can
lead to skewed distributions for workqueues with restricted cpumasks.
Update the max_active distribution to use per-workqueue effective
online CPU counts instead of system-wide and cache the calculation
results in node_nr_active->max.
v2: - wq->min/max_active now uses WRITE/READ_ONCE() as suggested by Lai.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Naohiro Aota <Naohiro.Aota@wdc.com>
Link: http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3
Fixes: 636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues")
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-01-29 18:11:25 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Per-node default cap on min_active. Unless explicitly set, min_active
|
|
|
|
* is set to min(max_active, WQ_DFL_MIN_ACTIVE). For more details, see
|
|
|
|
* workqueue_struct->min_active definition.
|
|
|
|
*/
|
|
|
|
WQ_DFL_MIN_ACTIVE = 8,
|
2010-06-29 08:07:10 +00:00
|
|
|
};
|
2006-11-22 14:54:01 +00:00
|
|
|
|
2010-06-29 08:07:14 +00:00
|
|
|
/*
|
|
|
|
* System-wide workqueues which are always present.
|
|
|
|
*
|
|
|
|
* system_wq is the one used by schedule[_delayed]_work[_on]().
|
|
|
|
* Multi-CPU multi-threaded. There are users which expect relatively
|
|
|
|
* short queue flush time. Don't queue works which can run for too
|
|
|
|
* long.
|
|
|
|
*
|
2014-05-22 08:42:41 +00:00
|
|
|
* system_highpri_wq is similar to system_wq but for work items which
|
|
|
|
* require WQ_HIGHPRI.
|
|
|
|
*
|
2010-06-29 08:07:14 +00:00
|
|
|
* system_long_wq is similar to system_wq but may host long running
|
|
|
|
* works. Queue flushing might take relatively long.
|
|
|
|
*
|
2010-07-02 08:03:51 +00:00
|
|
|
* system_unbound_wq is unbound workqueue. Workers are not bound to
|
|
|
|
* any specific CPU, not concurrency managed, and all queued works are
|
|
|
|
* executed immediately as long as max_active limit is not reached and
|
|
|
|
* resources are available.
|
2011-02-08 09:39:03 +00:00
|
|
|
*
|
2011-02-21 08:52:50 +00:00
|
|
|
* system_freezable_wq is equivalent to system_wq except that it's
|
|
|
|
* freezable.
|
2013-04-24 11:42:54 +00:00
|
|
|
*
|
|
|
|
* *_power_efficient_wq are inclined towards saving power and converted
|
|
|
|
* into WQ_UNBOUND variants if 'wq_power_efficient' is enabled; otherwise,
|
|
|
|
* they are same as their non-power-efficient counterparts - e.g.
|
|
|
|
* system_power_efficient_wq is identical to system_wq if
|
|
|
|
* 'wq_power_efficient' is disabled. See WQ_POWER_EFFICIENT for more info.
|
2024-02-04 21:28:06 +00:00
|
|
|
*
|
|
|
|
* system_bh[_highpri]_wq are convenience interface to softirq. BH work items
|
|
|
|
* are executed in the queueing CPU's BH context in the queueing order.
|
2010-06-29 08:07:14 +00:00
|
|
|
*/
|
|
|
|
extern struct workqueue_struct *system_wq;
|
2014-05-22 08:42:41 +00:00
|
|
|
extern struct workqueue_struct *system_highpri_wq;
|
2010-06-29 08:07:14 +00:00
|
|
|
extern struct workqueue_struct *system_long_wq;
|
2010-07-02 08:03:51 +00:00
|
|
|
extern struct workqueue_struct *system_unbound_wq;
|
2011-02-21 08:52:50 +00:00
|
|
|
extern struct workqueue_struct *system_freezable_wq;
|
2013-04-24 11:42:54 +00:00
|
|
|
extern struct workqueue_struct *system_power_efficient_wq;
|
|
|
|
extern struct workqueue_struct *system_freezable_power_efficient_wq;
|
2024-02-04 21:28:06 +00:00
|
|
|
extern struct workqueue_struct *system_bh_wq;
|
|
|
|
extern struct workqueue_struct *system_bh_highpri_wq;
|
|
|
|
|
|
|
|
void workqueue_softirq_action(bool highpri);
|
2024-02-27 01:38:55 +00:00
|
|
|
void workqueue_softirq_dead(unsigned int cpu);
|
2012-08-20 21:51:23 +00:00
|
|
|
|
2012-01-10 23:11:35 +00:00
|
|
|
/**
|
|
|
|
* alloc_workqueue - allocate a workqueue
|
|
|
|
* @fmt: printf format for the name of the workqueue
|
|
|
|
* @flags: WQ_* flags
|
workqueue: Implement system-wide nr_active enforcement for unbound workqueues
A pool_workqueue (pwq) represents the connection between a workqueue and a
worker_pool. One of the roles that a pwq plays is enforcement of the
max_active concurrency limit. Before 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues"), there was one pwq per each CPU
for per-cpu workqueues and per each NUMA node for unbound workqueues, which
was a natural result of per-cpu workqueues being served by per-cpu pools and
unbound by per-NUMA pools.
In terms of max_active enforcement, this was, while not perfect, workable.
For per-cpu workqueues, it was fine. For unbound, it wasn't great in that
NUMA machines would get max_active that's multiplied by the number of nodes
but didn't cause huge problems because NUMA machines are relatively rare and
the node count is usually pretty low.
However, cache layouts are more complex now and sharing a worker pool across
a whole node didn't really work well for unbound workqueues. Thus, a series
of commits culminating on 8639ecebc9b1 ("workqueue: Make unbound workqueues
to use per-cpu pool_workqueues") implemented more flexible affinity
mechanism for unbound workqueues which enables using e.g. last-level-cache
aligned pools. In the process, 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues") made unbound workqueues use
per-cpu pwqs like per-cpu workqueues.
While the change was necessary to enable more flexible affinity scopes, this
came with the side effect of blowing up the effective max_active for unbound
workqueues. Before, the effective max_active for unbound workqueues was
multiplied by the number of nodes. After, by the number of CPUs.
636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu
pool_workqueues") claims that this should generally be okay. It is okay for
users which self-regulates concurrency level which are the vast majority;
however, there are enough use cases which actually depend on max_active to
prevent the level of concurrency from going bonkers including several IO
handling workqueues that can issue a work item for each in-flight IO. With
targeted benchmarks, the misbehavior can easily be exposed as reported in
http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3.
Unfortunately, there is no way to express what these use cases need using
per-cpu max_active. A CPU may issue most of in-flight IOs, so we don't want
to set max_active too low but as soon as we increase max_active a bit, we
can end up with unreasonable number of in-flight work items when many CPUs
issue IOs at the same time. ie. The acceptable lowest max_active is higher
than the acceptable highest max_active.
Ideally, max_active for an unbound workqueue should be system-wide so that
the users can regulate the total level of concurrency regardless of node and
cache layout. The reasons workqueue hasn't implemented that yet are:
- One max_active enforcement decouples from pool boundaires, chaining
execution after a work item finishes requires inter-pool operations which
would require lock dancing, which is nasty.
- Sharing a single nr_active count across the whole system can be pretty
expensive on NUMA machines.
- Per-pwq enforcement had been more or less okay while we were using
per-node pools.
It looks like we no longer can avoid decoupling max_active enforcement from
pool boundaries. This patch implements system-wide nr_active mechanism with
the following design characteristics:
- To avoid sharing a single counter across multiple nodes, the configured
max_active is split across nodes according to the proportion of each
workqueue's online effective CPUs per node. e.g. A node with twice more
online effective CPUs will get twice higher portion of max_active.
- Workqueue used to be able to process a chain of interdependent work items
which is as long as max_active. We can't do this anymore as max_active is
distributed across the nodes. Instead, a new parameter min_active is
introduced which determines the minimum level of concurrency within a node
regardless of how max_active distribution comes out to be.
It is set to the smaller of max_active and WQ_DFL_MIN_ACTIVE which is 8.
This can lead to higher effective max_weight than configured and also
deadlocks if a workqueue was depending on being able to handle chains of
interdependent work items that are longer than 8.
I believe these should be fine given that the number of CPUs in each NUMA
node is usually higher than 8 and work item chain longer than 8 is pretty
unlikely. However, if these assumptions turn out to be wrong, we'll need
to add an interface to adjust min_active.
- Each unbound wq has an array of struct wq_node_nr_active which tracks
per-node nr_active. When its pwq wants to run a work item, it has to
obtain the matching node's nr_active. If over the node's max_active, the
pwq is queued on wq_node_nr_active->pending_pwqs. As work items finish,
the completion path round-robins the pending pwqs activating the first
inactive work item of each, which involves some pool lock dancing and
kicking other pools. It's not the simplest code but doesn't look too bad.
v4: - wq_adjust_max_active() updated to invoke wq_update_node_max_active().
- wq_adjust_max_active() is now protected by wq->mutex instead of
wq_pool_mutex.
v3: - wq_node_max_active() used to calculate per-node max_active on the fly
based on system-wide CPU online states. Lai pointed out that this can
lead to skewed distributions for workqueues with restricted cpumasks.
Update the max_active distribution to use per-workqueue effective
online CPU counts instead of system-wide and cache the calculation
results in node_nr_active->max.
v2: - wq->min/max_active now uses WRITE/READ_ONCE() as suggested by Lai.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Naohiro Aota <Naohiro.Aota@wdc.com>
Link: http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3
Fixes: 636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues")
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-01-29 18:11:25 +00:00
|
|
|
* @max_active: max in-flight work items, 0 for default
|
2019-02-14 23:00:54 +00:00
|
|
|
* remaining args: args for @fmt
|
2012-01-10 23:11:35 +00:00
|
|
|
*
|
workqueue: Implement system-wide nr_active enforcement for unbound workqueues
A pool_workqueue (pwq) represents the connection between a workqueue and a
worker_pool. One of the roles that a pwq plays is enforcement of the
max_active concurrency limit. Before 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues"), there was one pwq per each CPU
for per-cpu workqueues and per each NUMA node for unbound workqueues, which
was a natural result of per-cpu workqueues being served by per-cpu pools and
unbound by per-NUMA pools.
In terms of max_active enforcement, this was, while not perfect, workable.
For per-cpu workqueues, it was fine. For unbound, it wasn't great in that
NUMA machines would get max_active that's multiplied by the number of nodes
but didn't cause huge problems because NUMA machines are relatively rare and
the node count is usually pretty low.
However, cache layouts are more complex now and sharing a worker pool across
a whole node didn't really work well for unbound workqueues. Thus, a series
of commits culminating on 8639ecebc9b1 ("workqueue: Make unbound workqueues
to use per-cpu pool_workqueues") implemented more flexible affinity
mechanism for unbound workqueues which enables using e.g. last-level-cache
aligned pools. In the process, 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues") made unbound workqueues use
per-cpu pwqs like per-cpu workqueues.
While the change was necessary to enable more flexible affinity scopes, this
came with the side effect of blowing up the effective max_active for unbound
workqueues. Before, the effective max_active for unbound workqueues was
multiplied by the number of nodes. After, by the number of CPUs.
636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu
pool_workqueues") claims that this should generally be okay. It is okay for
users which self-regulates concurrency level which are the vast majority;
however, there are enough use cases which actually depend on max_active to
prevent the level of concurrency from going bonkers including several IO
handling workqueues that can issue a work item for each in-flight IO. With
targeted benchmarks, the misbehavior can easily be exposed as reported in
http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3.
Unfortunately, there is no way to express what these use cases need using
per-cpu max_active. A CPU may issue most of in-flight IOs, so we don't want
to set max_active too low but as soon as we increase max_active a bit, we
can end up with unreasonable number of in-flight work items when many CPUs
issue IOs at the same time. ie. The acceptable lowest max_active is higher
than the acceptable highest max_active.
Ideally, max_active for an unbound workqueue should be system-wide so that
the users can regulate the total level of concurrency regardless of node and
cache layout. The reasons workqueue hasn't implemented that yet are:
- One max_active enforcement decouples from pool boundaires, chaining
execution after a work item finishes requires inter-pool operations which
would require lock dancing, which is nasty.
- Sharing a single nr_active count across the whole system can be pretty
expensive on NUMA machines.
- Per-pwq enforcement had been more or less okay while we were using
per-node pools.
It looks like we no longer can avoid decoupling max_active enforcement from
pool boundaries. This patch implements system-wide nr_active mechanism with
the following design characteristics:
- To avoid sharing a single counter across multiple nodes, the configured
max_active is split across nodes according to the proportion of each
workqueue's online effective CPUs per node. e.g. A node with twice more
online effective CPUs will get twice higher portion of max_active.
- Workqueue used to be able to process a chain of interdependent work items
which is as long as max_active. We can't do this anymore as max_active is
distributed across the nodes. Instead, a new parameter min_active is
introduced which determines the minimum level of concurrency within a node
regardless of how max_active distribution comes out to be.
It is set to the smaller of max_active and WQ_DFL_MIN_ACTIVE which is 8.
This can lead to higher effective max_weight than configured and also
deadlocks if a workqueue was depending on being able to handle chains of
interdependent work items that are longer than 8.
I believe these should be fine given that the number of CPUs in each NUMA
node is usually higher than 8 and work item chain longer than 8 is pretty
unlikely. However, if these assumptions turn out to be wrong, we'll need
to add an interface to adjust min_active.
- Each unbound wq has an array of struct wq_node_nr_active which tracks
per-node nr_active. When its pwq wants to run a work item, it has to
obtain the matching node's nr_active. If over the node's max_active, the
pwq is queued on wq_node_nr_active->pending_pwqs. As work items finish,
the completion path round-robins the pending pwqs activating the first
inactive work item of each, which involves some pool lock dancing and
kicking other pools. It's not the simplest code but doesn't look too bad.
v4: - wq_adjust_max_active() updated to invoke wq_update_node_max_active().
- wq_adjust_max_active() is now protected by wq->mutex instead of
wq_pool_mutex.
v3: - wq_node_max_active() used to calculate per-node max_active on the fly
based on system-wide CPU online states. Lai pointed out that this can
lead to skewed distributions for workqueues with restricted cpumasks.
Update the max_active distribution to use per-workqueue effective
online CPU counts instead of system-wide and cache the calculation
results in node_nr_active->max.
v2: - wq->min/max_active now uses WRITE/READ_ONCE() as suggested by Lai.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Naohiro Aota <Naohiro.Aota@wdc.com>
Link: http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3
Fixes: 636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues")
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-01-29 18:11:25 +00:00
|
|
|
* For a per-cpu workqueue, @max_active limits the number of in-flight work
|
|
|
|
* items for each CPU. e.g. @max_active of 1 indicates that each CPU can be
|
|
|
|
* executing at most one work item for the workqueue.
|
|
|
|
*
|
|
|
|
* For unbound workqueues, @max_active limits the number of in-flight work items
|
|
|
|
* for the whole system. e.g. @max_active of 16 indicates that that there can be
|
|
|
|
* at most 16 work items executing for the workqueue in the whole system.
|
|
|
|
*
|
|
|
|
* As sharing the same active counter for an unbound workqueue across multiple
|
|
|
|
* NUMA nodes can be expensive, @max_active is distributed to each NUMA node
|
|
|
|
* according to the proportion of the number of online CPUs and enforced
|
|
|
|
* independently.
|
|
|
|
*
|
|
|
|
* Depending on online CPU distribution, a node may end up with per-node
|
|
|
|
* max_active which is significantly lower than @max_active, which can lead to
|
|
|
|
* deadlocks if the per-node concurrency limit is lower than the maximum number
|
|
|
|
* of interdependent work items for the workqueue.
|
|
|
|
*
|
|
|
|
* To guarantee forward progress regardless of online CPU distribution, the
|
|
|
|
* concurrency limit on every node is guaranteed to be equal to or greater than
|
|
|
|
* min_active which is set to min(@max_active, %WQ_DFL_MIN_ACTIVE). This means
|
|
|
|
* that the sum of per-node max_active's may be larger than @max_active.
|
|
|
|
*
|
|
|
|
* For detailed information on %WQ_* flags, please refer to
|
2016-10-28 08:14:09 +00:00
|
|
|
* Documentation/core-api/workqueue.rst.
|
2012-01-10 23:11:35 +00:00
|
|
|
*
|
|
|
|
* RETURNS:
|
|
|
|
* Pointer to the allocated workqueue on success, %NULL on failure.
|
|
|
|
*/
|
2021-09-13 10:02:56 +00:00
|
|
|
__printf(1, 4) struct workqueue_struct *
|
|
|
|
alloc_workqueue(const char *fmt, unsigned int flags, int max_active, ...);
|
2007-10-19 06:39:55 +00:00
|
|
|
|
2010-09-16 08:17:35 +00:00
|
|
|
/**
|
|
|
|
* alloc_ordered_workqueue - allocate an ordered workqueue
|
2012-01-10 23:11:35 +00:00
|
|
|
* @fmt: printf format for the name of the workqueue
|
2011-02-16 08:25:31 +00:00
|
|
|
* @flags: WQ_* flags (only WQ_FREEZABLE and WQ_MEM_RECLAIM are meaningful)
|
2022-06-09 23:41:10 +00:00
|
|
|
* @args: args for @fmt
|
2010-09-16 08:17:35 +00:00
|
|
|
*
|
|
|
|
* Allocate an ordered workqueue. An ordered workqueue executes at
|
|
|
|
* most one work item at any given time in the queued order. They are
|
|
|
|
* implemented as unbound workqueues with @max_active of one.
|
|
|
|
*
|
|
|
|
* RETURNS:
|
|
|
|
* Pointer to the allocated workqueue on success, %NULL on failure.
|
|
|
|
*/
|
2012-08-21 20:18:23 +00:00
|
|
|
#define alloc_ordered_workqueue(fmt, flags, args...) \
|
2024-02-06 00:19:10 +00:00
|
|
|
alloc_workqueue(fmt, WQ_UNBOUND | __WQ_ORDERED | (flags), 1, ##args)
|
2010-09-16 08:17:35 +00:00
|
|
|
|
2012-08-21 20:18:23 +00:00
|
|
|
#define create_workqueue(name) \
|
2016-01-29 10:59:46 +00:00
|
|
|
alloc_workqueue("%s", __WQ_LEGACY | WQ_MEM_RECLAIM, 1, (name))
|
2012-08-21 20:18:23 +00:00
|
|
|
#define create_freezable_workqueue(name) \
|
2016-01-29 10:59:46 +00:00
|
|
|
alloc_workqueue("%s", __WQ_LEGACY | WQ_FREEZABLE | WQ_UNBOUND | \
|
|
|
|
WQ_MEM_RECLAIM, 1, (name))
|
2012-08-21 20:18:23 +00:00
|
|
|
#define create_singlethread_workqueue(name) \
|
2016-01-29 10:59:46 +00:00
|
|
|
alloc_ordered_workqueue("%s", __WQ_LEGACY | WQ_MEM_RECLAIM, name)
|
2005-04-16 22:20:36 +00:00
|
|
|
|
workqueue: Introduce from_work() helper for cleaner callback declarations
To streamline the transition from tasklets to worqueues, a new helper
function, from_work(), is introduced. This helper, inspired by existing
from_() patterns, utilizes container_of() and eliminates the redundancy
of declaring variable types, leading to more concise and readable code.
The modified code snippet demonstrates the enhanced clarity achieved
with from_wq():
void callback(struct work_struct *w)
{
- struct some_data_structure *local = container_of(w,
struct some_data_structure,
work);
+ struct some_data_structure *local = from_work(local, w, work);
This change aims to facilitate a smoother transition and uphold code
quality standards.
Based on:
git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git disable_work-v3
Signed-off-by: Allen Pais <allen.lkml@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-02-27 19:10:37 +00:00
|
|
|
#define from_work(var, callback_work, work_fieldname) \
|
|
|
|
container_of(callback_work, typeof(*var), work_fieldname)
|
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
extern void destroy_workqueue(struct workqueue_struct *wq);
|
|
|
|
|
2019-09-06 01:40:22 +00:00
|
|
|
struct workqueue_attrs *alloc_workqueue_attrs(void);
|
|
|
|
void free_workqueue_attrs(struct workqueue_attrs *attrs);
|
|
|
|
int apply_workqueue_attrs(struct workqueue_struct *wq,
|
|
|
|
const struct workqueue_attrs *attrs);
|
2023-10-25 18:25:52 +00:00
|
|
|
extern int workqueue_unbound_exclude_cpumask(cpumask_var_t cpumask);
|
2013-03-12 18:30:00 +00:00
|
|
|
|
2012-08-03 17:30:44 +00:00
|
|
|
extern bool queue_work_on(int cpu, struct workqueue_struct *wq,
|
2008-07-24 04:28:39 +00:00
|
|
|
struct work_struct *work);
|
2019-01-22 18:39:26 +00:00
|
|
|
extern bool queue_work_node(int node, struct workqueue_struct *wq,
|
|
|
|
struct work_struct *work);
|
2012-08-03 17:30:44 +00:00
|
|
|
extern bool queue_delayed_work_on(int cpu, struct workqueue_struct *wq,
|
2007-05-09 09:34:22 +00:00
|
|
|
struct delayed_work *work, unsigned long delay);
|
2012-08-03 17:30:47 +00:00
|
|
|
extern bool mod_delayed_work_on(int cpu, struct workqueue_struct *wq,
|
|
|
|
struct delayed_work *dwork, unsigned long delay);
|
2018-03-14 19:45:13 +00:00
|
|
|
extern bool queue_rcu_work(struct workqueue_struct *wq, struct rcu_work *rwork);
|
2007-05-09 09:34:22 +00:00
|
|
|
|
2022-06-01 07:32:47 +00:00
|
|
|
extern void __flush_workqueue(struct workqueue_struct *wq);
|
2011-04-05 16:01:44 +00:00
|
|
|
extern void drain_workqueue(struct workqueue_struct *wq);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2006-11-22 14:55:48 +00:00
|
|
|
extern int schedule_on_each_cpu(work_func_t func);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2006-11-22 14:55:48 +00:00
|
|
|
int execute_in_process_context(work_func_t fn, struct execute_work *);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2010-09-16 08:36:00 +00:00
|
|
|
extern bool flush_work(struct work_struct *work);
|
2022-05-19 13:47:28 +00:00
|
|
|
extern bool cancel_work(struct work_struct *work);
|
2010-09-16 08:36:00 +00:00
|
|
|
extern bool cancel_work_sync(struct work_struct *work);
|
|
|
|
|
|
|
|
extern bool flush_delayed_work(struct delayed_work *dwork);
|
2012-08-21 20:18:24 +00:00
|
|
|
extern bool cancel_delayed_work(struct delayed_work *dwork);
|
2010-09-16 08:36:00 +00:00
|
|
|
extern bool cancel_delayed_work_sync(struct delayed_work *dwork);
|
2007-05-09 09:34:22 +00:00
|
|
|
|
2018-03-14 19:45:13 +00:00
|
|
|
extern bool flush_rcu_work(struct rcu_work *rwork);
|
|
|
|
|
2010-06-29 08:07:14 +00:00
|
|
|
extern void workqueue_set_max_active(struct workqueue_struct *wq,
|
|
|
|
int max_active);
|
2024-02-09 00:11:56 +00:00
|
|
|
extern void workqueue_set_min_active(struct workqueue_struct *wq,
|
|
|
|
int min_active);
|
2018-02-11 09:38:28 +00:00
|
|
|
extern struct work_struct *current_work(void);
|
2013-03-13 00:41:37 +00:00
|
|
|
extern bool current_is_workqueue_rescuer(void);
|
2013-03-12 18:29:59 +00:00
|
|
|
extern bool workqueue_congested(int cpu, struct workqueue_struct *wq);
|
2010-06-29 08:07:14 +00:00
|
|
|
extern unsigned int work_busy(struct work_struct *work);
|
2013-04-30 22:27:22 +00:00
|
|
|
extern __printf(1, 2) void set_worker_desc(const char *fmt, ...);
|
|
|
|
extern void print_worker_info(const char *log_lvl, struct task_struct *task);
|
2021-10-20 03:09:00 +00:00
|
|
|
extern void show_all_workqueues(void);
|
2023-03-20 03:29:05 +00:00
|
|
|
extern void show_freezable_workqueues(void);
|
2021-10-20 03:09:00 +00:00
|
|
|
extern void show_one_workqueue(struct workqueue_struct *wq);
|
2018-05-18 15:47:13 +00:00
|
|
|
extern void wq_worker_comm(char *buf, size_t size, struct task_struct *task);
|
2010-06-29 08:07:14 +00:00
|
|
|
|
2013-03-13 23:51:36 +00:00
|
|
|
/**
|
|
|
|
* queue_work - queue work on a workqueue
|
|
|
|
* @wq: workqueue to use
|
|
|
|
* @work: work to queue
|
|
|
|
*
|
|
|
|
* Returns %false if @work was already on a queue, %true otherwise.
|
|
|
|
*
|
|
|
|
* We queue the work to the CPU on which it was submitted, but if the CPU dies
|
|
|
|
* it can be processed by another CPU.
|
2020-01-22 18:39:52 +00:00
|
|
|
*
|
|
|
|
* Memory-ordering properties: If it returns %true, guarantees that all stores
|
|
|
|
* preceding the call to queue_work() in the program order will be visible from
|
|
|
|
* the CPU which will execute @work by the time such work executes, e.g.,
|
|
|
|
*
|
|
|
|
* { x is initially 0 }
|
|
|
|
*
|
|
|
|
* CPU0 CPU1
|
|
|
|
*
|
|
|
|
* WRITE_ONCE(x, 1); [ @work is being executed ]
|
|
|
|
* r0 = queue_work(wq, work); r1 = READ_ONCE(x);
|
|
|
|
*
|
|
|
|
* Forbids: r0 == true && r1 == 0
|
2013-03-13 23:51:36 +00:00
|
|
|
*/
|
|
|
|
static inline bool queue_work(struct workqueue_struct *wq,
|
|
|
|
struct work_struct *work)
|
|
|
|
{
|
|
|
|
return queue_work_on(WORK_CPU_UNBOUND, wq, work);
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* queue_delayed_work - queue work on a workqueue after delay
|
|
|
|
* @wq: workqueue to use
|
|
|
|
* @dwork: delayable work to queue
|
|
|
|
* @delay: number of jiffies to wait before queueing
|
|
|
|
*
|
|
|
|
* Equivalent to queue_delayed_work_on() but tries to use the local CPU.
|
|
|
|
*/
|
|
|
|
static inline bool queue_delayed_work(struct workqueue_struct *wq,
|
|
|
|
struct delayed_work *dwork,
|
|
|
|
unsigned long delay)
|
|
|
|
{
|
|
|
|
return queue_delayed_work_on(WORK_CPU_UNBOUND, wq, dwork, delay);
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* mod_delayed_work - modify delay of or queue a delayed work
|
|
|
|
* @wq: workqueue to use
|
|
|
|
* @dwork: work to queue
|
|
|
|
* @delay: number of jiffies to wait before queueing
|
|
|
|
*
|
|
|
|
* mod_delayed_work_on() on local CPU.
|
|
|
|
*/
|
|
|
|
static inline bool mod_delayed_work(struct workqueue_struct *wq,
|
|
|
|
struct delayed_work *dwork,
|
|
|
|
unsigned long delay)
|
|
|
|
{
|
|
|
|
return mod_delayed_work_on(WORK_CPU_UNBOUND, wq, dwork, delay);
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* schedule_work_on - put work task on a specific cpu
|
|
|
|
* @cpu: cpu to put the work task on
|
|
|
|
* @work: job to be done
|
|
|
|
*
|
|
|
|
* This puts a job on a specific cpu
|
|
|
|
*/
|
|
|
|
static inline bool schedule_work_on(int cpu, struct work_struct *work)
|
|
|
|
{
|
|
|
|
return queue_work_on(cpu, system_wq, work);
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* schedule_work - put work task in global workqueue
|
|
|
|
* @work: job to be done
|
|
|
|
*
|
|
|
|
* Returns %false if @work was already on the kernel-global workqueue and
|
|
|
|
* %true otherwise.
|
|
|
|
*
|
|
|
|
* This puts a job in the kernel-global workqueue if it was not already
|
|
|
|
* queued and leaves it in the same position on the kernel-global
|
|
|
|
* workqueue otherwise.
|
2020-01-22 18:39:52 +00:00
|
|
|
*
|
|
|
|
* Shares the same memory-ordering properties of queue_work(), cf. the
|
|
|
|
* DocBook header of queue_work().
|
2013-03-13 23:51:36 +00:00
|
|
|
*/
|
|
|
|
static inline bool schedule_work(struct work_struct *work)
|
|
|
|
{
|
|
|
|
return queue_work(system_wq, work);
|
|
|
|
}
|
|
|
|
|
2022-06-01 07:32:47 +00:00
|
|
|
/*
|
|
|
|
* Detect attempt to flush system-wide workqueues at compile time when possible.
|
2023-06-30 12:28:53 +00:00
|
|
|
* Warn attempt to flush system-wide workqueues at runtime.
|
2022-06-01 07:32:47 +00:00
|
|
|
*
|
|
|
|
* See https://lkml.kernel.org/r/49925af7-78a8-a3dd-bce6-cfc02e1a9236@I-love.SAKURA.ne.jp
|
|
|
|
* for reasons and steps for converting system-wide workqueues into local workqueues.
|
|
|
|
*/
|
|
|
|
extern void __warn_flushing_systemwide_wq(void)
|
|
|
|
__compiletime_warning("Please avoid flushing system-wide workqueues.");
|
|
|
|
|
2023-06-30 12:28:53 +00:00
|
|
|
/* Please stop using this function, for this function will be removed in near future. */
|
2022-06-01 07:32:47 +00:00
|
|
|
#define flush_scheduled_work() \
|
|
|
|
({ \
|
2023-06-30 12:28:53 +00:00
|
|
|
__warn_flushing_systemwide_wq(); \
|
2022-06-01 07:32:47 +00:00
|
|
|
__flush_workqueue(system_wq); \
|
|
|
|
})
|
|
|
|
|
|
|
|
#define flush_workqueue(wq) \
|
|
|
|
({ \
|
|
|
|
struct workqueue_struct *_wq = (wq); \
|
|
|
|
\
|
|
|
|
if ((__builtin_constant_p(_wq == system_wq) && \
|
|
|
|
_wq == system_wq) || \
|
|
|
|
(__builtin_constant_p(_wq == system_highpri_wq) && \
|
|
|
|
_wq == system_highpri_wq) || \
|
|
|
|
(__builtin_constant_p(_wq == system_long_wq) && \
|
|
|
|
_wq == system_long_wq) || \
|
|
|
|
(__builtin_constant_p(_wq == system_unbound_wq) && \
|
|
|
|
_wq == system_unbound_wq) || \
|
|
|
|
(__builtin_constant_p(_wq == system_freezable_wq) && \
|
|
|
|
_wq == system_freezable_wq) || \
|
|
|
|
(__builtin_constant_p(_wq == system_power_efficient_wq) && \
|
|
|
|
_wq == system_power_efficient_wq) || \
|
|
|
|
(__builtin_constant_p(_wq == system_freezable_power_efficient_wq) && \
|
|
|
|
_wq == system_freezable_power_efficient_wq)) \
|
|
|
|
__warn_flushing_systemwide_wq(); \
|
|
|
|
__flush_workqueue(_wq); \
|
|
|
|
})
|
2015-05-20 06:41:19 +00:00
|
|
|
|
2013-03-13 23:51:36 +00:00
|
|
|
/**
|
|
|
|
* schedule_delayed_work_on - queue work in global workqueue on CPU after delay
|
|
|
|
* @cpu: cpu to use
|
|
|
|
* @dwork: job to be done
|
|
|
|
* @delay: number of jiffies to wait
|
|
|
|
*
|
|
|
|
* After waiting for a given time this puts a job in the kernel-global
|
|
|
|
* workqueue on the specified CPU.
|
|
|
|
*/
|
|
|
|
static inline bool schedule_delayed_work_on(int cpu, struct delayed_work *dwork,
|
|
|
|
unsigned long delay)
|
|
|
|
{
|
|
|
|
return queue_delayed_work_on(cpu, system_wq, dwork, delay);
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* schedule_delayed_work - put work task in global workqueue after delay
|
|
|
|
* @dwork: job to be done
|
|
|
|
* @delay: number of jiffies to wait or 0 for immediate execution
|
|
|
|
*
|
|
|
|
* After waiting for a given time this puts a job in the kernel-global
|
|
|
|
* workqueue.
|
|
|
|
*/
|
|
|
|
static inline bool schedule_delayed_work(struct delayed_work *dwork,
|
|
|
|
unsigned long delay)
|
|
|
|
{
|
|
|
|
return queue_delayed_work(system_wq, dwork, delay);
|
|
|
|
}
|
|
|
|
|
2008-11-05 02:39:10 +00:00
|
|
|
#ifndef CONFIG_SMP
|
2013-03-12 18:29:59 +00:00
|
|
|
static inline long work_on_cpu(int cpu, long (*fn)(void *), void *arg)
|
2008-11-05 02:39:10 +00:00
|
|
|
{
|
|
|
|
return fn(arg);
|
|
|
|
}
|
2017-04-12 20:07:28 +00:00
|
|
|
static inline long work_on_cpu_safe(int cpu, long (*fn)(void *), void *arg)
|
|
|
|
{
|
|
|
|
return fn(arg);
|
|
|
|
}
|
2008-11-05 02:39:10 +00:00
|
|
|
#else
|
workqueue: Provide one lock class key per work_on_cpu() callsite
All callers of work_on_cpu() share the same lock class key for all the
functions queued. As a result the workqueue related locking scenario for
a function A may be spuriously accounted as an inversion against the
locking scenario of function B such as in the following model:
long A(void *arg)
{
mutex_lock(&mutex);
mutex_unlock(&mutex);
}
long B(void *arg)
{
}
void launchA(void)
{
work_on_cpu(0, A, NULL);
}
void launchB(void)
{
mutex_lock(&mutex);
work_on_cpu(1, B, NULL);
mutex_unlock(&mutex);
}
launchA and launchB running concurrently have no chance to deadlock.
However the above can be reported by lockdep as a possible locking
inversion because the works containing A() and B() are treated as
belonging to the same locking class.
The following shows an existing example of such a spurious lockdep splat:
======================================================
WARNING: possible circular locking dependency detected
6.6.0-rc1-00065-g934ebd6e5359 #35409 Not tainted
------------------------------------------------------
kworker/0:1/9 is trying to acquire lock:
ffffffff9bc72f30 (cpu_hotplug_lock){++++}-{0:0}, at: _cpu_down+0x57/0x2b0
but task is already holding lock:
ffff9e3bc0057e60 ((work_completion)(&wfc.work)){+.+.}-{0:0}, at: process_scheduled_works+0x216/0x500
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #2 ((work_completion)(&wfc.work)){+.+.}-{0:0}:
__flush_work+0x83/0x4e0
work_on_cpu+0x97/0xc0
rcu_nocb_cpu_offload+0x62/0xb0
rcu_nocb_toggle+0xd0/0x1d0
kthread+0xe6/0x120
ret_from_fork+0x2f/0x40
ret_from_fork_asm+0x1b/0x30
-> #1 (rcu_state.barrier_mutex){+.+.}-{3:3}:
__mutex_lock+0x81/0xc80
rcu_nocb_cpu_deoffload+0x38/0xb0
rcu_nocb_toggle+0x144/0x1d0
kthread+0xe6/0x120
ret_from_fork+0x2f/0x40
ret_from_fork_asm+0x1b/0x30
-> #0 (cpu_hotplug_lock){++++}-{0:0}:
__lock_acquire+0x1538/0x2500
lock_acquire+0xbf/0x2a0
percpu_down_write+0x31/0x200
_cpu_down+0x57/0x2b0
__cpu_down_maps_locked+0x10/0x20
work_for_cpu_fn+0x15/0x20
process_scheduled_works+0x2a7/0x500
worker_thread+0x173/0x330
kthread+0xe6/0x120
ret_from_fork+0x2f/0x40
ret_from_fork_asm+0x1b/0x30
other info that might help us debug this:
Chain exists of:
cpu_hotplug_lock --> rcu_state.barrier_mutex --> (work_completion)(&wfc.work)
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock((work_completion)(&wfc.work));
lock(rcu_state.barrier_mutex);
lock((work_completion)(&wfc.work));
lock(cpu_hotplug_lock);
*** DEADLOCK ***
2 locks held by kworker/0:1/9:
#0: ffff900481068b38 ((wq_completion)events){+.+.}-{0:0}, at: process_scheduled_works+0x212/0x500
#1: ffff9e3bc0057e60 ((work_completion)(&wfc.work)){+.+.}-{0:0}, at: process_scheduled_works+0x216/0x500
stack backtrace:
CPU: 0 PID: 9 Comm: kworker/0:1 Not tainted 6.6.0-rc1-00065-g934ebd6e5359 #35409
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
Workqueue: events work_for_cpu_fn
Call Trace:
rcu-torture: rcu_torture_read_exit: Start of episode
<TASK>
dump_stack_lvl+0x4a/0x80
check_noncircular+0x132/0x150
__lock_acquire+0x1538/0x2500
lock_acquire+0xbf/0x2a0
? _cpu_down+0x57/0x2b0
percpu_down_write+0x31/0x200
? _cpu_down+0x57/0x2b0
_cpu_down+0x57/0x2b0
__cpu_down_maps_locked+0x10/0x20
work_for_cpu_fn+0x15/0x20
process_scheduled_works+0x2a7/0x500
worker_thread+0x173/0x330
? __pfx_worker_thread+0x10/0x10
kthread+0xe6/0x120
? __pfx_kthread+0x10/0x10
ret_from_fork+0x2f/0x40
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1b/0x30
</TASK
Fix this with providing one lock class key per work_on_cpu() caller.
Reported-and-tested-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-09-24 15:07:02 +00:00
|
|
|
long work_on_cpu_key(int cpu, long (*fn)(void *),
|
|
|
|
void *arg, struct lock_class_key *key);
|
|
|
|
/*
|
|
|
|
* A new key is defined for each caller to make sure the work
|
|
|
|
* associated with the function doesn't share its locking class.
|
|
|
|
*/
|
|
|
|
#define work_on_cpu(_cpu, _fn, _arg) \
|
|
|
|
({ \
|
|
|
|
static struct lock_class_key __key; \
|
|
|
|
\
|
|
|
|
work_on_cpu_key(_cpu, _fn, _arg, &__key); \
|
|
|
|
})
|
|
|
|
|
|
|
|
long work_on_cpu_safe_key(int cpu, long (*fn)(void *),
|
|
|
|
void *arg, struct lock_class_key *key);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* A new key is defined for each caller to make sure the work
|
|
|
|
* associated with the function doesn't share its locking class.
|
|
|
|
*/
|
|
|
|
#define work_on_cpu_safe(_cpu, _fn, _arg) \
|
|
|
|
({ \
|
|
|
|
static struct lock_class_key __key; \
|
|
|
|
\
|
|
|
|
work_on_cpu_safe_key(_cpu, _fn, _arg, &__key); \
|
|
|
|
})
|
2008-11-05 02:39:10 +00:00
|
|
|
#endif /* CONFIG_SMP */
|
2010-05-13 19:32:28 +00:00
|
|
|
|
2010-06-29 08:07:12 +00:00
|
|
|
#ifdef CONFIG_FREEZER
|
|
|
|
extern void freeze_workqueues_begin(void);
|
|
|
|
extern bool freeze_workqueues_busy(void);
|
|
|
|
extern void thaw_workqueues(void);
|
|
|
|
#endif /* CONFIG_FREEZER */
|
|
|
|
|
2013-03-12 18:30:05 +00:00
|
|
|
#ifdef CONFIG_SYSFS
|
|
|
|
int workqueue_sysfs_register(struct workqueue_struct *wq);
|
|
|
|
#else /* CONFIG_SYSFS */
|
|
|
|
static inline int workqueue_sysfs_register(struct workqueue_struct *wq)
|
|
|
|
{ return 0; }
|
|
|
|
#endif /* CONFIG_SYSFS */
|
|
|
|
|
workqueue: implement lockup detector
Workqueue stalls can happen from a variety of usage bugs such as
missing WQ_MEM_RECLAIM flag or concurrency managed work item
indefinitely staying RUNNING. These stalls can be extremely difficult
to hunt down because the usual warning mechanisms can't detect
workqueue stalls and the internal state is pretty opaque.
To alleviate the situation, this patch implements workqueue lockup
detector. It periodically monitors all worker_pools periodically and,
if any pool failed to make forward progress longer than the threshold
duration, triggers warning and dumps workqueue state as follows.
BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 31s!
Showing busy workqueues and worker pools:
workqueue events: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=17/256
pending: monkey_wrench_fn, e1000_watchdog, cache_reap, vmstat_shepherd, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, cgroup_release_agent
workqueue events_power_efficient: flags=0x80
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
pending: check_lifetime, neigh_periodic_work
workqueue cgroup_pidlist_destroy: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/1
pending: cgroup_pidlist_destroy_work_fn
...
The detection mechanism is controller through kernel parameter
workqueue.watchdog_thresh and can be updated at runtime through the
sysfs module parameter file.
v2: Decoupled from softlockup control knobs.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Don Zickus <dzickus@redhat.com>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
2015-12-08 16:28:04 +00:00
|
|
|
#ifdef CONFIG_WQ_WATCHDOG
|
|
|
|
void wq_watchdog_touch(int cpu);
|
|
|
|
#else /* CONFIG_WQ_WATCHDOG */
|
|
|
|
static inline void wq_watchdog_touch(int cpu) { }
|
|
|
|
#endif /* CONFIG_WQ_WATCHDOG */
|
|
|
|
|
2016-07-13 17:16:29 +00:00
|
|
|
#ifdef CONFIG_SMP
|
|
|
|
int workqueue_prepare_cpu(unsigned int cpu);
|
|
|
|
int workqueue_online_cpu(unsigned int cpu);
|
|
|
|
int workqueue_offline_cpu(unsigned int cpu);
|
|
|
|
#endif
|
|
|
|
|
2020-02-23 07:28:52 +00:00
|
|
|
void __init workqueue_init_early(void);
|
|
|
|
void __init workqueue_init(void);
|
2023-08-08 01:57:24 +00:00
|
|
|
void __init workqueue_init_topology(void);
|
2016-09-16 19:49:32 +00:00
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
#endif
|