linux-stable

1199796 commits 105 branches 5097 tags 5.6 GiB

Author	SHA1	Message	Date
Tejun Heo	636b927eba	workqueue: Make unbound workqueues to use per-cpu pool_workqueues A pwq (pool_workqueue) represents an association between a workqueue and a worker_pool. When a work item is queued, the workqueue selects the pwq to use, which in turn determines the pool, and queues the work item to the pool through the pwq. pwq is also what implements the maximum concurrency limit - @max_active. As a per-cpu workqueue should be assocaited with a different worker_pool on each CPU, it always had per-cpu pwq's that are accessed through wq->cpu_pwq. However, unbound workqueues were sharing a pwq within each NUMA node by default. The sharing has several downsides: * Because @max_active is per-pwq, the meaning of @max_active changes depending on the machine configuration and whether workqueue NUMA locality support is enabled. * Makes per-cpu and unbound code deviate. * Gets in the way of making workqueue CPU locality awareness more flexible. This patch makes unbound workqueues use per-cpu pwq's the same way per-cpu workqueues do by making the following changes: * wq->numa_pwq_tbl[] is removed and unbound workqueues now use wq->cpu_pwq just like per-cpu workqueues. wq->cpu_pwq is now RCU protected for unbound workqueues. * numa_pwq_tbl_install() is renamed to install_unbound_pwq() and installs the specified pwq to the target CPU's wq->cpu_pwq. * apply_wqattrs_prepare() now always allocates a separate pwq for each CPU unless the workqueue is ordered. If ordered, all CPUs use wq->dfl_pwq. This makes the return value of wq_calc_node_cpumask() unnecessary. It now returns void. * @max_active now means the same thing for both per-cpu and unbound workqueues. WQ_UNBOUND_MAX_ACTIVE now equals WQ_MAX_ACTIVE and documentation is updated accordingly. WQ_UNBOUND_MAX_ACTIVE is no longer used in workqueue implementation and will be removed later. * All unbound pwq operations which used to be per-numa-node are now per-cpu. For most unbound workqueue users, this shouldn't cause noticeable changes. Work item issue and completion will be a small bit faster, flush_workqueue() would become a bit more expensive, and the total concurrency limit would likely become higher. All @max_active==1 use cases are currently being audited for conversion into alloc_ordered_workqueue() and they shouldn't be affected once the audit and conversion is complete. One area where the behavior change may be more noticeable is workqueue_congested() as the reported congestion state is now per CPU instead of NUMA node. There are only two users of this interface - drivers/infiniband/hw/hfi1 and net/smc. Maintainers of both subsystems are cc'd. Inputs on the behavior change would be very much appreciated. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Leon Romanovsky <leon@kernel.org> Cc: Karsten Graul <kgraul@linux.ibm.com> Cc: Wenjia Zhang <wenjia@linux.ibm.com> Cc: Jan Karcher <jaka@linux.ibm.com>	2023-08-07 15:57:23 -10:00
Tejun Heo	8a1dd1e547	workqueue: Track and monitor per-workqueue CPU time usage Now that wq_worker_tick() is there, we can easily track the rough CPU time consumption of each workqueue by charging the whole tick whenever a tick hits an active workqueue. While not super accurate, it provides reasonable visibility into the workqueues that consume a lot of CPU cycles. wq_monitor.py is updated to report the per-workqueue CPU times. v2: wq_monitor.py was using "cputime" as the key when outputting in json format. Use "cpu_time" instead for consistency with other fields. Signed-off-by: Tejun Heo <tj@kernel.org>	2023-05-17 17:02:09 -10:00
Tejun Heo	616db8779b	workqueue: Automatically mark CPU-hogging work items CPU_INTENSIVE If a per-cpu work item hogs the CPU, it can prevent other work items from starting through concurrency management. A per-cpu workqueue which intends to host such CPU-hogging work items can choose to not participate in concurrency management by setting %WQ_CPU_INTENSIVE; however, this can be error-prone and difficult to debug when missed. This patch adds an automatic CPU usage based detection. If a concurrency-managed work item consumes more CPU time than the threshold (10ms by default) continuously without intervening sleeps, wq_worker_tick() which is called from scheduler_tick() will detect the condition and automatically mark it CPU_INTENSIVE. The mechanism isn't foolproof: * Detection depends on tick hitting the work item. Getting preempted at the right timings may allow a violating work item to evade detection at least temporarily. * nohz_full CPUs may not be running ticks and thus can fail detection. * Even when detection is working, the 10ms detection delays can add up if many CPU-hogging work items are queued at the same time. However, in vast majority of cases, this should be able to detect violations reliably and provide reasonable protection with a small increase in code complexity. If some work items trigger this condition repeatedly, the bigger problem likely is the CPU being saturated with such per-cpu work items and the solution would be making them UNBOUND. The next patch will add a debug mechanism to help spot such cases. v4: Documentation for workqueue.cpu_intensive_thresh_us added to kernel-parameters.txt. v3: Switch to use wq_worker_tick() instead of hooking into preemptions as suggested by Peter. v2: Lai pointed out that wq_worker_stopping() also needs to be called from preemption and rtlock paths and an earlier patch was updated accordingly. This patch adds a comment describing the risk of infinte recursions and how they're avoided. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com>	2023-05-17 17:02:08 -10:00
Tejun Heo	725e8ec59c	workqueue: Add pwq->stats[] and a monitoring script Currently, the only way to peer into workqueue operations is through tracing. While possible, it isn't easy or convenient to monitor per-workqueue behaviors over time this way. Let's add pwq->stats[] that track relevant events and a drgn monitoring script - tools/workqueue/wq_monitor.py. It's arguable whether this needs to be configurable. However, it currently only has several counters and the runtime overhead shouldn't be noticeable given that they're on pwq's which are per-cpu on per-cpu workqueues and per-numa-node on unbound ones. Let's keep it simple for the time being. v2: Patch reordered to earlier with fewer fields. Field will be added back gradually. Help message improved. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com>	2023-05-17 17:02:08 -10:00
Ross Zwisler	2abfcd293b	docs: ftrace: always use canonical ftrace path The canonical location for the tracefs filesystem is at /sys/kernel/tracing. But, from Documentation/trace/ftrace.rst: Before 4.1, all ftrace tracing control files were within the debugfs file system, which is typically located at /sys/kernel/debug/tracing. For backward compatibility, when mounting the debugfs file system, the tracefs file system will be automatically mounted at: /sys/kernel/debug/tracing Many parts of Documentation still reference this older debugfs path, so let's update them to avoid confusion. Signed-off-by: Ross Zwisler <zwisler@google.com> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Link: https://lore.kernel.org/r/20230125213251.2013791-1-zwisler@google.com Signed-off-by: Jonathan Corbet <corbet@lwn.net>	2023-01-31 14:02:30 -07:00
Boqun Feng	f9eaaa82b4	workqueue: doc: Call out the non-reentrance conditions The current doc of workqueue API suggests that work items are non-reentrant: any work item is guaranteed to be executed by at most one worker system-wide at any given time. However this is not true, the following case can cause a work item W executed by two workers at the same time: queue_work_on(0, WQ1, W); // after a worker picks up W and clear the pending bit queue_work_on(1, WQ2, W); // workers on CPU0 and CPU1 will execute W in the same time. , which means the non-reentrance of a work item is conditional, and Lai Jiangshan provided a nice summary[1] of the conditions, therefore use it to describe a work item instance and improve the doc. [1]: https://lore.kernel.org/lkml/CAJhGHyDudet_xyNk=8xnuO2==o-u06s0E0GZVP4Q67nmQ84Ceg@mail.gmail.com/ Suggested-by: Matthew Wilcox <willy@infradead.org> Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2021-10-25 07:18:40 -10:00
Mauro Carvalho Chehab	c9e3d519ee	docs: basics.rst: move kernel-doc workqueue markups to workqueue.rst As there's already a rst file with workqueue markups, containing part of them, move the other definitions, in order to avoid warnings with Sphinx. Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>	2020-10-15 07:49:41 +02:00
Randy Dunlap	47684e111f	Documentation: core-api: minor workqueue.rst cleanups Clean up workqueue.rst: - fix minor typos - put '@' after `` instead of preceding them (one place) - use "CPU" instead of "cpu" in text consistently - quote one function name Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Florian Mickler <florian@mickler.org> Signed-off-by: Tejun Heo <tj@kernel.org>	2017-09-18 17:29:27 -07:00
Alexei Potashnik	0e0cafcda8	workqueue: doc change for ST behavior on NUMA systems NUMA rework of workqueue made the combination of max_active of 1 and WQ_UNBOUND insufficient to guarantee ST behavior system wide. alloc_ordered_queue should now be used instead. Signed-off-by: Alexei Potashnik <alexei@purestorage.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2017-07-18 14:34:53 -04:00
Silvio Fricke	e7f08ffb18	Documentation/workqueue.txt: convert to ReST markup ... and move to Documentation/core-api folder. Signed-off-by: Silvio Fricke <silvio.fricke@gmail.com> Signed-off-by: Jonathan Corbet <corbet@lwn.net>	2016-10-28 10:55:01 -06:00

Renamed from Documentation/workqueue.txt (Browse further)

10 commits