linux-stable

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git synced 2024-09-26 20:38:12 +00:00

Author	SHA1	Message	Date
Xuewen Yan	ccdec92198	workqueue: Control intensive warning threshold through cmdline When CONFIG_WQ_CPU_INTENSIVE_REPORT is set, the kernel will report the work functions which violate the intensive_threshold_us repeatedly. And now, only when the violate times exceed 4 and is a power of 2, the kernel warning could be triggered. However, sometimes, even if a long work execution time occurs only once, it may cause other work to be delayed for a long time. This may also cause some problems sometimes. In order to freely control the threshold of warninging, a boot argument is added so that the user can control the warning threshold to be printed. At the same time, keep the exponential backoff to prevent reporting too much. By default, the warning threshold is 4. tj: Updated kernel-parameters.txt description. Signed-off-by: Xuewen Yan <xuewen.yan@unisoc.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-02-22 07:50:45 -10:00
Linus Torvalds	efa80dcbb7	Tracing fix for v6.8: - While working on the ring buffer I noticed that the counter used for knowing where the end of the data is on a sub-buffer was not a full "int" but just 20 bits. It was masked out to 0xfffff. With the new code that allows the user to change the size of the sub-buffer, it is theoretically possible to ask for a size bigger than 2^20. If that happens, unexpected results may occur as there's no code checking if the counter overflowed the 20 bits of the write mask. There are other checks to make sure events fit in the sub-buffer, but if the sub-buffer itself is too big, that is not checked. Add a check in the resize of the sub-buffer to make sure that it never goes beyond the size of the counter that holds how much data is on it. -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZdaf+RQccm9zdGVkdEBn b29kbWlzLm9yZwAKCRAp5XQQmuv6qjEIAQDpsvHqUFNoG5fkRlWr2U0hNl5M6zLI xTf2mWoG/h8bwQD+NfiRC2UrD5EaubO15z0z6MxScOl1H9X+iI7WVwZkqQ8= =txOr -----END PGP SIGNATURE----- Merge tag 'trace-v6.8-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fix from Steven Rostedt: - While working on the ring buffer I noticed that the counter used for knowing where the end of the data is on a sub-buffer was not a full "int" but just 20 bits. It was masked out to 0xfffff. With the new code that allows the user to change the size of the sub-buffer, it is theoretically possible to ask for a size bigger than 2^20. If that happens, unexpected results may occur as there's no code checking if the counter overflowed the 20 bits of the write mask. There are other checks to make sure events fit in the sub-buffer, but if the sub-buffer itself is too big, that is not checked. Add a check in the resize of the sub-buffer to make sure that it never goes beyond the size of the counter that holds how much data is on it. * tag 'trace-v6.8-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: ring-buffer: Do not let subbuf be bigger than write mask	2024-02-22 09:23:22 -08:00
Anna-Maria Behnsen	b2cf7507e1	timers: Always queue timers on the local CPU The timer pull model is in place so we can remove the heuristics which try to guess the best target CPU at enqueue/modification time. All non pinned timers are queued on the local CPU in the separate storage and eventually pulled at expiry time to a remote CPU. Originally-by: Richard Cochran (linutronix GmbH) <richardcochran@gmail.com> Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20240221090548.36600-21-anna-maria@linutronix.de	2024-02-22 17:52:32 +01:00
Anna-Maria Behnsen	36e40df35d	timer_migration: Add tracepoints The timer pull logic needs proper debugging aids. Add tracepoints so the hierarchical idle machinery can be diagnosed. Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240222103403.31923-1-anna-maria@linutronix.de	2024-02-22 17:52:32 +01:00
Anna-Maria Behnsen	7ee9887703	timers: Implement the hierarchical pull model Placing timers at enqueue time on a target CPU based on dubious heuristics does not make any sense: 1) Most timer wheel timers are canceled or rearmed before they expire. 2) The heuristics to predict which CPU will be busy when the timer expires are wrong by definition. So placing the timers at enqueue wastes precious cycles. The proper solution to this problem is to always queue the timers on the local CPU and allow the non pinned timers to be pulled onto a busy CPU at expiry time. Therefore split the timer storage into local pinned and global timers: Local pinned timers are always expired on the CPU on which they have been queued. Global timers can be expired on any CPU. As long as a CPU is busy it expires both local and global timers. When a CPU goes idle it arms for the first expiring local timer. If the first expiring pinned (local) timer is before the first expiring movable timer, then no action is required because the CPU will wake up before the first movable timer expires. If the first expiring movable timer is before the first expiring pinned (local) timer, then this timer is queued into an idle timerqueue and eventually expired by another active CPU. To avoid global locking the timerqueues are implemented as a hierarchy. The lowest level of the hierarchy holds the CPUs. The CPUs are associated to groups of 8, which are separated per node. If more than one CPU group exist, then a second level in the hierarchy collects the groups. Depending on the size of the system more than 2 levels are required. Each group has a "migrator" which checks the timerqueue during the tick for remote expirable timers. If the last CPU in a group goes idle it reports the first expiring event in the group up to the next group(s) in the hierarchy. If the last CPU goes idle it arms its timer for the first system wide expiring timer to ensure that no timer event is missed. Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20240222103710.32582-1-anna-maria@linutronix.de	2024-02-22 17:52:32 +01:00
Anna-Maria Behnsen	57e95a5c41	timers: Introduce function to check timer base is_idle flag To prepare for the conversion of the NOHZ timer placement to a pull at expiry time model it's required to have a function that returns the value of the is_idle flag of the timer base to keep the hierarchy states during online in sync with timer base state. No functional change. Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20240221090548.36600-18-anna-maria@linutronix.de	2024-02-22 17:52:32 +01:00
Richard Cochran (linutronix GmbH)	4c532939aa	tick/sched: Split out jiffies update helper function The logic to get the time of the last jiffies update will be needed by the timer pull model as well. Move the code into a global function in anticipation of the new caller. No functional change. Signed-off-by: Richard Cochran (linutronix GmbH) <richardcochran@gmail.com> Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20240221090548.36600-17-anna-maria@linutronix.de	2024-02-22 17:52:32 +01:00
Anna-Maria Behnsen	89f01e10c9	timers: Check if timers base is handled already Due to the conversion of the NOHZ timer placement to a pull at expiry time model, the per CPU timer bases with non pinned timers are no longer handled only by the local CPU. In case a remote CPU already expires the non pinned timers base of the local CPU, nothing more needs to be done by the local CPU. A check at the begin of the expire timers routine is required, because timer base lock is dropped before executing the timer callback function. This is a preparatory work, but has no functional impact right now. Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20240221090548.36600-16-anna-maria@linutronix.de	2024-02-22 17:52:32 +01:00
Richard Cochran (linutronix GmbH)	90f5df66c8	timers: Restructure internal locking Move the locking out from __run_timers() to the call sites, so the protected section can be extended at the call site. Preparatory work for changing the NOHZ timer placement to a pull at expiry time model. No functional change. Signed-off-by: Richard Cochran (linutronix GmbH) <richardcochran@gmail.com> Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20240221090548.36600-15-anna-maria@linutronix.de	2024-02-22 17:52:31 +01:00
Anna-Maria Behnsen	f73d9257ff	timers: Add get next timer interrupt functionality for remote CPUs To prepare for the conversion of the NOHZ timer placement to a pull at expiry time model it's required to have functionality available getting the next timer interrupt on a remote CPU. Locking of the timer bases and getting the information for the next timer interrupt functionality is split into separate functions. This is required to be compliant with lock ordering when the new model is in place. Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20240221090548.36600-14-anna-maria@linutronix.de	2024-02-22 17:52:31 +01:00
Anna-Maria Behnsen	70b4cf84f3	timers: Split out "get next timer interrupt" functionality The functionality for getting the next timer interrupt in get_next_timer_interrupt() is split into a separate function fetch_next_timer_interrupt() to be usable by other call sites. This is preparatory work for the conversion of the NOHZ timer placement to a pull at expiry time model. No functional change. Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20240221090548.36600-13-anna-maria@linutronix.de	2024-02-22 17:52:31 +01:00
Anna-Maria Behnsen	21927fc89e	timers: Retrieve next expiry of pinned/non-pinned timers separately For the conversion of the NOHZ timer placement to a pull at expiry time model it's required to have separate expiry times for the pinned and the non-pinned (movable) timers. Therefore struct timer_events is introduced. No functional change Originally-by: Richard Cochran (linutronix GmbH) <richardcochran@gmail.com> Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20240221090548.36600-12-anna-maria@linutronix.de	2024-02-22 17:52:31 +01:00
Anna-Maria Behnsen	83a665dc99	timers: Keep the pinned timers separate from the others Separate the storage space for pinned timers. Deferrable timers (doesn't matter if pinned or non pinned) are still enqueued into their own base. This is preparatory work for changing the NOHZ timer placement from a push at enqueue time to a pull at expiry time model. Originally-by: Richard Cochran (linutronix GmbH) <richardcochran@gmail.com> Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20240221090548.36600-11-anna-maria@linutronix.de	2024-02-22 17:52:31 +01:00
Anna-Maria Behnsen	9f6a3c602c	timers: Split next timer interrupt logic Split the logic for getting next timer interrupt (no matter of recalculated or already stored in base->next_expiry) into a separate function named next_timer_interrupt(). Make it available to local call sites only. No functional change. Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20240221090548.36600-10-anna-maria@linutronix.de	2024-02-22 17:52:31 +01:00
Anna-Maria Behnsen	af68cb3fc7	timers: Simplify code in run_local_timers() The logic for raising a softirq the way it is implemented right now, is readable for two timer bases. When increasing the number of timer bases, code gets harder to read. With the introduction of the timer migration hierarchy, there will be three timer bases. Therefore restructure the code to use a loop. No functional change. Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20240221090548.36600-9-anna-maria@linutronix.de	2024-02-22 17:52:31 +01:00
Anna-Maria Behnsen	aae55e9fb8	timers: Make sure TIMER_PINNED flag is set in add_timer_on() When adding a timer to the timer wheel using add_timer_on(), it is an implicitly pinned timer. With the timer pull at expiry time model in place, the TIMER_PINNED flag is required to make sure timers end up in proper base. Set the TIMER_PINNED flag unconditionally when add_timer_on() is executed. Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20240221090548.36600-8-anna-maria@linutronix.de	2024-02-22 17:52:31 +01:00
Anna-Maria Behnsen	c0e8c5b599	workqueue: Use global variant for add_timer() The implementation of the NOHZ pull at expiry model will change the timer bases per CPU. Timers, that have to expire on a specific CPU, require the TIMER_PINNED flag. If the CPU doesn't matter, the TIMER_PINNED flag must be dropped. This is required for call sites which use the timer alternately as pinned and not pinned timer like workqueues do. Therefore use add_timer_global() in __queue_delayed_work() for non-bound delayed work to make sure the TIMER_PINNED flag is dropped. Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20240221090548.36600-7-anna-maria@linutronix.de	2024-02-22 17:52:30 +01:00
Anna-Maria Behnsen	8e7e247f64	timers: Introduce add_timer() variants which modify timer flags A timer might be used as a pinned timer (using add_timer_on()) and later on as non-pinned timer using add_timer(). When the "NOHZ timer pull at expiry model" is in place, the TIMER_PINNED flag is required to be used whenever a timer needs to expire on a dedicated CPU. Otherwise the flag must not be set if expiration on a dedicated CPU is not required. add_timer_on()'s behavior will be changed during the preparation patches for the "NOHZ timer pull at expiry model" to unconditionally set the TIMER_PINNED flag. To be able to clear/ set the flag when queueing a timer, two variants of add_timer() are introduced. This is a preparatory step and has no functional change. Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20240221090548.36600-6-anna-maria@linutronix.de	2024-02-22 17:52:30 +01:00
Anna-Maria Behnsen	73129cf4b6	timers: Optimization for timer_base_try_to_set_idle() When tick is stopped also the timer base is_idle flag is set. When reentering timer_base_try_to_set_idle() with the tick stopped, there is no need to check whether the timer base needs to be set idle again. When a timer was enqueued in the meantime, this is already handled by the tick_nohz_next_event() call which was executed before tick_nohz_stop_tick(). Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20240221090548.36600-5-anna-maria@linutronix.de	2024-02-22 17:52:30 +01:00
Anna-Maria Behnsen	e2e1d724e9	timers: Move marking timer bases idle into tick_nohz_stop_tick() The timer base is marked idle when get_next_timer_interrupt() is executed. But the decision whether the tick will be stopped and whether the system is able to go idle is done later. When the timer bases is marked idle and a new first timer is enqueued remote an IPI is raised. Even if it is not required because the tick is not stopped and the timer base is evaluated again at the next tick. To prevent this, the timer base is marked idle in tick_nohz_stop_tick() and get_next_timer_interrupt() is streamlined by only looking for the next timer interrupt. All other work is postponed to timer_base_try_to_set_idle() which is called by tick_nohz_stop_tick(). timer_base_try_to_set_idle() never resets timer_base::is_idle state. This is done when the tick is restarted via tick_nohz_restart_sched_tick(). With this, tick_sched::tick_stopped and timer_base::is_idle are always in sync. So there is no longer the need to execute timer_clear_idle() in tick_nohz_idle_retain_tick(). This was required before, as tick_nohz_next_event() set timer_base::is_idle even if the tick would not be stopped. So timer_clear_idle() is only executed, when timer base is idle. So the check whether timer base is idle, is now no longer required as well. While at it fix some nearby whitespace damage as well. Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20240221090548.36600-4-anna-maria@linutronix.de	2024-02-22 17:52:30 +01:00
Anna-Maria Behnsen	39ed699fb6	timers: Split out get next timer interrupt Split out get_next_timer_interrupt() to be able to extend it and make it reusable for other call sites. No functional change. Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20240221090548.36600-3-anna-maria@linutronix.de	2024-02-22 17:52:30 +01:00
Anna-Maria Behnsen	bebed6649e	timers: Restructure get_next_timer_interrupt() get_next_timer_interrupt() contains two parts for the next timer interrupt calculation. Those two parts are separated by forwarding the base clock. But the second part does not depend on the forwarded base clock. Therefore restructure get_next_timer_interrupt() to keep things together which belong together. No functional change. Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20240221090548.36600-2-anna-maria@linutronix.de	2024-02-22 17:52:30 +01:00
Max Kellermann	266e957864	cpu: Remove stray semicolon This syntax error was introduced by commit `da92df490e` ("cpu: Mark cpu_possible_mask as __ro_after_init"). Fixes: `da92df490e` ("cpu: Mark cpu_possible_mask as __ro_after_init") Signed-off-by: Max Kellermann <max.kellermann@ionos.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240222114727.1144588-1-max.kellermann@ionos.com	2024-02-22 17:51:14 +01:00
Paolo Abeni	fdcd4467ba	bpf-for-netdev -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZdaBCwAKCRDbK58LschI g3EhAP0d+S18mNabiEGz8efnE2yz3XcFchJgjiRS8WjOv75GvQEA6/sWncFjbc8k EqxPHmeJa19rWhQlFrmlyNQfLYGe4gY= =VkOs -----END PGP SIGNATURE----- Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Daniel Borkmann says: ==================== pull-request: bpf 2024-02-22 The following pull-request contains BPF updates for your net tree. We've added 11 non-merge commits during the last 24 day(s) which contain a total of 15 files changed, 217 insertions(+), 17 deletions(-). The main changes are: 1) Fix a syzkaller-triggered oops when attempting to read the vsyscall page through bpf_probe_read_kernel and friends, from Hou Tao. 2) Fix a kernel panic due to uninitialized iter position pointer in bpf_iter_task, from Yafang Shao. 3) Fix a race between bpf_timer_cancel_and_free and bpf_timer_cancel, from Martin KaFai Lau. 4) Fix a xsk warning in skb_add_rx_frag() (under CONFIG_DEBUG_NET) due to incorrect truesize accounting, from Sebastian Andrzej Siewior. 5) Fix a NULL pointer dereference in sk_psock_verdict_data_ready, from Shigeru Yoshida. 6) Fix a resolve_btfids warning when bpf_cpumask symbol cannot be resolved, from Hari Bathini. bpf-for-netdev * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: bpf, sockmap: Fix NULL pointer dereference in sk_psock_verdict_data_ready() selftests/bpf: Add negtive test cases for task iter bpf: Fix an issue due to uninitialized bpf_iter_task selftests/bpf: Test racing between bpf_timer_cancel_and_free and bpf_timer_cancel bpf: Fix racing between bpf_timer_cancel_and_free and bpf_timer_cancel selftest/bpf: Test the read of vsyscall page under x86-64 x86/mm: Disallow vsyscall page read for copy_from_kernel_nofault() x86/mm: Move is_vsyscall_vaddr() into asm/vsyscall.h bpf, scripts: Correct GPL license name xsk: Add truesize to skb_add_rx_frag(). bpf: Fix warning for bpf_cpumask in verifier ==================== Link: https://lore.kernel.org/r/20240221231826.1404-1-daniel@iogearbox.net Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-02-22 10:04:47 +01:00
Steven Rostedt (Google)	e78fb4eac8	ring-buffer: Do not let subbuf be bigger than write mask The data on the subbuffer is measured by a write variable that also contains status flags. The counter is just 20 bits in length. If the subbuffer is bigger than then counter, it will fail. Make sure that the subbuffer can not be set to greater than the counter that keeps track of the data on the subbuffer. Link: https://lore.kernel.org/linux-trace-kernel/20240220095112.77e9cb81@gandalf.local.home Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Fixes: `2808e31ec1` ("ring-buffer: Add interface for configuring trace sub buffer size") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2024-02-21 09:15:23 -05:00
Feng Tang	2ed08e4bc5	clocksource: Scale the watchdog read retries automatically On a 8-socket server the TSC is wrongly marked as 'unstable' and disabled during boot time on about one out of 120 boot attempts: clocksource: timekeeping watchdog on CPU227: wd-tsc-wd excessive read-back delay of 153560ns vs. limit of 125000ns, wd-wd read-back delay only 11440ns, attempt 3, marking tsc unstable tsc: Marking TSC unstable due to clocksource watchdog TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'. sched_clock: Marking unstable (119294969739, 159204297)<-(125446229205, -5992055152) clocksource: Checking clocksource tsc synchronization from CPU 319 to CPUs 0,99,136,180,210,542,601,896. clocksource: Switched to clocksource hpet The reason is that for platform with a large number of CPUs, there are sporadic big or huge read latencies while reading the watchog/clocksource during boot or when system is under stress work load, and the frequency and maximum value of the latency goes up with the number of online CPUs. The cCurrent code already has logic to detect and filter such high latency case by reading the watchdog twice and checking the two deltas. Due to the randomness of the latency, there is a low probabilty that the first delta (latency) is big, but the second delta is small and looks valid. The watchdog code retries the readouts by default twice, which is not necessarily sufficient for systems with a large number of CPUs. There is a command line parameter 'max_cswd_read_retries' which allows to increase the number of retries, but that's not user friendly as it needs to be tweaked per system. As the number of required retries is proportional to the number of online CPUs, this parameter can be calculated at runtime. Scale and enlarge the number of retries according to the number of online CPUs and remove the command line parameter completely. [ tglx: Massaged change log and comments ] Signed-off-by: Feng Tang <feng.tang@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Jin Wang <jin1.wang@intel.com> Tested-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Waiman Long <longman@redhat.com> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Link: https://lore.kernel.org/r/20240221060859.1027450-1-feng.tang@intel.com	2024-02-21 12:00:42 +01:00
David Gow	e0a1284b29	time/kunit: Use correct format specifier 'days' is a s64 (from div_s64), and so should use a %lld specifier. This was found by extending KUnit's assertion macros to use gcc's __printf attribute. Fixes: `2760105516` ("time: Improve performance of time64_to_tm()") Signed-off-by: David Gow <davidgow@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240221092728.1281499-5-davidgow@google.com	2024-02-21 12:00:42 +01:00
Christian Brauner	e1fb1dc08e	pidfd: allow to override signal scope in pidfd_send_signal() Right now we determine the scope of the signal based on the type of pidfd. There are use-cases where it's useful to override the scope of the signal. For example in [1]. Add flags to determine the scope of the signal: (1) PIDFD_SIGNAL_THREAD: send signal to specific thread reference by @pidfd (2) PIDFD_SIGNAL_THREAD_GROUP: send signal to thread-group of @pidfd (2) PIDFD_SIGNAL_PROCESS_GROUP: send signal to process-group of @pidfd Since we now allow specifying PIDFD_SEND_PROCESS_GROUP for pidfd_send_signal() to send signals to process groups we need to adjust the check restricting si_code emulation by userspace to account for PIDTYPE_PGID. Reviewed-by: Oleg Nesterov <oleg@redhat.com> Link: https://github.com/systemd/systemd/issues/31093 [1] Link: https://lore.kernel.org/r/20240210-chihuahua-hinzog-3945b6abd44a@brauner Link: https://lore.kernel.org/r/20240214123655.GB16265@redhat.com Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-02-21 09:46:08 +01:00
Tejun Heo	bccdc1faaf	workqueue: Make @flags handling consistent across set_work_data() and friends - set_work_data() takes a separate @flags argument but just ORs it to @data. This is more confusing than helpful. Just take @data. - Use the name @flags consistently and add the parameter to set_work_pool_and_{keep\|clear}_pending(). This will be used by the planned disable/enable support. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>	2024-02-20 19:36:15 -10:00
Tejun Heo	afe928c1dc	workqueue: Remove clear_work_data() clear_work_data() is only used in one place and immediately followed by smp_mb(), making it equivalent to set_work_pool_and_clear_pending() w/ WORK_OFFQ_POOL_NONE for @pool_id. Drop it. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>	2024-02-20 19:36:14 -10:00
Tejun Heo	978b8409ea	workqueue: Factor out work_grab_pending() from __cancel_work_sync() The planned disable/enable support will need the same logic. Let's factor it out. No functional changes. v2: Update function comment to include @irq_flags. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>	2024-02-20 19:36:14 -10:00
Tejun Heo	e9a8e01f9b	workqueue: Clean up enum work_bits and related constants The bits of work->data are used for a few different purposes. How the bits are used is determined by enum work_bits. The planned disable/enable support will add another use, so let's clean it up a bit in preparation. - Let WORK_STRUCT_*_BIT's values be determined by enum definition order. - Deliminate different bit sections the same way using SHIFT and BITS values. - Rename __WORK_OFFQ_CANCELING to WORK_OFFQ_CANCELING_BIT for consistency. - Introduce WORK_STRUCT_PWQ_SHIFT and replace WORK_STRUCT_FLAG_MASK and WORK_STRUCT_WQ_DATA_MASK with WQ_STRUCT_PWQ_MASK for clarity. - Improve documentation. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>	2024-02-20 19:36:14 -10:00
Tejun Heo	c5f5b9422a	workqueue: Introduce work_cancel_flags The cancel path used bool @is_dwork to distinguish canceling a regular work and a delayed one. The planned disable/enable support will need passing around another flag in the code path. As passing them around with bools will be confusing, let's introduce named flags to pass around in the cancel path. WORK_CANCEL_DELAYED replaces @is_dwork. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>	2024-02-20 19:36:14 -10:00
Tejun Heo	c26e2f2e2f	workqueue: Use variable name irq_flags for saving local irq flags Using the generic term `flags` for irq flags is conventional but can be confusing as there's quite a bit of code dealing with work flags which involves some subtleties. Let's use a more explicit name `irq_flags` for local irq flags. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>	2024-02-20 19:36:14 -10:00
Tejun Heo	cdc6e4b329	workqueue: Reorganize flush and cancel[_sync] functions They are currently a bit disorganized with flush and cancel functions mixed. Reoranize them so that flush functions come first, cancel next and cancel_sync last. This way, we won't have to add prototypes for internal functions for the planned disable/enable support. This is pure code reorganization. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>	2024-02-20 19:36:14 -10:00
Tejun Heo	c5140688d1	workqueue: Rename __cancel_work_timer() to __cancel_timer_sync() __cancel_work_timer() is used to implement cancel_work_sync() and cancel_delayed_work_sync(), similarly to how __cancel_work() is used to implement cancel_work() and cancel_delayed_work(). ie. The _timer part of the name is a complete misnomer. The difference from __cancel_work() is the fact that it syncs against work item execution not whether it handles timers or not. Let's rename it to less confusing __cancel_work_sync(). No functional change. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>	2024-02-20 19:36:14 -10:00
Tejun Heo	d355001fa9	workqueue: Use rcu_read_lock_any_held() instead of rcu_read_lock_held() The different flavors of RCU read critical sections have been unified. Let's update the locking assertion macros accordingly to avoid requiring unnecessary explicit rcu_read_[un]lock() calls. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>	2024-02-20 19:36:13 -10:00
Tejun Heo	c7a40c49af	workqueue: Cosmetic changes Reorder some global declarations and adjust comments and whitespaces for clarity and consistency. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>	2024-02-20 19:36:13 -10:00
Marco Elver	de2683e7fd	hardening: Enable KFENCE in the hardening config KFENCE is not a security mitigation mechanism (due to sampling), but has the performance characteristics of unintrusive hardening techniques. When used at scale, however, it improves overall security by allowing kernel developers to detect heap memory-safety bugs cheaply. Link: https://lkml.kernel.org/r/79B9A832-B3DE-4229-9D87-748B2CFB7D12@kernel.org Cc: Matthieu Baerts <matttbe@kernel.org> Cc: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Marco Elver <elver@google.com> Link: https://lore.kernel.org/r/20240212130116.997627-1-elver@google.com Signed-off-by: Kees Cook <keescook@chromium.org>	2024-02-20 20:47:32 -08:00
Lukas Bulwahn	7b3133aa4b	hardening: drop obsolete DRM_LEGACY from config fragment Commit `94f8f319cb` ("drm: Remove Kconfig option for legacy support (CONFIG_DRM_LEGACY)") removes the config DRM_LEGACY, but one reference to that config is left in the hardening.config fragment. As there is no drm legacy driver left, we do not need to recommend this attack surface reduction anymore. Drop this reference in hardening.config fragment. Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com> Link: https://lore.kernel.org/r/20240208091045.9219-3-lukas.bulwahn@gmail.com Signed-off-by: Kees Cook <keescook@chromium.org>	2024-02-20 20:47:32 -08:00
Lukas Bulwahn	006eac3fe2	hardening: drop obsolete UBSAN_SANITIZE_ALL from config fragment Commit 7a628f818499 ("ubsan: Remove CONFIG_UBSAN_SANITIZE_ALL") removes the config UBSAN_SANITIZE_ALL, but one reference to that config is left in the hardening.config fragment. Drop this reference in hardening.config fragment. Note that CONFIG_UBSAN is still enabled in the hardening.config fragment, so the functionality when using this fragment remains the same. Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com> Link: https://lore.kernel.org/r/20240208091045.9219-2-lukas.bulwahn@gmail.com Signed-off-by: Kees Cook <keescook@chromium.org>	2024-02-20 20:47:32 -08:00
Linus Torvalds	944d5fe50f	sched/membarrier: reduce the ability to hammer on sys_membarrier On some systems, sys_membarrier can be very expensive, causing overall slowdowns for everything. So put a lock on the path in order to serialize the accesses to prevent the ability for this to be called at too high of a frequency and saturate the machine. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-and-tested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Borislav Petkov <bp@alien8.de> Fixes: `22e4ebb975` ("membarrier: Provide expedited private command") Fixes: `c5f58bd58f` ("membarrier: Provide GLOBAL_EXPEDITED command") Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2024-02-20 09:38:05 -08:00
Marc Zyngier	5aa3c0cf5b	genirq/irqdomain: Don't call ops->select for DOMAIN_BUS_ANY tokens Users of the IRQCHIP_PLATFORM_DRIVER_{BEGIN,END} helpers rely on a fwspec containing only the fwnode (and crucially a number of parameters set to 0) together with a DOMAIN_BUS_ANY token to check whether a parent irqchip has probed and registered a domain. Since `de1ff306dc` ("genirq/irqdomain: Remove the param count restriction from select()"), ops->select() is called unconditionally, meaning that irqchips implementing select() now need to handle ANY as a match. Instead of adding more esoteric checks to the individual drivers, add that condition to irq_find_matching_fwspec(), and let it handle the corner case, as per the comment in the function. This restores the functionality of the above helpers. Fixes: `de1ff306dc` ("genirq/irqdomain: Remove the param count restriction from select()") Reported-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org> Reported-by: Biju Das <biju.das.jz@bp.renesas.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org> Tested-by: Biju Das <biju.das.jz@bp.renesas.com> Link: https://lore.kernel.org/r/20240220114731.1898534-1-maz@kernel.org Link: https://lore.kernel.org/r/20240219-gic-fix-child-domain-v1-1-09f8fd2d9a8f@linaro.org	2024-02-20 17:30:57 +01:00
Ingo Molnar	94bf12af35	Merge tag 'v6.8-rc5' into timers/core, to resolve conflict There's a conflict between this recent upstream fix: `dad6a09f31` ("hrtimer: Report offline hrtimer enqueue") and a pending commit in the timers tree: `1a4729ecaf` ("hrtimers: Move hrtimer base related definitions into hrtimer_defs.h") Resolve it by applying the upstream fix to the new <linux/hrtimer_defs.h> header. Conflict: include/linux/hrtimer.h Semantic conflict: include/linux/hrtimer_defs.h Signed-off-by: Ingo Molnar <mingo@kernel.org>	2024-02-19 22:27:57 +01:00
Alexey Dobriyan	da92df490e	cpu: Mark cpu_possible_mask as __ro_after_init cpu_possible_mask is by definition "cpus which could be hotplugged without reboot". It's a property which is fixed after kernel enumerates the hardware configuration. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/41cd78af-92a3-4f23-8c7a-4316a04a66d8@p183	2024-02-19 18:05:47 +01:00
Yafang Shao	5f2ae606cb	bpf: Fix an issue due to uninitialized bpf_iter_task Failure to initialize it->pos, coupled with the presence of an invalid value in the flags variable, can lead to it->pos referencing an invalid task, potentially resulting in a kernel panic. To mitigate this risk, it's crucial to ensure proper initialization of it->pos to NULL. Fixes: `ac8148d957` ("bpf: bpf_iter_task_next: use next_task(kit->task) rather than next_task(kit->pos)") Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yonghong.song@linux.dev> Acked-by: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/bpf/20240217114152.1623-2-laoar.shao@gmail.com	2024-02-19 12:28:15 +01:00
Martin KaFai Lau	0281b919e1	bpf: Fix racing between bpf_timer_cancel_and_free and bpf_timer_cancel The following race is possible between bpf_timer_cancel_and_free and bpf_timer_cancel. It will lead a UAF on the timer->timer. bpf_timer_cancel(); spin_lock(); t = timer->time; spin_unlock(); bpf_timer_cancel_and_free(); spin_lock(); t = timer->timer; timer->timer = NULL; spin_unlock(); hrtimer_cancel(&t->timer); kfree(t); /* UAF on t */ hrtimer_cancel(&t->timer); In bpf_timer_cancel_and_free, this patch frees the timer->timer after a rcu grace period. This requires a rcu_head addition to the "struct bpf_hrtimer". Another kfree(t) happens in bpf_timer_init, this does not need a kfree_rcu because it is still under the spin_lock and timer->timer has not been visible by others yet. In bpf_timer_cancel, rcu_read_lock() is added because this helper can be used in a non rcu critical section context (e.g. from a sleepable bpf prog). Other timer->timer usages in helpers.c have been audited, bpf_timer_cancel() is the only place where timer->timer is used outside of the spin_lock. Another solution considered is to mark a t->flag in bpf_timer_cancel and clear it after hrtimer_cancel() is done. In bpf_timer_cancel_and_free, it busy waits for the flag to be cleared before kfree(t). This patch goes with a straight forward solution and frees timer->timer after a rcu grace period. Fixes: `b00628b1c7` ("bpf: Introduce bpf timers.") Suggested-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/bpf/20240215211218.990808-1-martin.lau@linux.dev	2024-02-19 12:26:46 +01:00
Peter Hilber	14274d0bd3	timekeeping: Fix cross-timestamp interpolation for non-x86 So far, get_device_system_crosststamp() unconditionally passes system_counterval.cycles to timekeeping_cycles_to_ns(). But when interpolating system time (do_interp == true), system_counterval.cycles is before tkr_mono.cycle_last, contrary to the timekeeping_cycles_to_ns() expectations. On x86, CONFIG_CLOCKSOURCE_VALIDATE_LAST_CYCLE will mitigate on interpolating, setting delta to 0. With delta == 0, xtstamp->sys_monoraw and xtstamp->sys_realtime are then set to the last update time, as implicitly expected by adjust_historical_crosststamp(). On other architectures, the resulting nonsense xtstamp->sys_monoraw and xtstamp->sys_realtime corrupt the xtstamp (ts) adjustment in adjust_historical_crosststamp(). Fix this by deriving xtstamp->sys_monoraw and xtstamp->sys_realtime from the last update time when interpolating, by using the local variable "cycles". The local variable already has the right value when interpolating, unlike system_counterval.cycles. Fixes: `2c756feb18` ("time: Add history to cross timestamp interface supporting slower devices") Signed-off-by: Peter Hilber <peter.hilber@opensynergy.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <jstultz@google.com> Link: https://lore.kernel.org/r/20231218073849.35294-4-peter.hilber@opensynergy.com	2024-02-19 12:18:51 +01:00
Peter Hilber	87a4113088	timekeeping: Fix cross-timestamp interpolation corner case decision The cycle_between() helper checks if parameter test is in the open interval (before, after). Colloquially speaking, this also applies to the counter wrap-around special case before > after. get_device_system_crosststamp() currently uses cycle_between() at the first call site to decide whether to interpolate for older counter readings. get_device_system_crosststamp() has the following problem with cycle_between() testing against an open interval: Assume that, by chance, cycles == tk->tkr_mono.cycle_last (in the following, "cycle_last" for brevity). Then, cycle_between() at the first call site, with effective argument values cycle_between(cycle_last, cycles, now), returns false, enabling interpolation. During interpolation, get_device_system_crosststamp() will then call cycle_between() at the second call site (if a history_begin was supplied). The effective argument values are cycle_between(history_begin->cycles, cycles, cycles), since system_counterval.cycles == interval_start == cycles, per the assumption. Due to the test against the open interval, cycle_between() returns false again. This causes get_device_system_crosststamp() to return -EINVAL. This failure should be avoided, since get_device_system_crosststamp() works both when cycles follows cycle_last (no interpolation), and when cycles precedes cycle_last (interpolation). For the case cycles == cycle_last, interpolation is actually unneeded. Fix this by changing cycle_between() into timestamp_in_interval(), which now checks against the closed interval, rather than the open interval. This changes the get_device_system_crosststamp() behavior for three corner cases: 1. Bypass interpolation in the case cycles == tk->tkr_mono.cycle_last, fixing the problem described above. 2. At the first timestamp_in_interval() call site, cycles == now no longer causes failure. 3. At the second timestamp_in_interval() call site, history_begin->cycles == system_counterval.cycles no longer causes failure. adjust_historical_crosststamp() also works for this corner case, where partial_history_cycles == total_history_cycles. These behavioral changes should not cause any problems. Fixes: `2c756feb18` ("time: Add history to cross timestamp interface supporting slower devices") Signed-off-by: Peter Hilber <peter.hilber@opensynergy.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20231218073849.35294-3-peter.hilber@opensynergy.com	2024-02-19 12:18:51 +01:00
Peter Hilber	84dccadd3e	timekeeping: Fix cross-timestamp interpolation on counter wrap cycle_between() decides whether get_device_system_crosststamp() will interpolate for older counter readings. cycle_between() yields wrong results for a counter wrap-around where after < before < test, and for the case after < test < before. Fix the comparison logic. Fixes: `2c756feb18` ("time: Add history to cross timestamp interface supporting slower devices") Signed-off-by: Peter Hilber <peter.hilber@opensynergy.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <jstultz@google.com> Link: https://lore.kernel.org/r/20231218073849.35294-2-peter.hilber@opensynergy.com	2024-02-19 12:18:51 +01:00
Crystal Wood	c99303a2d2	genirq: Wake interrupt threads immediately when changing affinity The affinity setting of interrupt threads happens in the context of the thread when the thread is woken up by an hard interrupt. As this can be an arbitrary after changing the affinity, the thread can become runnable on an isolated CPU and cause isolation disruption. Avoid this by checking the set affinity request in wait_for_interrupt() and waking the threads immediately when the affinity is modified. Note that this is of the most benefit on systems where the interrupt affinity itself does not need to be deferred to the interrupt handler, but even where that's not the case, the total dirsuption will be less. Signed-off-by: Crystal Wood <crwood@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240122235353.15235-1-crwood@redhat.com	2024-02-19 11:43:46 +01:00
Anna-Maria Behnsen	892abd3571	timers: Add struct member description for timer_base timer_base struct lacks description of struct members. Important struct member information is sprinkled in comments or in code all over the place. Collect information and write struct description to keep track of most important information in a single place. Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240123164702.55612-5-anna-maria@linutronix.de	2024-02-19 09:38:00 +01:00
Anna-Maria Behnsen	f365d05506	tick/sched: Add function description for tick_nohz_next_event() The return value of tick_nohz_next_event() is not obvious at the first glance. Add a kernel-doc compatible function description which also covers return values. Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240123164702.55612-4-anna-maria@linutronix.de	2024-02-19 09:38:00 +01:00
Anna-Maria Behnsen	ca2768bbf5	hrtimers: Update formatting of documentation Documentation of functions lacks the annotations which are used by kernel-doc and *.rst to make appearance in rendered documents more user-friendly. Use those annotations to improve user-friendliness. While at it prevent duplication of comments and use a reference instead. Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240123164702.55612-3-anna-maria@linutronix.de	2024-02-19 09:37:59 +01:00
Linus Torvalds	ad645dea35	Probes fixes for v6.8-rc4: - tracing/probes: Fix BTF structure member finder to find the members which are placed after any anonymous union member correctly. -----BEGIN PGP SIGNATURE----- iQFPBAABCgA5FiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmXQqH8bHG1hc2FtaS5o aXJhbWF0c3VAZ21haWwuY29tAAoJENv7B78FKz8bKCsH/3EWGLBbwNMFFje+t2Bp L6LgeyG7ZUzeDb4ioO2bae++e3m7xtR7YAWbZoX193/BfPbBpr9rkUCfYqAaImBk dzdI+E8/gOkliLP0Vvj5GIecKDUdlB053NRgoOcq0n1NYOO3/4kyPZ+53tX5gsBI YNcdIKDdEVWX0+vinK6cCjWOLdVJc/8pH/6G2KNgG6Qs2T1dV0YzW5QRv1hhUOr3 1Obog64AFt7tVtIkGOK19gRJKa6VpysdD8wpekJnKfquTtFffrsgf4yTg84zf7Fe ueaOtJ7pV4R0SbQ/n5SnXe5cxXjX7MD24FZaeHCB7mp8sOsdeTukFmSTsIgOFeD/ cBQ= =KY+B -----END PGP SIGNATURE----- Merge tag 'probes-fixes-v6.8-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull probes fix from Masami Hiramatsu: - tracing/probes: Fix BTF structure member finder to find the members which are placed after any anonymous union member correctly. * tag 'probes-fixes-v6.8-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing/probes: Fix to search structure fields correctly	2024-02-17 07:59:47 -08:00
Masami Hiramatsu (Google)	9704669c38	tracing/probes: Fix to search structure fields correctly Fix to search a field from the structure which has anonymous union correctly. Since the reference `type` pointer was updated in the loop, the search loop suddenly aborted where it hits an anonymous union. Thus it can not find the field after the anonymous union. This avoids updating the cursor `type` pointer in the loop. Link: https://lore.kernel.org/all/170791694361.389532.10047514554799419688.stgit@devnote2/ Fixes: `302db0f5b3` ("tracing/probes: Add a function to search a member of a struct/union") Cc: stable@vger.kernel.org Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>	2024-02-17 21:25:42 +09:00
Linus Torvalds	975b26ab30	workqueue: Fixes for v6.8-rc4 Just one patch to revert `ca10d851b9` ("workqueue: Override implicit ordered attribute in workqueue_apply_unbound_cpumask()"). This could break ordering guarantee for ordered workqueues. The problem that the commit tried to resolve partially - making ordered workqueues follow unbound cpumask - is fully solved in wq/for-6.9 branch. -----BEGIN PGP SIGNATURE----- iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZc/UeA4cdGpAa2VybmVs Lm9yZwAKCRCxYfJx3gVYGR/pAQCeGwvE0m/NzdIsFh0Gd6fLwZlE9S9Kh0zsCaV4 3N5GdwD/XAXG77ZIBx1z88CT6YDyRt7iKWU6acSnfglgeFEJoAU= =Z54U -----END PGP SIGNATURE----- Merge tag 'wq-for-6.8-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq Pull workqueue fix from Tejun Heo: "Just one patch to revert commit `ca10d851b9` ("workqueue: Override implicit ordered attribute in workqueue_apply_unbound_cpumask()"). This commit could break ordering guarantees for ordered workqueues. The problem that the commit tried to resolve partially - making ordered workqueues follow unbound cpumask - is fully solved in wq/for-6.9 branch" * tag 'wq-for-6.8-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: Revert "workqueue: Override implicit ordered attribute in workqueue_apply_unbound_cpumask()"	2024-02-16 14:00:19 -08:00
Christophe Leroy	d1909c0221	module: Don't ignore errors from set_memory_XX() set_memory_ro(), set_memory_nx(), set_memory_x() and other helpers can fail and return an error. In that case the memory might not be protected as expected and the module loading has to be aborted to avoid security issues. Check return value of all calls to set_memory_XX() and handle error if any. Add a check to not call set_memory_XX() on NULL pointers as some architectures may not like it allthough numpages is always 0 in that case. This also avoid a useless call to set_vm_flush_reset_perms(). Link: https://github.com/KSPP/linux/issues/7 Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Tested-by: Marek Szyprowski <m.szyprowski@samsung.com> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>	2024-02-16 11:30:43 -08:00
Linus Torvalds	4b6f7c624e	Tracing fixes for v6.8: - Fix the #ifndef that didn't have CONFIG_ on HAVE_DYNAMIC_FTRACE_WITH_REGS The fix to have dynamic trampolines work with x86 broke arm64 as the config used in the #ifdef was HAVE_DYNAMIC_FTRACE_WITH_REGS and not CONFIG_HAVE_DYNAMIC_FTRACE_WITH_REGS which removed the fix that the previous fix was to fix. - Fix tracing_on state The code to test if "tracing_on" is set used ring_buffer_record_is_on() which returns false if the ring buffer isn't able to be written to. But the ring buffer disable has several bits that disable it. One is internal disabling which is used for resizing and other modifications of the ring buffer. But the "tracing_on" user space visible flag should only report if tracing is actually on and not internally disabled, as this can cause confusion as writing "1" when it is disabled will not enable it. Instead use ring_buffer_record_is_set_on() which shows the user space visible settings. - Fix a false positive kmemleak on saved cmdlines Now that the saved_cmdlines structure is allocated via alloc_page() and not via kmalloc() it has become invisible to kmemleak. The allocation done to one of its pointers was flagged as a dangling allocation leak. Make kmemleak aware of this allocation and free. - Fix synthetic event dynamic strings. A update that cleaned up the synthetic event code removed the return value of trace_string(), and had it return zero instead of the length, causing dynamic strings in the synthetic event to always have zero size. - Clean up documentation and header files for seq_buf -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZc92CxQccm9zdGVkdEBn b29kbWlzLm9yZwAKCRAp5XQQmuv6qiD6AQCCtF9hBWq7slLlBQ+k07hWXOi1h4nR Ae6UmoGlu3u4ugEA6/g8mO2vjABagnK7RSk/R+s1SvSGqWkmAsWeEKirEAY= =Eqjs -----END PGP SIGNATURE----- Merge tag 'trace-v6.8-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Fix the #ifndef that didn't have the 'CONFIG_' prefix on HAVE_DYNAMIC_FTRACE_WITH_REGS The fix to have dynamic trampolines work with x86 broke arm64 as the config used in the #ifdef was HAVE_DYNAMIC_FTRACE_WITH_REGS and not CONFIG_HAVE_DYNAMIC_FTRACE_WITH_REGS which removed the fix that the previous fix was to fix. - Fix tracing_on state The code to test if "tracing_on" is set incorrectly used ring_buffer_record_is_on() which returns false if the ring buffer isn't able to be written to. But the ring buffer disable has several bits that disable it. One is internal disabling which is used for resizing and other modifications of the ring buffer. But the "tracing_on" user space visible flag should only report if tracing is actually on and not internally disabled, as this can cause confusion as writing "1" when it is disabled will not enable it. Instead use ring_buffer_record_is_set_on() which shows the user space visible settings. - Fix a false positive kmemleak on saved cmdlines Now that the saved_cmdlines structure is allocated via alloc_page() and not via kmalloc() it has become invisible to kmemleak. The allocation done to one of its pointers was flagged as a dangling allocation leak. Make kmemleak aware of this allocation and free. - Fix synthetic event dynamic strings An update that cleaned up the synthetic event code removed the return value of trace_string(), and had it return zero instead of the length, causing dynamic strings in the synthetic event to always have zero size. - Clean up documentation and header files for seq_buf * tag 'trace-v6.8-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: seq_buf: Fix kernel documentation seq_buf: Don't use "proxy" headers tracing/synthetic: Fix trace_string() return value tracing: Inform kmemleak of saved_cmdlines allocation tracing: Use ring_buffer_record_is_set_on() in tracer_tracing_is_on() tracing: Fix HAVE_DYNAMIC_FTRACE_WITH_REGS ifdef	2024-02-16 10:33:51 -08:00
Tejun Heo	fd0a68a233	workqueue, irq_work: Build fix for !CONFIG_IRQ_WORK `2f34d7337d` ("workqueue: Fix queue_work_on() with BH workqueues") added irq_work usage to workqueue; however, it turns out irq_work is actually optional and the change breaks build on configuration which doesn't have CONFIG_IRQ_WORK enabled. Fix build by making workqueue use irq_work only when CONFIG_SMP and enabling CONFIG_IRQ_WORK when CONFIG_SMP is set. It's reasonable to argue that it may be better to just always enable it. However, this still saves a small bit of memory for tiny UP configs and also the least amount of change, so, for now, let's keep it conditional. Verified to do the right thing for x86_64 allnoconfig and defconfig, and aarch64 allnoconfig, allnoconfig + prink disable (SMP but nothing selects IRQ_WORK) and a modified aarch64 Kconfig where !SMP and nothing selects IRQ_WORK. v2: `depends on SMP` leads to Kconfig warnings when CONFIG_IRQ_WORK is selected by something else when !CONFIG_SMP. Use `def_bool y if SMP` instead. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org> Tested-by: Anders Roxell <anders.roxell@linaro.org> Fixes: `2f34d7337d` ("workqueue: Fix queue_work_on() with BH workqueues") Cc: Stephen Rothwell <sfr@canb.auug.org.au>	2024-02-16 06:33:43 -10:00
Shrikanth Hegde	8cec3dd9e5	sched/core: Simplify code by removing duplicate #ifdefs There's a few cases of nested #ifdefs in the scheduler code that can be simplified: #ifdef DEFINE_A ...code block... #ifdef DEFINE_A <-- This is a duplicate. ...code block... #endif #else #ifndef DEFINE_A <-- This is also duplicate. ...code block... #endif #endif More details about the script and methods used to find these code patterns can be found at: https://lore.kernel.org/all/20240118080326.13137-1-sshegde@linux.ibm.com/ No change in functionality intended. [ mingo: Clarified the changelog. ] Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20240216061433.535522-1-sshegde@linux.ibm.com	2024-02-16 09:37:15 +01:00
Matthieu Baerts (NGI0)	3738d710af	configs/debug: add NET debug config The debug.config file is really great to easily enable a bunch of general debugging features on a CI-like setup. But it would be great to also include core networking debugging config. A few CI's validating features from the Net tree also enable a few other debugging options on top of debug.config. A small selection is quite generic for the whole net tree. They validate some assumptions in different parts of the core net tree. As suggested by Jakub Kicinski in [1], having them added to this debug.config file would help other CIs using network features to find bugs in this area. Note that the two REFCNT configs also select REF_TRACKER, which doesn't seem to be an issue. Link: https://lore.kernel.org/netdev/20240202093148.33bd2b14@kernel.org/T/ [1] Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20240212-kconfig-debug-enable-net-v1-1-fb026de8174c@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-02-15 17:46:53 -08:00
Jakub Kicinski	73be9a3aab	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR. No conflicts. Adjacent changes: net/core/dev.c `9f30831390` ("net: add rcu safety to rtnl_prop_list_size()") `723de3ebef` ("net: free altname using an RCU callback") net/unix/garbage.c `11498715f2` ("af_unix: Remove io_uring code for GC.") `25236c91b5` ("af_unix: Fix task hung while purging oob_skb in GC.") drivers/net/ethernet/renesas/ravb_main.c `ed4adc0720` ("net: ravb: Count packets instead of descriptors in GbEth RX path" ) `c2da940857` ("ravb: Add Rx checksum offload support for GbEth") net/mptcp/protocol.c `bdd70eb689` ("mptcp: drop the push_pending field") `28e5c13805` ("mptcp: annotate lockless accesses around read-mostly fields") Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-02-15 16:20:04 -08:00
Yonghong Song	682158ab53	bpf: Fix test verif_scale_strobemeta_subprogs failure due to llvm19 With latest llvm19, I hit the following selftest failures with $ ./test_progs -j libbpf: prog 'on_event': BPF program load failed: Permission denied libbpf: prog 'on_event': -- BEGIN PROG LOAD LOG -- combined stack size of 4 calls is 544. Too large verification time 1344153 usec stack depth 24+440+0+32 processed 51008 insns (limit 1000000) max_states_per_insn 19 total_states 1467 peak_states 303 mark_read 146 -- END PROG LOAD LOG -- libbpf: prog 'on_event': failed to load: -13 libbpf: failed to load object 'strobemeta_subprogs.bpf.o' scale_test:FAIL:expect_success unexpected error: -13 (errno 13) #498 verif_scale_strobemeta_subprogs:FAIL The verifier complains too big of the combined stack size (544 bytes) which exceeds the maximum stack limit 512. This is a regression from llvm19 ([1]). In the above error log, the original stack depth is 24+440+0+32. To satisfy interpreter's need, in verifier the stack depth is adjusted to 32+448+32+32=544 which exceeds 512, hence the error. The same adjusted stack size is also used for jit case. But the jitted codes could use smaller stack size. $ egrep -r stack_depth \| grep round_up arm64/net/bpf_jit_comp.c: ctx->stack_size = round_up(prog->aux->stack_depth, 16); loongarch/net/bpf_jit.c: bpf_stack_adjust = round_up(ctx->prog->aux->stack_depth, 16); powerpc/net/bpf_jit_comp.c: cgctx.stack_size = round_up(fp->aux->stack_depth, 16); riscv/net/bpf_jit_comp32.c: round_up(ctx->prog->aux->stack_depth, STACK_ALIGN); riscv/net/bpf_jit_comp64.c: bpf_stack_adjust = round_up(ctx->prog->aux->stack_depth, 16); s390/net/bpf_jit_comp.c: u32 stack_depth = round_up(fp->aux->stack_depth, 8); sparc/net/bpf_jit_comp_64.c: stack_needed += round_up(stack_depth, 16); x86/net/bpf_jit_comp.c: EMIT3_off32(0x48, 0x81, 0xEC, round_up(stack_depth, 8)); x86/net/bpf_jit_comp.c: int tcc_off = -4 - round_up(stack_depth, 8); x86/net/bpf_jit_comp.c: round_up(stack_depth, 8)); x86/net/bpf_jit_comp.c: int tcc_off = -4 - round_up(stack_depth, 8); x86/net/bpf_jit_comp.c: EMIT3_off32(0x48, 0x81, 0xC4, round_up(stack_depth, 8)); In the above, STACK_ALIGN in riscv/net/bpf_jit_comp32.c is defined as 16. So stack is aligned in either 8 or 16, x86/s390 having 8-byte stack alignment and the rest having 16-byte alignment. This patch calculates total stack depth based on 16-byte alignment if jit is requested. For the above failing case, the new stack size will be 32+448+0+32=512 and no verification failure. llvm19 regression will be discussed separately in llvm upstream. The verifier change caused three test failures as these tests compared messages with stack size. More specifically, - test_global_funcs/global_func1: fail with interpreter mode and success with jit mode. Adjusted stack sizes so both jit and interpreter modes will fail. - async_stack_depth/{pseudo_call_check, async_call_root_check}: since jit and interpreter will calculate different stack sizes, the failure msg is adjusted to omit those specific stack size numbers. [1] https://lore.kernel.org/bpf/32bde0f0-1881-46c9-931a-673be566c61d@linux.dev/ Suggested-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20240214232951.4113094-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2024-02-15 13:45:27 -08:00
Andrii Nakryiko	57354f5fde	bpf: improve duplicate source code line detection Verifier log avoids printing the same source code line multiple times when a consecutive block of BPF assembly instructions are covered by the same original (C) source code line. This greatly improves verifier log legibility. Unfortunately, this check is imperfect and in production applications it quite often happens that verifier log will have multiple duplicated source lines emitted, for no apparently good reason. E.g., this is excerpt from a real-world BPF application (with register states omitted for clarity): BEFORE ====== ; for (int i = 0; i < STROBE_MAX_MAP_ENTRIES; ++i) { @ strobemeta_probe.bpf.c:394 5369: (07) r8 += 2 ; 5370: (07) r7 += 16 ; ; for (int i = 0; i < STROBE_MAX_MAP_ENTRIES; ++i) { @ strobemeta_probe.bpf.c:394 5371: (07) r9 += 1 ; 5372: (79) r4 = (u64 )(r10 -32) ; ; for (int i = 0; i < STROBE_MAX_MAP_ENTRIES; ++i) { @ strobemeta_probe.bpf.c:394 5373: (55) if r9 != 0xf goto pc+2 ; if (i >= map->cnt) @ strobemeta_probe.bpf.c:396 5376: (79) r1 = (u64 )(r10 -40) ; 5377: (79) r1 = (u64 )(r1 +8) ; ; if (i >= map->cnt) @ strobemeta_probe.bpf.c:396 5378: (dd) if r1 s<= r9 goto pc-5 ; ; descr->key_lens[i] = 0; @ strobemeta_probe.bpf.c:398 5379: (b4) w1 = 0 ; 5380: (6b) (u16 )(r8 -30) = r1 ; ; task, data, off, STROBE_MAX_STR_LEN, map->entries[i].key); @ strobemeta_probe.bpf.c:400 5381: (79) r3 = (u64 )(r7 -8) ; 5382: (7b) (u64 )(r10 -24) = r6 ; ; task, data, off, STROBE_MAX_STR_LEN, map->entries[i].key); @ strobemeta_probe.bpf.c:400 5383: (bc) w6 = w6 ; ; barrier_var(payload_off); @ strobemeta_probe.bpf.c:280 5384: (bf) r2 = r6 ; 5385: (bf) r1 = r4 ; As can be seen, line 394 is emitted thrice, 396 is emitted twice, and line 400 is duplicated as well. Note that there are no intermingling other lines of source code in between these duplicates, so the issue is not compiler reordering assembly instruction such that multiple original source code lines are in effect. It becomes more obvious what's going on if we look at full original line info information (using btfdump for this, [0]): #2764: line: insn #5363 --> 394:3 @ ./././strobemeta_probe.bpf.c for (int i = 0; i < STROBE_MAX_MAP_ENTRIES; ++i) { #2765: line: insn #5373 --> 394:21 @ ./././strobemeta_probe.bpf.c for (int i = 0; i < STROBE_MAX_MAP_ENTRIES; ++i) { #2766: line: insn #5375 --> 394:47 @ ./././strobemeta_probe.bpf.c for (int i = 0; i < STROBE_MAX_MAP_ENTRIES; ++i) { #2767: line: insn #5377 --> 394:3 @ ./././strobemeta_probe.bpf.c for (int i = 0; i < STROBE_MAX_MAP_ENTRIES; ++i) { #2768: line: insn #5378 --> 414:10 @ ./././strobemeta_probe.bpf.c return off; We can see that there are four line info records covering instructions #5363 through #5377 (instruction indices are shifted due to subprog instruction being appended to main program), all of them are pointing to the same C source code line #394. But each of them points to a different part of that line, which is denoted by differing column numbers (3, 21, 47, 3). But verifier log doesn't distinguish between parts of the same source code line and doesn't emit this column number information, so for end user it's just a repetitive visual noise. So let's improve the detection of repeated source code line and avoid this. With the changes in this patch, we get this output for the same piece of BPF program log: AFTER ===== ; for (int i = 0; i < STROBE_MAX_MAP_ENTRIES; ++i) { @ strobemeta_probe.bpf.c:394 5369: (07) r8 += 2 ; 5370: (07) r7 += 16 ; 5371: (07) r9 += 1 ; 5372: (79) r4 = (u64 )(r10 -32) ; 5373: (55) if r9 != 0xf goto pc+2 ; if (i >= map->cnt) @ strobemeta_probe.bpf.c:396 5376: (79) r1 = (u64 )(r10 -40) ; 5377: (79) r1 = (u64 )(r1 +8) ; 5378: (dd) if r1 s<= r9 goto pc-5 ; ; descr->key_lens[i] = 0; @ strobemeta_probe.bpf.c:398 5379: (b4) w1 = 0 ; 5380: (6b) (u16 )(r8 -30) = r1 ; ; task, data, off, STROBE_MAX_STR_LEN, map->entries[i].key); @ strobemeta_probe.bpf.c:400 5381: (79) r3 = (u64 )(r7 -8) ; 5382: (7b) (u64 )(r10 -24) = r6 ; 5383: (bc) w6 = w6 ; ; barrier_var(payload_off); @ strobemeta_probe.bpf.c:280 5384: (bf) r2 = r6 ; 5385: (bf) r1 = r4 ; All the duplication is gone and the log is cleaner and less distracting. [0] https://github.com/anakryiko/btfdump Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20240214174100.2847419-1-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2024-02-15 13:00:48 -08:00
Thomas Gleixner	9bbe13a5d4	genirq/msi: Provide MSI_FLAG_PARENT_PM_DEV Some platform-MSI implementations require that power management is redirected to the underlying interrupt chip device. To make this work with per device MSI domains provide a new feature flag and let the core code handle the setup of dev->pm_dev when set during device MSI domain creation. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Anup Patel <apatel@ventanamicro.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240127161753.114685-14-apatel@ventanamicro.com	2024-02-15 17:55:41 +01:00
Thomas Gleixner	e49312fe09	genirq/irqdomain: Reroute device MSI create_mapping Reroute interrupt allocation in irq_create_fwspec_mapping() if the domain is a MSI device domain. This is required to convert the support for wire to MSI bridges to per device MSI domains. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Anup Patel <apatel@ventanamicro.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240127161753.114685-13-apatel@ventanamicro.com	2024-02-15 17:55:41 +01:00
Thomas Gleixner	0ee1578b00	genirq/msi: Provide allocation/free functions for "wired" MSI interrupts To support wire to MSI bridges proper in the MSI core infrastructure it is required to have separate allocation/free interfaces which can be invoked from the regular irqdomain allocaton/free functions. The mechanism for allocation is: - Allocate the next free MSI descriptor index in the domain - Store the hardware interrupt number and the trigger type which was extracted by the irqdomain core from the firmware spec in the MSI descriptor device cookie so it can be retrieved by the underlying interrupt domain and interrupt chip - Use the regular MSI allocation mechanism for the newly allocated index which returns a fully initialized Linux interrupt on succes This works because: - the domains have a fixed size - each hardware interrupt is only allocated once - the underlying domain does not care about the MSI index it only cares about the hardware interrupt number and the trigger type The free function looks up the MSI index in the MSI descriptor of the provided Linux interrupt number and uses the regular index based free functions of the MSI core. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Anup Patel <apatel@ventanamicro.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240127161753.114685-12-apatel@ventanamicro.com	2024-02-15 17:55:41 +01:00
Thomas Gleixner	9d1c58c800	genirq/msi: Optionally use dev->fwnode for device domain To support wire to MSI domains via the MSI infrastructure it is required to use the firmware node of the device which implements this for creating the MSI domain. Otherwise the existing firmware match mechanisms to find the correct irqdomain for a wired interrupt which is connected to a wire to MSI bridge would fail. This cannot be used for the general case because not all devices provide firmware nodes and all regular per device MSI domains are directly associated to the device and have not be searched for. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Anup Patel <apatel@ventanamicro.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240127161753.114685-11-apatel@ventanamicro.com	2024-02-15 17:55:41 +01:00
Thomas Gleixner	3095cc0d5b	genirq/msi: Split msi_domain_alloc_irq_at() In preparation for providing a special allocation function for wired interrupts which are connected to a wire to MSI bridge, split the inner workings of msi_domain_alloc_irq_at() out into a helper function so the code can be shared. No functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Anup Patel <apatel@ventanamicro.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240127161753.114685-9-apatel@ventanamicro.com	2024-02-15 17:55:40 +01:00
Thomas Gleixner	9c78c1a85c	genirq/msi: Provide optional translation op irq_create_fwspec_mapping() requires translation of the firmware spec to a hardware interrupt number and the trigger type information. Wired interrupts which are connected to a wire to MSI bridge, like MBIGEN are allocated that way. So far MBIGEN provides a regular irqdomain which then hooks backwards into the MSI infrastructure. That's an unholy mess and will be replaced with per device MSI domains which are regular MSI domains. Interrupts on MSI domains are not supported by irq_create_fwspec_mapping(), but for making the wire to MSI bridges sane it makes sense to provide a special allocation/free interface in the MSI infrastructure. That avoids the backdoors into the core MSI allocation code and just shares all the regular MSI infrastructure. Provide an optional translation callback in msi_domain_ops which can be utilized by these wire to MSI bridges. No other MSI domain should provide a translation callback. The default translation callback of the MSI irqdomains will warn when it is invoked on a non-prepared MSI domain. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Anup Patel <apatel@ventanamicro.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240127161753.114685-8-apatel@ventanamicro.com	2024-02-15 17:55:40 +01:00
Thomas Gleixner	de1ff306dc	genirq/irqdomain: Remove the param count restriction from select() Now that the GIC-v3 callback can handle invocation with a fwspec parameter count of 0 lift the restriction in the core code and invoke select() unconditionally when the domain provides it. Preparatory change for per device MSI domains. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Anup Patel <apatel@ventanamicro.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240127161753.114685-3-apatel@ventanamicro.com	2024-02-15 17:55:39 +01:00
Thorsten Blum	9b6326354c	tracing/synthetic: Fix trace_string() return value Fix trace_string() by assigning the string length to the return variable which got lost in commit `ddeea494a1` ("tracing/synthetic: Use union instead of casts") and caused trace_string() to always return 0. Link: https://lore.kernel.org/linux-trace-kernel/20240214220555.711598-1-thorsten.blum@toblux.com Cc: stable@vger.kernel.org Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Fixes: `ddeea494a1` ("tracing/synthetic: Use union instead of casts") Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2024-02-15 11:40:01 -05:00
Andrii Nakryiko	a4561f5afe	bpf: Use O(log(N)) binary search to find line info record Real-world BPF applications keep growing in size. Medium-sized production application can easily have 50K+ verified instructions, and its line info section in .BTF.ext has more than 3K entries. When verifier emits log with log_level>=1, it annotates assembly code with matched original C source code. Currently it uses linear search over line info records to find a match. As complexity of BPF applications grows, this O(K * N) approach scales poorly. So, let's instead of linear O(N) search for line info record use faster equivalent O(log(N)) binary search algorithm. It's not a plain binary search, as we don't look for exact match. It's an upper bound search variant, looking for rightmost line info record that starts at or before given insn_off. Some unscientific measurements were done before and after this change. They were done in VM and fluctuate a bit, but overall the speed up is undeniable. BASELINE ======== File Program Duration (us) Insns -------------------------------- ---------------- ------------- ------ katran.bpf.o balancer_ingress 2497130 343552 pyperf600.bpf.linked3.o on_event 12389611 627288 strobelight_pyperf_libbpf.o on_py_event 387399 52445 -------------------------------- ---------------- ------------- ------ BINARY SEARCH ============= File Program Duration (us) Insns -------------------------------- ---------------- ------------- ------ katran.bpf.o balancer_ingress 2339312 343552 pyperf600.bpf.linked3.o on_event 5602203 627288 strobelight_pyperf_libbpf.o on_py_event 294761 52445 -------------------------------- ---------------- ------------- ------ While Katran's speed up is pretty modest (about 105ms, or 6%), for production pyperf BPF program (on_py_event) it's much greater already, going from 387ms down to 295ms (23% improvement). Looking at BPF selftests's biggest pyperf example, we can see even more dramatic improvement, shaving more than 50% of time, going from 12.3s down to 5.6s. Different amount of improvement is the function of overall amount of BPF assembly instructions in .bpf.o files (which contributes to how much line info records there will be and thus, on average, how much time linear search will take), among other things: $ llvm-objdump -d katran.bpf.o \| wc -l 3863 $ llvm-objdump -d strobelight_pyperf_libbpf.o \| wc -l 6997 $ llvm-objdump -d pyperf600.bpf.linked3.o \| wc -l 87854 Granted, this only applies to debugging cases (e.g., using veristat, or failing verification in production), but seems worth doing to improve overall developer experience anyways. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/bpf/20240214002311.2197116-1-andrii@kernel.org	2024-02-14 23:53:42 +01:00
Tejun Heo	2f34d7337d	workqueue: Fix queue_work_on() with BH workqueues When queue_work_on() is used to queue a BH work item on a remote CPU, the work item is queued on that CPU but kick_pool() raises softirq on the local CPU. This leads to stalls as the work item won't be executed until something else on the remote CPU schedules a BH work item or tasklet locally. Fix it by bouncing raising softirq to the target CPU using per-cpu irq_work. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: `4cb1ef6460` ("workqueue: Implement BH workqueues to eventually replace tasklets")	2024-02-14 08:33:55 -10:00
Steven Rostedt (Google)	2394ac4145	tracing: Inform kmemleak of saved_cmdlines allocation The allocation of the struct saved_cmdlines_buffer structure changed from: s = kmalloc(sizeof(s), GFP_KERNEL); s->saved_cmdlines = kmalloc_array(TASK_COMM_LEN, val, GFP_KERNEL); to: orig_size = sizeof(s) + val * TASK_COMM_LEN; order = get_order(orig_size); size = 1 << (order + PAGE_SHIFT); page = alloc_pages(GFP_KERNEL, order); if (!page) return NULL; s = page_address(page); memset(s, 0, sizeof(*s)); s->saved_cmdlines = kmalloc_array(TASK_COMM_LEN, val, GFP_KERNEL); Where that s->saved_cmdlines allocation looks to be a dangling allocation to kmemleak. That's because kmemleak only keeps track of kmalloc() allocations. For allocations that use page_alloc() directly, the kmemleak needs to be explicitly informed about it. Add kmemleak_alloc() and kmemleak_free() around the page allocation so that it doesn't give the following false positive: unreferenced object 0xffff8881010c8000 (size 32760): comm "swapper", pid 0, jiffies 4294667296 hex dump (first 32 bytes): ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ backtrace (crc ae6ec1b9): [<ffffffff86722405>] kmemleak_alloc+0x45/0x80 [<ffffffff8414028d>] __kmalloc_large_node+0x10d/0x190 [<ffffffff84146ab1>] __kmalloc+0x3b1/0x4c0 [<ffffffff83ed7103>] allocate_cmdlines_buffer+0x113/0x230 [<ffffffff88649c34>] tracer_alloc_buffers.isra.0+0x124/0x460 [<ffffffff8864a174>] early_trace_init+0x14/0xa0 [<ffffffff885dd5ae>] start_kernel+0x12e/0x3c0 [<ffffffff885f5758>] x86_64_start_reservations+0x18/0x30 [<ffffffff885f582b>] x86_64_start_kernel+0x7b/0x80 [<ffffffff83a001c3>] secondary_startup_64_no_verify+0x15e/0x16b Link: https://lore.kernel.org/linux-trace-kernel/87r0hfnr9r.fsf@kernel.org/ Link: https://lore.kernel.org/linux-trace-kernel/20240214112046.09a322d6@gandalf.local.home Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Fixes: `44dc5c41b5` ("tracing: Fix wasted memory in saved_cmdlines logic") Reported-by: Kalle Valo <kvalo@kernel.org> Tested-by: Kalle Valo <kvalo@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2024-02-14 12:36:34 -05:00
Onkarnath	c90e3ecc91	rcu/sync: remove un-used rcu_sync_enter_start function With commit '6a010a49b63a ("cgroup: Make !percpu threadgroup_rwsem operations optional")' usage of rcu_sync_enter_start is removed. So this function can also be removed. In the words of Oleg Nesterov: __rcu_sync_enter(wait => false) is a better alternative if someone needs rcu_sync_enter_start() again. Link: https://lore.kernel.org/all/20220725121208.GB28662@redhat.com/ Signed-off-by: Onkarnath <onkarnath.1@samsung.com> Signed-off-by: Maninder Singh <maninder1.s@samsung.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>	2024-02-14 08:00:57 -08:00
Paul E. McKenney	fd2a749d3f	rcutorture: Suppress rtort_pipe_count warnings until after stalls Currently, if rcu_torture_writer() sees fewer than ten grace periods having elapsed during a call to stutter_wait() that actually waited, the rtort_pipe_count warning is emitted. This has worked well for a long time. Except that the rcutorture TREE07 scenario now does a short-term 14-second RCU CPU stall, which can most definitely case false-positive rtort_pipe_count warnings. This commit therefore changes rcu_torture_writer() to compute the full expected holdoff and stall duration, and to refuse to report any rtort_pipe_count warnings until after all stalls have completed. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>	2024-02-14 08:00:57 -08:00
Joel Fernandes (Google)	67050837ec	srcu: Improve comments about acceleration leak The comments added in commit 1ef990c4b36b ("srcu: No need to advance/accelerate if no callback enqueued") are a bit confusing. The comments are describing a scenario for code that was moved and is no longer the way it was (snapshot after advancing). Improve the code comments to reflect this and also document why acceleration can never fail. Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Neeraj Upadhyay <neeraj.iitr10@gmail.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>	2024-02-14 08:00:57 -08:00
Qais Yousef	7f66f099de	rcu: Provide a boot time parameter to control lazy RCU To allow more flexible arrangements while still provide a single kernel for distros, provide a boot time parameter to enable/disable lazy RCU. Specify: rcutree.enable_rcu_lazy=[y\|1\|n\|0] Which also requires rcu_nocbs=all at boot time to enable/disable lazy RCU. To disable it by default at build time when CONFIG_RCU_LAZY=y, the new CONFIG_RCU_LAZY_DEFAULT_OFF can be used. Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io> Tested-by: Andrea Righi <andrea.righi@canonical.com> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>	2024-02-14 08:00:57 -08:00
Frederic Weisbecker	499d7e7e83	rcu: Rename jiffies_till_flush to jiffies_lazy_flush The variable name jiffies_till_flush is too generic and therefore: * It may shadow a global variable * It doesn't tell on what it operates Make the name more precise, along with the related APIs. Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>	2024-02-14 08:00:57 -08:00
Paul E. McKenney	3b239b308e	context_tracking: Fix kerneldoc headers for __ct_user_{enter,exit}() Document the "state" parameter of both of these functions. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202312041922.YZCcEPYD-lkp@intel.com/ Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Tested-by: Randy Dunlap <rdunlap@infradead.org> Acked-by: Randy Dunlap <rdunlap@infradead.org> Cc: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>	2024-02-14 07:53:50 -08:00
Frederic Weisbecker	23da2ad64d	rcu/exp: Remove rcu_par_gp_wq TREE04 running on short iterations can produce writer stalls of the following kind: ??? Writer stall state RTWS_EXP_SYNC(4) g3968 f0x0 ->state 0x2 cpu 0 task:rcu_torture_wri state:D stack:14568 pid:83 ppid:2 flags:0x00004000 Call Trace: <TASK> __schedule+0x2de/0x850 ? trace_event_raw_event_rcu_exp_funnel_lock+0x6d/0xb0 schedule+0x4f/0x90 synchronize_rcu_expedited+0x430/0x670 ? __pfx_autoremove_wake_function+0x10/0x10 ? __pfx_synchronize_rcu_expedited+0x10/0x10 do_rtws_sync.constprop.0+0xde/0x230 rcu_torture_writer+0x4b4/0xcd0 ? __pfx_rcu_torture_writer+0x10/0x10 kthread+0xc7/0xf0 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x2f/0x50 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1b/0x30 </TASK> Waiting for an expedited grace period and polling for an expedited grace period both are operations that internally rely on the same workqueue performing necessary asynchronous work. However, a dependency chain is involved between those two operations, as depicted below: ====== CPU 0 ======= ====== CPU 1 ======= synchronize_rcu_expedited() exp_funnel_lock() mutex_lock(&rcu_state.exp_mutex); start_poll_synchronize_rcu_expedited queue_work(rcu_gp_wq, &rnp->exp_poll_wq); synchronize_rcu_expedited_queue_work() queue_work(rcu_gp_wq, &rew->rew_work); wait_event() // A, wait for &rew->rew_work completion mutex_unlock() // B //======> switch to kworker sync_rcu_do_polled_gp() { synchronize_rcu_expedited() exp_funnel_lock() mutex_lock(&rcu_state.exp_mutex); // C, wait B .... } // D Since workqueues are usually implemented on top of several kworkers handling the queue concurrently, the above situation wouldn't deadlock most of the time because A then doesn't depend on D. But in case of memory stress, a single kworker may end up handling alone all the works in a serialized way. In that case the above layout becomes a problem because A then waits for D, closing a circular dependency: A -> D -> C -> B -> A This however only happens when CONFIG_RCU_EXP_KTHREAD=n. Indeed synchronize_rcu_expedited() is otherwise implemented on top of a kthread worker while polling still relies on rcu_gp_wq workqueue, breaking the above circular dependency chain. Fix this with making expedited grace period to always rely on kthread worker. The workqueue based implementation is essentially a duplicate anyway now that the per-node initialization is performed by per-node kthread workers. Meanwhile the CONFIG_RCU_EXP_KTHREAD switch is still kept around to manage the scheduler policy of these kthread workers. Reported-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Reported-by: Thomas Gleixner <tglx@linutronix.de> Suggested-by: Joel Fernandes <joel@joelfernandes.org> Suggested-by: Paul E. McKenney <paulmck@kernel.org> Suggested-by: Neeraj upadhyay <Neeraj.Upadhyay@amd.com> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>	2024-02-14 07:51:36 -08:00
Frederic Weisbecker	b67cffcbbf	rcu/exp: Handle parallel exp gp kworkers affinity Affine the parallel expedited gp kworkers to their respective RCU node in order to make them close to the cache their are playing with. This reuses the boost kthreads machinery that probe into CPU hotplug operations such that the kthreads become/stay affine to their respective node as soon/long as they contain online CPUs. Otherwise and if the current CPU going down was the last online on the leaf node, the related kthread is affine to the housekeeping CPUs. In the long run, this affinity VS CPU hotplug operation game should probably be implemented at the generic kthread level. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> [boqun: s/* rcu_boost_task/*rcu_boost_task as reported by checkpatch] Signed-off-by: Boqun Feng <boqun.feng@gmail.com>	2024-02-14 07:51:36 -08:00
Frederic Weisbecker	8e5e621566	rcu/exp: Make parallel exp gp kworker per rcu node When CONFIG_RCU_EXP_KTHREAD=n, the expedited grace period per node initialization is performed in parallel via workqueues (one work per node). However in CONFIG_RCU_EXP_KTHREAD=y, this per node initialization is performed by a single kworker serializing each node initialization (one work for all nodes). The second part is certainly less scalable and efficient beyond a single leaf node. To improve this, expand this single kworker into per-node kworkers. This new layout is eventually intended to remove the workqueues based implementation since it will essentially now become duplicate code. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>	2024-02-14 07:51:36 -08:00
Frederic Weisbecker	c19e5d3b49	rcu/exp: Move expedited kthread worker creation functions above rcutree_prepare_cpu() The expedited kthread worker performing the per node initialization is going to be split into per node kthreads. As such, the future per node kthread creation will need to be called from CPU hotplug callbacks instead of an initcall, right beside the per node boost kthread creation. To prepare for that, move the kthread worker creation above rcutree_prepare_cpu() as a first step to make the review smoother for the upcoming modifications. No intended functional change. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>	2024-02-14 07:51:36 -08:00
Frederic Weisbecker	7836b27060	rcu: s/boost_kthread_mutex/kthread_mutex This mutex is currently protecting per node boost kthreads creation and affinity setting across CPU hotplug operations. Since the expedited kworkers will soon be split per node as well, they will be subject to the same concurrency constraints against hotplug. Therefore their creation and affinity tuning operations will be grouped with those of boost kthreads and then rely on the same mutex. To prepare for that, generalize its name. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>	2024-02-14 07:51:36 -08:00
Frederic Weisbecker	e7539ffc9a	rcu/exp: Handle RCU expedited grace period kworker allocation failure Just like is done for the kworker performing nodes initialization, gracefully handle the possible allocation failure of the RCU expedited grace period main kworker. While at it perform a rename of the related checking functions to better reflect the expedited specifics. Reviewed-by: Kalesh Singh <kaleshsingh@google.com> Fixes: `9621fbee44` ("rcu: Move expedited grace period (GP) work to RT kthread_worker") Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>	2024-02-14 07:51:36 -08:00
Frederic Weisbecker	a636c5e6f8	rcu/exp: Fix RCU expedited parallel grace period kworker allocation failure recovery Under CONFIG_RCU_EXP_KTHREAD=y, the nodes initialization for expedited grace periods is queued to a kworker. However if the allocation of that kworker failed, the nodes initialization is performed synchronously by the caller instead. Now the check for kworker initialization failure relies on the kworker pointer to be NULL while its value might actually encapsulate an allocation failure error. Make sure to handle this case. Reviewed-by: Kalesh Singh <kaleshsingh@google.com> Fixes: `9621fbee44` ("rcu: Move expedited grace period (GP) work to RT kthread_worker") Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>	2024-02-14 07:51:36 -08:00
Frederic Weisbecker	a7e4074dcc	rcu/exp: Remove full barrier upon main thread wakeup When an expedited grace period is ending, care must be taken so that all the quiescent states propagated up to the root are correctly ordered against the wake up of the main expedited grace period workqueue. This ordering is already carried through the root rnp locking augmented by an smp_mb__after_unlock_lock() barrier. Therefore the explicit smp_mb() placed before the wake up is not needed and can be removed. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>	2024-02-14 07:51:35 -08:00
Zqiang	f3c4c00784	rcu/nocb: Check rdp_gp->nocb_timer in __call_rcu_nocb_wake() Currently, only rdp_gp->nocb_timer is used, for nocb_timer of no-rdp_gp structure, the timer_pending() is always return false, this commit therefore need to check rdp_gp->nocb_timer in __call_rcu_nocb_wake(). Signed-off-by: Zqiang <qiang.zhang1211@gmail.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>	2024-02-14 07:50:45 -08:00
Zqiang	dda98810b5	rcu/nocb: Fix WARN_ON_ONCE() in the rcu_nocb_bypass_lock() For the kernels built with CONFIG_RCU_NOCB_CPU_DEFAULT_ALL=y and CONFIG_RCU_LAZY=y, the following scenarios will trigger WARN_ON_ONCE() in the rcu_nocb_bypass_lock() and rcu_nocb_wait_contended() functions: CPU2 CPU11 kthread rcu_nocb_cb_kthread ksys_write rcu_do_batch vfs_write rcu_torture_timer_cb proc_sys_write __kmem_cache_free proc_sys_call_handler kmemleak_free drop_caches_sysctl_handler delete_object_full drop_slab __delete_object shrink_slab put_object lazy_rcu_shrink_scan call_rcu rcu_nocb_flush_bypass __call_rcu_commn rcu_nocb_bypass_lock raw_spin_trylock(&rdp->nocb_bypass_lock) fail atomic_inc(&rdp->nocb_lock_contended); rcu_nocb_wait_contended WARN_ON_ONCE(smp_processor_id() != rdp->cpu); WARN_ON_ONCE(atomic_read(&rdp->nocb_lock_contended)) \| \|_ _ _ _ _ _ _ _ _ _same rdp and rdp->cpu != 11_ _ _ _ _ _ _ _ _ __\| Reproduce this bug with "echo 3 > /proc/sys/vm/drop_caches". This commit therefore uses rcu_nocb_try_flush_bypass() instead of rcu_nocb_flush_bypass() in lazy_rcu_shrink_scan(). If the nocb_bypass queue is being flushed, then rcu_nocb_try_flush_bypass will return directly. Signed-off-by: Zqiang <qiang.zhang1211@gmail.com> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>	2024-02-14 07:50:45 -08:00
Frederic Weisbecker	afd4e69647	rcu/nocb: Re-arrange call_rcu() NOCB specific code Currently the call_rcu() function interleaves NOCB and !NOCB enqueue code in a complicated way such that: * The bypass enqueue code may or may not have enqueued and may or may not have locked the ->nocb_lock. Everything that follows is in a Schrödinger locking state for the unwary reviewer's eyes. * The was_alldone is always set but only used in NOCB related code. * The NOCB wake up is distantly related to the locking hopefully performed by the bypass enqueue code that did not enqueue on the bypass list. Unconfuse the whole and gather NOCB and !NOCB specific enqueue code to their own functions. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>	2024-02-14 07:50:45 -08:00
Frederic Weisbecker	b913c3fe68	rcu/nocb: Make IRQs disablement symmetric Currently IRQs are disabled on call_rcu() and then depending on the context: * If the CPU is in nocb mode: - If the callback is enqueued in the bypass list, IRQs are re-enabled implictly by rcu_nocb_try_bypass() - If the callback is enqueued in the normal list, IRQs are re-enabled implicitly by __call_rcu_nocb_wake() * If the CPU is NOT in nocb mode, IRQs are reenabled explicitly from call_rcu() This makes the code a bit hard to follow, especially as it interleaves with nocb locking. To make the IRQ flags coverage clearer and also in order to prepare for moving all the nocb enqueue code to its own function, always re-enable the IRQ flags explicitly from call_rcu(). Reviewed-by: Neeraj Upadhyay (AMD) <neeraj.iitr10@gmail.com> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>	2024-02-14 07:50:45 -08:00
Frederic Weisbecker	1e8e6951a5	rcu/nocb: Remove needless full barrier after callback advancing A full barrier is issued from nocb_gp_wait() upon callbacks advancing to order grace period completion with callbacks execution. However these two events are already ordered by the smp_mb__after_unlock_lock() barrier within the call to raw_spin_lock_rcu_node() that is necessary for callbacks advancing to happen. The following litmus test shows the kind of guarantee that this barrier provides: C smp_mb__after_unlock_lock {} // rcu_gp_cleanup() P0(spinlock_t rnp_lock, int gpnum) { // Grace period cleanup increase gp sequence number spin_lock(rnp_lock); WRITE_ONCE(gpnum, 1); spin_unlock(rnp_lock); } // nocb_gp_wait() P1(spinlock_t rnp_lock, spinlock_t nocb_lock, int gpnum, int cb_ready) { int r1; // Call rcu_advance_cbs() from nocb_gp_wait() spin_lock(nocb_lock); spin_lock(rnp_lock); smp_mb__after_unlock_lock(); r1 = READ_ONCE(gpnum); WRITE_ONCE(cb_ready, 1); spin_unlock(rnp_lock); spin_unlock(nocb_lock); } // nocb_cb_wait() P2(spinlock_t nocb_lock, int cb_ready, int cb_executed) { int r2; // rcu_do_batch() -> rcu_segcblist_extract_done_cbs() spin_lock(nocb_lock); r2 = READ_ONCE(cb_ready); spin_unlock(nocb_lock); // Actual callback execution WRITE_ONCE(cb_executed, 1); } P3(int cb_executed, int gpnum) { int r3; WRITE_ONCE(cb_executed, 2); smp_mb(); r3 = READ_ONCE(gpnum); } exists (1:r1=1 /\ 2:r2=1 /\ cb_executed=2 /\ 3:r3=0) (* Bad outcome. *) Here the bad outcome only occurs if the smp_mb__after_unlock_lock() is removed. This barrier orders the grace period completion against callbacks advancing and even later callbacks invocation, thanks to the opportunistic propagation via the ->nocb_lock to nocb_cb_wait(). Therefore the smp_mb() placed after callbacks advancing can be safely removed. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>	2024-02-14 07:50:45 -08:00
Frederic Weisbecker	ca16265aaf	rcu/nocb: Remove needless LOAD-ACQUIRE The LOAD-ACQUIRE access performed on rdp->nocb_cb_sleep advertizes ordering callback execution against grace period completion. However this is contradicted by the following: * This LOAD-ACQUIRE doesn't pair with anything. The only counterpart barrier that can be found is the smp_mb() placed after callbacks advancing in nocb_gp_wait(). However the barrier is placed _after_ ->nocb_cb_sleep write. * Callbacks can be concurrently advanced between the LOAD-ACQUIRE on ->nocb_cb_sleep and the call to rcu_segcblist_extract_done_cbs() in rcu_do_batch(), making any ordering based on ->nocb_cb_sleep broken. * Both rcu_segcblist_extract_done_cbs() and rcu_advance_cbs() are called under the nocb_lock, the latter hereby providing already the desired ACQUIRE semantics. Therefore it is safe to access ->nocb_cb_sleep with a simple compiler barrier. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>	2024-02-14 07:50:44 -08:00
Ingo Molnar	4589f199eb	Merge branch 'x86/bugs' into x86/core, to pick up pending changes before dependent patches Merge in pending alternatives patching infrastructure changes, before applying more patches. Signed-off-by: Ingo Molnar <mingo@kernel.org>	2024-02-14 10:49:37 +01:00
Andrii Nakryiko	7cc13adbd0	bpf: emit source code file name and line number in verifier log As BPF applications grow in size and complexity and are separated into multiple .bpf.c files that are statically linked together, it becomes harder and harder to match verifier's BPF assembly level output to original C code. While often annotated C source code is unique enough to be able to identify the file it belongs to, quite often this is actually problematic as parts of source code can be quite generic. Long story short, it is very useful to see source code file name and line number information along with the original C code. Verifier already knows this information, we just need to output it. This patch extends verifier log with file name and line number information, emitted next to original (presumably C) source code, annotating BPF assembly output, like so: ; <original C code> @ <filename>.bpf.c:<line> If file name has directory names in it, they are stripped away. This should be fine in practice as file names tend to be pretty unique with C code anyways, and keeping log size smaller is always good. In practice this might look something like below, where some code is coming from application files, while others are from libbpf's usdt.bpf.h header file: ; if (STROBEMETA_READ( @ strobemeta_probe.bpf.c:534 5592: (79) r1 = (u64 )(r10 -56) ; R1_w=mem_or_null(id=1589,sz=7680) R10=fp0 5593: (7b) (u64 )(r10 -56) = r1 ; R1_w=mem_or_null(id=1589,sz=7680) R10=fp0 5594: (79) r3 = (u64 )(r10 -8) ; R3_w=scalar() R10=fp0 fp-8=mmmmmmmm ... 170: (71) r1 = (u8 )(r8 +15) ; frame1: R1_w=scalar(...) R8_w=map_value(map=__bpf_usdt_spec,ks=4,vs=208) 171: (67) r1 <<= 56 ; frame1: R1_w=scalar(...) 172: (c7) r1 s>>= 56 ; frame1: R1_w=scalar(smin=smin32=-128,smax=smax32=127) ; val <<= arg_spec->arg_bitshift; @ usdt.bpf.h:183 173: (67) r1 <<= 32 ; frame1: R1_w=scalar(...) 174: (77) r1 >>= 32 ; frame1: R1_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff)) 175: (79) r2 = (u64 )(r10 -8) ; frame1: R2_w=scalar() R10=fp0 fp-8=mmmmmmmm 176: (6f) r2 <<= r1 ; frame1: R1_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff)) R2_w=scalar() 177: (7b) (u64 )(r10 -8) = r2 ; frame1: R2_w=scalar(id=61) R10=fp0 fp-8_w=scalar(id=61) ; if (arg_spec->arg_signed) @ usdt.bpf.h:184 178: (bf) r3 = r2 ; frame1: R2_w=scalar(id=61) R3_w=scalar(id=61) 179: (7f) r3 >>= r1 ; frame1: R1_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff)) R3_w=scalar() ; if (arg_spec->arg_signed) @ usdt.bpf.h:184 180: (71) r4 = (u8 )(r8 +14) 181: safe log_fixup tests needed a minor adjustment as verifier log output increased a bit and that test is quite sensitive to such changes. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20240212235944.2816107-1-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2024-02-13 18:51:32 -08:00
Andrii Nakryiko	879bbe7aa4	bpf: don't infer PTR_TO_CTX for programs with unnamed context type For program types that don't have named context type name (e.g., BPF iterator programs or tracepoint programs), ctx_tname will be a non-NULL empty string. For such programs it shouldn't be possible to have PTR_TO_CTX argument for global subprogs based on type name alone. arg:ctx tag is the only way to have PTR_TO_CTX passed into global subprog for such program types. Fix this loophole, which currently would assume PTR_TO_CTX whenever user uses a pointer to anonymous struct as an argument to their global subprogs. This happens in practice with the following (quite common, in practice) approach: typedef struct { /* anonymous / int x; } my_type_t; int my_subprog(my_type_t arg) { ... } User's intent is to have PTR_TO_MEM argument for `arg`, but verifier will complain about expecting PTR_TO_CTX. This fix also closes unintended s390x-specific KPROBE handling of PTR_TO_CTX case. Selftest change is necessary to accommodate this. Fixes: `91cc1a9974` ("bpf: Annotate context types") Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20240212233221.2575350-4-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2024-02-13 18:46:47 -08:00
Andrii Nakryiko	824c58fb10	bpf: handle bpf_user_pt_regs_t typedef explicitly for PTR_TO_CTX global arg Expected canonical argument type for global function arguments representing PTR_TO_CTX is `bpf_user_pt_regs_t *ctx`. This currently works on s390x by accident because kernel resolves such typedef to underlying struct (which is anonymous on s390x), and erroneously accepting it as expected context type. We are fixing this problem next, which would break s390x arch, so we need to handle `bpf_user_pt_regs_t` case explicitly for KPROBE programs. Fixes: `91cc1a9974` ("bpf: Annotate context types") Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20240212233221.2575350-3-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2024-02-13 18:46:47 -08:00
Andrii Nakryiko	fb5b86cfd4	bpf: simplify btf_get_prog_ctx_type() into btf_is_prog_ctx_type() Return result of btf_get_prog_ctx_type() is never used and callers only check NULL vs non-NULL case to determine if given type matches expected PTR_TO_CTX type. So rename function to `btf_is_prog_ctx_type()` and return a simple true/false. We'll use this simpler interface to handle kprobe program type's special typedef case in the next patch. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20240212233221.2575350-2-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2024-02-13 18:46:46 -08:00
Oliver Crumrine	32e18e7688	bpf: remove check in __cgroup_bpf_run_filter_skb Originally, this patch removed a redundant check in BPF_CGROUP_RUN_PROG_INET_EGRESS, as the check was already being done in the function it called, __cgroup_bpf_run_filter_skb. For v2, it was reccomended that I remove the check from __cgroup_bpf_run_filter_skb, and add the checks to the other macro that calls that function, BPF_CGROUP_RUN_PROG_INET_INGRESS. To sum it up, checking that the socket exists and that it is a full socket is now part of both macros BPF_CGROUP_RUN_PROG_INET_EGRESS and BPF_CGROUP_RUN_PROG_INET_INGRESS, and it is no longer part of the function they call, __cgroup_bpf_run_filter_skb. v3->v4: Fixed weird merge conflict. v2->v3: Sent to bpf-next instead of generic patch v1->v2: Addressed feedback about where check should be removed. Signed-off-by: Oliver Crumrine <ozlinuxc@gmail.com> Acked-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/7lv62yiyvmj5a7eozv2iznglpkydkdfancgmbhiptrgvgan5sy@3fl3onchgdz3 Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>	2024-02-13 15:41:17 -08:00
Kui-Feng Lee	1611603537	bpf: Create argument information for nullable arguments. Collect argument information from the type information of stub functions to mark arguments of BPF struct_ops programs with PTR_MAYBE_NULL if they are nullable. A nullable argument is annotated by suffixing "__nullable" at the argument name of stub function. For nullable arguments, this patch sets a struct bpf_ctx_arg_aux to label their reg_type with PTR_TO_BTF_ID \| PTR_TRUSTED \| PTR_MAYBE_NULL. This makes the verifier to check programs and ensure that they properly check the pointer. The programs should check if the pointer is null before accessing the pointed memory. The implementer of a struct_ops type should annotate the arguments that can be null. The implementer should define a stub function (empty) as a placeholder for each defined operator. The name of a stub function should be in the pattern "<st_op_type>__<operator name>". For example, for test_maybe_null of struct bpf_testmod_ops, it's stub function name should be "bpf_testmod_ops__test_maybe_null". You mark an argument nullable by suffixing the argument name with "__nullable" at the stub function. Since we already has stub functions for kCFI, we just reuse these stub functions with the naming convention mentioned earlier. These stub functions with the naming convention is only required if there are nullable arguments to annotate. For functions having not nullable arguments, stub functions are not necessary for the purpose of this patch. This patch will prepare a list of struct bpf_ctx_arg_aux, aka arg_info, for each member field of a struct_ops type. "arg_info" will be assigned to "prog->aux->ctx_arg_info" of BPF struct_ops programs in check_struct_ops_btf_id() so that it can be used by btf_ctx_access() later to set reg_type properly for the verifier. Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com> Link: https://lore.kernel.org/r/20240209023750.1153905-4-thinker.li@gmail.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>	2024-02-13 15:16:44 -08:00
Kui-Feng Lee	6115a0aeef	bpf: Move __kfunc_param_match_suffix() to btf.c. Move __kfunc_param_match_suffix() to btf.c and rename it as btf_param_match_suffix(). It can be reused by bpf_struct_ops later. Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com> Link: https://lore.kernel.org/r/20240209023750.1153905-3-thinker.li@gmail.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>	2024-02-13 15:16:44 -08:00
Kui-Feng Lee	77c0208e19	bpf: add btf pointer to struct bpf_ctx_arg_aux. Enable the providers to use types defined in a module instead of in the kernel (btf_vmlinux). Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com> Link: https://lore.kernel.org/r/20240209023750.1153905-2-thinker.li@gmail.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>	2024-02-13 15:16:44 -08:00
Hari Bathini	11f522256e	bpf: Fix warning for bpf_cpumask in verifier Compiling with CONFIG_BPF_SYSCALL & !CONFIG_BPF_JIT throws the below warning: "WARN: resolve_btfids: unresolved symbol bpf_cpumask" Fix it by adding the appropriate #ifdef. Signed-off-by: Hari Bathini <hbathini@linux.ibm.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Stanislav Fomichev <sdf@google.com> Acked-by: David Vernet <void@manifault.com> Link: https://lore.kernel.org/bpf/20240208100115.602172-1-hbathini@linux.ibm.com	2024-02-13 11:13:39 -08:00
Yonghong Song	178c54666f	bpf: Mark bpf_spin_{lock,unlock}() helpers with notrace correctly Currently tracing is supposed not to allow for bpf_spin_{lock,unlock}() helper calls. This is to prevent deadlock for the following cases: - there is a prog (prog-A) calling bpf_spin_{lock,unlock}(). - there is a tracing program (prog-B), e.g., fentry, attached to bpf_spin_lock() and/or bpf_spin_unlock(). - prog-B calls bpf_spin_{lock,unlock}(). For such a case, when prog-A calls bpf_spin_{lock,unlock}(), a deadlock will happen. The related source codes are below in kernel/bpf/helpers.c: notrace BPF_CALL_1(bpf_spin_lock, struct bpf_spin_lock , lock) notrace BPF_CALL_1(bpf_spin_unlock, struct bpf_spin_lock , lock) notrace is supposed to prevent fentry prog from attaching to bpf_spin_{lock,unlock}(). But actually this is not the case and fentry prog can successfully attached to bpf_spin_lock(). Siddharth Chintamaneni reported the issue in [1]. The following is the macro definition for above BPF_CALL_1: #define BPF_CALL_x(x, name, ...) \ static __always_inline \ u64 ____##name(__BPF_MAP(x, __BPF_DECL_ARGS, __BPF_V, __VA_ARGS__)); \ typedef u64 (*btf_##name)(__BPF_MAP(x, __BPF_DECL_ARGS, __BPF_V, __VA_ARGS__)); \ u64 name(__BPF_REG(x, __BPF_DECL_REGS, __BPF_N, __VA_ARGS__)); \ u64 name(__BPF_REG(x, __BPF_DECL_REGS, __BPF_N, __VA_ARGS__)) \ { \ return ((btf_##name)____##name)(__BPF_MAP(x,__BPF_CAST,__BPF_N,__VA_ARGS__));\ } \ static __always_inline \ u64 ____##name(__BPF_MAP(x, __BPF_DECL_ARGS, __BPF_V, __VA_ARGS__)) #define BPF_CALL_1(name, ...) BPF_CALL_x(1, name, __VA_ARGS__) The notrace attribute is actually applied to the static always_inline function ____bpf_spin_{lock,unlock}(). The actual callback function bpf_spin_{lock,unlock}() is not marked with notrace, hence allowing fentry prog to attach to two helpers, and this may cause the above mentioned deadlock. Siddharth Chintamaneni actually has a reproducer in [2]. To fix the issue, a new macro NOTRACE_BPF_CALL_1 is introduced which will add notrace attribute to the original function instead of the hidden always_inline function and this fixed the problem. [1] https://lore.kernel.org/bpf/CAE5sdEigPnoGrzN8WU7Tx-h-iFuMZgW06qp0KHWtpvoXxf1OAQ@mail.gmail.com/ [2] https://lore.kernel.org/bpf/CAE5sdEg6yUc_Jz50AnUXEEUh6O73yQ1Z6NV2srJnef0ZrQkZew@mail.gmail.com/ Fixes: `d83525ca62` ("bpf: introduce bpf_spin_lock") Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/bpf/20240207070102.335167-1-yonghong.song@linux.dev	2024-02-13 11:11:25 -08:00
Daniel Xu	5b268d1ebc	bpf: Have bpf_rdonly_cast() take a const pointer Since `20d59ee551` ("libbpf: add bpf_core_cast() macro"), libbpf is now exporting a const arg version of bpf_rdonly_cast(). This causes the following conflicting type error when generating kfunc prototypes from BTF: In file included from skeleton/pid_iter.bpf.c:5: /home/dxu/dev/linux/tools/bpf/bpftool/bootstrap/libbpf/include/bpf/bpf_core_read.h:297:14: error: conflicting types for 'bpf_rdonly_cast' extern void bpf_rdonly_cast(const void obj__ign, __u32 btf_id__k) __ksym __weak; ^ ./vmlinux.h:135625:14: note: previous declaration is here extern void bpf_rdonly_cast(void obj__ign, u32 btf_id__k) __weak __ksym; This is b/c the kernel defines bpf_rdonly_cast() with non-const arg. Since const arg is more permissive and thus backwards compatible, we change the kernel definition as well to avoid conflicting type errors. Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/bpf/dfd3823f11ffd2d4c838e961d61ec9ae8a646773.1707080349.git.dxu@dxuuu.xyz	2024-02-13 11:05:26 -08:00
Sven Schnelle	a6eaa24f1c	tracing: Use ring_buffer_record_is_set_on() in tracer_tracing_is_on() tracer_tracing_is_on() checks whether record_disabled is not zero. This checks both the record_disabled counter and the RB_BUFFER_OFF flag. Reading the source it looks like this function should only check for the RB_BUFFER_OFF flag. Therefore use ring_buffer_record_is_set_on(). This fixes spurious fails in the 'test for function traceon/off triggers' test from the ftrace testsuite when the system is under load. Link: https://lore.kernel.org/linux-trace-kernel/20240205065340.2848065-1-svens@linux.ibm.com Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Tested-By: Mete Durlu <meted@linux.ibm.com> Signed-off-by: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2024-02-13 12:04:17 -05:00
Petr Pavlu	bdbddb109c	tracing: Fix HAVE_DYNAMIC_FTRACE_WITH_REGS ifdef Commit `a8b9cf62ad` ("ftrace: Fix DIRECT_CALLS to use SAVE_REGS by default") attempted to fix an issue with direct trampolines on x86, see its description for details. However, it wrongly referenced the HAVE_DYNAMIC_FTRACE_WITH_REGS config option and the problem is still present. Add the missing "CONFIG_" prefix for the logic to work as intended. Link: https://lore.kernel.org/linux-trace-kernel/20240213132434.22537-1-petr.pavlu@suse.com Fixes: `a8b9cf62ad` ("ftrace: Fix DIRECT_CALLS to use SAVE_REGS by default") Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2024-02-13 11:56:49 -05:00
Marco Elver	68bc61c26c	bpf: Allow compiler to inline most of bpf_local_storage_lookup() In various performance profiles of kernels with BPF programs attached, bpf_local_storage_lookup() appears as a significant portion of CPU cycles spent. To enable the compiler generate more optimal code, turn bpf_local_storage_lookup() into a static inline function, where only the cache insertion code path is outlined Notably, outlining cache insertion helps avoid bloating callers by duplicating setting up calls to raw_spin_{lock,unlock}_irqsave() (on architectures which do not inline spin_lock/unlock, such as x86), which would cause the compiler produce worse code by deciding to outline otherwise inlinable functions. The call overhead is neutral, because we make 2 calls either way: either calling raw_spin_lock_irqsave() and raw_spin_unlock_irqsave(); or call __bpf_local_storage_insert_cache(), which calls raw_spin_lock_irqsave(), followed by a tail-call to raw_spin_unlock_irqsave() where the compiler can perform TCO and (in optimized uninstrumented builds) turns it into a plain jump. The call to __bpf_local_storage_insert_cache() can be elided entirely if cacheit_lockit is a false constant expression. Based on results from './benchs/run_bench_local_storage.sh' (21 trials, reboot between each trial; x86 defconfig + BPF, clang 16) this produces improvements in throughput and latency in the majority of cases, with an average (geomean) improvement of 8%: +---- Hashmap Control -------------------- \| \| + num keys: 10 \| : <before> \| <after> \| +-+ hashmap (control) sequential get +----------------------+---------------------- \| +- hits throughput \| 14.789 M ops/s \| 14.745 M ops/s ( ~ ) \| +- hits latency \| 67.679 ns/op \| 67.879 ns/op ( ~ ) \| +- important_hits throughput \| 14.789 M ops/s \| 14.745 M ops/s ( ~ ) \| \| + num keys: 1000 \| : <before> \| <after> \| +-+ hashmap (control) sequential get +----------------------+---------------------- \| +- hits throughput \| 12.233 M ops/s \| 12.170 M ops/s ( ~ ) \| +- hits latency \| 81.754 ns/op \| 82.185 ns/op ( ~ ) \| +- important_hits throughput \| 12.233 M ops/s \| 12.170 M ops/s ( ~ ) \| \| + num keys: 10000 \| : <before> \| <after> \| +-+ hashmap (control) sequential get +----------------------+---------------------- \| +- hits throughput \| 7.220 M ops/s \| 7.204 M ops/s ( ~ ) \| +- hits latency \| 138.522 ns/op \| 138.842 ns/op ( ~ ) \| +- important_hits throughput \| 7.220 M ops/s \| 7.204 M ops/s ( ~ ) \| \| + num keys: 100000 \| : <before> \| <after> \| +-+ hashmap (control) sequential get +----------------------+---------------------- \| +- hits throughput \| 5.061 M ops/s \| 5.165 M ops/s (+2.1%) \| +- hits latency \| 198.483 ns/op \| 194.270 ns/op (-2.1%) \| +- important_hits throughput \| 5.061 M ops/s \| 5.165 M ops/s (+2.1%) \| \| + num keys: 4194304 \| : <before> \| <after> \| +-+ hashmap (control) sequential get +----------------------+---------------------- \| +- hits throughput \| 2.864 M ops/s \| 2.882 M ops/s ( ~ ) \| +- hits latency \| 365.220 ns/op \| 361.418 ns/op (-1.0%) \| +- important_hits throughput \| 2.864 M ops/s \| 2.882 M ops/s ( ~ ) \| +---- Local Storage ---------------------- \| \| + num_maps: 1 \| : <before> \| <after> \| +-+ local_storage cache sequential get +----------------------+---------------------- \| +- hits throughput \| 33.005 M ops/s \| 39.068 M ops/s (+18.4%) \| +- hits latency \| 30.300 ns/op \| 25.598 ns/op (-15.5%) \| +- important_hits throughput \| 33.005 M ops/s \| 39.068 M ops/s (+18.4%) \| : \| : <before> \| <after> \| +-+ local_storage cache interleaved get +----------------------+---------------------- \| +- hits throughput \| 37.151 M ops/s \| 44.926 M ops/s (+20.9%) \| +- hits latency \| 26.919 ns/op \| 22.259 ns/op (-17.3%) \| +- important_hits throughput \| 37.151 M ops/s \| 44.926 M ops/s (+20.9%) \| \| + num_maps: 10 \| : <before> \| <after> \| +-+ local_storage cache sequential get +----------------------+---------------------- \| +- hits throughput \| 32.288 M ops/s \| 38.099 M ops/s (+18.0%) \| +- hits latency \| 30.972 ns/op \| 26.248 ns/op (-15.3%) \| +- important_hits throughput \| 3.229 M ops/s \| 3.810 M ops/s (+18.0%) \| : \| : <before> \| <after> \| +-+ local_storage cache interleaved get +----------------------+---------------------- \| +- hits throughput \| 34.473 M ops/s \| 41.145 M ops/s (+19.4%) \| +- hits latency \| 29.010 ns/op \| 24.307 ns/op (-16.2%) \| +- important_hits throughput \| 12.312 M ops/s \| 14.695 M ops/s (+19.4%) \| \| + num_maps: 16 \| : <before> \| <after> \| +-+ local_storage cache sequential get +----------------------+---------------------- \| +- hits throughput \| 32.524 M ops/s \| 38.341 M ops/s (+17.9%) \| +- hits latency \| 30.748 ns/op \| 26.083 ns/op (-15.2%) \| +- important_hits throughput \| 2.033 M ops/s \| 2.396 M ops/s (+17.9%) \| : \| : <before> \| <after> \| +-+ local_storage cache interleaved get +----------------------+---------------------- \| +- hits throughput \| 34.575 M ops/s \| 41.338 M ops/s (+19.6%) \| +- hits latency \| 28.925 ns/op \| 24.193 ns/op (-16.4%) \| +- important_hits throughput \| 11.001 M ops/s \| 13.153 M ops/s (+19.6%) \| \| + num_maps: 17 \| : <before> \| <after> \| +-+ local_storage cache sequential get +----------------------+---------------------- \| +- hits throughput \| 28.861 M ops/s \| 32.756 M ops/s (+13.5%) \| +- hits latency \| 34.649 ns/op \| 30.530 ns/op (-11.9%) \| +- important_hits throughput \| 1.700 M ops/s \| 1.929 M ops/s (+13.5%) \| : \| : <before> \| <after> \| +-+ local_storage cache interleaved get +----------------------+---------------------- \| +- hits throughput \| 31.529 M ops/s \| 36.110 M ops/s (+14.5%) \| +- hits latency \| 31.719 ns/op \| 27.697 ns/op (-12.7%) \| +- important_hits throughput \| 9.598 M ops/s \| 10.993 M ops/s (+14.5%) \| \| + num_maps: 24 \| : <before> \| <after> \| +-+ local_storage cache sequential get +----------------------+---------------------- \| +- hits throughput \| 18.602 M ops/s \| 19.937 M ops/s (+7.2%) \| +- hits latency \| 53.767 ns/op \| 50.166 ns/op (-6.7%) \| +- important_hits throughput \| 0.776 M ops/s \| 0.831 M ops/s (+7.2%) \| : \| : <before> \| <after> \| +-+ local_storage cache interleaved get +----------------------+---------------------- \| +- hits throughput \| 21.718 M ops/s \| 23.332 M ops/s (+7.4%) \| +- hits latency \| 46.047 ns/op \| 42.865 ns/op (-6.9%) \| +- important_hits throughput \| 6.110 M ops/s \| 6.564 M ops/s (+7.4%) \| \| + num_maps: 32 \| : <before> \| <after> \| +-+ local_storage cache sequential get +----------------------+---------------------- \| +- hits throughput \| 14.118 M ops/s \| 14.626 M ops/s (+3.6%) \| +- hits latency \| 70.856 ns/op \| 68.381 ns/op (-3.5%) \| +- important_hits throughput \| 0.442 M ops/s \| 0.458 M ops/s (+3.6%) \| : \| : <before> \| <after> \| +-+ local_storage cache interleaved get +----------------------+---------------------- \| +- hits throughput \| 17.111 M ops/s \| 17.906 M ops/s (+4.6%) \| +- hits latency \| 58.451 ns/op \| 55.865 ns/op (-4.4%) \| +- important_hits throughput \| 4.776 M ops/s \| 4.998 M ops/s (+4.6%) \| \| + num_maps: 100 \| : <before> \| <after> \| +-+ local_storage cache sequential get +----------------------+---------------------- \| +- hits throughput \| 5.281 M ops/s \| 5.528 M ops/s (+4.7%) \| +- hits latency \| 192.398 ns/op \| 183.059 ns/op (-4.9%) \| +- important_hits throughput \| 0.053 M ops/s \| 0.055 M ops/s (+4.9%) \| : \| : <before> \| <after> \| +-+ local_storage cache interleaved get +----------------------+---------------------- \| +- hits throughput \| 6.265 M ops/s \| 6.498 M ops/s (+3.7%) \| +- hits latency \| 161.436 ns/op \| 152.877 ns/op (-5.3%) \| +- important_hits throughput \| 1.636 M ops/s \| 1.697 M ops/s (+3.7%) \| \| + num_maps: 1000 \| : <before> \| <after> \| +-+ local_storage cache sequential get +----------------------+---------------------- \| +- hits throughput \| 0.355 M ops/s \| 0.354 M ops/s ( ~ ) \| +- hits latency \| 2826.538 ns/op \| 2827.139 ns/op ( ~ ) \| +- important_hits throughput \| 0.000 M ops/s \| 0.000 M ops/s ( ~ ) \| : \| : <before> \| <after> \| +-+ local_storage cache interleaved get +----------------------+---------------------- \| +- hits throughput \| 0.404 M ops/s \| 0.403 M ops/s ( ~ ) \| +- hits latency \| 2481.190 ns/op \| 2487.555 ns/op ( ~ ) \| +- important_hits throughput \| 0.102 M ops/s \| 0.101 M ops/s ( ~ ) The on_lookup test in {cgrp,task}_ls_recursion.c is removed because the bpf_local_storage_lookup is no longer traceable and adding tracepoint will make the compiler generate worse code: https://lore.kernel.org/bpf/ZcJmok64Xqv6l4ZS@elver.google.com/ Signed-off-by: Marco Elver <elver@google.com> Cc: Martin KaFai Lau <martin.lau@linux.dev> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20240207122626.3508658-1-elver@google.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>	2024-02-11 14:06:24 -08:00
Linus Torvalds	2766f59ca4	- Make sure a warning is issued when a hrtimer gets queued after the timers have been migrated on the CPU down path and thus said timer will get ignored -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmXIpjEACgkQEsHwGGHe VUrG8g/7BuGzGzFHYli7LZKuzI76tN0CU44v7oPauqZvkcaSUmF/E+/RAeHxMjUa 20Mlo2AUGrPkKbRUgN93IrRKZAfdjqKQ6UZiH/FTyFUtFfs9XNv2G0cCIVQAepXL WbKPxL/M9vbnJwK6CC5prSLHazGH2y5vg0zbY9RycGKvxza+HgkIrZoYp7ctHX1A xZFF5EyLu6g0x+yz7Tt0Zf93tADJxFSHmfz7Nmx1RFh0GJzceuKUvC2ZVyUr63fv ttQn4TLm0NKySaR+SPYPKKp1lLkHvfh9pFV2kdI/c7oo4Pig6bFTjMJpf0o541A2 s87sz2w6P16LMi2sjf/ASQmgMHmGiIQlmjjFbVX8sKeibdtUM3Vg7s/Hs/EilY7Q P7ANSmZTtQBoQsWd/E+8aOBUkC263Ua0uoOufH7dcfL9mSJxUos1SPleRnaO4o5n mm5GVDxggNj879nHZUBh9g7+JsLdZ5yozWne0xAyrI0WycsK0hzWuW0B5p4QMK3T 4zamSuZNObBUdbwb5cD1fL5X5aRkPvorj9iLliui1X0wfA+nByR5cjlcptXirCq4 8D2WCiQ+tbP3w4UJ0gtrC3mlXokWorMV8XoLZm9+RtWLi2LFAhMCZ9U3B0EgsziD whMGJBXKRaeeFnCnxPjlHC/5a1nXTCqFlks+PSUVT6RiavYJ7VA= =+SjG -----END PGP SIGNATURE----- Merge tag 'timers_urgent_for_v6.8_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fix from Borislav Petkov: - Make sure a warning is issued when a hrtimer gets queued after the timers have been migrated on the CPU down path and thus said timer will get ignored * tag 'timers_urgent_for_v6.8_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: hrtimer: Report offline hrtimer enqueue	2024-02-11 11:44:14 -08:00
Linus Torvalds	7521f258ea	21 hotfixes. 12 are cc:stable and the remainder pertain to post-6.7 issues or aren't considered to be needed in earlier kernel versions. -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZcfLvgAKCRDdBJ7gKXxA joCTAP4/XdBXA7Sj3GyjSAkYjg2U0quwX9oRhsx2Qy9duPDaLAD+NRl9XG14YSOB f/7OiTQoDfnwVgHAOVBHY/ylrcgZRQg= =2wdS -----END PGP SIGNATURE----- Merge tag 'mm-hotfixes-stable-2024-02-10-11-16' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "21 hotfixes. 12 are cc:stable and the remainder pertain to post-6.7 issues or aren't considered to be needed in earlier kernel versions" * tag 'mm-hotfixes-stable-2024-02-10-11-16' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (21 commits) nilfs2: fix potential bug in end_buffer_async_write mm/damon/sysfs-schemes: fix wrong DAMOS tried regions update timeout setup nilfs2: fix hang in nilfs_lookup_dirty_data_buffers() MAINTAINERS: Leo Yan has moved mm/zswap: don't return LRU_SKIP if we have dropped lru lock fs,hugetlb: fix NULL pointer dereference in hugetlbs_fill_super mailmap: switch email address for John Moon mm: zswap: fix objcg use-after-free in entry destruction mm/madvise: don't forget to leave lazy MMU mode in madvise_cold_or_pageout_pte_range() arch/arm/mm: fix major fault accounting when retrying under per-VMA lock selftests: core: include linux/close_range.h for CLOSE_RANGE_* macros mm/memory-failure: fix crash in split_huge_page_to_list from soft_offline_page mm: memcg: optimize parent iteration in memcg_rstat_updated() nilfs2: fix data corruption in dsync block recovery for small block sizes mm/userfaultfd: UFFDIO_MOVE implementation should use ptep_get() exit: wait_task_zombie: kill the no longer necessary spin_lock_irq(siglock) fs/proc: do_task_stat: use sig->stats_lock to gather the threads/children stats fs/proc: do_task_stat: move thread_group_cputime_adjusted() outside of lock_task_sighand() getrusage: use sig->stats_lock rather than lock_task_sighand() getrusage: move thread_group_cputime_adjusted() outside of lock_task_sighand() ...	2024-02-10 15:28:07 -08:00
Oleg Nesterov	81b9d8ac06	pidfd: change pidfd_send_signal() to respect PIDFD_THREAD Turn kill_pid_info() into kill_pid_info_type(), this allows to pass any pid_type to group_send_sig_info(), despite its name it should work fine even if type = PIDTYPE_PID. Change pidfd_send_signal() to use PIDTYPE_PID or PIDTYPE_TGID depending on PIDFD_THREAD. While at it kill another TODO comment in pidfd_show_fdinfo(). As Christian expains fdinfo reports f_flags, userspace can already detect PIDFD_THREAD. Reviewed-by: Tycho Andersen <tandersen@netflix.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20240209130650.GA8048@redhat.com Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-02-10 22:37:23 +01:00
Oleg Nesterov	c044a95026	signal: fill in si_code in prepare_kill_siginfo() So that do_tkill() can use this helper too. This also simplifies the next patch. TODO: perhaps we can kill prepare_kill_siginfo() and change the callers to use SEND_SIG_NOINFO, but this needs some changes in __send_signal_locked() and TP_STORE_SIGINFO(). Reviewed-by: Tycho Andersen <tandersen@netflix.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20240209130620.GA8039@redhat.com Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-02-10 22:37:18 +01:00
Tejun Heo	bf52b1ac6a	async: Use a dedicated unbound workqueue with raised min_active Async can schedule a number of interdependent work items. However, since `5797b1c189` ("workqueue: Implement system-wide nr_active enforcement for unbound workqueues"), unbound workqueues have separate min_active which sets the number of interdependent work items that can be handled. This default value is 8 which isn't sufficient for async and can lead to stalls during resume from suspend in some cases. Let's use a dedicated unbound workqueue with raised min_active. Link: http://lkml.kernel.org/r/708a65cc-79ec-44a6-8454-a93d0f3114c3@samsung.com Reported-by: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Rafael J. Wysocki <rjw@rjwysocki.net> Tested-by: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-02-09 11:13:59 -10:00
Tejun Heo	8f172181f2	workqueue: Implement workqueue_set_min_active() Since `5797b1c189` ("workqueue: Implement system-wide nr_active enforcement for unbound workqueues"), unbound workqueues have separate min_active which sets the number of interdependent work items that can be handled. This value is currently initialized to WQ_DFL_MIN_ACTIVE which is 8. This isn't high enough for some users, let's add an interface to adjust the setting. Signed-off-by: Tejun Heo <tj@kernel.org>	2024-02-09 11:13:59 -10:00
Waiman Long	516d3dc99f	workqueue: Fix kernel-doc comment of unplug_oldest_pwq() Fix the kernel-doc comment of the unplug_oldest_pwq() function to enable proper processing and formatting of the embedded ASCII diagram. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-02-09 11:04:13 -10:00
Linus Torvalds	ca8a66738a	Tracing fixes for v6.8-rc3: - Fix broken direct trampolines being called when another callback is attached the same function. ARM 64 does not support FTRACE_WITH_REGS, and when it added direct trampoline calls from ftrace, it removed the "WITH_REGS" flag from the ftrace_ops for direct trampolines. This broke x86 as x86 requires direct trampolines to have WITH_REGS. This wasn't noticed because direct trampolines work as long as the function it is attached to is not shared with other callbacks (like the function tracer). When there's other callbacks, a helper trampoline is called, to call all the non direct callbacks and when it returns, the direct trampoline is called. For x86, the direct trampoline sets a flag in the regs field to tell the x86 specific code to call the direct trampoline. But this only works if the ftrace_ops had WITH_REGS set. ARM does things differently that does not require this. For now, set WITH_REGS if the arch supports WITH_REGS (which ARM does not), and this makes it work for both ARM64 and x86. - Fix wasted memory in the saved_cmdlines logic. The saved_cmdlines is a cache that maps PIDs to COMMs that tracing can use. Most trace events only save the PID in the event. The saved_cmdlines file lists PIDs to COMMs so that the tracing tools can show an actual name and not just a PID for each event. There's an array of PIDs that map to a small set of saved COMM strings. The array is set to PID_MAX_DEFAULT which is usually set to 32768. When a PID comes in, it will add itself to this array along with the index into the COMM array (note if the system allows more than PID_MAX_DEFAULT, this cache is similar to cache lines as an update of a PID that has the same PID_MAX_DEFAULT bits set will flush out another task with the same matching bits set). A while ago, the size of this cache was changed to be dynamic and the array was moved into a structure and created with kmalloc(). But this new structure had the size of 131104 bytes, or 0x20020 in hex. As kmalloc allocates in powers of two, it was actually allocating 0x40000 bytes (262144) leaving 131040 bytes of wasted memory. The last element of this structure was a pointer to the COMM string array which defaulted to just saving 128 COMMs. By changing the last field of this structure to a variable length string, and just having it round up to fill the allocated memory, the default size of the saved COMM cache is now 8190. This not only uses the wasted space, but actually saves space by removing the extra allocation for the COMM names. -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZcYi8RQccm9zdGVkdEBn b29kbWlzLm9yZwAKCRAp5XQQmuv6qqENAQD6xGE9EPkbHArElKfgpSuQOfGhcyyP LjgVhqVgmIoqUwD8CeVpxk3VwZIOQYvPn5XictcZgkYSeEWUZcKYg4c/3gs= =iIBv -----END PGP SIGNATURE----- Merge tag 'trace-v6.8-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Fix broken direct trampolines being called when another callback is attached the same function. ARM 64 does not support FTRACE_WITH_REGS, and when it added direct trampoline calls from ftrace, it removed the "WITH_REGS" flag from the ftrace_ops for direct trampolines. This broke x86 as x86 requires direct trampolines to have WITH_REGS. This wasn't noticed because direct trampolines work as long as the function it is attached to is not shared with other callbacks (like the function tracer). When there are other callbacks, a helper trampoline is called, to call all the non direct callbacks and when it returns, the direct trampoline is called. For x86, the direct trampoline sets a flag in the regs field to tell the x86 specific code to call the direct trampoline. But this only works if the ftrace_ops had WITH_REGS set. ARM does things differently that does not require this. For now, set WITH_REGS if the arch supports WITH_REGS (which ARM does not), and this makes it work for both ARM64 and x86. - Fix wasted memory in the saved_cmdlines logic. The saved_cmdlines is a cache that maps PIDs to COMMs that tracing can use. Most trace events only save the PID in the event. The saved_cmdlines file lists PIDs to COMMs so that the tracing tools can show an actual name and not just a PID for each event. There's an array of PIDs that map to a small set of saved COMM strings. The array is set to PID_MAX_DEFAULT which is usually set to 32768. When a PID comes in, it will add itself to this array along with the index into the COMM array (note if the system allows more than PID_MAX_DEFAULT, this cache is similar to cache lines as an update of a PID that has the same PID_MAX_DEFAULT bits set will flush out another task with the same matching bits set). A while ago, the size of this cache was changed to be dynamic and the array was moved into a structure and created with kmalloc(). But this new structure had the size of 131104 bytes, or 0x20020 in hex. As kmalloc allocates in powers of two, it was actually allocating 0x40000 bytes (262144) leaving 131040 bytes of wasted memory. The last element of this structure was a pointer to the COMM string array which defaulted to just saving 128 COMMs. By changing the last field of this structure to a variable length string, and just having it round up to fill the allocated memory, the default size of the saved COMM cache is now 8190. This not only uses the wasted space, but actually saves space by removing the extra allocation for the COMM names. * tag 'trace-v6.8-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing: Fix wasted memory in saved_cmdlines logic ftrace: Fix DIRECT_CALLS to use SAVE_REGS by default	2024-02-09 11:13:19 -08:00
Steven Rostedt (Google)	44dc5c41b5	tracing: Fix wasted memory in saved_cmdlines logic While looking at improving the saved_cmdlines cache I found a huge amount of wasted memory that should be used for the cmdlines. The tracing data saves pids during the trace. At sched switch, if a trace occurred, it will save the comm of the task that did the trace. This is saved in a "cache" that maps pids to comms and exposed to user space via the /sys/kernel/tracing/saved_cmdlines file. Currently it only caches by default 128 comms. The structure that uses this creates an array to store the pids using PID_MAX_DEFAULT (which is usually set to 32768). This causes the structure to be of the size of 131104 bytes on 64 bit machines. In hex: 131104 = 0x20020, and since the kernel allocates generic memory in powers of two, the kernel would allocate 0x40000 or 262144 bytes to store this structure. That leaves 131040 bytes of wasted space. Worse, the structure points to an allocated array to store the comm names, which is 16 bytes times the amount of names to save (currently 128), which is 2048 bytes. Instead of allocating a separate array, make the structure end with a variable length string and use the extra space for that. This is similar to a recommendation that Linus had made about eventfs_inode names: https://lore.kernel.org/all/20240130190355.11486-5-torvalds@linux-foundation.org/ Instead of allocating a separate string array to hold the saved comms, have the structure end with: char saved_cmdlines[]; and round up to the next power of two over sizeof(struct saved_cmdline_buffers) + num_cmdlines * TASK_COMM_LEN It will use this extra space for the saved_cmdline portion. Now, instead of saving only 128 comms by default, by using this wasted space at the end of the structure it can save over 8000 comms and even saves space by removing the need for allocating the other array. Link: https://lore.kernel.org/linux-trace-kernel/20240209063622.1f7b6d5f@rorschach.local.home Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Vincent Donnefort <vdonnefort@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Mete Durlu <meted@linux.ibm.com> Fixes: `939c7a4f04` ("tracing: Introduce saved_cmdlines_size file") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2024-02-09 06:43:21 -05:00
Masami Hiramatsu (Google)	a8b9cf62ad	ftrace: Fix DIRECT_CALLS to use SAVE_REGS by default The commit `60c8971899` ("ftrace: Make DIRECT_CALLS work WITH_ARGS and !WITH_REGS") changed DIRECT_CALLS to use SAVE_ARGS when there are multiple ftrace_ops at the same function, but since the x86 only support to jump to direct_call from ftrace_regs_caller, when we set the function tracer on the same target function on x86, ftrace-direct does not work as below (this actually works on arm64.) At first, insmod ftrace-direct.ko to put a direct_call on 'wake_up_process()'. # insmod kernel/samples/ftrace/ftrace-direct.ko # less trace ... <idle>-0 [006] ..s1. 564.686958: my_direct_func: waking up rcu_preempt-17 <idle>-0 [007] ..s1. 564.687836: my_direct_func: waking up kcompactd0-63 <idle>-0 [006] ..s1. 564.690926: my_direct_func: waking up rcu_preempt-17 <idle>-0 [006] ..s1. 564.696872: my_direct_func: waking up rcu_preempt-17 <idle>-0 [007] ..s1. 565.191982: my_direct_func: waking up kcompactd0-63 Setup a function filter to the 'wake_up_process' too, and enable it. # cd /sys/kernel/tracing/ # echo wake_up_process > set_ftrace_filter # echo function > current_tracer # less trace ... <idle>-0 [006] ..s3. 686.180972: wake_up_process <-call_timer_fn <idle>-0 [006] ..s3. 686.186919: wake_up_process <-call_timer_fn <idle>-0 [002] ..s3. 686.264049: wake_up_process <-call_timer_fn <idle>-0 [002] d.h6. 686.515216: wake_up_process <-kick_pool <idle>-0 [002] d.h6. 686.691386: wake_up_process <-kick_pool Then, only function tracer is shown on x86. But if you enable 'kprobe on ftrace' event (which uses SAVE_REGS flag) on the same function, it is shown again. # echo 'p wake_up_process' >> dynamic_events # echo 1 > events/kprobes/p_wake_up_process_0/enable # echo > trace # less trace ... <idle>-0 [006] ..s2. 2710.345919: p_wake_up_process_0: (wake_up_process+0x4/0x20) <idle>-0 [006] ..s3. 2710.345923: wake_up_process <-call_timer_fn <idle>-0 [006] ..s1. 2710.345928: my_direct_func: waking up rcu_preempt-17 <idle>-0 [006] ..s2. 2710.349931: p_wake_up_process_0: (wake_up_process+0x4/0x20) <idle>-0 [006] ..s3. 2710.349934: wake_up_process <-call_timer_fn <idle>-0 [006] ..s1. 2710.349937: my_direct_func: waking up rcu_preempt-17 To fix this issue, use SAVE_REGS flag for multiple ftrace_ops flag of direct_call by default. Link: https://lore.kernel.org/linux-trace-kernel/170484558617.178953.1590516949390270842.stgit@devnote2 Fixes: `60c8971899` ("ftrace: Make DIRECT_CALLS work WITH_ARGS and !WITH_REGS") Cc: stable@vger.kernel.org Cc: Florent Revest <revest@chromium.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Reviewed-by: Mark Rutland <mark.rutland@arm.com> Tested-by: Mark Rutland <mark.rutland@arm.com> [arm64] Acked-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2024-02-09 04:58:22 -05:00
Jakub Kicinski	3be042cf46	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR. No conflicts. Adjacent changes: drivers/net/ethernet/stmicro/stmmac/common.h `38cc3c6dcc` ("net: stmmac: protect updates of 64-bit statistics counters") `fd5a6a7131` ("net: stmmac: est: Per Tx-queue error count for HLBF") `c5c3e1bfc9` ("net: stmmac: Offload queueMaxSDU from tc-taprio") drivers/net/wireless/microchip/wilc1000/netdev.c `c901388028` ("wifi: fill in MODULE_DESCRIPTION()s for wilc1000") `328efda22a` ("wifi: wilc1000: do not realloc workqueue everytime an interface is added") net/unix/garbage.c `11498715f2` ("af_unix: Remove io_uring code for GC.") `1279f9d9de` ("af_unix: Call kfree_skb() for dead unix_(sk)->oob_skb in GC.") Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-02-08 15:30:33 -08:00
Geliang Tang	947e56f82f	bpf, btf: Check btf for register_bpf_struct_ops Similar to the handling in the functions __register_btf_kfunc_id_set() and register_btf_id_dtor_kfuncs(), this patch uses the newly added helper check_btf_kconfigs() to handle module with its btf section stripped. While at it, the patch also adds the missed IS_ERR() check to fix the commit `f6be98d199` ("bpf, net: switch to dynamic registration") Fixes: `f6be98d199` ("bpf, net: switch to dynamic registration") Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> Link: https://lore.kernel.org/r/69082b9835463fe36f9e354bddf2d0a97df39c2b.1707373307.git.tanggeliang@kylinos.cn Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>	2024-02-08 11:37:49 -08:00
Waiman Long	49584bb8dd	workqueue: Bind unbound workqueue rescuer to wq_unbound_cpumask Commit `85f0ab43f9` ("kernel/workqueue: Bind rescuer to unbound cpumask for WQ_UNBOUND") modified init_rescuer() to bind rescuer of an unbound workqueue to the cpumask in wq->unbound_attrs. However unbound_attrs->cpumask's of all workqueues are initialized to cpu_possible_mask and will only be changed if it has the WQ_SYSFS flag to expose a cpumask sysfs file to be written by users. So this patch doesn't achieve what it is intended to do. If an unbound workqueue is created after wq_unbound_cpumask is modified and there is no more unbound cpumask update after that, the unbound rescuer will be bound to all CPUs unless the workqueue is created with the WQ_SYSFS flag and a user explicitly modified its cpumask sysfs file. Fix this problem by binding directly to wq_unbound_cpumask in init_rescuer(). Fixes: `85f0ab43f9` ("kernel/workqueue: Bind rescuer to unbound cpumask for WQ_UNBOUND") Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-02-08 09:23:38 -10:00
Juri Lelli	d64f2fa064	kernel/workqueue: Let rescuers follow unbound wq cpumask changes When workqueue cpumask changes are committed the associated rescuer (if one exists) affinity is not touched and this might be a problem down the line for isolated setups. Make sure rescuers affinity is updated every time a workqueue cpumask changes, so that rescuers can't break isolation. [longman: set_cpus_allowed_ptr() will block until the designated task is enqueued on an allowed CPU, no wake_up_process() needed. Also use the unbound_effective_cpumask() helper as suggested by Tejun.] Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-02-08 09:23:32 -10:00
Geliang Tang	9e60b0e025	bpf, btf: Add check_btf_kconfigs helper This patch extracts duplicate code on error path when btf_get_module_btf() returns NULL from the functions __register_btf_kfunc_id_set() and register_btf_id_dtor_kfuncs() into a new helper named check_btf_kconfigs() to check CONFIG_DEBUG_INFO_BTF and CONFIG_DEBUG_INFO_BTF_MODULES in it. Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/fa5537fc55f1e4d0bfd686598c81b7ab9dbd82b7.1707373307.git.tanggeliang@kylinos.cn Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>	2024-02-08 11:22:56 -08:00
Waiman Long	4c065dbce1	workqueue: Enable unbound cpumask update on ordered workqueues Ordered workqueues does not currently follow changes made to the global unbound cpumask because per-pool workqueue changes may break the ordering guarantee. IOW, a work function in an ordered workqueue may run on an isolated CPU. This patch enables ordered workqueues to follow changes made to the global unbound cpumask by temporaily plug or suspend the newly allocated pool_workqueue from executing newly queued work items until the old pwq has been properly drained. For ordered workqueues, there should only be one pwq that is unplugged, the rests should be plugged. This enables ordered workqueues to follow the unbound cpumask changes like other unbound workqueues at the expense of some delay in execution of work functions during the transition period. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-02-08 09:22:49 -10:00
Waiman Long	26fb7e3dda	workqueue: Link pwq's into wq->pwqs from oldest to newest Add a new pwq into the tail of wq->pwqs so that pwq iteration will start from the oldest pwq to the newest. This ordering will facilitate the inclusion of ordered workqueues in a wq_unbound_cpumask update. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-02-08 09:22:30 -10:00
Geliang Tang	b9a395f0f7	bpf, btf: Fix return value of register_btf_id_dtor_kfuncs The same as __register_btf_kfunc_id_set(), to let the modules with stripped btf section loaded, this patch changes the return value of register_btf_id_dtor_kfuncs() too from -ENOENT to 0 when btf is NULL. Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> Link: https://lore.kernel.org/r/eab65586d7fb0e72f2707d3747c7d4a5d60c823f.1707373307.git.tanggeliang@kylinos.cn Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>	2024-02-08 11:05:07 -08:00
Li zeming	9efd24ec55	kprobes: Remove unnecessary initial values of variables ri and sym is assigned first, so it does not need to initialize the assignment. Link: https://lore.kernel.org/all/20230919012823.7815-1-zeming@nfschina.com/ Signed-off-by: Li zeming <zeming@nfschina.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>	2024-02-08 23:29:29 +09:00
Masami Hiramatsu (Google)	9a571c1e27	tracing/probes: Fix to set arg size and fmt after setting type from BTF Since the BTF type setting updates probe_arg::type, the type size calculation and setting print-fmt should be done after that. Without this fix, the argument size and print-fmt can be wrong. Link: https://lore.kernel.org/all/170602218196.215583.6417859469540955777.stgit@devnote2/ Fixes: `b576e09701` ("tracing/probes: Support function parameters if BTF is available") Cc: stable@vger.kernel.org Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>	2024-02-08 23:26:25 +09:00
Masami Hiramatsu (Google)	8c427cc2fa	tracing/probes: Fix to show a parse error for bad type for $comm Fix to show a parse error for bad type (non-string) for $comm/$COMM and immediate-string. With this fix, error_log file shows appropriate error message as below. /sys/kernel/tracing # echo 'p vfs_read $comm:u32' >> kprobe_events sh: write error: Invalid argument /sys/kernel/tracing # echo 'p vfs_read \"hoge":u32' >> kprobe_events sh: write error: Invalid argument /sys/kernel/tracing # cat error_log [ 30.144183] trace_kprobe: error: $comm and immediate-string only accepts string type Command: p vfs_read $comm:u32 ^ [ 62.618500] trace_kprobe: error: $comm and immediate-string only accepts string type Command: p vfs_read \"hoge":u32 ^ Link: https://lore.kernel.org/all/170602215411.215583.2238016352271091852.stgit@devnote2/ Fixes: `3dd1f7f24f` ("tracing: probeevent: Fix to make the type of $comm string") Cc: stable@vger.kernel.org Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>	2024-02-08 23:26:13 +09:00
Lukasz Luba	22ea02848c	PM: EM: Add em_dev_compute_costs() The device drivers can modify EM at runtime by providing a new EM table. The EM is used by the EAS and the em_perf_state::cost stores pre-calculated value to avoid overhead. This patch provides the API for device drivers to calculate the cost values properly (and not duplicate the same code). Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2024-02-08 15:00:32 +01:00
Lukasz Luba	24e9fb635d	PM: EM: Remove old table Remove the old EM table which wasn't able to modify the data. Clean the unneeded function and refactor the code a bit. Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2024-02-08 15:00:32 +01:00
Lukasz Luba	09417e673c	PM: EM: Change debugfs configuration to use runtime EM table data Dump the runtime EM table values which can be modified in time. In order to do that allocate chunk of debug memory which can be later freed automatically thanks to devm_kcalloc(). This design can handle the fact that the EM table memory can change after EM update, so debug code cannot use the pointer from initialization phase. Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2024-02-08 15:00:32 +01:00
Lukasz Luba	1b600da510	PM: EM: Optimize em_cpu_energy() and remove division The Energy Model (EM) can be modified at runtime which brings new possibilities. The em_cpu_energy() is called by the Energy Aware Scheduler (EAS) in its hot path. The energy calculation uses power value for a given performance state (ps) and the CPU busy time as percentage for that given frequency. It is possible to avoid the division by 'scale_cpu' at runtime, because EM is updated whenever new max capacity CPU is set in the system. Use that feature and do the needed division during the calculation of the coefficient 'ps->cost'. That enhanced 'ps->cost' value can be then just multiplied simply by utilization: pd_nrg = ps->cost * \Sum cpu_util to get the needed energy for whole Performance Domain (PD). With this optimization and earlier removal of map_util_freq(), the em_cpu_energy() should run faster on the Big CPU by 1.43x and on the Little CPU by 1.69x (RockPi 4B board). Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2024-02-08 15:00:31 +01:00
Lukasz Luba	e3f1164fc9	PM: EM: Support late CPUs booting and capacity adjustment The patch adds needed infrastructure to handle the late CPUs boot, which might change the previous CPUs capacity values. With this changes the new CPUs which try to register EM will trigger the needed re-calculations for other CPUs EMs. Thanks to that the em_per_state::performance values will be aligned with the CPU capacity information after all CPUs finish the boot and EM registrations. Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2024-02-08 15:00:31 +01:00
Lukasz Luba	5a367f7b70	PM: EM: Add performance field to struct em_perf_state and optimize The performance doesn't scale linearly with the frequency. Also, it may be different in different workloads. Some CPUs are designed to be particularly good at some applications e.g. images or video processing and other CPUs in different. When those different types of CPUs are combined in one SoC they should be properly modeled to get max of the HW in Energy Aware Scheduler (EAS). The Energy Model (EM) provides the power vs. performance curves to the EAS, but assumes the CPUs capacity is fixed and scales linearly with the frequency. This patch allows to adjust the curve on the 'performance' axis as well. Code speed optimization: Removing map_util_freq() allows to avoid one division and one multiplication operations from the EAS hot code path. Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2024-02-08 15:00:31 +01:00
Lukasz Luba	977230d5d5	PM: EM: Introduce em_dev_update_perf_domain() for EM updates Add API function em_dev_update_perf_domain() which allows the EM to be changed safely. Concurrent updaters are serialized with a mutex and the removal of memory that will not be used any more is carried out with the help of RCU. Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2024-02-08 15:00:30 +01:00
Lukasz Luba	ffcf9bce7a	PM: EM: Add functions for memory allocations for new EM tables The runtime modified EM table can be provided from drivers. Create mechanism which allows safely allocate and free the table for device drivers. The same table can be used by the EAS in task scheduler code paths, so make sure the memory is not freed when the device driver module is unloaded. Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2024-02-08 15:00:30 +01:00
Lukasz Luba	ca0fc871f1	PM: EM: Introduce runtime modifiable table The new runtime table can be populated with a new power data to better reflect the actual efficiency of the device e.g. CPU. The power can vary over time e.g. due to the SoC temperature change. Higher temperature can increase power values. For longer running scenarios, such as game or camera, when also other devices are used (e.g. GPU, ISP) the CPU power can change. The new EM framework is able to addresses this issue and change the EM data at runtime safely. Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2024-02-08 15:00:27 +01:00
Lukasz Luba	8552d68201	PM: EM: Split the allocation and initialization of the EM table Split the process of allocation and data initialization for the EM table. The upcoming changes for modifiable EM will use it. This change is not expected to alter the general functionality. Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2024-02-08 15:00:27 +01:00
Lukasz Luba	818867224d	PM: EM: Check if the get_cost() callback is present in em_compute_costs() Subsequent changes will introduce a case in which 'cb->get_cost' may not be set in em_compute_costs(), so add a check to ensure that it is not NULL before attempting to dereference it. Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2024-02-08 15:00:26 +01:00
Lukasz Luba	faf7075b79	PM: EM: Introduce em_compute_costs() Move the EM costs computation code into a new dedicated function, em_compute_costs(), that can be reused in other places in the future. This change is not expected to alter the general functionality. Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2024-02-08 15:00:25 +01:00
Lukasz Luba	99907d6054	PM: EM: Find first CPU active while updating OPP efficiency The Energy Model might be updated at runtime and the energy efficiency for each OPP may change. Thus, there is a need to update also the cpufreq framework and make it aligned to the new values. In order to do that, use a first active CPU from the Performance Domain. This is needed since the first CPU in the cpumask might be offline when we run this code path. Reviewed-by: Hongyan Xia <hongyan.xia2@arm.com> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2024-02-08 15:00:23 +01:00
Lukasz Luba	e7b1cc9a7e	PM: EM: Extend em_cpufreq_update_efficiencies() argument list In order to prepare the code for the modifiable EM perf_state table, make em_cpufreq_update_efficiencies() take a pointer to the EM table as its second argument and modify it to use that new argument instead of the 'table' member of dev->em_pd. No functional impact. Reviewed-by: Hongyan Xia <hongyan.xia2@arm.com> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2024-02-08 15:00:22 +01:00
Lukasz Luba	4274521fab	PM: EM: Add missing newline for the message log Fix missing newline for the string long in the error code path. Reviewed-by: Hongyan Xia <hongyan.xia2@arm.com> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2024-02-08 15:00:21 +01:00
Oleg Nesterov	c1be35a16b	exit: wait_task_zombie: kill the no longer necessary spin_lock_irq(siglock) After the recent changes nobody use siglock to read the values protected by stats_lock, we can kill spin_lock_irq(&current->sighand->siglock) and update the comment. With this patch only __exit_signal() and thread_group_start_cputime() take stats_lock under siglock. Link: https://lkml.kernel.org/r/20240123153359.GA21866@redhat.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Dylan Hatch <dylanbhatch@google.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-02-07 21:20:33 -08:00
Oleg Nesterov	f7ec1cd5cc	getrusage: use sig->stats_lock rather than lock_task_sighand() lock_task_sighand() can trigger a hard lockup. If NR_CPUS threads call getrusage() at the same time and the process has NR_THREADS, spin_lock_irq will spin with irqs disabled O(NR_CPUS * NR_THREADS) time. Change getrusage() to use sig->stats_lock, it was specifically designed for this type of use. This way it runs lockless in the likely case. TODO: - Change do_task_stat() to use sig->stats_lock too, then we can remove spin_lock_irq(siglock) in wait_task_zombie(). - Turn sig->stats_lock into seqcount_rwlock_t, this way the readers in the slow mode won't exclude each other. See https://lore.kernel.org/all/20230913154907.GA26210@redhat.com/ - stats_lock has to disable irqs because ->siglock can be taken in irq context, it would be very nice to change __exit_signal() to avoid the siglock->stats_lock dependency. Link: https://lkml.kernel.org/r/20240122155053.GA26214@redhat.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reported-by: Dylan Hatch <dylanbhatch@google.com> Tested-by: Dylan Hatch <dylanbhatch@google.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-02-07 21:20:32 -08:00
Oleg Nesterov	daa694e413	getrusage: move thread_group_cputime_adjusted() outside of lock_task_sighand() Patch series "getrusage: use sig->stats_lock", v2. This patch (of 2): thread_group_cputime() does its own locking, we can safely shift thread_group_cputime_adjusted() which does another for_each_thread loop outside of ->siglock protected section. This is also preparation for the next patch which changes getrusage() to use stats_lock instead of siglock, thread_group_cputime() takes the same lock. With the current implementation recursive read_seqbegin_or_lock() is fine, thread_group_cputime() can't enter the slow mode if the caller holds stats_lock, yet this looks more safe and better performance-wise. Link: https://lkml.kernel.org/r/20240122155023.GA26169@redhat.com Link: https://lkml.kernel.org/r/20240122155050.GA26205@redhat.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reported-by: Dylan Hatch <dylanbhatch@google.com> Tested-by: Dylan Hatch <dylanbhatch@google.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-02-07 21:20:32 -08:00
Masahiro Yamada	e55dad12ab	bpf: Merge two CONFIG_BPF entries 'config BPF' exists in both init/Kconfig and kernel/bpf/Kconfig. Commit `b24abcff91` ("bpf, kconfig: Add consolidated menu entry for bpf with core options") added the second one to kernel/bpf/Kconfig instead of moving the existing one. Merge them together. Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/bpf/20240204075634.32969-1-masahiroy@kernel.org	2024-02-07 16:38:20 -08:00
John Ogness	7412dc6d55	dump_stack: Do not get cpu_sync for panic CPU dump_stack() is called in panic(). If for some reason another CPU is holding the printk_cpu_sync and is unable to release it, the panic CPU will be unable to continue and print the stacktrace. Since non-panic CPUs are not allowed to store new printk messages anyway, there is no need to synchronize the stacktrace output in a panic situation. For the panic CPU, do not get the printk_cpu_sync because it is not needed and avoids a potential deadlock scenario in panic(). Link: https://lore.kernel.org/lkml/ZcIGKU8sxti38Kok@alley Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20240207134103.1357162-15-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>	2024-02-07 17:23:19 +01:00
John Ogness	d988d9a9b9	panic: Flush kernel log buffer at the end If the kernel crashes in a context where printk() calls always defer printing (such as in NMI or inside a printk_safe section) then the final panic messages will be deferred to irq_work. But if irq_work is not available, the messages will not get printed unless explicitly flushed. The result is that the final "end Kernel panic" banner does not get printed. Add one final flush after the last printk() call to make sure the final panic messages make it out as well. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20240207134103.1357162-14-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>	2024-02-07 17:23:19 +01:00
John Ogness	779dbc2e78	printk: Avoid non-panic CPUs writing to ringbuffer Commit `13fb0f74d7` ("printk: Avoid livelock with heavy printk during panic") introduced a mechanism to silence non-panic CPUs if too many messages are being dropped. Aside from trying to workaround the livelock bugs of legacy consoles, it was also intended to avoid losing panic messages. However, if non-panic CPUs are writing to the ringbuffer, then reacting to dropped messages is too late. Another motivation is that non-finalized messages already might be skipped in panic(). In other words, random messages from non-panic CPUs might already get lost. It is better to ignore all to avoid confusion. To avoid losing panic CPU messages, silence non-panic CPUs immediately on panic. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20240207134103.1357162-13-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>	2024-02-07 17:23:19 +01:00
Petr Mladek	d04d5882cd	printk: Disable passing console lock owner completely during panic() The commit `d51507098f` ("printk: disable optimistic spin during panic") added checks to avoid becoming a console waiter if a panic is in progress. However, the transition to panic can occur while there is already a waiter. The current owner should not pass the lock to the waiter because it might get stopped or blocked anytime. Also the panic context might pass the console lock owner to an already stopped waiter by mistake. It might happen when console_flush_on_panic() ignores the current lock owner, for example: CPU0 CPU1 ---- ---- console_lock_spinning_enable() console_trylock_spinning() [CPU1 now console waiter] NMI: panic() panic_other_cpus_shutdown() [stopped as console waiter] console_flush_on_panic() console_lock_spinning_enable() [print 1 record] console_lock_spinning_disable_and_check() [handover to stopped CPU1] This results in panic() not flushing the panic messages. Fix these problems by disabling all spinning operations completely during panic(). Another advantage is that it prevents possible deadlocks caused by "console_owner_lock". The panic() context does not need to take it any longer. The lockless checks are safe because the functions become NOPs when they see the panic in progress. All operations manipulating the state are still synchronized by the lock even when non-panic CPUs would notice the panic synchronously. The current owner might stay spinning. But non-panic() CPUs would get stopped anyway and the panic context will never start spinning. Fixes: `dbdda842fe` ("printk: Add console owner and waiter logic to load balance console writes") Signed-off-by: John Ogness <john.ogness@linutronix.de> Link: https://lore.kernel.org/r/20240207134103.1357162-12-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>	2024-02-07 17:23:18 +01:00
John Ogness	b1c4c67a5e	printk: ringbuffer: Skip non-finalized records in panic Normally a reader will stop once reaching a non-finalized record. However, when a panic happens, writers from other CPUs (or an interrupted context on the panic CPU) may have been writing a record and were unable to finalize it. The panic CPU will reserve/commit/finalize its panic records, but these will be located after the non-finalized records. This results in panic() not flushing the panic messages. Extend _prb_read_valid() to skip over non-finalized records if on the panic CPU. Fixes: `896fbe20b4` ("printk: use the lockless ringbuffer") Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20240207134103.1357162-11-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>	2024-02-07 17:23:18 +01:00
John Ogness	ac7d7844c6	printk: Wait for all reserved records with pr_flush() Currently pr_flush() will only wait for records that were available to readers at the time of the call (using prb_next_seq()). But there may be more records (non-finalized) that have following finalized records. pr_flush() should wait for these to print as well. Particularly because any trailing finalized records may be the messages that the calling context wants to ensure are printed. Add a new ringbuffer function prb_next_reserve_seq() to return the sequence number following the most recently reserved record. This guarantees that pr_flush() will wait until all current printk() messages (completed or in progress) have been printed. Fixes: `3b604ca812` ("printk: add pr_flush()") Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20240207134103.1357162-10-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>	2024-02-07 17:23:18 +01:00
John Ogness	584528d621	printk: ringbuffer: Cleanup reader terminology With the lockless ringbuffer, it is allowed that multiple CPUs/contexts write simultaneously into the buffer. This creates an ambiguity as some writers will finalize sooner. The documentation for the prb_read functions is not clear as it refers to "not yet written" and "no data available". Clarify the return values and language to be in terms of the reader: records available for reading. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20240207134103.1357162-9-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>	2024-02-07 17:23:18 +01:00
John Ogness	36652d0f3b	printk: Add this_cpu_in_panic() There is already panic_in_progress() and other_cpu_in_panic(), but checking if the current CPU is the panic CPU must still be open coded. Add this_cpu_in_panic() to complete the set. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20240207134103.1357162-8-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>	2024-02-07 17:23:18 +01:00
John Ogness	0ab7cdd004	printk: For @suppress_panic_printk check for other CPU in panic Currently @suppress_panic_printk is checked along with non-matching @panic_cpu and current CPU. This works because @suppress_panic_printk is only set when panic_in_progress() is true. Rather than relying on the @suppress_panic_printk semantics, use the concise helper function other_cpu_in_progress(). The helper function exists to avoid open coding such tests. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20240207134103.1357162-7-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>	2024-02-07 17:23:18 +01:00
John Ogness	5113cf5f4c	printk: ringbuffer: Clarify special lpos values For empty line records, no data blocks are created. Instead, these valid records are identified by special logical position values (in fields of @prb_desc.text_blk_lpos). Currently the macro NO_LPOS is used for empty line records. This name is confusing because it does not imply _why_ there is no data block. Rename NO_LPOS to EMPTY_LINE_LPOS so that it is clear why there is no data block. Also add comments explaining the use of EMPTY_LINE_LPOS as well as clarification to the values used to represent data-less blocks. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20240207134103.1357162-6-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>	2024-02-07 17:23:18 +01:00
John Ogness	5f72e52ba9	printk: ringbuffer: Do not skip non-finalized records with prb_next_seq() Commit `f244b4dc53` ("printk: ringbuffer: Improve prb_next_seq() performance") introduced an optimization for prb_next_seq() by using best-effort to track recently finalized records. However, the order of finalization does not necessarily match the order of the records. The optimization changed prb_next_seq() to return inconsistent results, possibly yielding sequence numbers that are not available to readers because they are preceded by non-finalized records or they are not yet visible to the reader CPU. Rather than simply best-effort tracking recently finalized records, force the committing writer to read records and increment the last "contiguous block" of finalized records. In order to do this, the sequence number instead of ID must be stored because ID's cannot be directly compared. A new memory barrier pair is introduced to guarantee that a reader can always read the records up until the sequence number returned by prb_next_seq() (unless the records have since been overwritten in the ringbuffer). This restores the original functionality of prb_next_seq() while also keeping the optimization. For 32bit systems, only the lower 32 bits of the sequence number are stored. When reading the value, it is expanded to the full 64bit sequence number using the 32bit seq macros, which fold in the value returned by prb_first_seq(). Fixes: `f244b4dc53` ("printk: ringbuffer: Improve prb_next_seq() performance") Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20240207134103.1357162-5-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>	2024-02-07 17:23:18 +01:00
John Ogness	90ad525c2d	printk: Use prb_first_seq() as base for 32bit seq macros Note: This change only applies to 32bit architectures. On 64bit architectures the macros are NOPs. Currently prb_next_seq() is used as the base for the 32bit seq macros __u64seq_to_ulseq() and __ulseq_to_u64seq(). However, in a follow-up commit, prb_next_seq() will need to make use of the 32bit seq macros. Use prb_first_seq() as the base for the 32bit seq macros instead because it is guaranteed to return 64bit sequence numbers without relying on any 32bit seq macros. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20240207134103.1357162-4-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>	2024-02-07 17:23:18 +01:00
Sebastian Andrzej Siewior	418ec1961c	printk: Adjust mapping for 32bit seq macros Note: This change only applies to 32bit architectures. On 64bit architectures the macros are NOPs. __ulseq_to_u64seq() computes the upper 32 bits of the passed argument value (@ulseq). The upper bits are derived from a base value (@rb_next_seq) in a way that assumes @ulseq represents a 64bit number that is less than or equal to @rb_next_seq. Until now this mapping has been correct for all call sites. However, in a follow-up commit, values of @ulseq will be passed in that are higher than the base value. This requires a change to how the 32bit value is mapped to a 64bit sequence number. Rather than mapping @ulseq such that the base value is the end of a 32bit block, map @ulseq such that the base value is in the middle of a 32bit block. This allows supporting 31 bits before and after the base value, which is deemed acceptable for the console sequence number during runtime. Here is an example to illustrate the previous and new mappings. For a base value (@rb_next_seq) of 2 2000 0000... Before this change the range of possible return values was: 1 2000 0001 to 2 2000 0000 __ulseq_to_u64seq(1fff ffff) => 2 1fff ffff __ulseq_to_u64seq(2000 0000) => 2 2000 0000 __ulseq_to_u64seq(2000 0001) => 1 2000 0001 __ulseq_to_u64seq(9fff ffff) => 1 9fff ffff __ulseq_to_u64seq(a000 0000) => 1 a000 0000 __ulseq_to_u64seq(a000 0001) => 1 a000 0001 After this change the range of possible return values are: 1 a000 0001 to 2 a000 0000 __ulseq_to_u64seq(1fff ffff) => 2 1fff ffff __ulseq_to_u64seq(2000 0000) => 2 2000 0000 __ulseq_to_u64seq(2000 0001) => 2 2000 0001 __ulseq_to_u64seq(9fff ffff) => 2 9fff ffff __ulseq_to_u64seq(a000 0000) => 2 a000 0000 __ulseq_to_u64seq(a000 0001) => 1 a000 0001 [ john.ogness: Rewrite commit message. ] Reported-by: Francesco Dolcini <francesco@dolcini.it> Reported-by: kernel test robot <oliver.sang@intel.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20240207134103.1357162-3-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>	2024-02-07 17:23:17 +01:00
John Ogness	5b73e706f0	printk: nbcon: Relocate 32bit seq macros The macros __seq_to_nbcon_seq() and __nbcon_seq_to_seq() are used to provide support for atomic handling of sequence numbers on 32bit systems. Until now this was only used by nbcon.c, which is why they were located in nbcon.c and include nbcon in the name. In a follow-up commit this functionality is also needed by printk_ringbuffer. Rather than duplicating the functionality, relocate the macros to printk_ringbuffer.h. Also, since the macros will be no longer nbcon-specific, rename them to __u64seq_to_ulseq() and __ulseq_to_u64seq(). This does not result in any functional change. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20240207134103.1357162-2-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>	2024-02-07 17:23:17 +01:00
Peter Hilber	4b7f521229	timekeeping: Evaluate system_counterval_t.cs_id instead of .cs Clocksource pointers can be problematic to obtain for drivers which are not clocksource drivers themselves. In particular, the RFC virtio_rtc driver [1] would require a new helper function to obtain a pointer to the ARM Generic Timer clocksource. The ptp_kvm driver also required a similar workaround. Address this by evaluating the clocksource ID, rather than the clocksource pointer, of struct system_counterval_t. By this, setting the clocksource pointer becomes unneeded, and get_device_system_crosststamp() callers will no longer need to supply clocksource pointers. All relevant clocksource drivers provide the ID, so this change is not changing the behaviour. [1] https://lore.kernel.org/lkml/20231218073849.35294-1-peter.hilber@opensynergy.com/ Signed-off-by: Peter Hilber <peter.hilber@opensynergy.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240201010453.2212371-7-peter.hilber@opensynergy.com	2024-02-07 17:05:21 +01:00
Ricardo B. Marliere	49f1ff50d4	clockevents: Make clockevents_subsys const Now that the driver core can properly handle constant struct bus_type, move the clockevents_subsys variable to be a constant structure as well, placing it into read-only memory which can not be modified at runtime. Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ricardo B. Marliere <ricardo@marliere.net> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Link: https://lore.kernel.org/r/20240204-bus_cleanup-time-v1-2-207ec18e24b8@marliere.net	2024-02-07 15:11:24 +01:00
Ricardo B. Marliere	2bc7fc24f9	clocksource: Make clocksource_subsys const Now that the driver core can properly handle constant struct bus_type, move the clocksource_subsys variable to be a constant structure as well, placing it into read-only memory which can not be modified at runtime. Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ricardo B. Marliere <ricardo@marliere.net> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: John Stultz <jstultz@google.com> Link: https://lore.kernel.org/r/20240204-bus_cleanup-time-v1-1-207ec18e24b8@marliere.net	2024-02-07 15:11:24 +01:00
Tycho Andersen	0c9bd6bc4b	pidfd: getfd should always report ESRCH if a task is exiting We can get EBADF from pidfd_getfd() if a task is currently exiting, which might be confusing. Let's check PF_EXITING, and just report ESRCH if so. I chose PF_EXITING, because it is set in exit_signals(), which is called before exit_files(). Since ->exit_status is mostly set after exit_files() in exit_notify(), using that still leaves a window open for the race. Reviewed-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Tycho Andersen <tandersen@netflix.com> Link: https://lore.kernel.org/r/20240206192357.81942-1-tycho@tycho.pizza Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-02-07 12:09:44 +01:00
Oleg Nesterov	83b290c9e3	pidfd: clone: allow CLONE_THREAD \| CLONE_PIDFD together copy_process() just needs to pass PIDFD_THREAD to __pidfd_prepare() if clone_flags & CLONE_THREAD. We can also add another CLONE_ flag (or perhaps reuse CLONE_DETACHED) to enforce PIDFD_THREAD without CLONE_THREAD. Originally-from: Tycho Andersen <tycho@tycho.pizza> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20240205145532.GA28823@redhat.com Reviewed-by: Tycho Andersen <tandersen@netflix.com> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-02-06 14:39:32 +01:00
Oleg Nesterov	e2e8a142fb	pidfd: exit: kill the no longer used thread_group_exited() It was used by pidfd_poll() but now it has no callers. If it finally finds a modular user we can revert this change, but note that the comment above this helper and the changelog in `38fd525a4c` ("exit: Factor thread_group_exited out of pidfd_poll") are not accurate, thread_group_exited() won't return true if all other threads have passed exit_notify() and are zombies, it returns true only when all other threads are completely gone. Not to mention that it can only work if the task identified by @pid is a thread-group leader. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20240205174347.GA31461@redhat.com Reviewed-by: Tycho Andersen <tandersen@netflix.com> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-02-06 14:02:51 +01:00
Oleg Nesterov	9ed52108f6	pidfd: change do_notify_pidfd() to use __wake_up(poll_to_key(EPOLLIN)) rather than wake_up_all(). This way do_notify_pidfd() won't wakeup the POLLHUP-only waiters which wait for pid_task() == NULL. TODO: - as Christian pointed out, this asks for the new wake_up_all_poll() helper, it can already have other users. - we can probably discriminate the PIDFD_THREAD and non-PIDFD_THREAD waiters, but this needs more work. See https://lore.kernel.org/all/20240205140848.GA15853@redhat.com/ Signed-off-by: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20240205141348.GA16539@redhat.com Reviewed-by: Tycho Andersen <tandersen@netflix.com> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-02-06 13:52:51 +01:00
Frederic Weisbecker	dad6a09f31	hrtimer: Report offline hrtimer enqueue The hrtimers migration on CPU-down hotplug process has been moved earlier, before the CPU actually goes to die. This leaves a small window of opportunity to queue an hrtimer in a blind spot, leaving it ignored. For example a practical case has been reported with RCU waking up a SCHED_FIFO task right before the CPUHP_AP_IDLE_DEAD stage, queuing that way a sched/rt timer to the local offline CPU. Make sure such situations never go unnoticed and warn when that happens. Fixes: `5c0930ccaa` ("hrtimers: Push pending hrtimers away from outgoing CPU earlier") Reported-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20240129235646.3171983-4-boqun.feng@gmail.com	2024-02-06 10:56:35 +01:00
Kumar Kartikeya Dwivedi	6fceea0fa5	bpf: Transfer RCU lock state between subprog calls Allow transferring an imbalanced RCU lock state between subprog calls during verification. This allows patterns where a subprog call returns with an RCU lock held, or a subprog call releases an RCU lock held by the caller. Currently, the verifier would end up complaining if the RCU lock is not released when processing an exit from a subprog, which is non-ideal if its execution is supposed to be enclosed in an RCU read section of the caller. Instead, simply only check whether we are processing exit for frame#0 and do not complain on an active RCU lock otherwise. We only need to update the check when processing BPF_EXIT insn, as copy_verifier_state is already set up to do the right thing. Suggested-by: David Vernet <void@manifault.com> Tested-by: Yafang Shao <laoar.shao@gmail.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: David Vernet <void@manifault.com> Link: https://lore.kernel.org/r/20240205055646.1112186-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2024-02-05 20:00:14 -08:00
Kumar Kartikeya Dwivedi	a44b1334aa	bpf: Allow calling static subprogs while holding a bpf_spin_lock Currently, calling any helpers, kfuncs, or subprogs except the graph data structure (lists, rbtrees) API kfuncs while holding a bpf_spin_lock is not allowed. One of the original motivations of this decision was to force the BPF programmer's hand into keeping the bpf_spin_lock critical section small, and to ensure the execution time of the program does not increase due to lock waiting times. In addition to this, some of the helpers and kfuncs may be unsafe to call while holding a bpf_spin_lock. However, when it comes to subprog calls, atleast for static subprogs, the verifier is able to explore their instructions during verification. Therefore, it is similar in effect to having the same code inlined into the critical section. Hence, not allowing static subprog calls in the bpf_spin_lock critical section is mostly an annoyance that needs to be worked around, without providing any tangible benefit. Unlike static subprog calls, global subprog calls are not safe to permit within the critical section, as the verifier does not explore them during verification, therefore whether the same lock will be taken again, or unlocked, cannot be ascertained. Therefore, allow calling static subprogs within a bpf_spin_lock critical section, and only reject it in case the subprog linkage is global. Acked-by: Yonghong Song <yonghong.song@linux.dev> Acked-by: David Vernet <void@manifault.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20240204222349.938118-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2024-02-05 19:58:47 -08:00
Tejun Heo	40911d4457	Merge branch 'for-6.8-fixes' into for-6.9 The for-6.8-fixes commit ae9cc8956944 ("Revert "workqueue: Override implicit ordered attribute in workqueue_apply_unbound_cpumask()") also fixes build for Signed-off-by: Tejun Heo <tj@kernel.org>	2024-02-05 15:49:47 -10:00
Tejun Heo	aac8a59537	Revert "workqueue: Override implicit ordered attribute in workqueue_apply_unbound_cpumask()" This reverts commit `ca10d851b9`. The commit allowed workqueue_apply_unbound_cpumask() to clear __WQ_ORDERED on now removed implicitly ordered workqueues. This was incorrect in that system-wide config change shouldn't break ordering properties of all workqueues. The reason why apply_workqueue_attrs() path was allowed to do so was because it was targeting the specific workqueue - either the workqueue had WQ_SYSFS set or the workqueue user specifically tried to change max_active, both of which indicate that the workqueue doesn't need to be ordered. The implicitly ordered workqueue promotion was removed by the previous commit `3bc1e711c2` ("workqueue: Don't implicitly make UNBOUND workqueues w/ @max_active==1 ordered"). However, it didn't update this path and broke build. Let's revert the commit which was incorrect in the first place which also fixes build. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: `3bc1e711c2` ("workqueue: Don't implicitly make UNBOUND workqueues w/ @max_active==1 ordered") Fixes: `ca10d851b9` ("workqueue: Override implicit ordered attribute in workqueue_apply_unbound_cpumask()") Cc: stable@vger.kernel.org # v6.6+ Signed-off-by: Tejun Heo <tj@kernel.org>	2024-02-05 15:49:06 -10:00
Tejun Heo	3bc1e711c2	workqueue: Don't implicitly make UNBOUND workqueues w/ @max_active==1 ordered `5c0338c687` ("workqueue: restore WQ_UNBOUND/max_active==1 to be ordered") automoatically promoted UNBOUND workqueues w/ @max_active==1 to ordered workqueues because UNBOUND workqueues w/ @max_active==1 used to be the way to create ordered workqueues and the new NUMA support broke it. These problems can be subtle and the fact that they can only trigger on NUMA machines made them even more difficult to debug. However, overloading the UNBOUND allocation interface this way creates other issues. It's difficult to tell whether a given workqueue actually needs to be ordered and users that legitimately want a min concurrency level wq unexpectedly gets an ordered one instead. With planned UNBOUND workqueue udpates to improve execution locality and more prevalence of chiplet designs which can benefit from such improvements, this isn't a state we wanna be in forever. There aren't that many UNBOUND w/ @max_active==1 users in the tree and the preceding patches audited all and converted them to alloc_ordered_workqueue() as appropriate. This patch removes the implicit promotion of UNBOUND w/ @max_active==1 workqueues to ordered ones. v2: v1 patch incorrectly dropped !list_empty(&wq->pwqs) condition in apply_workqueue_attrs_locked() which spuriously triggers WARNING and fails workqueue creation. Fix it. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: kernel test robot <oliver.sang@intel.com> Link: https://lore.kernel.org/oe-lkp/202304251050.45a5df1f-oliver.sang@intel.com	2024-02-05 14:19:10 -10:00
Tejun Heo	7245d24f87	backtracetest: Convert from tasklet to BH workqueue The only generic interface to execute asynchronously in the BH context is tasklet; however, it's marked deprecated and has some design flaws. To replace tasklets, BH workqueue support was recently added. A BH workqueue behaves similarly to regular workqueues except that the queued work items are executed in the BH context. This patch converts backtracetest from tasklet to BH workqueue. - Replace "irq" with "bh" in names and message to better reflect what's happening. - Replace completion usage with a flush_work() call. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Arjan van de Ven <arjan@linux.intel.com>	2024-02-05 13:22:34 -10:00
Kui-Feng Lee	df9705eaa0	bpf: Remove an unnecessary check. The "i" here is always equal to "btf_type_vlen(t)" since the "for_each_member()" loop never breaks. Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20240203055119.2235598-1-thinker.li@gmail.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>	2024-02-05 10:25:08 -08:00
Waiman Long	8eb17dc1a6	workqueue: Skip __WQ_DESTROYING workqueues when updating global unbound cpumask Skip updating workqueues with __WQ_DESTROYING bit set when updating global unbound cpumask to avoid unnecessary work and other complications. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-02-05 07:52:22 -10:00
Wang Jinchao	96068b6030	workqueue: fix a typo in comment There should be three, fix it. Signed-off-by: Wang Jinchao <wangjinchao@xfusion.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-02-05 07:48:25 -10:00
Tejun Heo	4f19b8e01e	Revert "workqueue: make wq_subsys const" This reverts commit `d412ace111`. This leads to build failures as it depends on a driver-core commit `32f78abe59` ("driver core: bus: constantify subsys_register() calls"). Let's drop it from wq tree and route it through driver-core tree. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202402051505.kM9Rr3CJ-lkp@intel.com/	2024-02-05 07:19:54 -10:00
Jens Axboe	06b23f92af	block: update cached timestamp post schedule/preemption Mark the task as having a cached timestamp when set assign it, so we can efficiently check if it needs updating post being scheduled back in. This covers both the actual schedule out case, which would've flushed the plug, and the preemption case which doesn't touch the plugged requests (for many reasons, one of them being then we'd need to have preemption disabled around plug state manipulation). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-02-05 10:07:34 -07:00
Nikhil V	8bc2973635	PM: hibernate: Add support for LZ4 compression for hibernation Extend the support for LZ4 compression to be used with hibernation. The main idea is that different compression algorithms have different characteristics and hibernation may benefit when it uses any of these algorithms: a default algorithm, having higher compression rate but is slower(compression/decompression) and a secondary algorithm, that is faster(compression/decompression) but has lower compression rate. LZ4 algorithm has better decompression speeds over LZO. This reduces the hibernation image restore time. As per test results: LZO LZ4 Size before Compression(bytes) 682696704 682393600 Size after Compression(bytes) 146502402 155993547 Decompression Rate 335.02 MB/s 501.05 MB/s Restore time 4.4s 3.8s LZO is the default compression algorithm used for hibernation. Enable CONFIG_HIBERNATION_COMP_LZ4 to set the default compressor as LZ4. Signed-off-by: Nikhil V <quic_nprakash@quicinc.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2024-02-05 14:30:35 +01:00
Nikhil V	a06c6f5d3c	PM: hibernate: Move to crypto APIs for LZO compression Currently for hibernation, LZO is the only compression algorithm available and uses the existing LZO library calls. However, there is no flexibility to switch to other algorithms which provides better results. The main idea is that different compression algorithms have different characteristics and hibernation may benefit when it uses alternate algorithms. By moving to crypto based APIs, it lays a foundation to use other compression algorithms for hibernation. There are no functional changes introduced by this approach. Signed-off-by: Nikhil V <quic_nprakash@quicinc.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2024-02-05 14:28:54 +01:00
Nikhil V	89a807625f	PM: hibernate: Rename lzo* to make it generic Renaming lzo* to generic names, except for lzo_xxx() APIs. This is used in the next patch where we move to crypto based APIs for compression. There are no functional changes introduced by this approach. Signed-off-by: Nikhil V <quic_nprakash@quicinc.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2024-02-05 14:28:54 +01:00
Rafael J. Wysocki	a6d38e991d	PM: sleep: stats: Use locking in dpm_save_failed_dev() Because dpm_save_failed_dev() may be called simultaneously by multiple failing device PM functions, the state of the suspend_stats fields updated by it may become inconsistent. Prevent that from happening by using a lock in dpm_save_failed_dev(). Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com> Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>	2024-02-05 14:28:54 +01:00
Rafael J. Wysocki	9ff544fa5f	PM: sleep: stats: Define suspend_stats next to the code using it It is not necessary to define struct suspend_stats in a header file and the suspend_stats variable in the core device system-wide PM code. They both can be defined in kernel/power/main.c, next to the sysfs and debugfs code accessing suspend_stats, which can be static. Modify the code in question in accordance with the above observation and replace the static inline functions manipulating suspend_stats with regular ones defined in kernel/power/main.c. While at it, move the enum suspend_stat_step to the end of suspend.h which is a more suitable place for it. No intentional functional impact. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>	2024-02-05 14:28:19 +01:00
Rafael J. Wysocki	2231f78d3e	PM: sleep: stats: Use unsigned int for success and failure counters Change the type of the "success" and "fail" fields in struct suspend_stats to unsigned int, because they cannot be negative. No intentional functional impact. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com> Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>	2024-02-05 14:26:27 +01:00
Rafael J. Wysocki	b730bab0b9	PM: sleep: stats: Use an array of step failure counters Instead of using a set of individual struct suspend_stats fields representing suspend step failure counters, use an array of counters indexed by enum suspend_stat_step for this purpose, which allows dpm_save_failed_step() to increment the appropriate counter automatically, so that its callers don't need to do that directly. It also allows suspend_stats_show() to carry out a loop over the counters array to print their values. Because the counters cannot become negative, use unsigned int for representing them. The only user-observable impact of this change is a different ordering of entries in the suspend_stats debugfs file which is not expected to matter. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com> Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>	2024-02-05 14:25:56 +01:00
Rafael J. Wysocki	bc88528cda	PM: sleep: stats: Use array of suspend step names Replace suspend_step_name() in the suspend statistics code with an array of suspend step names which has fewer lines of code and less overhead. While at it, remove two unnecessary line breaks in suspend_stats_show() and adjust some white space in there to the kernel coding style for a more consistent code layout. No intentional functional impact. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com> Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>	2024-02-05 14:21:24 +01:00
Tejun Heo	4cb1ef6460	workqueue: Implement BH workqueues to eventually replace tasklets The only generic interface to execute asynchronously in the BH context is tasklet; however, it's marked deprecated and has some design flaws such as the execution code accessing the tasklet item after the execution is complete which can lead to subtle use-after-free in certain usage scenarios and less-developed flush and cancel mechanisms. This patch implements BH workqueues which share the same semantics and features of regular workqueues but execute their work items in the softirq context. As there is always only one BH execution context per CPU, none of the concurrency management mechanisms applies and a BH workqueue can be thought of as a convenience wrapper around softirq. Except for the inability to sleep while executing and lack of max_active adjustments, BH workqueues and work items should behave the same as regular workqueues and work items. Currently, the execution is hooked to tasklet[_hi]. However, the goal is to convert all tasklet users over to BH workqueues. Once the conversion is complete, tasklet can be removed and BH workqueues can directly take over the tasklet softirqs. system_bh[_highpri]_wq are added. As queue-wide flushing doesn't exist in tasklet, all existing tasklet users should be able to use the system BH workqueues without creating their own workqueues. v3: - Add missing interrupt.h include. v2: - Instead of using tasklets, hook directly into its softirq action functions - tasklet[_hi]_action(). This is slightly cheaper and closer to the eventual code structure we want to arrive at. Suggested by Lai. - Lai also pointed out several places which need NULL worker->task handling or can use clarification. Updated. Signed-off-by: Tejun Heo <tj@kernel.org> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/CAHk-=wjDW53w4-YcSmgKC5RruiRLHmJ1sXeYdp_ZgVoBw=5byA@mail.gmail.com Tested-by: Allen Pais <allen.lkml@gmail.com> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>	2024-02-04 11:28:06 -10:00
Tejun Heo	2fcdb1b444	workqueue: Factor out init_cpu_worker_pool() Factor out init_cpu_worker_pool() from workqueue_init_early(). This is pure reorganization in preparation of BH workqueue support. Signed-off-by: Tejun Heo <tj@kernel.org> Tested-by: Allen Pais <allen.lkml@gmail.com>	2024-02-04 11:28:06 -10:00
Tejun Heo	c35aea39d1	workqueue: Update lock debugging code These changes are in preparation of BH workqueue which will execute work items from BH context. - Update lock and RCU depth checks in process_one_work() so that it remembers and checks against the starting depths and prints out the depth changes. - Factor out lockdep annotations in the flush paths into touch_{wq\|work}_lockdep_map(). The work->lockdep_map touching is moved from __flush_work() to its callee - start_flush_work(). This brings it closer to the wq counterpart and will allow testing the associated wq's flags which will be needed to support BH workqueues. This is not expected to cause any functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Tested-by: Allen Pais <allen.lkml@gmail.com>	2024-02-04 11:28:06 -10:00
Ricardo B. Marliere	d412ace111	workqueue: make wq_subsys const Now that the driver core can properly handle constant struct bus_type, move the wq_subsys variable to be a constant structure as well, placing it into read-only memory which can not be modified at runtime. Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Suggested-and-reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ricardo B. Marliere <ricardo@marliere.net> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-02-04 11:23:25 -10:00
Tejun Heo	c70e1779b7	workqueue: Fix pwq->nr_in_flight corruption in try_to_grab_pending() `dd6c3c5441` ("workqueue: Move pwq_dec_nr_in_flight() to the end of work item handling") relocated pwq_dec_nr_in_flight() after set_work_pool_and_keep_pending(). However, the latter destroys information contained in work->data that's needed by pwq_dec_nr_in_flight() including the flush color. With flush color destroyed, flush_workqueue() can stall easily when mixed with cancel_work*() usages. This is easily triggered by running xfstests generic/001 test on xfs: INFO: task umount:6305 blocked for more than 122 seconds. ... task:umount state:D stack:13008 pid:6305 tgid:6305 ppid:6301 flags:0x00004000 Call Trace: <TASK> __schedule+0x2f6/0xa20 schedule+0x36/0xb0 schedule_timeout+0x20b/0x280 wait_for_completion+0x8a/0x140 __flush_workqueue+0x11a/0x3b0 xfs_inodegc_flush+0x24/0xf0 xfs_unmountfs+0x14/0x180 xfs_fs_put_super+0x3d/0x90 generic_shutdown_super+0x7c/0x160 kill_block_super+0x1b/0x40 xfs_kill_sb+0x12/0x30 deactivate_locked_super+0x35/0x90 deactivate_super+0x42/0x50 cleanup_mnt+0x109/0x170 __cleanup_mnt+0x12/0x20 task_work_run+0x60/0x90 syscall_exit_to_user_mode+0x146/0x150 do_syscall_64+0x5d/0x110 entry_SYSCALL_64_after_hwframe+0x6c/0x74 Fix it by stashing work_data before calling set_work_pool_and_keep_pending() and using the stashed value for pwq_dec_nr_in_flight(). Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Chandan Babu R <chandanbabu@kernel.org> Link: http://lkml.kernel.org/r/87o7cxeehy.fsf@debian-BULLSEYE-live-builder-AMD64 Fixes: `dd6c3c5441` ("workqueue: Move pwq_dec_nr_in_flight() to the end of work item handling")	2024-02-04 11:14:49 -10:00
Andrii Nakryiko	1eb986746a	bpf: don't emit warnings intended for global subprogs for static subprogs When btf_prepare_func_args() was generalized to handle both static and global subprogs, a few warnings/errors that are meant only for global subprog cases started to be emitted for static subprogs, where they are sort of expected and irrelavant. Stop polutting verifier logs with irrelevant scary-looking messages. Fixes: `e26080d0da` ("bpf: prepare btf_prepare_func_args() for handling static subprogs") Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20240202190529.2374377-4-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2024-02-02 18:08:59 -08:00
Andrii Nakryiko	8f13c34087	bpf: handle trusted PTR_TO_BTF_ID_OR_NULL in argument check logic Add PTR_TRUSTED \| PTR_MAYBE_NULL modifiers for PTR_TO_BTF_ID to check_reg_type() to support passing trusted nullable PTR_TO_BTF_ID registers into global functions accepting `__arg_trusted __arg_nullable` arguments. This hasn't been caught earlier because tests were either passing known non-NULL PTR_TO_BTF_ID registers or known NULL (SCALAR) registers. When utilizing this functionality in complicated real-world BPF application that passes around PTR_TO_BTF_ID_OR_NULL, it became apparent that verifier rejects valid case because check_reg_type() doesn't handle this case explicitly. Existing check_reg_type() logic is already anticipating this combination, so we just need to explicitly list this combo in the switch statement. Fixes: `e2b3c4ff5d` ("bpf: add __arg_trusted global func arg tag") Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20240202190529.2374377-2-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2024-02-02 18:08:58 -08:00
Linus Torvalds	56897d5188	Tracing and eventfs fixes for v6.8: - Fix the return code for ring_buffer_poll_wait() It was returing a -EINVAL instead of EPOLLERR. - Zero out the tracefs_inode so that all fields are initialized. The ti->private could have had stale data, but instead of just initializing it to NULL, clear out the entire structure when it is allocated. - Fix a crash in timerlat The hrtimer was initialized at read and not open, but is canceled at close. If the file was opened and never read the close will pass a NULL pointer to hrtime_cancel(). - Rewrite of eventfs. Linus wrote a patch series to remove the dentry references in the eventfs_inode and to use ref counting and more of proper VFS interfaces to make it work. - Add warning to put_ei() if ei is not set to free. That means something is about to free it when it shouldn't. - Restructure the eventfs_inode to make it more compact, and remove the unused llist field. - Remove the fsnotify() funtions for when the inodes were being created in the lookup code. It doesn't make sense to notify about creation just because something is being looked up. - The inode hard link count was not accurate. It was being updated when a file was looked up. The inodes of directories were updating their parent inode hard link count every time the inode was created. That means if memory reclaim cleaned a stale directory inode and the inode was lookup up again, it would increment the parent inode again as well. Al Viro said to just have all eventfs directories have a hard link count of 1. That tells user space not to trust it. -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZb1l/RQccm9zdGVkdEBn b29kbWlzLm9yZwAKCRAp5XQQmuv6qk6jAQDmecDOnx+j/Rm5krbX/meVPYXFj2CU 1wO7w1HBzopsBwEA5AjTKm9IGrl/eVG/+jViS165b+sJfwEcblHEFPWcIwo= =uUzb -----END PGP SIGNATURE----- Merge tag 'trace-v6.8-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing and eventfs fixes from Steven Rostedt: - Fix the return code for ring_buffer_poll_wait() It was returing a -EINVAL instead of EPOLLERR. - Zero out the tracefs_inode so that all fields are initialized. The ti->private could have had stale data, but instead of just initializing it to NULL, clear out the entire structure when it is allocated. - Fix a crash in timerlat The hrtimer was initialized at read and not open, but is canceled at close. If the file was opened and never read the close will pass a NULL pointer to hrtime_cancel(). - Rewrite of eventfs. Linus wrote a patch series to remove the dentry references in the eventfs_inode and to use ref counting and more of proper VFS interfaces to make it work. - Add warning to put_ei() if ei is not set to free. That means something is about to free it when it shouldn't. - Restructure the eventfs_inode to make it more compact, and remove the unused llist field. - Remove the fsnotify() funtions for when the inodes were being created in the lookup code. It doesn't make sense to notify about creation just because something is being looked up. - The inode hard link count was not accurate. It was being updated when a file was looked up. The inodes of directories were updating their parent inode hard link count every time the inode was created. That means if memory reclaim cleaned a stale directory inode and the inode was lookup up again, it would increment the parent inode again as well. Al Viro said to just have all eventfs directories have a hard link count of 1. That tells user space not to trust it. * tag 'trace-v6.8-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: eventfs: Keep all directory links at 1 eventfs: Remove fsnotify*() functions from lookup() eventfs: Restructure eventfs_inode structure to be more condensed eventfs: Warn if an eventfs_inode is freed without is_freed being set tracing/timerlat: Move hrtimer_init to timerlat_fd open() eventfs: Get rid of dentry pointers without refcounts eventfs: Clean up dentry ops and add revalidate function eventfs: Remove unused d_parent pointer field tracefs: dentry lookup crapectomy tracefs: Avoid using the ei->dentry pointer unnecessarily eventfs: Initialize the tracefs inode properly tracefs: Zero out the tracefs_inode when allocating it ring-buffer: Clean ring_buffer_poll_wait() error return	2024-02-02 15:32:58 -08:00

... 2 3 4 5 6 ...

44169 commits