Commit graph

500 commits

Author SHA1 Message Date
Zheng Yejian
45d99ea451 ring-buffer: Fix bytes info in per_cpu buffer stats
The 'bytes' info in file 'per_cpu/cpu<X>/stats' means the number of
bytes in cpu buffer that have not been consumed. However, currently
after consuming data by reading file 'trace_pipe', the 'bytes' info
was not changed as expected.

  # cat per_cpu/cpu0/stats
  entries: 0
  overrun: 0
  commit overrun: 0
  bytes: 568             <--- 'bytes' is problematical !!!
  oldest event ts:  8651.371479
  now ts:  8653.912224
  dropped events: 0
  read events: 8

The root cause is incorrect stat on cpu_buffer->read_bytes. To fix it:
  1. When stat 'read_bytes', account consumed event in rb_advance_reader();
  2. When stat 'entries_bytes', exclude the discarded padding event which
     is smaller than minimum size because it is invisible to reader. Then
     use rb_page_commit() instead of BUF_PAGE_SIZE at where accounting for
     page-based read/remove/overrun.

Also correct the comments of ring_buffer_bytes_cpu() in this patch.

Link: https://lore.kernel.org/linux-trace-kernel/20230921125425.1708423-1-zhengyejian1@huawei.com

Cc: stable@vger.kernel.org
Fixes: c64e148a3b ("trace: Add ring buffer stats to measure rate of events")
Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-09-22 16:57:14 -04:00
Steven Rostedt (Google)
95a404bd60 ring-buffer: Do not attempt to read past "commit"
When iterating over the ring buffer while the ring buffer is active, the
writer can corrupt the reader. There's barriers to help detect this and
handle it, but that code missed the case where the last event was at the
very end of the page and has only 4 bytes left.

The checks to detect the corruption by the writer to reads needs to see the
length of the event. If the length in the first 4 bytes is zero then the
length is stored in the second 4 bytes. But if the writer is in the process
of updating that code, there's a small window where the length in the first
4 bytes could be zero even though the length is only 4 bytes. That will
cause rb_event_length() to read the next 4 bytes which could happen to be off the
allocated page.

To protect against this, fail immediately if the next event pointer is
less than 8 bytes from the end of the commit (last byte of data), as all
events must be a minimum of 8 bytes anyway.

Link: https://lore.kernel.org/all/20230905141245.26470-1-Tze-nan.Wu@mediatek.com/
Link: https://lore.kernel.org/linux-trace-kernel/20230907122820.0899019c@gandalf.local.home

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Reported-by: Tze-nan Wu <Tze-nan.Wu@mediatek.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-09-08 23:12:19 -04:00
Zheng Yejian
f6bd2c9248 ring-buffer: Avoid softlockup in ring_buffer_resize()
When user resize all trace ring buffer through file 'buffer_size_kb',
then in ring_buffer_resize(), kernel allocates buffer pages for each
cpu in a loop.

If the kernel preemption model is PREEMPT_NONE and there are many cpus
and there are many buffer pages to be allocated, it may not give up cpu
for a long time and finally cause a softlockup.

To avoid it, call cond_resched() after each cpu buffer allocation.

Link: https://lore.kernel.org/linux-trace-kernel/20230906081930.3939106-1-zhengyejian1@huawei.com

Cc: <mhiramat@kernel.org>
Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-09-07 16:38:54 -04:00
Uros Bizjak
00a8478f8f ring_buffer: Use try_cmpxchg instead of cmpxchg
Use try_cmpxchg instead of cmpxchg (*ptr, old, new) == old in
ring_buffer.c. x86 CMPXCHG instruction returns success in ZF flag,
so this change saves a compare after cmpxchg (and related move
instruction in front of cmpxchg).

No functional change intended.

Link: https://lore.kernel.org/linux-trace-kernel/20230714154418.8884-1-ubizjak@gmail.com

Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-07-30 18:11:45 -04:00
Gaosheng Cui
151e34d1c6 ring-buffer: Fix kernel-doc warnings in ring_buffer.c
Fix kernel-doc warnings:

kernel/trace/ring_buffer.c:954: warning: Function parameter or
member 'cpu' not described in 'ring_buffer_wake_waiters'
kernel/trace/ring_buffer.c:3383: warning: Excess function parameter
'event' description in 'ring_buffer_unlock_commit'
kernel/trace/ring_buffer.c:5359: warning: Excess function parameter
'cpu' description in 'ring_buffer_reset_online_cpus'

Link: https://lkml.kernel.org/r/20230724140827.1023266-2-cuigaosheng1@huawei.com

Cc: <mhiramat@kernel.org>
Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-07-28 19:59:03 -04:00
Zheng Yejian
2d093282b0 ring-buffer: Fix wrong stat of cpu_buffer->read
When pages are removed in rb_remove_pages(), 'cpu_buffer->read' is set
to 0 in order to make sure any read iterators reset themselves. However,
this will mess 'entries' stating, see following steps:

  # cd /sys/kernel/tracing/
  # 1. Enlarge ring buffer prepare for later reducing:
  # echo 20 > per_cpu/cpu0/buffer_size_kb
  # 2. Write a log into ring buffer of cpu0:
  # taskset -c 0 echo "hello1" > trace_marker
  # 3. Read the log:
  # cat per_cpu/cpu0/trace_pipe
       <...>-332     [000] .....    62.406844: tracing_mark_write: hello1
  # 4. Stop reading and see the stats, now 0 entries, and 1 event readed:
  # cat per_cpu/cpu0/stats
   entries: 0
   [...]
   read events: 1
  # 5. Reduce the ring buffer
  # echo 7 > per_cpu/cpu0/buffer_size_kb
  # 6. Now entries became unexpected 1 because actually no entries!!!
  # cat per_cpu/cpu0/stats
   entries: 1
   [...]
   read events: 0

To fix it, introduce 'page_removed' field to count total removed pages
since last reset, then use it to let read iterators reset themselves
instead of changing the 'read' pointer.

Link: https://lore.kernel.org/linux-trace-kernel/20230724054040.3489499-1-zhengyejian1@huawei.com

Cc: <mhiramat@kernel.org>
Cc: <vnagarnaik@google.com>
Fixes: 83f40318da ("ring-buffer: Make removal of ring buffer pages atomic")
Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-07-28 19:16:23 -04:00
Chen Lin
8a96c0288d ring-buffer: Do not swap cpu_buffer during resize process
When ring_buffer_swap_cpu was called during resize process,
the cpu buffer was swapped in the middle, resulting in incorrect state.
Continuing to run in the wrong state will result in oops.

This issue can be easily reproduced using the following two scripts:
/tmp # cat test1.sh
//#! /bin/sh
for i in `seq 0 100000`
do
         echo 2000 > /sys/kernel/debug/tracing/buffer_size_kb
         sleep 0.5
         echo 5000 > /sys/kernel/debug/tracing/buffer_size_kb
         sleep 0.5
done
/tmp # cat test2.sh
//#! /bin/sh
for i in `seq 0 100000`
do
        echo irqsoff > /sys/kernel/debug/tracing/current_tracer
        sleep 1
        echo nop > /sys/kernel/debug/tracing/current_tracer
        sleep 1
done
/tmp # ./test1.sh &
/tmp # ./test2.sh &

A typical oops log is as follows, sometimes with other different oops logs.

[  231.711293] WARNING: CPU: 0 PID: 9 at kernel/trace/ring_buffer.c:2026 rb_update_pages+0x378/0x3f8
[  231.713375] Modules linked in:
[  231.714735] CPU: 0 PID: 9 Comm: kworker/0:1 Tainted: G        W          6.5.0-rc1-00276-g20edcec23f92 #15
[  231.716750] Hardware name: linux,dummy-virt (DT)
[  231.718152] Workqueue: events update_pages_handler
[  231.719714] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[  231.721171] pc : rb_update_pages+0x378/0x3f8
[  231.722212] lr : rb_update_pages+0x25c/0x3f8
[  231.723248] sp : ffff800082b9bd50
[  231.724169] x29: ffff800082b9bd50 x28: ffff8000825f7000 x27: 0000000000000000
[  231.726102] x26: 0000000000000001 x25: fffffffffffff010 x24: 0000000000000ff0
[  231.728122] x23: ffff0000c3a0b600 x22: ffff0000c3a0b5c0 x21: fffffffffffffe0a
[  231.730203] x20: ffff0000c3a0b600 x19: ffff0000c0102400 x18: 0000000000000000
[  231.732329] x17: 0000000000000000 x16: 0000000000000000 x15: 0000ffffe7aa8510
[  231.734212] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000002
[  231.736291] x11: ffff8000826998a8 x10: ffff800082b9baf0 x9 : ffff800081137558
[  231.738195] x8 : fffffc00030e82c8 x7 : 0000000000000000 x6 : 0000000000000001
[  231.740192] x5 : ffff0000ffbafe00 x4 : 0000000000000000 x3 : 0000000000000000
[  231.742118] x2 : 00000000000006aa x1 : 0000000000000001 x0 : ffff0000c0007208
[  231.744196] Call trace:
[  231.744892]  rb_update_pages+0x378/0x3f8
[  231.745893]  update_pages_handler+0x1c/0x38
[  231.746893]  process_one_work+0x1f0/0x468
[  231.747852]  worker_thread+0x54/0x410
[  231.748737]  kthread+0x124/0x138
[  231.749549]  ret_from_fork+0x10/0x20
[  231.750434] ---[ end trace 0000000000000000 ]---
[  233.720486] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
[  233.721696] Mem abort info:
[  233.721935]   ESR = 0x0000000096000004
[  233.722283]   EC = 0x25: DABT (current EL), IL = 32 bits
[  233.722596]   SET = 0, FnV = 0
[  233.722805]   EA = 0, S1PTW = 0
[  233.723026]   FSC = 0x04: level 0 translation fault
[  233.723458] Data abort info:
[  233.723734]   ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
[  233.724176]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
[  233.724589]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[  233.725075] user pgtable: 4k pages, 48-bit VAs, pgdp=0000000104943000
[  233.725592] [0000000000000000] pgd=0000000000000000, p4d=0000000000000000
[  233.726231] Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP
[  233.726720] Modules linked in:
[  233.727007] CPU: 0 PID: 9 Comm: kworker/0:1 Tainted: G        W          6.5.0-rc1-00276-g20edcec23f92 #15
[  233.727777] Hardware name: linux,dummy-virt (DT)
[  233.728225] Workqueue: events update_pages_handler
[  233.728655] pstate: 200000c5 (nzCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[  233.729054] pc : rb_update_pages+0x1a8/0x3f8
[  233.729334] lr : rb_update_pages+0x154/0x3f8
[  233.729592] sp : ffff800082b9bd50
[  233.729792] x29: ffff800082b9bd50 x28: ffff8000825f7000 x27: 0000000000000000
[  233.730220] x26: 0000000000000000 x25: ffff800082a8b840 x24: ffff0000c0102418
[  233.730653] x23: 0000000000000000 x22: fffffc000304c880 x21: 0000000000000003
[  233.731105] x20: 00000000000001f4 x19: ffff0000c0102400 x18: ffff800082fcbc58
[  233.731727] x17: 0000000000000000 x16: 0000000000000001 x15: 0000000000000001
[  233.732282] x14: ffff8000825fe0c8 x13: 0000000000000001 x12: 0000000000000000
[  233.732709] x11: ffff8000826998a8 x10: 0000000000000ae0 x9 : ffff8000801b760c
[  233.733148] x8 : fefefefefefefeff x7 : 0000000000000018 x6 : ffff0000c03298c0
[  233.733553] x5 : 0000000000000002 x4 : 0000000000000000 x3 : 0000000000000000
[  233.733972] x2 : ffff0000c3a0b600 x1 : 0000000000000000 x0 : 0000000000000000
[  233.734418] Call trace:
[  233.734593]  rb_update_pages+0x1a8/0x3f8
[  233.734853]  update_pages_handler+0x1c/0x38
[  233.735148]  process_one_work+0x1f0/0x468
[  233.735525]  worker_thread+0x54/0x410
[  233.735852]  kthread+0x124/0x138
[  233.736064]  ret_from_fork+0x10/0x20
[  233.736387] Code: 92400000 910006b5 aa000021 aa0303f7 (f9400060)
[  233.736959] ---[ end trace 0000000000000000 ]---

After analysis, the seq of the error is as follows [1-5]:

int ring_buffer_resize(struct trace_buffer *buffer, unsigned long size,
			int cpu_id)
{
	for_each_buffer_cpu(buffer, cpu) {
		cpu_buffer = buffer->buffers[cpu];
		//1. get cpu_buffer, aka cpu_buffer(A)
		...
		...
		schedule_work_on(cpu,
		 &cpu_buffer->update_pages_work);
		//2. 'update_pages_work' is queue on 'cpu', cpu_buffer(A) is passed to
		// update_pages_handler, do the update process, set 'update_done' in
		// complete(&cpu_buffer->update_done) and to wakeup resize process.
	//---->
		//3. Just at this moment, ring_buffer_swap_cpu is triggered,
		//cpu_buffer(A) be swaped to cpu_buffer(B), the max_buffer.
		//ring_buffer_swap_cpu is called as the 'Call trace' below.

		Call trace:
		 dump_backtrace+0x0/0x2f8
		 show_stack+0x18/0x28
		 dump_stack+0x12c/0x188
		 ring_buffer_swap_cpu+0x2f8/0x328
		 update_max_tr_single+0x180/0x210
		 check_critical_timing+0x2b4/0x2c8
		 tracer_hardirqs_on+0x1c0/0x200
		 trace_hardirqs_on+0xec/0x378
		 el0_svc_common+0x64/0x260
		 do_el0_svc+0x90/0xf8
		 el0_svc+0x20/0x30
		 el0_sync_handler+0xb0/0xb8
		 el0_sync+0x180/0x1c0
	//<----

	/* wait for all the updates to complete */
	for_each_buffer_cpu(buffer, cpu) {
		cpu_buffer = buffer->buffers[cpu];
		//4. get cpu_buffer, cpu_buffer(B) is used in the following process,
		//the state of cpu_buffer(A) and cpu_buffer(B) is totally wrong.
		//for example, cpu_buffer(A)->update_done will leave be set 1, and will
		//not 'wait_for_completion' at the next resize round.
		  if (!cpu_buffer->nr_pages_to_update)
			continue;

		if (cpu_online(cpu))
			wait_for_completion(&cpu_buffer->update_done);
		cpu_buffer->nr_pages_to_update = 0;
	}
	...
}
	//5. the state of cpu_buffer(A) and cpu_buffer(B) is totally wrong,
	//Continuing to run in the wrong state, then oops occurs.

Link: https://lore.kernel.org/linux-trace-kernel/202307191558478409990@zte.com.cn

Signed-off-by: Chen Lin <chen.lin5@zte.com.cn>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-07-23 11:09:25 -04:00
Zheng Yejian
7e42907f3a ring-buffer: Fix deadloop issue on reading trace_pipe
Soft lockup occurs when reading file 'trace_pipe':

  watchdog: BUG: soft lockup - CPU#6 stuck for 22s! [cat:4488]
  [...]
  RIP: 0010:ring_buffer_empty_cpu+0xed/0x170
  RSP: 0018:ffff88810dd6fc48 EFLAGS: 00000246
  RAX: 0000000000000000 RBX: 0000000000000246 RCX: ffffffff93d1aaeb
  RDX: ffff88810a280040 RSI: 0000000000000008 RDI: ffff88811164b218
  RBP: ffff88811164b218 R08: 0000000000000000 R09: ffff88815156600f
  R10: ffffed102a2acc01 R11: 0000000000000001 R12: 0000000051651901
  R13: 0000000000000000 R14: ffff888115e49500 R15: 0000000000000000
  [...]
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00007f8d853c2000 CR3: 000000010dcd8000 CR4: 00000000000006e0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
   __find_next_entry+0x1a8/0x4b0
   ? peek_next_entry+0x250/0x250
   ? down_write+0xa5/0x120
   ? down_write_killable+0x130/0x130
   trace_find_next_entry_inc+0x3b/0x1d0
   tracing_read_pipe+0x423/0xae0
   ? tracing_splice_read_pipe+0xcb0/0xcb0
   vfs_read+0x16b/0x490
   ksys_read+0x105/0x210
   ? __ia32_sys_pwrite64+0x200/0x200
   ? switch_fpu_return+0x108/0x220
   do_syscall_64+0x33/0x40
   entry_SYSCALL_64_after_hwframe+0x61/0xc6

Through the vmcore, I found it's because in tracing_read_pipe(),
ring_buffer_empty_cpu() found some buffer is not empty but then it
cannot read anything due to "rb_num_of_entries() == 0" always true,
Then it infinitely loop the procedure due to user buffer not been
filled, see following code path:

  tracing_read_pipe() {
    ... ...
    waitagain:
      tracing_wait_pipe() // 1. find non-empty buffer here
      trace_find_next_entry_inc()  // 2. loop here try to find an entry
        __find_next_entry()
          ring_buffer_empty_cpu();  // 3. find non-empty buffer
          peek_next_entry()  // 4. but peek always return NULL
            ring_buffer_peek()
              rb_buffer_peek()
                rb_get_reader_page()
                  // 5. because rb_num_of_entries() == 0 always true here
                  //    then return NULL
      // 6. user buffer not been filled so goto 'waitgain'
      //    and eventually leads to an deadloop in kernel!!!
  }

By some analyzing, I found that when resetting ringbuffer, the 'entries'
of its pages are not all cleared (see rb_reset_cpu()). Then when reducing
the ringbuffer, and if some reduced pages exist dirty 'entries' data, they
will be added into 'cpu_buffer->overrun' (see rb_remove_pages()), which
cause wrong 'overrun' count and eventually cause the deadloop issue.

To fix it, we need to clear every pages in rb_reset_cpu().

Link: https://lore.kernel.org/linux-trace-kernel/20230708225144.3785600-1-zhengyejian1@huawei.com

Cc: stable@vger.kernel.org
Fixes: a5fb833172 ("ring-buffer: Fix uninitialized read_stamp")
Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-07-12 12:15:26 -04:00
Linus Torvalds
d579c468d7 tracing updates for 6.4:
- User events are finally ready!
   After lots of collaboration between various parties, we finally locked
   down on a stable interface for user events that can also work with user
   space only tracing. This is implemented by telling the kernel (or user
   space library, but that part is user space only and not part of this
   patch set), where the variable is that the application uses to know if
   something is listening to the trace. There's also an interface to tell
   the kernel about these events, which will show up in the
   /sys/kernel/tracing/events/user_events/ directory, where it can be
    enabled. When it's enabled, the kernel will update the variable, to tell
   the application to start writing to the kernel.
   See https://lwn.net/Articles/927595/
 
 - Cleaned up the direct trampolines code to simplify arm64 addition of
   direct trampolines. Direct trampolines use the ftrace interface but
   instead of jumping to the ftrace trampoline, applications (mostly BPF)
   can register their own trampoline for performance reasons.
 
 - Some updates to the fprobe infrastructure. fprobes are more efficient than
   kprobes, as it does not need to save all the registers that kprobes on
   ftrace do. More work needs to be done before the fprobes will be exposed
   as dynamic events.
 
 - More updates to references to the obsolete path of
   /sys/kernel/debug/tracing for the new /sys/kernel/tracing path.
 
 - Add a seq_buf_do_printk() helper to seq_bufs, to print a large buffer line
   by line instead of all at once. There's users in production kernels that
   have a large data dump that originally used printk() directly, but the
   data dump was larger than what printk() allowed as a single print.
   Using seq_buf() to do the printing fixes that.
 
 - Add /sys/kernel/tracing/touched_functions that shows all functions that
   was every traced by ftrace or a direct trampoline. This is used for
   debugging issues where a traced function could have caused a crash by
   a bpf program or live patching.
 
 - Add a "fields" option that is similar to "raw" but outputs the fields of
   the events. It's easier to read by humans.
 
 - Some minor fixes and clean ups.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZEr36xQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6quZHAQCzuqnn2S8DsPd3Sy1vKIYaj0uajW5D
 Kz1oUJH4F0H7kgEA8XwXkdtfKpOXWc/ZH4LWfL7Orx2wJZJQMV9dVqEPDAE=
 =w0Z1
 -----END PGP SIGNATURE-----

Merge tag 'trace-v6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing updates from Steven Rostedt:

 - User events are finally ready!

   After lots of collaboration between various parties, we finally
   locked down on a stable interface for user events that can also work
   with user space only tracing.

   This is implemented by telling the kernel (or user space library, but
   that part is user space only and not part of this patch set), where
   the variable is that the application uses to know if something is
   listening to the trace.

   There's also an interface to tell the kernel about these events,
   which will show up in the /sys/kernel/tracing/events/user_events/
   directory, where it can be enabled.

   When it's enabled, the kernel will update the variable, to tell the
   application to start writing to the kernel.

   See https://lwn.net/Articles/927595/

 - Cleaned up the direct trampolines code to simplify arm64 addition of
   direct trampolines.

   Direct trampolines use the ftrace interface but instead of jumping to
   the ftrace trampoline, applications (mostly BPF) can register their
   own trampoline for performance reasons.

 - Some updates to the fprobe infrastructure. fprobes are more efficient
   than kprobes, as it does not need to save all the registers that
   kprobes on ftrace do. More work needs to be done before the fprobes
   will be exposed as dynamic events.

 - More updates to references to the obsolete path of
   /sys/kernel/debug/tracing for the new /sys/kernel/tracing path.

 - Add a seq_buf_do_printk() helper to seq_bufs, to print a large buffer
   line by line instead of all at once.

   There are users in production kernels that have a large data dump
   that originally used printk() directly, but the data dump was larger
   than what printk() allowed as a single print.

   Using seq_buf() to do the printing fixes that.

 - Add /sys/kernel/tracing/touched_functions that shows all functions
   that was every traced by ftrace or a direct trampoline. This is used
   for debugging issues where a traced function could have caused a
   crash by a bpf program or live patching.

 - Add a "fields" option that is similar to "raw" but outputs the fields
   of the events. It's easier to read by humans.

 - Some minor fixes and clean ups.

* tag 'trace-v6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (41 commits)
  ring-buffer: Sync IRQ works before buffer destruction
  tracing: Add missing spaces in trace_print_hex_seq()
  ring-buffer: Ensure proper resetting of atomic variables in ring_buffer_reset_online_cpus
  recordmcount: Fix memory leaks in the uwrite function
  tracing/user_events: Limit max fault-in attempts
  tracing/user_events: Prevent same address and bit per process
  tracing/user_events: Ensure bit is cleared on unregister
  tracing/user_events: Ensure write index cannot be negative
  seq_buf: Add seq_buf_do_printk() helper
  tracing: Fix print_fields() for __dyn_loc/__rel_loc
  tracing/user_events: Set event filter_type from type
  ring-buffer: Clearly check null ptr returned by rb_set_head_page()
  tracing: Unbreak user events
  tracing/user_events: Use print_format_fields() for trace output
  tracing/user_events: Align structs with tabs for readability
  tracing/user_events: Limit global user_event count
  tracing/user_events: Charge event allocs to cgroups
  tracing/user_events: Update documentation for ABI
  tracing/user_events: Use write ABI in example
  tracing/user_events: Add ABI self-test
  ...
2023-04-28 15:57:53 -07:00
Johannes Berg
675751bb20 ring-buffer: Sync IRQ works before buffer destruction
If something was written to the buffer just before destruction,
it may be possible (maybe not in a real system, but it did
happen in ARCH=um with time-travel) to destroy the ringbuffer
before the IRQ work ran, leading this KASAN report (or a crash
without KASAN):

    BUG: KASAN: slab-use-after-free in irq_work_run_list+0x11a/0x13a
    Read of size 8 at addr 000000006d640a48 by task swapper/0

    CPU: 0 PID: 0 Comm: swapper Tainted: G        W  O       6.3.0-rc1 #7
    Stack:
     60c4f20f 0c203d48 41b58ab3 60f224fc
     600477fa 60f35687 60c4f20f 601273dd
     00000008 6101eb00 6101eab0 615be548
    Call Trace:
     [<60047a58>] show_stack+0x25e/0x282
     [<60c609e0>] dump_stack_lvl+0x96/0xfd
     [<60c50d4c>] print_report+0x1a7/0x5a8
     [<603078d3>] kasan_report+0xc1/0xe9
     [<60308950>] __asan_report_load8_noabort+0x1b/0x1d
     [<60232844>] irq_work_run_list+0x11a/0x13a
     [<602328b4>] irq_work_tick+0x24/0x34
     [<6017f9dc>] update_process_times+0x162/0x196
     [<6019f335>] tick_sched_handle+0x1a4/0x1c3
     [<6019fd9e>] tick_sched_timer+0x79/0x10c
     [<601812b9>] __hrtimer_run_queues.constprop.0+0x425/0x695
     [<60182913>] hrtimer_interrupt+0x16c/0x2c4
     [<600486a3>] um_timer+0x164/0x183
     [...]

    Allocated by task 411:
     save_stack_trace+0x99/0xb5
     stack_trace_save+0x81/0x9b
     kasan_save_stack+0x2d/0x54
     kasan_set_track+0x34/0x3e
     kasan_save_alloc_info+0x25/0x28
     ____kasan_kmalloc+0x8b/0x97
     __kasan_kmalloc+0x10/0x12
     __kmalloc+0xb2/0xe8
     load_elf_phdrs+0xee/0x182
     [...]

    The buggy address belongs to the object at 000000006d640800
     which belongs to the cache kmalloc-1k of size 1024
    The buggy address is located 584 bytes inside of
     freed 1024-byte region [000000006d640800, 000000006d640c00)

Add the appropriate irq_work_sync() so the work finishes before
the buffers are destroyed.

Prior to the commit in the Fixes tag below, there was only a
single global IRQ work, so this issue didn't exist.

Link: https://lore.kernel.org/linux-trace-kernel/20230427175920.a76159263122.I8295e405c44362a86c995e9c2c37e3e03810aa56@changeid

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Fixes: 15693458c4 ("tracing/ring-buffer: Move poll wake ups into ring buffer code")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-04-27 18:01:58 -04:00
Tze-nan Wu
7c339fb4d8 ring-buffer: Ensure proper resetting of atomic variables in ring_buffer_reset_online_cpus
In ring_buffer_reset_online_cpus, the buffer_size_kb write operation
may permanently fail if the cpu_online_mask changes between two
for_each_online_buffer_cpu loops. The number of increases and decreases
on both cpu_buffer->resize_disabled and cpu_buffer->record_disabled may be
inconsistent, causing some CPUs to have non-zero values for these atomic
variables after the function returns.

This issue can be reproduced by "echo 0 > trace" while hotplugging cpu.
After reproducing success, we can find out buffer_size_kb will not be
functional anymore.

To prevent leaving 'resize_disabled' and 'record_disabled' non-zero after
ring_buffer_reset_online_cpus returns, we ensure that each atomic variable
has been set up before atomic_sub() to it.

Link: https://lore.kernel.org/linux-trace-kernel/20230426062027.17451-1-Tze-nan.Wu@mediatek.com

Cc: stable@vger.kernel.org
Cc: <mhiramat@kernel.org>
Cc: npiggin@gmail.com
Fixes: b23d7a5f4a ("ring-buffer: speed up buffer resets by avoiding synchronize_rcu for each CPU")
Reviewed-by: Cheng-Jui Wang <cheng-jui.wang@mediatek.com>
Signed-off-by: Tze-nan Wu <Tze-nan.Wu@mediatek.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-04-26 09:08:53 -04:00
Zheng Yejian
625ed52717 ring-buffer: Clearly check null ptr returned by rb_set_head_page()
In error case, 'buffer_page' returned by rb_set_head_page() is NULL,
currently check '&buffer_page->list' is equivalent to check 'buffer_page'
due to 'list' is the first member of 'buffer_page', but suppose it is not
some time, 'head_page' would be wild memory while check would be bypassed.

Link: https://lore.kernel.org/linux-trace-kernel/20230414071729.57312-1-zhengyejian1@huawei.com

Cc: <mhiramat@kernel.org>
Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-04-25 20:11:26 -04:00
Zheng Yejian
6455b6163d ring-buffer: Fix race while reader and writer are on the same page
When user reads file 'trace_pipe', kernel keeps printing following logs
that warn at "cpu_buffer->reader_page->read > rb_page_size(reader)" in
rb_get_reader_page(). It just looks like there's an infinite loop in
tracing_read_pipe(). This problem occurs several times on arm64 platform
when testing v5.10 and below.

  Call trace:
   rb_get_reader_page+0x248/0x1300
   rb_buffer_peek+0x34/0x160
   ring_buffer_peek+0xbc/0x224
   peek_next_entry+0x98/0xbc
   __find_next_entry+0xc4/0x1c0
   trace_find_next_entry_inc+0x30/0x94
   tracing_read_pipe+0x198/0x304
   vfs_read+0xb4/0x1e0
   ksys_read+0x74/0x100
   __arm64_sys_read+0x24/0x30
   el0_svc_common.constprop.0+0x7c/0x1bc
   do_el0_svc+0x2c/0x94
   el0_svc+0x20/0x30
   el0_sync_handler+0xb0/0xb4
   el0_sync+0x160/0x180

Then I dump the vmcore and look into the problematic per_cpu ring_buffer,
I found that tail_page/commit_page/reader_page are on the same page while
reader_page->read is obviously abnormal:
  tail_page == commit_page == reader_page == {
    .write = 0x100d20,
    .read = 0x8f9f4805,  // Far greater than 0xd20, obviously abnormal!!!
    .entries = 0x10004c,
    .real_end = 0x0,
    .page = {
      .time_stamp = 0x857257416af0,
      .commit = 0xd20,  // This page hasn't been full filled.
      // .data[0...0xd20] seems normal.
    }
 }

The root cause is most likely the race that reader and writer are on the
same page while reader saw an event that not fully committed by writer.

To fix this, add memory barriers to make sure the reader can see the
content of what is committed. Since commit a0fcaaed0c ("ring-buffer: Fix
race between reset page and reading page") has added the read barrier in
rb_get_reader_page(), here we just need to add the write barrier.

Link: https://lore.kernel.org/linux-trace-kernel/20230325021247.2923907-1-zhengyejian1@huawei.com

Cc: stable@vger.kernel.org
Fixes: 77ae365eca ("ring-buffer: make lockless")
Suggested-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-04-03 11:51:38 -04:00
Uros Bizjak
8328e36da9 ring_buffer: Use try_cmpxchg instead of cmpxchg
Use try_cmpxchg instead of cmpxchg (*ptr, old, new) == old.
x86 CMPXCHG instruction returns success in ZF flag, so this change
saves a compare after cmpxchg (and related move instruction in
front of cmpxchg).

Also, try_cmpxchg implicitly assigns old *ptr value to "old" when cmpxchg
fails. There is no need to re-read the value in the loop.

No functional change intended.

Link: https://lkml.kernel.org/r/20230305155532.5549-4-ubizjak@gmail.com

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Acked-by: Mukesh Ojha <quic_mojha@quicinc.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-03-21 13:59:48 -04:00
Uros Bizjak
bc92b9562a ring_buffer: Change some static functions to bool
The return values of some functions are of boolean type. Change the
type of these function to bool and adjust their return values. Also
change type of some internal varibles to bool.

No functional change intended.

Link: https://lkml.kernel.org/r/20230305155532.5549-3-ubizjak@gmail.com

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Reviewed-by: Mukesh Ojha <quic_mojha@quicinc.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-03-21 13:59:42 -04:00
Uros Bizjak
b4b55dfd96 ring_buffer: Change some static functions to void
The results of some static functions are not used. Change the
type of these function to void and remove unnecessary returns.

No functional change intended.

Link: https://lkml.kernel.org/r/20230305155532.5549-2-ubizjak@gmail.com

Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Reviewed-by: Mukesh Ojha <quic_mojha@quicinc.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-03-21 13:59:31 -04:00
Linus Torvalds
eaba52d63b Tracing fixes for 6.3:
- Fix setting affinity of hwlat threads in containers
   Using sched_set_affinity() has unwanted side effects when being
   called within a container. Use set_cpus_allowed_ptr() instead.
 
 - Fix per cpu thread management of the hwlat tracer
   * Do not start per_cpu threads if one is already running for the CPU.
   * When starting per_cpu threads, do not clear the kthread variable
     as it may already be set to running per cpu threads
 
 - Fix return value for test_gen_kprobe_cmd()
   On error the return value was overwritten by being set to
   the result of the call from kprobe_event_delete(), which would
   likely succeed, and thus have the function return success.
 
 - Fix splice() reads from the trace file that was broken by
   36e2c7421f ("fs: don't allow splice read/write without explicit ops")
 
 - Remove obsolete and confusing comment in ring_buffer.c
   The original design of the ring buffer used struct page flags
   for tricks to optimize, which was shortly removed due to them
   being tricks. But a comment for those tricks remained.
 
 - Set local functions and variables to static
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZBdIkBQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qti2AP49s1GM8teFgDF/CO3oa45BoYq1lMJO
 1Z+x+mS/bdUAZgEAw+3KGo8oZDuvWu/nr04JPeoy0GL1/JnbQ6JNCCjzhQc=
 =Yk66
 -----END PGP SIGNATURE-----

Merge tag 'trace-v6.3-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing fixes from Steven Rostedt:

 - Fix setting affinity of hwlat threads in containers

   Using sched_set_affinity() has unwanted side effects when being
   called within a container. Use set_cpus_allowed_ptr() instead

 - Fix per cpu thread management of the hwlat tracer:
    - Do not start per_cpu threads if one is already running for the CPU
    - When starting per_cpu threads, do not clear the kthread variable
      as it may already be set to running per cpu threads

 - Fix return value for test_gen_kprobe_cmd()

   On error the return value was overwritten by being set to the result
   of the call from kprobe_event_delete(), which would likely succeed,
   and thus have the function return success

 - Fix splice() reads from the trace file that was broken by commit
   36e2c7421f ("fs: don't allow splice read/write without explicit
   ops")

 - Remove obsolete and confusing comment in ring_buffer.c

   The original design of the ring buffer used struct page flags for
   tricks to optimize, which was shortly removed due to them being
   tricks. But a comment for those tricks remained

 - Set local functions and variables to static

* tag 'trace-v6.3-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing/hwlat: Replace sched_setaffinity with set_cpus_allowed_ptr
  ring-buffer: remove obsolete comment for free_buffer_page()
  tracing: Make splice_read available again
  ftrace: Set direct_ops storage-class-specifier to static
  trace/hwlat: Do not start per-cpu thread if it is already running
  trace/hwlat: Do not wipe the contents of per-cpu thread data
  tracing/osnoise: set several trace_osnoise.c variables storage-class-specifier to static
  tracing: Fix wrong return in kprobe_event_gen_test.c
2023-03-19 10:46:02 -07:00
Vlastimil Babka
a98151ad53 ring-buffer: remove obsolete comment for free_buffer_page()
The comment refers to mm/slob.c which is being removed. It comes from
commit ed56829cb3 ("ring_buffer: reset buffer page when freeing") and
according to Steven the borrowed code was a page mapcount and mapping
reset, which was later removed by commit e4c2ce82ca ("ring_buffer:
allocate buffer page pointer"). Thus the comment is not accurate anyway,
remove it.

Link: https://lore.kernel.org/linux-trace-kernel/20230315142446.27040-1-vbabka@suse.cz

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Reported-by: Mike Rapoport <mike.rapoport@gmail.com>
Suggested-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Fixes: e4c2ce82ca ("ring_buffer: allocate buffer page pointer")
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Mukesh Ojha <quic_mojha@quicinc.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-03-19 13:23:22 -04:00
Linus Torvalds
b72b5fecc1 tracing updates for 6.3:
- Add function names as a way to filter function addresses
 
 - Add sample module to test ftrace ops and dynamic trampolines
 
 - Allow stack traces to be passed from beginning event to end event for
   synthetic events. This will allow seeing the stack trace of when a task is
   scheduled out and recorded when it gets scheduled back in.
 
 - Add trace event helper __get_buf() to use as a temporary buffer when printing
   out trace event output.
 
 - Add kernel command line to create trace instances on boot up.
 
 - Add enabling of events to instances created at boot up.
 
 - Add trace_array_puts() to write into instances.
 
 - Allow boot instances to take a snapshot at the end of boot up.
 
 - Allow live patch modules to include trace events
 
 - Minor fixes and clean ups
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCY/PaaBQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qh5iAPoD0LKZzD33rhO5Ec4hoexE0DkqycP3
 dvmOMbCBL8GkxwEA+d2gLz/EquSFm166hc4D79Sn3geCqvkwmy8vQWVjIQc=
 =M82D
 -----END PGP SIGNATURE-----

Merge tag 'trace-v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing updates from Steven Rostedt:

 - Add function names as a way to filter function addresses

 - Add sample module to test ftrace ops and dynamic trampolines

 - Allow stack traces to be passed from beginning event to end event for
   synthetic events. This will allow seeing the stack trace of when a
   task is scheduled out and recorded when it gets scheduled back in.

 - Add trace event helper __get_buf() to use as a temporary buffer when
   printing out trace event output.

 - Add kernel command line to create trace instances on boot up.

 - Add enabling of events to instances created at boot up.

 - Add trace_array_puts() to write into instances.

 - Allow boot instances to take a snapshot at the end of boot up.

 - Allow live patch modules to include trace events

 - Minor fixes and clean ups

* tag 'trace-v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (31 commits)
  tracing: Remove unnecessary NULL assignment
  tracepoint: Allow livepatch module add trace event
  tracing: Always use canonical ftrace path
  tracing/histogram: Fix stacktrace histogram Documententation
  tracing/histogram: Fix stacktrace key
  tracing/histogram: Fix a few problems with stacktrace variable printing
  tracing: Add BUILD_BUG() to make sure stacktrace fits in strings
  tracing/histogram: Don't use strlen to find length of stacktrace variables
  tracing: Allow boot instances to have snapshot buffers
  tracing: Add trace_array_puts() to write into instance
  tracing: Add enabling of events to boot instances
  tracing: Add creation of instances at boot command line
  tracing: Fix trace_event_raw_event_synth() if else statement
  samples: ftrace: Make some global variables static
  ftrace: sample: avoid open-coded 64-bit division
  samples: ftrace: Include the nospec-branch.h only for x86
  tracing: Acquire buffer from temparary trace sequence
  tracing/histogram: Wrap remaining shell snippets in code blocks
  tracing/osnoise: No need for schedule_hrtimeout range
  bpf/tracing: Use stage6 of tracing to not duplicate macros
  ...
2023-02-23 10:20:49 -08:00
Ross Zwisler
2455f0e124 tracing: Always use canonical ftrace path
The canonical location for the tracefs filesystem is at /sys/kernel/tracing.

But, from Documentation/trace/ftrace.rst:

  Before 4.1, all ftrace tracing control files were within the debugfs
  file system, which is typically located at /sys/kernel/debug/tracing.
  For backward compatibility, when mounting the debugfs file system,
  the tracefs file system will be automatically mounted at:

  /sys/kernel/debug/tracing

Many comments and Kconfig help messages in the tracing code still refer
to this older debugfs path, so let's update them to avoid confusion.

Link: https://lore.kernel.org/linux-trace-kernel/20230215223350.2658616-2-zwisler@google.com

Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Mukesh Ojha <quic_mojha@quicinc.com>
Signed-off-by: Ross Zwisler <zwisler@google.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-02-18 14:34:09 -05:00
Mukesh Ojha
8843e06f67 ring-buffer: Handle race between rb_move_tail and rb_check_pages
It seems a data race between ring_buffer writing and integrity check.
That is, RB_FLAG of head_page is been updating, while at same time
RB_FLAG was cleared when doing integrity check rb_check_pages():

  rb_check_pages()            rb_handle_head_page():
  --------                    --------
  rb_head_page_deactivate()
                              rb_head_page_set_normal()
  rb_head_page_activate()

We do intergrity test of the list to check if the list is corrupted and
it is still worth doing it. So, let's refactor rb_check_pages() such that
we no longer clear and set flag during the list sanity checking.

[1] and [2] are the test to reproduce and the crash report respectively.

1:
``` read_trace.sh
  while true;
  do
    # the "trace" file is closed after read
    head -1 /sys/kernel/tracing/trace > /dev/null
  done
```
``` repro.sh
  sysctl -w kernel.panic_on_warn=1
  # function tracer will writing enough data into ring_buffer
  echo function > /sys/kernel/tracing/current_tracer
  ./read_trace.sh &
  ./read_trace.sh &
  ./read_trace.sh &
  ./read_trace.sh &
  ./read_trace.sh &
  ./read_trace.sh &
  ./read_trace.sh &
  ./read_trace.sh &
```

2:
------------[ cut here ]------------
WARNING: CPU: 9 PID: 62 at kernel/trace/ring_buffer.c:2653
rb_move_tail+0x450/0x470
Modules linked in:
CPU: 9 PID: 62 Comm: ksoftirqd/9 Tainted: G        W          6.2.0-rc6+
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014
RIP: 0010:rb_move_tail+0x450/0x470
Code: ff ff 4c 89 c8 f0 4d 0f b1 02 48 89 c2 48 83 e2 fc 49 39 d0 75 24
83 e0 03 83 f8 02 0f 84 e1 fb ff ff 48 8b 57 10 f0 ff 42 08 <0f> 0b 83
f8 02 0f 84 ce fb ff ff e9 db
RSP: 0018:ffffb5564089bd00 EFLAGS: 00000203
RAX: 0000000000000000 RBX: ffff9db385a2bf81 RCX: ffffb5564089bd18
RDX: ffff9db281110100 RSI: 0000000000000fe4 RDI: ffff9db380145400
RBP: ffff9db385a2bf80 R08: ffff9db385a2bfc0 R09: ffff9db385a2bfc2
R10: ffff9db385a6c000 R11: ffff9db385a2bf80 R12: 0000000000000000
R13: 00000000000003e8 R14: ffff9db281110100 R15: ffffffffbb006108
FS:  0000000000000000(0000) GS:ffff9db3bdcc0000(0000)
knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00005602323024c8 CR3: 0000000022e0c000 CR4: 00000000000006e0
Call Trace:
 <TASK>
 ring_buffer_lock_reserve+0x136/0x360
 ? __do_softirq+0x287/0x2df
 ? __pfx_rcu_softirq_qs+0x10/0x10
 trace_function+0x21/0x110
 ? __pfx_rcu_softirq_qs+0x10/0x10
 ? __do_softirq+0x287/0x2df
 function_trace_call+0xf6/0x120
 0xffffffffc038f097
 ? rcu_softirq_qs+0x5/0x140
 rcu_softirq_qs+0x5/0x140
 __do_softirq+0x287/0x2df
 run_ksoftirqd+0x2a/0x30
 smpboot_thread_fn+0x188/0x220
 ? __pfx_smpboot_thread_fn+0x10/0x10
 kthread+0xe7/0x110
 ? __pfx_kthread+0x10/0x10
 ret_from_fork+0x2c/0x50
 </TASK>
---[ end trace 0000000000000000 ]---

[ crash report and test reproducer credit goes to Zheng Yejian]

Link: https://lore.kernel.org/linux-trace-kernel/1676376403-16462-1-git-send-email-quic_mojha@quicinc.com

Cc: <mhiramat@kernel.org>
Cc: stable@vger.kernel.org
Fixes: 1039221cc2 ("ring-buffer: Do not disable recording when there is an iterator")
Reported-by: Zheng Yejian <zhengyejian1@huawei.com>
Signed-off-by: Mukesh Ojha <quic_mojha@quicinc.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-02-15 11:59:16 -05:00
Jia-Ju Bai
3e4272b995 tracing: Add NULL checks for buffer in ring_buffer_free_read_page()
In a previous commit 7433632c9f, buffer, buffer->buffers and
buffer->buffers[cpu] in ring_buffer_wake_waiters() can be NULL,
and thus the related checks are added.

However, in the same call stack, these variables are also used in
ring_buffer_free_read_page():

tracing_buffers_release()
  ring_buffer_wake_waiters(iter->array_buffer->buffer)
    cpu_buffer = buffer->buffers[cpu] -> Add checks by previous commit
  ring_buffer_free_read_page(iter->array_buffer->buffer)
    cpu_buffer = buffer->buffers[cpu] -> No check

Thus, to avod possible null-pointer derefernces, the related checks
should be added.

These results are reported by a static tool designed by myself.

Link: https://lkml.kernel.org/r/20230113125501.760324-1-baijiaju1990@gmail.com

Reported-by: TOTE Robot <oslab@tsinghua.edu.cn>
Signed-off-by: Jia-Ju Bai <baijiaju1990@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-01-25 10:31:24 -05:00
Steven Rostedt
88ca6a71dc ring-buffer: Handle resize in early boot up
With the new command line option that allows trace event triggers to be
added at boot, the "snapshot" trigger will allocate the snapshot buffer
very early, when interrupts can not be enabled. Allocating the ring buffer
is not the problem, but it also resizes it, which is, as the resize code
does synchronization that can not be preformed at early boot.

To handle this, first change the raw_spin_lock_irq() in rb_insert_pages()
to raw_spin_lock_irqsave(), such that the unlocking of that spin lock will
not enable interrupts.

Next, where it calls schedule_work_on(), disable migration and check if
the CPU to update is the current CPU, and if so, perform the work
directly, otherwise re-enable migration and call the schedule_work_on() to
the CPU that is being updated. The rb_insert_pages() just needs to be run
on the CPU that it is updating, and does not need preemption nor
interrupts disabled when calling it.

Link: https://lore.kernel.org/lkml/Y5J%2FCajlNh1gexvo@google.com/
Link: https://lore.kernel.org/linux-trace-kernel/20221209101151.1fec1167@gandalf.local.home
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>

Fixes: a01fdc897f ("tracing: Add trace_trigger kernel command line option")
Reported-by: Ross Zwisler <zwisler@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Tested-by: Ross Zwisler <zwisler@google.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-12-10 13:36:04 -05:00
Song Chen
04aabc32fb ring_buffer: Remove unused "event" parameter
After commit a389d86f7f ("ring-buffer: Have nested events still record
running time stamp"), the "event" parameter is no longer used in either
ring_buffer_unlock_commit() or rb_commit(). Best to remove it.

Link: https://lkml.kernel.org/r/1666274811-24138-1-git-send-email-chensong_2000@189.cn

Signed-off-by: Song Chen <chensong_2000@189.cn>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-11-23 19:08:30 -05:00
Daniil Tatianin
56f4ca0a79 ring_buffer: Do not deactivate non-existant pages
rb_head_page_deactivate() expects cpu_buffer to contain a valid list of
->pages, so verify that the list is actually present before calling it.

Found by Linux Verification Center (linuxtesting.org) with the SVACE
static analysis tool.

Link: https://lkml.kernel.org/r/20221114143129.3534443-1-d-tatianin@yandex-team.ru

Cc: stable@vger.kernel.org
Fixes: 77ae365eca ("ring-buffer: make lockless")
Signed-off-by: Daniil Tatianin <d-tatianin@yandex-team.ru>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-11-17 17:10:40 -05:00
Steven Rostedt (Google)
31029a8b2c ring-buffer: Include dropped pages in counting dirty patches
The function ring_buffer_nr_dirty_pages() was created to find out how many
pages are filled in the ring buffer. There's two running counters. One is
incremented whenever a new page is touched (pages_touched) and the other
is whenever a page is read (pages_read). The dirty count is the number
touched minus the number read. This is used to determine if a blocked task
should be woken up if the percentage of the ring buffer it is waiting for
is hit.

The problem is that it does not take into account dropped pages (when the
new writes overwrite pages that were not read). And then the dirty pages
will always be greater than the percentage.

This makes the "buffer_percent" file inaccurate, as the number of dirty
pages end up always being larger than the percentage, event when it's not
and this causes user space to be woken up more than it wants to be.

Add a new counter to keep track of lost pages, and include that in the
accounting of dirty pages so that it is actually accurate.

Link: https://lkml.kernel.org/r/20221021123013.55fb6055@gandalf.local.home

Fixes: 2c2b0a78b3 ("ring-buffer: Add percentage of ring buffer full to wake up reader")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-11-16 18:08:29 -05:00
Steven Rostedt (Google)
42fb0a1e84 tracing/ring-buffer: Have polling block on watermark
Currently the way polling works on the ring buffer is broken. It will
return immediately if there's any data in the ring buffer whereas a read
will block until the watermark (defined by the tracefs buffer_percent file)
is hit.

That is, a select() or poll() will return as if there's data available,
but then the following read will block. This is broken for the way
select()s and poll()s are supposed to work.

Have the polling on the ring buffer also block the same way reads and
splice does on the ring buffer.

Link: https://lkml.kernel.org/r/20221020231427.41be3f26@gandalf.local.home

Cc: Linux Trace Kernel <linux-trace-kernel@vger.kernel.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Primiano Tucci <primiano@google.com>
Cc: stable@vger.kernel.org
Fixes: 1e0d6714ac ("ring-buffer: Do not wake up a splice waiter when page is not full")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-11-16 17:13:04 -05:00
Steven Rostedt (Google)
7433632c9f ring-buffer: Check for NULL cpu_buffer in ring_buffer_wake_waiters()
On some machines the number of listed CPUs may be bigger than the actual
CPUs that exist. The tracing subsystem allocates a per_cpu directory with
access to the per CPU ring buffer via a cpuX file. But to save space, the
ring buffer will only allocate buffers for online CPUs, even though the
CPU array will be as big as the nr_cpu_ids.

With the addition of waking waiters on the ring buffer when closing the
file, the ring_buffer_wake_waiters() now needs to make sure that the
buffer is allocated (with the irq_work allocated with it) before trying to
wake waiters, as it will cause a NULL pointer dereference.

While debugging this, I added a NULL check for the buffer itself (which is
OK to do), and also NULL pointer checks against buffer->buffers (which is
not fine, and will WARN) as well as making sure the CPU number passed in
is within the nr_cpu_ids (which is also not fine if it isn't).

Link: https://lore.kernel.org/all/87h6zklb6n.wl-tiwai@suse.de/
Link: https://lore.kernel.org/all/CAM6Wdxc0KRJMXVAA0Y=u6Jh2V=uWB-_Fn6M4xRuNppfXzL1mUg@mail.gmail.com/
Link: https://lkml.kernel.org/linux-trace-kernel/20221101191009.1e7378c8@rorschach.local.home

Cc: stable@vger.kernel.org
Cc: Steven Noonan <steven.noonan@gmail.com>
Bugzilla: https://bugzilla.opensuse.org/show_bug.cgi?id=1204705
Reported-by: Takashi Iwai <tiwai@suse.de>
Reported-by: Roland Ruckerbauer <roland.rucky@gmail.com>
Fixes: f3ddb74ad0 ("tracing: Wake up ring buffer waiters on closing of the file")
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-11-02 23:52:03 -04:00
Jiapeng Chong
b7085b6ffe ring-buffer: Fix kernel-doc
kernel/trace/ring_buffer.c:895: warning: expecting prototype for ring_buffer_nr_pages_dirty(). Prototype was for ring_buffer_nr_dirty_pages() instead.
kernel/trace/ring_buffer.c:5313: warning: expecting prototype for ring_buffer_reset_cpu(). Prototype was for ring_buffer_reset_online_cpus() instead.
kernel/trace/ring_buffer.c:5382: warning: expecting prototype for rind_buffer_empty(). Prototype was for ring_buffer_empty() instead.

Link: https://bugzilla.openanolis.cn/show_bug.cgi?id=2340
Link: https://lkml.kernel.org/r/20221009020642.12506-1-jiapeng.chong@linux.alibaba.com

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-10-12 08:31:06 -04:00
Steven Rostedt (Google)
a0fcaaed0c ring-buffer: Fix race between reset page and reading page
The ring buffer is broken up into sub buffers (currently of page size).
Each sub buffer has a pointer to its "tail" (the last event written to the
sub buffer). When a new event is requested, the tail is locally
incremented to cover the size of the new event. This is done in a way that
there is no need for locking.

If the tail goes past the end of the sub buffer, the process of moving to
the next sub buffer takes place. After setting the current sub buffer to
the next one, the previous one that had the tail go passed the end of the
sub buffer needs to be reset back to the original tail location (before
the new event was requested) and the rest of the sub buffer needs to be
"padded".

The race happens when a reader takes control of the sub buffer. As readers
do a "swap" of sub buffers from the ring buffer to get exclusive access to
the sub buffer, it replaces the "head" sub buffer with an empty sub buffer
that goes back into the writable portion of the ring buffer. This swap can
happen as soon as the writer moves to the next sub buffer and before it
updates the last sub buffer with padding.

Because the sub buffer can be released to the reader while the writer is
still updating the padding, it is possible for the reader to see the event
that goes past the end of the sub buffer. This can cause obvious issues.

To fix this, add a few memory barriers so that the reader definitely sees
the updates to the sub buffer, and also waits until the writer has put
back the "tail" of the sub buffer back to the last event that was written
on it.

To be paranoid, it will only spin for 1 second, otherwise it will
warn and shutdown the ring buffer code. 1 second should be enough as
the writer does have preemption disabled. If the writer doesn't move
within 1 second (with preemption disabled) something is horribly
wrong. No interrupt should last 1 second!

Link: https://lore.kernel.org/all/20220830120854.7545-1-jiazi.li@transsion.com/
Link: https://bugzilla.kernel.org/show_bug.cgi?id=216369
Link: https://lkml.kernel.org/r/20220929104909.0650a36c@gandalf.local.home

Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: stable@vger.kernel.org
Fixes: c7b0930857 ("ring-buffer: prevent adding write in discarded area")
Reported-by: Jiazi.Li <jiazi.li@transsion.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-09-29 10:52:05 -04:00
Steven Rostedt (Google)
7e9fbbb1b7 ring-buffer: Add ring_buffer_wake_waiters()
On closing of a file that represents a ring buffer or flushing the file,
there may be waiters on the ring buffer that needs to be woken up and exit
the ring_buffer_wait() function.

Add ring_buffer_wake_waiters() to wake up the waiters on the ring buffer
and allow them to exit the wait loop.

Link: https://lkml.kernel.org/r/20220928133938.28dc2c27@gandalf.local.home

Cc: stable@vger.kernel.org
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Fixes: 15693458c4 ("tracing/ring-buffer: Move poll wake ups into ring buffer code")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-09-28 20:27:43 -04:00
Steven Rostedt (Google)
ec0bbc5ec5 ring-buffer: Check pending waiters when doing wake ups as well
The wake up waiters only checks the "wakeup_full" variable and not the
"full_waiters_pending". The full_waiters_pending is set when a waiter is
added to the wait queue. The wakeup_full is only set when an event is
triggered, and it clears the full_waiters_pending to avoid multiple calls
to irq_work_queue().

The irq_work callback really needs to check both wakeup_full as well as
full_waiters_pending such that this code can be used to wake up waiters
when a file is closed that represents the ring buffer and the waiters need
to be woken up.

Link: https://lkml.kernel.org/r/20220927231824.209460321@goodmis.org

Cc: stable@vger.kernel.org
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Fixes: 15693458c4 ("tracing/ring-buffer: Move poll wake ups into ring buffer code")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-09-27 19:23:04 -04:00
Steven Rostedt (Google)
3b19d614b6 ring-buffer: Have the shortest_full queue be the shortest not longest
The logic to know when the shortest waiters on the ring buffer should be
woken up or not has uses a less than instead of a greater than compare,
which causes the shortest_full to actually be the longest.

Link: https://lkml.kernel.org/r/20220927231823.718039222@goodmis.org

Cc: stable@vger.kernel.org
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Fixes: 2c2b0a78b3 ("ring-buffer: Add percentage of ring buffer full to wake up reader")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-09-27 19:23:03 -04:00
Steven Rostedt (Google)
fa8f4a8973 ring-buffer: Allow splice to read previous partially read pages
If a page is partially read, and then the splice system call is run
against the ring buffer, it will always fail to read, no matter how much
is in the ring buffer. That's because the code path for a partial read of
the page does will fail if the "full" flag is set.

The splice system call wants full pages, so if the read of the ring buffer
is not yet full, it should return zero, and the splice will block. But if
a previous read was done, where the beginning has been consumed, it should
still be given to the splice caller if the rest of the page has been
written to.

This caused the splice command to never consume data in this scenario, and
let the ring buffer just fill up and lose events.

Link: https://lkml.kernel.org/r/20220927144317.46be6b80@gandalf.local.home

Cc: stable@vger.kernel.org
Fixes: 8789a9e7df ("ring-buffer: read page interface")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-09-27 14:48:27 -04:00
Steven Rostedt (Google)
f03f2abce4 ring-buffer: Have 32 bit time stamps use all 64 bits
When the new logic was made to handle deltas of events from interrupts
that interrupted other events, it required 64 bit local atomics.
Unfortunately, 64 bit local atomics are expensive on 32 bit architectures.
Thus, commit 10464b4aa6 ("ring-buffer: Add rb_time_t 64 bit operations
for speeding up 32 bit") created a type of seq lock timer for 32 bits.
It used two 32 bit local atomics, but required 2 bits from them each for
synchronization, making it only 60 bits.

Add a new "msb" field to hold the extra 4 bits that are cut off.

Link: https://lore.kernel.org/all/20220426175338.3807ca4f@gandalf.local.home/
Link: https://lkml.kernel.org/r/20220427170812.53cc7139@gandalf.local.home

Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-04-27 17:19:30 -04:00
Steven Rostedt (Google)
6695da58f9 ring-buffer: Have absolute time stamps handle large numbers
There's an absolute timestamp event in the ring buffer, but this only
saves 59 bits of the timestamp, as the 5 MSB is used for meta data
(stating it is an absolute time stamp). This was never an issue as all the
clocks currently in use never used those 5 MSB. But now there's a new
clock (TAI) that does.

To handle this case, when reading an absolute timestamp, a previous full
timestamp is passed in, and the 5 MSB of that timestamp is OR'd to the
absolute timestamp (if any of the 5 MSB are set), and then to test for
overflow, if the new result is smaller than the passed in previous
timestamp, then 1 << 59 is added to it.

All the extra processing is done on the reader "slow" path, with the
exception of the "too big delta" check, and the reading of timestamps
for histograms.

Note, libtraceevent will need to be updated to handle this case as well.
But this is not a user space regression, as user space was never able to
handle any timestamps that used more than 59 bits.

Link: https://lore.kernel.org/all/20220426175338.3807ca4f@gandalf.local.home/
Link: https://lkml.kernel.org/r/20220427153339.16c33f75@gandalf.local.home

Cc: Tom Zanussi <zanussi@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Kurt Kanzenbach <kurt@linutronix.de>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-04-27 15:59:44 -04:00
Wan Jiabing
ed888241a0 ring-buffer: Simplify if-if to if-else
Use if and else instead of if(A) and if (!A).

Link: https://lkml.kernel.org/r/20220426070628.167565-1-wanjiabing@vivo.com

Signed-off-by: Wan Jiabing <wanjiabing@vivo.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-04-26 17:58:52 -04:00
Cai Huoqing
64ed3a049e ring-buffer: make use of the helper function kthread_run_on_cpu()
Replace kthread_create/kthread_bind/wake_up_process() with
kthread_run_on_cpu() to simplify the code.

Link: https://lkml.kernel.org/r/20211022025711.3673-4-caihuoqing@baidu.com
Signed-off-by: Cai Huoqing <caihuoqing@baidu.com>
Cc: Bernard Metzler <bmt@zurich.ibm.com>
Cc: Daniel Bristot de Oliveira <bristot@kernel.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Doug Ledford <dledford@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: "Paul E . McKenney" <paulmck@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-01-15 16:30:23 +02:00
Steven Rostedt (VMware)
51d1579466 ring-buffer: Protect ring_buffer_reset() from reentrancy
The resetting of the entire ring buffer use to simply go through and reset
each individual CPU buffer that had its own protection and synchronization.
But this was very slow, due to performing a synchronization for each CPU.
The code was reshuffled to do one disabling of all CPU buffers, followed
by a single RCU synchronization, and then the resetting of each of the CPU
buffers. But unfortunately, the mutex that prevented multiple occurrences
of resetting the buffer was not moved to the upper function, and there is
nothing to protect from it.

Take the ring buffer mutex around the global reset.

Cc: stable@vger.kernel.org
Fixes: b23d7a5f4a ("ring-buffer: speed up buffer resets by avoiding synchronize_rcu for each CPU")
Reported-by: "Tzvetomir Stoyanov (VMware)" <tz.stoyanov@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-11-10 11:56:29 -05:00
Steven Rostedt (VMware)
91ebe8bcbf tracing/perf: Add interrupt_context_level() helper
Now that there are three different instances of doing the addition trick
to the preempt_count() and NMI_MASK, HARDIRQ_MASK and SOFTIRQ_OFFSET
macros, it deserves a helper function defined in the preempt.h header.

Add the interrupt_context_level() helper and replace the three instances
that do that logic with it.

Link: https://lore.kernel.org/all/20211015142541.4badd8a9@gandalf.local.home/

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-10-19 20:33:20 -04:00
Steven Rostedt (VMware)
9b84fadc44 tracing: Reuse logic from perf's get_recursion_context()
Instead of having branches that adds noise to the branch prediction, use
the addition logic to set the bit for the level of interrupt context that
the state is currently in. This copies the logic from perf's
get_recursion_context() function.

Link: https://lore.kernel.org/all/20211015161702.GF174703@worktop.programming.kicks-ass.net/

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-10-19 20:33:20 -04:00
Sebastian Andrzej Siewior
99c37d1a63 tracing: Replace deprecated CPU-hotplug functions.
The functions get_online_cpus() and put_online_cpus() have been
deprecated during the CPU hotplug rework. They map directly to
cpus_read_lock() and cpus_read_unlock().

Replace deprecated CPU-hotplug functions with the official version.
The behavior remains unchanged.

Link: https://lkml.kernel.org/r/20210803141621.780504-37-bigeasy@linutronix.de

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Acked-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-08-17 15:47:14 -04:00
Haoran Luo
67f0d6d988 tracing: Fix bug in rb_per_cpu_empty() that might cause deadloop.
The "rb_per_cpu_empty()" misinterpret the condition (as not-empty) when
"head_page" and "commit_page" of "struct ring_buffer_per_cpu" points to
the same buffer page, whose "buffer_data_page" is empty and "read" field
is non-zero.

An error scenario could be constructed as followed (kernel perspective):

1. All pages in the buffer has been accessed by reader(s) so that all of
them will have non-zero "read" field.

2. Read and clear all buffer pages so that "rb_num_of_entries()" will
return 0 rendering there's no more data to read. It is also required
that the "read_page", "commit_page" and "tail_page" points to the same
page, while "head_page" is the next page of them.

3. Invoke "ring_buffer_lock_reserve()" with large enough "length"
so that it shot pass the end of current tail buffer page. Now the
"head_page", "commit_page" and "tail_page" points to the same page.

4. Discard current event with "ring_buffer_discard_commit()", so that
"head_page", "commit_page" and "tail_page" points to a page whose buffer
data page is now empty.

When the error scenario has been constructed, "tracing_read_pipe" will
be trapped inside a deadloop: "trace_empty()" returns 0 since
"rb_per_cpu_empty()" returns 0 when it hits the CPU containing such
constructed ring buffer. Then "trace_find_next_entry_inc()" always
return NULL since "rb_num_of_entries()" reports there's no more entry
to read. Finally "trace_seq_to_user()" returns "-EBUSY" spanking
"tracing_read_pipe" back to the start of the "waitagain" loop.

I've also written a proof-of-concept script to construct the scenario
and trigger the bug automatically, you can use it to trace and validate
my reasoning above:

  https://github.com/aegistudio/RingBufferDetonator.git

Tests has been carried out on linux kernel 5.14-rc2
(2734d6c1b1), my fixed version
of kernel (for testing whether my update fixes the bug) and
some older kernels (for range of affected kernels). Test result is
also attached to the proof-of-concept repository.

Link: https://lore.kernel.org/linux-trace-devel/YPaNxsIlb2yjSi5Y@aegistudio/
Link: https://lore.kernel.org/linux-trace-devel/YPgrN85WL9VyrZ55@aegistudio

Cc: stable@vger.kernel.org
Fixes: bf41a158ca ("ring-buffer: make reentrant")
Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org>
Signed-off-by: Haoran Luo <www@aegistudio.net>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-07-22 11:52:33 -04:00
Wei Ming Chen
957cdcd9bd ring-buffer: Use fallthrough pseudo-keyword
Replace /* fall through */ comment with pseudo-keyword macro fallthrough[1]

[1] https://www.kernel.org/doc/html/latest/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through

Link: https://lkml.kernel.org/r/20210511140246.18868-1-jj251510319013@gmail.com

Signed-off-by: Wei Ming Chen <jj251510319013@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-06-10 11:16:20 -04:00
Yordan Karadzhov (VMware)
f3ef7202ef tracing: Remove unused argument from "ring_buffer_time_stamp()
The "cpu" parameter is not being used by the function.

Link: https://lkml.kernel.org/r/20210329130331.199402-1-y.karadz@gmail.com

Signed-off-by: Yordan Karadzhov (VMware) <y.karadz@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-04-01 14:18:32 -04:00
Ingo Molnar
f2cc020d78 tracing: Fix various typos in comments
Fix ~59 single-word typos in the tracing code comments, and fix
the grammar in a handful of places.

Link: https://lore.kernel.org/r/20210322224546.GA1981273@gmail.com
Link: https://lkml.kernel.org/r/20210323174935.GA4176821@gmail.com

Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-03-23 14:08:18 -04:00
Steven Rostedt (VMware)
a948c69d6f ring-buffer: Add verifier for using ring_buffer_event_time_stamp()
The ring_buffer_event_time_stamp() must be only called by an event that has
not been committed yet, and is on the buffer that is passed in. This was
used to help debug converting the histogram logic over to using the new
time stamp code, and was proven to be very useful.

Add a verifier that can check that this is the case, and extra WARN_ONs to
catch unexpected use cases.

Link: https://lkml.kernel.org/r/20210316164113.987294354@goodmis.org

Reviewed-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-03-18 12:58:26 -04:00
Steven Rostedt (VMware)
efe6196a6b ring-buffer: Allow ring_buffer_event_time_stamp() to return time stamp of all events
Currently, ring_buffer_event_time_stamp() only returns an accurate time
stamp of the event if it has an absolute extended time stamp attached to
it. To make it more robust, use the event_stamp() in case the event does
not have an absolute value attached to it.

This will allow ring_buffer_event_time_stamp() to be used in more cases
than just histograms, and it will also allow histograms to not require
including absolute values all the time.

Link: https://lkml.kernel.org/r/20210316164113.704830885@goodmis.org

Reviewed-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-03-18 12:58:26 -04:00
Steven Rostedt (VMware)
8672e4948d ring-buffer: Add a event_stamp to cpu_buffer for each level of nesting
Add a place to save the current event time stamp for each level of nesting.
This will be used to retrieve the time stamp of the current event before it
is committed.

Link: https://lkml.kernel.org/r/20210316164113.399089673@goodmis.org

Reviewed-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-03-18 12:58:25 -04:00
Steven Rostedt (VMware)
e20044f7e9 ring-buffer: Separate out internal use of ring_buffer_event_time_stamp()
The exported use of ring_buffer_event_time_stamp() is going to become
different than how it is used internally. Move the internal logic out into a
static function called rb_event_time_stamp(), and have the internal callers
call that instead.

Link: https://lkml.kernel.org/r/20210316164113.257790481@goodmis.org

Reviewed-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-03-18 12:58:25 -04:00
Steven Rostedt (VMware)
6549de1fe3 ring-buffer: Add a little more information and a WARN when time stamp going backwards is detected
When the CONFIG_RING_BUFFER_VALIDATE_TIME_DELTAS is enabled, and the time
stamps are detected as not being valid, it reports information about the
write stamp, but does not show the before_stamp which is still useful
information. Also, it should give a warning once, such that tests detect
this happening.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-03-04 09:45:17 -05:00
Steven Rostedt (VMware)
6f6be606e7 ring-buffer: Force before_stamp and write_stamp to be different on discard
Part of the logic of the new time stamp code depends on the before_stamp and
the write_stamp to be different if the write_stamp does not match the last
event on the buffer, as it will be used to calculate the delta of the next
event written on the buffer.

The discard logic depends on this, as the next event to come in needs to
inject a full timestamp as it can not rely on the last event timestamp in
the buffer because it is unknown due to events after it being discarded. But
by changing the write_stamp back to the time before it, it forces the next
event to use a full time stamp, instead of relying on it.

The issue came when a full time stamp was used for the event, and
rb_time_delta() returns zero in that case. The update to the write_stamp
(which subtracts delta) made it not change. Then when the event is removed
from the buffer, because the before_stamp and write_stamp still match, the
next event written would calculate its delta from the write_stamp, but that
would be wrong as the write_stamp is of the time of the event that was
discarded.

In the case that the delta change being made to write_stamp is zero, set the
before_stamp to zero as well, and this will force the next event to inject a
full timestamp and not use the current write_stamp.

Cc: stable@vger.kernel.org
Fixes: a389d86f7f ("ring-buffer: Have nested events still record running time stamp")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-03-04 09:45:17 -05:00
Qiujun Huang
c6358bacdc ring-buffer: Drop unneeded check in ring_buffer_resize()
Remove the cpumask check, as we has done it at the beginning of
the function.
Also fix a typo. s/also the on the/also on the/

Link: https://lkml.kernel.org/r/20201224144634.3210-1-hqjagain@gmail.com

Signed-off-by: Qiujun Huang <hqjagain@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-02-02 17:02:06 -05:00
Qiujun Huang
6689bed36c ring-buffer: Remove cpu_buffer argument from the rb_inc_page()
The cpu_buffer argument is not used inside the rb_inc_page() after
commit 3adc54fa82 ("ring-buffer: make the buffer a true circular link
list").
And cpu_buffer argument is not used inside the two functions too,
rb_is_head_page/rb_set_list_to_head.

Link: https://lkml.kernel.org/r/20201225140356.23008-1-hqjagain@gmail.com

Signed-off-by: Qiujun Huang <hqjagain@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-02-02 17:02:06 -05:00
Linus Torvalds
09c0796adf Tracing updates for 5.11
The major update to this release is that there's a new arch config option called:
 CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS. Currently, only x86_64 enables it.
 All the ftrace callbacks now take a struct ftrace_regs instead of a struct
 pt_regs. If the architecture has HAVE_DYNAMIC_FTRACE_WITH_ARGS enabled, then
 the ftrace_regs will have enough information to read the arguments of the
 function being traced, as well as access to the stack pointer. This way, if
 a user (like live kernel patching) only cares about the arguments, then it
 can avoid using the heavier weight "regs" callback, that puts in enough
 information in the struct ftrace_regs to simulate a breakpoint exception
 (needed for kprobes).
 
 New config option that audits the timestamps of the ftrace ring buffer at
 most every event recorded.  The "check_buffer()" calls will conflict with
 mainline, because I purposely added the check without including the fix that
 it caught, which is in mainline. Running a kernel built from the commit of
 the added check will trigger it.
 
 Ftrace recursion protection has been cleaned up to move the protection to
 the callback itself (this saves on an extra function call for those
 callbacks).
 
 Perf now handles its own RCU protection and does not depend on ftrace to do
 it for it (saving on that extra function call).
 
 New debug option to add "recursed_functions" file to tracefs that lists all
 the places that triggered the recursion protection of the function tracer.
 This will show where things need to be fixed as recursion slows down the
 function tracer.
 
 The eval enum mapping updates done at boot up are now offloaded to a work
 queue, as it caused a noticeable pause on slow embedded boards.
 
 Various clean ups and last minute fixes.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCX9uq8xQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qtrwAQCHevqWMjKc1Q76bnCgwB0AbFKB6vqy
 5b6g/co5+ihv8wD/eJPWlZMAt97zTVW7bdp5qj/GTiCDbAsODMZ597LsxA0=
 =rZEz
 -----END PGP SIGNATURE-----

Merge tag 'trace-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing updates from Steven Rostedt:
 "The major update to this release is that there's a new arch config
  option called CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS.

  Currently, only x86_64 enables it. All the ftrace callbacks now take a
  struct ftrace_regs instead of a struct pt_regs. If the architecture
  has HAVE_DYNAMIC_FTRACE_WITH_ARGS enabled, then the ftrace_regs will
  have enough information to read the arguments of the function being
  traced, as well as access to the stack pointer.

  This way, if a user (like live kernel patching) only cares about the
  arguments, then it can avoid using the heavier weight "regs" callback,
  that puts in enough information in the struct ftrace_regs to simulate
  a breakpoint exception (needed for kprobes).

  A new config option that audits the timestamps of the ftrace ring
  buffer at most every event recorded.

  Ftrace recursion protection has been cleaned up to move the protection
  to the callback itself (this saves on an extra function call for those
  callbacks).

  Perf now handles its own RCU protection and does not depend on ftrace
  to do it for it (saving on that extra function call).

  New debug option to add "recursed_functions" file to tracefs that
  lists all the places that triggered the recursion protection of the
  function tracer. This will show where things need to be fixed as
  recursion slows down the function tracer.

  The eval enum mapping updates done at boot up are now offloaded to a
  work queue, as it caused a noticeable pause on slow embedded boards.

  Various clean ups and last minute fixes"

* tag 'trace-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (33 commits)
  tracing: Offload eval map updates to a work queue
  Revert: "ring-buffer: Remove HAVE_64BIT_ALIGNED_ACCESS"
  ring-buffer: Add rb_check_bpage in __rb_allocate_pages
  ring-buffer: Fix two typos in comments
  tracing: Drop unneeded assignment in ring_buffer_resize()
  tracing: Disable ftrace selftests when any tracer is running
  seq_buf: Avoid type mismatch for seq_buf_init
  ring-buffer: Fix a typo in function description
  ring-buffer: Remove obsolete rb_event_is_commit()
  ring-buffer: Add test to validate the time stamp deltas
  ftrace/documentation: Fix RST C code blocks
  tracing: Clean up after filter logic rewriting
  tracing: Remove the useless value assignment in test_create_synth_event()
  livepatch: Use the default ftrace_ops instead of REGS when ARGS is available
  ftrace/x86: Allow for arguments to be passed in to ftrace_regs by default
  ftrace: Have the callbacks receive a struct ftrace_regs instead of pt_regs
  MAINTAINERS: assign ./fs/tracefs to TRACING
  tracing: Fix some typos in comments
  ftrace: Remove unused varible 'ret'
  ring-buffer: Add recording of ring buffer recursion into recursed_functions
  ...
2020-12-17 13:22:17 -08:00
Steven Rostedt (VMware)
adab66b71a Revert: "ring-buffer: Remove HAVE_64BIT_ALIGNED_ACCESS"
It was believed that metag was the only architecture that required the ring
buffer to keep 8 byte words aligned on 8 byte architectures, and with its
removal, it was assumed that the ring buffer code did not need to handle
this case. It appears that sparc64 also requires this.

The following was reported on a sparc64 boot up:

   kernel: futex hash table entries: 65536 (order: 9, 4194304 bytes, linear)
   kernel: Running postponed tracer tests:
   kernel: Testing tracer function:
   kernel: Kernel unaligned access at TPC[552a20] trace_function+0x40/0x140
   kernel: Kernel unaligned access at TPC[552a24] trace_function+0x44/0x140
   kernel: Kernel unaligned access at TPC[552a20] trace_function+0x40/0x140
   kernel: Kernel unaligned access at TPC[552a24] trace_function+0x44/0x140
   kernel: Kernel unaligned access at TPC[552a20] trace_function+0x40/0x140
   kernel: PASSED

Need to put back the 64BIT aligned code for the ring buffer.

Link: https://lore.kernel.org/r/CADxRZqzXQRYgKc=y-KV=S_yHL+Y8Ay2mh5ezeZUnpRvg+syWKw@mail.gmail.com

Cc: stable@vger.kernel.org
Fixes: 86b3de60a0 ("ring-buffer: Remove HAVE_64BIT_ALIGNED_ACCESS")
Reported-by: Anatoly Pugachev <matorola@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-12-14 12:33:51 -05:00
Qiujun Huang
74e2afc6df ring-buffer: Add rb_check_bpage in __rb_allocate_pages
It may be better to check each page is aligned by 4 bytes. The 2
least significant bits of the address will be used as flags.

Link: https://lkml.kernel.org/r/20201015113842.2921-1-hqjagain@gmail.com

Signed-off-by: Qiujun Huang <hqjagain@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-12-14 12:26:32 -05:00
Qiujun Huang
82db909e6b ring-buffer: Fix two typos in comments
s/inerrupting/interrupting/
s/beween/between/

Link: https://lkml.kernel.org/r/20201014152749.29986-1-hqjagain@gmail.com

Signed-off-by: Qiujun Huang <hqjagain@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-12-14 12:21:27 -05:00
Lukas Bulwahn
3b3493531c tracing: Drop unneeded assignment in ring_buffer_resize()
Since commit 0a1754b2a9 ("ring-buffer: Return 0 on success from
ring_buffer_resize()"), computing the size is not needed anymore.

Drop unneeded assignment in ring_buffer_resize().

Link: https://lkml.kernel.org/r/20201214084503.3079-1-lukas.bulwahn@gmail.com

Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-12-14 12:09:54 -05:00
Qiujun Huang
888834903d ring-buffer: Fix a typo in function description
s/ring_buffer_commit_discard/ring_buffer_discard_commit/

Link: https://lkml.kernel.org/r/20201112151800.14382-1-hqjagain@gmail.com

Signed-off-by: Qiujun Huang <hqjagain@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-12-07 18:14:53 -05:00
Lukas Bulwahn
a32ded3389 ring-buffer: Remove obsolete rb_event_is_commit()
Commit a389d86f7f ("ring-buffer: Have nested events still record running
time stamp") removed the only uses of rb_event_is_commit() in
rb_update_event() and rb_update_write_stamp().

Hence, since then, make CC=clang W=1 warns:

  kernel/trace/ring_buffer.c:2763:1:
    warning: unused function 'rb_event_is_commit' [-Wunused-function]

Remove this obsolete function.

Link: https://lkml.kernel.org/r/20201117053703.11275-1-lukas.bulwahn@gmail.com

Reviewed-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-12-07 18:08:53 -05:00
Steven Rostedt (VMware)
5b7be9c709 ring-buffer: Add test to validate the time stamp deltas
While debugging a situation where a delta for an event was calucalted wrong,
I realize there was nothing making sure that the delta of events are
correct. If a single event has an incorrect delta, then all events after it
will also have one. If the discrepency gets large enough, it could cause
the time stamps to go backwards when crossing sub buffers, that record a
full 64 bit time stamp, and the new deltas are added to that.

Add a way to validate the events at most events and when crossing a buffer
page. This will help make sure that the deltas are always correct. This test
will detect if they are ever corrupted.

The test adds a high overhead to the ring buffer recording, as it does the
audit for almost every event, and should only be used for testing the ring
buffer.

This will catch the bug that is fixed by commit 55ea4cf403 ("ring-buffer:
Update write stamp with the correct ts"), which is not applied when this
commit is applied.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-12-01 22:00:46 -05:00
Steven Rostedt (VMware)
68e10d5ff5 ring-buffer: Always check to put back before stamp when crossing pages
The current ring buffer logic checks to see if the updating of the event
buffer was interrupted, and if it is, it will try to fix up the before stamp
with the write stamp to make them equal again. This logic is flawed, because
if it is not interrupted, the two are guaranteed to be different, as the
current event just updated the before stamp before allocation. This
guarantees that the next event (this one or another interrupting one) will
think it interrupted the time updates of a previous event and inject an
absolute time stamp to compensate.

The correct logic is to always update the timestamps when traversing to a
new sub buffer.

Cc: stable@vger.kernel.org
Fixes: a389d86f7f ("ring-buffer: Have nested events still record running time stamp")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-11-30 23:21:51 -05:00
Andrea Righi
8785f51a17 ring-buffer: Set the right timestamp in the slow path of __rb_reserve_next()
In the slow path of __rb_reserve_next() a nested event(s) can happen
between evaluating the timestamp delta of the current event and updating
write_stamp via local_cmpxchg(); in this case the delta is not valid
anymore and it should be set to 0 (same timestamp as the interrupting
event), since the event that we are currently processing is not the last
event in the buffer.

Link: https://lkml.kernel.org/r/X8IVJcp1gRE+FJCJ@xps-13-7390

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lwn.net/Articles/831207
Fixes: a389d86f7f ("ring-buffer: Have nested events still record running time stamp")
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-11-30 15:22:05 -05:00
Steven Rostedt (VMware)
55ea4cf403 ring-buffer: Update write stamp with the correct ts
The write stamp, used to calculate deltas between events, was updated with
the stale "ts" value in the "info" structure, and not with the updated "ts"
variable. This caused the deltas between events to be inaccurate, and when
crossing into a new sub buffer, had time go backwards.

Link: https://lkml.kernel.org/r/20201124223917.795844-1-elavila@google.com

Cc: stable@vger.kernel.org
Fixes: a389d86f7f ("ring-buffer: Have nested events still record running time stamp")
Reported-by: "J. Avila" <elavila@google.com>
Tested-by: Daniel Mentz <danielmentz@google.com>
Tested-by: Will McVicker <willmcvicker@google.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-11-30 15:21:31 -05:00
Steven Rostedt (VMware)
28575c61ea ring-buffer: Add recording of ring buffer recursion into recursed_functions
Add a new config RING_BUFFER_RECORD_RECURSION that will place functions that
recurse from the ring buffer into the ftrace recused_functions file.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-11-10 20:39:40 -05:00
Steven Rostedt (VMware)
b02414c8f0 ring-buffer: Fix recursion protection transitions between interrupt context
The recursion protection of the ring buffer depends on preempt_count() to be
correct. But it is possible that the ring buffer gets called after an
interrupt comes in but before it updates the preempt_count(). This will
trigger a false positive in the recursion code.

Use the same trick from the ftrace function callback recursion code which
uses a "transition" bit that gets set, to allow for a single recursion for
to handle transitions between contexts.

Cc: stable@vger.kernel.org
Fixes: 567cd4da54 ("ring-buffer: User context bit recursion checking")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-11-02 15:58:32 -05:00
Qiujun Huang
e1981f75d3 ring-buffer: Update the description for ring_buffer_wait
The function changed at some point, but the description was not
updated.

Link: https://lkml.kernel.org/r/20201017095246.5170-1-hqjagain@gmail.com

Signed-off-by: Qiujun Huang <hqjagain@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-10-22 11:26:26 -04:00
Qiujun Huang
0a1754b2a9 ring-buffer: Return 0 on success from ring_buffer_resize()
We don't need to check the new buffer size, and the return value
had confused resize_buffer_duplicate_size().
...
	ret = ring_buffer_resize(trace_buf->buffer,
		per_cpu_ptr(size_buf->data,cpu_id)->entries, cpu_id);
	if (ret == 0)
		per_cpu_ptr(trace_buf->data, cpu_id)->entries =
			per_cpu_ptr(size_buf->data, cpu_id)->entries;
...

Link: https://lkml.kernel.org/r/20201019142242.11560-1-hqjagain@gmail.com

Cc: stable@vger.kernel.org
Fixes: d60da506cb ("tracing: Add a resize function to make one buffer equivalent to another buffer")
Signed-off-by: Qiujun Huang <hqjagain@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-10-22 11:24:10 -04:00
Gaurav Kohli
bbeb97464e tracing: Fix race in trace_open and buffer resize call
Below race can come, if trace_open and resize of
cpu buffer is running parallely on different cpus
CPUX                                CPUY
				    ring_buffer_resize
				    atomic_read(&buffer->resize_disabled)
tracing_open
tracing_reset_online_cpus
ring_buffer_reset_cpu
rb_reset_cpu
				    rb_update_pages
				    remove/insert pages
resetting pointer

This race can cause data abort or some times infinte loop in
rb_remove_pages and rb_insert_pages while checking pages
for sanity.

Take buffer lock to fix this.

Link: https://lkml.kernel.org/r/1601976833-24377-1-git-send-email-gkohli@codeaurora.org

Cc: stable@vger.kernel.org
Fixes: b23d7a5f4a ("ring-buffer: speed up buffer resets by avoiding synchronize_rcu for each CPU")
Signed-off-by: Gaurav Kohli <gkohli@codeaurora.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-10-15 12:01:13 -04:00
Linus Torvalds
32663c78c1 Tracing updates for 5.9
- The biggest news in that the tracing ring buffer can now time events that
    interrupted other ring buffer events. Before this change, if an interrupt
    came in while recording another event, and that interrupt also had an
    event, those events would all have the same time stamp as the event it
    interrupted. Now, with the new design, those events will have a unique time
    stamp and rightfully display the time for those events that were recorded
    while interrupting another event.
 
  - Bootconfig how has an "override" operator that lets the users have a
    default config, but then add options to override the default.
 
  - A fix was made to properly filter function graph tracing to the ftrace
    PIDs. This came in at the end of the -rc cycle, and needs to be backported.
 
  - Several clean ups, performance updates, and minor fixes as well.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXy3GOBQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qphsAP9ci1jtrC2+cMBMCNKb/AFpA/nDaKsD
 hpsDzvD0YPOmCAEA9QbZset8wUNG49R4FexP7egQ8Ad2S6Oa5f60jWleDQY=
 =lH+q
 -----END PGP SIGNATURE-----

Merge tag 'trace-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing updates from Steven Rostedt:

 - The biggest news in that the tracing ring buffer can now time events
   that interrupted other ring buffer events.

   Before this change, if an interrupt came in while recording another
   event, and that interrupt also had an event, those events would all
   have the same time stamp as the event it interrupted.

   Now, with the new design, those events will have a unique time stamp
   and rightfully display the time for those events that were recorded
   while interrupting another event.

 - Bootconfig how has an "override" operator that lets the users have a
   default config, but then add options to override the default.

 - A fix was made to properly filter function graph tracing to the
   ftrace PIDs. This came in at the end of the -rc cycle, and needs to
   be backported.

 - Several clean ups, performance updates, and minor fixes as well.

* tag 'trace-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (39 commits)
  tracing: Add trace_array_init_printk() to initialize instance trace_printk() buffers
  kprobes: Fix compiler warning for !CONFIG_KPROBES_ON_FTRACE
  tracing: Use trace_sched_process_free() instead of exit() for pid tracing
  bootconfig: Fix to find the initargs correctly
  Documentation: bootconfig: Add bootconfig override operator
  tools/bootconfig: Add testcases for value override operator
  lib/bootconfig: Add override operator support
  kprobes: Remove show_registers() function prototype
  tracing/uprobe: Remove dead code in trace_uprobe_register()
  kprobes: Fix NULL pointer dereference at kprobe_ftrace_handler
  ftrace: Fix ftrace_trace_task return value
  tracepoint: Use __used attribute definitions from compiler_attributes.h
  tracepoint: Mark __tracepoint_string's __used
  trace : Have tracing buffer info use kvzalloc instead of kzalloc
  tracing: Remove outdated comment in stack handling
  ftrace: Do not let direct or IPMODIFY ftrace_ops be added to module and set trampolines
  ftrace: Setup correct FTRACE_FL_REGS flags for module
  tracing/hwlat: Honor the tracing_cpumask
  tracing/hwlat: Drop the duplicate assignment in start_kthread()
  tracing: Save one trace_event->type by using __TRACE_LAST_TYPE
  ...
2020-08-07 18:29:15 -07:00
Kees Cook
3f649ab728 treewide: Remove uninitialized_var() usage
Using uninitialized_var() is dangerous as it papers over real bugs[1]
(or can in the future), and suppresses unrelated compiler warnings
(e.g. "unused variable"). If the compiler thinks it is uninitialized,
either simply initialize the variable or make compiler changes.

In preparation for removing[2] the[3] macro[4], remove all remaining
needless uses with the following script:

git grep '\buninitialized_var\b' | cut -d: -f1 | sort -u | \
	xargs perl -pi -e \
		's/\buninitialized_var\(([^\)]+)\)/\1/g;
		 s:\s*/\* (GCC be quiet|to make compiler happy) \*/$::g;'

drivers/video/fbdev/riva/riva_hw.c was manually tweaked to avoid
pathological white-space.

No outstanding warnings were found building allmodconfig with GCC 9.3.0
for x86_64, i386, arm64, arm, powerpc, powerpc64le, s390x, mips, sparc64,
alpha, and m68k.

[1] https://lore.kernel.org/lkml/20200603174714.192027-1-glider@google.com/
[2] https://lore.kernel.org/lkml/CA+55aFw+Vbj0i=1TGqCR5vQkCzWJ0QxK6CernOU6eedsudAixw@mail.gmail.com/
[3] https://lore.kernel.org/lkml/CA+55aFwgbgqhbp1fkxvRKEpzyR5J8n1vKT1VZdz9knmPuXhOeg@mail.gmail.com/
[4] https://lore.kernel.org/lkml/CA+55aFz2500WfbKXAx8s67wrm9=yVJu65TpLgN_ybYNv0VEOKA@mail.gmail.com/

Reviewed-by: Leon Romanovsky <leonro@mellanox.com> # drivers/infiniband and mlx4/mlx5
Acked-by: Jason Gunthorpe <jgg@mellanox.com> # IB
Acked-by: Kalle Valo <kvalo@codeaurora.org> # wireless drivers
Reviewed-by: Chao Yu <yuchao0@huawei.com> # erofs
Signed-off-by: Kees Cook <keescook@chromium.org>
2020-07-16 12:35:15 -07:00
Steven Rostedt (VMware)
29ce24519c ring-buffer: Do not trigger a WARN if clock going backwards is detected
After tweaking the ring buffer to be a bit faster, a warning is triggering
on one of my machines, and causing my tests to fail. This warning is caused
when the delta (current time stamp minus previous time stamp), is larger
than the max time held by the ring buffer (59 bits).

If the clock were to go backwards slightly, this would then easily trigger
this warning. The machine that it triggered on, the clock did go backwards
by around 450 nanoseconds, and this happened after a recalibration of the
TSC clock. Now that the ring buffer is faster, it detects this, and the
delta that is used larger than the max, the warning is triggered and my test
fails.

To handle the clock going backwards, look at the saved before and after time
stamps. If they are the same, it means that the current event did not
interrupt another event, and that those timestamp are of a previous event
that was recorded. If the max delta is triggered, look at those time stamps,
make sure they are the same, then use them to compare with the current
timestamp. If the current timestamp is less than the before/after time
stamps, then that means the clock being used went backward.

Print out a message that this has happened, but do not warn about it (and
only print the message once).

Still do the warning if the delta is indeed larger than what can be used.

Also remove the unneeded KERN_WARNING from the WARN_ONCE() print.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-07-01 22:12:07 -04:00
Steven Rostedt (VMware)
bbeba3e58f ring-buffer: Call trace_clock_local() directly for RETPOLINE kernels
After doing some benchmarks and examining the code, I found that the ring
buffer clock calls were quite expensive, and noticed that it uses
retpolines. This is because the ring buffer clock is programmable, and can
be set. But in most cases it simply uses the fastest ns unit clock which is
the trace_clock_local(). For RETPOLINE builds, checking if the ring buffer
clock is set to trace_clock_local() and then calling it directly has brought
the time of an event on my i7 box from an average of 93 nanoseconds an event
down to 83 nanoseconds an event, and the minimum time from 81 nanoseconds to
68 nanoseconds!

Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-07-01 22:12:07 -04:00
Steven Rostedt (VMware)
74e879373b ring-buffer: Move the add_timestamp into its own function
Make a helper function rb_add_timestamp() that moves the adding of the
extended time stamps into its own function. Also, remove the noinline and
inline for the functions it calls, as recent benchmarks appear they do not
make a difference (just let gcc decide).

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-07-01 22:12:06 -04:00
Steven Rostedt (VMware)
58fbc3c632 ring-buffer: Consolidate add_timestamp to remove some branches
Reorganize a little the logic to handle adding the absolute time stamp,
extended and forced time stamps, in such a way to remove a branch or two.
This is just a micro optimization.

Also add before and after time stamps to the rb_event_info structure to
display those values in the rb_check_timestamps() code, if something were to
go wrong.

Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-07-01 22:11:22 -04:00
Steven Rostedt (VMware)
75b21c6dfa ring-buffer: Mark the !tail (crossing a page) as unlikely
It is the uncommon case where an event crosses a sub buffer boundary (page)
mark that check at the end of reserving an event as unlikely.

Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-06-30 17:18:56 -04:00
Nicholas Piggin
b23d7a5f4a ring-buffer: speed up buffer resets by avoiding synchronize_rcu for each CPU
On a 144 thread system, `perf ftrace` takes about 20 seconds to start
up, due to calling synchronize_rcu() for each CPU.

  cat /proc/108560/stack
    0xc0003e7eb336f470
    __switch_to+0x2e0/0x480
    __wait_rcu_gp+0x20c/0x220
    synchronize_rcu+0x9c/0xc0
    ring_buffer_reset_cpu+0x88/0x2e0
    tracing_reset_online_cpus+0x84/0xe0
    tracing_open+0x1d4/0x1f0

On a system with 10x more threads, it starts to become an annoyance.

Batch these up so we disable all the per-cpu buffers first, then
synchronize_rcu() once, then reset each of the buffers. This brings
the time down to about 0.5s.

Link: https://lkml.kernel.org/r/20200625053403.2386972-1-npiggin@gmail.com

Tested-by: Anton Blanchard <anton@ozlabs.org>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-06-30 17:18:56 -04:00
Steven Rostedt (VMware)
10464b4aa6 ring-buffer: Add rb_time_t 64 bit operations for speeding up 32 bit
After a discussion with the new time algorithm to have nested events still
have proper time keeping but required using local64_t atomic operations.
Mathieu was concerned about the performance this would have on 32 bit
machines, as in most cases, atomic 64 bit operations on them can be
expensive.

As the ring buffer's timing needs do not require full features of local64_t,
a wrapper is made to implement a new rb_time_t operation that uses two longs
on 32 bit machines but still uses the local64_t operations on 64 bit
machines. There's a switch that can be made in the file to force 64 bit to
use the 32 bit version just for testing purposes.

All reads do not need to succeed if a read happened while the stamp being
read is in the process of being updated. The requirement is that all reads
must succed that were done by an interrupting event (where this event was
interrupted by another event that did the write). Or if the event itself did
the write first. That is: rb_time_set(t, x) followed by rb_time_read(t) will
always succeed (even if it gets interrupted by another event that writes to
t. The result of the read will be either the previous set, or a set
performed by an interrupting event.

If the read is done by an event that interrupted another event that was in
the process of setting the time stamp, and no other event came along to
write to that time stamp, it will fail and the rb_time_read() will return
that it failed (the value to read will be undefined).

A set will always write to the time stamp and return with a valid time
stamp, such that any read after it will be valid.

A cmpxchg may fail if it interrupted an event that was in the process of
updating the time stamp just like the reads do. Other than that, it will act
like a normal cmpxchg.

The way this works is that the rb_time_t is made of of three fields. A cnt,
that gets updated atomically everyting a modification is made. A top that
represents the most significant 30 bits of the time, and a bottom to
represent the least significant 30 bits of the time. Notice, that the time
values is only 60 bits long (where the ring buffer only uses 59 bits, which
gives us 18 years of nanoseconds!).

The top two bits of both the top and bottom is a 2 bit counter that gets set
by the value of the least two significant bits of the cnt. A read of the top
and the bottom where both the top and bottom have the same most significant
top 2 bits, are considered a match and a valid 60 bit number can be created
from it. If they do not match, then the number is considered invalid, and
this must only happen if an event interrupted another event in the midst of
updating the time stamp.

This is only used for 32 bits machines as 64 bit machines can get better
performance out of the local64_t. This has been tested heavily by forcing 64
bit to use this logic.

Link: https://lore.kernel.org/r/20200625225345.18cf5881@oasis.local.home
Link: http://lkml.kernel.org/r/20200629025259.309232719@goodmis.org

Inspired-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-06-30 17:18:51 -04:00
Steven Rostedt (VMware)
7c4b4a5164 ring-buffer: Incorporate absolute timestamp into add_timestamp logic
Instead of calling out the absolute test for each time to check if the
ring buffer wants absolute time stamps for all its recording, incorporate it
with the add_timestamp field and turn it into flags for faster processing
between wanting a absolute tag and needing to force one.

Link: http://lkml.kernel.org/r/20200629025259.154892368@goodmis.org

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-06-30 16:16:14 -04:00
Steven Rostedt (VMware)
a389d86f7f ring-buffer: Have nested events still record running time stamp
Up until now, if an event is interrupted while it is recorded by an
interrupt, and that interrupt records events, the time of those events will
all be the same. This is because events only record the delta of the time
since the previous event (or beginning of a page), and to handle updating
the time keeping for that of nested events is extremely racy. After years of
thinking about this and several failed attempts, I finally have a solution
to solve this puzzle.

The problem is that you need to atomically calculate the delta and then
update the time stamp you made the delta from, as well as then record it
into the buffer, all this while at any time an interrupt can come in and
do the same thing. This is easy to solve with heavy weight atomics, but that
would be detrimental to the performance of the ring buffer. The current
state of affairs sacrificed the time deltas for nested events for
performance.

The reason for previous failed attempts at solving this puzzle was because I
was trying to completely avoid slow atomic operations like cmpxchg. I final
came to the conclusion to always avoid cmpxchg is not possible, which is why
those previous attempts always failed. But it is possible to pick one path
(the most common case) and avoid cmpxchg in that path, which is the "fast
path". The most common case is that an event will not be interrupted and
have other events added into it. An event can detect if it has
interrupted another event, and for these cases we can make it the slow
path and use the heavy operations like cmpxchg.

One more player was added to the game that made this possible, and that is
the "absolute timestamp" (by Tom Zanussi) that allows us to inject a full 59
bit time stamp. (Of course this breaks if a machine is running for more than
18 years without a reboot!).

There's barrier() placements around for being paranoid, even when they
are not needed because of other atomic functions near by. But those
should not hurt, as if they are not needed, they basically become a nop.

Note, this also makes the race window much smaller, which means there
are less slow paths to slow down the performance.

The basic idea is that there's two main paths taken.

 1) Not being interrupted between time stamps and reserving buffer space.
    In this case, the time stamps taken are true to the location in the
    buffer.

 2) Was interrupted by another path between taking time stamps and reserving
    buffer space.

The objective is to know what the delta is from the last reserved location
in the buffer.

As it is possible to detect if an event is interrupting another event before
reserving data, space is added to the length to be reserved to inject a full
time stamp along with the event being reserved.

When an event is not interrupted, the write stamp is always the time of the
last event written to the buffer.

In path 1, there's two sub paths we care about:

 a) The event did not interrupt another event.
 b) The event interrupted another event.

In case a, as the write stamp was read and known to be correct, the delta
between the current time stamp and the write stamp is the delta between the
current event and the previously recorded event.

In case b, extra space was reserved to just put the full time stamp into the
buffer. Which is done, as stated, in this path the time stamp taken is known
to match the location in the buffer.

In path 2, there's also two sub paths we care about:

 a) The event was not interrupted by another event since it reserved space
    on the buffer and re-reading the write stamp.
 b) The event was interrupted by another event.

In case a, the write stamp is that of the last event that interrupted this
event between taking the time stamps and reserving. As no event came in
after re-reading the write stamp, that event is known to be the time of the
event directly before this event and the delta can be the new time stamp and
the write stamp.

In case b, one or more events came in between reserving the event and
re-reading he write stamp. Since this event's buffer reservation is between
other events at this path, there's no way to know what the delta is. But
because an event interrupted this event after it started, its fine to just
give a zero delta, and take the same time stamp as the events that happened
within the event being recorded.

Here's the implementation of the design of this solution:

 All this is per cpu, and only needs to worry about nested events (not
 parallel events).

The players:

 write_tail: The index in the buffer where new events can be written to.
     It is incremented via local_add() to reserve space for a new event.

 before_stamp: A time stamp set by all events before reserving space.

 write_stamp: A time stamp updated by events after it has successfully
     reserved space.

	/* Save the current position of write */
 [A]	w = local_read(write_tail);
	barrier();
	/* Read both before and write stamps before touching anything */
	before = local_read(before_stamp);
	after = local_read(write_stamp);
	barrier();

	/*
	 * If before and after are the same, then this event is not
	 * interrupting a time update. If it is, then reserve space for adding
	 * a full time stamp (this can turn into a time extend which is
	 * just an extended time delta but fill up the extra space).
	 */
	if (after != before)
		abs = true;

	ts = clock();

	/* Now update the before_stamp (everyone does this!) */
 [B]	local_set(before_stamp, ts);

	/* Now reserve space on the buffer */
 [C]	write = local_add_return(len, write_tail);

	/* Set tail to be were this event's data is */
	tail = write - len;

 	if (w == tail) {

		/* Nothing interrupted this between A and C */
 [D]		local_set(write_stamp, ts);
		barrier();
 [E]		save_before = local_read(before_stamp);

 		if (!abs) {
			/* This did not interrupt a time update */
			delta = ts - after;
		} else {
			delta = ts; /* The full time stamp will be in use */
		}
		if (ts != save_before) {
			/* slow path - Was interrupted between C and E */
			/* The update to write_stamp could have overwritten the update to
			 * it by the interrupting event, but before and after should be
			 * the same for all completed top events */
			after = local_read(write_stamp);
			if (save_before > after)
				local_cmpxchg(write_stamp, after, save_before);
		}
	} else {
		/* slow path - Interrupted between A and C */

		after = local_read(write_stamp);
		temp_ts = clock();
		barrier();
 [F]		if (write == local_read(write_tail) && after < temp_ts) {
			/* This was not interrupted since C and F
			 * The last write_stamp is still valid for the previous event
			 * in the buffer. */
			delta = temp_ts - after;
			/* OK to keep this new time stamp */
			ts = temp_ts;
		} else {
			/* Interrupted between C and F
			 * Well, there's no use to try to know what the time stamp
			 * is for the previous event. Just set delta to zero and
			 * be the same time as that event that interrupted us before
			 * the reservation of the buffer. */

			delta = 0;
		}
		/* No need to use full timestamps here */
		abs = 0;
	}

Link: https://lkml.kernel.org/r/20200625094454.732790f7@oasis.local.home
Link: https://lore.kernel.org/r/20200627010041.517736087@goodmis.org
Link: http://lkml.kernel.org/r/20200629025258.957440797@goodmis.org

Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-06-30 14:29:33 -04:00
Steven Rostedt (VMware)
097350d1c6 ring-buffer: Zero out time extend if it is nested and not absolute
Currently the ring buffer makes events that happen in interrupts that preempt
another event have a delta of zero. (Hopefully we can change this soon). But
this is to deal with the races of updating a global counter with lockless
and nesting functions updating deltas.

With the addition of absolute time stamps, the time extend didn't follow
this rule. A time extend can happen if two events happen longer than 2^27
nanoseconds appart, as the delta time field in each event is only 27 bits.
If that happens, then a time extend is injected with 2^59 bits of
nanoseconds to use (18 years). But if the 2^27 nanoseconds happen between
two events, and as it is writing the event, an interrupt triggers, it will
see the 2^27 difference as well and inject a time extend of its own. But a
recent change made the time extend logic not take into account the nesting,
and this can cause two time extend deltas to happen moving the time stamp
much further ahead than the current time. This gets all reset when the ring
buffer moves to the next page, but that can cause time to appear to go
backwards.

This was observed in a trace-cmd recording, and since the data is saved in a
file, with trace-cmd report --debug, it was possible to see that this indeed
did happen!

  bash-52501   110d... 81778.908247: sched_switch:         bash:52501 [120] S ==> swapper/110:0 [120] [12770284:0x2e8:64]
  <idle>-0     110d... 81778.908757: sched_switch:         swapper/110:0 [120] R ==> bash:52501 [120] [509947:0x32c:64]
 TIME EXTEND: delta:306454770 length:0
  bash-52501   110.... 81779.215212: sched_swap_numa:      src_pid=52501 src_tgid=52388 src_ngid=52501 src_cpu=110 src_nid=2 dst_pid=52509 dst_tgid=52388 dst_ngid=52501 dst_cpu=49 dst_nid=1 [0:0x378:48]
 TIME EXTEND: delta:306458165 length:0
  bash-52501   110dNh. 81779.521670: sched_wakeup:         migration/110:565 [0] success=1 CPU:110 [0:0x3b4:40]

and at the next page, caused the time to go backwards:

  bash-52504   110d... 81779.685411: sched_switch:         bash:52504 [120] S ==> swapper/110:0 [120] [8347057:0xfb4:64]
CPU:110 [SUBBUFFER START] [81779379165886:0x1320000]
  <idle>-0     110dN.. 81779.379166: sched_wakeup:         bash:52504 [120] success=1 CPU:110 [0:0x10:40]
  <idle>-0     110d... 81779.379167: sched_switch:         swapper/110:0 [120] R ==> bash:52504 [120] [1168:0x3c:64]

Link: https://lkml.kernel.org/r/20200622151815.345d1bf5@oasis.local.home

Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tom Zanussi <zanussi@kernel.org>
Cc: stable@vger.kernel.org
Fixes: dc4e2801d4 ("ring-buffer: Redefine the unimplemented RINGBUF_TYPE_TIME_STAMP")
Reported-by: Julia Lawall <julia.lawall@inria.fr>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-06-23 11:18:42 -04:00
Steven Rostedt (VMware)
da4d401a6b ring-buffer: Remove all BUG() calls
There's a lot of checks to make sure the ring buffer is working, and if an
anomaly is detected, it safely shuts itself down. But there's a few cases
that it will call BUG(), which defeats the point of being safe (it crashes
the kernel when an anomaly is found!). There's no reason for them. Switch
them all to either WARN_ON_ONCE() (when no ring buffer descriptor is present),
or to RB_WARN_ON() (when a ring buffer descriptor is present).

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-05-14 08:51:02 -04:00
Steven Rostedt (VMware)
3d2353de81 ring-buffer: Don't deactivate the ring buffer on failed iterator reads
If the function tracer is running and the trace file is read (which uses the
ring buffer iterator), the iterator can get in sync with the writes, and
caues it to fail to find a page with content it can read three times. This
causes a warning and deactivation of the ring buffer code.

Looking at the other cases of failure to get an event, it appears that
there's a chance that the writer could cause them too. Since the iterator is
a "best effort" to read the ring buffer if there's an active writer (the
consumer reader is made for this case "see trace_pipe"), if it fails to get
an event after three tries, simply give up and return NULL. Don't warn, nor
disable the ring buffer on this failure.

Link: https://lore.kernel.org/r/20200429090508.GG5770@shao2-debian

Reported-by: kernel test robot <lkp@intel.com>
Fixes: ff84c50cfb ("ring-buffer: Do not die if rb_iter_peek() fails more than thrice")
Tested-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-05-14 08:50:51 -04:00
Steven Rostedt (VMware)
c9b7a4a72f ring-buffer/tracing: Have iterator acknowledge dropped events
Have the ring_buffer_iterator set a flag if events were dropped as it were
to go and peek at the next event. Have the trace file display this fact if
it happened with a "LOST EVENTS" message.

Link: http://lkml.kernel.org/r/20200317213417.045858900@goodmis.org

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-03-27 16:39:01 -04:00
Steven Rostedt (VMware)
1039221cc2 ring-buffer: Do not disable recording when there is an iterator
Now that the iterator can handle a concurrent writer, do not disable writing
to the ring buffer when there is an iterator present.

Link: http://lkml.kernel.org/r/20200317213416.759770696@goodmis.org

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-03-27 16:38:58 -04:00
Steven Rostedt (VMware)
07b8b10ec9 ring-buffer: Make resize disable per cpu buffer instead of total buffer
When the ring buffer becomes writable for even when the trace file is read,
it must still not be resized. But since tracers can be activated while the
trace file is being read, the irqsoff tracer can modify the per CPU buffers,
and this can cause the reader of the trace file to update the wrong buffer's
resize disable bit, as the irqsoff tracer swaps out cpu buffers.

By making the resize disable per cpu_buffer, it makes the update follow the
per cpu_buffer even if it's swapped out with the snapshot buffer and keeps
the release of the trace file modifying the same data as the open did.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-03-27 16:21:22 -04:00
Steven Rostedt (VMware)
153368ce1b ring-buffer: Optimize rb_iter_head_event()
As it is fine to perform several "peeks" of event data in the ring buffer
via the iterator before moving it forward, do not re-read the event, just
return what was read before. Otherwise, it can cause inconsistent results,
especially when testing multiple CPU buffers to interleave them.

Link: http://lkml.kernel.org/r/20200317213416.592032170@goodmis.org

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-03-19 19:11:20 -04:00
Steven Rostedt (VMware)
ff84c50cfb ring-buffer: Do not die if rb_iter_peek() fails more than thrice
As the iterator will be reading a live buffer, and if the event being read
is on a page that a writer crosses, it will fail and try again, the
condition in rb_iter_peek() that only allows a retry to happen three times
is no longer valid. Allow rb_iter_peek() to retry more than three times
without killing the ring buffer, but only if rb_iter_head_event() had failed
at least once.

Link: http://lkml.kernel.org/r/20200317213416.452888193@goodmis.org

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-03-19 19:11:19 -04:00
Steven Rostedt (VMware)
785888c544 ring-buffer: Have rb_iter_head_event() handle concurrent writer
Have the ring_buffer_iter structure have a place to store an event, such
that it can not be overwritten by a writer, and load it in such a way via
rb_iter_head_event() that it will return NULL and reset the iter to the
start of the current page if a writer updated the page.

Link: http://lkml.kernel.org/r/20200317213416.306959216@goodmis.org

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-03-19 19:11:19 -04:00
Steven Rostedt (VMware)
28e3fc56a4 ring-buffer: Add page_stamp to iterator for synchronization
Have the ring_buffer_iter structure contain a page_stamp, such that it can
be used to see if the writer entered the page the iterator is on. When going
to a new page, the iterator will record the time stamp of that page. When
reading events, it can copy the event to an internal buffer on the iterator
(to be implemented later), then check the page's time stamp with its own to
see if the writer entered the page. If so, it will need to try to read the
event again.

Link: http://lkml.kernel.org/r/20200317213416.163549674@goodmis.org

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-03-19 19:11:19 -04:00
Steven Rostedt (VMware)
bc1a72afdc ring-buffer: Rename ring_buffer_read() to read_buffer_iter_advance()
When the ring buffer was first created, the iterator followed the normal
producer/consumer operations where it had both a peek() operation, that just
returned the event at the current location, and a read(), that would return
the event at the current location and also increment the iterator such that
the next peek() or read() will return the next event.

The only use of the ring_buffer_read() is currently to move the iterator to
the next location and nothing now actually reads the event it returns.
Rename this function to its actual use case to ring_buffer_iter_advance(),
which also adds the "iter" part to the name, which is more meaningful. As
the timestamp returned by ring_buffer_read() was never used, there's no
reason that this new version should bother having returning it. It will also
become a void function.

Link: http://lkml.kernel.org/r/20200317213416.018928618@goodmis.org

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-03-19 19:11:19 -04:00
Steven Rostedt (VMware)
ead6ecfdde ring-buffer: Have ring_buffer_empty() not depend on tracing stopped
It was complained about that when the trace file is read, that the tracing
is disabled, as the iterator expects writing to the buffer it reads is not
updated. Several steps are needed to make the iterator handle a writer,
by testing if things have changed as it reads.

This step is to make ring_buffer_empty() expect the buffer to be changing.
Note if the current location of the iterator is overwritten, then it will
return false as new data is being added. Note, that this means that data
will be skipped.

Link: http://lkml.kernel.org/r/20200317213415.870741809@goodmis.org

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-03-19 19:11:19 -04:00
Alex Shi
b83479482f ring-buffer: Remove abandoned macro RB_MISSED_FLAGS
This macro isn't used since commit d325c40296 ("ring-buffer: Remove
unused function ring_buffer_page_len()"), so better to remove it.

Link: http://lkml.kernel.org/r/1579586080-45300-1-git-send-email-alex.shi@linux.alibaba.com

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-01-21 18:38:02 -05:00
Steven Rostedt (VMware)
cfc585a401 ring-buffer: Fix kernel doc for rb_update_event()
rb_update_event has changed without the kernel-doc update.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-01-14 16:27:51 -05:00
Fabian Frederick
59e7cffe5c ring-bufer: kernel-doc warning fixes
Also fixes a couple of typos

Link: http://lkml.kernel.org/r/1401992525-10417-1-git-send-email-fabf@skynet.be

Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Fabian Frederick <fabf@skynet.be>
[ Found this deep in the abyss of my INBOX ]
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-01-14 16:23:34 -05:00
Steven Rostedt (VMware)
1329249437 tracing: Make struct ring_buffer less ambiguous
As there's two struct ring_buffers in the kernel, it causes some confusion.
The other one being the perf ring buffer. It was agreed upon that as neither
of the ring buffers are generic enough to be used globally, they should be
renamed as:

   perf's ring_buffer -> perf_buffer
   ftrace's ring_buffer -> trace_buffer

This implements the changes to the ring buffer that ftrace uses.

Link: https://lore.kernel.org/r/20191213140531.116b3200@gandalf.local.home

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-01-13 13:19:38 -05:00
Stephen Rothwell
ee19545220 Fix up for "printk: Drop pr_warning definition"
Link: http://lkml.kernel.org/r/20191206092503.303d6a57@canb.auug.org.au
Cc: Linux Next Mailing List <linux-next@vger.kernel.org>
Cc: Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
Cc: "Steven Rostedt (VMware)" <rostedt@goodmis.org>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Petr Mladek <pmladek@suse.com>
2019-12-06 10:02:35 +01:00
Steven Rostedt (VMware)
a356646a56 tracing: Do not create directories if lockdown is in affect
If lockdown is disabling tracing on boot up, it prevents the tracing files
from even bering created. But when that happens, there's several places that
will give a warning that the files were not created as that is usually a
sign of a bug.

Add in strategic locations where a check is made to see if tracing is
disabled by lockdown, and if it is, do not go further, and fail silently
(but print that tracing is disabled by lockdown, without doing a WARN_ON()).

Cc: Matthew Garrett <mjg59@google.com>
Fixes: 17911ff38a ("tracing: Add locked_down checks to the open calls of files created for tracefs")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-12-04 08:27:15 -05:00
Steven Rostedt (VMware)
86b3de60a0 ring-buffer: Remove HAVE_64BIT_ALIGNED_ACCESS
Commit c19fa94a8f ("Add HAVE_64BIT_ALIGNED_ACCESS") added the config for
architectures that required 64bit aligned access for all 64bit words. As
the ftrace ring buffer stores data on 4 byte alignment, this config option
was used to force it to store data on 8 byte alignment to make sure the data
being stored and written directly into the ring buffer was 8 byte aligned as
it would cause issues trying to write an 8 byte word on a 4 not 8 byte
aligned memory location.

But with the removal of the metag architecture, which was the only
architecture to use this, there is no architecture supported by Linux that
requires 8 byte aligne access for all 8 byte words (4 byte alignment is good
enough). Removing this config can simplify the code a bit.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-05-28 09:36:19 -04:00