trace: Add osnoise tracer

In the context of high-performance computing (HPC), the Operating System
Noise (*osnoise*) refers to the interference experienced by an application
due to activities inside the operating system. In the context of Linux,
NMIs, IRQs, SoftIRQs, and any other system thread can cause noise to the
system. Moreover, hardware-related jobs can also cause noise, for example,
via SMIs.

The osnoise tracer leverages the hwlat_detector by running a similar
loop with preemption, SoftIRQs and IRQs enabled, thus allowing all
the sources of *osnoise* during its execution. Using the same approach
of hwlat, osnoise takes note of the entry and exit point of any
source of interferences, increasing a per-cpu interference counter. The
osnoise tracer also saves an interference counter for each source of
interference. The interference counter for NMI, IRQs, SoftIRQs, and
threads is increased anytime the tool observes these interferences' entry
events. When a noise happens without any interference from the operating
system level, the hardware noise counter increases, pointing to a
hardware-related noise. In this way, osnoise can account for any
source of interference. At the end of the period, the osnoise tracer
prints the sum of all noise, the max single noise, the percentage of CPU
available for the thread, and the counters for the noise sources.

Usage

Write the ASCII text "osnoise" into the current_tracer file of the
tracing system (generally mounted at /sys/kernel/tracing).

For example::

        [root@f32 ~]# cd /sys/kernel/tracing/
        [root@f32 tracing]# echo osnoise > current_tracer

It is possible to follow the trace by reading the trace trace file::

        [root@f32 tracing]# cat trace
        # tracer: osnoise
        #
        #                                _-----=> irqs-off
        #                               / _----=> need-resched
        #                              | / _---=> hardirq/softirq
        #                              || / _--=> preempt-depth                            MAX
        #                              || /                                             SINGLE     Interference counters:
        #                              ||||               RUNTIME      NOISE   % OF CPU  NOISE    +-----------------------------+
        #           TASK-PID      CPU# ||||   TIMESTAMP    IN US       IN US  AVAILABLE  IN US     HW    NMI    IRQ   SIRQ THREAD
        #              | |         |   ||||      |           |             |    |            |      |      |      |      |      |
                   <...>-859     [000] ....    81.637220: 1000000        190  99.98100       9     18      0   1007     18      1
                   <...>-860     [001] ....    81.638154: 1000000        656  99.93440      74     23      0   1006     16      3
                   <...>-861     [002] ....    81.638193: 1000000       5675  99.43250     202      6      0   1013     25     21
                   <...>-862     [003] ....    81.638242: 1000000        125  99.98750      45      1      0   1011     23      0
                   <...>-863     [004] ....    81.638260: 1000000       1721  99.82790     168      7      0   1002     49     41
                   <...>-864     [005] ....    81.638286: 1000000        263  99.97370      57      6      0   1006     26      2
                   <...>-865     [006] ....    81.638302: 1000000        109  99.98910      21      3      0   1006     18      1
                   <...>-866     [007] ....    81.638326: 1000000       7816  99.21840     107      8      0   1016     39     19

In addition to the regular trace fields (from TASK-PID to TIMESTAMP), the
tracer prints a message at the end of each period for each CPU that is
running an osnoise/CPU thread. The osnoise specific fields report:

 - The RUNTIME IN USE reports the amount of time in microseconds that
   the osnoise thread kept looping reading the time.
 - The NOISE IN US reports the sum of noise in microseconds observed
   by the osnoise tracer during the associated runtime.
 - The % OF CPU AVAILABLE reports the percentage of CPU available for
   the osnoise thread during the runtime window.
 - The MAX SINGLE NOISE IN US reports the maximum single noise observed
   during the runtime window.
 - The Interference counters display how many each of the respective
   interference happened during the runtime window.

Note that the example above shows a high number of HW noise samples.
The reason being is that this sample was taken on a virtual machine,
and the host interference is detected as a hardware interference.

Tracer options

The tracer has a set of options inside the osnoise directory, they are:

 - osnoise/cpus: CPUs at which a osnoise thread will execute.
 - osnoise/period_us: the period of the osnoise thread.
 - osnoise/runtime_us: how long an osnoise thread will look for noise.
 - osnoise/stop_tracing_us: stop the system tracing if a single noise
   higher than the configured value happens. Writing 0 disables this
   option.
 - osnoise/stop_tracing_total_us: stop the system tracing if total noise
   higher than the configured value happens. Writing 0 disables this
   option.
 - tracing_threshold: the minimum delta between two time() reads to be
   considered as noise, in us. When set to 0, the default value will
   be used, which is currently 5 us.

Additional Tracing

In addition to the tracer, a set of tracepoints were added to
facilitate the identification of the osnoise source.

 - osnoise:sample_threshold: printed anytime a noise is higher than
   the configurable tolerance_ns.
 - osnoise:nmi_noise: noise from NMI, including the duration.
 - osnoise:irq_noise: noise from an IRQ, including the duration.
 - osnoise:softirq_noise: noise from a SoftIRQ, including the
   duration.
 - osnoise:thread_noise: noise from a thread, including the duration.

Note that all the values are *net values*. For example, if while osnoise
is running, another thread preempts the osnoise thread, it will start a
thread_noise duration at the start. Then, an IRQ takes place, preempting
the thread_noise, starting a irq_noise. When the IRQ ends its execution,
it will compute its duration, and this duration will be subtracted from
the thread_noise, in such a way as to avoid the double accounting of the
IRQ execution. This logic is valid for all sources of noise.

Here is one example of the usage of these tracepoints::

       osnoise/8-961     [008] d.h.  5789.857532: irq_noise: local_timer:236 start 5789.857529929 duration 1845 ns
       osnoise/8-961     [008] dNh.  5789.858408: irq_noise: local_timer:236 start 5789.858404871 duration 2848 ns
     migration/8-54      [008] d...  5789.858413: thread_noise: migration/8:54 start 5789.858409300 duration 3068 ns
       osnoise/8-961     [008] ....  5789.858413: sample_threshold: start 5789.858404555 duration 8723 ns interferences 2

In this example, a noise sample of 8 microseconds was reported in the last
line, pointing to two interferences. Looking backward in the trace, the
two previous entries were about the migration thread running after a
timer IRQ execution. The first event is not part of the noise because
it took place one millisecond before.

It is worth noticing that the sum of the duration reported in the
tracepoints is smaller than eight us reported in the sample_threshold.
The reason roots in the overhead of the entry and exit code that happens
before and after any interference execution. This justifies the dual
approach: measuring thread and tracing.

Link: https://lkml.kernel.org/r/e649467042d60e7b62714c9c6751a56299d15119.1624372313.git.bristot@redhat.com

Cc: Phil Auld <pauld@redhat.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Kate Carcia <kcarcia@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Alexandre Chartre <alexandre.chartre@oracle.com>
Cc: Clark Willaims <williams@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
[
  Made the following functions static:
   trace_irqentry_callback()
   trace_irqexit_callback()
   trace_intel_irqentry_callback()
   trace_intel_irqexit_callback()

  Added to include/trace.h:
   osnoise_arch_register()
   osnoise_arch_unregister()

  Fixed define logic for LATENCY_FS_NOTIFY

  Reported-by: kernel test robot <lkp@intel.com>
]
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
This commit is contained in:
Daniel Bristot de Oliveira 2021-06-22 16:42:27 +02:00 committed by Steven Rostedt (VMware)
parent 6880c987e4
commit bce29ac9ce
13 changed files with 2072 additions and 4 deletions

View File

@ -23,6 +23,7 @@ Linux Tracing Technologies
histogram-design
boottime-trace
hwlat_detector
osnoise-tracer
intel_th
ring-buffer-design
stm

View File

@ -0,0 +1,152 @@
==============
OSNOISE Tracer
==============
In the context of high-performance computing (HPC), the Operating System
Noise (*osnoise*) refers to the interference experienced by an application
due to activities inside the operating system. In the context of Linux,
NMIs, IRQs, SoftIRQs, and any other system thread can cause noise to the
system. Moreover, hardware-related jobs can also cause noise, for example,
via SMIs.
hwlat_detector is one of the tools used to identify the most complex
source of noise: *hardware noise*.
In a nutshell, the hwlat_detector creates a thread that runs
periodically for a given period. At the beginning of a period, the thread
disables interrupt and starts sampling. While running, the hwlatd
thread reads the time in a loop. As interrupts are disabled, threads,
IRQs, and SoftIRQs cannot interfere with the hwlatd thread. Hence, the
cause of any gap between two different reads of the time roots either on
NMI or in the hardware itself. At the end of the period, hwlatd enables
interrupts and reports the max observed gap between the reads. It also
prints a NMI occurrence counter. If the output does not report NMI
executions, the user can conclude that the hardware is the culprit for
the latency. The hwlat detects the NMI execution by observing
the entry and exit of a NMI.
The osnoise tracer leverages the hwlat_detector by running a
similar loop with preemption, SoftIRQs and IRQs enabled, thus allowing
all the sources of *osnoise* during its execution. Using the same approach
of hwlat, osnoise takes note of the entry and exit point of any
source of interferences, increasing a per-cpu interference counter. The
osnoise tracer also saves an interference counter for each source of
interference. The interference counter for NMI, IRQs, SoftIRQs, and
threads is increased anytime the tool observes these interferences' entry
events. When a noise happens without any interference from the operating
system level, the hardware noise counter increases, pointing to a
hardware-related noise. In this way, osnoise can account for any
source of interference. At the end of the period, the osnoise tracer
prints the sum of all noise, the max single noise, the percentage of CPU
available for the thread, and the counters for the noise sources.
Usage
-----
Write the ASCII text "osnoise" into the current_tracer file of the
tracing system (generally mounted at /sys/kernel/tracing).
For example::
[root@f32 ~]# cd /sys/kernel/tracing/
[root@f32 tracing]# echo osnoise > current_tracer
It is possible to follow the trace by reading the trace trace file::
[root@f32 tracing]# cat trace
# tracer: osnoise
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth MAX
# || / SINGLE Interference counters:
# |||| RUNTIME NOISE % OF CPU NOISE +-----------------------------+
# TASK-PID CPU# |||| TIMESTAMP IN US IN US AVAILABLE IN US HW NMI IRQ SIRQ THREAD
# | | | |||| | | | | | | | | | |
<...>-859 [000] .... 81.637220: 1000000 190 99.98100 9 18 0 1007 18 1
<...>-860 [001] .... 81.638154: 1000000 656 99.93440 74 23 0 1006 16 3
<...>-861 [002] .... 81.638193: 1000000 5675 99.43250 202 6 0 1013 25 21
<...>-862 [003] .... 81.638242: 1000000 125 99.98750 45 1 0 1011 23 0
<...>-863 [004] .... 81.638260: 1000000 1721 99.82790 168 7 0 1002 49 41
<...>-864 [005] .... 81.638286: 1000000 263 99.97370 57 6 0 1006 26 2
<...>-865 [006] .... 81.638302: 1000000 109 99.98910 21 3 0 1006 18 1
<...>-866 [007] .... 81.638326: 1000000 7816 99.21840 107 8 0 1016 39 19
In addition to the regular trace fields (from TASK-PID to TIMESTAMP), the
tracer prints a message at the end of each period for each CPU that is
running an osnoise/ thread. The osnoise specific fields report:
- The RUNTIME IN USE reports the amount of time in microseconds that
the osnoise thread kept looping reading the time.
- The NOISE IN US reports the sum of noise in microseconds observed
by the osnoise tracer during the associated runtime.
- The % OF CPU AVAILABLE reports the percentage of CPU available for
the osnoise thread during the runtime window.
- The MAX SINGLE NOISE IN US reports the maximum single noise observed
during the runtime window.
- The Interference counters display how many each of the respective
interference happened during the runtime window.
Note that the example above shows a high number of HW noise samples.
The reason being is that this sample was taken on a virtual machine,
and the host interference is detected as a hardware interference.
Tracer options
---------------------
The tracer has a set of options inside the osnoise directory, they are:
- osnoise/cpus: CPUs at which a osnoise thread will execute.
- osnoise/period_us: the period of the osnoise thread.
- osnoise/runtime_us: how long an osnoise thread will look for noise.
- osnoise/stop_tracing_us: stop the system tracing if a single noise
higher than the configured value happens. Writing 0 disables this
option.
- osnoise/stop_tracing_total_us: stop the system tracing if total noise
higher than the configured value happens. Writing 0 disables this
option.
- tracing_threshold: the minimum delta between two time() reads to be
considered as noise, in us. When set to 0, the default value will
will be used, which is currently 5 us.
Additional Tracing
------------------
In addition to the tracer, a set of tracepoints were added to
facilitate the identification of the osnoise source.
- osnoise:sample_threshold: printed anytime a noise is higher than
the configurable tolerance_ns.
- osnoise:nmi_noise: noise from NMI, including the duration.
- osnoise:irq_noise: noise from an IRQ, including the duration.
- osnoise:softirq_noise: noise from a SoftIRQ, including the
duration.
- osnoise:thread_noise: noise from a thread, including the duration.
Note that all the values are *net values*. For example, if while osnoise
is running, another thread preempts the osnoise thread, it will start a
thread_noise duration at the start. Then, an IRQ takes place, preempting
the thread_noise, starting a irq_noise. When the IRQ ends its execution,
it will compute its duration, and this duration will be subtracted from
the thread_noise, in such a way as to avoid the double accounting of the
IRQ execution. This logic is valid for all sources of noise.
Here is one example of the usage of these tracepoints::
osnoise/8-961 [008] d.h. 5789.857532: irq_noise: local_timer:236 start 5789.857529929 duration 1845 ns
osnoise/8-961 [008] dNh. 5789.858408: irq_noise: local_timer:236 start 5789.858404871 duration 2848 ns
migration/8-54 [008] d... 5789.858413: thread_noise: migration/8:54 start 5789.858409300 duration 3068 ns
osnoise/8-961 [008] .... 5789.858413: sample_threshold: start 5789.858404555 duration 8812 ns interferences 2
In this example, a noise sample of 8 microseconds was reported in the last
line, pointing to two interferences. Looking backward in the trace, the
two previous entries were about the migration thread running after a
timer IRQ execution. The first event is not part of the noise because
it took place one millisecond before.
It is worth noticing that the sum of the duration reported in the
tracepoints is smaller than eight us reported in the sample_threshold.
The reason roots in the overhead of the entry and exit code that happens
before and after any interference execution. This justifies the dual
approach: measuring thread and tracing.

View File

@ -102,6 +102,7 @@ obj-$(CONFIG_FUNCTION_TRACER) += ftrace_$(BITS).o
obj-$(CONFIG_FUNCTION_GRAPH_TRACER) += ftrace.o
obj-$(CONFIG_FTRACE_SYSCALLS) += ftrace.o
obj-$(CONFIG_X86_TSC) += trace_clock.o
obj-$(CONFIG_TRACING) += trace.o
obj-$(CONFIG_CRASH_CORE) += crash_core_$(BITS).o
obj-$(CONFIG_KEXEC_CORE) += machine_kexec_$(BITS).o
obj-$(CONFIG_KEXEC_CORE) += relocate_kernel_$(BITS).o crash.o

237
arch/x86/kernel/trace.c Normal file
View File

@ -0,0 +1,237 @@
#include <asm/trace/irq_vectors.h>
#include <linux/trace.h>
#if defined(CONFIG_OSNOISE_TRACER) && defined(CONFIG_X86_LOCAL_APIC)
extern void osnoise_trace_irq_entry(int id);
extern void osnoise_trace_irq_exit(int id, const char *desc);
/*
* trace_intel_irq_entry - record intel specific IRQ entry
*/
static void trace_intel_irq_entry(void *data, int vector)
{
osnoise_trace_irq_entry(vector);
}
/*
* trace_intel_irq_exit - record intel specific IRQ exit
*/
static void trace_intel_irq_exit(void *data, int vector)
{
char *vector_desc = (char *) data;
osnoise_trace_irq_exit(vector, vector_desc);
}
/*
* register_intel_irq_tp - Register intel specific IRQ entry tracepoints
*/
int osnoise_arch_register(void)
{
int ret;
ret = register_trace_local_timer_entry(trace_intel_irq_entry, NULL);
if (ret)
goto out_err;
ret = register_trace_local_timer_exit(trace_intel_irq_exit, "local_timer");
if (ret)
goto out_timer_entry;
#ifdef CONFIG_X86_THERMAL_VECTOR
ret = register_trace_thermal_apic_entry(trace_intel_irq_entry, NULL);
if (ret)
goto out_timer_exit;
ret = register_trace_thermal_apic_exit(trace_intel_irq_exit, "thermal_apic");
if (ret)
goto out_thermal_entry;
#endif /* CONFIG_X86_THERMAL_VECTOR */
#ifdef CONFIG_X86_MCE_AMD
ret = register_trace_deferred_error_apic_entry(trace_intel_irq_entry, NULL);
if (ret)
goto out_thermal_exit;
ret = register_trace_deferred_error_apic_exit(trace_intel_irq_exit, "deferred_error");
if (ret)
goto out_deferred_entry;
#endif
#ifdef CONFIG_X86_MCE_THRESHOLD
ret = register_trace_threshold_apic_entry(trace_intel_irq_entry, NULL);
if (ret)
goto out_deferred_exit;
ret = register_trace_threshold_apic_exit(trace_intel_irq_exit, "threshold_apic");
if (ret)
goto out_threshold_entry;
#endif /* CONFIG_X86_MCE_THRESHOLD */
#ifdef CONFIG_SMP
ret = register_trace_call_function_single_entry(trace_intel_irq_entry, NULL);
if (ret)
goto out_threshold_exit;
ret = register_trace_call_function_single_exit(trace_intel_irq_exit,
"call_function_single");
if (ret)
goto out_call_function_single_entry;
ret = register_trace_call_function_entry(trace_intel_irq_entry, NULL);
if (ret)
goto out_call_function_single_exit;
ret = register_trace_call_function_exit(trace_intel_irq_exit, "call_function");
if (ret)
goto out_call_function_entry;
ret = register_trace_reschedule_entry(trace_intel_irq_entry, NULL);
if (ret)
goto out_call_function_exit;
ret = register_trace_reschedule_exit(trace_intel_irq_exit, "reschedule");
if (ret)
goto out_reschedule_entry;
#endif /* CONFIG_SMP */
#ifdef CONFIG_IRQ_WORK
ret = register_trace_irq_work_entry(trace_intel_irq_entry, NULL);
if (ret)
goto out_reschedule_exit;
ret = register_trace_irq_work_exit(trace_intel_irq_exit, "irq_work");
if (ret)
goto out_irq_work_entry;
#endif
ret = register_trace_x86_platform_ipi_entry(trace_intel_irq_entry, NULL);
if (ret)
goto out_irq_work_exit;
ret = register_trace_x86_platform_ipi_exit(trace_intel_irq_exit, "x86_platform_ipi");
if (ret)
goto out_x86_ipi_entry;
ret = register_trace_error_apic_entry(trace_intel_irq_entry, NULL);
if (ret)
goto out_x86_ipi_exit;
ret = register_trace_error_apic_exit(trace_intel_irq_exit, "error_apic");
if (ret)
goto out_error_apic_entry;
ret = register_trace_spurious_apic_entry(trace_intel_irq_entry, NULL);
if (ret)
goto out_error_apic_exit;
ret = register_trace_spurious_apic_exit(trace_intel_irq_exit, "spurious_apic");
if (ret)
goto out_spurious_apic_entry;
return 0;
out_spurious_apic_entry:
unregister_trace_spurious_apic_entry(trace_intel_irq_entry, NULL);
out_error_apic_exit:
unregister_trace_error_apic_exit(trace_intel_irq_exit, "error_apic");
out_error_apic_entry:
unregister_trace_error_apic_entry(trace_intel_irq_entry, NULL);
out_x86_ipi_exit:
unregister_trace_x86_platform_ipi_exit(trace_intel_irq_exit, "x86_platform_ipi");
out_x86_ipi_entry:
unregister_trace_x86_platform_ipi_entry(trace_intel_irq_entry, NULL);
out_irq_work_exit:
#ifdef CONFIG_IRQ_WORK
unregister_trace_irq_work_exit(trace_intel_irq_exit, "irq_work");
out_irq_work_entry:
unregister_trace_irq_work_entry(trace_intel_irq_entry, NULL);
out_reschedule_exit:
#endif
#ifdef CONFIG_SMP
unregister_trace_reschedule_exit(trace_intel_irq_exit, "reschedule");
out_reschedule_entry:
unregister_trace_reschedule_entry(trace_intel_irq_entry, NULL);
out_call_function_exit:
unregister_trace_call_function_exit(trace_intel_irq_exit, "call_function");
out_call_function_entry:
unregister_trace_call_function_entry(trace_intel_irq_entry, NULL);
out_call_function_single_exit:
unregister_trace_call_function_single_exit(trace_intel_irq_exit, "call_function_single");
out_call_function_single_entry:
unregister_trace_call_function_single_entry(trace_intel_irq_entry, NULL);
out_threshold_exit:
#endif
#ifdef CONFIG_X86_MCE_THRESHOLD
unregister_trace_threshold_apic_exit(trace_intel_irq_exit, "threshold_apic");
out_threshold_entry:
unregister_trace_threshold_apic_entry(trace_intel_irq_entry, NULL);
out_deferred_exit:
#endif
#ifdef CONFIG_X86_MCE_AMD
unregister_trace_deferred_error_apic_exit(trace_intel_irq_exit, "deferred_error");
out_deferred_entry:
unregister_trace_deferred_error_apic_entry(trace_intel_irq_entry, NULL);
out_thermal_exit:
#endif /* CONFIG_X86_MCE_AMD */
#ifdef CONFIG_X86_THERMAL_VECTOR
unregister_trace_thermal_apic_exit(trace_intel_irq_exit, "thermal_apic");
out_thermal_entry:
unregister_trace_thermal_apic_entry(trace_intel_irq_entry, NULL);
out_timer_exit:
#endif /* CONFIG_X86_THERMAL_VECTOR */
unregister_trace_local_timer_exit(trace_intel_irq_exit, "local_timer");
out_timer_entry:
unregister_trace_local_timer_entry(trace_intel_irq_entry, NULL);
out_err:
return -EINVAL;
}
void osnoise_arch_unregister(void)
{
unregister_trace_spurious_apic_exit(trace_intel_irq_exit, "spurious_apic");
unregister_trace_spurious_apic_entry(trace_intel_irq_entry, NULL);
unregister_trace_error_apic_exit(trace_intel_irq_exit, "error_apic");
unregister_trace_error_apic_entry(trace_intel_irq_entry, NULL);
unregister_trace_x86_platform_ipi_exit(trace_intel_irq_exit, "x86_platform_ipi");
unregister_trace_x86_platform_ipi_entry(trace_intel_irq_entry, NULL);
#ifdef CONFIG_IRQ_WORK
unregister_trace_irq_work_exit(trace_intel_irq_exit, "irq_work");
unregister_trace_irq_work_entry(trace_intel_irq_entry, NULL);
#endif
#ifdef CONFIG_SMP
unregister_trace_reschedule_exit(trace_intel_irq_exit, "reschedule");
unregister_trace_reschedule_entry(trace_intel_irq_entry, NULL);
unregister_trace_call_function_exit(trace_intel_irq_exit, "call_function");
unregister_trace_call_function_entry(trace_intel_irq_entry, NULL);
unregister_trace_call_function_single_exit(trace_intel_irq_exit, "call_function_single");
unregister_trace_call_function_single_entry(trace_intel_irq_entry, NULL);
#endif
#ifdef CONFIG_X86_MCE_THRESHOLD
unregister_trace_threshold_apic_exit(trace_intel_irq_exit, "threshold_apic");
unregister_trace_threshold_apic_entry(trace_intel_irq_entry, NULL);
#endif
#ifdef CONFIG_X86_MCE_AMD
unregister_trace_deferred_error_apic_exit(trace_intel_irq_exit, "deferred_error");
unregister_trace_deferred_error_apic_entry(trace_intel_irq_entry, NULL);
#endif
#ifdef CONFIG_X86_THERMAL_VECTOR
unregister_trace_thermal_apic_exit(trace_intel_irq_exit, "thermal_apic");
unregister_trace_thermal_apic_entry(trace_intel_irq_entry, NULL);
#endif /* CONFIG_X86_THERMAL_VECTOR */
unregister_trace_local_timer_exit(trace_intel_irq_exit, "local_timer");
unregister_trace_local_timer_entry(trace_intel_irq_entry, NULL);
}
#endif /* CONFIG_OSNOISE_TRAECR && CONFIG_X86_LOCAL_APIC */

View File

@ -7,12 +7,21 @@ extern bool trace_hwlat_callback_enabled;
extern void trace_hwlat_callback(bool enter);
#endif
#ifdef CONFIG_OSNOISE_TRACER
extern bool trace_osnoise_callback_enabled;
extern void trace_osnoise_callback(bool enter);
#endif
static inline void ftrace_nmi_enter(void)
{
#ifdef CONFIG_HWLAT_TRACER
if (trace_hwlat_callback_enabled)
trace_hwlat_callback(true);
#endif
#ifdef CONFIG_OSNOISE_TRACER
if (trace_osnoise_callback_enabled)
trace_osnoise_callback(true);
#endif
}
static inline void ftrace_nmi_exit(void)
@ -21,6 +30,10 @@ static inline void ftrace_nmi_exit(void)
if (trace_hwlat_callback_enabled)
trace_hwlat_callback(false);
#endif
#ifdef CONFIG_OSNOISE_TRACER
if (trace_osnoise_callback_enabled)
trace_osnoise_callback(false);
#endif
}
#endif /* _LINUX_FTRACE_IRQ_H */

View File

@ -41,6 +41,11 @@ int trace_array_init_printk(struct trace_array *tr);
void trace_array_put(struct trace_array *tr);
struct trace_array *trace_array_get_by_name(const char *name);
int trace_array_destroy(struct trace_array *tr);
/* For osnoise tracer */
int osnoise_arch_register(void);
void osnoise_arch_unregister(void);
#endif /* CONFIG_TRACING */
#endif /* _LINUX_TRACE_H */

View File

@ -0,0 +1,142 @@
/* SPDX-License-Identifier: GPL-2.0 */
#undef TRACE_SYSTEM
#define TRACE_SYSTEM osnoise
#if !defined(_OSNOISE_TRACE_H) || defined(TRACE_HEADER_MULTI_READ)
#define _OSNOISE_TRACE_H
#include <linux/tracepoint.h>
TRACE_EVENT(thread_noise,
TP_PROTO(struct task_struct *t, u64 start, u64 duration),
TP_ARGS(t, start, duration),
TP_STRUCT__entry(
__array( char, comm, TASK_COMM_LEN)
__field( u64, start )
__field( u64, duration)
__field( pid_t, pid )
),
TP_fast_assign(
memcpy(__entry->comm, t->comm, TASK_COMM_LEN);
__entry->pid = t->pid;
__entry->start = start;
__entry->duration = duration;
),
TP_printk("%8s:%d start %llu.%09u duration %llu ns",
__entry->comm,
__entry->pid,
__print_ns_to_secs(__entry->start),
__print_ns_without_secs(__entry->start),
__entry->duration)
);
TRACE_EVENT(softirq_noise,
TP_PROTO(int vector, u64 start, u64 duration),
TP_ARGS(vector, start, duration),
TP_STRUCT__entry(
__field( u64, start )
__field( u64, duration)
__field( int, vector )
),
TP_fast_assign(
__entry->vector = vector;
__entry->start = start;
__entry->duration = duration;
),
TP_printk("%8s:%d start %llu.%09u duration %llu ns",
show_softirq_name(__entry->vector),
__entry->vector,
__print_ns_to_secs(__entry->start),
__print_ns_without_secs(__entry->start),
__entry->duration)
);
TRACE_EVENT(irq_noise,
TP_PROTO(int vector, const char *desc, u64 start, u64 duration),
TP_ARGS(vector, desc, start, duration),
TP_STRUCT__entry(
__field( u64, start )
__field( u64, duration)
__string( desc, desc )
__field( int, vector )
),
TP_fast_assign(
__assign_str(desc, desc);
__entry->vector = vector;
__entry->start = start;
__entry->duration = duration;
),
TP_printk("%s:%d start %llu.%09u duration %llu ns",
__get_str(desc),
__entry->vector,
__print_ns_to_secs(__entry->start),
__print_ns_without_secs(__entry->start),
__entry->duration)
);
TRACE_EVENT(nmi_noise,
TP_PROTO(u64 start, u64 duration),
TP_ARGS(start, duration),
TP_STRUCT__entry(
__field( u64, start )
__field( u64, duration)
),
TP_fast_assign(
__entry->start = start;
__entry->duration = duration;
),
TP_printk("start %llu.%09u duration %llu ns",
__print_ns_to_secs(__entry->start),
__print_ns_without_secs(__entry->start),
__entry->duration)
);
TRACE_EVENT(sample_threshold,
TP_PROTO(u64 start, u64 duration, u64 interference),
TP_ARGS(start, duration, interference),
TP_STRUCT__entry(
__field( u64, start )
__field( u64, duration)
__field( u64, interference)
),
TP_fast_assign(
__entry->start = start;
__entry->duration = duration;
__entry->interference = interference;
),
TP_printk("start %llu.%09u duration %llu ns interferences %llu",
__print_ns_to_secs(__entry->start),
__print_ns_without_secs(__entry->start),
__entry->duration,
__entry->interference)
);
#endif /* _TRACE_OSNOISE_H */
/* This part must be outside protection */
#include <trace/define_trace.h>

View File

@ -356,6 +356,40 @@ config HWLAT_TRACER
file. Every time a latency is greater than tracing_thresh, it will
be recorded into the ring buffer.
config OSNOISE_TRACER
bool "OS Noise tracer"
select GENERIC_TRACER
help
In the context of high-performance computing (HPC), the Operating
System Noise (osnoise) refers to the interference experienced by an
application due to activities inside the operating system. In the
context of Linux, NMIs, IRQs, SoftIRQs, and any other system thread
can cause noise to the system. Moreover, hardware-related jobs can
also cause noise, for example, via SMIs.
The osnoise tracer leverages the hwlat_detector by running a similar
loop with preemption, SoftIRQs and IRQs enabled, thus allowing all
the sources of osnoise during its execution. The osnoise tracer takes
note of the entry and exit point of any source of interferences,
increasing a per-cpu interference counter. It saves an interference
counter for each source of interference. The interference counter for
NMI, IRQs, SoftIRQs, and threads is increased anytime the tool
observes these interferences' entry events. When a noise happens
without any interference from the operating system level, the
hardware noise counter increases, pointing to a hardware-related
noise. In this way, osnoise can account for any source of
interference. At the end of the period, the osnoise tracer prints
the sum of all noise, the max single noise, the percentage of CPU
available for the thread, and the counters for the noise sources.
In addition to the tracer, a set of tracepoints were added to
facilitate the identification of the osnoise source.
The output will appear in the trace and trace_pipe files.
To enable this tracer, echo in "osnoise" into the current_tracer
file.
config MMIOTRACE
bool "Memory mapped IO tracing"
depends on HAVE_MMIOTRACE_SUPPORT && PCI

View File

@ -58,6 +58,7 @@ obj-$(CONFIG_IRQSOFF_TRACER) += trace_irqsoff.o
obj-$(CONFIG_PREEMPT_TRACER) += trace_irqsoff.o
obj-$(CONFIG_SCHED_TRACER) += trace_sched_wakeup.o
obj-$(CONFIG_HWLAT_TRACER) += trace_hwlat.o
obj-$(CONFIG_OSNOISE_TRACER) += trace_osnoise.o
obj-$(CONFIG_NOP_TRACER) += trace_nop.o
obj-$(CONFIG_STACK_TRACER) += trace_stack.o
obj-$(CONFIG_MMIOTRACE) += trace_mmiotrace.o

View File

@ -44,6 +44,7 @@ enum trace_type {
TRACE_BLK,
TRACE_BPUTS,
TRACE_HWLAT,
TRACE_OSNOISE,
TRACE_RAW_DATA,
TRACE_FUNC_REPEATS,
@ -297,7 +298,8 @@ struct trace_array {
struct array_buffer max_buffer;
bool allocated_snapshot;
#endif
#if defined(CONFIG_TRACER_MAX_TRACE) || defined(CONFIG_HWLAT_TRACER)
#if defined(CONFIG_TRACER_MAX_TRACE) || defined(CONFIG_HWLAT_TRACER) \
|| defined(CONFIG_OSNOISE_TRACER)
unsigned long max_latency;
#ifdef CONFIG_FSNOTIFY
struct dentry *d_max_latency;
@ -445,6 +447,7 @@ extern void __ftrace_bad_type(void);
IF_ASSIGN(var, ent, struct bprint_entry, TRACE_BPRINT); \
IF_ASSIGN(var, ent, struct bputs_entry, TRACE_BPUTS); \
IF_ASSIGN(var, ent, struct hwlat_entry, TRACE_HWLAT); \
IF_ASSIGN(var, ent, struct osnoise_entry, TRACE_OSNOISE);\
IF_ASSIGN(var, ent, struct raw_data_entry, TRACE_RAW_DATA);\
IF_ASSIGN(var, ent, struct trace_mmiotrace_rw, \
TRACE_MMIO_RW); \
@ -675,8 +678,8 @@ void update_max_tr_single(struct trace_array *tr,
struct task_struct *tsk, int cpu);
#endif /* CONFIG_TRACER_MAX_TRACE */
#if (defined(CONFIG_TRACER_MAX_TRACE) || defined(CONFIG_HWLAT_TRACER)) && \
defined(CONFIG_FSNOTIFY)
#if (defined(CONFIG_TRACER_MAX_TRACE) || defined(CONFIG_HWLAT_TRACER) \
|| defined(CONFIG_OSNOISE_TRACER)) && defined(CONFIG_FSNOTIFY)
#define LATENCY_FS_NOTIFY
#endif

View File

@ -360,3 +360,28 @@ FTRACE_ENTRY(func_repeats, func_repeats_entry,
__entry->count,
FUNC_REPEATS_GET_DELTA_TS(__entry))
);
FTRACE_ENTRY(osnoise, osnoise_entry,
TRACE_OSNOISE,
F_STRUCT(
__field( u64, noise )
__field( u64, runtime )
__field( u64, max_sample )
__field( unsigned int, hw_count )
__field( unsigned int, nmi_count )
__field( unsigned int, irq_count )
__field( unsigned int, softirq_count )
__field( unsigned int, thread_count )
),
F_printk("noise:%llu\tmax_sample:%llu\thw:%u\tnmi:%u\tirq:%u\tsoftirq:%u\tthread:%u\n",
__entry->noise,
__entry->max_sample,
__entry->hw_count,
__entry->nmi_count,
__entry->irq_count,
__entry->softirq_count,
__entry->thread_count)
);

1384
kernel/trace/trace_osnoise.c Normal file

File diff suppressed because it is too large Load Diff

View File

@ -1202,7 +1202,6 @@ trace_hwlat_print(struct trace_iterator *iter, int flags,
return trace_handle_return(s);
}
static enum print_line_t
trace_hwlat_raw(struct trace_iterator *iter, int flags,
struct trace_event *event)
@ -1232,6 +1231,76 @@ static struct trace_event trace_hwlat_event = {
.funcs = &trace_hwlat_funcs,
};
/* TRACE_OSNOISE */
static enum print_line_t
trace_osnoise_print(struct trace_iterator *iter, int flags,
struct trace_event *event)
{
struct trace_entry *entry = iter->ent;
struct trace_seq *s = &iter->seq;
struct osnoise_entry *field;
u64 ratio, ratio_dec;
u64 net_runtime;
trace_assign_type(field, entry);
/*
* compute the available % of cpu time.
*/
net_runtime = field->runtime - field->noise;
ratio = net_runtime * 10000000;
do_div(ratio, field->runtime);
ratio_dec = do_div(ratio, 100000);
trace_seq_printf(s, "%llu %10llu %3llu.%05llu %7llu",
field->runtime,
field->noise,
ratio, ratio_dec,
field->max_sample);
trace_seq_printf(s, " %6u", field->hw_count);
trace_seq_printf(s, " %6u", field->nmi_count);
trace_seq_printf(s, " %6u", field->irq_count);
trace_seq_printf(s, " %6u", field->softirq_count);
trace_seq_printf(s, " %6u", field->thread_count);
trace_seq_putc(s, '\n');
return trace_handle_return(s);
}
static enum print_line_t
trace_osnoise_raw(struct trace_iterator *iter, int flags,
struct trace_event *event)
{
struct osnoise_entry *field;
struct trace_seq *s = &iter->seq;
trace_assign_type(field, iter->ent);
trace_seq_printf(s, "%lld %llu %llu %u %u %u %u %u\n",
field->runtime,
field->noise,
field->max_sample,
field->hw_count,
field->nmi_count,
field->irq_count,
field->softirq_count,
field->thread_count);
return trace_handle_return(s);
}
static struct trace_event_functions trace_osnoise_funcs = {
.trace = trace_osnoise_print,
.raw = trace_osnoise_raw,
};
static struct trace_event trace_osnoise_event = {
.type = TRACE_OSNOISE,
.funcs = &trace_osnoise_funcs,
};
/* TRACE_BPUTS */
static enum print_line_t
trace_bputs_print(struct trace_iterator *iter, int flags,
@ -1442,6 +1511,7 @@ static struct trace_event *events[] __initdata = {
&trace_bprint_event,
&trace_print_event,
&trace_hwlat_event,
&trace_osnoise_event,
&trace_raw_data_event,
&trace_func_repeats_event,
NULL