linux-stable/tools/perf/Documentation/perf-bench.txt

261 lines
5.4 KiB
Plaintext
Raw Normal View History

perf-bench(1)
=============
NAME
----
perf-bench - General framework for benchmark suites
SYNOPSIS
--------
[verse]
'perf bench' [<common options>] <subsystem> <suite> [<options>]
DESCRIPTION
-----------
This 'perf bench' command is a general framework for benchmark suites.
COMMON OPTIONS
--------------
-r::
--repeat=::
Specify number of times to repeat the run (default 10).
-f::
--format=::
Specify format style.
Current available format styles are:
'default'::
Default style. This is mainly for human reading.
---------------------
% perf bench sched pipe # with no style specified
(executing 1000000 pipe operations between two tasks)
Total time:5.855 sec
5.855061 usecs/op
170792 ops/sec
---------------------
'simple'::
This simple style is friendly for automated
processing by scripts.
---------------------
% perf bench --format=simple sched pipe # specified simple
5.988
---------------------
SUBSYSTEM
---------
'sched'::
Scheduler and IPC mechanisms.
perf bench: Add basic syscall benchmark The usefulness of having a standard way of testing syscall performance has come up from time to time[0]. Furthermore, some of our testing machinery (such as 'mmtests') already makes use of a simplified version of the microbenchmark. This patch mainly takes the same idea to measure syscall throughput compatible with 'perf-bench' via getppid(2), yet without any of the additional template stuff from Ingo's version (based on numa.c). The code is identical to what mmtests uses. [0] https://lore.kernel.org/lkml/20160201074156.GA27156@gmail.com/ Committer notes: Add mising stdlib.h and unistd.h to get the prototypes for exit() and getppid(). Committer testing: $ perf bench Usage: perf bench [<common options>] <collection> <benchmark> [<options>] # List of all available benchmark collections: sched: Scheduler and IPC benchmarks syscall: System call benchmarks mem: Memory access benchmarks numa: NUMA scheduling and MM benchmarks futex: Futex stressing benchmarks epoll: Epoll stressing benchmarks internals: Perf-internals benchmarks all: All benchmarks $ $ perf bench syscall # List of available benchmarks for collection 'syscall': basic: Benchmark for basic getppid(2) calls all: Run all syscall benchmarks $ perf bench syscall basic # Running 'syscall/basic' benchmark: # Executed 10000000 getppid() calls Total time: 3.679 [sec] 0.367957 usecs/op 2717708 ops/sec $ perf bench syscall all # Running syscall/basic benchmark... # Executed 10000000 getppid() calls Total time: 3.644 [sec] 0.364456 usecs/op 2743815 ops/sec $ Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Acked-by: Josh Poimboeuf <jpoimboe@redhat.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Link: http://lore.kernel.org/lkml/20190308181747.l36zqz2avtivrr3c@linux-r8p5 Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2019-03-08 18:17:47 +00:00
'syscall'::
System call performance (throughput).
'mem'::
Memory access performance.
'numa'::
NUMA scheduling and MM benchmarks.
'futex'::
Futex stressing benchmarks.
perf bench: Add epoll parallel epoll_wait benchmark This program benchmarks concurrent epoll_wait(2) for file descriptors that are monitored with with EPOLLIN along various semantics, by a single epoll instance. Such conditions can be found when using single/combined or multiple queuing when load balancing. Each thread has a number of private, nonblocking file descriptors, referred to as fdmap. A writer thread will constantly be writing to the fdmaps of all threads, minimizing each threads's chances of epoll_wait not finding any ready read events and blocking as this is not what we want to stress. Full details in the start of the C file. Committer testing: # perf bench Usage: perf bench [<common options>] <collection> <benchmark> [<options>] # List of all available benchmark collections: sched: Scheduler and IPC benchmarks mem: Memory access benchmarks numa: NUMA scheduling and MM benchmarks futex: Futex stressing benchmarks epoll: Epoll stressing benchmarks all: All benchmarks # perf bench epoll # List of available benchmarks for collection 'epoll': wait: Benchmark epoll concurrent epoll_waits all: Run all futex benchmarks # perf bench epoll wait # Running 'epoll/wait' benchmark: Run summary [PID 19295]: 3 threads monitoring on 64 file-descriptors for 8 secs. [thread 0] fdmap: 0xdaa650 ... 0xdaa74c [ 328241 ops/sec ] [thread 1] fdmap: 0xdaa900 ... 0xdaa9fc [ 351695 ops/sec ] [thread 2] fdmap: 0xdaabb0 ... 0xdaacac [ 381423 ops/sec ] Averaged 353786 operations/sec (+- 4.35%), total secs = 8 # Committer notes: Fix the build on debian:experimental-x-mips, debian:experimental-x-mipsel and others: CC /tmp/build/perf/bench/epoll-wait.o bench/epoll-wait.c: In function 'writerfn': bench/epoll-wait.c:399:12: error: format '%ld' expects argument of type 'long int', but argument 2 has type 'size_t' {aka 'unsigned int'} [-Werror=format=] printinfo("exiting writer-thread (total full-loops: %ld)\n", iter); ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~ bench/epoll-wait.c:86:31: note: in definition of macro 'printinfo' do { if (__verbose) { printf(fmt, ## arg); fflush(stdout); } } while (0) ^~~ cc1: all warnings being treated as errors Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: Jason Baron <jbaron@akamai.com> <jbaron@akamai.com> Link: http://lkml.kernel.org/r/20181106152226.20883-2-dave@stgolabs.net Link: http://lkml.kernel.org/r/20181106182349.thdkpvshkna5vd7o@linux-r8p5> [ Applied above fixup as per Davidlohr's request ] [ Use inttypes.h to print rlim_t fields, fixing the build on Alpine Linux / musl libc ] [ Check if eventfd() is available, i.e. if HAVE_EVENTFD is defined ] Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2018-11-06 15:22:25 +00:00
'epoll'::
Eventpoll (epoll) stressing benchmarks.
'internals'::
Benchmark internal perf functionality.
'uprobe'::
Benchmark overhead of uprobe + BPF.
'all'::
All benchmark subsystems.
SUITES FOR 'sched'
~~~~~~~~~~~~~~~~~~
*messaging*::
Suite for evaluating performance of scheduler and IPC mechanisms.
Based on hackbench by Rusty Russell.
Options of *messaging*
^^^^^^^^^^^^^^^^^^^^^^
-p::
--pipe::
Use pipe() instead of socketpair()
-t::
--thread::
Be multi thread instead of multi process
-g::
--group=::
Specify number of groups
-l::
--nr_loops=::
Specify number of loops
Example of *messaging*
^^^^^^^^^^^^^^^^^^^^^^
---------------------
% perf bench sched messaging # run with default
options (20 sender and receiver processes per group)
(10 groups == 400 processes run)
Total time:0.308 sec
% perf bench sched messaging -t -g 20 # be multi-thread, with 20 groups
(20 sender and receiver threads per group)
(20 groups == 800 threads run)
Total time:0.582 sec
---------------------
*pipe*::
Suite for pipe() system call.
Based on pipe-test-1m.c by Ingo Molnar.
Options of *pipe*
^^^^^^^^^^^^^^^^^
-l::
--loop=::
Specify number of loops.
-G::
--cgroups=::
Names of cgroups for sender and receiver, separated by a comma.
This is useful to check cgroup context switching overhead.
Note that perf doesn't create nor delete the cgroups, so users should
make sure that the cgroups exist and are accessible before use.
Example of *pipe*
^^^^^^^^^^^^^^^^^
---------------------
% perf bench sched pipe
(executing 1000000 pipe operations between two tasks)
Total time:8.091 sec
8.091833 usecs/op
123581 ops/sec
% perf bench sched pipe -l 1000 # loop 1000
(executing 1000 pipe operations between two tasks)
Total time:0.016 sec
16.948000 usecs/op
59004 ops/sec
% perf bench sched pipe -G AAA,BBB
(executing 1000000 pipe operations between cgroups)
# Running 'sched/pipe' benchmark:
# Executed 1000000 pipe operations between two processes
Total time: 6.886 [sec]
6.886208 usecs/op
145217 ops/sec
---------------------
perf bench: Add basic syscall benchmark The usefulness of having a standard way of testing syscall performance has come up from time to time[0]. Furthermore, some of our testing machinery (such as 'mmtests') already makes use of a simplified version of the microbenchmark. This patch mainly takes the same idea to measure syscall throughput compatible with 'perf-bench' via getppid(2), yet without any of the additional template stuff from Ingo's version (based on numa.c). The code is identical to what mmtests uses. [0] https://lore.kernel.org/lkml/20160201074156.GA27156@gmail.com/ Committer notes: Add mising stdlib.h and unistd.h to get the prototypes for exit() and getppid(). Committer testing: $ perf bench Usage: perf bench [<common options>] <collection> <benchmark> [<options>] # List of all available benchmark collections: sched: Scheduler and IPC benchmarks syscall: System call benchmarks mem: Memory access benchmarks numa: NUMA scheduling and MM benchmarks futex: Futex stressing benchmarks epoll: Epoll stressing benchmarks internals: Perf-internals benchmarks all: All benchmarks $ $ perf bench syscall # List of available benchmarks for collection 'syscall': basic: Benchmark for basic getppid(2) calls all: Run all syscall benchmarks $ perf bench syscall basic # Running 'syscall/basic' benchmark: # Executed 10000000 getppid() calls Total time: 3.679 [sec] 0.367957 usecs/op 2717708 ops/sec $ perf bench syscall all # Running syscall/basic benchmark... # Executed 10000000 getppid() calls Total time: 3.644 [sec] 0.364456 usecs/op 2743815 ops/sec $ Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Acked-by: Josh Poimboeuf <jpoimboe@redhat.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Link: http://lore.kernel.org/lkml/20190308181747.l36zqz2avtivrr3c@linux-r8p5 Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2019-03-08 18:17:47 +00:00
SUITES FOR 'syscall'
~~~~~~~~~~~~~~~~~~
*basic*::
Suite for evaluating performance of core system call throughput (both usecs/op and ops/sec metrics).
This uses a single thread simply doing getppid(2), which is a simple syscall where the result is not
cached by glibc.
SUITES FOR 'mem'
~~~~~~~~~~~~~~~~
*memcpy*::
Suite for evaluating performance of simple memory copy in various ways.
Options of *memcpy*
^^^^^^^^^^^^^^^^^^^
-l::
--size::
Specify size of memory to copy (default: 1MB).
Available units are B, KB, MB, GB and TB (case insensitive).
-f::
--function::
Specify function to copy (default: default).
Available functions are depend on the architecture.
On x86-64, x86-64-unrolled, x86-64-movsq and x86-64-movsb are supported.
-l::
--nr_loops::
Repeat memcpy invocation this number of times.
-c::
--cycles::
Use perf's cpu-cycles event instead of gettimeofday syscall.
*memset*::
Suite for evaluating performance of simple memory set in various ways.
Options of *memset*
^^^^^^^^^^^^^^^^^^^
-l::
--size::
Specify size of memory to set (default: 1MB).
Available units are B, KB, MB, GB and TB (case insensitive).
-f::
--function::
Specify function to set (default: default).
Available functions are depend on the architecture.
On x86-64, x86-64-unrolled, x86-64-stosq and x86-64-stosb are supported.
-l::
--nr_loops::
Repeat memset invocation this number of times.
-c::
--cycles::
Use perf's cpu-cycles event instead of gettimeofday syscall.
SUITES FOR 'numa'
~~~~~~~~~~~~~~~~~
*mem*::
Suite for evaluating NUMA workloads.
SUITES FOR 'futex'
~~~~~~~~~~~~~~~~~~
*hash*::
Suite for evaluating hash tables.
*wake*::
Suite for evaluating wake calls.
*wake-parallel*::
Suite for evaluating parallel wake calls.
*requeue*::
Suite for evaluating requeue calls.
*lock-pi*::
Suite for evaluating futex lock_pi calls.
perf bench: Add epoll parallel epoll_wait benchmark This program benchmarks concurrent epoll_wait(2) for file descriptors that are monitored with with EPOLLIN along various semantics, by a single epoll instance. Such conditions can be found when using single/combined or multiple queuing when load balancing. Each thread has a number of private, nonblocking file descriptors, referred to as fdmap. A writer thread will constantly be writing to the fdmaps of all threads, minimizing each threads's chances of epoll_wait not finding any ready read events and blocking as this is not what we want to stress. Full details in the start of the C file. Committer testing: # perf bench Usage: perf bench [<common options>] <collection> <benchmark> [<options>] # List of all available benchmark collections: sched: Scheduler and IPC benchmarks mem: Memory access benchmarks numa: NUMA scheduling and MM benchmarks futex: Futex stressing benchmarks epoll: Epoll stressing benchmarks all: All benchmarks # perf bench epoll # List of available benchmarks for collection 'epoll': wait: Benchmark epoll concurrent epoll_waits all: Run all futex benchmarks # perf bench epoll wait # Running 'epoll/wait' benchmark: Run summary [PID 19295]: 3 threads monitoring on 64 file-descriptors for 8 secs. [thread 0] fdmap: 0xdaa650 ... 0xdaa74c [ 328241 ops/sec ] [thread 1] fdmap: 0xdaa900 ... 0xdaa9fc [ 351695 ops/sec ] [thread 2] fdmap: 0xdaabb0 ... 0xdaacac [ 381423 ops/sec ] Averaged 353786 operations/sec (+- 4.35%), total secs = 8 # Committer notes: Fix the build on debian:experimental-x-mips, debian:experimental-x-mipsel and others: CC /tmp/build/perf/bench/epoll-wait.o bench/epoll-wait.c: In function 'writerfn': bench/epoll-wait.c:399:12: error: format '%ld' expects argument of type 'long int', but argument 2 has type 'size_t' {aka 'unsigned int'} [-Werror=format=] printinfo("exiting writer-thread (total full-loops: %ld)\n", iter); ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~ bench/epoll-wait.c:86:31: note: in definition of macro 'printinfo' do { if (__verbose) { printf(fmt, ## arg); fflush(stdout); } } while (0) ^~~ cc1: all warnings being treated as errors Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: Jason Baron <jbaron@akamai.com> <jbaron@akamai.com> Link: http://lkml.kernel.org/r/20181106152226.20883-2-dave@stgolabs.net Link: http://lkml.kernel.org/r/20181106182349.thdkpvshkna5vd7o@linux-r8p5> [ Applied above fixup as per Davidlohr's request ] [ Use inttypes.h to print rlim_t fields, fixing the build on Alpine Linux / musl libc ] [ Check if eventfd() is available, i.e. if HAVE_EVENTFD is defined ] Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2018-11-06 15:22:25 +00:00
SUITES FOR 'epoll'
~~~~~~~~~~~~~~~~~~
*wait*::
Suite for evaluating concurrent epoll_wait calls.
perf bench: Add epoll_ctl(2) benchmark Benchmark the various operations allowed for epoll_ctl(2). The idea is to concurrently stress a single epoll instance doing add/mod/del operations. Committer testing: # perf bench epoll ctl # Running 'epoll/ctl' benchmark: Run summary [PID 20344]: 4 threads doing epoll_ctl ops 64 file-descriptors for 8 secs. [thread 0] fdmap: 0x21a46b0 ... 0x21a47ac [ add: 1680960 ops; mod: 1680960 ops; del: 1680960 ops ] [thread 1] fdmap: 0x21a4960 ... 0x21a4a5c [ add: 1685440 ops; mod: 1685440 ops; del: 1685440 ops ] [thread 2] fdmap: 0x21a4c10 ... 0x21a4d0c [ add: 1674368 ops; mod: 1674368 ops; del: 1674368 ops ] [thread 3] fdmap: 0x21a4ec0 ... 0x21a4fbc [ add: 1677568 ops; mod: 1677568 ops; del: 1677568 ops ] Averaged 1679584 ADD operations (+- 0.14%) Averaged 1679584 MOD operations (+- 0.14%) Averaged 1679584 DEL operations (+- 0.14%) # Lets measure those calls with 'perf trace' to get a glympse at what this benchmark is doing in terms of syscalls: # perf trace -m32768 -s perf bench epoll ctl # Running 'epoll/ctl' benchmark: Run summary [PID 20405]: 4 threads doing epoll_ctl ops 64 file-descriptors for 8 secs. [thread 0] fdmap: 0x21764e0 ... 0x21765dc [ add: 1100480 ops; mod: 1100480 ops; del: 1100480 ops ] [thread 1] fdmap: 0x2176790 ... 0x217688c [ add: 1250176 ops; mod: 1250176 ops; del: 1250176 ops ] [thread 2] fdmap: 0x2176a40 ... 0x2176b3c [ add: 1022464 ops; mod: 1022464 ops; del: 1022464 ops ] [thread 3] fdmap: 0x2176cf0 ... 0x2176dec [ add: 705472 ops; mod: 705472 ops; del: 705472 ops ] Averaged 1019648 ADD operations (+- 11.27%) Averaged 1019648 MOD operations (+- 11.27%) Averaged 1019648 DEL operations (+- 11.27%) Summary of events: epoll-ctl (20405), 1264 events, 0.0% syscall calls total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- --------- --------- --------- --------- ------ eventfd2 256 9.514 0.001 0.037 5.243 68.00% clone 4 1.245 0.204 0.311 0.531 24.13% mprotect 66 0.345 0.002 0.005 0.021 7.43% openat 45 0.313 0.004 0.007 0.073 21.93% mmap 88 0.302 0.002 0.003 0.013 5.02% futex 4 0.160 0.002 0.040 0.140 83.43% sched_setaffinity 4 0.124 0.005 0.031 0.070 49.39% read 44 0.103 0.001 0.002 0.013 15.54% fstat 40 0.052 0.001 0.001 0.003 5.43% close 39 0.039 0.001 0.001 0.001 1.48% stat 9 0.034 0.003 0.004 0.006 7.30% access 3 0.023 0.007 0.008 0.008 4.25% open 2 0.021 0.008 0.011 0.013 22.60% getdents 4 0.019 0.001 0.005 0.009 37.15% write 2 0.013 0.004 0.007 0.009 38.48% munmap 1 0.010 0.010 0.010 0.010 0.00% brk 3 0.006 0.001 0.002 0.003 26.34% rt_sigprocmask 2 0.004 0.001 0.002 0.003 43.95% rt_sigaction 3 0.004 0.001 0.001 0.002 16.07% prlimit64 3 0.004 0.001 0.001 0.001 5.39% prctl 1 0.003 0.003 0.003 0.003 0.00% epoll_create 1 0.003 0.003 0.003 0.003 0.00% lseek 2 0.002 0.001 0.001 0.001 11.42% sched_getaffinity 1 0.002 0.002 0.002 0.002 0.00% arch_prctl 1 0.002 0.002 0.002 0.002 0.00% set_tid_address 1 0.001 0.001 0.001 0.001 0.00% getpid 1 0.001 0.001 0.001 0.001 0.00% set_robust_list 1 0.001 0.001 0.001 0.001 0.00% execve 1 0.000 0.000 0.000 0.000 0.00% epoll-ctl (20406), 1245480 events, 14.6% syscall calls total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- --------- --------- --------- --------- ------ epoll_ctl 619511 1034.927 0.001 0.002 6.691 0.67% nanosleep 3226 616.114 0.006 0.191 10.376 7.57% futex 2 11.336 0.002 5.668 11.334 99.97% set_robust_list 1 0.001 0.001 0.001 0.001 0.00% clone 1 0.000 0.000 0.000 0.000 0.00% epoll-ctl (20407), 1243151 events, 14.5% syscall calls total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- --------- --------- --------- --------- ------ epoll_ctl 618350 1042.181 0.001 0.002 2.512 0.40% nanosleep 3220 366.261 0.012 0.114 18.162 9.59% futex 4 5.463 0.001 1.366 5.427 99.12% set_robust_list 1 0.002 0.002 0.002 0.002 0.00% epoll-ctl (20408), 1801690 events, 21.1% syscall calls total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- --------- --------- --------- --------- ------ epoll_ctl 896174 1540.581 0.001 0.002 6.987 0.74% nanosleep 4667 783.393 0.006 0.168 10.419 7.10% futex 2 4.682 0.002 2.341 4.681 99.93% set_robust_list 1 0.002 0.002 0.002 0.002 0.00% clone 1 0.000 0.000 0.000 0.000 0.00% epoll-ctl (20409), 4254890 events, 49.8% syscall calls total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- --------- --------- --------- --------- ------ epoll_ctl 2116416 3768.097 0.001 0.002 9.956 0.41% nanosleep 11023 1141.778 0.006 0.104 9.447 4.95% futex 3 0.037 0.002 0.012 0.029 70.50% set_robust_list 1 0.008 0.008 0.008 0.008 0.00% madvise 1 0.005 0.005 0.005 0.005 0.00% clone 1 0.000 0.000 0.000 0.000 0.00% # Committer notes: Fix build on fedora:24-x-ARC-uClibc, debian:experimental-x-mips, debian:experimental-x-mipsel, ubuntu:16.04-x-arm and ubuntu:16.04-x-powerpc CC /tmp/build/perf/bench/epoll-ctl.o bench/epoll-ctl.c: In function 'init_fdmaps': bench/epoll-ctl.c:214:16: error: comparison between signed and unsigned integer expressions [-Werror=sign-compare] for (i = 0; i < nfds; i+=inc) { ^ bench/epoll-ctl.c: In function 'bench_epoll_ctl': bench/epoll-ctl.c:377:16: error: comparison between signed and unsigned integer expressions [-Werror=sign-compare] for (i = 0; i < nthreads; i++) { ^ bench/epoll-ctl.c:388:16: error: comparison between signed and unsigned integer expressions [-Werror=sign-compare] for (i = 0; i < nthreads; i++) { ^ cc1: all warnings being treated as errors Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: Jason Baron <jbaron@akamai.com> Link: http://lkml.kernel.org/r/20181106152226.20883-3-dave@stgolabs.net [ Use inttypes.h to print rlim_t fields, fixing the build on Alpine Linux / musl libc ] [ Check if eventfd() is available, i.e. if HAVE_EVENTFD is defined ] Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2018-11-06 15:22:26 +00:00
*ctl*::
Suite for evaluating multiple epoll_ctl calls.
SUITES FOR 'internals'
~~~~~~~~~~~~~~~~~~~~~~
*synthesize*::
Suite for evaluating perf's event synthesis performance.
SEE ALSO
--------
linkperf:perf[1]