Commit Graph

27383 Commits

Author SHA1 Message Date
David S. Miller 01adc4851a Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Minor conflict, a CHECK was placed into an if() statement
in net-next, whilst a newline was added to that CHECK
call in 'net'.  Thanks to Daniel for the merge resolution.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-07 23:35:08 -04:00
Daniel Borkmann e94fa1d931 bpf, xskmap: fix crash in xsk_map_alloc error path handling
If bpf_map_precharge_memlock() did not fail, then we set err to zero.
However, any subsequent failure from either alloc_percpu() or the
bpf_map_area_alloc() will return ERR_PTR(0) which in find_and_alloc_map()
will cause NULL pointer deref.

In devmap we have the convention that we return -EINVAL on page count
overflow, so keep the same logic here and just set err to -ENOMEM
after successful bpf_map_precharge_memlock().

Fixes: fbfc504a24 ("bpf: introduce new bpf AF_XDP map type BPF_MAP_TYPE_XSKMAP")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Björn Töpel <bjorn.topel@intel.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-04 14:55:54 -07:00
Jakub Kicinski ab7f5bf092 bpf: fix references to free_bpf_prog_info() in comments
Comments in the verifier refer to free_bpf_prog_info() which
seems to have never existed in tree.  Replace it with
free_used_maps().

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-04 23:41:04 +02:00
Jakub Kicinski f4e3ec0d57 bpf: replace map pointer loads before calling into offloads
Offloads may find host map pointers more useful than map fds.
Map pointers can be used to identify the map, while fds are
only valid within the context of loading process.

Jump to skip_full_check on error in case verifier log overflow
has to be handled (replace_map_fd_with_map_ptr() prints to the
log, driver prep may do that too in the future).

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jiong Wang <jiong.wang@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-04 23:41:03 +02:00
Jakub Kicinski 6cb5fb3891 bpf: export bpf_event_output()
bpf_event_output() is useful for offloads to add events to BPF
event rings, export it.  Note that export is placed near the stub
since tracing is optional and kernel/bpf/core.c is always going
to be built.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jiong Wang <jiong.wang@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-04 23:41:03 +02:00
Jakub Kicinski 630a4d3874 nfp: bpf: record offload neutral maps in the driver
For asynchronous events originating from the device, like perf event
output, we need to be able to make sure that objects being referred
to by the FW message are valid on the host.  FW events can get queued
and reordered.  Even if we had a FW message "barrier" we should still
protect ourselves from bogus FW output.

Add a reverse-mapping hash table and record in it all raw map pointers
FW may refer to.  Only record neutral maps, i.e. perf event arrays.
These are currently the only objects FW can refer to.  Use RCU protection
on the read side, update side is under RTNL.

Since program vs map destruction order is slightly painful for offload
simply take an extra reference on all the recorded maps to make sure
they don't disappear.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-04 23:41:03 +02:00
Jakub Kicinski 0cd3cbed3c bpf: offload: allow offloaded programs to use perf event arrays
BPF_MAP_TYPE_PERF_EVENT_ARRAY is special as far as offload goes.
The map only holds glue to perf ring, not actual data.  Allow
non-offloaded perf event arrays to be used in offloaded programs.
Offload driver can extract the events from HW and put them in
the map for user space to retrieve.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jiong Wang <jiong.wang@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-04 23:41:03 +02:00
David S. Miller a7b15ab887 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Overlapping changes in selftests Makefile.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-04 09:58:56 -04:00
Jiong Wang 4cb3d99c84 bpf: add faked "ending" subprog
There are quite a few code snippet like the following in verifier:

       subprog_start = 0;
       if (env->subprog_cnt == cur_subprog + 1)
               subprog_end = insn_cnt;
       else
               subprog_end = env->subprog_info[cur_subprog + 1].start;

The reason is there is no marker in subprog_info array to tell the end of
it.

We could resolve this issue by introducing a faked "ending" subprog.
The special "ending" subprog is with "insn_cnt" as start offset, so it is
serving as the end mark whenever we iterate over all subprogs.

Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-04 11:58:36 +02:00
Jiong Wang 9c8105bd44 bpf: centre subprog information fields
It is better to centre all subprog information fields into one structure.
This structure could later serve as function node in call graph.

Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-04 11:58:36 +02:00
Jiong Wang f910cefa32 bpf: unify main prog and subprog
Currently, verifier treat main prog and subprog differently. All subprogs
detected are kept in env->subprog_starts while main prog is not kept there.
Instead, main prog is implicitly defined as the prog start at 0.

There is actually no difference between main prog and subprog, it is better
to unify them, and register all progs detected into env->subprog_starts.

This could also help simplifying some code logic.

Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-04 11:58:35 +02:00
Linus Torvalds e523a2562a Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Various sockmap fixes from John Fastabend (pinned map handling,
    blocking in recvmsg, double page put, error handling during redirect
    failures, etc.)

 2) Fix dead code handling in x86-64 JIT, from Gianluca Borello.

 3) Missing device put in RDS IB code, from Dag Moxnes.

 4) Don't process fast open during repair mode in TCP< from Yuchung
    Cheng.

 5) Move address/port comparison fixes in SCTP, from Xin Long.

 6) Handle add a bond slave's master into a bridge properly, from
    Hangbin Liu.

 7) IPv6 multipath code can operate on unitialized memory due to an
    assumption that the icmp header is in the linear SKB area. Fix from
    Eric Dumazet.

 8) Don't invoke do_tcp_sendpages() recursively via TLS, from Dave
    Watson.

9) Fix memory leaks in x86-64 JIT, from Daniel Borkmann.

10) RDS leaks kernel memory to userspace, from Eric Dumazet.

11) DCCP can invoke a tasklet on a freed socket, take a refcount. Also
    from Eric Dumazet.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (78 commits)
  dccp: fix tasklet usage
  smc: fix sendpage() call
  net/smc: handle unregistered buffers
  net/smc: call consolidation
  qed: fix spelling mistake: "offloded" -> "offloaded"
  net/mlx5e: fix spelling mistake: "loobpack" -> "loopback"
  tcp: restore autocorking
  rds: do not leak kernel memory to user land
  qmi_wwan: do not steal interfaces from class drivers
  ipv4: fix fnhe usage by non-cached routes
  bpf: sockmap, fix error handling in redirect failures
  bpf: sockmap, zero sg_size on error when buffer is released
  bpf: sockmap, fix scatterlist update on error path in send with apply
  net_sched: fq: take care of throttled flows before reuse
  ipv6: Revert "ipv6: Allow non-gateway ECMP for IPv6"
  bpf, x64: fix memleak when not converging on calls
  bpf, x64: fix memleak when not converging after image
  net/smc: restrict non-blocking connect finish
  8139too: Use disable_irq_nosync() in rtl8139_poll_controller()
  sctp: fix the issue that the cookie-ack with auth can't get processed
  ...
2018-05-03 18:57:03 -10:00
Daniel Borkmann e0cea7ce98 bpf: implement ld_abs/ld_ind in native bpf
The main part of this work is to finally allow removal of LD_ABS
and LD_IND from the BPF core by reimplementing them through native
eBPF instead. Both LD_ABS/LD_IND were carried over from cBPF and
keeping them around in native eBPF caused way more trouble than
actually worth it. To just list some of the security issues in
the past:

  * fdfaf64e75 ("x86: bpf_jit: support negative offsets")
  * 35607b02db ("sparc: bpf_jit: fix loads from negative offsets")
  * e0ee9c1215 ("x86: bpf_jit: fix two bugs in eBPF JIT compiler")
  * 07aee94394 ("bpf, sparc: fix usage of wrong reg for load_skb_regs after call")
  * 6d59b7dbf7 ("bpf, s390x: do not reload skb pointers in non-skb context")
  * 87338c8e2c ("bpf, ppc64: do not reload skb pointers in non-skb context")

For programs in native eBPF, LD_ABS/LD_IND are pretty much legacy
these days due to their limitations and more efficient/flexible
alternatives that have been developed over time such as direct
packet access. LD_ABS/LD_IND only cover 1/2/4 byte loads into a
register, the load happens in host endianness and its exception
handling can yield unexpected behavior. The latter is explained
in depth in f6b1b3bf0d ("bpf: fix subprog verifier bypass by
div/mod by 0 exception") with similar cases of exceptions we had.
In native eBPF more recent program types will disable LD_ABS/LD_IND
altogether through may_access_skb() in verifier, and given the
limitations in terms of exception handling, it's also disabled
in programs that use BPF to BPF calls.

In terms of cBPF, the LD_ABS/LD_IND is used in networking programs
to access packet data. It is not used in seccomp-BPF but programs
that use it for socket filtering or reuseport for demuxing with
cBPF. This is mostly relevant for applications that have not yet
migrated to native eBPF.

The main complexity and source of bugs in LD_ABS/LD_IND is coming
from their implementation in the various JITs. Most of them keep
the model around from cBPF times by implementing a fastpath written
in asm. They use typically two from the BPF program hidden CPU
registers for caching the skb's headlen (skb->len - skb->data_len)
and skb->data. Throughout the JIT phase this requires to keep track
whether LD_ABS/LD_IND are used and if so, the two registers need
to be recached each time a BPF helper would change the underlying
packet data in native eBPF case. At least in eBPF case, available
CPU registers are rare and the additional exit path out of the
asm written JIT helper makes it also inflexible since not all
parts of the JITer are in control from plain C. A LD_ABS/LD_IND
implementation in eBPF therefore allows to significantly reduce
the complexity in JITs with comparable performance results for
them, e.g.:

test_bpf             tcpdump port 22             tcpdump complex
x64      - before    15 21 10                    14 19  18
         - after      7 10 10                     7 10  15
arm64    - before    40 91 92                    40 91 151
         - after     51 64 73                    51 62 113

For cBPF we now track any usage of LD_ABS/LD_IND in bpf_convert_filter()
and cache the skb's headlen and data in the cBPF prologue. The
BPF_REG_TMP gets remapped from R8 to R2 since it's mainly just
used as a local temporary variable. This allows to shrink the
image on x86_64 also for seccomp programs slightly since mapping
to %rsi is not an ereg. In callee-saved R8 and R9 we now track
skb data and headlen, respectively. For normal prologue emission
in the JITs this does not add any extra instructions since R8, R9
are pushed to stack in any case from eBPF side. cBPF uses the
convert_bpf_ld_abs() emitter which probes the fast path inline
already and falls back to bpf_skb_load_helper_{8,16,32}() helper
relying on the cached skb data and headlen as well. R8 and R9
never need to be reloaded due to bpf_helper_changes_pkt_data()
since all skb access in cBPF is read-only. Then, for the case
of native eBPF, we use the bpf_gen_ld_abs() emitter, which calls
the bpf_skb_load_helper_{8,16,32}_no_cache() helper unconditionally,
does neither cache skb data and headlen nor has an inlined fast
path. The reason for the latter is that native eBPF does not have
any extra registers available anyway, but even if there were, it
avoids any reload of skb data and headlen in the first place.
Additionally, for the negative offsets, we provide an alternative
bpf_skb_load_bytes_relative() helper in eBPF which operates
similarly as bpf_skb_load_bytes() and allows for more flexibility.
Tested myself on x64, arm64, s390x, from Sandipan on ppc64.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-03 16:49:19 -07:00
Björn Töpel fbfc504a24 bpf: introduce new bpf AF_XDP map type BPF_MAP_TYPE_XSKMAP
The xskmap is yet another BPF map, very much inspired by
dev/cpu/sockmap, and is a holder of AF_XDP sockets. A user application
adds AF_XDP sockets into the map, and by using the bpf_redirect_map
helper, an XDP program can redirect XDP frames to an AF_XDP socket.

Note that a socket that is bound to certain ifindex/queue index will
*only* accept XDP frames from that netdev/queue index. If an XDP
program tries to redirect from a netdev/queue index other than what
the socket is bound to, the frame will not be received on the socket.

A socket can reside in multiple maps.

v3: Fixed race and simplified code.
v2: Removed one indirection in map lookup.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-03 15:55:24 -07:00
Linus Torvalds f4ef6a438c Various fixes in tracing:
- Tracepoints should not give warning on OOM failures
 
  - Use special field for function pointer in trace event
 
  - Fix igrab issues in uprobes
 
  - Fixes to the new histogram triggers
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCWuoYdBQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qtFnAP9X4+AVDQH0VfsMLSc9D+rK6WmcRIhv
 q8J2gNPv3anM+AD/SFXWGO4ihN+0KDw/TqmJxESNEybq47vTZ/s5lM6A4gQ=
 =fQbj
 -----END PGP SIGNATURE-----

Merge tag 'trace-v4.17-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fixes from Steven Rostedt:
 "Various fixes in tracing:

   - Tracepoints should not give warning on OOM failures

   - Use special field for function pointer in trace event

   - Fix igrab issues in uprobes

   - Fixes to the new histogram triggers"

* tag 'trace-v4.17-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracepoint: Do not warn on ENOMEM
  tracing: Add field modifier parsing hist error for hist triggers
  tracing: Add field parsing hist error for hist triggers
  tracing: Restore proper field flag printing when displaying triggers
  tracing: initcall: Ordered comparison of function pointers
  tracing: Remove igrab() iput() call from uprobes.c
  tracing: Fix bad use of igrab in trace_uprobe.c
2018-05-02 17:38:37 -10:00
John Fastabend abaeb096ca bpf: sockmap, fix error handling in redirect failures
When a redirect failure happens we release the buffers in-flight
without calling a sk_mem_uncharge(), the uncharge is called before
dropping the sock lock for the redirecte, however we missed updating
the ring start index. When no apply actions are in progress this
is OK because we uncharge the entire buffer before the redirect.
But, when we have apply logic running its possible that only a
portion of the buffer is being redirected. In this case we only
do memory accounting for the buffer slice being redirected and
expect to be able to loop over the BPF program again and/or if
a sock is closed uncharge the memory at sock destruct time.

With an invalid start index however the program logic looks at
the start pointer index, checks the length, and when seeing the
length is zero (from the initial release and failure to update
the pointer) aborts without uncharging/releasing the remaining
memory.

The fix for this is simply to update the start index. To avoid
fixing this error in two locations we do a small refactor and
remove one case where it is open-coded. Then fix it in the
single function.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-02 15:30:45 -07:00
John Fastabend fec51d40ea bpf: sockmap, zero sg_size on error when buffer is released
When an error occurs during a redirect we have two cases that need
to be handled (i) we have a cork'ed buffer (ii) we have a normal
sendmsg buffer.

In the cork'ed buffer case we don't currently support recovering from
errors in a redirect action. So the buffer is released and the error
should _not_ be pushed back to the caller of sendmsg/sendpage. The
rationale here is the user will get an error that relates to old
data that may have been sent by some arbitrary thread on that sock.
Instead we simple consume the data and tell the user that the data
has been consumed. We may add proper error recovery in the future.
However, this patch fixes a bug where the bytes outstanding counter
sg_size was not zeroed. This could result in a case where if the user
has both a cork'ed action and apply action in progress we may
incorrectly call into the BPF program when the user expected an
old verdict to be applied via the apply action. I don't have a use
case where using apply and cork at the same time is valid but we
never explicitly reject it because it should work fine. This patch
ensures the sg_size is zeroed so we don't have this case.

In the normal sendmsg buffer case (no cork data) we also do not
zero sg_size. Again this can confuse the apply logic when the logic
calls into the BPF program when the BPF programmer expected the old
verdict to remain. So ensure we set sg_size to zero here as well. And
additionally to keep the psock state in-sync with the sk_msg_buff
release all the memory as well. Previously we did this before
returning to the user but this left a gap where psock and sk_msg_buff
states were out of sync which seems fragile. No additional overhead
is taken here except for a call to check the length and realize its
already been freed. This is in the error path as well so in my
opinion lets have robust code over optimized error paths.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-02 15:30:45 -07:00
John Fastabend 3cc9a472d6 bpf: sockmap, fix scatterlist update on error path in send with apply
When the call to do_tcp_sendpage() fails to send the complete block
requested we either retry if only a partial send was completed or
abort if we receive a error less than or equal to zero. Before
returning though we must update the scatterlist length/offset to
account for any partial send completed.

Before this patch we did this at the end of the retry loop, but
this was buggy when used while applying a verdict to fewer bytes
than in the scatterlist. When the scatterlist length was being set
we forgot to account for the apply logic reducing the size variable.
So the result was we chopped off some bytes in the scatterlist without
doing proper cleanup on them. This results in a WARNING when the
sock is tore down because the bytes have previously been charged to
the socket but are never uncharged.

The simple fix is to simply do the accounting inside the retry loop
subtracting from the absolute scatterlist values rather than trying
to accumulate the totals and subtract at the end.

Reported-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-02 15:30:44 -07:00
Mathieu Desnoyers d66a270be3 tracepoint: Do not warn on ENOMEM
Tracepoint should only warn when a kernel API user does not respect the
required preconditions (e.g. same tracepoint enabled twice, or called
to remove a tracepoint that does not exist).

Silence warning in out-of-memory conditions, given that the error is
returned to the caller.

This ensures that out-of-memory error-injection testing does not trigger
warnings in tracepoint.c, which were seen by syzbot.

Link: https://lkml.kernel.org/r/001a114465e241a8720567419a72@google.com
Link: https://lkml.kernel.org/r/001a1140e0de15fc910567464190@google.com
Link: http://lkml.kernel.org/r/20180315124424.32319-1-mathieu.desnoyers@efficios.com

CC: Peter Zijlstra <peterz@infradead.org>
CC: Jiri Olsa <jolsa@redhat.com>
CC: Arnaldo Carvalho de Melo <acme@kernel.org>
CC: Alexander Shishkin <alexander.shishkin@linux.intel.com>
CC: Namhyung Kim <namhyung@kernel.org>
CC: stable@vger.kernel.org
Fixes: de7b297390 ("tracepoint: Use struct pointer instead of name hash for reg/unreg tracepoints")
Reported-by: syzbot+9c0d616860575a73166a@syzkaller.appspotmail.com
Reported-by: syzbot+4e9ae7fa46233396f64d@syzkaller.appspotmail.com
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-04-30 12:09:56 -04:00
Alexei Starovoitov 4d220ed0f8 bpf: remove tracepoints from bpf core
tracepoints to bpf core were added as a way to provide introspection
to bpf programs and maps, but after some time it became clear that
this approach is inadequate, so prog_id, map_id and corresponding
get_next_id, get_fd_by_id, get_info_by_fd, prog_query APIs were
introduced and fully adopted by bpftool and other applications.
The tracepoints in bpf core started to rot and causing syzbot warnings:
WARNING: CPU: 0 PID: 3008 at kernel/trace/trace_event_perf.c:274
Kernel panic - not syncing: panic_on_warn set ...
perf_trace_bpf_map_keyval+0x260/0xbd0 include/trace/events/bpf.h:228
trace_bpf_map_update_elem include/trace/events/bpf.h:274 [inline]
map_update_elem kernel/bpf/syscall.c:597 [inline]
SYSC_bpf kernel/bpf/syscall.c:1478 [inline]
Hence this patch deletes tracepoints in bpf core.

Reported-by: Eric Biggers <ebiggers3@gmail.com>
Reported-by: syzbot <bot+a9dbb3c3e64b62536a4bc5ee7bbd4ca627566188@syzkaller.appspotmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-04-30 10:55:56 +02:00
Teng Qin 7ef3771205 bpf: Allow bpf_current_task_under_cgroup in interrupt
Currently, the bpf_current_task_under_cgroup helper has a check where if
the BPF program is running in_interrupt(), it will return -EINVAL. This
prevents the helper to be used in many useful scenarios, particularly
BPF programs attached to Perf Events.

This commit removes the check. Tested a few NMI (Perf Event) and some
softirq context, the helper returns the correct result.

Signed-off-by: Teng Qin <qinteng@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-04-29 09:18:04 -07:00
Linus Torvalds 810fb07a9b Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fixes from Thomas Gleixner:
 "Two fixes from the timer departement:

   - Fix a long standing issue in the NOHZ tick code which causes RB
     tree corruption, delayed timers and other malfunctions. The cause
     for this is code which modifies the expiry time of an enqueued
     hrtimer.

   - Revert the CLOCK_MONOTONIC/CLOCK_BOOTTIME unification due to
     regression reports. Seems userspace _is_ relying on the documented
     behaviour despite our hope that it wont"

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  Revert: Unify CLOCK_MONOTONIC and CLOCK_BOOTTIME
  tick/sched: Do not mess with an enqueued hrtimer
2018-04-29 09:03:25 -07:00
Yonghong Song 9cbe1f5a32 bpf/verifier: improve register value range tracking with ARSH
When helpers like bpf_get_stack returns an int value
and later on used for arithmetic computation, the LSH and ARSH
operations are often required to get proper sign extension into
64-bit. For example, without this patch:
    54: R0=inv(id=0,umax_value=800)
    54: (bf) r8 = r0
    55: R0=inv(id=0,umax_value=800) R8_w=inv(id=0,umax_value=800)
    55: (67) r8 <<= 32
    56: R8_w=inv(id=0,umax_value=3435973836800,var_off=(0x0; 0x3ff00000000))
    56: (c7) r8 s>>= 32
    57: R8=inv(id=0)
With this patch:
    54: R0=inv(id=0,umax_value=800)
    54: (bf) r8 = r0
    55: R0=inv(id=0,umax_value=800) R8_w=inv(id=0,umax_value=800)
    55: (67) r8 <<= 32
    56: R8_w=inv(id=0,umax_value=3435973836800,var_off=(0x0; 0x3ff00000000))
    56: (c7) r8 s>>= 32
    57: R8=inv(id=0, umax_value=800,var_off=(0x0; 0x3ff))
With better range of "R8", later on when "R8" is added to other register,
e.g., a map pointer or scalar-value register, the better register
range can be derived and verifier failure may be avoided.

In our later example,
    ......
    usize = bpf_get_stack(ctx, raw_data, max_len, BPF_F_USER_STACK);
    if (usize < 0)
        return 0;
    ksize = bpf_get_stack(ctx, raw_data + usize, max_len - usize, 0);
    ......
Without improving ARSH value range tracking, the register representing
"max_len - usize" will have smin_value equal to S64_MIN and will be
rejected by verifier.

Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-04-29 08:45:53 -07:00
Yonghong Song afbe1a5b79 bpf: remove never-hit branches in verifier adjust_scalar_min_max_vals
In verifier function adjust_scalar_min_max_vals,
when src_known is false and the opcode is BPF_LSH/BPF_RSH,
early return will happen in the function. So remove
the branch in handling BPF_LSH/BPF_RSH when src_known is false.

Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-04-29 08:45:53 -07:00
Yonghong Song 849fa50662 bpf/verifier: refine retval R0 state for bpf_get_stack helper
The special property of return values for helpers bpf_get_stack
and bpf_probe_read_str are captured in verifier.
Both helpers return a negative error code or
a length, which is equal to or smaller than the buffer
size argument. This additional information in the
verifier can avoid the condition such as "retval > bufsize"
in the bpf program. For example, for the code blow,
    usize = bpf_get_stack(ctx, raw_data, max_len, BPF_F_USER_STACK);
    if (usize < 0 || usize > max_len)
        return 0;
The verifier may have the following errors:
    52: (85) call bpf_get_stack#65
     R0=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R1_w=ctx(id=0,off=0,imm=0)
     R2_w=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R3_w=inv800 R4_w=inv256
     R6=ctx(id=0,off=0,imm=0) R7=map_value(id=0,off=0,ks=4,vs=1600,imm=0)
     R9_w=inv800 R10=fp0,call_-1
    53: (bf) r8 = r0
    54: (bf) r1 = r8
    55: (67) r1 <<= 32
    56: (bf) r2 = r1
    57: (77) r2 >>= 32
    58: (25) if r2 > 0x31f goto pc+33
     R0=inv(id=0) R1=inv(id=0,smax_value=9223372032559808512,
                         umax_value=18446744069414584320,
                         var_off=(0x0; 0xffffffff00000000))
     R2=inv(id=0,umax_value=799,var_off=(0x0; 0x3ff))
     R6=ctx(id=0,off=0,imm=0) R7=map_value(id=0,off=0,ks=4,vs=1600,imm=0)
     R8=inv(id=0) R9=inv800 R10=fp0,call_-1
    59: (1f) r9 -= r8
    60: (c7) r1 s>>= 32
    61: (bf) r2 = r7
    62: (0f) r2 += r1
    math between map_value pointer and register with unbounded
    min value is not allowed
The failure is due to llvm compiler optimization where register "r2",
which is a copy of "r1", is tested for condition while later on "r1"
is used for map_ptr operation. The verifier is not able to track such
inst sequence effectively.

Without the "usize > max_len" condition, there is no llvm optimization
and the below generated code passed verifier:
    52: (85) call bpf_get_stack#65
     R0=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R1_w=ctx(id=0,off=0,imm=0)
     R2_w=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R3_w=inv800 R4_w=inv256
     R6=ctx(id=0,off=0,imm=0) R7=map_value(id=0,off=0,ks=4,vs=1600,imm=0)
     R9_w=inv800 R10=fp0,call_-1
    53: (b7) r1 = 0
    54: (bf) r8 = r0
    55: (67) r8 <<= 32
    56: (c7) r8 s>>= 32
    57: (6d) if r1 s> r8 goto pc+24
     R0=inv(id=0,umax_value=800,var_off=(0x0; 0x3ff))
     R1=inv0 R6=ctx(id=0,off=0,imm=0)
     R7=map_value(id=0,off=0,ks=4,vs=1600,imm=0)
     R8=inv(id=0,umax_value=800,var_off=(0x0; 0x3ff)) R9=inv800
     R10=fp0,call_-1
    58: (bf) r2 = r7
    59: (0f) r2 += r8
    60: (1f) r9 -= r8
    61: (bf) r1 = r6

Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-04-29 08:45:53 -07:00
Yonghong Song c195651e56 bpf: add bpf_get_stack helper
Currently, stackmap and bpf_get_stackid helper are provided
for bpf program to get the stack trace. This approach has
a limitation though. If two stack traces have the same hash,
only one will get stored in the stackmap table,
so some stack traces are missing from user perspective.

This patch implements a new helper, bpf_get_stack, will
send stack traces directly to bpf program. The bpf program
is able to see all stack traces, and then can do in-kernel
processing or send stack traces to user space through
shared map or bpf_perf_event_output.

Acked-by: Alexei Starovoitov <ast@fb.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-04-29 08:45:53 -07:00
Yonghong Song 5f41263274 bpf: change prototype for stack_map_get_build_id_offset
This patch didn't incur functionality change. The function prototype
got changed so that the same function can be reused later.

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-04-29 08:45:53 -07:00
Linus Torvalds 7b87308e71 Modules fix for v4.17-rc3
- Fix display of module section addresses in sysfs, which were getting
 hashed with %pK and breaking tools like perf.
 
 Signed-off-by: Jessica Yu <jeyu@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABCgAGBQJa4wQwAAoJEMBFfjjOO8Fy5IUQAJYKkClqo0BuQocleR9aPJSg
 dIzeSHeUThT66KSBrmi74Q4t2UoVg4M4V/ktAIECqW9oNn2eWvVd5tovgEHntqYL
 GevuQK207VOJSNS+ohE0N0hPACd2hjCu58EnMUUheDvRdFHpLwTBqnejN6EvIq/o
 OoEin6Iq/NKdYCY2yQt5iRROmph61rpIyM4/js4BRz4flLE/MZemHRekNMhmMSqr
 IjUv83ez50PaWJAmk0fjNqAw9j2EmSl5B77wGrM+POifvcvBdxzBZpbeZHgdAESX
 3QgUihDRkpJ/bhf+HvmVxNe2WRV/7WD8d+3e/drkg2++CeP/Pw+bWCpcMflMZOOg
 MIroCd4H3jOSK2aunal1WftGca0awj4XdHdl01m3OgwAGUc6gCxwuPQ6/UaYUhkf
 jV4BV0XROvR49Mgs9V8/aZpomfF7u2vLZPPiR/2yvylcRfh6Fh7iUJU/N+LGFjdU
 KQCmt7ZWgGFYaf392bexVdQzMA+R1h0IWn6mKm6krdQ6x3XnQ/f0wwtWc0G6Vb1B
 ojF73rWCUqe6W/UhCk1ja3Bz6kOuECeKZr2YUTPiOJhNsLl3kDUhFhdH0ObX0D4x
 cf+VZep6hQoagc2x3ZcWe5AiBeChwQ0xypV19AVvGcgfGfoX6EQ61ORcqDVdcgO4
 fr39iXQSvau7jFP7EyTg
 =ZGdS
 -----END PGP SIGNATURE-----

Merge tag 'modules-for-v4.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux

Pull modules fix from Jessica Yu:
 "Fix display of module section addresses in sysfs, which were getting
  hashed with %pK and breaking tools like perf"

* tag 'modules-for-v4.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux:
  module: Fix display of wrong module .text address
2018-04-27 11:01:21 -07:00
Linus Torvalds 79a17dd9d2 Staging fixes for 4.17-rc3
Here are 2 staging driver fixups for 4.17-rc3.
 
 The first is the remaining stragglers of the irda code removal that you
 pointed out during the merge window.  The second is a fix for the
 wilc1000 driver due to a patch that got merged in 4.17-rc1.
 
 Both of these have been in linux-next for a while with no reported
 issues.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCWuMyew8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ymXxACffYtMbj0Vg5pD0yAPqRzJ2iVMVE0AnRkp4BYQ
 kXgAjDeSyrdKPUwQ7Hl2
 =UNuF
 -----END PGP SIGNATURE-----

Merge tag 'staging-4.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging

Pull staging fixes from Greg KH:
 "Here are two staging driver fixups for 4.17-rc3.

  The first is the remaining stragglers of the irda code removal that
  you pointed out during the merge window. The second is a fix for the
  wilc1000 driver due to a patch that got merged in 4.17-rc1.

  Both of these have been in linux-next for a while with no reported
  issues"

* tag 'staging-4.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
  staging: wilc1000: fix NULL pointer exception in host_int_parse_assoc_resp_info()
  staging: irda: remove remaining remants of irda code removal
2018-04-27 09:37:12 -07:00
Tom Zanussi dcf234577c tracing: Add field modifier parsing hist error for hist triggers
If the user specifies an invalid field modifier for a hist trigger,
the current code correctly flags that as an error, but doesn't tell
the user what happened.

Fix this by invoking hist_err() with an appropriate message when
invalid modifiers are specified.

Before:

  # echo 'hist:keys=pid:ts0=common_timestamp.junkusecs' >> /sys/kernel/debug/tracing/events/sched/sched_wakeup/trigger
  -su: echo: write error: Invalid argument
  # cat /sys/kernel/debug/tracing/events/sched/sched_wakeup/hist

After:

  # echo 'hist:keys=pid:ts0=common_timestamp.junkusecs' >> /sys/kernel/debug/tracing/events/sched/sched_wakeup/trigger
  -su: echo: write error: Invalid argument
  # cat /sys/kernel/debug/tracing/events/sched/sched_wakeup/hist
  ERROR: Invalid field modifier: junkusecs
    Last command: keys=pid:ts0=common_timestamp.junkusecs

Link: http://lkml.kernel.org/r/b043c59fa79acd06a5f14a1d44dee9e5a3cd1248.1524790601.git.tom.zanussi@linux.intel.com

Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-04-26 21:39:58 -04:00
Tom Zanussi 5ec432d7bf tracing: Add field parsing hist error for hist triggers
If the user specifies a nonexistent field for a hist trigger, the
current code correctly flags that as an error, but doesn't tell the
user what happened.

Fix this by invoking hist_err() with an appropriate message when
nonexistent fields are specified.

Before:

  # echo 'hist:keys=pid:ts0=common_timestamp.usecs' >> /sys/kernel/debug/tracing/events/sched/sched_switch/trigger
  -su: echo: write error: Invalid argument
  # cat /sys/kernel/debug/tracing/events/sched/sched_switch/hist

After:

  # echo 'hist:keys=pid:ts0=common_timestamp.usecs' >> /sys/kernel/debug/tracing/events/sched/sched_switch/trigger
  -su: echo: write error: Invalid argument
  # cat /sys/kernel/debug/tracing/events/sched/sched_switch/hist
  ERROR: Couldn't find field: pid
    Last command: keys=pid:ts0=common_timestamp.usecs

Link: http://lkml.kernel.org/r/fdc8746969d16906120f162b99dd71c741e0b62c.1524790601.git.tom.zanussi@linux.intel.com

Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Reported-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-04-26 21:39:32 -04:00
Tom Zanussi 608940dabe tracing: Restore proper field flag printing when displaying triggers
The flag-printing code used when displaying hist triggers somehow got
dropped during refactoring of the inter-event patchset.  This restores
it.

Below are a couple examples - in the first case, .usecs wasn't being
displayed properly for common_timestamps and the second illustrates
the same for other flags such as .execname.

Before:

  # echo 'hist:key=common_pid.execname:val=count:sort=count' > /sys/kernel/debug/tracing/events/syscalls/sys_enter_read/trigger
  # cat /sys/kernel/debug/tracing/events/syscalls/sys_enter_read/trigger
  hist:keys=common_pid:vals=hitcount,count:sort=count:size=2048 [active]

  # echo 'hist:keys=pid:ts0=common_timestamp.usecs if comm=="cyclictest"' >> /sys/kernel/debug/tracing/events/sched/sched_wakeup/trigger
  # cat /sys/kernel/debug/tracing/events/sched/sched_wakeup/trigger
  hist:keys=pid:vals=hitcount:ts0=common_timestamp:sort=hitcount:size=2048:clock=global if comm=="cyclictest" [active]

After:

  # echo 'hist:key=common_pid.execname:val=count:sort=count' > /sys/kernel/debug/tracing/events/syscalls/sys_enter_read/trigger
  # cat /sys/kernel/debug/tracing/events/syscalls/sys_enter_read/trigger
  hist:keys=common_pid.execname:vals=hitcount,count:sort=count:size=2048 [active]

  # echo 'hist:keys=pid:ts0=common_timestamp.usecs if comm=="cyclictest"' >> /sys/kernel/debug/tracing/events/sched/sched_wakeup/trigger
  # cat /sys/kernel/debug/tracing/events/sched/sched_wakeup/trigger
  hist:keys=pid:vals=hitcount:ts0=common_timestamp.usecs:sort=hitcount:size=2048:clock=global if comm=="cyclictest" [active]

Link: http://lkml.kernel.org/r/492bab42ff21806600af98a8ea901af10efbee0c.1524790601.git.tom.zanussi@linux.intel.com

Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-04-26 21:38:43 -04:00
David S. Miller 79741a38b4 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2018-04-27

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) Add extensive BPF helper description into include/uapi/linux/bpf.h
   and a new script bpf_helpers_doc.py which allows for generating a
   man page out of it. Thus, every helper in BPF now comes with proper
   function signature, detailed description and return code explanation,
   from Quentin.

2) Migrate the BPF collect metadata tunnel tests from BPF samples over
   to the BPF selftests and further extend them with v6 vxlan, geneve
   and ipip tests, simplify the ipip tests, improve documentation and
   convert to bpf_ntoh*() / bpf_hton*() api, from William.

3) Currently, helpers that expect ARG_PTR_TO_MAP_{KEY,VALUE} can only
   access stack and packet memory. Extend this to allow such helpers
   to also use map values, which enabled use cases where value from
   a first lookup can be directly used as a key for a second lookup,
   from Paul.

4) Add a new helper bpf_skb_get_xfrm_state() for tc BPF programs in
   order to retrieve XFRM state information containing SPI, peer
   address and reqid values, from Eyal.

5) Various optimizations in nfp driver's BPF JIT in order to turn ADD
   and SUB instructions with negative immediate into the opposite
   operation with a positive immediate such that nfp can better fit
   small immediates into instructions. Savings in instruction count
   up to 4% have been observed, from Jakub.

6) Add the BPF prog's gpl_compatible flag to struct bpf_prog_info
   and add support for dumping this through bpftool, from Jiri.

7) Move the BPF sockmap samples over into BPF selftests instead since
   sockmap was rather a series of tests than sample anyway and this way
   this can be run from automated bots, from John.

8) Follow-up fix for bpf_adjust_tail() helper in order to make it work
   with generic XDP, from Nikita.

9) Some follow-up cleanups to BTF, namely, removing unused defines from
   BTF uapi header and renaming 'name' struct btf_* members into name_off
   to make it more clear they are offsets into string section, from Martin.

10) Remove test_sock_addr from TEST_GEN_PROGS in BPF selftests since
    not run directly but invoked from test_sock_addr.sh, from Yonghong.

11) Remove redundant ret assignment in sample BPF loader, from Wang.

12) Add couple of missing files to BPF selftest's gitignore, from Anders.

There are two trivial merge conflicts while pulling:

  1) Remove samples/sockmap/Makefile since all sockmap tests have been
     moved to selftests.
  2) Add both hunks from tools/testing/selftests/bpf/.gitignore to the
     file since git should ignore all of them.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-26 21:19:50 -04:00
Linus Torvalds 47b5ece937 Following tracing fixes:
- Add workqueue forward declaration (for new work, but a nice clean up)
 
  - seftest fixes for the new histogram code
 
  - Print output fix for hwlat tracer
 
  - Fix missing system call events - due to change in x86 syscall naming
 
  - Fix kprobe address being used by perf being hashed
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCWuIMShQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qkrdAQDRrgIGcm4pRGrvPiGhp4FeQKUx3woM
 LY10qMYo3St7zwEAn5oor/e/7KQaQSdKQ7QkL690QU2bTO6FXz4VwE1OcgM=
 =OHJk
 -----END PGP SIGNATURE-----

Merge tag 'trace-v4.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fixes from Steven Rostedt:

 - Add workqueue forward declaration (for new work, but a nice clean up)

 - seftest fixes for the new histogram code

 - Print output fix for hwlat tracer

 - Fix missing system call events - due to change in x86 syscall naming

 - Fix kprobe address being used by perf being hashed

* tag 'trace-v4.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing: Fix missing tab for hwlat_detector print format
  selftests: ftrace: Add a testcase for multiple actions on trigger
  selftests: ftrace: Fix trigger extended error testcase
  kprobes: Fix random address output of blacklist file
  tracing: Fix kernel crash while using empty filter with perf
  tracing/x86: Update syscall trace events to handle new prefixed syscall func names
  tracing: Add missing forward declaration
2018-04-26 16:22:47 -07:00
Jiri Olsa b85fab0e67 bpf: Add gpl_compatible flag to struct bpf_prog_info
Adding gpl_compatible flag to struct bpf_prog_info
so it can be dumped via bpf_prog_get_info_by_fd and
displayed via bpftool progs dump.

Alexei noticed 4-byte hole in struct bpf_prog_info,
so we put the u32 flags field in there, and we can
keep adding bit fields in there without breaking
user space.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-04-26 22:36:11 +02:00
Song Liu 61f94203c9 tracing: Remove igrab() iput() call from uprobes.c
Caller of uprobe_register is required to keep the inode and containing
mount point referenced.

There was misuse of igrab() in uprobes.c and trace_uprobe.c. This is
because igrab() will not prevent umount of the containing mount point.
To fix this, we added path to struct trace_uprobe, which keeps the inode
and containing mount reference.

For uprobes.c, it is not necessary to call igrab() in uprobe_register(),
as the caller is required to keep the inode reference. The igrab() is
removed and comments on this requirement is added to uprobe_register().

Link: http://lkml.kernel.org/r/CAELBmZB2XX=qEOLAdvGG4cPx4GEntcSnWQquJLUK1ongRj35cA@mail.gmail.com
Link: http://lkml.kernel.org/r/20180423172135.4050588-2-songliubraving@fb.com

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Howard McLauchlan <hmclauchlan@fb.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-04-26 14:50:56 -04:00
Song Liu 0c92c7a3c5 tracing: Fix bad use of igrab in trace_uprobe.c
As Miklos reported and suggested:

  This pattern repeats two times in trace_uprobe.c and in
  kernel/events/core.c as well:

      ret = kern_path(filename, LOOKUP_FOLLOW, &path);
      if (ret)
          goto fail_address_parse;

      inode = igrab(d_inode(path.dentry));
      path_put(&path);

  And it's wrong.  You can only hold a reference to the inode if you
  have an active ref to the superblock as well (which is normally
  through path.mnt) or holding s_umount.

  This way unmounting the containing filesystem while the tracepoint is
  active will give you the "VFS: Busy inodes after unmount..." message
  and a crash when the inode is finally put.

  Solution: store path instead of inode.

This patch fixes two instances in trace_uprobe.c. struct path is added to
struct trace_uprobe to keep the inode and containing mount point
referenced.

Link: http://lkml.kernel.org/r/20180423172135.4050588-1-songliubraving@fb.com

Fixes: f3f096cfed ("tracing: Provide trace events interface for uprobes")
Fixes: 33ea4b2427 ("perf/core: Implement the 'perf_uprobe' PMU")
Cc: stable@vger.kernel.org
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Howard McLauchlan <hmclauchlan@fb.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Miklos Szeredi <mszeredi@redhat.com>
Reported-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-04-26 14:49:55 -04:00
Thomas Gleixner a3ed0e4393 Revert: Unify CLOCK_MONOTONIC and CLOCK_BOOTTIME
Revert commits

92af4dcb4e ("tracing: Unify the "boot" and "mono" tracing clocks")
127bfa5f43 ("hrtimer: Unify MONOTONIC and BOOTTIME clock behavior")
7250a4047a ("posix-timers: Unify MONOTONIC and BOOTTIME clock behavior")
d6c7270e91 ("timekeeping: Remove boot time specific code")
f2d6fdbfd2 ("Input: Evdev - unify MONOTONIC and BOOTTIME clock behavior")
d6ed449afd ("timekeeping: Make the MONOTONIC clock behave like the BOOTTIME clock")
72199320d4 ("timekeeping: Add the new CLOCK_MONOTONIC_ACTIVE clock")

As stated in the pull request for the unification of CLOCK_MONOTONIC and
CLOCK_BOOTTIME, it was clear that we might have to revert the change.

As reported by several folks systemd and other applications rely on the
documented behaviour of CLOCK_MONOTONIC on Linux and break with the above
changes. After resume daemons time out and other timeout related issues are
observed. Rafael compiled this list:

* systemd kills daemons on resume, after >WatchdogSec seconds
  of suspending (Genki Sky).  [Verified that that's because systemd uses
  CLOCK_MONOTONIC and expects it to not include the suspend time.]

* systemd-journald misbehaves after resume:
  systemd-journald[7266]: File /var/log/journal/016627c3c4784cd4812d4b7e96a34226/system.journal
corrupted or uncleanly shut down, renaming and replacing.
  (Mike Galbraith).

* NetworkManager reports "networking disabled" and networking is broken
  after resume 50% of the time (Pavel).  [May be because of systemd.]

* MATE desktop dims the display and starts the screensaver right after
  system resume (Pavel).

* Full system hang during resume (me).  [May be due to systemd or NM or both.]

That happens on debian and open suse systems.

It's sad, that these problems were neither catched in -next nor by those
folks who expressed interest in this change.

Reported-by: Rafael J. Wysocki <rjw@rjwysocki.net>
Reported-by: Genki Sky <sky@genki.is>,
Reported-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kevin Easton <kevin@guarana.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Salyzyn <salyzyn@android.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
2018-04-26 14:53:32 +02:00
Thomas Gleixner 1f71addd34 tick/sched: Do not mess with an enqueued hrtimer
Kaike reported that in tests rdma hrtimers occasionaly stopped working. He
did great debugging, which provided enough context to decode the problem.

CPU 3			     	      	     CPU 2

idle
start sched_timer expires = 712171000000
 queue->next = sched_timer
					    start rdmavt timer. expires = 712172915662
					    lock(baseof(CPU3))
tick_nohz_stop_tick()
tick = 716767000000			    timerqueue_add(tmr)

hrtimer_set_expires(sched_timer, tick);
  sched_timer->expires = 716767000000  <---- FAIL
					     if (tmr->expires < queue->next->expires)
hrtimer_start(sched_timer)		          queue->next = tmr;
lock(baseof(CPU3))
					     unlock(baseof(CPU3))
timerqueue_remove()
timerqueue_add()

ts->sched_timer is queued and queue->next is pointing to it, but then
ts->sched_timer.expires is modified.

This not only corrupts the ordering of the timerqueue RB tree, it also
makes CPU2 see the new expiry time of timerqueue->next->expires when
checking whether timerqueue->next needs to be updated. So CPU2 sees that
the rdma timer is earlier than timerqueue->next and sets the rdma timer as
new next.

Depending on whether it had also seen the new time at RB tree enqueue, it
might have queued the rdma timer at the wrong place and then after removing
the sched_timer the RB tree is completely hosed.

The problem was introduced with a commit which tried to solve inconsistency
between the hrtimer in the tick_sched data and the underlying hardware
clockevent. It split out hrtimer_set_expires() to store the new tick time
in both the NOHZ and the NOHZ + HIGHRES case, but missed the fact that in
the NOHZ + HIGHRES case the hrtimer might still be queued.

Use hrtimer_start(timer, tick...) for the NOHZ + HIGHRES case which sets
timer->expires after canceling the timer and move the hrtimer_set_expires()
invocation into the NOHZ only code path which is not affected as it merily
uses the hrtimer as next event storage so code pathes can be shared with
the NOHZ + HIGHRES case.

Fixes: d4af6d933c ("nohz: Fix spurious warning when hrtimer and clockevent get out of sync")
Reported-by: "Wan Kaike" <kaike.wan@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Cc: "Marciniszyn Mike" <mike.marciniszyn@intel.com>
Cc: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: linux-rdma@vger.kernel.org
Cc: "Dalessandro Dennis" <dennis.dalessandro@intel.com>
Cc: "Fleck John" <john.fleck@intel.com>
Cc: stable@vger.kernel.org
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: "Weiny Ira" <ira.weiny@intel.com>
Cc: "linux-rdma@vger.kernel.org"
Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1804241637390.1679@nanos.tec.linutronix.de
Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1804242119210.1597@nanos.tec.linutronix.de
2018-04-26 14:53:32 +02:00
David S. Miller a9537c937c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Merging net into net-next to help the bpf folks avoid
some really ugly merge conflicts.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-25 23:04:22 -04:00
David S. Miller 25eb0ea717 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2018-04-25

The following pull-request contains BPF updates for your *net* tree.

The main changes are:

1) Fix to clear the percpu metadata_dst that could otherwise carry
   stale ip_tunnel_info, from William.

2) Fix that reduces the number of passes in x64 JIT with regards to
   dead code sanitation to avoid risk of prog rejection, from Gianluca.

3) Several fixes of sockmap programs, besides others, fixing a double
   page_put() in error path, missing refcount hold for pinned sockmap,
   adding required -target bpf for clang in sample Makefile, from John.

4) Fix to disable preemption in __BPF_PROG_RUN_ARRAY() paths, from Roman.

5) Fix tools/bpf/ Makefile with regards to a lex/yacc build error
   seen on older gcc-5, from John.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-25 22:55:33 -04:00
Peter Xu 9a0fd67530 tracing: Fix missing tab for hwlat_detector print format
It's been missing for a while but no one is touching that up.  Fix it.

Link: http://lkml.kernel.org/r/20180315060639.9578-1-peterx@redhat.com

CC: Ingo Molnar <mingo@kernel.org>
Cc:stable@vger.kernel.org
Fixes: 7b2c862501 ("tracing: Add NMI tracing in hwlat detector")
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-04-25 10:28:46 -04:00
Thomas Richter bcbd385b61 kprobes: Fix random address output of blacklist file
File /sys/kernel/debug/kprobes/blacklist displays random addresses:

[root@s8360046 linux]# cat /sys/kernel/debug/kprobes/blacklist
0x0000000047149a90-0x00000000bfcb099a	print_type_x8
....

This breaks 'perf probe' which uses the blacklist file to prohibit
probes on certain functions by checking the address range.

Fix this by printing the correct (unhashed) address.

The file mode is read all but this is not an issue as the file
hierarchy points out:
 # ls -ld /sys/ /sys/kernel/ /sys/kernel/debug/ /sys/kernel/debug/kprobes/
	/sys/kernel/debug/kprobes/blacklist
dr-xr-xr-x 12 root root 0 Apr 19 07:56 /sys/
drwxr-xr-x  8 root root 0 Apr 19 07:56 /sys/kernel/
drwx------ 16 root root 0 Apr 19 06:56 /sys/kernel/debug/
drwxr-xr-x  2 root root 0 Apr 19 06:56 /sys/kernel/debug/kprobes/
-r--r--r--  1 root root 0 Apr 19 06:56 /sys/kernel/debug/kprobes/blacklist

Everything in and below /sys/kernel/debug is rwx to root only,
no group or others have access.

Background:
Directory /sys/kernel/debug/kprobes is created by debugfs_create_dir()
which sets the mode bits to rwxr-xr-x. Maybe change that to use the
parent's directory mode bits instead?

Link: http://lkml.kernel.org/r/20180419105556.86664-1-tmricht@linux.ibm.com

Fixes: ad67b74d24 ("printk: hash addresses printed with %p")
Cc: stable@vger.kernel.org
Cc: <stable@vger.kernel.org> # v4.15+
Cc: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: David S Miller <davem@davemloft.net>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: acme@kernel.org

Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-04-25 10:27:56 -04:00
Ravi Bangoria ba16293dad tracing: Fix kernel crash while using empty filter with perf
Kernel is crashing when user tries to record 'ftrace:function' event
with empty filter:

  # perf record -e ftrace:function --filter="" ls

  # dmesg
  BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
  Oops: 0000 [#1] SMP PTI
  ...
  RIP: 0010:ftrace_profile_set_filter+0x14b/0x2d0
  RSP: 0018:ffffa4a7c0da7d20 EFLAGS: 00010246
  RAX: ffffa4a7c0da7d64 RBX: 0000000000000000 RCX: 0000000000000006
  RDX: 0000000000000000 RSI: 0000000000000092 RDI: ffff8c48ffc968f0
  ...
  Call Trace:
   _perf_ioctl+0x54a/0x6b0
   ? rcu_all_qs+0x5/0x30
  ...

After patch:
  # perf record -e ftrace:function --filter="" ls
  failed to set filter "" on event ftrace:function with 22 (Invalid argument)

Also, if user tries to echo "" > filter, it used to throw an error.
This behavior got changed by commit 80765597bc ("tracing: Rewrite
filter logic to be simpler and faster"). This patch restores the
behavior as a side effect:

Before patch:
  # echo "" > filter
  #

After patch:
  # echo "" > filter
  bash: echo: write error: Invalid argument
  #

Link: http://lkml.kernel.org/r/20180420150758.19787-1-ravi.bangoria@linux.ibm.com

Fixes: 80765597bc ("tracing: Rewrite filter logic to be simpler and faster")
Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-04-25 10:27:55 -04:00
David S. Miller c749fa181b Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-04-24 23:59:11 -04:00
Linus Torvalds 24cac7009c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Fix rtnl deadlock in ipvs, from Julian Anastasov.

 2) s390 qeth fixes from Julian Wiedmann (control IO completion stalls,
    bad MAC address update sequence, request side races on command IO
    timeouts).

 3) Handle seq_file overflow properly in l2tp, from Guillaume Nault.

 4) Fix VLAN priority mappings in cpsw driver, from Ivan Khoronzhuk.

 5) Packet scheduler ife action fixes (malformed TLV lengths, etc.) from
    Alexander Aring.

 6) Fix out of bounds access in tcp md5 option parser, from Jann Horn.

 7) Missing netlink attribute policies in rtm_ipv6_policy table, from
    Eric Dumazet.

 8) Missing socket address length checks in l2tp and pppoe connect, from
    Guillaume Nault.

 9) Fix netconsole over team and bonding, from Xin Long.

10) Fix race with AF_PACKET socket state bitfields, from Willem de
    Bruijn.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (51 commits)
  ice: Fix insufficient memory issue in ice_aq_manage_mac_read
  sfc: ARFS filter IDs
  net: ethtool: Add missing kernel doc for FEC parameters
  packet: fix bitfield update race
  ice: Do not check INTEVENT bit for OICR interrupts
  ice: Fix incorrect comment for action type
  ice: Fix initialization for num_nodes_added
  igb: Fix the transmission mode of queue 0 for Qav mode
  ixgbevf: ensure xdp_ring resources are free'd on error exit
  team: fix netconsole setup over team
  amd-xgbe: Only use the SFP supported transceiver signals
  amd-xgbe: Improve KR auto-negotiation and training
  amd-xgbe: Add pre/post auto-negotiation phy hooks
  pppoe: check sockaddr length in pppoe_connect()
  l2tp: check sockaddr length in pppol2tp_connect()
  net: phy: marvell: clear wol event before setting it
  ipv6: add RTA_TABLE and RTA_PREFSRC to rtm_ipv6_policy
  bonding: do not set slave_dev npinfo before slave_enable_netpoll in bond_enslave
  tcp: don't read out-of-bounds opsize
  ibmvnic: Clean actual number of RX or TX pools
  ...
2018-04-24 14:16:40 -07:00
Paul Chaignon d71962f3e6 bpf: allow map helpers access to map values directly
Helpers that expect ARG_PTR_TO_MAP_KEY and ARG_PTR_TO_MAP_VALUE can only
access stack and packet memory.  Allow these helpers to directly access
map values by passing registers of type PTR_TO_MAP_VALUE.

This change removes the need for an extra copy to the stack when using a
map value to perform a second map lookup, as in the following:

struct bpf_map_def SEC("maps") infobyreq = {
    .type = BPF_MAP_TYPE_HASHMAP,
    .key_size = sizeof(struct request *),
    .value_size = sizeof(struct info_t),
    .max_entries = 1024,
};
struct bpf_map_def SEC("maps") counts = {
    .type = BPF_MAP_TYPE_HASHMAP,
    .key_size = sizeof(struct info_t),
    .value_size = sizeof(u64),
    .max_entries = 1024,
};
SEC("kprobe/blk_account_io_start")
int bpf_blk_account_io_start(struct pt_regs *ctx)
{
    struct info_t *info = bpf_map_lookup_elem(&infobyreq, &ctx->di);
    u64 *count = bpf_map_lookup_elem(&counts, info);
    (*count)++;
}

Signed-off-by: Paul Chaignon <paul.chaignon@orange.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-04-24 22:39:13 +02:00
John Fastabend 4fcfdfb833 bpf: sockmap, fix double page_put on ENOMEM error in redirect path
In the case where the socket memory boundary is hit the redirect
path returns an ENOMEM error. However, before checking for this
condition the redirect scatterlist buffer is setup with a valid
page and length. This is never unwound so when the buffers are
released latter in the error path we do a put_page() and clear
the scatterlist fields. But, because the initial error happens
before completing the scatterlist buffer we end up with both the
original buffer and the redirect buffer pointing to the same page
resulting in duplicate put_page() calls.

To fix this simply move the initial configuration of the redirect
scatterlist buffer below the sock memory check.

Found this while running TCP_STREAM test with netperf using Cilium.

Fixes: fa246693a1 ("bpf: sockmap, BPF_F_INGRESS flag for BPF_SK_SKB_STREAM_VERDICT")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-04-24 00:49:45 +02:00
John Fastabend e20f733483 bpf: sockmap, sk_wait_event needed to handle blocking cases
In the recvmsg handler we need to add a wait event to support the
blocking use cases. Without this we return zero and may confuse
user applications. In the wait event any data received on the
sk either via sk_receive_queue or the psock ingress list will
wake up the sock.

Fixes: fa246693a1 ("bpf: sockmap, BPF_F_INGRESS flag for BPF_SK_SKB_STREAM_VERDICT")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-04-24 00:49:45 +02:00
John Fastabend ba6b8de423 bpf: sockmap, map_release does not hold refcnt for pinned maps
Relying on map_release hook to decrement the reference counts when a
map is removed only works if the map is not being pinned. In the
pinned case the ref is decremented immediately and the BPF programs
released. After this BPF programs may not be in-use which is not
what the user would expect.

This patch moves the release logic into bpf_map_put_uref() and brings
sockmap in-line with how a similar case is handled in prog array maps.

Fixes: 3d9e952697 ("bpf: sockmap, fix leaking maps with attached but not detached progs")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-04-24 00:49:45 +02:00