Commit Graph

900703 Commits

Author SHA1 Message Date
Jakub Sitnicki 6e830c2f6c bpf, sockmap: Don't set up upcalls and progs for listening sockets
Now that sockmap/sockhash can hold listening sockets, when setting up the
psock we will (i) grab references to verdict/parser progs, and (2) override
socket upcalls sk_data_ready and sk_write_space.

However, since we cannot redirect to listening sockets so we don't need to
link the socket to the BPF progs. And more importantly we don't want the
listening socket to have overridden upcalls because they would get
inherited by child sockets cloned from it.

Introduce a separate initialization path for listening sockets that does
not change the upcalls and ignores the BPF progs.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200218171023.844439-6-jakub@cloudflare.com
2020-02-21 22:29:45 +01:00
Jakub Sitnicki 8ca30379a4 bpf, sockmap: Allow inserting listening TCP sockets into sockmap
In order for sockmap/sockhash types to become generic collections for
storing TCP sockets we need to loosen the checks during map update, while
tightening the checks in redirect helpers.

Currently sock{map,hash} require the TCP socket to be in established state,
which prevents inserting listening sockets.

Change the update pre-checks so the socket can also be in listening state.

Since it doesn't make sense to redirect with sock{map,hash} to listening
sockets, add appropriate socket state checks to BPF redirect helpers too.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200218171023.844439-5-jakub@cloudflare.com
2020-02-21 22:29:45 +01:00
Jakub Sitnicki e80251555f tcp_bpf: Don't let child socket inherit parent protocol ops on copy
Prepare for cloning listening sockets that have their protocol callbacks
overridden by sk_msg. Child sockets must not inherit parent callbacks that
access state stored in sk_user_data owned by the parent.

Restore the child socket protocol callbacks before it gets hashed and any
of the callbacks can get invoked.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200218171023.844439-4-jakub@cloudflare.com
2020-02-21 22:29:45 +01:00
Jakub Sitnicki f1ff5ce2cd net, sk_msg: Clear sk_user_data pointer on clone if tagged
sk_user_data can hold a pointer to an object that is not intended to be
shared between the parent socket and the child that gets a pointer copy on
clone. This is the case when sk_user_data points at reference-counted
object, like struct sk_psock.

One way to resolve it is to tag the pointer with a no-copy flag by
repurposing its lowest bit. Based on the bit-flag value we clear the child
sk_user_data pointer after cloning the parent socket.

The no-copy flag is stored in the pointer itself as opposed to externally,
say in socket flags, to guarantee that the pointer and the flag are copied
from parent to child socket in an atomic fashion. Parent socket state is
subject to change while copying, we don't hold any locks at that time.

This approach relies on an assumption that sk_user_data holds a pointer to
an object aligned at least 2 bytes. A manual audit of existing users of
rcu_dereference_sk_user_data helper confirms our assumption.

Also, an RCU-protected sk_user_data is not likely to hold a pointer to a
char value or a pathological case of "struct { char c; }". To be safe, warn
when the flag-bit is set when setting sk_user_data to catch any future
misuses.

It is worth considering why clearing sk_user_data unconditionally is not an
option. There exist users, DRBD, NVMe, and Xen drivers being among them,
that rely on the pointer being copied when cloning the listening socket.

Potentially we could distinguish these users by checking if the listening
socket has been created in kernel-space via sock_create_kern, and hence has
sk_kern_sock flag set. However, this is not the case for NVMe and Xen
drivers, which create sockets without marking them as belonging to the
kernel.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200218171023.844439-3-jakub@cloudflare.com
2020-02-21 22:29:45 +01:00
Jakub Sitnicki b8e202d1d1 net, sk_msg: Annotate lockless access to sk_prot on clone
sk_msg and ULP frameworks override protocol callbacks pointer in
sk->sk_prot, while tcp accesses it locklessly when cloning the listening
socket, that is with neither sk_lock nor sk_callback_lock held.

Once we enable use of listening sockets with sockmap (and hence sk_msg),
there will be shared access to sk->sk_prot if socket is getting cloned
while being inserted/deleted to/from the sockmap from another CPU:

Read side:

tcp_v4_rcv
  sk = __inet_lookup_skb(...)
  tcp_check_req(sk)
    inet_csk(sk)->icsk_af_ops->syn_recv_sock
      tcp_v4_syn_recv_sock
        tcp_create_openreq_child
          inet_csk_clone_lock
            sk_clone_lock
              READ_ONCE(sk->sk_prot)

Write side:

sock_map_ops->map_update_elem
  sock_map_update_elem
    sock_map_update_common
      sock_map_link_no_progs
        tcp_bpf_init
          tcp_bpf_update_sk_prot
            sk_psock_update_proto
              WRITE_ONCE(sk->sk_prot, ops)

sock_map_ops->map_delete_elem
  sock_map_delete_elem
    __sock_map_delete
     sock_map_unref
       sk_psock_put
         sk_psock_drop
           sk_psock_restore_proto
             tcp_update_ulp
               WRITE_ONCE(sk->sk_prot, proto)

Mark the shared access with READ_ONCE/WRITE_ONCE annotations.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200218171023.844439-2-jakub@cloudflare.com
2020-02-21 22:29:45 +01:00
Yonghong Song e42da4c62a docs/bpf: Update bpf development Q/A file
bpf now has its own mailing list bpf@vger.kernel.org.
Update the bpf_devel_QA.rst file to reflect this.

Also llvm has switch to github with llvm and clang
in the same repo https://github.com/llvm/llvm-project.git.
Update the QA file with newer build instructions.

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20200221004354.930952-1-yhs@fb.com
2020-02-20 18:05:37 -08:00
Andrii Nakryiko 006ed53e8c selftests/bpf: Fix trampoline_count clean up logic
Libbpf's Travis CI tests caught this issue. Ensure bpf_link and bpf_object
clean up is performed correctly.

Fixes: d633d57902 ("selftest/bpf: Add test for allowed trampolines count")
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/bpf/20200220230546.769250-1-andriin@fb.com
2020-02-20 18:03:10 -08:00
Alexei Starovoitov 2c3a368127 Merge branch 'set_attach_target'
Eelco Chaudron says:

====================
Currently when you want to attach a trace program to a bpf program
the section name needs to match the tracepoint/function semantics.

However the addition of the bpf_program__set_attach_target() API
allows you to specify the tracepoint/function dynamically.

The call flow would look something like this:

  xdp_fd = bpf_prog_get_fd_by_id(id);
  trace_obj = bpf_object__open_file("func.o", NULL);
  prog = bpf_object__find_program_by_title(trace_obj,
                                           "fentry/myfunc");
  bpf_program__set_expected_attach_type(prog, BPF_TRACE_FENTRY);
  bpf_program__set_attach_target(prog, xdp_fd,
                                 "xdpfilt_blk_all");
  bpf_object__load(trace_obj)

v1 -> v2: Remove requirement for attach type hint in API
v2 -> v3: Moved common warning to __find_vmlinux_btf_id, requested by Andrii
          Updated the xdp_bpf2bpf test to use this new API
v3 -> v4: Split up patch, update libbpf.map version
v4 -> v5: Fix return code, and prog assignment in test case
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2020-02-20 17:51:40 -08:00
Eelco Chaudron 933ce62d68 selftests/bpf: Update xdp_bpf2bpf test to use new set_attach_target API
Use the new bpf_program__set_attach_target() API in the xdp_bpf2bpf
selftest so it can be referenced as an example on how to use it.

Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/158220520562.127661.14289388017034825841.stgit@xdp-tutorial
2020-02-20 17:48:40 -08:00
Eelco Chaudron ff26ce5cd7 libbpf: Add support for dynamic program attach target
Currently when you want to attach a trace program to a bpf program
the section name needs to match the tracepoint/function semantics.

However the addition of the bpf_program__set_attach_target() API
allows you to specify the tracepoint/function dynamically.

The call flow would look something like this:

  xdp_fd = bpf_prog_get_fd_by_id(id);
  trace_obj = bpf_object__open_file("func.o", NULL);
  prog = bpf_object__find_program_by_title(trace_obj,
                                           "fentry/myfunc");
  bpf_program__set_expected_attach_type(prog, BPF_TRACE_FENTRY);
  bpf_program__set_attach_target(prog, xdp_fd,
                                 "xdpfilt_blk_all");
  bpf_object__load(trace_obj)

Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/158220519486.127661.7964708960649051384.stgit@xdp-tutorial
2020-02-20 17:48:40 -08:00
Eelco Chaudron dd88aed92d libbpf: Bump libpf current version to v0.0.8
New development cycles starts, bump to v0.0.8.

Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/158220518424.127661.8278643006567775528.stgit@xdp-tutorial
2020-02-20 17:48:40 -08:00
Andrii Nakryiko 5327644614 libbpf: Relax check whether BTF is mandatory
If BPF program is using BTF-defined maps, BTF is required only for
libbpf itself to process map definitions. If after that BTF fails to
be loaded into kernel (e.g., if it doesn't support BTF at all), this
shouldn't prevent valid BPF program from loading. Existing
retry-without-BTF logic for creating maps will succeed to create such
maps without any problems. So, presence of .maps section shouldn't make
BTF required for kernel. Update the check accordingly.

Validated by ensuring simple BPF program with BTF-defined maps is still
loaded on old kernel without BTF support and map is correctly parsed and
created.

Fixes: abd29c9314 ("libbpf: allow specifying map definitions using BTF")
Reported-by: Julia Kartseva <hex@fb.com>
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200220062635.1497872-1-andriin@fb.com
2020-02-20 11:03:39 -08:00
Alexei Starovoitov 500897804a selftests/bpf: Fix build of sockmap_ktls.c
The selftests fails to build with:
tools/testing/selftests/bpf/prog_tests/sockmap_ktls.c: In function ‘test_sockmap_ktls_disconnect_after_delete’:
tools/testing/selftests/bpf/prog_tests/sockmap_ktls.c:72:37: error: ‘TCP_ULP’ undeclared (first use in this function)
   72 |  err = setsockopt(cli, IPPROTO_TCP, TCP_ULP, "tls", strlen("tls"));
      |                                     ^~~~~~~

Similar to commit that fixes build of sockmap_basic.c on systems with old
/usr/include fix the build of sockmap_ktls.c

Fixes: d1ba1204f2 ("selftests/bpf: Test unhashing kTLS socket after removing from map")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200219205514.3353788-1-ast@kernel.org
2020-02-20 01:17:24 +01:00
Yonghong Song 83250f2b69 selftests/bpf: Change llvm flag -mcpu=probe to -mcpu=v3
The latest llvm supports cpu version v3, which is cpu version v1
plus some additional 64bit jmp insns and 32bit jmp insn support.

In selftests/bpf Makefile, the llvm flag -mcpu=probe did runtime
probe into the host system. Depending on compilation environments,
it is possible that runtime probe may fail, e.g., due to
memlock issue. This will cause generated code with cpu version v1.
This may cause confusion as the same compiler and the same C code
generates different byte codes in different environment.

Let us change the llvm flag -mcpu=probe to -mcpu=v3 so the
generated code will be the same regardless of the compilation
environment.

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200219004236.2291125-1-yhs@fb.com
2020-02-19 15:15:07 -08:00
Alexei Starovoitov 03aa39558e Merge branch 'bpf_read_branch_records'
Daniel Xu says:

====================
Branch records are a CPU feature that can be configured to record
certain branches that are taken during code execution. This data is
particularly interesting for profile guided optimizations. perf has had
branch record support for a while but the data collection can be a bit
coarse grained.

We (Facebook) have seen in experiments that associating metadata with
branch records can improve results (after postprocessing). We generally
use bpf_probe_read_*() to get metadata out of userspace. That's why bpf
support for branch records is useful.

Aside from this particular use case, having branch data available to bpf
progs can be useful to get stack traces out of userspace applications
that omit frame pointers.

Changes in v8:
- Use globals instead of perf buffer
- Call test_perf_branches__detach() before destroying skeleton
- Fix typo in docs

Changes in v7:
- Const-ify and static-ify local var
- Documentation formatting

Changes in v6:
- Move #ifdef a little to avoid unused variable warnings on !x86
- Test negative condition in selftest (-EINVAL on improperly configured
  perf event)
- Skip positive condition selftest on setups that don't support branch
  records

Changes in v5:
- Rename bpf_perf_prog_read_branches() -> bpf_read_branch_records()
- Rename BPF_F_GET_BR_SIZE -> BPF_F_GET_BRANCH_RECORDS_SIZE
- Squash tools/ bpf.h sync into selftest commit

Changes in v4:
- Add BPF_F_GET_BR_SIZE flag
- Return -ENOENT on unsupported architectures
- Only accept initialized memory in helper
- Check buffer size is multiple of sizeof(struct perf_branch_entry)
- Use bpf skeleton in selftest
- Add commit messages
- Spelling and formatting

Changes in v3:
- Document filling unused buffer with zero
- Formatting fixes
- Rebase

Changes in v2:
- Change to a bpf helper instead of context access
- Avoid mentioning Intel specific things
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2020-02-19 15:01:12 -08:00
Daniel Xu 67306f84ca selftests/bpf: Add bpf_read_branch_records() selftest
Add a selftest to test:

* default bpf_read_branch_records() behavior
* BPF_F_GET_BRANCH_RECORDS_SIZE flag behavior
* error path on non branch record perf events
* using helper to write to stack
* using helper to write to global

On host with hardware counter support:

    # ./test_progs -t perf_branches
    #27/1 perf_branches_hw:OK
    #27/2 perf_branches_no_hw:OK
    #27 perf_branches:OK
    Summary: 1/2 PASSED, 0 SKIPPED, 0 FAILED

On host without hardware counter support (VM):

    # ./test_progs -t perf_branches
    #27/1 perf_branches_hw:OK
    #27/2 perf_branches_no_hw:OK
    #27 perf_branches:OK
    Summary: 1/2 PASSED, 1 SKIPPED, 0 FAILED

Also sync tools/include/uapi/linux/bpf.h.

Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200218030432.4600-3-dxu@dxuuu.xyz
2020-02-19 15:01:07 -08:00
Daniel Xu fff7b64355 bpf: Add bpf_read_branch_records() helper
Branch records are a CPU feature that can be configured to record
certain branches that are taken during code execution. This data is
particularly interesting for profile guided optimizations. perf has had
branch record support for a while but the data collection can be a bit
coarse grained.

We (Facebook) have seen in experiments that associating metadata with
branch records can improve results (after postprocessing). We generally
use bpf_probe_read_*() to get metadata out of userspace. That's why bpf
support for branch records is useful.

Aside from this particular use case, having branch data available to bpf
progs can be useful to get stack traces out of userspace applications
that omit frame pointers.

Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200218030432.4600-2-dxu@dxuuu.xyz
2020-02-19 14:37:36 -08:00
Daniel Borkmann 2f14b2d9dd Merge branch 'bpf-skmsg-simplify-restore'
Jakub Sitnicki says:

====================
This series has been split out from "Extend SOCKMAP to store listening
sockets" [0]. I think it stands on its own, and makes the latter series
smaller, which will make the review easier, hopefully.

The essence is that we don't need to do a complicated dance in
sk_psock_restore_proto, if we agree that the contract with tcp_update_ulp
is to restore callbacks even when the socket doesn't use ULP. This is what
tcp_update_ulp currently does, and we just make use of it.

Series is accompanied by a test for a particularly tricky case of restoring
callbacks when we have both sockmap and tls callbacks configured in
sk->sk_prot.

[0] https://lore.kernel.org/bpf/20200127131057.150941-1-jakub@cloudflare.com/
====================

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2020-02-19 16:54:24 +01:00
Jakub Sitnicki d1ba1204f2 selftests/bpf: Test unhashing kTLS socket after removing from map
When a TCP socket gets inserted into a sockmap, its sk_prot callbacks get
replaced with tcp_bpf callbacks built from regular tcp callbacks. If TLS
gets enabled on the same socket, sk_prot callbacks get replaced once again,
this time with kTLS callbacks built from tcp_bpf callbacks.

Now, we allow removing a socket from a sockmap that has kTLS enabled. After
removal, socket remains with kTLS configured. This is where things things
get tricky.

Since the socket has a set of sk_prot callbacks that are a mix of kTLS and
tcp_bpf callbacks, we need to restore just the tcp_bpf callbacks to the
original ones. At the moment, it comes down to the the unhash operation.

We had a regression recently because tcp_bpf callbacks were not cleared in
this particular scenario of removing a kTLS socket from a sockmap. It got
fixed in commit 4da6a196f9 ("bpf: Sockmap/tls, during free we may call
tcp_bpf_unhash() in loop").

Add a test that triggers the regression so that we don't reintroduce it in
the future.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200217121530.754315-4-jakub@cloudflare.com
2020-02-19 16:54:05 +01:00
Jakub Sitnicki a178b45858 bpf, sk_msg: Don't clear saved sock proto on restore
There is no need to clear psock->sk_proto when restoring socket protocol
callbacks in sk->sk_prot. The psock is about to get detached from the sock
and eventually destroyed. At worst we will restore the protocol callbacks
and the write callback twice.

This makes reasoning about psock state easier. Once psock is initialized,
we can count on psock->sk_proto always being set.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200217121530.754315-3-jakub@cloudflare.com
2020-02-19 16:54:05 +01:00
Jakub Sitnicki a4393861a3 bpf, sk_msg: Let ULP restore sk_proto and write_space callback
We don't need a fallback for when the socket is not using ULP.
tcp_update_ulp handles this case exactly the same as we do in
sk_psock_restore_proto. Get rid of the duplicated code.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200217121530.754315-2-jakub@cloudflare.com
2020-02-19 16:54:05 +01:00
Song Liu b80b033bed bpf: Allow bpf_perf_event_read_value in all BPF programs
bpf_perf_event_read_value() is NMI safe. Enable it for all BPF programs.
This can be used in fentry/fexit to profile BPF program and individual
kernel function with hardware counters.

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200214234146.2910011-1-songliubraving@fb.com
2020-02-18 16:08:27 +01:00
YueHaibing b182a66792 net: ena: remove set but not used variable 'hash_key'
drivers/net/ethernet/amazon/ena/ena_com.c: In function ena_com_hash_key_allocate:
drivers/net/ethernet/amazon/ena/ena_com.c:1070:50:
 warning: variable hash_key set but not used [-Wunused-but-set-variable]

commit 6a4f7dc82d ("net: ena: rss: do not allocate key when not supported")
introduced this, but not used, so remove it.

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-17 22:32:50 -08:00
Gustavo A. R. Silva 2b73812483 net: netlink: Replace zero-length array with flexible-array member
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
        int stuff;
        struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.

Also, notice that, dynamic memory allocations won't be affected by
this change:

"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-17 19:05:06 -08:00
Gustavo A. R. Silva fbfc8502af net: switchdev: Replace zero-length array with flexible-array member
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
        int stuff;
        struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.

Also, notice that, dynamic memory allocations won't be affected by
this change:

"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-17 19:05:06 -08:00
Gustavo A. R. Silva 45a4296b6e bpf, sockmap: Replace zero-length array with flexible-array member
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
        int stuff;
        struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.

Also, notice that, dynamic memory allocations won't be affected by
this change:

"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-17 19:05:05 -08:00
Gustavo A. R. Silva 9814428a44 NFC: digital: Replace zero-length array with flexible-array member
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
        int stuff;
        struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.

Also, notice that, dynamic memory allocations won't be affected by
this change:

"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-17 19:05:05 -08:00
Gustavo A. R. Silva dc3cc347d2 net: usb: cdc-phonet: Replace zero-length array with flexible-array member
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
        int stuff;
        struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.

Also, notice that, dynamic memory allocations won't be affected by
this change:

"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-17 19:05:05 -08:00
Russell King 725d23b59c net: phy: allow bcm84881 to be a module
Now that the phylib module loading issue has been resolved, we can
allow this PHY driver to be built as a module.

Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-17 15:08:55 -08:00
David S. Miller 4c08222170 Merge branch 'net-smc-next'
Ursula Braun says:

====================
net/smc: patches 2020-02-17

here are patches for SMC making termination tasks more perfect.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-17 14:50:25 -08:00
Ursula Braun 5613f20c93 net/smc: reduce port_event scheduling
IB event handlers schedule the port event worker for further
processing of port state changes. This patch reduces the number of
schedules to avoid duplicate processing of the same port change.

Reviewed-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-17 14:50:24 -08:00
Karsten Graul 5f78fe968d net/smc: simplify normal link termination
smc_lgr_terminate() and smc_lgr_terminate_sched() both result in soft
link termination, smc_lgr_terminate_sched() is scheduling a worker for
this task. Take out complexity by always using the termination worker
and getting rid of smc_lgr_terminate() completely.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-17 14:50:24 -08:00
Karsten Graul ba95206042 net/smc: remove unused parameter of smc_lgr_terminate()
The soft parameter of smc_lgr_terminate() is not used and obsolete.
Remove it.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-17 14:50:24 -08:00
Karsten Graul 3739707c45 net/smc: do not delete lgr from list twice
When 2 callers call smc_lgr_terminate() at the same time
for the same lgr, one gets the lgr_lock and deletes the lgr from the
list and releases the lock. Then the second caller gets the lock and
tries to delete it again.
In smc_lgr_terminate() add a check if the link group lgr is already
deleted from the link group list and prevent to try to delete it a
second time.
And add a check if the lgr is marked as freeing, which means that a
termination is already pending.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-17 14:50:24 -08:00
Karsten Graul 354ea2baa3 net/smc: use termination worker under send_lock
smc_tx_rdma_write() is called under the send_lock and should not call
smc_lgr_terminate() directly. Call smc_lgr_terminate_sched() instead
which schedules a worker.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-17 14:50:24 -08:00
Karsten Graul 55dd575817 net/smc: improve smc_lgr_cleanup()
smc_lgr_cleanup() is called during termination processing, there is no
need to send a DELETE_LINK at that time. A DELETE_LINK should have been
sent before the termination is initiated, if needed.
And remove the extra call to wake_up(&lnk->wr_reg_wait) because
smc_llc_link_inactive() already calls the related helper function
smc_wr_wakeup_reg_wait().

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-17 14:50:24 -08:00
David S. Miller 790a9a7cce Merge branch 'mlxsw-Reduce-dependency-between-bridge-and-router-code'
Ido Schimmel says:

====================
mlxsw: Reduce dependency between bridge and router code

This patch set reduces the dependency between the bridge and the router
code in preparation for RTNL removal from the route insertion path in
mlxsw.

The motivation and solution are explained in detail in patch #3. The
main idea is that we need to stop special-casing the VXLAN devices with
regards to the reference counting of the FIDs. Otherwise, we can bump
into the situation described in patch #3, where the routing code calls
into the bridge code which calls back into the routing code. After
adding a mutex to protect router data structures to remove RTNL
dependency, this can result in an AA deadlock.

Patches #1 and #2 are preparations. They convert the FIDs to use
'refcount_t' for reference counting in order to catch over/under flows
and add extack to the bridge creation function.

Patches #3-#5 reduce the dependency between the bridge and the router
code. First, by having the VXLAN device take a reference on the FID in
patch #3 and then by removing unnecessary code following the change in
patch #3.

Patches #6-#10 adjust existing selftests and add new ones to exercise
the new code paths.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-17 14:42:54 -08:00
Ido Schimmel 495c3da648 selftests: mlxsw: vxlan: Add test for error path
Test that when two VXLAN tunnels with conflicting configurations (i.e.,
different TTL) are enslaved to the same VLAN-aware bridge, then the
enslavement of a port to the bridge is denied.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-17 14:42:53 -08:00
Ido Schimmel 58ba0238e9 selftests: mlxsw: vxlan: Adjust test to recent changes
After recent changes, the VXLAN tunnel will be offloaded regardless if
any local ports are member in the FID or not. Adjust the test to make
sure the tunnel is offloaded in this case.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-17 14:42:53 -08:00
Ido Schimmel 6c4e61ff5f selftests: mlxsw: extack: Test creation of multiple VLAN-aware bridges
The driver supports a single VLAN-aware bridge. Test that the
enslavement of a port to the second VLAN-aware bridge fails with an
extack.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-17 14:42:53 -08:00
Ido Schimmel bdc58bea0d selftests: mlxsw: extack: Test bridge creation with VXLAN
Test that creation of a bridge (both VLAN-aware and VLAN-unaware) fails
with an extack when a VXLAN device with an unsupported configuration is
already enslaved to it.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-17 14:42:53 -08:00
Ido Schimmel 745a7ea72d selftests: mlxsw: Remove deprecated test
The addition of a VLAN on a bridge slave prompts the driver to have the
local port in question join the FID corresponding to this VLAN.

Before recent changes, the operation of joining the FID would also mean
that the driver would enable VXLAN tunneling if a VXLAN device was also
member in the VLAN. In case the configuration of the VXLAN tunnel was
not supported, an extack error would be returned.

Since the operation of joining the FID no longer means that VXLAN
tunneling is potentially enabled, the test is no longer relevant. Remove
it.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-17 14:42:53 -08:00
Ido Schimmel da1f9f8cb7 mlxsw: spectrum: Reduce dependency between bridge and router code
Commit f40be47a3e ("mlxsw: spectrum_router: Do not force specific
configuration order") added a call from the routing code to the bridge
code in order to handle the case where VNI should be set on a FID
following the joining of the router port to the FID.

This is no longer required, as previous patches made VXLAN devices
explicitly take a reference on the FID and set VNI on it.

Therefore, remove the unnecessary call and simply have the RIF take a
reference on the FID without checking if VNI should also be set on it.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-17 14:42:53 -08:00
Ido Schimmel 578e55124c mlxsw: spectrum_switchdev: Remove VXLAN checks during FID membership
As explained in previous patch, VXLAN devices now take a reference on
the FID and not only local ports. Therefore, there is no need for local
ports to check if they need to set a VNI on the FID when they join the
FID.

Remove these unnecessary checks.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-17 14:42:53 -08:00
Ido Schimmel 71afb45a14 mlxsw: spectrum_switchdev: Have VXLAN device take reference on FID
Up until now only local ports and the router port (which is also a local
port) took a reference on the corresponding FID (Filtering Identifier)
when joining a bridge. For example:

        192.0.2.1/24
            br0
             |
      +------+------+
      |             |
     swp1        vxlan0

In this case the reference count of the FID will be '2'. Since the VXLAN
device does not take a reference on the FID, whenever a local port joins
the bridge it needs to check if a VXLAN device is already enslaved. If
the VXLAN device should be mapped to the FID in question, then the VXLAN
device's VNI is set on the FID.

Beside the fact that this scheme special-cases the VXLAN device, it also
creates an unnecessary dependency between the routing and bridge code:

1. [R] IP address is added on 'br0', which prompts the creation of a RIF
   and a backing FID
2. [B] VNI is enabled on backing FID
3. [R] Host route corresponding to VXLAN device's source address is
   promoted to perform NVE decapsulation

[R] - Routing code
[B] - Bridge code

This back and forth dependency will become problematic when a lock is
added in the routing code instead of relying on RTNL, as it will result
in an AA deadlock.

Instead, have the VXLAN device take a reference on the FID just like all
the other netdev members of the bridge. In order to correctly handle the
case where VXLAN devices are already enslaved to the bridge when it is
offloaded, walk the bridge's slaves and replay the configuration.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-17 14:42:53 -08:00
Ido Schimmel 23a1a0b391 mlxsw: spectrum_switchdev: Propagate extack to bridge creation function
Propagate extack to bridge creation function so that error messages
could be passed to user space via netlink instead of printing them to
kernel log.

A subsequent patch will pass the new extack argument to more functions.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-17 14:42:53 -08:00
Ido Schimmel b96f546980 mlxsw: spectrum_fid: Use 'refcount_t' for FID reference counting
'refcount_t' is very useful for catching over/under flows. Convert the
FID (Filtering Identifier) objects to use it instead of 'unsigned int'
for their reference count.

A subsequent patch in the series will change the way VXLAN devices hold
/ release the FID reference, which is why the conversion is made now.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-17 14:42:53 -08:00
Julian Wiedmann 583cb0b412 net: bridge: teach ndo_dflt_bridge_getlink() more brport flags
This enables ndo_dflt_bridge_getlink() to report a bridge port's
offload settings for multicast and broadcast flooding.

CC: Roopa Prabhu <roopa@cumulusnetworks.com>
CC: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-17 14:36:40 -08:00
David S. Miller 5f1475b171 Merge branch 'sfc-couple-more-ARFS-tidy-ups'
Edward Cree says:

====================
couple more ARFS tidy-ups

Tie up some loose ends from the recent ARFS work.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-17 14:35:30 -08:00
Edward Cree 025c5a0b58 sfc: move some ARFS code out of headers
efx_filter_rfs_expire() is a work-function, so it being inline makes no
 sense.  It's only ever used in efx_channels.c, so move it there.
While we're at it, clean out some related unused cruft.

Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-17 14:35:23 -08:00