Commit graph

1071876 commits

Author SHA1 Message Date
Alexei Starovoitov
79b203926d bpf: Convert bpf preload to light skeleton.
Convert bpffs preload iterators to light skeleton.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20220131220528.98088-6-alexei.starovoitov@gmail.com
2022-02-01 23:56:18 +01:00
Alexei Starovoitov
1ddbddd706 bpf: Remove unnecessary setrlimit from bpf preload.
BPF programs and maps are memcg accounted. setrlimit is obsolete.
Remove its use from bpf preload.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20220131220528.98088-5-alexei.starovoitov@gmail.com
2022-02-01 23:56:18 +01:00
Alexei Starovoitov
c69f94a33d libbpf: Open code raw_tp_open and link_create commands.
Open code raw_tracepoint_open and link_create used by light skeleton
to be able to avoid full libbpf eventually.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20220131220528.98088-4-alexei.starovoitov@gmail.com
2022-02-01 23:56:18 +01:00
Alexei Starovoitov
e981f41fd0 libbpf: Open code low level bpf commands.
Open code low level bpf commands used by light skeleton to
be able to avoid full libbpf eventually.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20220131220528.98088-3-alexei.starovoitov@gmail.com
2022-02-01 23:56:18 +01:00
Alexei Starovoitov
42d1d53fed libbpf: Add support for bpf iter in light skeleton.
bpf iterator programs should use bpf_link_create to attach instead of
bpf_raw_tracepoint_open like other tracing programs.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20220131220528.98088-2-alexei.starovoitov@gmail.com
2022-02-01 23:56:18 +01:00
Andrii Nakryiko
533de4aea6 Merge branch 'libbpf: deprecate xdp_cpumap, xdp_devmap and classifier sec definitions'
Lorenzo Bianconi says:

====================

Deprecate xdp_cpumap, xdp_devmap and classifier sec definitions.
Update cpumap/devmap samples and kselftests.

Changes since v2:
- update warning log
- split libbpf and samples/kselftests changes
- deprecate classifier sec definition

Changes since v1:
- refer to Libbpf-1.0-migration-guide in the warning rised by libbpf
====================

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
2022-02-01 09:51:32 -08:00
Lorenzo Bianconi
8bab532233 samples/bpf: Update cpumap/devmap sec_name
Substitute deprecated xdp_cpumap and xdp_devmap sec_name with
xdp/cpumap and xdp/devmap respectively.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/509201497c6c4926bc941f1cba24173cf500e760.1643727185.git.lorenzo@kernel.org
2022-02-01 09:51:31 -08:00
Lorenzo Bianconi
439f033656 selftests/bpf: Update cpumap/devmap sec_name
Substitute deprecated xdp_cpumap and xdp_devmap sec_name with
xdp/cpumap and xdp/devmap respectively in bpf kselftests.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/9a4286cd36781e2c31ba3773bfdcf45cf1bbaa9e.1643727185.git.lorenzo@kernel.org
2022-02-01 09:51:31 -08:00
Lorenzo Bianconi
4a4d4cee48 libbpf: Deprecate xdp_cpumap, xdp_devmap and classifier sec definitions
Deprecate xdp_cpumap, xdp_devmap and classifier sec definitions.
Introduce xdp/devmap and xdp/cpumap definitions according to the
standard for SEC("") in libbpf:
- prog_type.prog_flags/attach_place

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/5c7bd9426b3ce6a31d9a4b1f97eb299e1467fc52.1643727185.git.lorenzo@kernel.org
2022-02-01 09:51:31 -08:00
Dave Marchevsky
5ee32ea24c libbpf: Deprecate btf_ext rec_size APIs
btf_ext__{func,line}_info_rec_size functions are used in conjunction
with already-deprecated btf_ext__reloc_{func,line}_info functions. Since
struct btf_ext is opaque to the user it was necessary to expose rec_size
getters in the past.

btf_ext__reloc_{func,line}_info were deprecated in commit 8505e8709b
("libbpf: Implement generalized .BTF.ext func/line info adjustment")
as they're not compatible with support for multiple programs per
section. It was decided[0] that users of these APIs should implement their
own .btf.ext parsing to access this data, in which case the rec_size
getters are unnecessary. So deprecate them from libbpf 0.7.0 onwards.

  [0] Closes: https://github.com/libbpf/libbpf/issues/277

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220201014610.3522985-1-davemarchevsky@fb.com
2022-02-01 09:55:19 +01:00
Kenta Tada
0407a65f35 bpf: make bpf_copy_from_user_task() gpl only
access_process_vm() is exported by EXPORT_SYMBOL_GPL().

Signed-off-by: Kenta Tada <Kenta.Tada@sony.com>
Link: https://lore.kernel.org/r/20220128170906.21154-1-Kenta.Tada@sony.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-01-31 12:44:37 -08:00
Alexei Starovoitov
1fc5bdb2b8 Merge branch 'Split bpf_sock dst_port field'
Jakub Sitnicki says:

====================

This is a follow-up to discussion around the idea of making dst_port in struct
bpf_sock a 16-bit field that happened in [1].

v2:
- use an anonymous field for zero padding (Alexei)

v1:
- keep dst_field offset unchanged to prevent existing BPF program breakage
  (Martin)
- allow 8-bit loads from dst_port[0] and [1]
- add test coverage for the verifier and the context access converter

[1] https://lore.kernel.org/bpf/87sftbobys.fsf@cloudflare.com/
====================

Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-01-31 12:39:19 -08:00
Jakub Sitnicki
8f50f16ff3 selftests/bpf: Extend verifier and bpf_sock tests for dst_port loads
Add coverage to the verifier tests and tests for reading bpf_sock fields to
ensure that 32-bit, 16-bit, and 8-bit loads from dst_port field are allowed
only at intended offsets and produce expected values.

While 16-bit and 8-bit access to dst_port field is straight-forward, 32-bit
wide loads need be allowed and produce a zero-padded 16-bit value for
backward compatibility.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/r/20220130115518.213259-3-jakub@cloudflare.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-01-31 12:39:12 -08:00
Jakub Sitnicki
4421a58271 bpf: Make dst_port field in struct bpf_sock 16-bit wide
Menglong Dong reports that the documentation for the dst_port field in
struct bpf_sock is inaccurate and confusing. From the BPF program PoV, the
field is a zero-padded 16-bit integer in network byte order. The value
appears to the BPF user as if laid out in memory as so:

  offsetof(struct bpf_sock, dst_port) + 0  <port MSB>
                                      + 8  <port LSB>
                                      +16  0x00
                                      +24  0x00

32-, 16-, and 8-bit wide loads from the field are all allowed, but only if
the offset into the field is 0.

32-bit wide loads from dst_port are especially confusing. The loaded value,
after converting to host byte order with bpf_ntohl(dst_port), contains the
port number in the upper 16-bits.

Remove the confusion by splitting the field into two 16-bit fields. For
backward compatibility, allow 32-bit wide loads from offsetof(struct
bpf_sock, dst_port).

While at it, allow loads 8-bit loads at offset [0] and [1] from dst_port.

Reported-by: Menglong Dong <imagedong@tencent.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/r/20220130115518.213259-2-jakub@cloudflare.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-01-31 12:39:12 -08:00
Alexei Starovoitov
b3dddab2ff Merge branch 'selftests/bpf: use temp netns for testing'
Hangbin Liu says:

====================

There are some bpf tests using hard code netns name like ns0, ns1, etc.
This kind of ns name is easily used by other tests or system. If there
is already a such netns, all the related tests will failed. So let's
use temp netns name for testing.

The first patch not only change to temp netns. But also fixed an interface
index issue. So I add fixes tag. For the later patches, I think that
should be an update instead of fixes, so the fixes tag is not added.
====================

Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-01-27 19:21:36 -08:00
Hangbin Liu
4ec25b49f4 selftests/bpf/test_xdp_redirect: use temp netns for testing
Use temp netns instead of hard code name for testing in case the
netns already exists.

Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Acked-by: William Tu <u9012063@gmail.com>
Link: https://lore.kernel.org/r/20220125081717.1260849-8-liuhangbin@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-01-27 19:21:28 -08:00
Hangbin Liu
36d9970e52 selftests/bpf/test_xdp_meta: use temp netns for testing
Use temp netns instead of hard code name for testing in case the
netns already exists.

Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Link: https://lore.kernel.org/r/20220125081717.1260849-7-liuhangbin@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-01-27 19:21:28 -08:00
Hangbin Liu
ab6bcc2072 selftests/bpf/test_tcp_check_syncookie: use temp netns for testing
Use temp netns instead of hard code name for testing in case the
netns already exists.

Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Acked-by: Lorenz Bauer <lmb@cloudflare.com>
Link: https://lore.kernel.org/r/20220125081717.1260849-6-liuhangbin@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-01-27 19:21:28 -08:00
Hangbin Liu
07c5855461 selftests/bpf/test_lwt_seg6local: use temp netns for testing
Use temp netns instead of hard code name for testing in case the
netns already exists.

Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Link: https://lore.kernel.org/r/20220125081717.1260849-5-liuhangbin@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-01-27 19:21:27 -08:00
Hangbin Liu
3cc382e02f selftests/bpf/test_xdp_vlan: use temp netns for testing
Use temp netns instead of hard code name for testing in case the
netns already exists.

Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Link: https://lore.kernel.org/r/20220125081717.1260849-4-liuhangbin@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-01-27 19:21:27 -08:00
Hangbin Liu
9d66c9ddc9 selftests/bpf/test_xdp_veth: use temp netns for testing
Use temp netns instead of hard code name for testing in case the
netns already exists.

Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Link: https://lore.kernel.org/r/20220125081717.1260849-3-liuhangbin@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-01-27 19:21:27 -08:00
Hangbin Liu
cec74489a8 selftests/bpf/test_xdp_redirect_multi: use temp netns for testing
Use temp netns instead of hard code name for testing in case the netns
already exists.

Remove the hard code interface index when creating the veth interfaces.
Because when the system loads some virtual interface modules, e.g. tunnels.
the ifindex of 2 will be used and the cmd will fail.

As the netns has not created if checking environment failed. Trap the
clean up function after checking env.

Fixes: 8955c1a329 ("selftests/bpf/xdp_redirect_multi: Limit the tests in netns")
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Acked-by: William Tu <u9012063@gmail.com>
Link: https://lore.kernel.org/r/20220125081717.1260849-2-liuhangbin@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-01-27 19:21:27 -08:00
Hou Tao
b6ec79518e bpf, x86: Remove unnecessary handling of BPF_SUB atomic op
According to the LLVM commit (https://reviews.llvm.org/D72184),
sync_fetch_and_sub() is implemented as a negation followed by
sync_fetch_and_add(), so there will be no BPF_SUB op, thus just
remove it. BPF_SUB is also rejected by the verifier anyway.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Brendan Jackman <jackmanb@google.com>
Link: https://lore.kernel.org/bpf/20220127083240.1425481-1-houtao1@huawei.com
2022-01-27 22:47:05 +01:00
Alexei Starovoitov
50fc9786b2 Merge branch 'bpf: add __user tagging support in vmlinux BTF'
Yonghong Song says:

====================

The __user attribute is currently mainly used by sparse for type checking.
The attribute indicates whether a memory access is in user memory address
space or not. Such information is important during tracing kernel
internal functions or data structures as accessing user memory often
has different mechanisms compared to accessing kernel memory. For example,
the perf-probe needs explicit command line specification to indicate a
particular argument or string in user-space memory ([1], [2], [3]).
Currently, vmlinux BTF is available in kernel with many distributions.
If __user attribute information is available in vmlinux BTF, the explicit
user memory access information from users will not be necessary as
the kernel can figure it out by itself with vmlinux BTF.

Besides the above possible use for perf/probe, another use case is
for bpf verifier. Currently, for bpf BPF_PROG_TYPE_TRACING type of bpf
programs, users can write direct code like
  p->m1->m2
and "p" could be a function parameter. Without __user information in BTF,
the verifier will assume p->m1 accessing kernel memory and will generate
normal loads. Let us say "p" actually tagged with __user in the source
code.  In such cases, p->m1 is actually accessing user memory and direct
load is not right and may produce incorrect result. For such cases,
bpf_probe_read_user() will be the correct way to read p->m1.

To support encoding __user information in BTF, a new attribute
  __attribute__((btf_type_tag("<arbitrary_string>")))
is implemented in clang ([4]). For example, if we have
  #define __user __attribute__((btf_type_tag("user")))
during kernel compilation, the attribute "user" information will
be preserved in dwarf. After pahole converting dwarf to BTF, __user
information will be available in vmlinux BTF and such information
can be used by bpf verifier, perf/probe or other use cases.

Currently btf_type_tag is only supported in clang (>= clang14) and
pahole (>= 1.23). gcc support is also proposed and under development ([5]).

In the rest of patch set, Patch 1 added support of __user btf_type_tag
during compilation. Patch 2 added bpf verifier support to utilize __user
tag information to reject bpf programs not using proper helper to access
user memories. Patches 3-5 are for bpf selftests which demonstrate verifier
can reject direct user memory accesses.

  [1] http://lkml.kernel.org/r/155789874562.26965.10836126971405890891.stgit@devnote2
  [2] http://lkml.kernel.org/r/155789872187.26965.4468456816590888687.stgit@devnote2
  [3] http://lkml.kernel.org/r/155789871009.26965.14167558859557329331.stgit@devnote2
  [4] https://reviews.llvm.org/D111199
  [5] https://lore.kernel.org/bpf/0cbeb2fb-1a18-f690-e360-24b1c90c2a91@fb.com/

Changelog:
  v2 -> v3:
    - remove FLAG_DONTCARE enumerator and just use 0 as dontcare flag.
    - explain how btf type_tag is encoded in btf type chain.
  v1 -> v2:
    - use MEM_USER flag for PTR_TO_BTF_ID reg type instead of a separate
      field to encode __user tag.
    - add a test with kernel function __sys_getsockname which has __user tagged
      argument.
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-01-27 12:03:47 -08:00
Yonghong Song
b72903847a docs/bpf: clarify how btf_type_tag gets encoded in the type chain
Clarify where the BTF_KIND_TYPE_TAG gets encoded in the type chain,
so applications and kernel can properly parse them.

Signed-off-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20220127154627.665163-1-yhs@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-01-27 12:03:47 -08:00
Yonghong Song
67ef7e1a75 selftests/bpf: specify pahole version requirement for btf_tag test
Specify pahole version requirement (1.23) for btf_tag subtests
btf_type_tag_user_{mod1, mod2, vmlinux}.

Signed-off-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20220127154622.663337-1-yhs@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-01-27 12:03:47 -08:00
Yonghong Song
696c390115 selftests/bpf: add a selftest with __user tag
Added a selftest with three__user usages: a __user pointer-type argument
in bpf_testmod, a __user pointer-type struct member in bpf_testmod,
and a __user pointer-type struct member in vmlinux. In all cases,
directly accessing the user memory will result verification failure.

  $ ./test_progs -v -n 22/3
  ...
  libbpf: prog 'test_user1': BPF program load failed: Permission denied
  libbpf: prog 'test_user1': -- BEGIN PROG LOAD LOG --
  R1 type=ctx expected=fp
  0: R1=ctx(id=0,off=0,imm=0) R10=fp0
  ; int BPF_PROG(test_user1, struct bpf_testmod_btf_type_tag_1 *arg)
  0: (79) r1 = *(u64 *)(r1 +0)
  func 'bpf_testmod_test_btf_type_tag_user_1' arg0 has btf_id 136561 type STRUCT 'bpf_testmod_btf_type_tag_1'
  1: R1_w=user_ptr_bpf_testmod_btf_type_tag_1(id=0,off=0,imm=0)
  ; g = arg->a;
  1: (61) r1 = *(u32 *)(r1 +0)
  R1 invalid mem access 'user_ptr_'
  ...
  #22/3 btf_tag/btf_type_tag_user_mod1:OK

  $ ./test_progs -v -n 22/4
  ...
  libbpf: prog 'test_user2': BPF program load failed: Permission denied
  libbpf: prog 'test_user2': -- BEGIN PROG LOAD LOG --
  R1 type=ctx expected=fp
  0: R1=ctx(id=0,off=0,imm=0) R10=fp0
  ; int BPF_PROG(test_user2, struct bpf_testmod_btf_type_tag_2 *arg)
  0: (79) r1 = *(u64 *)(r1 +0)
  func 'bpf_testmod_test_btf_type_tag_user_2' arg0 has btf_id 136563 type STRUCT 'bpf_testmod_btf_type_tag_2'
  1: R1_w=ptr_bpf_testmod_btf_type_tag_2(id=0,off=0,imm=0)
  ; g = arg->p->a;
  1: (79) r1 = *(u64 *)(r1 +0)          ; R1_w=user_ptr_bpf_testmod_btf_type_tag_1(id=0,off=0,imm=0)
  ; g = arg->p->a;
  2: (61) r1 = *(u32 *)(r1 +0)
  R1 invalid mem access 'user_ptr_'
  ...
  #22/4 btf_tag/btf_type_tag_user_mod2:OK

  $ ./test_progs -v -n 22/5
  ...
  libbpf: prog 'test_sys_getsockname': BPF program load failed: Permission denied
  libbpf: prog 'test_sys_getsockname': -- BEGIN PROG LOAD LOG --
  R1 type=ctx expected=fp
  0: R1=ctx(id=0,off=0,imm=0) R10=fp0
  ; int BPF_PROG(test_sys_getsockname, int fd, struct sockaddr *usockaddr,
  0: (79) r1 = *(u64 *)(r1 +8)
  func '__sys_getsockname' arg1 has btf_id 2319 type STRUCT 'sockaddr'
  1: R1_w=user_ptr_sockaddr(id=0,off=0,imm=0)
  ; g = usockaddr->sa_family;
  1: (69) r1 = *(u16 *)(r1 +0)
  R1 invalid mem access 'user_ptr_'
  ...
  #22/5 btf_tag/btf_type_tag_user_vmlinux:OK

Signed-off-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20220127154616.659314-1-yhs@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-01-27 12:03:46 -08:00
Yonghong Song
571d01a9d0 selftests/bpf: rename btf_decl_tag.c to test_btf_decl_tag.c
The uapi btf.h contains the following declaration:
  struct btf_decl_tag {
       __s32   component_idx;
  };

The skeleton will also generate a struct with name
"btf_decl_tag" for bpf program btf_decl_tag.c.

Rename btf_decl_tag.c to test_btf_decl_tag.c so
the corresponding skeleton struct name becomes
"test_btf_decl_tag". This way, we could include
uapi btf.h in prog_tests/btf_tag.c.
There is no functionality change for this patch.

Signed-off-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20220127154611.656699-1-yhs@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-01-27 12:03:46 -08:00
Yonghong Song
c6f1bfe89a bpf: reject program if a __user tagged memory accessed in kernel way
BPF verifier supports direct memory access for BPF_PROG_TYPE_TRACING type
of bpf programs, e.g., a->b. If "a" is a pointer
pointing to kernel memory, bpf verifier will allow user to write
code in C like a->b and the verifier will translate it to a kernel
load properly. If "a" is a pointer to user memory, it is expected
that bpf developer should be bpf_probe_read_user() helper to
get the value a->b. Without utilizing BTF __user tagging information,
current verifier will assume that a->b is a kernel memory access
and this may generate incorrect result.

Now BTF contains __user information, it can check whether the
pointer points to a user memory or not. If it is, the verifier
can reject the program and force users to use bpf_probe_read_user()
helper explicitly.

In the future, we can easily extend btf_add_space for other
address space tagging, for example, rcu/percpu etc.

Signed-off-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20220127154606.654961-1-yhs@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-01-27 12:03:46 -08:00
Yonghong Song
7472d5a642 compiler_types: define __user as __attribute__((btf_type_tag("user")))
The __user attribute is currently mainly used by sparse for type checking.
The attribute indicates whether a memory access is in user memory address
space or not. Such information is important during tracing kernel
internal functions or data structures as accessing user memory often
has different mechanisms compared to accessing kernel memory. For example,
the perf-probe needs explicit command line specification to indicate a
particular argument or string in user-space memory ([1], [2], [3]).
Currently, vmlinux BTF is available in kernel with many distributions.
If __user attribute information is available in vmlinux BTF, the explicit
user memory access information from users will not be necessary as
the kernel can figure it out by itself with vmlinux BTF.

Besides the above possible use for perf/probe, another use case is
for bpf verifier. Currently, for bpf BPF_PROG_TYPE_TRACING type of bpf
programs, users can write direct code like
  p->m1->m2
and "p" could be a function parameter. Without __user information in BTF,
the verifier will assume p->m1 accessing kernel memory and will generate
normal loads. Let us say "p" actually tagged with __user in the source
code.  In such cases, p->m1 is actually accessing user memory and direct
load is not right and may produce incorrect result. For such cases,
bpf_probe_read_user() will be the correct way to read p->m1.

To support encoding __user information in BTF, a new attribute
  __attribute__((btf_type_tag("<arbitrary_string>")))
is implemented in clang ([4]). For example, if we have
  #define __user __attribute__((btf_type_tag("user")))
during kernel compilation, the attribute "user" information will
be preserved in dwarf. After pahole converting dwarf to BTF, __user
information will be available in vmlinux BTF.

The following is an example with latest upstream clang (clang14) and
pahole 1.23:

  [$ ~] cat test.c
  #define __user __attribute__((btf_type_tag("user")))
  int foo(int __user *arg) {
          return *arg;
  }
  [$ ~] clang -O2 -g -c test.c
  [$ ~] pahole -JV test.o
  ...
  [1] INT int size=4 nr_bits=32 encoding=SIGNED
  [2] TYPE_TAG user type_id=1
  [3] PTR (anon) type_id=2
  [4] FUNC_PROTO (anon) return=1 args=(3 arg)
  [5] FUNC foo type_id=4
  [$ ~]

You can see for the function argument "int __user *arg", its type is
described as
  PTR -> TYPE_TAG(user) -> INT
The kernel can use this information for bpf verification or other
use cases.

Current btf_type_tag is only supported in clang (>= clang14) and
pahole (>= 1.23). gcc support is also proposed and under development ([5]).

  [1] http://lkml.kernel.org/r/155789874562.26965.10836126971405890891.stgit@devnote2
  [2] http://lkml.kernel.org/r/155789872187.26965.4468456816590888687.stgit@devnote2
  [3] http://lkml.kernel.org/r/155789871009.26965.14167558859557329331.stgit@devnote2
  [4] https://reviews.llvm.org/D111199
  [5] https://lore.kernel.org/bpf/0cbeb2fb-1a18-f690-e360-24b1c90c2a91@fb.com/

Signed-off-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20220127154600.652613-1-yhs@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-01-27 12:03:46 -08:00
Pavel Begunkov
46531a3036 cgroup/bpf: fast path skb BPF filtering
Even though there is a static key protecting from overhead from
cgroup-bpf skb filtering when there is nothing attached, in many cases
it's not enough as registering a filter for one type will ruin the fast
path for all others. It's observed in production servers I've looked
at but also in laptops, where registration is done during init by
systemd or something else.

Add a per-socket fast path check guarding from such overhead. This
affects both receive and transmit paths of TCP, UDP and other
protocols. It showed ~1% tx/s improvement in small payload UDP
send benchmarks using a real NIC and in a server environment and the
number jumps to 2-3% for preemtible kernels.

Reviewed-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/r/d8c58857113185a764927a46f4b5a058d36d3ec3.1643292455.git.asml.silence@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-01-27 10:15:00 -08:00
Yonghong Song
cdb5ed9796 selftests/bpf: fix a clang compilation error
When building selftests/bpf with clang
  make -j LLVM=1
  make -C tools/testing/selftests/bpf -j LLVM=1
I hit the following compilation error:

  trace_helpers.c:152:9: error: variable 'found' is used uninitialized whenever 'while' loop exits because its condition is false [-Werror,-Wsometimes-uninitialized]
          while (fscanf(f, "%zx-%zx %s %zx %*[^\n]\n", &start, &end, buf, &base) == 4) {
                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  trace_helpers.c:161:7: note: uninitialized use occurs here
          if (!found)
               ^~~~~
  trace_helpers.c:152:9: note: remove the condition if it is always true
          while (fscanf(f, "%zx-%zx %s %zx %*[^\n]\n", &start, &end, buf, &base) == 4) {
                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                 1
  trace_helpers.c:145:12: note: initialize the variable 'found' to silence this warning
          bool found;
                    ^
                     = false

It is possible that for sane /proc/self/maps we may never hit the above issue
in practice. But let us initialize variable 'found' properly to silence the
compilation error.

Signed-off-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20220127163726.1442032-1-yhs@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-01-27 09:48:49 -08:00
Magnus Karlsson
3b22523bca selftests, xsk: Fix bpf_res cleanup test
After commit 710ad98c36 ("veth: Do not record rx queue hint in veth_xmit"),
veth no longer receives traffic on the same queue as it was sent on. This
breaks the bpf_res test for the AF_XDP selftests as the socket tied to
queue 1 will not receive traffic anymore.

Modify the test so that two sockets are tied to queue id 0 using a shared
umem instead. When killing the first socket enter the second socket into
the xskmap so that traffic will flow to it. This will still test that the
resources are not cleaned up until after the second socket dies, without
having to rely on veth supporting rx_queue hints.

Reported-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20220125082945.26179-1-magnus.karlsson@gmail.com
2022-01-27 17:43:51 +01:00
Daniel Borkmann
33372bc274 Merge branch 'xsk-batching'
Maciej Fijalkowski says:

====================
Unfortunately, similar scalability issues that were addressed for XDP
processing in ice, exist for XDP in the zero-copy driver used by AF_XDP.
Let's resolve them in mostly the same way as we did in [0] and utilize
the Tx batching API from XSK buffer pool.

Move the array of Tx descriptors that is used with batching approach to
the XSK buffer pool. This means that future users of this API will not
have to carry the array on their own side, they can simple refer to
pool's tx_desc array.

We also improve the Rx side where we extend ice_alloc_rx_buf_zc() to
handle the ring wrap and bump Rx tail more frequently. By doing so,
Rx side is adjusted to Tx and it was needed for l2fwd scenario.

Here are the improvements of performance numbers that this set brings
measured with xdpsock app in busy poll mode for 1 and 2 core modes.
Both Tx and Rx rings were sized to 1k length and busy poll budget was
256.

----------------------------------------------------------------
     |      txonly:      |      l2fwd      |      rxdrop
----------------------------------------------------------------
1C   |       149%        |       14%       |        3%
----------------------------------------------------------------
2C   |       134%        |       20%       |        5%
----------------------------------------------------------------

Next step will be to introduce batching onto Rx side.

v5:
* collect acks
* fix typos
* correct comments showing cache line boundaries in ice_tx_ring struct
v4 - address Alexandr's review:
* new patch (2) for making sure ring size is pow(2) when attaching
  xsk socket
* don't open code ALIGN_DOWN (patch 3)
* resign from storing tx_thresh in ice_tx_ring (patch 4)
* scope variables in a better way for Tx batching (patch 7)
v3:
* drop likely() that was wrapping napi_complete_done (patch 1)
* introduce configurable Tx threshold (patch 2)
* handle ring wrap on Rx side when allocating buffers (patch 3)
* respect NAPI budget when cleaning Tx descriptors in ZC (patch 6)
v2:
* introduce new patch that resets @next_dd and @next_rs fields
* use batching API for AF_XDP Tx on ice side

  [0]: https://lore.kernel.org/bpf/20211015162908.145341-8-anthony.l.nguyen@intel.com/
====================

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2022-01-27 17:27:31 +01:00
Maciej Fijalkowski
59e92bfe4d ice: xsk: Borrow xdp_tx_active logic from i40e
One of the things that commit 5574ff7b7b ("i40e: optimize AF_XDP Tx
completion path") introduced was the @xdp_tx_active field. Its usage
from i40e can be adjusted to ice driver and give us positive performance
results.

If the descriptor that @next_dd points to has been sent by HW (its DD
bit is set), then we are sure that at least quarter of the ring is ready
to be cleaned. If @xdp_tx_active is 0 which means that related xdp_ring
is not used for XDP_{TX, REDIRECT} workloads, then we know how many XSK
entries should placed to completion queue, IOW walking through the ring
can be skipped.

Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/20220125160446.78976-9-maciej.fijalkowski@intel.com
2022-01-27 17:25:33 +01:00
Maciej Fijalkowski
126cdfe100 ice: xsk: Improve AF_XDP ZC Tx and use batching API
Apply the logic that was done for regular XDP from commit 9610bd988d
("ice: optimize XDP_TX workloads") to the ZC side of the driver. On top
of that, introduce batching to Tx that is inspired by i40e's
implementation with adjustments to the cleaning logic - take into the
account NAPI budget in ice_clean_xdp_irq_zc().

Separating the stats structs onto separate cache lines seemed to improve
the performance.

Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com>
Link: https://lore.kernel.org/bpf/20220125160446.78976-8-maciej.fijalkowski@intel.com
2022-01-27 17:25:33 +01:00
Maciej Fijalkowski
86e3f78c8d ice: xsk: Avoid potential dead AF_XDP Tx processing
Commit 9610bd988d ("ice: optimize XDP_TX workloads") introduced
@next_dd and @next_rs to ice_tx_ring struct. Currently, their state is
not restored in ice_clean_tx_ring(), which was not causing any troubles
as the XDP rings are gone after we're done with XDP prog on interface.

For upcoming usage of mentioned fields in AF_XDP, this might expose us
to a potential dead Tx side. Scenario would look like following (based
on xdpsock):

- two xdpsock instances are spawned in Tx mode
- one of them is killed
- XDP prog is kept on interface due to the other xdpsock still running
  * this means that XDP rings stayed in place
- xdpsock is launched again on same queue id that was terminated on
- @next_dd and @next_rs setting is bogus, therefore transmit side is
  broken

To protect us from the above, restore the initial @next_rs and @next_dd
values when cleaning the Tx ring.

Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/20220125160446.78976-7-maciej.fijalkowski@intel.com
2022-01-27 17:25:33 +01:00
Magnus Karlsson
d1bc532e99 i40e: xsk: Move tmp desc array from driver to pool
Move desc_array from the driver to the pool. The reason behind this is
that we can then reuse this array as a temporary storage for descriptors
in all zero-copy drivers that use the batched interface. This will make
it easier to add batching to more drivers.

i40e is the only driver that has a batched Tx zero-copy
implementation, so no need to touch any other driver.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com>
Link: https://lore.kernel.org/bpf/20220125160446.78976-6-maciej.fijalkowski@intel.com
2022-01-27 17:25:32 +01:00
Maciej Fijalkowski
3dd411efe1 ice: Make Tx threshold dependent on ring length
XDP_TX workloads use a concept of Tx threshold that indicates the
interval of setting RS bit on descriptors which in turn tells the HW to
generate an interrupt to signal the completion of Tx on HW side. It is
currently based on a constant value of 32 which might not work out well
for various sizes of ring combined with for example batch size that can
be set via SO_BUSY_POLL_BUDGET.

Internal tests based on AF_XDP showed that most convenient setup of
mentioned threshold is when it is equal to quarter of a ring length.

Make use of recently introduced ICE_RING_QUARTER macro and use this
value as a substitute for ICE_TX_THRESH.

Align also ethtool -G callback so that next_dd/next_rs fields are up to
date in terms of the ring size.

Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/20220125160446.78976-5-maciej.fijalkowski@intel.com
2022-01-27 17:25:32 +01:00
Maciej Fijalkowski
3876ff525d ice: xsk: Handle SW XDP ring wrap and bump tail more often
Currently, if ice_clean_rx_irq_zc() processed the whole ring and
next_to_use != 0, then ice_alloc_rx_buf_zc() would not refill the whole
ring even if the XSK buffer pool would have enough free entries (either
from fill ring or the internal recycle mechanism) - it is because ring
wrap is not handled.

Improve the logic in ice_alloc_rx_buf_zc() to address the problem above.
Do not clamp the count of buffers that is passed to
xsk_buff_alloc_batch() in case when next_to_use + buffer count >=
rx_ring->count,  but rather split it and have two calls to the mentioned
function - one for the part up until the wrap and one for the part after
the wrap.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com>
Link: https://lore.kernel.org/bpf/20220125160446.78976-4-maciej.fijalkowski@intel.com
2022-01-27 17:25:32 +01:00
Maciej Fijalkowski
296f13ff38 ice: xsk: Force rings to be sized to power of 2
With the upcoming introduction of batching to XSK data path,
performance wise it will be the best to have the ring descriptor count
to be aligned to power of 2.

Check if ring sizes that user is going to attach the XSK socket fulfill
the condition above. For Tx side, although check is being done against
the Tx queue and in the end the socket will be attached to the XDP
queue, it is fine since XDP queues get the ring->count setting from Tx
queues.

Suggested-by: Alexander Lobakin <alexandr.lobakin@intel.com>
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/20220125160446.78976-3-maciej.fijalkowski@intel.com
2022-01-27 17:25:32 +01:00
Maciej Fijalkowski
a4e186693c ice: Remove likely for napi_complete_done
Remove the likely before napi_complete_done as this is the unlikely case
when busy-poll is used. Removing this has a positive performance impact
for busy-poll and no negative impact to the regular case.

Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/20220125160446.78976-2-maciej.fijalkowski@intel.com
2022-01-27 17:25:32 +01:00
Jakub Kicinski
8033c6c2fe bpf: remove unused static inlines
Remove two dead stubs, sk_msg_clear_meta() was never
used, use of xskq_cons_is_full() got replaced by
xsk_tx_writeable() in v5.10.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20220126185412.2776254-1-kuba@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-01-26 21:17:53 -08:00
Andrii Nakryiko
ff943683f8 selftests/bpf: fix uprobe offset calculation in selftests
Fix how selftests determine relative offset of a function that is
uprobed. Previously, there was an assumption that uprobed function is
always in the first executable region, which is not always the case
(libbpf CI hits this case now). So get_base_addr() approach in isolation
doesn't work anymore. So teach get_uprobe_offset() to determine correct
memory mapping and calculate uprobe offset correctly.

While at it, I merged together two implementations of
get_uprobe_offset() helper, moving powerpc64-specific logic inside (had
to add extra {} block to avoid unused variable error for insn).

Also ensured that uprobed functions are never inlined, but are still
static (and thus local to each selftest), by using a no-op asm volatile
block internally. I didn't want to keep them global __weak, because some
tests use uprobe's ref counter offset (to test USDT-like logic) which is
not compatible with non-refcounted uprobe. So it's nicer to have each
test uprobe target local to the file and guaranteed to not be inlined or
skipped by the compiler (which can happen with static functions,
especially if compiling selftests with -O2).

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20220126193058.3390292-1-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-01-26 20:04:01 -08:00
Yonghong Song
e5465a9027 selftests/bpf: Fix a clang compilation error
Compiling kernel and selftests/bpf with latest llvm like blow:
  make -j LLVM=1
  make -C tools/testing/selftests/bpf -j LLVM=1
I hit the following compilation error:
  /.../prog_tests/log_buf.c:215:6: error: variable 'log_buf' is used uninitialized whenever 'if' condition is true [-Werror,-Wsometimes-uninitialized]
          if (!ASSERT_OK_PTR(raw_btf_data, "raw_btf_data_good"))
              ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  /.../prog_tests/log_buf.c:264:7: note: uninitialized use occurs here
          free(log_buf);
               ^~~~~~~
  /.../prog_tests/log_buf.c:215:2: note: remove the 'if' if its condition is always false
          if (!ASSERT_OK_PTR(raw_btf_data, "raw_btf_data_good"))
          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  /.../prog_tests/log_buf.c:205:15: note: initialize the variable 'log_buf' to silence this warning
          char *log_buf;
                       ^
                        = NULL
  1 error generated.

Compiler rightfully detected that log_buf is uninitialized in one of failure path as indicated
in the above.

Proper initialization of 'log_buf' variable fixed the issue.

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220126181940.4105997-1-yhs@fb.com
2022-01-26 12:07:01 -08:00
Stanislav Fomichev
c446fdacb1 bpf: fix register_btf_kfunc_id_set for !CONFIG_DEBUG_INFO_BTF
Commit dee872e124 ("bpf: Populate kfunc BTF ID sets in struct btf")
breaks loading of some modules when CONFIG_DEBUG_INFO_BTF is not set.
register_btf_kfunc_id_set returns -ENOENT to the callers when
there is no module btf. Let's return 0 (success) instead to let
those modules work in !CONFIG_DEBUG_INFO_BTF cases.

Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Fixes: dee872e124 ("bpf: Populate kfunc BTF ID sets in struct btf")
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20220126001340.1573649-1-sdf@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-01-25 19:27:28 -08:00
Felix Maurer
fc1ca95585 selftests: bpf: Less strict size check in sockopt_sk
Originally, the kernel strictly checked the size of the optval in
getsockopt(TCP_ZEROCOPY_RECEIVE) to be equal to sizeof(struct
tcp_zerocopy_receive). With c8856c0514, this was changed to allow
optvals of different sizes.

The bpf code in the sockopt_sk test was still performing the strict size
check. This fix adapts the kernel behavior from c8856c0514 in the
selftest, i.e., just check if the required fields are there.

Fixes: 9cacf81f81 ("bpf: Remove extra lock_sock for TCP_ZEROCOPY_RECEIVE")
Signed-off-by: Felix Maurer <fmaurer@redhat.com>
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/6f569cca2e45473f9a724d54d03fdfb45f29e35f.1643129402.git.fmaurer@redhat.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-01-25 18:25:05 -08:00
Alexei Starovoitov
451c426044 Merge branch 'libbpf: deprecate some setter and getter APIs'
Andrii Nakryiko says:

====================

Another batch of simple deprecations. One of the last few, hopefully, as we
are getting close to deprecating all the planned APIs/features. See
individual patches for details.

v1->v2:
  - rebased on latest bpf-next, fixed Closes: reference.
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-01-25 17:59:07 -08:00
Andrii Nakryiko
23fcfcf8bb perf: use generic bpf_program__set_type() to set BPF prog type
bpf_program__set_<type>() APIs are deprecated, use generic
bpf_program__set_type() APIs instead.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20220124194254.2051434-8-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-01-25 17:59:07 -08:00
Andrii Nakryiko
61afd3da08 samples/bpf: use preferred getters/setters instead of deprecated ones
Use preferred setter and getter APIs instead of deprecated ones.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20220124194254.2051434-7-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-01-25 17:59:07 -08:00