linux-stable

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git synced 2024-09-29 05:44:11 +00:00

Author	SHA1	Message	Date
Alexei Starovoitov	79b203926d	bpf: Convert bpf preload to light skeleton. Convert bpffs preload iterators to light skeleton. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20220131220528.98088-6-alexei.starovoitov@gmail.com	2022-02-01 23:56:18 +01:00
Alexei Starovoitov	1ddbddd706	bpf: Remove unnecessary setrlimit from bpf preload. BPF programs and maps are memcg accounted. setrlimit is obsolete. Remove its use from bpf preload. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20220131220528.98088-5-alexei.starovoitov@gmail.com	2022-02-01 23:56:18 +01:00
Alexei Starovoitov	c69f94a33d	libbpf: Open code raw_tp_open and link_create commands. Open code raw_tracepoint_open and link_create used by light skeleton to be able to avoid full libbpf eventually. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20220131220528.98088-4-alexei.starovoitov@gmail.com	2022-02-01 23:56:18 +01:00
Alexei Starovoitov	e981f41fd0	libbpf: Open code low level bpf commands. Open code low level bpf commands used by light skeleton to be able to avoid full libbpf eventually. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20220131220528.98088-3-alexei.starovoitov@gmail.com	2022-02-01 23:56:18 +01:00
Alexei Starovoitov	42d1d53fed	libbpf: Add support for bpf iter in light skeleton. bpf iterator programs should use bpf_link_create to attach instead of bpf_raw_tracepoint_open like other tracing programs. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20220131220528.98088-2-alexei.starovoitov@gmail.com	2022-02-01 23:56:18 +01:00
Andrii Nakryiko	533de4aea6	Merge branch 'libbpf: deprecate xdp_cpumap, xdp_devmap and classifier sec definitions' Lorenzo Bianconi says: ==================== Deprecate xdp_cpumap, xdp_devmap and classifier sec definitions. Update cpumap/devmap samples and kselftests. Changes since v2: - update warning log - split libbpf and samples/kselftests changes - deprecate classifier sec definition Changes since v1: - refer to Libbpf-1.0-migration-guide in the warning rised by libbpf ==================== Signed-off-by: Andrii Nakryiko <andrii@kernel.org>	2022-02-01 09:51:32 -08:00
Lorenzo Bianconi	8bab532233	samples/bpf: Update cpumap/devmap sec_name Substitute deprecated xdp_cpumap and xdp_devmap sec_name with xdp/cpumap and xdp/devmap respectively. Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/509201497c6c4926bc941f1cba24173cf500e760.1643727185.git.lorenzo@kernel.org	2022-02-01 09:51:31 -08:00
Lorenzo Bianconi	439f033656	selftests/bpf: Update cpumap/devmap sec_name Substitute deprecated xdp_cpumap and xdp_devmap sec_name with xdp/cpumap and xdp/devmap respectively in bpf kselftests. Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/9a4286cd36781e2c31ba3773bfdcf45cf1bbaa9e.1643727185.git.lorenzo@kernel.org	2022-02-01 09:51:31 -08:00
Lorenzo Bianconi	4a4d4cee48	libbpf: Deprecate xdp_cpumap, xdp_devmap and classifier sec definitions Deprecate xdp_cpumap, xdp_devmap and classifier sec definitions. Introduce xdp/devmap and xdp/cpumap definitions according to the standard for SEC("") in libbpf: - prog_type.prog_flags/attach_place Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/5c7bd9426b3ce6a31d9a4b1f97eb299e1467fc52.1643727185.git.lorenzo@kernel.org	2022-02-01 09:51:31 -08:00
Dave Marchevsky	5ee32ea24c	libbpf: Deprecate btf_ext rec_size APIs btf_ext__{func,line}_info_rec_size functions are used in conjunction with already-deprecated btf_ext__reloc_{func,line}_info functions. Since struct btf_ext is opaque to the user it was necessary to expose rec_size getters in the past. btf_ext__reloc_{func,line}_info were deprecated in commit `8505e8709b` ("libbpf: Implement generalized .BTF.ext func/line info adjustment") as they're not compatible with support for multiple programs per section. It was decided[0] that users of these APIs should implement their own .btf.ext parsing to access this data, in which case the rec_size getters are unnecessary. So deprecate them from libbpf 0.7.0 onwards. [0] Closes: https://github.com/libbpf/libbpf/issues/277 Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20220201014610.3522985-1-davemarchevsky@fb.com	2022-02-01 09:55:19 +01:00
Kenta Tada	0407a65f35	bpf: make bpf_copy_from_user_task() gpl only access_process_vm() is exported by EXPORT_SYMBOL_GPL(). Signed-off-by: Kenta Tada <Kenta.Tada@sony.com> Link: https://lore.kernel.org/r/20220128170906.21154-1-Kenta.Tada@sony.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-01-31 12:44:37 -08:00
Alexei Starovoitov	1fc5bdb2b8	Merge branch 'Split bpf_sock dst_port field' Jakub Sitnicki says: ==================== This is a follow-up to discussion around the idea of making dst_port in struct bpf_sock a 16-bit field that happened in [1]. v2: - use an anonymous field for zero padding (Alexei) v1: - keep dst_field offset unchanged to prevent existing BPF program breakage (Martin) - allow 8-bit loads from dst_port[0] and [1] - add test coverage for the verifier and the context access converter [1] https://lore.kernel.org/bpf/87sftbobys.fsf@cloudflare.com/ ==================== Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-01-31 12:39:19 -08:00
Jakub Sitnicki	8f50f16ff3	selftests/bpf: Extend verifier and bpf_sock tests for dst_port loads Add coverage to the verifier tests and tests for reading bpf_sock fields to ensure that 32-bit, 16-bit, and 8-bit loads from dst_port field are allowed only at intended offsets and produce expected values. While 16-bit and 8-bit access to dst_port field is straight-forward, 32-bit wide loads need be allowed and produce a zero-padded 16-bit value for backward compatibility. Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/r/20220130115518.213259-3-jakub@cloudflare.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-01-31 12:39:12 -08:00
Jakub Sitnicki	4421a58271	bpf: Make dst_port field in struct bpf_sock 16-bit wide Menglong Dong reports that the documentation for the dst_port field in struct bpf_sock is inaccurate and confusing. From the BPF program PoV, the field is a zero-padded 16-bit integer in network byte order. The value appears to the BPF user as if laid out in memory as so: offsetof(struct bpf_sock, dst_port) + 0 <port MSB> + 8 <port LSB> +16 0x00 +24 0x00 32-, 16-, and 8-bit wide loads from the field are all allowed, but only if the offset into the field is 0. 32-bit wide loads from dst_port are especially confusing. The loaded value, after converting to host byte order with bpf_ntohl(dst_port), contains the port number in the upper 16-bits. Remove the confusion by splitting the field into two 16-bit fields. For backward compatibility, allow 32-bit wide loads from offsetof(struct bpf_sock, dst_port). While at it, allow loads 8-bit loads at offset [0] and [1] from dst_port. Reported-by: Menglong Dong <imagedong@tencent.com> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/r/20220130115518.213259-2-jakub@cloudflare.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-01-31 12:39:12 -08:00
Alexei Starovoitov	b3dddab2ff	Merge branch 'selftests/bpf: use temp netns for testing' Hangbin Liu says: ==================== There are some bpf tests using hard code netns name like ns0, ns1, etc. This kind of ns name is easily used by other tests or system. If there is already a such netns, all the related tests will failed. So let's use temp netns name for testing. The first patch not only change to temp netns. But also fixed an interface index issue. So I add fixes tag. For the later patches, I think that should be an update instead of fixes, so the fixes tag is not added. ==================== Acked-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-01-27 19:21:36 -08:00
Hangbin Liu	4ec25b49f4	selftests/bpf/test_xdp_redirect: use temp netns for testing Use temp netns instead of hard code name for testing in case the netns already exists. Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Acked-by: William Tu <u9012063@gmail.com> Link: https://lore.kernel.org/r/20220125081717.1260849-8-liuhangbin@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-01-27 19:21:28 -08:00
Hangbin Liu	36d9970e52	selftests/bpf/test_xdp_meta: use temp netns for testing Use temp netns instead of hard code name for testing in case the netns already exists. Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Link: https://lore.kernel.org/r/20220125081717.1260849-7-liuhangbin@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-01-27 19:21:28 -08:00
Hangbin Liu	ab6bcc2072	selftests/bpf/test_tcp_check_syncookie: use temp netns for testing Use temp netns instead of hard code name for testing in case the netns already exists. Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Acked-by: Lorenz Bauer <lmb@cloudflare.com> Link: https://lore.kernel.org/r/20220125081717.1260849-6-liuhangbin@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-01-27 19:21:28 -08:00
Hangbin Liu	07c5855461	selftests/bpf/test_lwt_seg6local: use temp netns for testing Use temp netns instead of hard code name for testing in case the netns already exists. Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Link: https://lore.kernel.org/r/20220125081717.1260849-5-liuhangbin@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-01-27 19:21:27 -08:00
Hangbin Liu	3cc382e02f	selftests/bpf/test_xdp_vlan: use temp netns for testing Use temp netns instead of hard code name for testing in case the netns already exists. Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Link: https://lore.kernel.org/r/20220125081717.1260849-4-liuhangbin@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-01-27 19:21:27 -08:00
Hangbin Liu	9d66c9ddc9	selftests/bpf/test_xdp_veth: use temp netns for testing Use temp netns instead of hard code name for testing in case the netns already exists. Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Link: https://lore.kernel.org/r/20220125081717.1260849-3-liuhangbin@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-01-27 19:21:27 -08:00
Hangbin Liu	cec74489a8	selftests/bpf/test_xdp_redirect_multi: use temp netns for testing Use temp netns instead of hard code name for testing in case the netns already exists. Remove the hard code interface index when creating the veth interfaces. Because when the system loads some virtual interface modules, e.g. tunnels. the ifindex of 2 will be used and the cmd will fail. As the netns has not created if checking environment failed. Trap the clean up function after checking env. Fixes: `8955c1a329` ("selftests/bpf/xdp_redirect_multi: Limit the tests in netns") Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Acked-by: William Tu <u9012063@gmail.com> Link: https://lore.kernel.org/r/20220125081717.1260849-2-liuhangbin@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-01-27 19:21:27 -08:00
Hou Tao	b6ec79518e	bpf, x86: Remove unnecessary handling of BPF_SUB atomic op According to the LLVM commit (https://reviews.llvm.org/D72184), sync_fetch_and_sub() is implemented as a negation followed by sync_fetch_and_add(), so there will be no BPF_SUB op, thus just remove it. BPF_SUB is also rejected by the verifier anyway. Signed-off-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Brendan Jackman <jackmanb@google.com> Link: https://lore.kernel.org/bpf/20220127083240.1425481-1-houtao1@huawei.com	2022-01-27 22:47:05 +01:00
Alexei Starovoitov	50fc9786b2	Merge branch 'bpf: add __user tagging support in vmlinux BTF' Yonghong Song says: ==================== The __user attribute is currently mainly used by sparse for type checking. The attribute indicates whether a memory access is in user memory address space or not. Such information is important during tracing kernel internal functions or data structures as accessing user memory often has different mechanisms compared to accessing kernel memory. For example, the perf-probe needs explicit command line specification to indicate a particular argument or string in user-space memory ([1], [2], [3]). Currently, vmlinux BTF is available in kernel with many distributions. If __user attribute information is available in vmlinux BTF, the explicit user memory access information from users will not be necessary as the kernel can figure it out by itself with vmlinux BTF. Besides the above possible use for perf/probe, another use case is for bpf verifier. Currently, for bpf BPF_PROG_TYPE_TRACING type of bpf programs, users can write direct code like p->m1->m2 and "p" could be a function parameter. Without __user information in BTF, the verifier will assume p->m1 accessing kernel memory and will generate normal loads. Let us say "p" actually tagged with __user in the source code. In such cases, p->m1 is actually accessing user memory and direct load is not right and may produce incorrect result. For such cases, bpf_probe_read_user() will be the correct way to read p->m1. To support encoding __user information in BTF, a new attribute __attribute__((btf_type_tag("<arbitrary_string>"))) is implemented in clang ([4]). For example, if we have #define __user __attribute__((btf_type_tag("user"))) during kernel compilation, the attribute "user" information will be preserved in dwarf. After pahole converting dwarf to BTF, __user information will be available in vmlinux BTF and such information can be used by bpf verifier, perf/probe or other use cases. Currently btf_type_tag is only supported in clang (>= clang14) and pahole (>= 1.23). gcc support is also proposed and under development ([5]). In the rest of patch set, Patch 1 added support of __user btf_type_tag during compilation. Patch 2 added bpf verifier support to utilize __user tag information to reject bpf programs not using proper helper to access user memories. Patches 3-5 are for bpf selftests which demonstrate verifier can reject direct user memory accesses. [1] http://lkml.kernel.org/r/155789874562.26965.10836126971405890891.stgit@devnote2 [2] http://lkml.kernel.org/r/155789872187.26965.4468456816590888687.stgit@devnote2 [3] http://lkml.kernel.org/r/155789871009.26965.14167558859557329331.stgit@devnote2 [4] https://reviews.llvm.org/D111199 [5] https://lore.kernel.org/bpf/0cbeb2fb-1a18-f690-e360-24b1c90c2a91@fb.com/ Changelog: v2 -> v3: - remove FLAG_DONTCARE enumerator and just use 0 as dontcare flag. - explain how btf type_tag is encoded in btf type chain. v1 -> v2: - use MEM_USER flag for PTR_TO_BTF_ID reg type instead of a separate field to encode __user tag. - add a test with kernel function __sys_getsockname which has __user tagged argument. ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-01-27 12:03:47 -08:00
Yonghong Song	b72903847a	docs/bpf: clarify how btf_type_tag gets encoded in the type chain Clarify where the BTF_KIND_TYPE_TAG gets encoded in the type chain, so applications and kernel can properly parse them. Signed-off-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20220127154627.665163-1-yhs@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-01-27 12:03:47 -08:00
Yonghong Song	67ef7e1a75	selftests/bpf: specify pahole version requirement for btf_tag test Specify pahole version requirement (1.23) for btf_tag subtests btf_type_tag_user_{mod1, mod2, vmlinux}. Signed-off-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20220127154622.663337-1-yhs@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-01-27 12:03:47 -08:00
Yonghong Song	696c390115	selftests/bpf: add a selftest with __user tag Added a selftest with three__user usages: a __user pointer-type argument in bpf_testmod, a __user pointer-type struct member in bpf_testmod, and a __user pointer-type struct member in vmlinux. In all cases, directly accessing the user memory will result verification failure. $ ./test_progs -v -n 22/3 ... libbpf: prog 'test_user1': BPF program load failed: Permission denied libbpf: prog 'test_user1': -- BEGIN PROG LOAD LOG -- R1 type=ctx expected=fp 0: R1=ctx(id=0,off=0,imm=0) R10=fp0 ; int BPF_PROG(test_user1, struct bpf_testmod_btf_type_tag_1 arg) 0: (79) r1 = (u64 )(r1 +0) func 'bpf_testmod_test_btf_type_tag_user_1' arg0 has btf_id 136561 type STRUCT 'bpf_testmod_btf_type_tag_1' 1: R1_w=user_ptr_bpf_testmod_btf_type_tag_1(id=0,off=0,imm=0) ; g = arg->a; 1: (61) r1 = (u32 )(r1 +0) R1 invalid mem access 'user_ptr_' ... #22/3 btf_tag/btf_type_tag_user_mod1:OK $ ./test_progs -v -n 22/4 ... libbpf: prog 'test_user2': BPF program load failed: Permission denied libbpf: prog 'test_user2': -- BEGIN PROG LOAD LOG -- R1 type=ctx expected=fp 0: R1=ctx(id=0,off=0,imm=0) R10=fp0 ; int BPF_PROG(test_user2, struct bpf_testmod_btf_type_tag_2 arg) 0: (79) r1 = (u64 )(r1 +0) func 'bpf_testmod_test_btf_type_tag_user_2' arg0 has btf_id 136563 type STRUCT 'bpf_testmod_btf_type_tag_2' 1: R1_w=ptr_bpf_testmod_btf_type_tag_2(id=0,off=0,imm=0) ; g = arg->p->a; 1: (79) r1 = (u64 )(r1 +0) ; R1_w=user_ptr_bpf_testmod_btf_type_tag_1(id=0,off=0,imm=0) ; g = arg->p->a; 2: (61) r1 = (u32 )(r1 +0) R1 invalid mem access 'user_ptr_' ... #22/4 btf_tag/btf_type_tag_user_mod2:OK $ ./test_progs -v -n 22/5 ... libbpf: prog 'test_sys_getsockname': BPF program load failed: Permission denied libbpf: prog 'test_sys_getsockname': -- BEGIN PROG LOAD LOG -- R1 type=ctx expected=fp 0: R1=ctx(id=0,off=0,imm=0) R10=fp0 ; int BPF_PROG(test_sys_getsockname, int fd, struct sockaddr usockaddr, 0: (79) r1 = (u64 )(r1 +8) func '__sys_getsockname' arg1 has btf_id 2319 type STRUCT 'sockaddr' 1: R1_w=user_ptr_sockaddr(id=0,off=0,imm=0) ; g = usockaddr->sa_family; 1: (69) r1 = (u16 *)(r1 +0) R1 invalid mem access 'user_ptr_' ... #22/5 btf_tag/btf_type_tag_user_vmlinux:OK Signed-off-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20220127154616.659314-1-yhs@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-01-27 12:03:46 -08:00
Yonghong Song	571d01a9d0	selftests/bpf: rename btf_decl_tag.c to test_btf_decl_tag.c The uapi btf.h contains the following declaration: struct btf_decl_tag { __s32 component_idx; }; The skeleton will also generate a struct with name "btf_decl_tag" for bpf program btf_decl_tag.c. Rename btf_decl_tag.c to test_btf_decl_tag.c so the corresponding skeleton struct name becomes "test_btf_decl_tag". This way, we could include uapi btf.h in prog_tests/btf_tag.c. There is no functionality change for this patch. Signed-off-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20220127154611.656699-1-yhs@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-01-27 12:03:46 -08:00
Yonghong Song	c6f1bfe89a	bpf: reject program if a __user tagged memory accessed in kernel way BPF verifier supports direct memory access for BPF_PROG_TYPE_TRACING type of bpf programs, e.g., a->b. If "a" is a pointer pointing to kernel memory, bpf verifier will allow user to write code in C like a->b and the verifier will translate it to a kernel load properly. If "a" is a pointer to user memory, it is expected that bpf developer should be bpf_probe_read_user() helper to get the value a->b. Without utilizing BTF __user tagging information, current verifier will assume that a->b is a kernel memory access and this may generate incorrect result. Now BTF contains __user information, it can check whether the pointer points to a user memory or not. If it is, the verifier can reject the program and force users to use bpf_probe_read_user() helper explicitly. In the future, we can easily extend btf_add_space for other address space tagging, for example, rcu/percpu etc. Signed-off-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20220127154606.654961-1-yhs@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-01-27 12:03:46 -08:00
Yonghong Song	7472d5a642	compiler_types: define __user as __attribute__((btf_type_tag("user"))) The __user attribute is currently mainly used by sparse for type checking. The attribute indicates whether a memory access is in user memory address space or not. Such information is important during tracing kernel internal functions or data structures as accessing user memory often has different mechanisms compared to accessing kernel memory. For example, the perf-probe needs explicit command line specification to indicate a particular argument or string in user-space memory ([1], [2], [3]). Currently, vmlinux BTF is available in kernel with many distributions. If __user attribute information is available in vmlinux BTF, the explicit user memory access information from users will not be necessary as the kernel can figure it out by itself with vmlinux BTF. Besides the above possible use for perf/probe, another use case is for bpf verifier. Currently, for bpf BPF_PROG_TYPE_TRACING type of bpf programs, users can write direct code like p->m1->m2 and "p" could be a function parameter. Without __user information in BTF, the verifier will assume p->m1 accessing kernel memory and will generate normal loads. Let us say "p" actually tagged with __user in the source code. In such cases, p->m1 is actually accessing user memory and direct load is not right and may produce incorrect result. For such cases, bpf_probe_read_user() will be the correct way to read p->m1. To support encoding __user information in BTF, a new attribute __attribute__((btf_type_tag("<arbitrary_string>"))) is implemented in clang ([4]). For example, if we have #define __user __attribute__((btf_type_tag("user"))) during kernel compilation, the attribute "user" information will be preserved in dwarf. After pahole converting dwarf to BTF, __user information will be available in vmlinux BTF. The following is an example with latest upstream clang (clang14) and pahole 1.23: [$ ~] cat test.c #define __user __attribute__((btf_type_tag("user"))) int foo(int __user arg) { return arg; } [$ ~] clang -O2 -g -c test.c [$ ~] pahole -JV test.o ... [1] INT int size=4 nr_bits=32 encoding=SIGNED [2] TYPE_TAG user type_id=1 [3] PTR (anon) type_id=2 [4] FUNC_PROTO (anon) return=1 args=(3 arg) [5] FUNC foo type_id=4 [$ ~] You can see for the function argument "int __user *arg", its type is described as PTR -> TYPE_TAG(user) -> INT The kernel can use this information for bpf verification or other use cases. Current btf_type_tag is only supported in clang (>= clang14) and pahole (>= 1.23). gcc support is also proposed and under development ([5]). [1] http://lkml.kernel.org/r/155789874562.26965.10836126971405890891.stgit@devnote2 [2] http://lkml.kernel.org/r/155789872187.26965.4468456816590888687.stgit@devnote2 [3] http://lkml.kernel.org/r/155789871009.26965.14167558859557329331.stgit@devnote2 [4] https://reviews.llvm.org/D111199 [5] https://lore.kernel.org/bpf/0cbeb2fb-1a18-f690-e360-24b1c90c2a91@fb.com/ Signed-off-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20220127154600.652613-1-yhs@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-01-27 12:03:46 -08:00
Pavel Begunkov	46531a3036	cgroup/bpf: fast path skb BPF filtering Even though there is a static key protecting from overhead from cgroup-bpf skb filtering when there is nothing attached, in many cases it's not enough as registering a filter for one type will ruin the fast path for all others. It's observed in production servers I've looked at but also in laptops, where registration is done during init by systemd or something else. Add a per-socket fast path check guarding from such overhead. This affects both receive and transmit paths of TCP, UDP and other protocols. It showed ~1% tx/s improvement in small payload UDP send benchmarks using a real NIC and in a server environment and the number jumps to 2-3% for preemtible kernels. Reviewed-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/r/d8c58857113185a764927a46f4b5a058d36d3ec3.1643292455.git.asml.silence@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-01-27 10:15:00 -08:00
Yonghong Song	cdb5ed9796	selftests/bpf: fix a clang compilation error When building selftests/bpf with clang make -j LLVM=1 make -C tools/testing/selftests/bpf -j LLVM=1 I hit the following compilation error: trace_helpers.c:152:9: error: variable 'found' is used uninitialized whenever 'while' loop exits because its condition is false [-Werror,-Wsometimes-uninitialized] while (fscanf(f, "%zx-%zx %s %zx %[^\n]\n", &start, &end, buf, &base) == 4) { ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ trace_helpers.c:161:7: note: uninitialized use occurs here if (!found) ^~~~~ trace_helpers.c:152:9: note: remove the condition if it is always true while (fscanf(f, "%zx-%zx %s %zx %[^\n]\n", &start, &end, buf, &base) == 4) { ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1 trace_helpers.c:145:12: note: initialize the variable 'found' to silence this warning bool found; ^ = false It is possible that for sane /proc/self/maps we may never hit the above issue in practice. But let us initialize variable 'found' properly to silence the compilation error. Signed-off-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20220127163726.1442032-1-yhs@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-01-27 09:48:49 -08:00
Magnus Karlsson	3b22523bca	selftests, xsk: Fix bpf_res cleanup test After commit `710ad98c36` ("veth: Do not record rx queue hint in veth_xmit"), veth no longer receives traffic on the same queue as it was sent on. This breaks the bpf_res test for the AF_XDP selftests as the socket tied to queue 1 will not receive traffic anymore. Modify the test so that two sockets are tied to queue id 0 using a shared umem instead. When killing the first socket enter the second socket into the xskmap so that traffic will flow to it. This will still test that the resources are not cleaned up until after the second socket dies, without having to rely on veth supporting rx_queue hints. Reported-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20220125082945.26179-1-magnus.karlsson@gmail.com	2022-01-27 17:43:51 +01:00
Daniel Borkmann	33372bc274	Merge branch 'xsk-batching' Maciej Fijalkowski says: ==================== Unfortunately, similar scalability issues that were addressed for XDP processing in ice, exist for XDP in the zero-copy driver used by AF_XDP. Let's resolve them in mostly the same way as we did in [0] and utilize the Tx batching API from XSK buffer pool. Move the array of Tx descriptors that is used with batching approach to the XSK buffer pool. This means that future users of this API will not have to carry the array on their own side, they can simple refer to pool's tx_desc array. We also improve the Rx side where we extend ice_alloc_rx_buf_zc() to handle the ring wrap and bump Rx tail more frequently. By doing so, Rx side is adjusted to Tx and it was needed for l2fwd scenario. Here are the improvements of performance numbers that this set brings measured with xdpsock app in busy poll mode for 1 and 2 core modes. Both Tx and Rx rings were sized to 1k length and busy poll budget was 256. ---------------------------------------------------------------- \| txonly: \| l2fwd \| rxdrop ---------------------------------------------------------------- 1C \| 149% \| 14% \| 3% ---------------------------------------------------------------- 2C \| 134% \| 20% \| 5% ---------------------------------------------------------------- Next step will be to introduce batching onto Rx side. v5: * collect acks * fix typos * correct comments showing cache line boundaries in ice_tx_ring struct v4 - address Alexandr's review: * new patch (2) for making sure ring size is pow(2) when attaching xsk socket * don't open code ALIGN_DOWN (patch 3) * resign from storing tx_thresh in ice_tx_ring (patch 4) * scope variables in a better way for Tx batching (patch 7) v3: * drop likely() that was wrapping napi_complete_done (patch 1) * introduce configurable Tx threshold (patch 2) * handle ring wrap on Rx side when allocating buffers (patch 3) * respect NAPI budget when cleaning Tx descriptors in ZC (patch 6) v2: * introduce new patch that resets @next_dd and @next_rs fields * use batching API for AF_XDP Tx on ice side [0]: https://lore.kernel.org/bpf/20211015162908.145341-8-anthony.l.nguyen@intel.com/ ==================== Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2022-01-27 17:27:31 +01:00
Maciej Fijalkowski	59e92bfe4d	ice: xsk: Borrow xdp_tx_active logic from i40e One of the things that commit `5574ff7b7b` ("i40e: optimize AF_XDP Tx completion path") introduced was the @xdp_tx_active field. Its usage from i40e can be adjusted to ice driver and give us positive performance results. If the descriptor that @next_dd points to has been sent by HW (its DD bit is set), then we are sure that at least quarter of the ring is ready to be cleaned. If @xdp_tx_active is 0 which means that related xdp_ring is not used for XDP_{TX, REDIRECT} workloads, then we know how many XSK entries should placed to completion queue, IOW walking through the ring can be skipped. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/20220125160446.78976-9-maciej.fijalkowski@intel.com	2022-01-27 17:25:33 +01:00
Maciej Fijalkowski	126cdfe100	ice: xsk: Improve AF_XDP ZC Tx and use batching API Apply the logic that was done for regular XDP from commit `9610bd988d` ("ice: optimize XDP_TX workloads") to the ZC side of the driver. On top of that, introduce batching to Tx that is inspired by i40e's implementation with adjustments to the cleaning logic - take into the account NAPI budget in ice_clean_xdp_irq_zc(). Separating the stats structs onto separate cache lines seemed to improve the performance. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com> Link: https://lore.kernel.org/bpf/20220125160446.78976-8-maciej.fijalkowski@intel.com	2022-01-27 17:25:33 +01:00
Maciej Fijalkowski	86e3f78c8d	ice: xsk: Avoid potential dead AF_XDP Tx processing Commit `9610bd988d` ("ice: optimize XDP_TX workloads") introduced @next_dd and @next_rs to ice_tx_ring struct. Currently, their state is not restored in ice_clean_tx_ring(), which was not causing any troubles as the XDP rings are gone after we're done with XDP prog on interface. For upcoming usage of mentioned fields in AF_XDP, this might expose us to a potential dead Tx side. Scenario would look like following (based on xdpsock): - two xdpsock instances are spawned in Tx mode - one of them is killed - XDP prog is kept on interface due to the other xdpsock still running * this means that XDP rings stayed in place - xdpsock is launched again on same queue id that was terminated on - @next_dd and @next_rs setting is bogus, therefore transmit side is broken To protect us from the above, restore the initial @next_rs and @next_dd values when cleaning the Tx ring. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/20220125160446.78976-7-maciej.fijalkowski@intel.com	2022-01-27 17:25:33 +01:00
Magnus Karlsson	d1bc532e99	i40e: xsk: Move tmp desc array from driver to pool Move desc_array from the driver to the pool. The reason behind this is that we can then reuse this array as a temporary storage for descriptors in all zero-copy drivers that use the batched interface. This will make it easier to add batching to more drivers. i40e is the only driver that has a batched Tx zero-copy implementation, so no need to touch any other driver. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com> Link: https://lore.kernel.org/bpf/20220125160446.78976-6-maciej.fijalkowski@intel.com	2022-01-27 17:25:32 +01:00
Maciej Fijalkowski	3dd411efe1	ice: Make Tx threshold dependent on ring length XDP_TX workloads use a concept of Tx threshold that indicates the interval of setting RS bit on descriptors which in turn tells the HW to generate an interrupt to signal the completion of Tx on HW side. It is currently based on a constant value of 32 which might not work out well for various sizes of ring combined with for example batch size that can be set via SO_BUSY_POLL_BUDGET. Internal tests based on AF_XDP showed that most convenient setup of mentioned threshold is when it is equal to quarter of a ring length. Make use of recently introduced ICE_RING_QUARTER macro and use this value as a substitute for ICE_TX_THRESH. Align also ethtool -G callback so that next_dd/next_rs fields are up to date in terms of the ring size. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/20220125160446.78976-5-maciej.fijalkowski@intel.com	2022-01-27 17:25:32 +01:00
Maciej Fijalkowski	3876ff525d	ice: xsk: Handle SW XDP ring wrap and bump tail more often Currently, if ice_clean_rx_irq_zc() processed the whole ring and next_to_use != 0, then ice_alloc_rx_buf_zc() would not refill the whole ring even if the XSK buffer pool would have enough free entries (either from fill ring or the internal recycle mechanism) - it is because ring wrap is not handled. Improve the logic in ice_alloc_rx_buf_zc() to address the problem above. Do not clamp the count of buffers that is passed to xsk_buff_alloc_batch() in case when next_to_use + buffer count >= rx_ring->count, but rather split it and have two calls to the mentioned function - one for the part up until the wrap and one for the part after the wrap. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com> Link: https://lore.kernel.org/bpf/20220125160446.78976-4-maciej.fijalkowski@intel.com	2022-01-27 17:25:32 +01:00
Maciej Fijalkowski	296f13ff38	ice: xsk: Force rings to be sized to power of 2 With the upcoming introduction of batching to XSK data path, performance wise it will be the best to have the ring descriptor count to be aligned to power of 2. Check if ring sizes that user is going to attach the XSK socket fulfill the condition above. For Tx side, although check is being done against the Tx queue and in the end the socket will be attached to the XDP queue, it is fine since XDP queues get the ring->count setting from Tx queues. Suggested-by: Alexander Lobakin <alexandr.lobakin@intel.com> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/20220125160446.78976-3-maciej.fijalkowski@intel.com	2022-01-27 17:25:32 +01:00
Maciej Fijalkowski	a4e186693c	ice: Remove likely for napi_complete_done Remove the likely before napi_complete_done as this is the unlikely case when busy-poll is used. Removing this has a positive performance impact for busy-poll and no negative impact to the regular case. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/20220125160446.78976-2-maciej.fijalkowski@intel.com	2022-01-27 17:25:32 +01:00
Jakub Kicinski	8033c6c2fe	bpf: remove unused static inlines Remove two dead stubs, sk_msg_clear_meta() was never used, use of xskq_cons_is_full() got replaced by xsk_tx_writeable() in v5.10. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20220126185412.2776254-1-kuba@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-01-26 21:17:53 -08:00
Andrii Nakryiko	ff943683f8	selftests/bpf: fix uprobe offset calculation in selftests Fix how selftests determine relative offset of a function that is uprobed. Previously, there was an assumption that uprobed function is always in the first executable region, which is not always the case (libbpf CI hits this case now). So get_base_addr() approach in isolation doesn't work anymore. So teach get_uprobe_offset() to determine correct memory mapping and calculate uprobe offset correctly. While at it, I merged together two implementations of get_uprobe_offset() helper, moving powerpc64-specific logic inside (had to add extra {} block to avoid unused variable error for insn). Also ensured that uprobed functions are never inlined, but are still static (and thus local to each selftest), by using a no-op asm volatile block internally. I didn't want to keep them global __weak, because some tests use uprobe's ref counter offset (to test USDT-like logic) which is not compatible with non-refcounted uprobe. So it's nicer to have each test uprobe target local to the file and guaranteed to not be inlined or skipped by the compiler (which can happen with static functions, especially if compiling selftests with -O2). Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20220126193058.3390292-1-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-01-26 20:04:01 -08:00
Yonghong Song	e5465a9027	selftests/bpf: Fix a clang compilation error Compiling kernel and selftests/bpf with latest llvm like blow: make -j LLVM=1 make -C tools/testing/selftests/bpf -j LLVM=1 I hit the following compilation error: /.../prog_tests/log_buf.c:215:6: error: variable 'log_buf' is used uninitialized whenever 'if' condition is true [-Werror,-Wsometimes-uninitialized] if (!ASSERT_OK_PTR(raw_btf_data, "raw_btf_data_good")) ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ /.../prog_tests/log_buf.c:264:7: note: uninitialized use occurs here free(log_buf); ^~~~~~~ /.../prog_tests/log_buf.c:215:2: note: remove the 'if' if its condition is always false if (!ASSERT_OK_PTR(raw_btf_data, "raw_btf_data_good")) ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ /.../prog_tests/log_buf.c:205:15: note: initialize the variable 'log_buf' to silence this warning char *log_buf; ^ = NULL 1 error generated. Compiler rightfully detected that log_buf is uninitialized in one of failure path as indicated in the above. Proper initialization of 'log_buf' variable fixed the issue. Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20220126181940.4105997-1-yhs@fb.com	2022-01-26 12:07:01 -08:00
Stanislav Fomichev	c446fdacb1	bpf: fix register_btf_kfunc_id_set for !CONFIG_DEBUG_INFO_BTF Commit `dee872e124` ("bpf: Populate kfunc BTF ID sets in struct btf") breaks loading of some modules when CONFIG_DEBUG_INFO_BTF is not set. register_btf_kfunc_id_set returns -ENOENT to the callers when there is no module btf. Let's return 0 (success) instead to let those modules work in !CONFIG_DEBUG_INFO_BTF cases. Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Fixes: `dee872e124` ("bpf: Populate kfunc BTF ID sets in struct btf") Signed-off-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20220126001340.1573649-1-sdf@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-01-25 19:27:28 -08:00
Felix Maurer	fc1ca95585	selftests: bpf: Less strict size check in sockopt_sk Originally, the kernel strictly checked the size of the optval in getsockopt(TCP_ZEROCOPY_RECEIVE) to be equal to sizeof(struct tcp_zerocopy_receive). With `c8856c0514`, this was changed to allow optvals of different sizes. The bpf code in the sockopt_sk test was still performing the strict size check. This fix adapts the kernel behavior from `c8856c0514` in the selftest, i.e., just check if the required fields are there. Fixes: `9cacf81f81` ("bpf: Remove extra lock_sock for TCP_ZEROCOPY_RECEIVE") Signed-off-by: Felix Maurer <fmaurer@redhat.com> Reviewed-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/6f569cca2e45473f9a724d54d03fdfb45f29e35f.1643129402.git.fmaurer@redhat.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-01-25 18:25:05 -08:00
Alexei Starovoitov	451c426044	Merge branch 'libbpf: deprecate some setter and getter APIs' Andrii Nakryiko says: ==================== Another batch of simple deprecations. One of the last few, hopefully, as we are getting close to deprecating all the planned APIs/features. See individual patches for details. v1->v2: - rebased on latest bpf-next, fixed Closes: reference. ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-01-25 17:59:07 -08:00
Andrii Nakryiko	23fcfcf8bb	perf: use generic bpf_program__set_type() to set BPF prog type bpf_program__set_<type>() APIs are deprecated, use generic bpf_program__set_type() APIs instead. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20220124194254.2051434-8-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-01-25 17:59:07 -08:00
Andrii Nakryiko	61afd3da08	samples/bpf: use preferred getters/setters instead of deprecated ones Use preferred setter and getter APIs instead of deprecated ones. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20220124194254.2051434-7-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-01-25 17:59:07 -08:00

1 2 3 4 5 ...

1071876 commits