linux-stable

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git synced 2024-09-29 05:44:11 +00:00

Author	SHA1	Message	Date
Alexei Starovoitov	3ee4f53355	libbpf: Split bpf_core_apply_relo() into bpf_program independent helper. bpf_core_apply_relo() doesn't need to know bpf_program internals and hashmap details. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20210721000822.40958-3-alexei.starovoitov@gmail.com	2021-07-26 12:13:05 -07:00
Alexei Starovoitov	6e43b28607	libbpf: Cleanup the layering between CORE and bpf_program. CO-RE processing functions don't need to know 'struct bpf_program' details. Cleanup the layering to eventually be able to move CO-RE logic into a separate file. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20210721000822.40958-2-alexei.starovoitov@gmail.com	2021-07-26 12:11:23 -07:00
Johan Almbladh	2b7e9f25e5	bpf/tests: Do not PASS tests without actually testing the result Each test case can have a set of sub-tests, where each sub-test can run the cBPF/eBPF test snippet with its own data_size and expected result. Before, the end of the sub-test array was indicated by both data_size and result being zero. However, most or all of the internal eBPF tests has a data_size of zero already. When such a test also had an expected value of zero, the test was never run but reported as PASS anyway. Now the test runner always runs the first sub-test, regardless of the data_size and result values. The sub-test array zero-termination only applies for any additional sub-tests. There are other ways fix it of course, but this solution at least removes the surprise of eBPF tests with a zero result always succeeding. Signed-off-by: Johan Almbladh <johan.almbladh@anyfinetworks.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20210721103822.3755111-1-johan.almbladh@anyfinetworks.com	2021-07-23 17:21:39 -07:00
Johan Almbladh	ae7f47041d	bpf/tests: Fix copy-and-paste error in double word test This test now operates on DW as stated instead of W, which was already covered by another test. Signed-off-by: Johan Almbladh <johan.almbladh@anyfinetworks.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20210721104058.3755254-1-johan.almbladh@anyfinetworks.com	2021-07-23 17:17:12 -07:00
Evgeniy Litvinenko	7a18844223	selftests/bpf: Document vmtest.sh dependencies Add a list of vmtest script dependencies to make it easier for new contributors to get going. Signed-off-by: Evgeniy Litvinenko <evgeniyl@fb.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20210723223645.907802-1-evgeniyl@fb.com	2021-07-23 17:09:13 -07:00
Evgeniy Litvinenko	e244d34d0e	libbpf: Add bpf_map__pin_path function Add bpf_map__pin_path, so that the inconsistently named bpf_map__get_pin_path can be deprecated later. This is part of the effort towards libbpf v1.0: https://github.com/libbpf/libbpf/issues/307 Also, add a selftest for the new function. Signed-off-by: Evgeniy Litvinenko <evgeniyl@fb.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20210723221511.803683-1-evgeniyl@fb.com	2021-07-23 16:57:03 -07:00
Andrii Nakryiko	d9e8d14b12	Merge branch 'bpf: Allow bpf tcp iter to do bpf_(get\|set)sockopt' Martin KaFai says: ==================== This set is to allow bpf tcp iter to call bpf_(get\|set)sockopt. With bpf-tcp-cc, new algo rollout happens more often. Instead of restarting the applications to pick up the new tcp-cc, this set allows the bpf tcp iter to call bpf_(get\|set)sockopt(TCP_CONGESTION). It is not limited to TCP_CONGESTION, the bpf tcp iter can call bpf_(get\|set)sockopt() with other options. The bpf tcp iter can read into all the fields of a tcp_sock, so there is a lot of flexibility to select the desired sk to do setsockopt(), e.g. it can test for TCP_LISTEN only and leave the established connections untouched, or check the addr/port, or check the current tcp-cc name, ...etc. Patch 1-4 are some cleanup and prep work in the tcp and bpf seq_file. Patch 5 is to have the tcp seq_file iterate on the port+addr lhash2 instead of the port only listening_hash. Patch 6 is to have the bpf tcp iter doing batching which then allows lock_sock. lock_sock is needed for setsockopt. Patch 7 allows the bpf tcp iter to call bpf_(get\|set)sockopt. v2: - Use __GFP_NOWARN in patch 6 - Add bpf_getsockopt() in patch 7 to give a symmetrical user experience. selftest in patch 8 is changed to also cover bpf_getsockopt(). - Remove CAP_NET_ADMIN check in patch 7. Tracing bpf prog has already required CAP_SYS_ADMIN or CAP_PERFMON. - Move some def macros to bpf_tracing_net.h in patch 8 ==================== Signed-off-by: Andrii Nakryiko <andrii@kernel.org>	2021-07-23 16:45:36 -07:00
Martin KaFai Lau	eed92afdd1	bpf: selftest: Test batching and bpf_(get\|set)sockopt in bpf tcp iter This patch adds tests for the batching and bpf_(get\|set)sockopt in bpf tcp iter. It first creates: a) 1 non SO_REUSEPORT listener in lhash2. b) 256 passive and active fds connected to the listener in (a). c) 256 SO_REUSEPORT listeners in one of the lhash2 bucket. The test sets all listeners and connections to bpf_cubic before running the bpf iter. The bpf iter then calls setsockopt(TCP_CONGESTION) to switch each listener and connection from bpf_cubic to bpf_dctcp. The bpf iter has a random_retry mode such that it can return EAGAIN to the usespace in the middle of a batch. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20210701200625.1036874-1-kafai@fb.com	2021-07-23 16:45:17 -07:00
Martin KaFai Lau	3cee6fb8e6	bpf: tcp: Support bpf_(get\|set)sockopt in bpf tcp iter This patch allows bpf tcp iter to call bpf_(get\|set)sockopt. To allow a specific bpf iter (tcp here) to call a set of helpers, get_func_proto function pointer is added to bpf_iter_reg. The bpf iter is a tracing prog which currently requires CAP_PERFMON or CAP_SYS_ADMIN, so this patch does not impose other capability checks for bpf_(get\|set)sockopt. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20210701200619.1036715-1-kafai@fb.com	2021-07-23 16:45:07 -07:00
Martin KaFai Lau	04c7820b77	bpf: tcp: Bpf iter batching and lock_sock This patch does batching and lock_sock for the bpf tcp iter. It does not affect the proc fs iteration. With bpf-tcp-cc, new algo rollout happens more often. Instead of restarting the application to pick up the new tcp-cc, the next patch will allow bpf iter to do setsockopt(TCP_CONGESTION). This requires locking the sock. Also, unlike the proc iteration (cat /proc/net/tcp[6]), the bpf iter can inspect all fields of a tcp_sock. It will be useful to have a consistent view on some of the fields (e.g. the ones reported in tcp_get_info() that also acquires the sock lock). Double lock: locking the bucket first and then locking the sock could lead to deadlock. This patch takes a batching approach similar to inet_diag. While holding the bucket lock, it batch a number of sockets into an array first and then unlock the bucket. Before doing show(), it then calls lock_sock_fast(). In a machine with ~400k connections, the maximum number of sk in a bucket of the established hashtable is 7. 0.02% of the established connections fall into this bucket size. For listen hash (port+addr lhash2), the bucket is usually very small also except for the SO_REUSEPORT use case which the userspace could have one SO_REUSEPORT socket per thread. While batching is used, it can also minimize the chance of missing sock in the setsockopt use case if the whole bucket is batched. This patch will start with a batch array with INIT_BATCH_SZ (16) which will be enough for the most common cases. bpf_iter_tcp_batch() will try to realloc to a larger array to handle exception case (e.g. the SO_REUSEPORT case in the lhash2). Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20210701200613.1036157-1-kafai@fb.com	2021-07-23 16:45:00 -07:00
Martin KaFai Lau	05c0b35709	tcp: seq_file: Replace listening_hash with lhash2 This patch moves the tcp seq_file iteration on listeners from the port only listening_hash to the port+addr lhash2. When iterating from the bpf iter, the next patch will need to lock the socket such that the bpf iter can call setsockopt (e.g. to change the TCP_CONGESTION). To avoid locking the bucket and then locking the sock, the bpf iter will first batch some sockets from the same bucket and then unlock the bucket. If the bucket size is small (which usually is), it is easier to batch the whole bucket such that it is less likely to miss a setsockopt on a socket due to changes in the bucket. However, the port only listening_hash could have many listeners hashed to a bucket (e.g. many individual VIP(s):443 and also multiple by the number of SO_REUSEPORT). We have seen bucket size in tens of thousands range. Also, the chance of having changes in some popular port buckets (e.g. 443) is also high. The port+addr lhash2 was introduced to solve this large listener bucket issue. Also, the listening_hash usage has already been replaced with lhash2 in the fast path inet[6]_lookup_listener(). This patch follows the same direction on moving to lhash2 and iterates the lhash2 instead of listening_hash. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20210701200606.1035783-1-kafai@fb.com	2021-07-23 16:44:57 -07:00
Martin KaFai Lau	b72acf4501	tcp: seq_file: Add listening_get_first() The current listening_get_next() is overloaded by passing NULL to the 2nd arg, like listening_get_next(seq, NULL), to mean get_first(). This patch moves some logic from the listening_get_next() into a new function listening_get_first(). It will be equivalent to the current established_get_first() and established_get_next() setup. get_first() is to find a non empty bucket and return the first sk. get_next() is to find the next sk of the current bucket and then resorts to get_first() if the current bucket is exhausted. The next patch is to move the listener seq_file iteration from listening_hash (port only) to lhash2 (port+addr). Separating out listening_get_first() from listening_get_next() here will make the following lhash2 changes cleaner and easier to follow. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20210701200600.1035353-1-kafai@fb.com	2021-07-23 16:44:53 -07:00
Martin KaFai Lau	62001372c2	bpf: tcp: seq_file: Remove bpf_seq_afinfo from tcp_iter_state A following patch will create a separate struct to store extra bpf_iter state and it will embed the existing tcp_iter_state like this: struct bpf_tcp_iter_state { struct tcp_iter_state state; /* More bpf_iter specific states here ... / } As a prep work, this patch removes the "struct tcp_seq_afinfo bpf_seq_afinfo" where its purpose is to tell if it is iterating from bpf_iter instead of proc fs. Currently, if "bpf_seq_afinfo" is not NULL, it is iterating from bpf_iter. The kernel should not filter by the addr family and leave this filtering decision to the bpf prog. Instead of adding a "bpf_seq_afinfo" pointer, this patch uses the "seq->op == &bpf_iter_tcp_seq_ops" test to tell if it is iterating from the bpf iter. The bpf_iter_(init\|fini)_tcp() is left here to prepare for the change of a following patch. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20210701200554.1034982-1-kafai@fb.com	2021-07-23 16:44:41 -07:00
Martin KaFai Lau	ad2d61376a	tcp: seq_file: Refactor net and family matching This patch refactors the net and family matching into two new helpers, seq_sk_match() and seq_file_family(). seq_file_family() is in the later part of the file to prepare the change of a following patch. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20210701200548.1034629-1-kafai@fb.com	2021-07-23 16:44:37 -07:00
Martin KaFai Lau	525e2f9fd0	tcp: seq_file: Avoid skipping sk during tcp_seek_last_pos st->bucket stores the current bucket number. st->offset stores the offset within this bucket that is the sk to be seq_show(). Thus, st->offset only makes sense within the same st->bucket. These two variables are an optimization for the common no-lseek case. When resuming the seq_file iteration (i.e. seq_start()), tcp_seek_last_pos() tries to continue from the st->offset at bucket st->bucket. However, it is possible that the bucket pointed by st->bucket has changed and st->offset may end up skipping the whole st->bucket without finding a sk. In this case, tcp_seek_last_pos() currently continues to satisfy the offset condition in the next (and incorrect) bucket. Instead, regardless of the offset value, the first sk of the next bucket should be returned. Thus, "bucket == st->bucket" check is added to tcp_seek_last_pos(). The chance of hitting this is small and the issue is a decade old, so targeting for the next tree. Fixes: `a8b690f98b` ("tcp: Fix slowness in read /proc/net/tcp") Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20210701200541.1033917-1-kafai@fb.com	2021-07-23 16:43:24 -07:00
Jiri Olsa	da97553ec6	libbpf: Export bpf_program__attach_kprobe_opts function Export bpf_program__attach_kprobe_opts as a public API. Rename bpf_program_attach_kprobe_opts to bpf_kprobe_opts and turn it into OPTS struct. Suggested-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: Alan Maguire <alan.maguire@oracle.com> Tested-by: Alan Maguire <alan.maguire@oracle.com> Link: https://lore.kernel.org/bpf/20210721215810.889975-4-jolsa@kernel.org	2021-07-22 20:09:16 -07:00
Jiri Olsa	e3f9bc35ea	libbpf: Allow decimal offset for kprobes Allow to specify decimal offset in SEC macro, like: SEC("kprobe/bpf_fentry_test7+5") Add selftest for that. Suggested-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: Alan Maguire <alan.maguire@oracle.com> Tested-by: Alan Maguire <alan.maguire@oracle.com> Link: https://lore.kernel.org/bpf/20210721215810.889975-3-jolsa@kernel.org	2021-07-22 20:09:16 -07:00
Jiri Olsa	1f71a468a7	libbpf: Fix func leak in attach_kprobe Add missing free() for func pointer in attach_kprobe function. Fixes: `a2488b5f48` ("libbpf: Allow specification of "kprobe/function+offset"") Reported-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: Alan Maguire <alan.maguire@oracle.com> Tested-by: Alan Maguire <alan.maguire@oracle.com> Link: https://lore.kernel.org/bpf/20210721215810.889975-2-jolsa@kernel.org	2021-07-22 20:08:39 -07:00
Martynas Pumputis	9907442fcd	selftests/bpf: Mute expected invalid map creation error msg Previously, the newly introduced test case in test_map_in_map(), which checks whether the inner map is destroyed after unsuccessful creation of the outer map, logged the following harmless and expected error: libbpf: map 'mim': failed to create: Invalid argument(-22) libbpf: failed to load object './test_map_in_map_invalid.o' To avoid any possible confusion, mute the logging during loading of the prog. Fixes: `08f71a1e39` ("selftests/bpf: Check inner map deletion") Reported-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Martynas Pumputis <m@lambda.lt> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20210721140941.563175-1-m@lambda.lt	2021-07-22 16:44:10 -07:00
Colin Ian King	724f17b7d4	bpf: Remove redundant intiialization of variable stype The variable stype is being initialized with a value that is never read, it is being updated later on. The assignment is redundant and can be removed. Addresses-Coverity: ("Unused value") Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20210721115630.109279-1-colin.king@canonical.com	2021-07-22 16:34:35 -07:00
Arnd Bergmann	16c5900ba7	bpf: Fix pointer cast warning kp->addr is a pointer, so it cannot be cast directly to a 'u64' when it gets interpreted as an integer value: kernel/trace/bpf_trace.c: In function '____bpf_get_func_ip_kprobe': kernel/trace/bpf_trace.c:968:21: error: cast from pointer to integer of different size [-Werror=pointer-to-int-cast] 968 \| return kp ? (u64) kp->addr : 0; Use the uintptr_t type instead. Fixes: `9ffd9f3ff7` ("bpf: Add bpf_get_func_ip helper for kprobe programs") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20210721212007.3876595-1-arnd@kernel.org	2021-07-22 16:27:42 -07:00
Andrii Nakryiko	807b8f0e24	Merge branch 'libbpf: btf typed data dumping fixes (__int128 usage, error propagation)' Alan Maguire says: ==================== This series aims to resolve further issues with the BTF typed data dumping interfaces in libbpf. Compilation failures with use of __int128 on 32-bit platforms were reported [1]. As a result, the use of __int128 in libbpf typed data dumping is replaced with __u64 usage for bitfield manipulations. In the case of 128-bit integer values, they are simply split into two 64-bit hex values for display (patch 1). Tests are added for __int128 display in patch 2, using conditional compilation to avoid problems with a lack of __int128 support. Patch 3 resolves an issue Andrii noted about error propagation when handling enum data display. More followup work is required to ensure multi-dimensional char array display works correctly. [1] https://lore.kernel.org/bpf/1626362126-27775-1-git-send-email-alan.maguire@oracle.com/T/#mc2cb023acfd6c3cd0b661e385787b76bb757430d Changes since v1: - added error handling for bitfield size > 64 bits by changing function signature for bitfield retrieval to return an int error value and to set bitfield value via a __u64 * argument (Andrii) ==================== Signed-off-by: Andrii Nakryiko <andrii@kernel.org>	2021-07-20 13:49:26 -07:00
Alan Maguire	720c29fca9	libbpf: Propagate errors when retrieving enum value for typed data display When retrieving the enum value associated with typed data during "is data zero?" checking in btf_dump_type_data_check_zero(), the return value of btf_dump_get_enum_value() is not passed to the caller if the function returns a non-zero (error) value. Currently, 0 is returned if the function returns an error. We should instead propagate the error to the caller. Signed-off-by: Alan Maguire <alan.maguire@oracle.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/1626770993-11073-4-git-send-email-alan.maguire@oracle.com	2021-07-20 13:49:25 -07:00
Alan Maguire	a17553dde2	selftests/bpf: Add __int128-specific tests for typed data dump Add tests for __int128 display for platforms that support it. __int128s are dumped as hex values. Signed-off-by: Alan Maguire <alan.maguire@oracle.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/1626770993-11073-3-git-send-email-alan.maguire@oracle.com	2021-07-20 13:49:25 -07:00
Alan Maguire	a1d3cc3c5e	libbpf: Avoid use of __int128 in typed dump display __int128 is not supported for some 32-bit platforms (arm and i386). __int128 was used in carrying out computations on bitfields which aid display, but the same calculations could be done with __u64 with the small effect of not supporting 128-bit bitfields. With these changes, a big-endian issue with casting 128-bit integers to 64-bit for enum bitfields is solved also, as we now use 64-bit integers for bitfield calculations. Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org> Reported-by: Linux Kernel Functional Testing <lkft@linaro.org> Signed-off-by: Alan Maguire <alan.maguire@oracle.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/1626770993-11073-2-git-send-email-alan.maguire@oracle.com	2021-07-20 13:49:25 -07:00
Vincent Li	875fc315db	selftests, bpf: test_tc_tunnel.sh nc: Cannot use -p and -l When run test_tc_tunnel.sh, it complains following error ipip encap 192.168.1.1 to 192.168.1.2, type ipip, mac none len 100 test basic connectivity nc: cannot use -p and -l nc man page has: -l Listen for an incoming connection rather than initiating a connection to a remote host.Cannot be used together with any of the options -psxz. Additionally, any timeouts specified with the -w option are ignored. Correct nc in server_listen(). Signed-off-by: Vincent Li <vincent.mc.li@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20210719223022.66681-1-vincent.mc.li@gmail.com	2021-07-20 13:43:00 -07:00
John Fastabend	c39aa21599	bpf, selftests: Fix test_maps now that sockmap supports UDP UDP socket support was added recently so testing UDP insert failure is no longer correct and causes test_maps failure. The fix is easy though, we simply need to test that UDP is correctly added instead of blocked. Fixes: `122e6c79ef` ("sock_map: Update sock type checks for UDP") Reported-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20210720184832.452430-1-john.fastabend@gmail.com	2021-07-20 13:18:51 -07:00
Martynas Pumputis	08f71a1e39	selftests/bpf: Check inner map deletion Add a test case to check whether an unsuccessful creation of an outer map of a BTF-defined map-in-map destroys the inner map. As bpf_object__create_map() is a static function, we cannot just call it from the test case and then check whether a map accessible via map->inner_map_fd has been closed. Instead, we iterate over all maps and check whether the map "$MAP_NAME.inner" does not exist. Signed-off-by: Martynas Pumputis <m@lambda.lt> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20210719173838.423148-3-m@lambda.lt	2021-07-19 15:58:07 -07:00
Martynas Pumputis	a21ab4c59e	libbpf: Fix removal of inner map in bpf_object__create_map If creating an outer map of a BTF-defined map-in-map fails (via bpf_object__create_map()), then the previously created its inner map won't be destroyed. Fix this by ensuring that the destroy routines are not bypassed in the case of a failure. Fixes: `646f02ffdd` ("libbpf: Add BTF-defined map-in-map support") Reported-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Martynas Pumputis <m@lambda.lt> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20210719173838.423148-2-m@lambda.lt	2021-07-19 15:48:20 -07:00
Andrii Nakryiko	78e4a95592	Merge branch 'libbpf: BTF typed dump cleanups' Alan Maguire says: ==================== Fix issues with libbpf BTF typed dump code. Patch 1 addresses handling of unaligned data. Patch 2 fixes issues Andrii noticed when compiling on ppc64le. Patch 3 simplifies typed dump by getting rid of allocation of dump data structure which tracks dump state etc. Changes since v1: - Andrii suggested using a function instead of a macro for checking alignment of data, and pointed out that we need to consider dump ptr size versus native pointer size (patch 1) ==================== Signed-off-by: Andrii Nakryiko <andrii@kernel.org>	2021-07-16 17:30:22 -07:00
Alan Maguire	add192f81a	libbpf: Btf typed dump does not need to allocate dump data By using the stack for this small structure, we avoid the need for freeing memory in error paths. Suggested-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Alan Maguire <alan.maguire@oracle.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/1626475617-25984-4-git-send-email-alan.maguire@oracle.com	2021-07-16 17:29:56 -07:00
Alan Maguire	04eb4dff6a	libbpf: Fix compilation errors on ppc64le for btf dump typed data __s64 can be defined as either long or long long, depending on the architecture. On ppc64le it's defined as long, giving this error: In file included from btf_dump.c:22: btf_dump.c: In function 'btf_dump_type_data_check_overflow': libbpf_internal.h:111:22: error: format '%lld' expects argument of type 'long long int', but argument 3 has type '__s64' {aka 'long int'} [-Werror=format=] 111 \| libbpf_print(level, "libbpf: " fmt, ##__VA_ARGS__); \ \| ^~~~~~~~~~ libbpf_internal.h:114:27: note: in expansion of macro '__pr' 114 \| #define pr_warn(fmt, ...) __pr(LIBBPF_WARN, fmt, ##__VA_ARGS__) \| ^~~~ btf_dump.c:1992:3: note: in expansion of macro 'pr_warn' 1992 \| pr_warn("unexpected size [%lld] for id [%u]\n", \| ^~~~~~~ btf_dump.c:1992:32: note: format string is defined here 1992 \| pr_warn("unexpected size [%lld] for id [%u]\n", \| ~~~^ \| \| \| long long int \| %ld Cast to size_t and use %zu instead. Reported-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Alan Maguire <alan.maguire@oracle.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/1626475617-25984-3-git-send-email-alan.maguire@oracle.com	2021-07-16 17:25:57 -07:00
Alan Maguire	8d44c3578b	libbpf: Clarify/fix unaligned data issues for btf typed dump If data is packed, data structures can store it outside of usual boundaries. For example a 4-byte int can be stored on a unaligned boundary in a case like this: struct s { char f1; int f2; } __attribute((packed)); ...the int is stored at an offset of one byte. Some platforms have problems dereferencing data that is not aligned with its size, and code exists to handle most cases of this for BTF typed data display. However pointer display was missed, and a simple function to test if "ptr_is_aligned(data, data_sz)" would help clarify this code. Suggested-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Alan Maguire <alan.maguire@oracle.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/1626475617-25984-2-git-send-email-alan.maguire@oracle.com	2021-07-16 17:24:55 -07:00
Andrii Nakryiko	068dfc655b	Merge branch 'libbpf: BTF dumper support for typed data' Alan Maguire says: ==================== Add a libbpf dumper function that supports dumping a representation of data passed in using the BTF id associated with the data in a manner similar to the bpf_snprintf_btf helper. Default output format is identical to that dumped by bpf_snprintf_btf() (bar using tabs instead of spaces for indentation, but the indent string can be customized also); for example, a "struct sk_buff" representation would look like this: (struct sk_buff){ (union){ (struct){ .next = (struct sk_buff )0xffffffffffffffff, .prev = (struct sk_buff )0xffffffffffffffff, (union){ .dev = (struct net_device )0xffffffffffffffff, .dev_scratch = (long unsigned int)18446744073709551615, }, }, ... Patch 1 implements the dump functionality in a manner similar to that in kernel/bpf/btf.c, but with a view to fitting into libbpf more naturally. For example, rather than using flags, boolean dump options are used to control output. In addition, rather than combining checks for display (such as is this field zero?) and actual display - as is done for the kernel code - the code is organized to separate zero and overflow checks from type display. Patch 2 adds ASSERT_STRNEQ() for use in the following BTF dumper tests. Patch 3 consists of selftests that utilize a dump printf function to snprintf the dump output to a string for comparison with expected output. Tests deliberately mirror those in snprintf_btf helper test to keep output consistent, but also cover overflow handling, var/section display. Changes since v5 [1] - readjust dump options to avoid unnecessary padding (Andrii, patch 1). - tidied up bitfield data checking/retrieval using Andrii's suggestions. Removed code where we adjust data pointer prior to calling bitfield functions as this adjustment is not needed, provided we use the type size as the number of bytes to iterate over when retrieving the full value we apply bit shifting operations to retrieve the bitfield value. With these chances, the _int_bits() functions were no longer needed (Andrii, patch 1). - coalesced the "is zero" checking for ints, floats and pointers into btf_dump_base_type_check_zero(), using a memcmp() of the size of the data. This can be derived from t->size for ints and floats, and pointer size is retrieved from dump's ptr_sz field (Andrii, patch 1). - Added alignment-aware handling for int, enum, float retrieval. Packed data structures can force ints, enums and floats to be aligned on different boundaries; for example, the struct p { char f1; int f2; } __attribute__((packed)); ...will have the int f2 field offset at byte 1, rather than at byte 4 for an unpacked structure. The problem is directly dereferencing that as an int is problematic on some platforms. For ints and enums, we can reuse bitfield retrieval to get the value for display, while for floats we use a local union of the floating-point types and memcpy into it, ensuring we can then dereference pointers into that union which will have safe alignment (Andrii, patch 1). - added comments to explain why we increment depth prior to displaying opening parens, and decrement it prior to displaying closing parens for structs, unions and arrays. The reason is that we don't want to have a trailing newline when displaying a type. The logic that handles this says "don't show a newline when the depth we're at is 0". For this to work for opening parens then we need to bump depth before showing opening parens + newline, and when we close out structure we need to show closing parens after reducing depth so that we don't append a newline to a top-level structure. So as a result we have struct foo {\n struct bar {\n }\n } - silently truncate provided indent string with strncat() if > 31 bytes (Andrii, patch 1). - fixed ASSERT_STRNEQ() macro to show only n bytes of string (Andrii, patch 2). - fixed strncat() of type data string to avoid stack corruption (Andrii, patch 3). - removed early returns from dump type tests (Andrii, patch 3). - have tests explicitly specify prefix (enum, struct, union) (Andrii, patch 3). - switch from CHECK() to ASSERT_* where possible (Andrii, patch 3). Changes since v4 [2] - Andrii kindly provided code to unify emitting a prepended cast (for example "(int)") with existing code, and this had the nice benefit of adding array indices in type specifications (Andrii, patches 1, 3) - Fixed indent_str option to make it a const char , stored in a fixed-length buffer internally (Andrii, patch 1) - Reworked bit shift logic to minimize endian-specific interactions, and use same macros as found elsewhere in libbpf to determine endianness (Andrii, patch 1) - Fixed type emitting to ensure that a trailing '\n' is not displayed; newlines are added during struct/array display, but for a single type the last character is no longer a newline (Andrii, patches 1, 3) - Added support for ASSERT_STRNEQ() macro (Andrii, patch 2) - Split tests into subtests for int, char, enum etc rather than one "dump type data" subtest (Andrii, patch 3) - Made better use of ASSERT macros (Andrii, patch 3) - Got rid of some other TEST_* macros that were unneeded (Andrii, patch 3) - Switched to using "struct fs_context" to verify enum bitfield values (Andrii, patch 3) Changes since v3 [3] - Retained separation of emitting of type name cast prefixing type values from existing functionality such as btf_dump_emit_type_chain() since initial code-shared version had so many exceptions it became hard to read. For example, we don't emit a type name if the type to be displayed is an array member, we also always emit "forward" definitions for structs/unions that aren't really forward definitions (we just want a "struct foo" output for "(struct foo){.bar = ...". We also always ignore modifiers const/volatile/restrict as they clutter output when emitting large types. - Added configurable 4-char indent string option; defaults to tab (Andrii) - Added support for BTF_KIND_FLOAT and associated tests (Andrii) - Added support for BTF_KIND_FUNC_PROTO function pointers to improve output of "ops" structures; for example: (struct file_operations){ .owner = (struct module )0xffffffffffffffff, .llseek = (loff_t()(struct file *, loff_t, int))0xffffffffffffffff, ... Added associated test also (Andrii) - Added handling for enum bitfields and associated test (Andrii) - Allocation of "struct btf_dump_data" done on-demand (Andrii) - Removed ".field = " output from function emitting type name and into caller (Andrii) - Removed BTF_INT_OFFSET() support (Andrii) - Use libbpf_err() to set errno for error cases (Andrii) - btf_dump_dump_type_data() returns size written, which is used when returning successfully from btf_dump__dump_type_data() (Andrii) Changes since v2 [4] - Renamed function to btf_dump__dump_type_data, reorganized arguments such that opts are last (Andrii) - Modified code to separate questions about display such as have we overflowed?/is this field zero? from actual display of typed data, such that we ask those questions separately from the code that actually displays typed data (Andrii) - Reworked code to handle overflow - where we do not provide enough data for the type we wish to display - by returning -E2BIG and attempting to present as much data as possible. Such a mode of operation allows for tracers which retrieve partial data (such as first 1024 bytes of a "struct task_struct" say), and want to display that partial data, while also knowing that it is not the full type. Such tracers can then denote this (perhaps via "..." or similar). - Explored reusing existing type emit functions, such as passing in a type id stack with a single type id to btf_dump_emit_type_chain() to support the display of typed data where a "cast" is prepended to the data to denote its type; "(int)1", "(struct foo){", etc. However the task of emitting a ".field_name = (typecast)" did not match well with model of walking the stack to display innermost types first and made the resultant code harder to read. Added a dedicated btf_dump_emit_type_name() function instead which is only ~70 lines (Andrii) - Various cleanups around bitfield macros, unneeded member iteration macros, avoiding compiler complaints when displaying int da ta by casting to long long, etc (Andrii) - Use DECLARE_LIBBPF_OPTS() in defining opts for tests (Andrii) - Added more type tests, overflow tests, var tests and section tests. Changes since RFC [5] - The initial approach explored was to share the kernel code with libbpf using #defines to paper over the different needs; however it makes more sense to try and fit in with libbpf code style for maintenance. A comment in the code points at the implementation in kernel/bpf/btf.c and notes that any issues found in it should be fixed there or vice versa; mirroring the tests should help with this also (Andrii) [1] https://lore.kernel.org/bpf/1624092968-5598-1-git-send-email-alan.maguire@oracle.com/ [2] https://lore.kernel.org/bpf/CAEf4BzYtbnphCkhz0epMKE4zWfvSOiMpu+-SXp9hadsrRApuZw@mail.gmail.com/T/ [3] https://lore.kernel.org/bpf/1622131170-8260-1-git-send-email-alan.maguire@oracle.com/ [4] https://lore.kernel.org/bpf/1610921764-7526-1-git-send-email-alan.maguire@oracle.com/ [5] https://lore.kernel.org/bpf/1610386373-24162-1-git-send-email-alan.maguire@oracle.com/ ==================== Signed-off-by: Andrii Nakryiko <andrii@kernel.org>	2021-07-16 13:50:41 -07:00
Alan Maguire	70a9241fbc	selftests/bpf: Add dump type data tests to btf dump tests Test various type data dumping operations by comparing expected format with the dumped string; an snprintf-style printf function is used to record the string dumped. Also verify overflow handling where the data passed does not cover the full size of a type, such as would occur if a tracer has a portion of the 8k "struct task_struct". Signed-off-by: Alan Maguire <alan.maguire@oracle.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/1626362126-27775-4-git-send-email-alan.maguire@oracle.com	2021-07-16 13:50:36 -07:00
Alan Maguire	17283337ff	selftests/bpf: Add ASSERT_STRNEQ() variant for test_progs It will support strncmp()-style string comparisons. Suggested-by: Andrii Nakryiko <andrii.nakryiko@gmail.com> Signed-off-by: Alan Maguire <alan.maguire@oracle.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/1626362126-27775-3-git-send-email-alan.maguire@oracle.com	2021-07-16 13:46:59 -07:00
Alan Maguire	920d16af9b	libbpf: BTF dumper support for typed data Add a BTF dumper for typed data, so that the user can dump a typed version of the data provided. The API is int btf_dump__dump_type_data(struct btf_dump d, __u32 id, void data, size_t data_sz, const struct btf_dump_type_data_opts opts); ...where the id is the BTF id of the data pointed to by the "void " argument; for example the BTF id of "struct sk_buff" for a "struct skb " data pointer. Options supported are - a starting indent level (indent_lvl) - a user-specified indent string which will be printed once per indent level; if NULL, tab is chosen but any string <= 32 chars can be provided. - a set of boolean options to control dump display, similar to those used for BPF helper bpf_snprintf_btf(). Options are - compact : omit newlines and other indentation - skip_names: omit member names - emit_zeroes: show zero-value members Default output format is identical to that dumped by bpf_snprintf_btf(), for example a "struct sk_buff" representation would look like this: struct sk_buff){ (union){ (struct){ .next = (struct sk_buff )0xffffffffffffffff, .prev = (struct sk_buff )0xffffffffffffffff, (union){ .dev = (struct net_device )0xffffffffffffffff, .dev_scratch = (long unsigned int)18446744073709551615, }, }, ... If the data structure is larger than the data_sz number of bytes that are available in data, as much of the data as possible will be dumped and -E2BIG will be returned. This is useful as tracers will sometimes not be able to capture all of the data associated with a type; for example a "struct task_struct" is ~16k. Being able to specify that only a subset is available is important for such cases. On success, the amount of data dumped is returned. Signed-off-by: Alan Maguire <alan.maguire@oracle.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/1626362126-27775-2-git-send-email-alan.maguire@oracle.com	2021-07-16 13:46:59 -07:00
Andrii Nakryiko	334faa5ce5	Merge branch 'Add btf_custom_path in bpf_obj_open_opts' Shuyi Cheng says: ==================== This patch set adds the ability to point to a custom BTF for the purposes of BPF CO-RE relocations. This is useful for using BPF CO-RE on old kernels that don't yet natively support kernel (vmlinux) BTF and thus libbpf needs application's help in locating kernel BTF generated separately from the kernel itself. This was already possible to do through bpf_object__load's attribute struct, but that makes it inconvenient to use with BPF skeleton, which only allows to specify bpf_object_open_opts during the open step. Thus, add the ability to override vmlinux BTF at open time. Patch #1 adds libbpf changes. Patch #2 fixes pre-existing memory leak detected during the code review. Patch #3 switches existing selftests to using open_opts for custom BTF. Changelog: ---------- v3: https://lore.kernel.org/bpf/CAEf4BzY2cdT44bfbMus=gei27ViqGE1BtGo6XrErSsOCnqtVJg@mail.gmail.com/T/#m877eed1d4cf0a1d3352d3f3d6c5ff158be45c542 v3->v4: - Follow Andrii's suggestion to modify cover letter description. - Delete function bpf_object__load_override_btf. - Follow Dan's suggestion to add fixes tag and modify commit msg to patch #2. - Add pathch #3 to switch existing selftests to using open_opts. v2: https://lore.kernel.org/bpf/CAEf4Bza_ua+tjxdhyy4nZ8Boeo+scipWmr_1xM1pC6N5wyuhAA@mail.gmail.com/T/#mf9cf86ae0ffa96180ac29e4fd12697eb70eccd0f v2->v3: - Load the BTF specified by btf_custom_path to btf_vmlinux_override instead of btf_bmlinux. - Fix the memory leak that may be introduced by the second version of the patch. - Add a new patch to fix the possible memory leak caused by obj->kconfig. v1: https://lore.kernel.org/bpf/CAEf4BzaGjEC4t1OefDo11pj2-HfNy0BLhs_G2UREjRNTmb2u=A@mail.gmail.com/t/#m4d9f7c6761fbd2b436b5dfe491cd864b70225804 v1->v2: - Change custom_btf_path to btf_custom_path. - If the length of btf_custom_path of bpf_obj_open_opts is too long, return ERR_PTR(-ENAMETOOLONG). - Add `custom BTF is in addition to vmlinux BTF` with btf_custom_path field. ==================== Signed-off-by: Andrii Nakryiko <andrii@kernel.org>	2021-07-16 13:22:47 -07:00
Shuyi Cheng	f0b7d11925	selftests/bpf: Switch existing selftests to using open_opts for custom BTF This patch mainly replaces the bpf_object_load_attr of the core_autosize.c and core_reloc.c files with bpf_object_open_opts. Signed-off-by: Shuyi Cheng <chengshuyi@linux.alibaba.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/1626180159-112996-4-git-send-email-chengshuyi@linux.alibaba.com	2021-07-16 13:22:47 -07:00
Shuyi Cheng	18353c87e0	libbpf: Fix the possible memory leak on error If the strdup() fails then we need to call bpf_object__close(obj) to avoid a resource leak. Fixes: `166750bc1d` ("libbpf: Support libbpf-provided extern variables") Signed-off-by: Shuyi Cheng <chengshuyi@linux.alibaba.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/1626180159-112996-3-git-send-email-chengshuyi@linux.alibaba.com	2021-07-16 13:22:47 -07:00
Shuyi Cheng	1373ff5995	libbpf: Introduce 'btf_custom_path' to 'bpf_obj_open_opts' btf_custom_path allows developers to load custom BTF which libbpf will subsequently use for CO-RE relocation instead of vmlinux BTF. Having btf_custom_path in bpf_object_open_opts one can directly use the skeleton's <objname>_bpf__open_opts() API to pass in the btf_custom_path parameter, as opposed to using bpf_object__load_xattr() which is slated to be deprecated ([0]). This work continues previous work started by another developer ([1]). [0] https://lore.kernel.org/bpf/CAEf4BzbJZLjNoiK8_VfeVg_Vrg=9iYFv+po-38SMe=UzwDKJ=Q@mail.gmail.com/#t [1] https://yhbt.net/lore/all/CAEf4Bzbgw49w2PtowsrzKQNcxD4fZRE6AKByX-5-dMo-+oWHHA@mail.gmail.com/ Signed-off-by: Shuyi Cheng <chengshuyi@linux.alibaba.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/1626180159-112996-2-git-send-email-chengshuyi@linux.alibaba.com	2021-07-16 13:22:47 -07:00
Roy, UjjaL	8886534747	bpf, doc: Add heading and example for extensions in cbpf Add new heading for extensions to make it more readable. Also, add one more example of filtering interface index for better understanding. Signed-off-by: Roy, UjjaL <royujjal@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Song Liu <songliubraving@fb.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/CAADnVQJ=DoRDcVkaXmY3EmNdLoO7gq1mkJOn5G=00wKH8qUtZQ@mail.gmail.com	2021-07-16 22:08:17 +02:00
Andrii Nakryiko	c7603cfa04	bpf: Add ambient BPF runtime context stored in current `b910eaaaa4` ("bpf: Fix NULL pointer dereference in bpf_get_local_storage() helper") fixed the problem with cgroup-local storage use in BPF by pre-allocating per-CPU array of 8 cgroup storage pointers to accommodate possible BPF program preemptions and nested executions. While this seems to work good in practice, it introduces new and unnecessary failure mode in which not all BPF programs might be executed if we fail to find an unused slot for cgroup storage, however unlikely it is. It might also not be so unlikely when/if we allow sleepable cgroup BPF programs in the future. Further, the way that cgroup storage is implemented as ambiently-available property during entire BPF program execution is a convenient way to pass extra information to BPF program and helpers without requiring user code to pass around extra arguments explicitly. So it would be good to have a generic solution that can allow implementing this without arbitrary restrictions. Ideally, such solution would work for both preemptable and sleepable BPF programs in exactly the same way. This patch introduces such solution, bpf_run_ctx. It adds one pointer field (bpf_ctx) to task_struct. This field is maintained by BPF_PROG_RUN family of macros in such a way that it always stays valid throughout BPF program execution. BPF program preemption is handled by remembering previous current->bpf_ctx value locally while executing nested BPF program and restoring old value after nested BPF program finishes. This is handled by two helper functions, bpf_set_run_ctx() and bpf_reset_run_ctx(), which are supposed to be used before and after BPF program runs, respectively. Restoring old value of the pointer handles preemption, while bpf_run_ctx pointer being a property of current task_struct naturally solves this problem for sleepable BPF programs by "following" BPF program execution as it is scheduled in and out of CPU. It would even allow CPU migration of BPF programs, even though it's not currently allowed by BPF infra. This patch cleans up cgroup local storage handling as a first application. The design itself is generic, though, with bpf_run_ctx being an empty struct that is supposed to be embedded into a specific struct for a given BPF program type (bpf_cg_run_ctx in this case). Follow up patches are planned that will expand this mechanism for other uses within tracing BPF programs. To verify that this change doesn't revert the fix to the original cgroup storage issue, I ran the same repro as in the original report ([0]) and didn't get any problems. Replacing bpf_reset_run_ctx(old_run_ctx) with bpf_reset_run_ctx(NULL) triggers the issue pretty quickly (so repro does work). [0] https://lore.kernel.org/bpf/YEEvBUiJl2pJkxTd@krava/ Fixes: `b910eaaaa4` ("bpf: Fix NULL pointer dereference in bpf_get_local_storage() helper") Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20210712230615.3525979-1-andrii@kernel.org	2021-07-16 21:15:28 +02:00
Peilin Ye	d4861fc6be	netdevsim: Add multi-queue support Currently netdevsim only supports a single queue per port, which is insufficient for testing multi-queue TC schedulers e.g. sch_mq. Extend the current sysfs interface so that users can create ports with multiple queues: $ echo "[ID] [PORT_COUNT] [NUM_QUEUES]" > /sys/bus/netdevsim/new_device As an example, echoing "2 4 8" creates 4 ports, with 8 queues per port. Note, this is compatible with the current interface, with default number of queues set to 1. For example, echoing "2 4" creates 4 ports with 1 queue per port; echoing "2" simply creates 1 port with 1 queue. Reviewed-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Peilin Ye <peilin.ye@bytedance.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-07-16 11:17:56 -07:00
Mark Gray	b83d23a2a3	openvswitch: Introduce per-cpu upcall dispatch The Open vSwitch kernel module uses the upcall mechanism to send packets from kernel space to user space when it misses in the kernel space flow table. The upcall sends packets via a Netlink socket. Currently, a Netlink socket is created for every vport. In this way, there is a 1:1 mapping between a vport and a Netlink socket. When a packet is received by a vport, if it needs to be sent to user space, it is sent via the corresponding Netlink socket. This mechanism, with various iterations of the corresponding user space code, has seen some limitations and issues: * On systems with a large number of vports, there is a correspondingly large number of Netlink sockets which can limit scaling. (https://bugzilla.redhat.com/show_bug.cgi?id=1526306) * Packet reordering on upcalls. (https://bugzilla.redhat.com/show_bug.cgi?id=1844576) * A thundering herd issue. (https://bugzilla.redhat.com/show_bug.cgi?id=1834444) This patch introduces an alternative, feature-negotiated, upcall mode using a per-cpu dispatch rather than a per-vport dispatch. In this mode, the Netlink socket to be used for the upcall is selected based on the CPU of the thread that is executing the upcall. In this way, it resolves the issues above as: a) The number of Netlink sockets scales with the number of CPUs rather than the number of vports. b) Ordering per-flow is maintained as packets are distributed to CPUs based on mechanisms such as RSS and flows are distributed to a single user space thread. c) Packets from a flow can only wake up one user space thread. The corresponding user space code can be found at: https://mail.openvswitch.org/pipermail/ovs-dev/2021-July/385139.html Bugzilla: https://bugzilla.redhat.com/1844576 Signed-off-by: Mark Gray <mark.d.gray@redhat.com> Acked-by: Flavio Leitner <fbl@sysclose.org> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-07-16 11:06:33 -07:00
Bill Wendling	919d527956	bnx2x: remove unused variable 'cur_data_offset' Fix the clang build warning: drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c:1862:13: error: variable 'cur_data_offset' set but not used [-Werror,-Wunused-but-set-variable] dma_addr_t cur_data_offset; Signed-off-by: Bill Wendling <morbo@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-07-16 10:56:55 -07:00
Christophe JAILLET	a99f030b24	net: switchdev: Simplify 'mlxsw_sp_mc_write_mdb_entry()' Use 'bitmap_alloc()/bitmap_free()' instead of hand-writing it. This makes the code less verbose. Also, use 'bitmap_alloc()' instead of 'bitmap_zalloc()' because the bitmap is fully overridden by a 'bitmap_copy()' call just after its allocation. While at it, remove an extra and unneeded space. Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-07-16 10:54:18 -07:00
Yajun Deng	f79a3bcb1a	net/sched: Remove unnecessary if statement It has been deal with the 'if (err' statement in rtnetlink_send() and rtnl_unicast(). so remove unnecessary if statement. v2: use the raw name rtnetlink_send(). Signed-off-by: Yajun Deng <yajun.deng@linux.dev> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-07-16 10:46:35 -07:00
Yajun Deng	cfdf0d9ae7	rtnetlink: use nlmsg_notify() in rtnetlink_send() The netlink_{broadcast, unicast} don't deal with 'if (err > 0' statement but nlmsg_{multicast, unicast} do. The nlmsg_notify() contains them. so use nlmsg_notify() instead. so that the caller wouldn't deal with 'if (err > 0' statement. v2: use nlmsg_notify() will do well. Signed-off-by: Yajun Deng <yajun.deng@linux.dev> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-07-16 10:46:35 -07:00
Haiyue Wang	63a9192b8f	gve: fix the wrong AdminQ buffer overflow check The 'tail' pointer is also free-running count, so it needs to be masked as 'adminq_prod_cnt' does, to become an index value of AdminQ buffer. Fixes: `5cdad90de6` ("gve: Batch AQ commands for creating and destroying queues.") Signed-off-by: Haiyue Wang <haiyue.wang@intel.com> Reviewed-by: Catherine Sullivan <csully@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-07-16 10:41:40 -07:00

1 2 3 4 5 ...

1029631 commits