linux-stable/kernel/bpf
Daniel Borkmann 35dfaad718 netkit, bpf: Add bpf programmable net device
This work adds a new, minimal BPF-programmable device called "netkit"
(former PoC code-name "meta") we recently presented at LSF/MM/BPF. The
core idea is that BPF programs are executed within the drivers xmit routine
and therefore e.g. in case of containers/Pods moving BPF processing closer
to the source.

One of the goals was that in case of Pod egress traffic, this allows to
move BPF programs from hostns tcx ingress into the device itself, providing
earlier drop or forward mechanisms, for example, if the BPF program
determines that the skb must be sent out of the node, then a redirect to
the physical device can take place directly without going through per-CPU
backlog queue. This helps to shift processing for such traffic from softirq
to process context, leading to better scheduling decisions/performance (see
measurements in the slides).

In this initial version, the netkit device ships as a pair, but we plan to
extend this further so it can also operate in single device mode. The pair
comes with a primary and a peer device. Only the primary device, typically
residing in hostns, can manage BPF programs for itself and its peer. The
peer device is designated for containers/Pods and cannot attach/detach
BPF programs. Upon the device creation, the user can set the default policy
to 'pass' or 'drop' for the case when no BPF program is attached.

Additionally, the device can be operated in L3 (default) or L2 mode. The
management of BPF programs is done via bpf_mprog, so that multi-attach is
supported right from the beginning with similar API and dependency controls
as tcx. For details on the latter see commit 053c8e1f23 ("bpf: Add generic
attach/detach/query API for multi-progs"). tc BPF compatibility is provided,
so that existing programs can be easily migrated.

Going forward, we plan to use netkit devices in Cilium as the main device
type for connecting Pods. They will be operated in L3 mode in order to
simplify a Pod's neighbor management and the peer will operate in default
drop mode, so that no traffic is leaving between the time when a Pod is
brought up by the CNI plugin and programs attached by the agent.
Additionally, the programs we attach via tcx on the physical devices are
using bpf_redirect_peer() for inbound traffic into netkit device, hence the
latter is also supporting the ndo_get_peer_dev callback. Similarly, we use
bpf_redirect_neigh() for the way out, pushing from netkit peer to phys device
directly. Also, BIG TCP is supported on netkit device. For the follow-up
work in single device mode, we plan to convert Cilium's cilium_host/_net
devices into a single one.

An extensive test suite for checking device operations and the BPF program
and link management API comes as BPF selftests in this series.

Co-developed-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://github.com/borkmann/iproute2/tree/pr/netkit
Link: http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf (24ff.)
Link: https://lore.kernel.org/r/20231024214904.29825-2-daniel@iogearbox.net
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-10-24 16:06:03 -07:00
..
preload bpf: make preloaded map iterators to display map elements count 2023-07-06 12:42:25 -07:00
arraymap.c bpf: return long from bpf_map_ops funcs 2023-03-22 15:11:30 -07:00
bloom_filter.c bpf: Centralize permissions checks for all BPF map types 2023-06-19 14:04:04 +02:00
bpf_cgrp_storage.c bpf: Teach verifier that certain helpers accept NULL pointer. 2023-04-04 16:57:16 -07:00
bpf_inode_storage.c Networking changes for 6.4. 2023-04-26 16:07:23 -07:00
bpf_iter.c bpf: Don't explicitly emit BTF for struct btf_iter_num 2023-10-13 15:48:58 -07:00
bpf_local_storage.c bpf: bpf_sk_storage: Fix the missing uncharge in sk_omem_alloc 2023-09-06 11:08:14 +02:00
bpf_lru_list.c bpf: Address KCSAN report on bpf_lru_list 2023-05-12 12:01:03 -07:00
bpf_lru_list.h bpf: lru: Remove unused declaration bpf_lru_promote() 2023-08-08 17:21:42 -07:00
bpf_lsm.c bpf: Fix the kernel crash caused by bpf_setsockopt(). 2023-01-26 23:26:40 -08:00
bpf_struct_ops.c bpf: Charge modmem for struct_ops trampoline 2023-09-14 15:30:45 -07:00
bpf_struct_ops_types.h
bpf_task_storage.c bpf: Teach verifier that certain helpers accept NULL pointer. 2023-04-04 16:57:16 -07:00
btf.c bpf: Add bpf_sock_addr_set_sun_path() to allow writing unix sockaddr from bpf 2023-10-11 16:29:25 -07:00
cgroup.c bpf: Implement cgroup sockaddr hooks for unix sockets 2023-10-11 17:27:47 -07:00
cgroup_iter.c bpf: Introduce css open-coded iterator kfuncs 2023-10-19 17:02:46 -07:00
core.c Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next 2023-09-17 15:12:06 +01:00
cpumap.c net, bpf: Add a warning if NAPI cb missed xdp_do_flush(). 2023-10-17 15:02:03 +02:00
cpumask.c bpf: Convert bpf_cpumask to bpf_mem_cache_free_rcu. 2023-07-12 23:45:23 +02:00
devmap.c net, bpf: Add a warning if NAPI cb missed xdp_do_flush(). 2023-10-17 15:02:03 +02:00
disasm.c bpf: change bpf_alu_sign_string and bpf_movsx_string to static 2023-08-04 16:15:50 -07:00
disasm.h
dispatcher.c bpf: Synchronize dispatcher update with bpf_dispatcher_xdp_func 2022-12-14 12:02:14 -08:00
hashtab.c bpf: Fix unnecessary -EBUSY from htab_lock_bucket 2023-10-24 14:25:55 +02:00
helpers.c bpf: Use bpf_global_percpu_ma for per-cpu kptr in __bpf_obj_drop_impl() 2023-10-20 14:15:13 -07:00
inode.c bpf: convert to ctime accessor functions 2023-07-24 10:30:07 +02:00
Kconfig bpf: Add fd-based tcx multi-prog infra with link support 2023-07-19 10:07:27 -07:00
link_iter.c bpf: Add bpf_link iterator 2022-05-10 11:20:45 -07:00
local_storage.c cgroup changes for v6.4-rc1 2023-04-29 10:05:22 -07:00
log.c bpf: drop unnecessary user-triggerable WARN_ONCE in verifierl log 2023-05-16 22:34:50 -07:00
lpm_trie.c bpf: Centralize permissions checks for all BPF map types 2023-06-19 14:04:04 +02:00
Makefile bpf: Add fd-based tcx multi-prog infra with link support 2023-07-19 10:07:27 -07:00
map_in_map.c bpf: Fix elem_size not being set for inner maps 2023-06-02 16:22:12 -07:00
map_in_map.h
map_iter.c bpf: allow any program to use the bpf_map_sum_elem_count kfunc 2023-07-19 09:48:53 -07:00
memalloc.c bpf: Use pcpu_alloc_size() in bpf_mem_free{_rcu}() 2023-10-20 14:15:13 -07:00
mmap_unlock_work.h
mprog.c bpf: Handle bpf_mprog_query with NULL entry 2023-10-06 17:11:20 -07:00
net_namespace.c net: Add includes masked by netdevice.h including uapi/bpf.h 2021-12-29 20:03:05 -08:00
offload.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2023-09-21 21:49:45 +02:00
percpu_freelist.c bpf: Initialize same number of free nodes for each pcpu_freelist 2022-11-11 12:05:14 -08:00
percpu_freelist.h
prog_iter.c
queue_stack_maps.c bpf: Avoid deadlock when using queue and stack maps from NMI 2023-09-11 19:04:49 -07:00
reuseport_array.c bpf: Centralize permissions checks for all BPF map types 2023-06-19 14:04:04 +02:00
ringbuf.c bpf: Fold smp_mb__before_atomic() into atomic_set_release() 2023-10-24 14:26:07 +02:00
stackmap.c bpf: Annotate struct bpf_stack_map with __counted_by 2023-10-06 23:44:35 +02:00
syscall.c netkit, bpf: Add bpf programmable net device 2023-10-24 16:06:03 -07:00
sysfs_btf.c
task_iter.c bpf: Let bpf_iter_task_new accept null task ptr 2023-10-19 17:02:46 -07:00
tcx.c bpf, tcx: Get rid of tcx_link_const 2023-10-23 15:01:53 -07:00
tnum.c
trampoline.c bpf, x64: Fix tailcall infinite loop 2023-09-12 13:06:12 -07:00
verifier.c bpf: Improve JEQ/JNE branch taken logic 2023-10-24 14:45:51 +02:00