linux-stable/kernel
Martin KaFai Lau 5dc4c4b7d4 bpf: Introduce BPF_MAP_TYPE_REUSEPORT_SOCKARRAY
This patch introduces a new map type BPF_MAP_TYPE_REUSEPORT_SOCKARRAY.

To unleash the full potential of a bpf prog, it is essential for the
userspace to be capable of directly setting up a bpf map which can then
be consumed by the bpf prog to make decision.  In this case, decide which
SO_REUSEPORT sk to serve the incoming request.

By adding BPF_MAP_TYPE_REUSEPORT_SOCKARRAY, the userspace has total control
and visibility on where a SO_REUSEPORT sk should be located in a bpf map.
The later patch will introduce BPF_PROG_TYPE_SK_REUSEPORT such that
the bpf prog can directly select a sk from the bpf map.  That will
raise the programmability of the bpf prog attached to a reuseport
group (a group of sk serving the same IP:PORT).

For example, in UDP, the bpf prog can peek into the payload (e.g.
through the "data" pointer introduced in the later patch) to learn
the application level's connection information and then decide which sk
to pick from a bpf map.  The userspace can tightly couple the sk's location
in a bpf map with the application logic in generating the UDP payload's
connection information.  This connection info contact/API stays within the
userspace.

Also, when used with map-in-map, the userspace can switch the
old-server-process's inner map to a new-server-process's inner map
in one call "bpf_map_update_elem(outer_map, &index, &new_reuseport_array)".
The bpf prog will then direct incoming requests to the new process instead
of the old process.  The old process can finish draining the pending
requests (e.g. by "accept()") before closing the old-fds.  [Note that
deleting a fd from a bpf map does not necessary mean the fd is closed]

During map_update_elem(),
Only SO_REUSEPORT sk (i.e. which has already been added
to a reuse->socks[]) can be used.  That means a SO_REUSEPORT sk that is
"bind()" for UDP or "bind()+listen()" for TCP.  These conditions are
ensured in "reuseport_array_update_check()".

A SO_REUSEPORT sk can only be added once to a map (i.e. the
same sk cannot be added twice even to the same map).  SO_REUSEPORT
already allows another sk to be created for the same IP:PORT.
There is no need to re-create a similar usage in the BPF side.

When a SO_REUSEPORT is deleted from the "reuse->socks[]" (e.g. "close()"),
it will notify the bpf map to remove it from the map also.  It is
done through "bpf_sk_reuseport_detach()" and it will only be called
if >=1 of the "reuse->sock[]" has ever been added to a bpf map.

The map_update()/map_delete() has to be in-sync with the
"reuse->socks[]".  Hence, the same "reuseport_lock" used
by "reuse->socks[]" has to be used here also. Care has
been taken to ensure the lock is only acquired when the
adding sk passes some strict tests. and
freeing the map does not require the reuseport_lock.

The reuseport_array will also support lookup from the syscall
side.  It will return a sock_gen_cookie().  The sock_gen_cookie()
is on-demand (i.e. a sk's cookie is not generated until the very
first map_lookup_elem()).

The lookup cookie is 64bits but it goes against the logical userspace
expectation on 32bits sizeof(fd) (and as other fd based bpf maps do also).
It may catch user in surprise if we enforce value_size=8 while
userspace still pass a 32bits fd during update.  Supporting different
value_size between lookup and update seems unintuitive also.

We also need to consider what if other existing fd based maps want
to return 64bits value from syscall's lookup in the future.
Hence, reuseport_array supports both value_size 4 and 8, and
assuming user will usually use value_size=4.  The syscall's lookup
will return ENOSPC on value_size=4.  It will will only
return 64bits value from sock_gen_cookie() when user consciously
choose value_size=8 (as a signal that lookup is desired) which then
requires a 64bits value in both lookup and update.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-08-11 01:58:46 +02:00
..
bpf bpf: Introduce BPF_MAP_TYPE_REUSEPORT_SOCKARRAY 2018-08-11 01:58:46 +02:00
cgroup kernfs: allow creating kernfs objects with arbitrary uid/gid 2018-07-20 23:44:35 -07:00
configs kconfig: tinyconfig: remove stale stack protector fixups 2018-06-15 07:15:28 +09:00
debug treewide: kzalloc() -> kcalloc() 2018-06-12 16:19:22 -07:00
dma swiotlb: export swiotlb_dma_ops 2018-06-28 14:00:40 +02:00
events perf/core: Fix crash when using HW tracing kernel filters 2018-07-25 11:46:22 +02:00
gcov gcov: remove CONFIG_GCOV_FORMAT_AUTODETECT 2018-06-08 18:56:02 +09:00
irq genirq: Make force irq threading setup more robust 2018-08-03 15:19:01 +02:00
livepatch livepatch: Allow to call a custom callback when freeing shadow variables 2018-04-17 13:42:48 +02:00
locking locking/rtmutex: Allow specifying a subclass for nested locking 2018-07-25 11:22:19 +02:00
power fix a series of Documentation/ broken file name references 2018-06-15 18:10:01 -03:00
printk Printk changes for 4.18 2018-06-06 16:04:55 -07:00
rcu treewide: Use array_size() in vmalloc() 2018-06-12 16:19:22 -07:00
sched sched/rt: Restore rt_runtime after disabling RT_RUNTIME_SHARE 2018-07-25 11:29:08 +02:00
time nohz: Fix local_timer_softirq_pending() 2018-07-31 22:08:44 +02:00
trace tracing: Quiet gcc warning about maybe unused link variable 2018-07-25 22:33:50 -04:00
.gitignore
acct.c kernel/acct.c: fix the acct->needcheck check in check_free_space() 2018-01-04 16:45:09 -08:00
async.c kernel/async.c: revert "async: simplify lowest_in_progress()" 2018-02-06 18:32:44 -08:00
audit.c audit: use inline function to get audit context 2018-05-14 17:24:18 -04:00
audit.h audit: track the owner of the command mutex ourselves 2018-02-23 11:22:22 -05:00
audit_fsnotify.c fsnotify: add fsnotify_add_inode_mark() wrappers 2018-05-18 14:58:22 +02:00
audit_tree.c fsnotify: add fsnotify_add_inode_mark() wrappers 2018-05-18 14:58:22 +02:00
audit_watch.c \n 2018-06-17 05:06:18 +09:00
auditfilter.c audit: use existing session info function 2018-05-18 15:47:54 -04:00
auditsc.c audit: fix potential null dereference 'context->module.name' 2018-07-30 18:09:37 -04:00
backtracetest.c
bounds.c
capability.c
compat.c Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2018-06-04 20:27:54 -07:00
configs.c
context_tracking.c
cpu.c cpu/hotplug: Fix unused function warning 2018-03-15 20:34:40 +01:00
cpu_pm.c
crash_core.c mm: split page_type out from _mapcount 2018-06-07 17:34:37 -07:00
crash_dump.c
cred.c
delayacct.c delayacct: Use raw_spinlocks 2018-04-27 14:34:51 +02:00
dma.c proc: introduce proc_create_single{,_data} 2018-05-16 07:23:35 +02:00
elfcore.c
exec_domain.c proc: introduce proc_create_single{,_data} 2018-05-16 07:23:35 +02:00
exit.c kernel: use kernel_wait4() instead of sys_wait4() 2018-04-02 20:14:51 +02:00
extable.c extable: Make init_kernel_text() global 2018-02-21 16:54:06 +01:00
fail_function.c treewide: kmalloc() -> kmalloc_array() 2018-06-12 16:19:22 -07:00
fork.c mm: introduce vma_init() 2018-07-26 19:38:03 -07:00
freezer.c
futex.c pids: introduce find_get_task_by_vpid() helper 2018-02-06 18:32:46 -08:00
futex_compat.c
groups.c
hung_task.c kernel/hung_task.c: show all hung tasks before panic 2018-06-07 17:34:39 -07:00
iomem.c memremap: split devm_memremap_pages() and memremap() infrastructure 2018-05-15 23:08:33 -07:00
irq_work.c irq/work: Improve the flag definitions 2018-01-08 19:43:15 +01:00
jump_label.c jump_label: Disable jump labels in __exit code 2018-03-20 08:57:17 +01:00
kallsyms.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk 2018-02-01 13:36:15 -08:00
kcmp.c
Kconfig.freezer
Kconfig.hz
Kconfig.locks
Kconfig.preempt
kcov.c sched/core / kcov: avoid kcov_area during task switch 2018-06-15 07:55:24 +09:00
kexec.c kexec: call do_kexec_load() in compat syscall directly 2018-04-02 20:15:01 +02:00
kexec_core.c kexec: yield to scheduler when loading kimage segments 2018-06-15 07:55:24 +09:00
kexec_file.c treewide: Use array_size() in vzalloc() 2018-06-12 16:19:22 -07:00
kexec_internal.h
kmod.c
kprobes.c kprobes: Fix random address output of blacklist file 2018-04-25 10:27:56 -04:00
ksysfs.c
kthread.c kthread, tracing: Don't expose half-written comm when creating kthreads 2018-07-26 09:59:33 -04:00
latencytop.c
Makefile dma-mapping: move all DMA mapping code to kernel/dma 2018-06-14 08:50:37 +02:00
memremap.c mm: fix exports that inadvertently make put_page() EXPORT_SYMBOL_GPL 2018-07-26 19:38:03 -07:00
module-internal.h
module.c Modules updates for v4.18 2018-06-16 07:36:39 +09:00
module_signing.c
notifier.c
nsproxy.c
padata.c padata: add SPDX identifier 2018-01-05 18:43:00 +11:00
panic.c Kbuild: rename CC_STACKPROTECTOR[_STRONG] config variables 2018-06-14 12:21:18 +09:00
params.c kernel/params.c: downgrade warning for unsafe parameters 2018-04-11 10:28:37 -07:00
pid.c xarray: add the xa_lock to the radix_tree_root 2018-04-11 10:28:39 -07:00
pid_namespace.c Merge branch 'userns-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace 2018-04-03 19:15:32 -07:00
profile.c
ptrace.c pids: introduce find_get_task_by_vpid() helper 2018-02-06 18:32:46 -08:00
range.c
reboot.c
relay.c kernel/relay.c: change return type to vm_fault_t 2018-06-15 07:55:24 +09:00
resource.c libnvdimm for 4.18 2018-06-08 17:21:52 -07:00
rseq.c rseq: uapi: Declare rseq_cs field as union, update includes 2018-07-10 22:18:52 +02:00
seccomp.c audit/stable-4.18 PR 20180605 2018-06-06 16:34:00 -07:00
signal.c signal: Remove no longer required irqsave/restore 2018-06-10 06:14:01 +02:00
smp.c
smpboot.c
smpboot.h
softirq.c nohz: Fix missing tick reprogram when interrupting an inline softirq 2018-08-03 15:52:10 +02:00
stacktrace.c
stop_machine.c stop_machine: Disable preemption after queueing stopper threads 2018-07-25 11:25:08 +02:00
sys.c mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct 2018-06-07 17:34:34 -07:00
sys_ni.c Merge branch 'core-rseq-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2018-06-10 10:17:09 -07:00
sysctl.c treewide: kzalloc() -> kcalloc() 2018-06-12 16:19:22 -07:00
sysctl_binary.c staging: irda: remove remaining remants of irda code removal 2018-04-16 11:26:49 +02:00
task_work.c
taskstats.c pids: introduce find_get_task_by_vpid() helper 2018-02-06 18:32:46 -08:00
test_kprobes.c
torture.c rcu: Rename cond_resched_rcu_qs() to cond_resched_tasks_rcu_qs() 2018-05-15 10:27:29 -07:00
tracepoint.c tracepoints: Fix the descriptions of tracepoint_probe_register{_prio} 2018-05-28 12:49:51 -04:00
tsacct.c
ucount.c headers: untangle kmemleak.h from mm.h 2018-04-05 21:36:27 -07:00
uid16.c fs: add do_fchownat(), ksys_fchown() helpers and ksys_{,l}chown() wrappers 2018-04-02 20:15:59 +02:00
uid16.h kernel: provide ksys_*() wrappers for syscalls called by kernel/uid16.c 2018-04-02 20:15:30 +02:00
umh.c umh: fix race condition 2018-06-07 16:56:28 -04:00
up.c
user-return-notifier.c
user.c efivarfs: Limit the rate for non-root to read files 2018-02-22 10:21:02 -08:00
user_namespace.c treewide: kmalloc() -> kmalloc_array() 2018-06-12 16:19:22 -07:00
utsname.c uts: create "struct uts_namespace" from kmem_cache 2018-04-11 10:28:35 -07:00
utsname_sysctl.c
watchdog.c
watchdog_hld.c
workqueue.c treewide: kzalloc() -> kcalloc() 2018-06-12 16:19:22 -07:00
workqueue_internal.h workqueue: Set worker->desc to workqueue name by default 2018-05-18 08:47:13 -07:00