linux-stable/net/core
Martin KaFai Lau 40a1227ea8 tcp: Avoid TCP syncookie rejected by SO_REUSEPORT socket
Although the actual cookie check "__cookie_v[46]_check()" does
not involve sk specific info, it checks whether the sk has recent
synq overflow event in "tcp_synq_no_recent_overflow()".  The
tcp_sk(sk)->rx_opt.ts_recent_stamp is updated every second
when it has sent out a syncookie (through "tcp_synq_overflow()").

The above per sk "recent synq overflow event timestamp" works well
for non SO_REUSEPORT use case.  However, it may cause random
connection request reject/discard when SO_REUSEPORT is used with
syncookie because it fails the "tcp_synq_no_recent_overflow()"
test.

When SO_REUSEPORT is used, it usually has multiple listening
socks serving TCP connection requests destinated to the same local IP:PORT.
There are cases that the TCP-ACK-COOKIE may not be received
by the same sk that sent out the syncookie.  For example,
if reuse->socks[] began with {sk0, sk1},
1) sk1 sent out syncookies and tcp_sk(sk1)->rx_opt.ts_recent_stamp
   was updated.
2) the reuse->socks[] became {sk1, sk2} later.  e.g. sk0 was first closed
   and then sk2 was added.  Here, sk2 does not have ts_recent_stamp set.
   There are other ordering that will trigger the similar situation
   below but the idea is the same.
3) When the TCP-ACK-COOKIE comes back, sk2 was selected.
   "tcp_synq_no_recent_overflow(sk2)" returns true. In this case,
   all syncookies sent by sk1 will be handled (and rejected)
   by sk2 while sk1 is still alive.

The userspace may create and remove listening SO_REUSEPORT sockets
as it sees fit.  E.g. Adding new thread (and SO_REUSEPORT sock) to handle
incoming requests, old process stopping and new process starting...etc.
With or without SO_ATTACH_REUSEPORT_[CB]BPF,
the sockets leaving and joining a reuseport group makes picking
the same sk to check the syncookie very difficult (if not impossible).

The later patches will allow bpf prog more flexibility in deciding
where a sk should be located in a bpf map and selecting a particular
SO_REUSEPORT sock as it sees fit.  e.g. Without closing any sock,
replace the whole bpf reuseport_array in one map_update() by using
map-in-map.  Getting the syncookie check working smoothly across
socks in the same "reuse->socks[]" is important.

A partial solution is to set the newly added sk's ts_recent_stamp
to the max ts_recent_stamp of a reuseport group but that will require
to iterate through reuse->socks[]  OR
pessimistically set it to "now - TCP_SYNCOOKIE_VALID" when a sk is
joining a reuseport group.  However, neither of them will solve the
existing sk getting moved around the reuse->socks[] and that
sk may not have ts_recent_stamp updated, unlikely under continuous
synflood but not impossible.

This patch opts to treat the reuseport group as a whole when
considering the last synq overflow timestamp since
they are serving the same IP:PORT from the userspace
(and BPF program) perspective.

"synq_overflow_ts" is added to "struct sock_reuseport".
The tcp_synq_overflow() and tcp_synq_no_recent_overflow()
will update/check reuse->synq_overflow_ts if the sk is
in a reuseport group.  Similar to the reuseport decision in
__inet_lookup_listener(), both sk->sk_reuseport and
sk->sk_reuseport_cb are tested for SO_REUSEPORT usage.
Update on "synq_overflow_ts" happens at roughly once
every second.

A synflood test was done with a 16 rx-queues and 16 reuseport sockets.
No meaningful performance change is observed.  Before and
after the change is ~9Mpps in IPv4.

Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-08-11 01:58:45 +02:00
..
datagram.c net: simplify sock_poll_wait 2018-07-30 09:10:25 -07:00
dev.c net: check extack._msg before print 2018-08-05 17:24:45 -07:00
dev_addr_lists.c net: change the comment of dev_mc_init 2018-04-19 12:58:20 -04:00
dev_ioctl.c net: remove redundant input checks in SIOCSIFTXQLEN case of dev_ifsioc 2018-07-24 11:36:15 -07:00
devlink.c devlink: Add generic parameters region_snapshot 2018-07-12 17:37:13 -07:00
drop_monitor.c treewide: setup_timer() -> timer_setup() 2017-11-21 15:57:07 -08:00
dst.c netfilter: nf_tables: add tunnel support 2018-08-03 21:12:12 +02:00
dst_cache.c net: core: dst_cache_set_ip6: Rename 'addr' parameter to 'saddr' for consistency 2018-03-05 12:52:45 -05:00
ethtool.c net: Add TLS RX offload feature 2018-07-16 00:12:09 -07:00
failover.c net: Introduce generic failover module 2018-05-28 22:59:54 -04:00
fib_notifier.c net: Fix fib notifer to return errno 2018-03-29 14:10:30 -04:00
fib_rules.c fib_rules: NULL check before kfree is not needed 2018-07-30 09:44:06 -07:00
filter.c bpf: Make redirect_info accessible from modules 2018-08-10 16:12:21 +02:00
flow_dissector.c flow_dissector: allow dissection of tunnel options from metadata 2018-08-07 12:22:14 -07:00
gen_estimator.c net_sched: gen_estimator: fix broken estimators based on percpu stats 2018-02-23 12:35:46 -05:00
gen_stats.c gen_stats: Fix netlink stats dumping in the presence of padding 2018-07-04 14:44:45 +09:00
gro_cells.c License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
hwbm.c
link_watch.c net: link_watch: mark bonding link events urgent 2018-01-23 19:43:30 -05:00
lwt_bpf.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next 2018-08-07 11:02:05 -07:00
lwtunnel.c ipv6: sr: define core operations for seg6local lightweight tunnel 2017-08-07 14:16:22 -07:00
Makefile net: Introduce generic failover module 2018-05-28 22:59:54 -04:00
neighbour.c net: remove blank lines at end of file 2018-07-24 14:10:43 -07:00
net-procfs.c proc: introduce proc_create_net{,_data} 2018-05-16 07:24:30 +02:00
net-sysfs.c net: create reusable function for getting ownership info of sysfs inodes 2018-07-20 23:44:36 -07:00
net-sysfs.h License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
net-traces.c net/ipv6: Udate fib6_table_lookup tracepoint 2018-05-24 23:01:15 -04:00
net_namespace.c net: create reusable function for getting ownership info of sysfs inodes 2018-07-20 23:44:36 -07:00
netclassid_cgroup.c cgroup: add @flags to css_task_iter_start() and implement CSS_TASK_ITER_PROCS 2017-07-21 11:14:51 -04:00
netevent.c
netpoll.c netpoll: Use lockdep to assert IRQs are disabled/enabled 2017-11-08 11:13:54 +01:00
netprio_cgroup.c net: remove duplicate includes 2017-12-13 13:18:46 -05:00
page_pool.c net/page_pool: Fix inconsistent lock state warning 2018-07-19 23:23:01 -07:00
pktgen.c Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next 2018-07-27 09:33:37 -07:00
ptp_classifier.c
request_sock.c ipv4: Namespaceify tcp_max_syn_backlog knob 2016-12-29 11:38:31 -05:00
rtnetlink.c net: report invalid mtu value via netlink extack 2018-07-29 12:57:26 -07:00
scm.c sched/headers: Prepare for new header dependencies before moving code to <linux/sched/user.h> 2017-03-02 08:42:29 +01:00
secure_seq.c tcp: Namespaceify sysctl_tcp_timestamps 2017-06-08 10:53:29 -04:00
skbuff.c net: Export skb_headers_offset_update 2018-08-10 16:12:20 +02:00
sock.c net: avoid unnecessary sock_flag() check when enable timestamp 2018-08-06 10:42:48 -07:00
sock_diag.c net: Drop pernet_operations::async 2018-03-27 13:18:09 -04:00
sock_reuseport.c tcp: Avoid TCP syncookie rejected by SO_REUSEPORT socket 2018-08-11 01:58:45 +02:00
stream.c vfs: do bulk POLL* -> EPOLL* replacement 2018-02-11 14:34:03 -08:00
sysctl_net_core.c headers: untangle kmemleak.h from mm.h 2018-04-05 21:36:27 -07:00
timestamping.c
tso.c License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
utils.c net: Remove some unneeded semicolon 2018-08-04 13:05:39 -07:00
xdp.c xdp: Helpers for disabling napi_direct of xdp_return_frame 2018-08-10 16:12:21 +02:00