linux-stable/net
Kuniyuki Iwashima 91a8528694 soreuseport: Fix socket selection for SO_INCOMING_CPU.
[ Upstream commit b261eda84e ]

Kazuho Oku reported that setsockopt(SO_INCOMING_CPU) does not work
with setsockopt(SO_REUSEPORT) since v4.6.

With the combination of SO_REUSEPORT and SO_INCOMING_CPU, we could
build a highly efficient server application.

setsockopt(SO_INCOMING_CPU) associates a CPU with a TCP listener
or UDP socket, and then incoming packets processed on the CPU will
likely be distributed to the socket.  Technically, a socket could
even receive packets handled on another CPU if no sockets in the
reuseport group have the same CPU receiving the flow.

The logic exists in compute_score() so that a socket will get a higher
score if it has the same CPU with the flow.  However, the score gets
ignored after the blamed two commits, which introduced a faster socket
selection algorithm for SO_REUSEPORT.

This patch introduces a counter of sockets with SO_INCOMING_CPU in
a reuseport group to check if we should iterate all sockets to find
a proper one.  We increment the counter when

  * calling listen() if the socket has SO_INCOMING_CPU and SO_REUSEPORT

  * enabling SO_INCOMING_CPU if the socket is in a reuseport group

Also, we decrement it when

  * detaching a socket out of the group to apply SO_INCOMING_CPU to
    migrated TCP requests

  * disabling SO_INCOMING_CPU if the socket is in a reuseport group

When the counter reaches 0, we can get back to the O(1) selection
algorithm.

The overall changes are negligible for the non-SO_INCOMING_CPU case,
and the only notable thing is that we have to update sk_incomnig_cpu
under reuseport_lock.  Otherwise, the race prevents transitioning to
the O(n) algorithm and results in the wrong socket selection.

 cpu1 (setsockopt)               cpu2 (listen)
+-----------------+             +-------------+

lock_sock(sk1)                  lock_sock(sk2)

reuseport_update_incoming_cpu(sk1, val)
.
|  /* set CPU as 0 */
|- WRITE_ONCE(sk1->incoming_cpu, val)
|
|                               spin_lock_bh(&reuseport_lock)
|                               reuseport_grow(sk2, reuse)
|                               .
|                               |- more_socks_size = reuse->max_socks * 2U;
|                               |- if (more_socks_size > U16_MAX &&
|                               |       reuse->num_closed_socks)
|                               |  .
|                               |  |- RCU_INIT_POINTER(sk1->sk_reuseport_cb, NULL);
|                               |  `- __reuseport_detach_closed_sock(sk1, reuse)
|                               |     .
|                               |     `- reuseport_put_incoming_cpu(sk1, reuse)
|                               |        .
|                               |        |  /* Read shutdown()ed sk1's sk_incoming_cpu
|                               |        |   * without lock_sock().
|                               |        |   */
|                               |        `- if (sk1->sk_incoming_cpu >= 0)
|                               |           .
|                               |           |  /* decrement not-yet-incremented
|                               |           |   * count, which is never incremented.
|                               |           |   */
|                               |           `- __reuseport_put_incoming_cpu(reuse);
|                               |
|                               `- spin_lock_bh(&reuseport_lock)
|
|- spin_lock_bh(&reuseport_lock)
|
|- reuse = rcu_dereference_protected(sk1->sk_reuseport_cb, ...)
|- if (!reuse)
|  .
|  |  /* Cannot increment reuse->incoming_cpu. */
|  `- goto out;
|
`- spin_unlock_bh(&reuseport_lock)

Fixes: e32ea7e747 ("soreuseport: fast reuseport UDP socket selection")
Fixes: c125e80b88 ("soreuseport: fast reuseport TCP socket selection")
Reported-by: Kazuho Oku <kazuhooku@gmail.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-31 13:32:04 +01:00
..
6lowpan
9p Including fixes from bpf, can and wifi. 2022-11-29 09:52:10 -08:00
802 treewide: use prandom_u32_max() when possible, part 1 2022-10-11 17:42:55 -06:00
8021q net: gro: skb_gro_header helper function 2022-08-25 10:33:21 +02:00
appletalk
atm net/atm: fix proc_mpc_write incorrect return value 2022-10-15 11:08:36 +01:00
ax25 ax25: move from strlcpy with unused retval to strscpy 2022-08-22 17:55:50 -07:00
batman-adv Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2022-09-22 13:02:10 -07:00
bluetooth Bluetooth: Fix crash when replugging CSR fake controllers 2022-12-02 13:22:56 -08:00
bpf bpf, test_run: Fix alignment problem in bpf_prog_test_run_skb() 2022-11-04 16:22:34 +01:00
bpfilter
bridge bridge: switchdev: Fix memory leaks when changing VLAN protocol 2022-11-15 13:38:11 +01:00
caif net: caif: fix double disconnect client in chnl_net_open() 2022-11-14 10:51:13 +00:00
can can: af_can: fix NULL pointer dereference in can_rcv_filter 2022-12-07 10:30:47 +01:00
ceph Random number generator fixes for Linux 6.1-rc1. 2022-10-16 15:27:07 -07:00
core soreuseport: Fix socket selection for SO_INCOMING_CPU. 2022-12-31 13:32:04 +01:00
dcb
dccp dccp/tcp: Fixup bhash2 bucket when connect() fails. 2022-11-22 20:15:37 -08:00
dns_resolver
dsa net: dsa: sja1105: Check return value 2022-12-02 20:46:52 -08:00
ethernet net: gro: skb_gro_header helper function 2022-08-25 10:33:21 +02:00
ethtool ethtool: eeprom: fix null-deref on genl_info in dump 2022-10-24 19:08:07 -07:00
hsr net: hsr: Fix potential use-after-free 2022-11-28 18:09:00 -08:00
ieee802154 net: ieee802154: fix error return code in dgram_bind() 2022-10-07 09:29:17 +02:00
ife
ipv4 ipv4: Fix incorrect route flushing when table ID 0 is used 2022-12-06 20:34:43 -08:00
ipv6 ipv6: avoid use-after-free in ip6_fragment() 2022-12-07 20:14:35 -08:00
iucv
kcm kcm: close race conditions on sk_receive_queue 2022-11-15 12:42:26 +01:00
key xfrm: Fix oops in __xfrm_state_delete() 2022-11-22 07:14:55 +01:00
l2tp l2tp: Don't sleep and disable BH under writer-side sk_callback_lock 2022-11-23 12:45:19 +00:00
l3mdev
lapb
llc
mac80211 wifi: mac80211: fix ifdef symbol name 2022-12-31 13:32:01 +01:00
mac802154 mac802154: fix missing INIT_LIST_HEAD in ieee802154_if_add() 2022-12-05 09:53:08 +01:00
mctp mctp: Fix an error handling path in mctp_init() 2022-11-09 19:26:08 -08:00
mpls net: Use u64_stats_fetch_begin_irq() for stats fetch. 2022-08-29 13:02:27 +01:00
mptcp mptcp: fix sleep in atomic at close time 2022-11-28 18:03:07 -08:00
ncsi genetlink: start to validate reserved header bytes 2022-08-29 12:47:15 +01:00
netfilter netfilter: ctnetlink: fix compilation warning after data race fixes in ct mark 2022-11-30 13:08:49 +01:00
netlabel genetlink: start to validate reserved header bytes 2022-08-29 12:47:15 +01:00
netlink genetlink: limit the use of validation workarounds to old ops 2022-10-27 08:20:21 -07:00
netrom
nfc NFC: nci: Bounds check struct nfc_target arrays 2022-12-05 17:46:25 -08:00
nsh
openvswitch netfilter: conntrack: Fix data-races around ct mark 2022-11-18 15:21:00 +01:00
packet packet: do not set TP_STATUS_CSUM_VALID on CHECKSUM_COMPLETE 2022-11-29 08:30:18 -08:00
phonet
psample genetlink: start to validate reserved header bytes 2022-08-29 12:47:15 +01:00
qrtr net: qrtr: start MHI channel after endpoit creation 2022-08-15 11:21:42 +01:00
rds treewide: use get_random_{u8,u16}() when possible, part 2 2022-10-11 17:42:58 -06:00
rfkill
rose rose: Fix NULL pointer dereference in rose_send_frame() 2022-11-02 11:57:30 +00:00
rxrpc rxrpc: Fix race between conn bundle lookup and bundle removal [ZDI-CAN-15975] 2022-11-18 12:05:44 +00:00
sched net: sched: allow act_ct to be built without NF_NAT 2022-11-22 12:16:55 +01:00
sctp sctp: fix memory leak in sctp_stream_outq_migrate() 2022-11-29 08:30:50 -08:00
smc net/smc: Fix possible leaked pernet namespace in smc_init() 2022-11-02 20:42:09 -07:00
strparser
sunrpc SUNRPC: Fix crasher in gss_unwrap_resp_integ() 2022-10-27 15:52:10 -04:00
switchdev
tipc tipc: call tipc_lxc_xmit without holding node_read_lock 2022-12-07 11:32:04 +01:00
tls net/tls: Fix memory leak in tls_enc_skb() and tls_sw_fallback_init() 2022-11-11 20:08:17 -08:00
unix af_unix: Get user_ns from in_skb in unix_diag_get_exact(). 2022-12-01 10:32:20 +01:00
vmw_vsock vsock: fix possible infinite sleep in vsock_connectible_wait_data() 2022-11-03 10:49:29 +01:00
wireless wifi: cfg80211: don't allow multi-BSSID in S1G 2022-11-25 12:43:14 +01:00
x25 net/x25: Fix skb leak in x25_lapb_receive_frame() 2022-11-15 20:22:19 -08:00
xdp Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2022-10-03 17:44:18 -07:00
xfrm Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec 2022-11-23 19:18:59 -08:00
compat.c net: clear msg_get_inq in __get_compat_msghdr() 2022-09-20 08:23:20 -07:00
devres.c
Kconfig Remove DECnet support from kernel 2022-08-22 14:26:30 +01:00
Kconfig.debug net: make NET_(DEV|NS)_REFCNT_TRACKER depend on NET 2022-09-20 14:23:56 -07:00
Makefile Remove DECnet support from kernel 2022-08-22 14:26:30 +01:00
socket.c d_path pile 2022-10-06 16:55:41 -07:00
sysctl_net.c