linux-stable/net
Magnus Karlsson ad7219cd87 xsk: Fix race at socket teardown
[ Upstream commit 18b1ab7aa7 ]

Fix a race in the xsk socket teardown code that can lead to a NULL pointer
dereference splat. The current xsk unbind code in xsk_unbind_dev() starts by
setting xs->state to XSK_UNBOUND, sets xs->dev to NULL and then waits for any
NAPI processing to terminate using synchronize_net(). After that, the release
code starts to tear down the socket state and free allocated memory.

  BUG: kernel NULL pointer dereference, address: 00000000000000c0
  PGD 8000000932469067 P4D 8000000932469067 PUD 0
  Oops: 0000 [#1] PREEMPT SMP PTI
  CPU: 25 PID: 69132 Comm: grpcpp_sync_ser Tainted: G          I       5.16.0+ #2
  Hardware name: Dell Inc. PowerEdge R730/0599V5, BIOS 1.2.10 03/09/2015
  RIP: 0010:__xsk_sendmsg+0x2c/0x690
  [...]
  RSP: 0018:ffffa2348bd13d50 EFLAGS: 00010246
  RAX: 0000000000000000 RBX: 0000000000000040 RCX: ffff8d5fc632d258
  RDX: 0000000000400000 RSI: ffffa2348bd13e10 RDI: ffff8d5fc5489800
  RBP: ffffa2348bd13db0 R08: 0000000000000000 R09: 00007ffffffff000
  R10: 0000000000000000 R11: 0000000000000000 R12: ffff8d5fc5489800
  R13: ffff8d5fcb0f5140 R14: ffff8d5fcb0f5140 R15: 0000000000000000
  FS:  00007f991cff9400(0000) GS:ffff8d6f1f700000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00000000000000c0 CR3: 0000000114888005 CR4: 00000000001706e0
  Call Trace:
  <TASK>
  ? aa_sk_perm+0x43/0x1b0
  xsk_sendmsg+0xf0/0x110
  sock_sendmsg+0x65/0x70
  __sys_sendto+0x113/0x190
  ? debug_smp_processor_id+0x17/0x20
  ? fpregs_assert_state_consistent+0x23/0x50
  ? exit_to_user_mode_prepare+0xa5/0x1d0
  __x64_sys_sendto+0x29/0x30
  do_syscall_64+0x3b/0xc0
  entry_SYSCALL_64_after_hwframe+0x44/0xae

There are two problems with the current code. First, setting xs->dev to NULL
before waiting for all users to stop using the socket is not correct. The
entry to the data plane functions xsk_poll(), xsk_sendmsg(), and xsk_recvmsg()
are all guarded by a test that xs->state is in the state XSK_BOUND and if not,
it returns right away. But one process might have passed this test but still
have not gotten to the point in which it uses xs->dev in the code. In this
interim, a second process executing xsk_unbind_dev() might have set xs->dev to
NULL which will lead to a crash for the first process. The solution here is
just to get rid of this NULL assignment since it is not used anymore. Before
commit 42fddcc7c6 ("xsk: use state member for socket synchronization"),
xs->dev was the gatekeeper to admit processes into the data plane functions,
but it was replaced with the state variable xs->state in the aforementioned
commit.

The second problem is that synchronize_net() does not wait for any process in
xsk_poll(), xsk_sendmsg(), or xsk_recvmsg() to complete, which means that the
state they rely on might be cleaned up prematurely. This can happen when the
notifier gets called (at driver unload for example) as it uses xsk_unbind_dev().
Solve this by extending the RCU critical region from just the ndo_xsk_wakeup
to the whole functions mentioned above, so that both the test of xs->state ==
XSK_BOUND and the last use of any member of xs is covered by the RCU critical
section. This will guarantee that when synchronize_net() completes, there will
be no processes left executing xsk_poll(), xsk_sendmsg(), or xsk_recvmsg() and
state can be cleaned up safely. Note that we need to drop the RCU lock for the
skb xmit path as it uses functions that might sleep. Due to this, we have to
retest the xs->state after we grab the mutex that protects the skb xmit code
from, among a number of things, an xsk_unbind_dev() being executed from the
notifier at the same time.

Fixes: 42fddcc7c6 ("xsk: use state member for socket synchronization")
Reported-by: Elza Mathew <elza.mathew@intel.com>
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Björn Töpel <bjorn@kernel.org>
Link: https://lore.kernel.org/bpf/20220228094552.10134-1-magnus.karlsson@gmail.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-04-08 14:23:36 +02:00
..
6lowpan 6lowpan: iphc: Fix an off-by-one check of array index 2021-07-22 16:19:03 +02:00
9p xen/9p: use alloc/free_pages_exact() 2022-03-11 12:22:36 +01:00
802 net: 802: remove dead leftover after ipx driver removal 2021-08-13 16:30:35 -07:00
8021q net: vlan: fix underflow for the real_dev refcnt 2021-12-01 09:04:53 +01:00
appletalk net: socket: rework compat_ifreq_ioctl() 2021-07-23 14:20:25 +01:00
atm atm: Use list_for_each_entry() to simplify code in resources.c 2021-06-10 14:08:09 -07:00
ax25 ax25: Fix NULL pointer dereference in ax25_kill_by_device 2022-03-16 14:23:38 +01:00
batman-adv batman-adv: Don't expect inter-netns unique iflink indices 2022-03-08 19:12:45 +01:00
bluetooth Bluetooth: hci_core: Fix leaking sent_cmd skb 2022-03-19 13:47:48 +01:00
bpf bpf, test, cgroup: Use sk_{alloc,free} for test cases 2021-09-28 09:29:28 +02:00
bpfilter bpfilter: Specify the log level for the kmsg message 2021-06-25 13:13:50 +02:00
bridge net: bridge: multicast: notify switchdev driver whenever MC processing gets disabled 2022-02-23 12:03:13 +01:00
caif net-caif: avoid user-triggerable WARN_ON(1) 2021-09-14 12:51:15 +01:00
can net-timestamp: convert sk->sk_tskey to atomic_t 2022-03-02 11:48:01 +01:00
ceph Networking changes for 5.14. 2021-06-30 15:51:09 -07:00
core net-sysfs: add check for netdevice being present to speed_show 2022-03-16 14:23:41 +01:00
dcb net: dcb: disable softirqs in dcbnl_flush_dev() 2022-03-08 19:12:52 +01:00
dccp tcp: switch orphan_count to bare per-cpu counters 2021-11-18 19:16:33 +01:00
decnet net: Remove redundant if statements 2021-08-05 13:27:50 +01:00
dns_resolver
dsa net: dsa: Add missing of_node_put() in dsa_port_parse_of 2022-03-23 09:16:42 +01:00
ethernet move netdev_boot_setup into Space.c 2021-08-03 13:05:26 +01:00
ethtool ethtool: do not perform operations on net devices being unregistered 2021-12-14 10:57:09 +01:00
hsr net: hsr: don't check sequence number if tag removal is offloaded 2021-06-16 12:13:01 -07:00
ieee802154 net: ieee802154: Return meaningful error codes from the netlink helpers 2022-02-08 18:34:09 +01:00
ife
ipv4 tcp: make tcp_read_sock() more robust 2022-03-19 13:47:50 +01:00
ipv6 xfrm: fix tunnel model fragmentation behavior 2022-04-08 14:22:46 +02:00
iucv net/iucv: Replace deprecated CPU-hotplug functions. 2021-08-09 10:13:32 +01:00
kcm net: sock: introduce sk_error_report 2021-06-29 11:28:21 -07:00
key af_key: add __GFP_ZERO flag for compose_sadb_supported in function pfkey_register 2022-04-08 14:22:47 +02:00
l2tp net/l2tp: Fix reference count leak in l2tp_udp_recv_core 2021-09-09 11:00:20 +01:00
l3mdev
lapb net: lapb: Use list_for_each_entry() to simplify code in lapb_iface.c 2021-06-08 16:31:25 -07:00
llc llc: only change llc->dev when bind() succeeds 2022-03-28 09:58:46 +02:00
mac80211 mac80211: limit bandwidth in HE capabilities 2022-04-08 14:23:28 +02:00
mac802154 ieee802154: Remove redundant initialization of variable ret 2021-09-07 14:06:08 +01:00
mctp mctp: Don't let RTM_DELROUTE delete local routes 2021-12-08 09:04:53 +01:00
mpls net: mpls: Fix notifications when deleting a device 2021-12-08 09:04:47 +01:00
mptcp mptcp: Correctly set DATA_FIN timeout when number of retransmits is large 2022-03-08 19:12:48 +01:00
ncsi net/ncsi: check for error return from call to nla_put_u32 2022-01-05 12:42:37 +01:00
netfilter netfilter: nf_tables: validate registers coming from userspace. 2022-03-28 09:58:44 +02:00
netlabel net: fix NULL pointer reference in cipso_v4_doi_free 2021-08-30 12:23:18 +01:00
netlink net: netlink: af_netlink: Prevent empty skb by adding a check on len. 2021-12-17 10:30:15 +01:00
netrom netrom: fix api breakage in nr_setsockopt() 2022-01-27 11:04:00 +01:00
nfc nfc: llcp: fix NULL error pointer dereference on sendmsg() after failed bind() 2022-01-27 11:02:48 +01:00
nsh
openvswitch openvswitch: Fix setting ipv6 fields causing hw csum failure 2022-03-02 11:47:57 +01:00
packet net/packet: fix slab-out-of-bounds access in packet_recvmsg() 2022-03-23 09:16:41 +01:00
phonet phonet: refcount leak in pep_sock_accep 2022-01-11 15:35:16 +01:00
psample
qrtr net: qrtr: revert check in qrtr_endpoint_post() 2021-09-02 11:37:02 +01:00
rds rds: memory leak in __rds_conn_create() 2021-12-22 09:32:42 +01:00
rfkill rfkill: make new event layout opt-in 2022-04-08 14:23:00 +02:00
rose
rxrpc rxrpc: Adjust retransmission backoff 2022-02-01 17:27:11 +01:00
sched net/sched: act_ct: Fix flow table lookup after ct clear or switching zones 2022-03-02 11:47:57 +01:00
sctp sctp: fix kernel-infoleak for SCTP sockets 2022-03-16 14:23:39 +01:00
smc net/smc: fix unexpected SMC_CLC_DECL_ERR_REGRMB error cause by server 2022-03-08 19:12:45 +01:00
strparser bpf: sockmap, strparser, and tls are reusing qdisc_skb_cb and colliding 2021-11-18 19:17:11 +01:00
sunrpc SUNRPC: avoid race between mod_timer() and del_timer_sync() 2022-04-08 14:22:52 +02:00
switchdev net: make switchdev_bridge_port_{,unoffload} loosely coupled with the bridge 2021-08-04 12:35:07 +01:00
tipc tipc: fix incorrect order of state message data sanity check 2022-03-16 14:23:38 +01:00
tls net/tls: Fix authentication failure in CCM mode 2021-12-08 09:04:41 +01:00
unix af_unix: annote lockless accesses to unix_tot_inflight & gc_in_progress 2022-01-27 11:05:30 +01:00
vmw_vsock vsock: each transport cycles only on its own sockets 2022-03-23 09:16:41 +01:00
wireless nl80211: Update bss channel on channel switch for P2P_CLIENT 2022-03-19 13:47:49 +01:00
x25 net: x25: Use list_for_each_entry() to simplify code in x25_route.c 2021-06-10 14:08:09 -07:00
xdp xsk: Fix race at socket teardown 2022-04-08 14:23:36 +02:00
xfrm xfrm: fix tunnel model fragmentation behavior 2022-04-08 14:22:46 +02:00
compat.c
devres.c net: devres: Correct a grammatical error 2021-06-11 12:55:28 -07:00
Kconfig mctp: Add MCTP base 2021-07-29 15:06:49 +01:00
Makefile mctp: Add MCTP base 2021-07-29 15:06:49 +01:00
socket.c net: fix SOF_TIMESTAMPING_BIND_PHC to work with multiple sockets 2022-01-27 11:03:52 +01:00
sysctl_net.c