Commit Graph

77391 Commits

Author SHA1 Message Date
Florian Westphal a590f47609 netfilter: nft_set_pipapo: move prove_locking helper around
Preparation patch, the helper will soon get called from insert
function too.

Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-05-06 16:29:22 +02:00
Florian Westphal 119c790a27 netfilter: conntrack: remove flowtable early-drop test
Not sure why this special case exists.  Early drop logic
(which kicks in when conntrack table is full) should be independent
of flowtable offload and only consider assured bit (i.e., two-way
traffic was seen).

flowtable entries hold a reference to the conntrack entry (struct
nf_conn) that has been offloaded. The conntrack use count is not
decremented until after the entry is free'd.

This change therefore will not result in exceeding the conntrack table
limit.  It does allow early-drop of tcp flows even when they've been
offloaded, but only if they have been offloaded before syn-ack was
received or after at least one peer has sent a fin.

Currently 'fin' packet reception already stops offloading, so this
should not impact offloading either.

Cc: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-05-06 16:29:21 +02:00
Jason Xing 8edc27fc4f netfilter: use NF_DROP instead of -NF_DROP
At the beginning in 2009 one patch [1] introduced collecting drop
counter in nf_conntrack_in() by returning -NF_DROP. Later, another
patch [2] changed the return value of tcp_packet() which now is
renamed to nf_conntrack_tcp_packet() from -NF_DROP to NF_DROP. As
we can see, that -NF_DROP should be corrected.

Similarly, there are other two points where the -NF_DROP is used.

Well, as NF_DROP is equal to 0, inverting NF_DROP makes no sense
as patch [2] said many years ago.

[1]
commit 7d1e04598e ("netfilter: nf_conntrack: account packets drop by tcp_packet()")
[2]
commit ec8d540969 ("netfilter: conntrack: fix dropping packet after l4proto->packet()")

Signed-off-by: Jason Xing <kernelxing@tencent.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-05-06 16:29:21 +02:00
Guoqing Jiang f2ad13ad08 SUNRPC: Remove comment for sp_lock
It is obsolete since sp_lock was discarded in commit 580a25756a
("SUNRPC: discard sp_lock").

Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-05-06 09:07:23 -04:00
Jeff Layton cf619507ae SUNRPC: add a new svc_find_listener helper
svc_find_listener will return the transport instance pointer for the
endpoint accepting connections/peer traffic from the specified transport
class and matching sockaddr.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-05-06 09:07:22 -04:00
Lorenzo Bianconi a79ec2aecb SUNRPC: introduce svc_xprt_create_from_sa utility routine
Add svc_xprt_create_from_sa utility routine and refactor
svc_xprt_create() codebase in order to introduce the capability to
create a svc port from socket address.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Tested-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-05-06 09:07:22 -04:00
Aleksandr Aprelkov a576f36971 sunrpc: removed redundant procp check
since vs_proc pointer is dereferenced before getting it's address there's
no need to check for NULL.

Found by Linux Verification Center (linuxtesting.org) with SVACE.

Fixes: 8e5b67731d ("SUNRPC: Add a callback to initialise server requests")
Signed-off-by: Aleksandr Aprelkov <aaprelkov@usergate.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-05-06 09:07:18 -04:00
Thadeu Lima de Souza Cascardo a26ff37e62 net: fix out-of-bounds access in ops_init
net_alloc_generic is called by net_alloc, which is called without any
locking. It reads max_gen_ptrs, which is changed under pernet_ops_rwsem. It
is read twice, first to allocate an array, then to set s.len, which is
later used to limit the bounds of the array access.

It is possible that the array is allocated and another thread is
registering a new pernet ops, increments max_gen_ptrs, which is then used
to set s.len with a larger than allocated length for the variable array.

Fix it by reading max_gen_ptrs only once in net_alloc_generic. If
max_gen_ptrs is later incremented, it will be caught in net_assign_generic.

Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@igalia.com>
Fixes: 073862ba5d ("netns: fix net_alloc_generic()")
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20240502132006.3430840-1-cascardo@igalia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-06 13:38:14 +02:00
Felix Fietkau c9d1d23e52 net: add heuristic for enabling TCP fraglist GRO
When forwarding TCP after GRO, software segmentation is very expensive,
especially when the checksum needs to be recalculated.
One case where that's currently unavoidable is when routing packets over
PPPoE. Performance improves significantly when using fraglist GRO
implemented in the same way as for UDP.

When NETIF_F_GRO_FRAGLIST is enabled, perform a lookup for an established
socket in the same netns as the receiving device. While this may not
cover all relevant use cases in multi-netns configurations, it should be
good enough for most configurations that need this.

Here's a measurement of running 2 TCP streams through a MediaTek MT7622
device (2-core Cortex-A53), which runs NAT with flow offload enabled from
one ethernet port to PPPoE on another ethernet port + cake qdisc set to
1Gbps.

rx-gro-list off: 630 Mbit/s, CPU 35% idle
rx-gro-list on:  770 Mbit/s, CPU 40% idle

Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-06 11:54:04 +02:00
Felix Fietkau 7516b27c55 net: create tcp_gro_header_pull helper function
Pull the code out of tcp_gro_receive in order to access the tcp header
from tcp4/6_gro_receive.

Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-06 11:54:04 +02:00
Felix Fietkau 80e85fbdf1 net: create tcp_gro_lookup helper function
This pulls the flow port matching out of tcp_gro_receive, so that it can be
reused for the next change, which adds the TCP fraglist GRO heuristic.

Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-06 11:54:04 +02:00
Felix Fietkau 8d95dc474f net: add code for TCP fraglist GRO
This implements fraglist GRO similar to how it's handled in UDP, however
no functional changes are added yet. The next change adds a heuristic for
using fraglist GRO instead of regular GRO.

Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-06 11:54:04 +02:00
Felix Fietkau bee88cd5bd net: add support for segmenting TCP fraglist GSO packets
Preparation for adding TCP fraglist GRO support. It expects packets to be
combined in a similar way as UDP fraglist GSO packets.
For IPv4 packets, NAT is handled in the same way as UDP fraglist GSO.

Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-06 11:54:03 +02:00
Felix Fietkau 8928756d53 net: move skb_gro_receive_list from udp to core
This helper function will be used for TCP fraglist GRO support

Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-06 11:54:03 +02:00
Jason Xing 40616789ec netfilter: conntrack: dccp: try not to drop skb in conntrack
It would be better not to drop skb in conntrack unless we have good
alternatives. So we can treat the result of testing skb's header
pointer as nf_conntrack_tcp_packet() does.

Signed-off-by: Jason Xing <kernelxing@tencent.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-05-06 11:13:56 +02:00
Linus Lüssing 4a3540a8bf netfilter: conntrack: fix ct-state for ICMPv6 Multicast Router Discovery
So far Multicast Router Advertisements and Multicast Router
Solicitations from the Multicast Router Discovery protocol (RFC4286)
would be marked as INVALID for IPv6, even if they are in fact intact
and adhering to RFC4286.

This broke MRA reception and by that multicast reception on
IPv6 multicast routers in a Proxmox managed setup, where Proxmox
would install a rule like "-m conntrack --ctstate INVALID -j DROP"
at the top of the FORWARD chain with br-nf-call-ip6tables enabled
by default.

Similar to as it's done for MLDv1, MLDv2 and IPv6 Neighbor Discovery
already, fix this issue by excluding MRD from connection tracking
handling as MRD always uses predefined multicast destinations
for its messages, too. This changes the ct-state for ICMPv6 MRD messages
from INVALID to UNTRACKED.

This issue was found and fixed with the help of the mrdisc tool
(https://github.com/troglobit/mrdisc).

Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-05-06 11:13:56 +02:00
Pablo Neira Ayuso 6e20eef413 netfilter: nf_tables: remove NETDEV_CHANGENAME from netdev chain event handler
Originally, device name used to be stored in the basechain, but it is
not the case anymore. Remove check for NETDEV_CHANGENAME.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-05-06 11:13:55 +02:00
Pablo Neira Ayuso 84b1a0c014 netfilter: nf_tables: skip transaction if update object is not implemented
Turn update into noop as a follow up for:

  9fedd894b4 ("netfilter: nf_tables: fix unexpected EOPNOTSUPP error")

instead of adding a transaction object which is simply discarded at a
later stage of the commit protocol.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-05-06 11:13:54 +02:00
Mina Almasry 173e7622cc Revert "net: mirror skb frag ref/unref helpers"
This reverts commit a580ea994f.

This revert is to resolve Dragos's report of page_pool leak here:
https://lore.kernel.org/lkml/20240424165646.1625690-2-dtatulea@nvidia.com/

The reverted patch interacts very badly with commit 2cc3aeb5ec ("skbuff:
Fix a potential race while recycling page_pool packets"). The reverted
commit hopes that the pp_recycle + is_pp_page variables do not change
between the skb_frag_ref and skb_frag_unref operation. If such a change
occurs, the skb_frag_ref/unref will not operate on the same reference type.
In the case of Dragos's report, the grabbed ref was a pp ref, but the unref
was a page ref, because the pp_recycle setting on the skb was changed.

Attempting to fix this issue on the fly is risky. Lets revert and I hope
to reland this with better understanding and testing to ensure we don't
regress some edge case while streamlining skb reffing.

Fixes: a580ea994f ("net: mirror skb frag ref/unref helpers")
Reported-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Mina Almasry <almasrymina@google.com>
Link: https://lore.kernel.org/r/20240502175423.2456544-1-almasrymina@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-03 16:05:53 -07:00
Roded Zats 1aec77b2bb rtnetlink: Correct nested IFLA_VF_VLAN_LIST attribute validation
Each attribute inside a nested IFLA_VF_VLAN_LIST is assumed to be a
struct ifla_vf_vlan_info so the size of such attribute needs to be at least
of sizeof(struct ifla_vf_vlan_info) which is 14 bytes.
The current size validation in do_setvfinfo is against NLA_HDRLEN (4 bytes)
which is less than sizeof(struct ifla_vf_vlan_info) so this validation
is not enough and a too small attribute might be cast to a
struct ifla_vf_vlan_info, this might result in an out of bands
read access when accessing the saved (casted) entry in ivvl.

Fixes: 79aab093a0 ("net: Update API for VF vlan protocol 802.1ad support")
Signed-off-by: Roded Zats <rzats@paloaltonetworks.com>
Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://lore.kernel.org/r/20240502155751.75705-1-rzats@paloaltonetworks.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-03 15:57:50 -07:00
Jakub Kicinski d0de616739 ipsec-2024-05-02
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEH7ZpcWbFyOOp6OJbrB3Eaf9PW7cFAmYzUX4ACgkQrB3Eaf9P
 W7e7qQ/+LgkDkL/LyXv3kAPN8b2SapIiIajarlRfgdPYdM6PP+kzGJxC/t5NZ2HE
 Q1N6K0hIL042rna1/grkUKHeQn4PXUlfT6y8YgjiuCvpFDVNb2ofyl3AmxjJnH1A
 iwMWf6EhwGoxbVs3DbDJ554U8T0nBJeZ+MXLF/4BI13bNdj7stbcKRqj6KHC5sQO
 JgtFVX+ip6LLGL7rR4YMv2h2p1sSu3Vp6bMcfM85I4ENec0UIjgsAF9P0buPl4gr
 2oKtMxga86CQWcymKo6DI+MsBBk91wvM+5/T9zQtpdxDuNEQNrotCoCc0Kd03xmP
 EGzJagwVGFj08kYJ7qICDwpXWCpLDVumoxWFNBWmAW9uNEkUW8Tiqmm8eW2Azs3d
 VAUFcyzHr7mkAaqSDDdE4J+L276Z+dS+BHPnoF6Sp+ctuvSmmeS6lyY9mGnFGH7H
 OiqFKonjBEC5iNAMIXF3WRKueMDdbbDFwHK4NEiTIUSeAMqETUP2sBC1GNTaN8YJ
 soKYtwUtiag2P44ZYy5UYeKJlaBnT1FOZHLs24iCOY1XjqJerwjefQuBO6HDBz/I
 vkaSY6ak6uRsAdfst45uQNPfxlJkFDbwRDowFCdhu5qG7bifqnXstQmNta2U1109
 4e3vt5jPowN/9bCtMx7Z+ftmmTsapxYCu5ZYRVAq82WahsXFPtE=
 =aeD1
 -----END PGP SIGNATURE-----

Merge tag 'ipsec-2024-05-02' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec

Steffen Klassert says:

====================
pull request (net): ipsec 2024-05-02

1) Fix an error pointer dereference in xfrm_in_fwd_icmp.
   From Antony Antony.

2) Preserve vlan tags for ESP transport mode software GRO.
   From Paul Davey.

3) Fix a spelling mistake in an uapi xfrm.h comment.
   From Anotny Antony.

* tag 'ipsec-2024-05-02' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec:
  xfrm: Correct spelling mistake in xfrm.h comment
  xfrm: Preserve vlan tags for transport mode software GRO
  xfrm: fix possible derferencing in error path
====================

Link: https://lore.kernel.org/r/20240502084838.2269355-1-steffen.klassert@secunet.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-03 15:56:15 -07:00
Eric Dumazet c1742dcb6b net: no longer acquire RTNL in threaded_show()
dev->threaded can be read locklessly, if we add
corresponding READ_ONCE()/WRITE_ONCE() annotations.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240502173926.2010646-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-03 15:14:01 -07:00
Eric Dumazet 0feb396f74 rtnetlink: use for_each_netdev_dump() in rtnl_stats_dump()
Switch rtnl_stats_dump() to use for_each_netdev_dump()
instead of net->dev_index_head[] hash table.

This makes the code much easier to read, and fixes
scalability issues.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240502113748.1622637-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-03 15:03:42 -07:00
Eric Dumazet 136c2a9a2a rtnetlink: change rtnl_stats_dump() return value
By returning 0 (or an error) instead of skb->len,
we allow NLMSG_DONE to be appended to the current
skb at the end of a dump, saving a couple of recvmsg()
system calls.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240502113748.1622637-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-03 15:03:42 -07:00
Duoming Zhou adf0398cee Bluetooth: l2cap: fix null-ptr-deref in l2cap_chan_timeout
There is a race condition between l2cap_chan_timeout() and
l2cap_chan_del(). When we use l2cap_chan_del() to delete the
channel, the chan->conn will be set to null. But the conn could
be dereferenced again in the mutex_lock() of l2cap_chan_timeout().
As a result the null pointer dereference bug will happen. The
KASAN report triggered by POC is shown below:

[  472.074580] ==================================================================
[  472.075284] BUG: KASAN: null-ptr-deref in mutex_lock+0x68/0xc0
[  472.075308] Write of size 8 at addr 0000000000000158 by task kworker/0:0/7
[  472.075308]
[  472.075308] CPU: 0 PID: 7 Comm: kworker/0:0 Not tainted 6.9.0-rc5-00356-g78c0094a146b #36
[  472.075308] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu4
[  472.075308] Workqueue: events l2cap_chan_timeout
[  472.075308] Call Trace:
[  472.075308]  <TASK>
[  472.075308]  dump_stack_lvl+0x137/0x1a0
[  472.075308]  print_report+0x101/0x250
[  472.075308]  ? __virt_addr_valid+0x77/0x160
[  472.075308]  ? mutex_lock+0x68/0xc0
[  472.075308]  kasan_report+0x139/0x170
[  472.075308]  ? mutex_lock+0x68/0xc0
[  472.075308]  kasan_check_range+0x2c3/0x2e0
[  472.075308]  mutex_lock+0x68/0xc0
[  472.075308]  l2cap_chan_timeout+0x181/0x300
[  472.075308]  process_one_work+0x5d2/0xe00
[  472.075308]  worker_thread+0xe1d/0x1660
[  472.075308]  ? pr_cont_work+0x5e0/0x5e0
[  472.075308]  kthread+0x2b7/0x350
[  472.075308]  ? pr_cont_work+0x5e0/0x5e0
[  472.075308]  ? kthread_blkcg+0xd0/0xd0
[  472.075308]  ret_from_fork+0x4d/0x80
[  472.075308]  ? kthread_blkcg+0xd0/0xd0
[  472.075308]  ret_from_fork_asm+0x11/0x20
[  472.075308]  </TASK>
[  472.075308] ==================================================================
[  472.094860] Disabling lock debugging due to kernel taint
[  472.096136] BUG: kernel NULL pointer dereference, address: 0000000000000158
[  472.096136] #PF: supervisor write access in kernel mode
[  472.096136] #PF: error_code(0x0002) - not-present page
[  472.096136] PGD 0 P4D 0
[  472.096136] Oops: 0002 [#1] PREEMPT SMP KASAN NOPTI
[  472.096136] CPU: 0 PID: 7 Comm: kworker/0:0 Tainted: G    B              6.9.0-rc5-00356-g78c0094a146b #36
[  472.096136] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu4
[  472.096136] Workqueue: events l2cap_chan_timeout
[  472.096136] RIP: 0010:mutex_lock+0x88/0xc0
[  472.096136] Code: be 08 00 00 00 e8 f8 23 1f fd 4c 89 f7 be 08 00 00 00 e8 eb 23 1f fd 42 80 3c 23 00 74 08 48 88
[  472.096136] RSP: 0018:ffff88800744fc78 EFLAGS: 00000246
[  472.096136] RAX: 0000000000000000 RBX: 1ffff11000e89f8f RCX: ffffffff8457c865
[  472.096136] RDX: 0000000000000001 RSI: 0000000000000008 RDI: ffff88800744fc78
[  472.096136] RBP: 0000000000000158 R08: ffff88800744fc7f R09: 1ffff11000e89f8f
[  472.096136] R10: dffffc0000000000 R11: ffffed1000e89f90 R12: dffffc0000000000
[  472.096136] R13: 0000000000000158 R14: ffff88800744fc78 R15: ffff888007405a00
[  472.096136] FS:  0000000000000000(0000) GS:ffff88806d200000(0000) knlGS:0000000000000000
[  472.096136] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  472.096136] CR2: 0000000000000158 CR3: 000000000da32000 CR4: 00000000000006f0
[  472.096136] Call Trace:
[  472.096136]  <TASK>
[  472.096136]  ? __die_body+0x8d/0xe0
[  472.096136]  ? page_fault_oops+0x6b8/0x9a0
[  472.096136]  ? kernelmode_fixup_or_oops+0x20c/0x2a0
[  472.096136]  ? do_user_addr_fault+0x1027/0x1340
[  472.096136]  ? _printk+0x7a/0xa0
[  472.096136]  ? mutex_lock+0x68/0xc0
[  472.096136]  ? add_taint+0x42/0xd0
[  472.096136]  ? exc_page_fault+0x6a/0x1b0
[  472.096136]  ? asm_exc_page_fault+0x26/0x30
[  472.096136]  ? mutex_lock+0x75/0xc0
[  472.096136]  ? mutex_lock+0x88/0xc0
[  472.096136]  ? mutex_lock+0x75/0xc0
[  472.096136]  l2cap_chan_timeout+0x181/0x300
[  472.096136]  process_one_work+0x5d2/0xe00
[  472.096136]  worker_thread+0xe1d/0x1660
[  472.096136]  ? pr_cont_work+0x5e0/0x5e0
[  472.096136]  kthread+0x2b7/0x350
[  472.096136]  ? pr_cont_work+0x5e0/0x5e0
[  472.096136]  ? kthread_blkcg+0xd0/0xd0
[  472.096136]  ret_from_fork+0x4d/0x80
[  472.096136]  ? kthread_blkcg+0xd0/0xd0
[  472.096136]  ret_from_fork_asm+0x11/0x20
[  472.096136]  </TASK>
[  472.096136] Modules linked in:
[  472.096136] CR2: 0000000000000158
[  472.096136] ---[ end trace 0000000000000000 ]---
[  472.096136] RIP: 0010:mutex_lock+0x88/0xc0
[  472.096136] Code: be 08 00 00 00 e8 f8 23 1f fd 4c 89 f7 be 08 00 00 00 e8 eb 23 1f fd 42 80 3c 23 00 74 08 48 88
[  472.096136] RSP: 0018:ffff88800744fc78 EFLAGS: 00000246
[  472.096136] RAX: 0000000000000000 RBX: 1ffff11000e89f8f RCX: ffffffff8457c865
[  472.096136] RDX: 0000000000000001 RSI: 0000000000000008 RDI: ffff88800744fc78
[  472.096136] RBP: 0000000000000158 R08: ffff88800744fc7f R09: 1ffff11000e89f8f
[  472.132932] R10: dffffc0000000000 R11: ffffed1000e89f90 R12: dffffc0000000000
[  472.132932] R13: 0000000000000158 R14: ffff88800744fc78 R15: ffff888007405a00
[  472.132932] FS:  0000000000000000(0000) GS:ffff88806d200000(0000) knlGS:0000000000000000
[  472.132932] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  472.132932] CR2: 0000000000000158 CR3: 000000000da32000 CR4: 00000000000006f0
[  472.132932] Kernel panic - not syncing: Fatal exception
[  472.132932] Kernel Offset: disabled
[  472.132932] ---[ end Kernel panic - not syncing: Fatal exception ]---

Add a check to judge whether the conn is null in l2cap_chan_timeout()
in order to mitigate the bug.

Fixes: 3df91ea20e ("Bluetooth: Revert to mutexes from RCU list")
Signed-off-by: Duoming Zhou <duoming@zju.edu.cn>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2024-05-03 13:05:54 -04:00
Sungwoo Kim d2706004a1 Bluetooth: HCI: Fix potential null-ptr-deref
Fix potential null-ptr-deref in hci_le_big_sync_established_evt().

Fixes: f777d88278 (Bluetooth: ISO: Notify user space about failed bis connections)
Signed-off-by: Sungwoo Kim <iam@sung-woo.kim>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2024-05-03 13:05:53 -04:00
Sungwoo Kim 10f9f426ac Bluetooth: msft: fix slab-use-after-free in msft_do_close()
Tying the msft->data lifetime to hdev by freeing it in
hci_release_dev() to fix the following case:

[use]
msft_do_close()
  msft = hdev->msft_data;
  if (!msft)                      ...(1) <- passed.
    return;
  mutex_lock(&msft->filter_lock); ...(4) <- used after freed.

[free]
msft_unregister()
  msft = hdev->msft_data;
  hdev->msft_data = NULL;         ...(2)
  kfree(msft);                    ...(3) <- msft is freed.

==================================================================
BUG: KASAN: slab-use-after-free in __mutex_lock_common
kernel/locking/mutex.c:587 [inline]
BUG: KASAN: slab-use-after-free in __mutex_lock+0x8f/0xc30
kernel/locking/mutex.c:752
Read of size 8 at addr ffff888106cbbca8 by task kworker/u5:2/309

Fixes: bf6a4e30ff ("Bluetooth: disable advertisement filters during suspend")
Signed-off-by: Sungwoo Kim <iam@sung-woo.kim>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2024-05-03 13:05:28 -04:00
Sungwoo Kim 4d7b41c0e4 Bluetooth: L2CAP: Fix slab-use-after-free in l2cap_connect()
Extend a critical section to prevent chan from early freeing.
Also make the l2cap_connect() return type void. Nothing is using the
returned value but it is ugly to return a potentially freed pointer.
Making it void will help with backports because earlier kernels did use
the return value. Now the compile will break for kernels where this
patch is not a complete fix.

Call stack summary:

[use]
l2cap_bredr_sig_cmd
  l2cap_connect
  ┌ mutex_lock(&conn->chan_lock);
  │ chan = pchan->ops->new_connection(pchan); <- alloc chan
  │ __l2cap_chan_add(conn, chan);
  │   l2cap_chan_hold(chan);
  │   list_add(&chan->list, &conn->chan_l);   ... (1)
  └ mutex_unlock(&conn->chan_lock);
    chan->conf_state              ... (4) <- use after free

[free]
l2cap_conn_del
┌ mutex_lock(&conn->chan_lock);
│ foreach chan in conn->chan_l:            ... (2)
│   l2cap_chan_put(chan);
│     l2cap_chan_destroy
│       kfree(chan)               ... (3) <- chan freed
└ mutex_unlock(&conn->chan_lock);

==================================================================
BUG: KASAN: slab-use-after-free in instrument_atomic_read
include/linux/instrumented.h:68 [inline]
BUG: KASAN: slab-use-after-free in _test_bit
include/asm-generic/bitops/instrumented-non-atomic.h:141 [inline]
BUG: KASAN: slab-use-after-free in l2cap_connect+0xa67/0x11a0
net/bluetooth/l2cap_core.c:4260
Read of size 8 at addr ffff88810bf040a0 by task kworker/u3:1/311

Fixes: 73ffa904b7 ("Bluetooth: Move conf_{req,rsp} stuff to struct l2cap_chan")
Signed-off-by: Sungwoo Kim <iam@sung-woo.kim>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2024-05-03 13:05:26 -04:00
Duoming Zhou 483bc08181 Bluetooth: Fix use-after-free bugs caused by sco_sock_timeout
When the sco connection is established and then, the sco socket
is releasing, timeout_work will be scheduled to judge whether
the sco disconnection is timeout. The sock will be deallocated
later, but it is dereferenced again in sco_sock_timeout. As a
result, the use-after-free bugs will happen. The root cause is
shown below:

    Cleanup Thread               |      Worker Thread
sco_sock_release                 |
  sco_sock_close                 |
    __sco_sock_close             |
      sco_sock_set_timer         |
        schedule_delayed_work    |
  sco_sock_kill                  |    (wait a time)
    sock_put(sk) //FREE          |  sco_sock_timeout
                                 |    sock_hold(sk) //USE

The KASAN report triggered by POC is shown below:

[   95.890016] ==================================================================
[   95.890496] BUG: KASAN: slab-use-after-free in sco_sock_timeout+0x5e/0x1c0
[   95.890755] Write of size 4 at addr ffff88800c388080 by task kworker/0:0/7
...
[   95.890755] Workqueue: events sco_sock_timeout
[   95.890755] Call Trace:
[   95.890755]  <TASK>
[   95.890755]  dump_stack_lvl+0x45/0x110
[   95.890755]  print_address_description+0x78/0x390
[   95.890755]  print_report+0x11b/0x250
[   95.890755]  ? __virt_addr_valid+0xbe/0xf0
[   95.890755]  ? sco_sock_timeout+0x5e/0x1c0
[   95.890755]  kasan_report+0x139/0x170
[   95.890755]  ? update_load_avg+0xe5/0x9f0
[   95.890755]  ? sco_sock_timeout+0x5e/0x1c0
[   95.890755]  kasan_check_range+0x2c3/0x2e0
[   95.890755]  sco_sock_timeout+0x5e/0x1c0
[   95.890755]  process_one_work+0x561/0xc50
[   95.890755]  worker_thread+0xab2/0x13c0
[   95.890755]  ? pr_cont_work+0x490/0x490
[   95.890755]  kthread+0x279/0x300
[   95.890755]  ? pr_cont_work+0x490/0x490
[   95.890755]  ? kthread_blkcg+0xa0/0xa0
[   95.890755]  ret_from_fork+0x34/0x60
[   95.890755]  ? kthread_blkcg+0xa0/0xa0
[   95.890755]  ret_from_fork_asm+0x11/0x20
[   95.890755]  </TASK>
[   95.890755]
[   95.890755] Allocated by task 506:
[   95.890755]  kasan_save_track+0x3f/0x70
[   95.890755]  __kasan_kmalloc+0x86/0x90
[   95.890755]  __kmalloc+0x17f/0x360
[   95.890755]  sk_prot_alloc+0xe1/0x1a0
[   95.890755]  sk_alloc+0x31/0x4e0
[   95.890755]  bt_sock_alloc+0x2b/0x2a0
[   95.890755]  sco_sock_create+0xad/0x320
[   95.890755]  bt_sock_create+0x145/0x320
[   95.890755]  __sock_create+0x2e1/0x650
[   95.890755]  __sys_socket+0xd0/0x280
[   95.890755]  __x64_sys_socket+0x75/0x80
[   95.890755]  do_syscall_64+0xc4/0x1b0
[   95.890755]  entry_SYSCALL_64_after_hwframe+0x67/0x6f
[   95.890755]
[   95.890755] Freed by task 506:
[   95.890755]  kasan_save_track+0x3f/0x70
[   95.890755]  kasan_save_free_info+0x40/0x50
[   95.890755]  poison_slab_object+0x118/0x180
[   95.890755]  __kasan_slab_free+0x12/0x30
[   95.890755]  kfree+0xb2/0x240
[   95.890755]  __sk_destruct+0x317/0x410
[   95.890755]  sco_sock_release+0x232/0x280
[   95.890755]  sock_close+0xb2/0x210
[   95.890755]  __fput+0x37f/0x770
[   95.890755]  task_work_run+0x1ae/0x210
[   95.890755]  get_signal+0xe17/0xf70
[   95.890755]  arch_do_signal_or_restart+0x3f/0x520
[   95.890755]  syscall_exit_to_user_mode+0x55/0x120
[   95.890755]  do_syscall_64+0xd1/0x1b0
[   95.890755]  entry_SYSCALL_64_after_hwframe+0x67/0x6f
[   95.890755]
[   95.890755] The buggy address belongs to the object at ffff88800c388000
[   95.890755]  which belongs to the cache kmalloc-1k of size 1024
[   95.890755] The buggy address is located 128 bytes inside of
[   95.890755]  freed 1024-byte region [ffff88800c388000, ffff88800c388400)
[   95.890755]
[   95.890755] The buggy address belongs to the physical page:
[   95.890755] page: refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff88800c38a800 pfn:0xc388
[   95.890755] head: order:3 entire_mapcount:0 nr_pages_mapped:0 pincount:0
[   95.890755] anon flags: 0x100000000000840(slab|head|node=0|zone=1)
[   95.890755] page_type: 0xffffffff()
[   95.890755] raw: 0100000000000840 ffff888006842dc0 0000000000000000 0000000000000001
[   95.890755] raw: ffff88800c38a800 000000000010000a 00000001ffffffff 0000000000000000
[   95.890755] head: 0100000000000840 ffff888006842dc0 0000000000000000 0000000000000001
[   95.890755] head: ffff88800c38a800 000000000010000a 00000001ffffffff 0000000000000000
[   95.890755] head: 0100000000000003 ffffea000030e201 ffffea000030e248 00000000ffffffff
[   95.890755] head: 0000000800000000 0000000000000000 00000000ffffffff 0000000000000000
[   95.890755] page dumped because: kasan: bad access detected
[   95.890755]
[   95.890755] Memory state around the buggy address:
[   95.890755]  ffff88800c387f80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[   95.890755]  ffff88800c388000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[   95.890755] >ffff88800c388080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[   95.890755]                    ^
[   95.890755]  ffff88800c388100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[   95.890755]  ffff88800c388180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[   95.890755] ==================================================================

Fix this problem by adding a check protected by sco_conn_lock to judget
whether the conn->hcon is null. Because the conn->hcon will be set to null,
when the sock is releasing.

Fixes: ba316be1b6 ("Bluetooth: schedule SCO timeouts with delayed_work")
Signed-off-by: Duoming Zhou <duoming@zju.edu.cn>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2024-05-03 13:03:53 -04:00
Joel Granados 78a7b5dbc0 ax.25: x.25: Remove the now superfluous sentinel elements from ctl_table array
This commit comes at the tail end of a greater effort to remove the
empty elements at the end of the ctl_table arrays (sentinels) which will
reduce the overall build time size of the kernel and run time memory
bloat by ~64 bytes per sentinel (further information Link :
https://lore.kernel.org/all/ZO5Yx5JFogGi%2FcBo@bombadil.infradead.org/)

Avoid a buffer overflow when traversing the ctl_table by ensuring that
AX25_MAX_VALUES is the same as the size of ax25_param_table. This is
done with a BUILD_BUG_ON where ax25_param_table is defined and a
CONFIG_AX25_DAMA_SLAVE guard in the unnamed enum definition as well as
in the ax25_dev_device_up and ax25_ds_set_timer functions.

The overflow happened when the sentinel was removed from
ax25_param_table. The sentinel's data element was changed when
CONFIG_AX25_DAMA_SLAVE was undefined. This had no adverse effects as it
still stopped on the sentinel's null procname but needed to be addressed
once the sentinel was removed.

Signed-off-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-05-03 13:29:43 +01:00
Joel Granados e00e35e217 appletalk: Remove the now superfluous sentinel elements from ctl_table array
This commit comes at the tail end of a greater effort to remove the
empty elements at the end of the ctl_table arrays (sentinels) which will
reduce the overall build time size of the kernel and run time memory
bloat by ~64 bytes per sentinel (further information Link :
https://lore.kernel.org/all/ZO5Yx5JFogGi%2FcBo@bombadil.infradead.org/)

Remove sentinel from atalk_table ctl_table array.

Acked-by: Kees Cook <keescook@chromium.org> # loadpin & yama
Signed-off-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-05-03 13:29:42 +01:00
Joel Granados 635470eb0a netfilter: Remove the now superfluous sentinel elements from ctl_table array
This commit comes at the tail end of a greater effort to remove the
empty elements at the end of the ctl_table arrays (sentinels) which will
reduce the overall build time size of the kernel and run time memory
bloat by ~64 bytes per sentinel (further information Link :
https://lore.kernel.org/all/ZO5Yx5JFogGi%2FcBo@bombadil.infradead.org/)

* Remove sentinel elements from ctl_table structs
* Remove instances where an array element is zeroed out to make it look
  like a sentinel. This is not longer needed and is safe after commit
  c899710fe7 ("networking: Update to register_net_sysctl_sz") added
  the array size to the ctl_table registration
* Remove the need for having __NF_SYSCTL_CT_LAST_SYSCTL as the
  sysctl array size is now in NF_SYSCTL_CT_LAST_SYSCTL
* Remove extra element in ctl_table arrays declarations

Acked-by: Kees Cook <keescook@chromium.org> # loadpin & yama
Signed-off-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-05-03 13:29:42 +01:00
Joel Granados 73dbd8cf79 net: Remove ctl_table sentinel elements from several networking subsystems
This commit comes at the tail end of a greater effort to remove the
empty elements at the end of the ctl_table arrays (sentinels) which
will reduce the overall build time size of the kernel and run time
memory bloat by ~64 bytes per sentinel (further information Link :
https://lore.kernel.org/all/ZO5Yx5JFogGi%2FcBo@bombadil.infradead.org/)

To avoid lots of small commits, this commit brings together network
changes from (as they appear in MAINTAINERS) LLC, MPTCP, NETROM NETWORK
LAYER, PHONET PROTOCOL, ROSE NETWORK LAYER, RXRPC SOCKETS, SCTP
PROTOCOL, SHARED MEMORY COMMUNICATIONS (SMC), TIPC NETWORK LAYER and
NETWORKING [IPSEC]

* Remove sentinel element from ctl_table structs.
* Replace empty array registration with the register_net_sysctl_sz call
  in llc_sysctl_init
* Replace the for loop stop condition that tests for procname == NULL
  with one that depends on array size in sctp_sysctl_net_register
* Remove instances where an array element is zeroed out to make it look
  like a sentinel in xfrm_sysctl_init. This is not longer needed and is
  safe after commit c899710fe7 ("networking: Update to
  register_net_sysctl_sz") added the array size to the ctl_table
  registration
* Use a table_size variable to keep the value of ARRAY_SIZE

Signed-off-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-05-03 13:29:42 +01:00
Joel Granados ca5d1fce79 net: sunrpc: Remove the now superfluous sentinel elements from ctl_table array
This commit comes at the tail end of a greater effort to remove the
empty elements at the end of the ctl_table arrays (sentinels) which
will reduce the overall build time size of the kernel and run time
memory bloat by ~64 bytes per sentinel (further information Link :
https://lore.kernel.org/all/ZO5Yx5JFogGi%2FcBo@bombadil.infradead.org/)

* Remove sentinel element from ctl_table structs.

Signed-off-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-05-03 13:29:42 +01:00
Joel Granados 92bedf0783 net: rds: Remove the now superfluous sentinel elements from ctl_table array
This commit comes at the tail end of a greater effort to remove the
empty elements at the end of the ctl_table arrays (sentinels) which
will reduce the overall build time size of the kernel and run time
memory bloat by ~64 bytes per sentinel (further information Link :
https://lore.kernel.org/all/ZO5Yx5JFogGi%2FcBo@bombadil.infradead.org/)

* Remove sentinel element from ctl_table structs.

Signed-off-by: Joel Granados <j.granados@samsung.com>
Acked-by: Allison Henderson <allison.henderson@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-05-03 13:29:42 +01:00
Joel Granados 1c106eb01c net: ipv{6,4}: Remove the now superfluous sentinel elements from ctl_table array
This commit comes at the tail end of a greater effort to remove the
empty elements at the end of the ctl_table arrays (sentinels) which
will reduce the overall build time size of the kernel and run time
memory bloat by ~64 bytes per sentinel (further information Link :
https://lore.kernel.org/all/ZO5Yx5JFogGi%2FcBo@bombadil.infradead.org/)

* Remove sentinel element from ctl_table structs.
* Remove the zeroing out of an array element (to make it look like a
  sentinel) in sysctl_route_net_init And ipv6_route_sysctl_init.
  This is not longer needed and is safe after commit c899710fe7
  ("networking: Update to register_net_sysctl_sz") added the array size
  to the ctl_table registration.
* Remove extra sentinel element in the declaration of devinet_vars.
* Removed the "-1" in __devinet_sysctl_register, sysctl_route_net_init,
  ipv6_sysctl_net_init and ipv4_sysctl_init_net that adjusted for having
  an extra empty element when looping over ctl_table arrays
* Replace the for loop stop condition in __addrconf_sysctl_register that
  tests for procname == NULL with one that depends on array size
* Removing the unprivileged user check in ipv6_route_sysctl_init is
  safe as it is replaced by calling ipv6_route_sysctl_table_size;
  introduced in commit c899710fe7 ("networking: Update to
  register_net_sysctl_sz")
* Use a table_size variable to keep the value of ARRAY_SIZE

Signed-off-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-05-03 13:29:42 +01:00
Joel Granados ce218712b0 net: Remove the now superfluous sentinel elements from ctl_table array
This commit comes at the tail end of a greater effort to remove the
empty elements at the end of the ctl_table arrays (sentinels) which
will reduce the overall build time size of the kernel and run time
memory bloat by ~64 bytes per sentinel (further information Link :
https://lore.kernel.org/all/ZO5Yx5JFogGi%2FcBo@bombadil.infradead.org/)

* Remove sentinel element from ctl_table structs.
* Remove the zeroing out of an array element (to make it look like a
  sentinel) in neigh_sysctl_register and lowpan_frags_ns_sysctl_register
  This is not longer needed and is safe after commit c899710fe7
  ("networking: Update to register_net_sysctl_sz") added the array size
  to the ctl_table registration.
* Replace the for loop stop condition in sysctl_core_net_init that tests
  for procname == NULL with one that depends on array size
* Removed the "-1" in mpls_net_init that adjusted for having an extra
  empty element when looping over ctl_table arrays
* Use a table_size variable to keep the value of ARRAY_SIZE

Signed-off-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-05-03 13:29:41 +01:00
Kalle Valo f1c26960b6 ath.git patches for v6.10
ath12k
 
 * debugfs support
 
 * dfs_simulate_radar debugfs file
 
 * disable Wireless Extensions
 
 * suspend and hibernation support
 
 * ACPI support
 
 * refactoring in preparation of multi-link support
 
 ath11k
 
 * support hibernation (required changes in qrtr and MHI subsystems)
 
 * ieee80211-freq-limit Device Tree property support
 
 ath10k
 
 * firmware-name Device Tree property support
 -----BEGIN PGP SIGNATURE-----
 
 iQFLBAABCgA1FiEEiBjanGPFTz4PRfLobhckVSbrbZsFAmYztxUXHHF1aWNfa3Zh
 bG9AcXVpY2luYy5jb20ACgkQbhckVSbrbZs0NQf/dYF+Pjy3eh9vabwZ05rUZwLF
 P/k7Q8o631Gl0LTkmGw6SseUFswAOH1uwaKpWQvjhzHE8Fy4ziMeQ+aruqtbfZ3i
 Q1BYsnCbQHZ+1HX0EJmB9KDz8+/G1JxOlwiux7EsujOuFixjomm08VEJ4QXGVSnj
 NnCg0zx/0njDqT/3KjJuoL4stAk0HDvIayiVGsQOwcQQJelLGudqTNmRPnKQkR5w
 yRUX+NaFlVHgbbhsnEyW3XeqKVFwgy64BtNRIVpJ0lF51aazT80DagNmw5CawznC
 uak1nFktAtX/AID3lWRjek3smIdlnUz39JojKW7iO60/wfZXXFbR8Xl2XEpp9g==
 =fF10
 -----END PGP SIGNATURE-----

Merge tag 'ath-next-20240502' of git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/ath

ath.git patches for v6.10

ath12k

* debugfs support

* dfs_simulate_radar debugfs file

* disable Wireless Extensions

* suspend and hibernation support

* ACPI support

* refactoring in preparation of multi-link support

ath11k

* support hibernation (required changes in qrtr and MHI subsystems)

* ieee80211-freq-limit Device Tree property support

ath10k

* firmware-name Device Tree property support
2024-05-03 13:30:19 +03:00
Aditya Kumar Singh 414e736c3d wifi: mac80211: handle color change per link
In order to support color change with MLO, handle the link ID now
passed from cfg80211, adjust the code to do everything per link
and call the notifications to cfg80211 correctly.

Signed-off-by: Aditya Kumar Singh <quic_adisi@quicinc.com>
Link: https://msgid.link/20240422053412.2024075-4-quic_adisi@quicinc.com
Link: https://msgid.link/20240422053412.2024075-5-quic_adisi@quicinc.com
Link: https://msgid.link/20240422053412.2024075-6-quic_adisi@quicinc.com
Link: https://msgid.link/20240422053412.2024075-7-quic_adisi@quicinc.com
[squash, move API call updates to this patch]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-05-03 10:18:19 +02:00
Aditya Kumar Singh 91d2b6ee13 wifi: cfg80211: handle color change per link
Currently, during color change, no link id information is passed down.
In order to support color change during Multi Link Operation, it is
required to pass link id as well.
Additionally, update notification APIs to allow drivers/mac80211 to
pass the link ID.

Signed-off-by: Aditya Kumar Singh <quic_adisi@quicinc.com>
Link: https://msgid.link/20240422053412.2024075-2-quic_adisi@quicinc.com
Link: https://msgid.link/20240422053412.2024075-3-quic_adisi@quicinc.com
[squash, actually only pass 0 from mac80211]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-05-03 10:18:03 +02:00
Xin Deng 9f6d4b8d14 wifi: cfg80211: Clear mlo_links info when STA disconnects
wdev->valid_links is not cleared when upper layer disconnect from a
wdev->AP MLD. It has been observed that this would prevent offchannel
operations like remain-on-channel which would be needed for user space
operations with Public Action frame.
Clear the wdev->valid_links when STA disconnects.

Signed-off-by: Xin Deng <quic_deng@quicinc.com>
Link: https://msgid.link/20240426092501.8592-1-quic_deng@quicinc.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-05-03 10:03:19 +02:00
Kees Cook 838c7b8f1f wifi: nl80211: Avoid address calculations via out of bounds array indexing
Before request->channels[] can be used, request->n_channels must be set.
Additionally, address calculations for memory after the "channels" array
need to be calculated from the allocation base ("request") rather than
via the first "out of bounds" index of "channels", otherwise run-time
bounds checking will throw a warning.

Reported-by: Nathan Chancellor <nathan@kernel.org>
Fixes: e3eac9f32e ("wifi: cfg80211: Annotate struct cfg80211_scan_request with __counted_by")
Signed-off-by: Kees Cook <keescook@chromium.org>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Link: https://msgid.link/20240424220057.work.819-kees@kernel.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-05-03 09:20:26 +02:00
Kuniyuki Iwashima f2db7230f7 tcp: Use refcount_inc_not_zero() in tcp_twsk_unique().
Anderson Nascimento reported a use-after-free splat in tcp_twsk_unique()
with nice analysis.

Since commit ec94c2696f ("tcp/dccp: avoid one atomic operation for
timewait hashdance"), inet_twsk_hashdance() sets TIME-WAIT socket's
sk_refcnt after putting it into ehash and releasing the bucket lock.

Thus, there is a small race window where other threads could try to
reuse the port during connect() and call sock_hold() in tcp_twsk_unique()
for the TIME-WAIT socket with zero refcnt.

If that happens, the refcnt taken by tcp_twsk_unique() is overwritten
and sock_put() will cause underflow, triggering a real use-after-free
somewhere else.

To avoid the use-after-free, we need to use refcount_inc_not_zero() in
tcp_twsk_unique() and give up on reusing the port if it returns false.

[0]:
refcount_t: addition on 0; use-after-free.
WARNING: CPU: 0 PID: 1039313 at lib/refcount.c:25 refcount_warn_saturate+0xe5/0x110
CPU: 0 PID: 1039313 Comm: trigger Not tainted 6.8.6-200.fc39.x86_64 #1
Hardware name: VMware, Inc. VMware20,1/440BX Desktop Reference Platform, BIOS VMW201.00V.21805430.B64.2305221830 05/22/2023
RIP: 0010:refcount_warn_saturate+0xe5/0x110
Code: 42 8e ff 0f 0b c3 cc cc cc cc 80 3d aa 13 ea 01 00 0f 85 5e ff ff ff 48 c7 c7 f8 8e b7 82 c6 05 96 13 ea 01 01 e8 7b 42 8e ff <0f> 0b c3 cc cc cc cc 48 c7 c7 50 8f b7 82 c6 05 7a 13 ea 01 01 e8
RSP: 0018:ffffc90006b43b60 EFLAGS: 00010282
RAX: 0000000000000000 RBX: ffff888009bb3ef0 RCX: 0000000000000027
RDX: ffff88807be218c8 RSI: 0000000000000001 RDI: ffff88807be218c0
RBP: 0000000000069d70 R08: 0000000000000000 R09: ffffc90006b439f0
R10: ffffc90006b439e8 R11: 0000000000000003 R12: ffff8880029ede84
R13: 0000000000004e20 R14: ffffffff84356dc0 R15: ffff888009bb3ef0
FS:  00007f62c10926c0(0000) GS:ffff88807be00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020ccb000 CR3: 000000004628c005 CR4: 0000000000f70ef0
PKRU: 55555554
Call Trace:
 <TASK>
 ? refcount_warn_saturate+0xe5/0x110
 ? __warn+0x81/0x130
 ? refcount_warn_saturate+0xe5/0x110
 ? report_bug+0x171/0x1a0
 ? refcount_warn_saturate+0xe5/0x110
 ? handle_bug+0x3c/0x80
 ? exc_invalid_op+0x17/0x70
 ? asm_exc_invalid_op+0x1a/0x20
 ? refcount_warn_saturate+0xe5/0x110
 tcp_twsk_unique+0x186/0x190
 __inet_check_established+0x176/0x2d0
 __inet_hash_connect+0x74/0x7d0
 ? __pfx___inet_check_established+0x10/0x10
 tcp_v4_connect+0x278/0x530
 __inet_stream_connect+0x10f/0x3d0
 inet_stream_connect+0x3a/0x60
 __sys_connect+0xa8/0xd0
 __x64_sys_connect+0x18/0x20
 do_syscall_64+0x83/0x170
 entry_SYSCALL_64_after_hwframe+0x78/0x80
RIP: 0033:0x7f62c11a885d
Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d a3 45 0c 00 f7 d8 64 89 01 48
RSP: 002b:00007f62c1091e58 EFLAGS: 00000296 ORIG_RAX: 000000000000002a
RAX: ffffffffffffffda RBX: 0000000020ccb004 RCX: 00007f62c11a885d
RDX: 0000000000000010 RSI: 0000000020ccb000 RDI: 0000000000000003
RBP: 00007f62c1091e90 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000296 R12: 00007f62c10926c0
R13: ffffffffffffff88 R14: 0000000000000000 R15: 00007ffe237885b0
 </TASK>

Fixes: ec94c2696f ("tcp/dccp: avoid one atomic operation for timewait hashdance")
Reported-by: Anderson Nascimento <anderson@allelesecurity.com>
Closes: https://lore.kernel.org/netdev/37a477a6-d39e-486b-9577-3463f655a6b7@allelesecurity.com/
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240501213145.62261-1-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-02 19:02:46 -07:00
Eric Dumazet 94062790ae tcp: defer shutdown(SEND_SHUTDOWN) for TCP_SYN_RECV sockets
TCP_SYN_RECV state is really special, it is only used by
cross-syn connections, mostly used by fuzzers.

In the following crash [1], syzbot managed to trigger a divide
by zero in tcp_rcv_space_adjust()

A socket makes the following state transitions,
without ever calling tcp_init_transfer(),
meaning tcp_init_buffer_space() is also not called.

         TCP_CLOSE
connect()
         TCP_SYN_SENT
         TCP_SYN_RECV
shutdown() -> tcp_shutdown(sk, SEND_SHUTDOWN)
         TCP_FIN_WAIT1

To fix this issue, change tcp_shutdown() to not
perform a TCP_SYN_RECV -> TCP_FIN_WAIT1 transition,
which makes no sense anyway.

When tcp_rcv_state_process() later changes socket state
from TCP_SYN_RECV to TCP_ESTABLISH, then look at
sk->sk_shutdown to finally enter TCP_FIN_WAIT1 state,
and send a FIN packet from a sane socket state.

This means tcp_send_fin() can now be called from BH
context, and must use GFP_ATOMIC allocations.

[1]
divide error: 0000 [#1] PREEMPT SMP KASAN NOPTI
CPU: 1 PID: 5084 Comm: syz-executor358 Not tainted 6.9.0-rc6-syzkaller-00022-g98369dccd2f8 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/27/2024
 RIP: 0010:tcp_rcv_space_adjust+0x2df/0x890 net/ipv4/tcp_input.c:767
Code: e3 04 4c 01 eb 48 8b 44 24 38 0f b6 04 10 84 c0 49 89 d5 0f 85 a5 03 00 00 41 8b 8e c8 09 00 00 89 e8 29 c8 48 0f af c3 31 d2 <48> f7 f1 48 8d 1c 43 49 8d 96 76 08 00 00 48 89 d0 48 c1 e8 03 48
RSP: 0018:ffffc900031ef3f0 EFLAGS: 00010246
RAX: 0c677a10441f8f42 RBX: 000000004fb95e7e RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: 0000000027d4b11f R08: ffffffff89e535a4 R09: 1ffffffff25e6ab7
R10: dffffc0000000000 R11: ffffffff8135e920 R12: ffff88802a9f8d30
R13: dffffc0000000000 R14: ffff88802a9f8d00 R15: 1ffff1100553f2da
FS:  00005555775c0380(0000) GS:ffff8880b9500000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f1155bf2304 CR3: 000000002b9f2000 CR4: 0000000000350ef0
Call Trace:
 <TASK>
  tcp_recvmsg_locked+0x106d/0x25a0 net/ipv4/tcp.c:2513
  tcp_recvmsg+0x25d/0x920 net/ipv4/tcp.c:2578
  inet6_recvmsg+0x16a/0x730 net/ipv6/af_inet6.c:680
  sock_recvmsg_nosec net/socket.c:1046 [inline]
  sock_recvmsg+0x109/0x280 net/socket.c:1068
  ____sys_recvmsg+0x1db/0x470 net/socket.c:2803
  ___sys_recvmsg net/socket.c:2845 [inline]
  do_recvmmsg+0x474/0xae0 net/socket.c:2939
  __sys_recvmmsg net/socket.c:3018 [inline]
  __do_sys_recvmmsg net/socket.c:3041 [inline]
  __se_sys_recvmmsg net/socket.c:3034 [inline]
  __x64_sys_recvmmsg+0x199/0x250 net/socket.c:3034
  do_syscall_x64 arch/x86/entry/common.c:52 [inline]
  do_syscall_64+0xf5/0x240 arch/x86/entry/common.c:83
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7faeb6363db9
Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 c1 17 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007ffcc1997168 EFLAGS: 00000246 ORIG_RAX: 000000000000012b
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007faeb6363db9
RDX: 0000000000000001 RSI: 0000000020000bc0 RDI: 0000000000000005
RBP: 0000000000000000 R08: 0000000000000000 R09: 000000000000001c
R10: 0000000000000122 R11: 0000000000000246 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000001

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Link: https://lore.kernel.org/r/20240501125448.896529-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-02 19:01:59 -07:00
Eric Dumazet a17ef9e6c2 net_sched: sch_sfq: annotate data-races around q->perturb_period
sfq_perturbation() reads q->perturb_period locklessly.
Add annotations to fix potential issues.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240430180015.3111398-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-02 19:01:35 -07:00
Miao Xu 0325cbd21e bpf: tcp: Allow to write tp->snd_cwnd_stamp in bpf_tcp_ca
This patch allows the write of tp->snd_cwnd_stamp in a bpf tcp
ca program. An use case of writing this field is to keep track
of the time whenever tp->snd_cwnd is raised or reduced inside
the `cong_control` callback.

Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Miao Xu <miaxu@meta.com>
Link: https://lore.kernel.org/r/20240502042318.801932-3-miaxu@meta.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-05-02 16:26:56 -07:00
Miao Xu 57bfc7605c tcp: Add new args for cong_control in tcp_congestion_ops
This patch adds two new arguments for cong_control of struct
tcp_congestion_ops:
 - ack
 - flag
These two arguments are inherited from the caller tcp_cong_control in
tcp_intput.c. One use case of them is to update cwnd and pacing rate
inside cong_control based on the info they provide. For example, the
flag can be used to decide if it is the right time to raise or reduce a
sender's cwnd.

Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Miao Xu <miaxu@meta.com>
Link: https://lore.kernel.org/r/20240502042318.801932-2-miaxu@meta.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-05-02 16:26:56 -07:00
Jakub Kicinski e958da0ddb Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

Conflicts:

include/linux/filter.h
kernel/bpf/core.c
  66e13b615a ("bpf: verifier: prevent userspace memory access")
  d503a04f8b ("bpf: Add support for certain atomics in bpf_arena to x86 JIT")
https://lore.kernel.org/all/20240429114939.210328b0@canb.auug.org.au/

No adjacent changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-02 12:06:25 -07:00
Linus Torvalds 545c494465 Including fixes from bpf.
Relatively calm week, likely due to public holiday in most places.
 No known outstanding regressions.
 
 Current release - regressions:
 
   - rxrpc: fix wrong alignmask in __page_frag_alloc_align()
 
   - eth: e1000e: change usleep_range to udelay in PHY mdic access
 
 Previous releases - regressions:
 
   - gro: fix udp bad offset in socket lookup
 
   - bpf: fix incorrect runtime stat for arm64
 
   - tipc: fix UAF in error path
 
   - netfs: fix a potential infinite loop in extract_user_to_sg()
 
   - eth: ice: ensure the copied buf is NUL terminated
 
   - eth: qeth: fix kernel panic after setting hsuid
 
 Previous releases - always broken:
 
   - bpf:
     - verifier: prevent userspace memory access
     - xdp: use flags field to disambiguate broadcast redirect
 
   - bridge: fix multicast-to-unicast with fraglist GSO
 
   - mptcp: ensure snd_nxt is properly initialized on connect
 
   - nsh: fix outer header access in nsh_gso_segment().
 
   - eth: bcmgenet: fix racing registers access
 
   - eth: vxlan: fix stats counters.
 
 Misc:
 
   - a bunch of MAINTAINERS file updates
 
 Signed-off-by: Paolo Abeni <pabeni@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmYzaRsSHHBhYmVuaUBy
 ZWRoYXQuY29tAAoJECkkeY3MjxOkh70P/jzsTsvzHspu3RUwcsyvWpSoJPcxP2tF
 5SKR66o8sbSjB5I26zUi/LtRZgbPO32GmLN2Y8GvP74h9lwKdDo4AY4volZKCT6f
 lRG6GohvMa0lSPSn1fti7CKVzDOsaTHvLz3uBBr+Xb9ITCKh+I+zGEEDGj/47SQN
 tmDWHPF8OMs2ezmYS5NqRIQ3CeRz6uyLmEoZhVm4SolypZ18oEg7GCtL3u6U48n+
 e3XB3WwKl0ZxK8ipvPgUDwGIDuM5hEyAaeNon3zpYGoqitRsRITUjULpb9dT4DtJ
 Jma3OkarFJNXgm4N/p/nAtQ9AdiAloF9ivZXs2t0XCdrrUZJUh05yuikoX+mLfpw
 GedG2AbaVl6mdqNkrHeyf5SXKuiPgeCLVfF2xMjS0l1kFbY+Bt8BqnRSdOrcoUG0
 zlSzBeBtajttMdnalWv2ZshjP8uo/NjXydUjoVNwuq8xGO5wP+zhNnwhOvecNyUg
 t7q2PLokahlz4oyDqyY/7SQ0hSEndqxOlt43I6CthoWH0XkS83nTPdQXcTKQParD
 ntJUk5QYwefUT1gimbn/N8GoP7a1+ysWiqcf/7+SNm932gJGiDt36+HOEmyhIfIG
 IDWTWJJW64SnPBIUw59MrG7hMtbfaiZiFQqeUJQpFVrRr+tg5z5NUZ5thA+EJVd8
 qiVDvmngZFiv
 =f6KY
 -----END PGP SIGNATURE-----

Merge tag 'net-6.9-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Paolo Abeni:
 "Including fixes from bpf.

  Relatively calm week, likely due to public holiday in most places. No
  known outstanding regressions.

  Current release - regressions:

   - rxrpc: fix wrong alignmask in __page_frag_alloc_align()

   - eth: e1000e: change usleep_range to udelay in PHY mdic access

  Previous releases - regressions:

   - gro: fix udp bad offset in socket lookup

   - bpf: fix incorrect runtime stat for arm64

   - tipc: fix UAF in error path

   - netfs: fix a potential infinite loop in extract_user_to_sg()

   - eth: ice: ensure the copied buf is NUL terminated

   - eth: qeth: fix kernel panic after setting hsuid

  Previous releases - always broken:

   - bpf:
       - verifier: prevent userspace memory access
       - xdp: use flags field to disambiguate broadcast redirect

   - bridge: fix multicast-to-unicast with fraglist GSO

   - mptcp: ensure snd_nxt is properly initialized on connect

   - nsh: fix outer header access in nsh_gso_segment().

   - eth: bcmgenet: fix racing registers access

   - eth: vxlan: fix stats counters.

  Misc:

   - a bunch of MAINTAINERS file updates"

* tag 'net-6.9-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (45 commits)
  MAINTAINERS: mark MYRICOM MYRI-10G as Orphan
  MAINTAINERS: remove Ariel Elior
  net: gro: add flush check in udp_gro_receive_segment
  net: gro: fix udp bad offset in socket lookup by adding {inner_}network_offset to napi_gro_cb
  ipv4: Fix uninit-value access in __ip_make_skb()
  s390/qeth: Fix kernel panic after setting hsuid
  vxlan: Pull inner IP header in vxlan_rcv().
  tipc: fix a possible memleak in tipc_buf_append
  tipc: fix UAF in error path
  rxrpc: Clients must accept conn from any address
  net: core: reject skb_copy(_expand) for fraglist GSO skbs
  net: bridge: fix multicast-to-unicast with fraglist GSO
  mptcp: ensure snd_nxt is properly initialized on connect
  e1000e: change usleep_range to udelay in PHY mdic access
  net: dsa: mv88e6xxx: Fix number of databases for 88E6141 / 88E6341
  cxgb4: Properly lock TX queue for the selftest.
  rxrpc: Fix using alignmask being zero for __page_frag_alloc_align()
  vxlan: Add missing VNI filter counter update in arp_reduce().
  vxlan: Fix racy device stats updates.
  net: qede: use return from qede_parse_actions()
  ...
2024-05-02 08:51:47 -07:00
Davide Caratti 86735b57c9 net/sched: unregister lockdep keys in qdisc_create/qdisc_alloc error path
Naresh and Eric report several errors (corrupted elements in the dynamic
key hash list), when running tdc.py or syzbot. The error path of
qdisc_alloc() and qdisc_create() frees the qdisc memory, but it forgets
to unregister the lockdep key, thus causing use-after-free like the
following one:

 ==================================================================
 BUG: KASAN: slab-use-after-free in lockdep_register_key+0x5f2/0x700
 Read of size 8 at addr ffff88811236f2a8 by task ip/7925

 CPU: 26 PID: 7925 Comm: ip Kdump: loaded Not tainted 6.9.0-rc2+ #648
 Hardware name: Supermicro SYS-6027R-72RF/X9DRH-7TF/7F/iTF/iF, BIOS 3.0  07/26/2013
 Call Trace:
  <TASK>
  dump_stack_lvl+0x7c/0xc0
  print_report+0xc9/0x610
  kasan_report+0x89/0xc0
  lockdep_register_key+0x5f2/0x700
  qdisc_alloc+0x21d/0xb60
  qdisc_create_dflt+0x63/0x3c0
  attach_one_default_qdisc.constprop.37+0x8e/0x170
  dev_activate+0x4bd/0xc30
  __dev_open+0x275/0x380
  __dev_change_flags+0x3f1/0x570
  dev_change_flags+0x7c/0x160
  do_setlink+0x1ea1/0x34b0
  __rtnl_newlink+0x8c9/0x1510
  rtnl_newlink+0x61/0x90
  rtnetlink_rcv_msg+0x2f0/0xbc0
  netlink_rcv_skb+0x120/0x380
  netlink_unicast+0x420/0x630
  netlink_sendmsg+0x732/0xbc0
  __sock_sendmsg+0x1ea/0x280
  ____sys_sendmsg+0x5a9/0x990
  ___sys_sendmsg+0xf1/0x180
  __sys_sendmsg+0xd3/0x180
  do_syscall_64+0x96/0x180
  entry_SYSCALL_64_after_hwframe+0x71/0x79
 RIP: 0033:0x7f9503f4fa07
 Code: 0a 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b9 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 89 54 24 1c 48 89 74 24 10
 RSP: 002b:00007fff6c729068 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
 RAX: ffffffffffffffda RBX: 000000006630c681 RCX: 00007f9503f4fa07
 RDX: 0000000000000000 RSI: 00007fff6c7290d0 RDI: 0000000000000003
 RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000078
 R10: 000000000000009b R11: 0000000000000246 R12: 0000000000000001
 R13: 00007fff6c729180 R14: 0000000000000000 R15: 000055bf67dd9040
  </TASK>

 Allocated by task 7745:
  kasan_save_stack+0x1c/0x40
  kasan_save_track+0x10/0x30
  __kasan_kmalloc+0x7b/0x90
  __kmalloc_node+0x1ff/0x460
  qdisc_alloc+0xae/0xb60
  qdisc_create+0xdd/0xfb0
  tc_modify_qdisc+0x37e/0x1960
  rtnetlink_rcv_msg+0x2f0/0xbc0
  netlink_rcv_skb+0x120/0x380
  netlink_unicast+0x420/0x630
  netlink_sendmsg+0x732/0xbc0
  __sock_sendmsg+0x1ea/0x280
  ____sys_sendmsg+0x5a9/0x990
  ___sys_sendmsg+0xf1/0x180
  __sys_sendmsg+0xd3/0x180
  do_syscall_64+0x96/0x180
  entry_SYSCALL_64_after_hwframe+0x71/0x79

 Freed by task 7745:
  kasan_save_stack+0x1c/0x40
  kasan_save_track+0x10/0x30
  kasan_save_free_info+0x36/0x60
  __kasan_slab_free+0xfe/0x180
  kfree+0x113/0x380
  qdisc_create+0xafb/0xfb0
  tc_modify_qdisc+0x37e/0x1960
  rtnetlink_rcv_msg+0x2f0/0xbc0
  netlink_rcv_skb+0x120/0x380
  netlink_unicast+0x420/0x630
  netlink_sendmsg+0x732/0xbc0
  __sock_sendmsg+0x1ea/0x280
  ____sys_sendmsg+0x5a9/0x990
  ___sys_sendmsg+0xf1/0x180
  __sys_sendmsg+0xd3/0x180
  do_syscall_64+0x96/0x180
  entry_SYSCALL_64_after_hwframe+0x71/0x79

Fix this ensuring that lockdep_unregister_key() is called before the
qdisc struct is freed, also in the error path of qdisc_create() and
qdisc_alloc().

Fixes: af0cb3fa3f ("net/sched: fix false lockdep warning on qdisc root lock")
Reported-by: Linux Kernel Functional Testing <lkft@linaro.org>
Closes: https://lore.kernel.org/netdev/20240429221706.1492418-1-naresh.kamboju@linaro.org/
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Tested-by: Ido Schimmel <idosch@nvidia.com>
Link: https://lore.kernel.org/r/2aa1ca0c0a3aa0acc15925c666c777a4b5de553c.1714496886.git.dcaratti@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-02 07:12:40 -07:00
Richard Gobert 5babae777c net: gro: add flush check in udp_gro_receive_segment
GRO-GSO path is supposed to be transparent and as such L3 flush checks are
relevant to all UDP flows merging in GRO. This patch uses the same logic
and code from tcp_gro_receive, terminating merge if flush is non zero.

Fixes: e20cf8d3f1 ("udp: implement GRO for plain UDP sockets.")
Signed-off-by: Richard Gobert <richardbgobert@gmail.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-02 11:03:20 +02:00
Richard Gobert 5ef31ea5d0 net: gro: fix udp bad offset in socket lookup by adding {inner_}network_offset to napi_gro_cb
Commits a602456 ("udp: Add GRO functions to UDP socket") and 57c67ff ("udp:
additional GRO support") introduce incorrect usage of {ip,ipv6}_hdr in the
complete phase of gro. The functions always return skb->network_header,
which in the case of encapsulated packets at the gro complete phase, is
always set to the innermost L3 of the packet. That means that calling
{ip,ipv6}_hdr for skbs which completed the GRO receive phase (both in
gro_list and *_gro_complete) when parsing an encapsulated packet's _outer_
L3/L4 may return an unexpected value.

This incorrect usage leads to a bug in GRO's UDP socket lookup.
udp{4,6}_lib_lookup_skb functions use ip_hdr/ipv6_hdr respectively. These
*_hdr functions return network_header which will point to the innermost L3,
resulting in the wrong offset being used in __udp{4,6}_lib_lookup with
encapsulated packets.

This patch adds network_offset and inner_network_offset to napi_gro_cb, and
makes sure both are set correctly.

To fix the issue, network_offsets union is used inside napi_gro_cb, in
which both the outer and the inner network offsets are saved.

Reproduction example:

Endpoint configuration example (fou + local address bind)

    # ip fou add port 6666 ipproto 4
    # ip link add name tun1 type ipip remote 2.2.2.1 local 2.2.2.2 encap fou encap-dport 5555 encap-sport 6666 mode ipip
    # ip link set tun1 up
    # ip a add 1.1.1.2/24 dev tun1

Netperf TCP_STREAM result on net-next before patch is applied:

net-next main, GRO enabled:
    $ netperf -H 1.1.1.2 -t TCP_STREAM -l 5
    Recv   Send    Send
    Socket Socket  Message  Elapsed
    Size   Size    Size     Time     Throughput
    bytes  bytes   bytes    secs.    10^6bits/sec

    131072  16384  16384    5.28        2.37

net-next main, GRO disabled:
    $ netperf -H 1.1.1.2 -t TCP_STREAM -l 5
    Recv   Send    Send
    Socket Socket  Message  Elapsed
    Size   Size    Size     Time     Throughput
    bytes  bytes   bytes    secs.    10^6bits/sec

    131072  16384  16384    5.01     2745.06

patch applied, GRO enabled:
    $ netperf -H 1.1.1.2 -t TCP_STREAM -l 5
    Recv   Send    Send
    Socket Socket  Message  Elapsed
    Size   Size    Size     Time     Throughput
    bytes  bytes   bytes    secs.    10^6bits/sec

    131072  16384  16384    5.01     2877.38

Fixes: a6024562ff ("udp: Add GRO functions to UDP socket")
Signed-off-by: Richard Gobert <richardbgobert@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-02 11:02:48 +02:00
Shigeru Yoshida fc1092f515 ipv4: Fix uninit-value access in __ip_make_skb()
KMSAN reported uninit-value access in __ip_make_skb() [1].  __ip_make_skb()
tests HDRINCL to know if the skb has icmphdr. However, HDRINCL can cause a
race condition. If calling setsockopt(2) with IP_HDRINCL changes HDRINCL
while __ip_make_skb() is running, the function will access icmphdr in the
skb even if it is not included. This causes the issue reported by KMSAN.

Check FLOWI_FLAG_KNOWN_NH on fl4->flowi4_flags instead of testing HDRINCL
on the socket.

Also, fl4->fl4_icmp_type and fl4->fl4_icmp_code are not initialized. These
are union in struct flowi4 and are implicitly initialized by
flowi4_init_output(), but we should not rely on specific union layout.

Initialize these explicitly in raw_sendmsg().

[1]
BUG: KMSAN: uninit-value in __ip_make_skb+0x2b74/0x2d20 net/ipv4/ip_output.c:1481
 __ip_make_skb+0x2b74/0x2d20 net/ipv4/ip_output.c:1481
 ip_finish_skb include/net/ip.h:243 [inline]
 ip_push_pending_frames+0x4c/0x5c0 net/ipv4/ip_output.c:1508
 raw_sendmsg+0x2381/0x2690 net/ipv4/raw.c:654
 inet_sendmsg+0x27b/0x2a0 net/ipv4/af_inet.c:851
 sock_sendmsg_nosec net/socket.c:730 [inline]
 __sock_sendmsg+0x274/0x3c0 net/socket.c:745
 __sys_sendto+0x62c/0x7b0 net/socket.c:2191
 __do_sys_sendto net/socket.c:2203 [inline]
 __se_sys_sendto net/socket.c:2199 [inline]
 __x64_sys_sendto+0x130/0x200 net/socket.c:2199
 do_syscall_64+0xd8/0x1f0 arch/x86/entry/common.c:83
 entry_SYSCALL_64_after_hwframe+0x6d/0x75

Uninit was created at:
 slab_post_alloc_hook mm/slub.c:3804 [inline]
 slab_alloc_node mm/slub.c:3845 [inline]
 kmem_cache_alloc_node+0x5f6/0xc50 mm/slub.c:3888
 kmalloc_reserve+0x13c/0x4a0 net/core/skbuff.c:577
 __alloc_skb+0x35a/0x7c0 net/core/skbuff.c:668
 alloc_skb include/linux/skbuff.h:1318 [inline]
 __ip_append_data+0x49ab/0x68c0 net/ipv4/ip_output.c:1128
 ip_append_data+0x1e7/0x260 net/ipv4/ip_output.c:1365
 raw_sendmsg+0x22b1/0x2690 net/ipv4/raw.c:648
 inet_sendmsg+0x27b/0x2a0 net/ipv4/af_inet.c:851
 sock_sendmsg_nosec net/socket.c:730 [inline]
 __sock_sendmsg+0x274/0x3c0 net/socket.c:745
 __sys_sendto+0x62c/0x7b0 net/socket.c:2191
 __do_sys_sendto net/socket.c:2203 [inline]
 __se_sys_sendto net/socket.c:2199 [inline]
 __x64_sys_sendto+0x130/0x200 net/socket.c:2199
 do_syscall_64+0xd8/0x1f0 arch/x86/entry/common.c:83
 entry_SYSCALL_64_after_hwframe+0x6d/0x75

CPU: 1 PID: 15709 Comm: syz-executor.7 Not tainted 6.8.0-11567-gb3603fcb79b1 #25
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-1.fc39 04/01/2014

Fixes: 99e5acae19 ("ipv4: Fix potential uninit variable access bug in __ip_make_skb()")
Reported-by: syzkaller <syzkaller@googlegroups.com>
Signed-off-by: Shigeru Yoshida <syoshida@redhat.com>
Link: https://lore.kernel.org/r/20240430123945.2057348-1-syoshida@redhat.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-02 10:16:35 +02:00
Florian Fainelli 8a021a863a net: dsa: Remove adjust_link paths
Now that we no longer any drivers using PHYLIB's adjust_link callback,
remove all paths that made use of adjust_link as well as the associated
functions.

Signed-off-by: Florian Fainelli <florian.fainelli@broadcom.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://lore.kernel.org/r/20240430164816.2400606-3-florian.fainelli@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-01 19:06:32 -07:00
Xin Long 97bf6f81b2 tipc: fix a possible memleak in tipc_buf_append
__skb_linearize() doesn't free the skb when it fails, so move
'*buf = NULL' after __skb_linearize(), so that the skb can be
freed on the err path.

Fixes: b7df21cf1b ("tipc: skb_linearize the head skb when reassembling msgs")
Reported-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Tung Nguyen <tung.q.nguyen@dektech.com.au>
Link: https://lore.kernel.org/r/90710748c29a1521efac4f75ea01b3b7e61414cf.1714485818.git.lucien.xin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-01 18:39:44 -07:00
Paolo Abeni 080cbb8902 tipc: fix UAF in error path
Sam Page (sam4k) working with Trend Micro Zero Day Initiative reported
a UAF in the tipc_buf_append() error path:

BUG: KASAN: slab-use-after-free in kfree_skb_list_reason+0x47e/0x4c0
linux/net/core/skbuff.c:1183
Read of size 8 at addr ffff88804d2a7c80 by task poc/8034

CPU: 1 PID: 8034 Comm: poc Not tainted 6.8.2 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
1.16.0-debian-1.16.0-5 04/01/2014
Call Trace:
 <IRQ>
 __dump_stack linux/lib/dump_stack.c:88
 dump_stack_lvl+0xd9/0x1b0 linux/lib/dump_stack.c:106
 print_address_description linux/mm/kasan/report.c:377
 print_report+0xc4/0x620 linux/mm/kasan/report.c:488
 kasan_report+0xda/0x110 linux/mm/kasan/report.c:601
 kfree_skb_list_reason+0x47e/0x4c0 linux/net/core/skbuff.c:1183
 skb_release_data+0x5af/0x880 linux/net/core/skbuff.c:1026
 skb_release_all linux/net/core/skbuff.c:1094
 __kfree_skb linux/net/core/skbuff.c:1108
 kfree_skb_reason+0x12d/0x210 linux/net/core/skbuff.c:1144
 kfree_skb linux/./include/linux/skbuff.h:1244
 tipc_buf_append+0x425/0xb50 linux/net/tipc/msg.c:186
 tipc_link_input+0x224/0x7c0 linux/net/tipc/link.c:1324
 tipc_link_rcv+0x76e/0x2d70 linux/net/tipc/link.c:1824
 tipc_rcv+0x45f/0x10f0 linux/net/tipc/node.c:2159
 tipc_udp_recv+0x73b/0x8f0 linux/net/tipc/udp_media.c:390
 udp_queue_rcv_one_skb+0xad2/0x1850 linux/net/ipv4/udp.c:2108
 udp_queue_rcv_skb+0x131/0xb00 linux/net/ipv4/udp.c:2186
 udp_unicast_rcv_skb+0x165/0x3b0 linux/net/ipv4/udp.c:2346
 __udp4_lib_rcv+0x2594/0x3400 linux/net/ipv4/udp.c:2422
 ip_protocol_deliver_rcu+0x30c/0x4e0 linux/net/ipv4/ip_input.c:205
 ip_local_deliver_finish+0x2e4/0x520 linux/net/ipv4/ip_input.c:233
 NF_HOOK linux/./include/linux/netfilter.h:314
 NF_HOOK linux/./include/linux/netfilter.h:308
 ip_local_deliver+0x18e/0x1f0 linux/net/ipv4/ip_input.c:254
 dst_input linux/./include/net/dst.h:461
 ip_rcv_finish linux/net/ipv4/ip_input.c:449
 NF_HOOK linux/./include/linux/netfilter.h:314
 NF_HOOK linux/./include/linux/netfilter.h:308
 ip_rcv+0x2c5/0x5d0 linux/net/ipv4/ip_input.c:569
 __netif_receive_skb_one_core+0x199/0x1e0 linux/net/core/dev.c:5534
 __netif_receive_skb+0x1f/0x1c0 linux/net/core/dev.c:5648
 process_backlog+0x101/0x6b0 linux/net/core/dev.c:5976
 __napi_poll.constprop.0+0xba/0x550 linux/net/core/dev.c:6576
 napi_poll linux/net/core/dev.c:6645
 net_rx_action+0x95a/0xe90 linux/net/core/dev.c:6781
 __do_softirq+0x21f/0x8e7 linux/kernel/softirq.c:553
 do_softirq linux/kernel/softirq.c:454
 do_softirq+0xb2/0xf0 linux/kernel/softirq.c:441
 </IRQ>
 <TASK>
 __local_bh_enable_ip+0x100/0x120 linux/kernel/softirq.c:381
 local_bh_enable linux/./include/linux/bottom_half.h:33
 rcu_read_unlock_bh linux/./include/linux/rcupdate.h:851
 __dev_queue_xmit+0x871/0x3ee0 linux/net/core/dev.c:4378
 dev_queue_xmit linux/./include/linux/netdevice.h:3169
 neigh_hh_output linux/./include/net/neighbour.h:526
 neigh_output linux/./include/net/neighbour.h:540
 ip_finish_output2+0x169f/0x2550 linux/net/ipv4/ip_output.c:235
 __ip_finish_output linux/net/ipv4/ip_output.c:313
 __ip_finish_output+0x49e/0x950 linux/net/ipv4/ip_output.c:295
 ip_finish_output+0x31/0x310 linux/net/ipv4/ip_output.c:323
 NF_HOOK_COND linux/./include/linux/netfilter.h:303
 ip_output+0x13b/0x2a0 linux/net/ipv4/ip_output.c:433
 dst_output linux/./include/net/dst.h:451
 ip_local_out linux/net/ipv4/ip_output.c:129
 ip_send_skb+0x3e5/0x560 linux/net/ipv4/ip_output.c:1492
 udp_send_skb+0x73f/0x1530 linux/net/ipv4/udp.c:963
 udp_sendmsg+0x1a36/0x2b40 linux/net/ipv4/udp.c:1250
 inet_sendmsg+0x105/0x140 linux/net/ipv4/af_inet.c:850
 sock_sendmsg_nosec linux/net/socket.c:730
 __sock_sendmsg linux/net/socket.c:745
 __sys_sendto+0x42c/0x4e0 linux/net/socket.c:2191
 __do_sys_sendto linux/net/socket.c:2203
 __se_sys_sendto linux/net/socket.c:2199
 __x64_sys_sendto+0xe0/0x1c0 linux/net/socket.c:2199
 do_syscall_x64 linux/arch/x86/entry/common.c:52
 do_syscall_64+0xd8/0x270 linux/arch/x86/entry/common.c:83
 entry_SYSCALL_64_after_hwframe+0x6f/0x77 linux/arch/x86/entry/entry_64.S:120
RIP: 0033:0x7f3434974f29
Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48
89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d
01 f0 ff ff 73 01 c3 48 8b 0d 37 8f 0d 00 f7 d8 64 89 01 48
RSP: 002b:00007fff9154f2b8 EFLAGS: 00000212 ORIG_RAX: 000000000000002c
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f3434974f29
RDX: 00000000000032c8 RSI: 00007fff9154f300 RDI: 0000000000000003
RBP: 00007fff915532e0 R08: 00007fff91553360 R09: 0000000000000010
R10: 0000000000000000 R11: 0000000000000212 R12: 000055ed86d261d0
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
 </TASK>

In the critical scenario, either the relevant skb is freed or its
ownership is transferred into a frag_lists. In both cases, the cleanup
code must not free it again: we need to clear the skb reference earlier.

Fixes: 1149557d64 ("tipc: eliminate unnecessary linearization of incoming buffers")
Cc: stable@vger.kernel.org
Reported-by: zdi-disclosures@trendmicro.com # ZDI-CAN-23852
Acked-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/752f1ccf762223d109845365d07f55414058e5a3.1714484273.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-01 18:39:29 -07:00
Kuniyuki Iwashima bf4ea58874 arp: Convert ioctl(SIOCGARP) to RCU.
ioctl(SIOCGARP) holds rtnl_lock() to get netdev by __dev_get_by_name()
and copy dev->name safely and calls neigh_lookup() later, which looks
up a neighbour entry under RCU.

Let's replace __dev_get_by_name() with dev_get_by_name_rcu() and strscpy()
with netdev_copy_name() to avoid locking rtnl_lock().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20240430015813.71143-8-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-01 18:37:07 -07:00
Kuniyuki Iwashima 0840556e5a net: Protect dev->name by seqlock.
We will convert ioctl(SIOCGARP) to RCU, and then we need to copy
dev->name which is currently protected by rtnl_lock().

This patch does the following:

  1) Add seqlock netdev_rename_lock to protect dev->name

  2) Add netdev_copy_name() that copies dev->name to buffer
     under netdev_rename_lock

  3) Use netdev_copy_name() in netdev_get_name() and drop
     devnet_rename_sem

Suggested-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/netdev/CANn89iJEWs7AYSJqGCUABeVqOCTkErponfZdT5kV-iD=-SajnQ@mail.gmail.com/
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20240430015813.71143-7-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-01 18:37:07 -07:00
Kuniyuki Iwashima a428bfc77a arp: Get dev after calling arp_req_(delete|set|get)().
arp_ioctl() holds rtnl_lock() first regardless of cmd (SIOCDARP,
SIOCSARP, and SIOCGARP) to get net_device by __dev_get_by_name()
and copy dev->name safely.

In the SIOCGARP path, arp_req_get() calls neigh_lookup(), which
looks up a neighbour entry under RCU.

We will extend the RCU section not to take rtnl_lock() and instead
use dev_get_by_name_rcu() for SIOCGARP.

As a preparation, let's move __dev_get_by_name() into another
function and call it from arp_req_delete(), arp_req_set(), and
arp_req_get().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20240430015813.71143-6-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-01 18:37:07 -07:00
Kuniyuki Iwashima 51e9ba48d4 arp: Remove a nest in arp_req_get().
This is a prep patch to make the following changes tidy.

No functional change intended.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20240430015813.71143-5-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-01 18:37:07 -07:00
Kuniyuki Iwashima f8696133f6 arp: Factorise ip_route_output() call in arp_req_set() and arp_req_delete().
When ioctl(SIOCDARP/SIOCSARP) is issued for non-proxy entry (no ATF_COM)
without arpreq.arp_dev[] set, arp_req_set() and arp_req_delete() looks up
dev based on IPv4 address by ip_route_output().

Let's factorise the same code as arp_req_dev().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20240430015813.71143-4-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-01 18:37:06 -07:00
Kuniyuki Iwashima 0592367424 arp: Validate netmask earlier for SIOCDARP and SIOCSARP in arp_ioctl().
When ioctl(SIOCDARP/SIOCSARP) is issued with ATF_PUBL, r.arp_netmask
must be 0.0.0.0 or 255.255.255.255.

Currently, the netmask is validated in arp_req_delete_public() or
arp_req_set_public() under rtnl_lock().

We have ATF_NETMASK test in arp_ioctl() before holding rtnl_lock(),
so let's move the netmask validation there.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20240430015813.71143-3-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-01 18:37:06 -07:00
Kuniyuki Iwashima 42033d0cfc arp: Move ATF_COM setting in arp_req_set().
In arp_req_set(), if ATF_PERM is set in arpreq.arp_flags,
ATF_COM is set automatically.

The flag will be used later for neigh_update() only when
a neighbour entry is found.

Let's set ATF_COM just before calling neigh_update().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20240430015813.71143-2-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-01 18:37:06 -07:00
Jeffrey Altman 8953285d7b rxrpc: Clients must accept conn from any address
The find connection logic of Transarc's Rx was modified in the mid-1990s
to support multi-homed servers which might send a response packet from
an address other than the destination address in the received packet.
The rules for accepting a packet by an Rx initiator (RX_CLIENT_CONNECTION)
were altered to permit acceptance of a packet from any address provided
that the port number was unchanged and all of the connection identifiers
matched (Epoch, CID, SecurityClass, ...).

This change applies the same rules to the Linux implementation which makes
it consistent with IBM AFS 3.6, Arla, OpenAFS and AuriStorFS.

Fixes: 17926a7932 ("[AF_RXRPC]: Provide secure RxRPC sockets for use by userspace and kernel both")
Signed-off-by: Jeffrey Altman <jaltman@auristor.com>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Marc Dionne <marc.dionne@auristor.com>
Link: https://lore.kernel.org/r/20240419163057.4141728-1-marc.dionne@auristor.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-01 14:49:56 -07:00
David Howells 5fb70e7275 netfs, 9p: Implement helpers for new write code
Implement the helpers for the new write code in 9p.  There's now an
optional ->prepare_write() that allows the filesystem to set the parameters
for the next write, such as maximum size and maximum segment count, and an
->issue_write() that is called to initiate an (asynchronous) write
operation.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: Eric Van Hensbergen <ericvh@kernel.org>
cc: Latchesar Ionkov <lucho@ionkov.net>
cc: Dominique Martinet <asmadeus@codewreck.org>
cc: Christian Schoenebeck <linux_oss@crudebyte.com>
cc: v9fs@lists.linux.dev
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
2024-05-01 18:07:37 +01:00
Eric Dumazet fff6e6accd ipv6: anycast: use call_rcu_hurry() in aca_put()
This is a followup of commit b5327b9a30 ("ipv6: use
call_rcu_hurry() in fib6_info_release()").

I had another pmtu.sh failure, and found another lazy
call_rcu() causing this failure.

aca_free_rcu() calls fib6_info_release() which releases
devices references.

We must not delay it too much or risk unregister_netdevice/ref_tracker
traces because references to netdev are not released in time.

This should speedup device/netns dismantles when CONFIG_RCU_LAZY=y

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-05-01 11:46:21 +01:00
Felix Fietkau d091e579b8 net: core: reject skb_copy(_expand) for fraglist GSO skbs
SKB_GSO_FRAGLIST skbs must not be linearized, otherwise they become
invalid. Return NULL if such an skb is passed to skb_copy or
skb_copy_expand, in order to prevent a crash on a potential later
call to skb_gso_segment.

Fixes: 3a1296a38d ("net: Support GRO/GSO fraglist chaining.")
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-05-01 11:44:10 +01:00
Felix Fietkau 59c878cbcd net: bridge: fix multicast-to-unicast with fraglist GSO
Calling skb_copy on a SKB_GSO_FRAGLIST skb is not valid, since it returns
an invalid linearized skb. This code only needs to change the ethernet
header, so pskb_copy is the right function to call here.

Fixes: 6db6f0eae6 ("bridge: multicast to unicast")
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-05-01 11:44:10 +01:00
Antony Antony 451b509678 xfrm: Restrict SA direction attribute to specific netlink message types
Reject the usage of the SA_DIR attribute in xfrm netlink messages when
it's not applicable. This ensures that SA_DIR is only accepted for
certain message types (NEWSA, UPDSA, and ALLOCSPI)

Signed-off-by: Antony Antony <antony.antony@secunet.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2024-05-01 10:06:59 +02:00
Antony Antony 304b44f0d5 xfrm: Add dir validation to "in" data path lookup
Introduces validation for the x->dir attribute within the XFRM input
data lookup path. If the configured direction does not match the
expected direction, input, increment the XfrmInStateDirError counter
and drop the packet to ensure data integrity and correct flow handling.

grep -vw 0 /proc/net/xfrm_stat
XfrmInStateDirError     	1

Signed-off-by: Antony Antony <antony.antony@secunet.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2024-05-01 10:06:27 +02:00
Antony Antony 601a0867f8 xfrm: Add dir validation to "out" data path lookup
Introduces validation for the x->dir attribute within the XFRM output
data lookup path. If the configured direction does not match the expected
direction, output, increment the XfrmOutStateDirError counter and drop
the packet to ensure data integrity and correct flow handling.

grep -vw 0 /proc/net/xfrm_stat
XfrmOutPolError         	1
XfrmOutStateDirError    	1

Signed-off-by: Antony Antony <antony.antony@secunet.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2024-05-01 10:05:52 +02:00
Antony Antony a4a87fa4e9 xfrm: Add Direction to the SA in or out
This patch introduces the 'dir' attribute, 'in' or 'out', to the
xfrm_state, SA, enhancing usability by delineating the scope of values
based on direction. An input SA will restrict values pertinent to input,
effectively segregating them from output-related values.
And an output SA will restrict attributes for output. This change aims
to streamline the configuration process and improve the overall
consistency of SA attributes during configuration.

This feature sets the groundwork for future patches, including
the upcoming IP-TFS patch.

Signed-off-by: Antony Antony <antony.antony@secunet.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2024-05-01 10:05:11 +02:00
Breno Leitao c2e6a872bd netpoll: Fix race condition in netpoll_owner_active
KCSAN detected a race condition in netpoll:

	BUG: KCSAN: data-race in net_rx_action / netpoll_send_skb
	write (marked) to 0xffff8881164168b0 of 4 bytes by interrupt on cpu 10:
	net_rx_action (./include/linux/netpoll.h:90 net/core/dev.c:6712 net/core/dev.c:6822)
<snip>
	read to 0xffff8881164168b0 of 4 bytes by task 1 on cpu 2:
	netpoll_send_skb (net/core/netpoll.c:319 net/core/netpoll.c:345 net/core/netpoll.c:393)
	netpoll_send_udp (net/core/netpoll.c:?)
<snip>
	value changed: 0x0000000a -> 0xffffffff

This happens because netpoll_owner_active() needs to check if the
current CPU is the owner of the lock, touching napi->poll_owner
non atomically. The ->poll_owner field contains the current CPU holding
the lock.

Use an atomic read to check if the poll owner is the current CPU.

Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://lore.kernel.org/r/20240429100437.3487432-1-leitao@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-30 19:03:47 -07:00
Paolo Abeni fb7a0d3348 mptcp: ensure snd_nxt is properly initialized on connect
Christoph reported a splat hinting at a corrupted snd_una:

  WARNING: CPU: 1 PID: 38 at net/mptcp/protocol.c:1005 __mptcp_clean_una+0x4b3/0x620 net/mptcp/protocol.c:1005
  Modules linked in:
  CPU: 1 PID: 38 Comm: kworker/1:1 Not tainted 6.9.0-rc1-gbbeac67456c9 #59
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014
  Workqueue: events mptcp_worker
  RIP: 0010:__mptcp_clean_una+0x4b3/0x620 net/mptcp/protocol.c:1005
  Code: be 06 01 00 00 bf 06 01 00 00 e8 a8 12 e7 fe e9 00 fe ff ff e8
  	8e 1a e7 fe 0f b7 ab 3e 02 00 00 e9 d3 fd ff ff e8 7d 1a e7 fe
  	<0f> 0b 4c 8b bb e0 05 00 00 e9 74 fc ff ff e8 6a 1a e7 fe 0f 0b e9
  RSP: 0018:ffffc9000013fd48 EFLAGS: 00010293
  RAX: 0000000000000000 RBX: ffff8881029bd280 RCX: ffffffff82382fe4
  RDX: ffff8881003cbd00 RSI: ffffffff823833c3 RDI: 0000000000000001
  RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000
  R10: 0000000000000000 R11: fefefefefefefeff R12: ffff888138ba8000
  R13: 0000000000000106 R14: ffff8881029bd908 R15: ffff888126560000
  FS:  0000000000000000(0000) GS:ffff88813bd00000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00007f604a5dae38 CR3: 0000000101dac002 CR4: 0000000000170ef0
  Call Trace:
   <TASK>
   __mptcp_clean_una_wakeup net/mptcp/protocol.c:1055 [inline]
   mptcp_clean_una_wakeup net/mptcp/protocol.c:1062 [inline]
   __mptcp_retrans+0x7f/0x7e0 net/mptcp/protocol.c:2615
   mptcp_worker+0x434/0x740 net/mptcp/protocol.c:2767
   process_one_work+0x1e0/0x560 kernel/workqueue.c:3254
   process_scheduled_works kernel/workqueue.c:3335 [inline]
   worker_thread+0x3c7/0x640 kernel/workqueue.c:3416
   kthread+0x121/0x170 kernel/kthread.c:388
   ret_from_fork+0x44/0x50 arch/x86/kernel/process.c:147
   ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:243
   </TASK>

When fallback to TCP happens early on a client socket, snd_nxt
is not yet initialized and any incoming ack will copy such value
into snd_una. If the mptcp worker (dumbly) tries mptcp-level
re-injection after such ack, that would unconditionally trigger a send
buffer cleanup using 'bad' snd_una values.

We could easily disable re-injection for fallback sockets, but such
dumb behavior already helped catching a few subtle issues and a very
low to zero impact in practice.

Instead address the issue always initializing snd_nxt (and write_seq,
for consistency) at connect time.

Fixes: 8fd738049a ("mptcp: fallback in case of simultaneous connect")
Cc: stable@vger.kernel.org
Reported-by: Christoph Paasch <cpaasch@apple.com>
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/485
Tested-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://lore.kernel.org/r/20240429-upstream-net-20240429-mptcp-snd_nxt-init-connect-v1-1-59ceac0a7dcb@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-30 18:48:28 -07:00
Eric Dumazet c204fef97e net: move sysctl_mem_pcpu_rsv to net_hotdata
sysctl_mem_pcpu_rsv is used in TCP fast path,
move it to net_hodata for better cache locality.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240429134025.1233626-6-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-30 18:46:52 -07:00
Eric Dumazet f3d93817fb net: add <net/proto_memory.h>
Move some proto memory definitions out of <net/sock.h>

Very few files need them, and following patch
will include <net/hotdata.h> from <net/proto_memory.h>

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240429134025.1233626-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-30 18:46:52 -07:00
Eric Dumazet dda4d96acb tcp: move tcp_out_of_memory() to net/ipv4/tcp.c
tcp_out_of_memory() has a single caller: tcp_check_oom().

Following patch will also make sk_memory_allocated()
not anymore visible from <net/sock.h> and <net/tcp.h>

Add const qualifier to sock argument of tcp_out_of_memory()
and tcp_check_oom().

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240429134025.1233626-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-30 18:46:52 -07:00
Eric Dumazet d480dc76d9 net: move sysctl_skb_defer_max to net_hotdata
sysctl_skb_defer_max is used in TCP fast path,
move it to net_hodata.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240429134025.1233626-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-30 18:46:52 -07:00
Eric Dumazet a86a0661b8 net: move sysctl_max_skb_frags to net_hotdata
sysctl_max_skb_frags is used in TCP and MPTCP fast paths,
move it to net_hodata for better cache locality.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240429134025.1233626-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-30 18:46:52 -07:00
Eric Dumazet 05d6d49209 inet: introduce dst_rtable() helper
I added dst_rt6_info() in commit
e8dfd42c17 ("ipv6: introduce dst_rt6_info() helper")

This patch does a similar change for IPv4.

Instead of (struct rtable *)dst casts, we can use :

 #define dst_rtable(_ptr) \
             container_of_const(_ptr, struct rtable, dst)

Patch is smaller than IPv6 one, because IPv4 has skb_rtable() helper.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://lore.kernel.org/r/20240429133009.1227754-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-30 18:32:38 -07:00
Yunsheng Lin 9f8eeea164 rxrpc: Fix using alignmask being zero for __page_frag_alloc_align()
rxrpc_alloc_data_txbuf() may be called with data_align being
zero in none_alloc_txbuf() and rxkad_alloc_txbuf(), data_align
is supposed to be an order-based alignment value, but zero is
not a valid order-based alignment value, and '~(data_align - 1)'
doesn't result in a valid mask-based alignment value for
__page_frag_alloc_align().

Fix it by passing a valid order-based alignment value in
none_alloc_txbuf() and rxkad_alloc_txbuf().

Also use page_frag_alloc_align() expecting an order-based
alignment value in rxrpc_alloc_data_txbuf() to avoid doing the
alignment converting operation and to catch possible invalid
alignment value in the future. Remove the 'if (data_align)'
checking too, as it is always true for a valid order-based
alignment value.

Fixes: 6b2536462f ("rxrpc: Fix use of changed alignment param to page_frag_alloc_align()")
Fixes: 49489bb03a ("rxrpc: Do zerocopy using MSG_SPLICE_PAGES and page frags")
CC: Alexander Duyck <alexander.duyck@gmail.com>
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Acked-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20240428111640.27306-1-linyunsheng@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-30 18:30:43 -07:00
Jakub Kicinski 12b6c3a038 net: page_pool: support error injection
Because of caching / recycling using the general page allocation
failures to induce errors in page pool allocation is very hard.
Add direct error injection support to page_pool_alloc_pages().

Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20240429144426.743476-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-30 08:15:31 -07:00
Wen Gu c3a910f238 net/smc: implement DMB-merged operations of loopback-ism
This implements operations related to merging sndbuf with peer DMB in
loopback-ism. The DMB won't be freed until no sndbuf is attached to it.

Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com>
Reviewed-and-tested-by: Jan Karcher <jaka@linux.ibm.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-04-30 13:24:49 +02:00
Wen Gu cc0ab806fc net/smc: adapt cursor update when sndbuf and peer DMB are merged
If the local sndbuf shares the same physical memory with peer DMB,
the cursor update processing needs to be adapted to ensure that the
data to be consumed won't be overwritten.

So in this case, the fin_curs and sndbuf_space that were originally
updated after sending the CDC message should be modified to not be
update until the peer updates cons_curs.

Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com>
Reviewed-and-tested-by: Jan Karcher <jaka@linux.ibm.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-04-30 13:24:48 +02:00
Wen Gu ae2be35cbe net/smc: {at|de}tach sndbuf to peer DMB if supported
If the device used by SMC-D supports merging local sndbuf to peer DMB,
then create sndbuf descriptor and attach it to peer DMB once peer
token is obtained, and detach and free the sndbuf descriptor when the
connection is freed.

Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com>
Reviewed-and-tested-by: Jan Karcher <jaka@linux.ibm.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-04-30 13:24:48 +02:00
Wen Gu 4398888268 net/smc: add operations to merge sndbuf with peer DMB
In some scenarios using Emulated-ISM device, sndbuf can share the same
physical memory region with peer DMB to avoid data copy from one side
to the other. In such case the sndbuf is only a descriptor that
describes the shared memory and does not actually occupy memory, it's
more like a ghost buffer.

      +----------+                     +----------+
      | socket A |                     | socket B |
      +----------+                     +----------+
            |                               |
       +--------+                       +--------+
       | sndbuf |                       |  DMB   |
       |  desc  |                       |  desc  |
       +--------+                       +--------+
            |                               |
            |                          +----v-----+
            +-------------------------->  memory  |
                                       +----------+

So here introduces three new SMC-D device operations to check if this
feature is supported by device, and to {attach|detach} ghost sndbuf to
peer DMB. For now only loopback-ism supports this.

Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com>
Reviewed-and-tested-by: Jan Karcher <jaka@linux.ibm.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-04-30 13:24:48 +02:00
Wen Gu 04791343d8 net/smc: register loopback-ism into SMC-D device list
After the loopback-ism device is ready, add it to the SMC-D device list
as an ISMv2 device, and always keep it at the beginning to ensure it is
preferred for providing a shortcut for data transfer within the same
kernel.

Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com>
Reviewed-and-tested-by: Jan Karcher <jaka@linux.ibm.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-04-30 13:24:48 +02:00
Wen Gu c8df2d449f net/smc: ignore loopback-ism when dumping SMC-D devices
Since loopback-ism is not a PCI device, the PCI information fed back by
smc_nl_handle_smcd_dev() does not apply to loopback-ism. So currently
ignore loopback-ism when dumping SMC-D devices. The netlink function of
loopback-ism will be refactored when SMC netlink interface is updated.

Link: https://lore.kernel.org/r/caab067b-f5c3-490f-9259-262624c236b4@linux.ibm.com/
Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com>
Reviewed-and-tested-by: Jan Karcher <jaka@linux.ibm.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-04-30 13:24:48 +02:00
Wen Gu d1d8d0b6c7 net/smc: mark optional smcd_ops and check for support when called
Some operations are not supported by new introduced Emulated-ISM, so
mark them as optional and check if the device supports them when called.

Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com>
Reviewed-and-tested-by: Jan Karcher <jaka@linux.ibm.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-04-30 13:24:48 +02:00
Wen Gu f7a22071db net/smc: implement DMB-related operations of loopback-ism
This implements DMB (un)registration and data move operations of
loopback-ism device.

Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com>
Reviewed-and-tested-by: Jan Karcher <jaka@linux.ibm.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-04-30 13:24:48 +02:00
Wen Gu 45783ee85b net/smc: implement ID-related operations of loopback-ism
This implements operations related to IDs for the loopback-ism device.
loopback-ism uses an Extended GID that is a 128-bit GID instead of the
existing ISM 64-bit GID, and uses the CHID defined with the reserved
value 0xFFFF.

Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com>
Reviewed-and-tested-by: Jan Karcher <jaka@linux.ibm.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-04-30 13:24:48 +02:00
Wen Gu 46ac64419d net/smc: introduce loopback-ism for SMC intra-OS shortcut
This introduces a kind of Emulated-ISM device named loopback-ism for
SMCv2.1. The loopback-ism device is currently exclusive for SMC usage,
and aims to provide an SMC shortcut for sockets within the same kernel,
leading to improved intra-OS traffic performance. Configuration of this
feature is managed through the config SMC_LO.

Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Reviewed-by: Gerd Bayer <gbayer@linux.ibm.com>
Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com>
Reviewed-and-tested-by: Jan Karcher <jaka@linux.ibm.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-04-30 13:24:48 +02:00
Wen Gu 784c46f546 net/smc: decouple ism_client from SMC-D DMB registration
The struct 'ism_client' is specialized for s390 platform firmware ISM.
So replace it with 'void' to make SMCD DMB registration helper generic
for both Emulated-ISM and existing ISM.

Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com>
Reviewed-and-tested-by: Jan Karcher <jaka@linux.ibm.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-04-30 13:24:48 +02:00
Erick Archer e5c5f3596d sctp: prefer struct_size over open coded arithmetic
This is an effort to get rid of all multiplications from allocation
functions in order to prevent integer overflows [1][2].

As the "ids" variable is a pointer to "struct sctp_assoc_ids" and this
structure ends in a flexible array:

struct sctp_assoc_ids {
	[...]
	sctp_assoc_t	gaids_assoc_id[];
};

the preferred way in the kernel is to use the struct_size() helper to
do the arithmetic instead of the calculation "size + size * count" in
the kmalloc() function.

Also, refactor the code adding the "ids_size" variable to avoid sizing
twice.

This way, the code is more readable and safer.

This code was detected with the help of Coccinelle, and audited and
modified manually.

Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#open-coded-arithmetic-in-allocator-arguments [1]
Link: https://github.com/KSPP/linux/issues/160 [2]
Signed-off-by: Erick Archer <erick.archer@outlook.com>
Acked-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/PAXPR02MB724871DB78375AB06B5171C88B152@PAXPR02MB7248.eurprd02.prod.outlook.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-04-30 11:39:03 +02:00
Xuan Zhuo 0cfe71f45f netdev: add queue stats
These stats are commonly. Support reporting those via netdev-genl queue
stats.

name: rx-hw-drops
name: rx-hw-drop-overruns
name: rx-csum-unnecessary
name: rx-csum-none
name: rx-csum-bad
name: rx-hw-gro-packets
name: rx-hw-gro-bytes
name: rx-hw-gro-wire-packets
name: rx-hw-gro-wire-bytes
name: rx-hw-drop-ratelimits
name: tx-hw-drops
name: tx-hw-drop-errors
name: tx-csum-none
name: tx-needs-csum
name: tx-hw-gso-packets
name: tx-hw-gso-bytes
name: tx-hw-gso-wire-packets
name: tx-hw-gso-wire-bytes
name: tx-hw-drop-ratelimits

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-04-30 10:51:33 +02:00
Eric Dumazet 3c668cef61 net: hsr: init prune_proxy_timer sooner
We must initialize prune_proxy_timer before we attempt
a del_timer_sync() on it.

syzbot reported the following splat:

INFO: trying to register non-static key.
The code is fine but needs lockdep annotation, or maybe
you didn't initialize this object before use?
turning off the locking correctness validator.
CPU: 1 PID: 11 Comm: kworker/u8:1 Not tainted 6.9.0-rc5-syzkaller-01199-gfc48de77d69d #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/27/2024
Workqueue: netns cleanup_net
Call Trace:
 <TASK>
  __dump_stack lib/dump_stack.c:88 [inline]
  dump_stack_lvl+0x241/0x360 lib/dump_stack.c:114
  assign_lock_key+0x238/0x270 kernel/locking/lockdep.c:976
  register_lock_class+0x1cf/0x980 kernel/locking/lockdep.c:1289
  __lock_acquire+0xda/0x1fd0 kernel/locking/lockdep.c:5014
  lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5754
  __timer_delete_sync+0x148/0x310 kernel/time/timer.c:1648
  del_timer_sync include/linux/timer.h:185 [inline]
  hsr_dellink+0x33/0x80 net/hsr/hsr_netlink.c:132
  default_device_exit_batch+0x956/0xa90 net/core/dev.c:11737
  ops_exit_list net/core/net_namespace.c:175 [inline]
  cleanup_net+0x89d/0xcc0 net/core/net_namespace.c:637
  process_one_work kernel/workqueue.c:3254 [inline]
  process_scheduled_works+0xa10/0x17c0 kernel/workqueue.c:3335
  worker_thread+0x86d/0xd70 kernel/workqueue.c:3416
  kthread+0x2f0/0x390 kernel/kthread.c:388
  ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147
  ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244
 </TASK>
ODEBUG: assert_init not available (active state 0) object: ffff88806d3fcd88 object type: timer_list hint: 0x0
 WARNING: CPU: 1 PID: 11 at lib/debugobjects.c:517 debug_print_object+0x17a/0x1f0 lib/debugobjects.c:514

Fixes: 5055cccfc2 ("net: hsr: Provide RedBox support (HSR-SAN)")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Lukasz Majewski <lukma@denx.de>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240426163355.2613767-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-29 19:02:53 -07:00
Jakub Kicinski 89de2db193 bpf-next-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZi9+AAAKCRDbK58LschI
 g0nEAP487m7L0nLVriC2oIOWsi29tklW3etm6DO7gmGRGIHgrgEAnMyV1xBj3bGj
 v6jJwDcybCym1hLx+1x1JCZ4eoAFswE=
 =xbna
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
pull-request: bpf-next 2024-04-29

We've added 147 non-merge commits during the last 32 day(s) which contain
a total of 158 files changed, 9400 insertions(+), 2213 deletions(-).

The main changes are:

1) Add an internal-only BPF per-CPU instruction for resolving per-CPU
   memory addresses and implement support in x86 BPF JIT. This allows
   inlining per-CPU array and hashmap lookups
   and the bpf_get_smp_processor_id() helper, from Andrii Nakryiko.

2) Add BPF link support for sk_msg and sk_skb programs, from Yonghong Song.

3) Optimize x86 BPF JIT's emit_mov_imm64, and add support for various
   atomics in bpf_arena which can be JITed as a single x86 instruction,
   from Alexei Starovoitov.

4) Add support for passing mark with bpf_fib_lookup helper,
   from Anton Protopopov.

5) Add a new bpf_wq API for deferring events and refactor sleepable
   bpf_timer code to keep common code where possible,
   from Benjamin Tissoires.

6) Fix BPF_PROG_TEST_RUN infra with regards to bpf_dummy_struct_ops programs
   to check when NULL is passed for non-NULLable parameters,
   from Eduard Zingerman.

7) Harden the BPF verifier's and/or/xor value tracking,
   from Harishankar Vishwanathan.

8) Introduce crypto kfuncs to make BPF programs able to utilize the kernel
   crypto subsystem, from Vadim Fedorenko.

9) Various improvements to the BPF instruction set standardization doc,
   from Dave Thaler.

10) Extend libbpf APIs to partially consume items from the BPF ringbuffer,
    from Andrea Righi.

11) Bigger batch of BPF selftests refactoring to use common network helpers
    and to drop duplicate code, from Geliang Tang.

12) Support bpf_tail_call_static() helper for BPF programs with GCC 13,
    from Jose E. Marchesi.

13) Add bpf_preempt_{disable,enable}() kfuncs in order to allow a BPF
    program to have code sections where preemption is disabled,
    from Kumar Kartikeya Dwivedi.

14) Allow invoking BPF kfuncs from BPF_PROG_TYPE_SYSCALL programs,
    from David Vernet.

15) Extend the BPF verifier to allow different input maps for a given
    bpf_for_each_map_elem() helper call in a BPF program, from Philo Lu.

16) Add support for PROBE_MEM32 and bpf_addr_space_cast instructions
    for riscv64 and arm64 JITs to enable BPF Arena, from Puranjay Mohan.

17) Shut up a false-positive KMSAN splat in interpreter mode by unpoison
    the stack memory, from Martin KaFai Lau.

18) Improve xsk selftest coverage with new tests on maximum and minimum
    hardware ring size configurations, from Tushar Vyavahare.

19) Various ReST man pages fixes as well as documentation and bash completion
    improvements for bpftool, from Rameez Rehman & Quentin Monnet.

20) Fix libbpf with regards to dumping subsequent char arrays,
    from Quentin Deslandes.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (147 commits)
  bpf, docs: Clarify PC use in instruction-set.rst
  bpf_helpers.h: Define bpf_tail_call_static when building with GCC
  bpf, docs: Add introduction for use in the ISA Internet Draft
  selftests/bpf: extend BPF_SOCK_OPS_RTT_CB test for srtt and mrtt_us
  bpf: add mrtt and srtt as BPF_SOCK_OPS_RTT_CB args
  selftests/bpf: dummy_st_ops should reject 0 for non-nullable params
  bpf: check bpf_dummy_struct_ops program params for test runs
  selftests/bpf: do not pass NULL for non-nullable params in dummy_st_ops
  selftests/bpf: adjust dummy_st_ops_success to detect additional error
  bpf: mark bpf_dummy_struct_ops.test_1 parameter as nullable
  selftests/bpf: Add ring_buffer__consume_n test.
  bpf: Add bpf_guard_preempt() convenience macro
  selftests: bpf: crypto: add benchmark for crypto functions
  selftests: bpf: crypto skcipher algo selftests
  bpf: crypto: add skcipher to bpf crypto
  bpf: make common crypto API for TC/XDP programs
  bpf: update the comment for BTF_FIELDS_MAX
  selftests/bpf: Fix wq test.
  selftests/bpf: Use make_sockaddr in test_sock_addr
  selftests/bpf: Use connect_to_addr in test_sock_addr
  ...
====================

Link: https://lore.kernel.org/r/20240429131657.19423-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-29 13:12:19 -07:00
Linus Torvalds 9e4bc4bcae NFS client bugfixes for Linux 6.9
Bugfixes:
  - Fix an Oops in xs_tcp_tls_setup_socket
  - Fix an Oops due to missing error handling in nfs_net_init()
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAmYv7PEACgkQZwvnipYK
 APKcgA//XE7iII6qQU9IC6jiv44qO7NtB6Zy3PQFCWO2ssMSqXbc4lO2eCmR/1nA
 3Mlf1RwPxp+M+iOqFgANZV7voQ/r6djEyM1ycr+J2G/mfoxmKMVmnvg3lcyAfYNj
 3fm6n0t8ZCkb3URoO4K0ejw007QfN2zpCL2psKucBdahOX7OYHT45o02liKN8ge5
 0h7XKUjDStKsId4y5UVNB+QUeaQaWzKKMCzTzX4CxfHXZpIjbDjkdJ9WxAue2+Th
 5yvNtPKkTi9EYwHjFAgN7ZKAC7Gu+jZXs9ewdqdyaSfYlioGk7ALz2ZZSptKb5zr
 +nwuR+SC4Yzg5uBdjOLXS7/6Z/CVyp2bmgoFAzrP8cC0zfB7wJMNwUTZbUx3d823
 Q7xYwecj1F9PE5CjHJlYpFZiKkiMHB242EFmBrcUR+yoRPHRgEg32UpI1/YWYnVO
 pG+Hto0O8JnlGzkqslKy/qN7OMFgNTli+nrFnJT7TxDk27GpOY3gv271164TgeUt
 MOk7iY6QjDqk6Zpzbg5AMcq6UhB+QUe76XAAFtDvjXLLBbsRckmbNTBbkFv6/O3p
 a3bOd7oeugNIaRJJaR/lQ/EVAoBSeNUUj5G1ivoXDWsNXE+ZKNNdYyrCociVgI6S
 j31k3XtbGVMk8M/7Lu5slPOGL4xqDtJAddF0eE7buQhRZNX7pPA=
 =OtQc
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-6.9-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client fixes from Trond Myklebust:

 - Fix an Oops in xs_tcp_tls_setup_socket

 - Fix an Oops due to missing error handling in nfs_net_init()

* tag 'nfs-for-6.9-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  nfs: Handle error of rpc_proc_register() in nfs_net_init().
  SUNRPC: add a missing rpc_stat for TCP TLS
2024-04-29 12:07:37 -07:00
Eric Dumazet e8dfd42c17 ipv6: introduce dst_rt6_info() helper
Instead of (struct rt6_info *)dst casts, we can use :

 #define dst_rt6_info(_ptr) \
         container_of_const(_ptr, struct rt6_info, dst)

Some places needed missing const qualifiers :

ip6_confirm_neigh(), ipv6_anycast_destination(),
ipv6_unicast_destination(), has_gateway()

v2: added missing parts (David Ahern)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-29 13:32:01 +01:00
Eric Dumazet 61f5338d62 inet: use call_rcu_hurry() in inet_free_ifa()
This is a followup of commit c4e86b4363 ("net: add two more
call_rcu_hurry()")

Our reference to ifa->ifa_dev must be freed ASAP
to release the reference to the netdev the same way.

inet_rcu_free_ifa()

	in_dev_put()
	 -> in_dev_finish_destroy()
	   -> netdev_put()

This should speedup device/netns dismantles when CONFIG_RCU_LAZY=y

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-29 09:58:58 +01:00
Eric Dumazet cd42ba1c8a net: give more chances to rcu in netdev_wait_allrefs_any()
This came while reviewing commit c4e86b4363 ("net: add two more
call_rcu_hurry()").

Paolo asked if adding one synchronize_rcu() would help.

While synchronize_rcu() does not help, making sure to call
rcu_barrier() before msleep(wait) is definitely helping
to make sure lazy call_rcu() are completed.

Instead of waiting ~100 seconds in my tests, the ref_tracker
splats occurs one time only, and netdev_wait_allrefs_any()
latency is reduced to the strict minimum.

Ideally we should audit our call_rcu() users to make sure
no refcount (or cascading call_rcu()) is held too long,
because rcu_barrier() is quite expensive.

Fixes: 0e4be9e57e ("net: use exponential backoff in netdev_wait_allrefs")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/all/28bbf698-befb-42f6-b561-851c67f464aa@kernel.org/T/#m76d73ed6b03cd930778ac4d20a777f22a08d6824
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-29 09:54:12 +01:00
Jakub Kicinski b2ff42c6d3 bpf-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZiwdfQAKCRDbK58LschI
 g1oqAP9mjayeIHCfYMQZa2eevy1PmVlgdNdFdMDWZFS/pHv9cgD/ZdmGzbUDKCAQ
 Y/KiTajitZw3kxtHX45v8/Ugtlsh9Qg=
 =Ewiw
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf

Daniel Borkmann says:

====================
pull-request: bpf 2024-04-26

We've added 12 non-merge commits during the last 22 day(s) which contain
a total of 14 files changed, 168 insertions(+), 72 deletions(-).

The main changes are:

1) Fix BPF_PROBE_MEM in verifier and JIT to skip loads from vsyscall page,
   from Puranjay Mohan.

2) Fix a crash in XDP with devmap broadcast redirect when the latter map
   is in process of being torn down, from Toke Høiland-Jørgensen.

3) Fix arm64 and riscv64 BPF JITs to properly clear start time for BPF
   program runtime stats, from Xu Kuohai.

4) Fix a sockmap KCSAN-reported data race in sk_psock_skb_ingress_enqueue,
    from Jason Xing.

5) Fix BPF verifier error message in resolve_pseudo_ldimm64,
   from Anton Protopopov.

6) Fix missing DEBUG_INFO_BTF_MODULES Kconfig menu item,
   from Andrii Nakryiko.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  selftests/bpf: Test PROBE_MEM of VSYSCALL_ADDR on x86-64
  bpf, x86: Fix PROBE_MEM runtime load check
  bpf: verifier: prevent userspace memory access
  xdp: use flags field to disambiguate broadcast redirect
  arm32, bpf: Reimplement sign-extension mov instruction
  riscv, bpf: Fix incorrect runtime stats
  bpf, arm64: Fix incorrect runtime stats
  bpf: Fix a verifier verbose message
  bpf, skmsg: Fix NULL pointer dereference in sk_psock_skb_ingress_enqueue
  MAINTAINERS: bpf: Add Lehui and Puranjay as riscv64 reviewers
  MAINTAINERS: Update email address for Puranjay Mohan
  bpf, kconfig: Fix DEBUG_INFO_BTF_MODULES Kconfig definition
====================

Link: https://lore.kernel.org/r/20240426224248.26197-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-26 17:36:53 -07:00
Eric Dumazet 1bede0a12d tcp: fix tcp_grow_skb() vs tstamps
I forgot to call tcp_skb_collapse_tstamp() in the
case we consume the second skb in write queue.

Neal suggested to create a common helper used by tcp_mtu_probe()
and tcp_grow_skb().

Fixes: 8ee602c635 ("tcp: try to send bigger TSO packets")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Link: https://lore.kernel.org/r/20240425193450.411640-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-26 13:55:29 -07:00
Jason Xing b533fb9cf4 rstreason: make it work in trace world
At last, we should let it work by introducing this reset reason in
trace world.

One of the possible expected outputs is:
... tcp_send_reset: skbaddr=xxx skaddr=xxx src=xxx dest=xxx
state=TCP_ESTABLISHED reason=NOT_SPECIFIED

Signed-off-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-04-26 15:34:01 +02:00
Jason Xing 215d40248b mptcp: introducing a helper into active reset logic
Since we have mapped every mptcp reset reason definition in enum
sk_rst_reason, introducing a new helper can cover some missing places
where we have already set the subflow->reset_reason.

Note: using SK_RST_REASON_NOT_SPECIFIED is the same as
SK_RST_REASON_MPTCP_RST_EUNSPEC. They are both unknown. So we can convert
it directly.

Suggested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-04-26 15:34:00 +02:00
Jason Xing 3e140491dd mptcp: support rstreason for passive reset
It relies on what reset options in the skb are as rfc8684 says. Reusing
this logic can save us much energy. This patch replaces most of the prior
NOT_SPECIFIED reasons.

Signed-off-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-04-26 15:34:00 +02:00
Jason Xing 120391ef9c tcp: support rstreason for passive reset
Reuse the dropreason logic to show the exact reason of tcp reset,
so we can finally display the corresponding item in enum sk_reset_reason
instead of reinventing new reset reasons. This patch replaces all
the prior NOT_SPECIFIED reasons.

Signed-off-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-04-26 15:34:00 +02:00
Jason Xing 5691276b39 rstreason: prepare for active reset
Like what we did to passive reset:
only passing possible reset reason in each active reset path.

No functional changes.

Signed-off-by: Jason Xing <kernelxing@tencent.com>
Acked-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-04-26 15:34:00 +02:00
Jason Xing 6be49deaa0 rstreason: prepare for passive reset
Adjust the parameter and support passing reason of reset which
is for now NOT_SPECIFIED. No functional changes.

Signed-off-by: Jason Xing <kernelxing@tencent.com>
Acked-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-04-26 15:34:00 +02:00
David Bauer 42f853b428 net l2tp: drop flow hash on forward
Drop the flow-hash of the skb when forwarding to the L2TP netdev.

This avoids the L2TP qdisc from using the flow-hash from the outer
packet, which is identical for every flow within the tunnel.

This does not affect every platform but is specific for the ethernet
driver. It depends on the platform including L4 information in the
flow-hash.

One such example is the Mediatek Filogic MT798x family of networking
processors.

Fixes: d9e31d17ce ("l2tp: Add L2TP ethernet pseudowire support")
Acked-by: James Chapman <jchapman@katalix.com>
Signed-off-by: David Bauer <mail@david-bauer.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240424171110.13701-1-mail@david-bauer.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-04-26 13:48:24 +02:00
Kuniyuki Iwashima 4b911a9690 nsh: Restore skb->{protocol,data,mac_header} for outer header in nsh_gso_segment().
syzbot triggered various splats (see [0] and links) by a crafted GSO
packet of VIRTIO_NET_HDR_GSO_UDP layering the following protocols:

  ETH_P_8021AD + ETH_P_NSH + ETH_P_IPV6 + IPPROTO_UDP

NSH can encapsulate IPv4, IPv6, Ethernet, NSH, and MPLS.  As the inner
protocol can be Ethernet, NSH GSO handler, nsh_gso_segment(), calls
skb_mac_gso_segment() to invoke inner protocol GSO handlers.

nsh_gso_segment() does the following for the original skb before
calling skb_mac_gso_segment()

  1. reset skb->network_header
  2. save the original skb->{mac_heaeder,mac_len} in a local variable
  3. pull the NSH header
  4. resets skb->mac_header
  5. set up skb->mac_len and skb->protocol for the inner protocol.

and does the following for the segmented skb

  6. set ntohs(ETH_P_NSH) to skb->protocol
  7. push the NSH header
  8. restore skb->mac_header
  9. set skb->mac_header + mac_len to skb->network_header
 10. restore skb->mac_len

There are two problems in 6-7 and 8-9.

  (a)
  After 6 & 7, skb->data points to the NSH header, so the outer header
  (ETH_P_8021AD in this case) is stripped when skb is sent out of netdev.

  Also, if NSH is encapsulated by NSH + Ethernet (so NSH-Ethernet-NSH),
  skb_pull() in the first nsh_gso_segment() will make skb->data point
  to the middle of the outer NSH or Ethernet header because the Ethernet
  header is not pulled by the second nsh_gso_segment().

  (b)
  While restoring skb->{mac_header,network_header} in 8 & 9,
  nsh_gso_segment() does not assume that the data in the linear
  buffer is shifted.

  However, udp6_ufo_fragment() could shift the data and change
  skb->mac_header accordingly as demonstrated by syzbot.

  If this happens, even the restored skb->mac_header points to
  the middle of the outer header.

It seems nsh_gso_segment() has never worked with outer headers so far.

At the end of nsh_gso_segment(), the outer header must be restored for
the segmented skb, instead of the NSH header.

To do that, let's calculate the outer header position relatively from
the inner header and set skb->{data,mac_header,protocol} properly.

[0]:
BUG: KMSAN: uninit-value in ipvlan_process_outbound drivers/net/ipvlan/ipvlan_core.c:524 [inline]
BUG: KMSAN: uninit-value in ipvlan_xmit_mode_l3 drivers/net/ipvlan/ipvlan_core.c:602 [inline]
BUG: KMSAN: uninit-value in ipvlan_queue_xmit+0xf44/0x16b0 drivers/net/ipvlan/ipvlan_core.c:668
 ipvlan_process_outbound drivers/net/ipvlan/ipvlan_core.c:524 [inline]
 ipvlan_xmit_mode_l3 drivers/net/ipvlan/ipvlan_core.c:602 [inline]
 ipvlan_queue_xmit+0xf44/0x16b0 drivers/net/ipvlan/ipvlan_core.c:668
 ipvlan_start_xmit+0x5c/0x1a0 drivers/net/ipvlan/ipvlan_main.c:222
 __netdev_start_xmit include/linux/netdevice.h:4989 [inline]
 netdev_start_xmit include/linux/netdevice.h:5003 [inline]
 xmit_one net/core/dev.c:3547 [inline]
 dev_hard_start_xmit+0x244/0xa10 net/core/dev.c:3563
 __dev_queue_xmit+0x33ed/0x51c0 net/core/dev.c:4351
 dev_queue_xmit include/linux/netdevice.h:3171 [inline]
 packet_xmit+0x9c/0x6b0 net/packet/af_packet.c:276
 packet_snd net/packet/af_packet.c:3081 [inline]
 packet_sendmsg+0x8aef/0x9f10 net/packet/af_packet.c:3113
 sock_sendmsg_nosec net/socket.c:730 [inline]
 __sock_sendmsg net/socket.c:745 [inline]
 __sys_sendto+0x735/0xa10 net/socket.c:2191
 __do_sys_sendto net/socket.c:2203 [inline]
 __se_sys_sendto net/socket.c:2199 [inline]
 __x64_sys_sendto+0x125/0x1c0 net/socket.c:2199
 do_syscall_x64 arch/x86/entry/common.c:52 [inline]
 do_syscall_64+0xcf/0x1e0 arch/x86/entry/common.c:83
 entry_SYSCALL_64_after_hwframe+0x63/0x6b

Uninit was created at:
 slab_post_alloc_hook mm/slub.c:3819 [inline]
 slab_alloc_node mm/slub.c:3860 [inline]
 __do_kmalloc_node mm/slub.c:3980 [inline]
 __kmalloc_node_track_caller+0x705/0x1000 mm/slub.c:4001
 kmalloc_reserve+0x249/0x4a0 net/core/skbuff.c:582
 __alloc_skb+0x352/0x790 net/core/skbuff.c:651
 skb_segment+0x20aa/0x7080 net/core/skbuff.c:4647
 udp6_ufo_fragment+0xcab/0x1150 net/ipv6/udp_offload.c:109
 ipv6_gso_segment+0x14be/0x2ca0 net/ipv6/ip6_offload.c:152
 skb_mac_gso_segment+0x3e8/0x760 net/core/gso.c:53
 nsh_gso_segment+0x6f4/0xf70 net/nsh/nsh.c:108
 skb_mac_gso_segment+0x3e8/0x760 net/core/gso.c:53
 __skb_gso_segment+0x4b0/0x730 net/core/gso.c:124
 skb_gso_segment include/net/gso.h:83 [inline]
 validate_xmit_skb+0x107f/0x1930 net/core/dev.c:3628
 __dev_queue_xmit+0x1f28/0x51c0 net/core/dev.c:4343
 dev_queue_xmit include/linux/netdevice.h:3171 [inline]
 packet_xmit+0x9c/0x6b0 net/packet/af_packet.c:276
 packet_snd net/packet/af_packet.c:3081 [inline]
 packet_sendmsg+0x8aef/0x9f10 net/packet/af_packet.c:3113
 sock_sendmsg_nosec net/socket.c:730 [inline]
 __sock_sendmsg net/socket.c:745 [inline]
 __sys_sendto+0x735/0xa10 net/socket.c:2191
 __do_sys_sendto net/socket.c:2203 [inline]
 __se_sys_sendto net/socket.c:2199 [inline]
 __x64_sys_sendto+0x125/0x1c0 net/socket.c:2199
 do_syscall_x64 arch/x86/entry/common.c:52 [inline]
 do_syscall_64+0xcf/0x1e0 arch/x86/entry/common.c:83
 entry_SYSCALL_64_after_hwframe+0x63/0x6b

CPU: 1 PID: 5101 Comm: syz-executor421 Not tainted 6.8.0-rc5-syzkaller-00297-gf2e367d6ad3b #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/25/2024

Fixes: c411ed8545 ("nsh: add GSO support")
Reported-and-tested-by: syzbot+42a0dc856239de4de60e@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=42a0dc856239de4de60e
Reported-and-tested-by: syzbot+c298c9f0e46a3c86332b@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=c298c9f0e46a3c86332b
Link: https://lore.kernel.org/netdev/20240415222041.18537-1-kuniyu@amazon.com/
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20240424023549.21862-1-kuniyu@amazon.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-04-26 12:20:01 +02:00
Lukasz Majewski 5055cccfc2 net: hsr: Provide RedBox support (HSR-SAN)
Introduce RedBox support (HSR-SAN to be more precise) for HSR networks.
Following traffic reduction optimizations have been implemented:
- Do not send HSR supervisory frames to Port C (interlink)
- Do not forward to HSR ring frames addressed to Port C
- Do not forward to Port C frames from HSR ring
- Do not send duplicate HSR frame to HSR ring when destination is Port C

The corresponding patch to modify iptable2 sources has already been sent:
https://lore.kernel.org/netdev/20240308145729.490863-1-lukma@denx.de/T/

Testing procedure (veth and netns):
-----------------------------------
One shall run:
linux-vanila/tools/testing/selftests/net/hsr/hsr_redbox.sh
(Detailed description of the setup one can find in the test
script file).

Testing procedure (real hardware):
----------------------------------
The EVB-KSZ9477 has been used for testing on net-next branch
(SHA1: 5fc68320c1).

Ports 4/5 were used for SW managed HSR (hsr1) as first hsr0 for ports 1/2
(with HW offloading for ksz9477) was created. Port 3 has been used as
interlink port (single USB-ETH dongle).

Configuration - RedBox (EVB-KSZ9477):
if link set lan1 down;ip link set lan2 down
ip link add name hsr0 type hsr slave1 lan1 slave2 lan2 supervision 45 version 1
ip link add name hsr1 type hsr slave1 lan4 slave2 lan5 interlink lan3 supervision 45 version 1
ip link set lan4 up;ip link set lan5 up
ip link set lan3 up
ip addr add 192.168.0.11/24 dev hsr1
ip link set hsr1 up

Configuration - DAN-H (EVB-KSZ9477):

ip link set lan1 down;ip link set lan2 down
ip link add name hsr0 type hsr slave1 lan1 slave2 lan2 supervision 45 version 1
ip link add name hsr1 type hsr slave1 lan4 slave2 lan5 supervision 45 version 1
ip link set lan4 up;ip link set lan5 up
ip addr add 192.168.0.12/24 dev hsr1
ip link set hsr1 up

This approach uses only SW based HSR devices (hsr1).

--------------          -----------------       ------------
DAN-H  Port5 | <------> | Port5         |       |
       Port4 | <------> | Port4   Port3 | <---> | PC
             |          | (RedBox)      |       | (USB-ETH)
EVB-KSZ9477  |          | EVB-KSZ9477   |       |
--------------          -----------------       ------------

Signed-off-by: Lukasz Majewski <lukma@denx.de>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-04-26 12:04:43 +02:00
Davide Caratti af0cb3fa3f net/sched: fix false lockdep warning on qdisc root lock
Xiumei and Christoph reported the following lockdep splat, complaining of
the qdisc root lock being taken twice:

 ============================================
 WARNING: possible recursive locking detected
 6.7.0-rc3+ #598 Not tainted
 --------------------------------------------
 swapper/2/0 is trying to acquire lock:
 ffff888177190110 (&sch->q.lock){+.-.}-{2:2}, at: __dev_queue_xmit+0x1560/0x2e70

 but task is already holding lock:
 ffff88811995a110 (&sch->q.lock){+.-.}-{2:2}, at: __dev_queue_xmit+0x1560/0x2e70

 other info that might help us debug this:
  Possible unsafe locking scenario:

        CPU0
        ----
   lock(&sch->q.lock);
   lock(&sch->q.lock);

  *** DEADLOCK ***

  May be due to missing lock nesting notation

 5 locks held by swapper/2/0:
  #0: ffff888135a09d98 ((&in_dev->mr_ifc_timer)){+.-.}-{0:0}, at: call_timer_fn+0x11a/0x510
  #1: ffffffffaaee5260 (rcu_read_lock){....}-{1:2}, at: ip_finish_output2+0x2c0/0x1ed0
  #2: ffffffffaaee5200 (rcu_read_lock_bh){....}-{1:2}, at: __dev_queue_xmit+0x209/0x2e70
  #3: ffff88811995a110 (&sch->q.lock){+.-.}-{2:2}, at: __dev_queue_xmit+0x1560/0x2e70
  #4: ffffffffaaee5200 (rcu_read_lock_bh){....}-{1:2}, at: __dev_queue_xmit+0x209/0x2e70

 stack backtrace:
 CPU: 2 PID: 0 Comm: swapper/2 Not tainted 6.7.0-rc3+ #598
 Hardware name: Red Hat KVM, BIOS 1.13.0-2.module+el8.3.0+7353+9de0a3cc 04/01/2014
 Call Trace:
  <IRQ>
  dump_stack_lvl+0x4a/0x80
  __lock_acquire+0xfdd/0x3150
  lock_acquire+0x1ca/0x540
  _raw_spin_lock+0x34/0x80
  __dev_queue_xmit+0x1560/0x2e70
  tcf_mirred_act+0x82e/0x1260 [act_mirred]
  tcf_action_exec+0x161/0x480
  tcf_classify+0x689/0x1170
  prio_enqueue+0x316/0x660 [sch_prio]
  dev_qdisc_enqueue+0x46/0x220
  __dev_queue_xmit+0x1615/0x2e70
  ip_finish_output2+0x1218/0x1ed0
  __ip_finish_output+0x8b3/0x1350
  ip_output+0x163/0x4e0
  igmp_ifc_timer_expire+0x44b/0x930
  call_timer_fn+0x1a2/0x510
  run_timer_softirq+0x54d/0x11a0
  __do_softirq+0x1b3/0x88f
  irq_exit_rcu+0x18f/0x1e0
  sysvec_apic_timer_interrupt+0x6f/0x90
  </IRQ>

This happens when TC does a mirred egress redirect from the root qdisc of
device A to the root qdisc of device B. As long as these two locks aren't
protecting the same qdisc, they can be acquired in chain: add a per-qdisc
lockdep key to silence false warnings.
This dynamic key should safely replace the static key we have in sch_htb:
it was added to allow enqueueing to the device "direct qdisc" while still
holding the qdisc root lock.

v2: don't use static keys anymore in HTB direct qdiscs (thanks Eric Dumazet)

CC: Maxim Mikityanskiy <maxim@isovalent.com>
CC: Xiumei Mu <xmu@redhat.com>
Reported-by: Christoph Paasch <cpaasch@apple.com>
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/451
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Link: https://lore.kernel.org/r/7dc06d6158f72053cf877a82e2a7a5bd23692faa.1713448007.git.dcaratti@redhat.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-04-26 10:46:41 +02:00
Paul Davey 58fbfecab9 xfrm: Preserve vlan tags for transport mode software GRO
The software GRO path for esp transport mode uses skb_mac_header_rebuild
prior to re-injecting the packet via the xfrm_napi_dev.  This only
copies skb->mac_len bytes of header which may not be sufficient if the
packet contains 802.1Q tags or other VLAN tags.  Worse copying only the
initial header will leave a packet marked as being VLAN tagged but
without the corresponding tag leading to mangling when it is later
untagged.

The VLAN tags are important when receiving the decrypted esp transport
mode packet after GRO processing to ensure it is received on the correct
interface.

Therefore record the full mac header length in xfrm*_transport_input for
later use in corresponding xfrm*_transport_finish to copy the entire mac
header when rebuilding the mac header for GRO.  The skb->data pointer is
left pointing skb->mac_header bytes after the start of the mac header as
is expected by the network stack and network and transport header
offsets reset to this location.

Fixes: 7785bba299 ("esp: Add a software GRO codepath")
Signed-off-by: Paul Davey <paul.davey@alliedtelesis.co.nz>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2024-04-26 06:44:33 +02:00
Suren Baghdasaryan 2c321f3f70 mm: change inlined allocation helpers to account at the call site
Main goal of memory allocation profiling patchset is to provide accounting
that is cheap enough to run in production.  To achieve that we inject
counters using codetags at the allocation call sites to account every time
allocation is made.  This injection allows us to perform accounting
efficiently because injected counters are immediately available as opposed
to the alternative methods, such as using _RET_IP_, which would require
counter lookup and appropriate locking that makes accounting much more
expensive.  This method requires all allocation functions to inject
separate counters at their call sites so that their callers can be
individually accounted.  Counter injection is implemented by allocation
hooks which should wrap all allocation functions.

Inlined functions which perform allocations but do not use allocation
hooks are directly charged for the allocations they perform.  In most
cases these functions are just specialized allocation wrappers used from
multiple places to allocate objects of a specific type.  It would be more
useful to do the accounting at their call sites instead.  Instrument these
helpers to do accounting at the call site.  Simple inlined allocation
wrappers are converted directly into macros.  More complex allocators or
allocators with documentation are converted into _noprof versions and
allocation hooks are added.  This allows memory allocation profiling
mechanism to charge allocations to the callers of these functions.

Link: https://lkml.kernel.org/r/20240415020731.1152108-1-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Jan Kara <jack@suse.cz>		[jbd2]
Cc: Anna Schumaker <anna@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Tissoires <benjamin.tissoires@redhat.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Jakub Sitnicki <jakub@cloudflare.com>
Cc: Jiri Kosina <jikos@kernel.org>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25 20:55:59 -07:00
Jakub Kicinski 1cedb16b94 Merge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue
Tony Nguyen says:

====================
net: intel: start The Great Code Dedup + Page Pool for iavf

Alexander Lobakin says:

Here's a two-shot: introduce {,Intel} Ethernet common library (libeth and
libie) and switch iavf to Page Pool. Details are in the commit messages;
here's a summary:

Not a secret there's a ton of code duplication between two and more Intel
ethernet modules. Before introducing new changes, which would need to be
copied over again, start decoupling the already existing duplicate
functionality into a new module, which will be shared between several
Intel Ethernet drivers. The first name that came to my mind was
"libie" -- "Intel Ethernet common library". Also this sounds like
"lovelie" (-> one word, no "lib I E" pls) and can be expanded as
"lib Internet Explorer" :P
The "generic", pure-software part is placed separately, so that it can be
easily reused in any driver by any vendor without linking to the Intel
pre-200G guts. In a few words, it's something any modern driver does the
same way, but nobody moved it level up (yet).
The series is only the beginning. From now on, adding every new feature
or doing any good driver refactoring will remove much more lines than add
for quite some time. There's a basic roadmap with some deduplications
planned already, not speaking of that touching every line now asks:
"can I share this?". The final destination is very ambitious: have only
one unified driver for at least i40e, ice, iavf, and idpf with a struct
ops for each generation. That's never gonna happen, right? But you still
can at least try.
PP conversion for iavf lands within the same series as these two are tied
closely. libie will support Page Pool model only, so that a driver can't
use much of the lib until it's converted. iavf is only the example, the
rest will eventually be converted soon on a per-driver basis. That is
when it gets really interesting. Stay tech.

* '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
  MAINTAINERS: add entry for libeth and libie
  iavf: switch to Page Pool
  iavf: pack iavf_ring more efficiently
  libeth: add Rx buffer management
  page_pool: add DMA-sync-for-CPU inline helper
  page_pool: constify some read-only function arguments
  slab: introduce kvmalloc_array_node() and kvcalloc_node()
  iavf: drop page splitting and recycling
  iavf: kill "legacy-rx" for good
  net: intel: introduce {, Intel} Ethernet common library
====================

Link: https://lore.kernel.org/r/20240424203559.3420468-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-25 20:00:54 -07:00
Eric Dumazet c4e86b4363 net: add two more call_rcu_hurry()
I had failures with pmtu.sh selftests lately,
with netns dismantles firing ref_tracking alerts [1].

After much debugging, I found that some queued
rcu callbacks were delayed by minutes, because
of CONFIG_RCU_LAZY=y option.

Joel Fernandes had a similar issue in the past,
fixed with commit 483c26ff63 ("net: Use call_rcu_hurry()
for dst_release()")

In this commit, I make sure nexthop_free_rcu()
and free_fib_info_rcu() are not delayed too much
because they both can release device references.

tools/testing/selftests/net/pmtu.sh no longer fails.

Traces were:

[  968.179860] ref_tracker: veth_A-R1@00000000d0ff3fe2 has 3/5 users at
                    dst_alloc+0x76/0x160
                    ip6_dst_alloc+0x25/0x80
                    ip6_pol_route+0x2a8/0x450
                    ip6_pol_route_output+0x1f/0x30
                    fib6_rule_lookup+0x163/0x270
                    ip6_route_output_flags+0xda/0x190
                    ip6_dst_lookup_tail.constprop.0+0x1d0/0x260
                    ip6_dst_lookup_flow+0x47/0xa0
                    udp_tunnel6_dst_lookup+0x158/0x210
                    vxlan_xmit_one+0x4c2/0x1550 [vxlan]
                    vxlan_xmit+0x52d/0x14f0 [vxlan]
                    dev_hard_start_xmit+0x7b/0x1e0
                    __dev_queue_xmit+0x20b/0xe40
                    ip6_finish_output2+0x2ea/0x6e0
                    ip6_finish_output+0x143/0x320
                    ip6_output+0x74/0x140

[  968.179860] ref_tracker: veth_A-R1@00000000d0ff3fe2 has 1/5 users at
                    netdev_get_by_index+0xc0/0xe0
                    fib6_nh_init+0x1a9/0xa90
                    rtm_new_nexthop+0x6fa/0x1580
                    rtnetlink_rcv_msg+0x155/0x3e0
                    netlink_rcv_skb+0x61/0x110
                    rtnetlink_rcv+0x19/0x20
                    netlink_unicast+0x23f/0x380
                    netlink_sendmsg+0x1fc/0x430
                    ____sys_sendmsg+0x2ef/0x320
                    ___sys_sendmsg+0x86/0xd0
                    __sys_sendmsg+0x67/0xc0
                    __x64_sys_sendmsg+0x21/0x30
                    x64_sys_call+0x252/0x2030
                    do_syscall_64+0x6c/0x190
                    entry_SYSCALL_64_after_hwframe+0x76/0x7e

[  968.179860] ref_tracker: veth_A-R1@00000000d0ff3fe2 has 1/5 users at
                    ipv6_add_dev+0x136/0x530
                    addrconf_notify+0x19d/0x770
                    notifier_call_chain+0x65/0xd0
                    raw_notifier_call_chain+0x1a/0x20
                    call_netdevice_notifiers_info+0x54/0x90
                    register_netdevice+0x61e/0x790
                    veth_newlink+0x230/0x440
                    __rtnl_newlink+0x7d2/0xaa0
                    rtnl_newlink+0x4c/0x70
                    rtnetlink_rcv_msg+0x155/0x3e0
                    netlink_rcv_skb+0x61/0x110
                    rtnetlink_rcv+0x19/0x20
                    netlink_unicast+0x23f/0x380
                    netlink_sendmsg+0x1fc/0x430
                    ____sys_sendmsg+0x2ef/0x320
                    ___sys_sendmsg+0x86/0xd0
....
[ 1079.316024]  ? show_regs+0x68/0x80
[ 1079.316087]  ? __warn+0x8c/0x140
[ 1079.316103]  ? ref_tracker_free+0x1a0/0x270
[ 1079.316117]  ? report_bug+0x196/0x1c0
[ 1079.316135]  ? handle_bug+0x42/0x80
[ 1079.316149]  ? exc_invalid_op+0x1c/0x70
[ 1079.316162]  ? asm_exc_invalid_op+0x1f/0x30
[ 1079.316193]  ? ref_tracker_free+0x1a0/0x270
[ 1079.316208]  ? _raw_spin_unlock+0x1a/0x40
[ 1079.316222]  ? free_unref_page+0x126/0x1a0
[ 1079.316239]  ? destroy_large_folio+0x69/0x90
[ 1079.316251]  ? __folio_put+0x99/0xd0
[ 1079.316276]  dst_dev_put+0x69/0xd0
[ 1079.316308]  fib6_nh_release_dsts.part.0+0x3d/0x80
[ 1079.316327]  fib6_nh_release+0x45/0x70
[ 1079.316340]  nexthop_free_rcu+0x131/0x170
[ 1079.316356]  rcu_do_batch+0x1ee/0x820
[ 1079.316370]  ? rcu_do_batch+0x179/0x820
[ 1079.316388]  rcu_core+0x1aa/0x4d0
[ 1079.316405]  rcu_core_si+0x12/0x20
[ 1079.316417]  __do_softirq+0x13a/0x3dc
[ 1079.316435]  __irq_exit_rcu+0xa3/0x110
[ 1079.316449]  irq_exit_rcu+0x12/0x30
[ 1079.316462]  sysvec_apic_timer_interrupt+0x5b/0xe0
[ 1079.316474]  asm_sysvec_apic_timer_interrupt+0x1f/0x30
[ 1079.316569] RIP: 0033:0x7f06b65c63f0

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240423205408.39632-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-25 15:24:23 -07:00
Philo Lu 48e2cd3e3d bpf: add mrtt and srtt as BPF_SOCK_OPS_RTT_CB args
Two important arguments in RTT estimation, mrtt and srtt, are passed to
tcp_bpf_rtt(), so that bpf programs get more information about RTT
computation in BPF_SOCK_OPS_RTT_CB.

The difference between bpf_sock_ops->srtt_us and the srtt here is: the
former is an old rtt before update, while srtt passed by tcp_bpf_rtt()
is that after update.

Signed-off-by: Philo Lu <lulie@linux.alibaba.com>
Link: https://lore.kernel.org/r/20240425161724.73707-2-lulie@linux.alibaba.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-04-25 14:09:05 -07:00
Eduard Zingerman 980ca8ceea bpf: check bpf_dummy_struct_ops program params for test runs
When doing BPF_PROG_TEST_RUN for bpf_dummy_struct_ops programs,
reject execution when NULL is passed for non-nullable params.
For programs with non-nullable params verifier assumes that
such params are never NULL and thus might optimize out NULL checks.

Suggested-by: Kui-Feng Lee <sinquersw@gmail.com>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240424012821.595216-5-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-04-25 12:42:43 -07:00
Eduard Zingerman 1479eaff1f bpf: mark bpf_dummy_struct_ops.test_1 parameter as nullable
Test case dummy_st_ops/dummy_init_ret_value passes NULL as the first
parameter of the test_1() function. Mark this parameter as nullable to
make verifier aware of such possibility.
Otherwise, NULL check in the test_1() code:

      SEC("struct_ops/test_1")
      int BPF_PROG(test_1, struct bpf_dummy_ops_state *state)
      {
            if (!state)
                    return ...;

            ... access state ...
      }

Might be removed by verifier, thus triggering NULL pointer dereference
under certain conditions.

Reported-by: Jose E. Marchesi <jemarch@gnu.org>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240424012821.595216-2-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-04-25 12:42:43 -07:00
Jakub Kicinski 2bd87951de Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

Conflicts:

drivers/net/ethernet/ti/icssg/icssg_prueth.c

net/mac80211/chan.c
  89884459a0 ("wifi: mac80211: fix idle calculation with multi-link")
  87f5500285 ("wifi: mac80211: simplify ieee80211_assign_link_chanctx()")
https://lore.kernel.org/all/20240422105623.7b1fbda2@canb.auug.org.au/

net/unix/garbage.c
  1971d13ffa ("af_unix: Suppress false-positive lockdep splat for spin_lock() in __unix_gc().")
  4090fa373f ("af_unix: Replace garbage collection algorithm.")

drivers/net/ethernet/ti/icssg/icssg_prueth.c
drivers/net/ethernet/ti/icssg/icssg_common.c
  4dcd0e83ea ("net: ti: icssg-prueth: Fix signedness bug in prueth_init_rx_chns()")
  e2dc7bfd67 ("net: ti: icssg-prueth: Move common functions into a separate file")

No adjacent changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-25 12:41:37 -07:00
Eric Dumazet ec00ed472b tcp: avoid premature drops in tcp_add_backlog()
While testing TCP performance with latest trees,
I saw suspect SOCKET_BACKLOG drops.

tcp_add_backlog() computes its limit with :

    limit = (u32)READ_ONCE(sk->sk_rcvbuf) +
            (u32)(READ_ONCE(sk->sk_sndbuf) >> 1);
    limit += 64 * 1024;

This does not take into account that sk->sk_backlog.len
is reset only at the very end of __release_sock().

Both sk->sk_backlog.len and sk->sk_rmem_alloc could reach
sk_rcvbuf in normal conditions.

We should double sk->sk_rcvbuf contribution in the formula
to absorb bubbles in the backlog, which happen more often
for very fast flows.

This change maintains decent protection against abuses.

Fixes: c377411f24 ("net: sk_add_backlog() take rmem_alloc into account")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240423125620.3309458-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-25 12:15:02 -07:00
Jakub Kicinski e6be197f23 wireless-next patches for v6.10
The second "new features" pull request for v6.10 with changes both in
 stack and in drivers. This time the pull request is rather small and
 nothing special standing out except maybe that we have several
 kernel-doc fixes. Great to see that we are getting warning free
 wireless code (until new warnings are added).
 
 Do note that this pull request has a simple conflict in mac80211 with
 net tree, here's an example conflict resolution:
 
 https://lore.kernel.org/all/20240422105623.7b1fbda2@canb.auug.org.au/
 
 Major changes:
 
 rtl8xxxu:
 
 * enable Management Frame Protection (MFP) support
 
 rtw88:
 
 * disable unsupported interface type of mesh point for all chips, and only
   support station mode for SDIO chips.
 -----BEGIN PGP SIGNATURE-----
 
 iQFFBAABCgAvFiEEiBjanGPFTz4PRfLobhckVSbrbZsFAmYo194RHGt2YWxvQGtl
 cm5lbC5vcmcACgkQbhckVSbrbZvS4Qf/VgymhfRmZ3UfqQiL9uu806ol3tBOnhAG
 zGo//wBVVgdOEWZhVtW2aVvPH1pGRnocxWZG8/G0zx4N3YaMtDocV8X3gRp8JpGI
 Gk4Atc7J6Eyyp+Csxz2HvG0BwvkcEt65GwBpE2PmrEukMByS29EzjUppUZISlBKM
 7e3rqKbLMVPueIoKfTImpWGJVdYOCvErbqZakcXV97eQw3wOb/NLIOBAPobrx8oS
 cNdtLhEHwZzAtwYWL7VzU6HBsmBMpSCBl08Pobq8esH82x6yCoHVmyQlb7SS+iZ1
 klffYjoPT0fWUAMjHJdM26KJ77GXOHtphxmTeS5xnrq/+AhPDJmdNg==
 =0Ijq
 -----END PGP SIGNATURE-----

Merge tag 'wireless-next-2024-04-24' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next

Kalle Valo says:

====================
wireless-next patches for v6.10

The second "new features" pull request for v6.10 with changes both in
stack and in drivers. This time the pull request is rather small and
nothing special standing out except maybe that we have several
kernel-doc fixes. Great to see that we are getting warning free
wireless code (until new warnings are added).

Major changes:

rtl8xxxu:
 * enable Management Frame Protection (MFP) support

rtw88:
 * disable unsupported interface type of mesh point for all chips, and only
   support station mode for SDIO chips.

* tag 'wireless-next-2024-04-24' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (63 commits)
  wifi: mac80211: handle link ID during management Tx
  wifi: mac80211: handle sdata->u.ap.active flag with MLO
  wifi: cfg80211: add return docs for regulatory functions
  wifi: cfg80211: make some regulatory functions void
  wifi: mac80211: add return docs for sta_info_flush()
  wifi: mac80211: keep mac80211 consistent on link activation failure
  wifi: mac80211: simplify ieee80211_assign_link_chanctx()
  wifi: mac80211: reserve chanctx during find
  wifi: cfg80211: fix cfg80211 function kernel-doc
  wifi: mac80211_hwsim: Use wider regulatory for custom for 6GHz tests
  wifi: iwlwifi: mvm: Don't allow EMLSR when the RSSI is low
  wifi: iwlwifi: mvm: disable EMLSR when we suspend with wowlan
  wifi: iwlwifi: mvm: get periodic statistics in EMLSR
  wifi: iwlwifi: mvm: don't recompute EMLSR mode in can_activate_links
  wifi: iwlwifi: mvm: implement EMLSR prevention mechanism.
  wifi: iwlwifi: mvm: exit EMLSR upon missed beacon
  wifi: iwlwifi: mvm: init vif works only once
  wifi: iwlwifi: mvm: Add helper functions to update EMLSR status
  wifi: iwlwifi: mvm: Implement new link selection algorithm
  wifi: iwlwifi: mvm: move EMLSR/links code
  ...
====================

Link: https://lore.kernel.org/r/20240424100122.217AEC113CE@smtp.kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-25 11:49:35 -07:00
Linus Torvalds 52afb15e9d Including fixes from netfilter, wireless and bluetooth.
Nothing major, regression fixes are mostly in drivers, two more
 of those are flowing towards us thru various trees. I wish some of
 the changes went into -rc5, we'll try to keep an eye on frequency
 of PRs from sub-trees.
 
 Also disproportional number of fixes for bugs added in v6.4,
 strange coincidence.
 
 Current release - regressions:
 
  - igc: fix LED-related deadlock on driver unbind
 
  - wifi: mac80211: small fixes to recent clean up of the connection
    process
 
  - Revert "wifi: iwlwifi: bump FW API to 90 for BZ/SC devices",
    kernel doesn't have all the code to deal with that version, yet
 
  - Bluetooth:
    - set power_ctrl_enabled on NULL returned by gpiod_get_optional()
    - qca: fix invalid device address check, again
 
  - eth: ravb: fix registered interrupt names
 
 Current release - new code bugs:
 
  - wifi: mac80211: check EHT/TTLM action frame length
 
 Previous releases - regressions:
 
  - fix sk_memory_allocated_{add|sub} for architectures where
    __this_cpu_{add|sub}* are not IRQ-safe
 
  - dsa: mv88e6xx: fix link setup for 88E6250
 
 Previous releases - always broken:
 
  - ip: validate dev returned from __in_dev_get_rcu(), prevent possible
    null-derefs in a few places
 
  - switch number of for_each_rcu() loops using call_rcu() on the iterator
    to for_each_safe()
 
  - macsec: fix isolation of broadcast traffic in presence of offload
 
  - vxlan: drop packets from invalid source address
 
  - eth: mlxsw: trap and ACL programming fixes
 
  - eth: bnxt: PCIe error recovery fixes, fix counting dropped packets
 
  - Bluetooth:
   - lots of fixes for the command submission rework from v6.4
   - qca: fix NULL-deref on non-serdev suspend
 
 Misc:
 
  - tools: ynl: don't ignore errors in NLMSG_DONE messages
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmYqjvgACgkQMUZtbf5S
 IrvxBA/9HdiiBU/qWdlZ5BorvVFj5XmOiGGD0UagKD2VZCxdLX8S/yfmY3KMoohy
 Dls5c3WxQbJbGsoIMEU6ztE0Iv1YYl1wamTfbyUDwv2ZMKR/vN5uzacB4CS9/FJ0
 vOQO1Y/VWx+uoA1gXRsY8Ffmh2ZMKdwoiKdpdRf/ADgPB8hNQYx78PqTBvKusqBa
 go1mahZbtsYIxLn/oL0xKQRKRZUY1T5T8zQ02i+8MvWBJDyRWCCaOICQus7FBdtz
 JAy5IyztzH0cYXgC0aRTPJkbwqXdpXjSoeOwNElRtUpD98zprDm16jqpSGrwhJoP
 AaWo5+1o908aOd+chhoCqfrEGbraMSRgvCTNMemPxL8cNF4JJfdp1A+v0+cZKlMy
 yjGTKoFZX6GPbOFYPC+rF8Zm6WzDsLcit/r01RTvf1JLf+Jdft72QwQec0rQykEV
 ATrYAQAW/B6zcfOmIXngFuCkO7KM9Yp2BSQNAtYOQR2GKijmALO74suIbNujP3hU
 kn25jnw0Fwzv5RIWluFK+V2AcW8cd1JZMbq8NQzhOXmrHbP4OmaYQrk0vkk8f9b9
 q5BK4C4/JcjCdEBGe38BlPFUx3Jr6xKOcF/DoAnhehwwEpCi5El9S5l7a4+HNBSh
 e1c/1vvcO54m4onXYJ+CH5clQLGs5NU71aqtBeleF5YoDLvwD8g=
 =EQyI
 -----END PGP SIGNATURE-----

Merge tag 'net-6.9-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Including fixes from netfilter, wireless and bluetooth.

  Nothing major, regression fixes are mostly in drivers, two more of
  those are flowing towards us thru various trees. I wish some of the
  changes went into -rc5, we'll try to keep an eye on frequency of PRs
  from sub-trees.

  Also disproportional number of fixes for bugs added in v6.4, strange
  coincidence.

  Current release - regressions:

   - igc: fix LED-related deadlock on driver unbind

   - wifi: mac80211: small fixes to recent clean up of the connection
     process

   - Revert "wifi: iwlwifi: bump FW API to 90 for BZ/SC devices", kernel
     doesn't have all the code to deal with that version, yet

   - Bluetooth:
       - set power_ctrl_enabled on NULL returned by gpiod_get_optional()
       - qca: fix invalid device address check, again

   - eth: ravb: fix registered interrupt names

  Current release - new code bugs:

   - wifi: mac80211: check EHT/TTLM action frame length

  Previous releases - regressions:

   - fix sk_memory_allocated_{add|sub} for architectures where
     __this_cpu_{add|sub}* are not IRQ-safe

   - dsa: mv88e6xx: fix link setup for 88E6250

  Previous releases - always broken:

   - ip: validate dev returned from __in_dev_get_rcu(), prevent possible
     null-derefs in a few places

   - switch number of for_each_rcu() loops using call_rcu() on the
     iterator to for_each_safe()

   - macsec: fix isolation of broadcast traffic in presence of offload

   - vxlan: drop packets from invalid source address

   - eth: mlxsw: trap and ACL programming fixes

   - eth: bnxt: PCIe error recovery fixes, fix counting dropped packets

   - Bluetooth:
       - lots of fixes for the command submission rework from v6.4
       - qca: fix NULL-deref on non-serdev suspend

  Misc:

   - tools: ynl: don't ignore errors in NLMSG_DONE messages"

* tag 'net-6.9-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (88 commits)
  af_unix: Suppress false-positive lockdep splat for spin_lock() in __unix_gc().
  net: b44: set pause params only when interface is up
  tls: fix lockless read of strp->msg_ready in ->poll
  dpll: fix dpll_pin_on_pin_register() for multiple parent pins
  net: ravb: Fix registered interrupt names
  octeontx2-af: fix the double free in rvu_npc_freemem()
  net: ethernet: ti: am65-cpts: Fix PTPv1 message type on TX packets
  ice: fix LAG and VF lock dependency in ice_reset_vf()
  iavf: Fix TC config comparison with existing adapter TC config
  i40e: Report MFS in decimal base instead of hex
  i40e: Do not use WQ_MEM_RECLAIM flag for workqueue
  net: ti: icssg-prueth: Fix signedness bug in prueth_init_rx_chns()
  net/mlx5e: Advertise mlx5 ethernet driver updates sk_buff md_dst for MACsec
  macsec: Detect if Rx skb is macsec-related for offloading devices that update md_dst
  ethernet: Add helper for assigning packet type when dest address does not match device address
  macsec: Enable devices to advertise whether they update sk_buff md_dst during offloads
  net: phy: dp83869: Fix MII mode failure
  netfilter: nf_tables: honor table dormant flag from netdev release event path
  eth: bnxt: fix counting packets discarded due to OOM and netpoll
  igc: Fix LED-related deadlock on driver unbind
  ...
2024-04-25 11:19:38 -07:00
Jakub Kicinski e8baa63f87 netfilter pull request 24-04-25
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEN9lkrMBJgcdVAPub1V2XiooUIOQFAmYqG18ACgkQ1V2XiooU
 IOT2fg/+Ir0uSBi5YldKlCqVGVTEAVoUuvo8yuzuUktYI5s+YpyNptFcNJHgJuP1
 H94qccf4K6yJuyb0dNaBooxkVY4kiPIDs2+XuI6fz9bJNI3kypITfhvUKIkLiKvX
 cwqvAG+v0HZ1CKMD/icCftF/gOK3+MSasPhqz6I0U9xp86shw5ImFwmg0n7rtgmB
 +WxKbGzVSw2f6QLWpYunhZI7HUxnsiR5l3YyqPP4HHh+8e1rNjfolS6yX/4MmrfH
 5TR7MkwjAxiXOy6JsC8TQqEc5hUASY0loKMfrEJjwol2ksmx7OBw8X8ivfv/PnnA
 gfaVzTC5WovHQotFFQ+Z4EKgMDkHZsZbxjsoWA5MPlrxYha/YYo6OzEvvjZYWe2Z
 5kKxSpBAF9IMY/wQfjicpTILhFW6/CjffzFQU6RESau6tn6YcFoTpJozq4Fyq6CX
 XI8vc21l8n/h5Ne03axN/+6FxPuSatYDBrvstcTuf2o1sefw91Ak4TYlERKTiynq
 xmlsq/3PqoTzPLeQcUzyuwKTsJmzKn5qt95NnWbzdo5ZicnrMGMCAxjVr/wyvhnK
 HHqMRG6EcdBH+608XpialmvyQ9/kMEoH2YBMJG4cHkxF/y0OKSXMs9lfNq4cxGLf
 KIWShd13MpgdA64uQNZ80OQulhU9/KKxOC5NGG4cZONmM3bogqw=
 =YK6G
 -----END PGP SIGNATURE-----

Merge tag 'nf-24-04-25' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf

Pablo Neira Ayuso says:

====================
Netfilter/IPVS fixes for net

The following patchset contains two Netfilter/IPVS fixes for net:

Patch #1 fixes SCTP checksumming for IPVS with gso packets,
	 from Ismael Luceno.

Patch #2 honor dormant flag from netdev event path to fix a possible
	 double hook unregistration.

* tag 'nf-24-04-25' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
  netfilter: nf_tables: honor table dormant flag from netdev release event path
  ipvs: Fix checksumming on GSO of SCTP packets
====================

Link: https://lore.kernel.org/r/20240425090149.1359547-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-25 08:46:53 -07:00
Kuniyuki Iwashima 1971d13ffa af_unix: Suppress false-positive lockdep splat for spin_lock() in __unix_gc().
syzbot reported a lockdep splat regarding unix_gc_lock and
unix_state_lock().

One is called from recvmsg() for a connected socket, and another
is called from GC for TCP_LISTEN socket.

So, the splat is false-positive.

Let's add a dedicated lock class for the latter to suppress the splat.

Note that this change is not necessary for net-next.git as the issue
is only applied to the old GC impl.

[0]:
WARNING: possible circular locking dependency detected
6.9.0-rc5-syzkaller-00007-g4d2008430ce8 #0 Not tainted
 -----------------------------------------------------
kworker/u8:1/11 is trying to acquire lock:
ffff88807cea4e70 (&u->lock){+.+.}-{2:2}, at: spin_lock include/linux/spinlock.h:351 [inline]
ffff88807cea4e70 (&u->lock){+.+.}-{2:2}, at: __unix_gc+0x40e/0xf70 net/unix/garbage.c:302

but task is already holding lock:
ffffffff8f6ab638 (unix_gc_lock){+.+.}-{2:2}, at: spin_lock include/linux/spinlock.h:351 [inline]
ffffffff8f6ab638 (unix_gc_lock){+.+.}-{2:2}, at: __unix_gc+0x117/0xf70 net/unix/garbage.c:261

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

 -> #1 (unix_gc_lock){+.+.}-{2:2}:
       lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5754
       __raw_spin_lock include/linux/spinlock_api_smp.h:133 [inline]
       _raw_spin_lock+0x2e/0x40 kernel/locking/spinlock.c:154
       spin_lock include/linux/spinlock.h:351 [inline]
       unix_notinflight+0x13d/0x390 net/unix/garbage.c:140
       unix_detach_fds net/unix/af_unix.c:1819 [inline]
       unix_destruct_scm+0x221/0x350 net/unix/af_unix.c:1876
       skb_release_head_state+0x100/0x250 net/core/skbuff.c:1188
       skb_release_all net/core/skbuff.c:1200 [inline]
       __kfree_skb net/core/skbuff.c:1216 [inline]
       kfree_skb_reason+0x16d/0x3b0 net/core/skbuff.c:1252
       kfree_skb include/linux/skbuff.h:1262 [inline]
       manage_oob net/unix/af_unix.c:2672 [inline]
       unix_stream_read_generic+0x1125/0x2700 net/unix/af_unix.c:2749
       unix_stream_splice_read+0x239/0x320 net/unix/af_unix.c:2981
       do_splice_read fs/splice.c:985 [inline]
       splice_file_to_pipe+0x299/0x500 fs/splice.c:1295
       do_splice+0xf2d/0x1880 fs/splice.c:1379
       __do_splice fs/splice.c:1436 [inline]
       __do_sys_splice fs/splice.c:1652 [inline]
       __se_sys_splice+0x331/0x4a0 fs/splice.c:1634
       do_syscall_x64 arch/x86/entry/common.c:52 [inline]
       do_syscall_64+0xf5/0x240 arch/x86/entry/common.c:83
       entry_SYSCALL_64_after_hwframe+0x77/0x7f

 -> #0 (&u->lock){+.+.}-{2:2}:
       check_prev_add kernel/locking/lockdep.c:3134 [inline]
       check_prevs_add kernel/locking/lockdep.c:3253 [inline]
       validate_chain+0x18cb/0x58e0 kernel/locking/lockdep.c:3869
       __lock_acquire+0x1346/0x1fd0 kernel/locking/lockdep.c:5137
       lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5754
       __raw_spin_lock include/linux/spinlock_api_smp.h:133 [inline]
       _raw_spin_lock+0x2e/0x40 kernel/locking/spinlock.c:154
       spin_lock include/linux/spinlock.h:351 [inline]
       __unix_gc+0x40e/0xf70 net/unix/garbage.c:302
       process_one_work kernel/workqueue.c:3254 [inline]
       process_scheduled_works+0xa10/0x17c0 kernel/workqueue.c:3335
       worker_thread+0x86d/0xd70 kernel/workqueue.c:3416
       kthread+0x2f0/0x390 kernel/kthread.c:388
       ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147
       ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244

other info that might help us debug this:

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(unix_gc_lock);
                               lock(&u->lock);
                               lock(unix_gc_lock);
  lock(&u->lock);

 *** DEADLOCK ***

3 locks held by kworker/u8:1/11:
 #0: ffff888015089148 ((wq_completion)events_unbound){+.+.}-{0:0}, at: process_one_work kernel/workqueue.c:3229 [inline]
 #0: ffff888015089148 ((wq_completion)events_unbound){+.+.}-{0:0}, at: process_scheduled_works+0x8e0/0x17c0 kernel/workqueue.c:3335
 #1: ffffc90000107d00 (unix_gc_work){+.+.}-{0:0}, at: process_one_work kernel/workqueue.c:3230 [inline]
 #1: ffffc90000107d00 (unix_gc_work){+.+.}-{0:0}, at: process_scheduled_works+0x91b/0x17c0 kernel/workqueue.c:3335
 #2: ffffffff8f6ab638 (unix_gc_lock){+.+.}-{2:2}, at: spin_lock include/linux/spinlock.h:351 [inline]
 #2: ffffffff8f6ab638 (unix_gc_lock){+.+.}-{2:2}, at: __unix_gc+0x117/0xf70 net/unix/garbage.c:261

stack backtrace:
CPU: 0 PID: 11 Comm: kworker/u8:1 Not tainted 6.9.0-rc5-syzkaller-00007-g4d2008430ce8 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/27/2024
Workqueue: events_unbound __unix_gc
Call Trace:
 <TASK>
 __dump_stack lib/dump_stack.c:88 [inline]
 dump_stack_lvl+0x241/0x360 lib/dump_stack.c:114
 check_noncircular+0x36a/0x4a0 kernel/locking/lockdep.c:2187
 check_prev_add kernel/locking/lockdep.c:3134 [inline]
 check_prevs_add kernel/locking/lockdep.c:3253 [inline]
 validate_chain+0x18cb/0x58e0 kernel/locking/lockdep.c:3869
 __lock_acquire+0x1346/0x1fd0 kernel/locking/lockdep.c:5137
 lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5754
 __raw_spin_lock include/linux/spinlock_api_smp.h:133 [inline]
 _raw_spin_lock+0x2e/0x40 kernel/locking/spinlock.c:154
 spin_lock include/linux/spinlock.h:351 [inline]
 __unix_gc+0x40e/0xf70 net/unix/garbage.c:302
 process_one_work kernel/workqueue.c:3254 [inline]
 process_scheduled_works+0xa10/0x17c0 kernel/workqueue.c:3335
 worker_thread+0x86d/0xd70 kernel/workqueue.c:3416
 kthread+0x2f0/0x390 kernel/kthread.c:388
 ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147
 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244
 </TASK>

Fixes: 47d8ac011f ("af_unix: Fix garbage collector racing against connect()")
Reported-and-tested-by: syzbot+fa379358c28cc87cc307@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=fa379358c28cc87cc307
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20240424170443.9832-1-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-25 08:37:02 -07:00
Sabrina Dubroca 0844370f89 tls: fix lockless read of strp->msg_ready in ->poll
tls_sk_poll is called without locking the socket, and needs to read
strp->msg_ready (via tls_strp_msg_ready). Convert msg_ready to a bool
and use READ_ONCE/WRITE_ONCE where needed. The remaining reads are
only performed when the socket is locked.

Fixes: 121dca784f ("tls: suppress wakeups unless we have a full record")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://lore.kernel.org/r/0b7ee062319037cf86af6b317b3d72f7bfcd2e97.1713797701.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-25 08:32:37 -07:00
Rahul Rameshbabu 6e159fd653 ethernet: Add helper for assigning packet type when dest address does not match device address
Enable reuse of logic in eth_type_trans for determining packet type.

Suggested-by: Sabrina Dubroca <sd@queasysnail.net>
Cc: stable@vger.kernel.org
Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://lore.kernel.org/r/20240423181319.115860-3-rrameshbabu@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-25 08:20:54 -07:00
linke li e7d96e750f net: bridge: remove redundant check of f->dst
In br_fill_forward_path(), f->dst is checked not to be NULL, then
immediately read using READ_ONCE and checked again. The first check is
useless, so this patch aims to remove the redundant check of f->dst.

Signed-off-by: linke li <lilinke99@qq.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-25 13:15:38 +01:00
David S. Miller 46bf0c9ab7 Fixes for the current cycle:
* ath11k: convert to correct RCU iteration of IPv6 addresses
  * iwlwifi: link ID, FW API version, scanning and PASN fixes
  * cfg80211: NULL-deref and tracing fixes
  * mac80211: connection mode, mesh fast-TX, multi-link and
              various other small fixes
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEpeA8sTs3M8SN2hR410qiO8sPaAAFAmYnhcAACgkQ10qiO8sP
 aAAR9w//ZU+X4jehyQ3JDN+vYk+Fjl0A8iYELaLMBk2GrHBQztBR5LqKdNuEj2xB
 8/550V3KDw0s58FVS6KaYzuViYYVgtZCcE4xDls6A3IA5rs/zgfRKQTXW+QBbX0J
 LoKOtvTWoTlBGCSp+GMhihU2IG5QBrYOxuNlxNQWld0BE1u/+PLfdg5UzyydpVVl
 osjN/8ieGJwpu9S+0IvS9uQu+1sDGZqLHGAEkk5er+3brXxyvZ0I2jHZwoYVqPpn
 Pd4qcc15zo6I3IudCastUJQyEOHTp+P4Vy4nEaqb6g3B7xEJwwL021wqVhek3Cdm
 kYRWoHq7a48FI8eJoywR4NrexP8vPpK2vaC9u+kC8AmgaI2w+BHYePMrivQeLYFP
 gd/eWqZfp/O5E2ULbc2sZ9651TiSMQEVy/mprxsjq52+wZnEiwF3hfiH2tqz5AK+
 /JZuwRiY30LwnodkasPHei1jFkDPt8dMbp+y0ProTPw6nbM38xLQ/BOzWduV+QWZ
 RLNtCuYHF2OpUyCJjJS1VF40PUUBSvGiArXy9tddzeHqEyow+E9DAohlv0nPusCZ
 9CN1q07YKN3GZnvEIZjPOn4IQ5D8/sLYbGYjhY5AVXJAo5A8RdtjeUeISKBxqn+j
 K/zJ1jGFjdV3nPpC55ayI//uaLemoW6GAwXC1q+OSiKf998DAtE=
 =dPas
 -----END PGP SIGNATURE-----

Merge tag 'wireless-2024-04-23' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless

Johannes berg says:

====================
Fixes for the current cycle:
 * ath11k: convert to correct RCU iteration of IPv6 addresses
 * iwlwifi: link ID, FW API version, scanning and PASN fixes
 * cfg80211: NULL-deref and tracing fixes
 * mac80211: connection mode, mesh fast-TX, multi-link and
             various other small fixes
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-25 12:18:37 +01:00
Pablo Neira Ayuso 8e30abc9ac netfilter: nf_tables: honor table dormant flag from netdev release event path
Check for table dormant flag otherwise netdev release event path tries
to unregister an already unregistered hook.

[524854.857999] ------------[ cut here ]------------
[524854.858010] WARNING: CPU: 0 PID: 3386599 at net/netfilter/core.c:501 __nf_unregister_net_hook+0x21a/0x260
[...]
[524854.858848] CPU: 0 PID: 3386599 Comm: kworker/u32:2 Not tainted 6.9.0-rc3+ #365
[524854.858869] Workqueue: netns cleanup_net
[524854.858886] RIP: 0010:__nf_unregister_net_hook+0x21a/0x260
[524854.858903] Code: 24 e8 aa 73 83 ff 48 63 43 1c 83 f8 01 0f 85 3d ff ff ff e8 98 d1 f0 ff 48 8b 3c 24 e8 8f 73 83 ff 48 63 43 1c e9 26 ff ff ff <0f> 0b 48 83 c4 18 48 c7 c7 00 68 e9 82 5b 5d 41 5c 41 5d 41 5e 41
[524854.858914] RSP: 0018:ffff8881e36d79e0 EFLAGS: 00010246
[524854.858926] RAX: 0000000000000000 RBX: ffff8881339ae790 RCX: ffffffff81ba524a
[524854.858936] RDX: dffffc0000000000 RSI: 0000000000000008 RDI: ffff8881c8a16438
[524854.858945] RBP: ffff8881c8a16438 R08: 0000000000000001 R09: ffffed103c6daf34
[524854.858954] R10: ffff8881e36d79a7 R11: 0000000000000000 R12: 0000000000000005
[524854.858962] R13: ffff8881c8a16000 R14: 0000000000000000 R15: ffff8881351b5a00
[524854.858971] FS:  0000000000000000(0000) GS:ffff888390800000(0000) knlGS:0000000000000000
[524854.858982] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[524854.858991] CR2: 00007fc9be0f16f4 CR3: 00000001437cc004 CR4: 00000000001706f0
[524854.859000] Call Trace:
[524854.859006]  <TASK>
[524854.859013]  ? __warn+0x9f/0x1a0
[524854.859027]  ? __nf_unregister_net_hook+0x21a/0x260
[524854.859044]  ? report_bug+0x1b1/0x1e0
[524854.859060]  ? handle_bug+0x3c/0x70
[524854.859071]  ? exc_invalid_op+0x17/0x40
[524854.859083]  ? asm_exc_invalid_op+0x1a/0x20
[524854.859100]  ? __nf_unregister_net_hook+0x6a/0x260
[524854.859116]  ? __nf_unregister_net_hook+0x21a/0x260
[524854.859135]  nf_tables_netdev_event+0x337/0x390 [nf_tables]
[524854.859304]  ? __pfx_nf_tables_netdev_event+0x10/0x10 [nf_tables]
[524854.859461]  ? packet_notifier+0xb3/0x360
[524854.859476]  ? _raw_spin_unlock_irqrestore+0x11/0x40
[524854.859489]  ? dcbnl_netdevice_event+0x35/0x140
[524854.859507]  ? __pfx_nf_tables_netdev_event+0x10/0x10 [nf_tables]
[524854.859661]  notifier_call_chain+0x7d/0x140
[524854.859677]  unregister_netdevice_many_notify+0x5e1/0xae0

Fixes: d54725cd11 ("netfilter: nf_tables: support for multiple devices per netdev hook")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-04-25 10:42:57 +02:00
Philo Lu 2bf90a57f0 tcp: update sacked after tracepoint in __tcp_retransmit_skb
Marking TCP_SKB_CB(skb)->sacked with TCPCB_EVER_RETRANS after the
traceopint (trace_tcp_retransmit_skb), then we can get the
retransmission efficiency by counting skbs w/ and w/o TCPCB_EVER_RETRANS
mark in this tracepoint.

We have discussed to achieve this with BPF_SOCK_OPS in [0], and using
tracepoint is thought to be a better solution.

[0]
https://lore.kernel.org/all/20240417124622.35333-1-lulie@linux.alibaba.com/

Signed-off-by: Philo Lu <lulie@linux.alibaba.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-25 08:52:12 +01:00
Jakub Kicinski e6b219014f bluetooth pull request for net:
- qca: set power_ctrl_enabled on NULL returned by gpiod_get_optional()
  - hci_sync: Using hci_cmd_sync_submit when removing Adv Monitor
  - qca: fix invalid device address check
  - hci_sync: Use advertised PHYs on hci_le_ext_create_conn_sync
  - Fix type of len in {l2cap,sco}_sock_getsockopt_old()
  - btusb: mediatek: Fix double free of skb in coredump
  - btusb: Add Realtek RTL8852BE support ID 0x0bda:0x4853
  - btusb: Fix triggering coredump implementation for QCA
 -----BEGIN PGP SIGNATURE-----
 
 iQJNBAABCAA3FiEE7E6oRXp8w05ovYr/9JCA4xAyCykFAmYpbfEZHGx1aXoudm9u
 LmRlbnR6QGludGVsLmNvbQAKCRD0kIDjEDILKWaeD/9d7B2uO+tArOk6QbPx6J5T
 iS4lRGBQtfnGN1NlzJUvNqN0IklgWBmD9GvoVN8EuuF/mTZie3cl7Naelyj/+jvh
 4bat9dvnDpjmJuqFrxUCl7svo4hN/6eMenHQVwC6c5HSkAQQTrHmXJTWWOI56mbU
 wMTLq8eDs3Ggv04Ul7sGlMrXxDB+h1kGPhxFZ5aSwYrGE+hRKAMTqikHsFD4IJCk
 73HOq9kOeg50LRgv7JyBuTKWefvsl3pKYjNnM25QJmLmwpegaI2Cv5p1reZdt0hh
 a9Xkz9nvfSGjB/Q9S+b9pTJIDBS/nuzHamKG+9VExa9eZ3a9B1NwAXXoxTEIyUEH
 1wDMONqaRMuVIzI6uTCZKwCBD4kZQYBo+M3qdZgBNeMwGv3II52ZIudGz10q+OTm
 lWWj5nveRDS0QSOlljEfFSr90ea6bsDYZkjhjTvV36RkTbJRf0E0TsGnYEYiuo+b
 dxEkL07X3HuqneOIeen821Zj+YjVnmfwTlvhvrHVzRyiz0W2LBAKi++G3mwHrZd4
 gTSGKIN8uEBM4M1rw50jyBtOGolugDQBvl7qLOl0HWzwHcQyqqYrKGoroO6TCc6r
 tcrpNEEbtQYk+deQz/EHY7qQJeNVpS6f0v8MdnruppctV1lO8v2lZKHVRrahVzUf
 MY+ONRCIqn+fUC2U6YmakQ==
 =t+Dp
 -----END PGP SIGNATURE-----

Merge tag 'for-net-2024-04-24' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth

Luiz Augusto von Dentz says:

====================
bluetooth pull request for net:

 - qca: set power_ctrl_enabled on NULL returned by gpiod_get_optional()
 - hci_sync: Using hci_cmd_sync_submit when removing Adv Monitor
 - qca: fix invalid device address check
 - hci_sync: Use advertised PHYs on hci_le_ext_create_conn_sync
 - Fix type of len in {l2cap,sco}_sock_getsockopt_old()
 - btusb: mediatek: Fix double free of skb in coredump
 - btusb: Add Realtek RTL8852BE support ID 0x0bda:0x4853
 - btusb: Fix triggering coredump implementation for QCA

* tag 'for-net-2024-04-24' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth:
  Bluetooth: qca: set power_ctrl_enabled on NULL returned by gpiod_get_optional()
  Bluetooth: hci_sync: Using hci_cmd_sync_submit when removing Adv Monitor
  Bluetooth: qca: fix NULL-deref on non-serdev setup
  Bluetooth: qca: fix NULL-deref on non-serdev suspend
  Bluetooth: btusb: mediatek: Fix double free of skb in coredump
  Bluetooth: MGMT: Fix failing to MGMT_OP_ADD_UUID/MGMT_OP_REMOVE_UUID
  Bluetooth: qca: fix invalid device address check
  Bluetooth: hci_event: Fix sending HCI_OP_READ_ENC_KEY_SIZE
  Bluetooth: btusb: Fix triggering coredump implementation for QCA
  Bluetooth: btusb: Add Realtek RTL8852BE support ID 0x0bda:0x4853
  Bluetooth: hci_sync: Use advertised PHYs on hci_le_ext_create_conn_sync
  Bluetooth: Fix type of len in {l2cap,sco}_sock_getsockopt_old()
====================

Link: https://lore.kernel.org/r/20240424204102.2319483-1-luiz.dentz@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-24 20:29:49 -07:00
Jakub Kicinski 21d9f921f8 Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue
Tony Nguyen says:

====================
ice: Support 5 layer Tx scheduler topology

Mateusz Polchlopek says:

For performance reasons there is a need to have support for selectable
Tx scheduler topology. Currently firmware supports only the default
9-layer and 5-layer topology. This patch series enables switch from
default to 5-layer topology, if user decides to opt-in.

* '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
  ice: Document tx_scheduling_layers parameter
  ice: Add tx_scheduling_layers devlink param
  ice: Enable switching default Tx scheduler topology
  ice: Adjust the VSI/Aggregator layers
  ice: Support 5 layer topology
  devlink: extend devlink_param *set pointer
====================

Link: https://lore.kernel.org/r/20240422203913.225151-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-24 20:05:31 -07:00
Hyunwoo Kim 5ea7b72d4f net: openvswitch: Fix Use-After-Free in ovs_ct_exit
Since kfree_rcu, which is called in the hlist_for_each_entry_rcu traversal
of ovs_ct_limit_exit, is not part of the RCU read critical section, it
is possible that the RCU grace period will pass during the traversal and
the key will be free.

To prevent this, it should be changed to hlist_for_each_entry_safe.

Fixes: 11efd5cb04 ("openvswitch: Support conntrack zone limit")
Signed-off-by: Hyunwoo Kim <v4bel@theori.io>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Aaron Conole <aconole@redhat.com>
Link: https://lore.kernel.org/r/ZiYvzQN/Ry5oeFQW@v4bel-B760M-AORUS-ELITE-AX
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-24 17:14:24 -07:00
Jun Gu 66270920f9 net: openvswitch: Release reference to netdev
dev_get_by_name will provide a reference on the netdev. So ensure that
the reference of netdev is released after completed.

Fixes: 2540088b83 ("net: openvswitch: Check vport netdev name")
Signed-off-by: Jun Gu <jun.gu@easystack.cn>
Reviewed-by: Aaron Conole <aconole@redhat.com>
Link: https://lore.kernel.org/r/20240423073751.52706-1-jun.gu@easystack.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-24 17:07:17 -07:00
Ismael Luceno e10d3ba4d4 ipvs: Fix checksumming on GSO of SCTP packets
It was observed in the wild that pairs of consecutive packets would leave
the IPVS with the same wrong checksum, and the issue only went away when
disabling GSO.

IPVS needs to avoid computing the SCTP checksum when using GSO.

Fixes: 90017accff ("sctp: Add GSO support")
Co-developed-by: Firo Yang <firo.yang@suse.com>
Signed-off-by: Ismael Luceno <iluceno@suse.de>
Tested-by: Andreas Taschner <andreas.taschner@suse.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-04-25 00:22:48 +02:00
Chun-Yi Lee 88cd6e6b2d Bluetooth: hci_sync: Using hci_cmd_sync_submit when removing Adv Monitor
Since the d883a4669a be introduced in v6.4, bluetooth daemon
got the following failed message of MGMT_OP_REMOVE_ADV_MONITOR
command when controller is power-off:

bluetoothd[20976]:
src/adapter.c:reset_adv_monitors_complete() Failed to reset Adv
Monitors: Failed>

Normally this situation is happened when the bluetoothd deamon
be started manually after system booting. Which means that
bluetoothd received MGMT_EV_INDEX_ADDED event after kernel
runs hci_power_off().

Base on doc/mgmt-api.txt, the MGMT_OP_REMOVE_ADV_MONITOR command
can be used when the controller is not powered. This patch changes
the code in remove_adv_monitor() to use hci_cmd_sync_submit()
instead of hci_cmd_sync_queue().

Fixes: d883a4669a ("Bluetooth: hci_sync: Only allow hci_cmd_sync_queue if running")
Cc: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Cc: Manish Mandlik <mmandlik@google.com>
Cc: Archie Pusaka <apusaka@chromium.org>
Cc: Miao-chen Chou <mcchou@chromium.org>
Signed-off-by: Chun-Yi Lee <jlee@suse.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2024-04-24 16:26:20 -04:00
Luiz Augusto von Dentz 6eb5fcc416 Bluetooth: MGMT: Fix failing to MGMT_OP_ADD_UUID/MGMT_OP_REMOVE_UUID
These commands don't require the adapter to be up and running so don't
use hci_cmd_sync_queue which would check that flag, instead use
hci_cmd_sync_submit which would ensure mgmt_class_complete is set
properly regardless if any command was actually run or not.

Link: https://github.com/bluez/bluez/issues/809
Fixes: d883a4669a ("Bluetooth: hci_sync: Only allow hci_cmd_sync_queue if running")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2024-04-24 16:26:14 -04:00
Luiz Augusto von Dentz a9a830a676 Bluetooth: hci_event: Fix sending HCI_OP_READ_ENC_KEY_SIZE
The code shall always check if HCI_QUIRK_BROKEN_READ_ENC_KEY_SIZE has
been set before attempting to use HCI_OP_READ_ENC_KEY_SIZE.

Fixes: c569242cd4 ("Bluetooth: hci_event: set the conn encrypted before conn establishes")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2024-04-24 16:26:11 -04:00
Luiz Augusto von Dentz 2e7ed5f5e6 Bluetooth: hci_sync: Use advertised PHYs on hci_le_ext_create_conn_sync
The extended advertising reports do report the PHYs so this store then
in hci_conn so it can be later used in hci_le_ext_create_conn_sync to
narrow the PHYs to be scanned since the controller will also perform a
scan having a smaller set of PHYs shall reduce the time it takes to
find and connect peers.

Fixes: 288c90224e ("Bluetooth: Enable all supported LE PHY by default")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2024-04-24 16:26:08 -04:00
Nathan Chancellor 9bf4e919cc Bluetooth: Fix type of len in {l2cap,sco}_sock_getsockopt_old()
After an innocuous optimization change in LLVM main (19.0.0), x86_64
allmodconfig (which enables CONFIG_KCSAN / -fsanitize=thread) fails to
build due to the checks in check_copy_size():

  In file included from net/bluetooth/sco.c:27:
  In file included from include/linux/module.h:13:
  In file included from include/linux/stat.h:19:
  In file included from include/linux/time.h:60:
  In file included from include/linux/time32.h:13:
  In file included from include/linux/timex.h:67:
  In file included from arch/x86/include/asm/timex.h:6:
  In file included from arch/x86/include/asm/tsc.h:10:
  In file included from arch/x86/include/asm/msr.h:15:
  In file included from include/linux/percpu.h:7:
  In file included from include/linux/smp.h:118:
  include/linux/thread_info.h:244:4: error: call to '__bad_copy_from'
  declared with 'error' attribute: copy source size is too small
    244 |                         __bad_copy_from();
        |                         ^

The same exact error occurs in l2cap_sock.c. The copy_to_user()
statements that are failing come from l2cap_sock_getsockopt_old() and
sco_sock_getsockopt_old(). This does not occur with GCC with or without
KCSAN or Clang without KCSAN enabled.

len is defined as an 'int' because it is assigned from
'__user int *optlen'. However, it is clamped against the result of
sizeof(), which has a type of 'size_t' ('unsigned long' for 64-bit
platforms). This is done with min_t() because min() requires compatible
types, which results in both len and the result of sizeof() being casted
to 'unsigned int', meaning len changes signs and the result of sizeof()
is truncated. From there, len is passed to copy_to_user(), which has a
third parameter type of 'unsigned long', so it is widened and changes
signs again. This excessive casting in combination with the KCSAN
instrumentation causes LLVM to fail to eliminate the __bad_copy_from()
call, failing the build.

The official recommendation from LLVM developers is to consistently use
long types for all size variables to avoid the unnecessary casting in
the first place. Change the type of len to size_t in both
l2cap_sock_getsockopt_old() and sco_sock_getsockopt_old(). This clears
up the error while allowing min_t() to be replaced with min(), resulting
in simpler code with no casts and fewer implicit conversions. While len
is a different type than optlen now, it should result in no functional
change because the result of sizeof() will clamp all values of optlen in
the same manner as before.

Cc: stable@vger.kernel.org
Closes: https://github.com/ClangBuiltLinux/linux/issues/2007
Link: https://github.com/llvm/llvm-project/issues/85647
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Reviewed-by: Justin Stitt <justinstitt@google.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2024-04-24 16:26:06 -04:00
Alexander Lobakin ef9226cd56 page_pool: constify some read-only function arguments
There are several functions taking pointers to data they don't modify.
This includes statistics fetching, page and page_pool parameters, etc.
Constify the pointers, so that call sites will be able to pass const
pointers as well.
No functional changes, no visible changes in functions sizes.

Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2024-04-24 11:06:25 -07:00
Breno Leitao c661050f93 net: create a dummy net_device allocator
It is impossible to use init_dummy_netdev together with alloc_netdev()
as the 'setup' argument.

This is because alloc_netdev() initializes some fields in the net_device
structure, and later init_dummy_netdev() memzero them all. This causes
some problems as reported here:

	https://lore.kernel.org/all/20240322082336.49f110cc@kernel.org/

Split the init_dummy_netdev() function in two. Create a new function called
init_dummy_netdev_core() that does not memzero the net_device structure.
Then have init_dummy_netdev() memzero-ing and calling
init_dummy_netdev_core(), keeping the old behaviour.

init_dummy_netdev_core() is the new function that could be called as an
argument for alloc_netdev().

Also, create a helper to allocate and initialize dummy net devices,
leveraging init_dummy_netdev_core() as the setup argument. This function
basically simplify the allocation of dummy devices, by allocating and
initializing it. Freeing the device continue to be done through
free_netdev()

Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-24 12:00:16 +01:00
Breno Leitao f8d05679fb net: free_netdev: exit earlier if dummy
For dummy devices, exit earlier at free_netdev() instead of executing
the whole function. This is necessary, because dummy devices are
special, and shouldn't have the second part of the function executed.

Otherwise reg_state, which is NETREG_DUMMY, will be overwritten and
there will be no way to identify that this is a dummy device. Also, this
device do not need the final put_device(), since dummy devices are not
registered (through register_netdevice()), where the device reference is
increased (at netdev_register_kobject()/device_add()).

Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-24 12:00:16 +01:00
Breno Leitao c6e7f27684 net: core: Fix documentation
Fix bad grammar in description of init_dummy_netdev() function.  This
topic showed up in the review of the "allocate dummy device dynamically"
patch set.

Suggested-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-24 12:00:16 +01:00
Thomas Weißschuh 795f90c6f1 sysctl: treewide: constify argument ctl_table_root::permissions(table)
The permissions callback should not modify the ctl_table. Enforce this
expectation via the typesystem. This is a step to put "struct ctl_table"
into .rodata throughout the kernel.

The patch was created with the following coccinelle script:

  @@
  identifier func, head, ctl;
  @@

  int func(
    struct ctl_table_header *head,
  - struct ctl_table *ctl)
  + const struct ctl_table *ctl)
  { ... }

(insert_entry() from fs/proc/proc_sysctl.c is a false-positive)

No additional occurrences of '.permissions =' were found after a
tree-wide search for places missed by the conccinelle script.

Reviewed-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Signed-off-by: Joel Granados <j.granados@samsung.com>
2024-04-24 09:43:54 +02:00
Thomas Weißschuh 520713a93d sysctl: treewide: drop unused argument ctl_table_root::set_ownership(table)
Remove the 'table' argument from set_ownership as it is never used. This
change is a step towards putting "struct ctl_table" into .rodata and
eventually having sysctl core only use "const struct ctl_table".

The patch was created with the following coccinelle script:

  @@
  identifier func, head, table, uid, gid;
  @@

  void func(
    struct ctl_table_header *head,
  - struct ctl_table *table,
    kuid_t *uid, kgid_t *gid)
  { ... }

No additional occurrences of 'set_ownership' were found after doing a
tree-wide search.

Reviewed-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Signed-off-by: Joel Granados <j.granados@samsung.com>
2024-04-24 09:43:54 +02:00
Hyunwoo Kim 80e679b352 tcp: Fix Use-After-Free in tcp_ao_connect_init
Since call_rcu, which is called in the hlist_for_each_entry_rcu traversal
of tcp_ao_connect_init, is not part of the RCU read critical section, it
is possible that the RCU grace period will pass during the traversal and
the key will be free.

To prevent this, it should be changed to hlist_for_each_entry_safe.

Fixes: 7c2ffaf21b ("net/tcp: Calculate TCP-AO traffic keys")
Signed-off-by: Hyunwoo Kim <v4bel@theori.io>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Dmitry Safonov <0x7f454c46@gmail.com>
Link: https://lore.kernel.org/r/ZiYu9NJ/ClR8uSkH@v4bel-B760M-AORUS-ELITE-AX
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-23 19:06:39 -07:00
Eric Dumazet 1c04b46cbd neighbour: fix neigh_master_filtered()
If we no longer hold RTNL, we must use netdev_master_upper_dev_get_rcu()
instead of netdev_master_upper_dev_get().

Fixes: ba0f780694 ("neighbour: no longer hold RTNL in neigh_dump_info()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20240421185753.1808077-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-23 19:04:50 -07:00
Eric Dumazet 58a4c9b1e5 ipv4: check for NULL idev in ip_route_use_hint()
syzbot was able to trigger a NULL deref in fib_validate_source()
in an old tree [1].

It appears the bug exists in latest trees.

All calls to __in_dev_get_rcu() must be checked for a NULL result.

[1]
general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] SMP KASAN
KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
CPU: 2 PID: 3257 Comm: syz-executor.3 Not tainted 5.10.0-syzkaller #0
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
 RIP: 0010:fib_validate_source+0xbf/0x15a0 net/ipv4/fib_frontend.c:425
Code: 18 f2 f2 f2 f2 42 c7 44 20 23 f3 f3 f3 f3 48 89 44 24 78 42 c6 44 20 27 f3 e8 5d 88 48 fc 4c 89 e8 48 c1 e8 03 48 89 44 24 18 <42> 80 3c 20 00 74 08 4c 89 ef e8 d2 15 98 fc 48 89 5c 24 10 41 bf
RSP: 0018:ffffc900015fee40 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff88800f7a4000 RCX: ffff88800f4f90c0
RDX: 0000000000000000 RSI: 0000000004001eac RDI: ffff8880160c64c0
RBP: ffffc900015ff060 R08: 0000000000000000 R09: ffff88800f7a4000
R10: 0000000000000002 R11: ffff88800f4f90c0 R12: dffffc0000000000
R13: 0000000000000000 R14: 0000000000000000 R15: ffff88800f7a4000
FS:  00007f938acfe6c0(0000) GS:ffff888058c00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f938acddd58 CR3: 000000001248e000 CR4: 0000000000352ef0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
  ip_route_use_hint+0x410/0x9b0 net/ipv4/route.c:2231
  ip_rcv_finish_core+0x2c4/0x1a30 net/ipv4/ip_input.c:327
  ip_list_rcv_finish net/ipv4/ip_input.c:612 [inline]
  ip_sublist_rcv+0x3ed/0xe50 net/ipv4/ip_input.c:638
  ip_list_rcv+0x422/0x470 net/ipv4/ip_input.c:673
  __netif_receive_skb_list_ptype net/core/dev.c:5572 [inline]
  __netif_receive_skb_list_core+0x6b1/0x890 net/core/dev.c:5620
  __netif_receive_skb_list net/core/dev.c:5672 [inline]
  netif_receive_skb_list_internal+0x9f9/0xdc0 net/core/dev.c:5764
  netif_receive_skb_list+0x55/0x3e0 net/core/dev.c:5816
  xdp_recv_frames net/bpf/test_run.c:257 [inline]
  xdp_test_run_batch net/bpf/test_run.c:335 [inline]
  bpf_test_run_xdp_live+0x1818/0x1d00 net/bpf/test_run.c:363
  bpf_prog_test_run_xdp+0x81f/0x1170 net/bpf/test_run.c:1376
  bpf_prog_test_run+0x349/0x3c0 kernel/bpf/syscall.c:3736
  __sys_bpf+0x45c/0x710 kernel/bpf/syscall.c:5115
  __do_sys_bpf kernel/bpf/syscall.c:5201 [inline]
  __se_sys_bpf kernel/bpf/syscall.c:5199 [inline]
  __x64_sys_bpf+0x7c/0x90 kernel/bpf/syscall.c:5199

Fixes: 02b2494161 ("ipv4: use dst hint for ipv4 list receive")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/20240421184326.1704930-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-23 19:02:57 -07:00
Jakub Kicinski 8af4f60472 netlink: support all extack types in dumps
Note that when this commit message refers to netlink dump
it only means the actual dumping part, the parsing / dump
start is handled by the same code as "doit".

Commit 4a19edb60d ("netlink: Pass extack to dump handlers")
added support for returning extack messages from dump handlers,
but left out other extack info, e.g. bad attribute.

This used to be fine because until YNL we had little practical
use for the machine readable attributes, and only messages were
used in practice.

YNL flips the preference 180 degrees, it's now much more useful
to point to a bad attr with NL_SET_BAD_ATTR() than type
an English message saying "attribute XYZ is $reason-why-bad".

Support all of extack. The fact that extack only gets added if
it fits remains unaddressed.

Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240420023543.3300306-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-23 10:09:49 -07:00
Jakub Kicinski 652332e3f1 netlink: move extack writing helpers
Next change will need them in netlink_dump_done(), pure move.

Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240420023543.3300306-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-23 10:09:49 -07:00
Jakub Kicinski ce05d0f203 netdev: support dumping a single netdev in qstats
Having to filter the right ifindex in the tests is a bit tedious.
Add support for dumping qstats for a single ifindex.

Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240420023543.3300306-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-23 10:09:49 -07:00
Kuniyuki Iwashima 1af2dface5 af_unix: Don't access successor in unix_del_edges() during GC.
syzbot reported use-after-free in unix_del_edges().  [0]

What the repro does is basically repeat the following quickly.

  1. pass a fd of an AF_UNIX socket to itself

    socketpair(AF_UNIX, SOCK_DGRAM, 0, [3, 4]) = 0
    sendmsg(3, {..., msg_control=[{cmsg_len=20, cmsg_level=SOL_SOCKET,
                                   cmsg_type=SCM_RIGHTS, cmsg_data=[4]}], ...}, 0) = 0

  2. pass other fds of AF_UNIX sockets to the socket above

    socketpair(AF_UNIX, SOCK_SEQPACKET, 0, [5, 6]) = 0
    sendmsg(3, {..., msg_control=[{cmsg_len=48, cmsg_level=SOL_SOCKET,
                                   cmsg_type=SCM_RIGHTS, cmsg_data=[5, 6]}], ...}, 0) = 0

  3. close all sockets

Here, two skb are created, and every unix_edge->successor is the first
socket.  Then, __unix_gc() will garbage-collect the two skb:

  (a) free skb with self-referencing fd
  (b) free skb holding other sockets

After (a), the self-referencing socket will be scheduled to be freed
later by the delayed_fput() task.

syzbot repeated the sequences above (1. ~ 3.) quickly and triggered
the task concurrently while GC was running.

So, at (b), the socket was already freed, and accessing it was illegal.

unix_del_edges() accesses the receiver socket as edge->successor to
optimise GC.  However, we should not do it during GC.

Garbage-collecting sockets does not change the shape of the rest
of the graph, so we need not call unix_update_graph() to update
unix_graph_grouped when we purge skb.

However, if we clean up all loops in the unix_walk_scc_fast() path,
unix_graph_maybe_cyclic remains unchanged (true), and __unix_gc()
will call unix_walk_scc_fast() continuously even though there is no
socket to garbage-collect.

To keep that optimisation while fixing UAF, let's add the same
updating logic of unix_graph_maybe_cyclic in unix_walk_scc_fast()
as done in unix_walk_scc() and __unix_walk_scc().

Note that when unix_del_edges() is called from other places, the
receiver socket is always alive:

  - sendmsg: the successor's sk_refcnt is bumped by sock_hold()
             unix_find_other() for SOCK_DGRAM, connect() for SOCK_STREAM

  - recvmsg: the successor is the receiver, and its fd is alive

[0]:
BUG: KASAN: slab-use-after-free in unix_edge_successor net/unix/garbage.c:109 [inline]
BUG: KASAN: slab-use-after-free in unix_del_edge net/unix/garbage.c:165 [inline]
BUG: KASAN: slab-use-after-free in unix_del_edges+0x148/0x630 net/unix/garbage.c:237
Read of size 8 at addr ffff888079c6e640 by task kworker/u8:6/1099

CPU: 0 PID: 1099 Comm: kworker/u8:6 Not tainted 6.9.0-rc4-next-20240418-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/27/2024
Workqueue: events_unbound __unix_gc
Call Trace:
 <TASK>
 __dump_stack lib/dump_stack.c:88 [inline]
 dump_stack_lvl+0x241/0x360 lib/dump_stack.c:114
 print_address_description mm/kasan/report.c:377 [inline]
 print_report+0x169/0x550 mm/kasan/report.c:488
 kasan_report+0x143/0x180 mm/kasan/report.c:601
 unix_edge_successor net/unix/garbage.c:109 [inline]
 unix_del_edge net/unix/garbage.c:165 [inline]
 unix_del_edges+0x148/0x630 net/unix/garbage.c:237
 unix_destroy_fpl+0x59/0x210 net/unix/garbage.c:298
 unix_detach_fds net/unix/af_unix.c:1811 [inline]
 unix_destruct_scm+0x13e/0x210 net/unix/af_unix.c:1826
 skb_release_head_state+0x100/0x250 net/core/skbuff.c:1127
 skb_release_all net/core/skbuff.c:1138 [inline]
 __kfree_skb net/core/skbuff.c:1154 [inline]
 kfree_skb_reason+0x16d/0x3b0 net/core/skbuff.c:1190
 __skb_queue_purge_reason include/linux/skbuff.h:3251 [inline]
 __skb_queue_purge include/linux/skbuff.h:3256 [inline]
 __unix_gc+0x1732/0x1830 net/unix/garbage.c:575
 process_one_work kernel/workqueue.c:3218 [inline]
 process_scheduled_works+0xa2c/0x1830 kernel/workqueue.c:3299
 worker_thread+0x86d/0xd70 kernel/workqueue.c:3380
 kthread+0x2f0/0x390 kernel/kthread.c:389
 ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147
 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244
 </TASK>

Allocated by task 14427:
 kasan_save_stack mm/kasan/common.c:47 [inline]
 kasan_save_track+0x3f/0x80 mm/kasan/common.c:68
 unpoison_slab_object mm/kasan/common.c:312 [inline]
 __kasan_slab_alloc+0x66/0x80 mm/kasan/common.c:338
 kasan_slab_alloc include/linux/kasan.h:201 [inline]
 slab_post_alloc_hook mm/slub.c:3897 [inline]
 slab_alloc_node mm/slub.c:3957 [inline]
 kmem_cache_alloc_noprof+0x135/0x290 mm/slub.c:3964
 sk_prot_alloc+0x58/0x210 net/core/sock.c:2074
 sk_alloc+0x38/0x370 net/core/sock.c:2133
 unix_create1+0xb4/0x770
 unix_create+0x14e/0x200 net/unix/af_unix.c:1034
 __sock_create+0x490/0x920 net/socket.c:1571
 sock_create net/socket.c:1622 [inline]
 __sys_socketpair+0x33e/0x720 net/socket.c:1773
 __do_sys_socketpair net/socket.c:1822 [inline]
 __se_sys_socketpair net/socket.c:1819 [inline]
 __x64_sys_socketpair+0x9b/0xb0 net/socket.c:1819
 do_syscall_x64 arch/x86/entry/common.c:52 [inline]
 do_syscall_64+0xf5/0x240 arch/x86/entry/common.c:83
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

Freed by task 1805:
 kasan_save_stack mm/kasan/common.c:47 [inline]
 kasan_save_track+0x3f/0x80 mm/kasan/common.c:68
 kasan_save_free_info+0x40/0x50 mm/kasan/generic.c:579
 poison_slab_object+0xe0/0x150 mm/kasan/common.c:240
 __kasan_slab_free+0x37/0x60 mm/kasan/common.c:256
 kasan_slab_free include/linux/kasan.h:184 [inline]
 slab_free_hook mm/slub.c:2190 [inline]
 slab_free mm/slub.c:4393 [inline]
 kmem_cache_free+0x145/0x340 mm/slub.c:4468
 sk_prot_free net/core/sock.c:2114 [inline]
 __sk_destruct+0x467/0x5f0 net/core/sock.c:2208
 sock_put include/net/sock.h:1948 [inline]
 unix_release_sock+0xa8b/0xd20 net/unix/af_unix.c:665
 unix_release+0x91/0xc0 net/unix/af_unix.c:1049
 __sock_release net/socket.c:659 [inline]
 sock_close+0xbc/0x240 net/socket.c:1421
 __fput+0x406/0x8b0 fs/file_table.c:422
 delayed_fput+0x59/0x80 fs/file_table.c:445
 process_one_work kernel/workqueue.c:3218 [inline]
 process_scheduled_works+0xa2c/0x1830 kernel/workqueue.c:3299
 worker_thread+0x86d/0xd70 kernel/workqueue.c:3380
 kthread+0x2f0/0x390 kernel/kthread.c:389
 ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147
 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244

The buggy address belongs to the object at ffff888079c6e000
 which belongs to the cache UNIX of size 1920
The buggy address is located 1600 bytes inside of
 freed 1920-byte region [ffff888079c6e000, ffff888079c6e780)

Reported-by: syzbot+f3f3eef1d2100200e593@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=f3f3eef1d2100200e593
Fixes: 77e5593aeb ("af_unix: Skip GC if no cycle exists.")
Fixes: fd86344823 ("af_unix: Try not to hold unix_gc_lock during accept().")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20240419235102.31707-1-kuniyu@amazon.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-04-23 13:57:35 +02:00
Duoming Zhou 467324bcfe ax25: Fix netdev refcount issue
The dev_tracker is added to ax25_cb in ax25_bind(). When the
ax25 device is detaching, the dev_tracker of ax25_cb should be
deallocated in ax25_kill_by_device() instead of the dev_tracker
of ax25_dev. The log reported by ref_tracker is shown below:

[   80.884935] ref_tracker: reference already released.
[   80.885150] ref_tracker: allocated in:
[   80.885349]  ax25_dev_device_up+0x105/0x540
[   80.885730]  ax25_device_event+0xa4/0x420
[   80.885730]  notifier_call_chain+0xc9/0x1e0
[   80.885730]  __dev_notify_flags+0x138/0x280
[   80.885730]  dev_change_flags+0xd7/0x180
[   80.885730]  dev_ifsioc+0x6a9/0xa30
[   80.885730]  dev_ioctl+0x4d8/0xd90
[   80.885730]  sock_do_ioctl+0x1c2/0x2d0
[   80.885730]  sock_ioctl+0x38b/0x4f0
[   80.885730]  __se_sys_ioctl+0xad/0xf0
[   80.885730]  do_syscall_64+0xc4/0x1b0
[   80.885730]  entry_SYSCALL_64_after_hwframe+0x67/0x6f
[   80.885730] ref_tracker: freed in:
[   80.885730]  ax25_device_event+0x272/0x420
[   80.885730]  notifier_call_chain+0xc9/0x1e0
[   80.885730]  dev_close_many+0x272/0x370
[   80.885730]  unregister_netdevice_many_notify+0x3b5/0x1180
[   80.885730]  unregister_netdev+0xcf/0x120
[   80.885730]  sixpack_close+0x11f/0x1b0
[   80.885730]  tty_ldisc_kill+0xcb/0x190
[   80.885730]  tty_ldisc_hangup+0x338/0x3d0
[   80.885730]  __tty_hangup+0x504/0x740
[   80.885730]  tty_release+0x46e/0xd80
[   80.885730]  __fput+0x37f/0x770
[   80.885730]  __x64_sys_close+0x7b/0xb0
[   80.885730]  do_syscall_64+0xc4/0x1b0
[   80.885730]  entry_SYSCALL_64_after_hwframe+0x67/0x6f
[   80.893739] ------------[ cut here ]------------
[   80.894030] WARNING: CPU: 2 PID: 140 at lib/ref_tracker.c:255 ref_tracker_free+0x47b/0x6b0
[   80.894297] Modules linked in:
[   80.894929] CPU: 2 PID: 140 Comm: ax25_conn_rel_6 Not tainted 6.9.0-rc4-g8cd26fd90c1a #11
[   80.895190] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qem4
[   80.895514] RIP: 0010:ref_tracker_free+0x47b/0x6b0
[   80.895808] Code: 83 c5 18 4c 89 eb 48 c1 eb 03 8a 04 13 84 c0 0f 85 df 01 00 00 41 83 7d 00 00 75 4b 4c 89 ff 9
[   80.896171] RSP: 0018:ffff888009edf8c0 EFLAGS: 00000286
[   80.896339] RAX: 1ffff1100141ac00 RBX: 1ffff1100149463b RCX: dffffc0000000000
[   80.896502] RDX: 0000000000000001 RSI: 0000000000000246 RDI: ffff88800a0d6518
[   80.896925] RBP: ffff888009edf9b0 R08: ffff88806d3288d3 R09: 1ffff1100da6511a
[   80.897212] R10: dffffc0000000000 R11: ffffed100da6511b R12: ffff88800a4a31d4
[   80.897859] R13: ffff88800a4a31d8 R14: dffffc0000000000 R15: ffff88800a0d6518
[   80.898279] FS:  00007fd88b7fe700(0000) GS:ffff88806d300000(0000) knlGS:0000000000000000
[   80.899436] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   80.900181] CR2: 00007fd88c001d48 CR3: 000000000993e000 CR4: 00000000000006f0
...
[   80.935774] ref_tracker: sp%d@000000000bb9df3d has 1/1 users at
[   80.935774]      ax25_bind+0x424/0x4e0
[   80.935774]      __sys_bind+0x1d9/0x270
[   80.935774]      __x64_sys_bind+0x75/0x80
[   80.935774]      do_syscall_64+0xc4/0x1b0
[   80.935774]      entry_SYSCALL_64_after_hwframe+0x67/0x6f

Change ax25_dev->dev_tracker to the dev_tracker of ax25_cb
in order to mitigate the bug.

Fixes: feef318c85 ("ax25: fix UAF bugs of net_device caused by rebinding operation")
Signed-off-by: Duoming Zhou <duoming@zju.edu.cn>
Link: https://lore.kernel.org/r/20240419020456.29826-1-duoming@zju.edu.cn
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-04-23 11:35:52 +02:00
Jens Axboe 3830fff399 Merge branch 'for-uring-ubufops' of git://git.kernel.org/pub/scm/linux/kernel/git/kuba/linux into for-6.10/io_uring
Merge net changes required for the upcoming send zerocopy improvements.

* 'for-uring-ubufops' of git://git.kernel.org/pub/scm/linux/kernel/git/kuba/linux:
  net: add callback for setting a ubuf_info to skb
  net: extend ubuf_info callback to ops structure

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-22 19:30:05 -06:00
Jun Gu 2540088b83 net: openvswitch: Check vport netdev name
Ensure that the provided netdev name is not one of its aliases to
prevent unnecessary creation and destruction of the vport by
ovs-vswitchd.

Signed-off-by: Jun Gu <jun.gu@easystack.cn>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Link: https://lore.kernel.org/r/20240419061425.132723-1-jun.gu@easystack.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-22 17:34:02 -07:00
Donald Hunter bf2ac490d2 netfilter: nfnetlink: Handle ACK flags for batch messages
The NLM_F_ACK flag is ignored for nfnetlink batch begin and end
messages. This is a problem for ynl which wants to receive an ack for
every message it sends, not just the commands in between the begin/end
messages.

Add processing for ACKs for begin/end messages and provide responses
when requested.

I have checked that iproute2, pyroute2 and systemd are unaffected by
this change since none of them use NLM_F_ACK for batch begin/end.

Signed-off-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://lore.kernel.org/r/20240418104737.77914-5-donald.hunter@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-22 17:20:42 -07:00
Jakub Kicinski af046fd169 Merge branch 'for-uring-ubufops' into HEAD
Pavel Begunkov says:

====================
implement io_uring notification (ubuf_info) stacking (net part)

To have per request buffer notifications each zerocopy io_uring send
request allocates a new ubuf_info. However, as an skb can carry only
one uarg, it may force the stack to create many small skbs hurting
performance in many ways.

The patchset implements notification, i.e. an io_uring's ubuf_info
extension, stacking. It attempts to link ubuf_info's into a list,
allowing to have multiple of them per skb.

liburing/examples/send-zerocopy shows up 6 times performance improvement
for TCP with 4KB bytes per send, and levels it with MSG_ZEROCOPY. Without
the patchset it requires much larger sends to utilise all potential.

bytes  | before | after (Kqps)
1200   | 195    | 1023
4000   | 193    | 1386
8000   | 154    | 1058
====================

Link: https://lore.kernel.org/all/cover.1713369317.git.asml.silence@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-22 17:15:39 -07:00
Pavel Begunkov 65bada80de net: add callback for setting a ubuf_info to skb
At the moment an skb can only have one ubuf_info associated with it,
which might be a performance problem for zerocopy sends in cases like
TCP via io_uring. Add a callback for assigning ubuf_info to skb, this
way we will implement smarter assignment later like linking ubuf_info
together.

Note, it's an optional callback, which should be compatible with
skb_zcopy_set(), that's because the net stack might potentially decide
to clone an skb and take another reference to ubuf_info whenever it
wishes. Also, a correct implementation should always be able to bind to
an skb without prior ubuf_info, otherwise we could end up in a situation
when the send would not be able to progress.

Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/all/b7918aadffeb787c84c9e72e34c729dc04f3a45d.1713369317.git.asml.silence@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-22 16:21:59 -07:00
Pavel Begunkov 7ab4f16f9e net: extend ubuf_info callback to ops structure
We'll need to associate additional callbacks with ubuf_info, introduce
a structure holding ubuf_info callbacks. Apart from a more smarter
io_uring notification management introduced in next patches, it can be
used to generalise msg_zerocopy_put_abort() and also store
->sg_from_iter, which is currently passed in struct msghdr.

Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/all/a62015541de49c0e2a8a0377a1d5d0a5aeb07016.1713369317.git.asml.silence@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-22 16:21:35 -07:00
Eric Dumazet 8ee602c635 tcp: try to send bigger TSO packets
While investigating TCP performance, I found that TCP would
sometimes send big skbs followed by a single MSS skb,
in a 'locked' pattern.

For instance, BIG TCP is enabled, MSS is set to have 4096 bytes
of payload per segment. gso_max_size is set to 181000.

This means that an optimal TCP packet size should contain
44 * 4096 = 180224 bytes of payload,

However, I was seeing packets sizes interleaved in this pattern:

172032, 8192, 172032, 8192, 172032, 8192, <repeat>

tcp_tso_should_defer() heuristic is defeated, because after a split of
a packet in write queue for whatever reason (this might be a too small
CWND or a small enough pacing_rate),
the leftover packet in the queue is smaller than the optimal size.

It is time to try to make 'leftover packets' bigger so that
tcp_tso_should_defer() can give its full potential.

After this patch, we can see the following output:

14:13:34.009273 IP6 sender > receiver: Flags [P.], seq 4048380:4098360, ack 1, win 256, options [nop,nop,TS val 3425678144 ecr 1561784500], length 49980
14:13:34.010272 IP6 sender > receiver: Flags [P.], seq 4098360:4148340, ack 1, win 256, options [nop,nop,TS val 3425678145 ecr 1561784501], length 49980
14:13:34.011271 IP6 sender > receiver: Flags [P.], seq 4148340:4198320, ack 1, win 256, options [nop,nop,TS val 3425678146 ecr 1561784502], length 49980
14:13:34.012271 IP6 sender > receiver: Flags [P.], seq 4198320:4248300, ack 1, win 256, options [nop,nop,TS val 3425678147 ecr 1561784503], length 49980
14:13:34.013272 IP6 sender > receiver: Flags [P.], seq 4248300:4298280, ack 1, win 256, options [nop,nop,TS val 3425678148 ecr 1561784504], length 49980
14:13:34.014271 IP6 sender > receiver: Flags [P.], seq 4298280:4348260, ack 1, win 256, options [nop,nop,TS val 3425678149 ecr 1561784505], length 49980
14:13:34.015272 IP6 sender > receiver: Flags [P.], seq 4348260:4398240, ack 1, win 256, options [nop,nop,TS val 3425678150 ecr 1561784506], length 49980
14:13:34.016270 IP6 sender > receiver: Flags [P.], seq 4398240:4448220, ack 1, win 256, options [nop,nop,TS val 3425678151 ecr 1561784507], length 49980
14:13:34.017269 IP6 sender > receiver: Flags [P.], seq 4448220:4498200, ack 1, win 256, options [nop,nop,TS val 3425678152 ecr 1561784508], length 49980
14:13:34.018276 IP6 sender > receiver: Flags [P.], seq 4498200:4548180, ack 1, win 256, options [nop,nop,TS val 3425678153 ecr 1561784509], length 49980
14:13:34.019259 IP6 sender > receiver: Flags [P.], seq 4548180:4598160, ack 1, win 256, options [nop,nop,TS val 3425678154 ecr 1561784510], length 49980

With 200 concurrent flows on a 100Gbit NIC, we can see a reduction
of TSO packets (and ACK packets) of about 30 %.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240418214600.1291486-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-22 14:25:28 -07:00
Eric Dumazet d5b38a71d3 tcp: call tcp_set_skb_tso_segs() from tcp_write_xmit()
tcp_write_xmit() calls tcp_init_tso_segs()
to set gso_size and gso_segs on the packet.

tcp_init_tso_segs() requires the stack to maintain
an up to date tcp_skb_pcount(), and this makes sense
for packets in rtx queue. Not so much for packets
still in the write queue.

In the following patch, we don't want to deal with
tcp_skb_pcount() when moving payload from 2nd
skb to 1st skb in the write queue.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240418214600.1291486-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-22 14:25:27 -07:00
Eric Dumazet 22555032c5 tcp: remove dubious FIN exception from tcp_cwnd_test()
tcp_cwnd_test() has a special handing for the last packet in
the write queue if it is smaller than one MSS and has the FIN flag.

This is in violation of TCP RFC, and seems quite dubious.

This packet can be sent only if the current CWND is bigger
than the number of packets in flight.

Making tcp_cwnd_test() result independent of the first skb
in the write queue is needed for the last patch of the series.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240418214600.1291486-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-22 14:25:27 -07:00
Mateusz Polchlopek 5625ca5640 devlink: extend devlink_param *set pointer
Extend devlink_param *set function pointer to take extack as a param.
Sometimes it is needed to pass information to the end user from set
function. It is more proper to use for that netlink instead of passing
message to dmesg.

Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2024-04-22 13:05:19 -07:00
Linus Torvalds 1c582ec142 nfsd-6.9 fixes:
- Fix an NFS/RDMA performance regression in v6.9-rc
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmYmrs4ACgkQM2qzM29m
 f5c39Q/9HiozR34Dr/5X+7DaVJNhMAhz1li6uu1EbXlNPJHRuacJl7UH+Li3+5mB
 ozoZhIwcSjmAP9wAMFEFkBUI3ddCYoKT+PcJmrvpAoR9NHxeLz2hGU2nPvt2s4na
 JiKKV+uM8YLNK9bJxWSe2rrOmcEIuA8jJn7dnURuBBVgFFRHKmNE2WUdJrAGmIzv
 ph/+6Gjxrw83UyEs6KwV5SSaC7WuGQEph1otNzjQ8DgImvN/4zyQek4yqe7bt8FS
 0OnJtoM9ubIryiUfR4ZxrIk9xCGsgh2csEZsSbqFx2ECwoBrCR4RkKue+jHQSVON
 OV35+b1CxdVdYChjmdYWz63P/fkyMC9iNGces/WRZmoD6x9mn5OErCrsDFJDEbD/
 lf+2xS+ky8M6pb7TMzUwiYtYu9yJL8N4yUPZUbWQGwb4c2oKXJ+6SoxD+4aWi2xw
 9Is9tGcbr3fNJLw5/j5y30tVE4x80EuuV4xCxHSayqlKiD40OXkztKBFwnbugEqJ
 mD77L2PVhU7ozfwwIDtvrxqDsITbVB877Bu1CCvlWpUnq8YFWaA8CuQXNuYeQHR6
 KgVuYZfvn5oYXC+9b9EBiIkObatzU8VUsdstjx8aulceIG92rZrRsg56TrUq9teV
 IMBJCneg9iqzQJQVSgQ9wcqv+EUVfnpSehzo5vWdh/Hz1DlyOEA=
 =l92Y
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-6.9-4' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd fix from Chuck Lever:

 - Fix an NFS/RDMA performance regression in v6.9-rc

* tag 'nfsd-6.9-4' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
  Revert "svcrdma: Add Write chunk WRs to the RPC's Send WR chain"
2024-04-22 12:20:09 -07:00
Toke Høiland-Jørgensen 5bcf0dcbf9 xdp: use flags field to disambiguate broadcast redirect
When redirecting a packet using XDP, the bpf_redirect_map() helper will set
up the redirect destination information in struct bpf_redirect_info (using
the __bpf_xdp_redirect_map() helper function), and the xdp_do_redirect()
function will read this information after the XDP program returns and pass
the frame on to the right redirect destination.

When using the BPF_F_BROADCAST flag to do multicast redirect to a whole
map, __bpf_xdp_redirect_map() sets the 'map' pointer in struct
bpf_redirect_info to point to the destination map to be broadcast. And
xdp_do_redirect() reacts to the value of this map pointer to decide whether
it's dealing with a broadcast or a single-value redirect. However, if the
destination map is being destroyed before xdp_do_redirect() is called, the
map pointer will be cleared out (by bpf_clear_redirect_map()) without
waiting for any XDP programs to stop running. This causes xdp_do_redirect()
to think that the redirect was to a single target, but the target pointer
is also NULL (since broadcast redirects don't have a single target), so
this causes a crash when a NULL pointer is passed to dev_map_enqueue().

To fix this, change xdp_do_redirect() to react directly to the presence of
the BPF_F_BROADCAST flag in the 'flags' value in struct bpf_redirect_info
to disambiguate between a single-target and a broadcast redirect. And only
read the 'map' pointer if the broadcast flag is set, aborting if that has
been cleared out in the meantime. This prevents the crash, while keeping
the atomic (cmpxchg-based) clearing of the map pointer itself, and without
adding any more checks in the non-broadcast fast path.

Fixes: e624d4ed4a ("xdp: Extend xdp_redirect_map with broadcast support")
Reported-and-tested-by: syzbot+af9492708df9797198d6@syzkaller.appspotmail.com
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Reviewed-by: Hangbin Liu <liuhangbin@gmail.com>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Link: https://lore.kernel.org/r/20240418071840.156411-1-toke@redhat.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-04-22 10:24:41 -07:00
Hangbin Liu 4fd1edcdf1 bridge/br_netlink.c: no need to return void function
br_info_notify is a void function. There is no need to return.

Fixes: b6d0425b81 ("bridge: cfm: Netlink Notifications.")
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-22 11:57:14 +01:00
Eric Dumazet c51db4ac10 tcp: do not export tcp_twsk_purge()
After commit 1eeb504357 ("tcp/dccp: do not care about
families in inet_twsk_purge()") tcp_twsk_purge() is
no longer potentially called from a module.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-22 11:53:51 +01:00
Eric Dumazet c58e88d490 icmp: prevent possible NULL dereferences from icmp_build_probe()
First problem is a double call to __in_dev_get_rcu(), because
the second one could return NULL.

if (__in_dev_get_rcu(dev) && __in_dev_get_rcu(dev)->ifa_list)

Second problem is a read from dev->ip6_ptr with no NULL check:

if (!list_empty(&rcu_dereference(dev->ip6_ptr)->addr_list))

Use the correct RCU API to fix these.

v2: add missing include <net/addrconf.h>

Fixes: d329ea5bd8 ("icmp: add response to RFC 8335 PROBE messages")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Andreas Roeseler <andreas.a.roeseler@gmail.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-22 10:00:31 +01:00
Thomas Weißschuh bfa858f220 sysctl: treewide: constify ctl_table_header::ctl_table_arg
To be able to constify instances of struct ctl_tables it is necessary to
remove ways through which non-const versions are exposed from the
sysctl core.
One of these is the ctl_table_arg member of struct ctl_table_header.

Constify this reference as a prerequisite for the full constification of
struct ctl_table instances.
No functional change.

Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-22 08:56:31 +01:00
Chuck Lever 32cf5a4eda Revert "svcrdma: Add Write chunk WRs to the RPC's Send WR chain"
Performance regression reported with NFS/RDMA using Omnipath,
bisected to commit e084ee673c ("svcrdma: Add Write chunk WRs to
the RPC's Send WR chain").

Tracing on the server reports:

  nfsd-7771  [060]  1758.891809: svcrdma_sq_post_err:
	cq.id=205 cid=226 sc_sq_avail=13643/851 status=-12

sq_post_err reports ENOMEM, and the rdma->sc_sq_avail (13643) is
larger than rdma->sc_sq_depth (851). The number of available Send
Queue entries is always supposed to be smaller than the Send Queue
depth. That seems like a Send Queue accounting bug in svcrdma.

As it's getting to be late in the 6.9-rc cycle, revert this commit.
It can be revisited in a subsequent kernel release.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=218743
Fixes: e084ee673c ("svcrdma: Add Write chunk WRs to the RPC's Send WR chain")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-04-20 11:20:41 -04:00
Yick Xie 680d11f6e5 udp: preserve the connected status if only UDP cmsg
If "udp_cmsg_send()" returned 0 (i.e. only UDP cmsg),
"connected" should not be set to 0. Otherwise it stops
the connected socket from using the cached route.

Fixes: 2e8de85763 ("udp: add gso segment cmsg")
Signed-off-by: Yick Xie <yick.xie@gmail.com>
Cc: stable@vger.kernel.org
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20240418170610.867084-1-yick.xie@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-19 20:10:15 -07:00
Eric Dumazet ba0f780694 neighbour: no longer hold RTNL in neigh_dump_info()
neigh_dump_table() is already relying on RCU protection.

pneigh_dump_table() is using its own protection (tbl->lock)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-19 12:39:20 +01:00
Eric Dumazet 7e4975f7e7 neighbour: fix neigh_dump_info() return value
Change neigh_dump_table() and pneigh_dump_table()
to either return 0 or -EMSGSIZE if not enough
space was available in the skb.

Then neigh_dump_info() can do the same.

This allows NLMSG_DONE to be appended to the current
skb at the end of a dump, saving a couple of recvmsg()
system calls.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-19 12:39:20 +01:00
Eric Dumazet f8f2eb9de6 neighbour: add RCU protection to neigh_tables[]
In order to remove RTNL protection from neightbl_dump_info()
and neigh_dump_info() later, we need to add
RCU protection to neigh_tables[].

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-19 12:39:20 +01:00
Jason Xing f7b60cce84 net: rps: locklessly access rflow->cpu
This is the last member in struct rps_dev_flow which should be
protected locklessly. So finish it.

Signed-off-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-19 11:38:03 +01:00
Jason Xing f00bf5dc83 net: rps: protect filter locklessly
As we can see, rflow->filter can be written/read concurrently, so
lockless access is needed.

Signed-off-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-19 11:38:03 +01:00
Jason Xing 84b6823cd9 net: rps: protect last_qtail with rps_input_queue_tail_save() helper
Removing one unnecessary reader protection and add another writer
protection to finish the locklessly proctection job.

Note: the removed READ_ONCE() is not needed because we only have to protect
the locklessly reader in the different context (rps_may_expire_flow()).

Signed-off-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-19 11:38:03 +01:00
Eric Dumazet c85cedb38f net_sched: sch_skbprio: implement lockless skbprio_dump()
Instead of relying on RTNL, skbprio_dump() can use READ_ONCE()
annotation, paired with WRITE_ONCE() one in skbprio_change().

Also add a READ_ONCE(sch->limit) in skbprio_enqueue().

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-19 11:34:08 +01:00
Eric Dumazet 6c00dc4fdb net_sched: sch_pie: implement lockless pie_dump()
Instead of relying on RTNL, pie_dump() can use READ_ONCE()
annotations, paired with WRITE_ONCE() ones in pie_change().

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-19 11:34:08 +01:00
Eric Dumazet 293c7e2b3e net_sched: sch_hhf: implement lockless hhf_dump()
Instead of relying on RTNL, hhf_dump() can use READ_ONCE()
annotations, paired with WRITE_ONCE() ones in hhf_change().

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-19 11:34:08 +01:00
Eric Dumazet 49e8ae5370 net_sched: sch_hfsc: implement lockless accesses to q->defcls
Instead of relying on RTNL, hfsc_dump_qdisc() can use READ_ONCE()
annotation, paired with WRITE_ONCE() one in hfsc_change_qdisc().

Use READ_ONCE(q->defcls) in hfsc_classify() to
no longer acquire qdisc lock from hfsc_change_qdisc().

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-19 11:34:08 +01:00
Eric Dumazet 13a9965de3 net_sched: sch_fq_pie: implement lockless fq_pie_dump()
Instead of relying on RTNL, fq_pie_dump() can use READ_ONCE()
annotations, paired with WRITE_ONCE() ones in fq_pie_change().

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-19 11:34:07 +01:00
Eric Dumazet 396a003850 net_sched: sch_fq_codel: implement lockless fq_codel_dump()
Instead of relying on RTNL, fq_codel_dump() can use READ_ONCE()
annotations, paired with WRITE_ONCE() ones in fq_codel_change().

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-19 11:34:07 +01:00
Eric Dumazet 01daf66b79 net_sched: sch_fifo: implement lockless __fifo_dump()
Instead of relying on RTNL, __fifo_dump() can use READ_ONCE()
annotations, paired with WRITE_ONCE() ones in __fifo_init().

Also add missing READ_ONCE(sh->limit) in bfifo_enqueue(),
pfifo_enqueue() and pfifo_tail_enqueue().

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-19 11:34:07 +01:00
Eric Dumazet c5f1dde7f7 net_sched: sch_ets: implement lockless ets_dump()
Instead of relying on RTNL, ets_dump() can use READ_ONCE()
annotations, paired with WRITE_ONCE() ones in ets_change().

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-19 11:34:07 +01:00
Eric Dumazet a1ac3a7c3d net_sched: sch_tfs: implement lockless etf_dump()
Instead of relying on RTNL, codel_dump() can use READ_ONCE()
annotations.

There is no etf_change() yet, this patch imply aligns
this qdisc with others.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-19 11:34:07 +01:00
Eric Dumazet c45bd26c82 net_sched: sch_codel: implement lockless codel_dump()
Instead of relying on RTNL, codel_dump() can use READ_ONCE()
annotations, paired with WRITE_ONCE() ones in codel_change().

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-19 11:34:07 +01:00
Eric Dumazet 7253c1d1e7 net_sched: sch_choke: implement lockless choke_dump()
Instead of relying on RTNL, choke_dump() can use READ_ONCE()
annotations, paired with WRITE_ONCE() ones in choke_change().

v2: added a WRITE_ONCE(p->Scell_log, Scell_log)
    per Simon feedback in V1
    Removed the READ_ONCE(q->limit) in choke_enqueue()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-19 11:34:07 +01:00
Eric Dumazet 8eb54a421a net_sched: sch_cbs: implement lockless cbs_dump()
Instead of relying on RTNL, cbs_dump() can use READ_ONCE()
annotations, paired with WRITE_ONCE() ones in cbs_change().

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-19 11:34:07 +01:00
Eric Dumazet 9263650102 net_sched: cake: implement lockless cake_dump()
Instead of relying on RTNL, cake_dump() can use READ_ONCE()
annotations, paired with WRITE_ONCE() ones in cake_change().

v2: addressed Simon feedback in V1: https://lore.kernel.org/netdev/20240417083549.GA3846178@kernel.org/

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Toke Høiland-Jørgensen <toke@toke.dk>
Reviewed-by: Simon Horman <horms@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@toke.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-19 11:34:07 +01:00
Eric Dumazet 24bcc30767 net_sched: sch_fq: implement lockless fq_dump()
Instead of relying on RTNL, fq_dump() can use READ_ONCE()
annotations, paired with WRITE_ONCE() in fq_change()

v2: Addressed Simon feedback in V1: https://lore.kernel.org/netdev/20240416181915.GT2320920@kernel.org/

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-19 11:34:07 +01:00
Sriram R b0d2d8f996 wifi: mac80211: handle link ID during management Tx
During non-STA management Tx, when source address is same as one of the
link addresses and even when userspace requested Tx on a specific link,
the link ID is not set in the TX control information. Now if the MLD
address is also the same as that of the link address, then mac80211
fills link as "unspecified", since it looks like MLD TX.

This is unexpected, however, since non-STA TX must specify which link
to use. In hwsim, this will (after warnings) result in dropping such
frames as well.

Use and set the link id if the link bss is matching the address and
requested channel.

Signed-off-by: Sriram R <quic_srirrama@quicinc.com>
Signed-off-by: Aditya Kumar Singh <quic_adisi@quicinc.com>
Link: https://msgid.link/20240410052705.169865-1-quic_adisi@quicinc.com
Link: https://lore.kernel.org/r/0496fb7e-53cc-476f-8052-985d82fd8d01@quicinc.com
[reword commit message, should spell out hwsim etc.]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-04-19 10:54:58 +02:00
Aditya Kumar Singh 6d4ed5b356 wifi: mac80211: handle sdata->u.ap.active flag with MLO
Currently whenever link AP beacon is assigned, sdata->u.ap.active flag is
set and whenever it is brought down, the flag is reset. However, with MLO,
all the links of the same MLD would use the same sdata. Hence there is no
need to set/reset for each link up/down. Also, resetting it  when only one
of the links went down is not desirable.

Add changes to set the active flag only when first link is assigned
beacon. Similarly, add changes to reset that flag only when last link is
brought down.

Signed-off-by: Aditya Kumar Singh <quic_adisi@quicinc.com>
Link: https://msgid.link/20240409094017.3165560-1-quic_adisi@quicinc.com
[remove unnecessary check before constant true assignment]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-04-19 10:48:01 +02:00
Johannes Berg 2bf78f0f10 wifi: cfg80211: add return docs for regulatory functions
Add return value documentation for regulatory functions
that are missing it.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-04-19 10:29:08 +02:00
Johannes Berg dbda949b7f wifi: cfg80211: make some regulatory functions void
The return value of regulatory_hint_indoor() is always 0 for
success, and the return value of regulatory_hint_found_beacon()
is always ignored. Make them both have void return.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-04-19 10:28:25 +02:00
Johannes Berg 630009e244 wifi: mac80211: add return docs for sta_info_flush()
Use the Return: annotation instead of spelling out "Returns" in
the documentation, for both sta_info_flush()/__sta_info_flush().

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-04-19 10:27:21 +02:00
Benjamin Berg cc3ea42cfa wifi: mac80211: keep mac80211 consistent on link activation failure
In the unlikely event that link_use_channel fails while activating a
link, mac80211 would go into a bad state. Unfortunately, we cannot
completely avoid failures from drivers in this case.

However, what we can do is to just continue internally anyway and assume
the driver is going to trigger a recovery flow from its side. Doing that
means that we at least have a consistent state in mac80211 allowing such
a recovery flow to succeed.

Reviewed-by: Miriam Rachel Korenblit <miriam.rachel.korenblit@intel.com>
Signed-off-by: Benjamin Berg <benjamin.berg@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Link: https://msgid.link/20240418115219.1129e89f4b55.I6299678353e50e88b55c99b0bce15c64b52c2804@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-04-19 10:19:37 +02:00
Johannes Berg 87f5500285 wifi: mac80211: simplify ieee80211_assign_link_chanctx()
There's no need for a label/goto here, the only thing is
that drv_assign_vif_chanctx() must succeed to set 'conf'
and add the new context to the list, the remaining code
is (and must be) the same regardless.

Reviewed-by: Miriam Rachel Korenblit <miriam.rachel.korenblit@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Link: https://msgid.link/20240418115219.a94852030d33.I9d647178ab25636372ed79e5312c68a06e0bf60c@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-04-19 10:19:37 +02:00
Johannes Berg 5e0c422d12 wifi: mac80211: reserve chanctx during find
When searching for a chanctx for re-use, it's later adjusted and
assigned. It may also be that another one is already assigned to
the link in question, so unassign can also happen. In short, the
driver is called multiple times. During these callbacks, it may
thus change active links (on another interface), which then can
in turn cause the found chanctx (that's going to be reused) to
get removed and freed.

To avoid this, temporarily assign it to the reserved chanctx and
track the link that wants to use it in the reserved_links list.
This causes the ieee80211_chanctx_refcount() to be increased by
one during these operations, thus avoiding the free.

Reviewed-by: Miriam Rachel Korenblit <miriam.rachel.korenblit@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Link: https://msgid.link/20240418115219.94ea84c8ee1e.I0b247dbc0cd937ae6367bc0fc7e8d156b5d5f9b1@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-04-19 10:19:36 +02:00
Miri Korenblit c6d075be96 wifi: mac80211: defer link switch work in reconfig
If a link switch work was queued, and then a restart happened, the
worker might be executed before the reconfig, and obviously it will fail
(the HW might not respond to updates etc.)
So, don't perform the switch if we are in reconfig, instead - do it
at the end of the reconfig.

Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://msgid.link/20240415112355.1ef1008e3a0a.I19add3f2152dcfd55a759de97b1d09265c1cde98@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-04-19 10:14:29 +02:00
Johannes Berg 570944a094 wifi: mac80211: transmit deauth only if link is available
There's an issue in that when we disconnect from an AP
due to the AP switching to an unsupported channel, we
might not tell the driver about this before we try to
send the deauth. If the underlying implementation has
detected the quiet CSA, this may cause issues if this
is the only active link. Avoid this by transmitting
(and flushing) the deauth only when there's an active
link available that's not affected by quiet CSA.

Since this introduces link->u.mgd.csa_blocked_tx and we
no longer check sdata->csa_blocked_tx for the TX itself
also rename the latter to csa_blocked_queues.

Fixes: 6f0107d195 ("wifi: mac80211: introduce a feature flag for quiet in CSA")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://msgid.link/20240415112355.1d91db5e95aa.Iad3a5df3367f305dff48cd61776abfd6cf0fd4ab@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-04-19 10:14:08 +02:00
Johannes Berg c53d8a5935 wifi: mac80211: fix unaligned le16 access
The AP removal timer field need not be aligned, so the
code shouldn't access it directly, but use unaligned
loads. Use get_unaligned_le16(), which even is shorter
than the current code since it doesn't need a cast.

Fixes: 8eb8dd2ffb ("wifi: mac80211: Support link removal using Reconfiguration ML element")
Reviewed-by: Ilan Peer <ilan.peer@intel.com>
Reviewed-by: Miriam Rachel Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://msgid.link/20240418105220.356788ba0045.I2b3cdb3644e205d5bb10322c345c0499171cf5d2@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-04-19 10:02:27 +02:00
Johannes Berg cb55e08dba wifi: mac80211: remove link before AP
If the AP removal timer is long, we don't really want to
remove the link immediately. However, we really should do
it _before_ the AP removes it (which happens at or after
count reaches 0), so subtract 1 from the countdown when
scheduling the timer. This causes the link removal work
to run just after the beacon with value 1 is received. If
the counter is already zero, do it immediately.

This fixes an issue where we do the removal too late and
receive a beacon from the AP that's no longer associated
with the MLD, but thus removed EHT and ML elements, and
then we disconnect instead from the whole MLD, since one
of the associated APs changed mode from EHT to HE.

Fixes: 8eb8dd2ffb ("wifi: mac80211: Support link removal using Reconfiguration ML element")
Reviewed-by: Ilan Peer <ilan.peer@intel.com>
Reviewed-by: Miriam Rachel Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://msgid.link/20240418105220.03ac4a09fa74.Ifb8c8d38e3402721a81ce5981568f47b5c5889cb@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-04-19 10:02:24 +02:00
Johannes Berg 801ea33ae8 wifi: nl80211: don't free NULL coalescing rule
If the parsing fails, we can dereference a NULL pointer here.

Cc: stable@vger.kernel.org
Fixes: be29b99a9b ("cfg80211/nl80211: Add packet coalesce support")
Reviewed-by: Miriam Rachel Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://msgid.link/20240418105220.b328f80406e7.Id75d961050deb05b3e4e354e024866f350c68103@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-04-19 10:02:17 +02:00
Johannes Berg 2fb5dfe18e wifi: mac80211: mlme: re-parse if AP mode is less than client
If the AP mode ends up being determined less than the client mode,
there may be different reasons for this, e.g. AP misconfiguration.
If this happens in a way that causes e.g. EHT to be rejected, the
elements need to be re-parsed since we'll connect as HE, but not
reparsing means that we'll still think it's OK to use multi-link,
so we can connect in a non-sensical configuration of advertising
only HE on a secondary link. This normally won't happen for the
assoc link because that reuses the mode from authentication, and
if that's not EHT, multi-link association is rejected.

Fix this inconsistency by parsing the elements again if the mode
was different from the first parsing attempt. Print the message a
bit later to avoid printing "determined AP ... to be HE" twice in
cases where ieee80211_determine_ap_chan() returned a lesser mode,
rather than the regulatory downgrades below changing it.

Fixes: 310c8387c6 ("wifi: mac80211: clean up connection process")
Reviewed-by: Miriam Rachel Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://msgid.link/20240418105220.d1f25d92cfe7.Ia21eff6cdcae2f5aca13cf8e742a986af5e70f89@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-04-19 10:02:14 +02:00
Johannes Berg 1ac6f60aab wifi: mac80211: mlme: fix memory leak
When re-parsing the elements here (with changed mode), free
the original ones first to avoid leaking memory.

Fixes: 310c8387c6 ("wifi: mac80211: clean up connection process")
Reviewed-by: Miriam Rachel Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://msgid.link/20240418105220.458421e3bbff.Icb5b84cba3ea420794cf009cf18ec3d76e434736@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-04-19 10:02:12 +02:00
Johannes Berg 645acc6f55 wifi: mac80211: mlme: re-parse with correct mode
When doing re-parsing in ieee80211_determine_chan_mode(),
the conn->mode is changed, and the whole point of doing
the parsing again was to parse as the downgraded mode.
However, that didn't actually work, because the setting
was copied before and never changed again. Fix that.

Fixes: 310c8387c6 ("wifi: mac80211: clean up connection process")
Reviewed-by: Miriam Rachel Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://msgid.link/20240418105220.5e0d1fcb5622.Ib0673e0bc90033fd6d387b6a5f107c040eb907cf@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-04-19 10:02:10 +02:00
Johannes Berg 89884459a0 wifi: mac80211: fix idle calculation with multi-link
The vif's idle state doesn't automatically go to true when
any link removes the channel context, it's only idle when
_all_ links no longer have a channel context. Fix that.

Reviewed-by: Miriam Rachel Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://msgid.link/20240418105220.90df97557702.I05d2228ce85c203b9f2d6da8538cc16dce46752a@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-04-19 10:02:07 +02:00
Kory Maincent (Dent Project) 4d18e3ddf4 net: ethtool: pse-pd: Expand pse commands with the PSE PoE interface
Add PSE PoE interface support in the ethtool pse command.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Kory Maincent <kory.maincent@bootlin.com>
Link: https://lore.kernel.org/r/20240417-feature_poe-v9-3-242293fd1900@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-18 18:27:02 -07:00
Jakub Kicinski 41e3ddb291 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

Conflicts:

include/trace/events/rpcgss.h
  386f4a7379 ("trace: events: cleanup deprecated strncpy uses")
  a4833e3aba ("SUNRPC: Fix rpcgss_context trace event acceptor field")

Adjacent changes:

drivers/net/ethernet/intel/ice/ice_tc_lib.c
  2cca35f5dd ("ice: Fix checking for unsupported keys on non-tunnel device")
  784feaa65d ("ice: Add support for PFCP hardware offload in switchdev")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-18 13:12:24 -07:00
Paolo Abeni ac1a21db32 netfilter pull request 24-04-18
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEN9lkrMBJgcdVAPub1V2XiooUIOQFAmYgcKsACgkQ1V2XiooU
 IOQnrQ//QxwhrMTNHWTLj5D++SWw0E8fixurVMZ7/ligVINVOUBzKBYgAP5HOCpK
 hbR5LwHxa2g4GAoupcOjRueu+L4lhUYpbczSPUSfQ996cAdgkR+Y8Refn4k8kGKW
 vGmnH+0jMr+7OIMAxh6nZrpzv2lycEzSCZyRag7PP9V+QhEdrOWQNjS45s9KuoCx
 bttWp+nWGmF9OjzX7dwJ6Aviu2KcQO0RPM8hekIQbauoiI/CUtfFwqK+M+sP4RcU
 nrlOqSgtJDPTQoISokhaNIWq9Sk6bJyZ+Bic22NIdMGtlX/e7FFbqlZtXPdY87/X
 hiOa9S+CM4htGElOmRDeb8fXp7Qqp+jeARt8Xb5UWNouVyNg+KC/1ojN71HJ2Ngm
 GhjqYtDoZK28BvHttSmZK7BCjTFOn9c8k37rZ9I5vPnKTV3RuiQyQ6fQDTIw4R+1
 ntCuScx7tiUNwPjFrU20HSIu9j8JDD8JFLbP1nEsyXJz0937TmE1ajodt/bcDc6G
 zYGuEaU950ke5LpUvdkDZZbp+AgOTAHAU+0UX6wCbLDT/KSTaE8RpIXWzXJBg4L2
 17SxBCRIjTmy+eCAYMPgdY3QddOEwTtvnDqzXOmQSbjdBcP/o7nMvQZEwHbel20z
 +ijwMO+MHTBilMGcwu0Phgu+d9qLIDpdg1ZLYAFeul2uFbL5Muc=
 =R1+V
 -----END PGP SIGNATURE-----

Merge tag 'nf-24-04-18' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf

Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for net:

Patch #1 amends a missing spot where the set iterator type is unset.
	 This is fixing a issue in the previous pull request.

Patch #2 fixes the delete set command abort path by restoring state
         of the elements. Reverse logic for the activate (abort) case
	 otherwise element state is not restored, this requires to move
	 the check for active/inactive elements to the set iterator
	 callback. From the deactivate path, toggle the next generation
	 bit and from the activate (abort) path, clear the next generation
	 bitmask.

Patch #3 skips elements already restored by delete set command from the
	 abort path in case there is a previous delete element command in
	 the batch. Check for the next generation bit just like it is done
	 via set iteration to restore maps.

netfilter pull request 24-04-18

* tag 'nf-24-04-18' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
  netfilter: nf_tables: fix memleak in map from abort path
  netfilter: nf_tables: restore set elements when delete set fails
  netfilter: nf_tables: missing iterator type in lookup walk
====================

Link: https://lore.kernel.org/r/20240418010948.3332346-1-pablo@netfilter.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-04-18 13:12:37 +02:00
Zheng Li eabf425bc6 neighbour: guarantee the localhost connections be established successfully even the ARP table is full
Inter-process communication on localhost should be established successfully
even the ARP table is full, many processes on server machine use the
localhost to communicate such as command-line interface (CLI),
servers hope all CLI commands can be executed successfully even the arp
table is full. Right now CLI commands got timeout when the arp table is
full. Set the parameter of exempt_from_gc to be true for LOOPBACK net
device to keep localhost neigh in arp table, not removed by gc.

the steps of reproduced:
server with "gc_thresh3 = 1024" setting, ping server from more than 1024
same netmask Lan IPv4 addresses, run "ssh localhost" on console interface,
then the command will get timeout.

Signed-off-by: Zheng Li <James.Z.Li@Dell.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240416095343.540-1-lizheng043@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-04-18 12:01:03 +02:00
Antony Antony aeb48a428d udpencap: Remove Obsolete UDP_ENCAP_ESPINUDP_NON_IKE Support
The UDP_ENCAP_ESPINUDP_NON_IKE mode, introduced into the Linux kernel
in 2004 [2], has remained inactive and obsolete for an extended period.

This mode was originally defined in an early version of an IETF draft
[1] from 2001. By the time it was integrated into the kernel in 2004 [2],
it had already been replaced by UDP_ENCAP_ESPINUDP [3] in later
versions of draft-ietf-ipsec-udp-encaps, particularly in version 06.

Over time, UDP_ENCAP_ESPINUDP_NON_IKE has lost its relevance, with no
known use cases.

With this commit, we remove support for UDP_ENCAP_ESPINUDP_NON_IKE,
simplifying the codebase and eliminating unnecessary complexity.
Kernel will return an error -ENOPROTOOPT if the userspace tries to set
this option.

References:
[1] https://datatracker.ietf.org/doc/html/draft-ietf-ipsec-udp-encaps-00.txt

[2] Commit that added UDP_ENCAP_ESPINUDP_NON_IKE to the Linux historic
    repository.

    Author: Andreas Gruenbacher <agruen@suse.de>
    Date: Fri Apr 9 01:47:47 2004 -0700

   [IPSEC]: Support draft-ietf-ipsec-udp-encaps-00/01, some ipec impls need it.

[3] Commit that added UDP_ENCAP_ESPINUDP to the Linux historic
    repository.

    Author: Derek Atkins <derek@ihtfp.com>
    Date: Wed Apr 2 13:21:02 2003 -0800

    [IPSEC]: Implement UDP Encapsulation framework.

Signed-off-by: Antony Antony <antony.antony@secunet.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2024-04-18 11:38:23 +02:00
Eric Dumazet ba3de6d803 tcp_metrics: use parallel_ops for tcp_metrics_nl_family
TCP_METRICS_CMD_GET and TCP_METRICS_CMD_DEL use their
own locking (tcp_metrics_lock and RCU),
they do not need genl_mutex protection.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240416162025.1251547-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-17 18:31:53 -07:00
Eric Dumazet ade1c9cc40 tcp_metrics: fix tcp_metrics_nl_dump() return value
Change tcp_metrics_nl_dump() to return 0 at the end
of a dump so that NLMSG_DONE can be appended
to the current skb, saving one recvmsg() system call.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240416161112.1199265-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-17 18:30:58 -07:00
Eric Dumazet 1514b06aff netns: no longer hold RTNL in rtnl_net_dumpid()
- rtnl_net_dumpid() is already fully RCU protected,
  RTNL is not needed there.

- Fix return value at the end of a dump,
  so that NLMSG_DONE can be appended to current skb,
  saving one recvmsg() system call.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Link: https://lore.kernel.org/r/20240416140739.967941-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-17 18:30:28 -07:00
Eric Dumazet 0f022d32c3 net/sched: Fix mirred deadlock on device recursion
When the mirred action is used on a classful egress qdisc and a packet is
mirrored or redirected to self we hit a qdisc lock deadlock.
See trace below.

[..... other info removed for brevity....]
[   82.890906]
[   82.890906] ============================================
[   82.890906] WARNING: possible recursive locking detected
[   82.890906] 6.8.0-05205-g77fadd89fe2d-dirty #213 Tainted: G        W
[   82.890906] --------------------------------------------
[   82.890906] ping/418 is trying to acquire lock:
[   82.890906] ffff888006994110 (&sch->q.lock){+.-.}-{3:3}, at:
__dev_queue_xmit+0x1778/0x3550
[   82.890906]
[   82.890906] but task is already holding lock:
[   82.890906] ffff888006994110 (&sch->q.lock){+.-.}-{3:3}, at:
__dev_queue_xmit+0x1778/0x3550
[   82.890906]
[   82.890906] other info that might help us debug this:
[   82.890906]  Possible unsafe locking scenario:
[   82.890906]
[   82.890906]        CPU0
[   82.890906]        ----
[   82.890906]   lock(&sch->q.lock);
[   82.890906]   lock(&sch->q.lock);
[   82.890906]
[   82.890906]  *** DEADLOCK ***
[   82.890906]
[..... other info removed for brevity....]

Example setup (eth0->eth0) to recreate
tc qdisc add dev eth0 root handle 1: htb default 30
tc filter add dev eth0 handle 1: protocol ip prio 2 matchall \
     action mirred egress redirect dev eth0

Another example(eth0->eth1->eth0) to recreate
tc qdisc add dev eth0 root handle 1: htb default 30
tc filter add dev eth0 handle 1: protocol ip prio 2 matchall \
     action mirred egress redirect dev eth1

tc qdisc add dev eth1 root handle 1: htb default 30
tc filter add dev eth1 handle 1: protocol ip prio 2 matchall \
     action mirred egress redirect dev eth0

We fix this by adding an owner field (CPU id) to struct Qdisc set after
root qdisc is entered. When the softirq enters it a second time, if the
qdisc owner is the same CPU, the packet is dropped to break the loop.

Reported-by: Mingshuai Ren <renmingshuai@huawei.com>
Closes: https://lore.kernel.org/netdev/20240314111713.5979-1-renmingshuai@huawei.com/
Fixes: 3bcb846ca4 ("net: get rid of spin_trylock() in net_tx_action()")
Fixes: e578d9c025 ("net: sched: use counter to break reclassify loops")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Victor Nogueira <victor@mojatatu.com>
Reviewed-by: Pedro Tammela <pctammela@mojatatu.com>
Tested-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://lore.kernel.org/r/20240415210728.36949-1-victor@mojatatu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-17 18:22:52 -07:00
Pablo Neira Ayuso 86a1471d7c netfilter: nf_tables: fix memleak in map from abort path
The delete set command does not rely on the transaction object for
element removal, therefore, a combination of delete element + delete set
from the abort path could result in restoring twice the refcount of the
mapping.

Check for inactive element in the next generation for the delete element
command in the abort path, skip restoring state if next generation bit
has been already cleared. This is similar to the activate logic using
the set walk iterator.

[ 6170.286929] ------------[ cut here ]------------
[ 6170.286939] WARNING: CPU: 6 PID: 790302 at net/netfilter/nf_tables_api.c:2086 nf_tables_chain_destroy+0x1f7/0x220 [nf_tables]
[ 6170.287071] Modules linked in: [...]
[ 6170.287633] CPU: 6 PID: 790302 Comm: kworker/6:2 Not tainted 6.9.0-rc3+ #365
[ 6170.287768] RIP: 0010:nf_tables_chain_destroy+0x1f7/0x220 [nf_tables]
[ 6170.287886] Code: df 48 8d 7d 58 e8 69 2e 3b df 48 8b 7d 58 e8 80 1b 37 df 48 8d 7d 68 e8 57 2e 3b df 48 8b 7d 68 e8 6e 1b 37 df 48 89 ef eb c4 <0f> 0b 48 83 c4 08 5b 5d 41 5c 41 5d 41 5e 41 5f c3 cc cc cc cc 0f
[ 6170.287895] RSP: 0018:ffff888134b8fd08 EFLAGS: 00010202
[ 6170.287904] RAX: 0000000000000001 RBX: ffff888125bffb28 RCX: dffffc0000000000
[ 6170.287912] RDX: 0000000000000003 RSI: ffffffffa20298ab RDI: ffff88811ebe4750
[ 6170.287919] RBP: ffff88811ebe4700 R08: ffff88838e812650 R09: fffffbfff0623a55
[ 6170.287926] R10: ffffffff8311d2af R11: 0000000000000001 R12: ffff888125bffb10
[ 6170.287933] R13: ffff888125bffb10 R14: dead000000000122 R15: dead000000000100
[ 6170.287940] FS:  0000000000000000(0000) GS:ffff888390b00000(0000) knlGS:0000000000000000
[ 6170.287948] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 6170.287955] CR2: 00007fd31fc00710 CR3: 0000000133f60004 CR4: 00000000001706f0
[ 6170.287962] Call Trace:
[ 6170.287967]  <TASK>
[ 6170.287973]  ? __warn+0x9f/0x1a0
[ 6170.287986]  ? nf_tables_chain_destroy+0x1f7/0x220 [nf_tables]
[ 6170.288092]  ? report_bug+0x1b1/0x1e0
[ 6170.287986]  ? nf_tables_chain_destroy+0x1f7/0x220 [nf_tables]
[ 6170.288092]  ? report_bug+0x1b1/0x1e0
[ 6170.288104]  ? handle_bug+0x3c/0x70
[ 6170.288112]  ? exc_invalid_op+0x17/0x40
[ 6170.288120]  ? asm_exc_invalid_op+0x1a/0x20
[ 6170.288132]  ? nf_tables_chain_destroy+0x2b/0x220 [nf_tables]
[ 6170.288243]  ? nf_tables_chain_destroy+0x1f7/0x220 [nf_tables]
[ 6170.288366]  ? nf_tables_chain_destroy+0x2b/0x220 [nf_tables]
[ 6170.288483]  nf_tables_trans_destroy_work+0x588/0x590 [nf_tables]

Fixes: 591054469b ("netfilter: nf_tables: revisit chain/object refcounting from elements")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-04-18 02:41:32 +02:00
Pablo Neira Ayuso e79b47a861 netfilter: nf_tables: restore set elements when delete set fails
From abort path, nft_mapelem_activate() needs to restore refcounters to
the original state. Currently, it uses the set->ops->walk() to iterate
over these set elements. The existing set iterator skips inactive
elements in the next generation, this does not work from the abort path
to restore the original state since it has to skip active elements
instead (not inactive ones).

This patch moves the check for inactive elements to the set iterator
callback, then it reverses the logic for the .activate case which
needs to skip active elements.

Toggle next generation bit for elements when delete set command is
invoked and call nft_clear() from .activate (abort) path to restore the
next generation bit.

The splat below shows an object in mappings memleak:

[43929.457523] ------------[ cut here ]------------
[43929.457532] WARNING: CPU: 0 PID: 1139 at include/net/netfilter/nf_tables.h:1237 nft_setelem_data_deactivate+0xe4/0xf0 [nf_tables]
[...]
[43929.458014] RIP: 0010:nft_setelem_data_deactivate+0xe4/0xf0 [nf_tables]
[43929.458076] Code: 83 f8 01 77 ab 49 8d 7c 24 08 e8 37 5e d0 de 49 8b 6c 24 08 48 8d 7d 50 e8 e9 5c d0 de 8b 45 50 8d 50 ff 89 55 50 85 c0 75 86 <0f> 0b eb 82 0f 0b eb b3 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90
[43929.458081] RSP: 0018:ffff888140f9f4b0 EFLAGS: 00010246
[43929.458086] RAX: 0000000000000000 RBX: ffff8881434f5288 RCX: dffffc0000000000
[43929.458090] RDX: 00000000ffffffff RSI: ffffffffa26d28a7 RDI: ffff88810ecc9550
[43929.458093] RBP: ffff88810ecc9500 R08: 0000000000000001 R09: ffffed10281f3e8f
[43929.458096] R10: 0000000000000003 R11: ffff0000ffff0000 R12: ffff8881434f52a0
[43929.458100] R13: ffff888140f9f5f4 R14: ffff888151c7a800 R15: 0000000000000002
[43929.458103] FS:  00007f0c687c4740(0000) GS:ffff888390800000(0000) knlGS:0000000000000000
[43929.458107] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[43929.458111] CR2: 00007f58dbe5b008 CR3: 0000000123602005 CR4: 00000000001706f0
[43929.458114] Call Trace:
[43929.458118]  <TASK>
[43929.458121]  ? __warn+0x9f/0x1a0
[43929.458127]  ? nft_setelem_data_deactivate+0xe4/0xf0 [nf_tables]
[43929.458188]  ? report_bug+0x1b1/0x1e0
[43929.458196]  ? handle_bug+0x3c/0x70
[43929.458200]  ? exc_invalid_op+0x17/0x40
[43929.458211]  ? nft_setelem_data_deactivate+0xd7/0xf0 [nf_tables]
[43929.458271]  ? nft_setelem_data_deactivate+0xe4/0xf0 [nf_tables]
[43929.458332]  nft_mapelem_deactivate+0x24/0x30 [nf_tables]
[43929.458392]  nft_rhash_walk+0xdd/0x180 [nf_tables]
[43929.458453]  ? __pfx_nft_rhash_walk+0x10/0x10 [nf_tables]
[43929.458512]  ? rb_insert_color+0x2e/0x280
[43929.458520]  nft_map_deactivate+0xdc/0x1e0 [nf_tables]
[43929.458582]  ? __pfx_nft_map_deactivate+0x10/0x10 [nf_tables]
[43929.458642]  ? __pfx_nft_mapelem_deactivate+0x10/0x10 [nf_tables]
[43929.458701]  ? __rcu_read_unlock+0x46/0x70
[43929.458709]  nft_delset+0xff/0x110 [nf_tables]
[43929.458769]  nft_flush_table+0x16f/0x460 [nf_tables]
[43929.458830]  nf_tables_deltable+0x501/0x580 [nf_tables]

Fixes: 628bd3e49c ("netfilter: nf_tables: drop map element references from preparation phase")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-04-17 17:43:11 +02:00
Pablo Neira Ayuso efefd4f00c netfilter: nf_tables: missing iterator type in lookup walk
Add missing decorator type to lookup expression and tighten WARN_ON_ONCE
check in pipapo to spot earlier that this is unset.

Fixes: 29b359cf6d ("netfilter: nft_set_pipapo: walk over current view on netlink dump")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-04-17 17:43:01 +02:00
Eric Dumazet 2bd99aef1b tcp: accept bare FIN packets under memory pressure
Andrew Oates reported that some macOS hosts could repeatedly
send FIN packets even if the remote peer drops them and
send back DUP ACK RWIN 0 packets.

<quoting Andrew>

 20:27:16.968254 gif0  In  IP macos > victim: Flags [SEW], seq 1950399762, win 65535, options [mss 1460,nop,wscale 6,nop,nop,TS val 501897188 ecr 0,sackOK,eol], length 0
 20:27:16.968339 gif0  Out IP victim > macos: Flags [S.E], seq 2995489058, ack 1950399763, win 1448, options [mss 1460,sackOK,TS val 3829877593 ecr 501897188,nop,wscale 0], length 0
 20:27:16.968833 gif0  In  IP macos > victim: Flags [.], ack 1, win 2058, options [nop,nop,TS val 501897188 ecr 3829877593], length 0
 20:27:16.968885 gif0  In  IP macos > victim: Flags [P.], seq 1:1449, ack 1, win 2058, options [nop,nop,TS val 501897188 ecr 3829877593], length 1448
 20:27:16.968896 gif0  Out IP victim > macos: Flags [.], ack 1449, win 0, options [nop,nop,TS val 3829877593 ecr 501897188], length 0
 20:27:19.454593 gif0  In  IP macos > victim: Flags [F.], seq 1449, ack 1, win 2058, options [nop,nop,TS val 501899674 ecr 3829877593], length 0
 20:27:19.454675 gif0  Out IP victim > macos: Flags [.], ack 1449, win 0, options [nop,nop,TS val 3829880079 ecr 501899674], length 0
 20:27:19.455116 gif0  In  IP macos > victim: Flags [F.], seq 1449, ack 1, win 2058, options [nop,nop,TS val 501899674 ecr 3829880079], length 0

 The retransmits/dup-ACKs then repeat in a tight loop.

</quoting Andrew>

RFC 9293 3.4. Sequence Numbers states :

  Note that when the receive window is zero no segments should be
  acceptable except ACK segments.  Thus, it is be possible for a TCP to
  maintain a zero receive window while transmitting data and receiving
  ACKs.  However, even when the receive window is zero, a TCP must
  process the RST and URG fields of all incoming segments.

Even if we could consider a bare FIN.ACK packet to be an ACK in RFC terms,
the retransmits should use exponential backoff.

Accepting the FIN in linux does not add extra memory costs,
because the FIN flag will simply be merged to the tail skb in
the receive queue, and incoming packet is freed.

Reported-by: Andrew Oates <aoates@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Christoph Paasch <cpaasch@apple.com>
Cc: Vidhi Goel <vidhi_goel@apple.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-17 12:58:40 +01:00
Felix Fietkau 8c75cdcdf8 wifi: mac80211: split mesh fast tx cache into local/proxied/forwarded
Depending on the origin of the packets (and their SA), 802.11 + mesh headers
could be filled in differently. In order to properly deal with that, add a
new field to the lookup key, indicating the type (local, proxied or
forwarded). This can fix spurious packet drop issues that depend on the order
in which nodes/hosts communicate with each other.

Fixes: d5edb9ae8d ("wifi: mac80211: mesh fast xmit support")
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Link: https://msgid.link/20240415121811.13391-1-nbd@nbd.name
[use sizeof_field() for key_len]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-04-17 09:21:32 +02:00
Kory Maincent (Dent Project) b010bf72da net: pse-pd: Rectify and adapt the naming of admin_cotrol member of struct pse_control_config
In commit 18ff0bcda6 ("ethtool: add interface to interact with Ethernet
Power Equipment"), the 'pse_control_config' structure was introduced,
housing a single member labeled 'admin_cotrol' responsible for maintaining
the operational state of the PoDL PSE functions.

A noticeable typographical error exists in the naming of this field
('cotrol' should be corrected to 'control'), which this commit aims to
rectify.

Furthermore, with upcoming extensions of this structure to encompass PoE
functionalities, the field is being renamed to 'podl_admin_state' to
distinctly indicate that this state is tailored specifically for PoDL."

Reviewed-by: Oleksij Rempel <o.rempel@pengutronix.de>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Kory Maincent <kory.maincent@bootlin.com>
Link: https://lore.kernel.org/r/20240414-feature_poe-v8-3-e4bf1e860da5@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-16 19:17:17 -07:00
Florian Westphal 86600ea11d ip6_vti: fix memleak on netns dismantle
kmemleak reports net_device resources are no longer released, restore
needs_free_netdev toggle.  Sample backtrace:

unreferenced object 0xffff88810874f000 (size 4096): [..]
    [<00000000a2b8af8b>] __kmalloc_node+0x209/0x290
    [<0000000040b0a1a9>] alloc_netdev_mqs+0x58/0x470
    [<00000000b4be1e78>] vti6_init_net+0x94/0x230
    [<000000008830c1ea>] ops_init+0x32/0xc0
    [<000000006a26fa8f>] setup_net+0x134/0x2e0
[..]

Cc: Breno Leitao <leitao@debian.org>
Fixes: a9b2d55a8f ("ip6_vti: Do not use custom stat allocator")
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240415122346.26503-1-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-16 17:33:59 -07:00
Colin Ian King c2b640529e net/handshake: remove redundant assignment to variable ret
The variable is being assigned an value and then is being re-assigned
a new value in the next statement. The assignment is redundant and can
be removed.

Cleans up clang scan build warning:
net/handshake/tlshd.c:216:2: warning: Value stored to 'ret' is never
read [deadcode.DeadStores]

Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Acked-by: Chuck Lever <chuck.lever@oracle.com>
Link: https://lore.kernel.org/r/20240415100713.483399-1-colin.i.king@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-16 17:14:55 -07:00
Kuniyuki Iwashima fd86344823 af_unix: Try not to hold unix_gc_lock during accept().
Commit dcf70df204 ("af_unix: Fix up unix_edge.successor for embryo
socket.") added spin_lock(&unix_gc_lock) in accept() path, and it
caused regression in a stress test as reported by kernel test robot.

If the embryo socket is not part of the inflight graph, we need not
hold the lock.

To decide that in O(1) time and avoid the regression in the normal
use case,

  1. add a new stat unix_sk(sk)->scm_stat.nr_unix_fds

  2. count the number of inflight AF_UNIX sockets in the receive
     queue under unix_state_lock()

  3. move unix_update_edges() call under unix_state_lock()

  4. avoid locking if nr_unix_fds is 0 in unix_update_edges()

Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202404101427.92a08551-oliver.sang@intel.com
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20240413021928.20946-1-kuniyu@amazon.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-04-16 13:50:57 +02:00
Pavel Begunkov 8c9a6f549e io_uring: separate header for exported net bits
We're exporting some io_uring bits to networking, e.g. for implementing
a net callback for io_uring cmds, but we don't want to expose more than
needed. Add a separate header for networking.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: David Wei <dw@davidwei.uk>
Link: https://lore.kernel.org/r/20240409210554.1878789-1-dw@davidwei.uk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15 08:10:26 -06:00
Gabriel Krisman Bertazi 50aee97d15 udp: Avoid call to compute_score on multiple sites
We've observed a 7-12% performance regression in iperf3 UDP ipv4 and
ipv6 tests with multiple sockets on Zen3 cpus, which we traced back to
commit f0ea27e7bf ("udp: re-score reuseport groups when connected
sockets are present").  The failing tests were those that would spawn
UDP sockets per-cpu on systems that have a high number of cpus.

Unsurprisingly, it is not caused by the extra re-scoring of the reused
socket, but due to the compiler no longer inlining compute_score, once
it has the extra call site in udp4_lib_lookup2.  This is augmented by
the "Safe RET" mitigation for SRSO, needed in our Zen3 cpus.

We could just explicitly inline it, but compute_score() is quite a large
function, around 300b.  Inlining in two sites would almost double
udp4_lib_lookup2, which is a silly thing to do just to workaround a
mitigation.  Instead, this patch shuffles the code a bit to avoid the
multiple calls to compute_score.  Since it is a static function used in
one spot, the compiler can safely fold it in, as it did before, without
increasing the text size.

With this patch applied I ran my original iperf3 testcases.  The failing
cases all looked like this (ipv4):
	iperf3 -c 127.0.0.1 --udp -4 -f K -b $R -l 8920 -t 30 -i 5 -P 64 -O 2

where $R is either 1G/10G/0 (max, unlimited).  I ran 3 times each.
baseline is v6.9-rc3. harmean == harmonic mean; CV == coefficient of
variation.

ipv4:
                 1G                10G                  MAX
	    HARMEAN  (CV)      HARMEAN  (CV)    HARMEAN     (CV)
baseline 1743852.66(0.0208) 1725933.02(0.0167) 1705203.78(0.0386)
patched  1968727.61(0.0035) 1962283.22(0.0195) 1923853.50(0.0256)

ipv6:
                 1G                10G                  MAX
	    HARMEAN  (CV)      HARMEAN  (CV)    HARMEAN     (CV)
baseline 1729020.03(0.0028) 1691704.49(0.0243) 1692251.34(0.0083)
patched  1900422.19(0.0067) 1900968.01(0.0067) 1568532.72(0.1519)

This restores the performance we had before the change above with this
benchmark.  We obviously don't expect any real impact when mitigations
are disabled, but just to be sure it also doesn't regresses:

mitigations=off ipv4:
                 1G                10G                  MAX
	    HARMEAN  (CV)      HARMEAN  (CV)    HARMEAN     (CV)
baseline 3230279.97(0.0066) 3229320.91(0.0060) 2605693.19(0.0697)
patched  3242802.36(0.0073) 3239310.71(0.0035) 2502427.19(0.0882)

Cc: Lorenz Bauer <lmb@isovalent.com>
Fixes: f0ea27e7bf ("udp: re-score reuseport groups when connected sockets are present")
Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-15 11:59:58 +01:00
Breno Leitao 05d604a577 net: ip6_gre: Remove generic .ndo_get_stats64
Commit 3e2f544dd8 ("net: get stats64 if device if driver is
configured") moved the callback to dev_get_tstats64() to net core, so,
unless the driver is doing some custom stats collection, it does not
need to set .ndo_get_stats64.

Since this driver is now relying in NETDEV_PCPU_STAT_TSTATS, then, it
doesn't need to set the dev_get_tstats64() generic .ndo_get_stats64
function pointer.

Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-15 11:32:13 +01:00
Breno Leitao 8622f90a37 net: ipv6_gre: Do not use custom stat allocator
With commit 34d21de99c ("net: Move {l,t,d}stats allocation to core and
convert veth & vrf"), stats allocation could be done on net core
instead of in this driver.

With this new approach, the driver doesn't have to bother with error
handling (allocation failure checking, making sure free happens in the
right spot, etc). This is core responsibility now.

Remove the allocation in the ip6_gre and leverage the network
core allocation instead.

Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-15 11:32:13 +01:00
Russell King (Oracle) a788fafff5 net: dsa: convert dsa_user_phylink_fixed_state() to use dsa_phylink_to_port()
Convert dsa_user_phylink_fixed_state() to use the newly introduced
dsa_phylink_to_port() helper.

Suggested-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-15 10:48:41 +01:00
Heiner Kallweit 9382b4f338 net: constify net_class
AFAICS all users of net_class take a const struct class * argument.
Therefore fully constify net_class.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-15 10:46:01 +01:00
Jason Xing 4d0470b9ad net: save some cycles when doing skb_attempt_defer_free()
Normally, we don't face these two exceptions very often meanwhile
we have some chance to meet the condition where the current cpu id
is the same as skb->alloc_cpu.

One simple test that can help us see the frequency of this statement
'cpu == raw_smp_processor_id()':
1. running iperf -s and iperf -c [ip] -P [MAX CPU]
2. using BPF to capture skb_attempt_defer_free()

I can see around 4% chance that happens to satisfy the statement.
So moving this statement at the beginning can save some cycles in
most cases.

Signed-off-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-15 10:34:59 +01:00
Jakub Kicinski 3db3b62955 net: dev_addr_lists: move locking out of init/exit in kunit
We lock and unlock rtnl in init/exit for convenience,
but it started causing problems if the exit is handled
by a different thread. To avoid having to futz with
disabling locking assertions move the locking into
the test cases. We don't use ASSERTs so it should
be safe.

   ============= dev-addr-list-test (6 subtests) ==============
   [PASSED] dev_addr_test_basic
   [PASSED] dev_addr_test_sync_one
   [PASSED] dev_addr_test_add_del
   [PASSED] dev_addr_test_del_main
   [PASSED] dev_addr_test_add_set
   [PASSED] dev_addr_test_add_excl
   =============== [PASSED] dev-addr-list-test ================

Link: https://lore.kernel.org/all/20240403131936.787234-7-linux@roeck-us.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-15 10:26:35 +01:00
Jakub Kicinski 460b0d33cf inet: bring NLM_DONE out to a separate recv() again
Commit under Fixes optimized the number of recv() calls
needed during RTM_GETROUTE dumps, but we got multiple
reports of applications hanging on recv() calls.
Applications expect that a route dump will be terminated
with a recv() reading an individual NLM_DONE message.

Coalescing NLM_DONE is perfectly legal in netlink,
but even tho reporters fixed the code in respective
projects, chances are it will take time for those
applications to get updated. So revert to old behavior
(for now)?

Old kernel (5.19):

 $ ./cli.py --dbg-small-recv 4096 --spec netlink/specs/rt_route.yaml \
            --dump getroute --json '{"rtm-family": 2}'
 Recv: read 692 bytes, 11 messages
   nl_len = 68 (52) nl_flags = 0x22 nl_type = 24
 ...
   nl_len = 60 (44) nl_flags = 0x22 nl_type = 24
 Recv: read 20 bytes, 1 messages
   nl_len = 20 (4) nl_flags = 0x2 nl_type = 3

Before (6.9-rc2):

 $ ./cli.py --dbg-small-recv 4096 --spec netlink/specs/rt_route.yaml \
            --dump getroute --json '{"rtm-family": 2}'
 Recv: read 712 bytes, 12 messages
   nl_len = 68 (52) nl_flags = 0x22 nl_type = 24
 ...
   nl_len = 60 (44) nl_flags = 0x22 nl_type = 24
   nl_len = 20 (4) nl_flags = 0x2 nl_type = 3

After:

 $ ./cli.py --dbg-small-recv 4096 --spec netlink/specs/rt_route.yaml \
            --dump getroute --json '{"rtm-family": 2}'
 Recv: read 692 bytes, 11 messages
   nl_len = 68 (52) nl_flags = 0x22 nl_type = 24
 ...
   nl_len = 60 (44) nl_flags = 0x22 nl_type = 24
 Recv: read 20 bytes, 1 messages
   nl_len = 20 (4) nl_flags = 0x2 nl_type = 3

Reported-by: Stefano Brivio <sbrivio@redhat.com>
Link: https://lore.kernel.org/all/20240315124808.033ff58d@elisabeth
Reported-by: Ilya Maximets <i.maximets@ovn.org>
Link: https://lore.kernel.org/all/02b50aae-f0e9-47a4-8365-a977a85975d3@ovn.org
Fixes: 4ce5dc9316 ("inet: switch inet_dump_fib() to RCU protection")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Tested-by: Ilya Maximets <i.maximets@ovn.org>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-15 10:25:27 +01:00
Wander Lairson Costa f1e197a665 drop_monitor: replace spin_lock by raw_spin_lock
trace_drop_common() is called with preemption disabled, and it acquires
a spin_lock. This is problematic for RT kernels because spin_locks are
sleeping locks in this configuration, which causes the following splat:

BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48
in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 449, name: rcuc/47
preempt_count: 1, expected: 0
RCU nest depth: 2, expected: 2
5 locks held by rcuc/47/449:
 #0: ff1100086ec30a60 ((softirq_ctrl.lock)){+.+.}-{2:2}, at: __local_bh_disable_ip+0x105/0x210
 #1: ffffffffb394a280 (rcu_read_lock){....}-{1:2}, at: rt_spin_lock+0xbf/0x130
 #2: ffffffffb394a280 (rcu_read_lock){....}-{1:2}, at: __local_bh_disable_ip+0x11c/0x210
 #3: ffffffffb394a160 (rcu_callback){....}-{0:0}, at: rcu_do_batch+0x360/0xc70
 #4: ff1100086ee07520 (&data->lock){+.+.}-{2:2}, at: trace_drop_common.constprop.0+0xb5/0x290
irq event stamp: 139909
hardirqs last  enabled at (139908): [<ffffffffb1df2b33>] _raw_spin_unlock_irqrestore+0x63/0x80
hardirqs last disabled at (139909): [<ffffffffb19bd03d>] trace_drop_common.constprop.0+0x26d/0x290
softirqs last  enabled at (139892): [<ffffffffb07a1083>] __local_bh_enable_ip+0x103/0x170
softirqs last disabled at (139898): [<ffffffffb0909b33>] rcu_cpu_kthread+0x93/0x1f0
Preemption disabled at:
[<ffffffffb1de786b>] rt_mutex_slowunlock+0xab/0x2e0
CPU: 47 PID: 449 Comm: rcuc/47 Not tainted 6.9.0-rc2-rt1+ #7
Hardware name: Dell Inc. PowerEdge R650/0Y2G81, BIOS 1.6.5 04/15/2022
Call Trace:
 <TASK>
 dump_stack_lvl+0x8c/0xd0
 dump_stack+0x14/0x20
 __might_resched+0x21e/0x2f0
 rt_spin_lock+0x5e/0x130
 ? trace_drop_common.constprop.0+0xb5/0x290
 ? skb_queue_purge_reason.part.0+0x1bf/0x230
 trace_drop_common.constprop.0+0xb5/0x290
 ? preempt_count_sub+0x1c/0xd0
 ? _raw_spin_unlock_irqrestore+0x4a/0x80
 ? __pfx_trace_drop_common.constprop.0+0x10/0x10
 ? rt_mutex_slowunlock+0x26a/0x2e0
 ? skb_queue_purge_reason.part.0+0x1bf/0x230
 ? __pfx_rt_mutex_slowunlock+0x10/0x10
 ? skb_queue_purge_reason.part.0+0x1bf/0x230
 trace_kfree_skb_hit+0x15/0x20
 trace_kfree_skb+0xe9/0x150
 kfree_skb_reason+0x7b/0x110
 skb_queue_purge_reason.part.0+0x1bf/0x230
 ? __pfx_skb_queue_purge_reason.part.0+0x10/0x10
 ? mark_lock.part.0+0x8a/0x520
...

trace_drop_common() also disables interrupts, but this is a minor issue
because we could easily replace it with a local_lock.

Replace the spin_lock with raw_spin_lock to avoid sleeping in atomic
context.

Signed-off-by: Wander Lairson Costa <wander@redhat.com>
Reported-by: Hu Chunyu <chuhu@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-15 09:54:15 +01:00
Eric Dumazet 32affa5578 fib: rules: no longer hold RTNL in fib_nl_dumprule()
- fib rules are already RCU protected, RTNL is not needed
  to get them.

- Fix return value at the end of a dump,
  so that NLMSG_DONE can be appended to current skb,
  saving one recvmsg() system call.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240411133340.1332796-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-12 19:09:31 -07:00
Colin Ian King 195b7fc53c tipc: remove redundant assignment to ret, simplify code
Variable err is being assigned a zero value and it is never read
afterwards in either the break path or continue path, the assignment
is redundant and can be removed. With it removed, the if statement
can also be simplified.

Cleans up clang scan warning:
net/tipc/socket.c:3570:5: warning: Value stored to 'err' is never
read [deadcode.DeadStores]

Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Link: https://lore.kernel.org/r/20240411091704.306752-1-colin.i.king@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-12 19:07:31 -07:00
Eric Dumazet d13b059623 tcp: small optimization when TCP_TW_SYN is processed
When TCP_TW_SYN is processed, we perform a lookup to find
a listener and jump back in tcp_v6_rcv() and tcp_v4_rcv()

Paolo suggested that we do not have to check if the
found socket is a TIME_WAIT or NEW_SYN_RECV one.

Suggested-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/netdev/68085c8a84538cacaac991415e4ccc72f45e76c2.camel@redhat.com/
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Link: https://lore.kernel.org/r/20240411082530.907113-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-12 19:06:50 -07:00
Kuniyuki Iwashima 22dd70eb2c af_unix: Don't peek OOB data without MSG_OOB.
Currently, we can read OOB data without MSG_OOB by using MSG_PEEK
when OOB data is sitting on the front row, which is apparently
wrong.

  >>> from socket import *
  >>> c1, c2 = socketpair(AF_UNIX, SOCK_STREAM)
  >>> c1.send(b'a', MSG_OOB)
  1
  >>> c2.recv(1, MSG_PEEK | MSG_DONTWAIT)
  b'a'

If manage_oob() is called when no data has been copied, we only
check if the socket enables SO_OOBINLINE or MSG_PEEK is not used.
Otherwise, the skb is returned as is.

However, here we should return NULL if MSG_PEEK is set and no data
has been copied.

Also, in such a case, we should not jump to the redo label because
we will be caught in the loop and hog the CPU until normal data
comes in.

Then, we need to handle skb == NULL case with the if-clause below
the manage_oob() block.

With this patch:

  >>> from socket import *
  >>> c1, c2 = socketpair(AF_UNIX, SOCK_STREAM)
  >>> c1.send(b'a', MSG_OOB)
  1
  >>> c2.recv(1, MSG_PEEK | MSG_DONTWAIT)
  Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
  BlockingIOError: [Errno 11] Resource temporarily unavailable

Fixes: 314001f0bf ("af_unix: Add OOB support")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20240410171016.7621-3-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-12 19:05:49 -07:00
Kuniyuki Iwashima 283454c8a1 af_unix: Call manage_oob() for every skb in unix_stream_read_generic().
When we call recv() for AF_UNIX socket, we first peek one skb and
calls manage_oob() to check if the skb is sent with MSG_OOB.

However, when we fetch the next (and the following) skb, manage_oob()
is not called now, leading a wrong behaviour.

Let's say a socket send()s "hello" with MSG_OOB and the peer tries
to recv() 5 bytes with MSG_PEEK.  Here, we should get only "hell"
without 'o', but actually not:

  >>> from socket import *
  >>> c1, c2 = socketpair(AF_UNIX, SOCK_STREAM)
  >>> c1.send(b'hello', MSG_OOB)
  5
  >>> c2.recv(5, MSG_PEEK)
  b'hello'

The first skb fills 4 bytes, and the next skb is peeked but not
properly checked by manage_oob().

Let's move up the again label to call manage_oob() for evry skb.

With this patch:

  >>> from socket import *
  >>> c1, c2 = socketpair(AF_UNIX, SOCK_STREAM)
  >>> c1.send(b'hello', MSG_OOB)
  5
  >>> c2.recv(5, MSG_PEEK)
  b'hell'

Fixes: 314001f0bf ("af_unix: Add OOB support")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20240410171016.7621-2-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-12 19:05:48 -07:00
Lin Ma a799de0e59 net: nfc: remove inappropriate attrs check
Revert "NFC: fix attrs checks in netlink interface"
This reverts commit 18917d5147.

Our checks found weird attrs present check in function
nfc_genl_dep_link_down() and nfc_genl_llc_get_params(), which are
introduced by commit 18917d5147 ("NFC: fix attrs checks in netlink
interface").

According to its message, it should add checks for functions
nfc_genl_deactivate_target() and nfc_genl_fw_download(). However, it
didn't do that. In fact, the expected checks are added by
(1) commit 385097a367 ("nfc: Ensure presence of required attributes in
the deactivate_target handler") and
(2) commit 280e3ebdaf ("nfc: Ensure presence of NFC_ATTR_FIRMWARE_NAME
attribute in nfc_genl_fw_download()"). Perhaps something went wrong.

Anyway, the attr NFC_ATTR_TARGET_INDEX is never accessed in callback
nfc_genl_dep_link_down() and same for NFC_ATTR_FIRMWARE_NAME and
nfc_genl_llc_get_params(). Thus, remove those checks.

Signed-off-by: Lin Ma <linma@zju.edu.cn>
Link: https://lore.kernel.org/r/20240410034846.167421-1-linma@zju.edu.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-12 18:52:35 -07:00
David S. Miller 90be7a5ce0 netfilter pull request 24-04-11
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEN9lkrMBJgcdVAPub1V2XiooUIOQFAmYXx3kACgkQ1V2XiooU
 IOThqQ//W+3Pq/Y7uI8z1bl8Mzb17q/zB5YSggNQzO9SzW5MMiXXKwMkTLK8huA3
 YXOhy1hTYl0FFEfA6ZWeInOSri9SlXA6j909tvMFuTwNHYn+q6CCdnx0rFwSly8Z
 8LmdlacxFDX7NMZXiJnPcyCZ7LaGrNUcOWKFhvEFzSTSYn3AQpECDPFgw8EhCE2B
 yybOooiqEbeUHgtxC3JQKTO1VZJY/6qpy+MjWNX0VAVAT8ynKTh4DSl9irOMcWHg
 D6yglRpfjJ539zZN48Gz4He0s3i36scrgcZ6QHB1mwLDl3zDM2589RRGk9nfmK8L
 /ELBT8uasNRanFq+yMii2QFxC+r2l6M+83s/Er+X0lVb5Mlu4aI8lzABpdThhlfb
 DLvan0rC5bKYlRsB1lg5HKSopgLeSfhj9ewLh9HFbc1eyG4Qg9ZUrk5sI27rhfju
 c7My9cBfEnF9+rIv/PT0l/chmhFBDdSmNjVfApMF99jbgscb0EGf2paKXEhk0mde
 3ppudYdPiQnx369+4anOx2d7G5U7YhdPZdKPE/cFFZVLOvCgyHCczEh7ZcO/mOfP
 5uMZanlimpXyVcrA4Vu9AQcURyX1S4d2+A8XLecYU5INT/D6VW1nQWjXqR/oesZe
 zzsgWDdIaLUD5hJ3f3lRG37b2msoOa519Urf9eL229K9Uzqn8Xo=
 =tl4K
 -----END PGP SIGNATURE-----

Merge tag 'nf-24-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf

netfilter pull request 24-04-11

Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for net:

Patches #1 and #2 add missing rcu read side lock when iterating over
expression and object type list which could race with module removal.

Patch #3 prevents promisc packet from visiting the bridge/input hook
	 to amend a recent fix to address conntrack confirmation race
	 in br_netfilter and nf_conntrack_bridge.

Patch #4 adds and uses iterate decorator type to fetch the current
	 pipapo set backend datastructure view when netlink dumps the
	 set elements.

Patch #5 fixes removal of duplicate elements in the pipapo set backend.

Patch #6 flowtable validates pppoe header before accessing it.

Patch #7 fixes flowtable datapath for pppoe packets, otherwise lookup
         fails and pppoe packets follow classic path.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-12 13:02:13 +01:00
Russell King (Oracle) cae425cb43 net: dsa: allow DSA switch drivers to provide their own phylink mac ops
Rather than having a shim for each and every phylink MAC operation,
allow DSA switch drivers to provide their own ops structure. When a
DSA driver provides the phylink MAC operations, the shimmed ops must
not be provided, so fail an attempt to register a switch with both
the phylink_mac_ops in struct dsa_switch and the phylink_mac_*
operations populated in dsa_switch_ops populated.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://lore.kernel.org/r/E1rudqF-006K9H-Cc@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-11 20:01:14 -07:00
Russell King (Oracle) dd0c9855b4 net: dsa: introduce dsa_phylink_to_port()
We convert from a phylink_config struct to a dsa_port struct in many
places, let's provide a helper for this.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Link: https://lore.kernel.org/r/E1rudqA-006K9B-85@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-11 20:01:14 -07:00
Colin Ian King f7ac8fbd32 tls: remove redundant assignment to variable decrypted
The variable decrypted is being assigned a value that is never read,
the control of flow after the assignment is via an return path and
decrypted is not referenced in this path. The assignment is redundant
and can be removed.

Cleans up clang scan warning:
net/tls/tls_sw.c:2150:4: warning: Value stored to 'decrypted' is never
read [deadcode.DeadStores]

Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Link: https://lore.kernel.org/r/20240410144136.289030-1-colin.i.king@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-11 20:00:22 -07:00
Guillaume Nault 5618603f5d ipv4: Remove RTO_ONLINK.
RTO_ONLINK was a flag used in ->flowi4_tos that allowed to alter the
scope of an IPv4 route lookup. Setting this flag was equivalent to
specifying RT_SCOPE_LINK in ->flowi4_scope.

With commit ec20b28300 ("ipv4: Set scope explicitly in
ip_route_output()."), the last users of RTO_ONLINK have been removed.
Therefore, we can now drop the code that checked this bit and stop
modifying ->flowi4_scope in ip_route_output_key_hash().

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/57de760565cab55df7b129f523530ac6475865b2.1712754146.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-11 19:52:11 -07:00
Jon Maloy 05ea491641 tcp: add support for SO_PEEK_OFF socket option
When reading received messages from a socket with MSG_PEEK, we may want
to read the contents with an offset, like we can do with pread/preadv()
when reading files. Currently, it is not possible to do that.

In this commit, we add support for the SO_PEEK_OFF socket option for TCP,
in a similar way it is done for Unix Domain sockets.

In the iperf3 log examples shown below, we can observe a throughput
improvement of 15-20 % in the direction host->namespace when using the
protocol splicer 'pasta' (https://passt.top).
This is a consistent result.

pasta(1) and passt(1) implement user-mode networking for network
namespaces (containers) and virtual machines by means of a translation
layer between Layer-2 network interface and native Layer-4 sockets
(TCP, UDP, ICMP/ICMPv6 echo).

Received, pending TCP data to the container/guest is kept in kernel
buffers until acknowledged, so the tool routinely needs to fetch new
data from socket, skipping data that was already sent.

At the moment this is implemented using a dummy buffer passed to
recvmsg(). With this change, we don't need a dummy buffer and the
related buffer copy (copy_to_user()) anymore.

passt and pasta are supported in KubeVirt and libvirt/qemu.

jmaloy@freyr:~/passt$ perf record -g ./pasta --config-net -f
SO_PEEK_OFF not supported by kernel.

jmaloy@freyr:~/passt# iperf3 -s
-----------------------------------------------------------
Server listening on 5201 (test #1)
-----------------------------------------------------------
Accepted connection from 192.168.122.1, port 44822
[  5] local 192.168.122.180 port 5201 connected to 192.168.122.1 port 44832
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec  1.02 GBytes  8.78 Gbits/sec
[  5]   1.00-2.00   sec  1.06 GBytes  9.08 Gbits/sec
[  5]   2.00-3.00   sec  1.07 GBytes  9.15 Gbits/sec
[  5]   3.00-4.00   sec  1.10 GBytes  9.46 Gbits/sec
[  5]   4.00-5.00   sec  1.03 GBytes  8.85 Gbits/sec
[  5]   5.00-6.00   sec  1.10 GBytes  9.44 Gbits/sec
[  5]   6.00-7.00   sec  1.11 GBytes  9.56 Gbits/sec
[  5]   7.00-8.00   sec  1.07 GBytes  9.20 Gbits/sec
[  5]   8.00-9.00   sec   667 MBytes  5.59 Gbits/sec
[  5]   9.00-10.00  sec  1.03 GBytes  8.83 Gbits/sec
[  5]  10.00-10.04  sec  30.1 MBytes  6.36 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-10.04  sec  10.3 GBytes  8.78 Gbits/sec   receiver
-----------------------------------------------------------
Server listening on 5201 (test #2)
-----------------------------------------------------------
^Ciperf3: interrupt - the server has terminated
jmaloy@freyr:~/passt#
logout
[ perf record: Woken up 23 times to write data ]
[ perf record: Captured and wrote 5.696 MB perf.data (35580 samples) ]
jmaloy@freyr:~/passt$

jmaloy@freyr:~/passt$ perf record -g ./pasta --config-net -f
SO_PEEK_OFF supported by kernel.

jmaloy@freyr:~/passt# iperf3 -s
-----------------------------------------------------------
Server listening on 5201 (test #1)
-----------------------------------------------------------
Accepted connection from 192.168.122.1, port 52084
[  5] local 192.168.122.180 port 5201 connected to 192.168.122.1 port 52098
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec  1.32 GBytes  11.3 Gbits/sec
[  5]   1.00-2.00   sec  1.19 GBytes  10.2 Gbits/sec
[  5]   2.00-3.00   sec  1.26 GBytes  10.8 Gbits/sec
[  5]   3.00-4.00   sec  1.36 GBytes  11.7 Gbits/sec
[  5]   4.00-5.00   sec  1.33 GBytes  11.4 Gbits/sec
[  5]   5.00-6.00   sec  1.21 GBytes  10.4 Gbits/sec
[  5]   6.00-7.00   sec  1.31 GBytes  11.2 Gbits/sec
[  5]   7.00-8.00   sec  1.25 GBytes  10.7 Gbits/sec
[  5]   8.00-9.00   sec  1.33 GBytes  11.5 Gbits/sec
[  5]   9.00-10.00  sec  1.24 GBytes  10.7 Gbits/sec
[  5]  10.00-10.04  sec  56.0 MBytes  12.1 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-10.04  sec  12.9 GBytes  11.0 Gbits/sec  receiver
-----------------------------------------------------------
Server listening on 5201 (test #2)
-----------------------------------------------------------
^Ciperf3: interrupt - the server has terminated
logout
[ perf record: Woken up 20 times to write data ]
[ perf record: Captured and wrote 5.040 MB perf.data (33411 samples) ]
jmaloy@freyr:~/passt$

The perf record confirms this result. Below, we can observe that the
CPU spends significantly less time in the function ____sys_recvmsg()
when we have offset support.

Without offset support:
----------------------
jmaloy@freyr:~/passt$ perf report -q --symbol-filter=do_syscall_64 \
                       -p ____sys_recvmsg -x --stdio -i  perf.data | head -1
46.32%     0.00%  passt.avx2  [kernel.vmlinux]  [k] do_syscall_64  ____sys_recvmsg

With offset support:
----------------------
jmaloy@freyr:~/passt$ perf report -q --symbol-filter=do_syscall_64 \
                       -p ____sys_recvmsg -x --stdio -i  perf.data | head -1
28.12%     0.00%  passt.avx2  [kernel.vmlinux]  [k] do_syscall_64  ____sys_recvmsg

Suggested-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240409152805.913891-1-jmaloy@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-11 19:46:55 -07:00
Eric Dumazet e0f89d2864 mpls: no longer hold RTNL in mpls_netconf_dump_devconf()
- Use for_each_netdev_dump() to no longer rely
  on net->dev_index_head hash table.

- No longer care of net->dev_base_seq

- Fix return value at the end of a dump,
  so that NLMSG_DONE can be appended to current skb,
  saving one recvmsg() system call.

- No longer grab RTNL, RCU protection is enough,
  afer adding one READ_ONCE(mdev->input_enabled)
  in mpls_netconf_fill_devconf()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240410111951.2673193-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-11 19:38:37 -07:00
Mina Almasry a580ea994f net: mirror skb frag ref/unref helpers
Refactor some of the skb frag ref/unref helpers for improved clarity.

Implement napi_pp_get_page() to be the mirror counterpart of
napi_pp_put_page().

Implement skb_page_ref() to be the mirror of skb_page_unref().

Improve __skb_frag_ref() to become a mirror counterpart of
__skb_frag_unref(). Previously unref could handle pp & non-pp pages,
while the ref could only handle non-pp pages. Now both the ref & unref
helpers can correctly handle both pp & non-pp pages.

Now that __skb_frag_ref() can handle both pp & non-pp pages, remove
skb_pp_frag_ref(), and use __skb_frag_ref() instead.  This lets us
remove pp specific handling from skb_try_coalesce.

Additionally, since __skb_frag_ref() can now handle both pp & non-pp
pages, a latent issue in skb_shift() should now be fixed. Previously
this function would do a non-pp ref & pp unref on potential pp frags
(fragfrom). After this patch, skb_shift() should correctly do a pp
ref/unref on pp frags.

Signed-off-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://lore.kernel.org/r/20240410190505.1225848-3-almasrymina@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-11 19:29:23 -07:00
Mina Almasry f6d827b180 net: move skb ref helpers to new header
Add a new header, linux/skbuff_ref.h, which contains all the skb_*_ref()
helpers. Many of the consumers of skbuff.h do not actually use any of
the skb ref helpers, and we can speed up compilation a bit by minimizing
this header file.

Additionally in the later patch in the series we add page_pool support
to skb_frag_ref(), which requires some page_pool dependencies. We can
now add these dependencies to skbuff_ref.h instead of a very ubiquitous
skbuff.h

Signed-off-by: Mina Almasry <almasrymina@google.com>
Link: https://lore.kernel.org/r/20240410190505.1225848-2-almasrymina@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-11 19:29:22 -07:00
Jakub Kicinski 94426ed213 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

Conflicts:

net/unix/garbage.c
  47d8ac011f ("af_unix: Fix garbage collector racing against connect()")
  4090fa373f ("af_unix: Replace garbage collection algorithm.")

Adjacent changes:

drivers/net/ethernet/broadcom/bnxt/bnxt.c
  faa12ca245 ("bnxt_en: Reset PTP tx_avail after possible firmware reset")
  b3d0083caf ("bnxt_en: Support RSS contexts in ethtool .{get|set}_rxfh()")

drivers/net/ethernet/broadcom/bnxt/bnxt_ulp.c
  7ac10c7d72 ("bnxt_en: Fix possible memory leak in bnxt_rdma_aux_device_init()")
  194fad5b27 ("bnxt_en: Refactor bnxt_rdma_aux_device_init/uninit functions")

drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
  958f56e483 ("net/mlx5e: Un-expose functions in en.h")
  49e6c93870 ("net/mlx5e: RSS, Block XOR hash with over 128 channels")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-11 14:23:47 -07:00
Linus Torvalds 2ae9a8972c Including fixes from bluetooth.
Current release - new code bugs:
 
   - netfilter: complete validation of user input
 
   - mlx5: disallow SRIOV switchdev mode when in multi-PF netdev
 
 Previous releases - regressions:
 
   - core: fix u64_stats_init() for lockdep when used repeatedly in one file
 
   - ipv6: fix race condition between ipv6_get_ifaddr and ipv6_del_addr
 
   - bluetooth: fix memory leak in hci_req_sync_complete()
 
   - batman-adv: avoid infinite loop trying to resize local TT
 
   - drv: geneve: fix header validation in geneve[6]_xmit_skb
 
   - drv: bnxt_en: fix possible memory leak in bnxt_rdma_aux_device_init()
 
   - drv: mlx5: offset comp irq index in name by one
 
   - drv: ena: avoid double-free clearing stale tx_info->xdpf value
 
   - drv: pds_core: fix pdsc_check_pci_health deadlock
 
 Previous releases - always broken:
 
   - xsk: validate user input for XDP_{UMEM|COMPLETION}_FILL_RING
 
   - bluetooth: fix setsockopt not validating user input
 
   - af_unix: clear stale u->oob_skb.
 
   - nfc: llcp: fix nfc_llcp_setsockopt() unsafe copies
 
   - drv: virtio_net: fix guest hangup on invalid RSS update
 
   - drv: mlx5e: Fix mlx5e_priv_init() cleanup flow
 
   - dsa: mt7530: trap link-local frames regardless of ST Port State
 
 Signed-off-by: Paolo Abeni <pabeni@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmYXyoQSHHBhYmVuaUBy
 ZWRoYXQuY29tAAoJECkkeY3MjxOk72wQAJJ9DQra9b/8S3Zla1dutBcznSCxruas
 vWrpgIZiT3Aw5zUmZUZn+rNP8xeWLBK78Yv4m236B8/D3Kji2uMbVrjUAApBBcHr
 /lLmctZIhDHCoJCYvRTC/VOVPuqbbbmxOmx6rNvry93iNAiHAnBdCUlOYzMNPzJz
 6XIGtztFP0ICtt9owFtQRnsPeKhZJ5DoxqgE9KS2Pmb9PU99i1bEShpwLwB5I83S
 yTHKUY5W0rknkQZTW1gbv+o3dR0iFy7LZ+1FItJ/UzH0bG6JmcqzSlH5mZZJCc/L
 5LdUwtwMmKG2Kez/vKr1DAwTeAyhwVU+d+Hb28QXiO0kAYbjbOgNXse1st3RwDt5
 YKMKlsmR+kgPYLcvs9df2aubNSRvi2utwIA2kuH33HxBYF5PfQR5PTGeR21A+cKo
 wvSit8aMaGFTPJ7rRIzkNaPdIHSvPMKYcXV/T8EPvlOHzi5GBX0qHWj99JO9Eri+
 VFci+FG3HCPHK8v683g/WWiiVNx/IHMfNbcukes1oDFsCeNo7KZcnPY+zVhtdvvt
 QBnvbAZGKeDXMbnHZLB3DCR3ENHWTrJzC3alLDp3/uFC79VKtfIRO2wEX3gkrN8S
 JHsdYU13Yp1ERaNjUeq7Sqk2OGLfsBt4HSOhcK8OPPgE5rDRON5UPjkuNvbaEiZY
 Morzaqzerg1B
 =a9bB
 -----END PGP SIGNATURE-----

Merge tag 'net-6.9-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Paolo Abeni:
 "Including fixes from bluetooth.

  Current release - new code bugs:

   - netfilter: complete validation of user input

   - mlx5: disallow SRIOV switchdev mode when in multi-PF netdev

  Previous releases - regressions:

   - core: fix u64_stats_init() for lockdep when used repeatedly in one
     file

   - ipv6: fix race condition between ipv6_get_ifaddr and ipv6_del_addr

   - bluetooth: fix memory leak in hci_req_sync_complete()

   - batman-adv: avoid infinite loop trying to resize local TT

   - drv: geneve: fix header validation in geneve[6]_xmit_skb

   - drv: bnxt_en: fix possible memory leak in
     bnxt_rdma_aux_device_init()

   - drv: mlx5: offset comp irq index in name by one

   - drv: ena: avoid double-free clearing stale tx_info->xdpf value

   - drv: pds_core: fix pdsc_check_pci_health deadlock

  Previous releases - always broken:

   - xsk: validate user input for XDP_{UMEM|COMPLETION}_FILL_RING

   - bluetooth: fix setsockopt not validating user input

   - af_unix: clear stale u->oob_skb.

   - nfc: llcp: fix nfc_llcp_setsockopt() unsafe copies

   - drv: virtio_net: fix guest hangup on invalid RSS update

   - drv: mlx5e: Fix mlx5e_priv_init() cleanup flow

   - dsa: mt7530: trap link-local frames regardless of ST Port State"

* tag 'net-6.9-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (59 commits)
  net: ena: Set tx_info->xdpf value to NULL
  net: ena: Fix incorrect descriptor free behavior
  net: ena: Wrong missing IO completions check order
  net: ena: Fix potential sign extension issue
  af_unix: Fix garbage collector racing against connect()
  net: dsa: mt7530: trap link-local frames regardless of ST Port State
  Revert "s390/ism: fix receive message buffer allocation"
  net: sparx5: fix wrong config being used when reconfiguring PCS
  net/mlx5: fix possible stack overflows
  net/mlx5: Disallow SRIOV switchdev mode when in multi-PF netdev
  net/mlx5e: RSS, Block XOR hash with over 128 channels
  net/mlx5e: Do not produce metadata freelist entries in Tx port ts WQE xmit
  net/mlx5e: HTB, Fix inconsistencies with QoS SQs number
  net/mlx5e: Fix mlx5e_priv_init() cleanup flow
  net/mlx5e: RSS, Block changing channels number when RXFH is configured
  net/mlx5: Correctly compare pkt reformat ids
  net/mlx5: Properly link new fs rules into the tree
  net/mlx5: offset comp irq index in name by one
  net/mlx5: Register devlink first under devlink lock
  net/mlx5: E-switch, store eswitch pointer before registering devlink_param
  ...
2024-04-11 11:46:31 -07:00
Geliang Tang 18d82cde74 mptcp: add last time fields in mptcp_info
This patch adds "last time" fields last_data_sent, last_data_recv and
last_ack_recv in struct mptcp_sock to record the last time data_sent,
data_recv and ack_recv happened. They all are initialized as
tcp_jiffies32 in __mptcp_init_sock(), and updated as tcp_jiffies32 too
when data is sent in __subflow_push_pending(), data is received in
__mptcp_move_skbs_from_subflow(), and ack is received in ack_update_msk().

Similar to tcpi_last_data_sent, tcpi_last_data_recv and tcpi_last_ack_recv
exposed with TCP, this patch exposes the last time "an action happened" for
MPTCP in mptcp_info, named mptcpi_last_data_sent, mptcpi_last_data_recv and
mptcpi_last_ack_recv, calculated in mptcp_diag_fill_info() as the time
deltas between now and the newly added last time fields in mptcp_sock.

Since msk->last_ack_recv needs to be protected by mptcp_data_lock/unlock,
and lock_sock_fast can sleep and be quite slow, move the entire
mptcp_data_lock/unlock block after the lock/unlock_sock_fast block.
Then mptcpi_last_data_sent and mptcpi_last_data_recv are set in
lock/unlock_sock_fast block, while mptcpi_last_ack_recv is set in
mptcp_data_lock/unlock block, which is protected by a spinlock and
should not block for too long.

Also add three reserved bytes in struct mptcp_info not to have holes in
this structure exposed to userspace.

Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/446
Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://lore.kernel.org/r/20240410-upstream-net-next-20240405-mptcp-last-time-info-v2-1-f95bd6b33e51@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-11 08:19:54 -07:00
Pablo Neira Ayuso 6db5dc7b35 netfilter: flowtable: incorrect pppoe tuple
pppoe traffic reaching ingress path does not match the flowtable entry
because the pppoe header is expected to be at the network header offset.
This bug causes a mismatch in the flow table lookup, so pppoe packets
enter the classical forwarding path.

Fixes: 72efd585f7 ("netfilter: flowtable: add pppoe support")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-04-11 12:14:10 +02:00
Pablo Neira Ayuso 87b3593bed netfilter: flowtable: validate pppoe header
Ensure there is sufficient room to access the protocol field of the
PPPoe header. Validate it once before the flowtable lookup, then use a
helper function to access protocol field.

Reported-by: syzbot+b6f07e1c07ef40199081@syzkaller.appspotmail.com
Fixes: 72efd585f7 ("netfilter: flowtable: add pppoe support")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-04-11 12:13:11 +02:00
Florian Westphal 3cfc9ec039 netfilter: nft_set_pipapo: do not free live element
Pablo reports a crash with large batches of elements with a
back-to-back add/remove pattern.  Quoting Pablo:

  add_elem("00000000") timeout 100 ms
  ...
  add_elem("0000000X") timeout 100 ms
  del_elem("0000000X") <---------------- delete one that was just added
  ...
  add_elem("00005000") timeout 100 ms

  1) nft_pipapo_remove() removes element 0000000X
  Then, KASAN shows a splat.

Looking at the remove function there is a chance that we will drop a
rule that maps to a non-deactivated element.

Removal happens in two steps, first we do a lookup for key k and return the
to-be-removed element and mark it as inactive in the next generation.
Then, in a second step, the element gets removed from the set/map.

The _remove function does not work correctly if we have more than one
element that share the same key.

This can happen if we insert an element into a set when the set already
holds an element with same key, but the element mapping to the existing
key has timed out or is not active in the next generation.

In such case its possible that removal will unmap the wrong element.
If this happens, we will leak the non-deactivated element, it becomes
unreachable.

The element that got deactivated (and will be freed later) will
remain reachable in the set data structure, this can result in
a crash when such an element is retrieved during lookup (stale
pointer).

Add a check that the fully matching key does in fact map to the element
that we have marked as inactive in the deactivation step.
If not, we need to continue searching.

Add a bug/warn trap at the end of the function as well, the remove
function must not ever be called with an invisible/unreachable/non-existent
element.

v2: avoid uneeded temporary variable (Stefano)

Fixes: 3c4287f620 ("nf_tables: Add set type for arbitrary concatenation of ranges")
Reported-by: Pablo Neira Ayuso <pablo@netfilter.org>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-04-11 12:10:13 +02:00
Pablo Neira Ayuso 29b359cf6d netfilter: nft_set_pipapo: walk over current view on netlink dump
The generation mask can be updated while netlink dump is in progress.
The pipapo set backend walk iterator cannot rely on it to infer what
view of the datastructure is to be used. Add notation to specify if user
wants to read/update the set.

Based on patch from Florian Westphal.

Fixes: 2b84e215f8 ("netfilter: nft_set_pipapo: .walk does not deal with generations")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-04-11 12:10:03 +02:00
Pablo Neira Ayuso 751de2012e netfilter: br_netfilter: skip conntrack input hook for promisc packets
For historical reasons, when bridge device is in promisc mode, packets
that are directed to the taps follow bridge input hook path. This patch
adds a workaround to reset conntrack for these packets.

Jianbo Liu reports warning splats in their test infrastructure where
cloned packets reach the br_netfilter input hook to confirm the
conntrack object.

Scratch one bit from BR_INPUT_SKB_CB to annotate that this packet has
reached the input hook because it is passed up to the bridge device to
reach the taps.

[   57.571874] WARNING: CPU: 1 PID: 0 at net/bridge/br_netfilter_hooks.c:616 br_nf_local_in+0x157/0x180 [br_netfilter]
[   57.572749] Modules linked in: xt_MASQUERADE nf_conntrack_netlink nfnetlink iptable_nat xt_addrtype xt_conntrack nf_nat br_netfilter rpcsec_gss_krb5 auth_rpcgss oid_registry overlay rpcrdma rdma_ucm ib_iser libiscsi scsi_transport_isc si ib_umad rdma_cm ib_ipoib iw_cm ib_cm mlx5_ib ib_uverbs ib_core mlx5ctl mlx5_core
[   57.575158] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 6.8.0+ #19
[   57.575700] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
[   57.576662] RIP: 0010:br_nf_local_in+0x157/0x180 [br_netfilter]
[   57.577195] Code: fe ff ff 41 bd 04 00 00 00 be 04 00 00 00 e9 4a ff ff ff be 04 00 00 00 48 89 ef e8 f3 a9 3c e1 66 83 ad b4 00 00 00 04 eb 91 <0f> 0b e9 f1 fe ff ff 0f 0b e9 df fe ff ff 48 89 df e8 b3 53 47 e1
[   57.578722] RSP: 0018:ffff88885f845a08 EFLAGS: 00010202
[   57.579207] RAX: 0000000000000002 RBX: ffff88812dfe8000 RCX: 0000000000000000
[   57.579830] RDX: ffff88885f845a60 RSI: ffff8881022dc300 RDI: 0000000000000000
[   57.580454] RBP: ffff88885f845a60 R08: 0000000000000001 R09: 0000000000000003
[   57.581076] R10: 00000000ffff1300 R11: 0000000000000002 R12: 0000000000000000
[   57.581695] R13: ffff8881047ffe00 R14: ffff888108dbee00 R15: ffff88814519b800
[   57.582313] FS:  0000000000000000(0000) GS:ffff88885f840000(0000) knlGS:0000000000000000
[   57.583040] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   57.583564] CR2: 000000c4206aa000 CR3: 0000000103847001 CR4: 0000000000370eb0
[   57.584194] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[   57.584820] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[   57.585440] Call Trace:
[   57.585721]  <IRQ>
[   57.585976]  ? __warn+0x7d/0x130
[   57.586323]  ? br_nf_local_in+0x157/0x180 [br_netfilter]
[   57.586811]  ? report_bug+0xf1/0x1c0
[   57.587177]  ? handle_bug+0x3f/0x70
[   57.587539]  ? exc_invalid_op+0x13/0x60
[   57.587929]  ? asm_exc_invalid_op+0x16/0x20
[   57.588336]  ? br_nf_local_in+0x157/0x180 [br_netfilter]
[   57.588825]  nf_hook_slow+0x3d/0xd0
[   57.589188]  ? br_handle_vlan+0x4b/0x110
[   57.589579]  br_pass_frame_up+0xfc/0x150
[   57.589970]  ? br_port_flags_change+0x40/0x40
[   57.590396]  br_handle_frame_finish+0x346/0x5e0
[   57.590837]  ? ipt_do_table+0x32e/0x430
[   57.591221]  ? br_handle_local_finish+0x20/0x20
[   57.591656]  br_nf_hook_thresh+0x4b/0xf0 [br_netfilter]
[   57.592286]  ? br_handle_local_finish+0x20/0x20
[   57.592802]  br_nf_pre_routing_finish+0x178/0x480 [br_netfilter]
[   57.593348]  ? br_handle_local_finish+0x20/0x20
[   57.593782]  ? nf_nat_ipv4_pre_routing+0x25/0x60 [nf_nat]
[   57.594279]  br_nf_pre_routing+0x24c/0x550 [br_netfilter]
[   57.594780]  ? br_nf_hook_thresh+0xf0/0xf0 [br_netfilter]
[   57.595280]  br_handle_frame+0x1f3/0x3d0
[   57.595676]  ? br_handle_local_finish+0x20/0x20
[   57.596118]  ? br_handle_frame_finish+0x5e0/0x5e0
[   57.596566]  __netif_receive_skb_core+0x25b/0xfc0
[   57.597017]  ? __napi_build_skb+0x37/0x40
[   57.597418]  __netif_receive_skb_list_core+0xfb/0x220

Fixes: 62e7151ae3 ("netfilter: bridge: confirm multicast packets before passing them up the stack")
Reported-by: Jianbo Liu <jianbol@nvidia.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-04-11 12:09:33 +02:00
Ziyang Xuan d78d867dce netfilter: nf_tables: Fix potential data-race in __nft_obj_type_get()
nft_unregister_obj() can concurrent with __nft_obj_type_get(),
and there is not any protection when iterate over nf_tables_objects
list in __nft_obj_type_get(). Therefore, there is potential data-race
of nf_tables_objects list entry.

Use list_for_each_entry_rcu() to iterate over nf_tables_objects
list in __nft_obj_type_get(), and use rcu_read_lock() in the caller
nft_obj_type_get() to protect the entire type query process.

Fixes: e50092404c ("netfilter: nf_tables: add stateful objects")
Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-04-11 12:09:11 +02:00
Ziyang Xuan f969eb84ce netfilter: nf_tables: Fix potential data-race in __nft_expr_type_get()
nft_unregister_expr() can concurrent with __nft_expr_type_get(),
and there is not any protection when iterate over nf_tables_expressions
list in __nft_expr_type_get(). Therefore, there is potential data-race
of nf_tables_expressions list entry.

Use list_for_each_entry_rcu() to iterate over nf_tables_expressions
list in __nft_expr_type_get(), and use rcu_read_lock() in the caller
nft_expr_type_get() to protect the entire type query process.

Fixes: ef1f7df917 ("netfilter: nf_tables: expression ops overloading")
Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-04-11 12:08:34 +02:00
Paolo Abeni fe3eb40672 bluetooth pull request for net:
- L2CAP: Don't double set the HCI_CONN_MGMT_CONNECTED bit
   - Fix memory leak in hci_req_sync_complete
   - hci_sync: Fix using the same interval and window for Coded PHY
   - Fix not validating setsockopt user input
 -----BEGIN PGP SIGNATURE-----
 
 iQJNBAABCAA3FiEE7E6oRXp8w05ovYr/9JCA4xAyCykFAmYW5JMZHGx1aXoudm9u
 LmRlbnR6QGludGVsLmNvbQAKCRD0kIDjEDILKcQjD/9JiPI2Tdb+LQ8g8WZDVEIG
 CefzSzDU1BQLDrU4JaaORPCUjTLNT+dDMTbDzcstmnL3g9yhMzUB8IfcTgcXWa7Z
 gs/cFyssPVtNEoDCZQwB3m83Hrx0e4dPxQAzs8qlIZDhedP33Uohy76jPiLqgoLM
 GtogGfezLthXeQlsCymwS7qju+37QW+GaBdid8N0g7YrAMPEqIkYRrx21OUxTKok
 q/4p9BDeYDpA7JdWXv3Izr2HT0cm6eaCkVu0rANj1pYSytalZxe2GAb10bP/uTLi
 DPxObxOz7J7gh93T6wDZhG3NHZIhRN9yBlO+9FAqKSs6RlPdBq33xNs+Fnilf2Rd
 iKu/cfjtbmc0N/NWih8dpnCMhNU277WJFJMZlOh2Wu0FwSTg0Sqy+sdNicm1cHP3
 oR46lbsD4ctM1JhGX5Whl2BxMR7D8ty3R/d+sTxsICY9MoUFSAzwnsN1+jwCyeW2
 PgPkHEs/gptA8l5XWyw33Rt206PO3B2jgdy96aJ+RN1fU8FkAl6WxnqXZZE2C8cN
 RcYFrFF6iq8pqyey9Y6+9dmzXolcxZ6bC0ooB84gGLRz5L2ptnD7NB1G8XvPkpKJ
 wyVLAdQ6w0KNEgorQkOvh4JVF3/dG04yE841Qz1MsOo8Pmr74j7vfFvko72uGdEC
 49B98NoX3xEmcC815aO3rQ==
 =YwaX
 -----END PGP SIGNATURE-----

Merge tag 'for-net-2024-04-10' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth

Luiz Augusto von Dentz says:

====================
bluetooth pull request for net:

  - L2CAP: Don't double set the HCI_CONN_MGMT_CONNECTED bit
  - Fix memory leak in hci_req_sync_complete
  - hci_sync: Fix using the same interval and window for Coded PHY
  - Fix not validating setsockopt user input

* tag 'for-net-2024-04-10' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth:
  Bluetooth: l2cap: Don't double set the HCI_CONN_MGMT_CONNECTED bit
  Bluetooth: hci_sock: Fix not validating setsockopt user input
  Bluetooth: ISO: Fix not validating setsockopt user input
  Bluetooth: L2CAP: Fix not validating setsockopt user input
  Bluetooth: RFCOMM: Fix not validating setsockopt user input
  Bluetooth: SCO: Fix not validating setsockopt user input
  Bluetooth: Fix memory leak in hci_req_sync_complete()
  Bluetooth: hci_sync: Fix using the same interval and window for Coded PHY
  Bluetooth: ISO: Don't reject BT_ISO_QOS if parameters are unset
====================

Link: https://lore.kernel.org/r/20240410191610.4156653-1-luiz.dentz@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-04-11 10:42:43 +02:00
Antony Antony 8b06a24bb6 xfrm: fix possible derferencing in error path
Fix derferencing pointer when xfrm_policy_lookup_bytype returns an
 error.

Fixes: 63b21caba1 ("xfrm: introduce forwarding of ICMP Error messages")
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/kernel-janitors/f6ef0d0d-96de-4e01-9dc3-c1b3a6338653@moroto.mountain/
Signed-off-by: Antony Antony <antony.antony@secunet.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2024-04-11 10:23:36 +02:00
Michal Luczaj 47d8ac011f af_unix: Fix garbage collector racing against connect()
Garbage collector does not take into account the risk of embryo getting
enqueued during the garbage collection. If such embryo has a peer that
carries SCM_RIGHTS, two consecutive passes of scan_children() may see a
different set of children. Leading to an incorrectly elevated inflight
count, and then a dangling pointer within the gc_inflight_list.

sockets are AF_UNIX/SOCK_STREAM
S is an unconnected socket
L is a listening in-flight socket bound to addr, not in fdtable
V's fd will be passed via sendmsg(), gets inflight count bumped

connect(S, addr)	sendmsg(S, [V]); close(V)	__unix_gc()
----------------	-------------------------	-----------

NS = unix_create1()
skb1 = sock_wmalloc(NS)
L = unix_find_other(addr)
unix_state_lock(L)
unix_peer(S) = NS
			// V count=1 inflight=0

 			NS = unix_peer(S)
 			skb2 = sock_alloc()
			skb_queue_tail(NS, skb2[V])

			// V became in-flight
			// V count=2 inflight=1

			close(V)

			// V count=1 inflight=1
			// GC candidate condition met

						for u in gc_inflight_list:
						  if (total_refs == inflight_refs)
						    add u to gc_candidates

						// gc_candidates={L, V}

						for u in gc_candidates:
						  scan_children(u, dec_inflight)

						// embryo (skb1) was not
						// reachable from L yet, so V's
						// inflight remains unchanged
__skb_queue_tail(L, skb1)
unix_state_unlock(L)
						for u in gc_candidates:
						  if (u.inflight)
						    scan_children(u, inc_inflight_move_tail)

						// V count=1 inflight=2 (!)

If there is a GC-candidate listening socket, lock/unlock its state. This
makes GC wait until the end of any ongoing connect() to that socket. After
flipping the lock, a possibly SCM-laden embryo is already enqueued. And if
there is another embryo coming, it can not possibly carry SCM_RIGHTS. At
this point, unix_inflight() can not happen because unix_gc_lock is already
taken. Inflight graph remains unaffected.

Fixes: 1fd05ba5a2 ("[AF_UNIX]: Rewrite garbage collector, fixes race.")
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20240409201047.1032217-1-mhal@rbox.co
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-04-11 09:46:15 +02:00
Yonghong Song 699c23f02c bpf: Add bpf_link support for sk_msg and sk_skb progs
Add bpf_link support for sk_msg and sk_skb programs. We have an
internal request to support bpf_link for sk_msg programs so user
space can have a uniform handling with bpf_link based libbpf
APIs. Using bpf_link based libbpf API also has a benefit which
makes system robust by decoupling prog life cycle and
attachment life cycle.

Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20240410043527.3737160-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-04-10 19:52:25 -07:00
Eric Dumazet 65acf6e050 netfilter: complete validation of user input
In my recent commit, I missed that do_replace() handlers
use copy_from_sockptr() (which I fixed), followed
by unsafe copy_from_sockptr_offset() calls.

In all functions, we can perform the @optlen validation
before even calling xt_alloc_table_info() with the following
check:

if ((u64)optlen < (u64)tmp.size + sizeof(tmp))
        return -EINVAL;

Fixes: 0c83842df4 ("netfilter: validate user input for expected length")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Pablo Neira Ayuso <pablo@netfilter.org>
Link: https://lore.kernel.org/r/20240409120741.3538135-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-10 19:42:56 -07:00
Pavel Begunkov d8415a165c net: use SKB_CONSUMED in skb_attempt_defer_free()
skb_attempt_defer_free() is used to free already processed skbs, so pass
SKB_CONSUMED as the reason in kfree_skb_napi_cache().

Suggested-by: Jason Xing <kerneljasonxing@gmail.com>
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/bcf5dbdda79688b074ab7ae2238535840a6d3fc2.1712711977.git.asml.silence@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-10 19:27:32 -07:00
Pavel Begunkov 7cb31c46b9 net: cache for same cpu skb_attempt_defer_free
Optimise skb_attempt_defer_free() when run by the same CPU the skb was
allocated on. Instead of __kfree_skb() -> kmem_cache_free() we can
disable softirqs and put the buffer into cpu local caches.

CPU bound TCP ping pong style benchmarking (i.e. netbench) showed a 1%
throughput increase (392.2 -> 396.4 Krps). Cross checking with profiles,
the total CPU share of skb_attempt_defer_free() dropped by 0.6%. Note,
I'd expect the win doubled with rx only benchmarks, as the optimisation
is for the receive path, but the test spends >55% of CPU doing writes.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/a887463fb219d973ec5ad275e31194812571f1f5.1712711977.git.asml.silence@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-10 19:27:32 -07:00
Eric Dumazet 9b9fd45869 tcp: tweak tcp_sock_write_txrx size assertion
I forgot 32bit arches might have 64bit alignment for u64
fields.

tcp_sock_write_txrx group does not contain pointers,
but two u64 fields. It is possible that on 32bit kernel,
a 32bit hole is before tp->tcp_clock_cache.

I will try to remember a group can be bigger on 32bit
kernels in the future.

With help from Vladimir Oltean.

Fixes: d2c3a7eb1a ("tcp: more struct tcp_sock adjustments")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202404082207.HCEdQhUO-lkp@intel.com/
Signed-off-by: Eric Dumazet <edumazet@google.com>
Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://lore.kernel.org/r/20240409140914.4105429-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-10 18:55:05 -07:00
Archie Pusaka 600b0bbe73 Bluetooth: l2cap: Don't double set the HCI_CONN_MGMT_CONNECTED bit
The bit is set and tested inside mgmt_device_connected(), therefore we
must not set it just outside the function.

Fixes: eeda1bf97b ("Bluetooth: hci_event: Fix not indicating new connection for BIG Sync")
Signed-off-by: Archie Pusaka <apusaka@chromium.org>
Reviewed-by: Manish Mandlik <mmandlik@chromium.org>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2024-04-10 15:10:16 -04:00
Luiz Augusto von Dentz b2186061d6 Bluetooth: hci_sock: Fix not validating setsockopt user input
Check user input length before copying data.

Fixes: 09572fca72 ("Bluetooth: hci_sock: Add support for BT_{SND,RCV}BUF")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2024-04-10 15:03:52 -04:00
Luiz Augusto von Dentz 9e8742cdfc Bluetooth: ISO: Fix not validating setsockopt user input
Check user input length before copying data.

Fixes: ccf74f2390 ("Bluetooth: Add BTPROTO_ISO socket type")
Fixes: 0731c5ab4d ("Bluetooth: ISO: Add support for BT_PKT_STATUS")
Fixes: f764a6c2c1 ("Bluetooth: ISO: Add broadcast support")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2024-04-10 15:03:50 -04:00
Luiz Augusto von Dentz 4f3951242a Bluetooth: L2CAP: Fix not validating setsockopt user input
Check user input length before copying data.

Fixes: 33575df7be ("Bluetooth: move l2cap_sock_setsockopt() to l2cap_sock.c")
Fixes: 3ee7b7cd83 ("Bluetooth: Add BT_MODE socket option")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2024-04-10 15:03:49 -04:00
Luiz Augusto von Dentz a97de7bff1 Bluetooth: RFCOMM: Fix not validating setsockopt user input
syzbot reported rfcomm_sock_setsockopt_old() is copying data without
checking user input length.

BUG: KASAN: slab-out-of-bounds in copy_from_sockptr_offset
include/linux/sockptr.h:49 [inline]
BUG: KASAN: slab-out-of-bounds in copy_from_sockptr
include/linux/sockptr.h:55 [inline]
BUG: KASAN: slab-out-of-bounds in rfcomm_sock_setsockopt_old
net/bluetooth/rfcomm/sock.c:632 [inline]
BUG: KASAN: slab-out-of-bounds in rfcomm_sock_setsockopt+0x893/0xa70
net/bluetooth/rfcomm/sock.c:673
Read of size 4 at addr ffff8880209a8bc3 by task syz-executor632/5064

Fixes: 9f2c8a03fb ("Bluetooth: Replace RFCOMM link mode with security level")
Fixes: bb23c0ab82 ("Bluetooth: Add support for deferring RFCOMM connection setup")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2024-04-10 15:03:47 -04:00
Luiz Augusto von Dentz 51eda36d33 Bluetooth: SCO: Fix not validating setsockopt user input
syzbot reported sco_sock_setsockopt() is copying data without
checking user input length.

BUG: KASAN: slab-out-of-bounds in copy_from_sockptr_offset
include/linux/sockptr.h:49 [inline]
BUG: KASAN: slab-out-of-bounds in copy_from_sockptr
include/linux/sockptr.h:55 [inline]
BUG: KASAN: slab-out-of-bounds in sco_sock_setsockopt+0xc0b/0xf90
net/bluetooth/sco.c:893
Read of size 4 at addr ffff88805f7b15a3 by task syz-executor.5/12578

Fixes: ad10b1a487 ("Bluetooth: Add Bluetooth socket voice option")
Fixes: b96e9c671b ("Bluetooth: Add BT_DEFER_SETUP option to sco socket")
Fixes: 00398e1d51 ("Bluetooth: Add support for BT_PKT_STATUS CMSG data for SCO connections")
Fixes: f6873401a6 ("Bluetooth: Allow setting of codec for HFP offload use case")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2024-04-10 15:03:46 -04:00
Dmitry Antipov 45d355a926 Bluetooth: Fix memory leak in hci_req_sync_complete()
In 'hci_req_sync_complete()', always free the previous sync
request state before assigning reference to a new one.

Reported-by: syzbot+39ec16ff6cc18b1d066d@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=39ec16ff6cc18b1d066d
Cc: stable@vger.kernel.org
Fixes: f60cb30579 ("Bluetooth: Convert hci_req_sync family of function to new request API")
Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2024-04-10 15:03:44 -04:00
Luiz Augusto von Dentz 53cb4197e6 Bluetooth: hci_sync: Fix using the same interval and window for Coded PHY
Coded PHY recommended intervals are 3 time bigger than the 1M PHY so
this aligns with that by multiplying by 3 the values given to 1M PHY
since the code already used recommended values for that.

Fixes: 288c90224e ("Bluetooth: Enable all supported LE PHY by default")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2024-04-10 15:03:43 -04:00
Luiz Augusto von Dentz b37cab587a Bluetooth: ISO: Don't reject BT_ISO_QOS if parameters are unset
Consider certain values (0x00) as unset and load proper default if
an application has not set them properly.

Fixes: 0fe8c8d071 ("Bluetooth: Split bt_iso_qos into dedicated structures")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2024-04-10 15:03:42 -04:00
Jianbo Liu 2ecd487b67 net: sched: cls_api: fix slab-use-after-free in fl_dump_key
The filter counter is updated under the protection of cb_lock in the
cited commit. While waiting for the lock, it's possible the filter is
being deleted by other thread, and thus causes UAF when dump it.

Fix this issue by moving tcf_block_filter_cnt_update() after
tfilter_put().

 ==================================================================
 BUG: KASAN: slab-use-after-free in fl_dump_key+0x1d3e/0x20d0 [cls_flower]
 Read of size 4 at addr ffff88814f864000 by task tc/2973

 CPU: 7 PID: 2973 Comm: tc Not tainted 6.9.0-rc2_for_upstream_debug_2024_04_02_12_41 #1
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
 Call Trace:
  <TASK>
  dump_stack_lvl+0x7e/0xc0
  print_report+0xc1/0x600
  ? __virt_addr_valid+0x1cf/0x390
  ? fl_dump_key+0x1d3e/0x20d0 [cls_flower]
  ? fl_dump_key+0x1d3e/0x20d0 [cls_flower]
  kasan_report+0xb9/0xf0
  ? fl_dump_key+0x1d3e/0x20d0 [cls_flower]
  fl_dump_key+0x1d3e/0x20d0 [cls_flower]
  ? lock_acquire+0x1c2/0x530
  ? fl_dump+0x172/0x5c0 [cls_flower]
  ? lockdep_hardirqs_on_prepare+0x400/0x400
  ? fl_dump_key_options.part.0+0x10f0/0x10f0 [cls_flower]
  ? do_raw_spin_lock+0x12d/0x270
  ? spin_bug+0x1d0/0x1d0
  fl_dump+0x21d/0x5c0 [cls_flower]
  ? fl_tmplt_dump+0x1f0/0x1f0 [cls_flower]
  ? nla_put+0x15f/0x1c0
  tcf_fill_node+0x51b/0x9a0
  ? tc_skb_ext_tc_enable+0x150/0x150
  ? __alloc_skb+0x17b/0x310
  ? __build_skb_around+0x340/0x340
  ? down_write+0x1b0/0x1e0
  tfilter_notify+0x1a5/0x390
  ? fl_terse_dump+0x400/0x400 [cls_flower]
  tc_new_tfilter+0x963/0x2170
  ? tc_del_tfilter+0x1490/0x1490
  ? print_usage_bug.part.0+0x670/0x670
  ? lock_downgrade+0x680/0x680
  ? security_capable+0x51/0x90
  ? tc_del_tfilter+0x1490/0x1490
  rtnetlink_rcv_msg+0x75e/0xac0
  ? if_nlmsg_stats_size+0x4c0/0x4c0
  ? lockdep_set_lock_cmp_fn+0x190/0x190
  ? __netlink_lookup+0x35e/0x6e0
  netlink_rcv_skb+0x12c/0x360
  ? if_nlmsg_stats_size+0x4c0/0x4c0
  ? netlink_ack+0x15e0/0x15e0
  ? lockdep_hardirqs_on_prepare+0x400/0x400
  ? netlink_deliver_tap+0xcd/0xa60
  ? netlink_deliver_tap+0xcd/0xa60
  ? netlink_deliver_tap+0x1c9/0xa60
  netlink_unicast+0x43e/0x700
  ? netlink_attachskb+0x750/0x750
  ? lock_acquire+0x1c2/0x530
  ? __might_fault+0xbb/0x170
  netlink_sendmsg+0x749/0xc10
  ? netlink_unicast+0x700/0x700
  ? __might_fault+0xbb/0x170
  ? netlink_unicast+0x700/0x700
  __sock_sendmsg+0xc5/0x190
  ____sys_sendmsg+0x534/0x6b0
  ? import_iovec+0x7/0x10
  ? kernel_sendmsg+0x30/0x30
  ? __copy_msghdr+0x3c0/0x3c0
  ? entry_SYSCALL_64_after_hwframe+0x46/0x4e
  ? lock_acquire+0x1c2/0x530
  ? __virt_addr_valid+0x116/0x390
  ___sys_sendmsg+0xeb/0x170
  ? __virt_addr_valid+0x1ca/0x390
  ? copy_msghdr_from_user+0x110/0x110
  ? __delete_object+0xb8/0x100
  ? __virt_addr_valid+0x1cf/0x390
  ? do_sys_openat2+0x102/0x150
  ? lockdep_hardirqs_on_prepare+0x284/0x400
  ? do_sys_openat2+0x102/0x150
  ? __fget_light+0x53/0x1d0
  ? sockfd_lookup_light+0x1a/0x150
  __sys_sendmsg+0xb5/0x140
  ? __sys_sendmsg_sock+0x20/0x20
  ? lock_downgrade+0x680/0x680
  do_syscall_64+0x70/0x140
  entry_SYSCALL_64_after_hwframe+0x46/0x4e
 RIP: 0033:0x7f98e3713367
 Code: 0e 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b9 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 89 54 24 1c 48 89 74 24 10
 RSP: 002b:00007ffc74a64608 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
 RAX: ffffffffffffffda RBX: 000000000047eae0 RCX: 00007f98e3713367
 RDX: 0000000000000000 RSI: 00007ffc74a64670 RDI: 0000000000000003
 RBP: 0000000000000008 R08: 0000000000000000 R09: 0000000000000000
 R10: 00007f98e360c5e8 R11: 0000000000000246 R12: 00007ffc74a6a508
 R13: 00000000660d518d R14: 0000000000484a80 R15: 00007ffc74a6a50b
  </TASK>

 Allocated by task 2973:
  kasan_save_stack+0x20/0x40
  kasan_save_track+0x10/0x30
  __kasan_kmalloc+0x77/0x90
  fl_change+0x27a6/0x4540 [cls_flower]
  tc_new_tfilter+0x879/0x2170
  rtnetlink_rcv_msg+0x75e/0xac0
  netlink_rcv_skb+0x12c/0x360
  netlink_unicast+0x43e/0x700
  netlink_sendmsg+0x749/0xc10
  __sock_sendmsg+0xc5/0x190
  ____sys_sendmsg+0x534/0x6b0
  ___sys_sendmsg+0xeb/0x170
  __sys_sendmsg+0xb5/0x140
  do_syscall_64+0x70/0x140
  entry_SYSCALL_64_after_hwframe+0x46/0x4e

 Freed by task 283:
  kasan_save_stack+0x20/0x40
  kasan_save_track+0x10/0x30
  kasan_save_free_info+0x37/0x50
  poison_slab_object+0x105/0x190
  __kasan_slab_free+0x11/0x30
  kfree+0x111/0x340
  process_one_work+0x787/0x1490
  worker_thread+0x586/0xd30
  kthread+0x2df/0x3b0
  ret_from_fork+0x2d/0x70
  ret_from_fork_asm+0x11/0x20

 Last potentially related work creation:
  kasan_save_stack+0x20/0x40
  __kasan_record_aux_stack+0x9b/0xb0
  insert_work+0x25/0x1b0
  __queue_work+0x640/0xc90
  rcu_work_rcufn+0x42/0x70
  rcu_core+0x6a9/0x1850
  __do_softirq+0x264/0x88f

 Second to last potentially related work creation:
  kasan_save_stack+0x20/0x40
  __kasan_record_aux_stack+0x9b/0xb0
  __call_rcu_common.constprop.0+0x6f/0xac0
  queue_rcu_work+0x56/0x70
  fl_mask_put+0x20d/0x270 [cls_flower]
  __fl_delete+0x352/0x6b0 [cls_flower]
  fl_delete+0x97/0x160 [cls_flower]
  tc_del_tfilter+0x7d1/0x1490
  rtnetlink_rcv_msg+0x75e/0xac0
  netlink_rcv_skb+0x12c/0x360
  netlink_unicast+0x43e/0x700
  netlink_sendmsg+0x749/0xc10
  __sock_sendmsg+0xc5/0x190
  ____sys_sendmsg+0x534/0x6b0
  ___sys_sendmsg+0xeb/0x170
  __sys_sendmsg+0xb5/0x140
  do_syscall_64+0x70/0x140
  entry_SYSCALL_64_after_hwframe+0x46/0x4e

Fixes: 2081fd3445 ("net: sched: cls_api: add filter counter")
Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Tested-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-10 08:28:26 +01:00
Mina Almasry f58f3c9563 net: remove napi_frag_unref
With the changes in the last patches, napi_frag_unref() is now
reduandant. Remove it and use skb_page_unref directly.

Signed-off-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://lore.kernel.org/r/20240408153000.2152844-4-almasrymina@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-09 18:20:32 -07:00
Mina Almasry 959fa5c188 net: make napi_frag_unref reuse skb_page_unref
The implementations of these 2 functions are almost identical. Remove
the implementation of napi_frag_unref, and make it a call into
skb_page_unref so we don't duplicate the implementation.

Signed-off-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://lore.kernel.org/r/20240408153000.2152844-2-almasrymina@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-09 18:20:29 -07:00
Kuan-Wei Chiu d034d02de8 net: sched: cake: Optimize the number of function calls and branches in heap construction
When constructing a heap, heapify operations are required on all
non-leaf nodes. Thus, determining the index of the first non-leaf node
is crucial. In a heap, the left child's index of node i is 2 * i + 1
and the right child's index is 2 * i + 2. Node CAKE_MAX_TINS *
CAKE_QUEUES / 2 has its left and right children at indexes
CAKE_MAX_TINS * CAKE_QUEUES + 1 and CAKE_MAX_TINS * CAKE_QUEUES + 2,
respectively, which are beyond the heap's range, indicating it as a
leaf node. Conversely, node CAKE_MAX_TINS * CAKE_QUEUES / 2 - 1 has a
left child at index CAKE_MAX_TINS * CAKE_QUEUES - 1, confirming its
non-leaf status. The loop should start from it since it's not a leaf
node.

By starting the loop from CAKE_MAX_TINS * CAKE_QUEUES / 2 - 1, we
minimize function calls and branch condition evaluations. This
adjustment theoretically reduces two function calls (one for
cake_heapify() and another for cake_heap_get_backlog()) and five branch
evaluations (one for iterating all non-leaf nodes, one within
cake_heapify()'s while loop, and three more within the while loop
with if conditions).

Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com>
Acked-by: Toke Høiland-Jørgensen <toke@toke.dk>
Link: https://lore.kernel.org/r/20240408174716.751069-1-visitorckw@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-09 17:30:14 -07:00
Jiri Benc 7633c4da91 ipv6: fix race condition between ipv6_get_ifaddr and ipv6_del_addr
Although ipv6_get_ifaddr walks inet6_addr_lst under the RCU lock, it
still means hlist_for_each_entry_rcu can return an item that got removed
from the list. The memory itself of such item is not freed thanks to RCU
but nothing guarantees the actual content of the memory is sane.

In particular, the reference count can be zero. This can happen if
ipv6_del_addr is called in parallel. ipv6_del_addr removes the entry
from inet6_addr_lst (hlist_del_init_rcu(&ifp->addr_lst)) and drops all
references (__in6_ifa_put(ifp) + in6_ifa_put(ifp)). With bad enough
timing, this can happen:

1. In ipv6_get_ifaddr, hlist_for_each_entry_rcu returns an entry.

2. Then, the whole ipv6_del_addr is executed for the given entry. The
   reference count drops to zero and kfree_rcu is scheduled.

3. ipv6_get_ifaddr continues and tries to increments the reference count
   (in6_ifa_hold).

4. The rcu is unlocked and the entry is freed.

5. The freed entry is returned.

Prevent increasing of the reference count in such case. The name
in6_ifa_hold_safe is chosen to mimic the existing fib6_info_hold_safe.

[   41.506330] refcount_t: addition on 0; use-after-free.
[   41.506760] WARNING: CPU: 0 PID: 595 at lib/refcount.c:25 refcount_warn_saturate+0xa5/0x130
[   41.507413] Modules linked in: veth bridge stp llc
[   41.507821] CPU: 0 PID: 595 Comm: python3 Not tainted 6.9.0-rc2.main-00208-g49563be82afa #14
[   41.508479] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
[   41.509163] RIP: 0010:refcount_warn_saturate+0xa5/0x130
[   41.509586] Code: ad ff 90 0f 0b 90 90 c3 cc cc cc cc 80 3d c0 30 ad 01 00 75 a0 c6 05 b7 30 ad 01 01 90 48 c7 c7 38 cc 7a 8c e8 cc 18 ad ff 90 <0f> 0b 90 90 c3 cc cc cc cc 80 3d 98 30 ad 01 00 0f 85 75 ff ff ff
[   41.510956] RSP: 0018:ffffbda3c026baf0 EFLAGS: 00010282
[   41.511368] RAX: 0000000000000000 RBX: ffff9e9c46914800 RCX: 0000000000000000
[   41.511910] RDX: ffff9e9c7ec29c00 RSI: ffff9e9c7ec1c900 RDI: ffff9e9c7ec1c900
[   41.512445] RBP: ffff9e9c43660c9c R08: 0000000000009ffb R09: 00000000ffffdfff
[   41.512998] R10: 00000000ffffdfff R11: ffffffff8ca58a40 R12: ffff9e9c4339a000
[   41.513534] R13: 0000000000000001 R14: ffff9e9c438a0000 R15: ffffbda3c026bb48
[   41.514086] FS:  00007fbc4cda1740(0000) GS:ffff9e9c7ec00000(0000) knlGS:0000000000000000
[   41.514726] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   41.515176] CR2: 000056233b337d88 CR3: 000000000376e006 CR4: 0000000000370ef0
[   41.515713] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   41.516252] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   41.516799] Call Trace:
[   41.517037]  <TASK>
[   41.517249]  ? __warn+0x7b/0x120
[   41.517535]  ? refcount_warn_saturate+0xa5/0x130
[   41.517923]  ? report_bug+0x164/0x190
[   41.518240]  ? handle_bug+0x3d/0x70
[   41.518541]  ? exc_invalid_op+0x17/0x70
[   41.520972]  ? asm_exc_invalid_op+0x1a/0x20
[   41.521325]  ? refcount_warn_saturate+0xa5/0x130
[   41.521708]  ipv6_get_ifaddr+0xda/0xe0
[   41.522035]  inet6_rtm_getaddr+0x342/0x3f0
[   41.522376]  ? __pfx_inet6_rtm_getaddr+0x10/0x10
[   41.522758]  rtnetlink_rcv_msg+0x334/0x3d0
[   41.523102]  ? netlink_unicast+0x30f/0x390
[   41.523445]  ? __pfx_rtnetlink_rcv_msg+0x10/0x10
[   41.523832]  netlink_rcv_skb+0x53/0x100
[   41.524157]  netlink_unicast+0x23b/0x390
[   41.524484]  netlink_sendmsg+0x1f2/0x440
[   41.524826]  __sys_sendto+0x1d8/0x1f0
[   41.525145]  __x64_sys_sendto+0x1f/0x30
[   41.525467]  do_syscall_64+0xa5/0x1b0
[   41.525794]  entry_SYSCALL_64_after_hwframe+0x72/0x7a
[   41.526213] RIP: 0033:0x7fbc4cfcea9a
[   41.526528] Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 7e c3 0f 1f 44 00 00 41 54 48 83 ec 30 44 89
[   41.527942] RSP: 002b:00007ffcf54012a8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[   41.528593] RAX: ffffffffffffffda RBX: 00007ffcf5401368 RCX: 00007fbc4cfcea9a
[   41.529173] RDX: 000000000000002c RSI: 00007fbc4b9d9bd0 RDI: 0000000000000005
[   41.529786] RBP: 00007fbc4bafb040 R08: 00007ffcf54013e0 R09: 000000000000000c
[   41.530375] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[   41.530977] R13: ffffffffc4653600 R14: 0000000000000001 R15: 00007fbc4ca85d1b
[   41.531573]  </TASK>

Fixes: 5c578aedcb ("IPv6: convert addrconf hash list to RCU")
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Link: https://lore.kernel.org/r/8ab821e36073a4a406c50ec83c9e8dc586c539e4.1712585809.git.jbenc@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-09 17:09:05 -07:00
Eric Dumazet 7a87441c96 nfc: llcp: fix nfc_llcp_setsockopt() unsafe copies
syzbot reported unsafe calls to copy_from_sockptr() [1]

Use copy_safe_from_sockptr() instead.

[1]

BUG: KASAN: slab-out-of-bounds in copy_from_sockptr_offset include/linux/sockptr.h:49 [inline]
 BUG: KASAN: slab-out-of-bounds in copy_from_sockptr include/linux/sockptr.h:55 [inline]
 BUG: KASAN: slab-out-of-bounds in nfc_llcp_setsockopt+0x6c2/0x850 net/nfc/llcp_sock.c:255
Read of size 4 at addr ffff88801caa1ec3 by task syz-executor459/5078

CPU: 0 PID: 5078 Comm: syz-executor459 Not tainted 6.8.0-syzkaller-08951-gfe46a7dd189e #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/27/2024
Call Trace:
 <TASK>
  __dump_stack lib/dump_stack.c:88 [inline]
  dump_stack_lvl+0x241/0x360 lib/dump_stack.c:114
  print_address_description mm/kasan/report.c:377 [inline]
  print_report+0x169/0x550 mm/kasan/report.c:488
  kasan_report+0x143/0x180 mm/kasan/report.c:601
  copy_from_sockptr_offset include/linux/sockptr.h:49 [inline]
  copy_from_sockptr include/linux/sockptr.h:55 [inline]
  nfc_llcp_setsockopt+0x6c2/0x850 net/nfc/llcp_sock.c:255
  do_sock_setsockopt+0x3b1/0x720 net/socket.c:2311
  __sys_setsockopt+0x1ae/0x250 net/socket.c:2334
  __do_sys_setsockopt net/socket.c:2343 [inline]
  __se_sys_setsockopt net/socket.c:2340 [inline]
  __x64_sys_setsockopt+0xb5/0xd0 net/socket.c:2340
 do_syscall_64+0xfd/0x240
 entry_SYSCALL_64_after_hwframe+0x6d/0x75
RIP: 0033:0x7f7fac07fd89
Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 91 18 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007fff660eb788 EFLAGS: 00000246 ORIG_RAX: 0000000000000036
RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f7fac07fd89
RDX: 0000000000000000 RSI: 0000000000000118 RDI: 0000000000000004
RBP: 0000000000000000 R08: 0000000000000002 R09: 0000000000000000
R10: 0000000020000a80 R11: 0000000000000246 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Link: https://lore.kernel.org/r/20240408082845.3957374-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-09 17:01:01 -07:00
Arnd Bergmann cf1b7201df ipv4/route: avoid unused-but-set-variable warning
The log_martians variable is only used in an #ifdef, causing a 'make W=1'
warning with gcc:

net/ipv4/route.c: In function 'ip_rt_send_redirect':
net/ipv4/route.c:880:13: error: variable 'log_martians' set but not used [-Werror=unused-but-set-variable]

Change the #ifdef to an equivalent IS_ENABLED() to let the compiler
see where the variable is used.

Fixes: 30038fc61a ("net: ip_rt_send_redirect() optimization")
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240408074219.3030256-2-arnd@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-04-09 15:49:19 +02:00
Arnd Bergmann 74043489fc ipv6: fib: hide unused 'pn' variable
When CONFIG_IPV6_SUBTREES is disabled, the only user is hidden, causing
a 'make W=1' warning:

net/ipv6/ip6_fib.c: In function 'fib6_add':
net/ipv6/ip6_fib.c:1388:32: error: variable 'pn' set but not used [-Werror=unused-but-set-variable]

Add another #ifdef around the variable declaration, matching the other
uses in this file.

Fixes: 66729e18df ("[IPV6] ROUTE: Make sure we have fn->leaf when adding a node on subtree.")
Link: https://lore.kernel.org/netdev/20240322131746.904943-1-arnd@kernel.org/
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240408074219.3030256-1-arnd@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-04-09 15:49:19 +02:00
Baochen Qiang e0cd118590 net: qrtr: support suspend/hibernation
MHI devices may not be destroyed during suspend/hibernation, so need
to unprepare/prepare MHI channels throughout the transition, this is
done by adding suspend/resume callbacks.

The suspend callback is called in the late suspend stage, this means
MHI channels are still alive at suspend stage, and that makes it
possible for an MHI controller driver to communicate with others over
those channels at suspend stage. While the resume callback is called
in the early resume stage, for a similar reason.

Also note that we won't do unprepare/prepare when MHI device is in
suspend state because it's pointless if MHI is only meant to go through
a suspend/resume transition, instead of a complete power cycle.

Tested-on: WCN6855 hw2.0 PCI WLAN.HSP.1.1-03125-QCAHSPSWPL_V1_V2_SILICONZ_LITE-3.6510.30

Signed-off-by: Baochen Qiang <quic_bqiang@quicinc.com>
Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Reviewed-by: Jeff Johnson <quic_jjohnson@quicinc.com>
Signed-off-by: Kalle Valo <quic_kvalo@quicinc.com>
Link: https://msgid.link/20240305021320.3367-3-quic_bqiang@quicinc.com
2024-04-09 14:43:28 +03:00
Eric Dumazet 41eecbd712 tcp: replace TCP_SKB_CB(skb)->tcp_tw_isn with a per-cpu field
TCP can transform a TIMEWAIT socket into a SYN_RECV one from
a SYN packet, and the ISN of the SYNACK packet is normally
generated using TIMEWAIT tw_snd_nxt :

tcp_timewait_state_process()
...
    u32 isn = tcptw->tw_snd_nxt + 65535 + 2;
    if (isn == 0)
        isn++;
    TCP_SKB_CB(skb)->tcp_tw_isn = isn;
    return TCP_TW_SYN;

This SYN packet also bypasses normal checks against listen queue
being full or not.

tcp_conn_request()
...
       __u32 isn = TCP_SKB_CB(skb)->tcp_tw_isn;
...
        /* TW buckets are converted to open requests without
         * limitations, they conserve resources and peer is
         * evidently real one.
         */
        if ((syncookies == 2 || inet_csk_reqsk_queue_is_full(sk)) && !isn) {
                want_cookie = tcp_syn_flood_action(sk, rsk_ops->slab_name);
                if (!want_cookie)
                        goto drop;
        }

This was using TCP_SKB_CB(skb)->tcp_tw_isn field in skb.

Unfortunately this field has been accidentally cleared
after the call to tcp_timewait_state_process() returning
TCP_TW_SYN.

Using a field in TCP_SKB_CB(skb) for a temporary state
is overkill.

Switch instead to a per-cpu variable.

As a bonus, we do not have to clear tcp_tw_isn in TCP receive
fast path.
It is temporarily set then cleared only in the TCP_TW_SYN dance.

Fixes: 4ad19de877 ("net: tcp6: fix double call of tcp_v6_fill_cb()")
Fixes: eeea10b83a ("tcp: add tcp_v4_fill_cb()/tcp_v4_restore_cb()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-04-09 11:47:40 +02:00
Eric Dumazet b9e8104058 tcp: propagate tcp_tw_isn via an extra parameter to ->route_req()
tcp_v6_init_req() reads TCP_SKB_CB(skb)->tcp_tw_isn to find
out if the request socket is created by a SYN hitting a TIMEWAIT socket.

This has been buggy for a decade, lets directly pass the information
from tcp_conn_request().

This is a preparatory patch to make the following one easier to review.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-04-09 11:47:40 +02:00
Kuniyuki Iwashima b46f4eaa4f af_unix: Clear stale u->oob_skb.
syzkaller started to report deadlock of unix_gc_lock after commit
4090fa373f ("af_unix: Replace garbage collection algorithm."), but
it just uncovers the bug that has been there since commit 314001f0bf
("af_unix: Add OOB support").

The repro basically does the following.

  from socket import *
  from array import array

  c1, c2 = socketpair(AF_UNIX, SOCK_STREAM)
  c1.sendmsg([b'a'], [(SOL_SOCKET, SCM_RIGHTS, array("i", [c2.fileno()]))], MSG_OOB)
  c2.recv(1)  # blocked as no normal data in recv queue

  c2.close()  # done async and unblock recv()
  c1.close()  # done async and trigger GC

A socket sends its file descriptor to itself as OOB data and tries to
receive normal data, but finally recv() fails due to async close().

The problem here is wrong handling of OOB skb in manage_oob().  When
recvmsg() is called without MSG_OOB, manage_oob() is called to check
if the peeked skb is OOB skb.  In such a case, manage_oob() pops it
out of the receive queue but does not clear unix_sock(sk)->oob_skb.
This is wrong in terms of uAPI.

Let's say we send "hello" with MSG_OOB, and "world" without MSG_OOB.
The 'o' is handled as OOB data.  When recv() is called twice without
MSG_OOB, the OOB data should be lost.

  >>> from socket import *
  >>> c1, c2 = socketpair(AF_UNIX, SOCK_STREAM, 0)
  >>> c1.send(b'hello', MSG_OOB)  # 'o' is OOB data
  5
  >>> c1.send(b'world')
  5
  >>> c2.recv(5)  # OOB data is not received
  b'hell'
  >>> c2.recv(5)  # OOB date is skipped
  b'world'
  >>> c2.recv(5, MSG_OOB)  # This should return an error
  b'o'

In the same situation, TCP actually returns -EINVAL for the last
recv().

Also, if we do not clear unix_sk(sk)->oob_skb, unix_poll() always set
EPOLLPRI even though the data has passed through by previous recv().

To avoid these issues, we must clear unix_sk(sk)->oob_skb when dequeuing
it from recv queue.

The reason why the old GC did not trigger the deadlock is because the
old GC relied on the receive queue to detect the loop.

When it is triggered, the socket with OOB data is marked as GC candidate
because file refcount == inflight count (1).  However, after traversing
all inflight sockets, the socket still has a positive inflight count (1),
thus the socket is excluded from candidates.  Then, the old GC lose the
chance to garbage-collect the socket.

With the old GC, the repro continues to create true garbage that will
never be freed nor detected by kmemleak as it's linked to the global
inflight list.  That's why we couldn't even notice the issue.

Fixes: 314001f0bf ("af_unix: Add OOB support")
Reported-by: syzbot+7f7f201cc2668a8fd169@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=7f7f201cc2668a8fd169
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240405221057.2406-1-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-08 19:58:48 -07:00
Michael-CY Lee a35b36e6ee wifi: mac80211: extend IEEE80211_KEY_FLAG_GENERATE_MMIE to other ciphers
Extend the flag IEEE80211_KEY_FLAG_GENERATE_MMIE to BIP-CMAC-256,
BIP-GMAC-128 and BIP-GMAC-256 for the same reason and in the same
way that the flag was added originally in commit a0b4496a43
("mac80211: add IEEE80211_KEY_FLAG_GENERATE_MMIE to ieee80211_key_flags").

Signed-off-by: Michael-CY Lee <michael-cy.lee@mediatek.com>
Link: https://msgid.link/20240326003036.15215-1-michael-cy.lee@mediatek.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-04-08 20:52:28 +02:00
Jeff Johnson 499921d3dc wifi: mac80211: correctly document struct mesh_table
Currently kernel-doc -Wall reports:

net/mac80211/ieee80211_i.h:687: warning: missing initial short description on line:
 * struct mesh_table

So add a short description.

Reported-by: Randy Dunlap <rdunlap@infradead.org>
Link: https://lore.kernel.org/linux-wireless/a009a21a-56d7-4a1a-aaf9-feefa5acc561@infradead.org/
Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Link: https://msgid.link/20240328-mesh_table-kerneldoc-v1-1-174c4df341b1@quicinc.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-04-08 20:19:14 +02:00
Igor Artemiev 9ef369973c wifi: cfg80211: fix the order of arguments for trace events of the tx_rx_evt class
The declarations of the tx_rx_evt class and the rdev_set_antenna event
use the wrong order of arguments in the TP_ARGS macro.

Fix the order of arguments in the TP_ARGS macro.

Found by Linux Verification Center (linuxtesting.org) with SVACE.

Signed-off-by: Igor Artemiev <Igor.A.Artemiev@mcst.ru>
Link: https://msgid.link/20240405152431.270267-1-Igor.A.Artemiev@mcst.ru
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-04-08 20:14:08 +02:00
Richard Kinder d12b9779cc wifi: mac80211: ensure beacon is non-S1G prior to extracting the beacon timestamp field
Logic inside ieee80211_rx_mgmt_beacon accesses the
mgmt->u.beacon.timestamp field without first checking whether the beacon
received is non-S1G format.

Fix the problem by checking the beacon is non-S1G format to avoid access
of the mgmt->u.beacon.timestamp field.

Signed-off-by: Richard Kinder <richard.kinder@gmail.com>
Link: https://msgid.link/20240328005725.85355-1-richard.kinder@gmail.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-04-08 20:12:01 +02:00
Johannes Berg ab9177d83c wifi: mac80211: don't use rate mask for scanning
The rate mask is intended for use during operation, and
can be set to only have masks for the currently active
band. As such, it cannot be used for scanning which can
be on other bands as well.

Simply ignore the rate masks during scanning to avoid
warnings from incorrect settings.

Reported-by: syzbot+fdc5123366fb9c3fdc6d@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=fdc5123366fb9c3fdc6d
Co-developed-by: Dmitry Antipov <dmantipov@yandex.ru>
Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru>
Tested-by: Dmitry Antipov <dmantipov@yandex.ru>
Link: https://msgid.link/20240326220854.9594cbb418ca.I7f86c0ba1f98cf7e27c2bacf6c2d417200ecea5c@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-04-08 20:11:34 +02:00
Johannes Berg 7c1c73bf84 wifi: mac80211: check EHT/TTLM action frame length
Check the EHT action frame length before accessing
the action code, if it's not present then the frame
cannot be valid.

Reported-by: syzbot+75af45a00cf13243ba39@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/r/0000000000006c06870614886611@google.com/
Fixes: 8f500fbc6c ("wifi: mac80211: process and save negotiated TID to Link mapping request")
Link: https://msgid.link/20240326213858.19c84f34349f.I71b439f016b28f65284bb7646fe36343b74cbc9a@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-04-08 20:11:21 +02:00
Parav Pandit 5af3e3876d devlink: Support setting max_io_eqs
Many devices send event notifications for the IO queues,
such as tx and rx queues, through event queues.

Enable a privileged owner, such as a hypervisor PF, to set the number
of IO event queues for the VF and SF during the provisioning stage.

example:
Get maximum IO event queues of the VF device::

  $ devlink port show pci/0000:06:00.0/2
  pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
      function:
          hw_addr 00:00:00:00:00:00 ipsec_packet disabled max_io_eqs 10

Set maximum IO event queues of the VF device::

  $ devlink port function set pci/0000:06:00.0/2 max_io_eqs 32

  $ devlink port show pci/0000:06:00.0/2
  pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
      function:
          hw_addr 00:00:00:00:00:00 ipsec_packet disabled max_io_eqs 32

Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-08 14:10:45 +01:00