Commit graph

8601 commits

Author SHA1 Message Date
Petr Machata
a1aee20d5d net: bridge: Add netlink knobs for number / maximum MDB entries
The previous patch added accounting for number of MDB entries per port and
per port-VLAN, and the logic to verify that these values stay within
configured bounds. However it didn't provide means to actually configure
those bounds or read the occupancy. This patch does that.

Two new netlink attributes are added for the MDB occupancy:
IFLA_BRPORT_MCAST_N_GROUPS for the per-port occupancy and
BRIDGE_VLANDB_ENTRY_MCAST_N_GROUPS for the per-port-VLAN occupancy.
And another two for the maximum number of MDB entries:
IFLA_BRPORT_MCAST_MAX_GROUPS for the per-port maximum, and
BRIDGE_VLANDB_ENTRY_MCAST_MAX_GROUPS for the per-port-VLAN one.

Note that the two new IFLA_BRPORT_ attributes prompt bumping of
RTNL_SLAVE_MAX_TYPE to size the slave attribute tables large enough.

The new attributes are used like this:

 # ip link add name br up type bridge vlan_filtering 1 mcast_snooping 1 \
                                      mcast_vlan_snooping 1 mcast_querier 1
 # ip link set dev v1 master br
 # bridge vlan add dev v1 vid 2

 # bridge vlan set dev v1 vid 1 mcast_max_groups 1
 # bridge mdb add dev br port v1 grp 230.1.2.3 temp vid 1
 # bridge mdb add dev br port v1 grp 230.1.2.4 temp vid 1
 Error: bridge: Port-VLAN is already in 1 groups, and mcast_max_groups=1.

 # bridge link set dev v1 mcast_max_groups 1
 # bridge mdb add dev br port v1 grp 230.1.2.3 temp vid 2
 Error: bridge: Port is already in 1 groups, and mcast_max_groups=1.

 # bridge -d link show
 5: v1@v2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 master br [...]
     [...] mcast_n_groups 1 mcast_max_groups 1

 # bridge -d vlan show
 port              vlan-id
 br                1 PVID Egress Untagged
                     state forwarding mcast_router 1
 v1                1 PVID Egress Untagged
                     [...] mcast_n_groups 1 mcast_max_groups 1
                   2
                     [...] mcast_n_groups 0 mcast_max_groups 0

Signed-off-by: Petr Machata <petrm@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-06 08:48:26 +00:00
Petr Machata
d47230a348 net: bridge: Add a tracepoint for MDB overflows
The following patch will add two more maximum MDB allowances to the global
one, mcast_hash_max, that exists today. In all these cases, attempts to add
MDB entries above the configured maximums through netlink, fail noisily and
obviously. Such visibility is missing when adding entries through the
control plane traffic, by IGMP or MLD packets.

To improve visibility in those cases, add a trace point that reports the
violation, including the relevant netdevice (be it a slave or the bridge
itself), and the MDB entry parameters:

	# perf record -e bridge:br_mdb_full &
	# [...]
	# perf script | cut -d: -f4-
	 dev v2 af 2 src ::ffff:0.0.0.0 grp ::ffff:239.1.1.112/00:00:00:00:00:00 vid 0
	 dev v2 af 10 src :: grp ff0e::112/00:00:00:00:00:00 vid 0
	 dev v2 af 2 src ::ffff:0.0.0.0 grp ::ffff:239.1.1.112/00:00:00:00:00:00 vid 10
	 dev v2 af 10 src 2001:db8:1::1 grp ff0e::1/00:00:00:00:00:00 vid 10
	 dev v2 af 2 src ::ffff:192.0.2.1 grp ::ffff:239.1.1.1/00:00:00:00:00:00 vid 10

CC: Steven Rostedt <rostedt@goodmis.org>
CC: linux-trace-kernel@vger.kernel.org
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-06 08:48:25 +00:00
Jakub Kicinski
82b4a9412b Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
net/core/gro.c
  7d2c89b325 ("skb: Do mix page pool and page referenced frags in GRO")
  b1a78b9b98 ("net: add support for ipv4 big tcp")
https://lore.kernel.org/all/20230203094454.5766f160@canb.auug.org.au/

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-02 14:49:55 -08:00
Brian Haley
62e395f82d neighbor: fix proxy_delay usage when it is zero
When set to zero, the neighbor sysctl proxy_delay value
does not cause an immediate reply for ARP/ND requests
as expected, it instead causes a random delay between
[0, U32_MAX). Looking at this comment from
__get_random_u32_below() explains the reason:

/*
 * This function is technically undefined for ceil == 0, and in fact
 * for the non-underscored constant version in the header, we build bug
 * on that. But for the non-constant case, it's convenient to have that
 * evaluate to being a straight call to get_random_u32(), so that
 * get_random_u32_inclusive() can work over its whole range without
 * undefined behavior.
 */

Added helper function that does not call get_random_u32_below()
if proxy_delay is zero and just uses the current value of
jiffies instead, causing pneigh_enqueue() to respond
immediately.

Also added definition of proxy_delay to ip-sysctl.txt since
it was missing.

Signed-off-by: Brian Haley <haleyb.dev@gmail.com>
Link: https://lore.kernel.org/r/20230130171428.367111-1-haleyb.dev@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-01 21:02:54 -08:00
Xin Long
b1a78b9b98 net: add support for ipv4 big tcp
Similar to Eric's IPv6 BIG TCP, this patch is to enable IPv4 BIG TCP.

Firstly, allow sk->sk_gso_max_size to be set to a value greater than
GSO_LEGACY_MAX_SIZE by not trimming gso_max_size in sk_trim_gso_size()
for IPv4 TCP sockets.

Then on TX path, set IP header tot_len to 0 when skb->len > IP_MAX_MTU
in __ip_local_out() to allow to send BIG TCP packets, and this implies
that skb->len is the length of a IPv4 packet; On RX path, use skb->len
as the length of the IPv4 packet when the IP header tot_len is 0 and
skb->len > IP_MAX_MTU in ip_rcv_core(). As the API iph_set_totlen() and
skb_ip_totlen() are used in __ip_local_out() and ip_rcv_core(), we only
need to update these APIs.

Also in GRO receive, add the check for ETH_P_IP/IPPROTO_TCP, and allows
the merged packet size >= GRO_LEGACY_MAX_SIZE in skb_gro_receive(). In
GRO complete, set IP header tot_len to 0 when the merged packet size
greater than IP_MAX_MTU in iph_set_totlen() so that it can be processed
on RX path.

Note that by checking skb_is_gso_tcp() in API iph_totlen(), it makes
this implementation safe to use iph->len == 0 indicates IPv4 BIG TCP
packets.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-01 20:54:27 -08:00
Xin Long
9eefedd58a net: add gso_ipv4_max_size and gro_ipv4_max_size per device
This patch introduces gso_ipv4_max_size and gro_ipv4_max_size
per device and adds netlink attributes for them, so that IPV4
BIG TCP can be guarded by a separate tunable in the next patch.

To not break the old application using "gso/gro_max_size" for
IPv4 GSO packets, this patch updates "gso/gro_ipv4_max_size"
in netif_set_gso/gro_max_size() if the new size isn't greater
than GSO_LEGACY_MAX_SIZE, so that nothing will change even if
userspace doesn't realize the new netlink attributes.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-01 20:54:27 -08:00
Yan Zhai
876e8ca836 net: fix NULL pointer in skb_segment_list
Commit 3a1296a38d ("net: Support GRO/GSO fraglist chaining.")
introduced UDP listifyed GRO. The segmentation relies on frag_list being
untouched when passing through the network stack. This assumption can be
broken sometimes, where frag_list itself gets pulled into linear area,
leaving frag_list being NULL. When this happens it can trigger
following NULL pointer dereference, and panic the kernel. Reverse the
test condition should fix it.

[19185.577801][    C1] BUG: kernel NULL pointer dereference, address:
...
[19185.663775][    C1] RIP: 0010:skb_segment_list+0x1cc/0x390
...
[19185.834644][    C1] Call Trace:
[19185.841730][    C1]  <TASK>
[19185.848563][    C1]  __udp_gso_segment+0x33e/0x510
[19185.857370][    C1]  inet_gso_segment+0x15b/0x3e0
[19185.866059][    C1]  skb_mac_gso_segment+0x97/0x110
[19185.874939][    C1]  __skb_gso_segment+0xb2/0x160
[19185.883646][    C1]  udp_queue_rcv_skb+0xc3/0x1d0
[19185.892319][    C1]  udp_unicast_rcv_skb+0x75/0x90
[19185.900979][    C1]  ip_protocol_deliver_rcu+0xd2/0x200
[19185.910003][    C1]  ip_local_deliver_finish+0x44/0x60
[19185.918757][    C1]  __netif_receive_skb_one_core+0x8b/0xa0
[19185.927834][    C1]  process_backlog+0x88/0x130
[19185.935840][    C1]  __napi_poll+0x27/0x150
[19185.943447][    C1]  net_rx_action+0x27e/0x5f0
[19185.951331][    C1]  ? mlx5_cq_tasklet_cb+0x70/0x160 [mlx5_core]
[19185.960848][    C1]  __do_softirq+0xbc/0x25d
[19185.968607][    C1]  irq_exit_rcu+0x83/0xb0
[19185.976247][    C1]  common_interrupt+0x43/0xa0
[19185.984235][    C1]  asm_common_interrupt+0x22/0x40
...
[19186.094106][    C1]  </TASK>

Fixes: 3a1296a38d ("net: Support GRO/GSO fraglist chaining.")
Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Yan Zhai <yan@cloudflare.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/Y9gt5EUizK1UImEP@debian
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-31 21:07:04 -08:00
Jakub Kicinski
2d104c390f bpf-next-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCY9RqJgAKCRDbK58LschI
 gw2IAP9G5uhFO5abBzYLupp6SY3T5j97MUvPwLfFqUEt7EXmuwEA2lCUEWeW0KtR
 QX+QmzCa6iHxrW7WzP4DUYLue//FJQY=
 =yYqA
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
bpf-next 2023-01-28

We've added 124 non-merge commits during the last 22 day(s) which contain
a total of 124 files changed, 6386 insertions(+), 1827 deletions(-).

The main changes are:

1) Implement XDP hints via kfuncs with initial support for RX hash and
   timestamp metadata kfuncs, from Stanislav Fomichev and
   Toke Høiland-Jørgensen.
   Measurements on overhead: https://lore.kernel.org/bpf/875yellcx6.fsf@toke.dk

2) Extend libbpf's bpf_tracing.h support for tracing arguments of
   kprobes/uprobes and syscall as a special case, from Andrii Nakryiko.

3) Significantly reduce the search time for module symbols by livepatch
   and BPF, from Jiri Olsa and Zhen Lei.

4) Enable cpumasks to be used as kptrs, which is useful for tracing
   programs tracking which tasks end up running on which CPUs
   in different time intervals, from David Vernet.

5) Fix several issues in the dynptr processing such as stack slot liveness
   propagation, missing checks for PTR_TO_STACK variable offset, etc,
   from Kumar Kartikeya Dwivedi.

6) Various performance improvements, fixes, and introduction of more
   than just one XDP program to XSK selftests, from Magnus Karlsson.

7) Big batch to BPF samples to reduce deprecated functionality,
   from Daniel T. Lee.

8) Enable struct_ops programs to be sleepable in verifier,
   from David Vernet.

9) Reduce pr_warn() noise on BTF mismatches when they are expected under
   the CONFIG_MODULE_ALLOW_BTF_MISMATCH config anyway, from Connor O'Brien.

10) Describe modulo and division by zero behavior of the BPF runtime
    in BPF's instruction specification document, from Dave Thaler.

11) Several improvements to libbpf API documentation in libbpf.h,
    from Grant Seltzer.

12) Improve resolve_btfids header dependencies related to subcmd and add
    proper support for HOSTCC, from Ian Rogers.

13) Add ipip6 and ip6ip decapsulation support for bpf_skb_adjust_room()
    helper along with BPF selftests, from Ziyang Xuan.

14) Simplify the parsing logic of structure parameters for BPF trampoline
    in the x86-64 JIT compiler, from Pu Lehui.

15) Get BTF working for kernels with CONFIG_RUST enabled by excluding
    Rust compilation units with pahole, from Martin Rodriguez Reboredo.

16) Get bpf_setsockopt() working for kTLS on top of TCP sockets,
    from Kui-Feng Lee.

17) Disable stack protection for BPF objects in bpftool given BPF backends
    don't support it, from Holger Hoffstätte.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (124 commits)
  selftest/bpf: Make crashes more debuggable in test_progs
  libbpf: Add documentation to map pinning API functions
  libbpf: Fix malformed documentation formatting
  selftests/bpf: Properly enable hwtstamp in xdp_hw_metadata
  selftests/bpf: Calls bpf_setsockopt() on a ktls enabled socket.
  bpf: Check the protocol of a sock to agree the calls to bpf_setsockopt().
  bpf/selftests: Verify struct_ops prog sleepable behavior
  bpf: Pass const struct bpf_prog * to .check_member
  libbpf: Support sleepable struct_ops.s section
  bpf: Allow BPF_PROG_TYPE_STRUCT_OPS programs to be sleepable
  selftests/bpf: Fix vmtest static compilation error
  tools/resolve_btfids: Alter how HOSTCC is forced
  tools/resolve_btfids: Install subcmd headers
  bpf/docs: Document the nocast aliasing behavior of ___init
  bpf/docs: Document how nested trusted fields may be defined
  bpf/docs: Document cpumask kfuncs in a new file
  selftests/bpf: Add selftest suite for cpumask kfuncs
  selftests/bpf: Add nested trust selftests suite
  bpf: Enable cpumasks to be queried and used as kptrs
  bpf: Disallow NULLable pointers for trusted kfuncs
  ...
====================

Link: https://lore.kernel.org/r/20230128004827.21371-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-28 00:00:14 -08:00
Jakub Kicinski
0548c5f26a bpf-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCY9RCwAAKCRDbK58LschI
 g7drAQDfMPc1Q2CE4LZ9oh2wu1Nt2/85naTDK/WirlCToKs0xwD+NRQOyO3hcoJJ
 rOCwfjOlAm+7uqtiwwodBvWgTlDgVAM=
 =0fC+
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf

Daniel Borkmann says:

====================
bpf 2023-01-27

We've added 10 non-merge commits during the last 9 day(s) which contain
a total of 10 files changed, 170 insertions(+), 59 deletions(-).

The main changes are:

1) Fix preservation of register's parent/live fields when copying
   range-info, from Eduard Zingerman.

2) Fix an off-by-one bug in bpf_mem_cache_idx() to select the right
   cache, from Hou Tao.

3) Fix stack overflow from infinite recursion in sock_map_close(),
   from Jakub Sitnicki.

4) Fix missing btf_put() in register_btf_id_dtor_kfuncs()'s error path,
   from Jiri Olsa.

5) Fix a splat from bpf_setsockopt() via lsm_cgroup/socket_sock_rcv_skb,
   from Kui-Feng Lee.

6) Fix bpf_send_signal[_thread]() helpers to hold a reference on the task,
   from Yonghong Song.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  bpf: Fix the kernel crash caused by bpf_setsockopt().
  selftests/bpf: Cover listener cloning with progs attached to sockmap
  selftests/bpf: Pass BPF skeleton to sockmap_listen ops tests
  bpf, sockmap: Check for any of tcp_bpf_prots when cloning a listener
  bpf, sockmap: Don't let sock_map_{close,destroy,unhash} call itself
  bpf: Add missing btf_put to register_btf_id_dtor_kfuncs
  selftests/bpf: Verify copy_register_state() preserves parent/live fields
  bpf: Fix to preserve reg parent/live fields when copying range info
  bpf: Fix a possible task gone issue with bpf_send_signal[_thread]() helpers
  bpf: Fix off-by-one error in bpf_mem_cache_idx()
====================

Link: https://lore.kernel.org/r/20230127215820.4993-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-27 23:32:03 -08:00
Breno Leitao
d8afe2f8a9 netpoll: Remove 4s sleep during carrier detection
This patch removes the msleep(4s) during netpoll_setup() if the carrier
appears instantly.

Here are some scenarios where this workaround is counter-productive in
modern ages:

Servers which have BMC communicating over NC-SI via the same NIC as gets
used for netconsole. BMC will keep the PHY up, hence the carrier
appearing instantly.

The link is fibre, SERDES getting sync could happen within 0.1Hz, and
the carrier also appears instantly.

Other than that, if a driver is reporting instant carrier and then
losing it, this is probably a driver bug.

Reported-by: Michael van der Westhuizen <rmikey@meta.com>
Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://lore.kernel.org/r/20230125185230.3574681-1-leitao@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-27 23:24:07 -08:00
Alexander Duyck
7d2c89b325 skb: Do mix page pool and page referenced frags in GRO
GSO should not merge page pool recycled frames with standard reference
counted frames. Traditionally this didn't occur, at least not often.
However as we start looking at adding support for wireless adapters there
becomes the potential to mix the two due to A-MSDU repartitioning frames in
the receive path. There are possibly other places where this may have
occurred however I suspect they must be few and far between as we have not
seen this issue until now.

Fixes: 53e0961da1 ("page_pool: add frag page recycling support in page pool")
Reported-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Alexander Duyck <alexanderduyck@fb.com>
Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/167475990764.1934330.11960904198087757911.stgit@localhost.localdomain
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-27 23:21:27 -08:00
Jakub Kicinski
b568d3072a Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Conflicts:

drivers/net/ethernet/intel/ice/ice_main.c
  418e53401e ("ice: move devlink port creation/deletion")
  643ef23bd9 ("ice: Introduce local var for readability")
https://lore.kernel.org/all/20230127124025.0dacef40@canb.auug.org.au/
https://lore.kernel.org/all/20230124005714.3996270-1-anthony.l.nguyen@intel.com/

drivers/net/ethernet/engleder/tsnep_main.c
  3d53aaef43 ("tsnep: Fix TX queue stop/wake for multiple queues")
  25faa6a4c5 ("tsnep: Replace TX spin_lock with __netif_tx_lock")
https://lore.kernel.org/all/20230127123604.36bb3e99@canb.auug.org.au/

net/netfilter/nf_conntrack_proto_sctp.c
  13bd9b31a9 ("Revert "netfilter: conntrack: add sctp DATA_SENT state"")
  a44b765148 ("netfilter: conntrack: unify established states for SCTP paths")
  f71cb8f45d ("netfilter: conntrack: sctp: use nf log infrastructure for invalid packets")
https://lore.kernel.org/all/20230127125052.674281f9@canb.auug.org.au/
https://lore.kernel.org/all/d36076f3-6add-a442-6d4b-ead9f7ffff86@tessares.net/

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-27 22:56:18 -08:00
Jakub Kicinski
2195e2a024 net: skbuff: drop the linux/textsearch.h include
This include was added for skb_find_text() but all we need there
is a forward declaration of struct ts_config.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-01-27 11:19:46 +00:00
Kui-Feng Lee
2ab42c7b87 bpf: Check the protocol of a sock to agree the calls to bpf_setsockopt().
Resolve an issue when calling sol_tcp_sockopt() on a socket with ktls
enabled. Prior to this patch, sol_tcp_sockopt() would only allow calls
if the function pointer of setsockopt of the socket was set to
tcp_setsockopt(). However, any socket with ktls enabled would have its
function pointer set to tls_setsockopt(). To resolve this issue, the
patch adds a check of the protocol of the linux socket and allows
bpf_setsockopt() to be called if ktls is initialized on the linux
socket. This ensures that calls to sol_tcp_sockopt() will succeed on
sockets with ktls enabled.

Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
Link: https://lore.kernel.org/r/20230125201608.908230-2-kuifeng@meta.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-01-25 14:49:11 -08:00
Jakub Sitnicki
5b4a79ba65 bpf, sockmap: Don't let sock_map_{close,destroy,unhash} call itself
sock_map proto callbacks should never call themselves by design. Protect
against bugs like [1] and break out of the recursive loop to avoid a stack
overflow in favor of a resource leak.

[1] https://lore.kernel.org/all/00000000000073b14905ef2e7401@google.com/

Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/r/20230113-sockmap-fix-v2-1-1e0ee7ac2f90@cloudflare.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-01-24 21:32:55 -08:00
Jesper Dangaard Brouer
3176eb8268 net: avoid irqsave in skb_defer_free_flush
The spin_lock irqsave/restore API variant in skb_defer_free_flush can
be replaced with the faster spin_lock irq variant, which doesn't need
to read and restore the CPU flags.

Using the unconditional irq "disable/enable" API variant is safe,
because the skb_defer_free_flush() function is only called during
NAPI-RX processing in net_rx_action(), where it is known the IRQs
are enabled.

Expected gain is 14 cycles from avoiding reading and restoring CPU
flags in a spin_lock_irqsave/restore operation, measured via a
microbencmark kernel module[1] on CPU E5-1650 v4 @ 3.60GHz.

Microbenchmark overhead of spin_lock+unlock:
 - spin_lock_unlock_irq     cost: 34 cycles(tsc)  9.486 ns
 - spin_lock_unlock_irqsave cost: 48 cycles(tsc) 13.567 ns

We don't expect to see a measurable packet performance gain, as
skb_defer_free_flush() is called infrequently once per NIC device NAPI
bulk cycle and conditionally only if SKBs have been deferred by other
CPUs via skb_attempt_defer_free().

[1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_sample.c

Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Link: https://lore.kernel.org/r/167421646327.1321776.7390743166998776914.stgit@firesoul
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-23 22:08:06 -08:00
Jesper Dangaard Brouer
f72ff8b81e net: fix kfree_skb_list use of skb_mark_not_on_list
A bug was introduced by commit eedade12f4 ("net: kfree_skb_list use
kmem_cache_free_bulk"). It unconditionally unlinked the SKB list via
invoking skb_mark_not_on_list().

In this patch we choose to remove the skb_mark_not_on_list() call as it
isn't necessary. It would be possible and correct to call
skb_mark_not_on_list() only when __kfree_skb_reason() returns true,
meaning the SKB is ready to be free'ed, as it calls/check skb_unref().

This fix is needed as kfree_skb_list() is also invoked on skb_shared_info
frag_list (skb_drop_fraglist() calling kfree_skb_list()). A frag_list can
have SKBs with elevated refcnt due to cloning via skb_clone_fraglist(),
which takes a reference on all SKBs in the list. This implies the
invariant that all SKBs in the list must have the same refcnt, when using
kfree_skb_list().

Reported-by: syzbot+c8a2e66e37eee553c4fd@syzkaller.appspotmail.com
Reported-and-tested-by: syzbot+c8a2e66e37eee553c4fd@syzkaller.appspotmail.com
Fixes: eedade12f4 ("net: kfree_skb_list use kmem_cache_free_bulk")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/167421088417.1125894.9761158218878962159.stgit@firesoul
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-23 21:39:04 -08:00
Stanislav Fomichev
3d76a4d3d4 bpf: XDP metadata RX kfuncs
Define a new kfunc set (xdp_metadata_kfunc_ids) which implements all possible
XDP metatada kfuncs. Not all devices have to implement them. If kfunc is not
supported by the target device, the default implementation is called instead.
The verifier, at load time, replaces a call to the generic kfunc with a call
to the per-device one. Per-device kfunc pointers are stored in separate
struct xdp_metadata_ops.

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20230119221536.3349901-8-sdf@google.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-01-23 09:38:11 -08:00
Stanislav Fomichev
2b3486bc2d bpf: Introduce device-bound XDP programs
New flag BPF_F_XDP_DEV_BOUND_ONLY plus all the infra to have a way
to associate a netdev with a BPF program at load time.

netdevsim checks are dropped in favor of generic check in dev_xdp_attach.

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20230119221536.3349901-6-sdf@google.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-01-23 09:38:10 -08:00
Stanislav Fomichev
9d03ebc71a bpf: Rename bpf_{prog,map}_is_dev_bound to is_offloaded
BPF offloading infra will be reused to implement
bound-but-not-offloaded bpf programs. Rename existing
helpers for clarity. No functional changes.

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20230119221536.3349901-3-sdf@google.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-01-23 09:38:10 -08:00
Peilin Ye
40e0b09081 net/sock: Introduce trace_sk_data_ready()
As suggested by Cong, introduce a tracepoint for all ->sk_data_ready()
callback implementations.  For example:

<...>
  iperf-609  [002] .....  70.660425: sk_data_ready: family=2 protocol=6 func=sock_def_readable
  iperf-609  [002] .....  70.660436: sk_data_ready: family=2 protocol=6 func=sock_def_readable
<...>

Suggested-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-01-23 11:26:50 +00:00
Paolo Abeni
71ab9c3e22 net: fix UaF in netns ops registration error path
If net_assign_generic() fails, the current error path in ops_init() tries
to clear the gen pointer slot. Anyway, in such error path, the gen pointer
itself has not been modified yet, and the existing and accessed one is
smaller than the accessed index, causing an out-of-bounds error:

 BUG: KASAN: slab-out-of-bounds in ops_init+0x2de/0x320
 Write of size 8 at addr ffff888109124978 by task modprobe/1018

 CPU: 2 PID: 1018 Comm: modprobe Not tainted 6.2.0-rc2.mptcp_ae5ac65fbed5+ #1641
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.1-2.fc37 04/01/2014
 Call Trace:
  <TASK>
  dump_stack_lvl+0x6a/0x9f
  print_address_description.constprop.0+0x86/0x2b5
  print_report+0x11b/0x1fb
  kasan_report+0x87/0xc0
  ops_init+0x2de/0x320
  register_pernet_operations+0x2e4/0x750
  register_pernet_subsys+0x24/0x40
  tcf_register_action+0x9f/0x560
  do_one_initcall+0xf9/0x570
  do_init_module+0x190/0x650
  load_module+0x1fa5/0x23c0
  __do_sys_finit_module+0x10d/0x1b0
  do_syscall_64+0x58/0x80
  entry_SYSCALL_64_after_hwframe+0x72/0xdc
 RIP: 0033:0x7f42518f778d
 Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48
       89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff
       ff 73 01 c3 48 8b 0d cb 56 2c 00 f7 d8 64 89 01 48
 RSP: 002b:00007fff96869688 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
 RAX: ffffffffffffffda RBX: 00005568ef7f7c90 RCX: 00007f42518f778d
 RDX: 0000000000000000 RSI: 00005568ef41d796 RDI: 0000000000000003
 RBP: 00005568ef41d796 R08: 0000000000000000 R09: 0000000000000000
 R10: 0000000000000003 R11: 0000000000000246 R12: 0000000000000000
 R13: 00005568ef7f7d30 R14: 0000000000040000 R15: 0000000000000000
  </TASK>

This change addresses the issue by skipping the gen pointer
de-reference in the mentioned error-path.

Found by code inspection and verified with explicit error injection
on a kasan-enabled kernel.

Fixes: d266935ac4 ("net: fix UAF issue in nfqnl_nf_hook_drop() when ops_init() failed")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Link: https://lore.kernel.org/r/cec4e0f3bb2c77ac03a6154a8508d3930beb5f0f.1674154348.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-20 18:51:18 -08:00
Magnus Karlsson
68e5b6aa27 xdp: document xdp_do_flush() before napi_complete_done()
Document in the XDP_REDIRECT manual section that drivers must call
xdp_do_flush() before napi_complete_done(). The two reasons behind
this can be found following the links below.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/r/20221220185903.1105011-1-sbohrer@cloudflare.com
Link: https://lore.kernel.org/all/20210624160609.292325-1-toke@redhat.com/
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-01-18 14:33:34 +00:00
Jesper Dangaard Brouer
eedade12f4 net: kfree_skb_list use kmem_cache_free_bulk
The kfree_skb_list function walks SKB (via skb->next) and frees them
individually to the SLUB/SLAB allocator (kmem_cache). It is more
efficient to bulk free them via the kmem_cache_free_bulk API.

This patches create a stack local array with SKBs to bulk free while
walking the list. Bulk array size is limited to 16 SKBs to trade off
stack usage and efficiency. The SLUB kmem_cache "skbuff_head_cache"
uses objsize 256 bytes usually in an order-1 page 8192 bytes that is
32 objects per slab (can vary on archs and due to SLUB sharing). Thus,
for SLUB the optimal bulk free case is 32 objects belonging to same
slab, but runtime this isn't likely to occur.

The expected gain from using kmem_cache bulk alloc and free API
have been assessed via a microbencmark kernel module[1].

The module 'slab_bulk_test01' results at bulk 16 element:
 kmem-in-loop Per elem: 109 cycles(tsc) 30.532 ns (step:16)
 kmem-bulk    Per elem: 64 cycles(tsc) 17.905 ns (step:16)

More detailed description of benchmarks avail in [2].

[1] https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm
[2] https://github.com/xdp-project/xdp-project/blob/master/areas/mem/kfree_skb_list01.org

V2: rename function to kfree_skb_add_bulk.

Reviewed-by: Saeed Mahameed <saeed@kernel.org>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-01-17 10:27:29 +01:00
Jesper Dangaard Brouer
a4650da2a2 net: fix call location in kfree_skb_list_reason
The SKB drop reason uses __builtin_return_address(0) to give the call
"location" to trace_kfree_skb() tracepoint skb:kfree_skb.

To keep this stable for compilers kfree_skb_reason() is annotated with
__fix_address (noinline __noclone) as fixed in commit c205cc7534
("net: skb: prevent the split of kfree_skb_reason() by gcc").

The function kfree_skb_list_reason() invoke kfree_skb_reason(), which
cause the __builtin_return_address(0) "location" to report the
unexpected address of kfree_skb_list_reason.

Example output from 'perf script':
 kpktgend_0  1337 [000]    81.002597: skb:kfree_skb: skbaddr=0xffff888144824700 protocol=2048 location=kfree_skb_list_reason+0x1e reason: QDISC_DROP

Patch creates an __always_inline __kfree_skb_reason() helper call that
is called from both kfree_skb_list() and kfree_skb_list_reason().
Suggestions for solutions that shares code better are welcome.

As preparation for next patch move __kfree_skb() invocation out of
this helper function.

Reviewed-by: Saeed Mahameed <saeed@kernel.org>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-01-17 10:27:29 +01:00
Ziyang Xuan
d219df60a7 bpf: Add ipip6 and ip6ip decap support for bpf_skb_adjust_room()
Add ipip6 and ip6ip decap support for bpf_skb_adjust_room().
Main use case is for using cls_bpf on ingress hook to decapsulate
IPv4 over IPv6 and IPv6 over IPv4 tunnel packets.

Add two new flags BPF_F_ADJ_ROOM_DECAP_L3_IPV{4,6} to indicate the
new IP header version after decapsulating the outer IP header.

Suggested-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/b268ec7f0ff9431f4f43b1b40ab856ebb28cb4e1.1673574419.git.william.xuanziyang@huawei.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-01-15 12:56:17 -08:00
Jon Maxwell
af6d10345c ipv6: remove max_size check inline with ipv4
In ip6_dst_gc() replace:

  if (entries > gc_thresh)

With:

  if (entries > ops->gc_thresh)

Sending Ipv6 packets in a loop via a raw socket triggers an issue where a
route is cloned by ip6_rt_cache_alloc() for each packet sent. This quickly
consumes the Ipv6 max_size threshold which defaults to 4096 resulting in
these warnings:

[1]   99.187805] dst_alloc: 7728 callbacks suppressed
[2] Route cache is full: consider increasing sysctl net.ipv6.route.max_size.
.
.
[300] Route cache is full: consider increasing sysctl net.ipv6.route.max_size.

When this happens the packet is dropped and sendto() gets a network is
unreachable error:

remaining pkt 200557 errno 101
remaining pkt 196462 errno 101
.
.
remaining pkt 126821 errno 101

Implement David Aherns suggestion to remove max_size check seeing that Ipv6
has a GC to manage memory usage. Ipv4 already does not check max_size.

Here are some memory comparisons for Ipv4 vs Ipv6 with the patch:

Test by running 5 instances of a program that sends UDP packets to a raw
socket 5000000 times. Compare Ipv4 and Ipv6 performance with a similar
program.

Ipv4:

Before test:

MemFree:        29427108 kB
Slab:             237612 kB

ip6_dst_cache       1912   2528    256   32    2 : tunables    0    0    0
xfrm_dst_cache         0      0    320   25    2 : tunables    0    0    0
ip_dst_cache        2881   3990    192   42    2 : tunables    0    0    0

During test:

MemFree:        29417608 kB
Slab:             247712 kB

ip6_dst_cache       1912   2528    256   32    2 : tunables    0    0    0
xfrm_dst_cache         0      0    320   25    2 : tunables    0    0    0
ip_dst_cache       44394  44394    192   42    2 : tunables    0    0    0

After test:

MemFree:        29422308 kB
Slab:             238104 kB

ip6_dst_cache       1912   2528    256   32    2 : tunables    0    0    0
xfrm_dst_cache         0      0    320   25    2 : tunables    0    0    0
ip_dst_cache        3048   4116    192   42    2 : tunables    0    0    0

Ipv6 with patch:

Errno 101 errors are not observed anymore with the patch.

Before test:

MemFree:        29422308 kB
Slab:             238104 kB

ip6_dst_cache       1912   2528    256   32    2 : tunables    0    0    0
xfrm_dst_cache         0      0    320   25    2 : tunables    0    0    0
ip_dst_cache        3048   4116    192   42    2 : tunables    0    0    0

During Test:

MemFree:        29431516 kB
Slab:             240940 kB

ip6_dst_cache      11980  12064    256   32    2 : tunables    0    0    0
xfrm_dst_cache         0      0    320   25    2 : tunables    0    0    0
ip_dst_cache        3048   4116    192   42    2 : tunables    0    0    0

After Test:

MemFree:        29441816 kB
Slab:             238132 kB

ip6_dst_cache       1902   2432    256   32    2 : tunables    0    0    0
xfrm_dst_cache         0      0    320   25    2 : tunables    0    0    0
ip_dst_cache        3048   4116    192   42    2 : tunables    0    0    0

Tested-by: Andrea Mayer <andrea.mayer@uniroma2.it>
Signed-off-by: Jon Maxwell <jmaxwell37@gmail.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20230112012532.311021-1-jmaxwell37@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-13 20:59:14 -08:00
Jakub Kicinski
a99da46ac0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
drivers/net/usb/r8152.c
  be53771c87 ("r8152: add vendor/device ID pair for Microsoft Devkit")
  ec51fbd1b8 ("r8152: add USB device driver for config selection")
https://lore.kernel.org/all/20230113113339.658c4723@canb.auug.org.au/

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-12 19:59:56 -08:00
Haiyue Wang
66cf99b55e bpf: Remove the unnecessary insn buffer comparison
The variable 'insn' is initialized to 'insn_buf' without being changed, only
some helper macros are defined, so the insn buffer comparison is unnecessary.
Just remove it. This missed removal back in 2377b81de5 ("bpf: split shared
bpf_tcp_sock and bpf_sock_ops implementation").

Signed-off-by: Haiyue Wang <haiyue.wang@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/bpf/20230108151258.96570-1-haiyue.wang@intel.com
2023-01-10 23:09:29 +01:00
Jakub Kicinski
12c1604ae1 net: skb: remove old comments about frag_size for build_skb()
Since commit ce098da149 ("skbuff: Introduce slab_build_skb()")
drivers trying to build skb around slab-backed buffers should
go via slab_build_skb() rather than passing frag_size = 0 to
the main build_skb().

Remove the copy'n'pasted comments about 0 meaning slab.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-01-09 08:15:04 +00:00
Eric Dumazet
7871f54e3d gro: take care of DODGY packets
Jaroslav reported a recent throughput regression with virtio_net
caused by blamed commit.

It is unclear if DODGY GSO packets coming from user space
can be accepted by GRO engine in the future with minimal
changes, and if there is any expected gain from it.

In the meantime, make sure to detect and flush DODGY packets.

Fixes: 5eddb24901 ("gro: add support of (hw)gro packets to gro stack")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-and-bisected-by: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com>
Cc: Coco Li <lixiaoyan@google.com>
Cc: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-01-09 07:37:07 +00:00
Mahesh Bandewar
6b754d7bd0 sysctl: expose all net/core sysctls inside netns
All were not visible to the non-priv users inside netns. However,
with 4ecb90090c ("sysctl: allow override of /proc/sys/net with
CAP_NET_ADMIN"), these vars are protected from getting modified.
A proc with capable(CAP_NET_ADMIN) can change the values so
not having them visible inside netns is just causing nuisance to
process that check certain values (e.g. net.core.somaxconn) and
see different behavior in root-netns vs. other-netns

Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-01-06 12:41:52 +00:00
Jakub Kicinski
f05bd8ebeb devlink: move code to a dedicated directory
The devlink code is hard to navigate with 13kLoC in one file.
I really like the way Michal split the ethtool into per-command
files and core. It'd probably be too much to split it all up,
but we can at least separate the core parts out of the per-cmd
implementations and put it in a directory so that new commands
can be separate files.

Move the code, subsequent commit will do a partial split.

Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-05 22:12:00 -08:00
Jakub Kicinski
4aea86b403 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
No conflicts.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-05 15:34:11 -08:00
Linus Torvalds
50011c32f4 Including fixes from bpf, wifi, and netfilter.
Current release - regressions:
 
  - bpf: fix nullness propagation for reg to reg comparisons,
    avoid null-deref
 
  - inet: control sockets should not use current thread task_frag
 
  - bpf: always use maximal size for copy_array()
 
  - eth: bnxt_en: don't link netdev to a devlink port for VFs
 
 Current release - new code bugs:
 
  - rxrpc: fix a couple of potential use-after-frees
 
  - netfilter: conntrack: fix IPv6 exthdr error check
 
  - wifi: iwlwifi: fw: skip PPAG for JF, avoid FW crashes
 
  - eth: dsa: qca8k: various fixes for the in-band register access
 
  - eth: nfp: fix schedule in atomic context when sync mc address
 
  - eth: renesas: rswitch: fix getting mac address from device tree
 
  - mobile: ipa: use proper endpoint mask for suspend
 
 Previous releases - regressions:
 
  - tcp: add TIME_WAIT sockets in bhash2, fix regression caught
    by Jiri / python tests
 
  - net: tc: don't intepret cls results when asked to drop, fix
    oob-access
 
  - vrf: determine the dst using the original ifindex for multicast
 
  - eth: bnxt_en:
    - fix XDP RX path if BPF adjusted packet length
    - fix HDS (header placement) and jumbo thresholds for RX packets
 
  - eth: ice: xsk: do not use xdp_return_frame() on tx_buf->raw_buf,
    avoid memory corruptions
 
 Previous releases - always broken:
 
  - ulp: prevent ULP without clone op from entering the LISTEN status
 
  - veth: fix race with AF_XDP exposing old or uninitialized descriptors
 
  - bpf:
    - pull before calling skb_postpull_rcsum() (fix checksum support
      and avoid a WARN())
    - fix panic due to wrong pageattr of im->image (when livepatch
      and kretfunc coexist)
    - keep a reference to the mm, in case the task is dead
 
  - mptcp: fix deadlock in fastopen error path
 
  - netfilter:
    - nf_tables: perform type checking for existing sets
    - nf_tables: honor set timeout and garbage collection updates
    - ipset: fix hash:net,port,net hang with /0 subnet
    - ipset: avoid hung task warning when adding/deleting entries
 
  - selftests: net:
    - fix cmsg_so_mark.sh test hang on non-x86 systems
    - fix the arp_ndisc_evict_nocarrier test for IPv6
 
  - usb: rndis_host: secure rndis_query check against int overflow
 
  - eth: r8169: fix dmar pte write access during suspend/resume with WOL
 
  - eth: lan966x: fix configuration of the PCS
 
  - eth: sparx5: fix reading of the MAC address
 
  - eth: qed: allow sleep in qed_mcp_trace_dump()
 
  - eth: hns3:
    - fix interrupts re-initialization after VF FLR
    - fix handling of promisc when MAC addr table gets full
    - refine the handling for VF heartbeat
 
  - eth: mlx5:
    - properly handle ingress QinQ-tagged packets on VST
    - fix io_eq_size and event_eq_size params validation on big endian
    - fix RoCE setting at HCA level if not supported at all
    - don't turn CQE compression on by default for IPoIB
 
  - eth: ena:
    - fix toeplitz initial hash key value
    - account for the number of XDP-processed bytes in interface stats
    - fix rx_copybreak value update
 
 Misc:
 
  - ethtool: harden phy stat handling against buggy drivers
 
  - docs: netdev: convert maintainer's doc from FAQ to a normal document
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmO3MLcACgkQMUZtbf5S
 IrsEQBAAijPrpxsGMfX+VMqZ8RPKA3Qg8XF3ji2fSp4c0kiKv6lYI7PzPTR3u/fj
 CAlhQMHv7z53uM6Zd7FdUVl23paaEycu8YnlwSubg9z+wSeh/RQ6iq94mSk1PV+K
 LLVR/yop2N35Yp/oc5KZMb9fMLkxRG9Ci73QUVVYgvIrSd4Zdm13FjfVjL2C1MZH
 Yp003wigMs9IkIHOpHjNqwn/5s//0yXsb1PgKxCsaMdMQsG0yC+7eyDmxshCqsji
 xQm15mkGMjvWEYJaa4Tj4L3JW6lWbQzCu9nqPUX16KpmrnScr8S8Is+aifFZIBeW
 GZeDYgvjSxNWodeOrJnD3X+fnbrR9+qfx7T9y7XighfytAz5DNm1LwVOvZKDgPFA
 s+LlxOhzkDNEqbIsusK/LW+04EFc5gJyTI2iR6s4SSqmH3c3coJZQJeyRFWDZy/x
 1oqzcCcq8SwGUTJ9g6HAmDQoVkhDWDT/ZcRKhpWG0nJub972lB2iwM7LrAu+HoHI
 r8hyCkHpOi5S3WZKI9gPiGD+yOlpVAuG2wHg2IpjhKQvtd9DFUChGDhFeoB2rqJf
 9uI3RJBBYTDkeNu3kpfy5uMh2XhvbIZntK5kwpJ4VettZWFMaOAzn7KNqk8iT4gJ
 ASMrUrX59X0TAN0MgpJJm7uGtKbKZOu4lHNm74TUxH7V7bYn7dk=
 =TlcN
 -----END PGP SIGNATURE-----

Merge tag 'net-6.2-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Including fixes from bpf, wifi, and netfilter.

  Current release - regressions:

   - bpf: fix nullness propagation for reg to reg comparisons, avoid
     null-deref

   - inet: control sockets should not use current thread task_frag

   - bpf: always use maximal size for copy_array()

   - eth: bnxt_en: don't link netdev to a devlink port for VFs

  Current release - new code bugs:

   - rxrpc: fix a couple of potential use-after-frees

   - netfilter: conntrack: fix IPv6 exthdr error check

   - wifi: iwlwifi: fw: skip PPAG for JF, avoid FW crashes

   - eth: dsa: qca8k: various fixes for the in-band register access

   - eth: nfp: fix schedule in atomic context when sync mc address

   - eth: renesas: rswitch: fix getting mac address from device tree

   - mobile: ipa: use proper endpoint mask for suspend

  Previous releases - regressions:

   - tcp: add TIME_WAIT sockets in bhash2, fix regression caught by
     Jiri / python tests

   - net: tc: don't intepret cls results when asked to drop, fix
     oob-access

   - vrf: determine the dst using the original ifindex for multicast

   - eth: bnxt_en:
      - fix XDP RX path if BPF adjusted packet length
      - fix HDS (header placement) and jumbo thresholds for RX packets

   - eth: ice: xsk: do not use xdp_return_frame() on tx_buf->raw_buf,
     avoid memory corruptions

  Previous releases - always broken:

   - ulp: prevent ULP without clone op from entering the LISTEN status

   - veth: fix race with AF_XDP exposing old or uninitialized
     descriptors

   - bpf:
      - pull before calling skb_postpull_rcsum() (fix checksum support
        and avoid a WARN())
      - fix panic due to wrong pageattr of im->image (when livepatch and
        kretfunc coexist)
      - keep a reference to the mm, in case the task is dead

   - mptcp: fix deadlock in fastopen error path

   - netfilter:
      - nf_tables: perform type checking for existing sets
      - nf_tables: honor set timeout and garbage collection updates
      - ipset: fix hash:net,port,net hang with /0 subnet
      - ipset: avoid hung task warning when adding/deleting entries

   - selftests: net:
      - fix cmsg_so_mark.sh test hang on non-x86 systems
      - fix the arp_ndisc_evict_nocarrier test for IPv6

   - usb: rndis_host: secure rndis_query check against int overflow

   - eth: r8169: fix dmar pte write access during suspend/resume with
     WOL

   - eth: lan966x: fix configuration of the PCS

   - eth: sparx5: fix reading of the MAC address

   - eth: qed: allow sleep in qed_mcp_trace_dump()

   - eth: hns3:
      - fix interrupts re-initialization after VF FLR
      - fix handling of promisc when MAC addr table gets full
      - refine the handling for VF heartbeat

   - eth: mlx5:
      - properly handle ingress QinQ-tagged packets on VST
      - fix io_eq_size and event_eq_size params validation on big endian
      - fix RoCE setting at HCA level if not supported at all
      - don't turn CQE compression on by default for IPoIB

   - eth: ena:
      - fix toeplitz initial hash key value
      - account for the number of XDP-processed bytes in interface stats
      - fix rx_copybreak value update

  Misc:

   - ethtool: harden phy stat handling against buggy drivers

   - docs: netdev: convert maintainer's doc from FAQ to a normal
     document"

* tag 'net-6.2-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (112 commits)
  caif: fix memory leak in cfctrl_linkup_request()
  inet: control sockets should not use current thread task_frag
  net/ulp: prevent ULP without clone op from entering the LISTEN status
  qed: allow sleep in qed_mcp_trace_dump()
  MAINTAINERS: Update maintainers for ptp_vmw driver
  usb: rndis_host: Secure rndis_query check against int overflow
  net: dpaa: Fix dtsec check for PCS availability
  octeontx2-pf: Fix lmtst ID used in aura free
  drivers/net/bonding/bond_3ad: return when there's no aggregator
  netfilter: ipset: Rework long task execution when adding/deleting entries
  netfilter: ipset: fix hash:net,port,net hang with /0 subnet
  net: sparx5: Fix reading of the MAC address
  vxlan: Fix memory leaks in error path
  net: sched: htb: fix htb_classify() kernel-doc
  net: sched: cbq: dont intepret cls results when asked to drop
  net: sched: atm: dont intepret cls results when asked to drop
  dt-bindings: net: marvell,orion-mdio: Fix examples
  dt-bindings: net: sun8i-emac: Add phy-supply property
  net: ipa: use proper endpoint mask for suspend
  selftests: net: return non-zero for failures reported in arp_ndisc_evict_nocarrier
  ...
2023-01-05 12:40:50 -08:00
Jakub Kicinski
d75858ef10 bpf-next-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCY7X/4wAKCRDbK58LschI
 g7gzAQCjKsLtAWg1OplW+B7pvEPwkQ8g3O1+PYWlToCUACTlzQD+PEMrqGnxB573
 oQAk6I2yOTwLgvlHkrm+TIdKSouI4gs=
 =2hUY
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
bpf-next 2023-01-04

We've added 45 non-merge commits during the last 21 day(s) which contain
a total of 50 files changed, 1454 insertions(+), 375 deletions(-).

The main changes are:

1) Fixes, improvements and refactoring of parts of BPF verifier's
   state equivalence checks, from Andrii Nakryiko.

2) Fix a few corner cases in libbpf's BTF-to-C converter in particular
   around padding handling and enums, also from Andrii Nakryiko.

3) Add BPF_F_NO_TUNNEL_KEY extension to bpf_skb_set_tunnel_key to better
  support decap on GRE tunnel devices not operating in collect metadata,
  from Christian Ehrig.

4) Improve x86 JIT's codegen for PROBE_MEM runtime error checks,
   from Dave Marchevsky.

5) Remove the need for trace_printk_lock for bpf_trace_printk
   and bpf_trace_vprintk helpers, from Jiri Olsa.

6) Add proper documentation for BPF_MAP_TYPE_SOCK{MAP,HASH} maps,
   from Maryam Tahhan.

7) Improvements in libbpf's btf_parse_elf error handling, from Changbin Du.

8) Bigger batch of improvements to BPF tracing code samples,
   from Daniel T. Lee.

9) Add LoongArch support to libbpf's bpf_tracing helper header,
   from Hengqi Chen.

10) Fix a libbpf compiler warning in perf_event_open_probe on arm32,
    from Khem Raj.

11) Optimize bpf_local_storage_elem by removing 56 bytes of padding,
    from Martin KaFai Lau.

12) Use pkg-config to locate libelf for resolve_btfids build,
    from Shen Jiamin.

13) Various libbpf improvements around API documentation and errno
    handling, from Xin Liu.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (45 commits)
  libbpf: Return -ENODATA for missing btf section
  libbpf: Add LoongArch support to bpf_tracing.h
  libbpf: Restore errno after pr_warn.
  libbpf: Added the description of some API functions
  libbpf: Fix invalid return address register in s390
  samples/bpf: Use BPF_KSYSCALL macro in syscall tracing programs
  samples/bpf: Fix tracex2 by using BPF_KSYSCALL macro
  samples/bpf: Change _kern suffix to .bpf with syscall tracing program
  samples/bpf: Use vmlinux.h instead of implicit headers in syscall tracing program
  samples/bpf: Use kyscall instead of kprobe in syscall tracing program
  bpf: rename list_head -> graph_root in field info types
  libbpf: fix errno is overwritten after being closed.
  bpf: fix regs_exact() logic in regsafe() to remap IDs correctly
  bpf: perform byte-by-byte comparison only when necessary in regsafe()
  bpf: reject non-exact register type matches in regsafe()
  bpf: generalize MAYBE_NULL vs non-MAYBE_NULL rule
  bpf: reorganize struct bpf_reg_state fields
  bpf: teach refsafe() to take into account ID remapping
  bpf: Remove unused field initialization in bpf's ctl_table
  selftests/bpf: Add jit probe_mem corner case tests to s390x denylist
  ...
====================

Link: https://lore.kernel.org/r/20230105000926.31350-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-04 20:21:25 -08:00
Steven Rostedt (Google)
292a089d78 treewide: Convert del_timer*() to timer_shutdown*()
Due to several bugs caused by timers being re-armed after they are
shutdown and just before they are freed, a new state of timers was added
called "shutdown".  After a timer is set to this state, then it can no
longer be re-armed.

The following script was run to find all the trivial locations where
del_timer() or del_timer_sync() is called in the same function that the
object holding the timer is freed.  It also ignores any locations where
the timer->function is modified between the del_timer*() and the free(),
as that is not considered a "trivial" case.

This was created by using a coccinelle script and the following
commands:

    $ cat timer.cocci
    @@
    expression ptr, slab;
    identifier timer, rfield;
    @@
    (
    -       del_timer(&ptr->timer);
    +       timer_shutdown(&ptr->timer);
    |
    -       del_timer_sync(&ptr->timer);
    +       timer_shutdown_sync(&ptr->timer);
    )
      ... when strict
          when != ptr->timer
    (
            kfree_rcu(ptr, rfield);
    |
            kmem_cache_free(slab, ptr);
    |
            kfree(ptr);
    )

    $ spatch timer.cocci . > /tmp/t.patch
    $ patch -p1 < /tmp/t.patch

Link: https://lore.kernel.org/lkml/20221123201306.823305113@linutronix.de/
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Acked-by: Pavel Machek <pavel@ucw.cz> [ LED ]
Acked-by: Kalle Valo <kvalo@kernel.org> [ wireless ]
Acked-by: Paolo Abeni <pabeni@redhat.com> [ networking ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-12-25 13:38:09 -08:00
David S. Miller
be1236fce5 bpf-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCY6YkXgAKCRDbK58LschI
 g25kAP4jYi+YomSlmGUzN/fUbEIHkXXyh85Yh2/yHGYdVuIuvwEA0uXeC7JHQTca
 dkcyYvgY6zJwFBV0lAVnhTRzFirFkQk=
 =THs1
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf

Daniel Borkmann says:

====================
The following pull-request contains BPF updates for your *net* tree.

We've added 7 non-merge commits during the last 5 day(s) which contain
a total of 11 files changed, 231 insertions(+), 3 deletions(-).

The main changes are:

1) Fix a splat in bpf_skb_generic_pop() under CHECKSUM_PARTIAL due to
   misuse of skb_postpull_rcsum(), from Jakub Kicinski with test case
   from Martin Lau.

2) Fix BPF verifier's nullness propagation when registers are of
   type PTR_TO_BTF_ID, from Hao Sun.

3) Fix bpftool build for JIT disassembler under statically built
   libllvm, from Anton Protopopov.

4) Fix warnings reported by resolve_btfids when building vmlinux
   with CONFIG_SECURITY_NETWORK disabled, from Hou Tao.

5) Minor fix up for BPF selftest gitignore, from Stanislav Fomichev.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2022-12-24 09:39:02 +00:00
Jakub Kicinski
c183e6c3ec Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
No conflicts.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-12-21 12:15:21 -08:00
Linus Torvalds
609d3bc623 Including fixes from bpf, netfilter and can.
Current release - regressions:
 
  - bpf: synchronize dispatcher update with bpf_dispatcher_xdp_func
 
  - rxrpc:
   - fix security setting propagation
   - fix null-deref in rxrpc_unuse_local()
   - fix switched parameters in peer tracing
 
 Current release - new code bugs:
 
  - rxrpc:
    - fix I/O thread startup getting skipped
    - fix locking issues in rxrpc_put_peer_locked()
    - fix I/O thread stop
    - fix uninitialised variable in rxperf server
    - fix the return value of rxrpc_new_incoming_call()
 
  - microchip: vcap: fix initialization of value and mask
 
  - nfp: fix unaligned io read of capabilities word
 
 Previous releases - regressions:
 
  - stop in-kernel socket users from corrupting socket's task_frag
 
  - stream: purge sk_error_queue in sk_stream_kill_queues()
 
  - openvswitch: fix flow lookup to use unmasked key
 
  - dsa: mv88e6xxx: avoid reg_lock deadlock in mv88e6xxx_setup_port()
 
  - devlink:
    - hold region lock when flushing snapshots
    - protect devlink dump by the instance lock
 
 Previous releases - always broken:
 
  - bpf:
    - prevent leak of lsm program after failed attach
    - resolve fext program type when checking map compatibility
 
  - skbuff: account for tail adjustment during pull operations
 
  - macsec: fix net device access prior to holding a lock
 
  - bonding: switch back when high prio link up
 
  - netfilter: flowtable: really fix NAT IPv6 offload
 
  - enetc: avoid buffer leaks on xdp_do_redirect() failure
 
  - unix: fix race in SOCK_SEQPACKET's unix_dgram_sendmsg()
 
  - dsa: microchip: remove IRQF_TRIGGER_FALLING in request_threaded_irq
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmOiGa4ACgkQMUZtbf5S
 IrvetBAAg/AjgG51gboLsuGjgRSwAi5T6ijgVR+pW+kMuoOdaamOF+h/zC1ox/H9
 QrWvTBipy+EqSD8bM4Xz0FNgidch8X4iWYhKGZuBht/4NP5FOzPUG2mNlUy5ANGq
 QZcCw6CUsir8HTb+IJpFEIq0JMwzKCm3WyAkYjEj4iuft0Y93cAgjkMVwoX0RERO
 o/pslC5dsozCLJxEglpw1aJq7aoroNuRSGSXl95nv8fU3UxmUXajnA3HNscXImdV
 6uqSIuyPIaGocpCBPRKUQd0sctkTY4cm8wmxxMCDVsBRVusoaq5eg1VRvxJm9Rxj
 gvDvHvfhnEuSigFF5A+paBp4c+i3C8g/UTBJTtptdAC+Y2tt4UT3Q5aaazYUOAqd
 W4TSJ3bk5zhkhpRF9clb0fNQaM1HOT4rkDEEGTfVN62dtHfPKpNwYufQKaYHdVj1
 RJ3ooH6c7TMVaRs6ZgEWNYToKZj94SIfPhfEhuqWXdNMDBkUMp2BXFFOp9fZDWju
 PsMQrRD7n6+XXpNvScYtnJDORqfIL9yHGZE9kxZA5QSDl9cnPA3SUbNruQPlXHrl
 w0yQlYuG3gcciua4dXaLfz1iN4rPdenuYhVBHhztEwDKl+b61CVQYlOHGkXPVURp
 oft74qCCFbva+Hf/7jENQotjT1tLfxAGdUARuFeDBueJgDRAPsw=
 =goV5
 -----END PGP SIGNATURE-----

Merge tag 'net-6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Including fixes from bpf, netfilter and can.

  Current release - regressions:

   - bpf: synchronize dispatcher update with bpf_dispatcher_xdp_func

   - rxrpc:
      - fix security setting propagation
      - fix null-deref in rxrpc_unuse_local()
      - fix switched parameters in peer tracing

  Current release - new code bugs:

   - rxrpc:
      - fix I/O thread startup getting skipped
      - fix locking issues in rxrpc_put_peer_locked()
      - fix I/O thread stop
      - fix uninitialised variable in rxperf server
      - fix the return value of rxrpc_new_incoming_call()

   - microchip: vcap: fix initialization of value and mask

   - nfp: fix unaligned io read of capabilities word

  Previous releases - regressions:

   - stop in-kernel socket users from corrupting socket's task_frag

   - stream: purge sk_error_queue in sk_stream_kill_queues()

   - openvswitch: fix flow lookup to use unmasked key

   - dsa: mv88e6xxx: avoid reg_lock deadlock in mv88e6xxx_setup_port()

   - devlink:
      - hold region lock when flushing snapshots
      - protect devlink dump by the instance lock

  Previous releases - always broken:

   - bpf:
      - prevent leak of lsm program after failed attach
      - resolve fext program type when checking map compatibility

   - skbuff: account for tail adjustment during pull operations

   - macsec: fix net device access prior to holding a lock

   - bonding: switch back when high prio link up

   - netfilter: flowtable: really fix NAT IPv6 offload

   - enetc: avoid buffer leaks on xdp_do_redirect() failure

   - unix: fix race in SOCK_SEQPACKET's unix_dgram_sendmsg()

   - dsa: microchip: remove IRQF_TRIGGER_FALLING in
     request_threaded_irq"

* tag 'net-6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (64 commits)
  net: fec: check the return value of build_skb()
  net: simplify sk_page_frag
  Treewide: Stop corrupting socket's task_frag
  net: Introduce sk_use_task_frag in struct sock.
  mctp: Remove device type check at unregister
  net: dsa: microchip: remove IRQF_TRIGGER_FALLING in request_threaded_irq
  can: kvaser_usb: hydra: help gcc-13 to figure out cmd_len
  can: flexcan: avoid unbalanced pm_runtime_enable warning
  Documentation: devlink: add missing toc entry for etas_es58x devlink doc
  mctp: serial: Fix starting value for frame check sequence
  nfp: fix unaligned io read of capabilities word
  net: stream: purge sk_error_queue in sk_stream_kill_queues()
  myri10ge: Fix an error handling path in myri10ge_probe()
  net: microchip: vcap: Fix initialization of value and mask
  rxrpc: Fix the return value of rxrpc_new_incoming_call()
  rxrpc: rxperf: Fix uninitialised variable
  rxrpc: Fix I/O thread stop
  rxrpc: Fix switched parameters in peer tracing
  rxrpc: Fix locking issues in rxrpc_put_peer_locked()
  rxrpc: Fix I/O thread startup getting skipped
  ...
2022-12-21 08:41:32 -08:00
Jakub Kicinski
54c3f1a814 bpf: pull before calling skb_postpull_rcsum()
Anand hit a BUG() when pulling off headers on egress to a SW tunnel.
We get to skb_checksum_help() with an invalid checksum offset
(commit d7ea0d9df2 ("net: remove two BUG() from skb_checksum_help()")
converted those BUGs to WARN_ONs()).
He points out oddness in how skb_postpull_rcsum() gets used.
Indeed looks like we should pull before "postpull", otherwise
the CHECKSUM_PARTIAL fixup from skb_postpull_rcsum() will not
be able to do its job:

	if (skb->ip_summed == CHECKSUM_PARTIAL &&
	    skb_checksum_start_offset(skb) < 0)
		skb->ip_summed = CHECKSUM_NONE;

Reported-by: Anand Parthasarathy <anpartha@meta.com>
Fixes: 6578171a7f ("bpf: add bpf_skb_change_proto helper")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20221220004701.402165-1-kuba@kernel.org
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2022-12-20 15:58:35 -08:00
Guillaume Nault
fb87bd4751 net: Introduce sk_use_task_frag in struct sock.
Sockets that can be used while recursing into memory reclaim, like
those used by network block devices and file systems, mustn't use
current->task_frag: if the current process is already using it, then
the inner memory reclaim call would corrupt the task_frag structure.

To avoid this, sk_page_frag() uses ->sk_allocation to detect sockets
that mustn't use current->task_frag, assuming that those used during
memory reclaim had their allocation constraints reflected in
->sk_allocation.

This unfortunately doesn't cover all cases: in an attempt to remove all
usage of GFP_NOFS and GFP_NOIO, sunrpc stopped setting these flags in
->sk_allocation, and used memalloc_nofs critical sections instead.
This breaks the sk_page_frag() heuristic since the allocation
constraints are now stored in current->flags, which sk_page_frag()
can't read without risking triggering a cache miss and slowing down
TCP's fast path.

This patch creates a new field in struct sock, named sk_use_task_frag,
which sockets with memory reclaim constraints can set to false if they
can't safely use current->task_frag. In such cases, sk_page_frag() now
always returns the socket's page_frag (->sk_frag). The first user is
sunrpc, which needs to avoid using current->task_frag but can keep
->sk_allocation set to GFP_KERNEL otherwise.

Eventually, it might be possible to simplify sk_page_frag() by only
testing ->sk_use_task_frag and avoid relying on the ->sk_allocation
heuristic entirely (assuming other sockets will set ->sk_use_task_frag
according to their constraints in the future).

The new ->sk_use_task_frag field is placed in a hole in struct sock and
belongs to a cache line shared with ->sk_shutdown. Therefore it should
be hot and shouldn't have negative performance impacts on TCP's fast
path (sk_shutdown is tested just before the while() loop in
tcp_sendmsg_locked()).

Link: https://lore.kernel.org/netdev/b4d8cb09c913d3e34f853736f3f5628abfd7f4b6.1656699567.git.gnault@redhat.com/
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-12-19 17:28:49 -08:00
Christian Ehrig
e26aa600ba bpf: Add flag BPF_F_NO_TUNNEL_KEY to bpf_skb_set_tunnel_key()
This patch allows to remove TUNNEL_KEY from the tunnel flags bitmap
when using bpf_skb_set_tunnel_key by providing a BPF_F_NO_TUNNEL_KEY
flag. On egress, the resulting tunnel header will not contain a tunnel
key if the protocol and implementation supports it.

At the moment bpf_tunnel_key wants a user to specify a numeric tunnel
key. This will wrap the inner packet into a tunnel header with the key
bit and value set accordingly. This is problematic when using a tunnel
protocol that supports optional tunnel keys and a receiving tunnel
device that is not expecting packets with the key bit set. The receiver
won't decapsulate and drop the packet.

RFC 2890 and RFC 2784 GRE tunnels are examples where this flag is
useful. It allows for generating packets, that can be decapsulated by
a GRE tunnel device not operating in collect metadata mode or not
expecting the key bit set.

Signed-off-by: Christian Ehrig <cehrig@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/bpf/20221218051734.31411-1-cehrig@cloudflare.com
2022-12-19 23:53:15 +01:00
Eric Dumazet
e0c8bccd40 net: stream: purge sk_error_queue in sk_stream_kill_queues()
Changheon Lee reported TCP socket leaks, with a nice repro.

It seems we leak TCP sockets with the following sequence:

1) SOF_TIMESTAMPING_TX_ACK is enabled on the socket.

   Each ACK will cook an skb put in error queue, from __skb_tstamp_tx().
   __skb_tstamp_tx() is using skb_clone(), unless
   SOF_TIMESTAMPING_OPT_TSONLY was also requested.

2) If the application is also using MSG_ZEROCOPY, then we put in the
   error queue cloned skbs that had a struct ubuf_info attached to them.

   Whenever an struct ubuf_info is allocated, sock_zerocopy_alloc()
   does a sock_hold().

   As long as the cloned skbs are still in sk_error_queue,
   socket refcount is kept elevated.

3) Application closes the socket, while error queue is not empty.

Since tcp_close() no longer purges the socket error queue,
we might end up with a TCP socket with at least one skb in
error queue keeping the socket alive forever.

This bug can be (ab)used to consume all kernel memory
and freeze the host.

We need to purge the error queue, with proper synchronization
against concurrent writers.

Fixes: 24bcbe1cc6 ("net: stream: don't purge sk_error_queue in sk_stream_kill_queues()")
Reported-by: Changheon Lee <darklight2357@icloud.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-12-19 12:33:16 +00:00
Miaoqian Lin
9054b41c4e net: Fix documentation for unregister_netdevice_notifier_net
unregister_netdevice_notifier_net() is used for unregister a notifier
registered by register_netdevice_notifier_net(). Also s/into/from/.

Signed-off-by: Miaoqian Lin <linmq006@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-12-19 10:42:41 +00:00
Subash Abhinov Kasiviswanathan
2d7afdcbc9 skbuff: Account for tail adjustment during pull operations
Extending the tail can have some unexpected side effects if a program uses
a helper like BPF_FUNC_skb_pull_data to read partial content beyond the
head skb headlen when all the skbs in the gso frag_list are linear with no
head_frag -

  kernel BUG at net/core/skbuff.c:4219!
  pc : skb_segment+0xcf4/0xd2c
  lr : skb_segment+0x63c/0xd2c
  Call trace:
   skb_segment+0xcf4/0xd2c
   __udp_gso_segment+0xa4/0x544
   udp4_ufo_fragment+0x184/0x1c0
   inet_gso_segment+0x16c/0x3a4
   skb_mac_gso_segment+0xd4/0x1b0
   __skb_gso_segment+0xcc/0x12c
   udp_rcv_segment+0x54/0x16c
   udp_queue_rcv_skb+0x78/0x144
   udp_unicast_rcv_skb+0x8c/0xa4
   __udp4_lib_rcv+0x490/0x68c
   udp_rcv+0x20/0x30
   ip_protocol_deliver_rcu+0x1b0/0x33c
   ip_local_deliver+0xd8/0x1f0
   ip_rcv+0x98/0x1a4
   deliver_ptype_list_skb+0x98/0x1ec
   __netif_receive_skb_core+0x978/0xc60

Fix this by marking these skbs as GSO_DODGY so segmentation can handle
the tail updates accordingly.

Fixes: 3dcbdb134f ("net: gso: Fix skb_segment splat when splitting gso_size mangled skb having linear-headed frag_list")
Signed-off-by: Sean Tranchetti <quic_stranche@quicinc.com>
Signed-off-by: Subash Abhinov Kasiviswanathan <quic_subashab@quicinc.com>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Link: https://lore.kernel.org/r/1671084718-24796-1-git-send-email-quic_subashab@quicinc.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-12-16 21:18:58 -08:00
Jakub Kicinski
214964a13a devlink: protect devlink dump by the instance lock
Take the instance lock around devlink_nl_fill() when dumping,
doit takes it already.

We are only dumping basic info so in the worst case we were risking
data races around the reload statistics. Until the big devlink mutex
was removed all relevant code was protected by it, so the missing
instance lock was not exposed.

Fixes: d3efc2a6a6 ("net: devlink: remove devlink_mutex")
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://lore.kernel.org/r/20221216044122.1863550-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-12-16 21:16:28 -08:00
Linus Torvalds
71a7507afb Driver Core changes for 6.2-rc1
Here is the set of driver core and kernfs changes for 6.2-rc1.
 
 The "big" change in here is the addition of a new macro,
 container_of_const() that will preserve the "const-ness" of a pointer
 passed into it.
 
 The "problem" of the current container_of() macro is that if you pass in
 a "const *", out of it can comes a non-const pointer unless you
 specifically ask for it.  For many usages, we want to preserve the
 "const" attribute by using the same call.  For a specific example, this
 series changes the kobj_to_dev() macro to use it, allowing it to be used
 no matter what the const value is.  This prevents every subsystem from
 having to declare 2 different individual macros (i.e.
 kobj_const_to_dev() and kobj_to_dev()) and having the compiler enforce
 the const value at build time, which having 2 macros would not do
 either.
 
 The driver for all of this have been discussions with the Rust kernel
 developers as to how to properly mark driver core, and kobject, objects
 as being "non-mutable".  The changes to the kobject and driver core in
 this pull request are the result of that, as there are lots of paths
 where kobjects and device pointers are not modified at all, so marking
 them as "const" allows the compiler to enforce this.
 
 So, a nice side affect of the Rust development effort has been already
 to clean up the driver core code to be more obvious about object rules.
 
 All of this has been bike-shedded in quite a lot of detail on lkml with
 different names and implementations resulting in the tiny version we
 have in here, much better than my original proposal.  Lots of subsystem
 maintainers have acked the changes as well.
 
 Other than this change, included in here are smaller stuff like:
   - kernfs fixes and updates to handle lock contention better
   - vmlinux.lds.h fixes and updates
   - sysfs and debugfs documentation updates
   - device property updates
 
 All of these have been in the linux-next tree for quite a while with no
 problems, OTHER than some merge issues with other trees that should be
 obvious when you hit them (block tree deletes a driver that this tree
 modifies, iommufd tree modifies code that this tree also touches).  If
 there are merge problems with these trees, please let me know.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCY5wz3A8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+yks0ACeKYUlVgCsER8eYW+x18szFa2QTXgAn2h/VhZe
 1Fp53boFaQkGBjl8mGF8
 =v+FB
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core updates from Greg KH:
 "Here is the set of driver core and kernfs changes for 6.2-rc1.

  The "big" change in here is the addition of a new macro,
  container_of_const() that will preserve the "const-ness" of a pointer
  passed into it.

  The "problem" of the current container_of() macro is that if you pass
  in a "const *", out of it can comes a non-const pointer unless you
  specifically ask for it. For many usages, we want to preserve the
  "const" attribute by using the same call. For a specific example, this
  series changes the kobj_to_dev() macro to use it, allowing it to be
  used no matter what the const value is. This prevents every subsystem
  from having to declare 2 different individual macros (i.e.
  kobj_const_to_dev() and kobj_to_dev()) and having the compiler enforce
  the const value at build time, which having 2 macros would not do
  either.

  The driver for all of this have been discussions with the Rust kernel
  developers as to how to properly mark driver core, and kobject,
  objects as being "non-mutable". The changes to the kobject and driver
  core in this pull request are the result of that, as there are lots of
  paths where kobjects and device pointers are not modified at all, so
  marking them as "const" allows the compiler to enforce this.

  So, a nice side affect of the Rust development effort has been already
  to clean up the driver core code to be more obvious about object
  rules.

  All of this has been bike-shedded in quite a lot of detail on lkml
  with different names and implementations resulting in the tiny version
  we have in here, much better than my original proposal. Lots of
  subsystem maintainers have acked the changes as well.

  Other than this change, included in here are smaller stuff like:

   - kernfs fixes and updates to handle lock contention better

   - vmlinux.lds.h fixes and updates

   - sysfs and debugfs documentation updates

   - device property updates

  All of these have been in the linux-next tree for quite a while with
  no problems"

* tag 'driver-core-6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (58 commits)
  device property: Fix documentation for fwnode_get_next_parent()
  firmware_loader: fix up to_fw_sysfs() to preserve const
  usb.h: take advantage of container_of_const()
  device.h: move kobj_to_dev() to use container_of_const()
  container_of: add container_of_const() that preserves const-ness of the pointer
  driver core: fix up missed drivers/s390/char/hmcdrv_dev.c class.devnode() conversion.
  driver core: fix up missed scsi/cxlflash class.devnode() conversion.
  driver core: fix up some missing class.devnode() conversions.
  driver core: make struct class.devnode() take a const *
  driver core: make struct class.dev_uevent() take a const *
  cacheinfo: Remove of_node_put() for fw_token
  device property: Add a blank line in Kconfig of tests
  device property: Rename goto label to be more precise
  device property: Move PROPERTY_ENTRY_BOOL() a bit down
  device property: Get rid of __PROPERTY_ENTRY_ARRAY_EL*SIZE*()
  kernfs: fix all kernel-doc warnings and multiple typos
  driver core: pass a const * into of_device_uevent()
  kobject: kset_uevent_ops: make name() callback take a const *
  kobject: kset_uevent_ops: make filter() callback take a const *
  kobject: make kobject_namespace take a const *
  ...
2022-12-16 03:54:54 -08:00
Jakub Kicinski
b4cafb3d2c devlink: hold region lock when flushing snapshots
Netdevsim triggers a splat on reload, when it destroys regions
with snapshots pending:

  WARNING: CPU: 1 PID: 787 at net/core/devlink.c:6291 devlink_region_snapshot_del+0x12e/0x140
  CPU: 1 PID: 787 Comm: devlink Not tainted 6.1.0-07460-g7ae9888d6e1c #580
  RIP: 0010:devlink_region_snapshot_del+0x12e/0x140
  Call Trace:
   <TASK>
   devl_region_destroy+0x70/0x140
   nsim_dev_reload_down+0x2f/0x60 [netdevsim]
   devlink_reload+0x1f7/0x360
   devlink_nl_cmd_reload+0x6ce/0x860
   genl_family_rcv_msg_doit.isra.0+0x145/0x1c0

This is the locking assert in devlink_region_snapshot_del(),
we're supposed to be holding the region->snapshot_lock here.

Fixes: 2dec18ad82 ("net: devlink: remove region snapshots list dependency on devlink->lock")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-12-16 10:21:36 +00:00
Linus Torvalds
7e68dd7d07 Networking changes for 6.2.
Core
 ----
  - Allow live renaming when an interface is up
 
  - Add retpoline wrappers for tc, improving considerably the
    performances of complex queue discipline configurations.
 
  - Add inet drop monitor support.
 
  - A few GRO performance improvements.
 
  - Add infrastructure for atomic dev stats, addressing long standing
    data races.
 
  - De-duplicate common code between OVS and conntrack offloading
    infrastructure.
 
  - A bunch of UBSAN_BOUNDS/FORTIFY_SOURCE improvements.
 
  - Netfilter: introduce packet parser for tunneled packets
 
  - Replace IPVS timer-based estimators with kthreads to scale up
    the workload with the number of available CPUs.
 
  - Add the helper support for connection-tracking OVS offload.
 
 BPF
 ---
  - Support for user defined BPF objects: the use case is to allocate
    own objects, build own object hierarchies and use the building
    blocks to build own data structures flexibly, for example, linked
    lists in BPF.
 
  - Make cgroup local storage available to non-cgroup attached BPF
    programs.
 
  - Avoid unnecessary deadlock detection and failures wrt BPF task
    storage helpers.
 
  - A relevant bunch of BPF verifier fixes and improvements.
 
  - Veristat tool improvements to support custom filtering, sorting,
    and replay of results.
 
  - Add LLVM disassembler as default library for dumping JITed code.
 
  - Lots of new BPF documentation for various BPF maps.
 
  - Add bpf_rcu_read_{,un}lock() support for sleepable programs.
 
  - Add RCU grace period chaining to BPF to wait for the completion
    of access from both sleepable and non-sleepable BPF programs.
 
  - Add support storing struct task_struct objects as kptrs in maps.
 
  - Improve helper UAPI by explicitly defining BPF_FUNC_xxx integer
    values.
 
  - Add libbpf *_opts API-variants for bpf_*_get_fd_by_id() functions.
 
 Protocols
 ---------
  - TCP: implement Protective Load Balancing across switch links.
 
  - TCP: allow dynamically disabling TCP-MD5 static key, reverting
    back to fast[er]-path.
 
  - UDP: Introduce optional per-netns hash lookup table.
 
  - IPv6: simplify and cleanup sockets disposal.
 
  - Netlink: support different type policies for each generic
    netlink operation.
 
  - MPTCP: add MSG_FASTOPEN and FastOpen listener side support.
 
  - MPTCP: add netlink notification support for listener sockets
    events.
 
  - SCTP: add VRF support, allowing sctp sockets binding to VRF
    devices.
 
  - Add bridging MAC Authentication Bypass (MAB) support.
 
  - Extensions for Ethernet VPN bridging implementation to better
    support multicast scenarios.
 
  - More work for Wi-Fi 7 support, comprising conversion of all
    the existing drivers to internal TX queue usage.
 
  - IPSec: introduce a new offload type (packet offload) allowing
    complete header processing and crypto offloading.
 
  - IPSec: extended ack support for more descriptive XFRM error
    reporting.
 
  - RXRPC: increase SACK table size and move processing into a
    per-local endpoint kernel thread, reducing considerably the
    required locking.
 
  - IEEE 802154: synchronous send frame and extended filtering
    support, initial support for scanning available 15.4 networks.
 
  - Tun: bump the link speed from 10Mbps to 10Gbps.
 
  - Tun/VirtioNet: implement UDP segmentation offload support.
 
 Driver API
 ----------
 
  - PHY/SFP: improve power level switching between standard
    level 1 and the higher power levels.
 
  - New API for netdev <-> devlink_port linkage.
 
  - PTP: convert existing drivers to new frequency adjustment
    implementation.
 
  - DSA: add support for rx offloading.
 
  - Autoload DSA tagging driver when dynamically changing protocol.
 
  - Add new PCP and APPTRUST attributes to Data Center Bridging.
 
  - Add configuration support for 800Gbps link speed.
 
  - Add devlink port function attribute to enable/disable RoCE and
    migratable.
 
  - Extend devlink-rate to support strict prioriry and weighted fair
    queuing.
 
  - Add devlink support to directly reading from region memory.
 
  - New device tree helper to fetch MAC address from nvmem.
 
  - New big TCP helper to simplify temporary header stripping.
 
 New hardware / drivers
 ----------------------
 
  - Ethernet:
    - Marvel Octeon CNF95N and CN10KB Ethernet Switches.
    - Marvel Prestera AC5X Ethernet Switch.
    - WangXun 10 Gigabit NIC.
    - Motorcomm yt8521 Gigabit Ethernet.
    - Microchip ksz9563 Gigabit Ethernet Switch.
    - Microsoft Azure Network Adapter.
    - Linux Automation 10Base-T1L adapter.
 
  - PHY:
    - Aquantia AQR112 and AQR412.
    - Motorcomm YT8531S.
 
  - PTP:
    - Orolia ART-CARD.
 
  - WiFi:
    - MediaTek Wi-Fi 7 (802.11be) devices.
    - RealTek rtw8821cu, rtw8822bu, rtw8822cu and rtw8723du USB
      devices.
 
  - Bluetooth:
    - Broadcom BCM4377/4378/4387 Bluetooth chipsets.
    - Realtek RTL8852BE and RTL8723DS.
    - Cypress.CYW4373A0 WiFi + Bluetooth combo device.
 
 Drivers
 -------
  - CAN:
    - gs_usb: bus error reporting support.
    - kvaser_usb: listen only and bus error reporting support.
 
  - Ethernet NICs:
    - Intel (100G):
      - extend action skbedit to RX queue mapping.
      - implement devlink-rate support.
      - support direct read from memory.
    - nVidia/Mellanox (mlx5):
      - SW steering improvements, increasing rules update rate.
      - Support for enhanced events compression.
      - extend H/W offload packet manipulation capabilities.
      - implement IPSec packet offload mode.
    - nVidia/Mellanox (mlx4):
      - better big TCP support.
    - Netronome Ethernet NICs (nfp):
      - IPsec offload support.
      - add support for multicast filter.
    - Broadcom:
      - RSS and PTP support improvements.
    - AMD/SolarFlare:
      - netlink extened ack improvements.
      - add basic flower matches to offload, and related stats.
    - Virtual NICs:
      - ibmvnic: introduce affinity hint support.
    - small / embedded:
      - FreeScale fec: add initial XDP support.
      - Marvel mv643xx_eth: support MII/GMII/RGMII modes for Kirkwood.
      - TI am65-cpsw: add suspend/resume support.
      - Mediatek MT7986: add RX wireless wthernet dispatch support.
      - Realtek 8169: enable GRO software interrupt coalescing per
        default.
 
  - Ethernet high-speed switches:
    - Microchip (sparx5):
      - add support for Sparx5 TC/flower H/W offload via VCAP.
    - Mellanox mlxsw:
      - add 802.1X and MAC Authentication Bypass offload support.
      - add ip6gre support.
 
  - Embedded Ethernet switches:
    - Mediatek (mtk_eth_soc):
      - improve PCS implementation, add DSA untag support.
      - enable flow offload support.
    - Renesas:
      - add rswitch R-Car Gen4 gPTP support.
    - Microchip (lan966x):
      - add full XDP support.
      - add TC H/W offload via VCAP.
      - enable PTP on bridge interfaces.
    - Microchip (ksz8):
      - add MTU support for KSZ8 series.
 
  - Qualcomm 802.11ax WiFi (ath11k):
    - support configuring channel dwell time during scan.
 
  - MediaTek WiFi (mt76):
    - enable Wireless Ethernet Dispatch (WED) offload support.
    - add ack signal support.
    - enable coredump support.
    - remain_on_channel support.
 
  - Intel WiFi (iwlwifi):
    - enable Wi-Fi 7 Extremely High Throughput (EHT) PHY capabilities.
    - 320 MHz channels support.
 
  - RealTek WiFi (rtw89):
    - new dynamic header firmware format support.
    - wake-over-WLAN support.
 
 Signed-off-by: Paolo Abeni <pabeni@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmOYXUcSHHBhYmVuaUBy
 ZWRoYXQuY29tAAoJECkkeY3MjxOk8zQP/R7BZtbJMTPiWkRnSoKHnAyupDVwrz5U
 ktukLkwPsCyJuEbAjgxrxf4EEEQ9uq2FFlxNSYuKiiQMqIpFxV6KED7LCUygn4Tc
 kxtkp0Q+5XiqisWlQmtfExf2OjuuPqcjV9tWCDBI6GebKUbfNwY/eI44RcMu4BSv
 DzIlW5GkX/kZAPqnnuqaLsN3FudDTJHGEAD7NbA++7wJ076RWYSLXlFv0Z+SCSPS
 H8/PEG0/ZK/65rIWMAFRClJ9BNIDwGVgp0GrsIvs1gqbRUOlA1hl1rDM21TqtNFf
 5QPQT7sIfTcCE/nerxKJD5JE3JyP+XRlRn96PaRw3rt4MgI6I/EOj/HOKQ5tMCNc
 oPiqb7N70+hkLZyr42qX+vN9eDPjp2koEQm7EO2Zs+/534/zWDs24Zfk/Aa1ps0I
 Fa82oGjAgkBhGe/FZ6i5cYoLcyxqRqZV1Ws9XQMl72qRC7/BwvNbIW6beLpCRyeM
 yYIU+0e9dEm+wHQEdh2niJuVtR63hy8tvmPx56lyh+6u0+pondkwbfSiC5aD3kAC
 ikKsN5DyEsdXyiBAlytCEBxnaOjQy4RAz+3YXSiS0eBNacXp03UUrNGx4Pzpu/D0
 QLFJhBnMFFCgy5to8/DvKnrTPgZdSURwqbIUcZdvU21f1HLR8tUTpaQnYffc/Whm
 V8gnt1EL+0cc
 =CbJC
 -----END PGP SIGNATURE-----

Merge tag 'net-next-6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking updates from Paolo Abeni:
 "Core:

   - Allow live renaming when an interface is up

   - Add retpoline wrappers for tc, improving considerably the
     performances of complex queue discipline configurations

   - Add inet drop monitor support

   - A few GRO performance improvements

   - Add infrastructure for atomic dev stats, addressing long standing
     data races

   - De-duplicate common code between OVS and conntrack offloading
     infrastructure

   - A bunch of UBSAN_BOUNDS/FORTIFY_SOURCE improvements

   - Netfilter: introduce packet parser for tunneled packets

   - Replace IPVS timer-based estimators with kthreads to scale up the
     workload with the number of available CPUs

   - Add the helper support for connection-tracking OVS offload

  BPF:

   - Support for user defined BPF objects: the use case is to allocate
     own objects, build own object hierarchies and use the building
     blocks to build own data structures flexibly, for example, linked
     lists in BPF

   - Make cgroup local storage available to non-cgroup attached BPF
     programs

   - Avoid unnecessary deadlock detection and failures wrt BPF task
     storage helpers

   - A relevant bunch of BPF verifier fixes and improvements

   - Veristat tool improvements to support custom filtering, sorting,
     and replay of results

   - Add LLVM disassembler as default library for dumping JITed code

   - Lots of new BPF documentation for various BPF maps

   - Add bpf_rcu_read_{,un}lock() support for sleepable programs

   - Add RCU grace period chaining to BPF to wait for the completion of
     access from both sleepable and non-sleepable BPF programs

   - Add support storing struct task_struct objects as kptrs in maps

   - Improve helper UAPI by explicitly defining BPF_FUNC_xxx integer
     values

   - Add libbpf *_opts API-variants for bpf_*_get_fd_by_id() functions

  Protocols:

   - TCP: implement Protective Load Balancing across switch links

   - TCP: allow dynamically disabling TCP-MD5 static key, reverting back
     to fast[er]-path

   - UDP: Introduce optional per-netns hash lookup table

   - IPv6: simplify and cleanup sockets disposal

   - Netlink: support different type policies for each generic netlink
     operation

   - MPTCP: add MSG_FASTOPEN and FastOpen listener side support

   - MPTCP: add netlink notification support for listener sockets events

   - SCTP: add VRF support, allowing sctp sockets binding to VRF devices

   - Add bridging MAC Authentication Bypass (MAB) support

   - Extensions for Ethernet VPN bridging implementation to better
     support multicast scenarios

   - More work for Wi-Fi 7 support, comprising conversion of all the
     existing drivers to internal TX queue usage

   - IPSec: introduce a new offload type (packet offload) allowing
     complete header processing and crypto offloading

   - IPSec: extended ack support for more descriptive XFRM error
     reporting

   - RXRPC: increase SACK table size and move processing into a
     per-local endpoint kernel thread, reducing considerably the
     required locking

   - IEEE 802154: synchronous send frame and extended filtering support,
     initial support for scanning available 15.4 networks

   - Tun: bump the link speed from 10Mbps to 10Gbps

   - Tun/VirtioNet: implement UDP segmentation offload support

  Driver API:

   - PHY/SFP: improve power level switching between standard level 1 and
     the higher power levels

   - New API for netdev <-> devlink_port linkage

   - PTP: convert existing drivers to new frequency adjustment
     implementation

   - DSA: add support for rx offloading

   - Autoload DSA tagging driver when dynamically changing protocol

   - Add new PCP and APPTRUST attributes to Data Center Bridging

   - Add configuration support for 800Gbps link speed

   - Add devlink port function attribute to enable/disable RoCE and
     migratable

   - Extend devlink-rate to support strict prioriry and weighted fair
     queuing

   - Add devlink support to directly reading from region memory

   - New device tree helper to fetch MAC address from nvmem

   - New big TCP helper to simplify temporary header stripping

  New hardware / drivers:

   - Ethernet:
      - Marvel Octeon CNF95N and CN10KB Ethernet Switches
      - Marvel Prestera AC5X Ethernet Switch
      - WangXun 10 Gigabit NIC
      - Motorcomm yt8521 Gigabit Ethernet
      - Microchip ksz9563 Gigabit Ethernet Switch
      - Microsoft Azure Network Adapter
      - Linux Automation 10Base-T1L adapter

   - PHY:
      - Aquantia AQR112 and AQR412
      - Motorcomm YT8531S

   - PTP:
      - Orolia ART-CARD

   - WiFi:
      - MediaTek Wi-Fi 7 (802.11be) devices
      - RealTek rtw8821cu, rtw8822bu, rtw8822cu and rtw8723du USB
        devices

   - Bluetooth:
      - Broadcom BCM4377/4378/4387 Bluetooth chipsets
      - Realtek RTL8852BE and RTL8723DS
      - Cypress.CYW4373A0 WiFi + Bluetooth combo device

  Drivers:

   - CAN:
      - gs_usb: bus error reporting support
      - kvaser_usb: listen only and bus error reporting support

   - Ethernet NICs:
      - Intel (100G):
         - extend action skbedit to RX queue mapping
         - implement devlink-rate support
         - support direct read from memory
      - nVidia/Mellanox (mlx5):
         - SW steering improvements, increasing rules update rate
         - Support for enhanced events compression
         - extend H/W offload packet manipulation capabilities
         - implement IPSec packet offload mode
      - nVidia/Mellanox (mlx4):
         - better big TCP support
      - Netronome Ethernet NICs (nfp):
         - IPsec offload support
         - add support for multicast filter
      - Broadcom:
         - RSS and PTP support improvements
      - AMD/SolarFlare:
         - netlink extened ack improvements
         - add basic flower matches to offload, and related stats
      - Virtual NICs:
         - ibmvnic: introduce affinity hint support
      - small / embedded:
         - FreeScale fec: add initial XDP support
         - Marvel mv643xx_eth: support MII/GMII/RGMII modes for Kirkwood
         - TI am65-cpsw: add suspend/resume support
         - Mediatek MT7986: add RX wireless wthernet dispatch support
         - Realtek 8169: enable GRO software interrupt coalescing per
           default

   - Ethernet high-speed switches:
      - Microchip (sparx5):
         - add support for Sparx5 TC/flower H/W offload via VCAP
      - Mellanox mlxsw:
         - add 802.1X and MAC Authentication Bypass offload support
         - add ip6gre support

   - Embedded Ethernet switches:
      - Mediatek (mtk_eth_soc):
         - improve PCS implementation, add DSA untag support
         - enable flow offload support
      - Renesas:
         - add rswitch R-Car Gen4 gPTP support
      - Microchip (lan966x):
         - add full XDP support
         - add TC H/W offload via VCAP
         - enable PTP on bridge interfaces
      - Microchip (ksz8):
         - add MTU support for KSZ8 series

   - Qualcomm 802.11ax WiFi (ath11k):
      - support configuring channel dwell time during scan

   - MediaTek WiFi (mt76):
      - enable Wireless Ethernet Dispatch (WED) offload support
      - add ack signal support
      - enable coredump support
      - remain_on_channel support

   - Intel WiFi (iwlwifi):
      - enable Wi-Fi 7 Extremely High Throughput (EHT) PHY capabilities
      - 320 MHz channels support

   - RealTek WiFi (rtw89):
      - new dynamic header firmware format support
      - wake-over-WLAN support"

* tag 'net-next-6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2002 commits)
  ipvs: fix type warning in do_div() on 32 bit
  net: lan966x: Remove a useless test in lan966x_ptp_add_trap()
  net: ipa: add IPA v4.7 support
  dt-bindings: net: qcom,ipa: Add SM6350 compatible
  bnxt: Use generic HBH removal helper in tx path
  IPv6/GRO: generic helper to remove temporary HBH/jumbo header in driver
  selftests: forwarding: Add bridge MDB test
  selftests: forwarding: Rename bridge_mdb test
  bridge: mcast: Support replacement of MDB port group entries
  bridge: mcast: Allow user space to specify MDB entry routing protocol
  bridge: mcast: Allow user space to add (*, G) with a source list and filter mode
  bridge: mcast: Add support for (*, G) with a source list and filter mode
  bridge: mcast: Avoid arming group timer when (S, G) corresponds to a source
  bridge: mcast: Add a flag for user installed source entries
  bridge: mcast: Expose __br_multicast_del_group_src()
  bridge: mcast: Expose br_multicast_new_group_src()
  bridge: mcast: Add a centralized error path
  bridge: mcast: Place netlink policy before validation functions
  bridge: mcast: Split (*, G) and (S, G) addition into different functions
  bridge: mcast: Do not derive entry type from its filter mode
  ...
2022-12-13 15:47:48 -08:00