Commit graph

1217910 commits

Author SHA1 Message Date
Florian Westphal
e0d4593140 netfilter: make nftables drops visible in net dropmonitor
net_dropmonitor blames core.c:nf_hook_slow.
Add NF_DROP_REASON() helper and use it in nft_do_chain().

The helper releases the skb, so exact drop location becomes
available. Calling code will observe the NF_STOLEN verdict
instead.

Adjust nf_hook_slow so we can embed an erro value wih
NF_STOLEN verdicts, just like we do for NF_DROP.

After this, drop in nftables can be pinpointed to a drop due
to a rule or the chain policy.

Signed-off-by: Florian Westphal <fw@strlen.de>
2023-10-18 10:26:43 +02:00
Florian Westphal
35c038b0a4 netfilter: nf_nat: mask out non-verdict bits when checking return value
Same as previous change: we need to mask out the non-verdict bits, as
upcoming patches may embed an errno value in NF_STOLEN verdicts too.

NF_DROP could already do this, but not all called functions do this.

Checks that only test ret vs NF_ACCEPT are fine, the 'errno parts'
are always 0 for those.

Signed-off-by: Florian Westphal <fw@strlen.de>
2023-10-18 10:26:43 +02:00
Florian Westphal
6291b3a67a netfilter: conntrack: convert nf_conntrack_update to netfilter verdicts
This function calls helpers that can return nf-verdicts, but then
those get converted to -1/0 as thats what the caller expects.

Theoretically NF_DROP could have an errno number set in the upper 24
bits of the return value. Or any of those helpers could return
NF_STOLEN, which would result in use-after-free.

This is fine as-is, the called functions don't do this yet.

But its better to avoid possible future problems if the upcoming
patchset to add NF_DROP_REASON() support gains further users, so remove
the 0/-1 translation from the picture and pass the verdicts down to
the caller.

Signed-off-by: Florian Westphal <fw@strlen.de>
2023-10-18 10:26:43 +02:00
Florian Westphal
4d26ab0086 netfilter: nf_tables: mask out non-verdict bits when checking return value
nftables trace infra must mask out the non-verdict bit parts of the
return value, else followup changes that 'return errno << 8 | NF_STOLEN'
will cause breakage.

Signed-off-by: Florian Westphal <fw@strlen.de>
2023-10-18 10:26:43 +02:00
Florian Westphal
e15e502710 netfilter: xt_mangle: only check verdict part of return value
These checks assume that the caller only returns NF_DROP without
any errno embedded in the upper bits.

This is fine right now, but followup patches will start to propagate
such errors to allow kfree_skb_drop_reason() in the called functions,
those would then indicate 'errno << 8 | NF_STOLEN'.

To not break things we have to mask those parts out.

Signed-off-by: Florian Westphal <fw@strlen.de>
2023-10-18 10:26:43 +02:00
David S. Miller
a0a8602247 Merge branch 'devlink-deadlock'
Jiri Pirko says:

====================
devlink: fix a deadlock when taking devlink instance lock while holding RTNL lock

devlink_port_fill() may be called sometimes with RTNL lock held.
When putting the nested port function devlink instance attrs,
current code takes nested devlink instance lock. In that case lock
ordering is wrong.

Patch #1 is a dependency of patch #2.
Patch #2 converts the peernet2id_alloc() call to rely in RCU so it could
         called without devlink instance lock.
Patch #3 takes device reference for devlink instance making sure that
         device does not disappear before devlink_release() is called.
Patch #4 benefits from the preparations done in patches #2 and #3 and
         removes the problematic nested devlink lock aquisition.
Patched #5-#7 improve documentation to reflect this issue so it is
              avoided in the future.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-18 09:23:02 +01:00
Jiri Pirko
5d77371e8c devlink: document devlink_rel_nested_in_notify() function
Add a documentation for devlink_rel_nested_in_notify() describing the
devlink instance locking consequences.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-18 09:23:01 +01:00
Jiri Pirko
bb11cf9b2c Documentation: devlink: add a note about RTNL lock into locking section
Add a note describing the locking order of taking RTNL lock with devlink
instance lock.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-18 09:23:01 +01:00
Jiri Pirko
b6f23b319a Documentation: devlink: add nested instance section
Add a part talking about nested devlink instances describing
the helpers and locking ordering.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-18 09:23:01 +01:00
Jiri Pirko
b5f4e37133 devlink: don't take instance lock for nested handle put
Lockdep reports following issue:

WARNING: possible circular locking dependency detected
------------------------------------------------------
devlink/8191 is trying to acquire lock:
ffff88813f32c250 (&devlink->lock_key#14){+.+.}-{3:3}, at: devlink_rel_devlink_handle_put+0x11e/0x2d0

                           but task is already holding lock:
ffffffff8511eca8 (rtnl_mutex){+.+.}-{3:3}, at: unregister_netdev+0xe/0x20

                           which lock already depends on the new lock.

                           the existing dependency chain (in reverse order) is:

                           -> #3 (rtnl_mutex){+.+.}-{3:3}:
       lock_acquire+0x1c3/0x500
       __mutex_lock+0x14c/0x1b20
       register_netdevice_notifier_net+0x13/0x30
       mlx5_lag_add_mdev+0x51c/0xa00 [mlx5_core]
       mlx5_load+0x222/0xc70 [mlx5_core]
       mlx5_init_one_devl_locked+0x4a0/0x1310 [mlx5_core]
       mlx5_init_one+0x3b/0x60 [mlx5_core]
       probe_one+0x786/0xd00 [mlx5_core]
       local_pci_probe+0xd7/0x180
       pci_device_probe+0x231/0x720
       really_probe+0x1e4/0xb60
       __driver_probe_device+0x261/0x470
       driver_probe_device+0x49/0x130
       __driver_attach+0x215/0x4c0
       bus_for_each_dev+0xf0/0x170
       bus_add_driver+0x21d/0x590
       driver_register+0x133/0x460
       vdpa_match_remove+0x89/0xc0 [vdpa]
       do_one_initcall+0xc4/0x360
       do_init_module+0x22d/0x760
       load_module+0x51d7/0x6750
       init_module_from_file+0xd2/0x130
       idempotent_init_module+0x326/0x5a0
       __x64_sys_finit_module+0xc1/0x130
       do_syscall_64+0x3d/0x90
       entry_SYSCALL_64_after_hwframe+0x46/0xb0

                           -> #2 (mlx5_intf_mutex){+.+.}-{3:3}:
       lock_acquire+0x1c3/0x500
       __mutex_lock+0x14c/0x1b20
       mlx5_register_device+0x3e/0xd0 [mlx5_core]
       mlx5_init_one_devl_locked+0x8fa/0x1310 [mlx5_core]
       mlx5_devlink_reload_up+0x147/0x170 [mlx5_core]
       devlink_reload+0x203/0x380
       devlink_nl_cmd_reload+0xb84/0x10e0
       genl_family_rcv_msg_doit+0x1cc/0x2a0
       genl_rcv_msg+0x3c9/0x670
       netlink_rcv_skb+0x12c/0x360
       genl_rcv+0x24/0x40
       netlink_unicast+0x435/0x6f0
       netlink_sendmsg+0x7a0/0xc70
       sock_sendmsg+0xc5/0x190
       __sys_sendto+0x1c8/0x290
       __x64_sys_sendto+0xdc/0x1b0
       do_syscall_64+0x3d/0x90
       entry_SYSCALL_64_after_hwframe+0x46/0xb0

                           -> #1 (&dev->lock_key#8){+.+.}-{3:3}:
       lock_acquire+0x1c3/0x500
       __mutex_lock+0x14c/0x1b20
       mlx5_init_one_devl_locked+0x45/0x1310 [mlx5_core]
       mlx5_devlink_reload_up+0x147/0x170 [mlx5_core]
       devlink_reload+0x203/0x380
       devlink_nl_cmd_reload+0xb84/0x10e0
       genl_family_rcv_msg_doit+0x1cc/0x2a0
       genl_rcv_msg+0x3c9/0x670
       netlink_rcv_skb+0x12c/0x360
       genl_rcv+0x24/0x40
       netlink_unicast+0x435/0x6f0
       netlink_sendmsg+0x7a0/0xc70
       sock_sendmsg+0xc5/0x190
       __sys_sendto+0x1c8/0x290
       __x64_sys_sendto+0xdc/0x1b0
       do_syscall_64+0x3d/0x90
       entry_SYSCALL_64_after_hwframe+0x46/0xb0

                           -> #0 (&devlink->lock_key#14){+.+.}-{3:3}:
       check_prev_add+0x1af/0x2300
       __lock_acquire+0x31d7/0x4eb0
       lock_acquire+0x1c3/0x500
       __mutex_lock+0x14c/0x1b20
       devlink_rel_devlink_handle_put+0x11e/0x2d0
       devlink_nl_port_fill+0xddf/0x1b00
       devlink_port_notify+0xb5/0x220
       __devlink_port_type_set+0x151/0x510
       devlink_port_netdevice_event+0x17c/0x220
       notifier_call_chain+0x97/0x240
       unregister_netdevice_many_notify+0x876/0x1790
       unregister_netdevice_queue+0x274/0x350
       unregister_netdev+0x18/0x20
       mlx5e_vport_rep_unload+0xc5/0x1c0 [mlx5_core]
       __esw_offloads_unload_rep+0xd8/0x130 [mlx5_core]
       mlx5_esw_offloads_rep_unload+0x52/0x70 [mlx5_core]
       mlx5_esw_offloads_unload_rep+0x85/0xc0 [mlx5_core]
       mlx5_eswitch_unload_sf_vport+0x41/0x90 [mlx5_core]
       mlx5_devlink_sf_port_del+0x120/0x280 [mlx5_core]
       genl_family_rcv_msg_doit+0x1cc/0x2a0
       genl_rcv_msg+0x3c9/0x670
       netlink_rcv_skb+0x12c/0x360
       genl_rcv+0x24/0x40
       netlink_unicast+0x435/0x6f0
       netlink_sendmsg+0x7a0/0xc70
       sock_sendmsg+0xc5/0x190
       __sys_sendto+0x1c8/0x290
       __x64_sys_sendto+0xdc/0x1b0
       do_syscall_64+0x3d/0x90
       entry_SYSCALL_64_after_hwframe+0x46/0xb0

                           other info that might help us debug this:

Chain exists of:
                             &devlink->lock_key#14 --> mlx5_intf_mutex --> rtnl_mutex

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(rtnl_mutex);
                               lock(mlx5_intf_mutex);
                               lock(rtnl_mutex);
  lock(&devlink->lock_key#14);

Problem is taking the devlink instance lock of nested instance when RTNL
is already held.

To fix this, don't take the devlink instance lock when putting nested
handle. Instead, rely on the preparations done by previous two patches
to be able to access device pointer and obtain netns id without devlink
instance lock held.

Fixes: c137743bce ("devlink: introduce object and nested devlink relationship infra")
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-18 09:23:01 +01:00
Jiri Pirko
a380687200 devlink: take device reference for devlink object
In preparation to allow to access device pointer without devlink
instance lock held, make sure the device pointer is usable until
devlink_release() is called.

Fixes: c137743bce ("devlink: introduce object and nested devlink relationship infra")
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-18 09:23:01 +01:00
Jiri Pirko
c503bc7df6 devlink: call peernet2id_alloc() with net pointer under RCU read lock
peernet2id_alloc() allows to be called lockless with peer net pointer
obtained in RCU critical section and makes sure to return ns ID if net
namespaces is not being removed concurrently. Benefit from
read_pnet_rcu() helper addition, use it to obtain net pointer under RCU
read lock and pass it to peernet2id_alloc() to get ns ID.

Fixes: c137743bce ("devlink: introduce object and nested devlink relationship infra")
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-18 09:23:01 +01:00
Jiri Pirko
2034d90ae4 net: treat possible_net_t net pointer as an RCU one and add read_pnet_rcu()
Make the net pointer stored in possible_net_t structure annotated as
an RCU pointer. Change the access helpers to treat it as such.
Introduce read_pnet_rcu() helper to allow caller to dereference
the net pointer under RCU read lock.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-18 09:23:01 +01:00
Jakub Kicinski
ee2a35fedb mlx5-updates-2023-10-10
1) Adham Faris, Increase max supported channels number to 256
 
 2) Leon Romanovsky, Allow IPsec soft/hard limits in bytes
 
 3) Shay Drory, Replace global mlx5_intf_lock with
    HCA devcom component lock
 
 4) Wei Zhang, Optimize SF creation flow
 
 During SF creation, HCA state gets changed from INVALID to
 IN_USE step by step. Accordingly, FW sends vhca event to
 driver to inform about this state change asynchronously.
 Each vhca event is critical because all related SW/FW
 operations are triggered by it.
 
 Currently there is only a single mlx5 general event handler
 which not only handles vhca event but many other events.
 This incurs huge bottleneck because all events are forced
 to be handled in serial manner.
 
 Moreover, all SFs share same table_lock which inevitably
 impacts each other when they are created in parallel.
 
 This series will solve this issue by:
 
 1. A dedicated vhca event handler is introduced to eliminate
    the mutual impact with other mlx5 events.
 2. Max FW threads work queues are employed in the vhca event
    handler to fully utilize FW capability.
 3. Redesign SF active work logic to completely remove
    table_lock.
 
 With above optimization, SF creation time is reduced by 25%,
 i.e. from 80s to 60s when creating 100 SFs.
 
 Patches summary:
 
 Patch 1 - implement dedicated vhca event handler with max FW
           cmd threads of work queues.
 Patch 2 - remove table_lock by redesigning SF active work
           logic.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEGhZs6bAKwk/OTgTpSD+KveBX+j4FAmUqzPEACgkQSD+KveBX
 +j52rwf/c/LnKjPdcWztDIj4YQys9VfyWm9rkYmuiQ9XTYjzY7Y9RTW9XvrJSTdA
 GsD8EmWnqysr0HyRqksuhR6QHydU+fOrOIYxA0OeIbqXXmhIoDqnqTPQftK1q8Cq
 xZ4wDWowwoOGs5BXYsj6m+53ukyB91/dVHS8qiqL6SzDhm6pEEqjeoum7bxM1lKF
 HxLHNLWp3wolLn361Qlvd7SyOcDu3/c7+DYUJ04Hc2TcJEc7G5ZkBtqMNUZ599Z8
 06lXXj1FgIGtqzuddHLpS71gvp5yvtiw/2ujmIX7YuH8zHDqGRhKUYiJ1eRk80iw
 aYPESXC0OsQDWgNFc13f5eTQgc2p/Q==
 =Qaae
 -----END PGP SIGNATURE-----

Merge tag 'mlx5-updates-2023-10-10' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux

Saeed Mahameed says:

====================
mlx5-updates-2023-10-10

1) Adham Faris, Increase max supported channels number to 256

2) Leon Romanovsky, Allow IPsec soft/hard limits in bytes

3) Shay Drory, Replace global mlx5_intf_lock with
   HCA devcom component lock

4) Wei Zhang, Optimize SF creation flow

During SF creation, HCA state gets changed from INVALID to
IN_USE step by step. Accordingly, FW sends vhca event to
driver to inform about this state change asynchronously.
Each vhca event is critical because all related SW/FW
operations are triggered by it.

Currently there is only a single mlx5 general event handler
which not only handles vhca event but many other events.
This incurs huge bottleneck because all events are forced
to be handled in serial manner.

Moreover, all SFs share same table_lock which inevitably
impacts each other when they are created in parallel.

This series will solve this issue by:

1. A dedicated vhca event handler is introduced to eliminate
   the mutual impact with other mlx5 events.
2. Max FW threads work queues are employed in the vhca event
   handler to fully utilize FW capability.
3. Redesign SF active work logic to completely remove
   table_lock.

With above optimization, SF creation time is reduced by 25%,
i.e. from 80s to 60s when creating 100 SFs.

Patches summary:

Patch 1 - implement dedicated vhca event handler with max FW
          cmd threads of work queues.
Patch 2 - remove table_lock by redesigning SF active work
          logic.

* tag 'mlx5-updates-2023-10-10' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux:
  net/mlx5e: Allow IPsec soft/hard limits in bytes
  net/mlx5e: Increase max supported channels number to 256
  net/mlx5e: Preparations for supporting larger number of channels
  net/mlx5e: Refactor mlx5e_rss_init() and mlx5e_rss_free() API's
  net/mlx5e: Refactor mlx5e_rss_set_rxfh() and mlx5e_rss_get_rxfh()
  net/mlx5e: Refactor rx_res_init() and rx_res_free() APIs
  net/mlx5e: Use PTR_ERR_OR_ZERO() to simplify code
  net/mlx5: Use PTR_ERR_OR_ZERO() to simplify code
  net/mlx5: fix config name in Kconfig parameter documentation
  net/mlx5: Remove unused declaration
  net/mlx5: Replace global mlx5_intf_lock with HCA devcom component lock
  net/mlx5: Refactor LAG peer device lookout bus logic to mlx5 devcom
  net/mlx5: Avoid false positive lockdep warning by adding lock_class_key
  net/mlx5: Redesign SF active work to remove table_lock
  net/mlx5: Parallelize vhca event handling
====================

Link: https://lore.kernel.org/r/20231014171908.290428-1-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-17 18:27:27 -07:00
Justin Stitt
d4b14c1da5 hamradio: replace deprecated strncpy with strscpy_pad
strncpy() is deprecated for use on NUL-terminated destination strings
[1] and as such we should prefer more robust and less ambiguous string
interfaces.

We expect both hi.data.modename and hi.data.drivername to be
NUL-terminated based on its usage with sprintf:
|       sprintf(hi.data.modename, "%sclk,%smodem,fclk=%d,bps=%d%s",
|               bc->cfg.intclk ? "int" : "ext",
|               bc->cfg.extmodem ? "ext" : "int", bc->cfg.fclk, bc->cfg.bps,
|               bc->cfg.loopback ? ",loopback" : "");

Note that this data is copied out to userspace with:
|       if (copy_to_user(data, &hi, sizeof(hi)))
... however, the data was also copied FROM the user here:
|       if (copy_from_user(&hi, data, sizeof(hi)))

Considering the above, a suitable replacement is strscpy_pad() as it
guarantees NUL-termination on the destination buffer while also
NUL-padding (which is good+wanted behavior when copying data to
userspace).

Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings [1]
Link: https://github.com/KSPP/linux/issues/90
Signed-off-by: Justin Stitt <justinstitt@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20231016-strncpy-drivers-net-hamradio-baycom_epp-c-v2-1-39f72a72de30@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-17 17:59:55 -07:00
Jakub Kicinski
5294df643b docs: netlink: clean up after deprecating version
Jiri moved version to legacy specs in commit 0f07415ebb ("netlink:
specs: don't allow version to be specified for genetlink").
Update the documentation.

Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20231016214540.1822392-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-17 17:59:51 -07:00
Jakub Kicinski
9fea94d3a8 tools: ynl: fix converting flags to names after recent cleanup
I recently cleaned up specs to not specify enum-as-flags
when target enum is already defined as flags.
YNL Python library did not convert flags, unfortunately,
so this caused breakage for Stan and Willem.

Note that the nlspec.py abstraction already hides the differences
between flags and enums (value vs user_value), so the changes
are pretty trivial.

Fixes: 0629f22ec1 ("ynl: netdev: drop unnecessary enum-as-flags")
Reported-and-tested-by: Willem de Bruijn <willemb@google.com>
Reported-and-tested-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/all/ZS10NtQgd_BJZ3RU@google.com/
Link: https://lore.kernel.org/r/20231016213937.1820386-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-17 17:59:46 -07:00
Jakub Kicinski
9fe1450f6d Merge branch 'net-remove-last-of-the-phylink-validate-methods-and-clean-up'
Russell King says:

====================
net: remove last of the phylink validate methods and clean up

This four patch series removes the last of the phylink MAC .validate
methods which can be found in the Freescale fman driver. fman has a
requirement that half duplex may not be supported in RGMII mode,
which is currently handled in its .validate method.

In order to keep this functionality when removing the .validate method,
we need to replace that with equivalent functionality, for which I
propose the optional .mac_get_caps method in the first patch.

The advantage of this approach over the .validate callback is that MAC
drivers only have to deal with the MAC_* capabilities, and don't need
to call back into phylink functions to do the masking of the ethtool
linkmodes etc - which then becomes internal to phylink. This can be
seen in the fourth patch where we make a load of these methods static.
====================

Link: https://lore.kernel.org/r/ZS1Z5DDfHyjMryYu@shell.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-17 17:51:53 -07:00
Russell King (Oracle)
743f639762 net: phylink: remove a bunch of unused validation methods
Remove exports for phylink_caps_to_linkmodes(),
phylink_get_capabilities(), phylink_validate_mask_caps() and
phylink_generic_validate(). Also, as phylink_generic_validate() is no
longer called, we can remove its implementation as well.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://lore.kernel.org/r/E1qsPkK-009wip-W9@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-17 17:51:53 -07:00
Russell King (Oracle)
da5f6b80ad net: phylink: remove .validate() method
The MAC .validate() method is no longer used, so remove it from the
phylink_mac_ops structure, and remove the callsite in
phylink_validate_mac_and_pcs().

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://lore.kernel.org/r/E1qsPkF-009wij-QM@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-17 17:51:53 -07:00
Russell King (Oracle)
2141297d42 net: fman: convert to .mac_get_caps()
Convert fman to use the .mac_get_caps() method rather than the
.validate() method.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Sean Anderson <sean.anderson@seco.com>
Link: https://lore.kernel.org/r/E1qsPkA-009wid-Kv@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-17 17:51:19 -07:00
Russell King (Oracle)
b6f9774719 net: phylink: provide mac_get_caps() method
Provide a new method, mac_get_caps() to get the MAC capabilities for
the specified interface mode. This is for MACs which have special
requirements, such as not supporting half-duplex in certain interface
modes, and will replace the validate() method.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://lore.kernel.org/r/E1qsPk5-009wiX-G5@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-17 17:51:02 -07:00
Jakub Kicinski
73b24e7ce8 eth: bnxt: fix backward compatibility with older devices
Recent FW interface update bumped the size of struct hwrm_func_cfg_input
above 128B which is the max some devices support.

Probe on Stratus (BCM957452) with FW 20.8.3.11 fails with:

   bnxt_en ...: Unable to reserve tx rings
   bnxt_en ...: 2nd rings reservation failed.
   bnxt_en ...: Not enough rings available.

Once probe is fixed other errors pop up:

   bnxt_en ...: Failed to set async event completion ring.

This is because __hwrm_send() rejects requests larger than
bp->hwrm_max_ext_req_len with -E2BIG. Since the driver doesn't
actually access any of the new fields, yet, trim the length.
It should be safe.

Similar workaround exists for backing_store_cfg_input.
Although that one mins() to a constant of 256, not 128
we'll effectively use here. Michael explains: "the backing
store cfg command is supported by relatively newer firmware
that will accept 256 bytes at least."

To make debugging easier in the future add a warning
for oversized requests.

Fixes: 754fbf604f ("bnxt_en: Update firmware interface to 1.10.2.171")
Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Link: https://lore.kernel.org/r/20231016171640.1481493-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-17 17:50:55 -07:00
Jakub Kicinski
99e79b677b Merge branch 'bridge-add-a-limit-on-learned-fdb-entries'
Johannes Nixdorf says:

====================
bridge: Add a limit on learned FDB entries

Introduce a limit on the amount of learned FDB entries on a bridge,
configured by netlink with a build time default on bridge creation in
the kernel config.

For backwards compatibility the kernel config default is disabling the
limit (0).

Without any limit a malicious actor may OOM a kernel by spamming packets
with changing MAC addresses on their bridge port, so allow the bridge
creator to limit the number of entries.

Currently the manual entries are identified by the bridge flags
BR_FDB_LOCAL or BR_FDB_ADDED_BY_USER, atomically bundled under the new
flag BR_FDB_DYNAMIC_LEARNED. This means the limit also applies to
entries created with BR_FDB_ADDED_BY_EXT_LEARN but none of BR_FDB_LOCAL
or BR_FDB_ADDED_BY_USER, e.g. ones added by SWITCHDEV_FDB_ADD_TO_BRIDGE.

Link to the corresponding iproute2 changes:
https://lore.kernel.org/r/20230919-fdb_limit-v4-1-b4d2dc4df30f@avm.de

v4: https://lore.kernel.org/r/20230919-fdb_limit-v4-0-39f0293807b8@avm.de/
v3: https://lore.kernel.org/r/20230905-fdb_limit-v3-0-7597cd500a82@avm.de/
v2: https://lore.kernel.org/netdev/20230619071444.14625-1-jnixdorf-oss@avm.de/
v1: https://lore.kernel.org/netdev/20230515085046.4457-1-jnixdorf-oss@avm.de/
====================

Link: https://lore.kernel.org/r/20231016-fdb_limit-v5-0-32cddff87758@avm.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-17 17:39:04 -07:00
Johannes Nixdorf
6f84090333 selftests: forwarding: bridge_fdb_learning_limit: Add a new selftest
Add a suite covering the fdb_n_learned and fdb_max_learned bridge
features, touching all special cases in accounting at least once.

Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Johannes Nixdorf <jnixdorf-oss@avm.de>
Link: https://lore.kernel.org/r/20231016-fdb_limit-v5-5-32cddff87758@avm.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-17 17:39:02 -07:00
Johannes Nixdorf
19297c3ab2 net: bridge: Set strict_start_type for br_policy
Set any new attributes added to br_policy to be parsed strictly, to
prevent userspace from passing garbage.

Signed-off-by: Johannes Nixdorf <jnixdorf-oss@avm.de>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://lore.kernel.org/r/20231016-fdb_limit-v5-4-32cddff87758@avm.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-17 17:39:02 -07:00
Johannes Nixdorf
ddd1ad6882 net: bridge: Add netlink knobs for number / max learned FDB entries
The previous patch added accounting and a limit for the number of
dynamically learned FDB entries per bridge. However it did not provide
means to actually configure those bounds or read back the count. This
patch does that.

Two new netlink attributes are added for the accounting and limit of
dynamically learned FDB entries:
 - IFLA_BR_FDB_N_LEARNED (RO) for the number of entries accounted for
   a single bridge.
 - IFLA_BR_FDB_MAX_LEARNED (RW) for the configured limit of entries for
   the bridge.

The new attributes are used like this:

 # ip link add name br up type bridge fdb_max_learned 256
 # ip link add name v1 up master br type veth peer v2
 # ip link set up dev v2
 # mausezahn -a rand -c 1024 v2
 0.01 seconds (90877 packets per second
 # bridge fdb | grep -v permanent | wc -l
 256
 # ip -d link show dev br
 13: br: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 [...]
     [...] fdb_n_learned 256 fdb_max_learned 256

Signed-off-by: Johannes Nixdorf <jnixdorf-oss@avm.de>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://lore.kernel.org/r/20231016-fdb_limit-v5-3-32cddff87758@avm.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-17 17:39:01 -07:00
Johannes Nixdorf
bdb4dfda3b net: bridge: Track and limit dynamically learned FDB entries
A malicious actor behind one bridge port may spam the kernel with packets
with a random source MAC address, each of which will create an FDB entry,
each of which is a dynamic allocation in the kernel.

There are roughly 2^48 different MAC addresses, further limited by the
rhashtable they are stored in to 2^31. Each entry is of the type struct
net_bridge_fdb_entry, which is currently 128 bytes big. This means the
maximum amount of memory allocated for FDB entries is 2^31 * 128B =
256GiB, which is too much for most computers.

Mitigate this by maintaining a per bridge count of those automatically
generated entries in fdb_n_learned, and a limit in fdb_max_learned. If
the limit is hit new entries are not learned anymore.

For backwards compatibility the default setting of 0 disables the limit.

User-added entries by netlink or from bridge or bridge port addresses
are never blocked and do not count towards that limit.

Introduce a new fdb entry flag BR_FDB_DYNAMIC_LEARNED to keep track of
whether an FDB entry is included in the count. The flag is enabled for
dynamically learned entries, and disabled for all other entries. This
should be equivalent to BR_FDB_ADDED_BY_USER and BR_FDB_LOCAL being unset,
but contrary to the two flags it can be toggled atomically.

Atomicity is required here, as there are multiple callers that modify the
flags, but are not under a common lock (br_fdb_update is the exception
for br->hash_lock, br_fdb_external_learn_add for RTNL).

Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Johannes Nixdorf <jnixdorf-oss@avm.de>
Link: https://lore.kernel.org/r/20231016-fdb_limit-v5-2-32cddff87758@avm.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-17 17:39:01 -07:00
Johannes Nixdorf
cbf51acbc5 net: bridge: Set BR_FDB_ADDED_BY_USER early in fdb_add_entry
In preparation of the following fdb limit for dynamically learned entries,
allow fdb_create to detect that the entry was added by the user. This
way it can skip applying the limit in this case.

Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Johannes Nixdorf <jnixdorf-oss@avm.de>
Link: https://lore.kernel.org/r/20231016-fdb_limit-v5-1-32cddff87758@avm.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-17 17:39:01 -07:00
Jakub Kicinski
56a7bb12c7 wireless-next patches for v6.7
The second pull request for v6.7, with only driver changes this time.
 We have now support for mt7925 PCIe and USB variants, few new features
 and of course some fixes.
 
 Major changes:
 
 mt76
 
 * mt7925 support
 
 ath12k
 
 * read board data variant name from SMBIOS
 
 wfx
 
 * Remain-On-Channel (ROC) support
 -----BEGIN PGP SIGNATURE-----
 
 iQFFBAABCgAvFiEEiBjanGPFTz4PRfLobhckVSbrbZsFAmUtSjcRHGt2YWxvQGtl
 cm5lbC5vcmcACgkQbhckVSbrbZu3CQf/ed0tYeZy/X6lBf1ZvFzB2upEEnY7bPXt
 PGKASGQVf0livFfwGI4o1YDI53e1fL58CdNam8CY496g7JzuRBaaOZ8jpSuYHQ36
 M5MsF7kI+k/5ja560qgn6CRJEyU89zKqGwiQUuvhPsSsCRE/VdMhGPdWKzUyBoon
 djuZQu1hayOCAuMk29TScvYisWbdUF9hKfqYBmz4xTnR5oBczP/qyknT/19pdNZZ
 6fyAcBJcTUzgASRcKYAejm1wQ6RUUS+wbqH1U2xjH6RyzCFEm0iSzF3wko7VHiiA
 f6T136N6aokMp+a/nfnxX+WNzBlhEF+IOKEw2QJs26fNipERwTpakw==
 =FIFP
 -----END PGP SIGNATURE-----

Merge tag 'wireless-next-2023-10-16' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next

Kalle Valo says:

====================
wireless-next patches for v6.7

The second pull request for v6.7, with only driver changes this time.
We have now support for mt7925 PCIe and USB variants, few new features
and of course some fixes.

Major changes:

mt76
 - mt7925 support

ath12k
 - read board data variant name from SMBIOS

wfx
 - Remain-On-Channel (ROC) support

* tag 'wireless-next-2023-10-16' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (109 commits)
  wifi: rtw89: mac: do bf_monitor only if WiFi 6 chips
  wifi: rtw89: mac: set bf_assoc capabilities according to chip gen
  wifi: rtw89: mac: set bfee_ctrl() according to chip gen
  wifi: rtw89: mac: add registers of MU-EDCA parameters for WiFi 7 chips
  wifi: rtw89: mac: generalize register of MU-EDCA switch according to chip gen
  wifi: rtw89: mac: update RTS threshold according to chip gen
  wifi: rtlwifi: simplify TX command fill callbacks
  wifi: hostap: remove unused ioctl function
  wifi: atmel: remove unused ioctl function
  wifi: rtw89: coex: add annotation __counted_by() to struct rtw89_btc_btf_set_mon_reg
  wifi: rtw89: coex: add annotation __counted_by() for struct rtw89_btc_btf_set_slot_table
  wifi: rtw89: add EHT radiotap in monitor mode
  wifi: rtw89: show EHT rate in debugfs
  wifi: rtw89: parse TX EHT rate selected by firmware from RA C2H report
  wifi: rtw89: Add EHT rate mask as parameters of RA H2C command
  wifi: rtw89: parse EHT information from RX descriptor and PPDU status packet
  wifi: radiotap: add bandwidth definition of EHT U-SIG
  wifi: rtlwifi: use convenient list_count_nodes()
  wifi: p54: Annotate struct p54_cal_database with __counted_by
  wifi: brcmfmac: fweh: Add __counted_by for struct brcmf_fweh_queue_item and use struct_size()
  ...
====================

Link: https://lore.kernel.org/r/20231016143822.880D8C433C8@smtp.kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-17 16:52:54 -07:00
Christophe JAILLET
7713ec8447 net: openvswitch: Annotate struct mask_array with __counted_by
Prepare for the coming implementation by GCC and Clang of the __counted_by
attribute. Flexible array members annotated with __counted_by can have
their accesses bounds-checked at run-time checking via CONFIG_UBSAN_BOUNDS
(for array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family
functions).

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/ca5c8049f58bb933f231afd0816e30a5aaa0eddd.1697264974.git.christophe.jaillet@wanadoo.fr
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-10-17 13:56:03 +02:00
Christophe JAILLET
df3bf90fef net: openvswitch: Use struct_size()
Use struct_size() instead of hand writing it.
This is less verbose and more robust.

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/e5122b4ff878cbf3ed72653a395ad5c4da04dc1e.1697264974.git.christophe.jaillet@wanadoo.fr
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-10-17 13:56:03 +02:00
Paolo Abeni
53c6b86cd0 Merge branch 'i3c-mctp-net-driver'
Matt Johnston says:

====================
I3C MCTP net driver

This series adds an I3C transport for the kernel's MCTP network
protocol. MCTP is a communication protocol between system components
(BMCs, drives, NICs etc), with higher level protocols such as NVMe-MI or
PLDM built on top of it (in userspace). It runs over various transports
such as I2C, PCIe, or I3C.

The mctp-i3c driver follows a similar approach to the kernel's existing
mctp-i2c driver, creating a "mctpi3cX" network interface for each
numbered I3C bus. Busses opt in to support by adding a "mctp-controller"
property to the devicetree:

&i3c0 {
        mctp-controller;
}

The driver will bind to MCTP class devices (DCR 0xCC) that are on a
supported I3C bus. Each bus is represented by a `struct mctp_i3c_bus`
that keeps state for the network device. An individual I3C device
(struct mctp_i3c_device) performs operations using the "parent"
mctp_i3c_bus object. The I3C notify/enumeration patch is needed so that
the mctp-i3c driver can handle creating/removing mctp_i3c_bus objects as
required.

The mctp-i3c driver is using the Provisioned ID as an identifier for
target I3C devices (the neighbour address), as that will be more stable
than the I3C dynamic address. The driver internally translates that to a
dynamic address for bus operations.

The driver has been tested using an AST2600 platform. A remote endpoint
has been tested against QEMU, as well as using the target mode support
in Aspeed's vendor tree.

I3C maintainers have acked merging this through net-next tree.
====================

Link: https://lore.kernel.org/r/20231013040628.354323-1-matt@codeconstruct.com.au
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-10-17 12:47:17 +02:00
Matt Johnston
c8755b29b5 mctp i3c: MCTP I3C driver
Provides MCTP network transport over an I3C bus, as specified in
DMTF DSP0233.

Each I3C bus (with "mctp-controller" devicetree property) gets an
"mctpi3cX" net device created. I3C devices are reachable as remote
endpoints through that net device. Link layer addressing uses the
I3C PID as a fixed hardware address for neighbour table entries.

The driver matches I3C devices that have the MIPI assigned DCR 0xCC for
MCTP.

Signed-off-by: Matt Johnston <matt@codeconstruct.com.au>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-10-17 12:45:03 +02:00
Jeremy Kerr
0ac6486e5c i3c: Add support for bus enumeration & notification
This allows other drivers to be notified when new i3c busses are
attached, referring to a whole i3c bus as opposed to individual
devices.

Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Signed-off-by: Matt Johnston <matt@codeconstruct.com.au>
Acked-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-10-17 12:45:03 +02:00
Matt Johnston
ee71d6d5f1 dt-bindings: i3c: Add mctp-controller property
This property is used to describe a I3C bus with attached MCTP I3C
target devices.

Signed-off-by: Matt Johnston <matt@codeconstruct.com.au>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Acked-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-10-17 12:45:02 +02:00
Florian Westphal
1b2d3b45c1 net: gso_test: release each segment individually
consume_skb() doesn't walk the segment list, so segments other than
the first are leaked.

Move this skb_consume call into the loop.

Cc: Willem de Bruijn <willemb@google.com>
Fixes: b3098d32ed ("net: add skb_segment kunit test")
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-17 11:32:05 +01:00
Jakub Kicinski
a3c2dd9648 bpf-next-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZS1d4wAKCRDbK58LschI
 g4DSAP441CdKh8fd+wNKUSKHFbpCQ6EvocR6Nf+Sj2DFUx/w/QEA7mfju7Abqjc3
 xwDEx0BuhrjMrjV5MmEpxc7lYl9XcQU=
 =vuWk
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
pull-request: bpf-next 2023-10-16

We've added 90 non-merge commits during the last 25 day(s) which contain
a total of 120 files changed, 3519 insertions(+), 895 deletions(-).

The main changes are:

1) Add missed stats for kprobes to retrieve the number of missed kprobe
   executions and subsequent executions of BPF programs, from Jiri Olsa.

2) Add cgroup BPF sockaddr hooks for unix sockets. The use case is
   for systemd to reimplement the LogNamespace feature which allows
   running multiple instances of systemd-journald to process the logs
   of different services, from Daan De Meyer.

3) Implement BPF CPUv4 support for s390x BPF JIT, from Ilya Leoshkevich.

4) Improve BPF verifier log output for scalar registers to better
   disambiguate their internal state wrt defaults vs min/max values
   matching, from Andrii Nakryiko.

5) Extend the BPF fib lookup helpers for IPv4/IPv6 to support retrieving
   the source IP address with a new BPF_FIB_LOOKUP_SRC flag,
   from Martynas Pumputis.

6) Add support for open-coded task_vma iterator to help with symbolization
   for BPF-collected user stacks, from Dave Marchevsky.

7) Add libbpf getters for accessing individual BPF ring buffers which
   is useful for polling them individually, for example, from Martin Kelly.

8) Extend AF_XDP selftests to validate the SHARED_UMEM feature,
   from Tushar Vyavahare.

9) Improve BPF selftests cross-building support for riscv arch,
   from Björn Töpel.

10) Add the ability to pin a BPF timer to the same calling CPU,
   from David Vernet.

11) Fix libbpf's bpf_tracing.h macros for riscv to use the generic
   implementation of PT_REGS_SYSCALL_REGS() to access syscall arguments,
   from Alexandre Ghiti.

12) Extend libbpf to support symbol versioning for uprobes, from Hengqi Chen.

13) Fix bpftool's skeleton code generation to guarantee that ELF data
    is 8 byte aligned, from Ian Rogers.

14) Inherit system-wide cpu_mitigations_off() setting for Spectre v1/v4
    security mitigations in BPF verifier, from Yafang Shao.

15) Annotate struct bpf_stack_map with __counted_by attribute to prepare
    BPF side for upcoming __counted_by compiler support, from Kees Cook.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (90 commits)
  bpf: Ensure proper register state printing for cond jumps
  bpf: Disambiguate SCALAR register state output in verifier logs
  selftests/bpf: Make align selftests more robust
  selftests/bpf: Improve missed_kprobe_recursion test robustness
  selftests/bpf: Improve percpu_alloc test robustness
  selftests/bpf: Add tests for open-coded task_vma iter
  bpf: Introduce task_vma open-coded iterator kfuncs
  selftests/bpf: Rename bpf_iter_task_vma.c to bpf_iter_task_vmas.c
  bpf: Don't explicitly emit BTF for struct btf_iter_num
  bpf: Change syscall_nr type to int in struct syscall_tp_t
  net/bpf: Avoid unused "sin_addr_len" warning when CONFIG_CGROUP_BPF is not set
  bpf: Avoid unnecessary audit log for CPU security mitigations
  selftests/bpf: Add tests for cgroup unix socket address hooks
  selftests/bpf: Make sure mount directory exists
  documentation/bpf: Document cgroup unix socket address hooks
  bpftool: Add support for cgroup unix socket address hooks
  libbpf: Add support for cgroup unix socket address hooks
  bpf: Implement cgroup sockaddr hooks for unix sockets
  bpf: Add bpf_sock_addr_set_sun_path() to allow writing unix sockaddr from bpf
  bpf: Propagate modified uaddrlen from cgroup sockaddr programs
  ...
====================

Link: https://lore.kernel.org/r/20231016204803.30153-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-16 21:05:33 -07:00
Yunsheng Lin
90de47f020 page_pool: fragment API support for 32-bit arch with 64-bit DMA
Currently page_pool_alloc_frag() is not supported in 32-bit
arch with 64-bit DMA because of the overlap issue between
pp_frag_count and dma_addr_upper in 'struct page' for those
arches, which seems to be quite common, see [1], which means
driver may need to handle it when using fragment API.

It is assumed that the combination of the above arch with an
address space >16TB does not exist, as all those arches have
64b equivalent, it seems logical to use the 64b version for a
system with a large address space. It is also assumed that dma
address is page aligned when we are dma mapping a page aligned
buffer, see [2].

That means we're storing 12 bits of 0 at the lower end for a
dma address, we can reuse those bits for the above arches to
support 32b+12b, which is 16TB of memory.

If we make a wrong assumption, a warning is emitted so that
user can report to us.

1. https://lore.kernel.org/all/20211117075652.58299-1-linyunsheng@huawei.com/
2. https://lore.kernel.org/all/20230818145145.4b357c89@kernel.org/

Tested-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
CC: Lorenzo Bianconi <lorenzo@kernel.org>
CC: Alexander Duyck <alexander.duyck@gmail.com>
CC: Liang Chen <liangchen.linux@gmail.com>
CC: Guillaume Tucker <guillaume.tucker@collabora.com>
CC: Matthew Wilcox <willy@infradead.org>
CC: Linux-MM <linux-mm@kvack.org>
Link: https://lore.kernel.org/r/20231013064827.61135-2-linyunsheng@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-16 18:28:59 -07:00
Jacob Keller
e411a8e3bb net: stub tcp_gro_complete if CONFIG_INET=n
A few networking drivers including bnx2x, bnxt, qede, and idpf call
tcp_gro_complete as part of offloading TCP GRO. The function is only
defined if CONFIG_INET is true, since its TCP specific and is meaningless
if the kernel lacks IP networking support.

The combination of trying to use the complex network drivers with
CONFIG_NET but not CONFIG_INET is rather unlikely in practice: most use
cases are going to need IP networking.

The tcp_gro_complete function just sets some data in the socket buffer for
use in processing the TCP packet in the event that the GRO was offloaded to
the device. If the kernel lacks TCP support, such setup will simply go
unused.

The bnx2x, bnxt, and qede drivers wrap their TCP offload support in
CONFIG_INET checks and skip handling on such kernels.

The idpf driver did not check CONFIG_INET and thus fails to link if the
kernel is configured  with CONFIG_NET=y, CONFIG_IDPF=(m|y), and
CONFIG_INET=n.

While checking CONFIG_INET does allow the driver to bypass significantly
more instructions in the event that we know TCP networking isn't supported,
the configuration is unlikely to be used widely.

Rather than require driver authors to care about this, stub the
tcp_gro_complete function when CONFIG_INET=n. This allows drivers to be
left as-is. It does mean the idpf driver will perform slightly more work
than strictly necessary when CONFIG_INET=n, since it will still execute
some of the skb setup in idpf_rx_rsc. However, that work would be performed
in the case where CONFIG_INET=y anyways.

I did not change the existing drivers, since they appear to wrap a
significant portion of code when CONFIG_INET=n. There is little benefit in
trashing these drivers just to unwrap and remove the CONFIG_INET check.

Using a stub for tcp_gro_complete is still beneficial, as it means future
drivers no longer need to worry about this case of CONFIG_NET=y and
CONFIG_INET=n, which should reduce noise from buildbots that check such a
configuration.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Tested-by: Randy Dunlap <rdunlap@infradead.org> # build-tested
Link: https://lore.kernel.org/r/20231013185502.1473541-1-jacob.e.keller@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-16 17:34:06 -07:00
Muhammad Muzammil
97ddc25a36 drivers: net: wwan: wwan_core.c: resolved spelling mistake
resolved typing mistake from devce to device

Signed-off-by: Muhammad Muzammil <m.muzzammilashraf@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20231013042304.7881-1-m.muzzammilashraf@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-16 16:39:00 -07:00
Liansen Zhai
c60991f8e1 cgroup, netclassid: on modifying netclassid in cgroup, only consider the main process.
When modifying netclassid, the command("echo 0x100001 > net_cls.classid")
will take more time on many threads of one process, because the process
create many fds.
for example, one process exists 28000 fds and 60000 threads, echo command
will task 45 seconds.
Now, we only consider the main process when exec "iterate_fd", and the
time is about 52 milliseconds.

Signed-off-by: Liansen Zhai <zhailiansen@kuaishou.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20231012090330.29636-1-zhailiansen@kuaishou.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-16 16:36:53 -07:00
Justin Stitt
1cfce8261d net: usb: replace deprecated strncpy with strscpy
strncpy() is deprecated for use on NUL-terminated destination strings
[1] and as such we should prefer more robust and less ambiguous string
interfaces.

Other implementations of .*get_drvinfo use strscpy so this patch brings
sr_get_drvinfo() in line as well:

igb/igb_ethtool.c +851
static void igb_get_drvinfo(struct net_device *netdev,

igbvf/ethtool.c
167:static void igbvf_get_drvinfo(struct net_device *netdev,

i40e/i40e_ethtool.c
1999:static void i40e_get_drvinfo(struct net_device *netdev,

e1000/e1000_ethtool.c
529:static void e1000_get_drvinfo(struct net_device *netdev,

ixgbevf/ethtool.c
211:static void ixgbevf_get_drvinfo(struct net_device *netdev,

...

Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings [1]
Link: https://manpages.debian.org/testing/linux-manual-4.8/strscpy.9.en.html [2]
Link: https://github.com/KSPP/linux/issues/90
Signed-off-by: Justin Stitt <justinstitt@google.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/20231012-strncpy-drivers-net-usb-sr9800-c-v1-1-5540832c8ec2@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-16 16:16:30 -07:00
Justin Stitt
2242f22ae5 lan78xx: replace deprecated strncpy with strscpy
strncpy() is deprecated for use on NUL-terminated destination strings
[1] and as such we should prefer more robust and less ambiguous string
interfaces.

Other implementations of .*get_drvinfo use strscpy so this patch brings
lan78xx_get_drvinfo() in line as well:

igb/igb_ethtool.c +851
static void igb_get_drvinfo(struct net_device *netdev,

igbvf/ethtool.c
167:static void igbvf_get_drvinfo(struct net_device *netdev,

i40e/i40e_ethtool.c
1999:static void i40e_get_drvinfo(struct net_device *netdev,

e1000/e1000_ethtool.c
529:static void e1000_get_drvinfo(struct net_device *netdev,

ixgbevf/ethtool.c
211:static void ixgbevf_get_drvinfo(struct net_device *netdev,

Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings [1]
Link: https://github.com/KSPP/linux/issues/90
Signed-off-by: Justin Stitt <justinstitt@google.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/20231012-strncpy-drivers-net-usb-lan78xx-c-v1-1-99d513061dfc@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-16 16:16:26 -07:00
Justin Stitt
4ddc1f1f73 net: phy: smsc: replace deprecated strncpy with ethtool_sprintf
strncpy() is deprecated for use on NUL-terminated destination strings
[1] and as such we should prefer more robust and less ambiguous string
interfaces.

ethtool_sprintf() is designed specifically for get_strings() usage.
Let's replace strncpy in favor of this dedicated helper function.

Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings [1]
Link: https://manpages.debian.org/testing/linux-manual-4.8/strscpy.9.en.html [2]
Link: https://github.com/KSPP/linux/issues/90
Signed-off-by: Justin Stitt <justinstitt@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20231012-strncpy-drivers-net-phy-smsc-c-v1-1-00528f7524b3@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-16 16:15:00 -07:00
Justin Stitt
eb7fa2eb96 net: netcp: replace deprecated strncpy with strscpy
strncpy() is deprecated for use on NUL-terminated destination strings
[1] and as such we should prefer more robust and less ambiguous string
interfaces.

Considering the above, a suitable replacement is `strscpy` [2] due to
the fact that it guarantees NUL-termination on the destination buffer
without unnecessarily NUL-padding.

Other implementations of .*get_drvinfo also use strscpy so this patch
brings keystone_get_drvinfo() in line as well:

igb/igb_ethtool.c +851
static void igb_get_drvinfo(struct net_device *netdev,

igbvf/ethtool.c
167:static void igbvf_get_drvinfo(struct net_device *netdev,

i40e/i40e_ethtool.c
1999:static void i40e_get_drvinfo(struct net_device *netdev,

e1000/e1000_ethtool.c
529:static void e1000_get_drvinfo(struct net_device *netdev,

ixgbevf/ethtool.c
211:static void ixgbevf_get_drvinfo(struct net_device *netdev,

Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings [1]
Link: https://manpages.debian.org/testing/linux-manual-4.8/strscpy.9.en.html [2]
Link: https://github.com/KSPP/linux/issues/90
Signed-off-by: Justin Stitt <justinstitt@google.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/20231012-strncpy-drivers-net-ethernet-ti-netcp_ethss-c-v1-1-93142e620864@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-16 16:14:09 -07:00
Haiyang Zhang
562b1fdf06 tcp: Set pingpong threshold via sysctl
TCP pingpong threshold is 1 by default. But some applications, like SQL DB
may prefer a higher pingpong threshold to activate delayed acks in quick
ack mode for better performance.

The pingpong threshold and related code were changed to 3 in the year
2019 in:
  commit 4a41f453be ("tcp: change pingpong threshold to 3")
And reverted to 1 in the year 2022 in:
  commit 4d8f24eeed ("Revert "tcp: change pingpong threshold to 3"")

There is no single value that fits all applications.
Add net.ipv4.tcp_pingpong_thresh sysctl tunable, so it can be tuned for
optimal performance based on the application needs.

Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/1697056244-21888-1-git-send-email-haiyangz@microsoft.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-16 14:55:32 -07:00
Daniel Borkmann
39d08b9164 net, sched: Add tcf_set_drop_reason for {__,}tcf_classify
Add an initial user for the newly added tcf_set_drop_reason() helper to set the
drop reason for internal errors leading to TC_ACT_SHOT inside {__,}tcf_classify().

Right now this only adds a very basic SKB_DROP_REASON_TC_ERROR as a generic
fallback indicator to mark drop locations. Where needed, such locations can be
converted to more specific codes, for example, when hitting the reclassification
limit, etc.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Victor Nogueira <victor@mojatatu.com>
Link: https://lore.kernel.org/r/20231009092655.22025-2-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-16 10:07:37 -07:00
Daniel Borkmann
54a59aed39 net, sched: Make tc-related drop reason more flexible
Currently, the kfree_skb_reason() in sch_handle_{ingress,egress}() can only
express a basic SKB_DROP_REASON_TC_INGRESS or SKB_DROP_REASON_TC_EGRESS reason.

Victor kicked-off an initial proposal to make this more flexible by disambiguating
verdict from return code by moving the verdict into struct tcf_result and
letting tcf_classify() return a negative error. If hit, then two new drop
reasons were added in the proposal, that is SKB_DROP_REASON_TC_INGRESS_ERROR
as well as SKB_DROP_REASON_TC_EGRESS_ERROR. Further analysis of the actual
error codes would have required to attach to tcf_classify via kprobe/kretprobe
to more deeply debug skb and the returned error.

In order to make the kfree_skb_reason() in sch_handle_{ingress,egress}() more
extensible, it can be addressed in a more straight forward way, that is: Instead
of placing the verdict into struct tcf_result, we can just put the drop reason
in there, which does not require changes throughout various classful schedulers
given the existing verdict logic can stay as is.

Then, SKB_DROP_REASON_TC_ERROR{,_*} can be added to the enum skb_drop_reason
to disambiguate between an error or an intentional drop. New drop reason error
codes can be added successively to the tc code base.

For internal error locations which have not yet been annotated with a
SKB_DROP_REASON_TC_ERROR{,_*}, the fallback is SKB_DROP_REASON_TC_INGRESS and
SKB_DROP_REASON_TC_EGRESS, respectively. Generic errors could be marked with a
SKB_DROP_REASON_TC_ERROR code until they are converted to more specific ones
if it is found that they would be useful for troubleshooting.

While drop reasons have infrastructure for subsystem specific error codes which
are currently used by mac80211 and ovs, Jakub mentioned that it is preferred
for tc to use the enum skb_drop_reason core codes given it is a better fit and
currently the tooling support is better, too.

With regards to the latter:

  [...] I think Alastair (bpftrace) is working on auto-prettifying enums when
  bpftrace outputs maps. So we can do something like:

  $ bpftrace -e 'tracepoint:skb:kfree_skb { @[args->reason] = count(); }'
  Attaching 1 probe...
  ^C

  @[SKB_DROP_REASON_TC_INGRESS]: 2
  @[SKB_CONSUMED]: 34

  ^^^^^^^^^^^^ names!!

  Auto-magically. [...]

Add a small helper tcf_set_drop_reason() which can be used to set the drop reason
into the tcf_result.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Victor Nogueira <victor@mojatatu.com>
Link: https://lore.kernel.org/netdev/20231006063233.74345d36@kernel.org
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20231009092655.22025-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-16 10:07:36 -07:00
Daniel Borkmann
99c9991f4e Merge branch 'bpf-log-improvements'
Andrii Nakryiko says:

====================
This patch set fixes ambiguity in BPF verifier log output of SCALAR register
in the parts that emit umin/umax, smin/smax, etc ranges. See patch #4 for
details.

Also, patch #5 fixes an issue with verifier log missing instruction context
(state) output for conditionals that trigger precision marking. See details in
the patch.

First two patches are just improvements to two selftests that are very flaky
locally when run in parallel mode.

Patch #3 changes 'align' selftest to be less strict about exact verifier log
output (which patch #4 changes, breaking lots of align tests as written). Now
test does more of a register substate checks, mostly around expected var_off()
values. This 'align' selftests is one of the more brittle ones and requires
constant adjustment when verifier log output changes, without really catching
any new issues. So hopefully these changes can minimize future support efforts
for this specific set of tests.
====================

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2023-10-16 13:49:41 +02:00