Commit graph

2527 commits

Author SHA1 Message Date
Jakub Kicinski
449f6bc17a Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

Conflicts:

net/sched/sch_taprio.c
  d636fc5dd6 ("net: sched: add rcu annotations around qdisc->qdisc_sleeping")
  dced11ef84 ("net/sched: taprio: don't overwrite "sch" variable in taprio_dump_class_stats()")

net/ipv4/sysctl_net_ipv4.c
  e209fee411 ("net/ipv4: ping_group_range: allow GID from 2147483648 to 4294967294")
  ccce324dab ("tcp: make the first N SYN RTO backoffs linear")
https://lore.kernel.org/all/20230605100816.08d41a7b@canb.auug.org.au/

No adjacent changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-08 11:35:14 -07:00
Lama Kayal
a33682e4e7 net/mlx5e: Expose catastrophic steering error counters
Add generated_pkt_steering_fail and handled_pkt_steering_fail to devlink
heatlth reporter.
generated_pkt_steering_fail indicates the number of packets dropped due to
illegal steering operation within the vport steering domain.
handled_pkt_steering_fail indicates the number of packets dropped due to
illegal steering operation, originated by the vport.

Also, update devlink reporter functionality documentation with the newly
exposed counters.

Signed-off-by: Lama Kayal <lkayal@nvidia.com>
Reviewed-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-06-07 14:00:43 -07:00
Akihiro Suda
e209fee411 net/ipv4: ping_group_range: allow GID from 2147483648 to 4294967294
With this commit, all the GIDs ("0 4294967294") can be written to the
"net.ipv4.ping_group_range" sysctl.

Note that 4294967295 (0xffffffff) is an invalid GID (see gid_valid() in
include/linux/uidgid.h), and an attempt to register this number will cause
-EINVAL.

Prior to this commit, only up to GID 2147483647 could be covered.
Documentation/networking/ip-sysctl.rst had "0 4294967295" as an example
value, but this example was wrong and causing -EINVAL.

Fixes: c319b4d76b ("net: ipv4: add IPPROTO_ICMP socket kind")
Co-developed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-02 09:55:22 +01:00
Jakub Kicinski
a03a91bd68 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

No conflicts.

Adjacent changes:

drivers/net/ethernet/sfc/tc.c
  622ab65634 ("sfc: fix error unwinds in TC offload")
  b6583d5e9e ("sfc: support TC decap rules matching on enc_src_port")

net/mptcp/protocol.c
  5b825727d0 ("mptcp: add annotations around msk->subflow accesses")
  e76c8ef5cc ("mptcp: refactor mptcp_stream_accept()")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-01 15:38:26 -07:00
Jakub Kicinski
aa866ee4b1 mlx5-fixes-2023-05-24
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEGhZs6bAKwk/OTgTpSD+KveBX+j4FAmRu2ZQACgkQSD+KveBX
 +j4sQAgAtCLJfYADpU7yWPZ3SMAnTelc65PfawdjEyBlRjE2CxsQ/bnRiWUXuR5i
 twsGvAJxQfpDBV0z/TxcdYznpLxHSIhosg0pRH+Tc4Lk6PjdFK7Sviqyw9JTrCJX
 rVaLJ5N877Pu5gH++OqFs8C+2gkrppQKWJtrmUaAPPdXixkr2q2pwlJiCOLpGYTN
 9yGKBO29wtvD0AJsssS8VJgy6uHMaOsTTACR4fYuV/huTKa5QfYBdJY1HqEs7Fe2
 LhGyLjVx6nHD8kt+fL9J2LJ6CLnG9a50reKuZnCOBa42obahUBCAQ5mr1mQlzNr5
 cTOmAZM+OWOa7bXJAY9AbRorlr1A4w==
 =DRJZ
 -----END PGP SIGNATURE-----

Merge tag 'mlx5-fixes-2023-05-24' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux

Saeed Mahameed says:

====================
mlx5 fixes 2023-05-24

This series includes bug fixes for the mlx5 driver.

* tag 'mlx5-fixes-2023-05-24' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux:
  Documentation: net/mlx5: Wrap notes in admonition blocks
  Documentation: net/mlx5: Add blank line separator before numbered lists
  Documentation: net/mlx5: Use bullet and definition lists for vnic counters description
  Documentation: net/mlx5: Wrap vnic reporter devlink commands in code blocks
  net/mlx5: Fix check for allocation failure in comp_irqs_request_pci()
  net/mlx5: DR, Add missing mutex init/destroy in pattern manager
  net/mlx5e: Move Ethernet driver debugfs to profile init callback
  net/mlx5e: Don't attach netdev profile while handling internal error
  net/mlx5: Fix post parse infra to only parse every action once
  net/mlx5e: Use query_special_contexts cmd only once per mdev
  net/mlx5: fw_tracer, Fix event handling
  net/mlx5: SF, Drain health before removing device
  net/mlx5: Drain health before unregistering devlink
  net/mlx5e: Do not update SBCM when prio2buffer command is invalid
  net/mlx5e: Consider internal buffers size in port buffer calculations
  net/mlx5e: Prevent encap offload when neigh update is running
  net/mlx5e: Extract remaining tunnel encap code to dedicated file
====================

Link: https://lore.kernel.org/r/20230525034847.99268-1-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-25 21:09:41 -07:00
Chuck Lever
26fb5480a2 net/handshake: Enable the SNI extension to work properly
Enable the upper layer protocol to specify the SNI peername. This
avoids the need for tlshd to use a DNS lookup, which can return a
hostname that doesn't match the incoming certificate's SubjectName.

Fixes: 2fd5532044 ("net/handshake: Add a kernel API for requesting a TLSv1.3 handshake")
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-24 22:05:24 -07:00
Bagas Sanjaya
bb72b94c65 Documentation: net/mlx5: Wrap notes in admonition blocks
Wrap note paragraphs in note:: directive as it better fit for the
purpose of noting devlink commands.

Fixes: f2d51e5793 ("net/mlx5: Separate mlx5 driver documentation into multiple pages")
Fixes: cf14af140a ("net/mlx5e: Add vnic devlink health reporter to representors")
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-05-24 20:44:20 -07:00
Bagas Sanjaya
92059da65d Documentation: net/mlx5: Add blank line separator before numbered lists
The doc forgets to add separator before numbered lists, which causes the
lists to be appended to previous paragraph inline instead.

Add the missing separator.

Fixes: f2d51e5793 ("net/mlx5: Separate mlx5 driver documentation into multiple pages")
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-05-24 20:44:20 -07:00
Bagas Sanjaya
565b8c60e9 Documentation: net/mlx5: Use bullet and definition lists for vnic counters description
"vnic reporter" section contains unformatted description for vnic
counters, which is rendered as one long paragraph instead of list.

Use bullet and definition lists to match other lists.

Fixes: b0bc615df4 ("net/mlx5: Add vnic devlink health reporter to PFs/VFs")
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Daniel Machon <daniel.machon@microchip.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-05-24 20:44:19 -07:00
Bagas Sanjaya
c3acc46c60 Documentation: net/mlx5: Wrap vnic reporter devlink commands in code blocks
Sphinx reports htmldocs warnings:

Documentation/networking/device_drivers/ethernet/mellanox/mlx5/devlink.rst:287: WARNING: Unexpected indentation.
Documentation/networking/device_drivers/ethernet/mellanox/mlx5/devlink.rst:288: WARNING: Block quote ends without a blank line; unexpected unindent.
Documentation/networking/device_drivers/ethernet/mellanox/mlx5/devlink.rst:290: WARNING: Unexpected indentation.

Fix above warnings by wrapping diagnostic devlink commands in "vnic
reporter" section in code blocks to be consistent with other devlink
command snippets.

Fixes: b0bc615df4 ("net/mlx5: Add vnic devlink health reporter to PFs/VFs")
Fixes: cf14af140a ("net/mlx5e: Add vnic devlink health reporter to representors")
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-05-24 20:44:19 -07:00
Dave Ertman
1c769b1a30 ice: Remove LAG+SRIOV mutual exclusion
There was a change previously to stop SR-IOV and LAG from existing on the
same interface.  This was to prevent the violation of LACP (Link
Aggregation Control Protocol).  The method to achieve this was to add a
no-op Rx handler onto the netdev when SR-IOV VFs were present, thus
blocking bonding, bridging, etc from claiming the interface by adding
its own Rx handler.  Also, when an interface was added into a aggregate,
then the SR-IOV capability was set to false.

There are some users that have in house solutions using both SR-IOV and
bridging/bonding that this method interferes with (e.g. creating duplicate
VFs on the bonded interfaces and failing between them when the interface
fails over).

It makes more sense to provide the most functionality
possible, the restriction on co-existence of these features will be
removed.  No additional functionality is currently being provided beyond
what existed before the co-existence restriction was put into place.  It is
up to the end user to not implement a solution that would interfere with
existing network protocols.

Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Signed-off-by: Dave Ertman <david.m.ertman@intel.com>
Signed-off-by: Wojciech Drewek <wojciech.drewek@intel.com>
Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2023-05-17 08:43:47 -07:00
Jakub Kicinski
e7480a44d7 Revert "net: Remove low_thresh in ip defrag"
This reverts commit b2cbac9b9b.

We have multiple reports of obvious breakage from this patch.

Reported-by: Ido Schimmel <idosch@idosch.org>
Link: https://lore.kernel.org/all/ZGIRWjNcfqI8yY8W@shredder/
Link: https://lore.kernel.org/all/CADJHv_sDK=0RrMA2FTZQV5fw7UQ+qY=HG21Wu5qb0V9vvx5w6A@mail.gmail.com/
Reported-by: syzbot+a5e719ac7c268e414c95@syzkaller.appspotmail.com
Reported-by: syzbot+a03fd670838d927d9cd8@syzkaller.appspotmail.com
Fixes: b2cbac9b9b ("net: Remove low_thresh in ip defrag")
Link: https://lore.kernel.org/r/20230517034112.1261835-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-16 20:46:30 -07:00
Hariprasad Kelam
efe103065c docs: octeontx2: Add Documentation for QOS
Add QOS example configuration along with tc-htb commands

Signed-off-by: Hariprasad Kelam <hkelam@marvell.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-15 09:31:08 +01:00
Angus Chen
b2cbac9b9b net: Remove low_thresh in ip defrag
As low_thresh has no work in fragment reassembles,del it.
And Mark it deprecated in sysctl Document.

Signed-off-by: Angus Chen <angus.chen@jaguarmicro.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-15 08:42:07 +01:00
Chuck Lever
eefca7ec51 net/handshake: Enable the SNI extension to work properly
Enable the upper layer protocol to specify the SNI peername. This
avoids the need for tlshd to use a DNS lookup, which can return a
hostname that doesn't match the incoming certificate's SubjectName.

Fixes: 2fd5532044 ("net/handshake: Add a kernel API for requesting a TLSv1.3 handshake")
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-12 09:24:08 +01:00
Jakub Kicinski
bc88ba0cad Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes. No conflicts.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-11 09:06:55 -07:00
David Morley
ccce324dab tcp: make the first N SYN RTO backoffs linear
Currently the SYN RTO schedule follows an exponential backoff
scheme, which can be unnecessarily conservative in cases where
there are link failures. In such cases, it's better to
aggressively try to retransmit packets, so it takes routers
less time to find a repath with a working link.

We chose a default value for this sysctl of 4, to follow
the macOS and IOS backoff scheme of 1,1,1,1,1,2,4,8, ...
MacOS and IOS have used this backoff schedule for over
a decade, since before this 2009 IETF presentation
discussed the behavior:
https://www.ietf.org/proceedings/75/slides/tcpm-1.pdf

This commit makes the SYN RTO schedule start with a number of
linear backoffs given by the following sysctl:
* tcp_syn_linear_timeouts

This changes the SYN RTO scheme to be: init_rto_val for
tcp_syn_linear_timeouts, exp backoff starting at init_rto_val

For example if init_rto_val = 1 and tcp_syn_linear_timeouts = 2, our
backoff scheme would be: 1, 1, 1, 2, 4, 8, 16, ...

Signed-off-by: David Morley <morleyd@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Tested-by: David Morley <morleyd@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20230509180558.2541885-1-morleyd.kernel@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-05-11 10:31:16 +02:00
Randy Dunlap
77c964dad9 docs: networking: fix x25-iface.rst heading & index order
Fix the chapter heading for "X.25 Device Driver Interface" so that it
does not contain a trailing '-' character, which makes Sphinx
omit this heading from the contents.

Reverse the order of the x25.rst and x25-iface.rst files in the index
so that the project introduction (x25.rst) comes first.

Fixes: 883780af72 ("docs: networking: convert x25-iface.txt to ReST")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: linux-doc@vger.kernel.org
Cc: Martin Schiller <ms@dev.tdt.de>
Cc: linux-x25@vger.kernel.org
Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-10 10:31:46 +01:00
Hangbin Liu
84df83e0ec Documentation: bonding: fix the doc of peer_notif_delay
Bonding only supports setting peer_notif_delay with miimon set.

Fixes: 0307d589c4 ("bonding: add documentation for peer_notif_delay")
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-10 09:27:20 +01:00
Paolo Abeni
c248b27cfc Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
No conflicts.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-04-26 10:17:46 +02:00
Jakub Kicinski
00d0f31a1e net: ethtool: coalesce: try to make user settings stick twice
SET_COALESCE may change operation mode and parameters in one call.
Changing operation mode may cause the driver to reset the parameter
values to what is a reasonable default for new operation mode.

Since driver does not know which parameters come from user and which
are echoed back from ->get, driver may ignore the parameters when
switching operation modes.

This used to be inevitable for ioctl() but in netlink we know which
parameters are actually specified by the user.

We could inform which parameters were set by the user but this would
lead to a lot of code duplication in the drivers. Instead try to call
the drivers twice if both mode and params are changed. The set method
already checks if any params need updating so in case the driver did
the right thing the first time around - there will be no second call
to it's ->set method (only an extra call to ->get()).

For mlx5 for example before this patch we'd see:

 # ethtool -C eth0 adaptive-rx on  adaptive-tx on
 # ethtool -C eth0 adaptive-rx off adaptive-tx off \
		   tx-usecs 123 rx-usecs 123
 Adaptive RX: off  TX: off
 rx-usecs: 3
 rx-frames: 32
 tx-usecs: 16
 tx-frames: 32
 [...]

After the change:

 # ethtool -C eth0 adaptive-rx on  adaptive-tx on
 # ethtool -C eth0 adaptive-rx off adaptive-tx off \
		   tx-usecs 123 rx-usecs 123
 Adaptive RX: off  TX: off
 rx-usecs: 123
 rx-frames: 32
 tx-usecs: 123
 tx-frames: 32
 [...]

This only works for netlink, so it's a small discrepancy between
netlink and ioctl(). Since we anticipate most users to move to
netlink I believe it's worth making their lives easier.

Link: https://lore.kernel.org/r/20230420233302.944382-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-24 18:09:49 -07:00
David Howells
e0416e7d33 rxrpc: Fix potential race in error handling in afs_make_call()
If the rxrpc call set up by afs_make_call() receives an error whilst it is
transmitting the request, there's the possibility that it may get to the
point the rxrpc call is ended (after the error_kill_call label) just as the
call is queued for async processing.

This could manifest itself as call->rxcall being seen as NULL in
afs_deliver_to_call() when it tries to lock the call.

Fix this by splitting rxrpc_kernel_end_call() into a function to shut down
an rxrpc call and a function to release the caller's reference and calling
the latter only when we get to afs_put_call().

Reported-by: Jeffrey Altman <jaltman@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: kafs-testing+fedora36_64checkkafs-build-306@auristor.com
cc: Marc Dionne <marc.dionne@auristor.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: linux-afs@lists.infradead.org
cc: netdev@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-22 15:16:39 +01:00
Jakub Kicinski
fbc1449d38 mlx5-updates-2023-04-20
1) Dragos Improves RX page pool, and provides some fixes to his previous series:
  1.1) Fix releasing page_pool for striding RQ and legacy RQ nonlinear case
  1.2) Hook NAPIs to page pools to gain more performance.
 
 2) From Roi, Some cleanups to TC and eswitch modules.
 
 3) Maher migrates vnic diagnostic counters reporting from debugfs to a
     dedicated devlink health reporter
 
 Maher Says:
 ===========
  net/mlx5: Expose vnic diagnostic counters using devlink
 
 Currently, vnic diagnostic counters are exposed through the following
 debugfs:
 
 $ ls /sys/kernel/debug/mlx5/0000:08:00.0/esw/vf_0/vnic_diag/
 cq_overrun
 quota_exceeded_command
 total_q_under_processor_handle
 invalid_command
 send_queue_priority_update_flow
 nic_receive_steering_discard
 
 The current design does not allow the hypervisor to view the diagnostic
 counters of its VFs, in case the VFs get bound to a VM. In other words,
 the counters are not exposed for representor interfaces.
 Furthermore, the debugfs design is inconvenient future-wise, in case more
 counters need to be reported by the driver in the future.
 
 As these counters pertain to vNIC health, it is more appropriate to
 utilize the devlink health reporter to expose them.
 
 Thus, this patchest includes the following changes:
 
 * Drop the current vnic diagnostic counters debugfs interface.
 * Add a vnic devlink health reporter for PFs/VFs core devices, which
   when diagnosed will dump vnic diagnostic counter values that are
   queried from FW.
 * Add a vnic devlink health reporter for the representor interface, which
   serves the same purpose listed in the previous point, in addition to
   allowing the hypervisor to view its VFs diagnostic counters, even when
   the VFs are bounded to external VMs.
 
 Example of devlink health reporter usage is:
 $devlink health diagnose pci/0000:08:00.0 reporter vnic
  vNIC env counters:
     total_error_queues: 0 send_queue_priority_update_flow: 0
     comp_eq_overrun: 0 async_eq_overrun: 0 cq_overrun: 0
     invalid_command: 0 quota_exceeded_command: 0
     nic_receive_steering_discard: 0
 
 ===========
 
 4) SW steering fixes and improvements
 
 Yevgeny Kliteynik Says:
 =======================
 These short patch series are just small fixes / improvements for
 SW steering:
 
  - Patch 1: Fix dumping of legacy modify_hdr in debug dump to
    align to what is expected by parser
  - Patch 2: Have separate threshold for ICM sync per ICM type
  - Patch 3: Add more info to the steering debug dump - Linux
    version and device name
  - Patch 4: Keep track of number of buddies that are currently
    in use per domain per buddy type
 
 =======================
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEGhZs6bAKwk/OTgTpSD+KveBX+j4FAmRB6HcACgkQSD+KveBX
 +j4+nAf/cCJ7poDQQOp/Oug6H3xn/COYPv3Iasl0dhT8yu+LlNMQTF0bRrU4UpAQ
 aIKcja2biOBAnD96EhA1nJoo9bJUTtKLokUDDyK/xRHS+wIyr8Lia6vxTz1yjj3C
 jDqX3+ZP4rFuhAvh+92AT1I0JvS0g+ocokPVKmm+Pwf4y7sG69CZ7phVGSc0iFfT
 y+gnP4C6cdIr7kNLByeeX6alDHL/q83vfNFWrugRPna2uXjcSR5Gtp03pJ0OVkI5
 qHxG7Bz0BE/hcMYwNcNVTu/5e02+5PS6B8kN/ho5DkhVFhp+h17XWWvOMZxC3jfI
 k0ijRSWocG1jIgRgkioBRgzXmc6nNw==
 =jUe8
 -----END PGP SIGNATURE-----

Merge tag 'mlx5-updates-2023-04-20' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux

Saeed Mahameed says:

====================
mlx5-updates-2023-04-20

1) Dragos Improves RX page pool, and provides some fixes to his previous
   series:
 1.1) Fix releasing page_pool for striding RQ and legacy RQ nonlinear case
 1.2) Hook NAPIs to page pools to gain more performance.

2) From Roi, Some cleanups to TC and eswitch modules.

3) Maher migrates vnic diagnostic counters reporting from debugfs to a
    dedicated devlink health reporter

Maher Says:
===========
 net/mlx5: Expose vnic diagnostic counters using devlink

Currently, vnic diagnostic counters are exposed through the following
debugfs:

$ ls /sys/kernel/debug/mlx5/0000:08:00.0/esw/vf_0/vnic_diag/
cq_overrun
quota_exceeded_command
total_q_under_processor_handle
invalid_command
send_queue_priority_update_flow
nic_receive_steering_discard

The current design does not allow the hypervisor to view the diagnostic
counters of its VFs, in case the VFs get bound to a VM. In other words,
the counters are not exposed for representor interfaces.
Furthermore, the debugfs design is inconvenient future-wise, in case more
counters need to be reported by the driver in the future.

As these counters pertain to vNIC health, it is more appropriate to
utilize the devlink health reporter to expose them.

Thus, this patchest includes the following changes:

* Drop the current vnic diagnostic counters debugfs interface.
* Add a vnic devlink health reporter for PFs/VFs core devices, which
  when diagnosed will dump vnic diagnostic counter values that are
  queried from FW.
* Add a vnic devlink health reporter for the representor interface, which
  serves the same purpose listed in the previous point, in addition to
  allowing the hypervisor to view its VFs diagnostic counters, even when
  the VFs are bounded to external VMs.

Example of devlink health reporter usage is:
$devlink health diagnose pci/0000:08:00.0 reporter vnic
 vNIC env counters:
    total_error_queues: 0 send_queue_priority_update_flow: 0
    comp_eq_overrun: 0 async_eq_overrun: 0 cq_overrun: 0
    invalid_command: 0 quota_exceeded_command: 0
    nic_receive_steering_discard: 0

===========

4) SW steering fixes and improvements

Yevgeny Kliteynik Says:
=======================
These short patch series are just small fixes / improvements for
SW steering:

 - Patch 1: Fix dumping of legacy modify_hdr in debug dump to
   align to what is expected by parser
 - Patch 2: Have separate threshold for ICM sync per ICM type
 - Patch 3: Add more info to the steering debug dump - Linux
   version and device name
 - Patch 4: Keep track of number of buddies that are currently
   in use per domain per buddy type

=======================

* tag 'mlx5-updates-2023-04-20' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux:
  net/mlx5: Update op_mode to op_mod for port selection
  net/mlx5: E-Switch, Remove unused mlx5_esw_offloads_vport_metadata_set()
  net/mlx5: E-Switch, Remove redundant dev arg from mlx5_esw_vport_alloc()
  net/mlx5: Include linux/pci.h for pci_msix_can_alloc_dyn()
  net/mlx5e: RX, Hook NAPIs to page pools
  net/mlx5e: RX, Fix XDP_TX page release for legacy rq nonlinear case
  net/mlx5e: RX, Fix releasing page_pool pages twice for striding RQ
  net/mlx5e: Add vnic devlink health reporter to representors
  net/mlx5: Add vnic devlink health reporter to PFs/VFs
  Revert "net/mlx5: Expose vnic diagnostic counters for eswitch managed vports"
  Revert "net/mlx5: Expose steering dropped packets counter"
  net/mlx5: DR, Add memory statistics for domain object
  net/mlx5: DR, Add more info in domain dbg dump
  net/mlx5: DR, Calculate sync threshold of each pool according to its type
  net/mlx5: DR, Fix dumping of legacy modify_hdr in debug dump
====================

Link: https://lore.kernel.org/r/20230421013850.349646-1-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-21 20:47:05 -07:00
Shannon Nelson
ddbcb22055 pds_core: Kconfig and pds_core.rst
Remaining documentation and Kconfig hook for building the driver.

Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-21 08:29:14 +01:00
Shannon Nelson
40ced89445 pds_core: devlink params for enabling VIF support
Add the devlink parameter switches so the user can enable
the features supported by the VFs.  The only feature supported
at the moment is vDPA.

Example:
    devlink dev param set pci/0000:2b:00.0 \
	    name enable_vnet cmode runtime value true

Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-21 08:29:13 +01:00
Shannon Nelson
49ce92fbee pds_core: add FW update feature to devlink
Add in the support for doing firmware updates.  Of the two
main banks available, a and b, this updates the one not in
use and then selects it for the next boot.

Example:
    devlink dev flash pci/0000:b2:00.0 \
	    file pensando/dsc_fw_1.63.0-22.tar

Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-21 08:29:13 +01:00
Shannon Nelson
45d76f4929 pds_core: set up device and adminq
Set up the basic adminq and notifyq queue structures.  These are
used mostly by the client drivers for feature configuration.
These are essentially the same adminq and notifyq as in the
ionic driver.

Part of this includes querying for device identity and FW
information, so we can make that available to devlink dev info.

  $ devlink dev info pci/0000:b5:00.0
  pci/0000:b5:00.0:
    driver pds_core
    serial_number FLM18420073
    versions:
        fixed:
          asic.id 0x0
          asic.rev 0x0
        running:
          fw 1.51.0-73
        stored:
          fw.goldfw 1.15.9-C-22
          fw.mainfwa 1.60.0-73
          fw.mainfwb 1.60.0-57

Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-21 08:29:12 +01:00
Shannon Nelson
25b450c05a pds_core: add devlink health facilities
Add devlink health reporting on top of our fw watchdog.

Example:
  # devlink health show pci/0000:2b:00.0 reporter fw
  pci/0000:2b:00.0:
    reporter fw
      state healthy error 0 recover 0
  # devlink health diagnose pci/0000:2b:00.0 reporter fw
   Status: healthy State: 1 Generation: 0 Recoveries: 0

Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-21 08:29:12 +01:00
Shannon Nelson
55435ea772 pds_core: initial framework for pds_core PF driver
This is the initial PCI driver framework for the new pds_core device
driver and its family of devices.  This does the very basics of
registering for the new PF PCI device 1dd8:100c, setting up debugfs
entries, and registering with devlink.

Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-21 08:29:12 +01:00
Mahesh Bandewar
7ab75456be ipv6: add icmpv6_error_anycast_as_unicast for ICMPv6
ICMPv6 error packets are not sent to the anycast destinations and this
prevents things like traceroute from working. So create a setting similar
to ECHO when dealing with Anycast sources (icmpv6_echo_ignore_anycast).

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Maciej Żenczykowski <maze@google.com>
Link: https://lore.kernel.org/r/20230419013238.2691167-1-maheshb@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-20 20:07:50 -07:00
Maher Sanalla
cf14af140a net/mlx5e: Add vnic devlink health reporter to representors
Create a new devlink health reporter for representor interface, which
reports the values of representor vnic diagnostic counters when diagnosed.

This patch will allow admins to monitor VF diagnostic counters through
the representor-interface vnic reporter.

Example of usage:
$ devlink health diagnose pci/0000:08:00.0/65537 reporter vnic
  vNIC env counters:
    total_error_queues: 0 send_queue_priority_update_flow: 0
    comp_eq_overrun: 0 async_eq_overrun: 0 cq_overrun: 0
    invalid_command: 0 quota_exceeded_command: 0
    nic_receive_steering_discard: 0

Signed-off-by: Maher Sanalla <msanalla@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-04-20 18:35:49 -07:00
Maher Sanalla
b0bc615df4 net/mlx5: Add vnic devlink health reporter to PFs/VFs
Create a vnic devlink health reporter for PFs/VFs interfaces.
The reporter's diagnose callback displays the values of vNIC/vport
transport debug counters of PFs/VFs, as follows:

$ devlink health diagnose pci/0000:08:00.0 reporter vnic
 vNIC env counters:
    total_error_queues: 0 send_queue_priority_update_flow: 0
    comp_eq_overrun: 0 async_eq_overrun: 0 cq_overrun: 0
    invalid_command: 0 quota_exceeded_command: 0
    nic_receive_steering_discard: 0

Moreover, add documentation on the reporter functionality and the
counters description.

While at it, expose the vNIC counters diagnose function to be used by
the downstream patch, which will reveal the counters for representor
interfaces.

Signed-off-by: Maher Sanalla <msanalla@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-04-20 18:35:49 -07:00
Jakub Kicinski
681c5b51dc Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Adjacent changes:

net/mptcp/protocol.h
  63740448a3 ("mptcp: fix accept vs worker race")
  2a6a870e44 ("mptcp: stops worker on unaccepted sockets at listener close")
  ddb1a072f8 ("mptcp: move first subflow allocation at mpc access time")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-20 16:29:51 -07:00
Chuck Lever
2fd5532044 net/handshake: Add a kernel API for requesting a TLSv1.3 handshake
To enable kernel consumers of TLS to request a TLS handshake, add
support to net/handshake/ to request a handshake upcall.

This patch also acts as a template for adding handshake upcall
support for other kernel transport layer security providers.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-19 18:48:48 -07:00
Jacob Keller
1a2bd3bd72 ice: document RDMA devlink parameters
Commit e523af4ee5 ("net/ice: Add support for enable_iwarp and enable_roce
devlink param") added support for the enable_roce and enable_iwarp
parameters in the ice driver. It didn't document these parameters in the
ice devlink documentation file. Add this documentation, including a note
about the mutual exclusion between the two modes.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Acked-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Link: https://lore.kernel.org/r/20230414162614.571861-1-jacob.e.keller@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-17 18:53:13 -07:00
Jakub Kicinski
8c48eea3ad page_pool: allow caching from safely localized NAPI
Recent patches to mlx5 mentioned a regression when moving from
driver local page pool to only using the generic page pool code.
Page pool has two recycling paths (1) direct one, which runs in
safe NAPI context (basically consumer context, so producing
can be lockless); and (2) via a ptr_ring, which takes a spin
lock because the freeing can happen from any CPU; producer
and consumer may run concurrently.

Since the page pool code was added, Eric introduced a revised version
of deferred skb freeing. TCP skbs are now usually returned to the CPU
which allocated them, and freed in softirq context. This places the
freeing (producing of pages back to the pool) enticingly close to
the allocation (consumer).

If we can prove that we're freeing in the same softirq context in which
the consumer NAPI will run - lockless use of the cache is perfectly fine,
no need for the lock.

Let drivers link the page pool to a NAPI instance. If the NAPI instance
is scheduled on the same CPU on which we're freeing - place the pages
in the direct cache.

With that and patched bnxt (XDP enabled to engage the page pool, sigh,
bnxt really needs page pool work :() I see a 2.6% perf boost with
a TCP stream test (app on a different physical core than softirq).

The CPU use of relevant functions decreases as expected:

  page_pool_refill_alloc_cache   1.17% -> 0%
  _raw_spin_lock                 2.41% -> 0.98%

Only consider lockless path to be safe when NAPI is scheduled
- in practice this should cover majority if not all of steady state
workloads. It's usually the NAPI kicking in that causes the skb flush.

The main case we'll miss out on is when application runs on the same
CPU as NAPI. In that case we don't use the deferred skb free path.

Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Tested-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-14 18:56:12 -07:00
Jakub Kicinski
800e68c44f Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Conflicts:

tools/testing/selftests/net/config
  62199e3f16 ("selftests: net: Add VXLAN MDB test")
  3a0385be13 ("selftests: add the missing CONFIG_IP_SCTP in net config")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-13 16:04:28 -07:00
Jakub Kicinski
50762d9af3 net: docs: update the sample code in driver.rst
The sample code talks about single-queue devices and uses locks.
Update it to something resembling more modern code.
Make sure we mention use of READ_ONCE() / WRITE_ONCE().

Change the comment which talked about consumer on the xmit side.
AFAIU xmit is the producer and completions are a consumer.

Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-04-13 13:30:21 +02:00
Jakub Kicinski
c91c46de6b net: provide macros for commonly copied lockless queue stop/wake code
A lot of drivers follow the same scheme to stop / start queues
without introducing locks between xmit and NAPI tx completions.
I'm guessing they all copy'n'paste each other's code.
The original code dates back all the way to e1000 and Linux 2.6.19.

Smaller drivers shy away from the scheme and introduce a lock
which may cause deadlocks in netpoll.

Provide macros which encapsulate the necessary logic.

The macros do not prevent false wake ups, the extra barrier
required to close that race is not worth it. See discussion in:
https://lore.kernel.org/all/c39312a2-4537-14b4-270c-9fe1fbb91e89@gmail.com/

Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-10 17:56:18 -07:00
Jakub Kicinski
8336462539 docs: net: use C syntax highlight in driver.rst
Use syntax highlight, comment out the "..." since they are
not valid C.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-10 17:56:17 -07:00
Jakub Kicinski
da4f0f82ee docs: net: move the probe and open/close sections of driver.rst up
Somehow it feels more right to start from the probe then open,
then tx... Much like the lifetime of the driver itself.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-10 17:56:17 -07:00
Jakub Kicinski
d2f5c68e3f docs: net: reformat driver.rst from a list to sections
driver.rst had a historical form of list of common problems.
In the age os Sphinx and rendered documentation it's better
to use the more usual title + text format.

This will allow us to render kdoc into the output more naturally.

No changes to the actual text.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-10 17:56:17 -07:00
YueHaibing
dc5110c2d9 tcp: restrict net.ipv4.tcp_app_win
UBSAN: shift-out-of-bounds in net/ipv4/tcp_input.c:555:23
shift exponent 255 is too large for 32-bit type 'int'
CPU: 1 PID: 7907 Comm: ssh Not tainted 6.3.0-rc4-00161-g62bad54b26db-dirty #206
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
Call Trace:
 <TASK>
 dump_stack_lvl+0x136/0x150
 __ubsan_handle_shift_out_of_bounds+0x21f/0x5a0
 tcp_init_transfer.cold+0x3a/0xb9
 tcp_finish_connect+0x1d0/0x620
 tcp_rcv_state_process+0xd78/0x4d60
 tcp_v4_do_rcv+0x33d/0x9d0
 __release_sock+0x133/0x3b0
 release_sock+0x58/0x1b0

'maxwin' is int, shifting int for 32 or more bits is undefined behaviour.

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-07 08:19:11 +01:00
Tony Nguyen
79d872c62b Documentation/eth/intel: Remove references to SourceForge
The out-of-tree driver is hosted on SourceForge, as this does not apply
to the kernel driver remove references to it. Also do some minor
formatting changes around this section.

Suggested-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
2023-03-30 09:35:04 -07:00
Tony Nguyen
8ba732befd Documentation/eth/intel: Update address for driver support
Update the email address for support to use Intel Wired LAN, the mailing
list used for kernel development.

Suggested-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
2023-03-30 09:34:59 -07:00
Dragos Tatulea
3905f8d64c net/mlx5e: RX, Remove unnecessary recycle parameter and page_cache stats
The recycle parameter used during page release is no longer
necessary: the page pool can detect when the page cannot be
recycled to the cache or ring without any outside hint.

The page pool will also take care of cleaning up after itself
once all the inflight pages have been released. So no need to
explicitly release pages to the system.

Remove the internal page_cache stats as the mlx5e_page_cache
struct no longer exists.

Delete the documentation entries along with the stats.

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-03-28 13:43:59 -07:00
Shay Agroskin
233eb4e786 ethtool: Add support for configuring tx_push_buf_len
This attribute, which is part of ethtool's ring param configuration
allows the user to specify the maximum number of the packet's payload
that can be written directly to the device.

Example usage:
    # ethtool -G [interface] tx-push-buf-len [number of bytes]

Co-developed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Shay Agroskin <shayagr@amazon.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-27 19:49:58 -07:00
Jakub Kicinski
dc0a7b5200 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Conflicts:

drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
  6e9d51b1a5 ("net/mlx5e: Initialize link speed to zero")
  1bffcea429 ("net/mlx5e: Add devlink hairpin queues parameters")
https://lore.kernel.org/all/20230324120623.4ebbc66f@canb.auug.org.au/
https://lore.kernel.org/all/20230321211135.47711-1-saeed@kernel.org/

Adjacent changes:

drivers/net/phy/phy.c
  323fe43cf9 ("net: phy: Improved PHY error reporting in state machine")
  4203d84032 ("net: phy: Ensure state transitions are processed from phy_stop()")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-24 10:10:20 -07:00
Jakub Kicinski
3eb8eea2a4 docs: networking: document NAPI
Add basic documentation about NAPI. We can stop linking to the ancient
doc on the LF wiki.

Link: https://lore.kernel.org/all/20230315223044.471002-1-kuba@kernel.org/
Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Pavel Pisa <pisa@cmp.felk.cvut.cz> # for ctucanfd-driver.rst
Reviewed-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Link: https://lore.kernel.org/r/20230322053848.198452-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-23 19:47:40 -07:00
Jesper Dangaard Brouer
915efd8a44 xdp: bpf_xdp_metadata use EOPNOTSUPP for no driver support
When driver doesn't implement a bpf_xdp_metadata kfunc the fallback
implementation returns EOPNOTSUPP, which indicate device driver doesn't
implement this kfunc.

Currently many drivers also return EOPNOTSUPP when the hint isn't
available, which is ambiguous from an API point of view. Instead
change drivers to return ENODATA in these cases.

There can be natural cases why a driver doesn't provide any hardware
info for a specific hint, even on a frame to frame basis (e.g. PTP).
Lets keep these cases as separate return codes.

When describing the return values, adjust the function kernel-doc layout
to get proper rendering for the return values.

Fixes: ab46182d0d ("net/mlx4_en: Support RX XDP metadata")
Fixes: bc8d405b1b ("net/mlx5e: Support RX XDP metadata")
Fixes: 306531f024 ("veth: Support RX XDP metadata")
Fixes: 3d76a4d3d4 ("bpf: XDP metadata RX kfuncs")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/167940675120.2718408.8176058626864184420.stgit@firesoul
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-22 09:11:09 -07:00
Tony Nguyen
e485f3a6ea ixgb: Remove ixgb driver
There are likely no users of this driver as the hardware has been
discontinued since 2010. Remove the driver and all references to it
in documentation.

Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Acked-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-19 10:51:07 +00:00
Gal Pressman
1bffcea429 net/mlx5e: Add devlink hairpin queues parameters
We refer to a TC NIC rule that involves forwarding as "hairpin".
Hairpin queues are mlx5 hardware specific implementation for hardware
forwarding of such packets.

Per the discussion in [1], move the hairpin queues control (number and
size) from debugfs to devlink.

Expose two devlink params:
- hairpin_num_queues: control the number of hairpin queues
- hairpin_queue_size: control the size (in packets) of the hairpin queues

[1] https://lore.kernel.org/all/20230111194608.7f15b9a1@kernel.org/

Signed-off-by: Gal Pressman <gal@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Link: https://lore.kernel.org/r/20230314054234.267365-12-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-15 22:12:08 -07:00
Linus Torvalds
5ca26d6039 Including fixes from wireless and netfilter.
Current release - regressions:
 
  - phy: multiple fixes for EEE rework
 
  - wifi: wext: warn about usage only once
 
  - wifi: ath11k: allow system suspend to survive ath11k
 
 Current release - new code bugs:
 
  - mlx5: Fix memory leak in IPsec RoCE creation
 
  - ibmvnic: assign XPS map to correct queue index
 
 Previous releases - regressions:
 
  - netfilter: ip6t_rpfilter: Fix regression with VRF interfaces
 
  - netfilter: ctnetlink: make event listener tracking global
 
  - nf_tables: allow to fetch set elements when table has an owner
 
  - mlx5:
    - fix skb leak while fifo resync and push
    - fix possible ptp queue fifo use-after-free
 
 Previous releases - always broken:
 
  - sched: fix action bind logic
 
  - ptp: vclock: use mutex to fix "sleep on atomic" bug if driver
    also uses a mutex
 
  - netfilter: conntrack: fix rmmod double-free race
 
  - netfilter: xt_length: use skb len to match in length_mt6,
    avoid issues with BIG TCP
 
 Misc:
 
  - ice: remove unnecessary CONFIG_ICE_GNSS
 
  - mlx5e: remove hairpin write debugfs files
 
  - sched: act_api: move TCA_EXT_WARN_MSG to the correct hierarchy
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmP9JgYACgkQMUZtbf5S
 IrsIRRAApy4Hjb8z5z3k4HOM2lA3b/3OWD301I5YtoU3FC4L938yETAPFYUGbWrX
 rKN4YOTNh2Fvkgbni7vz9hbC84C6i86Q9+u7dT1U+kCk3kbyQPFZlEDj5fY0I8zK
 1xweCRrC1CcG74S2M5UO3UnWz1ypWQpTnHfWZqq0Duh1j9Xc+MHjHC2IKrGnzM6U
 1/ODk9FrtsWC+KGJlWwiV+yJMYUA4nCKIS/NrmdRlBa7eoP0oC1xkA8g0kz3/P3S
 O+xMyhExcZbMYY5VMkiGBZ5l8Ve3t6lHcMXq7jWlSCOeXd4Ut6zzojHlGZjzlCy9
 RQQJzva2wlltqB9rECUQixpZbVS6ubf5++zvACOKONhSIEdpWjZW9K/qsV8igbfM
 Xx0hsG1jCBt/xssRw2UBsq73vjNf1AkdksvqJgcggAvBJU8cV3MxRRB4/9lyPdmB
 NNFqehwCeE3aU0FSBKoxZVYpfg+8J/XhwKT63Cc2d1ENetsWk/LxvkYm24aokpW+
 nn+jUH9AYk3rFlBVQG1xsCwU4VlGk/yZgRwRMYFBqPkAGcXLZOnqdoSviBPN3yN0
 Habs1hxToMt3QBgLJcMVn8CYdWCJgnZpxs8Mfo+PGoWKHzQ9kXBdyYyIZm1GyesD
 BN/2QN38yMGXRALd2NXS2Va4ygX7KptB7+HsitdkzKCqcp1Ao+I=
 =Ko4p
 -----END PGP SIGNATURE-----

Merge tag 'net-6.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Including fixes from wireless and netfilter.

  The notable fixes here are the EEE fix which restores boot for many
  embedded platforms (real and QEMU); WiFi warning suppression and the
  ICE Kconfig cleanup.

  Current release - regressions:

   - phy: multiple fixes for EEE rework

   - wifi: wext: warn about usage only once

   - wifi: ath11k: allow system suspend to survive ath11k

  Current release - new code bugs:

   - mlx5: Fix memory leak in IPsec RoCE creation

   - ibmvnic: assign XPS map to correct queue index

  Previous releases - regressions:

   - netfilter: ip6t_rpfilter: Fix regression with VRF interfaces

   - netfilter: ctnetlink: make event listener tracking global

   - nf_tables: allow to fetch set elements when table has an owner

   - mlx5:
      - fix skb leak while fifo resync and push
      - fix possible ptp queue fifo use-after-free

  Previous releases - always broken:

   - sched: fix action bind logic

   - ptp: vclock: use mutex to fix "sleep on atomic" bug if driver also
     uses a mutex

   - netfilter: conntrack: fix rmmod double-free race

   - netfilter: xt_length: use skb len to match in length_mt6, avoid
     issues with BIG TCP

  Misc:

   - ice: remove unnecessary CONFIG_ICE_GNSS

   - mlx5e: remove hairpin write debugfs files

   - sched: act_api: move TCA_EXT_WARN_MSG to the correct hierarchy"

* tag 'net-6.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (53 commits)
  tcp: tcp_check_req() can be called from process context
  net: phy: c45: fix network interface initialization failures on xtensa, arm:cubieboard
  xen-netback: remove unused variables pending_idx and index
  net/sched: act_api: move TCA_EXT_WARN_MSG to the correct hierarchy
  net: dsa: ocelot_ext: remove unnecessary phylink.h include
  net: mscc: ocelot: fix duplicate driver name error
  net: dsa: felix: fix internal MDIO controller resource length
  net: dsa: seville: ignore mscc-miim read errors from Lynx PCS
  net/sched: act_sample: fix action bind logic
  net/sched: act_mpls: fix action bind logic
  net/sched: act_pedit: fix action bind logic
  wifi: wext: warn about usage only once
  wifi: mt76: usb: fix use-after-free in mt76u_free_rx_queue
  qede: avoid uninitialized entries in coal_entry array
  nfc: fix memory leak of se_io context in nfc_genl_se_io
  ice: remove unnecessary CONFIG_ICE_GNSS
  net/sched: cls_api: Move call to tcf_exts_miss_cookie_base_destroy()
  ibmvnic: Assign XPS map to correct queue index
  docs: net: fix inaccuracies in msg_zerocopy.rst
  tools: net: add __pycache__ to gitignore
  ...
2023-02-27 14:05:08 -08:00
nick black
bce9045993 docs: net: fix inaccuracies in msg_zerocopy.rst
Replace "sendpage" with "sendfile". Remove comment about
ENOBUFS when the sockopt hasn't been set; experimentation
indicates that this is not true.

Signed-off-by: nick black <dankamongmen@gmail.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/Y/gg/EhIIjugLdd3@schwarzgerat.orthanc
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-24 18:31:31 -08:00
Linus Torvalds
70756b49be It has been a moderately calm cycle for documentation; the significant
changes include:
 
 - Some significant additions to the memory-management documentation
 
 - Some improvements to navigation in the HTML-rendered docs
 
 - More Spanish and Chinese translations
 
 ...and the usual set of typo fixes and such.
 -----BEGIN PGP SIGNATURE-----
 
 iQFDBAABCAAtFiEEIw+MvkEiF49krdp9F0NaE2wMflgFAmPzkQUPHGNvcmJldEBs
 d24ubmV0AAoJEBdDWhNsDH5YC0QH/09u10xV3N+RuveNE/tArVxKcQi7JZd/xugQ
 toSXygh64WY10lzwi7Ms1bHZzpPYB0fOrqTGNqNQuhrVTjQzaZB0BBJqm8lwt2w/
 S/Z5wj+IicJTmQ7+0C2Hc/dcK5SCPfY3CgwqOUVdr3dEm1oU+4QaBy31fuIJJ0Hx
 NdbXBco8BZqJX9P67jwp9vbrFrSGBjPI0U4HNHVjrWlcBy8JT0aAnf0fyWFy3orA
 T86EzmEw8drA1mXsHa5pmVwuHDx2X+D+eRurG9llCBrlIG9EDSmnalY4BeGqR4LS
 oDrEH6M91I5+9iWoJ0rBheD8rPclXO2HpjXLApXzTjrORgEYZsM=
 =MCdX
 -----END PGP SIGNATURE-----

Merge tag 'docs-6.3' of git://git.lwn.net/linux

Pull documentation updates from Jonathan Corbet:
 "It has been a moderately calm cycle for documentation; the significant
  changes include:

   - Some significant additions to the memory-management documentation

   - Some improvements to navigation in the HTML-rendered docs

   - More Spanish and Chinese translations

  ... and the usual set of typo fixes and such"

* tag 'docs-6.3' of git://git.lwn.net/linux: (68 commits)
  Documentation/watchdog/hpwdt: Fix Format
  Documentation/watchdog/hpwdt: Fix Reference
  Documentation: core-api: padata: correct spelling
  docs/mm: Physical Memory: correct spelling in reference to CONFIG_PAGE_EXTENSION
  docs: Use HTML comments for the kernel-toc SPDX line
  docs: Add more information to the HTML sidebar
  Documentation: KVM: Update AMD memory encryption link
  printk: Document that CONFIG_BOOT_PRINTK_DELAY required for boot_delay=
  Documentation: userspace-api: correct spelling
  Documentation: sparc: correct spelling
  Documentation: driver-api: correct spelling
  Documentation: admin-guide: correct spelling
  docs: add workload-tracing document to admin-guide
  docs/admin-guide/mm: remove useless markup
  docs/mm: remove useless markup
  docs/mm: Physical Memory: remove useless markup
  docs/sp_SP: Add process magic-number translation
  docs: ftrace: always use canonical ftrace path
  Doc/damon: fix the data path error
  dma-buf: Add "dma-buf" to title of documentation
  ...
2023-02-22 12:00:20 -08:00
Alejandro Lucero
14743ddd24 sfc: add devlink info support for ef100
Add devlink info support for ef100. The information reported is obtained
through the MCDI interface with the specific meaning defined in new
documentation file.

Signed-off-by: Alejandro Lucero <alejandro.lucero-palau@amd.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Acked-by: Martin Habets <habetsm.xilinx@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-02-16 12:03:12 +01:00
Jakub Kicinski
0f19f514de mlx5-updates-2023-02-10
1) From Roi and Mark: MultiPort eswitch support
 
 MultiPort E-Switch builds on newer hardware's capabilities and introduces
 a mode where a single E-Switch is used and all the vports and physical
 ports on the NIC are connected to it.
 
 The new mode will allow in the future a decrease in the memory used by the
 driver and advanced features that aren't possible today.
 
 This represents a big change in the current E-Switch implantation in mlx5.
 Currently, by default, each E-Switch manager manages its E-Switch.
 Steering rules in each E-Switch can only forward traffic to the native
 physical port associated with that E-Switch. While there are ways to target
 non-native physical ports, for example using a bond or via special TC
 rules. None of the ways allows a user to configure the driver
 to operate by default in such a mode nor can the driver decide
 to move to this mode by default as it's user configuration-driven right now.
 
 While MultiPort E-Switch single FDB mode is the preferred mode, older
 generations of ConnectX hardware couldn't support this mode so it was never
 implemented. Now that there is capable hardware present, start the
 transition to having this mode by default.
 
 Introduce a devlink parameter to control MultiPort Eswitch single FDB mode.
 This will allow users to select this mode on their system right now
 and in the future will allow the driver to move to this mode by default.
 
 2) From Jiri: Improvements and fixes for mlx5 netdev's devlink logic
  2.1) Cleanups related to mlx5's devlink port logic
  2.2) Move devlink port registration to be done before netdev alloc
  2.3) Create auxdev devlink instance in the same ns as parent devlink
  2.4) Suspend auxiliary devices only in case of PCI device suspend
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEGhZs6bAKwk/OTgTpSD+KveBX+j4FAmPsBlsACgkQSD+KveBX
 +j6izAgAtvzimjzOqI8J0FKPW3piRH3VtNWUrlGAf4HHPI0lZMbEV5GIaSI15JeH
 wrIpNXKzaA+6+NZ7TzTwqVO+zyKCQVpRw7xPANIJ2A5dId36jU2AgW06zgzFuYWa
 mXKG8rAUls6E5a5DHTaY/SueWMO9KMOfLzHFXHNsLAKTAw+FoDdlGso5vgvsy44g
 2/N/kDBdzC5n8og934sM7JymthnTVyuGe/Au2w1UUR1MMZ46OQgmhStDwqS5MrzI
 pIi+tWFG9YFM5/J8d13y7X/xMCq26Aw6SZSUhsRfTk0goy0uKaU6yYm8vSbO0mci
 +e0c8bpWL/MzAhMzWj8CwBPeO6knOw==
 =tYc2
 -----END PGP SIGNATURE-----

Merge tag 'mlx5-updates-2023-02-10' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux

Saeed Mahameed says:

====================
mlx5-updates-2023-02-10

1) From Roi and Mark: MultiPort eswitch support

MultiPort E-Switch builds on newer hardware's capabilities and introduces
a mode where a single E-Switch is used and all the vports and physical
ports on the NIC are connected to it.

The new mode will allow in the future a decrease in the memory used by the
driver and advanced features that aren't possible today.

This represents a big change in the current E-Switch implantation in mlx5.
Currently, by default, each E-Switch manager manages its E-Switch.
Steering rules in each E-Switch can only forward traffic to the native
physical port associated with that E-Switch. While there are ways to target
non-native physical ports, for example using a bond or via special TC
rules. None of the ways allows a user to configure the driver
to operate by default in such a mode nor can the driver decide
to move to this mode by default as it's user configuration-driven right now.

While MultiPort E-Switch single FDB mode is the preferred mode, older
generations of ConnectX hardware couldn't support this mode so it was never
implemented. Now that there is capable hardware present, start the
transition to having this mode by default.

Introduce a devlink parameter to control MultiPort Eswitch single FDB mode.
This will allow users to select this mode on their system right now
and in the future will allow the driver to move to this mode by default.

2) From Jiri: Improvements and fixes for mlx5 netdev's devlink logic
 2.1) Cleanups related to mlx5's devlink port logic
 2.2) Move devlink port registration to be done before netdev alloc
 2.3) Create auxdev devlink instance in the same ns as parent devlink
 2.4) Suspend auxiliary devices only in case of PCI device suspend

* tag 'mlx5-updates-2023-02-10' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux:
  net/mlx5: Suspend auxiliary devices only in case of PCI device suspend
  net/mlx5: Remove "recovery" arg from mlx5_load_one() function
  net/mlx5e: Create auxdev devlink instance in the same ns as parent devlink
  net/mlx5e: Move devlink port registration to be done before netdev alloc
  net/mlx5e: Move dl_port to struct mlx5e_dev
  net/mlx5e: Replace usage of mlx5e_devlink_get_dl_port() by netdev->devlink_port
  net/mlx5e: Pass mdev to mlx5e_devlink_port_register()
  net/mlx5: Remove outdated comment
  net/mlx5e: TC, Remove redundant parse_attr argument
  net/mlx5e: Use a simpler comparison for uplink rep
  net/mlx5: Lag, Add single RDMA device in multiport mode
  net/mlx5: Lag, set different uplink vport metadata in multiport eswitch mode
  net/mlx5: E-Switch, rename bond update function to be reused
  net/mlx5e: TC, Add peer flow in mpesw mode
  net/mlx5: Lag, Control MultiPort E-Switch single FDB mode
====================

Link: https://lore.kernel.org/r/20230214221239.159033-1-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-15 19:24:52 -08:00
Moshe Shemesh
c745cfb27a devlink: Update devlink health documentation
Update devlink-health.rst file:
- Add devlink formatted message (fmsg) API documentation.
- Add auto-dump as a condition to do dump once error reported.
- Expand OOB to clarify this acronym.

Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-15 19:15:44 -08:00
Roi Dayan
a32327a3a0 net/mlx5: Lag, Control MultiPort E-Switch single FDB mode
MultiPort E-Switch builds on newer hardware's capabilities and introduces
a mode where a single E-Switch is used and all the vports and physical
ports on the NIC are connected to it.

The new mode will allow in the future a decrease in the memory used by the
driver and advanced features that aren't possible today.

This represents a big change in the current E-Switch implantation in mlx5.
Currently, by default, each E-Switch manager manages its E-Switch.
Steering rules in each E-Switch can only forward traffic to the native
physical port associated with that E-Switch. While there are ways to target
non-native physical ports, for example using a bond or via special TC
rules. None of the ways allows a user to configure the driver
to operate by default in such a mode nor can the driver decide
to move to this mode by default as it's user configuration-driven right now.

While MultiPort E-Switch single FDB mode is the preferred mode, older
generations of ConnectX hardware couldn't support this mode so it was never
implemented. Now that there is capable hardware present, start the
transition to having this mode by default.

Introduce a devlink parameter to control MultiPort E-Switch single FDB mode.
This will allow users to select this mode on their system right now
and in the future will allow the driver to move to this mode by default.

Example:
    $ devlink dev param set pci/0000:00:0b.0 name esw_multiport value 1 \
                  cmode runtime

Signed-off-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Maor Dickman <maord@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-02-14 14:08:24 -08:00
Shannon Nelson
5b4e9a7a71 net: ethtool: extend ringparam set/get APIs for rx_push
Similar to what was done for TX_PUSH, add an RX_PUSH concept
to the ethtool interfaces.

Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-13 11:05:12 +00:00
Jakub Kicinski
8697a258ae Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
net/devlink/leftover.c / net/core/devlink.c:
  565b4824c3 ("devlink: change port event netdev notifier from per-net to global")
  f05bd8ebeb ("devlink: move code to a dedicated directory")
  687125b579 ("devlink: split out core code")
https://lore.kernel.org/all/20230208094657.379f2b1a@canb.auug.org.au/

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-09 12:25:40 -08:00
Jiawen Wu
363d7c2298 net: txgbe: Update support email address
Update new email address for Wangxun 10Gb NIC support team.

Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com>
Link: https://lore.kernel.org/r/20230208023035.3371250-1-jiawenwu@trustnetic.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-08 20:48:37 -08:00
Rahul Rameshbabu
04937a0f68 net/mlx5: Document support for RoCE HCA disablement capability
Some mlx5 devices are capable of disabling RoCE. In this situation,
disablement does not need to be handled at the driver level.

Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-02-04 02:07:03 -08:00
Rahul Rameshbabu
8ce3b586fa net/mlx5: Add counter information to mlx5 driver documentation
Update rst file to contain general information about statistics counters
for the mlx5 driver. Add specifics about individual counters in list
tables.

Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-02-04 02:07:03 -08:00
Rahul Rameshbabu
e12ebbf0cc net/mlx5: Document previously implemented mlx5 tracepoints
Tracepoints were previously implemented but not documented till this patch
series.

Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-02-04 02:07:03 -08:00
Rahul Rameshbabu
a12ba19269 net/mlx5: Update Kconfig parameter documentation
Provide information for Kconfig flags defined but not documented till this
patch series for the mlx5 driver.

Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-02-04 02:07:03 -08:00
Rahul Rameshbabu
f2d51e5793 net/mlx5: Separate mlx5 driver documentation into multiple pages
The mlx5 device driver documentation page has grown in size and should be
split into multiple subpages. This change also contains a table of contents
for these new subpages.

Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-02-04 02:07:02 -08:00
Jakub Kicinski
82b4a9412b Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
net/core/gro.c
  7d2c89b325 ("skb: Do mix page pool and page referenced frags in GRO")
  b1a78b9b98 ("net: add support for ipv4 big tcp")
https://lore.kernel.org/all/20230203094454.5766f160@canb.auug.org.au/

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-02 14:49:55 -08:00
Brian Haley
62e395f82d neighbor: fix proxy_delay usage when it is zero
When set to zero, the neighbor sysctl proxy_delay value
does not cause an immediate reply for ARP/ND requests
as expected, it instead causes a random delay between
[0, U32_MAX). Looking at this comment from
__get_random_u32_below() explains the reason:

/*
 * This function is technically undefined for ceil == 0, and in fact
 * for the non-underscored constant version in the header, we build bug
 * on that. But for the non-constant case, it's convenient to have that
 * evaluate to being a straight call to get_random_u32(), so that
 * get_random_u32_inclusive() can work over its whole range without
 * undefined behavior.
 */

Added helper function that does not call get_random_u32_below()
if proxy_delay is zero and just uses the current value of
jiffies instead, causing pneigh_enqueue() to respond
immediately.

Also added definition of proxy_delay to ip-sysctl.txt since
it was missing.

Signed-off-by: Brian Haley <haleyb.dev@gmail.com>
Link: https://lore.kernel.org/r/20230130171428.367111-1-haleyb.dev@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-01 21:02:54 -08:00
Ross Zwisler
2abfcd293b docs: ftrace: always use canonical ftrace path
The canonical location for the tracefs filesystem is at /sys/kernel/tracing.

But, from Documentation/trace/ftrace.rst:

  Before 4.1, all ftrace tracing control files were within the debugfs
  file system, which is typically located at /sys/kernel/debug/tracing.
  For backward compatibility, when mounting the debugfs file system,
  the tracefs file system will be automatically mounted at:

  /sys/kernel/debug/tracing

Many parts of Documentation still reference this older debugfs path, so
let's update them to avoid confusion.

Signed-off-by: Ross Zwisler <zwisler@google.com>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://lore.kernel.org/r/20230125213251.2013791-1-zwisler@google.com
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2023-01-31 14:02:30 -07:00
Randy Dunlap
a266ef69b8 Documentation: networking: correct spelling
Correct spelling problems for Documentation/networking/ as reported
by codespell.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: linux-doc@vger.kernel.org
Cc: Jiri Pirko <jiri@nvidia.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: netdev@vger.kernel.org
Link: https://lore.kernel.org/r/20230129231053.20863-5-rdunlap@infradead.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-01-31 13:00:47 +01:00
David S. Miller
5dd3beba22 This feature/cleanup patchset includes the following patches:
- bump version strings, by Simon Wunderlich
 
  - drop prandom.h includes, by Sven Eckelmann
 
  - fix mailing list address, by Sven Eckelmann
 
  - multicast feature preparation, by Linus Lüssing (2 patches)
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEE1ilQI7G+y+fdhnrfoSvjmEKSnqEFAmPTpRsWHHN3QHNpbW9u
 d3VuZGVybGljaC5kZQAKCRChK+OYQpKeoZWoEAC7F92qmyNAye5ylC4O7XeG5KXL
 tPHzOSp5n5Q98HOmLGrqgOTEgfoX8q/GkAQh4lauiG5LPhs8iOYfHhiz5u0rXXnv
 Dm26PCLTKxOd+nif2H28rh4ahzgZUSizQgtg3S6tiYAvvTc8TM67UbJpPkU3sXf6
 Xh9gT7W5YvZ8lvEQF/l/otdhuE3jpiBgjzyfDKTyLlVweWgli4XZSZ9BqIhv8c/O
 gSa6vp72FGyJEZJCHyXh4jD1yNU/S+CRFpj6Gy1OsqLVMnNVlJviwEyVJ+ZuzSw2
 KSOeeUXxN4XjeDulIOPgvvxFRsAPCeNoGqOzHtlLgEugyZGowKY9q5Bh33K63A7t
 YzayKnfq2CKAVg9VujePurGkiMTY5gjrtuS7KlcWxXn30V8rry3bN6Y8XLphr80A
 p1fVYjPQtFhbjjrrmsKbGSCmJiNo+jx2Rs6y1aQ86ykNVK3Et6rSUrISaMRjZEEG
 OIMGOkzE0ESH2yK99sdaHOKiauZIMPMHTK7rB53cSMCL2eyP+6wP1jGQFwiytaA6
 9GzMD8w8LVrJQc7PjkYP6f75iFKxtWzpf8RzwePhZFiRGVBROW0+6ECv6NAx2ImG
 PoMYcqy0EV3ffwUhNnWtML+f4keqBaOKXa63GyWGBKymHLr5yLBzyvjgsr9jMJNb
 3Sj/qf7JQy908jSxQQ==
 =yGYm
 -----END PGP SIGNATURE-----

Merge tag 'batadv-next-pullrequest-20230127' of git://git.open-mesh.org/linux-merge

Simon Wunderlich says:

====================
This feature/cleanup patchset includes the following patches:

 - bump version strings, by Simon Wunderlich

 - drop prandom.h includes, by Sven Eckelmann

 - fix mailing list address, by Sven Eckelmann

 - multicast feature preparation, by Linus Lüssing (2 patches)
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2023-01-30 07:33:06 +00:00
Jakub Kicinski
2d104c390f bpf-next-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCY9RqJgAKCRDbK58LschI
 gw2IAP9G5uhFO5abBzYLupp6SY3T5j97MUvPwLfFqUEt7EXmuwEA2lCUEWeW0KtR
 QX+QmzCa6iHxrW7WzP4DUYLue//FJQY=
 =yYqA
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
bpf-next 2023-01-28

We've added 124 non-merge commits during the last 22 day(s) which contain
a total of 124 files changed, 6386 insertions(+), 1827 deletions(-).

The main changes are:

1) Implement XDP hints via kfuncs with initial support for RX hash and
   timestamp metadata kfuncs, from Stanislav Fomichev and
   Toke Høiland-Jørgensen.
   Measurements on overhead: https://lore.kernel.org/bpf/875yellcx6.fsf@toke.dk

2) Extend libbpf's bpf_tracing.h support for tracing arguments of
   kprobes/uprobes and syscall as a special case, from Andrii Nakryiko.

3) Significantly reduce the search time for module symbols by livepatch
   and BPF, from Jiri Olsa and Zhen Lei.

4) Enable cpumasks to be used as kptrs, which is useful for tracing
   programs tracking which tasks end up running on which CPUs
   in different time intervals, from David Vernet.

5) Fix several issues in the dynptr processing such as stack slot liveness
   propagation, missing checks for PTR_TO_STACK variable offset, etc,
   from Kumar Kartikeya Dwivedi.

6) Various performance improvements, fixes, and introduction of more
   than just one XDP program to XSK selftests, from Magnus Karlsson.

7) Big batch to BPF samples to reduce deprecated functionality,
   from Daniel T. Lee.

8) Enable struct_ops programs to be sleepable in verifier,
   from David Vernet.

9) Reduce pr_warn() noise on BTF mismatches when they are expected under
   the CONFIG_MODULE_ALLOW_BTF_MISMATCH config anyway, from Connor O'Brien.

10) Describe modulo and division by zero behavior of the BPF runtime
    in BPF's instruction specification document, from Dave Thaler.

11) Several improvements to libbpf API documentation in libbpf.h,
    from Grant Seltzer.

12) Improve resolve_btfids header dependencies related to subcmd and add
    proper support for HOSTCC, from Ian Rogers.

13) Add ipip6 and ip6ip decapsulation support for bpf_skb_adjust_room()
    helper along with BPF selftests, from Ziyang Xuan.

14) Simplify the parsing logic of structure parameters for BPF trampoline
    in the x86-64 JIT compiler, from Pu Lehui.

15) Get BTF working for kernels with CONFIG_RUST enabled by excluding
    Rust compilation units with pahole, from Martin Rodriguez Reboredo.

16) Get bpf_setsockopt() working for kTLS on top of TCP sockets,
    from Kui-Feng Lee.

17) Disable stack protection for BPF objects in bpftool given BPF backends
    don't support it, from Holger Hoffstätte.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (124 commits)
  selftest/bpf: Make crashes more debuggable in test_progs
  libbpf: Add documentation to map pinning API functions
  libbpf: Fix malformed documentation formatting
  selftests/bpf: Properly enable hwtstamp in xdp_hw_metadata
  selftests/bpf: Calls bpf_setsockopt() on a ktls enabled socket.
  bpf: Check the protocol of a sock to agree the calls to bpf_setsockopt().
  bpf/selftests: Verify struct_ops prog sleepable behavior
  bpf: Pass const struct bpf_prog * to .check_member
  libbpf: Support sleepable struct_ops.s section
  bpf: Allow BPF_PROG_TYPE_STRUCT_OPS programs to be sleepable
  selftests/bpf: Fix vmtest static compilation error
  tools/resolve_btfids: Alter how HOSTCC is forced
  tools/resolve_btfids: Install subcmd headers
  bpf/docs: Document the nocast aliasing behavior of ___init
  bpf/docs: Document how nested trusted fields may be defined
  bpf/docs: Document cpumask kfuncs in a new file
  selftests/bpf: Add selftest suite for cpumask kfuncs
  selftests/bpf: Add nested trust selftests suite
  bpf: Enable cpumasks to be queried and used as kptrs
  bpf: Disallow NULLable pointers for trusted kfuncs
  ...
====================

Link: https://lore.kernel.org/r/20230128004827.21371-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-28 00:00:14 -08:00
Jakub Kicinski
b568d3072a Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Conflicts:

drivers/net/ethernet/intel/ice/ice_main.c
  418e53401e ("ice: move devlink port creation/deletion")
  643ef23bd9 ("ice: Introduce local var for readability")
https://lore.kernel.org/all/20230127124025.0dacef40@canb.auug.org.au/
https://lore.kernel.org/all/20230124005714.3996270-1-anthony.l.nguyen@intel.com/

drivers/net/ethernet/engleder/tsnep_main.c
  3d53aaef43 ("tsnep: Fix TX queue stop/wake for multiple queues")
  25faa6a4c5 ("tsnep: Replace TX spin_lock with __netif_tx_lock")
https://lore.kernel.org/all/20230127123604.36bb3e99@canb.auug.org.au/

net/netfilter/nf_conntrack_proto_sctp.c
  13bd9b31a9 ("Revert "netfilter: conntrack: add sctp DATA_SENT state"")
  a44b765148 ("netfilter: conntrack: unify established states for SCTP paths")
  f71cb8f45d ("netfilter: conntrack: sctp: use nf log infrastructure for invalid packets")
https://lore.kernel.org/all/20230127125052.674281f9@canb.auug.org.au/
https://lore.kernel.org/all/d36076f3-6add-a442-6d4b-ead9f7ffff86@tessares.net/

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-27 22:56:18 -08:00
Michal Wilczynski
53b9b77dcf ice: Fix broken link in ice NAPI doc
Current link for NAPI documentation in ice driver doesn't work - it
returns 404. Update the link to the working one.

Signed-off-by: Michal Wilczynski <michal.wilczynski@intel.com>
Acked-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2023-01-27 11:32:18 -08:00
Leon Romanovsky
7681a4f58f xfrm: extend add state callback to set failure reason
Almost all validation logic is in the drivers, but they are
missing reliable way to convey failure reason to userspace
applications.

Let's use extack to return this information to users.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-26 16:28:48 -08:00
Leon Romanovsky
3089386db0 xfrm: extend add policy callback to set failure reason
Almost all validation logic is in the drivers, but they are
missing reliable way to convey failure reason to userspace
applications.

Let's use extack to return this information to users.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-26 16:28:48 -08:00
Ivan Vecera
aee2770d19 docs: networking: Fix bridge documentation URL
Current documentation URL [1] is no longer valid.

[1] https://www.linuxfoundation.org/collaborate/workgroups/networking/bridge

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Link: https://lore.kernel.org/r/20230124145127.189221-1-ivecera@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-25 22:44:27 -08:00
Sriram Yagnaraman
a44b765148 netfilter: conntrack: unify established states for SCTP paths
An SCTP endpoint can start an association through a path and tear it
down over another one. That means the initial path will not see the
shutdown sequence, and the conntrack entry will remain in ESTABLISHED
state for 5 days.

By merging the HEARTBEAT_ACKED and ESTABLISHED states into one
ESTABLISHED state, there remains no difference between a primary or
secondary path. The timeout for the merged ESTABLISHED state is set to
210 seconds (hb_interval * max_path_retrans + rto_max). So, even if a
path doesn't see the shutdown sequence, it will expire in a reasonable
amount of time.

With this change in place, there is now more than one state from which
we can transition to ESTABLISHED, COOKIE_ECHOED and HEARTBEAT_SENT, so
handle the setting of ASSURED bit whenever a state change has happened
and the new state is ESTABLISHED. Removed the check for dir==REPLY since
the transition to ESTABLISHED can happen only in the reply direction.

Fixes: 9fb9cbb108 ("[NETFILTER]: Add nf_conntrack subsystem.")
Signed-off-by: Sriram Yagnaraman <sriram.yagnaraman@est.tech>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-01-24 09:52:52 +01:00
Jon Maxwell
695a376b59 ipv6: Document that max_size sysctl is deprecated
v4: fix deprecated typo.

Document that max_size is deprecated due to:

commit af6d10345c ("ipv6: remove max_size check inline with ipv4")

Signed-off-by: Jon Maxwell <jmaxwell37@gmail.com>
Link: https://lore.kernel.org/r/20230120232331.1273881-1-jmaxwell37@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-23 22:07:21 -08:00
Stanislav Fomichev
a4aeb9d656 bpf: Document XDP RX metadata
Document all current use-cases and assumptions.

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org
Acked-by: David Vernet <void@manifault.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20230119221536.3349901-2-sdf@google.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-01-23 09:38:10 -08:00
Vladimir Oltean
c319df10a4 docs: ethtool: document ETHTOOL_A_STATS_SRC and ETHTOOL_A_PAUSE_STATS_SRC
Two new netlink attributes were added to PAUSE_GET and STATS_GET and
their replies. Document them.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-01-23 12:44:18 +00:00
Vladimir Oltean
3700000479 docs: ethtool-netlink: document interface for MAC Merge layer
Show details about the structures passed back and forth related to MAC
Merge layer configuration, state and statistics. The rendered htmldocs
will be much more verbose due to the kerneldoc references.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-01-23 12:44:18 +00:00
Sven Eckelmann
8f6bc45837 batman-adv: Fix mailing list address
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2023-01-21 19:01:59 +01:00
Arkadiusz Kubalewski
c7ef8221ca ice: use GNSS subsystem instead of TTY
Previously support for GNSS was implemented as a TTY driver, it allowed
to access GNSS receiver on /dev/ttyGNSS_<bus><func>.

Use generic GNSS subsystem API instead of implementing own TTY driver.
The receiver is accessible on /dev/gnss<id>. In case of multiple receivers
in the OS, correct device can be found by enumerating either:
- /sys/class/net/<eth port>/device/gnss/
- /sys/class/gnss/gnss<id>/device/

Using GNSS subsystem is superior to implementing own TTY driver, as the
GNSS subsystem was designed solely for this purpose. It also implements
TTY driver but in a common and defined way.

From user perspective, there is no difference in communicating with a
device, except new path to the device shall be used. The device will
provide same information to the userspace as the old one, and can be used
in the same way, i.e.:
old # gpsmon /dev/ttyGNSS_2100_0
new # gpsmon /dev/gnss0
There is no other impact on userspace tools.

User expecting onboard GNSS receiver support is required to enable
CONFIG_GNSS=y/m in kernel config.

Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com>
Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com>
Signed-off-by: Michal Michalik <michal.michalik@intel.com>
Signed-off-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>
Tested-by: Gurucharan G <gurucharanx.g@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-01-20 13:27:17 +00:00
Daniele Palmas
31de284239 ethtool: add tx aggregation parameters
Add the following ethtool tx aggregation parameters:

ETHTOOL_A_COALESCE_TX_AGGR_MAX_BYTES
Maximum size in bytes of a tx aggregated block of frames.

ETHTOOL_A_COALESCE_TX_AGGR_MAX_FRAMES
Maximum number of frames that can be aggregated into a block.

ETHTOOL_A_COALESCE_TX_AGGR_TIME_USECS
Time in usecs after the first packet arrival in an aggregated
block for the block to be sent.

Signed-off-by: Daniele Palmas <dnlplm@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-01-13 10:23:52 +00:00
Jakub Kicinski
a99da46ac0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
drivers/net/usb/r8152.c
  be53771c87 ("r8152: add vendor/device ID pair for Microsoft Devkit")
  ec51fbd1b8 ("r8152: add USB device driver for config selection")
https://lore.kernel.org/all/20230113113339.658c4723@canb.auug.org.au/

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-12 19:59:56 -08:00
Piergiorgio Beruto
8580e16c28 net/ethtool: add netlink interface for the PLCA RS
Add support for configuring the PLCA Reconciliation Sublayer on
multi-drop PHYs that support IEEE802.3cg-2019 Clause 148 (e.g.,
10BASE-T1S). This patch adds the appropriate netlink interface
to ethtool.

Signed-off-by: Piergiorgio Beruto <piergiorgio.beruto@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-01-11 08:35:02 +00:00
David Howells
2d689424b6 rxrpc: Move call state changes from sendmsg to I/O thread
Move all the call state changes that are made in rxrpc_sendmsg() to the I/O
thread.  This is a step towards removing the call state lock.

This requires the switch to the RXRPC_CALL_CLIENT_AWAIT_REPLY and
RXRPC_CALL_SERVER_SEND_REPLY states to be done when the last packet is
decanted from ->tx_sendmsg to ->tx_buffer in the I/O thread, not when it is
added to ->tx_sendmsg by sendmsg().

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
2023-01-06 09:43:33 +00:00
Vincent Mailhol
115dd54690 Documentation: devlink: add missing toc entry for etas_es58x devlink doc
toc entry is missing for etas_es58x devlink doc and triggers this warning:

  Documentation/networking/devlink/etas_es58x.rst: WARNING: document isn't included in any toctree

Add the missing toc entry.

Fixes: 9f63f96aac ("Documentation: devlink: add devlink documentation for the etas_es58x driver")
Signed-off-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr>
Link: https://lore.kernel.org/all/20221213051136.721887-1-mailhol.vincent@wanadoo.fr
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2022-12-19 16:08:27 +01:00
Jakub Kicinski
7ae9888d6e Merge git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Pablo Neira Ayuso says:

====================
Netfilter/IPVS fixes for net

1) Fix NAT IPv6 flowtable hardware offload, from Qingfang DENG.

2) Add a safety check to IPVS socket option interface report a
   warning if unsupported command is seen, this. From Li Qiong.

3) Document SCTP conntrack timeouts, from Sriram Yagnaraman.

* git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
  netfilter: conntrack: document sctp timeouts
  ipvs: add a 'default' case in do_ip_vs_set_ctl()
  netfilter: flowtable: really fix NAT IPv6 offload
====================

Link: https://lore.kernel.org/r/20221213140923.154594-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-12-13 19:32:53 -08:00
Sriram Yagnaraman
f9645abe42 netfilter: conntrack: document sctp timeouts
Exposed through sysctl, update documentation to describe sctp states and
their default timeouts.

Signed-off-by: Sriram Yagnaraman <sriram.yagnaraman@est.tech>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-12-13 12:25:45 +01:00
Jakub Kicinski
95d1815f09 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next
Pablo Neira Ayuso says:

====================
Netfilter/IPVS updates for net-next

1) Incorrect error check in nft_expr_inner_parse(), from Dan Carpenter.

2) Add DATA_SENT state to SCTP connection tracking helper, from
   Sriram Yagnaraman.

3) Consolidate nf_confirm for ipv4 and ipv6, from Florian Westphal.

4) Add bitmask support for ipset, from Vishwanath Pai.

5) Handle icmpv6 redirects as RELATED, from Florian Westphal.

6) Add WARN_ON_ONCE() to impossible case in flowtable datapath,
   from Li Qiong.

7) A large batch of IPVS updates to replace timer-based estimators by
   kthreads to scale up wrt. CPUs and workload (millions of estimators).

Julian Anastasov says:

	This patchset implements stats estimation in kthread context.
It replaces the code that runs on single CPU in timer context every 2
seconds and causing latency splats as shown in reports [1], [2], [3].
The solution targets setups with thousands of IPVS services,
destinations and multi-CPU boxes.

	Spread the estimation on multiple (configured) CPUs and multiple
time slots (timer ticks) by using multiple chains organized under RCU
rules.  When stats are not needed, it is recommended to use
run_estimation=0 as already implemented before this change.

RCU Locking:

- As stats are now RCU-locked, tot_stats, svc and dest which
hold estimator structures are now always freed from RCU
callback. This ensures RCU grace period after the
ip_vs_stop_estimator() call.

Kthread data:

- every kthread works over its own data structure and all
such structures are attached to array. For now we limit
kthreads depending on the number of CPUs.

- even while there can be a kthread structure, its task
may not be running, eg. before first service is added or
while the sysctl var is set to an empty cpulist or
when run_estimation is set to 0 to disable the estimation.

- the allocated kthread context may grow from 1 to 50
allocated structures for timer ticks which saves memory for
setups with small number of estimators

- a task and its structure may be released if all
estimators are unlinked from its chains, leaving the
slot in the array empty

- every kthread data structure allows limited number
of estimators. Kthread 0 is also used to initially
calculate the max number of estimators to allow in every
chain considering a sub-100 microsecond cond_resched
rate. This number can be from 1 to hundreds.

- kthread 0 has an additional job of optimizing the
adding of estimators: they are first added in
temp list (est_temp_list) and later kthread 0
distributes them to other kthreads. The optimization
is based on the fact that newly added estimator
should be estimated after 2 seconds, so we have the
time to offload the adding to chain from controlling
process to kthread 0.

- to add new estimators we use the last added kthread
context (est_add_ktid). The new estimators are linked to
the chains just before the estimated one, based on add_row.
This ensures their estimation will start after 2 seconds.
If estimators are added in bursts, common case if all
services and dests are initially configured, we may
spread the estimators to more chains and as result,
reducing the initial delay below 2 seconds.

Many thanks to Jiri Wiesner for his valuable comments
and for spending a lot of time reviewing and testing
the changes on different platforms with 48-256 CPUs and
1-8 NUMA nodes under different cpufreq governors.

The new IPVS estimators do not use workqueue infrastructure
because:

- The estimation can take long time when using multiple IPVS rules (eg.
  millions estimator structures) and especially when box has multiple
  CPUs due to the for_each_possible_cpu usage that expects packets from
  any CPU. With est_nice sysctl we have more control how to prioritize the
  estimation kthreads compared to other processes/kthreads that have
  latency requirements (such as servers). As a benefit, we can see these
  kthreads in top and decide if we will need some further control to limit
  their CPU usage (max number of structure to estimate per kthread).

- with kthreads we run code that is read-mostly, no write/lock
  operations to process the estimators in 2-second intervals.

- work items are one-shot: as estimators are processed every
  2 seconds, they need to be re-added every time. This again
  loads the timers (add_timer) if we use delayed works, as there are
  no kthreads to do the timings.

[1] Report from Yunhong Jiang:
    https://lore.kernel.org/netdev/D25792C1-1B89-45DE-9F10-EC350DC04ADC@gmail.com/
[2] https://marc.info/?l=linux-virtual-server&m=159679809118027&w=2
[3] Report from Dust:
    https://archive.linuxvirtualserver.org/html/lvs-devel/2020-12/msg00000.html

* git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
  ipvs: run_estimation should control the kthread tasks
  ipvs: add est_cpulist and est_nice sysctl vars
  ipvs: use kthreads for stats estimation
  ipvs: use u64_stats_t for the per-cpu counters
  ipvs: use common functions for stats allocation
  ipvs: add rcu protection to stats
  netfilter: flowtable: add a 'default' case to flowtable datapath
  netfilter: conntrack: set icmpv6 redirects as RELATED
  netfilter: ipset: Add support for new bitmask parameter
  netfilter: conntrack: merge ipv4+ipv6 confirm functions
  netfilter: conntrack: add sctp DATA_SENT state
  netfilter: nft_inner: fix IS_ERR() vs NULL check
====================

Link: https://lore.kernel.org/r/20221211101204.1751-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-12-12 14:45:36 -08:00
Vincent Mailhol
9f63f96aac Documentation: devlink: add devlink documentation for the etas_es58x driver
List all the version information reported by the etas_es58x driver
through devlink. Also, update MAINTAINERS with the newly created file.

Signed-off-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr>
Link: https://lore.kernel.org/all/20221130174658.29282-8-mailhol.vincent@wanadoo.fr
[mkl: fixed version information table: "bl" -> "fw.bootloader"
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2022-12-12 11:39:13 +01:00
Vincent Mailhol
01d8053229 net: devlink: add DEVLINK_INFO_VERSION_GENERIC_FW_BOOTLOADER
As discussed in [1], abbreviating the bootloader to "bl" might not be
well understood. Instead, a bootloader technically being a firmware,
name it "fw.bootloader".

Add a new macro to devlink.h to formalize this new info attribute name
and update the documentation.

[1] https://lore.kernel.org/netdev/20221128142723.2f826d20@kernel.org/

Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr>
Link: https://lore.kernel.org/all/20221130174658.29282-5-mailhol.vincent@wanadoo.fr
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2022-12-12 11:39:13 +01:00
Julian Anastasov
144361c194 ipvs: run_estimation should control the kthread tasks
Change the run_estimation flag to start/stop the kthread tasks.

Signed-off-by: Julian Anastasov <ja@ssi.bg>
Cc: yunhong-cgl jiang <xintian1976@gmail.com>
Cc: "dust.li" <dust.li@linux.alibaba.com>
Reviewed-by: Jiri Wiesner <jwiesner@suse.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-12-10 22:44:43 +01:00
Julian Anastasov
f0be83d542 ipvs: add est_cpulist and est_nice sysctl vars
Allow the kthreads for stats to be configured for
specific cpulist (isolation) and niceness (scheduling
priority).

Signed-off-by: Julian Anastasov <ja@ssi.bg>
Cc: yunhong-cgl jiang <xintian1976@gmail.com>
Cc: "dust.li" <dust.li@linux.alibaba.com>
Reviewed-by: Jiri Wiesner <jwiesner@suse.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-12-10 22:44:43 +01:00
Jakub Kicinski
dd8b3a802b ipsec-next-2022-12-09
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEH7ZpcWbFyOOp6OJbrB3Eaf9PW7cFAmOS/ooACgkQrB3Eaf9P
 W7cOVA/+L8rHwLe78DDz/PNESyShtTVCYBDF/ngYMV8AIvjSfPresMbFV3NKqO5E
 3qbMl199QH2eWI7dhQaQ+edynSG0QCx5FmPai0UuHPLxATct1pNPJPpvBryO/4jC
 ZouYBIVjdMbq6Y8vD2gJ8UtA7TZpncP0HYOKTvYyDL9kQ+nUmu9KUYxcEcNHL5w+
 TjL9jJafR+GqczCRiwAoMKIFV7lUrTFzh7slfINNN5DVTuzN33H7Tp70z6IKOfVL
 1LATlZv7mqpLVF6dQuMXOt6kd/BEBl1y4ZHTHow5nstJvwu99P96iKwEfIXuOvWK
 fulhDU61eIik8D9QJWeM7TuZDbYewWI77plwVY/R/zRt0At4VLpq7I1m33CmLLMY
 Fb5fMxJPkM8YAtDID+BknYPrSAcxo8ji04BWFrVqQ6InPmtGfnP83XSSkYfxY7FB
 3hUfz4igsJpV5vrS1EFRhjklNwI+jY2yAvIggQtdkJ97ubSUY3E4ACfNqlJ5lJbv
 2KqWnSKlG21F9ZTR68VzcQVhFIQF6j/EuQqro+4TQUIdZswcml2iK32zrel0rs9C
 iAsgQQaMV9a2vEaScRZqdOJ4HENTbm9wD7Mso/i5vr+lnpr1ThKjQo8osU8YUlbC
 SDTMeWRRos+esFML6SP+YZ7SM/qXMluou204x/llJ/VDMXQ5e8k=
 =enQp
 -----END PGP SIGNATURE-----

Merge tag 'ipsec-next-2022-12-09' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next

Steffen Klassert says:

====================
ipsec-next 2022-12-09

1) Add xfrm packet offload core API.
   From Leon Romanovsky.

2) Add xfrm packet offload support for mlx5.
   From Leon Romanovsky and Raed Salem.

3) Fix a typto in a error message.
   From Colin Ian King.

* tag 'ipsec-next-2022-12-09' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next: (38 commits)
  xfrm: Fix spelling mistake "oflload" -> "offload"
  net/mlx5e: Open mlx5 driver to accept IPsec packet offload
  net/mlx5e: Handle ESN update events
  net/mlx5e: Handle hardware IPsec limits events
  net/mlx5e: Update IPsec soft and hard limits
  net/mlx5e: Store all XFRM SAs in Xarray
  net/mlx5e: Provide intermediate pointer to access IPsec struct
  net/mlx5e: Skip IPsec encryption for TX path without matching policy
  net/mlx5e: Add statistics for Rx/Tx IPsec offloaded flows
  net/mlx5e: Improve IPsec flow steering autogroup
  net/mlx5e: Configure IPsec packet offload flow steering
  net/mlx5e: Use same coding pattern for Rx and Tx flows
  net/mlx5e: Add XFRM policy offload logic
  net/mlx5e: Create IPsec policy offload tables
  net/mlx5e: Generalize creation of default IPsec miss group and rule
  net/mlx5e: Group IPsec miss handles into separate struct
  net/mlx5e: Make clear what IPsec rx_err does
  net/mlx5e: Flatten the IPsec RX add rule path
  net/mlx5e: Refactor FTE setup code to be more clear
  net/mlx5e: Move IPsec flow table creation to separate function
  ...
====================

Link: https://lore.kernel.org/r/20221209093310.4018731-1-steffen.klassert@secunet.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-12-09 20:06:35 -08:00
Willem de Bruijn
b534dc46c8 net_tstamp: add SOF_TIMESTAMPING_OPT_ID_TCP
Add an option to initialize SOF_TIMESTAMPING_OPT_ID for TCP from
write_seq sockets instead of snd_una.

This should have been the behavior from the start. Because processes
may now exist that rely on the established behavior, do not change
behavior of the existing option, but add the right behavior with a new
flag. It is encouraged to always set SOF_TIMESTAMPING_OPT_ID_TCP on
stream sockets along with the existing SOF_TIMESTAMPING_OPT_ID.

Intuitively the contract is that the counter is zero after the
setsockopt, so that the next write N results in a notification for
the last byte N - 1.

On idle sockets snd_una == write_seq and this holds for both. But on
sockets with data in transmission, snd_una records the unacked offset
in the stream. This depends on the ACK response from the peer. A
process cannot learn this in a race free manner (ioctl SIOCOUTQ is one
racy approach).

write_seq records the offset at the last byte written by the process.
This is a better starting point. It matches the intuitive contract in
all circumstances, unaffected by external behavior.

The new timestamp flag necessitates increasing sk_tsflags to 32 bits.
Move the field in struct sock to avoid growing the socket (for some
common CONFIG variants). The UAPI interface so_timestamping.flags is
already int, so 32 bits wide.

Reported-by: Sotirios Delimanolis <sotodel@meta.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20221207143701.29861-1-willemdebruijn.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-12-08 19:49:21 -08:00
Shay Drory
e5b9642a33 net/mlx5: E-Switch, Implement devlink port function cmds to control migratable
Implement devlink port function commands to enable / disable migratable.
This is used to control the migratable capability of the device.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Acked-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-12-07 20:09:18 -08:00