Commit Graph

9713 Commits

Author SHA1 Message Date
Ido Schimmel cc19439f70 mlxsw: core_thermal: Simplify transceiver module get_temp() callback
The get_temp() callback of a thermal zone associated with a transceiver
module no longer needs to read the temperature thresholds of the module.
Therefore, simplify the callback by only reading the temperature.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Vadim Pasternak <vadimp@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-02 13:42:30 +01:00
Ido Schimmel c1536d856e mlxsw: core_thermal: Make mlxsw_thermal_module_init() void
The function can no longer fail so make it void and remove the
associated error path.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Vadim Pasternak <vadimp@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-02 13:42:30 +01:00
Ido Schimmel 5601ef91fb mlxsw: core_thermal: Use static trip points for transceiver modules
The driver registers a thermal zone for each transceiver module and
tries to set the trip point temperatures according to the thresholds
read from the transceiver. If a threshold cannot be read or if a
transceiver is unplugged, the trip point temperature is set to zero,
which means that it is disabled as far as the thermal subsystem is
concerned.

A recent change in the thermal core made it so that such trip points are
no longer marked as disabled, which lead the thermal subsystem to
incorrectly set the associated cooling devices to the their maximum
state [1]. A fix to restore this behavior was merged in commit
f1b80a3878 ("thermal: core: Restore behavior regarding invalid trip
points"). However, the thermal maintainer suggested to not rely on this
behavior and instead always register a valid array of trip points [2].

Therefore, create a static array of trip points with sane defaults
(suggested by Vadim) and register it with the thermal zone of each
transceiver module. User space can choose to override these defaults
using the thermal zone sysfs interface since these files are writeable.

Before:

 $ cat /sys/class/thermal/thermal_zone11/type
 mlxsw-module11
 $ cat /sys/class/thermal/thermal_zone11/trip_point_*_temp
 65000
 75000
 80000

After:

 $ cat /sys/class/thermal/thermal_zone11/type
 mlxsw-module11
 $ cat /sys/class/thermal/thermal_zone11/trip_point_*_temp
 55000
 65000
 80000

Also tested by reverting commit f1b80a3878 ("thermal: core: Restore
behavior regarding invalid trip points") and making sure that the
associated cooling devices are not set to their maximum state.

[1] https://lore.kernel.org/linux-pm/ZA3CFNhU4AbtsP4G@shredder/
[2] https://lore.kernel.org/linux-pm/f78e6b70-a963-c0ca-a4b2-0d4c6aeef1fb@linaro.org/

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Vadim Pasternak <vadimp@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-02 13:42:30 +01:00
Jakub Kicinski 7079d5e61a mlx5-updates-2023-03-28
Dragos Tatulea says:
 ====================
 
 net/mlx5e: RX, Drop page_cache and fully use page_pool
 
 For page allocation on the rx path, the mlx5e driver has been using an
 internal page cache in tandem with the page pool. The internal page
 cache uses a queue for page recycling which has the issue of head of
 queue blocking.
 
 This patch series drops the internal page_cache altogether and uses the
 page_pool to implement everything that was done by the page_cache
 before:
 * Let the page_pool handle dma mapping and unmapping.
 * Use fragmented pages with fragment counter instead of tracking via
   page ref.
 * Enable skb recycling.
 
 The patch series has the following effects on the rx path:
 
 * Improved performance for the cases when there was low page recycling
   due to head of queue blocking in the internal page_cache. The test
   for this was running a single iperf TCP stream to a rx queue
   which is bound on the same cpu as the application.
 
   |-------------+--------+--------+------+---------|
   | rq type     | before | after  | unit |   diff  |
   |-------------+--------+--------+------+---------|
   | striding rq |  30.1  |  31.4  | Gbps |  4.14 % |
   | legacy rq   |  30.2  |  33.0  | Gbps |  8.48 % |
   |-------------+--------+--------+------+---------|
 
 * Small XDP performance degradation. The test was is XDP drop
   program running on a single rx queue with small packets incoming
   it looks like this:
 
   |-------------+----------+----------+------+---------|
   | rq type     | before   | after    | unit |   diff  |
   |-------------+----------+----------+------+---------|
   | striding rq | 19725449 | 18544617 | pps  | -6.37 % |
   | legacy rq   | 19879931 | 18631841 | pps  | -6.70 % |
   |-------------+----------+----------+------+---------|
 
   This will be handled in a different patch series by adding support for
   multi-packet per page.
 
 * For other cases the performance is roughly the same.
 
 The above numbers were obtained on the following system:
   24 core Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz
   32 GB RAM
   ConnectX-7 single port
 
 The breakdown on the patch series is the following:
 * Preparations for introducing the mlx5e_frag_page struct.
 * Delete the mlx5e_page_cache struct.
 * Enable dma mapping from page_pool.
 * Enable skb recycling and fragment counting.
 * Do deferred release of pages (just before alloc) to ensure better
   page_pool cache utilization.
 
 ====================
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEGhZs6bAKwk/OTgTpSD+KveBX+j4FAmQjUY8ACgkQSD+KveBX
 +j6tVAf/QHCbKgt9c2Q5EpFch2e4x3A/HfE7DbxTancIj0cc1bH98xd4wO574aE4
 PCJ/aJ+9zTLvTUgUnKDaiqonfmcsF7v6d/ltoLW1PTNnPqdsjsXpVy76dnL81SWy
 u/g7h68cfeMdMjAAoewyVv+k7GeTIZCsIdvik3dWGFQ67IpE1k5dLbO13YBNW/5m
 Cm39RzD55tjgxS8GHdyFYAV4MwgHy+pdhTYR9LGzH80hfd02KqsCO38u1NIShuez
 1rwjRF213Qdln20bMNSNiXG36JUV65mo+Q/XHKOEjB0qNKRcF5bzZovqHzP+R7QZ
 qhhhfce8c63UWpcXADP6k6qevW8+UA==
 =8F1t
 -----END PGP SIGNATURE-----

Merge tag 'mlx5-updates-2023-03-28' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux

Saeed Mahameed says:

====================
mlx5-updates-2023-03-28

Dragos Tatulea says:
====================

net/mlx5e: RX, Drop page_cache and fully use page_pool

For page allocation on the rx path, the mlx5e driver has been using an
internal page cache in tandem with the page pool. The internal page
cache uses a queue for page recycling which has the issue of head of
queue blocking.

This patch series drops the internal page_cache altogether and uses the
page_pool to implement everything that was done by the page_cache
before:
* Let the page_pool handle dma mapping and unmapping.
* Use fragmented pages with fragment counter instead of tracking via
  page ref.
* Enable skb recycling.

The patch series has the following effects on the rx path:

* Improved performance for the cases when there was low page recycling
  due to head of queue blocking in the internal page_cache. The test
  for this was running a single iperf TCP stream to a rx queue
  which is bound on the same cpu as the application.

  |-------------+--------+--------+------+---------|
  | rq type     | before | after  | unit |   diff  |
  |-------------+--------+--------+------+---------|
  | striding rq |  30.1  |  31.4  | Gbps |  4.14 % |
  | legacy rq   |  30.2  |  33.0  | Gbps |  8.48 % |
  |-------------+--------+--------+------+---------|

* Small XDP performance degradation. The test was is XDP drop
  program running on a single rx queue with small packets incoming
  it looks like this:

  |-------------+----------+----------+------+---------|
  | rq type     | before   | after    | unit |   diff  |
  |-------------+----------+----------+------+---------|
  | striding rq | 19725449 | 18544617 | pps  | -6.37 % |
  | legacy rq   | 19879931 | 18631841 | pps  | -6.70 % |
  |-------------+----------+----------+------+---------|

  This will be handled in a different patch series by adding support for
  multi-packet per page.

* For other cases the performance is roughly the same.

The above numbers were obtained on the following system:
  24 core Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz
  32 GB RAM
  ConnectX-7 single port

The breakdown on the patch series is the following:
* Preparations for introducing the mlx5e_frag_page struct.
* Delete the mlx5e_page_cache struct.
* Enable dma mapping from page_pool.
* Enable skb recycling and fragment counting.
* Do deferred release of pages (just before alloc) to ensure better
  page_pool cache utilization.

====================

* tag 'mlx5-updates-2023-03-28' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux:
  net/mlx5e: RX, Remove unnecessary recycle parameter and page_cache stats
  net/mlx5e: RX, Break the wqe bulk refill in smaller chunks
  net/mlx5e: RX, Increase WQE bulk size for legacy rq
  net/mlx5e: RX, Split off release path for xsk buffers for legacy rq
  net/mlx5e: RX, Defer page release in legacy rq for better recycling
  net/mlx5e: RX, Change wqe last_in_page field from bool to bit flags
  net/mlx5e: RX, Defer page release in striding rq for better recycling
  net/mlx5e: RX, Rename xdp_xmit_bitmap to a more generic name
  net/mlx5e: RX, Enable skb page recycling through the page_pool
  net/mlx5e: RX, Enable dma map and sync from page_pool allocator
  net/mlx5e: RX, Remove internal page_cache
  net/mlx5e: RX, Store SHAMPO header pages in array
  net/mlx5e: RX, Remove alloc unit layout constraint for striding rq
  net/mlx5e: RX, Remove alloc unit layout constraint for legacy rq
  net/mlx5e: RX, Remove mlx5e_alloc_unit argument in page allocation
====================

Link: https://lore.kernel.org/r/20230328205623.142075-1-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-29 22:15:24 -07:00
Kuniyuki Iwashima 8cdc3223e7 ipv6: Remove in6addr_any alternatives.
Some code defines the IPv6 wildcard address as a local variable and
use it with memcmp() or ipv6_addr_equal().

Let's use in6addr_any and ipv6_addr_any() instead.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-29 08:22:52 +01:00
Jakub Kicinski de7494524d mlx5-updates-2023-03-20
mlx5 dynamic msix
 
 This patch series adds support for dynamic msix vectors allocation in mlx5.
 
 Eli Cohen Says:
 
 ================
 
 The following series of patches modifies mlx5_core to work with the
 dynamic MSIX API. Currently, mlx5_core allocates all the interrupt
 vectors it needs and distributes them amongst the consumers. With the
 introduction of dynamic MSIX support, which allows for allocation of
 interrupts more than once, we now allocate vectors as we need them.
 This allows other drivers running on top of mlx5_core to allocate
 interrupt vectors for their own use. An example for this is mlx5_vdpa,
 which uses these vectors to propagate interrupts directly from the
 hardware to the vCPU [1].
 
 As a preparation for using this series, a use after free issue is fixed
 in lib/cpu_rmap.c and the allocator for rmap entries has been modified.
 A complementary API for irq_cpu_rmap_add() has also been introduced.
 
 [1] https://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git/patch/?id=0f2bf1fcae96a83b8c5581854713c9fc3407556e
 
 ================
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEGhZs6bAKwk/OTgTpSD+KveBX+j4FAmQeTIUACgkQSD+KveBX
 +j7oCQgAx9yNHM4BZD2UfIx/P+W13v1B+xOds04Vezl9JlakoqvviPxm3vvuKkl+
 j/8DdyoqMUbWV0j5XxgZ+GG91bc14jN1GQ+4fUf63SzA99vAGb9GJPV2aQt5roGh
 JmMqI2utDfoz+29qtQ+kVchY5AN5AoPXSQH2zkEZmJaPUjYb9Dr/4IayL0JaViAw
 S31QLHKkSJ8bL8Wc6Op1emNVV7eXs18f7IIjVs3sYOb3WJRPVpmdKneRqLgVYplf
 Td40Gwobl1elpjEqSSRTJI5YUSR8gcAJlBqIwHeJzFFpO3Pnciopl761osNKKs/a
 5ctES5DS6JHqqFGbWV1gKYcRMil3LA==
 =9i8l
 -----END PGP SIGNATURE-----

Merge tag 'mlx5-updates-2023-03-20' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux

Saeed Mahameed says:

====================
mlx5-updates-2023-03-20

mlx5 dynamic msix

This patch series adds support for dynamic msix vectors allocation in mlx5.

Eli Cohen Says:

================

The following series of patches modifies mlx5_core to work with the
dynamic MSIX API. Currently, mlx5_core allocates all the interrupt
vectors it needs and distributes them amongst the consumers. With the
introduction of dynamic MSIX support, which allows for allocation of
interrupts more than once, we now allocate vectors as we need them.
This allows other drivers running on top of mlx5_core to allocate
interrupt vectors for their own use. An example for this is mlx5_vdpa,
which uses these vectors to propagate interrupts directly from the
hardware to the vCPU [1].

As a preparation for using this series, a use after free issue is fixed
in lib/cpu_rmap.c and the allocator for rmap entries has been modified.
A complementary API for irq_cpu_rmap_add() has also been introduced.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git/patch/?id=0f2bf1fcae96a83b8c5581854713c9fc3407556e

================

* tag 'mlx5-updates-2023-03-20' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux:
  net/mlx5: Provide external API for allocating vectors
  net/mlx5: Use one completion vector if eth is disabled
  net/mlx5: Refactor calculation of required completion vectors
  net/mlx5: Move devlink registration before mlx5_load
  net/mlx5: Use dynamic msix vectors allocation
  net/mlx5: Refactor completion irq request/release code
  net/mlx5: Improve naming of pci function vectors
  net/mlx5: Use newer affinity descriptor
  net/mlx5: Modify struct mlx5_irq to use struct msi_map
  net/mlx5: Fix wrong comment
  net/mlx5e: Coding style fix, add empty line
  lib: cpu_rmap: Add irq_cpu_rmap_remove to complement irq_cpu_rmap_add
  lib: cpu_rmap: Use allocator for rmap entries
  lib: cpu_rmap: Avoid use after free on rmap->obj array entries
====================

Link: https://lore.kernel.org/r/20230324231341.29808-1-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-28 23:52:12 -07:00
Saeed Mahameed 163c2c7059 net/mlx5e: Fix build break on 32bit
The cited commit caused the following build break in mlx5 due to a change
in size of MAX_SKB_FRAGS.

error: format '%lu' expects argument of type 'long unsigned int',
       but argument 7 has type 'unsigned int' [-Werror=format=]

Fix this by explicit casting.

Fixes: 3948b05950 ("net: introduce a config option to tweak MAX_SKB_FRAGS")
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Link: https://lore.kernel.org/r/20230328200723.125122-1-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-28 16:18:59 -07:00
Dragos Tatulea 3905f8d64c net/mlx5e: RX, Remove unnecessary recycle parameter and page_cache stats
The recycle parameter used during page release is no longer
necessary: the page pool can detect when the page cannot be
recycled to the cache or ring without any outside hint.

The page pool will also take care of cleaning up after itself
once all the inflight pages have been released. So no need to
explicitly release pages to the system.

Remove the internal page_cache stats as the mlx5e_page_cache
struct no longer exists.

Delete the documentation entries along with the stats.

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-03-28 13:43:59 -07:00
Dragos Tatulea cd640b0503 net/mlx5e: RX, Break the wqe bulk refill in smaller chunks
To avoid overflowing the page pool's cache, don't release the
whole bulk which is usually larger than the cache refill size.
Group release+alloc instead into cache refill units that
allow releasing to the cache and then allocating from the cache.

A refill_unit variable is added as a iteration unit over the
wqe_bulk when doing release+alloc.

For a single ring, single core, default MTU (1500) TCP stream
test the number of pages allocated from the cache directly
(rx_pp_recycle_cached) increases from 0% to 52%:

+---------------------------------------------+
| Page Pool stats (/sec)  |  Before |   After |
+-------------------------+---------+---------+
|rx_pp_alloc_fast         | 2145422 | 2193802 |
|rx_pp_alloc_slow         |       2 |       0 |
|rx_pp_alloc_empty        |       2 |       0 |
|rx_pp_alloc_refill       |   34059 |   16634 |
|rx_pp_alloc_waive        |       0 |       0 |
|rx_pp_recycle_cached     |       0 | 1145818 |
|rx_pp_recycle_cache_full |       0 |       0 |
|rx_pp_recycle_ring       | 2179361 | 1064616 |
|rx_pp_recycle_ring_full  |     121 |       0 |
+---------------------------------------------+

With this patch, the performance for legacy rq for the above test is
back to baseline.

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-03-28 13:43:59 -07:00
Dragos Tatulea 4ba2b4988c net/mlx5e: RX, Increase WQE bulk size for legacy rq
Deferred page release was added to legacy rq but its desired effect
(driver releases last fragment to page pool cache) is not yet visible
due to the WQE bulks being too small.

This patch increases the WQE bulk size to span 512 KB and clip it to
one quarter of the rx queue size.

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-03-28 13:43:59 -07:00
Dragos Tatulea 76238d0fbd net/mlx5e: RX, Split off release path for xsk buffers for legacy rq
Don't mix xsk buffer releases with page releases anymore. This is
needed for handling of deferred page release.

Add a new bulk free function for xsk buffers from wqe frags.

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-03-28 13:43:59 -07:00
Dragos Tatulea 3f93f82988 net/mlx5e: RX, Defer page release in legacy rq for better recycling
Currently, fragmented pages from the page pool can be released
in two ways:

1) In the mlx5e driver when trimming off the unused fragments AND the
associated skb fragments have been released. This path allows
recycling of pages to the page pool cache (allow_direct == true).

2) On the skb release path (last fragment release), which
will always release pages to the page pool ring
(allow_direct == false).

Whichever is releasing the last fragment will be decisive on
where the page gets released: the cache or the ring. So we
obviously want to maximize for doing the release from 1.

This patch does that by deferring the release of page fragments
right before requesting new ones from the page pool. A flag is
added to make sure that there's no release before first alloc
and that XDP_TX fragments are not released prematurely.

This is a preparation patch that doesn't unlock the performance
improvements yet. A followup patch will do that.

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-03-28 13:43:59 -07:00
Dragos Tatulea 625dff29df net/mlx5e: RX, Change wqe last_in_page field from bool to bit flags
Change the bool flag to a bitfield as we'll use it in a downstream patch
in the series to add signaling about skipping a fragment release.

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-03-28 13:43:58 -07:00
Dragos Tatulea 4c2a132368 net/mlx5e: RX, Defer page release in striding rq for better recycling
Currently, for striding RQ, fragmented pages from the page pool can
get released in two ways:

1) In the mlx5e driver when trimming off the unused fragments AND the
   associated skb fragments have been released. This path allows
   recycling of pages to the page pool cache (allow_direct == true).

2) On the skb release path (last fragment release), which
   will always release pages to the page pool ring
   (allow_direct == false).

Whichever is releasing the last fragment will be decisive on
where the page gets released: the cache or the ring. So we
obviously want to maximize for doing the release from 1.

This patch does that by deferring the release of page fragments
right before requesting new ones from the page pool. Extra care
needs to be taken for the corner cases:

* On first call, make sure that release is not called. The
  skip_release_bitmap is used for this purpose.

* On rq shutdown, make sure that all wqes that were not
  in the linked list are released.

For a single ring, single core, default MTU (1500) TCP stream
test the number of pages allocated from the cache directly
(rx_pp_recycle_cached) increases from 31 % to 98 %:

+----------------------------------------------+
| Page Pool stats (/sec)  |  Before |   After  |
+-------------------------+---------+----------+
|rx_pp_alloc_fast         | 2137754 |  2261033 |
|rx_pp_alloc_slow         |      47 |        9 |
|rx_pp_alloc_empty        |      47 |        9 |
|rx_pp_alloc_refill       |   23230 |      819 |
|rx_pp_alloc_waive        |       0 |        0 |
|rx_pp_recycle_cached     |  672182 |  2209015 |
|rx_pp_recycle_cache_full |    1789 |        0 |
|rx_pp_recycle_ring       | 1485848 |    52259 |
|rx_pp_recycle_ring_full  |    3003 |      584 |
+----------------------------------------------+

With this patch, the performance in striding rq for the above test is
back to baseline.

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-03-28 13:43:58 -07:00
Dragos Tatulea 38a36efccd net/mlx5e: RX, Rename xdp_xmit_bitmap to a more generic name
The xdp_xmit_bitmap currently serves only one purpose: to avoid
releasing pages that are still in use due to XDP TX.

A following patch will use this bitmap in a slightly different context
but for the same purpose. So rename the bitmap to a more generic name
that reflects the purpose not the context.

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-03-28 13:43:58 -07:00
Dragos Tatulea 6f57428460 net/mlx5e: RX, Enable skb page recycling through the page_pool
Start using the page_pool skb recycling api to recycle all pages back to
the page pool and stop using atomic page reference counting.

The mlx5e driver used to manage in-flight pages using page refcounting:
for each fragment there were 2 atomic write operations happening (one
for building the skb and one on skb release).

The page_pool api introduced a method to track page fragments more
optimally:
* The page's pp_fragment_count is set to a large bias on page alloc
  (1 x atomic write operation).
* The driver tracks the actual page fragments in a non atomic variable.
* When the skb is recycled, pp_fragment_count is decremented
  (atomic write operation).
* When page is released in the driver, the unused number of fragments
  (relative to the bias) is deducted from pp_fragment_count (atomic
  write operation).
* Last page defragmentation will only be an atomic read.

So in total there are `number of fragments + 1` atomic write ops. As
opposed to previously: `2 * frags` atomic writes ops.

Pages are wrapped in a mlx5e_frag_page structure which also contains the
number of fragments. This makes it easy to count the fragments in the
driver.

This change brings performance improvements for the case when the old rx
page_cache had low recycling rates due to head of queue blocking. For a
iperf3 TCP test with a single stream, on a single core (iperf and receive
queue running on same core), the following improvements can be noticed:

* Striding rq:
  - before (net-next baseline): bitrate = 30.1 Gbits/sec
  - after                     : bitrate = 31.4 Gbits/sec (diff: 4.14 %)

* Legacy rq:
  - before (net-next baseline): bitrate = 30.2 Gbits/sec
  - after                     : bitrate = 33.0 Gbits/sec (diff: 8.48 %)

There are 2 temporary performance degradations introduced:

1) TCP streams that had a good recycling rate with the old page_cache
   have a degradation for both striding and linear rq. This is due to
   very low page pool cache recycling: the pages are released during skb
   recycle which will release pages to the page pool ring for safety.
   The following patches in this series will tackle this problem by
   deferring the page release in the driver to increase the
   chance of having pages recycled to the cache.

2) XDP performance is now lower (4-5 %) due to the higher number of
   atomic operations used for fragment management. But this opens the
   door for supporting multiple packets per page in XDP, which will
   bring a big gain.

Otherwise, performance is similar to baseline.

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-03-28 13:43:58 -07:00
Dragos Tatulea 4a5c5e2500 net/mlx5e: RX, Enable dma map and sync from page_pool allocator
Remove driver dma mapping and unmapping of pages. Let the
page_pool api do it.

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-03-28 13:43:58 -07:00
Dragos Tatulea 08c9b61b07 net/mlx5e: RX, Remove internal page_cache
This patch removes the internal rx page_cache and uses the generic
page_pool api only. It used to be that the page_pool couldn't handle all
the mlx5 driver usecases, but with the introduction of skb recycling and
page fragmentaton in the page_pool full switch can now be made. Some
benfits of this transition:
* Better page recycling in the cases when the page_cache was suffering
  from head of queue blocking. The page_pool doesn't have this issue.
* DMA mapping/unmapping can be managed by the page_pool.
* mlx5e_rq size reduced by more than 50% due to the page_cache array
  being deleted.

This patch only removes the page_cache. Downstream patches will enable
the required page_pool features and will add further fine-tuning.

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-03-28 13:43:57 -07:00
Dragos Tatulea ca6ef9f031 net/mlx5e: RX, Store SHAMPO header pages in array
Save allocated SHAMPO header pages to an array to which the
mlx5e_dma_info page will point to.

This change is a preparation for introducing mlx5e_frag_page structure
in a downstream patch. There's no new functionality introduced.

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-03-28 13:43:57 -07:00
Dragos Tatulea d39092caae net/mlx5e: RX, Remove alloc unit layout constraint for striding rq
This change removes the usage of mlx5e_alloc_unit union for
striding rq. The change is more straightforward than legacy rq as
the alloc units union is already in place.

This patch only moves things around: instead of an array of unions make
it a union of arrays. This has the effect that each mlx5e_mpw_info will
allocate the largest possible size of the array member. For xsk this
means that the array of xdp_buff pointers for the wqe will still be
contiguous, but there will be some extra unused bytes at the end of the
array.

As further patch in the series will add the mlx5e_frag_page struct for
which the described size constraint will no longer hold.

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-03-28 13:43:57 -07:00
Dragos Tatulea 8fb1814f58 net/mlx5e: RX, Remove alloc unit layout constraint for legacy rq
The mlx5e_alloc_unit union is conveniently used to store arrays of
pointers to struct page or struct xdp_buff (for xsk). The union is
currently expected to have the size of a pointer for xsk batch
allocations to work. This is conveniet for the current state of the
code but makes it impossible to add a structure of a different size
to the alloc unit.

A further patch in the series will add the mlx5e_frag_page struct for
which the described size constraint will no longer hold.

This change removes the usage of mlx5e_alloc_unit union for legacy rq:

- A union of arrays is introduced (mlx5e_alloc_units) to replace the
  array of unions to allow structures of different sizes.

- Each fragment has a pointer to a unit in the mlx5e_alloc_units array.

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-03-28 13:43:57 -07:00
Dragos Tatulea 09df037017 net/mlx5e: RX, Remove mlx5e_alloc_unit argument in page allocation
Change internal page cache and page pool api to use a struct page **
instead of a mlx5e_alloc_unit *.

This is the first change in a series which is meant to remove the
mlx5e_alloc_unit altogether.

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-03-28 13:43:56 -07:00
Eli Cohen fb0a6a268d net/mlx5: Provide external API for allocating vectors
Provide external API to be used by other drivers relying on mlx5_core,
for allocating MSIX vectors. An example for such a driver would be
mlx5_vdpa.

Signed-off-by: Eli Cohen <elic@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
2023-03-24 16:04:30 -07:00
Eli Cohen b637ac5db0 net/mlx5: Use one completion vector if eth is disabled
If eth is disabled by devlink, use only a single completion vector to
have minimum performance of all users of completion vectors. This also
affects Infiniband performance.

The rest of the vectors can be used by other consumers on a first come
first served basis.

mlx5_vdpa will make use of this to allocate dedicated vectors for its
own use.

Signed-off-by: Eli Cohen <elic@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
2023-03-24 16:04:30 -07:00
Eli Cohen 1dc85133c2 net/mlx5: Refactor calculation of required completion vectors
Move the calculation to a separate function. We will add more
functionality to it in a follow up patch.

Signed-off-by: Eli Cohen <elic@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
2023-03-24 16:04:30 -07:00
Eli Cohen fe578cbb2f net/mlx5: Move devlink registration before mlx5_load
In order to allow reference to devlink parameters during driver load,
move the devlink registration before mlx5_load. Subsequent patch will
use it to control the number of completion vectors required based on
whether eth is enabled or not.

Signed-off-by: Eli Cohen <elic@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
2023-03-24 16:04:30 -07:00
Eli Cohen 3354822cde net/mlx5: Use dynamic msix vectors allocation
Current implementation calculates the number and the partitioaning of
available interrupts vectors and then allocates all the interrupt
vectors.

Here, whenever dynamic msix allocation is supported, we change this to
use msix vectors dynamically so a vectors is actually allocated only
when needed. The current pool logic is kept in place to take care of
partitioning the vectors between the consumers and take care of
reference counting. However, the vectors are allocated only when needed.

Subsequent patches will make use of this to allocate vectors for VDPA.

Signed-off-by: Eli Cohen <elic@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
2023-03-24 16:04:29 -07:00
Eli Cohen b48a0f72bc net/mlx5: Refactor completion irq request/release code
Break the request and release functions into pci and sub-functions
devices handling for better readability, eventually making the code
symmetric in terms of request/release.

Signed-off-by: Eli Cohen <elic@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
2023-03-24 16:04:29 -07:00
Eli Cohen 8bebfd7679 net/mlx5: Improve naming of pci function vectors
The variable pf_vec is used to denote the number of vectors required for
the pci function's own use. To avoid confusion interpreting pf as
physical function, change the name to pcif_vec.

Same reasoning goes for pf_pool which is really pci function pool.

Signed-off-by: Eli Cohen <elic@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
2023-03-24 16:04:29 -07:00
Eli Cohen bbac70c741 net/mlx5: Use newer affinity descriptor
Use the more refined struct irq_affinity_desc to describe the required
IRQ affinity. For the async IRQs request unmanaged affinity and for
completion queues use managed affinity.

No functionality changes introduced. It will be used in a subsequent
patch when we use dynamic MSIX allocation.

Signed-off-by: Eli Cohen <elic@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
2023-03-24 16:04:29 -07:00
Eli Cohen 235a25fe28 net/mlx5: Modify struct mlx5_irq to use struct msi_map
Use the standard struct msi_map to store the vector number and irq
number pair in struct mlx5_irq.

Signed-off-by: Eli Cohen <elic@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
2023-03-24 16:04:29 -07:00
Eli Cohen 40a252c123 net/mlx5: Fix wrong comment
A control irq may be allocated from the parent device's pool in case
there is no SF dedicated pool. This could happen when there are not
enough vectors available for SFs.

Signed-off-by: Eli Cohen <elic@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
2023-03-24 16:04:29 -07:00
Eli Cohen b94616d9c6 net/mlx5e: Coding style fix, add empty line
Add empty line between two function defnititions.

Signed-off-by: Eli Cohen <elic@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
2023-03-24 16:04:29 -07:00
Jakub Kicinski dc0a7b5200 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Conflicts:

drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
  6e9d51b1a5 ("net/mlx5e: Initialize link speed to zero")
  1bffcea429 ("net/mlx5e: Add devlink hairpin queues parameters")
https://lore.kernel.org/all/20230324120623.4ebbc66f@canb.auug.org.au/
https://lore.kernel.org/all/20230321211135.47711-1-saeed@kernel.org/

Adjacent changes:

drivers/net/phy/phy.c
  323fe43cf9 ("net: phy: Improved PHY error reporting in state machine")
  4203d84032 ("net: phy: Ensure state transitions are processed from phy_stop()")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-24 10:10:20 -07:00
Cai Huoqing 5b6f4bd24c net/mlx5: Remove redundant pci_clear_master
Remove pci_clear_master to simplify the code,
the bus-mastering is also cleared in do_pci_disable_device,
like this:
./drivers/pci/pci.c:2197
static void do_pci_disable_device(struct pci_dev *dev)
{
	u16 pci_command;

	pci_read_config_word(dev, PCI_COMMAND, &pci_command);
	if (pci_command & PCI_COMMAND_MASTER) {
		pci_command &= ~PCI_COMMAND_MASTER;
		pci_write_config_word(dev, PCI_COMMAND, pci_command);
	}

	pcibios_disable_device(dev);
}.
And dev->is_busmaster is set to 0 in pci_disable_device.

Signed-off-by: Cai Huoqing <cai.huoqing@linux.dev>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-24 09:09:27 +00:00
Jakub Kicinski 1b4ae19e43 bpf-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZBzSGQAKCRDbK58LschI
 g+dhAP95enbrlwaQ+9aoqrU+GqCq+uo4SkaqnUtq6GSvRNiVBQD8C6iZxrAjyXnm
 1wRr3JN/HszPBzgjl3HvDc9y69I/PAI=
 =8JwR
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf

Daniel Borkmann says:

====================
pull-request: bpf 2023-03-23

We've added 8 non-merge commits during the last 13 day(s) which contain
a total of 21 files changed, 238 insertions(+), 161 deletions(-).

The main changes are:

1) Fix verification issues in some BPF programs due to their stack usage
   patterns, from Eduard Zingerman.

2) Fix to add missing overflow checks in xdp_umem_reg and return an error
   in such case, from Kal Conley.

3) Fix and undo poisoning of strlcpy in libbpf given it broke builds for
   libcs which provided the former like uClibc-ng, from Jesus Sanchez-Palencia.

4) Fix insufficient bpf_jit_limit default to avoid users running into hard
   to debug seccomp BPF errors, from Daniel Borkmann.

5) Fix driver return code when they don't support a bpf_xdp_metadata kfunc
   to make it unambiguous from other errors, from Jesper Dangaard Brouer.

6) Two BPF selftest fixes to address compilation errors from recent changes
   in kernel structures, from Alexei Starovoitov.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  xdp: bpf_xdp_metadata use EOPNOTSUPP for no driver support
  bpf: Adjust insufficient default bpf_jit_limit
  xsk: Add missing overflow check in xdp_umem_reg
  selftests/bpf: Fix progs/test_deny_namespace.c issues.
  selftests/bpf: Fix progs/find_vma_fail1.c build error.
  libbpf: Revert poisoning of strlcpy
  selftests/bpf: Tests for uninitialized stack reads
  bpf: Allow reads from uninit stack
====================

Link: https://lore.kernel.org/r/20230323225221.6082-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-23 16:03:33 -07:00
Jakub Kicinski fb63d217e6 mlx5-fixes-2023-03-21
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEGhZs6bAKwk/OTgTpSD+KveBX+j4FAmQaHFgACgkQSD+KveBX
 +j7mSwf8CSKcuarfUUVpkuS6/uoJ2W66ga5EplxqBeWu5/rFCDfU/aOe7LI8emH0
 ESzsRYSmpSw+VvlWFmQw+OGKE+XY5r9fzdDvBVMyS//CITXhab7icOm9tPlczER6
 i8T30vuV618pUEXjguo19Duv+uUy0Kfg62pmcaZOE1JoSzfNYQ+vL+PbZHkh15eB
 d1I0VIzVL0VPfRfqAIFhnF5JA77ofOlyleqy+Dm1K87MK+jva5VO3Fiwitrd25dw
 aluPi+1Ew50pWlnCOAWAFmrks1gAgxDjBhbozAMTvDrDn+7RekpjRfHQ197hTpWT
 GUp1w6Qcc91CdCEPCk2mO+gtwnb9Qg==
 =E7T7
 -----END PGP SIGNATURE-----

Merge tag 'mlx5-fixes-2023-03-21' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux

Saeed Mahameed says:

====================
mlx5 fixes 2023-03-21

This series provides bug fixes to mlx5 driver.

* tag 'mlx5-fixes-2023-03-21' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux:
  net/mlx5: E-Switch, Fix an Oops in error handling code
  net/mlx5: Read the TC mapping of all priorities on ETS query
  net/mlx5e: Overcome slow response for first macsec ASO WQE
  net/mlx5e: Initialize link speed to zero
  net/mlx5: Fix steering rules cleanup
  net/mlx5e: Block entering switchdev mode with ns inconsistency
  net/mlx5e: Set uplink rep as NETNS_LOCAL
====================

Link: https://lore.kernel.org/r/20230321211135.47711-1-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-22 22:50:01 -07:00
Jakub Kicinski e4d264e87a Extend packet offload to fully support libreswan
The following patches are an outcome of Raed's work to add packet
 offload support to libreswan [1].
 
 The series includes:
  * Priority support to IPsec policies
  * Statistics per-SA (visible through "ip -s xfrm state ..." command)
  * Support to IKE policy holes
  * Fine tuning to acquire logic.
 
 Thanks
 
 [1] https://github.com/libreswan/libreswan/pull/986
 Link: https://lore.kernel.org/all/cover.1678714336.git.leon@kernel.org
 Signed-off-by: Leon Romanovsky <leon@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQT1m3YD37UfMCUQBNwp8NhrnBAZsQUCZBgoggAKCRAp8NhrnBAZ
 sQPvAP4/muXSlz26gU632wL+wjpFkvM5fX1iu8AwmdhOtX+gAAD/edtCVXj4QxD4
 0xGMjwrz4tH5WqdUzqDQjhZrBYNWmAk=
 =oNEQ
 -----END PGP SIGNATURE-----

Merge tag 'ipsec-libreswan-mlx5' of https://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux

Leon Romanovsky says:

====================
Extend packet offload to fully support libreswan

The following patches are an outcome of Raed's work to add packet
offload support to libreswan [1].

The series includes:
 * Priority support to IPsec policies
 * Statistics per-SA (visible through "ip -s xfrm state ..." command)
 * Support to IKE policy holes
 * Fine tuning to acquire logic.

[1] https://github.com/libreswan/libreswan/pull/986
Link: https://lore.kernel.org/all/cover.1678714336.git.leon@kernel.org

* tag 'ipsec-libreswan-mlx5' of https://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux:
  net/mlx5e: Update IPsec per SA packets/bytes count
  net/mlx5e: Use one rule to count all IPsec Tx offloaded traffic
  net/mlx5e: Support IPsec acquire default SA
  net/mlx5e: Allow policies with reqid 0, to support IKE policy holes
  xfrm: copy_to_user_state fetch offloaded SA packets/bytes statistics
  xfrm: add new device offload acquire flag
  net/mlx5e: Use chains for IPsec policy priority offload
  net/mlx5: fs_core: Allow ignore_flow_level on TX dest
  net/mlx5: fs_chains: Refactor to detach chains from tc usage
====================

Link: https://lore.kernel.org/r/20230320094722.1009304-1-leon@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-22 21:56:33 -07:00
Jesper Dangaard Brouer 915efd8a44 xdp: bpf_xdp_metadata use EOPNOTSUPP for no driver support
When driver doesn't implement a bpf_xdp_metadata kfunc the fallback
implementation returns EOPNOTSUPP, which indicate device driver doesn't
implement this kfunc.

Currently many drivers also return EOPNOTSUPP when the hint isn't
available, which is ambiguous from an API point of view. Instead
change drivers to return ENODATA in these cases.

There can be natural cases why a driver doesn't provide any hardware
info for a specific hint, even on a frame to frame basis (e.g. PTP).
Lets keep these cases as separate return codes.

When describing the return values, adjust the function kernel-doc layout
to get proper rendering for the return values.

Fixes: ab46182d0d ("net/mlx4_en: Support RX XDP metadata")
Fixes: bc8d405b1b ("net/mlx5e: Support RX XDP metadata")
Fixes: 306531f024 ("veth: Support RX XDP metadata")
Fixes: 3d76a4d3d4 ("bpf: XDP metadata RX kfuncs")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/167940675120.2718408.8176058626864184420.stgit@firesoul
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-22 09:11:09 -07:00
Ido Schimmel bb765a7433 mlxsw: spectrum_fid: Fix incorrect local port type
Local port is a 10-bit number, but it was mistakenly stored in a u8,
resulting in firmware errors when using a netdev corresponding to a
local port higher than 255.

Fix by storing the local port in u16, as is done in the rest of the
code.

Fixes: bf73904f5f ("mlxsw: Add support for 802.1Q FID family")
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Danielle Ratson <danieller@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Link: https://lore.kernel.org/r/eace1f9d96545ab8a2775db857cb7e291a9b166b.1679398549.git.petrm@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-03-22 15:50:32 +01:00
Dan Carpenter 640fcdbcf2 net/mlx5: E-Switch, Fix an Oops in error handling code
The error handling dereferences "vport".  There is nothing we can do if
it is an error pointer except returning the error code.

Fixes: 133dcfc577 ("net/mlx5: E-Switch, Alloc and free unique metadata for match")
Signed-off-by: Dan Carpenter <error27@gmail.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-03-21 14:06:32 -07:00
Maher Sanalla 44d553188c net/mlx5: Read the TC mapping of all priorities on ETS query
When ETS configurations are queried by the user to get the mapping
assignment between packet priority and traffic class, only priorities up
to maximum TCs are queried from QTCT register in FW to retrieve their
assigned TC, leaving the rest of the priorities mapped to the default
TC #0 which might be misleading.

Fix by querying the TC mapping of all priorities on each ETS query,
regardless of the maximum number of TCs configured in FW.

Fixes: 820c2c5e77 ("net/mlx5e: Read ETS settings directly from firmware")
Signed-off-by: Maher Sanalla <msanalla@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-03-21 14:06:32 -07:00
Emeel Hakim 7e3fce82d9 net/mlx5e: Overcome slow response for first macsec ASO WQE
First ASO WQE poll causes a cache miss in hardware hence the resut is
delayed. It causes to the situation where such WQE is polled earlier
than it is needed.

Add logic to retry ASO CQ polling operation.

Fixes: 739cfa3451 ("net/mlx5: Make ASO poll CQ usable in atomic context") 
Signed-off-by: Emeel Hakim <ehakim@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Raed Salem <raeds@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-03-21 14:06:32 -07:00
Roy Novich 6e9d51b1a5 net/mlx5e: Initialize link speed to zero
mlx5e_port_max_linkspeed does not guarantee value assignment for speed.
Avoid cases where link_speed might be used uninitialized. In case
mlx5e_port_max_linkspeed fails, a default link speed of 50000 will be
used for the calculations.

Fixes: 3f6d08d196 ("net/mlx5e: Add RSS support for hairpin")
Signed-off-by: Roy Novich <royno@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Aya Levin <ayal@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-03-21 14:06:31 -07:00
Lama Kayal 922f56e9a7 net/mlx5: Fix steering rules cleanup
vport's mc, uc and multicast rules are not deleted in teardown path when
EEH happens. Since the vport's promisc settings(uc, mc and all) in
firmware are reset after EEH, mlx5 driver will try to delete the above
rules in the initialization path. This cause kernel crash because these
software rules are no longer valid.

Fix by nullifying these rules right after delete to avoid accessing any dangling
pointers.

Call Trace:
__list_del_entry_valid+0xcc/0x100 (unreliable)
tree_put_node+0xf4/0x1b0 [mlx5_core]
tree_remove_node+0x30/0x70 [mlx5_core]
mlx5_del_flow_rules+0x14c/0x1f0 [mlx5_core]
esw_apply_vport_rx_mode+0x10c/0x200 [mlx5_core]
esw_update_vport_rx_mode+0xb4/0x180 [mlx5_core]
esw_vport_change_handle_locked+0x1ec/0x230 [mlx5_core]
esw_enable_vport+0x130/0x260 [mlx5_core]
mlx5_eswitch_enable_sriov+0x2a0/0x2f0 [mlx5_core]
mlx5_device_enable_sriov+0x74/0x440 [mlx5_core]
mlx5_load_one+0x114c/0x1550 [mlx5_core]
mlx5_pci_resume+0x68/0xf0 [mlx5_core]
eeh_report_resume+0x1a4/0x230
eeh_pe_dev_traverse+0x98/0x170
eeh_handle_normal_event+0x3e4/0x640
eeh_handle_event+0x4c/0x370
eeh_event_handler+0x14c/0x210
kthread+0x168/0x1b0
ret_from_kernel_thread+0x5c/0x84

Fixes: a35f71f27a ("net/mlx5: E-Switch, Implement promiscuous rx modes vf request handling")
Signed-off-by: Huy Nguyen <huyn@mellanox.com>
Signed-off-by: Lama Kayal <lkayal@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Maor Dickman <maord@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-03-21 14:06:31 -07:00
Gavin Li 662404b24a net/mlx5e: Block entering switchdev mode with ns inconsistency
Upon entering switchdev mode, VF/SF representors are spawned in the
devlink instance's net namespace, whereas the PF net device transforms
into the uplink representor, remaining in the net namespace the PF net
device was in. Therefore, if a PF net device's namespace is different from
its parent devlink net namespace, entering switchdev mode can create an
illegal situation where all representors sharing the same core device
are NOT in the same net namespace.

To avoid this issue, block entering switchdev mode for devices whose child
netdev net namespace has diverged from the parent devlink's.

Fixes: 7768d1971d ("net/mlx5: E-Switch, Add control for encapsulation")
Signed-off-by: Gavin Li <gavinl@nvidia.com>
Reviewed-by: Gavi Teitz <gavi@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-03-21 14:06:31 -07:00
Gavin Li c83172b063 net/mlx5e: Set uplink rep as NETNS_LOCAL
Previously, NETNS_LOCAL was not set for uplink representors, inconsistent
with VF representors, and allowed the uplink representor to be moved
between net namespaces and separated from the VF representors it shares
the core device with. Such usage would break the isolation model of
namespaces, as devices in different namespaces would have access to
shared memory.

To solve this issue, set NETNS_LOCAL for uplink representors if eswitch is
in switchdev mode.

Fixes: 7a9fb35e8c ("net/mlx5e: Do not reload ethernet ports when changing eswitch mode")
Signed-off-by: Gavin Li <gavinl@nvidia.com>
Reviewed-by: Gavi Teitz <gavi@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-03-21 14:06:31 -07:00
Raed Salem 5a6cddb89b net/mlx5e: Update IPsec per SA packets/bytes count
Providing per SA packets/bytes statistics mandates creating unique
counter per SA flow for Rx/Tx, whenever offloaded SA statistics is
desired query the specific SA counter to provide the stack with the
needed data.

Signed-off-by: Raed Salem <raeds@nvidia.com>
Link: https://lore.kernel.org/r/7d5ce20ac495f3054afb633128700e7b7eeeb3cd.1678714336.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-03-20 11:29:52 +02:00
Raed Salem d0c19a310e net/mlx5e: Use one rule to count all IPsec Tx offloaded traffic
Currently one counter is shared between all IPsec Tx offloaded
rules to count the total amount of packets/bytes that was IPsec
Tx offloaded, replace this scheme by adding a new flow table (ft)
with one rule that counts all flows that passes through this
table (like Rx status ft), this ft is pointed by all IPsec Tx
offloaded rules. The above allows to have a counter per tx flow
rule in while keeping a separate global counter that store the
aggregation outcome of all these per flow counters.

Signed-off-by: Raed Salem <raeds@nvidia.com>
Link: https://lore.kernel.org/r/09b9119d1deb6e482fd2d17e1f5760d7c5be1e48.1678714336.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-03-20 11:29:48 +02:00
Raed Salem aa8bd0c951 net/mlx5e: Support IPsec acquire default SA
During XFRM stack acquire flow, a default SA is created to be updated
later, once acquire netlink message is handled in user space.

This SA is also passed to IPsec offload supporting driver, however this
SA acts only as placeholder and does not have context suitable for
offloading in HW yet. Identify this kind of SA by special offload flag
(XFRM_DEV_OFFLOAD_FLAG_ACQ), and create a SW only context.

In such cases with special mark so it won't be installed in HW in addition
flow and on remove/delete free this SW only context.

Signed-off-by: Raed Salem <raeds@nvidia.com>
Link: https://lore.kernel.org/r/8f36d6b61631dcd73fef0a0ac623456030bc9db0.1678714336.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-03-20 11:29:45 +02:00