The get_temp() callback of a thermal zone associated with a transceiver
module no longer needs to read the temperature thresholds of the module.
Therefore, simplify the callback by only reading the temperature.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Vadim Pasternak <vadimp@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The function can no longer fail so make it void and remove the
associated error path.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Vadim Pasternak <vadimp@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The driver registers a thermal zone for each transceiver module and
tries to set the trip point temperatures according to the thresholds
read from the transceiver. If a threshold cannot be read or if a
transceiver is unplugged, the trip point temperature is set to zero,
which means that it is disabled as far as the thermal subsystem is
concerned.
A recent change in the thermal core made it so that such trip points are
no longer marked as disabled, which lead the thermal subsystem to
incorrectly set the associated cooling devices to the their maximum
state [1]. A fix to restore this behavior was merged in commit
f1b80a3878 ("thermal: core: Restore behavior regarding invalid trip
points"). However, the thermal maintainer suggested to not rely on this
behavior and instead always register a valid array of trip points [2].
Therefore, create a static array of trip points with sane defaults
(suggested by Vadim) and register it with the thermal zone of each
transceiver module. User space can choose to override these defaults
using the thermal zone sysfs interface since these files are writeable.
Before:
$ cat /sys/class/thermal/thermal_zone11/type
mlxsw-module11
$ cat /sys/class/thermal/thermal_zone11/trip_point_*_temp
65000
75000
80000
After:
$ cat /sys/class/thermal/thermal_zone11/type
mlxsw-module11
$ cat /sys/class/thermal/thermal_zone11/trip_point_*_temp
55000
65000
80000
Also tested by reverting commit f1b80a3878 ("thermal: core: Restore
behavior regarding invalid trip points") and making sure that the
associated cooling devices are not set to their maximum state.
[1] https://lore.kernel.org/linux-pm/ZA3CFNhU4AbtsP4G@shredder/
[2] https://lore.kernel.org/linux-pm/f78e6b70-a963-c0ca-a4b2-0d4c6aeef1fb@linaro.org/
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Vadim Pasternak <vadimp@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dragos Tatulea says:
====================
net/mlx5e: RX, Drop page_cache and fully use page_pool
For page allocation on the rx path, the mlx5e driver has been using an
internal page cache in tandem with the page pool. The internal page
cache uses a queue for page recycling which has the issue of head of
queue blocking.
This patch series drops the internal page_cache altogether and uses the
page_pool to implement everything that was done by the page_cache
before:
* Let the page_pool handle dma mapping and unmapping.
* Use fragmented pages with fragment counter instead of tracking via
page ref.
* Enable skb recycling.
The patch series has the following effects on the rx path:
* Improved performance for the cases when there was low page recycling
due to head of queue blocking in the internal page_cache. The test
for this was running a single iperf TCP stream to a rx queue
which is bound on the same cpu as the application.
|-------------+--------+--------+------+---------|
| rq type | before | after | unit | diff |
|-------------+--------+--------+------+---------|
| striding rq | 30.1 | 31.4 | Gbps | 4.14 % |
| legacy rq | 30.2 | 33.0 | Gbps | 8.48 % |
|-------------+--------+--------+------+---------|
* Small XDP performance degradation. The test was is XDP drop
program running on a single rx queue with small packets incoming
it looks like this:
|-------------+----------+----------+------+---------|
| rq type | before | after | unit | diff |
|-------------+----------+----------+------+---------|
| striding rq | 19725449 | 18544617 | pps | -6.37 % |
| legacy rq | 19879931 | 18631841 | pps | -6.70 % |
|-------------+----------+----------+------+---------|
This will be handled in a different patch series by adding support for
multi-packet per page.
* For other cases the performance is roughly the same.
The above numbers were obtained on the following system:
24 core Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz
32 GB RAM
ConnectX-7 single port
The breakdown on the patch series is the following:
* Preparations for introducing the mlx5e_frag_page struct.
* Delete the mlx5e_page_cache struct.
* Enable dma mapping from page_pool.
* Enable skb recycling and fragment counting.
* Do deferred release of pages (just before alloc) to ensure better
page_pool cache utilization.
====================
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEGhZs6bAKwk/OTgTpSD+KveBX+j4FAmQjUY8ACgkQSD+KveBX
+j6tVAf/QHCbKgt9c2Q5EpFch2e4x3A/HfE7DbxTancIj0cc1bH98xd4wO574aE4
PCJ/aJ+9zTLvTUgUnKDaiqonfmcsF7v6d/ltoLW1PTNnPqdsjsXpVy76dnL81SWy
u/g7h68cfeMdMjAAoewyVv+k7GeTIZCsIdvik3dWGFQ67IpE1k5dLbO13YBNW/5m
Cm39RzD55tjgxS8GHdyFYAV4MwgHy+pdhTYR9LGzH80hfd02KqsCO38u1NIShuez
1rwjRF213Qdln20bMNSNiXG36JUV65mo+Q/XHKOEjB0qNKRcF5bzZovqHzP+R7QZ
qhhhfce8c63UWpcXADP6k6qevW8+UA==
=8F1t
-----END PGP SIGNATURE-----
Merge tag 'mlx5-updates-2023-03-28' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
mlx5-updates-2023-03-28
Dragos Tatulea says:
====================
net/mlx5e: RX, Drop page_cache and fully use page_pool
For page allocation on the rx path, the mlx5e driver has been using an
internal page cache in tandem with the page pool. The internal page
cache uses a queue for page recycling which has the issue of head of
queue blocking.
This patch series drops the internal page_cache altogether and uses the
page_pool to implement everything that was done by the page_cache
before:
* Let the page_pool handle dma mapping and unmapping.
* Use fragmented pages with fragment counter instead of tracking via
page ref.
* Enable skb recycling.
The patch series has the following effects on the rx path:
* Improved performance for the cases when there was low page recycling
due to head of queue blocking in the internal page_cache. The test
for this was running a single iperf TCP stream to a rx queue
which is bound on the same cpu as the application.
|-------------+--------+--------+------+---------|
| rq type | before | after | unit | diff |
|-------------+--------+--------+------+---------|
| striding rq | 30.1 | 31.4 | Gbps | 4.14 % |
| legacy rq | 30.2 | 33.0 | Gbps | 8.48 % |
|-------------+--------+--------+------+---------|
* Small XDP performance degradation. The test was is XDP drop
program running on a single rx queue with small packets incoming
it looks like this:
|-------------+----------+----------+------+---------|
| rq type | before | after | unit | diff |
|-------------+----------+----------+------+---------|
| striding rq | 19725449 | 18544617 | pps | -6.37 % |
| legacy rq | 19879931 | 18631841 | pps | -6.70 % |
|-------------+----------+----------+------+---------|
This will be handled in a different patch series by adding support for
multi-packet per page.
* For other cases the performance is roughly the same.
The above numbers were obtained on the following system:
24 core Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz
32 GB RAM
ConnectX-7 single port
The breakdown on the patch series is the following:
* Preparations for introducing the mlx5e_frag_page struct.
* Delete the mlx5e_page_cache struct.
* Enable dma mapping from page_pool.
* Enable skb recycling and fragment counting.
* Do deferred release of pages (just before alloc) to ensure better
page_pool cache utilization.
====================
* tag 'mlx5-updates-2023-03-28' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux:
net/mlx5e: RX, Remove unnecessary recycle parameter and page_cache stats
net/mlx5e: RX, Break the wqe bulk refill in smaller chunks
net/mlx5e: RX, Increase WQE bulk size for legacy rq
net/mlx5e: RX, Split off release path for xsk buffers for legacy rq
net/mlx5e: RX, Defer page release in legacy rq for better recycling
net/mlx5e: RX, Change wqe last_in_page field from bool to bit flags
net/mlx5e: RX, Defer page release in striding rq for better recycling
net/mlx5e: RX, Rename xdp_xmit_bitmap to a more generic name
net/mlx5e: RX, Enable skb page recycling through the page_pool
net/mlx5e: RX, Enable dma map and sync from page_pool allocator
net/mlx5e: RX, Remove internal page_cache
net/mlx5e: RX, Store SHAMPO header pages in array
net/mlx5e: RX, Remove alloc unit layout constraint for striding rq
net/mlx5e: RX, Remove alloc unit layout constraint for legacy rq
net/mlx5e: RX, Remove mlx5e_alloc_unit argument in page allocation
====================
Link: https://lore.kernel.org/r/20230328205623.142075-1-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Some code defines the IPv6 wildcard address as a local variable and
use it with memcmp() or ipv6_addr_equal().
Let's use in6addr_any and ipv6_addr_any() instead.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
mlx5 dynamic msix
This patch series adds support for dynamic msix vectors allocation in mlx5.
Eli Cohen Says:
================
The following series of patches modifies mlx5_core to work with the
dynamic MSIX API. Currently, mlx5_core allocates all the interrupt
vectors it needs and distributes them amongst the consumers. With the
introduction of dynamic MSIX support, which allows for allocation of
interrupts more than once, we now allocate vectors as we need them.
This allows other drivers running on top of mlx5_core to allocate
interrupt vectors for their own use. An example for this is mlx5_vdpa,
which uses these vectors to propagate interrupts directly from the
hardware to the vCPU [1].
As a preparation for using this series, a use after free issue is fixed
in lib/cpu_rmap.c and the allocator for rmap entries has been modified.
A complementary API for irq_cpu_rmap_add() has also been introduced.
[1] https://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git/patch/?id=0f2bf1fcae96a83b8c5581854713c9fc3407556e
================
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEGhZs6bAKwk/OTgTpSD+KveBX+j4FAmQeTIUACgkQSD+KveBX
+j7oCQgAx9yNHM4BZD2UfIx/P+W13v1B+xOds04Vezl9JlakoqvviPxm3vvuKkl+
j/8DdyoqMUbWV0j5XxgZ+GG91bc14jN1GQ+4fUf63SzA99vAGb9GJPV2aQt5roGh
JmMqI2utDfoz+29qtQ+kVchY5AN5AoPXSQH2zkEZmJaPUjYb9Dr/4IayL0JaViAw
S31QLHKkSJ8bL8Wc6Op1emNVV7eXs18f7IIjVs3sYOb3WJRPVpmdKneRqLgVYplf
Td40Gwobl1elpjEqSSRTJI5YUSR8gcAJlBqIwHeJzFFpO3Pnciopl761osNKKs/a
5ctES5DS6JHqqFGbWV1gKYcRMil3LA==
=9i8l
-----END PGP SIGNATURE-----
Merge tag 'mlx5-updates-2023-03-20' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
mlx5-updates-2023-03-20
mlx5 dynamic msix
This patch series adds support for dynamic msix vectors allocation in mlx5.
Eli Cohen Says:
================
The following series of patches modifies mlx5_core to work with the
dynamic MSIX API. Currently, mlx5_core allocates all the interrupt
vectors it needs and distributes them amongst the consumers. With the
introduction of dynamic MSIX support, which allows for allocation of
interrupts more than once, we now allocate vectors as we need them.
This allows other drivers running on top of mlx5_core to allocate
interrupt vectors for their own use. An example for this is mlx5_vdpa,
which uses these vectors to propagate interrupts directly from the
hardware to the vCPU [1].
As a preparation for using this series, a use after free issue is fixed
in lib/cpu_rmap.c and the allocator for rmap entries has been modified.
A complementary API for irq_cpu_rmap_add() has also been introduced.
[1] https://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git/patch/?id=0f2bf1fcae96a83b8c5581854713c9fc3407556e
================
* tag 'mlx5-updates-2023-03-20' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux:
net/mlx5: Provide external API for allocating vectors
net/mlx5: Use one completion vector if eth is disabled
net/mlx5: Refactor calculation of required completion vectors
net/mlx5: Move devlink registration before mlx5_load
net/mlx5: Use dynamic msix vectors allocation
net/mlx5: Refactor completion irq request/release code
net/mlx5: Improve naming of pci function vectors
net/mlx5: Use newer affinity descriptor
net/mlx5: Modify struct mlx5_irq to use struct msi_map
net/mlx5: Fix wrong comment
net/mlx5e: Coding style fix, add empty line
lib: cpu_rmap: Add irq_cpu_rmap_remove to complement irq_cpu_rmap_add
lib: cpu_rmap: Use allocator for rmap entries
lib: cpu_rmap: Avoid use after free on rmap->obj array entries
====================
Link: https://lore.kernel.org/r/20230324231341.29808-1-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The cited commit caused the following build break in mlx5 due to a change
in size of MAX_SKB_FRAGS.
error: format '%lu' expects argument of type 'long unsigned int',
but argument 7 has type 'unsigned int' [-Werror=format=]
Fix this by explicit casting.
Fixes: 3948b05950 ("net: introduce a config option to tweak MAX_SKB_FRAGS")
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Link: https://lore.kernel.org/r/20230328200723.125122-1-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The recycle parameter used during page release is no longer
necessary: the page pool can detect when the page cannot be
recycled to the cache or ring without any outside hint.
The page pool will also take care of cleaning up after itself
once all the inflight pages have been released. So no need to
explicitly release pages to the system.
Remove the internal page_cache stats as the mlx5e_page_cache
struct no longer exists.
Delete the documentation entries along with the stats.
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
To avoid overflowing the page pool's cache, don't release the
whole bulk which is usually larger than the cache refill size.
Group release+alloc instead into cache refill units that
allow releasing to the cache and then allocating from the cache.
A refill_unit variable is added as a iteration unit over the
wqe_bulk when doing release+alloc.
For a single ring, single core, default MTU (1500) TCP stream
test the number of pages allocated from the cache directly
(rx_pp_recycle_cached) increases from 0% to 52%:
+---------------------------------------------+
| Page Pool stats (/sec) | Before | After |
+-------------------------+---------+---------+
|rx_pp_alloc_fast | 2145422 | 2193802 |
|rx_pp_alloc_slow | 2 | 0 |
|rx_pp_alloc_empty | 2 | 0 |
|rx_pp_alloc_refill | 34059 | 16634 |
|rx_pp_alloc_waive | 0 | 0 |
|rx_pp_recycle_cached | 0 | 1145818 |
|rx_pp_recycle_cache_full | 0 | 0 |
|rx_pp_recycle_ring | 2179361 | 1064616 |
|rx_pp_recycle_ring_full | 121 | 0 |
+---------------------------------------------+
With this patch, the performance for legacy rq for the above test is
back to baseline.
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Deferred page release was added to legacy rq but its desired effect
(driver releases last fragment to page pool cache) is not yet visible
due to the WQE bulks being too small.
This patch increases the WQE bulk size to span 512 KB and clip it to
one quarter of the rx queue size.
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Don't mix xsk buffer releases with page releases anymore. This is
needed for handling of deferred page release.
Add a new bulk free function for xsk buffers from wqe frags.
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Currently, fragmented pages from the page pool can be released
in two ways:
1) In the mlx5e driver when trimming off the unused fragments AND the
associated skb fragments have been released. This path allows
recycling of pages to the page pool cache (allow_direct == true).
2) On the skb release path (last fragment release), which
will always release pages to the page pool ring
(allow_direct == false).
Whichever is releasing the last fragment will be decisive on
where the page gets released: the cache or the ring. So we
obviously want to maximize for doing the release from 1.
This patch does that by deferring the release of page fragments
right before requesting new ones from the page pool. A flag is
added to make sure that there's no release before first alloc
and that XDP_TX fragments are not released prematurely.
This is a preparation patch that doesn't unlock the performance
improvements yet. A followup patch will do that.
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Change the bool flag to a bitfield as we'll use it in a downstream patch
in the series to add signaling about skipping a fragment release.
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Currently, for striding RQ, fragmented pages from the page pool can
get released in two ways:
1) In the mlx5e driver when trimming off the unused fragments AND the
associated skb fragments have been released. This path allows
recycling of pages to the page pool cache (allow_direct == true).
2) On the skb release path (last fragment release), which
will always release pages to the page pool ring
(allow_direct == false).
Whichever is releasing the last fragment will be decisive on
where the page gets released: the cache or the ring. So we
obviously want to maximize for doing the release from 1.
This patch does that by deferring the release of page fragments
right before requesting new ones from the page pool. Extra care
needs to be taken for the corner cases:
* On first call, make sure that release is not called. The
skip_release_bitmap is used for this purpose.
* On rq shutdown, make sure that all wqes that were not
in the linked list are released.
For a single ring, single core, default MTU (1500) TCP stream
test the number of pages allocated from the cache directly
(rx_pp_recycle_cached) increases from 31 % to 98 %:
+----------------------------------------------+
| Page Pool stats (/sec) | Before | After |
+-------------------------+---------+----------+
|rx_pp_alloc_fast | 2137754 | 2261033 |
|rx_pp_alloc_slow | 47 | 9 |
|rx_pp_alloc_empty | 47 | 9 |
|rx_pp_alloc_refill | 23230 | 819 |
|rx_pp_alloc_waive | 0 | 0 |
|rx_pp_recycle_cached | 672182 | 2209015 |
|rx_pp_recycle_cache_full | 1789 | 0 |
|rx_pp_recycle_ring | 1485848 | 52259 |
|rx_pp_recycle_ring_full | 3003 | 584 |
+----------------------------------------------+
With this patch, the performance in striding rq for the above test is
back to baseline.
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
The xdp_xmit_bitmap currently serves only one purpose: to avoid
releasing pages that are still in use due to XDP TX.
A following patch will use this bitmap in a slightly different context
but for the same purpose. So rename the bitmap to a more generic name
that reflects the purpose not the context.
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Start using the page_pool skb recycling api to recycle all pages back to
the page pool and stop using atomic page reference counting.
The mlx5e driver used to manage in-flight pages using page refcounting:
for each fragment there were 2 atomic write operations happening (one
for building the skb and one on skb release).
The page_pool api introduced a method to track page fragments more
optimally:
* The page's pp_fragment_count is set to a large bias on page alloc
(1 x atomic write operation).
* The driver tracks the actual page fragments in a non atomic variable.
* When the skb is recycled, pp_fragment_count is decremented
(atomic write operation).
* When page is released in the driver, the unused number of fragments
(relative to the bias) is deducted from pp_fragment_count (atomic
write operation).
* Last page defragmentation will only be an atomic read.
So in total there are `number of fragments + 1` atomic write ops. As
opposed to previously: `2 * frags` atomic writes ops.
Pages are wrapped in a mlx5e_frag_page structure which also contains the
number of fragments. This makes it easy to count the fragments in the
driver.
This change brings performance improvements for the case when the old rx
page_cache had low recycling rates due to head of queue blocking. For a
iperf3 TCP test with a single stream, on a single core (iperf and receive
queue running on same core), the following improvements can be noticed:
* Striding rq:
- before (net-next baseline): bitrate = 30.1 Gbits/sec
- after : bitrate = 31.4 Gbits/sec (diff: 4.14 %)
* Legacy rq:
- before (net-next baseline): bitrate = 30.2 Gbits/sec
- after : bitrate = 33.0 Gbits/sec (diff: 8.48 %)
There are 2 temporary performance degradations introduced:
1) TCP streams that had a good recycling rate with the old page_cache
have a degradation for both striding and linear rq. This is due to
very low page pool cache recycling: the pages are released during skb
recycle which will release pages to the page pool ring for safety.
The following patches in this series will tackle this problem by
deferring the page release in the driver to increase the
chance of having pages recycled to the cache.
2) XDP performance is now lower (4-5 %) due to the higher number of
atomic operations used for fragment management. But this opens the
door for supporting multiple packets per page in XDP, which will
bring a big gain.
Otherwise, performance is similar to baseline.
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Remove driver dma mapping and unmapping of pages. Let the
page_pool api do it.
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
This patch removes the internal rx page_cache and uses the generic
page_pool api only. It used to be that the page_pool couldn't handle all
the mlx5 driver usecases, but with the introduction of skb recycling and
page fragmentaton in the page_pool full switch can now be made. Some
benfits of this transition:
* Better page recycling in the cases when the page_cache was suffering
from head of queue blocking. The page_pool doesn't have this issue.
* DMA mapping/unmapping can be managed by the page_pool.
* mlx5e_rq size reduced by more than 50% due to the page_cache array
being deleted.
This patch only removes the page_cache. Downstream patches will enable
the required page_pool features and will add further fine-tuning.
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Save allocated SHAMPO header pages to an array to which the
mlx5e_dma_info page will point to.
This change is a preparation for introducing mlx5e_frag_page structure
in a downstream patch. There's no new functionality introduced.
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
This change removes the usage of mlx5e_alloc_unit union for
striding rq. The change is more straightforward than legacy rq as
the alloc units union is already in place.
This patch only moves things around: instead of an array of unions make
it a union of arrays. This has the effect that each mlx5e_mpw_info will
allocate the largest possible size of the array member. For xsk this
means that the array of xdp_buff pointers for the wqe will still be
contiguous, but there will be some extra unused bytes at the end of the
array.
As further patch in the series will add the mlx5e_frag_page struct for
which the described size constraint will no longer hold.
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
The mlx5e_alloc_unit union is conveniently used to store arrays of
pointers to struct page or struct xdp_buff (for xsk). The union is
currently expected to have the size of a pointer for xsk batch
allocations to work. This is conveniet for the current state of the
code but makes it impossible to add a structure of a different size
to the alloc unit.
A further patch in the series will add the mlx5e_frag_page struct for
which the described size constraint will no longer hold.
This change removes the usage of mlx5e_alloc_unit union for legacy rq:
- A union of arrays is introduced (mlx5e_alloc_units) to replace the
array of unions to allow structures of different sizes.
- Each fragment has a pointer to a unit in the mlx5e_alloc_units array.
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Change internal page cache and page pool api to use a struct page **
instead of a mlx5e_alloc_unit *.
This is the first change in a series which is meant to remove the
mlx5e_alloc_unit altogether.
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Provide external API to be used by other drivers relying on mlx5_core,
for allocating MSIX vectors. An example for such a driver would be
mlx5_vdpa.
Signed-off-by: Eli Cohen <elic@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
If eth is disabled by devlink, use only a single completion vector to
have minimum performance of all users of completion vectors. This also
affects Infiniband performance.
The rest of the vectors can be used by other consumers on a first come
first served basis.
mlx5_vdpa will make use of this to allocate dedicated vectors for its
own use.
Signed-off-by: Eli Cohen <elic@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Move the calculation to a separate function. We will add more
functionality to it in a follow up patch.
Signed-off-by: Eli Cohen <elic@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
In order to allow reference to devlink parameters during driver load,
move the devlink registration before mlx5_load. Subsequent patch will
use it to control the number of completion vectors required based on
whether eth is enabled or not.
Signed-off-by: Eli Cohen <elic@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Current implementation calculates the number and the partitioaning of
available interrupts vectors and then allocates all the interrupt
vectors.
Here, whenever dynamic msix allocation is supported, we change this to
use msix vectors dynamically so a vectors is actually allocated only
when needed. The current pool logic is kept in place to take care of
partitioning the vectors between the consumers and take care of
reference counting. However, the vectors are allocated only when needed.
Subsequent patches will make use of this to allocate vectors for VDPA.
Signed-off-by: Eli Cohen <elic@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Break the request and release functions into pci and sub-functions
devices handling for better readability, eventually making the code
symmetric in terms of request/release.
Signed-off-by: Eli Cohen <elic@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
The variable pf_vec is used to denote the number of vectors required for
the pci function's own use. To avoid confusion interpreting pf as
physical function, change the name to pcif_vec.
Same reasoning goes for pf_pool which is really pci function pool.
Signed-off-by: Eli Cohen <elic@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Use the more refined struct irq_affinity_desc to describe the required
IRQ affinity. For the async IRQs request unmanaged affinity and for
completion queues use managed affinity.
No functionality changes introduced. It will be used in a subsequent
patch when we use dynamic MSIX allocation.
Signed-off-by: Eli Cohen <elic@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Use the standard struct msi_map to store the vector number and irq
number pair in struct mlx5_irq.
Signed-off-by: Eli Cohen <elic@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
A control irq may be allocated from the parent device's pool in case
there is no SF dedicated pool. This could happen when there are not
enough vectors available for SFs.
Signed-off-by: Eli Cohen <elic@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Add empty line between two function defnititions.
Signed-off-by: Eli Cohen <elic@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Remove pci_clear_master to simplify the code,
the bus-mastering is also cleared in do_pci_disable_device,
like this:
./drivers/pci/pci.c:2197
static void do_pci_disable_device(struct pci_dev *dev)
{
u16 pci_command;
pci_read_config_word(dev, PCI_COMMAND, &pci_command);
if (pci_command & PCI_COMMAND_MASTER) {
pci_command &= ~PCI_COMMAND_MASTER;
pci_write_config_word(dev, PCI_COMMAND, pci_command);
}
pcibios_disable_device(dev);
}.
And dev->is_busmaster is set to 0 in pci_disable_device.
Signed-off-by: Cai Huoqing <cai.huoqing@linux.dev>
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZBzSGQAKCRDbK58LschI
g+dhAP95enbrlwaQ+9aoqrU+GqCq+uo4SkaqnUtq6GSvRNiVBQD8C6iZxrAjyXnm
1wRr3JN/HszPBzgjl3HvDc9y69I/PAI=
=8JwR
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:
====================
pull-request: bpf 2023-03-23
We've added 8 non-merge commits during the last 13 day(s) which contain
a total of 21 files changed, 238 insertions(+), 161 deletions(-).
The main changes are:
1) Fix verification issues in some BPF programs due to their stack usage
patterns, from Eduard Zingerman.
2) Fix to add missing overflow checks in xdp_umem_reg and return an error
in such case, from Kal Conley.
3) Fix and undo poisoning of strlcpy in libbpf given it broke builds for
libcs which provided the former like uClibc-ng, from Jesus Sanchez-Palencia.
4) Fix insufficient bpf_jit_limit default to avoid users running into hard
to debug seccomp BPF errors, from Daniel Borkmann.
5) Fix driver return code when they don't support a bpf_xdp_metadata kfunc
to make it unambiguous from other errors, from Jesper Dangaard Brouer.
6) Two BPF selftest fixes to address compilation errors from recent changes
in kernel structures, from Alexei Starovoitov.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
xdp: bpf_xdp_metadata use EOPNOTSUPP for no driver support
bpf: Adjust insufficient default bpf_jit_limit
xsk: Add missing overflow check in xdp_umem_reg
selftests/bpf: Fix progs/test_deny_namespace.c issues.
selftests/bpf: Fix progs/find_vma_fail1.c build error.
libbpf: Revert poisoning of strlcpy
selftests/bpf: Tests for uninitialized stack reads
bpf: Allow reads from uninit stack
====================
Link: https://lore.kernel.org/r/20230323225221.6082-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEGhZs6bAKwk/OTgTpSD+KveBX+j4FAmQaHFgACgkQSD+KveBX
+j7mSwf8CSKcuarfUUVpkuS6/uoJ2W66ga5EplxqBeWu5/rFCDfU/aOe7LI8emH0
ESzsRYSmpSw+VvlWFmQw+OGKE+XY5r9fzdDvBVMyS//CITXhab7icOm9tPlczER6
i8T30vuV618pUEXjguo19Duv+uUy0Kfg62pmcaZOE1JoSzfNYQ+vL+PbZHkh15eB
d1I0VIzVL0VPfRfqAIFhnF5JA77ofOlyleqy+Dm1K87MK+jva5VO3Fiwitrd25dw
aluPi+1Ew50pWlnCOAWAFmrks1gAgxDjBhbozAMTvDrDn+7RekpjRfHQ197hTpWT
GUp1w6Qcc91CdCEPCk2mO+gtwnb9Qg==
=E7T7
-----END PGP SIGNATURE-----
Merge tag 'mlx5-fixes-2023-03-21' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
mlx5 fixes 2023-03-21
This series provides bug fixes to mlx5 driver.
* tag 'mlx5-fixes-2023-03-21' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux:
net/mlx5: E-Switch, Fix an Oops in error handling code
net/mlx5: Read the TC mapping of all priorities on ETS query
net/mlx5e: Overcome slow response for first macsec ASO WQE
net/mlx5e: Initialize link speed to zero
net/mlx5: Fix steering rules cleanup
net/mlx5e: Block entering switchdev mode with ns inconsistency
net/mlx5e: Set uplink rep as NETNS_LOCAL
====================
Link: https://lore.kernel.org/r/20230321211135.47711-1-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The following patches are an outcome of Raed's work to add packet
offload support to libreswan [1].
The series includes:
* Priority support to IPsec policies
* Statistics per-SA (visible through "ip -s xfrm state ..." command)
* Support to IKE policy holes
* Fine tuning to acquire logic.
Thanks
[1] https://github.com/libreswan/libreswan/pull/986
Link: https://lore.kernel.org/all/cover.1678714336.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQT1m3YD37UfMCUQBNwp8NhrnBAZsQUCZBgoggAKCRAp8NhrnBAZ
sQPvAP4/muXSlz26gU632wL+wjpFkvM5fX1iu8AwmdhOtX+gAAD/edtCVXj4QxD4
0xGMjwrz4tH5WqdUzqDQjhZrBYNWmAk=
=oNEQ
-----END PGP SIGNATURE-----
Merge tag 'ipsec-libreswan-mlx5' of https://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux
Leon Romanovsky says:
====================
Extend packet offload to fully support libreswan
The following patches are an outcome of Raed's work to add packet
offload support to libreswan [1].
The series includes:
* Priority support to IPsec policies
* Statistics per-SA (visible through "ip -s xfrm state ..." command)
* Support to IKE policy holes
* Fine tuning to acquire logic.
[1] https://github.com/libreswan/libreswan/pull/986
Link: https://lore.kernel.org/all/cover.1678714336.git.leon@kernel.org
* tag 'ipsec-libreswan-mlx5' of https://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux:
net/mlx5e: Update IPsec per SA packets/bytes count
net/mlx5e: Use one rule to count all IPsec Tx offloaded traffic
net/mlx5e: Support IPsec acquire default SA
net/mlx5e: Allow policies with reqid 0, to support IKE policy holes
xfrm: copy_to_user_state fetch offloaded SA packets/bytes statistics
xfrm: add new device offload acquire flag
net/mlx5e: Use chains for IPsec policy priority offload
net/mlx5: fs_core: Allow ignore_flow_level on TX dest
net/mlx5: fs_chains: Refactor to detach chains from tc usage
====================
Link: https://lore.kernel.org/r/20230320094722.1009304-1-leon@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When driver doesn't implement a bpf_xdp_metadata kfunc the fallback
implementation returns EOPNOTSUPP, which indicate device driver doesn't
implement this kfunc.
Currently many drivers also return EOPNOTSUPP when the hint isn't
available, which is ambiguous from an API point of view. Instead
change drivers to return ENODATA in these cases.
There can be natural cases why a driver doesn't provide any hardware
info for a specific hint, even on a frame to frame basis (e.g. PTP).
Lets keep these cases as separate return codes.
When describing the return values, adjust the function kernel-doc layout
to get proper rendering for the return values.
Fixes: ab46182d0d ("net/mlx4_en: Support RX XDP metadata")
Fixes: bc8d405b1b ("net/mlx5e: Support RX XDP metadata")
Fixes: 306531f024 ("veth: Support RX XDP metadata")
Fixes: 3d76a4d3d4 ("bpf: XDP metadata RX kfuncs")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/167940675120.2718408.8176058626864184420.stgit@firesoul
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Local port is a 10-bit number, but it was mistakenly stored in a u8,
resulting in firmware errors when using a netdev corresponding to a
local port higher than 255.
Fix by storing the local port in u16, as is done in the rest of the
code.
Fixes: bf73904f5f ("mlxsw: Add support for 802.1Q FID family")
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Danielle Ratson <danieller@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Link: https://lore.kernel.org/r/eace1f9d96545ab8a2775db857cb7e291a9b166b.1679398549.git.petrm@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The error handling dereferences "vport". There is nothing we can do if
it is an error pointer except returning the error code.
Fixes: 133dcfc577 ("net/mlx5: E-Switch, Alloc and free unique metadata for match")
Signed-off-by: Dan Carpenter <error27@gmail.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
When ETS configurations are queried by the user to get the mapping
assignment between packet priority and traffic class, only priorities up
to maximum TCs are queried from QTCT register in FW to retrieve their
assigned TC, leaving the rest of the priorities mapped to the default
TC #0 which might be misleading.
Fix by querying the TC mapping of all priorities on each ETS query,
regardless of the maximum number of TCs configured in FW.
Fixes: 820c2c5e77 ("net/mlx5e: Read ETS settings directly from firmware")
Signed-off-by: Maher Sanalla <msanalla@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
First ASO WQE poll causes a cache miss in hardware hence the resut is
delayed. It causes to the situation where such WQE is polled earlier
than it is needed.
Add logic to retry ASO CQ polling operation.
Fixes: 739cfa3451 ("net/mlx5: Make ASO poll CQ usable in atomic context")
Signed-off-by: Emeel Hakim <ehakim@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Raed Salem <raeds@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
mlx5e_port_max_linkspeed does not guarantee value assignment for speed.
Avoid cases where link_speed might be used uninitialized. In case
mlx5e_port_max_linkspeed fails, a default link speed of 50000 will be
used for the calculations.
Fixes: 3f6d08d196 ("net/mlx5e: Add RSS support for hairpin")
Signed-off-by: Roy Novich <royno@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Aya Levin <ayal@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
vport's mc, uc and multicast rules are not deleted in teardown path when
EEH happens. Since the vport's promisc settings(uc, mc and all) in
firmware are reset after EEH, mlx5 driver will try to delete the above
rules in the initialization path. This cause kernel crash because these
software rules are no longer valid.
Fix by nullifying these rules right after delete to avoid accessing any dangling
pointers.
Call Trace:
__list_del_entry_valid+0xcc/0x100 (unreliable)
tree_put_node+0xf4/0x1b0 [mlx5_core]
tree_remove_node+0x30/0x70 [mlx5_core]
mlx5_del_flow_rules+0x14c/0x1f0 [mlx5_core]
esw_apply_vport_rx_mode+0x10c/0x200 [mlx5_core]
esw_update_vport_rx_mode+0xb4/0x180 [mlx5_core]
esw_vport_change_handle_locked+0x1ec/0x230 [mlx5_core]
esw_enable_vport+0x130/0x260 [mlx5_core]
mlx5_eswitch_enable_sriov+0x2a0/0x2f0 [mlx5_core]
mlx5_device_enable_sriov+0x74/0x440 [mlx5_core]
mlx5_load_one+0x114c/0x1550 [mlx5_core]
mlx5_pci_resume+0x68/0xf0 [mlx5_core]
eeh_report_resume+0x1a4/0x230
eeh_pe_dev_traverse+0x98/0x170
eeh_handle_normal_event+0x3e4/0x640
eeh_handle_event+0x4c/0x370
eeh_event_handler+0x14c/0x210
kthread+0x168/0x1b0
ret_from_kernel_thread+0x5c/0x84
Fixes: a35f71f27a ("net/mlx5: E-Switch, Implement promiscuous rx modes vf request handling")
Signed-off-by: Huy Nguyen <huyn@mellanox.com>
Signed-off-by: Lama Kayal <lkayal@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Maor Dickman <maord@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Upon entering switchdev mode, VF/SF representors are spawned in the
devlink instance's net namespace, whereas the PF net device transforms
into the uplink representor, remaining in the net namespace the PF net
device was in. Therefore, if a PF net device's namespace is different from
its parent devlink net namespace, entering switchdev mode can create an
illegal situation where all representors sharing the same core device
are NOT in the same net namespace.
To avoid this issue, block entering switchdev mode for devices whose child
netdev net namespace has diverged from the parent devlink's.
Fixes: 7768d1971d ("net/mlx5: E-Switch, Add control for encapsulation")
Signed-off-by: Gavin Li <gavinl@nvidia.com>
Reviewed-by: Gavi Teitz <gavi@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Previously, NETNS_LOCAL was not set for uplink representors, inconsistent
with VF representors, and allowed the uplink representor to be moved
between net namespaces and separated from the VF representors it shares
the core device with. Such usage would break the isolation model of
namespaces, as devices in different namespaces would have access to
shared memory.
To solve this issue, set NETNS_LOCAL for uplink representors if eswitch is
in switchdev mode.
Fixes: 7a9fb35e8c ("net/mlx5e: Do not reload ethernet ports when changing eswitch mode")
Signed-off-by: Gavin Li <gavinl@nvidia.com>
Reviewed-by: Gavi Teitz <gavi@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Providing per SA packets/bytes statistics mandates creating unique
counter per SA flow for Rx/Tx, whenever offloaded SA statistics is
desired query the specific SA counter to provide the stack with the
needed data.
Signed-off-by: Raed Salem <raeds@nvidia.com>
Link: https://lore.kernel.org/r/7d5ce20ac495f3054afb633128700e7b7eeeb3cd.1678714336.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Currently one counter is shared between all IPsec Tx offloaded
rules to count the total amount of packets/bytes that was IPsec
Tx offloaded, replace this scheme by adding a new flow table (ft)
with one rule that counts all flows that passes through this
table (like Rx status ft), this ft is pointed by all IPsec Tx
offloaded rules. The above allows to have a counter per tx flow
rule in while keeping a separate global counter that store the
aggregation outcome of all these per flow counters.
Signed-off-by: Raed Salem <raeds@nvidia.com>
Link: https://lore.kernel.org/r/09b9119d1deb6e482fd2d17e1f5760d7c5be1e48.1678714336.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
During XFRM stack acquire flow, a default SA is created to be updated
later, once acquire netlink message is handled in user space.
This SA is also passed to IPsec offload supporting driver, however this
SA acts only as placeholder and does not have context suitable for
offloading in HW yet. Identify this kind of SA by special offload flag
(XFRM_DEV_OFFLOAD_FLAG_ACQ), and create a SW only context.
In such cases with special mark so it won't be installed in HW in addition
flow and on remove/delete free this SW only context.
Signed-off-by: Raed Salem <raeds@nvidia.com>
Link: https://lore.kernel.org/r/8f36d6b61631dcd73fef0a0ac623456030bc9db0.1678714336.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>