Increase max supported number of actions in the same rule.
Signed-off-by: Hamdan Igbaria <hamdani@nvidia.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Move all the vport capabilities to a separate struct and store vport caps
in XArray: SFs vport numbers will not come in the same range as VF vports,
so the existing implementation of vport capabilities as a fixed size array
is not suitable here.
XArray is a perfect fit: it is efficient when the indices used are densely
clustered. In addition to being a perfect fit as a dynamic data structure,
XArray also provides locking - it uses RCU and an internal spinlock to
synchronise access, so no additional protection needed.
Now except for the eswitch manager vport, all other vports (including the
uplink vport) are handled in the same way: when a new go-to-vport action
is added, this vport's caps are loaded from the xarray. If it is the first
time for this particular vport number, then its capabilities are queried
from FW and filled in into the appropriate entry.
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Muhammad Sammar <muhammads@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Implement csum recalculation flow tables in XAarray instead of a fixed
array, thus adding support for csum recalc table on any valid vport
number, which enables this support for SFs.
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Muhammad Sammar <muhammads@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Print similar error messages when an invalid vport number is
provided during action creation and during STEv0/1 creation.
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Muhammad Sammar <muhammads@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Currently, vport 0 capabilities are not set.
To fix this, we now querying both eswitch manager and vport 0.
Eswitch manager has an access to all the vports - for eswitch manager PF, all
vports can be referred as other vports. The exception is embedded CPU mode,
where there is vport 0 of ECPF and the PF vport 0.
Here is how vport are queried:
For Connect-X5/6:
PF vport (0) and vports 1..n: vport number, other = true
esw_manager is vport 0 (PF)
For BlueField (in embedded CPU mode):
ECPF vport: vport = 0, other = false
PF vport (0) and 1..n: vport number, other = true
esw_manager = vport 0 (ECPF)
Also, note that there's no need for other_vport function parameter
in dr_domain_query_vport - this value is now deduced locally in the
function.
Signed-off-by: Yuval Avnery <yuvalav@mellanox.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Muhammad Sammar <muhammads@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
SW steering defines its own macro for uplink vport number.
Replace this macro with an already existing mlx5 macro.
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
According to the HW spec, vport number is a 16-bit value.
Fix vport usage all over the code to u16 data type.
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Muhammad Sammar <muhammads@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
TX-port-TS hijacks the PTP traffic to a specific HW TX-queue. This
conflicts with MQPRIO in channel mode, which specifies explicitly which
TC accepts the packet. This patch mutually excludes the above
configuration.
Fixes: ec60c4581b ("net/mlx5e: Support MQPRIO channel mode")
Signed-off-by: Aya Levin <ayal@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
When setting number of completion EQs of the SF, consider number of
online CPUs.
Without this consideration, when number of online cpus are less than 8,
unnecessary 8 completion EQs are allocated.
Fixes: c36326d38d ("net/mlx5: Round-Robin EQs over IRQs")
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
The maximum irq_index can be 2047, This means irq_name should have 4
characters reserve for the irq_index. Hence, increase it to 4.
Fixes: 3af26495a2 ("net/mlx5: Enlarge interrupt field in CREATE_EQ")
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
When in Real-time mode, HW clock is synced with the PTP daemon. Hence
driver should not re-calibrate the next pulse (via MTPPSE repetitive
events mechanism).
This patch arms repetitive events only in free-running mode.
Fixes: 432119de33 ("net/mlx5: Add cyc2time HW translation mode support")
Signed-off-by: Aya Levin <ayal@nvidia.com>
Reviewed-by: Eran Ben Elisha <eranbe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Allow configuration of 1PPS start time only with time-stamp representing
a round second. Prior to this patch driver allowed setting of a
non-round-second which is not supported by the device. Avoid unexpected
behavior by restricting start-time configuration to a round-second.
Fixes: 4272f9b88d ("net/mlx5e: Change 1PPS out scheme")
Signed-off-by: Aya Levin <ayal@nvidia.com>
Reviewed-by: Eran Ben Elisha <eranbe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
* Add netdev->tc_to_txq rollback in case of failure in
mlx5e_update_netdev_queues().
* Fix broken transition between the two modes:
MQPRIO DCB mode with tc==8, and MQPRIO channel mode.
* Disable MQPRIO channel mode if re-attaching with a different number
of channels.
* Improve code sharing.
Fixes: ec60c4581b ("net/mlx5e: Support MQPRIO channel mode")
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
The value for maximum number of channels is first calculated based
on the netdev's profile and current function resources (specifically,
number of MSIX vectors, which depends among other things on the number
of online cores in the system).
This value is then used to calculate the netdev's number of rxqs/txqs.
Once created (by alloc_etherdev_mqs), the number of netdev's rxqs/txqs
is constant and we must not exceed it.
To achieve this, keep the maximum number of channels in sync upon any
netdevice re-attach.
Use mlx5e_get_max_num_channels() for calculating the number of netdev's
rxqs/txqs. After netdev is created, use mlx5e_calc_max_nch() (which
coinsiders core device resources, profile, and netdev) to init or
update priv->max_nch.
Before this patch, the value of priv->max_nch might get out of sync,
mistakenly allowing accesses to out-of-bounds objects, which would
crash the system.
Track the number of channels stats structures used in a separate
field, as they are persistent to suspend/resume operations. All the
collected stats of every channel index that ever existed should be
preserved. They are reset only when struct mlx5e_priv is,
in mlx5e_priv_cleanup(), which is part of the profile changing flow.
There is no point anymore in blocking a profile change due to max_nch
mismatch in mlx5e_netdev_change_profile(). Remove the limitation.
Fixes: a1f240f180 ("net/mlx5e: Adjust to max number of channles when re-attaching")
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Aya Levin <ayal@nvidia.com>
Reviewed-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Currently in Rx data path IPsec crypto offloaded packets uses
csum_none flag, so checksum is handled by the stack, this naturally
have some performance/cpu utilization impact on such flows. As Nvidia
NIC starting from ConnectX6DX provides checksum complete value out of
the box also for such flows there is no sense in taking csum_none path,
furthermore the stack (xfrm) have the method to handle checksum complete
corrections for such flows i.e. IPsec trailer removal and consequently
checksum value adjustment.
Because of the above and in addition the ConnectX6DX is the first HW
which supports IPsec crypto offload then it is safe to report csum
complete for IPsec offloaded traffic.
Fixes: b2ac7541e3 ("net/mlx5e: IPsec: Add Connect-X IPsec Rx data path offload")
Signed-off-by: Raed Salem <raeds@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Add counters for XDP REDIRECT success and failure. This brings the
redirect path in line with metrics gathered via the other XDP paths.
Signed-off-by: Joshua Roys <roysjosh@gmail.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use array_size() helper instead of the open-coded version in
copy_to_user(). These sorts of multiplication factors need
to be wrapped in array_size().
Link: https://github.com/KSPP/linux/issues/160
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
UID field was added to alloc_uar and dealloc_uar PRM command, to specify
DevX UID for UAR. This change enables firmware validating user access to
its own UAR resources.
For the kernel allocated UARs the UID will stay 0 as of today.
Signed-off-by: Meir Lichtinger <meirl@nvidia.com>
Reviewed-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
When rhashtable_init() fails, it returns -EINVAL.
However, since error return value of rhashtable_init is not checked,
it can cause use of uninitialized pointers.
So, fix unhandled errors of rhashtable_init.
Signed-off-by: MichelleJin <shjy180909@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make sure that devlink is open to receive user input when all
parameters are initialized.
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Refactor the code to make sure that devlink_register() is the last
command during initialization stage.
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Support offloading of TC rules that filter ingress traffic from a MACVLAN
device, which is attached to uplink representor.
Signed-off-by: Dima Chumak <dchumak@nvidia.com>
Reviewed-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Support offloading of TC rules that mirror/redirect egress traffic to a
MACVLAN device, which is attached to mlx5 representor net device.
Signed-off-by: Dima Chumak <dchumak@nvidia.com>
Reviewed-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
In switchdev mode we insert steering rules to eswitch that
make sure packets can't be looped back.
Modify the self tests infra and have flags per test.
Add a flag for tests that needs to be skipped in switchdev mode.
Before this commit:
$ ethtool --test enp8s0f0
The test result is FAIL
The test extra info:
Link Test 0
Speed Test 0
Health Test 0
Loopback Test 1
After this commit:
$ ethtool --test enp8s0f0
The test result is PASS
The test extra info:
Link Test 0
Speed Test 0
Health Test 0
Example output in dmesg:
enp8s0f0: Self test begin..
enp8s0f0: [0] Link Test start..
enp8s0f0: [0] Link Test end: result(0)
enp8s0f0: [1] Speed Test start..
enp8s0f0: [1] Speed Test end: result(0)
enp8s0f0: [2] Health Test start..
enp8s0f0: [2] Health Test end: result(0)
enp8s0f0: Self test out: status flags(0x1)
Signed-off-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Maor Dickman <maord@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
This to be consistent and adds the module name to the error message.
Signed-off-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Instead of having sparse ifdefs in source files use a single
ifdef in the tc sample header file and use stubs.
Signed-off-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Maor Dickman <maord@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
The priv argument is not being used. remove it.
Signed-off-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Maor Dickman <maord@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
The driver should add offloaded rules with either a fwd or drop action.
The check existed in parsing fdb flows but not when parsing nic flows.
Move the test into actions_match_supported() which is called for
checking nic flows and fdb flows.
Signed-off-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Maor Dickman <maord@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Do it when parsing like in other actions instead of when
checking if goto is supported in current scenario.
Signed-off-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Maor Dickman <maord@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
A user is expected to explicit request a fwd or drop action.
It is not correct to implicit add a fwd action for the user,
when modify header action flag exists.
Signed-off-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Maor Dickman <maord@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
modify_header_match_supported() should return type bool but
it returns the value returned by is_action_keys_supported()
which is type int.
is_action_keys_supported() always returns either -EOPNOTSUPP
or 0 and it shouldn't change as the purpose of the function
is checking for support. so just make the function return
a bool type.
Signed-off-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Maor Dickman <maord@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Prior to this patch, ethtool -X fail but the user receives a success
status. Try to roll-back when failing and return success status
accordingly.
Signed-off-by: Aya Levin <ayal@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Currently, mlxsw driver supports IP-in-IP only with IPv4 underlay.
Add support for IPv6 underlay for Spectrum-2 and above.
Most of the configurations are same to IPv4, the main difference between
IPv4 and IPv6 is related to saving IP addresses.
IPv6 addresses are saved as part of KVD and the relevant registers hold
pointer to them.
Add API for that as part of ipip_ops, so then only Spectrum-2 and above
will save IPv6 addresses in this way.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The Spectrum ASIC has a configurable limit on how deep into the packet
it parses. By default, the limit is 96 bytes.
For IP-in-IP packets, with IPv6 outer and inner headers, the default
parsing depth is not enough and without increasing it such packets cannot
be properly decapsulated.
Use the existing API to set parsing depth, call it once for each
decapsulation entry when it is created/destroyed.
There is no need to protect the code with new mutex because 'router->lock'
is already taken in these code paths.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support for allocating and freeing KVD entries for IPv6 addresses.
These addresses are programmed by the RIPS register and referenced by
the RATR and RTDP registers for IPv6 underlay encapsulation and
decapsulation, respectively.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add operations for IP-in-IP GRE6. For now the function can_offload()
returns false and the other functions warn as they should never be called.
A later patch will add dedicated operations for Spectrum-2 which will
support IPv6 underlay, so use 'mlxsw_sp1_' prefix for the new operations.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, there is support for IP-in-IP only with IPv4 underlay for all
supported Spectrum ASICs.
The next patches will add support for IPv6 underlay only for Spectrum-2
and above.
Add infrastructure for splitting IP-in-IP support between different
ASICs - create separate ipip_ops_arr and add ipips_init function to set the
right ops.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The RITR register is used to configure the router interface table.
For IP-in-IP, it stores the underlay source IP address for encapsulation
and also the ingress RIF for the underlay lookup.
Add support for IPv6 IP-in-IP configuration.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The RATR register is used to configure the Router Adjacency (next-hop)
Table.
For IP-in-IP entry, underlay destination IPv4 is saved as part of this
register and underlay destination IPv6 is saved by RIPS register and RATR
saves pointer to it.
Add function for setting IPv6 IP-in-IP configuration.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The RTDP register is used for configuring the tunnel decapsulation
properties of NVE and IP-in-IP.
Linux tunnels verify packets before decapsulation based on the packet's
source IP, which must match tunnel remote IP.
RTDP is used to configure decapsulation so that it filters out packets that
are not IPv6 or have the wrong source IP or wrong GRE key.
For IP-in-IP entry, source IPv4 is saved as part of this register and
source IPv6 is saved by RIPS register and RTDP saves pointer to it.
Create common function for configuring both IPv4 and IPv6 and add
dedicated functions for each protocol.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The RIPS register is used to store IPv6 addresses for use by the NVE and
IP-in-IP.
For IPv6 underlay support, RATR register needs to hold a pointer to the
remote IPv6 address for encapsulation and RTDP register needs to hold a
pointer to the local IPv6 address for decapsulation check.
Add the required register for saving IPv6 addresses.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The function __mlxsw_sp_ipip_netdev_ul_dev_get() returns the underlay
device that corresponds to the overlay device that it gets.
Currently, this function assumes that the tunnel is IPv4 GRE, because it
is the only one that is supported by mlxsw driver.
This assumption will no longer be correct when IPv6 GRE support is added,
resulting in wrong underlay device being returned.
Instead, check 'ol_dev->type' and return the underlay device accordingly.
Move the function to spectrum_ipip.c because spectrum_router.c should not
be aware to tunnel type.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The function mlxsw_sp_ipip_ol_netdev_change_gre4() contains code that
can be shared between IPv4 and IPv6.
The only difference is the way that arguments are taken from tunnel
parameters, which are different between IPv4 and IPv6.
For that, add structure 'mlxsw_sp_ipip_parms' to hold all the required
parameters for the function and save it as part of
'struct mlxsw_sp_ipip_entry' instead of the existing structure that is
not shared between IPv4 and IPv6. Add new operation as part of
'mlxsw_sp_ipip_ops' to initialize the new structure.
Then mlxsw_sp_ipip_ol_netdev_change_gre{4,6}() will prepare the new
structure and both will call the same function.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Suppress the following checkpatch.pl check [1] by adding a variable to
store the IP-in-IP options. Noticed while adding equivalent IPv6 code in
subsequent patches.
[1]
CHECK: Alignment should match open parenthesis
+ mlxsw_reg_ritr_loopback_ipip4_pack(ritr_pl, lb_cf.lb_ipipt,
+
+ MLXSW_REG_RITR_LOOPBACK_IPIP_OPTIONS_GRE_KEY_PRESET,
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, there are several functions that handle 'struct ip_tunnel_parm'
and 'struct __ip6_tnl_parm'.
Change the mentioned functions to get IP tunnel parameters by reference
and as 'const'.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
mlxsw_sp_fib4_entry_type_unset() is not specific for IPv4 FIB entry,
move the code to mlxsw_sp_fib_entry_type_unset(), and call this function
from mlxsw_sp_fib4_entry_type_unset() so then it will be used for IPv6
also.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
net/mptcp/protocol.c
977d293e23 ("mptcp: ensure tx skbs always have the MPTCP ext")
efe686ffce ("mptcp: ensure tx skbs always have the MPTCP ext")
same patch merged in both trees, keep net-next.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Driver doesn't support aRFS for encapsulated packets, return early error
in such a case.
Fixes: 1eb8c695bd ("net/mlx4_en: Add accelerated RFS support")
Signed-off-by: Aya Levin <ayal@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Start using the trap adjacency entry that was added in the previous
patch and remove the existing one which is no longer needed.
Note that the name of the old entry was inaccurate as the entry did not
discard packets, but trapped them.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In commit 0c3cbbf96d ("mlxsw: Add specific trap for packets routed via
invalid nexthops"), mlxsw started allocating a new adjacency entry
during driver initialization, to trap packets routed via invalid
nexthops.
This behavior was later altered in commit 983db6198f ("mlxsw:
spectrum_router: Allocate discard adjacency entry when needed") to only
allocate the entry upon the first route that requires it. The motivation
for the change is explained in the commit message.
The problem with the current behavior is that the entry shows up as a
"leak" in a new BPF resource monitoring tool [1]. This is caused by the
asymmetry of the allocation/free scheme. While the entry is allocated
upon the first route that requires it, it is only freed during
de-initialization of the driver.
Instead, track the number of active nexthop groups and allocate the
adjacency entry upon the creation of the first group. Free it when the
number of active groups reaches zero.
The next patch will convert mlxsw to start using the new entry and
remove the old one.
[1] https://github.com/Mellanox/mlxsw/tree/master/Debugging/libbpf-tools/resmon
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
devlink_register() can't fail and always returns success, but all drivers
are obligated to check returned status anyway. This adds a lot of boilerplate
code to handle impossible flow.
Make devlink_register() void and simplify the drivers that use that
API call.
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Acked-by: Simon Horman <simon.horman@corigine.com>
Acked-by: Vladimir Oltean <olteanv@gmail.com> # dsa
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Any link state change that's done prior to net device registration
isn't reflected on the state, thus the operational state is left
obsolete, with 'UNKNOWN' status.
To resolve the issue, query link state from FW upon open operations
to ensure operational state is updated.
Fixes: c27a02cd94 ("mlx4_en: Add driver for Mellanox ConnectX 10GbE NIC")
Signed-off-by: Lama Kayal <lkayal@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The network interface managed by the mlxbf_gige driver can
get into a problem state where traffic does not flow.
In this state, the interface will be up and enabled, but
will stop processing received packets. This problem state
will happen if three specific conditions occur:
1) driver has received more than (N * RxRingSize) packets but
less than (N+1 * RxRingSize) packets, where N is an odd number
Note: the command "ethtool -g <interface>" will display the
current receive ring size, which currently defaults to 128
2) the driver's interface was disabled via "ifconfig oob_net0 down"
during the window described in #1.
3) the driver's interface is re-enabled via "ifconfig oob_net0 up"
This patch ensures that the driver's "valid_polarity" field is
cleared during the open() method so that it always matches the
receive polarity used by hardware. Without this fix, the driver
needs to be unloaded and reloaded to correct this problem state.
Fixes: f92e1869d7 ("Add Mellanox BlueField Gigabit Ethernet driver")
Reviewed-by: Asmaa Mnebhi <asmaa@nvidia.com>
Signed-off-by: David Thompson <davthompson@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove the assert from the callback priv lookup function since it does
not require RTNL lock and is already protected by flow_indr_block_lock.
This will avoid warnings from being emitted to dmesg if the driver
registers its callback after an ingress qdisc was created for a
netdevice.
The warnings started after the following patch was merged:
commit 74fc4f8287 ("net: Fix offloading indirect devices dependency on qdisc order creation")
Signed-off-by: Eli Cohen <elic@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Implement support for ethtool_ops::reset in order to reset transceiver
modules. The module backing the netdev is reset when the 'ETH_RESET_PHY'
flag is set. After a successful reset, the flag is cleared by the driver
and other flags are ignored. This is in accordance with the interface
documentation:
"The reset() operation must clear the flags for the components which
were actually reset. On successful return, the flags indicate the
components which were not reset, either because they do not exist in the
hardware or because they cannot be reset independently. The driver must
never reset any components that were not requested."
Reset is useful in order to allow a module to transition out of a fault
state. From section 6.3.2.12 in CMIS 5.0: "Except for a power cycle, the
only exit path from the ModuleFault state is to perform a module reset
by taking an action that causes the ResetS transition signal to become
TRUE (see Table 6-11)".
An error is returned when the netdev is administratively up:
# ip link set dev swp11 up
# ethtool --reset swp11 phy
ETHTOOL_RESET 0x40
Cannot issue ETHTOOL_RESET: Invalid argument
# ip link set dev swp11 down
# ethtool --reset swp11 phy
ETHTOOL_RESET 0x40
Components reset: 0x40
An error is returned when the module is shared by multiple ports (split
ports) and the "phy-shared" flag is not set:
# devlink port split swp11 count 4
# ethtool --reset swp11s0 phy
ETHTOOL_RESET 0x40
Cannot issue ETHTOOL_RESET: Invalid argument
# ethtool --reset swp11s0 phy-shared
ETHTOOL_RESET 0x400000
Components reset: 0x400000
# devlink port unsplit swp11s0
# ethtool --reset swp11 phy
ETHTOOL_RESET 0x40
Components reset: 0x40
An error is also returned when one of the ports using the module is
administratively up:
# devlink port split swp11 count 4
# ip link set dev swp11s1 up
# ethtool --reset swp11s0 phy-shared
ETHTOOL_RESET 0x400000
Cannot issue ETHTOOL_RESET: Invalid argument
# ip link set dev swp11s1 down
# ethtool --reset swp11s0 phy-shared
ETHTOOL_RESET 0x400000
Components reset: 0x400000
Reset is performed by writing to the "rst" bit of the PMAOS register,
which instructs the firmware to assert the reset signal connected to the
module for a fixed amount of time.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The PMAOS register has enable bits (e.g., PMAOS.ee) that allow changing
only a subset of the fields, which is exactly what subsequent patches
will need to do. Instead of passing multiple arguments to its pack
function, only pass the module index and let the rest be set by the
different callers.
No functional changes intended.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The Ports Module Administrative and Operational Status (PMAOS) register
configures and retrieves the per-module status. Extend it with fields
required to support various module settings such as reset and power
mode.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the common port module core, track the number of logical ports that
are mapped to the port module and the number of logical ports using it
that are administratively up.
This will be used by later patches to potentially veto and control
certain operations on the module, such as reset and setting its power
mode.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The return value is never checked. Allows us to simplify a later patch.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The return value is not checked by the networking stack. Allows us to
simplify a later patch.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After the previous patch, the lock is always taken in process context so
it can be converted to a mutex. It is needed for future changes where we
will need to be able to sleep when holding the lock.
Convert the lock to a mutex.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Module temperature events are currently handled in softIRQ context,
requiring the 'module_info_lock' to be a spin lock. In future patchsets
we will need to be able to hold the lock while sleeping.
Therefore, defer handling of these events using a work queue so that the
next patch will be able to convert the lock to a mutex.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After the previous patch, the switch driver is always initialized last,
making this function redundant.
Remove it.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 961cf99a07 ("mlxsw: core: Re-order initialization sequence")
changed the initialization sequence so that the switch driver (e.g.,
mlxsw_spectrum) is initialized before registration with the hwmon and
thermal subsystems.
This was done in order to avoid situations where hwmon/thermal code uses
features not supported by current firmware version, which is only
validated as part of switch driver initialization.
Later, commit b79cb787ac ("mlxsw: Move fw flashing code into core.c")
moved firmware validation and flashing code from the switch driver to
mlxsw_core so that it is performed before driver initialization.
Therefore, change the initialization sequence back to its original form.
In addition to being more straightforward, it will allow us to simplify
parts of the code in subsequent patches and future patchsets.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The devlink parameters were published in two steps despite being static
and known in advance.
First step was to use devlink_params_publish() which iterated over all
known up to that point parameters and sent notification messages.
In second step, the call was devlink_param_publish() that looped over
same parameters list and sent notification for new parameters.
In order to simplify the API, move devlink_params_publish() to be called
when all parameters were already added and save the need to iterate over
parameters list again.
As a side effect, this change fixes the error unwind flow in which
parameters were not marked as unpublished.
Fixes: 82e6c96f04 ("net/mlx5: Register to devlink ingress VLAN filter trap")
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It is not used anymore, remove it.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Newly introduced PMTDB register is there to provide all needed info
about particular requested port split configuration. Use it instead of
figuring the info out manually in the driver.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The PMTDB register allows to query the possible module<->local port
mapping than can be used in PMLP. It does not represent the actual/current
mapping of the local to module. Actual mapping is only defined by PMLP.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of relying on the values coming from the PMLP register, use PLLP
to get the information about port front panel number and split number.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The PLLP register returns the mapping from Local Port into Label Port.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
During port creation, mlxsw_core_port_init() is called with the front
panel port number and the split port sub-number. Currently, this
information is determined by the driver without firmware assistance.
Subsequent patches are going to query this information from firmware,
but this requires the port to assigned to SWID.
Therefore, move port SWID assignment before mlxsw_core_port_init().
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
During port creation, mlxsw_core_port_init() is called with the front
panel port number and the split port sub-number. Currently, this
information is determined by the driver without firmware assistance.
Subsequent patches are going to query this information from firmware,
but this requires the port to be mapped to a module.
Therefore, move port mapping before mlxsw_core_port_init().
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add latest verified version of Nvidia Spectrum-family switch firmware,
for Spectrum (13.2008.3326), Spectrum-2 (29.2008.3326) and Spectrum-3
(30.2008.3326).
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When activating the PTP-RQ, redirect the RQT from drop-RQ to PTP-RQ.
Use mlx5e_channels_get_ptp_rqn to retrieve the rqn. This helper returns
a boolean (not status), hence caller should consider return value 0 as a
fail. Change the caller interpretation of the return value.
Fixes: 43ec0f41fa ("net/mlx5e: Hide all implementation details of mlx5e_rx_res")
Signed-off-by: Aya Levin <ayal@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Some profiles of the driver don't support a dedicated PTP-RQ, hence can't
support HW TS and CQE compression simultaneously. When HW TS is enabled
the COE compression is disabled, and should be restored when the HW TS
is turned off. Add rx_filter as an input to modifying CQE compression to
enforce this restriction.
Fixes: 256f79d13c ("net/mlx5e: Fix HW TS with CQE compression according to profile")
Signed-off-by: Aya Levin <ayal@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Fixes the below flow of sleeping in atomic context by releasing
the RCU lock before calling to free_match_list.
build_match_list() <- disables preempt
-> free_match_list()
-> tree_put_node()
-> down_write_ref_node() <- take write lock
Fixes: 693c6883bb ("net/mlx5: Add hash table for flow groups in flow table")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Maor Gottlieb <maorg@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
In NICs that don't support LAG, the LAG control structure won't be
allocated. If it wasn't allocated it means LAG doesn't exists and can be
skipped.
Fixes: cac1eb2cf2 ("net/mlx5: Lag, properly lock eswitch if needed")
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Maor Gottlieb <maorg@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
RDMA auxdev parameter registration was skipped for eswitch manager PCI PF.
Due to this when devlink parameter is read, it reads as false in below
code flow.
$ devlink dev reload pci/0000:06:00.0
devlink_reload()
mlx5_load_one()
mlx5_attach_device()
is_ib_enabled()
devlink_param_driverinit_value_get()
Due to this, is_ib_enabled() returns false for the RDMA auxiliary
device. This results into a skipping RDMA auxiliary device creation on
reload.
There is no need to check for eswitch manager capability to support RDMA
auxiliary device. Hence, fix it by skipping eswitch manager capability.
Fixes: 87158cedf0 ("net/mlx5: Support enable_rdma devlink dev param")
Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
In some conditions variable 'err' is not assigned with value in
mlx5_esw_bridge_port_obj_attr_set() and mlx5_esw_bridge_port_changeupper()
functions after recent changes to support LAG. Initialize the variable with
zero value in both cases.
Reported-by: Colin King <colin.king@canonical.com>
Reported-by: Tim Gardner <tim.gardner@canonical.com>
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
CC: linux-kernel@vger.kernel.org
Fixes: ff9b752146 ("net/mlx5: Bridge, support LAG")
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Use the devm_platform_ioremap_resource_byname() helper instead of
calling platform_get_resource_byname() and devm_ioremap_resource()
separately
Use the devm_platform_ioremap_resource() helper instead of
calling platform_get_resource() and devm_ioremap_resource()
separately
Signed-off-by: Cai Huoqing <caihuoqing@baidu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In HTB offload mode, qdiscs of leaf classes are grafted to netdev
queues. sch_htb expects the dev_queue field of these qdiscs to point to
the corresponding queues. However, qdisc creation may fail, and in that
case noop_qdisc is used instead. Its dev_queue doesn't point to the
right queue, so sch_htb can lose track of used netdev queues, which will
cause internal inconsistencies.
This commit fixes this bug by keeping track of the netdev queue inside
struct htb_class. All reads of cl->leaf.q->dev_queue are replaced by the
new field, the two values are synced on writes, and WARNs are added to
assert equality of the two values.
The driver API has changed: when TC_HTB_LEAF_DEL needs to move a queue,
the driver used to pass the old and new queue IDs to sch_htb. Now that
there is a new field (offload_queue) in struct htb_class that needs to
be updated on this operation, the driver will pass the old class ID to
sch_htb instead (it already knows the new class ID).
Fixes: d03b195b5a ("sch_htb: Hierarchical QoS hardware offload")
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20210826115425.1744053-1-maximmi@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add the support for update FTE, which is needed for cases where there are
multiple rules with the same match. In such case fs_core will merge the
actions and call update FTE to update current FTE. Since we don't want to
disrupt the traffic, we will add the new duplicate rule, and only then
remove the old duplicate rule.
Signed-off-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
To track each STE of the rule a rule member was allocated, each
member would point to one STE. This means that we would allocate
40B (rule member) * number of STEs per rule.
To reduce this per rule allocation we use the STE tree pointers
for next_htbl and pointing STE to navigate the tree, this allows
us to keep only the pointer to the last STE of rule (always unique).
From the last rule STE we are able to traverse and rebuild all of
the STEs that construct the rule.
In our testing with 8M rules, each consisting of 7 STES, we were able
to reduce 1.6GB of memory.
Signed-off-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Erez Shitrit <erezsh@nvidia.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
The calculations to decide for the maximum allowed collision threshold
are simple and there is no reason to save them on the htbl struct.
Signed-off-by: Erez Shitrit <erezsh@nvidia.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Instead of using the HW specific STEv0 type, it is better to use
an enum to indicate if this is an RX or TX nic domain.
This means that now we will need to convert the nic domain type
to the corresponding STE type.
Signed-off-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Merge DR_STE_SIZE enums - no need for a separate enum for reduced STE size.
Signed-off-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
The FDB RX pipe is connected to the wire and the source port for all
incoming packets equals to wire, single uplink port per PF, this means
there is no point of matching on the source port in such case.
Once we recognize such case, we will optimize the RX steering rule.
Note that in such case we clean both source_eswitch_owner_vhca_id and
source_port.
Signed-off-by: Alex Vesker <valex@mellanox.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
When creating an FTE, we might need to create multi-destination flow table,
which is eventually created by FW. In such case, this FW table should
include all the FTE properties as requested by the upper layer, including
the ability to point to another flow table with level lower or equal to
the current table - indicated by the "ignore_flow_level" property.
Signed-off-by: Chris Mi <cmi@nvidia.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Need to call the DR API only when it is DR table.
To update FW-owned table the driver should call the FW API.
Signed-off-by: Erez Shitrit <erezsh@nvidia.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Add missing support for matching on IPv6 flow label for STEv0.
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
There are usecases with Connection Tracking that have such connection
as default, printing this warning in dmesg confuses the user.
Signed-off-by: Bodong Wang <bodong@mellanox.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
In the event of SW steering QP entering error state, SW steering
cannot insert more rules, and will silently ignore the insertion
after issuing a warning.
Signed-off-by: Yuval Avnery <yuvalav@mellanox.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Under high stress, SW steering might get stuck on polling for completion
that never comes.
For such cases QP needs to have protocol retransmission mechanism enabled.
Currently the retransmission timeout is defined as 0 (unlimited). Fix this
by defining a real timeout.
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Enable pop VLAN action in TX and push VLAN in RX.
These actions are supported only on STEv1.
On TX: when a host sends a packet, VLAN is popped at the beginning.
On RX: just before passing the packet to the host the VLAN is pushed.
Signed-off-by: Muhammad Sammar <muhammads@nvidia.com>
Signed-off-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Split modify vlan state in the actions state machine to pop vlan
and push vlan states. This enables using of pop/push vlan without
restrictions (e.g. pop vlan on TX in STEv1).
Signed-off-by: Muhammad Sammar <muhammads@nvidia.com>
Signed-off-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
ConnectX supports offloading of various encapsulations and decapsulations
(e.g. VXLAN), which are performed by 'Packet Reformat' action. Starting
with ConnectX-6 DX, a new reformat type is supported - REMOVE_HEADER, which
allows deleting an arbitrary size chunk at the selected position in the packet.
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
In line 849 (#1), "mlx5dr_htbl_put(cur_htbl);" drops the reference to
cur_htbl and may cause cur_htbl to be freed.
However, cur_htbl is subsequently used in the next line, which may result
in an use-after-free bug.
Fix this by calling mlx5dr_err() before the cur_htbl is put.
Signed-off-by: Wentao_Liang <Wentao_Liang_g@163.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
If link aggregation is used within stack devices driver rejects encap
rules if PF of the VF tunnel device is down. This happens because route
resolved for other PF and its eswitch instance is used to determine
correct vport.
To fix that use devcom feature to retrieve other eswitch instance if
failed to find vport for the 1st eswitch and LAG is active.
Fixes: 10742efc20 ("net/mlx5e: VF tunnel TX traffic offloading")
Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
When indirect forward group is created, flow is added with vhca id but
without setting vhca id valid flag which violates the PRM.
Fix by setting the missing flag, vhca id valid.
Fixes: 34ca65352d ("net/mlx5: E-Switch, Indirect table infrastructure")
Signed-off-by: Maor Dickman <maord@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
After neigh-update-add failure we are still with a slow path rule but
the driver always assume the rule is an fdb rule.
Fix neigh-update-del by checking slow path tc flag on the flow.
Also fix neigh-update-add for when neigh-update-del fails the same.
Fixes: 5dbe906ff1 ("net/mlx5e: Use a slow path rule instead if vxlan neighbour isn't available")
Signed-off-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Paul Blakey <paulb@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
The call to mlx5_unregister_device() means that mlx5_core driver is
removed. In such scenario, we need to disregard all other flags like
attach/detach and forcibly remove all auxiliary devices.
Fixes: a5ae8fc905 ("net/mlx5e: Don't create devices during unload flow")
Tested-and-Reported-by: Yicong Yang <yangyicong@hisilicon.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
When handling FIB_EVENT_ENTRY_REPLACE event for a new multipath route,
lag activation can be missed if a stale (struct lag_mp)->mfi pointer
exists, which was associated with an older multipath route that had been
removed.
Normally, when a route is removed, it triggers mlx5_lag_fib_event(),
which handles FIB_EVENT_ENTRY_DEL and clears mfi pointer. But, if
mlx5_lag_check_prereq() condition isn't met, for example when eswitch is
in legacy mode, the fib event is skipped and mfi pointer becomes stale.
Fix by resetting mfi pointer to NULL in mlx5_deactivate_lag().
Fixes: 8a66e45859 ("net/mlx5: Change ownership model for lag")
Signed-off-by: Dima Chumak <dchumak@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
In order to support more coalesce parameters through netlink,
add two new parameter kernel_coal and extack for .set_coalesce
and .get_coalesce, then some extra info can return to user with
the netlink API.
Signed-off-by: Yufeng Mo <moyufeng@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Commit 01848e05f8 ("mlxsw: spectrum_router: Add support for inner
layer 3 multipath hash policy") and commit daeabf89eb ("mlxsw:
spectrum_router: Add support for custom multipath hash policy") added
support for multipath hash policies where the hash is calculated based
on inner packet fields.
For IPv6-in-IPv6 packets, the default parsing depth (96 bytes) is not
enough when these policies are used.
Therefore, for such cases, call the new API to increase / decrease the
parsing depth as necessary. Care is taken to ensure the API is not
called multiple times.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The previous patches added new API to handle parsing depth and converted
the existing code to use it.
Remove the old infrastructure which is not needed anymore.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Convert VxLAN and PTP modules to increase parsing depth using new API
that was added in the previous patch.
Separate MPRS register's configuration to VxLAN related configuration
and parsing depth configuration. Handle each one using the appropriate
API.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Spectrum ASICs have a configurable limit on how deep into the packet
they parse. By default, the limit is 96 bytes.
There are several cases where this parsing depth is not enough and there
is a need to increase it. Currently, increasing parsing depth is
maintained as part of VxLAN module, because the MPRS register which
configures parsing depth also configures UDP destination port number
used for VxLAN encapsulation and decapsulation.
Add an API for increasing parsing depth as part of spectrum.c code, so
that it will be possible to use it from other modules. In addition, add
an API for setting UDP destination port and protect it using a dedicated
lock for saving parsing configurations. The lock is needed as not all
the callers hold RTNL lock.
Maintain a counter for increased parsing depth consumers. For first
consumer subscription, increase the parsing depth and for last consumer
unsubscription, set parsing depth to default value.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Implement eswitch API that allows updating rate groups. If group
pointer is NULL, then move the vport to internal unlimited group zero.
Implement devlink_ops->rate_parent_node_set() callback in the terms of
the new eswitch group update API.
Enable QoS for all group's elements if a group has allocated BW share.
Co-developed-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com>
Reviewed-by: Huy Nguyen <huyn@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Provide eswitch API to allow controlling group rate limits. Use it to
implement devlink_ops->mlx5_devlink_rate_node_tx_{share|max}_set().
The share rate will create relative bandwidth share on the groups level
while within the group the user can set shared rate on the member vports
of that group and this rate will be relative to the group's share rate.
The group with the highest shared rate will get a BW share of 100 and
the rest of the groups will get a value that reflects the ratio between
their share rate and the maximum share rate.
Example:
Created four rate groups with tx_share limits:
$ devlink port function rate add \
pci/0000:06:00.0/group_1 tx_share 30gbit
$ devlink port function rate add \
pci/0000:06:00.0/group_2 tx_share 20gbit
$ devlink port function rate add \
pci/0000:06:00.0/group_3 tx_share 20gbit
$ devlink port function rate add \
pci/0000:06:00.0/group_4 tx_share 10gbit
Assuming link speed is 50 Gbit/sec ratio divider will be
50 / (30+20+20+10) = 0.625. Normalized rate values for the groups:
<group_1> 30 * 0.625 = 18.75 Gbit/sec
<group_2> 20 * 0.625 = 12.5 Gbit/sec
<group_3> 20 * 0.625 = 12.5 Gbit/sec
<group_4> 10 * 0.625 = 6.25 Gbit/sec
Rate group with unlimited tx_share rate will receive minimum BW value
(1Mbit/sec) if presented any group with tx_share rate limit. This allow
to not drop all packets in case of heavy traffic.
Co-developed-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com>
Reviewed-by: Huy Nguyen <huyn@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Extend eswitch API with rate limiting groups:
- Define new struct mlx5_esw_rate_group that is used to hold all
internal group data.
- Implement functions that allow creation, destruction and cleanup of
groups.
- Assign all vports to internal unlimited zero group by default.
This commit lays the groundwork for group rate limiting by implementing
devlink_ops->rate_node_{new|del}() callbacks to support creating and
deleting groups through devlink rate node objects. APIs that allows
setting rates and adding/removing members are implemented in following
patches.
Co-developed-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com>
Reviewed-by: Huy Nguyen <huyn@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Register devlink rate leaf object for every eswitch vport.
Implement devlink ops that enable setting shared and max tx rates
through devlink API.
Extract common eswitch code from existing tx rate set function that is
accessed through NDO to be reused for the devlink. Values configured
with NDO API are not visible for the devlink API, therefore shouldn't be
used simultaneously.
When normalizing the BW share value, dividing the desired minimum rate
by the common divider results in losing information since the quotient
is rounded down. This has a significant affect on configurations of low
rate where the round down eliminates a large percentage of the total
rate. To improve the formula, round up the division result to make sure
that the BW share is at least the value it was supposed to be and won't
lost a significant amount of the expected value.
Co-developed-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com>
Reviewed-by: Huy Nguyen <huyn@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Move eswitch QoS related code into dedicated file. Provide eswitch API
to access this code meaning it is isolated and restricted to be used
only by eswitch.c. Exception is legacy NDO vf set rate, which moved to
esw/legacy.c.
Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com>
Reviewed-by: Huy Nguyen <huyn@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Currently the sample offload actions send the encapsulated packet
to software. This commit decapsulates the packet before performing
the sampling and set the tunnel properties on the skb metadata
fields to make the behavior consistent with OVS sFlow.
If decapsulating first, we can't use the same match like before in
default table. So instantiate a post action instance to continue
processing the action list. If HW can preserve reg_c, also use the
post action instance.
Signed-off-by: Chris Mi <cmi@nvidia.com>
Reviewed-by: Oz Shlomo <ozsh@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Currently the sample offload actions send the encapsulated packet
to software. sFlow expects tunneled packets to be decapsulated while
having the tunnel properties on the skb metadata fields.
Reuse the functions used by connection tracking to map the outer
header properties to a unique id. The next patch will use that id
to restore the tunnel information of decapsulated packets onto the
skb.
Signed-off-by: Chris Mi <cmi@nvidia.com>
Reviewed-by: Oz Shlomo <ozsh@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
CONFIG_NET_TC_SKB_EXT controls the SKB extension support for
restoring chain ids. SKB extension is not required for tunnel
restoration.
Remove the CONFIG_NET_TC_SKB_EXT dependency as a pre-step for
using the tunnel restore methods for sample offload use cases.
Signed-off-by: Chris Mi <cmi@nvidia.com>
Reviewed-by: Oz Shlomo <ozsh@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Move post action table management to common library providing
add/del/get API. Refactor the ct action offload to use the common
API.
Signed-off-by: Chris Mi <cmi@nvidia.com>
Reviewed-by: Oz Shlomo <ozsh@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Some tc actions are modeled in hardware using multiple tables
causing a tc action list split. For example, CT action is modeled
by jumping to a ct table which is controlled by nf flowtable.
sFlow jumps in hardware to a sample table, which continues to a
"default table" where it should continue processing the action list.
Multi table actions are modeled in hardware using a unique fte_id.
The fte_id is set before jumping to a table. Split actions continue
to a post-action table where the matched fte_id value continues the
execution the tc action list.
Currently the post-action design is implemented only by the ct
action. Introduce post action infrastructure as a pre-step for
reusing it with the sFlow offload feature. Init and destroy the
common post action table. Refactor the ct offload to use the
common post table infrastructure in the next patch.
Signed-off-by: Chris Mi <cmi@nvidia.com>
Reviewed-by: Oz Shlomo <ozsh@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
IDR is deprecated. Use xarray instead.
Signed-off-by: Chris Mi <cmi@nvidia.com>
Reviewed-by: Oz Shlomo <ozsh@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Currently it is in eswitch attribute. Move it to flow attribute to
reflect the change in previous patch.
Signed-off-by: Chris Mi <cmi@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Module sample belongs to en/tc instead of esw. Move it and rename
accordingly.
Signed-off-by: Chris Mi <cmi@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
mlx5/esw/sample.c doesn't really need mlx5e_priv object, we can remove
this redundant dependency by passing the eswitch object directly to
the sample object constructor.
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
The ARRAY_SIZE macro is defined to get an array's size which is
more compact and more formal in linux source. Thus, we can replace
the long sizeof(arr)/sizeof(arr[0]) with the compact ARRAY_SIZE.
Signed-off-by: Jason Wang <wangborong@cdjrlc.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20210817121106.44189-1-wangborong@cdjrlc.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Allow adding bond net devices to mlx5 bridge with following changes:
- Modify bridge representor code to obtain uplink represetor that belongs
to eswitch that is registered for notification. Require representor to be
in shared FDB mode. If representor is the lag master, then consider its
port as local, otherwise treat it as peer.
- Use devcom to match on paired eswitch metadata in peer FDB entries. This
is necessary for shared FDB LAG to function since packets are always
received on active eswitch instance as opposed to parent eswitch of port.
- Support for deleting peer flows when receiving
SWITCHDEV_FDB_DEL_TO_BRIDGE notification was implemented in one of previous
patches in series. Now also implement support for handling
SWITCHDEV_FDB_ADD_TO_BRIDGE which can be generated on peer by bridge update
workqueue task in LAG configuration. Refresh the flow 'lastuse' timestamp
to current jiffies when receiving such notification on eswitch that manages
the local FDB entry. This allows peer entries to prevent ageing of the FDB.
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Allow connectivity between representors of different eswitch instances that
are attached to same bridge when merged_eswitch capability is enabled. Add
ports of peer eswitch to bridge instance and mark them with
MLX5_ESW_BRIDGE_PORT_FLAG_PEER. Mark FDBs offloaded on peer ports with
MLX5_ESW_BRIDGE_FLAG_PEER flag. Such FDBs can only be aged out on their
local eswitch instance, which then sends SWITCHDEV_FDB_DEL_TO_BRIDGE event.
Listen to the event on mlx5 bridge implementation and delete peer FDBs in
event handler.
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
SWITCHDEV_FDB_DEL_TO_BRIDGE notification is generated in multiple places in
bridge code. Following patch in series changes the condition for the
notification. Extract the notification into dedicated helper function
mlx5_esw_bridge_fdb_del_notify() to only modify it in single place in the
future changes.
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Following patches in series allow traffic between vports of different
eswitch instances, which requires addressing bridge port by
vport_num+esw_owner_vhca_id pair since vport_num is only unique
per-eswitch. As a preparation, extend struct mlx5_esw_bridge_port with
'esw_owner_vhca_id' field and use it as part of key for
mlx5_esw_bridge->vports xarray.
With this change we can't rely on switchdev_handle_port_obj_add() helper to
get mlx5 representor from stacked device because we need specifically
representor from parent eswitch that registered the callback to obtain
correct esw_owner_vhca_id. The helper doesn't allow passing additional
parameters to predicate function and doesn't provide access to the notifier
block to obtain eswitch through br_offloads. Implement custom helpers to
obtain mlx5 representor and use them in
mlx5_esw_bridge_port_obj_{add|del|attr_set}() implementations.
Remove direct pointer to parent bridge from struct mlx5_vport as it is no
longer needed.
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Following patches in series will pass bond device to bridge, which means
the code can't assume the device is mlx5 representor. Moreover, the core
device can be easily obtained from eswitch instance, so there is no reason
for more complex code that obtains struct mlx5_priv from net_device in
order to use its mdev. Refactor the code to use esw->dev instead of
priv->mdev.
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Refactor mlx5_esw_bridge_vport_link() to release the bridge instance if
mlx5_esw_bridge_vport_init() returned an error instead of relying on it to
release the bridge. This improves the design because object instance is
taken and released in same layer and simplifies following patches that add
more logic to mlx5_esw_bridge_vport_link().
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Add support for MQPRIO channel mode, in which a partition to TCs
is defined over the channels. We allow partitions with contiguous
queue indices, with no holes within. We do not allow modification
to the num of channels while this MQPRIO mode is active.
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
This is in preparation for supporting MQPRIO CHANNEL mode in
downstream patch, in addition to DCB mode that's supported today.
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Abstract the MQPRIO params into a struct.
Use a getter for DCB mode num_tcs.
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Extend the existing flow classification support, to steer
flows not only directly to a receive ring, but also into
the new RSS contexts.
Create needed TIR objects on demand, and hold reference
on the RSS context.
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Add support to multiple RSS contexts. Resources of the non-default
RSS contexts are allocated and created on demand. Each RSS context
can be controlled and configured separately, via the implemented
ethtool ops. Here we limit the num of total contexts to 16.
We do not enforce any kind of new limitation over the indirection table
content. More specifically, two separate contexts can be configured to
fully or partially point to the same set of receive rings.
The default RSS context (index 0) is created with its full set of TIRs.
All other contexts are created with an empty set, then TIRs are added
upon first usage when steering rules are added.
We use a reference counting mechanism to make sure an RSS context is
not removed before the rules pointing to it.
Block ethtool set_channels operations when multiple RSS contexts exist,
as currently the kernel doesn't protect against inconsistent channels
configs that break non-default RSS contexts.
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Move from static to dynamic memory allocations for TIR.
This is in preparation to supporting on-demand TIR operations in
downstream patches, where every RSS context will be init with an
empty set of TIRs.
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Code related to RSS is now encapsulated into a dedicated object and put
into new files en/rss.{c,h}. All usages are converted.
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Bring all fields that define and maintain RSS behavior together
into a new structure.
Align all usages with this new structure. Keep it hidden within
rx_res.c.
This helps supporting multiple RSS contexts in downstream patch.
Use dynamic allocations for the RSS context.
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Take TIR control operations in rx_res into functions.
This is in preparation to supporting on-demand TIR operations in
downstream patches.
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
All calls to mlx5e_rx_res_rss_set_indir_uniform() occur while the RSS
state is inactive, i.e. the RQT is pointing to the drop RQ, not to the
channels' RQs.
It means that the "apply" part of the function is not called.
Remove this part from the function, and document the change. It will be
useful for next patches in the series, allows code simplifications when
multiple RSS contexts are introduced.
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
The 'imply' keyword does not do what most people think it does, it only
politely asks Kconfig to turn on another symbol, but does not prevent
it from being disabled manually or built as a loadable module when the
user is built-in. In the ICE driver, the latter now causes a link failure:
aarch64-linux-ld: drivers/net/ethernet/intel/ice/ice_main.o: in function `ice_eth_ioctl':
ice_main.c:(.text+0x13b0): undefined reference to `ice_ptp_get_ts_config'
ice_main.c:(.text+0x13b0): relocation truncated to fit: R_AARCH64_CALL26 against undefined symbol `ice_ptp_get_ts_config'
aarch64-linux-ld: ice_main.c:(.text+0x13bc): undefined reference to `ice_ptp_set_ts_config'
ice_main.c:(.text+0x13bc): relocation truncated to fit: R_AARCH64_CALL26 against undefined symbol `ice_ptp_set_ts_config'
aarch64-linux-ld: drivers/net/ethernet/intel/ice/ice_main.o: in function `ice_prepare_for_reset':
ice_main.c:(.text+0x31fc): undefined reference to `ice_ptp_release'
ice_main.c:(.text+0x31fc): relocation truncated to fit: R_AARCH64_CALL26 against undefined symbol `ice_ptp_release'
aarch64-linux-ld: drivers/net/ethernet/intel/ice/ice_main.o: in function `ice_rebuild':
This is a recurring problem in many drivers, and we have discussed
it several times befores, without reaching a consensus. I'm providing
a link to the previous email thread for reference, which discusses
some related problems.
To solve the dependency issue better than the 'imply' keyword, introduce a
separate Kconfig symbol "CONFIG_PTP_1588_CLOCK_OPTIONAL" that any driver
can depend on if it is able to use PTP support when available, but works
fine without it. Whenever CONFIG_PTP_1588_CLOCK=m, those drivers are
then prevented from being built-in, the same way as with a 'depends on
PTP_1588_CLOCK || !PTP_1588_CLOCK' dependency that does the same trick,
but that can be rather confusing when you first see it.
Since this should cover the dependencies correctly, the IS_REACHABLE()
hack in the header is no longer needed now, and can be turned back
into a normal IS_ENABLED() check. Any driver that gets the dependency
wrong will now cause a link time failure rather than being unable to use
PTP support when that is in a loadable module.
However, the two recently added ptp_get_vclocks_index() and
ptp_convert_timestamp() interfaces are only called from builtin code with
ethtool and socket timestamps, so keep the current behavior by stubbing
those out completely when PTP is in a loadable module. This should be
addressed properly in a follow-up.
As Richard suggested, we may want to actually turn PTP support into a
'bool' option later on, preventing it from being a loadable module
altogether, which would be one way to solve the problem with the ethtool
interface.
Fixes: 06c16d89d2 ("ice: register 1588 PTP clock device object for E810 devices")
Link: https://lore.kernel.org/netdev/20210804121318.337276-1-arnd@kernel.org/
Link: https://lore.kernel.org/netdev/CAK8P3a06enZOf=XyZ+zcAwBczv41UuCTz+=0FMf2gBz1_cOnZQ@mail.gmail.com/
Link: https://lore.kernel.org/netdev/CAK8P3a3=eOxE-K25754+fB_-i_0BZzf9a9RfPTX3ppSwu9WZXw@mail.gmail.com/
Link: https://lore.kernel.org/netdev/20210726084540.3282344-1-arnd@kernel.org/
Acked-by: Shannon Nelson <snelson@pensando.io>
Acked-by: Jacob Keller <jacob.e.keller@intel.com>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/20210812183509.1362782-1-arnd@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
to replace printk(KERN_WARNING ...) with netdev_warn() kindly
Signed-off-by: Cai Huoqing <caihuoqing@baidu.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Fix the following smatch warning:
wait_func_handle_exec_timeout() warn: should '1 << ent->idx' be a 64 bit type?
Use 1ULL, to have a 64 bit type variable.
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Eran Ben Elisha <eranbe@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Subsequent patches make use of numa node affinity for memory
allocations. Initialize it for PCI PF, VF and SF devices.
Signed-off-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Currently mlx5_core_dev contains array of capabilities. It contains 19
valid capabilities of the device, 2 reserved entries and 12 holes.
Due to this for 14 unused entries, mlx5_core_dev allocates 14 * 8K = 112K
bytes of memory which is never used. Due to this mlx5_core_dev structure
size is 270Kbytes odd. This allocation further aligns to next power of 2
to 512Kbytes.
By skipping non-existent entries,
(a) 112Kbyte is saved,
(b) mlx5_core_dev reduces to 8KB with alignment
(c) 350KB saved in alignment
In future individual capability allocation can be used to skip its
allocation when such capability is disabled at the device level. This
patch prepares mlx5_core_dev to hold capability using a pointer instead
of inline array.
Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
In the current code, the current and maximal capabilities are
maintained in separate arrays which are both per type. In order to
allow the creation of such a basic structure as a dynamically
allocated array, we move curr and max fields to a unified
structure so that specific capabilities can be allocated as one unit.
Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Currently, all access to mlx5 IRQs are done undere a lock. Hance, there
isn't a reason to have kref in struct mlx5_irq.
Switch it to integer.
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
When MSI-X vectors allocated are not enough for SFs to have dedicated,
MSI-X, kernel log buffer has too many entries.
Hence only enable such log with debug level.
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
New mlx5_core device structure is allocated through devlink_alloc
with\ kzalloc and that ensures that all fields are equal to zero
and it includes ->state too.
That means that checks of that field in the mlx5_init_one() is
completely redundant, because that function is called only once
in the begging of mlx5_core_dev lifetime.
PCI:
.probe()
-> probe_one()
-> mlx5_init_one()
The recovery flow can't run at that time or before it, because relevant
work initialized later in mlx5_init_once().
Such initialization flow ensures that dev->state can't be
MLX5_DEVICE_STATE_UNINITIALIZED at all, so remove such impossible
checks.
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Fix typo of the cited commit that calls to mlx5_create_ttc_table, instead
of mlx5_create_inner_ttc_table.
Fixes: f4b45940e9 ("net/mlx5: Embed mlx5_ttc_table")
Signed-off-by: Maor Gottlieb <maorg@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Enable user to disable VDPA net auxiliary device so that when it is not
required, user can disable it.
For example,
$ devlink dev param set pci/0000:06:00.0 \
name enable_vnet value false cmode driverinit
$ devlink dev reload pci/0000:06:00.0
At this point devlink instance do not create auxiliary device
mlx5_core.vnet.2 for the VDPA net functionality.
Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Enable user to disable RDMA auxiliary device so that when it is not
required, user can disable it.
For example,
$ devlink dev param set pci/0000:06:00.0 \
name enable_rdma value false cmode driverinit
$ devlink dev reload pci/0000:06:00.0
At this point devlink instance do not create auxiliary device
mlx5_core.rdma.2 for the RDMA functionality.
Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Enable user to disable Ethernet auxiliary device so that when it is not
required, user can disable it.
For example,
$ devlink dev param set pci/0000:06:00.0 \
name enable_eth value false cmode driverinit
$ devlink dev reload pci/0000:06:00.0
At this point devlink instance do not create mlx5_core.eth.2 auxiliary
device for the Ethernet functionality.
Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cleanup routine missed to unpublish the parameters. Add it.
Fixes: e890acd5ff ("net/mlx5: Add devlink flow_steering_mode parameter")
Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The blamed commit added a new field to struct switchdev_notifier_fdb_info,
but did not make sure that all call paths set it to something valid.
For example, a switchdev driver may emit a SWITCHDEV_FDB_ADD_TO_BRIDGE
notifier, and since the 'is_local' flag is not set, it contains junk
from the stack, so the bridge might interpret those notifications as
being for local FDB entries when that was not intended.
To avoid that now and in the future, zero-initialize all
switchdev_notifier_fdb_info structures created by drivers such that all
newly added fields to not need to touch drivers again.
Fixes: 2c4eca3ef7 ("net: bridge: switchdev: include local flag in FDB notifications")
Reported-by: Ido Schimmel <idosch@idosch.org>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Tested-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Karsten Graul <kgraul@linux.ibm.com>
Link: https://lore.kernel.org/r/20210810115024.1629983-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Saeed Mahameed says:
====================
pull-request: mlx5-next 2020-08-9
This pulls mlx5-next branch which includes patches already reviewed on
net-next and rdma mailing lists.
1) mlx5 single E-Switch FDB for lag
2) IB/mlx5: Rename is_apu_thread_cq function to is_apu_cq
3) Add DCS caps & fields support
[1] https://patchwork.kernel.org/project/netdevbpf/cover/20210803231959.26513-1-saeed@kernel.org/
[2] 0e3364dab7.1626609184.git.leonro@nvidia.com/
[3] 55e1d69bef.1624258894.git.leonro@nvidia.com/
* 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux:
net/mlx5: Lag, Create shared FDB when in switchdev mode
net/mlx5: E-Switch, add logic to enable shared FDB
net/mlx5: Lag, move lag destruction to a workqueue
net/mlx5: Lag, properly lock eswitch if needed
net/mlx5: Add send to vport rules on paired device
net/mlx5: E-Switch, Add event callback for representors
net/mlx5e: Use shared mappings for restoring from metadata
net/mlx5e: Add an option to create a shared mapping
net/mlx5: E-Switch, set flow source for send to uplink rule
RDMA/mlx5: Add shared FDB support
{net, RDMA}/mlx5: Extend send to vport rules
RDMA/mlx5: Fill port info based on the relevant eswitch
net/mlx5: Lag, add initial logic for shared FDB
net/mlx5: Return mdev from eswitch
IB/mlx5: Rename is_apu_thread_cq function to is_apu_cq
net/mlx5: Add DCS caps & fields support
====================
Link: https://lore.kernel.org/r/20210809202522.316930-1-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The CQ destroy is performed based on the IRQ number that is stored in
cq->irqn. That number wasn't set explicitly during CQ creation and as
expected some of the API users of mlx5_core_create_cq() forgot to update
it.
This caused to wrong synchronization call of the wrong IRQ with a number
0 instead of the real one.
As a fix, set the IRQ number directly in the mlx5_core_create_cq() and
update all users accordingly.
Fixes: 1a86b377aa ("vdpa/mlx5: Add VDPA driver for supported mlx5 devices")
Fixes: ef1659ade3 ("IB/mlx5: Add DEVX support for CQ events")
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Free the offload sample action on error.
Fixes: f94d6389f6 ("net/mlx5e: TC, Add support to offload sample action")
Signed-off-by: Chris Mi <cmi@nvidia.com>
Reviewed-by: Oz Shlomo <ozsh@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Currently irq->pool is set after the irq is insert to the xarray.
Set irq->pool before the irq is inserted to the xarray.
Fixes: 71e084e264 ("net/mlx5: Allocating a pool of MSI-X vectors for SFs")
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Change order of functions in mlx5_irq_detach_nb() so it will be
a mirror of mlx5_irq_attach_nb.
Fixes: 71e084e264 ("net/mlx5: Allocating a pool of MSI-X vectors for SFs")
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Since switchdev mode can't support devlink traps, verify there are
no active devlink traps before moving eswitch to switchdev mode. If
there are active traps, prevent the switchdev mode configuration.
Fixes: eb3862a052 ("net/mlx5e: Enable traps according to link state")
Signed-off-by: Aya Levin <ayal@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
mlx5e_close_xdpsq does the cleanup: it calls mlx5e_free_xdpsq_descs to
free the outstanding descriptors, which relies on
mlx5e_page_release_dynamic and page_pool_release_page. However,
page_pool_destroy is already called by this point, because
mlx5e_close_rq runs before mlx5e_close_xdpsq.
This commit fixes the use-after-free by swapping mlx5e_close_xdpsq and
mlx5e_close_rq.
The commit cited below started calling page_pool_destroy directly from
the driver. Previously, the page pool was destroyed under a call_rcu
from xdp_rxq_info_unreg_mem_model, which would defer the deallocation
until after the XDPSQ is cleaned up.
Fixes: 1da4bbeffe ("net: core: page_pool: add user refcnt and reintroduce page_pool_destroy")
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Ageing time is not converted from clock_t to jiffies which results
incorrect ageing timeout calculation in workqueue update task. Fix it by
applying clock_t_to_jiffies() to provided value.
Fixes: c636a0f0f3 ("net/mlx5: Bridge, dynamic entry ageing")
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
It could be local and remote are on the same machine and the route
result will be a local route which will result in creating encap id
with src/dst mac address of 0.
Fixes: a54e20b4fc ("net/mlx5e: Add basic TC tunnel set action for SRIOV offloads")
Signed-off-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Maor Dickman <maord@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
While processing encapsulated packet on RX, one of the fields that is
checked is the inner packet length. If the length as specified in the header
doesn't match the actual inner packet length, the packet is invalid
and should be dropped. However, such packet caused the NIC to hang.
This patch turns on a 'fail_on_error' HW bit which allows HW to drop
such an invalid packet while processing RX packet and trying to decap it.
Fixes: ad17dc8cf9 ("net/mlx5: DR, Move STEv0 action apply logic")
Signed-off-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
All kernel devlink implementations call to devlink_alloc() during
initialization routine for specific device which is used later as
a parent device for devlink_register().
Such late device assignment causes to the situation which requires us to
call to device_register() before setting other parameters, but that call
opens devlink to the world and makes accessible for the netlink users.
Any attempt to move devlink_register() to be the last call generates the
following error due to access to the devlink->dev pointer.
[ 8.758862] devlink_nl_param_fill+0x2e8/0xe50
[ 8.760305] devlink_param_notify+0x6d/0x180
[ 8.760435] __devlink_params_register+0x2f1/0x670
[ 8.760558] devlink_params_register+0x1e/0x20
The simple change of API to set devlink device in the devlink_alloc()
instead of devlink_register() fixes all this above and ensures that
prior to call to devlink_register() everything already set.
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Devlink port already has pointer to the devlink instance and all API
calls that forward these devlink ports to the drivers perform same
"devlink_port->devlink" assignment before actual call.
This patch removes useless parameter and allows us in the future
to create specific devlink_port_ops to manage user space access with
reliable ops assignment.
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If both eswitches are in switchdev mode and the uplink representors
are enslaved to the same bond device create a shared FDB configuration.
When moving to shared FDB mode not only the hardware needs be configured
but the RDMA driver needs to reconfigure itself.
When such change is done, unload the RDMA devices, configure the hardware
and load the RDMA representors.
When destroying the lag (can happen if a PCI function is unbinded,
driver is unloaded or by just removing a netdev from the bond) make sure
to restore the system to the previous state only if possible.
For example, if a PCI function is unbinded there is no need to load the
representors as the device is going away.
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Shared FDB allows to direct traffic from all the vports in the HCA to a
single eswitch. In order to do that three things are needed.
1) Point the ingress ACL of the slave uplink to that of the master.
With this, wire traffic from both uplinks will reach the same eswitch
with the same metadata where a single steering rule can catch traffic
from both ports.
2) Set the FDB root flow table of the slave's eswitch to that of the
master. As this flow table can change dynamically make sure to
sync it on any set root flow table FDB command.
This will make sure traffic from SFs, VFs, ECPFs and PFs reach the
master eswitch.
3) Split wire traffic at the eswitch manager egress ACL so that it's
directed to the native eswitch manager. We only treat wire traffic
from both ports the same at the eswitch level. If such traffic wasn't
handled in the eswitch it needs to reach the right representor to be
processed by software. For example LACP packets should *always*
reach the right uplink representor for correct operation.
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Mark Zhang <markzhang@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
If a netdev is removed from the lag the lag should be destroyed.
With downstream patches this might trigger a reconfiguration of
representors on a different eswitch and such we don't have the proper
locking to so from this path. Move the destruction to be done by the
workqueue.
As the destruction won't affect the netdev side it okay to do so.
The RDMA side will be reconfigured and it already coded to handle such
reconfiguration.
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Mark Zhang <markzhang@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Currently when doing hardware lag we check the eswitch mode
but as this isn't done under a lock the check isn't valid.
As the code needs to sync between two different devices an extra
care is needed.
- When going to change eswitch mode, if hardware lag is active destroy it.
- While changing eswitch modes block any hardware bond creation.
- Delay handling bonding events until there are no mode changes in
progress.
- When attaching a new mdev to lag, block until there is no mode change
in progress. In order for the mode change to finish the interface lock
will have to be taken. Release the lock and sleep for 100ms to
allow forward progress. As this is a very rare condition (can happen if
the user unbinds and binds a PCI function while also changing eswitch
mode of the other PCI function) it has no real world impact.
As taking multiple eswitch mode locks is now required lockdep will
complain about a possible deadlock. Register a key per eswitch to make
lockdep happy.
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Mark Zhang <markzhang@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
When two mlx5 devices are paired in switchdev mode, always offload the
send-to-vport rule to the peer E-Switch. This allows to abstract
the logic when this is really necessary (single FDB) and combine
the logic of both cases into one.
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Mark Zhang <markzhang@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
This callback will allow to notify representors about relevant events
when in OFFLOADS mode. In downstream patches, this will be used to notify
about PAIR/UNPAIR devcom events.
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Mark Zhang <markzhang@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
FTEs are added with mapped metadata which is saved per eswitch.
When uplink reps are bonded and we are in a single FDB mode,
we could fail to find metadata which was stored on one eswitch mapping
but not the other or with a different id.
To resolve this issue use shared mapping between eswitch ports.
We do not have any conflict using a single mapping, for a type,
between the ports.
Signed-off-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Set the flow source param to local vport for the uplink rep
send-to-vport rule.
This will comply with the recent changes in SW steering that
use the flow source as an indication for the rule type - rx or tx.
Since the uplink send-to-vport rule is forwarding traffic to the wire
it has to indicate that it is an sx rule and can't use the any port
value in the flow source.
Signed-off-by: Ariel Levkovich <lariel@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
In shared FDB there is only one eswitch which is active and it receives
traffic from all representors and all vports in the HCA.
While the Ethernet representor will always reside on its native PF
the IB representor will not. Extend send to vport rule creation to
support such flows. Need to account for source vport that sends the
traffic (on which the representors resides) and the target eswitch
the traffic which reach.
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Mark Zhang <markzhang@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
As shared FDB requires changes in two subsystems first expose the needed
core functions so the RDMA side can be changed.
mlx5_lag_is_master(): return true if a given mlx5 device is the lag master.
mlx5_lag_is_shared_fdb(): Returns true if the lag mode is shared FDB.
mlx5_lag_get_peer_mdev(): Return the peer mdev in lag.
The mentioned functions will be used by downstream patches in order
to add support for shared FDB for the RDMA side.
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Mark Zhang <markzhang@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Export a function so users can retrieve the mellanox device that manages
the eswitch from the eswitch device.
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Mark Zhang <markzhang@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
This patch-set changes the TTC (Traffic Type Classification) logic
to be independent from the mlx5 ethernet driver by renaming the traffic
types enums and making the TTC API generic to the mlx5 core driver.
It allows to decouple TTC logic from mlx5e and reused by other parts
of mlx5 drivers, namely ADQ and lag TX steering hashing.
Patches overview:
1 - Rename traffic type enums to be mlx5 generic.
2 - Rename related TTC arguments and functions.
3 - Remove dependency in the mlx5e driver from the TTC implementation.
4 - Move TTC logic to fs_ttc.
5 - Embed struct mlx5_ttc_table in fs_ttc.
The refactoring series is followed by misc' cleanup patches.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEGhZs6bAKwk/OTgTpSD+KveBX+j4FAmEIqVYACgkQSD+KveBX
+j7CzQf/YpHlfRec0HLmVnSksw/TLh3MVl+Swd+cJRTku6JPF/zdhsJ9AbBxZIGh
U1cD+B2Fapqh4rAKHjlIgXKx0vMTyQxfEW/z83PSKINvlKJKEnT2hUjzT+3I04oR
1Wtd0FpgzBgfwKOwktnGr8i0zJ4k/kdF7PlQiN8XDbsI4u2jIII1sjwdrC1O0W24
29HQJJtKvXFEVLDI0ZdiuAHS6e57M9hAqp6SJkTukSpLXj7gNUfTPyoghk4ao0CF
1zmyuNjiojcltafqK1hZNdRO2ZwlxeRWNY/7ntVbHIpFEDnoDes8EDTzISpIPejH
aZdwjnVzBEOcztmc/RskJ1N7JktPdA==
=VFaz
-----END PGP SIGNATURE-----
Merge tag 'mlx5-updates-2021-08-02' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
This patch-set changes the TTC (Traffic Type Classification) logic
to be independent from the mlx5 ethernet driver by renaming the traffic
types enums and making the TTC API generic to the mlx5 core driver.
It allows to decouple TTC logic from mlx5e and reused by other parts
of mlx5 drivers, namely ADQ and lag TX steering hashing.
Patches overview:
1 - Rename traffic type enums to be mlx5 generic.
2 - Rename related TTC arguments and functions.
3 - Remove dependency in the mlx5e driver from the TTC implementation.
4 - Move TTC logic to fs_ttc.
5 - Embed struct mlx5_ttc_table in fs_ttc.
The refactoring series is followed by misc' cleanup patches.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The return value is missing in this code scenario, add the return value
'0' to the return value 'err'.
Eliminate the follow smatch warning:
drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c:3083
mlx5_devlink_eswitch_inline_mode_set() warn: missing error code 'err'.
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Fixes: 8e0aa4bc95 ("net/mlx5: E-switch, Protect eswitch mode changes")
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
counter is being initialized before being used.
Signed-off-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Maor Dickman <maord@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Passing parse_attr is redundant in parse_tc_nic_actions() and
mlx5e_tc_add_nic_flow() as we can get it from flow.
This is the same as with parse_tc_fdb_actions() and mlx5e_tc_add_fdb_flow().
Signed-off-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Maor Dickman <maord@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
The cap is very old and today will always exists.
The cap is not being checked anywhere else. Remove the check from
drop action when parsing tc rules in nic mode.
Signed-off-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Maor Dickman <maord@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
filter_dev is saved in parse_attr. and being used in other cases from
there. use it also for the leftover case.
Signed-off-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Maor Dickman <maord@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Since the code changed to use the flow action infra
there is no usage of tcf values from those includes.
Remove those.
Signed-off-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Maor Dickman <maord@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
mlx5_ttc_table struct shouldn't be exposed to the users so
this patch make it internal to ttc.
In addition add a getter function to get the TTC flow table for users
that need to add a rule which points on it.
Signed-off-by: Maor Gottlieb <maorg@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Now that TTC logic is not dependent on mlx5e structs, move it to
lib/fs_ttc.c so it could be used other part of the mlx5 driver.
Signed-off-by: Maor Gottlieb <maorg@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Remove dependency in the mlx5e driver from the TTC implementation
by changing the TTC related functions to receive mlx5 generic arguments.
It allows to decouple TTC logic from mlx5e and reused by other parts of
mlx5 driver.
Signed-off-by: Maor Gottlieb <maorg@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Since TTC logic is going to be moved to a separate file, make the
relevant functions and arguments that used by TTC to be mlx5 generic.
Signed-off-by: Maor Gottlieb <maorg@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Rename traffic type enums as part of the preparation for moving
the traffic type logic to a separate file.
Signed-off-by: Maor Gottlieb <maorg@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
The channels array in struct mlx5e_rx_res is converted to a dynamic one,
which will use the dynamic value of max_nch instead of
implementation-defined maximum of MLX5E_MAX_NUM_CHANNELS.
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
This commit moves all implementation details of struct mlx5e_rx_res
under en/rx_res.c. All access to RX resources is now done using methods.
Encapsulating RX resources into an object allows for better
manageability, because all the implementation details are now in a
single place, and external code can use only a limited set of API
methods to init/teardown the whole thing, reconfigure RSS and LRO
parameters, connect TIRs to flow steering and activate/deactivate TIRs.
mlx5e_rx_res is self-contained and doesn't depend on struct mlx5e_priv
or include en.h.
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Currently, struct mlx5e_channels is defined in en.h, along with a lot of
other stuff. In the following commit mlx5e_rx_res will need to get RQNs
(RQ hardware IDs), given a pointer to mlx5e_channels and the channel
index. In order to make it possible without including the whole en.h,
this commit introduces functions that will hide the implementation
details of mlx5e_channels.
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Replace mlx5e_build_default_indir_rqt with a new initializer of struct
mlx5e_rss_params_indir that works directly with the struct, rather than
its internals.
The new initializer is called mlx5e_rss_params_indir_init_uniform, which
also reflects the purpose (uniform spreading) better.
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Don't populate the array states on the stack but instead it
static const. Makes the object code smaller by 79 bytes.
Before:
text data bss dec hex filename
21309 8304 192 29805 746d drivers/net/ethernet/mellanox/mlx4/qp.o
After:
text data bss dec hex filename
21166 8368 192 29726 741e drivers/net/ethernet/mellanox/mlx4/qp.o
(gcc version 10.2.0)
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20210801153742.147304-1-colin.king@canonical.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Devlink is an integral part of mlx5 driver and all flows ensure that
devlink_*_register() will success. That makes the ->registered check
an obsolete.
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The result of __dev_get_by_index() is not checked for NULL and then gets
dereferenced immediately.
Also, __dev_get_by_index() must be called while holding either RTNL lock
or @dev_base_lock, which isn't satisfied by mlx5e_hairpin_get_mdev() or
its callers. This makes the underlying hlist_for_each_entry() loop not
safe, and can have adverse effects in itself.
Fix by using dev_get_by_index() and handling nullptr return value when
ifindex device is not found. Update mlx5e_hairpin_get_mdev() callers to
check for possible PTR_ERR() result.
Fixes: 77ab67b7f0 ("net/mlx5e: Basic setup of hairpin object")
Addresses-Coverity: ("Dereference null return value")
Signed-off-by: Dima Chumak <dchumak@nvidia.com>
Reviewed-by: Vlad Buslov <vladbu@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
When fw_fatal reporter reports an error, the firmware in not responding.
Unload the device to ensure that the driver closes all its resources,
even if recovery is not due (user disabled auto-recovery or reporter is
in grace period). On successful recovery the device is loaded back up.
Fixes: b3bd076f75 ("net/mlx5: Report devlink health on FW fatal issues")
Signed-off-by: Aya Levin <ayal@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Set the correct pci-device pointer to the ptp-RQ. This allows access to
dma_mask and avoids allocation request with wrong pci-device.
Fixes: a099da8ffc ("net/mlx5e: Add RQ to PTP channel")
Signed-off-by: Aya Levin <ayal@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Add PTP-RQ to the loop when setting rx-vlan-offload feature via ethtool.
On PTP-RQ's creation, set rx-vlan-offload into its parameters.
Fixes: a099da8ffc ("net/mlx5e: Add RQ to PTP channel")
Signed-off-by: Aya Levin <ayal@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
If a feature flag is only present in features, but not in hw_features,
the user can't reset it. Although hw_features may contain NETIF_F_HW_TC
by the point where the driver checks whether HTB offload is supported,
this flag is controlled by another condition that may not hold. Set it
explicitly to make sure the user can disable it.
Fixes: 214baf2287 ("net/mlx5e: Support HTB offload")
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
When HW aggregates packets for an LRO session, it writes the payload
of two consecutive packets of a flow contiguously, so that they usually
share a cacheline.
The first byte of a packet's payload is written immediately after
the last byte of the preceding packet.
In this flow, there are two consecutive write requests to the shared
cacheline:
1. Regular write for the earlier packet.
2. Read-modify-write for the following packet.
In case of relaxed-ordering on, these two writes might be re-ordered.
Using the end padding optimization (to avoid partial write for the last
cacheline of a packet) becomes problematic if the two writes occur
out-of-order, as the padding would overwrite payload that belongs to
the following packet, causing data corruption.
Avoid this by disabling the end padding optimization when both
LRO and relaxed-ordering are enabled.
Fixes: 17347d5430 ("net/mlx5e: Add support for PCI relaxed ordering")
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
This is the same check as LAG mode checks if to enable lag.
This will fix adding peer miss rules if lag is not supported
and even an incorrect rules in socket direct mode.
Also fix the incorrect comment on mlx5_get_next_phys_dev() as flow #1
doesn't exists.
Fixes: ac004b8321 ("net/mlx5e: E-Switch, Add peer miss rules")
Signed-off-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Maor Dickman <maord@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Destination vport vhca id is valid flag is set only merged eswitch isn't supported.
Change destination vport vhca id value to be set also only when merged eswitch
is supported.
Fixes: e4ad91f23f ("net/mlx5e: Split offloaded eswitch TC rules for port mirroring")
Signed-off-by: Maor Dickman <maord@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Fix a bug when flow table is created in priority that already
has other flow tables as shown in the below diagram.
If the new flow table (FT-B) has the lowest level in the priority,
we need to connect the flow tables from the previous priority (p0)
to this new table. In addition when this flow table is destroyed
(FT-B), we need to connect the flow tables from the previous
priority (p0) to the next level flow table (FT-C) in the same
priority of the destroyed table (if exists).
---------
|root_ns|
---------
|
--------------------------------
| | |
---------- ---------- ---------
|p(prio)-x| | p-y | | p-n |
---------- ---------- ---------
| |
---------------- ------------------
|ns(e.g bypass)| |ns(e.g. kernel) |
---------------- ------------------
| | |
------- ------ ----
| p0 | | p1 | |p2|
------- ------ ----
| | \
-------- ------- ------
| FT-A | |FT-B | |FT-C|
-------- ------- ------
Fixes: f90edfd279 ("net/mlx5_core: Connect flow tables")
Signed-off-by: Maor Gottlieb <maorg@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Most users of ndo_do_ioctl are ethernet drivers that implement
the MII commands SIOCGMIIPHY/SIOCGMIIREG/SIOCSMIIREG, or hardware
timestamping with SIOCSHWTSTAMP/SIOCGHWTSTAMP.
Separate these from the few drivers that use ndo_do_ioctl to
implement SIOCBOND, SIOCBR and SIOCWANDEV commands.
This is a purely cosmetic change intended to help readers find
their way through the implementation.
Cc: Doug Ledford <dledford@redhat.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jay Vosburgh <j.vosburgh@gmail.com>
Cc: Veaceslav Falico <vfalico@gmail.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Cc: Andrew Lunn <andrew@lunn.ch>
Cc: Vivien Didelot <vivien.didelot@gmail.com>
Cc: Florian Fainelli <f.fainelli@gmail.com>
Cc: Vladimir Oltean <olteanv@gmail.com>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: linux-rdma@vger.kernel.org
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This series aims to reduce coupling in mlx5e, particularly between RX
resources (TIRs, RQTs) and numerous code units that use them.
This refactoring is required for upcoming features, ADQ and TX lag
hashing.
The issue with the current code is that TIRs and RQTs are unmanaged,
different places all over the driver create, destroy, track and
configure them, often in an uncoordinated way. The responsibilities of
different units become vague, leading to a lot of hidden dependencies
between numerous units and tight coupling between them, which is prone
to bugs and hard to maintain.
The result of this refactoring is:
1. Creating a manager for RX resources, that controls their lifecycle
and provides a clear API, which restricts the set of actions that other
units can do.
2. Using object-oriented approach for TIRs, RQTs and RX resource
manager (struct mlx5e_rx_res).
3. Fixing a few bugs and misbehaviors found during the refactoring.
4. Reducing the amount of dependencies, removing hidden dependencies,
making them one-directional and organizing the code in clear abstraction
layers.
5. Explicitly exposing the remaining weird dependencies.
6. Simplifying and organizing code that creates and modifies TIRs and
RQTs.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEGhZs6bAKwk/OTgTpSD+KveBX+j4FAmD+5+EACgkQSD+KveBX
+j6l/Qf9FHU8xRCBW0Bn4Ja0CnS/C3MJMIBTRbe+Qq2noj6qr8y6ut8Z7jV3hGaw
dvcA6bVxpWPsqrY5CX4errfo2T4QiAsYhIeXrE2liCZsMUeYeZLusAYquks0oo57
ai2FnihMvL5Xf9U2r+aAeT4dxpQBJ7ANON+b+ZRCFjwRl8CZkd7i4pnifLYpptye
d6de6QSZkvxc818homiALdu0VH8HOBZALzfMI9y2Oh9peW+irc2UhEoThukD5VCD
Turm6WBQ4xteQq3uNKebbqg0cEevyQjHoUmBg799xD0tHzFbSp+kV3ivHKxsksUF
sOBdEumQjDc6J22uTnA086dIStN7sA==
=jxI/
-----END PGP SIGNATURE-----
Merge tag 'mlx5-updates-2021-07-24' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
mlx5-updates-2021-07-24
This series aims to reduce coupling in mlx5e, particularly between RX
resources (TIRs, RQTs) and numerous code units that use them.
This refactoring is required for upcoming features, ADQ and TX lag
hashing.
The issue with the current code is that TIRs and RQTs are unmanaged,
different places all over the driver create, destroy, track and
configure them, often in an uncoordinated way. The responsibilities of
different units become vague, leading to a lot of hidden dependencies
between numerous units and tight coupling between them, which is prone
to bugs and hard to maintain.
The result of this refactoring is:
1. Creating a manager for RX resources, that controls their lifecycle
and provides a clear API, which restricts the set of actions that other
units can do.
2. Using object-oriented approach for TIRs, RQTs and RX resource
manager (struct mlx5e_rx_res).
3. Fixing a few bugs and misbehaviors found during the refactoring.
4. Reducing the amount of dependencies, removing hidden dependencies,
making them one-directional and organizing the code in clear abstraction
layers.
5. Explicitly exposing the remaining weird dependencies.
6. Simplifying and organizing code that creates and modifies TIRs and
RQTs.
Saeed Mahameed says:
====================
mlx5 updates 2021-07-24
This series provides some refactoring to mlx5e RX resource management,
it is required for upcoming ADQ and TX lag hashing features.
The first two patches in this series :
net/mlx5e: Prohibit inner indir TIRs in IPoIB
net/mlx5e: Block LRO if firmware asks for tunneled LRO
Were supposed to go to net, but due to dependency and timing they were
included here.
I would appreciate it if you'd apply them to net and mark for -stable.
For more information please see tag log below.
Please pull and let me know if there is any problem.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
One of the previous commits introduced a dedicated object for a TIR.
kTLS code creates a TIR per connection using the low-level mlx5_core
API. This commit converts it to the new mlx5e_tir API.
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
This commit moves the responsibility of keeping the RSS configuration
for different traffic types to en/rx_res.{c,h}, hiding the
implementation details behind the new getters, and abandons all usage of
struct mlx5e_tirc_config, which is no longer useful and superseded by
struct mlx5e_rss_params_traffic_type.
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Code related to TIR is now encapsulated into a dedicated object and put
into new files en/tir.{c,h}. All usages are converted.
The Builder pattern is used to initialize a TIR. It allows to create a
multitude of different configurations, turning on and off some specific
features in different combinations, without having long parameter lists,
initializers per usage and repeating code in initializers.
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Maor Dickman <maord@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
This commit introduces a new struct to store RSS hash parameters: hash
function and hash key. The existing usages are changed to use the new
struct.
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
In order to drop a dependency to mdev and make the function more
universal, stop passing mdev to mlx5e_build_indir_tir_ctx_common() and
pass transport domain directly instead. It also prepares this function
to be used in other contexts that need a custom transport domain, such
as hairpin.
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
In order to reduce the list of parameters and to define clearer
responsibility for mlx5e_build_indir_tir_ctx_common(), stop passing
lro_param and instead call mlx5e_build_tir_ctx_lro() directly where
needed.
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
The functions that build TIR context for TIR create and modify commands
used to depend on struct mlx5e_priv and fetch some values directly from
different places. It increased coupling of code and the chance of weird
misbehavior due to hidden complex dependencies.
As the first step, this commit removes the priv parameter from these
functions. Instead, the necessary values are passed directly.
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
In order to abstract from implementation details of mlx5e_rqt, use the
mlx5e_rqt_get_rqtn getter instead of accessing the field directly.
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
RQT is not part of TIR, as multiple TIRs may point to the same RQT, as
it happens with indir_tir and inner_indir_tir. These instances of a TIR
don't use the embedded RQT.
This commit takes RQT out of TIR, making them independent. The RQTs are
placed into struct mlx5e_rx_res, and items in that struct are regrouped
by functionality: RSS, channels and PTP.
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
This commit moves RQTs and TIRs to a separate struct that is allocated
dynamically in profiles that support these RX resources (all profiles,
except IPoIB PKey). It also allows to remove rqt_enabled flags, as RQTs
are always enabled in profiles that support RX resources.
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
RSS params belong to the RX side initialization. Move them from
profile->init to profile->init_rx stage to allow the next commit to move
rss_params out of priv to a dynamically-allocated struct.
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Code related to RQT is now encapsulated into a dedicated object and put
into new files en/rqt.{c,h}. All usages are converted.
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Move the mlx5e_tunnel_inner_ft_supported() check for inner flow tables
support outside of mlx5e_create_inner_ttc_table() and
mlx5e_destroy_inner_ttc_table(). It allows to avoid accessing invalid
TIRNs of inner indirect TIRs. In a later commit these accesses will be
replaced by getters that will WARN if inner indirect TIRs don't exist.
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
res->td.list_lock protects the list of TIRs. There is no point to call
mlx5_core_destroy_tir() and invoke a firmware command under this lock.
This commit moves this call outside of the lock and puts it after
deleting the TIR from the list to ensure that TIRs are always alive
while in the list.
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
This commit does a cleanup in LRO configuration.
LRO is a parameter of an RQ, but its state is changed by modifying a TIR
related to the RQ.
The current status: LRO for tunneled packets is not supported in the
driver, inner TIRs may enable LRO on creation, but LRO status of inner
TIRs isn't changed in mlx5e_modify_tirs_lro(). This is inconsistent, but
as long as the firmware doesn't declare support for tunneled LRO, it
works, because the same RQs are shared between the inner and outer TIRs.
This commit does two fixes:
1. If the firmware has the tunneled LRO capability, LRO is blocked
altogether, because it's not possible to block it for inner TIRs only,
when the same RQs are shared between inner and outer TIRs, and the
driver won't be able to handle tunneled LRO traffic.
2. mlx5e_modify_tirs_lro() is patched to modify LRO state for all TIRs,
including inner ones, because all TIRs related to an RQ should agree on
their LRO state.
Fixes: 7b3722fa9e ("net/mlx5e: Support RSS for GRE tunneled packets")
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
TIR's rx_hash_field_selector_inner can be enabled only when
tunneled_offload_en = 1. tunneled_offload_en is filled according to the
tunneled_offload_en field in struct mlx5e_params, which is false in the
IPoIB profile. On the other hand, the IPoIB profile passes inner_ttc =
true to mlx5e_create_indirect_tirs, which potentially allows the latter
function to attempt to create inner indirect TIRs without having
tunneled_offload_en set.
This commit prohibits this behavior by passing inner_ttc = false to
mlx5e_create_indirect_tirs. The latter function won't attempt to create
inner indirect TIRs.
As inner indirect TIRs are not created in the IPoIB profile (this commit
blocks it explicitly, and even before they would have failed to be
created), the call to mlx5e_create_inner_ttc_table in
mlx5i_create_flow_steering is a no-op and can be removed.
Fixes: 46dc933cee ("net/mlx5e: Provide explicit directive if to create inner indirect tirs")
Fixes: 458821c72b ("net/mlx5e: IPoIB, Add inner TTC table to IPoIB flow steering")
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
The error code is missing in this code scenario, add the error code
'-EINVAL' to the return value 'err'.
Eliminate the follow smatch warning:
drivers/net/ethernet/mellanox/mlx4/main.c:3538 mlx4_load_one() warn:
missing error code 'err'.
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Fixes: 7ae0e400cd ("net/mlx4_core: Flexible (asymmetric) allocation of EQs and MSI-X vectors for PF/VFs")
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
is_apu_thread_cq() used to detect CQs which are attached to APU
threads. This was extended to support other elements as well,
so the function was renamed to is_apu_cq().
c_eqn_or_apu_element was extended from 8 bits to 32 bits, which wan't
reflected when the APU support was first introduced.
Acked-by: Michael S. Tsirkin <mst@redhat.com> # vdpa
Signed-off-by: Tal Gilboa <talgi@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Allow switchdevs to forward frames from the CPU in accordance with the
bridge configuration in the same way as is done between bridge
ports. This means that the bridge will only send a single skb towards
one of the ports under the switchdev's control, and expects the driver
to deliver the packet to all eligible ports in its domain.
Primarily this improves the performance of multicast flows with
multiple subscribers, as it allows the hardware to perform the frame
replication.
The basic flow between the driver and the bridge is as follows:
- When joining a bridge port, the switchdev driver calls
switchdev_bridge_port_offload() with tx_fwd_offload = true.
- The bridge sends offloadable skbs to one of the ports under the
switchdev's control using skb->offload_fwd_mark = true.
- The switchdev driver checks the skb->offload_fwd_mark field and lets
its FDB lookup select the destination port mask for this packet.
v1->v2:
- convert br_input_skb_cb::fwd_hwdoms to a plain unsigned long
- introduce a static key "br_switchdev_fwd_offload_used" to minimize the
impact of the newly introduced feature on all the setups which don't
have hardware that can make use of it
- introduce a check for nbp->flags & BR_FWD_OFFLOAD to optimize cache
line access
- reorder nbp_switchdev_frame_mark_accel() and br_handle_vlan() in
__br_forward()
- do not strip VLAN on egress if forwarding offload on VLAN-aware bridge
is being used
- propagate errors from .ndo_dfwd_add_station() if not EOPNOTSUPP
v2->v3:
- replace the solution based on .ndo_dfwd_add_station with a solution
based on switchdev_bridge_port_offload
- rename BR_FWD_OFFLOAD to BR_TX_FWD_OFFLOAD
v3->v4: rebase
v4->v5:
- make sure the static key is decremented on bridge port unoffload
- more function and variable renaming and comments for them:
br_switchdev_fwd_offload_used to br_switchdev_tx_fwd_offload
br_switchdev_accels_skb to br_switchdev_frame_uses_tx_fwd_offload
nbp_switchdev_frame_mark_tx_fwd to nbp_switchdev_frame_mark_tx_fwd_to_hwdom
nbp_switchdev_frame_mark_accel to nbp_switchdev_frame_mark_tx_fwd_offload
fwd_accel to tx_fwd_offload
Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>