linux-stable

Commit Graph

Author	SHA1	Message	Date
Rahul Rameshbabu	b8ac3415e2	net/mlx5e: Do not produce metadata freelist entries in Tx port ts WQE xmit [ Upstream commit `86b0ca5b11` ] Free Tx port timestamping metadata entries in the NAPI poll context and consume metadata enties in the WQE xmit path. Do not free a Tx port timestamping metadata entry in the WQE xmit path even in the error path to avoid a race between two metadata entry producers. Fixes: `3178308ad4` ("net/mlx5e: Make tx_port_ts logic resilient to out-of-order CQEs") Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/20240409190820.227554-10-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-04-17 11:23:33 +02:00
Rahul Rameshbabu	b7cf07586c	net/mlx5e: Use a memory barrier to enforce PTP WQ xmit submission tracking occurs after populating the metadata_map Just simply reordering the functions mlx5e_ptp_metadata_map_put and mlx5e_ptpsq_track_metadata in the mlx5e_txwqe_complete context is not good enough since both the compiler and CPU are free to reorder these two functions. If reordering does occur, the issue that was supposedly fixed by `7e3f3ba97e` ("net/mlx5e: Track xmit submission to PTP WQ after populating metadata map") will be seen. This will lead to NULL pointer dereferences in mlx5e_ptpsq_mark_ts_cqes_undelivered in the NAPI polling context due to the tracking list being populated before the metadata map. Fixes: `7e3f3ba97e` ("net/mlx5e: Track xmit submission to PTP WQ after populating metadata map") Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> CC: Vadim Fedorenko <vadfed@meta.com>	2024-03-01 23:02:26 -08:00
Tariq Toukan	db52aa6df8	net/mlx5e: Decouple CQ from priv Make CQ struct and methods independent of "priv", use more basic arguments instead. This will ease the transition to netdev with multiple mdevs. Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Gal Pressman <gal@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2023-12-13 18:03:32 -08:00
Rahul Rameshbabu	7e3f3ba97e	net/mlx5e: Track xmit submission to PTP WQ after populating metadata map Ensure the skb is available in metadata mapping to skbs before tracking the metadata index for detecting undelivered CQEs. If the metadata index is put in the tracking list before putting the skb in the map, the metadata index might be used for detecting undelivered CQEs before the relevant skb is available in the map, which can lead to a null-ptr-deref. Log: general protection fault, probably for non-canonical address 0xdffffc0000000005: 0000 [#1] SMP KASAN KASAN: null-ptr-deref in range [0x0000000000000028-0x000000000000002f] CPU: 0 PID: 1243 Comm: kworker/0:2 Not tainted 6.6.0-rc4+ #108 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 Workqueue: events mlx5e_rx_dim_work [mlx5_core] RIP: 0010:mlx5e_ptp_napi_poll+0x9a4/0x2290 [mlx5_core] Code: 8c 24 38 cc ff ff 4c 8d 3c c1 4c 89 f9 48 c1 e9 03 42 80 3c 31 00 0f 85 97 0f 00 00 4d 8b 3f 49 8d 7f 28 48 89 f9 48 c1 e9 03 <42> 80 3c 31 00 0f 85 8b 0f 00 00 49 8b 47 28 48 85 c0 0f 84 05 07 RSP: 0018:ffff8884d3c09c88 EFLAGS: 00010206 RAX: 0000000000000069 RBX: ffff8881160349d8 RCX: 0000000000000005 RDX: ffffed10218f48cf RSI: 0000000000000004 RDI: 0000000000000028 RBP: ffff888122707700 R08: 0000000000000001 R09: ffffed109a781383 R10: 0000000000000003 R11: 0000000000000003 R12: ffff88810c7a7a40 R13: ffff888122707700 R14: dffffc0000000000 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff8884d3c00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f4f878dd6e0 CR3: 000000014d108002 CR4: 0000000000370eb0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <IRQ> ? die_addr+0x3c/0xa0 ? exc_general_protection+0x144/0x210 ? asm_exc_general_protection+0x22/0x30 ? mlx5e_ptp_napi_poll+0x9a4/0x2290 [mlx5_core] ? mlx5e_ptp_napi_poll+0x8f6/0x2290 [mlx5_core] __napi_poll.constprop.0+0xa4/0x580 net_rx_action+0x460/0xb80 ? _raw_spin_unlock_irqrestore+0x32/0x60 ? __napi_poll.constprop.0+0x580/0x580 ? tasklet_action_common.isra.0+0x2ef/0x760 __do_softirq+0x26c/0x827 irq_exit_rcu+0xc2/0x100 common_interrupt+0x7f/0xa0 </IRQ> <TASK> asm_common_interrupt+0x22/0x40 RIP: 0010:__kmem_cache_alloc_node+0xb/0x330 Code: 41 5d 41 5e 41 5f c3 8b 44 24 14 8b 4c 24 10 09 c8 eb d5 e8 b7 43 ca 01 0f 1f 80 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41 57 <41> 56 41 89 d6 41 55 41 89 f5 41 54 49 89 fc 53 48 83 e4 f0 48 83 RSP: 0018:ffff88812c4079c0 EFLAGS: 00000246 RAX: 1ffffffff083c7fe RBX: ffff888100042dc0 RCX: 0000000000000218 RDX: 00000000ffffffff RSI: 0000000000000dc0 RDI: ffff888100042dc0 RBP: ffff88812c4079c8 R08: ffffffffa0289f96 R09: ffffed1025880ea9 R10: ffff888138839f80 R11: 0000000000000002 R12: 0000000000000dc0 R13: 0000000000000100 R14: 000000000000008c R15: ffff8881271fc450 ? cmd_exec+0x796/0x2200 [mlx5_core] kmalloc_trace+0x26/0xc0 cmd_exec+0x796/0x2200 [mlx5_core] mlx5_cmd_do+0x22/0xc0 [mlx5_core] mlx5_cmd_exec+0x17/0x30 [mlx5_core] mlx5_core_modify_cq_moderation+0x139/0x1b0 [mlx5_core] ? mlx5_add_cq_to_tasklet+0x280/0x280 [mlx5_core] ? lockdep_set_lock_cmp_fn+0x190/0x190 ? process_one_work+0x659/0x1220 mlx5e_rx_dim_work+0x9d/0x100 [mlx5_core] process_one_work+0x730/0x1220 ? lockdep_hardirqs_on_prepare+0x400/0x400 ? max_active_store+0xf0/0xf0 ? assign_work+0x168/0x240 worker_thread+0x70f/0x12d0 ? __kthread_parkme+0xd1/0x1d0 ? process_one_work+0x1220/0x1220 kthread+0x2d9/0x3b0 ? kthread_complete_and_exit+0x20/0x20 ret_from_fork+0x2d/0x70 ? kthread_complete_and_exit+0x20/0x20 ret_from_fork_asm+0x11/0x20 </TASK> Modules linked in: xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat br_netfilter rpcsec_gss_krb5 auth_rpcgss oid_registry overlay mlx5_ib ib_uverbs ib_core zram zsmalloc mlx5_core fuse ---[ end trace 0000000000000000 ]--- Fixes: `3178308ad4` ("net/mlx5e: Make tx_port_ts logic resilient to out-of-order CQEs") Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Link: https://lore.kernel.org/r/20231114215846.5902-11-saeed@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-11-15 22:34:31 -08:00
Rahul Rameshbabu	64f14d16ee	net/mlx5e: Avoid referencing skb after free-ing in drop path of mlx5e_sq_xmit_wqe When SQ is a port timestamping SQ for PTP, do not access tx flags of skb after free-ing the skb. Free the skb only after all references that depend on it have been handled in the dropped WQE path. Fixes: `3178308ad4` ("net/mlx5e: Make tx_port_ts logic resilient to out-of-order CQEs") Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Link: https://lore.kernel.org/r/20231114215846.5902-10-saeed@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-11-15 22:34:31 -08:00
Rahul Rameshbabu	3178308ad4	net/mlx5e: Make tx_port_ts logic resilient to out-of-order CQEs Use a map structure for associating CQEs containing port timestamping information with the appropriate skb. Track order of WQEs submitted using a FIFO. Check if the corresponding port timestamping CQEs from the lookup values in the FIFO are considered dropped due to time elapsed. Return the lookup value to a freelist after consuming the skb. Reuse the freed lookup in future WQE submission iterations. The map structure uses an integer identifier for the key and returns an skb corresponding to that identifier. Embed the integer identifier in the WQE submitted to the WQ for the transmit path when the SQ is a PTP (port timestamping) SQ. The embedded identifier can then be queried using a field in the CQE of the corresponding port timestamping CQ. In the port timestamping napi_poll context, the identifier is queried from the CQE polled from CQ and used to lookup the corresponding skb from the WQE submit path. The skb reference is removed from map and then embedded with the port HW timestamp information from the CQE and eventually consumed. The metadata freelist FIFO is an array containing integer identifiers that can be pushed and popped in the FIFO. The purpose of this structure is bookkeeping what identifier values can safely be used in a subsequent WQE submission and should not contain identifiers that have still not been reaped by processing a corresponding CQE completion on the port timestamping CQ. The ts_cqe_pending_list structure is a combination of an array and linked list. The array is pre-populated with the nodes that will be added and removed from the head of the linked list. Each node contains the unique identifier value associated with the values submitted in the WQEs and retrieved in the port timestamping CQEs. When a WQE is submitted, the node in the array corresponding to the identifier popped from the metadata freelist is added to the end of the CQE pending list and is marked as "in-use". The node is removed from the linked list under two conditions. The first condition is that the corresponding port timestamping CQE is polled in the PTP napi_poll context. The second condition is that more than a second has elapsed since the DMA timestamp value corresponding to the WQE submission. When the first condition occurs, the "in-use" bit in the linked list node is cleared, and the resources corresponding to the WQE submission are then released. The second condition, however, indicates that the port timestamping CQE will likely never be delivered. It's not impossible for the device to post a CQE after an infinite amount of time though highly improbable. In order to be resilient to this improbable case, resources related to the corresponding WQE submission are still kept, the identifier value is not returned to the freelist, and the "in-use" bit is cleared on the node to indicate that it's no longer part of the linked list of "likely to be delivered" port timestamping CQE identifiers. A count for the number of port timestamping CQEs considered highly likely to never be delivered by the device is maintained. This count gets decremented in the unlikely event a port timestamping CQE considered unlikely to ever be delivered is polled in the PTP napi_poll context. Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2023-08-14 14:40:20 -07:00
Rahul Rameshbabu	7aa5038019	net/mlx5e: Fix SQ wake logic in ptp napi_poll context Check in the mlx5e_ptp_poll_ts_cq context if the ptp tx sq should be woken up. Before change, the ptp tx sq may never wake up if the ptp tx ts skb fifo is full when mlx5e_poll_tx_cq checks if the queue should be woken up. Fixes: `1880bc4e4a` ("net/mlx5e: Add TX port timestamp support") Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2023-05-22 22:38:05 -07:00
Tariq Toukan	b5618a6b19	net/mlx5e: Remove unused function mlx5e_sq_xmit_simple The last usage was removed as part of commit `40379a0084` ("net/mlx5_fpga: Drop INNOVA TLS support"). Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Gal Pressman <gal@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2023-02-18 01:01:34 -08:00
Maxim Mikityanskiy	f9c955b4fe	net/mlx5e: Add missing sanity checks for max TX WQE size The commit cited below started using the firmware capability for the maximum TX WQE size. This commit adds an important check to verify that the driver doesn't attempt to exceed this capability, and also restores another check mistakenly removed in the cited commit (a WQE must not exceed the page size). Fixes: `c27bd1718c` ("net/mlx5e: Read max WQEBBs on the SQ from firmware") Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2022-11-09 10:30:43 -08:00
Aya Levin	19b43a432e	net/mlx5e: Extend SKB room check to include PTP-SQ When tx_port_ts is set, the driver diverts all UPD traffic over PTP port to a dedicated PTP-SQ. The SKBs are cached until the wire-CQE arrives. When the packet size is greater then MTU, the firmware might drop it and the packet won't be transmitted to the wire, hence the wire-CQE won't reach the driver. In this case the SKBs are accumulated in the SKB fifo. Add room check to consider the PTP-SQ SKB fifo, when the SKB fifo is full, driver stops the queue resulting in a TX timeout. Devlink TX-reporter can recover from it. Fixes: `1880bc4e4a` ("net/mlx5e: Add TX port timestamp support") Signed-off-by: Aya Levin <ayal@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Link: https://lore.kernel.org/r/20221026135153.154807-5-saeed@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-10-27 11:06:55 -07:00
Lior Nahmanson	9515978eee	net/mlx5e: Implement MACsec Tx data path using MACsec skb_metadata_dst MACsec driver marks Tx packets for device offload using a dedicated skb_metadata_dst which holds a 64 bits SCI number. A previously set rule will match on this number so the correct SA is used for the MACsec operation. As device driver can only provide 32 bits of metadata to flow tables, need to used a mapping from 64 bit to 32 bits marker or id, which is can be achieved by provide a 32 bit unique flow id in the control path, and used a hash table to map 64 bit to the unique id in the datapath. Signed-off-by: Lior Nahmanson <liorna@nvidia.com> Reviewed-by: Raed Salem <raeds@nvidia.com> Signed-off-by: Raed Salem <raeds@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-09-07 14:02:08 +01:00
Gal Pressman	ec082d31c1	net/mlx5e: Fix wrong use of skb_tcp_all_headers() with encapsulation Use skb_inner_tcp_all_headers() instead of skb_tcp_all_headers() when transmitting an encapsulated packet in mlx5e_tx_get_gso_ihs(). Fixes: `504148fedb` ("net: add skb_[inner_]tcp_all_headers helpers") Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2022-07-28 13:55:25 -07:00
Aya Levin	58a518948f	net/mlx5e: Add resiliency for PTP TX port timestamp PTP TX port timestamp relies on receiving 2 CQEs for each outgoing packet (WQE). The regular CQE has a less accurate timestamp than the wire CQE. On link change, the wire CQE may get lost. Let the driver detect and restore the relation between the CQEs, and re-sync after timeout. Add resiliency for this as follows: add id (producer counter) into the WQE's metadata. This id will be received in the wire CQE (in wqe_counter field). On handling the wire CQE, if there is no match, replay the PTP application with the time-stamp from the regular CQE and restore the sync between the CQEs and their SKBs. This patch adds 2 ptp counters: 1) ptp_cq0_resync_event: number of times a mismatch was detected between the regular CQE and the wire CQE. 2) ptp_cq0_resync_cqe: total amount of missing wire CQEs. Signed-off-by: Aya Levin <ayal@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2022-07-19 13:32:54 -07:00
Jakub Kicinski	816cd16883	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net include/net/sock.h `310731e2f1` ("net: Fix data-races around sysctl_mem.") `e70f3c7012` ("Revert "net: set SK_MEM_QUANTUM to 4096"") https://lore.kernel.org/all/20220711120211.7c8b7cba@canb.auug.org.au/ net/ipv4/fib_semantics.c `747c143072` ("ip: fix dflt addr selection for connected nexthop") `d62607c3fe` ("net: rename reference+tracking helpers") net/tls/tls.h include/net/tls.h `3d8c51b25a` ("net/tls: Check for errors in tls_device_init") `5879031423` ("tls: create an internal header") Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-07-14 15:27:35 -07:00
Maxim Mikityanskiy	5b759bf2f9	net/mlx5e: Ring the TX doorbell on DMA errors TX doorbells may be postponed, because sometimes the driver knows that another packet follows (for example, when xmit_more is true, or when a MPWQE session is closed before transmitting a packet). However, the DMA mapping may fail for the next packet, in which case a new WQE is not posted, the doorbell isn't updated either, and the transmission of the previous packet will be delayed indefinitely. This commit fixes the described rare error flow by posting a NOP and ringing the doorbell on errors to flush all the previous packets. The MPWQE session is closed before that. DMA mapping in the MPWQE flow is moved to the beginning of mlx5e_sq_xmit_mpwqe, because empty sessions are not allowed. Stop room always has enough space for a NOP, because the actual TX WQE is not posted. Fixes: `e586b3b0ba` ("net/mlx5: Ethernet Datapath files") Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2022-07-06 16:11:56 -07:00
Eric Dumazet	504148fedb	net: add skb_[inner_]tcp_all_headers helpers Most drivers use "skb_transport_offset(skb) + tcp_hdrlen(skb)" to compute headers length for a TCP packet, but others use more convoluted (but equivalent) ways. Add skb_tcp_all_headers() and skb_inner_tcp_all_headers() helpers to harmonize this a bit. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-07-02 16:22:25 +01:00
Eric Dumazet	de78960e02	mlx5: support BIG TCP packets mlx5 supports LSOv2. IPv6 gro/tcp stacks insert a temporary Hop-by-Hop header with JUMBO TLV for big packets. We need to ignore/skip this HBH header when populating TX descriptor. Note that ipv6_has_hopopt_jumbo() only recognizes very specific packet layout, thus mlx5e_sq_xmit_wqe() is taking care of this layout only. v7: adopt unsafe_memcpy() and MLX5_UNSAFE_MEMCPY_DISCLAIMER v2: clear hopbyhop in mlx5e_tx_get_gso_ihs() v4: fix compile error for CONFIG_MLX5_CORE_IPOIB=y Signed-off-by: Coco Li <lixiaoyan@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Cc: Saeed Mahameed <saeedm@nvidia.com> Cc: Leon Romanovsky <leon@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-05-16 10:18:56 +01:00
Kees Cook	43213daed6	fortify: Provide a memcpy trap door for sharp corners As we continue to narrow the scope of what the FORTIFY memcpy() will accept and build alternative APIs that give the compiler appropriate visibility into more complex memcpy scenarios, there is a need for "unfortified" memcpy use in rare cases where combinations of compiler behaviors, source code layout, etc, result in cases where the stricter memcpy checks need to be bypassed until appropriate solutions can be developed (i.e. fix compiler bugs, code refactoring, new API, etc). The intention is for this to be used only if there's no other reasonable solution, for its use to include a justification that can be used to assess future solutions, and for it to be temporary. Example usage included, based on analysis and discussion from: https://lore.kernel.org/netdev/CANn89iLS_2cshtuXPyNUGDPaic=sJiYfvTb_wNLgWrZRyBxZ_g@mail.gmail.com Cc: Jakub Kicinski <kuba@kernel.org> Cc: Eric Dumazet <edumazet@google.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Coco Li <lixiaoyan@google.com> Cc: Tariq Toukan <tariqt@nvidia.com> Cc: Saeed Mahameed <saeedm@nvidia.com> Cc: Leon Romanovsky <leon@kernel.org> Cc: netdev@vger.kernel.org Cc: linux-hardening@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20220511025301.3636666-1-keescook@chromium.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2022-05-12 10:49:23 +02:00
Maxim Mikityanskiy	6b23f6ab86	net/mlx5e: Move mlx5e_select_queue to en/selq.c This commit moves mlx5e_select_queue and all stuff related to ndo_select_queue to en/selq.c to put all stuff working with selq into a separate file. Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2022-02-14 22:30:50 -08:00
Maxim Mikityanskiy	17c84cb46e	net/mlx5e: Sync txq2sq updates with mlx5e_xmit for HTB queues This commit makes necessary changes to guarantee that txq2sq remains stable while mlx5e_xmit is running. Proper synchronization is added for HTB TX queues. All updates to txq2sq are performed while the corresponding queue is disabled (i.e. mlx5e_xmit doesn't run on that queue). smp_wmb after each change guarantees that mlx5e_xmit can see the updated value after the queue is enabled. Comments explaining this mechanism are added to mlx5e_xmit. When an HTB SQ can be deleted (after deleting an HTB node), synchronize with RCU to wait for mlx5e_select_queue to finish and stop selecting that queue, before we re-enable it to avoid TX timeout watchdog alarms. Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2022-02-14 22:30:49 -08:00
Aya Levin	76c31e5f75	net/mlx5e: Use FW limitation for max MPW WQEBBs Calculate maximal count of MPW WQEBBs on SQ's creation and store it there. Remove MLX5E_TX_MPW_MAX_NUM_DS and MLX5E_TX_MPW_MAX_WQEBBS. Update mlx5e_tx_mpwqe_is_full() and mlx5e_xdp_mpqwe_is_full() . Signed-off-by: Aya Levin <ayal@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2022-02-14 22:30:48 -08:00
Kees Cook	6d5c900eb6	net/mlx5e: Use struct_group() for memcpy() region In preparation for FORTIFY_SOURCE performing compile-time and run-time field bounds checking for memcpy(), memmove(), and memset(), avoid intentionally writing across neighboring fields. Use struct_group() in struct vlan_ethhdr around members h_dest and h_source, so they can be referenced together. This will allow memcpy() and sizeof() to more easily reason about sizes, improve readability, and avoid future warnings about writing beyond the end of h_dest. "pahole" shows no size nor member offset changes to struct vlan_ethhdr. "objdump -d" shows no object code changes. Fixes: `34802a42b3` ("net/mlx5e: Do not modify the TX SKB") Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2022-02-01 20:59:43 -08:00
Raed Salem	428ffea071	net/mlx5e: IPsec: Refactor checksum code in tx data path Part of code that is related solely to IPsec is always compiled in the driver code regardless if the IPsec functionality is enabled or disabled in the driver code, this will add unnecessary branch in case IPsec is disabled at Tx data path. Move IPsec related code to IPsec related file such that in case of IPsec is disabled and because of unlikely macro the compiler should be able to optimize and omit the checksum IPsec code all together from Tx data path Signed-off-by: Raed Salem <raeds@nvidia.com> Reviewed-by: Emeel Hakim <ehakim@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-10-29 13:53:28 -07:00
Emeel Hakim	1d00032394	net/mlx5e: IPsec: Fix work queue entry ethernet segment checksum flags Current Work Queue Entry (WQE) checksum (csum) flags in the ethernet segment (eseg) in case of IPsec crypto offload datapath are not aligned with PRM/HW expectations. Currently the driver always sets the l3_inner_csum flag in case of IPsec because of the wrong usage of skb->encapsulation as indicator for inner IPsec header since skb->encapsulation is always ON for IPsec packets since IPsec itself is an encapsulation protocol. The above forced a failing attempts of calculating csum of non-existing segments (like in the IP\|ESP\|TCP packet case which does not have an l3_inner) which led to lots of packet drops hence the low throughput. Fix by using xo->inner_ipproto as indicator for inner IPsec header instead of skb->encapsulation in addition to setting the csum flags as following: * Tunnel Mode: * Pkt: MAC IP ESP IP L4 * CSUM: l3_cs \| l3_inner_cs \| l4_inner_cs * * Transport Mode: * Pkt: MAC IP ESP L4 * CSUM: l3_cs [ \| l4_cs (checksum partial case)] * * Tunnel(VXLAN TCP/UDP) over Transport Mode * Pkt: MAC IP ESP UDP VXLAN IP L4 * CSUM: l3_cs \| l3_inner_cs \| l4_inner_cs Fixes: `f1267798c9` ("net/mlx5: Fix checksum issue of VXLAN and IPsec crypto offload") Signed-off-by: Emeel Hakim <ehakim@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-10-20 10:42:51 -07:00
Jakub Kicinski	adc2e56ebe	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Trivial conflicts in net/can/isotp.c and tools/testing/selftests/net/mptcp/mptcp_connect.sh scaled_ppm_to_ppb() was moved from drivers/ptp/ptp_clock.c to include/linux/ptp_clock_kernel.h in -next so re-apply the fix there. Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-06-18 19:47:02 -07:00
Aya Levin	a6ee6f5f10	net/mlx5e: Fix select queue to consider SKBTX_HW_TSTAMP Steering packets to PTP-SQ should be done only if the SKB has SKBTX_HW_TSTAMP set in the tx_flags. While here, take the function into a header and inline it. Set the whole condition to select the PTP-SQ to unlikely. Fixes: `24c22dd091` ("net/mlx5e: Add states to PTP channel") Signed-off-by: Aya Levin <ayal@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Maxim Mikityanskiy <maximmi@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-06-09 17:20:05 -07:00
Vladyslav Tarasiuk	f68406ca3b	net/mlx5e: Remove unreachable code in mlx5e_xmit() After some commits, mlx5e_txwqe_build_eseg() lost its ability to return boolean value and became effectively void. Change its return type to void and remove unreachable branches. Signed-off-by: Vladyslav Tarasiuk <vladyslavt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-06-03 13:10:21 -07:00
Aya Levin	24c22dd091	net/mlx5e: Add states to PTP channel Add PTP TX state to PTP channel, which indicates the corresponding SQ is available. Further patches in the set extend PTP channel to include RQ. The PTP channel state will be used for separation and coexistence of RX and TX PTP. Enhance conditions to verify the TX PTP state is set. Signed-off-by: Aya Levin <ayal@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-03-29 21:21:50 -07:00
Aya Levin	b0d35de441	net/mlx5e: Generalize PTP implementation Following patches in the set add support for RX PTP. Rename PTP prefix from %s/port_ptp/ptp/g to include RX PTP too. In addition rename indication (used in statistics context) that PTP-SQ was opened: %s/port_ptp_opened/tx_ptp_opened/g. This will simplify adding indication that PTP-RQ was opened. Signed-off-by: Aya Levin <ayal@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-03-25 19:50:11 -07:00
Maxim Mikityanskiy	991b265460	net/mlx5e: Use net_prefetchw instead of prefetchw in MPWQE TX datapath Commit `e20f0dbf20` ("net/mlx5e: RX, Add a prefetch command for small L1_CACHE_BYTES") switched to using net_prefetchw at all places in mlx5e. In the same time frame, commit `5af75c747e` ("net/mlx5e: Enhanced TX MPWQE for SKBs") added one more usage of prefetchw. When these two changes were merged, this new occurrence of prefetchw wasn't replaced with net_prefetchw. This commit fixes this last occurrence of prefetchw in mlx5e_tx_mpwqe_session_start, making the same change that was done in mlx5e_xdp_mpwqe_session_start. Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Reviewed-by: Saeed Mahameed <saeedm@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-03-12 15:29:31 -08:00
David S. Miller	44c3203975	Merge branch 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux Saeed Mahameed says: ==================== pull-request: mlx5-next 2021-02-16 The patches in this pr are already submitted and reviewed through the netdev and rdma mailing lists. The series includes mlx5 HW bits and definitions for mlx5 real time clock translation and handling in the mlx5 driver clock module to enable and support such mode [1] [1] https://patchwork.kernel.org/project/netdevbpf/patch/20210212223042.449816-7-saeed@kernel.org/ ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2021-02-16 14:53:30 -08:00
Aya Levin	432119de33	net/mlx5: Add cyc2time HW translation mode support Device timestamp can be in real time mode (cycles to time translation is offloaded into the Hardware). With real time mode, HW provides timestamp which is already translated into nanoseconds. With this mode, driver adjusts both the HW and timecounter (to keep clock_info_page updated) using callbacks: adjfreq, adjtime and settime. HW clock modifications are done via MTUTC access reg commands. Driver is allowed to modify HW real time clock only if MCAM ptpcyc2realtime_modify capability is set. Add MTUTC set function to be used for configuring the HW real time clock. Modify existing code to support both internal timer (with conversion via timecounter_cyc2time() and real time (no conversions). Align the signatures of the helpers converting from timestamp to nanoseconds. With that, when allocating a queue assign the corresponding callback with respect to the capability. Adjust 1PPS timestamp calculation flows based on the timestamp mode. Cyc2time offload brings two major advantages: - Improve MTAE (Max Time Absolute Error) for HW TS by up to 160 ns over a 100% loaded CPU. - Faster data-path timestamp to nanoseconds, as translation is lock-less and done in HW. On real time mode, timestamp format is 32 high bits of seconds and 32 low bits of nanoseconds. On some flows, driver shall convert this format into nanoseconds wall-clock with REAL_TIME_TO_NS macro. HW supports a single clock, and it is shared by all functions on a device. In case real time clock is used, it is recommended to use a single GM to all device's functions. Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com> Signed-off-by: Aya Levin <ayal@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-02-16 14:04:54 -08:00
Maxim Mikityanskiy	214baf2287	net/mlx5e: Support HTB offload This commit adds support for HTB offload in the mlx5e driver. Performance: NIC: Mellanox ConnectX-6 Dx CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz (24 cores with HT) 100 Gbit/s line rate, 500 UDP streams @ ~200 Mbit/s each 48 traffic classes, flower used for steering No shaping (rate limits set to 4 Gbit/s per TC) - checking for max throughput. Baseline: 98.7 Gbps, 8.25 Mpps HTB: 6.7 Gbps, 0.56 Mpps HTB offload: 95.6 Gbps, 8.00 Mpps Limitations: 1. 256 leaf nodes, 3 levels of depth. 2. Granularity for ceil is 1 Mbit/s. Rates are converted to weights, and the bandwidth is split among the siblings according to these weights. Other parameters for classes are not supported. Ethtool statistics support for QoS SQs are also added. The counters are called qos_txN_, where N is the QoS queue number (starting from 0, the numeration is separate from the normal SQs), and is the counter name (the counters are the same as for the normal SQs). Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-01-22 20:41:29 -08:00
Tariq Toukan	3a28eda94c	net/mlx5e: IPsec, Enclose csum logic under ipsec config All IPsec logic should be wrapped under the compile flag, including its checksum logic. Introduce an inline function in ipsec datapath header, with a corresponding stub. Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Raed Salem <raeds@nvidia.com> Reviewed-by: Huy Nguyen <huyn@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-01-13 15:45:58 -08:00
Moshe Shemesh	b544011f0e	net/mlx5e: Fix SWP offsets when vlan inserted by driver In case WQE includes inline header the vlan is inserted by driver even if vlan offload is set. On geneve over vlan interface where software parser is used the SWP offsets should be updated according to the added vlan. Fixes: `e3cfc7e6b7` ("net/mlx5e: TX, Add geneve tunnel stateless offload support") Signed-off-by: Moshe Shemesh <moshe@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-01-07 12:22:49 -08:00
Eran Ben Elisha	1880bc4e4a	net/mlx5e: Add TX port timestamp support Transmitted packet timestamping accuracy can be improved when using timestamp from the port, instead of packet CQE creation timestamp, as it better reflects the actual time of a packet's transmit. TX port timestamping is supported starting from ConnectX6-DX hardware. Although at the original completion, only CQE timestamp can be attached, we are able to get TX port timestamping via an additional completion over a special CQ associated with the SQ (in addition to the regular CQ). Driver to ignore the original packet completion timestamp, and report back the timestamp of the special CQ completion. If the absolute timestamp diff between the two completions is greater than 1 / 128 second, ignore the TX port timestamp as it has a jitter which is too big. No skb will be generate out of the extra completion. Allocate additional CQ per ptpsq, to receive the TX port timestamp. Driver to hold an skb FIFO in order to map between transmitted skb to the two expected completions. When using ptpsq, hold double refcount on the skb, to gaurantee it will not get released before both completions arrive. Expose dedicated counters of the ptp additional CQ and connect it to the TX health reporter. This patch improves TX Hardware timestamping offset to be less than 40ns at a 100Gbps line rate, compared to 600ns before. With that, making our HW compliant with G.8273.2 class C, and allow Linux systems to be deployed in the 5G telco edge, where this standard is a must. Signed-off-by: Eran Ben Elisha <eranbe@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2020-12-08 11:28:47 -08:00
Eran Ben Elisha	145e5637d9	net/mlx5e: Add TX PTP port object support Add TX PTP port object support for better TX timestamping accuracy. Currently, driver supports CQE based TX port timestamp. Device also offers TX port timestamp, which has less jitter and better reflects the actual time of a packet's transmit. Define new driver layout called ptpsq, on which driver will create SQs that will support TX port timestamp for their transmitted packets. Driver to identify PTP TX skbs and steer them to these dedicated SQs as part of the select queue ndo. Driver to hold ptpsq per TC and report them at netif_set_real_num_tx_queues(). Add support for all needed functionality in order to xmit and poll completions received via ptpsq. Add ptpsq to the TX reporter recover, diagnose and dump methods. Creation of ptpsqs is disabled by default, and can be enabled via tx_port_ts private flag. This patch steer all timestamp related packets to a ptpsq, but it does not open the port timestamp support for it. The support will be added in the following patch. Signed-off-by: Eran Ben Elisha <eranbe@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2020-12-08 11:28:46 -08:00
Eran Ben Elisha	0b676aaecc	net/mlx5e: Change skb fifo push/pop API to be used without SQ The skb fifo push/pop API used pre-defined attributes within the mlx5e_txqsq. In order to share the skb fifo API with other non-SQ use cases, change the API input to get newly defined mlx5e_skb_fifo struct. Signed-off-by: Eran Ben Elisha <eranbe@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2020-12-08 11:28:46 -08:00
Aya Levin	4d0b7ef909	net/mlx5e: Allow CQ outside of channel context In order to be able to create a CQ outside of a channel context, remove cq->channel direct pointer. This requires adding a direct pointer to channel statistics, netdevice, priv and to mlx5_core in order to support CQs that are a part of mlx5e_channel. In addition, parameters the were previously derived from the channel like napi, NUMA node, channel stats and index are now assembled in struct mlx5e_create_cq_param which is given to mlx5e_open_cq() instead of channel pointer. Generalizing mlx5e_open_cq() allows opening CQ outside of channel context which will be used in following patches in the patch-set. Signed-off-by: Aya Levin <ayal@nvidia.com> Signed-off-by: Eran Ben Elisha <eranbe@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2020-12-08 11:28:45 -08:00
Tariq Toukan	b336e6b25e	net/mlx5e: kTLS, Enforce HW TX csum offload with kTLS Checksum calculation cannot be done in SW for TX kTLS HW offloaded packets. Offload it to the device, disregard the declared state of the TX csum offload feature. Fixes: `d2ead1f360` ("net/mlx5e: Add kTLS TX HW offload support") Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Maxim Mikityanskiy <maximmi@mellanox.com> Reviewed-by: Boris Pismenny <borisp@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-12-03 11:18:35 -08:00
Huy Nguyen	5cfb540ef2	net/mlx5e: Set IPsec WAs only in IP's non checksum partial case. The IP's checksum partial still requires L4 csum flag on Ethernet WQE. Make the IPsec WAs only for the IP's non checksum partial case (for example icmd packet) Fixes: `5be019040c` ("net/mlx5e: IPsec: Add Connect-X IPsec Tx data path offload") Signed-off-by: Huy Nguyen <huyn@mellanox.com> Reviewed-by: Raed Salem <raeds@nvidia.com> Reviewed-by: Alaa Hleihel <alaa@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2020-11-17 11:50:52 -08:00
Raed Salem	5be019040c	net/mlx5e: IPsec: Add Connect-X IPsec Tx data path offload In the TX data path, spot packets with xfrm stack IPsec offload indication. Fill Software-Parser segment in TX descriptor so that the hardware may parse the ESP protocol, and perform TX checksum offload on the inner payload. Support GSO, by providing the trailer data and ICV placeholder so HW can fill it post encryption operation. Padding alignment cannot be performed in HW (ConnectX-6Dx) due to a bug. Software can overcome this limitation by adding NETIF_F_HW_ESP to the gso_partial_features field in netdev so the packets being aligned by the stack. l4_inner_checksum cannot be offloaded by HW for IPsec tunnel type packet. Note that for GSO SKBs, the stack does not include an ESP trailer, unlike the non-GSO case. Below is the iperf3 performance report on two server of 24 cores Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz with ConnectX6-DX. All the bandwidth test uses iperf3 TCP traffic with packet size 128KB. Each tunnel uses one iperf3 stream with one thread (option -P1). TX crypto offload shows improvements on both bandwidth and CPU utilization. ---------------------------------------------------------------------- Mode \| Num tunnel \| BW \| Send CPU util \| Recv CPU util \| \| (Gbps) \| (Average %) \| (Average %) ---------------------------------------------------------------------- Cryto offload \| \| \| \| (RX only) \| 1 \| 4.7 \| 4.2 \| 3.5 ---------------------------------------------------------------------- Cryto offload \| \| \| \| (RX only) \| 24 \| 15.6 \| 20 \| 10 ---------------------------------------------------------------------- Non-offload \| 1 \| 4.6 \| 4 \| 5 ---------------------------------------------------------------------- Non-offload \| 24 \| 11.9 \| 16 \| 12 ---------------------------------------------------------------------- Cryto offload \| \| \| \| (TX & RX) \| 1 \| 11.9 \| 2.1 \| 5.9 ---------------------------------------------------------------------- Cryto offload \| \| \| \| (TX & RX) \| 24 \| 38 \| 9.5 \| 27.5 ---------------------------------------------------------------------- Cryto offload \| \| \| \| (TX only) \| 1 \| 4.7 \| 0.7 \| 5 ---------------------------------------------------------------------- Cryto offload \| \| \| \| (TX only) \| 24 \| 14.5 \| 6 \| 20 Regression tests show no degradation on non-ipsec and non-offload-ipsec traffics. The packet rate test uses pktgen UDP to transmit on single CPU, the instructions and cycles are measured on the transmit CPU. before: ---------------------------------------------------------------------- Non-offload \| 1 \| 4.7 \| 4.2 \| 5.1 ---------------------------------------------------------------------- Non-offload \| 24 \| 11.2 \| 14 \| 15 ---------------------------------------------------------------------- Non-ipsec \| 1 \| 28 \| 4 \| 5.7 ---------------------------------------------------------------------- Non-ipsec \| 24 \| 68.3 \| 17.8 \| 39.7 ---------------------------------------------------------------------- Non-ipsec packet rate(BURST=1000 BC=5 NCPUS=1 SIZE=60) 13.56Mpps, 456 instructions/pkt, 191 cycles/pkt after: ---------------------------------------------------------------------- Non-offload \| 1 \| 4.69 \| 4.2 \| 5 ---------------------------------------------------------------------- Non-offload \| 24 \| 11.9 \| 13.5 \| 15.1 ---------------------------------------------------------------------- Non-ipsec \| 1 \| 29 \| 3.2 \| 5.5 ---------------------------------------------------------------------- Non-ipsec \| 24 \| 68.2 \| 18.5 \| 39.8 ---------------------------------------------------------------------- Non-ipsec packet rate: 13.56Mpps, 472 instructions/pkt, 191 cycles/pkt Signed-off-by: Raed Salem <raeds@mellanox.com> Signed-off-by: Huy Nguyen <huyn@mellanox.com> Reviewed-by: Maxim Mikityanskiy <maximmi@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2020-10-12 15:37:45 -07:00
Maxim Mikityanskiy	5af75c747e	net/mlx5e: Enhanced TX MPWQE for SKBs This commit adds support for Enhanced TX MPWQE feature in the regular (SKB) data path. A MPWQE (multi-packet work queue element) can serve multiple packets, reducing the PCI bandwidth on control traffic. Two new stats (tx_mpwqe_blks and tx_mpwqe_pkts) are added. The feature is on by default and controlled by the skb_tx_mpwqe private flag. In a MPWQE, eseg is shared among all packets, so eseg-based offloads (IPSEC, GENEVE, checksum) run on a separate eseg that is compared to the eseg of the current MPWQE session to decide if the new packet can be added to the same session. MPWQE is not compatible with certain offloads and features, such as TLS offload, TSO, nonlinear SKBs. If such incompatible features are in use, the driver gracefully falls back to non-MPWQE. This change has no performance impact in TCP single stream test and XDP_TX single stream test. UDP pktgen, 64-byte packets, single stream, MPWQE off: Packet rate: 16.96 Mpps (±0.12 Mpps) -> 17.01 Mpps (±0.20 Mpps) Instructions per packet: 421 -> 429 Cycles per packet: 156 -> 161 Instructions per cycle: 2.70 -> 2.67 UDP pktgen, 64-byte packets, single stream, MPWQE on: Packet rate: 16.96 Mpps (±0.12 Mpps) -> 20.94 Mpps (±0.33 Mpps) Instructions per packet: 421 -> 329 Cycles per packet: 156 -> 123 Instructions per cycle: 2.70 -> 2.67 Enabling MPWQE can reduce PCI bandwidth: PCI Gen2, pktgen at fixed rate of 36864000 pps on 24 CPU cores: Inbound PCI utilization with MPWQE off: 80.3% Inbound PCI utilization with MPWQE on: 59.0% PCI Gen3, pktgen at fixed rate of 56064000 pps on 24 CPU cores: Inbound PCI utilization with MPWQE off: 65.4% Inbound PCI utilization with MPWQE on: 49.3% Enabling MPWQE can also reduce CPU load, increasing the packet rate in case of CPU bottleneck: PCI Gen2, pktgen at full rate on 24 CPU cores: Packet rate with MPWQE off: 37.5 Mpps Packet rate with MPWQE on: 49.0 Mpps PCI Gen3, pktgen at full rate on 24 CPU cores: Packet rate with MPWQE off: 57.0 Mpps Packet rate with MPWQE on: 66.8 Mpps Burst size in all pktgen tests is 32. CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz (x86_64) NIC: Mellanox ConnectX-6 Dx GCC 10.2.0 Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2020-09-21 19:41:16 -07:00
Maxim Mikityanskiy	67044a88aa	net/mlx5e: Move TX code into functions to be used by MPWQE mlx5e_txwqe_complete performs some actions that can be taken to separate functions: 1. Update the flags needed for hardware timestamping. 2. Stop the TX queue if it's full. Take these actions into separate functions to be reused by the MPWQE code in the following commit and to maintain clear responsibilities of functions. Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2020-09-21 19:41:16 -07:00
Maxim Mikityanskiy	338c46c636	net/mlx5e: Support multiple SKBs in a TX WQE TX MPWQE support for SKBs is coming in one of the following patches, and a single MPWQE can send multiple SKBs. This commit prepares the TX path code to handle such cases: 1. An additional FIFO for SKBs is added, just like the FIFO for DMA chunks. 2. struct mlx5e_tx_wqe_info will contain num_fifo_pkts. If a given WQE contains only one packet, num_fifo_pkts will be zero, and the SKB will be stored in mlx5e_tx_wqe_info, as usual. If num_fifo_pkts > 0, the SKB pointer will be NULL, and the SKBs will be stored in the FIFO. This change has no performance impact in TCP single stream test and XDP_TX single stream test. When compiled with a recent GCC, this change shows no visible performance impact on UDP pktgen (burst 32) single stream test either: Packet rate: 16.95 Mpps (±0.15 Mpps) -> 16.96 Mpps (±0.12 Mpps) Instructions per packet: 429 -> 421 Cycles per packet: 160 -> 156 Instructions per cycle: 2.69 -> 2.70 CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz (x86_64) NIC: Mellanox ConnectX-6 Dx GCC 10.2.0 Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2020-09-21 19:41:15 -07:00
Maxim Mikityanskiy	56e4da669a	net/mlx5e: Move the TLS resync check out of the function Before this patch, mlx5e_ktls_tx_handle_resync_dump_comp checked for resync_dump_frag_page. It happened for all WQEs without an SKB, including padding WQEs, and required a function call. Normally, padding WQEs happen more often than TLS resyncs. Take this check out of the function and put it to an inline function to save a call on all padding WQEs. Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2020-09-21 19:41:15 -07:00
Maxim Mikityanskiy	97e3afd64d	net/mlx5e: Unify constants for WQE_EMPTY_DS_COUNT A constant for the number of DS in an empty WQE (i.e. a WQE without data segments) is needed in multiple places (normal TX data path, MPWQE in XDP), but currently we have a constant for XDP and an inline formula in normal TX. This patch introduces a common constant. Additionally, mlx5e_xdp_mpwqe_session_start is converted to use struct assignment, because the code nearby is touched. Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2020-09-21 19:41:15 -07:00
Maxim Mikityanskiy	8e4b53f60f	net/mlx5e: Refactor xmit functions A huge function mlx5e_sq_xmit was split into several to achieve multiple goals: 1. Reuse the code in IPoIB. 2. Better intergrate with TLS, IPSEC, GENEVE and checksum offloads. Now it's possible to reserve space in the WQ before running eseg-based offloads, so: 2.1. It's not needed to copy cseg and eseg after mlx5e_fill_sq_frag_edge anymore. 2.2. mlx5e_txqsq_get_next_pi will be used instead of the legacy mlx5e_fill_sq_frag_edge for better code maintainability and reuse. 3. Prepare for the upcoming TX MPWQE for SKBs. It will intervene after mlx5e_sq_calc_wqe_attr to check if it's possible to use MPWQE, and the code flow will split into two paths: MPWQE and non-MPWQE. Two high-level functions are provided to send packets: * mlx5e_xmit is called by the networking stack, runs offloads and sends the packet. In one of the following patches, MPWQE support will be added to this flow. * mlx5e_sq_xmit_simple is called by the TLS offload, runs only the checksum offload and sends the packet. This change has no performance impact in TCP single stream test and XDP_TX single stream test. When compiled with a recent GCC, this change shows no visible performance impact on UDP pktgen (burst 32) single stream test either: Packet rate: 16.86 Mpps (±0.15 Mpps) -> 16.95 Mpps (±0.15 Mpps) Instructions per packet: 434 -> 429 Cycles per packet: 158 -> 160 Instructions per cycle: 2.75 -> 2.69 CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz (x86_64) NIC: Mellanox ConnectX-6 Dx GCC 10.2.0 Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2020-09-21 19:41:14 -07:00
Maxim Mikityanskiy	d02dfcd51f	net/mlx5e: Move mlx5e_tx_wqe_inline_mode to en_tx.c Move mlx5e_tx_wqe_inline_mode from en/txrx.h to en_tx.c as it's only used there. Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2020-09-21 19:41:14 -07:00
Maxim Mikityanskiy	8ba6f18399	net/mlx5e: Use struct assignment to initialize mlx5e_tx_wqe_info Struct assignment guarantees that all fields of the structure are initialized (those that are not mentioned are zeroed). It makes code mode robust and reduces chances for unpredictable behavior when one forgets to reset some field and it holds an old value from previous iterations of using the structure. Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2020-09-21 19:41:13 -07:00

1 2 3 4

183 Commits