Commit Graph

5691 Commits

Author SHA1 Message Date
Dan Carpenter d5972f1000 net/mlx5: Fix check for allocation failure in comp_irqs_request_pci()
This function accidentally dereferences "cpus" instead of returning
directly.

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/r/202305200354.KV3jU94w-lkp@intel.com/
Fixes: b48a0f72bc ("net/mlx5: Refactor completion irq request/release code")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-05-24 20:44:19 -07:00
Yevgeny Kliteynik fe5c2d3aef net/mlx5: DR, Add missing mutex init/destroy in pattern manager
Add missing mutex init/destroy as caught by the lock's debug warning:
 DEBUG_LOCKS_WARN_ON(lock->magic != lock)

Fixes: da5d0027d6 ("net/mlx5: DR, Add cache for modify header pattern")
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-05-24 20:44:19 -07:00
Jianbo Liu c4c24fc30c net/mlx5e: Move Ethernet driver debugfs to profile init callback
As priv->dfs_root is cleared, and therefore missed, when change
eswitch mode, move the creation of the root debugfs to the init
callback of mlx5e_nic_profile and mlx5e_uplink_rep_profile, and
the destruction to the cleanup callback for symmeter.

Fixes: 288eca60cc ("net/mlx5e: Add Ethernet driver debugfs")
Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-05-24 20:44:18 -07:00
Dmytro Linkin bdf274750f net/mlx5e: Don't attach netdev profile while handling internal error
As part of switchdev mode disablement, driver changes port netdevice
profile from uplink to nic. If this process is triggered by health
recovery flow (PCI reset, for ex.) profile attach would fail because all
fw commands aborted when internal error flag is set. As a result, nic
netdevice profile is not attached and driver fails to rollback to uplink
profile, which leave driver in broken state and cause crash later.

To handle broken state do netdevice profile initialization only instead
of full attachment and release mdev resources on driver suspend as
expected. Actual netdevice attachment is done during driver load.

Fixes: c4d7eb5768 ("net/mxl5e: Add change profile method")
Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-05-24 20:44:18 -07:00
Vlad Buslov 5d862ec631 net/mlx5: Fix post parse infra to only parse every action once
Caller of mlx5e_tc_act_post_parse() needs it to parse only the subset of
actions starting after previous split and ending at the current action.
However, that range is not provided as arguments and
mlx5e_tc_act_post_parse() uses generic flow_action_for_each() that iterates
over all flow actions. Not only this is redundant, it also causes a bug
when mlx5e_tc_act->post_parse() callback is not idempotent since it will be
called for every split. For example, ct action tc_act_post_parse_ct()
callback obtains a reference to mlx5_ct_ft instance and calling it several
times during parsing stage will cause reference counter imbalance.

Fix the issue by providing a proper action range of the current split
subset to mlx5e_tc_act_post_parse() and only calling
mlx5e_tc_act->post_parse() for actions inside the subset range.

Fixes: 8300f22526 ("net/mlx5e: Create new flow attr for multi table actions")
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-05-24 20:44:18 -07:00
Dragos Tatulea 1db1f21cae net/mlx5e: Use query_special_contexts cmd only once per mdev
Don't query the firmware so many times (num rqs * num wqes * wqe frags)
because it slows down linearly the interface creation time when the
product is larger. Do it only once per mdev and store the result in
mlx5e_param.

Due to helper function being called from different files, move it to
an appropriate location. Rename the function with a proper prefix and
add a small cleanup.

This fix applies only for legacy rq.

Fixes: 1b1e486883 ("net/mlx5e: Use query_special_contexts for mkeys")
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Or Har-Toov <ohartoov@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-05-24 20:44:18 -07:00
Shay Drory 341a80de24 net/mlx5: fw_tracer, Fix event handling
mlx5 driver needs to parse traces with event_id inside the range of
first_string_trace and num_string_trace. However, mlx5 is parsing all
events with event_id >= first_string_trace.

Fix it by checking for the correct range.

Fixes: c71ad41ccb ("net/mlx5: FW tracer, events handling")
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-05-24 20:44:17 -07:00
Shay Drory b4646da057 net/mlx5: SF, Drain health before removing device
There is no point in recovery during device removal. Also, if health
work started need to wait for it to avoid races and NULL pointer
access.

Hence, drain health WQ before removing device.

Fixes: 1958fc2f07 ("net/mlx5: SF, Add auxiliary device driver")
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-05-24 20:44:17 -07:00
Shay Drory 824c8dc4a4 net/mlx5: Drain health before unregistering devlink
mlx5 health mechanism is using devlink APIs, which are using devlink
notify APIs. After the cited patch, using devlink notify APIs after
devlink is unregistered triggers a WARN_ON().
Hence, drain health WQ before devlink is unregistered.

Fixes: cf53021740 ("devlink: Notify users when objects are accessible")
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-05-24 20:44:17 -07:00
Maher Sanalla 623efc4cbd net/mlx5e: Do not update SBCM when prio2buffer command is invalid
The shared buffer pools configuration which are stored in the SBCM
register are updated when the user changes the prio2buffer mapping.

However, in case the user desired prio2buffer change is invalid,
which can occur due to mapping a lossless priority to a not large enough
buffer, the SBCM update should not be performed, as the user command is
failed.

Thus, Perform the SBCM update only after xoff threshold calculation is
performed and the user prio2buffer mapping is validated.

Fixes: a440030d89 ("net/mlx5e: Update shared buffer along with device buffer changes")
Signed-off-by: Maher Sanalla <msanalla@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-05-24 20:44:17 -07:00
Maher Sanalla 81fe2be062 net/mlx5e: Consider internal buffers size in port buffer calculations
Currently, when a user triggers a change in port buffer headroom
(buffers 0-7), the driver checks that the requested headroom does
not exceed the total port buffer size. However, this check does not
take into account the internal buffers (buffers 8-9), which are also
part of the total port buffer. This can result in treating invalid port
buffer change requests as valid, causing unintended changes to the shared
buffer.

To address this, include the internal buffers size in the calculation of
available port buffer space which ensures that port buffer requests do not
exceed the correct limit.

Furthermore, remove internal buffers (8-9) size from the total_size
calculation as these buffers are reserved for internal use and are not
exposed to the user.

While at it, add verbosity to the debug prints in
mlx5e_port_query_buffer() function to ease future debugging.

Fixes: ecdf2dadee ("net/mlx5e: Receive buffer support for DCBX")
Signed-off-by: Maher Sanalla <msanalla@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-05-24 20:44:16 -07:00
Chris Mi 37c3b9fa7c net/mlx5e: Prevent encap offload when neigh update is running
The cited commit adds a compeletion to remove dependency on rtnl
lock. But it causes a deadlock for multiple encapsulations:

 crash> bt ffff8aece8a64000
 PID: 1514557  TASK: ffff8aece8a64000  CPU: 3    COMMAND: "tc"
  #0 [ffffa6d14183f368] __schedule at ffffffffb8ba7f45
  #1 [ffffa6d14183f3f8] schedule at ffffffffb8ba8418
  #2 [ffffa6d14183f418] schedule_preempt_disabled at ffffffffb8ba8898
  #3 [ffffa6d14183f428] __mutex_lock at ffffffffb8baa7f8
  #4 [ffffa6d14183f4d0] mutex_lock_nested at ffffffffb8baabeb
  #5 [ffffa6d14183f4e0] mlx5e_attach_encap at ffffffffc0f48c17 [mlx5_core]
  #6 [ffffa6d14183f628] mlx5e_tc_add_fdb_flow at ffffffffc0f39680 [mlx5_core]
  #7 [ffffa6d14183f688] __mlx5e_add_fdb_flow at ffffffffc0f3b636 [mlx5_core]
  #8 [ffffa6d14183f6f0] mlx5e_tc_add_flow at ffffffffc0f3bcdf [mlx5_core]
  #9 [ffffa6d14183f728] mlx5e_configure_flower at ffffffffc0f3c1d1 [mlx5_core]
 #10 [ffffa6d14183f790] mlx5e_rep_setup_tc_cls_flower at ffffffffc0f3d529 [mlx5_core]
 #11 [ffffa6d14183f7a0] mlx5e_rep_setup_tc_cb at ffffffffc0f3d714 [mlx5_core]
 #12 [ffffa6d14183f7b0] tc_setup_cb_add at ffffffffb8931bb8
 #13 [ffffa6d14183f810] fl_hw_replace_filter at ffffffffc0dae901 [cls_flower]
 #14 [ffffa6d14183f8d8] fl_change at ffffffffc0db5c57 [cls_flower]
 #15 [ffffa6d14183f970] tc_new_tfilter at ffffffffb8936047
 #16 [ffffa6d14183fac8] rtnetlink_rcv_msg at ffffffffb88c7c31
 #17 [ffffa6d14183fb50] netlink_rcv_skb at ffffffffb8942853
 #18 [ffffa6d14183fbc0] rtnetlink_rcv at ffffffffb88c1835
 #19 [ffffa6d14183fbd0] netlink_unicast at ffffffffb8941f27
 #20 [ffffa6d14183fc18] netlink_sendmsg at ffffffffb8942245
 #21 [ffffa6d14183fc98] sock_sendmsg at ffffffffb887d482
 #22 [ffffa6d14183fcb8] ____sys_sendmsg at ffffffffb887d81a
 #23 [ffffa6d14183fd38] ___sys_sendmsg at ffffffffb88806e2
 #24 [ffffa6d14183fe90] __sys_sendmsg at ffffffffb88807a2
 #25 [ffffa6d14183ff28] __x64_sys_sendmsg at ffffffffb888080f
 #26 [ffffa6d14183ff38] do_syscall_64 at ffffffffb8b9b6a8
 #27 [ffffa6d14183ff50] entry_SYSCALL_64_after_hwframe at ffffffffb8c0007c
 crash> bt 0xffff8aeb07544000
 PID: 1110766  TASK: ffff8aeb07544000  CPU: 0    COMMAND: "kworker/u20:9"
  #0 [ffffa6d14e6b7bd8] __schedule at ffffffffb8ba7f45
  #1 [ffffa6d14e6b7c68] schedule at ffffffffb8ba8418
  #2 [ffffa6d14e6b7c88] schedule_timeout at ffffffffb8baef88
  #3 [ffffa6d14e6b7d10] wait_for_completion at ffffffffb8ba968b
  #4 [ffffa6d14e6b7d60] mlx5e_take_all_encap_flows at ffffffffc0f47ec4 [mlx5_core]
  #5 [ffffa6d14e6b7da0] mlx5e_rep_update_flows at ffffffffc0f3e734 [mlx5_core]
  #6 [ffffa6d14e6b7df8] mlx5e_rep_neigh_update at ffffffffc0f400bb [mlx5_core]
  #7 [ffffa6d14e6b7e50] process_one_work at ffffffffb80acc9c
  #8 [ffffa6d14e6b7ed0] worker_thread at ffffffffb80ad012
  #9 [ffffa6d14e6b7f10] kthread at ffffffffb80b615d
 #10 [ffffa6d14e6b7f50] ret_from_fork at ffffffffb8001b2f

After the first encap is attached, flow will be added to encap
entry's flows list. If neigh update is running at this time, the
following encaps of the flow can't hold the encap_tbl_lock and
sleep. If neigh update thread is waiting for that flow's init_done,
deadlock happens.

Fix it by holding lock outside of the for loop. If neigh update is
running, prevent encap flows from offloading. Since the lock is held
outside of the for loop, concurrent creation of encap entries is not
allowed. So remove unnecessary wait_for_completion call for res_ready.

Fixes: 95435ad799 ("net/mlx5e: Only access fully initialized flows in neigh update")
Signed-off-by: Chris Mi <cmi@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-05-24 20:44:16 -07:00
Chris Mi e2ab5aa11f net/mlx5e: Extract remaining tunnel encap code to dedicated file
Move set_encap_dests() and clean_encap_dests() to the tunnel encap
dedicated file. And rename them to mlx5e_tc_tun_encap_dests_set()
and mlx5e_tc_tun_encap_dests_unset().

No functional change in this patch. It is needed in the next patch.

Signed-off-by: Chris Mi <cmi@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-05-24 20:44:16 -07:00
Shay Drory 1da438c0ae net/mlx5: Fix indexing of mlx5_irq
After the cited patch, mlx5_irq xarray index can be different then
mlx5_irq MSIX table index.
Fix it by storing both mlx5_irq xarray index and MSIX table index.

Fixes: 3354822cde ("net/mlx5: Use dynamic msix vectors allocation")
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Eli Cohen <elic@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-05-22 22:38:06 -07:00
Shay Drory ef8c063cf8 net/mlx5: Fix irq affinity management
The cited patch deny the user of changing the affinity of mlx5 irqs,
which break backward compatibility.
Hence, allow the user to change the affinity of mlx5 irqs.

Fixes: bbac70c741 ("net/mlx5: Use newer affinity descriptor")
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Eli Cohen <elic@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-05-22 22:38:06 -07:00
Shay Drory 9c2d080109 net/mlx5: Free irqs only on shutdown callback
Whenever a shutdown is invoked, free irqs only and keep mlx5_irq
synthetic wrapper intact in order to avoid use-after-free on
system shutdown.

for example:
==================================================================
BUG: KASAN: use-after-free in _find_first_bit+0x66/0x80
Read of size 8 at addr ffff88823fc0d318 by task kworker/u192:0/13608

CPU: 25 PID: 13608 Comm: kworker/u192:0 Tainted: G    B   W  O  6.1.21-cloudflare-kasan-2023.3.21 #1
Hardware name: GIGABYTE R162-R2-GEN0/MZ12-HD2-CD, BIOS R14 05/03/2021
Workqueue: mlx5e mlx5e_tx_timeout_work [mlx5_core]
Call Trace:
  <TASK>
  dump_stack_lvl+0x34/0x48
  print_report+0x170/0x473
  ? _find_first_bit+0x66/0x80
  kasan_report+0xad/0x130
  ? _find_first_bit+0x66/0x80
  _find_first_bit+0x66/0x80
  mlx5e_open_channels+0x3c5/0x3a10 [mlx5_core]
  ? console_unlock+0x2fa/0x430
  ? _raw_spin_lock_irqsave+0x8d/0xf0
  ? _raw_spin_unlock_irqrestore+0x42/0x80
  ? preempt_count_add+0x7d/0x150
  ? __wake_up_klogd.part.0+0x7d/0xc0
  ? vprintk_emit+0xfe/0x2c0
  ? mlx5e_trigger_napi_sched+0x40/0x40 [mlx5_core]
  ? dev_attr_show.cold+0x35/0x35
  ? devlink_health_do_dump.part.0+0x174/0x340
  ? devlink_health_report+0x504/0x810
  ? mlx5e_reporter_tx_timeout+0x29d/0x3a0 [mlx5_core]
  ? mlx5e_tx_timeout_work+0x17c/0x230 [mlx5_core]
  ? process_one_work+0x680/0x1050
  mlx5e_safe_switch_params+0x156/0x220 [mlx5_core]
  ? mlx5e_switch_priv_channels+0x310/0x310 [mlx5_core]
  ? mlx5_eq_poll_irq_disabled+0xb6/0x100 [mlx5_core]
  mlx5e_tx_reporter_timeout_recover+0x123/0x240 [mlx5_core]
  ? __mutex_unlock_slowpath.constprop.0+0x2b0/0x2b0
  devlink_health_reporter_recover+0xa6/0x1f0
  devlink_health_report+0x2f7/0x810
  ? vsnprintf+0x854/0x15e0
  mlx5e_reporter_tx_timeout+0x29d/0x3a0 [mlx5_core]
  ? mlx5e_reporter_tx_err_cqe+0x1a0/0x1a0 [mlx5_core]
  ? mlx5e_tx_reporter_timeout_dump+0x50/0x50 [mlx5_core]
  ? mlx5e_tx_reporter_dump_sq+0x260/0x260 [mlx5_core]
  ? newidle_balance+0x9b7/0xe30
  ? psi_group_change+0x6a7/0xb80
  ? mutex_lock+0x96/0xf0
  ? __mutex_lock_slowpath+0x10/0x10
  mlx5e_tx_timeout_work+0x17c/0x230 [mlx5_core]
  process_one_work+0x680/0x1050
  worker_thread+0x5a0/0xeb0
  ? process_one_work+0x1050/0x1050
  kthread+0x2a2/0x340
  ? kthread_complete_and_exit+0x20/0x20
  ret_from_fork+0x22/0x30
  </TASK>

Freed by task 1:
  kasan_save_stack+0x23/0x50
  kasan_set_track+0x21/0x30
  kasan_save_free_info+0x2a/0x40
  ____kasan_slab_free+0x169/0x1d0
  slab_free_freelist_hook+0xd2/0x190
  __kmem_cache_free+0x1a1/0x2f0
  irq_pool_free+0x138/0x200 [mlx5_core]
  mlx5_irq_table_destroy+0xf6/0x170 [mlx5_core]
  mlx5_core_eq_free_irqs+0x74/0xf0 [mlx5_core]
  shutdown+0x194/0x1aa [mlx5_core]
  pci_device_shutdown+0x75/0x120
  device_shutdown+0x35c/0x620
  kernel_restart+0x60/0xa0
  __do_sys_reboot+0x1cb/0x2c0
  do_syscall_64+0x3b/0x90
  entry_SYSCALL_64_after_hwframe+0x4b/0xb5

The buggy address belongs to the object at ffff88823fc0d300
  which belongs to the cache kmalloc-192 of size 192
The buggy address is located 24 bytes inside of
  192-byte region [ffff88823fc0d300, ffff88823fc0d3c0)

The buggy address belongs to the physical page:
page:0000000010139587 refcount:1 mapcount:0 mapping:0000000000000000
index:0x0 pfn:0x23fc0c
head:0000000010139587 order:1 compound_mapcount:0 compound_pincount:0
flags: 0x2ffff800010200(slab|head|node=0|zone=2|lastcpupid=0x1ffff)
raw: 002ffff800010200 0000000000000000 dead000000000122 ffff88810004ca00
raw: 0000000000000000 0000000000200020 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
  ffff88823fc0d200: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
  ffff88823fc0d280: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
 >ffff88823fc0d300: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                             ^
  ffff88823fc0d380: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
  ffff88823fc0d400: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
==================================================================
general protection fault, probably for non-canonical address
0xdffffc005c40d7ac: 0000 [#1] PREEMPT SMP KASAN NOPTI
KASAN: probably user-memory-access in range [0x00000002e206bd60-0x00000002e206bd67]
CPU: 25 PID: 13608 Comm: kworker/u192:0 Tainted: G    B   W  O  6.1.21-cloudflare-kasan-2023.3.21 #1
Hardware name: GIGABYTE R162-R2-GEN0/MZ12-HD2-CD, BIOS R14 05/03/2021
Workqueue: mlx5e mlx5e_tx_timeout_work [mlx5_core]
RIP: 0010:__alloc_pages+0x141/0x5c0
Call Trace:
  <TASK>
  ? sysvec_apic_timer_interrupt+0xa0/0xc0
  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
  ? __alloc_pages_slowpath.constprop.0+0x1ec0/0x1ec0
  ? _raw_spin_unlock_irqrestore+0x3d/0x80
  __kmalloc_large_node+0x80/0x120
  ? kvmalloc_node+0x4e/0x170
  __kmalloc_node+0xd4/0x150
  kvmalloc_node+0x4e/0x170
  mlx5e_open_channels+0x631/0x3a10 [mlx5_core]
  ? console_unlock+0x2fa/0x430
  ? _raw_spin_lock_irqsave+0x8d/0xf0
  ? _raw_spin_unlock_irqrestore+0x42/0x80
  ? preempt_count_add+0x7d/0x150
  ? __wake_up_klogd.part.0+0x7d/0xc0
  ? vprintk_emit+0xfe/0x2c0
  ? mlx5e_trigger_napi_sched+0x40/0x40 [mlx5_core]
  ? dev_attr_show.cold+0x35/0x35
  ? devlink_health_do_dump.part.0+0x174/0x340
  ? devlink_health_report+0x504/0x810
  ? mlx5e_reporter_tx_timeout+0x29d/0x3a0 [mlx5_core]
  ? mlx5e_tx_timeout_work+0x17c/0x230 [mlx5_core]
  ? process_one_work+0x680/0x1050
  mlx5e_safe_switch_params+0x156/0x220 [mlx5_core]
  ? mlx5e_switch_priv_channels+0x310/0x310 [mlx5_core]
  ? mlx5_eq_poll_irq_disabled+0xb6/0x100 [mlx5_core]
  mlx5e_tx_reporter_timeout_recover+0x123/0x240 [mlx5_core]
  ? __mutex_unlock_slowpath.constprop.0+0x2b0/0x2b0
  devlink_health_reporter_recover+0xa6/0x1f0
  devlink_health_report+0x2f7/0x810
  ? vsnprintf+0x854/0x15e0
  mlx5e_reporter_tx_timeout+0x29d/0x3a0 [mlx5_core]
  ? mlx5e_reporter_tx_err_cqe+0x1a0/0x1a0 [mlx5_core]
  ? mlx5e_tx_reporter_timeout_dump+0x50/0x50 [mlx5_core]
  ? mlx5e_tx_reporter_dump_sq+0x260/0x260 [mlx5_core]
  ? newidle_balance+0x9b7/0xe30
  ? psi_group_change+0x6a7/0xb80
  ? mutex_lock+0x96/0xf0
  ? __mutex_lock_slowpath+0x10/0x10
  mlx5e_tx_timeout_work+0x17c/0x230 [mlx5_core]
  process_one_work+0x680/0x1050
  worker_thread+0x5a0/0xeb0
  ? process_one_work+0x1050/0x1050
  kthread+0x2a2/0x340
  ? kthread_complete_and_exit+0x20/0x20
  ret_from_fork+0x22/0x30
  </TASK>
---[ end trace 0000000000000000  ]---
RIP: 0010:__alloc_pages+0x141/0x5c0
Code: e0 39 a3 96 89 e9 b8 22 01 32 01 83 e1 0f 48 89 fa 01 c9 48 c1 ea
03 d3 f8 83 e0 03 89 44 24 6c 48 b8 00 00 00 00 00 fc ff df <80> 3c 02
00 0f 85 fc 03 00 00 89 e8 4a 8b 14 f5 e0 39 a3 96 4c 89
RSP: 0018:ffff888251f0f438 EFLAGS: 00010202
RAX: dffffc0000000000 RBX: 1ffff1104a3e1e8b RCX: 0000000000000000
RDX: 000000005c40d7ac RSI: 0000000000000003 RDI: 00000002e206bd60
RBP: 0000000000052dc0 R08: ffff8882b0044218 R09: ffff8882b0045e8a
R10: fffffbfff300fefc R11: ffff888167af4000 R12: 0000000000000003
R13: 0000000000000000 R14: 00000000696c7070 R15: ffff8882373f4380
FS:  0000000000000000(0000) GS:ffff88bf2be80000(0000)
knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00005641d031eee8 CR3: 0000002e7ca14000 CR4: 0000000000350ee0
Kernel panic - not syncing: Fatal exception
Kernel Offset: 0x11000000 from 0xffffffff81000000 (relocation range:
0xffffffff80000000-0xffffffffbfffffff)
---[ end Kernel panic - not syncing: Fatal exception  ]---]

Reported-by: Frederick Lawler <fred@cloudflare.com>
Link: https://lore.kernel.org/netdev/be5b9271-7507-19c5-ded1-fa78f1980e69@cloudflare.com
Signed-off-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-05-22 22:38:06 -07:00
Shay Drory 1f893f57a3 net/mlx5: Devcom, serialize devcom registration
From one hand, mlx5 driver is allowing to probe PFs in parallel.
From the other hand, devcom, which is a share resource between PFs, is
registered without any lock. This might resulted in memory problems.

Hence, use the global mlx5_dev_list_lock in order to serialize devcom
registration.

Fixes: fadd59fc50 ("net/mlx5: Introduce inter-device communication mechanism")
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-05-22 22:38:06 -07:00
Shay Drory af87194352 net/mlx5: Devcom, fix error flow in mlx5_devcom_register_device
In case devcom allocation is failed, mlx5 is always freeing the priv.
However, this priv might have been allocated by a different thread,
and freeing it might lead to use-after-free bugs.
Fix it by freeing the priv only in case it was allocated by the
running thread.

Fixes: fadd59fc50 ("net/mlx5: Introduce inter-device communication mechanism")
Signed-off-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-05-22 22:38:06 -07:00
Shay Drory 8c253dfc89 net/mlx5: E-switch, Devcom, sync devcom events and devcom comp register
devcom events are sent to all registered component. Following the
cited patch, it is possible for two components, e.g.: two eswitches,
to send devcom events, while both components are registered. This
means eswitch layer will do double un/pairing, which is double
allocation and free of resources, even though only one un/pairing is
needed. flow example:

	cpu0					cpu1
	----					----

 mlx5_devlink_eswitch_mode_set(dev0)
  esw_offloads_devcom_init()
   mlx5_devcom_register_component(esw0)
                                         mlx5_devlink_eswitch_mode_set(dev1)
                                          esw_offloads_devcom_init()
                                           mlx5_devcom_register_component(esw1)
                                           mlx5_devcom_send_event()
   mlx5_devcom_send_event()

Hence, check whether the eswitches are already un/paired before
free/allocation of resources.

Fixes: 09b278462f ("net: devlink: enable parallel ops on netlink interface")
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-05-22 22:38:06 -07:00
Paul Blakey dfa1e46d60 net/mlx5e: TC, Fix using eswitch mapping in nic mode
Cited patch is using the eswitch object mapping pool while
in nic mode where it isn't initialized. This results in the
trace below [0].

Fix that by using either nic or eswitch object mapping pool
depending if eswitch is enabled or not.

[0]:
[  826.446057] ==================================================================
[  826.446729] BUG: KASAN: slab-use-after-free in mlx5_add_flow_rules+0x30/0x490 [mlx5_core]
[  826.447515] Read of size 8 at addr ffff888194485830 by task tc/6233

[  826.448243] CPU: 16 PID: 6233 Comm: tc Tainted: G        W          6.3.0-rc6+ #1
[  826.448890] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
[  826.449785] Call Trace:
[  826.450052]  <TASK>
[  826.450302]  dump_stack_lvl+0x33/0x50
[  826.450650]  print_report+0xc2/0x610
[  826.450998]  ? __virt_addr_valid+0xb1/0x130
[  826.451385]  ? mlx5_add_flow_rules+0x30/0x490 [mlx5_core]
[  826.451935]  kasan_report+0xae/0xe0
[  826.452276]  ? mlx5_add_flow_rules+0x30/0x490 [mlx5_core]
[  826.452829]  mlx5_add_flow_rules+0x30/0x490 [mlx5_core]
[  826.453368]  ? __kmalloc_node+0x5a/0x120
[  826.453733]  esw_add_restore_rule+0x20f/0x270 [mlx5_core]
[  826.454288]  ? mlx5_eswitch_add_send_to_vport_meta_rule+0x260/0x260 [mlx5_core]
[  826.455011]  ? mutex_unlock+0x80/0xd0
[  826.455361]  ? __mutex_unlock_slowpath.constprop.0+0x210/0x210
[  826.455862]  ? mapping_add+0x2cb/0x440 [mlx5_core]
[  826.456425]  mlx5e_tc_action_miss_mapping_get+0x139/0x180 [mlx5_core]
[  826.457058]  ? mlx5e_tc_update_skb_nic+0xb0/0xb0 [mlx5_core]
[  826.457636]  ? __kasan_kmalloc+0x77/0x90
[  826.458000]  ? __kmalloc+0x57/0x120
[  826.458336]  mlx5_tc_ct_flow_offload+0x325/0xe40 [mlx5_core]
[  826.458916]  ? ct_kernel_enter.constprop.0+0x48/0xa0
[  826.459360]  ? mlx5_tc_ct_parse_action+0xf0/0xf0 [mlx5_core]
[  826.459933]  ? mlx5e_mod_hdr_attach+0x491/0x520 [mlx5_core]
[  826.460507]  ? mlx5e_mod_hdr_get+0x12/0x20 [mlx5_core]
[  826.461046]  ? mlx5e_tc_attach_mod_hdr+0x154/0x170 [mlx5_core]
[  826.461635]  mlx5e_configure_flower+0x969/0x2110 [mlx5_core]
[  826.462217]  ? _raw_spin_lock_bh+0x85/0xe0
[  826.462597]  ? __mlx5e_add_fdb_flow+0x750/0x750 [mlx5_core]
[  826.463163]  ? kasan_save_stack+0x2e/0x40
[  826.463534]  ? down_read+0x115/0x1b0
[  826.463878]  ? down_write_killable+0x110/0x110
[  826.464288]  ? tc_setup_action.part.0+0x9f/0x3b0
[  826.464701]  ? mlx5e_is_uplink_rep+0x4c/0x90 [mlx5_core]
[  826.465253]  ? mlx5e_tc_reoffload_flows_work+0x130/0x130 [mlx5_core]
[  826.465878]  tc_setup_cb_add+0x112/0x250
[  826.466247]  fl_hw_replace_filter+0x230/0x310 [cls_flower]
[  826.466724]  ? fl_hw_destroy_filter+0x1a0/0x1a0 [cls_flower]
[  826.467212]  fl_change+0x14e1/0x2030 [cls_flower]
[  826.467636]  ? sock_def_readable+0x89/0x120
[  826.468019]  ? fl_tmplt_create+0x2d0/0x2d0 [cls_flower]
[  826.468509]  ? kasan_unpoison+0x23/0x50
[  826.468873]  ? get_random_u16+0x180/0x180
[  826.469244]  ? __radix_tree_lookup+0x2b/0x130
[  826.469640]  ? fl_get+0x7b/0x140 [cls_flower]
[  826.470042]  ? fl_mask_put+0x200/0x200 [cls_flower]
[  826.470478]  ? __mutex_unlock_slowpath.constprop.0+0x210/0x210
[  826.470973]  ? fl_tmplt_create+0x2d0/0x2d0 [cls_flower]
[  826.471427]  tc_new_tfilter+0x644/0x1050
[  826.471795]  ? tc_get_tfilter+0x860/0x860
[  826.472170]  ? __thaw_task+0x130/0x130
[  826.472525]  ? arch_stack_walk+0x98/0xf0
[  826.472892]  ? cap_capable+0x9f/0xd0
[  826.473235]  ? security_capable+0x47/0x60
[  826.473608]  rtnetlink_rcv_msg+0x1d5/0x550
[  826.473985]  ? rtnl_calcit.isra.0+0x1f0/0x1f0
[  826.474383]  ? __stack_depot_save+0x35/0x4c0
[  826.474779]  ? kasan_save_stack+0x2e/0x40
[  826.475149]  ? kasan_save_stack+0x1e/0x40
[  826.475518]  ? __kasan_record_aux_stack+0x9f/0xb0
[  826.475939]  ? task_work_add+0x77/0x1c0
[  826.476305]  netlink_rcv_skb+0xe0/0x210
[  826.476661]  ? rtnl_calcit.isra.0+0x1f0/0x1f0
[  826.477057]  ? netlink_ack+0x7c0/0x7c0
[  826.477412]  ? rhashtable_jhash2+0xef/0x150
[  826.477796]  ? _copy_from_iter+0x105/0x770
[  826.484386]  netlink_unicast+0x346/0x490
[  826.484755]  ? netlink_attachskb+0x400/0x400
[  826.485145]  ? kernel_text_address+0xc2/0xd0
[  826.485535]  netlink_sendmsg+0x3b0/0x6c0
[  826.485902]  ? kernel_text_address+0xc2/0xd0
[  826.486296]  ? netlink_unicast+0x490/0x490
[  826.486671]  ? iovec_from_user.part.0+0x7a/0x1a0
[  826.487083]  ? netlink_unicast+0x490/0x490
[  826.487461]  sock_sendmsg+0x73/0xc0
[  826.487803]  ____sys_sendmsg+0x364/0x380
[  826.488186]  ? import_iovec+0x7/0x10
[  826.488531]  ? kernel_sendmsg+0x30/0x30
[  826.488893]  ? __copy_msghdr+0x180/0x180
[  826.489258]  ? kasan_save_stack+0x2e/0x40
[  826.489629]  ? kasan_save_stack+0x1e/0x40
[  826.490002]  ? __kasan_record_aux_stack+0x9f/0xb0
[  826.490424]  ? __call_rcu_common.constprop.0+0x46/0x580
[  826.490876]  ___sys_sendmsg+0xdf/0x140
[  826.491231]  ? copy_msghdr_from_user+0x110/0x110
[  826.491649]  ? fget_raw+0x120/0x120
[  826.491988]  ? ___sys_recvmsg+0xd9/0x130
[  826.492355]  ? folio_batch_add_and_move+0x80/0xa0
[  826.492776]  ? _raw_spin_lock+0x7a/0xd0
[  826.493137]  ? _raw_spin_lock+0x7a/0xd0
[  826.493500]  ? _raw_read_lock_irq+0x30/0x30
[  826.493880]  ? kasan_set_track+0x21/0x30
[  826.494249]  ? kasan_save_free_info+0x2a/0x40
[  826.494650]  ? do_sys_openat2+0xff/0x270
[  826.495016]  ? __fget_light+0x1b5/0x200
[  826.495377]  ? __virt_addr_valid+0xb1/0x130
[  826.495763]  __sys_sendmsg+0xb2/0x130
[  826.496118]  ? __sys_sendmsg_sock+0x20/0x20
[  826.496501]  ? __x64_sys_rseq+0x2e0/0x2e0
[  826.496874]  ? do_user_addr_fault+0x276/0x820
[  826.497273]  ? fpregs_assert_state_consistent+0x52/0x60
[  826.497727]  ? exit_to_user_mode_prepare+0x30/0x120
[  826.498158]  do_syscall_64+0x3d/0x90
[  826.498502]  entry_SYSCALL_64_after_hwframe+0x46/0xb0
[  826.498949] RIP: 0033:0x7f9b67f4f887
[  826.499294] Code: 0a 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b9 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 89 54 24 1c 48 89 74 24 10
[  826.500742] RSP: 002b:00007fff5d1a5498 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
[  826.501395] RAX: ffffffffffffffda RBX: 0000000064413ce6 RCX: 00007f9b67f4f887
[  826.501975] RDX: 0000000000000000 RSI: 00007fff5d1a5500 RDI: 0000000000000003
[  826.502556] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000001
[  826.503135] R10: 00007f9b67e08708 R11: 0000000000000246 R12: 0000000000000001
[  826.503714] R13: 0000000000000001 R14: 00007fff5d1a9800 R15: 0000000000485400
[  826.504304]  </TASK>

[  826.504753] Allocated by task 3764:
[  826.505090]  kasan_save_stack+0x1e/0x40
[  826.505453]  kasan_set_track+0x21/0x30
[  826.505810]  __kasan_kmalloc+0x77/0x90
[  826.506164]  __mlx5_create_flow_table+0x16d/0xbb0 [mlx5_core]
[  826.506742]  esw_offloads_enable+0x60d/0xfb0 [mlx5_core]
[  826.507292]  mlx5_eswitch_enable_locked+0x4d3/0x680 [mlx5_core]
[  826.507885]  mlx5_devlink_eswitch_mode_set+0x2a3/0x580 [mlx5_core]
[  826.508513]  devlink_nl_cmd_eswitch_set_doit+0xdf/0x1f0
[  826.508969]  genl_family_rcv_msg_doit.isra.0+0x146/0x1c0
[  826.509427]  genl_rcv_msg+0x28d/0x3e0
[  826.509772]  netlink_rcv_skb+0xe0/0x210
[  826.510133]  genl_rcv+0x24/0x40
[  826.510448]  netlink_unicast+0x346/0x490
[  826.510810]  netlink_sendmsg+0x3b0/0x6c0
[  826.511179]  sock_sendmsg+0x73/0xc0
[  826.511519]  __sys_sendto+0x18d/0x220
[  826.511867]  __x64_sys_sendto+0x72/0x80
[  826.512232]  do_syscall_64+0x3d/0x90
[  826.512576]  entry_SYSCALL_64_after_hwframe+0x46/0xb0

[  826.513220] Freed by task 5674:
[  826.513535]  kasan_save_stack+0x1e/0x40
[  826.513893]  kasan_set_track+0x21/0x30
[  826.514245]  kasan_save_free_info+0x2a/0x40
[  826.514629]  ____kasan_slab_free+0x11a/0x1b0
[  826.515021]  __kmem_cache_free+0x14d/0x280
[  826.515399]  tree_put_node+0x109/0x1c0 [mlx5_core]
[  826.515907]  mlx5_destroy_flow_table+0x119/0x630 [mlx5_core]
[  826.516481]  esw_offloads_steering_cleanup+0xe7/0x150 [mlx5_core]
[  826.517084]  esw_offloads_disable+0xe0/0x160 [mlx5_core]
[  826.517632]  mlx5_eswitch_disable_locked+0x26c/0x290 [mlx5_core]
[  826.518225]  mlx5_devlink_eswitch_mode_set+0x128/0x580 [mlx5_core]
[  826.518834]  devlink_nl_cmd_eswitch_set_doit+0xdf/0x1f0
[  826.519286]  genl_family_rcv_msg_doit.isra.0+0x146/0x1c0
[  826.519748]  genl_rcv_msg+0x28d/0x3e0
[  826.520101]  netlink_rcv_skb+0xe0/0x210
[  826.520458]  genl_rcv+0x24/0x40
[  826.520771]  netlink_unicast+0x346/0x490
[  826.521137]  netlink_sendmsg+0x3b0/0x6c0
[  826.521505]  sock_sendmsg+0x73/0xc0
[  826.521842]  __sys_sendto+0x18d/0x220
[  826.522191]  __x64_sys_sendto+0x72/0x80
[  826.522554]  do_syscall_64+0x3d/0x90
[  826.522894]  entry_SYSCALL_64_after_hwframe+0x46/0xb0

[  826.523540] Last potentially related work creation:
[  826.523969]  kasan_save_stack+0x1e/0x40
[  826.524331]  __kasan_record_aux_stack+0x9f/0xb0
[  826.524739]  insert_work+0x30/0x130
[  826.525078]  __queue_work+0x34b/0x690
[  826.525426]  queue_work_on+0x48/0x50
[  826.525766]  __rhashtable_remove_fast_one+0x4af/0x4d0 [mlx5_core]
[  826.526365]  del_sw_flow_group+0x1b5/0x270 [mlx5_core]
[  826.526898]  tree_put_node+0x109/0x1c0 [mlx5_core]
[  826.527407]  esw_offloads_steering_cleanup+0xd3/0x150 [mlx5_core]
[  826.528009]  esw_offloads_disable+0xe0/0x160 [mlx5_core]
[  826.528616]  mlx5_eswitch_disable_locked+0x26c/0x290 [mlx5_core]
[  826.529218]  mlx5_devlink_eswitch_mode_set+0x128/0x580 [mlx5_core]
[  826.529823]  devlink_nl_cmd_eswitch_set_doit+0xdf/0x1f0
[  826.530276]  genl_family_rcv_msg_doit.isra.0+0x146/0x1c0
[  826.530733]  genl_rcv_msg+0x28d/0x3e0
[  826.531079]  netlink_rcv_skb+0xe0/0x210
[  826.531439]  genl_rcv+0x24/0x40
[  826.531755]  netlink_unicast+0x346/0x490
[  826.532123]  netlink_sendmsg+0x3b0/0x6c0
[  826.532487]  sock_sendmsg+0x73/0xc0
[  826.532825]  __sys_sendto+0x18d/0x220
[  826.533175]  __x64_sys_sendto+0x72/0x80
[  826.533533]  do_syscall_64+0x3d/0x90
[  826.533877]  entry_SYSCALL_64_after_hwframe+0x46/0xb0

[  826.534521] The buggy address belongs to the object at ffff888194485800
                which belongs to the cache kmalloc-512 of size 512
[  826.535506] The buggy address is located 48 bytes inside of
                freed 512-byte region [ffff888194485800, ffff888194485a00)

[  826.536666] The buggy address belongs to the physical page:
[  826.537138] page:00000000d75841dd refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x194480
[  826.537915] head:00000000d75841dd order:3 entire_mapcount:0 nr_pages_mapped:0 pincount:0
[  826.538595] flags: 0x200000000010200(slab|head|node=0|zone=2)
[  826.539089] raw: 0200000000010200 ffff888100042c80 ffffea0004523800 dead000000000002
[  826.539755] raw: 0000000000000000 0000000000200020 00000001ffffffff 0000000000000000
[  826.540417] page dumped because: kasan: bad access detected

[  826.541095] Memory state around the buggy address:
[  826.541519]  ffff888194485700: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[  826.542149]  ffff888194485780: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[  826.542773] >ffff888194485800: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  826.543400]                                      ^
[  826.543822]  ffff888194485880: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  826.544452]  ffff888194485900: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  826.545079] ==================================================================

Fixes: 6702782845 ("net/mlx5e: TC, Set CT miss to the specific ct action instance")
Signed-off-by: Paul Blakey <paulb@nvidia.com>
Reviewed-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-05-22 22:38:06 -07:00
Rahul Rameshbabu 7aa5038019 net/mlx5e: Fix SQ wake logic in ptp napi_poll context
Check in the mlx5e_ptp_poll_ts_cq context if the ptp tx sq should be woken
up. Before change, the ptp tx sq may never wake up if the ptp tx ts skb
fifo is full when mlx5e_poll_tx_cq checks if the queue should be woken up.

Fixes: 1880bc4e4a ("net/mlx5e: Add TX port timestamp support")
Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-05-22 22:38:05 -07:00
Vlad Buslov 691c041bf2 net/mlx5e: Fix deadlock in tc route query code
Cited commit causes ABBA deadlock[0] when peer flows are created while
holding the devcom rw semaphore. Due to peer flows offload implementation
the lock is taken much higher up the call chain and there is no obvious way
to easily fix the deadlock. Instead, since tc route query code needs the
peer eswitch structure only to perform a lookup in xarray and doesn't
perform any sleeping operations with it, refactor the code for lockless
execution in following ways:

- RCUify the devcom 'data' pointer. When resetting the pointer
synchronously wait for RCU grace period before returning. This is fine
since devcom is currently only used for synchronization of
pairing/unpairing of eswitches which is rare and already expensive as-is.

- Wrap all usages of 'paired' boolean in {READ|WRITE}_ONCE(). The flag has
already been used in some unlocked contexts without proper
annotations (e.g. users of mlx5_devcom_is_paired() function), but it wasn't
an issue since all relevant code paths checked it again after obtaining the
devcom semaphore. Now it is also used by mlx5_devcom_get_peer_data_rcu() as
"best effort" check to return NULL when devcom is being unpaired. Note that
while RCU read lock doesn't prevent the unpaired flag from being changed
concurrently it still guarantees that reader can continue to use 'data'.

- Refactor mlx5e_tc_query_route_vport() function to use new
mlx5_devcom_get_peer_data_rcu() API which fixes the deadlock.

[0]:

[  164.599612] ======================================================
[  164.600142] WARNING: possible circular locking dependency detected
[  164.600667] 6.3.0-rc3+ #1 Not tainted
[  164.601021] ------------------------------------------------------
[  164.601557] handler1/3456 is trying to acquire lock:
[  164.601998] ffff88811f1714b0 (&esw->offloads.encap_tbl_lock){+.+.}-{3:3}, at: mlx5e_attach_encap+0xd8/0x8b0 [mlx5_core]
[  164.603078]
               but task is already holding lock:
[  164.603617] ffff88810137fc98 (&comp->sem){++++}-{3:3}, at: mlx5_devcom_get_peer_data+0x37/0x80 [mlx5_core]
[  164.604459]
               which lock already depends on the new lock.

[  164.605190]
               the existing dependency chain (in reverse order) is:
[  164.605848]
               -> #1 (&comp->sem){++++}-{3:3}:
[  164.606380]        down_read+0x39/0x50
[  164.606772]        mlx5_devcom_get_peer_data+0x37/0x80 [mlx5_core]
[  164.607336]        mlx5e_tc_query_route_vport+0x86/0xc0 [mlx5_core]
[  164.607914]        mlx5e_tc_tun_route_lookup+0x1a4/0x1d0 [mlx5_core]
[  164.608495]        mlx5e_attach_decap_route+0xc6/0x1e0 [mlx5_core]
[  164.609063]        mlx5e_tc_add_fdb_flow+0x1ea/0x360 [mlx5_core]
[  164.609627]        __mlx5e_add_fdb_flow+0x2d2/0x430 [mlx5_core]
[  164.610175]        mlx5e_configure_flower+0x952/0x1a20 [mlx5_core]
[  164.610741]        tc_setup_cb_add+0xd4/0x200
[  164.611146]        fl_hw_replace_filter+0x14c/0x1f0 [cls_flower]
[  164.611661]        fl_change+0xc95/0x18a0 [cls_flower]
[  164.612116]        tc_new_tfilter+0x3fc/0xd20
[  164.612516]        rtnetlink_rcv_msg+0x418/0x5b0
[  164.612936]        netlink_rcv_skb+0x54/0x100
[  164.613339]        netlink_unicast+0x190/0x250
[  164.613746]        netlink_sendmsg+0x245/0x4a0
[  164.614150]        sock_sendmsg+0x38/0x60
[  164.614522]        ____sys_sendmsg+0x1d0/0x1e0
[  164.614934]        ___sys_sendmsg+0x80/0xc0
[  164.615320]        __sys_sendmsg+0x51/0x90
[  164.615701]        do_syscall_64+0x3d/0x90
[  164.616083]        entry_SYSCALL_64_after_hwframe+0x46/0xb0
[  164.616568]
               -> #0 (&esw->offloads.encap_tbl_lock){+.+.}-{3:3}:
[  164.617210]        __lock_acquire+0x159e/0x26e0
[  164.617638]        lock_acquire+0xc2/0x2a0
[  164.618018]        __mutex_lock+0x92/0xcd0
[  164.618401]        mlx5e_attach_encap+0xd8/0x8b0 [mlx5_core]
[  164.618943]        post_process_attr+0x153/0x2d0 [mlx5_core]
[  164.619471]        mlx5e_tc_add_fdb_flow+0x164/0x360 [mlx5_core]
[  164.620021]        __mlx5e_add_fdb_flow+0x2d2/0x430 [mlx5_core]
[  164.620564]        mlx5e_configure_flower+0xe33/0x1a20 [mlx5_core]
[  164.621125]        tc_setup_cb_add+0xd4/0x200
[  164.621531]        fl_hw_replace_filter+0x14c/0x1f0 [cls_flower]
[  164.622047]        fl_change+0xc95/0x18a0 [cls_flower]
[  164.622500]        tc_new_tfilter+0x3fc/0xd20
[  164.622906]        rtnetlink_rcv_msg+0x418/0x5b0
[  164.623324]        netlink_rcv_skb+0x54/0x100
[  164.623727]        netlink_unicast+0x190/0x250
[  164.624138]        netlink_sendmsg+0x245/0x4a0
[  164.624544]        sock_sendmsg+0x38/0x60
[  164.624919]        ____sys_sendmsg+0x1d0/0x1e0
[  164.625340]        ___sys_sendmsg+0x80/0xc0
[  164.625731]        __sys_sendmsg+0x51/0x90
[  164.626117]        do_syscall_64+0x3d/0x90
[  164.626502]        entry_SYSCALL_64_after_hwframe+0x46/0xb0
[  164.626995]
               other info that might help us debug this:

[  164.627725]  Possible unsafe locking scenario:

[  164.628268]        CPU0                    CPU1
[  164.628683]        ----                    ----
[  164.629098]   lock(&comp->sem);
[  164.629421]                                lock(&esw->offloads.encap_tbl_lock);
[  164.630066]                                lock(&comp->sem);
[  164.630555]   lock(&esw->offloads.encap_tbl_lock);
[  164.630993]
                *** DEADLOCK ***

[  164.631575] 3 locks held by handler1/3456:
[  164.631962]  #0: ffff888124b75130 (&block->cb_lock){++++}-{3:3}, at: tc_setup_cb_add+0x5b/0x200
[  164.632703]  #1: ffff888116e512b8 (&esw->mode_lock){++++}-{3:3}, at: mlx5_esw_hold+0x39/0x50 [mlx5_core]
[  164.633552]  #2: ffff88810137fc98 (&comp->sem){++++}-{3:3}, at: mlx5_devcom_get_peer_data+0x37/0x80 [mlx5_core]
[  164.634435]
               stack backtrace:
[  164.634883] CPU: 17 PID: 3456 Comm: handler1 Not tainted 6.3.0-rc3+ #1
[  164.635431] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
[  164.636340] Call Trace:
[  164.636616]  <TASK>
[  164.636863]  dump_stack_lvl+0x47/0x70
[  164.637217]  check_noncircular+0xfe/0x110
[  164.637601]  __lock_acquire+0x159e/0x26e0
[  164.637977]  ? mlx5_cmd_set_fte+0x5b0/0x830 [mlx5_core]
[  164.638472]  lock_acquire+0xc2/0x2a0
[  164.638828]  ? mlx5e_attach_encap+0xd8/0x8b0 [mlx5_core]
[  164.639339]  ? lock_is_held_type+0x98/0x110
[  164.639728]  __mutex_lock+0x92/0xcd0
[  164.640074]  ? mlx5e_attach_encap+0xd8/0x8b0 [mlx5_core]
[  164.640576]  ? __lock_acquire+0x382/0x26e0
[  164.640958]  ? mlx5e_attach_encap+0xd8/0x8b0 [mlx5_core]
[  164.641468]  ? mlx5e_attach_encap+0xd8/0x8b0 [mlx5_core]
[  164.641965]  mlx5e_attach_encap+0xd8/0x8b0 [mlx5_core]
[  164.642454]  ? lock_release+0xbf/0x240
[  164.642819]  post_process_attr+0x153/0x2d0 [mlx5_core]
[  164.643318]  mlx5e_tc_add_fdb_flow+0x164/0x360 [mlx5_core]
[  164.643835]  __mlx5e_add_fdb_flow+0x2d2/0x430 [mlx5_core]
[  164.644340]  mlx5e_configure_flower+0xe33/0x1a20 [mlx5_core]
[  164.644862]  ? lock_acquire+0xc2/0x2a0
[  164.645219]  tc_setup_cb_add+0xd4/0x200
[  164.645588]  fl_hw_replace_filter+0x14c/0x1f0 [cls_flower]
[  164.646067]  fl_change+0xc95/0x18a0 [cls_flower]
[  164.646488]  tc_new_tfilter+0x3fc/0xd20
[  164.646861]  ? tc_del_tfilter+0x810/0x810
[  164.647236]  rtnetlink_rcv_msg+0x418/0x5b0
[  164.647621]  ? rtnl_setlink+0x160/0x160
[  164.647982]  netlink_rcv_skb+0x54/0x100
[  164.648348]  netlink_unicast+0x190/0x250
[  164.648722]  netlink_sendmsg+0x245/0x4a0
[  164.649090]  sock_sendmsg+0x38/0x60
[  164.649434]  ____sys_sendmsg+0x1d0/0x1e0
[  164.649804]  ? copy_msghdr_from_user+0x6d/0xa0
[  164.650213]  ___sys_sendmsg+0x80/0xc0
[  164.650563]  ? lock_acquire+0xc2/0x2a0
[  164.650926]  ? lock_acquire+0xc2/0x2a0
[  164.651286]  ? __fget_files+0x5/0x190
[  164.651644]  ? find_held_lock+0x2b/0x80
[  164.652006]  ? __fget_files+0xb9/0x190
[  164.652365]  ? lock_release+0xbf/0x240
[  164.652723]  ? __fget_files+0xd3/0x190
[  164.653079]  __sys_sendmsg+0x51/0x90
[  164.653435]  do_syscall_64+0x3d/0x90
[  164.653784]  entry_SYSCALL_64_after_hwframe+0x46/0xb0
[  164.654229] RIP: 0033:0x7f378054f8bd
[  164.654577] Code: 28 89 54 24 1c 48 89 74 24 10 89 7c 24 08 e8 6a c3 f4 ff 8b 54 24 1c 48 8b 74 24 10 41 89 c0 8b 7c 24 08 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 33 44 89 c7 48 89 44 24 08 e8 be c3 f4 ff 48
[  164.656041] RSP: 002b:00007f377fa114b0 EFLAGS: 00000293 ORIG_RAX: 000000000000002e
[  164.656701] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f378054f8bd
[  164.657297] RDX: 0000000000000000 RSI: 00007f377fa11540 RDI: 0000000000000014
[  164.657885] RBP: 00007f377fa12278 R08: 0000000000000000 R09: 000000000000015c
[  164.658472] R10: 00007f377fa123d0 R11: 0000000000000293 R12: 0000560962d99bd0
[  164.665317] R13: 0000000000000000 R14: 0000560962d99bd0 R15: 00007f377fa11540

Fixes: f9d196bd63 ("net/mlx5e: Use correct eswitch for stack devices with lag")
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-05-22 22:38:05 -07:00
Roi Dayan a65735148e net/mlx5: Fix error message when failing to allocate device memory
Fix spacing for the error and also the correct error code pointer.

Fixes: c9b9dcb430 ("net/mlx5: Move device memory management to mlx5_core")
Signed-off-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-05-22 22:38:05 -07:00
Vlad Buslov be071cdb16 net/mlx5e: Use correct encap attribute during invalidation
With introduction of post action infrastructure most of the users of encap
attribute had been modified in order to obtain the correct attribute by
calling mlx5e_tc_get_encap_attr() helper instead of assuming encap action
is always on default attribute. However, the cited commit didn't modify
mlx5e_invalidate_encap() which prevents it from destroying correct modify
header action which leads to a warning [0]. Fix the issue by using correct
attribute.

[0]:

Feb 21 09:47:35 c-237-177-40-045 kernel: WARNING: CPU: 17 PID: 654 at drivers/net/ethernet/mellanox/mlx5/core/en_tc.c:684 mlx5e_tc_attach_mod_hdr+0x1cc/0x230 [mlx5_core]
Feb 21 09:47:35 c-237-177-40-045 kernel: RIP: 0010:mlx5e_tc_attach_mod_hdr+0x1cc/0x230 [mlx5_core]
Feb 21 09:47:35 c-237-177-40-045 kernel: Call Trace:
Feb 21 09:47:35 c-237-177-40-045 kernel:  <TASK>
Feb 21 09:47:35 c-237-177-40-045 kernel:  mlx5e_tc_fib_event_work+0x8e3/0x1f60 [mlx5_core]
Feb 21 09:47:35 c-237-177-40-045 kernel:  ? mlx5e_take_all_encap_flows+0xe0/0xe0 [mlx5_core]
Feb 21 09:47:35 c-237-177-40-045 kernel:  ? lock_downgrade+0x6d0/0x6d0
Feb 21 09:47:35 c-237-177-40-045 kernel:  ? lockdep_hardirqs_on_prepare+0x273/0x3f0
Feb 21 09:47:35 c-237-177-40-045 kernel:  ? lockdep_hardirqs_on_prepare+0x273/0x3f0
Feb 21 09:47:35 c-237-177-40-045 kernel:  process_one_work+0x7c2/0x1310
Feb 21 09:47:35 c-237-177-40-045 kernel:  ? lockdep_hardirqs_on_prepare+0x3f0/0x3f0
Feb 21 09:47:35 c-237-177-40-045 kernel:  ? pwq_dec_nr_in_flight+0x230/0x230
Feb 21 09:47:35 c-237-177-40-045 kernel:  ? rwlock_bug.part.0+0x90/0x90
Feb 21 09:47:35 c-237-177-40-045 kernel:  worker_thread+0x59d/0xec0
Feb 21 09:47:35 c-237-177-40-045 kernel:  ? __kthread_parkme+0xd9/0x1d0

Fixes: 8300f22526 ("net/mlx5e: Create new flow attr for multi table actions")
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-05-22 22:38:05 -07:00
Yevgeny Kliteynik c7dd225bc2 net/mlx5: DR, Check force-loopback RC QP capability independently from RoCE
SW Steering uses RC QP for writing STEs to ICM. This writingis done in LB
(loopback), and FL (force-loopback) QP is preferred for performance. FL is
available when RoCE is enabled or disabled based on RoCE caps.
This patch adds reading of FL capability from HCA caps in addition to the
existing reading from RoCE caps, thus fixing the case where we didn't
have loopback enabled when RoCE was disabled.

Fixes: 7304d603a5 ("net/mlx5: DR, Add support for force-loopback QP")
Signed-off-by: Itamar Gozlan <igozlan@nvidia.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-05-22 22:38:05 -07:00
Erez Shitrit 1e5daf5565 net/mlx5: DR, Fix crc32 calculation to work on big-endian (BE) CPUs
When calculating crc for hash index we use the function crc32 that
calculates for little-endian (LE) arch.
Then we convert it to network endianness using htonl(), but it's wrong
to do the conversion in BE archs since the crc32 value is already LE.

The solution is to switch the bytes from the crc result for all types
of arc.

Fixes: 40416d8ede ("net/mlx5: DR, Replace CRC32 implementation to use kernel lib")
Signed-off-by: Erez Shitrit <erezsh@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-05-22 22:38:05 -07:00
Shay Drory 2be5bd42a5 net/mlx5: Handle pairing of E-switch via uplink un/load APIs
In case user switch a device from switchdev mode to legacy mode, mlx5
first unpair the E-switch and afterwards unload the uplink vport.
From the other hand, in case user remove or reload a device, mlx5
first unload the uplink vport and afterwards unpair the E-switch.

The latter is causing a bug[1], hence, handle pairing of E-switch as
part of uplink un/load APIs.

[1]
In case VF_LAG is used, every tc fdb flow is duplicated to the peer
esw. However, the original esw keeps a pointer to this duplicated
flow, not the peer esw.
e.g.: if user create tc fdb flow over esw0, the flow is duplicated
over esw1, in FW/HW, but in SW, esw0 keeps a pointer to the duplicated
flow.
During module unload while a peer tc fdb flow is still offloaded, in
case the first device to be removed is the peer device (esw1 in the
example above), the peer net-dev is destroyed, and so the mlx5e_priv
is memset to 0.
Afterwards, the peer device is trying to unpair himself from the
original device (esw0 in the example above). Unpair API invoke the
original device to clear peer flow from its eswitch (esw0), but the
peer flow, which is stored over the original eswitch (esw0), is
trying to use the peer mlx5e_priv, which is memset to 0 and result in
bellow kernel-oops.

[  157.964081 ] BUG: unable to handle page fault for address: 000000000002ce60
[  157.964662 ] #PF: supervisor read access in kernel mode
[  157.965123 ] #PF: error_code(0x0000) - not-present page
[  157.965582 ] PGD 0 P4D 0
[  157.965866 ] Oops: 0000 [#1] SMP
[  157.967670 ] RIP: 0010:mlx5e_tc_del_fdb_flow+0x48/0x460 [mlx5_core]
[  157.976164 ] Call Trace:
[  157.976437 ]  <TASK>
[  157.976690 ]  __mlx5e_tc_del_fdb_peer_flow+0xe6/0x100 [mlx5_core]
[  157.977230 ]  mlx5e_tc_clean_fdb_peer_flows+0x67/0x90 [mlx5_core]
[  157.977767 ]  mlx5_esw_offloads_unpair+0x2d/0x1e0 [mlx5_core]
[  157.984653 ]  mlx5_esw_offloads_devcom_event+0xbf/0x130 [mlx5_core]
[  157.985212 ]  mlx5_devcom_send_event+0xa3/0xb0 [mlx5_core]
[  157.985714 ]  esw_offloads_disable+0x5a/0x110 [mlx5_core]
[  157.986209 ]  mlx5_eswitch_disable_locked+0x152/0x170 [mlx5_core]
[  157.986757 ]  mlx5_eswitch_disable+0x51/0x80 [mlx5_core]
[  157.987248 ]  mlx5_unload+0x2a/0xb0 [mlx5_core]
[  157.987678 ]  mlx5_uninit_one+0x5f/0xd0 [mlx5_core]
[  157.988127 ]  remove_one+0x64/0xe0 [mlx5_core]
[  157.988549 ]  pci_device_remove+0x31/0xa0
[  157.988933 ]  device_release_driver_internal+0x18f/0x1f0
[  157.989402 ]  driver_detach+0x3f/0x80
[  157.989754 ]  bus_remove_driver+0x70/0xf0
[  157.990129 ]  pci_unregister_driver+0x34/0x90
[  157.990537 ]  mlx5_cleanup+0xc/0x1c [mlx5_core]
[  157.990972 ]  __x64_sys_delete_module+0x15a/0x250
[  157.991398 ]  ? exit_to_user_mode_prepare+0xea/0x110
[  157.991840 ]  do_syscall_64+0x3d/0x90
[  157.992198 ]  entry_SYSCALL_64_after_hwframe+0x46/0xb0

Fixes: 04de7dda73 ("net/mlx5e: Infrastructure for duplicated offloading of TC flows")
Fixes: 1418ddd96a ("net/mlx5e: Duplicate offloaded TC eswitch rules under uplink LAG")
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-05-22 22:38:04 -07:00
Shay Drory 2a0a935fb6 net/mlx5: Collect command failures data only for known commands
DEVX can issue a general command, which is not used by mlx5 driver.
In case such command is failed, mlx5 is trying to collect the failure
data, However, mlx5 doesn't create a storage for this command, since
mlx5 doesn't use it. This lead to array-index-out-of-bounds error.

Fix it by checking whether the command is known before collecting the
failure data.

Fixes: 34f46ae0d4 ("net/mlx5: Add command failures data to debugfs")
Signed-off-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-05-22 22:38:04 -07:00
Jakub Kicinski afbed3f748 net/mlx5e: do as little as possible in napi poll when budget is 0
NAPI gets called with budget of 0 from netpoll, which has interrupts
disabled. We should try to free some space on Tx rings and nothing
else.

Specifically do not try to handle XDP TX or try to refill Rx buffers -
we can't use the page pool from IRQ context. Don't check if IRQs moved,
either, that makes no sense in netpoll. Netpoll calls _all_ the rings
from whatever CPU it happens to be invoked on.

In general do as little as possible, the work quickly adds up when
there's tens of rings to poll.

The immediate stack trace I was seeing is:

    __do_softirq+0xd1/0x2c0
    __local_bh_enable_ip+0xc7/0x120
    </IRQ>
    <TASK>
    page_pool_put_defragged_page+0x267/0x320
    mlx5e_free_xdpsq_desc+0x99/0xd0
    mlx5e_poll_xdpsq_cq+0x138/0x3b0
    mlx5e_napi_poll+0xc3/0x8b0
    netpoll_poll_dev+0xce/0x150

AFAIU page pool takes a BH lock, releases it and since BH is now
enabled tries to run softirqs.

Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Fixes: 60bbf7eeef ("mlx5: use page_pool for xdp_return_frame call")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-19 08:39:14 +01:00
Linus Torvalds af3877265d v6.4 merge window RDMA pull request
Usual wide collection of unrelated items in drivers:
 
 - Driver bug fixes and treewide cleanups in hfi1, siw, qib, mlx5, rxe,
   usnic, usnic, bnxt_re, ocrdma, iser
    * Unnecessary NULL checks
    * kmap obsolescence
    * pci_enable_pcie_error_reporting() obsolescence
    * Unused variables and macros
    * trace event related warnings
    * casting warnings
 
 - Code cleanups for irdm and erdma
 
 - EFA reporting of 128 byte PCIe TLP support
 
 - mlx5 more agressively uses the out of order HW feature
 
 - Big rework of how state machines and tasks work in rxe
 
 - Fix a syzkaller found crash netdev refcount leak in siw
 
 - bnxt_re revises their HW description header
 
 - Congestion control for bnxt_re
 
 - Use mmu_notifiers more safely in hfi1
 
 - mlx5 gets better support for PCIe relaxed ordering inside VMs
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQRRRCHOFoQz/8F5bUaFwuHvBreFYQUCZEva5wAKCRCFwuHvBreF
 YZFmAQC9T3b/XQ3bRknYciuzbatC98o9xB0FTqmEFYGj+Y2lVAD9EEVe3HKfHfi3
 t/GxXYB5r22oxg5bgsblZfEdEdTVCg8=
 =akMm
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma

Pull rdma updates from Jason Gunthorpe:
 "Usual wide collection of unrelated items in drivers:

   - Driver bug fixes and treewide cleanups in hfi1, siw, qib, mlx5,
     rxe, usnic, usnic, bnxt_re, ocrdma, iser:
       - remove unnecessary NULL checks
       - kmap obsolescence
       - pci_enable_pcie_error_reporting() obsolescence
       - unused variables and macros
       - trace event related warnings
       - casting warnings

   - Code cleanups for irdm and erdma

   - EFA reporting of 128 byte PCIe TLP support

   - mlx5 more agressively uses the out of order HW feature

   - Big rework of how state machines and tasks work in rxe

   - Fix a syzkaller found crash netdev refcount leak in siw

   - bnxt_re revises their HW description header

   - Congestion control for bnxt_re

   - Use mmu_notifiers more safely in hfi1

   - mlx5 gets better support for PCIe relaxed ordering inside VMs"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (81 commits)
  RDMA/efa: Add rdma write capability to device caps
  RDMA/mlx5: Use correct device num_ports when modify DC
  RDMA/irdma: Drop spurious WQ_UNBOUND from alloc_ordered_workqueue() call
  RDMA/rxe: Fix spinlock recursion deadlock on requester
  RDMA/mlx5: Fix flow counter query via DEVX
  RDMA/rxe: Protect QP state with qp->state_lock
  RDMA/rxe: Move code to check if drained to subroutine
  RDMA/rxe: Remove qp->req.state
  RDMA/rxe: Remove qp->comp.state
  RDMA/rxe: Remove qp->resp.state
  RDMA/mlx5: Allow relaxed ordering read in VFs and VMs
  net/mlx5: Update relaxed ordering read HCA capabilities
  RDMA/mlx5: Check pcie_relaxed_ordering_enabled() in UMR
  RDMA/mlx5: Remove pcie_relaxed_ordering_enabled() check for RO write
  RDMA: Add ib_virt_dma_to_page()
  RDMA/rxe: Fix the error "trying to register non-static key in rxe_cleanup_task"
  RDMA/irdma: Slightly optimize irdma_form_ah_cm_frame()
  RDMA/rxe: Fix incorrect TASKLET_STATE_SCHED check in rxe_task.c
  IB/hfi1: Place struct mmu_rb_handler on cache line start
  IB/hfi1: Fix bugs with non-PAGE_SIZE-end multi-iovec user SDMA requests
  ...
2023-04-29 17:21:24 -07:00
Linus Torvalds 6e98b09da9 Networking changes for 6.4.
Core
 ----
 
  - Introduce a config option to tweak MAX_SKB_FRAGS. Increasing the
    default value allows for better BIG TCP performances.
 
  - Reduce compound page head access for zero-copy data transfers.
 
  - RPS/RFS improvements, avoiding unneeded NET_RX_SOFTIRQ when possible.
 
  - Threaded NAPI improvements, adding defer skb free support and unneeded
    softirq avoidance.
 
  - Address dst_entry reference count scalability issues, via false
    sharing avoidance and optimize refcount tracking.
 
  - Add lockless accesses annotation to sk_err[_soft].
 
  - Optimize again the skb struct layout.
 
  - Extends the skb drop reasons to make it usable by multiple
    subsystems.
 
  - Better const qualifier awareness for socket casts.
 
 BPF
 ---
 
  - Add skb and XDP typed dynptrs which allow BPF programs for more
    ergonomic and less brittle iteration through data and variable-sized
    accesses.
 
  - Add a new BPF netfilter program type and minimal support to hook
    BPF programs to netfilter hooks such as prerouting or forward.
 
  - Add more precise memory usage reporting for all BPF map types.
 
  - Adds support for using {FOU,GUE} encap with an ipip device operating
    in collect_md mode and add a set of BPF kfuncs for controlling encap
    params.
 
  - Allow BPF programs to detect at load time whether a particular kfunc
    exists or not, and also add support for this in light skeleton.
 
  - Bigger batch of BPF verifier improvements to prepare for upcoming BPF
    open-coded iterators allowing for less restrictive looping capabilities.
 
  - Rework RCU enforcement in the verifier, add kptr_rcu and enforce BPF
    programs to NULL-check before passing such pointers into kfunc.
 
  - Add support for kptrs in percpu hashmaps, percpu LRU hashmaps and in
    local storage maps.
 
  - Enable RCU semantics for task BPF kptrs and allow referenced kptr
    tasks to be stored in BPF maps.
 
  - Add support for refcounted local kptrs to the verifier for allowing
    shared ownership, useful for adding a node to both the BPF list and
    rbtree.
 
  - Add BPF verifier support for ST instructions in convert_ctx_access()
    which will help new -mcpu=v4 clang flag to start emitting them.
 
  - Add ARM32 USDT support to libbpf.
 
  - Improve bpftool's visual program dump which produces the control
    flow graph in a DOT format by adding C source inline annotations.
 
 Protocols
 ---------
 
  - IPv4: Allow adding to IPv4 address a 'protocol' tag. Such value
    indicates the provenance of the IP address.
 
  - IPv6: optimize route lookup, dropping unneeded R/W lock acquisition.
 
  - Add the handshake upcall mechanism, allowing the user-space
    to implement generic TLS handshake on kernel's behalf.
 
  - Bridge: support per-{Port, VLAN} neighbor suppression, increasing
    resilience to nodes failures.
 
  - SCTP: add support for Fair Capacity and Weighted Fair Queueing
    schedulers.
 
  - MPTCP: delay first subflow allocation up to its first usage. This
    will allow for later better LSM interaction.
 
  - xfrm: Remove inner/outer modes from input/output path. These are
    not needed anymore.
 
  - WiFi:
    - reduced neighbor report (RNR) handling for AP mode
    - HW timestamping support
    - support for randomized auth/deauth TA for PASN privacy
    - per-link debugfs for multi-link
    - TC offload support for mac80211 drivers
    - mac80211 mesh fast-xmit and fast-rx support
    - enable Wi-Fi 7 (EHT) mesh support
 
 Netfilter
 ---------
 
  - Add nf_tables 'brouting' support, to force a packet to be routed
    instead of being bridged.
 
  - Update bridge netfilter and ovs conntrack helpers to handle
    IPv6 Jumbo packets properly, i.e. fetch the packet length
    from hop-by-hop extension header. This is needed for BIT TCP
    support.
 
  - The iptables 32bit compat interface isn't compiled in by default
    anymore.
 
  - Move ip(6)tables builtin icmp matches to the udptcp one.
    This has the advantage that icmp/icmpv6 match doesn't load the
    iptables/ip6tables modules anymore when iptables-nft is used.
 
  - Extended netlink error report for netdevice in flowtables and
    netdev/chains. Allow for incrementally add/delete devices to netdev
    basechain. Allow to create netdev chain without device.
 
 Driver API
 ----------
 
  - Remove redundant Device Control Error Reporting Enable, as PCI core
    has already error reporting enabled at enumeration time.
 
  - Move Multicast DB netlink handlers to core, allowing devices other
    then bridge to use them.
 
  - Allow the page_pool to directly recycle the pages from safely
    localized NAPI.
 
  - Implement lockless TX queue stop/wake combo macros, allowing for
    further code de-duplication and sanitization.
 
  - Add YNL support for user headers and struct attrs.
 
  - Add partial YNL specification for devlink.
 
  - Add partial YNL specification for ethtool.
 
  - Add tc-mqprio and tc-taprio support for preemptible traffic classes.
 
  - Add tx push buf len param to ethtool, specifies the maximum number
    of bytes of a transmitted packet a driver can push directly to the
    underlying device.
 
  - Add basic LED support for switch/phy.
 
  - Add NAPI documentation, stop relaying on external links.
 
  - Convert dsa_master_ioctl() to netdev notifier. This is a preparatory
    work to make the hardware timestamping layer selectable by user
    space.
 
  - Add transceiver support and improve the error messages for CAN-FD
    controllers.
 
 New hardware / drivers
 ----------------------
 
  - Ethernet:
    - AMD/Pensando core device support
    - MediaTek MT7981 SoC
    - MediaTek MT7988 SoC
    - Broadcom BCM53134 embedded switch
    - Texas Instruments CPSW9G ethernet switch
    - Qualcomm EMAC3 DWMAC ethernet
    - StarFive JH7110 SoC
    - NXP CBTX ethernet PHY
 
  - WiFi:
    - Apple M1 Pro/Max devices
    - RealTek rtl8710bu/rtl8188gu
    - RealTek rtl8822bs, rtl8822cs and rtl8821cs SDIO chipset
 
  - Bluetooth:
    - Realtek RTL8821CS, RTL8851B, RTL8852BS
    - Mediatek MT7663, MT7922
    - NXP w8997
    - Actions Semi ATS2851
    - QTI WCN6855
    - Marvell 88W8997
 
  - Can:
    - STMicroelectronics bxcan stm32f429
 
 Drivers
 -------
  - Ethernet NICs:
    - Intel (1G, icg):
      - add tracking and reporting of QBV config errors.
      - add support for configuring max SDU for each Tx queue.
    - Intel (100G, ice):
      - refactor mailbox overflow detection to support Scalable IOV
      - GNSS interface optimization
    - Intel (i40e):
      - support XDP multi-buffer
    - nVidia/Mellanox:
      - add the support for linux bridge multicast offload
      - enable TC offload for egress and engress MACVLAN over bond
      - add support for VxLAN GBP encap/decap flows offload
      - extend packet offload to fully support libreswan
      - support tunnel mode in mlx5 IPsec packet offload
      - extend XDP multi-buffer support
      - support MACsec VLAN offload
      - add support for dynamic msix vectors allocation
      - drop RX page_cache and fully use page_pool
      - implement thermal zone to report NIC temperature
    - Netronome/Corigine:
      - add support for multi-zone conntrack offload
    - Solarflare/Xilinx:
      - support offloading TC VLAN push/pop actions to the MAE
      - support TC decap rules
      - support unicast PTP
 
  - Other NICs:
    - Broadcom (bnxt): enforce software based freq adjustments only
 		on shared PHC NIC
    - RealTek (r8169): refactor to addess ASPM issues during NAPI poll.
    - Micrel (lan8841): add support for PTP_PF_PEROUT
    - Cadence (macb): enable PTP unicast
    - Engleder (tsnep): add XDP socket zero-copy support
    - virtio-net: implement exact header length guest feature
    - veth: add page_pool support for page recycling
    - vxlan: add MDB data path support
    - gve: add XDP support for GQI-QPL format
    - geneve: accept every ethertype
    - macvlan: allow some packets to bypass broadcast queue
    - mana: add support for jumbo frame
 
  - Ethernet high-speed switches:
    - Microchip (sparx5): Add support for TC flower templates.
 
  - Ethernet embedded switches:
    - Broadcom (b54):
      - configure 6318 and 63268 RGMII ports
    - Marvell (mv88e6xxx):
      - faster C45 bus scan
    - Microchip:
      - lan966x:
        - add support for IS1 VCAP
        - better TX/RX from/to CPU performances
      - ksz9477: add ETS Qdisc support
      - ksz8: enhance static MAC table operations and error handling
      - sama7g5: add PTP capability
    - NXP (ocelot):
      - add support for external ports
      - add support for preemptible traffic classes
    - Texas Instruments:
      - add CPSWxG SGMII support for J7200 and J721E
 
  - Intel WiFi (iwlwifi):
    - preparation for Wi-Fi 7 EHT and multi-link support
    - EHT (Wi-Fi 7) sniffer support
    - hardware timestamping support for some devices/firwmares
    - TX beacon protection on newer hardware
 
  - Qualcomm 802.11ax WiFi (ath11k):
    - MU-MIMO parameters support
    - ack signal support for management packets
 
  - RealTek WiFi (rtw88):
    - SDIO bus support
    - better support for some SDIO devices
      (e.g. MAC address from efuse)
 
  - RealTek WiFi (rtw89):
    - HW scan support for 8852b
    - better support for 6 GHz scanning
    - support for various newer firmware APIs
    - framework firmware backwards compatibility
 
  - MediaTek WiFi (mt76):
    - P2P support
    - mesh A-MSDU support
    - EHT (Wi-Fi 7) support
    - coredump support
 
 Signed-off-by: Paolo Abeni <pabeni@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmRI/mUSHHBhYmVuaUBy
 ZWRoYXQuY29tAAoJECkkeY3MjxOkgO0QAJGxpuN67YgYV0BIM+/atWKEEexJYG7B
 9MMpU4jMO3EW/pUS5t7VRsBLUybLYVPmqCZoHodObDfnu59jiPOegb6SikJv/ZwJ
 Zw62PVk5MvDnQjlu4e6kDcGwkplteN08TlgI+a49BUTedpdFitrxHAYGW8f2fRO6
 cK2XSld+ZucMoym5vRwf8yWS1BwdxnslPMxDJ+/8ZbWBZv44qAnG2vMB/kIx7ObC
 Vel/4m6MzTwVsLYBsRvcwMVbNNlZ9GuhztlTzEbfGA4ZhTadIAMgb5VTWXB84Ws7
 Aic5wTdli+q+x6/2cxhbyeoVuB9HHObYmLBAciGg4GNljP5rnQBY3X3+KVZ/x9TI
 HQB7CmhxmAZVrO9pLARFV+ECrMTH2/dy3NyrZ7uYQ3WPOXJi8hJZjOTO/eeEGL7C
 eTjdz0dZBWIBK2gON/6s4nExXVQUTEF2ZsPi52jTTClKjfe5pz/ddeFQIWaY1DTm
 pInEiWPAvd28JyiFmhFNHsuIBCjX/Zqe2JuMfMBeBibDAC09o/OGdKJYUI15AiRf
 F46Pdb7use/puqfrYW44kSAfaPYoBiE+hj1RdeQfen35xD9HVE4vdnLNeuhRlFF9
 aQfyIRHYQofkumRDr5f8JEY66cl9NiKQ4IVW1xxQfYDNdC6wQqREPG1md7rJVMrJ
 vP7ugFnttneg
 =ITVa
 -----END PGP SIGNATURE-----

Merge tag 'net-next-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking updates from Paolo Abeni:
 "Core:

   - Introduce a config option to tweak MAX_SKB_FRAGS. Increasing the
     default value allows for better BIG TCP performances

   - Reduce compound page head access for zero-copy data transfers

   - RPS/RFS improvements, avoiding unneeded NET_RX_SOFTIRQ when
     possible

   - Threaded NAPI improvements, adding defer skb free support and
     unneeded softirq avoidance

   - Address dst_entry reference count scalability issues, via false
     sharing avoidance and optimize refcount tracking

   - Add lockless accesses annotation to sk_err[_soft]

   - Optimize again the skb struct layout

   - Extends the skb drop reasons to make it usable by multiple
     subsystems

   - Better const qualifier awareness for socket casts

  BPF:

   - Add skb and XDP typed dynptrs which allow BPF programs for more
     ergonomic and less brittle iteration through data and
     variable-sized accesses

   - Add a new BPF netfilter program type and minimal support to hook
     BPF programs to netfilter hooks such as prerouting or forward

   - Add more precise memory usage reporting for all BPF map types

   - Adds support for using {FOU,GUE} encap with an ipip device
     operating in collect_md mode and add a set of BPF kfuncs for
     controlling encap params

   - Allow BPF programs to detect at load time whether a particular
     kfunc exists or not, and also add support for this in light
     skeleton

   - Bigger batch of BPF verifier improvements to prepare for upcoming
     BPF open-coded iterators allowing for less restrictive looping
     capabilities

   - Rework RCU enforcement in the verifier, add kptr_rcu and enforce
     BPF programs to NULL-check before passing such pointers into kfunc

   - Add support for kptrs in percpu hashmaps, percpu LRU hashmaps and
     in local storage maps

   - Enable RCU semantics for task BPF kptrs and allow referenced kptr
     tasks to be stored in BPF maps

   - Add support for refcounted local kptrs to the verifier for allowing
     shared ownership, useful for adding a node to both the BPF list and
     rbtree

   - Add BPF verifier support for ST instructions in
     convert_ctx_access() which will help new -mcpu=v4 clang flag to
     start emitting them

   - Add ARM32 USDT support to libbpf

   - Improve bpftool's visual program dump which produces the control
     flow graph in a DOT format by adding C source inline annotations

  Protocols:

   - IPv4: Allow adding to IPv4 address a 'protocol' tag. Such value
     indicates the provenance of the IP address

   - IPv6: optimize route lookup, dropping unneeded R/W lock acquisition

   - Add the handshake upcall mechanism, allowing the user-space to
     implement generic TLS handshake on kernel's behalf

   - Bridge: support per-{Port, VLAN} neighbor suppression, increasing
     resilience to nodes failures

   - SCTP: add support for Fair Capacity and Weighted Fair Queueing
     schedulers

   - MPTCP: delay first subflow allocation up to its first usage. This
     will allow for later better LSM interaction

   - xfrm: Remove inner/outer modes from input/output path. These are
     not needed anymore

   - WiFi:
      - reduced neighbor report (RNR) handling for AP mode
      - HW timestamping support
      - support for randomized auth/deauth TA for PASN privacy
      - per-link debugfs for multi-link
      - TC offload support for mac80211 drivers
      - mac80211 mesh fast-xmit and fast-rx support
      - enable Wi-Fi 7 (EHT) mesh support

  Netfilter:

   - Add nf_tables 'brouting' support, to force a packet to be routed
     instead of being bridged

   - Update bridge netfilter and ovs conntrack helpers to handle IPv6
     Jumbo packets properly, i.e. fetch the packet length from
     hop-by-hop extension header. This is needed for BIT TCP support

   - The iptables 32bit compat interface isn't compiled in by default
     anymore

   - Move ip(6)tables builtin icmp matches to the udptcp one. This has
     the advantage that icmp/icmpv6 match doesn't load the
     iptables/ip6tables modules anymore when iptables-nft is used

   - Extended netlink error report for netdevice in flowtables and
     netdev/chains. Allow for incrementally add/delete devices to netdev
     basechain. Allow to create netdev chain without device

  Driver API:

   - Remove redundant Device Control Error Reporting Enable, as PCI core
     has already error reporting enabled at enumeration time

   - Move Multicast DB netlink handlers to core, allowing devices other
     then bridge to use them

   - Allow the page_pool to directly recycle the pages from safely
     localized NAPI

   - Implement lockless TX queue stop/wake combo macros, allowing for
     further code de-duplication and sanitization

   - Add YNL support for user headers and struct attrs

   - Add partial YNL specification for devlink

   - Add partial YNL specification for ethtool

   - Add tc-mqprio and tc-taprio support for preemptible traffic classes

   - Add tx push buf len param to ethtool, specifies the maximum number
     of bytes of a transmitted packet a driver can push directly to the
     underlying device

   - Add basic LED support for switch/phy

   - Add NAPI documentation, stop relaying on external links

   - Convert dsa_master_ioctl() to netdev notifier. This is a
     preparatory work to make the hardware timestamping layer selectable
     by user space

   - Add transceiver support and improve the error messages for CAN-FD
     controllers

  New hardware / drivers:

   - Ethernet:
      - AMD/Pensando core device support
      - MediaTek MT7981 SoC
      - MediaTek MT7988 SoC
      - Broadcom BCM53134 embedded switch
      - Texas Instruments CPSW9G ethernet switch
      - Qualcomm EMAC3 DWMAC ethernet
      - StarFive JH7110 SoC
      - NXP CBTX ethernet PHY

   - WiFi:
      - Apple M1 Pro/Max devices
      - RealTek rtl8710bu/rtl8188gu
      - RealTek rtl8822bs, rtl8822cs and rtl8821cs SDIO chipset

   - Bluetooth:
      - Realtek RTL8821CS, RTL8851B, RTL8852BS
      - Mediatek MT7663, MT7922
      - NXP w8997
      - Actions Semi ATS2851
      - QTI WCN6855
      - Marvell 88W8997

   - Can:
      - STMicroelectronics bxcan stm32f429

  Drivers:

   - Ethernet NICs:
      - Intel (1G, icg):
         - add tracking and reporting of QBV config errors
         - add support for configuring max SDU for each Tx queue
      - Intel (100G, ice):
         - refactor mailbox overflow detection to support Scalable IOV
         - GNSS interface optimization
      - Intel (i40e):
         - support XDP multi-buffer
      - nVidia/Mellanox:
         - add the support for linux bridge multicast offload
         - enable TC offload for egress and engress MACVLAN over bond
         - add support for VxLAN GBP encap/decap flows offload
         - extend packet offload to fully support libreswan
         - support tunnel mode in mlx5 IPsec packet offload
         - extend XDP multi-buffer support
         - support MACsec VLAN offload
         - add support for dynamic msix vectors allocation
         - drop RX page_cache and fully use page_pool
         - implement thermal zone to report NIC temperature
      - Netronome/Corigine:
         - add support for multi-zone conntrack offload
      - Solarflare/Xilinx:
         - support offloading TC VLAN push/pop actions to the MAE
         - support TC decap rules
         - support unicast PTP

   - Other NICs:
      - Broadcom (bnxt): enforce software based freq adjustments only on
        shared PHC NIC
      - RealTek (r8169): refactor to addess ASPM issues during NAPI poll
      - Micrel (lan8841): add support for PTP_PF_PEROUT
      - Cadence (macb): enable PTP unicast
      - Engleder (tsnep): add XDP socket zero-copy support
      - virtio-net: implement exact header length guest feature
      - veth: add page_pool support for page recycling
      - vxlan: add MDB data path support
      - gve: add XDP support for GQI-QPL format
      - geneve: accept every ethertype
      - macvlan: allow some packets to bypass broadcast queue
      - mana: add support for jumbo frame

   - Ethernet high-speed switches:
      - Microchip (sparx5): Add support for TC flower templates

   - Ethernet embedded switches:
      - Broadcom (b54):
         - configure 6318 and 63268 RGMII ports
      - Marvell (mv88e6xxx):
         - faster C45 bus scan
      - Microchip:
         - lan966x:
            - add support for IS1 VCAP
            - better TX/RX from/to CPU performances
         - ksz9477: add ETS Qdisc support
         - ksz8: enhance static MAC table operations and error handling
         - sama7g5: add PTP capability
      - NXP (ocelot):
         - add support for external ports
         - add support for preemptible traffic classes
      - Texas Instruments:
         - add CPSWxG SGMII support for J7200 and J721E

   - Intel WiFi (iwlwifi):
      - preparation for Wi-Fi 7 EHT and multi-link support
      - EHT (Wi-Fi 7) sniffer support
      - hardware timestamping support for some devices/firwmares
      - TX beacon protection on newer hardware

   - Qualcomm 802.11ax WiFi (ath11k):
      - MU-MIMO parameters support
      - ack signal support for management packets

   - RealTek WiFi (rtw88):
      - SDIO bus support
      - better support for some SDIO devices (e.g. MAC address from
        efuse)

   - RealTek WiFi (rtw89):
      - HW scan support for 8852b
      - better support for 6 GHz scanning
      - support for various newer firmware APIs
      - framework firmware backwards compatibility

   - MediaTek WiFi (mt76):
      - P2P support
      - mesh A-MSDU support
      - EHT (Wi-Fi 7) support
      - coredump support"

* tag 'net-next-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2078 commits)
  net: phy: hide the PHYLIB_LEDS knob
  net: phy: marvell-88x2222: remove unnecessary (void*) conversions
  tcp/udp: Fix memleaks of sk and zerocopy skbs with TX timestamp.
  net: amd: Fix link leak when verifying config failed
  net: phy: marvell: Fix inconsistent indenting in led_blink_set
  lan966x: Don't use xdp_frame when action is XDP_TX
  tsnep: Add XDP socket zero-copy TX support
  tsnep: Add XDP socket zero-copy RX support
  tsnep: Move skb receive action to separate function
  tsnep: Add functions for queue enable/disable
  tsnep: Rework TX/RX queue initialization
  tsnep: Replace modulo operation with mask
  net: phy: dp83867: Add led_brightness_set support
  net: phy: Fix reading LED reg property
  drivers: nfc: nfcsim: remove return value check of `dev_dir`
  net: phy: dp83867: Remove unnecessary (void*) conversions
  net: ethtool: coalesce: try to make user settings stick twice
  net: mana: Check if netdev/napi_alloc_frag returns single page
  net: mana: Rename mana_refill_rxoob and remove some empty lines
  net: veth: add page_pool stats
  ...
2023-04-26 16:07:23 -07:00
Paolo Abeni c248b27cfc Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
No conflicts.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-04-26 10:17:46 +02:00
Linus Torvalds 5dfb75e842 RCU Changes for 6.4:
o  MAINTAINERS files additions and changes.
  o  Fix hotplug warning in nohz code.
  o  Tick dependency changes by Zqiang.
  o  Lazy-RCU shrinker fixes by Zqiang.
  o  rcu-tasks stall reporting improvements by Neeraj.
  o  Initial changes for renaming of k[v]free_rcu() to its new k[v]free_rcu_mightsleep()
     name for robustness.
  o  Documentation Updates:
  o  Significant changes to srcu_struct size.
  o  Deadlock detection for srcu_read_lock() vs synchronize_srcu() from Boqun.
  o  rcutorture and rcu-related tool, which are targeted for v6.4 from Boqun's tree.
  o  Other misc changes.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEcoCIrlGe4gjE06JJqA4nf2o45hAFAmQuBnIACgkQqA4nf2o4
 5hACVRAAoXu7/gfh5Pjw9O4E4pCdPJKsZZVYrcrVGrq6NAxRn6M1SgurAdC5grj2
 96x0waoGaiO82V0H5iJMcKdAVu67x9R8WaQ1JoxN75Efn8h9W4TguB87TV1gk0xS
 eZ18b/CyEaM5mNb80DFFF4FLohy5737p/kNTMqXQdUyR1BsDl16iRMgjiBiFhNUx
 yPo8Y2kC2U2OTbldZgaE7s9bQO3xxEcifx93sGWsAex/gx54FYNisiwSlCOSgOE+
 XkYo/OKk8Xvr82tLVX8XQVEPCMJ+rxea8T5zSs8/alvsPq7gA8wW3y6fsoa3vUU/
 +Gd+W+Q/OsONIDtp8rQAY1qsD0ScDpaR8052RSH0zTa7pj8HsQgE5PjZ+cJW0SEi
 cKN+Oe8+ETqKald+xZ6PDf58O212VLrru3RpQWrOQcJ7fmKmfT4REK0RcbLgg4qT
 CBgOo6eg+ub4pxq2y11LZJBNTv1/S7xAEzFE0kArew64KB2gyVud0VJRZVAJnEfe
 93QQVDFrwK2bhgWQZ6J6IbTvGeQW0L93IibuaU6jhZPR283VtUIIvM7vrOylN7Fq
 4jsae0T7YGYfKUhgTpm7rCnm8A/D3Ni8MY0sKYYgDSyKmZUsnpI5wpx1xke4lwwV
 ErrY46RCFa+k8wscc6iWfB4cGXyyFHyu+wtyg0KpFn5JAzcfz4A=
 =Rgbj
 -----END PGP SIGNATURE-----

Merge tag 'rcu.6.4.april5.2023.3' of git://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux

Pull RCU updates from Joel Fernandes:

 - Updates and additions to MAINTAINERS files, with Boqun being added to
   the RCU entry and Zqiang being added as an RCU reviewer.

   I have also transitioned from reviewer to maintainer; however, Paul
   will be taking over sending RCU pull-requests for the next merge
   window.

 - Resolution of hotplug warning in nohz code, achieved by fixing
   cpu_is_hotpluggable() through interaction with the nohz subsystem.

   Tick dependency modifications by Zqiang, focusing on fixing usage of
   the TICK_DEP_BIT_RCU_EXP bitmask.

 - Avoid needless calls to the rcu-lazy shrinker for CONFIG_RCU_LAZY=n
   kernels, fixed by Zqiang.

 - Improvements to rcu-tasks stall reporting by Neeraj.

 - Initial renaming of k[v]free_rcu() to k[v]free_rcu_mightsleep() for
   increased robustness, affecting several components like mac802154,
   drbd, vmw_vmci, tracing, and more.

   A report by Eric Dumazet showed that the API could be unknowingly
   used in an atomic context, so we'd rather make sure they know what
   they're asking for by being explicit:

      https://lore.kernel.org/all/20221202052847.2623997-1-edumazet@google.com/

 - Documentation updates, including corrections to spelling,
   clarifications in comments, and improvements to the srcu_size_state
   comments.

 - Better srcu_struct cache locality for readers, by adjusting the size
   of srcu_struct in support of SRCU usage by Christoph Hellwig.

 - Teach lockdep to detect deadlocks between srcu_read_lock() vs
   synchronize_srcu() contributed by Boqun.

   Previously lockdep could not detect such deadlocks, now it can.

 - Integration of rcutorture and rcu-related tools, targeted for v6.4
   from Boqun's tree, featuring new SRCU deadlock scenarios, test_nmis
   module parameter, and more

 - Miscellaneous changes, various code cleanups and comment improvements

* tag 'rcu.6.4.april5.2023.3' of git://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux: (71 commits)
  checkpatch: Error out if deprecated RCU API used
  mac802154: Rename kfree_rcu() to kvfree_rcu_mightsleep()
  rcuscale: Rename kfree_rcu() to kfree_rcu_mightsleep()
  ext4/super: Rename kfree_rcu() to kfree_rcu_mightsleep()
  net/mlx5: Rename kfree_rcu() to kfree_rcu_mightsleep()
  net/sysctl: Rename kvfree_rcu() to kvfree_rcu_mightsleep()
  lib/test_vmalloc.c: Rename kvfree_rcu() to kvfree_rcu_mightsleep()
  tracing: Rename kvfree_rcu() to kvfree_rcu_mightsleep()
  misc: vmw_vmci: Rename kvfree_rcu() to kvfree_rcu_mightsleep()
  drbd: Rename kvfree_rcu() to kvfree_rcu_mightsleep()
  rcu: Protect rcu_print_task_exp_stall() ->exp_tasks access
  rcu: Avoid stack overflow due to __rcu_irq_enter_check_tick() being kprobe-ed
  rcu-tasks: Report stalls during synchronize_srcu() in rcu_tasks_postscan()
  rcu: Permit start_poll_synchronize_rcu_expedited() to be invoked early
  rcu: Remove never-set needwake assignment from rcu_report_qs_rdp()
  rcu: Register rcu-lazy shrinker only for CONFIG_RCU_LAZY=y kernels
  rcu: Fix missing TICK_DEP_MASK_RCU_EXP dependency check
  rcu: Fix set/clear TICK_DEP_BIT_RCU_EXP bitmask race
  rcu/trace: use strscpy() to instead of strncpy()
  tick/nohz: Fix cpu_is_hotpluggable() by checking with nohz subsystem
  ...
2023-04-24 12:16:14 -07:00
Jakub Kicinski fbc1449d38 mlx5-updates-2023-04-20
1) Dragos Improves RX page pool, and provides some fixes to his previous series:
  1.1) Fix releasing page_pool for striding RQ and legacy RQ nonlinear case
  1.2) Hook NAPIs to page pools to gain more performance.
 
 2) From Roi, Some cleanups to TC and eswitch modules.
 
 3) Maher migrates vnic diagnostic counters reporting from debugfs to a
     dedicated devlink health reporter
 
 Maher Says:
 ===========
  net/mlx5: Expose vnic diagnostic counters using devlink
 
 Currently, vnic diagnostic counters are exposed through the following
 debugfs:
 
 $ ls /sys/kernel/debug/mlx5/0000:08:00.0/esw/vf_0/vnic_diag/
 cq_overrun
 quota_exceeded_command
 total_q_under_processor_handle
 invalid_command
 send_queue_priority_update_flow
 nic_receive_steering_discard
 
 The current design does not allow the hypervisor to view the diagnostic
 counters of its VFs, in case the VFs get bound to a VM. In other words,
 the counters are not exposed for representor interfaces.
 Furthermore, the debugfs design is inconvenient future-wise, in case more
 counters need to be reported by the driver in the future.
 
 As these counters pertain to vNIC health, it is more appropriate to
 utilize the devlink health reporter to expose them.
 
 Thus, this patchest includes the following changes:
 
 * Drop the current vnic diagnostic counters debugfs interface.
 * Add a vnic devlink health reporter for PFs/VFs core devices, which
   when diagnosed will dump vnic diagnostic counter values that are
   queried from FW.
 * Add a vnic devlink health reporter for the representor interface, which
   serves the same purpose listed in the previous point, in addition to
   allowing the hypervisor to view its VFs diagnostic counters, even when
   the VFs are bounded to external VMs.
 
 Example of devlink health reporter usage is:
 $devlink health diagnose pci/0000:08:00.0 reporter vnic
  vNIC env counters:
     total_error_queues: 0 send_queue_priority_update_flow: 0
     comp_eq_overrun: 0 async_eq_overrun: 0 cq_overrun: 0
     invalid_command: 0 quota_exceeded_command: 0
     nic_receive_steering_discard: 0
 
 ===========
 
 4) SW steering fixes and improvements
 
 Yevgeny Kliteynik Says:
 =======================
 These short patch series are just small fixes / improvements for
 SW steering:
 
  - Patch 1: Fix dumping of legacy modify_hdr in debug dump to
    align to what is expected by parser
  - Patch 2: Have separate threshold for ICM sync per ICM type
  - Patch 3: Add more info to the steering debug dump - Linux
    version and device name
  - Patch 4: Keep track of number of buddies that are currently
    in use per domain per buddy type
 
 =======================
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEGhZs6bAKwk/OTgTpSD+KveBX+j4FAmRB6HcACgkQSD+KveBX
 +j4+nAf/cCJ7poDQQOp/Oug6H3xn/COYPv3Iasl0dhT8yu+LlNMQTF0bRrU4UpAQ
 aIKcja2biOBAnD96EhA1nJoo9bJUTtKLokUDDyK/xRHS+wIyr8Lia6vxTz1yjj3C
 jDqX3+ZP4rFuhAvh+92AT1I0JvS0g+ocokPVKmm+Pwf4y7sG69CZ7phVGSc0iFfT
 y+gnP4C6cdIr7kNLByeeX6alDHL/q83vfNFWrugRPna2uXjcSR5Gtp03pJ0OVkI5
 qHxG7Bz0BE/hcMYwNcNVTu/5e02+5PS6B8kN/ho5DkhVFhp+h17XWWvOMZxC3jfI
 k0ijRSWocG1jIgRgkioBRgzXmc6nNw==
 =jUe8
 -----END PGP SIGNATURE-----

Merge tag 'mlx5-updates-2023-04-20' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux

Saeed Mahameed says:

====================
mlx5-updates-2023-04-20

1) Dragos Improves RX page pool, and provides some fixes to his previous
   series:
 1.1) Fix releasing page_pool for striding RQ and legacy RQ nonlinear case
 1.2) Hook NAPIs to page pools to gain more performance.

2) From Roi, Some cleanups to TC and eswitch modules.

3) Maher migrates vnic diagnostic counters reporting from debugfs to a
    dedicated devlink health reporter

Maher Says:
===========
 net/mlx5: Expose vnic diagnostic counters using devlink

Currently, vnic diagnostic counters are exposed through the following
debugfs:

$ ls /sys/kernel/debug/mlx5/0000:08:00.0/esw/vf_0/vnic_diag/
cq_overrun
quota_exceeded_command
total_q_under_processor_handle
invalid_command
send_queue_priority_update_flow
nic_receive_steering_discard

The current design does not allow the hypervisor to view the diagnostic
counters of its VFs, in case the VFs get bound to a VM. In other words,
the counters are not exposed for representor interfaces.
Furthermore, the debugfs design is inconvenient future-wise, in case more
counters need to be reported by the driver in the future.

As these counters pertain to vNIC health, it is more appropriate to
utilize the devlink health reporter to expose them.

Thus, this patchest includes the following changes:

* Drop the current vnic diagnostic counters debugfs interface.
* Add a vnic devlink health reporter for PFs/VFs core devices, which
  when diagnosed will dump vnic diagnostic counter values that are
  queried from FW.
* Add a vnic devlink health reporter for the representor interface, which
  serves the same purpose listed in the previous point, in addition to
  allowing the hypervisor to view its VFs diagnostic counters, even when
  the VFs are bounded to external VMs.

Example of devlink health reporter usage is:
$devlink health diagnose pci/0000:08:00.0 reporter vnic
 vNIC env counters:
    total_error_queues: 0 send_queue_priority_update_flow: 0
    comp_eq_overrun: 0 async_eq_overrun: 0 cq_overrun: 0
    invalid_command: 0 quota_exceeded_command: 0
    nic_receive_steering_discard: 0

===========

4) SW steering fixes and improvements

Yevgeny Kliteynik Says:
=======================
These short patch series are just small fixes / improvements for
SW steering:

 - Patch 1: Fix dumping of legacy modify_hdr in debug dump to
   align to what is expected by parser
 - Patch 2: Have separate threshold for ICM sync per ICM type
 - Patch 3: Add more info to the steering debug dump - Linux
   version and device name
 - Patch 4: Keep track of number of buddies that are currently
   in use per domain per buddy type

=======================

* tag 'mlx5-updates-2023-04-20' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux:
  net/mlx5: Update op_mode to op_mod for port selection
  net/mlx5: E-Switch, Remove unused mlx5_esw_offloads_vport_metadata_set()
  net/mlx5: E-Switch, Remove redundant dev arg from mlx5_esw_vport_alloc()
  net/mlx5: Include linux/pci.h for pci_msix_can_alloc_dyn()
  net/mlx5e: RX, Hook NAPIs to page pools
  net/mlx5e: RX, Fix XDP_TX page release for legacy rq nonlinear case
  net/mlx5e: RX, Fix releasing page_pool pages twice for striding RQ
  net/mlx5e: Add vnic devlink health reporter to representors
  net/mlx5: Add vnic devlink health reporter to PFs/VFs
  Revert "net/mlx5: Expose vnic diagnostic counters for eswitch managed vports"
  Revert "net/mlx5: Expose steering dropped packets counter"
  net/mlx5: DR, Add memory statistics for domain object
  net/mlx5: DR, Add more info in domain dbg dump
  net/mlx5: DR, Calculate sync threshold of each pool according to its type
  net/mlx5: DR, Fix dumping of legacy modify_hdr in debug dump
====================

Link: https://lore.kernel.org/r/20230421013850.349646-1-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-21 20:47:05 -07:00
Leon Romanovsky 45fd01f2fb net/mlx5e: Refactor duplicated code in mlx5e_ipsec_init_macs
ARP discovery code has same logic for RX and TX flows, but with
different source and destination fields. Instead of duplicating
same code in mlx5e_ipsec_init_macs, let's refactor.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-21 11:49:47 +01:00
Leon Romanovsky 94edec4484 net/mlx5e: Properly release work data structure
There are some flows in which work structure is not allocated at all
and it is needed to be checked prior release of data structure.

 general protection fault, probably for non-canonical address 0xdffffc000000000a: 0000 [#1] SMP KASAN
 KASAN: null-ptr-deref in range [0x0000000000000050-0x0000000000000057]
 CPU: 6 PID: 3486 Comm: kworker/6:0 Not tainted 6.3.0-rc5_for_upstream_debug_2023_04_06_11_01 #1
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
 Workqueue: events xfrm_state_gc_task
 RIP: 0010:mlx5e_xfrm_free_state+0x177/0x260 [mlx5_core]
 Code: c1 ea 03 80 3c 02 00 0f 85 f5 00 00 00 4c 8b a5 08 01 00 00 48 b8 00 00 00 00 00 fc ff df 49 8d 7c 24 50 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 b7 00 00 00 49 8b 7c 24 50 e8 85 7c 09 e0 4c 89
 RSP: 0018:ffff888137a8fc50 EFLAGS: 00010206
 RAX: dffffc0000000000 RBX: ffff888180398000 RCX: 0000000000000000
 RDX: 000000000000000a RSI: ffffffffa1878227 RDI: 0000000000000050
 RBP: ffff88812a0c8000 R08: ffff888137a8fb60 R09: 0000000000000000
 R10: fffffbfff09aba0c R11: 0000000000000001 R12: 0000000000000000
 R13: ffff88812a0c8108 R14: ffffffff84c63480 R15: ffff8881acb63118
 FS:  0000000000000000(0000) GS:ffff88881eb00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 00007f667e8bc000 CR3: 0000000004693006 CR4: 0000000000370ea0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 Call Trace:

  ___xfrm_state_destroy+0x3c8/0x5e0
  xfrm_state_gc_task+0xf6/0x140
  ? ___xfrm_state_destroy+0x5e0/0x5e0
  process_one_work+0x7c2/0x1340
  ? lockdep_hardirqs_on_prepare+0x3f0/0x3f0
  ? pwq_dec_nr_in_flight+0x230/0x230
  ? spin_bug+0x1d0/0x1d0
  worker_thread+0x59d/0xec0
  ? __kthread_parkme+0xd9/0x1d0
  ? process_one_work+0x1340/0x1340
  kthread+0x28f/0x330
  ? kthread_complete_and_exit+0x20/0x20
  ret_from_fork+0x1f/0x30

 Modules linked in: sch_ingress openvswitch nsh mlx5_vdpa vringh vhost_iotlb vdpa mlx5_ib mlx5_core xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat br_netfilter rpcrdma rdma_ucm ib_iser libiscsi scsi_transport_iscsi rdma_cm iw_cm ib_umad ib_ipoib ib_cm ib_uverbs ib_core vfio_pci vfio_pci_core vfio_iommu_type1 vfio cuse overlay zram zsmalloc fuse [last unloaded: mlx5_core]
 ---[ end trace 0000000000000000 ]---

Fixes: 4562116f8a ("net/mlx5e: Generalize IPsec work structs")
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-21 11:49:47 +01:00
Leon Romanovsky 3198ae7d42 net/mlx5e: Compare all fields in IPv6 address
Fix size argument in memcmp to compare whole IPv6 address.

Fixes: b3beba1fb4 ("net/mlx5e: Allow policies with reqid 0, to support IKE policy holes")
Reviewed-by: Raed Salem <raeds@nvidia.com>
Reviewed-by: Emeel Hakim <ehakim@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-21 11:49:47 +01:00
Leon Romanovsky 697b3518eb net/mlx5e: Don't overwrite extack message returned from IPsec SA validator
Addition of new err_xfrm label caused to error messages be overwritten.
Fix it by using proper NL_SET_ERR_MSG_WEAK_MOD macro together with change
in a default message.

Fixes: aa8bd0c951 ("net/mlx5e: Support IPsec acquire default SA")
Reviewed-by: Raed Salem <raeds@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-21 11:49:47 +01:00
Leon Romanovsky e239e31ae8 net/mlx5e: Fix FW error while setting IPsec policy block action
When trying to set IPsec policy block action the following error is
generated:

 mlx5_cmd_out_err:803:(pid 3426): SET_FLOW_TABLE_ENTRY(0x936) op_mod(0x0) failed,
	status bad parameter(0x3), syndrome (0x8708c3), err(-22)

This error means that drop action is not allowed when modify action is
set, so update the code to skip modify header for XFRM_POLICY_BLOCK action.

Fixes: 6721239672 ("net/mlx5e: Skip IPsec encryption for TX path without matching policy")
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-21 11:49:47 +01:00
Emeel Hakim 765f974c7d net/mlx5: Consider VLAN interface in MACsec TX steering rules
Offloading MACsec when its configured over VLAN with current MACsec
TX steering rules will wrongly insert MACsec sec tag after inserting
the VLAN header leading to a ETHERNET | SECTAG | VLAN packet when
ETHERNET | VLAN | SECTAG is configured.

The above issue is due to adding the SECTAG by HW which is a later
stage compared to the VLAN header insertion stage.

Detect such a case and adjust TX steering rules to insert the
SECTAG in the correct place by using reformat_param_0 field in
the packet reformat to indicate the offset of SECTAG from end of
the MAC header to account for VLANs in granularity of 4Bytes.

Signed-off-by: Emeel Hakim <ehakim@nvidia.com>
Reviewed-by: Subbaraya Sundeep <sbhatta@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-21 08:22:15 +01:00
Emeel Hakim 4bba492b04 net/mlx5: Support MACsec over VLAN
MACsec device may have a VLAN device on top of it.
Detect MACsec state correctly under this condition,
and return the correct net device accordingly.

Signed-off-by: Emeel Hakim <ehakim@nvidia.com>
Reviewed-by: Subbaraya Sundeep <sbhatta@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-21 08:22:15 +01:00
Emeel Hakim 339ccec8d4 net/mlx5: Enable MACsec offload feature for VLAN interface
Enable MACsec offload feature over VLAN by adding NETIF_F_HW_MACSEC
to the device vlan_features.

Signed-off-by: Emeel Hakim <ehakim@nvidia.com>
Reviewed-by: Subbaraya Sundeep <sbhatta@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-21 08:22:15 +01:00
Jakub Kicinski 61718206ee eth: mlx5: avoid iterator use outside of a loop
Fix the following warning about risky iterator use:

drivers/net/ethernet/mellanox/mlx5/core/eq.c:1010 mlx5_comp_irq_get_affinity_mask() warn: iterator used outside loop: 'eq'

Acked-by: Saeed Mahameed <saeed@kernel.org>
Link: https://lore.kernel.org/r/20230420015802.815362-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-20 19:25:05 -07:00
Vlad Buslov 081abcacaf Revert "net/mlx5e: Don't use termination table when redundant"
This reverts commit 14624d7247.

The termination table usage is requires for DMFS steering mode as firmware
doesn't support mixed table destinations list which causes following
syndrome with hairpin rules:

[81922.283225] mlx5_core 0000:08:00.0: mlx5_cmd_out_err:803:(pid 25977): SET_FLOW_TABLE_ENTRY(0x936) op_mod(0x0) failed, status bad parameter(0x3), syndrome (0xaca205), err(-22)

Fixes: 14624d7247 ("net/mlx5e: Don't use termination table when redundant")
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Maor Dickman <maord@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-04-20 18:47:33 -07:00
Aya Levin 1b540decd0 net/mlx5e: Nullify table pointer when failing to create
On failing to create promisc flow steering table, the pointer is
returned with an error. Nullify it so unloading the driver won't try to
destroy a non existing table.

Failing to create promisc table may happen over BF devices when the ARM
side is going through a firmware tear down. The host side start a
reload flow. While the driver unloads, it tries to remove the promisc
table. Remove WARN in this state as it is a valid error flow.

Fixes: 1c46d7409f ("net/mlx5e: Optimize promiscuous mode")
Signed-off-by: Aya Levin <ayal@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-04-20 18:47:33 -07:00
Moshe Shemesh dfad99750c net/mlx5: Use recovery timeout on sync reset flow
Use the same timeout for sync reset flow and health recovery flow, since
the former involves driver's recovery from firmware reset, which is
similar to health recovery. Otherwise, in some cases, such as a firmware
upgrade on the DPU, the firmware pre-init bit may not be ready within
current timeout and the driver will abort loading back after reset.

Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Fixes: 37ca95e62e ("net/mlx5: Increase FW pre-init timeout for health recovery")
Reviewed-by: Maher Sanalla <msanalla@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-04-20 18:47:33 -07:00
Moshe Shemesh 21608a2cf3 Revert "net/mlx5: Remove "recovery" arg from mlx5_load_one() function"
This reverts commit 5977ac3910.

Revert this patch as we need the "recovery" arg back in mlx5_load_one()
function. This arg will be used in the next patch for using recovery
timeout during sync reset flow.

Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Maher Sanalla <msanalla@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-04-20 18:47:33 -07:00
Roi Dayan 0a6b069cc6 net/mlx5e: Fix error flow in representor failing to add vport rx rule
On representor init rx error flow the flow steering pointer is being
released so mlx5e_attach_netdev() doesn't have a valid fs pointer
in its error flow. Make sure the pointer is nullified when released
and add a check in mlx5e_fs_cleanup() to verify fs is not null
as representor cleanup callback would be called anyway.

Fixes: af8bbf7300 ("net/mlx5e: Convert mlx5e_flow_steering member of mlx5e_priv to pointer")
Signed-off-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Maor Dickman <maord@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-04-20 18:47:32 -07:00
Chris Mi 4fbef0f8ea net/mlx5: Release tunnel device after tc update skb
The cited commit causes a regression. Tunnel device is not released
after tc update skb if skb needs to be freed. The following error
message will be printed:

  unregister_netdevice: waiting for vxlan1 to become free. Usage count = 11

Fix it by releasing tunnel device if skb needs to be freed.

Fixes: 93a1ab2c54 ("net/mlx5: Refactor tc miss handling to a single function")
Signed-off-by: Chris Mi <cmi@nvidia.com>
Reviewed-by: Maor Dickman <maord@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-04-20 18:47:32 -07:00
Chris Mi 4c81893025 net/mlx5: E-switch, Don't destroy indirect table in split rule
Source port rewrite (forward to ovs internal port or statck device) isn't
supported in the rule of split action. So there is no indirect table in
split rule. The cited commit destroyes indirect table in split rule. The
indirect table for other rules will be destroyed wrongly. It will cause
traffic loss.

Fix it by removing the destroy function in split rule. And also remove
the destroy function in error flow.

Fixes: 10742efc20 ("net/mlx5e: VF tunnel TX traffic offloading")
Signed-off-by: Chris Mi <cmi@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Maor Dickman <maord@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-04-20 18:47:32 -07:00