Commit graph

1137097 commits

Author SHA1 Message Date
Linus Torvalds
7f9a7cd690 media fixes for v6.1-rc3
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE+QmuaPwR3wnBdVwACF8+vY7k4RUFAmNaSe8ACgkQCF8+vY7k
 4RWc+g//dWz903vryxpRlC2ap7W3Ml9A9KC2NONyR9ryvX71rXaqsVe5/rg4+kKK
 HJ5xbY6cFZVNWWvSEfW2sfNolAwdkEuDem1VDrOIvJRV+1fxCip4frwkD4bhY/V0
 H8I50tgSAtzpTTwEX8w5KHocUYXqtn0T9SX6CA5ll9ijJwVdU2hUZFzUWG2cbx1r
 shSv99HQApflUfD6McwhqFK8r1EpyzKUZypM3zmpSinvEinT+5naV6QBXlLMsO9F
 mmGnQKyOch1a9tTHV9MAojEVN7wTWZbdT7hwwwMP1Fj8zhdt6UqncjK6eaHhbpYd
 WExZEGhN1l+ZDxZZ1kY/VX/pE93uLaq16WkJH6HftiTYjdXpZe6IjBZnJsSIPktO
 BCoEYJfmCmfC/9AkqrM9/TdFBJ3MRgZwfrhZ8j6dcEgvZ5OYpQLOaWIR0cZ0YYPE
 iw+HooXlv3gf1JiMLb8KVFpC4UrD1RU8HfIFD2KaMx1UKUs3NjVzv5g8V+IDUa1i
 ky80MvEXXH6Eg91QNypQEY6EH6G6c2Mk8yVj6WVFWTEC9mqNo/A1egL6DFDKbfZd
 OuP3bl/hjdNU1oQ9ajBq/GurUJQoFtnCie5M2Sqy3gyKfD92F7nufJusfICaPgz7
 SztFKjPcaomLNJl/IH2ALox11+fs2HfTeiCX0zZwun8ddPqOWBM=
 =WXL5
 -----END PGP SIGNATURE-----

Merge tag 'media/v6.1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media

Pull media fixes from Mauro Carvalho Chehab:
 "A bunch of patches addressing issues in the vivid driver and adding
  new checks in V4L2 to validate the input parameters from some ioctls"

* tag 'media/v6.1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media:
  media: vivid.rst: loop_video is set on the capture devnode
  media: vivid: set num_in/outputs to 0 if not supported
  media: vivid: drop GFP_DMA32
  media: vivid: fix control handler mutex deadlock
  media: videodev2.h: V4L2_DV_BT_BLANKING_HEIGHT should check 'interlaced'
  media: v4l2-dv-timings: add sanity checks for blanking values
  media: vivid: dev->bitmap_cap wasn't freed in all cases
  media: vivid: s_fbuf: add more sanity checks
2022-10-27 12:21:57 -07:00
Linus Torvalds
200204f56f fscrypt fix for 6.1-rc3
Fix a memory leak that was introduced by a change that went into -rc1.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCY1oM6BQcZWJpZ2dlcnNA
 Z29vZ2xlLmNvbQAKCRDzXCl4vpKOK3ixAP9IY1TdJu64uKTofFdYvO/wBASpdszm
 GkY1QnEFxATA9AEAwRswZgaGiuKj4hFBeIWmu9+luT4T7kVIcaumslTyTg8=
 =YinC
 -----END PGP SIGNATURE-----

Merge tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt

Pull fscrypt fix from Eric Biggers:
 "Fix a memory leak that was introduced by a change that went into -rc1"

* tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt:
  fscrypt: fix keyring memory leak on mount failure
2022-10-27 11:44:18 -07:00
Vladimir Oltean
84ce1ca3fe net: enetc: survive memory pressure without crashing
Under memory pressure, enetc_refill_rx_ring() may fail, and when called
during the enetc_open() -> enetc_setup_rxbdr() procedure, this is not
checked for.

An extreme case of memory pressure will result in exactly zero buffers
being allocated for the RX ring, and in such a case it is expected that
hardware drops all RX packets due to lack of buffers.

This does not happen, because the reset-default value of the consumer
and produces index is 0, and this makes the ENETC think that all buffers
have been initialized and that it owns them (when in reality none were).

The hardware guide explains this best:

| Configure the receive ring producer index register RBaPIR with a value
| of 0. The producer index is initially configured by software but owned
| by hardware after the ring has been enabled. Hardware increments the
| index when a frame is received which may consume one or more BDs.
| Hardware is not allowed to increment the producer index to match the
| consumer index since it is used to indicate an empty condition. The ring
| can hold at most RBLENR[LENGTH]-1 received BDs.
|
| Configure the receive ring consumer index register RBaCIR. The
| consumer index is owned by software and updated during operation of the
| of the BD ring by software, to indicate that any receive data occupied
| in the BD has been processed and it has been prepared for new data.
| - If consumer index and producer index are initialized to the same
|   value, it indicates that all BDs in the ring have been prepared and
|   hardware owns all of the entries.
| - If consumer index is initialized to producer index plus N, it would
|   indicate N BDs have been prepared. Note that hardware cannot start if
|   only a single buffer is prepared due to the restrictions described in
|   (2).
| - Software may write consumer index to match producer index anytime
|   while the ring is operational to indicate all received BDs prior have
|   been processed and new BDs prepared for hardware.

Normally, the value of rx_ring->rcir (consumer index) is brought in sync
with the rx_ring->next_to_use software index, but this only happens if
page allocation ever succeeded.

When PI==CI==0, the hardware appears to receive frames and write them to
DMA address 0x0 (?!), then set the READY bit in the BD.

The enetc_clean_rx_ring() function (and its XDP derivative) is naturally
not prepared to handle such a condition. It will attempt to process
those frames using the rx_swbd structure associated with index i of the
RX ring, but that structure is not fully initialized (enetc_new_page()
does all of that). So what happens next is undefined behavior.

To operate using no buffer, we must initialize the CI to PI + 1, which
will block the hardware from advancing the CI any further, and drop
everything.

The issue was seen while adding support for zero-copy AF_XDP sockets,
where buffer memory comes from user space, which can even decide to
supply no buffers at all (example: "xdpsock --txonly"). However, the bug
is present also with the network stack code, even though it would take a
very determined person to trigger a page allocation failure at the
perfect time (a series of ifup/ifdown under memory pressure should
eventually reproduce it given enough retries).

Fixes: d4fd0404c1 ("enetc: Introduce basic PF and VF ENETC ethernet drivers")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Claudiu Manoil <claudiu.manoil@nxp.com>
Link: https://lore.kernel.org/r/20221027182925.3256653-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-27 11:32:25 -07:00
Eric Dumazet
ee15e1f38d kcm: do not sense pfmemalloc status in kcm_sendpage()
Similar to changes done in TCP in blamed commit.
We should not sense pfmemalloc status in sendpage() methods.

Fixes: 3261400639 ("tcp: TX zerocopy should not sense pfmemalloc status")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20221027040637.1107703-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-27 11:25:13 -07:00
Eric Dumazet
228ebc41df net: do not sense pfmemalloc status in skb_append_pagefrags()
skb_append_pagefrags() is used by af_unix and udp sendpage()
implementation so far.

In commit 3261400639 ("tcp: TX zerocopy should not sense
pfmemalloc status") we explained why we should not sense
pfmemalloc status for pages owned by user space.

We should also use skb_fill_page_desc_noacc()
in skb_append_pagefrags() to avoid following KCSAN report:

BUG: KCSAN: data-race in lru_add_fn / skb_append_pagefrags

write to 0xffffea00058fc1c8 of 8 bytes by task 17319 on cpu 0:
__list_add include/linux/list.h:73 [inline]
list_add include/linux/list.h:88 [inline]
lruvec_add_folio include/linux/mm_inline.h:323 [inline]
lru_add_fn+0x327/0x410 mm/swap.c:228
folio_batch_move_lru+0x1e1/0x2a0 mm/swap.c:246
lru_add_drain_cpu+0x73/0x250 mm/swap.c:669
lru_add_drain+0x21/0x60 mm/swap.c:773
free_pages_and_swap_cache+0x16/0x70 mm/swap_state.c:311
tlb_batch_pages_flush mm/mmu_gather.c:59 [inline]
tlb_flush_mmu_free mm/mmu_gather.c:256 [inline]
tlb_flush_mmu+0x5b2/0x640 mm/mmu_gather.c:263
tlb_finish_mmu+0x86/0x100 mm/mmu_gather.c:363
exit_mmap+0x190/0x4d0 mm/mmap.c:3098
__mmput+0x27/0x1b0 kernel/fork.c:1185
mmput+0x3d/0x50 kernel/fork.c:1207
copy_process+0x19fc/0x2100 kernel/fork.c:2518
kernel_clone+0x166/0x550 kernel/fork.c:2671
__do_sys_clone kernel/fork.c:2812 [inline]
__se_sys_clone kernel/fork.c:2796 [inline]
__x64_sys_clone+0xc3/0xf0 kernel/fork.c:2796
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x2b/0x70 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd

read to 0xffffea00058fc1c8 of 8 bytes by task 17325 on cpu 1:
page_is_pfmemalloc include/linux/mm.h:1817 [inline]
__skb_fill_page_desc include/linux/skbuff.h:2432 [inline]
skb_fill_page_desc include/linux/skbuff.h:2453 [inline]
skb_append_pagefrags+0x210/0x600 net/core/skbuff.c:3974
unix_stream_sendpage+0x45e/0x990 net/unix/af_unix.c:2338
kernel_sendpage+0x184/0x300 net/socket.c:3561
sock_sendpage+0x5a/0x70 net/socket.c:1054
pipe_to_sendpage+0x128/0x160 fs/splice.c:361
splice_from_pipe_feed fs/splice.c:415 [inline]
__splice_from_pipe+0x222/0x4d0 fs/splice.c:559
splice_from_pipe fs/splice.c:594 [inline]
generic_splice_sendpage+0x89/0xc0 fs/splice.c:743
do_splice_from fs/splice.c:764 [inline]
direct_splice_actor+0x80/0xa0 fs/splice.c:931
splice_direct_to_actor+0x305/0x620 fs/splice.c:886
do_splice_direct+0xfb/0x180 fs/splice.c:974
do_sendfile+0x3bf/0x910 fs/read_write.c:1255
__do_sys_sendfile64 fs/read_write.c:1323 [inline]
__se_sys_sendfile64 fs/read_write.c:1309 [inline]
__x64_sys_sendfile64+0x10c/0x150 fs/read_write.c:1309
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x2b/0x70 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd

value changed: 0x0000000000000000 -> 0xffffea00058fc188

Reported by Kernel Concurrency Sanitizer on:
CPU: 1 PID: 17325 Comm: syz-executor.0 Not tainted 6.1.0-rc1-syzkaller-00158-g440b7895c990-dirty #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/11/2022

Fixes: 3261400639 ("tcp: TX zerocopy should not sense pfmemalloc status")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20221027040346.1104204-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-27 11:25:13 -07:00
Raed Salem
12ba40ba3d net/mlx5e: Fix macsec sci endianness at rx sa update
The cited commit at rx sa update operation passes the sci object
attribute, in the wrong endianness and not as expected by the HW
effectively create malformed hw sa context in case of update rx sa
consequently, HW produces unexpected MACsec packets which uses this
sa.

Fix by passing sci to create macsec object with the correct endianness,
while at it add __force u64 to prevent sparse check error of type
"sparse: error: incorrect type in assignment".

Fixes: aae3454e4d ("net/mlx5e: Add MACsec offload Rx command support")
Signed-off-by: Raed Salem <raeds@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Link: https://lore.kernel.org/r/20221026135153.154807-16-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-27 11:06:57 -07:00
Raed Salem
d550956458 net/mlx5e: Fix wrong bitwise comparison usage in macsec_fs_rx_add_rule function
The cited commit produces a sparse check error of type
"sparse: error: restricted __be64 degrades to integer". The
offending line wrongly did a bitwise operation between two different
storage types one of 64 bit when the other smaller side is 16 bit
which caused the above sparse error, furthermore bitwise operation
usage here is wrong in the first place as the constant MACSEC_PORT_ES
is not a bitwise field.

Fix by using the right mask to get the lower 16 bit if the sci number,
and use comparison operator '==' instead of bitwise '&' operator.

Fixes: 3b20949cb2 ("net/mlx5e: Add MACsec RX steering rules")
Signed-off-by: Raed Salem <raeds@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Link: https://lore.kernel.org/r/20221026135153.154807-15-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-27 11:06:56 -07:00
Raed Salem
74573e38e9 net/mlx5e: Fix macsec rx security association (SA) update/delete
The cited commit adds the support for update/delete MACsec Rx SA,
naturally, these operations need to check if the SA in question exists
to update/delete the SA and return error code otherwise, however they
do just the opposite i.e. return with error if the SA exists

Fix by change the check to return error in case the SA in question does
not exist, adjust error message and code accordingly.

Fixes: aae3454e4d ("net/mlx5e: Add MACsec offload Rx command support")
Signed-off-by: Raed Salem <raeds@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Link: https://lore.kernel.org/r/20221026135153.154807-14-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-27 11:06:56 -07:00
Raed Salem
d3ecf03756 net/mlx5e: Fix macsec coverity issue at rx sa update
The cited commit at update rx sa operation passes object attributes
to MACsec object create function without initializing/setting all
attributes fields leaving some of them with garbage values, therefore
violating the implicit assumption at create object function, which
assumes that all input object attributes fields are set.

Fix by initializing the object attributes struct to zero, thus leaving
unset fields with the legal zero value.

Fixes: aae3454e4d ("net/mlx5e: Add MACsec offload Rx command support")
Signed-off-by: Raed Salem <raeds@nvidia.com>
Reviewed-by: Lior Nahmanson <liorna@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Link: https://lore.kernel.org/r/20221026135153.154807-13-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-27 11:06:56 -07:00
Suresh Devarakonda
aefb62a998 net/mlx5: Fix crash during sync firmware reset
When setting Bluefield to DPU NIC mode using mlxconfig tool +  sync
firmware reset flow, we run into scenario where the host was not
eswitch manager at the time of mlx5 driver load but becomes eswitch manager
after the sync firmware reset flow. This results in null pointer
access of mpfs structure during mac filter add. This change prevents null
pointer access but mpfs table entries will not be added.

Fixes: 5ec697446f ("net/mlx5: Add support for devlink reload action fw activate")
Signed-off-by: Suresh Devarakonda <ramad@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Bodong Wang <bodong@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Link: https://lore.kernel.org/r/20221026135153.154807-12-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-27 11:06:56 -07:00
Roy Novich
416ef71363 net/mlx5: Update fw fatal reporter state on PCI handlers successful recover
Update devlink health fw fatal reporter state to "healthy" is needed by
strictly calling devlink_health_reporter_state_update() after recovery
was done by PCI error handler. This is needed when fw_fatal reporter was
triggered due to PCI error. Poll health is called and set reporter state
to error. Health recovery failed (since EEH didn't re-enable the PCI).
PCI handlers keep on recover flow and succeed later without devlink
acknowledgment. Fix this by adding devlink state update at the end of
the PCI handler recovery process.

Fixes: 6181e5cb75 ("devlink: add support for reporter recovery completion")
Signed-off-by: Roy Novich <royno@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Aya Levin <ayal@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Link: https://lore.kernel.org/r/20221026135153.154807-11-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-27 11:06:56 -07:00
Roi Dayan
94d651739e net/mlx5e: TC, Fix cloned flow attr instance dests are not zeroed
On multi table split the driver creates a new attr instance with
data being copied from prev attr instance zeroing action flags.
Also need to reset dests properties to avoid incorrect dests per attr.

Fixes: 8300f22526 ("net/mlx5e: Create new flow attr for multi table actions")
Signed-off-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Maor Dickman <maord@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Link: https://lore.kernel.org/r/20221026135153.154807-10-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-27 11:06:56 -07:00
Ariel Levkovich
f382a2413d net/mlx5e: TC, Reject forwarding from internal port to internal port
Reject TC rules that forward from internal port to internal port
as it is not supported.

This include rules that are explicitly have internal port as
the filter device as well as rules that apply on tunnel interfaces
as the route device for the tunnel interface can be an internal
port.

Fixes: 27484f7170 ("net/mlx5e: Offload tc rules that redirect to ovs internal port")
Signed-off-by: Ariel Levkovich <lariel@nvidia.com>
Reviewed-by: Maor Dickman <maord@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Link: https://lore.kernel.org/r/20221026135153.154807-9-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-27 11:06:56 -07:00
Tariq Toukan
bacd22df95 net/mlx5: Fix possible use-after-free in async command interface
mlx5_cmd_cleanup_async_ctx should return only after all its callback
handlers were completed. Before this patch, the below race between
mlx5_cmd_cleanup_async_ctx and mlx5_cmd_exec_cb_handler was possible and
lead to a use-after-free:

1. mlx5_cmd_cleanup_async_ctx is called while num_inflight is 2 (i.e.
   elevated by 1, a single inflight callback).
2. mlx5_cmd_cleanup_async_ctx decreases num_inflight to 1.
3. mlx5_cmd_exec_cb_handler is called, decreases num_inflight to 0 and
   is about to call wake_up().
4. mlx5_cmd_cleanup_async_ctx calls wait_event, which returns
   immediately as the condition (num_inflight == 0) holds.
5. mlx5_cmd_cleanup_async_ctx returns.
6. The caller of mlx5_cmd_cleanup_async_ctx frees the mlx5_async_ctx
   object.
7. mlx5_cmd_exec_cb_handler goes on and calls wake_up() on the freed
   object.

Fix it by syncing using a completion object. Mark it completed when
num_inflight reaches 0.

Trace:

BUG: KASAN: use-after-free in do_raw_spin_lock+0x23d/0x270
Read of size 4 at addr ffff888139cd12f4 by task swapper/5/0

CPU: 5 PID: 0 Comm: swapper/5 Not tainted 6.0.0-rc3_for_upstream_debug_2022_08_30_13_10 #1
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
 <IRQ>
 dump_stack_lvl+0x57/0x7d
 print_report.cold+0x2d5/0x684
 ? do_raw_spin_lock+0x23d/0x270
 kasan_report+0xb1/0x1a0
 ? do_raw_spin_lock+0x23d/0x270
 do_raw_spin_lock+0x23d/0x270
 ? rwlock_bug.part.0+0x90/0x90
 ? __delete_object+0xb8/0x100
 ? lock_downgrade+0x6e0/0x6e0
 _raw_spin_lock_irqsave+0x43/0x60
 ? __wake_up_common_lock+0xb9/0x140
 __wake_up_common_lock+0xb9/0x140
 ? __wake_up_common+0x650/0x650
 ? destroy_tis_callback+0x53/0x70 [mlx5_core]
 ? kasan_set_track+0x21/0x30
 ? destroy_tis_callback+0x53/0x70 [mlx5_core]
 ? kfree+0x1ba/0x520
 ? do_raw_spin_unlock+0x54/0x220
 mlx5_cmd_exec_cb_handler+0x136/0x1a0 [mlx5_core]
 ? mlx5_cmd_cleanup_async_ctx+0x220/0x220 [mlx5_core]
 ? mlx5_cmd_cleanup_async_ctx+0x220/0x220 [mlx5_core]
 mlx5_cmd_comp_handler+0x65a/0x12b0 [mlx5_core]
 ? dump_command+0xcc0/0xcc0 [mlx5_core]
 ? lockdep_hardirqs_on_prepare+0x400/0x400
 ? cmd_comp_notifier+0x7e/0xb0 [mlx5_core]
 cmd_comp_notifier+0x7e/0xb0 [mlx5_core]
 atomic_notifier_call_chain+0xd7/0x1d0
 mlx5_eq_async_int+0x3ce/0xa20 [mlx5_core]
 atomic_notifier_call_chain+0xd7/0x1d0
 ? irq_release+0x140/0x140 [mlx5_core]
 irq_int_handler+0x19/0x30 [mlx5_core]
 __handle_irq_event_percpu+0x1f2/0x620
 handle_irq_event+0xb2/0x1d0
 handle_edge_irq+0x21e/0xb00
 __common_interrupt+0x79/0x1a0
 common_interrupt+0x78/0xa0
 </IRQ>
 <TASK>
 asm_common_interrupt+0x22/0x40
RIP: 0010:default_idle+0x42/0x60
Code: c1 83 e0 07 48 c1 e9 03 83 c0 03 0f b6 14 11 38 d0 7c 04 84 d2 75 14 8b 05 eb 47 22 02 85 c0 7e 07 0f 00 2d e0 9f 48 00 fb f4 <c3> 48 c7 c7 80 08 7f 85 e8 d1 d3 3e fe eb de 66 66 2e 0f 1f 84 00
RSP: 0018:ffff888100dbfdf0 EFLAGS: 00000242
RAX: 0000000000000001 RBX: ffffffff84ecbd48 RCX: 1ffffffff0afe110
RDX: 0000000000000004 RSI: 0000000000000000 RDI: ffffffff835cc9bc
RBP: 0000000000000005 R08: 0000000000000001 R09: ffff88881dec4ac3
R10: ffffed1103bd8958 R11: 0000017d0ca571c9 R12: 0000000000000005
R13: ffffffff84f024e0 R14: 0000000000000000 R15: dffffc0000000000
 ? default_idle_call+0xcc/0x450
 default_idle_call+0xec/0x450
 do_idle+0x394/0x450
 ? arch_cpu_idle_exit+0x40/0x40
 ? do_idle+0x17/0x450
 cpu_startup_entry+0x19/0x20
 start_secondary+0x221/0x2b0
 ? set_cpu_sibling_map+0x2070/0x2070
 secondary_startup_64_no_verify+0xcd/0xdb
 </TASK>

Allocated by task 49502:
 kasan_save_stack+0x1e/0x40
 __kasan_kmalloc+0x81/0xa0
 kvmalloc_node+0x48/0xe0
 mlx5e_bulk_async_init+0x35/0x110 [mlx5_core]
 mlx5e_tls_priv_tx_list_cleanup+0x84/0x3e0 [mlx5_core]
 mlx5e_ktls_cleanup_tx+0x38f/0x760 [mlx5_core]
 mlx5e_cleanup_nic_tx+0xa7/0x100 [mlx5_core]
 mlx5e_detach_netdev+0x1ca/0x2b0 [mlx5_core]
 mlx5e_suspend+0xdb/0x140 [mlx5_core]
 mlx5e_remove+0x89/0x190 [mlx5_core]
 auxiliary_bus_remove+0x52/0x70
 device_release_driver_internal+0x40f/0x650
 driver_detach+0xc1/0x180
 bus_remove_driver+0x125/0x2f0
 auxiliary_driver_unregister+0x16/0x50
 mlx5e_cleanup+0x26/0x30 [mlx5_core]
 cleanup+0xc/0x4e [mlx5_core]
 __x64_sys_delete_module+0x2b5/0x450
 do_syscall_64+0x3d/0x90
 entry_SYSCALL_64_after_hwframe+0x46/0xb0

Freed by task 49502:
 kasan_save_stack+0x1e/0x40
 kasan_set_track+0x21/0x30
 kasan_set_free_info+0x20/0x30
 ____kasan_slab_free+0x11d/0x1b0
 kfree+0x1ba/0x520
 mlx5e_tls_priv_tx_list_cleanup+0x2e7/0x3e0 [mlx5_core]
 mlx5e_ktls_cleanup_tx+0x38f/0x760 [mlx5_core]
 mlx5e_cleanup_nic_tx+0xa7/0x100 [mlx5_core]
 mlx5e_detach_netdev+0x1ca/0x2b0 [mlx5_core]
 mlx5e_suspend+0xdb/0x140 [mlx5_core]
 mlx5e_remove+0x89/0x190 [mlx5_core]
 auxiliary_bus_remove+0x52/0x70
 device_release_driver_internal+0x40f/0x650
 driver_detach+0xc1/0x180
 bus_remove_driver+0x125/0x2f0
 auxiliary_driver_unregister+0x16/0x50
 mlx5e_cleanup+0x26/0x30 [mlx5_core]
 cleanup+0xc/0x4e [mlx5_core]
 __x64_sys_delete_module+0x2b5/0x450
 do_syscall_64+0x3d/0x90
 entry_SYSCALL_64_after_hwframe+0x46/0xb0

Fixes: e355477ed9 ("net/mlx5: Make mlx5_cmd_exec_cb() a safe API")
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Link: https://lore.kernel.org/r/20221026135153.154807-8-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-27 11:06:56 -07:00
Saeed Mahameed
0f3caaa2c6 net/mlx5: ASO, Create the ASO SQ with the correct timestamp format
mlx5 SQs must select the timestamp format explicitly according to the
active clock mode, select the current active timestamp mode so ASO SQ create
will succeed.

This fixes the following error prints when trying to create ipsec ASO SQ
while the timestamp format is real time mode.

mlx5_cmd_out_err:778:(pid 34874): CREATE_SQ(0x904) op_mod(0x0) failed, status bad parameter(0x3), syndrome (0xd61c0b), err(-22)
mlx5_aso_create_sq:285:(pid 34874): Failed to open aso wq sq, err=-22
mlx5e_ipsec_init:436:(pid 34874): IPSec initialization failed, -22

Fixes: cdd04f4d4d ("net/mlx5: Add support to create SQ and CQ for ASO")
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Reported-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Link: https://lore.kernel.org/r/20221026135153.154807-7-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-27 11:06:55 -07:00
Paul Blakey
8dc47c0527 net/mlx5e: Update restore chain id for slow path packets
Currently encap slow path rules just forward to software without
setting the chain id miss register, so driver doesn't restore
the chain, and packets hitting this rule will restart from tc chain
0 instead of continuing to the chain the encap rule was on.

Fix this by setting the chain id miss register to the chain id mapping.

Fixes: 8f1e0b97cc ("net/mlx5: E-Switch, Mark miss packets with new chain id mapping")
Signed-off-by: Paul Blakey <paulb@nvidia.com>
Reviewed-by: Oz Shlomo <ozsh@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Link: https://lore.kernel.org/r/20221026135153.154807-6-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-27 11:06:55 -07:00
Aya Levin
19b43a432e net/mlx5e: Extend SKB room check to include PTP-SQ
When tx_port_ts is set, the driver diverts all UPD traffic over PTP port
to a dedicated PTP-SQ. The SKBs are cached until the wire-CQE arrives.
When the packet size is greater then MTU, the firmware might drop it and
the packet won't be transmitted to the wire, hence the wire-CQE won't
reach the driver. In this case the SKBs are accumulated in the SKB fifo.
Add room check to consider the PTP-SQ SKB fifo, when the SKB fifo is
full, driver stops the queue resulting in a TX timeout. Devlink
TX-reporter can recover from it.

Fixes: 1880bc4e4a ("net/mlx5e: Add TX port timestamp support")
Signed-off-by: Aya Levin <ayal@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Link: https://lore.kernel.org/r/20221026135153.154807-5-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-27 11:06:55 -07:00
Rongwei Liu
4ea9891d66 net/mlx5: DR, Fix matcher disconnect error flow
When 2nd flow rules arrives, it will merge together with the
1st one if matcher criteria is the same.

If merge fails, driver will rollback the merge contents, and
reject the 2nd rule. At rollback stage, matcher can't be
disconnected unconditionally, otherise the 1st rule can't be
hit anymore.

Add logic to check if the matcher should be disconnected or not.

Fixes: cc2295cd54 ("net/mlx5: DR, Improve steering for empty or RX/TX-only matchers")
Signed-off-by: Rongwei Liu <rongweil@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Link: https://lore.kernel.org/r/20221026135153.154807-4-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-27 11:06:50 -07:00
Moshe Shemesh
212b4d7251 net/mlx5: Wait for firmware to enable CRS before pci_restore_state
After firmware reset driver should verify firmware already enabled CRS
and became responsive to pci config cycles before restoring pci state.
Fix that by waiting till device_id is readable through PCI again.

Fixes: eabe8e5e88 ("net/mlx5: Handle sync reset now event")
Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Link: https://lore.kernel.org/r/20221026135153.154807-3-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-27 11:06:35 -07:00
Hyong Youb Kim
888be6b279 net/mlx5e: Do not increment ESN when updating IPsec ESN state
An offloaded SA stops receiving after about 2^32 + replay_window
packets. For example, when SA reaches <seq-hi 0x1, seq 0x2c>, all
subsequent packets get dropped with SA-icv-failure (integrity_failed).

To reproduce the bug:
- ConnectX-6 Dx with crypto enabled (FW 22.30.1004)
- ipsec.conf:
  nic-offload = yes
  replay-window = 32
  esn = yes
  salifetime=24h
- Run netperf for a long time to send more than 2^32 packets
  netperf -H <device-under-test> -t TCP_STREAM -l 20000

When 2^32 + replay_window packets are received, the replay window
moves from the 2nd half of subspace (overlap=1) to the 1st half
(overlap=0). The driver then updates the 'esn' value in NIC
(i.e. seq_hi) as follows.

 seq_hi = xfrm_replay_seqhi(seq_bottom)
 new esn in NIC = seq_hi + 1

The +1 increment is wrong, as seq_hi already contains the correct
seq_hi. For example, when seq_hi=1, the driver actually tells NIC to
use seq_hi=2 (esn). This incorrect esn value causes all subsequent
packets to fail integrity checks (SA-icv-failure). So, do not
increment.

Fixes: cb01008390 ("net/mlx5: IPSec, Add support for ESN")
Signed-off-by: Hyong Youb Kim <hyonkim@cisco.com>
Acked-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Link: https://lore.kernel.org/r/20221026135153.154807-2-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-27 11:06:34 -07:00
Jakub Kicinski
9ddcead06d Merge branch 'fix-some-issues-in-netdevsim-driver'
Zhengchao Shao says:

====================
fix some issues in netdevsim driver

When strace tool is used to perform memory injection, memory leaks and
files not removed issues are found. Fix them.
====================

Link: https://lore.kernel.org/r/20221026014642.116261-1-shaozhengchao@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-27 10:47:32 -07:00
Zhengchao Shao
a6aa8d0ce2 netdevsim: remove dir in nsim_dev_debugfs_init() when creating ports dir failed
Remove dir in nsim_dev_debugfs_init() when creating ports dir failed.
Otherwise, the netdevsim device will not be created next time. Kernel
reports an error: debugfs: Directory 'netdevsim1' with parent 'netdevsim'
already present!

Fixes: ab1d0cc004 ("netdevsim: change debugfs tree topology")
Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-27 10:47:29 -07:00
Zhengchao Shao
6b1da9f712 netdevsim: fix memory leak in nsim_drv_probe() when nsim_dev_resources_register() failed
If some items in nsim_dev_resources_register() fail, memory leak will
occur. The following is the memory leak information.

unreferenced object 0xffff888074c02600 (size 128):
  comm "echo", pid 8159, jiffies 4294945184 (age 493.530s)
  hex dump (first 32 bytes):
    40 47 ea 89 ff ff ff ff 01 00 00 00 00 00 00 00  @G..............
    ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
  backtrace:
    [<0000000011a31c98>] kmalloc_trace+0x22/0x60
    [<0000000027384c69>] devl_resource_register+0x144/0x4e0
    [<00000000a16db248>] nsim_drv_probe+0x37a/0x1260
    [<000000007d1f448c>] really_probe+0x20b/0xb10
    [<00000000c416848a>] __driver_probe_device+0x1b3/0x4a0
    [<00000000077e0351>] driver_probe_device+0x49/0x140
    [<0000000054f2465a>] __device_attach_driver+0x18c/0x2a0
    [<000000008538f359>] bus_for_each_drv+0x151/0x1d0
    [<0000000038e09747>] __device_attach+0x1c9/0x4e0
    [<00000000dd86e533>] bus_probe_device+0x1d5/0x280
    [<00000000839bea35>] device_add+0xae0/0x1cb0
    [<000000009c2abf46>] new_device_store+0x3b6/0x5f0
    [<00000000fb823d7f>] bus_attr_store+0x72/0xa0
    [<000000007acc4295>] sysfs_kf_write+0x106/0x160
    [<000000005f50cb4d>] kernfs_fop_write_iter+0x3a8/0x5a0
    [<0000000075eb41bf>] vfs_write+0x8f0/0xc80

Fixes: 37923ed6b8 ("netdevsim: Add simple FIB resource controller via devlink")
Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-27 10:47:24 -07:00
Zhengchao Shao
cf2010aa1c netdevsim: fix memory leak in nsim_bus_dev_new()
If device_register() failed in nsim_bus_dev_new(), the value of reference
in nsim_bus_dev->dev is 1. obj->name in nsim_bus_dev->dev will not be
released.

unreferenced object 0xffff88810352c480 (size 16):
  comm "echo", pid 5691, jiffies 4294945921 (age 133.270s)
  hex dump (first 16 bytes):
    6e 65 74 64 65 76 73 69 6d 31 00 00 00 00 00 00  netdevsim1......
  backtrace:
    [<000000005e2e5e26>] __kmalloc_node_track_caller+0x3a/0xb0
    [<0000000094ca4fc8>] kvasprintf+0xc3/0x160
    [<00000000aad09bcc>] kvasprintf_const+0x55/0x180
    [<000000009bac868d>] kobject_set_name_vargs+0x56/0x150
    [<000000007c1a5d70>] dev_set_name+0xbb/0xf0
    [<00000000ad0d126b>] device_add+0x1f8/0x1cb0
    [<00000000c222ae24>] new_device_store+0x3b6/0x5e0
    [<0000000043593421>] bus_attr_store+0x72/0xa0
    [<00000000cbb1833a>] sysfs_kf_write+0x106/0x160
    [<00000000d0dedb8a>] kernfs_fop_write_iter+0x3a8/0x5a0
    [<00000000770b66e2>] vfs_write+0x8f0/0xc80
    [<0000000078bb39be>] ksys_write+0x106/0x210
    [<00000000005e55a4>] do_syscall_64+0x35/0x80
    [<00000000eaa40bbc>] entry_SYSCALL_64_after_hwframe+0x46/0xb0

Fixes: 40e4fe4ce1 ("netdevsim: move device registration and related code to bus.c")
Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Link: https://lore.kernel.org/r/20221026015405.128795-1-shaozhengchao@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-27 10:47:09 -07:00
Jakub Kicinski
de90869a1b linux-can-fixes-for-6.1-20221027
-----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCgAxFiEEBsvAIBsPu6mG7thcrX5LkNig010FAmNabUQTHG1rbEBwZW5n
 dXRyb25peC5kZQAKCRCtfkuQ2KDTXaTBB/9xLKgfBKQtTKpLJ8HoDLESWbHr6v9T
 8GmOY0+UPbKxOqfMknYfwV57ca9agIw5IOFx9ry7w6PrytDF2S5ojYHtmQvvbmQJ
 2HyuP8k4Qt3MIS2O1fxHg6fe4qMhdDBJ2vkp5AGsfFWG3i189gPAGQFOtI19Zt+j
 Se547s5WzGlgcCQulbveyJsTO85Z6xhBr7VOy8mxSkkxivrYzKRk7YjWlulHCCFF
 A/Q15dXFyBtH+oG+Gl3Nnj0ttgiE7X6oOzwTH5JkUmUDnTujPZQX+tcJYNHKB+xw
 m3fwqRM8mWb1+m1gCgDKVT0Yaurbhbmv1phdZcPgFjCVU6PR1miFFG8M
 =YUyU
 -----END PGP SIGNATURE-----

Merge tag 'linux-can-fixes-for-6.1-20221027' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can

Marc Kleine-Budde says:

====================
pull-request: can 2022-10-27

Anssi Hannula fixes the use of the completions in the kvaser_usb
driver.

Biju Das contributes 2 patches for the rcar_canfd driver. A IRQ storm
that can be triggered by high CAN bus load and channel specific IRQ
handlers are fixed.

Yang Yingliang fixes the j1939 transport protocol by moving a
kfree_skb() out of a spin_lock_irqsave protected section.

* tag 'linux-can-fixes-for-6.1-20221027' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can:
  can: j1939: transport: j1939_session_skb_drop_old(): spin_unlock_irqrestore() before kfree_skb()
  can: rcar_canfd: fix channel specific IRQ handling for RZ/G2L
  can: rcar_canfd: rcar_canfd_handle_global_receive(): fix IRQ storm on global FIFO receive
  can: kvaser_usb: Fix possible completions during init_completion
====================

Link: https://lore.kernel.org/r/20221027114356.1939821-1-mkl@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-27 10:30:42 -07:00
Rafał Miłecki
ef3556ee16 net: broadcom: bcm4908_enet: update TX stats after actual transmission
Queueing packets doesn't guarantee their transmission. Update TX stats
after hardware confirms consuming submitted data.

This also fixes a possible race and NULL dereference.
bcm4908_enet_start_xmit() could try to access skb after freeing it in
the bcm4908_enet_poll_tx().

Reported-by: Florian Fainelli <f.fainelli@gmail.com>
Fixes: 4feffeadbc ("net: broadcom: bcm4908enet: add BCM4908 controller driver")
Signed-off-by: Rafał Miłecki <rafal@milecki.pl>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Link: https://lore.kernel.org/r/20221027112430.8696-1-zajec5@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-27 10:27:53 -07:00
Jakub Kicinski
bc520f5e25 Merge branch 'ip-rework-the-fix-for-dflt-addr-selection-for-connected-nexthop'
Nicolas Dichtel says:

====================
ip: rework the fix for dflt addr selection for connected nexthop"

This series reworks the fix that is reverted in the second commit.
As Julian explained, nhc_scope is related to nhc_gw, it's not the scope of
the route.
====================

Link: https://lore.kernel.org/r/20221020100952.8748-1-nicolas.dichtel@6wind.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-27 10:17:44 -07:00
Nicolas Dichtel
bac0f937c3 nh: fix scope used to find saddr when adding non gw nh
As explained by Julian, fib_nh_scope is related to fib_nh_gw4, but
fib_info_update_nhc_saddr() needs the scope of the route, which is
the scope "before" fib_nh_scope, ie fib_nh_scope - 1.

This patch fixes the problem described in commit 747c143072 ("ip: fix
dflt addr selection for connected nexthop").

Fixes: 597cfe4fc3 ("nexthop: Add support for IPv4 nexthops")
Link: https://lore.kernel.org/netdev/6c8a44ba-c2d5-cdf-c5c7-5baf97cba38@ssi.bg/
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Reviewed-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-27 10:17:40 -07:00
Nicolas Dichtel
e021c329ee Revert "ip: fix dflt addr selection for connected nexthop"
This reverts commit 747c143072.

As explained by Julian, nhc_scope is related to nhc_gw, not to the route.
Revert the original patch. The initial problem is fixed differently in the
next commit.

Link: https://lore.kernel.org/netdev/6c8a44ba-c2d5-cdf-c5c7-5baf97cba38@ssi.bg/
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Reviewed-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-27 10:17:36 -07:00
Nicolas Dichtel
745b913a59 Revert "ip: fix triggering of 'icmp redirect'"
This reverts commit eb55dc09b5.

The patch that introduces this bug is reverted right after this one.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Reviewed-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-27 10:17:32 -07:00
Jakub Kicinski
ce48ebdd56 genetlink: limit the use of validation workarounds to old ops
During review of previous change another thing came up - we should
limit the use of validation workarounds to old commands.
Don't list the workarounds one by one, as we're rejecting all existing
ones. We can deal with the masking in the unlikely event that new flag
is added.

Link: https://lore.kernel.org/all/6ba9f727e555fd376623a298d5d305ad408c3d47.camel@sipsolutions.net/
Link: https://lore.kernel.org/r/20221026001524.1892202-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-27 08:20:21 -07:00
Yevgeny Kliteynik
edaea00144 net/mlx5: DR, Remove the buddy used_list
No need to have the used_list - we don't need to keep track of the
used chunks, we only need to know the amount of used memory.

Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-10-27 15:50:40 +01:00
Yevgeny Kliteynik
4519fc45be net/mlx5: DR, Keep track of hot ICM chunks in an array instead of list
When ICM chunk is freed, it might still be accessed by HW until we do
sync with HW. This sync is expensive operation, so we don't do it often.
Instead, when the chunk is freed, it is moved to the buddy's "hot memory"
list. Once sync is done, we traverse the hot list and finally free all
the chunks.

It appears that traversing a long list takes unusually long time due to cache
misses on many entries, which causes a big "hiccup" during rule insertion.

This patch deals with this issue the following way:
 - Move hot chunks list from buddy to pool, so that the pool will
   keep track of all its hot memory.
 - Replace the list with pre-allocated array on the memory pool struct,
   and store only the information that is needed to later free this
   chunk in its buddy allocator.
   This cost additional memory for the array that is dynamically
   allocated, but it allows not to save long list of hot chunks,
   so at peak times it actually saves memory due to the fact that
   each array entry is much smaller than the chunk struct.

This way an overhead of traversing the long list is virtually removed:
the loop of freeing hot chunks takes ~27 msec instead of ~70 msec, where
most of it are the actual freeing activities.

Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-10-27 15:50:40 +01:00
Yevgeny Kliteynik
133ea373a0 net/mlx5: DR, Lower sync threshold for ICM hot memory
Instead of hiding the math in the code, define a value that sets the
fraction of allowed hot memory of ICM pool.
Set the threshold for sync of ICM hot chunks to 1/4 of the pool
instead of 1/2 of the pool. Although we will have more syncs, each
sync will be shorter and will help with insertion rate stability.

Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-10-27 15:50:40 +01:00
Yevgeny Kliteynik
fb628b71fb net/mlx5: DR, Allocate htbl from its own slab allocator
SW steering allocates/frees lots of htbl structs. Create a
separate kmem_cache and allocate htbls from this allocator.

Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-10-27 15:50:39 +01:00
Yevgeny Kliteynik
fd785e5213 net/mlx5: DR, Allocate icm_chunks from their own slab allocator
SW steering allocates/frees lots of icm_chunk structs. To make this more
efficiently, create a separate kmem_cache and allocate these chunks from
this allocator.
By doing this we observe that the alloc/free "hiccups" frequency has
become much lower, which allows for a more steady rule insersion rate.

Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-10-27 15:50:39 +01:00
Yevgeny Kliteynik
17b56073a0 net/mlx5: DR, Manage STE send info objects in pool
Instead of allocating/freeing send info objects dynamically, manage them
in pool. The number of send info objects doesn't depend on rules, so after
pre-populating the pool with an initial batch of send info objects, the
pool is not expected to grow.
This way we save alloc/free during writing STEs to ICM, which can
sometimes take up to 40msec.

Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-10-27 15:50:39 +01:00
Yevgeny Kliteynik
1bea2dc7f4 net/mlx5: DR, In rehash write the line in the entry immediately
Don't wait for the whole table to be ready - write each row immediately.
This way we save allocations of the ste_send_info structure and improve
performance.

Signed-off-by: Erez Shitrit <erezsh@nvidia.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-10-27 15:50:38 +01:00
Yevgeny Kliteynik
183a6706a0 net/mlx5: DR, Handle domain memory resources init/uninit separately
Handle creation/destruction of all the domain's memory pools and other
memory-related fields in a separate init/uninit functions.
This simplifies error flow and allows cleaner addition of new pools.

Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-10-27 15:50:38 +01:00
Yevgeny Kliteynik
06ab4a4089 net/mlx5: DR, Initialize chunk's ste_arrays at chunk creation
Rather than cleaning the corresponding chunk's section of ste_arrays on
chunk deletion, initialize these areas upon chunk creation.

Chunk destruction tend to come in large batches (during pool syncing).
To reduce the "hiccup" in such cases, moving ste_arrays init from chunk
destruction to initialization.

Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-10-27 15:50:38 +01:00
Yevgeny Kliteynik
b9b81e1e93 net/mlx5: DR, For short chains of STEs, avoid allocating ste_arr dynamically
While creating rule, ste_arr is an array that is allocated at the start
of the function and freed at the end.
This memory allocation can sometimes lead to "hiccups" of up to 10ms.
However, the common use case is short chains of STEs. For such cases,
we can use a local buffer on stack instead.

Changes in v2:
  Use small local array for short rules, allocate dynamically for long rules

Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-10-27 15:50:38 +01:00
Yevgeny Kliteynik
d277b55f0f net/mlx5: DR, Remove unneeded argument from dr_icm_chunk_destroy
Remove an argument that can be extracted in the function.

Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-10-27 15:50:37 +01:00
Yevgeny Kliteynik
5fd08f6539 net/mlx5: DR, Check device state when polling CQ
Calling fast teardown as part of the normal unloading caused
a problem with SW steering - SW steering still needs to clear
its tables, write to ICM and poll for completions.
When teardown has been done, SW steering keeps polling the CQ
forever, because nobody flushes it.

This patch fixes the issue by checking the device state in
cases where no CQE was returned.

Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-10-27 15:50:37 +01:00
Yevgeny Kliteynik
c4193a1281 net/mlx5: DR, Fix the SMFS sync_steering for fast teardown
If sync happens when the device is in fast teardown, just bail
and don't do anything, because the PCI device is not there any more.

Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-10-27 15:50:37 +01:00
Yevgeny Kliteynik
14335d54e7 net/mlx5: DR, In destroy flow, free resources even if FW command failed
Otherwise resources will never be freed and refcount will not be decreased.

Signed-off-by: Chris Mi <cmi@nvidia.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-10-27 15:50:36 +01:00
Florian Fainelli
9f172134dd net: bcmsysport: Indicate MAC is in charge of PHY PM
Avoid the PHY library call unnecessarily into the suspend/resume
functions by setting phydev->mac_managed_pm to true. The SYSTEMPORT
driver essentially does exactly what mdio_bus_phy_resume() does by
calling phy_resume().

Fixes: fba863b816 ("net: phy: make PHY PM ops a no-op if MAC driver manages PHY PM")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Link: https://lore.kernel.org/r/20221025234201.2549360-1-f.fainelli@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-10-27 15:56:07 +02:00
Kees Cook
12d6c1d3a2 skbuff: Proactively round up to kmalloc bucket size
Instead of discovering the kmalloc bucket size _after_ allocation, round
up proactively so the allocation is explicitly made for the full size,
allowing the compiler to correctly reason about the resulting size of
the buffer through the existing __alloc_size() hint.

This will allow for kernels built with CONFIG_UBSAN_BOUNDS or the
coming dynamic bounds checking under CONFIG_FORTIFY_SOURCE to gain
back the __alloc_size() hints that were temporarily reverted in commit
93dd04ab0b ("slab: remove __alloc_size attribute from __kmalloc_track_caller")

Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: netdev@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: David Rientjes <rientjes@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Link: https://patchwork.kernel.org/project/netdevbpf/patch/20221021234713.you.031-kees@kernel.org/
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20221025223811.up.360-kees@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-10-27 15:48:19 +02:00
Paolo Abeni
99c8eb4677 Merge branch 'net-ipa-don-t-use-fixed-table-sizes'
Alex Elder says:

====================
net: ipa: don't use fixed table sizes

Currently, routing and filter tables are assumed to have a fixed
size for all platforms.  In fact, these tables can support many more
entries than what has been assumed; the only limitation is the size
of the IPA-resident memory regions that contain them.

This series rearranges things so that the size of the table is
determined from the memory region size defined in configuration
data, rather than assuming it is fixed.

This will required for IPA versions 5.0+, where the number of
entries in a routing table is larger.
====================

Link: https://lore.kernel.org/r/20221025195143.255934-1-elder@linaro.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-10-27 13:38:15 +02:00
Alex Elder
f787d84830 net: ipa: determine filter table size from memory region
Currently we assume that any filter table contains a fixed number
of entries.  Like routing tables, the number of entries in a filter
table is limited only by the size of the IPA-local memory region
used to hold the table.

Stop assuming that a filter table has exactly 14 entries.  Instead,
determine the number of entries in a routing table by dividing its
memory region size by the size of an entry.  (Note that the first
"entry" in a filter table contains an endpoint bitmap.)

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-10-27 13:38:13 +02:00
Alex Elder
8defab8bdf net: ipa: don't assume 8 modem routing table entries
Currently all platforms are assumed allot 8 routing table entries
for use by the modem.  Instead, add a new configuration data entry
that defines the number of modem routing table entries, and record
that in the IPA structure.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-10-27 13:38:13 +02:00