Commit graph

1215127 commits

Author SHA1 Message Date
Martin KaFai Lau
55d49f750b bpf: bpf_sk_storage: Fix the missing uncharge in sk_omem_alloc
The commit c83597fa5d ("bpf: Refactor some inode/task/sk storage functions
for reuse"), refactored the bpf_{sk,task,inode}_storage_free() into
bpf_local_storage_unlink_nolock() which then later renamed to
bpf_local_storage_destroy(). The commit accidentally passed the
"bool uncharge_mem = false" argument to bpf_selem_unlink_storage_nolock()
which then stopped the uncharge from happening to the sk->sk_omem_alloc.

This missing uncharge only happens when the sk is going away (during
__sk_destruct).

This patch fixes it by always passing "uncharge_mem = true". It is a
noop to the task/inode/cgroup storage because they do not have the
map_local_storage_(un)charge enabled in the map_ops. A followup patch
will be done in bpf-next to remove the uncharge_mem argument.

A selftest is added in the next patch.

Fixes: c83597fa5d ("bpf: Refactor some inode/task/sk storage functions for reuse")
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20230901231129.578493-3-martin.lau@linux.dev
2023-09-06 11:08:14 +02:00
Martin KaFai Lau
a96a44aba5 bpf: bpf_sk_storage: Fix invalid wait context lockdep report
'./test_progs -t test_local_storage' reported a splat:

[   27.137569] =============================
[   27.138122] [ BUG: Invalid wait context ]
[   27.138650] 6.5.0-03980-gd11ae1b16b0a #247 Tainted: G           O
[   27.139542] -----------------------------
[   27.140106] test_progs/1729 is trying to lock:
[   27.140713] ffff8883ef047b88 (stock_lock){-.-.}-{3:3}, at: local_lock_acquire+0x9/0x130
[   27.141834] other info that might help us debug this:
[   27.142437] context-{5:5}
[   27.142856] 2 locks held by test_progs/1729:
[   27.143352]  #0: ffffffff84bcd9c0 (rcu_read_lock){....}-{1:3}, at: rcu_lock_acquire+0x4/0x40
[   27.144492]  #1: ffff888107deb2c0 (&storage->lock){..-.}-{2:2}, at: bpf_local_storage_update+0x39e/0x8e0
[   27.145855] stack backtrace:
[   27.146274] CPU: 0 PID: 1729 Comm: test_progs Tainted: G           O       6.5.0-03980-gd11ae1b16b0a #247
[   27.147550] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[   27.149127] Call Trace:
[   27.149490]  <TASK>
[   27.149867]  dump_stack_lvl+0x130/0x1d0
[   27.152609]  dump_stack+0x14/0x20
[   27.153131]  __lock_acquire+0x1657/0x2220
[   27.153677]  lock_acquire+0x1b8/0x510
[   27.157908]  local_lock_acquire+0x29/0x130
[   27.159048]  obj_cgroup_charge+0xf4/0x3c0
[   27.160794]  slab_pre_alloc_hook+0x28e/0x2b0
[   27.161931]  __kmem_cache_alloc_node+0x51/0x210
[   27.163557]  __kmalloc+0xaa/0x210
[   27.164593]  bpf_map_kzalloc+0xbc/0x170
[   27.165147]  bpf_selem_alloc+0x130/0x510
[   27.166295]  bpf_local_storage_update+0x5aa/0x8e0
[   27.167042]  bpf_fd_sk_storage_update_elem+0xdb/0x1a0
[   27.169199]  bpf_map_update_value+0x415/0x4f0
[   27.169871]  map_update_elem+0x413/0x550
[   27.170330]  __sys_bpf+0x5e9/0x640
[   27.174065]  __x64_sys_bpf+0x80/0x90
[   27.174568]  do_syscall_64+0x48/0xa0
[   27.175201]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[   27.175932] RIP: 0033:0x7effb40e41ad
[   27.176357] Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d8
[   27.179028] RSP: 002b:00007ffe64c21fc8 EFLAGS: 00000202 ORIG_RAX: 0000000000000141
[   27.180088] RAX: ffffffffffffffda RBX: 00007ffe64c22768 RCX: 00007effb40e41ad
[   27.181082] RDX: 0000000000000020 RSI: 00007ffe64c22008 RDI: 0000000000000002
[   27.182030] RBP: 00007ffe64c21ff0 R08: 0000000000000000 R09: 00007ffe64c22788
[   27.183038] R10: 0000000000000064 R11: 0000000000000202 R12: 0000000000000000
[   27.184006] R13: 00007ffe64c22788 R14: 00007effb42a1000 R15: 0000000000000000
[   27.184958]  </TASK>

It complains about acquiring a local_lock while holding a raw_spin_lock.
It means it should not allocate memory while holding a raw_spin_lock
since it is not safe for RT.

raw_spin_lock is needed because bpf_local_storage supports tracing
context. In particular for task local storage, it is easy to
get a "current" task PTR_TO_BTF_ID in tracing bpf prog.
However, task (and cgroup) local storage has already been moved to
bpf mem allocator which can be used after raw_spin_lock.

The splat is for the sk storage. For sk (and inode) storage,
it has not been moved to bpf mem allocator. Using raw_spin_lock or not,
kzalloc(GFP_ATOMIC) could theoretically be unsafe in tracing context.
However, the local storage helper requires a verifier accepted
sk pointer (PTR_TO_BTF_ID), it is hypothetical if that (mean running
a bpf prog in a kzalloc unsafe context and also able to hold a verifier
accepted sk pointer) could happen.

This patch avoids kzalloc after raw_spin_lock to silent the splat.
There is an existing kzalloc before the raw_spin_lock. At that point,
a kzalloc is very likely required because a lookup has just been done
before. Thus, this patch always does the kzalloc before acquiring
the raw_spin_lock and remove the later kzalloc usage after the
raw_spin_lock. After this change, it will have a charge and then
uncharge during the syscall bpf_map_update_elem() code path.
This patch opts for simplicity and not continue the old
optimization to save one charge and uncharge.

This issue is dated back to the very first commit of bpf_sk_storage
which had been refactored multiple times to create task, inode, and
cgroup storage. This patch uses a Fixes tag with a more recent
commit that should be easier to do backport.

Fixes: b00fa38a9c ("bpf: Enable non-atomic allocations in local storage")
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20230901231129.578493-2-martin.lau@linux.dev
2023-09-06 11:07:54 +02:00
Ilya Leoshkevich
a192103a11 s390/bpf: Pass through tail call counter in trampolines
s390x eBPF programs use the following extension to the s390x calling
convention: tail call counter is passed on stack at offset
STK_OFF_TCCNT, which callees otherwise use as scratch space.

Currently trampoline does not respect this and clobbers tail call
counter. This breaks enforcing tail call limits in eBPF programs, which
have trampolines attached to them.

Fix by forwarding a copy of the tail call counter to the original eBPF
program in the trampoline (for fexit), and by restoring it at the end
of the trampoline (for fentry).

Fixes: 528eb2cb87 ("s390/bpf: Implement arch_prepare_bpf_trampoline()")
Reported-by: Leon Hwang <hffilwlqm@gmail.com>
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20230906004448.111674-1-iii@linux.ibm.com
2023-09-06 10:48:14 +02:00
Sebastian Andrzej Siewior
6764e767f4 bpf: Assign bpf_tramp_run_ctx::saved_run_ctx before recursion check.
__bpf_prog_enter_recur() assigns bpf_tramp_run_ctx::saved_run_ctx before
performing the recursion check which means in case of a recursion
__bpf_prog_exit_recur() uses the previously set bpf_tramp_run_ctx::saved_run_ctx
value.

__bpf_prog_enter_sleepable_recur() assigns bpf_tramp_run_ctx::saved_run_ctx
after the recursion check which means in case of a recursion
__bpf_prog_exit_sleepable_recur() uses an uninitialized value. This does not
look right. If I read the entry trampoline code right, then bpf_tramp_run_ctx
isn't initialized upfront.

Align __bpf_prog_enter_sleepable_recur() with __bpf_prog_enter_recur() and
set bpf_tramp_run_ctx::saved_run_ctx before the recursion check is made.
Remove the assignment of saved_run_ctx in kern_sys_bpf() since it happens
a few cycles later.

Fixes: e384c7b7b4 ("bpf, x86: Create bpf_tramp_run_ctx on the caller thread's stack")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/bpf/20230830080405.251926-3-bigeasy@linutronix.de
2023-09-06 10:44:28 +02:00
Sebastian Andrzej Siewior
7645629f7d bpf: Invoke __bpf_prog_exit_sleepable_recur() on recursion in kern_sys_bpf().
If __bpf_prog_enter_sleepable_recur() detects recursion then it returns
0 without undoing rcu_read_lock_trace(), migrate_disable() or
decrementing the recursion counter. This is fine in the JIT case because
the JIT code will jump in the 0 case to the end and invoke the matching
exit trampoline (__bpf_prog_exit_sleepable_recur()).

This is not the case in kern_sys_bpf() which returns directly to the
caller with an error code.

Add __bpf_prog_exit_sleepable_recur() as clean up in the recursion case.

Fixes: b1d18a7574 ("bpf: Extend sys_bpf commands for bpf_syscall programs.")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/bpf/20230830080405.251926-2-bigeasy@linutronix.de
2023-09-06 10:39:31 +02:00
Jakub Kicinski
1a961e74d5 net: phylink: fix sphinx complaint about invalid literal
sphinx complains about the use of "%PHYLINK_PCS_NEG_*":

Documentation/networking/kapi:144: ./include/linux/phylink.h:601: WARNING: Inline literal start-string without end-string.
Documentation/networking/kapi:144: ./include/linux/phylink.h:633: WARNING: Inline literal start-string without end-string.

These are not valid symbols so drop the '%' prefix.

Alternatively we could use %PHYLINK_PCS_NEG_\* (escape the *)
or use normal literal ``PHYLINK_PCS_NEG_*`` but there is already
a handful of un-adorned DEFINE_* in this file.

Fixes: f99d471afa ("net: phylink: add PCS negotiation mode")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Link: https://lore.kernel.org/all/20230626162908.2f149f98@canb.auug.org.au/
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-09-06 07:46:49 +01:00
David S. Miller
f8fdd54ee6 Merge branch 'sja1105-fixes'
Vladimir Oltean says:

====================
tc-cbs offload fixes for SJA1105 DSA

Yanan Yang has pointed out to me that certain tc-cbs offloaded
configurations do not appear to do any shaping on the LS1021A-TSN board
(SJA1105T).

This is due to an apparent documentation error that also made its way
into the driver, which patch 1/3 now fixes.

While investigating and then testing, I've found 2 more bugs, which are
patches 2/3 and 3/3.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2023-09-06 06:23:05 +01:00
Vladimir Oltean
180a7419fe net: dsa: sja1105: complete tc-cbs offload support on SJA1110
The blamed commit left this delta behind:

  struct sja1105_cbs_entry {
 -	u64 port;
 -	u64 prio;
 +	u64 port; /* Not used for SJA1110 */
 +	u64 prio; /* Not used for SJA1110 */
  	u64 credit_hi;
  	u64 credit_lo;
  	u64 send_slope;
  	u64 idle_slope;
  };

but did not actually implement tc-cbs offload fully for the new switch.
The offload is accepted, but it doesn't work.

The difference compared to earlier switch generations is that now, the
table of CBS shapers is sparse, because there are many more shapers, so
the mapping between a {port, prio} and a table index is static, rather
than requiring us to store the port and prio into the sja1105_cbs_entry.

So, the problem is that the code programs the CBS shaper parameters at a
dynamic table index which is incorrect.

All that needs to be done for SJA1110 CBS shapers to work is to bypass
the logic which allocates shapers in a dense manner, as for SJA1105, and
use the fixed mapping instead.

Fixes: 3e77e59bf8 ("net: dsa: sja1105: add support for the SJA1110 switch family")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-09-06 06:23:05 +01:00
Vladimir Oltean
894cafc5c6 net: dsa: sja1105: fix -ENOSPC when replacing the same tc-cbs too many times
After running command [2] too many times in a row:

[1] $ tc qdisc add dev sw2p0 root handle 1: mqprio num_tc 8 \
	map 0 1 2 3 4 5 6 7 queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 hw 0
[2] $ tc qdisc replace dev sw2p0 parent 1:1 cbs offload 1 \
	idleslope 120000 sendslope -880000 locredit -1320 hicredit 180

(aka more than priv->info->num_cbs_shapers times)

we start seeing the following error message:

Error: Specified device failed to setup cbs hardware offload.

This comes from the fact that ndo_setup_tc(TC_SETUP_QDISC_CBS) presents
the same API for the qdisc create and replace cases, and the sja1105
driver fails to distinguish between the 2. Thus, it always thinks that
it must allocate the same shaper for a {port, queue} pair, when it may
instead have to replace an existing one.

Fixes: 4d7525085a ("net: dsa: sja1105: offload the Credit-Based Shaper qdisc")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-09-06 06:23:05 +01:00
Vladimir Oltean
954ad9bf13 net: dsa: sja1105: fix bandwidth discrepancy between tc-cbs software and offload
More careful measurement of the tc-cbs bandwidth shows that the stream
bandwidth (effectively idleslope) increases, there is a larger and
larger discrepancy between the rate limit obtained by the software
Qdisc, and the rate limit obtained by its offloaded counterpart.

The discrepancy becomes so large, that e.g. at an idleslope of 40000
(40Mbps), the offloaded cbs does not actually rate limit anything, and
traffic will pass at line rate through a 100 Mbps port.

The reason for the discrepancy is that the hardware documentation I've
been following is incorrect. UM11040.pdf (for SJA1105P/Q/R/S) states
about IDLE_SLOPE that it is "the rate (in unit of bytes/sec) at which
the credit counter is increased".

Cross-checking with UM10944.pdf (for SJA1105E/T) and UM11107.pdf
(for SJA1110), the wording is different: "This field specifies the
value, in bytes per second times link speed, by which the credit counter
is increased".

So there's an extra scaling for link speed that the driver is currently
not accounting for, and apparently (empirically), that link speed is
expressed in Kbps.

I've pondered whether to pollute the sja1105_mac_link_up()
implementation with CBS shaper reprogramming, but I don't think it is
worth it. IMO, the UAPI exposed by tc-cbs requires user space to
recalculate the sendslope anyway, since the formula for that depends on
port_transmit_rate (see man tc-cbs), which is not an invariant from tc's
perspective.

So we use the offload->sendslope and offload->idleslope to deduce the
original port_transmit_rate from the CBS formula, and use that value to
scale the offload->sendslope and offload->idleslope to values that the
hardware understands.

Some numerical data points:

 40Mbps stream, max interfering frame size 1500, port speed 100M
 ---------------------------------------------------------------

 tc-cbs parameters:
 idleslope 40000 sendslope -60000 locredit -900 hicredit 600

 which result in hardware values:

 Before (doesn't work)           After (works)
 credit_hi    600                600
 credit_lo    900                900
 send_slope   7500000            75
 idle_slope   5000000            50

 40Mbps stream, max interfering frame size 1500, port speed 1G
 -------------------------------------------------------------

 tc-cbs parameters:
 idleslope 40000 sendslope -960000 locredit -1440 hicredit 60

 which result in hardware values:

 Before (doesn't work)           After (works)
 credit_hi    60                 60
 credit_lo    1440               1440
 send_slope   120000000          120
 idle_slope   5000000            5

 5.12Mbps stream, max interfering frame size 1522, port speed 100M
 -----------------------------------------------------------------

 tc-cbs parameters:
 idleslope 5120 sendslope -94880 locredit -1444 hicredit 77

 which result in hardware values:

 Before (doesn't work)           After (works)
 credit_hi    77                 77
 credit_lo    1444               1444
 send_slope   11860000           118
 idle_slope   640000             6

Tested on SJA1105T, SJA1105S and SJA1110A, at 1Gbps and 100Mbps.

Fixes: 4d7525085a ("net: dsa: sja1105: offload the Credit-Based Shaper qdisc")
Reported-by: Yanan Yang <yanan.yang@nxp.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-09-06 06:23:04 +01:00
David S. Miller
ca7cfd73d0 Merge branch '1GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue
Tony Nguyen says:

====================
Change MIN_TXD and MIN_RXD to allow set rx/tx value between 64 and 80

Olga Zaborska says:

Change the minimum value of RX/TX descriptors to 64 to enable setting the rx/tx
value between 64 and 80. All igb, igbvf and igc devices can use as low as 64
descriptors.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2023-09-06 06:20:33 +01:00
Bodong Wang
344134609a mlx5/core: E-Switch, Create ACL FT for eswitch manager in switchdev mode
ACL flow table is required in switchdev mode when metadata is enabled,
driver creates such table when loading each vport. However, not every
vport is loaded in switchdev mode. Such as ECPF if it's the eswitch manager.
In this case, ACL flow table is still needed.

To make it modularized, create ACL flow table for eswitch manager as
default and skip such operations when loading manager vport.

Also, there is no need to load the eswitch manager vport in switchdev mode.
This means there is no need to load it on regular connect-x HCAs where
the PF is the eswitch manager. This will avoid creating duplicate ACL
flow table for host PF vport.

Fixes: 29bcb6e4fe ("net/mlx5e: E-Switch, Use metadata for vport matching in send-to-vport rules")
Fixes: eb8e9fae0a ("mlx5/core: E-Switch, Allocate ECPF vport if it's an eswitch manager")
Fixes: 5019833d66 ("net/mlx5: E-switch, Introduce helper function to enable/disable vports")
Signed-off-by: Bodong Wang <bodong@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-09-06 06:18:05 +01:00
Jianbo Liu
b7558a7752 net/mlx5e: Clear mirred devices array if the rule is split
In the cited commit, the mirred devices are recorded and checked while
parsing the actions. In order to avoid system crash, the duplicate
action in a single rule is not allowed.

But the rule is actually break down into several FTEs in different
tables, for either mirroring, or the specified types of actions which
use post action infrastructure.

It will reject certain action list by mistake, for example:
    actions:enp8s0f0_1,set(ipv4(ttl=63)),enp8s0f0_0,enp8s0f0_1.
Here the rule is split to two FTEs because of pedit action.

To fix this issue, when parsing the rule actions, reset if_count to
clear the mirred devices array if the rule is split to multiple
FTEs, and then the duplicate checking is restarted.

Fixes: 554fe75c1b ("net/mlx5e: Avoid duplicating rule destinations")
Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Reviewed-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-09-06 06:18:05 +01:00
Eric Dumazet
9b271ebaf9 ip_tunnels: use DEV_STATS_INC()
syzbot/KCSAN reported data-races in iptunnel_xmit_stats() [1]

This can run from multiple cpus without mutual exclusion.

Adopt SMP safe DEV_STATS_INC() to update dev->stats fields.

[1]
BUG: KCSAN: data-race in iptunnel_xmit / iptunnel_xmit

read-write to 0xffff8881353df170 of 8 bytes by task 30263 on cpu 1:
iptunnel_xmit_stats include/net/ip_tunnels.h:493 [inline]
iptunnel_xmit+0x432/0x4a0 net/ipv4/ip_tunnel_core.c:87
ip_tunnel_xmit+0x1477/0x1750 net/ipv4/ip_tunnel.c:831
__gre_xmit net/ipv4/ip_gre.c:469 [inline]
ipgre_xmit+0x516/0x570 net/ipv4/ip_gre.c:662
__netdev_start_xmit include/linux/netdevice.h:4889 [inline]
netdev_start_xmit include/linux/netdevice.h:4903 [inline]
xmit_one net/core/dev.c:3544 [inline]
dev_hard_start_xmit+0x11b/0x3f0 net/core/dev.c:3560
__dev_queue_xmit+0xeee/0x1de0 net/core/dev.c:4340
dev_queue_xmit include/linux/netdevice.h:3082 [inline]
__bpf_tx_skb net/core/filter.c:2129 [inline]
__bpf_redirect_no_mac net/core/filter.c:2159 [inline]
__bpf_redirect+0x723/0x9c0 net/core/filter.c:2182
____bpf_clone_redirect net/core/filter.c:2453 [inline]
bpf_clone_redirect+0x16c/0x1d0 net/core/filter.c:2425
___bpf_prog_run+0xd7d/0x41e0 kernel/bpf/core.c:1954
__bpf_prog_run512+0x74/0xa0 kernel/bpf/core.c:2195
bpf_dispatcher_nop_func include/linux/bpf.h:1181 [inline]
__bpf_prog_run include/linux/filter.h:609 [inline]
bpf_prog_run include/linux/filter.h:616 [inline]
bpf_test_run+0x15d/0x3d0 net/bpf/test_run.c:423
bpf_prog_test_run_skb+0x77b/0xa00 net/bpf/test_run.c:1045
bpf_prog_test_run+0x265/0x3d0 kernel/bpf/syscall.c:3996
__sys_bpf+0x3af/0x780 kernel/bpf/syscall.c:5353
__do_sys_bpf kernel/bpf/syscall.c:5439 [inline]
__se_sys_bpf kernel/bpf/syscall.c:5437 [inline]
__x64_sys_bpf+0x43/0x50 kernel/bpf/syscall.c:5437
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd

read-write to 0xffff8881353df170 of 8 bytes by task 30249 on cpu 0:
iptunnel_xmit_stats include/net/ip_tunnels.h:493 [inline]
iptunnel_xmit+0x432/0x4a0 net/ipv4/ip_tunnel_core.c:87
ip_tunnel_xmit+0x1477/0x1750 net/ipv4/ip_tunnel.c:831
__gre_xmit net/ipv4/ip_gre.c:469 [inline]
ipgre_xmit+0x516/0x570 net/ipv4/ip_gre.c:662
__netdev_start_xmit include/linux/netdevice.h:4889 [inline]
netdev_start_xmit include/linux/netdevice.h:4903 [inline]
xmit_one net/core/dev.c:3544 [inline]
dev_hard_start_xmit+0x11b/0x3f0 net/core/dev.c:3560
__dev_queue_xmit+0xeee/0x1de0 net/core/dev.c:4340
dev_queue_xmit include/linux/netdevice.h:3082 [inline]
__bpf_tx_skb net/core/filter.c:2129 [inline]
__bpf_redirect_no_mac net/core/filter.c:2159 [inline]
__bpf_redirect+0x723/0x9c0 net/core/filter.c:2182
____bpf_clone_redirect net/core/filter.c:2453 [inline]
bpf_clone_redirect+0x16c/0x1d0 net/core/filter.c:2425
___bpf_prog_run+0xd7d/0x41e0 kernel/bpf/core.c:1954
__bpf_prog_run512+0x74/0xa0 kernel/bpf/core.c:2195
bpf_dispatcher_nop_func include/linux/bpf.h:1181 [inline]
__bpf_prog_run include/linux/filter.h:609 [inline]
bpf_prog_run include/linux/filter.h:616 [inline]
bpf_test_run+0x15d/0x3d0 net/bpf/test_run.c:423
bpf_prog_test_run_skb+0x77b/0xa00 net/bpf/test_run.c:1045
bpf_prog_test_run+0x265/0x3d0 kernel/bpf/syscall.c:3996
__sys_bpf+0x3af/0x780 kernel/bpf/syscall.c:5353
__do_sys_bpf kernel/bpf/syscall.c:5439 [inline]
__se_sys_bpf kernel/bpf/syscall.c:5437 [inline]
__x64_sys_bpf+0x43/0x50 kernel/bpf/syscall.c:5437
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd

value changed: 0x0000000000018830 -> 0x0000000000018831

Reported by Kernel Concurrency Sanitizer on:
CPU: 0 PID: 30249 Comm: syz-executor.4 Not tainted 6.5.0-syzkaller-11704-g3f86ed6ec0b3 #0

Fixes: 039f50629b ("ip_tunnel: Move stats update to iptunnel_xmit()")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-09-06 06:05:59 +01:00
Taehee Yoo
39285e124e net: team: do not use dynamic lockdep key
team interface has used a dynamic lockdep key to avoid false-positive
lockdep deadlock detection. Virtual interfaces such as team usually
have their own lock for protecting private data.
These interfaces can be nested.
team0
  |
team1

Each interface's lock is actually different(team0->lock and team1->lock).
So,
mutex_lock(&team0->lock);
mutex_lock(&team1->lock);
mutex_unlock(&team1->lock);
mutex_unlock(&team0->lock);
The above case is absolutely safe. But lockdep warns about deadlock.
Because the lockdep understands these two locks are same. This is a
false-positive lockdep warning.

So, in order to avoid this problem, the team interfaces started to use
dynamic lockdep key. The false-positive problem was fixed, but it
introduced a new problem.

When the new team virtual interface is created, it registers a dynamic
lockdep key(creates dynamic lockdep key) and uses it. But there is the
limitation of the number of lockdep keys.
So, If so many team interfaces are created, it consumes all lockdep keys.
Then, the lockdep stops to work and warns about it.

In order to fix this problem, team interfaces use the subclass instead
of the dynamic key. So, when a new team interface is created, it doesn't
register(create) a new lockdep, but uses existed subclass key instead.
It is already used by the bonding interface for a similar case.

As the bonding interface does, the subclass variable is the same as
the 'dev->nested_level'. This variable indicates the depth in the stacked
interface graph.

The 'dev->nested_level' is protected by RTNL and RCU.
So, 'mutex_lock_nested()' for 'team->lock' requires RTNL or RCU.
In the current code, 'team->lock' is usually acquired under RTNL, there is
no problem with using 'dev->nested_level'.

The 'team_nl_team_get()' and The 'lb_stats_refresh()' functions acquire
'team->lock' without RTNL.
But these don't iterate their own ports nested so they don't need nested
lock.

Reproducer:
   for i in {0..1000}
   do
           ip link add team$i type team
           ip link add dummy$i master team$i type dummy
           ip link set dummy$i up
           ip link set team$i up
   done

Splat looks like:
   BUG: MAX_LOCKDEP_ENTRIES too low!
   turning off the locking correctness validator.
   Please attach the output of /proc/lock_stat to the bug report
   CPU: 0 PID: 4104 Comm: ip Not tainted 6.5.0-rc7+ #45
   Call Trace:
    <TASK>
   dump_stack_lvl+0x64/0xb0
   add_lock_to_list+0x30d/0x5e0
   check_prev_add+0x73a/0x23a0
   ...
   sock_def_readable+0xfe/0x4f0
   netlink_broadcast+0x76b/0xac0
   nlmsg_notify+0x69/0x1d0
   dev_open+0xed/0x130
   ...

Reported-by: syzbot+9bbbacfbf1e04d5221f7@syzkaller.appspotmail.com
Fixes: 369f61bee0 ("team: fix nested locking lockdep warning")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-09-06 06:04:13 +01:00
Quan Tian
a5e2151ff9 net/ipv6: SKB symmetric hash should incorporate transport ports
__skb_get_hash_symmetric() was added to compute a symmetric hash over
the protocol, addresses and transport ports, by commit eb70db8756
("packet: Use symmetric hash for PACKET_FANOUT_HASH."). It uses
flow_keys_dissector_symmetric_keys as the flow_dissector to incorporate
IPv4 addresses, IPv6 addresses and ports. However, it should not specify
the flag as FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL, which stops further
dissection when an IPv6 flow label is encountered, making transport
ports not being incorporated in such case.

As a consequence, the symmetric hash is based on 5-tuple for IPv4 but
3-tuple for IPv6 when flow label is present. It caused a few problems,
e.g. when nft symhash and openvswitch l4_sym rely on the symmetric hash
to perform load balancing as different L4 flows between two given IPv6
addresses would always get the same symmetric hash, leading to uneven
traffic distribution.

Removing the use of FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL makes sure the
symmetric hash is based on 5-tuple for both IPv4 and IPv6 consistently.

Fixes: eb70db8756 ("packet: Use symmetric hash for PACKET_FANOUT_HASH.")
Reported-by: Lars Ekman <uablrek@gmail.com>
Closes: https://github.com/antrea-io/antrea/issues/5457
Signed-off-by: Quan Tian <qtian@vmware.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-09-06 06:02:27 +01:00
Alexandre Ghiti
b7ac4b8ee7
riscv: libstub: Implement KASLR by using generic functions
We can now use arm64 functions to handle the move of the kernel physical
mapping: if KASLR is enabled, we will try to get a random seed from the
firmware, if not possible, the kernel will be moved to a location that
suits its alignment constraints.

Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Tested-by: Conor Dooley <conor.dooley@microchip.com>
Tested-by: Song Shuai <songshuaishuai@tinylab.org>
Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
Tested-by: Sami Tolvanen <samitolvanen@google.com>
Link: https://lore.kernel.org/r/20230722123850.634544-6-alexghiti@rivosinc.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2023-09-05 19:49:31 -07:00
Alexandre Ghiti
3c35d1a03c
libstub: Fix compilation warning for rv32
Fix the following warning which appears when compiled for rv32 by using
unsigned long type instead of u64.

../drivers/firmware/efi/libstub/efi-stub-helper.c: In function 'efi_kaslr_relocate_kernel':
../drivers/firmware/efi/libstub/efi-stub-helper.c:846:28: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]
  846 |                            (u64)_end < EFI_ALLOC_LIMIT) {

Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Tested-by: Conor Dooley <conor.dooley@microchip.com>
Tested-by: Song Shuai <songshuaishuai@tinylab.org>
Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
Tested-by: Sami Tolvanen <samitolvanen@google.com>
Link: https://lore.kernel.org/r/20230722123850.634544-5-alexghiti@rivosinc.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2023-09-05 19:49:30 -07:00
Alexandre Ghiti
6b56beb5f6
arm64: libstub: Move KASLR handling functions to kaslr.c
This prepares for riscv to use the same functions to handle the pĥysical
kernel move when KASLR is enabled.

Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Tested-by: Conor Dooley <conor.dooley@microchip.com>
Tested-by: Song Shuai <songshuaishuai@tinylab.org>
Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
Tested-by: Sami Tolvanen <samitolvanen@google.com>
Link: https://lore.kernel.org/r/20230722123850.634544-4-alexghiti@rivosinc.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2023-09-05 19:49:29 -07:00
Alexandre Ghiti
54a519e6af
riscv: Dump out kernel offset information on panic
Dump out the KASLR virtual kernel offset when panic to help debug kernel.

Signed-off-by: Zong Li <zong.li@sifive.com>
Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Tested-by: Conor Dooley <conor.dooley@microchip.com>
Tested-by: Song Shuai <songshuaishuai@tinylab.org>
Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
Tested-by: Sami Tolvanen <samitolvanen@google.com>
Link: https://lore.kernel.org/r/20230722123850.634544-3-alexghiti@rivosinc.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2023-09-05 19:49:28 -07:00
Alexandre Ghiti
84fe419dc7
riscv: Introduce virtual kernel mapping KASLR
KASLR implementation relies on a relocatable kernel so that we can move
the kernel mapping.

The seed needed to virtually move the kernel is taken from the device tree,
so we rely on the bootloader to provide a correct seed. Zkr could be used
unconditionnally instead if implemented, but that's for another patch.

Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Tested-by: Conor Dooley <conor.dooley@microchip.com>
Tested-by: Song Shuai <songshuaishuai@tinylab.org>
Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
Tested-by: Sami Tolvanen <samitolvanen@google.com>
Link: https://lore.kernel.org/r/20230722123850.634544-2-alexghiti@rivosinc.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2023-09-05 19:49:27 -07:00
Fabio Estevam
ce413486c9 dt-bindings: rtc: ds3231: Remove text binding
The "maxim,ds3231" compatible is described in the rtc-ds1307.yaml, so
there is no need to keep the text bindings version.

Remove the maxim,ds3231.txt file in favor of the rtc-ds1307.yaml binding.

Signed-off-by: Fabio Estevam <festevam@denx.de>
Acked-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Link: https://lore.kernel.org/r/20230902134407.2589099-1-festevam@gmail.com
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
2023-09-06 01:28:24 +02:00
Alexandre Belloni
d844c64bbc rtc: wm8350: remove unnecessary messages
The RTC core already prints a message when the RTC is registered and when
registering fails, it is not necessary to have more in the driver.

Acked-by: Charles Keepax <ckeepax@opensource.cirrus.com>
Link: https://lore.kernel.org/r/20230827221643.544259-3-alexandre.belloni@bootlin.com
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
2023-09-06 01:26:04 +02:00
Alexandre Belloni
8805baceb0 rtc: twl: remove unnecessary messages
The RTC core already prints a message when the RTC is registered and when
registering fails, it is not necessary to have more in the driver.

Link: https://lore.kernel.org/r/20230827221643.544259-2-alexandre.belloni@bootlin.com
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
2023-09-06 01:26:04 +02:00
Alexandre Belloni
94ec1f06d0 rtc: sun6i: remove unnecessary message
The core already print a message once the rtc is successfully registered,
it is not necessary to print an other one.

Acked-by: Jernej Skrabec <jernej.skrabec@gmail.com>
Link: https://lore.kernel.org/r/20230827221643.544259-1-alexandre.belloni@bootlin.com
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
2023-09-06 01:26:04 +02:00
Alexandre Belloni
348c11a7d7 rtc: stop warning for invalid alarms when the alarm is disabled
When the alarm is not enabled, it may never have been set and so we can't
expect it to be valid. This will prevent the apparition of boot messages
like this one:

rtc rtc0: invalid alarm value: 2023-7-8 45:85:85

Link: https://lore.kernel.org/r/20230827221532.543353-1-alexandre.belloni@bootlin.com
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
2023-09-06 01:25:15 +02:00
Frank Li
6e13d6528b i3c: master: svc: fix probe failure when no i3c device exist
I3C masters are expected to support hot-join. This means at initialization
time we might not yet discover any device and this should not be treated
as a fatal error.

During the DAA procedure which happens at probe time, if no device has
joined, all CCC will be NACKed (from a bus perspective). This leads to an
early return with an error code which fails the probe of the master.

Let's avoid this by just telling the core through an I3C_ERROR_M2
return command code that no device was discovered, which is a valid
situation. This way the master will no longer bail out and fail to probe
for a wrong reason.

Cc: stable@vger.kernel.org
Fixes: dd3c52846d ("i3c: master: svc: Add Silvaco I3C master driver")
Signed-off-by: Frank Li <Frank.Li@nxp.com>
Acked-by: Miquel Raynal <miquel.raynal@bootlin.com>
Link: https://lore.kernel.org/r/20230831141324.2841525-1-Frank.Li@nxp.com
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
2023-09-06 01:21:47 +02:00
Ariel Marcovitch
2a15de80dd idr: fix param name in idr_alloc_cyclic() doc
The relevant parameter is 'start' and not 'nextid'

Fixes: 460488c58c ("idr: Remove idr_alloc_ext")
Signed-off-by: Ariel Marcovitch <arielmarcovitch@gmail.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2023-09-05 19:01:38 -04:00
Philipp Stanner
e7716c74e3 xarray: Document necessary flag in alloc functions
Adds a new line to the docstrings of functions wrapping __xa_alloc() and
__xa_alloc_cyclic(), informing about the necessity of flag XA_FLAGS_ALLOC
being set previously.

The documentation so far says that functions wrapping __xa_alloc() and
__xa_alloc_cyclic() are supposed to return either -ENOMEM or -EBUSY in
case of an error. If the xarray has been initialized without the flag
XA_FLAGS_ALLOC, however, they fail with a different, undocumented error
code.

As hinted at in Documentation/core-api/xarray.rst, wrappers around these
functions should only be invoked when the flag has been set. The
functions' documentation should reflect that as well.

Signed-off-by: Philipp Stanner <pstanner@redhat.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2023-09-05 19:01:38 -04:00
Raphael Gallais-Pou
591b00cc4f dt-bindings: irqchip: convert st,stih407-irq-syscfg to DT schema
Convert deprecated format to DT schema format.

Signed-off-by: Raphael Gallais-Pou <rgallaispou@gmail.com>
Reviewed-by: Conor Dooley <conor.dooley@microchip.com>
Link: https://lore.kernel.org/r/20230905072740.23859-1-rgallaispou@gmail.com
Signed-off-by: Rob Herring <robh@kernel.org>
2023-09-05 16:15:09 -05:00
Rob Herring
274e480982 media: dt-bindings: Convert Omnivision OV7251 to DT schema
Convert the OmniVision OV7251 Image Sensor binding to DT schema format.

vddd-supply was listed as required, but the example and actual user
don't have it. Also, the data brief says it has an internal regulator,
so perhaps it is truly optional.

Add missing common "link-frequencies" which is used and required by the
Linux driver.

Reviewed-by: Conor Dooley <conor.dooley@microchip.com>
Acked-by: Wolfram Sang <wsa@kernel.org> # for I2C
Link: https://lore.kernel.org/r/20230817202713.2180195-1-robh@kernel.org
Signed-off-by: Rob Herring <robh@kernel.org>
2023-09-05 15:54:28 -05:00
Rob Herring
44ade291b7 media: dt-bindings: Merge OV5695 into OV5693 binding
The OV5695 binding is almost the same as the OV5693 binding. The only
difference is 'clock-names' is defined for OV5695. However, the lack of
clock-names is an omission as the Linux OV5693 driver expects the same
'xvclk' clock name.

'link-frequencies' is required by OV5693, but not OV5695, so make that
conditional. Really, this shouldn't vary by device, but we're stuck with
the existing binding use.

The rockchip-isp1 binding example is missing required properties, so it
has to be updated as well.

Reviewed-by: Conor Dooley <conor.dooley@microchip.com>
Link: https://lore.kernel.org/r/20230817202647.2179609-1-robh@kernel.org
Signed-off-by: Rob Herring <robh@kernel.org>
2023-09-05 15:54:15 -05:00
Linus Torvalds
65d6e954e3 gfs2 fixes
- Fix a glock state (non-)transition bug when a dlm request times out
   and is canceled, and we have locking requests that can now be granted
   immediately.
 
 - Various fixes and cleanups in how the logd and quotad daemons are
   woken up and terminated.
 
 - Fix several bugs in the quota data reference counting and shrinking.
   Free quota data objects synchronously in put_super() instead of
   letting call_rcu() run wild.
 
 - Make sure not to deallocate quota data during a withdraw; rather, defer
   quota data deallocation to put_super().  Withdraws can happen in
   contexts in which callers on the stack are holding quota data references.
 
 - Many minor quota fixes and cleanups by Bob.
 
 - Update the the mailing list address for gfs2 and dlm.  (It's the same
   list for both and we are moving it to gfs2@lists.linux.dev.)
 
 - Various other minor cleanups.
 -----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCAAyFiEEJZs3krPW0xkhLMTc1b+f6wMTZToFAmT3T7UUHGFncnVlbmJh
 QHJlZGhhdC5jb20ACgkQ1b+f6wMTZTqxhw/+IWp+4cY4htNkTRG7xkheTeQ+5whG
 NU40mp7Hj+WY5GoHqsk676q1pBkVAq5mNN1kt9S/oC6lLHrdu1HLpdIkgFow2nAC
 nDqlEqx9/Da9Q4H/+K442usO90S4o1MmOXOE9xcGcvJLqK4FLDOVDXbUWa43OXrK
 4HxgjgGSNPF4itD+U0o/V4f19jQ+4cNwmo6hGLyhsYillaUHiIXJezQlH7XycByM
 qGJqlG6odJ56wE38NG8Bt9Lj+91PsLLqO1TJxSYyzpf0h9QGQ2ySvu6/esTXwxWO
 XRuT4db7yjyAUhJoJMw+YU77xWQTz0/jriIDS7VqzvR1ns3GPaWdtb31TdUTBG4H
 IvBA8ep3oxHtcYFoPzCLBXgOIDej6KjAgS3RSv51yLeaZRHFUBc21fTSXbcDTIUs
 gkusZlRNQ9ANdBCVyf8hZxbE54HnaBJ8dKMZtynOXJEHs0EtGV8YKCNIpkFLxOvE
 vZkKcRsmVtuZ9fVhX1iH7dYmcsCMPI8RNo47k7hHk2EG8dU+eqyPSbi4QCmErNFf
 DlqX+fIuiDtOkbmWcrb2qdphn6j6bMLhDaOMJGIBOmgOPi+AU9dNAfmtu1cG4u1b
 2TFyUISayiwqHJQgguzvDed15fxexYdgoLB7O9t9TMbCENxisguNa5TsAN6ZkiLQ
 0hY6h80xSR2kCPU=
 =EonA
 -----END PGP SIGNATURE-----

Merge tag 'gfs2-v6.5-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2

Pull gfs2 updates from Andreas Gruenbacher:

 - Fix a glock state (non-)transition bug when a dlm request times out
   and is canceled, and we have locking requests that can now be granted
   immediately

 - Various fixes and cleanups in how the logd and quotad daemons are
   woken up and terminated

 - Fix several bugs in the quota data reference counting and shrinking.
   Free quota data objects synchronously in put_super() instead of
   letting call_rcu() run wild

 - Make sure not to deallocate quota data during a withdraw; rather,
   defer quota data deallocation to put_super(). Withdraws can happen in
   contexts in which callers on the stack are holding quota data
   references

 - Many minor quota fixes and cleanups by Bob

 - Update the the mailing list address for gfs2 and dlm. (It's the same
   list for both and we are moving it to gfs2@lists.linux.dev)

 - Various other minor cleanups

* tag 'gfs2-v6.5-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2: (51 commits)
  MAINTAINERS: Update dlm mailing list
  MAINTAINERS: Update gfs2 mailing list
  gfs2: change qd_slot_count to qd_slot_ref
  gfs2: check for no eligible quota changes
  gfs2: Remove useless assignment
  gfs2: simplify slot_get
  gfs2: Simplify qd2offset
  gfs2: introduce qd_bh_get_or_undo
  gfs2: Remove quota allocation info from quota file
  gfs2: use constant for array size
  gfs2: Set qd_sync_gen in do_sync
  gfs2: Remove useless err set
  gfs2: Small gfs2_quota_lock cleanup
  gfs2: move qdsb_put and reduce redundancy
  gfs2: improvements to sysfs status
  gfs2: Don't try to sync non-changes
  gfs2: Simplify function need_sync
  gfs2: remove unneeded pg_oflow variable
  gfs2: remove unneeded variable done
  gfs2: pass sdp to gfs2_write_buf_to_page
  ...
2023-09-05 13:00:28 -07:00
Jerome Neanne
ca0e36e3e3
regulator: tps6594-regulator: Fix random kernel crash
Random kernel crash detected in TI CICD when regulator driver is added.
This is root caused to irq index increment being done twice causing
irq_data being allocated outside of the range.

- Rework tps6594_request_reg_irqs with correct index increment
- Adjust irq_data kmalloc size to the exact size needed for the device

This has been reported on TI mainline. No public bug report associated.

Reported-by: Udit Kumar <u-kumar1@ti.com>
Fixes: f17ccc5deb ("regulator: tps6594-regulator: Add driver for TI TPS6594 regulators")
Signed-off-by: Jerome Neanne <jneanne@baylibre.com>
Link: https://lore.kernel.org/r/20230828-tps6594_random_boot_crash_fix-v1-1-f29cbf9ddb37@baylibre.com
Signed-off-by: Mark Brown <broonie@kernel.org>
2023-09-05 20:58:41 +01:00
Kan Liang
6f7f984fa8 perf/x86/uncore: Correct the number of CHAs on EMR
Starting from SPR, the basic uncore PMON information is retrieved from
the discovery table (resides in an MMIO space populated by BIOS). It is
called the discovery method. The existing value of the type->num_boxes
is from the discovery table.

On some SPR variants, there is a firmware bug that makes the value from the
discovery table incorrect. We use the value from the
SPR_MSR_UNC_CBO_CONFIG MSR to replace the one from the discovery table:

   38776cc45e ("perf/x86/uncore: Correct the number of CHAs on SPR")

Unfortunately, the SPR_MSR_UNC_CBO_CONFIG isn't available for the EMR
XCC (Always returns 0), but the above firmware bug doesn't impact the
EMR XCC.

Don't let the value from the MSR replace the existing value from the
discovery table.

Fixes: 38776cc45e ("perf/x86/uncore: Correct the number of CHAs on SPR")
Reported-by: Stephane Eranian <eranian@google.com>
Reported-by: Yunying Sun <yunying.sun@intel.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Yunying Sun <yunying.sun@intel.com>
Link: https://lore.kernel.org/r/20230905134248.496114-1-kan.liang@linux.intel.com
2023-09-05 21:50:21 +02:00
Linus Torvalds
9e310ea5c8 fuse update for 6.6
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCZPYlzAAKCRDh3BK/laaZ
 PEcxAP4suFAlonGntKJ5ltR+7ZN+WYdiraQ+5c6ISBFc+pFXgQD7B0xhztV4umSF
 III+pbD6lE5gP5u7+Kw/pOnTI42yTQ8=
 =aPjn
 -----END PGP SIGNATURE-----

Merge tag 'fuse-update-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse

Pull fuse updates from Miklos Szeredi:

 - Revert non-waiting FLUSH due to a regression

 - Fix a lookup counter leak in readdirplus

 - Add an option to allow shared mmaps in no-cache mode

 - Add btime support and statx intrastructure to the protocol

 - Invalidate positive/negative dentry on failed create/delete

* tag 'fuse-update-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
  fuse: conditionally fill kstat in fuse_do_statx()
  fuse: invalidate dentry on EEXIST creates or ENOENT deletes
  fuse: cache btime
  fuse: implement statx
  fuse: add ATTR_TIMEOUT macro
  fuse: add STATX request
  fuse: handle empty request_mask in statx
  fuse: write back dirty pages before direct write in direct_io_relax mode
  fuse: add a new fuse init flag to relax restrictions in no cache mode
  fuse: invalidate page cache pages before direct write
  fuse: nlookup missing decrement in fuse_direntplus_link
  Revert "fuse: in fuse_flush only wait if someone wants the return code"
2023-09-05 12:45:55 -07:00
Rafael J. Wysocki
edd220b33f thermal: core: Drop thermal_zone_device_register()
There are no more users of thermal_zone_device_register(), so drop it
from the core.

Note that thermal_zone_device_register_with_trips() may be renamed to
thermal_zone_device_register() in the future, but only after a grace
period allowing all of the possible work in progress that may be using
the latter to adjust.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2023-09-05 21:42:18 +02:00
Rafael J. Wysocki
cbcd51e822 thermal: Use thermal_tripless_zone_device_register()
All of the remaining callers of thermal_zone_device_register()
can use thermal_tripless_zone_device_register(), so make them
do so in order to allow the former to be dropped.

No intentional functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Miquel Raynal <miquel.raynal@bootlin.com>
2023-09-05 21:42:18 +02:00
Rafael J. Wysocki
d332db8fc1 thermal: core: Add function for registering tripless thermal zones
Multiple callers of thermal_zone_device_register() don't pass any trips
to it and they might use a shortened argument list for that, so add
a special function with fewer arguments for this purpose.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2023-09-05 21:42:18 +02:00
Rafael J. Wysocki
9ffa7b92bc thermal: core: Clean up headers of thermal zone registration functions
For consistency, add a missing thermal_zone_device_register_with_trips()
stub for the CONFIG_THERMAL unset case, specify argument names in all of
the thermal zone registration and unregistration function headers and
make all of them use white space consistently.

No intentional functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2023-09-05 21:42:18 +02:00
Linus Torvalds
4b3d6e0c6c ata changes for 6.6
- Fix OF include file for ata platform drivers (Rob).
 
  - Simplify various ahci, sata and pata platform drivers using the
    function devm_platform_ioremap_resource() (Yangtao).
 
  - Cleanup libata time related argument types (e.g. timeouts values)
    (Sergey).
 
  - Cleanup libata code around error handling as all ata drivers now
    define a error_handler operation (Hannes and Niklas).
 
  - Remove functions intended for libsas that are in fact unused
    (Niklas).
 
  - Change the remove device callback of platform drivers to a null
    function (Uwe).
 
  - Simplify the pata_imx driver using devm_clk_get_enabled() (Li).
 
  - Remove old and uinused remnants of the ide code in arm, parisc,
    powerpc, sparc and m68k architectures and associated drivers
    (pata_buddha, pata_falcon and pata_gayle) (Geert).
 
  - Add missing MODULE_DESCRIPTION() in the sata_gemini and pata_ftide010
    drivers (me).
 
  - Several fixes for the pata_ep93xx and pata_falcon drivers (Nikita,
    Michael).
 
  - Add Elkhart Lake AHCI controller support to the ahci driver (Werner).
 
  - Disable NCQ trim on Micron 1100 drives (Pawel).
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQSRPv8tYSvhwAzJdzjdoc3SxdoYdgUCZPbUdQAKCRDdoc3SxdoY
 du6NAP9G10EhjgXDNIM8f1AN8gHrZk2RGSxSuqIKKROxl2i+ugD+Mt8/UjRpf0JO
 nVdKTfgeXaLfCXHN/fmkIYNmk+I7Twg=
 =9Ety
 -----END PGP SIGNATURE-----

Merge tag 'ata-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata

Pull ata updates from Damien Le Moal:

 - Fix OF include file for ata platform drivers (Rob)

 - Simplify various ahci, sata and pata platform drivers using the
   function devm_platform_ioremap_resource() (Yangtao)

 - Cleanup libata time related argument types (e.g. timeouts values)
   (Sergey)

 - Cleanup libata code around error handling as all ata drivers now
   define a error_handler operation (Hannes and Niklas)

 - Remove functions intended for libsas that are in fact unused (Niklas)

 - Change the remove device callback of platform drivers to a null
   function (Uwe)

 - Simplify the pata_imx driver using devm_clk_get_enabled() (Li)

 - Remove old and uinused remnants of the ide code in arm, parisc,
   powerpc, sparc and m68k architectures and associated drivers
   (pata_buddha, pata_falcon and pata_gayle) (Geert)

 - Add missing MODULE_DESCRIPTION() in the sata_gemini and pata_ftide010
   drivers (me)

 - Several fixes for the pata_ep93xx and pata_falcon drivers (Nikita,
   Michael)

 - Add Elkhart Lake AHCI controller support to the ahci driver (Werner)

 - Disable NCQ trim on Micron 1100 drives (Pawel)

* tag 'ata-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata: (60 commits)
  ata: libata-core: Disable NCQ_TRIM on Micron 1100 drives
  ata: ahci: Add Elkhart Lake AHCI controller
  ata: pata_falcon: add data_swab option to byte-swap disk data
  ata: pata_falcon: fix IO base selection for Q40
  ata: pata_ep93xx: use soc_device_match for UDMA modes
  ata: pata_ep93xx: fix error return code in probe
  ata: sata_gemini: Add missing MODULE_DESCRIPTION
  ata: pata_ftide010: Add missing MODULE_DESCRIPTION
  m68k: Remove <asm/ide.h>
  ata: pata_gayle: Remove #include <asm/ide.h>
  ata: pata_falcon: Remove #include <asm/ide.h>
  ata: pata_buddha: Remove #include <asm/ide.h>
  asm-generic: Remove ide_iops.h
  sparc: Remove <asm/ide.h>
  powerpc: Remove <asm/ide.h>
  parisc: Remove <asm/ide.h>
  ARM: Remove <asm/ide.h>
  ata: pata_imx: Use helper function devm_clk_get_enabled()
  ata: sata_rcar: Convert to platform remove callback returning void
  ata: sata_mv: Convert to platform remove callback returning void
  ...
2023-09-05 12:37:28 -07:00
Linus Torvalds
7733171926 qcom: fix incorrect num_chans counting
mhu: Remove redundant dev_err
 bcm: fix comments
 common changes:
 convert to use devm_platform_get_and_ioremap_resource
 correct DT includes
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE6EwehDt/SOnwFyTyf9lkf8eYP5UFAmT3R1kACgkQf9lkf8eY
 P5UeIBAAk10at74VxwaasFXy6/83umaIygNSlXJ7+FyYJRsg5ta2wfNfMD/Jgoec
 +ACRQxWo9/GtM4OLnkwouN/xWyyDlod6J1afidpfstZ//pc4H6pOBcmRbBCgsHco
 72yloPY1bnq53dLwdYaLT+Gi/gLN5ipFZZvIByXJECPIdOp9wJGGEvNny553j561
 etnCvfaWe+8GdeQpv3vOQ8nb+mMy1iYyKc+suJNpYXH1lgsk1cDGMfbeGibS7r9u
 t0Oa4vLA+/avbxRI25PAveVfdIg7DWfCy1WOeBtGzsYTN4T109L8eFzRv7GO58v/
 /+v2JkzSWBip7ZIfNGGOk1L2iTvR8Lo9otWNaq64wXFAx8khanjN6ko/ZRL6ocd7
 AR4nDdK3pkVGKDgV75K6NnkgUpgMWrCYR6kg5x4f29S44aaZyfukqAJZe4mP134x
 nDEqUYpquscUv3vnizI5QqwPp1nCjRW1wBtnZE5/qh4F6WfiQ3BCkHudLvZQQrMF
 1+PCICHMmhAqYXwQyMlQyRBtk7p2Z00az47EXV7lMLS04YdBPySuQ5tVo0PUTLgR
 JWq9gw0YCqP/sKQfr7vraGOZ/nm9CbEKRorpXrJAuIKe2zso8A69Ej5XOpoqoWG7
 7ALpxx7iLRdMDnT+7JduAzBwrp6vX8kPzFgwJOC4pMGxLqsAMiU=
 =WEVp
 -----END PGP SIGNATURE-----

Merge tag 'mailbox-v6.6' of git://git.linaro.org/landing-teams/working/fujitsu/integration

Pull mailbox updates from Jassi Brar:

 - qcom: fix incorrect num_chans counting

 - mhu: Remove redundant dev_err

 - bcm: fix comments

 - common changes:
    - convert to use devm_platform_get_and_ioremap_resource
    - correct DT includes

* tag 'mailbox-v6.6' of git://git.linaro.org/landing-teams/working/fujitsu/integration:
  mailbox: qcom-ipcc: fix incorrect num_chans counting
  mailbox: Explicitly include correct DT includes
  mailbox: ti-msgmgr: Use devm_platform_ioremap_resource_byname()
  mailbox: platform-mhu: Remove redundant dev_err()
  mailbox: bcm-pdc: Fix some kernel-doc comments
  mailbox: mailbox-test: Fix an error check in mbox_test_probe()
  mailbox: tegra-hsp: Convert to devm_platform_ioremap_resource()
  mailbox: rockchip: Use devm_platform_get_and_ioremap_resource()
  mailbox: mailbox-test: Use devm_platform_get_and_ioremap_resource()
  mailbox: bcm-pdc: Use devm_platform_get_and_ioremap_resource()
  mailbox: bcm-ferxrm-mailbox: Use devm_platform_get_and_ioremap_resource()
2023-09-05 12:31:07 -07:00
Linus Torvalds
3c5c9b7cfd Seven hotfixes. Four are cc:stable and the remainder pertain to issues
which were introduced in the current merge window.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZPd5KAAKCRDdBJ7gKXxA
 jqIrAPoCqnQwOA577hJ3B1iEZnbYC0dlf5Rsk+uS/2HFnVeLhAD6A0uFOIE11ZQR
 I9AU7NDtu8NYkh9Adz+cRDeLNWbRSAo=
 =EFfq
 -----END PGP SIGNATURE-----

Merge tag 'mm-hotfixes-stable-2023-09-05-11-51' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc fixes from Andrew Morton:
 "Seven hotfixes. Four are cc:stable and the remainder pertain to issues
  which were introduced in the current merge window"

* tag 'mm-hotfixes-stable-2023-09-05-11-51' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  sparc64: add missing initialization of folio in tlb_batch_add()
  mm: memory-failure: use rcu lock instead of tasklist_lock when collect_procs()
  revert "memfd: improve userspace warnings for missing exec-related flags".
  rcu: dump vmalloc memory info safely
  mm/vmalloc: add a safer version of find_vm_area() for debug
  tools/mm: fix undefined reference to pthread_once
  memcontrol: ensure memcg acquired by id is properly set up
2023-09-05 12:22:39 -07:00
Linus Torvalds
6155a3b885 Hi,
This pull request contains two more bug fixes for tpm_crb, in other
 words categorically disabling rng for AMD CPU's in the tpm_crb driver,
 discarding the earlier probing approach.
 
 BR, Jarkko
 -----BEGIN PGP SIGNATURE-----
 
 iIgEABYIADAWIQRE6pSOnaBC00OEHEIaerohdGur0gUCZPYxGRIcamFya2tvQGtl
 cm5lbC5vcmcACgkQGnq6IXRrq9IRiwD+Lly+vi9Z6qouF72XUnNMohg0PFeHp/Ra
 7yD1LRALWSgBAMqUlFjA62/OEbJDWK+WS3gUx+e20y7L2OeSLPZVfqYB
 =DbwG
 -----END PGP SIGNATURE-----

Merge tag 'tpmdd-v6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jarkko/linux-tpmdd

Pull more tpm updates from Jarkko Sakkinen:
 "Two more bug fixes for tpm_crb, categorically disabling rng for AMD
  CPU's in the tpm_crb driver, discarding the earlier probing approach"

* tag 'tpmdd-v6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jarkko/linux-tpmdd:
  tpm: Enable hwrng only for Pluton on AMD CPUs
  tpm_crb: Fix an error handling path in crb_acpi_add()
2023-09-05 11:15:59 -07:00
Alexander Gordeev
06fc3b0d22 s390/vmem: do not silently ignore mapping limit
The only interface that allows drivers establishing
liner mappings is vmem_add_mapping(). It does check
a requested range against allowed limits and a call
to modify_pagetable() with an invalid mapping range
is impossible.

Hence, an attempt to map an address range outside of
the identity mapping or vmemmap array could only be
kernel bug.

Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2023-09-05 20:12:52 +02:00
Andy Shevchenko
f59ec04d38 s390/zcrypt: utilize dev_set_name() ability to use a formatted string
With the dev_set_name() prototype it's not obvious that it takes
a formatted string as a parameter. Use its facility instead of
duplicating the same with strncpy()/snprintf() calls.

With this, also prevent return error code to be shadowed.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Link: https://lore.kernel.org/r/20230831110000.24279-2-andriy.shevchenko@linux.intel.com
Signed-off-by: Harald Freudenberger <freude@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2023-09-05 20:12:52 +02:00
Andy Shevchenko
6252f47b78 s390/zcrypt: don't leak memory if dev_set_name() fails
When dev_set_name() fails, zcdn_create() doesn't free the newly
allocated resources. Do it.

Fixes: 00fab2350e ("s390/zcrypt: multiple zcrypt device nodes support")
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Link: https://lore.kernel.org/r/20230831110000.24279-1-andriy.shevchenko@linux.intel.com
Signed-off-by: Harald Freudenberger <freude@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2023-09-05 20:12:51 +02:00
Alexander Gordeev
08d90f46c7 s390/mm: fix MAX_DMA_ADDRESS physical vs virtual confusion
MAX_DMA_ADDRESS is defined and treated as a physical address,
whereas it should be virtual.

Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2023-09-05 20:12:51 +02:00
Mike Rapoport (IBM)
f4b4f3ec1a sparc64: add missing initialization of folio in tlb_batch_add()
Commit 1a10a44dfc ("sparc64: implement the new page table range API")
missed initialization of folio variable in tlb_batch_add() which causes
boot tests to crash.

Add missing initialization.

Link: https://lkml.kernel.org/r/20230904174350.GF3223@kernel.org
Fixes: 1a10a44dfc ("sparc64: implement the new page table range API")
Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
Reported-by: Guenter Roeck <linux@roeck-us.net>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-09-05 11:11:52 -07:00
Tong Tiangen
d256d1cd8d mm: memory-failure: use rcu lock instead of tasklist_lock when collect_procs()
We found a softlock issue in our test, analyzed the logs, and found that
the relevant CPU call trace as follows:

CPU0:
  _do_fork
    -> copy_process()
      -> write_lock_irq(&tasklist_lock)  //Disable irq,waiting for
      					 //tasklist_lock

CPU1:
  wp_page_copy()
    ->pte_offset_map_lock()
      -> spin_lock(&page->ptl);        //Hold page->ptl
    -> ptep_clear_flush()
      -> flush_tlb_others() ...
        -> smp_call_function_many()
          -> arch_send_call_function_ipi_mask()
            -> csd_lock_wait()         //Waiting for other CPUs respond
	                               //IPI

CPU2:
  collect_procs_anon()
    -> read_lock(&tasklist_lock)       //Hold tasklist_lock
      ->for_each_process(tsk)
        -> page_mapped_in_vma()
          -> page_vma_mapped_walk()
	    -> map_pte()
              ->spin_lock(&page->ptl)  //Waiting for page->ptl

We can see that CPU1 waiting for CPU0 respond IPI,CPU0 waiting for CPU2
unlock tasklist_lock, CPU2 waiting for CPU1 unlock page->ptl. As a result,
softlockup is triggered.

For collect_procs_anon(), what we're doing is task list iteration, during
the iteration, with the help of call_rcu(), the task_struct object is freed
only after one or more grace periods elapse. the logic as follows:

release_task()
  -> __exit_signal()
    -> __unhash_process()
      -> list_del_rcu()

  -> put_task_struct_rcu_user()
    -> call_rcu(&task->rcu, delayed_put_task_struct)

delayed_put_task_struct()
  -> put_task_struct()
  -> if (refcount_sub_and_test())
     	__put_task_struct()
          -> free_task()

Therefore, under the protection of the rcu lock, we can safely use
get_task_struct() to ensure a safe reference to task_struct during the
iteration.

By removing the use of tasklist_lock in task list iteration, we can break
the softlock chain above.

The same logic can also be applied to:
 - collect_procs_file()
 - collect_procs_fsdax()
 - collect_procs_ksm()

Link: https://lkml.kernel.org/r/20230828022527.241693-1-tongtiangen@huawei.com
Signed-off-by: Tong Tiangen <tongtiangen@huawei.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-09-05 11:11:52 -07:00