Commit Graph

44323 Commits

Author SHA1 Message Date
Linus Torvalds 80f8b450bf Fix suspicious RCU usage in __do_softirq().
Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmY3SxERHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1irFBAAkF7nMNof2kDXmHqeINNp0ZreVYEcVnTM
 S0xTUCvJ1C0UQgxPqOOlpODfOJLANqBS/xpwWTxzvdDemXDTAEeaiZz2wmiS77qG
 8Q98k39AOH1gynSIoZE9df4tniw2WxYaU5CMveT85YeMIW8rE3B0i/uNyrsCPJDw
 P9Bv0rBc96hbrFs32alVcix6YN1QySo8O9oZW+rRQndh8zd1lBCKVKC2QCGGLh7b
 pS45F0vJt6mVmdVURWvGtoaIh5PKNPBP1exfJow79AgogMuLgXm9JHltErgWc55L
 b508AjH29pKGb0a54hUaLAnXk1Fmu7xGZkQWIwUO7/U2ZYUR+3/eQ8UVoGhcole+
 nS/jew1er4W4/KLqhThKnNSuJaQeLljKbbsOK0bk4Dv1NTfiu83WIxgwVBZfR5Dx
 zZSG+PNcLxqVQDUz+bicy0l31x2bwGEjBnop9llPz/h+eeJHD7i3LVi+wVtrIyeP
 iLaRQVvFSgkFECJglq4aPBZ30bqU387hE9oKx+FW0WCUO6CWMg+rjqs8/MSAB31H
 8HKk9WxAWxlOdlAoESJawVLJxuAKHnVdgfilKjiBH5j5nUUB59cLNEcK+nA6W9t2
 ooGsIEiFNB1Uvt01awcSDOPUaE47H490gdZS4uuz93dTtBX6uPc+wYX0elrR8t7p
 /JRDKNBhlIg=
 =ZLyW
 -----END PGP SIGNATURE-----

Merge tag 'irq-urgent-2024-05-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull irq fix from Ingo Molnar:
 "Fix suspicious RCU usage in __do_softirq()"

* tag 'irq-urgent-2024-05-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  softirq: Fix suspicious RCU usage in __do_softirq()
2024-05-05 10:12:32 -07:00
Linus Torvalds 2c17a1cd90 Probes fixes for v6.9-rc6:
- probe-events: Fix memory leak in parsing probe argument. There is a
   memory leak (forget to free an allocated buffer) in a memory allocation
   failure path. Fixes it to jump to the correct error handling code.
 -----BEGIN PGP SIGNATURE-----
 
 iQFPBAABCgA5FiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmY2NRQbHG1hc2FtaS5o
 aXJhbWF0c3VAZ21haWwuY29tAAoJENv7B78FKz8bIacH/RmSQaraWiwQmMaWT8Pp
 wotOxtMYnl2uLNeVx3vn55+G1Xr/rJP3E9EBGTa+HMPky3trea07eBM5B3UnwT2y
 Y75Nhm6z3SFaLBygdKmQZgyIJF1W9w6J1cfqPwPlfR3h08a/9rNojd/DKBo7fLjk
 uwGAUHsB6sNhTvRF64wtr+I7V+8CGwNnApyQvf/mLnHsELerzm86nxDhXcfIvb1P
 UbM4nupqrV3QYCLYdXmma34PFFJzS3ioINGn692QtHFOSEdSwJfqsNv6AU/w98zD
 8o2rlSadc64Yl74vMLFRtBVS3K49VQXNgUUXjx2Gpj9/v80qn+B41HwaNSl1Lagx
 lIY=
 =tob5
 -----END PGP SIGNATURE-----

Merge tag 'probes-fixes-v6.9-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull probes fix from Masami Hiramatsu:

 - probe-events: Fix memory leak in parsing probe argument.

   There is a memory leak (forget to free an allocated buffer) in a
   memory allocation failure path. Fix it to jump to the correct error
   handling code.

* tag 'probes-fixes-v6.9-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing/probes: Fix memory leak in traceprobe_parse_probe_arg_body()
2024-05-05 09:56:50 -07:00
Linus Torvalds e92b99ae82 tracing and tracefs fixes for v6.9
- Fix RCU callback of freeing an eventfs_inode.
   The freeing of the eventfs_inode from the kref going to zero
   freed the contents of the eventfs_inode and then used kfree_rcu()
   to free the inode itself. But the contents should also be protected
   by RCU. Switch to a call_rcu() that calls a function to free all
   of the eventfs_inode after the RCU synchronization.
 
 - The tracing subsystem maps its own descriptor to a file represented by
   eventfs. The freeing of this descriptor needs to know when the
   last reference of an eventfs_inode is released, but currently
   there is no interface for that. Add a "release" callback to
   the eventfs_inode entry array that allows for freeing of data
   that can be referenced by the eventfs_inode being opened.
   Then increment the ref counter for this descriptor when the
   eventfs_inode file is created, and decrement/free it when the
   last reference to the eventfs_inode is released and the file
   is removed. This prevents races between freeing the descriptor
   and the opening of the eventfs file.
 
 - Fix the permission processing of eventfs.
   The change to make the permissions of eventfs default to the mount
   point but keep track of when changes were made had a side effect
   that could cause security concerns. When the tracefs is remounted
   with a given gid or uid, all the files within it should inherit
   that gid or uid. But if the admin had changed the permission of
   some file within the tracefs file system, it would not get updated
   by the remount. This caused the kselftest of file permissions
   to fail the second time it is run. The first time, all changes
   would look fine, but the second time, because the changes were
   "saved", the remount did not reset them.
 
   Create a link list of all existing tracefs inodes, and clear the
   saved flags on them on a remount if the remount changes the
   corresponding gid or uid fields.
 
   This also simplifies the code by removing the distinction between the
   toplevel eventfs and an instance eventfs. They should both act the
   same. They were different because of a misconception due to the
   remount not resetting the flags. Now that remount resets all the
   files and directories to default to the root node if a uid/gid is
   specified, it makes the logic simpler to implement.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZjXxzxQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qqzGAQCX8g7gtngGgwSsWqPW5GmecCifwFja
 k7cVEDhMYPnDeAEAkYi2ZBgJRkPsWPfMRClDK/DXP4woOo58asxtIxfTMgg=
 =mCkt
 -----END PGP SIGNATURE-----

Merge tag 'trace-v6.9-rc6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing and tracefs fixes from Steven Rostedt:

 - Fix RCU callback of freeing an eventfs_inode.

   The freeing of the eventfs_inode from the kref going to zero freed
   the contents of the eventfs_inode and then used kfree_rcu() to free
   the inode itself. But the contents should also be protected by RCU.
   Switch to a call_rcu() that calls a function to free all of the
   eventfs_inode after the RCU synchronization.

 - The tracing subsystem maps its own descriptor to a file represented
   by eventfs. The freeing of this descriptor needs to know when the
   last reference of an eventfs_inode is released, but currently there
   is no interface for that.

   Add a "release" callback to the eventfs_inode entry array that allows
   for freeing of data that can be referenced by the eventfs_inode being
   opened. Then increment the ref counter for this descriptor when the
   eventfs_inode file is created, and decrement/free it when the last
   reference to the eventfs_inode is released and the file is removed.
   This prevents races between freeing the descriptor and the opening of
   the eventfs file.

 - Fix the permission processing of eventfs.

   The change to make the permissions of eventfs default to the mount
   point but keep track of when changes were made had a side effect that
   could cause security concerns. When the tracefs is remounted with a
   given gid or uid, all the files within it should inherit that gid or
   uid. But if the admin had changed the permission of some file within
   the tracefs file system, it would not get updated by the remount.

   This caused the kselftest of file permissions to fail the second time
   it is run. The first time, all changes would look fine, but the
   second time, because the changes were "saved", the remount did not
   reset them.

   Create a link list of all existing tracefs inodes, and clear the
   saved flags on them on a remount if the remount changes the
   corresponding gid or uid fields.

   This also simplifies the code by removing the distinction between the
   toplevel eventfs and an instance eventfs. They should both act the
   same. They were different because of a misconception due to the
   remount not resetting the flags. Now that remount resets all the
   files and directories to default to the root node if a uid/gid is
   specified, it makes the logic simpler to implement.

* tag 'trace-v6.9-rc6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  eventfs: Have "events" directory get permissions from its parent
  eventfs: Do not treat events directory different than other directories
  eventfs: Do not differentiate the toplevel events directory
  tracefs: Still use mount point as default permissions for instances
  tracefs: Reset permissions on remount if permissions are options
  eventfs: Free all of the eventfs_inode after RCU
  eventfs/tracing: Add callback for release of an eventfs_inode
2024-05-05 09:53:09 -07:00
Linus Torvalds 4fbcf58590 dma-mapping fix for Linux 6.9
- fix the combination of restricted pools and dynamic swiotlb
    (Will Deacon)
 -----BEGIN PGP SIGNATURE-----
 
 iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAmY1uVELHGhjaEBsc3Qu
 ZGUACgkQD55TZVIEUYP3ZBAAi+aGmWWnpF6ujgGTjLSABztNWimuyr8GgZwCiRL4
 otvp/u6Iq6kHQJvPvDpJVUYV80unqz4NV67JYnsOi1kX2QHVME8ActHrQ/tpbiyz
 QrIRQ75iQUH4PVlBubUHHT0/zZoHn5RB1D8rB1vRBIxR+ApN2LIUq74d5W6YMcoE
 LcatCYLbomKovRFEornQ7+a9rHkiZvUPwbXqpxPUVAUnpaS2cTy6Tc5EmKOu00yi
 iMEvx5Hzmb2we0oHTwTNnrjzpmSTNww8geNOKBYRij+3VWBeb1weapJEl/EJ3hRh
 B7xkSNvFPMDMVlTUwO4+Bb6W76xbVXteiFsCatGV+2EUmJUlpw50uEUmA/smACuV
 Aw9oz6MZEj0VZjY+2kliYxO5sfgeU2Is/ZS2iTPB2pNcYlHppG4Fn4Bob9E+MJ9p
 aR+D4NbcrjM/PS4yIgto9/lyjQKu/Vs2T2c8eblE9Vp+io0/ZLI1dguOspRx2eAd
 sWSNZBSTPjrFQJuuszS+skws+s6j9hKCwi6N4Neb39+HNWvjJa0SYvBDFjoXBbd6
 kfwMWvMwRNDd0YhGAzfapPguy+FEtAoJ6s7SSSLG1XQ3BfKoC2YTQKjfG9aid+n4
 MmoAL+UGnXw31IAsITAQGMFC6h41mhNDlKPXJIm4/n8PEW8P7GIQugHN5SvuNXnN
 qk8=
 =05zN
 -----END PGP SIGNATURE-----

Merge tag 'dma-mapping-6.9-2024-05-04' of git://git.infradead.org/users/hch/dma-mapping

Pull dma-mapping fix from Christoph Hellwig:

 - fix the combination of restricted pools and dynamic swiotlb
   (Will Deacon)

* tag 'dma-mapping-6.9-2024-05-04' of git://git.infradead.org/users/hch/dma-mapping:
  swiotlb: initialise restricted pool list_head when SWIOTLB_DYNAMIC=y
2024-05-05 09:49:21 -07:00
Steven Rostedt (Google) b63db58e2f eventfs/tracing: Add callback for release of an eventfs_inode
Synthetic events create and destroy tracefs files when they are created
and removed. The tracing subsystem has its own file descriptor
representing the state of the events attached to the tracefs files.
There's a race between the eventfs files and this file descriptor of the
tracing system where the following can cause an issue:

With two scripts 'A' and 'B' doing:

  Script 'A':
    echo "hello int aaa" > /sys/kernel/tracing/synthetic_events
    while :
    do
      echo 0 > /sys/kernel/tracing/events/synthetic/hello/enable
    done

  Script 'B':
    echo > /sys/kernel/tracing/synthetic_events

Script 'A' creates a synthetic event "hello" and then just writes zero
into its enable file.

Script 'B' removes all synthetic events (including the newly created
"hello" event).

What happens is that the opening of the "enable" file has:

 {
	struct trace_event_file *file = inode->i_private;
	int ret;

	ret = tracing_check_open_get_tr(file->tr);
 [..]

But deleting the events frees the "file" descriptor, and a "use after
free" happens with the dereference at "file->tr".

The file descriptor does have a reference counter, but there needs to be a
way to decrement it from the eventfs when the eventfs_inode is removed
that represents this file descriptor.

Add an optional "release" callback to the eventfs_entry array structure,
that gets called when the eventfs file is about to be removed. This allows
for the creating on the eventfs file to increment the tracing file
descriptor ref counter. When the eventfs file is deleted, it can call the
release function that will call the put function for the tracing file
descriptor.

This will protect the tracing file from being freed while a eventfs file
that references it is being opened.

Link: https://lore.kernel.org/linux-trace-kernel/20240426073410.17154-1-Tze-nan.Wu@mediatek.com/
Link: https://lore.kernel.org/linux-trace-kernel/20240502090315.448cba46@gandalf.local.home

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Fixes: 5790b1fb3d ("eventfs: Remove eventfs_file and just use eventfs_inode")
Reported-by: Tze-nan wu <Tze-nan.Wu@mediatek.com>
Tested-by: Tze-nan Wu (吳澤南) <Tze-nan.Wu@mediatek.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-05-04 04:25:37 -04:00
Linus Torvalds 545c494465 Including fixes from bpf.
Relatively calm week, likely due to public holiday in most places.
 No known outstanding regressions.
 
 Current release - regressions:
 
   - rxrpc: fix wrong alignmask in __page_frag_alloc_align()
 
   - eth: e1000e: change usleep_range to udelay in PHY mdic access
 
 Previous releases - regressions:
 
   - gro: fix udp bad offset in socket lookup
 
   - bpf: fix incorrect runtime stat for arm64
 
   - tipc: fix UAF in error path
 
   - netfs: fix a potential infinite loop in extract_user_to_sg()
 
   - eth: ice: ensure the copied buf is NUL terminated
 
   - eth: qeth: fix kernel panic after setting hsuid
 
 Previous releases - always broken:
 
   - bpf:
     - verifier: prevent userspace memory access
     - xdp: use flags field to disambiguate broadcast redirect
 
   - bridge: fix multicast-to-unicast with fraglist GSO
 
   - mptcp: ensure snd_nxt is properly initialized on connect
 
   - nsh: fix outer header access in nsh_gso_segment().
 
   - eth: bcmgenet: fix racing registers access
 
   - eth: vxlan: fix stats counters.
 
 Misc:
 
   - a bunch of MAINTAINERS file updates
 
 Signed-off-by: Paolo Abeni <pabeni@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmYzaRsSHHBhYmVuaUBy
 ZWRoYXQuY29tAAoJECkkeY3MjxOkh70P/jzsTsvzHspu3RUwcsyvWpSoJPcxP2tF
 5SKR66o8sbSjB5I26zUi/LtRZgbPO32GmLN2Y8GvP74h9lwKdDo4AY4volZKCT6f
 lRG6GohvMa0lSPSn1fti7CKVzDOsaTHvLz3uBBr+Xb9ITCKh+I+zGEEDGj/47SQN
 tmDWHPF8OMs2ezmYS5NqRIQ3CeRz6uyLmEoZhVm4SolypZ18oEg7GCtL3u6U48n+
 e3XB3WwKl0ZxK8ipvPgUDwGIDuM5hEyAaeNon3zpYGoqitRsRITUjULpb9dT4DtJ
 Jma3OkarFJNXgm4N/p/nAtQ9AdiAloF9ivZXs2t0XCdrrUZJUh05yuikoX+mLfpw
 GedG2AbaVl6mdqNkrHeyf5SXKuiPgeCLVfF2xMjS0l1kFbY+Bt8BqnRSdOrcoUG0
 zlSzBeBtajttMdnalWv2ZshjP8uo/NjXydUjoVNwuq8xGO5wP+zhNnwhOvecNyUg
 t7q2PLokahlz4oyDqyY/7SQ0hSEndqxOlt43I6CthoWH0XkS83nTPdQXcTKQParD
 ntJUk5QYwefUT1gimbn/N8GoP7a1+ysWiqcf/7+SNm932gJGiDt36+HOEmyhIfIG
 IDWTWJJW64SnPBIUw59MrG7hMtbfaiZiFQqeUJQpFVrRr+tg5z5NUZ5thA+EJVd8
 qiVDvmngZFiv
 =f6KY
 -----END PGP SIGNATURE-----

Merge tag 'net-6.9-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Paolo Abeni:
 "Including fixes from bpf.

  Relatively calm week, likely due to public holiday in most places. No
  known outstanding regressions.

  Current release - regressions:

   - rxrpc: fix wrong alignmask in __page_frag_alloc_align()

   - eth: e1000e: change usleep_range to udelay in PHY mdic access

  Previous releases - regressions:

   - gro: fix udp bad offset in socket lookup

   - bpf: fix incorrect runtime stat for arm64

   - tipc: fix UAF in error path

   - netfs: fix a potential infinite loop in extract_user_to_sg()

   - eth: ice: ensure the copied buf is NUL terminated

   - eth: qeth: fix kernel panic after setting hsuid

  Previous releases - always broken:

   - bpf:
       - verifier: prevent userspace memory access
       - xdp: use flags field to disambiguate broadcast redirect

   - bridge: fix multicast-to-unicast with fraglist GSO

   - mptcp: ensure snd_nxt is properly initialized on connect

   - nsh: fix outer header access in nsh_gso_segment().

   - eth: bcmgenet: fix racing registers access

   - eth: vxlan: fix stats counters.

  Misc:

   - a bunch of MAINTAINERS file updates"

* tag 'net-6.9-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (45 commits)
  MAINTAINERS: mark MYRICOM MYRI-10G as Orphan
  MAINTAINERS: remove Ariel Elior
  net: gro: add flush check in udp_gro_receive_segment
  net: gro: fix udp bad offset in socket lookup by adding {inner_}network_offset to napi_gro_cb
  ipv4: Fix uninit-value access in __ip_make_skb()
  s390/qeth: Fix kernel panic after setting hsuid
  vxlan: Pull inner IP header in vxlan_rcv().
  tipc: fix a possible memleak in tipc_buf_append
  tipc: fix UAF in error path
  rxrpc: Clients must accept conn from any address
  net: core: reject skb_copy(_expand) for fraglist GSO skbs
  net: bridge: fix multicast-to-unicast with fraglist GSO
  mptcp: ensure snd_nxt is properly initialized on connect
  e1000e: change usleep_range to udelay in PHY mdic access
  net: dsa: mv88e6xxx: Fix number of databases for 88E6141 / 88E6341
  cxgb4: Properly lock TX queue for the selftest.
  rxrpc: Fix using alignmask being zero for __page_frag_alloc_align()
  vxlan: Add missing VNI filter counter update in arp_reduce().
  vxlan: Fix racy device stats updates.
  net: qede: use return from qede_parse_actions()
  ...
2024-05-02 08:51:47 -07:00
Will Deacon 75961ffb5c swiotlb: initialise restricted pool list_head when SWIOTLB_DYNAMIC=y
Using restricted DMA pools (CONFIG_DMA_RESTRICTED_POOL=y) in conjunction
with dynamic SWIOTLB (CONFIG_SWIOTLB_DYNAMIC=y) leads to the following
crash when initialising the restricted pools at boot-time:

  | Unable to handle kernel NULL pointer dereference at virtual address 0000000000000008
  | Internal error: Oops: 0000000096000005 [#1] PREEMPT SMP
  | pc : rmem_swiotlb_device_init+0xfc/0x1ec
  | lr : rmem_swiotlb_device_init+0xf0/0x1ec
  | Call trace:
  |  rmem_swiotlb_device_init+0xfc/0x1ec
  |  of_reserved_mem_device_init_by_idx+0x18c/0x238
  |  of_dma_configure_id+0x31c/0x33c
  |  platform_dma_configure+0x34/0x80

faddr2line reveals that the crash is in the list validation code:

  include/linux/list.h:83
  include/linux/rculist.h:79
  include/linux/rculist.h:106
  kernel/dma/swiotlb.c:306
  kernel/dma/swiotlb.c:1695

because add_mem_pool() is trying to list_add_rcu() to a NULL
'mem->pools'.

Fix the crash by initialising the 'mem->pools' list_head in
rmem_swiotlb_device_init() before calling add_mem_pool().

Reported-by: Nikita Ioffe <ioffe@google.com>
Tested-by: Nikita Ioffe <ioffe@google.com>
Fixes: 1aaa736815 ("swiotlb: allocate a new memory pool when existing pools are full")
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2024-05-02 14:57:04 +02:00
Linus Torvalds 98369dccd2 workqueue: Fixes for v6.9-rc6
Two doc update patches and the following three fixes:
 
 - On single node systems, the default pool is used but the node_nr_active
   for the default pool was set to min_active. This effectively limited the
   max concurrency of unbound pools on single node systems to 8 causing
   performance regressions on some workloads. Fixed by setting the default
   pool's node_nr_active to max_active.
 
 - wq_update_node_max_active() could trigger divide-by-zero if the
   intersection between the allowed CPUs for an unbound workqueue and online
   CPUs becomes empty.
 
 - When kick_pool() was trying to repatriate a worker to a CPU in its pod by
   setting task->wake_cpu, it didn't consider whether the CPU being selected
   is online or not which obviously can lead to subobtimal behaviors. On
   s390, this triggered a crash in arch code. The workqueue patch removes the
   gross misbehavior but doesn't fix the crash completely as there's a race
   window in which CPUs can go down after wake_cpu is set. Need to decide
   whether the fix should be on the core or arch side.
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZjAaug4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGT4fAQC5d8dNCDrAJmMgI0OBCwVgGGISTPalI+/ix4zu
 5muBLwEAszuSZ4hEmg4L/jseTk+gZV0vIi4/IHjOzWwYczzLxQA=
 =SeX1
 -----END PGP SIGNATURE-----

Merge tag 'wq-for-6.9-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq

Pull workqueue fixes from Tejun Heo:
 "Two doc update patches and the following three fixes:

   - On single node systems, the default pool is used but the
     node_nr_active for the default pool was set to min_active. This
     effectively limited the max concurrency of unbound pools on single
     node systems to 8 causing performance regressions on some
     workloads. Fixed by setting the default pool's node_nr_active to
     max_active.

   - wq_update_node_max_active() could trigger divide-by-zero if the
     intersection between the allowed CPUs for an unbound workqueue and
     online CPUs becomes empty.

   - When kick_pool() was trying to repatriate a worker to a CPU in its
     pod by setting task->wake_cpu, it didn't consider whether the CPU
     being selected is online or not which obviously can lead to
     subobtimal behaviors. On s390, this triggered a crash in arch code.
     The workqueue patch removes the gross misbehavior but doesn't fix
     the crash completely as there's a race window in which CPUs can go
     down after wake_cpu is set. Need to decide whether the fix should
     be on the core or arch side"

* tag 'wq-for-6.9-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  workqueue: Fix divide error in wq_update_node_max_active()
  workqueue: The default node_nr_active should have its max set to max_active
  workqueue: Fix selection of wake_cpu in kick_pool()
  docs/zh_CN: core-api: Update translation of workqueue.rst to 6.9-rc1
  Documentation/core-api: Update events_freezable_power references.
2024-04-29 15:57:37 -07:00
Matthew Wilcox (Oracle) 5af385f5f4 bounds: Use the right number of bits for power-of-two CONFIG_NR_CPUS
bits_per() rounds up to the next power of two when passed a power of
two.  This causes crashes on some machines and configurations.

Reported-by: Михаил Новоселов <m.novosyolov@rosalinux.ru>
Tested-by: Ильфат Гаптрахманов <i.gaptrakhmanov@rosalinux.ru>
Link: https://gitlab.freedesktop.org/drm/amd/-/issues/3347
Link: https://lore.kernel.org/all/1c978cf1-2934-4e66-e4b3-e81b04cb3571@rosalinux.ru/
Fixes: f2d5dcb48f (bounds: support non-power-of-two CONFIG_NR_CPUS)
Cc:  <stable@vger.kernel.org>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-04-29 08:29:29 -07:00
LuMingYin dce3696271 tracing/probes: Fix memory leak in traceprobe_parse_probe_arg_body()
If traceprobe_parse_probe_arg_body() failed to allocate 'parg->fmt',
it jumps to the label 'out' instead of 'fail' by mistake.In the result,
the buffer 'tmp' is not freed in this case and leaks its memory.

Thus jump to the label 'fail' in that error case.

Link: https://lore.kernel.org/all/20240427072347.1421053-1-lumingyindetect@126.com/

Fixes: 032330abd0 ("tracing/probes: Cleanup probe argument parser")
Signed-off-by: LuMingYin <lumingyindetect@126.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2024-04-29 22:30:46 +09:00
Zqiang 1dd1eff161 softirq: Fix suspicious RCU usage in __do_softirq()
Currently, the condition "__this_cpu_read(ksoftirqd) == current" is used to
invoke rcu_softirq_qs() in ksoftirqd tasks context for non-RT kernels.

This works correctly as long as the context is actually task context but
this condition is wrong when:

     - the current task is ksoftirqd
     - the task is interrupted in a RCU read side critical section
     - __do_softirq() is invoked on return from interrupt

Syzkaller triggered the following scenario:

  -> finish_task_switch()
    -> put_task_struct_rcu_user()
      -> call_rcu(&task->rcu, delayed_put_task_struct)
        -> __kasan_record_aux_stack()
          -> pfn_valid()
            -> rcu_read_lock_sched()
              <interrupt>
                __irq_exit_rcu()
                -> __do_softirq)()
                   -> if (!IS_ENABLED(CONFIG_PREEMPT_RT) &&
                     __this_cpu_read(ksoftirqd) == current)
                     -> rcu_softirq_qs()
                       -> RCU_LOCKDEP_WARN(lock_is_held(&rcu_sched_lock_map))

The rcu quiescent state is reported in the rcu-read critical section, so
the lockdep warning is triggered.

Fix this by splitting out the inner working of __do_softirq() into a helper
function which takes an argument to distinguish between ksoftirqd task
context and interrupted context and invoke it from the relevant call sites
with the proper context information and use that for the conditional
invocation of rcu_softirq_qs().

Reported-by: syzbot+dce04ed6d1438ad69656@syzkaller.appspotmail.com
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Zqiang <qiang.zhang1211@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240427102808.29356-1-qiang.zhang1211@gmail.com
Link: https://lore.kernel.org/lkml/8f281a10-b85a-4586-9586-5bbc12dc784f@paulmck-laptop/T/#mea8aba4abfcb97bbf499d169ce7f30c4cff1b0e3
2024-04-29 05:03:51 +02:00
Linus Torvalds 245c8e8174 Misc fixes:
- Fix EEVDF corner cases
 
  - Fix two nohz_full= related bugs that can cause boot crashes
    and warnings.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmYuBxcRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1im6A/+JfNAwxPghp9zM43ERLadl3MUbH0hsdV9
 54xhQm58Fi8wzXhxhRiOcLqrhFDNsy91mRWHxt9/PjvdFXhp9GiNpehMsHCmTsS8
 7ywJKcAeKTM1+7nq4RFbDFSSpr1J5aUYKXfuhWwr0QVF3mNoRkmZaLdlnVjxebbA
 sKXXtEKbn0yCeIsdPwmZlLxzNyOV2j0p8Xck8DKrLjW57pbebiBHyt2N59PsARb4
 Yt9wNbyb48DqGNy2FaRCWlm/8OL0BLMB0tMnXkIDtW89uVuP4V6fQF0vau3re+vy
 A8+OMD8gpeYjNV5WKrT5r3+EQyFJGI7nr6PbWTY8KLIGCjSSu9iGojn0hdVMGTj7
 rQe6LJNSMe6xW53ZrecMh6OGZ3esgkaZKafXrMcczcSq/CCX0wSVSAbANCkhyANx
 VFZsCgxX/zdRSwSRZiyiHLnP/3/lw0sOxoBS/m0hDSJulJF7fbQGLAfLx+Zccnoe
 2KBra2DXk/49OH+jehrj2C1m2ozWp2+4Kb7mwYISrTJVp0ylgjNiznAKkmB5R8XN
 UOfio5nr09KJWpRKW3UoR2CpaPu/BXUB249DDm36zK1I9V/ljYzrCHKjw+TTWgdS
 nPEVVYR9aj4t/De8wPm0gk/Orv9KaQkpdsOCgezRB0hJGuLpABcA9FGlTJntQ+n9
 UPLMOgN36Q4=
 =Zhc/
 -----END PGP SIGNATURE-----

Merge tag 'sched-urgent-2024-04-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler fixes from Ingo Molnar:

 - Fix EEVDF corner cases

 - Fix two nohz_full= related bugs that can cause boot crashes
   and warnings

* tag 'sched-urgent-2024-04-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/isolation: Fix boot crash when maxcpus < first housekeeping CPU
  sched/isolation: Prevent boot crash when the boot CPU is nohz_full
  sched/eevdf: Prevent vlag from going out of bounds in reweight_eevdf()
  sched/eevdf: Fix miscalculation in reweight_entity() when se is not curr
  sched/eevdf: Always update V if se->on_rq when reweighting
2024-04-28 12:11:26 -07:00
Linus Torvalds aec147c188 Misc fixes:
- Make the CPU_MITIGATIONS=n interaction with conflicting
    mitigation-enabling boot parameters a bit saner.
 
  - Re-enable CPU mitigations by default on non-x86
 
  - Fix TDX shared bit propagation on mprotect()
 
  - Fix potential show_regs() system hang when PKE
    initialization is not fully finished yet.
 
  - Add the 0x10-0x1f model IDs to the Zen5 range
 
  - Harden #VC instruction emulation some more
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmYuCVMRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1h0Hw/1HVlmRGTrQQBvVMlzt6Y3GlUk2uHSiSh0
 pO57sh9tMu/3kWdcrUi4xkEVHmfBjMxXY5sw/7VXQ9mG7wv+SVgF3gAaAl+5q73K
 JKPPAhkPqUmXP3Sm1rqTt8iZtTViY3ilP6QEZaOIfL2Pwa7X3QP8TJRBKAJCrXEM
 hOEMXSd1W1Escs/uPlhCXHx8TRVTr9f4bv8TdHBXZGHTida5vejj+yhMSdaM94qw
 ywZ4an1NOnLGcNEMMYhOQ6Kbh9Ckj46JRjpodTfmjodLd/jOhVU5C7nTZfHRXSRU
 3UQBZtTZIYYCs8Urg2l/W5IhywWV3P9Jg+D+vl/bdEKJ+yINLAnOgVhVPqeG2GWt
 Ww3FelgRz0AkQKTegRCK2jQWnHActSrYmkr4M24wa/cVkMrcpXT3LHj8PgRnllx5
 q5JqQ37G3QYHMzslbBqyUHzJv8KzgdZdgyFTN3dX1q9n5FPy7Ul9Ue1Zp2SoId8i
 K6u+IjCkftWwIbv8AhXiEVo0ynfBkmV4UNVGJks1xIPA3lmNv3ax5nQMJLvZzJ48
 n+Id8ALEWxyOrKR6bdWdPtJqd0Nw/q4e6AOTzVYE94X8+uVuug4m4X7QPo+Ctbz1
 IkhTxmBbHzgKylbddK6LkdnXnHCGidOmXsF3VS6TRfz7ALaMUgpaHw34reEhiOlT
 xsIw+XVOKg==
 =AfRR
 -----END PGP SIGNATURE-----

Merge tag 'x86-urgent-2024-04-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Ingo Molnar:

 - Make the CPU_MITIGATIONS=n interaction with conflicting
   mitigation-enabling boot parameters a bit saner.

 - Re-enable CPU mitigations by default on non-x86

 - Fix TDX shared bit propagation on mprotect()

 - Fix potential show_regs() system hang when PKE initialization
   is not fully finished yet.

 - Add the 0x10-0x1f model IDs to the Zen5 range

 - Harden #VC instruction emulation some more

* tag 'x86-urgent-2024-04-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  cpu: Ignore "mitigations" kernel parameter if CPU_MITIGATIONS=n
  cpu: Re-enable CPU mitigations by default for !X86 architectures
  x86/tdx: Preserve shared bit on mprotect()
  x86/cpu: Fix check for RDPKRU in __show_regs()
  x86/CPU/AMD: Add models 0x10-0x1f to the Zen5 range
  x86/sev: Check for MWAITX and MONITORX opcodes in the #VC handler
2024-04-28 11:58:16 -07:00
Oleg Nesterov 257bf89d84 sched/isolation: Fix boot crash when maxcpus < first housekeeping CPU
housekeeping_setup() checks cpumask_intersects(present, online) to ensure
that the kernel will have at least one housekeeping CPU after smp_init(),
but this doesn't work if the maxcpus= kernel parameter limits the number of
processors available after bootup.

For example, a kernel with "maxcpus=2 nohz_full=0-2" parameters crashes at
boot time on a virtual machine with 4 CPUs.

Change housekeeping_setup() to use cpumask_first_and() and check that the
returned CPU number is valid and less than setup_max_cpus.

Another corner case is "nohz_full=0" on a machine with a single CPU or with
the maxcpus=1 kernel argument. In this case non_housekeeping_mask is empty
and tick_nohz_full_setup() makes no sense. And indeed, the kernel hits the
WARN_ON(tick_nohz_full_running) in tick_sched_do_timer().

And how should the kernel interpret the "nohz_full=" parameter? It should
be silently ignored, but currently cpulist_parse() happily returns the
empty cpumask and this leads to the same problem.

Change housekeeping_setup() to check cpumask_empty(non_housekeeping_mask)
and do nothing in this case.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20240413141746.GA10008@redhat.com
2024-04-28 10:08:21 +02:00
Oleg Nesterov 5097cbcb38 sched/isolation: Prevent boot crash when the boot CPU is nohz_full
Documentation/timers/no_hz.rst states that the "nohz_full=" mask must not
include the boot CPU, which is no longer true after:

  08ae95f4fd ("nohz_full: Allow the boot CPU to be nohz_full").

However after:

  aae17ebb53 ("workqueue: Avoid using isolated cpus' timers on queue_delayed_work")

the kernel will crash at boot time in this case; housekeeping_any_cpu()
returns an invalid CPU number until smp_init() brings the first
housekeeping CPU up.

Change housekeeping_any_cpu() to check the result of cpumask_any_and() and
return smp_processor_id() in this case.

This is just the simple and backportable workaround which fixes the
symptom, but smp_processor_id() at boot time should be safe at least for
type == HK_TYPE_TIMER, this more or less matches the tick_do_timer_boot_cpu
logic.

There is no worry about cpu_down(); tick_nohz_cpu_down() will not allow to
offline tick_do_timer_cpu (the 1st online housekeeping CPU).

Fixes: aae17ebb53 ("workqueue: Avoid using isolated cpus' timers on queue_delayed_work")
Reported-by: Chris von Recklinghausen <crecklin@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20240411143905.GA19288@redhat.com
Closes: https://lore.kernel.org/all/20240402105847.GA24832@redhat.com/
2024-04-28 10:07:12 +02:00
Tetsuo Handa 2e5449f4f2 profiling: Remove create_prof_cpu_mask().
create_prof_cpu_mask() is no longer used after commit 1f44a22577 ("s390:
convert interrupt handling to use generic hardirq").

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-04-27 11:17:48 -07:00
Jakub Kicinski b2ff42c6d3 bpf-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZiwdfQAKCRDbK58LschI
 g1oqAP9mjayeIHCfYMQZa2eevy1PmVlgdNdFdMDWZFS/pHv9cgD/ZdmGzbUDKCAQ
 Y/KiTajitZw3kxtHX45v8/Ugtlsh9Qg=
 =Ewiw
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf

Daniel Borkmann says:

====================
pull-request: bpf 2024-04-26

We've added 12 non-merge commits during the last 22 day(s) which contain
a total of 14 files changed, 168 insertions(+), 72 deletions(-).

The main changes are:

1) Fix BPF_PROBE_MEM in verifier and JIT to skip loads from vsyscall page,
   from Puranjay Mohan.

2) Fix a crash in XDP with devmap broadcast redirect when the latter map
   is in process of being torn down, from Toke Høiland-Jørgensen.

3) Fix arm64 and riscv64 BPF JITs to properly clear start time for BPF
   program runtime stats, from Xu Kuohai.

4) Fix a sockmap KCSAN-reported data race in sk_psock_skb_ingress_enqueue,
    from Jason Xing.

5) Fix BPF verifier error message in resolve_pseudo_ldimm64,
   from Anton Protopopov.

6) Fix missing DEBUG_INFO_BTF_MODULES Kconfig menu item,
   from Andrii Nakryiko.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  selftests/bpf: Test PROBE_MEM of VSYSCALL_ADDR on x86-64
  bpf, x86: Fix PROBE_MEM runtime load check
  bpf: verifier: prevent userspace memory access
  xdp: use flags field to disambiguate broadcast redirect
  arm32, bpf: Reimplement sign-extension mov instruction
  riscv, bpf: Fix incorrect runtime stats
  bpf, arm64: Fix incorrect runtime stats
  bpf: Fix a verifier verbose message
  bpf, skmsg: Fix NULL pointer dereference in sk_psock_skb_ingress_enqueue
  MAINTAINERS: bpf: Add Lehui and Puranjay as riscv64 reviewers
  MAINTAINERS: Update email address for Puranjay Mohan
  bpf, kconfig: Fix DEBUG_INFO_BTF_MODULES Kconfig definition
====================

Link: https://lore.kernel.org/r/20240426224248.26197-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-26 17:36:53 -07:00
Linus Torvalds e6ebf01172 11 hotfixes. 8 are cc:stable and the remaining 3 (nice ratio!) address
post-6.8 issues or aren't considered suitable for backporting.
 
 All except one of these are for MM.  I see no particular theme - it's
 singletons all over.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZiwPZwAKCRDdBJ7gKXxA
 jmcQAPkB6UT/rBUMvFZb1dom9R6SDYl5ZBr20Vj1HvfakCLxmQEAqEd0N7QoWvKS
 hKNCMDujiEKqDUWeUaJen4cqXFFE2Qg=
 =1wP7
 -----END PGP SIGNATURE-----

Merge tag 'mm-hotfixes-stable-2024-04-26-13-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc fixes from Andrew Morton:
 "11 hotfixes. 8 are cc:stable and the remaining 3 (nice ratio!) address
  post-6.8 issues or aren't considered suitable for backporting.

  All except one of these are for MM. I see no particular theme - it's
  singletons all over"

* tag 'mm-hotfixes-stable-2024-04-26-13-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  mm/hugetlb: fix DEBUG_LOCKS_WARN_ON(1) when dissolve_free_hugetlb_folio()
  selftests: mm: protection_keys: save/restore nr_hugepages value from launch script
  stackdepot: respect __GFP_NOLOCKDEP allocation flag
  hugetlb: check for anon_vma prior to folio allocation
  mm: zswap: fix shrinker NULL crash with cgroup_disable=memory
  mm: turn folio_test_hugetlb into a PageType
  mm: support page_mapcount() on page_has_type() pages
  mm: create FOLIO_FLAG_FALSE and FOLIO_TYPE_OPS macros
  mm/hugetlb: fix missing hugetlb_lock for resv uncharge
  selftests: mm: fix unused and uninitialized variable warning
  selftests/harness: remove use of LINE_MAX
2024-04-26 13:48:03 -07:00
Puranjay Mohan 66e13b615a bpf: verifier: prevent userspace memory access
With BPF_PROBE_MEM, BPF allows de-referencing an untrusted pointer. To
thwart invalid memory accesses, the JITs add an exception table entry
for all such accesses. But in case the src_reg + offset is a userspace
address, the BPF program might read that memory if the user has
mapped it.

Make the verifier add guard instructions around such memory accesses and
skip the load if the address falls into the userspace region.

The JITs need to implement bpf_arch_uaddress_limit() to define where
the userspace addresses end for that architecture or TASK_SIZE is taken
as default.

The implementation is as follows:

REG_AX =  SRC_REG
if(offset)
	REG_AX += offset;
REG_AX >>= 32;
if (REG_AX <= (uaddress_limit >> 32))
	DST_REG = 0;
else
	DST_REG = *(size *)(SRC_REG + offset);

Comparing just the upper 32 bits of the load address with the upper
32 bits of uaddress_limit implies that the values are being aligned down
to a 4GB boundary before comparison.

The above means that all loads with address <= uaddress_limit + 4GB are
skipped. This is acceptable because there is a large hole (much larger
than 4GB) between userspace and kernel space memory, therefore a
correctly functioning BPF program should not access this 4GB memory
above the userspace.

Let's analyze what this patch does to the following fentry program
dereferencing an untrusted pointer:

  SEC("fentry/tcp_v4_connect")
  int BPF_PROG(fentry_tcp_v4_connect, struct sock *sk)
  {
                *(volatile long *)sk;
                return 0;
  }

    BPF Program before              |           BPF Program after
    ------------------              |           -----------------

  0: (79) r1 = *(u64 *)(r1 +0)          0: (79) r1 = *(u64 *)(r1 +0)
  -----------------------------------------------------------------------
  1: (79) r1 = *(u64 *)(r1 +0) --\      1: (bf) r11 = r1
  ----------------------------\   \     2: (77) r11 >>= 32
  2: (b7) r0 = 0               \   \    3: (b5) if r11 <= 0x8000 goto pc+2
  3: (95) exit                  \   \-> 4: (79) r1 = *(u64 *)(r1 +0)
                                 \      5: (05) goto pc+1
                                  \     6: (b7) r1 = 0
                                   \--------------------------------------
                                        7: (b7) r0 = 0
                                        8: (95) exit

As you can see from above, in the best case (off=0), 5 extra instructions
are emitted.

Now, we analyze the same program after it has gone through the JITs of
ARM64 and RISC-V architectures. We follow the single load instruction
that has the untrusted pointer and see what instrumentation has been
added around it.

                                x86-64 JIT
                                ==========
     JIT's Instrumentation
          (upstream)
     ---------------------

   0:   nopl   0x0(%rax,%rax,1)
   5:   xchg   %ax,%ax
   7:   push   %rbp
   8:   mov    %rsp,%rbp
   b:   mov    0x0(%rdi),%rdi
  ---------------------------------
   f:   movabs $0x800000000000,%r11
  19:   cmp    %r11,%rdi
  1c:   jb     0x000000000000002a
  1e:   mov    %rdi,%r11
  21:   add    $0x0,%r11
  28:   jae    0x000000000000002e
  2a:   xor    %edi,%edi
  2c:   jmp    0x0000000000000032
  2e:   mov    0x0(%rdi),%rdi
  ---------------------------------
  32:   xor    %eax,%eax
  34:   leave
  35:   ret

The x86-64 JIT already emits some instructions to protect against user
memory access. This patch doesn't make any changes for the x86-64 JIT.

                                  ARM64 JIT
                                  =========

        No Intrumentation                       Verifier's Instrumentation
           (upstream)                                  (This patch)
        -----------------                       --------------------------

   0:   add     x9, x30, #0x0                0:   add     x9, x30, #0x0
   4:   nop                                  4:   nop
   8:   paciasp                              8:   paciasp
   c:   stp     x29, x30, [sp, #-16]!        c:   stp     x29, x30, [sp, #-16]!
  10:   mov     x29, sp                     10:   mov     x29, sp
  14:   stp     x19, x20, [sp, #-16]!       14:   stp     x19, x20, [sp, #-16]!
  18:   stp     x21, x22, [sp, #-16]!       18:   stp     x21, x22, [sp, #-16]!
  1c:   stp     x25, x26, [sp, #-16]!       1c:   stp     x25, x26, [sp, #-16]!
  20:   stp     x27, x28, [sp, #-16]!       20:   stp     x27, x28, [sp, #-16]!
  24:   mov     x25, sp                     24:   mov     x25, sp
  28:   mov     x26, #0x0                   28:   mov     x26, #0x0
  2c:   sub     x27, x25, #0x0              2c:   sub     x27, x25, #0x0
  30:   sub     sp, sp, #0x0                30:   sub     sp, sp, #0x0
  34:   ldr     x0, [x0]                    34:   ldr     x0, [x0]
--------------------------------------------------------------------------------
  38:   ldr     x0, [x0] ----------\        38:   add     x9, x0, #0x0
-----------------------------------\\       3c:   lsr     x9, x9, #32
  3c:   mov     x7, #0x0            \\      40:   cmp     x9, #0x10, lsl #12
  40:   mov     sp, sp               \\     44:   b.ls    0x0000000000000050
  44:   ldp     x27, x28, [sp], #16   \\--> 48:   ldr     x0, [x0]
  48:   ldp     x25, x26, [sp], #16    \    4c:   b       0x0000000000000054
  4c:   ldp     x21, x22, [sp], #16     \   50:   mov     x0, #0x0
  50:   ldp     x19, x20, [sp], #16      \---------------------------------------
  54:   ldp     x29, x30, [sp], #16         54:   mov     x7, #0x0
  58:   add     x0, x7, #0x0                58:   mov     sp, sp
  5c:   autiasp                             5c:   ldp     x27, x28, [sp], #16
  60:   ret                                 60:   ldp     x25, x26, [sp], #16
  64:   nop                                 64:   ldp     x21, x22, [sp], #16
  68:   ldr     x10, 0x0000000000000070     68:   ldp     x19, x20, [sp], #16
  6c:   br      x10                         6c:   ldp     x29, x30, [sp], #16
                                            70:   add     x0, x7, #0x0
                                            74:   autiasp
                                            78:   ret
                                            7c:   nop
                                            80:   ldr     x10, 0x0000000000000088
                                            84:   br      x10

There are 6 extra instructions added in ARM64 in the best case. This will
become 7 in the worst case (off != 0).

                           RISC-V JIT (RISCV_ISA_C Disabled)
                           ==========

        No Intrumentation           Verifier's Instrumentation
           (upstream)                      (This patch)
        -----------------           --------------------------

   0:   nop                            0:   nop
   4:   nop                            4:   nop
   8:   li      a6, 33                 8:   li      a6, 33
   c:   addi    sp, sp, -16            c:   addi    sp, sp, -16
  10:   sd      s0, 8(sp)             10:   sd      s0, 8(sp)
  14:   addi    s0, sp, 16            14:   addi    s0, sp, 16
  18:   ld      a0, 0(a0)             18:   ld      a0, 0(a0)
---------------------------------------------------------------
  1c:   ld      a0, 0(a0) --\         1c:   mv      t0, a0
--------------------------\  \        20:   srli    t0, t0, 32
  20:   li      a5, 0      \  \       24:   lui     t1, 4096
  24:   ld      s0, 8(sp)   \  \      28:   sext.w  t1, t1
  28:   addi    sp, sp, 16   \  \     2c:   bgeu    t1, t0, 12
  2c:   sext.w  a0, a5        \  \--> 30:   ld      a0, 0(a0)
  30:   ret                    \      34:   j       8
                                \     38:   li      a0, 0
                                 \------------------------------
                                      3c:   li      a5, 0
                                      40:   ld      s0, 8(sp)
                                      44:   addi    sp, sp, 16
                                      48:   sext.w  a0, a5
                                      4c:   ret

There are 7 extra instructions added in RISC-V.

Fixes: 8008342853 ("bpf, arm64: Add BPF exception tables")
Reported-by: Breno Leitao <leitao@debian.org>
Suggested-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Puranjay Mohan <puranjay12@gmail.com>
Link: https://lore.kernel.org/r/20240424100210.11982-2-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-04-26 09:45:18 -07:00
Sean Christopherson ce0abef6a1 cpu: Ignore "mitigations" kernel parameter if CPU_MITIGATIONS=n
Explicitly disallow enabling mitigations at runtime for kernels that were
built with CONFIG_CPU_MITIGATIONS=n, as some architectures may omit code
entirely if mitigations are disabled at compile time.

E.g. on x86, a large pile of Kconfigs are buried behind CPU_MITIGATIONS,
and trying to provide sane behavior for retroactively enabling mitigations
is extremely difficult, bordering on impossible.  E.g. page table isolation
and call depth tracking require build-time support, BHI mitigations will
still be off without additional kernel parameters, etc.

  [ bp: Touchups. ]

Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240420000556.2645001-3-seanjc@google.com
2024-04-25 15:47:39 +02:00
Sean Christopherson fe42754b94 cpu: Re-enable CPU mitigations by default for !X86 architectures
Rename x86's to CPU_MITIGATIONS, define it in generic code, and force it
on for all architectures exception x86.  A recent commit to turn
mitigations off by default if SPECULATION_MITIGATIONS=n kinda sorta
missed that "cpu_mitigations" is completely generic, whereas
SPECULATION_MITIGATIONS is x86-specific.

Rename x86's SPECULATIVE_MITIGATIONS instead of keeping both and have it
select CPU_MITIGATIONS, as having two configs for the same thing is
unnecessary and confusing.  This will also allow x86 to use the knob to
manage mitigations that aren't strictly related to speculative
execution.

Use another Kconfig to communicate to common code that CPU_MITIGATIONS
is already defined instead of having x86's menu depend on the common
CPU_MITIGATIONS.  This allows keeping a single point of contact for all
of x86's mitigations, and it's not clear that other architectures *want*
to allow disabling mitigations at compile-time.

Fixes: f337a6a21e ("x86/cpu: Actually turn off mitigations by default for SPECULATION_MITIGATIONS=n")
Closes: https://lkml.kernel.org/r/20240413115324.53303a68%40canb.auug.org.au
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Reported-by: Michael Ellerman <mpe@ellerman.id.au>
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20240420000556.2645001-2-seanjc@google.com
2024-04-25 15:47:35 +02:00
Matthew Wilcox (Oracle) d99e3140a4 mm: turn folio_test_hugetlb into a PageType
The current folio_test_hugetlb() can be fooled by a concurrent folio split
into returning true for a folio which has never belonged to hugetlbfs. 
This can't happen if the caller holds a refcount on it, but we have a few
places (memory-failure, compaction, procfs) which do not and should not
take a speculative reference.

Since hugetlb pages do not use individual page mapcounts (they are always
fully mapped and use the entire_mapcount field to record the number of
mappings), the PageType field is available now that page_mapcount()
ignores the value in this field.

In compaction and with CONFIG_DEBUG_VM enabled, the current implementation
can result in an oops, as reported by Luis. This happens since 9c5ccf2db0
("mm: remove HUGETLB_PAGE_DTOR") effectively added some VM_BUG_ON() checks
in the PageHuge() testing path.

[willy@infradead.org: update vmcoreinfo]
  Link: https://lkml.kernel.org/r/ZgGZUvsdhaT1Va-T@casper.infradead.org
Link: https://lkml.kernel.org/r/20240321142448.1645400-6-willy@infradead.org
Fixes: 9c5ccf2db0 ("mm: remove HUGETLB_PAGE_DTOR")
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: Luis Chamberlain <mcgrof@kernel.org>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218227
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-24 19:34:26 -07:00
Lai Jiangshan 91f098704c workqueue: Fix divide error in wq_update_node_max_active()
Yue Sun and xingwei lee reported a divide error bug in
wq_update_node_max_active():

divide error: 0000 [#1] PREEMPT SMP KASAN PTI
CPU: 1 PID: 21 Comm: cpuhp/1 Not tainted 6.9.0-rc5 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
RIP: 0010:wq_update_node_max_active+0x369/0x6b0 kernel/workqueue.c:1605
Code: 24 bf 00 00 00 80 44 89 fe e8 83 27 33 00 41 83 fc ff 75 0d 41
81 ff 00 00 00 80 0f 84 68 01 00 00 e8 fb 22 33 00 44 89 f8 99 <41> f7
fc 89 c5 89 c7 44 89 ee e8 a8 24 33 00 89 ef 8b 5c 24 04 89
RSP: 0018:ffffc9000018fbb0 EFLAGS: 00010293
RAX: 00000000000000ff RBX: 0000000000000001 RCX: ffff888100ada500
RDX: 0000000000000000 RSI: 00000000000000ff RDI: 0000000080000000
RBP: 0000000000000001 R08: ffffffff815b1fcd R09: 1ffff1100364ad72
R10: dffffc0000000000 R11: ffffed100364ad73 R12: 0000000000000000
R13: 0000000000000100 R14: 0000000000000000 R15: 00000000000000ff
FS:  0000000000000000(0000) GS:ffff888135c00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fb8c06ca6f8 CR3: 000000010d6c6000 CR4: 0000000000750ef0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
 <TASK>
 workqueue_offline_cpu+0x56f/0x600 kernel/workqueue.c:6525
 cpuhp_invoke_callback+0x4e1/0x870 kernel/cpu.c:194
 cpuhp_thread_fun+0x411/0x7d0 kernel/cpu.c:1092
 smpboot_thread_fn+0x544/0xa10 kernel/smpboot.c:164
 kthread+0x2ed/0x390 kernel/kthread.c:388
 ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147
 ret_from_fork_asm+0x11/0x20 arch/x86/entry/entry_64.S:244
 </TASK>
Modules linked in:
---[ end trace 0000000000000000 ]---

After analysis, it happens when all of the CPUs in a workqueue's affinity
get offine.

The problem can be easily reproduced by:

 # echo 8 > /sys/devices/virtual/workqueue/<any-wq-name>/cpumask
 # echo 0 > /sys/devices/system/cpu/cpu3/online

Use the default max_actives for nodes when all of the CPUs in the
workqueue's affinity get offline to fix the problem.

Reported-by: Yue Sun <samsun1006219@gmail.com>
Reported-by: xingwei lee <xrivendell7@gmail.com>
Link: https://lore.kernel.org/lkml/CAEkJfYPGS1_4JqvpSo0=FM0S1ytB8CEbyreLTtWpR900dUZymw@mail.gmail.com/
Fixes: 5797b1c189 ("workqueue: Implement system-wide nr_active enforcement for unbound workqueues")
Cc: stable@vger.kernel.org
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-04-24 07:23:06 -10:00
Tejun Heo d40f92020c workqueue: The default node_nr_active should have its max set to max_active
The default nna (node_nr_active) is used when the pool isn't tied to a
specific NUMA node. This can happen in the following cases:

 1. On NUMA, if per-node pwq init failure and the fallback pwq is used.
 2. On NUMA, if a pool is configured to span multiple nodes.
 3. On single node setups.

5797b1c189 ("workqueue: Implement system-wide nr_active enforcement for
unbound workqueues") set the default nna->max to min_active because only #1
was being considered. For #2 and #3, using min_active means that the max
concurrency in normal operation is pushed down to min_active which is
currently 8, which can obviously lead to performance issues.

exact value nna->max is set to doesn't really matter. #2 can only happen if
the workqueue is intentionally configured to ignore NUMA boundaries and
there's no good way to distribute max_active in this case. #3 is the default
behavior on single node machines.

Let's set it the default nna->max to max_active. This fixes the artificially
lowered concurrency problem on single node machines and shouldn't hurt
anything for other cases.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Fixes: 5797b1c189 ("workqueue: Implement system-wide nr_active enforcement for unbound workqueues")
Link: https://lore.kernel.org/dm-devel/20240410084531.2134621-1-shinichiro.kawasaki@wdc.com/
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-04-23 17:32:59 -10:00
Sven Schnelle 57a01eafdc workqueue: Fix selection of wake_cpu in kick_pool()
With cpu_possible_mask=0-63 and cpu_online_mask=0-7 the following
kernel oops was observed:

smp: Bringing up secondary CPUs ...
smp: Brought up 1 node, 8 CPUs
Unable to handle kernel pointer dereference in virtual kernel address space
Failing address: 0000000000000000 TEID: 0000000000000803
[..]
 Call Trace:
arch_vcpu_is_preempted+0x12/0x80
select_idle_sibling+0x42/0x560
select_task_rq_fair+0x29a/0x3b0
try_to_wake_up+0x38e/0x6e0
kick_pool+0xa4/0x198
__queue_work.part.0+0x2bc/0x3a8
call_timer_fn+0x36/0x160
__run_timers+0x1e2/0x328
__run_timer_base+0x5a/0x88
run_timer_softirq+0x40/0x78
__do_softirq+0x118/0x388
irq_exit_rcu+0xc0/0xd8
do_ext_irq+0xae/0x168
ext_int_handler+0xbe/0xf0
psw_idle_exit+0x0/0xc
default_idle_call+0x3c/0x110
do_idle+0xd4/0x158
cpu_startup_entry+0x40/0x48
rest_init+0xc6/0xc8
start_kernel+0x3c4/0x5e0
startup_continue+0x3c/0x50

The crash is caused by calling arch_vcpu_is_preempted() for an offline
CPU. To avoid this, select the cpu with cpumask_any_and_distribute()
to mask __pod_cpumask with cpu_online_mask. In case no cpu is left in
the pool, skip the assignment.

tj: This doesn't fully fix the bug as CPUs can still go down between picking
the target CPU and the wake call. Fixing that likely requires adding
cpu_online() test to either the sched or s390 arch code. However, regardless
of how that is fixed, workqueue shouldn't be picking a CPU which isn't
online as that would result in unpredictable and worse behavior.

Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Fixes: 8639ecebc9 ("workqueue: Implement non-strict affinity scope for unbound workqueues")
Cc: stable@vger.kernel.org # v6.6+
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-04-23 06:22:40 -10:00
Xuewen Yan 1560d1f6eb sched/eevdf: Prevent vlag from going out of bounds in reweight_eevdf()
It was possible to have pick_eevdf() return NULL, which then causes a
NULL-deref. This turned out to be due to entity_eligible() returning
falsely negative because of a s64 multiplcation overflow.

Specifically, reweight_eevdf() computes the vlag without considering
the limit placed upon vlag as update_entity_lag() does, and then the
scaling multiplication (remember that weight is 20bit fixed point) can
overflow. This then leads to the new vruntime being weird which then
causes the above entity_eligible() to go side-ways and claim nothing
is eligible.

Thus limit the range of vlag accordingly.

All this was quite rare, but fatal when it does happen.

Closes: https://lore.kernel.org/all/ZhuYyrh3mweP_Kd8@nz.home/
Closes: https://lore.kernel.org/all/CA+9S74ih+45M_2TPUY_mPPVDhNvyYfy1J1ftSix+KjiTVxg8nw@mail.gmail.com/
Closes: https://lore.kernel.org/lkml/202401301012.2ed95df0-oliver.sang@intel.com/
Fixes: eab03c23c2 ("sched/eevdf: Fix vruntime adjustment on reweight")
Reported-by: Sergei Trofimovich <slyich@gmail.com>
Reported-by: Igor Raits <igor@gooddata.com>
Reported-by: Breno Leitao <leitao@debian.org>
Reported-by: kernel test robot <oliver.sang@intel.com>
Reported-by: Yujie Liu <yujie.liu@intel.com>
Signed-off-by: Xuewen Yan <xuewen.yan@unisoc.com>
Reviewed-and-tested-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20240422082238.5784-1-xuewen.yan@unisoc.com
2024-04-22 13:01:27 +02:00
Tianchen Ding afae8002b4 sched/eevdf: Fix miscalculation in reweight_entity() when se is not curr
reweight_eevdf() only keeps V unchanged inside itself. When se !=
cfs_rq->curr, it would be dequeued from rb tree first. So that V is
changed and the result is wrong. Pass the original V to reweight_eevdf()
to fix this issue.

Fixes: eab03c23c2 ("sched/eevdf: Fix vruntime adjustment on reweight")
Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
[peterz: flip if() condition for clarity]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Abel Wu <wuyun.abel@bytedance.com>
Link: https://lkml.kernel.org/r/20240306022133.81008-3-dtcccc@linux.alibaba.com
2024-04-22 13:01:26 +02:00
Tianchen Ding 11b1b8bc2b sched/eevdf: Always update V if se->on_rq when reweighting
reweight_eevdf() needs the latest V to do accurate calculation for new
ve and vd. So update V unconditionally when se is runnable.

Fixes: eab03c23c2 ("sched/eevdf: Fix vruntime adjustment on reweight")
Suggested-by: Abel Wu <wuyun.abel@bytedance.com>
Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Abel Wu <wuyun.abel@bytedance.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: Chen Yu <yu.c.chen@intel.com>
Link: https://lore.kernel.org/r/20240306022133.81008-2-dtcccc@linux.alibaba.com
2024-04-22 13:01:26 +02:00
Linus Torvalds 3b68086599 - Add a missing memory barrier in the concurrency ID mm switching
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmYk064ACgkQEsHwGGHe
 VUo+Gg/9GvZD7InmNHQx+TZORbCphljcx110JxB0wh6l1hTdqy57VbLjZtnztCVt
 EkFGku/5DEDLQVDD1VHvR6w8w3rayqBAhbGApnJfVGEDJfqyYBGXG4Nhx6qsJD5F
 YBtgk+HZjZ7G5kmGlokgNKXmh4NadUwADbkBaV+iSXBQs+Teld1THJFylO7ju+rZ
 jUttExna3YhsiHdcRwBhlxFwHm0SXsd591n9slTxNoIRg19SVAvfkYmcoL2W70c0
 qtHtaVS37sHInFuVzI53W82hNfGLLgPt+vkQEtiG00CwR4956I2LkorsfAG3JwAA
 CPlbgO3ypwWSJjSdRrvYkjn7pw9JtlMQMpVh+ypaMgQ1zucb6FpDEzHnAA4lbgNW
 u31ALXlqhUcnKC/3HZC+vPQosRNSyBP2rWtNNdMZ3YKrW4vU/5zII+4foZj9kv3u
 irH27Suh6aKMc1y6THNDZmwcgJIFRtvr3K1fm7WwM0z5qz5uNq3sOW5RIQTTL3ti
 +lW7pK1NqUHZhTsOs09+kt/M7vfxyrMqk5qma18kgdgkaRCwCSJLcPGICVCM5WmD
 mnLsbs15PIKdN4cwfqeC0KzWr/xroPn1VHiWgU9zZeXcEVss0+UDcMbopQERsXH0
 6AJV5A0/H/PVfDFKs5aTdwzQXB3LKG2AGdGoX7KpN8vtHf5TFt4=
 =YLBT
 -----END PGP SIGNATURE-----

Merge tag 'sched_urgent_for_v6.9_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler fix from Borislav Petkov:

 - Add a missing memory barrier in the concurrency ID mm switching

* tag 'sched_urgent_for_v6.9_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched: Add missing memory barrier in switch_mm_cid
2024-04-21 09:39:36 -07:00
Linus Torvalds 2d412262cc hardening fixes for v6.9-rc5
- Correctly disable UBSAN configs in configs/hardening (Nathan Chancellor)
 
 - Add missing signed integer overflow trap types to arm64 handler
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmYi0M8WHGtlZXNjb29r
 QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJjPrEACrLlPZUPLiJPlPdYC5bW4lLSgZ
 v6z5XjeEVWVvIlzW3DKPKvzMmIl6D3CTN6KbgdjHR+s5VYGYVlkoQJw09SSBu1OX
 yFC2i0lyqUKuAmh6jK1T46kXvbrgK/3ClO3nQk7KotTfvuRcorAcGEmayTYnaWqd
 JX3qyry2oEiQG9pWDHl9bRQ1ZgbNdNkxR2YYIhy88lrMWORdVNG7PFkgNnVsEbnb
 UAbXl817//TSTuXUwklTllz0UNInmDmQTjrmMRUhiwTKEs8aRS5VX6biSyc1Fucz
 KYXNeK9ciV80mQYnj7jDxgXC5jNThtrjEokzht8vvZGHcBp3WMr6CJLmwj9aaSXE
 edib7mJf/YveJTCPN17xAvIMHZAFZyoyeiVIOE1Ys2lWSj8rXH5TvWnn/E4QPxHK
 77lOKGZNwNMYmIa+L6gb3OOWpiZpOMTLCGMuJh6VSDf5BcA0i45yTxAlAe5JYpgw
 txxDscFu5MtrabR4Z28J+VY/wnWqQAC89D6qYsJOPH8kL0o3XhELCDKPNUZoY094
 LV7XuhAB+xDqNdvZi7SHTmTtZSLPqRBlNrOUqQSmXrwjp11naya26l7fn1Y0cpQM
 K8o3ioUkSg0PJNox/kGxryouHXXMqtN/k52JPotkfa6XEQDpN82uo0xJD9r21Viu
 qhA7A8vcQ3KIb0cUbw==
 =Q6Hg
 -----END PGP SIGNATURE-----

Merge tag 'hardening-v6.9-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull hardening fixes from Kees Cook:

 - Correctly disable UBSAN configs in configs/hardening (Nathan
   Chancellor)

 - Add missing signed integer overflow trap types to arm64 handler

* tag 'hardening-v6.9-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  ubsan: Add awareness of signed integer overflow traps
  configs/hardening: Disable CONFIG_UBSAN_SIGNED_WRAP
  configs/hardening: Fix disabling UBSAN configurations
2024-04-19 14:10:11 -07:00
Miaohe Lin 35e351780f fork: defer linking file vma until vma is fully initialized
Thorvald reported a WARNING [1]. And the root cause is below race:

 CPU 1					CPU 2
 fork					hugetlbfs_fallocate
  dup_mmap				 hugetlbfs_punch_hole
   i_mmap_lock_write(mapping);
   vma_interval_tree_insert_after -- Child vma is visible through i_mmap tree.
   i_mmap_unlock_write(mapping);
   hugetlb_dup_vma_private -- Clear vma_lock outside i_mmap_rwsem!
					 i_mmap_lock_write(mapping);
   					 hugetlb_vmdelete_list
					  vma_interval_tree_foreach
					   hugetlb_vma_trylock_write -- Vma_lock is cleared.
   tmp->vm_ops->open -- Alloc new vma_lock outside i_mmap_rwsem!
					   hugetlb_vma_unlock_write -- Vma_lock is assigned!!!
					 i_mmap_unlock_write(mapping);

hugetlb_dup_vma_private() and hugetlb_vm_op_open() are called outside
i_mmap_rwsem lock while vma lock can be used in the same time.  Fix this
by deferring linking file vma until vma is fully initialized.  Those vmas
should be initialized first before they can be used.

Link: https://lkml.kernel.org/r/20240410091441.3539905-1-linmiaohe@huawei.com
Fixes: 8d9bfb2608 ("hugetlb: add vma based lock for pmd sharing")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reported-by: Thorvald Natvig <thorvald@google.com>
Closes: https://lore.kernel.org/linux-mm/20240129161735.6gmjsswx62o4pbja@revolver/T/ [1]
Reviewed-by: Jane Chu <jane.chu@oracle.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peng Zhang <zhangpeng.00@bytedance.com>
Cc: Tycho Andersen <tandersen@netflix.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-16 15:39:51 -07:00
Mathieu Desnoyers fe90f3967b sched: Add missing memory barrier in switch_mm_cid
Many architectures' switch_mm() (e.g. arm64) do not have an smp_mb()
which the core scheduler code has depended upon since commit:

    commit 223baf9d17 ("sched: Fix performance regression introduced by mm_cid")

If switch_mm() doesn't call smp_mb(), sched_mm_cid_remote_clear() can
unset the actively used cid when it fails to observe active task after it
sets lazy_put.

There *is* a memory barrier between storing to rq->curr and _return to
userspace_ (as required by membarrier), but the rseq mm_cid has stricter
requirements: the barrier needs to be issued between store to rq->curr
and switch_mm_cid(), which happens earlier than:

  - spin_unlock(),
  - switch_to().

So it's fine when the architecture switch_mm() happens to have that
barrier already, but less so when the architecture only provides the
full barrier in switch_to() or spin_unlock().

It is a bug in the rseq switch_mm_cid() implementation. All architectures
that don't have memory barriers in switch_mm(), but rather have the full
barrier either in finish_lock_switch() or switch_to() have them too late
for the needs of switch_mm_cid().

Introduce a new smp_mb__after_switch_mm(), defined as smp_mb() in the
generic barrier.h header, and use it in switch_mm_cid() for scheduler
transitions where switch_mm() is expected to provide a memory barrier.

Architectures can override smp_mb__after_switch_mm() if their
switch_mm() implementation provides an implicit memory barrier.
Override it with a no-op on x86 which implicitly provide this memory
barrier by writing to CR3.

Fixes: 223baf9d17 ("sched: Fix performance regression introduced by mm_cid")
Reported-by: levi.yun <yeoreum.yun@arm.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> # for arm64
Acked-by: Dave Hansen <dave.hansen@linux.intel.com> # for x86
Cc: <stable@vger.kernel.org> # 6.4.x
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20240415152114.59122-2-mathieu.desnoyers@efficios.com
2024-04-16 13:59:45 +02:00
Nathan Chancellor 7fcb91d94e configs/hardening: Disable CONFIG_UBSAN_SIGNED_WRAP
kernel/configs/hardening.config turns on UBSAN for the bounds sanitizer,
as that in combination with trapping can stop the exploitation of buffer
overflows within the kernel. At the same time, hardening.config turns
off every other UBSAN sanitizer because trapping means all UBSAN reports
will be fatal and the problems brought up by other sanitizers generally
do not have security implications.

The signed integer overflow sanitizer was recently added back to the
kernel and it is default on with just CONFIG_UBSAN=y, meaning that it
gets enabled when merging hardening.config into another configuration.
While this sanitizer does have security implications like the array
bounds sanitizer, work to clean up enough instances to allow this to run
in production environments is still ramping up, which means regular
users and testers may be broken by these instances with
CONFIG_UBSAN_TRAP=y. Disable CONFIG_UBSAN_SIGNED_WRAP in
hardening.config to avoid this situation.

Fixes: 557f8c582a ("ubsan: Reintroduce signed overflow sanitizer")
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Link: https://lore.kernel.org/r/20240411-fix-ubsan-in-hardening-config-v1-2-e0177c80ffaa@kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
2024-04-15 11:08:24 -07:00
Nathan Chancellor e048d668f2 configs/hardening: Fix disabling UBSAN configurations
The initial change that added kernel/configs/hardening.config attempted
to disable all UBSAN sanitizers except for the array bounds one while
turning on UBSAN_TRAP. Unfortunately, it only got the syntax for
CONFIG_UBSAN_SHIFT correct, so configurations that are on by default
with CONFIG_UBSAN=y such as CONFIG_UBSAN_{BOOL,ENUM} do not get disabled
properly.

  CONFIG_ARCH_HAS_UBSAN=y
  CONFIG_UBSAN=y
  CONFIG_UBSAN_TRAP=y
  CONFIG_CC_HAS_UBSAN_BOUNDS_STRICT=y
  CONFIG_UBSAN_BOUNDS=y
  CONFIG_UBSAN_BOUNDS_STRICT=y
  # CONFIG_UBSAN_SHIFT is not set
  # CONFIG_UBSAN_DIV_ZERO is not set
  # CONFIG_UBSAN_UNREACHABLE is not set
  CONFIG_UBSAN_SIGNED_WRAP=y
  CONFIG_UBSAN_BOOL=y
  CONFIG_UBSAN_ENUM=y
  # CONFIG_TEST_UBSAN is not set

Add the missing 'is not set' to each configuration that needs it so that
they get disabled as intended.

  CONFIG_ARCH_HAS_UBSAN=y
  CONFIG_UBSAN=y
  CONFIG_UBSAN_TRAP=y
  CONFIG_CC_HAS_UBSAN_BOUNDS_STRICT=y
  CONFIG_UBSAN_BOUNDS=y
  CONFIG_UBSAN_BOUNDS_STRICT=y
  # CONFIG_UBSAN_SHIFT is not set
  # CONFIG_UBSAN_DIV_ZERO is not set
  # CONFIG_UBSAN_UNREACHABLE is not set
  CONFIG_UBSAN_SIGNED_WRAP=y
  # CONFIG_UBSAN_BOOL is not set
  # CONFIG_UBSAN_ENUM is not set
  # CONFIG_TEST_UBSAN is not set

Fixes: 215199e3d9 ("hardening: Provide Kconfig fragments for basic options")
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Link: https://lore.kernel.org/r/20240411-fix-ubsan-in-hardening-config-v1-1-e0177c80ffaa@kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
2024-04-15 11:08:24 -07:00
Linus Torvalds 27fd80851d Misc x86 fixes:
- Follow up fixes for the BHI mitigations code.
 
  - Fix !SPECULATION_MITIGATIONS bug not turning off
    mitigations as expected.
 
  - Work around an APIC emulation bug when the kernel is built with
    Clang and run as a SEV guest.
 
  - Follow up x86 topology fixes.
 
 Note that there's minor cleanups included in the BHI fixes,
 which we'd normally delay to the next merge window, but the
 BHI mitigations code is new and will be backported widely,
 so we thought it would be better to have a unified codebase
 at this stage. (Let me know if that assumption is wrong and
 I'll rebase it.)
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmYbmmkRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1iArg//R2VYSVgAfUf4cN3rTm8RuX8U/8V4A1C/
 FklDWziEWTw8DhQCidiIt7ETuMjvOf7HFzVTfLP+oOiBxHGMCEjxtf74mHRaeZ+f
 avPvwOfp3mrIdn/h2+t7pSYxZDf310EDvarliIXq0z2hmNjmQjKXo3dyWNiWJe9i
 LFzST8dRyR0Tg4MjLuY2g9ZEauRHpY6aAVk9UrRi7uaiLyoHnLBASn1DCL1pmMhT
 cqKfHhilkeUMYwTXbDTu1iIQwBHqcpCmUp2h6VPpxPkTxJXwzo0E4lVuFVu7tzUM
 yrMZrNxhtC5Cg9NVF56AqZocKjutlJfWXnmuvpc+dM7z1dF/EwFbx2ZM+3PDuw8Z
 uODs6bVzYlhzWxTMb9Obsp3RvHe1B7ZCFCZ8uo3G9lMFXqu047UwfuwqnZw4YpX6
 CDEKz3zLbV4s64HPlvbju3CpX+m9CXhg5cR6HfCJiwYCytb1bxZU0ottnPnYxVCb
 9Th7wi2f1sCZvtQ/T8aZpF7FhYe7abt/CDvlDoDxiRg1f+Z2nduznfDJH81nn1KS
 duZEicG0O+iYi9OCPVBTVutRxSWA6D8D4F0VN7cDRr4QHyXpOFJ1KsZyHBbZo9I6
 ubp9RiFNaocgdWiWVgEpvvmlYpVPDnR38oVHNPpNAyIYHu3mlCoS76znZTNX8WfJ
 GlvHqOp+EM0=
 =2ig7
 -----END PGP SIGNATURE-----

Merge tag 'x86-urgent-2024-04-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull misc x86 fixes from Ingo Molnar:

 - Follow up fixes for the BHI mitigations code

 - Fix !SPECULATION_MITIGATIONS bug not turning off mitigations as
   expected

 - Work around an APIC emulation bug when the kernel is built with Clang
   and run as a SEV guest

 - Follow up x86 topology fixes

* tag 'x86-urgent-2024-04-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/cpu/amd: Move TOPOEXT enablement into the topology parser
  x86/cpu/amd: Make the NODEID_MSR union actually work
  x86/cpu/amd: Make the CPUID 0x80000008 parser correct
  x86/bugs: Replace CONFIG_SPECTRE_BHI_{ON,OFF} with CONFIG_MITIGATION_SPECTRE_BHI
  x86/bugs: Remove CONFIG_BHI_MITIGATION_AUTO and spectre_bhi=auto
  x86/bugs: Clarify that syscall hardening isn't a BHI mitigation
  x86/bugs: Fix BHI handling of RRSBA
  x86/bugs: Rename various 'ia32_cap' variables to 'x86_arch_cap_msr'
  x86/bugs: Cache the value of MSR_IA32_ARCH_CAPABILITIES
  x86/bugs: Fix BHI documentation
  x86/cpu: Actually turn off mitigations by default for SPECULATION_MITIGATIONS=n
  x86/topology: Don't update cpu_possible_map in topo_set_cpuids()
  x86/bugs: Fix return type of spectre_bhi_state()
  x86/apic: Force native_apic_mem_read() to use the MOV instruction
2024-04-14 10:48:51 -07:00
Linus Torvalds c748fc3b1f Misc timer fixes:
- Address a (valid) W=1 build warning
 
  - Fix timer self-tests
 
  - Annotate a KCSAN warning wrt. accesses to the
    tick_do_timer_cpu global variable.
 
  - Address a !CONFIG_BUG build warning
 
 Heads up for the !CONFIG_BUG warning patch, which we
 addressed with:
 
    5284984a4f bug: Fix no-return-statement warning with !CONFIG_BUG
 
 Not everyone agreed though, see:
 
   https://lore.kernel.org/all/20240410153212.127477-1-adrian.hunter@intel.com
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmYblgkRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1jYWg//eNeJkdzJVdbj6g4n2t3WDDuX7dxuRqdG
 AQHJdZctG+kNZBp+U2Zvbb8BDZfDRSQDBDfQI0ck3xG314pzXzNg92YMJB95r/Zf
 aRcxMSFc3a2dN3vW97UDKquPuCarCPsZQvbQKmZ55OmgW6ZRhhsjed0f18Nq63xR
 oWrQ0rotNhMJ98dpSOfPqrMoCXza78P/7nA49LxVIQcuDb+dtyqVTuAbENOOkFYq
 nqAkvuieZGzLb4nKH2d1rK4agYuXwnMLJ71MOcCNWFp8njuRRx+Yc+3gyoNl7e9E
 ipd6DcelOEl/DaYRao9rRy3ij0veJoUvshKZBTEWPw9FQU24odwqX4p/Mj2vF1iN
 KExtF+S7LBxdJAdivHyuPtt9B0rKRmgIp/Q8Ytgzuxu9rZ3LNev+7l80qDOIM8MF
 Mozv6JsJN2sVOMWvnzF9B1WNjVSikcyuvd2JRPbQYh1zy8aCpFHhZY+LcvK3vYBQ
 qdzY8o5dmIW0JrtHZw4H7tqKByUKEbJMsslPefD9qNIq5bpAUgHi7HFOMTU0kOvx
 2rFDnC6cJk39CXyJrpLMyKDqZzDTHGV/J4nV7/L7vQzy3iIOcfVcszfGESaM/txk
 6cgdncf9pr8aOE34A6/5Kr4L45vgh7B6YGc4oqHpdlvFLR0ve0gi+BIjNja8Jy7C
 IwGsS2uloCA=
 =oEym
 -----END PGP SIGNATURE-----

Merge tag 'timers-urgent-2024-04-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer fixes from Ingo Molnar:

 - Address a (valid) W=1 build warning

 - Fix timer self-tests

 - Annotate a KCSAN warning wrt. accesses to the tick_do_timer_cpu
   global variable

 - Address a !CONFIG_BUG build warning

* tag 'timers-urgent-2024-04-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  selftests: kselftest: Fix build failure with NOLIBC
  selftests: timers: Fix abs() warning in posix_timers test
  selftests: kselftest: Mark functions that unconditionally call exit() as __noreturn
  selftests: timers: Fix posix_timers ksft_print_msg() warning
  selftests: timers: Fix valid-adjtimex signed left-shift undefined behavior
  bug: Fix no-return-statement warning with !CONFIG_BUG
  timekeeping: Use READ/WRITE_ONCE() for tick_do_timer_cpu
  selftests/timers/posix_timers: Reimplement check_timer_distribution()
  irqflags: Explicitly ignore lockdep_hrtimer_exit() argument
2024-04-14 10:32:22 -07:00
Linus Torvalds ddd7ad5cf1 dma-mapping fixes for Linux 6.9
- fix up swiotlb buffer padding even more (Petr Tesarik)
  - fix for partial dma_sync on swiotlb (Michael Kelley)
  - swiotlb debugfs fix (Dexuan Cui)
 -----BEGIN PGP SIGNATURE-----
 
 iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAmYbZMsLHGhjaEBsc3Qu
 ZGUACgkQD55TZVIEUYMzpw/9EA2Hfo0QRokyvYzHA9ks1JeMZMeV5fzJEb/NYb+V
 6cX2U0jRsdRqsZZXKl9c/oXwhiGLnQtuA9pIsMc18hqZooWxzoFWuc438UUihfvk
 6BVmciBl13gm0Addd5ajDVV7tPLWGmFY/rSNyD0yhPRlRy4PmVYMtaWpX+ec0jnW
 YUTSI+UTjknl7H5TXVDg/JmyHB7xTjGCAI75kMgxNK4PCX5oF5uqVpXZcF1YH5sj
 STQuUrLU/jE54G6X3sM44GJntDWq3RBRI2pTsgQC70MAdrrPm2g60BorccA8qx38
 Kz5CII8KuUvkIOtvncNmEzlGlfwflx91nF5ItQtJNobrd5GMSNtzaSLDR0LyfzLF
 w+awHYZyReEwE5oYcN8VQQR4mOZnqi3VRK+vTlTjcHXyXY3k0LdW+Jxzd6TkAeqr
 49ECkVHpw4QaQBKH6RWkED1Xd0CDAABx2HJ2QkHYEwwLL/KPvaehExY9LjDpQut4
 qJaAqLPKzxvAJdFjvH4HlJDhpw1BH0Pj3H6i4y0UgEzOKReMGun1FoqCseSJhhGc
 laFtFrIpooLqOD/h0e21XpVfEcRDkLwEjAtpkZXrfBZenZF8rFDm+vTOpBiyAK2a
 2KiU0+z/f1Flx5E0gLtxn5yPsWo2cuIjKfsVSewIwjrh6p/XFBF+iQi4dwISFxKz
 JrE=
 =U10z
 -----END PGP SIGNATURE-----

Merge tag 'dma-maping-6.9-2024-04-14' of git://git.infradead.org/users/hch/dma-mapping

Pull dma-mapping fixes from Christoph Hellwig:

 - fix up swiotlb buffer padding even more (Petr Tesarik)

 - fix for partial dma_sync on swiotlb (Michael Kelley)

 - swiotlb debugfs fix (Dexuan Cui)

* tag 'dma-maping-6.9-2024-04-14' of git://git.infradead.org/users/hch/dma-mapping:
  swiotlb: do not set total_used to 0 in swiotlb_create_debugfs_files()
  swiotlb: fix swiotlb_bounce() to do partial sync's correctly
  swiotlb: extend buffer pre-padding to alloc_align_mask if necessary
2024-04-14 10:02:40 -07:00
Anton Protopopov 37eacb9f6e bpf: Fix a verifier verbose message
Long ago a map file descriptor in a pseudo ldimm64 instruction could
only be present as an immediate value insn[0].imm, and thus this value
was used in a verbose verifier message printed when the file descriptor
wasn't valid. Since addition of BPF_PSEUDO_MAP_IDX_VALUE/BPF_PSEUDO_MAP_IDX
the insn[0].imm field can also contain an index pointing to the file
descriptor in the attr.fd_array array. However, if the file descriptor
is invalid, the verifier still prints the verbose message containing
value of insn[0].imm. Patch the verifier message to always print the
actual file descriptor value.

Fixes: 387544bfa2 ("bpf: Introduce fd_idx")
Signed-off-by: Anton Protopopov <aspsk@isovalent.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20240412141100.3562942-1-aspsk@isovalent.com
2024-04-12 18:37:20 +02:00
Linus Torvalds 5939d45155 Tracing fixes for 6.9:
- Fix the buffer_percent accounting as it is dependent on three variables:
   1) pages_read - number of subbuffers read
   2) pages_lost - number of subbuffers lost due to overwrite
   3) pages_touched - number of pages that a writer entered
   These three counters only increment, and to know how many active pages
   there are on the buffer at any given time, the pages_read and
   pages_lost are subtracted from pages_touched. But the pages touched
   was incremented whenever any writer went to the next subbuffer even
   if it wasn't the only one, so it was incremented more than it should
   be causing the counter for how many subbuffers currently have content
   incorrect, which caused the buffer_percent that holds waiters until
   the ring buffer is filled to a given percentage to wake up early.
 
 - Fix warning of unused functions when PERF_EVENTS is not configured in
 
 - Replace bad tab with space in Kconfig for FTRACE_RECORD_RECURSION_SIZE
 
 - Fix to some kerneldoc function comments in eventfs code.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZhk+khQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qvs0AP98c226UFU6Dha4wvgSulC/wKVvHN3X
 jeclMTdn8RGs2gD/b9OULKNv1//6fP16ZRun7ntRQkotVhlNhf9Ee0smiwU=
 =UYrk
 -----END PGP SIGNATURE-----

Merge tag 'trace-v6.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing fixes from Steven Rostedt:

 - Fix the buffer_percent accounting as it is dependent on three
   variables:

     1) pages_read - number of subbuffers read
     2) pages_lost - number of subbuffers lost due to overwrite
     3) pages_touched - number of pages that a writer entered

   These three counters only increment, and to know how many active
   pages there are on the buffer at any given time, the pages_read and
   pages_lost are subtracted from pages_touched.

   But the pages touched was incremented whenever any writer went to the
   next subbuffer even if it wasn't the only one, so it was incremented
   more than it should be causing the counter for how many subbuffers
   currently have content incorrect, which caused the buffer_percent
   that holds waiters until the ring buffer is filled to a given
   percentage to wake up early.

 - Fix warning of unused functions when PERF_EVENTS is not configured in

 - Replace bad tab with space in Kconfig for FTRACE_RECORD_RECURSION_SIZE

 - Fix to some kerneldoc function comments in eventfs code.

* tag 'trace-v6.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  ring-buffer: Only update pages_touched when a new page is touched
  tracing: hide unused ftrace_event_id_fops
  tracing: Fix FTRACE_RECORD_RECURSION_SIZE Kconfig entry
  eventfs: Fix kernel-doc comments to functions
2024-04-12 09:02:24 -07:00
Steven Rostedt (Google) ffe3986fec ring-buffer: Only update pages_touched when a new page is touched
The "buffer_percent" logic that is used by the ring buffer splice code to
only wake up the tasks when there's no data after the buffer is filled to
the percentage of the "buffer_percent" file is dependent on three
variables that determine the amount of data that is in the ring buffer:

 1) pages_read - incremented whenever a new sub-buffer is consumed
 2) pages_lost - incremented every time a writer overwrites a sub-buffer
 3) pages_touched - incremented when a write goes to a new sub-buffer

The percentage is the calculation of:

  (pages_touched - (pages_lost + pages_read)) / nr_pages

Basically, the amount of data is the total number of sub-bufs that have been
touched, minus the number of sub-bufs lost and sub-bufs consumed. This is
divided by the total count to give the buffer percentage. When the
percentage is greater than the value in the "buffer_percent" file, it
wakes up splice readers waiting for that amount.

It was observed that over time, the amount read from the splice was
constantly decreasing the longer the trace was running. That is, if one
asked for 60%, it would read over 60% when it first starts tracing, but
then it would be woken up at under 60% and would slowly decrease the
amount of data read after being woken up, where the amount becomes much
less than the buffer percent.

This was due to an accounting of the pages_touched incrementation. This
value is incremented whenever a writer transfers to a new sub-buffer. But
the place where it was incremented was incorrect. If a writer overflowed
the current sub-buffer it would go to the next one. If it gets preempted
by an interrupt at that time, and the interrupt performs a trace, it too
will end up going to the next sub-buffer. But only one should increment
the counter. Unfortunately, that was not the case.

Change the cmpxchg() that does the real switch of the tail-page into a
try_cmpxchg(), and on success, perform the increment of pages_touched. This
will only increment the counter once for when the writer moves to a new
sub-buffer, and not when there's a race and is incremented for when a
writer and its preempting writer both move to the same new sub-buffer.

Link: https://lore.kernel.org/linux-trace-kernel/20240409151309.0d0e5056@gandalf.local.home

Cc: stable@vger.kernel.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Fixes: 2c2b0a78b3 ("ring-buffer: Add percentage of ring buffer full to wake up reader")
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-04-11 17:49:57 -04:00
Arnd Bergmann 5281ec8345 tracing: hide unused ftrace_event_id_fops
When CONFIG_PERF_EVENTS, a 'make W=1' build produces a warning about the
unused ftrace_event_id_fops variable:

kernel/trace/trace_events.c:2155:37: error: 'ftrace_event_id_fops' defined but not used [-Werror=unused-const-variable=]
 2155 | static const struct file_operations ftrace_event_id_fops = {

Hide this in the same #ifdef as the reference to it.

Link: https://lore.kernel.org/linux-trace-kernel/20240403080702.3509288-7-arnd@kernel.org

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Zheng Yejian <zhengyejian1@huawei.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Ajay Kaher <akaher@vmware.com>
Cc: Jinjie Ruan <ruanjinjie@huawei.com>
Cc: Clément Léger <cleger@rivosinc.com>
Cc: Dan Carpenter <dan.carpenter@linaro.org>
Cc: "Tzvetomir Stoyanov (VMware)" <tz.stoyanov@gmail.com>
Fixes: 620a30e97f ("tracing: Don't pass file_operations array to event_create_dir()")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-04-11 17:46:55 -04:00
Prasad Pandit d96c36004e tracing: Fix FTRACE_RECORD_RECURSION_SIZE Kconfig entry
Fix FTRACE_RECORD_RECURSION_SIZE entry, replace tab with
a space character. It helps Kconfig parsers to read file
without error.

Link: https://lore.kernel.org/linux-trace-kernel/20240322121801.1803948-1-ppandit@redhat.com

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Fixes: 773c167050 ("ftrace: Add recording of functions that caused recursion")
Signed-off-by: Prasad Pandit <pjp@fedoraproject.org>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-04-11 17:45:18 -04:00
Linus Torvalds 136eb5fd6a Power management fix for 6.9-rc4
Fix the suspend-to-idle core code to guarantee that timers queued on
 CPUs other than the one that has first left the idle state, which should
 expire directly after resume, will be handled (Anna-Maria Behnsen).
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmYYIvYSHHJqd0Byand5
 c29ja2kubmV0AAoJEILEb/54YlRxq5oP/RS98YTjjrhqieQ2mfKN+thVpgTEeYBa
 C4NoRbQf0x8kyhHbqEmjEdxlIUUmLDpolWMxFtlYdhTSZB7+k0z9GL3fZfTKRTVq
 8lYSuxEzyicylve+b+gVcQ6m61kaV3M2jwP23Myaf5MOIx2OvQy7SnNS02p3yfJ8
 NhBTqqTXbupWnnRuSPZvejnZaU/3wW1UvVmbkBUNwhR0O+6J0bxC5fNrvgip//kG
 Iyx04j88787nwS9rUL68A9xSA9WVRuLNZRMb7RE2mZ9YVR6qLhxhUxfA8WrKignl
 SxFPPmzGNkWdwtlWeIsQvcAOOhTKnWqOm7aggBfcf7jGoYBOJ7Uo4gfxvEvA1gnc
 36UTpJXc/5HQDeVKYcSx5O9VSm7/cKx4sP5jiDKT/iBPLnSXAKoAflgl1RvkchvA
 Gg0aJ37MRIvoP53OidNACkHN2VsikTMuYA4b21G+we/ib7q8kIbdI9yqDF4aPcCU
 vV0rMlSGSfM5+PGn8fDei2sbZ4E6Mk3XRwzF386xJDIyrvnqr6hhmVxDmWOx7oEb
 Jpv80971cbr1lmtP2SKFVdsRxElhRWfXk3OwLOGRbLbQI1BP+SAWeyICAYUanPJI
 W6vmJwDqylCUFRy4mhufDe7fceXHnvFUYMhi0XWkGvCgWEgLffKHTzFNup12hU1Y
 65Rcv19bRP8h
 =8DDE
 -----END PGP SIGNATURE-----

Merge tag 'pm-6.9-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management fix from Rafael Wysocki:
 "Fix the suspend-to-idle core code to guarantee that timers queued on
  CPUs other than the one that has first left the idle state, which
  should expire directly after resume, will be handled (Anna-Maria
  Behnsen)"

* tag 'pm-6.9-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  PM: s2idle: Make sure CPUs will wakeup directly on resume
2024-04-11 12:00:25 -07:00
Zheng Yejian 325f3fb551 kprobes: Fix possible use-after-free issue on kprobe registration
When unloading a module, its state is changing MODULE_STATE_LIVE ->
 MODULE_STATE_GOING -> MODULE_STATE_UNFORMED. Each change will take
a time. `is_module_text_address()` and `__module_text_address()`
works with MODULE_STATE_LIVE and MODULE_STATE_GOING.
If we use `is_module_text_address()` and `__module_text_address()`
separately, there is a chance that the first one is succeeded but the
next one is failed because module->state becomes MODULE_STATE_UNFORMED
between those operations.

In `check_kprobe_address_safe()`, if the second `__module_text_address()`
is failed, that is ignored because it expected a kernel_text address.
But it may have failed simply because module->state has been changed
to MODULE_STATE_UNFORMED. In this case, arm_kprobe() will try to modify
non-exist module text address (use-after-free).

To fix this problem, we should not use separated `is_module_text_address()`
and `__module_text_address()`, but use only `__module_text_address()`
once and do `try_module_get(module)` which is only available with
MODULE_STATE_LIVE.

Link: https://lore.kernel.org/all/20240410015802.265220-1-zhengyejian1@huawei.com/

Fixes: 28f6c37a29 ("kprobes: Forbid probing on trampoline and BPF code areas")
Cc: stable@vger.kernel.org
Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2024-04-10 23:35:51 +09:00
Sean Christopherson f337a6a21e x86/cpu: Actually turn off mitigations by default for SPECULATION_MITIGATIONS=n
Initialize cpu_mitigations to CPU_MITIGATIONS_OFF if the kernel is built
with CONFIG_SPECULATION_MITIGATIONS=n, as the help text quite clearly
states that disabling SPECULATION_MITIGATIONS is supposed to turn off all
mitigations by default.

  │ If you say N, all mitigations will be disabled. You really
  │ should know what you are doing to say so.

As is, the kernel still defaults to CPU_MITIGATIONS_AUTO, which results in
some mitigations being enabled in spite of SPECULATION_MITIGATIONS=n.

Fixes: f43b9876e8 ("x86/retbleed: Add fine grained Kconfig knobs")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Daniel Sneddon <daniel.sneddon@linux.intel.com>
Cc: stable@vger.kernel.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20240409175108.1512861-2-seanjc@google.com
2024-04-10 16:22:47 +02:00
Thomas Gleixner f87cbcb345 timekeeping: Use READ/WRITE_ONCE() for tick_do_timer_cpu
tick_do_timer_cpu is used lockless to check which CPU needs to take care
of the per tick timekeeping duty. This is done to avoid a thundering
herd problem on jiffies_lock.

The read and writes are not annotated so KCSAN complains about data races:

  BUG: KCSAN: data-race in tick_nohz_idle_stop_tick / tick_nohz_next_event

  write to 0xffffffff8a2bda30 of 4 bytes by task 0 on cpu 26:
   tick_nohz_idle_stop_tick+0x3b1/0x4a0
   do_idle+0x1e3/0x250

  read to 0xffffffff8a2bda30 of 4 bytes by task 0 on cpu 16:
   tick_nohz_next_event+0xe7/0x1e0
   tick_nohz_get_sleep_length+0xa7/0xe0
   menu_select+0x82/0xb90
   cpuidle_select+0x44/0x60
   do_idle+0x1c2/0x250

  value changed: 0x0000001a -> 0xffffffff

Annotate them with READ/WRITE_ONCE() to document the intentional data race.

Reported-by: Mirsad Todorovac <mirsad.todorovac@alu.unizg.hr>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Sean Anderson <sean.anderson@seco.com>
Link: https://lore.kernel.org/r/87cyqy7rt3.ffs@tglx
2024-04-10 10:13:42 +02:00
Anna-Maria Behnsen 3c89a068bf PM: s2idle: Make sure CPUs will wakeup directly on resume
s2idle works like a regular suspend with freezing processes and freezing
devices. All CPUs except the control CPU go into idle. Once this is
completed the control CPU kicks all other CPUs out of idle, so that they
reenter the idle loop and then enter s2idle state. The control CPU then
issues an swait() on the suspend state and therefore enters the idle loop
as well.

Due to being kicked out of idle, the other CPUs leave their NOHZ states,
which means the tick is active and the corresponding hrtimer is programmed
to the next jiffie.

On entering s2idle the CPUs shut down their local clockevent device to
prevent wakeups. The last CPU which enters s2idle shuts down its local
clockevent and freezes timekeeping.

On resume, one of the CPUs receives the wakeup interrupt, unfreezes
timekeeping and its local clockevent and starts the resume process. At that
point all other CPUs are still in s2idle with their clockevents switched
off. They only resume when they are kicked by another CPU or after resuming
devices and then receiving a device interrupt.

That means there is no guarantee that all CPUs will wakeup directly on
resume. As a consequence there is no guarantee that timers which are queued
on those CPUs and should expire directly after resume, are handled. Also
timer list timers which are remotely queued to one of those CPUs after
resume will not result in a reprogramming IPI as the tick is
active. Queueing a hrtimer will also not result in a reprogramming IPI
because the first hrtimer event is already in the past.

The recent introduction of the timer pull model (7ee9887703 ("timers:
Implement the hierarchical pull model")) amplifies this problem, if the
current migrator is one of the non woken up CPUs. When a non pinned timer
list timer is queued and the queuing CPU goes idle, it relies on the still
suspended migrator CPU to expire the timer which will happen by chance.

The problem exists since commit 8d89835b04 ("PM: suspend: Do not pause
cpuidle in the suspend-to-idle path"). There the cpuidle_pause() call which
in turn invoked a wakeup for all idle CPUs was moved to a later point in
the resume process. This might not be reached or reached very late because
it waits on a timer of a still suspended CPU.

Address this by kicking all CPUs out of idle after the control CPU returns
from swait() so that they resume their timers and restore consistent system
state.

Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218641
Fixes: 8d89835b04 ("PM: suspend: Do not pause cpuidle in the suspend-to-idle path")
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Mario Limonciello <mario.limonciello@amd.com>
Cc: 5.16+ <stable@kernel.org> # 5.16+
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2024-04-08 15:36:54 +02:00
Linus Torvalds 3520c35e5f Fix various timer bugs:
- Fix a timer migration bug that may result in missed events
  - Fix timer migration group hierarchy event updates
  - Fix a PowerPC64 build warning
  - Fix a handful of DocBook annotation bugs
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmYSUpsRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1h+/RAAlbYzlotBMM0cqxCng5jgetTT7EfQHXl1
 zaqhx2FzEjoyhZ++kpBP03A42LumWz0TXTqRK+BicZIHWvIWz16w7xNr0dHo3+L8
 PfPTZEPb1IwSP1FKHyzEZbVWPnHtokyJBky5Qp5IG5FoNqV1pArqeadyaSbd3hIw
 A3l77wHCtXINkxjROs5EoJiOwVcJWigm4M7189EXDUKKr5nzE0hemNAKGnluQZxj
 O5gF9vv40B38MLuo3xLDxFCrY8WDcq9yhv/AtBk+952FsceSZbH29zOt1a5l2HPb
 yvBR4pMaS6x4UdzJeZTbdqDs8v9QWsCUc+qqeNYuFEJSBu9y7Qo5wec8c+Ptiu0E
 1we/g4nWRaRnXvGyS1uj448jUZgnGu61KFbCCF+guDl94zKY6TBZfVpeWrF/Xjdr
 Jq1K8zYMM/+hxlzqsVhoaL+2zAddUeWnwPcSC5J8mnVlyLJUd55Cd0OGcHimz3PV
 QcimajOcE7e/pkw0eQnRQ6qAVeWXcJY4hWoJS9Nk8F9InfDC7I8T5NgsNVb6Edyx
 fj2wE/K9lAfKevz49ieJ8ItIIus3Lzmi09pbfDmDP5J9iMyL6UMk2VXj8XAUvCdL
 qpgigP1zcluwAFqHmaym6mUsej+VL/WqsKfy6Q8LI5yNvdYtUuzfQuqGqyOyGXX0
 zJg6+qU7OAE=
 =4VkW
 -----END PGP SIGNATURE-----

Merge tag 'timers-urgent-2024-04-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer fixes from Ingo Molnar:
 "Fix various timer bugs:

   - Fix a timer migration bug that may result in missed events

   - Fix timer migration group hierarchy event updates

   - Fix a PowerPC64 build warning

   - Fix a handful of DocBook annotation bugs"

* tag 'timers-urgent-2024-04-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  timers/migration: Return early on deactivation
  timers/migration: Fix ignored event due to missing CPU update
  vdso: Use CONFIG_PAGE_SHIFT in vdso/datapage.h
  timers: Fix text inconsistencies and spelling
  tick/sched: Fix struct tick_sched doc warnings
  tick/sched: Fix various kernel-doc warnings
  timers: Fix kernel-doc format and add Return values
  time/timekeeping: Fix kernel-doc warnings and typos
  time/timecounter: Fix inline documentation
2024-04-07 09:20:50 -07:00
Anna-Maria Behnsen 7a96a84bfb timers/migration: Return early on deactivation
Commit 4b6f4c5a67 ("timer/migration: Remove buggy early return on
deactivation") removed the logic to return early in tmigr_update_events()
on deactivation. With this the problem with a not properly updated first
global event in a hierarchy containing only a single group was fixed.

But when having a look at this code path with a hierarchy with more than a
single level, now unnecessary work is done (example is partially copied
from the message of the commit mentioned above):

                            [GRP1:0]
                         migrator = GRP0:0
                         active   = GRP0:0
                         nextevt  = T0:0i, T0:1
                         /              \
              [GRP0:0]                  [GRP0:1]
           migrator = 0              migrator = NONE
           active   = 0              active   = NONE
           nextevt  = T0i, T1        nextevt  = T2
           /         \                /         \
          0 (T0i)     1 (T1)         2 (T2)      3
      active         idle            idle       idle

0) CPU 0 is active thus its event is ignored (the letter 'i') and so are
upper levels' events. CPU 1 is idle and has the timer T1 enqueued.
CPU 2 also has a timer. The expiry order is T0 (ignored) < T1 < T2

                            [GRP1:0]
                         migrator = GRP0:0
                         active   = GRP0:0
                         nextevt  = T0:0i, T0:1
                         /              \
              [GRP0:0]                  [GRP0:1]
           migrator = NONE           migrator = NONE
           active   = NONE           active   = NONE
           nextevt  = T1             nextevt  = T2
           /         \                /         \
          0 (T0i)     1 (T1)         2 (T2)      3
        idle         idle            idle         idle

1) CPU 0 goes idle without global event queued. Therefore KTIME_MAX is
pushed as its next expiry and its own event kept as "ignore". Without this
early return the following steps happen in tmigr_update_events() when
child = null and group = GRP0:0 :

  lock(GRP0:0->lock);
  timerqueue_del(GRP0:0, T0i);
  unlock(GRP0:0->lock);


                            [GRP1:0]
                         migrator = NONE
                         active   = NONE
                         nextevt  = T0:0, T0:1
                         /              \
              [GRP0:0]                  [GRP0:1]
           migrator = NONE           migrator = NONE
           active   = NONE           active   = NONE
           nextevt  = T1             nextevt  = T2
           /         \                /         \
          0 (T0i)     1 (T1)         2 (T2)      3
        idle         idle            idle         idle

2) The change now propagates up to the top. Then tmigr_update_events()
updates the group event of GRP0:0 and executes the following steps
(child = GRP0:0 and group = GRP0:0):

  lock(GRP0:0->lock);
  lock(GRP1:0->lock);
  evt = tmigr_next_groupevt(GRP0:0); -> this removes the ignored events
					in GRP0:0
  ... update GRP1:0 group event and timerqueue ...
  unlock(GRP1:0->lock);
  unlock(GRP0:0->lock);

So the dance in 1) with locking the GRP0:0->lock and removing the T0i from
the timerqueue is redundand as this is done nevertheless in 2) when
tmigr_next_groupevt(GRP0:0) is executed.

Revert commit 4b6f4c5a67 ("timer/migration: Remove buggy early return on
deactivation") and add a condition into return path to skip the return
only, when hierarchy contains a single group. Adapt comments accordingly.

Fixes: 4b6f4c5a67 ("timer/migration: Remove buggy early return on deactivation")
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/87cyr49on2.fsf@somnus
2024-04-05 11:05:16 +02:00
Frederic Weisbecker 61f7fdf8fd timers/migration: Fix ignored event due to missing CPU update
When a group event is updated with its expiry unchanged but a different
CPU, that target change may go unnoticed and the event may be propagated
up with a stale CPU value. The following depicts a scenario that has
been actually observed:

                       [GRP2:0]
                   migrator = GRP1:1
                   active   = GRP1:1
                   nextevt  = TGRP1:0 (T0)
                    /              \
               [GRP1:0]           [GRP1:1]
            migrator = NONE       [...]
            active   = NONE
            nextevt  = TGRP0:0 (T0)
            /           \
        [GRP0:0]       [...]
      migrator = NONE
      active   = NONE
      nextevt  = T0
      /         \
    0 (T0)       1 (T1)
    idle         idle

0) The hierarchy has 3 levels. The left part (GRP1:0) is all idle,
including CPU 0 and CPU 1 which have a timer each: T0 and T1. They have
the same expiry value.

                       [GRP2:0]
                   migrator = GRP1:1
                   active   = GRP1:1
                   nextevt  = KTIME_MAX
                    /              \
               [GRP1:0]           [GRP1:1]
            migrator = NONE       [...]
            active   = NONE
            nextevt  = TGRP0:0 (T0)
            /           \
        [GRP0:0]       [...]
      migrator = NONE
      active   = NONE
      nextevt  = T0
      /         \
    0 (T0)       1 (T1)
    idle         idle

1) The migrator in GRP1:1 handles remotely T0. The event is dequeued
from the top and T0 executed.

                       [GRP2:0]
                   migrator = GRP1:1
                   active   = GRP1:1
                   nextevt  = KTIME_MAX
                    /              \
               [GRP1:0]           [GRP1:1]
            migrator = NONE       [...]
            active   = NONE
            nextevt  = TGRP0:0 (T0)
            /           \
        [GRP0:0]       [...]
      migrator = NONE
      active   = NONE
      nextevt  = T1
      /         \
    0            1 (T1)
    idle         idle

2) The migrator in GRP1:1 fetches the next timer for CPU 0 and finds
none. But it updates the events from its groups, starting with GRP0:0
which now has T1 as its next event. So far so good.

                       [GRP2:0]
                   migrator = GRP1:1
                   active   = GRP1:1
                   nextevt  = KTIME_MAX
                    /              \
               [GRP1:0]           [GRP1:1]
            migrator = NONE       [...]
            active   = NONE
            nextevt  = TGRP0:0 (T0)
            /           \
        [GRP0:0]       [...]
      migrator = NONE
      active   = NONE
      nextevt  = T1
      /         \
    0            1 (T1)
    idle         idle

3) The migrator in GRP1:1 proceeds upward and updates the events in
GRP1:0. The child event TGRP0:0 is found queued with the same expiry
as before. And therefore it is left unchanged. However the target CPU
is not the same but that fact is ignored so TGRP0:0 still points to
CPU 0 when it should point to CPU 1.

                       [GRP2:0]
                   migrator = GRP1:1
                   active   = GRP1:1
                   nextevt  = TGRP1:0 (T0)
                    /              \
               [GRP1:0]           [GRP1:1]
            migrator = NONE       [...]
            active   = NONE
            nextevt  = TGRP0:0 (T0)
            /           \
        [GRP0:0]       [...]
      migrator = NONE
      active   = NONE
      nextevt  = T1
      /         \
    0            1 (T1)
    idle         idle

4) The propagation has reached the top level and TGRP1:0, having TGRP0:0
as its first event, also wrongly points to CPU 0. TGRP1:0 is added to
the top level group.

                       [GRP2:0]
                   migrator = GRP1:1
                   active   = GRP1:1
                   nextevt  = KTIME_MAX
                    /              \
               [GRP1:0]           [GRP1:1]
            migrator = NONE       [...]
            active   = NONE
            nextevt  = TGRP0:0 (T0)
            /           \
        [GRP0:0]       [...]
      migrator = NONE
      active   = NONE
      nextevt  = T1
      /         \
    0            1 (T1)
    idle         idle

5) The migrator in GRP1:1 dequeues the next event in top level pointing
to CPU 0. But since it actually doesn't see any real event in CPU 0, it
early returns.

6) T1 is left unhandled until either CPU 0 or CPU 1 wake up.

Some other bad scenario may involve trees with just two levels.

Fix this with unconditionally updating the CPU of the child event before
considering to early return while updating a queued event with an
unchanged expiry value.

Fixes: 7ee9887703 ("timers: Implement the hierarchical pull model")
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Link: https://lore.kernel.org/r/Zg2Ct6M2RJAYHgCB@localhost.localdomain
2024-04-05 11:05:16 +02:00