Commit Graph

37802 Commits

Author SHA1 Message Date
Davidlohr Bueso 7f8ca0edfe kernel/sys.c: only take tasklist_lock for get/setpriority(PRIO_PGRP)
PRIO_PGRP needs the tasklist_lock mainly to serialize vs setpgid(2), to
protect against any concurrent change_pid(PIDTYPE_PGID) that can move
the task from one hlist to another while iterating.

However, the remaining can only rely only on RCU:

PRIO_PROCESS only does the task lookup and never iterates over tasklist
and we already have an rcu-aware stable pointer.

PRIO_USER is already racy vs setuid(2) so with creds being rcu
protected, we can end up seeing stale data.  When removing the
tasklist_lock there can be a race with (i) fork but this is benign as
the child's nice is inherited and the new task is not observable by the
user yet either, hence the return semantics do not differ.  And (ii) a
race with exit, which is a small window and can cause us to miss a task
which was removed from the list and it had the highest nice.

Similarly change the buggy do_each_thread/while_each_thread combo in
PRIO_USER for the rcu-safe for_each_process_thread flavor, which doesn't
make use of next_thread/p->thread_group.

[akpm@linux-foundation.org: coding style fixes]

Link: https://lkml.kernel.org/r/20211210182250.43734-1-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-01-20 08:52:53 +02:00
Yafang Shao d6986ce24f kthread: dynamically allocate memory to store kthread's full name
When I was implementing a new per-cpu kthread cfs_migration, I found the
comm of it "cfs_migration/%u" is truncated due to the limitation of
TASK_COMM_LEN.  For example, the comm of the percpu thread on CPU10~19
all have the same name "cfs_migration/1", which will confuse the user.
This issue is not critical, because we can get the corresponding CPU
from the task's Cpus_allowed.  But for kthreads corresponding to other
hardware devices, it is not easy to get the detailed device info from
task comm, for example,

    jbd2/nvme0n1p2-
    xfs-reclaim/sdf

Currently there are so many truncated kthreads:

    rcu_tasks_kthre
    rcu_tasks_rude_
    rcu_tasks_trace
    poll_mpt3sas0_s
    ext4-rsv-conver
    xfs-reclaim/sd{a, b, c, ...}
    xfs-blockgc/sd{a, b, c, ...}
    xfs-inodegc/sd{a, b, c, ...}
    audit_send_repl
    ecryptfs-kthrea
    vfio-irqfd-clea
    jbd2/nvme0n1p2-
    ...

We can shorten these names to work around this problem, but it may be
not applied to all of the truncated kthreads.  Take 'jbd2/nvme0n1p2-'
for example, it is a nice name, and it is not a good idea to shorten it.

One possible way to fix this issue is extending the task comm size, but
as task->comm is used in lots of places, that may cause some potential
buffer overflows.  Another more conservative approach is introducing a
new pointer to store kthread's full name if it is truncated, which won't
introduce too much overhead as it is in the non-critical path.  Finally
we make a dicision to use the second approach.  See also the discussions
in this thread:
https://lore.kernel.org/lkml/20211101060419.4682-1-laoar.shao@gmail.com/

After this change, the full name of these truncated kthreads will be
displayed via /proc/[pid]/comm:

    rcu_tasks_kthread
    rcu_tasks_rude_kthread
    rcu_tasks_trace_kthread
    poll_mpt3sas0_statu
    ext4-rsv-conversion
    xfs-reclaim/sdf1
    xfs-blockgc/sdf1
    xfs-inodegc/sdf1
    audit_send_reply
    ecryptfs-kthread
    vfio-irqfd-cleanup
    jbd2/nvme0n1p2-8

Link: https://lkml.kernel.org/r/20211120112850.46047-1-laoar.shao@gmail.com
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Suggested-by: Petr Mladek <pmladek@suse.com>
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Arnaldo Carvalho de Melo <arnaldo.melo@gmail.com>
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Cc: Michal Miroslaw <mirq-linux@rere.qmqm.pl>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-01-20 08:52:53 +02:00
Linus Torvalds d1587f7bfe Merge branch 'for-5.16-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fixes from Tejun Heo:
 "This contains the cgroup.procs permission check fixes so that they use
  the credentials at the time of open rather than write, which also
  fixes the cgroup namespace lifetime bug"

* 'for-5.16-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  selftests: cgroup: Test open-time cgroup namespace usage for migration checks
  selftests: cgroup: Test open-time credential usage for migration checks
  selftests: cgroup: Make cg_create() use 0755 for permission instead of 0644
  cgroup: Use open-time cgroup namespace for process migration perm checks
  cgroup: Allocate cgroup_file_ctx for kernfs_open_file->priv
  cgroup: Use open-time credentials for process migraton perm checks
2022-01-07 15:58:06 -08:00
Linus Torvalds b2b436ec02 Three minor tracing fixes:
- Fix missing prototypes in sample module for direct functions
 
 - Fix check of valid buffer in get_trace_buf()
 
 - Fix annotations of percpu pointers.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYddVnBQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qg2PAQDVhSODIERza+YwP4AkMYBLWukngdi4
 2fvFOJa1qdGQ1AD/YMSsJzbqfUk5YL9LNElL37TFH0fyWzU85tXRHVwf4As=
 =KKJx
 -----END PGP SIGNATURE-----

Merge tag 'trace-v5.16-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fixes from Steven Rostedt:
 "Three minor tracing fixes:

   - Fix missing prototypes in sample module for direct functions

   - Fix check of valid buffer in get_trace_buf()

   - Fix annotations of percpu pointers"

* tag 'trace-v5.16-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing: Tag trace_percpu_buffer as a percpu pointer
  tracing: Fix check for trace_percpu_buffer validity in get_trace_buf()
  ftrace/samples: Add missing prototypes direct functions
2022-01-06 15:00:43 -08:00
Tejun Heo e574576416 cgroup: Use open-time cgroup namespace for process migration perm checks
cgroup process migration permission checks are performed at write time as
whether a given operation is allowed or not is dependent on the content of
the write - the PID. This currently uses current's cgroup namespace which is
a potential security weakness as it may allow scenarios where a less
privileged process tricks a more privileged one into writing into a fd that
it created.

This patch makes cgroup remember the cgroup namespace at the time of open
and uses it for migration permission checks instad of current's. Note that
this only applies to cgroup2 as cgroup1 doesn't have namespace support.

This also fixes a use-after-free bug on cgroupns reported in

 https://lore.kernel.org/r/00000000000048c15c05d0083397@google.com

Note that backporting this fix also requires the preceding patch.

Reported-by: "Eric W. Biederman" <ebiederm@xmission.com>
Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Reported-by: syzbot+50f5cf33a284ce738b62@syzkaller.appspotmail.com
Link: https://lore.kernel.org/r/00000000000048c15c05d0083397@google.com
Fixes: 5136f6365c ("cgroup: implement "nsdelegate" mount option")
Signed-off-by: Tejun Heo <tj@kernel.org>
2022-01-06 11:02:29 -10:00
Tejun Heo 0d2b5955b3 cgroup: Allocate cgroup_file_ctx for kernfs_open_file->priv
of->priv is currently used by each interface file implementation to store
private information. This patch collects the current two private data usages
into struct cgroup_file_ctx which is allocated and freed by the common path.
This allows generic private data which applies to multiple files, which will
be used to in the following patch.

Note that cgroup_procs iterator is now embedded as procs.iter in the new
cgroup_file_ctx so that it doesn't need to be allocated and freed
separately.

v2: union dropped from cgroup_file_ctx and the procs iterator is embedded in
    cgroup_file_ctx as suggested by Linus.

v3: Michal pointed out that cgroup1's procs pidlist uses of->priv too.
    Converted. Didn't change to embedded allocation as cgroup1 pidlists get
    stored for caching.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
2022-01-06 11:02:29 -10:00
Tejun Heo 1756d7994a cgroup: Use open-time credentials for process migraton perm checks
cgroup process migration permission checks are performed at write time as
whether a given operation is allowed or not is dependent on the content of
the write - the PID. This currently uses current's credentials which is a
potential security weakness as it may allow scenarios where a less
privileged process tricks a more privileged one into writing into a fd that
it created.

This patch makes both cgroup2 and cgroup1 process migration interfaces to
use the credentials saved at the time of open (file->f_cred) instead of
current's.

Reported-by: "Eric W. Biederman" <ebiederm@xmission.com>
Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org>
Fixes: 187fe84067 ("cgroup: require write perm on common ancestor when moving processes on the default hierarchy")
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2022-01-06 11:02:28 -10:00
Naveen N. Rao f28439db47 tracing: Tag trace_percpu_buffer as a percpu pointer
Tag trace_percpu_buffer as a percpu pointer to resolve warnings
reported by sparse:
  /linux/kernel/trace/trace.c:3218:46: warning: incorrect type in initializer (different address spaces)
  /linux/kernel/trace/trace.c:3218:46:    expected void const [noderef] __percpu *__vpp_verify
  /linux/kernel/trace/trace.c:3218:46:    got struct trace_buffer_struct *
  /linux/kernel/trace/trace.c:3234:9: warning: incorrect type in initializer (different address spaces)
  /linux/kernel/trace/trace.c:3234:9:    expected void const [noderef] __percpu *__vpp_verify
  /linux/kernel/trace/trace.c:3234:9:    got int *

Link: https://lkml.kernel.org/r/ebabd3f23101d89cb75671b68b6f819f5edc830b.1640255304.git.naveen.n.rao@linux.vnet.ibm.com

Cc: stable@vger.kernel.org
Reported-by: kernel test robot <lkp@intel.com>
Fixes: 07d777fe8c ("tracing: Add percpu buffers for trace_printk()")
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2022-01-05 18:53:49 -05:00
Naveen N. Rao 823e670f7e tracing: Fix check for trace_percpu_buffer validity in get_trace_buf()
With the new osnoise tracer, we are seeing the below splat:
    Kernel attempted to read user page (c7d880000) - exploit attempt? (uid: 0)
    BUG: Unable to handle kernel data access on read at 0xc7d880000
    Faulting instruction address: 0xc0000000002ffa10
    Oops: Kernel access of bad area, sig: 11 [#1]
    LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
    ...
    NIP [c0000000002ffa10] __trace_array_vprintk.part.0+0x70/0x2f0
    LR [c0000000002ff9fc] __trace_array_vprintk.part.0+0x5c/0x2f0
    Call Trace:
    [c0000008bdd73b80] [c0000000001c49cc] put_prev_task_fair+0x3c/0x60 (unreliable)
    [c0000008bdd73be0] [c000000000301430] trace_array_printk_buf+0x70/0x90
    [c0000008bdd73c00] [c0000000003178b0] trace_sched_switch_callback+0x250/0x290
    [c0000008bdd73c90] [c000000000e70d60] __schedule+0x410/0x710
    [c0000008bdd73d40] [c000000000e710c0] schedule+0x60/0x130
    [c0000008bdd73d70] [c000000000030614] interrupt_exit_user_prepare_main+0x264/0x270
    [c0000008bdd73de0] [c000000000030a70] syscall_exit_prepare+0x150/0x180
    [c0000008bdd73e10] [c00000000000c174] system_call_vectored_common+0xf4/0x278

osnoise tracer on ppc64le is triggering osnoise_taint() for negative
duration in get_int_safe_duration() called from
trace_sched_switch_callback()->thread_exit().

The problem though is that the check for a valid trace_percpu_buffer is
incorrect in get_trace_buf(). The check is being done after calculating
the pointer for the current cpu, rather than on the main percpu pointer.
Fix the check to be against trace_percpu_buffer.

Link: https://lkml.kernel.org/r/a920e4272e0b0635cf20c444707cbce1b2c8973d.1640255304.git.naveen.n.rao@linux.vnet.ibm.com

Cc: stable@vger.kernel.org
Fixes: e2ace00117 ("tracing: Choose static tp_printk buffer by explicit nesting count")
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2022-01-05 18:51:25 -05:00
Linus Torvalds d0cc67b278 Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
 "9 patches.

  Subsystems affected by this patch series: mm (kfence, mempolicy,
  memory-failure, pagemap, pagealloc, damon, and memory-failure),
  core-kernel, and MAINTAINERS"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  mm/hwpoison: clear MF_COUNT_INCREASED before retrying get_any_page()
  mm/damon/dbgfs: protect targets destructions with kdamond_lock
  mm/page_alloc: fix __alloc_size attribute for alloc_pages_exact_nid
  mm: delete unsafe BUG from page_cache_add_speculative()
  mm, hwpoison: fix condition in free hugetlb page path
  MAINTAINERS: mark more list instances as moderated
  kernel/crash_core: suppress unknown crashkernel parameter warning
  mm: mempolicy: fix THP allocations escaping mempolicy restrictions
  kfence: fix memory leak when cat kfence objects
2021-12-25 12:30:03 -08:00
Philipp Rudo 71d2bcec2d kernel/crash_core: suppress unknown crashkernel parameter warning
When booting with crashkernel= on the kernel command line a warning
similar to

    Kernel command line: ro console=ttyS0 crashkernel=256M
    Unknown kernel command line parameters "crashkernel=256M", will be passed to user space.

is printed.

This comes from crashkernel= being parsed independent from the kernel
parameter handling mechanism.  So the code in init/main.c doesn't know
that crashkernel= is a valid kernel parameter and prints this incorrect
warning.

Suppress the warning by adding a dummy early_param handler for
crashkernel=.

Link: https://lkml.kernel.org/r/20211208133443.6867-1-prudo@redhat.com
Fixes: 86d1919a4f ("init: print out unknown kernel parameters")
Signed-off-by: Philipp Rudo <prudo@redhat.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Andrew Halaney <ahalaney@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-12-25 12:20:55 -08:00
Linus Torvalds 7fe2bc1b64 Merge branch 'ucount-rlimit-fixes-for-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull ucount fix from Eric Biederman:
 "This fixes a silly logic bug in the ucount rlimits code, where it was
  comparing against the wrong limit"

* 'ucount-rlimit-fixes-for-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  ucounts: Fix rlimit max values check
2021-12-23 15:27:02 -08:00
Linus Torvalds e1fe1b10e6 - Make sure the CLOCK_REALTIME to CLOCK_MONOTONIC offset is never positive
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmG/OTsACgkQEsHwGGHe
 VUrBWxAAhCQ5rFc5WkVxN3Lr2JLtY2bNUAOrdWNVXXmuKIZhbCgnXZ6a7NH9Ins/
 zkLS7YL1gaZtcK+sYnPbO7Z6oTVEqV5UZnxuUH8DF8Q2U7cVdGvQSeHx5ghx4O35
 13P0RSrj0++Q03dc5mf7+OA7RTuH00JpFCvRavpHNJDYFIN+gl1pPDjM/0g+j90W
 PwFa/Hr8vOH7vpPRwygZ+yWfMunb7nTpY7Pa7toSQtE4NR6L2+A49+0/scjD5i9n
 wQCFI4Md49DRV8qvC04YmN4XC72PBKo59z0ptw1LP1yYuD3n0IjjxhRmkaEGLS/x
 abSs3DfwDDD3Bkl/CprJ6ZfoNez5jOsgdPgPH+c5QdHYk837JAgiLZL0M5YK+Gqf
 azuYSv0XfSA6Jg4ioaqsw5gq2QhJS0/ej3VN9qLIspDLncx0BHHr99inrmuvONbl
 cgtm24xQx8ezG8iEK4Ij05bg/sflwP8czTx4La8tnK2p1VK+xHeezKRLjEFqmXCr
 NV8nZEPO7QVbNinViHnEcvz4fur1lYHpCJnG2UbNPipYT2XHsAkaVEZ8uvmg+Ovy
 alcAasSVq9YdQbgWyYFmwWXVoPeG87z53MDA7kPk2TihJaOz2jaY0me5J6fOgSqh
 QFETA8Hcd3Do9hf0MRY9HTX18/uKinW8HclVw2yZxdztGfOfAxA=
 =8tah
 -----END PGP SIGNATURE-----

Merge tag 'timers_urgent_for_v5.16_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer fix from Borislav Petkov:

 - Make sure the CLOCK_REALTIME to CLOCK_MONOTONIC offset is never
   positive

* tag 'timers_urgent_for_v5.16_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  timekeeping: Really make sure wall_to_monotonic isn't positive
2021-12-19 12:23:18 -08:00
Linus Torvalds 909e1d166c - Fix the condition checking when the optimistic spinning of a waiter needs
to be terminated
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmG/NxMACgkQEsHwGGHe
 VUpzQw/+OQ6cDj41E+482w3iQDdnQWTyWV29ukyBbR+QRDmi7IyPIR6YQ3mEz0Wu
 qiG76aO3R7+y0mc84ISaZPhbZ1pTCvOPaBiE91rachc1w9bLH1J/HIy2veKvPw29
 8Vhn6sB2lUoh8y8Cy8AHgD0D6u/imBuBrVyO+qT22r1ZUlnZj02fT1U/XD2e3WNO
 Id9JXhzu6S2leRqg5hSS6WodXbtGBsM4k5jDscu3s4Akv0JS7dxaeVaEGLw5oqyJ
 +sIL6V6BwbfLEe4UOgvVzVgwzXnyhqtVF8ldaqj3PpdjhqUtzqGEmirUq4WVjZ+R
 A1mHZ3bgPQNqmdhhWNtz1IFSJcuVXGEgXSS98LStyLyxVPiAByo5wHWJxF3jx/UW
 ag2boT/MyoKP3iRclUKOgRqeDFsDH4HCNF9YEyqu5uSrvJhMNwhhCttCDFKu3cAl
 vSEXmgNr1gcL1IAUlm3w4ZQIU8x/eznfhZiVpoWqtGhSxQPmTShV/YT4S7SY7mtf
 0kxhK/Y1nS4nQqDTyuyVzJDFVX1ZoS0SJXe1L9TnMiD7VLO9wEblgdaDfp8DxCrY
 YPCpnpmnV9tOyGVmbAJU+Xz9Pbuoahr0h7JoslPDKMJQTO30vc0reF2Z5gV05FCM
 SgFUExL9a3TGLplMPmz96MhtTnN/a884txQOCCpvDygubXVLnTs=
 =85I3
 -----END PGP SIGNATURE-----

Merge tag 'locking_urgent_for_v5.16_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull locking fix from Borislav Petkov:

 - Fix the rtmutex condition checking when the optimistic spinning of a
   waiter needs to be terminated

* tag 'locking_urgent_for_v5.16_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  locking/rtmutex: Fix incorrect condition in rtmutex_spin_on_owner()
2021-12-19 12:17:26 -08:00
Linus Torvalds c36d891d78 - Prevent lock contention on the new sigaltstack lock on the common-case
path, when no changes have been made to the alternative signal stack.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmG/NZ0ACgkQEsHwGGHe
 VUqwEQ/8CCQ2KKBbDZOYIr1wnl+FDIycgq7tnz+q9SomzxQODdDWLREBoTPsOtoE
 NZgXZEQxX4Wh/+4rvvdSMCVT3nz2GvSSasVKGrPZyLpDDyL3coRO0Ngx9iRUd1kF
 j67e9oMuNboPC5jJfP9cC4T+GgDQDnXAjjT3jX7aiIXnNjnOCTZ5Z7W8GKw7d2qH
 4L2SJwAPOkuRicdQiRMJhVLsowsDIZtC8q8OZHhwu0dqM3/JVJCIxKKGKV69j5uk
 TUP6M0ZdyR30VrDfKYlm3m5fY0YFsBY/algphP41Hz5sUe9Xsw6F5+8sL3nCqLz1
 BBUFr/00qVruM3jWmIag/OQ8/4cAFZjrx+8ewdF61OEOWya9Mq7VxINjT8R77B0i
 AuA6Bkv1LArJyfvywbbD6JzAj7TQFPuhFPc0BUFwZfn+B1rvxm88JK2mjR9aO/wZ
 ZHgDJ5hOSIKKNJ2W9g2fhW0MTMUELxKqxHZqOmQU/8ydVxYHZtD2GLHLDAU3XBoe
 9PTntBvv7+qxqNQyY70k4jzIRfOFB8XuYxeWCbg10LqkbFFm2otYN2orsjVVBY7u
 9wPQhFvJo6pHBx+dNIV6be56SnIeTCdIWBqlUcAto5mCVbmIxQoIMoNLo6rGBrhA
 7UdhVCFJJki/Bs92aEQxl09volI9Ec7yXvmpU74LfKD+Gc8TxQo=
 =if9T
 -----END PGP SIGNATURE-----

Merge tag 'core_urgent_for_v5.16_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull signal handlign fix from Borislav Petkov:

 - Prevent lock contention on the new sigaltstack lock on the
   common-case path, when no changes have been made to the alternative
   signal stack.

* tag 'core_urgent_for_v5.16_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  signal: Skip the altstack update when not needed
2021-12-19 11:46:54 -08:00
Zqiang 8f556a326c locking/rtmutex: Fix incorrect condition in rtmutex_spin_on_owner()
Optimistic spinning needs to be terminated when the spinning waiter is not
longer the top waiter on the lock, but the condition is negated. It
terminates if the waiter is the top waiter, which is defeating the whole
purpose.

Fixes: c3123c4314 ("locking/rtmutex: Dont dereference waiter lockless")
Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20211217074207.77425-1-qiang1.zhang@intel.com
2021-12-18 10:55:51 +01:00
Yu Liao 4e8c11b6b3 timekeeping: Really make sure wall_to_monotonic isn't positive
Even after commit e1d7ba8735 ("time: Always make sure wall_to_monotonic
isn't positive") it is still possible to make wall_to_monotonic positive
by running the following code:

    int main(void)
    {
        struct timespec time;

        clock_gettime(CLOCK_MONOTONIC, &time);
        time.tv_nsec = 0;
        clock_settime(CLOCK_REALTIME, &time);
        return 0;
    }

The reason is that the second parameter of timespec64_compare(), ts_delta,
may be unnormalized because the delta is calculated with an open coded
substraction which causes the comparison of tv_sec to yield the wrong
result:

  wall_to_monotonic = { .tv_sec = -10, .tv_nsec =  900000000 }
  ts_delta 	    = { .tv_sec =  -9, .tv_nsec = -900000000 }

That makes timespec64_compare() claim that wall_to_monotonic < ts_delta,
but actually the result should be wall_to_monotonic > ts_delta.

After normalization, the result of timespec64_compare() is correct because
the tv_sec comparison is not longer misleading:

  wall_to_monotonic = { .tv_sec = -10, .tv_nsec =  900000000 }
  ts_delta 	    = { .tv_sec = -10, .tv_nsec =  100000000 }

Use timespec64_sub() to ensure that ts_delta is normalized, which fixes the
issue.

Fixes: e1d7ba8735 ("time: Always make sure wall_to_monotonic isn't positive")
Signed-off-by: Yu Liao <liaoyu15@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20211213135727.1656662-1-liaoyu15@huawei.com
2021-12-17 23:06:22 +01:00
Linus Torvalds 6441998e2e audit/stable-5.16 PR 20211216
-----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmG7vm8UHHBhdWxAcGF1
 bC1tb29yZS5jb20ACgkQ6iDy2pc3iXOCYw//Z7N53pFP1Ci1ToZWTgjdwBAV1lM/
 52uG1aEg/TxAVHt/3STNXEmsUc3BaxpYQxBIevjkGYbxe3MRvE9ZJlSQdFpyjXOs
 DrXxCC38TrcJ2wJpOPUidbokMSoyyJSX3dfSOwD566q1RCK1z9O7G544eh1DW651
 ewYLVClOFuoyxiQiBQwSPPjaOV8vTmFWl+omsoZS74CcshPglAngqqZcLRNJ14RV
 6TpnKZ1q4az7GQY1lqad1YmEwmMEgH32qfz/pFUvQ3s8omi3JhC1+IBggW2iE76G
 Ssdw62sqrn3dEoSG5TADc8NxDH+MFLauF2XgRP9ct3eKFG3X3Z605eWEpDFJ1i8S
 1FhOyherjQ1uSc6EOMMKfoyo7thrhoQ92wyCQBt4EkZxW8hULVuhqSX8KDs2p1+l
 0epQmlpCrzAzbPSMHlC5LATga8zzaUbyoVj03AcDAb+I+29v5fNRmzAbJrKZruwM
 dJosdAsJ9tlVE6GqyCIBLeC3PQxJ5Xjw3jpsrutD/aoFYkgKASve+Y927OWIj24r
 KpFqjdLOS3dTKmxEQr97iF5w1IaW80lGykaQAjW2JZVp2CWOCUxQOtqTaUQYzQAp
 H4D2aYzy9RJVHxvK0HYceT+FhrB+yIPKBMOaLz+UjDWopIkYzuJZ3AbaxLGVdGIh
 pEMYpVR3XXm87z0=
 =jWtt
 -----END PGP SIGNATURE-----

Merge tag 'audit-pr-20211216' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit

Pull audit fix from Paul Moore:
 "A single patch to fix a problem where the audit queue could grow
  unbounded when the audit daemon is forcibly stopped"

* tag 'audit-pr-20211216' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
  audit: improve robustness of the audit queue handling
2021-12-16 15:24:46 -08:00
Linus Torvalds 180f3bcfe3 Networking fixes for 5.16-rc6, including fixes from mac80211, wifi, bpf.
Current release - regressions:
 
  - dpaa2-eth: fix buffer overrun when reporting ethtool statistics
 
 Current release - new code bugs:
 
  - bpf: fix incorrect state pruning for <8B spill/fill
 
  - iavf:
      - add missing unlocks in iavf_watchdog_task()
      - do not override the adapter state in the watchdog task (again)
 
  - mlxsw: spectrum_router: consolidate MAC profiles when possible
 
 Previous releases - regressions:
 
  - mac80211, fix:
      - rate control, avoid driver crash for retransmitted frames
      - regression in SSN handling of addba tx
      - a memory leak where sta_info is not freed
      - marking TX-during-stop for TX in in_reconfig, prevent stall
 
  - cfg80211: acquire wiphy mutex on regulatory work
 
  - wifi drivers: fix build regressions and LED config dependency
 
  - virtio_net: fix rx_drops stat for small pkts
 
  - dsa: mv88e6xxx: unforce speed & duplex in mac_link_down()
 
 Previous releases - always broken:
 
  - bpf, fix:
     - kernel address leakage in atomic fetch
     - kernel address leakage in atomic cmpxchg's r0 aux reg
     - signed bounds propagation after mov32
     - extable fixup offset
     - extable address check
 
  - mac80211:
      - fix the size used for building probe request
      - send ADDBA requests using the tid/queue of the aggregation
        session
      - agg-tx: don't schedule_and_wake_txq() under sta->lock,
        avoid deadlocks
      - validate extended element ID is present
 
  - mptcp:
      - never allow the PM to close a listener subflow (null-defer)
      - clear 'kern' flag from fallback sockets, prevent crash
      - fix deadlock in __mptcp_push_pending()
 
  - inet_diag: fix kernel-infoleak for UDP sockets
 
  - xsk: do not sleep in poll() when need_wakeup set
 
  - smc: avoid very long waits in smc_release()
 
  - sch_ets: don't remove idle classes from the round-robin list
 
  - netdevsim:
      - zero-initialize memory for bpf map's value, prevent info leak
      - don't let user space overwrite read only (max) ethtool parms
 
  - ixgbe: set X550 MDIO speed before talking to PHY
 
  - stmmac:
      - fix null-deref in flower deletion w/ VLAN prio Rx steering
      - dwmac-rk: fix oob read in rk_gmac_setup
 
  - ice: time stamping fixes
 
  - systemport: add global locking for descriptor life cycle
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmG7rdUACgkQMUZtbf5S
 IrtRvw//etsgeg2+zxe+fBSbe7ZihcCB4yzWUoRDdNzPrLNLsnWxKT1wYblDcZft
 b1f/SpTy9ycfg+fspn2qET8gzydn4m9xHkjmlQPzmXB9tdIDF6mECFTAXYlar1hQ
 RQIijpfZYyrZeGdgHpsyq72YC4dpNdbZrxmQFVdpMr3cK8P2N0Dn32bBVa//+jb+
 LCv3Uw9C0yNbqhtRIiukkWIE20+/pXtKm0uErDVmvonqFMWPo6mYD0C2PwC20PwR
 Kv5ok6jH+44fCSwDoLChbB+Wes0AtrIQdUvUwXGXaF3MDfZl+24oLkX5xJl3EHWT
 90Mh0k0NhRORgBZ3NItwK7OliohrRHCYxlAXPjg1Dicxl+kxl0wPlva8v64eAA+u
 ZhwXwaQpCrZNdKoxHJw9kQ/CmbggtxcWkVolbZp3TzDjYY1E7qxuwg51YMhGmGT1
 FPjradYGvHKi+thizJiEdiZaMKRc8bpaL0hbpROxFQvfjNwFOwREQhtnXYP3W5Kd
 lK88fWaH86dxqL+ABvbrMnSZKuNlSL8R/CROWpZuF+vyLRXaxhAvYRrL79bgmkKq
 zvImnh1mFovdyKGJhibFMdy92X14z8FzoyX3VQuFcl9EB+2NQXnNZ6abDLJlufZX
 A0jQ5r46Ce/yyaXXmS61PrP7Pf5sxhs/69fqAIDQfSSzpyUKHd4=
 =VIbd
 -----END PGP SIGNATURE-----

Merge tag 'net-5.16-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Networking fixes, including fixes from mac80211, wifi, bpf.

  Relatively large batches of fixes from BPF and the WiFi stack, calm in
  general networking.

  Current release - regressions:

   - dpaa2-eth: fix buffer overrun when reporting ethtool statistics

  Current release - new code bugs:

   - bpf: fix incorrect state pruning for <8B spill/fill

   - iavf:
       - add missing unlocks in iavf_watchdog_task()
       - do not override the adapter state in the watchdog task (again)

   - mlxsw: spectrum_router: consolidate MAC profiles when possible

  Previous releases - regressions:

   - mac80211 fixes:
       - rate control, avoid driver crash for retransmitted frames
       - regression in SSN handling of addba tx
       - a memory leak where sta_info is not freed
       - marking TX-during-stop for TX in in_reconfig, prevent stall

   - cfg80211: acquire wiphy mutex on regulatory work

   - wifi drivers: fix build regressions and LED config dependency

   - virtio_net: fix rx_drops stat for small pkts

   - dsa: mv88e6xxx: unforce speed & duplex in mac_link_down()

  Previous releases - always broken:

   - bpf fixes:
       - kernel address leakage in atomic fetch
       - kernel address leakage in atomic cmpxchg's r0 aux reg
       - signed bounds propagation after mov32
       - extable fixup offset
       - extable address check

   - mac80211:
       - fix the size used for building probe request
       - send ADDBA requests using the tid/queue of the aggregation
         session
       - agg-tx: don't schedule_and_wake_txq() under sta->lock, avoid
         deadlocks
       - validate extended element ID is present

   - mptcp:
       - never allow the PM to close a listener subflow (null-defer)
       - clear 'kern' flag from fallback sockets, prevent crash
       - fix deadlock in __mptcp_push_pending()

   - inet_diag: fix kernel-infoleak for UDP sockets

   - xsk: do not sleep in poll() when need_wakeup set

   - smc: avoid very long waits in smc_release()

   - sch_ets: don't remove idle classes from the round-robin list

   - netdevsim:
       - zero-initialize memory for bpf map's value, prevent info leak
       - don't let user space overwrite read only (max) ethtool parms

   - ixgbe: set X550 MDIO speed before talking to PHY

   - stmmac:
       - fix null-deref in flower deletion w/ VLAN prio Rx steering
       - dwmac-rk: fix oob read in rk_gmac_setup

   - ice: time stamping fixes

   - systemport: add global locking for descriptor life cycle"

* tag 'net-5.16-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (89 commits)
  bpf, selftests: Fix racing issue in btf_skc_cls_ingress test
  selftest/bpf: Add a test that reads various addresses.
  bpf: Fix extable address check.
  bpf: Fix extable fixup offset.
  bpf, selftests: Add test case trying to taint map value pointer
  bpf: Make 32->64 bounds propagation slightly more robust
  bpf: Fix signed bounds propagation after mov32
  sit: do not call ipip6_dev_free() from sit_init_net()
  net: systemport: Add global locking for descriptor lifecycle
  net/smc: Prevent smc_release() from long blocking
  net: Fix double 0x prefix print in SKB dump
  virtio_net: fix rx_drops stat for small pkts
  dsa: mv88e6xxx: fix debug print for SPEED_UNFORCED
  sfc_ef100: potential dereference of null pointer
  net: stmmac: dwmac-rk: fix oob read in rk_gmac_setup
  net: usb: lan78xx: add Allied Telesis AT29M2-AF
  net/packet: rx_owner_map depends on pg_vec
  netdevsim: Zero-initialize memory for new map's value in function nsim_bpf_map_alloc
  dpaa2-eth: fix ethtool statistics
  ixgbe: set X550 MDIO speed before talking to PHY
  ...
2021-12-16 15:02:14 -08:00
Daniel Borkmann e572ff80f0 bpf: Make 32->64 bounds propagation slightly more robust
Make the bounds propagation in __reg_assign_32_into_64() slightly more
robust and readable by aligning it similarly as we did back in the
__reg_combine_64_into_32() counterpart. Meaning, only propagate or
pessimize them as a smin/smax pair.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
2021-12-16 19:45:56 +01:00
Daniel Borkmann 3cf2b61eb0 bpf: Fix signed bounds propagation after mov32
For the case where both s32_{min,max}_value bounds are positive, the
__reg_assign_32_into_64() directly propagates them to their 64 bit
counterparts, otherwise it pessimises them into [0,u32_max] universe and
tries to refine them later on by learning through the tnum as per comment
in mentioned function. However, that does not always happen, for example,
in mov32 operation we call zext_32_to_64(dst_reg) which invokes the
__reg_assign_32_into_64() as is without subsequent bounds update as
elsewhere thus no refinement based on tnum takes place.

Thus, not calling into the __update_reg_bounds() / __reg_deduce_bounds() /
__reg_bound_offset() triplet as we do, for example, in case of ALU ops via
adjust_scalar_min_max_vals(), will lead to more pessimistic bounds when
dumping the full register state:

Before fix:

  0: (b4) w0 = -1
  1: R0_w=invP4294967295
     (id=0,imm=ffffffff,
      smin_value=4294967295,smax_value=4294967295,
      umin_value=4294967295,umax_value=4294967295,
      var_off=(0xffffffff; 0x0),
      s32_min_value=-1,s32_max_value=-1,
      u32_min_value=-1,u32_max_value=-1)

  1: (bc) w0 = w0
  2: R0_w=invP4294967295
     (id=0,imm=ffffffff,
      smin_value=0,smax_value=4294967295,
      umin_value=4294967295,umax_value=4294967295,
      var_off=(0xffffffff; 0x0),
      s32_min_value=-1,s32_max_value=-1,
      u32_min_value=-1,u32_max_value=-1)

Technically, the smin_value=0 and smax_value=4294967295 bounds are not
incorrect, but given the register is still a constant, they break assumptions
about const scalars that smin_value == smax_value and umin_value == umax_value.

After fix:

  0: (b4) w0 = -1
  1: R0_w=invP4294967295
     (id=0,imm=ffffffff,
      smin_value=4294967295,smax_value=4294967295,
      umin_value=4294967295,umax_value=4294967295,
      var_off=(0xffffffff; 0x0),
      s32_min_value=-1,s32_max_value=-1,
      u32_min_value=-1,u32_max_value=-1)

  1: (bc) w0 = w0
  2: R0_w=invP4294967295
     (id=0,imm=ffffffff,
      smin_value=4294967295,smax_value=4294967295,
      umin_value=4294967295,umax_value=4294967295,
      var_off=(0xffffffff; 0x0),
      s32_min_value=-1,s32_max_value=-1,
      u32_min_value=-1,u32_max_value=-1)

Without the smin_value == smax_value and umin_value == umax_value invariant
being intact for const scalars, it is possible to leak out kernel pointers
from unprivileged user space if the latter is enabled. For example, when such
registers are involved in pointer arithmtics, then adjust_ptr_min_max_vals()
will taint the destination register into an unknown scalar, and the latter
can be exported and stored e.g. into a BPF map value.

Fixes: 3f50f132d8 ("bpf: Verifier, do explicit ALU32 bounds tracking")
Reported-by: Kuee K1r0a <liulin063@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
2021-12-16 19:45:46 +01:00
Paul Moore f4b3ee3c85 audit: improve robustness of the audit queue handling
If the audit daemon were ever to get stuck in a stopped state the
kernel's kauditd_thread() could get blocked attempting to send audit
records to the userspace audit daemon.  With the kernel thread
blocked it is possible that the audit queue could grow unbounded as
certain audit record generating events must be exempt from the queue
limits else the system enter a deadlock state.

This patch resolves this problem by lowering the kernel thread's
socket sending timeout from MAX_SCHEDULE_TIMEOUT to HZ/10 and tweaks
the kauditd_send_queue() function to better manage the various audit
queues when connection problems occur between the kernel and the
audit daemon.  With this patch, the backlog may temporarily grow
beyond the defined limits when the audit daemon is stopped and the
system is under heavy audit pressure, but kauditd_thread() will
continue to make progress and drain the queues as it would for other
connection problems.  For example, with the audit daemon put into a
stopped state and the system configured to audit every syscall it
was still possible to shutdown the system without a kernel panic,
deadlock, etc.; granted, the system was slow to shutdown but that is
to be expected given the extreme pressure of recording every syscall.

The timeout value of HZ/10 was chosen primarily through
experimentation and this developer's "gut feeling".  There is likely
no one perfect value, but as this scenario is limited in scope (root
privileges would be needed to send SIGSTOP to the audit daemon), it
is likely not worth exposing this as a tunable at present.  This can
always be done at a later date if it proves necessary.

Cc: stable@vger.kernel.org
Fixes: 5b52330bbf ("audit: fix auditd/kernel connection state tracking")
Reported-by: Gaosheng Cui <cuigaosheng1@huawei.com>
Tested-by: Gaosheng Cui <cuigaosheng1@huawei.com>
Reviewed-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2021-12-15 13:16:39 -05:00
Daniel Borkmann a82fe085f3 bpf: Fix kernel address leakage in atomic cmpxchg's r0 aux reg
The implementation of BPF_CMPXCHG on a high level has the following parameters:

  .-[old-val]                                          .-[new-val]
  BPF_R0 = cmpxchg{32,64}(DST_REG + insn->off, BPF_R0, SRC_REG)
                          `-[mem-loc]          `-[old-val]

Given a BPF insn can only have two registers (dst, src), the R0 is fixed and
used as an auxilliary register for input (old value) as well as output (returning
old value from memory location). While the verifier performs a number of safety
checks, it misses to reject unprivileged programs where R0 contains a pointer as
old value.

Through brute-forcing it takes about ~16sec on my machine to leak a kernel pointer
with BPF_CMPXCHG. The PoC is basically probing for kernel addresses by storing the
guessed address into the map slot as a scalar, and using the map value pointer as
R0 while SRC_REG has a canary value to detect a matching address.

Fix it by checking R0 for pointers, and reject if that's the case for unprivileged
programs.

Fixes: 5ffa25502b ("bpf: Add instructions for atomic_[cmp]xchg")
Reported-by: Ryota Shiga (Flatt Security)
Acked-by: Brendan Jackman <jackmanb@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2021-12-14 19:33:06 -08:00
Daniel Borkmann 7d3baf0afa bpf: Fix kernel address leakage in atomic fetch
The change in commit 37086bfdc7 ("bpf: Propagate stack bounds to registers
in atomics w/ BPF_FETCH") around check_mem_access() handling is buggy since
this would allow for unprivileged users to leak kernel pointers. For example,
an atomic fetch/and with -1 on a stack destination which holds a spilled
pointer will migrate the spilled register type into a scalar, which can then
be exported out of the program (since scalar != pointer) by dumping it into
a map value.

The original implementation of XADD was preventing this situation by using
a double call to check_mem_access() one with BPF_READ and a subsequent one
with BPF_WRITE, in both cases passing -1 as a placeholder value instead of
register as per XADD semantics since it didn't contain a value fetch. The
BPF_READ also included a check in check_stack_read_fixed_off() which rejects
the program if the stack slot is of __is_pointer_value() if dst_regno < 0.
The latter is to distinguish whether we're dealing with a regular stack spill/
fill or some arithmetical operation which is disallowed on non-scalars, see
also 6e7e63cbb0 ("bpf: Forbid XADD on spilled pointers for unprivileged
users") for more context on check_mem_access() and its handling of placeholder
value -1.

One minimally intrusive option to fix the leak is for the BPF_FETCH case to
initially check the BPF_READ case via check_mem_access() with -1 as register,
followed by the actual load case with non-negative load_reg to propagate
stack bounds to registers.

Fixes: 37086bfdc7 ("bpf: Propagate stack bounds to registers in atomics w/ BPF_FETCH")
Reported-by: <n4ke4mry@gmail.com>
Acked-by: Brendan Jackman <jackmanb@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2021-12-14 19:33:06 -08:00
Chang S. Bae 6c3118c321 signal: Skip the altstack update when not needed
== Background ==

Support for large, "dynamic" fpstates was recently merged.  This
included code to ensure that sigaltstacks are sufficiently sized for
these large states.  A new lock was added to remove races between
enabling large features and setting up sigaltstacks.

== Problem ==

The new lock (sigaltstack_lock()) is acquired in the sigreturn path
before restoring the old sigaltstack.  Unfortunately, contention on the
new lock causes a measurable signal handling performance regression [1].
However, the common case is that no *changes* are made to the
sigaltstack state at sigreturn.

== Solution ==

do_sigaltstack() acquires sigaltstack_lock() and is used for both
sys_sigaltstack() and restoring the sigaltstack in sys_sigreturn().
Check for changes to the sigaltstack before taking the lock.  If no
changes were made, return before acquiring the lock.

This removes lock contention from the common-case sigreturn path.

[1] https://lore.kernel.org/lkml/20211207012128.GA16074@xsang-OptiPlex-9020/

Fixes: 3aac3ebea0 ("x86/signal: Implement sigaltstack size validation")
Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20211210225503.12734-1-chang.seok.bae@intel.com
2021-12-14 13:08:36 -08:00
Linus Torvalds df442a4ec7 Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
 "21 patches.

  Subsystems affected by this patch series: MAINTAINERS, mailmap, and mm
  (mlock, pagecache, damon, slub, memcg, hugetlb, and pagecache)"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (21 commits)
  mm: bdi: initialize bdi_min_ratio when bdi is unregistered
  hugetlbfs: fix issue of preallocation of gigantic pages can't work
  mm/memcg: relocate mod_objcg_mlstate(), get_obj_stock() and put_obj_stock()
  mm/slub: fix endianness bug for alloc/free_traces attributes
  selftests/damon: split test cases
  selftests/damon: test debugfs file reads/writes with huge count
  selftests/damon: test wrong DAMOS condition ranges input
  selftests/damon: test DAMON enabling with empty target_ids case
  selftests/damon: skip test if DAMON is running
  mm/damon/vaddr-test: remove unnecessary variables
  mm/damon/vaddr-test: split a test function having >1024 bytes frame size
  mm/damon/vaddr: remove an unnecessary warning message
  mm/damon/core: remove unnecessary error messages
  mm/damon/dbgfs: remove an unnecessary error message
  mm/damon/core: use better timer mechanisms selection threshold
  mm/damon/core: fix fake load reports due to uninterruptible sleeps
  timers: implement usleep_idle_range()
  filemap: remove PageHWPoison check from next_uptodate_page()
  mailmap: update email address for Guo Ren
  MAINTAINERS: update kdump maintainers
  ...
2021-12-11 08:46:52 -08:00
SeongJae Park e4779015fd timers: implement usleep_idle_range()
Patch series "mm/damon: Fix fake /proc/loadavg reports", v3.

This patchset fixes DAMON's fake load report issue.  The first patch
makes yet another variant of usleep_range() for this fix, and the second
patch fixes the issue of DAMON by making it using the newly introduced
function.

This patch (of 2):

Some kernel threads such as DAMON could need to repeatedly sleep in
micro seconds level.  Because usleep_range() sleeps in uninterruptible
state, however, such threads would make /proc/loadavg reports fake load.

To help such cases, this commit implements a variant of usleep_range()
called usleep_idle_range().  It is same to usleep_range() but sets the
state of the current task as TASK_IDLE while sleeping.

Link: https://lkml.kernel.org/r/20211126145015.15862-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20211126145015.15862-2-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Cc: John Stultz <john.stultz@linaro.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-12-10 17:10:55 -08:00
Linus Torvalds 257dcf2923 tracing, ftrace and tracefs fixes:
- Have tracefs honor the gid mount option
 
  - Have new files in tracefs inherit the parent ownership
 
  - Have direct_ops unregister when it has no more functions
 
  - Properly clean up the ops when unregistering multi direct ops
 
  - Add a sample module to test the multiple direct ops
 
  - Fix memory leak in error path of __create_synth_event()
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYbOgPBQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qgOtAP0YD+cRLxnRKA376oQVB8zmuZ3mZ/4x
 6M1hqruSDlno3AEA19PyHpxl7flFwnBb6Gnfo9VeefcMS5ENDH9p1aHX4wU=
 =Tr6t
 -----END PGP SIGNATURE-----

Merge tag 'trace-v5.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fixes from Steven Rostedt:
 "Tracing, ftrace and tracefs fixes:

   - Have tracefs honor the gid mount option

   - Have new files in tracefs inherit the parent ownership

   - Have direct_ops unregister when it has no more functions

   - Properly clean up the ops when unregistering multi direct ops

   - Add a sample module to test the multiple direct ops

   - Fix memory leak in error path of __create_synth_event()"

* tag 'trace-v5.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing: Fix possible memory leak in __create_synth_event() error path
  ftrace/samples: Add module to test multi direct modify interface
  ftrace: Add cleanup to unregister_ftrace_direct_multi
  ftrace: Use direct_ops hash in unregister_ftrace_direct
  tracefs: Set all files to the same group ownership as the mount option
  tracefs: Have new files inherit the ownership of their parent
2021-12-10 14:24:05 -08:00
Linus Torvalds 0d21e66847 aio poll fixes for 5.16-rc5
Fix three bugs in aio poll, and one issue with POLLFREE more broadly:
 
   - aio poll didn't handle POLLFREE, causing a use-after-free.
   - aio poll could block while the file is ready.
   - aio poll called eventfd_signal() when it isn't allowed.
   - POLLFREE didn't handle multiple exclusive waiters correctly.
 
 This has been tested with the libaio test suite, as well as with test
 programs I wrote that reproduce the first two bugs.  I am sending this
 pull request myself as no one seems to be maintaining this code.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCYbObthQcZWJpZ2dlcnNA
 Z29vZ2xlLmNvbQAKCRDzXCl4vpKOK+3mAQC9W8ApzBleEPI6FXzIIo5AiQT/2jGl
 7FbO1MtkdUBU4QEAzf+VWl4Z4BJTgxl44avRdVDpXGAMnbWkd7heY+e3HwA=
 =mp+r
 -----END PGP SIGNATURE-----

Merge tag 'aio-poll-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux

Pull aio poll fixes from Eric Biggers:
 "Fix three bugs in aio poll, and one issue with POLLFREE more broadly:

   - aio poll didn't handle POLLFREE, causing a use-after-free.

   - aio poll could block while the file is ready.

   - aio poll called eventfd_signal() when it isn't allowed.

   - POLLFREE didn't handle multiple exclusive waiters correctly.

  This has been tested with the libaio test suite, as well as with test
  programs I wrote that reproduce the first two bugs. I am sending this
  pull request myself as no one seems to be maintaining this code"

* tag 'aio-poll-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux:
  aio: Fix incorrect usage of eventfd_signal_allowed()
  aio: fix use-after-free due to missing POLLFREE handling
  aio: keep poll requests on waitqueue until completed
  signalfd: use wake_up_pollfree()
  binder: use wake_up_pollfree()
  wait: add wake_up_pollfree()
2021-12-10 14:15:39 -08:00
Paul Chaignon 345e004d02 bpf: Fix incorrect state pruning for <8B spill/fill
Commit 354e8f1970 ("bpf: Support <8-byte scalar spill and refill")
introduced support in the verifier to track <8B spill/fills of scalars.
The backtracking logic for the precision bit was however skipping
spill/fills of less than 8B. That could cause state pruning to consider
two states equivalent when they shouldn't be.

As an example, consider the following bytecode snippet:

  0:  r7 = r1
  1:  call bpf_get_prandom_u32
  2:  r6 = 2
  3:  if r0 == 0 goto pc+1
  4:  r6 = 3
  ...
  8: [state pruning point]
  ...
  /* u32 spill/fill */
  10: *(u32 *)(r10 - 8) = r6
  11: r8 = *(u32 *)(r10 - 8)
  12: r0 = 0
  13: if r8 == 3 goto pc+1
  14: r0 = 1
  15: exit

The verifier first walks the path with R6=3. Given the support for <8B
spill/fills, at instruction 13, it knows the condition is true and skips
instruction 14. At that point, the backtracking logic kicks in but stops
at the fill instruction since it only propagates the precision bit for
8B spill/fill. When the verifier then walks the path with R6=2, it will
consider it safe at instruction 8 because R6 is not marked as needing
precision. Instruction 14 is thus never walked and is then incorrectly
removed as 'dead code'.

It's also possible to lead the verifier to accept e.g. an out-of-bound
memory access instead of causing an incorrect dead code elimination.

This regression was found via Cilium's bpf-next CI where it was causing
a conntrack map update to be silently skipped because the code had been
removed by the verifier.

This commit fixes it by enabling support for <8B spill/fills in the
bactracking logic. In case of a <8B spill/fill, the full 8B stack slot
will be marked as needing precision. Then, in __mark_chain_precision,
any tracked register spilled in a marked slot will itself be marked as
needing precision, regardless of the spill size. This logic makes two
assumptions: (1) only 8B-aligned spill/fill are tracked and (2) spilled
registers are only tracked if the spill and fill sizes are equal. Commit
ef979017b8 ("bpf: selftest: Add verifier tests for <8-byte scalar
spill and refill") covers the first assumption and the next commit in
this patchset covers the second.

Fixes: 354e8f1970 ("bpf: Support <8-byte scalar spill and refill")
Signed-off-by: Paul Chaignon <paul@isovalent.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2021-12-10 09:13:19 -08:00
Alexey Gladkov 59ec71575a ucounts: Fix rlimit max values check
The semantics of the rlimit max values differs from ucounts itself. When
creating a new userns, we store the current rlimit of the process in
ucount_max. Thus, the value of the limit in the parent userns is saved
in the created one.

The problem is that now we are taking the maximum value for counter from
the same userns. So for init_user_ns it will always be RLIM_INFINITY.

To fix the problem we need to check the counter value with the max value
stored in userns.

Reproducer:

su - test -c "ulimit -u 3; sleep 5 & sleep 6 & unshare -U --map-root-user sh -c 'sleep 7 & sleep 8 & date; wait'"

Before:

[1] 175
[2] 176
Fri Nov 26 13:48:20 UTC 2021
[1]-  Done                    sleep 5
[2]+  Done                    sleep 6

After:

[1] 167
[2] 168
sh: fork: retry: Resource temporarily unavailable
sh: fork: retry: Resource temporarily unavailable
sh: fork: retry: Resource temporarily unavailable
sh: fork: retry: Resource temporarily unavailable
sh: fork: retry: Resource temporarily unavailable
sh: fork: retry: Resource temporarily unavailable
sh: fork: retry: Resource temporarily unavailable
sh: fork: Interrupted system call
[1]-  Done                    sleep 5
[2]+  Done                    sleep 6

Fixes: c54b245d01 ("Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace")
Reported-by: Gleb Fotengauer-Malinovskiy <glebfm@altlinux.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Alexey Gladkov <legion@kernel.org>
Link: https://lkml.kernel.org/r/024ec805f6e16896f0b23e094773790d171d2c1c.1638218242.git.legion@kernel.org
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2021-12-09 15:37:18 -06:00
Linus Torvalds ded746bfc9 Networking fixes for 5.16-rc5, including fixes from bpf, can and netfilter.
Current release - regressions:
 
  - bpf, sockmap: re-evaluate proto ops when psock is removed from sockmap
 
 Current release - new code bugs:
 
  - bpf: fix bpf_check_mod_kfunc_call for built-in modules
 
  - ice: fixes for TC classifier offloads
 
  - vrf: don't run conntrack on vrf with !dflt qdisc
 
 Previous releases - regressions:
 
  - bpf: fix the off-by-two error in range markings
 
  - seg6: fix the iif in the IPv6 socket control block
 
  - devlink: fix netns refcount leak in devlink_nl_cmd_reload()
 
  - dsa: mv88e6xxx: fix "don't use PHY_DETECT on internal PHY's"
 
  - dsa: mv88e6xxx: allow use of PHYs on CPU and DSA ports
 
 Previous releases - always broken:
 
  - ethtool: do not perform operations on net devices being unregistered
 
  - udp: use datalen to cap max gso segments
 
  - ice: fix races in stats collection
 
  - fec: only clear interrupt of handling queue in fec_enet_rx_queue()
 
  - m_can: pci: fix incorrect reference clock rate
 
  - m_can: disable and ignore ELO interrupt
 
  - mvpp2: fix XDP rx queues registering
 
 Misc:
 
  - treewide: add missing includes masked by cgroup -> bpf.h dependency
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmGyN1AACgkQMUZtbf5S
 IrtgMA/8D0qk3c75ts0hCzGXwdNdEBs+e7u1bJVPqdyU8x/ZLAp2c0EKB/7IWuxA
 CtsnbanPcmibqvQJDI1hZEBdafi43BmF5VuFSIxYC4EM/1vLoRprurXlIwL2YWki
 aWi//tyOIGBl6/ClzJ9Vm51HTJQwDmdv8GRnKAbsC1eOTM3pmmcg+6TLbDhycFEQ
 F9kkDCvyB9kWIH645QyJRH+Y5qQOvneCyQCPkkyjTgEADzV5i7YgtRol6J3QIbw3
 umPHSckCBTjMacYcCLsbhQaF2gTMgPV1basNLPMjCquJVrItE0ZaeX3MiD6nBFae
 yY5+Wt5KAZDzjERhneX8AINHoRPA/tNIahC1+ytTmsTA8Hj230FHE5hH1ajWiJ9+
 GSTBCBqjtZXce3r2Efxfzy0Kb9JwL3vDi7LS2eKQLv0zBLfYp2ry9Sp9qe4NhPkb
 OYrxws9kl5GOPvrFB5BWI9XBINciC9yC3PjIsz1noi0vD8/Hi9dPwXeAYh36fXU3
 rwRg9uAt6tvFCpwbuQ9T2rsMST0miur2cDYd8qkJtuJ7zFvc+suMXwBZyI29nF2D
 uyymIC2XStHJfAjUkFsGVUSXF5FhML9OQsqmisdQ8KdH26jMnDeMjIWJM7UWK+zY
 E/fqWT8UyS3mXWqaggid4ZbotipCwA0gxiDHuqqUGTM+dbKrzmk=
 =F6rS
 -----END PGP SIGNATURE-----

Merge tag 'net-5.16-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Including fixes from bpf, can and netfilter.

  Current release - regressions:

   - bpf, sockmap: re-evaluate proto ops when psock is removed from
     sockmap

  Current release - new code bugs:

   - bpf: fix bpf_check_mod_kfunc_call for built-in modules

   - ice: fixes for TC classifier offloads

   - vrf: don't run conntrack on vrf with !dflt qdisc

  Previous releases - regressions:

   - bpf: fix the off-by-two error in range markings

   - seg6: fix the iif in the IPv6 socket control block

   - devlink: fix netns refcount leak in devlink_nl_cmd_reload()

   - dsa: mv88e6xxx: fix "don't use PHY_DETECT on internal PHY's"

   - dsa: mv88e6xxx: allow use of PHYs on CPU and DSA ports

  Previous releases - always broken:

   - ethtool: do not perform operations on net devices being
     unregistered

   - udp: use datalen to cap max gso segments

   - ice: fix races in stats collection

   - fec: only clear interrupt of handling queue in fec_enet_rx_queue()

   - m_can: pci: fix incorrect reference clock rate

   - m_can: disable and ignore ELO interrupt

   - mvpp2: fix XDP rx queues registering

  Misc:

   - treewide: add missing includes masked by cgroup -> bpf.h
     dependency"

* tag 'net-5.16-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (82 commits)
  net: dsa: mv88e6xxx: allow use of PHYs on CPU and DSA ports
  net: wwan: iosm: fixes unable to send AT command during mbim tx
  net: wwan: iosm: fixes net interface nonfunctional after fw flash
  net: wwan: iosm: fixes unnecessary doorbell send
  net: dsa: felix: Fix memory leak in felix_setup_mmio_filtering
  MAINTAINERS: s390/net: remove myself as maintainer
  net/sched: fq_pie: prevent dismantle issue
  net: mana: Fix memory leak in mana_hwc_create_wq
  seg6: fix the iif in the IPv6 socket control block
  nfp: Fix memory leak in nfp_cpp_area_cache_add()
  nfc: fix potential NULL pointer deref in nfc_genl_dump_ses_done
  nfc: fix segfault in nfc_genl_dump_devices_done
  udp: using datalen to cap max gso segments
  net: dsa: mv88e6xxx: error handling for serdes_power functions
  can: kvaser_usb: get CAN clock frequency from device
  can: kvaser_pciefd: kvaser_pciefd_rx_error_frame(): increase correct stats->{rx,tx}_errors counter
  net: mvpp2: fix XDP rx queues registering
  vmxnet3: fix minimum vectors alloc issue
  net, neigh: clear whole pneigh_entry at alloc time
  net: dsa: mv88e6xxx: fix "don't use PHY_DETECT on internal PHY's"
  ...
2021-12-09 11:26:44 -08:00
Eric Biggers 42288cb44c wait: add wake_up_pollfree()
Several ->poll() implementations are special in that they use a
waitqueue whose lifetime is the current task, rather than the struct
file as is normally the case.  This is okay for blocking polls, since a
blocking poll occurs within one task; however, non-blocking polls
require another solution.  This solution is for the queue to be cleared
before it is freed, using 'wake_up_poll(wq, EPOLLHUP | POLLFREE);'.

However, that has a bug: wake_up_poll() calls __wake_up() with
nr_exclusive=1.  Therefore, if there are multiple "exclusive" waiters,
and the wakeup function for the first one returns a positive value, only
that one will be called.  That's *not* what's needed for POLLFREE;
POLLFREE is special in that it really needs to wake up everyone.

Considering the three non-blocking poll systems:

- io_uring poll doesn't handle POLLFREE at all, so it is broken anyway.

- aio poll is unaffected, since it doesn't support exclusive waits.
  However, that's fragile, as someone could add this feature later.

- epoll doesn't appear to be broken by this, since its wakeup function
  returns 0 when it sees POLLFREE.  But this is fragile.

Although there is a workaround (see epoll), it's better to define a
function which always sends POLLFREE to all waiters.  Add such a
function.  Also make it verify that the queue really becomes empty after
all waiters have been woken up.

Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20211209010455.42744-2-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
2021-12-09 10:49:56 -08:00
Miaoqian Lin c24be24aed tracing: Fix possible memory leak in __create_synth_event() error path
There's error paths in __create_synth_event() after the argv is allocated
that fail to free it. Add a jump to free it when necessary.

Link: https://lkml.kernel.org/r/20211209024317.11783-1-linmq006@gmail.com

Suggested-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Miaoqian Lin <linmq006@gmail.com>
[ Fixed up the patch and change log ]
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-12-09 13:03:05 -05:00
Jakub Kicinski 6efcdadc15 Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
bpf 2021-12-08

We've added 12 non-merge commits during the last 22 day(s) which contain
a total of 29 files changed, 659 insertions(+), 80 deletions(-).

The main changes are:

1) Fix an off-by-two error in packet range markings and also add a batch of
   new tests for coverage of these corner cases, from Maxim Mikityanskiy.

2) Fix a compilation issue on MIPS JIT for R10000 CPUs, from Johan Almbladh.

3) Fix two functional regressions and a build warning related to BTF kfunc
   for modules, from Kumar Kartikeya Dwivedi.

4) Fix outdated code and docs regarding BPF's migrate_disable() use on non-
   PREEMPT_RT kernels, from Sebastian Andrzej Siewior.

5) Add missing includes in order to be able to detangle cgroup vs bpf header
   dependencies, from Jakub Kicinski.

6) Fix regression in BPF sockmap tests caused by missing detachment of progs
   from sockets when they are removed from the map, from John Fastabend.

7) Fix a missing "no previous prototype" warning in x86 JIT caused by BPF
   dispatcher, from Björn Töpel.

* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  bpf: Add selftests to cover packet access corner cases
  bpf: Fix the off-by-two error in range markings
  treewide: Add missing includes masked by cgroup -> bpf dependency
  tools/resolve_btfids: Skip unresolved symbol warning for empty BTF sets
  bpf: Fix bpf_check_mod_kfunc_call for built-in modules
  bpf: Make CONFIG_DEBUG_INFO_BTF depend upon CONFIG_BPF_SYSCALL
  mips, bpf: Fix reference to non-existing Kconfig symbol
  bpf: Make sure bpf_disable_instrumentation() is safe vs preemption.
  Documentation/locking/locktypes: Update migrate_disable() bits.
  bpf, sockmap: Re-evaluate proto ops when psock is removed from sockmap
  bpf, sockmap: Attach map progs to psock early for feature probes
  bpf, x86: Fix "no previous prototype" warning
====================

Link: https://lore.kernel.org/r/20211208155125.11826-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-08 16:06:44 -08:00
Jiri Olsa fea3ffa48c ftrace: Add cleanup to unregister_ftrace_direct_multi
Adding ops cleanup to unregister_ftrace_direct_multi,
so it can be reused in another register call.

Link: https://lkml.kernel.org/r/20211206182032.87248-3-jolsa@kernel.org

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Fixes: f64dd4627e ("ftrace: Add multi direct register/unregister interface")
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-12-08 12:12:02 -05:00
Jiri Olsa 7d5b7cad79 ftrace: Use direct_ops hash in unregister_ftrace_direct
Now when we have *direct_multi interface the direct_functions
hash is no longer owned just by direct_ops. It's also used by
any other ftrace_ops passed to *direct_multi interface.

Thus to find out that we are unregistering the last function
from direct_ops, we need to check directly direct_ops's hash.

Link: https://lkml.kernel.org/r/20211206182032.87248-2-jolsa@kernel.org

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Fixes: f64dd4627e ("ftrace: Add multi direct register/unregister interface")
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-12-08 12:12:02 -05:00
Linus Torvalds 7587a4a5a4 - Prevent a tick storm when a dedicated timekeeper CPU in nohz_full
mode runs for prolonged periods with interrupts disabled and ends up
 programming the next tick in the past, leading to that storm
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmGsp+EACgkQEsHwGGHe
 VUpmAA/6A8W0Nb6Doc8B3emuy9qv3NeqLGWqSIKcJnOz0GYhlWuFKGmH6zWQ/ZKZ
 ihjw5fP7aOEytLhLagnn1k2weRZrgBavHaxQskuL3HBFD0mT6Gz1TfJC9JlE5s2Q
 KxaDjRLx5RGJb/KHZDiZv6Kz61Ouh14KfHHymVhZndcPNZ7UjsCgacyUkctGKcoc
 DtNW0Z6tjUGbp1MXyGcOiTiM7hUS8SWsdJbMfn0Eu+/NKvnkua7vwTgEMTwYwrK0
 88sLYyVygL+NHjE9LpSGrRj1HjEV4dSMC3r18UYuWQYkzBvA+/SQbIKD5QoeFmZU
 st5dMBD8Q3KvAWQ8mXE5ymaYaIZxv21PaL1J7lZ3J3osMASH0LkMWXLYoMVtO5rq
 OIpZlODSGLiamGcC5uieoBR/f4Zzn+sEZZ6TyoXWOBv4Cap2XnlJP5WjJ4ARJvzT
 MLX2u8MPPMTL7vtd2Xb4kPZcWH5irrCENXlbz0UG08ZHj4CvBFb+a87f+E4aNUs4
 uBsTf/kS5SihE1ripSCJEnFsc/QgVPr/9jBXQehRcuI4NgT4pUg85LWDj3gSIcH8
 wMRbiX2ND0ZWk89RYaoiDQ6JPGrsnwKvGLRk9ZhFNtUfpycv5JWKwepVbmAKfos+
 JtmG/6kcFQKBofR7EA4Xuh7DHv7LKCRf3MMlAR6Gzx/3K2kyIoQ=
 =Ft9k
 -----END PGP SIGNATURE-----

Merge tag 'timers_urgent_for_v5.16_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer fix from Borislav Petkov:

 - Prevent a tick storm when a dedicated timekeeper CPU in nohz_full
   mode runs for prolonged periods with interrupts disabled and ends up
   programming the next tick in the past, leading to that storm

* tag 'timers_urgent_for_v5.16_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  timers/nohz: Last resort update jiffies on nohz_full IRQ entry
2021-12-05 08:58:52 -08:00
Linus Torvalds 1d213767dc - Properly init uclamp_flags of a runqueue, on first enqueuing
- Fix preempt= callback return values
 
 - Correct utime/stime resource usage reporting on nohz_full to return
 the proper times instead of shorter ones
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmGspTcACgkQEsHwGGHe
 VUqHhBAAhEd9DMoJwREKDCDMqc3pttNYpTpSVo1K6oBTsOh7mwEilPdlmsTl239V
 jRocVJST+/JmJ424j7t0Sp42tREMKNlbyf+ddvr0oUwi0mLUnN6J83NU4WK4Jisf
 gyXFIkeMR+/W6/LO7gDdq/+rlRDtJcllwHoOm1yyiy5Zc0qDrcy6CjgP5/9hEsh6
 xvRvPOXbeZZVA+a+n+G9xGN836aBe1VptoABbdAlOSTiOvAVkS95UCb9rfPTvMtq
 /71jjZMmhTxGUhg5oLpgvfRRZE608X6b2RCTcAPKa5mfMpN5YMQLcD9G0f8XZjkq
 iOO/+arE6XQJlTzhAEsGxkSXaVweYxRHHP1yAlWYlWV/xGhoaAyq/tXE1KusAnng
 16/eTbrPb1eawpI6p1AAScCQuF/TlYZCMqjbFVhViXM5Rkd6jrii9vz/JnkdokGR
 3TH0n4WAJkdZeg18WS3B0eIt6zDTvxbR9g5ap2/10xYnYHMNdHXGH8A+5Grw9/Ln
 Qsv0V43OjdUK2tVuIHYblx1X9dOlLdpTEg9FCfjiZTQVor1pTwcbG62qNMozanlf
 lQqI6f63E0jugHqhrqrfBvl4lUuoajN5SvXfBNFDIzxwWBGSdr+hJQXstUatfSZZ
 MdmJX+Dk5cAk4CpQQ1ofPvYkS3Ade0vxaL4H++KHYtRvpPvxCXA=
 =XQFF
 -----END PGP SIGNATURE-----

Merge tag 'sched_urgent_for_v5.16_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler fixes from Borislav Petkov:

 - Properly init uclamp_flags of a runqueue, on first enqueuing

 - Fix preempt= callback return values

 - Correct utime/stime resource usage reporting on nohz_full to return
   the proper times instead of shorter ones

* tag 'sched_urgent_for_v5.16_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/uclamp: Fix rq->uclamp_max not set on first enqueue
  preempt/dynamic: Fix setup_preempt_mode() return value
  sched/cputime: Fix getrusage(RUSAGE_THREAD) with nohz_full
2021-12-05 08:53:31 -08:00
Qais Yousef 315c4f8848 sched/uclamp: Fix rq->uclamp_max not set on first enqueue
Commit d81ae8aac8 ("sched/uclamp: Fix initialization of struct
uclamp_rq") introduced a bug where uclamp_max of the rq is not reset to
match the woken up task's uclamp_max when the rq is idle.

The code was relying on rq->uclamp_max initialized to zero, so on first
enqueue

	static inline void uclamp_rq_inc_id(struct rq *rq, struct task_struct *p,
					    enum uclamp_id clamp_id)
	{
		...

		if (uc_se->value > READ_ONCE(uc_rq->value))
			WRITE_ONCE(uc_rq->value, uc_se->value);
	}

was actually resetting it. But since commit d81ae8aac8 changed the
default to 1024, this no longer works. And since rq->uclamp_flags is
also initialized to 0, neither above code path nor uclamp_idle_reset()
update the rq->uclamp_max on first wake up from idle.

This is only visible from first wake up(s) until the first dequeue to
idle after enabling the static key. And it only matters if the
uclamp_max of this task is < 1024 since only then its uclamp_max will be
effectively ignored.

Fix it by properly initializing rq->uclamp_flags = UCLAMP_FLAG_IDLE to
ensure uclamp_idle_reset() is called which then will update the rq
uclamp_max value as expected.

Fixes: d81ae8aac8 ("sched/uclamp: Fix initialization of struct uclamp_rq")
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <Valentin.Schneider@arm.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lkml.kernel.org/r/20211202112033.1705279-1-qais.yousef@arm.com
2021-12-04 10:56:18 +01:00
Andrew Halaney 9ed20bafc8 preempt/dynamic: Fix setup_preempt_mode() return value
__setup() callbacks expect 1 for success and 0 for failure. Correct the
usage here to reflect that.

Fixes: 826bfeb37b ("preempt/dynamic: Support dynamic preempt with preempt= boot option")
Reported-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Andrew Halaney <ahalaney@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20211203233203.133581-1-ahalaney@redhat.com
2021-12-04 10:56:18 +01:00
Maxim Mikityanskiy 2fa7d94afc bpf: Fix the off-by-two error in range markings
The first commit cited below attempts to fix the off-by-one error that
appeared in some comparisons with an open range. Due to this error,
arithmetically equivalent pieces of code could get different verdicts
from the verifier, for example (pseudocode):

  // 1. Passes the verifier:
  if (data + 8 > data_end)
      return early
  read *(u64 *)data, i.e. [data; data+7]

  // 2. Rejected by the verifier (should still pass):
  if (data + 7 >= data_end)
      return early
  read *(u64 *)data, i.e. [data; data+7]

The attempted fix, however, shifts the range by one in a wrong
direction, so the bug not only remains, but also such piece of code
starts failing in the verifier:

  // 3. Rejected by the verifier, but the check is stricter than in #1.
  if (data + 8 >= data_end)
      return early
  read *(u64 *)data, i.e. [data; data+7]

The change performed by that fix converted an off-by-one bug into
off-by-two. The second commit cited below added the BPF selftests
written to ensure than code chunks like #3 are rejected, however,
they should be accepted.

This commit fixes the off-by-two error by adjusting new_range in the
right direction and fixes the tests by changing the range into the
one that should actually fail.

Fixes: fb2a311a31 ("bpf: fix off by one for range markings with L{T, E} patterns")
Fixes: b37242c773 ("bpf: add test cases to bpf selftests to cover all access tests")
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20211130181607.593149-1-maximmi@nvidia.com
2021-12-03 21:44:42 +01:00
Kumar Kartikeya Dwivedi b12f031043 bpf: Fix bpf_check_mod_kfunc_call for built-in modules
When module registering its set is built-in, THIS_MODULE will be NULL,
hence we cannot return early in case owner is NULL.

Fixes: 14f267d95f ("bpf: btf: Introduce helpers for dynamic BTF set registration")
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20211122144742.477787-3-memxor@gmail.com
2021-12-02 13:39:46 -08:00
Kumar Kartikeya Dwivedi d9847eb8be bpf: Make CONFIG_DEBUG_INFO_BTF depend upon CONFIG_BPF_SYSCALL
Vinicius Costa Gomes reported [0] that build fails when
CONFIG_DEBUG_INFO_BTF is enabled and CONFIG_BPF_SYSCALL is disabled.
This leads to btf.c not being compiled, and then no symbol being present
in vmlinux for the declarations in btf.h. Since BTF is not useful
without enabling BPF subsystem, disallow this combination.

However, theoretically disabling both now could still fail, as the
symbol for kfunc_btf_id_list variables is not available. This isn't a
problem as the compiler usually optimizes the whole register/unregister
call, but at lower optimization levels it can fail the build in linking
stage.

Fix that by adding dummy variables so that modules taking address of
them still work, but the whole thing is a noop.

  [0]: https://lore.kernel.org/bpf/20211110205418.332403-1-vinicius.gomes@intel.com

Fixes: 14f267d95f ("bpf: btf: Introduce helpers for dynamic BTF set registration")
Reported-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20211122144742.477787-2-memxor@gmail.com
2021-12-02 13:39:46 -08:00
Frederic Weisbecker e7f2be115f sched/cputime: Fix getrusage(RUSAGE_THREAD) with nohz_full
getrusage(RUSAGE_THREAD) with nohz_full may return shorter utime/stime
than the actual time.

task_cputime_adjusted() snapshots utime and stime and then adjust their
sum to match the scheduler maintained cputime.sum_exec_runtime.
Unfortunately in nohz_full, sum_exec_runtime is only updated once per
second in the worst case, causing a discrepancy against utime and stime
that can be updated anytime by the reader using vtime.

To fix this situation, perform an update of cputime.sum_exec_runtime
when the cputime snapshot reports the task as actually running while
the tick is disabled. The related overhead is then contained within the
relevant situations.

Reported-by: Hasegawa Hitomi <hasegawa-hitomi@fujitsu.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Hasegawa Hitomi <hasegawa-hitomi@fujitsu.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
Acked-by: Phil Auld <pauld@redhat.com>
Link: https://lore.kernel.org/r/20211026141055.57358-3-frederic@kernel.org
2021-12-02 15:08:22 +01:00
Frederic Weisbecker 53e87e3cdc timers/nohz: Last resort update jiffies on nohz_full IRQ entry
When at least one CPU runs in nohz_full mode, a dedicated timekeeper CPU
is guaranteed to stay online and to never stop its tick.

Meanwhile on some rare case, the dedicated timekeeper may be running
with interrupts disabled for a while, such as in stop_machine.

If jiffies stop being updated, a nohz_full CPU may end up endlessly
programming the next tick in the past, taking the last jiffies update
monotonic timestamp as a stale base, resulting in an tick storm.

Here is a scenario where it matters:

0) CPU 0 is the timekeeper and CPU 1 a nohz_full CPU.

1) A stop machine callback is queued to execute somewhere.

2) CPU 0 reaches MULTI_STOP_DISABLE_IRQ while CPU 1 is still in
   MULTI_STOP_PREPARE. Hence CPU 0 can't do its timekeeping duty. CPU 1
   can still take IRQs.

3) CPU 1 receives an IRQ which queues a timer callback one jiffy forward.

4) On IRQ exit, CPU 1 schedules the tick one jiffy forward, taking
   last_jiffies_update as a base. But last_jiffies_update hasn't been
   updated for 2 jiffies since the timekeeper has interrupts disabled.

5) clockevents_program_event(), which relies on ktime_get(), observes
   that the expiration is in the past and therefore programs the min
   delta event on the clock.

6) The tick fires immediately, goto 3)

7) Tick storm, the nohz_full CPU is drown and takes ages to reach
   MULTI_STOP_DISABLE_IRQ, which is the only way out of this situation.

Solve this with unconditionally updating jiffies if the value is stale
on nohz_full IRQ entry. IRQs and other disturbances are expected to be
rare enough on nohz_full for the unconditional call to ktime_get() to
actually matter.

Reported-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lore.kernel.org/r/20211026141055.57358-2-frederic@kernel.org
2021-12-02 15:07:22 +01:00
Masami Hiramatsu 6bbfa44116 kprobes: Limit max data_size of the kretprobe instances
The 'kprobe::data_size' is unsigned, thus it can not be negative.  But if
user sets it enough big number (e.g. (size_t)-8), the result of 'data_size
+ sizeof(struct kretprobe_instance)' becomes smaller than sizeof(struct
kretprobe_instance) or zero. In result, the kretprobe_instance are
allocated without enough memory, and kretprobe accesses outside of
allocated memory.

To avoid this issue, introduce a max limitation of the
kretprobe::data_size. 4KB per instance should be OK.

Link: https://lkml.kernel.org/r/163836995040.432120.10322772773821182925.stgit@devnote2

Cc: stable@vger.kernel.org
Fixes: f47cd9b553 ("kprobes: kretprobe user entry-handler")
Reported-by: zhangyue <zhangyue1@kylinos.cn>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-12-01 21:04:34 -05:00
Chen Jun f25667e598 tracing: Fix a kmemleak false positive in tracing_map
Doing the command:
  echo 'hist:key=common_pid.execname,common_timestamp' > /sys/kernel/debug/tracing/events/xxx/trigger

Triggers many kmemleak reports:

unreferenced object 0xffff0000c7ea4980 (size 128):
  comm "bash", pid 338, jiffies 4294912626 (age 9339.324s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<00000000f3469921>] kmem_cache_alloc_trace+0x4c0/0x6f0
    [<0000000054ca40c3>] hist_trigger_elt_data_alloc+0x140/0x178
    [<00000000633bd154>] tracing_map_init+0x1f8/0x268
    [<000000007e814ab9>] event_hist_trigger_func+0xca0/0x1ad0
    [<00000000bf8520ed>] trigger_process_regex+0xd4/0x128
    [<00000000f549355a>] event_trigger_write+0x7c/0x120
    [<00000000b80f898d>] vfs_write+0xc4/0x380
    [<00000000823e1055>] ksys_write+0x74/0xf8
    [<000000008a9374aa>] __arm64_sys_write+0x24/0x30
    [<0000000087124017>] do_el0_svc+0x88/0x1c0
    [<00000000efd0dcd1>] el0_svc+0x1c/0x28
    [<00000000dbfba9b3>] el0_sync_handler+0x88/0xc0
    [<00000000e7399680>] el0_sync+0x148/0x180
unreferenced object 0xffff0000c7ea4980 (size 128):
  comm "bash", pid 338, jiffies 4294912626 (age 9339.324s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<00000000f3469921>] kmem_cache_alloc_trace+0x4c0/0x6f0
    [<0000000054ca40c3>] hist_trigger_elt_data_alloc+0x140/0x178
    [<00000000633bd154>] tracing_map_init+0x1f8/0x268
    [<000000007e814ab9>] event_hist_trigger_func+0xca0/0x1ad0
    [<00000000bf8520ed>] trigger_process_regex+0xd4/0x128
    [<00000000f549355a>] event_trigger_write+0x7c/0x120
    [<00000000b80f898d>] vfs_write+0xc4/0x380
    [<00000000823e1055>] ksys_write+0x74/0xf8
    [<000000008a9374aa>] __arm64_sys_write+0x24/0x30
    [<0000000087124017>] do_el0_svc+0x88/0x1c0
    [<00000000efd0dcd1>] el0_svc+0x1c/0x28
    [<00000000dbfba9b3>] el0_sync_handler+0x88/0xc0
    [<00000000e7399680>] el0_sync+0x148/0x180

The reason is elts->pages[i] is alloced by get_zeroed_page.
and kmemleak will not scan the area alloced by get_zeroed_page.
The address stored in elts->pages will be regarded as leaked.

That is, the elts->pages[i] will have pointers loaded onto it as well, and
without telling kmemleak about it, those pointers will look like memory
without a reference.

To fix this, call kmemleak_alloc to tell kmemleak to scan elts->pages[i]

Link: https://lkml.kernel.org/r/20211124140801.87121-1-chenjun102@huawei.com

Signed-off-by: Chen Jun <chenjun102@huawei.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-12-01 21:04:34 -05:00
Steven Rostedt (VMware) 450fec13d9 tracing/histograms: String compares should not care about signed values
When comparing two strings for the "onmatch" histogram trigger, fields
that are strings use string comparisons, which do not care about being
signed or not.

Do not fail to match two string fields if one is unsigned char array and
the other is a signed char array.

Link: https://lore.kernel.org/all/20211129123043.5cfd687a@gandalf.local.home/

Cc: stable@vgerk.kernel.org
Cc: Tom Zanussi <zanussi@kernel.org>
Cc: Yafang Shao <laoar.shao@gmail.com>
Fixes: b05e89ae7c ("tracing: Accept different type for synthetic event fields")
Reviewed-by: Masami Hiramatsu <mhiramatsu@kernel.org>
Reported-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-12-01 21:04:22 -05:00
Linus Torvalds 97891bbf38 A single scheduler fix to ensure that there is no stale KASAN shadow state
left on the idle task's stack when a CPU is brought up after it was brought
 down before.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmGjr0UTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoadaD/0Q3hMjI+N3AigZiBToGccafOfsmiMH
 fJ6fUM7gh4pTrGuoDQSGt02zYNYx9Zx7X8PpiuWAAIKbppiKmvniCgPMgMGARUBn
 UQ/W2XWUiu/wtleRf4JtE6VwHciNVgLdnWIazRWsjDryUXVcJwhn8J1o5K6LnwjD
 Rof/aYuVR47DprYG03OI0FD1GwlSPWMbAgB6OlJS6ZRvpq+7ergVKA0PQAY7ZZko
 vBlDU7Sq4dJ2CE4aiRGLyLNhZfrubmfeMP2UVmVSpMBta7zs+YmaYjZvKfgO3KZT
 OVbyFfDbL8FJgUmTSI1WBKq+W44o1D1e8VrKiCFj+y5w9diHW9OQEg2wqQdsMB6a
 QgNgDZjg8UHancF5O2kNYjnUVGgxUww7PftWbxkg4VAUmlCzhbZAAegspZHow0mU
 zcqDaMTky0FbcbB/Ukik/HG6J3KrR34GYjui3fe0wZHZlDim6azZucRTd+x9jRsB
 jPUlE3DW0JfNFKcMnlLLNvS8h3j7iCbb3XDv1y4BW0+EB76IsCThjqFO0dIPpiju
 T9ituTr6p4+B4U37Cz5qOMgUSha+f9/6blYG8NgCeHyD2l5HDnavO9lGhoP3jsZJ
 LJRa8mWd+oZbZlpBtTkaSOA55cTxonsIuCseTdXlfsVtzuJBmLKwdRPuDSRCEo0G
 xH1vNNUba86+6A==
 =ne0K
 -----END PGP SIGNATURE-----

Merge tag 'sched-urgent-2021-11-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler fix from Thomas Gleixner:
 "A single scheduler fix to ensure that there is no stale KASAN shadow
  state left on the idle task's stack when a CPU is brought up after it
  was brought down before"

* tag 'sched-urgent-2021-11-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/scs: Reset task stack state in bringup_cpu()
2021-11-28 09:15:34 -08:00