Commit Graph

677 Commits

Author SHA1 Message Date
Cyril Hrubis 0766e7317a sched/rt: Disallow writing invalid values to sched_rt_period_us
commit 079be8fc63 upstream.

The validation of the value written to sched_rt_period_us was broken
because:

  - the sysclt_sched_rt_period is declared as unsigned int
  - parsed by proc_do_intvec()
  - the range is asserted after the value parsed by proc_do_intvec()

Because of this negative values written to the file were written into a
unsigned integer that were later on interpreted as large positive
integers which did passed the check:

  if (sysclt_sched_rt_period <= 0)
	return EINVAL;

This commit fixes the parsing by setting explicit range for both
perid_us and runtime_us into the sched_rt_sysctls table and processes
the values with proc_dointvec_minmax() instead.

Alternatively if we wanted to use full range of unsigned int for the
period value we would have to split the proc_handler and use
proc_douintvec() for it however even the
Documentation/scheduller/sched-rt-group.rst describes the range as 1 to
INT_MAX.

As far as I can tell the only problem this causes is that the sysctl
file allows writing negative values which when read back may confuse
userspace.

There is also a LTP test being submitted for these sysctl files at:

  http://patchwork.ozlabs.org/project/ltp/patch/20230901144433.2526-1-chrubis@suse.cz/

Signed-off-by: Cyril Hrubis <chrubis@suse.cz>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20231002115553.3007-2-chrubis@suse.cz
[ pvorel: rebased for 5.15, 5.10 ]
Reviewed-by: Petr Vorel <pvorel@suse.cz>
Signed-off-by: Petr Vorel <pvorel@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-03-01 13:21:43 +01:00
tangmeng 191f1f1f6a kernel/panic: move panic sysctls to its own file
commit 9df9186984 upstream.

kernel/sysctl.c is a kitchen sink where everyone leaves their dirty
dishes, this makes it very difficult to maintain.

To help with this maintenance let's start by moving sysctls to places
where they actually belong.  The proc sysctl maintainers do not want to
know what sysctl knobs you wish to add for your own piece of code, we
just care about the core logic.

All filesystem syctls now get reviewed by fs folks. This commit
follows the commit of fs, move the oops_all_cpu_backtrace sysctl to
its own file, kernel/panic.c.

Signed-off-by: tangmeng <tangmeng@uniontech.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-02-01 08:27:20 +01:00
Linus Torvalds 48642f9431 proc: proc_skip_spaces() shouldn't think it is working on C strings
commit bce9332220 upstream.

proc_skip_spaces() seems to think it is working on C strings, and ends
up being just a wrapper around skip_spaces() with a really odd calling
convention.

Instead of basing it on skip_spaces(), it should have looked more like
proc_skip_char(), which really is the exact same function (except it
skips a particular character, rather than whitespace).  So use that as
inspiration, odd coding and all.

Now the calling convention actually makes sense and works for the
intended purpose.

Reported-and-tested-by: Kyle Zeng <zengyhkyle@gmail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-12-08 11:28:45 +01:00
Linus Torvalds 3eb9213f66 proc: avoid integer type confusion in get_proc_long
commit e6cfaf34be upstream.

proc_get_long() is passed a size_t, but then assigns it to an 'int'
variable for the length.  Let's not do that, even if our IO paths are
limited to MAX_RW_COUNT (exactly because of these kinds of type errors).

So do the proper test in the rigth type.

Reported-by: Kyle Zeng <zengyhkyle@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-12-08 11:28:45 +01:00
Xiaoming Ni 9cb4959493 sysctl: move some boundary constants from sysctl.c to sysctl_vals
[ Upstream commit 78e36f3b0d ]

sysctl has helpers which let us specify boundary values for a min or max
int value.  Since these are used for a boundary check only they don't
change, so move these variables to sysctl_vals to avoid adding duplicate
variables.  This will help with our cleanup of kernel/sysctl.c.

[akpm@linux-foundation.org: update it for "mm/pagealloc: sysctl: change watermark_scale_factor max limit to 30%"]
[mcgrof@kernel.org: major rebase]

Link: https://lkml.kernel.org/r/20211123202347.818157-3-mcgrof@kernel.org
Signed-off-by: Xiaoming Ni <nixiaoming@huawei.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Qing Wang <wangqing@vivo.com>
Cc: Sebastian Reichel <sre@kernel.org>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Stephen Kitt <steve@sk2.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Antti Palosaari <crope@iki.fi>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Clemens Ladisch <clemens@ladisch.de>
Cc: David Airlie <airlied@linux.ie>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Julia Lawall <julia.lawall@inria.fr>
Cc: Lukas Middendorf <kernel@tuxforce.de>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Phillip Potter <phil@philpotter.co.uk>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Douglas Gilbert <dgilbert@interlog.com>
Cc: James E.J. Bottomley <jejb@linux.ibm.com>
Cc: Jani Nikula <jani.nikula@intel.com>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-07-29 17:25:11 +02:00
Suren Baghdasaryan e2b6c5f7ee mm/pagealloc: sysctl: change watermark_scale_factor max limit to 30%
[ Upstream commit 39c65a94cd ]

For embedded systems with low total memory, having to run applications
with relatively large memory requirements, 10% max limitation for
watermark_scale_factor poses an issue of triggering direct reclaim every
time such application is started.  This results in slow application
startup times and bad end-user experience.

By increasing watermark_scale_factor max limit we allow vendors more
flexibility to choose the right level of kswapd aggressiveness for their
device and workload requirements.

Link: https://lkml.kernel.org/r/20211124193604.2758863-1-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Lukas Middendorf <kernel@tuxforce.de>
Cc: Antti Palosaari <crope@iki.fi>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Zhang Yi <yi.zhang@huawei.com>
Cc: Fengfei Xi <xi.fengfei@h3c.com>
Cc: Mike Rapoport <rppt@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-07-29 17:25:11 +02:00
Muchun Song a06248fa62 mm: sysctl: fix missing numa_stat when !CONFIG_HUGETLB_PAGE
[ Upstream commit 43b5240ca6 ]

"numa_stat" should not be included in the scope of CONFIG_HUGETLB_PAGE, if
CONFIG_HUGETLB_PAGE is not configured even if CONFIG_NUMA is configured,
"numa_stat" is missed form /proc. Move it out of CONFIG_HUGETLB_PAGE to
fix it.

Fixes: 4518085e12 ("mm, sysctl: make NUMA stats configurable")
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Cc: <stable@vger.kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-07-21 21:24:32 +02:00
Kuniyuki Iwashima a716a3846c sysctl: Fix data-races in proc_dointvec_ms_jiffies().
[ Upstream commit 7d1025e559 ]

A sysctl variable is accessed concurrently, and there is always a chance
of data-race.  So, all readers and writers need some basic protection to
avoid load/store-tearing.

This patch changes proc_dointvec_ms_jiffies() to use READ_ONCE() and
WRITE_ONCE() internally to fix data-races on the sysctl side.  For now,
proc_dointvec_ms_jiffies() itself is tolerant to a data-race, but we still
need to add annotations on the other subsystem's side.

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-07-21 21:24:26 +02:00
Kuniyuki Iwashima e58b02e445 sysctl: Fix data-races in proc_dou8vec_minmax().
[ Upstream commit 7dee5d7747 ]

A sysctl variable is accessed concurrently, and there is always a chance
of data-race.  So, all readers and writers need some basic protection to
avoid load/store-tearing.

This patch changes proc_dou8vec_minmax() to use READ_ONCE() and
WRITE_ONCE() internally to fix data-races on the sysctl side.  For now,
proc_dou8vec_minmax() itself is tolerant to a data-race, but we still
need to add annotations on the other subsystem's side.

Fixes: cb94441306 ("sysctl: add proc_dou8vec_minmax()")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-07-21 21:24:26 +02:00
Kuniyuki Iwashima 67623d290d sysctl: Fix data races in proc_dointvec_jiffies().
[ Upstream commit e877820877 ]

A sysctl variable is accessed concurrently, and there is always a chance
of data-race.  So, all readers and writers need some basic protection to
avoid load/store-tearing.

This patch changes proc_dointvec_jiffies() to use READ_ONCE() and
WRITE_ONCE() internally to fix data-races on the sysctl side.  For now,
proc_dointvec_jiffies() itself is tolerant to a data-race, but we still
need to add annotations on the other subsystem's side.

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-07-21 21:24:20 +02:00
Kuniyuki Iwashima dcdf3c3c58 sysctl: Fix data races in proc_doulongvec_minmax().
[ Upstream commit c31bcc8fb8 ]

A sysctl variable is accessed concurrently, and there is always a chance
of data-race.  So, all readers and writers need some basic protection to
avoid load/store-tearing.

This patch changes proc_doulongvec_minmax() to use READ_ONCE() and
WRITE_ONCE() internally to fix data-races on the sysctl side.  For now,
proc_doulongvec_minmax() itself is tolerant to a data-race, but we still
need to add annotations on the other subsystem's side.

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-07-21 21:24:20 +02:00
Kuniyuki Iwashima 40e0477a73 sysctl: Fix data races in proc_douintvec_minmax().
[ Upstream commit 2d3b559df3 ]

A sysctl variable is accessed concurrently, and there is always a chance
of data-race.  So, all readers and writers need some basic protection to
avoid load/store-tearing.

This patch changes proc_douintvec_minmax() to use READ_ONCE() and
WRITE_ONCE() internally to fix data-races on the sysctl side.  For now,
proc_douintvec_minmax() itself is tolerant to a data-race, but we still
need to add annotations on the other subsystem's side.

Fixes: 61d9b56a89 ("sysctl: add unsigned int range support")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-07-21 21:24:20 +02:00
Kuniyuki Iwashima 32d7f8da82 sysctl: Fix data races in proc_dointvec_minmax().
[ Upstream commit f613d86d01 ]

A sysctl variable is accessed concurrently, and there is always a chance
of data-race.  So, all readers and writers need some basic protection to
avoid load/store-tearing.

This patch changes proc_dointvec_minmax() to use READ_ONCE() and
WRITE_ONCE() internally to fix data-races on the sysctl side.  For now,
proc_dointvec_minmax() itself is tolerant to a data-race, but we still
need to add annotations on the other subsystem's side.

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-07-21 21:24:20 +02:00
Kuniyuki Iwashima d335db59f7 sysctl: Fix data races in proc_douintvec().
[ Upstream commit 4762b532ec ]

A sysctl variable is accessed concurrently, and there is always a chance
of data-race.  So, all readers and writers need some basic protection to
avoid load/store-tearing.

This patch changes proc_douintvec() to use READ_ONCE() and WRITE_ONCE()
internally to fix data-races on the sysctl side.  For now, proc_douintvec()
itself is tolerant to a data-race, but we still need to add annotations on
the other subsystem's side.

Fixes: e7d316a02f ("sysctl: handle error writing UINT_MAX to u32 fields")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-07-21 21:24:20 +02:00
Kuniyuki Iwashima 279bf2a909 sysctl: Fix data races in proc_dointvec().
[ Upstream commit 1f1be04b4d ]

A sysctl variable is accessed concurrently, and there is always a chance
of data-race.  So, all readers and writers need some basic protection to
avoid load/store-tearing.

This patch changes proc_dointvec() to use READ_ONCE() and WRITE_ONCE()
internally to fix data-races on the sysctl side.  For now, proc_dointvec()
itself is tolerant to a data-race, but we still need to add annotations on
the other subsystem's side.

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-07-21 21:24:19 +02:00
Josh Poimboeuf 316e4a1652 x86/speculation: Include unprivileged eBPF status in Spectre v2 mitigation reporting
commit 44a3918c82 upstream.

With unprivileged eBPF enabled, eIBRS (without retpoline) is vulnerable
to Spectre v2 BHB-based attacks.

When both are enabled, print a warning message and report it in the
'spectre_v2' sysfs vulnerabilities file.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
[fllinden@amazon.com: backported to 5.15]
Signed-off-by: Frank van der Linden <fllinden@amazon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-03-11 12:22:31 +01:00
Linus Torvalds 14726903c8 Merge branch 'akpm' (patches from Andrew)
Merge misc updates from Andrew Morton:
 "173 patches.

  Subsystems affected by this series: ia64, ocfs2, block, and mm (debug,
  pagecache, gup, swap, shmem, memcg, selftests, pagemap, mremap,
  bootmem, sparsemem, vmalloc, kasan, pagealloc, memory-failure,
  hugetlb, userfaultfd, vmscan, compaction, mempolicy, memblock,
  oom-kill, migration, ksm, percpu, vmstat, and madvise)"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (173 commits)
  mm/madvise: add MADV_WILLNEED to process_madvise()
  mm/vmstat: remove unneeded return value
  mm/vmstat: simplify the array size calculation
  mm/vmstat: correct some wrong comments
  mm/percpu,c: remove obsolete comments of pcpu_chunk_populated()
  selftests: vm: add COW time test for KSM pages
  selftests: vm: add KSM merging time test
  mm: KSM: fix data type
  selftests: vm: add KSM merging across nodes test
  selftests: vm: add KSM zero page merging test
  selftests: vm: add KSM unmerge test
  selftests: vm: add KSM merge test
  mm/migrate: correct kernel-doc notation
  mm: wire up syscall process_mrelease
  mm: introduce process_mrelease system call
  memblock: make memblock_find_in_range method private
  mm/mempolicy.c: use in_task() in mempolicy_slab_node()
  mm/mempolicy: unify the create() func for bind/interleave/prefer-many policies
  mm/mempolicy: advertise new MPOL_PREFERRED_MANY
  mm/hugetlb: add support for mempolicy MPOL_PREFERRED_MANY
  ...
2021-09-03 10:08:28 -07:00
Charan Teja Reddy 65d759c8f9 mm: compaction: support triggering of proactive compaction by user
The proactive compaction[1] gets triggered for every 500msec and run
compaction on the node for COMPACTION_HPAGE_ORDER (usually order-9) pages
based on the value set to sysctl.compaction_proactiveness.  Triggering the
compaction for every 500msec in search of COMPACTION_HPAGE_ORDER pages is
not needed for all applications, especially on the embedded system
usecases which may have few MB's of RAM.  Enabling the proactive
compaction in its state will endup in running almost always on such
systems.

Other side, proactive compaction can still be very much useful for getting
a set of higher order pages in some controllable manner(controlled by
using the sysctl.compaction_proactiveness).  So, on systems where enabling
the proactive compaction always may proove not required, can trigger the
same from user space on write to its sysctl interface.  As an example, say
app launcher decide to launch the memory heavy application which can be
launched fast if it gets more higher order pages thus launcher can prepare
the system in advance by triggering the proactive compaction from
userspace.

This triggering of proactive compaction is done on a write to
sysctl.compaction_proactiveness by user.

[1]https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit?id=facdaa917c4d5a376d09d25865f5a863f906234a

[akpm@linux-foundation.org: tweak vm.rst, per Mike]

Link: https://lkml.kernel.org/r/1627653207-12317-1-git-send-email-charante@codeaurora.org
Signed-off-by: Charan Teja Reddy <charante@codeaurora.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Rafael Aquini <aquini@redhat.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Nitin Gupta <nigupta@nvidia.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03 09:58:17 -07:00
Jia He a2071573d6 sysctl: introduce new proc handler proc_dobool
This is to let bool variable could be correctly displayed in
big/little endian sysctl procfs. sizeof(bool) is arch dependent,
proc_dobool should work in all arches.

Suggested-by: Pan Xinhui <xinhui@linux.vnet.ibm.com>
Signed-off-by: Jia He <hejianet@gmail.com>
[thuth: rebased the patch to the current kernel version]
Signed-off-by: Thomas Huth <thuth@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-08-17 11:47:53 -04:00
Linus Torvalds 71bd934101 Merge branch 'akpm' (patches from Andrew)
Merge more updates from Andrew Morton:
 "190 patches.

  Subsystems affected by this patch series: mm (hugetlb, userfaultfd,
  vmscan, kconfig, proc, z3fold, zbud, ras, mempolicy, memblock,
  migration, thp, nommu, kconfig, madvise, memory-hotplug, zswap,
  zsmalloc, zram, cleanups, kfence, and hmm), procfs, sysctl, misc,
  core-kernel, lib, lz4, checkpatch, init, kprobes, nilfs2, hfs,
  signals, exec, kcov, selftests, compress/decompress, and ipc"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (190 commits)
  ipc/util.c: use binary search for max_idx
  ipc/sem.c: use READ_ONCE()/WRITE_ONCE() for use_global_lock
  ipc: use kmalloc for msg_queue and shmid_kernel
  ipc sem: use kvmalloc for sem_undo allocation
  lib/decompressors: remove set but not used variabled 'level'
  selftests/vm/pkeys: exercise x86 XSAVE init state
  selftests/vm/pkeys: refill shadow register after implicit kernel write
  selftests/vm/pkeys: handle negative sys_pkey_alloc() return code
  selftests/vm/pkeys: fix alloc_random_pkey() to make it really, really random
  kcov: add __no_sanitize_coverage to fix noinstr for all architectures
  exec: remove checks in __register_bimfmt()
  x86: signal: don't do sas_ss_reset() until we are certain that sigframe won't be abandoned
  hfsplus: report create_date to kstat.btime
  hfsplus: remove unnecessary oom message
  nilfs2: remove redundant continue statement in a while-loop
  kprobes: remove duplicated strong free_insn_page in x86 and s390
  init: print out unknown kernel parameters
  checkpatch: do not complain about positive return values starting with EPOLL
  checkpatch: improve the indented label test
  checkpatch: scripts/spdxcheck.py now requires python3
  ...
2021-07-02 12:08:10 -07:00
Andy Shevchenko f39650de68 kernel.h: split out panic and oops helpers
kernel.h is being used as a dump for all kinds of stuff for a long time.
Here is the attempt to start cleaning it up by splitting out panic and
oops helpers.

There are several purposes of doing this:
- dropping dependency in bug.h
- dropping a loop by moving out panic_notifier.h
- unload kernel.h from something which has its own domain

At the same time convert users tree-wide to use new headers, although for
the time being include new header back to kernel.h to avoid twisted
indirected includes for existing users.

[akpm@linux-foundation.org: thread_info.h needs limits.h]
[andriy.shevchenko@linux.intel.com: ia64 fix]
  Link: https://lkml.kernel.org/r/20210520130557.55277-1-andriy.shevchenko@linux.intel.com

Link: https://lkml.kernel.org/r/20210511074137.33666-1-andriy.shevchenko@linux.intel.com
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Co-developed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Corey Minyard <cminyard@mvista.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Wei Liu <wei.liu@kernel.org>
Acked-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Sebastian Reichel <sre@kernel.org>
Acked-by: Luis Chamberlain <mcgrof@kernel.org>
Acked-by: Stephen Boyd <sboyd@kernel.org>
Acked-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Acked-by: Helge Deller <deller@gmx.de> # parisc
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-01 11:06:04 -07:00
Jiapeng Chong 9a52c5f3c8 sysctl: remove redundant assignment to first
Variable first is set to '0', but this value is never read as it is not
used later on, hence it is a redundant assignment and can be removed.

Clean up the following clang-analyzer warning:

kernel/sysctl.c:1562:4: warning: Value stored to 'first' is never read
[clang-analyzer-deadcode.DeadStores].

Link: https://lkml.kernel.org/r/1620469990-22182-1-git-send-email-jiapeng.chong@linux.alibaba.com
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Acked-by: Luis Chamberlain <mcgrof@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Martin KaFai Lau <kafai@fb.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Yonghong Song <yhs@fb.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: KP Singh <kpsingh@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-01 11:06:04 -07:00
Linus Torvalds df668a5fe4 for-5.14/block-2021-06-29
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmDbXAwQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpr0HEADDJaSgjpnWQwH1RVLNagJa9KnktxZYsEs+
 as3QmDdpKRG3rEC9bdE7FLe/xq3WBaO5j1hTQ9P6IguqLyS1Df72DtTlKyaCrZoe
 zv9eIlY4lZUfksE2nzWmlN9uG0FBVXeEQpHCLSNbUZeK1zvV6+NNhQqw2kc0sEqu
 hReUFeMUbsMcu/w5T3XMVJNsTMCql9wta2H0q5hONQyJQSrIwa1D+sUdE5I8fO4j
 bnoYX9yxHX26EztX1UJiGRgoq5Trz7LY7hAfljKSkewpFwiHE2vBdq2L0C2RKsIV
 tTs2DjMCMQyPNeA7WAG8HlR4aPG+7+/fuBP1KJHkykjWXglWN7OqISuBv6rrBgQs
 gNRnZ4qmb1CzD6aLEBk59nHt6po6eMxXIW856YktKy8rKcrgK29qP44Z+oomkPKo
 ZjQ0wqN5CvpObM/dIKxl9bAJ4zQDHBt49d5nTTQLfWl/mgevu6ZNWD/hONyCQmFy
 zKKqQ/wkxWHutOsjC5/MKNb3ZRNH9tt9X+HfULO2DU6IqqifYw/ex4z4MVsBopJC
 7pPfd81kgC73TgXe1AaCwHqNWsrqYCuTK0ew1CtGudlS3lucMwtap4GBiCgg5gbu
 M8pEgwO4OcCLHyRUc8zdfqI7HumbprbFmojPkwGSEe0ofVD74lMhzbUj5jvTYY2B
 t8D2XcgyOA==
 =lhon
 -----END PGP SIGNATURE-----

Merge tag 'for-5.14/block-2021-06-29' of git://git.kernel.dk/linux-block

Pull core block updates from Jens Axboe:

 - disk events cleanup (Christoph)

 - gendisk and request queue allocation simplifications (Christoph)

 - bdev_disk_changed cleanups (Christoph)

 - IO priority improvements (Bart)

 - Chained bio completion trace fix (Edward)

 - blk-wbt fixes (Jan)

 - blk-wbt enable/disable fix (Zhang)

 - Scheduler dispatch improvements (Jan, Ming)

 - Shared tagset scheduler improvements (John)

 - BFQ updates (Paolo, Luca, Pietro)

 - BFQ lock inversion fix (Jan)

 - Documentation improvements (Kir)

 - CLONE_IO block cgroup fix (Tejun)

 - Remove of ancient and deprecated block dump feature (zhangyi)

 - Discard merge fix (Ming)

 - Misc fixes or followup fixes (Colin, Damien, Dan, Long, Max, Thomas,
   Yang)

* tag 'for-5.14/block-2021-06-29' of git://git.kernel.dk/linux-block: (129 commits)
  block: fix discard request merge
  block/mq-deadline: Remove a WARN_ON_ONCE() call
  blk-mq: update hctx->dispatch_busy in case of real scheduler
  blk: Fix lock inversion between ioc lock and bfqd lock
  bfq: Remove merged request already in bfq_requests_merged()
  block: pass a gendisk to bdev_disk_changed
  block: move bdev_disk_changed
  block: add the events* attributes to disk_attrs
  block: move the disk events code to a separate file
  block: fix trace completion for chained bio
  block/partitions/msdos: Fix typo inidicator -> indicator
  block, bfq: reset waker pointer with shared queues
  block, bfq: check waker only for queues with no in-flight I/O
  block, bfq: avoid delayed merge of async queues
  block, bfq: boost throughput by extending queue-merging times
  block, bfq: consider also creation time in delayed stable merge
  block, bfq: fix delayed stable merge check
  block, bfq: let also stably merged queues enjoy weight raising
  blk-wbt: make sure throttle is enabled properly
  blk-wbt: introduce a new disable state to prevent false positive by rwb_enabled()
  ...
2021-06-30 12:12:56 -07:00
Linus Torvalds 65090f30ab Merge branch 'akpm' (patches from Andrew)
Merge misc updates from Andrew Morton:
 "191 patches.

  Subsystems affected by this patch series: kthread, ia64, scripts,
  ntfs, squashfs, ocfs2, kernel/watchdog, and mm (gup, pagealloc, slab,
  slub, kmemleak, dax, debug, pagecache, gup, swap, memcg, pagemap,
  mprotect, bootmem, dma, tracing, vmalloc, kasan, initialization,
  pagealloc, and memory-failure)"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (191 commits)
  mm,hwpoison: make get_hwpoison_page() call get_any_page()
  mm,hwpoison: send SIGBUS with error virutal address
  mm/page_alloc: split pcp->high across all online CPUs for cpuless nodes
  mm/page_alloc: allow high-order pages to be stored on the per-cpu lists
  mm: replace CONFIG_FLAT_NODE_MEM_MAP with CONFIG_FLATMEM
  mm: replace CONFIG_NEED_MULTIPLE_NODES with CONFIG_NUMA
  docs: remove description of DISCONTIGMEM
  arch, mm: remove stale mentions of DISCONIGMEM
  mm: remove CONFIG_DISCONTIGMEM
  m68k: remove support for DISCONTIGMEM
  arc: remove support for DISCONTIGMEM
  arc: update comment about HIGHMEM implementation
  alpha: remove DISCONTIGMEM and NUMA
  mm/page_alloc: move free_the_page
  mm/page_alloc: fix counting of managed_pages
  mm/page_alloc: improve memmap_pages dbg msg
  mm: drop SECTION_SHIFT in code comments
  mm/page_alloc: introduce vm.percpu_pagelist_high_fraction
  mm/page_alloc: limit the number of pages on PCP lists when reclaim is active
  mm/page_alloc: scale the number of pages that are batch freed
  ...
2021-06-29 17:29:11 -07:00
Mel Gorman 74f4482209 mm/page_alloc: introduce vm.percpu_pagelist_high_fraction
This introduces a new sysctl vm.percpu_pagelist_high_fraction.  It is
similar to the old vm.percpu_pagelist_fraction.  The old sysctl increased
both pcp->batch and pcp->high with the higher pcp->high potentially
reducing zone->lock contention.  However, the higher pcp->batch value also
potentially increased allocation latency while the PCP was refilled.  This
sysctl only adjusts pcp->high so that zone->lock contention is potentially
reduced but allocation latency during a PCP refill remains the same.

  # grep -E "high:|batch" /proc/zoneinfo | tail -2
              high:  649
              batch: 63

  # sysctl vm.percpu_pagelist_high_fraction=8
  # grep -E "high:|batch" /proc/zoneinfo | tail -2
              high:  35071
              batch: 63

  # sysctl vm.percpu_pagelist_high_fraction=64
              high:  4383
              batch: 63

  # sysctl vm.percpu_pagelist_high_fraction=0
              high:  649
              batch: 63

[mgorman@techsingularity.net: fix documentation]
  Link: https://lkml.kernel.org/r/20210528151010.GQ30378@techsingularity.net

Link: https://lkml.kernel.org/r/20210525080119.5455-7-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29 10:53:55 -07:00
Mel Gorman bbbecb35a4 mm/page_alloc: delete vm.percpu_pagelist_fraction
Patch series "Calculate pcp->high based on zone sizes and active CPUs", v2.

The per-cpu page allocator (PCP) is meant to reduce contention on the zone
lock but the sizing of batch and high is archaic and neither takes the
zone size into account or the number of CPUs local to a zone.  With larger
zones and more CPUs per node, the contention is getting worse.
Furthermore, the fact that vm.percpu_pagelist_fraction adjusts both batch
and high values means that the sysctl can reduce zone lock contention but
also increase allocation latencies.

This series disassociates pcp->high from pcp->batch and then scales
pcp->high based on the size of the local zone with limited impact to
reclaim and accounting for active CPUs but leaves pcp->batch static.  It
also adapts the number of pages that can be on the pcp list based on
recent freeing patterns.

The motivation is partially to adjust to larger memory sizes but is also
driven by the fact that large batches of page freeing via release_pages()
often shows zone contention as a major part of the problem.  Another is a
bug report based on an older kernel where a multi-terabyte process can
takes several minutes to exit.  A workaround was to use
vm.percpu_pagelist_fraction to increase the pcp->high value but testing
indicated that a production workload could not use the same values because
of an increase in allocation latencies.  Unfortunately, I cannot reproduce
this test case myself as the multi-terabyte machines are in active use but
it should alleviate the problem.

The series aims to address both and partially acts as a pre-requisite.
pcp only works with order-0 which is useless for SLUB (when using high
orders) and THP (unconditionally).  To store high-order pages on PCP, the
pcp->high values need to be increased first.

This patch (of 6):

The vm.percpu_pagelist_fraction is used to increase the batch and high
limits for the per-cpu page allocator (PCP).  The intent behind the sysctl
is to reduce zone lock acquisition when allocating/freeing pages but it
has a problem.  While it can decrease contention, it can also increase
latency on the allocation side due to unreasonably large batch sizes.
This leads to games where an administrator adjusts
percpu_pagelist_fraction on the fly to work around contention and
allocation latency problems.

This series aims to alleviate the problems with zone lock contention while
avoiding the allocation-side latency problems.  For the purposes of
review, it's easier to remove this sysctl now and reintroduce a similar
sysctl later in the series that deals only with pcp->high.

Link: https://lkml.kernel.org/r/20210525080119.5455-1-mgorman@techsingularity.net
Link: https://lkml.kernel.org/r/20210525080119.5455-2-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29 10:53:54 -07:00
Ingo Molnar a9e906b71f Merge branch 'sched/urgent' into sched/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2021-06-03 19:00:49 +02:00
zhangyi (F) 3af3d772f7 block_dump: remove block_dump feature
We have already delete block_dump feature in mark_inode_dirty() because
it can be replaced by tracepoints, now we also remove the part in
submit_bio() for the same reason. The part of block dump feature in
submit_bio() dump the write process, write region and sectors on the
target disk into kernel message. it can be replaced by
block_bio_queue tracepoint in submit_bio_checks(), so we do not need
block_dump anymore, remove the whole block_dump feature.

Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210313030146.2882027-3-yi.zhang@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-05-24 06:47:21 -06:00
Peter Zijlstra 0cd7c741f0 delayacct: Add sysctl to enable at runtime
Just like sched_schedstats, allow runtime enabling (and disabling) of
delayacct. This is useful if one forgot to add the delayacct boot time
option.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/YJkhebGJAywaZowX@hirez.programming.kicks-ass.net
2021-05-12 11:43:25 +02:00
David S. Miller df6f823703 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2021-05-11

The following pull-request contains BPF updates for your *net* tree.

We've added 13 non-merge commits during the last 8 day(s) which contain
a total of 21 files changed, 817 insertions(+), 382 deletions(-).

The main changes are:

1) Fix multiple ringbuf bugs in particular to prevent writable mmap of
   read-only pages, from Andrii Nakryiko & Thadeu Lima de Souza Cascardo.

2) Fix verifier alu32 known-const subregister bound tracking for bitwise
   operations and/or/xor, from Daniel Borkmann.

3) Reject trampoline attachment for functions with variable arguments,
   and also add a deny list of other forbidden functions, from Jiri Olsa.

4) Fix nested bpf_bprintf_prepare() calls used by various helpers by
   switching to per-CPU buffers, from Florent Revest.

5) Fix kernel compilation with BTF debug info on ppc64 due to pahole
   missing TCP-CC functions like cubictcp_init, from Martin KaFai Lau.

6) Add a kconfig entry to provide an option to disallow unprivileged
   BPF by default, from Daniel Borkmann.

7) Fix libbpf compilation for older libelf when GELF_ST_VISIBILITY()
   macro is not available, from Arnaldo Carvalho de Melo.

8) Migrate test_tc_redirect to test_progs framework as prep work
   for upcoming skb_change_head() fix & selftest, from Jussi Maki.

9) Fix a libbpf segfault in add_dummy_ksym_var() if BTF is not
   present, from Ian Rogers.

10) Fix tx_only micro-benchmark in xdpsock BPF sample with proper frame
    size, from Magnus Karlsson.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-11 16:05:56 -07:00
Daniel Borkmann 08389d8882 bpf: Add kconfig knob for disabling unpriv bpf by default
Add a kconfig knob which allows for unprivileged bpf to be disabled by default.
If set, the knob sets /proc/sys/kernel/unprivileged_bpf_disabled to value of 2.

This still allows a transition of 2 -> {0,1} through an admin. Similarly,
this also still keeps 1 -> {1} behavior intact, so that once set to permanently
disabled, it cannot be undone aside from a reboot.

We've also added extra2 with max of 2 for the procfs handler, so that an admin
still has a chance to toggle between 0 <-> 2.

Either way, as an additional alternative, applications can make use of CAP_BPF
that we added a while ago.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/74ec548079189e4e4dffaeb42b8987bb3c852eee.1620765074.git.daniel@iogearbox.net
2021-05-11 13:56:16 -07:00
Pintu Kumar ef49843841 mm/compaction: remove unused variable sysctl_compact_memory
The sysctl_compact_memory is mostly unused in mm/compaction.c It just
acts as a place holder for sysctl to store .data.

But the .data itself is not needed here.

So we can get ride of this variable completely and make .data as NULL.
This will also eliminate the extern declaration from header file.  No
functionality is broken or changed this way.

Link: https://lkml.kernel.org/r/1614852224-14671-1-git-send-email-pintu@codeaurora.org
Signed-off-by: Pintu Kumar <pintu@codeaurora.org>
Signed-off-by: Pintu Agarwal <pintu.ping@gmail.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-05 11:27:24 -07:00
Linus Torvalds 9d31d23389 Networking changes for 5.13.
Core:
 
  - bpf:
 	- allow bpf programs calling kernel functions (initially to
 	  reuse TCP congestion control implementations)
 	- enable task local storage for tracing programs - remove the
 	  need to store per-task state in hash maps, and allow tracing
 	  programs access to task local storage previously added for
 	  BPF_LSM
 	- add bpf_for_each_map_elem() helper, allowing programs to
 	  walk all map elements in a more robust and easier to verify
 	  fashion
 	- sockmap: support UDP and cross-protocol BPF_SK_SKB_VERDICT
 	  redirection
 	- lpm: add support for batched ops in LPM trie
 	- add BTF_KIND_FLOAT support - mostly to allow use of BTF
 	  on s390 which has floats in its headers files
 	- improve BPF syscall documentation and extend the use of kdoc
 	  parsing scripts we already employ for bpf-helpers
 	- libbpf, bpftool: support static linking of BPF ELF files
 	- improve support for encapsulation of L2 packets
 
  - xdp: restructure redirect actions to avoid a runtime lookup,
 	improving performance by 4-8% in microbenchmarks
 
  - xsk: build skb by page (aka generic zerocopy xmit) - improve
 	performance of software AF_XDP path by 33% for devices
 	which don't need headers in the linear skb part (e.g. virtio)
 
  - nexthop: resilient next-hop groups - improve path stability
 	on next-hops group changes (incl. offload for mlxsw)
 
  - ipv6: segment routing: add support for IPv4 decapsulation
 
  - icmp: add support for RFC 8335 extended PROBE messages
 
  - inet: use bigger hash table for IP ID generation
 
  - tcp: deal better with delayed TX completions - make sure we don't
 	give up on fast TCP retransmissions only because driver is
 	slow in reporting that it completed transmitting the original
 
  - tcp: reorder tcp_congestion_ops for better cache locality
 
  - mptcp:
 	- add sockopt support for common TCP options
 	- add support for common TCP msg flags
 	- include multiple address ids in RM_ADDR
 	- add reset option support for resetting one subflow
 
  - udp: GRO L4 improvements - improve 'forward' / 'frag_list'
 	co-existence with UDP tunnel GRO, allowing the first to take
 	place correctly	even for encapsulated UDP traffic
 
  - micro-optimize dev_gro_receive() and flow dissection, avoid
 	retpoline overhead on VLAN and TEB GRO
 
  - use less memory for sysctls, add a new sysctl type, to allow using
 	u8 instead of "int" and "long" and shrink networking sysctls
 
  - veth: allow GRO without XDP - this allows aggregating UDP
 	packets before handing them off to routing, bridge, OvS, etc.
 
  - allow specifing ifindex when device is moved to another namespace
 
  - netfilter:
 	- nft_socket: add support for cgroupsv2
 	- nftables: add catch-all set element - special element used
 	  to define a default action in case normal lookup missed
 	- use net_generic infra in many modules to avoid allocating
 	  per-ns memory unnecessarily
 
  - xps: improve the xps handling to avoid potential out-of-bound
 	accesses and use-after-free when XPS change race with other
 	re-configuration under traffic
 
  - add a config knob to turn off per-cpu netdev refcnt to catch
 	underflows in testing
 
 Device APIs:
 
  - add WWAN subsystem to organize the WWAN interfaces better and
    hopefully start driving towards more unified and vendor-
    -independent APIs
 
  - ethtool:
 	- add interface for reading IEEE MIB stats (incl. mlx5 and
 	  bnxt support)
 	- allow network drivers to dump arbitrary SFP EEPROM data,
 	  current offset+length API was a poor fit for modern SFP
 	  which define EEPROM in terms of pages (incl. mlx5 support)
 
  - act_police, flow_offload: add support for packet-per-second
 	policing (incl. offload for nfp)
 
  - psample: add additional metadata attributes like transit delay
 	for packets sampled from switch HW (and corresponding egress
 	and policy-based sampling in the mlxsw driver)
 
  - dsa: improve support for sandwiched LAGs with bridge and DSA
 
  - netfilter:
 	- flowtable: use direct xmit in topologies with IP
 	  forwarding, bridging, vlans etc.
 	- nftables: counter hardware offload support
 
  - Bluetooth:
 	- improvements for firmware download w/ Intel devices
 	- add support for reading AOSP vendor capabilities
 	- add support for virtio transport driver
 
  - mac80211:
 	- allow concurrent monitor iface and ethernet rx decap
 	- set priority and queue mapping for injected frames
 
  - phy: add support for Clause-45 PHY Loopback
 
  - pci/iov: add sysfs MSI-X vector assignment interface
 	to distribute MSI-X resources to VFs (incl. mlx5 support)
 
 New hardware/drivers:
 
  - dsa: mv88e6xxx: add support for Marvell mv88e6393x -
 	11-port Ethernet switch with 8x 1-Gigabit Ethernet
 	and 3x 10-Gigabit interfaces.
 
  - dsa: support for legacy Broadcom tags used on BCM5325, BCM5365
 	and BCM63xx switches
 
  - Microchip KSZ8863 and KSZ8873; 3x 10/100Mbps Ethernet switches
 
  - ath11k: support for QCN9074 a 802.11ax device
 
  - Bluetooth: Broadcom BCM4330 and BMC4334
 
  - phy: Marvell 88X2222 transceiver support
 
  - mdio: add BCM6368 MDIO mux bus controller
 
  - r8152: support RTL8153 and RTL8156 (USB Ethernet) chips
 
  - mana: driver for Microsoft Azure Network Adapter (MANA)
 
  - Actions Semi Owl Ethernet MAC
 
  - can: driver for ETAS ES58X CAN/USB interfaces
 
 Pure driver changes:
 
  - add XDP support to: enetc, igc, stmmac
  - add AF_XDP support to: stmmac
 
  - virtio:
 	- page_to_skb() use build_skb when there's sufficient tailroom
 	  (21% improvement for 1000B UDP frames)
 	- support XDP even without dedicated Tx queues - share the Tx
 	  queues with the stack when necessary
 
  - mlx5:
 	- flow rules: add support for mirroring with conntrack,
 	  matching on ICMP, GTP, flex filters and more
 	- support packet sampling with flow offloads
 	- persist uplink representor netdev across eswitch mode
 	  changes
 	- allow coexistence of CQE compression and HW time-stamping
 	- add ethtool extended link error state reporting
 
  - ice, iavf: support flow filters, UDP Segmentation Offload
 
  - dpaa2-switch:
 	- move the driver out of staging
 	- add spanning tree (STP) support
 	- add rx copybreak support
 	- add tc flower hardware offload on ingress traffic
 
  - ionic:
 	- implement Rx page reuse
 	- support HW PTP time-stamping
 
  - octeon: support TC hardware offloads - flower matching on ingress
 	and egress ratelimitting.
 
  - stmmac:
 	- add RX frame steering based on VLAN priority in tc flower
 	- support frame preemption (FPE)
 	- intel: add cross time-stamping freq difference adjustment
 
  - ocelot:
 	- support forwarding of MRP frames in HW
 	- support multiple bridges
 	- support PTP Sync one-step timestamping
 
  - dsa: mv88e6xxx, dpaa2-switch: offload bridge port flags like
 	learning, flooding etc.
 
  - ipa: add IPA v4.5, v4.9 and v4.11 support (Qualcomm SDX55, SM8350,
 	SC7280 SoCs)
 
  - mt7601u: enable TDLS support
 
  - mt76:
 	- add support for 802.3 rx frames (mt7915/mt7615)
 	- mt7915 flash pre-calibration support
 	- mt7921/mt7663 runtime power management fixes
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmCKFPIACgkQMUZtbf5S
 Irtw0g/+NA8bWdHNgG4H5rya0pv2z3IieLRmSdDfKRQQXcJpklawc5MKVVaTee/Q
 5/QqgPdCsu1LAU6JXBKsKmyDDaMlQKdWuKbOqDSiAQKoMesZStTEHf9d851ZzgxA
 Cdb6O7BD3lBl/IN+oxNG+KcmD1LKquTPKGySq2mQtEdLO12ekAsranzmj4voKffd
 q9tBShpXQ7Dq77DLYfiQXVCvsizNcbbJFuxX0o9Lpb9+61ZyYAbogZSa9ypiZZwR
 I/9azRBtJg7UV1aD/cLuAfy66Qh7t63+rCxVazs5Os8jVO26P/jQdisnnOe/x+p9
 wYEmKm3GSu0V4SAPxkWW+ooKusflCeqDoMIuooKt6kbP6BRj540veGw3Ww/m5YFr
 7pLQkTSP/tSjuGQIdBE1LOP5LBO8DZeC8Kiop9V0fzAW9hFSZbEq25WW0bPj8QQO
 zA4Z7yWlslvxcfY2BdJX3wD8klaINkl/8fDWZFFsBdfFX2VeLtm7Xfduw34BJpvU
 rYT3oWr6PhtkPAKR32SUcemSfeWgIVU41eSshzRz3kez1NngBUuLlSGGSEaKbes5
 pZVt6pYFFVByyf6MTHFEoQvafZfEw04JILZpo4R5V8iTHzom0kD3Py064sBiXEw2
 B6t+OW4qgcxGblpFkK2lD4kR2s1TPUs0ckVO6sAy1x8q60KKKjY=
 =vcbA
 -----END PGP SIGNATURE-----

Merge tag 'net-next-5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking updates from Jakub Kicinski:
 "Core:

   - bpf:
        - allow bpf programs calling kernel functions (initially to
          reuse TCP congestion control implementations)
        - enable task local storage for tracing programs - remove the
          need to store per-task state in hash maps, and allow tracing
          programs access to task local storage previously added for
          BPF_LSM
        - add bpf_for_each_map_elem() helper, allowing programs to walk
          all map elements in a more robust and easier to verify fashion
        - sockmap: support UDP and cross-protocol BPF_SK_SKB_VERDICT
          redirection
        - lpm: add support for batched ops in LPM trie
        - add BTF_KIND_FLOAT support - mostly to allow use of BTF on
          s390 which has floats in its headers files
        - improve BPF syscall documentation and extend the use of kdoc
          parsing scripts we already employ for bpf-helpers
        - libbpf, bpftool: support static linking of BPF ELF files
        - improve support for encapsulation of L2 packets

   - xdp: restructure redirect actions to avoid a runtime lookup,
     improving performance by 4-8% in microbenchmarks

   - xsk: build skb by page (aka generic zerocopy xmit) - improve
     performance of software AF_XDP path by 33% for devices which don't
     need headers in the linear skb part (e.g. virtio)

   - nexthop: resilient next-hop groups - improve path stability on
     next-hops group changes (incl. offload for mlxsw)

   - ipv6: segment routing: add support for IPv4 decapsulation

   - icmp: add support for RFC 8335 extended PROBE messages

   - inet: use bigger hash table for IP ID generation

   - tcp: deal better with delayed TX completions - make sure we don't
     give up on fast TCP retransmissions only because driver is slow in
     reporting that it completed transmitting the original

   - tcp: reorder tcp_congestion_ops for better cache locality

   - mptcp:
        - add sockopt support for common TCP options
        - add support for common TCP msg flags
        - include multiple address ids in RM_ADDR
        - add reset option support for resetting one subflow

   - udp: GRO L4 improvements - improve 'forward' / 'frag_list'
     co-existence with UDP tunnel GRO, allowing the first to take place
     correctly even for encapsulated UDP traffic

   - micro-optimize dev_gro_receive() and flow dissection, avoid
     retpoline overhead on VLAN and TEB GRO

   - use less memory for sysctls, add a new sysctl type, to allow using
     u8 instead of "int" and "long" and shrink networking sysctls

   - veth: allow GRO without XDP - this allows aggregating UDP packets
     before handing them off to routing, bridge, OvS, etc.

   - allow specifing ifindex when device is moved to another namespace

   - netfilter:
        - nft_socket: add support for cgroupsv2
        - nftables: add catch-all set element - special element used to
          define a default action in case normal lookup missed
        - use net_generic infra in many modules to avoid allocating
          per-ns memory unnecessarily

   - xps: improve the xps handling to avoid potential out-of-bound
     accesses and use-after-free when XPS change race with other
     re-configuration under traffic

   - add a config knob to turn off per-cpu netdev refcnt to catch
     underflows in testing

  Device APIs:

   - add WWAN subsystem to organize the WWAN interfaces better and
     hopefully start driving towards more unified and vendor-
     independent APIs

   - ethtool:
        - add interface for reading IEEE MIB stats (incl. mlx5 and bnxt
          support)
        - allow network drivers to dump arbitrary SFP EEPROM data,
          current offset+length API was a poor fit for modern SFP which
          define EEPROM in terms of pages (incl. mlx5 support)

   - act_police, flow_offload: add support for packet-per-second
     policing (incl. offload for nfp)

   - psample: add additional metadata attributes like transit delay for
     packets sampled from switch HW (and corresponding egress and
     policy-based sampling in the mlxsw driver)

   - dsa: improve support for sandwiched LAGs with bridge and DSA

   - netfilter:
        - flowtable: use direct xmit in topologies with IP forwarding,
          bridging, vlans etc.
        - nftables: counter hardware offload support

   - Bluetooth:
        - improvements for firmware download w/ Intel devices
        - add support for reading AOSP vendor capabilities
        - add support for virtio transport driver

   - mac80211:
        - allow concurrent monitor iface and ethernet rx decap
        - set priority and queue mapping for injected frames

   - phy: add support for Clause-45 PHY Loopback

   - pci/iov: add sysfs MSI-X vector assignment interface to distribute
     MSI-X resources to VFs (incl. mlx5 support)

  New hardware/drivers:

   - dsa: mv88e6xxx: add support for Marvell mv88e6393x - 11-port
     Ethernet switch with 8x 1-Gigabit Ethernet and 3x 10-Gigabit
     interfaces.

   - dsa: support for legacy Broadcom tags used on BCM5325, BCM5365 and
     BCM63xx switches

   - Microchip KSZ8863 and KSZ8873; 3x 10/100Mbps Ethernet switches

   - ath11k: support for QCN9074 a 802.11ax device

   - Bluetooth: Broadcom BCM4330 and BMC4334

   - phy: Marvell 88X2222 transceiver support

   - mdio: add BCM6368 MDIO mux bus controller

   - r8152: support RTL8153 and RTL8156 (USB Ethernet) chips

   - mana: driver for Microsoft Azure Network Adapter (MANA)

   - Actions Semi Owl Ethernet MAC

   - can: driver for ETAS ES58X CAN/USB interfaces

  Pure driver changes:

   - add XDP support to: enetc, igc, stmmac

   - add AF_XDP support to: stmmac

   - virtio:
        - page_to_skb() use build_skb when there's sufficient tailroom
          (21% improvement for 1000B UDP frames)
        - support XDP even without dedicated Tx queues - share the Tx
          queues with the stack when necessary

   - mlx5:
        - flow rules: add support for mirroring with conntrack, matching
          on ICMP, GTP, flex filters and more
        - support packet sampling with flow offloads
        - persist uplink representor netdev across eswitch mode changes
        - allow coexistence of CQE compression and HW time-stamping
        - add ethtool extended link error state reporting

   - ice, iavf: support flow filters, UDP Segmentation Offload

   - dpaa2-switch:
        - move the driver out of staging
        - add spanning tree (STP) support
        - add rx copybreak support
        - add tc flower hardware offload on ingress traffic

   - ionic:
        - implement Rx page reuse
        - support HW PTP time-stamping

   - octeon: support TC hardware offloads - flower matching on ingress
     and egress ratelimitting.

   - stmmac:
        - add RX frame steering based on VLAN priority in tc flower
        - support frame preemption (FPE)
        - intel: add cross time-stamping freq difference adjustment

   - ocelot:
        - support forwarding of MRP frames in HW
        - support multiple bridges
        - support PTP Sync one-step timestamping

   - dsa: mv88e6xxx, dpaa2-switch: offload bridge port flags like
     learning, flooding etc.

   - ipa: add IPA v4.5, v4.9 and v4.11 support (Qualcomm SDX55, SM8350,
     SC7280 SoCs)

   - mt7601u: enable TDLS support

   - mt76:
        - add support for 802.3 rx frames (mt7915/mt7615)
        - mt7915 flash pre-calibration support
        - mt7921/mt7663 runtime power management fixes"

* tag 'net-next-5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2451 commits)
  net: selftest: fix build issue if INET is disabled
  net: netrom: nr_in: Remove redundant assignment to ns
  net: tun: Remove redundant assignment to ret
  net: phy: marvell: add downshift support for M88E1240
  net: dsa: ksz: Make reg_mib_cnt a u8 as it never exceeds 255
  net/sched: act_ct: Remove redundant ct get and check
  icmp: standardize naming of RFC 8335 PROBE constants
  bpf, selftests: Update array map tests for per-cpu batched ops
  bpf: Add batched ops support for percpu array
  bpf: Implement formatted output helpers with bstr_printf
  seq_file: Add a seq_bprintf function
  sfc: adjust efx->xdp_tx_queue_count with the real number of initialized queues
  net:nfc:digital: Fix a double free in digital_tg_recv_dep_req
  net: fix a concurrency bug in l2tp_tunnel_register()
  net/smc: Remove redundant assignment to rc
  mpls: Remove redundant assignment to err
  llc2: Remove redundant assignment to rc
  net/tls: Remove redundant initialization of record
  rds: Remove redundant assignment to nr_sig
  dt-bindings: net: mdio-gpio: add compatible for microchip,mdio-smi0
  ...
2021-04-29 11:57:23 -07:00
Linus Torvalds 3644286f6c \n
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmCJUfIACgkQnJ2qBz9k
 QNkStAf8CA7beya7LZ/GGN7HzXhv2cs+IpUFhRkynLklEM0lxKsOEagLFSZxkoMD
 IBSRSo4odkkderqI9W/yp+9OYhOd9+BQCq4isg1Gh9Tf5xANJEpLvBAPnWVhooJs
 9CrYZQY9Bdf+fF/8GHbKlrMAYm56vBCmWqyWTEtWUyPBOA12in2ZHQJmCa+5+nge
 zTT/B5cvuhN5K7uYhGM4YfeCU5DBmmvD4sV6YBTkQOgCU0bEF0f9R3JjHDo34a1s
 yqna3ypqKNRhsJVs8F+aOGRieUYxFoRqtYNHZK3qI9i07v7ndoTm5jzGN6OFlKs3
 U3rF9/+cBgeESahWG6IjHIqhXGXNhg==
 =KjNm
 -----END PGP SIGNATURE-----

Merge tag 'fsnotify_for_v5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

Pull fsnotify updates from Jan Kara:

 - support for limited fanotify functionality for unpriviledged users

 - faster merging of fanotify events

 - a few smaller fsnotify improvements

* tag 'fsnotify_for_v5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  shmem: allow reporting fanotify events with file handles on tmpfs
  fs: introduce a wrapper uuid_to_fsid()
  fanotify_user: use upper_32_bits() to verify mask
  fanotify: support limited functionality for unprivileged users
  fanotify: configurable limits via sysfs
  fanotify: limit number of event merge attempts
  fsnotify: use hash table for faster events merge
  fanotify: mix event info and pid into merge key hash
  fanotify: reduce event objectid to 29-bit hash
  fsnotify: allow fsnotify_{peek,remove}_first_event with empty queue
2021-04-29 11:06:13 -07:00
Peter Zijlstra 8a99b6833c sched: Move SCHED_DEBUG sysctl to debugfs
Stop polluting sysctl with undocumented knobs that really are debug
only, move them all to /debug/sched/ along with the existing
/debug/sched_* files that already exist.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Tested-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/20210412102001.287610138@infradead.org
2021-04-16 17:06:34 +02:00
Peter Zijlstra 1d1c2509de sched: Remove sched_schedstats sysctl out from under SCHED_DEBUG
CONFIG_SCHEDSTATS does not depend on SCHED_DEBUG, it is inconsistent
to have the sysctl depend on it.

Suggested-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/20210412102001.161151631@infradead.org
2021-04-16 17:06:33 +02:00
Mel Gorman b7cc6ec744 sched/numa: Allow runtime enabling/disabling of NUMA balance without SCHED_DEBUG
The ability to enable/disable NUMA balancing is not a debugging feature
and should not depend on CONFIG_SCHED_DEBUG.  For example, machines within
a HPC cluster may disable NUMA balancing temporarily for some jobs and
re-enable it for other jobs without needing to reboot.

This patch removes the dependency on CONFIG_SCHED_DEBUG for
kernel.numa_balancing sysctl. The other numa balancing related sysctls
are left as-is because if they need to be tuned then it is more likely
that NUMA balancing needs to be fixed instead.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/20210324133916.GQ15768@suse.de
2021-04-16 17:06:33 +02:00
Eric Dumazet cb94441306 sysctl: add proc_dou8vec_minmax()
Networking has many sysctls that could fit in one u8.

This patch adds proc_dou8vec_minmax() for this purpose.

Note that the .extra1 and .extra2 fields are pointing
to integers, because it makes conversions easier.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-25 17:39:33 -07:00
Amir Goldstein 5b8fea65d1 fanotify: configurable limits via sysfs
fanotify has some hardcoded limits. The only APIs to escape those limits
are FAN_UNLIMITED_QUEUE and FAN_UNLIMITED_MARKS.

Allow finer grained tuning of the system limits via sysfs tunables under
/proc/sys/fs/fanotify, similar to tunables under /proc/sys/fs/inotify,
with some minor differences.

- max_queued_events - global system tunable for group queue size limit.
  Like the inotify tunable with the same name, it defaults to 16384 and
  applies on initialization of a new group.

- max_user_marks - user ns tunable for marks limit per user.
  Like the inotify tunable named max_user_watches, on a machine with
  sufficient RAM and it defaults to 1048576 in init userns and can be
  further limited per containing user ns.

- max_user_groups - user ns tunable for number of groups per user.
  Like the inotify tunable named max_user_instances, it defaults to 128
  in init userns and can be further limited per containing user ns.

The slightly different tunable names used for fanotify are derived from
the "group" and "mark" terminology used in the fanotify man pages and
throughout the code.

Considering the fact that the default value for max_user_instances was
increased in kernel v5.10 from 8192 to 1048576, leaving the legacy
fanotify limit of 8192 marks per group in addition to the max_user_marks
limit makes little sense, so the per group marks limit has been removed.

Note that when a group is initialized with FAN_UNLIMITED_MARKS, its own
marks are not accounted in the per user marks account, so in effect the
limit of max_user_marks is only for the collection of groups that are
not initialized with FAN_UNLIMITED_MARKS.

Link: https://lore.kernel.org/r/20210304112921.3996419-2-amir73il@gmail.com
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2021-03-16 16:49:31 +01:00
Lin Feng 3b3376f222 sysctl.c: fix underflow value setting risk in vm_table
Apart from subsystem specific .proc_handler handler, all ctl_tables with
extra1 and extra2 members set should use proc_dointvec_minmax instead of
proc_dointvec, or the limit set in extra* never work and potentially echo
underflow values(negative numbers) is likely make system unstable.

Especially vfs_cache_pressure and zone_reclaim_mode, -1 is apparently not
a valid value, but we can set to them.  And then kernel may crash.

# echo -1 > /proc/sys/vm/vfs_cache_pressure

Link: https://lkml.kernel.org/r/20201223105535.2875-1-linf@wangsu.com
Signed-off-by: Lin Feng <linf@wangsu.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-26 09:41:03 -08:00
chao dfe564045c rcu: Panic after fixed number of stalls
Some stalls are transient, so that system fully recovers.  This commit
therefore allows users to configure the number of stalls that must happen
in order to trigger kernel panic.

Signed-off-by: chao <chao@eero.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-11-19 19:37:16 -08:00
Linus Torvalds 5ef64cc898 mm: allow a controlled amount of unfairness in the page lock
Commit 2a9127fcf2 ("mm: rewrite wait_on_page_bit_common() logic") made
the page locking entirely fair, in that if a waiter came in while the
lock was held, the lock would be transferred to the lockers strictly in
order.

That was intended to finally get rid of the long-reported watchdog
failures that involved the page lock under extreme load, where a process
could end up waiting essentially forever, as other page lockers stole
the lock from under it.

It also improved some benchmarks, but it ended up causing huge
performance regressions on others, simply because fair lock behavior
doesn't end up giving out the lock as aggressively, causing better
worst-case latency, but potentially much worse average latencies and
throughput.

Instead of reverting that change entirely, this introduces a controlled
amount of unfairness, with a sysctl knob to tune it if somebody needs
to.  But the default value should hopefully be good for any normal load,
allowing a few rounds of lock stealing, but enforcing the strict
ordering before the lock has been stolen too many times.

There is also a hint from Matthieu Baerts that the fair page coloring
may end up exposing an ABBA deadlock that is hidden by the usual
optimistic lock stealing, and while the unfairness doesn't fix the
fundamental issue (and I'm still looking at that), it avoids it in
practice.

The amount of unfairness can be modified by writing a new value to the
'sysctl_page_lock_unfairness' variable (default value of 5, exposed
through /proc/sys/vm/page_lock_unfairness), but that is hopefully
something we'd use mainly for debugging rather than being necessary for
any deep system tuning.

This whole issue has exposed just how critical the page lock can be, and
how contended it gets under certain locks.  And the main contention
doesn't really seem to be anything related to IO (which was the origin
of this lock), but for things like just verifying that the page file
mapping is stable while faulting in the page into a page table.

Link: https://lore.kernel.org/linux-fsdevel/ed8442fd-6f54-dd84-cd4a-941e8b7ee603@MichaelLarabel.com/
Link: https://www.phoronix.com/scan.php?page=article&item=linux-50-59&num=1
Link: https://lore.kernel.org/linux-fsdevel/c560a38d-8313-51fb-b1ec-e904bd8836bc@tessares.net/
Reported-and-tested-by: Michael Larabel <Michael@michaellarabel.com>
Tested-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-17 10:26:41 -07:00
Tobias Klauser 7787b6fc93 bpf, sysctl: Let bpf_stats_handler take a kernel pointer buffer
Commit 32927393dc ("sysctl: pass kernel pointers to ->proc_handler")
changed ctl_table.proc_handler to take a kernel pointer. Adjust the
signature of bpf_stats_handler to match ctl_table.proc_handler which
fixes the following sparse warning:

kernel/sysctl.c:226:49: warning: incorrect type in argument 3 (different address spaces)
kernel/sysctl.c:226:49:    expected void *
kernel/sysctl.c:226:49:    got void [noderef] __user *buffer
kernel/sysctl.c:2640:35: warning: incorrect type in initializer (incompatible argument 3 (different address spaces))
kernel/sysctl.c:2640:35:    expected int ( [usertype] *proc_handler )( ... )
kernel/sysctl.c:2640:35:    got int ( * )( ... )

Fixes: 32927393dc ("sysctl: pass kernel pointers to ->proc_handler")
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/bpf/20200824142047.22043-1-tklauser@distanz.ch
2020-08-24 21:11:40 -07:00
Nitin Gupta d34c0a7599 mm: use unsigned types for fragmentation score
Proactive compaction uses per-node/zone "fragmentation score" which is
always in range [0, 100], so use unsigned type of these scores as well as
for related constants.

Signed-off-by: Nitin Gupta <nigupta@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Baoquan He <bhe@redhat.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Link: http://lkml.kernel.org/r/20200618010319.13159-1-nigupta@nvidia.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12 10:57:56 -07:00
Nitin Gupta facdaa917c mm: proactive compaction
For some applications, we need to allocate almost all memory as hugepages.
However, on a running system, higher-order allocations can fail if the
memory is fragmented.  Linux kernel currently does on-demand compaction as
we request more hugepages, but this style of compaction incurs very high
latency.  Experiments with one-time full memory compaction (followed by
hugepage allocations) show that kernel is able to restore a highly
fragmented memory state to a fairly compacted memory state within <1 sec
for a 32G system.  Such data suggests that a more proactive compaction can
help us allocate a large fraction of memory as hugepages keeping
allocation latencies low.

For a more proactive compaction, the approach taken here is to define a
new sysctl called 'vm.compaction_proactiveness' which dictates bounds for
external fragmentation which kcompactd tries to maintain.

The tunable takes a value in range [0, 100], with a default of 20.

Note that a previous version of this patch [1] was found to introduce too
many tunables (per-order extfrag{low, high}), but this one reduces them to
just one sysctl.  Also, the new tunable is an opaque value instead of
asking for specific bounds of "external fragmentation", which would have
been difficult to estimate.  The internal interpretation of this opaque
value allows for future fine-tuning.

Currently, we use a simple translation from this tunable to [low, high]
"fragmentation score" thresholds (low=100-proactiveness, high=low+10%).
The score for a node is defined as weighted mean of per-zone external
fragmentation.  A zone's present_pages determines its weight.

To periodically check per-node score, we reuse per-node kcompactd threads,
which are woken up every 500 milliseconds to check the same.  If a node's
score exceeds its high threshold (as derived from user-provided
proactiveness value), proactive compaction is started until its score
reaches its low threshold value.  By default, proactiveness is set to 20,
which implies threshold values of low=80 and high=90.

This patch is largely based on ideas from Michal Hocko [2].  See also the
LWN article [3].

Performance data
================

System: x64_64, 1T RAM, 80 CPU threads.
Kernel: 5.6.0-rc3 + this patch

echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/defrag

Before starting the driver, the system was fragmented from a userspace
program that allocates all memory and then for each 2M aligned section,
frees 3/4 of base pages using munmap.  The workload is mainly anonymous
userspace pages, which are easy to move around.  I intentionally avoided
unmovable pages in this test to see how much latency we incur when
hugepage allocations hit direct compaction.

1. Kernel hugepage allocation latencies

With the system in such a fragmented state, a kernel driver then allocates
as many hugepages as possible and measures allocation latency:

(all latency values are in microseconds)

- With vanilla 5.6.0-rc3

  percentile latency
  –––––––––– –––––––
	   5    7894
	  10    9496
	  25   12561
	  30   15295
	  40   18244
	  50   21229
	  60   27556
	  75   30147
	  80   31047
	  90   32859
	  95   33799

Total 2M hugepages allocated = 383859 (749G worth of hugepages out of 762G
total free => 98% of free memory could be allocated as hugepages)

- With 5.6.0-rc3 + this patch, with proactiveness=20

sysctl -w vm.compaction_proactiveness=20

  percentile latency
  –––––––––– –––––––
	   5       2
	  10       2
	  25       3
	  30       3
	  40       3
	  50       4
	  60       4
	  75       4
	  80       4
	  90       5
	  95     429

Total 2M hugepages allocated = 384105 (750G worth of hugepages out of 762G
total free => 98% of free memory could be allocated as hugepages)

2. JAVA heap allocation

In this test, we first fragment memory using the same method as for (1).

Then, we start a Java process with a heap size set to 700G and request the
heap to be allocated with THP hugepages.  We also set THP to madvise to
allow hugepage backing of this heap.

/usr/bin/time
 java -Xms700G -Xmx700G -XX:+UseTransparentHugePages -XX:+AlwaysPreTouch

The above command allocates 700G of Java heap using hugepages.

- With vanilla 5.6.0-rc3

17.39user 1666.48system 27:37.89elapsed

- With 5.6.0-rc3 + this patch, with proactiveness=20

8.35user 194.58system 3:19.62elapsed

Elapsed time remains around 3:15, as proactiveness is further increased.

Note that proactive compaction happens throughout the runtime of these
workloads.  The situation of one-time compaction, sufficient to supply
hugepages for following allocation stream, can probably happen for more
extreme proactiveness values, like 80 or 90.

In the above Java workload, proactiveness is set to 20.  The test starts
with a node's score of 80 or higher, depending on the delay between the
fragmentation step and starting the benchmark, which gives more-or-less
time for the initial round of compaction.  As t he benchmark consumes
hugepages, node's score quickly rises above the high threshold (90) and
proactive compaction starts again, which brings down the score to the low
threshold level (80).  Repeat.

bpftrace also confirms proactive compaction running 20+ times during the
runtime of this Java benchmark.  kcompactd threads consume 100% of one of
the CPUs while it tries to bring a node's score within thresholds.

Backoff behavior
================

Above workloads produce a memory state which is easy to compact.  However,
if memory is filled with unmovable pages, proactive compaction should
essentially back off.  To test this aspect:

- Created a kernel driver that allocates almost all memory as hugepages
  followed by freeing first 3/4 of each hugepage.
- Set proactiveness=40
- Note that proactive_compact_node() is deferred maximum number of times
  with HPAGE_FRAG_CHECK_INTERVAL_MSEC of wait between each check
  (=> ~30 seconds between retries).

[1] https://patchwork.kernel.org/patch/11098289/
[2] https://lore.kernel.org/linux-mm/20161230131412.GI13301@dhcp22.suse.cz/
[3] https://lwn.net/Articles/817905/

Signed-off-by: Nitin Gupta <nigupta@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Oleksandr Natalenko <oleksandr@redhat.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com>
Reviewed-by: Oleksandr Natalenko <oleksandr@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Nitin Gupta <ngupta@nitingupta.dev>
Cc: Oleksandr Natalenko <oleksandr@redhat.com>
Link: http://lkml.kernel.org/r/20200616204527.19185-1-nigupta@nvidia.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12 10:57:56 -07:00
Feng Tang 56f3547bfa mm: adjust vm_committed_as_batch according to vm overcommit policy
When checking a performance change for will-it-scale scalability mmap test
[1], we found very high lock contention for spinlock of percpu counter
'vm_committed_as':

    94.14%     0.35%  [kernel.kallsyms]         [k] _raw_spin_lock_irqsave
    48.21% _raw_spin_lock_irqsave;percpu_counter_add_batch;__vm_enough_memory;mmap_region;do_mmap;
    45.91% _raw_spin_lock_irqsave;percpu_counter_add_batch;__do_munmap;

Actually this heavy lock contention is not always necessary.  The
'vm_committed_as' needs to be very precise when the strict
OVERCOMMIT_NEVER policy is set, which requires a rather small batch number
for the percpu counter.

So keep 'batch' number unchanged for strict OVERCOMMIT_NEVER policy, and
lift it to 64X for OVERCOMMIT_ALWAYS and OVERCOMMIT_GUESS policies.  Also
add a sysctl handler to adjust it when the policy is reconfigured.

Benchmark with the same testcase in [1] shows 53% improvement on a 8C/16T
desktop, and 2097%(20X) on a 4S/72C/144T server.  We tested with test
platforms in 0day (server, desktop and laptop), and 80%+ platforms shows
improvements with that test.  And whether it shows improvements depends on
if the test mmap size is bigger than the batch number computed.

And if the lift is 16X, 1/3 of the platforms will show improvements,
though it should help the mmap/unmap usage generally, as Michal Hocko
mentioned:

: I believe that there are non-synthetic worklaods which would benefit from
: a larger batch.  E.g.  large in memory databases which do large mmaps
: during startups from multiple threads.

[1] https://lore.kernel.org/lkml/20200305062138.GI5972@shao2-debian/

Signed-off-by: Feng Tang <feng.tang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Qian Cai <cai@lca.pw>
Cc: Kees Cook <keescook@chromium.org>
Cc: Andi Kleen <andi.kleen@intel.com>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: kernel test robot <rong.a.chen@intel.com>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/1589611660-89854-4-git-send-email-feng.tang@intel.com
Link: http://lkml.kernel.org/r/1592725000-73486-4-git-send-email-feng.tang@intel.com
Link: http://lkml.kernel.org/r/1594389708-60781-5-git-send-email-feng.tang@intel.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07 11:33:26 -07:00
Qais Yousef 13685c4a08 sched/uclamp: Add a new sysctl to control RT default boost value
RT tasks by default run at the highest capacity/performance level. When
uclamp is selected this default behavior is retained by enforcing the
requested uclamp.min (p->uclamp_req[UCLAMP_MIN]) of the RT tasks to be
uclamp_none(UCLAMP_MAX), which is SCHED_CAPACITY_SCALE; the maximum
value.

This is also referred to as 'the default boost value of RT tasks'.

See commit 1a00d99997 ("sched/uclamp: Set default clamps for RT tasks").

On battery powered devices, it is desired to control this default
(currently hardcoded) behavior at runtime to reduce energy consumed by
RT tasks.

For example, a mobile device manufacturer where big.LITTLE architecture
is dominant, the performance of the little cores varies across SoCs, and
on high end ones the big cores could be too power hungry.

Given the diversity of SoCs, the new knob allows manufactures to tune
the best performance/power for RT tasks for the particular hardware they
run on.

They could opt to further tune the value when the user selects
a different power saving mode or when the device is actively charging.

The runtime aspect of it further helps in creating a single kernel image
that can be run on multiple devices that require different tuning.

Keep in mind that a lot of RT tasks in the system are created by the
kernel. On Android for instance I can see over 50 RT tasks, only
a handful of which created by the Android framework.

To control the default behavior globally by system admins and device
integrator, introduce the new sysctl_sched_uclamp_util_min_rt_default
to change the default boost value of the RT tasks.

I anticipate this to be mostly in the form of modifying the init script
of a particular device.

To avoid polluting the fast path with unnecessary code, the approach
taken is to synchronously do the update by traversing all the existing
tasks in the system. This could race with a concurrent fork(), which is
dealt with by introducing sched_post_fork() function which will ensure
the racy fork will get the right update applied.

Tested on Juno-r2 in combination with the RT capacity awareness [1].
By default an RT task will go to the highest capacity CPU and run at the
maximum frequency, which is particularly energy inefficient on high end
mobile devices because the biggest core[s] are 'huge' and power hungry.

With this patch the RT task can be controlled to run anywhere by
default, and doesn't cause the frequency to be maximum all the time.
Yet any task that really needs to be boosted can easily escape this
default behavior by modifying its requested uclamp.min value
(p->uclamp_req[UCLAMP_MIN]) via sched_setattr() syscall.

[1] 804d402fb6f6: ("sched/rt: Make RT capacity-aware")

Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200716110347.19553-2-qais.yousef@arm.com
2020-07-29 13:51:47 +02:00
Peter Zijlstra b4098bfc5e sched/deadline: Impose global limits on sched_attr::sched_period
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20190726161357.397880775@infradead.org
2020-06-15 14:10:04 +02:00
Rafael Aquini e77132e758 kernel/sysctl.c: ignore out-of-range taint bits introduced via kernel.tainted
Users with SYS_ADMIN capability can add arbitrary taint flags to the
running kernel by writing to /proc/sys/kernel/tainted or issuing the
command 'sysctl -w kernel.tainted=...'.  This interface, however, is
open for any integer value and this might cause an invalid set of flags
being committed to the tainted_mask bitset.

This patch introduces a simple way for proc_taint() to ignore any
eventual invalid bit coming from the user input before committing those
bits to the kernel tainted_mask.

Signed-off-by: Rafael Aquini <aquini@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Link: http://lkml.kernel.org/r/20200512223946.888020-1-aquini@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-08 11:05:56 -07:00
Guilherme G. Piccoli 60c958d8df panic: add sysctl to dump all CPUs backtraces on oops event
Usually when the kernel reaches an oops condition, it's a point of no
return; in case not enough debug information is available in the kernel
splat, one of the last resorts would be to collect a kernel crash dump
and analyze it.  The problem with this approach is that in order to
collect the dump, a panic is required (to kexec-load the crash kernel).
When in an environment of multiple virtual machines, users may prefer to
try living with the oops, at least until being able to properly shutdown
their VMs / finish their important tasks.

This patch implements a way to collect a bit more debug details when an
oops event is reached, by printing all the CPUs backtraces through the
usage of NMIs (on architectures that support that).  The sysctl added
(and documented) here was called "oops_all_cpu_backtrace", and when set
will (as the name suggests) dump all CPUs backtraces.

Far from ideal, this may be the last option though for users that for
some reason cannot panic on oops.  Most of times oopses are clear enough
to indicate the kernel portion that must be investigated, but in virtual
environments it's possible to observe hypervisor/KVM issues that could
lead to oopses shown in other guests CPUs (like virtual APIC crashes).
This patch hence aims to help debug such complex issues without
resorting to kdump.

Signed-off-by: Guilherme G. Piccoli <gpiccoli@canonical.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Matthew Wilcox <willy@infradead.org>
Link: http://lkml.kernel.org/r/20200327224116.21030-1-gpiccoli@canonical.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-08 11:05:56 -07:00