Pull vfs mount updates from Al Viro:
"The first part of mount updates.
Convert filesystems to use the new mount API"
* 'work.mount0' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
mnt_init(): call shmem_init() unconditionally
constify ksys_mount() string arguments
don't bother with registering rootfs
init_rootfs(): don't bother with init_ramfs_fs()
vfs: Convert smackfs to use the new mount API
vfs: Convert selinuxfs to use the new mount API
vfs: Convert securityfs to use the new mount API
vfs: Convert apparmorfs to use the new mount API
vfs: Convert openpromfs to use the new mount API
vfs: Convert xenfs to use the new mount API
vfs: Convert gadgetfs to use the new mount API
vfs: Convert oprofilefs to use the new mount API
vfs: Convert ibmasmfs to use the new mount API
vfs: Convert qib_fs/ipathfs to use the new mount API
vfs: Convert efivarfs to use the new mount API
vfs: Convert configfs to use the new mount API
vfs: Convert binfmt_misc to use the new mount API
convenience helper: get_tree_single()
convenience helper get_tree_nodev()
vfs: Kill sget_userns()
...
In the sysctl code the proc_dointvec_minmax() function is often used to
validate the user supplied value between an allowed range. This
function uses the extra1 and extra2 members from struct ctl_table as
minimum and maximum allowed value.
On sysctl handler declaration, in every source file there are some
readonly variables containing just an integer which address is assigned
to the extra1 and extra2 members, so the sysctl range is enforced.
The special values 0, 1 and INT_MAX are very often used as range
boundary, leading duplication of variables like zero=0, one=1,
int_max=INT_MAX in different source files:
$ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
248
Add a const int array containing the most commonly used values, some
macros to refer more easily to the correct array member, and use them
instead of creating a local one for every object file.
This is the bloat-o-meter output comparing the old and new binary
compiled with the default Fedora config:
# scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
Data old new delta
sysctl_vals - 12 +12
__kstrtab_sysctl_vals - 12 +12
max 14 10 -4
int_max 16 - -16
one 68 - -68
zero 128 28 -100
Total: Before=20583249, After=20583085, chg -0.00%
[mcroce@redhat.com: tipc: remove two unused variables]
Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
[akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
[arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
[akpm@linux-foundation.org: fix fs/eventpoll.c]
Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 7635d9cbe8 ("mm, thp, proc: report THP eligibility for each
vma") introduced THPeligible bit for processes' smaps. But, when
checking the eligibility for shmem vma, __transparent_hugepage_enabled()
is called to override the result from shmem_huge_enabled(). It may
result in the anonymous vma's THP flag override shmem's. For example,
running a simple test which create THP for shmem, but with anonymous THP
disabled, when reading the process's smaps, it may show:
7fc92ec00000-7fc92f000000 rw-s 00000000 00:14 27764 /dev/shm/test
Size: 4096 kB
...
[snip]
...
ShmemPmdMapped: 4096 kB
...
[snip]
...
THPeligible: 0
And, /proc/meminfo does show THP allocated and PMD mapped too:
ShmemHugePages: 4096 kB
ShmemPmdMapped: 4096 kB
This doesn't make too much sense. The shmem objects should be treated
separately from anonymous THP. Calling shmem_huge_enabled() with
checking MMF_DISABLE_THP sounds good enough. And, we could skip stack
and dax vma check since we already checked if the vma is shmem already.
Also check if vma is suitable for THP by calling
transhuge_vma_suitable().
And minor fix to smaps output format and documentation.
Link: http://lkml.kernel.org/r/1560401041-32207-3-git-send-email-yang.shi@linux.alibaba.com
Fixes: 7635d9cbe8 ("mm, thp, proc: report THP eligibility for each vma")
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge more updates from Andrew Morton:
"VM:
- z3fold fixes and enhancements by Henry Burns and Vitaly Wool
- more accurate reclaimed slab caches calculations by Yafang Shao
- fix MAP_UNINITIALIZED UAPI symbol to not depend on config, by
Christoph Hellwig
- !CONFIG_MMU fixes by Christoph Hellwig
- new novmcoredd parameter to omit device dumps from vmcore, by
Kairui Song
- new test_meminit module for testing heap and pagealloc
initialization, by Alexander Potapenko
- ioremap improvements for huge mappings, by Anshuman Khandual
- generalize kprobe page fault handling, by Anshuman Khandual
- device-dax hotplug fixes and improvements, by Pavel Tatashin
- enable synchronous DAX fault on powerpc, by Aneesh Kumar K.V
- add pte_devmap() support for arm64, by Robin Murphy
- unify locked_vm accounting with a helper, by Daniel Jordan
- several misc fixes
core/lib:
- new typeof_member() macro including some users, by Alexey Dobriyan
- make BIT() and GENMASK() available in asm, by Masahiro Yamada
- changed LIST_POISON2 on x86_64 to 0xdead000000000122 for better
code generation, by Alexey Dobriyan
- rbtree code size optimizations, by Michel Lespinasse
- convert struct pid count to refcount_t, by Joel Fernandes
get_maintainer.pl:
- add --no-moderated switch to skip moderated ML's, by Joe Perches
misc:
- ptrace PTRACE_GET_SYSCALL_INFO interface
- coda updates
- gdb scripts, various"
[ Using merge message suggestion from Vlastimil Babka, with some editing - Linus ]
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (100 commits)
fs/select.c: use struct_size() in kmalloc()
mm: add account_locked_vm utility function
arm64: mm: implement pte_devmap support
mm: introduce ARCH_HAS_PTE_DEVMAP
mm: clean up is_device_*_page() definitions
mm/mmap: move common defines to mman-common.h
mm: move MAP_SYNC to asm-generic/mman-common.h
device-dax: "Hotremove" persistent memory that is used like normal RAM
mm/hotplug: make remove_memory() interface usable
device-dax: fix memory and resource leak if hotplug fails
include/linux/lz4.h: fix spelling and copy-paste errors in documentation
ipc/mqueue.c: only perform resource calculation if user valid
include/asm-generic/bug.h: fix "cut here" for WARN_ON for __WARN_TAINT architectures
scripts/gdb: add helpers to find and list devices
scripts/gdb: add lx-genpd-summary command
drivers/pps/pps.c: clear offset flags in PPS_SETPARAMS ioctl
kernel/pid.c: convert struct pid count to refcount_t
drivers/rapidio/devices/rio_mport_cdev.c: NUL terminate some strings
select: shift restore_saved_sigmask_unless() into poll_select_copy_remaining()
select: change do_poll() to return -ERESTARTNOHAND rather than -EINTR
...
Normally, the inode's i_uid/i_gid are translated relative to s_user_ns,
but this is not a correct behavior for proc. Since sysctl permission
check in test_perm is done against GLOBAL_ROOT_[UG]ID, it makes more
sense to use these values in u_[ug]id of proc inodes. In other words:
although uid/gid in the inode is not read during test_perm, the inode
logically belongs to the root of the namespace. I have confirmed this
with Eric Biederman at LPC and in this thread:
https://lore.kernel.org/lkml/87k1kzjdff.fsf@xmission.com
Consequences
============
Since the i_[ug]id values of proc nodes are not used for permissions
checks, this change usually makes no functional difference. However, it
causes an issue in a setup where:
* a namespace container is created without root user in container -
hence the i_[ug]id of proc nodes are set to INVALID_[UG]ID
* container creator tries to configure it by writing /proc/sys files,
e.g. writing /proc/sys/kernel/shmmax to configure shared memory limit
Kernel does not allow to open an inode for writing if its i_[ug]id are
invalid, making it impossible to write shmmax and thus - configure the
container.
Using a container with no root mapping is apparently rare, but we do use
this configuration at Google. Also, we use a generic tool to configure
the container limits, and the inability to write any of them causes a
failure.
History
=======
The invalid uids/gids in inodes first appeared due to 8175435777 (fs:
Update i_[ug]id_(read|write) to translate relative to s_user_ns).
However, AFAIK, this did not immediately cause any issues. The
inability to write to these "invalid" inodes was only caused by a later
commit 0bd23d09b8 (vfs: Don't modify inodes with a uid or gid unknown
to the vfs).
Tested: Used a repro program that creates a user namespace without any
mapping and stat'ed /proc/$PID/root/proc/sys/kernel/shmmax from outside.
Before the change, it shows the overflow uid, with the change it's 0.
The overflow uid indicates that the uid in the inode is not correct and
thus it is not possible to open the file for writing.
Link: http://lkml.kernel.org/r/20190708115130.250149-1-rburny@google.com
Fixes: 0bd23d09b8 ("vfs: Don't modify inodes with a uid or gid unknown to the vfs")
Signed-off-by: Radoslaw Burny <rburny@google.com>
Acked-by: Luis Chamberlain <mcgrof@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: Seth Forshee <seth.forshee@canonical.com>
Cc: John Sperbeck <jsperbeck@google.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: <stable@vger.kernel.org> [4.8+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Don't repeat function signatures twice.
This is a kind-of-precursor for "struct proc_ops".
Note:
typeof(pde->proc_fops->...) ...;
can't be used because ->proc_fops is "const struct file_operations *".
"const" prevents assignment down the code and it can't be deleted in the
type system.
Link: http://lkml.kernel.org/r/20190529191110.GB5703@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since commit 2724273e8f ("vmcore: add API to collect hardware dump in
second kernel"), drivers are allowed to add device related dump data to
vmcore as they want by using the device dump API. This has a potential
issue, the data is stored in memory, drivers may append too much data
and use too much memory. The vmcore is typically used in a kdump kernel
which runs in a pre-reserved small chunk of memory. So as a result it
will make kdump unusable at all due to OOM issues.
So introduce new 'novmcoredd' command line option. User can disable
device dump to reduce memory usage. This is helpful if device dump is
using too much memory, disabling device dump could make sure a regular
vmcore without device dump data is still available.
[akpm@linux-foundation.org: tweak documentation]
[akpm@linux-foundation.org: vmcore.c needs moduleparam.h]
Link: http://lkml.kernel.org/r/20190528111856.7276-1-kasong@redhat.com
Signed-off-by: Kairui Song <kasong@redhat.com>
Acked-by: Dave Young <dyoung@redhat.com>
Reviewed-by: Bhupesh Sharma <bhsharma@redhat.com>
Cc: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
Cc: "David S . Miller" <davem@davemloft.net>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Baoquan He <bhe@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE+QmuaPwR3wnBdVwACF8+vY7k4RUFAl0tpocACgkQCF8+vY7k
4RWoxA//b/fmDXP3WPzrjjSmpyB9ml0/epKzPbT5S2j0lftqKBmet29k+PCjVrTx
Nq2QauehY9ug5h8UMVUCmzPr95F0tSIGRoqk1vrn7z0K3q6k1SHrtvqbY1Bgb2Uk
Qvh2YFU4fQLJg8WAbExCjxCdbdmBKQVGKTwCtM+tP5OMxwAFOmQrjGaUaKCKIIA2
7Wzrx8CpSji+bJ3uK/d36c+4M9oDly5eaxBhoboL3BI0y+GqwiSASGwTO7BxrPOg
0wq5IZHnqS8+bprT9xQdDOqf+UOY9U1cxE/+sqsHxblfUEx9gfLy/R+FLmJn+SS9
Z3yLy4SqVHQMpWBjEAGodohikF60PAuTdymSC11jqFaKCUxWrIZg5xO+0blMrxPF
7vYIexutCkaBMHBlNaNsHIqB7B/2FGGKoN7QW64hwvwJCGvF7OmJcV+R4bROGvh4
nFuis9/Nm66Fq7I3aw37ThyZ0aWZdaQ0QJTH9ksxU/ZCz2hhMNYu/rXggrDvkS4U
nr77ZT5Gd7nj4b110zf8+99uiGiinY6hTfzPAuTCLBhaxwrv4/xDHAhpwdEB5T4j
8gOkxV8c0XWtL7sKqhGJvs/RRe2za0Y9XH6fyxsYfWcfuLjEvug8ouXMad9gxFWH
DL3WnKJEMGLScei2wux4kGOwEbkR1bUf2cHJfh3GpCB/y8vgLOc=
=smxY
-----END PGP SIGNATURE-----
Merge tag 'docs/v5.3-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media
Pull rst conversion of docs from Mauro Carvalho Chehab:
"As agreed with Jon, I'm sending this big series directly to you, c/c
him, as this series required a special care, in order to avoid
conflicts with other trees"
* tag 'docs/v5.3-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (77 commits)
docs: kbuild: fix build with pdf and fix some minor issues
docs: block: fix pdf output
docs: arm: fix a breakage with pdf output
docs: don't use nested tables
docs: gpio: add sysfs interface to the admin-guide
docs: locking: add it to the main index
docs: add some directories to the main documentation index
docs: add SPDX tags to new index files
docs: add a memory-devices subdir to driver-api
docs: phy: place documentation under driver-api
docs: serial: move it to the driver-api
docs: driver-api: add remaining converted dirs to it
docs: driver-api: add xilinx driver API documentation
docs: driver-api: add a series of orphaned documents
docs: admin-guide: add a series of orphaned documents
docs: cgroup-v1: add it to the admin-guide book
docs: aoe: add it to the driver-api book
docs: add some documentation dirs to the driver-api book
docs: driver-model: move it to the driver-api book
docs: lp855x-driver.rst: add it to the driver-api book
...
This fixes two problems reported with the cmdline simplification and
cleanup last year:
- the setproctitle() special cases didn't quite match the original
semantics, and it can be noticeable:
https://lore.kernel.org/lkml/alpine.LNX.2.21.1904052326230.3249@kich.toxcorp.com/
- it could leak an uninitialized byte from the temporary buffer under
the right (wrong) circustances:
https://lore.kernel.org/lkml/20190712160913.17727-1-izbyshev@ispras.ru/
It rewrites the logic entirely, splitting it into two separate commits
(and two separate functions) for the two different cases ("unedited
cmdline" vs "setproctitle() has been used to change the command line").
* proc-cmdline:
/proc/<pid>/cmdline: add back the setproctitle() special case
/proc/<pid>/cmdline: remove all the special cases
This makes the setproctitle() special case very explicit indeed, and
handles it with a separate helper function entirely. In the process, it
re-instates the original semantics of simply stopping at the first NUL
character when the original last NUL character is no longer there.
[ The original semantics can still be seen in mm/util.c: get_cmdline()
that is limited to a fixed-size buffer ]
This makes the logic about when we use the string lengths etc much more
obvious, and makes it easier to see what we do and what the two very
different cases are.
Note that even when we allow walking past the end of the argument array
(because the setproctitle() might have overwritten and overflowed the
original argv[] strings), we only allow it when it overflows into the
environment region if it is immediately adjacent.
[ Fixed for missing 'count' checks noted by Alexey Izbyshev ]
Link: https://lore.kernel.org/lkml/alpine.LNX.2.21.1904052326230.3249@kich.toxcorp.com/
Fixes: 5ab8271899 ("fs/proc: simplify and clarify get_mm_cmdline() function")
Cc: Jakub Jankowski <shasta@toxcorp.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Alexey Izbyshev <izbyshev@ispras.ru>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Start off with a clean slate that only reads exactly from arg_start to
arg_end, without any oddities. This simplifies the code and in the
process removes the case that caused us to potentially leak an
uninitialized byte from the temporary kernel buffer.
Note that in order to start from scratch with an understandable base,
this simplifies things _too_ much, and removes all the legacy logic to
handle setproctitle() having changed the argument strings.
We'll add back those special cases very differently in the next commit.
Link: https://lore.kernel.org/lkml/20190712160913.17727-1-izbyshev@ispras.ru/
Fixes: f5b65348fd ("proc: fix missing final NUL in get_mm_cmdline() rewrite")
Cc: Alexey Izbyshev <izbyshev@ispras.ru>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The stuff under sysctl describes /sys interface from userspace
point of view. So, add it to the admin-guide and remove the
:orphan: from its index file.
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Improvements and bug fixes for the hmm interface in the kernel:
- Improve clarity, locking and APIs related to the 'hmm mirror' feature
merged last cycle. In linux-next we now see AMDGPU and nouveau to be
using this API.
- Remove old or transitional hmm APIs. These are hold overs from the past
with no users, or APIs that existed only to manage cross tree conflicts.
There are still a few more of these cleanups that didn't make the merge
window cut off.
- Improve some core mm APIs:
* export alloc_pages_vma() for driver use
* refactor into devm_request_free_mem_region() to manage
DEVICE_PRIVATE resource reservations
* refactor duplicative driver code into the core dev_pagemap
struct
- Remove hmm wrappers of improved core mm APIs, instead have drivers use
the simplified API directly
- Remove DEVICE_PUBLIC
- Simplify the kconfig flow for the hmm users and core code
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEfB7FMLh+8QxL+6i3OG33FX4gmxoFAl0k1zkACgkQOG33FX4g
mxrO+w//QF/yI/9Hh30RWEBq8W107cODkDlaT0Z/7cVEXfGetZzIUpqzxnJofRfQ
xTw1XmYkc9WpJe/mTTuFZFewNQwWuMM6X0Xi25fV438/Y64EclevlcJTeD49TIH1
CIMsz8bX7CnCEq5sz+UypLg9LPnaD9L/JLyuSbyjqjms/o+yzqa7ji7p/DSINuhZ
Qva9OZL1ZSEDJfNGi8uGpYBqryHoBAonIL12R9sCF5pbJEnHfWrH7C06q7AWOAjQ
4vjN/p3F4L9l/v2IQ26Kn/S0AhmN7n3GT//0K66e2gJPfXa8fxRKGuFn/Kd79EGL
YPASn5iu3cM23up1XkbMNtzacL8yiIeTOcMdqw26OaOClojy/9OJduv5AChe6qL/
VUQIAn1zvPsJTyC5U7mhmkrGuTpP6ivHpxtcaUp+Ovvi1cyK40nLCmSNvLnbN5ES
bxbb0SjE4uupDG5qU6Yct/hFp6uVMSxMqXZOb9Xy8ZBkbMsJyVOLj71G1/rVIfPU
hO1AChX5CRG1eJoMo6oBIpiwmSvcOaPp3dqIOQZvwMOqrO869LR8qv7RXyh/g9gi
FAEKnwLl4GK3YtEO4Kt/1YI5DXYjSFUbfgAs0SPsRKS6hK2+RgRk2M/B/5dAX0/d
lgOf9WPODPwiSXBYLtJB8qHVDX0DIY8faOyTx6BYIKClUtgbBI8=
=wKvp
-----END PGP SIGNATURE-----
Merge tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull HMM updates from Jason Gunthorpe:
"Improvements and bug fixes for the hmm interface in the kernel:
- Improve clarity, locking and APIs related to the 'hmm mirror'
feature merged last cycle. In linux-next we now see AMDGPU and
nouveau to be using this API.
- Remove old or transitional hmm APIs. These are hold overs from the
past with no users, or APIs that existed only to manage cross tree
conflicts. There are still a few more of these cleanups that didn't
make the merge window cut off.
- Improve some core mm APIs:
- export alloc_pages_vma() for driver use
- refactor into devm_request_free_mem_region() to manage
DEVICE_PRIVATE resource reservations
- refactor duplicative driver code into the core dev_pagemap
struct
- Remove hmm wrappers of improved core mm APIs, instead have drivers
use the simplified API directly
- Remove DEVICE_PUBLIC
- Simplify the kconfig flow for the hmm users and core code"
* tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (42 commits)
mm: don't select MIGRATE_VMA_HELPER from HMM_MIRROR
mm: remove the HMM config option
mm: sort out the DEVICE_PRIVATE Kconfig mess
mm: simplify ZONE_DEVICE page private data
mm: remove hmm_devmem_add
mm: remove hmm_vma_alloc_locked_page
nouveau: use devm_memremap_pages directly
nouveau: use alloc_page_vma directly
PCI/P2PDMA: use the dev_pagemap internal refcount
device-dax: use the dev_pagemap internal refcount
memremap: provide an optional internal refcount in struct dev_pagemap
memremap: replace the altmap_valid field with a PGMAP_ALTMAP_VALID flag
memremap: remove the data field in struct dev_pagemap
memremap: add a migrate_to_ram method to struct dev_pagemap_ops
memremap: lift the devmap_enable manipulation into devm_memremap_pages
memremap: pass a struct dev_pagemap to ->kill and ->cleanup
memremap: move dev_pagemap callbacks into a separate structure
memremap: validate the pagemap type passed to devm_memremap_pages
mm: factor out a devm_request_free_mem_region helper
mm: export alloc_pages_vma
...
oom_unkillable_task() can be called from three different contexts i.e.
global OOM, memcg OOM and oom_score procfs interface. At the moment
oom_unkillable_task() does a task_in_mem_cgroup() check on the given
process. Since there is no reason to perform task_in_mem_cgroup()
check for global OOM and oom_score procfs interface, those contexts
provide NULL memcg and skips the task_in_mem_cgroup() check. However
for memcg OOM context, the oom_unkillable_task() is always called from
mem_cgroup_scan_tasks() and thus task_in_mem_cgroup() check becomes
redundant and effectively dead code. So, just remove the
task_in_mem_cgroup() check altogether.
Link: http://lkml.kernel.org/r/20190624212631.87212-2-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Roman Gushchin <guro@fb.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Paul Jackson <pj@sgi.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Vmalloc() is getting more and more used these days (kernel stacks, bpf and
percpu allocator are new top users), and the total % of memory consumed by
vmalloc() can be pretty significant and changes dynamically.
/proc/meminfo is the best place to display this information: its top goal
is to show top consumers of the memory.
Since the VmallocUsed field in /proc/meminfo is not in use for quite a
long time (it has been defined to 0 by a5ad88ce8c ("mm: get rid of
'vmalloc_info' from /proc/meminfo")), let's reuse it for showing the
actual physical memory consumption of vmalloc().
Link: http://lkml.kernel.org/r/20190417194002.12369-3-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Report separate components (anon, file, and shmem) for PSS in
smaps_rollup.
This helps understand and tune the memory manager behavior in consumer
devices, particularly mobile devices. Many of them (e.g. chromebooks and
Android-based devices) use zram for anon memory, and perform disk reads
for discarded file pages. The difference in latency is large (e.g.
reading a single page from SSD is 30 times slower than decompressing a
zram page on one popular device), thus it is useful to know how much of
the PSS is anon vs. file.
All the information is already present in /proc/pid/smaps, but much more
expensive to obtain because of the large size of that procfs entry.
This patch also removes a small code duplication in smaps_account, which
would have gotten worse otherwise.
Also updated Documentation/filesystems/proc.txt (the smaps section was a
bit stale, and I added a smaps_rollup section) and
Documentation/ABI/testing/procfs-smaps_rollup.
[semenzato@chromium.org: v5]
Link: http://lkml.kernel.org/r/20190626234333.44608-1-semenzato@chromium.org
Link: http://lkml.kernel.org/r/20190626180429.174569-1-semenzato@chromium.org
Signed-off-by: Luigi Semenzato <semenzato@chromium.org>
Acked-by: Yu Zhao <yuzhao@chromium.org>
Cc: Sonny Rao <sonnyrao@chromium.org>
Cc: Yu Zhao <yuzhao@chromium.org>
Cc: Brian Geffon <bgeffon@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Do not remain stuck forever if something goes wrong. Using a killable
lock permits cleanup of stuck tasks and simplifies investigation.
It seems ->d_revalidate() could return any error (except ECHILD) to abort
validation and pass error as result of lookup sequence.
[akpm@linux-foundation.org: fix proc_map_files_lookup() return value, per Andrei]
Link: http://lkml.kernel.org/r/156007493995.3335.9595044802115356911.stgit@buzz
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Reviewed-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Cyrill Gorcunov <gorcunov@gmail.com>
Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Do not remain stuck forever if something goes wrong. Using a killable
lock permits cleanup of stuck tasks and simplifies investigation.
Replace the only unkillable mmap_sem lock in clear_refs_write().
Link: http://lkml.kernel.org/r/156007493826.3335.5424884725467456239.stgit@buzz
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Reviewed-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Cyrill Gorcunov <gorcunov@gmail.com>
Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Do not remain stuck forever if something goes wrong. Using a killable
lock permits cleanup of stuck tasks and simplifies investigation.
This function is also used for /proc/pid/smaps.
Link: http://lkml.kernel.org/r/156007493160.3335.14447544314127417266.stgit@buzz
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Reviewed-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Cyrill Gorcunov <gorcunov@gmail.com>
Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull x865 kdump updates from Thomas Gleixner:
"Yet more kexec/kdump updates:
- Properly support kexec when AMD's memory encryption (SME) is
enabled
- Pass reserved e820 ranges to the kexec kernel so both PCI and SME
can work"
* 'x86-kdump-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
fs/proc/vmcore: Enable dumping of encrypted memory when SEV was active
x86/kexec: Set the C-bit in the identity map page table when SEV is active
x86/kexec: Do not map kexec area as decrypted when SEV is active
x86/crash: Add e820 reserved ranges to kdump kernel's e820 table
x86/mm: Rework ioremap resource mapping determination
x86/e820, ioport: Add a new I/O resource descriptor IORES_DESC_RESERVED
x86/mm: Create a workarea in the kernel for SME early encryption
x86/mm: Identify the end of the kernel area to be reserved
Pull x86 AVX512 status update from Ingo Molnar:
"This adds a new ABI that the main scheduler probably doesn't want to
deal with but HPC job schedulers might want to use: the
AVX512_elapsed_ms field in the new /proc/<pid>/arch_status task status
file, which allows the user-space job scheduler to cluster such tasks,
to avoid turbo frequency drops"
* 'x86-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
Documentation/filesystems/proc.txt: Add arch_status file
x86/process: Add AVX-512 usage elapsed time to /proc/pid/arch_status
proc: Add /proc/<pid>/arch_status
Pull scheduler updates from Ingo Molnar:
- Remove the unused per rq load array and all its infrastructure, by
Dietmar Eggemann.
- Add utilization clamping support by Patrick Bellasi. This is a
refinement of the energy aware scheduling framework with support for
boosting of interactive and capping of background workloads: to make
sure critical GUI threads get maximum frequency ASAP, and to make
sure background processing doesn't unnecessarily move to cpufreq
governor to higher frequencies and less energy efficient CPU modes.
- Add the bare minimum of tracepoints required for LISA EAS regression
testing, by Qais Yousef - which allows automated testing of various
power management features, including energy aware scheduling.
- Restructure the former tsk_nr_cpus_allowed() facility that the -rt
kernel used to modify the scheduler's CPU affinity logic such as
migrate_disable() - introduce the task->cpus_ptr value instead of
taking the address of &task->cpus_allowed directly - by Sebastian
Andrzej Siewior.
- Misc optimizations, fixes, cleanups and small enhancements - see the
Git log for details.
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits)
sched/uclamp: Add uclamp support to energy_compute()
sched/uclamp: Add uclamp_util_with()
sched/cpufreq, sched/uclamp: Add clamps for FAIR and RT tasks
sched/uclamp: Set default clamps for RT tasks
sched/uclamp: Reset uclamp values on RESET_ON_FORK
sched/uclamp: Extend sched_setattr() to support utilization clamping
sched/core: Allow sched_setattr() to use the current policy
sched/uclamp: Add system default clamps
sched/uclamp: Enforce last task's UCLAMP_MAX
sched/uclamp: Add bucket local max tracking
sched/uclamp: Add CPU's clamp buckets refcounting
sched/fair: Rename weighted_cpuload() to cpu_runnable_load()
sched/debug: Export the newly added tracepoints
sched/debug: Add sched_overutilized tracepoint
sched/debug: Add new tracepoint to track PELT at se level
sched/debug: Add new tracepoints to track PELT at rq level
sched/debug: Add a new sched_trace_*() helper functions
sched/autogroup: Make autogroup_path() always available
sched/wait: Deduplicate code with do-while
sched/topology: Remove unused 'sd' parameter from arch_scale_cpu_capacity()
...
The code hasn't been used since it was added to the tree, and doesn't
appear to actually be usable.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
0a1eb2d474 ("fs/proc: Stop reporting eip and esp in /proc/PID/stat")
stopped reporting eip/esp and fd7d56270b ("fs/proc: Report eip/esp in
/prod/PID/stat for coredumping") reintroduced the feature to fix a
regression with userspace core dump handlers (such as minicoredumper).
Because PF_DUMPCORE is only set for the primary thread, this didn't fix
the original problem for secondary threads. Allow reporting the eip/esp
for all threads by checking for PF_EXITING as well. This is set for all
the other threads when they are killed. coredump_wait() waits for all the
tasks to become inactive before proceeding to invoke a core dumper.
Link: http://lkml.kernel.org/r/87y32p7i7a.fsf@linutronix.de
Link: http://lkml.kernel.org/r/20190522161614.628-1-jlu@pengutronix.de
Fixes: fd7d56270b ("fs/proc: Report eip/esp in /prod/PID/stat for coredumping")
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reported-by: Jan Luebbe <jlu@pengutronix.de>
Tested-by: Jan Luebbe <jlu@pengutronix.de>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove the d_is_dir() check from tgid_pidfd_to_pid().
It is pointless since you should never get &proc_tgid_base_operations
for f_op on a non-directory.
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Christian Brauner <christian@brauner.io>
In the kdump kernel, the memory of the first kernel gets to be dumped
into a vmcore file.
Similarly to SME kdump, if SEV was enabled in the first kernel, the old
memory has to be remapped encrypted in order to access it properly.
Commit
992b649a3f ("kdump, proc/vmcore: Enable kdumping encrypted memory with SME enabled")
took care of the SME case but it uses sme_active() which checks for SME
only. Use mem_encrypt_active() instead, which returns true when either
SME or SEV is active.
Unlike SME, the second kernel images (kernel and initrd) are loaded into
encrypted memory when SEV is active, hence the kernel elf header must be
remapped as encrypted in order to access it properly.
[ bp: Massage commit message. ]
Co-developed-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Lianbo Jiang <lijiang@redhat.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: bhe@redhat.com
Cc: dyoung@redhat.com
Cc: Ganesh Goudar <ganeshgr@chelsio.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: kexec@lists.infradead.org
Cc: linux-fsdevel@vger.kernel.org
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: mingo@redhat.com
Cc: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/20190430074421.7852-4-lijiang@redhat.com
Exposing architecture specific per process information is useful for
various reasons. An example is the AVX512 usage on x86 which is important
for task placement for power/performance optimizations.
Adding this information to the existing /prcc/pid/status file would be the
obvious choise, but it has been agreed on that a explicit arch_status file
is better in separating the generic and architecture specific information.
[ tglx: Massage changelog ]
Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Cc: peterz@infradead.org
Cc: hpa@zytor.com
Cc: ak@linux.intel.com
Cc: tim.c.chen@linux.intel.com
Cc: dave.hansen@intel.com
Cc: arjan@linux.intel.com
Cc: adobriyan@gmail.com
Cc: aubrey.li@intel.com
Cc: linux-api@vger.kernel.org
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Linux API <linux-api@vger.kernel.org>
Link: https://lkml.kernel.org/r/20190606012236.9391-1-aubrey.li@linux.intel.com
In commit:
4b53a3412d ("sched/core: Remove the tsk_nr_cpus_allowed() wrapper")
the tsk_nr_cpus_allowed() wrapper was removed. There was not
much difference in !RT but in RT we used this to implement
migrate_disable(). Within a migrate_disable() section the CPU mask is
restricted to single CPU while the "normal" CPU mask remains untouched.
As an alternative implementation Ingo suggested to use:
struct task_struct {
const cpumask_t *cpus_ptr;
cpumask_t cpus_mask;
};
with
t->cpus_ptr = &t->cpus_mask;
In -RT we then can switch the cpus_ptr to:
t->cpus_ptr = &cpumask_of(task_cpu(p));
in a migration disabled region. The rules are simple:
- Code that 'uses' ->cpus_allowed would use the pointer.
- Code that 'modifies' ->cpus_allowed would use the direct mask.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20190423142636.14347-1-bigeasy@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Based on 1 normalized pattern(s):
licensed under gplv2
extracted by the scancode license scanner the SPDX license identifier
GPL-2.0-only
has been chosen to replace the boilerplate/reference in 99 file(s).
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexios Zavras <alexios.zavras@intel.com>
Reviewed-by: Richard Fontana <rfontana@redhat.com>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Steve Winslow <swinslow@gmail.com>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190528170027.163048684@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Based on 1 normalized pattern(s):
this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license as published by
the free software foundation either version 2 of the license or at
your option any later version
extracted by the scancode license scanner the SPDX license identifier
GPL-2.0-or-later
has been chosen to replace the boilerplate/reference in 3029 file(s).
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190527070032.746973796@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Proc filesystem has special locking rules for various files. Thus
fanotify which opens files on event delivery can easily deadlock
against another process that waits for fanotify permission event to be
handled. Since permission events on /proc have doubtful value anyway,
just disallow them.
Link: https://lore.kernel.org/linux-fsdevel/20190320131642.GE9485@quack2.suse.cz/
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Add SPDX license identifiers to all Make/Kconfig files which:
- Have no license information of any form
These files fall under the project license, GPL v2 only. The resulting SPDX
license identifier is:
GPL-2.0-only
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Add SPDX license identifiers to all files which:
- Have no license information of any form
- Have EXPORT_.*_SYMBOL_GPL inside which was used in the
initial scan/conversion to ignore the file
These files fall under the project license, GPL v2 only. The resulting SPDX
license identifier is:
GPL-2.0-only
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The name clear_all_latency_tracing is misleading, in fact which only
clear per task's latency_record[], and we do have another function named
clear_global_latency_tracing which clear the global latency_record[]
buffer.
Link: http://lkml.kernel.org/r/20190226114602.16902-1-linf@wangsu.com
Signed-off-by: Lin Feng <linf@wangsu.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Fabian Frederick <fabf@skynet.be>
Cc: Arjan van de Ven <arjan@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This updates each existing invalidation to use the correct mmu notifier
event that represent what is happening to the CPU page table. See the
patch which introduced the events to see the rational behind this.
Link: http://lkml.kernel.org/r/20190326164747.24405-7-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Ross Zwisler <zwisler@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krcmar <rkrcmar@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Christian Koenig <christian.koenig@amd.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
CPU page table update can happens for many reasons, not only as a result
of a syscall (munmap(), mprotect(), mremap(), madvise(), ...) but also as
a result of kernel activities (memory compression, reclaim, migration,
...).
Users of mmu notifier API track changes to the CPU page table and take
specific action for them. While current API only provide range of virtual
address affected by the change, not why the changes is happening.
This patchset do the initial mechanical convertion of all the places that
calls mmu_notifier_range_init to also provide the default MMU_NOTIFY_UNMAP
event as well as the vma if it is know (most invalidation happens against
a given vma). Passing down the vma allows the users of mmu notifier to
inspect the new vma page protection.
The MMU_NOTIFY_UNMAP is always the safe default as users of mmu notifier
should assume that every for the range is going away when that event
happens. A latter patch do convert mm call path to use a more appropriate
events for each call.
This is done as 2 patches so that no call site is forgotten especialy
as it uses this following coccinelle patch:
%<----------------------------------------------------------------------
@@
identifier I1, I2, I3, I4;
@@
static inline void mmu_notifier_range_init(struct mmu_notifier_range *I1,
+enum mmu_notifier_event event,
+unsigned flags,
+struct vm_area_struct *vma,
struct mm_struct *I2, unsigned long I3, unsigned long I4) { ... }
@@
@@
-#define mmu_notifier_range_init(range, mm, start, end)
+#define mmu_notifier_range_init(range, event, flags, vma, mm, start, end)
@@
expression E1, E3, E4;
identifier I1;
@@
<...
mmu_notifier_range_init(E1,
+MMU_NOTIFY_UNMAP, 0, I1,
I1->vm_mm, E3, E4)
...>
@@
expression E1, E2, E3, E4;
identifier FN, VMA;
@@
FN(..., struct vm_area_struct *VMA, ...) {
<...
mmu_notifier_range_init(E1,
+MMU_NOTIFY_UNMAP, 0, VMA,
E2, E3, E4)
...> }
@@
expression E1, E2, E3, E4;
identifier FN, VMA;
@@
FN(...) {
struct vm_area_struct *VMA;
<...
mmu_notifier_range_init(E1,
+MMU_NOTIFY_UNMAP, 0, VMA,
E2, E3, E4)
...> }
@@
expression E1, E2, E3, E4;
identifier FN;
@@
FN(...) {
<...
mmu_notifier_range_init(E1,
+MMU_NOTIFY_UNMAP, 0, NULL,
E2, E3, E4)
...> }
---------------------------------------------------------------------->%
Applied with:
spatch --all-includes --sp-file mmu-notifier.spatch fs/proc/task_mmu.c --in-place
spatch --sp-file mmu-notifier.spatch --dir kernel/events/ --in-place
spatch --sp-file mmu-notifier.spatch --dir mm --in-place
Link: http://lkml.kernel.org/r/20190326164747.24405-6-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Ross Zwisler <zwisler@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krcmar <rkrcmar@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Christian Koenig <christian.koenig@amd.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull networking updates from David Miller:
"Highlights:
1) Support AES128-CCM ciphers in kTLS, from Vakul Garg.
2) Add fib_sync_mem to control the amount of dirty memory we allow to
queue up between synchronize RCU calls, from David Ahern.
3) Make flow classifier more lockless, from Vlad Buslov.
4) Add PHY downshift support to aquantia driver, from Heiner
Kallweit.
5) Add SKB cache for TCP rx and tx, from Eric Dumazet. This reduces
contention on SLAB spinlocks in heavy RPC workloads.
6) Partial GSO offload support in XFRM, from Boris Pismenny.
7) Add fast link down support to ethtool, from Heiner Kallweit.
8) Use siphash for IP ID generator, from Eric Dumazet.
9) Pull nexthops even further out from ipv4/ipv6 routes and FIB
entries, from David Ahern.
10) Move skb->xmit_more into a per-cpu variable, from Florian
Westphal.
11) Improve eBPF verifier speed and increase maximum program size,
from Alexei Starovoitov.
12) Eliminate per-bucket spinlocks in rhashtable, and instead use bit
spinlocks. From Neil Brown.
13) Allow tunneling with GUE encap in ipvs, from Jacky Hu.
14) Improve link partner cap detection in generic PHY code, from
Heiner Kallweit.
15) Add layer 2 encap support to bpf_skb_adjust_room(), from Alan
Maguire.
16) Remove SKB list implementation assumptions in SCTP, your's truly.
17) Various cleanups, optimizations, and simplifications in r8169
driver. From Heiner Kallweit.
18) Add memory accounting on TX and RX path of SCTP, from Xin Long.
19) Switch PHY drivers over to use dynamic featue detection, from
Heiner Kallweit.
20) Support flow steering without masking in dpaa2-eth, from Ioana
Ciocoi.
21) Implement ndo_get_devlink_port in netdevsim driver, from Jiri
Pirko.
22) Increase the strict parsing of current and future netlink
attributes, also export such policies to userspace. From Johannes
Berg.
23) Allow DSA tag drivers to be modular, from Andrew Lunn.
24) Remove legacy DSA probing support, also from Andrew Lunn.
25) Allow ll_temac driver to be used on non-x86 platforms, from Esben
Haabendal.
26) Add a generic tracepoint for TX queue timeouts to ease debugging,
from Cong Wang.
27) More indirect call optimizations, from Paolo Abeni"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1763 commits)
cxgb4: Fix error path in cxgb4_init_module
net: phy: improve pause mode reporting in phy_print_status
dt-bindings: net: Fix a typo in the phy-mode list for ethernet bindings
net: macb: Change interrupt and napi enable order in open
net: ll_temac: Improve error message on error IRQ
net/sched: remove block pointer from common offload structure
net: ethernet: support of_get_mac_address new ERR_PTR error
net: usb: smsc: fix warning reported by kbuild test robot
staging: octeon-ethernet: Fix of_get_mac_address ERR_PTR check
net: dsa: support of_get_mac_address new ERR_PTR error
net: dsa: sja1105: Fix status initialization in sja1105_get_ethtool_stats
vrf: sit mtu should not be updated when vrf netdev is the link
net: dsa: Fix error cleanup path in dsa_init_module
l2tp: Fix possible NULL pointer dereference
taprio: add null check on sched_nest to avoid potential null pointer dereference
net: mvpp2: cls: fix less than zero check on a u32 variable
net_sched: sch_fq: handle non connected flows
net_sched: sch_fq: do not assume EDT packets are ordered
net: hns3: use devm_kcalloc when allocating desc_cb
net: hns3: some cleanup for struct hns3_enet_ring
...
-----BEGIN PGP SIGNATURE-----
iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAlzRrxsUHHBhdWxAcGF1
bC1tb29yZS5jb20ACgkQ6iDy2pc3iXPhlw/9EQVpaHZ62ruzY9a2POvhpAsiRzcB
hELj15iLf12EUKGhxgihDaBc7uQOlOWcFbQO8xtw7YxV7KlOtAx5ijsM9OSeczVk
MhCz7hIUnZwgS4/sJ4HDLNKvgq2xSl4MMjZCZ+0SGfNrfvOo0yidj3w6CLrtKCD2
qhUyX0FtGPHKZEQnEULUHm92U//0+iKtK/5fEX7hXTwpujwzRS+E0kSwnnY18lx8
VW1/fgElqixwHpQvKsUFMi4MkdWD3YydGXSaePVur6GpKGFbA+ooHng49HpMwiOH
33RkbnXp/MxD8MLX/eMpFwMAt92rss6Sf8MPE+XJ+SeN193R8PGguNt7F6f2SR62
W051tsDJ4p97L+7FEw5Y5i0HDxGQintp/tlYLWStXCa/0yntMEyjZHichPr3IteN
G9qg3iSqI+TzhYf7rxFk1lmnyOAj11UGAy9HhRva6pTmXrwlJ12amEbMzbMae1Of
+h0hj4+p/mINGV7v38Igy015b3qMMaIwe9cnAstYnz7MZgjm5YhEWPlJMqus9nS2
XfRh5x8Dhy9Q9NRXusbZltJHAjSAtyKXvcjN7vCKFE0r/7qWQ6nkzp7PD0CVQqLV
FKSQ4MSq2TDfQ/Oq7iQc9jEIMomud5FBPNnEjLCndR05jsQzSxCYKUvonM3wob/B
rCsoxkDZwSivsdo=
=Ts2E
-----END PGP SIGNATURE-----
Merge tag 'selinux-pr-20190507' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux
Pull selinux updates from Paul Moore:
"We've got a few SELinux patches for the v5.2 merge window, the
highlights are below:
- Add LSM hooks, and the SELinux implementation, for proper labeling
of kernfs. While we are only including the SELinux implementation
here, the rest of the LSM folks have given the hooks a thumbs-up.
- Update the SELinux mdp (Make Dummy Policy) script to actually work
on a modern system.
- Disallow userspace to change the LSM credentials via
/proc/self/attr when the task's credentials are already overridden.
The change was made in procfs because all the LSM folks agreed this
was the Right Thing To Do and duplicating it across each LSM was
going to be annoying"
* tag 'selinux-pr-20190507' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux:
proc: prevent changes to overridden credentials
selinux: Check address length before reading address family
kernfs: fix xattr name handling in LSM helpers
MAINTAINERS: update SELinux file patterns
selinux: avoid uninitialized variable warning
selinux: remove useless assignments
LSM: lsm_hooks.h - fix missing colon in docstring
selinux: Make selinux_kernfs_init_security static
kernfs: initialize security of newly created nodes
selinux: implement the kernfs_init_security hook
LSM: add new hook for kernfs node initialization
kernfs: use simple_xattrs for security attributes
selinux: try security xattr after genfs for kernfs filesystems
kernfs: do not alloc iattrs in kernfs_xattr_get
kernfs: clean up struct kernfs_iattrs
scripts/selinux: fix build
selinux: use kernel linux/socket.h for genheaders and mdp
scripts/selinux: modernize mdp
Pull vfs inode freeing updates from Al Viro:
"Introduction of separate method for RCU-delayed part of
->destroy_inode() (if any).
Pretty much as posted, except that destroy_inode() stashes
->free_inode into the victim (anon-unioned with ->i_fops) before
scheduling i_callback() and the last two patches (sockfs conversion
and folding struct socket_wq into struct socket) are excluded - that
pair should go through netdev once davem reopens his tree"
* 'work.icache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (58 commits)
orangefs: make use of ->free_inode()
shmem: make use of ->free_inode()
hugetlb: make use of ->free_inode()
overlayfs: make use of ->free_inode()
jfs: switch to ->free_inode()
fuse: switch to ->free_inode()
ext4: make use of ->free_inode()
ecryptfs: make use of ->free_inode()
ceph: use ->free_inode()
btrfs: use ->free_inode()
afs: switch to use of ->free_inode()
dax: make use of ->free_inode()
ntfs: switch to ->free_inode()
securityfs: switch to ->free_inode()
apparmor: switch to ->free_inode()
rpcpipe: switch to ->free_inode()
bpf: switch to ->free_inode()
mqueue: switch to ->free_inode()
ufs: switch to ->free_inode()
coda: switch to ->free_inode()
...
Pull stack trace updates from Ingo Molnar:
"So Thomas looked at the stacktrace code recently and noticed a few
weirdnesses, and we all know how such stories of crummy kernel code
meeting German engineering perfection end: a 45-patch series to clean
it all up! :-)
Here's the changes in Thomas's words:
'Struct stack_trace is a sinkhole for input and output parameters
which is largely pointless for most usage sites. In fact if embedded
into other data structures it creates indirections and extra storage
overhead for no benefit.
Looking at all usage sites makes it clear that they just require an
interface which is based on a storage array. That array is either on
stack, global or embedded into some other data structure.
Some of the stack depot usage sites are outright wrong, but
fortunately the wrongness just causes more stack being used for
nothing and does not have functional impact.
Another oddity is the inconsistent termination of the stack trace
with ULONG_MAX. It's pointless as the number of entries is what
determines the length of the stored trace. In fact quite some call
sites remove the ULONG_MAX marker afterwards with or without nasty
comments about it. Not all architectures do that and those which do,
do it inconsistenly either conditional on nr_entries == 0 or
unconditionally.
The following series cleans that up by:
1) Removing the ULONG_MAX termination in the architecture code
2) Removing the ULONG_MAX fixups at the call sites
3) Providing plain storage array based interfaces for stacktrace
and stackdepot.
4) Cleaning up the mess at the callsites including some related
cleanups.
5) Removing the struct stack_trace based interfaces
This is not changing the struct stack_trace interfaces at the
architecture level, but it removes the exposure to the generic
code'"
* 'core-stacktrace-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (45 commits)
x86/stacktrace: Use common infrastructure
stacktrace: Provide common infrastructure
lib/stackdepot: Remove obsolete functions
stacktrace: Remove obsolete functions
livepatch: Simplify stack trace retrieval
tracing: Remove the last struct stack_trace usage
tracing: Simplify stack trace retrieval
tracing: Make ftrace_trace_userstack() static and conditional
tracing: Use percpu stack trace buffer more intelligently
tracing: Simplify stacktrace retrieval in histograms
lockdep: Simplify stack trace handling
lockdep: Remove save argument from check_prev_add()
lockdep: Remove unused trace argument from print_circular_bug()
drm: Simplify stacktrace handling
dm persistent data: Simplify stack trace handling
dm bufio: Simplify stack trace retrieval
btrfs: ref-verify: Simplify stack trace retrieval
dma/debug: Simplify stracktrace retrieval
fault-inject: Simplify stacktrace retrieval
mm/page_owner: Simplify stack trace handling
...
Prevent userspace from changing the the /proc/PID/attr values if the
task's credentials are currently overriden. This not only makes sense
conceptually, it also prevents some really bizarre error cases caused
when trying to commit credentials to a task with overridden
credentials.
Cc: <stable@vger.kernel.org>
Reported-by: "chengjian (D)" <cj.chengjian@huawei.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
Acked-by: John Johansen <john.johansen@canonical.com>
Acked-by: James Morris <james.morris@microsoft.com>
Acked-by: Casey Schaufler <casey@schaufler-ca.com>
The core dumping code has always run without holding the mmap_sem for
writing, despite that is the only way to ensure that the entire vma
layout will not change from under it. Only using some signal
serialization on the processes belonging to the mm is not nearly enough.
This was pointed out earlier. For example in Hugh's post from Jul 2017:
https://lkml.kernel.org/r/alpine.LSU.2.11.1707191716030.2055@eggly.anvils
"Not strictly relevant here, but a related note: I was very surprised
to discover, only quite recently, how handle_mm_fault() may be called
without down_read(mmap_sem) - when core dumping. That seems a
misguided optimization to me, which would also be nice to correct"
In particular because the growsdown and growsup can move the
vm_start/vm_end the various loops the core dump does around the vma will
not be consistent if page faults can happen concurrently.
Pretty much all users calling mmget_not_zero()/get_task_mm() and then
taking the mmap_sem had the potential to introduce unexpected side
effects in the core dumping code.
Adding mmap_sem for writing around the ->core_dump invocation is a
viable long term fix, but it requires removing all copy user and page
faults and to replace them with get_dump_page() for all binary formats
which is not suitable as a short term fix.
For the time being this solution manually covers the places that can
confuse the core dump either by altering the vma layout or the vma flags
while it runs. Once ->core_dump runs under mmap_sem for writing the
function mmget_still_valid() can be dropped.
Allowing mmap_sem protected sections to run in parallel with the
coredump provides some minor parallelism advantage to the swapoff code
(which seems to be safe enough by never mangling any vma field and can
keep doing swapins in parallel to the core dumping) and to some other
corner case.
In order to facilitate the backporting I added "Fixes: 86039bd3b4e6"
however the side effect of this same race condition in /proc/pid/mem
should be reproducible since before 2.6.12-rc2 so I couldn't add any
other "Fixes:" because there's no hash beyond the git genesis commit.
Because find_extend_vma() is the only location outside of the process
context that could modify the "mm" structures under mmap_sem for
reading, by adding the mmget_still_valid() check to it, all other cases
that take the mmap_sem for reading don't need the new check after
mmget_not_zero()/get_task_mm(). The expand_stack() in page fault
context also doesn't need the new check, because all tasks under core
dumping are frozen.
Link: http://lkml.kernel.org/r/20190325224949.11068-1-aarcange@redhat.com
Fixes: 86039bd3b4 ("userfaultfd: add new syscall to provide memory externalization")
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Jann Horn <jannh@google.com>
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Jann Horn <jannh@google.com>
Acked-by: Jason Gunthorpe <jgg@mellanox.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
No architecture terminates the stack trace with ULONG_MAX anymore. The
consumer terminates on the first zero entry or at the number of entries, so
no functional change.
Remove the cruft.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Alexander Potapenko <glider@google.com>
Link: https://lkml.kernel.org/r/20190410103644.853527514@linutronix.de
Add file_pos field to bpf_sysctl context to read and write sysctl file
position at which sysctl is being accessed (read or written).
The field can be used to e.g. override whole sysctl value on write to
sysctl even when sys_write is called by user space with file_pos > 0. Or
BPF program may reject such accesses.
Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add helpers to work with new value being written to sysctl by user
space.
bpf_sysctl_get_new_value() copies value being written to sysctl into
provided buffer.
bpf_sysctl_set_new_value() overrides new value being written by user
space with a one from provided buffer. Buffer should contain string
representation of the value, similar to what can be seen in /proc/sys/.
Both helpers can be used only on sysctl write.
File position matters and can be managed by an interface that will be
introduced separately. E.g. if user space calls sys_write to a file in
/proc/sys/ at file position = X, where X > 0, then the value set by
bpf_sysctl_set_new_value() will be written starting from X. If program
wants to override whole value with specified buffer, file position has
to be set to zero.
Documentation for the new helpers is provided in bpf.h UAPI.
Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Containerized applications may run as root and it may create problems
for whole host. Specifically such applications may change a sysctl and
affect applications in other containers.
Furthermore in existing infrastructure it may not be possible to just
completely disable writing to sysctl, instead such a process should be
gradual with ability to log what sysctl are being changed by a
container, investigate, limit the set of writable sysctl to currently
used ones (so that new ones can not be changed) and eventually reduce
this set to zero.
The patch introduces new program type BPF_PROG_TYPE_CGROUP_SYSCTL and
attach type BPF_CGROUP_SYSCTL to solve these problems on cgroup basis.
New program type has access to following minimal context:
struct bpf_sysctl {
__u32 write;
};
Where @write indicates whether sysctl is being read (= 0) or written (=
1).
Helpers to access sysctl name and value will be introduced separately.
BPF_CGROUP_SYSCTL attach point is added to sysctl code right before
passing control to ctl_table->proc_handler so that BPF program can
either allow or deny access to sysctl.
Suggested-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
task_current_syscall() has a single user that passes in 6 for maxargs, which
is the maximum arguments that can be used to get system calls from
syscall_get_arguments(). Instead of passing in a number of arguments to
grab, just get 6 arguments. The args argument even specifies that it's an
array of 6 items.
This will also allow changing syscall_get_arguments() to not get a variable
number of arguments, but always grab 6.
Linus also suggested not passing in a bunch of arguments to
task_current_syscall() but to instead pass in a pointer to a structure, and
just fill the structure. struct seccomp_data has almost all the parameters
that is needed except for the stack pointer (sp). As seccomp_data is part of
uapi, and I'm afraid to change it, a new structure was created
"syscall_info", which includes seccomp_data and adds the "sp" field.
Link: http://lkml.kernel.org/r/20161107213233.466776454@goodmis.org
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
On machines where the GART aperture is mapped over physical RAM,
/proc/kcore contains the GART aperture range. Accessing the GART range via
/proc/kcore results in a kernel crash.
vmcore used to have the same issue, until it was fixed with commit
2a3e83c6f9 ("x86/gart: Exclude GART aperture from vmcore")', leveraging
existing hook infrastructure in vmcore to let /proc/vmcore return zeroes
when attempting to read the aperture region, and so it won't read from the
actual memory.
Apply the same workaround for kcore. First implement the same hook
infrastructure for kcore, then reuse the hook functions introduced in the
previous vmcore fix. Just with some minor adjustment, rename some functions
for more general usage, and simplify the hook infrastructure a bit as there
is no module usage yet.
Suggested-by: Baoquan He <bhe@redhat.com>
Signed-off-by: Kairui Song <kasong@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Jiri Bohac <jbohac@suse.cz>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Dave Young <dyoung@redhat.com>
Link: https://lkml.kernel.org/r/20190308030508.13548-1-kasong@redhat.com
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE7btrcuORLb1XUhEwjrBW1T7ssS0FAlx+nn4ACgkQjrBW1T7s
sS2kwg//aJUCwLIhV91gXUFN2jHTCf0/+5fnigEk7JhAT5wmAykxLM8tprLlIlyp
HtwNQx54hq/6p010Ulo9K50VS6JRii+2lNSpC6IkqXXdHXXm0ViH+5I9Nru8SVJ+
avRCYWNjW9Gn1EtcB2yv6KP3XffgnQ6ZLIr4QJwglOxgAqUaWZ68woSUlrIR5yFj
j48wAxjsC3g2qwGLvXPeiwYZHwk6VnYmrZ3eWXPDthWRDC4zkjyBdchZZzFJagSC
6sX8T9s5ua5juZMokEJaWjuBQQyfg0NYu41hupSdVjV7/0D3E+5/DiReInvLmSup
63bZ85uKRqWTNgl4cmJ1W3aVe2RYYemMZCXVVYYvU+IKpvTSzzYY7us+FyMAIRUV
bT+XPGzTWcGrChzv9bHZcBrkL91XGqyxRJz56jLl6EhRtqxmzmywf6mO6pS2WK4N
r+aBDgXeJbG39KguCzwUgVX8hC6YlSxSP8Md+2sK+UoAdfTUvFtdCYnjhuACofCt
saRvDIPF8N9qn4Ch3InzCKkrUTL/H3BZKBl2jo6tYQ9smUsFZW7lQoip5Ui/0VS+
qksJ91djOc9facGoOorPazojY5fO5Lj3Hg+cGIoxUV0jPH483z7hWH0ALynb0f6z
EDsgNyEUpIO2nJMJJfm37ysbU/j1gOpzQdaAEaWeknwtfecFPzM=
=yOWp
-----END PGP SIGNATURE-----
Merge tag 'pidfd-v5.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux
Pull pidfd system call from Christian Brauner:
"This introduces the ability to use file descriptors from /proc/<pid>/
as stable handles on struct pid. Even if a pid is recycled the handle
will not change. For a start these fds can be used to send signals to
the processes they refer to.
With the ability to use /proc/<pid> fds as stable handles on struct
pid we can fix a long-standing issue where after a process has exited
its pid can be reused by another process. If a caller sends a signal
to a reused pid it will end up signaling the wrong process.
With this patchset we enable a variety of use cases. One obvious
example is that we can now safely delegate an important part of
process management - sending signals - to processes other than the
parent of a given process by sending file descriptors around via scm
rights and not fearing that the given process will have been recycled
in the meantime. It also allows for easy testing whether a given
process is still alive or not by sending signal 0 to a pidfd which is
quite handy.
There has been some interest in this feature e.g. from systems
management (systemd, glibc) and container managers. I have requested
and gotten comments from glibc to make sure that this syscall is
suitable for their needs as well. In the future I expect it to take on
most other pid-based signal syscalls. But such features are left for
the future once they are needed.
This has been sitting in linux-next for quite a while and has not
caused any issues. It comes with selftests which verify basic
functionality and also test that a recycled pid cannot be signaled via
a pidfd.
Jon has written about a prior version of this patchset. It should
cover the basic functionality since not a lot has changed since then:
https://lwn.net/Articles/773459/
The commit message for the syscall itself is extensively documenting
the syscall, including it's functionality and extensibility"
* tag 'pidfd-v5.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
selftests: add tests for pidfd_send_signal()
signal: add pidfd_send_signal() syscall
Pull vfs mount infrastructure updates from Al Viro:
"The rest of core infrastructure; no new syscalls in that pile, but the
old parts are switched to new infrastructure. At that point
conversions of individual filesystems can happen independently; some
are done here (afs, cgroup, procfs, etc.), there's also a large series
outside of that pile dealing with NFS (quite a bit of option-parsing
stuff is getting used there - it's one of the most convoluted
filesystems in terms of mount-related logics), but NFS bits are the
next cycle fodder.
It got seriously simplified since the last cycle; documentation is
probably the weakest bit at the moment - I considered dropping the
commit introducing Documentation/filesystems/mount_api.txt (cutting
the size increase by quarter ;-), but decided that it would be better
to fix it up after -rc1 instead.
That pile allows to do followup work in independent branches, which
should make life much easier for the next cycle. fs/super.c size
increase is unpleasant; there's a followup series that allows to
shrink it considerably, but I decided to leave that until the next
cycle"
* 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (41 commits)
afs: Use fs_context to pass parameters over automount
afs: Add fs_context support
vfs: Add some logging to the core users of the fs_context log
vfs: Implement logging through fs_context
vfs: Provide documentation for new mount API
vfs: Remove kern_mount_data()
hugetlbfs: Convert to fs_context
cpuset: Use fs_context
kernfs, sysfs, cgroup, intel_rdt: Support fs_context
cgroup: store a reference to cgroup_ns into cgroup_fs_context
cgroup1_get_tree(): separate "get cgroup_root to use" into a separate helper
cgroup_do_mount(): massage calling conventions
cgroup: stash cgroup_root reference into cgroup_fs_context
cgroup2: switch to option-by-option parsing
cgroup1: switch to option-by-option parsing
cgroup: take options parsing into ->parse_monolithic()
cgroup: fold cgroup1_mount() into cgroup1_get_tree()
cgroup: start switching to fs_context
ipc: Convert mqueue fs to fs_context
proc: Add fs_context support to procfs
...
The new generic radix trees have a simpler API and implementation, and
no limitations on number of elements, so all flex_array users are being
converted
Link: http://lkml.kernel.org/r/20181217131929.11727-6-kent.overstreet@gmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Eric Paris <eparis@parisplace.org>
Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Paul Moore <paul@paul-moore.com>
Cc: Pravin B Shelar <pshelar@ovn.org>
Cc: Shaohua Li <shli@kernel.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: Vlad Yasevich <vyasevich@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Compilers like to transform loops like
for (i = 0; i < n; i++) {
[use p[i]]
}
into
for (p = p0; p < end; p++) {
...
}
Do it by hand, so that it results in overall simpler loop
and smaller code.
Space savings:
$ ./scripts/bloat-o-meter ../vmlinux-001 ../obj/vmlinux
add/remove: 0/0 grow/shrink: 2/1 up/down: 4/-9 (-5)
Function old new delta
proc_tid_base_lookup 17 19 +2
proc_tgid_base_lookup 17 19 +2
proc_pident_lookup 179 170 -9
The same could be done to proc_pident_readdir(), but the code becomes
bigger for some reason.
[sfr@canb.auug.org.au: merge fix for proc_pident_lookup() API change]
Link: http://lkml.kernel.org/r/20190131160135.4a8ae70b@canb.auug.org.au
Link: http://lkml.kernel.org/r/20190114200422.GB9680@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: James Morris <jmorris@namei.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Casey Schaufler <casey@schaufler-ca.com>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This has been a slightly more active cycle than normal with ongoing core
changes and quite a lot of collected driver updates.
- Various driver fixes for bnxt_re, cxgb4, hns, mlx5, pvrdma, rxe
- A new data transfer mode for HFI1 giving higher performance
- Significant functional and bug fix update to the mlx5 On-Demand-Paging MR
feature
- A chip hang reset recovery system for hns
- Change mm->pinned_vm to an atomic64
- Update bnxt_re to support a new 57500 chip
- A sane netlink 'rdma link add' method for creating rxe devices and fixing
the various unregistration race conditions in rxe's unregister flow
- Allow lookup up objects by an ID over netlink
- Various reworking of the core to driver interface:
* Drivers should not assume umem SGLs are in PAGE_SIZE chunks
* ucontext is accessed via udata not other means
* Start to make the core code responsible for object memory
allocation
* Drivers should convert struct device to struct ib_device
via a helper
* Drivers have more tools to avoid use after unregister problems
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEfB7FMLh+8QxL+6i3OG33FX4gmxoFAlyAJYYACgkQOG33FX4g
mxrWwQ/+OyAx4Moru7Aix0C6GWxTJp/wKgw21CS3reZxgLai6x81xNYG/s2wCNjo
IccObVd7mvzyqPdxOeyHBsJBbQDqWvoD6O2duH8cqGMgBRgh3CSdUep2zLvPpSAx
2W1SvWYCLDnCuarboFrCA8c4AN3eCZiqD7z9lHyFQGjy3nTUWzk1uBaOP46uaiMv
w89N8EMdXJ/iY6ONzihvE05NEYbMA8fuvosKLLNdghRiHIjbMQU8SneY23pvyPDd
ZziPu9NcO3Hw9OVbkwtJp47U3KCBgvKHmnixyZKkikjiD+HVoABw2IMwcYwyBZwP
Bic/ddONJUvAxMHpKRnQaW7znAiHARk21nDG28UAI7FWXH/wMXgicMp6LRcNKqKF
vqXdxHTKJb0QUR4xrYI+eA8ihstss7UUpgSgByuANJ0X729xHiJtlEvPb1DPo1Dz
9CB4OHOVRl5O8sA5Jc6PSusZiKEpvWoyWbdmw0IiwDF5pe922VLl5Nv88ta+sJ38
v2Ll5AgYcluk7F3599Uh9D7gwp5hxW2Ph3bNYyg2j3HP4/dKsL9XvIJPXqEthgCr
3KQS9rOZfI/7URieT+H+Mlf+OWZhXsZilJG7No0fYgIVjgJ00h3SF1/299YIq6Qp
9W7ZXBfVSwLYA2AEVSvGFeZPUxgBwHrSZ62wya4uFeB1jyoodPk=
=p12E
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma updates from Jason Gunthorpe:
"This has been a slightly more active cycle than normal with ongoing
core changes and quite a lot of collected driver updates.
- Various driver fixes for bnxt_re, cxgb4, hns, mlx5, pvrdma, rxe
- A new data transfer mode for HFI1 giving higher performance
- Significant functional and bug fix update to the mlx5
On-Demand-Paging MR feature
- A chip hang reset recovery system for hns
- Change mm->pinned_vm to an atomic64
- Update bnxt_re to support a new 57500 chip
- A sane netlink 'rdma link add' method for creating rxe devices and
fixing the various unregistration race conditions in rxe's
unregister flow
- Allow lookup up objects by an ID over netlink
- Various reworking of the core to driver interface:
- drivers should not assume umem SGLs are in PAGE_SIZE chunks
- ucontext is accessed via udata not other means
- start to make the core code responsible for object memory
allocation
- drivers should convert struct device to struct ib_device via a
helper
- drivers have more tools to avoid use after unregister problems"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (280 commits)
net/mlx5: ODP support for XRC transport is not enabled by default in FW
IB/hfi1: Close race condition on user context disable and close
RDMA/umem: Revert broken 'off by one' fix
RDMA/umem: minor bug fix in error handling path
RDMA/hns: Use GFP_ATOMIC in hns_roce_v2_modify_qp
cxgb4: kfree mhp after the debug print
IB/rdmavt: Fix concurrency panics in QP post_send and modify to error
IB/rdmavt: Fix loopback send with invalidate ordering
IB/iser: Fix dma_nents type definition
IB/mlx5: Set correct write permissions for implicit ODP MR
bnxt_re: Clean cq for kernel consumers only
RDMA/uverbs: Don't do double free of allocated PD
RDMA: Handle ucontext allocations by IB/core
RDMA/core: Fix a WARN() message
bnxt_re: fix the regression due to changes in alloc_pbl
IB/mlx4: Increase the timeout for CM cache
IB/core: Abort page fault handler silently during owning process exit
IB/mlx5: Validate correct PD before prefetch MR
IB/mlx5: Protect against prefetch of invalid MR
RDMA/uverbs: Store PR pointer before it is overwritten
...
-----BEGIN PGP SIGNATURE-----
iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAlx+8ZgUHHBhdWxAcGF1
bC1tb29yZS5jb20ACgkQ6iDy2pc3iXOlDhAAiGlirQ9syyG2fYzaARZZ2QoU/GGD
PSAeiNmP3jvJzXArCvugRCw+YSNDdQOBM3SrLQC+cM0MAIDRYXN0NdcrsbTchlMA
51Fx1egZ9Fyj+Ehgida3muh2lRUy7DQwMCL6tAVqwz7vYkSTGDUf+MlYqOqXDka5
74pEExOS3Jdi7560BsE8b6QoW9JIJqEJnirXGkG9o2qC0oFHCR6PKxIyQ7TJrLR1
F23aFTqLTH1nbPUQjnox2PTf13iQVh4j2gwzd+9c9KBfxoGSge3dmxId7BJHy2aG
M27fPdCYTNZAGWpPVujsCPAh1WPQ9NQqg3mA9+g14PEbiLqPcqU+kWmnDU7T7bEw
Qx0kt6Y8GiknwCqq8pDbKYclgRmOjSGdfutzd0z8uDpbaeunS4/NqnDb/FUaDVcr
jA4d6ep7qEgHpYbL8KgOeZCexfaTfz6mcwRWNq3Uu9cLZbZqSSQ7PXolMADHvoRs
LS7VH2jcP7q4p4GWmdfjv67xyUUo9HG5HHX74h5pLfQSYXiBWo4ht0UOAzX/6EcE
CJNHAFHv+OanI5Rg/6JQ8b3/bJYxzAJVyLZpCuMtlKk6lYBGNeADk9BezEDIYsm8
tSe4/GqqyR9+Qz8rSdpAZ0KKkfqS535IcHUPUJau7Bzg1xqSEP5gzZN6QsjdXg0+
5wFFfdFICTfJFXo=
=57/1
-----END PGP SIGNATURE-----
Merge tag 'audit-pr-20190305' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit
Pull audit updates from Paul Moore:
"A lucky 13 audit patches for v5.1.
Despite the rather large diffstat, most of the changes are from two
bug fix patches that move code from one Kconfig option to another.
Beyond that bit of churn, the remaining changes are largely cleanups
and bug-fixes as we slowly march towards container auditing. It isn't
all boring though, we do have a couple of new things: file
capabilities v3 support, and expanded support for filtering on
filesystems to solve problems with remote filesystems.
All changes pass the audit-testsuite. Please merge for v5.1"
* tag 'audit-pr-20190305' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
audit: mark expected switch fall-through
audit: hide auditsc_get_stamp and audit_serial prototypes
audit: join tty records to their syscall
audit: remove audit_context when CONFIG_ AUDIT and not AUDITSYSCALL
audit: remove unused actx param from audit_rule_match
audit: ignore fcaps on umount
audit: clean up AUDITSYSCALL prototypes and stubs
audit: more filter PATH records keyed on filesystem magic
audit: add support for fcaps v3
audit: move loginuid and sessionid from CONFIG_AUDITSYSCALL to CONFIG_AUDIT
audit: add syscall information to CONFIG_CHANGE records
audit: hand taken context to audit_kill_trees for syscall logging
audit: give a clue what CONFIG_CHANGE op was involved
Pull security subsystem updates from James Morris:
- Extend LSM stacking to allow sharing of cred, file, ipc, inode, and
task blobs. This paves the way for more full-featured LSMs to be
merged, and is specifically aimed at LandLock and SARA LSMs. This
work is from Casey and Kees.
- There's a new LSM from Micah Morton: "SafeSetID gates the setid
family of syscalls to restrict UID/GID transitions from a given
UID/GID to only those approved by a system-wide whitelist." This
feature is currently shipping in ChromeOS.
* 'next-general' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (62 commits)
keys: fix missing __user in KEYCTL_PKEY_QUERY
LSM: Update list of SECURITYFS users in Kconfig
LSM: Ignore "security=" when "lsm=" is specified
LSM: Update function documentation for cap_capable
security: mark expected switch fall-throughs and add a missing break
tomoyo: Bump version.
LSM: fix return value check in safesetid_init_securityfs()
LSM: SafeSetID: add selftest
LSM: SafeSetID: remove unused include
LSM: SafeSetID: 'depend' on CONFIG_SECURITY
LSM: Add 'name' field for SafeSetID in DEFINE_LSM
LSM: add SafeSetID module that gates setid calls
LSM: add SafeSetID module that gates setid calls
tomoyo: Allow multiple use_group lines.
tomoyo: Coding style fix.
tomoyo: Swicth from cred->security to task_struct->security.
security: keys: annotate implicit fall throughs
security: keys: annotate implicit fall throughs
security: keys: annotate implicit fall through
capabilities:: annotate implicit fall through
...
Merge misc updates from Andrew Morton:
- a few misc things
- ocfs2 updates
- most of MM
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (159 commits)
tools/testing/selftests/proc/proc-self-syscall.c: remove duplicate include
proc: more robust bulk read test
proc: test /proc/*/maps, smaps, smaps_rollup, statm
proc: use seq_puts() everywhere
proc: read kernel cpu stat pointer once
proc: remove unused argument in proc_pid_lookup()
fs/proc/thread_self.c: code cleanup for proc_setup_thread_self()
fs/proc/self.c: code cleanup for proc_setup_self()
proc: return exit code 4 for skipped tests
mm,mremap: bail out earlier in mremap_to under map pressure
mm/sparse: fix a bad comparison
mm/memory.c: do_fault: avoid usage of stale vm_area_struct
writeback: fix inode cgroup switching comment
mm/huge_memory.c: fix "orig_pud" set but not used
mm/hotplug: fix an imbalance with DEBUG_PAGEALLOC
mm/memcontrol.c: fix bad line in comment
mm/cma.c: cma_declare_contiguous: correct err handling
mm/page_ext.c: fix an imbalance with kmemleak
mm/compaction: pass pgdat to too_many_isolated() instead of zone
mm: remove zone_lru_lock() function, access ->lru_lock directly
...
Pull scheduler updates from Ingo Molnar:
"The main changes in this cycle were:
- refcount conversions
- Solve the rq->leaf_cfs_rq_list can of worms for real.
- improve power-aware scheduling
- add sysctl knob for Energy Aware Scheduling
- documentation updates
- misc other changes"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (34 commits)
kthread: Do not use TIMER_IRQSAFE
kthread: Convert worker lock to raw spinlock
sched/fair: Use non-atomic cpumask_{set,clear}_cpu()
sched/fair: Remove unused 'sd' parameter from select_idle_smt()
sched/wait: Use freezable_schedule() when possible
sched/fair: Prune, fix and simplify the nohz_balancer_kick() comment block
sched/fair: Explain LLC nohz kick condition
sched/fair: Simplify nohz_balancer_kick()
sched/topology: Fix percpu data types in struct sd_data & struct s_data
sched/fair: Simplify post_init_entity_util_avg() by calling it with a task_struct pointer argument
sched/fair: Fix O(nr_cgroups) in the load balancing path
sched/fair: Optimize update_blocked_averages()
sched/fair: Fix insertion in rq->leaf_cfs_rq_list
sched/fair: Add tmp_alone_branch assertion
sched/core: Use READ_ONCE()/WRITE_ONCE() in move_queued_task()/task_rq_lock()
sched/debug: Initialize sd_sysctl_cpus if !CONFIG_CPUMASK_OFFSTACK
sched/pelt: Skip updating util_est when utilization is higher than CPU's capacity
sched/fair: Update scale invariance of PELT
sched/fair: Move the rq_of() helper function
sched/core: Convert task_struct.stack_refcount to refcount_t
...
Architectures like ppc64 require to do a conditional tlb flush based on
the old and new value of pte. Enable that by passing old pte value as
the arg.
Link: http://lkml.kernel.org/r/20190116085035.29729-3-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "NestMMU pte upgrade workaround for mprotect", v5.
We can upgrade pte access (R -> RW transition) via mprotect. We need to
make sure we follow the recommended pte update sequence as outlined in
commit bd5050e38a ("powerpc/mm/radix: Change pte relax sequence to
handle nest MMU hang") for such updates. This patch series does that.
This patch (of 5):
Some architectures may want to call flush_tlb_range from these helpers.
Link: http://lkml.kernel.org/r/20190116085035.29729-2-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
PG_balloon was introduced to implement page migration/compaction for
pages inflated in virtio-balloon. Nowadays, it is only a marker that a
page is part of virtio-balloon and therefore logically offline.
We also want to make use of this flag in other balloon drivers - for
inflated pages or when onlining a section but keeping some pages offline
(e.g. used right now by XEN and Hyper-V via set_online_page_callback()).
We are going to expose this flag to dump tools like makedumpfile. But
instead of exposing PG_balloon, let's generalize the concept of marking
pages as logically offline, so it can be reused for other purposes later
on.
Rename PG_balloon to PG_offline. This is an indicator that the page is
logically offline, the content stale and that it should not be touched
(e.g. a hypervisor would have to allocate backing storage in order for
the guest to dump an unused page). We can then e.g. exclude such pages
from dumps.
We replace and reuse KPF_BALLOON (23), as this shouldn't really harm
(and for now the semantics stay the same). In following patches, we
will make use of this bit also in other balloon drivers. While at it,
document PGTABLE.
[akpm@linux-foundation.org: fix comment text, per David]
Link: http://lkml.kernel.org/r/20181119101616.8901-3-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Konstantin Khlebnikov <koct9i@gmail.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Pankaj gupta <pagupta@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Christian Hansen <chansen3@cisco.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Miles Chen <miles.chen@mediatek.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Kazuhito Hagio <k-hagio@ab.jp.nec.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Julien Freche <jfreche@vmware.com>
Cc: Kairui Song <kasong@redhat.com>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Lianbo Jiang <lijiang@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Nadav Amit <namit@vmware.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Xavier Deguillard <xdeguillard@vmware.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull irq updates from Thomas Gleixner:
"The interrupt departement delivers this time:
- New infrastructure to manage NMIs on platforms which have a sane
NMI delivery, i.e. identifiable NMI vectors instead of a single
lump.
- Simplification of the interrupt affinity management so drivers
don't have to implement ugly loops around the PCI/MSI enablement.
- Speedup for interrupt statistics in /proc/stat
- Provide a function to retrieve the default irq domain
- A new interrupt controller for the Loongson LS1X platform
- Affinity support for the SiFive PLIC
- Better support for the iMX irqsteer driver
- NUMA aware memory allocations for GICv3
- The usual small fixes, improvements and cleanups all over the
place"
* 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits)
irqchip/imx-irqsteer: Add multi output interrupts support
irqchip/imx-irqsteer: Change to use reg_num instead of irq_group
dt-bindings: irq: imx-irqsteer: Add multi output interrupts support
dt-binding: irq: imx-irqsteer: Use irq number instead of group number
irqchip/brcmstb-l2: Use _irqsave locking variants in non-interrupt code
irqchip/gicv3-its: Use NUMA aware memory allocation for ITS tables
irqdomain: Allow the default irq domain to be retrieved
irqchip/sifive-plic: Implement irq_set_affinity() for SMP host
irqchip/sifive-plic: Differentiate between PLIC handler and context
irqchip/sifive-plic: Add warning in plic_init() if handler already present
irqchip/sifive-plic: Pre-compute context hart base and enable base
PCI/MSI: Remove obsolete sanity checks for multiple interrupt sets
genirq/affinity: Remove the leftovers of the original set support
nvme-pci: Simplify interrupt allocation
genirq/affinity: Add new callback for (re)calculating interrupt sets
genirq/affinity: Store interrupt sets size in struct irq_affinity
genirq/affinity: Code consolidation
irqchip/irq-sifive-plic: Check and continue in case of an invalid cpuid.
irqchip/i8259: Fix shutdown order by moving syscore_ops registration
dt-bindings: interrupt-controller: loongson ls1x intc
...
The kill() syscall operates on process identifiers (pid). After a process
has exited its pid can be reused by another process. If a caller sends a
signal to a reused pid it will end up signaling the wrong process. This
issue has often surfaced and there has been a push to address this problem [1].
This patch uses file descriptors (fd) from proc/<pid> as stable handles on
struct pid. Even if a pid is recycled the handle will not change. The fd
can be used to send signals to the process it refers to.
Thus, the new syscall pidfd_send_signal() is introduced to solve this
problem. Instead of pids it operates on process fds (pidfd).
/* prototype and argument /*
long pidfd_send_signal(int pidfd, int sig, siginfo_t *info, unsigned int flags);
/* syscall number 424 */
The syscall number was chosen to be 424 to align with Arnd's rework in his
y2038 to minimize merge conflicts (cf. [25]).
In addition to the pidfd and signal argument it takes an additional
siginfo_t and flags argument. If the siginfo_t argument is NULL then
pidfd_send_signal() is equivalent to kill(<positive-pid>, <signal>). If it
is not NULL pidfd_send_signal() is equivalent to rt_sigqueueinfo().
The flags argument is added to allow for future extensions of this syscall.
It currently needs to be passed as 0. Failing to do so will cause EINVAL.
/* pidfd_send_signal() replaces multiple pid-based syscalls */
The pidfd_send_signal() syscall currently takes on the job of
rt_sigqueueinfo(2) and parts of the functionality of kill(2), Namely, when a
positive pid is passed to kill(2). It will however be possible to also
replace tgkill(2) and rt_tgsigqueueinfo(2) if this syscall is extended.
/* sending signals to threads (tid) and process groups (pgid) */
Specifically, the pidfd_send_signal() syscall does currently not operate on
process groups or threads. This is left for future extensions.
In order to extend the syscall to allow sending signal to threads and
process groups appropriately named flags (e.g. PIDFD_TYPE_PGID, and
PIDFD_TYPE_TID) should be added. This implies that the flags argument will
determine what is signaled and not the file descriptor itself. Put in other
words, grouping in this api is a property of the flags argument not a
property of the file descriptor (cf. [13]). Clarification for this has been
requested by Eric (cf. [19]).
When appropriate extensions through the flags argument are added then
pidfd_send_signal() can additionally replace the part of kill(2) which
operates on process groups as well as the tgkill(2) and
rt_tgsigqueueinfo(2) syscalls.
How such an extension could be implemented has been very roughly sketched
in [14], [15], and [16]. However, this should not be taken as a commitment
to a particular implementation. There might be better ways to do it.
Right now this is intentionally left out to keep this patchset as simple as
possible (cf. [4]).
/* naming */
The syscall had various names throughout iterations of this patchset:
- procfd_signal()
- procfd_send_signal()
- taskfd_send_signal()
In the last round of reviews it was pointed out that given that if the
flags argument decides the scope of the signal instead of different types
of fds it might make sense to either settle for "procfd_" or "pidfd_" as
prefix. The community was willing to accept either (cf. [17] and [18]).
Given that one developer expressed strong preference for the "pidfd_"
prefix (cf. [13]) and with other developers less opinionated about the name
we should settle for "pidfd_" to avoid further bikeshedding.
The "_send_signal" suffix was chosen to reflect the fact that the syscall
takes on the job of multiple syscalls. It is therefore intentional that the
name is not reminiscent of neither kill(2) nor rt_sigqueueinfo(2). Not the
fomer because it might imply that pidfd_send_signal() is a replacement for
kill(2), and not the latter because it is a hassle to remember the correct
spelling - especially for non-native speakers - and because it is not
descriptive enough of what the syscall actually does. The name
"pidfd_send_signal" makes it very clear that its job is to send signals.
/* zombies */
Zombies can be signaled just as any other process. No special error will be
reported since a zombie state is an unreliable state (cf. [3]). However,
this can be added as an extension through the @flags argument if the need
ever arises.
/* cross-namespace signals */
The patch currently enforces that the signaler and signalee either are in
the same pid namespace or that the signaler's pid namespace is an ancestor
of the signalee's pid namespace. This is done for the sake of simplicity
and because it is unclear to what values certain members of struct
siginfo_t would need to be set to (cf. [5], [6]).
/* compat syscalls */
It became clear that we would like to avoid adding compat syscalls
(cf. [7]). The compat syscall handling is now done in kernel/signal.c
itself by adding __copy_siginfo_from_user_generic() which lets us avoid
compat syscalls (cf. [8]). It should be noted that the addition of
__copy_siginfo_from_user_any() is caused by a bug in the original
implementation of rt_sigqueueinfo(2) (cf. 12).
With upcoming rework for syscall handling things might improve
significantly (cf. [11]) and __copy_siginfo_from_user_any() will not gain
any additional callers.
/* testing */
This patch was tested on x64 and x86.
/* userspace usage */
An asciinema recording for the basic functionality can be found under [9].
With this patch a process can be killed via:
#define _GNU_SOURCE
#include <errno.h>
#include <fcntl.h>
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/stat.h>
#include <sys/syscall.h>
#include <sys/types.h>
#include <unistd.h>
static inline int do_pidfd_send_signal(int pidfd, int sig, siginfo_t *info,
unsigned int flags)
{
#ifdef __NR_pidfd_send_signal
return syscall(__NR_pidfd_send_signal, pidfd, sig, info, flags);
#else
return -ENOSYS;
#endif
}
int main(int argc, char *argv[])
{
int fd, ret, saved_errno, sig;
if (argc < 3)
exit(EXIT_FAILURE);
fd = open(argv[1], O_DIRECTORY | O_CLOEXEC);
if (fd < 0) {
printf("%s - Failed to open \"%s\"\n", strerror(errno), argv[1]);
exit(EXIT_FAILURE);
}
sig = atoi(argv[2]);
printf("Sending signal %d to process %s\n", sig, argv[1]);
ret = do_pidfd_send_signal(fd, sig, NULL, 0);
saved_errno = errno;
close(fd);
errno = saved_errno;
if (ret < 0) {
printf("%s - Failed to send signal %d to process %s\n",
strerror(errno), sig, argv[1]);
exit(EXIT_FAILURE);
}
exit(EXIT_SUCCESS);
}
/* Q&A
* Given that it seems the same questions get asked again by people who are
* late to the party it makes sense to add a Q&A section to the commit
* message so it's hopefully easier to avoid duplicate threads.
*
* For the sake of progress please consider these arguments settled unless
* there is a new point that desperately needs to be addressed. Please make
* sure to check the links to the threads in this commit message whether
* this has not already been covered.
*/
Q-01: (Florian Weimer [20], Andrew Morton [21])
What happens when the target process has exited?
A-01: Sending the signal will fail with ESRCH (cf. [22]).
Q-02: (Andrew Morton [21])
Is the task_struct pinned by the fd?
A-02: No. A reference to struct pid is kept. struct pid - as far as I
understand - was created exactly for the reason to not require to
pin struct task_struct (cf. [22]).
Q-03: (Andrew Morton [21])
Does the entire procfs directory remain visible? Just one entry
within it?
A-03: The same thing that happens right now when you hold a file descriptor
to /proc/<pid> open (cf. [22]).
Q-04: (Andrew Morton [21])
Does the pid remain reserved?
A-04: No. This patchset guarantees a stable handle not that pids are not
recycled (cf. [22]).
Q-05: (Andrew Morton [21])
Do attempts to signal that fd return errors?
A-05: See {Q,A}-01.
Q-06: (Andrew Morton [22])
Is there a cleaner way of obtaining the fd? Another syscall perhaps.
A-06: Userspace can already trivially retrieve file descriptors from procfs
so this is something that we will need to support anyway. Hence,
there's no immediate need to add another syscalls just to make
pidfd_send_signal() not dependent on the presence of procfs. However,
adding a syscalls to get such file descriptors is planned for a
future patchset (cf. [22]).
Q-07: (Andrew Morton [21] and others)
This fd-for-a-process sounds like a handy thing and people may well
think up other uses for it in the future, probably unrelated to
signals. Are the code and the interface designed to permit such
future applications?
A-07: Yes (cf. [22]).
Q-08: (Andrew Morton [21] and others)
Now I think about it, why a new syscall? This thing is looking
rather like an ioctl?
A-08: This has been extensively discussed. It was agreed that a syscall is
preferred for a variety or reasons. Here are just a few taken from
prior threads. Syscalls are safer than ioctl()s especially when
signaling to fds. Processes are a core kernel concept so a syscall
seems more appropriate. The layout of the syscall with its four
arguments would require the addition of a custom struct for the
ioctl() thereby causing at least the same amount or even more
complexity for userspace than a simple syscall. The new syscall will
replace multiple other pid-based syscalls (see description above).
The file-descriptors-for-processes concept introduced with this
syscall will be extended with other syscalls in the future. See also
[22], [23] and various other threads already linked in here.
Q-09: (Florian Weimer [24])
What happens if you use the new interface with an O_PATH descriptor?
A-09:
pidfds opened as O_PATH fds cannot be used to send signals to a
process (cf. [2]). Signaling processes through pidfds is the
equivalent of writing to a file. Thus, this is not an operation that
operates "purely at the file descriptor level" as required by the
open(2) manpage. See also [4].
/* References */
[1]: https://lore.kernel.org/lkml/20181029221037.87724-1-dancol@google.com/
[2]: https://lore.kernel.org/lkml/874lbtjvtd.fsf@oldenburg2.str.redhat.com/
[3]: https://lore.kernel.org/lkml/20181204132604.aspfupwjgjx6fhva@brauner.io/
[4]: https://lore.kernel.org/lkml/20181203180224.fkvw4kajtbvru2ku@brauner.io/
[5]: https://lore.kernel.org/lkml/20181121213946.GA10795@mail.hallyn.com/
[6]: https://lore.kernel.org/lkml/20181120103111.etlqp7zop34v6nv4@brauner.io/
[7]: https://lore.kernel.org/lkml/36323361-90BD-41AF-AB5B-EE0D7BA02C21@amacapital.net/
[8]: https://lore.kernel.org/lkml/87tvjxp8pc.fsf@xmission.com/
[9]: https://asciinema.org/a/IQjuCHew6bnq1cr78yuMv16cy
[11]: https://lore.kernel.org/lkml/F53D6D38-3521-4C20-9034-5AF447DF62FF@amacapital.net/
[12]: https://lore.kernel.org/lkml/87zhtjn8ck.fsf@xmission.com/
[13]: https://lore.kernel.org/lkml/871s6u9z6u.fsf@xmission.com/
[14]: https://lore.kernel.org/lkml/20181206231742.xxi4ghn24z4h2qki@brauner.io/
[15]: https://lore.kernel.org/lkml/20181207003124.GA11160@mail.hallyn.com/
[16]: https://lore.kernel.org/lkml/20181207015423.4miorx43l3qhppfz@brauner.io/
[17]: https://lore.kernel.org/lkml/CAGXu5jL8PciZAXvOvCeCU3wKUEB_dU-O3q0tDw4uB_ojMvDEew@mail.gmail.com/
[18]: https://lore.kernel.org/lkml/20181206222746.GB9224@mail.hallyn.com/
[19]: https://lore.kernel.org/lkml/20181208054059.19813-1-christian@brauner.io/
[20]: https://lore.kernel.org/lkml/8736rebl9s.fsf@oldenburg.str.redhat.com/
[21]: https://lore.kernel.org/lkml/20181228152012.dbf0508c2508138efc5f2bbe@linux-foundation.org/
[22]: https://lore.kernel.org/lkml/20181228233725.722tdfgijxcssg76@brauner.io/
[23]: https://lwn.net/Articles/773459/
[24]: https://lore.kernel.org/lkml/8736rebl9s.fsf@oldenburg.str.redhat.com/
[25]: https://lore.kernel.org/lkml/CAK8P3a0ej9NcJM8wXNPbcGUyOUZYX+VLoDFdbenW3s3114oQZw@mail.gmail.com/
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Jann Horn <jannh@google.com>
Cc: Andy Lutomirsky <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Florian Weimer <fweimer@redhat.com>
Signed-off-by: Christian Brauner <christian@brauner.io>
Reviewed-by: Tycho Andersen <tycho@tycho.ws>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: David Howells <dhowells@redhat.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Serge Hallyn <serge@hallyn.com>
Acked-by: Aleksa Sarai <cyphar@cyphar.com>
Add fs_context support to procfs.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Move proc_fill_super() to fs/proc/root.c as that's where the other
superblock stuff is.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com>
cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Tetsuo has reported that creating a thousands of processes sharing MM
without SIGHAND (aka alien threads) and setting
/proc/<pid>/oom_score_adj will swamp the kernel log and takes ages [1]
to finish. This is especially worrisome that all that printing is done
under RCU lock and this can potentially trigger RCU stall or softlockup
detector.
The primary reason for the printk was to catch potential users who might
depend on the behavior prior to 44a70adec9 ("mm, oom_adj: make sure
processes sharing mm have same view of oom_score_adj") but after more
than 2 years without a single report I guess it is safe to simply remove
the printk altogether.
The next step should be moving oom_score_adj over to the mm struct and
remove all the tasks crawling as suggested by [2]
[1] http://lkml.kernel.org/r/97fce864-6f75-bca5-14bc-12c9f890e740@i-love.sakura.ne.jp
[2] http://lkml.kernel.org/r/20190117155159.GA4087@dhcp22.suse.cz
Link: http://lkml.kernel.org/r/20190212102129.26288-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Yong-Taek Lee <ytk.lee@samsung.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The 'pss_locked' field of smaps_rollup was being calculated incorrectly.
It accumulated the current pss everytime a locked VMA was found. Fix
that by adding to 'pss_locked' the same time as that of 'pss' if the vma
being walked is locked.
Link: http://lkml.kernel.org/r/20190203065425.14650-1-sspatil@android.com
Fixes: 493b0e9d94 ("mm: add /proc/pid/smaps_rollup")
Signed-off-by: Sandeep Patil <sspatil@android.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Daniel Colascione <dancol@google.com>
Cc: <stable@vger.kernel.org> [4.14.x, 4.19.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Waiman reported that on large systems with a large amount of interrupts the
readout of /proc/stat takes a long time to sum up the interrupt
statistics. In principle this is not a problem. but for unknown reasons
some enterprise quality software reads /proc/stat with a high frequency.
The reason for this is that interrupt statistics are accounted per cpu. So
the /proc/stat logic has to sum up the interrupt stats for each interrupt.
The interrupt core provides now a per interrupt summary counter which can
be used to avoid the summation loops completely except for interrupts
marked PER_CPU which are only a small fraction of the interrupt space if at
all.
Another simplification is to iterate only over the active interrupts and
skip the potentially large gaps in the interrupt number space and just
print zeros for the gaps without going into the interrupt core in the first
place.
Waiman provided test results from a 4-socket IvyBridge-EX system (60-core
120-thread, 3016 irqs) excuting a test program which reads /proc/stat
50,000 times:
Before: 18.436s (sys 18.380s)
After: 3.769s (sys 3.742s)
Reported-by: Waiman Long <longman@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Reviewed-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: linux-fsdevel@vger.kernel.org
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Daniel Colascione <dancol@google.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Link: https://lkml.kernel.org/r/20190208135021.013828701@linutronix.de
Taking a sleeping lock to _only_ increment a variable is quite the
overkill, and pretty much all users do this. Furthermore, some drivers
(ie: infiniband and scif) that need pinned semantics can go to quite
some trouble to actually delay via workqueue (un)accounting for pinned
pages when not possible to acquire it.
By making the counter atomic we no longer need to hold the mmap_sem and
can simply some code around it for pinned_vm users. The counter is 64-bit
such that we need not worry about overflows such as rdma user input
controlled from userspace.
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Christoph Lameter <cl@linux.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
atomic_t variables are currently used to implement reference
counters with the following properties:
- counter is initialized to 1 using atomic_set()
- a resource is freed upon counter reaching zero
- once counter reaches zero, its further
increments aren't allowed
- counter schema uses basic atomic operations
(set, inc, inc_not_zero, dec_and_test, etc.)
Such atomic variables should be converted to a newly provided
refcount_t type and API that prevents accidental counter overflows
and underflows. This is important since overflows and underflows
can lead to use-after-free situation and be exploitable.
The variable sighand_struct.count is used as pure reference counter.
Convert it to refcount_t and fix up the operations.
** Important note for maintainers:
Some functions from refcount_t API defined in lib/refcount.c
have different memory ordering guarantees than their atomic
counterparts.
The full comparison can be seen in
https://lkml.org/lkml/2017/11/15/57 and it is hopefully soon
in state to be merged to the documentation tree.
Normally the differences should not matter since refcount_t provides
enough guarantees to satisfy the refcounting use cases, but in
some rare cases it might matter.
Please double check that you don't have some undocumented
memory guarantees for this variable usage.
For the sighand_struct.count it might make a difference
in following places:
- __cleanup_sighand: decrement in refcount_dec_and_test() only
provides RELEASE ordering and control dependency on success
vs. fully ordered atomic counterpart
Suggested-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: David Windsor <dwindsor@gmail.com>
Reviewed-by: Hans Liljestrand <ishkamiel@gmail.com>
Reviewed-by: Andrea Parri <andrea.parri@amarulasolutions.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: akpm@linux-foundation.org
Cc: viro@zeniv.linux.org.uk
Link: https://lkml.kernel.org/r/1547814450-18902-2-git-send-email-elena.reshetova@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
loginuid and sessionid (and audit_log_session_info) should be part of
CONFIG_AUDIT scope and not CONFIG_AUDITSYSCALL since it is used in
CONFIG_CHANGE, ANOM_LINK, FEATURE_CHANGE (and INTEGRITY_RULE), none of
which are otherwise dependent on AUDITSYSCALL.
Please see github issue
https://github.com/linux-audit/audit-kernel/issues/104
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
[PM: tweaked subject line for better grep'ing]
Signed-off-by: Paul Moore <paul@paul-moore.com>
Back in 2007 I made what turned out to be a rather serious
mistake in the implementation of the Smack security module.
The SELinux module used an interface in /proc to manipulate
the security context on processes. Rather than use a similar
interface, I used the same interface. The AppArmor team did
likewise. Now /proc/.../attr/current will tell you the
security "context" of the process, but it will be different
depending on the security module you're using.
This patch provides a subdirectory in /proc/.../attr for
Smack. Smack user space can use the "current" file in
this subdirectory and never have to worry about getting
SELinux attributes by mistake. Programs that use the
old interface will continue to work (or fail, as the case
may be) as before.
The proposed S.A.R.A security module is dependent on
the mechanism to create its own attr subdirectory.
The original implementation is by Kees Cook.
Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Header of /proc/*/limits is a fixed string, so print it directly without
formatting specifiers.
Link: http://lkml.kernel.org/r/20181203164242.GB6904@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
name_to_int() is defined in fs/proc/util.c and declared in
fs/proc/internal.h, but the declaration isn't included at the point of
the definition. Include the header to enforce that the definition
matches the declaration.
This addresses a gcc warning when -Wmissing-prototypes is enabled.
Link: http://lkml.kernel.org/r/20181115001833.49371-1-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Access to timerslack_ns is controlled by a process having CAP_SYS_NICE
in its effective capability set, but the current check looks in the root
namespace instead of the process' user namespace. Since a process is
allowed to do other activities controlled by CAP_SYS_NICE inside a
namespace, it should also be able to adjust timerslack_ns.
Link: http://lkml.kernel.org/r/20181030180012.232896-1-bmgordon@google.com
Signed-off-by: Benjamin Gordon <bmgordon@google.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: "Serge E. Hallyn" <serge@hallyn.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Oren Laadan <orenl@cellrox.com>
Cc: Ruchi Kandoi <kandoiruchi@google.com>
Cc: Rom Lemarchand <romlem@android.com>
Cc: Todd Kjos <tkjos@google.com>
Cc: Colin Cross <ccross@android.com>
Cc: Nick Kralevich <nnk@google.com>
Cc: Dmitry Shmidt <dimitrysh@google.com>
Cc: Elliott Hughes <enh@google.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Rientjes has reported that commit 1860033237 ("mm: make
PR_SET_THP_DISABLE immediately active") has changed the way how we
report THPable VMAs to the userspace. Their monitoring tool is
triggering false alarms on PR_SET_THP_DISABLE tasks because it considers
an insufficient THP usage as a memory fragmentation resp. memory
pressure issue.
Before the said commit each newly created VMA inherited VM_NOHUGEPAGE
flag and that got exposed to the userspace via /proc/<pid>/smaps file.
This implementation had its downsides as explained in the commit message
but it is true that the userspace doesn't have any means to query for
the process wide THP enabled/disabled status.
PR_SET_THP_DISABLE is a process wide flag so it makes a lot of sense to
export in the process wide context rather than per-vma. Introduce a new
field to /proc/<pid>/status which export this status. If
PR_SET_THP_DISABLE is used then it reports false same as when the THP is
not compiled in. It doesn't consider the global THP status because we
already export that information via sysfs
Link: http://lkml.kernel.org/r/20181211143641.3503-4-mhocko@kernel.org
Fixes: 1860033237 ("mm: make PR_SET_THP_DISABLE immediately active")
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: David Rientjes <rientjes@google.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Paul Oppenheimer <bepvte@gmail.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Userspace falls short when trying to find out whether a specific memory
range is eligible for THP. There are usecases that would like to know
that
http://lkml.kernel.org/r/alpine.DEB.2.21.1809251248450.50347@chino.kir.corp.google.com
: This is used to identify heap mappings that should be able to fault thp
: but do not, and they normally point to a low-on-memory or fragmentation
: issue.
The only way to deduce this now is to query for hg resp. nh flags and
confronting the state with the global setting. Except that there is also
PR_SET_THP_DISABLE that might change the picture. So the final logic is
not trivial. Moreover the eligibility of the vma depends on the type of
VMA as well. In the past we have supported only anononymous memory VMAs
but things have changed and shmem based vmas are supported as well these
days and the query logic gets even more complicated because the
eligibility depends on the mount option and another global configuration
knob.
Simplify the current state and report the THP eligibility in
/proc/<pid>/smaps for each existing vma. Reuse
transparent_hugepage_enabled for this purpose. The original
implementation of this function assumes that the caller knows that the vma
itself is supported for THP so make the core checks into
__transparent_hugepage_enabled and use it for existing callers.
__show_smap just use the new transparent_hugepage_enabled which also
checks the vma support status (please note that this one has to be out of
line due to include dependency issues).
[mhocko@kernel.org: fix oops with NULL ->f_mapping]
Link: http://lkml.kernel.org/r/20181224185106.GC16738@dhcp22.suse.cz
Link: http://lkml.kernel.org/r/20181211143641.3503-3-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Paul Oppenheimer <bepvte@gmail.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
To avoid having to change many call sites everytime we want to add a
parameter use a structure to group all parameters for the mmu_notifier
invalidate_range_start/end cakks. No functional changes with this patch.
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181205053628.3210-3-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Acked-by: Christian König <christian.koenig@amd.com>
Acked-by: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <zwisler@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krcmar <rkrcmar@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Felix Kuehling <felix.kuehling@amd.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
From: Jérôme Glisse <jglisse@redhat.com>
Subject: mm/mmu_notifier: use structure for invalidate_range_start/end calls v3
fix build warning in migrate.c when CONFIG_MMU_NOTIFIER=n
Link: http://lkml.kernel.org/r/20181213171330.8489-3-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Certain pages that are never mapped to userspace have a type indicated in
the page_type field of their struct pages (e.g. PG_buddy). page_type
overlaps with _mapcount so set the count to 0 and avoid calling
page_mapcount() for these pages.
[anthony.yznaga@oracle.com: incorporate feedback from Matthew Wilcox]
Link: http://lkml.kernel.org/r/1544481313-27318-1-git-send-email-anthony.yznaga@oracle.com
Link: http://lkml.kernel.org/r/1543963526-27917-1-git-send-email-anthony.yznaga@oracle.com
Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Miles Chen <miles.chen@mediatek.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
totalram_pages and totalhigh_pages are made static inline function.
Main motivation was that managed_page_count_lock handling was complicating
things. It was discussed in length here,
https://lore.kernel.org/patchwork/patch/995739/#1181785 So it seemes
better to remove the lock and convert variables to atomic, with preventing
poteintial store-to-read tearing as a bonus.
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/1542090790-21750-4-git-send-email-arunks@codeaurora.org
Signed-off-by: Arun KS <arunks@codeaurora.org>
Suggested-by: Michal Hocko <mhocko@suse.com>
Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
proc_sys_lookup can fail with ENOMEM instead of ENOENT when the
corresponding sysctl table is being unregistered. In our case we see
this upon opening /proc/sys/net/*/conf files while network interfaces
are being deleted, which confuses our configuration daemon.
The problem was successfully reproduced and this fix tested on v4.9.122
and v4.20-rc6.
v2: return ERR_PTRs in all cases when proc_sys_make_inode fails instead
of mixing them with NULL. Thanks Al Viro for the feedback.
Fixes: ace0c791e6 ("proc/sysctl: Don't grab i_lock under sysctl_lock.")
Cc: stable@vger.kernel.org
Signed-off-by: Ivan Delalande <colona@arista.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
- Introduces the stackleak gcc plugin ported from grsecurity by Alexander
Popov, with x86 and arm64 support.
-----BEGIN PGP SIGNATURE-----
Comment: Kees Cook <kees@outflux.net>
iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAlvQvn4WHGtlZXNjb29r
QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJpSfD/sErFreuPT1beSw994Lr9Zx4k9v
ERsuXxWBENaJOJXbOOHMfVEcEeG/1uhPSp7hlw/dpHfh0anATTrcYqm8RNKbfK+k
o06+JK14OJfpm5Ghq/7OizhdNLCMT8wMU3XZtWfy65VSJGjEFx8Y48vMeQtpWtUK
ylSzi9JV6j2iUBF9oibtiT53+yqsqAtX80X1G7HRCgv9kxuKMhZr+Q5oGV6+ViyQ
Azj8mNn06iRnhHKd17WxDJr0GjSibzz4weS/9XgP3t3EcNWJo1EgBlD2KV3tOfP5
nzmqfqTqrcjxs/tyjdh6vVCSlYucNtyCQGn63qyShQYSg6mZwclR2fY8YSTw6PWw
GfYWFOWru9z+qyQmwFkQ9bSQS2R+JIT0oBCj9VmtF9XmPCy7K2neJsQclzSPBiCW
wPgXVQS4IA4684O5CmDOVMwmDpGvhdBNUR6cqSzGLxQOHY1csyXubMNUsqU3g9xk
Ob4pEy/xrrIw4WpwHcLHSEW5gV1/OLhsT0fGRJJiC947L3cN5s9EZp7FLbIS0zlk
qzaXUcLmn6AgcfkYwg5cI3RMLaN2V0eDCMVTWZJ1wbrmUV9chAaOnTPTjNqLOTht
v3b1TTxXG4iCpMmOFf59F8pqgAwbBDlfyNSbySZ/Pq5QH69udz3Z9pIUlYQnSJHk
u6q++2ReDpJXF81rBw==
=Ks6B
-----END PGP SIGNATURE-----
Merge tag 'stackleak-v4.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull stackleak gcc plugin from Kees Cook:
"Please pull this new GCC plugin, stackleak, for v4.20-rc1. This plugin
was ported from grsecurity by Alexander Popov. It provides efficient
stack content poisoning at syscall exit. This creates a defense
against at least two classes of flaws:
- Uninitialized stack usage. (We continue to work on improving the
compiler to do this in other ways: e.g. unconditional zero init was
proposed to GCC and Clang, and more plugin work has started too).
- Stack content exposure. By greatly reducing the lifetime of valid
stack contents, exposures via either direct read bugs or unknown
cache side-channels become much more difficult to exploit. This
complements the existing buddy and heap poisoning options, but
provides the coverage for stacks.
The x86 hooks are included in this series (which have been reviewed by
Ingo, Dave Hansen, and Thomas Gleixner). The arm64 hooks have already
been merged through the arm64 tree (written by Laura Abbott and
reviewed by Mark Rutland and Will Deacon).
With VLAs having been removed this release, there is no need for
alloca() protection, so it has been removed from the plugin"
* tag 'stackleak-v4.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
arm64: Drop unneeded stackleak_check_alloca()
stackleak: Allow runtime disabling of kernel stack erasing
doc: self-protection: Add information about STACKLEAK feature
fs/proc: Show STACKLEAK metrics in the /proc file system
lkdtm: Add a test for STACKLEAK
gcc-plugins: Add STACKLEAK plugin for tracking the kernel stack
x86/entry: Add STACKLEAK erasing the kernel stack at the end of syscalls
Move remaining definitions and declarations from include/linux/bootmem.h
into include/linux/memblock.h and remove the redundant header.
The includes were replaced with the semantic patch below and then
semi-automated removal of duplicated '#include <linux/memblock.h>
@@
@@
- #include <linux/bootmem.h>
+ #include <linux/memblock.h>
[sfr@canb.auug.org.au: dma-direct: fix up for the removal of linux/bootmem.h]
Link: http://lkml.kernel.org/r/20181002185342.133d1680@canb.auug.org.au
[sfr@canb.auug.org.au: powerpc: fix up for removal of linux/bootmem.h]
Link: http://lkml.kernel.org/r/20181005161406.73ef8727@canb.auug.org.au
[sfr@canb.auug.org.au: x86/kaslr, ACPI/NUMA: fix for linux/bootmem.h removal]
Link: http://lkml.kernel.org/r/20181008190341.5e396491@canb.auug.org.au
Link: http://lkml.kernel.org/r/1536927045-23536-30-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ley Foon Tan <lftan@altera.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Palmer Dabbelt <palmer@sifive.com>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Serge Semin <fancer.lancer@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull XArray conversion from Matthew Wilcox:
"The XArray provides an improved interface to the radix tree data
structure, providing locking as part of the API, specifying GFP flags
at allocation time, eliminating preloading, less re-walking the tree,
more efficient iterations and not exposing RCU-protected pointers to
its users.
This patch set
1. Introduces the XArray implementation
2. Converts the pagecache to use it
3. Converts memremap to use it
The page cache is the most complex and important user of the radix
tree, so converting it was most important. Converting the memremap
code removes the only other user of the multiorder code, which allows
us to remove the radix tree code that supported it.
I have 40+ followup patches to convert many other users of the radix
tree over to the XArray, but I'd like to get this part in first. The
other conversions haven't been in linux-next and aren't suitable for
applying yet, but you can see them in the xarray-conv branch if you're
interested"
* 'xarray' of git://git.infradead.org/users/willy/linux-dax: (90 commits)
radix tree: Remove multiorder support
radix tree test: Convert multiorder tests to XArray
radix tree tests: Convert item_delete_rcu to XArray
radix tree tests: Convert item_kill_tree to XArray
radix tree tests: Move item_insert_order
radix tree test suite: Remove multiorder benchmarking
radix tree test suite: Remove __item_insert
memremap: Convert to XArray
xarray: Add range store functionality
xarray: Move multiorder_check to in-kernel tests
xarray: Move multiorder_shrink to kernel tests
xarray: Move multiorder account test in-kernel
radix tree test suite: Convert iteration test to XArray
radix tree test suite: Convert tag_tagged_items to XArray
radix tree: Remove radix_tree_clear_tags
radix tree: Remove radix_tree_maybe_preload_order
radix tree: Remove split/join code
radix tree: Remove radix_tree_update_node_t
page cache: Finish XArray conversion
dax: Convert page fault handlers to XArray
...
The page cache and most shrinkable slab caches hold data that has been
read from disk, but there are some caches that only cache CPU work, such
as the dentry and inode caches of procfs and sysfs, as well as the subset
of radix tree nodes that track non-resident page cache.
Currently, all these are shrunk at the same rate: using DEFAULT_SEEKS for
the shrinker's seeks setting tells the reclaim algorithm that for every
two page cache pages scanned it should scan one slab object.
This is a bogus setting. A virtual inode that required no IO to create is
not twice as valuable as a page cache page; shadow cache entries with
eviction distances beyond the size of memory aren't either.
In most cases, the behavior in practice is still fine. Such virtual
caches don't tend to grow and assert themselves aggressively, and usually
get picked up before they cause problems. But there are scenarios where
that's not true.
Our database workloads suffer from two of those. For one, their file
workingset is several times bigger than available memory, which has the
kernel aggressively create shadow page cache entries for the non-resident
parts of it. The workingset code does tell the VM that most of these are
expendable, but the VM ends up balancing them 2:1 to cache pages as per
the seeks setting. This is a huge waste of memory.
These workloads also deal with tens of thousands of open files and use
/proc for introspection, which ends up growing the proc_inode_cache to
absurdly large sizes - again at the cost of valuable cache space, which
isn't a reasonable trade-off, given that proc inodes can be re-created
without involving the disk.
This patch implements a "zero-seek" setting for shrinkers that results in
a target ratio of 0:1 between their objects and IO-backed caches. This
allows such virtual caches to grow when memory is available (they do
cache/avoid CPU work after all), but effectively disables them as soon as
IO-backed objects are under pressure.
It then switches the shrinkers for procfs and sysfs metadata, as well as
excess page cache shadow nodes, to the new zero-seek setting.
Link: http://lkml.kernel.org/r/20181009184732.762-5-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Domas Mituzas <dmituzas@fb.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Rik van Riel <riel@surriel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are several definitions of those functions/macros in places that
mess with fixed-point load averages. Provide an official version.
[akpm@linux-foundation.org: fix missed conversion in block/blk-iolatency.c]
Link: http://lkml.kernel.org/r/20180828172258.3185-5-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Tested-by: Daniel Drake <drake@endlessm.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The vmstat NR_KERNEL_MISC_RECLAIMABLE counter is for kernel non-slab
allocations that can be reclaimed via shrinker. In /proc/meminfo, we can
show the sum of all reclaimable kernel allocations (including slab) as
"KReclaimable". Add the same counter also to per-node meminfo under /sys
With this counter, users will have more complete information about kernel
memory usage. Non-slab reclaimable pages (currently just the ION
allocator) will not be missing from /proc/meminfo, making users wonder
where part of their memory went. More precisely, they already appear in
MemAvailable, but without the new counter, it's not obvious why the value
in MemAvailable doesn't fully correspond with the sum of other counters
participating in it.
Link: http://lkml.kernel.org/r/20180731090649.16028-6-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Vijayanand Jitta <vjitta@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull x86 mm updates from Ingo Molnar:
"Lots of changes in this cycle:
- Lots of CPA (change page attribute) optimizations and related
cleanups (Thomas Gleixner, Peter Zijstra)
- Make lazy TLB mode even lazier (Rik van Riel)
- Fault handler cleanups and improvements (Dave Hansen)
- kdump, vmcore: Enable kdumping encrypted memory with AMD SME
enabled (Lianbo Jiang)
- Clean up VM layout documentation (Baoquan He, Ingo Molnar)
- ... plus misc other fixes and enhancements"
* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (51 commits)
x86/stackprotector: Remove the call to boot_init_stack_canary() from cpu_startup_entry()
x86/mm: Kill stray kernel fault handling comment
x86/mm: Do not warn about PCI BIOS W+X mappings
resource: Clean it up a bit
resource: Fix find_next_iomem_res() iteration issue
resource: Include resource end in walk_*() interfaces
x86/kexec: Correct KEXEC_BACKUP_SRC_END off-by-one error
x86/mm: Remove spurious fault pkey check
x86/mm/vsyscall: Consider vsyscall page part of user address space
x86/mm: Add vsyscall address helper
x86/mm: Fix exception table comments
x86/mm: Add clarifying comments for user addr space
x86/mm: Break out user address space handling
x86/mm: Break out kernel address space handling
x86/mm: Clarify hardware vs. software "error_code"
x86/mm/tlb: Make lazy TLB mode lazier
x86/mm/tlb: Add freed_tables element to flush_tlb_info
x86/mm/tlb: Add freed_tables argument to flush_tlb_mm_range
smp,cpumask: introduce on_each_cpu_cond_mask
smp: use __cpumask_set_cpu in on_each_cpu_cond
...
Lianbo reported a build error with a particular 32-bit config, see Link
below for details.
Provide a weak copy_oldmem_page_encrypted() function which architectures
can override, in the same manner other functionality in that file is
supplied.
Reported-by: Lianbo Jiang <lijiang@redhat.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
CC: x86@kernel.org
Link: http://lkml.kernel.org/r/710b9d95-2f70-eadf-c4a1-c3dc80ee4ebb@redhat.com
Currently, you can use /proc/self/task/*/stack to cause a stack walk on
a task you control while it is running on another CPU. That means that
the stack can change under the stack walker. The stack walker does
have guards against going completely off the rails and into random
kernel memory, but it can interpret random data from your kernel stack
as instruction pointers and stack pointers. This can cause exposure of
kernel stack contents to userspace.
Restrict the ability to inspect kernel stacks of arbitrary tasks to root
in order to prevent a local attacker from exploiting racy stack unwinding
to leak kernel task stack contents. See the added comment for a longer
rationale.
There don't seem to be any users of this userspace API that can't
gracefully bail out if reading from the file fails. Therefore, I believe
that this change is unlikely to break things. In the case that this patch
does end up needing a revert, the next-best solution might be to fake a
single-entry stack based on wchan.
Link: http://lkml.kernel.org/r/20180927153316.200286-1-jannh@google.com
Fixes: 2ec220e27f ("proc: add /proc/*/stack")
Signed-off-by: Jann Horn <jannh@google.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Ken Chen <kenchen@google.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Introduce xarray value entries and tagged pointers to replace radix
tree exceptional entries. This is a slight change in encoding to allow
the use of an extra bit (we can now store BITS_PER_LONG - 1 bits in a
value entry). It is also a change in emphasis; exceptional entries are
intimidating and different. As the comment explains, you can choose
to store values or pointers in the xarray and they are both first-class
citizens.
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Josef Bacik <jbacik@fb.com>
The 'm' kcore_list item could point to kclist_head, and it is incorrect to
look at m->addr / m->size in this case.
There is no choice but to run through the list of entries for every
address if we did not find any entry in the previous iteration
Reset 'm' to NULL in that case at Omar Sandoval's suggestion.
[akpm@linux-foundation.org: add comment]
Link: http://lkml.kernel.org/r/1536100702-28706-1-git-send-email-asmadeus@codewreck.org
Fixes: bf991c2231 ("proc/kcore: optimize multiple page reads")
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Omar Sandoval <osandov@osandov.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: James Morse <james.morse@arm.com>
Cc: Bhupesh Sharma <bhsharma@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Introduce CONFIG_STACKLEAK_METRICS providing STACKLEAK information about
tasks via the /proc file system. In particular, /proc/<pid>/stack_depth
shows the maximum kernel stack consumption for the current and previous
syscalls. Although this information is not precise, it can be useful for
estimating the STACKLEAK performance impact for your workloads.
Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Alexander Popov <alex.popov@linux.com>
Tested-by: Laura Abbott <labbott@redhat.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Pull perf updates from Thomas Gleixner:
"Kernel:
- Improve kallsyms coverage
- Add x86 entry trampolines to kcore
- Fix ARM SPE handling
- Correct PPC event post processing
Tools:
- Make the build system more robust
- Small fixes and enhancements all over the place
- Update kernel ABI header copies
- Preparatory work for converting libtraceevnt to a shared library
- License cleanups"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (100 commits)
tools arch: Update arch/x86/lib/memcpy_64.S copy used in 'perf bench mem memcpy'
tools arch x86: Update tools's copy of cpufeatures.h
perf python: Fix pyrf_evlist__read_on_cpu() interface
perf mmap: Store real cpu number in 'struct perf_mmap'
perf tools: Remove ext from struct kmod_path
perf tools: Add gzip_is_compressed function
perf tools: Add lzma_is_compressed function
perf tools: Add is_compressed callback to compressions array
perf tools: Move the temp file processing into decompress_kmodule
perf tools: Use compression id in decompress_kmodule()
perf tools: Store compression id into struct dso
perf tools: Add compression id into 'struct kmod_path'
perf tools: Make is_supported_compression() static
perf tools: Make decompress_to_file() function static
perf tools: Get rid of dso__needs_decompress() call in __open_dso()
perf tools: Get rid of dso__needs_decompress() call in symbol__disassemble()
perf tools: Get rid of dso__needs_decompress() call in read_object_code()
tools lib traceevent: Change to SPDX License format
perf llvm: Allow passing options to llc in addition to clang
perf parser: Improve error message for PMU address filters
...
Without CONFIG_MMU, we get a build warning:
fs/proc/vmcore.c:228:12: error: 'vmcoredd_mmap_dumps' defined but not used [-Werror=unused-function]
static int vmcoredd_mmap_dumps(struct vm_area_struct *vma, unsigned long dst,
The function is only referenced from an #ifdef'ed caller, so
this uses the same #ifdef around it.
Link: http://lkml.kernel.org/r/20180525213526.2117790-1-arnd@arndb.de
Fixes: 7efe48df8a ("vmcore: append device dumps to vmcore as elf notes")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Ganesh Goudar <ganeshgr@chelsio.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The vmcoreinfo information is useful for runtime debugging tools, not just
for crash dumps. A lot of this information can be determined by other
means, but this is much more convenient, and it only adds a page at most
to the file.
Link: http://lkml.kernel.org/r/fddbcd08eed76344863303878b12de1c1e2a04b6.1531953780.git.osandov@fb.com
Signed-off-by: Omar Sandoval <osandov@fb.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Bhupesh Sharma <bhsharma@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: James Morse <james.morse@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The current code does a full search of the segment list every time for
every page. This is wasteful, since it's almost certain that the next
page will be in the same segment. Instead, check if the previous segment
covers the current page before doing the list search.
Link: http://lkml.kernel.org/r/fd346c11090cf93d867e01b8d73a6567c5ac6361.1531953780.git.osandov@fb.com
Signed-off-by: Omar Sandoval <osandov@fb.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Bhupesh Sharma <bhsharma@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: James Morse <james.morse@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, the ELF file header, program headers, and note segment are
allocated all at once, in some icky code dating back to 2.3. Programs
tend to read the file header, then the program headers, then the note
segment, all separately, so this is a waste of effort. It's cleaner and
more efficient to handle the three separately.
Link: http://lkml.kernel.org/r/19c92cbad0e11f6103ff3274b2e7a7e51a1eb74b.1531953780.git.osandov@fb.com
Signed-off-by: Omar Sandoval <osandov@fb.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Bhupesh Sharma <bhsharma@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: James Morse <james.morse@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now that we're using an rwsem, we can hold it during the entirety of
read_kcore() and have a common return path. This is preparation for the
next change.
[akpm@linux-foundation.org: fix locking bug reported by Tetsuo Handa]
Link: http://lkml.kernel.org/r/d7cfbc1e8a76616f3b699eaff9df0a2730380534.1531953780.git.osandov@fb.com
Signed-off-by: Omar Sandoval <osandov@fb.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Bhupesh Sharma <bhsharma@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: James Morse <james.morse@arm.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There's a theoretical race condition that will cause /proc/kcore to miss
a memory hotplug event:
CPU0 CPU1
// hotplug event 1
kcore_need_update = 1
open_kcore() open_kcore()
kcore_update_ram() kcore_update_ram()
// Walk RAM // Walk RAM
__kcore_update_ram() __kcore_update_ram()
kcore_need_update = 0
// hotplug event 2
kcore_need_update = 1
kcore_need_update = 0
Note that CPU1 set up the RAM kcore entries with the state after hotplug
event 1 but cleared the flag for hotplug event 2. The RAM entries will
therefore be stale until there is another hotplug event.
This is an extremely unlikely sequence of events, but the fix makes the
synchronization saner, anyways: we serialize the entire update sequence,
which means that whoever clears the flag will always succeed in replacing
the kcore list.
Link: http://lkml.kernel.org/r/6106c509998779730c12400c1b996425df7d7089.1531953780.git.osandov@fb.com
Signed-off-by: Omar Sandoval <osandov@fb.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Bhupesh Sharma <bhsharma@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: James Morse <james.morse@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now we only need kclist_lock from user context and at fs init time, and
the following changes need to sleep while holding the kclist_lock.
Link: http://lkml.kernel.org/r/521ba449ebe921d905177410fee9222d07882f0d.1531953780.git.osandov@fb.com
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Bhupesh Sharma <bhsharma@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: James Morse <james.morse@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The memory hotplug notifier kcore_callback() only needs kclist_lock to
prevent races with __kcore_update_ram(), but we can easily eliminate that
race by using an atomic xchg() in __kcore_update_ram(). This is
preparation for converting kclist_lock to an rwsem.
Link: http://lkml.kernel.org/r/0a4bc89f4dbde8b5b2ea309f7b4fb6a85fe29df2.1531953780.git.osandov@fb.com
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Bhupesh Sharma <bhsharma@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: James Morse <james.morse@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "/proc/kcore improvements", v4.
This series makes a few improvements to /proc/kcore. It fixes a couple of
small issues in v3 but is otherwise the same. Patches 1, 2, and 3 are
prep patches. Patch 4 is a fix/cleanup. Patch 5 is another prep patch.
Patches 6 and 7 are optimizations to ->read(). Patch 8 makes it possible
to enable CRASH_CORE on any architecture, which is needed for patch 9.
Patch 9 adds vmcoreinfo to /proc/kcore.
This patch (of 9):
kclist_add() is only called at init time, so there's no point in grabbing
any locks. We're also going to replace the rwlock with a rwsem, which we
don't want to try grabbing during early boot.
While we're here, mark kclist_add() with __init so that we'll get a
warning if it's called from non-init code.
Link: http://lkml.kernel.org/r/98208db1faf167aa8b08eebfa968d95c70527739.1531953780.git.osandov@fb.com
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Bhupesh Sharma <bhsharma@redhat.com>
Tested-by: Bhupesh Sharma <bhsharma@redhat.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Bhupesh Sharma <bhsharma@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: James Morse <james.morse@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
elf_kcore_store_hdr() uses __pa() to find the physical address of
KCORE_RAM or KCORE_TEXT entries exported as program headers.
This trips CONFIG_DEBUG_VIRTUAL's checks, as the KCORE_TEXT entries are
not in the linear map.
Handle these two cases separately, using __pa_symbol() for the KCORE_TEXT
entries.
Link: http://lkml.kernel.org/r/20180711131944.15252-1-james.morse@arm.com
Signed-off-by: James Morse <james.morse@arm.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Omar Sandoval <osandov@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use new return type vm_fault_t for fault handler in struct
vm_operations_struct. For now, this is just documenting that the function
returns a VM_FAULT value rather than an errno. Once all instances are
converted, vm_fault_t will become a distinct type.
See 1c8f422059 ("mm: change return type to vm_fault_t") for reference.
Link: http://lkml.kernel.org/r/20180702153325.GA3875@jordon-HP-15-Notebook-PC
Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ganesh Goudar <ganeshgr@chelsio.com>
Cc: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Number of CPUs is never high enough to force 64-bit arithmetic.
Save couple of bytes on x86_64.
Link: http://lkml.kernel.org/r/20180627200710.GC18434@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
->latency_record is defined as
struct latency_record[LT_SAVECOUNT];
so use the same macro whie iterating.
Link: http://lkml.kernel.org/r/20180627200534.GA18434@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Code checks if write is done by current to its own attributes.
For that get/put pair is unnecessary as it can be done under RCU.
Note: rcu_read_unlock() can be done even earlier since pointer to a task
is not dereferenced. It depends if /proc code should look scary or not:
rcu_read_lock();
task = pid_task(...);
rcu_read_unlock();
if (!task)
return -ESRCH;
if (task != current)
return -EACCESS:
P.S.: rename "length" variable. Code like this
length = -EINVAL;
should not exist.
Link: http://lkml.kernel.org/r/20180627200218.GF18113@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Readdir context is thread local, so ->pos is thread local,
move it out of readlock.
Link: http://lkml.kernel.org/r/20180627195339.GD18113@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
get_monotonic_boottime() is deprecated and uses the old timespec type.
Let's convert /proc/uptime to use ktime_get_boottime_ts64().
Link: http://lkml.kernel.org/r/20180620081746.282742-1-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
24074a35c5 ("proc: Make inline name size calculation automatic")
started to put PDE allocations into kmalloc-256 which is unnecessary as
~40 character names are very rare.
Put allocation back into kmalloc-192 cache for 64-bit non-debug builds.
Put BUILD_BUG_ON to know when PDE size has gotten out of control.
[adobriyan@gmail.com: fix BUILD_BUG_ON breakage on powerpc64]
Link: http://lkml.kernel.org/r/20180703191602.GA25521@avx2
Link: http://lkml.kernel.org/r/20180617215732.GA24688@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, percpu memory only exposes allocation and utilization
information via debugfs. This more or less is only really useful for
understanding the fragmentation and allocation information at a per-chunk
level with a few global counters. This is also gated behind a config.
BPF and cgroup, for example, have seen an increase in use causing
increased use of percpu memory. Let's make it easier for someone to
identify how much memory is being used.
This patch adds the "Percpu" stat to meminfo to more easily look up how
much percpu memory is in use. This number includes the cost for all
allocated backing pages and not just insight at the per a unit, per chunk
level. Metadata is excluded. I think excluding metadata is fair because
the backing memory scales with the numbere of cpus and can quickly
outweigh the metadata. It also makes this calculation light.
Link: http://lkml.kernel.org/r/20180807184723.74919-1-dennisszhou@gmail.com
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The /proc/pid/smaps_rollup file is currently implemented via the
m_start/m_next/m_stop seq_file iterators shared with the other maps files,
that iterate over vma's. However, the rollup file doesn't print anything
for each vma, only accumulate the stats.
There are some issues with the current code as reported in [1] - the
accumulated stats can get skewed if seq_file start()/stop() op is called
multiple times, if show() is called multiple times, and after seeks to
non-zero position.
Patch [1] fixed those within existing design, but I believe it is
fundamentally wrong to expose the vma iterators to the seq_file mechanism
when smaps_rollup shows logically a single set of values for the whole
address space.
This patch thus refactors the code to provide a single "value" at offset
0, with vma iteration to gather the stats done internally. This fixes the
situations where results are skewed, and simplifies the code, especially
in show_smap(), at the expense of somewhat less code reuse.
[1] https://marc.info/?l=linux-mm&m=151927723128134&w=2
[vbabka@suse.c: use seq_file infrastructure]
Link: http://lkml.kernel.org/r/bf4525b0-fd5b-4c4c-2cb3-adee3dd95a48@suse.cz
Link: http://lkml.kernel.org/r/20180723111933.15443-5-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: Daniel Colascione <dancol@google.com>
Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
To prepare for handling /proc/pid/smaps_rollup differently from
/proc/pid/smaps factor out from show_smap() printing the parts of output
that are common for both variants, which is the bulk of the gathered
memory stats.
[vbabka@suse.cz: add const, per Alexey]
Link: http://lkml.kernel.org/r/b45f319f-cd04-337b-37f8-77f99786aa8a@suse.cz
Link: http://lkml.kernel.org/r/20180723111933.15443-4-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Daniel Colascione <dancol@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
To prepare for handling /proc/pid/smaps_rollup differently from
/proc/pid/smaps factor out vma mem stats gathering from show_smap() - it
will be used by both.
Link: http://lkml.kernel.org/r/20180723111933.15443-3-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Daniel Colascione <dancol@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "cleanups and refactor of /proc/pid/smaps*".
The recent regression in /proc/pid/smaps made me look more into the code.
Especially the issues with smaps_rollup reported in [1] as explained in
Patch 4, which fixes them by refactoring the code. Patches 2 and 3 are
preparations for that. Patch 1 is me realizing that there's a lot of
boilerplate left from times where we tried (unsuccessfuly) to mark thread
stacks in the output.
Originally I had also plans to rework the translation from
/proc/pid/*maps* file offsets to the internal structures. Now the offset
means "vma number", which is not really stable (vma's can come and go
between read() calls) and there's an extra caching of last vma's address.
My idea was that offsets would be interpreted directly as addresses, which
would also allow meaningful seeks (see the ugly seek_to_smaps_entry() in
tools/testing/selftests/vm/mlock2.h). However loff_t is (signed) long
long so that might be insufficient somewhere for the unsigned long
addresses.
So the result is fixed issues with skewed /proc/pid/smaps_rollup results,
simpler smaps code, and a lot of unused code removed.
[1] https://marc.info/?l=linux-mm&m=151927723128134&w=2
This patch (of 4):
Commit b76437579d ("procfs: mark thread stack correctly in
proc/<pid>/maps") introduced differences between /proc/PID/maps and
/proc/PID/task/TID/maps to mark thread stacks properly, and this was
also done for smaps and numa_maps. However it didn't work properly and
was ultimately removed by commit b18cb64ead ("fs/proc: Stop trying to
report thread stacks").
Now the is_pid parameter for the related show_*() functions is unused
and we can remove it together with wrapper functions and ops structures
that differ for PID and TID cases only in this parameter.
Link: http://lkml.kernel.org/r/20180723111933.15443-2-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Daniel Colascione <dancol@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Without program headers for PTI entry trampoline pages, the trampoline
virtual addresses do not map to anything.
Example before:
sudo gdb --quiet vmlinux /proc/kcore
Reading symbols from vmlinux...done.
[New process 1]
Core was generated by `BOOT_IMAGE=/boot/vmlinuz-4.16.0 root=UUID=a6096b83-b763-4101-807e-f33daff63233'.
#0 0x0000000000000000 in irq_stack_union ()
(gdb) x /21ib 0xfffffe0000006000
0xfffffe0000006000: Cannot access memory at address 0xfffffe0000006000
(gdb) quit
After:
sudo gdb --quiet vmlinux /proc/kcore
[sudo] password for ahunter:
Reading symbols from vmlinux...done.
[New process 1]
Core was generated by `BOOT_IMAGE=/boot/vmlinuz-4.16.0-fix-4-00005-gd6e65a8b4072 root=UUID=a6096b83-b7'.
#0 0x0000000000000000 in irq_stack_union ()
(gdb) x /21ib 0xfffffe0000006000
0xfffffe0000006000: swapgs
0xfffffe0000006003: mov %rsp,-0x3e12(%rip) # 0xfffffe00000021f8
0xfffffe000000600a: xchg %ax,%ax
0xfffffe000000600c: mov %cr3,%rsp
0xfffffe000000600f: bts $0x3f,%rsp
0xfffffe0000006014: and $0xffffffffffffe7ff,%rsp
0xfffffe000000601b: mov %rsp,%cr3
0xfffffe000000601e: mov -0x3019(%rip),%rsp # 0xfffffe000000300c
0xfffffe0000006025: pushq $0x2b
0xfffffe0000006027: pushq -0x3e35(%rip) # 0xfffffe00000021f8
0xfffffe000000602d: push %r11
0xfffffe000000602f: pushq $0x33
0xfffffe0000006031: push %rcx
0xfffffe0000006032: push %rdi
0xfffffe0000006033: mov $0xffffffff91a00010,%rdi
0xfffffe000000603a: callq 0xfffffe0000006046
0xfffffe000000603f: pause
0xfffffe0000006041: lfence
0xfffffe0000006044: jmp 0xfffffe000000603f
0xfffffe0000006046: mov %rdi,(%rsp)
0xfffffe000000604a: retq
(gdb) quit
In addition, entry trampolines all map to the same page. Represent that
by giving the corresponding program headers in kcore the same offset.
This has the benefit that, when perf tools uses /proc/kcore as a source
for kernel object code, samples from different CPU trampolines are
aggregated together. Note, such aggregation is normal for profiling
i.e. people want to profile the object code, not every different virtual
address the object code might be mapped to (across different processes
for example).
Notes by PeterZ:
This also adds the KCORE_REMAP functionality.
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Acked-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: x86@kernel.org
Link: http://lkml.kernel.org/r/1528289651-4113-4-git-send-email-adrian.hunter@intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Thomas reports:
"While looking around in /proc on my v4.14.52 system I noticed that all
processes got a lot of "Locked" memory in /proc/*/smaps. A lot more
memory than a regular user can usually lock with mlock().
Commit 493b0e9d94 (in v4.14-rc1) seems to have changed the behavior
of "Locked".
Before that commit the code was like this. Notice the VM_LOCKED check.
(vma->vm_flags & VM_LOCKED) ?
(unsigned long)(mss.pss >> (10 + PSS_SHIFT)) : 0);
After that commit Locked is now the same as Pss:
(unsigned long)(mss->pss >> (10 + PSS_SHIFT)));
This looks like a mistake."
Indeed, the commit has added mss->pss_locked with the correct value that
depends on VM_LOCKED, but forgot to actually use it. Fix it.
Link: http://lkml.kernel.org/r/ebf6c7fb-fec3-6a26-544f-710ed193c154@suse.cz
Fixes: 493b0e9d94 ("mm: add /proc/pid/smaps_rollup")
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: Thomas Lindroth <thomas.lindroth@gmail.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Daniel Colascione <dancol@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull vfs fix from Al Viro:
"Followup to procfs-seq_file series this window"
This fixes a memory leak by making sure that proc seq files release any
private data on close. The 'proc_seq_open' has to be properly paired
with 'proc_seq_release' that releases the extra private data.
* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
proc: add proc_seq_release
kmemleak reported some memory leak on reading proc files. After adding
some debug lines, find that proc_seq_fops is using seq_release as
release handler, which won't handle the free of 'private' field of
seq_file, while in fact the open handler proc_seq_open could create
the private data with __seq_open_private when state_size is greater
than zero. So after reading files created with proc_create_seq_private,
such as /proc/timer_list and /proc/vmallocinfo, the private mem of a
seq_file is not freed. Fix it by adding the paired proc_seq_release
as the default release handler of proc_seq_ops instead of seq_release.
Fixes: 44414d82cf ("proc: introduce proc_create_seq_private")
Reviewed-by: Christoph Hellwig <hch@lst.de>
CC: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chunyu Hu <chuhu@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The rewrite of the cmdline fetching missed the fact that we used to also
return the final terminating NUL character of the last argument. I
hadn't noticed, and none of the tools I tested cared, but something
obviously must care, because Michal Kubecek noticed the change in
behavior.
Tweak the "find the end" logic to actually include the NUL character,
and once past the eend of argv, always start the strnlen() at the
expected (original) argument end.
This whole "allow people to rewrite their arguments in place" is a nasty
hack and requires that odd slop handling at the end of the argv array,
but it's our traditional model, so we continue to support it.
Repored-and-bisected-by: Michal Kubecek <mkubecek@suse.cz>
Reviewed-and-tested-by: Michal Kubecek <mkubecek@suse.cz>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull AFS updates from Al Viro:
"Assorted AFS stuff - ended up in vfs.git since most of that consists
of David's AFS-related followups to Christoph's procfs series"
* 'afs-proc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
afs: Optimise callback breaking by not repeating volume lookup
afs: Display manually added cells in dynamic root mount
afs: Enable IPv6 DNS lookups
afs: Show all of a server's addresses in /proc/fs/afs/servers
afs: Handle CONFIG_PROC_FS=n
proc: Make inline name size calculation automatic
afs: Implement network namespacing
afs: Mark afs_net::ws_cell as __rcu and set using rcu functions
afs: Fix a Sparse warning in xdr_decode_AFSFetchStatus()
proc: Add a way to make network proc files writable
afs: Rearrange fs/afs/proc.c to remove remaining predeclarations.
afs: Rearrange fs/afs/proc.c to move the show routines up
afs: Rearrange fs/afs/proc.c by moving fops and open functions down
afs: Move /proc management functions to the end of the file
Make calculation of the size of the inline name in struct proc_dir_entry
automatic, rather than having to manually encode the numbers and failing to
allow for lockdep.
Require a minimum inline name size of 33+1 to allow for names that look
like two hex numbers with a dash between.
Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Merge more updates from Andrew Morton:
- MM remainders
- various misc things
- kcov updates
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (27 commits)
lib/test_printf.c: call wait_for_random_bytes() before plain %p tests
hexagon: drop the unused variable zero_page_mask
hexagon: fix printk format warning in setup.c
mm: fix oom_kill event handling
treewide: use PHYS_ADDR_MAX to avoid type casting ULLONG_MAX
mm: use octal not symbolic permissions
ipc: use new return type vm_fault_t
sysvipc/sem: mitigate semnum index against spectre v1
fault-injection: reorder config entries
arm: port KCOV to arm
sched/core / kcov: avoid kcov_area during task switch
kcov: prefault the kcov_area
kcov: ensure irq code sees a valid area
kernel/relay.c: change return type to vm_fault_t
exofs: avoid VLA in structures
coredump: fix spam with zero VMA process
fat: use fat_fs_error() instead of BUG_ON() in __fat_get_block()
proc: skip branch in /proc/*/* lookup
mremap: remove LATENCY_LIMIT from mremap to reduce the number of TLB shootdowns
mm/memblock: add missing include <linux/bootmem.h>
...
Code is structured like this:
for ( ... p < last; p++) {
if (memcmp == 0)
break;
}
if (p >= last)
ERROR
OK
gcc doesn't see that if if lookup succeeds than post loop branch will
never be taken and skip it.
[akpm@linux-foundation.org: proc_pident_instantiate() no longer takes an inode*]
Link: http://lkml.kernel.org/r/20180423213954.GD9043@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is a late set of changes from Deepa Dinamani doing an automated
treewide conversion of the inode and iattr structures from 'timespec'
to 'timespec64', to push the conversion from the VFS layer into the
individual file systems.
There were no conflicts between this and the contents of linux-next
until just before the merge window, when we saw multiple problems:
- A minor conflict with my own y2038 fixes, which I could address
by adding another patch on top here.
- One semantic conflict with late changes to the NFS tree. I addressed
this by merging Deepa's original branch on top of the changes that
now got merged into mainline and making sure the merge commit includes
the necessary changes as produced by coccinelle.
- A trivial conflict against the removal of staging/lustre.
- Multiple conflicts against the VFS changes in the overlayfs tree.
These are still part of linux-next, but apparently this is no longer
intended for 4.18 [1], so I am ignoring that part.
As Deepa writes:
The series aims to switch vfs timestamps to use struct timespec64.
Currently vfs uses struct timespec, which is not y2038 safe.
The series involves the following:
1. Add vfs helper functions for supporting struct timepec64 timestamps.
2. Cast prints of vfs timestamps to avoid warnings after the switch.
3. Simplify code using vfs timestamps so that the actual
replacement becomes easy.
4. Convert vfs timestamps to use struct timespec64 using a script.
This is a flag day patch.
Next steps:
1. Convert APIs that can handle timespec64, instead of converting
timestamps at the boundaries.
2. Update internal data structures to avoid timestamp conversions.
Thomas Gleixner adds:
I think there is no point to drag that out for the next merge window.
The whole thing needs to be done in one go for the core changes which
means that you're going to play that catchup game forever. Let's get
over with it towards the end of the merge window.
[1] https://www.spinics.net/lists/linux-fsdevel/msg128294.html
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJbInZAAAoJEGCrR//JCVInReoQAIlVIIMt5ZX6wmaKbrjy9Itf
MfgbFihQ/djLnuSPVQ3nztcxF0d66BKHZ9puVjz6+mIHqfDvJTRwZs9nU+sOF/T1
g78fRkM1cxq6ZCkGYAbzyjyo5aC4PnSMP/NQLmwqvi0MXqqrbDoq5ZdP9DHJw39h
L9lD8FM/P7T29Fgp9tq/pT5l9X8VU8+s5KQG1uhB5hii4VL6pD6JyLElDita7rg+
Z7/V7jkxIGEUWF7vGaiR1QTFzEtpUA/exDf9cnsf51OGtK/LJfQ0oiZPPuq3oA/E
LSbt8YQQObc+dvfnGxwgxEg1k5WP5ekj/Wdibv/+rQKgGyLOTz6Q4xK6r8F2ahxs
nyZQBdXqHhJYyKr1H1reUH3mrSgQbE5U5R1i3My0xV2dSn+vtK5vgF21v2Ku3A1G
wJratdtF/kVBzSEQUhsYTw14Un+xhBLRWzcq0cELonqxaKvRQK9r92KHLIWNE7/v
c0TmhFbkZA+zR8HdsaL3iYf1+0W/eYy8PcvepyldKNeW2pVk3CyvdTfY2Z87G2XK
tIkK+BUWbG3drEGG3hxZ3757Ln3a9qWyC5ruD3mBVkuug/wekbI8PykYJS7Mx4s/
WNXl0dAL0Eeu1M8uEJejRAe1Q3eXoMWZbvCYZc+wAm92pATfHVcKwPOh8P7NHlfy
A3HkjIBrKW5AgQDxfgvm
=CZX2
-----END PGP SIGNATURE-----
Merge tag 'vfs-timespec64' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground
Pull inode timestamps conversion to timespec64 from Arnd Bergmann:
"This is a late set of changes from Deepa Dinamani doing an automated
treewide conversion of the inode and iattr structures from 'timespec'
to 'timespec64', to push the conversion from the VFS layer into the
individual file systems.
As Deepa writes:
'The series aims to switch vfs timestamps to use struct timespec64.
Currently vfs uses struct timespec, which is not y2038 safe.
The series involves the following:
1. Add vfs helper functions for supporting struct timepec64
timestamps.
2. Cast prints of vfs timestamps to avoid warnings after the switch.
3. Simplify code using vfs timestamps so that the actual replacement
becomes easy.
4. Convert vfs timestamps to use struct timespec64 using a script.
This is a flag day patch.
Next steps:
1. Convert APIs that can handle timespec64, instead of converting
timestamps at the boundaries.
2. Update internal data structures to avoid timestamp conversions'
Thomas Gleixner adds:
'I think there is no point to drag that out for the next merge
window. The whole thing needs to be done in one go for the core
changes which means that you're going to play that catchup game
forever. Let's get over with it towards the end of the merge window'"
* tag 'vfs-timespec64' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground:
pstore: Remove bogus format string definition
vfs: change inode times to use struct timespec64
pstore: Convert internal records to timespec64
udf: Simplify calls to udf_disk_stamp_to_time
fs: nfs: get rid of memcpys for inode times
ceph: make inode time prints to be long long
lustre: Use long long type to print inode time
fs: add timespec64_truncate()
Pull the timespec64 conversion from Deepa Dinamani:
"The series aims to switch vfs timestamps to use
struct timespec64. Currently vfs uses struct timespec,
which is not y2038 safe.
The flag patch applies cleanly. I've not seen the timestamps
update logic change often. The series applies cleanly on 4.17-rc6
and linux-next tip (top commit: next-20180517).
I'm not sure how to merge this kind of a series with a flag patch.
We are targeting 4.18 for this.
Let me know if you have other suggestions.
The series involves the following:
1. Add vfs helper functions for supporting struct timepec64 timestamps.
2. Cast prints of vfs timestamps to avoid warnings after the switch.
3. Simplify code using vfs timestamps so that the actual
replacement becomes easy.
4. Convert vfs timestamps to use struct timespec64 using a script.
This is a flag day patch.
I've tried to keep the conversions with the script simple, to
aid in the reviews. I've kept all the internal filesystem data
structures and function signatures the same.
Next steps:
1. Convert APIs that can handle timespec64, instead of converting
timestamps at the boundaries.
2. Update internal data structures to avoid timestamp conversions."
I've pulled it into a branch based on top of the NFS changes that
are now in mainline, so I could resolve the non-obvious conflict
between the two while merging.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Merge proc_cmdline simplifications.
This re-writes the get_mm_cmdline() logic to be rather simpler than it
used to be, and makes the semantics for "cmdline goes past the end of
the original area" more natural.
You _can_ use prctl(PR_SET_MM) to just point your command line somewhere
else entirely, but the traditional model is to just edit things in place
and that still needs to continue to work. At least this way the code
makes some sense.
* proc-cmdline:
fs/proc: simplify and clarify get_mm_cmdline() function
fs/proc: re-factor proc_pid_cmdline_read() a bit
Pull proc_fill_cache regression fix from Al Viro:
"Regression fix for proc_fill_cache() braino introduced when switching
instantiate() callback to d_splice_alias()"
* 'work.lookup' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fix proc_fill_cache() in case of d_alloc_parallel() failure
If d_alloc_parallel() returns ERR_PTR(...), we don't want to dput()
that. Small reorganization allows to have all error-in-lookup
cases rejoin the main codepath after dput(child), avoiding the
entire problem.
Spotted-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Fixes: 0168b9e38c "procfs: switch instantiate_t to d_splice_alias()"
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
struct stack_trace::nr_entries is defined as "unsigned int" (YAY!) so
the iterator should be unsigned as well.
It saves 1 byte of code or something like that.
Link: http://lkml.kernel.org/r/20180423215248.GG9043@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It's defined as atomic_t and really long signal queues are unheard of.
Link: http://lkml.kernel.org/r/20180423215119.GF9043@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
All those lengths are unsigned as they should be.
Link: http://lkml.kernel.org/r/20180423213751.GC9043@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Code can be sonsolidated if a dummy region of 0 length is used in normal
case of \0-separated command line:
1) [arg_start, arg_end) + [dummy len=0]
2) [arg_start, arg_end) + [env_start, env_end)
Link: http://lkml.kernel.org/r/20180221193335.GB28678@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Shevchenko <andy.shevchenko@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
"rv" variable is used both as a counter of bytes transferred and an
error value holder but it can be reduced solely to error values if
original start of userspace buffer is stashed and used at the very end.
[akpm@linux-foundation.org: simplify cleanup code]
Link: http://lkml.kernel.org/r/20180221193009.GA28678@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Shevchenko <andy.shevchenko@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
"final" variable is OK but we can get away with less lines.
Link: http://lkml.kernel.org/r/20180221192751.GC28548@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Shevchenko <andy.shevchenko@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
access_remote_vm() doesn't return negative errors, it returns number of
bytes read/written (0 if error occurs). This allows to delete some
comparisons which never trigger.
Reuse "nr_read" variable while I'm at it.
Link: http://lkml.kernel.org/r/20180221192605.GB28548@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Shevchenko <andy.shevchenko@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Define a new PageTable bit in the page_type and use it to mark pages in
use as page tables. This can be helpful when debugging crashdumps or
analysing memory fragmentation. Add a KPF flag to report these pages to
userspace and update page-types.c to interpret that flag.
Note that only pages currently accounted as NR_PAGETABLES are tracked as
PageTable; this does not include pgd/p4d/pud/pmd pages. Those will be the
subject of a later patch.
Link: http://lkml.kernel.org/r/20180518194519.3820-4-willy@infradead.org
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In commit ab676b7d6f ("pagemap: do not leak physical addresses to
non-privileged userspace"), the /proc/PID/pagemap is restricted to be
readable only by CAP_SYS_ADMIN to address some security issue.
In commit 1c90308e7a ("pagemap: hide physical addresses from
non-privileged users"), the restriction is relieved to make
/proc/PID/pagemap readable, but hide the physical addresses for
non-privileged users.
But the swap entries are readable for non-privileged users too. This
has some security issues. For example, for page under migrating, the
swap entry has physical address information. So, in this patch, the
swap entries are hided for non-privileged users too.
Link: http://lkml.kernel.org/r/20180508012745.7238-1-ying.huang@intel.com
Fixes: 1c90308e7a ("pagemap: hide physical addresses from non-privileged users")
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Suggested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Andrei Vagin <avagin@openvz.org>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Daniel Colascione <dancol@google.com>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mmap_sem is on the hot path of kernel, and it very contended, but it is
abused too. It is used to protect arg_start|end and evn_start|end when
reading /proc/$PID/cmdline and /proc/$PID/environ, but it doesn't make
sense since those proc files just expect to read 4 values atomically and
not related to VM, they could be set to arbitrary values by C/R.
And, the mmap_sem contention may cause unexpected issue like below:
INFO: task ps:14018 blocked for more than 120 seconds.
Tainted: G E 4.9.79-009.ali3000.alios7.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
ps D 0 14018 1 0x00000004
Call Trace:
schedule+0x36/0x80
rwsem_down_read_failed+0xf0/0x150
call_rwsem_down_read_failed+0x18/0x30
down_read+0x20/0x40
proc_pid_cmdline_read+0xd9/0x4e0
__vfs_read+0x37/0x150
vfs_read+0x96/0x130
SyS_read+0x55/0xc0
entry_SYSCALL_64_fastpath+0x1a/0xc5
Both Alexey Dobriyan and Michal Hocko suggested to use dedicated lock
for them to mitigate the abuse of mmap_sem.
So, introduce a new spinlock in mm_struct to protect the concurrent
access to arg_start|end, env_start|end and others, as well as replace
write map_sem to read to protect the race condition between prctl and
sys_brk which might break check_data_rlimit(), and makes prctl more
friendly to other VM operations.
This patch just eliminates the abuse of mmap_sem, but it can't resolve
the above hung task warning completely since the later
access_remote_vm() call needs acquire mmap_sem. The mmap_sem
scalability issue will be solved in the future.
[yang.shi@linux.alibaba.com: add comment about mmap_sem and arg_lock]
Link: http://lkml.kernel.org/r/1524077799-80690-1-git-send-email-yang.shi@linux.alibaba.com
Link: http://lkml.kernel.org/r/1523730291-109696-1-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mateusz Guzik <mguzik@redhat.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Notable changes:
- Support for split PMD page table lock on 64-bit Book3S (Power8/9).
- Add support for HAVE_RELIABLE_STACKTRACE, so we properly support live
patching again.
- Add support for patching barrier_nospec in copy_from_user() and syscall entry.
- A couple of fixes for our data breakpoints on Book3S.
- A series from Nick optimising TLB/mm handling with the Radix MMU.
- Numerous small cleanups to squash sparse/gcc warnings from Mathieu Malaterre.
- Several series optimising various parts of the 32-bit code from Christophe Leroy.
- Removal of support for two old machines, "SBC834xE" and "C2K" ("GEFanuc,C2K"),
which is why the diffstat has so many deletions.
And many other small improvements & fixes.
There's a few out-of-area changes. Some minor ftrace changes OK'ed by Steve, and
a fix to our powernv cpuidle driver. Then there's a series touching mm, x86 and
fs/proc/task_mmu.c, which cleans up some details around pkey support. It was
ack'ed/reviewed by Ingo & Dave and has been in next for several weeks.
Thanks to:
Akshay Adiga, Alastair D'Silva, Alexey Kardashevskiy, Al Viro, Andrew
Donnellan, Aneesh Kumar K.V, Anju T Sudhakar, Arnd Bergmann, Balbir Singh,
Cédric Le Goater, Christophe Leroy, Christophe Lombard, Colin Ian King, Dave
Hansen, Fabio Estevam, Finn Thain, Frederic Barrat, Gautham R. Shenoy, Haren
Myneni, Hari Bathini, Ingo Molnar, Jonathan Neuschäfer, Josh Poimboeuf,
Kamalesh Babulal, Madhavan Srinivasan, Mahesh Salgaonkar, Mark Greer, Mathieu
Malaterre, Matthew Wilcox, Michael Neuling, Michal Suchanek, Naveen N. Rao,
Nicholas Piggin, Nicolai Stange, Olof Johansson, Paul Gortmaker, Paul
Mackerras, Peter Rosin, Pridhiviraj Paidipeddi, Ram Pai, Rashmica Gupta, Ravi
Bangoria, Russell Currey, Sam Bobroff, Samuel Mendoza-Jonas, Segher
Boessenkool, Shilpasri G Bhat, Simon Guo, Souptick Joarder, Stewart Smith,
Thiago Jung Bauermann, Torsten Duwe, Vaibhav Jain, Wei Yongjun, Wolfram Sang,
Yisheng Xie, YueHaibing.
-----BEGIN PGP SIGNATURE-----
iQIwBAABCAAaBQJbGQKBExxtcGVAZWxsZXJtYW4uaWQuYXUACgkQUevqPMjhpYBq
TRAAioK7rz5xYMkxaM3Ng3ybobEeNAwQqOolz98xvmnB9SfDWNuc99vf8cGu0/fQ
zc8AKZ5RcnwipOjyGlxW9oa1ZhVq0xtYnQPiYLEKMdLQmh5D+C7+KpvAd1UElweg
ub40/xDySWfMujfuMSF9JDCWPIXyojt4Xg5nJKIVRrAm/3YMe/+i5Am7NWHuMCEb
aQmZtlYW5Mz81XY0968hjpUO6eKFRmsaM7yFAhGTXx6+oLRpGj1PZB4AwdRIKS2L
Ak7q/VgxtE4W+s3a0GK2s+eXIhGKeFuX9AVnx3nti+8/K1OqrqhDcLMUC/9JpCpv
EvOtO7dxPnZujHjdu4Eai/xNoo4h6zRy7bWqve9LoBM40CP5jljKzu1lwqqb5yO0
jC7/aXhgiSIxxcRJLjoI/TYpZPu40MifrkydmczykdPyPCnMIWEJDcj4KsRL/9Y8
9SSbJzRNC/SgQNTbUYPZFFi6G0QaMmlcbCb628k8QT+Gn3Xkdf/ZtxzqEyoF4Irq
46kFBsiSSK4Bu0rVlcUtJQLgdqytWULO6NKEYnD67laxYcgQd8pGFQ8SjZhRZLgU
q5LA3HIWhoAI4M0wZhOnKXO6JfiQ1UbO8gUJLsWsfF0Fk5KAcdm+4kb4jbI1H4Qk
Vol9WNRZwEllyaiqScZN9RuVVuH0GPOZeEH1dtWK+uWi0lM=
=ZlBf
-----END PGP SIGNATURE-----
Merge tag 'powerpc-4.18-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
"Notable changes:
- Support for split PMD page table lock on 64-bit Book3S (Power8/9).
- Add support for HAVE_RELIABLE_STACKTRACE, so we properly support
live patching again.
- Add support for patching barrier_nospec in copy_from_user() and
syscall entry.
- A couple of fixes for our data breakpoints on Book3S.
- A series from Nick optimising TLB/mm handling with the Radix MMU.
- Numerous small cleanups to squash sparse/gcc warnings from Mathieu
Malaterre.
- Several series optimising various parts of the 32-bit code from
Christophe Leroy.
- Removal of support for two old machines, "SBC834xE" and "C2K"
("GEFanuc,C2K"), which is why the diffstat has so many deletions.
And many other small improvements & fixes.
There's a few out-of-area changes. Some minor ftrace changes OK'ed by
Steve, and a fix to our powernv cpuidle driver. Then there's a series
touching mm, x86 and fs/proc/task_mmu.c, which cleans up some details
around pkey support. It was ack'ed/reviewed by Ingo & Dave and has
been in next for several weeks.
Thanks to: Akshay Adiga, Alastair D'Silva, Alexey Kardashevskiy, Al
Viro, Andrew Donnellan, Aneesh Kumar K.V, Anju T Sudhakar, Arnd
Bergmann, Balbir Singh, Cédric Le Goater, Christophe Leroy, Christophe
Lombard, Colin Ian King, Dave Hansen, Fabio Estevam, Finn Thain,
Frederic Barrat, Gautham R. Shenoy, Haren Myneni, Hari Bathini, Ingo
Molnar, Jonathan Neuschäfer, Josh Poimboeuf, Kamalesh Babulal,
Madhavan Srinivasan, Mahesh Salgaonkar, Mark Greer, Mathieu Malaterre,
Matthew Wilcox, Michael Neuling, Michal Suchanek, Naveen N. Rao,
Nicholas Piggin, Nicolai Stange, Olof Johansson, Paul Gortmaker, Paul
Mackerras, Peter Rosin, Pridhiviraj Paidipeddi, Ram Pai, Rashmica
Gupta, Ravi Bangoria, Russell Currey, Sam Bobroff, Samuel
Mendoza-Jonas, Segher Boessenkool, Shilpasri G Bhat, Simon Guo,
Souptick Joarder, Stewart Smith, Thiago Jung Bauermann, Torsten Duwe,
Vaibhav Jain, Wei Yongjun, Wolfram Sang, Yisheng Xie, YueHaibing"
* tag 'powerpc-4.18-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (251 commits)
powerpc/64s/radix: Fix missing ptesync in flush_cache_vmap
cpuidle: powernv: Fix promotion from snooze if next state disabled
powerpc: fix build failure by disabling attribute-alias warning in pci_32
ocxl: Fix missing unlock on error in afu_ioctl_enable_p9_wait()
powerpc-opal: fix spelling mistake "Uniterrupted" -> "Uninterrupted"
powerpc: fix spelling mistake: "Usupported" -> "Unsupported"
powerpc/pkeys: Detach execute_only key on !PROT_EXEC
powerpc/powernv: copy/paste - Mask SO bit in CR
powerpc: Remove core support for Marvell mv64x60 hostbridges
powerpc/boot: Remove core support for Marvell mv64x60 hostbridges
powerpc/boot: Remove support for Marvell mv64x60 i2c controller
powerpc/boot: Remove support for Marvell MPSC serial controller
powerpc/embedded6xx: Remove C2K board support
powerpc/lib: optimise PPC32 memcmp
powerpc/lib: optimise 32 bits __clear_user()
powerpc/time: inline arch_vtime_task_switch()
powerpc/Makefile: set -mcpu=860 flag for the 8xx
powerpc: Implement csum_ipv6_magic in assembly
powerpc/32: Optimise __csum_partial()
powerpc/lib: Adjust .balign inside string functions for PPC32
...
Pull networking updates from David Miller:
1) Add Maglev hashing scheduler to IPVS, from Inju Song.
2) Lots of new TC subsystem tests from Roman Mashak.
3) Add TCP zero copy receive and fix delayed acks and autotuning with
SO_RCVLOWAT, from Eric Dumazet.
4) Add XDP_REDIRECT support to mlx5 driver, from Jesper Dangaard
Brouer.
5) Add ttl inherit support to vxlan, from Hangbin Liu.
6) Properly separate ipv6 routes into their logically independant
components. fib6_info for the routing table, and fib6_nh for sets of
nexthops, which thus can be shared. From David Ahern.
7) Add bpf_xdp_adjust_tail helper, which can be used to generate ICMP
messages from XDP programs. From Nikita V. Shirokov.
8) Lots of long overdue cleanups to the r8169 driver, from Heiner
Kallweit.
9) Add BTF ("BPF Type Format"), from Martin KaFai Lau.
10) Add traffic condition monitoring to iwlwifi, from Luca Coelho.
11) Plumb extack down into fib_rules, from Roopa Prabhu.
12) Add Flower classifier offload support to igb, from Vinicius Costa
Gomes.
13) Add UDP GSO support, from Willem de Bruijn.
14) Add documentation for eBPF helpers, from Quentin Monnet.
15) Add TLS tx offload to mlx5, from Ilya Lesokhin.
16) Allow applications to be given the number of bytes available to read
on a socket via a control message returned from recvmsg(), from
Soheil Hassas Yeganeh.
17) Add x86_32 eBPF JIT compiler, from Wang YanQing.
18) Add AF_XDP sockets, with zerocopy support infrastructure as well.
From Björn Töpel.
19) Remove indirect load support from all of the BPF JITs and handle
these operations in the verifier by translating them into native BPF
instead. From Daniel Borkmann.
20) Add GRO support to ipv6 gre tunnels, from Eran Ben Elisha.
21) Allow XDP programs to do lookups in the main kernel routing tables
for forwarding. From David Ahern.
22) Allow drivers to store hardware state into an ELF section of kernel
dump vmcore files, and use it in cxgb4. From Rahul Lakkireddy.
23) Various RACK and loss detection improvements in TCP, from Yuchung
Cheng.
24) Add TCP SACK compression, from Eric Dumazet.
25) Add User Mode Helper support and basic bpfilter infrastructure, from
Alexei Starovoitov.
26) Support ports and protocol values in RTM_GETROUTE, from Roopa
Prabhu.
27) Support bulking in ->ndo_xdp_xmit() API, from Jesper Dangaard
Brouer.
28) Add lots of forwarding selftests, from Petr Machata.
29) Add generic network device failover driver, from Sridhar Samudrala.
* ra.kernel.org:/pub/scm/linux/kernel/git/davem/net-next: (1959 commits)
strparser: Add __strp_unpause and use it in ktls.
rxrpc: Fix terminal retransmission connection ID to include the channel
net: hns3: Optimize PF CMDQ interrupt switching process
net: hns3: Fix for VF mailbox receiving unknown message
net: hns3: Fix for VF mailbox cannot receiving PF response
bnx2x: use the right constant
Revert "net: sched: cls: Fix offloading when ingress dev is vxlan"
net: dsa: b53: Fix for brcm tag issue in Cygnus SoC
enic: fix UDP rss bits
netdev-FAQ: clarify DaveM's position for stable backports
rtnetlink: validate attributes in do_setlink()
mlxsw: Add extack messages for port_{un, }split failures
netdevsim: Add extack error message for devlink reload
devlink: Add extack to reload and port_{un, }split operations
net: metrics: add proper netlink validation
ipmr: fix error path when ipmr_new_table fails
ip6mr: only set ip6mr_table from setsockopt when ip6mr_new_table succeeds
net: hns3: remove unused hclgevf_cfg_func_mta_filter
netfilter: provide udp*_lib_lookup for nf_tproxy
qed*: Utilize FW 8.37.2.0
...
Pull workqueue updates from Tejun Heo:
- make kworkers report the workqueue it is executing or has executed
most recently in /proc/PID/comm (so they show up in ps/top)
- CONFIG_SMP shuffle to move stuff which isn't necessary for UP builds
inside CONFIG_SMP.
* 'for-4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: move function definitions within CONFIG_SMP block
workqueue: Make sure struct worker is accessible for wq_worker_comm()
workqueue: Show the latest workqueue name in /proc/PID/{comm,stat,status}
proc: Consolidate task->comm formatting into proc_task_name()
workqueue: Set worker->desc to workqueue name by default
workqueue: Make worker_attach/detach_pool() update worker->pool
workqueue: Replace pool->attach_mutex with global wq_pool_attach_mutex
Pull dcache lookup cleanups from Al Viro:
"Cleaning ->lookup() instances up - mostly d_splice_alias() conversions"
* 'work.lookup' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (29 commits)
switch the rest of procfs lookups to d_splice_alias()
procfs: switch instantiate_t to d_splice_alias()
don't bother with tid_fd_revalidate() in lookups
proc_lookupfd_common(): don't bother with instantiate unless the file is open
procfs: get rid of ancient BS in pid_revalidate() uses
cifs_lookup(): switch to d_splice_alias()
cifs_lookup(): cifs_get_inode_...() never returns 0 with *inode left NULL
9p: unify paths in v9fs_vfs_lookup()
ncp_lookup(): use d_splice_alias()
hfsplus: switch to d_splice_alias()
hfs: don't allow mounting over .../rsrc
hfs: use d_splice_alias()
omfs_lookup(): report IO errors, use d_splice_alias()
orangefs_lookup: simplify
openpromfs: switch to d_splice_alias()
xfs_vn_lookup: simplify a bit
adfs_lookup: do not fail with ENOENT on negatives, use d_splice_alias()
adfs_lookup_byname: .. *is* taken care of in fs/namei.c
romfs_lookup: switch to d_splice_alias()
qnx6_lookup: switch to d_splice_alias()
...
including:
- Extensive RST conversions and organizational work in the
memory-management docs thanks to Mike Rapoport.
- An update of Documentation/features from Andrea Parri and a script to
keep it updated.
- Various LICENSES updates from Thomas, along with a script to check SPDX
tags.
- Work to fix dangling references to documentation files; this involved a
fair number of one-liner comment changes outside of Documentation/
...and the usual list of documentation improvements, typo fixes, etc.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJbFTkKAAoJEI3ONVYwIuV6t24P/0K9qltHkLwsBo2fbGu/emem
mb1QrZCFZGebKVrCIvET3YcT0q0xPW+ZldwMQYEUeCcu/vD3cGHGXlDbVJCa1fFD
2OS10W/sEObPnREtlHO/zAzpapKP9DO1/f6NhO55iBJLGOCgoLL5xvSqgsI8MTGd
vcJDXLitkh4CJEcfNLkQt8dEZzq9Tb6wdSFIvZBBXRNon2ItVN92D5xoQ0wtB+qt
KmcGYofajK9bjtZpnC4iNg3i+zdwkd80bGTEN9f0hJTRZK5emCILk8fip8CMhRuB
iwmcqb2RnMLydNLyK9RSs6OS5z3G4fYu9llRtLlZBAupcjRVpalWaBGxLOVO6jBG
mvkqdKPMtxV4c7NvwKwFQL9dcjtxsxO4RDRYVWN82dS1L6WKKk8UvTuJUBLH0YA5
af7ZKn7mJVhJ1cxPblaEBOBM3oQuk57LLkjmcpMOXyJ/IOkTIuV1Ezht+XzFyFQv
VWSyekiKo+8D6WHACPTaWiizjW15e8CyP+WIhKzJyn7VQQrZwhsOS+R//ITsuvQ0
vRdZ20lwUeBhR+mnXd5NfIo2w7G+OiqiREVAgxjgRrS0PnkzWG7lzzcSVU8HTfT4
S7VXqval2a9Xg+N8aU2JUe49W858J8hKvIa98hBxGoZa84wxOGtEo7pIKhnMwMSe
Uhkh/1/bQMxsK3fBEF74
=I6FG
-----END PGP SIGNATURE-----
Merge tag 'docs-4.18' of git://git.lwn.net/linux
Pull documentation updates from Jonathan Corbet:
"There's been a fair amount of work in the docs tree this time around,
including:
- Extensive RST conversions and organizational work in the
memory-management docs thanks to Mike Rapoport.
- An update of Documentation/features from Andrea Parri and a script
to keep it updated.
- Various LICENSES updates from Thomas, along with a script to check
SPDX tags.
- Work to fix dangling references to documentation files; this
involved a fair number of one-liner comment changes outside of
Documentation/
... and the usual list of documentation improvements, typo fixes, etc"
* tag 'docs-4.18' of git://git.lwn.net/linux: (103 commits)
Documentation: document hung_task_panic kernel parameter
docs/admin-guide/mm: add high level concepts overview
docs/vm: move ksm and transhuge from "user" to "internals" section.
docs: Use the kerneldoc comments for memalloc_no*()
doc: document scope NOFS, NOIO APIs
docs: update kernel versions and dates in tables
docs/vm: transhuge: split userspace bits to admin-guide/mm/transhuge
docs/vm: transhuge: minor updates
docs/vm: transhuge: change sections order
Documentation: arm: clean up Marvell Berlin family info
Documentation: gpio: driver: Fix a typo and some odd grammar
docs: ranoops.rst: fix location of ramoops.txt
scripts/documentation-file-ref-check: rewrite it in perl with auto-fix mode
docs: uio-howto.rst: use a code block to solve a warning
mm, THP, doc: Add document for thp_swpout/thp_swpout_fallback
w1: w1_io.c: fix a kernel-doc warning
Documentation/process/posting: wrap text at 80 cols
docs: admin-guide: add cgroup-v2 documentation
Revert "Documentation/features/vm: Remove arch support status file for 'pte_special'"
Documentation: refcount-vs-atomic: Update reference to LKMM doc.
...
what we want it for is actually updating inode metadata;
take _that_ into a separate helper and use it.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
... and take the "check if file is open, pick ->f_mode" into a helper;
tid_fd_revalidate() can use it.
The next patch will get rid of tid_fd_revalidate() calls in instantiate
callbacks.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
First of all, calling pid_revalidate() in the end of <pid>/* lookups
is *not* about closing any kind of races; that used to be true once
upon a time, but these days those comments are actively misleading.
Especially since pid_revalidate() doesn't even do d_drop() on
failure anymore. It doesn't matter, anyway, since once
pid_revalidate() starts returning false, ->d_delete() of those
dentries starts saying "don't keep"; they won't get stuck in
dcache any longer than they are pinned.
These calls cannot be just removed, though - the side effect of
pid_revalidate() (updating i_uid/i_gid/etc.) is what we are calling
it for here.
Let's separate the "update ownership" into a new helper (pid_update_inode())
and use it, both in lookups and in pid_revalidate() itself.
The comments in pid_revalidate() are also out of date - they refer to
the time when pid_revalidate() used to call d_drop() directly...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
S390 bpf_jit.S is removed in net-next and had changes in 'net',
since that code isn't used any more take the removal.
TLS data structures split the TX and RX components in 'net-next',
put the new struct members from the bug fix in 'net' into the RX
part.
The 'net-next' tree had some reworking of how the ERSPAN code works in
the GRE tunneling code, overlapping with a one-line headroom
calculation fix in 'net'.
Overlapping changes in __sock_map_ctx_update_elem(), keep the bits
that read the prog members via READ_ONCE() into local variables
before using them.
Signed-off-by: David S. Miller <davem@davemloft.net>
Merge speculative store buffer bypass fixes from Thomas Gleixner:
- rework of the SPEC_CTRL MSR management to accomodate the new fancy
SSBD (Speculative Store Bypass Disable) bit handling.
- the CPU bug and sysfs infrastructure for the exciting new Speculative
Store Bypass 'feature'.
- support for disabling SSB via LS_CFG MSR on AMD CPUs including
Hyperthread synchronization on ZEN.
- PRCTL support for dynamic runtime control of SSB
- SECCOMP integration to automatically disable SSB for sandboxed
processes with a filter flag for opt-out.
- KVM integration to allow guests fiddling with SSBD including the new
software MSR VIRT_SPEC_CTRL to handle the LS_CFG based oddities on
AMD.
- BPF protection against SSB
.. this is just the core and x86 side, other architecture support will
come separately.
* 'speck-v20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (49 commits)
bpf: Prevent memory disambiguation attack
x86/bugs: Rename SSBD_NO to SSB_NO
KVM: SVM: Implement VIRT_SPEC_CTRL support for SSBD
x86/speculation, KVM: Implement support for VIRT_SPEC_CTRL/LS_CFG
x86/bugs: Rework spec_ctrl base and mask logic
x86/bugs: Remove x86_spec_ctrl_set()
x86/bugs: Expose x86_spec_ctrl_base directly
x86/bugs: Unify x86_spec_ctrl_{set_guest,restore_host}
x86/speculation: Rework speculative_store_bypass_update()
x86/speculation: Add virtualized speculative store bypass disable support
x86/bugs, KVM: Extend speculation control for VIRT_SPEC_CTRL
x86/speculation: Handle HT correctly on AMD
x86/cpufeatures: Add FEATURE_ZEN
x86/cpufeatures: Disentangle SSBD enumeration
x86/cpufeatures: Disentangle MSR_SPEC_CTRL enumeration from IBRS
x86/speculation: Use synthetic bits for IBRS/IBPB/STIBP
KVM: SVM: Move spec control call after restore of GS
x86/cpu: Make alternative_msr_write work for 32-bit code
x86/bugs: Fix the parameters alignment and missing void
x86/bugs: Make cpu_show_common() static
...
Fix below build warning:
WARNING: vmlinux.o(.text+0x422bb8): Section mismatch in reference from
the function vmcore_add_device_dump() to the function
.init.text:get_vmcore_size.constprop.5()
The function vmcore_add_device_dump() references
the function __init get_vmcore_size.constprop.5().
This is often because vmcore_add_device_dump lacks a __init
annotation or the annotation of get_vmcore_size.constprop.5 is wrong.
Fixes: 7efe48df8a ("vmcore: append device dumps to vmcore as elf notes")
Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There can be a lot of workqueue workers and they all show up with the
cryptic kworker/* names making it difficult to understand which is
doing what and how they came to be.
# ps -ef | grep kworker
root 4 2 0 Feb25 ? 00:00:00 [kworker/0:0H]
root 6 2 0 Feb25 ? 00:00:00 [kworker/u112:0]
root 19 2 0 Feb25 ? 00:00:00 [kworker/1:0H]
root 25 2 0 Feb25 ? 00:00:00 [kworker/2:0H]
root 31 2 0 Feb25 ? 00:00:00 [kworker/3:0H]
...
This patch makes workqueue workers report the latest workqueue it was
executing for through /proc/PID/{comm,stat,status}. The extra
information is appended to the kthread name with intervening '+' if
currently executing, otherwise '-'.
# cat /proc/25/comm
kworker/2:0-events_power_efficient
# cat /proc/25/stat
25 (kworker/2:0-events_power_efficient) I 2 0 0 0 -1 69238880 0 0...
# grep Name /proc/25/status
Name: kworker/2:0-events_power_efficient
Unfortunately, ps(1) truncates comm to 15 characters,
# ps 25
PID TTY STAT TIME COMMAND
25 ? I 0:00 [kworker/2:0-eve]
making it a lot less useful; however, this should be an easy fix from
ps(1) side.
Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Craig Small <csmall@enc.com.au>
proc shows task->comm in three places - comm, stat, status - and each
is fetching and formatting task->comm slighly differently. This patch
renames task_name() to proc_task_name(), makes it more generic, and
updates all three paths to use it.
This will enable expanding comm reporting for workqueue workers.
Signed-off-by: Tejun Heo <tj@kernel.org>
Provide two extra functions, proc_create_net_data_write() and
proc_create_net_single_write() that act like their non-write versions but
also set a write method in the proc_dir_entry struct.
An internal simple write function is provided that will copy its buffer and
hand it to the pde->write() method if available (or give an error if not).
The buffer may be modified by the write method.
Signed-off-by: David Howells <dhowells@redhat.com>
We have some very odd semantics for reading the command line through
/proc, because we allow people to rewrite their own command line pretty
much at will, and things get positively funky when you extend your
command line past the point that used to be the end of the command line,
and is now in the environment variable area.
But our weird semantics doesn't mean that we should write weird and
complex code to handle them.
So re-write get_mm_cmdline() to be much simpler, and much more explicit
about what it is actually doing and why. And avoid the extra check for
"is there a NUL character at the end of the command line where I expect
one to be", by simply making the NUL character handling be part of the
normal "once you hit the end of the command line, stop at the first NUL
character" logic.
It's quite possible that we should stop the crazy "walk into
environment" entirely, but happily it's not really the usual case.
NOTE! We tried to really simplify and limit our odd cmdline parsing some
time ago, but people complained. See commit c2c0bb4462 ("proc: fix
PAGE_SIZE limit of /proc/$PID/cmdline") for details about why we have
this complexity.
Cc: Tejun Heo <tj@kernel.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Jarod Wilson <jarod@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is a pure refactoring of the function, preparing for some further
cleanups. The thing was pretty illegible, and the core functionality
still is, but now the core loop is a bit more isolated from the thing
that goes on around it.
This was "inspired" by the confluence of kworker workqueue name cleanups
by Tejun, currently scheduled for 4.18, and commit 7f7ccc2ccc ("proc:
do not access cmdline nor environ from file-backed areas").
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
proc_pid_cmdline_read() and environ_read() directly access the target
process' VM to retrieve the command line and environment. If this
process remaps these areas onto a file via mmap(), the requesting
process may experience various issues such as extra delays if the
underlying device is slow to respond.
Let's simply refuse to access file-backed areas in these functions.
For this we add a new FOLL_ANON gup flag that is passed to all calls
to access_remote_vm(). The code already takes care of such failures
(including unmapped areas). Accesses via /proc/pid/mem were not
changed though.
This was assigned CVE-2018-1120.
Note for stable backports: the patch may apply to kernels prior to 4.11
but silently miss one location; it must be checked that no call to
access_remote_vm() keeps zero as the last argument.
Reported-by: Qualys Security Advisory <qsa@qualys.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Just set up the show callback in the tty_operations, and use
proc_create_single_data to create the file without additional
boilerplace code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Variant of proc_create_data that directly take a seq_file show
callback and deals with network namespaces in ->open and ->release.
All callers of proc_create + single_open_net converted over, and
single_{open,release}_net are removed entirely.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Variants of proc_create{,_data} that directly take a struct seq_operations
and deal with network namespaces in ->open and ->release. All callers of
proc_create + seq_open_net converted over, and seq_{open,release}_net are
removed entirely.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Variants of proc_create{,_data} that directly take a seq_file show
callback and drastically reduces the boilerplate code in the callers.
All trivial callers converted over.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Variant of proc_create_data that directly take a struct seq_operations
argument + a private state size and drastically reduces the boilerplate
code in the callers.
All trivial callers converted over.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Variants of proc_create{,_data} that directly take a struct seq_operations
argument and drastically reduces the boilerplate code in the callers.
All trivial callers converted over.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Common code for creating a regular file. Factor out of proc_create_data, to
be reused by other functions soon.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Return registered entry on success, return NULL on failure and free the
passed in entry. Also expose it in internal.h as we'll start using it
in proc_net.c soon.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Factor out retrieving the per-sb pid namespaces from the sb private data
into an easier to understand helper.
Suggested-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Update read and mmap logic to append device dumps as additional notes
before the other elf notes. We add device dumps before other elf notes
because the other elf notes may not fill the elf notes buffer
completely and we will end up with zero-filled data between the elf
notes and the device dumps. Tools will then try to decode this
zero-filled data as valid notes and we don't want that. Hence, adding
device dumps before the other elf notes ensure that zero-filled data
can be avoided. This also ensures that the device dumps and the
other elf notes can be properly mmaped at page aligned address.
Incorporate device dump size into the total vmcore size. Also update
offsets for other program headers after the device dumps are added.
Suggested-by: Eric Biederman <ebiederm@xmission.com>.
Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The sequence of actions done by device drivers to append their device
specific hardware/firmware logs to /proc/vmcore are as follows:
1. During probe (before hardware is initialized), device drivers
register to the vmcore module (via vmcore_add_device_dump()), with
callback function, along with buffer size and log name needed for
firmware/hardware log collection.
2. vmcore module allocates the buffer with requested size. It adds
an Elf note and invokes the device driver's registered callback
function.
3. Device driver collects all hardware/firmware logs into the buffer
and returns control back to vmcore module.
Ensure that the device dump buffer size is always aligned to page size
so that it can be mmaped.
Also, rename alloc_elfnotes_buf() to vmcore_alloc_buf() to make it more
generic and reserve NT_VMCOREDD note type to indicate vmcore device
dump.
Suggested-by: Eric Biederman <ebiederm@xmission.com>.
Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The existing kcore code checks for bad addresses against __va(0) with
the assumption that this is the lowest address on the system. This may
not hold true on some systems (e.g. arm64) and produce overflows and
crashes. Switch to using other functions to validate the address range.
It's currently only seen on arm64 and it's not clear if anyone wants to
use that particular combination on a stable release. So this is not
urgent for stable.
Link: http://lkml.kernel.org/r/20180501201143.15121-1-labbott@redhat.com
Signed-off-by: Laura Abbott <labbott@redhat.com>
Tested-by: Dave Anderson <anderson@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>a
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The style for the 'status' file is CamelCase or this. _.
Fixes: fae1fa0fc ("proc: Provide details on speculation flaw mitigations")
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Currently the architecture specific code is expected to display the
protection keys in smap for a given vma. This can lead to redundant
code and possibly to divergent formats in which the key gets
displayed.
This patch changes the implementation. It displays the pkey only if
the architecture support pkeys, i.e arch_pkeys_enabled() returns true.
x86 arch_show_smap() function is not needed anymore, delete it.
Signed-off-by: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com>
Signed-off-by: Ram Pai <linuxram@us.ibm.com>
[mpe: Split out from larger patch, rebased on header changes]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
Currently only 4bits are allocated in the vma flags to hold 16
keys. This is sufficient for x86. PowerPC supports 32 keys,
which needs 5bits. This patch allocates an additional bit.
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Ram Pai <linuxram@us.ibm.com>
[mpe: Fold in #if VM_PKEY_BIT4 as noticed by Dave Hansen]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
VM_PKEY_BITx are defined only if CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
is enabled. Powerpc also needs these bits. Hence lets define the
VM_PKEY_BITx bits for any architecture that enables
CONFIG_ARCH_HAS_PKEYS.
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Ram Pai <linuxram@us.ibm.com>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
For certain use cases it is desired to enforce mitigations so they cannot
be undone afterwards. That's important for loader stubs which want to
prevent a child from disabling the mitigation again. Will also be used for
seccomp(). The extra state preserving of the prctl state for SSB is a
preparatory step for EBPF dymanic speculation control.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
As done with seccomp and no_new_privs, also show speculation flaw
mitigation state in /proc/$pid/status.
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
-----BEGIN PGP SIGNATURE-----
iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAlrdQu4eHHRvcnZhbGRz
QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGVjEIAJqS+sFJCAL8rNAv
tiVJHuAjogVdZGJJFBUWyb4yNZw7nRSKfitaSe875WdF55IGEhnMDbAGe7IMEb5j
1F8Ml2bzJzMWxfBWAzeU+wj6FaQksbIsI1gVM8tqk/Wtu121pB32VW8R82oHg+Hr
sjsFTKFicNsqih+7QTVujaRjSmabKf0/JdyYM6p1cqWrxZQ0pmFaGDu0rwet9PFx
lJsewOmnoZ0GV/Qzn40E304Xf+Vv2gVDVbC5wY86ejNigFt+5qN+gtDqDu7UkftR
ZfD4vJuiKCigNfUrpbJWfpbegBiQc0JMvjLWWhgo/AYdGhNGMlwjQanh2oZcXlrw
VmrNduo=
=/j3z
-----END PGP SIGNATURE-----
Merge tag 'v4.17-rc2' into docs-next
Merge -rc2 to pick up the changes to
Documentation/core-api/kernel-api.rst that hit mainline via the
networking tree. In their absence, subsequent patches cannot be
applied.
Several documents in Documentation/vm fit quite well into the "admin/user
guide" category. The documents that don't overload the reader with lots of
implementation details and provide coherent description of certain feature
can be moved to Documentation/admin-guide/mm.
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Commit 95846ecf9d ("pid: replace pid bitmap implementation with IDR
API") changed last field of /proc/loadavg (last pid allocated) to be off
by one:
# unshare -p -f --mount-proc cat /proc/loadavg
0.00 0.00 0.00 1/60 2 <===
It should be 1 after first fork into pid namespace.
This is formally a regression but given how useless this field is I
don't think anyone is affected.
Bug was found by /proc testsuite!
Link: http://lkml.kernel.org/r/20180413175408.GA27246@avx2
Fixes: 95846ecf9d ("pid: replace pid bitmap implementation with IDR API")
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Gargi Sharma <gs051095@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
task_dump_owner() has the following code:
mm = task->mm;
if (mm) {
if (get_dumpable(mm) != SUID_DUMP_USER) {
uid = ...
}
}
Check for ->mm is buggy -- kernel thread might be borrowing mm
and inode will go to some random uid:gid pair.
Link: http://lkml.kernel.org/r/20180412220109.GA20978@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The swap offset reported by /proc/<pid>/pagemap may be not correct for
PMD migration entries. If addr passed into pagemap_pmd_range() isn't
aligned with PMD start address, the swap offset reported doesn't
reflect this. And in the loop to report information of each sub-page,
the swap offset isn't increased accordingly as that for PFN.
This may happen after opening /proc/<pid>/pagemap and seeking to a page
whose address doesn't align with a PMD start address. I have verified
this with a simple test program.
BTW: migration swap entries have PFN information, do we need to restrict
whether to show them?
[akpm@linux-foundation.org: fix typo, per Huang, Ying]
Link: http://lkml.kernel.org/r/20180408033737.10897-1-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andrei Vagin <avagin@openvz.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: "Jerome Glisse" <jglisse@redhat.com>
Cc: Daniel Colascione <dancol@google.com>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mike Rapoport says:
These patches convert files in Documentation/vm to ReST format, add an
initial index and link it to the top level documentation.
There are no contents changes in the documentation, except few spelling
fixes. The relatively large diffstat stems from the indentation and
paragraph wrapping changes.
I've tried to keep the formatting as consistent as possible, but I could
miss some places that needed markup and add some markup where it was not
necessary.
[jc: significant conflicts in vm/hmm.rst]
If module removes proc directory while another process pins it by
chdir'ing to it, then subsequent recreation of proc entry and all
entries down the tree will not be visible to any process until pinning
process unchdir from directory and unpins everything.
Steps to reproduce:
proc_mkdir("aaa", NULL);
proc_create("aaa/bbb", ...);
chdir("/proc/aaa");
remove_proc_entry("aaa/bbb", NULL);
remove_proc_entry("aaa", NULL);
proc_mkdir("aaa", NULL);
# inaccessible because "aaa" dentry still points
# to the original "aaa".
proc_create("aaa/bbb", ...);
Fix is to implement ->d_revalidate and ->d_delete.
Link: http://lkml.kernel.org/r/20180312201938.GA4871@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "ipc: Clamp *mni to the real IPCMNI limit", v3.
The sysctl parameters msgmni, shmmni and semmni have an inherent limit
of IPC_MNI (32k). However, users may not be aware of that because they
can write a value much higher than that without getting any error or
notification. Reading the parameters back will show the newly written
values which are not real.
Enforcing the limit by failing sysctl parameter write, however, can
break existing user applications. To address this delemma, a new flags
field is introduced into the ctl_table. The value CTL_FLAGS_CLAMP_RANGE
can be added to any ctl_table entries to enable a looser range clamping
without returning any error. For example,
.flags = CTL_FLAGS_CLAMP_RANGE,
This flags value are now used for the range checking of shmmni, msgmni
and semmni without breaking existing applications. If any out of range
value is written to those sysctl parameters, the following warning will
be printed instead.
Kernel parameter "shmmni" was set out of range [0, 32768], clamped to 32768.
Reading the values back will show 32768 instead of some fake values.
This patch (of 6):
Fix a typo.
Link: http://lkml.kernel.org/r/1519926220-7453-2-git-send-email-longman@redhat.com
Signed-off-by: Waiman Long <longman@redhat.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Luis R. Rodriguez <mcgrof@kernel.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In a typical for /proc "open+read+close" usecase, dentry is looked up
successfully on open only to be killed in dput() on close. In fact
dentries which aren't /proc/*/... and /proc/sys/* were almost NEVER
CACHED. Simple printk in proc_lookup_de() shows that.
Now that ->delete hook intelligently picks which dentries should live in
dcache and which should not, rbtree caching is not necessary as dcache
does it job, at last!
As a side effect, struct proc_dir_entry shrinks by one pointer which can
go into inline name.
Link: http://lkml.kernel.org/r/20180314231032.GA15854@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Various subsystems can create files and directories in /proc with names
directly controlled by userspace.
Which means "/", "." and ".." are no-no.
"/" split is already taken care of, do the other 2 prohibited names.
Link: http://lkml.kernel.org/r/20180310001223.GB12443@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Florian Westphal <fw@strlen.de>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm_struct is not needed while printing as all the data was already
extracted.
Link: http://lkml.kernel.org/r/20180309223120.GC3843@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use seq_puts() and skip format string processing.
Link: http://lkml.kernel.org/r/20180309222948.GB3843@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As soon as register_filesystem() exits, filesystem can be mounted. It
is better to present fully operational /proc.
Of course it doesn't matter because /proc is not modular but do it
anyway.
Drop error check, it should be handled by panicking.
Link: http://lkml.kernel.org/r/20180309222709.GA3843@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I totally forgot that _parse_integer() accepts arbitrary amount of
leading zeroes leading to the following lookups:
OK
# readlink /proc/1/map_files/56427ecba000-56427eddc000
/lib/systemd/systemd
bogus
# readlink /proc/1/map_files/00000000000056427ecba000-56427eddc000
/lib/systemd/systemd
# readlink /proc/1/map_files/56427ecba000-00000000000056427eddc000
/lib/systemd/systemd
Link: http://lkml.kernel.org/r/20180303215130.GA23480@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Cyrill Gorcunov <gorcunov@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Pavel Emelyanov <xemul@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
"struct proc_dir_entry" is variable sized because of 0-length trailing
array for name, however, because of SLAB padding allocations it is
possible to make "struct proc_dir_entry" fixed sized and allocate same
amount of memory.
It buys fine-grained debugging with poisoning and usercopy protection
which is not possible with kmalloc-* caches.
Currently, on 32-bit 91+ byte allocations go into kmalloc-128 and on
64-bit 147+ byte allocations go to kmalloc-192 anyway.
Additional memory is allocated only for 38/46+ byte long names which are
rare or may not even exist in the wild.
Link: http://lkml.kernel.org/r/20180223205504.GA17139@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
proc_sys_link_fill_cache() does not need to check whether we're called
for a link - it's already done by scan().
Link: http://lkml.kernel.org/r/20180228013506.4915-2-danilokrummrich@dk-develop.de
Signed-off-by: Danilo Krummrich <danilokrummrich@dk-develop.de>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: "Luis R . Rodriguez" <mcgrof@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
proc_sys_link_fill_cache() does not take currently unregistering sysctl
tables into account, which might result into a page fault in
sysctl_follow_link() - add a check to fix it.
This bug has been present since v3.4.
Link: http://lkml.kernel.org/r/20180228013506.4915-1-danilokrummrich@dk-develop.de
Fixes: 0e47c99d7f ("sysctl: Replace root_list with links between sysctl_table_sets")
Signed-off-by: Danilo Krummrich <danilokrummrich@dk-develop.de>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: "Luis R . Rodriguez" <mcgrof@kernel.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
get_wchan() accesses stack page before permissions are checked, let's
not play this game.
Link: http://lkml.kernel.org/r/20180217071923.GA16074@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Shevchenko <andy.shevchenko@gmail.com>
Cc: Rasmus Villemoes <rasmus.villemoes@prevas.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
seq_printf() works slower than seq_puts, seq_puts, etc.
== test_proc.c
int main(int argc, char **argv)
{
int n, i, fd;
char buf[16384];
n = atoi(argv[1]);
for (i = 0; i < n; i++) {
fd = open(argv[2], O_RDONLY);
if (fd < 0)
return 1;
if (read(fd, buf, sizeof(buf)) <= 0)
return 1;
close(fd);
}
return 0;
}
==
$ time ./test_proc 1000000 /proc/1/status
== Before path ==
real 0m5.171s
user 0m0.328s
sys 0m4.783s
== After patch ==
real 0m4.761s
user 0m0.334s
sys 0m4.366s
Link: http://lkml.kernel.org/r/20180212074931.7227-4-avagin@openvz.org
Signed-off-by: Andrei Vagin <avagin@openvz.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
seq_put_decimal_ull_w(m, str, val, width) prints a decimal number with a
specified minimal field width.
It is equivalent of seq_printf(m, "%s%*d", str, width, val), but it
works much faster.
== test_smaps.py
num = 0
with open("/proc/1/smaps") as f:
for x in xrange(10000):
data = f.read()
f.seek(0, 0)
==
== Before patch ==
$ time python test_smaps.py
real 0m4.593s
user 0m0.398s
sys 0m4.158s
== After patch ==
$ time python test_smaps.py
real 0m3.828s
user 0m0.413s
sys 0m3.408s
$ perf -g record python test_smaps.py
== Before patch ==
- 79.01% 3.36% python [kernel.kallsyms] [k] show_smap.isra.33
- 75.65% show_smap.isra.33
+ 48.85% seq_printf
+ 15.75% __walk_page_range
+ 9.70% show_map_vma.isra.23
0.61% seq_puts
== After patch ==
- 75.51% 4.62% python [kernel.kallsyms] [k] show_smap.isra.33
- 70.88% show_smap.isra.33
+ 24.82% seq_put_decimal_ull_w
+ 19.78% __walk_page_range
+ 12.74% seq_printf
+ 11.08% show_map_vma.isra.23
+ 1.68% seq_puts
[akpm@linux-foundation.org: fix drivers/of/unittest.c build]
Link: http://lkml.kernel.org/r/20180212074931.7227-1-avagin@openvz.org
Signed-off-by: Andrei Vagin <avagin@openvz.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The allocation is persistent in fact as any fool can open a file in
/proc and sit on it.
Link: http://lkml.kernel.org/r/20180214082409.GC17157@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
"struct pde_opener" is fixed size and we can have more granular approach
to debugging.
For those who don't know, per cache SLUB poisoning and red zoning don't
work if there is at least one object allocated which is hopeless in case
of kmalloc-64 but not in case of standalone cache. Although systemd
opens 2 files from the get go, so it is hopeless after all.
Link: http://lkml.kernel.org/r/20180214082306.GB17157@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The whole point of code in fs/proc/inode.c is to make sure ->release
hook is called either at close() or at rmmod time.
All if it is unnecessary if there is no ->release hook.
Save allocation+list manipulations under spinlock in that case.
Link: http://lkml.kernel.org/r/20180214063033.GA15579@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
get_task_umask locks/unlocks the task on its own. The only caller does
the same thing immediately after.
Utilize the fact the task has to be locked anyway and just do it once.
Since there are no other users and the code is short, fold it in.
Link: http://lkml.kernel.org/r/1517995608-23683-1-git-send-email-mguzik@redhat.com
Signed-off-by: Mateusz Guzik <mguzik@redhat.com>
Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hardwall is a tile specific feature, and with the removal of the
tile architecture, this has become dead code, so let's remove it.
Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Pull x86 fixes from Ingo Molnar:
"Misc fixes all across the map:
- /proc/kcore vsyscall related fixes
- LTO fix
- build warning fix
- CPU hotplug fix
- Kconfig NR_CPUS cleanups
- cpu_has() cleanups/robustification
- .gitignore fix
- memory-failure unmapping fix
- UV platform fix"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mm, mm/hwpoison: Don't unconditionally unmap kernel 1:1 pages
x86/error_inject: Make just_return_func() globally visible
x86/platform/UV: Fix GAM Range Table entries less than 1GB
x86/build: Add arch/x86/tools/insn_decoder_test to .gitignore
x86/smpboot: Fix uncore_pci_remove() indexing bug when hot-removing a physical CPU
x86/mm/kcore: Add vsyscall page to /proc/kcore conditionally
vfs/proc/kcore, x86/mm/kcore: Fix SMAP fault when dumping vsyscall user page
x86/Kconfig: Further simplify the NR_CPUS config
x86/Kconfig: Simplify NR_CPUS config
x86/MCE: Fix build warning introduced by "x86: do not use print_symbol()"
x86/cpufeature: Update _static_cpu_has() to use all named variables
x86/cpufeature: Reindent _static_cpu_has()
Commit:
df04abfd18 ("fs/proc/kcore.c: Add bounce buffer for ktext data")
... introduced a bounce buffer to work around CONFIG_HARDENED_USERCOPY=y.
However, accessing the vsyscall user page will cause an SMAP fault.
Replace memcpy() with copy_from_user() to fix this bug works, but adding
a common way to handle this sort of user page may be useful for future.
Currently, only vsyscall page requires KCORE_USER.
Signed-off-by: Jia Zhang <zhang.jia@linux.alibaba.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: jolsa@redhat.com
Link: http://lkml.kernel.org/r/1518446694-21124-2-git-send-email-zhang.jia@linux.alibaba.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This is the mindless scripted replacement of kernel use of POLL*
variables as described by Al, done by this script:
for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do
L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'`
for f in $L; do sed -i "-es/^\([^\"]*\)\(\<POLL$V\>\)/\\1E\\2/" $f; done
done
with de-mangling cleanups yet to come.
NOTE! On almost all architectures, the EPOLL* constants have the same
values as the POLL* constants do. But they keyword here is "almost".
For various bad reasons they aren't the same, and epoll() doesn't
actually work quite correctly in some cases due to this on Sparc et al.
The next patch from Al will sort out the final differences, and we
should be all done.
Scripted-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A single character (line break) should be put into a sequence. Thus use
the corresponding function "seq_putc".
This issue was detected by using the Coccinelle software.
Link: http://lkml.kernel.org/r/04fb69fe-d820-9141-820f-07e9a48f4635@users.sourceforge.net
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rearrange args for smaller code.
lookup revolves around memcmp() which gets len 3rd arg, so propagate
length as 3rd arg.
readdir and lookup add additional arg to VFS ->readdir and ->lookup, so
better add it to the end.
Space savings on x86_64:
add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-18 (-18)
Function old new delta
proc_readdir 22 13 -9
proc_lookup 18 9 -9
proc_match() is smaller if not inlined, I promise!
Link: http://lkml.kernel.org/r/20180104175958.GB5204@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
use_pde() is used at every open/read/write/... of every random /proc
file. Negative refcount happens only if PDE is being deleted by module
(read: never). So it gets "likely".
unuse_pde() gets "unlikely" for the same reason.
close_pdeo() gets unlikely as the completion is filled only if there is a
race between PDE removal and close() (read: never ever).
It even saves code on x86_64 defconfig:
add/remove: 0/0 grow/shrink: 1/2 up/down: 2/-20 (-18)
Function old new delta
close_pdeo 183 185 +2
proc_reg_get_unmapped_area 119 111 -8
proc_reg_poll 85 73 -12
Link: http://lkml.kernel.org/r/20180104175657.GA5204@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/proc/self inode numbers, value of proc_inode_cache and st_nlink of
/proc/$TGID are fixed constants.
Link: http://lkml.kernel.org/r/20180103184707.GA31849@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>