Commit graph

2027 commits

Author SHA1 Message Date
Eric W. Biederman
3147d8aaa0 proc: Use PIDTYPE_TGID in next_tgid
Combine the pid_task and thes test has_group_leader_pid into a single
dereference by using pid_task(PIDTYPE_TGID).

This makes the code simpler and proof against needing to even think
about any shenanigans that de_thread might get up to.

Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2020-04-24 17:16:35 -05:00
Eric W. Biederman
0fb5ce62c5 proc: modernize proc to support multiple private instances
Alexey Gladkov <gladkov.alexey@gmail.com> writes:
 Procfs modernization:
 ---------------------
 Historically procfs was always tied to pid namespaces, during pid
 namespace creation we internally create a procfs mount for it. However,
 this has the effect that all new procfs mounts are just a mirror of the
 internal one, any change, any mount option update, any new future
 introduction will propagate to all other procfs mounts that are in the
 same pid namespace.

 This may have solved several use cases in that time. However today we
 face new requirements, and making procfs able to support new private
 instances inside same pid namespace seems a major point. If we want to
 to introduce new features and security mechanisms we have to make sure
 first that we do not break existing usecases. Supporting private procfs
 instances will allow to support new features and behaviour without
 propagating it to all other procfs mounts.

 Today procfs is more of a burden especially to some Embedded, IoT,
 sandbox, container use cases. In user space we are over-mounting null
 or inaccessible files on top to hide files and information. If we want
 to hide pids we have to create PID namespaces otherwise mount options
 propagate to all other proc mounts, changing a mount option value in one
 mount will propagate to all other proc mounts. If we want to introduce
 new features, then they will propagate to all other mounts too, resulting
 either maybe new useful functionality or maybe breaking stuff. We have
 also to note that userspace should not workaround procfs, the kernel
 should just provide a sane simple interface.

 In this regard several developers and maintainers pointed out that
 there are problems with procfs and it has to be modernized:

 "Here's another one: split up and modernize /proc." by Andy Lutomirski [1]

 Discussion about kernel pointer leaks:

 "And yes, as Kees and Daniel mentioned, it's definitely not just dmesg.
 In fact, the primary things tend to be /proc and /sys, not dmesg
 itself." By Linus Torvalds [2]

 Lot of other areas in the kernel and filesystems have been updated to be
 able to support private instances, devpts is one major example [3].

 Which will be used for:

 1) Embedded systems and IoT: usually we have one supervisor for
 apps, we have some lightweight sandbox support, however if we create
 pid namespaces we have to manage all the processes inside too,
 where our goal is to be able to run a bunch of apps each one inside
 its own mount namespace, maybe use network namespaces for vlans
 setups, but right now we only want mount namespaces, without all the
 other complexity. We want procfs to behave more like a real file system,
 and block access to inodes that belong to other users. The 'hidepid=' will
 not work since it is a shared mount option.

 2) Containers, sandboxes and Private instances of file systems - devpts case
 Historically, lot of file systems inside Linux kernel view when instantiated
 were just a mirror of an already created and mounted filesystem. This was the
 case of devpts filesystem, it seems at that time the requirements were to
 optimize things and reuse the same memory, etc. This design used to work but not
 anymore with today's containers, IoT, hostile environments and all the privacy
 challenges that Linux faces.

 In that regards, devpts was updated so that each new mounts is a total
 independent file system by the following patches:

 "devpts: Make each mount of devpts an independent filesystem" by
 Eric W. Biederman [3] [4]

 3) Linux Security Modules have multiple ptrace paths inside some
 subsystems, however inside procfs, the implementation does not guarantee
 that the ptrace() check which triggers the security_ptrace_check() hook
 will always run. We have the 'hidepid' mount option that can be used to
 force the ptrace_may_access() check inside has_pid_permissions() to run.
 The problem is that 'hidepid' is per pid namespace and not attached to
 the mount point, any remount or modification of 'hidepid' will propagate
 to all other procfs mounts.

 This also does not allow to support Yama LSM easily in desktop and user
 sessions. Yama ptrace scope which restricts ptrace and some other
 syscalls to be allowed only on inferiors, can be updated to have a
 per-task context, where the context will be inherited during fork(),
 clone() and preserved across execve(). If we support multiple private
 procfs instances, then we may force the ptrace_may_access() on
 /proc/<pids>/ to always run inside that new procfs instances. This will
 allow to specifiy on user sessions if we should populate procfs with
 pids that the user can ptrace or not.

 By using Yama ptrace scope, some restricted users will only be able to see
 inferiors inside /proc, they won't even be able to see their other
 processes. Some software like Chromium, Firefox's crash handler, Wine
 and others are already using Yama to restrict which processes can be
 ptracable. With this change this will give the possibility to restrict
 /proc/<pids>/ but more importantly this will give desktop users a
 generic and usuable way to specifiy which users should see all processes
 and which user can not.

 Side notes:

 * This covers the lack of seccomp where it is not able to parse
 arguments, it is easy to install a seccomp filter on direct syscalls
 that operate on pids, however /proc/<pid>/ is a Linux ABI using
 filesystem syscalls. With this change all LSMs should be able to analyze
 open/read/write/close... on /proc/<pid>/

 4) This will allow to implement new features either in kernel or
 userspace without having to worry about procfs.
 In containers, sandboxes, etc we have workarounds to hide some /proc
 inodes, this should be supported natively without doing extra complex
 work, the kernel should be able to support sane options that work with
 today and future Linux use cases.

 5) Creation of new superblock with all procfs options for each procfs
 mount will fix the ignoring of mount options. The problem is that the
 second mount of procfs in the same pid namespace ignores the mount
 options. The mount options are ignored without error until procfs is
 remounted.

 Before:

 proc /proc proc rw,relatime,hidepid=2 0 0

 mount("proc", "/tmp/proc", "proc", 0, "hidepid=1") = 0
 +++ exited with 0 +++

 proc /proc proc rw,relatime,hidepid=2 0 0
 proc /tmp/proc proc rw,relatime,hidepid=2 0 0

 proc /proc proc rw,relatime,hidepid=1 0 0
 proc /tmp/proc proc rw,relatime,hidepid=1 0 0

 After:

 proc /proc proc rw,relatime,hidepid=ptraceable 0 0

 proc /proc proc rw,relatime,hidepid=ptraceable 0 0
 proc /tmp/proc proc rw,relatime,hidepid=invisible 0 0

 Introduced changes:
 -------------------
 Each mount of procfs creates a separate procfs instance with its own
 mount options.

 This series adds few new mount options:

 * New 'hidepid=ptraceable' or 'hidepid=4' mount option to show only ptraceable
 processes in the procfs. This allows to support lightweight sandboxes in
 Embedded Linux, also solves the case for LSM where now with this mount option,
 we make sure that they have a ptrace path in procfs.

 * 'subset=pid' that allows to hide non-pid inodes from procfs. It can be used
 in containers and sandboxes, as these are already trying to hide and block
 access to procfs inodes anyway.

 ChangeLog:
 ----------
 * Rebase on top of v5.7-rc1.
 * Fix a resource leak if proc is not mounted or if proc is simply reconfigured.
 * Add few selftests.

 * After a discussion with Eric W. Biederman, the numerical values for hidepid
   parameter have been removed from uapi.
 * Remove proc_self and proc_thread_self from the pid_namespace struct.
 * I took into account the comment of Kees Cook.
 * Update Reviewed-by tags.

 * 'subset=pidfs' renamed to 'subset=pid' as suggested by Alexey Dobriyan.
 * Include Reviewed-by tags.

 * Rebase on top of Eric W. Biederman's procfs changes.
 * Add human readable values of 'hidepid' as suggested by Andy Lutomirski.

 * Started using RCU lock to clean dcache entries as suggested by Linus Torvalds.

 * 'pidonly=1' renamed to 'subset=pidfs' as suggested by Alexey Dobriyan.
 * HIDEPID_* moved to uapi/ as they are user interface to mount().
   Suggested-by Alexey Dobriyan <adobriyan@gmail.com>

 * 'hidepid=' and 'gid=' mount options are moved from pid namespace to superblock.
 * 'newinstance' mount option removed as suggested by Eric W. Biederman.
    Mount of procfs always creates a new instance.
 * 'limit_pids' renamed to 'hidepid=3'.
 * I took into account the comment of Linus Torvalds [7].
 * Documentation added.

 * Fixed a bug that caused a problem with the Fedora boot.
 * The 'pidonly' option is visible among the mount options.

 * Renamed mount options to 'newinstance' and 'pids='
    Suggested-by: Andy Lutomirski <luto@kernel.org>
 * Fixed order of commit, Suggested-by: Andy Lutomirski <luto@kernel.org>
 * Many bug fixes.

 * Removed 'unshared' mount option and replaced it with 'limit_pids'
    which is attached to the current procfs mount.
    Suggested-by Andy Lutomirski <luto@kernel.org>
 * Do not fill dcache with pid entries that we can not ptrace.
 * Many bug fixes.

 References:
 -----------
 [1] https://lists.linuxfoundation.org/pipermail/ksummit-discuss/2017-January/004215.html
 [2] http://www.openwall.com/lists/kernel-hardening/2017/10/05/5
 [3] https://lwn.net/Articles/689539/
 [4] http://lxr.free-electrons.com/source/Documentation/filesystems/devpts.txt?v=3.14
 [5] https://lkml.org/lkml/2017/5/2/407
 [6] https://lkml.org/lkml/2017/5/3/357
 [7] https://lkml.org/lkml/2018/5/11/505

 Alexey Gladkov (7):
   proc: rename struct proc_fs_info to proc_fs_opts
   proc: allow to mount many instances of proc in one pid namespace
   proc: instantiate only pids that we can ptrace on 'hidepid=4' mount
     option
   proc: add option to mount only a pids subset
   docs: proc: add documentation for "hidepid=4" and "subset=pid" options
     and new mount behavior
  proc: use human-readable values for hidepid
   proc: use named enums for better readability

  Documentation/filesystems/proc.rst            |  92 +++++++++---
  fs/proc/base.c                                |  48 +++++--
  fs/proc/generic.c                             |   9 ++
  fs/proc/inode.c                               |  30 +++-
  fs/proc/root.c                                | 131 +++++++++++++-----
  fs/proc/self.c                                |   6 +-
  fs/proc/thread_self.c                         |   6 +-
  fs/proc_namespace.c                           |  14 +-
  include/linux/pid_namespace.h                 |  12 --
  include/linux/proc_fs.h                       |  30 +++-
  tools/testing/selftests/proc/.gitignore       |   2 +
  tools/testing/selftests/proc/Makefile         |   2 +
  .../selftests/proc/proc-fsconfig-hidepid.c    |  50 +++++++
  .../selftests/proc/proc-multiple-procfs.c     |  48 +++++++
  14 files changed, 384 insertions(+), 96 deletions(-)
  create mode 100644 tools/testing/selftests/proc/proc-fsconfig-hidepid.c
  create mode 100644 tools/testing/selftests/proc/proc-multiple-procfs.c

Link: https://lore.kernel.org/lkml/20200419141057.621356-1-gladkov.alexey@gmail.com/
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2020-04-24 17:04:06 -05:00
Eric W. Biederman
6ade99ec61 proc: Put thread_pid in release_task not proc_flush_pid
Oleg pointed out that in the unlikely event the kernel is compiled
with CONFIG_PROC_FS unset that release_task will now leak the pid.

Move the put_pid out of proc_flush_pid into release_task to fix this
and to guarantee I don't make that mistake again.

When possible it makes sense to keep get and put in the same function
so it can easily been seen how they pair up.

Fixes: 7bc3e6e55a ("proc: Use a list of inodes to flush from proc")
Reported-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2020-04-24 15:49:00 -05:00
Alexey Gladkov
e61bb8b36a proc: use named enums for better readability
Signed-off-by: Alexey Gladkov <gladkov.alexey@gmail.com>
Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-04-22 10:51:22 -05:00
Alexey Gladkov
1c6c4d112e proc: use human-readable values for hidepid
The hidepid parameter values are becoming more and more and it becomes
difficult to remember what each new magic number means.

Backward compatibility is preserved since it is possible to specify
numerical value for the hidepid parameter. This does not break the
fsconfig since it is not possible to specify a numerical value through
it. All numeric values are converted to a string. The type
FSCONFIG_SET_BINARY cannot be used to indicate a numerical value.

Selftest has been added to verify this behavior.

Suggested-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Alexey Gladkov <gladkov.alexey@gmail.com>
Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-04-22 10:51:22 -05:00
Alexey Gladkov
6814ef2d99 proc: add option to mount only a pids subset
This allows to hide all files and directories in the procfs that are not
related to tasks.

Signed-off-by: Alexey Gladkov <gladkov.alexey@gmail.com>
Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-04-22 10:51:22 -05:00
Alexey Gladkov
24a71ce5c4 proc: instantiate only pids that we can ptrace on 'hidepid=4' mount option
If "hidepid=4" mount option is set then do not instantiate pids that
we can not ptrace. "hidepid=4" means that procfs should only contain
pids that the caller can ptrace.

Signed-off-by: Djalal Harouni <tixxdz@gmail.com>
Signed-off-by: Alexey Gladkov <gladkov.alexey@gmail.com>
Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-04-22 10:51:21 -05:00
Alexey Gladkov
fa10fed30f proc: allow to mount many instances of proc in one pid namespace
This patch allows to have multiple procfs instances inside the
same pid namespace. The aim here is lightweight sandboxes, and to allow
that we have to modernize procfs internals.

1) The main aim of this work is to have on embedded systems one
supervisor for apps. Right now we have some lightweight sandbox support,
however if we create pid namespacess we have to manages all the
processes inside too, where our goal is to be able to run a bunch of
apps each one inside its own mount namespace without being able to
notice each other. We only want to use mount namespaces, and we want
procfs to behave more like a real mount point.

2) Linux Security Modules have multiple ptrace paths inside some
subsystems, however inside procfs, the implementation does not guarantee
that the ptrace() check which triggers the security_ptrace_check() hook
will always run. We have the 'hidepid' mount option that can be used to
force the ptrace_may_access() check inside has_pid_permissions() to run.
The problem is that 'hidepid' is per pid namespace and not attached to
the mount point, any remount or modification of 'hidepid' will propagate
to all other procfs mounts.

This also does not allow to support Yama LSM easily in desktop and user
sessions. Yama ptrace scope which restricts ptrace and some other
syscalls to be allowed only on inferiors, can be updated to have a
per-task context, where the context will be inherited during fork(),
clone() and preserved across execve(). If we support multiple private
procfs instances, then we may force the ptrace_may_access() on
/proc/<pids>/ to always run inside that new procfs instances. This will
allow to specifiy on user sessions if we should populate procfs with
pids that the user can ptrace or not.

By using Yama ptrace scope, some restricted users will only be able to see
inferiors inside /proc, they won't even be able to see their other
processes. Some software like Chromium, Firefox's crash handler, Wine
and others are already using Yama to restrict which processes can be
ptracable. With this change this will give the possibility to restrict
/proc/<pids>/ but more importantly this will give desktop users a
generic and usuable way to specifiy which users should see all processes
and which users can not.

Side notes:
* This covers the lack of seccomp where it is not able to parse
arguments, it is easy to install a seccomp filter on direct syscalls
that operate on pids, however /proc/<pid>/ is a Linux ABI using
filesystem syscalls. With this change LSMs should be able to analyze
open/read/write/close...

In the new patch set version I removed the 'newinstance' option
as suggested by Eric W. Biederman.

Selftest has been added to verify new behavior.

Signed-off-by: Alexey Gladkov <gladkov.alexey@gmail.com>
Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-04-22 10:51:21 -05:00
Eric W. Biederman
4fa3b1c417 proc: Handle umounts cleanly
syzbot writes:
> KASAN: use-after-free Read in dput (2)
>
> proc_fill_super: allocate dentry failed
> ==================================================================
> BUG: KASAN: use-after-free in fast_dput fs/dcache.c:727 [inline]
> BUG: KASAN: use-after-free in dput+0x53e/0xdf0 fs/dcache.c:846
> Read of size 4 at addr ffff88808a618cf0 by task syz-executor.0/8426
>
> CPU: 0 PID: 8426 Comm: syz-executor.0 Not tainted 5.6.0-next-20200412-syzkaller #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
> Call Trace:
>  __dump_stack lib/dump_stack.c:77 [inline]
>  dump_stack+0x188/0x20d lib/dump_stack.c:118
>  print_address_description.constprop.0.cold+0xd3/0x315 mm/kasan/report.c:382
>  __kasan_report.cold+0x35/0x4d mm/kasan/report.c:511
>  kasan_report+0x33/0x50 mm/kasan/common.c:625
>  fast_dput fs/dcache.c:727 [inline]
>  dput+0x53e/0xdf0 fs/dcache.c:846
>  proc_kill_sb+0x73/0xf0 fs/proc/root.c:195
>  deactivate_locked_super+0x8c/0xf0 fs/super.c:335
>  vfs_get_super+0x258/0x2d0 fs/super.c:1212
>  vfs_get_tree+0x89/0x2f0 fs/super.c:1547
>  do_new_mount fs/namespace.c:2813 [inline]
>  do_mount+0x1306/0x1b30 fs/namespace.c:3138
>  __do_sys_mount fs/namespace.c:3347 [inline]
>  __se_sys_mount fs/namespace.c:3324 [inline]
>  __x64_sys_mount+0x18f/0x230 fs/namespace.c:3324
>  do_syscall_64+0xf6/0x7d0 arch/x86/entry/common.c:295
>  entry_SYSCALL_64_after_hwframe+0x49/0xb3
> RIP: 0033:0x45c889
> Code: ad b6 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 7b b6 fb ff c3 66 2e 0f 1f 84 00 00 00 00
> RSP: 002b:00007ffc1930ec48 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
> RAX: ffffffffffffffda RBX: 0000000001324914 RCX: 000000000045c889
> RDX: 0000000020000140 RSI: 0000000020000040 RDI: 0000000000000000
> RBP: 000000000076bf00 R08: 0000000000000000 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000003
> R13: 0000000000000749 R14: 00000000004ca15a R15: 0000000000000013

Looking at the code now that it the internal mount of proc is no
longer used it is possible to unmount proc.   If proc is unmounted
the fields of the pid namespace that were used for filesystem
specific state are not reinitialized.

Which means that proc_self and proc_thread_self can be pointers to
already freed dentries.

The reported user after free appears to be from mounting and
unmounting proc followed by mounting proc again and using error
injection to cause the new root dentry allocation to fail.  This in
turn results in proc_kill_sb running with proc_self and
proc_thread_self still retaining their values from the previous mount
of proc.  Then calling dput on either proc_self of proc_thread_self
will result in double put.  Which KASAN sees as a use after free.

Solve this by always reinitializing the filesystem state stored
in the struct pid_namespace, when proc is unmounted.

Reported-by: syzbot+72868dd424eb66c6b95f@syzkaller.appspotmail.com
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Fixes: 69879c01a0 ("proc: Remove the now unnecessary internal mount of proc")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2020-04-15 23:52:29 -05:00
Linus Torvalds
87ad46e601 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull proc fix from Eric Biederman:
 "A brown paper bag slipped through my proc changes, and syzcaller
  caught it when the code ended up in your tree.

  I have opted to fix it the simplest cleanest way I know how, so there
  is no reasonable chance for the bug to repeat"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  proc: Use a dedicated lock in struct pid
2020-04-10 12:59:56 -07:00
Eric W. Biederman
63f818f46a proc: Use a dedicated lock in struct pid
syzbot wrote:
> ========================================================
> WARNING: possible irq lock inversion dependency detected
> 5.6.0-syzkaller #0 Not tainted
> --------------------------------------------------------
> swapper/1/0 just changed the state of lock:
> ffffffff898090d8 (tasklist_lock){.+.?}-{2:2}, at: send_sigurg+0x9f/0x320 fs/fcntl.c:840
> but this lock took another, SOFTIRQ-unsafe lock in the past:
>  (&pid->wait_pidfd){+.+.}-{2:2}
>
>
> and interrupts could create inverse lock ordering between them.
>
>
> other info that might help us debug this:
>  Possible interrupt unsafe locking scenario:
>
>        CPU0                    CPU1
>        ----                    ----
>   lock(&pid->wait_pidfd);
>                                local_irq_disable();
>                                lock(tasklist_lock);
>                                lock(&pid->wait_pidfd);
>   <Interrupt>
>     lock(tasklist_lock);
>
>  *** DEADLOCK ***
>
> 4 locks held by swapper/1/0:

The problem is that because wait_pidfd.lock is taken under the tasklist
lock.  It must always be taken with irqs disabled as tasklist_lock can be
taken from interrupt context and if wait_pidfd.lock was already taken this
would create a lock order inversion.

Oleg suggested just disabling irqs where I have added extra calls to
wait_pidfd.lock.  That should be safe and I think the code will eventually
do that.  It was rightly pointed out by Christian that sharing the
wait_pidfd.lock was a premature optimization.

It is also true that my pre-merge window testing was insufficient.  So
remove the premature optimization and give struct pid a dedicated lock of
it's own for struct pid things.  I have verified that lockdep sees all 3
paths where we take the new pid->lock and lockdep does not complain.

It is my current day dream that one day pid->lock can be used to guard the
task lists as well and then the tasklist_lock won't need to be held to
deliver signals.  That will require taking pid->lock with irqs disabled.

Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/lkml/00000000000011d66805a25cd73f@google.com/
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Reported-by: syzbot+343f75cdeea091340956@syzkaller.appspotmail.com
Reported-by: syzbot+832aabf700bc3ec920b9@syzkaller.appspotmail.com
Reported-by: syzbot+f675f964019f884dbd0f@syzkaller.appspotmail.com
Reported-by: syzbot+a9fb1457d720a55d6dc5@syzkaller.appspotmail.com
Fixes: 7bc3e6e55a ("proc: Use a list of inodes to flush from proc")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2020-04-09 12:15:35 -05:00
Matthew Wilcox (Oracle)
fad955009c proc: inline m_next_vma into m_next
It's clearer to just put this inline.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20200317193201.9924-5-adobriyan@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07 10:43:42 -07:00
Matthew Wilcox (Oracle)
4781f2c3ab proc: use ppos instead of m->version
The ppos is a private cursor, just like m->version.  Use the canonical
cursor, not a special one.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20200317193201.9924-3-adobriyan@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07 10:43:42 -07:00
Matthew Wilcox (Oracle)
c2e88d22e8 proc: remove m_cache_vma
Instead of setting m->version in the show method, set it in m_next(),
where it should be.  Also remove the fallback code for failing to find a
vma, or version being zero.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20200317193201.9924-2-adobriyan@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07 10:43:42 -07:00
Matthew Wilcox (Oracle)
d07ded611e proc: inline vma_stop into m_stop
Instead of calling vma_stop() from m_start() and m_next(), do its work
in m_stop().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20200317193201.9924-1-adobriyan@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07 10:43:42 -07:00
Alexey Dobriyan
5c5ab9714c proc: speed up /proc/*/statm
top(1) reads all /proc/*/statm files but kernel threads will always have
zeros.  Print those zeroes directly without going through
seq_put_decimal_ull().

Speed up reading /proc/2/statm (which is kthreadd) is like 3%.

My system has more kernel threads than normal processes after booting KDE.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20200307154435.GA2788@avx2
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07 10:43:42 -07:00
Alexey Dobriyan
d919b33daf proc: faster open/read/close with "permanent" files
Now that "struct proc_ops" exist we can start putting there stuff which
could not fly with VFS "struct file_operations"...

Most of fs/proc/inode.c file is dedicated to make open/read/.../close
reliable in the event of disappearing /proc entries which usually happens
if module is getting removed.  Files like /proc/cpuinfo which never
disappear simply do not need such protection.

Save 2 atomic ops, 1 allocation, 1 free per open/read/close sequence for such
"permanent" files.

Enable "permanent" flag for

	/proc/cpuinfo
	/proc/kmsg
	/proc/modules
	/proc/slabinfo
	/proc/stat
	/proc/sysvipc/*
	/proc/swaps

More will come once I figure out foolproof way to prevent out module
authors from marking their stuff "permanent" for performance reasons
when it is not.

This should help with scalability: benchmark is "read /proc/cpuinfo R times
by N threads scattered over the system".

	N	R	t, s (before)	t, s (after)
	-----------------------------------------------------
	64	4096	1.582458	1.530502	-3.2%
	256	4096	6.371926	6.125168	-3.9%
	1024	4096	25.64888	24.47528	-4.6%

Benchmark source:

#include <chrono>
#include <iostream>
#include <thread>
#include <vector>

#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>

const int NR_CPUS = sysconf(_SC_NPROCESSORS_ONLN);
int N;
const char *filename;
int R;

int xxx = 0;

int glue(int n)
{
	cpu_set_t m;
	CPU_ZERO(&m);
	CPU_SET(n, &m);
	return sched_setaffinity(0, sizeof(cpu_set_t), &m);
}

void f(int n)
{
	glue(n % NR_CPUS);

	while (*(volatile int *)&xxx == 0) {
	}

	for (int i = 0; i < R; i++) {
		int fd = open(filename, O_RDONLY);
		char buf[4096];
		ssize_t rv = read(fd, buf, sizeof(buf));
		asm volatile ("" :: "g" (rv));
		close(fd);
	}
}

int main(int argc, char *argv[])
{
	if (argc < 4) {
		std::cerr << "usage: " << argv[0] << ' ' << "N /proc/filename R
";
		return 1;
	}

	N = atoi(argv[1]);
	filename = argv[2];
	R = atoi(argv[3]);

	for (int i = 0; i < NR_CPUS; i++) {
		if (glue(i) == 0)
			break;
	}

	std::vector<std::thread> T;
	T.reserve(N);
	for (int i = 0; i < N; i++) {
		T.emplace_back(f, i);
	}

	auto t0 = std::chrono::system_clock::now();
	{
		*(volatile int *)&xxx = 1;
		for (auto& t: T) {
			t.join();
		}
	}
	auto t1 = std::chrono::system_clock::now();
	std::chrono::duration<double> dt = t1 - t0;
	std::cout << dt.count() << '
';

	return 0;
}

P.S.:
Explicit randomization marker is added because adding non-function pointer
will silently disable structure layout randomization.

[akpm@linux-foundation.org: coding style fixes]
Reported-by: kbuild test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Joe Perches <joe@perches.com>
Link: http://lkml.kernel.org/r/20200222201539.GA22576@avx2
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07 10:43:42 -07:00
Jules Irenge
904f394e2e fs/proc/inode.c: annotate close_pdeo() for sparse
Fix sparse locking imbalance warning:

	warning: context imbalance in close_pdeo() - unexpected unlock

Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20200227201538.GA30462@avx2
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07 10:43:42 -07:00
Bernd Edlinger
76518d3798 proc: io_accounting: Use new infrastructure to fix deadlocks in execve
This changes do_io_accounting to use the new exec_update_mutex
instead of cred_guard_mutex.

This fixes possible deadlocks when the trace is accessing
/proc/$pid/io for instance.

This should be safe, as the credentials are only used for reading.

Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-03-25 10:04:01 -05:00
Bernd Edlinger
2db9dbf71b proc: Use new infrastructure to fix deadlocks in execve
This changes lock_trace to use the new exec_update_mutex
instead of cred_guard_mutex.

This fixes possible deadlocks when the trace is accessing
/proc/$pid/stack for instance.

This should be safe, as the credentials are only used for reading,
and task->mm is updated on execve under the new exec_update_mutex.

Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-03-25 10:04:01 -05:00
Eric W. Biederman
69879c01a0 proc: Remove the now unnecessary internal mount of proc
There remains no more code in the kernel using pids_ns->proc_mnt,
therefore remove it from the kernel.

The big benefit of this change is that one of the most error prone and
tricky parts of the pid namespace implementation, maintaining kernel
mounts of proc is removed.

In addition removing the unnecessary complexity of the kernel mount
fixes a regression that caused the proc mount options to be ignored.
Now that the initial mount of proc comes from userspace, those mount
options are again honored.  This fixes Android's usage of the proc
hidepid option.

Reported-by: Alistair Strachan <astrachan@google.com>
Fixes: e94591d0d9 ("proc: Convert proc_mount to use mount_ns.")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2020-02-28 12:06:14 -06:00
Eric W. Biederman
7bc3e6e55a proc: Use a list of inodes to flush from proc
Rework the flushing of proc to use a list of directory inodes that
need to be flushed.

The list is kept on struct pid not on struct task_struct, as there is
a fixed connection between proc inodes and pids but at least for the
case of de_thread the pid of a task_struct changes.

This removes the dependency on proc_mnt which allows for different
mounts of proc having different mount options even in the same pid
namespace and this allows for the removal of proc_mnt which will
trivially the first mount of proc to honor it's mount options.

This flushing remains an optimization.  The functions
pid_delete_dentry and pid_revalidate ensure that ordinary dcache
management will not attempt to use dentries past the point their
respective task has died.  When unused the shrinker will
eventually be able to remove these dentries.

There is a case in de_thread where proc_flush_pid can be
called early for a given pid.  Which winds up being
safe (if suboptimal) as this is just an optiimization.

Only pid directories are put on the list as the other
per pid files are children of those directories and
d_invalidate on the directory will get them as well.

So that the pid can be used during flushing it's reference count is
taken in release_task and dropped in proc_flush_pid.  Further the call
of proc_flush_pid is moved after the tasklist_lock is released in
release_task so that it is certain that the pid has already been
unhashed when flushing it taking place.  This removes a small race
where a dentry could recreated.

As struct pid is supposed to be small and I need a per pid lock
I reuse the only lock that currently exists in struct pid the
the wait_pidfd.lock.

The net result is that this adds all of this functionality
with just a little extra list management overhead and
a single extra pointer in struct pid.

v2: Initialize pid->inodes.  I somehow failed to get that
    initialization into the initial version of the patch.  A boot
    failure was reported by "kernel test robot <lkp@intel.com>", and
    failure to initialize that pid->inodes matches all of the reported
    symptoms.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-02-24 10:14:44 -06:00
Eric W. Biederman
71448011ea proc: Clear the pieces of proc_inode that proc_evict_inode cares about
This just keeps everything tidier, and allows for using flags like
SLAB_TYPESAFE_BY_RCU where slabs are not always cleared before reuse.
I don't see reuse without reinitializing happening with the proc_inode
but I had a false alarm while reworking flushing of proc dentries and
indoes when a process dies that caused me to tidy this up.

The code is a little easier to follow and reason about this
way so I figured the changes might as well be kept.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2020-02-24 09:51:25 -06:00
Eric W. Biederman
f90f3cafe8 proc: Use d_invalidate in proc_prune_siblings_dcache
The function d_prune_aliases has the problem that it will only prune
aliases thare are completely unused.  It will not remove aliases for
the dcache or even think of removing mounts from the dcache.  For that
behavior d_invalidate is needed.

To use d_invalidate replace d_prune_aliases with d_find_alias followed
by d_invalidate and dput.

For completeness the directory and the non-directory cases are
separated because in theory (although not in currently in practice for
proc) directories can only ever have a single dentry while
non-directories can have hardlinks and thus multiple dentries.
As part of this separation use d_find_any_alias for directories
to spare d_find_alias the extra work of doing that.

Plus the differences between d_find_any_alias and d_find_alias makes
it clear why the directory and non-directory code and not share code.

To make it clear these routines now invalidate dentries rename
proc_prune_siblings_dache to proc_invalidate_siblings_dcache, and rename
proc_sys_prune_dcache proc_sys_invalidate_dcache.

V2: Split the directory and non-directory cases.  To make this
    code robust to future changes in proc.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2020-02-24 09:50:04 -06:00
Eric W. Biederman
080f6276fc proc: In proc_prune_siblings_dcache cache an aquired super block
Because there are likely to be several sysctls in a row on the
same superblock cache the super_block after the count has
been raised and don't deactivate it until we are processing
another super_block.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2020-02-21 14:06:42 -06:00
Eric W. Biederman
26dbc60f38 proc: Generalize proc_sys_prune_dcache into proc_prune_siblings_dcache
This prepares the way for allowing the pid part of proc to use this
dcache pruning code as well.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-02-20 10:06:02 -06:00
Eric W. Biederman
0afa5ca822 proc: Rename in proc_inode rename sysctl_inodes sibling_inodes
I about to need and use the same functionality for pid based
inodes and there is no point in adding a second field when
this field is already here and serving the same purporse.

Just give the field a generic name so it is clear that
it is no longer sysctl specific.

Also for good measure initialize sibling_inodes when
proc_inode is initialized.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-02-20 08:14:32 -06:00
Linus Torvalds
c9d35ee049 Merge branch 'merge.nfs-fs_parse.1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs file system parameter updates from Al Viro:
 "Saner fs_parser.c guts and data structures. The system-wide registry
  of syntax types (string/enum/int32/oct32/.../etc.) is gone and so is
  the horror switch() in fs_parse() that would have to grow another case
  every time something got added to that system-wide registry.

  New syntax types can be added by filesystems easily now, and their
  namespace is that of functions - not of system-wide enum members. IOW,
  they can be shared or kept private and if some turn out to be widely
  useful, we can make them common library helpers, etc., without having
  to do anything whatsoever to fs_parse() itself.

  And we already get that kind of requests - the thing that finally
  pushed me into doing that was "oh, and let's add one for timeouts -
  things like 15s or 2h". If some filesystem really wants that, let them
  do it. Without somebody having to play gatekeeper for the variants
  blessed by direct support in fs_parse(), TYVM.

  Quite a bit of boilerplate is gone. And IMO the data structures make a
  lot more sense now. -200LoC, while we are at it"

* 'merge.nfs-fs_parse.1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (25 commits)
  tmpfs: switch to use of invalfc()
  cgroup1: switch to use of errorfc() et.al.
  procfs: switch to use of invalfc()
  hugetlbfs: switch to use of invalfc()
  cramfs: switch to use of errofc() et.al.
  gfs2: switch to use of errorfc() et.al.
  fuse: switch to use errorfc() et.al.
  ceph: use errorfc() and friends instead of spelling the prefix out
  prefix-handling analogues of errorf() and friends
  turn fs_param_is_... into functions
  fs_parse: handle optional arguments sanely
  fs_parse: fold fs_parameter_desc/fs_parameter_spec
  fs_parser: remove fs_parameter_description name field
  add prefix to fs_context->log
  ceph_parse_param(), ceph_parse_mon_ips(): switch to passing fc_log
  new primitive: __fs_parse()
  switch rbd and libceph to p_log-based primitives
  struct p_log, variants of warnf() et.al. taking that one instead
  teach logfc() to handle prefices, give it saner calling conventions
  get rid of cg_invalf()
  ...
2020-02-08 13:26:41 -08:00
Al Viro
bf45f7fcc4 procfs: switch to use of invalfc()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-02-07 14:48:42 -05:00
Al Viro
d7167b1499 fs_parse: fold fs_parameter_desc/fs_parameter_spec
The former contains nothing but a pointer to an array of the latter...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-02-07 14:48:37 -05:00
Eric Sandeen
96cafb9ccb fs_parser: remove fs_parameter_description name field
Unused now.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-02-07 14:48:36 -05:00
Linus Torvalds
e310396bb8 Tracing updates:
- Added new "bootconfig".
    Looks for a file appended to initrd to add boot config options.
    This has been discussed thoroughly at Linux Plumbers.
    Very useful for adding kprobes at bootup.
    Only enabled if "bootconfig" is on the real kernel command line.
 
  - Created dynamic event creation.
    Merges common code between creating synthetic events and
      kprobe events.
 
  - Rename perf "ring_buffer" structure to "perf_buffer"
 
  - Rename ftrace "ring_buffer" structure to "trace_buffer"
    Had to rename existing "trace_buffer" to "array_buffer"
 
  - Allow trace_printk() to work withing (some) tracing code.
 
  - Sort of tracing configs to be a little better organized
 
  - Fixed bug where ftrace_graph hash was not being protected properly
 
  - Various other small fixes and clean ups
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXjtAURQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qshOAQDzopQmvAVrrI6oogghr8JQA30Z2yqT
 i+Ld7vPWL2MV9wEA1S+zLGDSYrj8f/vsCq6BxRYT1ApO+YtmY6LTXiUejwg=
 =WNds
 -----END PGP SIGNATURE-----

Merge tag 'trace-v5.6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing updates from Steven Rostedt:

 - Added new "bootconfig".

   This looks for a file appended to initrd to add boot config options,
   and has been discussed thoroughly at Linux Plumbers.

   Very useful for adding kprobes at bootup.

   Only enabled if "bootconfig" is on the real kernel command line.

 - Created dynamic event creation.

   Merges common code between creating synthetic events and kprobe
   events.

 - Rename perf "ring_buffer" structure to "perf_buffer"

 - Rename ftrace "ring_buffer" structure to "trace_buffer"

   Had to rename existing "trace_buffer" to "array_buffer"

 - Allow trace_printk() to work withing (some) tracing code.

 - Sort of tracing configs to be a little better organized

 - Fixed bug where ftrace_graph hash was not being protected properly

 - Various other small fixes and clean ups

* tag 'trace-v5.6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (88 commits)
  bootconfig: Show the number of nodes on boot message
  tools/bootconfig: Show the number of bootconfig nodes
  bootconfig: Add more parse error messages
  bootconfig: Use bootconfig instead of boot config
  ftrace: Protect ftrace_graph_hash with ftrace_sync
  ftrace: Add comment to why rcu_dereference_sched() is open coded
  tracing: Annotate ftrace_graph_notrace_hash pointer with __rcu
  tracing: Annotate ftrace_graph_hash pointer with __rcu
  bootconfig: Only load bootconfig if "bootconfig" is on the kernel cmdline
  tracing: Use seq_buf for building dynevent_cmd string
  tracing: Remove useless code in dynevent_arg_pair_add()
  tracing: Remove check_arg() callbacks from dynevent args
  tracing: Consolidate some synth_event_trace code
  tracing: Fix now invalid var_ref_vals assumption in trace action
  tracing: Change trace_boot to use synth_event interface
  tracing: Move tracing selftests to bottom of menu
  tracing: Move mmio tracer config up with the other tracers
  tracing: Move tracing test module configs together
  tracing: Move all function tracing configs together
  tracing: Documentation for in-kernel synthetic event API
  ...
2020-02-06 07:12:11 +00:00
Alexey Dobriyan
97a32539b9 proc: convert everything to "struct proc_ops"
The most notable change is DEFINE_SHOW_ATTRIBUTE macro split in
seq_file.h.

Conversion rule is:

	llseek		=> proc_lseek
	unlocked_ioctl	=> proc_ioctl

	xxx		=> proc_xxx

	delete ".owner = THIS_MODULE" line

[akpm@linux-foundation.org: fix drivers/isdn/capi/kcapi_proc.c]
[sfr@canb.auug.org.au: fix kernel/sched/psi.c]
  Link: http://lkml.kernel.org/r/20200122180545.36222f50@canb.auug.org.au
Link: http://lkml.kernel.org/r/20191225172546.GB13378@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-02-04 03:05:26 +00:00
Alexey Dobriyan
d56c0d45f0 proc: decouple proc from VFS with "struct proc_ops"
Currently core /proc code uses "struct file_operations" for custom hooks,
however, VFS doesn't directly call them.  Every time VFS expands
file_operations hook set, /proc code bloats for no reason.

Introduce "struct proc_ops" which contains only those hooks which /proc
allows to call into (open, release, read, write, ioctl, mmap, poll).  It
doesn't contain module pointer as well.

Save ~184 bytes per usage:

	add/remove: 26/26 grow/shrink: 1/4 up/down: 1922/-6674 (-4752)
	Function                                     old     new   delta
	sysvipc_proc_ops                               -      72     +72
				...
	config_gz_proc_ops                             -      72     +72
	proc_get_inode                               289     339     +50
	proc_reg_get_unmapped_area                   110     107      -3
	close_pdeo                                   227     224      -3
	proc_reg_open                                289     284      -5
	proc_create_data                              60      53      -7
	rt_cpu_seq_fops                              256       -    -256
				...
	default_affinity_proc_fops                   256       -    -256
	Total: Before=5430095, After=5425343, chg -0.09%

Link: http://lkml.kernel.org/r/20191225172228.GA13378@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-02-04 03:05:26 +00:00
Steven Price
b7a16c7ad7 mm: pagewalk: add 'depth' parameter to pte_hole
The pte_hole() callback is called at multiple levels of the page tables.
Code dumping the kernel page tables needs to know what at what depth the
missing entry is.  Add this is an extra parameter to pte_hole().  When the
depth isn't know (e.g.  processing a vma) then -1 is passed.

The depth that is reported is the actual level where the entry is missing
(ignoring any folding that is in place), i.e.  any levels where
PTRS_PER_P?D is set to 1 are ignored.

Note that depth starts at 0 for a PGD so that PUD/PMD/PTE retain their
natural numbers as levels 2/3/4.

Link: http://lkml.kernel.org/r/20191218162402.45610-16-steven.price@arm.com
Signed-off-by: Steven Price <steven.price@arm.com>
Tested-by: Zong Li <zong.li@sifive.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Hogan <jhogan@kernel.org>
Cc: James Morse <james.morse@arm.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: "Liang, Kan" <kan.liang@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-02-04 03:05:25 +00:00
David Hildenbrand
abec749fac fs/proc/page.c: allow inspection of last section and fix end detection
If max_pfn does not fall onto a section boundary, it is possible to
inspect PFNs up to max_pfn, and PFNs above max_pfn, however, max_pfn
itself can't be inspected.  We can have a valid (and online) memmap at and
above max_pfn if max_pfn is not aligned to a section boundary.  The whole
early section has a memmap and is marked online.  Being able to inspect
the state of these PFNs is valuable for debugging, especially because
max_pfn can change on memory hotplug and expose these memmaps.

Also, querying page flags via "./page-types -r -a 0x144001,"
(tools/vm/page-types.c) inside a x86-64 guest with 4160MB under QEMU
results in an (almost) endless loop in user space, because the end is not
detected properly when starting after max_pfn.

Instead, let's allow to inspect all pages in the highest section and
return 0 directly if we try to access pages above that section.

While at it, check the count before adjusting it, to avoid masking user
errors.

Link: http://lkml.kernel.org/r/20191211163201.17179-3-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Bob Picco <bob.picco@oracle.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Steven Sistare <steven.sistare@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-02-04 03:05:23 +00:00
Linus Torvalds
6aee4badd8 Merge branch 'work.openat2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull openat2 support from Al Viro:
 "This is the openat2() series from Aleksa Sarai.

  I'm afraid that the rest of namei stuff will have to wait - it got
  zero review the last time I'd posted #work.namei, and there had been a
  leak in the posted series I'd caught only last weekend. I was going to
  repost it on Monday, but the window opened and the odds of getting any
  review during that... Oh, well.

  Anyway, openat2 part should be ready; that _did_ get sane amount of
  review and public testing, so here it comes"

From Aleksa's description of the series:
 "For a very long time, extending openat(2) with new features has been
  incredibly frustrating. This stems from the fact that openat(2) is
  possibly the most famous counter-example to the mantra "don't silently
  accept garbage from userspace" -- it doesn't check whether unknown
  flags are present[1].

  This means that (generally) the addition of new flags to openat(2) has
  been fraught with backwards-compatibility issues (O_TMPFILE has to be
  defined as __O_TMPFILE|O_DIRECTORY|[O_RDWR or O_WRONLY] to ensure old
  kernels gave errors, since it's insecure to silently ignore the
  flag[2]). All new security-related flags therefore have a tough road
  to being added to openat(2).

  Furthermore, the need for some sort of control over VFS's path
  resolution (to avoid malicious paths resulting in inadvertent
  breakouts) has been a very long-standing desire of many userspace
  applications.

  This patchset is a revival of Al Viro's old AT_NO_JUMPS[3] patchset
  (which was a variant of David Drysdale's O_BENEATH patchset[4] which
  was a spin-off of the Capsicum project[5]) with a few additions and
  changes made based on the previous discussion within [6] as well as
  others I felt were useful.

  In line with the conclusions of the original discussion of
  AT_NO_JUMPS, the flag has been split up into separate flags. However,
  instead of being an openat(2) flag it is provided through a new
  syscall openat2(2) which provides several other improvements to the
  openat(2) interface (see the patch description for more details). The
  following new LOOKUP_* flags are added:

  LOOKUP_NO_XDEV:

     Blocks all mountpoint crossings (upwards, downwards, or through
     absolute links). Absolute pathnames alone in openat(2) do not
     trigger this. Magic-link traversal which implies a vfsmount jump is
     also blocked (though magic-link jumps on the same vfsmount are
     permitted).

  LOOKUP_NO_MAGICLINKS:

     Blocks resolution through /proc/$pid/fd-style links. This is done
     by blocking the usage of nd_jump_link() during resolution in a
     filesystem. The term "magic-links" is used to match with the only
     reference to these links in Documentation/, but I'm happy to change
     the name.

     It should be noted that this is different to the scope of
     ~LOOKUP_FOLLOW in that it applies to all path components. However,
     you can do openat2(NO_FOLLOW|NO_MAGICLINKS) on a magic-link and it
     will *not* fail (assuming that no parent component was a
     magic-link), and you will have an fd for the magic-link.

     In order to correctly detect magic-links, the introduction of a new
     LOOKUP_MAGICLINK_JUMPED state flag was required.

  LOOKUP_BENEATH:

     Disallows escapes to outside the starting dirfd's
     tree, using techniques such as ".." or absolute links. Absolute
     paths in openat(2) are also disallowed.

     Conceptually this flag is to ensure you "stay below" a certain
     point in the filesystem tree -- but this requires some additional
     to protect against various races that would allow escape using
     "..".

     Currently LOOKUP_BENEATH implies LOOKUP_NO_MAGICLINKS, because it
     can trivially beam you around the filesystem (breaking the
     protection). In future, there might be similar safety checks done
     as in LOOKUP_IN_ROOT, but that requires more discussion.

  In addition, two new flags are added that expand on the above ideas:

  LOOKUP_NO_SYMLINKS:

     Does what it says on the tin. No symlink resolution is allowed at
     all, including magic-links. Just as with LOOKUP_NO_MAGICLINKS this
     can still be used with NOFOLLOW to open an fd for the symlink as
     long as no parent path had a symlink component.

  LOOKUP_IN_ROOT:

     This is an extension of LOOKUP_BENEATH that, rather than blocking
     attempts to move past the root, forces all such movements to be
     scoped to the starting point. This provides chroot(2)-like
     protection but without the cost of a chroot(2) for each filesystem
     operation, as well as being safe against race attacks that
     chroot(2) is not.

     If a race is detected (as with LOOKUP_BENEATH) then an error is
     generated, and similar to LOOKUP_BENEATH it is not permitted to
     cross magic-links with LOOKUP_IN_ROOT.

     The primary need for this is from container runtimes, which
     currently need to do symlink scoping in userspace[7] when opening
     paths in a potentially malicious container.

     There is a long list of CVEs that could have bene mitigated by
     having RESOLVE_THIS_ROOT (such as CVE-2017-1002101,
     CVE-2017-1002102, CVE-2018-15664, and CVE-2019-5736, just to name a
     few).

  In order to make all of the above more usable, I'm working on
  libpathrs[8] which is a C-friendly library for safe path resolution.
  It features a userspace-emulated backend if the kernel doesn't support
  openat2(2). Hopefully we can get userspace to switch to using it, and
  thus get openat2(2) support for free once it's ready.

  Future work would include implementing things like
  RESOLVE_NO_AUTOMOUNT and possibly a RESOLVE_NO_REMOTE (to allow
  programs to be sure they don't hit DoSes though stale NFS handles)"

* 'work.openat2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  Documentation: path-lookup: include new LOOKUP flags
  selftests: add openat2(2) selftests
  open: introduce openat2(2) syscall
  namei: LOOKUP_{IN_ROOT,BENEATH}: permit limited ".." resolution
  namei: LOOKUP_IN_ROOT: chroot-like scoped resolution
  namei: LOOKUP_BENEATH: O_BENEATH-like scoped resolution
  namei: LOOKUP_NO_XDEV: block mountpoint crossing
  namei: LOOKUP_NO_MAGICLINKS: block magic-link resolution
  namei: LOOKUP_NO_SYMLINKS: block symlink resolution
  namei: allow set_root() to produce errors
  namei: allow nd_jump_link() to produce errors
  nsfs: clean-up ns_get_path() signature to return int
  namei: only return -ECHILD from follow_dotdot_rcu()
2020-01-29 11:20:24 -08:00
Linus Torvalds
4244057c3d Merge branch 'x86-cache-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 resource control updates from Ingo Molnar:
 "The main change in this tree is the extension of the resctrl procfs
  ABI with a new file that helps tooling to navigate from tasks back to
  resctrl groups: /proc/{pid}/cpu_resctrl_groups.

  Also fix static key usage for certain feature combinations and
  simplify the task exit resctrl case"

* 'x86-cache-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/resctrl: Add task resctrl information display
  x86/resctrl: Check monitoring static key in the MBM overflow handler
  x86/resctrl: Do not reconfigure exiting tasks
2020-01-28 12:00:29 -08:00
Chen Yu
e79f15a459 x86/resctrl: Add task resctrl information display
Monitoring tools that want to find out which resctrl control and monitor
groups a task belongs to must currently read the "tasks" file in every
group until they locate the process ID.

Add an additional file /proc/{pid}/cpu_resctrl_groups to provide this
information:

1)   res:
     mon:

resctrl is not available.

2)   res:/
     mon:

Task is part of the root resctrl control group, and it is not associated
to any monitor group.

3)  res:/
    mon:mon0

Task is part of the root resctrl control group and monitor group mon0.

4)  res:group0
    mon:

Task is part of resctrl control group group0, and it is not associated
to any monitor group.

5) res:group0
   mon:mon1

Task is part of resctrl control group group0 and monitor group mon1.

Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Jinshi Chen <jinshi.chen@intel.com>
Link: https://lkml.kernel.org/r/20200115092851.14761-1-yu.c.chen@intel.com
2020-01-20 16:19:10 +01:00
Andrei Vagin
04a8682a71 fs/proc: Introduce /proc/pid/timens_offsets
API to set time namespace offsets for children processes, i.e.:
echo "$clockid $offset_sec $offset_nsec" > /proc/self/timens_offsets

Co-developed-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20191112012724.250792-28-dima@arista.com
2020-01-14 12:20:59 +01:00
Dmitry Safonov
0efc8bb0bb fs/proc: Respect boottime inside time namespace for /proc/uptime
Make sure that /proc/uptime is adjusted to the tasks time namespace.

Co-developed-by: Andrei Vagin <avagin@openvz.org>
Signed-off-by: Andrei Vagin <avagin@openvz.org>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20191112012724.250792-19-dima@arista.com
2020-01-14 12:20:56 +01:00
Andrei Vagin
769071ac9f ns: Introduce Time Namespace
Time Namespace isolates clock values.

The kernel provides access to several clocks CLOCK_REALTIME,
CLOCK_MONOTONIC, CLOCK_BOOTTIME, etc.

CLOCK_REALTIME
      System-wide clock that measures real (i.e., wall-clock) time.

CLOCK_MONOTONIC
      Clock that cannot be set and represents monotonic time since
      some unspecified starting point.

CLOCK_BOOTTIME
      Identical to CLOCK_MONOTONIC, except it also includes any time
      that the system is suspended.

For many users, the time namespace means the ability to changes date and
time in a container (CLOCK_REALTIME). Providing per namespace notions of
CLOCK_REALTIME would be complex with a massive overhead, but has a dubious
value.

But in the context of checkpoint/restore functionality, monotonic and
boottime clocks become interesting. Both clocks are monotonic with
unspecified starting points. These clocks are widely used to measure time
slices and set timers. After restoring or migrating processes, it has to be
guaranteed that they never go backward. In an ideal case, the behavior of
these clocks should be the same as for a case when a whole system is
suspended. All this means that it is required to set CLOCK_MONOTONIC and
CLOCK_BOOTTIME clocks, which can be achieved by adding per-namespace
offsets for clocks.

A time namespace is similar to a pid namespace in the way how it is
created: unshare(CLONE_NEWTIME) system call creates a new time namespace,
but doesn't set it to the current process. Then all children of the process
will be born in the new time namespace, or a process can use the setns()
system call to join a namespace.

This scheme allows setting clock offsets for a namespace, before any
processes appear in it.

All available clone flags have been used, so CLONE_NEWTIME uses the highest
bit of CSIGNAL. It means that it can be used only with the unshare() and
the clone3() system calls.

[ tglx: Adjusted paragraph about clone3() to reality and massaged the
  	changelog a bit. ]

Co-developed-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://criu.org/Time_namespace
Link: https://lists.openvz.org/pipermail/criu/2018-June/041504.html
Link: https://lore.kernel.org/r/20191112012724.250792-4-dima@arista.com
2020-01-14 12:20:48 +01:00
Masami Hiramatsu
c1a3c36017 proc: bootconfig: Add /proc/bootconfig to show boot config list
Add /proc/bootconfig which shows the list of key-value pairs
in boot config. Since after boot, all boot configs and tree
are removed, this interface just keep a copy of key-value
pairs in text.

Link: http://lkml.kernel.org/r/157867225967.17873.12155805787236073787.stgit@devnote2

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-01-13 13:19:39 -05:00
Flavio Leitner
346da4d2c7 sched/cputime, proc/stat: Fix incorrect guest nice cpustat value
The value being used for guest_nice should be CPUTIME_GUEST_NICE
and not CPUTIME_USER.

Fixes: 26dae145a7 ("procfs: Use all-in-one vtime aware kcpustat accessor")
Signed-off-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20191205020344.14940-1-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-12-11 07:09:58 +01:00
Aleksa Sarai
1bc82070fa namei: allow nd_jump_link() to produce errors
In preparation for LOOKUP_NO_MAGICLINKS, it's necessary to add the
ability for nd_jump_link() to return an error which the corresponding
get_link() caller must propogate back up to the VFS.

Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-12-08 19:09:38 -05:00
Aleksa Sarai
ce623f8987 nsfs: clean-up ns_get_path() signature to return int
ns_get_path() and ns_get_path_cb() only ever return either NULL or an
ERR_PTR. It is far more idiomatic to simply return an integer, and it
makes all of the callers of ns_get_path() more straightforward to read.

Fixes: e149ed2b80 ("take the targets of /proc/*/ns/* symlinks to separate fs")
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-12-08 19:09:37 -05:00
Krzysztof Kozlowski
3d82191c22 fs/proc/Kconfig: fix indentation
Adjust indentation from spaces to tab (+optional two spaces) as in
coding style with command like:
        $ sed -e 's/^        /	/' -i */Kconfig

[adobriyan@gmail.com: add two spaces where necessary]
Link: http://lkml.kernel.org/r/20191124133936.GA5655@avx2
Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org>
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-04 19:44:11 -08:00
Alexey Dobriyan
70a731c0e3 fs/proc/internal.h: shuffle "struct pde_opener"
List iteration takes more code than anything else which means embedded
list_head should be the first element of the structure.

Space savings:

	add/remove: 0/0 grow/shrink: 0/4 up/down: 0/-18 (-18)
	Function                                     old     new   delta
	close_pdeo                                   228     227      -1
	proc_reg_release                              86      82      -4
	proc_entry_rundown                           143     139      -4
	proc_reg_open                                298     289      -9

Link: http://lkml.kernel.org/r/20191004234753.GB30246@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-04 19:44:11 -08:00
Alexey Dobriyan
5f6354eaa5 fs/proc/generic.c: delete useless "len" variable
Pointer to next '/' encodes length of path element and next start
position.  Subtraction and increment are redundant.

Link: http://lkml.kernel.org/r/20191004234521.GA30246@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-04 19:44:11 -08:00
Alexey Dobriyan
e06689bf57 proc: change ->nlink under proc_subdir_lock
Currently gluing PDE into global /proc tree is done under lock, but
changing ->nlink is not.  Additionally struct proc_dir_entry::nlink is
not atomic so updates can be lost.

Link: http://lkml.kernel.org/r/20190925202436.GA17388@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-04 19:44:11 -08:00