linux-stable/fs/proc
Zhihao Cheng d919a1e79b proc: fix a dentry lock race between release_task and lookup
Commit 7bc3e6e55a ("proc: Use a list of inodes to flush from proc")
moved proc_flush_task() behind __exit_signal().  Then, process systemd can
take long period high cpu usage during releasing task in following
concurrent processes:

  systemd                                 ps
kernel_waitid                 stat(/proc/tgid)
  do_wait                       filename_lookup
    wait_consider_task            lookup_fast
      release_task
        __exit_signal
          __unhash_process
            detach_pid
              __change_pid // remove task->pid_links
                                     d_revalidate -> pid_revalidate  // 0
                                     d_invalidate(/proc/tgid)
                                       shrink_dcache_parent(/proc/tgid)
                                         d_walk(/proc/tgid)
                                           spin_lock_nested(/proc/tgid/fd)
                                           // iterating opened fd
        proc_flush_pid                                    |
           d_invalidate (/proc/tgid/fd)                   |
              shrink_dcache_parent(/proc/tgid/fd)         |
                shrink_dentry_list(subdirs)               ↓
                  shrink_lock_dentry(/proc/tgid/fd) --> race on dentry lock

Function d_invalidate() will remove dentry from hash firstly, but why does
proc_flush_pid() process dentry '/proc/tgid/fd' before dentry
'/proc/tgid'?  That's because proc_pid_make_inode() adds proc inode in
reverse order by invoking hlist_add_head_rcu().  But proc should not add
any inodes under '/proc/tgid' except '/proc/tgid/task/pid', fix it by
adding inode into 'pid->inodes' only if the inode is /proc/tgid or
/proc/tgid/task/pid.

Performance regression:
Create 200 tasks, each task open one file for 50,000 times. Kill all
tasks when opened files exceed 10,000,000 (cat /proc/sys/fs/file-nr).

Before fix:
$ time killall -wq aa
  real    4m40.946s   # During this period, we can see 'ps' and 'systemd'
			taking high cpu usage.

After fix:
$ time killall -wq aa
  real    1m20.732s   # During this period, we can see 'systemd' taking
			high cpu usage.

Link: https://lkml.kernel.org/r/20220713130029.4133533-1-chengzhihao1@huawei.com
Fixes: 7bc3e6e55a ("proc: Use a list of inodes to flush from proc")
Link: https://bugzilla.kernel.org/show_bug.cgi?id=216054
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Suggested-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-17 17:31:43 -07:00
..
array.c proc: delete unused <linux/uaccess.h> includes 2022-07-17 17:31:39 -07:00
base.c proc: fix a dentry lock race between release_task and lookup 2022-07-17 17:31:43 -07:00
bootconfig.c proc: bootconfig: Add null pointer check 2022-04-02 08:40:09 -04:00
cmdline.c
consoles.c treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 191 2019-05-30 11:29:21 -07:00
cpuinfo.c x86/aperfmperf: Replace aperfmperf_get_khz() 2022-04-27 20:22:19 +02:00
devices.c block: move block-related definitions out of fs.h 2020-06-24 09:16:02 -06:00
fd.c procfs: prevent unprivileged processes accessing fdinfo dir 2022-05-09 17:34:28 -07:00
fd.h fs: make helpers idmap mount aware 2021-01-24 14:27:20 +01:00
generic.c proc: fix dentry/inode overinstantiating under /proc/${pid}/net 2022-05-09 18:29:19 -07:00
inode.c proc: delete unused <linux/uaccess.h> includes 2022-07-17 17:31:39 -07:00
internal.h fs: proc: store PDE()->data into inode->i_private 2022-01-22 08:33:37 +02:00
interrupts.c
Kconfig treewide: replace '---help---' in Kconfig files with 'help' 2020-06-14 01:57:21 +09:00
kcore.c fs/proc/kcore.c: remove check of list iterator against head past the loop body 2022-04-29 14:37:59 -07:00
kmsg.c proc: delete unused <linux/uaccess.h> includes 2022-07-17 17:31:39 -07:00
loadavg.c sched: Make nr_running() return 32-bit value 2021-05-12 21:34:14 +02:00
Makefile proc: bootconfig: Add /proc/bootconfig to show boot config list 2020-01-13 13:19:39 -05:00
meminfo.c mm: zswap: add basic meminfo and vmstat coverage 2022-05-19 14:08:53 -07:00
namespaces.c Merge branch 'work.openat2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2020-01-29 11:20:24 -08:00
nommu.c proc: delete unused <linux/uaccess.h> includes 2022-07-17 17:31:39 -07:00
page.c mm: don't include <linux/memremap.h> in <linux/mm.h> 2022-03-03 12:47:33 -05:00
proc_net.c proc: delete unused <linux/uaccess.h> includes 2022-07-17 17:31:39 -07:00
proc_sysctl.c sysctl changes for v5.19-rc1 2022-05-26 16:57:20 -07:00
proc_tty.c proc: delete unused <linux/uaccess.h> includes 2022-07-17 17:31:39 -07:00
root.c proc: delete unused <linux/uaccess.h> includes 2022-07-17 17:31:39 -07:00
self.c Revert "proc: don't allow async path resolution of /proc/self components" 2021-02-23 20:32:11 -07:00
softirqs.c
stat.c fs/proc/uptime.c: Fix idle time reporting in /proc/uptime 2021-10-05 15:51:35 +02:00
task_mmu.c mm/pagemap: recognize uffd-wp bit for shmem/hugetlbfs 2022-05-13 07:20:11 -07:00
task_nommu.c mmap locking API: use coccinelle to convert mmap_sem rwsem call sites 2020-06-09 09:39:14 -07:00
thread_self.c Revert "proc: don't allow async path resolution of /proc/thread-self components" 2021-02-23 20:32:11 -07:00
uptime.c fs/proc/uptime.c: Fix idle time reporting in /proc/uptime 2021-10-05 15:51:35 +02:00
util.c fs/proc/util.c: include fs/proc/internal.h for name_to_int() 2019-01-04 13:13:45 -08:00
version.c
vmcore.c proc: delete unused <linux/uaccess.h> includes 2022-07-17 17:31:39 -07:00