linux-stable/fs
Imran Khan 9caf696142 kernfs: Introduce separate rwsem to protect inode attributes.
Right now a global per-fs rwsem (kernfs_rwsem) synchronizes multiple
kernfs operations. On a large system with few hundred CPUs and few
hundred applications simultaneoulsy trying to access sysfs, this
results in multiple sys_open(s) contending on kernfs_rwsem via
kernfs_iop_permission and kernfs_dop_revalidate.

For example on a system with 384 cores, if I run 200 instances of an
application which is mostly executing the following loop:

  for (int loop = 0; loop <100 ; loop++)
  {
    for (int port_num = 1; port_num < 2; port_num++)
    {
      for (int gid_index = 0; gid_index < 254; gid_index++ )
      {
        char ret_buf[64], ret_buf_lo[64];
        char gid_file_path[1024];

        int      ret_len;
        int      ret_fd;
        ssize_t  ret_rd;

        ub4  i, saved_errno;

        memset(ret_buf, 0, sizeof(ret_buf));
        memset(gid_file_path, 0, sizeof(gid_file_path));

        ret_len = snprintf(gid_file_path, sizeof(gid_file_path),
                           "/sys/class/infiniband/%s/ports/%d/gids/%d",
                           dev_name,
                           port_num,
                           gid_index);

        ret_fd = open(gid_file_path, O_RDONLY | O_CLOEXEC);
        if (ret_fd < 0)
        {
          printf("Failed to open %s\n", gid_file_path);
          continue;
        }

        /* Read the GID */
        ret_rd = read(ret_fd, ret_buf, 40);

        if (ret_rd == -1)
        {
          printf("Failed to read from file %s, errno: %u\n",
                 gid_file_path, saved_errno);

          continue;
        }

        close(ret_fd);
      }
    }

I see contention around kernfs_rwsem as follows:

path_openat
|
|----link_path_walk.part.0.constprop.0
|      |
|      |--49.92%--inode_permission
|      |          |
|      |           --48.69%--kernfs_iop_permission
|      |                     |
|      |                     |--18.16%--down_read
|      |                     |
|      |                     |--15.38%--up_read
|      |                     |
|      |                      --14.58%--_raw_spin_lock
|      |                                |
|      |                                 -----
|      |
|      |--29.08%--walk_component
|      |          |
|      |           --29.02%--lookup_fast
|      |                     |
|      |                     |--24.26%--kernfs_dop_revalidate
|      |                     |          |
|      |                     |          |--14.97%--down_read
|      |                     |          |
|      |                     |           --9.01%--up_read
|      |                     |
|      |                      --4.74%--__d_lookup
|      |                                |
|      |                                 --4.64%--_raw_spin_lock
|      |                                           |
|      |                                            ----

Having a separate per-fs rwsem to protect kernfs inode attributes,
will avoid the above mentioned contention and result in better
performance as can bee seen below:

path_openat
|
|----link_path_walk.part.0.constprop.0
|     |
|     |
|     |--27.06%--inode_permission
|     |          |
|     |           --25.84%--kernfs_iop_permission
|     |                     |
|     |                     |--9.29%--up_read
|     |                     |
|     |                     |--8.19%--down_read
|     |                     |
|     |                      --7.89%--_raw_spin_lock
|     |                                |
|     |                                 ----
|     |
|     |--22.42%--walk_component
|     |          |
|     |           --22.36%--lookup_fast
|     |                     |
|     |                     |--16.07%--__d_lookup
|     |                     |          |
|     |                     |           --16.01%--_raw_spin_lock
|     |                     |                     |
|     |                     |                      ----
|     |                     |
|     |                      --6.28%--kernfs_dop_revalidate
|     |                                |
|     |                                |--3.76%--down_read
|     |                                |
|     |                                 --2.26%--up_read

As can be seen from the above data the overhead due to both
kerfs_iop_permission and kernfs_dop_revalidate have gone down and
this also reduces overall run time of the earlier mentioned loop.

Signed-off-by: Imran Khan <imran.f.khan@oracle.com>
Link: https://lore.kernel.org/r/20230309110932.2889010-2-imran.f.khan@oracle.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-03-29 12:23:45 +02:00
..
9p 9p patches for 6.3 merge window (part 1) 2023-03-01 08:52:49 -08:00
adfs fs: port ->setattr() to pass mnt_idmap 2023-01-19 09:24:02 +01:00
affs for-6.3/dio-2023-02-16 2023-02-20 14:10:36 -08:00
afs - Daniel Verkamp has contributed a memfd series ("mm/memfd: add 2023-02-23 17:09:35 -08:00
autofs fs: port ->permission() to pass mnt_idmap 2023-01-19 09:24:28 +01:00
befs
bfs fs: port inode_init_owner() to mnt_idmap 2023-01-19 09:24:28 +01:00
btrfs - Daniel Verkamp has contributed a memfd series ("mm/memfd: add 2023-02-23 17:09:35 -08:00
cachefiles fs: port ->permission() to pass mnt_idmap 2023-01-19 09:24:28 +01:00
ceph Two small fixes from Xiubo and myself, marked for stable. 2023-03-02 10:48:30 -08:00
cifs cifs: Fix memory leak in direct I/O 2023-03-01 18:18:25 -06:00
coda driver core: class: remove module * from class_create() 2023-03-17 15:16:33 +01:00
configfs fs: port ->permission() to pass mnt_idmap 2023-01-19 09:24:28 +01:00
cramfs fs/cramfs/inode.c: initialize file_ra_state 2023-03-02 21:54:23 -08:00
crypto fsverity updates for 6.3 2023-02-20 12:33:41 -08:00
debugfs ARM: SoC drivers for 6.3 2023-02-27 10:04:49 -08:00
devpts
dlm Driver core changes for 6.3-rc1 2023-02-24 12:58:55 -08:00
ecryptfs This update includes the following changes: 2023-02-21 18:10:50 -08:00
efivarfs A healthy mix of EFI contributions this time: 2023-02-23 14:41:48 -08:00
efs
erofs - Daniel Verkamp has contributed a memfd series ("mm/memfd: add 2023-02-23 17:09:35 -08:00
exfat Description for this pull request: 2023-03-01 08:42:27 -08:00
exportfs fs: port ->permission() to pass mnt_idmap 2023-01-19 09:24:28 +01:00
ext2 for-6.3/dio-2023-02-16 2023-02-20 14:10:36 -08:00
ext4 Improve performance for ext4 by allowing multiple process to perform 2023-02-28 09:05:47 -08:00
f2fs f2fs-for-6.3-rc1 2023-02-27 16:18:51 -08:00
fat There is no particular theme here - mainly quick hits all over the tree. 2023-02-23 17:55:40 -08:00
freevxfs There is no particular theme here - mainly quick hits all over the tree. 2023-02-23 17:55:40 -08:00
fscache fscache: Use clear_and_wake_up_bit() in fscache_create_volume_work() 2023-01-30 12:51:54 +00:00
fuse driver core: class: remove module * from class_create() 2023-03-17 15:16:33 +01:00
gfs2 Driver core changes for 6.3-rc1 2023-02-24 12:58:55 -08:00
hfs There is no particular theme here - mainly quick hits all over the tree. 2023-02-23 17:55:40 -08:00
hfsplus fs: hfsplus: fix UAF issue in hfsplus_put_super 2023-03-02 21:54:23 -08:00
hostfs This pull request contains the following changes for UML: 2023-03-01 09:13:00 -08:00
hpfs fs: port ->rename() to pass mnt_idmap 2023-01-19 09:24:26 +01:00
hugetlbfs - Daniel Verkamp has contributed a memfd series ("mm/memfd: add 2023-02-23 17:09:35 -08:00
iomap - Daniel Verkamp has contributed a memfd series ("mm/memfd: add 2023-02-23 17:09:35 -08:00
isofs - hfs and hfsplus kmap API modernization from Fabio Francesco 2022-10-12 11:00:22 -07:00
jbd2 Improve performance for ext4 by allowing multiple process to perform 2023-02-28 09:05:47 -08:00
jffs2 This pull request contains updates for JFFS2, UBI and UBIFS 2023-03-01 09:06:51 -08:00
jfs Just one simple sanity check 2023-03-01 08:47:19 -08:00
kernfs kernfs: Introduce separate rwsem to protect inode attributes. 2023-03-29 12:23:45 +02:00
ksmbd driver core: class: mark the struct class for sysfs callbacks as constant 2023-03-29 07:54:58 +02:00
lockd sysctl-6.3-rc1 2023-02-23 14:16:56 -08:00
minix Merge branch 'work.minix' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2023-02-24 19:01:15 -08:00
netfs iov: Fix netfs_extract_user_to_sg() 2023-03-01 18:18:25 -06:00
nfs - Daniel Verkamp has contributed a memfd series ("mm/memfd: add 2023-02-23 17:09:35 -08:00
nfs_common filelock: move file locking definitions to separate header file 2023-01-11 06:52:32 -05:00
nfsd NFSD 6.3 Release Notes 2023-02-22 14:21:40 -08:00
nilfs2 There is no particular theme here - mainly quick hits all over the tree. 2023-02-23 17:55:40 -08:00
nls
notify RCU pull request for v6.3 2023-02-21 10:45:51 -08:00
ntfs There is no particular theme here - mainly quick hits all over the tree. 2023-02-23 17:55:40 -08:00
ntfs3 - Daniel Verkamp has contributed a memfd series ("mm/memfd: add 2023-02-23 17:09:35 -08:00
ocfs2 ocfs2: fix non-auto defrag path not working issue 2023-02-27 17:00:15 -08:00
omfs fs: port inode_init_owner() to mnt_idmap 2023-01-19 09:24:28 +01:00
openpromfs
orangefs - Daniel Verkamp has contributed a memfd series ("mm/memfd: add 2023-02-23 17:09:35 -08:00
overlayfs fs.idmapped.v6.3 2023-02-20 11:53:11 -08:00
proc capability: just use a 'u64' instead of a 'u32[2]' array 2023-03-01 10:01:22 -08:00
pstore driver core: class: remove module * from class_create() 2023-03-17 15:16:33 +01:00
qnx4
qnx6 fs/qnx6: delete unnecessary checks before brelse() 2022-09-11 21:55:07 -07:00
quota RCU pull request for v6.3 2023-02-21 10:45:51 -08:00
ramfs - Daniel Verkamp has contributed a memfd series ("mm/memfd: add 2023-02-23 17:09:35 -08:00
reiserfs - Daniel Verkamp has contributed a memfd series ("mm/memfd: add 2023-02-23 17:09:35 -08:00
romfs mm/nommu: factor out check for NOMMU shared mappings into is_nommu_shared_mapping() 2023-01-18 17:12:56 -08:00
smbfs_common smb3: Replace smb2pdu 1-element arrays with flex-arrays 2023-02-20 17:25:43 -06:00
squashfs revert "squashfs: harden sanity check in squashfs_read_xattr_id_table" 2023-02-03 17:52:25 -08:00
sysfs
sysv Merge branch 'work.sysv' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2023-02-24 19:03:26 -08:00
tracefs fs: port ->mkdir() to pass mnt_idmap 2023-01-19 09:24:26 +01:00
ubifs This pull request contains updates for JFFS2, UBI and UBIFS 2023-03-01 09:06:51 -08:00
udf - Daniel Verkamp has contributed a memfd series ("mm/memfd: add 2023-02-23 17:09:35 -08:00
ufs fs: port inode_init_owner() to mnt_idmap 2023-01-19 09:24:28 +01:00
unicode
vboxsf fs: port ->rename() to pass mnt_idmap 2023-01-19 09:24:26 +01:00
verity fsverity: support verifying data from large folios 2023-01-27 14:46:31 -08:00
xfs New code for 6.3-rc1, part 2: 2023-02-28 16:08:30 -08:00
zonefs zonefs changes for 6.3-rc1 2023-02-22 14:11:54 -08:00
aio.c Merge branch 'mm-hotfixes-stable' into mm-stable 2023-02-10 15:34:48 -08:00
anon_inodes.c
attr.c fs.idmapped.v6.3 2023-02-20 11:53:11 -08:00
bad_inode.c fs: port ->permission() to pass mnt_idmap 2023-01-19 09:24:28 +01:00
binfmt_elf.c Linux 6.2-rc6 2023-01-31 15:01:20 +01:00
binfmt_elf_fdpic.c elfcore: Add a cprm parameter to elf_core_extra_{phdrs,data_size} 2023-01-05 15:12:12 +00:00
binfmt_elf_test.c
binfmt_flat.c
binfmt_misc.c binfmt_misc: fix shift-out-of-bounds in check_special_flags 2022-12-02 13:57:04 -08:00
binfmt_script.c
buffer.c - Daniel Verkamp has contributed a memfd series ("mm/memfd: add 2023-02-23 17:09:35 -08:00
char_dev.c chardev: fix error handling in cdev_device_add() 2022-12-02 17:48:59 +01:00
compat_binfmt_elf.c
coredump.c - Daniel Verkamp has contributed a memfd series ("mm/memfd: add 2023-02-23 17:09:35 -08:00
d_path.c
dax.c fsdax: dax_unshare_iter() should return a valid length 2023-02-03 17:52:24 -08:00
dcache.c tmpfile API change 2022-10-10 19:45:17 -07:00
direct-io.c fs: move sb_init_dio_done_wq out of direct-io.c 2023-01-26 10:30:56 -07:00
drop_caches.c
eventfd.c eventfd: provide a eventfd_signal_mask() helper 2022-11-22 06:07:55 -07:00
eventpoll.c eventpoll: add EPOLL_URING_WAKE poll wakeup flag 2022-11-21 07:45:29 -07:00
exec.c - Daniel Verkamp has contributed a memfd series ("mm/memfd: add 2023-02-23 17:09:35 -08:00
fcntl.c fs.idmapped.v6.3 2023-02-20 11:53:11 -08:00
fhandle.c do_sys_name_to_handle(): constify path 2022-09-01 17:36:39 -04:00
file.c fs: use acquire ordering in __fget_light() 2022-10-31 15:30:11 -04:00
file_table.c filelock: move file locking definitions to separate header file 2023-01-11 06:52:32 -05:00
filesystems.c
fs-writeback.c mm: convert mem_cgroup_css_from_page() to mem_cgroup_css_from_folio() 2023-02-02 22:33:19 -08:00
fs_context.c
fs_parser.c ext4: journal_path mount options should follow links 2022-12-01 10:46:54 -05:00
fs_pin.c
fs_struct.c
fs_types.c
fsopen.c
init.c fs: port ->permission() to pass mnt_idmap 2023-01-19 09:24:28 +01:00
inode.c fs.idmapped.v6.3 2023-02-20 11:53:11 -08:00
internal.h for-6.3/dio-2023-02-16 2023-02-20 14:10:36 -08:00
ioctl.c fs: port inode_owner_or_capable() to mnt_idmap 2023-01-19 09:24:29 +01:00
Kconfig fs: build the legacy direct I/O code conditionally 2023-01-26 10:30:56 -07:00
Kconfig.binfmt Xtensa updates for v6.1 2022-10-10 14:21:11 -07:00
kernel_read_file.c
libfs.c fs.idmapped.v6.3 2023-02-20 11:53:11 -08:00
locks.c RCU pull request for v6.3 2023-02-21 10:45:51 -08:00
Makefile for-6.3/dio-2023-02-16 2023-02-20 14:10:36 -08:00
mbcache.c ext4: fix deadlock due to mbcache entry corruption 2022-12-08 21:49:25 -05:00
mnt_idmapping.c fs: move mnt_idmap 2023-01-19 09:24:30 +01:00
mount.h
mpage.c - Daniel Verkamp has contributed a memfd series ("mm/memfd: add 2023-02-23 17:09:35 -08:00
namei.c NFSD 6.3 Release Notes 2023-02-22 14:21:40 -08:00
namespace.c Merge branch 'work.namespace' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2023-02-24 19:20:07 -08:00
no-block.c
nsfs.c nsfs: repair kernel-doc for ns_match() 2023-01-11 15:47:40 -05:00
open.c vfs: avoid duplicating creds in faccessat if possible 2023-02-27 16:39:19 -08:00
pipe.c
pnode.c pnode: terminate at peers of source 2022-12-21 14:45:25 +01:00
pnode.h
posix_acl.c fs.acl.v6.3 2023-02-20 12:14:33 -08:00
proc_namespace.c
read_write.c iov_iter work; most of that is about getting rid of 2022-12-12 18:29:54 -08:00
readdir.c
remap_range.c fs: port i_{g,u}id_into_vfs{g,u}id() to mnt_idmap 2023-01-19 09:24:29 +01:00
select.c
seq_file.c use less confusing names for iov_iter direction initializers 2022-11-25 13:01:55 -05:00
signalfd.c
splice.c 46 fs/cifs (smb3 client) changesets, 37 in fs/cifs and 9 for related helper functions and cleanup outside from Dave Howells and Willy 2023-02-22 17:12:44 -08:00
stack.c
stat.c fs.idmapped.v6.3 2023-02-20 11:53:11 -08:00
statfs.c
super.c for-6.3/dio-2023-02-16 2023-02-20 14:10:36 -08:00
sync.c
sysctls.c
timerfd.c
userfaultfd.c mm: replace vma->vm_flags direct modifications with modifier calls 2023-02-09 16:51:39 -08:00
utimes.c fs.idmapped.v6.3 2023-02-20 11:53:11 -08:00
xattr.c fs.idmapped.v6.3 2023-02-20 11:53:11 -08:00