Commit graph

80887 commits

Author SHA1 Message Date
Yuezhang Mo
6cb5d1a16a exfat: fix unexpected EOF while reading dir
If the position is not aligned with the dentry size, the return
value of readdir() will be NULL and errno is 0, which means the
end of the directory stream is reached.

If the position is aligned with dentry size, but there is no file
or directory at the position, exfat_readdir() will continue to
get dentry from the next dentry. So the dentry gotten by readdir()
may not be at the position.

After this commit, if the position is not aligned with the dentry
size, round the position up to the dentry size and continue to get
the dentry.

Fixes: ca06197382 ("exfat: add directory operations")
Cc: stable@vger.kernel.org # v5.7+
Reported-by: Wang Yugui <wangyugui@e16-tech.com>
Signed-off-by: Yuezhang Mo <Yuezhang.Mo@sony.com>
Reviewed-by: Andy Wu <Andy.Wu@sony.com>
Reviewed-by: Aoyama Wataru <wataru.aoyama@sony.com>
Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
2023-02-27 21:14:44 +09:00
Linus Torvalds
498a1cf902 Kbuild updates for v6.3
- Change V=1 option to print both short log and full command log.
 
  - Allow V=1 and V=2 to be combined as V=12.
 
  - Make W=1 detect wrong .gitignore files.
 
  - Tree-wide cleanups for unused command line arguments passed to Clang.
 
  - Stop using -Qunused-arguments with Clang.
 
  - Make scripts/setlocalversion handle only correct release tags instead
    of any arbitrary annotated tag.
 
  - Create Debian and RPM source packages without cleaning the source tree.
 
  - Various cleanups for packaging.
 -----BEGIN PGP SIGNATURE-----
 
 iQJJBAABCgAzFiEEbmPs18K1szRHjPqEPYsBB53g2wYFAmP7iHoVHG1hc2FoaXJv
 eUBrZXJuZWwub3JnAAoJED2LAQed4NsGL/cQAK9q5rsNL5a2LgTbm89ORA+UV+ST
 hrAoGo5DkJHUbVH53oPzyLynFBZPvUzLK8yjApjXkyAzy2hXYnj+vbTs0s+JVCFL
 owS4NB0YP+tpHGuy8bGpWI0GMZSMwmspUteqxk86zuH8uQVAhnCaeV1/Cr6Aqj1h
 2jk1FZid3/h7qEkEgu5U8soeyFnV6VhAT6Ie5yfZ2O2RdsSqPUh6vfKrgdyW4RWz
 gito0SOUwvjIDfSmTnIIacUibisPRv2OW29OvmDp1aXj5rMhe3UfOznVE3NR86yl
 ZbWDAIm6KYT8V1ASOoAUR80qent9IPKytThLK9BVEQCT6bsujCZMvhYhhEvO30TF
 Lzsdr+FrES//xag3+hgc63FEied2xxWGQG1cRtzAhfRL9tJ03+mY1omoW6SyKqW/
 Gc9PIcTgQbCIrkeL0HuAI1q3I1vkvHXInJKtGkoHh1J9aJ8v5gQpwGA+DDRUnA+A
 LQSeEbT2Hf3MoF4CqZRnConvfhlMuLI+j5v54YPrhokxXmv7u807kjfwMFTiZ/+m
 CJFlEMf9YRv3pi8g/AYyGAg5ZQigCwzOCRUC5kguFqzZdgnjiI907GEL804lm1Mg
 lpx/HtYPyxwWEd2XyU6/C9AEIl3gm7MBd6b1tD54Tb/VmE+AvjS/O9jFYXZqnAnM
 Llv4BfK/cQKwHb6o
 =HpFZ
 -----END PGP SIGNATURE-----

Merge tag 'kbuild-v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild

Pull Kbuild updates from Masahiro Yamada:

 - Change V=1 option to print both short log and full command log

 - Allow V=1 and V=2 to be combined as V=12

 - Make W=1 detect wrong .gitignore files

 - Tree-wide cleanups for unused command line arguments passed to Clang

 - Stop using -Qunused-arguments with Clang

 - Make scripts/setlocalversion handle only correct release tags instead
   of any arbitrary annotated tag

 - Create Debian and RPM source packages without cleaning the source
   tree

 - Various cleanups for packaging

* tag 'kbuild-v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (74 commits)
  kbuild: rpm-pkg: remove unneeded KERNELRELEASE from modules/headers_install
  docs: kbuild: remove description of KBUILD_LDS_MODULE
  .gitattributes: use 'dts' diff driver for *.dtso files
  kbuild: deb-pkg: improve the usability of source package
  kbuild: deb-pkg: fix binary-arch and clean in debian/rules
  kbuild: tar-pkg: use tar rules in scripts/Makefile.package
  kbuild: make perf-tar*-src-pkg work without relying on git
  kbuild: deb-pkg: switch over to source format 3.0 (quilt)
  kbuild: deb-pkg: make .orig tarball a hard link if possible
  kbuild: deb-pkg: hide KDEB_SOURCENAME from Makefile
  kbuild: srcrpm-pkg: create source package without cleaning
  kbuild: rpm-pkg: build binary packages from source rpm
  kbuild: deb-pkg: create source package without cleaning
  kbuild: add a tool to list files ignored by git
  Documentation/llvm: add Chimera Linux, Google and Meta datacenters
  setlocalversion: use only the correct release tag for git-describe
  setlocalversion: clean up the construction of version output
  .gitignore: ignore *.cover and *.mbx
  kbuild: remove --include-dir MAKEFLAG from top Makefile
  kbuild: fix trivial typo in comment
  ...
2023-02-26 11:53:25 -08:00
Xiubo Li
e027253c4b ceph: update the time stamps and try to drop the suid/sgid
The fallocate will try to clear the suid/sgid if a unprevileged user
changed the file.

There is no POSIX item requires that we should clear the suid/sgid
in fallocate code path but this is the default behaviour for most of
the filesystems and the VFS layer. And also the same for the write
code path, which have already support it.

And also we need to update the time stamps since the fallocate will
change the file contents.

Cc: stable@vger.kernel.org
Link: https://tracker.ceph.com/issues/58054
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-02-26 20:03:14 +01:00
Zhang Yi
e3645d72f8 ext4: fix incorrect options show of original mount_opt and extend mount_opt2
Current _ext4_show_options() do not distinguish MOPT_2 flag, so it mixed
extend sbi->s_mount_opt2 options with sbi->s_mount_opt, it could lead to
show incorrect options, e.g. show fc_debug_force if we mount with
errors=continue mode and miss it if we set.

  $ mkfs.ext4 /dev/pmem0
  $ mount -o errors=remount-ro /dev/pmem0 /mnt
  $ cat /proc/fs/ext4/pmem0/options | grep fc_debug_force
    #empty
  $ mount -o remount,errors=continue /mnt
  $ cat /proc/fs/ext4/pmem0/options | grep fc_debug_force
    fc_debug_force
  $ mount -o remount,errors=remount-ro,fc_debug_force /mnt
  $ cat /proc/fs/ext4/pmem0/options | grep fc_debug_force
    #empty

Fixes: 995a3ed67f ("ext4: add fast_commit feature and handling for extended mount options")
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230129034939.3702550-1-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-02-25 15:39:08 -05:00
Jan Kara
0813299c58 ext4: Fix possible corruption when moving a directory
When we are renaming a directory to a different directory, we need to
update '..' entry in the moved directory. However nothing prevents moved
directory from being modified and even converted from the inline format
to the normal format. When such race happens the rename code gets
confused and we crash. Fix the problem by locking the moved directory.

CC: stable@vger.kernel.org
Fixes: 32f7f22c0b ("ext4: let ext4_rename handle inline dir")
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230126112221.11866-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-02-25 15:39:07 -05:00
Ye Bin
172e344e6f ext4: init error handle resource before init group descriptors
Now, 's_err_report' timer is init after ext4_group_desc_init() when fill
super. Theoretically, ext4_group_desc_init() may access to error handle
as follows:
__ext4_fill_super
  ext4_group_desc_init
    ext4_check_descriptors
      ext4_get_group_desc
        ext4_error
          ext4_handle_error
            ext4_commit_super
              ext4_update_super
                if (!es->s_error_count)
                  mod_timer(&sbi->s_err_report, jiffies + 24*60*60*HZ);
		  --> Accessing Uninitialized Variables
timer_setup(&sbi->s_err_report, print_daily_error_info, 0);

Maybe above issue is just theoretical, as ext4_check_descriptors() didn't
judge 'gpd' which get from ext4_get_group_desc(), if access to error handle
ext4_get_group_desc() will return NULL, then will trigger null-ptr-deref in
ext4_check_descriptors().
However, from the perspective of pure code, it is better to initialize
resource that may need to be used first.

Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230119013711.86680-1-yebin@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-02-25 15:39:07 -05:00
Linus Torvalds
489fa31ea8 Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc vfs updates from Al Viro:
 "Assorted stuff that didn't fit anywhere else"

* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  nsfs: repair kernel-doc for ns_match()
  nsfs: add compat ioctl handler
  fs/cramfs: Convert kmap() to kmap_local_data()
2023-02-24 19:27:55 -08:00
Linus Torvalds
3df88c6a17 Merge branch 'work.namespace' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull ipc namespace update from Al Viro:
 "Rik's patches reducing the amount of synchronize_rcu() triggered by
  ipc namespace destruction.

  I've some pending stuff reducing that on the normal umount side, but
  it's nowhere near ready and Rik's stuff shouldn't be held back due to
  conflicts - I'll just redo the parts of my series that stray into
  ipc/*"

* 'work.namespace' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  ipc,namespace: batch free ipc_namespace structures
  ipc,namespace: make ipc namespace allocation wait for pending free
2023-02-24 19:20:07 -08:00
Linus Torvalds
d6b9cf417c Merge branch 'work.sysv' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull sysv updates from Al Viro:
 "Fabio's 'switch to kmap_local_page()' patchset (originally after the
  ext2 counterpart, with a lot of cleaning up done to it; as the matter
  of fact, ext2 side is in need of similar cleanups - calling
  conventions there are bloody awful).

  Plus the equivalents of minix stuff..."

* 'work.sysv' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  sysv: fix handling of delete_entry and set_link failures
  fs/sysv: Replace kmap() with kmap_local_page()
  fs/sysv: Use dir_put_page() in sysv_rename()
  fs/sysv: Change the signature of dir_get_page()
  fs/sysv: Use the offset_in_page() helper
  sysv: don't flush page immediately for DIRSYNC directories
2023-02-24 19:03:26 -08:00
Linus Torvalds
397aa6b63f Merge branch 'work.minix' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull minix updates from Al Viro:
 "Assorted fixes - mostly Christoph's"

* 'work.minix' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  minix_rename(): minix_delete_entry() might fail
  minix: don't flush page immediately for DIRSYNC directories
  minix: fix error handling in minix_set_link
  minix: fix error handling in minix_delete_entry
  minix: move releasing pages into unlink and rename
  minix: make minix_new_inode() return error as ERR_PTR(-E...)
2023-02-24 19:01:15 -08:00
Linus Torvalds
a93e884edf Driver core changes for 6.3-rc1
Here is the large set of driver core changes for 6.3-rc1.
 
 There's a lot of changes this development cycle, most of the work falls
 into two different categories:
   - fw_devlink fixes and updates.  This has gone through numerous review
     cycles and lots of review and testing by lots of different devices.
     Hopefully all should be good now, and Saravana will be keeping a
     watch for any potential regression on odd embedded systems.
   - driver core changes to work to make struct bus_type able to be moved
     into read-only memory (i.e. const)  The recent work with Rust has
     pointed out a number of areas in the driver core where we are
     passing around and working with structures that really do not have
     to be dynamic at all, and they should be able to be read-only making
     things safer overall.  This is the contuation of that work (started
     last release with kobject changes) in moving struct bus_type to be
     constant.  We didn't quite make it for this release, but the
     remaining patches will be finished up for the release after this
     one, but the groundwork has been laid for this effort.
 
 Other than that we have in here:
   - debugfs memory leak fixes in some subsystems
   - error path cleanups and fixes for some never-able-to-be-hit
     codepaths.
   - cacheinfo rework and fixes
   - Other tiny fixes, full details are in the shortlog
 
 All of these have been in linux-next for a while with no reported
 problems.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCY/ipdg8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ynL3gCgwzbcWu0So3piZyLiJKxsVo9C2EsAn3sZ9gN6
 6oeFOjD3JDju3cQsfGgd
 =Su6W
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-6.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core updates from Greg KH:
 "Here is the large set of driver core changes for 6.3-rc1.

  There's a lot of changes this development cycle, most of the work
  falls into two different categories:

   - fw_devlink fixes and updates. This has gone through numerous review
     cycles and lots of review and testing by lots of different devices.
     Hopefully all should be good now, and Saravana will be keeping a
     watch for any potential regression on odd embedded systems.

   - driver core changes to work to make struct bus_type able to be
     moved into read-only memory (i.e. const) The recent work with Rust
     has pointed out a number of areas in the driver core where we are
     passing around and working with structures that really do not have
     to be dynamic at all, and they should be able to be read-only
     making things safer overall. This is the contuation of that work
     (started last release with kobject changes) in moving struct
     bus_type to be constant. We didn't quite make it for this release,
     but the remaining patches will be finished up for the release after
     this one, but the groundwork has been laid for this effort.

  Other than that we have in here:

   - debugfs memory leak fixes in some subsystems

   - error path cleanups and fixes for some never-able-to-be-hit
     codepaths.

   - cacheinfo rework and fixes

   - Other tiny fixes, full details are in the shortlog

  All of these have been in linux-next for a while with no reported
  problems"

[ Geert Uytterhoeven points out that that last sentence isn't true, and
  that there's a pending report that has a fix that is queued up - Linus ]

* tag 'driver-core-6.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (124 commits)
  debugfs: drop inline constant formatting for ERR_PTR(-ERROR)
  OPP: fix error checking in opp_migrate_dentry()
  debugfs: update comment of debugfs_rename()
  i3c: fix device.h kernel-doc warnings
  dma-mapping: no need to pass a bus_type into get_arch_dma_ops()
  driver core: class: move EXPORT_SYMBOL_GPL() lines to the correct place
  Revert "driver core: add error handling for devtmpfs_create_node()"
  Revert "devtmpfs: add debug info to handle()"
  Revert "devtmpfs: remove return value of devtmpfs_delete_node()"
  driver core: cpu: don't hand-override the uevent bus_type callback.
  devtmpfs: remove return value of devtmpfs_delete_node()
  devtmpfs: add debug info to handle()
  driver core: add error handling for devtmpfs_create_node()
  driver core: bus: update my copyright notice
  driver core: bus: add bus_get_dev_root() function
  driver core: bus: constify bus_unregister()
  driver core: bus: constify some internal functions
  driver core: bus: constify bus_get_kset()
  driver core: bus: constify bus_register/unregister_notifier()
  driver core: remove private pointer from struct bus_type
  ...
2023-02-24 12:58:55 -08:00
David Howells
ab7362d04d cifs: Fix cifs_writepages_region()
Fix the cifs_writepages_region() to just jump over members of the batch
that have been cleaned up rather than counting them as skipped.

Unlike the other "skip_write" cases, this situation happens even for
WB_SYNC_ALL, simply because the page has either been cleaned by somebody
else, or was truncated.

So in this case we're not "skipping" the write, we simply no longer need
any write at all, so it's very different from the other skip_write cases.

And we definitely shouldn't stop writing the rest just because of too
many of these cases (or because we want to be rescheduled).

Fixes: 3822a7c409 ("Merge tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm")
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/lkml/2213409.1677249075@warthog.procyon.org.uk/
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-02-24 11:01:58 -08:00
Eric Van Hensbergen
89c58cb395
fs/9p: fix error reporting in v9fs_dir_release
Checking the p9_fid_put value allows us to pass back errors
involved if we end up clunking the fid as part of dir_release.

This can help with more graceful response to errors in writeback
among other things.

Signed-off-by: Eric Van Hensbergen <ericvh@kernel.org>
Reviewed-by: Dominique Martinet <asmadeus@codewreck.org>
2023-02-24 13:42:40 +00:00
Linus Torvalds
d2980d8d82 There is no particular theme here - mainly quick hits all over the tree.
Most notable is a set of zlib changes from Mikhail Zaslonko which enhances
 and fixes zlib's use of S390 hardware support: "lib/zlib: Set of s390
 DFLTCC related patches for kernel zlib".
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY/QC4QAKCRDdBJ7gKXxA
 jtKdAQCbDCBdY8H45d1fONzQW2UDqCPnOi77MpVUxGL33r+1SAEA807C7rvDEmlf
 yP1Ft+722fFU5jogVU8ZFh+vapv2/gI=
 =Q9YK
 -----END PGP SIGNATURE-----

Merge tag 'mm-nonmm-stable-2023-02-20-15-29' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull non-MM updates from Andrew Morton:
 "There is no particular theme here - mainly quick hits all over the
  tree.

  Most notable is a set of zlib changes from Mikhail Zaslonko which
  enhances and fixes zlib's use of S390 hardware support: 'lib/zlib: Set
  of s390 DFLTCC related patches for kernel zlib'"

* tag 'mm-nonmm-stable-2023-02-20-15-29' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (55 commits)
  Update CREDITS file entry for Jesper Juhl
  sparc: allow PM configs for sparc32 COMPILE_TEST
  hung_task: print message when hung_task_warnings gets down to zero.
  arch/Kconfig: fix indentation
  scripts/tags.sh: fix the Kconfig tags generation when using latest ctags
  nilfs2: prevent WARNING in nilfs_dat_commit_end()
  lib/zlib: remove redundation assignement of avail_in dfltcc_gdht()
  lib/Kconfig.debug: do not enable DEBUG_PREEMPT by default
  lib/zlib: DFLTCC always switch to software inflate for Z_PACKET_FLUSH option
  lib/zlib: DFLTCC support inflate with small window
  lib/zlib: Split deflate and inflate states for DFLTCC
  lib/zlib: DFLTCC not writing header bits when avail_out == 0
  lib/zlib: fix DFLTCC ignoring flush modes when avail_in == 0
  lib/zlib: fix DFLTCC not flushing EOBS when creating raw streams
  lib/zlib: implement switching between DFLTCC and software
  lib/zlib: adjust offset calculation for dfltcc_state
  nilfs2: replace WARN_ONs for invalid DAT metadata block requests
  scripts/spelling.txt: add "exsits" pattern and fix typo instances
  fs: gracefully handle ->get_block not mapping bh in __mpage_writepage
  cramfs: Kconfig: fix spelling & punctuation
  ...
2023-02-23 17:55:40 -08:00
Linus Torvalds
3822a7c409 - Daniel Verkamp has contributed a memfd series ("mm/memfd: add
F_SEAL_EXEC") which permits the setting of the memfd execute bit at
   memfd creation time, with the option of sealing the state of the X bit.
 
 - Peter Xu adds a patch series ("mm/hugetlb: Make huge_pte_offset()
   thread-safe for pmd unshare") which addresses a rare race condition
   related to PMD unsharing.
 
 - Several folioification patch serieses from Matthew Wilcox, Vishal
   Moola, Sidhartha Kumar and Lorenzo Stoakes
 
 - Johannes Weiner has a series ("mm: push down lock_page_memcg()") which
   does perform some memcg maintenance and cleanup work.
 
 - SeongJae Park has added DAMOS filtering to DAMON, with the series
   "mm/damon/core: implement damos filter".  These filters provide users
   with finer-grained control over DAMOS's actions.  SeongJae has also done
   some DAMON cleanup work.
 
 - Kairui Song adds a series ("Clean up and fixes for swap").
 
 - Vernon Yang contributed the series "Clean up and refinement for maple
   tree".
 
 - Yu Zhao has contributed the "mm: multi-gen LRU: memcg LRU" series.  It
   adds to MGLRU an LRU of memcgs, to improve the scalability of global
   reclaim.
 
 - David Hildenbrand has added some userfaultfd cleanup work in the
   series "mm: uffd-wp + change_protection() cleanups".
 
 - Christoph Hellwig has removed the generic_writepages() library
   function in the series "remove generic_writepages".
 
 - Baolin Wang has performed some maintenance on the compaction code in
   his series "Some small improvements for compaction".
 
 - Sidhartha Kumar is doing some maintenance work on struct page in his
   series "Get rid of tail page fields".
 
 - David Hildenbrand contributed some cleanup, bugfixing and
   generalization of pte management and of pte debugging in his series "mm:
   support __HAVE_ARCH_PTE_SWP_EXCLUSIVE on all architectures with swap
   PTEs".
 
 - Mel Gorman and Neil Brown have removed the __GFP_ATOMIC allocation
   flag in the series "Discard __GFP_ATOMIC".
 
 - Sergey Senozhatsky has improved zsmalloc's memory utilization with his
   series "zsmalloc: make zspage chain size configurable".
 
 - Joey Gouly has added prctl() support for prohibiting the creation of
   writeable+executable mappings.  The previous BPF-based approach had
   shortcomings.  See "mm: In-kernel support for memory-deny-write-execute
   (MDWE)".
 
 - Waiman Long did some kmemleak cleanup and bugfixing in the series
   "mm/kmemleak: Simplify kmemleak_cond_resched() & fix UAF".
 
 - T.J.  Alumbaugh has contributed some MGLRU cleanup work in his series
   "mm: multi-gen LRU: improve".
 
 - Jiaqi Yan has provided some enhancements to our memory error
   statistics reporting, mainly by presenting the statistics on a per-node
   basis.  See the series "Introduce per NUMA node memory error
   statistics".
 
 - Mel Gorman has a second and hopefully final shot at fixing a CPU-hog
   regression in compaction via his series "Fix excessive CPU usage during
   compaction".
 
 - Christoph Hellwig does some vmalloc maintenance work in the series
   "cleanup vfree and vunmap".
 
 - Christoph Hellwig has removed block_device_operations.rw_page() in ths
   series "remove ->rw_page".
 
 - We get some maple_tree improvements and cleanups in Liam Howlett's
   series "VMA tree type safety and remove __vma_adjust()".
 
 - Suren Baghdasaryan has done some work on the maintainability of our
   vm_flags handling in the series "introduce vm_flags modifier functions".
 
 - Some pagemap cleanup and generalization work in Mike Rapoport's series
   "mm, arch: add generic implementation of pfn_valid() for FLATMEM" and
   "fixups for generic implementation of pfn_valid()"
 
 - Baoquan He has done some work to make /proc/vmallocinfo and
   /proc/kcore better represent the real state of things in his series
   "mm/vmalloc.c: allow vread() to read out vm_map_ram areas".
 
 - Jason Gunthorpe rationalized the GUP system's interface to the rest of
   the kernel in the series "Simplify the external interface for GUP".
 
 - SeongJae Park wishes to migrate people from DAMON's debugfs interface
   over to its sysfs interface.  To support this, we'll temporarily be
   printing warnings when people use the debugfs interface.  See the series
   "mm/damon: deprecate DAMON debugfs interface".
 
 - Andrey Konovalov provided the accurately named "lib/stackdepot: fixes
   and clean-ups" series.
 
 - Huang Ying has provided a dramatic reduction in migration's TLB flush
   IPI rates with the series "migrate_pages(): batch TLB flushing".
 
 - Arnd Bergmann has some objtool fixups in "objtool warning fixes".
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY/PoPQAKCRDdBJ7gKXxA
 jlvpAPsFECUBBl20qSue2zCYWnHC7Yk4q9ytTkPB/MMDrFEN9wD/SNKEm2UoK6/K
 DmxHkn0LAitGgJRS/W9w81yrgig9tAQ=
 =MlGs
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:

 - Daniel Verkamp has contributed a memfd series ("mm/memfd: add
   F_SEAL_EXEC") which permits the setting of the memfd execute bit at
   memfd creation time, with the option of sealing the state of the X
   bit.

 - Peter Xu adds a patch series ("mm/hugetlb: Make huge_pte_offset()
   thread-safe for pmd unshare") which addresses a rare race condition
   related to PMD unsharing.

 - Several folioification patch serieses from Matthew Wilcox, Vishal
   Moola, Sidhartha Kumar and Lorenzo Stoakes

 - Johannes Weiner has a series ("mm: push down lock_page_memcg()")
   which does perform some memcg maintenance and cleanup work.

 - SeongJae Park has added DAMOS filtering to DAMON, with the series
   "mm/damon/core: implement damos filter".

   These filters provide users with finer-grained control over DAMOS's
   actions. SeongJae has also done some DAMON cleanup work.

 - Kairui Song adds a series ("Clean up and fixes for swap").

 - Vernon Yang contributed the series "Clean up and refinement for maple
   tree".

 - Yu Zhao has contributed the "mm: multi-gen LRU: memcg LRU" series. It
   adds to MGLRU an LRU of memcgs, to improve the scalability of global
   reclaim.

 - David Hildenbrand has added some userfaultfd cleanup work in the
   series "mm: uffd-wp + change_protection() cleanups".

 - Christoph Hellwig has removed the generic_writepages() library
   function in the series "remove generic_writepages".

 - Baolin Wang has performed some maintenance on the compaction code in
   his series "Some small improvements for compaction".

 - Sidhartha Kumar is doing some maintenance work on struct page in his
   series "Get rid of tail page fields".

 - David Hildenbrand contributed some cleanup, bugfixing and
   generalization of pte management and of pte debugging in his series
   "mm: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE on all architectures with
   swap PTEs".

 - Mel Gorman and Neil Brown have removed the __GFP_ATOMIC allocation
   flag in the series "Discard __GFP_ATOMIC".

 - Sergey Senozhatsky has improved zsmalloc's memory utilization with
   his series "zsmalloc: make zspage chain size configurable".

 - Joey Gouly has added prctl() support for prohibiting the creation of
   writeable+executable mappings.

   The previous BPF-based approach had shortcomings. See "mm: In-kernel
   support for memory-deny-write-execute (MDWE)".

 - Waiman Long did some kmemleak cleanup and bugfixing in the series
   "mm/kmemleak: Simplify kmemleak_cond_resched() & fix UAF".

 - T.J. Alumbaugh has contributed some MGLRU cleanup work in his series
   "mm: multi-gen LRU: improve".

 - Jiaqi Yan has provided some enhancements to our memory error
   statistics reporting, mainly by presenting the statistics on a
   per-node basis. See the series "Introduce per NUMA node memory error
   statistics".

 - Mel Gorman has a second and hopefully final shot at fixing a CPU-hog
   regression in compaction via his series "Fix excessive CPU usage
   during compaction".

 - Christoph Hellwig does some vmalloc maintenance work in the series
   "cleanup vfree and vunmap".

 - Christoph Hellwig has removed block_device_operations.rw_page() in
   ths series "remove ->rw_page".

 - We get some maple_tree improvements and cleanups in Liam Howlett's
   series "VMA tree type safety and remove __vma_adjust()".

 - Suren Baghdasaryan has done some work on the maintainability of our
   vm_flags handling in the series "introduce vm_flags modifier
   functions".

 - Some pagemap cleanup and generalization work in Mike Rapoport's
   series "mm, arch: add generic implementation of pfn_valid() for
   FLATMEM" and "fixups for generic implementation of pfn_valid()"

 - Baoquan He has done some work to make /proc/vmallocinfo and
   /proc/kcore better represent the real state of things in his series
   "mm/vmalloc.c: allow vread() to read out vm_map_ram areas".

 - Jason Gunthorpe rationalized the GUP system's interface to the rest
   of the kernel in the series "Simplify the external interface for
   GUP".

 - SeongJae Park wishes to migrate people from DAMON's debugfs interface
   over to its sysfs interface. To support this, we'll temporarily be
   printing warnings when people use the debugfs interface. See the
   series "mm/damon: deprecate DAMON debugfs interface".

 - Andrey Konovalov provided the accurately named "lib/stackdepot: fixes
   and clean-ups" series.

 - Huang Ying has provided a dramatic reduction in migration's TLB flush
   IPI rates with the series "migrate_pages(): batch TLB flushing".

 - Arnd Bergmann has some objtool fixups in "objtool warning fixes".

* tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (505 commits)
  include/linux/migrate.h: remove unneeded externs
  mm/memory_hotplug: cleanup return value handing in do_migrate_range()
  mm/uffd: fix comment in handling pte markers
  mm: change to return bool for isolate_movable_page()
  mm: hugetlb: change to return bool for isolate_hugetlb()
  mm: change to return bool for isolate_lru_page()
  mm: change to return bool for folio_isolate_lru()
  objtool: add UACCESS exceptions for __tsan_volatile_read/write
  kmsan: disable ftrace in kmsan core code
  kasan: mark addr_has_metadata __always_inline
  mm: memcontrol: rename memcg_kmem_enabled()
  sh: initialize max_mapnr
  m68k/nommu: add missing definition of ARCH_PFN_OFFSET
  mm: percpu: fix incorrect size in pcpu_obj_full_size()
  maple_tree: reduce stack usage with gcc-9 and earlier
  mm: page_alloc: call panic() when memoryless node allocation fails
  mm: multi-gen LRU: avoid futile retries
  migrate_pages: move THP/hugetlb migration support check to simplify code
  migrate_pages: batch flushing TLB
  migrate_pages: share more code between _unmap and _move
  ...
2023-02-23 17:09:35 -08:00
Linus Torvalds
06e1a81c48 A healthy mix of EFI contributions this time:
- Performance tweaks for efifb earlycon by Andy
 
 - Preparatory refactoring and cleanup work in the efivar layer by Johan,
   which is needed to accommodate the Snapdragon arm64 laptops that
   expose their EFI variable store via a TEE secure world API.
 
 - Enhancements to the EFI memory map handling so that Xen dom0 can
   safely access EFI configuration tables (Demi Marie)
 
 - Wire up the newly introduced IBT/BTI flag in the EFI memory attributes
   table, so that firmware that is generated with ENDBR/BTI landing pads
   will be mapped with enforcement enabled.
 
 - Clean up how we check and print the EFI revision exposed by the
   firmware.
 
 - Incorporate EFI memory attributes protocol definition contributed by
   Evgeniy and wire it up in the EFI zboot code. This ensures that these
   images can execute under new and stricter rules regarding the default
   memory permissions for EFI page allocations. (More work is in progress
   here)
 
 - CPER header cleanup by Dan Williams
 
 - Use a raw spinlock to protect the EFI runtime services stack on arm64
   to ensure the correct semantics under -rt. (Pierre)
 
 - EFI framebuffer quirk for Lenovo Ideapad by Darrell.
 -----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE+9lifEBpyUIVN1cpw08iOZLZjyQFAmPzuwsACgkQw08iOZLZ
 jyS7dwwAm95DlDxFIQi4FmTm2mqJws9PyDrkfaAK1CoyqCgeOLQT2FkVolgr8jne
 pwpwCTXtYP8y0BZvdQEIjpAq/BHKaD3GJSPfl7lo+pnUu68PpsFWaV6EdT33KKfj
 QeF0MnUvrqUeTFI77D+S0ZW2zxdo9eCcahF3TPA52/bEiiDHWBF8Qm9VHeQGklik
 zoXA15ft3mgITybgjEA0ncGrVZiBMZrYoMvbdkeoedfw02GN/eaQn8d2iHBtTDEh
 3XNlo7ONX0v50cjt0yvwFEA0AKo0o7R1cj+ziKH/bc4KjzIiCbINhy7blroSq+5K
 YMlnPHuj8Nhv3I+MBdmn/nxRCQeQsE4RfRru04hfNfdcqjAuqwcBvRXvVnjWKZHl
 CmUYs+p/oqxrQ4BjiHfw0JKbXRsgbFI6o3FeeLH9kzI9IDUPpqu3Ma814FVok9Ai
 zbOCrJf5tEtg5tIavcUESEMBuHjEafqzh8c7j7AAqbaNjlihsqosDy9aYoarEi5M
 f/tLec86
 =+pOz
 -----END PGP SIGNATURE-----

Merge tag 'efi-next-for-v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi

Pull EFI updates from Ard Biesheuvel:
 "A healthy mix of EFI contributions this time:

   - Performance tweaks for efifb earlycon (Andy)

   - Preparatory refactoring and cleanup work in the efivar layer, which
     is needed to accommodate the Snapdragon arm64 laptops that expose
     their EFI variable store via a TEE secure world API (Johan)

   - Enhancements to the EFI memory map handling so that Xen dom0 can
     safely access EFI configuration tables (Demi Marie)

   - Wire up the newly introduced IBT/BTI flag in the EFI memory
     attributes table, so that firmware that is generated with ENDBR/BTI
     landing pads will be mapped with enforcement enabled

   - Clean up how we check and print the EFI revision exposed by the
     firmware

   - Incorporate EFI memory attributes protocol definition and wire it
     up in the EFI zboot code (Evgeniy)

     This ensures that these images can execute under new and stricter
     rules regarding the default memory permissions for EFI page
     allocations (More work is in progress here)

   - CPER header cleanup (Dan Williams)

   - Use a raw spinlock to protect the EFI runtime services stack on
     arm64 to ensure the correct semantics under -rt (Pierre)

   - EFI framebuffer quirk for Lenovo Ideapad (Darrell)"

* tag 'efi-next-for-v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi: (24 commits)
  firmware/efi sysfb_efi: Add quirk for Lenovo IdeaPad Duet 3
  arm64: efi: Make efi_rt_lock a raw_spinlock
  efi: Add mixed-mode thunk recipe for GetMemoryAttributes
  efi: x86: Wire up IBT annotation in memory attributes table
  efi: arm64: Wire up BTI annotation in memory attributes table
  efi: Discover BTI support in runtime services regions
  efi/cper, cxl: Remove cxl_err.h
  efi: Use standard format for printing the EFI revision
  efi: Drop minimum EFI version check at boot
  efi: zboot: Use EFI protocol to remap code/data with the right attributes
  efi/libstub: Add memory attribute protocol definitions
  efi: efivars: prevent double registration
  efi: verify that variable services are supported
  efivarfs: always register filesystem
  efi: efivars: add efivars printk prefix
  efi: Warn if trying to reserve memory under Xen
  efi: Actually enable the ESRT under Xen
  efi: Apply allowlist to EFI configuration tables when running under Xen
  efi: xen: Implement memory descriptor lookup based on hypercall
  efi: memmap: Disregard bogus entries instead of returning them
  ...
2023-02-23 14:41:48 -08:00
Eric Van Hensbergen
344504e912
fs/9p: Expand setup of writeback cache to all levels
If cache is enabled, make sure we are putting the right things
in place (mainly impacts mmap).  This also sets us up for more
cache levels.

Signed-off-by: Eric Van Hensbergen <ericvh@kernel.org>
Reviewed-by: Dominique Martinet <asmadeus@codewreck.org>
2023-02-23 22:39:36 +00:00
Linus Torvalds
fcc77d7c8e sysctl-6.3-rc1
Just one fix which just came in, this just hit linux-next just yesterday
 with a success build report. But since its a fix and reviewed I think its
 good to take in.
 
 Sadly the eager beavers willing to help with the sysctl moves have slowed.
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCgAwFiEENnNq2KuOejlQLZofziMdCjCSiKcFAmP2ndUSHG1jZ3JvZkBr
 ZXJuZWwub3JnAAoJEM4jHQowkoinyLYQAM8HBVhPcXi4gp2DHbUY6Bd6MeXsQ4Mk
 EulLxYZXJFIfgXcKYN9IMNcPtdYJ4xnGWLNzPoixiMb61KY6Cjw+l62AtNBiwvxw
 /GliVtrwz0EU1Bw7vilm4UBn/BUH7vXk7HMRExBwMXLwY7y11TYWMtb6+V+7Zzf1
 VImyPZ1MlfOYlvgRUdaNUnuvHLE19x2pdVG/oRDLez9gs38FYEAiBd34adUKDf2U
 FxZBQPd+4xaEht2sdTp0ws52YoHHx3k9i8mSiRwQqsiydKMk32iD59hXxeA4r3Oc
 lzf10VgN7EdSDCTzdfDYhpIxq04RuA1s1gtXU/eePMqfJZenR0FbFHhcHmIhon4j
 D5EuON5eMISzzT9h9DSI5k7kXPDv0iIZPb/vunAfklC1UPVv8uZ2/RV6M1jNR1Bn
 S5WYURzdlwpS+4+uwl479061W1YnnhmtwVAMNw+AzfTxv9eQQ3oa2X70L0xirfDx
 xJxZS8lzc1rLgL2y3lO2/8W1yI6hA0xhME+pxuopVTMWcXt6aG4efq3tIrxEx3Kf
 0HukI1KbN6O/Rmf7hqprHGsfIq4EP4IEOk2XIvMSOrDyf2fOXVGCQQQ0eMOw4ejR
 WTtHp1TW2byA8lSOp4K+rMZamt9Mokci0tOEc/52OXFA2jDtq4KZNHvHAOdZv4/Y
 kLMflEpOmU++
 =Txc3
 -----END PGP SIGNATURE-----

Merge tag 'sysctl-6.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux

Pull sysctl update from Luis Chamberlain:
 "Just one fix which just came in.

  Sadly the eager beavers willing to help with the sysctl moves have
  slowed"

* tag 'sysctl-6.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux:
  sysctl: fix proc_dobool() usability
2023-02-23 14:16:56 -08:00
Linus Torvalds
307e14c039 46 fs/cifs (smb3 client) changesets, 37 in fs/cifs and 9 for related helper functions and cleanup outside from Dave Howells and Willy
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmP2kaAACgkQiiy9cAdy
 T1Eergv9FHVs7hS0anJF0xgRghR4+g0m5UUo08iJazgJdDgcS5JY+ZasIpYpEsG3
 QmsIT33XVYZypXoOzjMSsPlwo6esTCJQScVLz85e4ebedCbCBDks+wVQcbfTzD5/
 KrwmUoTBLU0L/ppFhqRk9k53nrSf1SXCWPthjdfWa3mTHdIVM4kQJruTWwUDiJXp
 mdYwTx6FnTNer3QWetNzYOwdUgLu3rk0zLcBwQNCo6g5LOpA44iFfEAO4zeiOuZT
 LMDPbDj0nWQyWPLLdcbtsn2laYyEBDBLZevLirSaqPQ/KCtGcw0mBt6dCAzg8/CM
 ONqHHxdEpvPON8Sxujcn4CxpXhl0nCLwwtKtWU4rt7IevI9U+PynNl57TtJJ16/s
 b3XD2QVbFjlcdAMTmArvqnogdzoC3mZu1R1IRs+jukhLAOqZiLN6o/E2HAllt47i
 krzXeXIzQr10w9fnJ7LtIc/7IUFgtUfrOkg4TKyNcnRVHQaSSxv+JLRgqMPOr/M0
 I7zt0G0j
 =4hIT
 -----END PGP SIGNATURE-----

Merge tag '6.3-rc-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs client updates from Steve French:
 "The largest subset of this is from David Howells et al: making the
  cifs/smb3 driver pass iov_iters down to the lowest layers, directly to
  the network transport rather than passing lists of pages around,
  helping multiple areas:

   - Pin user pages, thereby fixing the race between concurrent DIO read
     and fork, where the pages containing the DIO read buffer may end up
     belonging to the child process and not the parent - with the result
     that the parent might not see the retrieved data.

   - cifs shouldn't take refs on pages extracted from non-user-backed
     iterators (eg. KVEC). With these changes, cifs will apply the
     appropriate cleanup.

   - Making it easier to transition to using folios in cifs rather than
     pages by dealing with them through BVEC and XARRAY iterators.

   - Allowing cifs to use the new splice function

  The remainder are:

   - fixes for stable, including various fixes for uninitialized memory,
     wrong length field causing mount issue to very old servers,
     important directory lease fixes and reconnect fixes

   - cleanups (unused code removal, change one element array usage, and
     a change form strtobool to kstrtobool, and Kconfig cleanups)

   - SMBDIRECT (RDMA) fixes including iov_iter integration and UAF fixes

   - reconnect fixes

   - multichannel fixes, including improving channel allocation (to
     least used channel)

   - remove the last use of lock_page_killable by moving to
     folio_lock_killable"

* tag '6.3-rc-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6: (46 commits)
  update internal module version number for cifs.ko
  cifs: update ip_addr for ses only for primary chan setup
  cifs: use tcon allocation functions even for dummy tcon
  cifs: use the least loaded channel for sending requests
  cifs: DIO to/from KVEC-type iterators should now work
  cifs: Remove unused code
  cifs: Build the RDMA SGE list directly from an iterator
  cifs: Change the I/O paths to use an iterator rather than a page list
  cifs: Add a function to read into an iter from a socket
  cifs: Add some helper functions
  cifs: Add a function to Hash the contents of an iterator
  cifs: Add a function to build an RDMA SGE list from an iterator
  netfs: Add a function to extract an iterator into a scatterlist
  netfs: Add a function to extract a UBUF or IOVEC into a BVEC iterator
  cifs: Implement splice_read to pass down ITER_BVEC not ITER_PIPE
  splice: Export filemap/direct_splice_read()
  iov_iter: Add a function to extract a page list from an iterator
  iov_iter: Define flags to qualify page extraction.
  splice: Add a func to do a splice from an O_DIRECT file without ITER_PIPE
  splice: Add a func to do a splice from a buffered file without ITER_PIPE
  ...
2023-02-22 17:12:44 -08:00
Linus Torvalds
d8ca6dbb8d NFS Client Updates for Linux 6.3
New Features:
   * Convert the read and write paths to use folios
 
 Bugfixes and Cleanups:
   * Fix tracepoint state manager flag printing
   * Fix disabling swap files
   * Fix NFSv4 client identifier sysfs path in the documentation
   * Don't clear NFS_CAP_COPY if server returns NFS4ERR_OFFLOAD_DENIED
   * Treat GETDEVICEINFO errors as a layout failure
   * Replace kmap_atomic() calls with kmap_local_page()
   * Constify sunrpc sysfs kobj_type structures
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAmP2lSQACgkQ18tUv7Cl
 QOuaPw//TNLuEMiiMtM6EXFPc2p2tNMfM+EeMjYKZkS5WnvszVOZRzQfXz4dvvif
 R2BIHT+C6AWkPOxnbuhIk86acy/wRuAPD9OMyp+3PNa4oEqjyIk5QaTvgYeeRBNe
 H7haAqCiEBx/g8F9DladWLAwpilTSPpupdmiXwTS7Q7wB6NkKqc1eAJ+BTuRYAsm
 19p2nGTzenWrut4HMOOmuAVrA2OcWtzgYzdJlY19lWRmzfCvX1gh3i2JxGutmnl3
 tduviVBdu5QGnTKHJiP853kzs3VwEHIFSO+5z137JWKTm3IgPxW/u3zf+ijmw22t
 vjbtFAhb4+u5juchvVKyIX4lClPSKZincQ6dn9soOm7qJcxYWNm40DzxAF21GUAq
 d4tp+zBoe9pfKxZrylp6YxIchLQYhU+dL0hhqbKzfOAFOg28k1JsJubl/wRvWKVe
 LGbvFOrUYo8arPOKfxBWMf4HCouu/vGq/qng+U/Ppjw3h8gD32aniyP3qN9Fm1JA
 rFAMKTnD4I3r3x1E4HX3icUMwil36zBr9cPglQBYD5DW54ConEW8e6hvQmjTdQHP
 EgN9+PSYzYXuKR/Fij7BYOymUD9grT1+5hVaOPzlM4SY3jMy2novcQGoGiLVOd0t
 PGQ5047qbESFD7VkW4GBAYftrSMLiU+l5KL3aN1JRtprxu6NaYA=
 =39ue
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-6.3-1' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client updates from Anna Schumaker:
 "New Features:

   - Convert the read and write paths to use folios

  Bugfixes and Cleanups:

   - Fix tracepoint state manager flag printing

   - Fix disabling swap files

   - Fix NFSv4 client identifier sysfs path in the documentation

   - Don't clear NFS_CAP_COPY if server returns NFS4ERR_OFFLOAD_DENIED

   - Treat GETDEVICEINFO errors as a layout failure

   - Replace kmap_atomic() calls with kmap_local_page()

   - Constify sunrpc sysfs kobj_type structures"

* tag 'nfs-for-6.3-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (25 commits)
  fs/nfs: Replace kmap_atomic() with kmap_local_page() in dir.c
  pNFS/filelayout: treat GETDEVICEINFO errors as layout failure
  Documentation: Fix sysfs path for the NFSv4 client identifier
  nfs42: do not fail with EIO if ssc returns NFS4ERR_OFFLOAD_DENIED
  NFS: fix disabling of swap
  SUNRPC: make kobj_type structures constant
  nfs4trace: fix state manager flag printing
  NFS: Remove unnecessary check in nfs_read_folio()
  NFS: Improve tracing of nfs_wb_folio()
  NFS: Enable tracing of nfs_invalidate_folio() and nfs_launder_folio()
  NFS: fix up nfs_release_folio() to try to release the page
  NFS: Clean up O_DIRECT request allocation
  NFS: Fix up nfs_vm_page_mkwrite() for folios
  NFS: Convert nfs_write_begin/end to use folios
  NFS: Remove unused function nfs_wb_page()
  NFS: Convert buffered writes to use folios
  NFS: Convert the function nfs_wb_page() to use folios
  NFS: Convert buffered reads to use folios
  NFS: Add a helper nfs_wb_folio()
  NFS: Convert the remaining pagelist helper functions to support folios
  ...
2023-02-22 14:47:20 -08:00
Linus Torvalds
9fc2f99030 NFSD 6.3 Release Notes
Two significant security enhancements are part of this release:
 
 * NFSD's RPC header encoding and decoding, including RPCSEC GSS
   and gssproxy header parsing, has been overhauled to make it
   more memory-safe.
 
 * Support for Kerberos AES-SHA2-based encryption types has been
   added for both the NFS client and server. This provides a clean
   path for deprecating and removing insecure encryption types
   based on DES and SHA-1. AES-SHA2 is also FIPS-140 compliant, so
   that NFS with Kerberos may now be used on systems with fips
   enabled.
 
 In addition to these, NFSD is now able to handle crossing into an
 auto-mounted mount point on an exported NFS mount. A number of
 fixes have been made to NFSD's server-side copy implementation.
 
 RPC metrics have been converted to per-CPU variables. This helps
 reduce unnecessary cross-CPU and cross-node memory bus traffic,
 and significantly reduces noise when KCSAN is enabled.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmPzgiYACgkQM2qzM29m
 f5dB2A//eqjpj+FgAN+UjygrwMC4ahAsPX3Sc3FG8/lTAiao3NFVFY2gxAiCPyVE
 CFk+tUyfL23oXvbyfIBe3LhxSBOf621xU6up2OzqAzJqh1Q9iUWB6as3I14to8ZU
 sWpxXo5ofwk1hzkbrvOAVkyfY0emwsr00iBeWMawkpBe8FZEQA31OYj3/xHr6bBI
 zEVlZPBZAZlp0DZ74tb+bBLs/EOnqKj+XLWcogCH13JB3sn2umF6cQNkYgsxvHGa
 TNQi4LEdzWZGme242LfBRiGGwm1xuVIjlAhYV/R1wIjaknE3QBzqfXc6lJx74WII
 HaqpRJGrKqdo7B+1gaXCl/AMS7YluED1CBrxuej0wBG7l2JEB7m2MFMQ4LTQjgsn
 nrr3P70DgbB4LuPCPyUS7dtsMmUXabIqP7niiCR4T1toH6lBmHAgEi4cFmkzg7Cd
 EoFzn888mtDpfx4fghcsRWS5oKXEzbPJfu5+IZOD63+UB+NGpi0Xo2s23sJPK8vz
 kqK/X63JYOUxWUvK0zkj/c/wW1cLqIaBwnSKbShou5/BL+cZVI+uJYrnEesgpoB2
 5fh/cZv3hdcoOPO7OfcjCLQYy4J6RCWajptnk/hcS3lMvBTBrnq697iAqCVURDKU
 Xfmlf7XbBwje+sk4eHgqVGEqqVjrEmoqbmA2OS44WSS5LDvxXdI=
 =ZG/7
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd updates from Chuck Lever:
 "Two significant security enhancements are part of this release:

   - NFSD's RPC header encoding and decoding, including RPCSEC GSS and
     gssproxy header parsing, has been overhauled to make it more
     memory-safe.

   - Support for Kerberos AES-SHA2-based encryption types has been added
     for both the NFS client and server. This provides a clean path for
     deprecating and removing insecure encryption types based on DES and
     SHA-1. AES-SHA2 is also FIPS-140 compliant, so that NFS with
     Kerberos may now be used on systems with fips enabled.

  In addition to these, NFSD is now able to handle crossing into an
  auto-mounted mount point on an exported NFS mount. A number of fixes
  have been made to NFSD's server-side copy implementation.

  RPC metrics have been converted to per-CPU variables. This helps
  reduce unnecessary cross-CPU and cross-node memory bus traffic, and
  significantly reduces noise when KCSAN is enabled"

* tag 'nfsd-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (121 commits)
  NFSD: Clean up nfsd_symlink()
  NFSD: copy the whole verifier in nfsd_copy_write_verifier
  nfsd: don't fsync nfsd_files on last close
  SUNRPC: Fix occasional warning when destroying gss_krb5_enctypes
  nfsd: fix courtesy client with deny mode handling in nfs4_upgrade_open
  NFSD: fix problems with cleanup on errors in nfsd4_copy
  nfsd: fix race to check ls_layouts
  nfsd: don't hand out delegation on setuid files being opened for write
  SUNRPC: Remove ->xpo_secure_port()
  SUNRPC: Clean up the svc_xprt_flags() macro
  nfsd: remove fs/nfsd/fault_inject.c
  NFSD: fix leaked reference count of nfsd4_ssc_umount_item
  nfsd: clean up potential nfsd_file refcount leaks in COPY codepath
  nfsd: zero out pointers after putting nfsd_files on COPY setup error
  SUNRPC: Fix whitespace damage in svcauth_unix.c
  nfsd: eliminate __nfs4_get_fd
  nfsd: add some kerneldoc comments for stateid preprocessing functions
  nfsd: eliminate find_deleg_file_locked
  nfsd: don't take nfsd4_copy ref for OP_OFFLOAD_STATUS
  SUNRPC: Add encryption self-tests
  ...
2023-02-22 14:21:40 -08:00
Linus Torvalds
25ac8c12ff ten smb3 server fixes, including three for stable
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmP1kvMACgkQiiy9cAdy
 T1GwlQwAvFBHn8oUwDTj2bIlZRKsMMWGP61qdtgrsKPXcg2hZzw2jDmM6nk12MrS
 AqYCnT1YdjKzDnEgBhHnaPLe+q2M27feJ5Y+FTqv1Hb9VwNY9OJT4hH3II3cuHfm
 zLcnAAdVmtIh7FRYKrwGxqyESNik9N3g4evXcafKKq99sxaGnrnbTWQmOvTqM7kF
 2AeBh6AHlHkoJNfd3BjlB8158Y/7Sfkul1AA7xXm6GWhgnWt2rmWNNqAEpTPXWyj
 xkcb3Jy0wez1D8IXRqyrZdqdFXIfri6mvJoun3zAll6f29T/xxz0CYZUDA881q6E
 EBANy/mLeMMKC4g3c2LPdEUB9ijwA9N+w8E7BSefCIMtN46cOQBTxYrbDfbCEs/m
 ElGo2AcOM25FrkMdl9XSX0Phru3OAOrOivVLCBqbGbHPa0lgBhIZOdy38HCK3ZGm
 0eH6ttyE0p0Mpt1+kex1e6t8XfdyFkHk5gjPElWbsrhSQYK1AAWuBhETWZ79CSlx
 fCou/JZ1
 =HBuN
 -----END PGP SIGNATURE-----

Merge tag '6.3-rc-ksmbd-fixes' of git://git.samba.org/ksmbd

Pull ksmbd server updates from Steve French:

 - Fix for memory leak

 - Two important fixes for frame length checks (which are also now
   stricter)

 - four minor cleanup fixes

 - Fix to clarify ksmbd/Kconfig to indent properl

 - Conversion of the channel list and rpc handle list to xarrays

* tag '6.3-rc-ksmbd-fixes' of git://git.samba.org/ksmbd:
  ksmbd: fix possible memory leak in smb2_lock()
  ksmbd: do not allow the actual frame length to be smaller than the rfc1002 length
  ksmbd: fix wrong data area length for smb2 lock request
  ksmbd: Fix parameter name and comment mismatch
  ksmbd: Fix spelling mistake "excceed" -> "exceeded"
  ksmbd: update Kconfig to note Kerberos support and fix indentation
  ksmbd: Remove duplicated codes
  ksmbd: fix typo, syncronous->synchronous
  ksmbd: Implements sess->rpc_handle_list as xarray
  ksmbd: Implements sess->ksmbd_chann_list as xarray
2023-02-22 14:17:27 -08:00
Linus Torvalds
232dd59906 zonefs changes for 6.3-rc1
* Reorganize zonefs code to split file related operations to a new
     fs/zonefs/file.c file. From me.
 
   * Modify zonefs to use dynamically allocated inodes and dentries
     (using the inode and dentry caches) instead of statically allocating
     everything on mount. This saves a significant amount of memory for
     very large zoned block devices with 10s of thousands of zones. From
     me.
 
   * Make zonefs_sb_ktype a const struct kobj_type, from Thomas.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQSRPv8tYSvhwAzJdzjdoc3SxdoYdgUCY/VokQAKCRDdoc3SxdoY
 dgCVAP427Q2wBrphU1//iW6R/7OZMz24nwmJcYPBAVLTngQO5gD+IVX+ru4xhD4Q
 W1VKFN9QIDbjEiEc+Lmt9iQuDz+roAI=
 =e/FD
 -----END PGP SIGNATURE-----

Merge tag 'zonefs-6.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs

Pull zonefs updates from Damien Le Moal:

 - Reorganize zonefs code to split file related operations to a new
   fs/zonefs/file.c file (me)

 - Modify zonefs to use dynamically allocated inodes and dentries (using
   the inode and dentry caches) instead of statically allocating
   everything on mount. This saves a significant amount of memory for
   very large zoned block devices with 10s of thousands of zones (me)

 - Make zonefs_sb_ktype a const struct kobj_type (Thomas)

* tag 'zonefs-6.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs:
  zonefs: make kobj_type structure constant
  zonefs: Cache zone group directory inodes
  zonefs: Dynamically create file inodes when needed
  zonefs: Separate zone information from inode information
  zonefs: Reduce struct zonefs_inode_info size
  zonefs: Simplify IO error handling
  zonefs: Reorganize code
2023-02-22 14:11:54 -08:00
Linus Torvalds
b7ee881282 gfs2 fixes
- Fix a race when disassociating inodes from their glocks after
   iget_failed().
 
 - On filesystems with a block size smaller than the page size, make
   sure that ->writepages() writes out all buffers of journaled inodes.
 
 - Various improvements to the way the delete workqueue is drained to
   speed up unmount and prevent leftover inodes.  At unmount time, evict
   deleted inodes cooperatively across the cluster to avoid unnecessary
   timeouts.
 
 - Various minor cleanups and fixes.
 -----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCAAyFiEEJZs3krPW0xkhLMTc1b+f6wMTZToFAmP2ASMUHGFncnVlbmJh
 QHJlZGhhdC5jb20ACgkQ1b+f6wMTZTodnQ//XSt/sKo6W4y56Pxz5AHO2cTdypLk
 b+ki1uWfguBm/3o8nFtXedoCcTsWZ8EICPL9bZKFBwSLSRChZuc6kKDz07wwseKR
 t8Lbi9G8tSwyDeQaiGTBd1UBWFAjNGqXniyqx/ki+RZZ3QMVJnwcb1Bjtl9DJ5DE
 lAXcvz0DZSQQwtpsdG+qpme65XZSziS0uDgkaz5Pio/1NDfbZ/U28HBMLNkS/Ef7
 RnF05PRaM07OGn2rXmvIqwIwoxjH2eF38x5EyMI3xpr//b3x/mVbwL+QFECfgf7r
 iuISCtL46n/gs4NmroPfT5LbCDhkOw513mmkdJNKXwHsz4s8hS+BuZVtTa8b02hn
 0K5Ova63qz3TIoZ+P6n44xiRFEVjqz0eqn0XhOr+HRljRXhn2ihxQLl4yGNgrmB2
 KTC+xMqZHXs2J3H97OTZDJVHYe5k6HqzvBUN6BnGRZ1lSEJbt5Fe0b7Web2aaLY2
 X2jFXplWVisTPKcusSuG3kP4WrEJq7ic8YLX6BgKU5DBbS69NETssUuMGUIxsd6k
 P+A4wfrUWac+X+DHFRPJu1yNL2UsW237AX75sNNOqLNRX04ZjXGFxymnEw//t2Qr
 2sPOEkU4O61o7tYWlK8PXTDVEbteZO3pBCdj7ARsmDEY401QuT9ZlmpmhDxxP/hP
 TGLDXXMbG+eireU=
 =3v/p
 -----END PGP SIGNATURE-----

Merge tag 'gfs2-v6.2-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2

Pull gfs2 updates from Andreas Gruenbacher:

 - Fix a race when disassociating inodes from their glocks after
   iget_failed()

 - On filesystems with a block size smaller than the page size, make
   sure that ->writepages() writes out all buffers of journaled inodes

 - Various improvements to the way the delete workqueue is drained to
   speed up unmount and prevent leftover inodes. At unmount time, evict
   deleted inodes cooperatively across the cluster to avoid unnecessary
   timeouts

 - Various minor cleanups and fixes

* tag 'gfs2-v6.2-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
  gfs2: Convert gfs2_page_add_databufs to folios
  gfs2: jdata writepage fix
  gfs2: Improve gfs2_make_fs_rw error handling
  Revert "GFS2: free disk inode which is deleted by remote node -V2"
  gfs2: Evict inodes cooperatively
  gfs2: Flush delete work before shrinking inode cache
  gfs2: Cease delete work during unmount
  gfs2: Add SDF_DEACTIVATING super block flag
  gfs2: check gl_object in rgrp glops
  gfs2: Split the two kinds of glock "delete" work
  gfs2: Move delete workqueue into super block
  gfs2: Get rid of GLF_PENDING_DELETE flag
  gfs2: Make glock lru list scanning safer
  gfs2: Clean up gfs2_scan_glock_lru
  gfs2: Improve gfs2_upgrade_iopen_glock comment
  gfs2: gl_object races fix
2023-02-22 14:00:53 -08:00
Linus Torvalds
28e335208c New code for 6.3-rc1:
* Eliminate repeated boxing and unboxing of log item parameters.
  * Clean up some confusing variable names in the log item code.
  * Fix a deadlock when doing unwritten extent conversion that causes a
    bmbt split when there are sustained memory shortages and the worker
    pool runs out of worker threads.
  * Fix the panic_mask debug knob not being able to trigger on verifier
    errors.
  * Constify kobj_type objects.
 
 Signed-off-by: Darrick J. Wong <djwong@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCY+Z6BwAKCRBKO3ySh0YR
 pkQJAQCjkzXqZuj8WH/g22S01smT51QhmX+1ubLdzMYSvRvrKQD+MlH74EcgurQD
 GhgCWJh6dBTx1nICKpCXYgVD9Glvowc=
 =J2Xw
 -----END PGP SIGNATURE-----

Merge tag 'xfs-6.3-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs updates from Darrick Wong:
 "There's a couple of bug fixes, some cleanups for inconsistent variable
  names and reduction of struct boxing and unboxing in the logging code.

  More work is pending, which will begin reworking allocation group
  lifetimes and finally replace confusing indirect calls to the
  allocator with actual ... function calls. But I want to let that
  experience another week of testing.

  Summary:

   - Eliminate repeated boxing and unboxing of log item parameters

   - Clean up some confusing variable names in the log item code

   - Fix a deadlock when doing unwritten extent conversion that causes a
     bmbt split when there are sustained memory shortages and the worker
     pool runs out of worker threads

   - Fix the panic_mask debug knob not being able to trigger on verifier
     errors

   - Constify kobj_type objects"

* tag 'xfs-6.3-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: revert commit 8954c44ff4
  xfs: make kobj_type structures constant
  xfs: allow setting full range of panic tags
  xfs: don't use BMBT btree split workers for IO completion
  xfs: fix confusing variable names in xfs_refcount_item.c
  xfs: pass refcount intent directly through the log intent code
  xfs: fix confusing variable names in xfs_rmap_item.c
  xfs: pass rmap space mapping directly through the log intent code
  xfs: fix confusing xfs_extent_item variable names
  xfs: pass xfs_extent_free_item directly through the log intent code
  xfs: fix confusing variable names in xfs_bmap_item.c
  xfs: pass the xfs_bmbt_irec directly through the log intent code
  xfs: use strscpy() to instead of strncpy()
2023-02-22 13:55:51 -08:00
Linus Torvalds
d151e8bea1 New code for 6.3:
- Change when the iomap page_done function is called so that we still
    have a locked folio in the success case.  This fixes a writeback race
    in gfs2.
  - Change when the iomap page_prepare function is called so that gfs2
    can recover from OOM scenarios more gracefully.
  - Rename the iomap page_ops to folio_ops, since they operate on folios
    now.
 
 Signed-off-by: Darrick J. Wong <djwong@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCY8g/FwAKCRBKO3ySh0YR
 pi19AQDCatxkzguJGV9BY52Bf8iDxCgdL34RatKXAzkZC3Y6UQEAsNdb88rkWkNK
 qPlXgsZm9cNlFb8c7mFvA9JAL9IPxgE=
 =ubh6
 -----END PGP SIGNATURE-----

Merge tag 'iomap-6.3-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull iomap updates from Darrick Wong:
 "This is mostly rearranging things to make life easier for gfs2,
  nothing all that mindblowing for this release.

   - Change when the iomap page_done function is called so that we still
     have a locked folio in the success case. This fixes a writeback
     race in gfs2

   - Change when the iomap page_prepare function is called so that gfs2
     can recover from OOM scenarios more gracefully

   - Rename the iomap page_ops to folio_ops, since they operate on
     folios now"

* tag 'iomap-6.3-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  iomap: Rename page_ops to folio_ops
  iomap: Rename page_prepare handler to get_folio
  iomap: Add __iomap_get_folio helper
  iomap/gfs2: Get page in page_prepare handler
  iomap: Add iomap_get_folio helper
  iomap: Rename page_done handler to put_folio
  iomap/gfs2: Unlock and put folio in page_done handler
  iomap: Add __iomap_put_folio helper
2023-02-22 13:50:13 -08:00
Andreas Gruenbacher
c1b0c3cfcb gfs2: Convert gfs2_page_add_databufs to folios
Convert gfs2_page_add_databufs() to folios and rename it to
gfs2_trans_add_databufs().

Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2023-02-22 12:06:20 +01:00
Andreas Gruenbacher
cbb60951ce gfs2: jdata writepage fix
The ->writepage() and ->writepages() operations are supposed to write
entire pages.  However, on filesystems with a block size smaller than
PAGE_SIZE, __gfs2_jdata_writepage() only adds the first block to the
current transaction instead of adding the entire page.  Fix that.

Fixes: 18ec7d5c3f ("[GFS2] Make journaled data files identical to normal files on disk")
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2023-02-22 12:06:20 +01:00
Linus Torvalds
5b7c4cabbb Networking changes for 6.3.
Core
 ----
 
  - Add dedicated kmem_cache for typical/small skb->head, avoid having
    to access struct page at kfree time, and improve memory use.
 
  - Introduce sysctl to set default RPS configuration for new netdevs.
 
  - Define Netlink protocol specification format which can be used
    to describe messages used by each family and auto-generate parsers.
    Add tools for generating kernel data structures and uAPI headers.
 
  - Expose all net/core sysctls inside netns.
 
  - Remove 4s sleep in netpoll if carrier is instantly detected on boot.
 
  - Add configurable limit of MDB entries per port, and port-vlan.
 
  - Continue populating drop reasons throughout the stack.
 
  - Retire a handful of legacy Qdiscs and classifiers.
 
 Protocols
 ---------
 
  - Support IPv4 big TCP (TSO frames larger than 64kB).
 
  - Add IP_LOCAL_PORT_RANGE socket option, to control local port range
    on socket by socket basis.
 
  - Track and report in procfs number of MPTCP sockets used.
 
  - Support mixing IPv4 and IPv6 flows in the in-kernel MPTCP
    path manager.
 
  - IPv6: don't check net.ipv6.route.max_size and rely on garbage
    collection to free memory (similarly to IPv4).
 
  - Support Penultimate Segment Pop (PSP) flavor in SRv6 (RFC8986).
 
  - ICMP: add per-rate limit counters.
 
  - Add support for user scanning requests in ieee802154.
 
  - Remove static WEP support.
 
  - Support minimal Wi-Fi 7 Extremely High Throughput (EHT) rate
    reporting.
 
  - WiFi 7 EHT channel puncturing support (client & AP).
 
 BPF
 ---
 
  - Add a rbtree data structure following the "next-gen data structure"
    precedent set by recently added linked list, that is, by using
    kfunc + kptr instead of adding a new BPF map type.
 
  - Expose XDP hints via kfuncs with initial support for RX hash and
    timestamp metadata.
 
  - Add BPF_F_NO_TUNNEL_KEY extension to bpf_skb_set_tunnel_key
    to better support decap on GRE tunnel devices not operating
    in collect metadata.
 
  - Improve x86 JIT's codegen for PROBE_MEM runtime error checks.
 
  - Remove the need for trace_printk_lock for bpf_trace_printk
    and bpf_trace_vprintk helpers.
 
  - Extend libbpf's bpf_tracing.h support for tracing arguments of
    kprobes/uprobes and syscall as a special case.
 
  - Significantly reduce the search time for module symbols
    by livepatch and BPF.
 
  - Enable cpumasks to be used as kptrs, which is useful for tracing
    programs tracking which tasks end up running on which CPUs in
    different time intervals.
 
  - Add support for BPF trampoline on s390x and riscv64.
 
  - Add capability to export the XDP features supported by the NIC.
 
  - Add __bpf_kfunc tag for marking kernel functions as kfuncs.
 
  - Add cgroup.memory=nobpf kernel parameter option to disable BPF
    memory accounting for container environments.
 
 Netfilter
 ---------
 
  - Remove the CLUSTERIP target. It has been marked as obsolete
    for years, and we still have WARN splats wrt. races of
    the out-of-band /proc interface installed by this target.
 
  - Add 'destroy' commands to nf_tables. They are identical to
    the existing 'delete' commands, but do not return an error if
    the referenced object (set, chain, rule...) did not exist.
 
 Driver API
 ----------
 
  - Improve cpumask_local_spread() locality to help NICs set the right
    IRQ affinity on AMD platforms.
 
  - Separate C22 and C45 MDIO bus transactions more clearly.
 
  - Introduce new DCB table to control DSCP rewrite on egress.
 
  - Support configuration of Physical Layer Collision Avoidance (PLCA)
    Reconciliation Sublayer (RS) (802.3cg-2019). Modern version of
    shared medium Ethernet.
 
  - Support for MAC Merge layer (IEEE 802.3-2018 clause 99). Allowing
    preemption of low priority frames by high priority frames.
 
  - Add support for controlling MACSec offload using netlink SET.
 
  - Rework devlink instance refcounts to allow registration and
    de-registration under the instance lock. Split the code into multiple
    files, drop some of the unnecessarily granular locks and factor out
    common parts of netlink operation handling.
 
  - Add TX frame aggregation parameters (for USB drivers).
 
  - Add a new attr TCA_EXT_WARN_MSG to report TC (offload) warning
    messages with notifications for debug.
 
  - Allow offloading of UDP NEW connections via act_ct.
 
  - Add support for per action HW stats in TC.
 
  - Support hardware miss to TC action (continue processing in SW from
    a specific point in the action chain).
 
  - Warn if old Wireless Extension user space interface is used with
    modern cfg80211/mac80211 drivers. Do not support Wireless Extensions
    for Wi-Fi 7 devices at all. Everyone should switch to using nl80211
    interface instead.
 
  - Improve the CAN bit timing configuration. Use extack to return error
    messages directly to user space, update the SJW handling, including
    the definition of a new default value that will benefit CAN-FD
    controllers, by increasing their oscillator tolerance.
 
 New hardware / drivers
 ----------------------
 
  - Ethernet:
    - nVidia BlueField-3 support (control traffic driver)
    - Ethernet support for imx93 SoCs
    - Motorcomm yt8531 gigabit Ethernet PHY
    - onsemi NCN26000 10BASE-T1S PHY (with support for PLCA)
    - Microchip LAN8841 PHY (incl. cable diagnostics and PTP)
    - Amlogic gxl MDIO mux
 
  - WiFi:
    - RealTek RTL8188EU (rtl8xxxu)
    - Qualcomm Wi-Fi 7 devices (ath12k)
 
  - CAN:
    - Renesas R-Car V4H
 
 Drivers
 -------
 
  - Bluetooth:
    - Set Per Platform Antenna Gain (PPAG) for Intel controllers.
 
  - Ethernet NICs:
    - Intel (1G, igc):
      - support TSN / Qbv / packet scheduling features of i226 model
    - Intel (100G, ice):
      - use GNSS subsystem instead of TTY
      - multi-buffer XDP support
      - extend support for GPIO pins to E823 devices
    - nVidia/Mellanox:
      - update the shared buffer configuration on PFC commands
      - implement PTP adjphase function for HW offset control
      - TC support for Geneve and GRE with VF tunnel offload
      - more efficient crypto key management method
      - multi-port eswitch support
    - Netronome/Corigine:
      - add DCB IEEE support
      - support IPsec offloading for NFP3800
    - Freescale/NXP (enetc):
      - enetc: support XDP_REDIRECT for XDP non-linear buffers
      - enetc: improve reconfig, avoid link flap and waiting for idle
      - enetc: support MAC Merge layer
    - Other NICs:
      - sfc/ef100: add basic devlink support for ef100
      - ionic: rx_push mode operation (writing descriptors via MMIO)
      - bnxt: use the auxiliary bus abstraction for RDMA
      - r8169: disable ASPM and reset bus in case of tx timeout
      - cpsw: support QSGMII mode for J721e CPSW9G
      - cpts: support pulse-per-second output
      - ngbe: add an mdio bus driver
      - usbnet: optimize usbnet_bh() by avoiding unnecessary queuing
      - r8152: handle devices with FW with NCM support
      - amd-xgbe: support 10Mbps, 2.5GbE speeds and rx-adaptation
      - virtio-net: support multi buffer XDP
      - virtio/vsock: replace virtio_vsock_pkt with sk_buff
      - tsnep: XDP support
 
  - Ethernet high-speed switches:
    - nVidia/Mellanox (mlxsw):
      - add support for latency TLV (in FW control messages)
    - Microchip (sparx5):
      - separate explicit and implicit traffic forwarding rules, make
        the implicit rules always active
      - add support for egress DSCP rewrite
      - IS0 VCAP support (Ingress Classification)
      - IS2 VCAP filters (protos, L3 addrs, L4 ports, flags, ToS etc.)
      - ES2 VCAP support (Egress Access Control)
      - support for Per-Stream Filtering and Policing (802.1Q, 8.6.5.1)
 
  - Ethernet embedded switches:
    - Marvell (mv88e6xxx):
      - add MAB (port auth) offload support
      - enable PTP receive for mv88e6390
    - NXP (ocelot):
      - support MAC Merge layer
      - support for the the vsc7512 internal copper phys
    - Microchip:
      - lan9303: convert to PHYLINK
      - lan966x: support TC flower filter statistics
      - lan937x: PTP support for KSZ9563/KSZ8563 and LAN937x
      - lan937x: support Credit Based Shaper configuration
      - ksz9477: support Energy Efficient Ethernet
    - other:
      - qca8k: convert to regmap read/write API, use bulk operations
      - rswitch: Improve TX timestamp accuracy
 
  - Intel WiFi (iwlwifi):
    - EHT (Wi-Fi 7) rate reporting
    - STEP equalizer support: transfer some STEP (connection to radio
      on platforms with integrated wifi) related parameters from the
      BIOS to the firmware.
 
  - Qualcomm 802.11ax WiFi (ath11k):
    - IPQ5018 support
    - Fine Timing Measurement (FTM) responder role support
    - channel 177 support
 
  - MediaTek WiFi (mt76):
    - per-PHY LED support
    - mt7996: EHT (Wi-Fi 7) support
    - Wireless Ethernet Dispatch (WED) reset support
    - switch to using page pool allocator
 
  - RealTek WiFi (rtw89):
    - support new version of Bluetooth co-existance
 
  - Mobile:
    - rmnet: support TX aggregation.
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmP1VIYACgkQMUZtbf5S
 IrvsChAApz0rNL/sPKxXTEfxZ1tN7D3sYxYKQPomxvl5BV+MvicrLddJy3KmzEFK
 nnJNO3nuRNuH422JQ/ylZ4mGX1opa6+5QJb0UINImXUI7Fm8HHBIuPGkv7d5CheZ
 7JexFqjPJXUy9nPyh1Rra+IA9AcRd2U7jeGEZR38wb99bHJQj5Bzdk20WArEB0el
 n44aqg49LXH71bSeXRz77x5SjkwVtYiccQxLcnmTbjLU2xVraLvI2J+wAhHnVXWW
 9lrU1+V4Ex2Xcd1xR0L0cHeK+meP1TrPRAeF+JDpVI3a/zJiE7cZjfHdG/jH5xWl
 leZJqghVozrZQNtewWWO7XhUFhMDgFu3W/1vNLjSHPZEqaz1JpM67J1+ql6s63l4
 LMWoXbcYZz+SL9ZRCoPkbGue/5fKSHv8/Jl9Sh58+eTS+c/zgN8uFGRNFXLX1+EP
 n8uvt985PxMd6x1+dHumhOUzxnY4Sfi1vjitSunTsNFQ3Cmp4SO0IfBVJWfLUCuC
 xz5hbJGJJbSpvUsO+HWyCg83E5OWghRE/Onpt2jsQSZCrO9HDg4FRTEf3WAMgaqc
 edb5KfbRZPTJQM08gWdluXzSk1nw3FNP2tXW4XlgUrEbjb+fOk0V9dQg2gyYTxQ1
 Nhvn8ZQPi6/GMMELHAIPGmmW1allyOGiAzGlQsv8EmL+OFM6WDI=
 =xXhC
 -----END PGP SIGNATURE-----

Merge tag 'net-next-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking updates from Jakub Kicinski:
 "Core:

   - Add dedicated kmem_cache for typical/small skb->head, avoid having
     to access struct page at kfree time, and improve memory use.

   - Introduce sysctl to set default RPS configuration for new netdevs.

   - Define Netlink protocol specification format which can be used to
     describe messages used by each family and auto-generate parsers.
     Add tools for generating kernel data structures and uAPI headers.

   - Expose all net/core sysctls inside netns.

   - Remove 4s sleep in netpoll if carrier is instantly detected on
     boot.

   - Add configurable limit of MDB entries per port, and port-vlan.

   - Continue populating drop reasons throughout the stack.

   - Retire a handful of legacy Qdiscs and classifiers.

  Protocols:

   - Support IPv4 big TCP (TSO frames larger than 64kB).

   - Add IP_LOCAL_PORT_RANGE socket option, to control local port range
     on socket by socket basis.

   - Track and report in procfs number of MPTCP sockets used.

   - Support mixing IPv4 and IPv6 flows in the in-kernel MPTCP path
     manager.

   - IPv6: don't check net.ipv6.route.max_size and rely on garbage
     collection to free memory (similarly to IPv4).

   - Support Penultimate Segment Pop (PSP) flavor in SRv6 (RFC8986).

   - ICMP: add per-rate limit counters.

   - Add support for user scanning requests in ieee802154.

   - Remove static WEP support.

   - Support minimal Wi-Fi 7 Extremely High Throughput (EHT) rate
     reporting.

   - WiFi 7 EHT channel puncturing support (client & AP).

  BPF:

   - Add a rbtree data structure following the "next-gen data structure"
     precedent set by recently added linked list, that is, by using
     kfunc + kptr instead of adding a new BPF map type.

   - Expose XDP hints via kfuncs with initial support for RX hash and
     timestamp metadata.

   - Add BPF_F_NO_TUNNEL_KEY extension to bpf_skb_set_tunnel_key to
     better support decap on GRE tunnel devices not operating in collect
     metadata.

   - Improve x86 JIT's codegen for PROBE_MEM runtime error checks.

   - Remove the need for trace_printk_lock for bpf_trace_printk and
     bpf_trace_vprintk helpers.

   - Extend libbpf's bpf_tracing.h support for tracing arguments of
     kprobes/uprobes and syscall as a special case.

   - Significantly reduce the search time for module symbols by
     livepatch and BPF.

   - Enable cpumasks to be used as kptrs, which is useful for tracing
     programs tracking which tasks end up running on which CPUs in
     different time intervals.

   - Add support for BPF trampoline on s390x and riscv64.

   - Add capability to export the XDP features supported by the NIC.

   - Add __bpf_kfunc tag for marking kernel functions as kfuncs.

   - Add cgroup.memory=nobpf kernel parameter option to disable BPF
     memory accounting for container environments.

  Netfilter:

   - Remove the CLUSTERIP target. It has been marked as obsolete for
     years, and we still have WARN splats wrt races of the out-of-band
     /proc interface installed by this target.

   - Add 'destroy' commands to nf_tables. They are identical to the
     existing 'delete' commands, but do not return an error if the
     referenced object (set, chain, rule...) did not exist.

  Driver API:

   - Improve cpumask_local_spread() locality to help NICs set the right
     IRQ affinity on AMD platforms.

   - Separate C22 and C45 MDIO bus transactions more clearly.

   - Introduce new DCB table to control DSCP rewrite on egress.

   - Support configuration of Physical Layer Collision Avoidance (PLCA)
     Reconciliation Sublayer (RS) (802.3cg-2019). Modern version of
     shared medium Ethernet.

   - Support for MAC Merge layer (IEEE 802.3-2018 clause 99). Allowing
     preemption of low priority frames by high priority frames.

   - Add support for controlling MACSec offload using netlink SET.

   - Rework devlink instance refcounts to allow registration and
     de-registration under the instance lock. Split the code into
     multiple files, drop some of the unnecessarily granular locks and
     factor out common parts of netlink operation handling.

   - Add TX frame aggregation parameters (for USB drivers).

   - Add a new attr TCA_EXT_WARN_MSG to report TC (offload) warning
     messages with notifications for debug.

   - Allow offloading of UDP NEW connections via act_ct.

   - Add support for per action HW stats in TC.

   - Support hardware miss to TC action (continue processing in SW from
     a specific point in the action chain).

   - Warn if old Wireless Extension user space interface is used with
     modern cfg80211/mac80211 drivers. Do not support Wireless
     Extensions for Wi-Fi 7 devices at all. Everyone should switch to
     using nl80211 interface instead.

   - Improve the CAN bit timing configuration. Use extack to return
     error messages directly to user space, update the SJW handling,
     including the definition of a new default value that will benefit
     CAN-FD controllers, by increasing their oscillator tolerance.

  New hardware / drivers:

   - Ethernet:
      - nVidia BlueField-3 support (control traffic driver)
      - Ethernet support for imx93 SoCs
      - Motorcomm yt8531 gigabit Ethernet PHY
      - onsemi NCN26000 10BASE-T1S PHY (with support for PLCA)
      - Microchip LAN8841 PHY (incl. cable diagnostics and PTP)
      - Amlogic gxl MDIO mux

   - WiFi:
      - RealTek RTL8188EU (rtl8xxxu)
      - Qualcomm Wi-Fi 7 devices (ath12k)

   - CAN:
      - Renesas R-Car V4H

  Drivers:

   - Bluetooth:
      - Set Per Platform Antenna Gain (PPAG) for Intel controllers.

   - Ethernet NICs:
      - Intel (1G, igc):
         - support TSN / Qbv / packet scheduling features of i226 model
      - Intel (100G, ice):
         - use GNSS subsystem instead of TTY
         - multi-buffer XDP support
         - extend support for GPIO pins to E823 devices
      - nVidia/Mellanox:
         - update the shared buffer configuration on PFC commands
         - implement PTP adjphase function for HW offset control
         - TC support for Geneve and GRE with VF tunnel offload
         - more efficient crypto key management method
         - multi-port eswitch support
      - Netronome/Corigine:
         - add DCB IEEE support
         - support IPsec offloading for NFP3800
      - Freescale/NXP (enetc):
         - support XDP_REDIRECT for XDP non-linear buffers
         - improve reconfig, avoid link flap and waiting for idle
         - support MAC Merge layer
      - Other NICs:
         - sfc/ef100: add basic devlink support for ef100
         - ionic: rx_push mode operation (writing descriptors via MMIO)
         - bnxt: use the auxiliary bus abstraction for RDMA
         - r8169: disable ASPM and reset bus in case of tx timeout
         - cpsw: support QSGMII mode for J721e CPSW9G
         - cpts: support pulse-per-second output
         - ngbe: add an mdio bus driver
         - usbnet: optimize usbnet_bh() by avoiding unnecessary queuing
         - r8152: handle devices with FW with NCM support
         - amd-xgbe: support 10Mbps, 2.5GbE speeds and rx-adaptation
         - virtio-net: support multi buffer XDP
         - virtio/vsock: replace virtio_vsock_pkt with sk_buff
         - tsnep: XDP support

   - Ethernet high-speed switches:
      - nVidia/Mellanox (mlxsw):
         - add support for latency TLV (in FW control messages)
      - Microchip (sparx5):
         - separate explicit and implicit traffic forwarding rules, make
           the implicit rules always active
         - add support for egress DSCP rewrite
         - IS0 VCAP support (Ingress Classification)
         - IS2 VCAP filters (protos, L3 addrs, L4 ports, flags, ToS
           etc.)
         - ES2 VCAP support (Egress Access Control)
         - support for Per-Stream Filtering and Policing (802.1Q,
           8.6.5.1)

   - Ethernet embedded switches:
      - Marvell (mv88e6xxx):
         - add MAB (port auth) offload support
         - enable PTP receive for mv88e6390
      - NXP (ocelot):
         - support MAC Merge layer
         - support for the the vsc7512 internal copper phys
      - Microchip:
         - lan9303: convert to PHYLINK
         - lan966x: support TC flower filter statistics
         - lan937x: PTP support for KSZ9563/KSZ8563 and LAN937x
         - lan937x: support Credit Based Shaper configuration
         - ksz9477: support Energy Efficient Ethernet
      - other:
         - qca8k: convert to regmap read/write API, use bulk operations
         - rswitch: Improve TX timestamp accuracy

   - Intel WiFi (iwlwifi):
      - EHT (Wi-Fi 7) rate reporting
      - STEP equalizer support: transfer some STEP (connection to radio
        on platforms with integrated wifi) related parameters from the
        BIOS to the firmware.

   - Qualcomm 802.11ax WiFi (ath11k):
      - IPQ5018 support
      - Fine Timing Measurement (FTM) responder role support
      - channel 177 support

   - MediaTek WiFi (mt76):
      - per-PHY LED support
      - mt7996: EHT (Wi-Fi 7) support
      - Wireless Ethernet Dispatch (WED) reset support
      - switch to using page pool allocator

   - RealTek WiFi (rtw89):
      - support new version of Bluetooth co-existance

   - Mobile:
      - rmnet: support TX aggregation"

* tag 'net-next-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1872 commits)
  page_pool: add a comment explaining the fragment counter usage
  net: ethtool: fix __ethtool_dev_mm_supported() implementation
  ethtool: pse-pd: Fix double word in comments
  xsk: add linux/vmalloc.h to xsk.c
  sefltests: netdevsim: wait for devlink instance after netns removal
  selftest: fib_tests: Always cleanup before exit
  net/mlx5e: Align IPsec ASO result memory to be as required by hardware
  net/mlx5e: TC, Set CT miss to the specific ct action instance
  net/mlx5e: Rename CHAIN_TO_REG to MAPPED_OBJ_TO_REG
  net/mlx5: Refactor tc miss handling to a single function
  net/mlx5: Kconfig: Make tc offload depend on tc skb extension
  net/sched: flower: Support hardware miss to tc action
  net/sched: flower: Move filter handle initialization earlier
  net/sched: cls_api: Support hardware miss to tc action
  net/sched: Rename user cookie and act cookie
  sfc: fix builds without CONFIG_RTC_LIB
  sfc: clean up some inconsistent indentings
  net/mlx4_en: Introduce flexible array to silence overflow warning
  net: lan966x: Fix possible deadlock inside PTP
  net/ulp: Remove redundant ->clone() test in inet_clone_ulp().
  ...
2023-02-21 18:24:12 -08:00
Linus Torvalds
36289a03bc This update includes the following changes:
API:
 
 - Use kmap_local instead of kmap_atomic.
 - Change request callback to take void pointer.
 - Print FIPS status in /proc/crypto (when enabled).
 
 Algorithms:
 
 - Add rfc4106/gcm support on arm64.
 - Add ARIA AVX2/512 support on x86.
 
 Drivers:
 
 - Add TRNG driver for StarFive SoC.
 - Delete ux500/hash driver (subsumed by stm32/hash).
 - Add zlib support in qat.
 - Add RSA support in aspeed.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEn51F/lCuNhUwmDeSxycdCkmxi6cFAmPzAiwACgkQxycdCkmx
 i6et8xAAoO3w5MZFGXMzWsYhfSZFdceXBEQfDR7JOCdHxpMIQhw0FLlb0uttFk6m
 SeWrdP9wiifBDoCmw7qffFJml8ZftPL/XeXjob2d9v7jKbPyw3lDSIdsNfN/5EEL
 oIc9915zwrgawvahPAa+PQ4Ue03qRjUyOcV42dpd1W3NYhzDVHoK5OUU+mEFYDvx
 Sgw/YUugKf0VXkVDFzG5049+CPcheyRZqclAo9jyl2eZiXujgUyV33nxRCtqIA+t
 7jlHKwi+6QzFHY0CX5BvShR8xyEuH5MLoU3H/jYGXnRb3nEpRYAEO4VZchIHqF0F
 Y6pKIKc6Q8OyIVY8RsjQY3hioCqYnQFZ5Xtc1zGtOYEitVLbkmItMG0mVn0XOfyt
 gJDi6gkEw5uPUbEQdI4R1xEgJ8eCckMsOJ+uRxqTm+uLqNDxPbsB9bohKniMogXV
 lDlVXjU23AA9VeKtqU8FvWjfgqsN47X4aoq1j4/4aI7X9F7P9FOP21TZloP7+ssj
 PFrzNaRXUrMEsvyS1wqPegIh987lj6WkH4hyU0wjzaIq4IQELidHsSXFS12iWIPH
 kTEoC/trAVoYSr0zXKWUCs4h/x0FztVNbjs4KiDP2FLXX1RzeVZ0WlaXZhryHr+n
 1+8yCuS6tVofAbSX0wNkZdf0x5+3CIBw4kqSIvjKDPYYEfIDaT0=
 =dMYe
 -----END PGP SIGNATURE-----

Merge tag 'v6.3-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6

Pull crypto update from Herbert Xu:
 "API:
   - Use kmap_local instead of kmap_atomic
   - Change request callback to take void pointer
   - Print FIPS status in /proc/crypto (when enabled)

  Algorithms:
   - Add rfc4106/gcm support on arm64
   - Add ARIA AVX2/512 support on x86

  Drivers:
   - Add TRNG driver for StarFive SoC
   - Delete ux500/hash driver (subsumed by stm32/hash)
   - Add zlib support in qat
   - Add RSA support in aspeed"

* tag 'v6.3-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (156 commits)
  crypto: x86/aria-avx - Do not use avx2 instructions
  crypto: aspeed - Fix modular aspeed-acry
  crypto: hisilicon/qm - fix coding style issues
  crypto: hisilicon/qm - update comments to match function
  crypto: hisilicon/qm - change function names
  crypto: hisilicon/qm - use min() instead of min_t()
  crypto: hisilicon/qm - remove some unused defines
  crypto: proc - Print fips status
  crypto: crypto4xx - Call dma_unmap_page when done
  crypto: octeontx2 - Fix objects shared between several modules
  crypto: nx - Fix sparse warnings
  crypto: ecc - Silence sparse warning
  tls: Pass rec instead of aead_req into tls_encrypt_done
  crypto: api - Remove completion function scaffolding
  tls: Remove completion function scaffolding
  tipc: Remove completion function scaffolding
  net: ipv6: Remove completion function scaffolding
  net: ipv4: Remove completion function scaffolding
  net: macsec: Remove completion function scaffolding
  dm: Remove completion function scaffolding
  ...
2023-02-21 18:10:50 -08:00
Ondrej Mosnacek
f1aa2eb5ea sysctl: fix proc_dobool() usability
Currently proc_dobool expects a (bool *) in table->data, but sizeof(int)
in table->maxsize, because it uses do_proc_dointvec() directly.

This is unsafe for at least two reasons:
1. A sysctl table definition may use { .data = &variable, .maxsize =
   sizeof(variable) }, not realizing that this makes the sysctl unusable
   (see the Fixes: tag) and that they need to use the completely
   counterintuitive sizeof(int) instead.
2. proc_dobool() will currently try to parse an array of values if given
   .maxsize >= 2*sizeof(int), but will try to write values of type bool
   by offsets of sizeof(int), so it will not work correctly with neither
   an (int *) nor a (bool *). There is no .maxsize validation to prevent
   this.

Fix this by:
1. Constraining proc_dobool() to allow only one value and .maxsize ==
   sizeof(bool).
2. Wrapping the original struct ctl_table in a temporary one with .data
   pointing to a local int variable and .maxsize set to sizeof(int) and
   passing this one to proc_dointvec(), converting the value to/from
   bool as needed (using proc_dou8vec_minmax() as an example).
3. Extending sysctl_check_table() to enforce proc_dobool() expectations.
4. Fixing the proc_dobool() docstring (it was just copy-pasted from
   proc_douintvec, apparently...).
5. Converting all existing proc_dobool() users to set .maxsize to
   sizeof(bool) instead of sizeof(int).

Fixes: 83efeeeb3d ("tty: Allow TIOCSTI to be disabled")
Fixes: a2071573d6 ("sysctl: introduce new proc handler proc_dobool")
Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-02-21 13:34:07 -08:00
Linus Torvalds
4a7d37e824 hardening updates for v6.3-rc1
- Replace 0-length and 1-element arrays with flexible arrays in various
   subsystems (Paulo Miguel Almeida, Stephen Rothwell, Kees Cook)
 
 - randstruct: Disable Clang 15 support (Eric Biggers)
 
 - GCC plugins: Drop -std=gnu++11 flag (Sam James)
 
 - strpbrk(): Refactor to use strchr() (Andy Shevchenko)
 
 - LoadPin LSM: Allow root filesystem switching when non-enforcing
 
 - fortify: Use dynamic object size hints when available
 
 - ext4: Fix CFI function prototype mismatch
 
 - Nouveau: Fix DP buffer size arguments
 
 - hisilicon: Wipe entire crypto DMA pool on error
 
 - coda: Fully allocate sig_inputArgs
 
 - UBSAN: Improve arm64 trap code reporting
 
 - copy_struct_from_user(): Add minimum bounds check on kernel buffer size
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmPv1Y8WHGtlZXNjb29r
 QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJg5UD/9x3Lx0EG3iL4qPtjmohaXd899r
 AzP1ysoxYnmo/cY0//W3DPCJrUaVlTm7M2xXOpzi7YPVD8Jcofzy6Uxm9BiG/OJ9
 bla7uQixlDMA2MBmWzAXhM7337WgEtBcr6kbXk6rHFnzmk8CdAY3wjmLmiefxEWT
 gkdeJlbkBFynssSF2nejgCvr/ZyiWQr2V9hRdEavLQH/MDS785bmNwbLyUNqK+eo
 gOtuyjyV90t+cSIN0bF7gOCFGf1ivKA/+GNFrob0jY0Fy2kGx1I2wQMn9yzjzerC
 o6Majz9r+7Z7xIaz2Pm9nDaWyZDI05RfoRpQZ9dSEJ+zYgbFBFpDpJShcJvSpNa0
 POqeR400n/6VWBcbk7UU0s7VCVU13IsOFhBSVMQM5FfzIcUkj0/VBm0Jm0ODrpM9
 13/nKyAkvHkH0uSJbQjn79rXvEvqQyi5f28emm2CuhiHHUiDEUdsmMD7fE8UXo4r
 U8dgfwTOLLQBKmOQJcgiLo8iLDPhatZKYQAZ7LMY9kbHLsJlRVxfzY9PriNCuI5o
 XuMLJG33TrlUDfqQrKeSJ9srVRiiIBAzoWnIfIVE3Xb46LqFNXVRdJCt4A2678jn
 gYIzkQ2HbVe2chUhUyjsjGTjmmeX9qZG0UOlhRQ0RvWFxi390wwYqhkSaOEGtDGv
 QbVh0Lb86m3H/G+M9g==
 =XnVa
 -----END PGP SIGNATURE-----

Merge tag 'hardening-v6.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull hardening updates from Kees Cook:
 "Beyond some specific LoadPin, UBSAN, and fortify features, there are
  other fixes scattered around in various subsystems where maintainers
  were okay with me carrying them in my tree or were non-responsive but
  the patches were reviewed by others:

   - Replace 0-length and 1-element arrays with flexible arrays in
     various subsystems (Paulo Miguel Almeida, Stephen Rothwell, Kees
     Cook)

   - randstruct: Disable Clang 15 support (Eric Biggers)

   - GCC plugins: Drop -std=gnu++11 flag (Sam James)

   - strpbrk(): Refactor to use strchr() (Andy Shevchenko)

   - LoadPin LSM: Allow root filesystem switching when non-enforcing

   - fortify: Use dynamic object size hints when available

   - ext4: Fix CFI function prototype mismatch

   - Nouveau: Fix DP buffer size arguments

   - hisilicon: Wipe entire crypto DMA pool on error

   - coda: Fully allocate sig_inputArgs

   - UBSAN: Improve arm64 trap code reporting

   - copy_struct_from_user(): Add minimum bounds check on kernel buffer
     size"

* tag 'hardening-v6.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  randstruct: disable Clang 15 support
  uaccess: Add minimum bounds check on kernel buffer size
  arm64: Support Clang UBSAN trap codes for better reporting
  coda: Avoid partial allocation of sig_inputArgs
  gcc-plugins: drop -std=gnu++11 to fix GCC 13 build
  lib/string: Use strchr() in strpbrk()
  crypto: hisilicon: Wipe entire pool on error
  net/i40e: Replace 0-length array with flexible array
  io_uring: Replace 0-length array with flexible array
  ext4: Fix function prototype mismatch for ext4_feat_ktype
  i915/gvt: Replace one-element array with flexible-array member
  drm/nouveau/disp: Fix nvif_outp_acquire_dp() argument size
  LoadPin: Allow filesystem switch when not enforcing
  LoadPin: Move pin reporting cleanly out of locking
  LoadPin: Refactor sysctl initialization
  LoadPin: Refactor read-only check into a helper
  ARM: ixp4xx: Replace 0-length arrays with flexible arrays
  fortify: Use __builtin_dynamic_object_size() when available
  rxrpc: replace zero-lenth array with DECLARE_FLEX_ARRAY() helper
2023-02-21 11:07:23 -08:00
Linus Torvalds
8cc01d43f8 RCU pull request for v6.3
This pull request contains the following branches:
 
 doc.2023.01.05a: Documentation updates.
 
 fixes.2023.01.23a: Miscellaneous fixes, perhaps most notably:
 
 o	Throttling callback invocation based on the number of callbacks
 	that are now ready to invoke instead of on the total number
 	of callbacks.
 
 o	Several patches that suppress false-positive boot-time
 	diagnostics, for example, due to lockdep not yet being
 	initialized.
 
 o	Make expedited RCU CPU stall warnings dump stacks of any tasks
 	that are blocking the stalled grace period.  (Normal RCU CPU
 	stall warnings have doen this for mnay years.)
 
 o	Lazy-callback fixes to avoid delays during boot, suspend, and
 	resume.  (Note that lazy callbacks must be explicitly enabled,
 	so this should not (yet) affect production use cases.)
 
 kvfree.2023.01.03a: Cause kfree_rcu() and friends to take advantage of
 	polled grace periods, thus reducing memory footprint by almost
 	two orders of magnitude, admittedly on a microbenchmark.
 	This series also begins the transition from kfree_rcu(p) to
 	kfree_rcu_mightsleep(p).  This transition was motivated by bugs
 	where kfree_rcu(p), which can block, was typed instead of the
 	intended kfree_rcu(p, rh).
 
 srcu.2023.01.03a: SRCU updates, perhaps most notably fixing a bug that
 	causes SRCU to fail when booted on a system with a non-zero boot
 	CPU.  This surprising situation actually happens for kdump kernels
 	on the powerpc architecture.  It also adds an srcu_down_read()
 	and srcu_up_read(), which act like srcu_read_lock() and
 	srcu_read_unlock(), but allow an SRCU read-side critical section
 	to be handed off from one task to another.
 
 srcu-always.2023.02.02a: Cleans up the now-useless SRCU Kconfig option.
 	There are a few more commits that are not yet acked or pulled
 	into maintainer trees, and these will be in a pull request for
 	a later merge window.
 
 tasks.2023.01.03a: RCU-tasks updates, perhaps most notably these fixes:
 
 o	A strange interaction between PID-namespace unshare and the
 	RCU-tasks grace period that results in a low-probability but
 	very real hang.
 
 o	A race between an RCU tasks rude grace period on a single-CPU
 	system and CPU-hotplug addition of the second CPU that can result
 	in a too-short grace period.
 
 o	A race between shrinking RCU tasks down to a single callback list
 	and queuing a new callback to some other CPU, but where that
 	queuing is delayed for more than an RCU grace period.  This can
 	result in that callback being stranded on the non-boot CPU.
 
 torture.2023.01.05a: Torture-test updates and fixes.
 
 torturescript.2023.01.03a: Torture-test scripting updates and fixes.
 
 stall.2023.01.09a: Provide additional RCU CPU stall-warning information
 	in kernels built with CONFIG_RCU_CPU_STALL_CPUTIME=y, and
 	restore the full five-minute timeout limit for expedited RCU
 	CPU stall warnings.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEbK7UrM+RBIrCoViJnr8S83LZ+4wFAmPq29UTHHBhdWxtY2tA
 a2VybmVsLm9yZwAKCRCevxLzctn7jAhVEACEAKJY1VJ9IUqz7CwzAYkzgRJfiygh
 oDUXmlqtm6ew9pr2GdLUVCVsUSldzBc0K7Djb/G1niv4JPs+v7YwupIV33+UbStU
 Qxt6ztTdxc4lKospLm1+2vF9ZdzVEmiP4wVCc4iDarv5FM3FpWSTNc8+L7qmlC+X
 myjv+GqMTxkXZBvYJOgJGFjDwN8noTd7Fr3mCCVLFm3PXMDa7tcwD6HRP5AqD2N8
 qC5M6LEqepKVGmz0mYMLlSN1GPaqIsEcexIFEazRsPEivPh/iafyQCQ/cqxwhXmV
 vEt7u+dXGZT/oiDq9cJ+/XRDS2RyKIS6dUE14TiiHolDCn1ONESahfA/gXWKykC2
 BaGPfjWXrWv/hwbeZ+8xEdkAvTIV92tGpXir9Fby1Z5PjP3balvrnn6hs5AnQBJb
 NdhRPLzy/dCnEF+CweAYYm1qvTo8cd5nyiNwBZHn7rEAIu3Axrecag1rhFl3AJ07
 cpVMQXZtkQVa2X8aIRTUC+ijX6yIqNaHlu0HqNXgIUTDzL4nv5cMjOMzpNQP9/dZ
 FwAMZYNiOk9IlMiKJ8ZiVcxeiA8ouIBlkYM3k6vGrmiONZ7a/EV/mSHoJqI8bvqr
 AxUIJ2Ayhg3bxPboL5oKgCiLql0A7ZVvz6quX6McitWGMgaSvel1fDzT3TnZd41e
 4AFBFd/+VedUGg==
 =bBYK
 -----END PGP SIGNATURE-----

Merge tag 'rcu.2023.02.10a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu

Pull RCU updates from Paul McKenney:

 - Documentation updates

 - Miscellaneous fixes, perhaps most notably:

      - Throttling callback invocation based on the number of callbacks
        that are now ready to invoke instead of on the total number of
        callbacks

      - Several patches that suppress false-positive boot-time
        diagnostics, for example, due to lockdep not yet being
        initialized

      - Make expedited RCU CPU stall warnings dump stacks of any tasks
        that are blocking the stalled grace period. (Normal RCU CPU
        stall warnings have done this for many years)

      - Lazy-callback fixes to avoid delays during boot, suspend, and
        resume. (Note that lazy callbacks must be explicitly enabled, so
        this should not (yet) affect production use cases)

 - Make kfree_rcu() and friends take advantage of polled grace periods,
   thus reducing memory footprint by almost two orders of magnitude,
   admittedly on a microbenchmark

   This also begins the transition from kfree_rcu(p) to
   kfree_rcu_mightsleep(p). This transition was motivated by bugs where
   kfree_rcu(p), which can block, was typed instead of the intended
   kfree_rcu(p, rh)

 - SRCU updates, perhaps most notably fixing a bug that causes SRCU to
   fail when booted on a system with a non-zero boot CPU. This
   surprising situation actually happens for kdump kernels on the
   powerpc architecture

   This also adds an srcu_down_read() and srcu_up_read(), which act like
   srcu_read_lock() and srcu_read_unlock(), but allow an SRCU read-side
   critical section to be handed off from one task to another

 - Clean up the now-useless SRCU Kconfig option

   There are a few more commits that are not yet acked or pulled into
   maintainer trees, and these will be in a pull request for a later
   merge window

 - RCU-tasks updates, perhaps most notably these fixes:

      - A strange interaction between PID-namespace unshare and the
        RCU-tasks grace period that results in a low-probability but
        very real hang

      - A race between an RCU tasks rude grace period on a single-CPU
        system and CPU-hotplug addition of the second CPU that can
        result in a too-short grace period

      - A race between shrinking RCU tasks down to a single callback
        list and queuing a new callback to some other CPU, but where
        that queuing is delayed for more than an RCU grace period. This
        can result in that callback being stranded on the non-boot CPU

 - Torture-test updates and fixes

 - Torture-test scripting updates and fixes

 - Provide additional RCU CPU stall-warning information in kernels built
   with CONFIG_RCU_CPU_STALL_CPUTIME=y, and restore the full five-minute
   timeout limit for expedited RCU CPU stall warnings

* tag 'rcu.2023.02.10a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: (80 commits)
  rcu/kvfree: Add kvfree_rcu_mightsleep() and kfree_rcu_mightsleep()
  kernel/notifier: Remove CONFIG_SRCU
  init: Remove "select SRCU"
  fs/quota: Remove "select SRCU"
  fs/notify: Remove "select SRCU"
  fs/btrfs: Remove "select SRCU"
  fs: Remove CONFIG_SRCU
  drivers/pci/controller: Remove "select SRCU"
  drivers/net: Remove "select SRCU"
  drivers/md: Remove "select SRCU"
  drivers/hwtracing/stm: Remove "select SRCU"
  drivers/dax: Remove "select SRCU"
  drivers/base: Remove CONFIG_SRCU
  rcu: Disable laziness if lazy-tracking says so
  rcu: Track laziness during boot and suspend
  rcu: Remove redundant call to rcu_boost_kthread_setaffinity()
  rcu: Allow up to five minutes expedited RCU CPU stall-warning timeouts
  rcu: Align the output of RCU CPU stall warning messages
  rcu: Add RCU stall diagnosis information
  sched: Add helper nr_context_switches_cpu()
  ...
2023-02-21 10:45:51 -08:00
Steve French
fdbf807215 update internal module version number for cifs.ko
From 2.41 to 2.42

Signed-off-by: Steve French <stfrench@microsoft.com>
2023-02-21 01:25:44 -06:00
Shyam Prasad N
e77978de47 cifs: update ip_addr for ses only for primary chan setup
We update ses->ip_addr whenever we do a session setup.
But this should happen only for primary channel in mchan
scenario.

Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-02-21 01:25:11 -06:00
Shyam Prasad N
df57109bd5 cifs: use tcon allocation functions even for dummy tcon
In smb2_reconnect_server, we allocate a dummy tcon for
calling reconnect for just the session. This should be
allocated using tconInfoAlloc, and not kmalloc.

Fixes: 3663c9045f ("cifs: check reconnects for channels of active tcons too")
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-02-21 01:25:07 -06:00
Shyam Prasad N
ea90708d3c cifs: use the least loaded channel for sending requests
Till now, we've used a simple round robin approach to
distribute the requests between the channels. This does
not work well if the channels consume the requests at
different speeds, even if the advertised speeds are the
same.

This change will allow the client to pick the channel
with least number of requests currently in-flight. This
will disregard the link speed, and select a channel
based on the current load of the channels.

For cases when all the channels are equally loaded,
fall back to the old round robin method.

Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-02-21 01:24:48 -06:00
Linus Torvalds
1f2d9ffc7a Scheduler updates in this cycle are:
- Improve the scalability of the CFS bandwidth unthrottling logic
    with large number of CPUs.
 
  - Fix & rework various cpuidle routines, simplify interaction with
    the generic scheduler code. Add __cpuidle methods as noinstr to
    objtool's noinstr detection and fix boatloads of cpuidle bugs & quirks.
 
  - Add new ABI: introduce MEMBARRIER_CMD_GET_REGISTRATIONS,
    to query previously issued registrations.
 
  - Limit scheduler slice duration to the sysctl_sched_latency period,
    to improve scheduling granularity with a large number of SCHED_IDLE
    tasks.
 
  - Debuggability enhancement on sys_exit(): warn about disabled IRQs,
    but also enable them to prevent a cascade of followup problems and
    repeat warnings.
 
  - Fix the rescheduling logic in prio_changed_dl().
 
  - Micro-optimize cpufreq and sched-util methods.
 
  - Micro-optimize ttwu_runnable()
 
  - Micro-optimize the idle-scanning in update_numa_stats(),
    select_idle_capacity() and steal_cookie_task().
 
  - Update the RSEQ code & self-tests
 
  - Constify various scheduler methods
 
  - Remove unused methods
 
  - Refine __init tags
 
  - Documentation updates
 
  - ... Misc other cleanups, fixes
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmPzbJwRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1iIvA//ZcEaB8Z6ChLRQjM+bsaudKJu3pdLQbPK
 iYbP8Da+LsAfxbEfYuGV3m+jIp0LlBOtsI/EezxQrXV+V7FvNyAX9Y00eEu/zlj8
 7Jn3LMy/DBYTwH7LwVdcU0MyIVI8ZPc6WNnkx0LOtGZn8n+qfHPSDzcP3CW+a5AV
 UvllPYpYyEmsX0Eby7CF4Ue8mSmbViw/xR3rNr8ZSve0c25XzKabw8O9kE3jiHxP
 d/zERJoAYeDyYUEuZqhfn5dTlB4an4IjNEkAfRE5SQ09RA8Gkxsa5Ar8gob9e9M1
 eQsdd4/bdhnrkM8L5qDZczqmgCTZ2bukQrxkBXhRDhLgoFxwAn77b+2ZjmIW3Lae
 AyGqRcDSg1q2oxaYm5ZiuO/t26aDOZu9vPHyHRDGt95EGbZlrp+GgeePyfCigJYz
 UmPdZAAcHdSymnnnlcvdG37WVvaVkpgWZzd8LbtBi23QR+Zc4WQ2IlgnUS5WKNNf
 VOBcAcP6E1IslDotZDQCc2dPFFQoQQEssVooyUc5oMytm7BsvxXLOeHG+Ncu/8uc
 H+U8Qn8jnqTxJbC5hkWQIJlhVKCq2FJrHxxySYTKROfUNcDgCmxboFeAcXTCIU1K
 T0S+sdoTS/CvtLklRkG0j6B8N4N98mOd9cFwUV3tX+/gMLMep3hCQs5L76JagvC5
 skkQXoONNaM=
 =l1nN
 -----END PGP SIGNATURE-----

Merge tag 'sched-core-2023-02-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler updates from Ingo Molnar:

 - Improve the scalability of the CFS bandwidth unthrottling logic with
   large number of CPUs.

 - Fix & rework various cpuidle routines, simplify interaction with the
   generic scheduler code. Add __cpuidle methods as noinstr to objtool's
   noinstr detection and fix boatloads of cpuidle bugs & quirks.

 - Add new ABI: introduce MEMBARRIER_CMD_GET_REGISTRATIONS, to query
   previously issued registrations.

 - Limit scheduler slice duration to the sysctl_sched_latency period, to
   improve scheduling granularity with a large number of SCHED_IDLE
   tasks.

 - Debuggability enhancement on sys_exit(): warn about disabled IRQs,
   but also enable them to prevent a cascade of followup problems and
   repeat warnings.

 - Fix the rescheduling logic in prio_changed_dl().

 - Micro-optimize cpufreq and sched-util methods.

 - Micro-optimize ttwu_runnable()

 - Micro-optimize the idle-scanning in update_numa_stats(),
   select_idle_capacity() and steal_cookie_task().

 - Update the RSEQ code & self-tests

 - Constify various scheduler methods

 - Remove unused methods

 - Refine __init tags

 - Documentation updates

 - Misc other cleanups, fixes

* tag 'sched-core-2023-02-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (110 commits)
  sched/rt: pick_next_rt_entity(): check list_entry
  sched/deadline: Add more reschedule cases to prio_changed_dl()
  sched/fair: sanitize vruntime of entity being placed
  sched/fair: Remove capacity inversion detection
  sched/fair: unlink misfit task from cpu overutilized
  objtool: mem*() are not uaccess safe
  cpuidle: Fix poll_idle() noinstr annotation
  sched/clock: Make local_clock() noinstr
  sched/clock/x86: Mark sched_clock() noinstr
  x86/pvclock: Improve atomic update of last_value in pvclock_clocksource_read()
  x86/atomics: Always inline arch_atomic64*()
  cpuidle: tracing, preempt: Squash _rcuidle tracing
  cpuidle: tracing: Warn about !rcu_is_watching()
  cpuidle: lib/bug: Disable rcu_is_watching() during WARN/BUG
  cpuidle: drivers: firmware: psci: Dont instrument suspend code
  KVM: selftests: Fix build of rseq test
  exit: Detect and fix irq disabled state in oops
  cpuidle, arm64: Fix the ARM64 cpuidle logic
  cpuidle: mvebu: Fix duplicate flags assignment
  sched/fair: Limit sched slice duration
  ...
2023-02-20 17:41:08 -08:00
David Howells
e7388b8a1a cifs: DIO to/from KVEC-type iterators should now work
DIO to/from KVEC-type iterators should now work as the iterator is passed
down to the socket in non-RDMA/non-crypto mode and in RDMA or crypto mode
care is taken to handle vmap/vmalloc correctly and not take page refs when
building a scatterlist.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Steve French <sfrench@samba.org>
cc: Shyam Prasad N <nspmangalore@gmail.com>
cc: Rohith Surabattula <rohiths.msft@gmail.com>
cc: Tom Talpey <tom@talpey.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cifs@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-02-20 18:36:02 -06:00
David Howells
607aea3cc2 cifs: Remove unused code
Remove a bunch of functions that are no longer used and are commented out
after the conversion to use iterators throughout the I/O path.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Steve French <sfrench@samba.org>
cc: Shyam Prasad N <nspmangalore@gmail.com>
cc: Rohith Surabattula <rohiths.msft@gmail.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cifs@vger.kernel.org

Link: https://lore.kernel.org/r/164928621823.457102.8777804402615654773.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/165211421039.3154751.15199634443157779005.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/165348881165.2106726.2993852968344861224.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/165364827876.3334034.9331465096417303889.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/166126396915.708021.2010212654244139442.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/166697261080.61150.17513116912567922274.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/166732033255.3186319.5527423437137895940.stgit@warthog.procyon.org.uk/ # rfc
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-02-20 18:36:02 -06:00
David Howells
3d78fe73fa cifs: Build the RDMA SGE list directly from an iterator
In the depths of the cifs RDMA code, extract part of an iov iterator
directly into an SGE list without going through an intermediate
scatterlist.

Note that this doesn't support extraction from an IOBUF- or UBUF-type
iterator (ie. user-supplied buffer).  The assumption is that the higher
layers will extract those to a BVEC-type iterator first and do whatever is
required to stop the pages from going away.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Steve French <sfrench@samba.org>
cc: Shyam Prasad N <nspmangalore@gmail.com>
cc: Rohith Surabattula <rohiths.msft@gmail.com>
cc: Tom Talpey <tom@talpey.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cifs@vger.kernel.org
cc: linux-rdma@vger.kernel.org

Link: https://lore.kernel.org/r/166697260361.61150.5064013393408112197.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/166732032518.3186319.1859601819981624629.stgit@warthog.procyon.org.uk/ # rfc
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-02-20 18:36:02 -06:00
David Howells
d08089f649 cifs: Change the I/O paths to use an iterator rather than a page list
Currently, the cifs I/O paths hand lists of pages from the VM interface
routines at the top all the way through the intervening layers to the
socket interface at the bottom.

This is a problem, however, for interfacing with netfslib which passes an
iterator through to the ->issue_read() method (and will pass an iterator
through to the ->issue_write() method in future).  Netfslib takes over
bounce buffering for direct I/O, async I/O and encrypted content, so cifs
doesn't need to do that.  Netfslib also converts IOVEC-type iterators into
BVEC-type iterators if necessary.

Further, cifs needs foliating - and folios may come in a variety of sizes,
so a page list pointing to an array of heterogeneous pages may cause
problems in places such as where crypto is done.

Change the cifs I/O paths to hand iov_iter iterators all the way through
instead.

Notes:

 (1) Some old routines are #if'd out to be removed in a follow up patch so
     as to avoid confusing diff, thereby making the diff output easier to
     follow.  I've removed functions that don't overlap with anything
     added.

 (2) struct smb_rqst loses rq_pages, rq_offset, rq_npages, rq_pagesz and
     rq_tailsz which describe the pages forming the buffer; instead there's
     an rq_iter describing the source buffer and an rq_buffer which is used
     to hold the buffer for encryption.

 (3) struct cifs_readdata and cifs_writedata are similarly modified to
     smb_rqst.  The ->read_into_pages() and ->copy_into_pages() are then
     replaced with passing the iterator directly to the socket.

     The iterators are stored in these structs so that they are persistent
     and don't get deallocated when the function returns (unlike if they
     were stack variables).

 (4) Buffered writeback is overhauled, borrowing the code from the afs
     filesystem to gather up contiguous runs of folios.  The XARRAY-type
     iterator is then used to refer directly to the pagecache and can be
     passed to the socket to transmit data directly from there.

     This includes:

	cifs_extend_writeback()
	cifs_write_back_from_locked_folio()
	cifs_writepages_region()
	cifs_writepages()

 (5) Pages are converted to folios.

 (6) Direct I/O uses netfs_extract_user_iter() to create a BVEC-type
     iterator from an IOBUF/UBUF-type source iterator.

 (7) smb2_get_aead_req() uses netfs_extract_iter_to_sg() to extract page
     fragments from the iterator into the scatterlists that the crypto
     layer prefers.

 (8) smb2_init_transform_rq() attached pages to smb_rqst::rq_buffer, an
     xarray, to use as a bounce buffer for encryption.  An XARRAY-type
     iterator can then be used to pass the bounce buffer to lower layers.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Steve French <sfrench@samba.org>
cc: Shyam Prasad N <nspmangalore@gmail.com>
cc: Rohith Surabattula <rohiths.msft@gmail.com>
cc: Paulo Alcantara <pc@cjr.nz>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cifs@vger.kernel.org

Link: https://lore.kernel.org/r/164311907995.2806745.400147335497304099.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/164928620163.457102.11602306234438271112.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/165211420279.3154751.15923591172438186144.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/165348880385.2106726.3220789453472800240.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/165364827111.3334034.934805882842932881.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/166126396180.708021.271013668175370826.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/166697259595.61150.5982032408321852414.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/166732031756.3186319.12528413619888902872.stgit@warthog.procyon.org.uk/ # rfc
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-02-20 18:36:02 -06:00
David Howells
16541195c6 cifs: Add a function to read into an iter from a socket
Add a helper function to read data from a socket into the given iterator.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Steve French <sfrench@samba.org>
cc: Shyam Prasad N <nspmangalore@gmail.com>
cc: Rohith Surabattula <rohiths.msft@gmail.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cifs@vger.kernel.org

Link: https://lore.kernel.org/r/164928617874.457102.10021662143234315566.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/165211419563.3154751.18431990381145195050.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/165348879662.2106726.16881134187242702351.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/165364826398.3334034.12541600783145647319.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/166126395495.708021.12328677373159554478.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/166697258876.61150.3530237818849429372.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/166732031039.3186319.10691316510079412635.stgit@warthog.procyon.org.uk/ # rfc
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-02-20 17:25:44 -06:00
David Howells
b8713c4dbf cifs: Add some helper functions
Add some helper functions to manipulate the folio marks by iterating
through a list of folios held in an xarray rather than using a page list.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Steve French <sfrench@samba.org>
cc: Shyam Prasad N <nspmangalore@gmail.com>
cc: Rohith Surabattula <rohiths.msft@gmail.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cifs@vger.kernel.org

Link: https://lore.kernel.org/r/164928616583.457102.15157033997163988344.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/165211418840.3154751.3090684430628501879.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/165348878940.2106726.204291614267188735.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/165364825674.3334034.3356201708659748648.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/166126394799.708021.10637797063862600488.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/166697258147.61150.9940790486999562110.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/166732030314.3186319.9209944805565413627.stgit@warthog.procyon.org.uk/ # rfc
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-02-20 17:25:44 -06:00
David Howells
39bc58203f cifs: Add a function to Hash the contents of an iterator
Add a function to push the contents of a BVEC-, KVEC- or XARRAY-type
iterator into a synchronous hash algorithm.

UBUF- and IOBUF-type iterators are not supported on the assumption that
either we're doing buffered I/O, in which case we won't see them, or we're
doing direct I/O, in which case the iterator will have been extracted into
a BVEC-type iterator higher up.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Steve French <sfrench@samba.org>
cc: Shyam Prasad N <nspmangalore@gmail.com>
cc: Rohith Surabattula <rohiths.msft@gmail.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cifs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-crypto@vger.kernel.org

Link: https://lore.kernel.org/r/166697257423.61150.12070648579830206483.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/166732029577.3186319.17162612653237909961.stgit@warthog.procyon.org.uk/ # rfc
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-02-20 17:25:44 -06:00
David Howells
e5fbdde430 cifs: Add a function to build an RDMA SGE list from an iterator
Add a function to add elements onto an RDMA SGE list representing page
fragments extracted from a BVEC-, KVEC- or XARRAY-type iterator and DMA
mapped until the maximum number of elements is reached.

Nothing is done to make sure the pages remain present - that must be done
by the caller.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Steve French <sfrench@samba.org>
cc: Shyam Prasad N <nspmangalore@gmail.com>
cc: Rohith Surabattula <rohiths.msft@gmail.com>
cc: Tom Talpey <tom@talpey.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cifs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-rdma@vger.kernel.org

Link: https://lore.kernel.org/r/166697256704.61150.17388516338310645808.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/166732028840.3186319.8512284239779728860.stgit@warthog.procyon.org.uk/ # rfc
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-02-20 17:25:44 -06:00
David Howells
0185846975 netfs: Add a function to extract an iterator into a scatterlist
Provide a function for filling in a scatterlist from the list of pages
contained in an iterator.

If the iterator is UBUF- or IOBUF-type, the pages have a pin taken on them
(as FOLL_PIN).

If the iterator is BVEC-, KVEC- or XARRAY-type, no pin is taken on the
pages and it is left to the caller to manage their lifetime.  It cannot be
assumed that a ref can be validly taken, particularly in the case of a KVEC
iterator.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: Steve French <sfrench@samba.org>
cc: Shyam Prasad N <nspmangalore@gmail.com>
cc: Rohith Surabattula <rohiths.msft@gmail.com>
cc: linux-cachefs@redhat.com
cc: linux-cifs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-02-20 17:25:43 -06:00
David Howells
85dd2c8ff3 netfs: Add a function to extract a UBUF or IOVEC into a BVEC iterator
Add a function to extract the pages from a user-space supplied iterator
(UBUF- or IOVEC-type) into a BVEC-type iterator, retaining the pages by
getting a pin on them (as FOLL_PIN) as we go.

This is useful in three situations:

 (1) A userspace thread may have a sibling that unmaps or remaps the
     process's VM during the operation, changing the assignment of the
     pages and potentially causing an error.  Retaining the pages keeps
     some pages around, even if this occurs; futher, we find out at the
     point of extraction if EFAULT is going to be incurred.

 (2) Pages might get swapped out/discarded if not retained, so we want to
     retain them to avoid the reload causing a deadlock due to a DIO
     from/to an mmapped region on the same file.

 (3) The iterator may get passed to sendmsg() by the filesystem.  If a
     fault occurs, we may get a short write to a TCP stream that's then
     tricky to recover from.

We don't deal with other types of iterator here, leaving it to other
mechanisms to retain the pages (eg. PG_locked, PG_writeback and the pipe
lock).

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: Steve French <sfrench@samba.org>
cc: Shyam Prasad N <nspmangalore@gmail.com>
cc: Rohith Surabattula <rohiths.msft@gmail.com>
cc: linux-cachefs@redhat.com
cc: linux-cifs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-02-20 17:25:43 -06:00
David Howells
4e260a8fd7 cifs: Implement splice_read to pass down ITER_BVEC not ITER_PIPE
Provide cifs_splice_read() to use a bvec rather than an pipe iterator as
the latter cannot so easily be split and advanced, which is necessary to
pass an iterator down to the bottom levels.  Upstream cifs gets around this
problem by using iov_iter_get_pages() to prefill the pipe and then passing
the list of pages down.

This is done by:

 (1) Bulk-allocate a bunch of pages to carry as much of the requested
     amount of data as possible, but without overrunning the available
     slots in the pipe and add them to an ITER_BVEC.

 (2) Synchronously call ->read_iter() to read into the buffer.

 (3) Discard any unused pages.

 (4) Load the remaining pages into the pipe in order and advance the head
     pointer.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Steve French <sfrench@samba.org>
cc: Shyam Prasad N <nspmangalore@gmail.com>
cc: Rohith Surabattula <rohiths.msft@gmail.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: linux-cifs@vger.kernel.org

Link: https://lore.kernel.org/r/166732028113.3186319.1793644937097301358.stgit@warthog.procyon.org.uk/ # rfc
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-02-20 17:25:43 -06:00
David Howells
7c8e01ebf2 splice: Export filemap/direct_splice_read()
filemap_splice_read() and direct_splice_read() should be exported.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Steve French <sfrench@samba.org>
cc: Jens Axboe <axboe@kernel.dk>
cc: Christoph Hellwig <hch@lst.de>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: David Hildenbrand <david@redhat.com>
cc: John Hubbard <jhubbard@nvidia.com>
cc: linux-cifs@vger.kernel.org
cc: linux-mm@kvack.org
cc: linux-block@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-02-20 17:25:43 -06:00
David Howells
33b3b04154 splice: Add a func to do a splice from an O_DIRECT file without ITER_PIPE
Implement a function, direct_file_splice(), that deals with this by using
an ITER_BVEC iterator instead of an ITER_PIPE iterator as the former won't
free its buffers when reverted.  The function bulk allocates all the
buffers it thinks it is going to use in advance, does the read
synchronously and only then trims the buffer down.  The pages we did use
get pushed into the pipe.

This fixes a problem with the upcoming iov_iter_extract_pages() function,
whereby pages extracted from a non-user-backed iterator such as ITER_PIPE
aren't pinned.  __iomap_dio_rw(), however, calls iov_iter_revert() to
shorten the iterator to just the bufferage it is going to use - which has
the side-effect of freeing the excess pipe buffers, even though they're
attached to a bio and may get written to by DMA (thanks to Hillf Danton for
spotting this[1]).

This then causes memory corruption that is particularly noticeable when the
syzbot test[2] is run.  The test boils down to:

	out = creat(argv[1], 0666);
	ftruncate(out, 0x800);
	lseek(out, 0x200, SEEK_SET);
	in = open(argv[1], O_RDONLY | O_DIRECT | O_NOFOLLOW);
	sendfile(out, in, NULL, 0x1dd00);

run repeatedly in parallel.  What I think is happening is that ftruncate()
occasionally shortens the DIO read that's about to be made by sendfile's
splice core by reducing i_size.

This should be more efficient for DIO read by virtue of doing a bulk page
allocation, but slightly less efficient by ignoring any partial page in the
pipe.

Reported-by: syzbot+a440341a59e3b7142895@syzkaller.appspotmail.com
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
cc: Christoph Hellwig <hch@lst.de>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: David Hildenbrand <david@redhat.com>
cc: John Hubbard <jhubbard@nvidia.com>
cc: linux-mm@kvack.org
cc: linux-block@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/20230207094731.1390-1-hdanton@sina.com/ [1]
Link: https://lore.kernel.org/r/000000000000b0b3c005f3a09383@google.com/ [2]
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-02-20 17:25:43 -06:00
Kees Cook
eb3e28c1e8 smb3: Replace smb2pdu 1-element arrays with flex-arrays
The kernel is globally removing the ambiguous 0-length and 1-element
arrays in favor of flexible arrays, so that we can gain both compile-time
and run-time array bounds checking[1].

Replace the trailing 1-element array with a flexible array in the
following structures:

	struct smb2_err_rsp
	struct smb2_tree_connect_req
	struct smb2_negotiate_rsp
	struct smb2_sess_setup_req
	struct smb2_sess_setup_rsp
	struct smb2_read_req
	struct smb2_read_rsp
	struct smb2_write_req
	struct smb2_write_rsp
	struct smb2_query_directory_req
	struct smb2_query_directory_rsp
	struct smb2_set_info_req
	struct smb2_change_notify_rsp
	struct smb2_create_rsp
	struct smb2_query_info_req
	struct smb2_query_info_rsp

Replace the trailing 1-element array with a flexible array, but leave
the existing structure padding:

	struct smb2_file_all_info
	struct smb2_lock_req

Adjust all related size calculations to match the changes to sizeof().

No machine code output or .data section differences are produced after
these changes.

[1] For lots of details, see both:
    https://docs.kernel.org/process/deprecated.html#zero-length-and-one-element-arrays
    https://people.kernel.org/kees/bounded-flexible-arrays-in-c

Cc: Steve French <sfrench@samba.org>
Cc: Paulo Alcantara <pc@cjr.nz>
Cc: Ronnie Sahlberg <lsahlber@redhat.com>
Cc: Shyam Prasad N <sprasad@microsoft.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Reviewed-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-02-20 17:25:43 -06:00
Paulo Alcantara
0e9bd27b2a cifs: get rid of dns resolve worker
We already upcall to resolve hostnames during reconnect by calling
reconn_set_ipaddr_from_hostname(), so there is no point in having a
worker to periodically call it.

Signed-off-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Reviewed-by <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-02-20 17:25:43 -06:00
Linus Torvalds
5b0ed59649 for-6.3/block-2023-02-16
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmPvfncQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpob2EADXJxcr2jjYHm/7cjKkyuVX8fr80dNdMeuY
 JFdsjG1k6Uj73BVhQQWYTcs/PsrWBHWRsv6uz4WgOELj55eXmf5Q0kJszyUeJW33
 /DjqLvtoppVcYf80xE13wKvCfn73BjwQo6xkGM0qAYn15eaXiD/Ax3xC6eJlsBeK
 PEw7EJyhacbSxZa/1D2B6+mqII1jUQWProTCc3udZ4JHi3WvdWa3Rda0qCqHl4a1
 +K2aP2YTFIRPxBzfMNa/CafWVIFubTdht+4Ds6R60RImzB9e0VUBfcsiUyW5Zg7L
 Fwv7ptXuWrALwVNdW56Oz1QikBxn2pdRR2HMLwKJW1MD8kP9r8LMm2jV5Rhiwe0B
 OQsGRYkOzBvw+bxeP5fvk0iPGVMz6ActH4gkraA5QdLqayDaFYOadlhqz0uRo5SH
 Fb42Vl658K/MHDSIk8U58TNkmrsIJsBGohXI9DOGINPPvv3XOPi4Q1HmXkGRmii0
 y+lNU/QEGh7xXXew29SPP76uQpQaYfC7NxXCMw/OpOMwehzjsjshmM2lpxi8zsgt
 PJUmfHv5qxCplNmTJXmUpmX7sS7550HUdu9FJb13DM+gzKg8bk9jWVuLrzqrVlG5
 1hKWEl1+heg1heRfaIuJVLbPI0au6Sb4uqhih/PHyrP9TWIoAruDbDJM65GKTxyE
 2uEgcHzHQw==
 =poRc
 -----END PGP SIGNATURE-----

Merge tag 'for-6.3/block-2023-02-16' of git://git.kernel.dk/linux

Pull block updates from Jens Axboe:

 - NVMe updates via Christoph:
      - Small improvements to the logging functionality (Amit Engel)
      - Authentication cleanups (Hannes Reinecke)
      - Cleanup and optimize the DMA mapping cod in the PCIe driver
        (Keith Busch)
      - Work around the command effects for Format NVM (Keith Busch)
      - Misc cleanups (Keith Busch, Christoph Hellwig)
      - Fix and cleanup freeing single sgl (Keith Busch)

 - MD updates via Song:
      - Fix a rare crash during the takeover process
      - Don't update recovery_cp when curr_resync is ACTIVE
      - Free writes_pending in md_stop
      - Change active_io to percpu

 - Updates to drbd, inching us closer to unifying the out-of-tree driver
   with the in-tree one (Andreas, Christoph, Lars, Robert)

 - BFQ update adding support for multi-actuator drives (Paolo, Federico,
   Davide)

 - Make brd compliant with REQ_NOWAIT (me)

 - Fix for IOPOLL and queue entering, fixing stalled IO waiting on
   timeouts (me)

 - Fix for REQ_NOWAIT with multiple bios (me)

 - Fix memory leak in blktrace cleanup (Greg)

 - Clean up sbitmap and fix a potential hang (Kemeng)

 - Clean up some bits in BFQ, and fix a bug in the request injection
   (Kemeng)

 - Clean up the request allocation and issue code, and fix some bugs
   related to that (Kemeng)

 - ublk updates and fixes:
      - Add support for unprivileged ublk (Ming)
      - Improve device deletion handling (Ming)
      - Misc (Liu, Ziyang)

 - s390 dasd fixes (Alexander, Qiheng)

 - Improve utility of request caching and fixes (Anuj, Xiao)

 - zoned cleanups (Pankaj)

 - More constification for kobjs (Thomas)

 - blk-iocost cleanups (Yu)

 - Remove bio splitting from drivers that don't need it (Christoph)

 - Switch blk-cgroups to use struct gendisk. Some of this is now
   incomplete as select late reverts were done. (Christoph)

 - Add bvec initialization helpers, and convert callers to use that
   rather than open-coding it (Christoph)

 - Misc fixes and cleanups (Jinke, Keith, Arnd, Bart, Li, Martin,
   Matthew, Ulf, Zhong)

* tag 'for-6.3/block-2023-02-16' of git://git.kernel.dk/linux: (169 commits)
  brd: use radix_tree_maybe_preload instead of radix_tree_preload
  block: use proper return value from bio_failfast()
  block: bio-integrity: Copy flags when bio_integrity_payload is cloned
  block: Fix io statistics for cgroup in throttle path
  brd: mark as nowait compatible
  brd: check for REQ_NOWAIT and set correct page allocation mask
  brd: return 0/-error from brd_insert_page()
  block: sync mixed merged request's failfast with 1st bio's
  Revert "blk-cgroup: pin the gendisk in struct blkcg_gq"
  Revert "blk-cgroup: pass a gendisk to blkg_lookup"
  Revert "blk-cgroup: delay blk-cgroup initialization until add_disk"
  Revert "blk-cgroup: delay calling blkcg_exit_disk until disk_release"
  Revert "blk-cgroup: move the cgroup information to struct gendisk"
  nvme-pci: remove iod use_sgls
  nvme-pci: fix freeing single sgl
  block: ublk: check IO buffer based on flag need_get_data
  s390/dasd: Fix potential memleak in dasd_eckd_init()
  s390/dasd: sort out physical vs virtual pointers usage
  block: Remove the ALLOC_CACHE_SLACK constant
  block: make kobj_type structures constant
  ...
2023-02-20 14:27:21 -08:00
Linus Torvalds
553637f73c for-6.3/dio-2023-02-16
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmPueAQQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgplopEACo17a4Z2p2xCedA0NCqX2ggtsSIdYiluPm
 pgdBzIEsgwKo1XVLGRgGiC8VdMRuzO4Zh/NGn4iRF1a68wjgjnwGWY7r052TUoSr
 q1yya739BpffnkXjj15x86cwl+5rHv2RQkm15+2HqBgcruA63/ZgdKBtjj+EtVKs
 zYOlmgyfFbkn8AdULMGiDKP4lixV8gUelv6vWneBwNrj4iSLnuN1+8nJNsl4wxwg
 ImSpx63AzhUoeL6byc+fmiA8fZhDhSvwON2tCyyCmOjlFM/TLrsm5t1juWiDid1O
 UROkQwQtsmjSUq3ow5fRJfjbZ3HLa1uGQr95DYHy0OBRAteAhDY5Upv0DXNL0ZBh
 uNNg8AXtJbyc+pLHWnncyiTzi+3eWs7WiMn04/a5eDhFvcJ0PZjLIgRi5j1ezUS1
 bWqoPaAIxoMD83WoMxjnKvBpGeMzPHvNTijeZjkGOu0vOk8JhXqNmLTjNG9aLtzf
 1Nvp0o8AqtQAW7cgFazZSWtw4bPk/wZ7mW0zHtqLDHIzXkc7A/Uo0ftdv84G08aW
 pvakNz4aNLwSPf7hxgPP9SgS9CeHhxK8PS6uk3V788SI8qGiew11+EcTNGkQNmvw
 /ItCo93UaWD/7SZLObTLslmet7rFHzz6PXaXrMxrSvaeZMkgr7DWEy9XS+ueOtXO
 fS8QhJX11w==
 =IU45
 -----END PGP SIGNATURE-----

Merge tag 'for-6.3/dio-2023-02-16' of git://git.kernel.dk/linux

Pull legacy dio update from Jens Axboe:
 "We only have a few file systems that use the old dio code, make them
  select it rather than build it unconditionally"

* tag 'for-6.3/dio-2023-02-16' of git://git.kernel.dk/linux:
  fs: build the legacy direct I/O code conditionally
  fs: move sb_init_dio_done_wq out of direct-io.c
2023-02-20 14:10:36 -08:00
Linus Torvalds
eca3a04f14 dlm for 6.3
This patch set fixes some races in the lowcomms startup and shutdown code
 that were found by targetted stress testing that quickly and repeatedly
 joins and leaves lockspaces.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJj870xAAoJEDgbc8f8gGmqGasP/2fy7GIe/NNaTDVfRPzJMVsh
 v5GP42q7pwuaR2rTZxHOpALfDJXx3wu9zsA3A6luSaYRRHxdntztaVT2vaAFa13o
 KI2IDB2DIdJ1PqDVgq8E9I6+X/y5gFpeq9Wyo2FdYATncQ370wN42uIk3EECPTZh
 73HZHBkACQJ28ljt4zWg9ANn1rSiZQjaKayFwbnaH0AjbWrTehccyKR0BE2rNd6c
 EqoRX0tRJLH1RVaOHKW+m4EZaMDo2m5hQH/lhcwth/uqxOqQYOSyMPneKVj/TsI8
 8ToXRCqOYwV3uNZVhgRxu3hHPd2e0l3OgU+mlvNqURTSANr8/D3kEZp90AWDV2yg
 yGtaSYspKY7tjgdIWZbTffxXu+5CZs0xTSkw53xjWyCZ4++c/MO4WO3w8/QIsmWd
 Ow8l29YQc5t7XI0zjpsgyQMaKHQLXE06y9F1jZlZpxa0PfvDOevhl+8YQrKGIJYh
 B+QzjtkmQLpD9RyBRdmxIrS5dl/Wv+wa4zhstD9jkUAkKvZ5ndYvCVtjRu5F8PZ7
 HJzUorZFqDgKfXqTLt91irtS4OdMY4xPWr/V1K8NekTQ/+GY7Q1Qu1MlBOW2aqIk
 UHfn1VaxdFom+tdFF83WG+kmnUu2+YZK83vtnFoZlTsbvfY//zveR1Hbt4MnTj6G
 Z00nx5eb2SaKJFHD0J8z
 =4Def
 -----END PGP SIGNATURE-----

Merge tag 'dlm-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm

Pull dlm updates from David Teigland:
 "This fixes some races in the lowcomms startup and shutdown code that
  were found by targeted stress testing that quickly and repeatedly
  joins and leaves lockspaces"

* tag 'dlm-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm:
  fs: dlm: remove unnecessary waker_up() calls
  fs: dlm: move state change into else branch
  fs: dlm: remove newline in log_print
  fs: dlm: reduce the shutdown timeout to 5 secs
  fs: dlm: make dlm sequence id more robust
  fs: dlm: wait until all midcomms nodes detect version
  fs: dlm: ignore unexpected non dlm opts msgs
  fs: dlm: bring back previous shutdown handling
  fs: dlm: send FIN ack back in right cases
  fs: dlm: move sending fin message into state change handling
  fs: dlm: don't set stop rx flag after node reset
  fs: dlm: fix race setting stop tx flag
  fs: dlm: be sure to call dlm_send_queue_flush()
  fs: dlm: fix use after free in midcomms commit
  fs: dlm: start midcomms before scand
  fs/dlm: Remove "select SRCU"
  fs: dlm: fix return value check in dlm_memory_init()
2023-02-20 13:05:24 -08:00
Linus Torvalds
885ce48739 for-6.3-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmPzxWcACgkQxWXV+ddt
 WDt+fRAAg5pz7gWNMtIK30gp/uojjAkCWXymxRtK2tZU3naI+6IYSAKxuKq8Iz1Y
 drdlpSvTX/Gv3XlGB9QuoH6digTjQzeVzjAm0eP6w8t8354KGSRUYdtoFp8I8E5Z
 q0JUuZ6w/KvpZfOIsmcgpOScgcl+8+UlOxs2iuSrOvAqP8Dg1VCt5vBm7htIb0tm
 5ClbgmIacxWrOII55XGuY0mWuZSlS4hdyWdYMelvtM8aPPG+e8eEzKjscVOOueLz
 Smi1kN5QU3o+m4oKjN1OJlKfeURdbcZUwva9zOsegSbPHUzNwIao44cQ5cQhMR0r
 kI3nCpJwGKdUd6IblEdcqBN5F4V64edLSruOLuGYzxySnEWhFE2YU2xW/v5b1eQW
 GHurI52FGrPqcX9FgQNzfTjQzk341iQ0QIs5exycJH7xeohEZnlaK2yNUngKSo1C
 naqczEMMMcxNjQaooUuxRkL/zz36D/Dkyo2YOCODtWyu61XY9LqvaxMvClFI20lL
 40dzzYnnMQwkXJrQ/MVQhz1BBaPVqizt8+ErL7GQp2CWr9miD6mcA5b2pyZm5Q3r
 hHadzeTXXS7P9g9UnuDxpZqkhvadGC2Sy4l/D6jURyKFzr8mtplaRRwUS2gSuP3z
 zxavvP4UukwNWXxDz755NAhiGbA+xpSMATKCrZ/Sdogvxe8IhRg=
 =NCpw
 -----END PGP SIGNATURE-----

Merge tag 'for-6.3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs updates from David Sterba:
 "The usual mix of performance improvements and new features.

  The core change is reworking how checksums are processed, with
  followup cleanups and simplifications. There are two minor changes in
  block layer and iomap code.

  Features:

   - block group allocation class heuristics:
      - pack files by size (up to 128k, up to 8M, more) to avoid
        fragmentation in block groups, assuming that file size and life
        time is correlated, in particular this may help during balance
      - with tracepoints and extensible in the future

  Performance:

   - send: cache directory utimes and only emit the command when
     necessary
      - speedup up to 10x
      - smaller final stream produced (no redundant utimes commands
        issued)
      - compatibility not affected

   - fiemap: skip backref checks for shared leaves
      - speedup 3x on sample filesystem with all leaves shared (e.g. on
        snapshots)

   - micro optimized b-tree key lookup, speedup in metadata operations
     (sample benchmark: fs_mark +10% of files/sec)

  Core changes:

   - change where checksumming is done in the io path:
      - checksum and read repair does verification at lower layer
      - cascaded cleanups and simplifications

   - raid56 refactoring and cleanups

  Fixes:

   - sysfs: make sure that a run-time change of a feature is correctly
     tracked by the feature files

   - scrub: better reporting of tree block errors

  Other:

   - locally enable -Wmaybe-uninitialized after fixing all warnings

   - misc cleanups, spelling fixes

  Other code:

   - block: export bio_split_rw

   - iomap: remove IOMAP_F_ZONE_APPEND"

* tag 'for-6.3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (109 commits)
  btrfs: make kobj_type structures constant
  btrfs: remove the bdev argument to btrfs_rmap_block
  btrfs: don't rely on unchanging ->bi_bdev for zone append remaps
  btrfs: never return true for reads in btrfs_use_zone_append
  btrfs: pass a btrfs_bio to btrfs_use_append
  btrfs: set bbio->file_offset in alloc_new_bio
  btrfs: use file_offset to limit bios size in calc_bio_boundaries
  btrfs: do unsigned integer division in the extent buffer binary search loop
  btrfs: eliminate extra call when doing binary search on extent buffer
  btrfs: raid56: handle endio in scrub_rbio
  btrfs: raid56: handle endio in recover_rbio
  btrfs: raid56: handle endio in rmw_rbio
  btrfs: raid56: submit the read bios from scrub_assemble_read_bios
  btrfs: raid56: fold rmw_read_wait_recover into rmw_read_bios
  btrfs: raid56: fold recover_assemble_read_bios into recover_rbio
  btrfs: raid56: add a bio_list_put helper
  btrfs: raid56: wait for I/O completion in submit_read_bios
  btrfs: raid56: simplify code flow in rmw_rbio
  btrfs: raid56: simplify error handling and code flow in raid56_parity_write
  btrfs: replace btrfs_wait_tree_block_writeback by wait_on_extent_buffer_writeback
  ...
2023-02-20 12:54:27 -08:00
Linus Torvalds
274978f173 \n
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmPvZWkACgkQnJ2qBz9k
 QNkVtQf/V515KIb7lEkwOjlF7AQg5MS/c8zuodHISzWYhZyd0KSTb+qnF/QzLvVm
 cJ3cIPXrIyVw4Eeqh0qQvukOYCcBvUa1IBW5kePiy3mHiHRD2PRgaxSGBXTXqqPG
 xXagwllrn3/mG4ZKXlNYFrzgoshSFFeBdkLGEi+/L6DAe0B+mG+FIHON1eWylgOT
 j1D+/k9RNvRhRU8WtStcI4u9mnVPqUI2RSWUpjxuNzUPtyFflPVNCz+bkXXovPQ0
 ZQY2HeZcs7jsorCmeSHUzTt5bbj3BfhO3uWL4/wnHgp+88OBRUUyvMrNbOF97xd6
 KFqbJVQbSevasUSEvCS+3+EChzjGoA==
 =DaJi
 -----END PGP SIGNATURE-----

Merge tag 'fixes_for_v6.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

Pull UDF and ext2 fixes from Jan Kara:

 - Rewrite of udf directory iteration code to address multiple syzbot
   reports

 - Fixes to udf extent handling and block mapping code to address
   several syzbot reports and filesystem corruption issues uncovered by
   fsx & fsstress

 - Convert udf to kmap_local()

 - Add sanity checks when loading udf bitmaps

 - Drop old VARCONV support which I've never seen used and which was
   broken for quite some years without anybody noticing

 - Finish conversion of ext2 to kmap_local()

 - One fix to mpage_writepages() on which other udf fixes depend

* tag 'fixes_for_v6.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: (78 commits)
  udf: Avoid directory type conversion failure due to ENOMEM
  udf: Use unsigned variables for size calculations
  udf: remove reporting loc in debug output
  udf: Check consistency of Space Bitmap Descriptor
  udf: Fix file counting in LVID
  udf: Limit file size to 4TB
  udf: Don't return bh from udf_expand_dir_adinicb()
  udf: Convert udf_expand_file_adinicb() to avoid kmap_atomic()
  udf: Convert udf_adinicb_writepage() to memcpy_to_page()
  udf: Switch udf_adinicb_readpage() to kmap_local_page()
  udf: Move udf_adinicb_readpage() to inode.c
  udf: Mark aops implementation static
  udf: Switch to single address_space_operations
  udf: Add handling of in-ICB files to udf_bmap()
  udf: Convert all file types to use udf_write_end()
  udf: Convert in-ICB files to use udf_write_begin()
  udf: Convert in-ICB files to use udf_direct_IO()
  udf: Convert in-ICB files to use udf_writepages()
  udf: Unify .read_folio for normal and in-ICB files
  udf: Fix off-by-one error when discarding preallocation
  ...
2023-02-20 12:44:08 -08:00
Linus Torvalds
cd776a4342 \n
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmPvZJwACgkQnJ2qBz9k
 QNlPcAf/UL7DDv37vnvfcFTa9lRyC0dXsgxnVZUwMU0hJs/ewbmueYGnJSBRTVLG
 7ad7bKYQVWsjhas4YulofgRrFWxVDcR32qbC+pDo/X6vGjo4tDl2CNPYREY3n3kN
 xR6Ca7nPxBH5AVYwwOqBJSTqhWGy1TSDeuskndS0P+YtTv6Y4Zvm4UEiNAXJ4nwo
 5Nd+bsPpkrEgQqO/NK2rCXfBfkJr4jAMcp+Nn2zAP44icZAXJYn8QrN3gVL6OZlN
 RKq36MGQf52lxyufVyFCulWKRbxhEKUS0nURZgAG+Sv87DlSuBJgRVG7xJ1baPpK
 2g7wG2jaT7YMfA4PWms/rwAj/CkGLA==
 =NRh0
 -----END PGP SIGNATURE-----

Merge tag 'fsnotify_for_v6.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

Pull fsnotify updates from Jan Kara:
 "Support for auditing decisions regarding fanotify permission events"

* tag 'fsnotify_for_v6.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  fanotify,audit: Allow audit to use the full permission event response
  fanotify: define struct members to hold response decision context
  fanotify: Ensure consistent variable type for response
2023-02-20 12:38:27 -08:00
Linus Torvalds
6639c3ce7f fsverity updates for 6.3
Fix the longstanding implementation limitation that fsverity was only
 supported when the Merkle tree block size, filesystem block size, and
 PAGE_SIZE were all equal.  Specifically, add support for Merkle tree
 block sizes less than PAGE_SIZE, and make ext4 support fsverity on
 filesystems where the filesystem block size is less than PAGE_SIZE.
 
 Effectively, this means that fsverity can now be used on systems with
 non-4K pages, at least on ext4.  These changes have been tested using
 the verity group of xfstests, newly updated to cover the new code paths.
 
 Also update fs/verity/ to support verifying data from large folios.
 There's also a similar patch for fs/crypto/, to support decrypting data
 from large folios, which I'm including in this pull request to avoid a
 merge conflict between the fscrypt and fsverity branches.
 
 There will be a merge conflict in fs/buffer.c with some of the foliation
 work in the mm tree.  Please use the merge resolution from linux-next.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCY/KJtRQcZWJpZ2dlcnNA
 Z29vZ2xlLmNvbQAKCRDzXCl4vpKOK/A/AP0RUlCClBRuHwXPRG0we8R1L153ga4s
 Vl+xRpCr+SswXwEAiOEpYN5cXoVKzNgxbEXo2pQzxi5lrpjZgUI6CL3DuQs=
 =ZRFX
 -----END PGP SIGNATURE-----

Merge tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fsverity/linux

Pull fsverity updates from Eric Biggers:
 "Fix the longstanding implementation limitation that fsverity was only
  supported when the Merkle tree block size, filesystem block size, and
  PAGE_SIZE were all equal.

  Specifically, add support for Merkle tree block sizes less than
  PAGE_SIZE, and make ext4 support fsverity on filesystems where the
  filesystem block size is less than PAGE_SIZE.

  Effectively, this means that fsverity can now be used on systems with
  non-4K pages, at least on ext4. These changes have been tested using
  the verity group of xfstests, newly updated to cover the new code
  paths.

  Also update fs/verity/ to support verifying data from large folios.

  There's also a similar patch for fs/crypto/, to support decrypting
  data from large folios, which I'm including in here to avoid a merge
  conflict between the fscrypt and fsverity branches"

* tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fsverity/linux:
  fscrypt: support decrypting data from large folios
  fsverity: support verifying data from large folios
  fsverity.rst: update git repo URL for fsverity-utils
  ext4: allow verity with fs block size < PAGE_SIZE
  fs/buffer.c: support fsverity in block_read_full_folio()
  f2fs: simplify f2fs_readpage_limit()
  ext4: simplify ext4_readpage_limit()
  fsverity: support enabling with tree block size < PAGE_SIZE
  fsverity: support verification with tree block size < PAGE_SIZE
  fsverity: replace fsverity_hash_page() with fsverity_hash_block()
  fsverity: use EFBIG for file too large to enable verity
  fsverity: store log2(digest_size) precomputed
  fsverity: simplify Merkle tree readahead size calculation
  fsverity: use unsigned long for level_start
  fsverity: remove debug messages and CONFIG_FS_VERITY_DEBUG
  fsverity: pass pos and size to ->write_merkle_tree_block
  fsverity: optimize fsverity_cleanup_inode() on non-verity files
  fsverity: optimize fsverity_prepare_setattr() on non-verity files
  fsverity: optimize fsverity_file_open() on non-verity files
2023-02-20 12:33:41 -08:00
Linus Torvalds
f18f9845f2 fscrypt updates for 6.3
Simplify the implementation of the test_dummy_encryption mount option by
 adding the "test dummy key" on-demand.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCY/J7NRQcZWJpZ2dlcnNA
 Z29vZ2xlLmNvbQAKCRDzXCl4vpKOK3UVAP9Likiuy47D/RM4mOsPMwLAlQRx5uW6
 iGxT6DutekA7DwEA4hNjEQQ/EKO+UxFb+fBCX+xpTDbS3LB7CxGsqHzZJQM=
 =SiNJ
 -----END PGP SIGNATURE-----

Merge tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/linux

Pull fscrypt updates from Eric Biggers:
 "Simplify the implementation of the test_dummy_encryption mount option
  by adding the 'test dummy key' on-demand"

* tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/linux:
  fscrypt: clean up fscrypt_add_test_dummy_key()
  fs/super.c: stop calling fscrypt_destroy_keyring() from __put_super()
  f2fs: stop calling fscrypt_add_test_dummy_key()
  ext4: stop calling fscrypt_add_test_dummy_key()
  fscrypt: add the test dummy encryption key on-demand
2023-02-20 12:29:27 -08:00
Linus Torvalds
dc483c851f Changes since last update:
- Add per-cpu kthreads for low-latency decompression for Android
    use cases;
 
  - Get rid of tagged pointer helpers since they are rarely used now;
 
  - Several code cleanups to reduce codebase;
 
  - Documentation and MAINTAINERS updates.
 -----BEGIN PGP SIGNATURE-----
 
 iIcEABYIAC8WIQThPAmQN9sSA0DVxtI5NzHcH7XmBAUCY/IDjhEceGlhbmdAa2Vy
 bmVsLm9yZwAKCRA5NzHcH7XmBNbTAQDT2njll/B2JSYbVC2I2HYTZSyFXEaHhH+M
 6gHRbEhTWAD/VNiAcdE600IkUwut/78tDvwlz/XJSd2JQMMwkTSviwc=
 =oroQ
 -----END PGP SIGNATURE-----

Merge tag 'erofs-for-6.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs

Pull erofs updates from Gao Xiang:
 "The most noticeable feature for this cycle is per-CPU kthread
  decompression since Android use cases need low-latency I/O handling in
  order to ensure the app runtime performance, currently unbounded
  workqueue latencies are not quite good for production on many aarch64
  hardwares and thus we need to introduce a deterministic expectation
  for these. Decompression is CPU-intensive and it is sleepable for
  EROFS, so other alternatives like decompression under softirq contexts
  are not considered. More details are in the corresponding commit
  message.

  Others are random cleanups around the whole codebase and we will
  continue to clean up further in the next few months.

  Due to Lunar New Year holidays, some other new features were not
  completely reviewed and solidified as expected and we may delay them
  into the next version.

  Summary:

   - Add per-cpu kthreads for low-latency decompression for Android use
     cases

   - Get rid of tagged pointer helpers since they are rarely used now

   - Several code cleanups to reduce codebase

   - Documentation and MAINTAINERS updates"

* tag 'erofs-for-6.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs: (21 commits)
  erofs: fix an error code in z_erofs_init_zip_subsystem()
  erofs: unify anonymous inodes for blob
  erofs: relinquish volume with mutex held
  erofs: maintain cookies of share domain in self-contained list
  erofs: remove unused device mapping in meta routine
  MAINTAINERS: erofs: Add Documentation/ABI/testing/sysfs-fs-erofs
  Documentation/ABI: sysfs-fs-erofs: update supported features
  erofs: remove unused EROFS_GET_BLOCKS_RAW flag
  erofs: update print symbols for various flags in trace
  erofs: make kobj_type structures constant
  erofs: add per-cpu threads for decompression as an option
  erofs: tidy up internal.h
  erofs: get rid of z_erofs_do_map_blocks() forward declaration
  erofs: move zdata.h into zdata.c
  erofs: remove tagged pointer helpers
  erofs: avoid tagged pointers to mark sync decompression
  erofs: get rid of erofs_inode_datablocks()
  erofs: simplify iloc()
  erofs: get rid of debug_one_dentry()
  erofs: remove linux/buffer_head.h dependency
  ...
2023-02-20 12:23:40 -08:00
Linus Torvalds
91bc559d8d fs.acl.v6.3
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCY+5NzQAKCRCRxhvAZXjc
 otv2AP9wJtg+RL01iYiUE2mRMYxq4R79yWrtPEyuDEZIq5tQSwEA/H4yk7EHgHMS
 aKnEfny/P9JjKPtZzsxhMQcpiIVewQs=
 =+Q0C
 -----END PGP SIGNATURE-----

Merge tag 'fs.acl.v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping

Pull vfs acl update from Christian Brauner:
 "This contains a single update to the internal get acl method and
  replaces an open-coded cmpxchg() comparison with with try_cmpxchg().

  It's clearer and also beneficial on some architectures"

* tag 'fs.acl.v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping:
  posix_acl: Use try_cmpxchg in get_acl
2023-02-20 12:14:33 -08:00
Linus Torvalds
ea5aac6fae fs.v6.3
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCY+5SogAKCRCRxhvAZXjc
 orVwAP4jJ1dPZYx1xHip9TfB5fv5xHz3euhvWns6qGJdVYoHzwEAhVxgYUpqWdXX
 L/+VKRFFujYxsSXP4BbS3xDPUJeQFAI=
 =ccK2
 -----END PGP SIGNATURE-----

Merge tag 'fs.v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping

Pull vfs hardening update from Christian Brauner:
 "Jan pointed out that during shutdown both filp_close() and super block
  destruction will use basic printk logging when bugs are detected. This
  causes issues in a few scenarios:

   - Tools like syzkaller cannot figure out that the logged message
     indicates a bug.

   - Users that explicitly opt in to have the kernel bug on data
     corruption by selecting CONFIG_BUG_ON_DATA_CORRUPTION should see
     the kernel crash when they did actually select that option.

   - When there are busy inodes after the superblock is shut down later
     access to such a busy inodes walks through freed memory. It would
     be better to cleanly crash instead.

  All of this can be addressed by using the already existing
  CHECK_DATA_CORRUPTION() macro in these places when kernel bugs are
  detected. Its logging improvement is useful for all users.

  Otherwise this only has a meaningful behavioral effect when users do
  select CONFIG_BUG_ON_DATA_CORRUPTION which means this is backward
  compatible for regular users"

* tag 'fs.v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping:
  fs: Use CHECK_DATA_CORRUPTION() when kernel bugs are detected
2023-02-20 12:03:55 -08:00
Linus Torvalds
05e6295f7b fs.idmapped.v6.3
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCY+5NlQAKCRCRxhvAZXjc
 orOaAP9i2h3OJy95nO2Fpde0Bt2UT+oulKCCcGlvXJ8/+TQpyQD/ZQq47gFQ0EAz
 Br5NxeyGeecAb0lHpFz+CpLGsxMrMwQ=
 =+BG5
 -----END PGP SIGNATURE-----

Merge tag 'fs.idmapped.v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping

Pull vfs idmapping updates from Christian Brauner:

 - Last cycle we introduced the dedicated struct mnt_idmap type for
   mount idmapping and the required infrastucture in 256c8aed2b ("fs:
   introduce dedicated idmap type for mounts"). As promised in last
   cycle's pull request message this converts everything to rely on
   struct mnt_idmap.

   Currently we still pass around the plain namespace that was attached
   to a mount. This is in general pretty convenient but it makes it easy
   to conflate namespaces that are relevant on the filesystem with
   namespaces that are relevant on the mount level. Especially for
   non-vfs developers without detailed knowledge in this area this was a
   potential source for bugs.

   This finishes the conversion. Instead of passing the plain namespace
   around this updates all places that currently take a pointer to a
   mnt_userns with a pointer to struct mnt_idmap.

   Now that the conversion is done all helpers down to the really
   low-level helpers only accept a struct mnt_idmap argument instead of
   two namespace arguments.

   Conflating mount and other idmappings will now cause the compiler to
   complain loudly thus eliminating the possibility of any bugs. This
   makes it impossible for filesystem developers to mix up mount and
   filesystem idmappings as they are two distinct types and require
   distinct helpers that cannot be used interchangeably.

   Everything associated with struct mnt_idmap is moved into a single
   separate file. With that change no code can poke around in struct
   mnt_idmap. It can only be interacted with through dedicated helpers.
   That means all filesystems are and all of the vfs is completely
   oblivious to the actual implementation of idmappings.

   We are now also able to extend struct mnt_idmap as we see fit. For
   example, we can decouple it completely from namespaces for users that
   don't require or don't want to use them at all. We can also extend
   the concept of idmappings so we can cover filesystem specific
   requirements.

   In combination with the vfs{g,u}id_t work we finished in v6.2 this
   makes this feature substantially more robust and thus difficult to
   implement wrong by a given filesystem and also protects the vfs.

 - Enable idmapped mounts for tmpfs and fulfill a longstanding request.

   A long-standing request from users had been to make it possible to
   create idmapped mounts for tmpfs. For example, to share the host's
   tmpfs mount between multiple sandboxes. This is a prerequisite for
   some advanced Kubernetes cases. Systemd also has a range of use-cases
   to increase service isolation. And there are more users of this.

   However, with all of the other work going on this was way down on the
   priority list but luckily someone other than ourselves picked this
   up.

   As usual the patch is tiny as all the infrastructure work had been
   done multiple kernel releases ago. In addition to all the tests that
   we already have I requested that Rodrigo add a dedicated tmpfs
   testsuite for idmapped mounts to xfstests. It is to be included into
   xfstests during the v6.3 development cycle. This should add a slew of
   additional tests.

* tag 'fs.idmapped.v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping: (26 commits)
  shmem: support idmapped mounts for tmpfs
  fs: move mnt_idmap
  fs: port vfs{g,u}id helpers to mnt_idmap
  fs: port fs{g,u}id helpers to mnt_idmap
  fs: port i_{g,u}id_into_vfs{g,u}id() to mnt_idmap
  fs: port i_{g,u}id_{needs_}update() to mnt_idmap
  quota: port to mnt_idmap
  fs: port privilege checking helpers to mnt_idmap
  fs: port inode_owner_or_capable() to mnt_idmap
  fs: port inode_init_owner() to mnt_idmap
  fs: port acl to mnt_idmap
  fs: port xattr to mnt_idmap
  fs: port ->permission() to pass mnt_idmap
  fs: port ->fileattr_set() to pass mnt_idmap
  fs: port ->set_acl() to pass mnt_idmap
  fs: port ->get_acl() to pass mnt_idmap
  fs: port ->tmpfile() to pass mnt_idmap
  fs: port ->rename() to pass mnt_idmap
  fs: port ->mknod() to pass mnt_idmap
  fs: port ->mkdir() to pass mnt_idmap
  ...
2023-02-20 11:53:11 -08:00
Linus Torvalds
de630176bd i_version handling changes for v6.3
-----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCAAxFiEES8DXskRxsqGE6vXTAA5oQRlWghUFAmPuDAUTHGpsYXl0b25A
 a2VybmVsLm9yZwAKCRAADmhBGVaCFQu9D/9ZEYVvoY3s7tK5Cw+My+1U0sNQFUW0
 D46kVF0bLxv09G8prI6FZFQORguSMZli3TEsdyUbU12ak3kNHYkTW6AxxnN0jQVE
 ur6f5iOauh9mWcv2S9/gulz9KH2uiG1rGHWfClfoZgQI1QCatfaKh9nrEcNrPy71
 qpTbcN8AOhhYDP2ZAHajuwTuak1jLBHYN7ihfaUPyex9kAxUalbw5wIsXYfTB+Mn
 /xPuCAN9YL5KpuUzVSF7QfWWVEiqErSFeFip+NCS/srihURKr14JuVSwUwkTn8rR
 ONWBawXDw+eiWZ1nQmzvPTazbvsQGO1cKx8A20nOLmJjtRqAGlfngDQrCchkoSg7
 SwYDiDpDqkg5PTWSfkDMG1wY/tEm6dO89P912kY+jFuGJIrG4HPkOR20StmKkuz5
 La4bUs5UwucxxS9Hlu5+aFGrYFtuixWpR/1Wu5ITRtoq+j5vzHHrQ1wkbezNXs0w
 VxS7YqW1TDDUUXgBzYOctmQ50ADVVC63J6AbhA7w7Qh+1vpEvyqYcWZw/Vd54bPP
 tDh3FFnEGJjX0QqtdpHpyIiOm/BXiJ/8y96vEAxVzXtOmb59AQDCPq99HgWa4G8L
 62ltF7qkgXKOsWCaKk07uycE0zISnx1ZXk9bvLFeN02XhRkPxevV93Yn/YiJ7nZe
 xcR23eGRsseA/w==
 =EKA6
 -----END PGP SIGNATURE-----

Merge tag 'iversion-v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux

Pull i_version updates from Jeff Layton:
 "This overhauls how we handle i_version queries from nfsd.

  Instead of having special routines and grabbing the i_version field
  directly out of the inode in some cases, we've moved most of the
  handling into the various filesystems' getattr operations. As a bonus,
  this makes ceph's change attribute usable by knfsd as well.

  This should pave the way for future work to make this value queryable
  by userland, and to make it more resilient against rolling back on a
  crash"

* tag 'iversion-v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
  nfsd: remove fetch_iversion export operation
  nfsd: use the getattr operation to fetch i_version
  nfsd: move nfsd4_change_attribute to nfsfh.c
  ceph: report the inode version in getattr if requested
  nfs: report the inode version in getattr if requested
  vfs: plumb i_version handling into struct kstat
  fs: clarify when the i_version counter must be updated
  fs: uninline inode_query_iversion
2023-02-20 11:21:02 -08:00
Linus Torvalds
575a7e0f81 File locking changes for v6.3
-----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCAAxFiEES8DXskRxsqGE6vXTAA5oQRlWghUFAmPuC5kTHGpsYXl0b25A
 a2VybmVsLm9yZwAKCRAADmhBGVaCFdXpEADFN/loTtcANLPQmLmgmJLDuZr2zKrf
 aziMJRjqGMx6BLdwBgX8/XGBNwpG4tkVbI+zdRoVHkpcayMDpLq0dnrvi79a/dGU
 fBrI72ZDMd/S9lzbodObHMziLqvgFthsPm9ldVAZ2Kt400KKNE+ozcveiC3yVGy0
 n1k5BSt/78abzpqut5whVgJBooHtUMCh3XvBJPKwgOneHfAXCm+jqaXlKKpKlpZj
 s2OUyn8BLfNkTgpAZ88L5Rkf0mftjziL6C8KOMy1hvOsyiP0IkwLuQ/kO+2H0Ate
 p3tbOGvUT+n1gYpFYBDLnuWB4G8+CVPxfoO6KGhwT4OlCJpPlNCM8O+w/A/dKn4I
 858spkpYPMy91lEkcrRLRkg/MARWGTgZ3k76fp3OWNnfruWd6ekMlYKx9n6CIy34
 Aoc3Svy9KeA7oRrbRDltmw3UVmz53GcDo337ZL1J6Jph3s86dMG7AwGYvoDfKuKK
 b0oNK5db5v50scBnRHX6UWejE5fSnnHvgC7pU57u08odCVEUALB+r8f04vmkxcVJ
 Qed7lolQdFzn9ddaOXzpg5KeCe/cX3p4IPZSTHad7CPr8gswmC135DfXCr64x2hC
 5jyNzKbe/x+7B2xCweHmEk4ojt8IU3UaYxLJoQkNeVr8rEGC9gqZgkSDe7BnTpOf
 wT0ijzhy2u5RKg==
 =Zhf3
 -----END PGP SIGNATURE-----

Merge tag 'locks-v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux

Pull file locking updates from Jeff Layton:
 "The main change here is that I've broken out most of the file locking
  definitions into a new header file. I also went ahead and completed
  the removal of locks_inode function"

* tag 'locks-v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
  fs: remove locks_inode
  filelock: move file locking definitions to separate header file
2023-02-20 11:10:38 -08:00
Zhang Xiaoxu
3e161c2791 cifs: Fix warning and UAF when destroy the MR list
If the MR allocate failed, the MR recovery work not initialized
and list not cleared. Then will be warning and UAF when release
the MR:

  WARNING: CPU: 4 PID: 824 at kernel/workqueue.c:3066 __flush_work.isra.0+0xf7/0x110
  CPU: 4 PID: 824 Comm: mount.cifs Not tainted 6.1.0-rc5+ #82
  RIP: 0010:__flush_work.isra.0+0xf7/0x110
  Call Trace:
   <TASK>
   __cancel_work_timer+0x2ba/0x2e0
   smbd_destroy+0x4e1/0x990
   _smbd_get_connection+0x1cbd/0x2110
   smbd_get_connection+0x21/0x40
   cifs_get_tcp_session+0x8ef/0xda0
   mount_get_conns+0x60/0x750
   cifs_mount+0x103/0xd00
   cifs_smb3_do_mount+0x1dd/0xcb0
   smb3_get_tree+0x1d5/0x300
   vfs_get_tree+0x41/0xf0
   path_mount+0x9b3/0xdd0
   __x64_sys_mount+0x190/0x1d0
   do_syscall_64+0x35/0x80
   entry_SYSCALL_64_after_hwframe+0x46/0xb0

  BUG: KASAN: use-after-free in smbd_destroy+0x4fc/0x990
  Read of size 8 at addr ffff88810b156a08 by task mount.cifs/824
  CPU: 4 PID: 824 Comm: mount.cifs Tainted: G        W          6.1.0-rc5+ #82
  Call Trace:
   dump_stack_lvl+0x34/0x44
   print_report+0x171/0x472
   kasan_report+0xad/0x130
   smbd_destroy+0x4fc/0x990
   _smbd_get_connection+0x1cbd/0x2110
   smbd_get_connection+0x21/0x40
   cifs_get_tcp_session+0x8ef/0xda0
   mount_get_conns+0x60/0x750
   cifs_mount+0x103/0xd00
   cifs_smb3_do_mount+0x1dd/0xcb0
   smb3_get_tree+0x1d5/0x300
   vfs_get_tree+0x41/0xf0
   path_mount+0x9b3/0xdd0
   __x64_sys_mount+0x190/0x1d0
   do_syscall_64+0x35/0x80
   entry_SYSCALL_64_after_hwframe+0x46/0xb0

  Allocated by task 824:
   kasan_save_stack+0x1e/0x40
   kasan_set_track+0x21/0x30
   __kasan_kmalloc+0x7a/0x90
   _smbd_get_connection+0x1b6f/0x2110
   smbd_get_connection+0x21/0x40
   cifs_get_tcp_session+0x8ef/0xda0
   mount_get_conns+0x60/0x750
   cifs_mount+0x103/0xd00
   cifs_smb3_do_mount+0x1dd/0xcb0
   smb3_get_tree+0x1d5/0x300
   vfs_get_tree+0x41/0xf0
   path_mount+0x9b3/0xdd0
   __x64_sys_mount+0x190/0x1d0
   do_syscall_64+0x35/0x80
   entry_SYSCALL_64_after_hwframe+0x46/0xb0

  Freed by task 824:
   kasan_save_stack+0x1e/0x40
   kasan_set_track+0x21/0x30
   kasan_save_free_info+0x2a/0x40
   ____kasan_slab_free+0x143/0x1b0
   __kmem_cache_free+0xc8/0x330
   _smbd_get_connection+0x1c6a/0x2110
   smbd_get_connection+0x21/0x40
   cifs_get_tcp_session+0x8ef/0xda0
   mount_get_conns+0x60/0x750
   cifs_mount+0x103/0xd00
   cifs_smb3_do_mount+0x1dd/0xcb0
   smb3_get_tree+0x1d5/0x300
   vfs_get_tree+0x41/0xf0
   path_mount+0x9b3/0xdd0
   __x64_sys_mount+0x190/0x1d0
   do_syscall_64+0x35/0x80
   entry_SYSCALL_64_after_hwframe+0x46/0xb0

Let's initialize the MR recovery work before MR allocate to prevent
the warning, remove the MRs from the list to prevent the UAF.

Fixes: c739858334 ("CIFS: SMBD: Implement RDMA memory registration")
Acked-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Reviewed-by: Tom Talpey <tom@talpey.com>
Signed-off-by: Zhang Xiaoxu <zhangxiaoxu5@huawei.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-02-20 11:48:48 -06:00
Zhang Xiaoxu
e9d3401d95 cifs: Fix lost destroy smbd connection when MR allocate failed
If the MR allocate failed, the smb direct connection info is NULL,
then smbd_destroy() will directly return, then the connection info
will be leaked.

Let's set the smb direct connection info to the server before call
smbd_destroy().

Fixes: c739858334 ("CIFS: SMBD: Implement RDMA memory registration")
Signed-off-by: Zhang Xiaoxu <zhangxiaoxu5@huawei.com>
Acked-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Reviewed-by: David Howells <dhowells@redhat.com>
Reviewed-by: Tom Talpey <tom@talpey.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-02-20 11:48:48 -06:00
Ronnie Sahlberg
8e843bf38f cifs: return a single-use cfid if we did not get a lease
If we did not get a lease we can still return a single use cfid to the caller.
The cfid will not have has_lease set and will thus not be shared with any
other concurrent users and will be freed immediately when the caller
drops the handle.

This avoids extra roundtrips for servers that do not support directory leases
where they would first fail to get a cfid with a lease and then fallback
to try a normal SMB2_open()

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Cc: stable@vger.kernel.org
Reviewed-by: Bharath SM <bharathsm@microsoft.com>
Reviewed-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-02-20 11:48:48 -06:00
Ronnie Sahlberg
66d45ca135 cifs: Check the lease context if we actually got a lease
Some servers may return that we got a lease in rsp->OplockLevel
but then in the lease context contradict this and say we got no lease
at all.  Thus we need to check the context if we have a lease.
Additionally, If we do not get a lease we need to make sure we close
the handle before we return an error to the caller.

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Cc: stable@vger.kernel.org
Reviewed-by: Bharath SM <bharathsm@microsoft.com>
Reviewed-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-02-20 11:48:48 -06:00
Kees Cook
35235e19b3 cifs: Replace remaining 1-element arrays
The kernel is globally removing the ambiguous 0-length and 1-element
arrays in favor of flexible arrays, so that we can gain both compile-time
and run-time array bounds checking[1].

Replace the trailing 1-element array with a flexible array in the
following structures:

	struct cifs_spnego_msg
	struct cifs_quota_data
	struct get_dfs_referral_rsp
	struct file_alt_name_info
	NEGOTIATE_RSP
	SESSION_SETUP_ANDX
	TCONX_REQ
	TCONX_RSP
	TCONX_RSP_EXT
	ECHO_REQ
	ECHO_RSP
	OPEN_REQ
	OPENX_REQ
	LOCK_REQ
	RENAME_REQ
	COPY_REQ
	COPY_RSP
	NT_RENAME_REQ
	DELETE_FILE_REQ
	DELETE_DIRECTORY_REQ
	CREATE_DIRECTORY_REQ
	QUERY_INFORMATION_REQ
	SETATTR_REQ
	TRANSACT_IOCTL_REQ
	TRANSACT_CHANGE_NOTIFY_REQ
	TRANSACTION2_QPI_REQ
	TRANSACTION2_SPI_REQ
	TRANSACTION2_FFIRST_REQ
	TRANSACTION2_GET_DFS_REFER_REQ
	FILE_UNIX_LINK_INFO
	FILE_DIRECTORY_INFO
	FILE_FULL_DIRECTORY_INFO
	SEARCH_ID_FULL_DIR_INFO
	FILE_BOTH_DIRECTORY_INFO
	FIND_FILE_STANDARD_INFO

Replace the trailing 1-element array with a flexible array, but leave
the existing structure padding:

	FILE_ALL_INFO
	FILE_UNIX_INFO

Remove unused structures:

	struct gea
	struct gealist

Adjust all related size calculations to match the changes to sizeof().

No machine code output differences are produced after these changes.

[1] For lots of details, see both:
    https://docs.kernel.org/process/deprecated.html#zero-length-and-one-element-arrays
    https://people.kernel.org/kees/bounded-flexible-arrays-in-c

Cc: Steve French <sfrench@samba.org>
Cc: Paulo Alcantara <pc@cjr.nz>
Cc: Ronnie Sahlberg <lsahlber@redhat.com>
Cc: Shyam Prasad N <sprasad@microsoft.com>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-02-20 11:48:48 -06:00
Kees Cook
398d5843c0 cifs: Convert struct fealist away from 1-element array
The kernel is globally removing the ambiguous 0-length and 1-element
arrays in favor of flexible arrays, so that we can gain both compile-time
and run-time array bounds checking[1].

While struct fealist is defined as a "fake" flexible array (via a
1-element array), it is only used for examination of the first array
element. Walking the list is performed separately, so there is no reason
to treat the "list" member of struct fealist as anything other than a
single entry. Adjust the struct and code to match.

Additionally, struct fea uses the "name" member either as a dynamic
string, or is manually calculated from the start of the struct. Redefine
the member as a flexible array.

No machine code output differences are produced after these changes.

[1] For lots of details, see both:
    https://docs.kernel.org/process/deprecated.html#zero-length-and-one-element-arrays
    https://people.kernel.org/kees/bounded-flexible-arrays-in-c

Cc: Steve French <sfrench@samba.org>
Cc: Paulo Alcantara <pc@cjr.nz>
Cc: Ronnie Sahlberg <lsahlber@redhat.com>
Cc: Shyam Prasad N <sprasad@microsoft.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-02-20 11:48:48 -06:00
Paulo Alcantara
d99e86ebde cifs: fix mount on old smb servers
The client was sending rfc1002 session request packet with a wrong
length field set, therefore failing to mount shares against old SMB
servers over port 139.

Fix this by calculating the correct length as specified in rfc1002.

Fixes: d7173623bf ("cifs: use ALIGN() and round_up() macros")
Cc: stable@vger.kernel.org
Signed-off-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-02-20 11:48:48 -06:00
Volker Lendecke
de036dcaca cifs: Fix uninitialized memory reads for oparms.mode
Use a struct assignment with implicit member initialization

Signed-off-by: Volker Lendecke <vl@samba.org>
Cc: stable@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-02-20 11:48:48 -06:00
Namjae Jeon
5574920c7a cifs: remove unneeded 2bytes of padding from smb2 tree connect
Due to the 2bytes of padding from the smb2 tree connect request,
there is an unneeded difference between the rfc1002 length and the actual
frame length. In the case of windows client, it is sent by matching it
exactly.

Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-02-20 11:48:48 -06:00
Matthew Wilcox (Oracle)
c191bc070e cifs: Use a folio in cifs_page_mkwrite()
Avoids many calls to compound_head() and removes calls to various
compat functions.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-02-20 11:48:48 -06:00
Volker Lendecke
d447e794a3 cifs: Fix uninitialized memory read in smb3_qfs_tcon()
oparms was not fully initialized

Signed-off-by: Volker Lendecke <vl@samba.org>
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Cc: stable@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-02-20 11:48:48 -06:00
Stefan Metzmacher
3891f6c765 cifs: don't try to use rdma offload on encrypted connections
The aim of using encryption on a connection is to keep
the data confidential, so we must not use plaintext rdma offload
for that data!

It seems that current windows servers and ksmbd would allow
this, but that's no reason to expose the users data in plaintext!
And servers hopefully reject this in future.

Note modern windows servers support signed or encrypted offload,
see MS-SMB2 2.2.3.1.6 SMB2_RDMA_TRANSFORM_CAPABILITIES, but we don't
support that yet.

Signed-off-by: Stefan Metzmacher <metze@samba.org>
Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Long Li <longli@microsoft.com>
Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: David Howells <dhowells@redhat.com>
Cc: linux-cifs@vger.kernel.org
Cc: stable@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-02-20 11:48:48 -06:00
Stefan Metzmacher
a6559cc1d3 cifs: split out smb3_use_rdma_offload() helper
We should have the logic to decide if we want rdma offload
in a single spot in order to advance it in future.

Signed-off-by: Stefan Metzmacher <metze@samba.org>
Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Long Li <longli@microsoft.com>
Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: David Howells <dhowells@redhat.com>
Cc: linux-cifs@vger.kernel.org
Cc: stable@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-02-20 11:48:48 -06:00
Stefan Metzmacher
d643a8a446 cifs: introduce cifs_io_parms in smb2_async_writev()
This will simplify the following changes and makes it easy to get
in passed in from the caller in future.

Signed-off-by: Stefan Metzmacher <metze@samba.org>
Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Long Li <longli@microsoft.com>
Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: David Howells <dhowells@redhat.com>
Cc: linux-cifs@vger.kernel.org
Cc: stable@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-02-20 11:48:48 -06:00
Paulo Alcantara
8954278173 cifs: get rid of unneeded conditional in cifs_get_num_sgs()
Just have @skip set to 0 after first iterations of the two nested
loops.

Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-02-20 11:48:48 -06:00
Paulo Alcantara
3c0070f54b cifs: prevent data race in smb2_reconnect()
Make sure to get an up-to-date TCP_Server_Info::nr_targets value prior
to waiting the server to be reconnected in smb2_reconnect().  It is
set in cifs_tcp_ses_needs_reconnect() and protected by
TCP_Server_Info::srv_lock.

Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-02-20 11:48:48 -06:00
Steve French
77e3f338df cifs: fix indentation in make menuconfig options
The options that are displayed for the smb3.1.1/cifs client
in "make menuconfig" are confusing because some of them are
not indented making them not appear to be related to cifs.ko
Fix that by adding an if/endif (similar to what ceph and 9pm did)
if fs/cifs/Kconfig

Suggested-by: David Howells <dhowells@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-02-20 11:48:47 -06:00
Steve French
ed2f1d9cea cifs: update Kconfig description
There were various outdated or missing things in fs/cifs/Kconfig
e.g. mention of support for insecure NTLM which has been removed,
and lack of mention of some important features. This also shortens
it slightly, and fixes some confusing text (e.g. the SMB1 POSIX
extensions option).

Reviewed-by: Namjae Jeon <linkinjeon@kernel.org>
Acked-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Acked-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-02-20 11:48:47 -06:00
Andy Shevchenko
d4fba63fe1 cifs: Get rid of unneeded conditional in the smb2_get_aead_req()
In the smb2_get_aead_req() the skip variable is used only for
the very first iteration of the two nested loops, which means
it's basically in invariant to those loops. Hence, instead of
using conditional on each iteration, unconditionally assign
the 'skip' variable before the loops and at the end of the
inner loop.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-02-20 11:48:47 -06:00
Shyam Prasad N
05844bd661 cifs: print last update time for interface list
We store the last updated time for interface list while
parsing the interfaces. This change is to just print that
info in DebugData.

Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-02-20 11:48:47 -06:00
Gustavo A. R. Silva
4e551dbdba cifs: Replace zero-length arrays with flexible-array members
Zero-length arrays are deprecated[1] and we are moving towards
adopting C99 flexible-array members instead. So, replace zero-length
arrays in a couple of structures with flex-array members.

This helps with the ongoing efforts to tighten the FORTIFY_SOURCE
routines on memcpy() and help us make progress towards globally
enabling -fstrict-flex-arrays=3 [2].

Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#zero-length-and-one-element-arrays [1]
Link: https://gcc.gnu.org/pipermail/gcc-patches/2022-October/602902.html [2]
Link: https://github.com/KSPP/linux/issues/78
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Paulo Alcantara <pc@cjr.nz>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-02-20 11:48:47 -06:00
Christophe JAILLET
9f5d1a8cf8 cifs: Use kstrtobool() instead of strtobool()
strtobool() is the same as kstrtobool().
However, the latter is more used within the kernel.

In order to remove strtobool() and slightly simplify kstrtox.h, switch to
the other function name.

While at it, include the corresponding header file (<linux/kstrtox.h>)

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Paulo Alcantara <pc@cjr.nz>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-02-20 11:48:47 -06:00
Chuck Lever
4b471a8b84 NFSD: Clean up nfsd_symlink()
The pointer dentry is assigned a value that is never read, the
assignment is redundant and can be removed.

Cleans up clang-scan warning:
fs/nfsd/nfsctl.c:1231:2: warning: Value stored to 'dentry' is
never read [deadcode.DeadStores]
       dentry = ERR_PTR(ret);

No need to initialize "int ret = -ENOMEM;" either.

These are vestiges of nfsd_mkdir(), from whence I copied
nfsd_symlink().

Reported-by: Colin Ian King <colin.i.king@gmail.com>
Reported-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-02-20 09:20:59 -05:00
Chuck Lever
90d2175572 NFSD: copy the whole verifier in nfsd_copy_write_verifier
Currently, we're only memcpy'ing the first __be32. Ensure we copy into
both words.

Fixes: 91d2e9b56c ("NFSD: Clean up the nfsd_net::nfssvc_boot field")
Reported-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-02-20 09:20:59 -05:00
Jeff Layton
4c475eee02 nfsd: don't fsync nfsd_files on last close
Most of the time, NFSv4 clients issue a COMMIT before the final CLOSE of
an open stateid, so with NFSv4, the fsync in the nfsd_file_free path is
usually a no-op and doesn't block.

We have a customer running knfsd over very slow storage (XFS over Ceph
RBD). They were using the "async" export option because performance was
more important than data integrity for this application. That export
option turns NFSv4 COMMIT calls into no-ops. Due to the fsync in this
codepath however, their final CLOSE calls would still stall (since a
CLOSE effectively became a COMMIT).

I think this fsync is not strictly necessary. We only use that result to
reset the write verifier. Instead of fsync'ing all of the data when we
free an nfsd_file, we can just check for writeback errors when one is
acquired and when it is freed.

If the client never comes back, then it'll never see the error anyway
and there is no point in resetting it. If an error occurs after the
nfsd_file is removed from the cache but before the inode is evicted,
then it will reset the write verifier on the next nfsd_file_acquire,
(since there will be an unseen error).

The only exception here is if something else opens and fsyncs the file
during that window. Given that local applications work with this
limitation today, I don't see that as an issue.

Link: https://bugzilla.redhat.com/show_bug.cgi?id=2166658
Fixes: ac3a2585f0 ("nfsd: rework refcounting in filecache")
Reported-and-tested-by: Pierguido Lambri <plambri@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-02-20 09:20:59 -05:00
Jeff Layton
dcd779dc46 nfsd: fix courtesy client with deny mode handling in nfs4_upgrade_open
The nested if statements here make no sense, as you can never reach
"else" branch in the nested statement. Fix the error handling for
when there is a courtesy client that holds a conflicting deny mode.

Fixes: 3d69427151 ("NFSD: add support for share reservation conflict to courteous server")
Reported-by: 張智諺 <cc85nod@gmail.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Dai Ngo <dai.ngo@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-02-20 09:20:57 -05:00
Dai Ngo
81e722978a NFSD: fix problems with cleanup on errors in nfsd4_copy
When nfsd4_copy fails to allocate memory for async_copy->cp_src, or
nfs4_init_copy_state fails, it calls cleanup_async_copy to do the
cleanup for the async_copy which causes page fault since async_copy
is not yet initialized.

This patche rearranges the order of initializing the fields in
async_copy and adds checks in cleanup_async_copy to skip un-initialized
fields.

Fixes: ce0887ac96 ("NFSD add nfs4 inter ssc to nfsd4_copy")
Fixes: 87689df694 ("NFSD: Shrink size of struct nfsd4_copy")
Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-02-20 09:20:57 -05:00
Benjamin Coddington
fb610c4dbc nfsd: fix race to check ls_layouts
Its possible for __break_lease to find the layout's lease before we've
added the layout to the owner's ls_layouts list.  In that case, setting
ls_recalled = true without actually recalling the layout will cause the
server to never send a recall callback.

Move the check for ls_layouts before setting ls_recalled.

Fixes: c5c707f96f ("nfsd: implement pNFS layout recalls")
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-02-20 09:20:56 -05:00
Jeff Layton
826b67e637 nfsd: don't hand out delegation on setuid files being opened for write
We had a bug report that xfstest generic/355 was failing on NFSv4.0.
This test sets various combinations of setuid/setgid modes and tests
whether DIO writes will cause them to be stripped.

What I found was that the server did properly strip those bits, but
the client didn't notice because it held a delegation that was not
recalled. The recall didn't occur because the client itself was the
one generating the activity and we avoid recalls in that case.

Clearing setuid bits is an "implicit" activity. The client didn't
specifically request that we do that, so we need the server to issue a
CB_RECALL, or avoid the situation entirely by not issuing a delegation.

The easiest fix here is to simply not give out a delegation if the file
is being opened for write, and the mode has the setuid and/or setgid bit
set. Note that there is a potential race between the mode and lease
being set, so we test for this condition both before and after setting
the lease.

This patch fixes generic/355, generic/683 and generic/684 for me. (Note
that 355 fails only on v4.0, and 683 and 684 require NFSv4.2 to run and
fail).

Reported-by: Boyang Xue <bxue@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-02-20 09:20:55 -05:00
Jeff Layton
4dbca1c3c6 nfsd: remove fs/nfsd/fault_inject.c
This file is no longer built at all.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-02-20 09:20:54 -05:00
Dai Ngo
34e8f9ec4c NFSD: fix leaked reference count of nfsd4_ssc_umount_item
The reference count of nfsd4_ssc_umount_item is not decremented
on error conditions. This prevents the laundromat from unmounting
the vfsmount of the source file.

This patch decrements the reference count of nfsd4_ssc_umount_item
on error.

Fixes: f4e44b3933 ("NFSD: delay unmount source's export after inter-server copy completed.")
Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-02-20 09:20:52 -05:00
Jeff Layton
6ba434cb1a nfsd: clean up potential nfsd_file refcount leaks in COPY codepath
There are two different flavors of the nfsd4_copy struct. One is
embedded in the compound and is used directly in synchronous copies. The
other is dynamically allocated, refcounted and tracked in the client
struture. For the embedded one, the cleanup just involves releasing any
nfsd_files held on its behalf. For the async one, the cleanup is a bit
more involved, and we need to dequeue it from lists, unhash it, etc.

There is at least one potential refcount leak in this code now. If the
kthread_create call fails, then both the src and dst nfsd_files in the
original nfsd4_copy object are leaked.

The cleanup in this codepath is also sort of weird. In the async copy
case, we'll have up to four nfsd_file references (src and dst for both
flavors of copy structure). They are both put at the end of
nfsd4_do_async_copy, even though the ones held on behalf of the embedded
one outlive that structure.

Change it so that we always clean up the nfsd_file refs held by the
embedded copy structure before nfsd4_copy returns. Rework
cleanup_async_copy to handle both inter and intra copies. Eliminate
nfsd4_cleanup_intra_ssc since it now becomes a no-op.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-02-20 09:20:52 -05:00
Jeff Layton
1f0001d43d nfsd: zero out pointers after putting nfsd_files on COPY setup error
At first, I thought this might be a source of nfsd_file overputs, but
the current callers seem to avoid an extra put when nfsd4_verify_copy
returns an error.

Still, it's "bad form" to leave the pointers filled out when we don't
have a reference to them anymore, and that might lead to bugs later.
Zero them out as a defensive coding measure.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-02-20 09:20:52 -05:00
Jeff Layton
edd2f5526e nfsd: eliminate __nfs4_get_fd
This is wrapper is pointless, and just obscures what's going on.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-02-20 09:20:52 -05:00
Jeff Layton
ee97e73015 nfsd: add some kerneldoc comments for stateid preprocessing functions
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-02-20 09:20:52 -05:00
Jeff Layton
45ba66cc2c nfsd: eliminate find_deleg_file_locked
We really don't need an accessor function here.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-02-20 09:20:51 -05:00
Jeff Layton
fcb530973b nfsd: don't take nfsd4_copy ref for OP_OFFLOAD_STATUS
We're not doing any blocking operations for OP_OFFLOAD_STATUS, so taking
and putting a reference is a waste of effort. Take the client lock,
search for the copy and fetch the wr_bytes_written field and return.

Also, make find_async_copy a static function.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-02-20 09:20:51 -05:00
Chuck Lever
4df750c924 NFSD: Replace /proc/fs/nfsd/supported_krb5_enctypes with a symlink
Now that I've added a file under /proc/net/rpc that is managed by
the SunRPC's Kerberos mechanism, replace NFSD's
supported_krb5_enctypes file with a symlink to the new SunRPC proc
file, which contains exactly the same content.

Remarkably, commit b0b0c0a26e ("nfsd: add proc file listing
kernel's gss_krb5 enctypes") added the nfsd_supported_krb5_enctypes
file in 2011, but this file has never been documented in nfsd(7).

Tested-by: Scott Mayhew <smayhew@redhat.com>
Reviewed-by: Simo Sorce <simo@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-02-20 09:20:37 -05:00
Chuck Lever
65ba3d2425 SUNRPC: Use per-CPU counters to tally server RPC counts
- Improves counting accuracy
 - Reduces cross-CPU memory traffic

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-02-20 09:20:32 -05:00
Jeff Layton
f5f9d4a314 nfsd: move reply cache initialization into nfsd startup
There's no need to start the reply cache before nfsd is up and running,
and doing so means that we register a shrinker for every net namespace
instead of just the ones where nfsd is running.

Move it to the per-net nfsd startup instead.

Reported-by: Dai Ngo <dai.ngo@oracle.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-02-20 09:20:32 -05:00
Chuck Lever
cee4db1945 SUNRPC: Refactor RPC server dispatch method
Currently, svcauth_gss_accept() pre-reserves response buffer space
for the RPC payload length and GSS sequence number before returning
to the dispatcher, which then adds the header's accept_stat field.

The problem is the accept_stat field is supposed to go before the
length and seq_num fields. So svcauth_gss_release() has to relocate
the accept_stat value (see svcauth_gss_prepare_to_wrap()).

To enable these fields to be added to the response buffer in the
correct (final) order, the pointer to the accept_stat has to be made
available to svcauth_gss_accept() so that it can set it before
reserving space for the length and seq_num fields.

As a first step, move the pointer to the location of the accept_stat
field into struct svc_rqst.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-02-20 09:20:31 -05:00
Chuck Lever
8dd41d70f3 SUNRPC: Push svcxdr_init_encode() into svc_process_common()
Now that all vs_dispatch functions invoke svcxdr_init_encode(), it
is common code and can be pushed down into the generic RPC server.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-02-20 09:20:27 -05:00
Jeff Layton
bd6aaf781d nfsd: fix potential race in nfs4_find_file
The WARN_ON_ONCE check is not terribly useful. It also seems possible
for nfs4_find_file to race with the destruction of an fi_deleg_file
while trying to take a reference to it.

Now that it's safe to pass nfs_get_file a NULL pointer, remove the WARN
and NULL pointer check. Take the fi_lock when fetching fi_deleg_file.

Cc: NeilBrown <neilb@suse.de>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-02-20 09:20:22 -05:00
Jeff Layton
70f62231cd nfsd: allow nfsd_file_get to sanely handle a NULL pointer
...and remove some now-useless NULL pointer checks in its callers.

Suggested-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-02-20 09:20:21 -05:00
Dai Ngo
df24ac7a2e NFSD: enhance inter-server copy cleanup
Currently nfsd4_setup_inter_ssc returns the vfsmount of the source
server's export when the mount completes. After the copy is done
nfsd4_cleanup_inter_ssc is called with the vfsmount of the source
server and it searches nfsd_ssc_mount_list for a matching entry
to do the clean up.

The problems with this approach are (1) the need to search the
nfsd_ssc_mount_list and (2) the code has to handle the case where
the matching entry is not found which looks ugly.

The enhancement is instead of nfsd4_setup_inter_ssc returning the
vfsmount, it returns the nfsd4_ssc_umount_item which has the
vfsmount embedded in it. When nfsd4_cleanup_inter_ssc is called
it's passed with the nfsd4_ssc_umount_item directly to do the
clean up so no searching is needed and there is no need to handle
the 'not found' case.

Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
[ cel: adjusted whitespace and variable/function names ]
Reviewed-by: Olga Kornievskaia <kolga@netapp.com>
2023-02-20 09:20:21 -05:00
Chuck Lever
dba5eaa46b SUNRPC: Push svcxdr_init_decode() into svc_process_common()
Now that all vs_dispatch functions invoke svcxdr_init_decode(), it
is common code and can be pushed down into the generic RPC server.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-02-20 09:20:09 -05:00
Richard Weinberger
f78e445458 NFS: nfs_encode_fh: Remove S_AUTOMOUNT check
Now with NFSD being able to cross into auto mounts,
the check can be removed.

Signed-off-by: Richard Weinberger <richard@nod.at>
Reviewed-by: Ian Kent <raven@themaw.net>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Acked-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-02-20 09:20:09 -05:00
Richard Weinberger
e1f19857f9 fs: namei: Allow follow_down() to uncover auto mounts
This function is only used by NFSD to cross mount points.
If a mount point is of type auto mount, follow_down() will
not uncover it. Add LOOKUP_AUTOMOUNT to the lookup flags
to have ->d_automount() called when NFSD walks down the
mount tree.

Signed-off-by: Richard Weinberger <richard@nod.at>
Reviewed-by: Ian Kent <raven@themaw.net>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-02-20 09:20:08 -05:00
Richard Weinberger
50f5fdaea9 NFSD: Teach nfsd_mountpoint() auto mounts
Currently nfsd_mountpoint() tests for mount points using d_mountpoint(),
this works only when a mount point is already uncovered.
In our case the mount point is of type auto mount and can be coverted.
i.e. ->d_automount() was not called.

Using d_managed() nfsd_mountpoint() can test whether a mount point is
either already uncovered or can be uncovered later.

Signed-off-by: Richard Weinberger <richard@nod.at>
Reviewed-by: Ian Kent <raven@themaw.net>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-02-20 09:20:08 -05:00
Bagas Sanjaya
88cd618dcc debugfs: drop inline constant formatting for ERR_PTR(-ERROR)
Stephen Rothwell reported htmldocs warning when merging driver-core tree
for linux-next:

Documentation/filesystems/api-summary:146: fs/debugfs/inode.c:804: WARNING: Inline literal start-string without end-string.

The warning is due to inline constant formatting (``%CONST``) doesn't play
nice with complex-name constants like ERR_PTR(-ERROR).

Drop the formatting for that constant above to be consistent with similar
error constants and also to fix the above warning.

Link: https://lore.kernel.org/lkml/20230220163133.481e43d8@canb.auug.org.au/
Fixes: d3002468cb ("debugfs: update comment of debugfs_rename()")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>
Link: https://lore.kernel.org/r/20230220124721.11657-1-bagasdotme@gmail.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-02-20 14:14:56 +01:00
Hangyu Hua
d3ca9f7aeb ksmbd: fix possible memory leak in smb2_lock()
argv needs to be free when setup_async_work fails or when the current
process is woken up.

Fixes: e2f34481b2 ("cifsd: add server-side procedures for SMB3")
Cc: stable@vger.kernel.org
Signed-off-by: Hangyu Hua <hbh25y@gmail.com>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-02-19 17:57:25 -06:00
Baokun Li
0f7bfd6f81 ext4: fix task hung in ext4_xattr_delete_inode
Syzbot reported a hung task problem:
==================================================================
INFO: task syz-executor232:5073 blocked for more than 143 seconds.
      Not tainted 6.2.0-rc2-syzkaller-00024-g512dee0c00ad #0
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:syz-exec232 state:D stack:21024 pid:5073 ppid:5072 flags:0x00004004
Call Trace:
 <TASK>
 context_switch kernel/sched/core.c:5244 [inline]
 __schedule+0x995/0xe20 kernel/sched/core.c:6555
 schedule+0xcb/0x190 kernel/sched/core.c:6631
 __wait_on_freeing_inode fs/inode.c:2196 [inline]
 find_inode_fast+0x35a/0x4c0 fs/inode.c:950
 iget_locked+0xb1/0x830 fs/inode.c:1273
 __ext4_iget+0x22e/0x3ed0 fs/ext4/inode.c:4861
 ext4_xattr_inode_iget+0x68/0x4e0 fs/ext4/xattr.c:389
 ext4_xattr_inode_dec_ref_all+0x1a7/0xe50 fs/ext4/xattr.c:1148
 ext4_xattr_delete_inode+0xb04/0xcd0 fs/ext4/xattr.c:2880
 ext4_evict_inode+0xd7c/0x10b0 fs/ext4/inode.c:296
 evict+0x2a4/0x620 fs/inode.c:664
 ext4_orphan_cleanup+0xb60/0x1340 fs/ext4/orphan.c:474
 __ext4_fill_super fs/ext4/super.c:5516 [inline]
 ext4_fill_super+0x81cd/0x8700 fs/ext4/super.c:5644
 get_tree_bdev+0x400/0x620 fs/super.c:1282
 vfs_get_tree+0x88/0x270 fs/super.c:1489
 do_new_mount+0x289/0xad0 fs/namespace.c:3145
 do_mount fs/namespace.c:3488 [inline]
 __do_sys_mount fs/namespace.c:3697 [inline]
 __se_sys_mount+0x2d3/0x3c0 fs/namespace.c:3674
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x3d/0xb0 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7fa5406fd5ea
RSP: 002b:00007ffc7232f968 EFLAGS: 00000202 ORIG_RAX: 00000000000000a5
RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007fa5406fd5ea
RDX: 0000000020000440 RSI: 0000000020000000 RDI: 00007ffc7232f970
RBP: 00007ffc7232f970 R08: 00007ffc7232f9b0 R09: 0000000000000432
R10: 0000000000804a03 R11: 0000000000000202 R12: 0000000000000004
R13: 0000555556a7a2c0 R14: 00007ffc7232f9b0 R15: 0000000000000000
 </TASK>
==================================================================

The problem is that the inode contains an xattr entry with ea_inum of 15
when cleaning up an orphan inode <15>. When evict inode <15>, the reference
counting of the corresponding EA inode is decreased. When EA inode <15> is
found by find_inode_fast() in __ext4_iget(), it is found that the EA inode
holds the I_FREEING flag and waits for the EA inode to complete deletion.
As a result, when inode <15> is being deleted, we wait for inode <15> to
complete the deletion, resulting in an infinite loop and triggering Hung
Task. To solve this problem, we only need to check whether the ino of EA
inode and parent is the same before getting EA inode.

Link: https://syzkaller.appspot.com/bug?extid=77d6fcc37bbb92f26048
Reported-by: syzbot+77d6fcc37bbb92f26048@syzkaller.appspotmail.com
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230110133436.996350-1-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-02-19 00:16:22 -05:00
Zhihao Cheng
e6b9bd7290 jbd2: fix data missing when reusing bh which is ready to be checkpointed
Following process will make data lost and could lead to a filesystem
corrupted problem:

1. jh(bh) is inserted into T1->t_checkpoint_list, bh is dirty, and
   jh->b_transaction = NULL
2. T1 is added into journal->j_checkpoint_transactions.
3. Get bh prepare to write while doing checkpoing:
           PA				    PB
   do_get_write_access             jbd2_log_do_checkpoint
    spin_lock(&jh->b_state_lock)
     if (buffer_dirty(bh))
      clear_buffer_dirty(bh)   // clear buffer dirty
       set_buffer_jbddirty(bh)
				    transaction =
				    journal->j_checkpoint_transactions
				    jh = transaction->t_checkpoint_list
				    if (!buffer_dirty(bh))
		                      __jbd2_journal_remove_checkpoint(jh)
				      // bh won't be flushed
		                    jbd2_cleanup_journal_tail
    __jbd2_journal_file_buffer(jh, transaction, BJ_Reserved)
4. Aborting journal/Power-cut before writing latest bh on journal area.

In this way we get a corrupted filesystem with bh's data lost.

Fix it by moving the clearing of buffer_dirty bit just before the call
to __jbd2_journal_file_buffer(), both bit clearing and jh->b_transaction
assignment are under journal->j_list_lock locked, so that
jbd2_log_do_checkpoint() will wait until jh's new transaction fininshed
even bh is currently not dirty. And journal_shrink_one_cp_list() won't
remove jh from checkpoint list if the buffer head is reused in
do_get_write_access().

Fetch a reproducer in [Link].

Link: https://bugzilla.kernel.org/show_bug.cgi?id=216898
Cc: <stable@kernel.org>
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: zhanchengbin <zhanchengbin1@huawei.com>
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230110015327.1181863-1-chengzhihao1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-02-19 00:09:59 -05:00
Baokun Li
3039d8b869 ext4: update s_journal_inum if it changes after journal replay
When mounting a crafted ext4 image, s_journal_inum may change after journal
replay, which is obviously unreasonable because we have successfully loaded
and replayed the journal through the old s_journal_inum. And the new
s_journal_inum bypasses some of the checks in ext4_get_journal(), which
may trigger a null pointer dereference problem. So if s_journal_inum
changes after the journal replay, we ignore the change, and rewrite the
current journal_inum to the superblock.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=216541
Reported-by: Luís Henriques <lhenriques@suse.de>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230107032126.4165860-3-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-02-19 00:09:59 -05:00
Baokun Li
5cd740287a ext4: fail ext4_iget if special inode unallocated
In ext4_fill_super(), EXT4_ORPHAN_FS flag is cleared after
ext4_orphan_cleanup() is executed. Therefore, when __ext4_iget() is
called to get an inode whose i_nlink is 0 when the flag exists, no error
is returned. If the inode is a special inode, a null pointer dereference
may occur. If the value of i_nlink is 0 for any inodes (except boot loader
inodes) got by using the EXT4_IGET_SPECIAL flag, the current file system
is corrupted. Therefore, make the ext4_iget() function return an error if
it gets such an abnormal special inode.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=199179
Link: https://bugzilla.kernel.org/show_bug.cgi?id=216541
Link: https://bugzilla.kernel.org/show_bug.cgi?id=216539
Reported-by: Luís Henriques <lhenriques@suse.de>
Suggested-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230107032126.4165860-2-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-02-19 00:09:59 -05:00
Kees Cook
d99a55a94a ext4: fix function prototype mismatch for ext4_feat_ktype
With clang's kernel control flow integrity (kCFI, CONFIG_CFI_CLANG),
indirect call targets are validated against the expected function
pointer prototype to make sure the call target is valid to help mitigate
ROP attacks. If they are not identical, there is a failure at run time,
which manifests as either a kernel panic or thread getting killed.

ext4_feat_ktype was setting the "release" handler to "kfree", which
doesn't have a matching function prototype. Add a simple wrapper
with the correct prototype.

This was found as a result of Clang's new -Wcast-function-type-strict
flag, which is more sensitive than the simpler -Wcast-function-type,
which only checks for type width mismatches.

Note that this code is only reached when ext4 is a loadable module and
it is being unloaded:

 CFI failure at kobject_put+0xbb/0x1b0 (target: kfree+0x0/0x180; expected type: 0x7c4aa698)
 ...
 RIP: 0010:kobject_put+0xbb/0x1b0
 ...
 Call Trace:
  <TASK>
  ext4_exit_sysfs+0x14/0x60 [ext4]
  cleanup_module+0x67/0xedb [ext4]

Fixes: b99fee58a2 ("ext4: create ext4_feat kobject dynamically")
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Eric Biggers <ebiggers@kernel.org>
Cc: stable@vger.kernel.org
Build-tested-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: Nathan Chancellor <nathan@kernel.org>
Link: https://lore.kernel.org/r/20230103234616.never.915-kees@kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20230104210908.gonna.388-kees@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-02-19 00:09:59 -05:00
XU pengfei
7fc51f923e ext4: remove unnecessary variable initialization
Variables are assigned first and then used. Initialization is not required.

Signed-off-by: XU pengfei <xupengfei@nfschina.com>
Link: https://lore.kernel.org/r/20230104055229.3663-1-xupengfei@nfschina.com
2023-02-19 00:09:59 -05:00
zhanchengbin
3f5424790d ext4: fix inode tree inconsistency caused by ENOMEM
If ENOMEM fails when the extent is splitting, we need to restore the length
of the split extent.
In the ext4_split_extent_at function, only in ext4_ext_create_new_leaf will
it alloc memory and change the shape of the extent tree,even if an ENOMEM
is returned at this time, the extent tree is still self-consistent, Just
restore the split extent lens in the function ext4_split_extent_at.

ext4_split_extent_at
 ext4_ext_insert_extent
  ext4_ext_create_new_leaf
   1)ext4_ext_split
     ext4_find_extent
   2)ext4_ext_grow_indepth
     ext4_find_extent

Signed-off-by: zhanchengbin <zhanchengbin1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230103022812.130603-1-zhanchengbin1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-02-18 23:58:28 -05:00
Jun Nie
f31173c199 ext4: refuse to create ea block when umounted
The ea block expansion need to access s_root while it is
already set as NULL when umount is triggered. Refuse this
request to avoid panic.

Reported-by: syzbot+2dacb8f015bf1420155f@syzkaller.appspotmail.com
Link: https://syzkaller.appspot.com/bug?id=3613786cb88c93aa1c6a279b1df6a7b201347d08
Link: https://lore.kernel.org/r/20230103014517.495275-3-jun.nie@linaro.org
Cc: stable@kernel.org
Signed-off-by: Jun Nie <jun.nie@linaro.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-02-18 23:58:18 -05:00
Jun Nie
1e9d62d252 ext4: optimize ea_inode block expansion
Copy ea data from inode entry when expanding ea block if possible.
Then remove the ea entry if expansion success. Thus memcpy to a
temporary buffer may be avoided.

If the expansion fails, we do not need to recovery the removed ea
entry neither in this way.

Reported-by: syzbot+2dacb8f015bf1420155f@syzkaller.appspotmail.com
Link: https://syzkaller.appspot.com/bug?id=3613786cb88c93aa1c6a279b1df6a7b201347d08
Link: https://lore.kernel.org/r/20230103014517.495275-2-jun.nie@linaro.org
Cc: stable@kernel.org
Signed-off-by: Jun Nie <jun.nie@linaro.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-02-18 23:57:37 -05:00
Tanmay Bhushan
08abd0466e ext4: remove dead code in updating backup sb
ext4_update_backup_sb checks for err having some value
after unlocking buffer. But err has not been updated
till that point in any code which will lead execution
of the code in question.

Signed-off-by: Tanmay Bhushan <007047221b@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20221230141858.3828-1-007047221b@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-02-18 22:57:25 -05:00
Linus Torvalds
38f8ccde04 Six hotfixes. Five are cc:stable: four for MM, one for nilfs2. Also a
MAINTAINERS update.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY/AK0AAKCRDdBJ7gKXxA
 jg4SAQCw/Udkt+UgtFzQ+oXg8FAw3ivrniGnOwaMfDDbiVz3KgD+Mkvnw6nb7PMT
 G9iFA5ZRBISCv0ahXxnNrxbtmcFcewQ=
 =fFg9
 -----END PGP SIGNATURE-----

Merge tag 'mm-hotfixes-stable-2023-02-17-15-16-2' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc fixes from Andrew Morton:
 "Six hotfixes. Five are cc:stable: four for MM, one for nilfs2.

  Also a MAINTAINERS update"

* tag 'mm-hotfixes-stable-2023-02-17-15-16-2' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  nilfs2: fix underflow in second superblock position calculations
  hugetlb: check for undefined shift on 32 bit architectures
  mm/migrate: fix wrongly apply write bit after mkdirty on sparc64
  MAINTAINERS: update FPU EMULATOR web page
  mm/MADV_COLLAPSE: set EAGAIN on unexpected page refcount
  mm/filemap: fix page end in filemap_get_read_batch
2023-02-17 17:51:40 -08:00
Ryusuke Konishi
99b9402a36 nilfs2: fix underflow in second superblock position calculations
Macro NILFS_SB2_OFFSET_BYTES, which computes the position of the second
superblock, underflows when the argument device size is less than 4096
bytes.  Therefore, when using this macro, it is necessary to check in
advance that the device size is not less than a lower limit, or at least
that underflow does not occur.

The current nilfs2 implementation lacks this check, causing out-of-bound
block access when mounting devices smaller than 4096 bytes:

 I/O error, dev loop0, sector 36028797018963960 op 0x0:(READ) flags 0x0
 phys_seg 1 prio class 2
 NILFS (loop0): unable to read secondary superblock (blocksize = 1024)

In addition, when trying to resize the filesystem to a size below 4096
bytes, this underflow occurs in nilfs_resize_fs(), passing a huge number
of segments to nilfs_sufile_resize(), corrupting parameters such as the
number of segments in superblocks.  This causes excessive loop iterations
in nilfs_sufile_resize() during a subsequent resize ioctl, causing
semaphore ns_segctor_sem to block for a long time and hang the writer
thread:

 INFO: task segctord:5067 blocked for more than 143 seconds.
      Not tainted 6.2.0-rc8-syzkaller-00015-gf6feea56f66d #0
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 task:segctord        state:D stack:23456 pid:5067  ppid:2
 flags:0x00004000
 Call Trace:
  <TASK>
  context_switch kernel/sched/core.c:5293 [inline]
  __schedule+0x1409/0x43f0 kernel/sched/core.c:6606
  schedule+0xc3/0x190 kernel/sched/core.c:6682
  rwsem_down_write_slowpath+0xfcf/0x14a0 kernel/locking/rwsem.c:1190
  nilfs_transaction_lock+0x25c/0x4f0 fs/nilfs2/segment.c:357
  nilfs_segctor_thread_construct fs/nilfs2/segment.c:2486 [inline]
  nilfs_segctor_thread+0x52f/0x1140 fs/nilfs2/segment.c:2570
  kthread+0x270/0x300 kernel/kthread.c:376
  ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:308
  </TASK>
 ...
 Call Trace:
  <TASK>
  folio_mark_accessed+0x51c/0xf00 mm/swap.c:515
  __nilfs_get_page_block fs/nilfs2/page.c:42 [inline]
  nilfs_grab_buffer+0x3d3/0x540 fs/nilfs2/page.c:61
  nilfs_mdt_submit_block+0xd7/0x8f0 fs/nilfs2/mdt.c:121
  nilfs_mdt_read_block+0xeb/0x430 fs/nilfs2/mdt.c:176
  nilfs_mdt_get_block+0x12d/0xbb0 fs/nilfs2/mdt.c:251
  nilfs_sufile_get_segment_usage_block fs/nilfs2/sufile.c:92 [inline]
  nilfs_sufile_truncate_range fs/nilfs2/sufile.c:679 [inline]
  nilfs_sufile_resize+0x7a3/0x12b0 fs/nilfs2/sufile.c:777
  nilfs_resize_fs+0x20c/0xed0 fs/nilfs2/super.c:422
  nilfs_ioctl_resize fs/nilfs2/ioctl.c:1033 [inline]
  nilfs_ioctl+0x137c/0x2440 fs/nilfs2/ioctl.c:1301
  ...

This fixes these issues by inserting appropriate minimum device size
checks or anti-underflow checks, depending on where the macro is used.

Link: https://lkml.kernel.org/r/0000000000004e1dfa05f4a48e6b@google.com
Link: https://lkml.kernel.org/r/20230214224043.24141-1-konishi.ryusuke@gmail.com
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Reported-by: <syzbot+f0c4082ce5ebebdac63b@syzkaller.appspotmail.com>
Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-17 15:07:05 -08:00
Anna Schumaker
896e090eef Revert "NFSv4.2: Change the default KConfig value for READ_PLUS"
This reverts commit 7fd461c47c.

Unfortunately, it has come to our attention that there is still a bug
somewhere in the READ_PLUS code that can result in nfsroot systems on
ARM to crash during boot.

Let's do the right thing and revert this change so we don't break
people's nfsroot setups.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2023-02-17 09:07:19 -05:00
David S. Miller
675f176b4d Merge ra.kernel.org:/pub/scm/linux/kernel/git/netdev/net
Some of the devlink bits were tricky, but I think I got it right.

Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-17 11:06:39 +00:00
Dan Carpenter
8d1b80a794 erofs: fix an error code in z_erofs_init_zip_subsystem()
Return -ENOMEM if alloc_workqueue() fails.  Don't return success.

Fixes: d8a650adf429 ("erofs: add per-cpu threads for decompression as an option")
Signed-off-by: Dan Carpenter <error27@gmail.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/Y+4d0FRsUq8jPoOu@kili
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2023-02-16 22:51:53 +08:00
Qi Zheng
d3002468cb debugfs: update comment of debugfs_rename()
Since commit ff9fb72bc0 ("debugfs: return error values, not NULL")
changed return value of debugfs_rename() in error cases from %NULL to
%ERR_PTR(-ERROR), the comment of debugfs_rename should also be updated
so as not to mislead readers.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Link: https://lore.kernel.org/r/20230208035634.58095-1-zhengqi.arch@bytedance.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-02-16 13:43:05 +01:00
Linus Torvalds
3402351a5a nfsd-6.2 fixes:
- Fix a teardown bug in the new nfs4_file hashtable
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmPqS30ACgkQM2qzM29m
 f5fV2BAAgkmq0MF6nTjjVcFYSUyJvPO4jCcT5ZPqJcv2CHCWp/EkhOrBMHtli3y2
 zUspjZfHpuV5oi2j7J1/PnlLBbov7oXjg9vZ2xGazo1Zkr5wQHxwyuFXdVzte2Qk
 fGcCEgLA3aPiluafzWHMYeoZUqVSQCT41UaUW4iauc/byRZmNs0nwVsYlfhWzg1U
 dsQsK03S357mmqhevxIbBVvTRNakxJwqALBVZcEgfjulESwgQ6gIJpo8srrAkZ4v
 2k5bPVVTRhyDCv+8fOe7UqIyW7HwT5WDvMJg7C883wcP8xk7NHqg54t7NVoPNv5R
 oZLkwUBr5G75OThCmrm8zDMt6zqT76Wyc29fBQ0hjyNml0HOiB131yYv7zGW6qqD
 2e+jQpAKDkc6IIy8IzqxgHan9lMvtYfz7uIQjGGXb+vPTF5S7BIdWJjSEvgo/7a7
 RNlou48UAF4NeZHcDESXFArcQ0MXd9rBnawxRVDqMzQ3owMyn3Nt8FRx7K/dNGKl
 6qnIR+H7G65KQgQCxiqrGvvO3NssCBurv/9BRNRmVshVzI0pdn/seZff4TngvDfI
 GWvEE7nMFE1jDBrN5o5ecLUVI+zTDuNY3AAnaH7afiRCBQauq+AziCfvXoLtfGw2
 +FktTeZRgZQzMFA62H3eEDvy5dKDp/a0ivP1QglIid6oBRH/muw=
 =pQWv
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-6.2-6' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd fix from Chuck Lever:

 - Fix a teardown bug in the new nfs4_file hashtable

* tag 'nfsd-6.2-6' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
  nfsd: don't destroy global nfs4_file table in per-net shutdown
2023-02-15 11:48:56 -08:00
Thomas Weißschuh
964a54e5e1 btrfs: make kobj_type structures constant
Since commit ee6d3dd4ed ("driver core: make kobj_type constant.")
the driver core allows the usage of const struct kobj_type.

Take advantage of this to constify the structure definitions to prevent
modification at runtime.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15 19:38:55 +01:00
Christoph Hellwig
1eb82ef873 btrfs: remove the bdev argument to btrfs_rmap_block
The only user in the zoned remap code is gone now, so remove the argument.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15 19:38:55 +01:00
Christoph Hellwig
04f0847c45 btrfs: don't rely on unchanging ->bi_bdev for zone append remaps
btrfs_record_physical_zoned relies on a bio->bi_bdev samples in the
bio_end_io handler to find the reverse map for remapping the zone append
write, but stacked block device drivers can and usually do change bi_bdev
when sending on the bio to a lower device.  This can happen e.g. with the
nvme-multipath driver when a NVMe SSD sets the shared namespace bit.

But there is no real need for the bdev in btrfs_record_physical_zoned,
as it is only passed to btrfs_rmap_block, which uses it to pick the
mapping to report if there are multiple reverse mappings.  As zone
writes can only do simple non-mirror writes right now, and anything
more complex will use the stripe tree there is no chance of the multiple
mappings case actually happening.

Instead open code the subset of btrfs_rmap_block in
btrfs_record_physical_zoned, which also removes a memory allocation and
remove the bdev field in the ordered extent.

Fixes: d8e3fb106f ("btrfs: zoned: use ZONE_APPEND write for zoned mode")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15 19:38:55 +01:00
Christoph Hellwig
fdf9a37dcf btrfs: never return true for reads in btrfs_use_zone_append
Using Zone Append only makes sense for writes to the device, so check
that in btrfs_use_zone_append.  This avoids the possibility of
artificially limited read size on zoned file systems.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15 19:38:55 +01:00
Christoph Hellwig
921603c762 btrfs: pass a btrfs_bio to btrfs_use_append
struct btrfs_bio has all the information needed for btrfs_use_append, so
pass that instead of a btrfs_inode and file_offset.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15 19:38:55 +01:00
Christoph Hellwig
0d495430db btrfs: set bbio->file_offset in alloc_new_bio
Instead of digging into the bio_vec in submit_one_bio, set file_offset at
bio allocation time from the provided parameter.  This also ensures that
the file_offset is available all the time when building up the bio
payload.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15 19:38:55 +01:00
Christoph Hellwig
72fcf1a47b btrfs: use file_offset to limit bios size in calc_bio_boundaries
btrfs_ordered_extent->disk_bytenr can be rewritten by the zoned I/O
completion handler, and thus in general is not a good idea to limit I/O
size.  But the maximum bio size calculation can easily be done using the
file_offset fields in the btrfs_ordered_extent and btrfs_bio structures,
so switch to that instead.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15 19:38:55 +01:00
Filipe Manana
a724f313f8 btrfs: do unsigned integer division in the extent buffer binary search loop
In the search loop of the binary search function, we are doing a division
by 2 of the sum of the high and low slots. Because the slots are integers,
the generated assembly code for it is the following on x86_64:

   0x00000000000141f1 <+145>:	mov    %eax,%ebx
   0x00000000000141f3 <+147>:	shr    $0x1f,%ebx
   0x00000000000141f6 <+150>:	add    %eax,%ebx
   0x00000000000141f8 <+152>:	sar    %ebx

It's a few more instructions than a simple right shift, because signed
integer division needs to round towards zero. However we know that slots
can never be negative (btrfs_header_nritems() returns an u32), so we
can instead use unsigned types for the low and high slots and therefore
use unsigned integer division, which results in a single instruction on
x86_64:

   0x00000000000141f0 <+144>:	shr    %ebx

So use unsigned types for the slots and therefore unsigned division.

This is part of a small patchset comprised of the following two patches:

  btrfs: eliminate extra call when doing binary search on extent buffer
  btrfs: do unsigned integer division in the extent buffer binary search loop

The following fs_mark test was run on a non-debug kernel (Debian's default
kernel config) before and after applying the patchset:

  $ cat test.sh
  #!/bin/bash

  DEV=/dev/sdi
  MNT=/mnt/sdi
  MOUNT_OPTIONS="-o ssd"
  MKFS_OPTIONS="-O no-holes -R free-space-tree"
  FILES=100000
  THREADS=$(nproc --all)
  FILE_SIZE=0

  umount $DEV &> /dev/null
  mkfs.btrfs -f $MKFS_OPTIONS $DEV
  mount $MOUNT_OPTIONS $DEV $MNT

  OPTS="-S 0 -L 6 -n $FILES -s $FILE_SIZE -t $THREADS -k"
  for ((i = 1; i <= $THREADS; i++)); do
      OPTS="$OPTS -d $MNT/d$i"
  done

  fs_mark $OPTS

  umount $MNT

Results before applying patchset:

  FSUse%        Count         Size    Files/sec     App Overhead
       2      1200000            0     174472.0         11549868
       4      2400000            0     253503.0         11694618
       4      3600000            0     257833.1         11611508
       6      4800000            0     247089.5         11665983
       6      6000000            0     211296.1         12121244
      10      7200000            0     187330.6         12548565

Results after applying patchset:

  FSUse%        Count         Size    Files/sec     App Overhead
       2      1200000            0     207556.0         11393252
       4      2400000            0     266751.1         11347909
       4      3600000            0     274397.5         11270058
       6      4800000            0     259608.4         11442250
       6      6000000            0     238895.8         11635921
       8      7200000            0     211942.2         11873825

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15 19:38:55 +01:00
Filipe Manana
7b00dfffeb btrfs: eliminate extra call when doing binary search on extent buffer
The function btrfs_bin_search() is just a wrapper around the function
generic_bin_search(), which passes the same arguments plus a default
low slot with a value of 0. This adds an unnecessary extra function
call, since btrfs_bin_search() is not static. So improve on this by
making btrfs_bin_search() an inline function that calls
generic_bin_search(), renaming the later to btrfs_generic_bin_search()
and exporting it.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15 19:38:55 +01:00
Christoph Hellwig
08241d3c74 btrfs: raid56: handle endio in scrub_rbio
The only caller of scrub_rbio calls rbio_orig_end_io right after it,
move it into scrub_rbio to match the other work item helpers.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15 19:38:55 +01:00
Christoph Hellwig
40f87ddb5d btrfs: raid56: handle endio in recover_rbio
Both callers of recover_rbio call rbio_orig_end_io right after it, so
move the call into the shared function.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15 19:38:54 +01:00
Christoph Hellwig
1d0ef1ca11 btrfs: raid56: handle endio in rmw_rbio
Both callers of rmv_rbio call rbio_orig_end_io right after it, so
move the call into the shared function.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15 19:38:54 +01:00
Christoph Hellwig
52f0c19864 btrfs: raid56: submit the read bios from scrub_assemble_read_bios
Instead of filling in a bio_list and submitting the bios in the only
caller, do that in scrub_assemble_read_bios.  This removes the
need to pass the bio_list, and also makes it clear that the extra
bio_list cleanup in the caller is entirely pointless.  Rename the
function to scrub_read_bios to make it clear that the bios are not
only assembled.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15 19:38:54 +01:00
Christoph Hellwig
02efa3a6ba btrfs: raid56: fold rmw_read_wait_recover into rmw_read_bios
There is very little extra code in rmw_read_bios, and a large part of it
is the superfluous extra cleanup of the bio list.  Merge the two
functions, and only clean up the bio list after it has been added to
but before it has been emptied again by submit_read_wait_bio_list.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15 19:38:54 +01:00
Christoph Hellwig
d838d05ea5 btrfs: raid56: fold recover_assemble_read_bios into recover_rbio
There is very little extra code in recover_rbio, and a large part of it
is the superfluous extra cleanup of the bio list.  Merge the two
functions, and only clean up the bio list after it has been added to
but before it has been emptied again by submit_read_wait_bio_list.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15 19:38:54 +01:00
Christoph Hellwig
801fcfc5d7 btrfs: raid56: add a bio_list_put helper
Add a helper to put all bios in a list. This does not need to be added
to block layer as there are no other users of such code.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15 19:38:54 +01:00
Christoph Hellwig
1c76fb7b31 btrfs: raid56: wait for I/O completion in submit_read_bios
In addition to setting up the end_io handler and submitting the bios in
submit_read_bios, also wait for them to be completed instead of waiting
for the completion manually in all three callers.

Rename submit_read_bios to submit_read_wait_bio_list to make it clear
it waits for the bios as well.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15 19:38:54 +01:00
Christoph Hellwig
4d7627010b btrfs: raid56: simplify code flow in rmw_rbio
Remove the write goto label by moving the data page allocation and data
read into the branch.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15 19:38:54 +01:00
Christoph Hellwig
abb49e8742 btrfs: raid56: simplify error handling and code flow in raid56_parity_write
Handle the error return on alloc_rbio failure directly instead of using
a goto and remove the queue_rbio goto label by moving the plugged
check into the if branch.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15 19:38:54 +01:00
Josef Bacik
79b02ec1d8 btrfs: replace btrfs_wait_tree_block_writeback by wait_on_extent_buffer_writeback
This is used in the tree-log code and is a holdover from previous
iterations of extent buffer writeback.  We can simply use
wait_on_extent_buffer_writeback here, and remove
btrfs_wait_tree_block_writeback completely as it's equivalent (waiting
on page write writeback).

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15 19:38:54 +01:00
Josef Bacik
98c8d683c2 btrfs: combine btrfs_clear_buffer_dirty and clear_extent_buffer_dirty
btrfs_clear_buffer_dirty just does the test_clear_bit() and then calls
clear_extent_buffer_dirty and does the dirty metadata accounting.
Combine this into clear_extent_buffer_dirty and make the result
btrfs_clear_buffer_dirty.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15 19:38:54 +01:00
Josef Bacik
190a83391b btrfs: rename btrfs_clean_tree_block to btrfs_clear_buffer_dirty
btrfs_clean_tree_block is a misnomer, it's just
clear_extent_buffer_dirty with some extra accounting around it.  Rename
this to btrfs_clear_buffer_dirty to make it more clear it belongs with
it's setter, btrfs_mark_buffer_dirty.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15 19:38:53 +01:00
Josef Bacik
f88fd65043 btrfs: do not increment dirty_metadata_bytes in set_btree_ioerr
We only add if we set the extent buffer dirty, and we subtract when we
clear the extent buffer dirty.  If we end up in set_btree_ioerr we have
already cleared the buffer dirty, and we aren't resetting dirty on the
extent buffer, so this is simply wrong.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15 19:38:53 +01:00
Josef Bacik
c4e54a6571 btrfs: replace clearing extent buffer dirty bit with btrfs_clean_block
Now that we're passing in the trans into btrfs_clean_tree_block, we can
easily roll in the handling of the !trans case and replace all
occurrences of

	if (test_and_clear_bit(EXTENT_BUFFER_DIRTY, &eb->bflags))
		clear_extent_buffer_dirty(eb);

with

	btrfs_tree_lock(eb);
	btrfs_clean_tree_block(eb);
	btrfs_tree_unlock(eb);

We need the lock because if we are actually dirty we need to make sure
we aren't racing with anything that's starting writeout currently.  This
also makes sure that we're accounting fs_info->dirty_metadata_bytes
appropriately.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15 19:38:53 +01:00
Josef Bacik
ed25dab3a0 btrfs: add trans argument to btrfs_clean_tree_block
We check the header generation in the extent buffer against the current
running transaction id to see if it's safe to clear DIRTY on this
buffer.  Generally speaking if we're clearing the buffer dirty we're
holding the transaction open, but in the case of cleaning up an aborted
transaction we don't, so we have extra checks in that path to check the
transid.  To allow for a future cleanup go ahead and pass in the trans
handle so we don't have to rely on ->running_transaction being set.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15 19:38:53 +01:00
Josef Bacik
d3fb66150c btrfs: always lock the block before calling btrfs_clean_tree_block
We want to clean up the dirty handling for extent buffers so it's a
little more consistent, so skip the check for generation == transid and
simply always lock the extent buffer before calling btrfs_clean_tree_block.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15 19:38:53 +01:00
Christoph Hellwig
8e81aa16a4 iomap: remove IOMAP_F_ZONE_APPEND
No users left now that btrfs takes REQ_OP_WRITE bios from iomap and
splits and converts them to REQ_OP_ZONE_APPEND internally.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15 19:38:53 +01:00
Christoph Hellwig
d5e4377d50 btrfs: split zone append bios in btrfs_submit_bio
The current btrfs zoned device support is a little cumbersome in the data
I/O path as it requires the callers to not issue I/O larger than the
supported ZONE_APPEND size of the underlying device.  This leads to a lot
of extra accounting.  Instead change btrfs_submit_bio so that it can take
write bios of arbitrary size and form from the upper layers, and just
split them internally to the ZONE_APPEND queue limits.  Then remove all
the upper layer warts catering to limited write sized on zoned devices,
including the extra refcount in the compressed_bio.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15 19:38:53 +01:00
Christoph Hellwig
243cf8d1b6 btrfs: calculate file system wide queue limit for zoned mode
To be able to split a write into properly sized zone append commands,
we need a queue_limits structure that contains the least common
denominator suitable for all devices.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15 19:38:53 +01:00
Christoph Hellwig
35a8d7da3c btrfs: remove now spurious bio submission helpers
Call btrfs_submit_bio and btrfs_submit_compressed_read directly from
submit_one_bio now that all additional functionality has moved into
btrfs_submit_bio.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15 19:38:53 +01:00
Christoph Hellwig
285599b6fe btrfs: remove the fs_info argument to btrfs_submit_bio
btrfs_submit_bio can derive it trivially from bbio->inode, so stop
bothering in the callers.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15 19:38:53 +01:00
Christoph Hellwig
48253076c3 btrfs: open code submit_encoded_read_bio
Open code the functionality in the only caller and remove the now
superfluous error handling there.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15 19:38:53 +01:00
Christoph Hellwig
f8a02dc6fd btrfs: remove struct btrfs_io_geometry
Now that btrfs_get_io_geometry has a single caller, we can massage it
into a form that is more suitable for that caller and remove the
marshalling into and out of struct btrfs_io_geometry.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15 19:38:52 +01:00
Qu Wenruo
a34e4c3f88 btrfs: remove stripe boundary calculation for encoded I/O
Stop looking at the stripe boundary in
btrfs_encoded_read_regular_fill_pages() now that btrfs_submit_bio can
split bios.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15 19:38:52 +01:00
Qu Wenruo
30493ff49f btrfs: remove stripe boundary calculation for compressed I/O
Stop looking at the stripe boundary in alloc_compressed_bio() now that
that btrfs_submit_bio can split bios, open code the now trivial code
from alloc_compressed_bio() in btrfs_submit_compressed_read and stop
maintaining the pending_ios count for reads as there is always just
a single bio now.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
[hch: remove more cruft in btrfs_submit_compressed_read,
      use btrfs_zoned_get_device in alloc_compressed_bio]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15 19:38:52 +01:00
Qu Wenruo
2380220e1e btrfs: remove stripe boundary calculation for buffered I/O
Remove btrfs_bio_ctrl::len_to_stripe_boundary, so that buffer
I/O will no longer limit its bio size according to stripe length
now that btrfs_submit_bio can split bios at stripe boundaries.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
[hch: simplify calc_bio_boundaries a little more]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15 19:38:52 +01:00
Christoph Hellwig
67d6698250 btrfs: pass the iomap bio to btrfs_submit_bio
Now that btrfs_submit_bio splits the bio when crossing stripe boundaries,
there is no need for the higher level code to do that manually.

For direct I/O this is really helpful, as btrfs_submit_io can now simply
take the bio allocated by iomap and send it on to btrfs_submit_bio
instead of allocating clones.

For that to work, the bio embedded into struct btrfs_dio_private needs to
become a full btrfs_bio as expected by btrfs_submit_bio.

With this change there is a single work item to offload the entire iomap
bio so the heuristics to skip async processing for bios that were split
isn't needed anymore either.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15 19:38:52 +01:00
Christoph Hellwig
852eee62d3 btrfs: allow btrfs_submit_bio to split bios
Currently the I/O submitters have to split bios according to the chunk
stripe boundaries.  This leads to extra lookups in the extent trees and
a lot of boilerplate code.

To drop this requirement, split the bio when __btrfs_map_block returns a
mapping that is smaller than the requested size and keep a count of
pending bios in the original btrfs_bio so that the upper level
completion is only invoked when all clones have completed.

Based on a patch from Qu Wenruo.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15 19:38:52 +01:00
Christoph Hellwig
542e300e83 btrfs: support cloned bios in btree_csum_one_bio
To allow splitting bios in btrfs_submit_bio, btree_csum_one_bio needs to
be able to handle cloned bios.  As btree_csum_one_bio is always called
before handing the bio to the block layer that is trivially done by using
bio_for_each_segment instead of bio_for_each_segment_all.  Also switch
the function to take a btrfs_bio and use that to derive the fs_info.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15 19:38:52 +01:00
Christoph Hellwig
69ccf3f424 btrfs: handle recording of zoned writes in the storage layer
Move the code that splits the ordered extents and records the physical
location for them to the storage layer so that the higher level consumers
don't have to care about physical block numbers at all.  This will also
allow to eventually remove accounting for the zone append write sizes in
the upper layer with a little bit more block layer work.

Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15 19:38:52 +01:00
Christoph Hellwig
f8a53bb58e btrfs: handle checksum generation in the storage layer
Instead of letting the callers of btrfs_submit_bio deal with checksumming
the (meta)data in the bio and making decisions on when to offload the
checksumming to the bio, leave that to btrfs_submit_bio.  Do do so the
existing btrfs_submit_bio function is split into an upper and a lower
half, so that the lower half can be offloaded to a workqueue.

Note that this changes the behavior for direct writes to raid56 volumes so
that async checksum offloading is not skipped when more I/O is expected.
This runs counter to the argument explaining why it was done, although I
can't measure any affects of the change.  Commits later in this series
will make sure the entire direct writes is offloaded to the workqueue
at once and thus make sure it is sent to the raid56 code from a single
thread.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15 19:38:52 +01:00
Christoph Hellwig
f8c44673e5 btrfs: simplify the btrfs_csum_one_bio calling convention
To prepare for further bio submission changes btrfs_csum_one_bio
should be able to take all it's arguments from the btrfs_bio structure.
It can always use the bbio->inode already, and once the compression code
is updated to set ->file_offset that one can be used unconditionally
as well instead of looking at the page mapping now that btrfs doesn't
allow ordered extents to span discontiguous data ranges.

The only slightly tricky bit is the one_ordered flag set by the
compressed writes.  Replace that one with the driver private bio
flag, which gets cleared before the bio is handed off to the block layer
so that we don't get in the way of driver use.

Note: this leaves an argument and a flag to btrfs_wq_submit_bio unused.
But that whole mechanism will be removed in its current form in the
next patch.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15 19:38:52 +01:00
Christoph Hellwig
deb6216fa0 btrfs: open code the submit_bio_start helpers
The submit helpers are now trivial and can be called directly.  Note
that btree_csum_one_bio has to be moved up in the file a bit to avoid a
forward declaration.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15 19:38:52 +01:00
Christoph Hellwig
295fe46ff1 btrfs: remove struct btrfs_bio::is_metadata flag
This flag is unused now, so remove it.  Re-expand the mirror_num field
to 8 bits, and move it to the I/O completion internal section of the
structure.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15 19:38:52 +01:00
Christoph Hellwig
0d3acb25e7 btrfs: rename btrfs_bio::iter field
Rename iter to saved_iter and move it next to the repair internals
and nothing outside of bio.c should be touching it.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15 19:38:51 +01:00
Christoph Hellwig
0571b6357c btrfs: remove the io_failure_record infrastructure
struct io_failure_record and the io_failure_tree tree are unused now,
so remove them. This in turn makes struct btrfs_inode smaller by 16
bytes.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15 19:38:51 +01:00
Christoph Hellwig
860c8c4516 btrfs: remove struct btrfs_bio::device field
The device field is only used by the simple end I/O handler, and for
that it can simply be stored in the bi_private field of the bio,
which is currently used for the fs_info that can be retrieved through
bbio->inode as well.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15 19:38:51 +01:00
Christoph Hellwig
3d49d0d312 btrfs: remove now unused checksumming helpers
Remove the unused btrfs_verify_data_csum helper, and fold
btrfs_check_data_csum into its only caller.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15 19:38:51 +01:00
Christoph Hellwig
ac9f942e86 btrfs: remove btrfs_bio_for_each_sector
btrfs_bio_for_each_sector is unused now, so remove it.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15 19:38:51 +01:00
Christoph Hellwig
7ab0fdfc81 btrfs: open code btrfs_bio_free_csum
btrfs_bio_free_csum has only one caller left, and that caller is always
for an data inode and doesn't need zeroing of the csum pointer as that
pointer will never be touched again.  Just open code the conditional
kfree there.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15 19:38:51 +01:00
Christoph Hellwig
7609afac67 btrfs: handle checksum validation and repair at the storage layer
Currently btrfs handles checksum validation and repair in the end I/O
handler for the btrfs_bio.  This leads to a lot of duplicate code
plus issues with varying semantics or bugs, e.g.

- the until recently broken repair for compressed extents
- the fact that encoded reads validate the checksums but do not kick
  of read repair
- the inconsistent checking of the BTRFS_FS_STATE_NO_CSUMS flag

This commit revamps the checksum validation and repair code to instead
work below the btrfs_submit_bio interfaces.

In case of a checksum failure (or a plain old I/O error), the repair
is now kicked off before the upper level ->end_io handler is invoked.

Progress of an in-progress repair is tracked by a small structure
that is allocated using a mempool for each original bio with failed
sectors, which holds a reference to the original bio.   This new
structure is allocated using a mempool to guarantee forward progress
even under memory pressure.  The mempool will be replenished when
the repair completes, just as the mempools backing the bios.

There is one significant behavior change here:  If repair fails or
is impossible to start with, the whole bio will be failed to the
upper layer.  This is the behavior that all I/O submitters except
for buffered I/O already emulated in their end_io handler.  For
buffered I/O this now means that a large readahead request can
fail due to a single bad sector, but as readahead errors are ignored
the following readpage if the sector is actually accessed will
still be able to read.  This also matches the I/O failure handling
in other file systems.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15 19:38:51 +01:00
Christoph Hellwig
e52190441b btrfs: add a btrfs_data_csum_ok helper
Add a new checksumming helper that wraps btrfs_check_data_csum and
does all the checks to if we're dealing with some form of nodatacsum
I/O.  This helper will be used by the new storage layer checksum
validation and repair code.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15 19:38:51 +01:00
Christoph Hellwig
1c2b3ee3b0 btrfs: pre-load data checksum for reads in btrfs_submit_bio
Instead of calling btrfs_lookup_bio_sums in every caller of
btrfs_submit_bio that reads data, do the call once in btrfs_submit_bio.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15 19:38:51 +01:00
Christoph Hellwig
7276aa7d38 btrfs: save the bio iter for checksum validation in common code
All callers of btrfs_submit_bio that want to validate checksums
currently have to store a copy of the iter in the btrfs_bio.  Move
the assignment into common code.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15 19:38:51 +01:00
Christoph Hellwig
9ba0004bd9 btrfs: refactor error handling in btrfs_submit_bio
Add a bbio local variable and to prepare for calling functions that
return a blk_status_t, rename the existing int used for error handling
so that ret can be reused for the blk_status_t, and a label that can be
reused for failing the passed in bio.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15 19:38:51 +01:00
Christoph Hellwig
4ae2edf12d btrfs: simplify parameters of btrfs_lookup_bio_sums
The csums argument is always NULL now, so remove it and always allocate
the csums array in the btrfs_bio.  Also pass the btrfs_bio instead of
inode + bio to document that this function requires a btrfs_bio and
not just any bio.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15 19:38:50 +01:00
Christoph Hellwig
5fa356531e btrfs: remove the direct I/O read checksum lookup optimization
To prepare for pending changes drop the optimization to only look up
csums once per bio that is submitted from the iomap layer.  In the
short run this does cause additional lookups for fragmented direct
reads, but later in the series, the bio based lookup will be used on
the entire bio submitted from iomap, restoring the old behavior
in common code.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15 19:38:50 +01:00
Christoph Hellwig
d0e5cb2be7 btrfs: add a btrfs_inode pointer to struct btrfs_bio
All btrfs_bio I/Os are associated with an inode.  Add a pointer to that
inode, which will allow to simplify a lot of calling conventions, and
which will be needed in the I/O completion path in the future.

This grow the btrfs_bio structure by a pointer, but that grows will
be offset by the removal of the device pointer soon.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15 19:38:50 +01:00
Christoph Hellwig
e0cfbb2cca btrfs: better document struct btrfs_bio
Update the comments on btrfs_bio to better describe the structure.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15 19:38:50 +01:00
Qu Wenruo
c9a43aaf09 btrfs: raid56: reduce overhead to calculate the bio length
In rbio_update_error_bitmap(), we need to calculate the length of the
rbio.  As since it's called in the endio function, we can not directly
grab the length from bi_iter.

Currently we call bio_for_each_segment_all(), which will always return a
range inside a page.  But that's not necessary as we don't really care
about anything inside the page.

So use bio_for_each_bvec_all(), which can return a bvec across multiple
continuous pages thus reduce the loops.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15 19:38:50 +01:00
Colin Ian King
67da05b3f2 btrfs: fix spelling mistakes found using codespell
There quite a few spelling mistakes as found using codespell. Fix them.

Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15 19:38:50 +01:00
Filipe Manana
e2fd83064a btrfs: skip backref walking during fiemap if we know the leaf is shared
During fiemap, when checking if a data extent is shared we are doing the
backref walking even if we already know the leaf is shared, which is a
waste of time since if the leaf shared then the data extent is also
shared. So skip the backref walking when we know we are in a shared leaf.

The following test was measures the gains for a case where all leaves
are shared due to a snapshot:

   $ cat test.sh
   #!/bin/bash

   DEV=/dev/sdj
   MNT=/mnt/sdj

   umount $DEV &> /dev/null
   mkfs.btrfs -f $DEV
   # Use compression to quickly create files with a lot of extents
   # (each with a size of 128K).
   mount -o compress=lzo $DEV $MNT

   # 40G gives 327680 extents, each with a size of 128K.
   xfs_io -f -c "pwrite -S 0xab -b 1M 0 40G" $MNT/foobar

   # Add some more files to increase the size of the fs and extent
   # trees (in the real world there's a lot of files and extents
   # from other files).
   xfs_io -f -c "pwrite -S 0xcd -b 1M 0 20G" $MNT/file1
   xfs_io -f -c "pwrite -S 0xef -b 1M 0 20G" $MNT/file2
   xfs_io -f -c "pwrite -S 0x73 -b 1M 0 20G" $MNT/file3

   # Create a snapshot so all the extents become indirectly shared
   # through subtrees, with a generation less than or equals to the
   # generation used to create the snapshot.
   btrfs subvolume snapshot -r $MNT $MNT/snap1

   # Unmount and mount again to clear cached metadata.
   umount $MNT
   mount -o compress=lzo $DEV $MNT

   start=$(date +%s%N)
   # The filefrag tool  uses the fiemap ioctl.
   filefrag $MNT/foobar
   end=$(date +%s%N)
   dur=$(( (end - start) / 1000000 ))
   echo "fiemap took $dur milliseconds (metadata not cached)"
   echo

   start=$(date +%s%N)
   filefrag $MNT/foobar
   end=$(date +%s%N)
   dur=$(( (end - start) / 1000000 ))
   echo "fiemap took $dur milliseconds (metadata cached)"

   umount $MNT

The results were the following on a non-debug kernel (Debian's default
kernel config).

Before this patch:

   (...)
   /mnt/sdi/foobar: 327680 extents found
   fiemap took 1821 milliseconds (metadata not cached)

   /mnt/sdi/foobar: 327680 extents found
   fiemap took 399 milliseconds (metadata cached)

After this patch:

   (...)
   /mnt/sdi/foobar: 327680 extents found
   fiemap took 591 milliseconds (metadata not cached)

   /mnt/sdi/foobar: 327680 extents found
   fiemap took 123 milliseconds (metadata cached)

That's a speedup of 3.1x and 3.2x.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15 19:38:50 +01:00
Filipe Manana
4e4488d4ef btrfs: assert commit root semaphore is held when accessing backref cache
During fiemap, when accessing the cache that stores the sharedness of an
extent, we need to either be holding a transaction handle or the commit
root semaphore. I left comments about this in the comment that precedes
store_backref_shared_cache() and lookup_backref_shared_cache(), but have
actually not enforced it through assertions. So assert that the commit
root semaphore is held if we are not holding a transaction handle.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15 19:38:50 +01:00
Boris Burkov
2b5463fcbd btrfs: hold block group refcount during async discard
Async discard does not acquire the block group reference count while it
holds a reference on the discard list. This is generally OK, as the
paths which destroy block groups tend to try to synchronize on
cancelling async discard work. However, relying on cancelling work
requires careful analysis to be sure it is safe from races with
unpinning scheduling more work.

While I am unable to find a race with unpinning in the current code for
either the unused bgs or relocation paths, I believe we have one in an
older version of auto relocation in a Meta internal build. This suggests
that this is in fact an error prone model, and could be fragile to
future changes to these bg deletion paths.

To make this ownership more clear, add a refcount for async discard. If
work is queued for a block group, its refcount should be incremented,
and when work is completed or canceled, it should be decremented.

CC: stable@vger.kernel.org # 5.15+
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15 19:38:50 +01:00
Filipe Manana
3e49363be6 btrfs: send: cache utimes operations for directories if possible
Whenever we add or remove an entry to a directory, we issue an utimes
command for the directory. If we add 1000 entries to a directory (create
1000 files under it or move 1000 files to it), then we issue the same
utimes command 1000 times, which increases the send stream size, results
in more pipe IO, one search in the send b+tree, allocating one path for
the search, etc, as well as making the receiver do a system call for each
duplicated utimes command.

We also issue an utimes command when we create a new directory, but later
we might add entries to it corresponding to inodes with an higher inode
number, so it's pointless to issue the utimes command before we create
the last inode under the directory.

So use a lru cache to track directories for which we must send a utimes
command. When we need to remove an entry from the cache, we issue the
utimes command for the respective directory. When finishing the send
operation, we go over each cache element and issue the respective utimes
command. Finally the caching is entirely optional, just a performance
optimization, meaning that if we fail to cache (due to memory allocation
failure), we issue the utimes command right away, that is, we fallback
to the previous, unoptimized, behaviour.

This patch belongs to a patchset comprised of the following patches:

  btrfs: send: directly return from did_overwrite_ref() and simplify it
  btrfs: send: avoid unnecessary generation search at did_overwrite_ref()
  btrfs: send: directly return from will_overwrite_ref() and simplify it
  btrfs: send: avoid extra b+tree searches when checking reference overrides
  btrfs: send: remove send_progress argument from can_rmdir()
  btrfs: send: avoid duplicated orphan dir allocation and initialization
  btrfs: send: avoid unnecessary orphan dir rbtree search at can_rmdir()
  btrfs: send: reduce searches on parent root when checking if dir can be removed
  btrfs: send: iterate waiting dir move rbtree only once when processing refs
  btrfs: send: initialize all the red black trees earlier
  btrfs: send: genericize the backref cache to allow it to be reused
  btrfs: adapt lru cache to allow for 64 bits keys on 32 bits systems
  btrfs: send: cache information about created directories
  btrfs: allow a generation number to be associated with lru cache entries
  btrfs: add an api to delete a specific entry from the lru cache
  btrfs: send: use the lru cache to implement the name cache
  btrfs: send: update size of roots array for backref cache entries
  btrfs: send: cache utimes operations for directories if possible

The following test was run before and after applying the whole patchset,
and on a non-debug kernel (Debian's default kernel config):

   #!/bin/bash

   MNT=/mnt/sdi
   DEV=/dev/sdi

   mkfs.btrfs -f $DEV > /dev/null
   mount $DEV $MNT

   mkdir $MNT/A
   for ((i = 1; i <= 20000; i++)); do
       echo -n > $MNT/A/file_$i
   done

   btrfs subvolume snapshot -r $MNT $MNT/snap1

   mkdir $MNT/B
   for ((i = 20000; i <= 40000; i++)); do
       echo -n > $MNT/B/file_$i
   done

   mv $MNT/A/file_* $MNT/B/

   btrfs subvolume snapshot -r $MNT $MNT/snap2

   start=$(date +%s%N)
   btrfs send -p $MNT/snap1 $MNT/snap2 > /dev/null
   end=$(date +%s%N)

   dur=$(( (end - start) / 1000000 ))
   echo "Incremental send took $dur milliseconds"

   umount $MNT

Before the whole patchset: 18408 milliseconds
After the whole patchset:   1942 milliseconds  (9.5x speedup)

Using 60000 files instead of 40000:

Before the whole patchset: 39764 milliseconds
After the whole patchset:   3076 milliseconds  (12.9x speedup)

Using 20000 files instead of 40000:

Before the whole patchset:  5072 milliseconds
After the whole patchset:    916 milliseconds  (5.5x speedup)

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15 19:38:50 +01:00