Commit Graph

89252 Commits

Author SHA1 Message Date
Gang Li 26d1dc6bb2 hugetlb: have CONFIG_HUGETLBFS select CONFIG_PADATA
Allow hugetlb use padata_do_multithreaded for parallel initialization. 
Select CONFIG_PADATA in this case.

Link: https://lkml.kernel.org/r/20240222140422.393911-7-gang.li@linux.dev
Signed-off-by: Gang Li <ligang.bdlg@bytedance.com>
Tested-by: David Rientjes <rientjes@google.com>
Reviewed-by: Muchun Song <muchun.song@linux.dev>
Tested-by: Paul E. McKenney <paulmck@kernel.org>
Acked-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-06 13:04:17 -08:00
Baoquan He 443cbaf9e2 crash: split vmcoreinfo exporting code out from crash_core.c
Now move the relevant codes into separate files:
kernel/crash_reserve.c, include/linux/crash_reserve.h.

And add config item CRASH_RESERVE to control its enabling.

And also update the old ifdeffery of CONFIG_CRASH_CORE, including of
<linux/crash_core.h> and config item dependency on CRASH_CORE
accordingly.

And also do renaming as follows:
 - arch/xxx/kernel/{crash_core.c => vmcore_info.c}
because they are only related to vmcoreinfo exporting on x86, arm64,
riscv.

And also Remove config item CRASH_CORE, and rely on CONFIG_KEXEC_CORE to
decide if build in crash_core.c.

[yang.lee@linux.alibaba.com: remove duplicated include in vmcore_info.c]
  Link: https://lkml.kernel.org/r/20240126005744.16561-1-yang.lee@linux.alibaba.com
Link: https://lkml.kernel.org/r/20240124051254.67105-3-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Acked-by: Hari Bathini <hbathini@linux.ibm.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Pingfan Liu <piliu@redhat.com>
Cc: Klara Modin <klarasmodin@gmail.com>
Cc: Michael Kelley <mhklinux@outlook.com>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Yang Li <yang.lee@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-23 17:48:22 -08:00
Lokesh Gidra 867a43a34f userfaultfd: use per-vma locks in userfaultfd operations
All userfaultfd operations, except write-protect, opportunistically use
per-vma locks to lock vmas.  On failure, attempt again inside mmap_lock
critical section.

Write-protect operation requires mmap_lock as it iterates over multiple
vmas.

Link: https://lkml.kernel.org/r/20240215182756.3448972-5-lokeshgidra@google.com
Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Nicolas Geoffray <ngeoffray@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tim Murray <timmurray@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22 15:27:20 -08:00
Lokesh Gidra 5e4c24a57b userfaultfd: protect mmap_changing with rw_sem in userfaulfd_ctx
Increments and loads to mmap_changing are always in mmap_lock critical
section.  This ensures that if userspace requests event notification for
non-cooperative operations (e.g.  mremap), userfaultfd operations don't
occur concurrently.

This can be achieved by using a separate read-write semaphore in
userfaultfd_ctx such that increments are done in write-mode and loads in
read-mode, thereby eliminating the dependency on mmap_lock for this
purpose.

This is a preparatory step before we replace mmap_lock usage with per-vma
locks in fill/move ioctls.

Link: https://lkml.kernel.org/r/20240215182756.3448972-3-lokeshgidra@google.com
Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Nicolas Geoffray <ngeoffray@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tim Murray <timmurray@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22 15:27:20 -08:00
Lokesh Gidra f91e6b41dd userfaultfd: move userfaultfd_ctx struct to header file
Patch series "per-vma locks in userfaultfd", v7.

Performing userfaultfd operations (like copy/move etc.) in critical
section of mmap_lock (read-mode) causes significant contention on the lock
when operations requiring the lock in write-mode are taking place
concurrently.  We can use per-vma locks instead to significantly reduce
the contention issue.

Android runtime's Garbage Collector uses userfaultfd for concurrent
compaction.  mmap-lock contention during compaction potentially causes
jittery experience for the user.  During one such reproducible scenario,
we observed the following improvements with this patch-set:

- Wall clock time of compaction phase came down from ~3s to <500ms
- Uninterruptible sleep time (across all threads in the process) was
  ~10ms (none in mmap_lock) during compaction, instead of >20s


This patch (of 4):

Move the struct to userfaultfd_k.h to be accessible from mm/userfaultfd.c.
There are no other changes in the struct.

This is required to prepare for using per-vma locks in userfaultfd
operations.

Link: https://lkml.kernel.org/r/20240215182756.3448972-1-lokeshgidra@google.com
Link: https://lkml.kernel.org/r/20240215182756.3448972-2-lokeshgidra@google.com
Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Nicolas Geoffray <ngeoffray@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tim Murray <timmurray@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22 15:27:20 -08:00
Mathieu Desnoyers 1df4ca0155 dax: check for data cache aliasing at runtime
Replace the following fs/Kconfig:FS_DAX dependency:

  depends on !(ARM || MIPS || SPARC)

By a runtime check within alloc_dax(). This runtime check returns
ERR_PTR(-EOPNOTSUPP) if the @ops parameter is non-NULL (which means
the kernel is using an aliased mapping) on an architecture which
has data cache aliasing.

Change the return value from NULL to PTR_ERR(-EOPNOTSUPP) for
CONFIG_DAX=n for consistency.

This is done in preparation for using cpu_dcache_is_aliasing() in a
following change which will properly support architectures which detect
data cache aliasing at runtime.

Link: https://lkml.kernel.org/r/20240215144633.96437-8-mathieu.desnoyers@efficios.com
Fixes: d92576f116 ("dax: does not work correctly with virtual aliasing caches")
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: kernel test robot <lkp@intel.com>
Cc: Michael Sclafani <dm-devel@lists.linux.dev>
Cc: Mike Snitzer <snitzer@kernel.org>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22 15:27:19 -08:00
Mathieu Desnoyers 562ce8285b virtio: treat alloc_dax() -EOPNOTSUPP failure as non-fatal
In preparation for checking whether the architecture has data cache
aliasing within alloc_dax(), modify the error handling of virtio
virtio_fs_setup_dax() to treat alloc_dax() -EOPNOTSUPP failure as
non-fatal.

Link: https://lkml.kernel.org/r/20240215144633.96437-7-mathieu.desnoyers@efficios.com
Co-developed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Fixes: d92576f116 ("dax: does not work correctly with virtual aliasing caches")
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Mike Snitzer <snitzer@kernel.org>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: kernel test robot <lkp@intel.com>
Cc: Michael Sclafani <dm-devel@lists.linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22 15:27:19 -08:00
Lokesh Gidra 6ca03f1bb5 userfaultfd: fix return error if mmap_changing is non-zero in MOVE ioctl
To be consistent with other uffd ioctl's returning EAGAIN when
mmap_changing is detected, we should change UFFDIO_MOVE to do the same.

Link: https://lkml.kernel.org/r/20240117223922.1445327-1-lokeshgidra@google.com
Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
Acked-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Nicolas Geoffray <ngeoffray@google.com>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22 10:24:38 -08:00
Hui Zhu cabbb6d51e fs/proc/task_mmu.c: add_to_pagemap: remove useless parameter addr
Function parameter addr of add_to_pagemap() is useless.  Remove it.

Link: https://lkml.kernel.org/r/20240111084533.40038-1-teawaterz@linux.alibaba.com
Signed-off-by: Hui Zhu <teawater@antgroup.com>
Reviewed-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
Tested-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrei Vagin <avagin@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-21 16:00:04 -08:00
Matthew Wilcox (Oracle) 7101422464 proc: use pfn_swap_entry_folio where obvious
These callers only pass the result to PageAnon(), so we can save the extra
call to compound_head() by using pfn_swap_entry_folio().

Link: https://lkml.kernel.org/r/20240111152429.3374566-3-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-21 16:00:03 -08:00
Linus Torvalds f2667e0c32 bcachefs fixes for v6.8-rc5
Mostly pretty trivial, the user visible ones are:
  - don't barf when replicas_required > replicas
  - fix check_version_upgrade() so it doesn't do something nonsensical
    when we're downgrading
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmXQ7/0ACgkQE6szbY3K
 bnbAVQ//V+25RfGRdb7JW7aXZJHyWdlpJE2Om0Vhbd3kYsEeMwqtugkuS4eE7nLV
 Xbi70de7Y7STXKrrTCAYZp7QIN0HY3VHtMnmh1IzbF31JZtI69AfZDRDKC9ITNhp
 Cf3jwmD9OypanVhdAlH/PIecxFdqglJ3e0V5r8zFibb4H2phn3rGPooNl9NCK+KN
 VZOz2W0roCslPutT89SWBRZl7og8BNky1ZYS3cjE+4gFg+/TTeCqgZXteoJkYR70
 0oS+e33AdhxdLoPV6ePN0faFp4rpwh3mpy2gL4MhddSEqTEoUd0/Is4jmBUJtCps
 pgnFz+D2L4zRg2NnyAMVt4DrjyXNhjZDkbd43aK11yyCYSLmG87KW/95K7ocU60i
 /ZqUWWJ9IQO92ihQ+FxJcFw4pNTTWuzNNjZiGKvSad2/rU6btMKW4g+JbTNt2mqn
 3X43s9pdx1sKvR5HNwvwm3R0mrT1baP+/mdfbgbmVPMS6ijD7KQ6JLhB3wGCFSKA
 /yNxasBt+Rfmn9gVsqRTBaH39GWbVwSwJO+gthLAdqylB/j8xne414BxPdhCXeOW
 5KLvUUoJmIx7VW9NsJms4ES/yMd3xmlsyjOnRLqs4Q1WvxCIoyJnM/i02jRHfrjA
 49qlZIR0Gm8rQw7qmjA/R6LzaY9T70uC1xCAHTQ1C+bvpzQW3M0=
 =XWHd
 -----END PGP SIGNATURE-----

Merge tag 'bcachefs-2024-02-17' of https://evilpiepirate.org/git/bcachefs

Pull bcachefs fixes from Kent Overstreet:
 "Mostly pretty trivial, the user visible ones are:

   - don't barf when replicas_required > replicas

   - fix check_version_upgrade() so it doesn't do something nonsensical
     when we're downgrading"

* tag 'bcachefs-2024-02-17' of https://evilpiepirate.org/git/bcachefs:
  bcachefs: Fix missing va_end()
  bcachefs: Fix check_version_upgrade()
  bcachefs: Clamp replicas_required to replicas
  bcachefs: fix missing endiannes conversion in sb_members
  bcachefs: fix kmemleak in __bch2_read_super error handling path
  bcachefs: Fix missing bch2_err_class() calls
2024-02-17 13:17:32 -08:00
Linus Torvalds 55f626f2d0 Five smb3 client fixes, most also for stable
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmXP2xEACgkQiiy9cAdy
 T1EGMwwAiclDnZEtlHRK80Kmncnk7JaLnmHqLLfgIocVo0MDnaUrJVftj43F2OP/
 za1uJfI9uM+lgM+vbEjldLxMf8yVnt6MpgzwXBZftppFpVPAwiwvGh4m4RAKreUj
 SnE/WEINjj3p3dTJdnKup3Ff2hDXm9VBLpGPWwUTQ0RNWROO7hUtnTyeGRCZKEgy
 uCVVU8AS5DAvBpuWZ+K0zG/omts42fP399GkQUD5Pz8DgOaFCMfQU33/zPFQEkPX
 idnXH3KL4SHTFvytTTRCgHGg2x5MLersC6zYqwlgTD9YkGpr1UbuIRYvVO793Bek
 VqhQTsnSnQSTuJ5lCH7elcLhF8XPF/qM+sRw/OoidgUwMMDjN0jMkwxBk+MX8Gw2
 6lJfWAjTjRhanF6s501fEGd4YPFIBAysmW1XLiS12ot05s5k9D7sp+EjklPsSxvX
 VrHHtcElY9rwh5sk7gZsmBl2U2x8k4ZvZeq6dIqQAvg+oQAbi9mTFrGuxsC3jD/x
 LP2cFzhW
 =v3wq
 -----END PGP SIGNATURE-----

Merge tag '6.8-rc4-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull smb client fixes from Steve French:
 "Five smb3 client fixes, most also for stable:

   - Two multichannel fixes (one to fix potential handle leak on retry)

   - Work around possible serious data corruption (due to change in
     folios in 6.3, for cases when non standard maximum write size
     negotiated)

   - Symlink creation fix

   - Multiuser automount fix"

* tag '6.8-rc4-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  smb: Fix regression in writes when non-standard maximum write size negotiated
  smb: client: handle path separator of created SMB symlinks
  smb: client: set correct id, uid and cruid for multiuser automounts
  cifs: update the same create_guid on replay
  cifs: fix underflow in parse_server_interfaces()
2024-02-17 07:56:10 -08:00
Linus Torvalds 3f9c1b315d Additional cap handling fixes from Xiubo to avoid "client isn't
responding to mclientcaps(revoke)" stalls on the MDS side.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmXPnBgTHGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHzi33XCACBiglCuzqv5/MTU7W/CaWOGYUL9OT2
 dcP6lkFyexuVl7yjbiAwnBbAiefMr5jgBK27+20ZdT7VDzrtBeDB18al/QMv7r+0
 TSIbUW3nLIph2LdodgKypJ6WOHPEpi4OTncFTlkfERDNQR3GXRDWJkI9pQWcRiYr
 DTz0FvvMkDNitoHlXdD3RhEQ8M2gdoT5HXyns4YdCjc7aZekkwjkoG4Yf+/BWLUy
 3v/2lcTdW6e6u6Pqu5I9xq+bnir6F9FIsERW1TaZfFwksQr/IMdJs0DTWzfwh26v
 wJlyYYguSAC2/kJg52HWfVvtszjWvlpDj81AZn8HxgU4+MQoBKGd18FY
 =YWdh
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-6.8-rc5' of https://github.com/ceph/ceph-client

Pull ceph fixes from Ilya Dryomov:
 "Additional cap handling fixes from Xiubo to avoid "client isn't
  responding to mclientcaps(revoke)" stalls on the MDS side"

* tag 'ceph-for-6.8-rc5' of https://github.com/ceph/ceph-client:
  ceph: add ceph_cap_unlink_work to fire check_caps() immediately
  ceph: always queue a writeback when revoking the Fb caps
2024-02-16 13:37:15 -08:00
Linus Torvalds efb0b63afc zonefs fixes for 6.8.0-rc5
- Fix direct write error handling to avoid a race between failed IO
    completion and the submission path itself which can result in an
    invalid file size exposed to the user after the failed IO.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQSRPv8tYSvhwAzJdzjdoc3SxdoYdgUCZc9S/AAKCRDdoc3SxdoY
 dgilAQDhQeRxzZLXO5lh5LGeqveo88kXuQclCK9VeqnCr0cnHQD/RTXvo464Vf4c
 DAuDtLxRA16sj8WlLkUVkvjMKdjYaQ8=
 =n1Tp
 -----END PGP SIGNATURE-----

Merge tag 'zonefs-6.8-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs

Pull zonefs fix from Damien Le Moal:

 - Fix direct write error handling to avoid a race between failed IO
   completion and the submission path itself which can result in an
   invalid file size exposed to the user after the failed IO.

* tag 'zonefs-6.8-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs:
  zonefs: Improve error handling
2024-02-16 09:29:26 -08:00
Steve French 4860abb91f smb: Fix regression in writes when non-standard maximum write size negotiated
The conversion to netfs in the 6.3 kernel caused a regression when
maximum write size is set by the server to an unexpected value which is
not a multiple of 4096 (similarly if the user overrides the maximum
write size by setting mount parm "wsize", but sets it to a value that
is not a multiple of 4096).  When negotiated write size is not a
multiple of 4096 the netfs code can skip the end of the final
page when doing large sequential writes, causing data corruption.

This section of code is being rewritten/removed due to a large
netfs change, but until that point (ie for the 6.3 kernel until now)
we can not support non-standard maximum write sizes.

Add a warning if a user specifies a wsize on mount that is not
a multiple of 4096 (and round down), also add a change where we
round down the maximum write size if the server negotiates a value
that is not a multiple of 4096 (we also have to check to make sure that
we do not round it down to zero).

Reported-by: R. Diez" <rdiez-2006@rd10.de>
Fixes: d08089f649 ("cifs: Change the I/O paths to use an iterator rather than a page list")
Suggested-by: Ronnie Sahlberg <ronniesahlberg@gmail.com>
Acked-by: Ronnie Sahlberg <ronniesahlberg@gmail.com>
Tested-by: Matthew Ruffell <matthew.ruffell@canonical.com>
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Cc: stable@vger.kernel.org # v6.3+
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-02-15 22:19:23 -06:00
Damien Le Moal 14db5f64a9 zonefs: Improve error handling
Write error handling is racy and can sometime lead to the error recovery
path wrongly changing the inode size of a sequential zone file to an
incorrect value  which results in garbage data being readable at the end
of a file. There are 2 problems:

1) zonefs_file_dio_write() updates a zone file write pointer offset
   after issuing a direct IO with iomap_dio_rw(). This update is done
   only if the IO succeed for synchronous direct writes. However, for
   asynchronous direct writes, the update is done without waiting for
   the IO completion so that the next asynchronous IO can be
   immediately issued. However, if an asynchronous IO completes with a
   failure right before the i_truncate_mutex lock protecting the update,
   the update may change the value of the inode write pointer offset
   that was corrected by the error path (zonefs_io_error() function).

2) zonefs_io_error() is called when a read or write error occurs. This
   function executes a report zone operation using the callback function
   zonefs_io_error_cb(), which does all the error recovery handling
   based on the current zone condition, write pointer position and
   according to the mount options being used. However, depending on the
   zoned device being used, a report zone callback may be executed in a
   context that is different from the context of __zonefs_io_error(). As
   a result, zonefs_io_error_cb() may be executed without the inode
   truncate mutex lock held, which can lead to invalid error processing.

Fix both problems as follows:
- Problem 1: Perform the inode write pointer offset update before a
  direct write is issued with iomap_dio_rw(). This is safe to do as
  partial direct writes are not supported (IOMAP_DIO_PARTIAL is not
  set) and any failed IO will trigger the execution of zonefs_io_error()
  which will correct the inode write pointer offset to reflect the
  current state of the one on the device.
- Problem 2: Change zonefs_io_error_cb() into zonefs_handle_io_error()
  and call this function directly from __zonefs_io_error() after
  obtaining the zone information using blkdev_report_zones() with a
  simple callback function that copies to a local stack variable the
  struct blk_zone obtained from the device. This ensures that error
  handling is performed holding the inode truncate mutex.
  This change also simplifies error handling for conventional zone files
  by bypassing the execution of report zones entirely. This is safe to
  do because the condition of conventional zones cannot be read-only or
  offline and conventional zone files are always fully mapped with a
  constant file size.

Reported-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Fixes: 8dcc1a9d90 ("fs: New zonefs file system")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
2024-02-16 10:20:35 +09:00
Linus Torvalds 1f3a3e2aae for-6.8-rc4-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmXMewsACgkQxWXV+ddt
 WDtFUBAAkEU/hxB4YsLn2JEdp3wc80w5/qKkPaYHsI2ncvc3RFiG+tqSY7BakMgE
 Kkdl8ouNX3p/S62ykIBQTKZnOTk7FgKlClAQtgKn1afexqABsP2mifnh40Dzf7eA
 VvEl7chnRT6oeivtQkB+BtgOzaOUp4j/8oAivRN8NKNwTxojV4g9PErKSOWfVQSq
 3zlrLJbe6era43SpnexkjZHn4Fy4CN+C7FMm+pT/yKzZi2oBZs9BvNZGhIkdnzcK
 MftrY9dSGO3CDD2Kvrz3lEm7ZB83wCpm+GTDN7iJx2y+yeW+aHjshFkJr1ApEZQa
 lsWTnj3hk3yHoOPUuLlchw5JcFb/dFZ1Ztdwkunf8nmt5a3O/5Zf+Csgze8c+Iii
 MJQKi0B/bNQ7cSEwRt36s75kROBItZmHCZmSBlOpT1LXSDQMJ9lvEnv/fPQdcHHF
 WMEmk5O5IoGYv5kx5wIoWv27HKE/bDwH6RjkxEd/n17XP+PcfHY4K0o0CGtfwS8g
 hdy9RI9X8dbf3ZPrxtsgQ2T8btWs68A4S6nwcSuY5HK0WNmvRh47eLfCI6S6XGJs
 hHkppLcc+WTXOskCA+ABdm9hgeAPZkCSpuQSmC2HBt8gRv8XqO7z4cZ/up2N+tES
 ZOJSrJb97nusOcxY0pLexnD6eI3pQxzGMiPONlC1Re8CdjZ0l+4=
 =RRGT
 -----END PGP SIGNATURE-----

Merge tag 'for-6.8-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "A few regular fixes and one fix for space reservation regression since
  6.7 that users have been reporting:

   - fix over-reservation of metadata chunks due to not keeping proper
     balance between global block reserve and delayed refs reserve; in
     practice this leaves behind empty metadata block groups, the
     workaround is to reclaim them by using the '-musage=1' balance
     filter

   - other space reservation fixes:
      - do not delete unused block group if it may be used soon
      - do not reserve space for checksums for NOCOW files

   - fix extent map assertion failure when writing out free space inode

   - reject encoded write if inode has nodatasum flag set

   - fix chunk map leak when loading block group zone info"

* tag 'for-6.8-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: don't refill whole delayed refs block reserve when starting transaction
  btrfs: zoned: fix chunk map leak when loading block group zone info
  btrfs: reject encoded write if inode has nodatasum flag set
  btrfs: don't reserve space for checksums when writing to nocow files
  btrfs: add new unused block groups to the list of unused block groups
  btrfs: do not delete unused block group if it may be used soon
  btrfs: add and use helper to check if block group is used
  btrfs: don't drop extent_map for free space inode on write error
2024-02-14 15:47:02 -08:00
Kent Overstreet 816054f40a bcachefs: Fix missing va_end()
Fixes: https://lore.kernel.org/linux-bcachefs/202402131603.E953E2CF@keescook/T/#u
Reported-by: coverity scan
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-02-13 21:59:27 -05:00
Kent Overstreet 2eeccee86d bcachefs: Fix check_version_upgrade()
When also downgrading, check_version_upgrade() could pick a new version
greater than the latest supported version.

Fixes:
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-02-13 21:58:37 -05:00
Kent Overstreet 4e07447503 bcachefs: Clamp replicas_required to replicas
This prevents going emergency read only when the user has specified
replicas_required > replicas.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-02-13 20:33:38 -05:00
Filipe Manana 2f6397e448 btrfs: don't refill whole delayed refs block reserve when starting transaction
Since commit 28270e25c6 ("btrfs: always reserve space for delayed refs
when starting transaction") we started not only to reserve metadata space
for the delayed refs a caller of btrfs_start_transaction() might generate
but also to try to fully refill the delayed refs block reserve, because
there are several case where we generate delayed refs and haven't reserved
space for them, relying on the global block reserve. Relying too much on
the global block reserve is not always safe, and can result in hitting
-ENOSPC during transaction commits or worst, in rare cases, being unable
to mount a filesystem that needs to do orphan cleanup or anything that
requires modifying the filesystem during mount, and has no more
unallocated space and the metadata space is nearly full. This was
explained in detail in that commit's change log.

However the gap between the reserved amount and the size of the delayed
refs block reserve can be huge, so attempting to reserve space for such
a gap can result in allocating many metadata block groups that end up
not being used. After a recent patch, with the subject:

  "btrfs: add new unused block groups to the list of unused block groups"

We started to add new block groups that are unused to the list of unused
block groups, to avoid having them around for a very long time in case
they are never used, because a block group is only added to the list of
unused block groups when we deallocate the last extent or when mounting
the filesystem and the block group has 0 bytes used. This is not a problem
introduced by the commit mentioned earlier, it always existed as our
metadata space reservations are, most of the time, pessimistic and end up
not using all the space they reserved, so we can occasionally end up with
one or two unused metadata block groups for a long period. However after
that commit mentioned earlier, we are just more pessimistic in the
metadata space reservations when starting a transaction and therefore the
issue is more likely to happen.

This however is not always enough because we might create unused metadata
block groups when reserving metadata space at a high rate if there's
always a gap in the delayed refs block reserve and the cleaner kthread
isn't triggered often enough or is busy with other work (running delayed
iputs, cleaning deleted roots, etc), not to mention the block group's
allocated space is only usable for a new block group after the transaction
used to remove it is committed.

A user reported that he's getting a lot of allocated metadata block groups
but the usage percentage of metadata space was very low compared to the
total allocated space, specially after running a series of block group
relocations.

So for now stop trying to refill the gap in the delayed refs block reserve
and reserve space only for the delayed refs we are expected to generate
when starting a transaction.

CC: stable@vger.kernel.org # 6.7+
Reported-by: Ivan Shapovalov <intelfx@intelfx.name>
Link: https://lore.kernel.org/linux-btrfs/9cdbf0ca9cdda1b4c84e15e548af7d7f9f926382.camel@intelfx.name/
Link: https://lore.kernel.org/linux-btrfs/CAL3q7H6802ayLHUJFztzZAVzBLJAGdFx=6FHNNy87+obZXXZpQ@mail.gmail.com/
Tested-by: Ivan Shapovalov <intelfx@intelfx.name>
Reported-by: Heddxh <g311571057@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CAE93xANEby6RezOD=zcofENYZOT-wpYygJyauyUAZkLv6XVFOA@mail.gmail.com/
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-02-13 18:39:09 +01:00
Filipe Manana 88e81a6777 btrfs: zoned: fix chunk map leak when loading block group zone info
At btrfs_load_block_group_zone_info() we never drop a reference on the
chunk map we have looked up, therefore leaking a reference on it. So
add the missing btrfs_free_chunk_map() at the end of the function.

Fixes: 7dc66abb5a ("btrfs: use a dedicated data structure for chunk maps")
Reported-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-02-13 18:38:19 +01:00
Filipe Manana 1bd96c92c6 btrfs: reject encoded write if inode has nodatasum flag set
Currently we allow an encoded write against inodes that have the NODATASUM
flag set, either because they are NOCOW files or they were created while
the filesystem was mounted with "-o nodatasum". This results in having
compressed extents without corresponding checksums, which is a filesystem
inconsistency reported by 'btrfs check'.

For example, running btrfs/281 with MOUNT_OPTIONS="-o nodatacow" triggers
this and 'btrfs check' errors out with:

   [1/7] checking root items
   [2/7] checking extents
   [3/7] checking free space tree
   [4/7] checking fs roots
   root 256 inode 257 errors 1040, bad file extent, some csum missing
   root 256 inode 258 errors 1040, bad file extent, some csum missing
   ERROR: errors found in fs roots
   (...)

So reject encoded writes if the target inode has NODATASUM set.

CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-02-13 18:38:05 +01:00
Filipe Manana feefe1f49d btrfs: don't reserve space for checksums when writing to nocow files
Currently when doing a write to a file we always reserve metadata space
for inserting data checksums. However we don't need to do it if we have
a nodatacow file (-o nodatacow mount option or chattr +C) or if checksums
are disabled (-o nodatasum mount option), as in that case we are only
adding unnecessary pressure to metadata reservations.

For example on x86_64, with the default node size of 16K, a 4K buffered
write into a nodatacow file is reserving 655360 bytes of metadata space,
as it's accounting for checksums. After this change, which stops reserving
space for checksums if we have a nodatacow file or checksums are disabled,
we only need to reserve 393216 bytes of metadata.

CC: stable@vger.kernel.org # 6.1+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-02-13 18:36:35 +01:00
Xiubo Li dbc347ef7f ceph: add ceph_cap_unlink_work to fire check_caps() immediately
When unlinking a file the check caps could be delayed for more than
5 seconds, but in MDS side it maybe waiting for the clients to
release caps.

This will use the cap_wq work queue and a dedicated list to help
fire the check_caps() and dirty buffer flushing immediately.

Link: https://tracker.ceph.com/issues/50223
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-02-13 11:22:54 +01:00
Xiubo Li 902d6d013f ceph: always queue a writeback when revoking the Fb caps
In case there is 'Fw' dirty caps and 'CHECK_CAPS_FLUSH' is set we
will always ignore queue a writeback. Queue a writeback is very
important because it will block kclient flushing the snapcaps to
MDS and which will block MDS waiting for revoking the 'Fb' caps.

Link: https://tracker.ceph.com/issues/50223
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-02-13 11:22:35 +01:00
Paulo Alcantara 8bde59b20d smb: client: handle path separator of created SMB symlinks
Convert path separator to CIFS_DIR_SEP(cifs_sb) from symlink target
before sending it over the wire otherwise the created SMB symlink may
become innaccesible from server side.

Fixes: 514d793e27 ("smb: client: allow creating symlinks via reparse points")
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-02-12 12:47:21 -06:00
Paulo Alcantara 4508ec1735 smb: client: set correct id, uid and cruid for multiuser automounts
When uid, gid and cruid are not specified, we need to dynamically
set them into the filesystem context used for automounting otherwise
they'll end up reusing the values from the parent mount.

Fixes: 9fd29a5bae ("cifs: use fs_context for automounts")
Reported-by: Shane Nehring <snehring@iastate.edu>
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=2259257
Cc: stable@vger.kernel.org # 6.2+
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-02-12 12:46:49 -06:00
Linus Torvalds 716f4aaa7b vfs-6.8-rc5.fixes
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZcoMdAAKCRCRxhvAZXjc
 ogy4AQDVp4huR6BBnRMhOCZbIsmkuHmq6ynpIZNTTAM0DdMn5AEAlJ03aEIaG9WS
 RQMdaYajeVpZfR/vIUg8UdVkHQxOEgw=
 =akNF
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.8-rc5.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs fixes from Christian Brauner:

 - Fix performance regression introduced by moving the security
   permission hook out of do_clone_file_range() and into its caller
   vfs_clone_file_range().

   This causes the security hook to be called in situation were it
   wasn't called before as the fast permission checks were left in
   do_clone_file_range().

   Fix this by merging the two implementations back together and
   restoring the old ordering: fast permission checks first, expensive
   ones later.

 - Tweak mount_setattr() permission checking so that mount properties on
   the real rootfs can be changed.

   When we added mount_setattr() we added additional checks compared to
   legacy mount(2). If the mount had a parent then verify that the
   caller and the mount namespace the mount is attached to match and if
   not make sure that it's an anonymous mount.

   But the real rootfs falls into neither category. It is neither an
   anoymous mount because it is obviously attached to the initial mount
   namespace but it also obviously doesn't have a parent mount. So that
   means legacy mount(2) allows changing mount properties on the real
   rootfs but mount_setattr(2) blocks this. This causes regressions (See
   the commit for details).

   Fix this by relaxing the check. If the mount has a parent or if it
   isn't a detached mount, verify that the mount namespaces of the
   caller and the mount are the same. Technically, we could probably
   write this even simpler and check that the mount namespaces match if
   it isn't a detached mount. But the slightly longer check makes it
   clearer what conditions one needs to think about.

* tag 'vfs-6.8-rc5.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  fs: relax mount_setattr() permission checks
  remap_range: merge do_clone_file_range() into vfs_clone_file_range()
2024-02-12 07:15:45 -08:00
Shyam Prasad N 79520587fe cifs: update the same create_guid on replay
File open requests made to the server contain a
CreateGuid, which is used by the server to identify
the open request. If the same request needs to be
replayed, it needs to be sent with the same CreateGuid
in the durable handle v2 context.

Without doing so, we could end up leaking handles on
the server when:
1. multichannel is used AND
2. connection goes down, but not for all channels

This is because the replayed open request would have a
new CreateGuid and the server will treat this as a new
request and open a new handle.

This change fixes this by reusing the existing create_guid
stored in the cached fid struct.

REF: MS-SMB2 4.9 Replay Create Request on an Alternate Channel

Fixes: 4f1fffa237 ("cifs: commands that are retried should have replay flag set")
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-02-11 19:07:08 -06:00
Dan Carpenter cffe487026 cifs: fix underflow in parse_server_interfaces()
In this loop, we step through the buffer and after each item we check
if the size_left is greater than the minimum size we need.  However,
the problem is that "bytes_left" is type ssize_t while sizeof() is type
size_t.  That means that because of type promotion, the comparison is
done as an unsigned and if we have negative bytes left the loop
continues instead of ending.

Fixes: fe856be475 ("CIFS: parse and store info on iface queries")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-02-11 19:07:08 -06:00
Linus Torvalds 7521f258ea 21 hotfixes. 12 are cc:stable and the remainder pertain to post-6.7
issues or aren't considered to be needed in earlier kernel versions.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZcfLvgAKCRDdBJ7gKXxA
 joCTAP4/XdBXA7Sj3GyjSAkYjg2U0quwX9oRhsx2Qy9duPDaLAD+NRl9XG14YSOB
 f/7OiTQoDfnwVgHAOVBHY/ylrcgZRQg=
 =2wdS
 -----END PGP SIGNATURE-----

Merge tag 'mm-hotfixes-stable-2024-02-10-11-16' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc fixes from Andrew Morton:
 "21 hotfixes. 12 are cc:stable and the remainder pertain to post-6.7
  issues or aren't considered to be needed in earlier kernel versions"

* tag 'mm-hotfixes-stable-2024-02-10-11-16' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (21 commits)
  nilfs2: fix potential bug in end_buffer_async_write
  mm/damon/sysfs-schemes: fix wrong DAMOS tried regions update timeout setup
  nilfs2: fix hang in nilfs_lookup_dirty_data_buffers()
  MAINTAINERS: Leo Yan has moved
  mm/zswap: don't return LRU_SKIP if we have dropped lru lock
  fs,hugetlb: fix NULL pointer dereference in hugetlbs_fill_super
  mailmap: switch email address for John Moon
  mm: zswap: fix objcg use-after-free in entry destruction
  mm/madvise: don't forget to leave lazy MMU mode in madvise_cold_or_pageout_pte_range()
  arch/arm/mm: fix major fault accounting when retrying under per-VMA lock
  selftests: core: include linux/close_range.h for CLOSE_RANGE_* macros
  mm/memory-failure: fix crash in split_huge_page_to_list from soft_offline_page
  mm: memcg: optimize parent iteration in memcg_rstat_updated()
  nilfs2: fix data corruption in dsync block recovery for small block sizes
  mm/userfaultfd: UFFDIO_MOVE implementation should use ptep_get()
  exit: wait_task_zombie: kill the no longer necessary spin_lock_irq(siglock)
  fs/proc: do_task_stat: use sig->stats_lock to gather the threads/children stats
  fs/proc: do_task_stat: move thread_group_cputime_adjusted() outside of lock_task_sighand()
  getrusage: use sig->stats_lock rather than lock_task_sighand()
  getrusage: move thread_group_cputime_adjusted() outside of lock_task_sighand()
  ...
2024-02-10 15:28:07 -08:00
Kent Overstreet 04eb57930e bcachefs: fix missing endiannes conversion in sb_members
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-02-10 17:37:34 -05:00
Su Yue 7dcfb87af9 bcachefs: fix kmemleak in __bch2_read_super error handling path
During xfstest tests, there are some kmemleak reports e.g. generic/051 with
if USE_KMEMLEAK=yes:

====================================================================
EXPERIMENTAL kmemleak reported some memory leaks!  Due to the way kmemleak
works, the leak might be from an earlier test, or something totally unrelated.
unreferenced object 0xffff9ef905aaf778 (size 8):
  comm "mount.bcachefs", pid 169844, jiffies 4295281209 (age 87.040s)
  hex dump (first 8 bytes):
    a5 cc cc cc cc cc cc cc                          ........
  backtrace:
    [<ffffffff87fd9a43>] __kmem_cache_alloc_node+0x1f3/0x2c0
    [<ffffffff87f49b66>] kmalloc_trace+0x26/0xb0
    [<ffffffffc0a3fefe>] __bch2_read_super+0xfe/0x4e0 [bcachefs]
    [<ffffffffc0a3ad22>] bch2_fs_open+0x262/0x1710 [bcachefs]
    [<ffffffffc09c9e24>] bch2_mount+0x4c4/0x640 [bcachefs]
    [<ffffffff88080c90>] legacy_get_tree+0x30/0x60
    [<ffffffff8802c748>] vfs_get_tree+0x28/0xf0
    [<ffffffff88061fe5>] path_mount+0x475/0xb60
    [<ffffffff880627e5>] __x64_sys_mount+0x105/0x140
    [<ffffffff88932642>] do_syscall_64+0x42/0xf0
    [<ffffffff88a000e6>] entry_SYSCALL_64_after_hwframe+0x6e/0x76
unreferenced object 0xffff9ef96cdc4fc0 (size 32):
  comm "mount.bcachefs", pid 169844, jiffies 4295281209 (age 87.040s)
  hex dump (first 32 bytes):
    2f 64 65 76 2f 6d 61 70 70 65 72 2f 74 65 73 74  /dev/mapper/test
    2d 31 00 cc cc cc cc cc cc cc cc cc cc cc cc cc  -1..............
  backtrace:
    [<ffffffff87fd9a43>] __kmem_cache_alloc_node+0x1f3/0x2c0
    [<ffffffff87f4a081>] __kmalloc_node_track_caller+0x51/0x150
    [<ffffffff87f3adc2>] kstrdup+0x32/0x60
    [<ffffffffc0a3ff1a>] __bch2_read_super+0x11a/0x4e0 [bcachefs]
    [<ffffffffc0a3ad22>] bch2_fs_open+0x262/0x1710 [bcachefs]
    [<ffffffffc09c9e24>] bch2_mount+0x4c4/0x640 [bcachefs]
    [<ffffffff88080c90>] legacy_get_tree+0x30/0x60
    [<ffffffff8802c748>] vfs_get_tree+0x28/0xf0
    [<ffffffff88061fe5>] path_mount+0x475/0xb60
    [<ffffffff880627e5>] __x64_sys_mount+0x105/0x140
    [<ffffffff88932642>] do_syscall_64+0x42/0xf0
    [<ffffffff88a000e6>] entry_SYSCALL_64_after_hwframe+0x6e/0x76
====================================================================

The leak happens if bdev_open_by_path() failed to open a block device
then it goes label 'out' directly without call of bch2_free_super().

Fix it by going to label 'err' instead of 'out' if bdev_open_by_path()
fails.

Signed-off-by: Su Yue <glass.su@suse.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-02-10 17:37:34 -05:00
Kent Overstreet 1a1c93e7f8 bcachefs: Fix missing bch2_err_class() calls
We aren't supposed to be leaking our private error codes outside of
fs/bcachefs/.

Fixes:
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-02-10 17:37:32 -05:00
Linus Torvalds 5a7ec87063 Two ksmbd server fixes
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmXGtzoACgkQiiy9cAdy
 T1G9UQv/b1rOI+u7Cr5RDnO0O4sbL7bJ7pfJHEK0KKpat0BFtsrGZFRwDsuSDmkc
 BMIdeENnM1aoGjGEzvyGJmzUEZUcusy2zFdLBDiW1zPBb5D5HLRmr7fN02ZwPwj9
 5vnuvM5/Iql/dSMBjDcm7M5NuiVlp9+SmN27OqXbfc0e6xHxnzhwu6A3x3Ryaz/J
 0LzxNt++UUkZfK6FrePDdRyWlvBHsMy4RfTmjIO432bhNjsx90YHrPtKj2ph4xi5
 /92QuLJvSaYyj1IrZIV6v0UBJBKtnoGek8UJ7k3Mz/BkHBXvvZTR0MYL/tKW80eK
 Bfck2qcRVauLPseGRnn5GTkvF+itTb5RXksXzVSomveAzQ7TAle/qx7EL93QKCLC
 vPJLAXK00T0JvE0zyVxGuPWvl9iWBUwbR4uwUL4XNnJksIXsTYci7TZ0ELyAA7hJ
 bdn/4DyRTS5KXC60JwE9hcGXpjYstD6w8Jz+UseADsS+qE3zuX0UwnynNCQc0zjy
 iTboUnA1
 =exTN
 -----END PGP SIGNATURE-----

Merge tag '6.8-rc3-ksmbd-server-fixes' of git://git.samba.org/ksmbd

Pull smb server fixes from Steve French:
 "Two ksmbd server fixes:

   - memory leak fix

   - a minor kernel-doc fix"

* tag '6.8-rc3-ksmbd-server-fixes' of git://git.samba.org/ksmbd:
  ksmbd: free aux buffer if ksmbd_iov_pin_rsp_read fails
  ksmbd: Add kernel-doc for ksmbd_extract_sharename() function
2024-02-10 07:53:41 -08:00
Linus Torvalds ca00c700c5 Five smb3 client fixes
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmXGuQsACgkQiiy9cAdy
 T1FJHgwAgvfgXuEjLXzjcEUg7bZ7z76BDC4Qq0NJImkJ6GUm9eQssCF8xbOZlfb/
 bxUiZATGGbRso3cLKIcZtgBOkyKx8v9ZTVbjxpYJQqqliAvYNHXfi3TlMsjDwGlM
 Nd4kVboZmHJSMirR3O915Il4iOt36/RygDiLHqqE6jG+BM74I3fpOI+wtphUIEdG
 KHRczjbTlmKoZDH99Np/CYGYKiQOcFLOj7FetiYBW3AS1H2qSol5PDO0vOgvSDFq
 3QOIRN1Km5tRogHx/hgr993DYamvDHTI3GbSEwDT45zP1m13AHLFf7tPrPW9vNwQ
 1LIRqTFp7UYJElHGUZMYkPwY9ryqU1GHNekiV/JqUJyzvZ6wLC0mHJnY4Sh+xpdP
 wUyTCdZITZ6KPw+bQnITISkQvB3Z8lp6xX4OQZYusICLzAqfsGVI8qM2AMWM5Ywv
 9BEQb1cY6iolcQZ1nYl1/yJTVy9KZHK+N/f/8GCrF/7MDEJn2CKyIUT+Zz8f1K+F
 lS6Dm8XU
 =jOZH
 -----END PGP SIGNATURE-----

Merge tag '6.8-rc3-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull smb client fixes from Steve French:

 - reconnect fix

 - multichannel channel selection fix

 - minor mount warning fix

 - reparse point fix

 - null pointer check improvement

* tag '6.8-rc3-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  smb3: clarify mount warning
  cifs: handle cases where multiple sessions share connection
  cifs: change tcon status when need_reconnect is set on it
  smb: client: set correct d_type for reparse points under DFS mounts
  smb3: add missing null server pointer check
2024-02-09 17:09:30 -08:00
Linus Torvalds e1e3f530a1 Some fscrypt-related fixups (sparse reads are used only for encrypted
files) and two cap handling fixes from Xiubo and Rishabh.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmXGRJATHGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHzi+/0B/4pEweAm2W0UUaaS59DecNySBFobwed
 m7bBDBGIAQ/I3duN46a13FzsGNclho967TeB0ig1jrQxnoo3HEMiXpZz5xfG9spe
 fyvrIk3R8cSqgd7YsyITnUjGGd2UBvZVrbWOCbWrKofSoflS6IjcGDQF7ZrgEsff
 0KkMaWHvO6poIU2mAToV//UkWUk6RrtAUNlSdjLpizXnUrrAQ+vUA3OU9SSp6Klf
 xmFaIiAiVZC6M8qFpXJtnIf8Ba7PrpW5InAXgCOkxDKciE9fLaPsIu0B3H9lUVKZ
 TJwjEJ0nB+akh0tRO5bZKyM8j0D3lhgxphJwNtUoYjQsV3m7LcGQV+Il
 =u953
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-6.8-rc4' of https://github.com/ceph/ceph-client

Pull ceph fixes from Ilya Dryomov:
 "Some fscrypt-related fixups (sparse reads are used only for encrypted
  files) and two cap handling fixes from Xiubo and Rishabh"

* tag 'ceph-for-6.8-rc4' of https://github.com/ceph/ceph-client:
  ceph: always check dir caps asynchronously
  ceph: prevent use-after-free in encode_cap_msg()
  ceph: always set initial i_blkbits to CEPH_FSCRYPT_BLOCK_SHIFT
  libceph: just wait for more data to be available on the socket
  libceph: rename read_sparse_msg_*() to read_partial_sparse_msg_*()
  libceph: fail sparse-read if the data length doesn't match
2024-02-09 17:05:02 -08:00
Linus Torvalds a2343df3fb driver ntfs3 for linux 6.8
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEh0DEKNP0I9IjwfWEqbAzH4MkB7YFAmXGIMMACgkQqbAzH4Mk
 B7b07xAAw/VcHqPdhcVg2SttGy1D0rpjXvSK9Na3pulD83M3AptvjqXToP1xZHUP
 fpZ32rAWoeaJieTbS+xJUEyRj+VGN9iEgUMBtoaIYIv9ozmC7IU0xyDvJCPU07F1
 X2n+IXxpBsU/Y1rQiJIijzum+BQYgXgsifdwkZU50QQjQWWcFdMU9VPjN2Saw/Sx
 8gd1rzKVKIclErpReDyuZpTqDweM4BxiuwKhLodzlMtfO2MEqXxwFXnLQDX2xJLh
 zJPMepM/3mzHSBjKrHQ+xZHZCDuP393UUJK+sd9PaETR8xHR4ew9yqSu1Ajg/o64
 xoQ8rkpT9g6AZS+JNKtKN52rw5rn4ZCi/VZ61HgqiLGTxOkVnHpynmDr3IKzfgn+
 j7ZD33HteBJLxnR3YTi7fJA8DF9d0vHUv+HtH351WVibJn9DrzWzIkp6uDdaVfoa
 YvVE+ODynLVvpDKVTm4QOmIRnVMFDZwNo7C2sURy6nqQYf+ufYYRbe5btrvhSZ8k
 uazJLhLSLFCHiT6WlbmykntTo15sub/yIF5juVRcWthi2jWj0qII549jtSkZquQR
 pEVcitMTrr6RqwEB/B5nsz2azQ4m/+JgO0se1kWvxa+6erVV0wCdB7STW77zbRmx
 m8Xyr8Pf+ZxM+IhP4cpSxgcc5olhvUjcrkNBtilQc0vLqf535k0=
 =4Gkn
 -----END PGP SIGNATURE-----

Merge tag 'ntfs3_for_6.8' of https://github.com/Paragon-Software-Group/linux-ntfs3

Pull ntfs3 fixes from Konstantin Komarov:
 "Fixed:
   - size update for compressed file
   - some logic errors, overflows
   - memory leak
   - some code was refactored

  Added:
   - implement super_operations::shutdown

  Improved:
   - alternative boot processing
   - reduced stack usage"

* tag 'ntfs3_for_6.8' of https://github.com/Paragon-Software-Group/linux-ntfs3: (28 commits)
  fs/ntfs3: Slightly simplify ntfs_inode_printk()
  fs/ntfs3: Add ioctl operation for directories (FITRIM)
  fs/ntfs3: Fix oob in ntfs_listxattr
  fs/ntfs3: Fix an NULL dereference bug
  fs/ntfs3: Update inode->i_size after success write into compressed file
  fs/ntfs3: Fixed overflow check in mi_enum_attr()
  fs/ntfs3: Correct function is_rst_area_valid
  fs/ntfs3: Use i_size_read and i_size_write
  fs/ntfs3: Prevent generic message "attempt to access beyond end of device"
  fs/ntfs3: use non-movable memory for ntfs3 MFT buffer cache
  fs/ntfs3: Use kvfree to free memory allocated by kvmalloc
  fs/ntfs3: Disable ATTR_LIST_ENTRY size check
  fs/ntfs3: Fix c/mtime typo
  fs/ntfs3: Add NULL ptr dereference checking at the end of attr_allocate_frame()
  fs/ntfs3: Add and fix comments
  fs/ntfs3: ntfs3_forced_shutdown use int instead of bool
  fs/ntfs3: Implement super_operations::shutdown
  fs/ntfs3: Drop suid and sgid bits as a part of fpunch
  fs/ntfs3: Add file_modified
  fs/ntfs3: Correct use bh_read
  ...
2024-02-09 16:59:49 -08:00
Steve French a5cc98eba2 smb3: clarify mount warning
When a user tries to use the "sec=krb5p" mount parameter to encrypt
data on connection to a server (when authenticating with Kerberos), we
indicate that it is not supported, but do not note the equivalent
recommended mount parameter ("sec=krb5,seal") which turns on encryption
for that mount (and uses Kerberos for auth).  Update the warning message.

Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-02-09 14:43:27 -06:00
Shyam Prasad N a39c757bf0 cifs: handle cases where multiple sessions share connection
Based on our implementation of multichannel, it is entirely
possible that a server struct may not be found in any channel
of an SMB session.

In such cases, we should be prepared to move on and search for
the server struct in the next session.

Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-02-09 14:43:25 -06:00
Shyam Prasad N c6e02eefd6 cifs: change tcon status when need_reconnect is set on it
When a tcon is marked for need_reconnect, the intention
is to have it reconnected.

This change adjusts tcon->status in cifs_tree_connect
when need_reconnect is set. Also, this change has a minor
correction in resetting need_reconnect on success. It makes
sure that it is done with tc_lock held.

Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-02-09 14:43:23 -06:00
Filipe Manana 12c5128f10 btrfs: add new unused block groups to the list of unused block groups
Space reservations for metadata are, most of the time, pessimistic as we
reserve space for worst possible cases - where tree heights are at the
maximum possible height (8), we need to COW every extent buffer in a tree
path, need to split extent buffers, etc.

For data, we generally reserve the exact amount of space we are going to
allocate. The exception here is when using compression, in which case we
reserve space matching the uncompressed size, as the compression only
happens at writeback time and in the worst possible case we need that
amount of space in case the data is not compressible.

This means that when there's not available space in the corresponding
space_info object, we may need to allocate a new block group, and then
that block group might not be used after all. In this case the block
group is never added to the list of unused block groups and ends up
never being deleted - except if we unmount and mount again the fs, as
when reading block groups from disk we add unused ones to the list of
unused block groups (fs_info->unused_bgs). Otherwise a block group is
only added to the list of unused block groups when we deallocate the
last extent from it, so if no extent is ever allocated, the block group
is kept around forever.

This also means that if we have a bunch of tasks reserving space in
parallel we can end up allocating many block groups that end up never
being used or kept around for too long without being used, which has
the potential to result in ENOSPC failures in case for example we over
allocate too many metadata block groups and then end up in a state
without enough unallocated space to allocate a new data block group.

This is more likely to happen with metadata reservations as of kernel
6.7, namely since commit 28270e25c6 ("btrfs: always reserve space for
delayed refs when starting transaction"), because we started to always
reserve space for delayed references when starting a transaction handle
for a non-zero number of items, and also to try to reserve space to fill
the gap between the delayed block reserve's reserved space and its size.

So to avoid this, when finishing the creation a new block group, add the
block group to the list of unused block groups if it's still unused at
that time. This way the next time the cleaner kthread runs, it will delete
the block group if it's still unused and not needed to satisfy existing
space reservations.

Reported-by: Ivan Shapovalov <intelfx@intelfx.name>
Link: https://lore.kernel.org/linux-btrfs/9cdbf0ca9cdda1b4c84e15e548af7d7f9f926382.camel@intelfx.name/
CC: stable@vger.kernel.org # 6.7+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-02-09 20:29:22 +01:00
Filipe Manana f4a9f21941 btrfs: do not delete unused block group if it may be used soon
Before deleting a block group that is in the list of unused block groups
(fs_info->unused_bgs), we check if the block group became used before
deleting it, as extents from it may have been allocated after it was added
to the list.

However even if the block group was not yet used, there may be tasks that
have only reserved space and have not yet allocated extents, and they
might be relying on the availability of the unused block group in order
to allocate extents. The reservation works first by increasing the
"bytes_may_use" field of the corresponding space_info object (which may
first require flushing delayed items, allocating a new block group, etc),
and only later a task does the actual allocation of extents.

For metadata we usually don't end up using all reserved space, as we are
pessimistic and typically account for the worst cases (need to COW every
single node in a path of a tree at maximum possible height, etc). For
data we usually reserve the exact amount of space we're going to allocate
later, except when using compression where we always reserve space based
on the uncompressed size, as compression is only triggered when writeback
starts so we don't know in advance how much space we'll actually need, or
if the data is compressible.

So don't delete an unused block group if the total size of its space_info
object minus the block group's size is less then the sum of used space and
space that may be used (space_info->bytes_may_use), as that means we have
tasks that reserved space and may need to allocate extents from the block
group. In this case, besides skipping the deletion, re-add the block group
to the list of unused block groups so that it may be reconsidered later,
in case the tasks that reserved space end up not needing to allocate
extents from it.

Allowing the deletion of the block group while we have reserved space, can
result in tasks failing to allocate metadata extents (-ENOSPC) while under
a transaction handle, resulting in a transaction abort, or failure during
writeback for the case of data extents.

CC: stable@vger.kernel.org # 6.0+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-02-09 20:29:16 +01:00
Filipe Manana 1693d5442c btrfs: add and use helper to check if block group is used
Add a helper function to determine if a block group is being used and make
use of it at btrfs_delete_unused_bgs(). This helper will also be used in
future code changes.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-02-09 20:29:14 +01:00
Josef Bacik 5571e41ec6 btrfs: don't drop extent_map for free space inode on write error
While running the CI for an unrelated change I hit the following panic
with generic/648 on btrfs_holes_spacecache.

assertion failed: block_start != EXTENT_MAP_HOLE, in fs/btrfs/extent_io.c:1385
------------[ cut here ]------------
kernel BUG at fs/btrfs/extent_io.c:1385!
invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
CPU: 1 PID: 2695096 Comm: fsstress Kdump: loaded Tainted: G        W          6.8.0-rc2+ #1
RIP: 0010:__extent_writepage_io.constprop.0+0x4c1/0x5c0
Call Trace:
 <TASK>
 extent_write_cache_pages+0x2ac/0x8f0
 extent_writepages+0x87/0x110
 do_writepages+0xd5/0x1f0
 filemap_fdatawrite_wbc+0x63/0x90
 __filemap_fdatawrite_range+0x5c/0x80
 btrfs_fdatawrite_range+0x1f/0x50
 btrfs_write_out_cache+0x507/0x560
 btrfs_write_dirty_block_groups+0x32a/0x420
 commit_cowonly_roots+0x21b/0x290
 btrfs_commit_transaction+0x813/0x1360
 btrfs_sync_file+0x51a/0x640
 __x64_sys_fdatasync+0x52/0x90
 do_syscall_64+0x9c/0x190
 entry_SYSCALL_64_after_hwframe+0x6e/0x76

This happens because we fail to write out the free space cache in one
instance, come back around and attempt to write it again.  However on
the second pass through we go to call btrfs_get_extent() on the inode to
get the extent mapping.  Because this is a new block group, and with the
free space inode we always search the commit root to avoid deadlocking
with the tree, we find nothing and return a EXTENT_MAP_HOLE for the
requested range.

This happens because the first time we try to write the space cache out
we hit an error, and on an error we drop the extent mapping.  This is
normal for normal files, but the free space cache inode is special.  We
always expect the extent map to be correct.  Thus the second time
through we end up with a bogus extent map.

Since we're deprecating this feature, the most straightforward way to
fix this is to simply skip dropping the extent map range for this failed
range.

I shortened the test by using error injection to stress the area to make
it easier to reproduce.  With this patch in place we no longer panic
with my error injection test.

CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-02-09 20:28:28 +01:00
Paulo Alcantara 55c7788c37 smb: client: set correct d_type for reparse points under DFS mounts
Send query dir requests with an info level of
SMB_FIND_FILE_FULL_DIRECTORY_INFO rather than
SMB_FIND_FILE_DIRECTORY_INFO when the client is generating its own
inode numbers (e.g. noserverino) so that reparse tags still
can be parsed directly from the responses, but server won't
send UniqueId (server inode number)

Signed-off-by: Paulo Alcantara <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-02-08 10:50:40 -06:00
Steve French 45be0882c5 smb3: add missing null server pointer check
Address static checker warning in cifs_ses_get_chan_index():
    warn: variable dereferenced before check 'server'
To be consistent, and reduce risk, we should add another check
for null server pointer.

Fixes: 88675b22d3 ("cifs: do not search for channel if server is terminating")
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-02-08 10:50:40 -06:00
Ryusuke Konishi 5bc09b397c nilfs2: fix potential bug in end_buffer_async_write
According to a syzbot report, end_buffer_async_write(), which handles the
completion of block device writes, may detect abnormal condition of the
buffer async_write flag and cause a BUG_ON failure when using nilfs2.

Nilfs2 itself does not use end_buffer_async_write().  But, the async_write
flag is now used as a marker by commit 7f42ec3941 ("nilfs2: fix issue
with race condition of competition between segments for dirty blocks") as
a means of resolving double list insertion of dirty blocks in
nilfs_lookup_dirty_data_buffers() and nilfs_lookup_node_buffers() and the
resulting crash.

This modification is safe as long as it is used for file data and b-tree
node blocks where the page caches are independent.  However, it was
irrelevant and redundant to also introduce async_write for segment summary
and super root blocks that share buffers with the backing device.  This
led to the possibility that the BUG_ON check in end_buffer_async_write
would fail as described above, if independent writebacks of the backing
device occurred in parallel.

The use of async_write for segment summary buffers has already been
removed in a previous change.

Fix this issue by removing the manipulation of the async_write flag for
the remaining super root block buffer.

Link: https://lkml.kernel.org/r/20240203161645.4992-1-konishi.ryusuke@gmail.com
Fixes: 7f42ec3941 ("nilfs2: fix issue with race condition of competition between segments for dirty blocks")
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Reported-by: syzbot+5c04210f7c7f897c1e7f@syzkaller.appspotmail.com
Closes: https://lkml.kernel.org/r/00000000000019a97c05fd42f8c8@google.com
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-07 21:20:37 -08:00
Ryusuke Konishi 38296afe3c nilfs2: fix hang in nilfs_lookup_dirty_data_buffers()
Syzbot reported a hang issue in migrate_pages_batch() called by mbind()
and nilfs_lookup_dirty_data_buffers() called in the log writer of nilfs2.

While migrate_pages_batch() locks a folio and waits for the writeback to
complete, the log writer thread that should bring the writeback to
completion picks up the folio being written back in
nilfs_lookup_dirty_data_buffers() that it calls for subsequent log
creation and was trying to lock the folio.  Thus causing a deadlock.

In the first place, it is unexpected that folios/pages in the middle of
writeback will be updated and become dirty.  Nilfs2 adds a checksum to
verify the validity of the log being written and uses it for recovery at
mount, so data changes during writeback are suppressed.  Since this is
broken, an unclean shutdown could potentially cause recovery to fail.

Investigation revealed that the root cause is that the wait for writeback
completion in nilfs_page_mkwrite() is conditional, and if the backing
device does not require stable writes, data may be modified without
waiting.

Fix these issues by making nilfs_page_mkwrite() wait for writeback to
finish regardless of the stable write requirement of the backing device.

Link: https://lkml.kernel.org/r/20240131145657.4209-1-konishi.ryusuke@gmail.com
Fixes: 1d1d1a7672 ("mm: only enforce stable page writes if the backing device requires it")
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Reported-by: syzbot+ee2ae68da3b22d04cd8d@syzkaller.appspotmail.com
Closes: https://lkml.kernel.org/r/00000000000047d819061004ad6c@google.com
Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-07 21:20:36 -08:00