Commit graph

81951 commits

Author SHA1 Message Date
Josef Bacik
4aec05fa5a btrfs: remove level argument from btrfs_set_block_flags
We just pass in btrfs_header_level(eb) for the level, and we're passing
in the eb already, so simply get the level from the eb inside of
btrfs_set_block_flags.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:24 +02:00
Josef Bacik
54d687c13a btrfs: move btrfs_check_trunc_cache_free_space into block-rsv.c
This is completely related to block rsv's, move it out of the free space
cache code and into block-rsv.c.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:24 +02:00
Qu Wenruo
94ead93e63 btrfs: scrub: use recovered data stripes as cache to avoid unnecessary read
For P/Q stripe scrub, we have quite some duplicated read IO:

- Data stripes read for verification
  This is triggered by the scrub_submit_initial_read() inside
  scrub_raid56_parity_stripe().

- Data stripes read (again) for P/Q stripe verification
  This is triggered by scrub_assemble_read_bios() from scrub_rbio().

  Although we can have hit rbio cache and avoid unnecessary read, the
  chance is very low, as scrub would easily flush the whole rbio cache.

This means, even we're just scrubbing a single P/Q stripe, we would read
the data stripes twice for the best case scenario.  If we need to
recover some data stripes, it would cause more reads on the same data
stripes, again and again.

However before we call raid56_parity_submit_scrub_rbio() we already
have all data stripes repaired and their contents ready to use.
But RAID56 cache is unaware about the scrub cache, thus RAID56 layer
itself still needs to re-read the data stripes.

To avoid such cache miss, this patch would:

- Introduce a new helper, raid56_parity_cache_data_pages()
  This function would grab the pages from an array, and copy the content
  to the rbio, marking all the involved sectors uptodate.

  The page copy is unavoidable because of the cache pages of rbio are all
  self managed, thus can not utilize outside pages without screwing up
  the lifespan.

- Use the repaired data stripes as cache inside
  scrub_raid56_parity_stripe()

By this, we ensure all the data sectors of the scrub rbio are already
uptodate, and no need to read them again from disk.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:24 +02:00
Filipe Manana
7e5ba55994 btrfs: assert tree lock is held when removing free space entries
Removing a free space entry from an in memory space cache requires having
the corresponding btrfs_free_space_ctl's 'tree_lock' held. We have several
code paths that remove an entry, so add assertions where appropriate to
verify we are holding the lock, as the lock is acquired by some other
function up in the call chain, which makes it easy to miss in the future.

Note: for this to work we need to lock the local btrfs_free_space_ctl at
load_free_space_cache(), which was not being done because it's local,
declared on the stack, so no other task has access to it.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:24 +02:00
Filipe Manana
9649bd9a29 btrfs: assert tree lock is held when linking free space
When linking a free space entry, at link_free_space(), the caller should
be holding the spinlock 'tree_lock' of the given btrfs_free_space_ctl
argument, which is necessary for manipulating the red black tree of free
space entries (done by tree_insert_offset(), which already asserts the
lock is held) and for manipulating the 'free_space', 'free_extents',
'discardable_extents' and 'discardable_bytes' counters of the given
struct btrfs_free_space_ctl.

So assert that the spinlock 'tree_lock' of the given btrfs_free_space_ctl
is held by the current task. We have multiple code paths that end up
calling link_free_space(), and all currently take the lock before calling
it.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:24 +02:00
Filipe Manana
91de9e978d btrfs: assert tree lock is held when searching for free space entries
When searching for a free space entry by offset, at tree_search_offset(),
we are supposed to have the btrfs_free_space_ctl's 'tree_lock' held, so
assert that. We have multiple callers of tree_search_offset(), and all
currently hold the necessary lock before calling it.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:24 +02:00
Filipe Manana
13c2018fcc btrfs: assert proper locks are held at tree_insert_offset()
There are multiple code paths leading to tree_insert_offset(), and each
path takes the necessary locks before tree_insert_offset() is called,
since they do other things that require those locks to be held. This makes
it easy to miss the locking somewhere, so make tree_insert_offset() assert
that the required locks are being held by the calling task.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:24 +02:00
Filipe Manana
0d6bac4d30 btrfs: simplify arguments to tree_insert_offset()
For the in-memory component of space caching (free space cache and free
space tree), three of the arguments passed to tree_insert_offset() can
always be taken from the new free space entry that we are about to add.

So simplify tree_insert_offset() to take the new entry instead of the
'offset', 'node' and 'bitmap' arguments. This will also allow to make
further changes simpler.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:24 +02:00
Filipe Manana
b77433b144 btrfs: use precomputed end offsets at do_trimming()
The are two computations of end offsets at do_trimming() that are not
necessary, as they were previously computed and stored in local const
variables. So just use the variables instead, to make the source code
shorter and easier to read.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:24 +02:00
Filipe Manana
9085f42571 btrfs: avoid searching twice for previous node when merging free space entries
At try_merge_free_space(), avoid calling twice rb_prev() to find the
previous node, as that requires looping through the red black tree, so
store the result of the rb_prev() call and then use it.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:23 +02:00
Filipe Manana
fbb2e654d8 btrfs: avoid extra memory allocation when copying free space cache
At copy_free_space_cache(), we add a new entry to the block group's ctl
before we free the entry from the temporary ctl. Adding a new entry
requires the allocation of a new struct btrfs_free_space, so we can
avoid a temporary extra allocation by freeing the entry from the
temporary ctl before we add a new entry to the main ctl, which possibly
also reduces the chances for a memory allocation failure in case of very
high memory pressure. So just do that.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:23 +02:00
Tom Rix
12df6a622e btrfs: simplify transid initialization in btrfs_ioctl_wait_sync
A small code simplification, move the default value of transid to its
initialization and remove the else-statement.

Signed-off-by: Tom Rix <trix@redhat.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:23 +02:00
Qu Wenruo
b9a9a85059 btrfs: output affected files when relocation fails
[PROBLEM]
When relocation fails (mostly due to checksum mismatch), we only got
very cryptic error messages like:

  BTRFS info (device dm-4): relocating block group 13631488 flags data
  BTRFS warning (device dm-4): csum failed root -9 ino 257 off 0 csum 0x373e1ae3 expected csum 0x98757625 mirror 1
  BTRFS error (device dm-4): bdev /dev/mapper/test-scratch1 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
  BTRFS info (device dm-4): balance: ended with status: -5

The end user has to decipher the above messages and use various tools to
locate the affected files and find a way to fix the problem (mostly
deleting the file).  This is not an easy work even for experienced
developer, not to mention the end users.

[SCRUB IS DOING BETTER]
By contrast, scrub is providing much better error messages:

  BTRFS error (device dm-4): unable to fixup (regular) error at logical 13631488 on dev /dev/mapper/test-scratch1 physical 13631488
  BTRFS warning (device dm-4): checksum error at logical 13631488 on dev /dev/mapper/test-scratch1, physical 13631488, root 5, inode 257, offset 0, length 4096, links 1 (path: file)
  BTRFS info (device dm-4): scrub: finished on devid 1 with status: 0

Which provides the affected files directly to the end user.

[IMPROVEMENT]
Instead of the generic data checksum error messages, which is not doing
a good job for data reloc inodes, this patch introduce a scrub like
backref walking based solution.

When a sector fails its checksum for data reloc inode, we go the
following workflow:

- Get the real logical bytenr
  For data reloc inode, the file offset is the offset inside the block
  group.
  Thus the real logical bytenr is @file_off + @block_group->start.

- Do an extent type check
  If it's tree blocks it's much easier to handle, just go through
  all the tree block backref.

- Do a backref walk and inode path resolution for data extents
  This is mostly the same as scrub.
  But unfortunately we can not reuse the same function as the output
  format is different.

Now the new output would be more user friendly:

  BTRFS info (device dm-4): relocating block group 13631488 flags data
  BTRFS warning (device dm-4): csum failed root -9 ino 257 off 0 logical 13631488 csum 0x373e1ae3 expected csum 0x98757625 mirror 1
  BTRFS warning (device dm-4): checksum error at logical 13631488 mirror 1 root 5 inode 257 offset 0 length 4096 links 1 (path: file)
  BTRFS error (device dm-4): bdev /dev/mapper/test-scratch1 errs: wr 0, rd 0, flush 0, corrupt 2, gen 0
  BTRFS info (device dm-4): balance: ended with status: -5

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:23 +02:00
Christoph Hellwig
8bfec2e426 btrfs: remove hipri_workers workqueue
Now that btrfs_wq_submit_bio is never called for synchronous I/O,
the hipri_workers workqueue is not used anymore and can be removed.

Reviewed-by: Chris Mason <clm@fb.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:23 +02:00
Christoph Hellwig
e917ff56c8 btrfs: determine synchronous writers from bio or writeback control
The writeback_control structure already passes down the information about
a writeback being synchronous from the core VM code, and thus information
is propagated into the bio REQ_SYNC flag through the wbc_to_write_flags
helper.

Use that information to decide if checksums calculation is offloaded to
a workqueue instead of btrfs_inode::sync_writers field that not only
bloats the inode but also has too wide scope, being inode wide instead
of limited to the actual writeback request.

The sync writes were set in:

- btrfs_do_write_iter - regular IO, sync status is set
- start_ordered_ops - ordered write start, writeback with WB_SYNC_ALL
  mode
- btrfs_write_marked_extents - write marked extents, writeback with
  WB_SYNC_ALL mode

Reviewed-by: Chris Mason <clm@fb.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:23 +02:00
Christoph Hellwig
da02361807 btrfs: submit IO synchronously for fast checksum implementations
Most modern hardware supports very fast accelerated crc32c calculation.
If that is supported the CPU overhead of the checksum calculation is
very limited, and offloading the calculation to special worker threads
has a lot of overhead for no gain.

E.g. on an Intel Optane device is actually very much slows down even
1M buffered writes with fio:

Unpatched:

write: IOPS=3316, BW=3316MiB/s (3477MB/s)(200GiB/61757msec); 0 zone resets

With synchronous CRCs:

write: IOPS=4882, BW=4882MiB/s (5119MB/s)(200GiB/41948msec); 0 zone resets

With a lot of variation during the unpatched run going down as low as
1100MB/s, while the synchronous CRC version has about the same peak write
speed but much lower dips, and fewer kworkers churning around.
Both tests had fio saturated at 100% CPU.

(thanks to Jens Axboe via Chris Mason for the benchmarking)

Reviewed-by: Chris Mason <clm@fb.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:23 +02:00
Anand Jain
adbe7e388e btrfs: use SECTOR_SHIFT to convert LBA to physical offset
Using SECTOR_SHIFT to convert LBA to physical address makes it more
readable.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:23 +02:00
Anand Jain
29e70be261 btrfs: use SECTOR_SHIFT to convert physical offset to LBA
Use SECTOR_SHIFT while converting a physical address to an LBA, makes
it more readable.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:23 +02:00
Qu Wenruo
eee3b81178 btrfs: improve leaf dump and error handling
Improve the leaf dump behavior by:

- Always dump the leaf first, then the error message

- Output the slot number if possible
  Especially in __btrfs_free_extent() the leaf dump of extent tree can
  be pretty large.
  With an extra slot number it's much easier to locate the problem.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:23 +02:00
Qu Wenruo
6c75a589cb btrfs: print-tree: pass const extent buffer pointer
Since print-tree infrastructure only prints the content of a tree block,
we can make them to accept const extent buffer pointer.

This removes a forced type convert in extent-tree, where we convert a
const extent buffer pointer to regular one, just to avoid compiler
warning.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:22 +02:00
Naohiro Aota
b5345d6cee btrfs: export bitmap_test_range_all_{set,zero}
bitmap_test_range_all_{set,zero} defined in subpage.c are useful for other
components. Move them to misc.h and use them in zoned.c. Also, as
find_next{,_zero}_bit take/return "unsigned long" instead of "unsigned
int", convert the type to "unsigned long".

While at it, also rewrite the "if (...) return true; else return false;"
pattern and add const to the input bitmap.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:22 +02:00
Filipe Manana
88ad95b055 btrfs: tag as unlikely the key comparison when checking sibling keys
When checking siblings keys, before moving keys from one node/leaf to a
sibling node/leaf, it's very unexpected to have the last key of the left
sibling greater than or equals to the first key of the right sibling, as
that means we have a (serious) corruption that breaks the key ordering
properties of a b+tree. Since this is unexpected, surround the comparison
with the unlikely macro, which helps the compiler generate better code
for the most expected case (no existing b+tree corruption). This is also
what we do for other unexpected cases of invalid key ordering (like at
btrfs_set_item_key_safe()).

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:22 +02:00
Filipe Manana
f2db4d5cb4 btrfs: make btrfs_free_device() static
The function btrfs_free_device() is never used outside of volumes.c, so
make it static and remove its prototype declaration at volumes.h.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:22 +02:00
Sweet Tea Dorminy
1b53e51a4a btrfs: don't commit transaction for every subvol create
Recently a Meta-internal workload encountered subvolume creation taking
up to 2s each, significantly slower than directory creation. As they
were hoping to be able to use subvolumes instead of directories, and
were looking to create hundreds, this was a significant issue. After
Josef investigated, it turned out to be due to the transaction commit
currently performed at the end of subvolume creation.

This change improves the workload by not doing transaction commit for every
subvolume creation, and merely requiring a transaction commit on fsync.
In the worst case, of doing a subvolume create and fsync in a loop, this
should require an equal amount of time to the current scheme; and in the
best case, the internal workload creating hundreds of subvolumes before
fsyncing is greatly improved.

While it would be nice to be able to use the log tree and use the normal
fsync path, log tree replay can't deal with new subvolume inodes
presently.

It's possible that there's some reason that the transaction commit is
necessary for correctness during subvolume creation; however,
git logs indicate that the commit dates back to the beginning of
subvolume creation, and there are no notes on why it would be necessary.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Neal Gompa <neal@gompa.dev>
Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:22 +02:00
Filipe Manana
f469c8bd90 btrfs: unexport btrfs_prev_leaf()
btrfs_prev_leaf() is not used outside ctree.c, so there's no need to
export it at ctree.h - just make it static at ctree.c and move its
definition above btrfs_search_slot_for_read(), since that function
calls it.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:22 +02:00
David Howells
ba00b19067 afs: Fix vlserver probe RTT handling
In the same spirit as commit ca57f02295 ("afs: Fix fileserver probe
RTT handling"), don't rule out using a vlserver just because there
haven't been enough packets yet to calculate a real rtt.  Always set the
server's probe rtt from the estimate provided by rxrpc_kernel_get_srtt,
which is capped at 1 second.

This could lead to EDESTADDRREQ errors when accessing a cell for the
first time, even though the vl servers are known and have responded to a
probe.

Fixes: 1d4adfaf65 ("rxrpc: Make rxrpc_kernel_get_srtt() indicate validity")
Signed-off-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-afs@lists.infradead.org
Link: http://lists.infradead.org/pipermail/linux-afs/2023-June/006746.html
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-06-16 14:43:41 -07:00
Linus Torvalds
4973ca2955 for-6.4-rc6-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmSMg4YACgkQxWXV+ddt
 WDvNxg/9G45Lcn3YPYXicbzKcrrz4fpg4gqx9IX226DfJX78iZskl3LN1w+gFcj0
 gAKSC73ZZCGhIqrHOuWIbH5+BRO3FzTB9zr7tfx4H+pFWHs0BgYPqcoBjLTHZ/Pn
 2RYu+F922tGaPW7LZ2LtGlv+8Y4IDtWVe6uRyxSqv3dtF1jcgUfnJk2zJXG5z41R
 h1BSX7mcWUxUXbSJqTzAij7jyvbpnmy1BjsGDRG2G2J/AmvpUBtx1Gc3aKWhD2Up
 vNLQkl4OxbaW1t8CV9u6iGduS5mUAetOXoT2DTr3sSQMeA56Gpues/qb6qQVTbwb
 2cBnwQugZyz39yZkyvvopy6z2rasMmw6V/aPLKTLvPN/P+DYwU+bfcFuNa+LFxz4
 KJqGvZdrwDlhGc80+xjKhly4zLahAt0H+Y1yKjRK2RRx/TsXl4ufVc5hpq9rj8eK
 AoNvoZw9W3/L0juMUfZILhMbD2f7XGbUXlNhIXHCZsOZzuZBqNMNNv9d8b5ncbWE
 q6a5EJXzQzk13kiurVBZJoZokYxsUzEBsKeij4aaP1Rkw8r/62GvEt79Nu8X+67+
 cQyZ6CQ6eZ2PsPx9DtooCbAnH6huIPf9yagn5J2Li6H6VdvOlP6zIi7Tp33AhPdp
 1BMfaNq46l6Gxiu1pnclzSb8abVLb71ZxXNItEK/EkbH/uktaro=
 =NAyd
 -----END PGP SIGNATURE-----

Merge tag 'for-6.4-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "Two fixes for NOCOW files, a regression fix in scrub and an assertion
  fix:

   - NOCOW fixes:
      - keep length of iomap direct io request in case of a failure
      - properly pass mode of extent reference checking, this can break
        some cases for swapfile

   - fix error value confusion when scrubbing a stripe

   - convert assertion to a proper error handling when loading global
     roots, reported by syzbot"

* tag 'for-6.4-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: scrub: fix a return value overwrite in scrub_stripe()
  btrfs: do not ASSERT() on duplicated global roots
  btrfs: can_nocow_file_extent should pass down args->strict from callers
  btrfs: fix iomap_begin length for nocow writes
2023-06-16 12:41:56 -07:00
Linus Torvalds
62d8779610 Fix two regressions in ext4, one report by syzkaller[1], and reported by
multiple users (and tracked by regzbot[2]).
 
 [1] https://syzkaller.appspot.com/bug?extid=4acc7d910e617b360859
 [2] https://linux-regtracking.leemhuis.info/regzbot/regression/ZIauBR7YiV3rVAHL@glitch/
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmSLPr8ACgkQ8vlZVpUN
 gaO08wgAlk1HqZub9iJm7ZQRx4qVb4zGYGL8So4IE/971FT8Vq3suJsQlTWpPVBB
 8O6H5bsq5F54wfjmY44aUyBjc1SEJL2zfrEeMxcsxH+1tEoEe8i+FT3BXVr6+Y7k
 qdZYrqmtbYpM2dyNW7LY4UDRQBE6ug+r6Cq1z2Mo7cfwfiBcYYe7lncbl3xJe7D8
 8DZLgjQg48wxvU5EOK2yHMHw0YfJySnmaCo7J7sU5TLZj91IEnlNGwAjYIy1GUBV
 3I+/+U70QR1VtZig7v4q6TI3LDHfegRc8zIIOtM3u1qs2bzLFgyoNURtut2vhF2+
 /XyXxFabmlpG7yp3UZVetCECKVa2Kg==
 =WJ4O
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 fixes from Ted Ts'o:
 "Fix two regressions in ext4, one report by syzkaller[1], and reported
  by multiple users (and tracked by regzbot[2])"

[1] https://syzkaller.appspot.com/bug?extid=4acc7d910e617b360859
[2] https://linux-regtracking.leemhuis.info/regzbot/regression/ZIauBR7YiV3rVAHL@glitch/

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: drop the call to ext4_error() from ext4_get_group_info()
  Revert "ext4: remove unnecessary check in ext4_bg_num_gdb_nometa"
2023-06-15 15:40:58 -07:00
Linus Torvalds
7a043feb6c eight, mostly small, smb3 client fixes
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmSKVUsACgkQiiy9cAdy
 T1EwhAv/bazgxHAjxmoZ8hxw21uH/Ufvf2Cvt49yFTjxiaOCoSPLovA5AyB1FlTz
 xsw/Cf6haO+RIYpLQ1KAIg7Qt3zwUu50Io2s/yyaHmQu7d+eNUPISajrJIjYOnqB
 tAVHav0XcvTaPAzyr9luqmBZnLaAeatK5CbrK+gT4xstW/cuTu/F4JUF8fOL++ww
 /IDTvo6UKqEKuPD8dpszXZnJjuj8aIsRIh8CBGI73QrDVExTzhrdmRULWqwSj0Lr
 nHsh8hgUZ25dSe/9jIagPqgvzaizvcyzopJuO8mpQAMDww9o1kBlud9GLdgNGnfM
 bDHHbn3Y9Fu5VaxMJ/n9lozMXT+OB+3ZrrBFP8J0unSidzsEzWKfYjE0gPXGKru8
 YhvjWKdJZjIOEu9QRuD4KK+MOF20cuq3OmvKpoZiEr2niB7oX25B+JddxW2CJD5a
 oAiP1+1/wiVe79KUlLLRe+OTtujUXtMiXPePzHh15dYetg+P4qJCBqWjDPdzRmdr
 MfMhsZpq
 =peSC
 -----END PGP SIGNATURE-----

Merge tag '6.4-rc6-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull smb client fixes from Steve French:
 "Eight, mostly small, smb3 client fixes:

   - important fix for deferred close oops (race with unmount) found
     with xfstest generic/098 to some servers

   - important reconnect fix

   - fix problem with max_credits mount option

   - two multichannel (interface related) fixes

   - one trivial removal of confusing comment

   - two small debugging improvements (to better spot crediting
     problems)"

* tag '6.4-rc6-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: add a warning when the in-flight count goes negative
  cifs: fix lease break oops in xfstest generic/098
  cifs: fix max_credits implementation
  cifs: fix sockaddr comparison in iface_cmp
  smb/client: print "Unknown" instead of bogus link speed value
  cifs: print all credit counters in DebugData
  cifs: fix status checks in cifs_tree_connect
  smb: remove obsolete comment
2023-06-15 15:24:33 -07:00
Fabio M. De Francesco
f451fd97dd ext4: drop the call to ext4_error() from ext4_get_group_info()
A recent patch added a call to ext4_error() which is problematic since
some callers of the ext4_get_group_info() function may be holding a
spinlock, whereas ext4_error() must never be called in atomic context.

This triggered a report from Syzbot: "BUG: sleeping function called from
invalid context in ext4_update_super" (see the link below).

Therefore, drop the call to ext4_error() from ext4_get_group_info(). In
the meantime use eight characters tabs instead of nine characters ones.

Reported-by: syzbot+4acc7d910e617b360859@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/00000000000070575805fdc6cdb2@google.com/
Fixes: 5354b2af34 ("ext4: allow ext4_get_group_info() to fail")
Suggested-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com>
Link: https://lore.kernel.org/r/20230614100446.14337-1-fmdefrancesco@gmail.com
2023-06-14 22:24:05 -04:00
Kemeng Shi
1948279211 Revert "ext4: remove unnecessary check in ext4_bg_num_gdb_nometa"
This reverts commit ad3f09be6c.

The reverted commit was intended to simpfy the code to get group
descriptor block number in non-meta block group by assuming
s_gdb_count is block number used for all non-meta block group descriptors.
However s_gdb_count is block number used for all meta *and* non-meta
group descriptors. So s_gdb_group will be > actual group descriptor block
number used for all non-meta block group which should be "total non-meta
block group" / "group descriptors per block", e.g. s_first_meta_bg.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Link: https://lore.kernel.org/r/20230613225025.3859522-1-shikemeng@huaweicloud.com
Fixes: ad3f09be6c ("ext4: remove unnecessary check in ext4_bg_num_gdb_nometa")
Cc: stable@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-14 22:22:27 -04:00
Qu Wenruo
b50f2d048e btrfs: scrub: fix a return value overwrite in scrub_stripe()
[RETURN VALUE OVERWRITE]
Inside scrub_stripe(), we would submit all the remaining stripes after
iterating all extents.

But since flush_scrub_stripes() can return error, we need to avoid
overwriting the existing @ret if there is any error.

However the existing check is doing the wrong check:

	ret2 = flush_scrub_stripes();
	if (!ret2)
		ret = ret2;

This would overwrite the existing @ret to 0 as long as the final flush
detects no critical errors.

[FIX]
We should check @ret other than @ret2 in that case.

Fixes: 8eb3dd17ea ("btrfs: dev-replace: error out if we have unrepaired metadata error during")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-14 18:30:30 +02:00
Shyam Prasad N
e4645cc2f1 cifs: add a warning when the in-flight count goes negative
We've seen the in-flight count go into negative with some
internal stress testing in Microsoft.

Adding a WARN when this happens, in hope of understanding
why this happens when it happens.

Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Reviewed-by: Bharath SM <bharathsm@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-06-14 10:15:05 -05:00
Steve French
c774e6779f cifs: fix lease break oops in xfstest generic/098
umount can race with lease break so need to check if
tcon->ses->server is still valid to send the lease
break response.

Reviewed-by: Bharath SM <bharathsm@microsoft.com>
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Fixes: 59a556aebc ("SMB3: drop reference to cfile before sending oplock break")
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-06-14 10:15:01 -05:00
Qu Wenruo
745806fb45 btrfs: do not ASSERT() on duplicated global roots
[BUG]
Syzbot reports a reproducible ASSERT() when using rescue=usebackuproot
mount option on a corrupted fs.

The full report can be found here:
https://syzkaller.appspot.com/bug?extid=c4614eae20a166c25bf0

  BTRFS error (device loop0: state C): failed to load root csum
  assertion failed: !tmp, in fs/btrfs/disk-io.c:1103
  ------------[ cut here ]------------
  kernel BUG at fs/btrfs/ctree.h:3664!
  invalid opcode: 0000 [#1] PREEMPT SMP KASAN
  CPU: 1 PID: 3608 Comm: syz-executor356 Not tainted 6.0.0-rc7-syzkaller-00029-g3800a713b607 #0
  Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/26/2022
  RIP: 0010:assertfail+0x1a/0x1c fs/btrfs/ctree.h:3663
  RSP: 0018:ffffc90003aaf250 EFLAGS: 00010246
  RAX: 0000000000000032 RBX: 0000000000000000 RCX: f21c13f886638400
  RDX: 0000000000000000 RSI: 0000000080000000 RDI: 0000000000000000
  RBP: ffff888021c640a0 R08: ffffffff816bd38d R09: ffffed10173667f1
  R10: ffffed10173667f1 R11: 1ffff110173667f0 R12: dffffc0000000000
  R13: ffff8880229c21f7 R14: ffff888021c64060 R15: ffff8880226c0000
  FS:  0000555556a73300(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 000055a2637d7a00 CR3: 00000000709c4000 CR4: 00000000003506e0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
   <TASK>
   btrfs_global_root_insert+0x1a7/0x1b0 fs/btrfs/disk-io.c:1103
   load_global_roots_objectid+0x482/0x8c0 fs/btrfs/disk-io.c:2467
   load_global_roots fs/btrfs/disk-io.c:2501 [inline]
   btrfs_read_roots fs/btrfs/disk-io.c:2528 [inline]
   init_tree_roots+0xccb/0x203c fs/btrfs/disk-io.c:2939
   open_ctree+0x1e53/0x33df fs/btrfs/disk-io.c:3574
   btrfs_fill_super+0x1c6/0x2d0 fs/btrfs/super.c:1456
   btrfs_mount_root+0x885/0x9a0 fs/btrfs/super.c:1824
   legacy_get_tree+0xea/0x180 fs/fs_context.c:610
   vfs_get_tree+0x88/0x270 fs/super.c:1530
   fc_mount fs/namespace.c:1043 [inline]
   vfs_kern_mount+0xc9/0x160 fs/namespace.c:1073
   btrfs_mount+0x3d3/0xbb0 fs/btrfs/super.c:1884

[CAUSE]
Since the introduction of global roots, we handle
csum/extent/free-space-tree roots as global roots, even if no
extent-tree-v2 feature is enabled.

So for regular csum/extent/fst roots, we load them into
fs_info::global_root_tree rb tree.

And we should not expect any conflicts in that rb tree, thus we have an
ASSERT() inside btrfs_global_root_insert().

But rescue=usebackuproot can break the assumption, as we will try to
load those trees again and again as long as we have bad roots and have
backup roots slot remaining.

So in that case we can have conflicting roots in the rb tree, and
triggering the ASSERT() crash.

[FIX]
We can safely remove that ASSERT(), as the caller will properly put the
offending root.

To make further debugging easier, also add two explicit error messages:

- Error message for conflicting global roots
- Error message when using backup roots slot

Reported-by: syzbot+a694851c6ab28cbcfb9c@syzkaller.appspotmail.com
Fixes: abed4aaae4 ("btrfs: track the csum, extent, and free space trees in a rb tree")
CC: stable@vger.kernel.org # 6.1+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-13 01:21:16 +02:00
Linus Torvalds
fb054096ae 19 hotfixes. 14 are cc:stable and the remainder address issues which were
introduced during this -rc cycle or which were considered inappropriate
 for a backport.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZIdw7QAKCRDdBJ7gKXxA
 jki4AQCygi1UoqVPq4N/NzJbv2GaNDXNmcJIoLvPpp3MYFhucAEAtQNzAYO9z6CT
 iLDMosnuh+1KLTaKNGL5iak3NAxnxQw=
 =mTdI
 -----END PGP SIGNATURE-----

Merge tag 'mm-hotfixes-stable-2023-06-12-12-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc fixes from Andrew Morton:
 "19 hotfixes. 14 are cc:stable and the remainder address issues which
  were introduced during this development cycle or which were considered
  inappropriate for a backport"

* tag 'mm-hotfixes-stable-2023-06-12-12-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  zswap: do not shrink if cgroup may not zswap
  page cache: fix page_cache_next/prev_miss off by one
  ocfs2: check new file size on fallocate call
  mailmap: add entry for John Keeping
  mm/damon/core: fix divide error in damon_nr_accesses_to_accesses_bp()
  epoll: ep_autoremove_wake_function should use list_del_init_careful
  mm/gup_test: fix ioctl fail for compat task
  nilfs2: reject devices with insufficient block count
  ocfs2: fix use-after-free when unmounting read-only filesystem
  lib/test_vmalloc.c: avoid garbage in page array
  nilfs2: fix possible out-of-bounds segment allocation in resize ioctl
  riscv/purgatory: remove PGO flags
  powerpc/purgatory: remove PGO flags
  x86/purgatory: remove PGO flags
  kexec: support purgatories with .text.hot sections
  mm/uffd: allow vma to merge as much as possible
  mm/uffd: fix vma operation where start addr cuts part of vma
  radix-tree: move declarations to header
  nilfs2: fix incomplete buffer cleanup in nilfs_btnode_abort_change_key()
2023-06-12 16:14:34 -07:00
Chris Mason
deccae40e4 btrfs: can_nocow_file_extent should pass down args->strict from callers
Commit 619104ba45 ("btrfs: move common NOCOW checks against a file
extent into a helper") changed our call to btrfs_cross_ref_exist() to
always pass false for the 'strict' parameter.  We're passing this down
through the stack so that we can do a full check for cross references
during swapfile activation.

With strict always false, this test fails:

  btrfs subvol create swappy
  chattr +C swappy
  fallocate -l1G swappy/swapfile
  chmod 600 swappy/swapfile
  mkswap swappy/swapfile

  btrfs subvol snap swappy swapsnap
  btrfs subvol del -C swapsnap

  btrfs fi sync /
  sync;sync;sync

  swapon swappy/swapfile

The fix is to just use args->strict, and everyone except swapfile
activation is passing false.

Fixes: 619104ba45 ("btrfs: move common NOCOW checks against a file extent into a helper")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-13 00:01:08 +02:00
Christoph Hellwig
7833b86595 btrfs: fix iomap_begin length for nocow writes
can_nocow_extent can reduce the len passed in, which needs to be
propagated to btrfs_dio_iomap_begin so that iomap does not submit
more data then is mapped.

This problems exists since the btrfs_get_blocks_direct helper was added
in commit c5794e5178 ("btrfs: Factor out write portion of
btrfs_get_blocks_direct"), but the ordered_extent splitting added in
commit b73a6fd1b1 ("btrfs: split partial dio bios before submit")
added a WARN_ON that made a syzkaller test fail.

Reported-by: syzbot+ee90502d5c8fd1d0dd93@syzkaller.appspotmail.com
Fixes: c5794e5178 ("btrfs: Factor out write portion of btrfs_get_blocks_direct")
CC: stable@vger.kernel.org # 6.1+
Tested-by: syzbot+ee90502d5c8fd1d0dd93@syzkaller.appspotmail.com
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-13 00:01:00 +02:00
Luís Henriques
26a6ffff7d ocfs2: check new file size on fallocate call
When changing a file size with fallocate() the new size isn't being
checked.  In particular, the FSIZE ulimit isn't being checked, which makes
fstest generic/228 fail.  Simply adding a call to inode_newsize_ok() fixes
this issue.

Link: https://lkml.kernel.org/r/20230529152645.32680-1-lhenriques@suse.de
Signed-off-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Mark Fasheh <mark@fasheh.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-12 11:31:52 -07:00
Benjamin Segall
2192bba03d epoll: ep_autoremove_wake_function should use list_del_init_careful
autoremove_wake_function uses list_del_init_careful, so should epoll's
more aggressive variant.  It only doesn't because it was copied from an
older wait.c rather than the most recent.

[bsegall@google.com: add comment]
  Link: https://lkml.kernel.org/r/xm26bki0ulsr.fsf_-_@google.com
Link: https://lkml.kernel.org/r/xm26pm6hvfer.fsf@google.com
Fixes: a16ceb1396 ("epoll: autoremove wakers even more aggressively")
Signed-off-by: Ben Segall <bsegall@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-12 11:31:52 -07:00
Ryusuke Konishi
92c5d1b860 nilfs2: reject devices with insufficient block count
The current sanity check for nilfs2 geometry information lacks checks for
the number of segments stored in superblocks, so even for device images
that have been destructively truncated or have an unusually high number of
segments, the mount operation may succeed.

This causes out-of-bounds block I/O on file system block reads or log
writes to the segments, the latter in particular causing
"a_ops->writepages" to repeatedly fail, resulting in sync_inodes_sb() to
hang.

Fix this issue by checking the number of segments stored in the superblock
and avoiding mounting devices that can cause out-of-bounds accesses.  To
eliminate the possibility of overflow when calculating the number of
blocks required for the device from the number of segments, this also adds
a helper function to calculate the upper bound on the number of segments
and inserts a check using it.

Link: https://lkml.kernel.org/r/20230526021332.3431-1-konishi.ryusuke@gmail.com
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Reported-by: syzbot+7d50f1e54a12ba3aeae2@syzkaller.appspotmail.com
  Link: https://syzkaller.appspot.com/bug?extid=7d50f1e54a12ba3aeae2
Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-12 11:31:51 -07:00
Luís Henriques
50d927880e ocfs2: fix use-after-free when unmounting read-only filesystem
It's trivial to trigger a use-after-free bug in the ocfs2 quotas code using
fstest generic/452.  After a read-only remount, quotas are suspended and
ocfs2_mem_dqinfo is freed through ->ocfs2_local_free_info().  When unmounting
the filesystem, an UAF access to the oinfo will eventually cause a crash.
 
BUG: KASAN: slab-use-after-free in timer_delete+0x54/0xc0
Read of size 8 at addr ffff8880389a8208 by task umount/669
...
Call Trace:
 <TASK>
 ...
 timer_delete+0x54/0xc0
 try_to_grab_pending+0x31/0x230
 __cancel_work_timer+0x6c/0x270
 ocfs2_disable_quotas.isra.0+0x3e/0xf0 [ocfs2]
 ocfs2_dismount_volume+0xdd/0x450 [ocfs2]
 generic_shutdown_super+0xaa/0x280
 kill_block_super+0x46/0x70
 deactivate_locked_super+0x4d/0xb0
 cleanup_mnt+0x135/0x1f0
 ...
 </TASK>

Allocated by task 632:
 kasan_save_stack+0x1c/0x40
 kasan_set_track+0x21/0x30
 __kasan_kmalloc+0x8b/0x90
 ocfs2_local_read_info+0xe3/0x9a0 [ocfs2]
 dquot_load_quota_sb+0x34b/0x680
 dquot_load_quota_inode+0xfe/0x1a0
 ocfs2_enable_quotas+0x190/0x2f0 [ocfs2]
 ocfs2_fill_super+0x14ef/0x2120 [ocfs2]
 mount_bdev+0x1be/0x200
 legacy_get_tree+0x6c/0xb0
 vfs_get_tree+0x3e/0x110
 path_mount+0xa90/0xe10
 __x64_sys_mount+0x16f/0x1a0
 do_syscall_64+0x43/0x90
 entry_SYSCALL_64_after_hwframe+0x72/0xdc

Freed by task 650:
 kasan_save_stack+0x1c/0x40
 kasan_set_track+0x21/0x30
 kasan_save_free_info+0x2a/0x50
 __kasan_slab_free+0xf9/0x150
 __kmem_cache_free+0x89/0x180
 ocfs2_local_free_info+0x2ba/0x3f0 [ocfs2]
 dquot_disable+0x35f/0xa70
 ocfs2_susp_quotas.isra.0+0x159/0x1a0 [ocfs2]
 ocfs2_remount+0x150/0x580 [ocfs2]
 reconfigure_super+0x1a5/0x3a0
 path_mount+0xc8a/0xe10
 __x64_sys_mount+0x16f/0x1a0
 do_syscall_64+0x43/0x90
 entry_SYSCALL_64_after_hwframe+0x72/0xdc

Link: https://lkml.kernel.org/r/20230522102112.9031-1-lhenriques@suse.de
Signed-off-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Tested-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-12 11:31:51 -07:00
Ryusuke Konishi
fee5eaecca nilfs2: fix possible out-of-bounds segment allocation in resize ioctl
Syzbot reports that in its stress test for resize ioctl, the log writing
function nilfs_segctor_do_construct hits a WARN_ON in
nilfs_segctor_truncate_segments().

It turned out that there is a problem with the current implementation of
the resize ioctl, which changes the writable range on the device (the
range of allocatable segments) at the end of the resize process.

This order is necessary for file system expansion to avoid corrupting the
superblock at trailing edge.  However, in the case of a file system
shrink, if log writes occur after truncating out-of-bounds trailing
segments and before the resize is complete, segments may be allocated from
the truncated space.

The userspace resize tool was fine as it limits the range of allocatable
segments before performing the resize, but it can run into this issue if
the resize ioctl is called alone.

Fix this issue by changing nilfs_sufile_resize() to update the range of
allocatable segments immediately after successful truncation of segment
space in case of file system shrink.

Link: https://lkml.kernel.org/r/20230524094348.3784-1-konishi.ryusuke@gmail.com
Fixes: 4e33f9eab0 ("nilfs2: implement resize ioctl")
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Reported-by: syzbot+33494cd0df2ec2931851@syzkaller.appspotmail.com
Closes: https://lkml.kernel.org/r/0000000000005434c405fbbafdc5@google.com
Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-12 11:31:51 -07:00
Peter Xu
5543d3c43c mm/uffd: allow vma to merge as much as possible
We used to not pass in the pgoff correctly when register/unregister uffd
regions, it caused incorrect behavior on vma merging and can cause
mergeable vmas being separate after ioctls return.

For example, when we have:

  vma1(range 0-9, with uffd), vma2(range 10-19, no uffd)

Then someone unregisters uffd on range (5-9), it should logically become:

  vma1(range 0-4, with uffd), vma2(range 5-19, no uffd)

But with current code we'll have:

  vma1(range 0-4, with uffd), vma3(range 5-9, no uffd), vma2(range 10-19, no uffd)

This patch allows such merge to happen correctly before ioctl returns.

This behavior seems to have existed since the 1st day of uffd.  Since
pgoff for vma_merge() is only used to identify the possibility of vma
merging, meanwhile here what we did was always passing in a pgoff smaller
than what we should, so there should have no other side effect besides not
merging it.  Let's still tentatively copy stable for this, even though I
don't see anything will go wrong besides vma being split (which is mostly
not user visible).

Link: https://lkml.kernel.org/r/20230517190916.3429499-3-peterx@redhat.com
Fixes: 86039bd3b4 ("userfaultfd: add new syscall to provide memory externalization")
Signed-off-by: Peter Xu <peterx@redhat.com>
Reported-by: Lorenzo Stoakes <lstoakes@gmail.com>
Acked-by: Lorenzo Stoakes <lstoakes@gmail.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-12 11:31:50 -07:00
Peter Xu
270aa01062 mm/uffd: fix vma operation where start addr cuts part of vma
Patch series "mm/uffd: Fix vma merge/split", v2.

This series contains two patches that fix vma merge/split for userfaultfd
on two separate issues.

Patch 1 fixes a regression since 6.1+ due to something we overlooked when
converting to maple tree apis.  The plan is we use patch 1 to replace the
commit "2f628010799e (mm: userfaultfd: avoid passing an invalid range to
vma_merge())" in mm-hostfixes-unstable tree if possible, so as to bring
uffd vma operations back aligned with the rest code again.

Patch 2 fixes a long standing issue that vma can be left unmerged even if
we can for either uffd register or unregister.

Many thanks to Lorenzo on either noticing this issue from the assert
movement patch, looking at this problem, and also provided a reproducer on
the unmerged vma issue [1].

[1] https://gist.github.com/lorenzo-stoakes/a11a10f5f479e7a977fc456331266e0e


This patch (of 2):

It seems vma merging with uffd paths is broken with either
register/unregister, where right now we can feed wrong parameters to
vma_merge() and it's found by recent patch which moved asserts upwards in
vma_merge() by Lorenzo Stoakes:

https://lore.kernel.org/all/ZFunF7DmMdK05MoF@FVFF77S0Q05N.cambridge.arm.com/

It's possible that "start" is contained within vma but not clamped to its
start.  We need to convert this into either "cannot merge" case or "can
merge" case 4 which permits subdivision of prev by assigning vma to prev. 
As we loop, each subsequent VMA will be clamped to the start.

This patch will eliminate the report and make sure vma_merge() calls will
become legal again.

One thing to mention is that the "Fixes: 29417d292bd0" below is there only
to help explain where the warning can start to trigger, the real commit to
fix should be 69dbe6daf1.  Commit 29417d292b helps us to identify the
issue, but unfortunately we may want to keep it in Fixes too just to ease
kernel backporters for easier tracking.

Link: https://lkml.kernel.org/r/20230517190916.3429499-1-peterx@redhat.com
Link: https://lkml.kernel.org/r/20230517190916.3429499-2-peterx@redhat.com
Fixes: 69dbe6daf1 ("userfaultfd: use maple tree iterator to iterate VMAs")
Signed-off-by: Peter Xu <peterx@redhat.com>
Reported-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Closes: https://lore.kernel.org/all/ZFunF7DmMdK05MoF@FVFF77S0Q05N.cambridge.arm.com/
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-12 11:31:50 -07:00
Ryusuke Konishi
2f012f2bac nilfs2: fix incomplete buffer cleanup in nilfs_btnode_abort_change_key()
A syzbot fault injection test reported that nilfs_btnode_create_block, a
helper function that allocates a new node block for b-trees, causes a
kernel BUG for disk images where the file system block size is smaller
than the page size.

This was due to unexpected flags on the newly allocated buffer head, and
it turned out to be because the buffer flags were not cleared by
nilfs_btnode_abort_change_key() after an error occurred during a b-tree
update operation and the buffer was later reused in that state.

Fix this issue by using nilfs_btnode_delete() to abandon the unused
preallocated buffer in nilfs_btnode_abort_change_key().

Link: https://lkml.kernel.org/r/20230513102428.10223-1-konishi.ryusuke@gmail.com
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Reported-by: syzbot+b0a35a5c1f7e846d3b09@syzkaller.appspotmail.com
Closes: https://lkml.kernel.org/r/000000000000d1d6c205ebc4d512@google.com
Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-12 11:31:49 -07:00
Linus Torvalds
ace9e12da2 for-6.4-rc6-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmSHScAACgkQxWXV+ddt
 WDvoLA/8CxGfC9i/zO2odxbV1id8JiubGyi2Q28ANE3ygwRBI2dh7u2TBTv9aKPF
 Bzm6VsafG2OwMuwu08jO3t98+QrxU9vb6YCzCPL4t+8IDLJhwpz6zdH/Lvl3RnyV
 nz+aKHi2vfTRKt1Cf4uB5dVzPM3QVHYi3vidt15Suf2nhKnXimu0FVGXabQfd44z
 cCE4ep8IkLshcrsEOwVQj44isRXztJza3D6P7zPfu0NB5Bue7VJNBI4JoGOAT8UQ
 8c+V1U6EbMARWcdbk4Vm34IoAAxcQW6MNnHG83+ie2OpuKJ9g7oNXMTPL73gntNr
 DtC38Vr8gbpXJFmqOCwD8+9f3jP2pX6LjJT0IR6eGJbCleWd6JPlvnfJ+QHdb/vE
 LblDjH84O0Js+0iPKOSKzglfrKZPYDEnIBUwbZQICj/8+aHPU1Y4eTRcv52bVnpa
 1umdz19Sjh0HjuX4k44E/fLgGnLw+ezxhe6WQ7RdDrnr4+9tXpz0z/ZsatIgl1Pc
 wfS5Y2XBIdzKBIF8FxAEL3xCXd6byOsMMhSRu6J7W8Tgw5dnvKiQLRCK+FIpBRru
 WZ7vrNKz67marmqcIp0Hpoipd5+ib6pAdZs69GAvk4bWvVoLZ0Vuyb3lQr5fg6Vm
 Xn1iwcYoWjlAYrpVW31dlaVCfoewm96qbzNa3XqA87I/6frGFcc=
 =ABpK
 -----END PGP SIGNATURE-----

Merge tag 'for-6.4-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "A  more fixes and regression fixes:

   - in subpage mode, fix crash when repairing metadata at the end of
     a stripe

   - properly enable async discard when remounting from read-only to
     read-write

   - scrub regression fixes:
      - respect read-only scrub when attempting to do a repair
      - fix reporting of found errors, the stats don't get properly
        accounted after a stripe repair"

* tag 'for-6.4-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: scrub: also report errors hit during the initial read
  btrfs: scrub: respect the read-only flag during repair
  btrfs: properly enable async discard when switching from RO->RW
  btrfs: subpage: fix a crash in metadata repair path
2023-06-12 10:53:35 -07:00
Shyam Prasad N
5e90aa21eb cifs: fix max_credits implementation
The current implementation of max_credits on the client does
not work because the CreditRequest logic for several commands
does not take max_credits into account.

Still, we can end up asking the server for more credits, depending
on the number of credits in flight. For this, we need to
limit the credits while parsing the responses too.

Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-06-11 20:52:49 -05:00
Shyam Prasad N
2991b77409 cifs: fix sockaddr comparison in iface_cmp
iface_cmp used to simply do a memcmp of the two
provided struct sockaddrs. The comparison needs to do more
based on the address family. Similar logic was already
present in cifs_match_ipaddr. Doing something similar now.

Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-06-11 20:52:49 -05:00
Enzo Matsumiya
50e63d6db6 smb/client: print "Unknown" instead of bogus link speed value
The virtio driver for Linux guests will not set a link speed to its
paravirtualized NICs.  This will be seen as -1 in the ethernet layer, and
when some servers (e.g. samba) fetches it, it's converted to an unsigned
value (and multiplied by 1000 * 1000), so in client side we end up with:

1)      Speed: 4294967295000000 bps

in DebugData.

This patch introduces a helper that returns a speed string (in Mbps or
Gbps) if interface speed is valid (>= SPEED_10 and <= SPEED_800000), or
"Unknown" otherwise.

The reason to not change the value in iface->speed is because we don't
know the real speed of the HW backing the server NIC, so let's keep
considering these as the fastest NICs available.

Also print "Capabilities: None" when the interface doesn't support any.

Signed-off-by: Enzo Matsumiya <ematsumiya@suse.de>
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-06-11 20:52:49 -05:00