Commit graph

12064 commits

Author SHA1 Message Date
Josef Bacik
4aec05fa5a btrfs: remove level argument from btrfs_set_block_flags
We just pass in btrfs_header_level(eb) for the level, and we're passing
in the eb already, so simply get the level from the eb inside of
btrfs_set_block_flags.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:24 +02:00
Josef Bacik
54d687c13a btrfs: move btrfs_check_trunc_cache_free_space into block-rsv.c
This is completely related to block rsv's, move it out of the free space
cache code and into block-rsv.c.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:24 +02:00
Qu Wenruo
94ead93e63 btrfs: scrub: use recovered data stripes as cache to avoid unnecessary read
For P/Q stripe scrub, we have quite some duplicated read IO:

- Data stripes read for verification
  This is triggered by the scrub_submit_initial_read() inside
  scrub_raid56_parity_stripe().

- Data stripes read (again) for P/Q stripe verification
  This is triggered by scrub_assemble_read_bios() from scrub_rbio().

  Although we can have hit rbio cache and avoid unnecessary read, the
  chance is very low, as scrub would easily flush the whole rbio cache.

This means, even we're just scrubbing a single P/Q stripe, we would read
the data stripes twice for the best case scenario.  If we need to
recover some data stripes, it would cause more reads on the same data
stripes, again and again.

However before we call raid56_parity_submit_scrub_rbio() we already
have all data stripes repaired and their contents ready to use.
But RAID56 cache is unaware about the scrub cache, thus RAID56 layer
itself still needs to re-read the data stripes.

To avoid such cache miss, this patch would:

- Introduce a new helper, raid56_parity_cache_data_pages()
  This function would grab the pages from an array, and copy the content
  to the rbio, marking all the involved sectors uptodate.

  The page copy is unavoidable because of the cache pages of rbio are all
  self managed, thus can not utilize outside pages without screwing up
  the lifespan.

- Use the repaired data stripes as cache inside
  scrub_raid56_parity_stripe()

By this, we ensure all the data sectors of the scrub rbio are already
uptodate, and no need to read them again from disk.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:24 +02:00
Filipe Manana
7e5ba55994 btrfs: assert tree lock is held when removing free space entries
Removing a free space entry from an in memory space cache requires having
the corresponding btrfs_free_space_ctl's 'tree_lock' held. We have several
code paths that remove an entry, so add assertions where appropriate to
verify we are holding the lock, as the lock is acquired by some other
function up in the call chain, which makes it easy to miss in the future.

Note: for this to work we need to lock the local btrfs_free_space_ctl at
load_free_space_cache(), which was not being done because it's local,
declared on the stack, so no other task has access to it.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:24 +02:00
Filipe Manana
9649bd9a29 btrfs: assert tree lock is held when linking free space
When linking a free space entry, at link_free_space(), the caller should
be holding the spinlock 'tree_lock' of the given btrfs_free_space_ctl
argument, which is necessary for manipulating the red black tree of free
space entries (done by tree_insert_offset(), which already asserts the
lock is held) and for manipulating the 'free_space', 'free_extents',
'discardable_extents' and 'discardable_bytes' counters of the given
struct btrfs_free_space_ctl.

So assert that the spinlock 'tree_lock' of the given btrfs_free_space_ctl
is held by the current task. We have multiple code paths that end up
calling link_free_space(), and all currently take the lock before calling
it.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:24 +02:00
Filipe Manana
91de9e978d btrfs: assert tree lock is held when searching for free space entries
When searching for a free space entry by offset, at tree_search_offset(),
we are supposed to have the btrfs_free_space_ctl's 'tree_lock' held, so
assert that. We have multiple callers of tree_search_offset(), and all
currently hold the necessary lock before calling it.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:24 +02:00
Filipe Manana
13c2018fcc btrfs: assert proper locks are held at tree_insert_offset()
There are multiple code paths leading to tree_insert_offset(), and each
path takes the necessary locks before tree_insert_offset() is called,
since they do other things that require those locks to be held. This makes
it easy to miss the locking somewhere, so make tree_insert_offset() assert
that the required locks are being held by the calling task.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:24 +02:00
Filipe Manana
0d6bac4d30 btrfs: simplify arguments to tree_insert_offset()
For the in-memory component of space caching (free space cache and free
space tree), three of the arguments passed to tree_insert_offset() can
always be taken from the new free space entry that we are about to add.

So simplify tree_insert_offset() to take the new entry instead of the
'offset', 'node' and 'bitmap' arguments. This will also allow to make
further changes simpler.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:24 +02:00
Filipe Manana
b77433b144 btrfs: use precomputed end offsets at do_trimming()
The are two computations of end offsets at do_trimming() that are not
necessary, as they were previously computed and stored in local const
variables. So just use the variables instead, to make the source code
shorter and easier to read.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:24 +02:00
Filipe Manana
9085f42571 btrfs: avoid searching twice for previous node when merging free space entries
At try_merge_free_space(), avoid calling twice rb_prev() to find the
previous node, as that requires looping through the red black tree, so
store the result of the rb_prev() call and then use it.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:23 +02:00
Filipe Manana
fbb2e654d8 btrfs: avoid extra memory allocation when copying free space cache
At copy_free_space_cache(), we add a new entry to the block group's ctl
before we free the entry from the temporary ctl. Adding a new entry
requires the allocation of a new struct btrfs_free_space, so we can
avoid a temporary extra allocation by freeing the entry from the
temporary ctl before we add a new entry to the main ctl, which possibly
also reduces the chances for a memory allocation failure in case of very
high memory pressure. So just do that.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:23 +02:00
Tom Rix
12df6a622e btrfs: simplify transid initialization in btrfs_ioctl_wait_sync
A small code simplification, move the default value of transid to its
initialization and remove the else-statement.

Signed-off-by: Tom Rix <trix@redhat.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:23 +02:00
Qu Wenruo
b9a9a85059 btrfs: output affected files when relocation fails
[PROBLEM]
When relocation fails (mostly due to checksum mismatch), we only got
very cryptic error messages like:

  BTRFS info (device dm-4): relocating block group 13631488 flags data
  BTRFS warning (device dm-4): csum failed root -9 ino 257 off 0 csum 0x373e1ae3 expected csum 0x98757625 mirror 1
  BTRFS error (device dm-4): bdev /dev/mapper/test-scratch1 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
  BTRFS info (device dm-4): balance: ended with status: -5

The end user has to decipher the above messages and use various tools to
locate the affected files and find a way to fix the problem (mostly
deleting the file).  This is not an easy work even for experienced
developer, not to mention the end users.

[SCRUB IS DOING BETTER]
By contrast, scrub is providing much better error messages:

  BTRFS error (device dm-4): unable to fixup (regular) error at logical 13631488 on dev /dev/mapper/test-scratch1 physical 13631488
  BTRFS warning (device dm-4): checksum error at logical 13631488 on dev /dev/mapper/test-scratch1, physical 13631488, root 5, inode 257, offset 0, length 4096, links 1 (path: file)
  BTRFS info (device dm-4): scrub: finished on devid 1 with status: 0

Which provides the affected files directly to the end user.

[IMPROVEMENT]
Instead of the generic data checksum error messages, which is not doing
a good job for data reloc inodes, this patch introduce a scrub like
backref walking based solution.

When a sector fails its checksum for data reloc inode, we go the
following workflow:

- Get the real logical bytenr
  For data reloc inode, the file offset is the offset inside the block
  group.
  Thus the real logical bytenr is @file_off + @block_group->start.

- Do an extent type check
  If it's tree blocks it's much easier to handle, just go through
  all the tree block backref.

- Do a backref walk and inode path resolution for data extents
  This is mostly the same as scrub.
  But unfortunately we can not reuse the same function as the output
  format is different.

Now the new output would be more user friendly:

  BTRFS info (device dm-4): relocating block group 13631488 flags data
  BTRFS warning (device dm-4): csum failed root -9 ino 257 off 0 logical 13631488 csum 0x373e1ae3 expected csum 0x98757625 mirror 1
  BTRFS warning (device dm-4): checksum error at logical 13631488 mirror 1 root 5 inode 257 offset 0 length 4096 links 1 (path: file)
  BTRFS error (device dm-4): bdev /dev/mapper/test-scratch1 errs: wr 0, rd 0, flush 0, corrupt 2, gen 0
  BTRFS info (device dm-4): balance: ended with status: -5

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:23 +02:00
Christoph Hellwig
8bfec2e426 btrfs: remove hipri_workers workqueue
Now that btrfs_wq_submit_bio is never called for synchronous I/O,
the hipri_workers workqueue is not used anymore and can be removed.

Reviewed-by: Chris Mason <clm@fb.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:23 +02:00
Christoph Hellwig
e917ff56c8 btrfs: determine synchronous writers from bio or writeback control
The writeback_control structure already passes down the information about
a writeback being synchronous from the core VM code, and thus information
is propagated into the bio REQ_SYNC flag through the wbc_to_write_flags
helper.

Use that information to decide if checksums calculation is offloaded to
a workqueue instead of btrfs_inode::sync_writers field that not only
bloats the inode but also has too wide scope, being inode wide instead
of limited to the actual writeback request.

The sync writes were set in:

- btrfs_do_write_iter - regular IO, sync status is set
- start_ordered_ops - ordered write start, writeback with WB_SYNC_ALL
  mode
- btrfs_write_marked_extents - write marked extents, writeback with
  WB_SYNC_ALL mode

Reviewed-by: Chris Mason <clm@fb.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:23 +02:00
Christoph Hellwig
da02361807 btrfs: submit IO synchronously for fast checksum implementations
Most modern hardware supports very fast accelerated crc32c calculation.
If that is supported the CPU overhead of the checksum calculation is
very limited, and offloading the calculation to special worker threads
has a lot of overhead for no gain.

E.g. on an Intel Optane device is actually very much slows down even
1M buffered writes with fio:

Unpatched:

write: IOPS=3316, BW=3316MiB/s (3477MB/s)(200GiB/61757msec); 0 zone resets

With synchronous CRCs:

write: IOPS=4882, BW=4882MiB/s (5119MB/s)(200GiB/41948msec); 0 zone resets

With a lot of variation during the unpatched run going down as low as
1100MB/s, while the synchronous CRC version has about the same peak write
speed but much lower dips, and fewer kworkers churning around.
Both tests had fio saturated at 100% CPU.

(thanks to Jens Axboe via Chris Mason for the benchmarking)

Reviewed-by: Chris Mason <clm@fb.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:23 +02:00
Anand Jain
adbe7e388e btrfs: use SECTOR_SHIFT to convert LBA to physical offset
Using SECTOR_SHIFT to convert LBA to physical address makes it more
readable.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:23 +02:00
Anand Jain
29e70be261 btrfs: use SECTOR_SHIFT to convert physical offset to LBA
Use SECTOR_SHIFT while converting a physical address to an LBA, makes
it more readable.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:23 +02:00
Qu Wenruo
eee3b81178 btrfs: improve leaf dump and error handling
Improve the leaf dump behavior by:

- Always dump the leaf first, then the error message

- Output the slot number if possible
  Especially in __btrfs_free_extent() the leaf dump of extent tree can
  be pretty large.
  With an extra slot number it's much easier to locate the problem.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:23 +02:00
Qu Wenruo
6c75a589cb btrfs: print-tree: pass const extent buffer pointer
Since print-tree infrastructure only prints the content of a tree block,
we can make them to accept const extent buffer pointer.

This removes a forced type convert in extent-tree, where we convert a
const extent buffer pointer to regular one, just to avoid compiler
warning.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:22 +02:00
Naohiro Aota
b5345d6cee btrfs: export bitmap_test_range_all_{set,zero}
bitmap_test_range_all_{set,zero} defined in subpage.c are useful for other
components. Move them to misc.h and use them in zoned.c. Also, as
find_next{,_zero}_bit take/return "unsigned long" instead of "unsigned
int", convert the type to "unsigned long".

While at it, also rewrite the "if (...) return true; else return false;"
pattern and add const to the input bitmap.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:22 +02:00
Filipe Manana
88ad95b055 btrfs: tag as unlikely the key comparison when checking sibling keys
When checking siblings keys, before moving keys from one node/leaf to a
sibling node/leaf, it's very unexpected to have the last key of the left
sibling greater than or equals to the first key of the right sibling, as
that means we have a (serious) corruption that breaks the key ordering
properties of a b+tree. Since this is unexpected, surround the comparison
with the unlikely macro, which helps the compiler generate better code
for the most expected case (no existing b+tree corruption). This is also
what we do for other unexpected cases of invalid key ordering (like at
btrfs_set_item_key_safe()).

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:22 +02:00
Filipe Manana
f2db4d5cb4 btrfs: make btrfs_free_device() static
The function btrfs_free_device() is never used outside of volumes.c, so
make it static and remove its prototype declaration at volumes.h.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:22 +02:00
Sweet Tea Dorminy
1b53e51a4a btrfs: don't commit transaction for every subvol create
Recently a Meta-internal workload encountered subvolume creation taking
up to 2s each, significantly slower than directory creation. As they
were hoping to be able to use subvolumes instead of directories, and
were looking to create hundreds, this was a significant issue. After
Josef investigated, it turned out to be due to the transaction commit
currently performed at the end of subvolume creation.

This change improves the workload by not doing transaction commit for every
subvolume creation, and merely requiring a transaction commit on fsync.
In the worst case, of doing a subvolume create and fsync in a loop, this
should require an equal amount of time to the current scheme; and in the
best case, the internal workload creating hundreds of subvolumes before
fsyncing is greatly improved.

While it would be nice to be able to use the log tree and use the normal
fsync path, log tree replay can't deal with new subvolume inodes
presently.

It's possible that there's some reason that the transaction commit is
necessary for correctness during subvolume creation; however,
git logs indicate that the commit dates back to the beginning of
subvolume creation, and there are no notes on why it would be necessary.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Neal Gompa <neal@gompa.dev>
Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:22 +02:00
Filipe Manana
f469c8bd90 btrfs: unexport btrfs_prev_leaf()
btrfs_prev_leaf() is not used outside ctree.c, so there's no need to
export it at ctree.h - just make it static at ctree.c and move its
definition above btrfs_search_slot_for_read(), since that function
calls it.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:22 +02:00
Linus Torvalds
4973ca2955 for-6.4-rc6-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmSMg4YACgkQxWXV+ddt
 WDvNxg/9G45Lcn3YPYXicbzKcrrz4fpg4gqx9IX226DfJX78iZskl3LN1w+gFcj0
 gAKSC73ZZCGhIqrHOuWIbH5+BRO3FzTB9zr7tfx4H+pFWHs0BgYPqcoBjLTHZ/Pn
 2RYu+F922tGaPW7LZ2LtGlv+8Y4IDtWVe6uRyxSqv3dtF1jcgUfnJk2zJXG5z41R
 h1BSX7mcWUxUXbSJqTzAij7jyvbpnmy1BjsGDRG2G2J/AmvpUBtx1Gc3aKWhD2Up
 vNLQkl4OxbaW1t8CV9u6iGduS5mUAetOXoT2DTr3sSQMeA56Gpues/qb6qQVTbwb
 2cBnwQugZyz39yZkyvvopy6z2rasMmw6V/aPLKTLvPN/P+DYwU+bfcFuNa+LFxz4
 KJqGvZdrwDlhGc80+xjKhly4zLahAt0H+Y1yKjRK2RRx/TsXl4ufVc5hpq9rj8eK
 AoNvoZw9W3/L0juMUfZILhMbD2f7XGbUXlNhIXHCZsOZzuZBqNMNNv9d8b5ncbWE
 q6a5EJXzQzk13kiurVBZJoZokYxsUzEBsKeij4aaP1Rkw8r/62GvEt79Nu8X+67+
 cQyZ6CQ6eZ2PsPx9DtooCbAnH6huIPf9yagn5J2Li6H6VdvOlP6zIi7Tp33AhPdp
 1BMfaNq46l6Gxiu1pnclzSb8abVLb71ZxXNItEK/EkbH/uktaro=
 =NAyd
 -----END PGP SIGNATURE-----

Merge tag 'for-6.4-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "Two fixes for NOCOW files, a regression fix in scrub and an assertion
  fix:

   - NOCOW fixes:
      - keep length of iomap direct io request in case of a failure
      - properly pass mode of extent reference checking, this can break
        some cases for swapfile

   - fix error value confusion when scrubbing a stripe

   - convert assertion to a proper error handling when loading global
     roots, reported by syzbot"

* tag 'for-6.4-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: scrub: fix a return value overwrite in scrub_stripe()
  btrfs: do not ASSERT() on duplicated global roots
  btrfs: can_nocow_file_extent should pass down args->strict from callers
  btrfs: fix iomap_begin length for nocow writes
2023-06-16 12:41:56 -07:00
Qu Wenruo
b50f2d048e btrfs: scrub: fix a return value overwrite in scrub_stripe()
[RETURN VALUE OVERWRITE]
Inside scrub_stripe(), we would submit all the remaining stripes after
iterating all extents.

But since flush_scrub_stripes() can return error, we need to avoid
overwriting the existing @ret if there is any error.

However the existing check is doing the wrong check:

	ret2 = flush_scrub_stripes();
	if (!ret2)
		ret = ret2;

This would overwrite the existing @ret to 0 as long as the final flush
detects no critical errors.

[FIX]
We should check @ret other than @ret2 in that case.

Fixes: 8eb3dd17ea ("btrfs: dev-replace: error out if we have unrepaired metadata error during")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-14 18:30:30 +02:00
Qu Wenruo
745806fb45 btrfs: do not ASSERT() on duplicated global roots
[BUG]
Syzbot reports a reproducible ASSERT() when using rescue=usebackuproot
mount option on a corrupted fs.

The full report can be found here:
https://syzkaller.appspot.com/bug?extid=c4614eae20a166c25bf0

  BTRFS error (device loop0: state C): failed to load root csum
  assertion failed: !tmp, in fs/btrfs/disk-io.c:1103
  ------------[ cut here ]------------
  kernel BUG at fs/btrfs/ctree.h:3664!
  invalid opcode: 0000 [#1] PREEMPT SMP KASAN
  CPU: 1 PID: 3608 Comm: syz-executor356 Not tainted 6.0.0-rc7-syzkaller-00029-g3800a713b607 #0
  Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/26/2022
  RIP: 0010:assertfail+0x1a/0x1c fs/btrfs/ctree.h:3663
  RSP: 0018:ffffc90003aaf250 EFLAGS: 00010246
  RAX: 0000000000000032 RBX: 0000000000000000 RCX: f21c13f886638400
  RDX: 0000000000000000 RSI: 0000000080000000 RDI: 0000000000000000
  RBP: ffff888021c640a0 R08: ffffffff816bd38d R09: ffffed10173667f1
  R10: ffffed10173667f1 R11: 1ffff110173667f0 R12: dffffc0000000000
  R13: ffff8880229c21f7 R14: ffff888021c64060 R15: ffff8880226c0000
  FS:  0000555556a73300(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 000055a2637d7a00 CR3: 00000000709c4000 CR4: 00000000003506e0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
   <TASK>
   btrfs_global_root_insert+0x1a7/0x1b0 fs/btrfs/disk-io.c:1103
   load_global_roots_objectid+0x482/0x8c0 fs/btrfs/disk-io.c:2467
   load_global_roots fs/btrfs/disk-io.c:2501 [inline]
   btrfs_read_roots fs/btrfs/disk-io.c:2528 [inline]
   init_tree_roots+0xccb/0x203c fs/btrfs/disk-io.c:2939
   open_ctree+0x1e53/0x33df fs/btrfs/disk-io.c:3574
   btrfs_fill_super+0x1c6/0x2d0 fs/btrfs/super.c:1456
   btrfs_mount_root+0x885/0x9a0 fs/btrfs/super.c:1824
   legacy_get_tree+0xea/0x180 fs/fs_context.c:610
   vfs_get_tree+0x88/0x270 fs/super.c:1530
   fc_mount fs/namespace.c:1043 [inline]
   vfs_kern_mount+0xc9/0x160 fs/namespace.c:1073
   btrfs_mount+0x3d3/0xbb0 fs/btrfs/super.c:1884

[CAUSE]
Since the introduction of global roots, we handle
csum/extent/free-space-tree roots as global roots, even if no
extent-tree-v2 feature is enabled.

So for regular csum/extent/fst roots, we load them into
fs_info::global_root_tree rb tree.

And we should not expect any conflicts in that rb tree, thus we have an
ASSERT() inside btrfs_global_root_insert().

But rescue=usebackuproot can break the assumption, as we will try to
load those trees again and again as long as we have bad roots and have
backup roots slot remaining.

So in that case we can have conflicting roots in the rb tree, and
triggering the ASSERT() crash.

[FIX]
We can safely remove that ASSERT(), as the caller will properly put the
offending root.

To make further debugging easier, also add two explicit error messages:

- Error message for conflicting global roots
- Error message when using backup roots slot

Reported-by: syzbot+a694851c6ab28cbcfb9c@syzkaller.appspotmail.com
Fixes: abed4aaae4 ("btrfs: track the csum, extent, and free space trees in a rb tree")
CC: stable@vger.kernel.org # 6.1+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-13 01:21:16 +02:00
Chris Mason
deccae40e4 btrfs: can_nocow_file_extent should pass down args->strict from callers
Commit 619104ba45 ("btrfs: move common NOCOW checks against a file
extent into a helper") changed our call to btrfs_cross_ref_exist() to
always pass false for the 'strict' parameter.  We're passing this down
through the stack so that we can do a full check for cross references
during swapfile activation.

With strict always false, this test fails:

  btrfs subvol create swappy
  chattr +C swappy
  fallocate -l1G swappy/swapfile
  chmod 600 swappy/swapfile
  mkswap swappy/swapfile

  btrfs subvol snap swappy swapsnap
  btrfs subvol del -C swapsnap

  btrfs fi sync /
  sync;sync;sync

  swapon swappy/swapfile

The fix is to just use args->strict, and everyone except swapfile
activation is passing false.

Fixes: 619104ba45 ("btrfs: move common NOCOW checks against a file extent into a helper")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-13 00:01:08 +02:00
Christoph Hellwig
7833b86595 btrfs: fix iomap_begin length for nocow writes
can_nocow_extent can reduce the len passed in, which needs to be
propagated to btrfs_dio_iomap_begin so that iomap does not submit
more data then is mapped.

This problems exists since the btrfs_get_blocks_direct helper was added
in commit c5794e5178 ("btrfs: Factor out write portion of
btrfs_get_blocks_direct"), but the ordered_extent splitting added in
commit b73a6fd1b1 ("btrfs: split partial dio bios before submit")
added a WARN_ON that made a syzkaller test fail.

Reported-by: syzbot+ee90502d5c8fd1d0dd93@syzkaller.appspotmail.com
Fixes: c5794e5178 ("btrfs: Factor out write portion of btrfs_get_blocks_direct")
CC: stable@vger.kernel.org # 6.1+
Tested-by: syzbot+ee90502d5c8fd1d0dd93@syzkaller.appspotmail.com
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-13 00:01:00 +02:00
Linus Torvalds
ace9e12da2 for-6.4-rc6-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmSHScAACgkQxWXV+ddt
 WDvoLA/8CxGfC9i/zO2odxbV1id8JiubGyi2Q28ANE3ygwRBI2dh7u2TBTv9aKPF
 Bzm6VsafG2OwMuwu08jO3t98+QrxU9vb6YCzCPL4t+8IDLJhwpz6zdH/Lvl3RnyV
 nz+aKHi2vfTRKt1Cf4uB5dVzPM3QVHYi3vidt15Suf2nhKnXimu0FVGXabQfd44z
 cCE4ep8IkLshcrsEOwVQj44isRXztJza3D6P7zPfu0NB5Bue7VJNBI4JoGOAT8UQ
 8c+V1U6EbMARWcdbk4Vm34IoAAxcQW6MNnHG83+ie2OpuKJ9g7oNXMTPL73gntNr
 DtC38Vr8gbpXJFmqOCwD8+9f3jP2pX6LjJT0IR6eGJbCleWd6JPlvnfJ+QHdb/vE
 LblDjH84O0Js+0iPKOSKzglfrKZPYDEnIBUwbZQICj/8+aHPU1Y4eTRcv52bVnpa
 1umdz19Sjh0HjuX4k44E/fLgGnLw+ezxhe6WQ7RdDrnr4+9tXpz0z/ZsatIgl1Pc
 wfS5Y2XBIdzKBIF8FxAEL3xCXd6byOsMMhSRu6J7W8Tgw5dnvKiQLRCK+FIpBRru
 WZ7vrNKz67marmqcIp0Hpoipd5+ib6pAdZs69GAvk4bWvVoLZ0Vuyb3lQr5fg6Vm
 Xn1iwcYoWjlAYrpVW31dlaVCfoewm96qbzNa3XqA87I/6frGFcc=
 =ABpK
 -----END PGP SIGNATURE-----

Merge tag 'for-6.4-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "A  more fixes and regression fixes:

   - in subpage mode, fix crash when repairing metadata at the end of
     a stripe

   - properly enable async discard when remounting from read-only to
     read-write

   - scrub regression fixes:
      - respect read-only scrub when attempting to do a repair
      - fix reporting of found errors, the stats don't get properly
        accounted after a stripe repair"

* tag 'for-6.4-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: scrub: also report errors hit during the initial read
  btrfs: scrub: respect the read-only flag during repair
  btrfs: properly enable async discard when switching from RO->RW
  btrfs: subpage: fix a crash in metadata repair path
2023-06-12 10:53:35 -07:00
Qu Wenruo
79b8ee702c btrfs: scrub: also report errors hit during the initial read
[BUG]
After the recent scrub rework introduced in commit e02ee89baa ("btrfs:
scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure"),
btrfs scrub no longer reports repaired errors any more:

  # mkfs.btrfs -f $dev -d DUP
  # mount $dev $mnt
  # xfs_io -f -d -c "pwrite -b 64K -S 0xaa 0 64" $mnt/file
  # umount $dev
  # xfs_io -f -c "pwrite -S 0xff $phy1 64K" $dev # Corrupt the first mirror
  # mount $dev $mnt
  # btrfs scrub start -BR $mnt
  scrub done for 725e7cb7-8a4a-4c77-9f2a-86943619e218
  Scrub started:    Tue Jun  6 14:56:50 2023
  Status:           finished
  Duration:         0:00:00
  	data_extents_scrubbed: 2
  	tree_extents_scrubbed: 18
  	data_bytes_scrubbed: 131072
  	tree_bytes_scrubbed: 294912
  	read_errors: 0
  	csum_errors: 0 <<< No errors here
  	verify_errors: 0
         [...]
  	uncorrectable_errors: 0
  	unverified_errors: 0
  	corrected_errors: 16		<<< Only corrected errors
  	last_physical: 2723151872

This can confuse btrfs-progs, as it relies on the csum_errors to
determine if there is anything wrong.

While on v6.3.x kernels, the report is different:

 	csum_errors: 16			<<<
 	verify_errors: 0
	[...]
 	uncorrectable_errors: 0
 	unverified_errors: 0
 	corrected_errors: 16 <<<

[CAUSE]
In the reworked scrub, we update the scrub progress inside
scrub_stripe_report_errors(), using various bitmaps to update the
result.

For example for csum_errors, we use bitmap_weight() of
stripe->csum_error_bitmap.

Unfortunately at that stage, all error bitmaps (except
init_error_bitmap) are the result of the latest repair attempt, thus if
the stripe is fully repaired, those error bitmaps will all be empty,
resulting the above output mismatch.

To fix this, record the number of errors into stripe->init_nr_*_errors.
Since we don't really care about where those errors are, we only need to
record the number of errors.

Then in scrub_stripe_report_errors(), use those initial numbers to
update the progress other than using the latest error bitmaps.

Fixes: e02ee89baa ("btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-08 14:34:01 +02:00
Qu Wenruo
1f2030ff6e btrfs: scrub: respect the read-only flag during repair
[BUG]
With recent scrub rework, the scrub operation no longer respects the
read-only flag passed by "-r" option of "btrfs scrub start" command.

  # mkfs.btrfs -f -d raid1 $dev1 $dev2
  # mount $dev1 $mnt
  # xfs_io -f -d -c "pwrite -b 128K -S 0xaa 0 128k" $mnt/file
  # sync
  # xfs_io -c "pwrite -S 0xff $phy1 64k" $dev1
  # xfs_io -c "pwrite -S 0xff $((phy2 + 65536)) 64k" $dev2
  # mount $dev1 $mnt -o ro
  # btrfs scrub start -BrRd $mnt
  Scrub device $dev1 (id 1) done
  Scrub started:    Tue Jun  6 09:59:14 2023
  Status:           finished
  Duration:         0:00:00
         [...]
  	corrected_errors: 16 <<< Still has corrupted sectors
  	last_physical: 1372585984

  Scrub device $dev2 (id 2) done
  Scrub started:    Tue Jun  6 09:59:14 2023
  Status:           finished
  Duration:         0:00:00
         [...]
  	corrected_errors: 16 <<< Still has corrupted sectors
  	last_physical: 1351614464

  # btrfs scrub start -BrRd $mnt
  Scrub device $dev1 (id 1) done
  Scrub started:    Tue Jun  6 10:00:17 2023
  Status:           finished
  Duration:         0:00:00
         [...]
  	corrected_errors: 0 <<< No more errors
  	last_physical: 1372585984

  Scrub device $dev2 (id 2) done
         [...]
  	corrected_errors: 0 <<< No more errors
  	last_physical: 1372585984

[CAUSE]
In the newly reworked scrub code, repair is always submitted no matter
if we're doing a read-only scrub.

[FIX]
Fix it by skipping the write submission if the scrub is a read-only one.

Unfortunately for the report part, even for a read-only scrub we will
still report it as corrected errors, as we know it's repairable, even we
won't really submit the write.

Fixes: e02ee89baa ("btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-08 13:56:38 +02:00
Chris Mason
981a37bab5 btrfs: properly enable async discard when switching from RO->RW
The async discard uses the BTRFS_FS_DISCARD_RUNNING bit in the fs_info
to force discards off when the filesystem has aborted or we're generally
not able to run discards.  This gets flipped on when we're mounted rw,
and also when we go from ro->rw.

Commit 63a7cb1307 ("btrfs: auto enable discard=async when possible")
enabled async discard by default, and this meant
"mount -o ro /dev/xxx /yyy" had async discards turned on.

Unfortunately, this meant our check in btrfs_remount_cleanup() would see
that discards are already on:

    /* If we toggled discard async */
    if (!btrfs_raw_test_opt(old_opts, DISCARD_ASYNC) &&
	btrfs_test_opt(fs_info, DISCARD_ASYNC))
	    btrfs_discard_resume(fs_info);

So, we'd never call btrfs_discard_resume() when remounting the root
filesystem from ro->rw.

drgn shows this really nicely:

import os
import sys

from drgn.helpers.linux.fs import path_lookup
from drgn import NULL, Object, Type, cast

def btrfs_sb(sb):
    return cast("struct btrfs_fs_info *", sb.s_fs_info)

if len(sys.argv) == 1:
    path = "/"
else:
    path = sys.argv[1]

fs_info = cast("struct btrfs_fs_info *", path_lookup(prog, path).mnt.mnt_sb.s_fs_info)

BTRFS_FS_DISCARD_RUNNING = 1 << prog['BTRFS_FS_DISCARD_RUNNING']
if fs_info.flags & BTRFS_FS_DISCARD_RUNNING:
    print("discard running flag is on")
else:
    print("discard running flag is off")

[root]# mount | grep nvme
/dev/nvme0n1p3 on / type btrfs
(rw,relatime,compress-force=zstd:3,ssd,discard=async,space_cache=v2,subvolid=5,subvol=/)

[root]# ./discard_running.drgn
discard running flag is off

[root]# mount -o remount,discard=sync /
[root]# mount -o remount,discard=async /
[root]# ./discard_running.drgn
discard running flag is on

The fix is to call btrfs_discard_resume() when we're going from ro->rw.
It already checks to make sure the async discard flag is on, so it'll do
the right thing.

Fixes: 63a7cb1307 ("btrfs: auto enable discard=async when possible")
CC: stable@vger.kernel.org # 6.3+
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Chris Mason <clm@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-06 19:44:22 +02:00
Qu Wenruo
917ac77846 btrfs: subpage: fix a crash in metadata repair path
[BUG]
Test case btrfs/027 would crash with subpage (64K page size, 4K
sectorsize) with the following dying messages:

  debug: map_length=16384 length=65536 type=metadata|raid6(0x104)
  assertion failed: map_length >= length, in fs/btrfs/volumes.c:8093
  ------------[ cut here ]------------
  kernel BUG at fs/btrfs/messages.c:259!
  Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
  Call trace:
   btrfs_assertfail+0x28/0x2c [btrfs]
   btrfs_map_repair_block+0x150/0x2b8 [btrfs]
   btrfs_repair_io_failure+0xd4/0x31c [btrfs]
   btrfs_read_extent_buffer+0x150/0x16c [btrfs]
   read_tree_block+0x38/0xbc [btrfs]
   read_tree_root_path+0xfc/0x1bc [btrfs]
   btrfs_get_root_ref.part.0+0xd4/0x3a8 [btrfs]
   open_ctree+0xa30/0x172c [btrfs]
   btrfs_mount_root+0x3c4/0x4a4 [btrfs]
   legacy_get_tree+0x30/0x60
   vfs_get_tree+0x28/0xec
   vfs_kern_mount.part.0+0x90/0xd4
   vfs_kern_mount+0x14/0x28
   btrfs_mount+0x114/0x418 [btrfs]
   legacy_get_tree+0x30/0x60
   vfs_get_tree+0x28/0xec
   path_mount+0x3e0/0xb64
   __arm64_sys_mount+0x200/0x2d8
   invoke_syscall+0x48/0x114
   el0_svc_common.constprop.0+0x60/0x11c
   do_el0_svc+0x38/0x98
   el0_svc+0x40/0xa8
   el0t_64_sync_handler+0xf4/0x120
   el0t_64_sync+0x190/0x194
  Code: aa0403e2 b0fff060 91010000 959c2024 (d4210000)

[CAUSE]
In btrfs/027 we test RAID6 with missing devices, in this particular
case, we're repairing a metadata at the end of a data stripe.

But at btrfs_repair_io_failure(), we always pass a full PAGE for repair,
and for subpage case this can cross stripe boundary and lead to the
above BUG_ON().

This metadata repair code is always there, since the introduction of
subpage support, but this can trigger BUG_ON() after the bio split
ability at btrfs_map_bio().

[FIX]
Instead of passing the old PAGE_SIZE, we calculate the correct length
based on the eb size and page size for both regular and subpage cases.

CC: stable@vger.kernel.org # 6.3+
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-05 19:21:57 +02:00
Linus Torvalds
e0178b546d for-6.4-rc4-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmR6NA4ACgkQxWXV+ddt
 WDuySw//TLkn3Q2UXZrxbcC9npTvVtIl8bm/UeRNY14Q4/ImC/HHNgAmIlO33J0c
 6/kqoujHLkXWhOyLME9QfqgMwhOEWz1kluU6vXpNQ0i3CE/4T9jceAphqxLcLhjr
 TtnV5SkGbgs+tsAyADfoFB/659JNo+zC4ZN1tSa/TFoZ7xbx7CkCGaAt4V8kkrQw
 BdcKMHBoN9CJE3waatAEcZPqUobEi0Wc+3W38fNOmFJoo3CQXobc5Rb5+1dEOy2G
 nEdfe/HUYVfT4PaSHS4ollQ2ajG+BXOOjd2X4ux2w7dk3iSkcIJFSu942vdtgM6Y
 ygeuhd4cZu6VCYN7lz0qbl8+t5rcRgErKMT5KiJ9fFQ7JDgRGTb6Mr+loPzxlbZ0
 bOgXvqb4mCNrPiQjzuNqUnr5AzD0X2ObTX0g9IsInJaiH7BtGRwBL/FWeX2XMxLQ
 SKBnFETJ1kqxg5/0YY1a9rCfciiDrSOZ1YgY74CEOh/JsJA+4fwx6ojV7uAdnGTg
 hjPhmwK3PjgjvoYcUEN7hIini2mSqyyw9+QynZ611HHV8dy2z4fG0xoubO2cUWsP
 e8JizBiUZWiVqj7UHXvLD7XkDFBJDXjD6iTopaZVz6ae4w4S9Dn3QroNvWshWmGC
 suukX3ZFASpeIJlftrrTzf1r8zvyfgGbS7sZ6ZwhIRx3wr1FFZw=
 =O3yC
 -----END PGP SIGNATURE-----

Merge tag 'for-6.4-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fix from David Sterba:
 "One regression fix.

  The rewrite of scrub code in 6.4 broke device replace in zoned mode,
  some of the writes could happen out of order so this had to be
  adjusted for all cases"

* tag 'for-6.4-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: zoned: fix dev-replace after the scrub rework
2023-06-02 17:16:19 -04:00
Qu Wenruo
b675df0257 btrfs: zoned: fix dev-replace after the scrub rework
[BUG]
After commit e02ee89baa ("btrfs: scrub: switch scrub_simple_mirror()
to scrub_stripe infrastructure"), scrub no longer works for zoned device
at all.

Even an empty zoned btrfs cannot be replaced:

  # mkfs.btrfs -f /dev/nvme0n1
  # mount /dev/nvme0n1 /mnt/btrfs
  # btrfs replace start -Bf 1 /dev/nvme0n2 /mnt/btrfs
  Resetting device zones /dev/nvme1n1 (160 zones) ...
  ERROR: ioctl(DEV_REPLACE_START) failed on "/mnt/btrfs/": Input/output error

And we can hit kernel crash related to that:

  BTRFS info (device nvme1n1): host-managed zoned block device /dev/nvme3n1, 160 zones of 134217728 bytes
  BTRFS info (device nvme1n1): dev_replace from /dev/nvme2n1 (devid 2) to /dev/nvme3n1 started
  nvme3n1: Zone Management Append(0x7d) @ LBA 65536, 4 blocks, Zone Is Full (sct 0x1 / sc 0xb9) DNR
  I/O error, dev nvme3n1, sector 786432 op 0xd:(ZONE_APPEND) flags 0x4000 phys_seg 3 prio class 2
  BTRFS error (device nvme1n1): bdev /dev/nvme3n1 errs: wr 1, rd 0, flush 0, corrupt 0, gen 0
  BUG: kernel NULL pointer dereference, address: 00000000000000a8
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
  RIP: 0010:_raw_spin_lock_irqsave+0x1e/0x40
  Call Trace:
   <IRQ>
   btrfs_lookup_ordered_extent+0x31/0x190
   btrfs_record_physical_zoned+0x18/0x40
   btrfs_simple_end_io+0xaf/0xc0
   blk_update_request+0x153/0x4c0
   blk_mq_end_request+0x15/0xd0
   nvme_poll_cq+0x1d3/0x360
   nvme_irq+0x39/0x80
   __handle_irq_event_percpu+0x3b/0x190
   handle_irq_event+0x2f/0x70
   handle_edge_irq+0x7c/0x210
   __common_interrupt+0x34/0xa0
   common_interrupt+0x7d/0xa0
   </IRQ>
   <TASK>
   asm_common_interrupt+0x22/0x40

[CAUSE]
Dev-replace reuses scrub code to iterate all extents and write the
existing content back to the new device.

And for zoned devices, we call fill_writer_pointer_gap() to make sure
all the writes into the zoned device is sequential, even if there may be
some gaps between the writes.

However we have several different bugs all related to zoned dev-replace:

- We are using ZONE_APPEND operation for metadata style write back
  For zoned devices, btrfs has two ways to write data:

  * ZONE_APPEND for data
    This allows higher queue depth, but will not be able to know where
    the write would land.
    Thus needs to grab the real on-disk physical location in it's endio.

  * WRITE for metadata
    This requires single queue depth (new writes can only be submitted
    after previous one finished), and all writes must be sequential.

  For scrub, we go single queue depth, but still goes with ZONE_APPEND,
  which requires btrfs_bio::inode being populated.
  This is the cause of that crash.

- No correct tracing of write_pointer
  After a write finished, we should forward sctx->write_pointer, or
  fill_writer_pointer_gap() would not work properly and cause more
  than necessary zero out, and fill the whole zone prematurely.

- Incorrect physical bytenr passed to fill_writer_pointer_gap()
  In scrub_write_sectors(), one call site passes logical address, which
  is completely wrong.

  The other call site passes physical address of current sector, but
  we should pass the physical address of the btrfs_bio we're submitting.

  This is the cause of the -EIO errors.

[FIX]
- Do not use ZONE_APPEND for btrfs_submit_repair_write().

- Manually forward sctx->write_pointer after successful writeback

- Use the physical address of the to-be-submitted btrfs_bio for
  fill_writer_pointer_gap()

Now zoned device replace would work as expected.

Reported-by: Christoph Hellwig <hch@lst.de>
Fixes: e02ee89baa ("btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-01 15:12:02 +02:00
Linus Torvalds
48b1320a67 for-6.4-rc4-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmR2TDwACgkQxWXV+ddt
 WDsMvQ/+KgUXW+Liu5BaOyD5UzPL4BgHWiPTmJyRpsWTkGm8LE/yRCRoxqp1XbU+
 nOjQpjkxI+ziRgKpDTAGFK/w51TV9ECM5wyZiXx93TO6iaTOuYCtSnSsWylzEC1H
 q9I3znLJSWrnBPTktwTZ29rvKvXj1k3th8ypyI9ho7N+3H0Uzt2VIPxrH2oVXZNz
 f2vkjSX9pKGN5zxM2ahd3Nde4Ma6yAlJLD+pnlYK20zH/30cAXdJsUCsUqQLXDL1
 sUR++Br7qym3Wqn9Qa5R71IPJ1FieW2NaHgAz4dBBFfqe5PR7YCGL/Md6G+CFJ1E
 qLLFOWpELpqkeQdvivBnMZWqgpw+54Pdfuqxg7VylEmUc1y6CK4ab5XctpXIf75h
 6bK0RPZ7D9jZl6JukkWftoS4XnW2cseyEfHneDMZDty4v1bxwR6g7i4ZTym413Gx
 Td1Z+G6BN5O5ih0Pc0CgSS3QnndWTUl3LAHiuxRErrK4dxpeuQlDTGWWY7YVyRPJ
 O9yC24GbHyWYBYHtNACEn6/GlXQjtswhjlHxqONmQfnstZL7Fz8si9EQEOWwssJE
 PIlb022a1mvR42yHr64TE0SzpDZbMY8mnULAsSrWgPXh3IAt1ztUuJajcFs84MZr
 qWewi4F/3wDAB0m1lUbAOmeBbpAw5gSGHhwBrjdK3EWJr2kxQ50=
 =viyP
 -----END PGP SIGNATURE-----

Merge tag 'for-6.4-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "One bug fix and two build warning fixes:

   - call proper end bio callback for metadata RAID0 in a rare case of
     an unaligned block

   - fix uninitialized variable (reported by gcc 10.2)

   - fix warning about potential access beyond array bounds on mips64
     with 64k pages (runtime check would not allow that)"

* tag 'for-6.4-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: fix csum_tree_block page iteration to avoid tripping on -Werror=array-bounds
  btrfs: fix an uninitialized variable warning in btrfs_log_inode
  btrfs: call btrfs_orig_bbio_end_io in btrfs_end_bio_work
2023-05-30 17:23:50 -04:00
pengfuyuan
5ad9b4719f btrfs: fix csum_tree_block page iteration to avoid tripping on -Werror=array-bounds
When compiling on a MIPS 64-bit machine we get these warnings:

    In file included from ./arch/mips/include/asm/cacheflush.h:13,
	             from ./include/linux/cacheflush.h:5,
	             from ./include/linux/highmem.h:8,
		     from ./include/linux/bvec.h:10,
		     from ./include/linux/blk_types.h:10,
                     from ./include/linux/blkdev.h:9,
	             from fs/btrfs/disk-io.c:7:
    fs/btrfs/disk-io.c: In function ‘csum_tree_block’:
    fs/btrfs/disk-io.c💯34: error: array subscript 1 is above array bounds of ‘struct page *[1]’ [-Werror=array-bounds]
      100 |   kaddr = page_address(buf->pages[i]);
          |                        ~~~~~~~~~~^~~
    ./include/linux/mm.h:2135:48: note: in definition of macro ‘page_address’
     2135 | #define page_address(page) lowmem_page_address(page)
          |                                                ^~~~
    cc1: all warnings being treated as errors

We can check if i overflows to solve the problem. However, this doesn't make
much sense, since i == 1 and num_pages == 1 doesn't execute the body of the loop.
In addition, i < num_pages can also ensure that buf->pages[i] will not cross
the boundary. Unfortunately, this doesn't help with the problem observed here:
gcc still complains.

To fix this add a compile-time condition for the extent buffer page
array size limit, which would eventually lead to eliminating the whole
for loop.

CC: stable@vger.kernel.org # 5.10+
Signed-off-by: pengfuyuan <pengfuyuan@kylinos.cn>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-05-26 23:24:55 +02:00
Shida Zhang
8fd9f4232d btrfs: fix an uninitialized variable warning in btrfs_log_inode
This fixes the following warning reported by gcc 10.2.1 under x86_64:

../fs/btrfs/tree-log.c: In function ‘btrfs_log_inode’:
../fs/btrfs/tree-log.c:6211:9: error: ‘last_range_start’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
 6211 |   ret = insert_dir_log_key(trans, log, path, key.objectid,
      |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 6212 |       first_dir_index, last_dir_index);
      |       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
../fs/btrfs/tree-log.c:6161:6: note: ‘last_range_start’ was declared here
 6161 |  u64 last_range_start;
      |      ^~~~~~~~~~~~~~~~

This might be a false positive fixed in later compiler versions but we
want to have it fixed.

Reported-by: k2ci <kernel-bot@kylinos.cn>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Shida Zhang <zhangshida@kylinos.cn>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-05-26 23:24:04 +02:00
Christoph Hellwig
45c2f36871 btrfs: call btrfs_orig_bbio_end_io in btrfs_end_bio_work
When I implemented the storage layer bio splitting, I was under the
assumption that we'll never split metadata bios.  But Qu reminded me that
this can actually happen with very old file systems with unaligned
metadata chunks and RAID0.

I still haven't seen such a case in practice, but we better handled this
case, especially as it is fairly easily to do not calling the ->end_іo
method directly in btrfs_end_io_work, and using the proper
btrfs_orig_bbio_end_io helper instead.

In addition to the old file system with unaligned metadata chunks case
documented in the commit log, the combination of the new scrub code
with Johannes pending raid-stripe-tree also triggers this case.  We
spent some time debugging it and found that this patch solves
the problem.

Fixes: 103c19723c ("btrfs: split the bio submission path into a separate file")
CC: stable@vger.kernel.org # 6.3+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-05-26 23:24:01 +02:00
Linus Torvalds
b158dd941b for-6.4-rc3-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmRwqRsACgkQxWXV+ddt
 WDuCPQ//T8JVY6usnGF/Fw/3zbtDNvrdQLDfp3HovIg7gmLIBda0bT05w4Q46FUU
 l4BV0bHyTUNWPlXUmrrSmt8HipRe2z4Wjwc16azdLmSs5zf0FO1LbsCKDmM8Ncid
 LTi2jzyyb3E44ZzC/i7RCaBt+vYRb2ZmtZ/glh3K4H0GgTAYl1GxZoAoYgBnvmlG
 nvmlWWDaM2cRKaUREm75il37LKLIlW5jvdUFQrqwWNgUH72ay5/7SZxHywlk8x6b
 qwhhp+s6bMUNzi6CqE2SLnESjI9yl0l/0gLebhDXVulo0BiCrti+YLpueP4eQs1B
 yYXX3PvHOXhoN4tUQ4yDF9G57To4Gw1aiQOnWOOLcbyGG1ZgyekpoRRXh6r74LKt
 FDyWT+u/xd78by1km3VzqmvKtqHnRFNMYfP+MMDIhyhy5prKCWeVo7bC+2FP+89o
 kv9+0Z0w0lkLycFfLaewZkEv0/WY8GMuT7kptHQ2Ao6ulAvG+j97sgVBFGXJjeCr
 B1OAGdeTF79IV139bCxPA62cat87Zrh15mZN+y7U32Vs2JkOqbT0LTQGKoVs/TCI
 AyHCDb8oOfGiebibnEDrDNtubz7NFCq4ntZRmuv5FJ+l2d1wl6ZvsI+DoYP7Zide
 DLR7ZtPs1Yvm27xDjs+fVmMx4nuNGikEbPZPxJro1CjLVzCEt7k=
 =elHB
 -----END PGP SIGNATURE-----

Merge tag 'for-6.4-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

 - handle memory allocation error in checksumming helper (reported by
   syzbot)

 - fix lockdep splat when aborting a transaction, add NOFS protection
   around invalidate_inode_pages2 that could allocate with GFP_KERNEL

 - reduce chances to hit an ENOSPC during scrub with RAID56 profiles

* tag 'for-6.4-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: use nofs when cleaning up aborted transactions
  btrfs: handle memory allocation failure in btrfs_csum_one_bio
  btrfs: scrub: try harder to mark RAID56 block groups read-only
2023-05-26 13:21:38 -07:00
Josef Bacik
597441b343 btrfs: use nofs when cleaning up aborted transactions
Our CI system caught a lockdep splat:

  ======================================================
  WARNING: possible circular locking dependency detected
  6.3.0-rc7+ #1167 Not tainted
  ------------------------------------------------------
  kswapd0/46 is trying to acquire lock:
  ffff8c6543abd650 (sb_internal#2){++++}-{0:0}, at: btrfs_commit_inode_delayed_inode+0x5f/0x120

  but task is already holding lock:
  ffffffffabe61b40 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0x4aa/0x7a0

  which lock already depends on the new lock.

  the existing dependency chain (in reverse order) is:

  -> #1 (fs_reclaim){+.+.}-{0:0}:
	 fs_reclaim_acquire+0xa5/0xe0
	 kmem_cache_alloc+0x31/0x2c0
	 alloc_extent_state+0x1d/0xd0
	 __clear_extent_bit+0x2e0/0x4f0
	 try_release_extent_mapping+0x216/0x280
	 btrfs_release_folio+0x2e/0x90
	 invalidate_inode_pages2_range+0x397/0x470
	 btrfs_cleanup_dirty_bgs+0x9e/0x210
	 btrfs_cleanup_one_transaction+0x22/0x760
	 btrfs_commit_transaction+0x3b7/0x13a0
	 create_subvol+0x59b/0x970
	 btrfs_mksubvol+0x435/0x4f0
	 __btrfs_ioctl_snap_create+0x11e/0x1b0
	 btrfs_ioctl_snap_create_v2+0xbf/0x140
	 btrfs_ioctl+0xa45/0x28f0
	 __x64_sys_ioctl+0x88/0xc0
	 do_syscall_64+0x38/0x90
	 entry_SYSCALL_64_after_hwframe+0x72/0xdc

  -> #0 (sb_internal#2){++++}-{0:0}:
	 __lock_acquire+0x1435/0x21a0
	 lock_acquire+0xc2/0x2b0
	 start_transaction+0x401/0x730
	 btrfs_commit_inode_delayed_inode+0x5f/0x120
	 btrfs_evict_inode+0x292/0x3d0
	 evict+0xcc/0x1d0
	 inode_lru_isolate+0x14d/0x1e0
	 __list_lru_walk_one+0xbe/0x1c0
	 list_lru_walk_one+0x58/0x80
	 prune_icache_sb+0x39/0x60
	 super_cache_scan+0x161/0x1f0
	 do_shrink_slab+0x163/0x340
	 shrink_slab+0x1d3/0x290
	 shrink_node+0x300/0x720
	 balance_pgdat+0x35c/0x7a0
	 kswapd+0x205/0x410
	 kthread+0xf0/0x120
	 ret_from_fork+0x29/0x50

  other info that might help us debug this:

   Possible unsafe locking scenario:

	 CPU0                    CPU1
	 ----                    ----
    lock(fs_reclaim);
				 lock(sb_internal#2);
				 lock(fs_reclaim);
    lock(sb_internal#2);

   *** DEADLOCK ***

  3 locks held by kswapd0/46:
   #0: ffffffffabe61b40 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0x4aa/0x7a0
   #1: ffffffffabe50270 (shrinker_rwsem){++++}-{3:3}, at: shrink_slab+0x113/0x290
   #2: ffff8c6543abd0e0 (&type->s_umount_key#44){++++}-{3:3}, at: super_cache_scan+0x38/0x1f0

  stack backtrace:
  CPU: 0 PID: 46 Comm: kswapd0 Not tainted 6.3.0-rc7+ #1167
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
  Call Trace:
   <TASK>
   dump_stack_lvl+0x58/0x90
   check_noncircular+0xd6/0x100
   ? save_trace+0x3f/0x310
   ? add_lock_to_list+0x97/0x120
   __lock_acquire+0x1435/0x21a0
   lock_acquire+0xc2/0x2b0
   ? btrfs_commit_inode_delayed_inode+0x5f/0x120
   start_transaction+0x401/0x730
   ? btrfs_commit_inode_delayed_inode+0x5f/0x120
   btrfs_commit_inode_delayed_inode+0x5f/0x120
   btrfs_evict_inode+0x292/0x3d0
   ? lock_release+0x134/0x270
   ? __pfx_wake_bit_function+0x10/0x10
   evict+0xcc/0x1d0
   inode_lru_isolate+0x14d/0x1e0
   __list_lru_walk_one+0xbe/0x1c0
   ? __pfx_inode_lru_isolate+0x10/0x10
   ? __pfx_inode_lru_isolate+0x10/0x10
   list_lru_walk_one+0x58/0x80
   prune_icache_sb+0x39/0x60
   super_cache_scan+0x161/0x1f0
   do_shrink_slab+0x163/0x340
   shrink_slab+0x1d3/0x290
   shrink_node+0x300/0x720
   balance_pgdat+0x35c/0x7a0
   kswapd+0x205/0x410
   ? __pfx_autoremove_wake_function+0x10/0x10
   ? __pfx_kswapd+0x10/0x10
   kthread+0xf0/0x120
   ? __pfx_kthread+0x10/0x10
   ret_from_fork+0x29/0x50
   </TASK>

This happens because when we abort the transaction in the transaction
commit path we call invalidate_inode_pages2_range on our block group
cache inodes (if we have space cache v1) and any delalloc inodes we may
have.  The plain invalidate_inode_pages2_range() call passes through
GFP_KERNEL, which makes sense in most cases, but not here.  Wrap these
two invalidate callees with memalloc_nofs_save/memalloc_nofs_restore to
make sure we don't end up with the fs reclaim dependency under the
transaction dependency.

CC: stable@vger.kernel.org # 4.14+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-05-17 13:08:28 +02:00
Johannes Thumshirn
806570c0bb btrfs: handle memory allocation failure in btrfs_csum_one_bio
Since f8a53bb58e ("btrfs: handle checksum generation in the storage
layer") the failures of btrfs_csum_one_bio() are handled via
bio_end_io().

This means, we can return BLK_STS_RESOURCE from btrfs_csum_one_bio() in
case the allocation of the ordered sums fails.

This also fixes a syzkaller report, where injecting a failure into the
kvzalloc() call results in a BUG_ON().

Reported-by: syzbot+d8941552e21eac774778@syzkaller.appspotmail.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-05-17 13:08:28 +02:00
Qu Wenruo
7561551e7b btrfs: scrub: try harder to mark RAID56 block groups read-only
Currently we allow a block group not to be marked read-only for scrub.

But for RAID56 block groups if we require the block group to be
read-only, then we're allowed to use cached content from scrub stripe to
reduce unnecessary RAID56 reads.

So this patch would:

- Make btrfs_inc_block_group_ro() try harder
  During my tests, for cases like btrfs/061 and btrfs/064, we can hit
  ENOSPC from btrfs_inc_block_group_ro() calls during scrub.

  The reason is if we only have one single data chunk, and trying to
  scrub it, we won't have any space left for any newer data writes.

  But this check should be done by the caller, especially for scrub
  cases we only temporarily mark the chunk read-only.
  And newer data writes would always try to allocate a new data chunk
  when needed.

- Return error for scrub if we failed to mark a RAID56 chunk read-only

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-05-17 11:59:46 +02:00
Linus Torvalds
76c7f8873a for-6.4-rc1-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmRebDIACgkQxWXV+ddt
 WDu3vA//RNyRGjEz0HgfhTc1119DXJLwK6j544waYLrzRcMtBK4xKByiaFkAA4tL
 PQidGX+nAQPm+pZl0jcK30cBMObik5GXJwoSOZGl7/ectx4O7aFfXqiSfwPTyqZU
 3fTavoqoJxbxJCVbifcXOPNhsUxMlEGYJmA3CVRsllLviXY+3HMpX2ZpWZ7vch+N
 MLENNBfUo1HVdWaxOYfQif/qT5iR9G7D8dBjX9DUK0kVwrbwBB0rolJy4fPrY6z5
 gBLED9Ks3FBgyU3mYq4qrfPmbfF8mPiaU0+1j+B46vw3PdPtIwjIForR+91GsZ1v
 iHojbykf6VWTQV+gO78mgv4O4vRtn3C+UJaGxLL86OMOaiQQHFYdSETn9arPmoho
 p1wCBidI82tvfIOGYXgrTGorLN27hhyPJinHe/2Bqo+1wUL8/J8mwCWunIox7a8z
 rxO5QhDIDFX7gamsvYjkW3tBkYuGiGvBjx+Ic2cBHTkVp9wSPL9PCvqNNru2qexA
 t0BpAL9DxvN+T1xO1thC3qsm2Ogx0QEmgdDfRglbEVASnRZKZZsJEMO90FzFbkFg
 vLbs0KnT7yS7mTwq4NklDrgHZ0eiiJLZVCb8bR8xkzVW+ADrUmZuDM8WOcCgJAUp
 fUoMmFsJZi5zsdAOygDWr1bBHorLV5szrY0bSB5L2eHwJjYZ6KE=
 =uWUN
 -----END PGP SIGNATURE-----

Merge tag 'for-6.4-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull more btrfs fixes from David Sterba:

 - fix incorrect number of bitmap entries for space cache if loading is
   interrupted by some error

 - fix backref walking, this breaks a mode of LOGICAL_INO_V2 ioctl that
   is used in deduplication tools

 - zoned mode fixes:
      - properly finish zone reserved for relocation
      - correctly calculate super block zone end on ZNS
      - properly initialize new extent buffer for redirty

 - make mount option clear_cache work with block-group-tree, to rebuild
   free-space-tree instead of temporarily disabling it that would lead
   to a forced read-only mount

 - fix alignment check for offset when printing extent item

* tag 'for-6.4-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: make clear_cache mount option to rebuild FST without disabling it
  btrfs: zero the buffer before marking it dirty in btrfs_redirty_list_add
  btrfs: zoned: fix full zone super block reading on ZNS
  btrfs: zoned: zone finish data relocation BG with last IO
  btrfs: fix backref walking not returning all inode refs
  btrfs: fix space cache inconsistency after error loading it from disk
  btrfs: print-tree: parent bytenr must be aligned to sector size
2023-05-12 17:10:32 -05:00
Qu Wenruo
1d6a4fc857 btrfs: make clear_cache mount option to rebuild FST without disabling it
Previously clear_cache mount option would simply disable free-space-tree
feature temporarily then re-enable it to rebuild the whole free space
tree.

But this is problematic for block-group-tree feature, as we have an
artificial dependency on free-space-tree feature.

If we go the existing method, after clearing the free-space-tree
feature, we would flip the filesystem to read-only mode, as we detect a
super block write with block-group-tree but no free-space-tree feature.

This patch would change the behavior by properly rebuilding the free
space tree without disabling this feature, thus allowing clear_cache
mount option to work with block group tree.

Now we can mount a filesystem with block-group-tree feature and
clear_mount option:

  $ mkfs.btrfs  -O block-group-tree /dev/test/scratch1  -f
  $ sudo mount /dev/test/scratch1 /mnt/btrfs -o clear_cache
  $ sudo dmesg -t | head -n 5
  BTRFS info (device dm-1): force clearing of disk cache
  BTRFS info (device dm-1): using free space tree
  BTRFS info (device dm-1): auto enabling async discard
  BTRFS info (device dm-1): rebuilding free space tree
  BTRFS info (device dm-1): checking UUID tree

CC: stable@vger.kernel.org # 6.1+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-05-10 14:51:27 +02:00
Christoph Hellwig
c83b56d1dd btrfs: zero the buffer before marking it dirty in btrfs_redirty_list_add
btrfs_redirty_list_add zeroes the buffer data and sets the
EXTENT_BUFFER_NO_CHECK to make sure writeback is fine with a bogus
header.  But it does that after already marking the buffer dirty, which
means that writeback could already be looking at the buffer.

Switch the order of operations around so that the buffer is only marked
dirty when we're ready to write it.

Fixes: d3575156f6 ("btrfs: zoned: redirty released extent buffers")
CC: stable@vger.kernel.org # 5.15+
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-05-10 14:50:29 +02:00
Naohiro Aota
02ca9e6fb5 btrfs: zoned: fix full zone super block reading on ZNS
When both of the superblock zones are full, we need to check which
superblock is newer. The calculation of last superblock position is wrong
as it does not consider zone_capacity and uses the length.

Fixes: 9658b72ef3 ("btrfs: zoned: locate superblock position using zone capacity")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-05-10 14:50:22 +02:00
Naohiro Aota
f84353c7c2 btrfs: zoned: zone finish data relocation BG with last IO
For data block groups, we zone finish a zone (or, just deactivate it) when
seeing the last IO in btrfs_finish_ordered_io(). That is only called for
IOs using ZONE_APPEND, but we use a regular WRITE command for data
relocation IOs. Detect it and call btrfs_zone_finish_endio() properly.

Fixes: be1a1d7a5d ("btrfs: zoned: finish fully written block group")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-05-10 14:50:12 +02:00