Commit graph

1171208 commits

Author SHA1 Message Date
Boris Burkov
f263a7c3a5 btrfs: reinterpret async discard iops_limit=0 as no delay
Currently, a limit of 0 results in a hard coded metering over 6 hours.
Since the default is a set limit, I suspect no one truly depends on this
rather arbitrary setting. Repurpose it for an arguably more useful
"unlimited" mode, where the delay is 0.

Note that if block groups are too new, or go fully empty, there is still
a delay associated with those conditions. Those delays implement
heuristics for not trimming a region we are relatively likely to fully
overwrite soon.

CC: stable@vger.kernel.org # 6.2+
Reviewed-by: Neal Gompa <neal@gompa.dev>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 19:52:19 +02:00
Boris Burkov
cfe3445a58 btrfs: set default discard iops_limit to 1000
Previously, the default was a relatively conservative 10. This results
in a 100ms delay, so with ~300 discards in a commit, it takes the full
30s till the next commit to finish the discards. On a workstation, this
results in the disk never going idle, wasting power/battery, etc.

Set the default to 1000, which results in using the smallest possible
delay, currently, which is 1ms. This has shown to not pathologically
keep the disk busy by the original reporter.

Link: https://lore.kernel.org/linux-btrfs/Y%2F+n1wS%2F4XAH7X1p@nz/
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2182228
CC: stable@vger.kernel.org # 6.2+
Reviewed-by: Neal Gompa <neal@gompa.dev
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 19:52:18 +02:00
Qu Wenruo
aca43fe839 btrfs: remove unused raid56 functions which were dedicated for scrub
Since the scrub rework, the following RAID56 functions are no longer
called:

- raid56_add_scrub_pages()
- raid56_alloc_missing_rbio()
- raid56_submit_missing_rbio()

Those functions are all utilized by scrub to handle missing device cases
for RAID56.

However the new scrub code handle them in a completely different way:

- If it's data stripe, go recovery path through btrfs_submit_bio()
- If it's P/Q stripe, it would be handled through
  raid56_parity_submit_scrub_rbio()
  And that function would handle dev-replace and repair properly.

Thus we can safely remove those functions.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 19:52:18 +02:00
Qu Wenruo
13a62fd997 btrfs: scrub: remove scrub_bio structure
Since scrub path has been fully moved to scrub_stripe based facilities,
no more scrub_bio would be submitted.
Thus we can remove it completely, this involves:

- SCRUB_SECTORS_PER_BIO macro
- SCRUB_BIOS_PER_SCTX macro
- SCRUB_MAX_PAGES macro
- BTRFS_MAX_MIRRORS macro
- scrub_bio structure
- scrub_ctx::bios member
- scrub_ctx::curr member
- scrub_ctx::bios_in_flight member
- scrub_ctx::workers_pending member
- scrub_ctx::list_lock member
- scrub_ctx::list_wait member

- function scrub_bio_end_io_worker()
- function scrub_pending_bio_inc()
- function scrub_pending_bio_dec()
- function scrub_throttle()
- function scrub_submit()

- function scrub_find_csum()
- function drop_csum_range()

- Some unnecessary flush and scrub pauses

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:24 +02:00
Qu Wenruo
001e3fc263 btrfs: scrub: remove scrub_block and scrub_sector structures
Those two structures are used to represent a bunch of sectors for scrub,
but now they are fully replaced by scrub_stripe in one go, so we can
remove them. This involves:

- structure scrub_block
- structure scrub_sector

- structure scrub_page_private
- function attach_scrub_page_private()
- function detach_scrub_page_private()
  Now we no longer need to use page::private to handle subpage.

- function alloc_scrub_block()
- function alloc_scrub_sector()
- function scrub_sector_get_page()
- function scrub_sector_get_page_offset()
- function scrub_sector_get_kaddr()
- function bio_add_scrub_sector()

- function scrub_checksum_data()
- function scrub_checksum_tree_block()
- function scrub_checksum_super()
- function scrub_check_fsid()
- function scrub_block_get()
- function scrub_block_put()
- function scrub_sector_get()
- function scrub_sector_put()
- function scrub_bio_end_io()
- function scrub_block_complete()
- function scrub_add_sector_to_rd_bio()

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:24 +02:00
Qu Wenruo
e9255d6c40 btrfs: scrub: remove the old scrub recheck code
The old scrub code has different entrance to verify the content, and
since we have removed the writeback path, now we can start removing the
re-check part, including:

- scrub_recover structure
- scrub_sector::recover member
- function scrub_setup_recheck_block()
- function scrub_recheck_block()
- function scrub_recheck_block_checksum()
- function scrub_repair_block_group_good_copy()
- function scrub_repair_sector_from_good_copy()
- function scrub_is_page_on_raid56()

- function full_stripe_lock()
- function search_full_stripe_lock()
- function get_full_stripe_logical()
- function insert_full_stripe_lock()
- function lock_full_stripe()
- function unlock_full_stripe()
- btrfs_block_group::full_stripe_locks_root member
- btrfs_full_stripe_locks_tree structure
  This infrastructure is to ensure RAID56 scrub is properly handling
  recovery and P/Q scrub correctly.

  This is no longer needed, before P/Q scrub we will wait for all
  the involved data stripes to be scrubbed first, and RAID56 code has
  internal lock to ensure no race in the same full stripe.

- function scrub_print_warning()
- function scrub_get_recover()
- function scrub_put_recover()
- function scrub_handle_errored_block()
- function scrub_setup_recheck_block()
- function scrub_bio_wait_endio()
- function scrub_submit_raid56_bio_wait()
- function scrub_recheck_block_on_raid56()
- function scrub_recheck_block()
- function scrub_recheck_block_checksum()
- function scrub_repair_block_from_good_copy()
- function scrub_repair_sector_from_good_copy()

And two more functions exported temporarily for later cleanup:

- alloc_scrub_sector()
- alloc_scrub_block()

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:24 +02:00
Qu Wenruo
16f9399349 btrfs: scrub: remove the old writeback infrastructure
Since the whole scrub path has been switched to scrub_stripe based
solution, the old writeback path can be removed completely, which
involves:

- scrub_ctx::wr_curr_bio member
- scrub_ctx::flush_all_writes member
- function scrub_write_block_to_dev_replace()
- function scrub_write_sector_to_dev_replace()
- function scrub_add_sector_to_wr_bio()
- function scrub_wr_submit()
- function scrub_wr_bio_end_io()
- function scrub_wr_bio_end_io_worker()

And one more function needs to be exported temporarily:

- scrub_sector_get()

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:24 +02:00
Qu Wenruo
5dc96f8d5d btrfs: scrub: remove scrub_parity structure
The structure scrub_parity is used to indicate that some extents are
scrubbed for the purpose of RAID56 P/Q scrubbing.

Since the whole RAID56 P/Q scrubbing path has been replaced with new
scrub_stripe infrastructure, and we no longer need to use scrub_parity
to modify the behavior of data stripes, we can remove it completely.

This removal involves:

- scrub_parity_workers
  Now only one worker would be utilized, scrub_workers, to do the read
  and repair.
  All writeback would happen at the main scrub thread.

- scrub_block::sparity member
- scrub_parity structure
- function scrub_parity_get()
- function scrub_parity_put()
- function scrub_free_parity()

- function __scrub_mark_bitmap()
- function scrub_parity_mark_sectors_error()
- function scrub_parity_mark_sectors_data()
  These helpers are no longer needed, scrub_stripe has its bitmaps and
  we can use bitmap helpers to get the error/data status.

- scrub_parity_bio_endio()
- scrub_parity_check_and_repair()
- function scrub_sectors_for_parity()
- function scrub_extent_for_parity()
- function scrub_raid56_data_stripe_for_parity()
- function scrub_raid56_parity()
  The new code would reuse the scrub read-repair and writeback path.
  Just skip the dev-replace phase.
  And scrub_stripe infrastructure allows us to submit and wait for those
  data stripes before scrubbing P/Q, without extra infrastructure.

The following two functions are temporarily exported for later cleanup:

- scrub_find_csum()
- scrub_add_sector_to_rd_bio()

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:24 +02:00
Qu Wenruo
1009254bf2 btrfs: scrub: use scrub_stripe to implement RAID56 P/Q scrub
Implement the only missing part for scrub: RAID56 P/Q stripe scrub.

The workflow is pretty straightforward for the new function,
scrub_raid56_parity_stripe():

- Go through the regular scrub path for each data stripe

- Wait for the verification and repair to finish

- Writeback the repaired sectors to data stripes

- Make sure all stripes are properly repaired
  If we have sectors unrepaired, we cannot continue, or we could further
  corrupt the P/Q stripe.

- Submit the rbio for P/Q stripe
  The dev-replace would be handled inside
  raid56_parity_submit_scrub_rbio() path.

- Wait for the above bio to finish

Although the old code is no longer used, we still keep the declaration,
as the cleanup can be several times larger than this patch itself.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:24 +02:00
Qu Wenruo
e02ee89baa btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure
Switch scrub_simple_mirror() to the new scrub_stripe infrastructure.

Since scrub_simple_mirror() is the core part of scrub (only RAID56
P/Q stripes don't utilize it), we can get rid of a big chunk of code,
mostly scrub_extent(), scrub_sectors() and directly called functions.

There is a functionality change:

- Scrub speed throttle now only affects read on the scrubbing device
  Writes (for repair and replace), and reads from other mirrors won't
  be limited by the set limits.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:24 +02:00
Qu Wenruo
54765392a1 btrfs: scrub: introduce helper to queue a stripe for scrub
The new helper, queue_scrub_stripe(), would try to queue a stripe for
scrub.  If all stripes are already in use, we will submit all the
existing ones and wait for them to finish.

Currently we would queue up to 8 stripes, to enlarge the blocksize to
512KiB to improve the performance. Sectors repaired on zoned need to be
relocated instead of in-place fix.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:24 +02:00
Qu Wenruo
0096580713 btrfs: scrub: introduce error reporting functionality for scrub_stripe
The new helper, scrub_stripe_report_errors(), will report the result of
the scrub to system log.

The main reporting is done by introducing a new helper,
scrub_print_common_warning(), which is mostly the same content from
scrub_print_wanring(), but without the need for a scrub_block.

Since we're reporting the errors, it's the perfect time to update the
scrub stats too.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:24 +02:00
Qu Wenruo
058e09e6fe btrfs: scrub: introduce a writeback helper for scrub_stripe
Add a new helper, scrub_write_sectors(), to submit write bios for
specified sectors to the target disk.

There are several differences compared to read path:

- Utilize btrfs_submit_scrub_write()
  Now we still rely on the @mirror_num based writeback, but the
  requirement is also a little different than regular writeback or read,
  thus we have to call btrfs_submit_scrub_write().

- We cannot write the full stripe back
  We can only write the sectors we have.  There will be two call sites
  later, one for repaired sectors, one for all utilized sectors of
  dev-replace.

  Thus the callers should specify their own write_bitmap.

This function only submit the bios, will not wait for them unless for
zoned case.

Caller must explicitly wait for the IO to finish.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:23 +02:00
Qu Wenruo
9ecb5ef543 btrfs: scrub: introduce the main read repair worker for scrub_stripe
The new helper, scrub_stripe_read_repair_worker(), would handle the
read-repair part:

- Wait for the previous submitted read IO to finish

- Verify the contents of the stripe

- Go through the remaining mirrors, using as large blocksize as possible
  At this stage, we just read out all the failed sectors from each
  mirror and re-verify.
  If no more failed sector, we can exit.

- Go through all mirrors again, sector-by-sector
  This time, we read sector by sector, this is to address cases where
  one bad sector mismatches the drive's internal checksum, and cause the
  whole read range to fail.

  We put this recovery method as the last resort, as sector-by-sector
  reading is slow, and reading from other mirrors may have already fixed
  the errors.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:23 +02:00
Qu Wenruo
97cf8f3754 btrfs: scrub: introduce a helper to verify one scrub_stripe
The new helper, scrub_verify_stripe(), shares the same main workflow of
the old scrub code.

The major differences are:

- How pages/page_offset is grabbed
  Everything can be grabbed from scrub_stripe easily.

- When error report happens
  Currently the helper only verifies the sectors, not really doing any
  error reporting.
  The error reporting would be done after we have done the repair.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:23 +02:00
Qu Wenruo
a3ddbaebc7 btrfs: scrub: introduce a helper to verify one metadata block
The new helper, scrub_verify_one_metadata(), is almost the same as
scrub_checksum_tree_block().

The difference is in how we grab the pages from other structures.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:23 +02:00
Qu Wenruo
b979547513 btrfs: scrub: introduce helper to find and fill sector info for a scrub_stripe
The new helper will search the extent tree to find the first extent of a
logical range, then fill the sectors array by two loops:

- Loop 1 to fill common bits and metadata generation

- Loop 2 to fill csum data (only for data bgs)
  This loop will use the new btrfs_lookup_csums_bitmap() to fill
  the full csum buffer, and set scrub_sector_verification::csum.

With all the needed info filled by this function, later we only need to
submit and verify the stripe.

Here we temporarily export the helper to avoid warning on unused static
function.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:23 +02:00
Qu Wenruo
2af2aaf982 btrfs: scrub: introduce structure for new BTRFS_STRIPE_LEN based interface
This patch introduces the following structures:

- scrub_sector_verification
  Contains all the needed info to verify one sector (data or metadata).

- scrub_stripe
  Contains all needed members (mostly bitmap based) to scrub one stripe
  (with a length of BTRFS_STRIPE_LEN).

The basic idea is, we keep the existing per-device scrub behavior, but
merge all the scrub_bio/scrub_bio into one generic structure, and read
the full BTRFS_STRIPE_LEN stripe on the first try.

This means we will read some sectors which are not scrub target, but
that's fine. At dev-replace time we only writeback the utilized and good
sectors, and for read-repair we only writeback the repaired sectors.

With every read submitted in BTRFS_STRIPE_LEN, the need for complex bio
form shaping would be gone.
Although to get the same performance of the old scrub behavior, we would
need to submit the initial read for two stripes at once.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:23 +02:00
Qu Wenruo
4886ff7b50 btrfs: introduce a new helper to submit write bio for repair
Both scrub and read-repair are utilizing a special repair writes that:

- Only writes back to a single device
  Even for read-repair on RAID56, we only update the corrupted data
  stripe itself, not triggering the full RMW path.

- Requires a valid @mirror_num
  For RAID56 case, only @mirror_num == 1 is valid.
  For non-RAID56 cases, we need @mirror_num to locate our stripe.

- No data csum generation needed

These two call sites still have some differences though:

- Read-repair goes plain bio
  It doesn't need a full btrfs_bio, and goes submit_bio_wait().

- New scrub repair would go btrfs_bio
  To simplify both read and write path.

So here this patch would:

- Introduce a common helper, btrfs_map_repair_block()
  Due to the single device nature, we can use an on-stack
  btrfs_io_stripe to pass device and its physical bytenr.

- Introduce a new interface, btrfs_submit_repair_bio(), for later scrub
  code
  This is for the incoming scrub code.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:23 +02:00
Qu Wenruo
4317ff0056 btrfs: introduce btrfs_bio::fs_info member
Currently we're doing a lot of work for btrfs_bio:

- Checksum verification for data read bios
- Bio splits if it crosses stripe boundary
- Read repair for data read bios

However for the incoming scrub patches, we don't want this extra
functionality at all, just plain logical + mirror -> physical mapping
ability.

Thus here we do the following changes:

- Introduce btrfs_bio::fs_info
  This is for the new scrub specific btrfs_bio, which would not populate
  btrfs_bio::inode.
  Thus we need such new member to grab a fs_info

  This new member will always be populated.

- Replace @inode argument with @fs_info for btrfs_bio_init() and its
  caller
  Since @inode is no longer a mandatory member, replace it with
  @fs_info, and let involved users populate @inode.

- Skip checksum verification and generation if @bbio->inode is NULL

- Add extra ASSERT()s
  To make sure:

  * bbio->inode is properly set for involved read repair path
  * if @file_offset is set, bbio->inode is also populated

- Grab @fs_info from @bbio directly
  We can no longer go @bbio->inode->root->fs_info, as bbio->inode can be
  NULL. This involves:

  * btrfs_simple_end_io()
  * should_async_write()
  * btrfs_wq_submit_bio()
  * btrfs_use_zone_append()

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:23 +02:00
Qu Wenruo
2a2dc22f7e btrfs: scrub: use dedicated super block verification function to scrub one super block
There is really no need to go through the super complex scrub_sectors()
to just handle super blocks.  Introduce a dedicated function to handle
super block scrubbing.

This new function will introduce a behavior change, instead of using the
complex but concurrent scrub_bio system, here we just go submit-and-wait.

There is really not much sense to care the performance of super block
scrubbing. It only has 3 super blocks at most, and they are all
scattered around the devices already.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:23 +02:00
Anand Jain
f0bb5474cf btrfs: remove redundant release of btrfs_device::alloc_state
Commit 321f69f86a ("btrfs: reset device back to allocation state when
removing") included adding extent_io_tree_release(&device->alloc_state)
to btrfs_close_one_device(), which had already been called in
btrfs_free_device().

The alloc_state tree (IO_TREE_DEVICE_ALLOC_STATE), is created in
btrfs_alloc_device() and released in btrfs_close_one_device(). Therefore,
the additional call to extent_io_tree_release(&device->alloc_state) in
btrfs_free_device() is unnecessary and can be removed.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:23 +02:00
Anand Jain
1f16033c99 btrfs: warn for any missed cleanup at btrfs_close_one_device
During my recent search for the root cause of a reported bug, I realized
that it's a good idea to issue a warning for missed cleanup instead of
using debug-only assertions. Since most installations run with debug off,
missed cleanups and premature calls to close could go unnoticed. However,
these issues are serious enough to warrant reporting and fixing.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:23 +02:00
Christoph Hellwig
7533583e12 libcrc32c: remove crc32c_impl
This was only ever used by btrfs, and the usage just went away.
This effectively reverts df91f56adc ("libcrc32c: Add crc32c_impl
function").

Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:23 +02:00
Christoph Hellwig
6e7a367e1a btrfs: don't print the crc32c implementation at module load time
Btrfs can use various different checksumming algorithms, and prints
the one used for a given file system at mount time.  Don't bother
printing the crc32c implementation at module load time, the information
is available in /sys/fs/btrfs/FSID/checksum.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:22 +02:00
Christoph Hellwig
e6b430f817 btrfs: tree-log: factor out a clean_log_buffer helper
The tree-log code has three almost identical copies for the accounting on
an extent_buffer that doesn't need to be written any more.  The only
difference is that walk_down_log_tree passed the bytenr used to find the
buffer instead of extent_buffer.start and calculates the length using the
nodesize, while the other two callers look at the extent_buffer.len
field that must always be equivalent to the nodesize.

Factor the code into a common helper.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:22 +02:00
Christoph Hellwig
2c275afeb6 block: make blkcg_punt_bio_submit optional
Guard all the code to punt bios to a per-cgroup submission helper by a
new CONFIG_BLK_CGROUP_PUNT_BIO symbol that is selected by btrfs.
This way non-btrfs kernel builds don't need to have this code.

Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:22 +02:00
Christoph Hellwig
12be09fe18 block: async_bio_lock does not need to be bh-safe
async_bio_lock is only taken from bio submission and workqueue context,
both are never in bottom halves.

Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:22 +02:00
Christoph Hellwig
3480373ebd btrfs, block: move REQ_CGROUP_PUNT to btrfs
REQ_CGROUP_PUNT is a bit annoying as it is hard to follow and adds
a branch to the bio submission hot path.  To fix this, export
blkcg_punt_bio_submit and let btrfs call it directly.  Add a new
REQ_FS_PRIVATE flag for btrfs to indicate to it's own low-level
bio submission code that a punt to the cgroup submission helper
is required.

Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:22 +02:00
Christoph Hellwig
0a0596fbbe btrfs, mm: remove the punt_to_cgroup field in struct writeback_control
punt_to_cgroup is only used by extent_write_locked_range, but that
function also directly controls the bio flags for the actual submission.
Remove th punt_to_cgroup field, and just set REQ_CGROUP_PUNT directly
in extent_write_locked_range.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:22 +02:00
Christoph Hellwig
896d7c1a90 btrfs: also use kthread_associate_blkcg for uncompressible ranges
submit_one_async_extent needs to use submit_one_async_extent no matter
if the range it handles ends up beeing compressed or not as the deadlock
risk due to cgroup thottling is the same.  Call kthread_associate_blkcg
earlier to cover submit_uncompressed_range case as well.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:22 +02:00
Christoph Hellwig
e43a6210b7 btrfs: don't free the async_extent in submit_uncompressed_range
Let submit_one_async_extent, which is the only caller of
submit_uncompressed_range handle freeing of the async_extent in one
central place.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:22 +02:00
Christoph Hellwig
05d06a5c9d btrfs: move kthread_associate_blkcg out of btrfs_submit_compressed_write
btrfs_submit_compressed_write should not have to care if it is called
from a helper thread or not.  Move the kthread_associate_blkcg handling
into submit_one_async_extent, as that is the one caller that needs it.
Also move the assignment of REQ_CGROUP_PUNT into cow_file_range_async,
as that is the routine that sets up the helper thread offload.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:22 +02:00
Filipe Manana
0f69d1f4d6 btrfs: correctly calculate delayed ref bytes when starting transaction
When starting a transaction, we are assuming the number of bytes used for
each delayed ref update matches the number of bytes used for each item
update, that is the return value of:

   btrfs_calc_insert_metadata_size(fs_info, num_items)

However that is not correct when we are using the free space tree, as we
need to multiply that value by 2, since delayed ref updates need to modify
the free space tree besides the extent tree.

So fix this by using btrfs_calc_delayed_ref_bytes() to get the correct
number of bytes used for delayed ref updates.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:22 +02:00
Filipe Manana
e4773b57b8 btrfs: make btrfs_block_rsv_full() check more boolean when starting transaction
When starting a transaction we are comparing the result of a call to
btrfs_block_rsv_full() with 0, but the function returns a boolean. While
in practice it is not incorrect, as 0 is equivalent to false, it makes it
a bit odd and less readable. So update the check to not compare against 0
and instead use the logical not (!) operator.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:22 +02:00
Boris Burkov
b73a6fd1b1 btrfs: split partial dio bios before submit
If an application is doing direct io to a btrfs file and experiences a
page fault reading from the write buffer, iomap will issue a partial
bio, and allow the fs to keep going. However, there was a subtle bug in
this code path in the btrfs dio iomap implementation that led to the
partial write ending up as a gap in the file's extents and to be read
back as zeros.

The sequence of events in a partial write, lightly summarized and
trimmed down for brevity is as follows:

==== WRITING TASK ====
 btrfs_direct_write
 __iomap_dio_write
 iomap_iter
 btrfs_dio_iomap_begin # create full ordered extent
 iomap_dio_bio_iter
 bio_iov_iter_get_pages # page fault; partial read
 submit_bio # partial bio
 iomap_iter
 btrfs_dio_iomap_end
 btrfs_mark_ordered_io_finished # sets BTRFS_ORDERED_IOERR;
				# submit to finish_ordered_fn wq
 fault_in_iov_iter_readable # btrfs_direct_write detects partial write
 __iomap_dio_write
 iomap_iter
 btrfs_dio_iomap_begin # create second partial ordered extent
 iomap_dio_bio_iter
 bio_iov_iter_get_pages # read all of remainder
 submit_bio # partial bio with all of remainder
 iomap_iter
 btrfs_dio_iomap_end # nothing exciting to do with ordered io

==== DIO ENDIO ====
== FIRST PARTIAL BIO ==
 btrfs_dio_end_io
 btrfs_mark_ordered_io_finished # bytes_left > 0
			        # don't submit to finish_ordered_fn wq
== SECOND PARTIAL BIO ==
 btrfs_dio_end_io
 btrfs_mark_ordered_io_finished # bytes_left == 0
			        # submit to finish_ordered_fn wq

==== BTRFS FINISH ORDERED WQ ====
== FIRST PARTIAL BIO ==
 btrfs_finish_ordered_io # called by dio_iomap_end_io, sees
		         # BTRFS_ORDERED_IOERR, just drops the
		         # ordered_extent
==SECOND PARTIAL BIO==
 btrfs_finish_ordered_io # called by btrfs_dio_end_io, writes out file
		         # extents, csums, etc...

The essence of the problem is that while btrfs_direct_write and iomap
properly interact to submit all the correct bios, there is insufficient
logic in the btrfs dio functions (btrfs_dio_iomap_begin,
btrfs_dio_submit_io, btrfs_dio_end_io, and btrfs_dio_iomap_end) to
ensure that every bio is at least a part of a completed ordered_extent.
And it is completing an ordered_extent that results in crucial
functionality like writing out a file extent for the range.

More specifically, btrfs_dio_end_io treats the ordered extent as
unfinished but btrfs_dio_iomap_end sets BTRFS_ORDERED_IOERR on it.
Thus, the finish io work doesn't result in file extents, csums, etc.
In the aftermath, such a file behaves as though it has a hole in it,
instead of the purportedly written data.

We considered a few options for fixing the bug:

  1. treat the partial bio as if we had truncated the file, which would
     result in properly finishing it.
  2. split the ordered extent when submitting a partial bio.
  3. cache the ordered extent across calls to __iomap_dio_rw in
     iter->private, so that we could reuse it and correctly apply
     several bios to it.

I had trouble with 1, and it felt the most like a hack, so I tried 2
and 3. Since 3 has the benefit of also not creating an extra file
extent, and avoids an ordered extent lookup during bio submission, it
felt like the best option. However, that turned out to re-introduce a
deadlock which this code discarding the ordered_extent between faults
was meant to fix in the first place. (Link to an explanation of the
deadlock below.)

Therefore, go with fix 2, which requires a bit more setup work but fixes
the corruption without introducing the deadlock, which is fundamentally
caused by the ordered extent existing when we attempt to fault in a
range that overlaps with it.

Put succinctly, what this patch does is: when we submit a dio bio, check
if it is partial against the ordered extent stored in dio_data, and if it
is, extract the ordered_extent that matches the bio exactly out of the
larger ordered_extent. Keep the remaining ordered_extent around in dio_data
for cancellation in iomap_end.

Thanks to Josef, Christoph, and Filipe with their help figuring out the
bug and the fix.

Fixes: 51bd9563b6 ("btrfs: fix deadlock due to page faults during direct IO reads and writes")
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2169947
Link: https://lore.kernel.org/linux-btrfs/aa1fb69e-b613-47aa-a99e-a0a2c9ed273f@app.fastmail.com/
Link: https://pastebin.com/3SDaH8C6
Link: https://lore.kernel.org/linux-btrfs/20230315195231.GW10580@twin.jikos.cz/T/#t
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Boris Burkov <boris@bur.io>
[ hch: refactored the ordered_extent extraction ]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:21 +02:00
Boris Burkov
f0f5329a00 btrfs: don't split NOCOW extent_maps in btrfs_extract_ordered_extent
NOCOW writes just overwrite an existing extent map, which thus should
not be split in btrfs_extract_ordered_extent.  The NOCOW case can't
currently happen as btrfs_extract_ordered_extent is only used on zoned
devices that do not support NOCOW writes, but this will change soon.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Boris Burkov <boris@bur.io>
[ hch: split from a larger patch, wrote a commit log ]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:21 +02:00
Christoph Hellwig
7edd339c8a btrfs: pass an ordered_extent to btrfs_extract_ordered_extent
To prepare for a new caller that already has the ordered_extent
available, change btrfs_extract_ordered_extent to take an argument
for it.  Add a wrapper for the bio case that still has to do the
lookup (for now).

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:21 +02:00
Christoph Hellwig
2e38a84bc6 btrfs: simplify extent map splitting and rename split_zoned_em
split_zoned_em is only ever asked to split out the beginning of an extent
map.  Change it to only take a len to split out instead of a pre and post
region.

Also rename the function to split_extent_map as there is nothing zoned
device specific about it.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:21 +02:00
Christoph Hellwig
f0792b792d btrfs: fold btrfs_clone_ordered_extent into btrfs_split_ordered_extent
The function btrfs_clone_ordered_extent is very specific to the usage in
btrfs_split_ordered_extent.  Now that only a single call to
btrfs_clone_ordered_extent is left, just fold it into
btrfs_split_ordered_extent to make the operation more clear.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:21 +02:00
Christoph Hellwig
8f4af4b8e1 btrfs: sink parameter len to btrfs_split_ordered_extent
btrfs_split_ordered_extent is only ever asked to split out the beginning
of an ordered_extent (i.e. post == 0).  Change it to only take a len to
split out, and switch it to allocate the new extent for the beginning,
as that helps with callers that want to keep a pointer to the
ordered_extent that it is stealing from.

Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:21 +02:00
Christoph Hellwig
11d33ab6c1 btrfs: simplify splitting logic in btrfs_extract_ordered_extent
btrfs_extract_ordered_extent is always used to split an ordered_extent
and extent_map into two parts, so it doesn't need to deal with a three
way split.

Simplify it by only allowing for a single split point, and always split
out the beginning of the extent, as that is what we'll later need to
be able to hold on to a reference to the original ordered_extent that
the first part is split off for submission.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:21 +02:00
Christoph Hellwig
e44ca71cfe btrfs: move ordered_extent internal sanity checks into btrfs_split_ordered_extent
Move the three checks that are about ordered extent internal sanity
checking into btrfs_split_ordered_extent instead of doing them in the
higher level btrfs_extract_ordered_extent routine.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:21 +02:00
Boris Burkov
53f2c20687 btrfs: stash ordered extent in dio_data during iomap dio
While it is not feasible for an ordered extent to survive across the
calls btrfs_direct_write makes into __iomap_dio_rw, it is still helpful
to stash it on the dio_data in between creating it in iomap_begin and
finishing it in either end_io or iomap_end.

The specific use I have in mind is that we can check if a particular bio
is partial in submit_io without unconditionally looking up the ordered
extent. This is a preparatory patch for a later patch which does just
that.

Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:21 +02:00
Boris Burkov
8725bddf30 btrfs: pass flags as unsigned long to btrfs_add_ordered_extent
The ordered_extent flags are declared as unsigned long, so pass them as
such to btrfs_add_ordered_extent.

Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Boris Burkov <boris@bur.io>
[ hch: split from a larger patch ]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:21 +02:00
Boris Burkov
cf6d1aa482 btrfs: add function to create and return an ordered extent
Currently, btrfs_add_ordered_extent allocates a new ordered extent, adds
it to the rb_tree, but doesn't return a referenced pointer to the
caller. There are cases where it is useful for the creator of a new
ordered_extent to hang on to such a pointer, so add a new function
btrfs_alloc_ordered_extent which is the same as
btrfs_add_ordered_extent, except it takes an additional reference count
and returns a pointer to the ordered_extent. Implement
btrfs_add_ordered_extent as btrfs_alloc_ordered_extent followed by
dropping the new reference and handling the IS_ERR case.

The type of flags in btrfs_alloc_ordered_extent and
btrfs_add_ordered_extent is changed from unsigned int to unsigned long
so it's unified with the other ordered extent functions.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:21 +02:00
Johannes Thumshirn
cf32e41fa5 btrfs: use __bio_add_page to add single a page in rbio_add_io_sector
The btrfs raid56 sector submission code uses bio_add_page() to add a
page to a newly created bio. bio_add_page() can fail, but the return
value is never checked.

Use __bio_add_page() as adding a single page to a newly created bio is
guaranteed to succeed.

This brings us a step closer to marking bio_add_page() as __must_check.

Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:20 +02:00
Johannes Thumshirn
078e4cf5db btrfs: use __bio_add_page for adding a single page in repair_one_sector
The btrfs repair bio submission code uses bio_add_page() to add a page
to a newly created bio. bio_add_page() can fail, but the return value is
never checked.

Use __bio_add_page() as adding a single page to a newly created bio is
guaranteed to succeed.

This brings us a step closer to marking bio_add_page() as __must_check.

Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:20 +02:00
Anand Jain
7e812f2054 btrfs: use test_and_clear_bit() in wait_dev_flush()
The function wait_dev_flush() tests for the BTRFS_DEV_STATE_FLUSH_SENT
bit and then clears it separately. Instead, use test_and_clear_bit().
Though we don't need to do the atomic test and clear, it's following a
common pattern.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:20 +02:00
Anand Jain
1b465784dc btrfs: change wait_dev_flush() return type to bool
The flush error code is maintained in btrfs_device::last_flush_error, so
there is no point in returning it in wait_dev_flush() when it is not being
used. Instead, we can return a boolean value.

Note that even though btrfs_device::last_flush_error may not be used, we
will keep it for now.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:20 +02:00