Commit graph

1132 commits

Author SHA1 Message Date
Qu Wenruo
5467abba1c btrfs: move end_io_func argument to btrfs_bio_ctrl structure
For function submit_extent_page() and alloc_new_bio(), we have an
argument @end_io_func to indicate the end io function.

But that function never change inside any call site of them, thus no
need to pass the pointer around everywhere.

There is a better match for the lifespan of all the call sites, as we
have btrfs_bio_ctrl structure, thus we can put the endio function
pointer there, and grab the pointer every time we allocate a new bio.

Also add extra ASSERT()s to make sure every call site of
submit_extent_page() and alloc_new_bio() has properly set the pointer
inside btrfs_bio_ctrl.

This removes one argument from the already long argument list of
submit_extent_page().

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26 12:28:07 +02:00
Qu Wenruo
209ecde55c btrfs: switch page and disk_bytenr argument position for submit_extent_page()
Normally we put (page, pg_len, pg_offset) arguments together, just like
what __bio_add_page() does.

But in submit_extent_page(), what we got is, (page, disk_bytenr, pg_len,
pg_offset), which sometimes can be confusing.

Change the order to (disk_bytenr, page, pg_len, pg_offset) to make it
to follow the common schema.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26 12:28:07 +02:00
Qu Wenruo
814b6f9158 btrfs: update the comment for submit_extent_page()
Since commit 390ed29b81 ("btrfs: refactor submit_extent_page() to make
bio and its flag tracing easier"), we are using bio_ctrl structure to
replace some of arguments of submit_extent_page().

But unfortunately that commit didn't update the comment for
submit_extent_page(), thus some arguments are stale like:

- bio_ret
- mirror_num
  Those are all contained in bio_ctrl now.

- prev_bio_flags
  We no longer use this flag to determine if we can merge bios.

Update the comment for submit_extent_page() to keep it up-to-date.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26 12:28:07 +02:00
Josef Bacik
ee8ba05cbb btrfs: open code and remove btrfs_inode_sectorsize helper
This is defined in btrfs_inode.h, and dereferences btrfs_root and
btrfs_fs_info, both of which aren't defined in btrfs_inode.h.
Additionally, in many places we already have root or fs_info, so this
helper often makes the code harder to read.  So delete the helper and
simply open code it in the few places that we use it.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26 12:28:06 +02:00
Christoph Hellwig
bd86a532b2 btrfs: stop tracking failed reads in the I/O tree
There is a separate I/O failure tree to track the fail reads, so remove
the extra EXTENT_DAMAGED bit in the I/O tree as it's set but never used.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26 12:28:05 +02:00
Josef Bacik
bd015294af btrfs: replace delete argument with EXTENT_CLEAR_ALL_BITS
Instead of taking up a whole argument to indicate we're clearing
everything in a range, simply add another EXTENT bit to control this,
and then update all the callers to drop this argument from the
clear_extent_bit variants.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26 12:28:05 +02:00
Josef Bacik
b71fb16b2f btrfs: don't clear CTL bits when trying to release extent state
When trying to release the extent states due to memory pressure we'll
set all the bits except LOCKED, NODATASUM, and DELALLOC_NEW.  This
includes some of the CTL bits, which isn't really a problem but isn't
correct either.  Exclude the CTL bits from this clearing.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26 12:28:05 +02:00
Josef Bacik
570eb97bac btrfs: unify the lock/unlock extent variants
We have two variants of lock/unlock extent, one set that takes a cached
state, another that does not.  This is slightly annoying, and generally
speaking there are only a few places where we don't have a cached state.
Simplify this by making lock_extent/unlock_extent the only variant and
make it take a cached state, then convert all the callers appropriately.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26 12:28:05 +02:00
Josef Bacik
dbbf49928f btrfs: remove the wake argument from clear_extent_bits
This is only used in the case that we are clearing EXTENT_LOCKED, so
infer this value from the bits passed in instead of taking it as an
argument.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26 12:28:04 +02:00
Josef Bacik
e3974c6694 btrfs: move core extent_io_tree functions to extent-io-tree.c
This is still huge, but unfortunately I cannot make it smaller without
renaming tree_search() and changing all the callers to use the new name,
then moving those chunks and then changing the name back.  This feels
like too much churn for code movement, so I've limited this to only
things that called tree_search().  With this patch all of the
extent_io_tree code is now in extent-io-tree.c.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26 12:28:03 +02:00
Josef Bacik
3883001838 btrfs: move a few exported extent_io_tree helpers to extent-io-tree.c
These are the last few helpers that do not rely on tree_search() and
who's other helpers are exported and in extent-io-tree.c already.  Move
these across now in order to make the core move smaller.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26 12:28:03 +02:00
Josef Bacik
04eba89323 btrfs: temporarily export and then move extent state helpers
In order to avoid moving all of the related code at once temporarily
export all of the extent state related helpers.  Then move these helpers
into extent-io-tree.c.  We will clean up the exports and make them
static in followup patches.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26 12:28:03 +02:00
Josef Bacik
91af24e484 btrfs: temporarily export and move core extent_io_tree tree functions
A lot of the various internals of extent_io_tree call these two
functions for insert or searching the rb tree for entries, so
temporarily export them and then move them to extent-io-tree.c.  We
can't move tree_search() without renaming it, and I don't want to
introduce a bunch of churn just to do that, so move these functions
first and then we can move a few big functions and then the remaining
users of tree_search().

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26 12:28:03 +02:00
Josef Bacik
6962541e96 btrfs: move btrfs_debug_check_extent_io_range into extent-io-tree.c
This helper is used by a lot of the core extent_io_tree helpers, so
temporarily export it and move it into extent-io-tree.c in order to make
it straightforward to migrate the helpers in batches.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26 12:28:03 +02:00
Josef Bacik
ec39e39bbf btrfs: export wait_extent_bit
This is used by the subpage code in addition to lock_extent_bits, so
export it so we can move it out of extent_io.c

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26 12:28:03 +02:00
Josef Bacik
a66318872c btrfs: move simple extent bit helpers out of extent_io.c
These are just variants and wrappers around the actual work horses of
the extent state.  Extract these out of extent_io.c.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26 12:28:03 +02:00
Josef Bacik
ad79532957 btrfs: convert BUG_ON(EXTENT_BIT_LOCKED) checks to ASSERT's
We only call these functions from the qgroup code which doesn't call
with EXTENT_BIT_LOCKED.  These are BUG_ON()'s that exist to keep us
developers from using these functions with EXTENT_BIT_LOCKED, so convert
them to ASSERT()'s.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26 12:28:03 +02:00
Josef Bacik
83cf709a89 btrfs: move extent state init and alloc functions to their own file
Start cleaning up extent_io.c by moving the extent state code out of it.
This patch starts with the extent state allocation code and the
extent_io_tree init code.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26 12:28:03 +02:00
Josef Bacik
c45379a20f btrfs: temporarily export alloc_extent_state helpers
We're going to move this code in stages, but while we're doing that we
need to export these helpers so we can more easily move the code into
the new file.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26 12:28:03 +02:00
Josef Bacik
a40246e8af btrfs: separate out the eb and extent state leak helpers
Currently we have the add/del functions generic so that we can use them
for both extent buffers and extent states.  We want to separate this
code however, so separate these helpers into per-object helpers in
anticipation of the split.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26 12:28:02 +02:00
Josef Bacik
a62a3bd954 btrfs: separate out the extent state and extent buffer init code
In order to help separate the extent buffer from the extent io tree code
we need to break up the init functions.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26 12:28:02 +02:00
Josef Bacik
cdca85b092 btrfs: use find_first_extent_bit in btrfs_clean_io_failure
Currently we're using find_first_extent_bit_state to check if our state
contains the given failrec range, however this is more of an internal
extent_io_tree helper, and is technically unsafe to use because we're
accessing the state outside of the extent_io_tree lock.

Instead use the normal helper find_first_extent_bit which returns the
range of the extent state we find in find_first_extent_bit_state and use
that to do our sanity checking.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26 12:28:02 +02:00
Josef Bacik
87c11705cc btrfs: convert the io_failure_tree to a plain rb_tree
We still have this oddity of stashing the io_failure_record in the
extent state for the io_failure_tree, which is leftover from when we
used to stuff private pointers in extent_io_trees.

However this doesn't make a lot of sense for the io failure records, we
can simply use a normal rb_tree for this.  This will allow us to further
simplify the extent_io_tree code by removing the io_failure_rec pointer
from the extent state.

Convert the io_failure_tree to an rb tree + spinlock in the inode, and
then use our rb tree simple helpers to insert and find failed records.
This greatly cleans up this code and makes it easier to separate out the
extent_io_tree code.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26 12:28:02 +02:00
Josef Bacik
a206174805 btrfs: unexport internal failrec functions
These are internally used functions and are not used outside of
extent_io.c.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26 12:28:02 +02:00
Josef Bacik
0d0a762c41 btrfs: rename clean_io_failure and remove extraneous args
This is exported, so rename it to btrfs_clean_io_failure.  Additionally
we are passing in the io tree's and such from the inode, so instead of
doing all that simply pass in the inode itself and get all the
components we need directly inside of btrfs_clean_io_failure.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26 12:28:02 +02:00
Filipe Manana
ac3c0d36a2 btrfs: make fiemap more efficient and accurate reporting extent sharedness
The current fiemap implementation does not scale very well with the number
of extents a file has. This is both because the main algorithm to find out
the extents has a high algorithmic complexity and because for each extent
we have to check if it's shared. This second part, checking if an extent
is shared, is significantly improved by the two previous patches in this
patchset, while the first part is improved by this specific patch. Every
now and then we get reports from users mentioning fiemap is too slow or
even unusable for files with a very large number of extents, such as the
two recent reports referred to by the Link tags at the bottom of this
change log.

To understand why the part of finding which extents a file has is very
inefficient, consider the example of doing a full ranged fiemap against
a file that has over 100K extents (normal for example for a file with
more than 10G of data and using compression, which limits the extent size
to 128K). When we enter fiemap at extent_fiemap(), the following happens:

1) Before entering the main loop, we call get_extent_skip_holes() to get
   the first extent map. This leads us to btrfs_get_extent_fiemap(), which
   in turn calls btrfs_get_extent(), to find the first extent map that
   covers the file range [0, LLONG_MAX).

   btrfs_get_extent() will first search the inode's extent map tree, to
   see if we have an extent map there that covers the range. If it does
   not find one, then it will search the inode's subvolume b+tree for a
   fitting file extent item. After finding the file extent item, it will
   allocate an extent map, fill it in with information extracted from the
   file extent item, and add it to the inode's extent map tree (which
   requires a search for insertion in the tree).

2) Then we enter the main loop at extent_fiemap(), emit the details of
   the extent, and call again get_extent_skip_holes(), with a start
   offset matching the end of the extent map we previously processed.

   We end up at btrfs_get_extent() again, will search the extent map tree
   and then search the subvolume b+tree for a file extent item if we could
   not find an extent map in the extent tree. We allocate an extent map,
   fill it in with the details in the file extent item, and then insert
   it into the extent map tree (yet another search in this tree).

3) The second step is repeated over and over, until we have processed the
   whole file range. Each iteration ends at btrfs_get_extent(), which
   does a red black tree search on the extent map tree, then searches the
   subvolume b+tree, allocates an extent map and then does another search
   in the extent map tree in order to insert the extent map.

   In the best scenario we have all the extent maps already in the extent
   tree, and so for each extent we do a single search on a red black tree,
   so we have a complexity of O(n log n).

   In the worst scenario we don't have any extent map already loaded in
   the extent map tree, or have very few already there. In this case the
   complexity is much higher since we do:

   - A red black tree search on the extent map tree, which has O(log n)
     complexity, initially very fast since the tree is empty or very
     small, but as we end up allocating extent maps and adding them to
     the tree when we don't find them there, each subsequent search on
     the tree gets slower, since it's getting bigger and bigger after
     each iteration.

   - A search on the subvolume b+tree, also O(log n) complexity, but it
     has items for all inodes in the subvolume, not just items for our
     inode. Plus on a filesystem with concurrent operations on other
     inodes, we can block doing the search due to lock contention on
     b+tree nodes/leaves.

   - Allocate an extent map - this can block, and can also fail if we
     are under serious memory pressure.

   - Do another search on the extent maps red black tree, with the goal
     of inserting the extent map we just allocated. Again, after every
     iteration this tree is getting bigger by 1 element, so after many
     iterations the searches are slower and slower.

   - We will not need the allocated extent map anymore, so it's pointless
     to add it to the extent map tree. It's just wasting time and memory.

   In short we end up searching the extent map tree multiple times, on a
   tree that is growing bigger and bigger after each iteration. And
   besides that we visit the same leaf of the subvolume b+tree many times,
   since a leaf with the default size of 16K can easily have more than 200
   file extent items.

This is very inefficient overall. This patch changes the algorithm to
instead iterate over the subvolume b+tree, visiting each leaf only once,
and only searching in the extent map tree for file ranges that have holes
or prealloc extents, in order to figure out if we have delalloc there.
It will never allocate an extent map and add it to the extent map tree.
This is very similar to what was previously done for the lseek's hole and
data seeking features.

Also, the current implementation relying on extent maps for figuring out
which extents we have is not correct. This is because extent maps can be
merged even if they represent different extents - we do this to minimize
memory utilization and keep extent map trees smaller. For example if we
have two extents that are contiguous on disk, once we load the two extent
maps, they get merged into a single one - however if only one of the
extents is shared, we end up reporting both as shared or both as not
shared, which is incorrect.

This reproducer triggers that bug:

    $ cat fiemap-bug.sh
    #!/bin/bash

    DEV=/dev/sdj
    MNT=/mnt/sdj

    mkfs.btrfs -f $DEV
    mount $DEV $MNT

    # Create a file with two 256K extents.
    # Since there is no other write activity, they will be contiguous,
    # and their extent maps merged, despite having two distinct extents.
    xfs_io -f -c "pwrite -S 0xab 0 256K" \
              -c "fsync" \
              -c "pwrite -S 0xcd 256K 256K" \
              -c "fsync" \
              $MNT/foo

    # Now clone only the second extent into another file.
    xfs_io -f -c "reflink $MNT/foo 256K 0 256K" $MNT/bar

    # Filefrag will report a single 512K extent, and say it's not shared.
    echo
    filefrag -v $MNT/foo

    umount $MNT

Running the reproducer:

    $ ./fiemap-bug.sh
    wrote 262144/262144 bytes at offset 0
    256 KiB, 64 ops; 0.0038 sec (65.479 MiB/sec and 16762.7030 ops/sec)
    wrote 262144/262144 bytes at offset 262144
    256 KiB, 64 ops; 0.0040 sec (61.125 MiB/sec and 15647.9218 ops/sec)
    linked 262144/262144 bytes at offset 0
    256 KiB, 1 ops; 0.0002 sec (1.034 GiB/sec and 4237.2881 ops/sec)

    Filesystem type is: 9123683e
    File size of /mnt/sdj/foo is 524288 (128 blocks of 4096 bytes)
     ext:     logical_offset:        physical_offset: length:   expected: flags:
       0:        0..     127:       3328..      3455:    128:             last,eof
    /mnt/sdj/foo: 1 extent found

We end up reporting that we have a single 512K that is not shared, however
we have two 256K extents, and the second one is shared. Changing the
reproducer to clone instead the first extent into file 'bar', makes us
report a single 512K extent that is shared, which is algo incorrect since
we have two 256K extents and only the first one is shared.

This patch is part of a larger patchset that is comprised of the following
patches:

    btrfs: allow hole and data seeking to be interruptible
    btrfs: make hole and data seeking a lot more efficient
    btrfs: remove check for impossible block start for an extent map at fiemap
    btrfs: remove zero length check when entering fiemap
    btrfs: properly flush delalloc when entering fiemap
    btrfs: allow fiemap to be interruptible
    btrfs: rename btrfs_check_shared() to a more descriptive name
    btrfs: speedup checking for extent sharedness during fiemap
    btrfs: skip unnecessary extent buffer sharedness checks during fiemap
    btrfs: make fiemap more efficient and accurate reporting extent sharedness

The patchset was tested on a machine running a non-debug kernel (Debian's
default config) and compared the tests below on a branch without the
patchset versus the same branch with the whole patchset applied.

The following test for a large compressed file without holes:

    $ cat fiemap-perf-test.sh
    #!/bin/bash

    DEV=/dev/sdi
    MNT=/mnt/sdi

    mkfs.btrfs -f $DEV
    mount -o compress=lzo $DEV $MNT

    # 40G gives 327680 128K file extents (due to compression).
    xfs_io -f -c "pwrite -S 0xab -b 1M 0 20G" $MNT/foobar

    umount $MNT
    mount -o compress=lzo $DEV $MNT

    start=$(date +%s%N)
    filefrag $MNT/foobar
    end=$(date +%s%N)
    dur=$(( (end - start) / 1000000 ))
    echo "fiemap took $dur milliseconds (metadata not cached)"

    start=$(date +%s%N)
    filefrag $MNT/foobar
    end=$(date +%s%N)
    dur=$(( (end - start) / 1000000 ))
    echo "fiemap took $dur milliseconds (metadata cached)"

    umount $MNT

Before patchset:

    $ ./fiemap-perf-test.sh
    (...)
    /mnt/sdi/foobar: 327680 extents found
    fiemap took 3597 milliseconds (metadata not cached)
    /mnt/sdi/foobar: 327680 extents found
    fiemap took 2107 milliseconds (metadata cached)

After patchset:

    $ ./fiemap-perf-test.sh
    (...)
    /mnt/sdi/foobar: 327680 extents found
    fiemap took 1214 milliseconds (metadata not cached)
    /mnt/sdi/foobar: 327680 extents found
    fiemap took 684 milliseconds (metadata cached)

That's a speedup of about 3x for both cases (no metadata cached and all
metadata cached).

The test provided by Pavel (first Link tag at the bottom), which uses
files with a large number of holes, was also used to measure the gains,
and it consists on a small C program and a shell script to invoke it.
The C program is the following:

    $ cat pavels-test.c
    #include <stdio.h>
    #include <unistd.h>
    #include <stdlib.h>
    #include <fcntl.h>

    #include <sys/stat.h>
    #include <sys/time.h>
    #include <sys/ioctl.h>

    #include <linux/fs.h>
    #include <linux/fiemap.h>

    #define FILE_INTERVAL (1<<13) /* 8Kb */

    long long interval(struct timeval t1, struct timeval t2)
    {
        long long val = 0;
        val += (t2.tv_usec - t1.tv_usec);
        val += (t2.tv_sec - t1.tv_sec) * 1000 * 1000;
        return val;
    }

    int main(int argc, char **argv)
    {
        struct fiemap fiemap = {};
        struct timeval t1, t2;
        char data = 'a';
        struct stat st;
        int fd, off, file_size = FILE_INTERVAL;

        if (argc != 3 && argc != 2) {
                printf("usage: %s <path> [size]\n", argv[0]);
                return 1;
        }

        if (argc == 3)
                file_size = atoi(argv[2]);
        if (file_size < FILE_INTERVAL)
                file_size = FILE_INTERVAL;
        file_size -= file_size % FILE_INTERVAL;

        fd = open(argv[1], O_RDWR | O_CREAT | O_TRUNC, 0644);
        if (fd < 0) {
            perror("open");
            return 1;
        }

        for (off = 0; off < file_size; off += FILE_INTERVAL) {
            if (pwrite(fd, &data, 1, off) != 1) {
                perror("pwrite");
                close(fd);
                return 1;
            }
        }

        if (ftruncate(fd, file_size)) {
            perror("ftruncate");
            close(fd);
            return 1;
        }

        if (fstat(fd, &st) < 0) {
            perror("fstat");
            close(fd);
            return 1;
        }

        printf("size: %ld\n", st.st_size);
        printf("actual size: %ld\n", st.st_blocks * 512);

        fiemap.fm_length = FIEMAP_MAX_OFFSET;
        gettimeofday(&t1, NULL);
        if (ioctl(fd, FS_IOC_FIEMAP, &fiemap) < 0) {
            perror("fiemap");
            close(fd);
            return 1;
        }
        gettimeofday(&t2, NULL);

        printf("fiemap: fm_mapped_extents = %d\n",
               fiemap.fm_mapped_extents);
        printf("time = %lld us\n", interval(t1, t2));

        close(fd);
        return 0;
    }

    $ gcc -o pavels_test pavels_test.c

And the wrapper shell script:

    $ cat fiemap-pavels-test.sh

    #!/bin/bash

    DEV=/dev/sdi
    MNT=/mnt/sdi

    mkfs.btrfs -f -O no-holes $DEV
    mount $DEV $MNT

    echo
    echo "*********** 256M ***********"
    echo

    ./pavels-test $MNT/testfile $((1 << 28))
    echo
    ./pavels-test $MNT/testfile $((1 << 28))

    echo
    echo "*********** 512M ***********"
    echo

    ./pavels-test $MNT/testfile $((1 << 29))
    echo
    ./pavels-test $MNT/testfile $((1 << 29))

    echo
    echo "*********** 1G ***********"
    echo

    ./pavels-test $MNT/testfile $((1 << 30))
    echo
    ./pavels-test $MNT/testfile $((1 << 30))

    umount $MNT

Running his reproducer before applying the patchset:

    *********** 256M ***********

    size: 268435456
    actual size: 134217728
    fiemap: fm_mapped_extents = 32768
    time = 4003133 us

    size: 268435456
    actual size: 134217728
    fiemap: fm_mapped_extents = 32768
    time = 4895330 us

    *********** 512M ***********

    size: 536870912
    actual size: 268435456
    fiemap: fm_mapped_extents = 65536
    time = 30123675 us

    size: 536870912
    actual size: 268435456
    fiemap: fm_mapped_extents = 65536
    time = 33450934 us

    *********** 1G ***********

    size: 1073741824
    actual size: 536870912
    fiemap: fm_mapped_extents = 131072
    time = 224924074 us

    size: 1073741824
    actual size: 536870912
    fiemap: fm_mapped_extents = 131072
    time = 217239242 us

Running it after applying the patchset:

    *********** 256M ***********

    size: 268435456
    actual size: 134217728
    fiemap: fm_mapped_extents = 32768
    time = 29475 us

    size: 268435456
    actual size: 134217728
    fiemap: fm_mapped_extents = 32768
    time = 29307 us

    *********** 512M ***********

    size: 536870912
    actual size: 268435456
    fiemap: fm_mapped_extents = 65536
    time = 58996 us

    size: 536870912
    actual size: 268435456
    fiemap: fm_mapped_extents = 65536
    time = 59115 us

    *********** 1G ***********

    size: 1073741824
    actual size: 536870912
    fiemap: fm_mapped_extents = 116251
    time = 124141 us

    size: 1073741824
    actual size: 536870912
    fiemap: fm_mapped_extents = 131072
    time = 119387 us

The speedup is massive, both on the first fiemap call and on the second
one as well, as his test creates files with many holes and small extents
(every extent follows a hole and precedes another hole).

For the 256M file we go from 4 seconds down to 29 milliseconds in the
first run, and then from 4.9 seconds down to 29 milliseconds again in the
second run, a speedup of 138x and 169x, respectively.

For the 512M file we go from 30.1 seconds down to 59 milliseconds in the
first run, and then from 33.5 seconds down to 59 milliseconds again in the
second run, a speedup of 510x and 568x, respectively.

For the 1G file, we go from 225 seconds down to 124 milliseconds in the
first run, and then from 217 seconds down to 119 milliseconds in the
second run, a speedup of 1815x and 1824x, respectively.

Reported-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Link: https://lore.kernel.org/linux-btrfs/21dd32c6-f1f9-f44a-466a-e18fdc6788a7@virtuozzo.com/
Reported-by: Dominique MARTINET <dominique.martinet@atmark-techno.com>
Link: https://lore.kernel.org/linux-btrfs/Ysace25wh5BbLd5f@atmark-techno.com/
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26 12:28:01 +02:00
Filipe Manana
b8f164e3e6 btrfs: skip unnecessary extent buffer sharedness checks during fiemap
During fiemap, for each file extent we find, we must check if it's shared
or not. The sharedness check starts by verifying if the extent is directly
shared (its refcount in the extent tree is > 1), and if it is not directly
shared, then we will check if every node in the subvolume b+tree leading
from the root to the leaf that has the file extent item (in reverse order),
is shared (through snapshots).

However this second step is not needed if our extent was created in a
transaction more recent than the last transaction where a snapshot of the
inode's root happened, because it can't be shared indirectly (through
shared subtrees) without a snapshot created in a more recent transaction.

So grab the generation of the extent from the extent map and pass it to
btrfs_is_data_extent_shared(), which will skip this second phase when the
generation is more recent than the root's last snapshot value. Note that
we skip this optimization if the extent map is the result of merging 2
or more extent maps, because in this case its generation is the maximum
of the generations of all merged extent maps.

The fact the we use extent maps and they can be merged despite the
underlying extents being distinct (different file extent items in the
subvolume b+tree and different extent items in the extent b+tree), can
result in some bugs when reporting shared extents. But this is a problem
of the current implementation of fiemap relying on extent maps.
One example where we get incorrect results is:

    $ cat fiemap-bug.sh
    #!/bin/bash

    DEV=/dev/sdj
    MNT=/mnt/sdj

    mkfs.btrfs -f $DEV
    mount $DEV $MNT

    # Create a file with two 256K extents.
    # Since there is no other write activity, they will be contiguous,
    # and their extent maps merged, despite having two distinct extents.
    xfs_io -f -c "pwrite -S 0xab 0 256K" \
              -c "fsync" \
              -c "pwrite -S 0xcd 256K 256K" \
              -c "fsync" \
              $MNT/foo

    # Now clone only the second extent into another file.
    xfs_io -f -c "reflink $MNT/foo 256K 0 256K" $MNT/bar

    # Filefrag will report a single 512K extent, and say it's not shared.
    echo
    filefrag -v $MNT/foo

    umount $MNT

Running the reproducer:

    $ ./fiemap-bug.sh
    wrote 262144/262144 bytes at offset 0
    256 KiB, 64 ops; 0.0038 sec (65.479 MiB/sec and 16762.7030 ops/sec)
    wrote 262144/262144 bytes at offset 262144
    256 KiB, 64 ops; 0.0040 sec (61.125 MiB/sec and 15647.9218 ops/sec)
    linked 262144/262144 bytes at offset 0
    256 KiB, 1 ops; 0.0002 sec (1.034 GiB/sec and 4237.2881 ops/sec)

    Filesystem type is: 9123683e
    File size of /mnt/sdj/foo is 524288 (128 blocks of 4096 bytes)
     ext:     logical_offset:        physical_offset: length:   expected: flags:
       0:        0..     127:       3328..      3455:    128:             last,eof
    /mnt/sdj/foo: 1 extent found

We end up reporting that we have a single 512K that is not shared, however
we have two 256K extents, and the second one is shared. Changing the
reproducer to clone instead the first extent into file 'bar', makes us
report a single 512K extent that is shared, which is algo incorrect since
we have two 256K extents and only the first one is shared.

This is z problem that existed before this change, and remains after this
change, as it can't be easily fixed. The next patch in the series reworks
fiemap to primarily use file extent items instead of extent maps (except
for checking for delalloc ranges), with the goal of improving its
scalability and performance, but it also ends up fixing this particular
bug caused by extent map merging.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26 12:28:01 +02:00
Filipe Manana
12a824dc67 btrfs: speedup checking for extent sharedness during fiemap
One of the most expensive tasks performed during fiemap is to check if
an extent is shared. This task has two major steps:

1) Check if the data extent is shared. This implies checking the extent
   item in the extent tree, checking delayed references, etc. If we
   find the data extent is directly shared, we terminate immediately;

2) If the data extent is not directly shared (its extent item has a
   refcount of 1), then it may be shared if we have snapshots that share
   subtrees of the inode's subvolume b+tree. So we check if the leaf
   containing the file extent item is shared, then its parent node, then
   the parent node of the parent node, etc, until we reach the root node
   or we find one of them is shared - in which case we stop immediately.

During fiemap we process the extents of a file from left to right, from
file offset 0 to EOF. This means that we iterate b+tree leaves from left
to right, and has the implication that we keep repeating that second step
above several times for the same b+tree path of the inode's subvolume
b+tree.

For example, if we have two file extent items in leaf X, and the path to
leaf X is A -> B -> C -> X, then when we try to determine if the data
extent referenced by the first extent item is shared, we check if the data
extent is shared - if it's not, then we check if leaf X is shared, if not,
then we check if node C is shared, if not, then check if node B is shared,
if not than check if node A is shared. When we move to the next file
extent item, after determining the data extent is not shared, we repeat
the checks for X, C, B and A - doing all the expensive searches in the
extent tree, delayed refs, etc. If we have thousands of tile extents, then
we keep repeating the sharedness checks for the same paths over and over.

On a file that has no shared extents or only a small portion, it's easy
to see that this scales terribly with the number of extents in the file
and the sizes of the extent and subvolume b+trees.

This change eliminates the repeated sharedness check on extent buffers
by caching the results of the last path used. The results can be used as
long as no snapshots were created since they were cached (for not shared
extent buffers) or no roots were dropped since they were cached (for
shared extent buffers). This greatly reduces the time spent by fiemap for
files with thousands of extents and/or large extent and subvolume b+trees.

Example performance test:

    $ cat fiemap-perf-test.sh
    #!/bin/bash

    DEV=/dev/sdi
    MNT=/mnt/sdi

    mkfs.btrfs -f $DEV
    mount -o compress=lzo $DEV $MNT

    # 40G gives 327680 128K file extents (due to compression).
    xfs_io -f -c "pwrite -S 0xab -b 1M 0 40G" $MNT/foobar

    umount $MNT
    mount -o compress=lzo $DEV $MNT

    start=$(date +%s%N)
    filefrag $MNT/foobar
    end=$(date +%s%N)
    dur=$(( (end - start) / 1000000 ))
    echo "fiemap took $dur milliseconds (metadata not cached)"

    start=$(date +%s%N)
    filefrag $MNT/foobar
    end=$(date +%s%N)
    dur=$(( (end - start) / 1000000 ))
    echo "fiemap took $dur milliseconds (metadata cached)"

    umount $MNT

Before this patch:

    $ ./fiemap-perf-test.sh
    (...)
    /mnt/sdi/foobar: 327680 extents found
    fiemap took 3597 milliseconds (metadata not cached)
    /mnt/sdi/foobar: 327680 extents found
    fiemap took 2107 milliseconds (metadata cached)

After this patch:

    $ ./fiemap-perf-test.sh
    (...)
    /mnt/sdi/foobar: 327680 extents found
    fiemap took 1646 milliseconds (metadata not cached)
    /mnt/sdi/foobar: 327680 extents found
    fiemap took 698 milliseconds (metadata cached)

That's about 2.2x faster when no metadata is cached, and about 3x faster
when all metadata is cached. On a real filesystem with many other files,
data, directories, etc, the b+trees will be 2 or 3 levels higher,
therefore this optimization will have a higher impact.

Several reports of a slow fiemap show up often, the two Link tags below
refer to two recent reports of such slowness. This patch, together with
the next ones in the series, is meant to address that.

Link: https://lore.kernel.org/linux-btrfs/21dd32c6-f1f9-f44a-466a-e18fdc6788a7@virtuozzo.com/
Link: https://lore.kernel.org/linux-btrfs/Ysace25wh5BbLd5f@atmark-techno.com/
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26 12:28:01 +02:00
Filipe Manana
8eedaddaab btrfs: rename btrfs_check_shared() to a more descriptive name
The function btrfs_check_shared() is supposed to be used to check if a
data extent is shared, but its name is too generic, may easily cause
confusion in the sense that it may be used for metadata extents.

So rename it to btrfs_is_data_extent_shared(), which will also make it
less confusing after the next change that adds a backref lookup cache for
the b+tree nodes that lead to the leaf that contains the file extent item
that points to the target data extent.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26 12:28:01 +02:00
Filipe Manana
09fbc1c8e7 btrfs: allow fiemap to be interruptible
Doing fiemap on a file with a very large number of extents can take a very
long time, and we have reports of it being too slow (two recent examples
in the Link tags below), so make it interruptible.

Link: https://lore.kernel.org/linux-btrfs/21dd32c6-f1f9-f44a-466a-e18fdc6788a7@virtuozzo.com/
Link: https://lore.kernel.org/linux-btrfs/Ysace25wh5BbLd5f@atmark-techno.com/
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26 12:28:00 +02:00
Filipe Manana
9a42bbaeff btrfs: remove zero length check when entering fiemap
There's no point to check for a 0 length at extent_fiemap(), as before
calling it, we called fiemap_prep() at btrfs_fiemap(), which already
checks for a zero length and returns the same -EINVAL error. So remove
the pointless check.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26 12:28:00 +02:00
Filipe Manana
f12eec9a26 btrfs: remove check for impossible block start for an extent map at fiemap
During fiemap we are testing if an extent map has a block start with a
value of EXTENT_MAP_LAST_BYTE, but that is never set on an extent map,
and never was according to git history. So remove that useless check.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26 12:28:00 +02:00
Christoph Hellwig
917f32a235 btrfs: give struct btrfs_bio a real end_io handler
Currently btrfs_bio end I/O handling is a bit of a mess.  The bi_end_io
handler and bi_private pointer of the embedded struct bio are both used
to handle the completion of the high-level btrfs_bio and for the I/O
completion for the low-level device that the embedded bio ends up being
sent to.

To support this bi_end_io and bi_private are saved into the
btrfs_io_context structure and then restored after the bio sent to the
underlying device has completed the actual I/O.

Untangle this by adding an end I/O handler and private data to struct
btrfs_bio for the high-level btrfs_bio based completions, and leave the
actual bio bi_end_io handler and bi_private pointer entirely to the
low-level device I/O.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26 12:27:59 +02:00
Christoph Hellwig
6b42f5e343 btrfs: pass the operation to btrfs_bio_alloc
Pass the operation to btrfs_bio_alloc, matching what bio_alloc_bioset
set does.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26 12:27:58 +02:00
Christoph Hellwig
d45cfb883b btrfs: move btrfs_bio allocation to volumes.c
volumes.c is the place that implements the storage layer using the
btrfs_bio structure, so move the bio_set and allocation helpers there
as well.

To make up for the new initialization boilerplate, merge the two
init/exit helpers in extent_io.c into a single one.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26 12:27:58 +02:00
Christoph Hellwig
1e408af31b btrfs: don't create integrity bioset for btrfs_bioset
btrfs never uses bio integrity data itself, so don't allocate
the integrity pools for btrfs_bioset.

This patch is a revert of the commit b208c2f7ce ("btrfs: Fix crash due
to not allocating integrity data for a set").  The integrity data pool
is not needed, the bio-integrity code now handles allocating the
integrity payload without that.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26 12:27:58 +02:00
Ethan Lien
52b029f427 btrfs: remove unnecessary EXTENT_UPTODATE state in buffered I/O path
After we copied data to page cache in buffered I/O, we
1. Insert a EXTENT_UPTODATE state into inode's io_tree, by
   endio_readpage_release_extent(), set_extent_delalloc() or
   set_extent_defrag().
2. Set page uptodate before we unlock the page.

But the only place we check io_tree's EXTENT_UPTODATE state is in
btrfs_do_readpage(). We know we enter btrfs_do_readpage() only when we
have a non-uptodate page, so it is unnecessary to set EXTENT_UPTODATE.

For example, when performing a buffered random read:

	fio --rw=randread --ioengine=libaio --direct=0 --numjobs=4 \
		--filesize=32G --size=4G --bs=4k --name=job \
		--filename=/mnt/file --name=job

Then check how many extent_state in io_tree:

	cat /proc/slabinfo | grep btrfs_extent_state | awk '{print $2}'

w/o this patch, we got 640567 btrfs_extent_state.
w/  this patch, we got    204 btrfs_extent_state.

Maintaining such a big tree brings overhead since every I/O needs to insert
EXTENT_LOCKED, insert EXTENT_UPTODATE, then remove EXTENT_LOCKED. And in
every insert or remove, we need to lock io_tree, do tree search, alloc or
dealloc extent states. By removing unnecessary EXTENT_UPTODATE, we keep
io_tree in a minimal size and reduce overhead when performing buffered I/O.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: Ethan Lien <ethanlien@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26 12:27:57 +02:00
Uros Bizjak
e5677f0560 btrfs: use atomic_try_cmpxchg in free_extent_buffer
Use `atomic_try_cmpxchg(ptr, &old, new)` instead of
`atomic_cmpxchg(ptr, old, new) == old` in free_extent_buffer. This
has two benefits:

- The x86 cmpxchg instruction returns success in the ZF flag, so this
  change saves a compare after cmpxchg, as well as a related move
  instruction in the front of cmpxchg.

- atomic_try_cmpxchg implicitly assigns the *ptr value to &old when
  cmpxchg fails, enabling further code simplifications.

This patch has no functional change.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26 12:27:55 +02:00
Linus Torvalds
8379c0b31f for-6.0-rc3-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmMLY9oACgkQxWXV+ddt
 WDue/w/8C3ZF8nLAI/sMrUpef2vSD62bvkKRRS45wzR2uod6yc0Fle9upzBssJQZ
 qO3mQ53+QV+imCq7dY5mmtmwCUJNmbV5gbiMoF1OoV9TYtpZb/NIDklSX8se2eJX
 drdAWQr2pYwU2M4duA4IEW08TvQ2TFh0JiqMi0aYM5apyL80uv3WniOu+xpRipA3
 CMFAnDqayIgQ5OIsedqNy2MBLBopodUL5PZv/H7/g6KSKIuAZP9zgg1eKPfaz2t3
 HO183ubmMbVtxgxeu+EnvCkg/iQ5hQiuGmyi0FLYMs/A6/NglwBnIJU5jCMQhcp6
 HO5+FSUn6lHQetVzt2uHb9Lo+gX4FtCaHqVv1bXT62lnmDsZO1D7RVSg1Fra+CY+
 jJmi8vvIbfbYlSZPZlJANoWe8ODOMVPk+pM4SFHlxOWGAY6HViX2RfHnIjNj5x9O
 iDSTGvH6++nBF1Wu2/Xja/VKZ1avxRyTu2srW8JOF62j/tTU/EoPJcO9rxXOBBmC
 Hi4UmJ690p3h5xZeeiyE8CmaSlPtfdCcnc/97FnusEjBao9O7THX0PCDVJX6VBkm
 hVk01Z6+az1UNcD18KecvCpKYF/At4WpjaUGgf7q+LBfJXuXA6jfzOVDJMKV3TFd
 n1yMFg+duGj90l8gT0aa/VQiBlUlnzQKz6ceqyKkPccwveNis6I=
 =p8YV
 -----END PGP SIGNATURE-----

Merge tag 'for-6.0-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "Fixes:

   - check that subvolume is writable when changing xattrs from security
     namespace

   - fix memory leak in device lookup helper

   - update generation of hole file extent item when merging holes

   - fix space cache corruption and potential double allocations; this
     is a rare bug but can be serious once it happens, stable backports
     and analysis tool will be provided

   - fix error handling when deleting root references

   - fix crash due to assert when attempting to cancel suspended device
     replace, add message what to do if mount fails due to missing
     replace item

  Regressions:

   - don't merge pages into bio if their page offset is not contiguous

   - don't allow large NOWAIT direct reads, this could lead to short
     reads eg. in io_uring"

* tag 'for-6.0-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: add info when mount fails due to stale replace target
  btrfs: replace: drop assert for suspended replace
  btrfs: fix silent failure when deleting root reference
  btrfs: fix space cache corruption and potential double allocations
  btrfs: don't allow large NOWAIT direct reads
  btrfs: don't merge pages into bio if their page offset is not contiguous
  btrfs: update generation of hole file extent item when merging holes
  btrfs: fix possible memory leak in btrfs_get_dev_args_from_path()
  btrfs: check if root is readonly while setting security xattr
2022-08-28 10:44:04 -07:00
Qu Wenruo
4a445b7b61 btrfs: don't merge pages into bio if their page offset is not contiguous
[BUG]
Zygo reported on latest development branch, he could hit
ASSERT()/BUG_ON() caused crash when doing RAID5 recovery (intentionally
corrupt one disk, and let btrfs to recover the data during read/scrub).

And The following minimal reproducer can cause extent state leakage at
rmmod time:

  mkfs.btrfs -f -d raid5 -m raid5 $dev1 $dev2 $dev3 -b 1G > /dev/null
  mount $dev1 $mnt
  fsstress -w -d $mnt -n 25 -s 1660807876
  sync
  fssum -A -f -w /tmp/fssum.saved $mnt
  umount $mnt

  # Wipe the dev1 but keeps its super block
  xfs_io -c "pwrite -S 0x0 1m 1023m" $dev1
  mount $dev1 $mnt
  fssum -r /tmp/fssum.saved $mnt > /dev/null
  umount $mnt
  rmmod btrfs

This will lead to the following extent states leakage:

  BTRFS: state leak: start 499712 end 503807 state 5 in tree 1 refs 1
  BTRFS: state leak: start 495616 end 499711 state 5 in tree 1 refs 1
  BTRFS: state leak: start 491520 end 495615 state 5 in tree 1 refs 1
  BTRFS: state leak: start 487424 end 491519 state 5 in tree 1 refs 1
  BTRFS: state leak: start 483328 end 487423 state 5 in tree 1 refs 1
  BTRFS: state leak: start 479232 end 483327 state 5 in tree 1 refs 1
  BTRFS: state leak: start 475136 end 479231 state 5 in tree 1 refs 1
  BTRFS: state leak: start 471040 end 475135 state 5 in tree 1 refs 1

[CAUSE]
Since commit 7aa51232e2 ("btrfs: pass a btrfs_bio to
btrfs_repair_one_sector"), we always use btrfs_bio->file_offset to
determine the file offset of a page.

But that usage assume that, one bio has all its page having a continuous
page offsets.

Unfortunately that's not true, btrfs only requires the logical bytenr
contiguous when assembling its bios.

From above script, we have one bio looks like this:

  fssum-27671  submit_one_bio: bio logical=217739264 len=36864
  fssum-27671  submit_one_bio:   r/i=5/261 page_offset=466944 <<<
  fssum-27671  submit_one_bio:   r/i=5/261 page_offset=724992 <<<
  fssum-27671  submit_one_bio:   r/i=5/261 page_offset=729088
  fssum-27671  submit_one_bio:   r/i=5/261 page_offset=733184
  fssum-27671  submit_one_bio:   r/i=5/261 page_offset=737280
  fssum-27671  submit_one_bio:   r/i=5/261 page_offset=741376
  fssum-27671  submit_one_bio:   r/i=5/261 page_offset=745472
  fssum-27671  submit_one_bio:   r/i=5/261 page_offset=749568
  fssum-27671  submit_one_bio:   r/i=5/261 page_offset=753664

Note that the 1st and the 2nd page has non-contiguous page offsets.

This means, at repair time, we will have completely wrong file offset
passed in:

   kworker/u32:2-19927  btrfs_repair_one_sector: r/i=5/261 page_off=729088 file_off=475136 bio_offset=8192

Since the file offset is incorrect, we latter incorrectly set the extent
states, and no way to really release them.

Thus later it causes the leakage.

In fact, this can be even worse, since the file offset is incorrect, we
can hit cases like the incorrect file offset belongs to a HOLE, and
later cause btrfs_num_copies() to trigger error, finally hit
BUG_ON()/ASSERT() later.

[FIX]
Add an extra condition in btrfs_bio_add_page() for uncompressed IO.

Now we will have more strict requirement for bio pages:

- They should all have the same mapping
  (the mapping check is already implied by the call chain)

- Their logical bytenr should be adjacent
  This is the same as the old condition.

- Their page_offset() (file offset) should be adjacent
  This is the new check.
  This would result a slightly increased amount of bios from btrfs
  (needs holes and inside the same stripe boundary to trigger).

  But this would greatly reduce the confusion, as it's pretty common
  to assume a btrfs bio would only contain continuous page cache.

Later we may need extra cleanups, as we no longer needs to handle gaps
between page offsets in endio functions.

Currently this should be the minimal patch to fix commit 7aa51232e2
("btrfs: pass a btrfs_bio to btrfs_repair_one_sector").

Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Fixes: 7aa51232e2 ("btrfs: pass a btrfs_bio to btrfs_repair_one_sector")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-08-22 18:06:58 +02:00
Linus Torvalds
42c54d5491 for-6.0-rc1-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmL/dJgACgkQxWXV+ddt
 WDtUmg/9Fje7L+jAtQLqvJYvvGCFCCs8gkm+Rwh9MLYHcEDnBtRWzDWYTKcfO9WW
 mPmiNlNPAbnAw/9apeJDvFOrd4Vqr8ZD3Vr0flXl5EJ9QXSTL6JXfaiLS1o6rQ0q
 OXqxVTh5B5aEmfsEhyJBLzsZGdISJbr60dAE8/ZDX9wo8cDavZ6YxToIroUeizGO
 dx2DY5A7OQRKTEf4aJgu4zcm1Sq+U5A8M1pcfU4Rhb/YapVgcCc2wrdrw4NOkNJu
 B/NZFcD0BcF2bZW7uHT5vr8rGzj2xZUCkuggBQ6/0/h7OdRXIzHCanMw4lPy602v
 IfZASF7eYkq7FHRbANj/WAYHOFl41vS+whqZ+sEI/qrW+ZyrODIzS67sbU/Bsa7+
 ZoL6RSlIbcMgz7XVqcc5d8bKNK/Hc3MCyMWlLT0XScm05BiJy5O2iEZ7UOJ4r8E+
 6J/pFD1bNdacp6UrzwEjyXuSufvp4pJdNWn5ttIWYTBygXMT8AJ403yIheiV3KA3
 SkoMj4A54tF3G7NhkzfR5sC7hlgcA0njJLzloicmNgP7E12vmcDL2rTdTt/Yl0cw
 3w6ztJS+1sWocFSJ43lE2muHlj7wW7QDYPvul1t6yWgO9wtWECo7c/ASeRl0zPkP
 atAJtsr3uV9/aA2ae9QeWzut1W3hbE5NFQOLPcF/iXN+kxMmesI=
 =HgV0
 -----END PGP SIGNATURE-----

Merge tag 'for-6.0-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "A few short fixes and a lockdep warning fix (needs moving some code):

   - tree-log replay fixes:
      - fix error handling when looking up extent refs
      - fix warning when setting inode number of links

   - relocation fixes:
      - reset block group read-only status when relocation fails
      - unset control structure if transaction fails when starting
        to process a block group
      - add lockdep annotations to fix a warning during relocation
        where blocks temporarily belong to another tree and can lead
        to reversed dependencies

   - tree-checker verifies that extent items don't overlap"

* tag 'for-6.0-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: tree-checker: check for overlapping extent items
  btrfs: fix warning during log replay when bumping inode link count
  btrfs: fix lost error handling when looking up extended ref on log replay
  btrfs: fix lockdep splat with reloc root extent buffers
  btrfs: move lockdep class helpers to locking.c
  btrfs: unset reloc control if transaction commit fails in prepare_to_relocate()
  btrfs: reset RO counter on block group if we fail to relocate
2022-08-19 13:33:48 -07:00
Josef Bacik
b40130b23c btrfs: fix lockdep splat with reloc root extent buffers
We have been hitting the following lockdep splat with btrfs/187 recently

  WARNING: possible circular locking dependency detected
  5.19.0-rc8+ #775 Not tainted
  ------------------------------------------------------
  btrfs/752500 is trying to acquire lock:
  ffff97e1875a97b8 (btrfs-treloc-02#2){+.+.}-{3:3}, at: __btrfs_tree_lock+0x24/0x110

  but task is already holding lock:
  ffff97e1875a9278 (btrfs-tree-01/1){+.+.}-{3:3}, at: __btrfs_tree_lock+0x24/0x110

  which lock already depends on the new lock.

  the existing dependency chain (in reverse order) is:

  -> #2 (btrfs-tree-01/1){+.+.}-{3:3}:
	 down_write_nested+0x41/0x80
	 __btrfs_tree_lock+0x24/0x110
	 btrfs_init_new_buffer+0x7d/0x2c0
	 btrfs_alloc_tree_block+0x120/0x3b0
	 __btrfs_cow_block+0x136/0x600
	 btrfs_cow_block+0x10b/0x230
	 btrfs_search_slot+0x53b/0xb70
	 btrfs_lookup_inode+0x2a/0xa0
	 __btrfs_update_delayed_inode+0x5f/0x280
	 btrfs_async_run_delayed_root+0x24c/0x290
	 btrfs_work_helper+0xf2/0x3e0
	 process_one_work+0x271/0x590
	 worker_thread+0x52/0x3b0
	 kthread+0xf0/0x120
	 ret_from_fork+0x1f/0x30

  -> #1 (btrfs-tree-01){++++}-{3:3}:
	 down_write_nested+0x41/0x80
	 __btrfs_tree_lock+0x24/0x110
	 btrfs_search_slot+0x3c3/0xb70
	 do_relocation+0x10c/0x6b0
	 relocate_tree_blocks+0x317/0x6d0
	 relocate_block_group+0x1f1/0x560
	 btrfs_relocate_block_group+0x23e/0x400
	 btrfs_relocate_chunk+0x4c/0x140
	 btrfs_balance+0x755/0xe40
	 btrfs_ioctl+0x1ea2/0x2c90
	 __x64_sys_ioctl+0x88/0xc0
	 do_syscall_64+0x38/0x90
	 entry_SYSCALL_64_after_hwframe+0x63/0xcd

  -> #0 (btrfs-treloc-02#2){+.+.}-{3:3}:
	 __lock_acquire+0x1122/0x1e10
	 lock_acquire+0xc2/0x2d0
	 down_write_nested+0x41/0x80
	 __btrfs_tree_lock+0x24/0x110
	 btrfs_lock_root_node+0x31/0x50
	 btrfs_search_slot+0x1cb/0xb70
	 replace_path+0x541/0x9f0
	 merge_reloc_root+0x1d6/0x610
	 merge_reloc_roots+0xe2/0x260
	 relocate_block_group+0x2c8/0x560
	 btrfs_relocate_block_group+0x23e/0x400
	 btrfs_relocate_chunk+0x4c/0x140
	 btrfs_balance+0x755/0xe40
	 btrfs_ioctl+0x1ea2/0x2c90
	 __x64_sys_ioctl+0x88/0xc0
	 do_syscall_64+0x38/0x90
	 entry_SYSCALL_64_after_hwframe+0x63/0xcd

  other info that might help us debug this:

  Chain exists of:
    btrfs-treloc-02#2 --> btrfs-tree-01 --> btrfs-tree-01/1

   Possible unsafe locking scenario:

	 CPU0                    CPU1
	 ----                    ----
    lock(btrfs-tree-01/1);
				 lock(btrfs-tree-01);
				 lock(btrfs-tree-01/1);
    lock(btrfs-treloc-02#2);

   *** DEADLOCK ***

  7 locks held by btrfs/752500:
   #0: ffff97e292fdf460 (sb_writers#12){.+.+}-{0:0}, at: btrfs_ioctl+0x208/0x2c90
   #1: ffff97e284c02050 (&fs_info->reclaim_bgs_lock){+.+.}-{3:3}, at: btrfs_balance+0x55f/0xe40
   #2: ffff97e284c00878 (&fs_info->cleaner_mutex){+.+.}-{3:3}, at: btrfs_relocate_block_group+0x236/0x400
   #3: ffff97e292fdf650 (sb_internal#2){.+.+}-{0:0}, at: merge_reloc_root+0xef/0x610
   #4: ffff97e284c02378 (btrfs_trans_num_writers){++++}-{0:0}, at: join_transaction+0x1a8/0x5a0
   #5: ffff97e284c023a0 (btrfs_trans_num_extwriters){++++}-{0:0}, at: join_transaction+0x1a8/0x5a0
   #6: ffff97e1875a9278 (btrfs-tree-01/1){+.+.}-{3:3}, at: __btrfs_tree_lock+0x24/0x110

  stack backtrace:
  CPU: 1 PID: 752500 Comm: btrfs Not tainted 5.19.0-rc8+ #775
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
  Call Trace:

   dump_stack_lvl+0x56/0x73
   check_noncircular+0xd6/0x100
   ? lock_is_held_type+0xe2/0x140
   __lock_acquire+0x1122/0x1e10
   lock_acquire+0xc2/0x2d0
   ? __btrfs_tree_lock+0x24/0x110
   down_write_nested+0x41/0x80
   ? __btrfs_tree_lock+0x24/0x110
   __btrfs_tree_lock+0x24/0x110
   btrfs_lock_root_node+0x31/0x50
   btrfs_search_slot+0x1cb/0xb70
   ? lock_release+0x137/0x2d0
   ? _raw_spin_unlock+0x29/0x50
   ? release_extent_buffer+0x128/0x180
   replace_path+0x541/0x9f0
   merge_reloc_root+0x1d6/0x610
   merge_reloc_roots+0xe2/0x260
   relocate_block_group+0x2c8/0x560
   btrfs_relocate_block_group+0x23e/0x400
   btrfs_relocate_chunk+0x4c/0x140
   btrfs_balance+0x755/0xe40
   btrfs_ioctl+0x1ea2/0x2c90
   ? lock_is_held_type+0xe2/0x140
   ? lock_is_held_type+0xe2/0x140
   ? __x64_sys_ioctl+0x88/0xc0
   __x64_sys_ioctl+0x88/0xc0
   do_syscall_64+0x38/0x90
   entry_SYSCALL_64_after_hwframe+0x63/0xcd

This isn't necessarily new, it's just tricky to hit in practice.  There
are two competing things going on here.  With relocation we create a
snapshot of every fs tree with a reloc tree.  Any extent buffers that
get initialized here are initialized with the reloc root lockdep key.
However since it is a snapshot, any blocks that are currently in cache
that originally belonged to the fs tree will have the normal tree
lockdep key set.  This creates the lock dependency of

  reloc tree -> normal tree

for the extent buffer locking during the first phase of the relocation
as we walk down the reloc root to relocate blocks.

However this is problematic because the final phase of the relocation is
merging the reloc root into the original fs root.  This involves
searching down to any keys that exist in the original fs root and then
swapping the relocated block and the original fs root block.  We have to
search down to the fs root first, and then go search the reloc root for
the block we need to replace.  This creates the dependency of

  normal tree -> reloc tree

which is why lockdep complains.

Additionally even if we were to fix this particular mismatch with a
different nesting for the merge case, we're still slotting in a block
that has a owner of the reloc root objectid into a normal tree, so that
block will have its lockdep key set to the tree reloc root, and create a
lockdep splat later on when we wander into that block from the fs root.

Unfortunately the only solution here is to make sure we do not set the
lockdep key to the reloc tree lockdep key normally, and then reset any
blocks we wander into from the reloc root when we're doing the merged.

This solves the problem of having mixed tree reloc keys intermixed with
normal tree keys, and then allows us to make sure in the merge case we
maintain the lock order of

  normal tree -> reloc tree

We handle this by setting a bit on the reloc root when we do the search
for the block we want to relocate, and any block we search into or COW
at that point gets set to the reloc tree key.  This works correctly
because we only ever COW down to the parent node, so we aren't resetting
the key for the block we're linking into the fs root.

With this patch we no longer have the lockdep splat in btrfs/187.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-08-17 16:19:12 +02:00
Linus Torvalds
353767e4aa for-5.20-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmLnyNUACgkQxWXV+ddt
 WDt9vA/9HcF+v5EkknyW07tatTap/Hm/ZB86Z5OZi6ikwIEcHsWhp3rUICejm88e
 GecDPIluDtCtyD6x4stuqkwOm22aDP5q2T9H6+gyw92ozyb436OV1Z8IrmftzXKY
 EpZO70PHZT+E6E/WYvyoTmmoCrjib7YlqCWZZhSLUFpsqqlOInmHEH49PW6KvM4r
 acUZ/RxHurKdmI3kNY6ECbAQl6CASvtTdYcVCx8fT2zN0azoLIQxpYa7n/9ca1R6
 8WnYilCbLbNGtcUXvO2M3tMZ4/5kvxrwQsUn93ccCJYuiN0ASiDXbLZ2g4LZ+n56
 JGu+y5v5oBwjpVf+46cuvnENP5BQ61594WPseiVjrqODWnPjN28XkcVC0XmPsiiZ
 lszeHO2cuIrIFoCah8ELMl8usu8+qxfXmPxIXtPu9rEyKsDtOjxVYc8SMXqLp0qQ
 qYtBoFm0JcZHqtZRpB+dhQ37/xXtH4ljUi/mI6x8iALVujeR273URs7yO9zgIdeW
 uZoFtbwpHFLUk+TL7Ku82/zOXp3fCwtDpNmlYbxeMbea/be3ShjncM4+mYzvHYri
 dYON2LFrq+mnRDqtIXTCaAYwX7zU8Y18Ev9QwlNll8dKlKwS89+jpqLoa+eVYy3c
 /HitHFza70KxmOj4dvDVZlzDpPvl7kW1UBkmskg4u3jnNWzedkM=
 =sS1q
 -----END PGP SIGNATURE-----

Merge tag 'for-5.20-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs updates from David Sterba:
 "This brings some long awaited changes, the send protocol bump,
  otherwise lots of small improvements and fixes. The main core part is
  reworking bio handling, cleaning up the submission and endio and
  improving error handling.

  There are some changes outside of btrfs adding helpers or updating
  API, listed at the end of the changelog.

  Features:

   - sysfs:
      - export chunk size, in debug mode add tunable for setting its size
      - show zoned among features (was only in debug mode)
      - show commit stats (number, last/max/total duration)

   - send protocol updated to 2
      - new commands:
         - ability write larger data chunks than 64K
         - send raw compressed extents (uses the encoded data ioctls),
           ie. no decompression on send side, no compression needed on
           receive side if supported
         - send 'otime' (inode creation time) among other timestamps
         - send file attributes (a.k.a file flags and xflags)
      - this is first version bump, backward compatibility on send and
        receive side is provided
      - there are still some known and wanted commands that will be
        implemented in the near future, another version bump will be
        needed, however we want to minimize that to avoid causing
        usability issues

   - print checksum type and implementation at mount time

   - don't print some messages at mount (mentioned as people asked about
     it), we want to print messages namely for new features so let's
     make some space for that
      - big metadata - this has been supported for a long time and is
        not a feature that's worth mentioning
      - skinny metadata - same reason, set by default by mkfs

  Performance improvements:

   - reduced amount of reserved metadata for delayed items
      - when inserted items can be batched into one leaf
      - when deleting batched directory index items
      - when deleting delayed items used for deletion
      - overall improved count of files/sec, decreased subvolume lock
        contention

   - metadata item access bounds checker micro-optimized, with a few
     percent of improved runtime for metadata-heavy operations

   - increase direct io limit for read to 256 sectors, improved
     throughput by 3x on sample workload

  Notable fixes:

   - raid56
      - reduce parity writes, skip sectors of stripe when there are no
        data updates
      - restore reading from on-disk data instead of using stripe cache,
        this reduces chances to damage correct data due to RMW cycle

   - refuse to replay log with unknown incompat read-only feature bit
     set

   - zoned
      - fix page locking when COW fails in the middle of allocation
      - improved tracking of active zones, ZNS drives may limit the
        number and there are ENOSPC errors due to that limit and not
        actual lack of space
      - adjust maximum extent size for zone append so it does not cause
        late ENOSPC due to underreservation

   - mirror reading error messages show the mirror number

   - don't fallback to buffered IO for NOWAIT direct IO writes, we don't
     have the NOWAIT semantics for buffered io yet

   - send, fix sending link commands for existing file paths when there
     are deleted and created hardlinks for same files

   - repair all mirrors for profiles with more than 1 copy (raid1c34)

   - fix repair of compressed extents, unify where error detection and
     repair happen

  Core changes:

   - bio completion cleanups
      - don't double defer compression bios
      - simplify endio workqueues
      - add more data to btrfs_bio to avoid allocation for read requests
      - rework bio error handling so it's same what block layer does,
        the submission works and errors are consumed in endio
      - when asynchronous bio offload fails fall back to synchronous
        checksum calculation to avoid errors under writeback or memory
        pressure

   - new trace points
      - raid56 events
      - ordered extent operations

   - super block log_root_transid deprecated (never used)

   - mixed_backref and big_metadata sysfs feature files removed, they've
     been default for sufficiently long time, there are no known users
     and mixed_backref could be confused with mixed_groups

  Non-btrfs changes, API updates:

   - minor highmem API update to cover const arguments

   - switch all kmap/kmap_atomic to kmap_local

   - remove redundant flush_dcache_page()

   - address_space_operations::writepage callback removed

   - add bdev_max_segments() helper"

* tag 'for-5.20-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (163 commits)
  btrfs: don't call btrfs_page_set_checked in finish_compressed_bio_read
  btrfs: fix repair of compressed extents
  btrfs: remove the start argument to check_data_csum and export
  btrfs: pass a btrfs_bio to btrfs_repair_one_sector
  btrfs: simplify the pending I/O counting in struct compressed_bio
  btrfs: repair all known bad mirrors
  btrfs: merge btrfs_dev_stat_print_on_error with its only caller
  btrfs: join running log transaction when logging new name
  btrfs: simplify error handling in btrfs_lookup_dentry
  btrfs: send: always use the rbtree based inode ref management infrastructure
  btrfs: send: fix sending link commands for existing file paths
  btrfs: send: introduce recorded_ref_alloc and recorded_ref_free
  btrfs: zoned: wait until zone is finished when allocation didn't progress
  btrfs: zoned: write out partially allocated region
  btrfs: zoned: activate necessary block group
  btrfs: zoned: activate metadata block group on flush_space
  btrfs: zoned: disable metadata overcommit for zoned
  btrfs: zoned: introduce space_info->active_total_bytes
  btrfs: zoned: finish least available block group on data bg allocation
  btrfs: let can_allocate_chunk return error
  ...
2022-08-03 14:54:52 -07:00
Linus Torvalds
c013d0af81 for-5.20/block-2022-07-29
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmLko3gQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpmQaD/90NKFj4v8I456TUQyg1jimXEsL+e84E6o2
 ALWVb6JzQvlPVQXNLnK5YKIunMWOTtTMz0nyB8sVRwVJVJO0P5d7QopAkZM8fkyU
 MK5OCzoryENw4DTc2wJS4in6cSbGylIuN74wMzlf7+M67JTImfoZQhbTMcjwzZfn
 b3OlL6sID7zMXwGcuOJPZyUJICCpDhzdSF9JXqKma5PQuG2SBmQyvFxJAcsoFBPc
 YetnoRIOIN6yBvsIZaPaYq7XI9MIvF0e67EQtyCEHj4tHpyVnyDWkeObVFULsISU
 gGEKbkYPvNUzRAU5Q1NBBHh1tTfkf/MaUxTuZwoEwZ/s04IGBGMmrZGyfvdfzYo6
 M7NwSEg/TrUSNfTwn65mQi7uOXu1pGkJrqz84Flm8u9Qid9Vd7LExLG5p/ggnWdH
 5th93MDEmtEg29e9DXpEAuS5d0t3TtSvosflaKpyfNNfr+P0rWCN6GM/uW62VUTK
 ls69SQh/AQJRbg64jU4xper6WhaYtSXK7TKEnxJycoEn9gYNyCcdot2uekth0xRH
 ChHGmRlteiqe/y4uFWn/2dcxWjoleiHbFjTaiRL75WVl8wIDEjw02LGuoZ61Ss9H
 WOV+MT7KqNjBGe6lreUY+O/PO02dzmoR6heJXN19p8zr/pBuLCTGX7UpO7rzgaBR
 4N1HEozvIw==
 =celk
 -----END PGP SIGNATURE-----

Merge tag 'for-5.20/block-2022-07-29' of git://git.kernel.dk/linux-block

Pull block updates from Jens Axboe:

 - Improve the type checking of request flags (Bart)

 - Ensure queue mapping for a single queues always picks the right queue
   (Bart)

 - Sanitize the io priority handling (Jan)

 - rq-qos race fix (Jinke)

 - Reserved tags handling improvements (John)

 - Separate memory alignment from file/disk offset aligment for O_DIRECT
   (Keith)

 - Add new ublk driver, userspace block driver using io_uring for
   communication with the userspace backend (Ming)

 - Use try_cmpxchg() to cleanup the code in various spots (Uros)

 - Finally remove bdevname() (Christoph)

 - Clean up the zoned device handling (Christoph)

 - Clean up independent access range support (Christoph)

 - Clean up and improve block sysfs handling (Christoph)

 - Clean up and improve teardown of block devices.

   This turns the usual two step process into something that is simpler
   to implement and handle in block drivers (Christoph)

 - Clean up chunk size handling (Christoph)

 - Misc cleanups and fixes (Bart, Bo, Dan, GuoYong, Jason, Keith, Liu,
   Ming, Sebastian, Yang, Ying)

* tag 'for-5.20/block-2022-07-29' of git://git.kernel.dk/linux-block: (178 commits)
  ublk_drv: fix double shift bug
  ublk_drv: make sure that correct flags(features) returned to userspace
  ublk_drv: fix error handling of ublk_add_dev
  ublk_drv: fix lockdep warning
  block: remove __blk_get_queue
  block: call blk_mq_exit_queue from disk_release for never added disks
  blk-mq: fix error handling in __blk_mq_alloc_disk
  ublk: defer disk allocation
  ublk: rewrite ublk_ctrl_get_queue_affinity to not rely on hctx->cpumask
  ublk: fold __ublk_create_dev into ublk_ctrl_add_dev
  ublk: cleanup ublk_ctrl_uring_cmd
  ublk: simplify ublk_ch_open and ublk_ch_release
  ublk: remove the empty open and release block device operations
  ublk: remove UBLK_IO_F_PREFLUSH
  ublk: add a MAINTAINERS entry
  block: don't allow the same type rq_qos add more than once
  mmc: fix disk/queue leak in case of adding disk failure
  ublk_drv: fix an IS_ERR() vs NULL check
  ublk: remove UBLK_IO_F_INTEGRITY
  ublk_drv: remove unneeded semicolon
  ...
2022-08-02 13:46:35 -07:00
Christoph Hellwig
81bd9328ab btrfs: fix repair of compressed extents
Currently the checksum of compressed extents is verified based on the
compressed data and the lower btrfs_bio, but the actual repair process
is driven by end_bio_extent_readpage on the upper btrfs_bio for the
decompressed data.

This has a bunch of issues, including not being able to properly
communicate the failed mirror up in case that the I/O submission got
preempted, a general loss of if an error was an I/O error or a checksum
verification failure, but most importantly that this design causes
btrfs_clean_io_failure to eventually write back the uncompressed good
data onto the disk sectors that are supposed to contain compressed data.

Fix this by moving the repair to the lower btrfs_bio.  To do so, a fair
amount of code has to be reshuffled:

 a) the lower btrfs_bio now needs a valid csum pointer.  The easiest way
    to achieve that is to pass NULL btrfs_lookup_bio_sums and just use
    the btrfs_bio management of csums.  For a compressed_bio that is
    split into multiple btrfs_bios this means additional memory
    allocations, but the code becomes a lot more regular.
 b) checksum verification now runs directly on the lower btrfs_bio instead
    of the compressed_bio.  This actually nicely simplifies the end I/O
    processing.
 c) btrfs_repair_one_sector can't just look up the logical address for
    the file offset any more, as there is no corresponding relative
    offsets that apply to the file offset and the logic address for
    compressed extents.  Instead require that the saved bvec_iter in the
    btrfs_bio is filled out for all read bios and use that, which again
    removes a fair amount of code.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25 19:56:16 +02:00
Christoph Hellwig
7aa51232e2 btrfs: pass a btrfs_bio to btrfs_repair_one_sector
Pass the btrfs_bio instead of the plain bio to btrfs_repair_one_sector,
and remove the start and failed_mirror arguments in favor of deriving
them from the btrfs_bio.  For this to work ensure that the file_offset
field is also initialized for buffered I/O.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25 19:55:19 +02:00
Christoph Hellwig
c144c63fd3 btrfs: repair all known bad mirrors
When there is more than a single level of redundancy there can also be
multiple bad mirrors, and the current read repair code only repairs the
last bad one.

Restructure btrfs_repair_one_sector so that it records the originally
failed mirror and the number of copies, and then repair all known bad
copies until we reach the originally failed copy in clean_io_failure.
Note that this also means the read repair reads will always start from
the next bad mirror and not mirror 0.

This fixes btrfs/265 in xfstests.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25 17:45:43 +02:00
Naohiro Aota
f7b12a62f0 btrfs: replace BTRFS_MAX_EXTENT_SIZE with fs_info->max_extent_size
On zoned filesystem, data write out is limited by max_zone_append_size,
and a large ordered extent is split according the size of a bio. OTOH,
the number of extents to be written is calculated using
BTRFS_MAX_EXTENT_SIZE, and that estimated number is used to reserve the
metadata bytes to update and/or create the metadata items.

The metadata reservation is done at e.g, btrfs_buffered_write() and then
released according to the estimation changes. Thus, if the number of extent
increases massively, the reserved metadata can run out.

The increase of the number of extents easily occurs on zoned filesystem
if BTRFS_MAX_EXTENT_SIZE > max_zone_append_size. And, it causes the
following warning on a small RAM environment with disabling metadata
over-commit (in the following patch).

[75721.498492] ------------[ cut here ]------------
[75721.505624] BTRFS: block rsv 1 returned -28
[75721.512230] WARNING: CPU: 24 PID: 2327559 at fs/btrfs/block-rsv.c:537 btrfs_use_block_rsv+0x560/0x760 [btrfs]
[75721.581854] CPU: 24 PID: 2327559 Comm: kworker/u64:10 Kdump: loaded Tainted: G        W         5.18.0-rc2-BTRFS-ZNS+ #109
[75721.597200] Hardware name: Supermicro Super Server/H12SSL-NT, BIOS 2.0 02/22/2021
[75721.607310] Workqueue: btrfs-endio-write btrfs_work_helper [btrfs]
[75721.616209] RIP: 0010:btrfs_use_block_rsv+0x560/0x760 [btrfs]
[75721.646649] RSP: 0018:ffffc9000fbdf3e0 EFLAGS: 00010286
[75721.654126] RAX: 0000000000000000 RBX: 0000000000004000 RCX: 0000000000000000
[75721.663524] RDX: 0000000000000004 RSI: 0000000000000008 RDI: fffff52001f7be6e
[75721.672921] RBP: ffffc9000fbdf420 R08: 0000000000000001 R09: ffff889f8d1fc6c7
[75721.682493] R10: ffffed13f1a3f8d8 R11: 0000000000000001 R12: ffff88980a3c0e28
[75721.692284] R13: ffff889b66590000 R14: ffff88980a3c0e40 R15: ffff88980a3c0e8a
[75721.701878] FS:  0000000000000000(0000) GS:ffff889f8d000000(0000) knlGS:0000000000000000
[75721.712601] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[75721.720726] CR2: 000055d12e05c018 CR3: 0000800193594000 CR4: 0000000000350ee0
[75721.730499] Call Trace:
[75721.735166]  <TASK>
[75721.739886]  btrfs_alloc_tree_block+0x1e1/0x1100 [btrfs]
[75721.747545]  ? btrfs_alloc_logged_file_extent+0x550/0x550 [btrfs]
[75721.756145]  ? btrfs_get_32+0xea/0x2d0 [btrfs]
[75721.762852]  ? btrfs_get_32+0xea/0x2d0 [btrfs]
[75721.769520]  ? push_leaf_left+0x420/0x620 [btrfs]
[75721.776431]  ? memcpy+0x4e/0x60
[75721.781931]  split_leaf+0x433/0x12d0 [btrfs]
[75721.788392]  ? btrfs_get_token_32+0x580/0x580 [btrfs]
[75721.795636]  ? push_for_double_split.isra.0+0x420/0x420 [btrfs]
[75721.803759]  ? leaf_space_used+0x15d/0x1a0 [btrfs]
[75721.811156]  btrfs_search_slot+0x1bc3/0x2790 [btrfs]
[75721.818300]  ? lock_downgrade+0x7c0/0x7c0
[75721.824411]  ? free_extent_buffer.part.0+0x107/0x200 [btrfs]
[75721.832456]  ? split_leaf+0x12d0/0x12d0 [btrfs]
[75721.839149]  ? free_extent_buffer.part.0+0x14f/0x200 [btrfs]
[75721.846945]  ? free_extent_buffer+0x13/0x20 [btrfs]
[75721.853960]  ? btrfs_release_path+0x4b/0x190 [btrfs]
[75721.861429]  btrfs_csum_file_blocks+0x85c/0x1500 [btrfs]
[75721.869313]  ? rcu_read_lock_sched_held+0x16/0x80
[75721.876085]  ? lock_release+0x552/0xf80
[75721.881957]  ? btrfs_del_csums+0x8c0/0x8c0 [btrfs]
[75721.888886]  ? __kasan_check_write+0x14/0x20
[75721.895152]  ? do_raw_read_unlock+0x44/0x80
[75721.901323]  ? _raw_write_lock_irq+0x60/0x80
[75721.907983]  ? btrfs_global_root+0xb9/0xe0 [btrfs]
[75721.915166]  ? btrfs_csum_root+0x12b/0x180 [btrfs]
[75721.921918]  ? btrfs_get_global_root+0x820/0x820 [btrfs]
[75721.929166]  ? _raw_write_unlock+0x23/0x40
[75721.935116]  ? unpin_extent_cache+0x1e3/0x390 [btrfs]
[75721.942041]  btrfs_finish_ordered_io.isra.0+0xa0c/0x1dc0 [btrfs]
[75721.949906]  ? try_to_wake_up+0x30/0x14a0
[75721.955700]  ? btrfs_unlink_subvol+0xda0/0xda0 [btrfs]
[75721.962661]  ? rcu_read_lock_sched_held+0x16/0x80
[75721.969111]  ? lock_acquire+0x41b/0x4c0
[75721.974982]  finish_ordered_fn+0x15/0x20 [btrfs]
[75721.981639]  btrfs_work_helper+0x1af/0xa80 [btrfs]
[75721.988184]  ? _raw_spin_unlock_irq+0x28/0x50
[75721.994643]  process_one_work+0x815/0x1460
[75722.000444]  ? pwq_dec_nr_in_flight+0x250/0x250
[75722.006643]  ? do_raw_spin_trylock+0xbb/0x190
[75722.013086]  worker_thread+0x59a/0xeb0
[75722.018511]  kthread+0x2ac/0x360
[75722.023428]  ? process_one_work+0x1460/0x1460
[75722.029431]  ? kthread_complete_and_exit+0x30/0x30
[75722.036044]  ret_from_fork+0x22/0x30
[75722.041255]  </TASK>
[75722.045047] irq event stamp: 0
[75722.049703] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
[75722.057610] hardirqs last disabled at (0): [<ffffffff8118a94a>] copy_process+0x1c1a/0x66b0
[75722.067533] softirqs last  enabled at (0): [<ffffffff8118a989>] copy_process+0x1c59/0x66b0
[75722.077423] softirqs last disabled at (0): [<0000000000000000>] 0x0
[75722.085335] ---[ end trace 0000000000000000 ]---

To fix the estimation, we need to introduce fs_info->max_extent_size to
replace BTRFS_MAX_EXTENT_SIZE, which allow setting the different size for
regular vs zoned filesystem.

Set fs_info->max_extent_size to BTRFS_MAX_EXTENT_SIZE by default. On zoned
filesystem, it is set to fs_info->max_zone_append_size.

CC: stable@vger.kernel.org # 5.12+
Fixes: d8e3fb106f ("btrfs: zoned: use ZONE_APPEND write for zoned mode")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25 17:45:41 +02:00
Christoph Hellwig
f3e90c1ca9 btrfs: remove extent writepage address space operation
Same as in commit 21b4ee7029 ("xfs: drop ->writepage completely"): we
can remove the callback as it's only used in one place - single page
writeback from memory reclaim and is not called for cgroup writeback at
all.

We only allow such writeback from kswapd, not from direct memory
reclaim, and so it is rarely used. When it comes from kswapd, it is
effectively random dirty page shoot-down, which is horrible for IO
patterns. We can rely on background writeback to clean all dirty pages
in an efficient way and not let it be interrupted by kswapd.

Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25 17:45:37 +02:00
David Sterba
9db33891c7 btrfs: unify tree search helper returning prev and next nodes
Simplify helper to return only next and prev pointers, we don't need all
the node/parent/prev/next pointers of __etree_search as there are now
other specialized helpers. Rename parameters so they follow the naming.

Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25 17:45:36 +02:00