Commit graph

1266877 commits

Author SHA1 Message Date
Josef Bacik
0586d0a89e btrfs: move extent bit and page cleanup into cow_file_range_inline
We duplicate the extent cleanup for cow_file_range_inline() in the cow
and compressed case.  The encoded case doesn't need to do cleanup the
same way, so rename cow_file_range_inline to __cow_file_range_inline and
then make cow_file_range_inline handle the extent cleanup appropriately,
and update the callers.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:09 +02:00
Josef Bacik
0332967b4d btrfs: unlock all the pages with successful inline extent creation
Since 4750af3bbe ("btrfs: prevent extent_clear_unlock_delalloc() to
unlock page not locked by __process_pages_contig()") we have been
unlocking the locked page manually instead of via
extent_clear_unlock_delalloc() because of subpage blocksize support.
However we actually disable inline extent creation for subpage blocksize
support, so this behavior isn't necessary.  Remove this code and
comment, if at some point the subpage blocksize code grows support for
inline extents this can be re-evaluated.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:09 +02:00
Josef Bacik
6eecfa2240 btrfs: push all inline logic into cow_file_range
Currently we have a lot of duplicated checks of

if (start == 0 && fs_info->sectorsize == PAGE_SIZE)
	cow_file_range_inline();

Instead of duplicating this check everywhere, consolidate all of the
inline extent logic into a helper which documents all of the checks and
then use that helper inside of cow_file_range_inline().  With this we
can clean up all of the calls to either unconditionally call
cow_file_range_inline(), or at least reduce the checks we're doing
before we call cow_file_range_inline();

Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:09 +02:00
Josef Bacik
aa5ccf2917 btrfs: handle errors in btrfs_reloc_clone_csums properly
In the cow path we will clone the reloc csums for relocated data
extents, and if there's an error we already have an ordered extent and
rely on the ordered extent finishing to clean everything up.

There's a problem however, we don't mark the ordered extent with an
error, we pretend like everything was just fine.  If we were at the end
of our range we won't actually bubble up this error anywhere, and we
could end up inserting an extent that doesn't have csums where it should
have them.

Fix this by adding a helper to mark the ordered extent with an error,
and then use this when we fail to lookup the csums in
btrfs_reloc_clone_csums.  Use this helper in the other place where we
use the same pattern while we're here.

This will prevent us from erroneously inserting the extent that doesn't
have the required checksums.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:09 +02:00
Qu Wenruo
e98bf64f7a btrfs: add extra sanity checks for create_io_em()
The function create_io_em() is called before we submit an IO, to update
the in-memory extent map for the involved range.

This patch changes the following aspects:

- Does not allow BTRFS_ORDERED_NOCOW type
  For real NOCOW (excluding NOCOW writes into preallocated ranges)
  writes, we never call create_io_em(), as we does not need to update
  the extent map at all.

  So remove the sanity check allowing BTRFS_ORDERED_NOCOW type.

- Add extra sanity checks
  * PREALLOC
    - @block_len == len
      For uncompressed writes.

  * REGULAR
    - @block_len == @orig_block_len == @ram_bytes == @len
      We're creating a new uncompressed extent, and referring all of it.

    - @orig_start == @start
      We haven no offset inside the extent.

  * COMPRESSED
    - valid @compress_type
    - @len <= @ram_bytes
      This is to co-operate with encoded writes, which can cause a new
      file extent referring only part of a uncompressed extent.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:08 +02:00
Qu Wenruo
4bdc558bf9 btrfs: simplify the inline extent map creation
With the tree-checker ensuring all inline file extents starts at file
offset 0 and has a length no larger than sectorsize, we can simplify the
calculation to assigned those fixes values directly.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:08 +02:00
Qu Wenruo
319d91ee72 btrfs: add extra comments on extent_map members
The extent_map structure is very critical to btrfs, as it is involved
for both read and write paths.

Unfortunately the structure is not properly explained, making it pretty
hard to understand nor to do further improvement.

This patch adds extra comments explaining the major members based on my
code reading.  Hopefully we can find more members to cleanup in the
future.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:08 +02:00
Naohiro Aota
30704a0d56 btrfs: drop unused argument of calcu_metadata_size()
calcu_metadata_size() has a "reserve" argument, but the only caller always
set it to "1". The other usage (reserve = 0) is dropped by a commit
0647bf564f ("Btrfs: improve forever loop when doing balance relocation"),
which is more than 10 years ago. Drop the argument and simplify the code.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:08 +02:00
Anand Jain
33a44f3760 btrfs: simplify return variables in btrfs_drop_subtree()
There's another return variable wret that is only passed to ret on
error, we can simply use ret.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:08 +02:00
Anand Jain
1618aa3c2e btrfs: simplify return variables in lookup_extent_data_ref()
First, drop err instead reuse ret, choose to return the error instead of
goto fail and then return the same error. Do not initialize the ret
until where it has to be initialized. Slight logic change in handling
the btrfs_search_slot() and btrfs_next_leaf() return value.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:08 +02:00
Anand Jain
6e812a9c65 btrfs: rename return variables in btrfs_qgroup_rescan_worker()
Rename ret to ret2 compile and then err to ret. Also, new ret2 is found
to be localized within the 'if (trans)' statement, so move its
declaration there.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:08 +02:00
Anand Jain
5e8fb9b84b btrfs: drop variable err in quick_update_accounting()
In quick_update_accounting() err is used as 2nd return value, which could
be achieved just with ret.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:08 +02:00
Anand Jain
acde0e8609 btrfs: reuse ret instead of err in relocate_tree_blocks()
Coding style fixes the function relocate_tree_blocks().  After the fix,
ret is the return value variable.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:08 +02:00
Anand Jain
2daca1e419 btrfs: rename err and ret to ret in build_backref_tree()
Code style fix in the function build_backref_tree().  Drop the ret
initialization 0, as we don't need it.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:08 +02:00
Anand Jain
1e8a42375f btrfs: rename werr and err to ret in __btrfs_wait_marked_extents()
Rename the function's local return variables err and werr to ret.
Also, align the variable declarations with the other declarations in
the function for better function space alignment.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:08 +02:00
Anand Jain
ce87531120 btrfs: rename werr and err to ret in btrfs_write_marked_extents()
Rename the function's local variable werr and err to ret.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:07 +02:00
Anand Jain
9a7b68d32a btrfs: report filemap_fdata<write|wait>_range() error
In the function btrfs_write_marked_extents() and in __btrfs_wait_marked_extents()
return the actual error if when filemap_fdata<write|wait>_range() fails.

Suggested-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:07 +02:00
David Sterba
fef998d1a0 btrfs: use btrfs_is_testing() everywhere
There are open coded tests of BTRFS_FS_STATE_DUMMY_FS_INFO and we have a
wrapper for that that's a compile-time constant when self-tests are not
built in. As this is only for development we can save some bytes and
conditions on release configs by using the helper in the remaining
cases.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:07 +02:00
Filipe Manana
905a95f3dd btrfs: initialize delayed inodes xarray without GFP_ATOMIC
There's no need to initialize the delayed inodes xarray with a GFP_ATOMIC
flag because that actually does nothing on the xarray operations. That was
needed for radix trees, but for xarrays the allocation flags are passed as
the last argument to xa_store() (which we are using correctly).

So initialize the delayed inodes xarray with a simple xa_init().

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:07 +02:00
Filipe Manana
de6f14e83e btrfs: make try_release_extent_mapping() return a bool
Currently try_release_extent_mapping() as an int return type, but we
use it as a boolean. Its only caller, the release folio callback, also
returns a boolean which corresponds to try_release_extent_mapping()'s
return value. So change its return value type to bool as well as its
helper try_release_extent_state().

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:07 +02:00
Filipe Manana
2e504418e4 btrfs: be better releasing extent maps at try_release_extent_mapping()
At try_release_extent_mapping(), called during the release folio callback
(btrfs_release_folio() callchain), we don't release any extent maps in the
range if the GFP flags don't allow blocking. This behaviour is exaggerated
because:

1) Both searching for extent maps and removing them are not blocking
   operations. The only thing that it is the cond_resched() call at the
   end of the loop that searches for and removes extent maps;

2) We currently only operate on a single page, so for the case where
   block size matches the page size, we can only have one extent map,
   and for the case where the block size is smaller than the page size,
   we can have at most 16 extent maps.

So it's very unlikely the cond_resched() call will ever block even in the
block size smaller than page size scenario.

So instead of not removing any extent maps at all in case the GFP glags
don't allow blocking, keep removing extent maps while we don't need to
reschedule. This makes it safe for the subpage case and for a future
where we can process folios with a size larger than a page.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:07 +02:00
Filipe Manana
433a3e01dd btrfs: remove i_size restriction at try_release_extent_mapping()
Currently we don't attempt to release extent maps if the inode has an
i_size that is not greater than 16M. This condition was added way back
in 2008 by commit 70dec8079d ("Btrfs: extent_io and extent_state
optimizations"), without any explanation about it. A quick chat with
Chris on slack revealed that the goal was probably to release the extent
maps for small files only when closing the inode. This however can be
harmful in case we have tons of such files being kept open for very long
periods of time, since we will consume more and more pages for extent
maps.

So remove the condition.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:07 +02:00
Filipe Manana
85d288309a btrfs: use btrfs_get_fs_generation() at try_release_extent_mapping()
Nowadays we have the btrfs_get_fs_generation() to get the current
generation of the filesystem, so there's no need anymore to lock the
transaction spinlock to read it.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:07 +02:00
Filipe Manana
078b981aaa btrfs: rename some variables at try_release_extent_mapping()
Rename the following variables:

1) "btrfs_inode" to "inode", because it's shorter to type and clear, and
   we don't have a VFS inode here as well, so there's no confusion;

2) "tree" to "io_tree", to be clear which tree we are dealing with, since
   we use 2 different trees in the function;

3) "map" to "extent_tree" since "map" gives the idea we are dealing with
   an extent map for example, but we are dealing with the inode's extent
   tree (the tree which stores extent maps).

These also make the next patches simpler.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:07 +02:00
Filipe Manana
0d89a15e1a btrfs: add tracepoints for extent map shrinker events
Add some tracepoints for the extent map shrinker to help debug and analyse
main events. These have proved useful during development of the shrinker.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:07 +02:00
Filipe Manana
65bb9fb00b btrfs: update comment for btrfs_set_inode_full_sync() about locking
Nowadays we have a lock used to synchronize mmap writes with reflink and
fsync operations (struct btrfs_inode::i_mmap_lock), so update the comment
for btrfs_set_inode_full_sync() to mention that it can also be called
while holding that mmap lock. Besides being a valid alternative to the
inode's VFS lock, we already have the extent map shrinker using that mmap
lock instead.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:07 +02:00
Filipe Manana
956a17d9d0 btrfs: add a shrinker for extent maps
Extent maps are used either to represent existing file extent items, or to
represent new extents that are going to be written and the respective file
extent items are created when the ordered extent completes.

We currently don't have any limit for how many extent maps we can have,
neither per inode nor globally. Most of the time this not too noticeable
because extent maps are removed in the following situations:

1) When evicting an inode;

2) When releasing folios (pages) through the btrfs_release_folio() address
   space operation callback.

   However we won't release extent maps in the folio range if the folio is
   either dirty or under writeback or if the inode's i_size is less than
   or equals to 16M (see try_release_extent_mapping(). This 16M i_size
   constraint was added back in 2008 with commit 70dec8079d ("Btrfs:
   extent_io and extent_state optimizations"), but there's no explanation
   about why we have it or why the 16M value.

This means that for buffered IO we can reach an OOM situation due to too
many extent maps if either of the following happens:

1) There's a set of tasks constantly doing IO on many files with a size
   not larger than 16M, specially if they keep the files open for very
   long periods, therefore preventing inode eviction.

   This requires a really high number of such files, and having many non
   mergeable extent maps (due to random 4K writes for example) and a
   machine with very little memory;

2) There's a set tasks constantly doing random write IO (therefore
   creating many non mergeable extent maps) on files and keeping them
   open for long periods of time, so inode eviction doesn't happen and
   there's always a lot of dirty pages or pages under writeback,
   preventing btrfs_release_folio() from releasing the respective extent
   maps.

This second case was actually reported in the thread pointed by the Link
tag below, and it requires a very large file under heavy IO and a machine
with very little amount of RAM, which is probably hard to happen in
practice in a real world use case.

However when using direct IO this is not so hard to happen, because the
page cache is not used, and therefore btrfs_release_folio() is never
called. Which means extent maps are dropped only when evicting the inode,
and that means that if we have tasks that keep a file descriptor open and
keep doing IO on a very large file (or files), we can exhaust memory due
to an unbounded amount of extent maps. This is especially easy to happen
if we have a huge file with millions of small extents and their extent
maps are not mergeable (non contiguous offsets and disk locations).
This was reported in that thread with the following fio test:

   $ cat test.sh
   #!/bin/bash

   DEV=/dev/sdj
   MNT=/mnt/sdj
   MOUNT_OPTIONS="-o ssd"
   MKFS_OPTIONS=""

   cat <<EOF > /tmp/fio-job.ini
   [global]
   name=fio-rand-write
   filename=$MNT/fio-rand-write
   rw=randwrite
   bs=4K
   direct=1
   numjobs=16
   fallocate=none
   time_based
   runtime=90000

   [file1]
   size=300G
   ioengine=libaio
   iodepth=16

   EOF

   umount $MNT &> /dev/null
   mkfs.btrfs -f $MKFS_OPTIONS $DEV
   mount $MOUNT_OPTIONS $DEV $MNT

   fio /tmp/fio-job.ini
   umount $MNT

Monitoring the btrfs_extent_map slab while running the test with:

   $ watch -d -n 1 'cat /sys/kernel/slab/btrfs_extent_map/objects \
                        /sys/kernel/slab/btrfs_extent_map/total_objects'

Shows the number of active and total extent maps skyrocketing to tens of
millions, and on systems with a short amount of memory it's easy and quick
to get into an OOM situation, as reported in that thread.

So to avoid this issue add a shrinker that will remove extents maps, as
long as they are not pinned, and takes proper care with any concurrent
fsync to avoid missing extents (setting the full sync flag while in the
middle of a fast fsync). This shrinker is triggered through the callbacks
nr_cached_objects and free_cached_objects of struct super_operations.

The shrinker will iterate over all roots and over all inodes of each
root, and keeps track of the last scanned root and inode, so that the
next time it runs, it starts from that root and from the next inode.
This is similar to what xfs does for its inode reclaim (implements those
callbacks, and cycles through inodes by starting from where it ended
last time).

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:06 +02:00
Filipe Manana
f1d97e7691 btrfs: add a global per cpu counter to track number of used extent maps
Add a per cpu counter that tracks the total number of extent maps that are
in extent trees of inodes that belong to fs trees. This is going to be
used in an upcoming change that adds a shrinker for extent maps. Only
extent maps for fs trees are considered, because for special trees such as
the data relocation tree we don't want to evict their extent maps which
are critical for the relocation to work, and since those are limited, it's
not a concern to have them in memory during the relocation of a block
group. Another case are extent maps for free space cache inodes, which
must always remain in memory, but those are limited (there's only one per
free space cache inode, which means one per block group).

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:06 +02:00
Filipe Manana
5fa8a6baff btrfs: pass the extent map tree's inode to try_merge_map()
Extent maps are always associated to an inode's extent map tree, so
there's no need to pass the extent map tree explicitly to try_merge_map().

In order to facilitate an upcoming change that adds a shrinker for extent
maps, change try_merge_map() to receive the inode instead of its extent
map tree.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:06 +02:00
Filipe Manana
e778724a5e btrfs: pass the extent map tree's inode to setup_extent_mapping()
Extent maps are always associated to an inode's extent map tree, so
there's no need to pass the extent map tree explicitly to
setup_extent_mapping().

In order to facilitate an upcoming change that adds a shrinker for extent
maps, change setup_extent_mapping() to receive the inode instead of its
extent map tree.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:06 +02:00
Filipe Manana
6a3a9113ae btrfs: pass the extent map tree's inode to replace_extent_mapping()
Extent maps are always associated to an inode's extent map tree, so
there's no need to pass the extent map tree explicitly to
replace_extent_mapping().

In order to facilitate an upcoming change that adds a shrinker for extent
maps, change replace_extent_mapping() to receive the inode instead of its
extent map tree.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:06 +02:00
Filipe Manana
c2fbd812d7 btrfs: pass the extent map tree's inode to remove_extent_mapping()
Extent maps are always associated to an inode's extent map tree, so
there's no need to pass the extent map tree explicitly to
remove_extent_mapping().

In order to facilitate an upcoming change that adds a shrinker for extent
maps, change remove_extent_mapping() to receive the inode instead of its
extent map tree.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:06 +02:00
Filipe Manana
002f3a2ce8 btrfs: pass the extent map tree's inode to clear_em_logging()
Extent maps are always associated to an inode's extent map tree, so
there's no need to pass the extent map tree explicitly to
clear_em_logging().

In order to facilitate an upcoming change that adds a shrinker for extent
maps, change clear_em_logging() to receive the inode instead of its extent
map tree.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:06 +02:00
Filipe Manana
6c566def95 btrfs: pass the extent map tree's inode to add_extent_mapping()
Extent maps are always added to an inode's extent map tree, so there's no
need to pass the extent map tree explicitly to add_extent_mapping().

In order to facilitate an upcoming change that adds a shrinker for extent
maps, change add_extent_mapping() to receive the inode instead of its
extent map tree.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:06 +02:00
Josef Bacik
e094f48040 btrfs: change root->root_key.objectid to btrfs_root_id()
A comment from Filipe on one of my previous cleanups brought my
attention to a new helper we have for getting the root id of a root,
which makes it easier to read in the code.

The changes where made with the following Coccinelle semantic patch:

// <smpl>
@@
expression E,E1;
@@
(
 E->root_key.objectid = E1
|
- E->root_key.objectid
+ btrfs_root_id(E)
)
// </smpl>

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor style fixups ]
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:06 +02:00
Josef Bacik
53e2415868 btrfs: set start on clone before calling copy_extent_buffer_full
Our subpage testing started hanging on generic/560 and I bisected it
down to 1cab1375ba ("btrfs: reuse cloned extent buffer during
fiemap to avoid re-allocations").  This is subtle because we use
eb->start to figure out where in the folio we're copying to when we're
subpage, as our ->start may refer to an area inside of the folio.

For example, assume a 16K page size machine with a 4K node size, and
assume that we already have a cloned extent buffer when we cloned the
previous search.

copy_extent_buffer_full() will do the following when copying the extent
buffer path->nodes[0] (src) into cloned (dest):

  src->start = 8k; // this is the new leaf we're cloning
  cloned->start = 4k; // this is left over from the previous clone

  src_addr = folio_address(src->folios[0]);
  dest_addr = folio_address(dest->folios[0]);

  memcpy(dest_addr + get_eb_offset_in_folio(dst, 0),
	 src_addr + get_eb_offset_in_folio(src, 0), src->len);

Now get_eb_offset_in_folio() is where the problems occur, because for
sub-pagesize blocksize we can have multiple eb's per folio, the code for
this is as follows

  size_t get_eb_offset_in_folio(eb, offset) {
	  return (eb->start + offset & (folio_size(eb->folio[0]) - 1));
  }

So in the above example we are copying into offset 4K inside the folio.
However once we update cloned->start to 8K to match the src the math for
get_eb_offset_in_folio() changes, and any subsequent reads (i.e.
btrfs_item_key_to_cpu()) will start reading from the offset 8K instead
of 4K where we copied to, giving us garbage.

Fix this by setting start before we co copy_extent_buffer_full() to make
sure that we're copying into the same offset inside of the folio that we
will read from later.

All other sites of copy_extent_buffer_full() are correct because we
either set ->start beforehand or we simply don't change it in the case
of the tree-log usage.

With this fix we now pass generic/560 on our subpage tests.

Fixes: 1cab1375ba ("btrfs: reuse cloned extent buffer during fiemap to avoid re-allocations")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:06 +02:00
Josef Bacik
99f2be1522 btrfs: replace btrfs_delayed_*_ref with btrfs_*_ref
Now that these two structs are the same, move the btrfs_data_ref and
btrfs_tree_ref up and use these in the btrfs_delayed_ref_node.  Then
remove the btrfs_delayed_*_ref structs.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:05 +02:00
Josef Bacik
7f6af7c434 btrfs: remove the btrfs_delayed_ref_node container helpers
Now that we don't use these helpers anywhere, remove them.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:05 +02:00
Josef Bacik
efc7d5dbf8 btrfs: stop referencing btrfs_delayed_tree_ref directly
We only ever need to use this to get the level of the tree block ref, so
use the btrfs_delayed_ref_owner() helper, which returns the level for
the given reference.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:05 +02:00
Josef Bacik
44cc2e38e6 btrfs: stop referencing btrfs_delayed_data_ref directly
Now that most of our elements are inside of btrfs_delayed_ref_node
directly and we have helpers for the delayed_data_ref bits, go ahead and
remove all direct usage of btrfs_delayed_data_ref and use the helpers
where needed.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:05 +02:00
Josef Bacik
b4b5934ac1 btrfs: make the insert backref helpers take a btrfs_delayed_ref_node
We don't need to pass in all the elements for the backrefs as function
arguments, simply pass through the btrfs_delayed_ref_node and then
extract the values we need from that.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:05 +02:00
Josef Bacik
85bb9f544e btrfs: drop unnecessary arguments from __btrfs_free_extent
We have all the information we need in our btrfs_delayed_ref_node, which
we already pass into __btrfs_free_extent.  Drop the extra arguments and
just extract the values from btrfs_delayed_ref_node.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:05 +02:00
Josef Bacik
a502f112ad btrfs: make __btrfs_inc_extent_ref take a btrfs_delayed_ref_node
We're just extracting the values from btrfs_delayed_ref_node and passing
them through, simply pass the btrfs_delayed_ref_node into
__btrfs_inc_extent_ref and shrink the function arguments.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:05 +02:00
Josef Bacik
5366763446 btrfs: rename btrfs_data_ref->ino to ->objectid
This is how we refer to it in the rest of the extent reference related
code, make it consistent.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:05 +02:00
Josef Bacik
cf4f04325b btrfs: move ->parent and ->ref_root into btrfs_delayed_ref_node
These two members are shared by both the tree refs and data refs, so
move them into btrfs_delayed_ref_node proper.  This allows us to greatly
simplify the comparison code, as the shared refs always only sort on
parent, and the non shared refs always sort first on ref_root, and then
only data refs sort on their specific fields.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:05 +02:00
Josef Bacik
12390e42b6 btrfs: rename ->len to ->num_bytes in btrfs_ref
We consistently use ->num_bytes everywhere through the delayed ref code,
except in btrfs_ref.  Rename btrfs_ref to match all the other code.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:05 +02:00
Josef Bacik
f75464f7bb btrfs: unify the btrfs_add_delayed_*_ref helpers into one helper
Now that these helpers are identical, create a helper function that
handles everything properly and strip the individual helpers down to use
just the common helper. This cleans up a significant amount of
duplicated code.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:05 +02:00
Josef Bacik
1bff6d4f87 btrfs: simplify delayed ref tracepoints
Now that all of the delayed ref information is in the delayed ref node,
drastically simplify the delayed ref tracepoints by simply passing in
the btrfs_delayed_ref_node and populating the tracepoints with the
values from the structure itself.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:04 +02:00
Josef Bacik
0ea4703cc2 btrfs: move ref specific initialization into init_delayed_ref_common
Now that the btrfs_delayed_ref_node contains a union of the data and
metadata specific information we can move the initialization into
init_delayed_ref_common and just use the btrfs_ref to initialize the
correct fields of the reference.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:04 +02:00
Josef Bacik
0509cc5661 btrfs: initialize btrfs_delayed_ref_head with btrfs_ref
We are calling init_delayed_ref_head with all of the elements from
btrfs_ref, clean this up to simply pass in the btrfs_ref and initialize
the btrfs_delayed_ref_head with the values from the btrfs_ref directly.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:04 +02:00