There is a potential parallel list adding for retrying in
btrfs_reclaim_bgs_work and adding to the unused list. Since the block
group is removed from the reclaim list and it is on a relocation work,
it can be added into the unused list in parallel. When that happens,
adding it to the reclaim list will corrupt the list head and trigger
list corruption like below.
Fix it by taking fs_info->unused_bgs_lock.
[177.504][T2585409] BTRFS error (device nullb1): error relocating ch= unk 2415919104
[177.514][T2585409] list_del corruption. next->prev should be ff1100= 0344b119c0, but was ff11000377e87c70. (next=3Dff110002390cd9c0)
[177.529][T2585409] ------------[ cut here ]------------
[177.537][T2585409] kernel BUG at lib/list_debug.c:65!
[177.545][T2585409] Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI
[177.555][T2585409] CPU: 9 PID: 2585409 Comm: kworker/u128:2 Tainted: G W 6.10.0-rc5-kts #1
[177.568][T2585409] Hardware name: Supermicro SYS-520P-WTR/X12SPW-TF, BIOS 1.2 02/14/2022
[177.579][T2585409] Workqueue: events_unbound btrfs_reclaim_bgs_work[btrfs]
[177.589][T2585409] RIP: 0010:__list_del_entry_valid_or_report.cold+0x70/0x72
[177.624][T2585409] RSP: 0018:ff11000377e87a70 EFLAGS: 00010286
[177.633][T2585409] RAX: 000000000000006d RBX: ff11000344b119c0 RCX:0000000000000000
[177.644][T2585409] RDX: 000000000000006d RSI: 0000000000000008 RDI:ffe21c006efd0f40
[177.655][T2585409] RBP: ff110002e0509f78 R08: 0000000000000001 R09:ffe21c006efd0f08
[177.665][T2585409] R10: ff11000377e87847 R11: 0000000000000000 R12:ff110002390cd9c0
[177.676][T2585409] R13: ff11000344b119c0 R14: ff110002e0508000 R15:dffffc0000000000
[177.687][T2585409] FS: 0000000000000000(0000) GS:ff11000fec880000(0000) knlGS:0000000000000000
[177.700][T2585409] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[177.709][T2585409] CR2: 00007f06bc7b1978 CR3: 0000001021e86005 CR4:0000000000771ef0
[177.720][T2585409] DR0: 0000000000000000 DR1: 0000000000000000 DR2:0000000000000000
[177.731][T2585409] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:0000000000000400
[177.742][T2585409] PKRU: 55555554
[177.748][T2585409] Call Trace:
[177.753][T2585409] <TASK>
[177.759][T2585409] ? __die_body.cold+0x19/0x27
[177.766][T2585409] ? die+0x2e/0x50
[177.772][T2585409] ? do_trap+0x1ea/0x2d0
[177.779][T2585409] ? __list_del_entry_valid_or_report.cold+0x70/0x72
[177.788][T2585409] ? do_error_trap+0xa3/0x160
[177.795][T2585409] ? __list_del_entry_valid_or_report.cold+0x70/0x72
[177.805][T2585409] ? handle_invalid_op+0x2c/0x40
[177.812][T2585409] ? __list_del_entry_valid_or_report.cold+0x70/0x72
[177.820][T2585409] ? exc_invalid_op+0x2d/0x40
[177.827][T2585409] ? asm_exc_invalid_op+0x1a/0x20
[177.834][T2585409] ? __list_del_entry_valid_or_report.cold+0x70/0x72
[177.843][T2585409] btrfs_delete_unused_bgs+0x3d9/0x14c0 [btrfs]
There is a similar retry_list code in btrfs_delete_unused_bgs(), but it is
safe, AFAICS. Since the block group was in the unused list, the used bytes
should be 0 when it was added to the unused list. Then, it checks
block_group->{used,reserved,pinned} are still 0 under the
block_group->lock. So, they should be still eligible for the unused list,
not the reclaim list.
The reason it is safe there it's because because we're holding
space_info->groups_sem in write mode.
That means no other task can allocate from the block group, so while we
are at deleted_unused_bgs() it's not possible for other tasks to
allocate and deallocate extents from the block group, so it can't be
added to the unused list or the reclaim list by anyone else.
The bug can be reproduced by btrfs/166 after a few rounds. In practice
this can be hit when relocation cannot find more chunk space and ends
with ENOSPC.
Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Suggested-by: Johannes Thumshirn <Johannes.Thumshirn@wdc.com>
Fixes: 4eb4e85c4f ("btrfs: retry block group reclaim without infinite loop")
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If inc_block_group_ro systematically fails (e.g. due to ETXTBUSY from
swap) or btrfs_relocate_chunk systematically fails (from lack of
space), then this worker becomes an infinite loop.
At the very least, this strands the cleaner thread, but can also result
in hung tasks/RCU stalls on PREEMPT_NONE kernels and if the
reclaim_bgs_lock mutex is not contended.
I believe the best long term fix is to manage reclaim via work queue,
where we queue up a relocation on the triggering condition and re-queue
on failure. In the meantime, this is an easy fix to apply to avoid the
immediate pain.
Fixes: 7e27180994 ("btrfs: reinsert BGs failed to reclaim")
CC: stable@vger.kernel.org # 6.6+
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit f4a9f21941 ("btrfs: do not delete unused block group if it may be
used soon") changed the behaviour of deleting unused block-groups on zoned
filesystems. Starting with this commit, we're using
btrfs_space_info_used() to calculate the number of used bytes in a
space_info. But btrfs_space_info_used() also accounts
btrfs_space_info::bytes_zone_unusable as used bytes.
So if a block group is 100% zone_unusable it is skipped from the deletion
step.
In order not to skip fully zone_unusable block-groups, also check if the
block-group has bytes left that can be used on a zoned filesystem.
Fixes: f4a9f21941 ("btrfs: do not delete unused block group if it may be used soon")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_put_caching_control() is only used in block-group.c, so mark it
static.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Lijuan Li <lilijuan@iscas.ac.cn>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The unlikely case of lookup error in btrfs_remove_block_group() can be
handled properly, in its caller this would lead to a transaction abort.
We can't do anything else, a block group must have been loaded first.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At btrfs_delete_unused_bgs(), the use of the list_is_singular() check on
a block group may not be immediately obvious. It is there to prevent
losing raid profile information for a block group type (data, metadata or
system), as that information is removed from
fs_info->avail_[data|metadata|system]_alloc_bits when the last block group
of a given type is deleted. So deleting the block group would later result
in creating block groups of that type with a single profile (because
fs_info->avail_*_alloc_bits would have a value of 0).
This check was added in commit aefbe9a633 ("btrfs: Fix lost-data-profile
caused by auto removing bg").
So add a comment mentioning the need for the check.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Handle the lookup failure of the block group to unpin, this is a logic
error as the block group must exist at this point. If not, something else
must have freed it, like clean_pinned_extents() would do without locking
the unused_bg_unpin_mutex.
Push the errors to the callers, proper handling will be done in followup
patches.
Signed-off-by: David Sterba <dsterba@suse.com>
Space reservations for metadata are, most of the time, pessimistic as we
reserve space for worst possible cases - where tree heights are at the
maximum possible height (8), we need to COW every extent buffer in a tree
path, need to split extent buffers, etc.
For data, we generally reserve the exact amount of space we are going to
allocate. The exception here is when using compression, in which case we
reserve space matching the uncompressed size, as the compression only
happens at writeback time and in the worst possible case we need that
amount of space in case the data is not compressible.
This means that when there's not available space in the corresponding
space_info object, we may need to allocate a new block group, and then
that block group might not be used after all. In this case the block
group is never added to the list of unused block groups and ends up
never being deleted - except if we unmount and mount again the fs, as
when reading block groups from disk we add unused ones to the list of
unused block groups (fs_info->unused_bgs). Otherwise a block group is
only added to the list of unused block groups when we deallocate the
last extent from it, so if no extent is ever allocated, the block group
is kept around forever.
This also means that if we have a bunch of tasks reserving space in
parallel we can end up allocating many block groups that end up never
being used or kept around for too long without being used, which has
the potential to result in ENOSPC failures in case for example we over
allocate too many metadata block groups and then end up in a state
without enough unallocated space to allocate a new data block group.
This is more likely to happen with metadata reservations as of kernel
6.7, namely since commit 28270e25c6 ("btrfs: always reserve space for
delayed refs when starting transaction"), because we started to always
reserve space for delayed references when starting a transaction handle
for a non-zero number of items, and also to try to reserve space to fill
the gap between the delayed block reserve's reserved space and its size.
So to avoid this, when finishing the creation a new block group, add the
block group to the list of unused block groups if it's still unused at
that time. This way the next time the cleaner kthread runs, it will delete
the block group if it's still unused and not needed to satisfy existing
space reservations.
Reported-by: Ivan Shapovalov <intelfx@intelfx.name>
Link: https://lore.kernel.org/linux-btrfs/9cdbf0ca9cdda1b4c84e15e548af7d7f9f926382.camel@intelfx.name/
CC: stable@vger.kernel.org # 6.7+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Before deleting a block group that is in the list of unused block groups
(fs_info->unused_bgs), we check if the block group became used before
deleting it, as extents from it may have been allocated after it was added
to the list.
However even if the block group was not yet used, there may be tasks that
have only reserved space and have not yet allocated extents, and they
might be relying on the availability of the unused block group in order
to allocate extents. The reservation works first by increasing the
"bytes_may_use" field of the corresponding space_info object (which may
first require flushing delayed items, allocating a new block group, etc),
and only later a task does the actual allocation of extents.
For metadata we usually don't end up using all reserved space, as we are
pessimistic and typically account for the worst cases (need to COW every
single node in a path of a tree at maximum possible height, etc). For
data we usually reserve the exact amount of space we're going to allocate
later, except when using compression where we always reserve space based
on the uncompressed size, as compression is only triggered when writeback
starts so we don't know in advance how much space we'll actually need, or
if the data is compressible.
So don't delete an unused block group if the total size of its space_info
object minus the block group's size is less then the sum of used space and
space that may be used (space_info->bytes_may_use), as that means we have
tasks that reserved space and may need to allocate extents from the block
group. In this case, besides skipping the deletion, re-add the block group
to the list of unused block groups so that it may be reconsidered later,
in case the tasks that reserved space end up not needing to allocate
extents from it.
Allowing the deletion of the block group while we have reserved space, can
result in tasks failing to allocate metadata extents (-ENOSPC) while under
a transaction handle, resulting in a transaction abort, or failure during
writeback for the case of data extents.
CC: stable@vger.kernel.org # 6.0+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add a helper function to determine if a block group is being used and make
use of it at btrfs_delete_unused_bgs(). This helper will also be used in
future code changes.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's not needed to have a local variable to store the stripe size at
insert_dev_extents(), we can just take from the chunk map as it's only
used once and typing 'map->stripe_size' is not much more verbose than
simply typing 'stripe_size'. So remove the local variable.
This was added before the recent addition of a dedicated structure for
chunk mappings because the stripe size was encoded in the 'orig_block_len'
field of an extent_map structure, so the use of the local variable made
things more readable.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When starting a transaction to remove a block group we have one ASSERT
that checks we found an extent map and that the extent map's start offset
matches the desired chunk offset. In case one of the conditions fails, we
get a stack trace that point to the respective line of code, however we
can't tell which condition failed: either there's no extent map or we got
one with an unexpected start offset. To make such an issue easier to debug
and analyse, split the assertion into two, one for each condition. This
was actually triggered during development of another upcoming change.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Space for block group item insertions, necessary after allocating a new
block group, is reserved in the delayed refs block reserve. Currently we
do this by incrementing the transaction handle's delayed_ref_updates
counter and then calling btrfs_update_delayed_refs_rsv(), which will
increase the size of the delayed refs block reserve by an amount that
corresponds to the same amount we use for delayed refs, given by
btrfs_calc_delayed_ref_bytes().
That is an excessive amount because it corresponds to the amount of space
needed to insert one item in a btree (btrfs_calc_insert_metadata_size())
times 2 when the free space tree feature is enabled. All we need is an
amount as given by btrfs_calc_insert_metadata_size(), since we only need to
insert a block group item in the extent tree (or block group tree if this
feature is enabled). By using btrfs_calc_insert_metadata_size() we will
need to reserve 2 times less space when using the free space tree, putting
less pressure on space reservation.
So use helpers to reserve and release space for block group item
insertions that use btrfs_calc_insert_metadata_size() for calculation of
the space.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Space for block group item updates, necessary after allocating or
deallocating an extent from a block group, is reserved in the delayed
refs block reserve. Currently we do this by incrementing the transaction
handle's delayed_ref_updates counter and then calling
btrfs_update_delayed_refs_rsv(), which will increase the size of the
delayed refs block reserve by an amount that corresponds to the same
amount we use for delayed refs, given by btrfs_calc_delayed_ref_bytes().
That is an excessive amount because it corresponds to the amount of space
needed to insert one item in a btree (btrfs_calc_insert_metadata_size())
times 2 when the free space tree feature is enabled. All we need is an
amount as given by btrfs_calc_metadata_size(), since we only need to
update an existing block group item in the extent tree (or block group
tree if this feature is enabled). By using btrfs_calc_metadata_size() we
will need to reserve 4 times less space when using the free space tree
and 2 times less space when not using it, putting less pressure on space
reservation.
So use helpers to reserve and release space for block group item updates
that use btrfs_calc_metadata_size() for calculation of the space.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The root argument for btrfs_update_inode() always matches the root of the
given inode, so remove the root argument and get it from the inode
argument.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are two callbacks defined in btrfs_work but only two actually make
use of them, otherwise there are NULLs. We can get rid of the freeing
callback making it a special case of the normal work. This reduces the
size of btrfs_work by 8 bytes, final layout:
struct btrfs_work {
btrfs_func_t func; /* 0 8 */
btrfs_ordered_func_t ordered_func; /* 8 8 */
struct work_struct normal_work; /* 16 32 */
struct list_head ordered_list; /* 48 16 */
/* --- cacheline 1 boundary (64 bytes) --- */
struct btrfs_workqueue * wq; /* 64 8 */
long unsigned int flags; /* 72 8 */
/* size: 80, cachelines: 2, members: 6 */
/* last cacheline: 16 bytes */
};
This in turn reduces size of other structures (on a release config):
- async_chunk 160 -> 152
- async_submit_bio 152 -> 144
- btrfs_async_delayed_work 104 -> 96
- btrfs_caching_control 176 -> 168
- btrfs_delalloc_work 144 -> 136
- btrfs_fs_info 3608 -> 3600
- btrfs_ordered_extent 440 -> 424
- btrfs_writepage_fixup 104 -> 96
Signed-off-by: David Sterba <dsterba@suse.com>
When an extent is allocated or freed, we call btrfs_update_block_group()
to update its block group and space info. An extent always belongs to a
single block group, it can never span multiple block groups, so the loop
we have at btrfs_update_block_group() is pointless, as it always has a
single iteration. The loop was added in the very early days, 2007, when
the block group code was added in commit 9078a3e1e4 ("Btrfs: start of
block group code"), but even back then it seemed pointless.
So remove the loop and assert the block group containing the start offset
of the extent also contains the whole extent.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When marking an extent buffer as dirty, at btrfs_mark_buffer_dirty(),
we check if its generation matches the running transaction and if not we
just print a warning. Such mismatch is an indicator that something really
went wrong and only printing a warning message (and stack trace) is not
enough to prevent a corruption. Allowing a transaction to commit with such
an extent buffer will trigger an error if we ever try to read it from disk
due to a generation mismatch with its parent generation.
So abort the current transaction with -EUCLEAN if we notice a generation
mismatch. For this we need to pass a transaction handle to
btrfs_mark_buffer_dirty() which is always available except in test code,
in which case we can pass NULL since it operates on dummy extent buffers
and all test roots have a single node/leaf (root node at level 0).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently when reserving space for deleting the csum items for a data
extent, when adding or updating a delayed ref head, we determine how
many leaves of csum items we can have and then pass that number to the
helper btrfs_calc_delayed_ref_bytes(). This helper is used for calculating
space for all tree modifications we need when running delayed references,
however the amount of space it computes is excessive for deleting csum
items because:
1) It uses btrfs_calc_insert_metadata_size() which is excessive because
we only need to delete csum items from the csum tree, we don't need
to insert any items, so btrfs_calc_metadata_size() is all we need (as
it computes space needed to delete an item);
2) If the free space tree is enabled, it doubles the amount of space,
which is pointless for csum deletion since we don't need to touch the
free space tree or any other tree other than the csum tree.
So improve on this by tracking how many csum deletions we have and using
a new helper to calculate space for csum deletions (just a wrapper around
btrfs_calc_metadata_size() with a comment). This reduces the amount of
space we need to reserve for csum deletions by a factor of 4, and it helps
reduce the number of times we have to block space reservations and have
the reclaim task enter the space flushing algorithm (flush delayed items,
flush delayed refs, etc) in order to satisfy tickets.
For example this results in a total time decrease when unlinking (or
truncating) files with many extents, as we end up having to block on space
metadata reservations less often. Example test:
$ cat test.sh
#!/bin/bash
DEV=/dev/nullb0
MNT=/mnt/test
umount $DEV &> /dev/null
mkfs.btrfs -f $DEV
# Use compression to quickly create files with a lot of extents
# (each with a size of 128K).
mount -o compress=lzo $DEV $MNT
# 100G gives at least 983040 extents with a size of 128K.
xfs_io -f -c "pwrite -S 0xab -b 1M 0 120G" $MNT/foobar
# Flush all delalloc and clear all metadata from memory.
umount $MNT
mount -o compress=lzo $DEV $MNT
start=$(date +%s%N)
rm -f $MNT/foobar
end=$(date +%s%N)
dur=$(( (end - start) / 1000000 ))
echo "rm took $dur milliseconds"
umount $MNT
Before this change rm took: 7504 milliseconds
After this change rm took: 6574 milliseconds (-12.4%)
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When running delayed references, through btrfs_run_delayed_refs(), we can
specify how many to run, run all existing delayed references and keep
running delayed references while we can find any. This is controlled with
the value of the 'count' argument, where a value of 0 means to run all
delayed references that exist by the time btrfs_run_delayed_refs() is
called, (unsigned long)-1 means to keep running delayed references while
we are able find any, and any other value to run that exact number of
delayed references.
Typically a specific value other than 0 or -1 is used when flushing space
to try to release a certain amount of bytes for a ticket. In this case
we just simply calculate how many delayed reference heads correspond to a
specific amount of bytes, with calc_delayed_refs_nr(). However that only
takes into account the space reserved for the reference heads themselves,
and does not account for the space reserved for deleting checksums from
the csum tree (see add_delayed_ref_head() and update_existing_head_ref())
in case we are going to delete a data extent. This means we may end up
running more delayed references than necessary in case we process delayed
references for deleting a data extent.
So change the logic of btrfs_run_delayed_refs() to take a bytes argument
to specify how many bytes of delayed references to run/release, using the
special values of 0 to mean all existing delayed references and U64_MAX
(or (u64)-1) to keep running delayed references while we can find any.
This prevents running more delayed references than necessary, when we have
delayed references for deleting data extents, but also makes the upcoming
changes/patches simpler and it's preparatory work for them.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit 675dfe1223 ("btrfs: fix block group item corruption after
inserting new block group") fixed one race that resulted in not persisting
a block group's item when its "used" bytes field decreases to zero.
However there's another race that can happen in a much shorter time window
that results in the same problem. The following sequence of steps explains
how it can happen:
1) Task A creates a metadata block group X, its "used" and "commit_used"
fields are initialized to 0;
2) Two extents are allocated from block group X, so its "used" field is
updated to 32K, and its "commit_used" field remains as 0;
3) Transaction commit starts, by some task B, and it enters
btrfs_start_dirty_block_groups(). There it tries to update the block
group item for block group X, which currently has its "used" field with
a value of 32K and its "commit_used" field with a value of 0. However
that fails since the block group item was not yet inserted, so at
update_block_group_item(), the btrfs_search_slot() call returns 1, and
then we set 'ret' to -ENOENT. Before jumping to the label 'fail'...
4) The block group item is inserted by task A, when for example
btrfs_create_pending_block_groups() is called when releasing its
transaction handle. This results in insert_block_group_item() inserting
the block group item in the extent tree (or block group tree), with a
"used" field having a value of 32K and setting "commit_used", in struct
btrfs_block_group, to the same value (32K);
5) Task B jumps to the 'fail' label and then resets the "commit_used"
field to 0. At btrfs_start_dirty_block_groups(), because -ENOENT was
returned from update_block_group_item(), we add the block group again
to the list of dirty block groups, so that we will try again in the
critical section of the transaction commit when calling
btrfs_write_dirty_block_groups();
6) Later the two extents from block group X are freed, so its "used" field
becomes 0;
7) If no more extents are allocated from block group X before we get into
btrfs_write_dirty_block_groups(), then when we call
update_block_group_item() again for block group X, we will not update
the block group item to reflect that it has 0 bytes used, because the
"used" and "commit_used" fields in struct btrfs_block_group have the
same value, a value of 0.
As a result after committing the transaction we have an empty block
group with its block group item having a 32K value for its "used" field.
This will trigger errors from fsck ("btrfs check" command) and after
mounting again the fs, the cleaner kthread will not automatically delete
the empty block group, since its "used" field is not 0. Possibly there
are other issues due to this inconsistency.
When this issue happens, the error reported by fsck is like this:
[1/7] checking root items
[2/7] checking extents
block group [1104150528 1073741824] used 39796736 but extent items used 0
ERROR: errors found in extent allocation tree or chunk allocation
(...)
So fix this by not resetting the "commit_used" field of a block group when
we don't find the block group item at update_block_group_item().
Fixes: 7248e0cebb ("btrfs: skip update of block group item if used bytes are the same")
CC: stable@vger.kernel.org # 6.2+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that a non-DATA block group is activated at write time, don't
activate it on allocation time.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In the current implementation, block groups are activated at reservation
time to ensure that all reserved bytes can be written to an active metadata
block group. However, this approach has proven to be less efficient, as it
activates block groups more frequently than necessary, putting pressure on
the active zone resource and leading to potential issues such as early
ENOSPC or hung_task.
Another drawback of the current method is that it hampers metadata
over-commit, and necessitates additional flush operations and block group
allocations, resulting in decreased overall performance.
To address these issues, this commit introduces a write-time activation of
metadata and system block group. This involves reserving at least one
active block group specifically for a metadata and system block group.
Since metadata write-out is always allocated sequentially, when we need to
write to a non-active block group, we can wait for the ongoing IOs to
complete, activate a new block group, and then proceed with writing to the
new block group.
Fixes: b093151391 ("btrfs: zoned: activate metadata block group on flush_space")
CC: stable@vger.kernel.org # 6.1+
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function btrfs_free_excluded_extents() is only used by block-group.c,
so move it into block-group.c and make it static. Also removed unnecessary
variables that are used only once.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The code for btrfs_add_excluded_extent() is trivial, it's just a
set_extent_bit() call. However it's defined in extent-tree.c but it is
only used (twice) in block-group.c. So open code it in block-group.c,
reducing the need to export a trivial function.
Also since the only caller btrfs_add_excluded_extent() is prepared to
deal with errors, stop ignoring errors from the set_extent_bit() call.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently find_first_extent_bit() returns a 0 if it found a range in the
given io tree and 1 if it didn't find any. There's no need to return any
errors, so make the return value a boolean and invert the logic to make
more sense: return true if it found a range and false if it didn't find
any range.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since add_new_free_space() is exported, used outside block-group.c, rename
it to include the 'btrfs_' prefix.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The documentation for add_new_free_space() is stale and no longer correct:
1) It's no longer used only when caching a block group. It's also called
when creating a block group (btrfs_make_block_group()), when reading
a block group at mount time (read_one_block_group()) and when reading
the free space tree for a block group (typically the first time we
attempt to allocate from the block group);
2) It has nothing to do with pinned extents. It only deals with the
excluded extents io tree, which is used to track the locations of
super blocks in order to make sure we never add the location of a
super block to the free space cache of a block group.
So update the documention and also add a description of the arguments
and return values.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmTXzuUACgkQxWXV+ddt
WDvQVg/+PwDYtfsFBBxWboR/Ehu+nGj+PGRGH5kUumCt03760GtVMYqJzakinAoA
TUg7N+SvC0i6STUQ1LxkqdyU+eHxk0D1qwK7HJtbqNQJ+kwaEPlilHwsptMmuuM/
xaei+C8gLmQreL5ZH6ZsnLfV4aaFR7Ur8KSiAq28H6dKXGnh4q9yio2BspeFoc4Y
8cD2Q8eOUxLBbGkAy9RHeMWf6OMOv2jyzdA761NZrjxUe23bDWSdRM6cRhfdJIh+
gfwW1IVH2EVOwo+FeaIpMSf4dpnenOYOKOftTncrz7XS0VEN/wJYQXGjNbLa7u4d
RxV2RujzRPePAUKDbLRakfXotcuKdSQuX2epLSYkQTfGQ0KRYu5YIDQgkm3r7Yky
cF5mkyEyI8lFCiop7Bgi3MqnzoY5ZgWAkWSy9/TzjQ4yRhjiZ3fmk5JgoJ8gwUc3
Fle4czcmKvk6ZqQAn90b0qGtW9FXzVAekZjLAH26O7+dgEn+CCAfwT9GuG7h+ATM
9Bh+5U5PWxWmNPTYU8Sn+WR9HpVL6+1maxrax/Ftb8/FuFlQXFHxK+OnTcKx9K+y
OGsv0r/4Zv517k1qqlHvf397Jvz7MmYLyOwkqu5xyomCGtrKIBkkEGF/9sHrZJVM
YokgphDZL8AILrnnPwCOgt4lsph1VKS/Sgvu7XKovnZbvvh8S+M=
=csAj
-----END PGP SIGNATURE-----
Merge tag 'for-6.5-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"More fixes, some of them going back to older releases and there are
fixes for hangs in stress tests regarding space caching:
- fixes and progress tracking for hangs in free space caching, found
by test generic/475
- writeback fixes, write pages in integrity mode and skip writing
pages that have been written meanwhile
- properly clear end of extent range after an error
- relocation fixes:
- fix race betwen qgroup tree creation and relocation
- detect and report invalid reloc roots"
* tag 'for-6.5-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: set cache_block_group_error if we find an error
btrfs: reject invalid reloc tree root keys with stack dump
btrfs: exit gracefully if reloc roots don't match
btrfs: avoid race between qgroup tree creation and relocation
btrfs: properly clear end of the unreserved range in cow_file_range
btrfs: don't wait for writeback on clean pages in extent_write_cache_pages
btrfs: don't stop integrity writeback too early
btrfs: wait for actual caching progress during allocation
Recently we've been having mysterious hangs while running generic/475 on
the CI system. This turned out to be something like this:
Task 1
dmsetup suspend --nolockfs
-> __dm_suspend
-> dm_wait_for_completion
-> dm_wait_for_bios_completion
-> Unable to complete because of IO's on a plug in Task 2
Task 2
wb_workfn
-> wb_writeback
-> blk_start_plug
-> writeback_sb_inodes
-> Infinite loop unable to make an allocation
Task 3
cache_block_group
->read_extent_buffer_pages
->Waiting for IO to complete that can't be submitted because Task 1
suspended the DM device
The problem here is that we need Task 2 to be scheduled completely for
the blk plug to flush. Normally this would happen, we normally wait for
the block group caching to finish (Task 3), and this schedule would
result in the block plug flushing.
However if there's enough free space available from the current caching
to satisfy the allocation we won't actually wait for the caching to
complete. This check however just checks that we have enough space, not
that we can make the allocation. In this particular case we were trying
to allocate 9MiB, and we had 10MiB of free space, but we didn't have
9MiB of contiguous space to allocate, and thus the allocation failed and
we looped.
We specifically don't cycle through the FFE loop until we stop finding
cached block groups because we don't want to allocate new block groups
just because we're caching, so we short circuit the normal loop once we
hit LOOP_CACHING_WAIT and we found a caching block group.
This is normally fine, except in this particular case where the caching
thread can't make progress because the DM device has been suspended.
Fix this by not only waiting for free space to >= the amount of space we
want to allocate, but also that we make some progress in caching from
the time we start waiting. This will keep us from busy looping when the
caching is taking a while but still theoretically has enough space for
us to allocate from, and fixes this particular case by forcing us to
actually sleep and wait for forward progress, which will flush the plug.
With this fix we're no longer hanging with generic/475.
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmTCTAsACgkQxWXV+ddt
WDvhYhAAluWrfM2ZzhY/tdDeKUNpf0NIAFGZIV4QP/2E43yIPC2+xMPqW/AnBnIP
k28gOhgoH7LBP/cr0IrFHMz8Glges3cHz1UxFjJZgjiU3mAA0mgkIttPpzms7vqi
3SVUxL2bJkebJy53nOpZcHlrcWveg+q0hTUslquCYBb3dA4gb61HBwzA2e0wKFeB
wYw/gQtEy3TkHQPAxVjUF28ASoaroNKsE9QjfLZV0FDn0u0zBFxqpqj7bFUay++i
sG3nPVZsqKcgIX7sUSwrpv4XAFu8fHz+GAQqCNqTxKCJ0ZZzsgzJtKs+12rv7dZC
EvRt0jEt+DgwvmEy7j250TEbcI9rMaQuny8yt2j9sNKH/m9bW0BjptCtwghDoL89
0D6qqicHbA+dJNq8/kDyxV6xC2Git2Ck0fpOfiU7YzhAFECZc/DkidvXa1keMUay
usspO+YHOjDtlq0zJ0xixbxCseJfrj4habieVKZ/CnAvb84082ZiLcMxFqop/ewB
WHKNB0O2+P78xoa7/Be6tp/w1HaaW8ZHvkPicD9d4khKJrXAKLNc/Xny4OqRT14z
sWWaFuNjC7kIUT15EAQNj0wgymA7XcTL9gM1uuSO95PN+M3j4CleApzEvR3dn9FX
gmoxuwVfVsJKKcwo6WFByqzu03kuSladEFasSHQAJbh3jyU9LUY=
=Y/po
-----END PGP SIGNATURE-----
Merge tag 'for-6.5-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- fix accounting of global block reserve size when block group tree is
enabled
- the async discard has been enabled in 6.2 unconditionally, but for
zoned mode it does not make that much sense to do it asynchronously
as the zones are reset as needed
- error handling and proper error value propagation fixes
* tag 'for-6.5-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: check for commit error at btrfs_attach_transaction_barrier()
btrfs: check if the transaction was aborted at btrfs_wait_for_commit()
btrfs: remove BUG_ON()'s in add_new_free_space()
btrfs: account block group tree when calculating global reserve size
btrfs: zoned: do not enable async discard
At add_new_free_space() we have these BUG_ON()'s that are there to deal
with any failure to add free space to the in memory free space cache.
Such failures are mostly -ENOMEM that should be very rare. However there's
no need to have these BUG_ON()'s, we can just return any error to the
caller and all callers and their upper call chain are already dealing with
errors.
So just make add_new_free_space() return any errors, while removing the
BUG_ON()'s, and returning the total amount of added free space to an
optional u64 pointer argument.
Reported-by: syzbot+3ba856e07b7127889d8c@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/000000000000e9cb8305ff4e8327@google.com/
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmS4TjYACgkQxWXV+ddt
WDt8yw//b6E+Ab0QBBS6Dl9R3euv+aqQUa35Lg8z2L89oLjNlX/yZSj4ZIZ/467A
omDtCat1uwYSlzPgM7RtfvXew7ujULmK9UP5IdwEQgQ5qf+pIE50h78x64zCLG1R
g2C9OMGvNxU8kOR2sRFaBJ+Lf8Mth8ipOzDl8XPE4m7t87tLYdpmVoHJKzNj/AcF
UOruS8HilFYlc8Oqb/yxDPYeMmdUQaqahhq4tHkoIB/09aePaJwAiyhQ53v3TaAX
oBvBBrV65WdHu3epfB9WcPQRyd09ov+e14UXAX7CepkZ0RdIpU1CcszPGHTtOP94
fRaup2zFygolHSTMPFU3drtnI3CRIGntdfHFqDEH+M1ysvRxG4VUnAjWL6w9KZZC
3z2G6cYoVxpcbCch5wpn/97cr6PowRABK0AoirrreU0VA2Vkdi45N5eoT/2DTEVT
s5VirJsfC5IXCrvQf/h67jHlqk+23e+/uP8in+GWTgPEmddTSWNKCi9ZcHKQ/skf
5U9Y3vkFFl68cQEwcB6vYMlbCiCxCuimz8AvGFrxkU9qmah6gcgTnXwKqneAbr3f
jFwZvgh+E5ueK76/KLLWIKyfDx5cU8N/D5TlmZ+n/TAoWUFVs5V2ctB4EOsIotev
EN7sriJGUkfGp+EAs4Xux4OKGx6IinyogdZmQN9HmNIlGks02qk=
=yO9Y
-----END PGP SIGNATURE-----
Merge tag 'for-6.5-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"Stable fixes:
- fix race between balance and cancel/pause
- various iput() fixes
- fix use-after-free of new block group that became unused
- fix warning when putting transaction with qgroups enabled after
abort
- fix crash in subpage mode when page could be released between map
and map read
- when scrubbing raid56 verify the P/Q stripes unconditionally
- fix minor memory leak in zoned mode when a block group with an
unexpected superblock is found
Regression fixes:
- fix ordered extent split error handling when submitting direct IO
- user irq-safe locking when adding delayed iputs"
* tag 'for-6.5-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix warning when putting transaction with qgroups enabled after abort
btrfs: fix ordered extent split error handling in btrfs_dio_submit_io
btrfs: set_page_extent_mapped after read_folio in btrfs_cont_expand
btrfs: raid56: always verify the P/Q contents for scrub
btrfs: use irq safe locking when running and adding delayed iputs
btrfs: fix iput() on error pointer after error during orphan cleanup
btrfs: fix double iput() on inode after an error during orphan cleanup
btrfs: zoned: fix memory leak after finding block group with super blocks
btrfs: fix use-after-free of new block group that became unused
btrfs: be a bit more careful when setting mirror_num_ret in btrfs_map_block
btrfs: fix race between balance and cancel/pause
At exclude_super_stripes(), if we happen to find a block group that has
super blocks mapped to it and we are on a zoned filesystem, we error out
as this is not supposed to happen, indicating either a bug or maybe some
memory corruption for example. However we are exiting the function without
freeing the memory allocated for the logical address of the super blocks.
Fix this by freeing the logical address.
Fixes: 12659251ca ("btrfs: implement log-structured superblock for ZONED mode")
CC: stable@vger.kernel.org # 5.10+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If a task creates a new block group and that block group becomes unused
before we finish its creation, at btrfs_create_pending_block_groups(),
then when btrfs_mark_bg_unused() is called against the block group, we
assume that the block group is currently in the list of block groups to
reclaim, and we move it out of the list of new block groups and into the
list of unused block groups. This has two consequences:
1) We move it out of the list of new block groups associated to the
current transaction. So the block group creation is not finished and
if we attempt to delete the bg because it's unused, we will not find
the block group item in the extent tree (or the new block group tree),
its device extent items in the device tree etc, resulting in the
deletion to fail due to the missing items;
2) We don't increment the reference count on the block group when we
move it to the list of unused block groups, because we assumed the
block group was on the list of block groups to reclaim, and in that
case it already has the correct reference count. However the block
group was on the list of new block groups, in which case no extra
reference was taken because it's local to the current task. This
later results in doing an extra reference count decrement when
removing the block group from the unused list, eventually leading the
reference count to 0.
This second case was caught when running generic/297 from fstests, which
produced the following assertion failure and stack trace:
[589.559] assertion failed: refcount_read(&block_group->refs) == 1, in fs/btrfs/block-group.c:4299
[589.559] ------------[ cut here ]------------
[589.559] kernel BUG at fs/btrfs/block-group.c:4299!
[589.560] invalid opcode: 0000 [#1] PREEMPT SMP PTI
[589.560] CPU: 8 PID: 2819134 Comm: umount Tainted: G W 6.4.0-rc6-btrfs-next-134+ #1
[589.560] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
[589.560] RIP: 0010:btrfs_free_block_groups+0x449/0x4a0 [btrfs]
[589.561] Code: 68 62 da c0 (...)
[589.561] RSP: 0018:ffffa55a8c3b3d98 EFLAGS: 00010246
[589.561] RAX: 0000000000000058 RBX: ffff8f030d7f2000 RCX: 0000000000000000
[589.562] RDX: 0000000000000000 RSI: ffffffff953f0878 RDI: 00000000ffffffff
[589.562] RBP: ffff8f030d7f2088 R08: 0000000000000000 R09: ffffa55a8c3b3c50
[589.562] R10: 0000000000000001 R11: 0000000000000001 R12: ffff8f05850b4c00
[589.562] R13: ffff8f030d7f2090 R14: ffff8f05850b4cd8 R15: dead000000000100
[589.563] FS: 00007f497fd2e840(0000) GS:ffff8f09dfc00000(0000) knlGS:0000000000000000
[589.563] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[589.563] CR2: 00007f497ff8ec10 CR3: 0000000271472006 CR4: 0000000000370ee0
[589.563] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[589.564] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[589.564] Call Trace:
[589.564] <TASK>
[589.565] ? __die_body+0x1b/0x60
[589.565] ? die+0x39/0x60
[589.565] ? do_trap+0xeb/0x110
[589.565] ? btrfs_free_block_groups+0x449/0x4a0 [btrfs]
[589.566] ? do_error_trap+0x6a/0x90
[589.566] ? btrfs_free_block_groups+0x449/0x4a0 [btrfs]
[589.566] ? exc_invalid_op+0x4e/0x70
[589.566] ? btrfs_free_block_groups+0x449/0x4a0 [btrfs]
[589.567] ? asm_exc_invalid_op+0x16/0x20
[589.567] ? btrfs_free_block_groups+0x449/0x4a0 [btrfs]
[589.567] ? btrfs_free_block_groups+0x449/0x4a0 [btrfs]
[589.567] close_ctree+0x35d/0x560 [btrfs]
[589.568] ? fsnotify_sb_delete+0x13e/0x1d0
[589.568] ? dispose_list+0x3a/0x50
[589.568] ? evict_inodes+0x151/0x1a0
[589.568] generic_shutdown_super+0x73/0x1a0
[589.569] kill_anon_super+0x14/0x30
[589.569] btrfs_kill_super+0x12/0x20 [btrfs]
[589.569] deactivate_locked_super+0x2e/0x70
[589.569] cleanup_mnt+0x104/0x160
[589.570] task_work_run+0x56/0x90
[589.570] exit_to_user_mode_prepare+0x160/0x170
[589.570] syscall_exit_to_user_mode+0x22/0x50
[589.570] ? __x64_sys_umount+0x12/0x20
[589.571] do_syscall_64+0x48/0x90
[589.571] entry_SYSCALL_64_after_hwframe+0x72/0xdc
[589.571] RIP: 0033:0x7f497ff0a567
[589.571] Code: af 98 0e (...)
[589.572] RSP: 002b:00007ffc98347358 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[589.572] RAX: 0000000000000000 RBX: 00007f49800b8264 RCX: 00007f497ff0a567
[589.572] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000557f558abfa0
[589.573] RBP: 0000557f558a6ba0 R08: 0000000000000000 R09: 00007ffc98346100
[589.573] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[589.573] R13: 0000557f558abfa0 R14: 0000557f558a6cb0 R15: 0000557f558a6dd0
[589.573] </TASK>
[589.574] Modules linked in: dm_snapshot dm_thin_pool (...)
[589.576] ---[ end trace 0000000000000000 ]---
Fix this by adding a runtime flag to the block group to tell that the
block group is still in the list of new block groups, and therefore it
should not be moved to the list of unused block groups, at
btrfs_mark_bg_unused(), until the flag is cleared, when we finish the
creation of the block group at btrfs_create_pending_block_groups().
Fixes: a9f189716c ("btrfs: move out now unused BG from the reclaim list")
CC: stable@vger.kernel.org # 5.15+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmSZ0YUACgkQxWXV+ddt
WDuF9g/+OsEZflGYIZj11trG0l5HWApnINKqhZ524J0TNy9KxY0KOTqPCOg0O41E
Vt7uJCMG06+ifvVEby8srjzUZxUutIuMeGIP91VyXUbt+CleAWunxnL8aKcTPS3T
QxyGx0VmukO2UHYCXhXQLRXo/+zPcBxnk6UgVzcBCIOecOMTB1KCeblBmqd3q86f
NVFse+BTkmtm86u/1rzEDqgY6lMHQ+jNoHhpRVGpzmSnSX8GELyOW3QnixS0LCo2
no0vvW0QXkRJ+S68V5zBWlqa3xr21jOYcmON2Ra2G8Etsjx7W5XKS3I9k/4uonxb
LbITmBwEZWt/aTzmLFT16S5M9BlRqH5Ffmsw7Ls+NDmdvH/f1zM8XeNAb7kpFTrn
T3aALjkcd65/JFmgyVmzdt4BSmrUkYm0EEmLirQec86HJ4NlQJpJ2B7cfMWKPyal
+VgaT4S+fLTc/HJD3nObMXTCxZrMf0sBUhU4/QXqL7TTjqqosSn26mlGNUocw7Ty
HaESk7j2L9TMPt640r1G98j9ND7sWmyBmiYsah8F3MKZCIS892qhtFs0m5g2tA1F
sjPv9u6M5Pi6ie5Eo8xs+SqKa7TPPVsbZ9XcMRBuzDc5AtUPAm6ii9QVwref8wTq
qO379jDepgPj4HZkXMzQKxd6rw6wrF854304XhjHZefk+ChhIc4=
=nN4X
-----END PGP SIGNATURE-----
Merge tag 'for-6.5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"Mainly core changes, refactoring and optimizations.
Performance is improved in some areas, overall there may be a
cumulative improvement due to refactoring that removed lookups in the
IO path or simplified IO submission tracking.
Core:
- submit IO synchronously for fast checksums (crc32c and xxhash),
remove high priority worker kthread
- read extent buffer in one go, simplify IO tracking, bio submission
and locking
- remove additional tracking of redirtied extent buffers, originally
added for zoned mode but actually not needed
- track ordered extent pointer in bio to avoid rbtree lookups during
IO
- scrub, use recovered data stripes as cache to avoid unnecessary
read
- in zoned mode, optimize logical to physical mappings of extents
- remove PageError handling, not set by VFS nor writeback
- cleanups, refactoring, better structure packing
- lots of error handling improvements
- more assertions, lockdep annotations
- print assertion failure with the exact line where it happens
- tracepoint updates
- more debugging prints
Performance:
- speedup in fsync(), better tracking of inode logged status can
avoid transaction commit
- IO path structures track logical offsets in data structures and
does not need to look it up
User visible changes:
- don't commit transaction for every created subvolume, this can
reduce time when many subvolumes are created in a batch
- print affected files when relocation fails
- trigger orphan file cleanup during START_SYNC ioctl
Notable fixes:
- fix crash when disabling quota and relocation
- fix crashes when removing roots from drity list
- fix transacion abort during relocation when converting from newer
profiles not covered by fallback
- in zoned mode, stop reclaiming block groups if filesystem becomes
read-only
- fix rare race condition in tree mod log rewind that can miss some
btree node slots
- with enabled fsverity, drop up-to-date page bit in case the
verification fails"
* tag 'for-6.5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (194 commits)
btrfs: fix race between quota disable and relocation
btrfs: add comment to struct btrfs_fs_info::dirty_cowonly_roots
btrfs: fix race when deleting free space root from the dirty cow roots list
btrfs: fix race when deleting quota root from the dirty cow roots list
btrfs: tracepoints: also show actual number of the outstanding extents
btrfs: update i_version in update_dev_time
btrfs: make btrfs_compressed_bioset static
btrfs: add handling for RAID1C23/DUP to btrfs_reduce_alloc_profile
btrfs: scrub: remove btrfs_fs_info::scrub_wr_completion_workers
btrfs: scrub: remove scrub_ctx::csum_list member
btrfs: do not BUG_ON after failure to migrate space during truncation
btrfs: do not BUG_ON on failure to get dir index for new snapshot
btrfs: send: do not BUG_ON() on unexpected symlink data extent
btrfs: do not BUG_ON() when dropping inode items from log root
btrfs: replace BUG_ON() at split_item() with proper error handling
btrfs: do not BUG_ON() on tree mod log failures at btrfs_del_ptr()
btrfs: do not BUG_ON() on tree mod log failures at insert_ptr()
btrfs: do not BUG_ON() on tree mod log failure at insert_new_root()
btrfs: do not BUG_ON() on tree mod log failures at push_nodes_for_insert()
btrfs: abort transaction at update_ref_for_cow() when ref count is zero
...
There was regression caused by a97699d1d6 ("btrfs: replace
map_lookup->stripe_len by BTRFS_STRIPE_LEN") and supposedly fixed by
a7299a18a1 ("btrfs: fix u32 overflows when left shifting stripe_nr").
To avoid code churn the fix was open coding the type casts but
unfortunately missed one which was still possible to hit [1].
The missing place was assignment of bioc->full_stripe_logical inside
btrfs_map_block().
Fix it by adding a helper that does the safe calculation of the offset
and use it everywhere even though it may not be strictly necessary due
to already using u64 types. This replaces all remaining
"<< BTRFS_STRIPE_LEN_SHIFT" calls.
[1] https://lore.kernel.org/linux-btrfs/20230622065438.86402-1-wqu@suse.com/
Fixes: a7299a18a1 ("btrfs: fix u32 overflows when left shifting stripe_nr")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
Callers of `btrfs_reduce_alloc_profile` expect it to return exactly
one allocation profile flag, and failing to do so may ultimately
result in a WARN_ON and remount-ro when allocating new blocks, like
the below transaction abort on 6.1.
`btrfs_reduce_alloc_profile` has two ways of determining the profile,
first it checks if a conversion balance is currently running and
uses the profile we're converting to. If no balance is currently
running, it returns the max-redundancy profile which at least one
block in the selected block group has.
This works by simply checking each known allocation profile bit in
redundancy order. However, `btrfs_reduce_alloc_profile` has not been
updated as new flags have been added - first with the `DUP` profile
and later with the RAID1C34 profiles.
Because of the way it checks, if we have blocks with different
profiles and at least one is known, that profile will be selected.
However, if none are known we may return a flag set with multiple
allocation profiles set.
This is currently only possible when a balance from one of the three
unhandled profiles to another of the unhandled profiles is canceled
after allocating at least one block using the new profile.
In that case, a transaction abort like the below will occur and the
filesystem will need to be mounted with -o skip_balance to get it
mounted rw again (but the balance cannot be resumed without a
similar abort).
[770.648] ------------[ cut here ]------------
[770.648] BTRFS: Transaction aborted (error -22)
[770.648] WARNING: CPU: 43 PID: 1159593 at fs/btrfs/extent-tree.c:4122 find_free_extent+0x1d94/0x1e00 [btrfs]
[770.648] CPU: 43 PID: 1159593 Comm: btrfs Tainted: G W 6.1.0-0.deb11.7-powerpc64le #1 Debian 6.1.20-2~bpo11+1a~test
[770.648] Hardware name: T2P9D01 REV 1.00 POWER9 0x4e1202 opal:skiboot-bc106a0 PowerNV
[770.648] NIP: c00800000f6784fc LR: c00800000f6784f8 CTR: c000000000d746c0
[770.648] REGS: c000200089afe9a0 TRAP: 0700 Tainted: G W (6.1.0-0.deb11.7-powerpc64le Debian 6.1.20-2~bpo11+1a~test)
[770.648] MSR: 9000000002029033 <SF,HV,VEC,EE,ME,IR,DR,RI,LE> CR: 28848282 XER: 20040000
[770.648] CFAR: c000000000135110 IRQMASK: 0
GPR00: c00800000f6784f8 c000200089afec40 c00800000f7ea800 0000000000000026
GPR04: 00000001004820c2 c000200089afea00 c000200089afe9f8 0000000000000027
GPR08: c000200ffbfe7f98 c000000002127f90 ffffffffffffffd8 0000000026d6a6e8
GPR12: 0000000028848282 c000200fff7f3800 5deadbeef0000122 c00000002269d000
GPR16: c0002008c7797c40 c000200089afef17 0000000000000000 0000000000000000
GPR20: 0000000000000000 0000000000000001 c000200008bc5a98 0000000000000001
GPR24: 0000000000000000 c0000003c73088d0 c000200089afef17 c000000016d3a800
GPR28: c0000003c7308800 c00000002269d000 ffffffffffffffea 0000000000000001
[770.648] NIP [c00800000f6784fc] find_free_extent+0x1d94/0x1e00 [btrfs]
[770.648] LR [c00800000f6784f8] find_free_extent+0x1d90/0x1e00 [btrfs]
[770.648] Call Trace:
[770.648] [c000200089afec40] [c00800000f6784f8] find_free_extent+0x1d90/0x1e00 [btrfs] (unreliable)
[770.648] [c000200089afed30] [c00800000f681398] btrfs_reserve_extent+0x1a0/0x2f0 [btrfs]
[770.648] [c000200089afeea0] [c00800000f681bf0] btrfs_alloc_tree_block+0x108/0x670 [btrfs]
[770.648] [c000200089afeff0] [c00800000f66bd68] __btrfs_cow_block+0x170/0x850 [btrfs]
[770.648] [c000200089aff100] [c00800000f66c58c] btrfs_cow_block+0x144/0x288 [btrfs]
[770.648] [c000200089aff1b0] [c00800000f67113c] btrfs_search_slot+0x6b4/0xcb0 [btrfs]
[770.648] [c000200089aff2a0] [c00800000f679f60] lookup_inline_extent_backref+0x128/0x7c0 [btrfs]
[770.648] [c000200089aff3b0] [c00800000f67b338] lookup_extent_backref+0x70/0x190 [btrfs]
[770.648] [c000200089aff470] [c00800000f67b54c] __btrfs_free_extent+0xf4/0x1490 [btrfs]
[770.648] [c000200089aff5a0] [c00800000f67d770] __btrfs_run_delayed_refs+0x328/0x1530 [btrfs]
[770.648] [c000200089aff740] [c00800000f67ea2c] btrfs_run_delayed_refs+0xb4/0x3e0 [btrfs]
[770.648] [c000200089aff800] [c00800000f699aa4] btrfs_commit_transaction+0x8c/0x12b0 [btrfs]
[770.648] [c000200089aff8f0] [c00800000f6dc628] reset_balance_state+0x1c0/0x290 [btrfs]
[770.648] [c000200089aff9a0] [c00800000f6e2f7c] btrfs_balance+0x1164/0x1500 [btrfs]
[770.648] [c000200089affb40] [c00800000f6f8e4c] btrfs_ioctl+0x2b54/0x3100 [btrfs]
[770.648] [c000200089affc80] [c00000000053be14] sys_ioctl+0x794/0x1310
[770.648] [c000200089affd70] [c00000000002af98] system_call_exception+0x138/0x250
[770.648] [c000200089affe10] [c00000000000c654] system_call_common+0xf4/0x258
[770.648] --- interrupt: c00 at 0x7fff94126800
[770.648] NIP: 00007fff94126800 LR: 0000000107e0b594 CTR: 0000000000000000
[770.648] REGS: c000200089affe80 TRAP: 0c00 Tainted: G W (6.1.0-0.deb11.7-powerpc64le Debian 6.1.20-2~bpo11+1a~test)
[770.648] MSR: 900000000000d033 <SF,HV,EE,PR,ME,IR,DR,RI,LE> CR: 24002848 XER: 00000000
[770.648] IRQMASK: 0
GPR00: 0000000000000036 00007fffc9439da0 00007fff94217100 0000000000000003
GPR04: 00000000c4009420 00007fffc9439ee8 0000000000000000 0000000000000000
GPR08: 00000000803c7416 0000000000000000 0000000000000000 0000000000000000
GPR12: 0000000000000000 00007fff9467d120 0000000107e64c9c 0000000107e64d0a
GPR16: 0000000107e64d06 0000000107e64cf1 0000000107e64cc4 0000000107e64c73
GPR20: 0000000107e64c31 0000000107e64bf1 0000000107e64be7 0000000000000000
GPR24: 0000000000000000 00007fffc9439ee0 0000000000000003 0000000000000001
GPR28: 00007fffc943f713 0000000000000000 00007fffc9439ee8 0000000000000000
[770.648] NIP [00007fff94126800] 0x7fff94126800
[770.648] LR [0000000107e0b594] 0x107e0b594
[770.648] --- interrupt: c00
[770.648] Instruction dump:
[770.648] 3b00ffe4 e8898828 481175f5 60000000 4bfff4fc 3be00000 4bfff570 3d220000
[770.648] 7fc4f378 e8698830 4811cd95 e8410018 <0fe00000> f9c10060 f9e10068 fa010070
[770.648] ---[ end trace 0000000000000000 ]---
[770.648] BTRFS: error (device dm-2: state A) in find_free_extent_update_loop:4122: errno=-22 unknown
[770.648] BTRFS info (device dm-2: state EA): forced readonly
[770.648] BTRFS: error (device dm-2: state EA) in __btrfs_free_extent:3070: errno=-22 unknown
[770.648] BTRFS error (device dm-2: state EA): failed to run delayed ref for logical 17838685708288 num_bytes 24576 type 184 action 2 ref_mod 1: -22
[770.648] BTRFS: error (device dm-2: state EA) in btrfs_run_delayed_refs:2144: errno=-22 unknown
[770.648] BTRFS: error (device dm-2: state EA) in reset_balance_state:3599: errno=-22 unknown
Fixes: 47e6f7423b ("btrfs: add support for 3-copy replication (raid1c3)")
Fixes: 8d6fac0087 ("btrfs: add support for 4-copy replication (raid1c4)")
CC: stable@vger.kernel.org # 5.10+
Signed-off-by: Matt Corallo <blnxfsl@bluematt.me>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The reclaim process can temporarily fail. For example, if the space is
getting tight, it fails to make the block group read-only. If there are no
further writes on that block group, the block group will never get back to
the reclaim list, and the BG never gets reclaimed. In a certain workload,
we can leave many such block groups never reclaimed.
So, let's get it back to the list and give it a chance to be reclaimed.
Fixes: 18bb8bbf13 ("btrfs: zoned: automatically reclaim zones")
CC: stable@vger.kernel.org # 5.15+
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When a filesystem is read-only, we cannot reclaim a block group as it
cannot rewrite the data. Just bail out in that case.
Note that it can drop block groups in this case. As we did
sb_start_write(), read-only filesystem means we got a fatal error and
forced read-only. There is no chance to reclaim them again.
Fixes: 18bb8bbf13 ("btrfs: zoned: automatically reclaim zones")
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
An unused block group is easy to remove to free up space and should be
reclaimed fast. Such block group can often already be a target of the
reclaim process. As we check list_empty(&bg->bg_list), we keep it in the
reclaim list. That block group is never reclaimed until the file system
is filled e.g. up to 75%.
Instead, we can move unused block group to the unused list and delete it
fast.
Fixes: 18bb8bbf13 ("btrfs: zoned: automatically reclaim zones")
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The reclaiming process only starts after the filesystem volumes are
allocated to a certain level (75% by default). Thus, the list of
reclaiming target block groups can build up so huge at the time the
reclaim process kicks in. On a test run, there were over 1000 BGs in the
reclaim list.
As the reclaim involves rewriting the data, it takes really long time to
reclaim the BGs. While the reclaim is running, btrfs_delete_unused_bgs()
won't proceed because the reclaim side is holding
fs_info->reclaim_bgs_lock. As a result, we will have a large number of
unused BGs kept in the unused list. On my test run, I got 1057 unused BGs.
Since deleting a block group is relatively easy and fast work, we can call
btrfs_delete_unused_bgs() while it reclaims BGs, to avoid building up
unused BGs.
Fixes: 18bb8bbf13 ("btrfs: zoned: automatically reclaim zones")
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that all extent state bit helpers effectively take the GFP_NOFS mask
(and GFP_NOWAIT is encoded in the bits) we can remove the parameter.
This reduces stack consumption in many functions and simplifies a lot of
code.
Net effect on module on a release build:
text data bss dec hex filename
1250432 20985 16088 1287505 13a551 pre/btrfs.ko
1247074 20985 16088 1284147 139833 post/btrfs.ko
DELTA: -3358
Signed-off-by: David Sterba <dsterba@suse.com>
The __GFP_NOFAIL passed to set_extent_bit first appeared in 2010
(commit f0486c68e4 ("Btrfs: Introduce contexts for metadata
reservation")), without any explanation why it would be needed.
Meanwhile we've updated the semantics of set_extent_bit to handle failed
allocations and do unlock, sleep and retry if needed. The use of the
NOFAIL flag is also an outlier, we never want any of the set/clear
extent bit helpers to fail, they're used for many critical changes like
extent locking, besides the extent state bit changes.
Signed-off-by: David Sterba <dsterba@suse.com>
The helper is used a few times, that it's setting the DIRTY extent bit
is still clear.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently we allow a block group not to be marked read-only for scrub.
But for RAID56 block groups if we require the block group to be
read-only, then we're allowed to use cached content from scrub stripe to
reduce unnecessary RAID56 reads.
So this patch would:
- Make btrfs_inc_block_group_ro() try harder
During my tests, for cases like btrfs/061 and btrfs/064, we can hit
ENOSPC from btrfs_inc_block_group_ro() calls during scrub.
The reason is if we only have one single data chunk, and trying to
scrub it, we won't have any space left for any newer data writes.
But this check should be done by the caller, especially for scrub
cases we only temporarily mark the chunk read-only.
And newer data writes would always try to allocate a new data chunk
when needed.
- Return error for scrub if we failed to mark a RAID56 chunk read-only
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The old scrub code has different entrance to verify the content, and
since we have removed the writeback path, now we can start removing the
re-check part, including:
- scrub_recover structure
- scrub_sector::recover member
- function scrub_setup_recheck_block()
- function scrub_recheck_block()
- function scrub_recheck_block_checksum()
- function scrub_repair_block_group_good_copy()
- function scrub_repair_sector_from_good_copy()
- function scrub_is_page_on_raid56()
- function full_stripe_lock()
- function search_full_stripe_lock()
- function get_full_stripe_logical()
- function insert_full_stripe_lock()
- function lock_full_stripe()
- function unlock_full_stripe()
- btrfs_block_group::full_stripe_locks_root member
- btrfs_full_stripe_locks_tree structure
This infrastructure is to ensure RAID56 scrub is properly handling
recovery and P/Q scrub correctly.
This is no longer needed, before P/Q scrub we will wait for all
the involved data stripes to be scrubbed first, and RAID56 code has
internal lock to ensure no race in the same full stripe.
- function scrub_print_warning()
- function scrub_get_recover()
- function scrub_put_recover()
- function scrub_handle_errored_block()
- function scrub_setup_recheck_block()
- function scrub_bio_wait_endio()
- function scrub_submit_raid56_bio_wait()
- function scrub_recheck_block_on_raid56()
- function scrub_recheck_block()
- function scrub_recheck_block_checksum()
- function scrub_repair_block_from_good_copy()
- function scrub_repair_sector_from_good_copy()
And two more functions exported temporarily for later cleanup:
- alloc_scrub_sector()
- alloc_scrub_block()
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The only caller of btrfs_make_block_group() always passes 0 as the value
for the bytes_used argument, so remove it.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>