Commit graph

1665 commits

Author SHA1 Message Date
Naohiro Aota
760e69c4c2 btrfs: zoned: activate block group only for extent allocation
In btrfs_make_block_group(), we activate the allocated block group,
expecting that the block group is soon used for allocation. However, the
chunk allocation from flush_space() context broke the assumption. There
can be a large time gap between the chunk allocation time and the extent
allocation time from the chunk.

Activating the empty block groups pre-allocated from flush_space()
context can exhaust the active zone counter of a device. Once we use all
the active zone counts for empty pre-allocated block groups, we cannot
activate new block group for the other things: metadata, tree-log, or
data relocation block group.  That failure results in a fake -ENOSPC.

This patch introduces CHUNK_ALLOC_FORCE_FOR_EXTENT to distinguish the
chunk allocation from find_free_extent(). Now, the new block group is
activated only in that context.

Fixes: eb66a010d5 ("btrfs: zoned: activate new block group")
CC: stable@vger.kernel.org # 5.16+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-04-06 00:50:41 +02:00
Josef Bacik
8f8aa4c7a9 btrfs: factor out do_free_extent_accounting helper
__btrfs_free_extent() does all of the hard work of updating the extent
ref items, and then at the end if we dropped the extent completely it
does the cleanup accounting work.  We're going to only want to do that
work for metadata with extent tree v2, so extract this bit into its own
helper.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14 13:13:53 +01:00
Josef Bacik
5b2a54bb7c btrfs: remove last_ref from the extent freeing code
This is a remnant of the work I did for qgroups a long time ago to only
run for a block when we had dropped the last ref.  We haven't done that
for years, but the code remains.  Drop this remnant.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14 13:13:53 +01:00
Josef Bacik
3466670558 btrfs: add a alloc_reserved_extent helper
We duplicate this logic for both data and metadata, at this point we've
already done our type specific extent root operations, this is just
doing the accounting and removing the space from the free space tree.
Extract this common logic out into a helper.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14 13:13:53 +01:00
Josef Bacik
b3c958a369 btrfs: remove BUG_ON(ret) in alloc_reserved_tree_block
Switch this to an ASSERT() and return the error in the normal case.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14 13:13:53 +01:00
Jiapeng Chong
c4bf190999 btrfs: zoned: remove redundant initialization of to_add
to_add is being initialized to len but this is never read as to_add is
overwritten later on. Remove the redundant initialization.

Cleans up the following clang-analyzer warning:

fs/btrfs/extent-tree.c:2769:8: warning: Value stored to 'to_add' during
its initialization is never read [clang-analyzer-deadcode.DeadStores].

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14 13:13:47 +01:00
Josef Bacik
b4be6aefa7 btrfs: do not start relocation until in progress drops are done
We hit a bug with a recovering relocation on mount for one of our file
systems in production.  I reproduced this locally by injecting errors
into snapshot delete with balance running at the same time.  This
presented as an error while looking up an extent item

  WARNING: CPU: 5 PID: 1501 at fs/btrfs/extent-tree.c:866 lookup_inline_extent_backref+0x647/0x680
  CPU: 5 PID: 1501 Comm: btrfs-balance Not tainted 5.16.0-rc8+ #8
  RIP: 0010:lookup_inline_extent_backref+0x647/0x680
  RSP: 0018:ffffae0a023ab960 EFLAGS: 00010202
  RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000000
  RDX: 0000000000000000 RSI: 000000000000000c RDI: 0000000000000000
  RBP: ffff943fd2a39b60 R08: 0000000000000000 R09: 0000000000000001
  R10: 0001434088152de0 R11: 0000000000000000 R12: 0000000001d05000
  R13: ffff943fd2a39b60 R14: ffff943fdb96f2a0 R15: ffff9442fc923000
  FS:  0000000000000000(0000) GS:ffff944e9eb40000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00007f1157b1fca8 CR3: 000000010f092000 CR4: 0000000000350ee0
  Call Trace:
   <TASK>
   insert_inline_extent_backref+0x46/0xd0
   __btrfs_inc_extent_ref.isra.0+0x5f/0x200
   ? btrfs_merge_delayed_refs+0x164/0x190
   __btrfs_run_delayed_refs+0x561/0xfa0
   ? btrfs_search_slot+0x7b4/0xb30
   ? btrfs_update_root+0x1a9/0x2c0
   btrfs_run_delayed_refs+0x73/0x1f0
   ? btrfs_update_root+0x1a9/0x2c0
   btrfs_commit_transaction+0x50/0xa50
   ? btrfs_update_reloc_root+0x122/0x220
   prepare_to_merge+0x29f/0x320
   relocate_block_group+0x2b8/0x550
   btrfs_relocate_block_group+0x1a6/0x350
   btrfs_relocate_chunk+0x27/0xe0
   btrfs_balance+0x777/0xe60
   balance_kthread+0x35/0x50
   ? btrfs_balance+0xe60/0xe60
   kthread+0x16b/0x190
   ? set_kthread_struct+0x40/0x40
   ret_from_fork+0x22/0x30
   </TASK>

Normally snapshot deletion and relocation are excluded from running at
the same time by the fs_info->cleaner_mutex.  However if we had a
pending balance waiting to get the ->cleaner_mutex, and a snapshot
deletion was running, and then the box crashed, we would come up in a
state where we have a half deleted snapshot.

Again, in the normal case the snapshot deletion needs to complete before
relocation can start, but in this case relocation could very well start
before the snapshot deletion completes, as we simply add the root to the
dead roots list and wait for the next time the cleaner runs to clean up
the snapshot.

Fix this by setting a bit on the fs_info if we have any DEAD_ROOT's that
had a pending drop_progress key.  If they do then we know we were in the
middle of the drop operation and set a flag on the fs_info.  Then
balance can wait until this flag is cleared to start up again.

If there are DEAD_ROOT's that don't have a drop_progress set then we're
safe to start balance right away as we'll be properly protected by the
cleaner_mutex.

CC: stable@vger.kernel.org # 5.10+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-02 16:52:39 +01:00
Naohiro Aota
82187d2ecd btrfs: zoned: fix chunk allocation condition for zoned allocator
The ZNS specification defines a limit on the number of "active"
zones. That limit impose us to limit the number of block groups which
can be used for an allocation at the same time. Not to exceed the
limit, we reuse the existing active block groups as much as possible
when we can't activate any other zones without sacrificing an already
activated block group in commit a85f05e59b ("btrfs: zoned: avoid
chunk allocation if active block group has enough space").

However, the check is wrong in two ways. First, it checks the
condition for every raid index (ffe_ctl->index). Even if it reaches
the condition and "ffe_ctl->max_extent_size >=
ffe_ctl->min_alloc_size" is met, there can be other block groups
having enough space to hold ffe_ctl->num_bytes. (Actually, this won't
happen in the current zoned code as it only supports SINGLE
profile. But, it can happen once it enables other RAID types.)

Second, it checks the active zone availability depending on the
raid index. The raid index is just an index for
space_info->block_groups, so it has nothing to do with chunk allocation.

These mistakes are causing a faulty allocation in a certain
situation. Consider we are running zoned btrfs on a device whose
max_active_zone == 0 (no limit). And, suppose no block group have a
room to fit ffe_ctl->num_bytes but some room to meet
ffe_ctl->min_alloc_size (i.e. max_extent_size > num_bytes >=
min_alloc_size).

In this situation, the following occur:

- With SINGLE raid_index, it reaches the chunk allocation checking
  code
- The check returns true because we can activate a new zone (no limit)
- But, before allocating the chunk, it iterates to the next raid index
  (RAID5)
- Since there are no RAID5 block groups on zoned mode, it again
  reaches the check code
- The check returns false because of btrfs_can_activate_zone()'s "if
  (raid_index != BTRFS_RAID_SINGLE)" part
- That results in returning -ENOSPC without allocating a new chunk

As a result, we end up hitting -ENOSPC too early.

Move the check to the right place in the can_allocate_chunk() hook,
and do the active zone check depending on the allocation flag, not on
the raid index.

CC: stable@vger.kernel.org # 5.16
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07 14:18:26 +01:00
Naohiro Aota
50475cd577 btrfs: add extent allocator hook to decide to allocate chunk or not
Introduce a new hook for an extent allocator policy. With the new
hook, a policy can decide to allocate a new block group or not. If
not, it will return -ENOSPC, so btrfs_reserve_extent() will cut the
allocation size in half and retry the allocation if min_alloc_size is
large enough.

The hook has a place holder and will be replaced with the real
implementation in the next patch.

CC: stable@vger.kernel.org # 5.16
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07 14:18:26 +01:00
Naohiro Aota
1ada69f61c btrfs: zoned: unset dedicated block group on allocation failure
Allocating an extent from a block group can fail for various reasons.
When an allocation from a dedicated block group (for tree-log or
relocation data) fails, we need to unregister it as a dedicated one so
that we can allocate a new block group for the dedicated one.

However, we are returning early when the block group in case it is
read-only, fully used, or not be able to activate the zone. As a result,
we keep the non-usable block group as a dedicated one, leading to
further allocation failure. With many block groups, the allocator will
iterate hopeless loop to find a free extent, results in a hung task.

Fix the issue by delaying the return and doing the proper cleanups.

CC: stable@vger.kernel.org # 5.16
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07 14:18:26 +01:00
Josef Bacik
abed4aaae4 btrfs: track the csum, extent, and free space trees in a rb tree
In the future we are going to have multiple copies of these trees.  To
facilitate this we need a way to lookup the different roots we are
looking for.  Handle this by adding a global root rb tree that is
indexed on the root->root_key.  Then instead of loading the roots at
mount time with individually targeted keys, simply search the tree_root
for anything with the specific objectid we want.  This will make it
straightforward to support both old style and new style file systems.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-03 15:09:50 +01:00
Josef Bacik
fc28b25e1f btrfs: stop accessing ->csum_root directly
We are going to have multiple csum roots in the future, so convert all
users of ->csum_root to btrfs_csum_root() and rename ->csum_root to
->_csum_root so we can easily find remaining users in the future.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-03 15:09:49 +01:00
Josef Bacik
29cbcf4017 btrfs: stop accessing ->extent_root directly
When we start having multiple extent roots we'll need to use a helper to
get to the correct extent_root.  Rename fs_info->extent_root to
_extent_root and convert all of the users of the extent root to using
the btrfs_extent_root() helper.  This will allow us to easily clean up
the remaining direct accesses in the future.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-03 15:09:49 +01:00
Josef Bacik
8e1d029091 btrfs: use chunk_root in find_free_extent_update_loop
We're only using this to start the transaction with to possibly allocate
a chunk.  It doesn't really matter which root to use, but with extent
tree v2 we'll need a bytenr to look up a extent root which makes the
usage of the extent_root awkward here.  Simply change it to the
chunk_root.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-03 15:09:48 +01:00
Josef Bacik
76d76e789d btrfs: make remove_extent_backref pass the root
With extent tree v2 we'll have a different extent root based on where
the bytenr is located, so adjust the remove_extent_backref() helper and
it's helpers to pass the extent_root around.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-03 15:09:48 +01:00
Josef Bacik
3212fa14e7 btrfs: drop the _nr from the item helpers
Now that all call sites are using the slot number to modify item values,
rename the SETGET helpers to raw_item_*(), and then rework the _nr()
helpers to be the btrfs_item_*() btrfs_set_item_*() helpers, and then
rename all of the callers to the new helpers.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-03 15:09:43 +01:00
Filipe Manana
7a1636089a btrfs: fix invalid delayed ref after subvolume creation failure
When creating a subvolume, at ioctl.c:create_subvol(), if we fail to
insert the new root's root item into the root tree, we are freeing the
metadata extent we reserved for the new root to prevent a metadata
extent leak, as we don't abort the transaction at that point (since
there is nothing at that point that is irreversible).

However we allocated the metadata extent for the new root which we are
creating for the new subvolume, so its delayed reference refers to the
ID of this new root. But when we free the metadata extent we pass the
root of the subvolume where the new subvolume is located to
btrfs_free_tree_block() - this is incorrect because this will generate
a delayed reference that refers to the ID of the parent subvolume's root,
and not to ID of the new root.

This results in a failure when running delayed references that leads to
a transaction abort and a trace like the following:

[3868.738042] RIP: 0010:__btrfs_free_extent+0x709/0x950 [btrfs]
[3868.739857] Code: 68 0f 85 e6 fb ff (...)
[3868.742963] RSP: 0018:ffffb0e9045cf910 EFLAGS: 00010246
[3868.743908] RAX: 00000000fffffffe RBX: 00000000fffffffe RCX: 0000000000000002
[3868.745312] RDX: 00000000fffffffe RSI: 0000000000000002 RDI: ffff90b0cd793b88
[3868.746643] RBP: 000000000e5d8000 R08: 0000000000000000 R09: ffff90b0cd793b88
[3868.747979] R10: 0000000000000002 R11: 00014ded97944d68 R12: 0000000000000000
[3868.749373] R13: ffff90b09afe4a28 R14: 0000000000000000 R15: ffff90b0cd793b88
[3868.750725] FS:  00007f281c4a8b80(0000) GS:ffff90b3ada00000(0000) knlGS:0000000000000000
[3868.752275] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[3868.753515] CR2: 00007f281c6a5000 CR3: 0000000108a42006 CR4: 0000000000370ee0
[3868.754869] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[3868.756228] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[3868.757803] Call Trace:
[3868.758281]  <TASK>
[3868.758655]  ? btrfs_merge_delayed_refs+0x178/0x1c0 [btrfs]
[3868.759827]  __btrfs_run_delayed_refs+0x2b1/0x1250 [btrfs]
[3868.761047]  btrfs_run_delayed_refs+0x86/0x210 [btrfs]
[3868.762069]  ? lock_acquired+0x19f/0x420
[3868.762829]  btrfs_commit_transaction+0x69/0xb20 [btrfs]
[3868.763860]  ? _raw_spin_unlock+0x29/0x40
[3868.764614]  ? btrfs_block_rsv_release+0x1c2/0x1e0 [btrfs]
[3868.765870]  create_subvol+0x1d8/0x9a0 [btrfs]
[3868.766766]  btrfs_mksubvol+0x447/0x4c0 [btrfs]
[3868.767669]  ? preempt_count_add+0x49/0xa0
[3868.768444]  __btrfs_ioctl_snap_create+0x123/0x190 [btrfs]
[3868.769639]  ? _copy_from_user+0x66/0xa0
[3868.770391]  btrfs_ioctl_snap_create_v2+0xbb/0x140 [btrfs]
[3868.771495]  btrfs_ioctl+0xd1e/0x35c0 [btrfs]
[3868.772364]  ? __slab_free+0x10a/0x360
[3868.773198]  ? rcu_read_lock_sched_held+0x12/0x60
[3868.774121]  ? lock_release+0x223/0x4a0
[3868.774863]  ? lock_acquired+0x19f/0x420
[3868.775634]  ? rcu_read_lock_sched_held+0x12/0x60
[3868.776530]  ? trace_hardirqs_on+0x1b/0xe0
[3868.777373]  ? _raw_spin_unlock_irqrestore+0x3e/0x60
[3868.778280]  ? kmem_cache_free+0x321/0x3c0
[3868.779011]  ? __x64_sys_ioctl+0x83/0xb0
[3868.779718]  __x64_sys_ioctl+0x83/0xb0
[3868.780387]  do_syscall_64+0x3b/0xc0
[3868.781059]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[3868.781953] RIP: 0033:0x7f281c59e957
[3868.782585] Code: 3c 1c 48 f7 d8 4c (...)
[3868.785867] RSP: 002b:00007ffe1f83e2b8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
[3868.787198] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f281c59e957
[3868.788450] RDX: 00007ffe1f83e2c0 RSI: 0000000050009418 RDI: 0000000000000003
[3868.789748] RBP: 00007ffe1f83f300 R08: 0000000000000000 R09: 00007ffe1f83fe36
[3868.791214] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000003
[3868.792468] R13: 0000000000000003 R14: 00007ffe1f83e2c0 R15: 00000000000003cc
[3868.793765]  </TASK>
[3868.794037] irq event stamp: 0
[3868.794548] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
[3868.795670] hardirqs last disabled at (0): [<ffffffff98294214>] copy_process+0x934/0x2040
[3868.797086] softirqs last  enabled at (0): [<ffffffff98294214>] copy_process+0x934/0x2040
[3868.798309] softirqs last disabled at (0): [<0000000000000000>] 0x0
[3868.799284] ---[ end trace be24c7002fe27747 ]---
[3868.799928] BTRFS info (device dm-0): leaf 241188864 gen 1268 total ptrs 214 free space 469 owner 2
[3868.801133] BTRFS info (device dm-0): refs 2 lock_owner 225627 current 225627
[3868.802056]  item 0 key (237436928 169 0) itemoff 16250 itemsize 33
[3868.802863]          extent refs 1 gen 1265 flags 2
[3868.803447]          ref#0: tree block backref root 1610
(...)
[3869.064354]  item 114 key (241008640 169 0) itemoff 12488 itemsize 33
[3869.065421]          extent refs 1 gen 1268 flags 2
[3869.066115]          ref#0: tree block backref root 1689
(...)
[3869.403834] BTRFS error (device dm-0): unable to find ref byte nr 241008640 parent 0 root 1622  owner 0 offset 0
[3869.405641] BTRFS: error (device dm-0) in __btrfs_free_extent:3076: errno=-2 No such entry
[3869.407138] BTRFS: error (device dm-0) in btrfs_run_delayed_refs:2159: errno=-2 No such entry

Fix this by passing the new subvolume's root ID to btrfs_free_tree_block().
This requires changing the root argument of btrfs_free_tree_block() from
struct btrfs_root * to a u64, since at this point during the subvolume
creation we have not yet created the struct btrfs_root for the new
subvolume, and btrfs_free_tree_block() only needs a root ID and nothing
else from a struct btrfs_root.

This was triggered by test case generic/475 from fstests.

Fixes: 67addf2900 ("btrfs: fix metadata extent leak after failure to create subvolume")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-12-15 17:07:33 +01:00
Josef Bacik
f981fec12c btrfs: fail if fstrim_range->start == U64_MAX
We've always been failing generic/260 because it's testing things we
actually don't care about and thus won't fail for.  However we probably
should fail for fstrim_range->start == U64_MAX since we clearly can't
trim anything past that.  This in combination with an update to
generic/260 will allow us to pass this test properly.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-12-08 15:42:24 +01:00
Anand Jain
11b66fa6ee btrfs: reduce btrfs_update_block_group alloc argument to bool
btrfs_update_block_group() accounts for the number of bytes allocated or
freed. Argument @alloc specifies whether the call is for alloc or free.
Convert the argument @alloc type from int to bool.

Reviewed-by: Su Yue <l@damenly.su>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:08:06 +02:00
Nikolay Borisov
681145d4ac btrfs: pull up qgroup checks from delayed-ref core to init time
Instead of checking whether qgroup processing for a dealyed ref has to
happen in the core of delayed ref, simply pull the check at init time of
respective delayed ref structures. This eliminates the final use of
real_root in delayed-ref core paving the way to making this member
optional.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:08:06 +02:00
Nikolay Borisov
f42c5da6c1 btrfs: add additional parameters to btrfs_init_tree_ref/btrfs_init_data_ref
In order to make 'real_root' used only in ref-verify it's required to
have the necessary context to perform the same checks that this member
is used for. So add 'mod_root' which will contain the root on behalf of
which a delayed ref was created and a 'skip_group' parameter which
will contain callsite-specific override of skip_qgroup.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:08:06 +02:00
Nikolay Borisov
113479d5b8 btrfs: rename root fields in delayed refs structs
Both data and metadata delayed ref structures have fields named
root/ref_root respectively. Those are somewhat cryptic and don't really
convey the real meaning. In fact those roots are really the original
owners of the respective block (i.e in case of a snapshot a data delayed
ref will contain the original root that owns the given block). Rename
those fields accordingly and adjust comments.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:08:06 +02:00
Filipe Manana
49d0c6424c btrfs: assert that extent buffers are write locked instead of only locked
We currently use lockdep_assert_held() at btrfs_assert_tree_locked(), and
that checks that we hold a lock either in read mode or write mode.

However in all contexts we use btrfs_assert_tree_locked(), we actually
want to check if we are holding a write lock on the extent buffer's rw
semaphore - it would be a bug if in any of those contexts we were holding
a read lock instead.

So change btrfs_assert_tree_locked() to use lockdep_assert_held_write()
instead and, to make it more explicit, rename btrfs_assert_tree_locked()
to btrfs_assert_tree_write_locked(), so that it's clear we want to check
we are holding a write lock.

For now there are no contexts where we want to assert that we must have
a read lock, but in case that is needed in the future, we can add a new
helper function that just calls out lockdep_assert_held_read().

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:08:02 +02:00
Qu Wenruo
4c66461179 btrfs: rename btrfs_bio to btrfs_io_context
The structure btrfs_bio is used by two different sites:

- bio->bi_private for mirror based profiles
  For those profiles (SINGLE/DUP/RAID1*/RAID10), this structures records
  how many mirrors are still pending, and save the original endio
  function of the bio.

- RAID56 code
  In that case, RAID56 only utilize the stripes info, and no long uses
  that to trace the pending mirrors.

So btrfs_bio is not always bind to a bio, and contains more info for IO
context, thus renaming it will make the naming less confusing.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:08:02 +02:00
Johannes Thumshirn
2d81eb1c3f btrfs: zoned: let the for_treelog test in the allocator stand out
The statement which decides if an extent allocation on a zoned device is
for the dedicated tree-log block group or not and if we can use the block
group we picked for this allocation is not easy to read but an important
part of the allocator.

Rewrite into an if condition instead of a plain boolean test to make it
stand out more, like the version which tests for the dedicated
data-relocation block group.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:08:02 +02:00
Johannes Thumshirn
c2707a2556 btrfs: zoned: add a dedicated data relocation block group
Relocation in a zoned filesystem can fail with a transaction abort with
error -22 (EINVAL). This happens because the relocation code assumes that
the extents we relocated the data to have the same size the source extents
had and ensures this by preallocating the extents.

But in a zoned filesystem we currently can't preallocate the extents as
this would break the sequential write required rule. Therefore it can
happen that the writeback process kicks in while we're still adding pages
to a delalloc range and starts writing out dirty pages.

This then creates destination extents that are smaller than the source
extents, triggering the following safety check in get_new_location():

 1034         if (num_bytes != btrfs_file_extent_disk_num_bytes(leaf, fi)) {
 1035                 ret = -EINVAL;
 1036                 goto out;
 1037         }

Temporarily create a dedicated block group for the relocation process, so
no non-relocation data writes can interfere with the relocation writes.

This is needed that we can switch the relocation process on a zoned
filesystem from the REQ_OP_ZONE_APPEND writing we use for data to a scheme
like in a non-zoned filesystem using REQ_OP_WRITE and preallocation.

Fixes: 32430c6148 ("btrfs: zoned: enable relocation on a zoned filesystem")
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:08:01 +02:00
Johannes Thumshirn
37f00a6d2e btrfs: introduce btrfs_is_data_reloc_root
There are several places in our codebase where we check if a root is the
root of the data reloc tree and subsequent patches will introduce more.

Factor out the check into a small helper function instead of open coding
it multiple times.

Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:08:01 +02:00
Naohiro Aota
a85f05e59b btrfs: zoned: avoid chunk allocation if active block group has enough space
The current extent allocator tries to allocate a new block group when the
existing block groups do not have enough space. On a ZNS device, a new
block group means a new active zone. If the number of active zones has
already reached the max_active_zones, activating a new zone needs to finish
an existing zone, leading to wasting the free space there.

So, instead, it should reuse the existing active block groups as much as
possible when we can't activate any other zones without sacrificing an
already activated block group.

While at it, I converted find_free_extent_update_loop() to check the
found_extent() case early and made the other conditions simpler.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:07:59 +02:00
Naohiro Aota
a12b0dc0aa btrfs: move ffe_ctl one level up
We are passing too many variables as it is from btrfs_reserve_extent() to
find_free_extent(). The next commit will add min_alloc_size to ffe_ctl, and
that means another pass-through argument. Take this opportunity to move
ffe_ctl one level up and drop the redundant arguments.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:07:59 +02:00
Naohiro Aota
2e654e4bb9 btrfs: zoned: activate block group on allocation
Activate a block group when trying to allocate an extent from it. We check
read-only case and no space left case before trying to activate a block
group not to consume the number of active zones uselessly.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:07:59 +02:00
Naohiro Aota
98173255bd btrfs: zoned: calculate free space from zone capacity
Now that we introduced capacity in a block group, we need to calculate free
space using the capacity instead of the length. Thus, bytes we account
capacity - alloc_pointer as free, and account bytes [capacity, length] as
zone unusable.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:07:58 +02:00
Qu Wenruo
19ea40dddf btrfs: unlock newly allocated extent buffer after error
[BUG]
There is a bug report that injected ENOMEM error could leave a tree
block locked while we return to user-space:

  BTRFS info (device loop0): enabling ssd optimizations
  FAULT_INJECTION: forcing a failure.
  name failslab, interval 1, probability 0, space 0, times 0
  CPU: 0 PID: 7579 Comm: syz-executor Not tainted 5.15.0-rc1 #16
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
  rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
  Call Trace:
   __dump_stack lib/dump_stack.c:88 [inline]
   dump_stack_lvl+0x8d/0xcf lib/dump_stack.c:106
   fail_dump lib/fault-inject.c:52 [inline]
   should_fail+0x13c/0x160 lib/fault-inject.c:146
   should_failslab+0x5/0x10 mm/slab_common.c:1328
   slab_pre_alloc_hook.constprop.99+0x4e/0xc0 mm/slab.h:494
   slab_alloc_node mm/slub.c:3120 [inline]
   slab_alloc mm/slub.c:3214 [inline]
   kmem_cache_alloc+0x44/0x280 mm/slub.c:3219
   btrfs_alloc_delayed_extent_op fs/btrfs/delayed-ref.h:299 [inline]
   btrfs_alloc_tree_block+0x38c/0x670 fs/btrfs/extent-tree.c:4833
   __btrfs_cow_block+0x16f/0x7d0 fs/btrfs/ctree.c:415
   btrfs_cow_block+0x12a/0x300 fs/btrfs/ctree.c:570
   btrfs_search_slot+0x6b0/0xee0 fs/btrfs/ctree.c:1768
   btrfs_insert_empty_items+0x80/0xf0 fs/btrfs/ctree.c:3905
   btrfs_new_inode+0x311/0xa60 fs/btrfs/inode.c:6530
   btrfs_create+0x12b/0x270 fs/btrfs/inode.c:6783
   lookup_open+0x660/0x780 fs/namei.c:3282
   open_last_lookups fs/namei.c:3352 [inline]
   path_openat+0x465/0xe20 fs/namei.c:3557
   do_filp_open+0xe3/0x170 fs/namei.c:3588
   do_sys_openat2+0x357/0x4a0 fs/open.c:1200
   do_sys_open+0x87/0xd0 fs/open.c:1216
   do_syscall_x64 arch/x86/entry/common.c:50 [inline]
   do_syscall_64+0x34/0xb0 arch/x86/entry/common.c:80
   entry_SYSCALL_64_after_hwframe+0x44/0xae
  RIP: 0033:0x46ae99
  Code: f7 d8 64 89 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 48 89 f8 48
  89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d
  01 f0 ff ff 73 01 c3 48 c7 c1 bc ff ff ff f7 d8 64 89 01 48
  RSP: 002b:00007f46711b9c48 EFLAGS: 00000246 ORIG_RAX: 0000000000000055
  RAX: ffffffffffffffda RBX: 000000000078c0a0 RCX: 000000000046ae99
  RDX: 0000000000000000 RSI: 00000000000000a1 RDI: 0000000020005800
  RBP: 00007f46711b9c80 R08: 0000000000000000 R09: 0000000000000000
  R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000017
  R13: 0000000000000000 R14: 000000000078c0a0 R15: 00007ffc129da6e0

  ================================================
  WARNING: lock held when returning to user space!
  5.15.0-rc1 #16 Not tainted
  ------------------------------------------------
  syz-executor/7579 is leaving the kernel with locks still held!
  1 lock held by syz-executor/7579:
   #0: ffff888104b73da8 (btrfs-tree-01/1){+.+.}-{3:3}, at:
  __btrfs_tree_lock+0x2e/0x1a0 fs/btrfs/locking.c:112

[CAUSE]
In btrfs_alloc_tree_block(), after btrfs_init_new_buffer(), the new
extent buffer @buf is locked, but if later operations like adding
delayed tree ref fail, we just free @buf without unlocking it,
resulting above warning.

[FIX]
Unlock @buf in out_free_buf: label.

Reported-by: Hao Sun <sunhao.th@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CACkBjsZ9O6Zr0KK1yGn=1rQi6Crh1yeCRdTSBxx9R99L4xdn-Q@mail.gmail.com/
CC: stable@vger.kernel.org # 5.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-07 22:04:20 +02:00
Marcos Paulo de Souza
6534c0c99d btrfs: pass NULL as trans to btrfs_search_slot if we only want to search
Using a transaction in btrfs_search_slot is only useful when we are
searching to add or modify the tree. When the function is used for
searching, insert length and mod arguments are 0, there is no need to
use a transaction.

No functional changes, changing for consistency.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:00 +02:00
Anand Jain
23608d51a3 btrfs: cleanup fs_devices pointer usage in btrfs_trim_fs
Drop variable 'devices' (used only once) and add new variable for
the fs_devices, so it is used at two locations within btrfs_trim_fs()
function and also helps to access fs_devices->devices.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:18:59 +02:00
Anand Jain
16a200f66e btrfs: check for missing device in btrfs_trim_fs
A fstrim on a degraded raid1 can trigger the following null pointer
dereference:

  BTRFS info (device loop0): allowing degraded mounts
  BTRFS info (device loop0): disk space caching is enabled
  BTRFS info (device loop0): has skinny extents
  BTRFS warning (device loop0): devid 2 uuid 97ac16f7-e14d-4db1-95bc-3d489b424adb is missing
  BTRFS warning (device loop0): devid 2 uuid 97ac16f7-e14d-4db1-95bc-3d489b424adb is missing
  BTRFS info (device loop0): enabling ssd optimizations
  BUG: kernel NULL pointer dereference, address: 0000000000000620
  PGD 0 P4D 0
  Oops: 0000 [#1] SMP NOPTI
  CPU: 0 PID: 4574 Comm: fstrim Not tainted 5.13.0-rc7+ #31
  Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
  RIP: 0010:btrfs_trim_fs+0x199/0x4a0 [btrfs]
  RSP: 0018:ffff959541797d28 EFLAGS: 00010293
  RAX: 0000000000000000 RBX: ffff946f84eca508 RCX: a7a67937adff8608
  RDX: ffff946e8122d000 RSI: 0000000000000000 RDI: ffffffffc02fdbf0
  RBP: ffff946ea4615000 R08: 0000000000000001 R09: 0000000000000000
  R10: 0000000000000000 R11: ffff946e8122d960 R12: 0000000000000000
  R13: ffff959541797db8 R14: ffff946e8122d000 R15: ffff959541797db8
  FS:  00007f55917a5080(0000) GS:ffff946f9bc00000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000000000000620 CR3: 000000002d2c8001 CR4: 00000000000706f0
  Call Trace:
  btrfs_ioctl_fitrim+0x167/0x260 [btrfs]
  btrfs_ioctl+0x1c00/0x2fe0 [btrfs]
  ? selinux_file_ioctl+0x140/0x240
  ? syscall_trace_enter.constprop.0+0x188/0x240
  ? __x64_sys_ioctl+0x83/0xb0
  __x64_sys_ioctl+0x83/0xb0

Reproducer:

  $ mkfs.btrfs -fq -d raid1 -m raid1 /dev/loop0 /dev/loop1
  $ mount /dev/loop0 /btrfs
  $ umount /btrfs
  $ btrfs dev scan --forget
  $ mount -o degraded /dev/loop0 /btrfs

  $ fstrim /btrfs

The reason is we call btrfs_trim_free_extents() for the missing device,
which uses device->bdev (NULL for missing device) to find if the device
supports discard.

Fix is to check if the device is missing before calling
btrfs_trim_free_extents().

CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-07-22 15:49:49 +02:00
Josef Bacik
138a12d865 btrfs: rip out btrfs_space_info::total_bytes_pinned
We used this in may_commit_transaction() in order to determine if we
needed to commit the transaction.  However we no longer have that logic
and thus have no use of this counter anymore, so delete it.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-22 14:55:25 +02:00
David Sterba
1a9fd4172d btrfs: fix typos in comments
Fix typos that have snuck in since the last round. Found by codespell.

Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-22 14:11:57 +02:00
Josef Bacik
5963ffcaf3 btrfs: always abort the transaction if we abort a trans handle
While stress testing our error handling I noticed that sometimes we
would still commit the transaction even though we had aborted the
transaction.

Currently we track if a trans handle has dirtied any metadata, and if it
hasn't we mark the filesystem as having an error (so no new transactions
can be started), but we will allow the current transaction to complete
as we do not mark the transaction itself as having been aborted.

This sounds good in theory, but we were not properly tracking IO errors
in btrfs_finish_ordered_io, and thus committing the transaction with
bogus free space data.  This isn't necessarily a problem per-se with the
free space cache, as the other guards in place would have kept us from
accepting the free space cache as valid, but highlights a real world
case where we had a bug and could have corrupted the filesystem because
of it.

This "skip abort on empty trans handle" is nice in theory, but assumes
we have perfect error handling everywhere, which we clearly do not.
Also we do not allow further transactions to be started, so all this
does is save the last transaction that was happening, which doesn't
necessarily gain us anything other than the potential for real
corruption.

Remove this particular bit of code, if we decide we need to abort the
transaction then abort the current one and keep us from doing real harm
to the file system, regardless of whether this specific trans handle
dirtied anything or not.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-21 15:19:06 +02:00
Josef Bacik
856bd270dc btrfs: return errors from btrfs_del_csums in cleanup_ref_head
We are unconditionally returning 0 in cleanup_ref_head, despite the fact
that btrfs_del_csums could fail.  We need to return the error so the
transaction gets aborted properly, fix this by returning ret from
btrfs_del_csums in cleanup_ref_head.

Reviewed-by: Qu Wenruo <wqu@suse.com>
CC: stable@vger.kernel.org # 4.19+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-05-27 23:30:55 +02:00
Anand Jain
5e753a817b btrfs: fix unmountable seed device after fstrim
The following test case reproduces an issue of wrongly freeing in-use
blocks on the readonly seed device when fstrim is called on the rw sprout
device. As shown below.

Create a seed device and add a sprout device to it:

  $ mkfs.btrfs -fq -dsingle -msingle /dev/loop0
  $ btrfstune -S 1 /dev/loop0
  $ mount /dev/loop0 /btrfs
  $ btrfs dev add -f /dev/loop1 /btrfs
  BTRFS info (device loop0): relocating block group 290455552 flags system
  BTRFS info (device loop0): relocating block group 1048576 flags system
  BTRFS info (device loop0): disk added /dev/loop1
  $ umount /btrfs

Mount the sprout device and run fstrim:

  $ mount /dev/loop1 /btrfs
  $ fstrim /btrfs
  $ umount /btrfs

Now try to mount the seed device, and it fails:

  $ mount /dev/loop0 /btrfs
  mount: /btrfs: wrong fs type, bad option, bad superblock on /dev/loop0, missing codepage or helper program, or other error.

Block 5292032 is missing on the readonly seed device:

 $ dmesg -kt | tail
 <snip>
 BTRFS error (device loop0): bad tree block start, want 5292032 have 0
 BTRFS warning (device loop0): couldn't read-tree root
 BTRFS error (device loop0): open_ctree failed

From the dump-tree of the seed device (taken before the fstrim). Block
5292032 belonged to the block group starting at 5242880:

  $ btrfs inspect dump-tree -e /dev/loop0 | grep -A1 BLOCK_GROUP
  <snip>
  item 3 key (5242880 BLOCK_GROUP_ITEM 8388608) itemoff 16169 itemsize 24
  	block group used 114688 chunk_objectid 256 flags METADATA
  <snip>

From the dump-tree of the sprout device (taken before the fstrim).
fstrim used block-group 5242880 to find the related free space to free:

  $ btrfs inspect dump-tree -e /dev/loop1 | grep -A1 BLOCK_GROUP
  <snip>
  item 1 key (5242880 BLOCK_GROUP_ITEM 8388608) itemoff 16226 itemsize 24
  	block group used 32768 chunk_objectid 256 flags METADATA
  <snip>

BPF kernel tracing the fstrim command finds the missing block 5292032
within the range of the discarded blocks as below:

  kprobe:btrfs_discard_extent {
  	printf("freeing start %llu end %llu num_bytes %llu:\n",
  		arg1, arg1+arg2, arg2);
  }

  freeing start 5259264 end 5406720 num_bytes 147456
  <snip>

Fix this by avoiding the discard command to the readonly seed device.

Reported-by: Chris Murphy <lists@colorremedies.com>
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-05-04 14:41:37 +02:00
Filipe Manana
888dd18339 btrfs: use the new bit BTRFS_FS_TREE_MOD_LOG_USERS at btrfs_free_tree_block()
Instead of exposing implementation details of the tree mod log to check
if there are active tree mod log users at btrfs_free_tree_block(), use
the new bit BTRFS_FS_TREE_MOD_LOG_USERS for fs_info->flags instead. This
way extent-tree.c does not need to known about any of the internals of
the tree mod log and avoids taking a lock unnecessarily as well.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-04-19 17:25:16 +02:00
Anand Jain
05947ae186 btrfs: unexport btrfs_extent_readonly() and make it static
btrfs_extent_readonly() is used by can_nocow_extent() in inode.c. So
move it from extent-tree.c to inode.c and declare it as static.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-04-19 17:25:14 +02:00
Filipe Manana
485df75554 btrfs: always pin deleted leaves when there are active tree mod log users
When freeing a tree block we may end up adding its extent back to the
free space cache/tree, as long as there are no more references for it,
it was created in the current transaction and writeback for it never
happened. This is generally fine, however when we have tree mod log
operations it can result in inconsistent versions of a btree after
unwinding extent buffers with the recorded tree mod log operations.

This is because:

* We only log operations for nodes (adding and removing key/pointers),
  for leaves we don't do anything;

* This means that we can log a MOD_LOG_KEY_REMOVE_WHILE_FREEING operation
  for a node that points to a leaf that was deleted;

* Before we apply the logged operation to unwind a node, we can have
  that leaf's extent allocated again, either as a node or as a leaf, and
  possibly for another btree. This is possible if the leaf was created in
  the current transaction and writeback for it never started, in which
  case btrfs_free_tree_block() returns its extent back to the free space
  cache/tree;

* Then, before applying the tree mod log operation, some task allocates
  the metadata extent just freed before, and uses it either as a leaf or
  as a node for some btree (can be the same or another one, it does not
  matter);

* After applying the MOD_LOG_KEY_REMOVE_WHILE_FREEING operation we now
  get the target node with an item pointing to the metadata extent that
  now has content different from what it had before the leaf was deleted.
  It might now belong to a different btree and be a node and not a leaf
  anymore.

  As a consequence, the results of searches after the unwinding can be
  unpredictable and produce unexpected results.

So make sure we pin extent buffers corresponding to leaves when there
are tree mod log users.

CC: stable@vger.kernel.org # 4.14+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-03-16 20:32:22 +01:00
Naohiro Aota
40ab3be102 btrfs: zoned: extend zoned allocator to use dedicated tree-log block group
This is the 1/3 patch to enable tree log on zoned filesystems.

The tree-log feature does not work on a zoned filesystem as is. Blocks for
a tree-log tree are allocated mixed with other metadata blocks and btrfs
writes and syncs the tree-log blocks to devices at the time of fsync(),
which has a different timing than a global transaction commit. As a
result, both writing tree-log blocks and writing other metadata blocks
become non-sequential writes that zoned filesystems must avoid.

Introduce a dedicated block group for tree-log blocks, so that tree-log
blocks and other metadata blocks can be separate write streams.  As a
result, each write stream can now be written to devices separately.
"fs_info->treelog_bg" tracks the dedicated block group and assigns
"treelog_bg" on-demand on tree-log block allocation time.

This commit extends the zoned block allocator to use the block group.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09 02:46:08 +01:00
Naohiro Aota
6143c23ccc btrfs: zoned: implement cloning for zoned device-replace
This is 2/4 patch to implement device replace for zoned filesystems.

In zoned mode, a block group must be either copied (from the source
device to the target device) or cloned (to both devices).

Implement the cloning part. If a block group targeted by an IO is marked
to copy, we should not clone the IO to the destination device, because
the block group is eventually copied by the replace process.

This commit also handles cloning of device reset.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09 02:46:07 +01:00
Naohiro Aota
dcba6e48b5 btrfs: zoned: reset zones of unused block groups
We must reset the zones of a deleted unused block group to rewind the
zones' write pointers to the zones' start.

To do this, we can use the DISCARD_SYNC code to do the reset when the
filesystem is running on zoned devices.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09 02:46:04 +01:00
Naohiro Aota
d3575156f6 btrfs: zoned: redirty released extent buffers
Tree manipulating operations like merging nodes often release
once-allocated tree nodes. Such nodes are cleaned so that pages in the
node are not uselessly written out. On zoned volumes, however, such
optimization blocks the following IOs as the cancellation of the write
out of the freed blocks breaks the sequential write sequence expected by
the device.

Introduce a list of clean and unwritten extent buffers that have been
released in a transaction. Redirty the buffers so that
btree_write_cache_pages() can send proper bios to the devices.

Besides it clears the entire content of the extent buffer not to confuse
raw block scanners e.g. 'btrfs check'. By clearing the content,
csum_dirty_buffer() complains about bytenr mismatch, so avoid the
checking and checksum using newly introduced buffer flag
EXTENT_BUFFER_NO_CHECK.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09 02:46:04 +01:00
Naohiro Aota
2eda57089e btrfs: zoned: implement sequential extent allocation
Implement a sequential extent allocator for zoned filesystems. This
allocator only needs to check if there is enough space in the block group
after the allocation pointer to satisfy the extent allocation request.
Therefore the allocator never manages bitmaps or clusters. Also, add
assertions to the corresponding functions.

As zone append writing is used, it would be unnecessary to track the
allocation offset, as the allocator only needs to check available space.
But by tracking and returning the offset as an allocated region, we can
skip modification of ordered extents and checksum information when there
is no IO reordering.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09 02:46:03 +01:00
Naohiro Aota
169e0da91a btrfs: zoned: track unusable bytes for zones
In a zoned filesystem a once written then freed region is not usable
until the underlying zone has been reset. So we need to distinguish such
unusable space from usable free space.

Therefore we need to introduce the "zone_unusable" field to the block
group structure, and "bytes_zone_unusable" to the space_info structure
to track the unusable space.

Pinned bytes are always reclaimed to the unusable space. But, when an
allocated region is returned before using e.g., the block group becomes
read-only between allocation time and reservation time, we can safely
return the region to the block group. For the situation, this commit
introduces "btrfs_add_free_space_unused". This behaves the same as
btrfs_add_free_space() on regular filesystem. On zoned filesystems, it
rewinds the allocation offset.

Because the read-only bytes tracks free but unusable bytes when the block
group is read-only, we need to migrate the zone_unusable bytes to
read-only bytes when a block group is marked read-only.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09 02:46:03 +01:00
Josef Bacik
b7774425e0 btrfs: remove bogus BUG_ON in alloc_reserved_tree_block
The fix 361048f586 ("Btrfs: fix full backref problem when inserting
shared block reference") added a delayed ref flushing at subvolume
creation time in order to avoid hitting this particular BUG_ON().

Before this fix, we were tripping the BUG_ON() by

1. Modify snapshot A, which creates blocks with a normal reference for
   snapshot A, as A is the owner of these blocks.  We now have delayed
   refs for these blocks.
2. Create a snapshot of A named B, which pushes references for the
   children blocks of the root node for the new root B, thus creating
   more delayed refs for newly allocated blocks.
3. A is modified, and because the metadata blocks can now be shared, it
   must push FULL_BACKREF references to the children of any block that A
   COWs down it's path to its target key.
4. Delayed refs are run.  Because these are newly allocated blocks, we
   have ->must_insert_reserved reserved set on the delayed ref head, we
   call into alloc_reserved_tree_block() to add the extent item, and
   then add our ref.  At the time of this fix, we were ordering
   FULL_BACKREF delayed ref operations first, so we'd go to add this
   reference and then BUG_ON() because we didn't have the FULL_BACKREF
   flag set.

The patch fixed this problem by making sure we ran the delayed refs
before we had the chance to modify A.  This meant that any *new* blocks
would have had their extent items created _before_ we would ever
actually COW down and generate FULL_BACKREF entries.  Thus the problem
went away.

However this BUG_ON() is actually completely bogus.  The existence of a
full backref doesn't necessarily mean that FULL_BACKREF must be set on
that block, it must only be set on the actual parent itself.  Consider
the example provided above.  If we COW down one path from A, any nodes
are going to have a FULL_BACKREF ref pushed down to _all_ of their
children, but not all of the children are going to have FULL_BACKREF
set.  It is completely valid to have an extent item with normal and full
backrefs without FULL_BACKREF actually set on the block itself.

As a final note, I have been testing with the patch (applied after this
one)

  btrfs: stop running all delayed refs during snapshot

which removed this flushing.  My test was a torture test which did a lot
of operations while snapshotting and deleting snapshots as well as
relocation, and I never tripped this BUG_ON().  This is actually because
at the time of 361048f586, we ordered SHARED keys _before_ normal
references, and thus they would get run first.  However currently they
are ordered _after_ normal references, so we'd do the initial creation
without having a shared reference, and thus not hit this BUG_ON(), which
explains why I didn't start hitting this problem during my testing with
my other patch applied.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:58:57 +01:00