Commit graph

1216533 commits

Author SHA1 Message Date
Filipe Manana
0a325e620e btrfs: remove redundant root argument from maybe_insert_hole()
The root argument for maybe_insert_hole() always matches the root of the
given inode, so remove the root argument and get it from the inode
argument.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:12 +02:00
Filipe Manana
04bd8e9410 btrfs: remove redundant root argument from btrfs_delayed_update_inode()
The root argument for btrfs_delayed_update_inode() always matches the root
of the given inode, so remove the root argument and get it from the inode
argument.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:12 +02:00
Filipe Manana
07a274a886 btrfs: remove redundant root argument from btrfs_update_inode_item()
The root argument for btrfs_update_inode_item() always matches the root of
the given inode, so remove the root argument and get it from the inode
argument.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:12 +02:00
Filipe Manana
8b9d032225 btrfs: remove redundant root argument from btrfs_update_inode()
The root argument for btrfs_update_inode() always matches the root of the
given inode, so remove the root argument and get it from the inode
argument.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:12 +02:00
Filipe Manana
0a5d0dc55f btrfs: remove redundant root argument from btrfs_update_inode_fallback()
The root argument for btrfs_update_inode_fallback() always matches the
root of the given inode, so remove the root argument and get it from the
inode argument.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:12 +02:00
Filipe Manana
cddaaacca9 btrfs: remove noinline from btrfs_update_inode()
The noinline attribute of btrfs_update_inode() is pointless as the
function is exported and widely used, so remove it.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:12 +02:00
Filipe Manana
2199cb0f5e btrfs: simplify error check condition at btrfs_dirty_inode()
The following condition at btrfs_dirty_inode() is redundant:

  if (ret && (ret == -ENOSPC || ret == -EDQUOT))

The first check for a non-zero 'ret' value is pointless, we can simplify
this to simply:

  if (ret == -ENOSPC || ret == -EDQUOT)

Not only this makes it easier to read, it also slightly reduces the text
size of the btrfs kernel module:

  $ size fs/btrfs/btrfs.ko.before
     text	   data	    bss	    dec	    hex	filename
  1641400	 168265	  16864	1826529	 1bdee1	fs/btrfs/btrfs.ko.before

  $ size fs/btrfs/btrfs.ko.after
     text	   data	    bss	    dec	    hex	filename
  1641224	 168181	  16864	1826269	 1bdddd	fs/btrfs/btrfs.ko.after

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:12 +02:00
Boris Burkov
e076145115 btrfs: qgroup: only set QUOTA_ENABLED when done reading qgroups
In open_ctree, we set BTRFS_FS_QUOTA_ENABLED as soon as we see a
quota_root, as opposed to after we are done setting up the qgroup
structures. In the quota_enable path, we wait until after the structures
are set up. Likewise, in disable, we clear the bit before tearing down
the structures. I feel that this organization is less surprising for the
open_ctree path.

I don't believe this fixes any actual bug, but avoids potential
confusion when using btrfs_qgroup_mode in an intermediate state where we
are enabled but haven't yet setup the qgroup status flags. It also
avoids any risk of calling a qgroup function and attempting to use the
qgroup rbtrees before they exist/are setup.

This all occurs before we do rw setup, so I believe it should be mostly
a no-op.

Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:12 +02:00
Boris Burkov
2672a051e3 btrfs: track data relocation with simple quota
Relocation data allocations are quite tricky for simple quotas. The
basic data relocation sequence is (ignoring details that aren't relevant
to this fix):

- create a fake relocation data fs root
- create a fake relocation inode in that root
- for each data extent:
  - preallocate a data extent on behalf of the fake inode
  - copy over the data
- for each extent
  - swap the refs so that the original file extent now refers to the new
    extent item
- drop the fake root, dropping its refs on the old extents, which lets
  us delete them.

Done naively, this results in storing an extent item in the extent tree
whose owner_ref points at the relocation data root and a no-op squota
recording, since the reloc root is not a legit fstree. So far, that's
OK. The problem comes when you do the swap, and leave an extent item
owned by this bogus root as the real permanent extents of the file. If
the file then drops that ref, we free it and no-op account that against
the fake relocation root. Essentially, this means that relocation is
simple quota "extent laundering", since we re-own the extents into a
fake root.

Simple quotas very intentionally doesn't have a mechanism for
transferring ownership of extents, as that is exactly the complicated
thing we are trying to avoid with the new design. Further, it cannot be
correctly done in this case, since at the time you create the new
"real" refs, there is no way to know which was the original owner before
relocation unless we track it.

Therefore, it makes more sense to trick the preallocation to handle
relocation as a special case and note the proper owner ref from the
beginning. That way, we never write out an extent item without the
correct owner ref that it will eventually have.

This could be done by wiring a special root parameter all the way
through the allocation code path, but to avoid that special case
touching all the code, take advantage of the serial nature of relocation
to store the src root on the relocation root object. Then when we finish
the prealloc, if it happens to be this case, prepare the delayed ref
appropriately.

We must also add logic to handle relocating adjacent extents with
different owning roots. Those cannot be preallocated together in a
cluster as it would lose the separate ownership information.

This is obviously a smelly bit of code, but I think it is the best
solution to the problem, given the relocation implementation.

Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:12 +02:00
Boris Burkov
60ea105a0f btrfs: qgroup: track metadata relocation COW with simple quota
Relocation COWs metadata blocks in two cases for the reloc root:

- copying the subvolume root item when creating the reloc root
- copying a btree node when there is a COW during relocation

In both cases, the resulting btree node hits an abnormal code path with
respect to the owner field in its btrfs_header. It first creates the
root item for the new objectid, which populates the reloc root id, and
it at this point that delayed refs are created.

Later, it fully copies the old node into the new node (including the
original owner field) which overwrites it. This results in a simple
quotas mismatch where we run the delayed ref for the reloc root which
has no simple quota effect (reloc root is not an fstree) but when we
ultimately delete the node, the owner is the real original fstree and we
do free the space.

To work around this without tampering with the behavior of relocation,
add a parameter to btrfs_add_tree_block that lets the relocation code
path specify a different owning root than the "operating" root (in this
case, owning root is the real root and the operating root is the reloc
root). These can naturally be plumbed into delayed refs that have the
same concept.

Note that this is a double count in some sense, but a relatively natural
one, as there are really two extents, and the old one will be deleted
soon. This is consistent with how data relocation extents are accounted
by simple quotas.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:12 +02:00
Boris Burkov
bd7c1ea3a3 btrfs: qgroup: check generation when recording simple quota delta
Simple quotas count extents only from the moment the feature is enabled.
Therefore, if we do something like:

1. create subvol S
2. write F in S
3. enable quotas
4. remove F
5. write G in S

then after 3. and 4. we would expect the simple quota usage of S to be 0
(putting aside some metadata extents that might be written) and after
5., it should be the size of G plus metadata. Therefore, we need to be
able to determine whether a particular quota delta we are processing
predates simple quota enablement.

To do this, store the transaction id when quotas were enabled. In
fs_info for immediate use and in the quota status item to make it
recoverable on mount. When we see a delta, check if the generation of
the extent item is less than that of quota enablement. If so, we should
ignore the delta from this extent.

Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:11 +02:00
Boris Burkov
5343cd9364 btrfs: qgroup: simple quota auto hierarchy for nested subvolumes
Consider the following sequence:

- enable quotas
- create subvol S id 256 at dir outer/
- create a qgroup 1/100
- add 0/256 (S's auto qgroup) to 1/100
- create subvol T id 257 at dir outer/inner/

With full qgroups, there is no relationship between 0/257 and either of
0/256 or 1/100. There is an inherit feature that the creator of inner/
can use to specify it ought to be in 1/100.

Simple quotas are targeted at container isolation, where such automatic
inheritance for not necessarily trusted/controlled nested subvol
creation would be quite helpful. Therefore, add a new default behavior
for simple quotas: when you create a nested subvol, automatically
inherit as parents any parents of the qgroup of the subvol the new inode
is going in.

In our example, 257/0 would also be under 1/100, allowing easy control
of a total quota over an arbitrary hierarchy of subvolumes.

I think this _might_ be a generally useful behavior, so it could be
interesting to put it behind a new inheritance flag that simple quotas
always use while traditional quotas let the user specify, but this is a
minimally intrusive change to start.

Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:11 +02:00
Boris Burkov
cecbb533b5 btrfs: record simple quota deltas in delayed refs
At the moment that we run delayed refs, we make the final ref-count
based decision on creating/removing extent (and metadata) items.
Therefore, it is exactly the spot to hook up simple quotas.

There are a few important subtleties to the fields we must collect to
accurately track simple quotas, particularly when removing an extent.
When removing a data extent, the ref could be in any tree (due to
reflink, for example) and so we need to recover the owning root id from
the owner ref item. When removing a metadata extent, we know the owning
root from the owner field in the header when we create the delayed ref,
so we can recover it from there.

We must also be careful to handle reservations properly to not leaked
reserved space. The happy path is freeing the reservation when the
simple quota delta runs on a data extent. If that doesn't happen, due to
refs canceling out or some error, the ref head already has the
must_insert_reserved machinery to handle this, so we piggy back on that
and use it to clean up the reserved data.

Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:11 +02:00
Boris Burkov
8d29909140 btrfs: add helper for inline owner ref lookup
Inline ref parsing is a bit tricky and relies on a decent amount of
implicit information, so I think it is beneficial to have a helper
function for reading the owner ref, if only to "document" the format,
along with the write path.

The main subtlety of note which I was missing by open-coding this was
that it is important to check whether or not inline refs are present
*at all*. i.e., if we are writing out a new extent under squotas, we
will always use a big enough item for the inline ref and have it.
However, it is possible that some random item predating squotas will not
have any inline refs. In that case, trying to read the "type" field of
the first inline ref will just be reading garbage in the form of
whatever is in the next item.

This will be used by the extent free-ing path, which looks up data
extent owners as well as a relocation path which needs to grab the owner
before relocating an extent.

Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:11 +02:00
Boris Burkov
d9a620f77e btrfs: new inline ref storing owning subvol of data extents
In order to implement simple quota groups, we need to be able to
associate a data extent with the subvolume that created it. Once you
account for reflink, this information cannot be recovered without
explicitly storing it. Options for storing it are:

- a new key/item
- a new extent inline ref item

The former is backwards compatible, but wastes space, the latter is
incompat, but is efficient in space and reuses the existing inline ref
machinery, while only abusing it a tiny amount -- specifically, the new
item is not a ref, per-se.

Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:11 +02:00
Boris Burkov
cf79ac4793 btrfs: track original extent owner in head_ref
Simple quotas requires tracking the original creating root of any given
extent. This gets complicated when multiple subvolumes create
overlapping/contradictory refs in the same transaction. For example,
due to modifying or deleting an extent while also snapshotting it.

To resolve this in a general way, take advantage of the fact that we are
essentially already tracking this for handling releasing reservations.
The head ref coalesces the various refs and uses must_insert_reserved to
check if it needs to create an extent/free reservation. Store the ref
that set must_insert_reserved as the owning ref on the head ref.

Note that this can result in writing an extent for the very first time
with an owner different from its only ref, but it will look the same as
if you first created it with the original owning ref, then added the
other ref, then removed the owning ref.

Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:11 +02:00
Boris Burkov
457cb1ddf5 btrfs: track owning root in btrfs_ref
While data extents require us to store additional inline refs to track
the original owner on free, this information is available implicitly for
metadata. It is found in the owner field of the header of the tree
block. Even if other trees refer to this block and the original ref goes
away, we will not rewrite that header field, so it will reliably give the
original owner.

In addition, there is a relocation case where a new data extent needs to
have an owning root separate from the referring root wired through
delayed refs.

To use it for recording simple quota deltas, we need to wire this root
id through from when we create the delayed ref until we fully process
it. Store it in the generic btrfs_ref struct of the delayed ref.

Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:11 +02:00
Boris Burkov
610647d7ef btrfs: rename tree_ref and data_ref owning_root
commit 113479d5b8 ("btrfs: rename root fields in delayed refs structs")
changed these from ref_root to owning_root. However, there are many
circumstances where that name is not really accurate and the root on the
ref struct _is_ the referring root. In general, these are not the owning
root, though it does happen in some ref merging cases involving
overwrites during snapshots and similar.

Simple quotas cares quite a bit about tracking the original owner of an
extent through delayed refs, so rename these back to free up the name
for the real owning root (which will live on the generic btrfs_ref and
the head ref)

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:11 +02:00
Boris Burkov
1e0e9d5771 btrfs: add helper for recording simple quota deltas
Rather than re-computing shared/exclusive ownership based on backrefs
and walking roots for implicit backrefs, simple quotas does an increment
when creating an extent and a decrement when deleting it. Add the API
for the extent item code to use to track those events.

Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:11 +02:00
Boris Burkov
6ed05643dd btrfs: create qgroup earlier in snapshot creation
Pull creating the qgroup earlier in the snapshot. This allows simple
quotas qgroups to see all the metadata writes related to the snapshot
being created and to be born with the root node accounted.

Note this has an impact on transaction commit where the qgroup creation
can do a lot of work, allocate memory and take locks. The change is done
for correctness, potential performance issues will be fixed in the
future.

Signed-off-by: Boris Burkov <boris@bur.io>
[ add note ]
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:10 +02:00
Boris Burkov
af0e2aab3b btrfs: qgroup: flush reservations during quota disable
The following sequence:

  enable simple quotas
  do some writes
      reserve space
      create ordered_extent
	  release rsv (store rsv_bytes in OE, mark QGROUP_RESERVED bits)
  disable quotas
  enable simple quotas
      set qgroup rsv to 0 on all subvolumes
  ordered_extent finishes
      create delayed ref with rsv_bytes from before
  run delayed ref
      record_simple_quota_delta
	  free rsv_bytes (0 -> -rsv_delta)

results in us reliably underflowing the subvolume's qgroup rsv counter,
because disabling/re-enabling quotas toggles reservation counters down
to 0, but does not remove other file system state which represents
successful acquisition of qgroup rsv space. Specifically metadata rsv
counters on the root object and rsv_bytes on ordered_extent objects that
have released their reservation as well as the corresponding
QGROUP_RESERVED extent bits.

Normal qgroups gets away with this, I believe because it forces more
work to happen on transaction commit, but I am not certain it is totally
safe from the ordered_extent/leaked extent bit variant. Simple quotas
hits this reliably.

The intent of the fix is to make disable take the time to clear that
external to qgroups state as well: after flipping off the quota bit on
fs_info, flush delalloc and ordered extents, clearing the extent bits
along the way. This makes it so there are no ordered extents or meta
prealloc hanging around from the first enablement period during the second.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:10 +02:00
Boris Burkov
a744986ac4 btrfs: sysfs: add simple_quota incompat feature entry
Add an entry in the features directory for the new incompat flag

Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:10 +02:00
Boris Burkov
0182764a21 btrfs: sysfs: expose quota mode via sysfs
Add a new sysfs file /sys/fs/btrfs/<uuid>/qgroups/mode
which prints out the mode qgroups is running in. The possible modes are
qgroup, and squota.

If quotas are not enabled, then the qgroups directory will not exist,
so don't handle that mode.

Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:10 +02:00
Boris Burkov
182940f4f4 btrfs: qgroup: add new quota mode for simple quotas
Add a new quota mode called "simple quotas". It can be enabled by the
existing quota enable ioctl via a new command, and sets an incompat
bit, as the implementation of simple quotas will make backwards
incompatible changes to the disk format of the extent tree.

Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:10 +02:00
Boris Burkov
6b0cd63bc7 btrfs: qgroup: introduce quota mode
In preparation for introducing simple quotas, change from a binary
setting for quotas to an enum based mode. Initially, the possible modes
are disabled/full. Full quotas is normal btrfs qgroups.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:10 +02:00
David Sterba
078b8b90b8 btrfs: merge ordered work callbacks in btrfs_work into one
There are two callbacks defined in btrfs_work but only two actually make
use of them, otherwise there are NULLs. We can get rid of the freeing
callback making it a special case of the normal work. This reduces the
size of btrfs_work by 8 bytes, final layout:

struct btrfs_work {
        btrfs_func_t               func;                 /*     0     8 */
        btrfs_ordered_func_t       ordered_func;         /*     8     8 */
        struct work_struct         normal_work;          /*    16    32 */
        struct list_head           ordered_list;         /*    48    16 */
        /* --- cacheline 1 boundary (64 bytes) --- */
        struct btrfs_workqueue *   wq;                   /*    64     8 */
        long unsigned int          flags;                /*    72     8 */

        /* size: 80, cachelines: 2, members: 6 */
        /* last cacheline: 16 bytes */
};

This in turn reduces size of other structures (on a release config):

- async_chunk			 160 ->  152
- async_submit_bio		 152 ->  144
- btrfs_async_delayed_work	 104 ->   96
- btrfs_caching_control		 176 ->  168
- btrfs_delalloc_work		 144 ->  136
- btrfs_fs_info			3608 -> 3600
- btrfs_ordered_extent		 440 ->  424
- btrfs_writepage_fixup		 104 ->   96

Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:10 +02:00
Johannes Thumshirn
e9b9b911e0 btrfs: add raid stripe tree to features enabled with debug config
Until the raid stripe tree code is well enough tested and feature
complete, "hide" it behind CONFIG_BTRFS_DEBUG so only people who
want to use it are actually using it.

The scrub support may still fail some tests (btrfs/060 and up) and will
be fixed, RAID5/6 is not supported.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:10 +02:00
Johannes Thumshirn
e0b4077fcc btrfs: tree-checker: add support for raid stripe tree
Add a tree checker support for RAID stripe tree items, verify:

- alignment
- presence of the incompat bit
- supported encoding

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:10 +02:00
Johannes Thumshirn
b5e2c2ff67 btrfs: tracepoints: add events for raid stripe tree
Add trace events for raid-stripe-tree operations.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:10 +02:00
Johannes Thumshirn
9f9918a801 btrfs: sysfs: announce presence of raid-stripe-tree
If a filesystem with a raid-stripe-tree is mounted, show the RST feature
in sysfs, currently still under the CONFIG_BTRFS_DEBUG option.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:09 +02:00
Johannes Thumshirn
edde81f1ab btrfs: add raid stripe tree pretty printer
Decode raid-stripe-tree entries on btrfs_print_tree().

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:09 +02:00
Johannes Thumshirn
568220fa96 btrfs: zoned: support RAID0/1/10 on top of raid stripe tree
When we have a raid-stripe-tree, we can do RAID0/1/10 on zoned devices
for data block groups. For metadata block groups, we don't actually
need anything special, as all metadata I/O is protected by the
btrfs_zoned_meta_io_lock() already.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:09 +02:00
Johannes Thumshirn
9acaa64187 btrfs: scrub: implement raid stripe tree support
A filesystem that uses the raid stripe tree for logical to physical
address translation can't use the regular scrub path, that reads all
stripes and then checks if a sector is unused afterwards.

When using the raid stripe tree, this will result in lookup errors, as
the stripe tree doesn't know the requested logical addresses.

In case we're scrubbing a filesystem which uses the RAID stripe tree for
multi-device logical to physical address translation, perform an extra
block mapping step to get the real on-disk stripe length from the stripe
tree when scrubbing the sectors.

This prevents a double completion of the btrfs_bio caused by splitting the
underlying bio and ultimately a use-after-free.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:09 +02:00
Johannes Thumshirn
10e27980f2 btrfs: lookup physical address from stripe extent
Lookup the physical address from the raid stripe tree when a read on an
RAID volume formatted with the raid stripe tree was attempted.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:09 +02:00
Johannes Thumshirn
ca41504efd btrfs: delete stripe extent on extent deletion
As each stripe extent is tied to an extent item, delete the stripe extent
once the corresponding extent item is deleted.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:09 +02:00
Johannes Thumshirn
02c372e1f0 btrfs: add support for inserting raid stripe extents
Add support for inserting stripe extents into the raid stripe tree on
completion of every write that needs an extra logical-to-physical
translation when using RAID.

Inserting the stripe extents happens after the data I/O has completed,
this is done to

  a) support zone-append and
  b) rule out the possibility of a RAID-write-hole.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:09 +02:00
Johannes Thumshirn
515020900d btrfs: read raid stripe tree from disk
If we find the raid-stripe-tree on mount, read it from disk. This is
a backward incompatible feature. The rescue=ignorebadroots mount option
will skip this tree.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:09 +02:00
Johannes Thumshirn
ee1293308e btrfs: add raid stripe tree definitions
Add definitions for the raid stripe tree. This tree will hold information
about the on-disk layout of the stripes in a RAID set.

Each stripe extent has a 1:1 relationship with an on-disk extent item and
is doing the logical to per-drive physical address translation for the
extent item in question.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:09 +02:00
Qu Wenruo
6d3a61945b btrfs: warn on tree blocks which are not nodesize aligned
A long time ago, we had some metadata chunks which started at sector
boundary but not aligned to nodesize boundary.

This led to some older filesystems which can have tree blocks only
aligned to sectorsize, but not nodesize.

Later 'btrfs check' gained the ability to detect and warn about such tree
blocks, and kernel fixed the chunk allocation behavior, nowadays those
tree blocks should be pretty rare.

But in the future, if we want to migrate metadata to folio, we cannot
have such tree blocks, as filemap_add_folio() requires the page index to
be aligned with the folio number of pages.  Such unaligned tree blocks
can lead to VM_BUG_ON().

So this patch adds extra warning for those unaligned tree blocks, as a
preparation for the future folio migration.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:09 +02:00
Josef Bacik
11aeb97b45 btrfs: don't arbitrarily slow down delalloc if we're committing
We have a random schedule_timeout() if the current transaction is
committing, which seems to be a holdover from the original delalloc
reservation code.

Remove this, we have the proper flushing stuff, we shouldn't be hoping
for random timing things to make everything work.  This just induces
latency for no reason.

CC: stable@vger.kernel.org # 5.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:08 +02:00
Filipe Manana
c967c19ea6 btrfs: remove useless comment from btrfs_pin_extent_for_log_replay()
The comment on top of btrfs_pin_extent_for_log_replay() mentioning that
the function must be called within a transaction is pointless as of
commit 9fce570454 ("btrfs: Make btrfs_pin_extent_for_log_replay take
transaction handle"), since the function now takes a transaction handle
as its first argument. So remove the comment because it's completely
useless now.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:08 +02:00
Filipe Manana
df423ee23b btrfs: remove stale comment from btrfs_free_extent()
A comment at btrfs_free_extent() mentions the call to btrfs_pin_extent()
unlocks the pinned mutex, however that mutex is long gone, it was removed
in 2009 by commit 04018de5d4 ("Btrfs: kill the pinned_mutex"). So just
delete the comment.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:08 +02:00
Christoph Hellwig
87463f7e02 btrfs: zoned: factor out DUP bg handling from btrfs_load_block_group_zone_info
Split the code handling a type DUP block group from
btrfs_load_block_group_zone_info to make the code more readable.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:08 +02:00
Christoph Hellwig
9e0e3e74dc btrfs: zoned: factor out single bg handling from btrfs_load_block_group_zone_info
Split the code handling a type single block group from
btrfs_load_block_group_zone_info to make the code more readable.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:08 +02:00
Christoph Hellwig
09a46725cc btrfs: zoned: factor out per-zone logic from btrfs_load_block_group_zone_info
Split out a helper for the body of the per-zone loop in
btrfs_load_block_group_zone_info to make the function easier to read and
modify.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:08 +02:00
Christoph Hellwig
15c12fcc50 btrfs: zoned: introduce a zone_info struct in btrfs_load_block_group_zone_info
Add a new zone_info structure to hold per-zone information in
btrfs_load_block_group_zone_info and prepare for breaking out helpers
from it.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:08 +02:00
Filipe Manana
4d20c1def9 btrfs: remove pointless loop from btrfs_update_block_group()
When an extent is allocated or freed, we call btrfs_update_block_group()
to update its block group and space info. An extent always belongs to a
single block group, it can never span multiple block groups, so the loop
we have at btrfs_update_block_group() is pointless, as it always has a
single iteration. The loop was added in the very early days, 2007, when
the block group code was added in commit 9078a3e1e4 ("Btrfs: start of
block group code"), but even back then it seemed pointless.

So remove the loop and assert the block group containing the start offset
of the extent also contains the whole extent.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:07 +02:00
Filipe Manana
4ebe8d4788 btrfs: mark transaction id check as unlikely at btrfs_mark_buffer_dirty()
At btrfs_mark_buffer_dirty(), having a transaction id mismatch is never
expected to happen and it usually means there's a bug or some memory
corruption due to a bitflip for example. So mark the condition as unlikely
to optimize code generation as well as to make it obvious for human
readers that it is a very unexpected condition.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:07 +02:00
Filipe Manana
20cbe46035 btrfs: use btrfs_crit at btrfs_mark_buffer_dirty()
There's no need to use WARN() at btrfs_mark_buffer_dirty() to print an
error message, as we have the fs_info pointer we can use btrfs_crit()
which prints device information and makes the message have a more uniform
format. As we are already aborting the transaction we already have a stack
trace printed as well. So replace the use of WARN() with btrfs_crit().

Also slightly reword the message to use 'logical' instead of 'block' as
it's what is used in other error/warning messages.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:07 +02:00
Filipe Manana
50564b651d btrfs: abort transaction on generation mismatch when marking eb as dirty
When marking an extent buffer as dirty, at btrfs_mark_buffer_dirty(),
we check if its generation matches the running transaction and if not we
just print a warning. Such mismatch is an indicator that something really
went wrong and only printing a warning message (and stack trace) is not
enough to prevent a corruption. Allowing a transaction to commit with such
an extent buffer will trigger an error if we ever try to read it from disk
due to a generation mismatch with its parent generation.

So abort the current transaction with -EUCLEAN if we notice a generation
mismatch. For this we need to pass a transaction handle to
btrfs_mark_buffer_dirty() which is always available except in test code,
in which case we can pass NULL since it operates on dummy extent buffers
and all test roots have a single node/leaf (root node at level 0).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:07 +02:00