Commit graph

726 commits

Author SHA1 Message Date
Filipe Manana
a7ddeeb079 btrfs: prevent transaction block reserve underflow when starting transaction
When starting a transaction, with a non-zero number of items, we reserve
metadata space for that number of items and for delayed refs by doing a
call to btrfs_block_rsv_add(), with the transaction block reserve passed
as the block reserve argument. This reserves metadata space and adds it
to the transaction block reserve. Later we migrate the space we reserved
for delayed references from the transaction block reserve into the delayed
refs block reserve, by calling btrfs_migrate_to_delayed_refs_rsv().

btrfs_migrate_to_delayed_refs_rsv() decrements the number of bytes to
migrate from the source block reserve, and this however may result in an
underflow in case the space added to the transaction block reserve ended
up being used by another task that has not reserved enough space for its
own use - examples are tasks doing reflinks or hole punching because they
end up calling btrfs_replace_file_extents() -> btrfs_drop_extents() and
may need to modify/COW a variable number of leaves/paths, so they keep
trying to use space from the transaction block reserve when they need to
COW an extent buffer, and may end up trying to use more space then they
have reserved (1 unit/path only for removing file extent items).

This can be avoided by simply reserving space first without adding it to
the transaction block reserve, then add the space for delayed refs to the
delayed refs block reserve and finally add the remaining reserved space
to the transaction block reserve. This also makes the code a bit shorter
and simpler. So just do that.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-09-20 20:42:18 +02:00
Josef Bacik
77d20c685b btrfs: do not block starts waiting on previous transaction commit
Internally I got a report of very long stalls on normal operations like
creating a new file when auto relocation was running.  The reporter used
the 'bpf offcputime' tracer to show that we would get stuck in
start_transaction for 5 to 30 seconds, and were always being woken up by
the transaction commit.

Using my timing-everything script, which times how long a function takes
and what percentage of that total time is taken up by its children, I
saw several traces like this

1083 took 32812902424 ns
        29929002926 ns 91.2110% wait_for_commit_duration
        25568 ns 7.7920e-05% commit_fs_roots_duration
        1007751 ns 0.00307% commit_cowonly_roots_duration
        446855602 ns 1.36182% btrfs_run_delayed_refs_duration
        271980 ns 0.00082% btrfs_run_delayed_items_duration
        2008 ns 6.1195e-06% btrfs_apply_pending_changes_duration
        9656 ns 2.9427e-05% switch_commit_roots_duration
        1598 ns 4.8700e-06% btrfs_commit_device_sizes_duration
        4314 ns 1.3147e-05% btrfs_free_log_root_tree_duration

Here I was only tracing functions that happen where we are between
START_COMMIT and UNBLOCKED in order to see what would be keeping us
blocked for so long.  The wait_for_commit() we do is where we wait for a
previous transaction that hasn't completed it's commit.  This can
include all of the unpin work and other cleanups, which tends to be the
longest part of our transaction commit.

There is no reason we should be blocking new things from entering the
transaction at this point, it just adds to random latency spikes for no
reason.

Fix this by adding a PREP stage.  This allows us to properly deal with
multiple committers coming in at the same time, we retain the behavior
that the winner waits on the previous transaction and the losers all
wait for this transaction commit to occur.  Nothing else is blocked
during the PREP stage, and then once the wait is complete we switch to
COMMIT_START and all of the same behavior as before is maintained.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-09-08 14:10:49 +02:00
Filipe Manana
19288951ff btrfs: update comment for btrfs_join_transaction_nostart()
Update the comment for btrfs_join_transaction_nostart() to be more clear
about how it works and how it's different from btrfs_attach_transaction().

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21 14:52:17 +02:00
Filipe Manana
4490e803e1 btrfs: don't start transaction when joining with TRANS_JOIN_NOSTART
When joining a transaction with TRANS_JOIN_NOSTART, if we don't find a
running transaction we end up creating one. This goes against the purpose
of TRANS_JOIN_NOSTART which is to join a running transaction if its state
is at or below the state TRANS_STATE_COMMIT_START, otherwise return an
-ENOENT error and don't start a new transaction. So fix this to not create
a new transaction if there's no running transaction at or below that
state.

CC: stable@vger.kernel.org # 4.14+
Fixes: a6d155d2e3 ("Btrfs: fix deadlock between fiemap and transaction commits")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21 14:52:17 +02:00
Boris Burkov
a649684967 btrfs: fix start transaction qgroup rsv double free
btrfs_start_transaction reserves metadata space of the PERTRANS type
before it identifies a transaction to start/join. This allows flushing
when reserving that space without a deadlock. However, it results in a
race which temporarily breaks qgroup rsv accounting.

T1                                              T2
start_transaction
do_stuff
                                            start_transaction
                                                qgroup_reserve_meta_pertrans
commit_transaction
    qgroup_free_meta_all_pertrans
                                            hit an error starting txn
                                            goto reserve_fail
                                            qgroup_free_meta_pertrans (already freed!)

The basic issue is that there is nothing preventing another commit from
committing before start_transaction finishes (in fact sometimes we
intentionally wait for it) so any error path that frees the reserve is
at risk of this race.

While this exact space was getting freed anyway, and it's not a huge
deal to double free it (just a warning, the free code catches this), it
can result in incorrectly freeing some other pertrans reservation in
this same reservation, which could then lead to spuriously granting
reservations we might not have the space for. Therefore, I do believe it
is worth fixing.

To fix it, use the existing prealloc->pertrans conversion mechanism.
When we first reserve the space, we reserve prealloc space and only when
we are sure we have a transaction do we convert it to pertrans. This way
any racing commits do not blow away our reservation, but we still get a
pertrans reservation that is freed when _this_ transaction gets committed.

This issue can be reproduced by running generic/269 with either qgroups
or squotas enabled via mkfs on the scratch device.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
CC: stable@vger.kernel.org # 5.10+
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21 14:52:13 +02:00
Filipe Manana
e5860f8207 btrfs: make find_first_extent_bit() return a boolean
Currently find_first_extent_bit() returns a 0 if it found a range in the
given io tree and 1 if it didn't find any. There's no need to return any
errors, so make the return value a boolean and invert the logic to make
more sense: return true if it found a range and false if it didn't find
any range.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21 14:52:12 +02:00
Filipe Manana
b28ff3a7d7 btrfs: check for commit error at btrfs_attach_transaction_barrier()
btrfs_attach_transaction_barrier() is used to get a handle pointing to the
current running transaction if the transaction has not started its commit
yet (its state is < TRANS_STATE_COMMIT_START). If the transaction commit
has started, then we wait for the transaction to commit and finish before
returning - however we completely ignore if the transaction was aborted
due to some error during its commit, we simply return ERR_PT(-ENOENT),
which makes the caller assume everything is fine and no errors happened.

This could make an fsync return success (0) to user space when in fact we
had a transaction abort and the target inode changes were therefore not
persisted.

Fix this by checking for the return value from btrfs_wait_for_commit(),
and if it returned an error, return it back to the caller.

Fixes: d4edf39bd5 ("Btrfs: fix uncompleted transaction")
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-07-26 13:57:47 +02:00
Filipe Manana
bf7ecbe987 btrfs: check if the transaction was aborted at btrfs_wait_for_commit()
At btrfs_wait_for_commit() we wait for a transaction to finish and then
always return 0 (success) without checking if it was aborted, in which
case the transaction didn't happen due to some critical error. Fix this
by checking if the transaction was aborted.

Fixes: 462045928b ("Btrfs: add START_SYNC, WAIT_SYNC ioctls")
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-07-24 18:06:27 +02:00
Filipe Manana
df9f278239 btrfs: do not BUG_ON on failure to get dir index for new snapshot
During the transaction commit path, at create_pending_snapshot(), there
is no need to BUG_ON() in case we fail to get a dir index for the snapshot
in the parent directory. This should fail very rarely because the parent
inode should be loaded in memory already, with the respective delayed
inode created and the parent inode's index_cnt field already initialized.

However if it fails, it may be -ENOMEM like the comment at
create_pending_snapshot() says or any error returned by
btrfs_search_slot() through btrfs_set_inode_index_count(), which can be
pretty much anything such as -EIO or -EUCLEAN for example. So the comment
is not correct when it says it can only be -ENOMEM.

However doing a BUG_ON() here is overkill, since we can instead abort
the transaction and return the error. Note that any error returned by
create_pending_snapshot() will eventually result in a transaction
abort at cleanup_transaction(), called from btrfs_commit_transaction(),
but we can explicitly abort the transaction at this point instead so that
we get a stack trace to tell us that the call to btrfs_set_inode_index()
failed.

So just abort the transaction and return in case btrfs_set_inode_index()
returned an error at create_pending_snapshot().

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:39 +02:00
Christoph Hellwig
f880fe6e0b btrfs: don't hold an extra reference for redirtied buffers
When btrfs_redirty_list_add redirties a buffer, it also acquires
an extra reference that is released on transaction commit.  But
this is not required as buffers that are dirty or under writeback
are never freed (look for calls to extent_buffer_under_io())).

Remove the extra reference and the infrastructure used to drop it
again.

History behind redirty logic:

In the first place, it used releasing_list to hold all the
to-be-released extent buffers, and decided which buffers to re-dirty at
the commit time. Then, in a later version, the behaviour got changed to
re-dirty a necessary buffer and add re-dirtied one to the list in
btrfs_free_tree_block(). In short, the list was there mostly for the
patch series' historical reason.

Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
[ add Naohiro's comment regarding history ]
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:26 +02:00
Christoph Hellwig
e917ff56c8 btrfs: determine synchronous writers from bio or writeback control
The writeback_control structure already passes down the information about
a writeback being synchronous from the core VM code, and thus information
is propagated into the bio REQ_SYNC flag through the wbc_to_write_flags
helper.

Use that information to decide if checksums calculation is offloaded to
a workqueue instead of btrfs_inode::sync_writers field that not only
bloats the inode but also has too wide scope, being inode wide instead
of limited to the actual writeback request.

The sync writes were set in:

- btrfs_do_write_iter - regular IO, sync status is set
- start_ordered_ops - ordered write start, writeback with WB_SYNC_ALL
  mode
- btrfs_write_marked_extents - write marked extents, writeback with
  WB_SYNC_ALL mode

Reviewed-by: Chris Mason <clm@fb.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:23 +02:00
Filipe Manana
0f69d1f4d6 btrfs: correctly calculate delayed ref bytes when starting transaction
When starting a transaction, we are assuming the number of bytes used for
each delayed ref update matches the number of bytes used for each item
update, that is the return value of:

   btrfs_calc_insert_metadata_size(fs_info, num_items)

However that is not correct when we are using the free space tree, as we
need to multiply that value by 2, since delayed ref updates need to modify
the free space tree besides the extent tree.

So fix this by using btrfs_calc_delayed_ref_bytes() to get the correct
number of bytes used for delayed ref updates.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:22 +02:00
Filipe Manana
e4773b57b8 btrfs: make btrfs_block_rsv_full() check more boolean when starting transaction
When starting a transaction we are comparing the result of a call to
btrfs_block_rsv_full() with 0, but the function returns a boolean. While
in practice it is not incorrect, as 0 is equivalent to false, it makes it
a bit odd and less readable. So update the check to not compare against 0
and instead use the logical not (!) operator.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:22 +02:00
Filipe Manana
04fb3285a4 btrfs: collapse should_end_transaction() into btrfs_should_end_transaction()
The function should_end_transaction() is very short and only has one
caller, which is btrfs_should_end_transaction(). So move the code from
should_end_transaction() into btrfs_should_end_transaction().

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:19 +02:00
Filipe Manana
2d82a40aa7 btrfs: fix deadlock when aborting transaction during relocation with scrub
Before relocating a block group we pause scrub, then do the relocation and
then unpause scrub. The relocation process requires starting and committing
a transaction, and if we have a failure in the critical section of the
transaction commit path (transaction state >= TRANS_STATE_COMMIT_START),
we will deadlock if there is a paused scrub.

That results in stack traces like the following:

  [42.479] BTRFS info (device sdc): relocating block group 53876686848 flags metadata|raid6
  [42.936] BTRFS warning (device sdc): Skipping commit of aborted transaction.
  [42.936] ------------[ cut here ]------------
  [42.936] BTRFS: Transaction aborted (error -28)
  [42.936] WARNING: CPU: 11 PID: 346822 at fs/btrfs/transaction.c:1977 btrfs_commit_transaction+0xcc8/0xeb0 [btrfs]
  [42.936] Modules linked in: dm_flakey dm_mod loop btrfs (...)
  [42.936] CPU: 11 PID: 346822 Comm: btrfs Tainted: G        W          6.3.0-rc2-btrfs-next-127+ #1
  [42.936] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
  [42.936] RIP: 0010:btrfs_commit_transaction+0xcc8/0xeb0 [btrfs]
  [42.936] Code: ff ff 45 8b (...)
  [42.936] RSP: 0018:ffffb58649633b48 EFLAGS: 00010282
  [42.936] RAX: 0000000000000000 RBX: ffff8be6ef4d5bd8 RCX: 0000000000000000
  [42.936] RDX: 0000000000000002 RSI: ffffffffb35e7782 RDI: 00000000ffffffff
  [42.936] RBP: ffff8be6ef4d5c98 R08: 0000000000000000 R09: ffffb586496339e8
  [42.936] R10: 0000000000000001 R11: 0000000000000001 R12: ffff8be6d38c7c00
  [42.936] R13: 00000000ffffffe4 R14: ffff8be6c268c000 R15: ffff8be6ef4d5cf0
  [42.936] FS:  00007f381a82b340(0000) GS:ffff8beddfcc0000(0000) knlGS:0000000000000000
  [42.936] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [42.936] CR2: 00007f1e35fb7638 CR3: 0000000117680006 CR4: 0000000000370ee0
  [42.936] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  [42.936] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  [42.936] Call Trace:
  [42.936]  <TASK>
  [42.936]  ? start_transaction+0xcb/0x610 [btrfs]
  [42.936]  prepare_to_relocate+0x111/0x1a0 [btrfs]
  [42.936]  relocate_block_group+0x57/0x5d0 [btrfs]
  [42.936]  ? btrfs_wait_nocow_writers+0x25/0xb0 [btrfs]
  [42.936]  btrfs_relocate_block_group+0x248/0x3c0 [btrfs]
  [42.936]  ? __pfx_autoremove_wake_function+0x10/0x10
  [42.936]  btrfs_relocate_chunk+0x3b/0x150 [btrfs]
  [42.936]  btrfs_balance+0x8ff/0x11d0 [btrfs]
  [42.936]  ? __kmem_cache_alloc_node+0x14a/0x410
  [42.936]  btrfs_ioctl+0x2334/0x32c0 [btrfs]
  [42.937]  ? mod_objcg_state+0xd2/0x360
  [42.937]  ? refill_obj_stock+0xb0/0x160
  [42.937]  ? seq_release+0x25/0x30
  [42.937]  ? __rseq_handle_notify_resume+0x3b5/0x4b0
  [42.937]  ? percpu_counter_add_batch+0x2e/0xa0
  [42.937]  ? __x64_sys_ioctl+0x88/0xc0
  [42.937]  __x64_sys_ioctl+0x88/0xc0
  [42.937]  do_syscall_64+0x38/0x90
  [42.937]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
  [42.937] RIP: 0033:0x7f381a6ffe9b
  [42.937] Code: 00 48 89 44 24 (...)
  [42.937] RSP: 002b:00007ffd45ecf060 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
  [42.937] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f381a6ffe9b
  [42.937] RDX: 00007ffd45ecf150 RSI: 00000000c4009420 RDI: 0000000000000003
  [42.937] RBP: 0000000000000003 R08: 0000000000000013 R09: 0000000000000000
  [42.937] R10: 00007f381a60c878 R11: 0000000000000246 R12: 00007ffd45ed0423
  [42.937] R13: 00007ffd45ecf150 R14: 0000000000000000 R15: 00007ffd45ecf148
  [42.937]  </TASK>
  [42.937] ---[ end trace 0000000000000000 ]---
  [42.937] BTRFS: error (device sdc: state A) in cleanup_transaction:1977: errno=-28 No space left
  [59.196] INFO: task btrfs:346772 blocked for more than 120 seconds.
  [59.196]       Tainted: G        W          6.3.0-rc2-btrfs-next-127+ #1
  [59.196] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [59.196] task:btrfs           state:D stack:0     pid:346772 ppid:1      flags:0x00004002
  [59.196] Call Trace:
  [59.196]  <TASK>
  [59.196]  __schedule+0x392/0xa70
  [59.196]  ? __pv_queued_spin_lock_slowpath+0x165/0x370
  [59.196]  schedule+0x5d/0xd0
  [59.196]  __scrub_blocked_if_needed+0x74/0xc0 [btrfs]
  [59.197]  ? __pfx_autoremove_wake_function+0x10/0x10
  [59.197]  scrub_pause_off+0x21/0x50 [btrfs]
  [59.197]  scrub_simple_mirror+0x1c7/0x950 [btrfs]
  [59.197]  ? scrub_parity_put+0x1a5/0x1d0 [btrfs]
  [59.198]  ? __pfx_autoremove_wake_function+0x10/0x10
  [59.198]  scrub_stripe+0x20d/0x740 [btrfs]
  [59.198]  scrub_chunk+0xc4/0x130 [btrfs]
  [59.198]  scrub_enumerate_chunks+0x3e4/0x7a0 [btrfs]
  [59.198]  ? __pfx_autoremove_wake_function+0x10/0x10
  [59.198]  btrfs_scrub_dev+0x236/0x6a0 [btrfs]
  [59.199]  ? btrfs_ioctl+0xd97/0x32c0 [btrfs]
  [59.199]  ? _copy_from_user+0x7b/0x80
  [59.199]  btrfs_ioctl+0xde1/0x32c0 [btrfs]
  [59.199]  ? refill_stock+0x33/0x50
  [59.199]  ? should_failslab+0xa/0x20
  [59.199]  ? kmem_cache_alloc_node+0x151/0x460
  [59.199]  ? alloc_io_context+0x1b/0x80
  [59.199]  ? preempt_count_add+0x70/0xa0
  [59.199]  ? __x64_sys_ioctl+0x88/0xc0
  [59.199]  __x64_sys_ioctl+0x88/0xc0
  [59.199]  do_syscall_64+0x38/0x90
  [59.199]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
  [59.199] RIP: 0033:0x7f82ffaffe9b
  [59.199] RSP: 002b:00007f82ff9fcc50 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
  [59.199] RAX: ffffffffffffffda RBX: 000055b191e36310 RCX: 00007f82ffaffe9b
  [59.199] RDX: 000055b191e36310 RSI: 00000000c400941b RDI: 0000000000000003
  [59.199] RBP: 0000000000000000 R08: 00007fff1575016f R09: 0000000000000000
  [59.199] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f82ff9fd640
  [59.199] R13: 000000000000006b R14: 00007f82ffa87580 R15: 0000000000000000
  [59.199]  </TASK>
  [59.199] INFO: task btrfs:346773 blocked for more than 120 seconds.
  [59.200]       Tainted: G        W          6.3.0-rc2-btrfs-next-127+ #1
  [59.200] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [59.201] task:btrfs           state:D stack:0     pid:346773 ppid:1      flags:0x00004002
  [59.201] Call Trace:
  [59.201]  <TASK>
  [59.201]  __schedule+0x392/0xa70
  [59.201]  ? __pv_queued_spin_lock_slowpath+0x165/0x370
  [59.201]  schedule+0x5d/0xd0
  [59.201]  __scrub_blocked_if_needed+0x74/0xc0 [btrfs]
  [59.201]  ? __pfx_autoremove_wake_function+0x10/0x10
  [59.201]  scrub_pause_off+0x21/0x50 [btrfs]
  [59.202]  scrub_simple_mirror+0x1c7/0x950 [btrfs]
  [59.202]  ? scrub_parity_put+0x1a5/0x1d0 [btrfs]
  [59.202]  ? __pfx_autoremove_wake_function+0x10/0x10
  [59.202]  scrub_stripe+0x20d/0x740 [btrfs]
  [59.202]  scrub_chunk+0xc4/0x130 [btrfs]
  [59.203]  scrub_enumerate_chunks+0x3e4/0x7a0 [btrfs]
  [59.203]  ? __pfx_autoremove_wake_function+0x10/0x10
  [59.203]  btrfs_scrub_dev+0x236/0x6a0 [btrfs]
  [59.203]  ? btrfs_ioctl+0xd97/0x32c0 [btrfs]
  [59.203]  ? _copy_from_user+0x7b/0x80
  [59.203]  btrfs_ioctl+0xde1/0x32c0 [btrfs]
  [59.204]  ? should_failslab+0xa/0x20
  [59.204]  ? kmem_cache_alloc_node+0x151/0x460
  [59.204]  ? alloc_io_context+0x1b/0x80
  [59.204]  ? preempt_count_add+0x70/0xa0
  [59.204]  ? __x64_sys_ioctl+0x88/0xc0
  [59.204]  __x64_sys_ioctl+0x88/0xc0
  [59.204]  do_syscall_64+0x38/0x90
  [59.204]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
  [59.204] RIP: 0033:0x7f82ffaffe9b
  [59.204] RSP: 002b:00007f82ff1fbc50 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
  [59.204] RAX: ffffffffffffffda RBX: 000055b191e36790 RCX: 00007f82ffaffe9b
  [59.204] RDX: 000055b191e36790 RSI: 00000000c400941b RDI: 0000000000000003
  [59.204] RBP: 0000000000000000 R08: 00007fff1575016f R09: 0000000000000000
  [59.204] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f82ff1fc640
  [59.204] R13: 000000000000006b R14: 00007f82ffa87580 R15: 0000000000000000
  [59.204]  </TASK>
  [59.204] INFO: task btrfs:346774 blocked for more than 120 seconds.
  [59.205]       Tainted: G        W          6.3.0-rc2-btrfs-next-127+ #1
  [59.205] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [59.206] task:btrfs           state:D stack:0     pid:346774 ppid:1      flags:0x00004002
  [59.206] Call Trace:
  [59.206]  <TASK>
  [59.206]  __schedule+0x392/0xa70
  [59.206]  schedule+0x5d/0xd0
  [59.206]  __scrub_blocked_if_needed+0x74/0xc0 [btrfs]
  [59.206]  ? __pfx_autoremove_wake_function+0x10/0x10
  [59.206]  scrub_pause_off+0x21/0x50 [btrfs]
  [59.207]  scrub_simple_mirror+0x1c7/0x950 [btrfs]
  [59.207]  ? scrub_parity_put+0x1a5/0x1d0 [btrfs]
  [59.207]  ? __pfx_autoremove_wake_function+0x10/0x10
  [59.207]  scrub_stripe+0x20d/0x740 [btrfs]
  [59.208]  scrub_chunk+0xc4/0x130 [btrfs]
  [59.208]  scrub_enumerate_chunks+0x3e4/0x7a0 [btrfs]
  [59.208]  ? __mutex_unlock_slowpath.isra.0+0x9a/0x120
  [59.208]  btrfs_scrub_dev+0x236/0x6a0 [btrfs]
  [59.208]  ? btrfs_ioctl+0xd97/0x32c0 [btrfs]
  [59.209]  ? _copy_from_user+0x7b/0x80
  [59.209]  btrfs_ioctl+0xde1/0x32c0 [btrfs]
  [59.209]  ? should_failslab+0xa/0x20
  [59.209]  ? kmem_cache_alloc_node+0x151/0x460
  [59.209]  ? alloc_io_context+0x1b/0x80
  [59.209]  ? preempt_count_add+0x70/0xa0
  [59.209]  ? __x64_sys_ioctl+0x88/0xc0
  [59.209]  __x64_sys_ioctl+0x88/0xc0
  [59.209]  do_syscall_64+0x38/0x90
  [59.209]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
  [59.209] RIP: 0033:0x7f82ffaffe9b
  [59.209] RSP: 002b:00007f82fe9fac50 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
  [59.209] RAX: ffffffffffffffda RBX: 000055b191e36c10 RCX: 00007f82ffaffe9b
  [59.209] RDX: 000055b191e36c10 RSI: 00000000c400941b RDI: 0000000000000003
  [59.209] RBP: 0000000000000000 R08: 00007fff1575016f R09: 0000000000000000
  [59.209] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f82fe9fb640
  [59.209] R13: 000000000000006b R14: 00007f82ffa87580 R15: 0000000000000000
  [59.209]  </TASK>
  [59.209] INFO: task btrfs:346775 blocked for more than 120 seconds.
  [59.210]       Tainted: G        W          6.3.0-rc2-btrfs-next-127+ #1
  [59.210] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [59.211] task:btrfs           state:D stack:0     pid:346775 ppid:1      flags:0x00004002
  [59.211] Call Trace:
  [59.211]  <TASK>
  [59.211]  __schedule+0x392/0xa70
  [59.211]  schedule+0x5d/0xd0
  [59.211]  __scrub_blocked_if_needed+0x74/0xc0 [btrfs]
  [59.211]  ? __pfx_autoremove_wake_function+0x10/0x10
  [59.211]  scrub_pause_off+0x21/0x50 [btrfs]
  [59.212]  scrub_simple_mirror+0x1c7/0x950 [btrfs]
  [59.212]  ? scrub_parity_put+0x1a5/0x1d0 [btrfs]
  [59.212]  ? __pfx_autoremove_wake_function+0x10/0x10
  [59.212]  scrub_stripe+0x20d/0x740 [btrfs]
  [59.213]  scrub_chunk+0xc4/0x130 [btrfs]
  [59.213]  scrub_enumerate_chunks+0x3e4/0x7a0 [btrfs]
  [59.213]  ? __mutex_unlock_slowpath.isra.0+0x9a/0x120
  [59.213]  btrfs_scrub_dev+0x236/0x6a0 [btrfs]
  [59.213]  ? btrfs_ioctl+0xd97/0x32c0 [btrfs]
  [59.214]  ? _copy_from_user+0x7b/0x80
  [59.214]  btrfs_ioctl+0xde1/0x32c0 [btrfs]
  [59.214]  ? should_failslab+0xa/0x20
  [59.214]  ? kmem_cache_alloc_node+0x151/0x460
  [59.214]  ? alloc_io_context+0x1b/0x80
  [59.214]  ? preempt_count_add+0x70/0xa0
  [59.214]  ? __x64_sys_ioctl+0x88/0xc0
  [59.214]  __x64_sys_ioctl+0x88/0xc0
  [59.214]  do_syscall_64+0x38/0x90
  [59.214]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
  [59.214] RIP: 0033:0x7f82ffaffe9b
  [59.214] RSP: 002b:00007f82fe1f9c50 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
  [59.214] RAX: ffffffffffffffda RBX: 000055b191e37090 RCX: 00007f82ffaffe9b
  [59.214] RDX: 000055b191e37090 RSI: 00000000c400941b RDI: 0000000000000003
  [59.214] RBP: 0000000000000000 R08: 00007fff1575016f R09: 0000000000000000
  [59.214] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f82fe1fa640
  [59.214] R13: 000000000000006b R14: 00007f82ffa87580 R15: 0000000000000000
  [59.214]  </TASK>
  [59.214] INFO: task btrfs:346776 blocked for more than 120 seconds.
  [59.215]       Tainted: G        W          6.3.0-rc2-btrfs-next-127+ #1
  [59.216] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [59.217] task:btrfs           state:D stack:0     pid:346776 ppid:1      flags:0x00004002
  [59.217] Call Trace:
  [59.217]  <TASK>
  [59.217]  __schedule+0x392/0xa70
  [59.217]  ? __pv_queued_spin_lock_slowpath+0x165/0x370
  [59.217]  schedule+0x5d/0xd0
  [59.217]  __scrub_blocked_if_needed+0x74/0xc0 [btrfs]
  [59.217]  ? __pfx_autoremove_wake_function+0x10/0x10
  [59.217]  scrub_pause_off+0x21/0x50 [btrfs]
  [59.217]  scrub_simple_mirror+0x1c7/0x950 [btrfs]
  [59.217]  ? scrub_parity_put+0x1a5/0x1d0 [btrfs]
  [59.218]  ? __pfx_autoremove_wake_function+0x10/0x10
  [59.218]  scrub_stripe+0x20d/0x740 [btrfs]
  [59.218]  scrub_chunk+0xc4/0x130 [btrfs]
  [59.218]  scrub_enumerate_chunks+0x3e4/0x7a0 [btrfs]
  [59.219]  ? __pfx_autoremove_wake_function+0x10/0x10
  [59.219]  btrfs_scrub_dev+0x236/0x6a0 [btrfs]
  [59.219]  ? btrfs_ioctl+0xd97/0x32c0 [btrfs]
  [59.219]  ? _copy_from_user+0x7b/0x80
  [59.219]  btrfs_ioctl+0xde1/0x32c0 [btrfs]
  [59.219]  ? should_failslab+0xa/0x20
  [59.219]  ? kmem_cache_alloc_node+0x151/0x460
  [59.219]  ? alloc_io_context+0x1b/0x80
  [59.219]  ? preempt_count_add+0x70/0xa0
  [59.219]  ? __x64_sys_ioctl+0x88/0xc0
  [59.219]  __x64_sys_ioctl+0x88/0xc0
  [59.219]  do_syscall_64+0x38/0x90
  [59.219]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
  [59.219] RIP: 0033:0x7f82ffaffe9b
  [59.219] RSP: 002b:00007f82fd9f8c50 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
  [59.219] RAX: ffffffffffffffda RBX: 000055b191e37510 RCX: 00007f82ffaffe9b
  [59.219] RDX: 000055b191e37510 RSI: 00000000c400941b RDI: 0000000000000003
  [59.219] RBP: 0000000000000000 R08: 00007fff1575016f R09: 0000000000000000
  [59.219] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f82fd9f9640
  [59.219] R13: 000000000000006b R14: 00007f82ffa87580 R15: 0000000000000000
  [59.219]  </TASK>
  [59.219] INFO: task btrfs:346822 blocked for more than 120 seconds.
  [59.220]       Tainted: G        W          6.3.0-rc2-btrfs-next-127+ #1
  [59.221] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [59.222] task:btrfs           state:D stack:0     pid:346822 ppid:1      flags:0x00004002
  [59.222] Call Trace:
  [59.222]  <TASK>
  [59.222]  __schedule+0x392/0xa70
  [59.222]  schedule+0x5d/0xd0
  [59.222]  btrfs_scrub_cancel+0x91/0x100 [btrfs]
  [59.222]  ? __pfx_autoremove_wake_function+0x10/0x10
  [59.222]  btrfs_commit_transaction+0x572/0xeb0 [btrfs]
  [59.223]  ? start_transaction+0xcb/0x610 [btrfs]
  [59.223]  prepare_to_relocate+0x111/0x1a0 [btrfs]
  [59.223]  relocate_block_group+0x57/0x5d0 [btrfs]
  [59.223]  ? btrfs_wait_nocow_writers+0x25/0xb0 [btrfs]
  [59.223]  btrfs_relocate_block_group+0x248/0x3c0 [btrfs]
  [59.224]  ? __pfx_autoremove_wake_function+0x10/0x10
  [59.224]  btrfs_relocate_chunk+0x3b/0x150 [btrfs]
  [59.224]  btrfs_balance+0x8ff/0x11d0 [btrfs]
  [59.224]  ? __kmem_cache_alloc_node+0x14a/0x410
  [59.224]  btrfs_ioctl+0x2334/0x32c0 [btrfs]
  [59.225]  ? mod_objcg_state+0xd2/0x360
  [59.225]  ? refill_obj_stock+0xb0/0x160
  [59.225]  ? seq_release+0x25/0x30
  [59.225]  ? __rseq_handle_notify_resume+0x3b5/0x4b0
  [59.225]  ? percpu_counter_add_batch+0x2e/0xa0
  [59.225]  ? __x64_sys_ioctl+0x88/0xc0
  [59.225]  __x64_sys_ioctl+0x88/0xc0
  [59.225]  do_syscall_64+0x38/0x90
  [59.225]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
  [59.225] RIP: 0033:0x7f381a6ffe9b
  [59.225] RSP: 002b:00007ffd45ecf060 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
  [59.225] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f381a6ffe9b
  [59.225] RDX: 00007ffd45ecf150 RSI: 00000000c4009420 RDI: 0000000000000003
  [59.225] RBP: 0000000000000003 R08: 0000000000000013 R09: 0000000000000000
  [59.225] R10: 00007f381a60c878 R11: 0000000000000246 R12: 00007ffd45ed0423
  [59.225] R13: 00007ffd45ecf150 R14: 0000000000000000 R15: 00007ffd45ecf148
  [59.225]  </TASK>

What happens is the following:

1) A scrub is running, so fs_info->scrubs_running is 1;

2) Task A starts block group relocation, and at btrfs_relocate_chunk() it
   pauses scrub by calling btrfs_scrub_pause(). That increments
   fs_info->scrub_pause_req from 0 to 1 and waits for the scrub task to
   pause (for fs_info->scrubs_paused to be == to fs_info->scrubs_running);

3) The scrub task pauses at scrub_pause_off(), waiting for
   fs_info->scrub_pause_req to decrease to 0;

4) Task A then enters btrfs_relocate_block_group(), and down that call
   chain we start a transaction and then attempt to commit it;

5) When task A calls btrfs_commit_transaction(), it either will do the
   commit itself or wait for some other task that already started the
   commit of the transaction - it doesn't matter which case;

6) The transaction commit enters state TRANS_STATE_COMMIT_START;

7) An error happens during the transaction commit, like -ENOSPC when
   running delayed refs or delayed items for example;

8) This results in calling transaction.c:cleanup_transaction(), where
   we call btrfs_scrub_cancel(), incrementing fs_info->scrub_cancel_req
   from 0 to 1, and blocking this task waiting for fs_info->scrubs_running
   to decrease to 0;

9) From this point on, both the transaction commit and the scrub task
   hang forever:

   1) The transaction commit is waiting for fs_info->scrubs_running to
      be decreased to 0;

   2) The scrub task is at scrub_pause_off() waiting for
      fs_info->scrub_pause_req to decrease to 0 - so it can not proceed
      to stop the scrub and decrement fs_info->scrubs_running from 0 to 1.

   Therefore resulting in a deadlock.

Fix this by having cleanup_transaction(), called if a transaction commit
fails, not call btrfs_scrub_cancel() if relocation is in progress, and
having btrfs_relocate_block_group() call btrfs_scrub_cancel() instead if
the relocation failed and a transaction abort happened.

This was triggered with btrfs/061 from fstests.

Fixes: 55e3a601c8 ("btrfs: Fix data checksum error cause by replace with io-load.")
CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-03-28 00:47:00 +02:00
Qu Wenruo
b7625f461d btrfs: sysfs: update fs features directory asynchronously
[BUG]
Since the introduction of per-fs feature sysfs interface
(/sys/fs/btrfs/<UUID>/features/), the content of that directory is never
updated.

Thus for the following case, that directory will not show the new
features like RAID56:

  # mkfs.btrfs -f $dev1 $dev2 $dev3
  # mount $dev1 $mnt
  # btrfs balance start -f -mconvert=raid5 $mnt
  # ls /sys/fs/btrfs/$uuid/features/
  extended_iref  free_space_tree  no_holes  skinny_metadata

While after unmount and mount, we got the correct features:

  # umount $mnt
  # mount $dev1 $mnt
  # ls /sys/fs/btrfs/$uuid/features/
  extended_iref  free_space_tree  no_holes  raid56 skinny_metadata

[CAUSE]
Because we never really try to update the content of per-fs features/
directory.

We had an attempt to update the features directory dynamically in commit
14e46e0495 ("btrfs: synchronize incompat feature bits with sysfs
files"), but unfortunately it get reverted in commit e410e34fad
("Revert "btrfs: synchronize incompat feature bits with sysfs files"").
The problem in the original patch is, in the context of
btrfs_create_chunk(), we can not afford to update the sysfs group.

The exported but never utilized function, btrfs_sysfs_feature_update()
is the leftover of such attempt.  As even if we go sysfs_update_group(),
new files will need extra memory allocation, and we have no way to
specify the sysfs update to go GFP_NOFS.

[FIX]
This patch will address the old problem by doing asynchronous sysfs
update in the cleaner thread.

This involves the following changes:

- Make __btrfs_(set|clear)_fs_(incompat|compat_ro) helpers to set
  BTRFS_FS_FEATURE_CHANGED flag when needed

- Update btrfs_sysfs_feature_update() to use sysfs_update_group()
  And drop unnecessary arguments.

- Call btrfs_sysfs_feature_update() in cleaner_kthread
  If we have the BTRFS_FS_FEATURE_CHANGED flag set.

- Wake up cleaner_kthread in btrfs_commit_transaction if we have
  BTRFS_FS_FEATURE_CHANGED flag

By this, all the previously dangerous call sites like
btrfs_create_chunk() need no new changes, as above helpers would
have already set the BTRFS_FS_FEATURE_CHANGED flag.

The real work happens at cleaner_kthread, thus we pay the cost of
delaying the update to sysfs directory, but the delayed time should be
small enough that end user can not distinguish though it might get
delayed if the cleaner thread is busy with removing subvolumes or
defrag.

CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-13 17:50:35 +01:00
Josef Bacik
fccf0c842e btrfs: move btrfs_abort_transaction to transaction.c
While trying to sync messages.[ch] I ended up with this dependency on
messages.h in the rest of btrfs-progs code base because it's where
btrfs_abort_transaction() was now held.  We want to keep messages.[ch]
limited to the kernel code, and the btrfs_abort_transaction() code
better fits in the transaction code and not in messages.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ move the __cold attributes ]
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-13 17:50:33 +01:00
David Sterba
35da5a7ede btrfs: drop private_data parameter from extent_io_tree_init
All callers except one pass NULL, so the parameter can be dropped and
the inode::io_tree initialization can be open coded.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:54 +01:00
David Sterba
428c8e0310 btrfs: simplify percent calculation helpers, rename div_factor
The div_factor* helpers calculate fraction or percentage fraction. The
name is a bit confusing, we use it only for percentage calculations and
there are two helpers.

There's a helper mult_frac that's for general fractions, that tries to
be accurate but we multiply and divide by small numbers so we can use
the div_u64 helper.

Rename the div_factor* helpers and use 1..100 percentage range, also drop
the case checking for percentage == 100, it's never hit.

The conversions:

* div_factor calculates tenths and the numbers need to be adjusted
* div_factor_fine is direct replacement

Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:48 +01:00
Josef Bacik
2fc6822c99 btrfs: move scrub prototypes into scrub.h
Move these out of ctree.h into scrub.h to cut down on code in ctree.h.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:47 +01:00
Josef Bacik
677074792a btrfs: move relocation prototypes into relocation.h
Move these out of ctree.h into relocation.h to cut down on code in
ctree.h

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:47 +01:00
Josef Bacik
7572dec8f5 btrfs: move ioctl prototypes into ioctl.h
Move these out of ctree.h into ioctl.h to cut down on code in ctree.h.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:46 +01:00
Josef Bacik
c7a03b524d btrfs: move uuid tree prototypes to uuid-tree.h
Move these out of ctree.h into uuid-tree.h to cut down on the code in
ctree.h.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:46 +01:00
Josef Bacik
f2b39277b8 btrfs: move dir-item prototypes into dir-item.h
Move these prototypes out of ctree.h and into their own header file.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:46 +01:00
Josef Bacik
59b818e064 btrfs: move defrag related prototypes to their own header
Now that the defrag code is all in one file, create a defrag.h and move
all the defrag related prototypes and helper out of ctree.h and into
defrag.h.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:46 +01:00
Josef Bacik
45c40c8f95 btrfs: move root tree prototypes to their own header
Move all the root-tree.c prototypes to root-tree.h, and then update all
the necessary files to include the new header.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:44 +01:00
Josef Bacik
a0231804af btrfs: move extent-tree helpers into their own header file
Move all the extent tree related prototypes to extent-tree.h out of
ctree.h, and then go include it everywhere needed so everything
compiles.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:44 +01:00
Sweet Tea Dorminy
6db7531882 btrfs: use struct fscrypt_str instead of struct qstr
While struct qstr is more natural without fscrypt, since it's provided
by dentries, struct fscrypt_str is provided by the fscrypt handlers
processing dentries, and is thus more natural in the fscrypt world.
Replace all of the struct qstr uses with struct fscrypt_str.

Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:43 +01:00
Sweet Tea Dorminy
ab3c5c18e8 btrfs: setup qstr from dentrys using fscrypt helper
Most places where we get a struct qstr, we are doing so from a dentry.
With fscrypt, the dentry's name may be encrypted on-disk, so fscrypt
provides a helper to convert a dentry name to the appropriate disk name
if necessary. Convert each of the dentry name accesses to use
fscrypt_setup_filename(), then convert the resulting fscrypt_name back
to an unencrypted qstr. This does not work for nokey names, but the
specific locations that could spawn nokey names are noted.

At present, since there are no encrypted directories, nothing goes down
the filename encryption paths.

Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:43 +01:00
Sweet Tea Dorminy
e43eec81c5 btrfs: use struct qstr instead of name and namelen pairs
Many functions throughout btrfs take name buffer and name length
arguments. Most of these functions at the highest level are usually
called with these arguments extracted from a supplied dentry's name.
But the entire name can be passed instead, making each function a little
more elegant.

Each function whose arguments are currently the name and length
extracted from a dentry is herein converted to instead take a pointer to
the name in the dentry. The couple of calls to these calls without a
struct dentry are converted to create an appropriate qstr to pass in.
Additionally, every function which is only called with a name/len
extracted directly from a qstr is also converted.

This change has positive effect on stack consumption, frame of many
functions is reduced but this will be used in the future for fscrypt
related structures.

Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:43 +01:00
Josef Bacik
07e81dc944 btrfs: move accessor helpers into accessors.h
This is a large patch, but because they're all macros it's impossible to
split up.  Simply copy all of the item accessors in ctree.h and paste
them in accessors.h, and then update any files to include the header so
everything compiles.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ reformat comments, style fixups ]
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:42 +01:00
Josef Bacik
55e5cfd36d btrfs: remove fs_info::pending_changes and related code
Now that we're not using this code anywhere we can remove it as well as
the member from fs_info.

We don't have any mount options or on/off features that would utilize
the pending infrastructure, the last one was inode_cache.
There was a patchset [1] to enable some features from sysfs that would
break things if it would be set immediately. In case we'll need that
kind of logic again the patch can be reverted, but for the current use
it can be replaced by the single state bit to do the commit.

[1] https://lore.kernel.org/linux-btrfs/1422609654-19519-1-git-send-email-quwenruo@cn.fujitsu.com/

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add note ]
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:42 +01:00
Josef Bacik
c52cc7b7ac btrfs: add a BTRFS_FS_NEED_TRANS_COMMIT flag
Currently we are only using fs_info->pending_changes to indicate that we
need a transaction commit.  The original users for this were removed
years ago and we don't have more usage in sight, so this is the only
remaining reason to have this field.  Add a flag so we can remove this
code.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:42 +01:00
Josef Bacik
c7f13d428e btrfs: move fs wide helpers out of ctree.h
We have several fs wide related helpers in ctree.h.  The bulk of these
are the incompat flag test helpers, but there are things such as
btrfs_fs_closing() and the read only helpers that also aren't directly
related to the ctree code.  Move these into a fs.h header, which will
serve as the location for file system wide related helpers.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:41 +01:00
Josef Bacik
956504a331 btrfs: move trans_handle_cachep out of ctree.h
This is local to the transaction code, remove it from ctree.h and
inode.c, create new helpers in the transaction to handle the init work
and move the cachep locally to transaction.c.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:37 +01:00
Josef Bacik
efb0645bd9 btrfs: don't init io tree with private data for non-inodes
We only use this for normal inodes, so don't set it if we're not a
normal inode.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26 12:28:05 +02:00
Josef Bacik
bd015294af btrfs: replace delete argument with EXTENT_CLEAR_ALL_BITS
Instead of taking up a whole argument to indicate we're clearing
everything in a range, simply add another EXTENT bit to control this,
and then update all the callers to drop this argument from the
clear_extent_bit variants.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26 12:28:05 +02:00
Josef Bacik
dbbf49928f btrfs: remove the wake argument from clear_extent_bits
This is only used in the case that we are clearing EXTENT_LOCKED, so
infer this value from the bits passed in instead of taking it as an
argument.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26 12:28:04 +02:00
David Sterba
748f553c3c btrfs: add KCSAN annotations for unlocked access to block_rsv->full
KCSAN reports that there's unlocked access mixed with locked access,
which is technically correct but is not a bug.  To avoid false alerts at
least from KCSAN, add annotation and use a wrapper whenever ->full is
accessed for read outside of lock.

It is used as a fast check and only advisory.  In the worst case the
block reserve is found !full and becomes full in the meantime, but
properly handled.

Depending on the value of ->full, btrfs_block_rsv_release decides
where to return the reservation, and block_rsv_release_bytes handles a
NULL pointer for block_rsv and if it's not NULL then it double checks
the full status under a lock.

Link: https://lore.kernel.org/linux-btrfs/CAAwBoOJDjei5Hnem155N_cJwiEkVwJYvgN-tQrwWbZQGhFU=cA@mail.gmail.com/
Link: https://lore.kernel.org/linux-btrfs/YvHU/vsXd7uz5V6j@hungrycats.org
Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26 12:28:02 +02:00
Qu Wenruo
14033b08a0 btrfs: don't save block group root into super block
The extent tree v2 needs a new root for storing all block group items,
the whole feature hasn't been finished yet so we can afford to do some
changes.

My initial proposal years ago just added a new tree rootid, and load it
from tree root, just like what we did for quota/free space tree/uuid/extent
roots.

But the extent tree v2 patches introduced a completely new way to store
block group tree root into super block which is arguably wasteful.

Currently there are only 3 trees stored in super blocks, and they all
have their valid reasons:

- Chunk root
  Needed for bootstrap.

- Tree root
  Really the entry point for all trees.

- Log root
  This is special as log root has to be updated out of existing
  transaction mechanism.

There is not even any reason to put block group root into super blocks,
the block group tree is updated at the same time as the old extent tree,
no need for extra bootstrap/out-of-transaction update.

So just move block group root from super block into tree root.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26 12:28:00 +02:00
Omar Sandoval
48ff70830b btrfs: get rid of block group caching progress logic
struct btrfs_caching_ctl::progress and struct
btrfs_block_group::last_byte_to_unpin were previously needed to ensure
that unpin_extent_range() didn't return a range to the free space cache
before the caching thread had a chance to cache that range. However, the
commit "btrfs: fix space cache corruption and potential double
allocations" made it so that we always synchronously cache the block
group at the time that we pin the extent, so this machinery is no longer
necessary.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26 12:27:58 +02:00
Ioannis Angelakopoulos
8b53779eaa btrfs: add lockdep annotations for pending_ordered wait event
In contrast to the num_writers and num_extwriters wait events, the
condition for the pending ordered wait event is signaled in a different
context from the wait event itself. The condition signaling occurs in
btrfs_remove_ordered_extent() in fs/btrfs/ordered-data.c while the wait
event is implemented in btrfs_commit_transaction() in
fs/btrfs/transaction.c

Thus the thread signaling the condition has to acquire the lockdep map
as a reader at the start of btrfs_remove_ordered_extent() and release it
after it has signaled the condition. In this case some dependencies
might be left out due to the placement of the annotation, but it is
better than no annotation at all.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Ioannis Angelakopoulos <iangelak@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26 12:27:53 +02:00
Ioannis Angelakopoulos
3e738c531a btrfs: add lockdep annotations for transaction states wait events
Add lockdep annotations for the transaction states that have wait
events;

  1) TRANS_STATE_COMMIT_START
  2) TRANS_STATE_UNBLOCKED
  3) TRANS_STATE_SUPER_COMMITTED
  4) TRANS_STATE_COMPLETED

The new macros introduced here to annotate the transaction states wait
events have the same effect as the generic lockdep annotation macros.

With the exception of the lockdep annotation for TRANS_STATE_COMMIT_START
the transaction thread has to acquire the lockdep maps for the
transaction states as reader after the lockdep map for num_writers is
released so that lockdep does not complain.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Ioannis Angelakopoulos <iangelak@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26 12:27:53 +02:00
Ioannis Angelakopoulos
5a9ba6709f btrfs: add lockdep annotations for num_extwriters wait event
Similarly to the num_writers wait event in fs/btrfs/transaction.c add a
lockdep annotation for the num_extwriters wait event.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Ioannis Angelakopoulos <iangelak@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26 12:27:53 +02:00
Ioannis Angelakopoulos
e1489b4fe6 btrfs: add lockdep annotations for num_writers wait event
Annotate the num_writers wait event in fs/btrfs/transaction.c with
lockdep in order to catch deadlocks involving this wait event.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Ioannis Angelakopoulos <iangelak@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26 12:27:53 +02:00
David Sterba
c1867eb33e btrfs: clean up chained assignments
The chained assignments may be convenient to write, but make readability
a bit worse as it's too easy to overlook that there are several values
set on the same line while this is rather an exception.  Making it
consistent everywhere avoids surprises.

The pattern where inode times are initialized reuses the first value and
the order is mtime, ctime. In other blocks the assignments are expanded
so the order of variables is similar to the neighboring code.

Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25 17:45:39 +02:00
Ioannis Angelakopoulos
e55958c8a0 btrfs: collect commit stats, count, duration
Track several stats about transaction commit, to be later exported via
sysfs:

- number of commits so far
- duration of the last commit in ns
- maximum commit duration seen so far in ns
- total duration for all commits so far in ns

The update of the commit stats occurs after the commit thread has gone
through all the logic that checks if there is another thread committing
at the same time. This means that we only account for actual commit work
in the commit stats we report and not the time the thread spends waiting
until it is ready to do the commit work.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Ioannis Angelakopoulos <iangelak@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25 17:45:37 +02:00
David Sterba
fc7cbcd489 Revert "btrfs: turn fs_roots_radix in btrfs_fs_info into an XArray"
This reverts commit 48b36a602a.

Revert the xarray conversion, there's a problem with potential
sleep-inside-spinlock [1] when calling xa_insert that triggers GFP_NOFS
allocation. The radix tree used the preloading mechanism to avoid
sleeping but this is not available in xarray.

Conversion from spin lock to mutex is possible but at time of rc6 is
riskier than a clean revert.

[1] https://lore.kernel.org/linux-btrfs/cover.1657097693.git.fdmanana@suse.com/

Reported-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-15 19:14:28 +02:00
Gabriel Niebler
48b36a602a btrfs: turn fs_roots_radix in btrfs_fs_info into an XArray
… rename it to simply fs_roots and adjust all usages of this object to use
the XArray API, because it is notionally easier to use and understand, as
it provides array semantics, and also takes care of locking for us,
further simplifying the code.

Also do some refactoring, esp. where the API change requires largely
rewriting some functions, anyway.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Gabriel Niebler <gniebler@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:15:57 +02:00
Filipe Manana
16b0c2581e btrfs: use a read/write lock for protecting the block groups tree
Currently we use a spin lock to protect the red black tree that we use to
track block groups. Most accesses to that tree are actually read only and
for large filesystems, with thousands of block groups, it actually has
a bad impact on performance, as concurrent read only searches on the tree
are serialized.

Read only searches on the tree are very frequent and done when:

1) Pinning and unpinning extents, as we need to lookup the respective
   block group from the tree;

2) Freeing the last reference of a tree block, regardless if we pin the
   underlying extent or add it back to free space cache/tree;

3) During NOCOW writes, both buffered IO and direct IO, we need to check
   if the block group that contains an extent is read only or not and to
   increment the number of NOCOW writers in the block group. For those
   operations we need to search for the block group in the tree.
   Similarly, after creating the ordered extent for the NOCOW write, we
   need to decrement the number of NOCOW writers from the same block
   group, which requires searching for it in the tree;

4) Decreasing the number of extent reservations in a block group;

5) When allocating extents and freeing reserved extents;

6) Adding and removing free space to the free space tree;

7) When releasing delalloc bytes during ordered extent completion;

8) When relocating a block group;

9) During fitrim, to iterate over the block groups;

10) etc;

Write accesses to the tree, to add or remove block groups, are much less
frequent as they happen only when allocating a new block group or when
deleting a block group.

We also use the same spin lock to protect the list of currently caching
block groups. Additions to this list are made when we need to cache a
block group, because we don't have a free space cache for it (or we have
but it's invalid), and removals from this list are done when caching of
the block group's free space finishes. These cases are also not very
common, but when they happen, they happen only once when the filesystem
is mounted.

So switch the lock that protects the tree of block groups from a spinning
lock to a read/write lock.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:13 +02:00