linux-stable/fs/btrfs
Filipe Manana ced63fffd6 btrfs: fix race when detecting delalloc ranges during fiemap
[ Upstream commit 978b63f746 ]

For fiemap we recently stopped locking the target extent range for the
whole duration of the fiemap call, in order to avoid a deadlock in a
scenario where the fiemap buffer happens to be a memory mapped range of
the same file. This use case is very unlikely to be useful in practice but
it may be triggered by fuzz testing (syzbot, etc).

This however introduced a race that makes us miss delalloc ranges for
file regions that are currently holes, so the caller of fiemap will not
be aware that there's data for some file regions. This can be quite
serious for some use cases - for example in coreutils versions before 9.0,
the cp program used fiemap to detect holes and data in the source file,
copying only regions with data (extents or delalloc) from the source file
to the destination file in order to preserve holes (see the documentation
for its --sparse command line option). This means that if cp was used
with a source file that had delalloc in a hole, the destination file could
end up without that data, which is effectively a data loss issue, if it
happened to hit the race described below.

The race happens like this:

1) Fiemap is called, without the FIEMAP_FLAG_SYNC flag, for a file that
   has delalloc in the file range [64M, 65M[, which is currently a hole;

2) Fiemap locks the inode in shared mode, then starts iterating the
   inode's subvolume tree searching for file extent items, without having
   the whole fiemap target range locked in the inode's io tree - the
   change introduced recently by commit b0ad381fa7 ("btrfs: fix
   deadlock with fiemap and extent locking"). It only locks ranges in
   the io tree when it finds a hole or prealloc extent since that
   commit;

3) Note that fiemap clones each leaf before using it, and this is to
   avoid deadlocks when locking a file range in the inode's io tree and
   the fiemap buffer is memory mapped to some file, because writing
   to the page with btrfs_page_mkwrite() will wait on any ordered extent
   for the page's range and the ordered extent needs to lock the range
   and may need to modify the same leaf, therefore leading to a deadlock
   on the leaf;

4) While iterating the file extent items in the cloned leaf before
   finding the hole in the range [64M, 65M[, the delalloc in that range
   is flushed and its ordered extent completes - meaning the corresponding
   file extent item is in the inode's subvolume tree, but not present in
   the cloned leaf that fiemap is iterating over;

5) When fiemap finds the hole in the [64M, 65M[ range by seeing the gap in
   the cloned leaf (or a file extent item with disk_bytenr == 0 in case
   the NO_HOLES feature is not enabled), it will lock that file range in
   the inode's io tree and then search for delalloc by checking for the
   EXTENT_DELALLOC bit in the io tree for that range and ordered extents
   (with btrfs_find_delalloc_in_range()). But it finds nothing since the
   delalloc in that range was already flushed and the ordered extent
   completed and is gone - as a result fiemap will not report that there's
   delalloc or an extent for the range [64M, 65M[, so user space will be
   mislead into thinking that there's a hole in that range.

This could actually be sporadically triggered with test case generic/094
from fstests, which reports a missing extent/delalloc range like this:

#  generic/094 2s ... - output mismatch (see /home/fdmanana/git/hub/xfstests/results//generic/094.out.bad)
#      --- tests/generic/094.out	2020-06-10 19:29:03.830519425 +0100
#      +++ /home/fdmanana/git/hub/xfstests/results//generic/094.out.bad	2024-02-28 11:00:00.381071525 +0000
#      @@ -1,3 +1,9 @@
#       QA output created by 094
#       fiemap run with sync
#       fiemap run without sync
#      +ERROR: couldn't find extent at 7
#      +map is 'HHDDHPPDPHPH'
#      +logical: [       5..       6] phys:   301517..  301518 flags: 0x800 tot: 2
#      +logical: [       8..       8] phys:   301520..  301520 flags: 0x800 tot: 1
#      ...
#      (Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/generic/094.out /home/fdmanana/git/hub/xfstests/results//generic/094.out.bad'  to see the entire diff)

So in order to fix this, while still avoiding deadlocks in the case where
the fiemap buffer is memory mapped to the same file, change fiemap to work
like the following:

1) Always lock the whole range in the inode's io tree before starting to
   iterate the inode's subvolume tree searching for file extent items,
   just like we did before commit b0ad381fa7 ("btrfs: fix deadlock with
   fiemap and extent locking");

2) Now instead of writing to the fiemap buffer every time we have an extent
   to report, write instead to a temporary buffer (1 page), and when that
   buffer becomes full, stop iterating the file extent items, unlock the
   range in the io tree, release the search path, submit all the entries
   kept in that buffer to the fiemap buffer, and then resume the search
   for file extent items after locking again the remainder of the range in
   the io tree.

   The buffer having a size of a page, allows for 146 entries in a system
   with 4K pages. This is a large enough value to have a good performance
   by avoiding too many restarts of the search for file extent items.
   In other words this preserves the huge performance gains made in the
   last two years to fiemap, while avoiding the deadlocks in case the
   fiemap buffer is memory mapped to the same file (useless in practice,
   but possible and exercised by fuzz testing and syzbot).

Fixes: b0ad381fa7 ("btrfs: fix deadlock with fiemap and extent locking")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-03-26 18:16:51 -04:00
..
tests btrfs: migrate extent_buffer::pages[] to folio 2023-12-15 23:01:04 +01:00
accessors.c btrfs: migrate get_eb_page_index() and get_eb_offset_in_page() to folios 2023-12-15 23:03:58 +01:00
accessors.h btrfs: migrate extent_buffer::pages[] to folio 2023-12-15 23:01:04 +01:00
acl.c
acl.h
async-thread.c btrfs: merge ordered work callbacks in btrfs_work into one 2023-10-12 16:44:10 +02:00
async-thread.h btrfs: merge ordered work callbacks in btrfs_work into one 2023-10-12 16:44:10 +02:00
backref.c for-6.7-tag 2023-10-30 10:42:06 -10:00
backref.h for-6.7-tag 2023-10-30 10:42:06 -10:00
bio.c btrfs: migrate btrfs_repair_io_failure() to folio interfaces 2023-12-15 23:03:58 +01:00
bio.h btrfs: migrate btrfs_repair_io_failure() to folio interfaces 2023-12-15 23:03:58 +01:00
block-group.c btrfs: add new unused block groups to the list of unused block groups 2024-02-09 20:29:22 +01:00
block-group.h btrfs: add and use helper to check if block group is used 2024-02-09 20:29:14 +01:00
block-rsv.c btrfs: fix data race at btrfs_use_block_rsv() when accessing block reserve 2024-02-22 12:15:12 +01:00
block-rsv.h btrfs: fix data race at btrfs_use_block_rsv() when accessing block reserve 2024-02-22 12:15:12 +01:00
btrfs_inode.h btrfs: fix mismatching parameter names for btrfs_get_extent() 2023-12-15 22:59:30 +01:00
compression.c btrfs: zlib: fix and simplify the inline extent decompression 2024-01-18 23:35:26 +01:00
compression.h Revert "btrfs: zstd: fix and simplify the inline extent decompression" 2024-01-22 15:39:01 -08:00
ctree.c btrfs: migrate get_eb_page_index() and get_eb_offset_in_page() to folios 2023-12-15 23:03:58 +01:00
ctree.h btrfs: switch btrfs_root::delayed_nodes_tree to xarray from radix-tree 2023-12-15 23:01:03 +01:00
defrag.c btrfs: defrag: avoid unnecessary defrag caused by incorrect extent size 2024-02-19 11:19:58 +01:00
defrag.h btrfs: move btrfs_defrag_root() to defrag.{c,h} 2023-10-12 16:44:13 +02:00
delalloc-space.c btrfs: don't reserve space for checksums when writing to nocow files 2024-02-13 18:36:35 +01:00
delalloc-space.h
delayed-inode.c btrfs: switch btrfs_root::delayed_nodes_tree to xarray from radix-tree 2023-12-15 23:01:03 +01:00
delayed-inode.h btrfs: remove redundant root argument from btrfs_delayed_update_inode() 2023-10-12 16:44:12 +02:00
delayed-ref.c btrfs: fix qgroup record leaks when using simple quotas 2023-11-09 14:01:59 +01:00
delayed-ref.h btrfs: stop reserving excessive space for block group item insertions 2023-10-12 16:44:16 +02:00
dev-replace.c btrfs: dev-replace: properly validate device names 2024-02-22 12:14:21 +01:00
dev-replace.h
dir-item.c btrfs: abort transaction on generation mismatch when marking eb as dirty 2023-10-12 16:44:07 +02:00
dir-item.h btrfs: add fscrypt related dependencies to respective headers 2023-10-12 16:44:02 +02:00
discard.c btrfs: unexport btrfs_run_discard_work and make it static 2023-06-19 13:59:25 +02:00
discard.h btrfs: unexport btrfs_run_discard_work and make it static 2023-06-19 13:59:25 +02:00
disk-io.c btrfs: fix double free of anonymous device after snapshot creation failure 2024-02-29 22:34:11 +01:00
disk-io.h btrfs: fix double free of anonymous device after snapshot creation failure 2024-02-29 22:34:11 +01:00
export.c
export.h
extent-io-tree.c btrfs: allocate btrfs_inode::file_extent_tree only without NO_HOLES 2023-12-15 22:59:01 +01:00
extent-io-tree.h btrfs: always set extent_io_tree::inode and drop fs_info 2023-12-15 20:27:02 +01:00
extent-tree.c btrfs: don't warn if discard range is not aligned to sector 2024-01-18 23:35:57 +01:00
extent-tree.h btrfs: get correct owning_root when dropping snapshot 2023-11-03 16:39:06 +01:00
extent_io.c btrfs: fix race when detecting delalloc ranges during fiemap 2024-03-26 18:16:51 -04:00
extent_io.h btrfs: migrate get_eb_page_index() and get_eb_offset_in_page() to folios 2023-12-15 23:03:58 +01:00
extent_map.c btrfs: use the flags of an extent map to identify the compression type 2023-12-15 22:59:02 +01:00
extent_map.h btrfs: use the flags of an extent map to identify the compression type 2023-12-15 22:59:02 +01:00
file-item.c btrfs: use the flags of an extent map to identify the compression type 2023-12-15 22:59:02 +01:00
file-item.h btrfs: scrub: avoid unnecessary csum tree search preparing stripes 2023-08-21 14:54:48 +02:00
file.c btrfs: migrate subpage code to folio interfaces 2023-12-15 23:03:58 +01:00
file.h
free-space-cache.c btrfs: migrate subpage code to folio interfaces 2023-12-15 23:03:58 +01:00
free-space-cache.h btrfs: move btrfs_check_trunc_cache_free_space into block-rsv.c 2023-06-19 13:59:24 +02:00
free-space-tree.c btrfs: abort transaction on generation mismatch when marking eb as dirty 2023-10-12 16:44:07 +02:00
free-space-tree.h btrfs: make clear_cache mount option to rebuild FST without disabling it 2023-05-10 14:51:27 +02:00
fs.c btrfs: sysfs: update fs features directory asynchronously 2023-02-13 17:50:35 +01:00
fs.h btrfs: remove old mount API code 2023-12-15 20:27:04 +01:00
inode-item.c btrfs: track owning root in btrfs_ref 2023-10-12 16:44:11 +02:00
inode-item.h btrfs: add fscrypt related dependencies to respective headers 2023-10-12 16:44:02 +02:00
inode.c for-6.8-rc6-tag 2024-03-01 12:29:20 -08:00
ioctl.c for-6.8-rc6-tag 2024-03-01 12:29:20 -08:00
ioctl.h
Kconfig btrfs: check-integrity: remove CONFIG_BTRFS_FS_CHECK_INTEGRITY option 2023-10-12 16:44:05 +02:00
locking.c btrfs: add raid stripe tree definitions 2023-10-12 16:44:09 +02:00
locking.h btrfs: do not block starts waiting on previous transaction commit 2023-09-08 14:10:49 +02:00
lru_cache.c btrfs: fix typos found by codespell 2023-12-15 23:00:04 +01:00
lru_cache.h btrfs: remove btrfs_lru_cache_is_full() inline function 2023-04-17 18:01:18 +02:00
lzo.c btrfs: lzo: fix and simplify the inline extent decompression 2024-01-18 23:35:30 +01:00
Makefile btrfs: add support for inserting raid stripe extents 2023-10-12 16:44:09 +02:00
messages.c btrfs: constify fs_info parameter in __btrfs_panic() 2023-12-15 20:27:02 +01:00
messages.h btrfs: constify fs_info parameter in __btrfs_panic() 2023-12-15 20:27:02 +01:00
misc.h minmax: add in_range() macro 2023-08-24 16:20:18 -07:00
ordered-data.c btrfs: migrate subpage code to folio interfaces 2023-12-15 23:03:58 +01:00
ordered-data.h btrfs: remove unused btrfs_ordered_extent::outstanding_isize 2023-12-15 20:27:01 +01:00
orphan.c
orphan.h
print-tree.c btrfs: new inline ref storing owning subvol of data extents 2023-10-12 16:44:11 +02:00
print-tree.h btrfs: print-tree: pass const extent buffer pointer 2023-06-19 13:59:22 +02:00
props.c btrfs: move btrfs_name_hash to dir-item.h 2023-10-12 16:44:02 +02:00
props.h
qgroup.c btrfs: forbid deleting live subvol qgroup 2024-01-31 08:42:47 +01:00
qgroup.h btrfs: ensure releasing squota reserve on head refs 2023-12-06 22:32:57 +01:00
raid-stripe-tree.c btrfs: directly return 0 on no error code in btrfs_insert_raid_extent() 2023-11-03 16:38:51 +01:00
raid-stripe-tree.h btrfs: zoned: support RAID0/1/10 on top of raid stripe tree 2023-10-12 16:44:09 +02:00
raid56.c btrfs: refactor alloc_extent_buffer() to allocate-then-attach method 2023-12-15 23:01:04 +01:00
raid56.h btrfs: use a dedicated data structure for chunk maps 2023-12-15 20:27:02 +01:00
rcu-string.h
ref-verify.c btrfs: ref-verify: free ref cache before clearing mount opt 2024-01-12 01:59:49 +01:00
ref-verify.h
reflink.c btrfs: migrate subpage code to folio interfaces 2023-12-15 23:03:58 +01:00
reflink.h
relocation.c btrfs: migrate subpage code to folio interfaces 2023-12-15 23:03:58 +01:00
relocation.h btrfs: relocation: constify parameters where possible 2023-10-12 16:44:13 +02:00
root-tree.c btrfs: qgroup: add new quota mode for simple quotas 2023-10-12 16:44:10 +02:00
root-tree.h btrfs: drop __must_check annotations 2023-10-12 16:44:04 +02:00
scrub.c btrfs: scrub: limit RST scrub to chunk boundary 2024-01-18 23:43:08 +01:00
scrub.h btrfs: scrub: remove scrub_bio structure 2023-04-17 18:01:24 +02:00
send.c btrfs: send: don't issue unnecessary zero writes for trailing hole 2024-02-22 12:14:31 +01:00
send.h
space-info.c btrfs: fix data races when accessing the reserved amount of block reserves 2024-02-22 12:15:06 +01:00
space-info.h btrfs: pass a space_info argument to btrfs_reserve_metadata_bytes() 2023-10-12 16:44:05 +02:00
subpage.c for-6.8-rc1-tag 2024-01-22 13:29:42 -08:00
subpage.h btrfs: migrate subpage code to folio interfaces 2023-12-15 23:03:58 +01:00
super.c for-6.8-rc1-tag 2024-01-22 13:29:42 -08:00
super.h btrfs: remove old mount API code 2023-12-15 20:27:04 +01:00
sysfs.c btrfs: sysfs: validate scrub_speed_max value 2023-12-15 23:01:04 +01:00
sysfs.h btrfs: sysfs: update fs features directory asynchronously 2023-02-13 17:50:35 +01:00
transaction.c btrfs: fix double free of anonymous device after snapshot creation failure 2024-02-29 22:34:11 +01:00
transaction.h btrfs: free qgroup pertrans reserve on transaction abort 2023-12-06 22:32:49 +01:00
tree-checker.c btrfs: tree-checker: fix inline ref size in error messages 2024-01-18 23:35:50 +01:00
tree-checker.h btrfs: fix typos found by codespell 2023-12-15 23:00:04 +01:00
tree-log.c btrfs: use the flags of an extent map to identify the compression type 2023-12-15 22:59:02 +01:00
tree-log.h btrfs: change for_rename argument of btrfs_record_unlink_dir() to bool 2023-06-19 13:59:26 +02:00
tree-mod-log.c btrfs: avoid tree mod log ENOMEM failures when we don't need to log 2023-06-19 13:59:38 +02:00
tree-mod-log.h
ulist.c btrfs: reformat remaining kdoc style comments 2023-10-12 16:44:04 +02:00
ulist.h
uuid-tree.c btrfs: abort transaction on generation mismatch when marking eb as dirty 2023-10-12 16:44:07 +02:00
uuid-tree.h
verity.c btrfs: remove redundant root argument from btrfs_update_inode() 2023-10-12 16:44:12 +02:00
verity.h
volumes.c btrfs: fix unbalanced unlock of mapping_tree_lock 2024-01-12 01:59:59 +01:00
volumes.h btrfs: fix typos found by codespell 2023-12-15 23:00:04 +01:00
xattr.c btrfs: cache that we don't have security.capability set 2023-12-15 20:27:05 +01:00
xattr.h btrfs: move btrfs_xattr_handlers to .rodata 2023-10-09 16:24:17 +02:00
zlib.c btrfs: zlib: fix and simplify the inline extent decompression 2024-01-18 23:35:26 +01:00
zoned.c for-6.8-rc6-tag 2024-02-26 11:00:54 -08:00
zoned.h for-6.8/block-2024-01-08 2024-01-11 13:58:04 -08:00
zstd.c Revert "btrfs: zstd: fix and simplify the inline extent decompression" 2024-01-22 15:39:01 -08:00