linux-stable/fs
Filipe Manana 978b63f746 btrfs: fix race when detecting delalloc ranges during fiemap
For fiemap we recently stopped locking the target extent range for the
whole duration of the fiemap call, in order to avoid a deadlock in a
scenario where the fiemap buffer happens to be a memory mapped range of
the same file. This use case is very unlikely to be useful in practice but
it may be triggered by fuzz testing (syzbot, etc).

This however introduced a race that makes us miss delalloc ranges for
file regions that are currently holes, so the caller of fiemap will not
be aware that there's data for some file regions. This can be quite
serious for some use cases - for example in coreutils versions before 9.0,
the cp program used fiemap to detect holes and data in the source file,
copying only regions with data (extents or delalloc) from the source file
to the destination file in order to preserve holes (see the documentation
for its --sparse command line option). This means that if cp was used
with a source file that had delalloc in a hole, the destination file could
end up without that data, which is effectively a data loss issue, if it
happened to hit the race described below.

The race happens like this:

1) Fiemap is called, without the FIEMAP_FLAG_SYNC flag, for a file that
   has delalloc in the file range [64M, 65M[, which is currently a hole;

2) Fiemap locks the inode in shared mode, then starts iterating the
   inode's subvolume tree searching for file extent items, without having
   the whole fiemap target range locked in the inode's io tree - the
   change introduced recently by commit b0ad381fa7 ("btrfs: fix
   deadlock with fiemap and extent locking"). It only locks ranges in
   the io tree when it finds a hole or prealloc extent since that
   commit;

3) Note that fiemap clones each leaf before using it, and this is to
   avoid deadlocks when locking a file range in the inode's io tree and
   the fiemap buffer is memory mapped to some file, because writing
   to the page with btrfs_page_mkwrite() will wait on any ordered extent
   for the page's range and the ordered extent needs to lock the range
   and may need to modify the same leaf, therefore leading to a deadlock
   on the leaf;

4) While iterating the file extent items in the cloned leaf before
   finding the hole in the range [64M, 65M[, the delalloc in that range
   is flushed and its ordered extent completes - meaning the corresponding
   file extent item is in the inode's subvolume tree, but not present in
   the cloned leaf that fiemap is iterating over;

5) When fiemap finds the hole in the [64M, 65M[ range by seeing the gap in
   the cloned leaf (or a file extent item with disk_bytenr == 0 in case
   the NO_HOLES feature is not enabled), it will lock that file range in
   the inode's io tree and then search for delalloc by checking for the
   EXTENT_DELALLOC bit in the io tree for that range and ordered extents
   (with btrfs_find_delalloc_in_range()). But it finds nothing since the
   delalloc in that range was already flushed and the ordered extent
   completed and is gone - as a result fiemap will not report that there's
   delalloc or an extent for the range [64M, 65M[, so user space will be
   mislead into thinking that there's a hole in that range.

This could actually be sporadically triggered with test case generic/094
from fstests, which reports a missing extent/delalloc range like this:

  generic/094 2s ... - output mismatch (see /home/fdmanana/git/hub/xfstests/results//generic/094.out.bad)
      --- tests/generic/094.out	2020-06-10 19:29:03.830519425 +0100
      +++ /home/fdmanana/git/hub/xfstests/results//generic/094.out.bad	2024-02-28 11:00:00.381071525 +0000
      @@ -1,3 +1,9 @@
       QA output created by 094
       fiemap run with sync
       fiemap run without sync
      +ERROR: couldn't find extent at 7
      +map is 'HHDDHPPDPHPH'
      +logical: [       5..       6] phys:   301517..  301518 flags: 0x800 tot: 2
      +logical: [       8..       8] phys:   301520..  301520 flags: 0x800 tot: 1
      ...
      (Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/generic/094.out /home/fdmanana/git/hub/xfstests/results//generic/094.out.bad'  to see the entire diff)

So in order to fix this, while still avoiding deadlocks in the case where
the fiemap buffer is memory mapped to the same file, change fiemap to work
like the following:

1) Always lock the whole range in the inode's io tree before starting to
   iterate the inode's subvolume tree searching for file extent items,
   just like we did before commit b0ad381fa7 ("btrfs: fix deadlock with
   fiemap and extent locking");

2) Now instead of writing to the fiemap buffer every time we have an extent
   to report, write instead to a temporary buffer (1 page), and when that
   buffer becomes full, stop iterating the file extent items, unlock the
   range in the io tree, release the search path, submit all the entries
   kept in that buffer to the fiemap buffer, and then resume the search
   for file extent items after locking again the remainder of the range in
   the io tree.

   The buffer having a size of a page, allows for 146 entries in a system
   with 4K pages. This is a large enough value to have a good performance
   by avoiding too many restarts of the search for file extent items.
   In other words this preserves the huge performance gains made in the
   last two years to fiemap, while avoiding the deadlocks in case the
   fiemap buffer is memory mapped to the same file (useless in practice,
   but possible and exercised by fuzz testing and syzbot).

Fixes: b0ad381fa7 ("btrfs: fix deadlock with fiemap and extent locking")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-05 18:12:37 +01:00
..
9p
adfs
affs affs: free affs_sb_info with kfree_rcu() 2024-02-25 02:10:31 -05:00
afs afs: Fix endless loop in directory parsing 2024-02-27 11:20:43 +01:00
autofs dcache stuff for this cycle 2024-01-11 20:11:35 -08:00
bcachefs bcachefs: fix bch2_save_backtrace() 2024-02-25 15:45:36 -05:00
befs
bfs misc cleanups (the part that hadn't been picked by individual fs trees) 2024-01-11 20:23:50 -08:00
btrfs btrfs: fix race when detecting delalloc ranges during fiemap 2024-03-05 18:12:37 +01:00
cachefiles cachefiles: fix memory leak in cachefiles_add_cache() 2024-02-20 09:46:07 +01:00
ceph ceph: switch to corrected encoding of max_xattr_size in mdsmap 2024-02-26 19:20:30 +01:00
coda dcache stuff for this cycle 2024-01-11 20:11:35 -08:00
configfs
cramfs
crypto
debugfs
devpts
dlm
ecryptfs fix directory locking scheme on rename 2024-01-11 20:00:22 -08:00
efivarfs efivarfs: Drop 'duplicates' bool parameter on efivar_init() 2024-02-25 09:43:39 +01:00
efs
erofs Change since last update: 2024-02-25 09:53:13 -08:00
exfat Description for this pull request: 2024-03-01 12:22:30 -08:00
exportfs
ext2 fix directory locking scheme on rename 2024-01-11 20:00:22 -08:00
ext4 We still have some races in filesystem methods when exposed to RCU 2024-02-25 09:29:05 -08:00
f2fs f2fs: fix double free of f2fs_sb_info 2024-01-12 18:55:09 -08:00
fat
freevxfs
fuse fuse: fix UAF in rcu pathwalks 2024-02-25 02:10:32 -05:00
gfs2 Revert "gfs2: Use GL_NOBLOCK flag for non-blocking lookups" 2024-02-02 17:21:44 +01:00
hfs
hfsplus hfsplus: switch to rcu-delayed unloading of nls and freeing ->s_fs_info 2024-02-25 02:10:31 -05:00
hostfs
hpfs
hugetlbfs fs,hugetlb: fix NULL pointer dereference in hugetlbs_fill_super 2024-02-07 21:20:36 -08:00
iomap
isofs
jbd2
jffs2
jfs Revert "jfs: fix shift-out-of-bounds in dbJoin" 2024-01-29 08:45:10 -06:00
kernfs Revert "kernfs: convert kernfs_idr_lock to an irq safe raw spinlock" 2024-01-11 11:51:27 +01:00
lockd sysctl-6.8-rc1 2024-01-10 17:44:36 -08:00
minix minixfs kmap_local_page() switchover and related fixes - very similar to sysv series. 2024-01-11 19:54:18 -08:00
netfs netfs: Fix missing zero-length check in unbuffered write 2024-01-29 14:53:21 +01:00
nfs nfs: fix UAF on pathwalk running into umount 2024-02-25 02:10:32 -05:00
nfs_common
nfsd nfsd-6.8 fixes: 2024-02-07 17:48:15 +00:00
nilfs2 nilfs2: fix potential bug in end_buffer_async_write 2024-02-07 21:20:37 -08:00
nls
notify dcache stuff for this cycle 2024-01-11 20:11:35 -08:00
ntfs sysctl-6.8-rc1 2024-01-10 17:44:36 -08:00
ntfs3 fs/ntfs3: fix build without CONFIG_NTFS3_LZX_XPRESS 2024-02-26 09:32:23 -08:00
ocfs2 misc cleanups (the part that hadn't been picked by individual fs trees) 2024-01-11 20:23:50 -08:00
omfs
openpromfs
orangefs
overlayfs vfs-6.8-rc5.fixes 2024-02-12 07:15:45 -08:00
proc We still have some races in filesystem methods when exposed to RCU 2024-02-25 09:29:05 -08:00
pstore
qnx4
qnx6
quota sysctl-6.8-rc1 2024-01-10 17:44:36 -08:00
ramfs mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDER 2024-01-08 15:27:15 -08:00
reiserfs misc cleanups (the part that hadn't been picked by individual fs trees) 2024-01-11 20:23:50 -08:00
romfs
smb We still have some races in filesystem methods when exposed to RCU 2024-02-25 09:29:05 -08:00
squashfs
sysfs
sysv
tracefs eventfs: Keep all directory links at 1 2024-02-01 11:53:53 -05:00
ubifs ubifs: fix kernel-doc warnings 2024-01-06 23:49:50 +01:00
udf misc cleanups (the part that hadn't been picked by individual fs trees) 2024-01-11 20:23:50 -08:00
ufs Many singleton patches against the MM code. The patch series which 2024-01-09 11:18:47 -08:00
unicode
vboxsf
verity Networking changes for 6.8. 2024-01-11 10:07:29 -08:00
xfs xfs: drop experimental warning for FSDAX 2024-02-27 09:53:30 +05:30
zonefs zonefs: Improve error handling 2024-02-16 10:20:35 +09:00
aio.c fs/aio: Make io_cancel() generate completions again 2024-02-27 11:20:44 +01:00
anon_inodes.c
attr.c
backing-file.c
bad_inode.c
binfmt_elf.c
binfmt_elf_fdpic.c
binfmt_elf_test.c
binfmt_flat.c
binfmt_misc.c
binfmt_script.c
buffer.c Many singleton patches against the MM code. The patch series which 2024-01-09 11:18:47 -08:00
char_dev.c
compat_binfmt_elf.c
coredump.c
d_path.c
dax.c
dcache.c Revert "get rid of DCACHE_GENOCIDE" 2024-02-09 23:31:16 -05:00
direct-io.c
drop_caches.c
eventfd.c
eventpoll.c
exec.c execve fixes for v6.8-rc2 2024-01-24 13:32:29 -08:00
fcntl.c
fhandle.c
file.c
file_table.c dcache stuff for this cycle 2024-01-11 20:11:35 -08:00
filesystems.c
fs-writeback.c
fs_context.c
fs_parser.c
fs_pin.c
fs_struct.c
fs_types.c
fsopen.c
init.c
inode.c fix directory locking scheme on rename 2024-01-11 20:00:22 -08:00
internal.h dcache stuff for this cycle 2024-01-11 20:11:35 -08:00
ioctl.c
Kconfig vfs-6.8.netfs 2024-01-19 09:10:23 -08:00
Kconfig.binfmt
kernel_read_file.c
libfs.c dcache stuff for this cycle 2024-01-11 20:11:35 -08:00
locks.c
Makefile vfs-6.8.netfs 2024-01-19 09:10:23 -08:00
mbcache.c
mnt_idmapping.c
mount.h
mpage.c
namei.c rcu pathwalk: prevent bogus hard errors from may_lookup() 2024-02-25 02:10:31 -05:00
namespace.c fs: relax mount_setattr() permission checks 2024-02-07 21:16:29 +01:00
nsfs.c
open.c vfs-6.8.rw 2024-01-08 11:11:51 -08:00
pipe.c sysctl-6.8-rc1 2024-01-10 17:44:36 -08:00
pnode.c
pnode.h
posix_acl.c
proc_namespace.c
read_write.c
readdir.c
remap_range.c remap_range: merge do_clone_file_range() into vfs_clone_file_range() 2024-02-06 17:07:21 +01:00
select.c
seq_file.c
signalfd.c
splice.c
stack.c
stat.c vfs-6.8.mount 2024-01-08 10:57:34 -08:00
statfs.c
super.c fs/super.c: don't drop ->s_user_ns until we free struct super_block itself 2024-02-25 02:10:31 -05:00
sync.c
sysctls.c
timerfd.c
userfaultfd.c Generic: 2024-01-17 13:03:37 -08:00
utimes.c
xattr.c