linux-stable/fs
Filipe Manana a1ce40f8ae btrfs: make send work with concurrent block group relocation
commit d96b34248c upstream.

We don't allow send and balance/relocation to run in parallel in order
to prevent send failing or silently producing some bad stream. This is
because while send is using an extent (specially metadata) or about to
read a metadata extent and expecting it belongs to a specific parent
node, relocation can run, the transaction used for the relocation is
committed and the extent gets reallocated while send is still using the
extent, so it ends up with a different content than expected. This can
result in just failing to read a metadata extent due to failure of the
validation checks (parent transid, level, etc), failure to find a
backreference for a data extent, and other unexpected failures. Besides
reallocation, there's also a similar problem of an extent getting
discarded when it's unpinned after the transaction used for block group
relocation is committed.

The restriction between balance and send was added in commit 9e967495e0
("Btrfs: prevent send failures and crashes due to concurrent relocation"),
kernel 5.3, while the more general restriction between send and relocation
was added in commit 1cea5cf0e6 ("btrfs: ensure relocation never runs
while we have send operations running"), kernel 5.14.

Both send and relocation can be very long running operations. Relocation
because it has to do a lot of IO and expensive backreference lookups in
case there are many snapshots, and send due to read IO when operating on
very large trees. This makes it inconvenient for users and tools to deal
with scheduling both operations.

For zoned filesystem we also have automatic block group relocation, so
send can fail with -EAGAIN when users least expect it or send can end up
delaying the block group relocation for too long. In the future we might
also get the automatic block group relocation for non zoned filesystems.

This change makes it possible for send and relocation to run in parallel.
This is achieved the following way:

1) For all tree searches, send acquires a read lock on the commit root
   semaphore;

2) After each tree search, and before releasing the commit root semaphore,
   the leaf is cloned and placed in the search path (struct btrfs_path);

3) After releasing the commit root semaphore, the changed_cb() callback
   is invoked, which operates on the leaf and writes commands to the pipe
   (or file in case send/receive is not used with a pipe). It's important
   here to not hold a lock on the commit root semaphore, because if we did
   we could deadlock when sending and receiving to the same filesystem
   using a pipe - the send task blocks on the pipe because it's full, the
   receive task, which is the only consumer of the pipe, triggers a
   transaction commit when attempting to create a subvolume or reserve
   space for a write operation for example, but the transaction commit
   blocks trying to write lock the commit root semaphore, resulting in a
   deadlock;

4) Before moving to the next key, or advancing to the next change in case
   of an incremental send, check if a transaction used for relocation was
   committed (or is about to finish its commit). If so, release the search
   path(s) and restart the search, to where we were before, so that we
   don't operate on stale extent buffers. The search restarts are always
   possible because both the send and parent roots are RO, and no one can
   add, remove of update keys (change their offset) in RO trees - the
   only exception is deduplication, but that is still not allowed to run
   in parallel with send;

5) Periodically check if there is contention on the commit root semaphore,
   which means there is a transaction commit trying to write lock it, and
   release the semaphore and reschedule if there is contention, so as to
   avoid causing any significant delays to transaction commits.

This leaves some room for optimizations for send to have less path
releases and re searching the trees when there's relocation running, but
for now it's kept simple as it performs quite well (on very large trees
with resulting send streams in the order of a few hundred gigabytes).

Test case btrfs/187, from fstests, stresses relocation, send and
deduplication attempting to run in parallel, but without verifying if send
succeeds and if it produces correct streams. A new test case will be added
that exercises relocation happening in parallel with send and then checks
that send succeeds and the resulting streams are correct.

A final note is that for now this still leaves the mutual exclusion
between send operations and deduplication on files belonging to a root
used by send operations. A solution for that will be slightly more complex
but it will eventually be built on top of this change.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-03-16 14:23:46 +01:00
..
9p Revert "fs/9p: search open fids first" 2022-02-08 18:34:04 +01:00
adfs mm: require ->set_page_dirty to be explicitly wired up 2021-06-29 10:53:48 -07:00
affs mm: require ->set_page_dirty to be explicitly wired up 2021-06-29 10:53:48 -07:00
afs afs: Fix mmap 2021-12-22 09:32:45 +01:00
autofs autofs: fix wait name hash calculation in autofs_wait() 2021-10-20 21:09:02 -04:00
befs isystem: ship and use stdarg.h 2021-08-19 09:02:55 +09:00
bfs mm: require ->set_page_dirty to be explicitly wired up 2021-06-29 10:53:48 -07:00
btrfs btrfs: make send work with concurrent block group relocation 2022-03-16 14:23:46 +01:00
cachefiles cachefiles: Change %p in format strings to something else 2021-08-27 13:34:02 +01:00
ceph ceph: put the requests/sessions when it fails to alloc memory 2022-02-01 17:27:14 +01:00
cifs cifs: fix confusing unneeded warning message on smb2.1 and earlier 2022-03-08 19:12:41 +01:00
coda
configfs configfs: fix a race in configfs_{,un}register_subsystem() 2022-03-02 11:48:02 +01:00
cramfs
crypto fscrypt: allow 256-bit master keys with AES-256-XTS 2021-11-18 19:16:11 +01:00
debugfs debugfs: lockdown: Allow reading debugfs files that are not world readable 2022-01-27 11:03:55 +01:00
devpts fsnotify: fix fsnotify hooks in pseudo filesystems 2022-02-01 17:27:01 +01:00
dlm fs: dlm: filter user dlm messages for kernel locks 2022-01-27 11:04:23 +01:00
ecryptfs mm: require ->set_page_dirty to be explicitly wired up 2021-06-29 10:53:48 -07:00
efivarfs
efs
erofs erofs: fix deadlock when shrink erofs slab 2021-12-01 09:04:50 +01:00
exfat exfat: fix i_blocks for files truncated over 4 GiB 2022-03-08 19:12:32 +01:00
exportfs
ext2 ext2: fix sleeping in atomic bugs on error 2021-09-22 13:05:23 +02:00
ext4 ext4: fast commit may miss file actions 2022-03-08 19:12:32 +01:00
f2fs f2fs: fix to check available space of CP area correctly in update_ckpt_flags() 2022-01-27 11:05:30 +01:00
fat linux-kselftest-kunit-5.15-rc1 2021-09-02 12:32:12 -07:00
freevxfs
fscache fscache: Remove an unused static variable 2021-10-04 22:13:12 +01:00
fuse fuse: fix pipe buffer lifetime for direct_io 2022-03-16 14:23:42 +01:00
gfs2 gfs2: Fix gfs2_release for non-writers regression 2022-02-16 12:56:18 +01:00
hfs hfs: add lock nesting notation to hfs_find_init 2021-07-15 10:13:49 -07:00
hfsplus hfsplus: report create_date to kstat.btime 2021-07-01 11:06:06 -07:00
hostfs hostfs: support splice_write 2021-08-26 22:28:02 +02:00
hpfs hpfs: use iomap_fiemap to implement ->fiemap 2021-07-27 11:00:36 +02:00
hugetlbfs hugetlbfs: fix off-by-one error in hugetlb_vmdelete_list() 2022-03-08 19:12:38 +01:00
iomap iomap: Fix inline extent handling in iomap_readpage 2021-12-01 09:04:44 +01:00
isofs isofs: Fix out of bound access for corrupted isofs image 2021-11-12 15:05:50 +01:00
jbd2 ext4: fast commit may not fallback for ineligible commit 2022-03-08 19:12:32 +01:00
jffs2 jffs2: GC deadlock reading a page that is used in jffs2_write_begin() 2022-01-27 11:04:49 +01:00
jfs JFS: fix memleak in jfs_mount 2021-11-18 19:16:48 +01:00
kernfs kernfs: don't create a negative dentry if inactive node exists 2021-10-04 10:27:18 +02:00
ksmbd ksmbd: don't align last entry offset in smb2 query directory 2022-02-23 12:03:18 +01:00
lockd lockd: fix failure to cleanup client locks 2022-02-05 12:38:57 +01:00
minix mm: require ->set_page_dirty to be explicitly wired up 2021-06-29 10:53:48 -07:00
netfs netfs: fix parameter of cleanup() 2021-12-29 12:28:59 +01:00
nfs NFS: Do not report writeback errors in nfs_getattr() 2022-02-23 12:03:15 +01:00
nfs_common nfs: Fix kerneldoc warning shown up by W=1 2021-10-04 22:02:17 +01:00
nfsd nfsd: fix crash on COPY_NOTIFY with special stateid 2022-03-08 19:12:36 +01:00
nilfs2 Merge branch 'akpm' (patches from Andrew) 2021-09-08 12:55:35 -07:00
nls
notify fanotify: Fix stale file descriptor in copy_event_to_user() 2022-02-05 12:38:59 +01:00
ntfs Merge branch 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2021-07-03 11:30:04 -07:00
ntfs3 Fixed xfstests generic/016 generic/021 generic/022 generic/041 generic/274 generic/423, 2021-10-15 09:58:11 -04:00
ocfs2 ocfs2: fix a deadlock when commit trans 2022-02-01 17:27:05 +01:00
omfs mm: require ->set_page_dirty to be explicitly wired up 2021-06-29 10:53:48 -07:00
openpromfs
orangefs orangefs: Fix the size of a memory allocation in orangefs_bufmap_alloc() 2022-01-20 09:13:13 +01:00
overlayfs ovl: fix NULL pointer dereference in copy up warning 2022-02-05 12:38:59 +01:00
proc proc: fix documentation and description of pagemap 2022-03-08 19:12:54 +01:00
pstore pstore/blk: Use "%lu" to format unsigned long 2021-11-25 09:48:42 +01:00
qnx4 qnx4: work around gcc false positive warning bug 2021-09-21 08:36:48 -07:00
qnx6
quota quota: make dquot_quota_sync return errors from ->sync_fs 2022-02-23 12:03:06 +01:00
ramfs fs: move ramfs_aops to libfs 2021-06-29 10:53:48 -07:00
reiserfs Kbuild updates for v5.15 2021-09-03 15:33:47 -07:00
romfs
smbfs_common cifs: Fix crash on unload of cifs_arc4.ko 2021-12-14 10:57:12 +01:00
squashfs squashfs: use bvec_virt 2021-08-16 10:50:32 -06:00
sysfs sysfs: Allow deferred execution of iomem_get_mapping() 2021-08-06 13:05:28 +02:00
sysv mm: require ->set_page_dirty to be explicitly wired up 2021-06-29 10:53:48 -07:00
tracefs tracefs: Set the group ownership in apply_options() not parse_options() 2022-03-02 11:48:05 +01:00
ubifs ubifs: Error path in ubifs_remount_rw() seems to wrongly free write buffers 2022-01-27 11:05:07 +01:00
udf udf: Fix NULL ptr deref when converting from inline format 2022-02-01 17:27:00 +01:00
ufs isystem: ship and use stdarg.h 2021-08-19 09:02:55 +09:00
unicode
vboxsf vboxfs: fix broken legacy mount signature checking 2021-09-27 11:26:21 -07:00
verity fs-verity: fix signed integer overflow with i_size near S64_MAX 2021-09-22 10:56:34 -07:00
xfs xfs: map unwritten blocks in XFS_IOC_{ALLOC,FREE}SP just like fallocate 2022-01-11 15:35:16 +01:00
zonefs zonefs: add MODULE_ALIAS_FS 2021-12-22 09:32:48 +01:00
aio.c aio: Fix incorrect usage of eventfd_signal_allowed() 2021-12-14 10:57:22 +01:00
anon_inodes.c
attr.c fs: handle circular mappings correctly 2021-11-25 09:48:46 +01:00
bad_inode.c vfs: add rcu argument to ->get_acl() callback 2021-08-18 22:08:24 +02:00
binfmt_aout.c binfmt: a.out: Fix bogus semicolon 2021-09-05 10:15:05 -07:00
binfmt_elf.c elf: don't use MAP_FIXED_NOREPLACE for elf interpreter mappings 2021-10-03 14:02:58 -07:00
binfmt_elf_fdpic.c binfmt: remove in-tree usage of MAP_DENYWRITE 2021-09-03 18:42:01 +02:00
binfmt_flat.c binfmt: remove in-tree usage of MAP_EXECUTABLE 2021-06-29 10:53:50 -07:00
binfmt_misc.c
binfmt_script.c
buffer.c mm: fs: invalidate bh_lrus for only cold path 2021-09-24 16:13:35 -07:00
char_dev.c
compat_binfmt_elf.c
coredump.c coredump: fix memleak in dump_vma_snapshot() 2021-09-08 11:50:27 -07:00
d_path.c d_path: make 'prepend()' fill up the buffer exactly on overflow 2021-09-02 10:07:29 -07:00
dax.c New code for 5.15: 2021-08-31 11:13:35 -07:00
dcache.c
direct-io.c
drop_caches.c fs: drop_caches: fix skipping over shadow cache inodes 2021-09-03 09:58:10 -07:00
eventfd.c eventfd: Export eventfd_wake_count to modules 2021-09-06 07:20:56 -04:00
eventpoll.c ARM development updates for 5.15: 2021-09-09 13:25:49 -07:00
exec.c signal: Replace force_sigsegv(SIGSEGV) with force_fatal_sig(SIGSEGV) 2021-11-25 09:49:06 +01:00
fcntl.c Merge branch 'akpm' (patches from Andrew) 2021-09-03 10:08:28 -07:00
fhandle.c
file.c fget: clarify and improve __fget_files() implementation 2022-01-16 09:12:42 +01:00
file_table.c
filesystems.c fs: simplify get_filesystem_list / get_all_fs_names 2021-08-23 01:25:40 -04:00
fs-writeback.c Merge branch 'akpm' (patches from Andrew) 2021-09-03 10:08:28 -07:00
fs_context.c vfs: fs_context: fix up param length parsing in legacy_parse_param 2022-01-20 09:13:14 +01:00
fs_parser.c namei: Standardize callers of filename_lookup() 2021-09-07 16:07:47 -04:00
fs_pin.c
fs_struct.c
fs_types.c
fsopen.c
init.c
inode.c fs: export an inode_update_time helper 2021-11-25 09:49:08 +01:00
internal.h block: move fs/block_dev.c to block/bdev.c 2021-09-07 08:39:40 -06:00
io-wq.c io-wq: drop wqe lock before creating new worker 2021-12-22 09:32:51 +01:00
io-wq.h io-wq: provide a way to limit max number of workers 2021-08-29 07:55:55 -06:00
io_uring.c io_uring: fix no lock protection for ctx->cq_extra 2022-03-08 19:12:33 +01:00
ioctl.c New code for 5.15: 2021-08-31 11:06:32 -07:00
Kconfig 4 cifs/smb3 fixes, one for DFS reconnect, and one to begin creating common headers for server and client and the other two to rename the cifs_common directory to smbfs_common to be more consistent ie change use of the name cifs to smb which is more accurate 2021-09-12 10:10:21 -07:00
Kconfig.binfmt binfmt: remove support for em86 (alpha only) 2021-07-25 22:33:03 -07:00
kernel_read_file.c vfs: check fd has read access in kernel_read_file_from_fd() 2021-10-18 20:22:03 -10:00
libfs.c fs: remove noop_set_page_dirty() 2021-06-29 10:53:48 -07:00
locks.c Revert "memcg: enable accounting for file lock caches" 2021-09-07 11:21:48 -07:00
Makefile 4 cifs/smb3 fixes, one for DFS reconnect, and one to begin creating common headers for server and client and the other two to rename the cifs_common directory to smbfs_common to be more consistent ie change use of the name cifs to smb which is more accurate 2021-09-12 10:10:21 -07:00
mbcache.c
mount.h
mpage.c
namei.c fsnotify: invalidate dcache before IN_DELETE event 2022-02-01 17:27:15 +01:00
namespace.c fs/mount_setattr: always cleanup mount_kattr 2022-01-05 12:42:39 +01:00
no-block.c
nsfs.c
open.c mm, thp: fix incorrect unmap behavior for private pages 2021-11-18 19:17:17 +01:00
pipe.c watch_queue: Fix lack of barrier/sync/lock between post and read 2022-03-16 14:23:44 +01:00
pnode.c
pnode.h
posix_acl.c ovl: enable RCU'd ->get_acl() 2021-08-18 22:08:24 +02:00
proc_namespace.c
read_write.c fs: clean up after mandatory file locking support removal 2021-08-24 07:52:45 -04:00
readdir.c
remap_range.c fs: remove mandatory file locking support 2021-08-23 06:15:36 -04:00
select.c select: Fix indefinitely sleeping task in poll_schedule_timeout() 2022-01-29 10:58:25 +01:00
seq_file.c seq_file: disallow extremely large seq buffer allocations 2021-07-19 17:18:48 -07:00
signalfd.c signalfd: use wake_up_pollfree() 2021-12-14 10:57:15 +01:00
splice.c
stack.c
stat.c fs: add generic helper for filling statx attribute flags 2021-08-17 11:47:43 +02:00
statfs.c
super.c vfs: make freeze_super abort when sync_filesystem returns error 2022-02-23 12:03:05 +01:00
sync.c
timerfd.c timerfd: Provide timerfd_resume() 2021-08-10 17:57:22 +02:00
userfaultfd.c userfaultfd: fix a race between writeprotect and exit_mmap() 2021-10-18 20:22:02 -10:00
utimes.c
xattr.c