linux-stable/fs
Qu Wenruo 3efab6d817 btrfs: raid56: don't trust any cached sector in __raid56_parity_recover()
commit f6065f8ede upstream.

[BUG]
There is a small workload which will always fail with recent kernel:
(A simplified version from btrfs/125 test case)

  mkfs.btrfs -f -m raid5 -d raid5 -b 1G $dev1 $dev2 $dev3
  mount $dev1 $mnt
  xfs_io -f -c "pwrite -S 0xee 0 1M" $mnt/file1
  sync
  umount $mnt
  btrfs dev scan -u $dev3
  mount -o degraded $dev1 $mnt
  xfs_io -f -c "pwrite -S 0xff 0 128M" $mnt/file2
  umount $mnt
  btrfs dev scan
  mount $dev1 $mnt
  btrfs balance start --full-balance $mnt
  umount $mnt

The failure is always failed to read some tree blocks:

  BTRFS info (device dm-4): relocating block group 217710592 flags data|raid5
  BTRFS error (device dm-4): parent transid verify failed on 38993920 wanted 9 found 7
  BTRFS error (device dm-4): parent transid verify failed on 38993920 wanted 9 found 7
  ...

[CAUSE]
With the recently added debug output, we can see all RAID56 operations
related to full stripe 38928384:

  56.1183: raid56_read_partial: full_stripe=38928384 devid=2 type=DATA1 offset=0 opf=0x0 physical=9502720 len=65536
  56.1185: raid56_read_partial: full_stripe=38928384 devid=3 type=DATA2 offset=16384 opf=0x0 physical=9519104 len=16384
  56.1185: raid56_read_partial: full_stripe=38928384 devid=3 type=DATA2 offset=49152 opf=0x0 physical=9551872 len=16384
  56.1187: raid56_write_stripe: full_stripe=38928384 devid=3 type=DATA2 offset=0 opf=0x1 physical=9502720 len=16384
  56.1188: raid56_write_stripe: full_stripe=38928384 devid=3 type=DATA2 offset=32768 opf=0x1 physical=9535488 len=16384
  56.1188: raid56_write_stripe: full_stripe=38928384 devid=1 type=PQ1 offset=0 opf=0x1 physical=30474240 len=16384
  56.1189: raid56_write_stripe: full_stripe=38928384 devid=1 type=PQ1 offset=32768 opf=0x1 physical=30507008 len=16384
  56.1218: raid56_write_stripe: full_stripe=38928384 devid=3 type=DATA2 offset=49152 opf=0x1 physical=9551872 len=16384
  56.1219: raid56_write_stripe: full_stripe=38928384 devid=1 type=PQ1 offset=49152 opf=0x1 physical=30523392 len=16384
  56.2721: raid56_parity_recover: full stripe=38928384 eb=39010304 mirror=2
  56.2723: raid56_parity_recover: full stripe=38928384 eb=39010304 mirror=2
  56.2724: raid56_parity_recover: full stripe=38928384 eb=39010304 mirror=2

Before we enter raid56_parity_recover(), we have triggered some metadata
write for the full stripe 38928384, this leads to us to read all the
sectors from disk.

Furthermore, btrfs raid56 write will cache its calculated P/Q sectors to
avoid unnecessary read.

This means, for that full stripe, after any partial write, we will have
stale data, along with P/Q calculated using that stale data.

Thankfully due to patch "btrfs: only write the sectors in the vertical stripe
which has data stripes" we haven't submitted all the corrupted P/Q to disk.

When we really need to recover certain range, aka in
raid56_parity_recover(), we will use the cached rbio, along with its
cached sectors (the full stripe is all cached).

This explains why we have no event raid56_scrub_read_recover()
triggered.

Since we have the cached P/Q which is calculated using the stale data,
the recovered one will just be stale.

In our particular test case, it will always return the same incorrect
metadata, thus causing the same error message "parent transid verify
failed on 39010304 wanted 9 found 7" again and again.

[BTRFS DESTRUCTIVE RMW PROBLEM]

Test case btrfs/125 (and above workload) always has its trouble with
the destructive read-modify-write (RMW) cycle:

        0       32K     64K
Data1:  | Good  | Good  |
Data2:  | Bad   | Bad   |
Parity: | Good  | Good  |

In above case, if we trigger any write into Data1, we will use the bad
data in Data2 to re-generate parity, killing the only chance to recovery
Data2, thus Data2 is lost forever.

This destructive RMW cycle is not specific to btrfs RAID56, but there
are some btrfs specific behaviors making the case even worse:

- Btrfs will cache sectors for unrelated vertical stripes.

  In above example, if we're only writing into 0~32K range, btrfs will
  still read data range (32K ~ 64K) of Data1, and (64K~128K) of Data2.
  This behavior is to cache sectors for later update.

  Incidentally commit d4e28d9b5f ("btrfs: raid56: make steal_rbio()
  subpage compatible") has a bug which makes RAID56 to never trust the
  cached sectors, thus slightly improve the situation for recovery.

  Unfortunately, follow up fix "btrfs: update stripe_sectors::uptodate in
  steal_rbio" will revert the behavior back to the old one.

- Btrfs raid56 partial write will update all P/Q sectors and cache them

  This means, even if data at (64K ~ 96K) of Data2 is free space, and
  only (96K ~ 128K) of Data2 is really stale data.
  And we write into that (96K ~ 128K), we will update all the parity
  sectors for the full stripe.

  This unnecessary behavior will completely kill the chance of recovery.

  Thankfully, an unrelated optimization "btrfs: only write the sectors
  in the vertical stripe which has data stripes" will prevent
  submitting the write bio for untouched vertical sectors.

  That optimization will keep the on-disk P/Q untouched for a chance for
  later recovery.

[FIX]
Although we have no good way to completely fix the destructive RMW
(unless we go full scrub for each partial write), we can still limit the
damage.

With patch "btrfs: only write the sectors in the vertical stripe which
has data stripes" now we won't really submit the P/Q of unrelated
vertical stripes, so the on-disk P/Q should still be fine.

Now we really need to do is just drop all the cached sectors when doing
recovery.

By this, we have a chance to read the original P/Q from disk, and have a
chance to recover the stale data, while still keep the cache to speed up
regular write path.

In fact, just dropping all the cache for recovery path is good enough to
allow the test case btrfs/125 along with the small script to pass
reliably.

The lack of metadata write after the degraded mount, and forced metadata
COW is saving us this time.

So this patch will fix the behavior by not trust any cache in
__raid56_parity_recover(), to solve the problem while still keep the
cache useful.

But please note that this test pass DOES NOT mean we have solved the
destructive RMW problem, we just do better damage control a little
better.

Related patches:

- btrfs: only write the sectors in the vertical stripe
- d4e28d9b5f ("btrfs: raid56: make steal_rbio() subpage compatible")
- btrfs: update stripe_sectors::uptodate in steal_rbio

Acked-by: David Sterba <dsterba@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-08-21 15:18:56 +02:00
..
9p 9p: fix EBADF errors in cached mode 2022-06-29 09:04:26 +02:00
adfs Filesystem folio changes for 5.18 2022-03-22 18:26:56 -07:00
affs Filesystem folio changes for 5.18 2022-03-22 18:26:56 -07:00
afs netfs: do not unlock and put the folio twice 2022-07-22 10:21:42 +02:00
autofs
befs fs: allocate inode by using alloc_inode_sb() 2022-03-22 15:57:03 -07:00
bfs Filesystem folio changes for 5.18 2022-03-22 18:26:56 -07:00
btrfs btrfs: raid56: don't trust any cached sector in __raid56_parity_recover() 2022-08-21 15:18:56 +02:00
cachefiles cachefiles: Fix KASAN slab-out-of-bounds in cachefiles_set_volume_xattr 2022-04-08 23:32:40 +01:00
ceph netfs: do not unlock and put the folio twice 2022-07-22 10:21:42 +02:00
cifs SMB3: fix lease break timeout when multiple deferred close handles for the same file. 2022-08-17 14:42:19 +02:00
coda Folio changes for 5.18 2022-03-22 17:03:12 -07:00
configfs configfs: fix a race in configfs_{,un}register_subsystem() 2022-02-22 18:30:28 +01:00
cramfs
crypto fs: Remove ->readpages address space operation 2022-04-01 13:45:33 -04:00
debugfs debugfs: Document that debugfs_create functions need not be error checked 2022-02-25 11:56:13 +01:00
devpts fsnotify: fix fsnotify hooks in pseudo filesystems 2022-01-24 14:17:02 +01:00
dlm dlm: fix pending remove if msg allocation fails 2022-07-29 17:28:15 +02:00
ecryptfs Filesystem folio changes for 5.18 2022-03-22 18:26:56 -07:00
efivarfs
efs fs: allocate inode by using alloc_inode_sb() 2022-03-22 15:57:03 -07:00
erofs erofs: avoid consecutive detection for Highmem memory 2022-08-17 14:40:38 +02:00
exfat exfat: use updated exfat_chain directly during renaming 2022-07-29 17:28:16 +02:00
exportfs exportfs: support idmapped mounts 2022-06-09 10:30:56 +02:00
ext2 ext2: Add more validity checks for inode counts 2022-08-17 14:40:21 +02:00
ext4 ext4: fix race when reusing xattr blocks 2022-08-17 14:42:32 +02:00
f2fs f2fs: fix null-ptr-deref in f2fs_get_dnode_of_data 2022-08-17 14:42:36 +02:00
fat fat: add ratelimit to fat*_ent_bread() 2022-06-09 10:29:49 +02:00
freevxfs fs: allocate inode by using alloc_inode_sb() 2022-03-22 15:57:03 -07:00
fscache fscache: Fix if condition in fscache_wait_on_volume_collision() 2022-07-12 16:42:17 +02:00
fuse fuse: Remove the control interface for virtio-fs 2022-08-17 14:42:09 +02:00
gfs2 gfs2: use i_lock spin_lock for inode qadata 2022-06-09 10:29:47 +02:00
hfs Filesystem folio changes for 5.18 2022-03-22 18:26:56 -07:00
hfsplus Filesystem folio changes for 5.18 2022-03-22 18:26:56 -07:00
hostfs Filesystem folio changes for 5.18 2022-03-22 18:26:56 -07:00
hpfs Filesystem folio changes for 5.18 2022-03-22 18:26:56 -07:00
hugetlbfs hugetlbfs: fix hugetlbfs_statfs() locking 2022-06-09 10:30:31 +02:00
iomap iomap: iomap_write_failed fix 2022-06-09 10:30:08 +02:00
isofs fs: allocate inode by using alloc_inode_sb() 2022-03-22 15:57:03 -07:00
jbd2 jbd2: fix assertion 'jh->b_frozen_data == NULL' failure when journal aborted 2022-08-17 14:41:57 +02:00
jffs2 jffs2: fix memory leak in jffs2_do_fill_super 2022-06-14 18:44:55 +02:00
jfs fs: jfs: fix possible NULL pointer dereference in dbFree() 2022-06-09 10:29:49 +02:00
kernfs kernfs: fix potential NULL dereference in __kernfs_remove 2022-08-17 14:41:42 +02:00
ksmbd ksmbd: prevent out of bound read for SMB2_WRITE 2022-08-17 14:42:30 +02:00
lockd lockd: detect and reject lock arguments that overflow 2022-08-17 14:40:04 +02:00
minix Merge branch 'akpm' (patches from Andrew) 2022-03-24 14:14:07 -07:00
netfs netfs: do not unlock and put the folio twice 2022-07-22 10:21:42 +02:00
nfs pNFS/flexfiles: Report RDMA connection errors to the server 2022-08-17 14:40:02 +02:00
nfs_common
nfsd nfsd: eliminate the NFSD_FILE_BREAK_* flags 2022-08-17 14:40:02 +02:00
nilfs2 nilfs2: fix incorrect masking of permission flags for symlinks 2022-07-22 10:21:21 +02:00
nls
notify fanotify: refine the validation checks on non-dir inode mask 2022-07-07 17:54:57 +02:00
ntfs ntfs: fix use-after-free in ntfs_ucsncmp() 2022-08-03 12:05:16 +02:00
ntfs3 fs/ntfs3: Fix invalid free in log_replay 2022-06-09 10:30:56 +02:00
ocfs2 Revert "ocfs2: mount shared volume without ha stack" 2022-08-03 12:05:16 +02:00
omfs fs: Convert __set_page_dirty_buffers to block_dirty_folio 2022-03-16 13:37:04 -04:00
openpromfs fs: allocate inode by using alloc_inode_sb() 2022-03-22 15:57:03 -07:00
orangefs Filesystem folio changes for 5.18 2022-03-22 18:26:56 -07:00
overlayfs ovl: drop WARN_ON() dentry is NULL in ovl_encode_fh() 2022-08-17 14:40:09 +02:00
proc proc: fix a dentry lock race between release_task and lookup 2022-08-17 14:42:07 +02:00
pstore pstore: Don't use semaphores in always-atomic-context code 2022-03-15 11:08:23 -07:00
qnx4 fs: allocate inode by using alloc_inode_sb() 2022-03-22 15:57:03 -07:00
qnx6 fs: allocate inode by using alloc_inode_sb() 2022-03-22 15:57:03 -07:00
quota quota: Prevent memory allocation recursion while holding dq_lock 2022-06-22 14:27:51 +02:00
ramfs
reiserfs \n 2022-03-25 17:38:15 -07:00
romfs fs: allocate inode by using alloc_inode_sb() 2022-03-22 15:57:03 -07:00
smbfs_common smb3: fix ksmbd bigendian bug in oplock break, and move its struct to smbfs_common 2022-03-31 09:38:53 -05:00
squashfs Merge branch 'akpm' (patches from Andrew) 2022-03-22 16:11:53 -07:00
sysfs kobject: kobj_type: remove default_attrs 2022-04-05 15:39:19 +02:00
sysv Filesystem folio changes for 5.18 2022-03-22 18:26:56 -07:00
tracefs tracefs: Set the group ownership in apply_options() not parse_options() 2022-02-25 21:05:04 -05:00
ubifs This pull request contains fixes for JFFS2, UBI and UBIFS 2022-03-31 16:09:41 -07:00
udf udf: Avoid using stale lengthOfImpUse 2022-05-10 13:30:32 +02:00
ufs Filesystem folio changes for 5.18 2022-03-22 18:26:56 -07:00
unicode kbuild: unify cmd_copy and cmd_shipped 2022-02-14 10:37:32 +09:00
vboxsf Filesystem folio changes for 5.18 2022-03-22 18:26:56 -07:00
verity fs: Remove ->readpages address space operation 2022-04-01 13:45:33 -04:00
xfs xfs: reorder iunlink remove operation in xfs_ifree 2022-04-21 08:45:16 +10:00
zonefs block: add a bdev_max_zone_append_sectors helper 2022-08-17 14:42:25 +02:00
Kconfig Folio changes for 5.18 2022-03-22 17:03:12 -07:00
Kconfig.binfmt execve updates for v5.18-rc1 2022-03-21 19:16:02 -07:00
Makefile io_uring: move to separate directory 2022-08-17 14:40:41 +02:00
aio.c Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2022-04-01 19:57:03 -07:00
anon_inodes.c
attr.c vfs: Check the truncate maximum size in inode_newsize_ok() 2022-08-17 14:40:08 +02:00
bad_inode.c
binfmt_aout.c
binfmt_elf.c revert "fs/binfmt_elf: use PT_LOAD p_align values for static PIE" 2022-04-15 14:49:56 -07:00
binfmt_elf_fdpic.c coredump: Snapshot the vmas in do_coredump 2022-03-08 12:55:29 -06:00
binfmt_elf_test.c binfmt_elf: Introduce KUnit test 2022-03-03 20:38:56 -08:00
binfmt_flat.c binfmt_flat: do not stop relocating GOT entries prematurely on riscv 2022-06-09 10:29:26 +02:00
binfmt_misc.c Fix regression due to "fs: move binfmt_misc sysctl to its own file" 2022-02-09 09:50:02 -08:00
binfmt_script.c
buffer.c filemap: Remove AOP_FLAG_CONT_EXPAND 2022-04-01 14:40:44 -04:00
char_dev.c
compat_binfmt_elf.c binfmt_elf: Introduce KUnit test 2022-03-03 20:38:56 -08:00
coredump.c ptrace: Cleanups for v5.18 2022-03-28 17:29:53 -07:00
d_path.c
dax.c dax: fix cache flush on PMD-mapped pages 2022-06-09 10:30:28 +02:00
dcache.c mm: dcache: use kmem_cache_alloc_lru() to allocate dentry 2022-03-22 15:57:03 -07:00
direct-io.c block: remove the per-bio/request write hint 2022-03-07 12:45:57 -07:00
drop_caches.c
eventfd.c
eventpoll.c epoll: autoremove wakers even more aggressively 2022-08-17 14:40:20 +02:00
exec.c posix-cpu-timers: Cleanup CPU timers before freeing them during exec 2022-08-17 14:42:19 +02:00
fcntl.c fs: remove fs.f_write_hint 2022-03-08 17:55:03 -07:00
fhandle.c
file.c fs: fix fd table size alignment properly 2022-03-29 23:29:18 -07:00
file_table.c SUNRPC: Ensure we flush any closed sockets before xs_xprt_free() 2022-04-07 16:19:47 -04:00
filesystems.c
fs-writeback.c writeback: Fix inode->i_io_list not be protected by inode->i_lock error 2022-06-14 18:45:18 +02:00
fs_context.c vfs: fs_context: fix up param length parsing in legacy_parse_param 2022-01-18 09:23:19 +02:00
fs_parser.c
fs_pin.c
fs_struct.c
fs_types.c
fsopen.c
init.c
inode.c writeback: Fix inode->i_io_list not be protected by inode->i_lock error 2022-06-14 18:45:18 +02:00
internal.h Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2022-04-01 19:57:03 -07:00
ioctl.c Fixes for 5.18-rc1: 2022-04-01 19:35:56 -07:00
kernel_read_file.c
libfs.c fs: Convert __set_page_dirty_no_writeback to noop_dirty_folio 2022-03-16 13:37:05 -04:00
locks.c fs: move locking sysctls where they are used 2022-01-22 08:33:36 +02:00
mbcache.c mbcache: add functions to delete entry if unused 2022-08-17 14:40:17 +02:00
mount.h
mpage.c for-5.18/alloc-cleanups-2022-03-25 2022-03-26 11:59:30 -07:00
namei.c __follow_mount_rcu(): verify that mount_lock remains unchanged 2022-08-17 14:42:20 +02:00
namespace.c fs: hold writers when changing mount's idmapping 2022-06-09 10:29:42 +02:00
no-block.c
nsfs.c
open.c fs: remove fs.f_write_hint 2022-03-08 17:55:03 -07:00
pipe.c pipe: Fix missing lock in pipe_resize_ring() 2022-06-06 08:48:53 +02:00
pnode.c
pnode.h
posix_acl.c fs: fix acl translation 2022-04-19 10:19:02 -07:00
proc_namespace.c
read_write.c fs: sendfile handles O_NONBLOCK of out_fd 2022-08-03 12:05:16 +02:00
readdir.c
remap_range.c fs/remap: constrain dedupe of EOF blocks 2022-07-22 10:21:21 +02:00
select.c
seq_file.c rxrpc: Fix locking issue 2022-06-09 10:30:19 +02:00
signalfd.c Merge branch 'signal-for-v5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace 2022-01-17 05:49:30 +02:00
splice.c fs: check FMODE_LSEEK to control internal pipe splicing 2022-08-17 14:41:05 +02:00
stack.c
stat.c stat: fix inconsistency between struct stat and struct compat_stat 2022-04-12 13:35:08 -10:00
statfs.c
super.c vfs: make freeze_super abort when sync_filesystem returns error 2022-01-30 08:59:47 -08:00
sync.c vfs: make sync_filesystem return errors from ->sync_fs 2022-01-30 08:59:47 -08:00
sysctls.c fs: move namespace sysctls and declare fs base directory 2022-01-22 08:33:36 +02:00
timerfd.c
userfaultfd.c userfaultfd: provide properly masked address for huge-pages 2022-08-03 12:05:16 +02:00
utimes.c
xattr.c fs: fix acl translation 2022-04-19 10:19:02 -07:00