linux-stable/fs/btrfs
Filipe Manana 0151049a68 Btrfs: fix missing data checksums after replaying a log tree
commit 40e046acbd upstream.

When logging a file that has shared extents (reflinked with other files or
with itself), we can end up logging multiple checksum items that cover
overlapping ranges. This confuses the search for checksums at log replay
time causing some checksums to never be added to the fs/subvolume tree.

Consider the following example of a file that shares the same extent at
offsets 0 and 256Kb:

   [ bytenr 13893632, offset 64Kb, len 64Kb  ]
   0                                         64Kb

   [ bytenr 13631488, offset 64Kb, len 192Kb ]
   64Kb                                      256Kb

   [ bytenr 13893632, offset 0, len 256Kb    ]
   256Kb                                     512Kb

When logging the inode, at tree-log.c:copy_items(), when processing the
file extent item at offset 0, we log a checksum item covering the range
13959168 to 14024704, which corresponds to 13893632 + 64Kb and 13893632 +
64Kb + 64Kb, respectively.

Later when processing the extent item at offset 256K, we log the checksums
for the range from 13893632 to 14155776 (which corresponds to 13893632 +
256Kb). These checksums get merged with the checksum item for the range
from 13631488 to 13893632 (13631488 + 256Kb), logged by a previous fsync.
So after this we get the two following checksum items in the log tree:

   (...)
   item 6 key (EXTENT_CSUM EXTENT_CSUM 13631488) itemoff 3095 itemsize 512
           range start 13631488 end 14155776 length 524288
   item 7 key (EXTENT_CSUM EXTENT_CSUM 13959168) itemoff 3031 itemsize 64
           range start 13959168 end 14024704 length 65536

The first one covers the range from the second one, they overlap.

So far this does not cause a problem after replaying the log, because
when replaying the file extent item for offset 256K, we copy all the
checksums for the extent 13893632 from the log tree to the fs/subvolume
tree, since searching for an checksum item for bytenr 13893632 leaves us
at the first checksum item, which covers the whole range of the extent.

However if we write 64Kb to file offset 256Kb for example, we will
not be able to find and copy the checksums for the last 128Kb of the
extent at bytenr 13893632, referenced by the file range 384Kb to 512Kb.

After writing 64Kb into file offset 256Kb we get the following extent
layout for our file:

   [ bytenr 13893632, offset 64K, len 64Kb   ]
   0                                         64Kb

   [ bytenr 13631488, offset 64Kb, len 192Kb ]
   64Kb                                      256Kb

   [ bytenr 14155776, offset 0, len 64Kb     ]
   256Kb                                     320Kb

   [ bytenr 13893632, offset 64Kb, len 192Kb ]
   320Kb                                     512Kb

After fsync'ing the file, if we have a power failure and then mount
the filesystem to replay the log, the following happens:

1) When replaying the file extent item for file offset 320Kb, we
   lookup for the checksums for the extent range from 13959168
   (13893632 + 64Kb) to 14155776 (13893632 + 256Kb), through a call
   to btrfs_lookup_csums_range();

2) btrfs_lookup_csums_range() finds the checksum item that starts
   precisely at offset 13959168 (item 7 in the log tree, shown before);

3) However that checksum item only covers 64Kb of data, and not 192Kb
   of data;

4) As a result only the checksums for the first 64Kb of data referenced
   by the file extent item are found and copied to the fs/subvolume tree.
   The remaining 128Kb of data, file range 384Kb to 512Kb, doesn't get
   the corresponding data checksums found and copied to the fs/subvolume
   tree.

5) After replaying the log userspace will not be able to read the file
   range from 384Kb to 512Kb, because the checksums are missing and
   resulting in an -EIO error.

The following steps reproduce this scenario:

  $ mkfs.btrfs -f /dev/sdc
  $ mount /dev/sdc /mnt/sdc

  $ xfs_io -f -c "pwrite -S 0xa3 0 256K" /mnt/sdc/foobar
  $ xfs_io -c "fsync" /mnt/sdc/foobar
  $ xfs_io -c "pwrite -S 0xc7 256K 256K" /mnt/sdc/foobar

  $ xfs_io -c "reflink /mnt/sdc/foobar 320K 0 64K" /mnt/sdc/foobar
  $ xfs_io -c "fsync" /mnt/sdc/foobar

  $ xfs_io -c "pwrite -S 0xe5 256K 64K" /mnt/sdc/foobar
  $ xfs_io -c "fsync" /mnt/sdc/foobar

  <power failure>

  $ mount /dev/sdc /mnt/sdc
  $ md5sum /mnt/sdc/foobar
  md5sum: /mnt/sdc/foobar: Input/output error

  $ dmesg | tail
  [165305.003464] BTRFS info (device sdc): no csum found for inode 257 start 401408
  [165305.004014] BTRFS info (device sdc): no csum found for inode 257 start 405504
  [165305.004559] BTRFS info (device sdc): no csum found for inode 257 start 409600
  [165305.005101] BTRFS info (device sdc): no csum found for inode 257 start 413696
  [165305.005627] BTRFS info (device sdc): no csum found for inode 257 start 417792
  [165305.006134] BTRFS info (device sdc): no csum found for inode 257 start 421888
  [165305.006625] BTRFS info (device sdc): no csum found for inode 257 start 425984
  [165305.007278] BTRFS info (device sdc): no csum found for inode 257 start 430080
  [165305.008248] BTRFS warning (device sdc): csum failed root 5 ino 257 off 393216 csum 0x1337385e expected csum 0x00000000 mirror 1
  [165305.009550] BTRFS warning (device sdc): csum failed root 5 ino 257 off 393216 csum 0x1337385e expected csum 0x00000000 mirror 1

Fix this simply by deleting first any checksums, from the log tree, for the
range of the extent we are logging at copy_items(). This ensures we do not
get checksum items in the log tree that have overlapping ranges.

This is a long time issue that has been present since we have the clone
(and deduplication) ioctl, and can happen both when an extent is shared
between different files and within the same file.

A test case for fstests follows soon.

CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-31 16:34:42 +01:00
..
tests btrfs: qgroup: Drop fs_info parameter from btrfs_qgroup_account_extent 2018-08-06 13:12:52 +02:00
acl.c Btrfs: setup a nofs context for memory allocation at __btrfs_set_acl 2019-03-23 20:10:00 +01:00
async-thread.c btrfs: replace GPL boilerplate by SPDX -- sources 2018-04-12 16:29:51 +02:00
async-thread.h btrfs: replace GPL boilerplate by SPDX -- headers 2018-04-12 16:29:46 +02:00
backref.c Btrfs: fix deadlock between fiemap and transaction commits 2019-08-25 10:47:54 +02:00
backref.h btrfs: replace GPL boilerplate by SPDX -- headers 2018-04-12 16:29:46 +02:00
btrfs_inode.h btrfs: use tagged writepage to mitigate livelock of snapshot 2019-02-12 19:47:11 +01:00
check-integrity.c btrfs: open-code bio_set_op_attrs 2018-08-06 13:12:44 +02:00
check-integrity.h btrfs: replace GPL boilerplate by SPDX -- headers 2018-04-12 16:29:46 +02:00
compression.c btrfs: correctly validate compression type 2019-09-16 08:22:19 +02:00
compression.h btrfs: correctly validate compression type 2019-09-16 08:22:19 +02:00
ctree.c btrfs: handle error of get_old_root 2019-12-01 09:16:22 +01:00
ctree.h Btrfs: fix missing data checksums after replaying a log tree 2019-12-31 16:34:42 +01:00
dedupe.h btrfs: replace GPL boilerplate by SPDX -- headers 2018-04-12 16:29:46 +02:00
delayed-inode.c btrfs: use refcount_inc_not_zero in kill_all_nodes 2019-12-17 20:34:46 +01:00
delayed-inode.h btrfs: Remove fs_info from btrfs_delete_delayed_dir_index 2018-08-06 13:13:00 +02:00
delayed-ref.c btrfs: only track ref_heads in delayed_ref_updates 2019-12-05 09:20:23 +01:00
delayed-ref.h btrfs: Remove fs_info from btrfs_add_delayed_data_ref 2018-08-06 13:12:34 +02:00
dev-replace.c btrfs: dev-replace: set result code of cancel by status of scrub 2019-12-05 09:20:21 +01:00
dev-replace.h btrfs: replace GPL boilerplate by SPDX -- headers 2018-04-12 16:29:46 +02:00
dir-item.c btrfs: Remove fs_info from btrfs_insert_delayed_dir_index 2018-08-06 13:13:00 +02:00
disk-io.c Btrfs: allow clear_extent_dirty() to receive a cached extent state record 2019-12-05 09:20:22 +01:00
disk-io.h btrfs: Check the first key and level for cached extent buffer 2019-05-22 07:37:42 +02:00
export.c btrfs: replace GPL boilerplate by SPDX -- sources 2018-04-12 16:29:51 +02:00
export.h btrfs: replace GPL boilerplate by SPDX -- headers 2018-04-12 16:29:46 +02:00
extent-tree.c Btrfs: fix missing data checksums after replaying a log tree 2019-12-31 16:34:42 +01:00
extent_io.c btrfs: Avoid getting stuck during cyclic writebacks 2019-12-17 20:34:47 +01:00
extent_io.h Btrfs: allow clear_extent_dirty() to receive a cached extent state record 2019-12-05 09:20:22 +01:00
extent_map.c btrfs: use fs_info for btrfs_handle_em_exist tracepoint 2018-05-28 18:07:17 +02:00
extent_map.h btrfs: use fs_info for btrfs_handle_em_exist tracepoint 2018-05-28 18:07:17 +02:00
file-item.c Btrfs: fix missing data checksums after replaying a log tree 2019-12-31 16:34:42 +01:00
file.c Btrfs: fix negative subv_writers counter and data space leak after buffered write 2019-12-17 20:34:47 +01:00
free-space-cache.c btrfs: check page->mapping when loading free space cache 2019-12-17 20:34:45 +01:00
free-space-cache.h btrfs: replace GPL boilerplate by SPDX -- headers 2018-04-12 16:29:46 +02:00
free-space-tree.c btrfs: Remove fs_info from btrfs_del_root 2018-08-06 13:13:00 +02:00
free-space-tree.h btrfs: Remove fs_info argument from add_to_free_space_tree 2018-05-28 18:07:36 +02:00
inode-item.c btrfs: replace GPL boilerplate by SPDX -- sources 2018-04-12 16:29:51 +02:00
inode-map.c btrfs: qgroup: Always free PREALLOC META reserve in btrfs_delalloc_release_extents() 2019-11-06 13:05:13 +01:00
inode-map.h btrfs: replace GPL boilerplate by SPDX -- headers 2018-04-12 16:29:46 +02:00
inode.c btrfs: do not call synchronize_srcu() in inode_tree_del 2019-12-31 16:34:41 +01:00
ioctl.c btrfs: defrag: use btrfs_mod_outstanding_extents in cluster_pages_for_defrag 2019-12-01 09:16:21 +01:00
Kconfig btrfs: add SPDX header to Kconfig 2018-04-12 16:29:55 +02:00
locking.c btrfs: replace waitqueue_actvie with cond_wake_up 2018-05-28 18:23:09 +02:00
locking.h btrfs: replace GPL boilerplate by SPDX -- headers 2018-04-12 16:29:46 +02:00
lzo.c btrfs: lzo: Harden inline lzo compressed extent decompression 2018-05-30 16:46:43 +02:00
Makefile btrfs: Remove custom crc32c init code 2018-03-26 15:09:39 +02:00
math.h btrfs: replace GPL boilerplate by SPDX -- headers 2018-04-12 16:29:46 +02:00
ordered-data.c btrfs: prune unused includes 2018-08-06 13:12:43 +02:00
ordered-data.h btrfs: remove remaing full_sync logic from btrfs_sync_file 2018-08-06 13:12:31 +02:00
orphan.c btrfs: replace GPL boilerplate by SPDX -- sources 2018-04-12 16:29:51 +02:00
print-tree.c btrfs: annotate unlikely branches after V0 extent type removal 2018-08-06 13:12:41 +02:00
print-tree.h btrfs: print-tree: debugging output enhancement 2018-04-20 19:18:16 +02:00
props.c btrfs: correctly validate compression type 2019-09-16 08:22:19 +02:00
props.h btrfs: replace GPL boilerplate by SPDX -- headers 2018-04-12 16:29:46 +02:00
qgroup.c btrfs: tracepoints: Fix wrong parameter order for qgroup events 2019-11-06 13:05:13 +01:00
qgroup.h btrfs: qgroup: Avoid calling qgroup functions if qgroup is not enabled 2018-11-13 11:08:56 -08:00
raid56.c btrfs: raid56: properly unmap parity page in finish_parity_scrub() 2019-04-03 06:26:21 +02:00
raid56.h btrfs: replace GPL boilerplate by SPDX -- headers 2018-04-12 16:29:46 +02:00
rcu-string.h btrfs: replace GPL boilerplate by SPDX -- headers 2018-04-12 16:29:46 +02:00
reada.c btrfs: start readahead also in seed devices 2019-06-25 11:35:59 +08:00
ref-verify.c btrfs: fix uninitialized ret in ref-verify 2019-10-17 13:45:27 -07:00
ref-verify.h btrfs: replace GPL boilerplate by SPDX -- headers 2018-04-12 16:29:46 +02:00
relocation.c btrfs: qgroup: Always free PREALLOC META reserve in btrfs_delalloc_release_extents() 2019-11-06 13:05:13 +01:00
root-tree.c btrfs: Don't panic when we can't find a root key 2019-05-31 06:46:13 -07:00
scrub.c btrfs: init csum_list before possible free 2019-09-16 08:22:08 +02:00
send.c Btrfs: send, skip backreference walking for extents with many references 2019-12-17 20:34:48 +01:00
send.h btrfs: replace GPL boilerplate by SPDX -- headers 2018-04-12 16:29:46 +02:00
struct-funcs.c btrfs: prune unused includes 2018-08-06 13:12:43 +02:00
super.c btrfs: avoid link error with CONFIG_NO_AUTO_INLINE 2019-12-01 09:17:19 +01:00
sysfs.c btrfs: sysfs: don't leak memory when failing add fsid 2019-05-31 06:46:02 -07:00
sysfs.h btrfs: sysfs: Use enum/define value for feature array definitions 2018-05-28 18:23:39 +02:00
transaction.c Btrfs: fix deadlock between fiemap and transaction commits 2019-08-25 10:47:54 +02:00
transaction.h Btrfs: fix deadlock between fiemap and transaction commits 2019-08-25 10:47:54 +02:00
tree-checker.c btrfs: tree-checker: Don't check max block group size as current max chunk size limit is unreliable 2018-12-08 12:59:10 +01:00
tree-checker.h btrfs: replace GPL boilerplate by SPDX -- headers 2018-04-12 16:29:46 +02:00
tree-defrag.c btrfs: replace GPL boilerplate by SPDX -- sources 2018-04-12 16:29:51 +02:00
tree-log.c Btrfs: fix missing data checksums after replaying a log tree 2019-12-31 16:34:42 +01:00
tree-log.h Btrfs: sync log after logging new name 2018-08-23 17:37:26 +02:00
ulist.c btrfs: replace GPL boilerplate by SPDX -- sources 2018-04-12 16:29:51 +02:00
ulist.h btrfs: replace GPL boilerplate by SPDX -- headers 2018-04-12 16:29:46 +02:00
uuid-tree.c btrfs: Remove fs_info argument from btrfs_uuid_tree_rem 2018-05-30 16:46:53 +02:00
volumes.c btrfs: fix ncopies raid_attr for RAID56 2019-12-05 09:20:21 +01:00
volumes.h btrfs: Remove btrfs_bio::flags member 2019-12-17 20:34:48 +01:00
xattr.c Btrfs: use nofs context when initializing security xattrs to avoid deadlock 2019-01-16 22:04:37 +01:00
xattr.h btrfs: replace GPL boilerplate by SPDX -- headers 2018-04-12 16:29:46 +02:00
zlib.c btrfs: replace GPL boilerplate by SPDX -- sources 2018-04-12 16:29:51 +02:00
zstd.c btrfs: replace GPL boilerplate by SPDX -- sources 2018-04-12 16:29:51 +02:00