Commit graph

1355 commits

Author SHA1 Message Date
Qu Wenruo
5958253cf6 btrfs: qgroup: catch reserved space leaks at unmount time
Before this patch, qgroup completely relies on per-inode extent io tree
to detect reserved data space leak.

However previous bug has already shown how release page before
btrfs_finish_ordered_io() could lead to leak, and since it's
QGROUP_RESERVED bit cleared without triggering qgroup rsv, it can't be
detected by per-inode extent io tree.

So this patch adds another (and hopefully the final) safety net to catch
qgroup data reserved space leak.  At least the new safety net catches
all the leaks during development, so it should be pretty useful in the
real world.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27 12:55:24 +02:00
Linus Torvalds
aa27b32b76 for-5.8-rc4-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl8EdTkACgkQxWXV+ddt
 WDv6xA/9Hguo/k6oj/7Nl9n3UUZ7gp44R/jy37fhMuNcwuEDuqIEfAgGXupdJVaj
 pYDorUMRUQfI2yLB1iHAnPgBMKBidSroDsdrRHKuimnhABSO2/KX/KXPianIIRGi
 wPvqZR04L565LNpRlDQx7OYkJWey7b6xf47UZqDglivnKY1OwCJlXgfCj/9FApr0
 Y+PVlgEU78ExTeAHs/h8ofZ/f5T2eqiluBSFVykzCg1NngaQVOKpN3gnWEatUAvM
 ekm6U4E1ZR9oOprdhlf6V96ztGzVTRKB1vFIeCvJLqLNIe+0pxlRfRn2aOj8vzEO
 DRjgOlhyAIgypp78SwCspjhvejvVneSFdEGSVvHOw1ombB//OJ1qBb5G/lIcwCj3
 PZ3OnQJV7+/Ty7Xt/X26W841zvnu90K0di0CsOPehtbkgkR4txgHCJB9mSlsMugN
 awN5Ryy1rw1cAM5GspXG9EEOvJmnSizQf4BcK649IG5eUKThYYLc5mp68jiMljs0
 NHFPg5P4yTRjk7Yqgxq5VvTPLLJo5j5xxqtY/1zDWuguRa40wIoy/JUJaJoPg9Vd
 221/qRG4R4xGyZXGx6XTiWK+M3qjTlS9My9tGoWygwlExRkr7Uli9Ikef3U0tBoF
 bjTcfCNOuCp+JECHNcnMZ9fhhFaMwIL1V4OflB1iicBAtXxo8Lk=
 =+4BZ
 -----END PGP SIGNATURE-----

Merge tag 'for-5.8-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

 - regression fix of a leak in global block reserve accounting

 - fix a (hard to hit) race of readahead vs releasepage that could lead
   to crash

 - convert all remaining uses of comment fall through annotations to the
   pseudo keyword

 - fix crash when mounting a fuzzed image with -o recovery

* tag 'for-5.8-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: reset tree root pointer after error in init_tree_roots
  btrfs: fix reclaim_size counter leak after stealing from global reserve
  btrfs: fix fatal extent_buffer readahead vs releasepage race
  btrfs: convert comments to fallthrough annotations
2020-07-07 14:10:33 -07:00
Josef Bacik
0465337c55 btrfs: reset tree root pointer after error in init_tree_roots
Eric reported an issue where mounting -o recovery with a fuzzed fs
resulted in a kernel panic.  This is because we tried to free the tree
node, except it was an error from the read.  Fix this by properly
resetting the tree_root->node == NULL in this case.  The panic was the
following

  BTRFS warning (device loop0): failed to read tree root
  BUG: kernel NULL pointer dereference, address: 000000000000001f
  RIP: 0010:free_extent_buffer+0xe/0x90 [btrfs]
  Call Trace:
   free_root_extent_buffers.part.0+0x11/0x30 [btrfs]
   free_root_pointers+0x1a/0xa2 [btrfs]
   open_ctree+0x1776/0x18a5 [btrfs]
   btrfs_mount_root.cold+0x13/0xfa [btrfs]
   ? selinux_fs_context_parse_param+0x37/0x80
   legacy_get_tree+0x27/0x40
   vfs_get_tree+0x25/0xb0
   fc_mount+0xe/0x30
   vfs_kern_mount.part.0+0x71/0x90
   btrfs_mount+0x147/0x3e0 [btrfs]
   ? cred_has_capability+0x7c/0x120
   ? legacy_get_tree+0x27/0x40
   legacy_get_tree+0x27/0x40
   vfs_get_tree+0x25/0xb0
   do_mount+0x735/0xa40
   __x64_sys_mount+0x8e/0xd0
   do_syscall_64+0x4d/0x90
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

Nik says: this is problematic only if we fail on the last iteration of
the loop as this results in init_tree_roots returning err value with
tree_root->node = -ERR. Subsequently the caller does: fail_tree_roots
which calls free_root_pointers on the bogus value.

Reported-by: Eric Sandeen <sandeen@redhat.com>
Fixes: b8522a1e5f ("btrfs: Factor out tree roots initialization during mount")
CC: stable@vger.kernel.org # 5.5+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add details how the pointer gets dereferenced ]
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-02 10:27:12 +02:00
Linus Torvalds
f3cdc8ae11 for-5.8-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl7U50AACgkQxWXV+ddt
 WDtK1g//RXeNsTguYQr1N9R5eUPThjLEI0+4J0l4SYfCPU8Ou3C7nqpOEJJQgm8F
 ezZE+16cWi9U5uGueOc+w0rfyz4AuIXKgzoz+c0/GG2+yV5jp6DsAMbWqojAb96L
 V/N3HxEzR66jqwgVUBE/x5okb2SyY7//B1l/O0amc66XDO7KTMImpIwThere6zWZ
 o2SNpYpHAPQeUYJQx8h+FAW3w1CxrCZmnifazU9Jqe9J7QeQLg7rbUlJDV38jySm
 ZOA8ohKN9U1gPZy+dTU3kdyyuBIq1etkIaSPJANyTo5TczPKiC0IMg75cXtS4ae/
 NSxhccMpSIjVMcIHARzSFGYKNP3sGNRsmaTUg/2Cx/9GoHOhYMiCAVc8qtBBpwJO
 UI0siexrCe64RuTBMRRc128GdFv7IjmSImcdi8xaR62bCcUiNdEa3zvjRe/9tOEH
 ET7Z85oBnKpSzpC3MdhSUU4dtHY5XLawP8z3oUU1VSzSWM2DVjlHf79/VzbOfp18
 miCVpt94lCn/gUX7el6qcnbuvMAjDyeC6HmfD+TwzQgGwyV6TLgKN9lRXeH/Oy6/
 VgjGQSavGHMll3zIGURmrBCXKudjJg0J+IP4wN1TimmSEMfwKH+7tnekQd8y5qlF
 eXEIqlWNykKeDzEnmV9QJy+/cV83hVWM/mUslcTx39tLN/3B/Us=
 =qTt8
 -----END PGP SIGNATURE-----

Merge tag 'for-5.8-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs updates from David Sterba:
 "Highlights:

   - speedup dead root detection during orphan cleanup, eg. when there
     are many deleted subvolumes waiting to be cleaned, the trees are
     now looked up in radix tree instead of a O(N^2) search

   - snapshot creation with inherited qgroup will mark the qgroup
     inconsistent, requires a rescan

   - send will emit file capabilities after chown, this produces a
     stream that does not need postprocessing to set the capabilities
     again

   - direct io ported to iomap infrastructure, cleaned up and simplified
     code, notably removing last use of struct buffer_head in btrfs code

  Core changes:

   - factor out backreference iteration, to be used by ordinary
     backreferences and relocation code

   - improved global block reserve utilization
      * better logic to serialize requests
      * increased maximum available for unlink
      * improved handling on large pages (64K)

   - direct io cleanups and fixes
      * simplify layering, where cloned bios were unnecessarily created
        for some cases
      * error handling fixes (submit, endio)
      * remove repair worker thread, used to avoid deadlocks during
        repair

   - refactored block group reading code, preparatory work for new type
     of block group storage that should improve mount time on large
     filesystems

  Cleanups:

   - cleaned up (and slightly sped up) set/get helpers for metadata data
     structure members

   - root bit REF_COWS got renamed to SHAREABLE to reflect the that the
     blocks of the tree get shared either among subvolumes or with the
     relocation trees

  Fixes:

   - when subvolume deletion fails due to ENOSPC, the filesystem is not
     turned read-only

   - device scan deals with devices from other filesystems that changed
     ownership due to overwrite (mkfs)

   - fix a race between scrub and block group removal/allocation

   - fix long standing bug of a runaway balance operation, printing the
     same line to the syslog, caused by a stale status bit on a reloc
     tree that prevented progress

   - fix corrupt log due to concurrent fsync of inodes with shared
     extents

   - fix space underflow for NODATACOW and buffered writes when it for
     some reason needs to fallback to COW mode"

* tag 'for-5.8-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (133 commits)
  btrfs: fix space_info bytes_may_use underflow during space cache writeout
  btrfs: fix space_info bytes_may_use underflow after nocow buffered write
  btrfs: fix wrong file range cleanup after an error filling dealloc range
  btrfs: remove redundant local variable in read_block_for_search
  btrfs: open code key_search
  btrfs: split btrfs_direct_IO to read and write part
  btrfs: remove BTRFS_INODE_READDIO_NEED_LOCK
  fs: remove dio_end_io()
  btrfs: switch to iomap_dio_rw() for dio
  iomap: remove lockdep_assert_held()
  iomap: add a filesystem hook for direct I/O bio submission
  fs: export generic_file_buffered_read()
  btrfs: turn space cache writeout failure messages into debug messages
  btrfs: include error on messages about failure to write space/inode caches
  btrfs: remove useless 'fail_unlock' label from btrfs_csum_file_blocks()
  btrfs: do not ignore error from btrfs_next_leaf() when inserting checksums
  btrfs: make checksum item extension more efficient
  btrfs: fix corrupt log due to concurrent fsync of inodes with shared extents
  btrfs: unexport btrfs_compress_set_level()
  btrfs: simplify iget helpers
  ...
2020-06-02 19:59:25 -07:00
Guoqing Jiang
d1b89bc042 btrfs: use attach/detach_page_private
Since the new pair function is introduced, we can call them to clean the
code in btrfs.

Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: David Sterba <dsterba@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Link: http://lkml.kernel.org/r/20200517214718.468-4-guoqing.jiang@cloud.ionos.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:07 -07:00
Filipe Manana
e289f03ea7 btrfs: fix corrupt log due to concurrent fsync of inodes with shared extents
When we have extents shared amongst different inodes in the same subvolume,
if we fsync them in parallel we can end up with checksum items in the log
tree that represent ranges which overlap.

For example, consider we have inodes A and B, both sharing an extent that
covers the logical range from X to X + 64KiB:

1) Task A starts an fsync on inode A;

2) Task B starts an fsync on inode B;

3) Task A calls btrfs_csum_file_blocks(), and the first search in the
   log tree, through btrfs_lookup_csum(), returns -EFBIG because it
   finds an existing checksum item that covers the range from X - 64KiB
   to X;

4) Task A checks that the checksum item has not reached the maximum
   possible size (MAX_CSUM_ITEMS) and then releases the search path
   before it does another path search for insertion (through a direct
   call to btrfs_search_slot());

5) As soon as task A releases the path and before it does the search
   for insertion, task B calls btrfs_csum_file_blocks() and gets -EFBIG
   too, because there is an existing checksum item that has an end
   offset that matches the start offset (X) of the checksum range we want
   to log;

6) Task B releases the path;

7) Task A does the path search for insertion (through btrfs_search_slot())
   and then verifies that the checksum item that ends at offset X still
   exists and extends its size to insert the checksums for the range from
   X to X + 64KiB;

8) Task A releases the path and returns from btrfs_csum_file_blocks(),
   having inserted the checksums into an existing checksum item that got
   its size extended. At this point we have one checksum item in the log
   tree that covers the logical range from X - 64KiB to X + 64KiB;

9) Task B now does a search for insertion using btrfs_search_slot() too,
   but it finds that the previous checksum item no longer ends at the
   offset X, it now ends at an of offset X + 64KiB, so it leaves that item
   untouched.

   Then it releases the path and calls btrfs_insert_empty_item()
   that inserts a checksum item with a key offset corresponding to X and
   a size for inserting a single checksum (4 bytes in case of crc32c).
   Subsequent iterations end up extending this new checksum item so that
   it contains the checksums for the range from X to X + 64KiB.

   So after task B returns from btrfs_csum_file_blocks() we end up with
   two checksum items in the log tree that have overlapping ranges, one
   for the range from X - 64KiB to X + 64KiB, and another for the range
   from X to X + 64KiB.

Having checksum items that represent ranges which overlap, regardless of
being in the log tree or in the chekcsums tree, can lead to problems where
checksums for a file range end up not being found. This type of problem
has happened a few times in the past and the following commits fixed them
and explain in detail why having checksum items with overlapping ranges is
problematic:

  27b9a8122f "Btrfs: fix csum tree corruption, duplicate and outdated checksums"
  b84b8390d6 "Btrfs: fix file read corruption after extent cloning and fsync"
  40e046acbd "Btrfs: fix missing data checksums after replaying a log tree"

Since this specific instance of the problem can only happen when logging
inodes, because it is the only case where concurrent attempts to insert
checksums for the same range can happen, fix the issue by using an extent
io tree as a range lock to serialize checksum insertion during inode
logging.

This issue could often be reproduced by the test case generic/457 from
fstests. When it happens it produces the following trace:

 BTRFS critical (device dm-0): corrupt leaf: root=18446744073709551610 block=30625792 slot=42, csum end range (15020032) goes beyond the start range (15015936) of the next csum item
 BTRFS info (device dm-0): leaf 30625792 gen 7 total ptrs 49 free space 2402 owner 18446744073709551610
 BTRFS info (device dm-0): refs 1 lock (w:0 r:0 bw:0 br:0 sw:0 sr:0) lock_owner 0 current 15884
      item 0 key (18446744073709551606 128 13979648) itemoff 3991 itemsize 4
      item 1 key (18446744073709551606 128 13983744) itemoff 3987 itemsize 4
      item 2 key (18446744073709551606 128 13987840) itemoff 3983 itemsize 4
      item 3 key (18446744073709551606 128 13991936) itemoff 3979 itemsize 4
      item 4 key (18446744073709551606 128 13996032) itemoff 3975 itemsize 4
      item 5 key (18446744073709551606 128 14000128) itemoff 3971 itemsize 4
 (...)
 BTRFS error (device dm-0): block=30625792 write time tree block corruption detected
 ------------[ cut here ]------------
 WARNING: CPU: 1 PID: 15884 at fs/btrfs/disk-io.c:539 btree_csum_one_bio+0x268/0x2d0 [btrfs]
 Modules linked in: btrfs dm_thin_pool ...
 CPU: 1 PID: 15884 Comm: fsx Tainted: G        W         5.6.0-rc7-btrfs-next-58 #1
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
 RIP: 0010:btree_csum_one_bio+0x268/0x2d0 [btrfs]
 Code: c7 c7 ...
 RSP: 0018:ffffbb0109e6f8e0 EFLAGS: 00010296
 RAX: 0000000000000000 RBX: ffffe1c0847b6080 RCX: 0000000000000000
 RDX: 0000000000000000 RSI: ffffffffaa963988 RDI: 0000000000000001
 RBP: ffff956a4f4d2000 R08: 0000000000000000 R09: 0000000000000001
 R10: 0000000000000526 R11: 0000000000000000 R12: ffff956a5cd28bb0
 R13: 0000000000000000 R14: ffff956a649c9388 R15: 000000011ed82000
 FS:  00007fb419959e80(0000) GS:ffff956a7aa00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000fe6d54 CR3: 0000000138696005 CR4: 00000000003606e0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 Call Trace:
  btree_submit_bio_hook+0x67/0xc0 [btrfs]
  submit_one_bio+0x31/0x50 [btrfs]
  btree_write_cache_pages+0x2db/0x4b0 [btrfs]
  ? __filemap_fdatawrite_range+0xb1/0x110
  do_writepages+0x23/0x80
  __filemap_fdatawrite_range+0xd2/0x110
  btrfs_write_marked_extents+0x15e/0x180 [btrfs]
  btrfs_sync_log+0x206/0x10a0 [btrfs]
  ? kmem_cache_free+0x315/0x3b0
  ? btrfs_log_inode+0x1e8/0xf90 [btrfs]
  ? __mutex_unlock_slowpath+0x45/0x2a0
  ? lockref_put_or_lock+0x9/0x30
  ? dput+0x2d/0x580
  ? dput+0xb5/0x580
  ? btrfs_sync_file+0x464/0x4d0 [btrfs]
  btrfs_sync_file+0x464/0x4d0 [btrfs]
  do_fsync+0x38/0x60
  __x64_sys_fsync+0x10/0x20
  do_syscall_64+0x5c/0x280
  entry_SYSCALL_64_after_hwframe+0x49/0xbe
 RIP: 0033:0x7fb41953a6d0
 Code: 48 3d ...
 RSP: 002b:00007ffcc86bd218 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
 RAX: ffffffffffffffda RBX: 000000000000000d RCX: 00007fb41953a6d0
 RDX: 0000000000000009 RSI: 0000000000040000 RDI: 0000000000000003
 RBP: 0000000000040000 R08: 0000000000000001 R09: 0000000000000009
 R10: 0000000000000064 R11: 0000000000000246 R12: 0000556cf4b2c060
 R13: 0000000000000100 R14: 0000000000000000 R15: 0000556cf322b420
 irq event stamp: 0
 hardirqs last  enabled at (0): [<0000000000000000>] 0x0
 hardirqs last disabled at (0): [<ffffffffa96bdedf>] copy_process+0x74f/0x2020
 softirqs last  enabled at (0): [<ffffffffa96bdedf>] copy_process+0x74f/0x2020
 softirqs last disabled at (0): [<0000000000000000>] 0x0
 ---[ end trace d543fc76f5ad7fd8 ]---

In that trace the tree checker detected the overlapping checksum items at
the time when we triggered writeback for the log tree when syncing the
log.

Another trace that can happen is due to BUG_ON() when deleting checksum
items while logging an inode:

 BTRFS critical (device dm-0): slot 81 key (18446744073709551606 128 13635584) new key (18446744073709551606 128 13635584)
 BTRFS info (device dm-0): leaf 30949376 gen 7 total ptrs 98 free space 8527 owner 18446744073709551610
 BTRFS info (device dm-0): refs 4 lock (w:1 r:0 bw:0 br:0 sw:1 sr:0) lock_owner 13473 current 13473
  item 0 key (257 1 0) itemoff 16123 itemsize 160
          inode generation 7 size 262144 mode 100600
  item 1 key (257 12 256) itemoff 16103 itemsize 20
  item 2 key (257 108 0) itemoff 16050 itemsize 53
          extent data disk bytenr 13631488 nr 4096
          extent data offset 0 nr 131072 ram 131072
 (...)
 ------------[ cut here ]------------
 kernel BUG at fs/btrfs/ctree.c:3153!
 invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
 CPU: 1 PID: 13473 Comm: fsx Not tainted 5.6.0-rc7-btrfs-next-58 #1
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
 RIP: 0010:btrfs_set_item_key_safe+0x1ea/0x270 [btrfs]
 Code: 0f b6 ...
 RSP: 0018:ffff95e3889179d0 EFLAGS: 00010282
 RAX: 0000000000000000 RBX: 0000000000000051 RCX: 0000000000000000
 RDX: 0000000000000000 RSI: ffffffffb7763988 RDI: 0000000000000001
 RBP: fffffffffffffff6 R08: 0000000000000000 R09: 0000000000000001
 R10: 00000000000009ef R11: 0000000000000000 R12: ffff8912a8ba5a08
 R13: ffff95e388917a06 R14: ffff89138dcf68c8 R15: ffff95e388917ace
 FS:  00007fe587084e80(0000) GS:ffff8913baa00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 00007fe587091000 CR3: 0000000126dac005 CR4: 00000000003606e0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 Call Trace:
  btrfs_del_csums+0x2f4/0x540 [btrfs]
  copy_items+0x4b5/0x560 [btrfs]
  btrfs_log_inode+0x910/0xf90 [btrfs]
  btrfs_log_inode_parent+0x2a0/0xe40 [btrfs]
  ? dget_parent+0x5/0x370
  btrfs_log_dentry_safe+0x4a/0x70 [btrfs]
  btrfs_sync_file+0x42b/0x4d0 [btrfs]
  __x64_sys_msync+0x199/0x200
  do_syscall_64+0x5c/0x280
  entry_SYSCALL_64_after_hwframe+0x49/0xbe
 RIP: 0033:0x7fe586c65760
 Code: 00 f7 ...
 RSP: 002b:00007ffe250f98b8 EFLAGS: 00000246 ORIG_RAX: 000000000000001a
 RAX: ffffffffffffffda RBX: 00000000000040e1 RCX: 00007fe586c65760
 RDX: 0000000000000004 RSI: 0000000000006b51 RDI: 00007fe58708b000
 RBP: 0000000000006a70 R08: 0000000000000003 R09: 00007fe58700cb61
 R10: 0000000000000100 R11: 0000000000000246 R12: 00000000000000e1
 R13: 00007fe58708b000 R14: 0000000000006b51 R15: 0000558de021a420
 Modules linked in: dm_log_writes ...
 ---[ end trace c92a7f447a8515f5 ]---

CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:37 +02:00
David Sterba
56e9357a1e btrfs: simplify root lookup by id
The main function to lookup a root by its id btrfs_get_fs_root takes the
whole key, while only using the objectid. The value of offset is preset
to (u64)-1 but not actually used until btrfs_find_root that does the
actual search.

Switch btrfs_get_fs_root to use only objectid and remove all local
variables that existed just for the lookup. The actual key for search is
set up in btrfs_get_fs_root, reusing another key variable.

Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:36 +02:00
Qu Wenruo
1dae7e0e58 btrfs: reloc: clear DEAD_RELOC_TREE bit for orphan roots to prevent runaway balance
[BUG]
There are several reported runaway balance, that balance is flooding the
log with "found X extents" where the X never changes.

[CAUSE]
Commit d2311e6985 ("btrfs: relocation: Delay reloc tree deletion after
merge_reloc_roots") introduced BTRFS_ROOT_DEAD_RELOC_TREE bit to
indicate that one subvolume has finished its tree blocks swap with its
reloc tree.

However if balance is canceled or hits ENOSPC halfway, we didn't clear
the BTRFS_ROOT_DEAD_RELOC_TREE bit, leaving that bit hanging forever
until unmount.

Any subvolume root with that bit, would cause backref cache to skip this
tree block, as it has finished its tree block swap.  This would cause
all tree blocks of that root be ignored by balance, leading to runaway
balance.

[FIX]
Fix the problem by also clearing the BTRFS_ROOT_DEAD_RELOC_TREE bit for
the original subvolume of orphan reloc root.

Add an umount check for the stale bit still set.

Fixes: d2311e6985 ("btrfs: relocation: Delay reloc tree deletion after merge_reloc_roots")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:36 +02:00
Qu Wenruo
aeb935a455 btrfs: don't set SHAREABLE flag for data reloc tree
SHAREABLE flag is set for subvolumes because users can create snapshot
for subvolumes, thus sharing tree blocks of them.

But data reloc tree is not exposed to user space, as it's only an
internal tree for data relocation, thus it doesn't need the full path
replacement handling at all.

This patch will make data reloc tree a non-shareable tree, and add
btrfs_fs_info::data_reloc_root for data reloc tree, so relocation code
can grab it from fs_info directly.

This would slightly improve tree relocation, as now data reloc tree
can go through regular COW routine to get relocated, without bothering
the complex tree reloc tree routine.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:35 +02:00
Qu Wenruo
92a7cc4252 btrfs: rename BTRFS_ROOT_REF_COWS to BTRFS_ROOT_SHAREABLE
The name BTRFS_ROOT_REF_COWS is not very clear about the meaning.

In fact, that bit can only be set to those trees:

- Subvolume roots
- Data reloc root
- Reloc roots for above roots

All other trees won't get this bit set.  So just by the result, it is
obvious that, roots with this bit set can have tree blocks shared with
other trees.  Either shared by snapshots, or by reloc roots (an special
snapshot created by relocation).

This patch will rename BTRFS_ROOT_REF_COWS to BTRFS_ROOT_SHAREABLE to
make it easier to understand, and update all comment mentioning
"reference counted" to follow the rename.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:35 +02:00
Eric Biggers
fd08001f17 btrfs: use crypto_shash_digest() instead of open coding
Use crypto_shash_digest() instead of crypto_shash_init() +
crypto_shash_update() + crypto_shash_final().  This is more efficient.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:27 +02:00
Omar Sandoval
5c047a699a btrfs: get rid of endio_repair_workers
This was originally added in commit 8b110e393c ("Btrfs: implement
repair function when direct read fails") to avoid a deadlock. In that
commit, the direct I/O read endio executes on the endio_workers
workqueue, submits a repair bio, and waits for it to complete. The
repair bio endio must execute on a different workqueue, otherwise it
could block on the endio_workers workqueue becoming available, which
won't happen because the original endio is blocked on the repair bio.

As of the previous commit, the original endio doesn't wait for the
repair bio, so this separate workqueue is unnecessary.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:27 +02:00
Josef Bacik
ef67963dac btrfs: drop logs when we've aborted a transaction
Dave reported a problem where we were panicing with generic/475 with
misc-5.7.  This is because we were doing IO after we had stopped all of
the worker threads, because we do the log tree cleanup on roots at drop
time.  Cleaning up the log tree will always need to do reads if we
happened to have evicted the blocks from memory.

Because of this simply add a helper to btrfs_cleanup_transaction() that
will go through and drop all of the log roots.  This gets run before we
do the close_ctree() work, and thus we are allowed to do any reads that
we would need.  I ran this through many iterations of generic/475 with
constrained memory and I did not see the issue.

  general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b6b6b: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
  CPU: 2 PID: 12359 Comm: umount Tainted: G        W 5.6.0-rc7-btrfs-next-58 #1
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
  RIP: 0010:btrfs_queue_work+0x33/0x1c0 [btrfs]
  RSP: 0018:ffff9cfb015937d8 EFLAGS: 00010246
  RAX: 0000000000000000 RBX: ffff8eb5e339ed80 RCX: 0000000000000000
  RDX: 0000000000000001 RSI: ffff8eb5eb33b770 RDI: ffff8eb5e37a0460
  RBP: ffff8eb5eb33b770 R08: 000000000000020c R09: ffffffff9fc09ac0
  R10: 0000000000000007 R11: 0000000000000000 R12: 6b6b6b6b6b6b6b6b
  R13: ffff9cfb00229040 R14: 0000000000000008 R15: ffff8eb5d3868000
  FS:  00007f167ea022c0(0000) GS:ffff8eb5fae00000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00007f167e5e0cb1 CR3: 0000000138c18004 CR4: 00000000003606e0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
   btrfs_end_bio+0x81/0x130 [btrfs]
   __split_and_process_bio+0xaf/0x4e0 [dm_mod]
   ? percpu_counter_add_batch+0xa3/0x120
   dm_process_bio+0x98/0x290 [dm_mod]
   ? generic_make_request+0xfb/0x410
   dm_make_request+0x4d/0x120 [dm_mod]
   ? generic_make_request+0xfb/0x410
   generic_make_request+0x12a/0x410
   ? submit_bio+0x38/0x160
   submit_bio+0x38/0x160
   ? percpu_counter_add_batch+0xa3/0x120
   btrfs_map_bio+0x289/0x570 [btrfs]
   ? kmem_cache_alloc+0x24d/0x300
   btree_submit_bio_hook+0x79/0xc0 [btrfs]
   submit_one_bio+0x31/0x50 [btrfs]
   read_extent_buffer_pages+0x2fe/0x450 [btrfs]
   btree_read_extent_buffer_pages+0x7e/0x170 [btrfs]
   walk_down_log_tree+0x343/0x690 [btrfs]
   ? walk_log_tree+0x3d/0x380 [btrfs]
   walk_log_tree+0xf7/0x380 [btrfs]
   ? plist_requeue+0xf0/0xf0
   ? delete_node+0x4b/0x230
   free_log_tree+0x4c/0x130 [btrfs]
   ? wait_log_commit+0x140/0x140 [btrfs]
   btrfs_free_log+0x17/0x30 [btrfs]
   btrfs_drop_and_free_fs_root+0xb0/0xd0 [btrfs]
   btrfs_free_fs_roots+0x10c/0x190 [btrfs]
   ? do_raw_spin_unlock+0x49/0xc0
   ? _raw_spin_unlock+0x29/0x40
   ? release_extent_buffer+0x121/0x170 [btrfs]
   close_ctree+0x289/0x2e6 [btrfs]
   generic_shutdown_super+0x6c/0x110
   kill_anon_super+0xe/0x30
   btrfs_kill_super+0x12/0x20 [btrfs]
   deactivate_locked_super+0x3a/0x70

Reported-by: David Sterba <dsterba@suse.com>
Fixes: 8c38938c7b ("btrfs: move the root freeing stuff into btrfs_put_root")
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-04-23 17:23:11 +02:00
Josef Bacik
c75e839414 btrfs: kill the subvol_srcu
Now that we have proper root ref counting everywhere we can kill the
subvol_srcu.

* removal of fs_info::subvol_srcu reduces size of fs_info by 1176 bytes

* the refcount_t used for the references checks for accidental 0->1
  in cases where the root lifetime would not be properly protected

* there's a leak detector for roots to catch unfreed roots at umount
  time

* SRCU served us well over the years but is was not a proper
  synchronization mechanism for some cases

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:02:00 +01:00
Josef Bacik
efc3453494 btrfs: make btrfs_cleanup_fs_roots use the radix tree lock
The radix root is primarily protected by the fs_roots_radix_lock, so use
that to lookup and get a ref on all of our fs roots in
btrfs_cleanup_fs_roots. The tree reference is taken in the protected
section as before.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:59 +01:00
Josef Bacik
4785e24fa5 btrfs: don't take an extra root ref at allocation time
Now that all the users of roots take references for them we can drop the
extra root ref we've been taking.  Before we had roots at 2 refs for the
life of the file system, one for the radix tree, and one simply for
existing.  Now that we have proper ref accounting in all places that use
roots we can drop this extra ref simply for existing as we no longer
need it.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:59 +01:00
Josef Bacik
dc9492c14c btrfs: hold a ref on the root on the dead roots list
At the point we add a root to the dead roots list we have no open inodes
for that root, so we need to hold a ref on that root to keep it from
disappearing.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:59 +01:00
Josef Bacik
5c8fd99fec btrfs: make inodes hold a ref on their roots
If we make sure all the inodes have refs on their root we don't have to
worry about the root disappearing while we have open inodes.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:59 +01:00
Josef Bacik
8c38938c7b btrfs: move the root freeing stuff into btrfs_put_root
There are a few different ways to free roots, either you allocated them
yourself and you just do

free_extent_buffer(root->node);
free_extent_buffer(root->commit_node);
btrfs_put_root(root);

Which is the pattern for log roots.  Or for snapshots/subvolumes that
are being dropped you simply call btrfs_free_fs_root() which does all
the cleanup for you.

Unify this all into btrfs_put_root(), so that we don't free up things
associated with the root until the last reference is dropped.  This
makes the root freeing code much more significant.

The only caveat is at close_ctree() time we have to free the extent
buffers for all of our main roots (extent_root, chunk_root, etc) because
we have to drop the btree_inode and we'll run into issues if we hold
onto those nodes until ->kill_sb() time.  This will be addressed in the
future when we kill the btree_inode.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:59 +01:00
Josef Bacik
0e996e7fcf btrfs: move ino_cache_inode dropping out of btrfs_free_fs_root
We are going to make root life be controlled soley by refcounting, and
inodes will be one of the things that hold a ref on the root.  This
means we need to handle dropping the ino_cache_inode outside of the root
freeing logic, so move it into btrfs_drop_and_free_fs_root() so it is
cleaned up properly on unmount.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:58 +01:00
Josef Bacik
3fd6372758 btrfs: make the extent buffer leak check per fs info
I'm going to make the entire destruction of btrfs_root's controlled by
their refcount, so it will be helpful to notice if we're leaking their
eb's on umount.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:58 +01:00
David Sterba
c67b38925b btrfs: return void from csum_tree_block
Now that csum_tree_block is not returning any errors, we can make
csum_tree_block return void and simplify callers.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:52 +01:00
David Sterba
e9be5a303d btrfs: simplify tree block checksumming loop
Thw whole point of csum_tree_block is to iterate over all extent buffer
pages and pass it to checksumming functions. The bytes where checksum is
stored must be skipped, thus map_private_extent_buffer. This complicates
further offset calculations.

As the first page will be always present, checksum the relevant bytes
unconditionally and then do a simple iteration over the remaining pages.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:52 +01:00
Filipe Manana
f0cc2cd701 Btrfs: fix crash during unmount due to race with delayed inode workers
During unmount we can have a job from the delayed inode items work queue
still running, that can lead to at least two bad things:

1) A crash, because the worker can try to create a transaction just
   after the fs roots were freed;

2) A transaction leak, because the worker can create a transaction
   before the fs roots are freed and just after we committed the last
   transaction and after we stopped the transaction kthread.

A stack trace example of the crash:

 [79011.691214] kernel BUG at lib/radix-tree.c:982!
 [79011.692056] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
 [79011.693180] CPU: 3 PID: 1394 Comm: kworker/u8:2 Tainted: G        W         5.6.0-rc2-btrfs-next-54 #2
 (...)
 [79011.696789] Workqueue: btrfs-delayed-meta btrfs_work_helper [btrfs]
 [79011.697904] RIP: 0010:radix_tree_tag_set+0xe7/0x170
 (...)
 [79011.702014] RSP: 0018:ffffb3c84a317ca0 EFLAGS: 00010293
 [79011.702949] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
 [79011.704202] RDX: ffffb3c84a317cb0 RSI: ffffb3c84a317ca8 RDI: ffff8db3931340a0
 [79011.705463] RBP: 0000000000000005 R08: 0000000000000005 R09: ffffffff974629d0
 [79011.706756] R10: ffffb3c84a317bc0 R11: 0000000000000001 R12: ffff8db393134000
 [79011.708010] R13: ffff8db3931340a0 R14: ffff8db393134068 R15: 0000000000000001
 [79011.709270] FS:  0000000000000000(0000) GS:ffff8db3b6a00000(0000) knlGS:0000000000000000
 [79011.710699] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 [79011.711710] CR2: 00007f22c2a0a000 CR3: 0000000232ad4005 CR4: 00000000003606e0
 [79011.712958] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 [79011.714205] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 [79011.715448] Call Trace:
 [79011.715925]  record_root_in_trans+0x72/0xf0 [btrfs]
 [79011.716819]  btrfs_record_root_in_trans+0x4b/0x70 [btrfs]
 [79011.717925]  start_transaction+0xdd/0x5c0 [btrfs]
 [79011.718829]  btrfs_async_run_delayed_root+0x17e/0x2b0 [btrfs]
 [79011.719915]  btrfs_work_helper+0xaa/0x720 [btrfs]
 [79011.720773]  process_one_work+0x26d/0x6a0
 [79011.721497]  worker_thread+0x4f/0x3e0
 [79011.722153]  ? process_one_work+0x6a0/0x6a0
 [79011.722901]  kthread+0x103/0x140
 [79011.723481]  ? kthread_create_worker_on_cpu+0x70/0x70
 [79011.724379]  ret_from_fork+0x3a/0x50
 (...)

The following diagram shows a sequence of steps that lead to the crash
during ummount of the filesystem:

        CPU 1                                             CPU 2                                CPU 3

 btrfs_punch_hole()
   btrfs_btree_balance_dirty()
     btrfs_balance_delayed_items()
       --> sees
           fs_info->delayed_root->items
           with value 200, which is greater
           than
           BTRFS_DELAYED_BACKGROUND (128)
           and smaller than
           BTRFS_DELAYED_WRITEBACK (512)
       btrfs_wq_run_delayed_node()
         --> queues a job for
             fs_info->delayed_workers to run
             btrfs_async_run_delayed_root()

                                                                                            btrfs_async_run_delayed_root()
                                                                                              --> job queued by CPU 1

                                                                                              --> starts picking and running
                                                                                                  delayed nodes from the
                                                                                                  prepare_list list

                                                 close_ctree()

                                                   btrfs_delete_unused_bgs()

                                                   btrfs_commit_super()

                                                     btrfs_join_transaction()
                                                       --> gets transaction N

                                                     btrfs_commit_transaction(N)
                                                       --> set transaction state
                                                        to TRANTS_STATE_COMMIT_START

                                                                                             btrfs_first_prepared_delayed_node()
                                                                                               --> picks delayed node X through
                                                                                                   the prepared_list list

                                                       btrfs_run_delayed_items()

                                                         btrfs_first_delayed_node()
                                                           --> also picks delayed node X
                                                               but through the node_list
                                                               list

                                                         __btrfs_commit_inode_delayed_items()
                                                            --> runs all delayed items from
                                                                this node and drops the
                                                                node's item count to 0
                                                                through call to
                                                                btrfs_release_delayed_inode()

                                                         --> finishes running any remaining
                                                             delayed nodes

                                                       --> finishes transaction commit

                                                   --> stops cleaner and transaction threads

                                                   btrfs_free_fs_roots()
                                                     --> frees all roots and removes them
                                                         from the radix tree
                                                         fs_info->fs_roots_radix

                                                                                             btrfs_join_transaction()
                                                                                               start_transaction()
                                                                                                 btrfs_record_root_in_trans()
                                                                                                   record_root_in_trans()
                                                                                                     radix_tree_tag_set()
                                                                                                       --> crashes because
                                                                                                           the root is not in
                                                                                                           the radix tree
                                                                                                           anymore

If the worker is able to call btrfs_join_transaction() before the unmount
task frees the fs roots, we end up leaking a transaction and all its
resources, since after the call to btrfs_commit_super() and stopping the
transaction kthread, we don't expect to have any transaction open anymore.

When this situation happens the worker has a delayed node that has no
more items to run, since the task calling btrfs_run_delayed_items(),
which is doing a transaction commit, picks the same node and runs all
its items first.

We can not wait for the worker to complete when running delayed items
through btrfs_run_delayed_items(), because we call that function in
several phases of a transaction commit, and that could cause a deadlock
because the worker calls btrfs_join_transaction() and the task doing the
transaction commit may have already set the transaction state to
TRANS_STATE_COMMIT_DOING.

Also it's not possible to get into a situation where only some of the
items of a delayed node are added to the fs/subvolume tree in the current
transaction and the remaining ones in the next transaction, because when
running the items of a delayed inode we lock its mutex, effectively
waiting for the worker if the worker is running the items of the delayed
node already.

Since this can only cause issues when unmounting a filesystem, fix it in
a simple way by waiting for any jobs on the delayed workers queue before
calling btrfs_commit_supper() at close_ctree(). This works because at this
point no one can call btrfs_btree_balance_dirty() or
btrfs_balance_delayed_items(), and if we end up waiting for any worker to
complete, btrfs_commit_super() will commit the transaction created by the
worker.

CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:51 +01:00
Andy Shevchenko
807fc790aa btrfs: switch to use new generic UUID API
There are new types and helpers that are supposed to be used in new code.

As a preparation to get rid of legacy types and API functions do
the conversion here.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:47 +01:00
David Sterba
15b6e8a83e btrfs: reduce pointer intdirections in btree_readpage_end_io_hook
All we need to read is checksum size from fs_info superblock, and
fs_info is provided by extent buffer so we can get rid of the wild
pointer indirections from page/inode/root.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:45 +01:00
David Sterba
b79ce3dddd btrfs: adjust delayed refs message level
The message seems to be for debugging and has little value for users.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:45 +01:00
David Sterba
c4ac754198 btrfs: open code trivial helper btrfs_header_chunk_tree_uuid
The helper btrfs_header_chunk_tree_uuid follows naming convention of
other struct accessors but does something compeletly different. As the
offsetof calculation is clear in the context of extent buffer operations
we can remove it.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:44 +01:00
David Sterba
9a8658e33d btrfs: open code trivial helper btrfs_header_fsid
The helper btrfs_header_fsid follows naming convention of other struct
accessors but does something compeletly different. As the offsetof
calculation is clear in the context of extent buffer operations we can
remove it.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:44 +01:00
Nikolay Borisov
dcc3eb9638 btrfs: convert snapshot/nocow exlcusion to drew lock
This patch removes all haphazard code implementing nocow writers
exclusion from pending snapshot creation and switches to using the drew
lock to ensure this invariant still holds.

'Readers' are snapshot creators from create_snapshot and 'writers' are
nocow writers from buffered write path or btrfs_setsize. This locking
scheme allows for multiple snapshots to happen while any nocow writers
are blocked, since writes to page cache in the nocow path will make
snapshots inconsistent.

So for performance reasons we'd like to have the ability to run multiple
concurrent snapshots and also favors readers in this case. And in case
there aren't pending snapshots (which will be the majority of the cases)
we rely on the percpu's writers counter to avoid cacheline contention.

The main gain from using the drew lock is it's now a lot easier to
reason about the guarantees of the locking scheme and whether there is
some silent breakage lurking.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:44 +01:00
Josef Bacik
75ec1db871 btrfs: set update the uuid generation as soon as possible
In my EIO stress testing I noticed I was getting forced to rescan the
uuid tree pretty often, which was weird.  This is because my error
injection stuff would sometimes inject an error after log replay but
before we loaded the UUID tree.  If log replay committed the transaction
it wouldn't have updated the uuid tree generation, but the tree was
valid and didn't change, so there's no reason to not update the
generation here.

Fix this by setting the BTRFS_FS_UPDATE_UUID_TREE_GEN bit immediately
after reading all the fs roots if the uuid tree generation matches the
fs generation.  Then any transaction commits that happen during mount
won't screw up our uuid tree state, forcing us to do needless uuid
rescans.

Fixes: 70f8017547 ("Btrfs: check UUID tree during mount if required")
CC: stable@vger.kernel.org # 4.19+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:41 +01:00
Josef Bacik
c94bec2c61 btrfs: bail out of uuid tree scanning if we're closing
In doing my fsstress+EIO stress testing I started running into issues
where umount would get stuck forever because the uuid checker was
chewing through the thousands of subvolumes I had created.

We shouldn't block umount on this, simply bail if we're unmounting the
fs.  We need to make sure we don't mark the UUID tree as ok, so we only
set that bit if we made it through the whole rescan operation, but
otherwise this is completely safe.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:41 +01:00
Nikolay Borisov
97f4dd09da btrfs: make btrfs_check_uuid_tree private to disk-io.c
It's used only during filesystem mount as such it can be made private to
disk-io.c file. Also use the occasion to move btrfs_uuid_rescan_kthread
as btrfs_check_uuid_tree is its sole caller.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:41 +01:00
Johannes Thumshirn
314b6dd0ee btrfs: use bios instead of buffer_heads from super block writeout
Similar to the superblock read path, change the write path to using bios
and pages instead of buffer_heads. This allows us to skip over the
buffer_head code, for writing the superblock to disk.

This is based on a patch originally authored by Nikolay Borisov.

Co-developed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:39 +01:00
Johannes Thumshirn
8f32380d3f btrfs: use the page cache for super block reading
Super-block reading in BTRFS is done using buffer_heads. Buffer_heads
have some drawbacks, like not being able to propagate errors from the
lower layers.

Directly use the page cache for reading the super blocks from disk or
invalidating an on-disk super block. We have to use the page cache so to
avoid races between mkfs and udev. See also 6f60cbd3ae ("btrfs: access
superblock via pagecache in scan_one_device").

This patch unwraps the buffer head API and does not change the way the
super block is actually read.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:39 +01:00
Nikolay Borisov
fe119a6eeb btrfs: switch to per-transaction pinned extents
This commit flips the switch to start tracking/processing pinned extents
on a per-transaction basis. It mostly replaces all references from
btrfs_fs_info::(pinned_extents|freed_extents[]) to
btrfs_transaction::pinned_extents.

Two notable modifications that warrant explicit mention are changing
clean_pinned_extents to get a reference to the previously running
transaction. The other one is removal of call to
btrfs_destroy_pinned_extent since transactions are going to be cleaned
in btrfs_cleanup_one_transaction.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:38 +01:00
Nikolay Borisov
f603bb94ab btrfs: Perform pinned cleanup directly in btrfs_destroy_delayed_refs
Having btrfs_destroy_delayed_refs call btrfs_pin_extent is problematic
for making pinned extents tracking per-transaction since
btrfs_trans_handle cannot be passed to btrfs_pin_extent in this context.
Additionally delayed refs heads pinned in btrfs_destroy_delayed_refs
are going to be handled very closely, in btrfs_destroy_pinned_extent.

To enable btrfs_pin_extent to take btrfs_trans_handle simply open code
it in btrfs_destroy_delayed_refs and call btrfs_error_unpin_extent_range
on the range. This enables us to do less work in
btrfs_destroy_pinned_extent and leaves btrfs_pin_extent being called in
contexts which have a valid btrfs_trans_handle.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:36 +01:00
David Sterba
71ad38b44e btrfs: sink argument tree to extent_read_full_page
The tree pointer can be safely read from the page's inode, use it and
drop the redundant argument.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:35 +01:00
Josef Bacik
0024652895 btrfs: rename btrfs_put_fs_root and btrfs_grab_fs_root
We are now using these for all roots, rename them to btrfs_put_root()
and btrfs_grab_root();

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:33 +01:00
Josef Bacik
bd647ce385 btrfs: add a leak check for roots
Now that we're going to start relying on getting ref counting right for
roots, add a list to track allocated roots and print out any roots that
aren't freed up at free_fs_info time.

Hide this behind CONFIG_BTRFS_DEBUG because this will just be used for
developers to verify they aren't breaking things.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:33 +01:00
Josef Bacik
8260edba67 btrfs: make the init of static elements in fs_info separate
In adding things like eb leak checking and root leak checking there were
a lot of weird corner cases that come from the fact that

  1) We do not init the fs_info until we get to open_ctree time in the
     normal case and

  2) The test infrastructure half-init's the fs_info for things that it
     needs.

This makes it really annoying to make changes because you have to add
init in two different places, have special cases for testing fs_info's
that may not have certain things initialized, and cases for fs_info's
that didn't make it to open_ctree and thus are not fully set up.

Fix this by extracting out the non-allocating init of the fs info into
it's own public function and use that to make sure we're all getting
consistent views of an allocated fs_info.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:33 +01:00
Josef Bacik
ae18c37ad5 btrfs: move fs_info init work into it's own helper function
open_ctree mixes initialization of fs stuff and fs_info stuff, which
makes it confusing when doing things like adding the root leak
detection.  Make a separate function that inits all the static
structures inside of the fs_info needed for the fs to operate, and then
call that before we start setting up the fs_info to be mounted.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:33 +01:00
Josef Bacik
141386e1a5 btrfs: free more things in btrfs_free_fs_info
Things like the percpu_counters, the mapping_tree, and the csum hash can
all be freed at btrfs_free_fs_info time, since the helpers all check if
the structure has been initialized already.  This significantly cleans
up the error cases in open_ctree.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:32 +01:00
Josef Bacik
bc44d7c4b2 btrfs: push btrfs_grab_fs_root into btrfs_get_fs_root
Now that all callers of btrfs_get_fs_root are subsequently calling
btrfs_grab_fs_root and handling dropping the ref when they are done
appropriately, go ahead and push btrfs_grab_fs_root up into
btrfs_get_fs_root.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:32 +01:00
Josef Bacik
81f096edf0 btrfs: use btrfs_put_fs_root to free roots always
If we are going to track leaked roots we need to free them all the same
way, so don't kfree() roots directly, use btrfs_put_fs_root.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:32 +01:00
Josef Bacik
4c78e9f596 btrfs: hold a ref on the root in open_ctree
We lookup the fs_root and put it in our fs_info directly, we should hold
a ref on this root for the lifetime of the fs_info.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:32 +01:00
Josef Bacik
0d4b046301 btrfs: export and rename free_fs_info
We're going to start freeing roots and doing other complicated things in
free_fs_info, so we need to move it to disk-io.c and export it in order
to use things lik btrfs_put_fs_root().

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:32 +01:00
Josef Bacik
af01d2e53f btrfs: hold a ref on fs roots while they're in the radix tree
If the root is sitting in the radix tree, we should probably have a ref
for the radix tree.  Grab a ref on the root when we insert it, and drop
it when it gets deleted.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:27 +01:00
Josef Bacik
a98db0f304 btrfs: make the fs root init functions static
Now that the orphan cleanup stuff doesn't use this directly we can just
make them static.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:26 +01:00
Josef Bacik
3619c94f07 btrfs: open code btrfs_read_fs_root_no_name
All this does is call btrfs_get_fs_root() with check_ref == true.  Just
use btrfs_get_fs_root() so we don't have a bunch of different helpers
that do the same thing.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:26 +01:00