Commit graph

1113 commits

Author SHA1 Message Date
Omar Sandoval
1ec2bf44c3 Btrfs: fix missing delayed iputs on unmount
[ Upstream commit d6fd0ae25c ]

There's a race between close_ctree() and cleaner_kthread().
close_ctree() sets btrfs_fs_closing(), and the cleaner stops when it
sees it set, but this is racy; the cleaner might have already checked
the bit and could be cleaning stuff. In particular, if it deletes unused
block groups, it will create delayed iputs for the free space cache
inodes. As of "btrfs: don't run delayed_iputs in commit", we're no
longer running delayed iputs after a commit. Therefore, if the cleaner
creates more delayed iputs after delayed iputs are run in
btrfs_commit_super(), we will leak inodes on unmount and get a busy
inode crash from the VFS.

Fix it by parking the cleaner before we actually close anything. Then,
any remaining delayed iputs will always be handled in
btrfs_commit_super(). This also ensures that the commit in close_ctree()
is really the last commit, so we can get rid of the commit in
cleaner_kthread().

The fstest/generic/475 followed by 476 can trigger a crash that
manifests as a slab corruption caused by accessing the freed kthread
structure by a wake up function. Sample trace:

[ 5657.077612] BUG: unable to handle kernel NULL pointer dereference at 00000000000000cc
[ 5657.079432] PGD 1c57a067 P4D 1c57a067 PUD da10067 PMD 0
[ 5657.080661] Oops: 0000 [#1] PREEMPT SMP
[ 5657.081592] CPU: 1 PID: 5157 Comm: fsstress Tainted: G        W         4.19.0-rc8-default+ #323
[ 5657.083703] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626cc-prebuilt.qemu-project.org 04/01/2014
[ 5657.086577] RIP: 0010:shrink_page_list+0x2f9/0xe90
[ 5657.091937] RSP: 0018:ffffb5c745c8f728 EFLAGS: 00010287
[ 5657.092953] RAX: 0000000000000074 RBX: ffffb5c745c8f830 RCX: 0000000000000000
[ 5657.094590] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff9a8747fdf3d0
[ 5657.095987] RBP: ffffb5c745c8f9e0 R08: 0000000000000000 R09: 0000000000000000
[ 5657.097159] R10: ffff9a8747fdf5e8 R11: 0000000000000000 R12: ffffb5c745c8f788
[ 5657.098513] R13: ffff9a877f6ff2c0 R14: ffff9a877f6ff2c8 R15: dead000000000200
[ 5657.099689] FS:  00007f948d853b80(0000) GS:ffff9a877d600000(0000) knlGS:0000000000000000
[ 5657.101032] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5657.101953] CR2: 00000000000000cc CR3: 00000000684bd000 CR4: 00000000000006e0
[ 5657.103159] Call Trace:
[ 5657.103776]  shrink_inactive_list+0x194/0x410
[ 5657.104671]  shrink_node_memcg.constprop.84+0x39a/0x6a0
[ 5657.105750]  shrink_node+0x62/0x1c0
[ 5657.106529]  try_to_free_pages+0x1a4/0x500
[ 5657.107408]  __alloc_pages_slowpath+0x2c9/0xb20
[ 5657.108418]  __alloc_pages_nodemask+0x268/0x2b0
[ 5657.109348]  kmalloc_large_node+0x37/0x90
[ 5657.110205]  __kmalloc_node+0x236/0x310
[ 5657.111014]  kvmalloc_node+0x3e/0x70

Fixes: 30928e9baa ("btrfs: don't run delayed_iputs in commit")
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add trace ]
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
2019-05-16 19:42:28 +02:00
Josef Bacik
f97fd2926e btrfs: wait on ordered extents on abort cleanup
commit 74d5d229b1 upstream.

If we flip read-only before we initiate writeback on all dirty pages for
ordered extents we've created then we'll have ordered extents left over
on umount, which results in all sorts of bad things happening.  Fix this
by making sure we wait on ordered extents if we have to do the aborted
transaction cleanup stuff.

generic/475 can produce this warning:

 [ 8531.177332] WARNING: CPU: 2 PID: 11997 at fs/btrfs/disk-io.c:3856 btrfs_free_fs_root+0x95/0xa0 [btrfs]
 [ 8531.183282] CPU: 2 PID: 11997 Comm: umount Tainted: G        W 5.0.0-rc1-default+ #394
 [ 8531.185164] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),BIOS rel-1.11.2-0-gf9626cc-prebuilt.qemu-project.org 04/01/2014
 [ 8531.187851] RIP: 0010:btrfs_free_fs_root+0x95/0xa0 [btrfs]
 [ 8531.193082] RSP: 0018:ffffb1ab86163d98 EFLAGS: 00010286
 [ 8531.194198] RAX: ffff9f3449494d18 RBX: ffff9f34a2695000 RCX:0000000000000000
 [ 8531.195629] RDX: 0000000000000002 RSI: 0000000000000001 RDI:0000000000000000
 [ 8531.197315] RBP: ffff9f344e930000 R08: 0000000000000001 R09:0000000000000000
 [ 8531.199095] R10: 0000000000000000 R11: ffff9f34494d4ff8 R12:ffffb1ab86163dc0
 [ 8531.200870] R13: ffff9f344e9300b0 R14: ffffb1ab86163db8 R15:0000000000000000
 [ 8531.202707] FS:  00007fc68e949fc0(0000) GS:ffff9f34bd800000(0000)knlGS:0000000000000000
 [ 8531.204851] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 [ 8531.205942] CR2: 00007ffde8114dd8 CR3: 000000002dfbd000 CR4:00000000000006e0
 [ 8531.207516] Call Trace:
 [ 8531.208175]  btrfs_free_fs_roots+0xdb/0x170 [btrfs]
 [ 8531.210209]  ? wait_for_completion+0x5b/0x190
 [ 8531.211303]  close_ctree+0x157/0x350 [btrfs]
 [ 8531.212412]  generic_shutdown_super+0x64/0x100
 [ 8531.213485]  kill_anon_super+0x14/0x30
 [ 8531.214430]  btrfs_kill_super+0x12/0xa0 [btrfs]
 [ 8531.215539]  deactivate_locked_super+0x29/0x60
 [ 8531.216633]  cleanup_mnt+0x3b/0x70
 [ 8531.217497]  task_work_run+0x98/0xc0
 [ 8531.218397]  exit_to_usermode_loop+0x83/0x90
 [ 8531.219324]  do_syscall_64+0x15b/0x180
 [ 8531.220192]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
 [ 8531.221286] RIP: 0033:0x7fc68e5e4d07
 [ 8531.225621] RSP: 002b:00007ffde8116608 EFLAGS: 00000246 ORIG_RAX:00000000000000a6
 [ 8531.227512] RAX: 0000000000000000 RBX: 00005580c2175970 RCX:00007fc68e5e4d07
 [ 8531.229098] RDX: 0000000000000001 RSI: 0000000000000000 RDI:00005580c2175b80
 [ 8531.230730] RBP: 0000000000000000 R08: 00005580c2175ba0 R09:00007ffde8114e80
 [ 8531.232269] R10: 0000000000000000 R11: 0000000000000246 R12:00005580c2175b80
 [ 8531.233839] R13: 00007fc68eac61c4 R14: 00005580c2175a68 R15:0000000000000000

Leaving a tree in the rb-tree:

3853 void btrfs_free_fs_root(struct btrfs_root *root)
3854 {
3855         iput(root->ino_cache_inode);
3856         WARN_ON(!RB_EMPTY_ROOT(&root->inode_tree));

CC: stable@vger.kernel.org
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
[ add stacktrace ]
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-23 08:09:48 +01:00
Nikolay Borisov
335cbe4df8 btrfs: Always try all copies when reading extent buffers
commit f8397d69da upstream.

When a metadata read is served the endio routine btree_readpage_end_io_hook
is called which eventually runs the tree-checker. If tree-checker fails
to validate the read eb then it sets EXTENT_BUFFER_CORRUPT flag. This
leads to btree_read_extent_buffer_pages wrongly assuming that all
available copies of this extent buffer are wrong and failing prematurely.
Fix this modify btree_read_extent_buffer_pages to read all copies of
the data.

This failure was exhibitted in xfstests btrfs/124 which would
spuriously fail its balance operations. The reason was that when balance
was run following re-introduction of the missing raid1 disk
__btrfs_map_block would map the read request to stripe 0, which
corresponded to devid 2 (the disk which is being removed in the test):

    item 2 key (FIRST_CHUNK_TREE CHUNK_ITEM 3553624064) itemoff 15975 itemsize 112
	length 1073741824 owner 2 stripe_len 65536 type DATA|RAID1
	io_align 65536 io_width 65536 sector_size 4096
	num_stripes 2 sub_stripes 1
		stripe 0 devid 2 offset 2156920832
		dev_uuid 8466c350-ed0c-4c3b-b17d-6379b445d5c8
		stripe 1 devid 1 offset 3553624064
		dev_uuid 1265d8db-5596-477e-af03-df08eb38d2ca

This caused read requests for a checksum item that to be routed to the
stale disk which triggered the aforementioned logic involving
EXTENT_BUFFER_CORRUPT flag. This then triggered cascading failures of
the balance operation.

Fixes: a826d6dcb3 ("Btrfs: check items for correctness as we search")
CC: stable@vger.kernel.org # 4.4+
Suggested-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-12-08 13:03:39 +01:00
Qu Wenruo
b6a07f9035 btrfs: tree-checker: Fix false panic for sanity test
commit 69fc6cbbac upstream.

[BUG]
If we run btrfs with CONFIG_BTRFS_FS_RUN_SANITY_TESTS=y, it will
instantly cause kernel panic like:

------
...
assertion failed: 0, file: fs/btrfs/disk-io.c, line: 3853
...
Call Trace:
 btrfs_mark_buffer_dirty+0x187/0x1f0 [btrfs]
 setup_items_for_insert+0x385/0x650 [btrfs]
 __btrfs_drop_extents+0x129a/0x1870 [btrfs]
...
-----

[Cause]
Btrfs will call btrfs_check_leaf() in btrfs_mark_buffer_dirty() to check
if the leaf is valid with CONFIG_BTRFS_FS_RUN_SANITY_TESTS=y.

However quite some btrfs_mark_buffer_dirty() callers(*) don't really
initialize its item data but only initialize its item pointers, leaving
item data uninitialized.

This makes tree-checker catch uninitialized data as error, causing
such panic.

*: These callers include but not limited to
setup_items_for_insert()
btrfs_split_item()
btrfs_expand_item()

[Fix]
Add a new parameter @check_item_data to btrfs_check_leaf().
With @check_item_data set to false, item data check will be skipped and
fallback to old btrfs_check_leaf() behavior.

So we can still get early warning if we screw up item pointers, and
avoid false panic.

Cc: Filipe Manana <fdmanana@gmail.com>
Reported-by: Lakshmipathi.G <lakshmipathi.g@gmail.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>

Signed-off-by: Sasha Levin <sashal@kernel.org>
2018-12-05 19:41:12 +01:00
Qu Wenruo
eb3493e247 btrfs: Move leaf and node validation checker to tree-checker.c
commit 557ea5dd00 upstream.

It's no doubt the comprehensive tree block checker will become larger,
so moving them into their own files is quite reasonable.

Signed-off-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
[ wording adjustments ]
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2018-12-05 19:41:12 +01:00
Qu Wenruo
64948fd63f btrfs: Add checker for EXTENT_CSUM
commit 4b865cab96 upstream.

EXTENT_CSUM checker is a relatively easy one, only needs to check:

1) Objectid
   Fixed to BTRFS_EXTENT_CSUM_OBJECTID

2) Key offset alignment
   Must be aligned to sectorsize

3) Item size alignedment
   Must be aligned to csum size

Signed-off-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2018-12-05 19:41:11 +01:00
Qu Wenruo
fa5d29e6d7 btrfs: Add sanity check for EXTENT_DATA when reading out leaf
commit 40c3c40947 upstream.

Add extra checks for item with EXTENT_DATA type.  This checks the
following thing:

0) Key offset
   All key offsets must be aligned to sectorsize.
   Inline extent must have 0 for key offset.

1) Item size
   Uncompressed inline file extent size must match item size.
   (Compressed inline file extent has no information about its on-disk size.)
   Regular/preallocated file extent size must be a fixed value.

2) Every member of regular file extent item
   Including alignment for bytenr and offset, possible value for
   compression/encryption/type.

3) Type/compression/encode must be one of the valid values.

This should be the most comprehensive and strict check in the context
of btrfs_item for EXTENT_DATA.

Signed-off-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ switch to BTRFS_FILE_EXTENT_TYPES, similar to what
  BTRFS_COMPRESS_TYPES does ]
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2018-12-05 19:41:11 +01:00
Qu Wenruo
ac6ea50bb6 btrfs: Check if item pointer overlaps with the item itself
commit 7f43d4affb upstream.

Function check_leaf() checks if any item pointer points outside of the
leaf, but it doesn't check if the pointer overlaps with the item itself.

Normally only the last item may be the victim, but adding such check is
never a bad idea anyway.

Signed-off-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2018-12-05 19:41:11 +01:00
Qu Wenruo
a5cc85fe13 btrfs: Refactor check_leaf function for later expansion
commit c3267bbaa9 upstream.

Current check_leaf() function does a good job checking key order and
item offset/size.

However it only checks from slot 0 to the last but one slot, this is
good but makes later expansion hard.

So this refactoring iterates from slot 0 to the last slot.
For key comparison, it uses a key with all 0 as initial key, so all
valid keys should be larger than that.

And for item size/offset checks, it compares current item end with
previous item offset.
For slot 0, use leaf end as a special case.

This makes later item/key offset checks and item size checks easier to
be implemented.

Also, makes check_leaf() to return -EUCLEAN other than -EIO to indicate
error.

Signed-off-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2018-12-05 19:41:11 +01:00
Lu Fengqi
84e7fe7bd0 btrfs: fix pinned underflow after transaction aborted
commit fcd5e74288 upstream.

When running generic/475, we may get the following warning in dmesg:

[ 6902.102154] WARNING: CPU: 3 PID: 18013 at fs/btrfs/extent-tree.c:9776 btrfs_free_block_groups+0x2af/0x3b0 [btrfs]
[ 6902.109160] CPU: 3 PID: 18013 Comm: umount Tainted: G        W  O      4.19.0-rc8+ #8
[ 6902.110971] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[ 6902.112857] RIP: 0010:btrfs_free_block_groups+0x2af/0x3b0 [btrfs]
[ 6902.118921] RSP: 0018:ffffc9000459bdb0 EFLAGS: 00010286
[ 6902.120315] RAX: ffff880175050bb0 RBX: ffff8801124a8000 RCX: 0000000000170007
[ 6902.121969] RDX: 0000000000000002 RSI: 0000000000170007 RDI: ffffffff8125fb74
[ 6902.123716] RBP: ffff880175055d10 R08: 0000000000000000 R09: 0000000000000000
[ 6902.125417] R10: 0000000000000000 R11: 0000000000000000 R12: ffff880175055d88
[ 6902.127129] R13: ffff880175050bb0 R14: 0000000000000000 R15: dead000000000100
[ 6902.129060] FS:  00007f4507223780(0000) GS:ffff88017ba00000(0000) knlGS:0000000000000000
[ 6902.130996] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 6902.132558] CR2: 00005623599cac78 CR3: 000000014b700001 CR4: 00000000003606e0
[ 6902.134270] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 6902.135981] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 6902.137836] Call Trace:
[ 6902.138939]  close_ctree+0x171/0x330 [btrfs]
[ 6902.140181]  ? kthread_stop+0x146/0x1f0
[ 6902.141277]  generic_shutdown_super+0x6c/0x100
[ 6902.142517]  kill_anon_super+0x14/0x30
[ 6902.143554]  btrfs_kill_super+0x13/0x100 [btrfs]
[ 6902.144790]  deactivate_locked_super+0x2f/0x70
[ 6902.146014]  cleanup_mnt+0x3b/0x70
[ 6902.147020]  task_work_run+0x9e/0xd0
[ 6902.148036]  do_syscall_64+0x470/0x600
[ 6902.149142]  ? trace_hardirqs_off_thunk+0x1a/0x1c
[ 6902.150375]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 6902.151640] RIP: 0033:0x7f45077a6a7b
[ 6902.157324] RSP: 002b:00007ffd589f3e68 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[ 6902.159187] RAX: 0000000000000000 RBX: 000055e8eec732b0 RCX: 00007f45077a6a7b
[ 6902.160834] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 000055e8eec73490
[ 6902.162526] RBP: 0000000000000000 R08: 000055e8eec734b0 R09: 00007ffd589f26c0
[ 6902.164141] R10: 0000000000000000 R11: 0000000000000246 R12: 000055e8eec73490
[ 6902.165815] R13: 00007f4507ac61a4 R14: 0000000000000000 R15: 00007ffd589f40d8
[ 6902.167553] irq event stamp: 0
[ 6902.168998] hardirqs last  enabled at (0): [<0000000000000000>]           (null)
[ 6902.170731] hardirqs last disabled at (0): [<ffffffff810cd810>] copy_process.part.55+0x3b0/0x1f00
[ 6902.172773] softirqs last  enabled at (0): [<ffffffff810cd810>] copy_process.part.55+0x3b0/0x1f00
[ 6902.174671] softirqs last disabled at (0): [<0000000000000000>]           (null)
[ 6902.176407] ---[ end trace 463138c2986b275c ]---
[ 6902.177636] BTRFS info (device dm-3): space_info 4 has 273465344 free, is not full
[ 6902.179453] BTRFS info (device dm-3): space_info total=276824064, used=4685824, pinned=18446744073708158976, reserved=0, may_use=0, readonly=65536

In the above line there's "pinned=18446744073708158976" which is an
unsigned u64 value of -1392640, an obvious underflow.

When transaction_kthread is running cleanup_transaction(), another
fsstress is running btrfs_commit_transaction(). The
btrfs_finish_extent_commit() may get the same range as
btrfs_destroy_pinned_extent() got, which causes the pinned underflow.

Fixes: d4b450cd4b ("Btrfs: fix race between transaction commit and empty block group removal")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-11-21 09:24:10 +01:00
Ethan Lien
770025cc4b btrfs: use correct compare function of dirty_metadata_bytes
commit d814a49198 upstream.

We use customized, nodesize batch value to update dirty_metadata_bytes.
We should also use batch version of compare function or we will easily
goto fast path and get false result from percpu_counter_compare().

Fixes: e2d845211e ("Btrfs: use percpu counter for dirty metadata count")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Ethan Lien <ethanlien@synology.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-09-05 09:26:33 +02:00
Anand Jain
058dd233b5 btrfs: define SUPER_FLAG_METADUMP_V2
commit e2731e5588 upstream.

btrfs-progs uses super flag bit BTRFS_SUPER_FLAG_METADUMP_V2 (1ULL << 34).
So just define that in kernel so that we know its been used.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-06-11 22:49:18 +02:00
Jeff Mahoney
de00d57294 btrfs: fix lockdep splat in btrfs_alloc_subvolume_writers
[ Upstream commit 8a5a916d9a ]

While running btrfs/011, I hit the following lockdep splat.

This is the important bit:
   pcpu_alloc+0x1ac/0x5e0
   __percpu_counter_init+0x4e/0xb0
   btrfs_init_fs_root+0x99/0x1c0 [btrfs]
   btrfs_get_fs_root.part.54+0x5b/0x150 [btrfs]
   resolve_indirect_refs+0x130/0x830 [btrfs]
   find_parent_nodes+0x69e/0xff0 [btrfs]
   btrfs_find_all_roots_safe+0xa0/0x110 [btrfs]
   btrfs_find_all_roots+0x50/0x70 [btrfs]
   btrfs_qgroup_prepare_account_extents+0x53/0x90 [btrfs]
   btrfs_commit_transaction+0x3ce/0x9b0 [btrfs]

The percpu_counter_init call in btrfs_alloc_subvolume_writers
uses GFP_KERNEL, which we can't do during transaction commit.

This switches it to GFP_NOFS.

========================================================
WARNING: possible irq lock inversion dependency detected
4.12.14-kvmsmall #8 Tainted: G        W
--------------------------------------------------------
kswapd0/50 just changed the state of lock:
 (&delayed_node->mutex){+.+.-.}, at: [<ffffffffc06994fa>] __btrfs_release_delayed_node+0x3a/0x1f0 [btrfs]
but this lock took another, RECLAIM_FS-unsafe lock in the past:
 (pcpu_alloc_mutex){+.+.+.}

and interrupts could create inverse lock ordering between them.

other info that might help us debug this:
Chain exists of:
  &delayed_node->mutex --> &found->groups_sem --> pcpu_alloc_mutex

 Possible interrupt unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(pcpu_alloc_mutex);
                               local_irq_disable();
                               lock(&delayed_node->mutex);
                               lock(&found->groups_sem);
  <Interrupt>
    lock(&delayed_node->mutex);

 *** DEADLOCK ***

2 locks held by kswapd0/50:
 #0:  (shrinker_rwsem){++++..}, at: [<ffffffff811dc11f>] shrink_slab+0x7f/0x5b0
 #1:  (&type->s_umount_key#30){+++++.}, at: [<ffffffff8126dec6>] trylock_super+0x16/0x50

the shortest dependencies between 2nd lock and 1st lock:
   -> (pcpu_alloc_mutex){+.+.+.} ops: 4904 {
      HARDIRQ-ON-W at:
                          __mutex_lock+0x4e/0x8c0
                          pcpu_alloc+0x1ac/0x5e0
                          alloc_kmem_cache_cpus.isra.70+0x25/0xa0
                          __do_tune_cpucache+0x2c/0x220
                          do_tune_cpucache+0x26/0xc0
                          enable_cpucache+0x6d/0xf0
                          kmem_cache_init_late+0x42/0x75
                          start_kernel+0x343/0x4cb
                          x86_64_start_kernel+0x127/0x134
                          secondary_startup_64+0xa5/0xb0
      SOFTIRQ-ON-W at:
                          __mutex_lock+0x4e/0x8c0
                          pcpu_alloc+0x1ac/0x5e0
                          alloc_kmem_cache_cpus.isra.70+0x25/0xa0
                          __do_tune_cpucache+0x2c/0x220
                          do_tune_cpucache+0x26/0xc0
                          enable_cpucache+0x6d/0xf0
                          kmem_cache_init_late+0x42/0x75
                          start_kernel+0x343/0x4cb
                          x86_64_start_kernel+0x127/0x134
                          secondary_startup_64+0xa5/0xb0
      RECLAIM_FS-ON-W at:
                             __kmalloc+0x47/0x310
                             pcpu_extend_area_map+0x2b/0xc0
                             pcpu_alloc+0x3ec/0x5e0
                             alloc_kmem_cache_cpus.isra.70+0x25/0xa0
                             __do_tune_cpucache+0x2c/0x220
                             do_tune_cpucache+0x26/0xc0
                             enable_cpucache+0x6d/0xf0
                             __kmem_cache_create+0x1bf/0x390
                             create_cache+0xba/0x1b0
                             kmem_cache_create+0x1f8/0x2b0
                             ksm_init+0x6f/0x19d
                             do_one_initcall+0x50/0x1b0
                             kernel_init_freeable+0x201/0x289
                             kernel_init+0xa/0x100
                             ret_from_fork+0x3a/0x50
      INITIAL USE at:
                         __mutex_lock+0x4e/0x8c0
                         pcpu_alloc+0x1ac/0x5e0
                         alloc_kmem_cache_cpus.isra.70+0x25/0xa0
                         setup_cpu_cache+0x2f/0x1f0
                         __kmem_cache_create+0x1bf/0x390
                         create_boot_cache+0x8b/0xb1
                         kmem_cache_init+0xa1/0x19e
                         start_kernel+0x270/0x4cb
                         x86_64_start_kernel+0x127/0x134
                         secondary_startup_64+0xa5/0xb0
    }
    ... key      at: [<ffffffff821d8e70>] pcpu_alloc_mutex+0x70/0xa0
    ... acquired at:
   pcpu_alloc+0x1ac/0x5e0
   __percpu_counter_init+0x4e/0xb0
   btrfs_init_fs_root+0x99/0x1c0 [btrfs]
   btrfs_get_fs_root.part.54+0x5b/0x150 [btrfs]
   resolve_indirect_refs+0x130/0x830 [btrfs]
   find_parent_nodes+0x69e/0xff0 [btrfs]
   btrfs_find_all_roots_safe+0xa0/0x110 [btrfs]
   btrfs_find_all_roots+0x50/0x70 [btrfs]
   btrfs_qgroup_prepare_account_extents+0x53/0x90 [btrfs]
   btrfs_commit_transaction+0x3ce/0x9b0 [btrfs]
   transaction_kthread+0x176/0x1b0 [btrfs]
   kthread+0x102/0x140
   ret_from_fork+0x3a/0x50

  -> (&fs_info->commit_root_sem){++++..} ops: 1566382 {
     HARDIRQ-ON-W at:
                        down_write+0x3e/0xa0
                        cache_block_group+0x287/0x420 [btrfs]
                        find_free_extent+0x106c/0x12d0 [btrfs]
                        btrfs_reserve_extent+0xd8/0x170 [btrfs]
                        cow_file_range.isra.66+0x133/0x470 [btrfs]
                        run_delalloc_range+0x121/0x410 [btrfs]
                        writepage_delalloc.isra.50+0xfe/0x180 [btrfs]
                        __extent_writepage+0x19a/0x360 [btrfs]
                        extent_write_cache_pages.constprop.56+0x249/0x3e0 [btrfs]
                        extent_writepages+0x4d/0x60 [btrfs]
                        do_writepages+0x1a/0x70
                        __filemap_fdatawrite_range+0xa7/0xe0
                        btrfs_rename+0x5ee/0xdb0 [btrfs]
                        vfs_rename+0x52a/0x7e0
                        SyS_rename+0x351/0x3b0
                        do_syscall_64+0x79/0x1e0
                        entry_SYSCALL_64_after_hwframe+0x42/0xb7
     HARDIRQ-ON-R at:
                        down_read+0x35/0x90
                        caching_thread+0x57/0x560 [btrfs]
                        normal_work_helper+0x1c0/0x5e0 [btrfs]
                        process_one_work+0x1e0/0x5c0
                        worker_thread+0x44/0x390
                        kthread+0x102/0x140
                        ret_from_fork+0x3a/0x50
     SOFTIRQ-ON-W at:
                        down_write+0x3e/0xa0
                        cache_block_group+0x287/0x420 [btrfs]
                        find_free_extent+0x106c/0x12d0 [btrfs]
                        btrfs_reserve_extent+0xd8/0x170 [btrfs]
                        cow_file_range.isra.66+0x133/0x470 [btrfs]
                        run_delalloc_range+0x121/0x410 [btrfs]
                        writepage_delalloc.isra.50+0xfe/0x180 [btrfs]
                        __extent_writepage+0x19a/0x360 [btrfs]
                        extent_write_cache_pages.constprop.56+0x249/0x3e0 [btrfs]
                        extent_writepages+0x4d/0x60 [btrfs]
                        do_writepages+0x1a/0x70
                        __filemap_fdatawrite_range+0xa7/0xe0
                        btrfs_rename+0x5ee/0xdb0 [btrfs]
                        vfs_rename+0x52a/0x7e0
                        SyS_rename+0x351/0x3b0
                        do_syscall_64+0x79/0x1e0
                        entry_SYSCALL_64_after_hwframe+0x42/0xb7
     SOFTIRQ-ON-R at:
                        down_read+0x35/0x90
                        caching_thread+0x57/0x560 [btrfs]
                        normal_work_helper+0x1c0/0x5e0 [btrfs]
                        process_one_work+0x1e0/0x5c0
                        worker_thread+0x44/0x390
                        kthread+0x102/0x140
                        ret_from_fork+0x3a/0x50
     INITIAL USE at:
                       down_write+0x3e/0xa0
                       cache_block_group+0x287/0x420 [btrfs]
                       find_free_extent+0x106c/0x12d0 [btrfs]
                       btrfs_reserve_extent+0xd8/0x170 [btrfs]
                       cow_file_range.isra.66+0x133/0x470 [btrfs]
                       run_delalloc_range+0x121/0x410 [btrfs]
                       writepage_delalloc.isra.50+0xfe/0x180 [btrfs]
                       __extent_writepage+0x19a/0x360 [btrfs]
                       extent_write_cache_pages.constprop.56+0x249/0x3e0 [btrfs]
                       extent_writepages+0x4d/0x60 [btrfs]
                       do_writepages+0x1a/0x70
                       __filemap_fdatawrite_range+0xa7/0xe0
                       btrfs_rename+0x5ee/0xdb0 [btrfs]
                       vfs_rename+0x52a/0x7e0
                       SyS_rename+0x351/0x3b0
                       do_syscall_64+0x79/0x1e0
                       entry_SYSCALL_64_after_hwframe+0x42/0xb7
   }
   ... key      at: [<ffffffffc0729578>] __key.61970+0x0/0xfffffffffff9aa88 [btrfs]
   ... acquired at:
   cache_block_group+0x287/0x420 [btrfs]
   find_free_extent+0x106c/0x12d0 [btrfs]
   btrfs_reserve_extent+0xd8/0x170 [btrfs]
   btrfs_alloc_tree_block+0x12f/0x4c0 [btrfs]
   btrfs_create_tree+0xbb/0x2a0 [btrfs]
   btrfs_create_uuid_tree+0x37/0x140 [btrfs]
   open_ctree+0x23c0/0x2660 [btrfs]
   btrfs_mount+0xd36/0xf90 [btrfs]
   mount_fs+0x3a/0x160
   vfs_kern_mount+0x66/0x150
   btrfs_mount+0x18c/0xf90 [btrfs]
   mount_fs+0x3a/0x160
   vfs_kern_mount+0x66/0x150
   do_mount+0x1c1/0xcc0
   SyS_mount+0x7e/0xd0
   do_syscall_64+0x79/0x1e0
   entry_SYSCALL_64_after_hwframe+0x42/0xb7

 -> (&found->groups_sem){++++..} ops: 2134587 {
    HARDIRQ-ON-W at:
                      down_write+0x3e/0xa0
                      __link_block_group+0x34/0x130 [btrfs]
                      btrfs_read_block_groups+0x33d/0x7b0 [btrfs]
                      open_ctree+0x2054/0x2660 [btrfs]
                      btrfs_mount+0xd36/0xf90 [btrfs]
                      mount_fs+0x3a/0x160
                      vfs_kern_mount+0x66/0x150
                      btrfs_mount+0x18c/0xf90 [btrfs]
                      mount_fs+0x3a/0x160
                      vfs_kern_mount+0x66/0x150
                      do_mount+0x1c1/0xcc0
                      SyS_mount+0x7e/0xd0
                      do_syscall_64+0x79/0x1e0
                      entry_SYSCALL_64_after_hwframe+0x42/0xb7
    HARDIRQ-ON-R at:
                      down_read+0x35/0x90
                      btrfs_calc_num_tolerated_disk_barrier_failures+0x113/0x1f0 [btrfs]
                      open_ctree+0x207b/0x2660 [btrfs]
                      btrfs_mount+0xd36/0xf90 [btrfs]
                      mount_fs+0x3a/0x160
                      vfs_kern_mount+0x66/0x150
                      btrfs_mount+0x18c/0xf90 [btrfs]
                      mount_fs+0x3a/0x160
                      vfs_kern_mount+0x66/0x150
                      do_mount+0x1c1/0xcc0
                      SyS_mount+0x7e/0xd0
                      do_syscall_64+0x79/0x1e0
                      entry_SYSCALL_64_after_hwframe+0x42/0xb7
    SOFTIRQ-ON-W at:
                      down_write+0x3e/0xa0
                      __link_block_group+0x34/0x130 [btrfs]
                      btrfs_read_block_groups+0x33d/0x7b0 [btrfs]
                      open_ctree+0x2054/0x2660 [btrfs]
                      btrfs_mount+0xd36/0xf90 [btrfs]
                      mount_fs+0x3a/0x160
                      vfs_kern_mount+0x66/0x150
                      btrfs_mount+0x18c/0xf90 [btrfs]
                      mount_fs+0x3a/0x160
                      vfs_kern_mount+0x66/0x150
                      do_mount+0x1c1/0xcc0
                      SyS_mount+0x7e/0xd0
                      do_syscall_64+0x79/0x1e0
                      entry_SYSCALL_64_after_hwframe+0x42/0xb7
    SOFTIRQ-ON-R at:
                      down_read+0x35/0x90
                      btrfs_calc_num_tolerated_disk_barrier_failures+0x113/0x1f0 [btrfs]
                      open_ctree+0x207b/0x2660 [btrfs]
                      btrfs_mount+0xd36/0xf90 [btrfs]
                      mount_fs+0x3a/0x160
                      vfs_kern_mount+0x66/0x150
                      btrfs_mount+0x18c/0xf90 [btrfs]
                      mount_fs+0x3a/0x160
                      vfs_kern_mount+0x66/0x150
                      do_mount+0x1c1/0xcc0
                      SyS_mount+0x7e/0xd0
                      do_syscall_64+0x79/0x1e0
                      entry_SYSCALL_64_after_hwframe+0x42/0xb7
    INITIAL USE at:
                     down_write+0x3e/0xa0
                     __link_block_group+0x34/0x130 [btrfs]
                     btrfs_read_block_groups+0x33d/0x7b0 [btrfs]
                     open_ctree+0x2054/0x2660 [btrfs]
                     btrfs_mount+0xd36/0xf90 [btrfs]
                     mount_fs+0x3a/0x160
                     vfs_kern_mount+0x66/0x150
                     btrfs_mount+0x18c/0xf90 [btrfs]
                     mount_fs+0x3a/0x160
                     vfs_kern_mount+0x66/0x150
                     do_mount+0x1c1/0xcc0
                     SyS_mount+0x7e/0xd0
                     do_syscall_64+0x79/0x1e0
                     entry_SYSCALL_64_after_hwframe+0x42/0xb7
  }
  ... key      at: [<ffffffffc0729488>] __key.59101+0x0/0xfffffffffff9ab78 [btrfs]
  ... acquired at:
   find_free_extent+0xcb4/0x12d0 [btrfs]
   btrfs_reserve_extent+0xd8/0x170 [btrfs]
   btrfs_alloc_tree_block+0x12f/0x4c0 [btrfs]
   __btrfs_cow_block+0x110/0x5b0 [btrfs]
   btrfs_cow_block+0xd7/0x290 [btrfs]
   btrfs_search_slot+0x1f6/0x960 [btrfs]
   btrfs_lookup_inode+0x2a/0x90 [btrfs]
   __btrfs_update_delayed_inode+0x65/0x210 [btrfs]
   btrfs_commit_inode_delayed_inode+0x121/0x130 [btrfs]
   btrfs_evict_inode+0x3fe/0x6a0 [btrfs]
   evict+0xc4/0x190
   __dentry_kill+0xbf/0x170
   dput+0x2ae/0x2f0
   SyS_rename+0x2a6/0x3b0
   do_syscall_64+0x79/0x1e0
   entry_SYSCALL_64_after_hwframe+0x42/0xb7

-> (&delayed_node->mutex){+.+.-.} ops: 5580204 {
   HARDIRQ-ON-W at:
                    __mutex_lock+0x4e/0x8c0
                    btrfs_delayed_update_inode+0x46/0x6e0 [btrfs]
                    btrfs_update_inode+0x83/0x110 [btrfs]
                    btrfs_dirty_inode+0x62/0xe0 [btrfs]
                    touch_atime+0x8c/0xb0
                    do_generic_file_read+0x818/0xb10
                    __vfs_read+0xdc/0x150
                    vfs_read+0x8a/0x130
                    SyS_read+0x45/0xa0
                    do_syscall_64+0x79/0x1e0
                    entry_SYSCALL_64_after_hwframe+0x42/0xb7
   SOFTIRQ-ON-W at:
                    __mutex_lock+0x4e/0x8c0
                    btrfs_delayed_update_inode+0x46/0x6e0 [btrfs]
                    btrfs_update_inode+0x83/0x110 [btrfs]
                    btrfs_dirty_inode+0x62/0xe0 [btrfs]
                    touch_atime+0x8c/0xb0
                    do_generic_file_read+0x818/0xb10
                    __vfs_read+0xdc/0x150
                    vfs_read+0x8a/0x130
                    SyS_read+0x45/0xa0
                    do_syscall_64+0x79/0x1e0
                    entry_SYSCALL_64_after_hwframe+0x42/0xb7
   IN-RECLAIM_FS-W at:
                       __mutex_lock+0x4e/0x8c0
                       __btrfs_release_delayed_node+0x3a/0x1f0 [btrfs]
                       btrfs_evict_inode+0x22c/0x6a0 [btrfs]
                       evict+0xc4/0x190
                       dispose_list+0x35/0x50
                       prune_icache_sb+0x42/0x50
                       super_cache_scan+0x139/0x190
                       shrink_slab+0x262/0x5b0
                       shrink_node+0x2eb/0x2f0
                       kswapd+0x2eb/0x890
                       kthread+0x102/0x140
                       ret_from_fork+0x3a/0x50
   INITIAL USE at:
                   __mutex_lock+0x4e/0x8c0
                   btrfs_delayed_update_inode+0x46/0x6e0 [btrfs]
                   btrfs_update_inode+0x83/0x110 [btrfs]
                   btrfs_dirty_inode+0x62/0xe0 [btrfs]
                   touch_atime+0x8c/0xb0
                   do_generic_file_read+0x818/0xb10
                   __vfs_read+0xdc/0x150
                   vfs_read+0x8a/0x130
                   SyS_read+0x45/0xa0
                   do_syscall_64+0x79/0x1e0
                   entry_SYSCALL_64_after_hwframe+0x42/0xb7
 }
 ... key      at: [<ffffffffc072d488>] __key.56935+0x0/0xfffffffffff96b78 [btrfs]
 ... acquired at:
   __lock_acquire+0x264/0x11c0
   lock_acquire+0xbd/0x1e0
   __mutex_lock+0x4e/0x8c0
   __btrfs_release_delayed_node+0x3a/0x1f0 [btrfs]
   btrfs_evict_inode+0x22c/0x6a0 [btrfs]
   evict+0xc4/0x190
   dispose_list+0x35/0x50
   prune_icache_sb+0x42/0x50
   super_cache_scan+0x139/0x190
   shrink_slab+0x262/0x5b0
   shrink_node+0x2eb/0x2f0
   kswapd+0x2eb/0x890
   kthread+0x102/0x140
   ret_from_fork+0x3a/0x50

stack backtrace:
CPU: 1 PID: 50 Comm: kswapd0 Tainted: G        W        4.12.14-kvmsmall #8 SLE15 (unreleased)
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
Call Trace:
 dump_stack+0x78/0xb7
 print_irq_inversion_bug.part.38+0x19f/0x1aa
 check_usage_forwards+0x102/0x120
 ? ret_from_fork+0x3a/0x50
 ? check_usage_backwards+0x110/0x110
 mark_lock+0x16c/0x270
 __lock_acquire+0x264/0x11c0
 ? pagevec_lookup_entries+0x1a/0x30
 ? truncate_inode_pages_range+0x2b3/0x7f0
 lock_acquire+0xbd/0x1e0
 ? __btrfs_release_delayed_node+0x3a/0x1f0 [btrfs]
 __mutex_lock+0x4e/0x8c0
 ? __btrfs_release_delayed_node+0x3a/0x1f0 [btrfs]
 ? __btrfs_release_delayed_node+0x3a/0x1f0 [btrfs]
 ? btrfs_evict_inode+0x1f6/0x6a0 [btrfs]
 __btrfs_release_delayed_node+0x3a/0x1f0 [btrfs]
 btrfs_evict_inode+0x22c/0x6a0 [btrfs]
 evict+0xc4/0x190
 dispose_list+0x35/0x50
 prune_icache_sb+0x42/0x50
 super_cache_scan+0x139/0x190
 shrink_slab+0x262/0x5b0
 shrink_node+0x2eb/0x2f0
 kswapd+0x2eb/0x890
 kthread+0x102/0x140
 ? mem_cgroup_shrink_node+0x2c0/0x2c0
 ? kthread_create_on_node+0x40/0x40
 ret_from_fork+0x3a/0x50

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: David Sterba <dsterba@suse.com>

Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-30 07:52:26 +02:00
Liu Bo
f2924e32dc Btrfs: clean up resources during umount after trans is aborted
[ Upstream commit af72273381 ]

Currently if some fatal errors occur, like all IO get -EIO, resources
would be cleaned up when
a) transaction is being committed or
b) BTRFS_FS_STATE_ERROR is set

However, in some rare cases, resources may be left alone after transaction
gets aborted and umount may run into some ASSERT(), e.g.
ASSERT(list_empty(&block_group->dirty_list));

For case a), in btrfs_commit_transaciton(), there're several places at the
beginning where we just call btrfs_end_transaction() without cleaning up
resources.  For case b), it is possible that the trans handle doesn't have
any dirty stuff, then only trans hanlde is marked as aborted while
BTRFS_FS_STATE_ERROR is not set, so resources remain in memory.

This makes btrfs also check BTRFS_FS_STATE_TRANS_ABORTED to make sure that
all resources won't stay in memory after umount.

Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-30 07:52:22 +02:00
Nikolay Borisov
7ea5cff55c btrfs: Fix delalloc inodes invalidation during transaction abort
commit fe816d0f1d upstream.

When a transaction is aborted btrfs_cleanup_transaction is called to
cleanup all the various in-flight bits and pieces which migth be
active. One of those is delalloc inodes - inodes which have dirty
pages which haven't been persisted yet. Currently the process of
freeing such delalloc inodes in exceptional circumstances such as
transaction abort boiled down to calling btrfs_invalidate_inodes whose
sole job is to invalidate the dentries for all inodes related to a
root. This is in fact wrong and insufficient since such delalloc inodes
will likely have pending pages or ordered-extents and will be linked to
the sb->s_inode_list. This means that unmounting a btrfs instance with
an aborted transaction could potentially lead inodes/their pages
visible to the system long after their superblock has been freed. This
in turn leads to a "use-after-free" situation once page shrink is
triggered. This situation could be simulated by running generic/019
which would cause such inodes to be left hanging, followed by
generic/176 which causes memory pressure and page eviction which lead
to touching the freed super block instance. This situation is
additionally detected by the unmount code of VFS with the following
message:

"VFS: Busy inodes after unmount of Self-destruct in 5 seconds.  Have a nice day..."

Additionally btrfs hits WARN_ON(!RB_EMPTY_ROOT(&root->inode_tree));
in free_fs_root for the same reason.

This patch aims to rectify the sitaution by doing the following:

1. Change btrfs_destroy_delalloc_inodes so that it calls
invalidate_inode_pages2 for every inode on the delalloc list, this
ensures that all the pages of the inode are released. This function
boils down to calling btrfs_releasepage. During test I observed cases
where inodes on the delalloc list were having an i_count of 0, so this
necessitates using igrab to be sure we are working on a non-freed inode.

2. Since calling btrfs_releasepage might queue delayed iputs move the
call out to btrfs_cleanup_transaction in btrfs_error_commit_super before
calling run_delayed_iputs for the last time. This is necessary to ensure
that delayed iputs are run.

Note: this patch is tagged for 4.14 stable but the fix applies to older
versions too but needs to be backported manually due to conflicts.

CC: stable@vger.kernel.org # 4.14.x: 2b87733134: btrfs: Split btrfs_del_delalloc_inode into 2 functions
CC: stable@vger.kernel.org # 4.14.x
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add comment to igrab ]
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-22 18:54:01 +02:00
Anand Jain
c231cec825 btrfs: fail mount when sb flag is not in BTRFS_SUPER_FLAG_SUPP
[ Upstream commit 6f794e3c5c ]

It appears from the original commit [1] that there isn't any design
specific reason not to fail the mount instead of just warning. This
patch will change it to fail.

[1]
 commit 319e4d0661
    btrfs: Enhance super validation check

Fixes: 319e4d0661 ("btrfs: Enhance super validation check")
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-04-26 11:02:09 +02:00
Omar Sandoval
27b0dc3168 Btrfs: disable FUA if mounted with nobarrier
[ Upstream commit 1b9e619c5b ]

I was seeing disk flushes still happening when I mounted a Btrfs
filesystem with nobarrier for testing. This is because we use FUA to
write out the first super block, and on devices without FUA support, the
block layer translates FUA to a flush. Even on devices supporting true
FUA, using FUA when we asked for no barriers is surprising.

Fixes: 387125fc72 ("Btrfs: fix barrier flushes")
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-25 11:08:00 +01:00
Linus Torvalds
5ba88cd6e9 Merge branch 'for-4.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
 "We've collected a bunch of isolated fixes, for crashes, user-visible
  behaviour or missing bits from other subsystem cleanups from the past.

  The overall number is not small but I was not able to make it
  significantly smaller. Most of the patches are supposed to go to
  stable"

* 'for-4.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: log csums for all modified extents
  Btrfs: fix unexpected result when dio reading corrupted blocks
  btrfs: Report error on removing qgroup if del_qgroup_item fails
  Btrfs: skip checksum when reading compressed data if some IO have failed
  Btrfs: fix kernel oops while reading compressed data
  Btrfs: use btrfs_op instead of bio_op in __btrfs_map_block
  Btrfs: do not backup tree roots when fsync
  btrfs: remove BTRFS_FS_QUOTA_DISABLING flag
  btrfs: propagate error to btrfs_cmp_data_prepare caller
  btrfs: prevent to set invalid default subvolid
  Btrfs: send: fix error number for unknown inode types
  btrfs: fix NULL pointer dereference from free_reloc_roots()
  btrfs: finish ordered extent cleaning if no progress is found
  btrfs: clear ordered flag on cleaning up ordered extents
  Btrfs: fix incorrect {node,sector}size endianness from BTRFS_IOC_FS_INFO
  Btrfs: do not reset bio->bi_ops while writing bio
  Btrfs: use the new helper wbc_to_write_flags
2017-09-29 12:57:35 -07:00
Liu Bo
fed3b38114 Btrfs: do not backup tree roots when fsync
It doesn't make sense to backup tree roots when doing fsync, since
during fsync those tree roots have not been consistent on disk.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-09-26 14:53:04 +02:00
Linus Torvalds
0f0d12728e Merge branch 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull mount flag updates from Al Viro:
 "Another chunk of fmount preparations from dhowells; only trivial
  conflicts for that part. It separates MS_... bits (very grotty
  mount(2) ABI) from the struct super_block ->s_flags (kernel-internal,
  only a small subset of MS_... stuff).

  This does *not* convert the filesystems to new constants; only the
  infrastructure is done here. The next step in that series is where the
  conflicts would be; that's the conversion of filesystems. It's purely
  mechanical and it's better done after the merge, so if you could run
  something like

	list=$(for i in MS_RDONLY MS_NOSUID MS_NODEV MS_NOEXEC MS_SYNCHRONOUS MS_MANDLOCK MS_DIRSYNC MS_NOATIME MS_NODIRATIME MS_SILENT MS_POSIXACL MS_KERNMOUNT MS_I_VERSION MS_LAZYTIME; do git grep -l $i fs drivers/staging/lustre drivers/mtd ipc mm include/linux; done|sort|uniq|grep -v '^fs/namespace.c$')

	sed -i -e 's/\<MS_RDONLY\>/SB_RDONLY/g' \
	        -e 's/\<MS_NOSUID\>/SB_NOSUID/g' \
	        -e 's/\<MS_NODEV\>/SB_NODEV/g' \
	        -e 's/\<MS_NOEXEC\>/SB_NOEXEC/g' \
	        -e 's/\<MS_SYNCHRONOUS\>/SB_SYNCHRONOUS/g' \
	        -e 's/\<MS_MANDLOCK\>/SB_MANDLOCK/g' \
	        -e 's/\<MS_DIRSYNC\>/SB_DIRSYNC/g' \
	        -e 's/\<MS_NOATIME\>/SB_NOATIME/g' \
	        -e 's/\<MS_NODIRATIME\>/SB_NODIRATIME/g' \
	        -e 's/\<MS_SILENT\>/SB_SILENT/g' \
	        -e 's/\<MS_POSIXACL\>/SB_POSIXACL/g' \
	        -e 's/\<MS_KERNMOUNT\>/SB_KERNMOUNT/g' \
	        -e 's/\<MS_I_VERSION\>/SB_I_VERSION/g' \
	        -e 's/\<MS_LAZYTIME\>/SB_LAZYTIME/g' \
	        $list

  and commit it with something along the lines of 'convert filesystems
  away from use of MS_... constants' as commit message, it would save a
  quite a bit of headache next cycle"

* 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  VFS: Differentiate mount flags (MS_*) from internal superblock flags
  VFS: Convert sb->s_flags & MS_RDONLY to sb_rdonly(sb)
  vfs: Add sb_rdonly(sb) to query the MS_RDONLY flag on s_flags
2017-09-14 18:54:01 -07:00
Linus Torvalds
e7cdb60fd2 Merge branch 'zstd-minimal' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull zstd support from Chris Mason:
 "Nick Terrell's patch series to add zstd support to the kernel has been
  floating around for a while. After talking with Dave Sterba, Herbert
  and Phillip, we decided to send the whole thing in as one pull
  request.

  zstd is a big win in speed over zlib and in compression ratio over
  lzo, and the compression team here at FB has gotten great results
  using it in production. Nick will continue to update the kernel side
  with new improvements from the open source zstd userland code.

  Nick has a number of benchmarks for the main zstd code in his lib/zstd
  commit:

      I ran the benchmarks on a Ubuntu 14.04 VM with 2 cores and 4 GiB
      of RAM. The VM is running on a MacBook Pro with a 3.1 GHz Intel
      Core i7 processor, 16 GB of RAM, and a SSD. I benchmarked using
      `silesia.tar` [3], which is 211,988,480 B large. Run the following
      commands for the benchmark:

        sudo modprobe zstd_compress_test
        sudo mknod zstd_compress_test c 245 0
        sudo cp silesia.tar zstd_compress_test

      The time is reported by the time of the userland `cp`.
      The MB/s is computed with

        1,536,217,008 B / time(buffer size, hash)

      which includes the time to copy from userland.
      The Adjusted MB/s is computed with

        1,536,217,088 B / (time(buffer size, hash) - time(buffer size, none)).

      The memory reported is the amount of memory the compressor
      requests.

        | Method   | Size (B) | Time (s) | Ratio | MB/s    | Adj MB/s | Mem (MB) |
        |----------|----------|----------|-------|---------|----------|----------|
        | none     | 11988480 |    0.100 |     1 | 2119.88 |        - |        - |
        | zstd -1  | 73645762 |    1.044 | 2.878 |  203.05 |   224.56 |     1.23 |
        | zstd -3  | 66988878 |    1.761 | 3.165 |  120.38 |   127.63 |     2.47 |
        | zstd -5  | 65001259 |    2.563 | 3.261 |   82.71 |    86.07 |     2.86 |
        | zstd -10 | 60165346 |   13.242 | 3.523 |   16.01 |    16.13 |    13.22 |
        | zstd -15 | 58009756 |   47.601 | 3.654 |    4.45 |     4.46 |    21.61 |
        | zstd -19 | 54014593 |  102.835 | 3.925 |    2.06 |     2.06 |    60.15 |
        | zlib -1  | 77260026 |    2.895 | 2.744 |   73.23 |    75.85 |     0.27 |
        | zlib -3  | 72972206 |    4.116 | 2.905 |   51.50 |    52.79 |     0.27 |
        | zlib -6  | 68190360 |    9.633 | 3.109 |   22.01 |    22.24 |     0.27 |
        | zlib -9  | 67613382 |   22.554 | 3.135 |    9.40 |     9.44 |     0.27 |

      I benchmarked zstd decompression using the same method on the same
      machine. The benchmark file is located in the upstream zstd repo
      under `contrib/linux-kernel/zstd_decompress_test.c` [4]. The
      memory reported is the amount of memory required to decompress
      data compressed with the given compression level. If you know the
      maximum size of your input, you can reduce the memory usage of
      decompression irrespective of the compression level.

        | Method   | Time (s) | MB/s    | Adjusted MB/s | Memory (MB) |
        |----------|----------|---------|---------------|-------------|
        | none     |    0.025 | 8479.54 |             - |           - |
        | zstd -1  |    0.358 |  592.15 |        636.60 |        0.84 |
        | zstd -3  |    0.396 |  535.32 |        571.40 |        1.46 |
        | zstd -5  |    0.396 |  535.32 |        571.40 |        1.46 |
        | zstd -10 |    0.374 |  566.81 |        607.42 |        2.51 |
        | zstd -15 |    0.379 |  559.34 |        598.84 |        4.61 |
        | zstd -19 |    0.412 |  514.54 |        547.77 |        8.80 |
        | zlib -1  |    0.940 |  225.52 |        231.68 |        0.04 |
        | zlib -3  |    0.883 |  240.08 |        247.07 |        0.04 |
        | zlib -6  |    0.844 |  251.17 |        258.84 |        0.04 |
        | zlib -9  |    0.837 |  253.27 |        287.64 |        0.04 |

  I ran a long series of tests and benchmarks on the btrfs side and the
  gains are very similar to the core benchmarks Nick ran"

* 'zstd-minimal' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  squashfs: Add zstd support
  btrfs: Add zstd support
  lib: Add zstd modules
  lib: Add xxhash module
2017-09-14 17:30:49 -07:00
Linus Torvalds
66ba772ee3 Merge branch 'for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
 "The changes range through all types: cleanups, core chagnes, sanity
  checks, fixes, other user visible changes, detailed list below:

   - deprecated: user transaction ioctl

   - mount option ssd does not change allocation alignments

   - degraded read-write mount is allowed if all the raid profile
     constraints are met, now based on more accurate check

   - defrag: do not reset compression afterwards; the NOCOMPRESS flag
     can be now overriden by defrag

   - prep work for better extent reference tracking (related to the
     qgroup slowness with balance)

   - prep work for compression heuristics

   - memory allocation reductions (may help latencies on a loaded
     system)

   - better accounting for io waiting states

   - error handling improvements (removed BUGs)

   - added more sanity checks for shared refs

   - fix readdir vs pagefault deadlock under some circumstances

   - fix for 'no-hole' mode, certain combination of compressed and
     inline extents

   - send: fix emission of invalid clone operations

   - fixup file mode if setting acls fail

   - more fixes from fuzzing

   - oher cleanups"

* 'for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (104 commits)
  btrfs: submit superblock io with REQ_META and REQ_PRIO
  btrfs: remove unnecessary memory barrier in btrfs_direct_IO
  btrfs: remove superfluous chunk_tree argument from btrfs_alloc_dev_extent
  btrfs: Remove chunk_objectid parameter of btrfs_alloc_dev_extent
  btrfs: pass fs_info to btrfs_del_root instead of tree_root
  Btrfs: add one more sanity check for shared ref type
  Btrfs: remove BUG_ON in __add_tree_block
  Btrfs: remove BUG() in add_data_reference
  Btrfs: remove BUG() in print_extent_item
  Btrfs: remove BUG() in btrfs_extent_inline_ref_size
  Btrfs: convert to use btrfs_get_extent_inline_ref_type
  Btrfs: add a helper to retrive extent inline ref type
  btrfs: scrub: simplify scrub worker initialization
  btrfs: scrub: clean up division in scrub_find_csum
  btrfs: scrub: clean up division in __scrub_mark_bitmap
  btrfs: scrub: use bool for flush_all_writes
  btrfs: preserve i_mode if __btrfs_set_acl() fails
  btrfs: Remove extraneous chunk_objectid variable
  btrfs: Remove chunk_objectid argument from btrfs_make_block_group
  btrfs: Remove extra parentheses from condition in copy_items()
  ...
2017-09-09 13:27:51 -07:00
Linus Torvalds
a0725ab0c7 Merge branch 'for-4.14/block' of git://git.kernel.dk/linux-block
Pull block layer updates from Jens Axboe:
 "This is the first pull request for 4.14, containing most of the code
  changes. It's a quiet series this round, which I think we needed after
  the churn of the last few series. This contains:

   - Fix for a registration race in loop, from Anton Volkov.

   - Overflow complaint fix from Arnd for DAC960.

   - Series of drbd changes from the usual suspects.

   - Conversion of the stec/skd driver to blk-mq. From Bart.

   - A few BFQ improvements/fixes from Paolo.

   - CFQ improvement from Ritesh, allowing idling for group idle.

   - A few fixes found by Dan's smatch, courtesy of Dan.

   - A warning fixup for a race between changing the IO scheduler and
     device remova. From David Jeffery.

   - A few nbd fixes from Josef.

   - Support for cgroup info in blktrace, from Shaohua.

   - Also from Shaohua, new features in the null_blk driver to allow it
     to actually hold data, among other things.

   - Various corner cases and error handling fixes from Weiping Zhang.

   - Improvements to the IO stats tracking for blk-mq from me. Can
     drastically improve performance for fast devices and/or big
     machines.

   - Series from Christoph removing bi_bdev as being needed for IO
     submission, in preparation for nvme multipathing code.

   - Series from Bart, including various cleanups and fixes for switch
     fall through case complaints"

* 'for-4.14/block' of git://git.kernel.dk/linux-block: (162 commits)
  kernfs: checking for IS_ERR() instead of NULL
  drbd: remove BIOSET_NEED_RESCUER flag from drbd_{md_,}io_bio_set
  drbd: Fix allyesconfig build, fix recent commit
  drbd: switch from kmalloc() to kmalloc_array()
  drbd: abort drbd_start_resync if there is no connection
  drbd: move global variables to drbd namespace and make some static
  drbd: rename "usermode_helper" to "drbd_usermode_helper"
  drbd: fix race between handshake and admin disconnect/down
  drbd: fix potential deadlock when trying to detach during handshake
  drbd: A single dot should be put into a sequence.
  drbd: fix rmmod cleanup, remove _all_ debugfs entries
  drbd: Use setup_timer() instead of init_timer() to simplify the code.
  drbd: fix potential get_ldev/put_ldev refcount imbalance during attach
  drbd: new disk-option disable-write-same
  drbd: Fix resource role for newly created resources in events2
  drbd: mark symbols static where possible
  drbd: Send P_NEG_ACK upon write error in protocol != C
  drbd: add explicit plugging when submitting batches
  drbd: change list_for_each_safe to while(list_first_entry_or_null)
  drbd: introduce drbd_recv_header_maybe_unplug
  ...
2017-09-07 11:59:42 -07:00
Omar Sandoval
58efbc9f54 Btrfs: fix blk_status_t/errno confusion
This fixes several instances of blk_status_t and bare errno ints being
mixed up, some of which are real bugs.

In the normal case, 0 matches BLK_STS_OK, so we don't observe any
effects of the missing conversion, but in case of errors or passes
through the repair/retry paths, the errors get mixed up.

The changes were identified using 'sparse', we don't have reports of the
buggy behaviour.

Fixes: 4e4cbee93d ("block: switch bios to blk_status_t")
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-24 17:19:02 +02:00
Christoph Hellwig
74d46992e0 block: replace bi_bdev with a gendisk pointer and partitions index
This way we don't need a block_device structure to submit I/O.  The
block_device has different life time rules from the gendisk and
request_queue and is usually only available when the block device node
is open.  Other callers need to explicitly create one (e.g. the lightnvm
passthrough code, or the new nvme multipathing code).

For the actual I/O path all that we need is the gendisk, which exists
once per block device.  But given that the block layer also does
partition remapping we additionally need a partition index, which is
used for said remapping in generic_make_request.

Note that all the block drivers generally want request_queue or
sometimes the gendisk, so this removes a layer of indirection all
over the stack.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-23 12:49:55 -06:00
David Sterba
db95c876c5 btrfs: submit superblock io with REQ_META and REQ_PRIO
The superblock is also metadata of the filesystem so the relevant IO
should be tagged as such. We also tag it as high priority, as it's the
last block committed for metadata from a given transaction. Any delays
would effectively block the whole transaction, also blocking any other
operation holding the device_list_mutex.

Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-22 13:22:05 +02:00
Hans van Kranenburg
583b723151 btrfs: Do not use data_alloc_cluster in ssd mode
This patch provides a band aid to improve the 'out of the box'
behaviour of btrfs for disks that are detected as being an ssd.  In a
general purpose mixed workload scenario, the current ssd mode causes
overallocation of available raw disk space for data, while leaving
behind increasing amounts of unused fragmented free space. This
situation leads to early ENOSPC problems which are harming user
experience and adoption of btrfs as a general purpose filesystem.

This patch modifies the data extent allocation behaviour of the ssd mode
to make it behave identical to nossd mode.  The metadata behaviour and
additional ssd_spread option stay untouched so far.

Recommendations for future development are to reconsider the current
oversimplified nossd / ssd distinction and the broken detection
mechanism based on the rotational attribute in sysfs and provide
experienced users with a more flexible way to choose allocator behaviour
for data and metadata, optimized for certain use cases, while keeping
sane 'out of the box' default settings.  The internals of the current
btrfs code have more potential than what currently gets exposed to the
user to choose from.

    The SSD story...

    In the first year of btrfs development, around early 2008, btrfs
gained a mount option which enables specific functionality for
filesystems on solid state devices. The first occurance of this
functionality is in commit e18e4809, labeled "Add mount -o ssd, which
includes optimizations for seek free storage".

The effect on allocating free space for doing (data) writes is to
'cluster' writes together, writing them out in contiguous space, as
opposed to a 'tetris' way of putting all separate writes into any free
space fragment that fits (which is what the -o nossd behaviour does).

A somewhat simplified explanation of what happens is that, when for
example, the 'cluster' size is set to 2MiB, when we do some writes, the
data allocator will search for a free space block that is 2MiB big, and
put the writes in there. The ssd mode itself might allow a 2MiB cluster
to be composed of multiple free space extents with some existing data in
between, while the additional ssd_spread mount option kills off this
option and requires fully free space.

The idea behind this is (commit 536ac8ae): "The [...] clusters make it
more likely a given IO will completely overwrite the ssd block, so it
doesn't have to do an internal rwm cycle."; ssd block meaning nand erase
block. So, effectively this means applying a "locality based algorithm"
and trying to outsmart the actual ssd.

Since then, various changes have been made to the involved code, but the
basic idea is still present, and gets activated whenever the ssd mount
option is active. This also happens by default, when the rotational flag
as seen at /sys/block/<device>/queue/rotational is set to 0.

    However, there's a number of problems with this approach.

    First, what the optimization is trying to do is outsmart the ssd by
assuming there is a relation between the physical address space of the
block device as seen by btrfs and the actual physical storage of the
ssd, and then adjusting data placement. However, since the introduction
of the Flash Translation Layer (FTL) which is a part of the internal
controller of an ssd, these attempts are futile. The use of good quality
FTL in consumer ssd products might have been limited in 2008, but this
situation has changed drastically soon after that time. Today, even the
flash memory in your automatic cat feeding machine or your grandma's
wheelchair has a full featured one.

Second, the behaviour as described above results in the filesystem being
filled up with badly fragmented free space extents because of relatively
small pieces of space that are freed up by deletes, but not selected
again as part of a 'cluster'. Since the algorithm prefers allocating a
new chunk over going back to tetris mode, the end result is a filesystem
in which all raw space is allocated, but which is composed of
underutilized chunks with a 'shotgun blast' pattern of fragmented free
space. Usually, the next problematic thing that happens is the
filesystem wanting to allocate new space for metadata, which causes the
filesystem to fail in spectacular ways.

Third, the default mount options you get for an ssd ('ssd' mode enabled,
'discard' not enabled), in combination with spreading out writes over
the full address space and ignoring freed up space leads to worst case
behaviour in providing information to the ssd itself, since it will
never learn that all the free space left behind is actually free.  There
are two ways to let an ssd know previously written data does not have to
be preserved, which are sending explicit signals using discard or
fstrim, or by simply overwriting the space with new data.  The worst
case behaviour is the btrfs ssd_spread mount option in combination with
not having discard enabled. It has a side effect of minimizing the reuse
of free space previously written in.

Fourth, the rotational flag in /sys/ does not reliably indicate if the
device is a locally attached ssd. For example, iSCSI or NBD displays as
non-rotational, while a loop device on an ssd shows up as rotational.

The combination of the second and third problem effectively means that
despite all the good intentions, the btrfs ssd mode reliably causes the
ssd hardware and the filesystem structures and performance to be choked
to death. The clickbait version of the title of this story would have
been "Btrfs ssd optimizations considered harmful for ssds".

The current nossd 'tetris' mode (even still without discard) allows a
pattern of overwriting much more previously used space, causing many
more implicit discards to happen because of the overwrite information
the ssd gets. The actual location in the physical address space, as seen
from the point of view of btrfs is irrelevant, because the actual writes
to the low level flash are reordered anyway thanks to the FTL.

    Changes made in the code

1. Make ssd mode data allocation identical to tetris mode, like nossd.
2. Adjust and clean up filesystem mount messages so that we can easily
identify if a kernel has this patch applied or not, when providing
support to end users. Also, make better use of the *_and_info helpers to
only trigger messages on actual state changes.

    Backporting notes

Notes for whoever wants to backport this patch to their 4.9 LTS kernel:
* First apply commit 951e7966 "btrfs: drop the nossd flag when
  remounting with -o ssd", or fixup the differences manually.
* The rest of the conflicts are because of the fs_info refactoring. So,
  for example, instead of using fs_info, it's root->fs_info in
  extent-tree.c

Signed-off-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-21 17:47:42 +02:00
Lu Fengqi
43a0111103 btrfs: use btrfsic_submit_bio instead of submit_bio in write_dev_flush
Although this bio has no data attached, it will reach this condition
(bio->bi_opf & REQ_PREFLUSH) and then update the flush_gen of dev_state
in __btrfsic_submit_bio. So we should still submit it through integrity
checker. Otherwise, the integrity checker will throw the following warning
when I mount a newly created btrfs filesystem.

[10264.755497] btrfs: attempt to write superblock which references block M @29523968 (sdb1/1111654400/0) which is not flushed out of disk's write cache (block flush_gen=1, dev->flush_gen=0)!
[10264.755498] btrfs: attempt to write superblock which references block M @29523968 (sdb1/37912576/0) which is not flushed out of disk's write cache (block flush_gen=1, dev->flush_gen=0)!

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-21 17:47:42 +02:00
Anand Jain
44880fdc91 btrfs: use appropriate define for the fsid
Though BTRFS_FSID_SIZE and BTRFS_UUID_SIZE are of the same size, we
should use the matching constant for the fsid buffer.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-18 16:36:29 +02:00
David Sterba
4958aa6821 btrfs: drop chunk locks at the end of close_ctree
The pinned chunks might be left over so we clean them but at this point
of close_ctree, there's noone to race with, the locking can be removed.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:04 +02:00
David Sterba
ea14b57fd1 btrfs: fix spelling of snapshotting
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:04 +02:00
David Sterba
9f6d251033 btrfs: use named constant for bdev blocksize
Superblock is read and written using buffer heads, we need to set the
bdev blocksize. The magic constant has been hardcoded in several places,
so replace it with a named constant.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:04 +02:00
David Sterba
abbb3b8ebf btrfs: split write_dev_supers to two functions
There are two independent parts, one that writes the superblocks and
another that waits for completion. No functional changes, but cleanups,
reformatting and comment updates.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:04 +02:00
David Sterba
a4f78750ef btrfs: get fs_info from eb in btrfs_print_leaf, remove argument
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:03 +02:00
Qu Wenruo
bc3cce2378 btrfs: Cleanup num_tolerated_disk_barrier_failures
As we use per-chunk degradable check, the global
num_tolerated_disk_barrier_failures is of no use.

We can now remove it.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:02 +02:00
Qu Wenruo
d10b82fe29 btrfs: Allow barrier_all_devices to do chunk level device check
The last user of num_tolerated_disk_barrier_failures is
barrier_all_devices().
But it can be easily changed to the new per-chunk degradable check
framework.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:02 +02:00
Qu Wenruo
4330e183c9 btrfs: Do chunk level check for degraded rw mount
Now use the btrfs_check_rw_degradable() to check if we can mount in the
degraded mode.

With this patch, we can mount in the following case:
 # mkfs.btrfs -f -m raid1 -d single /dev/sdb /dev/sdc
 # wipefs -a /dev/sdc
 # mount /dev/sdb /mnt/btrfs -o degraded
 As the single data chunk is only on sdb, so it's OK to mount as
 degraded, as missing one device is OK for RAID1.

But still fail in the following case as expected:
 # mkfs.btrfs -f -m raid1 -d single /dev/sdb /dev/sdc
 # wipefs -a /dev/sdb
 # mount /dev/sdc /mnt/btrfs -o degraded
 As the data chunk is only in sdb, so it's not OK to mount it as
 degraded.

Reported-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Reported-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:02 +02:00
Nick Terrell
5c1aab1dd5 btrfs: Add zstd support
Add zstd compression and decompression support to BtrFS. zstd at its
fastest level compresses almost as well as zlib, while offering much
faster compression and decompression, approaching lzo speeds.

I benchmarked btrfs with zstd compression against no compression, lzo
compression, and zlib compression. I benchmarked two scenarios. Copying
a set of files to btrfs, and then reading the files. Copying a tarball
to btrfs, extracting it to btrfs, and then reading the extracted files.
After every operation, I call `sync` and include the sync time.
Between every pair of operations I unmount and remount the filesystem
to avoid caching. The benchmark files can be found in the upstream
zstd source repository under
`contrib/linux-kernel/{btrfs-benchmark.sh,btrfs-extract-benchmark.sh}`
[1] [2].

I ran the benchmarks on a Ubuntu 14.04 VM with 2 cores and 4 GiB of RAM.
The VM is running on a MacBook Pro with a 3.1 GHz Intel Core i7 processor,
16 GB of RAM, and a SSD.

The first compression benchmark is copying 10 copies of the unzipped
Silesia corpus [3] into a BtrFS filesystem mounted with
`-o compress-force=Method`. The decompression benchmark times how long
it takes to `tar` all 10 copies into `/dev/null`. The compression ratio is
measured by comparing the output of `df` and `du`. See the benchmark file
[1] for details. I benchmarked multiple zstd compression levels, although
the patch uses zstd level 1.

| Method  | Ratio | Compression MB/s | Decompression speed |
|---------|-------|------------------|---------------------|
| None    |  0.99 |              504 |                 686 |
| lzo     |  1.66 |              398 |                 442 |
| zlib    |  2.58 |               65 |                 241 |
| zstd 1  |  2.57 |              260 |                 383 |
| zstd 3  |  2.71 |              174 |                 408 |
| zstd 6  |  2.87 |               70 |                 398 |
| zstd 9  |  2.92 |               43 |                 406 |
| zstd 12 |  2.93 |               21 |                 408 |
| zstd 15 |  3.01 |               11 |                 354 |

The next benchmark first copies `linux-4.11.6.tar` [4] to btrfs. Then it
measures the compression ratio, extracts the tar, and deletes the tar.
Then it measures the compression ratio again, and `tar`s the extracted
files into `/dev/null`. See the benchmark file [2] for details.

| Method | Tar Ratio | Extract Ratio | Copy (s) | Extract (s)| Read (s) |
|--------|-----------|---------------|----------|------------|----------|
| None   |      0.97 |          0.78 |    0.981 |      5.501 |    8.807 |
| lzo    |      2.06 |          1.38 |    1.631 |      8.458 |    8.585 |
| zlib   |      3.40 |          1.86 |    7.750 |     21.544 |   11.744 |
| zstd 1 |      3.57 |          1.85 |    2.579 |     11.479 |    9.389 |

[1] https://github.com/facebook/zstd/blob/dev/contrib/linux-kernel/btrfs-benchmark.sh
[2] https://github.com/facebook/zstd/blob/dev/contrib/linux-kernel/btrfs-extract-benchmark.sh
[3] http://sun.aei.polsl.pl/~sdeor/index.php?page=silesia
[4] https://cdn.kernel.org/pub/linux/kernel/v4.x/linux-4.11.6.tar.xz

zstd source repository: https://github.com/facebook/zstd

Signed-off-by: Nick Terrell <terrelln@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2017-08-15 09:02:09 -07:00
David Howells
bc98a42c1f VFS: Convert sb->s_flags & MS_RDONLY to sb_rdonly(sb)
Firstly by applying the following with coccinelle's spatch:

	@@ expression SB; @@
	-SB->s_flags & MS_RDONLY
	+sb_rdonly(SB)

to effect the conversion to sb_rdonly(sb), then by applying:

	@@ expression A, SB; @@
	(
	-(!sb_rdonly(SB)) && A
	+!sb_rdonly(SB) && A
	|
	-A != (sb_rdonly(SB))
	+A != sb_rdonly(SB)
	|
	-A == (sb_rdonly(SB))
	+A == sb_rdonly(SB)
	|
	-!(sb_rdonly(SB))
	+!sb_rdonly(SB)
	|
	-A && (sb_rdonly(SB))
	+A && sb_rdonly(SB)
	|
	-A || (sb_rdonly(SB))
	+A || sb_rdonly(SB)
	|
	-(sb_rdonly(SB)) != A
	+sb_rdonly(SB) != A
	|
	-(sb_rdonly(SB)) == A
	+sb_rdonly(SB) == A
	|
	-(sb_rdonly(SB)) && A
	+sb_rdonly(SB) && A
	|
	-(sb_rdonly(SB)) || A
	+sb_rdonly(SB) || A
	)

	@@ expression A, B, SB; @@
	(
	-(sb_rdonly(SB)) ? 1 : 0
	+sb_rdonly(SB)
	|
	-(sb_rdonly(SB)) ? A : B
	+sb_rdonly(SB) ? A : B
	)

to remove left over excess bracketage and finally by applying:

	@@ expression A, SB; @@
	(
	-(A & MS_RDONLY) != sb_rdonly(SB)
	+(bool)(A & MS_RDONLY) != sb_rdonly(SB)
	|
	-(A & MS_RDONLY) == sb_rdonly(SB)
	+(bool)(A & MS_RDONLY) == sb_rdonly(SB)
	)

to make comparisons against the result of sb_rdonly() (which is a bool)
work correctly.

Signed-off-by: David Howells <dhowells@redhat.com>
2017-07-17 08:45:34 +01:00
Linus Torvalds
bc243704fb Merge branch 'for-4.13-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
 "We've identified and fixed a silent corruption (introduced by code in
  the first pull), a fixup after the blk_status_t merge and two fixes to
  incremental send that Filipe has been hunting for some time"

* 'for-4.13-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  Btrfs: fix unexpected return value of bio_readpage_error
  btrfs: btrfs_create_repair_bio never fails, skip error handling
  btrfs: cloned bios must not be iterated by bio_for_each_segment_all
  Btrfs: fix write corruption due to bio cloning on raid5/6
  Btrfs: incremental send, fix invalid memory access
  Btrfs: incremental send, fix invalid path for link commands
2017-07-14 22:55:52 -07:00
David Sterba
c09abff87f btrfs: cloned bios must not be iterated by bio_for_each_segment_all
We've started using cloned bios more in 4.13, there are some specifics
regarding the iteration.  Filipe found [1] that the raid56 iterated a
cloned bio using bio_for_each_segment_all, which is incorrect. The
cloned bios have wrong bi_vcnt and this could lead to silent
corruptions.  This patch adds assertions to all remaining
bio_for_each_segment_all cases.

[1] https://patchwork.kernel.org/patch/9838535/

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-07-14 20:39:31 +02:00
Linus Torvalds
a4c20b9a57 Merge branch 'for-4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu
Pull percpu updates from Tejun Heo:
 "These are the percpu changes for the v4.13-rc1 merge window. There are
  a couple visibility related changes - tracepoints and allocator stats
  through debugfs, along with __ro_after_init markings and a cosmetic
  rename in percpu_counter.

  Please note that the simple O(#elements_in_the_chunk) area allocator
  used by percpu allocator is again showing scalability issues,
  primarily with bpf allocating and freeing large number of counters.
  Dennis is working on the replacement allocator and the percpu
  allocator will be seeing increased churns in the coming cycles"

* 'for-4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
  percpu: fix static checker warnings in pcpu_destroy_chunk
  percpu: fix early calls for spinlock in pcpu_stats
  percpu: resolve err may not be initialized in pcpu_alloc
  percpu_counter: Rename __percpu_counter_add to percpu_counter_add_batch
  percpu: add tracepoint support for percpu memory
  percpu: expose statistics about percpu memory via debugfs
  percpu: migrate percpu data structures to internal header
  percpu: add missing lockdep_assert_held to func pcpu_free_area
  mark most percpu globals as __ro_after_init
2017-07-06 08:59:41 -07:00
Linus Torvalds
8c27cb3566 Merge branch 'for-4.13-part1' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
 "The core updates improve error handling (mostly related to bios), with
  the usual incremental work on the GFP_NOFS (mis)use removal,
  refactoring or cleanups. Except the two top patches, all have been in
  for-next for an extensive amount of time.

  User visible changes:

   - statx support

   - quota override tunable

   - improved compression thresholds

   - obsoleted mount option alloc_start

  Core updates:

   - bio-related updates:
       - faster bio cloning
       - no allocation failures
       - preallocated flush bios

   - more kvzalloc use, memalloc_nofs protections, GFP_NOFS updates

   - prep work for btree_inode removal

   - dir-item validation

   - qgoup fixes and updates

   - cleanups:
       - removed unused struct members, unused code, refactoring
       - argument refactoring (fs_info/root, caller -> callee sink)
       - SEARCH_TREE ioctl docs"

* 'for-4.13-part1' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (115 commits)
  btrfs: Remove false alert when fiemap range is smaller than on-disk extent
  btrfs: Don't clear SGID when inheriting ACLs
  btrfs: fix integer overflow in calc_reclaim_items_nr
  btrfs: scrub: fix target device intialization while setting up scrub context
  btrfs: qgroup: Fix qgroup reserved space underflow by only freeing reserved ranges
  btrfs: qgroup: Introduce extent changeset for qgroup reserve functions
  btrfs: qgroup: Fix qgroup reserved space underflow caused by buffered write and quotas being enabled
  btrfs: qgroup: Return actually freed bytes for qgroup release or free data
  btrfs: qgroup: Cleanup btrfs_qgroup_prepare_account_extents function
  btrfs: qgroup: Add quick exit for non-fs extents
  Btrfs: rework delayed ref total_bytes_pinned accounting
  Btrfs: return old and new total ref mods when adding delayed refs
  Btrfs: always account pinned bytes when dropping a tree block ref
  Btrfs: update total_bytes_pinned when pinning down extents
  Btrfs: make BUG_ON() in add_pinned_bytes() an ASSERT()
  Btrfs: make add_pinned_bytes() take an s64 num_bytes instead of u64
  btrfs: fix validation of XATTR_ITEM dir items
  btrfs: Verify dir_item in iterate_object_props
  btrfs: Check name_len before in btrfs_del_root_ref
  btrfs: Check name_len before reading btrfs_get_name
  ...
2017-07-05 16:41:23 -07:00
David Sterba
66b4993e95 btrfs: move dev stats accounting out of wait_dev_flush
We should really just wait in wait_dev_flush and let the caller decide
what to do with the error value.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-21 19:03:39 +02:00
David Sterba
2980d5745f btrfs: account as waiting for IO, while waiting fot the flush bio completion
Similar to what submit_bio_wait does, we should account for IO while
waiting for a bio completion. This has marginal visible effects, flush
bio is short-lived.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-21 19:03:39 +02:00
David Sterba
e0ae999414 btrfs: preallocate device flush bio
For devices that support flushing, we allocate a bio, submit, wait for
it and then free it. The bio allocation does not fail so ENOMEM is not a
problem but we still may unnecessarily stress the allocation subsystem.

Instead, we can allocate the bio at the same time we allocate the device
and reuse it each time we need to flush the barriers. The bio is reset
before each use. Reference counting is simplified to just device
allocation (get) and freeing (put).

The bio used to be submitted through the integrity checker which will
find out that bio has no data attached and call submit_bio.

Status of the bio in flight needs to be tracked separately in case the
device caches get switched off between write and wait.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-21 19:03:38 +02:00
Nikolay Borisov
104b4e5139 percpu_counter: Rename __percpu_counter_add to percpu_counter_add_batch
Currently, percpu_counter_add is a wrapper around __percpu_counter_add
which is preempt safe due to explicit calls to preempt_disable.  Given
how __ prefix is used in percpu related interfaces, the naming
unfortunately creates the false sense that __percpu_counter_add is
less safe than percpu_counter_add.  In terms of context-safety,
they're equivalent.  The only difference is that the __ version takes
a batch parameter.

Make this a bit more explicit by just renaming __percpu_counter_add to
percpu_counter_add_batch.

This patch doesn't cause any functional changes.

tj: Minor updates to patch description for clarity.  Cosmetic
    indentation updates.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: David Sterba <dsterba@suse.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Jan Kara <jack@suse.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: linux-mm@kvack.org
Cc: "David S. Miller" <davem@davemloft.net>
2017-06-20 15:42:32 -04:00
David Sterba
fac03c8dae btrfs: move fs_info::fs_frozen to the flags
We can keep the state among the other fs_info flags, there's no reason
why fs_frozen would need to be separate.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-20 14:22:42 +02:00
Anand Jain
4fc6441aac btrfs: wait part of the write_dev_flush() can be separated out
Submit and wait parts of write_dev_flush() can be split into two
separate functions for better readability.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:04 +02:00
Anand Jain
cea7c8bf77 btrfs: remove redundant null bdev counting during flush submission
There is no extra benefit to count null bdev during the submit loop,
as these null devices will be anyway checked during command
completion device loop just after the submit loop. We are holding the
device_list_mutex, the device->bdev status won't change in between.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:04 +02:00