Commit graph

768611 commits

Author SHA1 Message Date
Josef Bacik
3c4276936f Btrfs: fix btrfs_write_inode vs delayed iput deadlock
We recently ran into the following deadlock involving
btrfs_write_inode():

[  +0.005066]  __schedule+0x38e/0x8c0
[  +0.007144]  schedule+0x36/0x80
[  +0.006447]  bit_wait+0x11/0x60
[  +0.006446]  __wait_on_bit+0xbe/0x110
[  +0.007487]  ? bit_wait_io+0x60/0x60
[  +0.007319]  __inode_wait_for_writeback+0x96/0xc0
[  +0.009568]  ? autoremove_wake_function+0x40/0x40
[  +0.009565]  inode_wait_for_writeback+0x21/0x30
[  +0.009224]  evict+0xb0/0x190
[  +0.006099]  iput+0x1a8/0x210
[  +0.006103]  btrfs_run_delayed_iputs+0x73/0xc0
[  +0.009047]  btrfs_commit_transaction+0x799/0x8c0
[  +0.009567]  btrfs_write_inode+0x81/0xb0
[  +0.008008]  __writeback_single_inode+0x267/0x320
[  +0.009569]  writeback_sb_inodes+0x25b/0x4e0
[  +0.008702]  wb_writeback+0x102/0x2d0
[  +0.007487]  wb_workfn+0xa4/0x310
[  +0.006794]  ? wb_workfn+0xa4/0x310
[  +0.007143]  process_one_work+0x150/0x410
[  +0.008179]  worker_thread+0x6d/0x520
[  +0.007490]  kthread+0x12c/0x160
[  +0.006620]  ? put_pwq_unlocked+0x80/0x80
[  +0.008185]  ? kthread_park+0xa0/0xa0
[  +0.007484]  ? do_syscall_64+0x53/0x150
[  +0.007837]  ret_from_fork+0x29/0x40

Writeback calls:

btrfs_write_inode
  btrfs_commit_transaction
    btrfs_run_delayed_iputs

If iput() is called on that same inode, evict() will wait for writeback
forever.

btrfs_write_inode() was originally added way back in 4730a4bc5b
("btrfs_dirty_inode") to support O_SYNC writes. However, ->write_inode()
hasn't been used for O_SYNC since 148f948ba8 ("vfs: Introduce new
helpers for syncing after writing to O_SYNC file or IS_SYNC inode"), so
btrfs_write_inode() is actually unnecessary (and leads to a bunch of
unnecessary commits). Get rid of it, which also gets rid of the
deadlock.

CC: stable@vger.kernel.org # 3.2+
Signed-off-by: Josef Bacik <jbacik@fb.com>
[Omar: new commit message]
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:58 +02:00
Nikolay Borisov
97aff912a2 btrfs: Remove fs_info from btrfs_finish_chunk_alloc
It can be referenced from the passed transaction handle.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:58 +02:00
Nikolay Borisov
f4208794d0 btrfs: Remove fs_info form btrfs_free_chunk
It can be referenced from the passed transaction handle.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:57 +02:00
Nikolay Borisov
4f5ad7bd63 btrfs: Remove fs_info from btrfs_destroy_dev_replace_tgtdev
This function is always passed a well-formed tgtdevice so the fs_info
can be referenced from there.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:57 +02:00
Nikolay Borisov
d6507cf1e2 btrfs: Remove fs_info from btrfs_assign_next_active_device
It can be referenced from the passed 'device' argument which is always
a well-formed device.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:57 +02:00
Nikolay Borisov
5495f195fc btrfs: remove fs_info argument from update_dev_stat_item
It can be referenced from the passed transaction handle.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:57 +02:00
Nikolay Borisov
68a9db5f23 btrfs: Remove fs_info from btrfs_rm_dev_replace_remove_srcdev
It can be referenced from the passed srcdev argument.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:57 +02:00
Nikolay Borisov
8e87e85627 btrfs: Remove fs_info argument from btrfs_add_dev_item
It can be referenced form the passed transaction handle.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:56 +02:00
Qu Wenruo
5e23a6fea6 btrfs: extent-tree: Remove dead alignment check
In find_free_extent() under checks: label, we have the following code:

		search_start = ALIGN(offset, fs_info->stripesize);
		/* move on to the next group */
		if (search_start + num_bytes >
		    block_group->key.objectid + block_group->key.offset) {
			btrfs_add_free_space(block_group, offset, num_bytes);
			goto loop;
		}
		if (offset < search_start)
			btrfs_add_free_space(block_group, offset,
					     search_start - offset);
		BUG_ON(offset > search_start);

However ALIGN() is rounding up, thus @search_start >= @offset and that
BUG_ON() will never be triggered.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:56 +02:00
Filipe Manana
ca5d2ba1ae Btrfs: remove unused key assignment when doing a full send
At send.c:full_send_tree() we were setting the 'key' variable in the loop
while never using it later. We were also using two btrfs_key variables
to store the initial key for search and the key found in every iteration
of the loop. So remove this useless key assignment and use the same
btrfs_key variable to store the initial search key and the key found in
each iteration. This was introduced in the initial send commit but was
never used (commit 31db9f7c23 ("Btrfs: introduce BTRFS_IOC_SEND for
btrfs send/receive").

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:56 +02:00
David Sterba
5cdc84bfde btrfs: drop extent_io_ops::set_range_writeback callback
The data and metadata callback implementation both use the same
function. We can remove the call indirection and intermediate helper
completely.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:56 +02:00
David Sterba
00032d38ea btrfs: drop extent_io_ops::merge_bio_hook callback
The data and metadata callback implementation both use the same
function. We can remove the call indirection completely.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:56 +02:00
David Sterba
05912a3c04 btrfs: drop extent_io_ops::tree_fs_info callback
All implementations of the callback are trivial and do the same and
there's only one user. Merge everything together.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:55 +02:00
David Sterba
e288c080dd btrfs: unify end_io callbacks of async_submit_bio
The end_io callbacks passed to btrfs_wq_submit_bio
(btrfs_submit_bio_done and btree_submit_bio_done) are effectively the
same code, there's no point to do the indirection. Export
btrfs_submit_bio_done and call it directly.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:55 +02:00
David Sterba
d7cbfafc4b btrfs: remove unused member async_submit_bio::bio_flags
After splitting the start and end hooks in a758781d4b ("btrfs:
separate types for submit_bio_start and submit_bio_done"), some of
the function arguments were dropped but not removed from the structure.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:55 +02:00
David Sterba
d7e8555b1d btrfs: remove unused member async_submit_bio::fs_info
Introduced by c6100a4b4e ("Btrfs: replace tree->mapping with
tree->private_data") to be used in run_one_async_done where it got
unused after 736cd52e0c ("Btrfs: remove nr_async_submits and
async_submit_draining").

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:55 +02:00
Gu Jinxiang
315409b009 btrfs: validate type when reading a chunk
Reported in https://bugzilla.kernel.org/show_bug.cgi?id=199839, with an
image that has an invalid chunk type but does not return an error.

Add chunk type check in btrfs_check_chunk_valid, to detect the wrong
type combinations.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=199839
Reported-by: Xu Wen <wen.xu@gatech.edu>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Gu Jinxiang <gujx@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:55 +02:00
Nikolay Borisov
b0132a3be5 btrfs: Rename EXTENT_BUFFER_DUMMY to EXTENT_BUFFER_UNMAPPED
EXTENT_BUFFER_DUMMY is an awful name for this flag. Buffers which have
this flag set are not in any way dummy. Rather, they are private in the
sense that are not mapped and linked to the global buffer tree. This
flag has subtle implications to the way free_extent_buffer works for
example, as well as controls whether page->mapping->private_lock is held
during extent_buffer release. Pages for an unmapped buffer cannot be
under io, nor can they be written by a 3rd party so taking the lock is
unnecessary.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ EXTENT_BUFFER_UNMAPPED, update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:55 +02:00
Nikolay Borisov
07e21c4dad btrfs: Document locking requirement via lockdep_assert_held
Remove stale comment since there is no longer an eb->eb_lock and
document the locking expectation with a lockdep_assert_held statement.
No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:54 +02:00
David Sterba
55ac01396a btrfs: rename btrfs_release_extent_buffer_page
The function used to release one page (and always the first one), but
not anymore since a50924e3a4 ("btrfs: drop constant param
from btrfs_release_extent_buffer_page").  Update the name and comment.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:54 +02:00
Nikolay Borisov
d64766fdf9 btrfs: Refactor loop in btrfs_release_extent_buffer_page
The purpose of the function is to free all the pages comprising an
extent buffer. This can be achieved with a simple for loop rather than
the slightly more involved 'do {} while' construct. So rewrite the
loop using a 'for' construct. Additionally we can never have an
extent_buffer that has 0 pages so remove the check for index == 0. No
functional changes.

The reversed order used to have a meaning in the past where the first
page served as a blocking point for several callers. See eg
4f2de97ace ("Btrfs: set page->private to the eb").

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:54 +02:00
Nikolay Borisov
b16d011e79 btrfs: Reword dodgy comments in alloc_extent_buffer
Commit eb14ab8ed2 ("Btrfs: fix page->private races") fixed a genuine
race between extent buffer initialisation and btree_releasepage.
Unfortunately as the code has evolved the comments weren't changed which
made them slightly wrong and they weren't very clear in the fist place.
Fix this by (hopefully) rewording them in a more approachable manner.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:54 +02:00
Nikolay Borisov
28187ae569 btrfs: Simplify page unlocking in alloc_extent_buffer
Current version of the page unlocking code was added in
727011e07c ("Btrfs: allow metadata blocks larger than the page size")
but even in this commit that particular flag was never used per-se. In
fact, btrfs only uses PageChecked for data pages to identify pages
which have been dirtied but don't have ORDERED bit set. For more
information see 247e743cbe ("Btrfs: Use async helpers to deal with
pages that have been improperly dirtied").

However, this doesn't apply to extent buffer pages. The important bit
here is that the pages are unlocked AFTER the extent buffer has been
properly recorded in the radix tree to avoid races with
btree_releasepage. Let's exploit this fact and simplify the page
unlocking sequence by unlocking the pages in-order and removing the
redundant PageChecked flag setting/clearing.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:54 +02:00
Qu Wenruo
8b9b6f2554 btrfs: scrub: cleanup the remaining nodatasum fixup code
Remove the remaining code that misused the page cache pages during
device replace and could cause data corruption for compressed nodatasum
extents. Such files do not normally exist but there's a bug that allows
this combination and the corruption was exposed by device replace fixup
code.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:53 +02:00
David Sterba
46df06b85e btrfs: refactor block group replication factor calculation to a helper
There are many places that open code the duplicity factor of the block
group profiles, create a common helper. This can be easily extended for
more copies.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:53 +02:00
Anand Jain
321a4bf72b btrfs: use the assigned fs_devices instead of the dereference
We have assigned the %fs_info->fs_devices in %fs_devices as its not
modified just use it for the mutex_lock().

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:53 +02:00
Lu Fengqi
62088ca742 btrfs: qgroup: Drop fs_info parameter from qgroup_rescan_leaf
It can be fetched from the transaction handle.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:53 +02:00
Lu Fengqi
a937742250 btrfs: qgroup: Drop fs_info parameter from btrfs_qgroup_inherit
It can be fetched from the transaction handle.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:53 +02:00
Lu Fengqi
280f8bd2cb btrfs: qgroup: Drop fs_info parameter from btrfs_run_qgroups
It can be fetched from the transaction handle.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:52 +02:00
Lu Fengqi
8696d76045 btrfs: qgroup: Drop fs_info parameter from btrfs_qgroup_account_extent
It can be fetched from the transaction handle.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:52 +02:00
Lu Fengqi
deb4062743 btrfs: qgroup: Drop root parameter from btrfs_qgroup_trace_subtree
The fs_info can be fetched from the transaction handle directly.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:52 +02:00
Lu Fengqi
8d38d7eb7b btrfs: qgroup: Drop fs_info parameter from btrfs_qgroup_trace_leaf_items
It can be fetched from the transaction handle.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:52 +02:00
Lu Fengqi
a95f3aafd6 btrfs: qgroup: Drop fs_info parameter from btrfs_qgroup_trace_extent
It can be fetched from the transaction handle. In addition, remove the
WARN_ON(trans == NULL) because it's not possible to hit this condition.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:52 +02:00
Lu Fengqi
f0042d5e92 btrfs: qgroup: Drop fs_info parameter from btrfs_limit_qgroup
It can be fetched from the transaction handle.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:51 +02:00
Lu Fengqi
3efbee1d00 btrfs: qgroup: Drop fs_info parameter from btrfs_remove_qgroup
It can be fetched from the transaction handle.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:51 +02:00
Lu Fengqi
49a05ecde3 btrfs: qgroup: Drop fs_info parameter from btrfs_create_qgroup
It can be fetched from the transaction handle.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:51 +02:00
Lu Fengqi
39616c2735 btrfs: qgroup: Drop fs_info parameter from btrfs_del_qgroup_relation
It can be fetched from the transaction handle.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:51 +02:00
Lu Fengqi
6b36f1aa5c btrfs: qgroup: Drop fs_info parameter from __del_qgroup_relation
It can be fetched from the transaction handle.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:51 +02:00
Lu Fengqi
9f8a6ce6ba btrfs: qgroup: Drop fs_info parameter from btrfs_add_qgroup_relation
It can be fetched from the transaction handle.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:51 +02:00
Lu Fengqi
2e980acdd8 btrfs: qgroup: Drop quota_root and fs_info parameters from update_qgroup_status_item
They can be fetched from the transaction handle.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:50 +02:00
Lu Fengqi
3e07e9a09f btrfs: qgroup: Drop root parameter from update_qgroup_info_item
It can be fetched from the transaction handle.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:50 +02:00
Lu Fengqi
ac8a866af1 btrfs: qgroup: Drop root parameter from update_qgroup_limit_item
It can be fetched from the transaction handle.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:50 +02:00
Lu Fengqi
69104618f4 btrfs: qgroup: Drop quota_root parameter from del_qgroup_item
It can be fetched from the transaction handle.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:50 +02:00
Lu Fengqi
99d7f09ac0 btrfs: qgroup: Drop quota_root parameter from del_qgroup_relation_item
It can be fetched from the transaction handle.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:50 +02:00
Lu Fengqi
711169c40f btrfs: qgroup: Drop quota_root parameter from add_qgroup_relation_item
It can be fetched from the transaction handle.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:49 +02:00
Anand Jain
fa59f27c8c btrfs: rename btrfs_parse_early_options
Rename btrfs_parse_early_options() to btrfs_parse_device_options(). As
btrfs_parse_early_options() parses the -o device options and scan the
device provided. So this rename specifies its action. Also the function
name is in line with btrfs_parse_subvol_options().
No functional changes.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:49 +02:00
Lu Fengqi
c8389d4c0d btrfs: qgroup: cleanup the unused srcroot from btrfs_qgroup_inherit
Since commit 0b246afa62 ("btrfs: root->fs_info cleanup, add fs_info
convenience variables"), the srcroot is no longer used to get
fs_info::nodesize.  In fact, it can be dropped after commit 707e8a0715
("btrfs: use nodesize everywhere, kill leafsize").

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:49 +02:00
Qu Wenruo
031f24da2c btrfs: Use btrfs_mark_bg_unused to replace open code
Introduce a small helper, btrfs_mark_bg_unused(), to acquire locks and
add a block group to unused_bgs list.

No functional modification, and only 3 callers are involved.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:49 +02:00
Nikolay Borisov
2556fbb0be btrfs: Rewrite retry logic in do_chunk_alloc
do_chunk_alloc implements logic to detect whether there is currently
pending chunk allocation (by means of space_info->chunk_alloc being
set) and if so it loops around to the 'again' label. Additionally,
based on the state of the space_info (e.g. whether it's full or not)
and the return value of should_alloc_chunk() it decides whether this
is a "hard" error (ENOSPC) or we can just return 0.

This patch refactors all of this:

1. Put order to the scattered ifs handling the various cases in an
easy-to-read if {} else if{} branches. This makes clear the various
cases we are interested in handling.

2. Call should_alloc_chunk only once and use the result in the
if/else if constructs. All of this is done under space_info->lock, so
even before multiple calls of should_alloc_chunk were unnecessary.

3. Rewrite the "do {} while()" loop currently implemented via label
into an explicit loop construct.

4. Move the mutex locking for the case where the caller is the one doing
the allocation. For the case where the caller needs to wait a concurrent
allocation, introduce a pair of mutex_lock/mutex_unlock to act as a
barrier and reword the comment.

5. Switch local vars to bool type where pertinent.

All in all this shouldn't introduce any functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:49 +02:00
Ethan Lien
dec59fa3a7 btrfs: use customized batch size for total_bytes_pinned
In commit b150a4f10d ("Btrfs: use a percpu to keep track of possibly
pinned bytes") we use total_bytes_pinned to track how many bytes we are
going to free in this transaction. When we are close to ENOSPC, we check it
and know if we can make the allocation by commit the current transaction.
For every data/metadata extent we are going to free, we add
total_bytes_pinned in btrfs_free_extent() and btrfs_free_tree_block(), and
release it in unpin_extent_range() when we finish the transaction. So this
is a variable we frequently update but rarely read - just the suitable
use of percpu_counter. But in previous commit we update total_bytes_pinned
by default 32 batch size, making every update essentially a spin lock
protected update. Since every spin lock/unlock operation involves syncing
a globally used variable and some kind of barrier in a SMP system, this is
more expensive than using total_bytes_pinned as a simple atomic64_t.

So fix this by using a customized batch size. Since we only read
total_bytes_pinned when we are close to ENOSPC and fail to allocate new
chunk, we can use a really large batch size and have nearly no penalty
in most cases.

[Test]
We tested the patch on a 4-cores x86 machine:

1. fallocate a 16GiB size test file
2. take snapshot (so all following writes will be COW)
3. run a 180 sec, 4 jobs, 4K random write fio on test file

We also added a temporary lockdep class on percpu_counter's spin lock
used by total_bytes_pinned to track it by lock_stat.

[Results]
unpatched:
lock_stat version 0.4
-----------------------------------------------------------------------
                              class name    con-bounces    contentions
waittime-min   waittime-max waittime-total   waittime-avg    acq-bounces
acquisitions   holdtime-min   holdtime-max holdtime-total   holdtime-avg

               total_bytes_pinned_percpu:            82             82
        0.21           0.61          29.46           0.36         298340
      635973           0.09          11.01      173476.25           0.27

patched:
lock_stat version 0.4
-----------------------------------------------------------------------
                              class name    con-bounces    contentions
waittime-min   waittime-max waittime-total   waittime-avg    acq-bounces
acquisitions   holdtime-min   holdtime-max holdtime-total   holdtime-avg

               total_bytes_pinned_percpu:             1              1
        0.62           0.62           0.62           0.62          13601
       31542           0.14           9.61       11016.90           0.35

[Analysis]
Since the spin lock only protects a single in-memory variable, the
contentions (number of lock acquisitions that had to wait) in both
unpatched and patched version are low. But when we see acquisitions and
acq-bounces, we get much lower counts in patched version. Here the most
important metric is acq-bounces. It means how many times the lock gets
transferred between different cpus, so the patch can really reduce
cacheline bouncing of spin lock (also the global counter of percpu_counter)
in a SMP system.

Fixes: b150a4f10d ("Btrfs: use a percpu to keep track of possibly pinned bytes")
Signed-off-by: Ethan Lien <ethanlien@synology.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:48 +02:00