linux-stable/fs/btrfs
Ethan Lien e73e81b6d0 btrfs: balance dirty metadata pages in btrfs_finish_ordered_io
[Problem description and how we fix it]
We should balance dirty metadata pages at the end of
btrfs_finish_ordered_io, since a small, unmergeable random write can
potentially produce dirty metadata which is multiple times larger than
the data itself. For example, a small, unmergeable 4KiB write may
produce:

    16KiB dirty leaf (and possibly 16KiB dirty node) in subvolume tree
    16KiB dirty leaf (and possibly 16KiB dirty node) in checksum tree
    16KiB dirty leaf (and possibly 16KiB dirty node) in extent tree

Although we do call balance dirty pages in write side, but in the
buffered write path, most metadata are dirtied only after we reach the
dirty background limit (which by far only counts dirty data pages) and
wakeup the flusher thread. If there are many small, unmergeable random
writes spread in a large btree, we'll find a burst of dirty pages
exceeds the dirty_bytes limit after we wakeup the flusher thread - which
is not what we expect. In our machine, it caused out-of-memory problem
since a page cannot be dropped if it is marked dirty.

Someone may worry about we may sleep in btrfs_btree_balance_dirty_nodelay,
but since we do btrfs_finish_ordered_io in a separate worker, it will not
stop the flusher consuming dirty pages. Also, we use different worker for
metadata writeback endio, sleep in btrfs_finish_ordered_io help us throttle
the size of dirty metadata pages.

[Reproduce steps]
To reproduce the problem, we need to do 4KiB write randomly spread in a
large btree. In our 2GiB RAM machine:

1) Create 4 subvolumes.
2) Run fio on each subvolume:

   [global]
   direct=0
   rw=randwrite
   ioengine=libaio
   bs=4k
   iodepth=16
   numjobs=1
   group_reporting
   size=128G
   runtime=1800
   norandommap
   time_based
   randrepeat=0

3) Take snapshot on each subvolume and repeat fio on existing files.
4) Repeat step (3) until we get large btrees.
   In our case, by observing btrfs_root_item->bytes_used, we have 2GiB of
   metadata in each subvolume tree and 12GiB of metadata in extent tree.
5) Stop all fio, take snapshot again, and wait until all delayed work is
   completed.
6) Start all fio. Few seconds later we hit OOM when the flusher starts
   to work.

It can be reproduced even when using nocow write.

Signed-off-by: Ethan Lien <ethanlien@synology.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30 16:46:53 +02:00
..
tests btrfs: tests: drop newline from test_msg strings 2018-05-29 18:12:52 +02:00
acl.c btrfs: replace GPL boilerplate by SPDX -- sources 2018-04-12 16:29:51 +02:00
async-thread.c btrfs: replace GPL boilerplate by SPDX -- sources 2018-04-12 16:29:51 +02:00
async-thread.h btrfs: replace GPL boilerplate by SPDX -- headers 2018-04-12 16:29:46 +02:00
backref.c btrfs: replace GPL boilerplate by SPDX -- sources 2018-04-12 16:29:51 +02:00
backref.h btrfs: replace GPL boilerplate by SPDX -- headers 2018-04-12 16:29:46 +02:00
btrfs_inode.h Btrfs: renumber BTRFS_INODE_ runtime flags and switch to enums 2018-05-28 18:23:59 +02:00
check-integrity.c btrfs: replace GPL boilerplate by SPDX -- sources 2018-04-12 16:29:51 +02:00
check-integrity.h btrfs: replace GPL boilerplate by SPDX -- headers 2018-04-12 16:29:46 +02:00
compression.c btrfs: replace waitqueue_actvie with cond_wake_up 2018-05-28 18:23:09 +02:00
compression.h btrfs: compression: Add linux/sizes.h for compression.h 2018-05-29 18:13:00 +02:00
ctree.c Btrfs: remove unused check of skip_locking 2018-05-30 16:46:52 +02:00
ctree.h btrfs: Remove fs_info argument from btrfs_uuid_tree_rem 2018-05-30 16:46:53 +02:00
dedupe.h btrfs: replace GPL boilerplate by SPDX -- headers 2018-04-12 16:29:46 +02:00
delayed-inode.c btrfs: replace waitqueue_actvie with cond_wake_up 2018-05-28 18:23:09 +02:00
delayed-inode.h btrfs: replace GPL boilerplate by SPDX -- headers 2018-04-12 16:29:46 +02:00
delayed-ref.c btrfs: split delayed ref head initialization and addition 2018-05-28 18:07:32 +02:00
delayed-ref.h btrfs: Drop fs_info parameter from btrfs_merge_delayed_refs 2018-05-28 18:07:20 +02:00
dev-replace.c btrfs: drop uuid_mutex in btrfs_dev_replace_finishing 2018-05-28 18:23:16 +02:00
dev-replace.h btrfs: replace GPL boilerplate by SPDX -- headers 2018-04-12 16:29:46 +02:00
dir-item.c btrfs: replace GPL boilerplate by SPDX -- sources 2018-04-12 16:29:51 +02:00
disk-io.c Btrfs: add parent_transid parameter to veirfy_level_key 2018-05-30 16:46:43 +02:00
disk-io.h btrfs: replace GPL boilerplate by SPDX -- headers 2018-04-12 16:29:46 +02:00
export.c btrfs: replace GPL boilerplate by SPDX -- sources 2018-04-12 16:29:51 +02:00
export.h btrfs: replace GPL boilerplate by SPDX -- headers 2018-04-12 16:29:46 +02:00
extent-tree.c btrfs: drop unused space_info parameter from create_space_info 2018-05-30 16:46:43 +02:00
extent_io.c btrfs: Remove tree argument from extent_writepages 2018-05-28 18:07:20 +02:00
extent_io.h btrfs: Remove tree argument from extent_writepages 2018-05-28 18:07:20 +02:00
extent_map.c btrfs: use fs_info for btrfs_handle_em_exist tracepoint 2018-05-28 18:07:17 +02:00
extent_map.h btrfs: use fs_info for btrfs_handle_em_exist tracepoint 2018-05-28 18:07:17 +02:00
file-item.c btrfs: replace GPL boilerplate by SPDX -- sources 2018-04-12 16:29:51 +02:00
file.c btrfs: Fix wrong btrfs_delalloc_release_extents parameter 2018-04-18 16:46:57 +02:00
free-space-cache.c Btrfs: stop creating orphan items for truncate 2018-05-28 18:23:46 +02:00
free-space-cache.h btrfs: replace GPL boilerplate by SPDX -- headers 2018-04-12 16:29:46 +02:00
free-space-tree.c btrfs: Remove fs_info argument from populate_free_space_tree 2018-05-28 18:07:36 +02:00
free-space-tree.h btrfs: Remove fs_info argument from add_to_free_space_tree 2018-05-28 18:07:36 +02:00
inode-item.c btrfs: replace GPL boilerplate by SPDX -- sources 2018-04-12 16:29:51 +02:00
inode-map.c btrfs: replace GPL boilerplate by SPDX -- sources 2018-04-12 16:29:51 +02:00
inode-map.h btrfs: replace GPL boilerplate by SPDX -- headers 2018-04-12 16:29:46 +02:00
inode.c btrfs: balance dirty metadata pages in btrfs_finish_ordered_io 2018-05-30 16:46:53 +02:00
ioctl.c btrfs: Remove fs_info argument from btrfs_uuid_tree_rem 2018-05-30 16:46:53 +02:00
Kconfig btrfs: add SPDX header to Kconfig 2018-04-12 16:29:55 +02:00
locking.c btrfs: replace waitqueue_actvie with cond_wake_up 2018-05-28 18:23:09 +02:00
locking.h btrfs: replace GPL boilerplate by SPDX -- headers 2018-04-12 16:29:46 +02:00
lzo.c btrfs: lzo: Harden inline lzo compressed extent decompression 2018-05-30 16:46:43 +02:00
Makefile btrfs: Remove custom crc32c init code 2018-03-26 15:09:39 +02:00
math.h btrfs: replace GPL boilerplate by SPDX -- headers 2018-04-12 16:29:46 +02:00
ordered-data.c btrfs: replace waitqueue_actvie with cond_wake_up 2018-05-28 18:23:09 +02:00
ordered-data.h btrfs: replace GPL boilerplate by SPDX -- headers 2018-04-12 16:29:46 +02:00
orphan.c btrfs: replace GPL boilerplate by SPDX -- sources 2018-04-12 16:29:51 +02:00
print-tree.c btrfs: print-tree: Add eb locking status output for debug build 2018-05-28 18:07:26 +02:00
print-tree.h btrfs: print-tree: debugging output enhancement 2018-04-20 19:18:16 +02:00
props.c btrfs: property: Set incompat flag if lzo/zstd compression is set 2018-05-17 14:18:25 +02:00
props.h btrfs: replace GPL boilerplate by SPDX -- headers 2018-04-12 16:29:46 +02:00
qgroup.c btrfs: qgroup: show more meaningful qgroup_rescan_init error message 2018-05-30 16:46:43 +02:00
qgroup.h btrfs: replace GPL boilerplate by SPDX -- headers 2018-04-12 16:29:46 +02:00
raid56.c btrfs: replace GPL boilerplate by SPDX -- sources 2018-04-12 16:29:51 +02:00
raid56.h btrfs: replace GPL boilerplate by SPDX -- headers 2018-04-12 16:29:46 +02:00
rcu-string.h btrfs: replace GPL boilerplate by SPDX -- headers 2018-04-12 16:29:46 +02:00
reada.c btrfs: replace GPL boilerplate by SPDX -- sources 2018-04-12 16:29:51 +02:00
ref-verify.c btrfs: replace GPL boilerplate by SPDX -- sources 2018-04-12 16:29:51 +02:00
ref-verify.h btrfs: replace GPL boilerplate by SPDX -- headers 2018-04-12 16:29:46 +02:00
relocation.c btrfs: fix describe_relocation when printing unknown flags 2018-05-28 18:24:11 +02:00
root-tree.c btrfs: replace GPL boilerplate by SPDX -- sources 2018-04-12 16:29:51 +02:00
scrub.c btrfs: trace: Add trace points for unused block groups 2018-05-28 18:07:28 +02:00
send.c btrfs: incremental send, improve rmdir performance for large directory 2018-05-28 18:07:32 +02:00
send.h btrfs: replace GPL boilerplate by SPDX -- headers 2018-04-12 16:29:46 +02:00
struct-funcs.c btrfs: replace GPL boilerplate by SPDX -- sources 2018-04-12 16:29:51 +02:00
super.c Btrfs: allow empty subvol= again 2018-05-28 18:24:13 +02:00
sysfs.c btrfs: sysfs: Add entry which shows if rmdir can work on subvolumes 2018-05-28 18:23:41 +02:00
sysfs.h btrfs: sysfs: Use enum/define value for feature array definitions 2018-05-28 18:23:39 +02:00
transaction.c btrfs: Remove fs_info argument from btrfs_uuid_tree_add 2018-05-30 16:46:52 +02:00
transaction.h btrfs: qgroup: Commit transaction in advance to reduce early EDQUOT 2018-04-18 16:46:47 +02:00
tree-checker.c btrfs: replace GPL boilerplate by SPDX -- sources 2018-04-12 16:29:51 +02:00
tree-checker.h btrfs: replace GPL boilerplate by SPDX -- headers 2018-04-12 16:29:46 +02:00
tree-defrag.c btrfs: replace GPL boilerplate by SPDX -- sources 2018-04-12 16:29:51 +02:00
tree-log.c btrfs: replace waitqueue_actvie with cond_wake_up 2018-05-28 18:23:09 +02:00
tree-log.h btrfs: replace GPL boilerplate by SPDX -- headers 2018-04-12 16:29:46 +02:00
ulist.c btrfs: replace GPL boilerplate by SPDX -- sources 2018-04-12 16:29:51 +02:00
ulist.h btrfs: replace GPL boilerplate by SPDX -- headers 2018-04-12 16:29:46 +02:00
uuid-tree.c btrfs: Remove fs_info argument from btrfs_uuid_tree_rem 2018-05-30 16:46:53 +02:00
volumes.c btrfs: Remove fs_info argument from btrfs_uuid_tree_add 2018-05-30 16:46:52 +02:00
volumes.h btrfs: remove redundant btrfs_balance_control::fs_info 2018-05-28 18:07:30 +02:00
xattr.c btrfs: replace GPL boilerplate by SPDX -- sources 2018-04-12 16:29:51 +02:00
xattr.h btrfs: replace GPL boilerplate by SPDX -- headers 2018-04-12 16:29:46 +02:00
zlib.c btrfs: replace GPL boilerplate by SPDX -- sources 2018-04-12 16:29:51 +02:00
zstd.c btrfs: replace GPL boilerplate by SPDX -- sources 2018-04-12 16:29:51 +02:00