linux-stable/fs
Ethan Lien e73e81b6d0 btrfs: balance dirty metadata pages in btrfs_finish_ordered_io
[Problem description and how we fix it]
We should balance dirty metadata pages at the end of
btrfs_finish_ordered_io, since a small, unmergeable random write can
potentially produce dirty metadata which is multiple times larger than
the data itself. For example, a small, unmergeable 4KiB write may
produce:

    16KiB dirty leaf (and possibly 16KiB dirty node) in subvolume tree
    16KiB dirty leaf (and possibly 16KiB dirty node) in checksum tree
    16KiB dirty leaf (and possibly 16KiB dirty node) in extent tree

Although we do call balance dirty pages in write side, but in the
buffered write path, most metadata are dirtied only after we reach the
dirty background limit (which by far only counts dirty data pages) and
wakeup the flusher thread. If there are many small, unmergeable random
writes spread in a large btree, we'll find a burst of dirty pages
exceeds the dirty_bytes limit after we wakeup the flusher thread - which
is not what we expect. In our machine, it caused out-of-memory problem
since a page cannot be dropped if it is marked dirty.

Someone may worry about we may sleep in btrfs_btree_balance_dirty_nodelay,
but since we do btrfs_finish_ordered_io in a separate worker, it will not
stop the flusher consuming dirty pages. Also, we use different worker for
metadata writeback endio, sleep in btrfs_finish_ordered_io help us throttle
the size of dirty metadata pages.

[Reproduce steps]
To reproduce the problem, we need to do 4KiB write randomly spread in a
large btree. In our 2GiB RAM machine:

1) Create 4 subvolumes.
2) Run fio on each subvolume:

   [global]
   direct=0
   rw=randwrite
   ioengine=libaio
   bs=4k
   iodepth=16
   numjobs=1
   group_reporting
   size=128G
   runtime=1800
   norandommap
   time_based
   randrepeat=0

3) Take snapshot on each subvolume and repeat fio on existing files.
4) Repeat step (3) until we get large btrees.
   In our case, by observing btrfs_root_item->bytes_used, we have 2GiB of
   metadata in each subvolume tree and 12GiB of metadata in extent tree.
5) Stop all fio, take snapshot again, and wait until all delayed work is
   completed.
6) Start all fio. Few seconds later we hit OOM when the flusher starts
   to work.

It can be reproduced even when using nocow write.

Signed-off-by: Ethan Lien <ethanlien@synology.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30 16:46:53 +02:00
..
9p fscache development 2018-04-07 09:08:24 -07:00
adfs
affs affs_lookup: switch to d_splice_alias() 2018-05-21 14:29:12 -04:00
afs afs: Fix the non-encryption of calls 2018-05-14 15:15:19 +01:00
autofs4 autofs: mount point create should honour passed in mode 2018-04-20 17:18:35 -07:00
befs befs_lookup(): use d_splice_alias() 2018-05-21 14:30:07 -04:00
bfs
btrfs btrfs: balance dirty metadata pages in btrfs_finish_ordered_io 2018-05-30 16:46:53 +02:00
cachefiles cachefiles: vfs_mkdir() might succeed leaving dentry negative unhashed 2018-05-21 14:30:10 -04:00
ceph ceph: fix iov_iter issues in ceph_direct_read_write() 2018-05-10 10:15:12 +02:00
cifs Merge candidates for 4.17-rc 2018-05-24 14:12:05 -07:00
coda vfs: do bulk POLL* -> EPOLL* replacement 2018-02-11 14:34:03 -08:00
configfs
cramfs cramfs: Fix IS_ENABLED typo 2018-05-21 14:30:08 -04:00
crypto fscrypt: fix build with pre-4.6 gcc versions 2018-02-01 10:51:18 -05:00
debugfs debugfs_lookup(): switch to lookup_one_len_unlocked() 2018-03-29 15:07:47 -04:00
devpts devpts: comment devpts_mntget() 2018-03-14 13:31:23 +01:00
dlm net: make getname() functions return length rather than use int* parameter 2018-02-12 14:15:04 -05:00
ecryptfs Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2018-05-21 11:54:57 -07:00
efivarfs efivarfs: Limit the rate for non-root to read files 2018-02-22 10:21:02 -08:00
efs
exofs iversion.h related cleanup for v4.16 2018-02-07 14:25:22 -08:00
exportfs ovl: do not try to reconnect a disconnected origin dentry 2018-04-12 12:04:49 +02:00
ext2 Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2018-05-21 11:54:57 -07:00
ext4 Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2018-05-21 11:54:57 -07:00
f2fs do d_instantiate/unlock_new_inode combinations safely 2018-05-11 15:36:37 -04:00
fat iversion: Rename make inode_cmp_iversion{+raw} to inode_eq_iversion{+raw} 2018-02-01 08:15:25 -05:00
freevxfs vxfs: Define usercopy region in vxfs_inode slab cache 2018-01-15 12:07:57 -08:00
fscache fscache: use appropriate radix tree accessors 2018-04-11 10:28:39 -07:00
fuse fuse: define the filesystem as untrusted 2018-03-23 06:31:37 -04:00
gfs2 GFS2: Minor improvements to comments and documentation 2018-04-12 10:07:51 -07:00
hfs
hfsplus hfsplus: stop workqueue when fill_super() failed 2018-05-18 17:17:12 -07:00
hostfs hostfs: rename do_rmdir() to hostfs_do_rmdir() 2018-04-02 20:15:53 +02:00
hpfs
hugetlbfs hugetlbfs: fix bug in pgoff overflow checking 2018-04-05 21:36:21 -07:00
isofs isofs: fix potential memory leak in mount option parsing 2018-04-16 09:47:41 +02:00
jbd2 ext4: set h_journal if there is a failure starting a reserved handle 2018-04-18 11:49:31 -04:00
jffs2 do d_instantiate/unlock_new_inode combinations safely 2018-05-11 15:36:37 -04:00
jfs do d_instantiate/unlock_new_inode combinations safely 2018-05-11 15:36:37 -04:00
kernfs kernfs: deal with kernfs_fill_super() failures 2018-05-21 14:30:08 -04:00
lockd net: Drop pernet_operations::async 2018-03-27 13:18:09 -04:00
minix treewide: simplify Kconfig dependencies for removed archs 2018-03-26 15:55:57 +02:00
nfs NFS client updates for Linux 4.17 2018-04-12 12:55:50 -07:00
nfs_common net: Drop pernet_operations::async 2018-03-27 13:18:09 -04:00
nfsd nfsd: vfs_mkdir() might succeed leaving dentry negative unhashed 2018-05-21 14:30:10 -04:00
nilfs2 do d_instantiate/unlock_new_inode combinations safely 2018-05-11 15:36:37 -04:00
nls
notify fsnotify: fix ignore mask logic in send_to_group() 2018-04-13 15:52:49 +02:00
ntfs ntfs: fix bogus __mark_inode_dirty(I_DIRTY_SYNC | I_DIRTY_DATASYNC) call 2018-03-28 01:39:02 -04:00
ocfs2 ocfs2: revert "ocfs2/o2hb: check len for bio_add_page() to avoid getting incorrect bio" 2018-05-25 18:12:10 -07:00
omfs
openpromfs
orangefs do d_instantiate/unlock_new_inode combinations safely 2018-05-11 15:36:37 -04:00
overlayfs ovl: add support for "xino" mount and config options 2018-04-12 12:04:50 +02:00
proc Merge branch 'speck-v20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2018-05-21 11:23:26 -07:00
pstore pstore: fix crypto dependencies without compression 2018-04-06 15:45:33 -07:00
qnx4
qnx6
quota fs: quota: Replace GFP_ATOMIC with GFP_KERNEL in dquot_init 2018-04-09 17:48:54 +02:00
ramfs
reiserfs do d_instantiate/unlock_new_inode combinations safely 2018-05-11 15:36:37 -04:00
romfs
squashfs
sysfs unfuck sysfs_mount() 2018-05-21 14:30:09 -04:00
sysv
tracefs
ubifs This pull request contains updates for both UBI and UBIFS: 2018-04-11 16:39:34 -07:00
udf Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2018-05-21 11:54:57 -07:00
ufs do d_instantiate/unlock_new_inode combinations safely 2018-05-11 15:36:37 -04:00
xfs xfs: cap the length of deduplication requests 2018-05-02 09:21:33 -07:00
aio.c aio: fix io_destroy(2) vs. lookup_ioctx() race 2018-05-21 14:30:11 -04:00
anon_inodes.c
attr.c
bad_inode.c
binfmt_aout.c exec: introduce finalize_exec() before start_thread() 2018-04-11 10:28:37 -07:00
binfmt_elf.c fs, elf: don't complain MAP_FIXED_NOREPLACE unless -EEXIST error 2018-04-20 17:18:36 -07:00
binfmt_elf_fdpic.c exec: introduce finalize_exec() before start_thread() 2018-04-11 10:28:37 -07:00
binfmt_em86.c
binfmt_flat.c exec: introduce finalize_exec() before start_thread() 2018-04-11 10:28:37 -07:00
binfmt_misc.c fs: add ksys_close() wrapper; remove in-kernel calls to sys_close() 2018-04-02 20:16:00 +02:00
binfmt_script.c
block_dev.c libnvdimm for 4.17 2018-04-10 10:25:57 -07:00
buffer.c Merge branch 'work.thaw' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2018-04-12 12:28:32 -07:00
char_dev.c block, char_dev: Use correct format specifier for unsigned ints 2018-03-15 17:59:24 +01:00
compat.c
compat_binfmt_elf.c
compat_ioctl.c
coredump.c
d_path.c split d_path() and friends into a separate file 2018-03-29 15:07:46 -04:00
dax.c page cache: use xa_lock 2018-04-11 10:28:39 -07:00
dcache.c do d_instantiate/unlock_new_inode combinations safely 2018-05-11 15:36:37 -04:00
dcookies.c fs: add do_lookup_dcookie() helper; remove in-kernel call to syscall 2018-04-02 20:15:39 +02:00
direct-io.c Merge branch 'akpm' (patches from Andrew) 2018-04-06 14:19:26 -07:00
drop_caches.c
eventfd.c fs: add do_eventfd() helper; remove internal call to sys_eventfd() 2018-04-02 20:15:39 +02:00
eventpoll.c fs: add do_epoll_*() helpers; remove internal calls to sys_epoll_*() 2018-04-02 20:15:37 +02:00
exec.c exec: pin stack limit during exec 2018-04-11 10:28:37 -07:00
fcntl.c fs: add do_compat_fcntl64() helper; remove in-kernel call to compat syscall 2018-04-02 20:15:42 +02:00
fhandle.c vfs: Copy struct mount.mnt_id to userspace using put_user() 2018-01-15 12:07:51 -08:00
file.c fs: add ksys_close() wrapper; remove in-kernel calls to sys_close() 2018-04-02 20:16:00 +02:00
file_table.c
filesystems.c
fs-writeback.c bdi: Fix oops in wb_workfn() 2018-05-03 16:11:37 -06:00
fs_pin.c
fs_struct.c
inode.c page cache: use xa_lock 2018-04-11 10:28:39 -07:00
internal.h Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2018-04-06 11:07:08 -07:00
ioctl.c fs: add ksys_ioctl() helper; remove in-kernel calls to sys_ioctl() 2018-04-02 20:16:03 +02:00
iomap.c iomap: warn on zero-length mappings 2018-01-29 07:27:24 -08:00
Kconfig libnvdimm for 4.16 2018-02-06 10:41:33 -08:00
Kconfig.binfmt treewide: simplify Kconfig dependencies for removed archs 2018-03-26 15:55:57 +02:00
libfs.c fs, dax: prepare for dax-specific address_space_operations 2018-03-30 11:34:55 -07:00
locks.c treewide: Align function definition open/close braces 2018-03-26 11:13:09 +02:00
Makefile split d_path() and friends into a separate file 2018-03-29 15:07:46 -04:00
mbcache.c mbcache: make sure c_entry_count is not decremented past zero 2018-01-09 23:57:52 -05:00
mount.h
mpage.c
namei.c Merge branch 'work.namei' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2018-04-09 12:48:05 -07:00
namespace.c vfs: Undo an overly zealous MS_RDONLY -> SB_RDONLY conversion 2018-04-20 09:59:33 -07:00
no-block.c
nsfs.c net: Export open_related_ns() 2018-02-15 15:34:42 -05:00
open.c Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2018-04-06 11:07:08 -07:00
pipe.c fs: add do_pipe2() helper; remove internal call to sys_pipe2() 2018-04-02 20:15:35 +02:00
pnode.c
pnode.h
posix_acl.c
proc_namespace.c vfs: do bulk POLL* -> EPOLL* replacement 2018-02-11 14:34:03 -08:00
read_write.c fs: add ksys_p{read,write}64() helpers; remove in-kernel calls to syscalls 2018-04-02 20:16:09 +02:00
readdir.c fs: add ksys_getdents64() helper; remove in-kernel calls to sys_getdents64() 2018-04-02 20:16:02 +02:00
select.c fs: add do_compat_select() helper; remove in-kernel call to compat syscall 2018-04-02 20:15:42 +02:00
seq_file.c proc: fix smaps and meminfo alignment 2018-05-25 18:12:11 -07:00
signalfd.c fs: add do_compat_signalfd4() helper; remove in-kernel call to compat syscall 2018-04-02 20:15:43 +02:00
splice.c fs: add do_vmsplice() helper; remove in-kernel call to syscall 2018-04-02 20:15:40 +02:00
stack.c
stat.c fs: add do_readlinkat() helper; remove internal call to sys_readlinkat() 2018-04-02 20:15:34 +02:00
statfs.c
super.c fs: don't scan the inode cache before SB_BORN is set 2018-05-11 15:37:57 -04:00
sync.c Changes for this release: 2018-04-04 12:44:02 -07:00
timerfd.c vfs: do bulk POLL* -> EPOLL* replacement 2018-02-11 14:34:03 -08:00
userfaultfd.c vfs: do bulk POLL* -> EPOLL* replacement 2018-02-11 14:34:03 -08:00
utimes.c fs: add do_compat_futimesat() helper; remove in-kernel call to compat syscall 2018-04-02 20:15:44 +02:00
xattr.c