linux-stable/fs
Dave Chinner cf11da9c5d xfs: refine the allocation stack switch
The allocation stack switch at xfs_bmapi_allocate() has served it's
purpose, but is no longer a sufficient solution to the stack usage
problem we have in the XFS allocation path.

Whilst the kernel stack size is now 16k, that is not a valid reason
for undoing all our "keep stack usage down" modifications. What it
does allow us to do is have the freedom to refine and perfect the
modifications knowing that if we get it wrong it won't blow up in
our faces - we have a safety net now.

This is important because we still have the issue of older kernels
having smaller stacks and that they are still supported and are
demonstrating a wide range of different stack overflows.  Red Hat
has several open bugs for allocation based stack overflows from
directory modifications and direct IO block allocation and these
problems still need to be solved. If we can solve them upstream,
then distro's won't need to bake their own unique solutions.

To that end, I've observed that every allocation based stack
overflow report has had a specific characteristic - it has happened
during or directly after a bmap btree block split. That event
requires a new block to be allocated to the tree, and so we
effectively stack one allocation stack on top of another, and that's
when we get into trouble.

A further observation is that bmap btree block splits are much rarer
than writeback allocation - over a range of different workloads I've
observed the ratio of bmap btree inserts to splits ranges from 100:1
(xfstests run) to 10000:1 (local VM image server with sparse files
that range in the hundreds of thousands to millions of extents).
Either way, bmap btree split events are much, much rarer than
allocation events.

Finally, we have to move the kswapd state to the allocation workqueue
work when allocation is done on behalf of kswapd. This is proving to
cause significant perturbation in performance under memory pressure
and appears to be generating allocation deadlock warnings under some
workloads, so avoiding the use of a workqueue for the majority of
kswapd writeback allocation will minimise the impact of such
behaviour.

Hence it makes sense to move the stack switch to xfs_btree_split()
and only do it for bmap btree splits. Stack switches during
allocation will be much rarer, so there won't be significant
performacne overhead caused by switching stacks. The worse case
stack from all allocation paths will be split, not just writeback.
And the majority of memory allocations will be done in the correct
context (e.g. kswapd) without causing additional latency, and so we
simplify the memory reclaim interactions between processes,
workqueues and kswapd.

The worst stack I've been able to generate with this patch in place
is 5600 bytes deep. It's very revealing because we exit XFS at:

37)     1768      64   kmem_cache_alloc+0x13b/0x170

about 1800 bytes of stack consumed, and the remaining 3800 bytes
(and 36 functions) is memory reclaim, swap and the IO stack. And
this occurs in the inode allocation from an open(O_CREAT) syscall,
not writeback.

The amount of stack being used is much less than I've previously be
able to generate - fs_mark testing has been able to generate stack
usage of around 7k without too much trouble; with this patch it's
only just getting to 5.5k. This is primarily because the metadata
allocation paths (e.g. directory blocks) are no longer causing
double splits on the same stack, and hence now stack tracing is
showing swapping being the worst stack consumer rather than XFS.

Performance of fs_mark inode create workloads is unchanged.
Performance of fs_mark async fsync workloads is consistently good
with context switches reduced by around 150,000/s (30%).
Performance of dbench, streaming IO and postmark is unchanged.
Allocation deadlock warnings have not been seen on the workloads
that generated them since adding this patch.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-07-15 07:08:24 +10:00
..
9p Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2014-06-12 10:30:18 -07:00
adfs write_iter variants of {__,}generic_file_aio_write() 2014-05-06 17:38:00 -04:00
affs Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2014-06-12 10:30:18 -07:00
afs Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2014-06-12 10:30:18 -07:00
autofs4 fs/autofs4/dev-ioctl.c: add __init to autofs_dev_ioctl_init 2014-06-04 16:54:21 -07:00
befs fs/befs: kernel-doc fixes 2014-06-06 16:08:09 -07:00
bfs write_iter variants of {__,}generic_file_aio_write() 2014-05-06 17:38:00 -04:00
btrfs Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs 2014-06-21 14:21:43 -10:00
cachefiles fs/cachefiles: replace kerror by pr_err 2014-06-06 16:08:14 -07:00
ceph Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client 2014-06-12 23:06:23 -07:00
cifs Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2014-06-12 10:30:18 -07:00
coda coda: convert use of typedef ctl_table to struct ctl_table 2014-06-06 16:08:16 -07:00
configfs fs/configfs: use pr_fmt 2014-06-04 16:53:53 -07:00
cramfs Major changes for 3.14 include support for the newly added ZERO_RANGE 2014-04-04 15:39:39 -07:00
debugfs Major changes for 3.14 include support for the newly added ZERO_RANGE 2014-04-04 15:39:39 -07:00
devpts fs/devpts/inode.c: convert printk to pr_foo() 2014-06-06 16:08:14 -07:00
dlm dlm: keep listening connection alive with sctp mode 2014-06-12 10:26:14 -05:00
ecryptfs write_iter variants of {__,}generic_file_aio_write() 2014-05-06 17:38:00 -04:00
efivarfs fs/efivarfs/super.c: use static const for dentry_operations 2014-06-04 16:54:14 -07:00
efs fs/efs: convert printk(KERN_DEBUG to pr_debug 2014-06-04 16:54:21 -07:00
exofs Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2014-06-12 10:30:18 -07:00
exportfs fs/exportfs/expfs.c: kernel-doc warning fixes 2014-06-04 16:54:14 -07:00
ext2 ->splice_write() via ->write_iter() 2014-06-12 00:18:51 -04:00
ext3 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2014-06-12 10:30:18 -07:00
ext4 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2014-06-12 10:30:18 -07:00
f2fs Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2014-06-12 10:30:18 -07:00
fat Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2014-06-12 10:30:18 -07:00
freevxfs Major changes for 3.14 include support for the newly added ZERO_RANGE 2014-04-04 15:39:39 -07:00
fscache fscache: convert use of typedef ctl_table to struct ctl_table 2014-06-06 16:08:16 -07:00
fuse Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2014-06-12 10:30:18 -07:00
gfs2 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2014-06-12 10:30:18 -07:00
hfs write_iter variants of {__,}generic_file_aio_write() 2014-05-06 17:38:00 -04:00
hfsplus Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2014-06-12 10:30:18 -07:00
hostfs write_iter variants of {__,}generic_file_aio_write() 2014-05-06 17:38:00 -04:00
hpfs Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2014-06-12 10:30:18 -07:00
hppfs
hugetlbfs fs/hugetlbfs/inode.c: remove null test before kfree 2014-06-04 16:54:11 -07:00
isofs Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs 2014-04-07 17:59:17 -07:00
jbd fs/jbd/revoke.c: replace shift loop by ilog2 2014-05-21 10:26:13 +02:00
jbd2 arch: Mass conversion of smp_mb__*() 2014-04-18 14:20:48 +02:00
jffs2 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2014-06-12 10:30:18 -07:00
jfs Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2014-06-12 10:30:18 -07:00
kernfs kernfs: move the last knowledge of sysfs out from kernfs 2014-06-03 08:11:18 -07:00
lockd Merge branch 'for-3.16' of git://linux-nfs.org/~bfields/linux 2014-06-10 11:50:57 -07:00
logfs write_iter variants of {__,}generic_file_aio_write() 2014-05-06 17:38:00 -04:00
minix write_iter variants of {__,}generic_file_aio_write() 2014-05-06 17:38:00 -04:00
ncpfs fs/ncpfs/getopt.c: replace simple_strtoul by kstrtoul 2014-06-04 16:54:21 -07:00
nfs Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2014-06-12 10:30:18 -07:00
nfs_common
nfsd NFSD: fix bug for readdir of pseudofs 2014-06-17 16:42:48 -04:00
nilfs2 write_iter variants of {__,}generic_file_aio_write() 2014-05-06 17:38:00 -04:00
nls nls: have register_nls() set ->owner 2014-01-25 03:14:05 -05:00
notify inotify: convert use of typedef ctl_table to struct ctl_table 2014-06-06 16:08:16 -07:00
ntfs Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2014-06-12 10:30:18 -07:00
ocfs2 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2014-06-12 10:30:18 -07:00
omfs write_iter variants of {__,}generic_file_aio_write() 2014-05-06 17:38:00 -04:00
openpromfs fs: push sync_filesystem() down to the file system's remount_fs() 2014-03-13 10:14:33 -04:00
proc Merge branch 'next' (accumulated 3.16 merge window patches) into master 2014-06-08 11:31:16 -07:00
pstore fs/pstore: logging clean-up 2014-06-06 16:08:13 -07:00
qnx4 fs: push sync_filesystem() down to the file system's remount_fs() 2014-03-13 10:14:33 -04:00
qnx6 fs: push sync_filesystem() down to the file system's remount_fs() 2014-03-13 10:14:33 -04:00
quota xfs: fix Q_XQUOTARM ioctl 2014-05-05 17:25:50 +10:00
ramfs ->splice_write() via ->write_iter() 2014-06-12 00:18:51 -04:00
reiserfs Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2014-06-12 10:30:18 -07:00
romfs switch simple generic_file_aio_read() users to ->read_iter() 2014-05-06 17:37:55 -04:00
squashfs fs/squashfs/squashfs.h: replace pr_warning by pr_warn 2014-06-04 16:53:52 -07:00
sysfs kernfs: move the last knowledge of sysfs out from kernfs 2014-06-03 08:11:18 -07:00
sysv write_iter variants of {__,}generic_file_aio_write() 2014-05-06 17:38:00 -04:00
ubifs Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2014-06-12 10:30:18 -07:00
udf udf: switch to ->write_iter() 2014-05-06 17:39:36 -04:00
ufs Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2014-06-12 10:30:18 -07:00
xfs xfs: refine the allocation stack switch 2014-07-15 07:08:24 +10:00
Kconfig kernfs: add CONFIG_KERNFS 2014-02-07 16:08:57 -08:00
Kconfig.binfmt
Makefile block: move ioprio.c from fs/ to block/ 2014-05-19 11:02:18 -06:00
aio.c Merge git://git.kvack.org/~bcrl/aio-next 2014-06-14 19:43:27 -05:00
anon_inodes.c vfs: Allocate anon_inode_inode in anon_inode_init() 2014-03-27 09:52:54 -07:00
attr.c fs,userns: Change inode_capable to capable_wrt_inode_uidgid 2014-06-10 13:57:22 -07:00
bad_inode.c
binfmt_aout.c
binfmt_elf.c Merge branch 'x86/vdso' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into next 2014-06-05 08:05:29 -07:00
binfmt_elf_fdpic.c
binfmt_em86.c
binfmt_flat.c fs/binfmt_flat.c: make old_reloc() static 2014-06-04 16:54:21 -07:00
binfmt_misc.c binfmt_misc: add missing 'break' statement 2014-04-03 16:21:16 -07:00
binfmt_script.c
binfmt_som.c
block_dev.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2014-06-12 10:30:18 -07:00
buffer.c mm: non-atomically mark page accessed during page cache allocation where possible 2014-06-04 16:54:10 -07:00
char_dev.c
compat.c locks: rename file-private locks to "open file description locks" 2014-04-22 08:23:58 -04:00
compat_binfmt_elf.c binfmt_elf: add ELF_HWCAP2 to compat auxv entries 2014-03-04 08:05:21 +00:00
compat_ioctl.c fs/compat: convert to COMPAT_SYSCALL_DEFINE with changing parameter types 2014-03-06 16:30:44 +01:00
coredump.c coredump: fix va_list corruption 2014-04-19 13:23:31 -07:00
dcache.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2014-06-12 10:30:18 -07:00
dcookies.c fs/compat: fix lookup_dcookie() parameter handling 2014-01-29 16:22:40 -08:00
direct-io.c new helper: iov_iter_npages() 2014-05-06 17:32:52 -04:00
drop_caches.c fs: convert use of typedef ctl_table to struct ctl_table 2014-06-06 16:08:16 -07:00
eventfd.c eventfd_ctx_fdget(): use fdget() instead of fget() 2014-01-25 03:13:04 -05:00
eventpoll.c epoll: fix use-after-free in eventpoll_release_file 2014-06-16 17:21:59 -10:00
exec.c perf: Differentiate exec() and non-exec() comm events 2014-06-06 07:56:22 +02:00
fcntl.c locks: rename file-private locks to "open file description locks" 2014-04-22 08:23:58 -04:00
fhandle.c
file.c fs/file.c: don't open-code kvfree() 2014-05-06 17:31:10 -04:00
file_table.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2014-06-12 10:30:18 -07:00
filesystems.c sys_sysfs: Add CONFIG_SYSFS_SYSCALL 2014-04-03 16:21:05 -07:00
fs-writeback.c One of the main highlights this time, is not the patches themselves 2014-04-04 14:49:16 -07:00
fs_struct.c
inode.c fs,userns: Change inode_capable to capable_wrt_inode_uidgid 2014-06-10 13:57:22 -07:00
internal.h
ioctl.c
libfs.c fs/libfs.c: add generic data flush to fsync 2014-06-04 16:53:55 -07:00
locks.c locks: set fl_owner for leases back to current->files 2014-06-10 12:29:05 -04:00
mbcache.c ext4: each filesystem creates and uses its own mb_cache 2014-03-18 19:24:49 -04:00
mount.h reduce m_start() cost... 2014-04-01 23:19:09 -04:00
mpage.c fs/block_dev.c: add bdev_read_page() and bdev_write_page() 2014-06-04 16:54:02 -07:00
namei.c fs,userns: Change inode_capable to capable_wrt_inode_uidgid 2014-06-10 13:57:22 -07:00
namespace.c VFS: Make delayed_free() call free_vfsmnt() 2014-04-01 23:19:18 -04:00
no-block.c
open.c new methods: ->read_iter() and ->write_iter() 2014-05-06 17:36:00 -04:00
pipe.c new helper: copy_page_from_iter() 2014-05-06 17:39:42 -04:00
pnode.c smarter propagate_mnt() 2014-04-01 23:19:08 -04:00
pnode.h smarter propagate_mnt() 2014-04-01 23:19:08 -04:00
posix_acl.c posix_acl: handle NULL ACL in posix_acl_equiv_mode 2014-05-06 13:58:42 -04:00
proc_namespace.c reduce m_start() cost... 2014-04-01 23:19:09 -04:00
read_write.c switch simple generic_file_aio_read() users to ->read_iter() 2014-05-06 17:37:55 -04:00
readdir.c fanotify: create FAN_ACCESS event for readdir 2014-06-04 16:53:52 -07:00
select.c
seq_file.c
signalfd.c
splice.c Merge commit '9f12600fe425bc28f0ccba034a77783c09c15af4' into for-linus 2014-06-12 00:28:09 -04:00
stack.c
stat.c
statfs.c
super.c fs/superblock: avoid locking counting inodes and dentries before reclaiming them 2014-06-04 16:54:11 -07:00
sync.c Revert "writeback: do not sync data dirtied after sync start" 2014-02-22 02:02:28 +01:00
timerfd.c timerfd: support CLOCK_BOOTTIME clock 2014-01-23 16:57:40 -08:00
utimes.c
xattr.c