linux-stable/fs
Dave Chinner eef983ffea xfs: journal IO cache flush reductions
Currently every journal IO is issued as REQ_PREFLUSH | REQ_FUA to
guarantee the ordering requirements the journal has w.r.t. metadata
writeback. THe two ordering constraints are:

1. we cannot overwrite metadata in the journal until we guarantee
that the dirty metadata has been written back in place and is
stable.

2. we cannot write back dirty metadata until it has been written to
the journal and guaranteed to be stable (and hence recoverable) in
the journal.

The ordering guarantees of #1 are provided by REQ_PREFLUSH. This
causes the journal IO to issue a cache flush and wait for it to
complete before issuing the write IO to the journal. Hence all
completed metadata IO is guaranteed to be stable before the journal
overwrites the old metadata.

The ordering guarantees of #2 are provided by the REQ_FUA, which
ensures the journal writes do not complete until they are on stable
storage. Hence by the time the last journal IO in a checkpoint
completes, we know that the entire checkpoint is on stable storage
and we can unpin the dirty metadata and allow it to be written back.

This is the mechanism by which ordering was first implemented in XFS
way back in 2002 by commit 95d97c36e5155075ba2eb22b17562cfcc53fcf96
("Add support for drive write cache flushing") in the xfs-archive
tree.

A lot has changed since then, most notably we now use delayed
logging to checkpoint the filesystem to the journal rather than
write each individual transaction to the journal. Cache flushes on
journal IO are necessary when individual transactions are wholly
contained within a single iclog. However, CIL checkpoints are single
transactions that typically span hundreds to thousands of individual
journal writes, and so the requirements for device cache flushing
have changed.

That is, the ordering rules I state above apply to ordering of
atomic transactions recorded in the journal, not to the journal IO
itself. Hence we need to ensure metadata is stable before we start
writing a new transaction to the journal (guarantee #1), and we need
to ensure the entire transaction is stable in the journal before we
start metadata writeback (guarantee #2).

Hence we only need a REQ_PREFLUSH on the journal IO that starts a
new journal transaction to provide #1, and it is not on any other
journal IO done within the context of that journal transaction.

The CIL checkpoint already issues a cache flush before it starts
writing to the log, so we no longer need the iclog IO to issue a
REQ_REFLUSH for us. Hence if XLOG_START_TRANS is passed
to xlog_write(), we no longer need to mark the first iclog in
the log write with REQ_PREFLUSH for this case. As an added bonus,
this ordering mechanism works for both internal and external logs,
meaning we can remove the explicit data device cache flushes from
the iclog write code when using external logs.

Given the new ordering semantics of commit records for the CIL, we
need iclogs containing commit records to issue a REQ_PREFLUSH. We
also require unmount records to do this. Hence for both
XLOG_COMMIT_TRANS and XLOG_UNMOUNT_TRANS xlog_write() calls we need
to mark the first iclog being written with REQ_PREFLUSH.

For both commit records and unmount records, we also want them
immediately on stable storage, so we want to also mark the iclogs
that contain these records to be marked REQ_FUA. That means if a
record is split across multiple iclogs, they are all marked REQ_FUA
and not just the last one so that when the transaction is completed
all the parts of the record are on stable storage.

And for external logs, unmount records need a pre-write data device
cache flush similar to the CIL checkpoint cache pre-flush as the
internal iclog write code does not do this implicitly anymore.

As an optimisation, when the commit record lands in the same iclog
as the journal transaction starts, we don't need to wait for
anything and can simply use REQ_FUA to provide guarantee #2.  This
means that for fsync() heavy workloads, the cache flush behaviour is
completely unchanged and there is no degradation in performance as a
result of optimise the multi-IO transaction case.

The most notable sign that there is less IO latency on my test
machine (nvme SSDs) is that the "noiclogs" rate has dropped
substantially. This metric indicates that the CIL push is blocking
in xlog_get_iclog_space() waiting for iclog IO completion to occur.
With 8 iclogs of 256kB, the rate is appoximately 1 noiclog event to
every 4 iclog writes. IOWs, every 4th call to xlog_get_iclog_space()
is blocking waiting for log IO. With the changes in this patch, this
drops to 1 noiclog event for every 100 iclog writes. Hence it is
clear that log IO is completing much faster than it was previously,
but it is also clear that for large iclog sizes, this isn't the
performance limiting factor on this hardware.

With smaller iclogs (32kB), however, there is a substantial
difference. With the cache flush modifications, the journal is now
running at over 4000 write IOPS, and the journal throughput is
largely identical to the 256kB iclogs and the noiclog event rate
stays low at about 1:50 iclog writes. The existing code tops out at
about 2500 IOPS as the number of cache flushes dominate performance
and latency. The noiclog event rate is about 1:4, and the
performance variance is quite large as the journal throughput can
fall to less than half the peak sustained rate when the cache flush
rate prevents metadata writeback from keeping up and the log runs
out of space and throttles reservations.

As a result:

	logbsize	fsmark create rate	rm -rf
before	32kb		152851+/-5.3e+04	5m28s
patched	32kb		221533+/-1.1e+04	5m24s

before	256kb		220239+/-6.2e+03	4m58s
patched	256kb		228286+/-9.2e+03	5m06s

The rm -rf times are included because I ran them, but the
differences are largely noise. This workload is largely metadata
read IO latency bound and the changes to the journal cache flushing
doesn't really make any noticable difference to behaviour apart from
a reduction in noiclog events from background CIL pushing.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-06-21 10:06:08 -07:00
..
9p 9p for 5.13-rc1 2021-05-07 11:18:52 -07:00
adfs
affs
afs afs: Fix the nlink handling of dir-over-dir rename 2021-05-27 06:23:58 -10:00
autofs autofs: should_expire() argument is guaranteed to be positive 2021-03-24 14:14:27 -04:00
befs fs/befs: Delete obsolete TODO file 2021-03-30 16:54:49 -07:00
bfs
btrfs for-5.13-rc2-tag 2021-05-21 13:24:12 -10:00
cachefiles fscache, cachefiles: Add alternate API to use kiocb for read/write to cache 2021-04-23 10:14:32 +01:00
ceph Notable items here are a series to take advantage of David Howells' 2021-05-06 10:27:02 -07:00
cifs cifs: change format of CIFS_FULL_KEY_DUMP ioctl 2021-05-27 15:26:32 -05:00
coda coda: fix reference counting in coda_file_mmap error path 2021-04-23 14:42:39 -07:00
configfs treewide: remove editor modelines and cruft 2021-05-07 00:26:34 -07:00
cramfs
crypto Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 2021-04-26 08:51:23 -07:00
debugfs debugfs: fix security_locked_down() call for SELinux 2021-05-18 18:05:59 +02:00
devpts
dlm fs: dlm: fix missing unlock on error in accept_from_sock() 2021-03-29 13:28:18 -05:00
ecryptfs fs: ecryptfs: remove BUG_ON from crypt_scatterlist 2021-05-13 18:32:26 +02:00
efivarfs efivars: convert to fileattr 2021-04-12 15:04:29 +02:00
efs
erofs erofs: fix 1 lcluster-sized pcluster for big pcluster 2021-05-13 15:58:46 +08:00
exfat exfat: speed up iterate/lookup by fixing start point of traversing cluster chain 2021-04-27 20:45:07 +09:00
exportfs
ext2 Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2021-05-02 09:14:01 -07:00
ext4 Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2021-05-02 09:14:01 -07:00
f2fs f2fs: return EINVAL for hole cases in swap file 2021-05-12 07:38:00 -07:00
fat fs: fat: fix spelling typo of values 2021-05-07 00:26:34 -07:00
freevxfs
fscache fscache, cachefiles: Add alternate API to use kiocb for read/write to cache 2021-04-23 10:14:32 +01:00
fuse Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2021-05-02 09:14:01 -07:00
gfs2 mm: introduce and use mapping_empty() 2021-05-05 11:27:19 -07:00
hfs
hfsplus hfsplus: prevent corruption in shrinking truncate 2021-05-14 19:41:32 -07:00
hostfs Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2021-05-02 09:14:01 -07:00
hpfs hpfs: replace one-element array with flexible-array member 2021-05-06 19:24:13 -07:00
hugetlbfs userfaultfd: hugetlbfs: fix new flag usage in error path 2021-05-22 15:09:07 -10:00
iomap mm/filemap: fix readahead return types 2021-05-14 19:41:32 -07:00
isofs isofs: fix fall-through warnings for Clang 2021-05-06 19:24:13 -07:00
jbd2 ext4: fix debug format string warning 2021-04-09 23:32:16 -04:00
jffs2 This pull request contains changes for JFFS2, UBI and UBIFS 2021-05-04 18:08:40 -07:00
jfs jfs: convert to fileattr 2021-04-12 15:04:29 +02:00
kernfs
lockd
minix
netfs netfs: Make CONFIG_NETFS_SUPPORT auto-selected rather than manual 2021-05-25 13:48:04 +01:00
nfs nfs: Remove trailing semicolon in macros 2021-05-27 09:19:33 -04:00
nfs_common NFSD: Add an xdr_stream-based encoder for NFSv2/3 ACLs 2021-03-22 10:19:00 -04:00
nfsd NFS client updates for Linux 5.13 2021-05-07 11:23:41 -07:00
nilfs2 Merge branch 'akpm' (patches from Andrew) 2021-05-07 00:34:51 -07:00
nls
notify fanotify_user: use upper_32_bits() to verify mask 2021-03-25 15:33:45 +01:00
ntfs
ocfs2 treewide: remove editor modelines and cruft 2021-05-07 00:26:34 -07:00
omfs
openpromfs openpromfs: don't do unlock_new_inode() until the new inode is set up 2021-03-12 22:15:22 -05:00
orangefs orangefs: leave files in the page cache for a few micro seconds at least 2021-04-29 08:06:05 -04:00
overlayfs overlayfs update for 5.13 2021-04-30 15:17:08 -07:00
proc proc: Check /proc/$pid/attr/ writes against file opener 2021-05-25 10:24:41 -10:00
pstore printk changes for 5.13 2021-04-27 18:09:44 -07:00
qnx4
qnx6
quota quota: Use 'hlist_for_each_entry' to simplify code 2021-05-10 16:27:49 +02:00
ramfs
reiserfs treewide: remove editor modelines and cruft 2021-05-07 00:26:34 -07:00
romfs
squashfs squashfs: fix divide error in calculate_skip() 2021-05-14 19:41:32 -07:00
sysfs
sysv
tracefs tracing: Fix various typos in comments 2021-03-23 14:08:18 -04:00
ubifs This pull request contains changes for JFFS2, UBI and UBIFS 2021-05-04 18:08:40 -07:00
udf useful constants: struct qstr for ".." 2021-04-15 22:36:45 -04:00
ufs useful constants: struct qstr for ".." 2021-04-15 22:36:45 -04:00
unicode .gitignore: prefix local generated files with a slash 2021-05-02 00:43:35 +09:00
vboxsf vboxsf: don't allow to change the inode type 2021-03-12 22:15:00 -05:00
verity fsverity: relax build time dependency on CRYPTO_SHA256 2021-04-22 17:31:32 +10:00
xfs xfs: journal IO cache flush reductions 2021-06-21 10:06:08 -07:00
zonefs \n 2021-04-29 11:06:13 -07:00
aio.c Revert "mremap: don't allow MREMAP_DONTUNMAP on special_mappings and aio" 2021-04-30 11:20:39 -07:00
anon_inodes.c
attr.c
bad_inode.c
binfmt_aout.c
binfmt_elf.c
binfmt_elf_fdpic.c
binfmt_em86.c
binfmt_flat.c binfmt_flat: allow not offsetting data start 2021-04-19 09:56:37 +10:00
binfmt_misc.c binfmt_misc: fix possible deadlock in bm_register_write 2021-03-13 11:27:30 -08:00
binfmt_script.c
block_dev.c block-5.13-2021-05-22 2021-05-22 07:40:34 -10:00
buffer.c Merge branch 'akpm' (patches from Andrew) 2021-05-05 13:50:15 -07:00
char_dev.c
compat_binfmt_elf.c
coredump.c
d_path.c constify dentry argument of dentry_path()/dentry_path_raw() 2021-03-21 11:43:58 -04:00
dax.c dax fixes for 5.13-rc2 2021-05-15 08:28:08 -07:00
dcache.c useful constants: struct qstr for ".." 2021-04-15 22:36:45 -04:00
direct-io.c fs: direct-io: fix missing sdio->boundary 2021-04-09 14:54:23 -07:00
drop_caches.c
eventfd.c
eventpoll.c fs/epoll: restore waking from ep_done_scan() 2021-05-06 19:24:13 -07:00
exec.c
fcntl.c
fhandle.c
file.c Merge branch 'work.file' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2021-05-03 11:05:28 -07:00
file_table.c
filesystems.c
fs-writeback.c
fs_context.c
fs_parser.c vfs: fs_parser: clean up kernel-doc warnings 2021-04-30 11:20:35 -07:00
fs_pin.c
fs_struct.c
fs_types.c
fsopen.c
init.c
inode.c mm: remove nrexceptional from inode: remove BUG_ON 2021-05-05 11:27:20 -07:00
internal.h
io-wq.c io-wq: Fix UAF when wakeup wqe in hash waitqueue 2021-05-26 09:03:56 -06:00
io-wq.h io_uring/io-wq: close io-wq full-stop gap 2021-05-25 19:39:58 -06:00
io_uring.c io_uring: fix data race to avoid potential NULL-deref 2021-05-27 07:44:49 -06:00
ioctl.c vfs: add fileattr ops 2021-04-12 15:04:23 +02:00
Kconfig NFS client updates for Linux 5.13 2021-05-07 11:23:41 -07:00
Kconfig.binfmt binfmt_flat: allow not offsetting data start 2021-04-19 09:56:37 +10:00
kernel_read_file.c
libfs.c libfs: fix kernel-doc for mnt_userns 2021-03-23 11:20:25 +01:00
locks.c Additional fixes and clean-ups for NFSD since tags/nfsd-5.13, 2021-05-05 13:44:19 -07:00
Makefile netfs: Provide readahead and readpage netfs helpers 2021-04-23 10:14:32 +01:00
mbcache.c
mount.h
mpage.c
namei.c fs.idmapped.helpers.v5.13 2021-04-27 12:49:42 -07:00
namespace.c fs/mount_setattr: tighten permission checks 2021-05-12 14:13:16 +02:00
no-block.c
nsfs.c
open.c
pipe.c
pnode.c
pnode.h
posix_acl.c
proc_namespace.c
read_write.c
readdir.c readdir: make sure to verify directory entry for legacy interfaces too 2021-04-17 11:39:49 -07:00
remap_range.c
select.c kernel, fs: Introduce and use set_restart_fn() and arch_set_restart_data() 2021-03-16 22:13:10 +01:00
seq_file.c seq_file: Add a seq_bprintf function 2021-04-27 15:50:15 -07:00
signalfd.c signalfd: Remove SIL_PERF_EVENT fields from signalfd_siginfo 2021-05-18 16:20:54 -05:00
splice.c
stack.c
stat.c fs: fix reporting supported extra file attributes for statx() 2021-04-17 23:03:50 -04:00
statfs.c
super.c fs,security: Add sb_delete hook 2021-04-22 12:22:11 -07:00
sync.c
timerfd.c
userfaultfd.c userfaultfd: add UFFDIO_CONTINUE ioctl 2021-05-05 11:27:22 -07:00
utimes.c
xattr.c xattr: fix kernel-doc for mnt_userns and vfs xattr helpers 2021-03-23 11:20:26 +01:00