linux-stable/fs
Dave Chinner e2f6ad4624 xfs: make xfs_writepage_map extent map centric
xfs_writepage_map() iterates over the bufferheads on a page to decide
what sort of IO to do and what actions to take.  However, when it comes
to reflink and deciding when it needs to execute a COW operation, we no
longer look at the bufferhead state but instead we ignore than and look
up internal state held in the COW fork extent list.

This means xfs_writepage_map() is somewhat confused. It does stuff, then
ignores it, then tries to handle the impedence mismatch by shovelling the
results inside the existing mapping code.  It works, but it's a bit of a
mess and it makes it hard to fix the cached map bug that the writepage
code currently has.

To unify the two different mechanisms, we first have to choose a direction.
That's already been set - we're de-emphasising bufferheads so they are no
longer a control structure as we need to do taht to allow for eventual
removal.  Hence we need to move away from looking at bufferhead state to
determine what operations we need to perform.

We can't completely get rid of bufferheads yet - they do contain some
state that is absolutely necessary, such as whether that part of the page
contains valid data or not (buffer_uptodate()).  Other state in the
bufferhead is redundant:

	BH_dirty - the page is dirty, so we can ignore this and just
		write it
	BH_delay - we have delalloc extent info in the DATA fork extent
		tree
	BH_unwritten - same as BH_delay
	BH_mapped - indicates we've already used it once for IO and it is
		mapped to a disk address. Needs to be ignored for COW
		blocks.

The BH_mapped flag is an interesting case - it's supposed to indicate that
it's already mapped to disk and so we can just use it "as is".  In theory,
we don't even have to do an extent lookup to find where to write it too,
but we have to do that anyway to determine we are actually writing over a
valid extent.  Hence it's not even serving the purpose of avoiding a an
extent lookup during writeback, and so we can pretty much ignore it.
Especially as we have to ignore it for COW operations...

Therefore, use the extent map as the source of information to tell us
what actions we need to take and what sort of IO we should perform.  The
first step is to have xfs_map_blocks() set the io type according to what
it looks up.  This means it can easily handle both normal overwrite and
COW cases.  The only thing we also need to add is the ability to return
hole mappings.

We need to return and cache hole mappings now for the case of multiple
blocks per page.  We no longer use the BH_mapped to indicate a block over
a hole, so we have to get that info from xfs_map_blocks().  We cache it so
that holes that span two pages don't need separate lookups.  This allows us
to avoid ever doing write IO over a hole, too.

Now that we have xfs_map_blocks() returning both a cached map and the type
of IO we need to perform, we can rewrite xfs_writepage_map() to drop all
the bufferhead control. It's also much simplified because it doesn't need
to explicitly handle COW operations.  Instead of iterating bufferheads, it
iterates blocks within the page and then looks up what per-block state is
required from the appropriate bufferhead.  It then validates the cached
map, and if it's not valid, we get a new map.  If we don't get a valid map
or it's over a hole, we skip the block.

At this point, we have to remap the bufferhead via xfs_map_at_offset().
As previously noted, we had to do this even if the buffer was already
mapped as the mapping would be stale for XFS_IO_DELALLOC, XFS_IO_UNWRITTEN
and XFS_IO_COW IO types.  With xfs_map_blocks() now controlling the type,
even XFS_IO_OVERWRITE types need remapping, as converted-but-not-yet-
written delalloc extents beyond EOF can be reported at XFS_IO_OVERWRITE.
Bufferheads that span such regions still need their BH_Delay flags cleared
and their block numbers calculated, so we now unconditionally map each
bufferhead before submission.

But wait! There's more - remember the old "treat unwritten extents as
holes on read" hack?  Yeah, that means we can have a dirty page with
unmapped, unwritten bufferheads that contain data!  What makes these so
special is that the unwritten "hole" bufferheads do not have a valid block
device pointer, so if we attempt to write them xfs_add_to_ioend() blows
up. So we make xfs_map_at_offset() do the "realtime or data device"
lookup from the inode and ignore what was or wasn't put into the
bufferhead when the buffer was instantiated.

The astute reader will have realised by now that this code treats
unwritten extents in multiple-blocks-per-page situations differently.
If we get any combination of unwritten blocks on a dirty page that contain
valid data in the page, we're going to convert them to real extents.  This
can actually be a win, because it means that pages with interleaving
unwritten and written blocks will get converted to a single written extent
with zeros replacing the interspersed unwritten blocks.  This is actually
good for reducing extent list and conversion overhead, and it means we
issue a contiguous IO instead of lots of little ones.  The downside is
that we use up a little extra IO bandwidth.  Neither of these seem like a
bad thing given that spinning disks are seek sensitive, and SSDs/pmem have
bandwidth to burn and the lower Io latency/CPU overhead of fewer, larger
IOs will result in better performance on them...

As a result of all this, the only state we actually care about from the
bufferhead is a single flag - BH_Uptodate. We still use the bufferhead to
pass some information to the bio via xfs_add_to_ioend(), but that is
trivial to separate and pass explicitly.  This means we really only need
1 bit of state per block per page from the buffered write path in the
writeback path.  Everything else we do with the bufferhead is purely to
make the buffered IO front end continue to work correctly. i.e we've
pretty much marginalised bufferheads in the writeback path completely.

Signed-off-By: Dave Chinner <dchinner@redhat.com>
[hch: forward port, refactor and split off bits into other commits]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:00 -07:00
..
9p treewide: kmalloc() -> kmalloc_array() 2018-06-12 16:19:22 -07:00
adfs vfs/y2038: inode timestamps conversion to timespec64 2018-06-15 07:31:07 +09:00
affs affs: fix potential memory leak when parsing option 'prefix' 2018-05-28 12:36:41 +02:00
afs Merge branch 'afs-proc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2018-06-16 16:32:04 +09:00
autofs autofs: rename 'autofs' module back to 'autofs4' 2018-07-05 11:35:04 -07:00
befs fix a series of Documentation/ broken file name references 2018-06-15 18:10:01 -03:00
bfs bfs_add_entry: pass name/len as qstr pointer 2018-05-22 14:27:50 -04:00
btrfs for-4.18-rc2-tag 2018-07-01 12:38:16 -07:00
cachefiles Merge branch 'hch.procfs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2018-06-04 10:00:01 -07:00
ceph ceph: fix dentry leak in splice_dentry() 2018-06-26 18:42:44 +02:00
cifs cifs: Fix stack out-of-bounds in smb{2,3}_create_lease_buf() 2018-07-05 13:48:25 -05:00
coda vfs: change inode times to use struct timespec64 2018-06-05 16:57:31 -07:00
configfs vfs: change inode times to use struct timespec64 2018-06-05 16:57:31 -07:00
cramfs vfs/y2038: inode timestamps conversion to timespec64 2018-06-15 07:31:07 +09:00
crypto f2fs-for-4.18-rc1 2018-06-11 10:16:13 -07:00
debugfs Revert "debugfs: inode: debugfs_create_dir uses mode permission from parent" 2018-06-12 20:52:16 -07:00
devpts
dlm treewide: Use array_size() in vmalloc() 2018-06-12 16:19:22 -07:00
ecryptfs Merge branch 'fixes' of https://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs into aio-base 2018-05-26 09:16:25 +02:00
efivarfs
efs
exofs exofs: avoid VLA in structures 2018-06-15 07:55:24 +09:00
exportfs ovl: do not try to reconnect a disconnected origin dentry 2018-04-12 12:04:49 +02:00
ext2 ext2: add warning when specifying nocheck option 2018-06-20 11:04:26 +02:00
ext4 Bug fixes for ext4; most of which relate to vulnerabilities where a 2018-07-08 11:10:30 -07:00
f2fs vfs/y2038: inode timestamps conversion to timespec64 2018-06-15 07:31:07 +09:00
fat Merge branch 'akpm' (patches from Andrew) 2018-06-15 08:51:42 +09:00
freevxfs freevxfs_lookup(): use d_splice_alias() 2018-05-22 14:27:51 -04:00
fscache proc: introduce proc_create_single{,_data} 2018-05-16 07:23:35 +02:00
fuse vfs/y2038: inode timestamps conversion to timespec64 2018-06-15 07:31:07 +09:00
gfs2 vfs/y2038: inode timestamps conversion to timespec64 2018-06-15 07:31:07 +09:00
hfs vfs/y2038: inode timestamps conversion to timespec64 2018-06-15 07:31:07 +09:00
hfsplus vfs/y2038: inode timestamps conversion to timespec64 2018-06-15 07:31:07 +09:00
hostfs vfs: change inode times to use struct timespec64 2018-06-05 16:57:31 -07:00
hpfs treewide: kmalloc() -> kmalloc_array() 2018-06-12 16:19:22 -07:00
hugetlbfs hugetlbfs: fix bug in pgoff overflow checking 2018-04-05 21:36:21 -07:00
isofs isofs: fix potential memory leak in mount option parsing 2018-04-16 09:47:41 +02:00
jbd2 Bug fixes for ext4; most of which relate to vulnerabilities where a 2018-07-08 11:10:30 -07:00
jffs2 vfs/y2038: inode timestamps conversion to timespec64 2018-06-15 07:31:07 +09:00
jfs This fixes a too-small allocation in the xattr code. 2018-06-19 07:47:32 +09:00
kernfs vfs/y2038: inode timestamps conversion to timespec64 2018-06-15 07:31:07 +09:00
lockd
minix minix_lookup: use d_splice_alias() 2018-05-22 14:27:52 -04:00
nfs NFS client bugfixes for Linux 4.18 2018-06-22 06:21:34 +09:00
nfs_common
nfsd vfs/y2038: inode timestamps conversion to timespec64 2018-06-15 07:31:07 +09:00
nilfs2 do d_instantiate/unlock_new_inode combinations safely 2018-05-11 15:36:37 -04:00
nls
notify fsnotify: add fsnotify_add_inode_mark() wrappers 2018-05-18 14:58:22 +02:00
ntfs vfs/y2038: inode timestamps conversion to timespec64 2018-06-15 07:31:07 +09:00
ocfs2 vfs/y2038: inode timestamps conversion to timespec64 2018-06-15 07:31:07 +09:00
omfs omfs_lookup(): report IO errors, use d_splice_alias() 2018-05-22 14:27:58 -04:00
openpromfs openpromfs: switch to d_splice_alias() 2018-05-22 14:27:57 -04:00
orangefs Solve a series of broken links for files under Documentation: 2018-06-17 05:25:18 +09:00
overlayfs vfs/y2038: inode timestamps conversion to timespec64 2018-06-15 07:31:07 +09:00
proc Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2018-07-01 12:32:19 -07:00
pstore pstore: Remove bogus format string definition 2018-06-14 14:57:24 +02:00
qnx4 qnx4_lookup: use d_splice_alias() 2018-05-22 14:27:52 -04:00
qnx6 qnx6_lookup: switch to d_splice_alias() 2018-05-22 14:27:54 -04:00
quota quota: Cleanup list iteration in dqcache_shrink_scan() 2018-06-20 11:04:26 +02:00
ramfs
reiserfs vfs/y2038: inode timestamps conversion to timespec64 2018-06-15 07:31:07 +09:00
romfs romfs_lookup: switch to d_splice_alias() 2018-05-22 14:27:55 -04:00
squashfs
sysfs unfuck sysfs_mount() 2018-05-21 14:30:09 -04:00
sysv sysv_lookup: use d_splice_alias() 2018-05-22 14:27:53 -04:00
tracefs
ubifs vfs/y2038: inode timestamps conversion to timespec64 2018-06-15 07:31:07 +09:00
udf udf: Drop unused arguments of udf_delete_aext() 2018-06-20 11:05:49 +02:00
ufs treewide: kmalloc() -> kmalloc_array() 2018-06-12 16:19:22 -07:00
xfs xfs: make xfs_writepage_map extent map centric 2018-07-11 22:26:00 -07:00
aio.c Revert changes to convert to ->poll_mask() and aio IOCB_CMD_POLL 2018-06-28 10:40:47 -07:00
anon_inodes.c
attr.c vfs/y2038: inode timestamps conversion to timespec64 2018-06-15 07:31:07 +09:00
bad_inode.c vfs: change inode times to use struct timespec64 2018-06-05 16:57:31 -07:00
binfmt_aout.c exec: introduce finalize_exec() before start_thread() 2018-04-11 10:28:37 -07:00
binfmt_elf.c coredump: fix spam with zero VMA process 2018-06-15 07:55:24 +09:00
binfmt_elf_fdpic.c treewide: kmalloc() -> kmalloc_array() 2018-06-12 16:19:22 -07:00
binfmt_em86.c
binfmt_flat.c exec: introduce finalize_exec() before start_thread() 2018-04-11 10:28:37 -07:00
binfmt_misc.c docs: Fix more broken references 2018-06-15 18:11:26 -03:00
binfmt_script.c
block_dev.c treewide: kmalloc() -> kmalloc_array() 2018-06-12 16:19:22 -07:00
buffer.c iomap: mark newly allocated buffer heads as new 2018-06-19 15:10:55 -07:00
char_dev.c
compat.c ncpfs: remove compat functionality 2018-06-05 19:23:26 +02:00
compat_binfmt_elf.c
compat_ioctl.c autofs: clean up includes 2018-06-07 17:34:40 -07:00
coredump.c
d_path.c
dax.c libnvdimm for 4.18 2018-06-08 17:21:52 -07:00
dcache.c Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2018-06-04 10:14:28 -07:00
dcookies.c
direct-io.c block: consistently use GFP_NOIO instead of __GFP_NORECLAIM 2018-05-14 08:55:18 -06:00
drop_caches.c
eventfd.c Revert changes to convert to ->poll_mask() and aio IOCB_CMD_POLL 2018-06-28 10:40:47 -07:00
eventpoll.c Revert changes to convert to ->poll_mask() and aio IOCB_CMD_POLL 2018-06-28 10:40:47 -07:00
exec.c Merge branch 'core-rseq-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2018-06-10 10:17:09 -07:00
fcntl.c mm: restructure memfd code 2018-06-07 17:34:35 -07:00
fhandle.c
file.c fs: add ksys_close() wrapper; remove in-kernel calls to sys_close() 2018-04-02 20:16:00 +02:00
file_table.c
filesystems.c proc: introduce proc_create_single{,_data} 2018-05-16 07:23:35 +02:00
fs-writeback.c bdi: Fix oops in wb_workfn() 2018-05-03 16:11:37 -06:00
fs_pin.c
fs_struct.c
inode.c Fix up non-directory creation in SGID directories 2018-07-05 12:36:36 -07:00
internal.h fs: factor out a __generic_write_end helper 2018-06-19 15:10:55 -07:00
ioctl.c fs: Allow CAP_SYS_ADMIN in s_user_ns to freeze and thaw filesystems 2018-05-24 12:04:28 -05:00
iomap.c iomap: add inline data support to iomap_readpage_actor 2018-07-03 09:07:47 -07:00
Kconfig autofs: remove left-over autofs4 stubs 2018-06-11 08:22:34 -07:00
Kconfig.binfmt docs: Fix more broken references 2018-06-15 18:11:26 -03:00
libfs.c
locks.c vfs/y2038: inode timestamps conversion to timespec64 2018-06-15 07:31:07 +09:00
Makefile autofs: remove left-over autofs4 stubs 2018-06-11 08:22:34 -07:00
mbcache.c treewide: kmalloc() -> kmalloc_array() 2018-06-12 16:19:22 -07:00
mount.h
mpage.c
namei.c Merge branch 'afs-proc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2018-06-16 16:32:04 +09:00
namespace.c fs: Allow superblock owner to access do_remount_sb() 2018-05-24 12:02:25 -05:00
no-block.c
nsfs.c
open.c Revert "fs: fold open_check_o_direct into do_dentry_open" 2018-06-03 10:58:23 -07:00
pipe.c Revert changes to convert to ->poll_mask() and aio IOCB_CMD_POLL 2018-06-28 10:40:47 -07:00
pnode.c
pnode.h
posix_acl.c
proc_namespace.c
read_write.c treewide: kmalloc() -> kmalloc_array() 2018-06-12 16:19:22 -07:00
readdir.c fs: add ksys_getdents64() helper; remove in-kernel calls to sys_getdents64() 2018-04-02 20:16:02 +02:00
select.c Revert changes to convert to ->poll_mask() and aio IOCB_CMD_POLL 2018-06-28 10:40:47 -07:00
seq_file.c proc: fix smaps and meminfo alignment 2018-05-25 18:12:11 -07:00
signalfd.c Merge branch 'work.compat' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2018-06-16 16:21:50 +09:00
splice.c Merge branch 'work.compat' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2018-06-16 16:21:50 +09:00
stack.c
stat.c
statfs.c
super.c Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2018-06-04 10:14:28 -07:00
sync.c Changes for this release: 2018-04-04 12:44:02 -07:00
timerfd.c Revert changes to convert to ->poll_mask() and aio IOCB_CMD_POLL 2018-06-28 10:40:47 -07:00
userfaultfd.c userfaultfd: hugetlbfs: fix userfaultfd_huge_must_wait() pte access 2018-07-03 17:32:18 -07:00
utimes.c
xattr.c vfs: delete unnecessary assignment in vfs_listxattr 2018-05-29 13:22:41 -04:00