From a805c111650cdba6ee880f528abdd03c1af82089 Mon Sep 17 00:00:00 2001 From: Qian Cai Date: Thu, 10 Sep 2020 08:26:15 -0700 Subject: [PATCH 01/17] iomap: fix WARN_ON_ONCE() from unprivileged users It is trivial to trigger a WARN_ON_ONCE(1) in iomap_dio_actor() by unprivileged users which would taint the kernel, or worse - panic if panic_on_warn or panic_on_taint is set. Hence, just convert it to pr_warn_ratelimited() to let users know their workloads are racing. Thank Dave Chinner for the initial analysis of the racing reproducers. Signed-off-by: Qian Cai Reviewed-by: Christoph Hellwig Reviewed-by: Darrick J. Wong Signed-off-by: Darrick J. Wong --- fs/iomap/direct-io.c | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c index c1aafb2ab990..9519113ebc35 100644 --- a/fs/iomap/direct-io.c +++ b/fs/iomap/direct-io.c @@ -388,6 +388,16 @@ iomap_dio_actor(struct inode *inode, loff_t pos, loff_t length, return iomap_dio_bio_actor(inode, pos, length, dio, iomap); case IOMAP_INLINE: return iomap_dio_inline_actor(inode, pos, length, dio, iomap); + case IOMAP_DELALLOC: + /* + * DIO is not serialised against mmap() access at all, and so + * if the page_mkwrite occurs between the writeback and the + * iomap_apply() call in the DIO path, then it will see the + * DELALLOC block that the page-mkwrite allocated. + */ + pr_warn_ratelimited("Direct I/O collision with buffered writes! File: %pD4 Comm: %.20s\n", + dio->iocb->ki_filp, current->comm); + return -EIO; default: WARN_ON_ONCE(1); return -EIO; From c114bbc6c423a449a00780d2d6633432c0a0505f Mon Sep 17 00:00:00 2001 From: Andreas Gruenbacher Date: Thu, 10 Sep 2020 08:26:16 -0700 Subject: [PATCH 02/17] iomap: Fix direct I/O write consistency check When a direct I/O write falls back to buffered I/O entirely, dio->size will be 0 in iomap_dio_complete. Function invalidate_inode_pages2_range will try to invalidate the rest of the address space. If there are any dirty pages in that range, the write will fail and a "Page cache invalidation failure on direct I/O" error will be logged. On gfs2, this can be reproduced as follows: xfs_io \ -c "open -ft foo" -c "pwrite 4k 4k" -c "close" \ -c "open -d foo" -c "pwrite 0 4k" Fix this by recognizing 0-length writes. Signed-off-by: Andreas Gruenbacher Reviewed-by: Darrick J. Wong Signed-off-by: Darrick J. Wong Reviewed-by: Christoph Hellwig --- fs/iomap/direct-io.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c index 9519113ebc35..024d4bb3028e 100644 --- a/fs/iomap/direct-io.c +++ b/fs/iomap/direct-io.c @@ -108,7 +108,7 @@ static ssize_t iomap_dio_complete(struct iomap_dio *dio) * ->end_io() when necessary, otherwise a racing buffer read would cache * zeros from unwritten extents. */ - if (!dio->error && + if (!dio->error && dio->size && (dio->flags & IOMAP_DIO_WRITE) && inode->i_mapping->nrpages) { int err; err = invalidate_inode_pages2_range(inode->i_mapping, From e6e7ca92623a43156100306861272e04d46385fc Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Thu, 10 Sep 2020 08:26:17 -0700 Subject: [PATCH 03/17] iomap: Clear page error before beginning a write If we find a page in write_begin which is !Uptodate, we need to clear any error on the page before starting to read data into it. This matches how filemap_fault(), do_read_cache_page() and generic_file_buffered_read() handle PageError on !Uptodate pages. When calling iomap_set_range_uptodate() in __iomap_write_begin(), blocks were not being marked as uptodate. This was found with generic/127 and a specially modified kernel which would fail (some) readahead I/Os. The test read some bytes in a prior page which caused readahead to extend into page 0x34. There was a subsequent write to page 0x34, followed by a read to page 0x34. Because the blocks were still marked as !Uptodate, the read caused all blocks to be re-read, overwriting the write. With this change, and the next one, the bytes which were written are marked as being Uptodate, so even though the page is still marked as !Uptodate, the blocks containing the written data are not re-read from storage. Fixes: 9dc55f1389f9 ("iomap: add support for sub-pagesize buffered I/O without buffer heads") Signed-off-by: Matthew Wilcox (Oracle) Reviewed-by: Darrick J. Wong Signed-off-by: Darrick J. Wong Reviewed-by: Christoph Hellwig --- fs/iomap/buffered-io.c | 1 + 1 file changed, 1 insertion(+) diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c index bcfc288dba3f..c95454784df4 100644 --- a/fs/iomap/buffered-io.c +++ b/fs/iomap/buffered-io.c @@ -578,6 +578,7 @@ __iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, int flags, if (PageUptodate(page)) return 0; + ClearPageError(page); do { iomap_adjust_read_range(inode, iop, &block_start, From 14284fedf59f1647264f4603d64418cf1fcd3eb0 Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Thu, 10 Sep 2020 08:26:18 -0700 Subject: [PATCH 04/17] iomap: Mark read blocks uptodate in write_begin When bringing (portions of) a page uptodate, we were marking blocks that were zeroed as being uptodate, but not blocks that were read from storage. Like the previous commit, this problem was found with generic/127 and a kernel which failed readahead I/Os. This bug causes writes to be silently lost when working with flaky storage. Fixes: 9dc55f1389f9 ("iomap: add support for sub-pagesize buffered I/O without buffer heads") Signed-off-by: Matthew Wilcox (Oracle) Reviewed-by: Darrick J. Wong Signed-off-by: Darrick J. Wong Reviewed-by: Christoph Hellwig --- fs/iomap/buffered-io.c | 14 ++++++-------- 1 file changed, 6 insertions(+), 8 deletions(-) diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c index c95454784df4..897ab9a26a74 100644 --- a/fs/iomap/buffered-io.c +++ b/fs/iomap/buffered-io.c @@ -574,7 +574,6 @@ __iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, int flags, loff_t block_start = pos & ~(block_size - 1); loff_t block_end = (pos + len + block_size - 1) & ~(block_size - 1); unsigned from = offset_in_page(pos), to = from + len, poff, plen; - int status; if (PageUptodate(page)) return 0; @@ -595,14 +594,13 @@ __iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, int flags, if (WARN_ON_ONCE(flags & IOMAP_WRITE_F_UNSHARE)) return -EIO; zero_user_segments(page, poff, from, to, poff + plen); - iomap_set_range_uptodate(page, poff, plen); - continue; + } else { + int status = iomap_read_page_sync(block_start, page, + poff, plen, srcmap); + if (status) + return status; } - - status = iomap_read_page_sync(block_start, page, poff, plen, - srcmap); - if (status) - return status; + iomap_set_range_uptodate(page, poff, plen); } while ((block_start += plen) < block_end); return 0; From 6cc19c5fad0908e4e5ed99c2236d740668985364 Mon Sep 17 00:00:00 2001 From: Nikolay Borisov Date: Thu, 10 Sep 2020 08:38:06 -0700 Subject: [PATCH 05/17] iomap: Use round_down/round_up macros in __iomap_write_begin Signed-off-by: Nikolay Borisov Reviewed-by: Darrick J. Wong Signed-off-by: Darrick J. Wong --- fs/iomap/buffered-io.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c index 897ab9a26a74..2b4757dba825 100644 --- a/fs/iomap/buffered-io.c +++ b/fs/iomap/buffered-io.c @@ -571,8 +571,8 @@ __iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, int flags, { struct iomap_page *iop = iomap_page_create(inode, page); loff_t block_size = i_blocksize(inode); - loff_t block_start = pos & ~(block_size - 1); - loff_t block_end = (pos + len + block_size - 1) & ~(block_size - 1); + loff_t block_start = round_down(pos, block_size); + loff_t block_end = round_up(pos + len, block_size); unsigned from = offset_in_page(pos), to = from + len, poff, plen; if (PageUptodate(page)) From 7ed3cd1a69e3aedb29d4edf0b02624a1833a952d Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Mon, 21 Sep 2020 08:58:38 -0700 Subject: [PATCH 06/17] iomap: Fix misplaced page flushing If iomap_unshare_actor() unshares to an inline iomap, the page was not being flushed. block_write_end() and __iomap_write_end() already contain flushes, so adding it to iomap_write_end_inline() seems like the best place. That means we can remove it from iomap_write_actor(). Signed-off-by: Matthew Wilcox (Oracle) Reviewed-by: Dave Chinner Reviewed-by: Darrick J. Wong Reviewed-by: Christoph Hellwig Signed-off-by: Darrick J. Wong --- fs/iomap/buffered-io.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c index 2b4757dba825..d8013f38a317 100644 --- a/fs/iomap/buffered-io.c +++ b/fs/iomap/buffered-io.c @@ -717,6 +717,7 @@ iomap_write_end_inline(struct inode *inode, struct page *page, WARN_ON_ONCE(!PageUptodate(page)); BUG_ON(pos + copied > PAGE_SIZE - offset_in_page(iomap->inline_data)); + flush_dcache_page(page); addr = kmap_atomic(page); memcpy(iomap->inline_data + pos, addr + pos, copied); kunmap_atomic(addr); @@ -810,8 +811,6 @@ again: copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes); - flush_dcache_page(page); - status = iomap_write_end(inode, pos, bytes, copied, page, iomap, srcmap); if (unlikely(status < 0)) From 24addd848a45747bcda68418710c72fdc8e145e4 Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Mon, 21 Sep 2020 08:58:39 -0700 Subject: [PATCH 07/17] fs: Introduce i_blocks_per_page This helper is useful for both THPs and for supporting block size larger than page size. Convert all users that I could find (we have a few different ways of writing this idiom, and I may have missed some). Signed-off-by: Matthew Wilcox (Oracle) Reviewed-by: Christoph Hellwig Reviewed-by: Dave Chinner Reviewed-by: Darrick J. Wong Signed-off-by: Darrick J. Wong Acked-by: Dave Kleikamp --- fs/iomap/buffered-io.c | 8 ++++---- fs/jfs/jfs_metapage.c | 2 +- fs/xfs/xfs_aops.c | 2 +- include/linux/pagemap.h | 16 ++++++++++++++++ 4 files changed, 22 insertions(+), 6 deletions(-) diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c index d8013f38a317..dd0739606dce 100644 --- a/fs/iomap/buffered-io.c +++ b/fs/iomap/buffered-io.c @@ -46,7 +46,7 @@ iomap_page_create(struct inode *inode, struct page *page) { struct iomap_page *iop = to_iomap_page(page); - if (iop || i_blocksize(inode) == PAGE_SIZE) + if (iop || i_blocks_per_page(inode, page) <= 1) return iop; iop = kmalloc(sizeof(*iop), GFP_NOFS | __GFP_NOFAIL); @@ -147,7 +147,7 @@ iomap_iop_set_range_uptodate(struct page *page, unsigned off, unsigned len) unsigned int i; spin_lock_irqsave(&iop->uptodate_lock, flags); - for (i = 0; i < PAGE_SIZE / i_blocksize(inode); i++) { + for (i = 0; i < i_blocks_per_page(inode, page); i++) { if (i >= first && i <= last) set_bit(i, iop->uptodate); else if (!test_bit(i, iop->uptodate)) @@ -1077,7 +1077,7 @@ iomap_finish_page_writeback(struct inode *inode, struct page *page, mapping_set_error(inode->i_mapping, -EIO); } - WARN_ON_ONCE(i_blocksize(inode) < PAGE_SIZE && !iop); + WARN_ON_ONCE(i_blocks_per_page(inode, page) > 1 && !iop); WARN_ON_ONCE(iop && atomic_read(&iop->write_count) <= 0); if (!iop || atomic_dec_and_test(&iop->write_count)) @@ -1373,7 +1373,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc, int error = 0, count = 0, i; LIST_HEAD(submit_list); - WARN_ON_ONCE(i_blocksize(inode) < PAGE_SIZE && !iop); + WARN_ON_ONCE(i_blocks_per_page(inode, page) > 1 && !iop); WARN_ON_ONCE(iop && atomic_read(&iop->write_count) != 0); /* diff --git a/fs/jfs/jfs_metapage.c b/fs/jfs/jfs_metapage.c index a2f5338a5ea1..176580f54af9 100644 --- a/fs/jfs/jfs_metapage.c +++ b/fs/jfs/jfs_metapage.c @@ -473,7 +473,7 @@ static int metapage_readpage(struct file *fp, struct page *page) struct inode *inode = page->mapping->host; struct bio *bio = NULL; int block_offset; - int blocks_per_page = PAGE_SIZE >> inode->i_blkbits; + int blocks_per_page = i_blocks_per_page(inode, page); sector_t page_start; /* address of page in fs blocks */ sector_t pblock; int xlen; diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c index b35611882ff9..55d126d4e096 100644 --- a/fs/xfs/xfs_aops.c +++ b/fs/xfs/xfs_aops.c @@ -544,7 +544,7 @@ xfs_discard_page( page, ip->i_ino, offset); error = xfs_bmap_punch_delalloc_range(ip, start_fsb, - PAGE_SIZE / i_blocksize(inode)); + i_blocks_per_page(inode, page)); if (error && !XFS_FORCED_SHUTDOWN(mp)) xfs_alert(mp, "page discard unable to remove delalloc mapping."); out_invalidate: diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index 7de11dcd534d..853733286138 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -899,4 +899,20 @@ static inline int page_mkwrite_check_truncate(struct page *page, return offset; } +/** + * i_blocks_per_page - How many blocks fit in this page. + * @inode: The inode which contains the blocks. + * @page: The page (head page if the page is a THP). + * + * If the block size is larger than the size of this page, return zero. + * + * Context: The caller should hold a refcount on the page to prevent it + * from being split. + * Return: The number of filesystem blocks covered by this page. + */ +static inline +unsigned int i_blocks_per_page(struct inode *inode, struct page *page) +{ + return thp_size(page) >> inode->i_blkbits; +} #endif /* _LINUX_PAGEMAP_H */ From a6901d4d148dcbad7efb3174afbdf68c995618c2 Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Mon, 21 Sep 2020 08:58:39 -0700 Subject: [PATCH 08/17] iomap: Use kzalloc to allocate iomap_page We can skip most of the initialisation, although spinlocks still need explicit initialisation as architectures may use a non-zero value to indicate unlocked. The comment is no longer useful as attach_page_private() handles the refcount now. Signed-off-by: Matthew Wilcox (Oracle) Reviewed-by: Christoph Hellwig Reviewed-by: Dave Chinner Reviewed-by: Darrick J. Wong Signed-off-by: Darrick J. Wong --- fs/iomap/buffered-io.c | 10 +--------- 1 file changed, 1 insertion(+), 9 deletions(-) diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c index dd0739606dce..d1caea4c5fba 100644 --- a/fs/iomap/buffered-io.c +++ b/fs/iomap/buffered-io.c @@ -49,16 +49,8 @@ iomap_page_create(struct inode *inode, struct page *page) if (iop || i_blocks_per_page(inode, page) <= 1) return iop; - iop = kmalloc(sizeof(*iop), GFP_NOFS | __GFP_NOFAIL); - atomic_set(&iop->read_count, 0); - atomic_set(&iop->write_count, 0); + iop = kzalloc(sizeof(*iop), GFP_NOFS | __GFP_NOFAIL); spin_lock_init(&iop->uptodate_lock); - bitmap_zero(iop->uptodate, PAGE_SIZE / SECTOR_SIZE); - - /* - * migrate_page_move_mapping() assumes that pages with private data have - * their count elevated by 1. - */ attach_page_private(page, iop); return iop; } From b21866f514cb50bd75614851a8d066398a114cab Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Mon, 21 Sep 2020 08:58:40 -0700 Subject: [PATCH 09/17] iomap: Use bitmap ops to set uptodate bits Now that the bitmap is protected by a spinlock, we can use the more efficient bitmap ops instead of individual test/set bit ops. Signed-off-by: Matthew Wilcox (Oracle) Reviewed-by: Christoph Hellwig Reviewed-by: Dave Chinner Reviewed-by: Darrick J. Wong Signed-off-by: Darrick J. Wong --- fs/iomap/buffered-io.c | 12 ++---------- 1 file changed, 2 insertions(+), 10 deletions(-) diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c index d1caea4c5fba..eab839b2be33 100644 --- a/fs/iomap/buffered-io.c +++ b/fs/iomap/buffered-io.c @@ -134,19 +134,11 @@ iomap_iop_set_range_uptodate(struct page *page, unsigned off, unsigned len) struct inode *inode = page->mapping->host; unsigned first = off >> inode->i_blkbits; unsigned last = (off + len - 1) >> inode->i_blkbits; - bool uptodate = true; unsigned long flags; - unsigned int i; spin_lock_irqsave(&iop->uptodate_lock, flags); - for (i = 0; i < i_blocks_per_page(inode, page); i++) { - if (i >= first && i <= last) - set_bit(i, iop->uptodate); - else if (!test_bit(i, iop->uptodate)) - uptodate = false; - } - - if (uptodate) + bitmap_set(iop->uptodate, first, last - first + 1); + if (bitmap_full(iop->uptodate, i_blocks_per_page(inode, page))) SetPageUptodate(page); spin_unlock_irqrestore(&iop->uptodate_lock, flags); } From 0a195b91e8991367a94ee199f3af7faa7607e7db Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Mon, 21 Sep 2020 08:58:40 -0700 Subject: [PATCH 10/17] iomap: Support arbitrarily many blocks per page Size the uptodate array dynamically to support larger pages in the page cache. With a 64kB page, we're only saving 8 bytes per page today, but with a 2MB maximum page size, we'd have to allocate more than 4kB per page. Add a few debugging assertions. Signed-off-by: Matthew Wilcox (Oracle) Reviewed-by: Dave Chinner Reviewed-by: Darrick J. Wong Signed-off-by: Darrick J. Wong Reviewed-by: Christoph Hellwig --- fs/iomap/buffered-io.c | 22 +++++++++++++++++----- 1 file changed, 17 insertions(+), 5 deletions(-) diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c index eab839b2be33..9f0fa495ab69 100644 --- a/fs/iomap/buffered-io.c +++ b/fs/iomap/buffered-io.c @@ -22,18 +22,25 @@ #include "../internal.h" /* - * Structure allocated for each page when block size < PAGE_SIZE to track - * sub-page uptodate status and I/O completions. + * Structure allocated for each page or THP when block size < page size + * to track sub-page uptodate status and I/O completions. */ struct iomap_page { atomic_t read_count; atomic_t write_count; spinlock_t uptodate_lock; - DECLARE_BITMAP(uptodate, PAGE_SIZE / 512); + unsigned long uptodate[]; }; static inline struct iomap_page *to_iomap_page(struct page *page) { + /* + * per-block data is stored in the head page. Callers should + * not be dealing with tail pages (and if they are, they can + * call thp_head() first. + */ + VM_BUG_ON_PGFLAGS(PageTail(page), page); + if (page_has_private(page)) return (struct iomap_page *)page_private(page); return NULL; @@ -45,11 +52,13 @@ static struct iomap_page * iomap_page_create(struct inode *inode, struct page *page) { struct iomap_page *iop = to_iomap_page(page); + unsigned int nr_blocks = i_blocks_per_page(inode, page); - if (iop || i_blocks_per_page(inode, page) <= 1) + if (iop || nr_blocks <= 1) return iop; - iop = kzalloc(sizeof(*iop), GFP_NOFS | __GFP_NOFAIL); + iop = kzalloc(struct_size(iop, uptodate, BITS_TO_LONGS(nr_blocks)), + GFP_NOFS | __GFP_NOFAIL); spin_lock_init(&iop->uptodate_lock); attach_page_private(page, iop); return iop; @@ -59,11 +68,14 @@ static void iomap_page_release(struct page *page) { struct iomap_page *iop = detach_page_private(page); + unsigned int nr_blocks = i_blocks_per_page(page->mapping->host, page); if (!iop) return; WARN_ON_ONCE(atomic_read(&iop->read_count)); WARN_ON_ONCE(atomic_read(&iop->write_count)); + WARN_ON_ONCE(bitmap_full(iop->uptodate, nr_blocks) != + PageUptodate(page)); kfree(iop); } From 7d636676d2841ba5d92462dfa99a89c2f2da8919 Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Mon, 21 Sep 2020 08:58:40 -0700 Subject: [PATCH 11/17] iomap: Convert read_count to read_bytes_pending Instead of counting bio segments, count the number of bytes submitted. This insulates us from the block layer's definition of what a 'same page' is, which is not necessarily clear once THPs are involved. Signed-off-by: Matthew Wilcox (Oracle) Reviewed-by: Darrick J. Wong Signed-off-by: Darrick J. Wong Reviewed-by: Christoph Hellwig --- fs/iomap/buffered-io.c | 41 ++++++++++++----------------------------- 1 file changed, 12 insertions(+), 29 deletions(-) diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c index 9f0fa495ab69..46370cea30c8 100644 --- a/fs/iomap/buffered-io.c +++ b/fs/iomap/buffered-io.c @@ -26,7 +26,7 @@ * to track sub-page uptodate status and I/O completions. */ struct iomap_page { - atomic_t read_count; + atomic_t read_bytes_pending; atomic_t write_count; spinlock_t uptodate_lock; unsigned long uptodate[]; @@ -72,7 +72,7 @@ iomap_page_release(struct page *page) if (!iop) return; - WARN_ON_ONCE(atomic_read(&iop->read_count)); + WARN_ON_ONCE(atomic_read(&iop->read_bytes_pending)); WARN_ON_ONCE(atomic_read(&iop->write_count)); WARN_ON_ONCE(bitmap_full(iop->uptodate, nr_blocks) != PageUptodate(page)); @@ -167,13 +167,6 @@ iomap_set_range_uptodate(struct page *page, unsigned off, unsigned len) SetPageUptodate(page); } -static void -iomap_read_finish(struct iomap_page *iop, struct page *page) -{ - if (!iop || atomic_dec_and_test(&iop->read_count)) - unlock_page(page); -} - static void iomap_read_page_end_io(struct bio_vec *bvec, int error) { @@ -187,7 +180,8 @@ iomap_read_page_end_io(struct bio_vec *bvec, int error) iomap_set_range_uptodate(page, bvec->bv_offset, bvec->bv_len); } - iomap_read_finish(iop, page); + if (!iop || atomic_sub_and_test(bvec->bv_len, &iop->read_bytes_pending)) + unlock_page(page); } static void @@ -267,30 +261,19 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data, } ctx->cur_page_in_bio = true; + if (iop) + atomic_add(plen, &iop->read_bytes_pending); - /* - * Try to merge into a previous segment if we can. - */ + /* Try to merge into a previous segment if we can */ sector = iomap_sector(iomap, pos); - if (ctx->bio && bio_end_sector(ctx->bio) == sector) + if (ctx->bio && bio_end_sector(ctx->bio) == sector) { + if (__bio_try_merge_page(ctx->bio, page, plen, poff, + &same_page)) + goto done; is_contig = true; - - if (is_contig && - __bio_try_merge_page(ctx->bio, page, plen, poff, &same_page)) { - if (!same_page && iop) - atomic_inc(&iop->read_count); - goto done; } - /* - * If we start a new segment we need to increase the read count, and we - * need to do so before submitting any previous full bio to make sure - * that we don't prematurely unlock the page. - */ - if (iop) - atomic_inc(&iop->read_count); - - if (!ctx->bio || !is_contig || bio_full(ctx->bio, plen)) { + if (!is_contig || bio_full(ctx->bio, plen)) { gfp_t gfp = mapping_gfp_constraint(page->mapping, GFP_KERNEL); gfp_t orig_gfp = gfp; int nr_vecs = (length + PAGE_SIZE - 1) >> PAGE_SHIFT; From 0fb2d7209d66a84d58d9337ffe9f609fd552b07b Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Mon, 21 Sep 2020 08:58:41 -0700 Subject: [PATCH 12/17] iomap: Convert write_count to write_bytes_pending Instead of counting bio segments, count the number of bytes submitted. This insulates us from the block layer's definition of what a 'same page' is, which is not necessarily clear once THPs are involved. Signed-off-by: Matthew Wilcox (Oracle) Reviewed-by: Christoph Hellwig Reviewed-by: Darrick J. Wong Signed-off-by: Darrick J. Wong --- fs/iomap/buffered-io.c | 19 ++++++++++--------- 1 file changed, 10 insertions(+), 9 deletions(-) diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c index 46370cea30c8..fdcdf1704e38 100644 --- a/fs/iomap/buffered-io.c +++ b/fs/iomap/buffered-io.c @@ -27,7 +27,7 @@ */ struct iomap_page { atomic_t read_bytes_pending; - atomic_t write_count; + atomic_t write_bytes_pending; spinlock_t uptodate_lock; unsigned long uptodate[]; }; @@ -73,7 +73,7 @@ iomap_page_release(struct page *page) if (!iop) return; WARN_ON_ONCE(atomic_read(&iop->read_bytes_pending)); - WARN_ON_ONCE(atomic_read(&iop->write_count)); + WARN_ON_ONCE(atomic_read(&iop->write_bytes_pending)); WARN_ON_ONCE(bitmap_full(iop->uptodate, nr_blocks) != PageUptodate(page)); kfree(iop); @@ -1047,7 +1047,7 @@ EXPORT_SYMBOL_GPL(iomap_page_mkwrite); static void iomap_finish_page_writeback(struct inode *inode, struct page *page, - int error) + int error, unsigned int len) { struct iomap_page *iop = to_iomap_page(page); @@ -1057,9 +1057,9 @@ iomap_finish_page_writeback(struct inode *inode, struct page *page, } WARN_ON_ONCE(i_blocks_per_page(inode, page) > 1 && !iop); - WARN_ON_ONCE(iop && atomic_read(&iop->write_count) <= 0); + WARN_ON_ONCE(iop && atomic_read(&iop->write_bytes_pending) <= 0); - if (!iop || atomic_dec_and_test(&iop->write_count)) + if (!iop || atomic_sub_and_test(len, &iop->write_bytes_pending)) end_page_writeback(page); } @@ -1093,7 +1093,8 @@ iomap_finish_ioend(struct iomap_ioend *ioend, int error) /* walk each page on bio, ending page IO on them */ bio_for_each_segment_all(bv, bio, iter_all) - iomap_finish_page_writeback(inode, bv->bv_page, error); + iomap_finish_page_writeback(inode, bv->bv_page, error, + bv->bv_len); bio_put(bio); } /* The ioend has been freed by bio_put() */ @@ -1309,8 +1310,8 @@ iomap_add_to_ioend(struct inode *inode, loff_t offset, struct page *page, merged = __bio_try_merge_page(wpc->ioend->io_bio, page, len, poff, &same_page); - if (iop && !same_page) - atomic_inc(&iop->write_count); + if (iop) + atomic_add(len, &iop->write_bytes_pending); if (!merged) { if (bio_full(wpc->ioend->io_bio, len)) { @@ -1353,7 +1354,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc, LIST_HEAD(submit_list); WARN_ON_ONCE(i_blocks_per_page(inode, page) > 1 && !iop); - WARN_ON_ONCE(iop && atomic_read(&iop->write_count) != 0); + WARN_ON_ONCE(iop && atomic_read(&iop->write_bytes_pending) != 0); /* * Walk through the page to find areas to write back. If we run off the From e25ba8cbfd16bb10df8d90a138c743676fb8ee40 Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Mon, 21 Sep 2020 08:58:41 -0700 Subject: [PATCH 13/17] iomap: Convert iomap_write_end types iomap_write_end cannot return an error, so switch it to return size_t instead of int and remove the error checking from the callers. Also convert the arguments to size_t from unsigned int, in case anyone ever wants to support a page size larger than 2GB. Signed-off-by: Matthew Wilcox (Oracle) Reviewed-by: Christoph Hellwig Reviewed-by: Darrick J. Wong Signed-off-by: Darrick J. Wong --- fs/iomap/buffered-io.c | 31 ++++++++++++------------------- 1 file changed, 12 insertions(+), 19 deletions(-) diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c index fdcdf1704e38..f26b24f39de4 100644 --- a/fs/iomap/buffered-io.c +++ b/fs/iomap/buffered-io.c @@ -663,9 +663,8 @@ iomap_set_page_dirty(struct page *page) } EXPORT_SYMBOL_GPL(iomap_set_page_dirty); -static int -__iomap_write_end(struct inode *inode, loff_t pos, unsigned len, - unsigned copied, struct page *page) +static size_t __iomap_write_end(struct inode *inode, loff_t pos, size_t len, + size_t copied, struct page *page) { flush_dcache_page(page); @@ -687,9 +686,8 @@ __iomap_write_end(struct inode *inode, loff_t pos, unsigned len, return copied; } -static int -iomap_write_end_inline(struct inode *inode, struct page *page, - struct iomap *iomap, loff_t pos, unsigned copied) +static size_t iomap_write_end_inline(struct inode *inode, struct page *page, + struct iomap *iomap, loff_t pos, size_t copied) { void *addr; @@ -705,13 +703,14 @@ iomap_write_end_inline(struct inode *inode, struct page *page, return copied; } -static int -iomap_write_end(struct inode *inode, loff_t pos, unsigned len, unsigned copied, - struct page *page, struct iomap *iomap, struct iomap *srcmap) +/* Returns the number of bytes copied. May be 0. Cannot be an errno. */ +static size_t iomap_write_end(struct inode *inode, loff_t pos, size_t len, + size_t copied, struct page *page, struct iomap *iomap, + struct iomap *srcmap) { const struct iomap_page_ops *page_ops = iomap->page_ops; loff_t old_size = inode->i_size; - int ret; + size_t ret; if (srcmap->type == IOMAP_INLINE) { ret = iomap_write_end_inline(inode, page, iomap, pos, copied); @@ -790,11 +789,8 @@ again: copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes); - status = iomap_write_end(inode, pos, bytes, copied, page, iomap, + copied = iomap_write_end(inode, pos, bytes, copied, page, iomap, srcmap); - if (unlikely(status < 0)) - break; - copied = status; cond_resched(); @@ -868,11 +864,8 @@ iomap_unshare_actor(struct inode *inode, loff_t pos, loff_t length, void *data, status = iomap_write_end(inode, pos, bytes, bytes, page, iomap, srcmap); - if (unlikely(status <= 0)) { - if (WARN_ON_ONCE(status == 0)) - return -EIO; - return status; - } + if (WARN_ON_ONCE(status == 0)) + return -EIO; cond_resched(); From 81ee8e52a71c712dbe04994f1430cb4c88b87ad6 Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Mon, 21 Sep 2020 08:58:42 -0700 Subject: [PATCH 14/17] iomap: Change calling convention for zeroing Pass the full length to iomap_zero() and dax_iomap_zero(), and have them return how many bytes they actually handled. This is preparatory work for handling THP, although it looks like DAX could actually take advantage of it if there's a larger contiguous area. Signed-off-by: Matthew Wilcox (Oracle) Reviewed-by: Darrick J. Wong Signed-off-by: Darrick J. Wong Reviewed-by: Christoph Hellwig --- fs/dax.c | 13 ++++++------- fs/iomap/buffered-io.c | 33 +++++++++++++++------------------ include/linux/dax.h | 3 +-- 3 files changed, 22 insertions(+), 27 deletions(-) diff --git a/fs/dax.c b/fs/dax.c index 994ab66a9907..6ad346352a8c 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -1037,18 +1037,18 @@ static vm_fault_t dax_load_hole(struct xa_state *xas, return ret; } -int dax_iomap_zero(loff_t pos, unsigned offset, unsigned size, - struct iomap *iomap) +s64 dax_iomap_zero(loff_t pos, u64 length, struct iomap *iomap) { sector_t sector = iomap_sector(iomap, pos & PAGE_MASK); pgoff_t pgoff; long rc, id; void *kaddr; bool page_aligned = false; - + unsigned offset = offset_in_page(pos); + unsigned size = min_t(u64, PAGE_SIZE - offset, length); if (IS_ALIGNED(sector << SECTOR_SHIFT, PAGE_SIZE) && - IS_ALIGNED(size, PAGE_SIZE)) + (size == PAGE_SIZE)) page_aligned = true; rc = bdev_dax_pgoff(iomap->bdev, sector, PAGE_SIZE, &pgoff); @@ -1058,8 +1058,7 @@ int dax_iomap_zero(loff_t pos, unsigned offset, unsigned size, id = dax_read_lock(); if (page_aligned) - rc = dax_zero_page_range(iomap->dax_dev, pgoff, - size >> PAGE_SHIFT); + rc = dax_zero_page_range(iomap->dax_dev, pgoff, 1); else rc = dax_direct_access(iomap->dax_dev, pgoff, 1, &kaddr, NULL); if (rc < 0) { @@ -1072,7 +1071,7 @@ int dax_iomap_zero(loff_t pos, unsigned offset, unsigned size, dax_flush(iomap->dax_dev, kaddr + offset, size); } dax_read_unlock(id); - return 0; + return size; } static loff_t diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c index f26b24f39de4..8b6cca7e34e4 100644 --- a/fs/iomap/buffered-io.c +++ b/fs/iomap/buffered-io.c @@ -898,11 +898,13 @@ iomap_file_unshare(struct inode *inode, loff_t pos, loff_t len, } EXPORT_SYMBOL_GPL(iomap_file_unshare); -static int iomap_zero(struct inode *inode, loff_t pos, unsigned offset, - unsigned bytes, struct iomap *iomap, struct iomap *srcmap) +static s64 iomap_zero(struct inode *inode, loff_t pos, u64 length, + struct iomap *iomap, struct iomap *srcmap) { struct page *page; int status; + unsigned offset = offset_in_page(pos); + unsigned bytes = min_t(u64, PAGE_SIZE - offset, length); status = iomap_write_begin(inode, pos, bytes, 0, &page, iomap, srcmap); if (status) @@ -914,38 +916,33 @@ static int iomap_zero(struct inode *inode, loff_t pos, unsigned offset, return iomap_write_end(inode, pos, bytes, bytes, page, iomap, srcmap); } -static loff_t -iomap_zero_range_actor(struct inode *inode, loff_t pos, loff_t count, - void *data, struct iomap *iomap, struct iomap *srcmap) +static loff_t iomap_zero_range_actor(struct inode *inode, loff_t pos, + loff_t length, void *data, struct iomap *iomap, + struct iomap *srcmap) { bool *did_zero = data; loff_t written = 0; - int status; /* already zeroed? we're done. */ if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN) - return count; + return length; do { - unsigned offset, bytes; - - offset = offset_in_page(pos); - bytes = min_t(loff_t, PAGE_SIZE - offset, count); + s64 bytes; if (IS_DAX(inode)) - status = dax_iomap_zero(pos, offset, bytes, iomap); + bytes = dax_iomap_zero(pos, length, iomap); else - status = iomap_zero(inode, pos, offset, bytes, iomap, - srcmap); - if (status < 0) - return status; + bytes = iomap_zero(inode, pos, length, iomap, srcmap); + if (bytes < 0) + return bytes; pos += bytes; - count -= bytes; + length -= bytes; written += bytes; if (did_zero) *did_zero = true; - } while (count > 0); + } while (length > 0); return written; } diff --git a/include/linux/dax.h b/include/linux/dax.h index 6904d4e0b2e0..951a851a0481 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -214,8 +214,7 @@ vm_fault_t dax_finish_sync_fault(struct vm_fault *vmf, int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index); int dax_invalidate_mapping_entry_sync(struct address_space *mapping, pgoff_t index); -int dax_iomap_zero(loff_t pos, unsigned offset, unsigned size, - struct iomap *iomap); +s64 dax_iomap_zero(loff_t pos, u64 length, struct iomap *iomap); static inline bool dax_mapping(struct address_space *mapping) { return mapping->host && IS_DAX(mapping->host); From 4595a298d5563cf76c1d852970f162051fd1a7a6 Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Fri, 25 Sep 2020 11:16:53 -0700 Subject: [PATCH 15/17] iomap: Set all uptodate bits for an Uptodate page For filesystems with block size < page size, we need to set all the per-block uptodate bits if the page was already uptodate at the time we create the per-block metadata. This can happen if the page is invalidated (eg by a write to drop_caches) but ultimately not removed from the page cache. This is a data corruption issue as page writeback skips blocks which are marked !uptodate. Fixes: 9dc55f1389f9 ("iomap: add support for sub-pagesize buffered I/O without buffer heads") Signed-off-by: Matthew Wilcox (Oracle) Reported-by: Qian Cai Cc: Brian Foster Reviewed-by: Gao Xiang Reviewed-by: Darrick J. Wong Signed-off-by: Darrick J. Wong Reviewed-by: Christoph Hellwig --- fs/iomap/buffered-io.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c index 8b6cca7e34e4..8180061b9e16 100644 --- a/fs/iomap/buffered-io.c +++ b/fs/iomap/buffered-io.c @@ -60,6 +60,8 @@ iomap_page_create(struct inode *inode, struct page *page) iop = kzalloc(struct_size(iop, uptodate, BITS_TO_LONGS(nr_blocks)), GFP_NOFS | __GFP_NOFAIL); spin_lock_init(&iop->uptodate_lock); + if (PageUptodate(page)) + bitmap_fill(iop->uptodate, nr_blocks); attach_page_private(page, iop); return iop; } From c3d4ed1abecfcfc801199cfadb71f5b80e025d9e Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Mon, 28 Sep 2020 08:51:08 -0700 Subject: [PATCH 16/17] iomap: Allow filesystem to call iomap_dio_complete without i_rwsem This is to avoid the deadlock caused in btrfs because of O_DIRECT | O_DSYNC. Filesystems such as btrfs require i_rwsem while performing sync on a file. iomap_dio_rw() is called under i_rw_sem. This leads to a deadlock because of: iomap_dio_complete() generic_write_sync() btrfs_sync_file() Separate out iomap_dio_complete() from iomap_dio_rw(), so filesystems can call iomap_dio_complete() after unlocking i_rwsem. Signed-off-by: Christoph Hellwig Reviewed-by: Josef Bacik Signed-off-by: Goldwyn Rodrigues Reviewed-by: Darrick J. Wong Signed-off-by: Darrick J. Wong --- fs/iomap/direct-io.c | 35 ++++++++++++++++++++++++++--------- include/linux/iomap.h | 5 +++++ 2 files changed, 31 insertions(+), 9 deletions(-) diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c index 024d4bb3028e..b005ef5f554e 100644 --- a/fs/iomap/direct-io.c +++ b/fs/iomap/direct-io.c @@ -76,7 +76,7 @@ static void iomap_dio_submit_bio(struct iomap_dio *dio, struct iomap *iomap, dio->submit.cookie = submit_bio(bio); } -static ssize_t iomap_dio_complete(struct iomap_dio *dio) +ssize_t iomap_dio_complete(struct iomap_dio *dio) { const struct iomap_dio_ops *dops = dio->dops; struct kiocb *iocb = dio->iocb; @@ -130,6 +130,7 @@ static ssize_t iomap_dio_complete(struct iomap_dio *dio) return ret; } +EXPORT_SYMBOL_GPL(iomap_dio_complete); static void iomap_dio_complete_work(struct work_struct *work) { @@ -416,8 +417,8 @@ iomap_dio_actor(struct inode *inode, loff_t pos, loff_t length, * Returns -ENOTBLK In case of a page invalidation invalidation failure for * writes. The callers needs to fall back to buffered I/O in this case. */ -ssize_t -iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter, +struct iomap_dio * +__iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter, const struct iomap_ops *ops, const struct iomap_dio_ops *dops, bool wait_for_completion) { @@ -431,14 +432,14 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter, struct iomap_dio *dio; if (!count) - return 0; + return NULL; if (WARN_ON(is_sync_kiocb(iocb) && !wait_for_completion)) - return -EIO; + return ERR_PTR(-EIO); dio = kmalloc(sizeof(*dio), GFP_KERNEL); if (!dio) - return -ENOMEM; + return ERR_PTR(-ENOMEM); dio->iocb = iocb; atomic_set(&dio->ref, 1); @@ -568,7 +569,7 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter, dio->wait_for_completion = wait_for_completion; if (!atomic_dec_and_test(&dio->ref)) { if (!wait_for_completion) - return -EIOCBQUEUED; + return ERR_PTR(-EIOCBQUEUED); for (;;) { set_current_state(TASK_UNINTERRUPTIBLE); @@ -584,10 +585,26 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter, __set_current_state(TASK_RUNNING); } - return iomap_dio_complete(dio); + return dio; out_free_dio: kfree(dio); - return ret; + if (ret) + return ERR_PTR(ret); + return NULL; +} +EXPORT_SYMBOL_GPL(__iomap_dio_rw); + +ssize_t +iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter, + const struct iomap_ops *ops, const struct iomap_dio_ops *dops, + bool wait_for_completion) +{ + struct iomap_dio *dio; + + dio = __iomap_dio_rw(iocb, iter, ops, dops, wait_for_completion); + if (IS_ERR_OR_NULL(dio)) + return PTR_ERR_OR_ZERO(dio); + return iomap_dio_complete(dio); } EXPORT_SYMBOL_GPL(iomap_dio_rw); diff --git a/include/linux/iomap.h b/include/linux/iomap.h index 4d1d3c3469e9..172b3397a1a3 100644 --- a/include/linux/iomap.h +++ b/include/linux/iomap.h @@ -13,6 +13,7 @@ struct address_space; struct fiemap_extent_info; struct inode; +struct iomap_dio; struct iomap_writepage_ctx; struct iov_iter; struct kiocb; @@ -258,6 +259,10 @@ struct iomap_dio_ops { ssize_t iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter, const struct iomap_ops *ops, const struct iomap_dio_ops *dops, bool wait_for_completion); +struct iomap_dio *__iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter, + const struct iomap_ops *ops, const struct iomap_dio_ops *dops, + bool wait_for_completion); +ssize_t iomap_dio_complete(struct iomap_dio *dio); int iomap_dio_iopoll(struct kiocb *kiocb, bool spin); #ifdef CONFIG_SWAP From 1a31182edd0083bb9f26e582ed39f92f898c4d0a Mon Sep 17 00:00:00 2001 From: Goldwyn Rodrigues Date: Mon, 28 Sep 2020 08:51:08 -0700 Subject: [PATCH 17/17] iomap: Call inode_dio_end() before generic_write_sync() iomap complete routine can deadlock with btrfs_fallocate because of the call to generic_write_sync(). P0 P1 inode_lock() fallocate(FALLOC_FL_ZERO_RANGE) __iomap_dio_rw() inode_lock() inode_unlock() inode_dio_wait() iomap_dio_complete() generic_write_sync() btrfs_file_fsync() inode_lock() inode_dio_end() is used to notify the end of DIO data in order to synchronize with truncate. Call inode_dio_end() before calling generic_write_sync(), so filesystems can lock i_rwsem during a sync. This matches the way it is done in fs/direct-io.c:dio_complete(). Signed-off-by: Goldwyn Rodrigues Reviewed-by: Christoph Hellwig Reviewed-by: Josef Bacik Reviewed-by: Darrick J. Wong Signed-off-by: Darrick J. Wong --- fs/iomap/direct-io.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c index b005ef5f554e..933f234d5bec 100644 --- a/fs/iomap/direct-io.c +++ b/fs/iomap/direct-io.c @@ -118,6 +118,7 @@ ssize_t iomap_dio_complete(struct iomap_dio *dio) dio_warn_stale_pagecache(iocb->ki_filp); } + inode_dio_end(file_inode(iocb->ki_filp)); /* * If this is a DSYNC write, make sure we push it to stable storage now * that we've written data. @@ -125,7 +126,6 @@ ssize_t iomap_dio_complete(struct iomap_dio *dio) if (ret > 0 && (dio->flags & IOMAP_DIO_NEED_SYNC)) ret = generic_write_sync(iocb, ret); - inode_dio_end(file_inode(iocb->ki_filp)); kfree(dio); return ret;