From 2ae4db5647d807efb6a87c09efaa6d1db9c905d7 Mon Sep 17 00:00:00 2001 From: Christian Brauner Date: Thu, 13 Jun 2024 11:38:14 +0200 Subject: [PATCH 1/8] fs: don't misleadingly warn during thaw operations The block device may have been frozen before it was claimed by a filesystem. Concurrently another process might try to mount that frozen block device and has temporarily claimed the block device for that purpose causing a concurrent fs_bdev_thaw() to end up here. The mounter is already about to abort mounting because they still saw an elevanted bdev->bd_fsfreeze_count so get_bdev_super() will return NULL in that case. For example, P1 calls dm_suspend() which calls into bdev_freeze() before the block device has been claimed by the filesystem. This brings bdev->bd_fsfreeze_count to 1 and no call into fs_bdev_freeze() is required. Now P2 tries to mount that frozen block device. It claims it and checks bdev->bd_fsfreeze_count. As it's elevated it aborts mounting. In the meantime P3 called dm_resume(). P3 sees that the block device is already claimed by a filesystem and calls into fs_bdev_thaw(). P3 takes a passive reference and realizes that the filesystem isn't ready yet. P3 puts itself to sleep to wait for the filesystem to become ready. P2 now puts the last active reference to the filesystem and marks it as dying. P3 gets woken, sees that the filesystem is dying and get_bdev_super() fails. Fixes: 49ef8832fb1a ("bdev: implement freeze and thaw holder operations") Cc: Reported-by: Theodore Ts'o Link: https://lore.kernel.org/r/20240611085210.GA1838544@mit.edu Link: https://lore.kernel.org/r/20240613-lackmantel-einsehen-90f0d727358d@brauner Reviewed-by: Darrick J. Wong Signed-off-by: Christian Brauner --- fs/super.c | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/fs/super.c b/fs/super.c index b72f1d288e95..095ba793e10c 100644 --- a/fs/super.c +++ b/fs/super.c @@ -1502,8 +1502,17 @@ static int fs_bdev_thaw(struct block_device *bdev) lockdep_assert_held(&bdev->bd_fsfreeze_mutex); + /* + * The block device may have been frozen before it was claimed by a + * filesystem. Concurrently another process might try to mount that + * frozen block device and has temporarily claimed the block device for + * that purpose causing a concurrent fs_bdev_thaw() to end up here. The + * mounter is already about to abort mounting because they still saw an + * elevanted bdev->bd_fsfreeze_count so get_bdev_super() will return + * NULL in that case. + */ sb = get_bdev_super(bdev); - if (WARN_ON_ONCE(!sb)) + if (!sb) return -EINVAL; if (sb->s_op->thaw_super) From 702eb71fd6501b3566283f8c96d7ccc6ddd662e9 Mon Sep 17 00:00:00 2001 From: Jan Kara Date: Mon, 17 Jun 2024 18:23:00 +0200 Subject: [PATCH 2/8] fsnotify: Do not generate events for O_PATH file descriptors Currently we will not generate FS_OPEN events for O_PATH file descriptors but we will generate FS_CLOSE events for them. This is asymmetry is confusing. Arguably no fsnotify events should be generated for O_PATH file descriptors as they cannot be used to access or modify file content, they are just convenient handles to file objects like paths. So fix the asymmetry by stopping to generate FS_CLOSE for O_PATH file descriptors. Cc: Signed-off-by: Jan Kara Link: https://lore.kernel.org/r/20240617162303.1596-1-jack@suse.cz Reviewed-by: Amir Goldstein Signed-off-by: Christian Brauner --- include/linux/fsnotify.h | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/include/linux/fsnotify.h b/include/linux/fsnotify.h index 4da80e92f804..278620e063ab 100644 --- a/include/linux/fsnotify.h +++ b/include/linux/fsnotify.h @@ -112,7 +112,13 @@ static inline int fsnotify_file(struct file *file, __u32 mask) { const struct path *path; - if (file->f_mode & FMODE_NONOTIFY) + /* + * FMODE_NONOTIFY are fds generated by fanotify itself which should not + * generate new events. We also don't want to generate events for + * FMODE_PATH fds (involves open & close events) as they are just + * handle creation / destruction events and not "real" file events. + */ + if (file->f_mode & (FMODE_NONOTIFY | FMODE_PATH)) return 0; path = &file->f_path; From 7d1cf5e624ef5d81b933e8b7f4927531166c0f7a Mon Sep 17 00:00:00 2001 From: NeilBrown Date: Mon, 17 Jun 2024 18:23:01 +0200 Subject: [PATCH 3/8] vfs: generate FS_CREATE before FS_OPEN when ->atomic_open used. When a file is opened and created with open(..., O_CREAT) we get both the CREATE and OPEN fsnotify events and would expect them in that order. For most filesystems we get them in that order because open_last_lookups() calls fsnofify_create() and then do_open() (from path_openat()) calls vfs_open()->do_dentry_open() which calls fsnotify_open(). However when ->atomic_open is used, the do_dentry_open() -> fsnotify_open() call happens from finish_open() which is called from the ->atomic_open handler in lookup_open() which is called *before* open_last_lookups() calls fsnotify_create. So we get the "open" notification before "create" - which is backwards. ltp testcase inotify02 tests this and reports the inconsistency. This patch lifts the fsnotify_open() call out of do_dentry_open() and places it higher up the call stack. There are three callers of do_dentry_open(). For vfs_open() and kernel_file_open() the fsnotify_open() is placed directly in that caller so there should be no behavioural change. For finish_open() there are two cases: - finish_open is used in ->atomic_open handlers. For these we add a call to fsnotify_open() at open_last_lookups() if FMODE_OPENED is set - which means do_dentry_open() has been called. - finish_open is used in ->tmpfile() handlers. For these a similar call to fsnotify_open() is added to vfs_tmpfile() With this patch NFSv3 is restored to its previous behaviour (before ->atomic_open support was added) of generating CREATE notifications before OPEN, and NFSv4 now has that same correct ordering that is has not had before. I haven't tested other filesystems. Fixes: 7c6c5249f061 ("NFS: add atomic_open for NFSv3 to handle O_TRUNC correctly.") Reported-by: James Clark Closes: https://lore.kernel.org/all/01c3bf2e-eb1f-4b7f-a54f-d2a05dd3d8c8@arm.com Signed-off-by: NeilBrown Link: https://lore.kernel.org/r/171817619547.14261.975798725161704336@noble.neil.brown.name Fixes: 7b8c9d7bb457 ("fsnotify: move fsnotify_open() hook into do_dentry_open()") Tested-by: James Clark Signed-off-by: Jan Kara Link: https://lore.kernel.org/r/20240617162303.1596-2-jack@suse.cz Reviewed-by: Amir Goldstein Signed-off-by: Christian Brauner --- fs/namei.c | 10 ++++++++-- fs/open.c | 22 +++++++++++++++------- 2 files changed, 23 insertions(+), 9 deletions(-) diff --git a/fs/namei.c b/fs/namei.c index 37fb0a8aa09a..1e05a0f3f04d 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -3572,8 +3572,12 @@ static const char *open_last_lookups(struct nameidata *nd, else inode_lock_shared(dir->d_inode); dentry = lookup_open(nd, file, op, got_write); - if (!IS_ERR(dentry) && (file->f_mode & FMODE_CREATED)) - fsnotify_create(dir->d_inode, dentry); + if (!IS_ERR(dentry)) { + if (file->f_mode & FMODE_CREATED) + fsnotify_create(dir->d_inode, dentry); + if (file->f_mode & FMODE_OPENED) + fsnotify_open(file); + } if (open_flag & O_CREAT) inode_unlock(dir->d_inode); else @@ -3700,6 +3704,8 @@ int vfs_tmpfile(struct mnt_idmap *idmap, mode = vfs_prepare_mode(idmap, dir, mode, mode, mode); error = dir->i_op->tmpfile(idmap, dir, file, mode); dput(child); + if (file->f_mode & FMODE_OPENED) + fsnotify_open(file); if (error) return error; /* Don't check for other permissions, the inode was just created */ diff --git a/fs/open.c b/fs/open.c index 89cafb572061..f1607729acb9 100644 --- a/fs/open.c +++ b/fs/open.c @@ -1004,11 +1004,6 @@ static int do_dentry_open(struct file *f, } } - /* - * Once we return a file with FMODE_OPENED, __fput() will call - * fsnotify_close(), so we need fsnotify_open() here for symmetry. - */ - fsnotify_open(f); return 0; cleanup_all: @@ -1085,8 +1080,19 @@ EXPORT_SYMBOL(file_path); */ int vfs_open(const struct path *path, struct file *file) { + int ret; + file->f_path = *path; - return do_dentry_open(file, NULL); + ret = do_dentry_open(file, NULL); + if (!ret) { + /* + * Once we return a file with FMODE_OPENED, __fput() will call + * fsnotify_close(), so we need fsnotify_open() here for + * symmetry. + */ + fsnotify_open(file); + } + return ret; } struct file *dentry_open(const struct path *path, int flags, @@ -1177,8 +1183,10 @@ struct file *kernel_file_open(const struct path *path, int flags, error = do_dentry_open(f, NULL); if (error) { fput(f); - f = ERR_PTR(error); + return ERR_PTR(error); } + + fsnotify_open(f); return f; } EXPORT_SYMBOL_GPL(kernel_file_open); From d98b7d7dda721ca009b6dc5dd3beeeb7fd46f4b4 Mon Sep 17 00:00:00 2001 From: David Howells Date: Mon, 20 May 2024 16:12:56 +0100 Subject: [PATCH 4/8] netfs: Fix io_uring based write-through [This was included in v2 of 9b038d004ce95551cb35381c49fe896c5bc11ffe, but v1 got pushed instead] Fix netfs_unbuffered_write_iter_locked() to set the total request length in the netfs_io_request struct rather than leaving it as zero. Fixes: 288ace2f57c9 ("netfs: New writeback implementation") Signed-off-by: David Howells cc: Jeff Layton cc: Steve French cc: Enzo Matsumiya cc: Christian Brauner cc: netfs@lists.linux.dev cc: v9fs@lists.linux.dev cc: linux-afs@lists.infradead.org cc: linux-cifs@vger.kernel.org cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/20240620173137.610345-2-dhowells@redhat.com Signed-off-by: Christian Brauner --- fs/netfs/direct_write.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/netfs/direct_write.c b/fs/netfs/direct_write.c index e14cd53ac9fd..88f2adfab75e 100644 --- a/fs/netfs/direct_write.c +++ b/fs/netfs/direct_write.c @@ -92,8 +92,9 @@ ssize_t netfs_unbuffered_write_iter_locked(struct kiocb *iocb, struct iov_iter * __set_bit(NETFS_RREQ_UPLOAD_TO_SERVER, &wreq->flags); if (async) wreq->iocb = iocb; + wreq->len = iov_iter_count(&wreq->io_iter); wreq->cleanup = netfs_cleanup_dio_write; - ret = netfs_unbuffered_write(wreq, is_sync_kiocb(iocb), iov_iter_count(&wreq->io_iter)); + ret = netfs_unbuffered_write(wreq, is_sync_kiocb(iocb), wreq->len); if (ret < 0) { _debug("begin = %zd", ret); goto out; From 6470e0bc6fe1948dcc2dfe7264c5a6c7a4a6788a Mon Sep 17 00:00:00 2001 From: David Howells Date: Wed, 5 Jun 2024 22:18:04 +0100 Subject: [PATCH 5/8] netfs: Fix early issue of write op on partial write to folio tail During the writeback procedure, at the end of netfs_write_folio(), pending write operations are flushed if the amount of write-streaming data stored in a page is less than the size of the folio because if we haven't modified a folio to the end, it cannot be contiguous with the following folio... except if the dirty region of the folio is right at the end of the folio space. Fix the test to take the offset into the folio into account as well, such that if the dirty region runs right up to the end of the folio, we leave the flushing for later. Fixes: 288ace2f57c9 ("netfs: New writeback implementation") Signed-off-by: David Howells cc: Jeff Layton cc: Eric Van Hensbergen cc: Latchesar Ionkov cc: Dominique Martinet cc: Christian Schoenebeck cc: Marc Dionne cc: Steve French cc: Paulo Alcantara (DFS, global name space) cc: v9fs@lists.linux.dev cc: linux-afs@lists.infradead.org cc: linux-cifs@vger.kernel.org cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/20240620173137.610345-4-dhowells@redhat.com Signed-off-by: Christian Brauner --- fs/netfs/write_issue.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/netfs/write_issue.c b/fs/netfs/write_issue.c index 3aa86e268f40..ec6cf8707fb0 100644 --- a/fs/netfs/write_issue.c +++ b/fs/netfs/write_issue.c @@ -483,7 +483,7 @@ static int netfs_write_folio(struct netfs_io_request *wreq, if (!debug) kdebug("R=%x: No submit", wreq->debug_id); - if (flen < fsize) + if (foff + flen < fsize) for (int s = 0; s < NR_IO_STREAMS; s++) netfs_issue_write(wreq, &wreq->io_streams[s]); From 84dfbc9cad7d86984f2b5814bf36e61ff492f306 Mon Sep 17 00:00:00 2001 From: David Howells Date: Thu, 6 Jun 2024 11:10:44 +0100 Subject: [PATCH 6/8] netfs: Delete some xarray-wangling functions that aren't used Delete some xarray-based buffer wangling functions that are intended for use with bounce buffering, but aren't used because bounce-buffering got deferred to a later patch series. Now, however, the intention is to use something other than an xarray to do this. Signed-off-by: David Howells cc: Jeff Layton cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/20240620173137.610345-9-dhowells@redhat.com Signed-off-by: Christian Brauner --- fs/netfs/internal.h | 9 ----- fs/netfs/misc.c | 81 --------------------------------------------- 2 files changed, 90 deletions(-) diff --git a/fs/netfs/internal.h b/fs/netfs/internal.h index 95e281a8af78..acd9ca14e264 100644 --- a/fs/netfs/internal.h +++ b/fs/netfs/internal.h @@ -63,15 +63,6 @@ static inline void netfs_proc_del_rreq(struct netfs_io_request *rreq) {} /* * misc.c */ -#define NETFS_FLAG_PUT_MARK BIT(0) -#define NETFS_FLAG_PAGECACHE_MARK BIT(1) -int netfs_xa_store_and_mark(struct xarray *xa, unsigned long index, - struct folio *folio, unsigned int flags, - gfp_t gfp_mask); -int netfs_add_folios_to_buffer(struct xarray *buffer, - struct address_space *mapping, - pgoff_t index, pgoff_t to, gfp_t gfp_mask); -void netfs_clear_buffer(struct xarray *buffer); /* * objects.c diff --git a/fs/netfs/misc.c b/fs/netfs/misc.c index bc1fc54fb724..83e644bd518f 100644 --- a/fs/netfs/misc.c +++ b/fs/netfs/misc.c @@ -8,87 +8,6 @@ #include #include "internal.h" -/* - * Attach a folio to the buffer and maybe set marks on it to say that we need - * to put the folio later and twiddle the pagecache flags. - */ -int netfs_xa_store_and_mark(struct xarray *xa, unsigned long index, - struct folio *folio, unsigned int flags, - gfp_t gfp_mask) -{ - XA_STATE_ORDER(xas, xa, index, folio_order(folio)); - -retry: - xas_lock(&xas); - for (;;) { - xas_store(&xas, folio); - if (!xas_error(&xas)) - break; - xas_unlock(&xas); - if (!xas_nomem(&xas, gfp_mask)) - return xas_error(&xas); - goto retry; - } - - if (flags & NETFS_FLAG_PUT_MARK) - xas_set_mark(&xas, NETFS_BUF_PUT_MARK); - if (flags & NETFS_FLAG_PAGECACHE_MARK) - xas_set_mark(&xas, NETFS_BUF_PAGECACHE_MARK); - xas_unlock(&xas); - return xas_error(&xas); -} - -/* - * Create the specified range of folios in the buffer attached to the read - * request. The folios are marked with NETFS_BUF_PUT_MARK so that we know that - * these need freeing later. - */ -int netfs_add_folios_to_buffer(struct xarray *buffer, - struct address_space *mapping, - pgoff_t index, pgoff_t to, gfp_t gfp_mask) -{ - struct folio *folio; - int ret; - - if (to + 1 == index) /* Page range is inclusive */ - return 0; - - do { - /* TODO: Figure out what order folio can be allocated here */ - folio = filemap_alloc_folio(readahead_gfp_mask(mapping), 0); - if (!folio) - return -ENOMEM; - folio->index = index; - ret = netfs_xa_store_and_mark(buffer, index, folio, - NETFS_FLAG_PUT_MARK, gfp_mask); - if (ret < 0) { - folio_put(folio); - return ret; - } - - index += folio_nr_pages(folio); - } while (index <= to && index != 0); - - return 0; -} - -/* - * Clear an xarray buffer, putting a ref on the folios that have - * NETFS_BUF_PUT_MARK set. - */ -void netfs_clear_buffer(struct xarray *buffer) -{ - struct folio *folio; - XA_STATE(xas, buffer, 0); - - rcu_read_lock(); - xas_for_each_marked(&xas, folio, ULONG_MAX, NETFS_BUF_PUT_MARK) { - folio_put(folio); - } - rcu_read_unlock(); - xa_destroy(buffer); -} - /** * netfs_dirty_folio - Mark folio dirty and pin a cache object for writeback * @mapping: The mapping the folio belongs to. From a81c98bfa40c11f8ea79b5a9b3f5fda73bfbb4d2 Mon Sep 17 00:00:00 2001 From: David Howells Date: Tue, 25 Jun 2024 13:29:06 +0100 Subject: [PATCH 7/8] netfs: Fix netfs_page_mkwrite() to check folio->mapping is valid Fix netfs_page_mkwrite() to check that folio->mapping is valid once it has taken the folio lock (as filemap_page_mkwrite() does). Without this, generic/247 occasionally oopses with something like the following: BUG: kernel NULL pointer dereference, address: 0000000000000000 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page RIP: 0010:trace_event_raw_event_netfs_folio+0x61/0xc0 ... Call Trace: ? __die_body+0x1a/0x60 ? page_fault_oops+0x6e/0xa0 ? exc_page_fault+0xc2/0xe0 ? asm_exc_page_fault+0x22/0x30 ? trace_event_raw_event_netfs_folio+0x61/0xc0 trace_netfs_folio+0x39/0x40 netfs_page_mkwrite+0x14c/0x1d0 do_page_mkwrite+0x50/0x90 do_pte_missing+0x184/0x200 __handle_mm_fault+0x42d/0x500 handle_mm_fault+0x121/0x1f0 do_user_addr_fault+0x23e/0x3c0 exc_page_fault+0xc2/0xe0 asm_exc_page_fault+0x22/0x30 This is due to the invalidate_inode_pages2_range() issued at the end of the DIO write interfering with the mmap'd writes. Fixes: 102a7e2c598c ("netfs: Allow buffered shared-writeable mmap through netfs_page_mkwrite()") Signed-off-by: David Howells Link: https://lore.kernel.org/r/780211.1719318546@warthog.procyon.org.uk Reviewed-by: Jeff Layton cc: Matthew Wilcox cc: Jeff Layton cc: netfs@lists.linux.dev cc: v9fs@lists.linux.dev cc: linux-afs@lists.infradead.org cc: linux-cifs@vger.kernel.org cc: linux-mm@kvack.org cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner --- fs/netfs/buffered_write.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/fs/netfs/buffered_write.c b/fs/netfs/buffered_write.c index 07bc1fd43530..270f8ebf8328 100644 --- a/fs/netfs/buffered_write.c +++ b/fs/netfs/buffered_write.c @@ -523,6 +523,7 @@ vm_fault_t netfs_page_mkwrite(struct vm_fault *vmf, struct netfs_group *netfs_gr struct netfs_group *group; struct folio *folio = page_folio(vmf->page); struct file *file = vmf->vma->vm_file; + struct address_space *mapping = file->f_mapping; struct inode *inode = file_inode(file); struct netfs_inode *ictx = netfs_inode(inode); vm_fault_t ret = VM_FAULT_RETRY; @@ -534,6 +535,11 @@ vm_fault_t netfs_page_mkwrite(struct vm_fault *vmf, struct netfs_group *netfs_gr if (folio_lock_killable(folio) < 0) goto out; + if (folio->mapping != mapping) { + folio_unlock(folio); + ret = VM_FAULT_NOPAGE; + goto out; + } if (folio_wait_writeback_killable(folio)) { ret = VM_FAULT_LOCKED; @@ -549,7 +555,7 @@ vm_fault_t netfs_page_mkwrite(struct vm_fault *vmf, struct netfs_group *netfs_gr group = netfs_folio_group(folio); if (group != netfs_group && group != NETFS_FOLIO_COPY_TO_CACHE) { folio_unlock(folio); - err = filemap_fdatawait_range(inode->i_mapping, + err = filemap_fdatawait_range(mapping, folio_pos(folio), folio_pos(folio) + folio_size(folio)); switch (err) { From 9d66154f73b7c7007c3be1113dfb50b99b791f8f Mon Sep 17 00:00:00 2001 From: David Howells Date: Mon, 24 Jun 2024 12:24:03 +0100 Subject: [PATCH 8/8] netfs: Fix netfs_page_mkwrite() to flush conflicting data, not wait Fix netfs_page_mkwrite() to use filemap_fdatawrite_range(), not filemap_fdatawait_range() to flush conflicting data. Fixes: 102a7e2c598c ("netfs: Allow buffered shared-writeable mmap through netfs_page_mkwrite()") Signed-off-by: David Howells Link: https://lore.kernel.org/r/614300.1719228243@warthog.procyon.org.uk cc: Matthew Wilcox cc: Jeff Layton cc: netfs@lists.linux.dev cc: v9fs@lists.linux.dev cc: linux-afs@lists.infradead.org cc: linux-cifs@vger.kernel.org cc: linux-mm@kvack.org cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner --- fs/netfs/buffered_write.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/fs/netfs/buffered_write.c b/fs/netfs/buffered_write.c index 270f8ebf8328..d583af7a2209 100644 --- a/fs/netfs/buffered_write.c +++ b/fs/netfs/buffered_write.c @@ -555,9 +555,9 @@ vm_fault_t netfs_page_mkwrite(struct vm_fault *vmf, struct netfs_group *netfs_gr group = netfs_folio_group(folio); if (group != netfs_group && group != NETFS_FOLIO_COPY_TO_CACHE) { folio_unlock(folio); - err = filemap_fdatawait_range(mapping, - folio_pos(folio), - folio_pos(folio) + folio_size(folio)); + err = filemap_fdatawrite_range(mapping, + folio_pos(folio), + folio_pos(folio) + folio_size(folio)); switch (err) { case 0: ret = VM_FAULT_RETRY;