Commit Graph

361 Commits

Author SHA1 Message Date
NeilBrown 7c6c5249f0 NFS: add atomic_open for NFSv3 to handle O_TRUNC correctly.
With two clients, each with NFSv3 mounts of the same directory, the sequence:

   client1            client2
  ls -l afile
                      echo hello there > afile
  echo HELLO > afile
  cat afile

will show
   HELLO
   there

because the O_TRUNC requested in the final 'echo' doesn't take effect.
This is because the "Negative dentry, just create a file" section in
lookup_open() assumes that the file *does* get created since the dentry
was negative, so it sets FMODE_CREATED, and this causes do_open() to
clear O_TRUNC and so the file doesn't get truncated.

Even mounting with -o lookupcache=none does not help as
nfs_neg_need_reval() always returns false if LOOKUP_CREATE is set.

This patch fixes the problem by providing an atomic_open inode operation
for NFSv3 (and v2).  The code is largely the code from the branch in
lookup_open() when atomic_open is not provided.  The significant change
is that the O_TRUNC flag is passed a new nfs_do_create() which add
'trunc' handling to nfs_create().

With this change we also optimise away an unnecessary LOOKUP before the
file is created.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2024-05-20 11:09:20 -04:00
Josef Bacik 17f46b803d nfs: fix UAF in direct writes
In production we have been hitting the following warning consistently

------------[ cut here ]------------
refcount_t: underflow; use-after-free.
WARNING: CPU: 17 PID: 1800359 at lib/refcount.c:28 refcount_warn_saturate+0x9c/0xe0
Workqueue: nfsiod nfs_direct_write_schedule_work [nfs]
RIP: 0010:refcount_warn_saturate+0x9c/0xe0
PKRU: 55555554
Call Trace:
 <TASK>
 ? __warn+0x9f/0x130
 ? refcount_warn_saturate+0x9c/0xe0
 ? report_bug+0xcc/0x150
 ? handle_bug+0x3d/0x70
 ? exc_invalid_op+0x16/0x40
 ? asm_exc_invalid_op+0x16/0x20
 ? refcount_warn_saturate+0x9c/0xe0
 nfs_direct_write_schedule_work+0x237/0x250 [nfs]
 process_one_work+0x12f/0x4a0
 worker_thread+0x14e/0x3b0
 ? ZSTD_getCParams_internal+0x220/0x220
 kthread+0xdc/0x120
 ? __btf_name_valid+0xa0/0xa0
 ret_from_fork+0x1f/0x30

This is because we're completing the nfs_direct_request twice in a row.

The source of this is when we have our commit requests to submit, we
process them and send them off, and then in the completion path for the
commit requests we have

if (nfs_commit_end(cinfo.mds))
	nfs_direct_write_complete(dreq);

However since we're submitting asynchronous requests we sometimes have
one that completes before we submit the next one, so we end up calling
complete on the nfs_direct_request twice.

The only other place we use nfs_generic_commit_list() is in
__nfs_commit_inode, which wraps this call in a

nfs_commit_begin();
nfs_commit_end();

Which is a common pattern for this style of completion handling, one
that is also repeated in the direct code with get_dreq()/put_dreq()
calls around where we process events as well as in the completion paths.

Fix this by using the same pattern for the commit requests.

Before with my 200 node rocksdb stress running this warning would pop
every 10ish minutes.  With my patch the stress test has been running for
several hours without popping.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2024-03-09 09:14:51 -05:00
Matthew Wilcox (Oracle) 12fc0a9631 nfs: Remove writepage
NFS already has writepages and migrate_folio, so it does not need to
implement writepage.  The writepage operation is deprecated as it leads
to worse performance under high memory pressure due to folios being
written out in LRU order rather than sequentially within a file.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2024-01-04 10:47:56 -05:00
NeilBrown 3db63daabe NFSv3: handle out-of-order write replies.
NFSv3 includes pre/post wcc attributes which allow the client to
determine if all changes to the file have been made by the client
itself, or if any might have been made by some other client.

If there are gaps in the pre/post ctime sequence it must be assumed that
some other client changed the file in that gap and the local cache must
be suspect.  The next time the file is opened the cache should be
invalidated.

Since Commit 1c341b7775 ("NFS: Add deferred cache invalidation for
close-to-open consistency violations") in linux 5.3 the Linux client has
been triggering this invalidation.  The chunk in nfs_update_inode() in
particularly triggers.

Unfortunately Linux NFS assumes that all replies will be processed in
the order sent, and will arrive in the order processed.  This is not
true in general.  Consequently Linux NFS might ignore the wcc info in a
WRITE reply because the reply is in response to a WRITE that was sent
before some other request for which a reply has already been seen.  This
is detected by Linux using the gencount tests in nfs_inode_attr_cmp().

Also, when the gencount tests pass it is still possible that the request
were processed on the server in a different order, and a gap seen in
the ctime sequence might be filled in by a subsequent reply, so gaps
should not immediately trigger delayed invalidation.

The net result is that writing to a server and then reading the file
back can result in going to the server for the read rather than serving
it from cache - all because a couple of replies arrived out-of-order.
This is a performance regression over kernels before 5.3, though the
change in 5.3 is a correctness improvement.

This has been seen with Linux writing to a Netapp server which
occasionally re-orders requests.  In testing the majority of requests
were in-order, but a few (maybe 2 or three at a time) could be
re-ordered.

This patch addresses the problem by recording any gaps seen in the
pre/post ctime sequence and not triggering invalidation until either
there are too many gaps to fit in the table, or until there are no more
active writes and the remaining gaps cannot be resolved.

We allocate a table of 16 gaps on demand.  If the allocation fails we
revert to current behaviour which is of little cost as we are unlikely
to be able to cache the writes anyway.

In the table we store "start->end" pair when iversion is updated and
"end<-start" pairs pre/post pairs reported by the server.  Usually these
exactly cancel out and so nothing is stored.  When there are
out-of-order replies we do store gaps and these will eventually be
cancelled against later replies when this client is the only writer.

If the final write is out-of-order there may be one gap remaining when
the file is closed.  This will be noticed and if there is precisely on
gap and if the iversion can be advanced to match it, then we do so.

This patch makes no attempt to handle directories correctly.  The same
problem potentially exists in the out-of-order replies to create/unlink
requests can cause future lookup requires to be sent to the server
unnecessarily.  A similar scheme using the same primitives could be used
to notice and handle out-of-order replies.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-04-11 16:13:21 -04:00
Dave Wysochanski 03f5bd75a4 NFS: Remove fscache specific trace points and NFS_INO_FSCACHE bit
The NFS specific trace points are no longer needed as tracing is well
covered by netfs and fscache.

Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Tested-by: Daire Byrne <daire@dneg.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-04-11 13:08:27 -04:00
Dave Wysochanski 88a4d7bdee NFS: Configure support for netfs when NFS fscache is configured
As first steps for support of the netfs library when NFS_FSCACHE is
configured, add NETFS_SUPPORT to Kconfig and add the required netfs_inode
into struct nfs_inode.

Using netfs requires we move the VFS inode structure to be stored
inside struct netfs_inode, along with the fscache_cookie.
Thus, if NFS_FSCACHE is configured, place netfs_inode inside an
anonymous union so the vfs_inode memory is the same and we do
not need to modify other non-fscache areas of NFS.
In addition, inside the NFS fscache code, use the new helpers,
netfs_inode() and netfs_i_cookie() helpers, and remove our own
helper, nfs_i_fscache().

Later patches will convert NFS fscache to fully use netfs.

Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Tested-by: Daire Byrne <daire@dneg.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-04-11 13:00:02 -04:00
Linus Torvalds d8ca6dbb8d NFS Client Updates for Linux 6.3
New Features:
   * Convert the read and write paths to use folios
 
 Bugfixes and Cleanups:
   * Fix tracepoint state manager flag printing
   * Fix disabling swap files
   * Fix NFSv4 client identifier sysfs path in the documentation
   * Don't clear NFS_CAP_COPY if server returns NFS4ERR_OFFLOAD_DENIED
   * Treat GETDEVICEINFO errors as a layout failure
   * Replace kmap_atomic() calls with kmap_local_page()
   * Constify sunrpc sysfs kobj_type structures
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAmP2lSQACgkQ18tUv7Cl
 QOuaPw//TNLuEMiiMtM6EXFPc2p2tNMfM+EeMjYKZkS5WnvszVOZRzQfXz4dvvif
 R2BIHT+C6AWkPOxnbuhIk86acy/wRuAPD9OMyp+3PNa4oEqjyIk5QaTvgYeeRBNe
 H7haAqCiEBx/g8F9DladWLAwpilTSPpupdmiXwTS7Q7wB6NkKqc1eAJ+BTuRYAsm
 19p2nGTzenWrut4HMOOmuAVrA2OcWtzgYzdJlY19lWRmzfCvX1gh3i2JxGutmnl3
 tduviVBdu5QGnTKHJiP853kzs3VwEHIFSO+5z137JWKTm3IgPxW/u3zf+ijmw22t
 vjbtFAhb4+u5juchvVKyIX4lClPSKZincQ6dn9soOm7qJcxYWNm40DzxAF21GUAq
 d4tp+zBoe9pfKxZrylp6YxIchLQYhU+dL0hhqbKzfOAFOg28k1JsJubl/wRvWKVe
 LGbvFOrUYo8arPOKfxBWMf4HCouu/vGq/qng+U/Ppjw3h8gD32aniyP3qN9Fm1JA
 rFAMKTnD4I3r3x1E4HX3icUMwil36zBr9cPglQBYD5DW54ConEW8e6hvQmjTdQHP
 EgN9+PSYzYXuKR/Fij7BYOymUD9grT1+5hVaOPzlM4SY3jMy2novcQGoGiLVOd0t
 PGQ5047qbESFD7VkW4GBAYftrSMLiU+l5KL3aN1JRtprxu6NaYA=
 =39ue
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-6.3-1' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client updates from Anna Schumaker:
 "New Features:

   - Convert the read and write paths to use folios

  Bugfixes and Cleanups:

   - Fix tracepoint state manager flag printing

   - Fix disabling swap files

   - Fix NFSv4 client identifier sysfs path in the documentation

   - Don't clear NFS_CAP_COPY if server returns NFS4ERR_OFFLOAD_DENIED

   - Treat GETDEVICEINFO errors as a layout failure

   - Replace kmap_atomic() calls with kmap_local_page()

   - Constify sunrpc sysfs kobj_type structures"

* tag 'nfs-for-6.3-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (25 commits)
  fs/nfs: Replace kmap_atomic() with kmap_local_page() in dir.c
  pNFS/filelayout: treat GETDEVICEINFO errors as layout failure
  Documentation: Fix sysfs path for the NFSv4 client identifier
  nfs42: do not fail with EIO if ssc returns NFS4ERR_OFFLOAD_DENIED
  NFS: fix disabling of swap
  SUNRPC: make kobj_type structures constant
  nfs4trace: fix state manager flag printing
  NFS: Remove unnecessary check in nfs_read_folio()
  NFS: Improve tracing of nfs_wb_folio()
  NFS: Enable tracing of nfs_invalidate_folio() and nfs_launder_folio()
  NFS: fix up nfs_release_folio() to try to release the page
  NFS: Clean up O_DIRECT request allocation
  NFS: Fix up nfs_vm_page_mkwrite() for folios
  NFS: Convert nfs_write_begin/end to use folios
  NFS: Remove unused function nfs_wb_page()
  NFS: Convert buffered writes to use folios
  NFS: Convert the function nfs_wb_page() to use folios
  NFS: Convert buffered reads to use folios
  NFS: Add a helper nfs_wb_folio()
  NFS: Convert the remaining pagelist helper functions to support folios
  ...
2023-02-22 14:47:20 -08:00
Trond Myklebust 4cbf76948c NFS: Remove unused function nfs_wb_page()
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-02-14 14:22:32 -05:00
Trond Myklebust 0c493b5cf1 NFS: Convert buffered writes to use folios
Mostly mechanical conversion of struct page and functions into struct
folio equivalents.
The lack of support for folios in write_cache_pages(), means we still
only support order 0 folio allocations. However the rest of the
writeback code should now be ready for order n > 0.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-02-14 14:22:32 -05:00
Trond Myklebust 4b27232a6e NFS: Add a helper nfs_wb_folio()
...and use it in nfs_launder_folio().

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-02-14 14:22:32 -05:00
Christian Brauner 4609e1f18e
fs: port ->permission() to pass mnt_idmap
Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-01-19 09:24:28 +01:00
Christian Brauner b74d24f7a7
fs: port ->getattr() to pass mnt_idmap
Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-01-19 09:24:25 +01:00
Christian Brauner c1632a0f11
fs: port ->setattr() to pass mnt_idmap
Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-01-19 09:24:02 +01:00
Trond Myklebust 0eb43812c0 NFS: Clear the file access cache upon login
POSIX typically only refreshes the user's supplementary group
information upon login. Since NFS servers may often refresh their
concept of the user supplementary group membership at their own cadence,
it is possible for the NFS client's access cache to become stale due to
the user's group membership changing on the server after the user has
already logged in on the client.
While it is reasonable to expect that such group membership changes are
rare, and that we do not want to optimise the cache to accommodate them,
it is also not unreasonable for the user to expect that if they log out
and log back in again, that the staleness would clear up.

Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-11-27 22:09:59 -05:00
Trond Myklebust 5f6277a0c1 NFS: Cleanup to remove unused flag NFS_CONTEXT_RESEND_WRITES
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-08-13 13:02:14 -04:00
Trond Myklebust 67f4b5dc49 NFS: Fix another fsync() issue after a server reboot
Currently, when the writeback code detects a server reboot, it redirties
any pages that were not committed to disk, and it sets the flag
NFS_CONTEXT_RESEND_WRITES in the nfs_open_context of the file descriptor
that dirtied the file. While this allows the file descriptor in question
to redrive its own writes, it violates the fsync() requirement that we
should be synchronising all writes to disk.
While the problem is infrequent, we do see corner cases where an
untimely server reboot causes the fsync() call to abandon its attempt to
sync data to disk and causing data corruption issues due to missed error
conditions or similar.

In order to tighted up the client's ability to deal with this situation
without introducing livelocks, add a counter that records the number of
times pages are redirtied due to a server reboot-like condition, and use
that in fsync() to redrive the sync to disk.

Fixes: 2197e9b06c ("NFS: Fix up fsync() when the server rebooted")
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-08-13 13:02:13 -04:00
NeilBrown 3c59366c20 NFS: don't unhash dentry during unlink/rename
NFS unlink() (and rename over existing target) must determine if the
file is open, and must perform a "silly rename" instead of an unlink (or
before rename) if it is.  Otherwise the client might hold a file open
which has been removed on the server.

Consequently if it determines that the file isn't open, it must block
any subsequent opens until the unlink/rename has been completed on the
server.

This is currently achieved by unhashing the dentry.  This forces any
open attempt to the slow-path for lookup which will block on i_rwsem on
the directory until the unlink/rename completes.  A future patch will
change the VFS to only get a shared lock on i_rwsem for unlink, so this
will no longer work.

Instead we introduce an explicit interlock.  A special value is stored
in dentry->d_fsdata while the unlink/rename is running and
->d_revalidate blocks while that value is present.  When ->d_revalidate
unblocks, the dentry will be invalid.  This closes the race
without requiring exclusion on i_rwsem.

d_fsdata is already used in two different ways.
1/ an IS_ROOT directory dentry might have a "devname" stored in
   d_fsdata.  Such a dentry doesn't have a name and so cannot be the
   target of unlink or rename.  For safety we check if an old devname
   is still stored, and remove it if it is.
2/ a dentry with DCACHE_NFSFS_RENAMED set will have a 'struct
   nfs_unlinkdata' stored in d_fsdata.  While this is set maydelete()
   will fail, so an unlink or rename will never proceed on such
   a dentry.

Neither of these can be in effect when a dentry is the target of unlink
or rename.  So we can expect d_fsdata to be NULL, and store a special
value ((void*)1) which is given the name NFS_FSDATA_BLOCKED to indicate
that any lookup will be blocked.

The d_count() is incremented under d_lock() when a lookup finds the
dentry, so we check d_count() is low, and set NFS_FSDATA_BLOCKED under
the same lock to avoid any races.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-08-08 16:25:56 -04:00
Linus Torvalds 98931dd95f Yang Shi has improved the behaviour of khugepaged collapsing of readonly
file-backed transparent hugepages.
 
 Johannes Weiner has arranged for zswap memory use to be tracked and
 managed on a per-cgroup basis.
 
 Munchun Song adds a /proc knob ("hugetlb_optimize_vmemmap") for runtime
 enablement of the recent huge page vmemmap optimization feature.
 
 Baolin Wang contributes a series to fix some issues around hugetlb
 pagetable invalidation.
 
 Zhenwei Pi has fixed some interactions between hwpoisoned pages and
 virtualization.
 
 Tong Tiangen has enabled the use of the presently x86-only
 page_table_check debugging feature on arm64 and riscv.
 
 David Vernet has done some fixup work on the memcg selftests.
 
 Peter Xu has taught userfaultfd to handle write protection faults against
 shmem- and hugetlbfs-backed files.
 
 More DAMON development from SeongJae Park - adding online tuning of the
 feature and support for monitoring of fixed virtual address ranges.  Also
 easier discovery of which monitoring operations are available.
 
 Nadav Amit has done some optimization of TLB flushing during mprotect().
 
 Neil Brown continues to labor away at improving our swap-over-NFS support.
 
 David Hildenbrand has some fixes to anon page COWing versus
 get_user_pages().
 
 Peng Liu fixed some errors in the core hugetlb code.
 
 Joao Martins has reduced the amount of memory consumed by device-dax's
 compound devmaps.
 
 Some cleanups of the arch-specific pagemap code from Anshuman Khandual.
 
 Muchun Song has found and fixed some errors in the TLB flushing of
 transparent hugepages.
 
 Roman Gushchin has done more work on the memcg selftests.
 
 And, of course, many smaller fixes and cleanups.  Notably, the customary
 million cleanup serieses from Miaohe Lin.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCYo52xQAKCRDdBJ7gKXxA
 jtJFAQD238KoeI9z5SkPMaeBRYSRQmNll85mxs25KapcEgWgGQD9FAb7DJkqsIVk
 PzE+d9hEfirUGdL6cujatwJ6ejYR8Q8=
 =nFe6
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2022-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:
 "Almost all of MM here. A few things are still getting finished off,
  reviewed, etc.

   - Yang Shi has improved the behaviour of khugepaged collapsing of
     readonly file-backed transparent hugepages.

   - Johannes Weiner has arranged for zswap memory use to be tracked and
     managed on a per-cgroup basis.

   - Munchun Song adds a /proc knob ("hugetlb_optimize_vmemmap") for
     runtime enablement of the recent huge page vmemmap optimization
     feature.

   - Baolin Wang contributes a series to fix some issues around hugetlb
     pagetable invalidation.

   - Zhenwei Pi has fixed some interactions between hwpoisoned pages and
     virtualization.

   - Tong Tiangen has enabled the use of the presently x86-only
     page_table_check debugging feature on arm64 and riscv.

   - David Vernet has done some fixup work on the memcg selftests.

   - Peter Xu has taught userfaultfd to handle write protection faults
     against shmem- and hugetlbfs-backed files.

   - More DAMON development from SeongJae Park - adding online tuning of
     the feature and support for monitoring of fixed virtual address
     ranges. Also easier discovery of which monitoring operations are
     available.

   - Nadav Amit has done some optimization of TLB flushing during
     mprotect().

   - Neil Brown continues to labor away at improving our swap-over-NFS
     support.

   - David Hildenbrand has some fixes to anon page COWing versus
     get_user_pages().

   - Peng Liu fixed some errors in the core hugetlb code.

   - Joao Martins has reduced the amount of memory consumed by
     device-dax's compound devmaps.

   - Some cleanups of the arch-specific pagemap code from Anshuman
     Khandual.

   - Muchun Song has found and fixed some errors in the TLB flushing of
     transparent hugepages.

   - Roman Gushchin has done more work on the memcg selftests.

  ... and, of course, many smaller fixes and cleanups. Notably, the
  customary million cleanup serieses from Miaohe Lin"

* tag 'mm-stable-2022-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (381 commits)
  mm: kfence: use PAGE_ALIGNED helper
  selftests: vm: add the "settings" file with timeout variable
  selftests: vm: add "test_hmm.sh" to TEST_FILES
  selftests: vm: check numa_available() before operating "merge_across_nodes" in ksm_tests
  selftests: vm: add migration to the .gitignore
  selftests/vm/pkeys: fix typo in comment
  ksm: fix typo in comment
  selftests: vm: add process_mrelease tests
  Revert "mm/vmscan: never demote for memcg reclaim"
  mm/kfence: print disabling or re-enabling message
  include/trace/events/percpu.h: cleanup for "percpu: improve percpu_alloc_percpu event trace"
  include/trace/events/mmflags.h: cleanup for "tracing: incorrect gfp_t conversion"
  mm: fix a potential infinite loop in start_isolate_page_range()
  MAINTAINERS: add Muchun as co-maintainer for HugeTLB
  zram: fix Kconfig dependency warning
  mm/shmem: fix shmem folio swapoff hang
  cgroup: fix an error handling path in alloc_pagecache_max_30M()
  mm: damon: use HPAGE_PMD_SIZE
  tracing: incorrect isolate_mote_t cast in mm_vmscan_lru_isolate
  nodemask.h: fix compilation error with GCC12
  ...
2022-05-26 12:32:41 -07:00
NeilBrown eb79f3af93 nfs: rename nfs_direct_IO and use as ->swap_rw
The nfs_direct_IO() exists to support SWAP IO, but hasn't worked for a
while.  We now need a ->swap_rw function which behaves slightly
differently, returning zero for success rather than a byte count.

So modify nfs_direct_IO accordingly, rename it, and use it as the
->swap_rw function.

Link: https://lkml.kernel.org/r/165119301493.15698.7491285551903597618.stgit@noble.brown
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Geert Uytterhoeven <geert+renesas@glider.be> (on Renesas RSK+RZA1 with 32 MiB of SDRAM)
Cc: David Howells <dhowells@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-09 18:20:48 -07:00
Matthew Wilcox (Oracle) 65d023af7f nfs: Convert nfs to read_folio
This is a "weak" conversion which converts straight back to using pages.
A full conversion should be performed at some point, hopefully by
someone familiar with the filesystem.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-05-09 16:21:46 -04:00
Linus Torvalds 965181d7ef NFS client updates for Linux 5.18
Highlights include:
 
 Features:
 - Switch NFS to use readahead instead of the obsolete readpages.
 - Readdir fixes to improve cacheability of large directories when there
   are multiple readers and writers.
 - Readdir performance improvements when doing a seekdir() immediately
   after opening the directory (common when re-exporting NFS).
 - NFS swap improvements from Neil Brown.
 - Loosen up memory allocation to permit direct reclaim and write back
   in cases where there is no danger of deadlocking the writeback code or
   NFS swap.
 - Avoid sillyrename when the NFSv4 server claims to support the
   necessary features to recover the unlinked but open file after reboot.
 
 Bugfixes:
 - Patch from Olga to add a mount option to control NFSv4.1 session
   trunking discovery, and default it to being off.
 - Fix a lockup in nfs_do_recoalesce().
 - Two fixes for list iterator variables being used when pointing to the
   list head.
 - Fix a kernel memory scribble when reading from a non-socket transport
   in /sys/kernel/sunrpc.
 - Fix a race where reconnecting to a server could leave the TCP socket
   stuck forever in the connecting state.
 - Patch from Neil to fix a shutdown race which can leave the SUNRPC
   transport timer primed after we free the struct xprt itself.
 - Patch from Xin Xiong to fix reference count leaks in the NFSv4.2 copy
   offload.
 - Sunrpc patch from Olga to avoid resending a task on an offlined
   transport.
 
 Cleanups:
 - Patches from Dave Wysochanski to clean up the fscache code
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAmJDLLUACgkQZwvnipYK
 APJZ5Q/9FH5S38RcUOxpFS+XaVdmxltJTMQv9Bz5mcyljq6PwRjTn73Qs7MfBQfD
 k90O9Pn/EXgQpFWNW4o/saQGLAO1MbVaesdNNF92Le9NziMz7lr95fsFXOqpg9EQ
 rEVOImWinv/N0+P0yv5tgjx6Fheu70L4dnwsxepnmtv2R1VjXOVchK7WjsV+9qey
 Xvn15OzhtVCGhRvvKfmGuGW+2I0D29tRNn2X0Pi+LjxAEQ0We/rcsJeGDMGYtbeC
 2U4ADk7KB0kJq04WYCH8k8tTrljL3PUONzO6QP1lltmAZ0FOwXf/AjyaD3SML2Xu
 A1yc2bEn3/S6wCQ3BiMxoE7UCSewg/ZyQD7B9GjZ/nWYNXREhLKvtVZt+ULTCA6o
 UKAI2vlEHljzQI2YfbVVZFyL8KCZqi+BFcJ/H/dtFKNazIkiefkaF+703Pflazis
 HjhA5G6IT0Pk6qZ6OBp+Z/4kCDYZYXXi9ysiNuMPWCiYEfZzt7ocUjm44uRAi+vR
 Qpf0V01Rg4iplTj3YFXqb9TwxrudzC7AcQ7hbLJia0kD/8djG+aGZs3SJWeO+VuE
 Xa41oxAx9jqOSRvSKPvj7nbexrwn8PtQI2NtDRqjPXlg6+Jbk23u8lIEVhsCH877
 xoMUHtTKLyqECskzRN4nh7qoBD3QfiLuI74aDcPtGEuXzU1+Ox8=
 =jeGt
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-5.18-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client updates from Trond Myklebust:
 "Highlights include:

  Features:

   - Switch NFS to use readahead instead of the obsolete readpages.

   - Readdir fixes to improve cacheability of large directories when
     there are multiple readers and writers.

   - Readdir performance improvements when doing a seekdir() immediately
     after opening the directory (common when re-exporting NFS).

   - NFS swap improvements from Neil Brown.

   - Loosen up memory allocation to permit direct reclaim and write back
     in cases where there is no danger of deadlocking the writeback code
     or NFS swap.

   - Avoid sillyrename when the NFSv4 server claims to support the
     necessary features to recover the unlinked but open file after
     reboot.

  Bugfixes:

   - Patch from Olga to add a mount option to control NFSv4.1 session
     trunking discovery, and default it to being off.

   - Fix a lockup in nfs_do_recoalesce().

   - Two fixes for list iterator variables being used when pointing to
     the list head.

   - Fix a kernel memory scribble when reading from a non-socket
     transport in /sys/kernel/sunrpc.

   - Fix a race where reconnecting to a server could leave the TCP
     socket stuck forever in the connecting state.

   - Patch from Neil to fix a shutdown race which can leave the SUNRPC
     transport timer primed after we free the struct xprt itself.

   - Patch from Xin Xiong to fix reference count leaks in the NFSv4.2
     copy offload.

   - Sunrpc patch from Olga to avoid resending a task on an offlined
     transport.

  Cleanups:

   - Patches from Dave Wysochanski to clean up the fscache code"

* tag 'nfs-for-5.18-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (91 commits)
  NFSv4/pNFS: Fix another issue with a list iterator pointing to the head
  NFS: Don't loop forever in nfs_do_recoalesce()
  SUNRPC: Don't return error values in sysfs read of closed files
  SUNRPC: Do not dereference non-socket transports in sysfs
  NFSv4.1: don't retry BIND_CONN_TO_SESSION on session error
  SUNRPC don't resend a task on an offlined transport
  NFS: replace usage of found with dedicated list iterator variable
  SUNRPC: avoid race between mod_timer() and del_timer_sync()
  pNFS/files: Ensure pNFS allocation modes are consistent with nfsiod
  pNFS/flexfiles: Ensure pNFS allocation modes are consistent with nfsiod
  NFSv4/pnfs: Ensure pNFS allocation modes are consistent with nfsiod
  NFS: Avoid writeback threads getting stuck in mempool_alloc()
  NFS: nfsiod should not block forever in mempool_alloc()
  SUNRPC: Make the rpciod and xprtiod slab allocation modes consistent
  SUNRPC: Fix unx_lookup_cred() allocation
  NFS: Fix memory allocation in rpc_alloc_task()
  NFS: Fix memory allocation in rpc_malloc()
  SUNRPC: Improve accuracy of socket ENOBUFS determination
  SUNRPC: Replace internal use of SOCKWQ_ASYNC_NOSPACE
  SUNRPC: Fix socket waits for write buffer space
  ...
2022-03-29 18:55:37 -07:00
Trond Myklebust 515dcdcd48 NFS: nfsiod should not block forever in mempool_alloc()
The concern is that since nfsiod is sometimes required to kick off a
commit, it can get locked up waiting forever in mempool_alloc() instead
of failing gracefully and leaving the commit until later.

Try to allocate from the slab first, with GFP_KERNEL | __GFP_NORETRY,
then fall back to a non-blocking attempt to allocate from the memory
pool.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-22 15:52:56 -04:00
Matthew Wilcox (Oracle) 6d740c76ea nfs: Convert from invalidatepage to invalidate_folio
Print the folio index instead of the pointer, since this is more
useful.  We also don't need to use page_file_mapping() as we do not
invalidate swapcache pages.  Since this is the only caller of
nfs_wb_page_cancel(), convert it to nfs_wb_folio_cancel().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs
2022-03-15 08:23:30 -04:00
NeilBrown 64158668ac NFS: swap IO handling is slightly different for O_DIRECT IO
1/ Taking the i_rwsem for swap IO triggers lockdep warnings regarding
   possible deadlocks with "fs_reclaim".  These deadlocks could, I believe,
   eventuate if a buffered read on the swapfile was attempted.

   We don't need coherence with the page cache for a swap file, and
   buffered writes are forbidden anyway.  There is no other need for
   i_rwsem during direct IO.  So never take it for swap_rw()

2/ generic_write_checks() explicitly forbids writes to swap, and
   performs checks that are not needed for swap.  So bypass it
   for swap_rw().

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-13 12:59:35 -04:00
NeilBrown 89c2be8a95 NFS: discard NFS_RPC_SWAPFLAGS and RPC_TASK_ROOTCREDS
NFS_RPC_SWAPFLAGS is only used for READ requests.
It sets RPC_TASK_SWAPPER which gives some memory-allocation priority to
requests.  This is not needed for swap READ - though it is for writes
where it is set via a different mechanism.

RPC_TASK_ROOTCREDS causes the 'machine' credential to be used.
This is not needed as the root credential is saved when the swap file is
opened, and this is used for all IO.

So NFS_RPC_SWAPFLAGS isn't needed, and as it is the only user of
RPC_TASK_ROOTCREDS, that isn't needed either.

Remove both.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-13 12:59:35 -04:00
Trond Myklebust b0365ccb07 NFS: Fix up forced readdirplus
Avoid clearing the entire readdir page cache if we're just doing forced
readdirplus for the 'ls -l' heuristic.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-02 08:43:39 -05:00
Trond Myklebust f648022faa NFS: Convert readdir page cache to use a cookie based index
Instead of using a linear index to address the pages, use the cookie of
the first entry, since that is what we use to match the page anyway.

This allows us to avoid re-reading the entire cache on a seekdir() type
of operation. The latter is very common when re-exporting NFS, and is a
major performance drain.

The change does affect our duplicate cookie detection, since we can no
longer rely on the page index as a linear offset for detecting whether
we looped backwards. However since we no longer do a linear search
through all the pages on each call to nfs_readdir(), this is less of a
concern than it was previously.
The other downside is that invalidate_mapping_pages() no longer can use
the page index to avoid clearing pages that have been read. A subsequent
patch will restore the functionality this provides to the 'ls -l'
heuristic.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-02 08:43:39 -05:00
Trond Myklebust 230bc98f7a NFS: Improve heuristic for readdirplus
The heuristic for readdirplus is designed to try to detect 'ls -l' and
similar patterns. It does so by looking for cache hit/miss patterns in
both the attribute cache and in the dcache of the files in a given
directory, and then sets a flag for the readdirplus code to interpret.

The problem with this approach is that a single attribute or dcache miss
can cause the NFS code to force a refresh of the attributes for the
entire set of files contained in the directory.

To be able to make a more nuanced decision, let's sample the number of
hits and misses in the set of open directory descriptors. That allows us
to set thresholds at which we start preferring READDIRPLUS over regular
READDIR, or at which we start to force a re-read of the remaining
readdir cache using READDIRPLUS.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-02 08:43:38 -05:00
Trond Myklebust 580f236737 NFS: Adjust the amount of readahead performed by NFS readdir
The current NFS readdir code will always try to maximise the amount of
readahead it performs on the assumption that we can cache anything that
isn't immediately read by the process.
There are several cases where this assumption breaks down, including
when the 'ls -l' heuristic kicks in to try to force use of readdirplus
as a batch replacement for lookup/getattr.

This patch therefore tries to tone down the amount of readahead we
perform, and adjust it to try to match the amount of data being
requested by user space.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-02 08:43:38 -05:00
Trond Myklebust 728dd0ab37 NFS: Don't re-read the entire page cache to find the next cookie
If the page cache entry that was last read gets invalidated for some
reason, then make sure we can re-create it on the next call to readdir.
This, combined with the cache page validation, allows us to reuse the
cached value of page-index on successive calls to nfs_readdir.

Credit is due to Benjamin Coddington for showing that the concept works,
and that it allows for improved cache sharing between processes even in
the case where pages are lost due to LRU or active invalidation.

Suggested-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-02 08:43:38 -05:00
Trond Myklebust 1a93b82c59 NFS: constify nfs_server_capable() and nfs_have_writebacks()
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-02-28 10:11:32 -05:00
Trond Myklebust 41e97b7f8a NFS: Remove unused flag NFS_INO_REVAL_PAGECACHE
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-02-25 18:50:13 -05:00
Trond Myklebust 88a6099fc3 NFS: Replace last uses of NFS_INO_REVAL_PAGECACHE
Now that we have more fine grained attribute revalidation, let's just
get rid of NFS_INO_REVAL_PAGECACHE.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-02-25 18:50:13 -05:00
Olga Kornievskaia 43245eca6e NFSv4.1 support for NFS4_RESULT_PRESERVER_UNLINKED
In 4.1+, the server is allowed to set a flag
NFS4_RESULT_PRESERVE_UNLINKED in reply to the OPEN, that tells
the client that it does not need to do a silly rename of an
opened file when it's being removed.

Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-02-25 18:50:12 -05:00
Matthew Wilcox (Oracle) 8786fde842 Convert NFS from readpages to readahead
NFS is one of the last two users of the deprecated ->readpages aop.
This conversion looks straightforward, but I have only compile-tested
it.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-02-25 15:07:07 -05:00
Trond Myklebust e1d2699b96 NFS: Avoid duplicate uncached readdir calls on eof
If we've reached the end of the directory, then cache that information
in the context so that we don't need to do an uncached readdir in order
to rediscover that fact.

Fixes: 794092c57f ("NFS: Do uncached readdir when we're seeking a cookie in an empty page cache")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-02-02 10:47:33 -05:00
Linus Torvalds 0280e3c58f NFS Client Updates for Linux 5.17
- New Features:
   - Basic handling for case insensitive filesystems
   - Initial support for fs_locations and server trunking
 
 - Bugfixes and Cleanups:
   - Cleanups to how the "struct cred *" is handled for the nfs_access_entry
   - Ensure the server has an up to date ctimes before hardlinking or renaming
   - Update 'blocks used' after writeback, fallocate, and clone
   - nfs_atomic_open() fixes
   - Improvements to sunrpc tracing
   - Various null check & indenting related cleanups
   - Some improvements to the sunrpc sysfs code
     - Use default_groups in kobj_type
     - Fix some potential races and reference leaks
   - A few tracepoint cleanups in xprtrdma
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAmHodsIACgkQ18tUv7Cl
 QOt3xQ//c9JPmMJZZoZtaD5UrHg28iyxaJpOUUpwC/jxQhLOETCf+nU1cELYgLq5
 4W06NBYEmjDJ/tihUvcGMKLvbCtQR9Zl9HepFKDTLTQpGmRFD4enwSmMNvW/AV+h
 I7PoN6J1DX/TZ5InOHH9asyoC2MjwrNHMn3bbQVT0qy+i3T76zJiBF79eWTnPR48
 kKPnF1I0p4LKGJy+y+y/z2mdCsz7tzFkhssxVhot0nafxXzbUOp1H9aiwxroRiUC
 ljbBA0TX8FWkGpGFt3y2QK2fMD7ovDpRhLFYiJClmeERXJVH5mXL9O5XfN5AL0xe
 W/QqT5lbWfeHLkpm2j87yTyaHASC7hGKsAyPD0zWLDcNZws61l1Sy4BHymSE5ZVh
 zt7sJjBnOWAtntyUGBg78G2vhBsd63GzrtcqAOlrngwA5ohJ8f32qvBQGyw4MQu9
 75CjRcO8K8mnf9BJ6I1vYPycjkUh9RSFfNdnUEAI9ZwiTEC/hfEvH/omvEtZsNol
 jBgv2SItTkdMZlEppEL4gxuaYT2wiZf2C6Gco215iPAqLC6dudoroN6yoLk/LRd0
 OWZLl5XTr3j6m5QDm22k5CG080vl6XiAxmAFaFSLza6Q34Jmuluc0gLAZZxvqXk9
 Ay7dQt9PQQk6mXD5Hreb0E5N9zcm2LkfvWpyGJ7mTV7sSHjA2DU=
 =wcVT
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-5.17-1' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client updates from Anna Schumaker:
 "New Features:

   - Basic handling for case insensitive filesystems

   - Initial support for fs_locations and server trunking

  Bugfixes and Cleanups:

   - Cleanups to how the "struct cred *" is handled for the
     nfs_access_entry

   - Ensure the server has an up to date ctimes before hardlinking or
     renaming

   - Update 'blocks used' after writeback, fallocate, and clone

   - nfs_atomic_open() fixes

   - Improvements to sunrpc tracing

   - Various null check & indenting related cleanups

   - Some improvements to the sunrpc sysfs code:
      - Use default_groups in kobj_type
      - Fix some potential races and reference leaks

   - A few tracepoint cleanups in xprtrdma"

[ This should have gone in during the merge window, but didn't. The
  original pull request - sent during the merge window - had gotten
  marked as spam and discarded due missing DKIM headers in the email
  from Anna.   - Linus ]

* tag 'nfs-for-5.17-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (35 commits)
  SUNRPC: Don't dereference xprt->snd_task if it's a cookie
  xprtrdma: Remove definitions of RPCDBG_FACILITY
  xprtrdma: Remove final dprintk call sites from xprtrdma
  sunrpc: Fix potential race conditions in rpc_sysfs_xprt_state_change()
  net/sunrpc: fix reference count leaks in rpc_sysfs_xprt_state_change
  NFSv4.1 test and add 4.1 trunking transport
  SUNRPC allow for unspecified transport time in rpc_clnt_add_xprt
  NFSv4 handle port presence in fs_location server string
  NFSv4 expose nfs_parse_server_name function
  NFSv4.1 query for fs_location attr on a new file system
  NFSv4 store server support for fs_location attribute
  NFSv4 remove zero number of fs_locations entries error check
  NFSv4: nfs_atomic_open() can race when looking up a non-regular file
  NFSv4: Handle case where the lookup of a directory fails
  NFSv42: Fallocate and clone should also request 'blocks used'
  NFSv4: Allow writebacks to request 'blocks used'
  SUNRPC: use default_groups in kobj_type
  NFS: use default_groups in kobj_type
  NFS: Fix the verifier for case sensitive filesystem in nfs_atomic_open()
  NFS: Add a helper to remove case-insensitive aliases
  ...
2022-01-25 20:16:03 +02:00
Dave Wysochanski a6b5a28eb5 nfs: Convert to new fscache volume/cookie API
Change the nfs filesystem to support fscache's indexing rewrite and
reenable caching in nfs.

The following changes have been made:

 (1) The fscache_netfs struct is no more, and there's no need to register
     the filesystem as a whole.

 (2) The session cookie is now an fscache_volume cookie, allocated with
     fscache_acquire_volume().  That takes three parameters: a string
     representing the "volume" in the index, a string naming the cache to
     use (or NULL) and a u64 that conveys coherency metadata for the
     volume.

     For nfs, I've made it render the volume name string as:

        "nfs,<ver>,<family>,<address>,<port>,<fsidH>,<fsidL>*<,param>[,<uniq>]"

 (3) The fscache_cookie_def is no more and needed information is passed
     directly to fscache_acquire_cookie().  The cache no longer calls back
     into the filesystem, but rather metadata changes are indicated at
     other times.

     fscache_acquire_cookie() is passed the same keying and coherency
     information as before.

 (4) fscache_enable/disable_cookie() have been removed.

     Call fscache_use_cookie() and fscache_unuse_cookie() when a file is
     opened or closed to prevent a cache file from being culled and to keep
     resources to hand that are needed to do I/O.

     If a file is opened for writing, we invalidate it with
     FSCACHE_INVAL_DIO_WRITE in lieu of doing writeback to the cache,
     thereby making it cease caching until all currently open files are
     closed.  This should give the same behaviour as the uptream code.
     Making the cache store local modifications isn't straightforward for
     NFS, so that's left for future patches.

 (5) fscache_invalidate() now needs to be given uptodate auxiliary data and
     a file size.  It also takes a flag to indicate if this was due to a
     DIO write.

 (6) Call nfs_fscache_invalidate() with FSCACHE_INVAL_DIO_WRITE on a file
     to which a DIO write is made.

 (7) Call fscache_note_page_release() from nfs_release_page().

 (8) Use a killable wait in nfs_vm_page_mkwrite() when waiting for
     PG_fscache to be cleared.

 (9) The functions to read and write data to/from the cache are stubbed out
     pending a conversion to use netfslib.

Changes
=======
ver #3:
 - Added missing =n fallback for nfs_fscache_release_file()[1][2].

ver #2:
 - Use gfpflags_allow_blocking() rather than using flag directly.
 - fscache_acquire_volume() now returns errors.
 - Remove NFS_INO_FSCACHE as it's no longer used.
 - Need to unuse a cookie on file-release, not inode-clear.

Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Co-developed-by: David Howells <dhowells@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Dave Wysochanski <dwysocha@redhat.com>
Acked-by: Jeff Layton <jlayton@kernel.org>
cc: Trond Myklebust <trond.myklebust@hammerspace.com>
cc: Anna Schumaker <anna.schumaker@netapp.com>
cc: linux-nfs@vger.kernel.org
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/202112100804.nksO8K4u-lkp@intel.com/ [1]
Link: https://lore.kernel.org/r/202112100957.2oEDT20W-lkp@intel.com/ [2]
Link: https://lore.kernel.org/r/163819668938.215744.14448852181937731615.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906979003.143852.2601189243864854724.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967182112.1823006.7791504655391213379.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021575950.640689.12069642327533368467.stgit@warthog.procyon.org.uk/ # v4
2022-01-10 11:53:25 +00:00
NeilBrown 6238aec83f NFS: don't store 'struct cred *' in struct nfs_access_entry
Storing the 'struct cred *' in nfs_access_entry is problematic.
An active 'cred' can keep a 'struct key *' active, and a quota is
imposed on the number of such keys that a user can maintain.
Cached 'nfs_access_entry' structs have indefinite lifetime, and having
these keep 'struct key's alive imposes on that quota.

So remove the 'struct cred *' and replace it with the fields we need:
  kuid_t, kgid_t, and struct group_info *

This makes the 'struct nfs_access_entry' 64 bits larger.

New function "access_cmp" is introduced which is identical to
cred_fscmp() except that the second arg is an 'nfs_access_entry', rather
than a 'cred'

Fixes: b68572e07c ("NFS: change access cache to use 'struct cred'.")
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-01-06 14:00:20 -05:00
NeilBrown 73fbb3fa64 NFS: pass cred explicitly for access tests
Storing the 'struct cred *' in nfs_access_entry is problematic.
An active 'cred' can keep a 'struct key *' active, and a quota is
imposed on the number of such keys that a user can maintain.
Cached 'nfs_access_entry' structs have indefinite lifetime, and having
these keep 'struct key's alive imposes on that quota.

So a future patch will remove the ->cred ref from nfs_access_entry.

To prepare, change various functions to not assume there is a 'cred' in
the nfs_access_entry, but to pass the cred around explicitly.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-01-06 14:00:20 -05:00
NeilBrown b5e7b59c34 NFS: change nfs_access_get_cached to only report the mask
Currently the nfs_access_get_cached family of functions report a
'struct nfs_access_entry' as the result, with both .mask and .cred set.
However the .cred is never used.  This is probably good and there is no
guarantee that it won't be freed before use.

Change to only report the 'mask' - as this is all that is used or needed.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-01-06 14:00:19 -05:00
Anna Schumaker dd225cb3b0 NFS: Remove the nfs4_label argument from nfs_setsecurity
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-11-05 14:54:40 -04:00
Anna Schumaker cf7ab00aab NFS: Remove the nfs4_label argument from nfs_fhget()
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-11-05 14:54:40 -04:00
Anna Schumaker cc6f32989c NFS: Remove the nfs4_label argument from nfs_add_or_obtain()
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-11-05 14:54:39 -04:00
Anna Schumaker d91bfc4642 NFS: Remove the nfs4_label argument from nfs_instantiate()
Pull the label from the fattr instead.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-11-05 14:54:39 -04:00
Anna Schumaker d755ad8dc7 NFS: Create a new nfs_alloc_fattr_with_label() function
For creating fattrs with the label field already allocated for us. I
also update nfs_free_fattr() to free the label in the end.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-11-05 14:54:38 -04:00
Trond Myklebust e591b298d7 NFS: Save some space in the inode
Save some space in the nfs_inode by setting up an anonymous union with
the fields that are peculiar to a specific type of filesystem object.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-10-20 18:09:54 -04:00
Dave Wysochanski 0ebeebcf59 NFS: Fix WARN_ON due to unionization of nfs_inode.nrequests
Fixes the following WARN_ON
WARNING: CPU: 2 PID: 18678 at fs/nfs/inode.c:123 nfs_clear_inode+0x3b/0x50 [nfs]
...
Call Trace:
  nfs4_evict_inode+0x57/0x70 [nfsv4]
  evict+0xd1/0x180
  dispose_list+0x48/0x60
  evict_inodes+0x156/0x190
  generic_shutdown_super+0x37/0x110
  nfs_kill_super+0x1d/0x40 [nfs]
  deactivate_locked_super+0x36/0xa0

Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-10-20 18:09:54 -04:00
Trond Myklebust 133a48abf6 NFS: Fix up commit deadlocks
If O_DIRECT bumps the commit_info rpcs_out field, then that could lead
to fsync() hangs. The fix is to ensure that O_DIRECT calls
nfs_commit_end().

Fixes: 723c921e7d ("sched/wait, fs/nfs: Convert wait_on_atomic_t() usage to the new wait_var_event() API")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-10-20 18:09:45 -04:00
Trond Myklebust ff81dfb5d7 NFS: Further optimisations for 'ls -l'
If a user is doing 'ls -l', we have a heuristic in GETATTR that tells
the readdir code to try to use READDIRPLUS in order to refresh the inode
attributes. In certain cirumstances, we also try to invalidate the
remaining directory entries in order to ensure this refresh.

If there are multiple readers of the directory, we probably should avoid
invalidating the page cache, since the heuristic breaks down in that
situation anyway.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
2021-10-03 20:49:07 -04:00