If write_ports_addfd or write_ports_addxprt fail, they call nfsd_put()
without calling nfsd_last_thread(). This leaves nn->nfsd_serv pointing
to a structure that has been freed.
So remove 'static' from nfsd_last_thread() and call it when the
nfsd_serv is about to be destroyed.
Fixes: ec52361df9 ("SUNRPC: stop using ->sv_nrthreads as a refcount")
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Patch series "workload-specific and memory pressure-driven zswap
writeback", v8.
There are currently several issues with zswap writeback:
1. There is only a single global LRU for zswap, making it impossible to
perform worload-specific shrinking - an memcg under memory pressure
cannot determine which pages in the pool it owns, and often ends up
writing pages from other memcgs. This issue has been previously
observed in practice and mitigated by simply disabling
memcg-initiated shrinking:
https://lore.kernel.org/all/20230530232435.3097106-1-nphamcs@gmail.com/T/#u
But this solution leaves a lot to be desired, as we still do not
have an avenue for an memcg to free up its own memory locked up in
the zswap pool.
2. We only shrink the zswap pool when the user-defined limit is hit.
This means that if we set the limit too high, cold data that are
unlikely to be used again will reside in the pool, wasting precious
memory. It is hard to predict how much zswap space will be needed
ahead of time, as this depends on the workload (specifically, on
factors such as memory access patterns and compressibility of the
memory pages).
This patch series solves these issues by separating the global zswap LRU
into per-memcg and per-NUMA LRUs, and performs workload-specific (i.e
memcg- and NUMA-aware) zswap writeback under memory pressure. The new
shrinker does not have any parameter that must be tuned by the user, and
can be opted in or out on a per-memcg basis.
As a proof of concept, we ran the following synthetic benchmark: build the
linux kernel in a memory-limited cgroup, and allocate some cold data in
tmpfs to see if the shrinker could write them out and improved the overall
performance. Depending on the amount of cold data generated, we observe
from 14% to 35% reduction in kernel CPU time used in the kernel builds.
This patch (of 6):
The interface of list_lru is based on the assumption that the list node
and the data it represents belong to the same allocated on the correct
node/memcg. While this assumption is valid for existing slab objects LRU
such as dentries and inodes, it is undocumented, and rather inflexible for
certain potential list_lru users (such as the upcoming zswap shrinker and
the THP shrinker). It has caused us a lot of issues during our
development.
This patch changes list_lru interface so that the caller must explicitly
specify numa node and memcg when adding and removing objects. The old
list_lru_add() and list_lru_del() are renamed to list_lru_add_obj() and
list_lru_del_obj(), respectively.
It also extends the list_lru API with a new function, list_lru_putback,
which undoes a previous list_lru_isolate call. Unlike list_lru_add, it
does not increment the LRU node count (as list_lru_isolate does not
decrement the node count). list_lru_putback also allows for explicit
memcg and NUMA node selection.
Link: https://lkml.kernel.org/r/20231130194023.4102148-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20231130194023.4102148-2-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: Bagas Sanjaya <bagasdotme@gmail.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
... and fix the directory locking documentation and proof of correctness.
Holding ->s_vfs_rename_mutex *almost* prevents ->d_parent changes; the
case where we really don't want it is splicing the root of disconnected
tree to somewhere.
In other words, ->s_vfs_rename_mutex is sufficient to stabilize "X is an
ancestor of Y" only if X and Y are already in the same tree. Otherwise
it can go from false to true, and one can construct a deadlock on that.
Make lock_two_directories() report an error in such case and update the
callers of lock_rename()/lock_rename_child() to handle such errors.
And yes, such conditions are not impossible to create ;-/
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
All the callers of vfs_iter_write() call file_start_write() just before
calling vfs_iter_write() except for target_core_file's fd_do_rw().
Move file_start_write() from the callers into vfs_iter_write().
fd_do_rw() calls vfs_iter_write() with a non-regular file, so
file_start_write() is a no-op.
This is needed for fanotify "pre content" events.
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Link: https://lore.kernel.org/r/20231122122715.2561213-11-amir73il@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
vfs_splice_read() has a permission hook inside rw_verify_area() and
it is called from do_splice_direct() -> splice_direct_to_actor().
The callers of do_splice_direct() (e.g. vfs_copy_file_range()) already
call rw_verify_area() for the entire range, but the other caller of
splice_direct_to_actor() (nfsd) does not.
Add the rw_verify_area() checks in nfsd_splice_read() and use a
variant of vfs_splice_read() without rw_verify_area() check in
splice_direct_to_actor() to avoid the redundant rw_verify_area() checks.
This is needed for fanotify "pre content" events.
Acked-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Link: https://lore.kernel.org/r/20231122122715.2561213-4-amir73il@gmail.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
nfsd_client_rmdir() open-codes a subset of simple_recursive_removal().
Conversion to calling simple_recursive_removal() allows to clean things
up quite a bit.
While we are at it, nfsdfs_create_files() doesn't need to mess with "pick
the reference to struct nfsdfs_client from the already created parent" -
the caller already knows it (that's where the parent got it from,
after all), so we might as well just pass it as an explicit argument.
So __get_nfsdfs_client() is only needed in get_nfsdfs_client() and
can be folded in there.
Incidentally, the locking in get_nfsdfs_client() is too heavy - we don't
need ->i_rwsem for that, ->i_lock serves just fine.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Tested-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Chuck Lever <chuck.lever@oracle.com>
Acked-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
- Fix several long-standing bugs in the duplicate reply cache
- Fix a memory leak
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmVY6iwACgkQM2qzM29m
f5fDqhAAnsHqZNG2I6asqh/5pPoqcp7kXkEe1l+6jr4jz/00R0lg+oLbsC6/S6eY
tzGVkQtIkl2OpL8lt4JTgUL/xiO2JaqbdiIuHelnT62l97r7kbKWJ0ALHahLafiX
hCQdJWOdud86kZ5x6/cYVfqyO08bhqDLUFvWd7zSLmzW/9U3bG4v6yXvHT3qqnnE
dCJtuM9+DUfnDJKHe6+BFIobkyta8+Tpsg4QSgSAu4hg+dTcqtPCOxMeT+YwgNQd
uZY1xiIjPLkufsBF86xzC3tyoFNaZc5QhIwv7ZBtmzNUw3906SbuST9hJiWeHeWq
m7p0YeWDJrygiyIrvYxv6NDUCqnkoOxbuKTUAniTEj1SsE2gcQCfij0bU1OSRk6r
CpT8TdJ6j2zP78+xxMsSIiA/gR7uJtCKs7LABru7DX25+sDkKK2Te9+PXINarJ1k
fraeDeuXMQ1cu71WRXUJ3QKGn1/bC8DYGHFVQqcB+gVXqcm3BuKNYGt01nFyCp5s
+1jL+lRxUjPydI2J1VKw5g+5jW9rBgLOT0T9xlr+TZDnYADBAakwpbujrBH5Ey+W
BswGoNzqsW9B0U4N6o8OP0FA6IxcddX6Uan2czienmqj217w4AaPPT/vuA5EyC9W
zNzBAi87rnaqQs4LnWiS09K+RvYIedQl1lJIzwIdQZfnMKmCSEI=
=GOU4
-----END PGP SIGNATURE-----
Merge tag 'nfsd-6.7-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd fixes from Chuck Lever:
- Fix several long-standing bugs in the duplicate reply cache
- Fix a memory leak
* tag 'nfsd-6.7-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
NFSD: Fix checksum mismatches in the duplicate reply cache
NFSD: Fix "start of NFS reply" pointer passed to nfsd_cache_update()
NFSD: Update nfsd_cache_append() to use xdr_stream
nfsd: fix file memleak on client_opens_release
nfsd_cache_csum() currently assumes that the server's RPC layer has
been advancing rq_arg.head[0].iov_base as it decodes an incoming
request, because that's the way it used to work. On entry, it
expects that buf->head[0].iov_base points to the start of the NFS
header, and excludes the already-decoded RPC header.
These days however, head[0].iov_base now points to the start of the
RPC header during all processing. It no longer points at the NFS
Call header when execution arrives at nfsd_cache_csum().
In a retransmitted RPC the XID and the NFS header are supposed to
be the same as the original message, but the contents of the
retransmitted RPC header can be different. For example, for krb5,
the GSS sequence number will be different between the two. Thus if
the RPC header is always included in the DRC checksum computation,
the checksum of the retransmitted message might not match the
checksum of the original message, even though the NFS part of these
messages is identical.
The result is that, even if a matching XID is found in the DRC,
the checksum mismatch causes the server to execute the
retransmitted RPC transaction again.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Tested-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The "statp + 1" pointer that is passed to nfsd_cache_update() is
supposed to point to the start of the egress NFS Reply header. In
fact, it does point there for AUTH_SYS and RPCSEC_GSS_KRB5 requests.
But both krb5i and krb5p add fields between the RPC header's
accept_stat field and the start of the NFS Reply header. In those
cases, "statp + 1" points at the extra fields instead of the Reply.
The result is that nfsd_cache_update() caches what looks to the
client like garbage.
A connection break can occur for a number of reasons, but the most
common reason when using krb5i/p is a GSS sequence number window
underrun. When an underrun is detected, the server is obliged to
drop the RPC and the connection to force a retransmit with a fresh
GSS sequence number. The client presents the same XID, it hits in
the server's DRC, and the server returns the garbage cache entry.
The "statp + 1" argument has been used since the oldest changeset
in the kernel history repo, so it has been in nfsd_dispatch()
literally since before history began. The problem arose only when
the server-side GSS implementation was added twenty years ago.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Tested-by: Jeff Layton <jlayton@kernel.org
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
When inserting a DRC-cached response into the reply buffer, ensure
that the reply buffer's xdr_stream is updated properly. Otherwise
the server will send a garbage response.
Cc: stable@vger.kernel.org # v6.3+
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Tested-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
seq_release should be called to free the allocated seq_file
Cc: stable@vger.kernel.org # v5.3+
Signed-off-by: Mahmoud Adam <mngyadam@amazon.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Fixes: 78599c42ae ("nfsd4: add file to display list of client's opens")
Reviewed-by: NeilBrown <neilb@suse.de>
Tested-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZUpEaAAKCRCRxhvAZXjc
ounBAQCAoS66gnOZ+k4kOWwB2zZ1Ueh3dPFC7IcEZ+pwFS8hpAEAxUQxV0TSWf5l
W/1oKRtAJyuSYvehHeMUSJmHVBiM8w4=
=bNm0
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.7.fsid' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs fanotify fsid updates from Christian Brauner:
"This work is part of the plan to enable fanotify to serve as a drop-in
replacement for inotify. While inotify is availabe on all filesystems,
fanotify currently isn't.
In order to support fanotify on all filesystems two things are needed:
(1) all filesystems need to support AT_HANDLE_FID
(2) all filesystems need to report a non-zero f_fsid
This contains (1) and allows filesystems to encode non-decodable file
handlers for fanotify without implementing any exportfs operations by
encoding a file id of type FILEID_INO64_GEN from i_ino and
i_generation.
Filesystems that want to opt out of encoding non-decodable file ids
for fanotify that don't support NFS export can do so by providing an
empty export_operations struct.
This also partially addresses (2) by generating f_fsid for simple
filesystems as well as freevxfs. Remaining filesystems will be dealt
with by separate patches.
Finally, this contains the patch from the current exportfs maintainers
which moves exportfs under vfs with Chuck, Jeff, and Amir as
maintainers and vfs.git as tree"
* tag 'vfs-6.7.fsid' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
MAINTAINERS: create an entry for exportfs
fs: fix build error with CONFIG_EXPORTFS=m or not defined
freevxfs: derive f_fsid from bdev->bd_dev
fs: report f_fsid from s_dev for "simple" filesystems
exportfs: support encoding non-decodeable file handles by default
exportfs: define FILEID_INO64_GEN* file handle types
exportfs: make ->encode_fh() a mandatory method for NFS export
exportfs: add helpers to check if filesystem can encode/decode file handles
included in this merge do the following:
- Kemeng Shi has contributed some compation maintenance work in the
series "Fixes and cleanups to compaction".
- Joel Fernandes has a patchset ("Optimize mremap during mutual
alignment within PMD") which fixes an obscure issue with mremap()'s
pagetable handling during a subsequent exec(), based upon an
implementation which Linus suggested.
- More DAMON/DAMOS maintenance and feature work from SeongJae Park i the
following patch series:
mm/damon: misc fixups for documents, comments and its tracepoint
mm/damon: add a tracepoint for damos apply target regions
mm/damon: provide pseudo-moving sum based access rate
mm/damon: implement DAMOS apply intervals
mm/damon/core-test: Fix memory leaks in core-test
mm/damon/sysfs-schemes: Do DAMOS tried regions update for only one apply interval
- In the series "Do not try to access unaccepted memory" Adrian Hunter
provides some fixups for the recently-added "unaccepted memory' feature.
To increase the feature's checking coverage. "Plug a few gaps where
RAM is exposed without checking if it is unaccepted memory".
- In the series "cleanups for lockless slab shrink" Qi Zheng has done
some maintenance work which is preparation for the lockless slab
shrinking code.
- Qi Zheng has redone the earlier (and reverted) attempt to make slab
shrinking lockless in the series "use refcount+RCU method to implement
lockless slab shrink".
- David Hildenbrand contributes some maintenance work for the rmap code
in the series "Anon rmap cleanups".
- Kefeng Wang does more folio conversions and some maintenance work in
the migration code. Series "mm: migrate: more folio conversion and
unification".
- Matthew Wilcox has fixed an issue in the buffer_head code which was
causing long stalls under some heavy memory/IO loads. Some cleanups
were added on the way. Series "Add and use bdev_getblk()".
- In the series "Use nth_page() in place of direct struct page
manipulation" Zi Yan has fixed a potential issue with the direct
manipulation of hugetlb page frames.
- In the series "mm: hugetlb: Skip initialization of gigantic tail
struct pages if freed by HVO" has improved our handling of gigantic
pages in the hugetlb vmmemmep optimizaton code. This provides
significant boot time improvements when significant amounts of gigantic
pages are in use.
- Matthew Wilcox has sent the series "Small hugetlb cleanups" - code
rationalization and folio conversions in the hugetlb code.
- Yin Fengwei has improved mlock()'s handling of large folios in the
series "support large folio for mlock"
- In the series "Expose swapcache stat for memcg v1" Liu Shixin has
added statistics for memcg v1 users which are available (and useful)
under memcg v2.
- Florent Revest has enhanced the MDWE (Memory-Deny-Write-Executable)
prctl so that userspace may direct the kernel to not automatically
propagate the denial to child processes. The series is named "MDWE
without inheritance".
- Kefeng Wang has provided the series "mm: convert numa balancing
functions to use a folio" which does what it says.
- In the series "mm/ksm: add fork-exec support for prctl" Stefan Roesch
makes is possible for a process to propagate KSM treatment across
exec().
- Huang Ying has enhanced memory tiering's calculation of memory
distances. This is used to permit the dax/kmem driver to use "high
bandwidth memory" in addition to Optane Data Center Persistent Memory
Modules (DCPMM). The series is named "memory tiering: calculate
abstract distance based on ACPI HMAT"
- In the series "Smart scanning mode for KSM" Stefan Roesch has
optimized KSM by teaching it to retain and use some historical
information from previous scans.
- Yosry Ahmed has fixed some inconsistencies in memcg statistics in the
series "mm: memcg: fix tracking of pending stats updates values".
- In the series "Implement IOCTL to get and optionally clear info about
PTEs" Peter Xu has added an ioctl to /proc/<pid>/pagemap which permits
us to atomically read-then-clear page softdirty state. This is mainly
used by CRIU.
- Hugh Dickins contributed the series "shmem,tmpfs: general maintenance"
- a bunch of relatively minor maintenance tweaks to this code.
- Matthew Wilcox has increased the use of the VMA lock over file-backed
page faults in the series "Handle more faults under the VMA lock". Some
rationalizations of the fault path became possible as a result.
- In the series "mm/rmap: convert page_move_anon_rmap() to
folio_move_anon_rmap()" David Hildenbrand has implemented some cleanups
and folio conversions.
- In the series "various improvements to the GUP interface" Lorenzo
Stoakes has simplified and improved the GUP interface with an eye to
providing groundwork for future improvements.
- Andrey Konovalov has sent along the series "kasan: assorted fixes and
improvements" which does those things.
- Some page allocator maintenance work from Kemeng Shi in the series
"Two minor cleanups to break_down_buddy_pages".
- In thes series "New selftest for mm" Breno Leitao has developed
another MM self test which tickles a race we had between madvise() and
page faults.
- In the series "Add folio_end_read" Matthew Wilcox provides cleanups
and an optimization to the core pagecache code.
- Nhat Pham has added memcg accounting for hugetlb memory in the series
"hugetlb memcg accounting".
- Cleanups and rationalizations to the pagemap code from Lorenzo
Stoakes, in the series "Abstract vma_merge() and split_vma()".
- Audra Mitchell has fixed issues in the procfs page_owner code's new
timestamping feature which was causing some misbehaviours. In the
series "Fix page_owner's use of free timestamps".
- Lorenzo Stoakes has fixed the handling of new mappings of sealed files
in the series "permit write-sealed memfd read-only shared mappings".
- Mike Kravetz has optimized the hugetlb vmemmap optimization in the
series "Batch hugetlb vmemmap modification operations".
- Some buffer_head folio conversions and cleanups from Matthew Wilcox in
the series "Finish the create_empty_buffers() transition".
- As a page allocator performance optimization Huang Ying has added
automatic tuning to the allocator's per-cpu-pages feature, in the series
"mm: PCP high auto-tuning".
- Roman Gushchin has contributed the patchset "mm: improve performance
of accounted kernel memory allocations" which improves their performance
by ~30% as measured by a micro-benchmark.
- folio conversions from Kefeng Wang in the series "mm: convert page
cpupid functions to folios".
- Some kmemleak fixups in Liu Shixin's series "Some bugfix about
kmemleak".
- Qi Zheng has improved our handling of memoryless nodes by keeping them
off the allocation fallback list. This is done in the series "handle
memoryless nodes more appropriately".
- khugepaged conversions from Vishal Moola in the series "Some
khugepaged folio conversions".
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZULEMwAKCRDdBJ7gKXxA
jhQHAQCYpD3g849x69DmHnHWHm/EHQLvQmRMDeYZI+nx/sCJOwEAw4AKg0Oemv9y
FgeUPAD1oasg6CP+INZvCj34waNxwAc=
=E+Y4
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2023-11-01-14-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
"Many singleton patches against the MM code. The patch series which are
included in this merge do the following:
- Kemeng Shi has contributed some compation maintenance work in the
series 'Fixes and cleanups to compaction'
- Joel Fernandes has a patchset ('Optimize mremap during mutual
alignment within PMD') which fixes an obscure issue with mremap()'s
pagetable handling during a subsequent exec(), based upon an
implementation which Linus suggested
- More DAMON/DAMOS maintenance and feature work from SeongJae Park i
the following patch series:
mm/damon: misc fixups for documents, comments and its tracepoint
mm/damon: add a tracepoint for damos apply target regions
mm/damon: provide pseudo-moving sum based access rate
mm/damon: implement DAMOS apply intervals
mm/damon/core-test: Fix memory leaks in core-test
mm/damon/sysfs-schemes: Do DAMOS tried regions update for only one apply interval
- In the series 'Do not try to access unaccepted memory' Adrian
Hunter provides some fixups for the recently-added 'unaccepted
memory' feature. To increase the feature's checking coverage. 'Plug
a few gaps where RAM is exposed without checking if it is
unaccepted memory'
- In the series 'cleanups for lockless slab shrink' Qi Zheng has done
some maintenance work which is preparation for the lockless slab
shrinking code
- Qi Zheng has redone the earlier (and reverted) attempt to make slab
shrinking lockless in the series 'use refcount+RCU method to
implement lockless slab shrink'
- David Hildenbrand contributes some maintenance work for the rmap
code in the series 'Anon rmap cleanups'
- Kefeng Wang does more folio conversions and some maintenance work
in the migration code. Series 'mm: migrate: more folio conversion
and unification'
- Matthew Wilcox has fixed an issue in the buffer_head code which was
causing long stalls under some heavy memory/IO loads. Some cleanups
were added on the way. Series 'Add and use bdev_getblk()'
- In the series 'Use nth_page() in place of direct struct page
manipulation' Zi Yan has fixed a potential issue with the direct
manipulation of hugetlb page frames
- In the series 'mm: hugetlb: Skip initialization of gigantic tail
struct pages if freed by HVO' has improved our handling of gigantic
pages in the hugetlb vmmemmep optimizaton code. This provides
significant boot time improvements when significant amounts of
gigantic pages are in use
- Matthew Wilcox has sent the series 'Small hugetlb cleanups' - code
rationalization and folio conversions in the hugetlb code
- Yin Fengwei has improved mlock()'s handling of large folios in the
series 'support large folio for mlock'
- In the series 'Expose swapcache stat for memcg v1' Liu Shixin has
added statistics for memcg v1 users which are available (and
useful) under memcg v2
- Florent Revest has enhanced the MDWE (Memory-Deny-Write-Executable)
prctl so that userspace may direct the kernel to not automatically
propagate the denial to child processes. The series is named 'MDWE
without inheritance'
- Kefeng Wang has provided the series 'mm: convert numa balancing
functions to use a folio' which does what it says
- In the series 'mm/ksm: add fork-exec support for prctl' Stefan
Roesch makes is possible for a process to propagate KSM treatment
across exec()
- Huang Ying has enhanced memory tiering's calculation of memory
distances. This is used to permit the dax/kmem driver to use 'high
bandwidth memory' in addition to Optane Data Center Persistent
Memory Modules (DCPMM). The series is named 'memory tiering:
calculate abstract distance based on ACPI HMAT'
- In the series 'Smart scanning mode for KSM' Stefan Roesch has
optimized KSM by teaching it to retain and use some historical
information from previous scans
- Yosry Ahmed has fixed some inconsistencies in memcg statistics in
the series 'mm: memcg: fix tracking of pending stats updates
values'
- In the series 'Implement IOCTL to get and optionally clear info
about PTEs' Peter Xu has added an ioctl to /proc/<pid>/pagemap
which permits us to atomically read-then-clear page softdirty
state. This is mainly used by CRIU
- Hugh Dickins contributed the series 'shmem,tmpfs: general
maintenance', a bunch of relatively minor maintenance tweaks to
this code
- Matthew Wilcox has increased the use of the VMA lock over
file-backed page faults in the series 'Handle more faults under the
VMA lock'. Some rationalizations of the fault path became possible
as a result
- In the series 'mm/rmap: convert page_move_anon_rmap() to
folio_move_anon_rmap()' David Hildenbrand has implemented some
cleanups and folio conversions
- In the series 'various improvements to the GUP interface' Lorenzo
Stoakes has simplified and improved the GUP interface with an eye
to providing groundwork for future improvements
- Andrey Konovalov has sent along the series 'kasan: assorted fixes
and improvements' which does those things
- Some page allocator maintenance work from Kemeng Shi in the series
'Two minor cleanups to break_down_buddy_pages'
- In thes series 'New selftest for mm' Breno Leitao has developed
another MM self test which tickles a race we had between madvise()
and page faults
- In the series 'Add folio_end_read' Matthew Wilcox provides cleanups
and an optimization to the core pagecache code
- Nhat Pham has added memcg accounting for hugetlb memory in the
series 'hugetlb memcg accounting'
- Cleanups and rationalizations to the pagemap code from Lorenzo
Stoakes, in the series 'Abstract vma_merge() and split_vma()'
- Audra Mitchell has fixed issues in the procfs page_owner code's new
timestamping feature which was causing some misbehaviours. In the
series 'Fix page_owner's use of free timestamps'
- Lorenzo Stoakes has fixed the handling of new mappings of sealed
files in the series 'permit write-sealed memfd read-only shared
mappings'
- Mike Kravetz has optimized the hugetlb vmemmap optimization in the
series 'Batch hugetlb vmemmap modification operations'
- Some buffer_head folio conversions and cleanups from Matthew Wilcox
in the series 'Finish the create_empty_buffers() transition'
- As a page allocator performance optimization Huang Ying has added
automatic tuning to the allocator's per-cpu-pages feature, in the
series 'mm: PCP high auto-tuning'
- Roman Gushchin has contributed the patchset 'mm: improve
performance of accounted kernel memory allocations' which improves
their performance by ~30% as measured by a micro-benchmark
- folio conversions from Kefeng Wang in the series 'mm: convert page
cpupid functions to folios'
- Some kmemleak fixups in Liu Shixin's series 'Some bugfix about
kmemleak'
- Qi Zheng has improved our handling of memoryless nodes by keeping
them off the allocation fallback list. This is done in the series
'handle memoryless nodes more appropriately'
- khugepaged conversions from Vishal Moola in the series 'Some
khugepaged folio conversions'"
[ bcachefs conflicts with the dynamically allocated shrinkers have been
resolved as per Stephen Rothwell in
https://lore.kernel.org/all/20230913093553.4290421e@canb.auug.org.au/
with help from Qi Zheng.
The clone3 test filtering conflict was half-arsed by yours truly ]
* tag 'mm-stable-2023-11-01-14-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (406 commits)
mm/damon/sysfs: update monitoring target regions for online input commit
mm/damon/sysfs: remove requested targets when online-commit inputs
selftests: add a sanity check for zswap
Documentation: maple_tree: fix word spelling error
mm/vmalloc: fix the unchecked dereference warning in vread_iter()
zswap: export compression failure stats
Documentation: ubsan: drop "the" from article title
mempolicy: migration attempt to match interleave nodes
mempolicy: mmap_lock is not needed while migrating folios
mempolicy: alloc_pages_mpol() for NUMA policy without vma
mm: add page_rmappable_folio() wrapper
mempolicy: remove confusing MPOL_MF_LAZY dead code
mempolicy: mpol_shared_policy_init() without pseudo-vma
mempolicy trivia: use pgoff_t in shared mempolicy tree
mempolicy trivia: slightly more consistent naming
mempolicy trivia: delete those ancient pr_debug()s
mempolicy: fix migrate_pages(2) syscall return nr_failed
kernfs: drop shared NUMA mempolicy hooks
hugetlbfs: drop shared NUMA mempolicy pretence
mm/damon/sysfs-test: add a unit test for damon_sysfs_set_targets()
...
This release completes the SunRPC thread scheduler work that was
begun in v6.6. The scheduler can now find an svc thread to wake in
constant time and without a list walk. Thanks again to Neil Brown
for this overhaul.
Lorenzo Bianconi contributed infrastructure for a netlink-based
NFSD control plane. The long-term plan is to provide the same
functionality as found in /proc/fs/nfsd, plus some interesting
additions, and then migrate the NFSD user space utilities to
netlink.
A long series to overhaul NFSD's NFSv4 operation encoding was
applied in this release. The goals are to bring this family of
encoding functions in line with the matching NFSv4 decoding
functions and with the NFSv2 and NFSv3 XDR functions, preparing
the way for better memory safety and maintainability.
A further improvement to NFSD's write delegation support was
contributed by Dai Ngo. This adds a CB_GETATTR callback,
enabling the server to retrieve cached size and mtime data from
clients holding write delegations. If the server can retrieve
this information, it does not have to recall the delegation in
some cases.
The usual panoply of bug fixes and minor improvements round out
this release. As always I am grateful to all contributors,
reviewers, and testers.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmU5IuoACgkQM2qzM29m
f5eVsg//bVp8S93ci/oDlKfzOwH2fO5e5rna91wrDpJxkd51h6KTx55dSRG5sjAZ
EywIVOann6xCtsixAPyff5Cweg2dWvzQRsy1ZnvWQ1qZBzD5KAJY5LPkeSFUCKBo
Zani/qTOYbxzgFMjZx+yDSXDPKG68WYZBQK59SI7mURu4SYdk8aRyNY8mjHfr0Vh
Aqrcny4oVtXV4sL5P5G/2FUW7WKT3olA3jSYlRRNMhbs2qpEemRCCrspOEMMad+b
t1+ZCg+U27PMranvOJnof4RU7peZbaxDWA0gyiUbivVXVtZn9uOs0ffhktkvechL
ePc33dqdp2ITdKIPA6JlaRv5WflKXQw0YYM9Kv5mcR4A2el7owL4f/pMlPhtbYwJ
IOJv15KdKVN979G2e6WMYiKK+iHfaUUguhMEXnfnGoAajHOZNQiUEo3iFQAD7LDc
DvMF8d9QqYmB9IW8FOYaRRfZGJOQHf3TL79Nd08z/bn5swvlvfj77leux9Sb+0/m
Luk2Xvz2AJVSXE31wzabaGHkizN+BtH+e4MMbXUHBPW5jE9v7XOnEUFr4UdZyr9P
Gl87A7NcrzNjJWT5TrnzM4sOslNsx46Aeg+VuNt2fSRn2dm6iBu2B8s0N4imx6dV
PX1y9VSLq5WRhjrFZ1qeiZdsuTaQtrEiNDoRIQR6nCJPAV80iFk=
=B4wJ
-----END PGP SIGNATURE-----
Merge tag 'nfsd-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd updates from Chuck Lever:
"This release completes the SunRPC thread scheduler work that was begun
in v6.6. The scheduler can now find an svc thread to wake in constant
time and without a list walk. Thanks again to Neil Brown for this
overhaul.
Lorenzo Bianconi contributed infrastructure for a netlink-based NFSD
control plane. The long-term plan is to provide the same functionality
as found in /proc/fs/nfsd, plus some interesting additions, and then
migrate the NFSD user space utilities to netlink.
A long series to overhaul NFSD's NFSv4 operation encoding was applied
in this release. The goals are to bring this family of encoding
functions in line with the matching NFSv4 decoding functions and with
the NFSv2 and NFSv3 XDR functions, preparing the way for better memory
safety and maintainability.
A further improvement to NFSD's write delegation support was
contributed by Dai Ngo. This adds a CB_GETATTR callback, enabling the
server to retrieve cached size and mtime data from clients holding
write delegations. If the server can retrieve this information, it
does not have to recall the delegation in some cases.
The usual panoply of bug fixes and minor improvements round out this
release. As always I am grateful to all contributors, reviewers, and
testers"
* tag 'nfsd-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (127 commits)
svcrdma: Fix tracepoint printk format
svcrdma: Drop connection after an RDMA Read error
NFSD: clean up alloc_init_deleg()
NFSD: Fix frame size warning in svc_export_parse()
NFSD: Rewrite synopsis of nfsd_percpu_counters_init()
nfsd: Clean up errors in nfs3proc.c
nfsd: Clean up errors in nfs4state.c
NFSD: Clean up errors in stats.c
NFSD: simplify error paths in nfsd_svc()
NFSD: Clean up nfsd4_encode_seek()
NFSD: Clean up nfsd4_encode_offset_status()
NFSD: Clean up nfsd4_encode_copy_notify()
NFSD: Clean up nfsd4_encode_copy()
NFSD: Clean up nfsd4_encode_test_stateid()
NFSD: Clean up nfsd4_encode_exchange_id()
NFSD: Clean up nfsd4_do_encode_secinfo()
NFSD: Clean up nfsd4_encode_access()
NFSD: Clean up nfsd4_encode_readdir()
NFSD: Clean up nfsd4_encode_entry4()
NFSD: Add an nfsd4_encode_nfs_cookie4() helper
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZTppYgAKCRCRxhvAZXjc
okIHAP9anLz1QDyMLH12ASuHjgBc0Of3jcB6NB97IWGpL4O21gEA46ohaD+vcJuC
YkBLU3lXqQ87nfu28ExFAzh10hG2jwM=
=m4pB
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.7.ctime' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs
Pull vfs inode time accessor updates from Christian Brauner:
"This finishes the conversion of all inode time fields to accessor
functions as discussed on list. Changing timestamps manually as we
used to do before is error prone. Using accessors function makes this
robust.
It does not contain the switch of the time fields to discrete 64 bit
integers to replace struct timespec and free up space in struct inode.
But after this, the switch can be trivially made and the patch should
only affect the vfs if we decide to do it"
* tag 'vfs-6.7.ctime' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (86 commits)
fs: rename inode i_atime and i_mtime fields
security: convert to new timestamp accessors
selinux: convert to new timestamp accessors
apparmor: convert to new timestamp accessors
sunrpc: convert to new timestamp accessors
mm: convert to new timestamp accessors
bpf: convert to new timestamp accessors
ipc: convert to new timestamp accessors
linux: convert to new timestamp accessors
zonefs: convert to new timestamp accessors
xfs: convert to new timestamp accessors
vboxsf: convert to new timestamp accessors
ufs: convert to new timestamp accessors
udf: convert to new timestamp accessors
ubifs: convert to new timestamp accessors
tracefs: convert to new timestamp accessors
sysv: convert to new timestamp accessors
squashfs: convert to new timestamp accessors
server: convert to new timestamp accessors
client: convert to new timestamp accessors
...
The logic of whether filesystem can encode/decode file handles is open
coded in many places.
In preparation to changing the logic, move the open coded logic into
inline helpers.
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Link: https://lore.kernel.org/r/20231023180801.2953446-2-amir73il@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
... checking that after lock_rename() is too late. Incidentally,
NFSv2 had no nfserr_xdev...
Fixes: aa387d6ce1 "nfsd: fix EXDEV checking in rename"
Cc: stable@vger.kernel.org # v3.9+
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Modify the conditional statement for null pointer check in the function
'alloc_init_deleg' to make this function more robust and clear. Otherwise,
this function may have potential pointer dereference problem in the future,
when modifying or expanding the nfs4_delegation structure.
Signed-off-by: Sicong Huang <huangsicong@iie.ac.cn>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
fs/nfsd/export.c: In function 'svc_export_parse':
fs/nfsd/export.c:737:1: warning: the frame size of 1040 bytes is larger than 1024 bytes [-Wframe-larger-than=]
737 | }
On my systems, svc_export_parse() has a stack frame of over 800
bytes, not 1040, but nonetheless, it could do with some reduction.
When a struct svc_export is on the stack, it's a temporary structure
used as an argument, and not visible as an actual exported FS. No
need to reserve space for export_stats in such cases.
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202310012359.YEw5IrK6-lkp@intel.com/
Cc: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
In function ‘export_stats_init’,
inlined from ‘svc_export_alloc’ at fs/nfsd/export.c:866:6:
fs/nfsd/export.c:337:16: warning: ‘nfsd_percpu_counters_init’ accessing 40 bytes in a region of size 0 [-Wstringop-overflow=]
337 | return nfsd_percpu_counters_init(&stats->counter, EXP_STATS_COUNTERS_NUM);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
fs/nfsd/export.c:337:16: note: referencing argument 1 of type ‘struct percpu_counter[0]’
fs/nfsd/stats.h: In function ‘svc_export_alloc’:
fs/nfsd/stats.h:40:5: note: in a call to function ‘nfsd_percpu_counters_init’
40 | int nfsd_percpu_counters_init(struct percpu_counter counters[], int num);
| ^~~~~~~~~~~~~~~~~~~~~~~~~
Cc: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Fix the following errors reported by checkpatch:
ERROR: need consistent spacing around '+' (ctx:WxV)
ERROR: spaces required around that '?' (ctx:VxW)
Signed-off-by: KaiLong Wang <wangkailong@jari.cn>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Fix the following errors reported by checkpatch:
ERROR: spaces required around that '=' (ctx:VxW)
ERROR: space required after that ',' (ctx:VxO)
ERROR: space required before that '~' (ctx:OxV)
Signed-off-by: KaiLong Wang <wangkailong@jari.cn>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Fix the following errors reported by checkpatch:
ERROR: space required after that ',' (ctx:VxV)
Signed-off-by: KaiLong Wang <wangkailong@jari.cn>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The error paths in nfsd_svc() are needlessly complex and can result in a
final call to svc_put() without nfsd_last_thread() being called. This
results in the listening sockets not being closed properly.
The per-netns setup provided by nfsd_startup_new() and removed by
nfsd_shutdown_net() is needed precisely when there are running threads.
So we don't need nfsd_up_before. We don't need to know if it *was* up.
We only need to know if any threads are left. If none are, then we must
call nfsd_shutdown_net(). But we don't need to do that explicitly as
nfsd_last_thread() does that for us.
So simply call nfsd_last_thread() before the last svc_put() if there are
no running threads. That will always do the right thing.
Also discard:
pr_info("nfsd: last server has exited, flushing export cache\n");
It may not be true if an attempt to start the first server failed, and
it isn't particularly helpful and it simply reports normal behaviour.
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Replace open-coded encoding logic with the use of conventional XDR
utility functions.
Note that if we replace the cpn_sec and cpn_nsec fields with a
single struct timespec64 field, the encoder can use
nfsd4_encode_nfstime4(), as that is the data type specified by the
XDR spec.
NFS4ERR_INVAL seems inappropriate if the encoder doesn't support
encoding the response. Instead use NFS4ERR_SERVERFAULT, since this
condition is a software bug on the server.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Restructure this function using conventional XDR utility functions
and so it aligns better with the XDR in the specification.
I've also moved nfsd4_encode_copy() closer to the data type encoders
that only it uses.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Restructure nfsd4_encode_exchange_id() so that it will be more
straightforward to add support for SSV one day. Also, adopt the use
of the conventional XDR utility functions.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Refactor nfsd4_encode_secinfo() so it is more clear what XDR data
item is being encoded by which piece of code.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Convert nfsd4_encode_access() to use modern XDR utility functions.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Untangle nfsd4_encode_readdir() so it is more clear what XDR data
item is being encoded by which piece of code.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reshape nfsd4_encode_entry4() to be more like the legacy dirent
encoders, which were recently rewritten to use xdr_stream.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
De-duplicate the entry4 cookie encoder, similar to the arrangement
for the NFSv2 and NFSv3 directory entry encoders.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
No need for specialized code here, as this function is invoked only
rarely. Convert it to encode to xdr_stream using conventional XDR
helpers.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Rename nfsd4_encode_dirent() to match the naming convention already
used in the NFSv2 and NFSv3 readdir paths. The new name reflects the
name of the spec-defined XDR data type for an NFSv4 directory entry.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
De-duplicate open-coded encoding of the sessionid, and convert the
rest of the function to use conventional XDR utility functions.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Convert nfsd4_encode_create_session() to use the conventional XDR
encoding utilities.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
De-duplicate the encoding of the fore channel and backchannel
attributes.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
There is more than one NFSv4 operation that needs to encode a
sessionid4, so extract that data type into a separate helper.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
To better align our implementation with the XDR specification,
refactor the part of nfsd4_encode_open() that encodes delegation
metadata.
As part of that refactor, remove an unnecessary BUG() call site and
a comment that appears to be stale.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
To better align our implementation with the XDR specification,
refactor the part of nfsd4_encode_open() that encodes the
open_none_delegation4 type.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Make it easier to adjust the XDR encoder to handle new features
related to write delegations.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Refactor nfsd4_encode_open() so the open_read_delegation4 type is
encoded in a separate function. This makes it more straightforward
to later add support for returning an nfsace4 in OPEN responses that
offer a delegation.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Use the modern XDR utility functions.
The LOCK and LOCKT encoder functions need to return nfserr_denied
when a lock is denied, but nfsd4_encode_lock4denied() should return
a status code that is consistent with other XDR encoders.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
To improve readability and better align the LOCK encoders with the
XDR specification, add an explicit encoder named for the lock_owner4
type.
In particular, to avoid code duplication, use
nfsd4_encode_clientid4() to encode the clientid in the lock owner
rather than open-coding it.
It looks to me like nfs4_set_lock_denied() already clears the
clientid if it won't return an owner (cf: the nevermind: label). The
code in the XDR encoder appears to be redundant and can safely be
removed.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
An XDR encoder is responsible for marshaling results, not releasing
memory that was allocated by the upper layer. We have .op_release
for that purpose.
Move the release of the ld_owner.data string to op_release functions
for LOCK and LOCKT.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Adopt the conventional XDR utility functions. Also, restructure to
make the function align more closely with the spec -- there doesn't
seem to be a performance need for speciality code, so prioritize
readability.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
This enables callers to be passed const pointer parameters.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Adopt the use of conventional XDR utility functions. Restructure
the encoder to better align with the XDR definition of the result.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Adopt the use of conventional XDR utility functions. Restructure
the encoder to better align with the XDR definition of the result.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
De-duplicate the open-coded stateid4 encoder. Adopt the use of the
conventional current XDR encoding helpers. Refactor the encoder to
align with the XDR specification.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
This enables callers to be passed const pointer parameters.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Update the encoder function name to match the type name, as is the
convention with other such encoder utility functions, and with
nfsd4_decode_stateid4().
Make the @stateid argument a const so that callers of
nfsd4_encode_stateid4() in the future can be passed const pointers
to structures.
Since the compiler is allowed to add padding to structs, use the
wire (spec-defined) size when reserving buffer space.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
This is a synonym for nfsd4_encode_uint32_t() that matches the
name of the XDR type. It will get at least one more use in a
subsequent patch.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
For better alignment with the specification, NFSD's encoder function
name should match the name of the XDR data type.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The fattr4 encoder is now structured like the COMPOUND op encoder:
one function for each individual attribute, called by bit number.
Benefits include:
- The individual attributes are now guaranteed to be encoded in
bitmask order into the send buffer
- There can be no unwanted side effects between attribute encoders
- The code now clearly documents which attributes are /not/
implemented on this server
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Refactor the encoder for FATTR4_XATTR_SUPPORT into a helper. In a
subsequent patch, this helper will be called from a bitmask loop.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Refactor the encoder for FATTR4_SEC_LABEL into a helper. In a
subsequent patch, this helper will be called from a bitmask loop.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Refactor the encoder for FATTR4_SUPPATTR_EXCLCREAT into a helper. In
a subsequent patch, this helper will be called from a bitmask loop.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Refactor the encoder for FATTR4_LAYOUT_BLKSIZE into a helper. In a
subsequent patch, this helper will be called from a bitmask loop.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Refactor the encoder for FATTR4_LAYOUT_TYPES into a helper. In a
subsequent patch, this helper will be called from a bitmask loop.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Refactor the encoder for FATTR4_FS_LAYOUT_TYPES into a helper. In a
subsequent patch, this helper will be called from a bitmask loop.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Refactor the encoder for FATTR4_MOUNTED_ON_FILEID into a helper. In
a subsequent patch, this helper will be called from a bitmask loop.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Refactor the encoder for FATTR4_TIME_MODIFY into a helper. In a
subsequent patch, this helper will be called from a bitmask loop.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Refactor the encoder for FATTR4_TIME_METADATA into a helper. In a
subsequent patch, this helper will be called from a bitmask loop.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Refactor the encoder for FATTR4_TIME_DELTA into a helper. In a
subsequent patch, this helper will be called from a bitmask loop.
fattr4_time_delta is specified as an nfstime4, so de-duplicate this
encoder.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Refactor the encoder for FATTR4_TIME_CREATE into a helper. In a
subsequent patch, this helper will be called from a bitmask loop.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Refactor the encoder for FATTR4_TIME_ACCESS into a helper. In a
subsequent patch, this helper will be called from a bitmask loop.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Refactor the encoder for FATTR4_SPACE_USED into a helper. In a
subsequent patch, this helper will be called from a bitmask loop.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Refactor the encoder for FATTR4_SPACE_TOTAL into a helper. In a
subsequent patch, this helper will be called from a bitmask loop.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Refactor the encoder for FATTR4_SPACE_FREE into a helper. In a
subsequent patch, this helper will be called from a bitmask loop.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Refactor the encoder for FATTR4_SPACE_AVAIL into a helper. In a
subsequent patch, this helper will be called from a bitmask loop.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Refactor the encoder for FATTR4_RAWDEV into a helper. In a
subsequent patch, this helper will be called from a bitmask loop.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Refactor the encoder for FATTR4_OWNER_GROUP into a helper. In a
subsequent patch, this helper will be called from a bitmask loop.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Refactor the encoder for FATTR4_OWNER into a helper. In a
subsequent patch, this helper will be called from a bitmask loop.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Refactor the encoder for FATTR4_NUMLINKS into a helper. In a
subsequent patch, this helper will be called from a bitmask loop.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Refactor the encoder for FATTR4_MODE into a helper. In a subsequent
patch, this helper will be called from a bitmask loop.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Refactor the encoder for FATTR4_MAXWRITE into a helper. In a
subsequent patch, this helper will be called from a bitmask loop.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Refactor the encoder for FATTR4_MAXREAD into a helper. In a
subsequent patch, this helper will be called from a bitmask loop.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Refactor the encoder for FATTR4_MAXNAME into a helper. In a
subsequent patch, this helper will be called from a bitmask loop.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Refactor the encoder for FATTR4_MAXLINK into a helper. In a
subsequent patch, this helper will be called from a bitmask loop.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Refactor the encoder for FATTR4_MAXFILESIZE into a helper. In a
subsequent patch, this helper will be called from a bitmask loop.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Refactor the encoder for FATTR4_FS_LOCATIONS into a helper. In a
subsequent patch, this helper will be called from a bitmask loop.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Refactor the encoder for FATTR4_FILES_TOTAL into a helper. In a
subsequent patch, this helper will be called from a bitmask loop.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Refactor the encoder for FATTR4_FILES_FREE into a helper. In a
subsequent patch, this helper will be called from a bitmask loop.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Refactor the encoder for FATTR4_FILES_AVAIL into a helper. In a
subsequent patch, this helper will be called from a bitmask loop.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Refactor the encoder for FATTR4_FILEID into a helper. In a
subsequent patch, this helper will be called from a bitmask loop.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Refactor the encoder for FATTR4_FILEHANDLE into a helper. In a
subsequent patch, this helper will be called from a bitmask loop.
We can de-duplicate the other filehandle encoder (in GETFH) using
our new helper.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Refactor the encoder for FATTR4_ACL into a helper. In a subsequent
patch, this helper will be called from a bitmask loop.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Refactor the ACE encoding helper so that it can eventually be reused
for encoding OPEN results that contain delegation ACEs.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Refactor the encoder for FATTR4_ACLSUPPORT into a helper. In a
subsequent patch, this helper will be called from a bitmask loop.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Refactor the encoder for FATTR4_RDATTR_ERROR into a helper. In a
subsequent patch, this helper will be called from a bitmask loop.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Refactor the encoder for FATTR4_LEASE_TIME into a helper. In a
subsequent patch, this helper will be called from a bitmask loop.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Refactor the encoder for FATTR4_FSID into a helper. In a subsequent
patch, this helper will be called from a bitmask loop.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Refactor the encoder for FATTR4_SIZE into a helper. In a subsequent
patch, this helper will be called from a bitmask loop.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Refactor the encoder for FATTR4_CHANGE into a helper. In a
subsequent patch, this helper will be called from a bitmask loop.
The code is restructured a bit to use the modern xdr_stream flow,
and the encoded cinfo value is made const so that callers of the
encoders can be passed a const cinfo.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Refactor the encoder for FATTR4_FH_EXPIRE_TYPE into a helper. In a
subsequent patch, this helper will be called from a bitmask loop.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Refactor the encoder for FATTR4_TYPE into a helper. In a subsequent
patch, this helper will be called from a bitmask loop.
In addition, restructure the code so that byte-swapping is done on
constant values rather than at run time. Run-time swapping can be
costly on some platforms, and "type" is a frequently-requested
attribute.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Refactor the encoder for FATTR4_SUPPORTED_ATTRS into a helper. In a
subsequent patch, this helper will be called from a bitmask loop.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Add an encoding helper that encodes a single boolean "false" value.
Attributes that always return "false" can use this helper.
In a subsequent patch, this helper will be called from a bitmask
loop, so it is given a standardized synopsis.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Add an encoding helper that encodes a single boolean "true" value.
Attributes that always return "true" can use this helper.
In a subsequent patch, this helper will be called from a bitmask
loop, so it is given a standardized synopsis.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
I'm about to split nfsd4_encode_fattr() into a number of smaller
functions. Instead of passing a large number of arguments to each of
the smaller functions, create a struct that can gather the common
argument variables into something with a convenient handle on it.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
De-duplicate the encoding of bitmap4 results in
nfsd4_encode_setattr().
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
For alignment with the specification, the name of NFSD's encoder
function should match the name of the XDR type.
I've also replaced a few "naked integers" with symbolic constants
that better reflect the usage of these values.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The generic XDR encoders return a length or a negative errno. NFSv4
encoders want to know simply whether the encode ran out of stream
buffer space. The return values for server-side encoding are either
nfs_ok or nfserr_resource.
So far I've found it adds a lot of duplicate code to try to use the
generic XDR encoder utilities when encoding the simple data types in
the NFSv4 operation encoders.
Add a set of NFSv4-specific utilities that handle the basic XDR data
types. These are added in xdr4.h so they might eventually be used by
the callback server and pNFS driver encoders too.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Introduce rpc_status netlink support for NFSD in order to dump pending
RPC requests debugging information from userspace.
Closes: https://bugzilla.linux-nfs.org/show_bug.cgi?id=366
Tested-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Generate stubs and uAPI for nfsd netlink protocol. For the moment,
the new protocol has one operation: rpc_status.
The generated header and source files are created by running:
tools/net/ynl/ynl-regen.sh
Tested-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Acked-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
If the GETATTR request on a file that has write delegation in effect
and the request attributes include the change info and size attribute
then the request is handled as below:
Server sends CB_GETATTR to client to get the latest change info and file
size. If these values are the same as the server's cached values then
the GETATTR proceeds as normal.
If either the change info or file size is different from the server's
cached values, or the file was already marked as modified, then:
. update time_modify and time_metadata into file's metadata
with current time
. encode GETATTR as normal except the file size is encoded with
the value returned from CB_GETATTR
. mark the file as modified
If the CB_GETATTR fails for any reasons, the delegation is recalled
and NFS4ERR_DELAY is returned for the GETATTR.
Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Includes:
. CB_GETATTR proc for nfs4_cb_procedures[]
. XDR encoding and decoding function for CB_GETATTR request/reply
. add nfs4_cb_fattr to nfs4_delegation for sending CB_GETATTR
and store file attributes from client's reply.
Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Using an atomic_t avoids the need to take a spinlock (which can soon be
removed).
Choosing a thread to kill needs to be careful as we cannot set the "die
now" bit atomically with the test on the count. Instead we temporarily
increase the count.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
svc threads are currently stopped using kthread_stop(). This requires
identifying a specific thread. However we don't care which thread
stops, just as long as one does.
So instead, set a flag in the svc_pool to say that a thread needs to
die, and have each thread check this flag instead of calling
kthread_should_stop(). The first thread to find and clear this flag
then moves towards exiting.
This removes an explicit dependency on sp_all_threads which will make a
future patch simpler.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
This patch reverts mostly commit 40595cdc93 ("nfs: block notification
on fs with its own ->lock") and introduces an EXPORT_OP_ASYNC_LOCK
export flag to signal that the "own ->lock" implementation supports
async lock requests. The only main user is DLM that is used by GFS2 and
OCFS2 filesystem. Those implement their own lock() implementation and
return FILE_LOCK_DEFERRED as return value. Since commit 40595cdc93
("nfs: block notification on fs with its own ->lock") the DLM
implementation were never updated. This patch should prepare for DLM
to set the EXPORT_OP_ASYNC_LOCK export flag and update the DLM
plock implementation regarding to it.
Acked-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
If fsync() is returning EAGAIN, then we can assume that the filesystem
being exported is something like NFS with the 'softerr' mount option
enabled, and that it is just asking us to replay the fsync() operation
at a later date.
If we see an ESTALE, then ditto: the file is gone, so there is no danger
of losing the error.
For those cases, do not reset the write verifier. A write verifier
change has a global effect, causing retransmission by all clients of
all uncommitted unstable writes for all files, so it is worth
mitigating where possible.
Link: https://lore.kernel.org/linux-nfs/20230911184357.11739-1-trond.myklebust@hammerspace.com/
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The nfsd_open code handles EOPENSTALE correctly, by retrying the call to
fh_verify() and __nfsd_open(). However the filecache just drops the
error on the floor, and immediately returns nfserr_stale to the caller.
This patch ensures that we propagate the EOPENSTALE code back to
nfsd_file_do_acquire, and that we handle it correctly.
Fixes: 65294c1f2c ("nfsd: add a new struct file caching facility to nfsd")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Message-Id: <20230911183027.11372-1-trond.myklebust@hammerspace.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Add trace points on destination server to track inter and intra
server copy operations.
Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Tested-by: Chen Hanxiao <chenhx.fnst@fujitsu.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Prepare for adding server copy trace points.
Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Tested-by: Chen Hanxiao <chenhx.fnst@fujitsu.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
In preparation for implementing lockless slab shrink, use new APIs to
dynamically allocate the nfsd-reply shrinker, so that it can be freed
asynchronously via RCU. Then it doesn't need to wait for RCU read-side
critical section when releasing the struct nfsd_net.
Link: https://lkml.kernel.org/r/20230911094444.68966-34-zhengqi.arch@bytedance.com
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Acked-by: Chuck Lever <chuck.lever@oracle.com>
Acked-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Muchun Song <songmuchun@bytedance.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Olga Kornievskaia <kolga@netapp.com>
Cc: Dai Ngo <Dai.Ngo@oracle.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Abhinav Kumar <quic_abhinavk@quicinc.com>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Anna Schumaker <anna@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Bob Peterson <rpeterso@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Carlos Llamas <cmllamas@google.com>
Cc: Chandan Babu R <chandan.babu@oracle.com>
Cc: Chao Yu <chao@kernel.org>
Cc: Chris Mason <clm@fb.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Christian Koenig <christian.koenig@amd.com>
Cc: Chuck Lever <cel@kernel.org>
Cc: Coly Li <colyli@suse.de>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Airlie <airlied@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Sterba <dsterba@suse.com>
Cc: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
Cc: Gao Xiang <hsiangkao@linux.alibaba.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Huang Rui <ray.huang@amd.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Jeffle Xu <jefflexu@linux.alibaba.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Kirill Tkhai <tkhai@ya.ru>
Cc: Marijn Suijten <marijn.suijten@somainline.org>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Mike Snitzer <snitzer@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nadav Amit <namit@vmware.com>
Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rob Clark <robdclark@gmail.com>
Cc: Rob Herring <robh@kernel.org>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Sean Paul <sean@poorly.run>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Song Liu <song@kernel.org>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Steven Price <steven.price@arm.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tomeu Vizoso <tomeu.vizoso@collabora.com>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Cc: Yue Hu <huyue2@coolpad.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
In preparation for implementing lockless slab shrink, use new APIs to
dynamically allocate the nfsd-client shrinker, so that it can be freed
asynchronously via RCU. Then it doesn't need to wait for RCU read-side
critical section when releasing the struct nfsd_net.
Link: https://lkml.kernel.org/r/20230911094444.68966-33-zhengqi.arch@bytedance.com
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Acked-by: Chuck Lever <chuck.lever@oracle.com>
Acked-by: Jeff Layton <jlayton@kernel.org>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Olga Kornievskaia <kolga@netapp.com>
Cc: Dai Ngo <Dai.Ngo@oracle.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Abhinav Kumar <quic_abhinavk@quicinc.com>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Anna Schumaker <anna@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Bob Peterson <rpeterso@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Carlos Llamas <cmllamas@google.com>
Cc: Chandan Babu R <chandan.babu@oracle.com>
Cc: Chao Yu <chao@kernel.org>
Cc: Chris Mason <clm@fb.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Christian Koenig <christian.koenig@amd.com>
Cc: Chuck Lever <cel@kernel.org>
Cc: Coly Li <colyli@suse.de>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Airlie <airlied@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Sterba <dsterba@suse.com>
Cc: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
Cc: Gao Xiang <hsiangkao@linux.alibaba.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Huang Rui <ray.huang@amd.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Jeffle Xu <jefflexu@linux.alibaba.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Kirill Tkhai <tkhai@ya.ru>
Cc: Marijn Suijten <marijn.suijten@somainline.org>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Mike Snitzer <snitzer@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Nadav Amit <namit@vmware.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rob Clark <robdclark@gmail.com>
Cc: Rob Herring <robh@kernel.org>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Sean Paul <sean@poorly.run>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Song Liu <song@kernel.org>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Steven Price <steven.price@arm.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tomeu Vizoso <tomeu.vizoso@collabora.com>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Cc: Yue Hu <huyue2@coolpad.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
- Fix NFSv4 READ corner case
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmUYSnEACgkQM2qzM29m
f5dtSxAAk/n3CzeIJFALIcP+JmHW6Zh2KbpygcqhxFoO1FAyRBJgOIEhtdD/c65Q
fzfCQDZUzdTSirklBS5m0lIMhqaYMe1LUC8OIq/hh309vQ6WnlLltIU3fsaZFva5
kAaDeI/oGq2YnbjQENWCGHvKC7RR1jOTWASI/+52oYg/fwRkt14so/wDl54mxhNP
q6Gt0Xw/mXkslAkQNQ+7a5eAa4TtepzkjkNWL4hSqboHQ4QFT2tmXqDV3E+V9tzX
F/tlEN3EJo3nDNcUYAh6ec/YXo0tBh4DsnJxA9bRYloU40EUGP19ewXLEQgXPw9M
IgSNluxTmDTQGz6kSUkQpepevWDUN0+fDt1kVqACIRXuvuMcGuDUCbD4VMa8HNnc
bCMg8R/SkrBnzTamHJpPhbV/F/C0Iwp80RpqDMAnt4splrQUCJeB+Wl1qJAnYUW5
TdJxSJKCvypMIcNCd1ZMSVbeByUgZ/qXKKsS2YNS4l8+DGdnEJQ2mnFKw3Nk5bEF
byWoyW6RMXCIwaqQa8hNsO5kW4MTxURxLXcgqzN7+5L2ht7pAYpNjnOc0vIBPe5t
9z+/dSk8a9QXBwRTxPL4QOgludXPqvokGBtCIV1LD5xQA9OZuXivnBz2DWmCt+IR
GOYfjs3gSlvVrv7+EGffYlOh9f7GN/wlUfUxERKcB6HPweGOtqs=
=aD10
-----END PGP SIGNATURE-----
Merge tag 'nfsd-6.6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd fix from Chuck Lever:
- Fix NFSv4 READ corner case
* tag 'nfsd-6.6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
NFSD: Fix zero NFSv4 READ results when RQ_SPLICE_OK is not set
nfsd4_encode_readv() uses xdr->buf->page_len as a starting point for
the nfsd_iter_read() sink buffer -- page_len is going to be offset
by the parts of the COMPOUND that have already been encoded into
xdr->buf->pages.
However, that value must be captured /before/
xdr_reserve_space_vec() advances page_len by the expected size of
the read payload. Otherwise, the whole front part of the first
page of the payload in the reply will be uninitialized.
Mantas hit this because sec=krb5i forces RQ_SPLICE_OK off, which
invokes the readv part of the nfsd4_encode_read() path. Also,
older Linux NFS clients appear to send shorter READ requests
for files smaller than a page, whereas newer clients just send
page-sized requests and let the server send as many bytes as
are in the file.
Reported-by: Mantas Mikulėnas <grawity@gmail.com>
Closes: https://lore.kernel.org/linux-nfs/f1d0b234-e650-0f6e-0f5d-126b3d51d1eb@gmail.com/
Fixes: 703d752155 ("NFSD: Hoist rq_vec preparation into nfsd_read() [step two]")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
- Use correct order when encoding NFSv4 RENAME change_info
- Fix a potential oops during NFSD shutdown
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmUEf90ACgkQM2qzM29m
f5dcrhAAjz3SLoz4nAlo441JMATp/auwqTBKlWvNUWbK/ZgYjUcTXH8RfPbBQ/Yl
5NfsThubmPTSr/bNvV+C963C4c/25d45SYwabAdLoZWKIpWf1mCDNhE1JkGQoM83
KyZcrLnEknh/fN6jAxueFhWPCaavGDK8eI8H6Ls33GvX0cG4S7IYseGZ1x2dMWpE
XFnV8oAWSD3cjKrTGwPYja2W0ShISGn4UMnj+lPWEpfCDydnQzKh87dhWzvlAxCf
4Ckikmv3vcCfm15Ja0JZX3K1cvh0A3oISbgN/QNqGpzDzQ1qyugsx9uNltDa76N8
NzGENnvO2/WURkw4DHJYIsiNt82uB04NYSAL6mLat3GU4DeAKOl3r+9W0jZPFS17
7mOLh9dqg3ubSxOr6BpeY4I6+Uq2enUTwh1+VkvC7hQ906ZY8kzaj5MtLddvkjCw
p4BTDzH280AwtnwiJ7q0WvBR5TxdGc3fqqsLJ+yMX2aGdhM1iMOjjENoSc3+v+Cy
pEK1tju/lDzcMZrl5A08Za30L5boHp11n5SLia7tctKiFs4biL3WwNu527Y/Klq6
04DtkcXdF/tW702388sKL1UnnXhw4KA8k7Us4HrQL+zXDZ+/UJ+AzY6s8ls2ytbN
0NESe/ntsKjoijrRphAp+RgC8fhC7iT0GMJP8OHyHUqEOU2Wrbw=
=ZTRy
-----END PGP SIGNATURE-----
Merge tag 'nfsd-6.6-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd fixes from Chuck Lever:
- Use correct order when encoding NFSv4 RENAME change_info
- Fix a potential oops during NFSD shutdown
* tag 'nfsd-6.6-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
NFSD: fix possible oops when nfsd/pool_stats is closed.
nfsd: fix change_info in NFSv4 RENAME replies
If /proc/fs/nfsd/pool_stats is open when the last nfsd thread exits, then
when the file is closed a NULL pointer is dereferenced.
This is because nfsd_pool_stats_release() assumes that the
pointer to the svc_serv cannot become NULL while a reference is held.
This used to be the case but a recent patch split nfsd_last_thread() out
from nfsd_put(), and clearing the pointer is done in nfsd_last_thread().
This is easily reproduced by running
rpc.nfsd 8 ; ( rpc.nfsd 0;true) < /proc/fs/nfsd/pool_stats
Fortunately nfsd_pool_stats_release() has easy access to the svc_serv
pointer, and so can call svc_put() on it directly.
Fixes: 9f28a971ee ("nfsd: separate nfsd_last_thread() from nfsd_put()")
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
nfsd sends the transposed directory change info in the RENAME reply. The
source directory is in save_fh and the target is in current_fh.
Reported-by: Zhi Li <yieli@redhat.com>
Reported-by: Benjamin Coddington <bcodding@redhat.com>
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=2218844
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
I'm thrilled to announce that the Linux in-kernel NFS server now
offers NFSv4 write delegations. A write delegation enables a client
to cache data and metadata for a single file more aggressively,
reducing network round trips and server workload. Many thanks to Dai
Ngo for contributing this facility, and to Jeff Layton and Neil
Brown for reviewing and testing it.
This release also sees the removal of all support for DES- and
triple-DES-based Kerberos encryption types in the kernel's SunRPC
implementation. These encryption types have been deprecated by the
Internet community for years and are considered insecure. This
change affects both the in-kernel NFS client and server.
The server's UDP and TCP socket transports have now fully adopted
David Howells' new bio_vec iterator so that no more than one
sendmsg() call is needed to transmit each RPC message. In
particular, this helps kTLS optimize record boundaries when sending
RPC-with-TLS replies, and it takes the server a baby step closer to
handling file I/O via folios.
We've begun work on overhauling the SunRPC thread scheduler to
remove a costly linked-list walk when looking for an idle RPC
service thread to wake. The pre-requisites are included in this
release. Thanks to Neil Brown for his ongoing work on this
improvement.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmTwoa0ACgkQM2qzM29m
f5cZvw/8CmFVNC27aMrJEhRRhwwrXbLzUkWh9GCYkG98PHYiLxLTvZ6qELXAax/a
UjSgIDSRcWl4z8M/tyBQtgsw7NADr+7XWqEoXR7HZ5pEEC/KNGM0oQWQ92ojjKYy
JmHdB02uaDJfcd9ioFTU13cw7q2BQfoe2xLI8yqis2vcVSu92AM7aIw+cvJIpwQB
inA3TIIsYTV/gPByXSfEtvmYACadoFiMvfvYwaWhjFS9MdSzFmcVG0Dp3EFIig29
odmWEofcz6uIvUWvUswWEGdoSu7uOKIztSuAI4PlTwaofUaSKG6e5kmtpr3cLERD
Uhg2lm5JgqkXBd7QHObNimJ4DtQzFwHmhA08qo8rd/zba75mn/Hr5IF0q3Rxs99J
SRYHcAeP8afKn5Ge0yzoTgCNcqhfz8KLRfoCQX49mljr+muNxld4nMklD2KdUwJi
XEB512/q3E4nUgopXZiSJYQYAq1CfdR5WpGipZ9X0XK9HZBDF/qhXGtk1YQuNWyj
ZxJS3bfBza4oVIvP5/ehjCIQwOvqkcrC5zZGDIgDvw9Q6L3L1wqmVntsdCLCLRcJ
jB4IOsj+DECfJ6w2vP2SZ3GeMtFnyuTQjhUTkjPuAKTBBiKo4Tj0o/agpfDYbWZy
1l3a2yH5jqJgkm4MaVh3YHRJGc0ub0ccpIrs3QQ4jvjMLQ/3Gcs=
=XGHs
-----END PGP SIGNATURE-----
Merge tag 'nfsd-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd updates from Chuck Lever:
"I'm thrilled to announce that the Linux in-kernel NFS server now
offers NFSv4 write delegations. A write delegation enables a client to
cache data and metadata for a single file more aggressively, reducing
network round trips and server workload. Many thanks to Dai Ngo for
contributing this facility, and to Jeff Layton and Neil Brown for
reviewing and testing it.
This release also sees the removal of all support for DES- and
triple-DES-based Kerberos encryption types in the kernel's SunRPC
implementation. These encryption types have been deprecated by the
Internet community for years and are considered insecure. This change
affects both the in-kernel NFS client and server.
The server's UDP and TCP socket transports have now fully adopted
David Howells' new bio_vec iterator so that no more than one sendmsg()
call is needed to transmit each RPC message. In particular, this helps
kTLS optimize record boundaries when sending RPC-with-TLS replies, and
it takes the server a baby step closer to handling file I/O via
folios.
We've begun work on overhauling the SunRPC thread scheduler to remove
a costly linked-list walk when looking for an idle RPC service thread
to wake. The pre-requisites are included in this release. Thanks to
Neil Brown for his ongoing work on this improvement"
* tag 'nfsd-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (56 commits)
Documentation: Add missing documentation for EXPORT_OP flags
SUNRPC: Remove unused declaration rpc_modcount()
SUNRPC: Remove unused declarations
NFSD: da_addr_body field missing in some GETDEVICEINFO replies
SUNRPC: Remove return value of svc_pool_wake_idle_thread()
SUNRPC: make rqst_should_sleep() idempotent()
SUNRPC: Clean up svc_set_num_threads
SUNRPC: Count ingress RPC messages per svc_pool
SUNRPC: Deduplicate thread wake-up code
SUNRPC: Move trace_svc_xprt_enqueue
SUNRPC: Add enum svc_auth_status
SUNRPC: change svc_xprt::xpt_flags bits to enum
SUNRPC: change svc_rqst::rq_flags bits to enum
SUNRPC: change svc_pool::sp_flags bits to enum
SUNRPC: change cache_head.flags bits to enum
SUNRPC: remove timeout arg from svc_recv()
SUNRPC: change svc_recv() to return void.
SUNRPC: call svc_process() from svc_recv().
nfsd: separate nfsd_last_thread() from nfsd_put()
nfsd: Simplify code around svc_exit_thread() call in nfsd()
...
The XDR specification in RFC 8881 looks like this:
struct device_addr4 {
layouttype4 da_layout_type;
opaque da_addr_body<>;
};
struct GETDEVICEINFO4resok {
device_addr4 gdir_device_addr;
bitmap4 gdir_notification;
};
union GETDEVICEINFO4res switch (nfsstat4 gdir_status) {
case NFS4_OK:
GETDEVICEINFO4resok gdir_resok4;
case NFS4ERR_TOOSMALL:
count4 gdir_mincount;
default:
void;
};
Looking at nfsd4_encode_getdeviceinfo() ....
When the client provides a zero gd_maxcount, then the Linux NFS
server implementation encodes the da_layout_type field and then
skips the da_addr_body field completely, proceeding directly to
encode gdir_notification field.
There does not appear to be an option in the specification to skip
encoding da_addr_body. Moreover, Section 18.40.3 says:
> If the client wants to just update or turn off notifications, it
> MAY send a GETDEVICEINFO operation with gdia_maxcount set to zero.
> In that event, if the device ID is valid, the reply's da_addr_body
> field of the gdir_device_addr field will be of zero length.
Since the layout drivers are responsible for encoding the
da_addr_body field, put this fix inside the ->encode_getdeviceinfo
methods.
Fixes: 9cf514ccfa ("nfsd: implement pNFS operations")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Tom Haynes <loghyr@gmail.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Most svc threads have no interest in a timeout.
nfsd sets it to 1 hour, but this is a wart of no significance.
lockd uses the timeout so that it can call nlmsvc_retry_blocked().
It also sometimes calls svc_wake_up() to ensure this is called.
So change lockd to be consistent and always use svc_wake_up() to trigger
nlmsvc_retry_blocked() - using a timer instead of a timeout to
svc_recv().
And change svc_recv() to not take a timeout arg.
This makes the sp_threads_timedout counter always zero.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
svc_recv() currently returns a 0 on success or one of two errors:
- -EAGAIN means no message was successfully received
- -EINTR means the thread has been told to stop
Previously nfsd would stop as the result of a signal as well as
following kthread_stop(). In that case the difference was useful: EINTR
means stop unconditionally. EAGAIN means stop if kthread_should_stop(),
continue otherwise.
Now threads only exit when kthread_should_stop() so we don't need the
distinction.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
All callers of svc_recv() go on to call svc_process() on success.
Simplify callers by having svc_recv() do that for them.
This loses one call to validate_process_creds() in nfsd. That was
debugging code added 14 years ago. I don't think we need to keep it.
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Now that the last nfsd thread is stopped by an explicit act of calling
svc_set_num_threads() with a count of zero, we only have a limited
number of places that can happen, and don't need to call
nfsd_last_thread() in nfsd_put()
So separate that out and call it at the two places where the number of
threads is set to zero.
Move the clearing of ->nfsd_serv and the call to svc_xprt_destroy_all()
into nfsd_last_thread(), as they are really part of the same action.
nfsd_put() is now a thin wrapper around svc_put(), so make it a static
inline.
nfsd_put() cannot be called after nfsd_last_thread(), so in a couple of
places we have to use svc_put() instead.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Previously a thread could exit asynchronously (due to a signal) so some
care was needed to hold nfsd_mutex over the last svc_put() call. Now a
thread can only exit when svc_set_num_threads() is called, and this is
always called under nfsd_mutex. So no care is needed.
Not only is the mutex held when a thread exits now, but the svc refcount
is elevated, so the svc_put() in svc_exit_thread() will never be a final
put, so the mutex isn't even needed at this point in the code.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The original implementation of nfsd used signals to stop threads during
shutdown.
In Linux 2.3.46pre5 nfsd gained the ability to shutdown threads
internally it if was asked to run "0" threads. After this user-space
transitioned to using "rpc.nfsd 0" to stop nfsd and sending signals to
threads was no longer an important part of the API.
In commit 3ebdbe5203 ("SUNRPC: discard svo_setup and rename
svc_set_num_threads_sync()") (v5.17-rc1~75^2~41) we finally removed the
use of signals for stopping threads, using kthread_stop() instead.
This patch makes the "obvious" next step and removes the ability to
signal nfsd threads - or any svc threads. nfsd stops allowing signals
and we don't check for their delivery any more.
This will allow for some simplification in later patches.
A change worth noting is in nfsd4_ssc_setup_dul(). There was previously
a signal_pending() check which would only succeed when the thread was
being shut down. It should really have tested kthread_should_stop() as
well. Now it just does the latter, not the former.
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
A well-formed NFSv4 ACL will always contain OWNER@/GROUP@/EVERYONE@
ACEs, but there is no requirement for inheritable entries for those
entities. POSIX ACLs must always have owner/group/other entries, even for a
default ACL.
nfsd builds the default ACL from inheritable ACEs, but the current code
just leaves any unspecified ACEs zeroed out. The result is that adding a
default user or group ACE to an inode can leave it with unwanted deny
entries.
For instance, a newly created directory with no acl will look something
like this:
# NFSv4 translation by server
A::OWNER@:rwaDxtTcCy
A::GROUP@:rxtcy
A::EVERYONE@:rxtcy
# POSIX ACL of underlying file
user::rwx
group::r-x
other::r-x
...if I then add new v4 ACE:
nfs4_setfacl -a A:fd:1000:rwx /mnt/local/test
...I end up with a result like this today:
user::rwx
user:1000:rwx
group::r-x
mask::rwx
other::r-x
default:user::---
default:user:1000:rwx
default:group::---
default😷:rwx
default:other::---
A::OWNER@:rwaDxtTcCy
A::1000:rwaDxtcy
A::GROUP@:rxtcy
A::EVERYONE@:rxtcy
D:fdi:OWNER@:rwaDx
A:fdi:OWNER@:tTcCy
A:fdi:1000:rwaDxtcy
A:fdi:GROUP@:tcy
A:fdi:EVERYONE@:tcy
...which is not at all expected. Adding a single inheritable allow ACE
should not result in everyone else losing access.
The setfacl command solves a silimar issue by copying owner/group/other
entries from the effective ACL when none of them are set:
"If a Default ACL entry is created, and the Default ACL contains no
owner, owning group, or others entry, a copy of the ACL owner,
owning group, or others entry is added to the Default ACL.
Having nfsd do the same provides a more sane result (with no deny ACEs
in the resulting set):
user::rwx
user:1000:rwx
group::r-x
mask::rwx
other::r-x
default:user::rwx
default:user:1000:rwx
default:group::r-x
default😷:rwx
default:other::r-x
A::OWNER@:rwaDxtTcCy
A::1000:rwaDxtcy
A::GROUP@:rxtcy
A::EVERYONE@:rxtcy
A:fdi:OWNER@:rwaDxtTcCy
A:fdi:1000:rwaDxtcy
A:fdi:GROUP@:rxtcy
A:fdi:EVERYONE@:rxtcy
Reported-by: Ondrej Valousek <ondrej.valousek@diasemi.com>
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=2136452
Suggested-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
In the event that we can't fetch post_op_attr attributes, we still need
to set a value for the after_change. The operation has already happened,
so we're not able to return an error at that point, but we do want to
ensure that the client knows that its cache should be invalidated.
If we weren't able to fetch post-op attrs, then just set the
after_change to before_change + 1. The atomic flag should already be
clear in this case.
Suggested-by: Neil Brown <neilb@suse.de>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
At one time, nfsd would scrape inode information directly out of struct
inode in order to populate the change_info4. At that time, the BUG_ON in
set_change_info made some sense, since having it unset meant a coding
error.
More recently, it calls vfs_getattr to get this information, which can
fail. If that fails, fh_pre_saved can end up not being set. While this
situation is unfortunate, we don't need to crash the box.
Move set_change_info to nfs4proc.c since all of the callers are there.
Revise the condition for setting "atomic" to also check for
fh_pre_saved. Drop the BUG_ON and just have it zero out both
change_attr4s when this occurs.
Reported-by: Boyang Xue <bxue@redhat.com>
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=2223560
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Collecting pre_op_attrs can fail, in which case it's probably best to
fail the whole operation.
Change fh_fill_pre_attrs and fh_fill_both_attrs to return __be32, and
have the callers check the return code and abort the operation if it's
not nfs_ok.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
I got this today from modpost:
WARNING: modpost: missing MODULE_DESCRIPTION() in fs/nfsd/nfsd.o
Add a module description.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The svc_ prefix is identified with the SunRPC layer. Although the
duplicate reply cache caches RPC replies, it is only for the NFS
protocol. Rename the struct to better reflect its purpose.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Over time I'd like to see NFS-specific fields moved out of struct
svc_rqst, which is an RPC layer object. These fields are layering
violations.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Avoid holding the bucket lock while freeing cache entries. This
change also caps the number of entries that are freed when the
shrinker calls to reduce the shrinker's impact on the cache's
effectiveness.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Enable nfsd_prune_bucket() to drop the bucket lock while calling
kfree(). Use the same pattern that Jeff recently introduced in the
NFSD filecache.
A few percpu operations are moved outside the lock since they
temporarily disable local IRQs which is expensive and does not
need to be done while the lock is held.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
To reduce contention on the bucket locks, we must avoid calling
kfree() while each bucket lock is held.
Start by refactoring nfsd_reply_cache_free_locked() into a helper
that removes an entry from the bucket (and must therefore run under
the lock) and a second helper that frees the entry (which does not
need to hold the lock).
For readability, rename the helpers nfsd_cacherep_<verb>.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
This patch grants write delegations for OPEN with NFS4_SHARE_ACCESS_WRITE
if there is no conflict with other OPENs.
Write delegation conflicts with another OPEN, REMOVE, RENAME and SETATTR
are handled the same as read delegation using notify_change,
try_break_deleg.
The NFSv4.0 protocol does not enable a server to determine that a
conflicting GETATTR originated from the client holding the
delegation versus coming from some other client. With NFSv4.1 and
later, the SEQUENCE operation that begins each COMPOUND contains a
client ID, so delegation recall can be safely squelched in this case.
With NFSv4.0, however, the server must recall or send a CB_GETATTR
(per RFC 7530 Section 16.7.5) even when the GETATTR originates from
the client holding that delegation.
An NFSv4.0 client can trigger a pathological situation if it always
sends a DELEGRETURN preceded by a conflicting GETATTR in the same
COMPOUND. COMPOUND execution will always stop at the GETATTR and the
DELEGRETURN will never get executed. The server eventually revokes
the delegation, which can result in loss of open or lock state.
Tracepoint added to track whether read or write delegation is granted.
Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Replace the -1 (no limit) with a zero (no reserved space).
This prevents certain non-determinant client behavior, such as
silly-renaming a file when the only open reference is a write
delegation. Such a rename can leave unexpected .nfs files in a
directory that is otherwise supposed to be empty.
Note that other server implementations that support write delegation
also set this field to zero.
Suggested-by: Dai Ngo <dai.ngo@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
If the GETATTR request on a file that has write delegation in effect and
the request attributes include the change info and size attribute then
the write delegation is recalled. If the delegation is returned within
30ms then the GETATTR is serviced as normal otherwise the NFS4ERR_DELAY
error is returned for the GETATTR.
Add counter for write delegation recall due to conflict GETATTR. This is
used to evaluate the need to implement CB_GETATTR to adoid recalling the
delegation with conflit GETATTR.
Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>