Commit Graph

4839 Commits

Author SHA1 Message Date
Benjamin Coddington 57331a59ac NFSv4.1: Use the nfs_client's rpc timeouts for backchannel
For backchannel requests that lookup the appropriate nfs_client, use the
state-management rpc_clnt's rpc_timeout parameters for the backchannel's
response.  When the nfs_client cannot be found, fall back to using the
xprt's default timeout parameters.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2024-01-04 17:01:01 -05:00
Benjamin Coddington e6f533b615 SUNRPC: Fixup v4.1 backchannel request timeouts
After commit 59464b262f ("SUNRPC: SOFTCONN tasks should time out when on
the sending list"), any 4.1 backchannel tasks placed on the sending queue
would immediately return with -ETIMEDOUT since their req timers are zero.

Initialize the backchannel's rpc_rqst timeout parameters from the xprt's
default timeout settings.

Fixes: 59464b262f ("SUNRPC: SOFTCONN tasks should time out when on the sending list")
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2024-01-04 17:00:52 -05:00
Olga Kornievskaia 98b4e51375 SUNRPC: fix _xprt_switch_find_current_entry logic
Fix the logic for picking current transport entry.

Fixes: 95d0d30c66 ("SUNRPC create an iterator to list only OFFLINE xprts")
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2024-01-04 10:47:56 -05:00
Anna Schumaker 31b6290869 SUNRPC: Fix a suspicious RCU usage warning
I received the following warning while running cthon against an ontap
server running pNFS:

[   57.202521] =============================
[   57.202522] WARNING: suspicious RCU usage
[   57.202523] 6.7.0-rc3-g2cc14f52aeb7 #41492 Not tainted
[   57.202525] -----------------------------
[   57.202525] net/sunrpc/xprtmultipath.c:349 RCU-list traversed in non-reader section!!
[   57.202527]
               other info that might help us debug this:

[   57.202528]
               rcu_scheduler_active = 2, debug_locks = 1
[   57.202529] no locks held by test5/3567.
[   57.202530]
               stack backtrace:
[   57.202532] CPU: 0 PID: 3567 Comm: test5 Not tainted 6.7.0-rc3-g2cc14f52aeb7 #41492 5b09971b4965c0aceba19f3eea324a4a806e227e
[   57.202534] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 2/2/2022
[   57.202536] Call Trace:
[   57.202537]  <TASK>
[   57.202540]  dump_stack_lvl+0x77/0xb0
[   57.202551]  lockdep_rcu_suspicious+0x154/0x1a0
[   57.202556]  rpc_xprt_switch_has_addr+0x17c/0x190 [sunrpc ebe02571b9a8ceebf7d98e71675af20c19bdb1f6]
[   57.202596]  rpc_clnt_setup_test_and_add_xprt+0x50/0x180 [sunrpc ebe02571b9a8ceebf7d98e71675af20c19bdb1f6]
[   57.202621]  ? rpc_clnt_add_xprt+0x254/0x300 [sunrpc ebe02571b9a8ceebf7d98e71675af20c19bdb1f6]
[   57.202646]  rpc_clnt_add_xprt+0x27a/0x300 [sunrpc ebe02571b9a8ceebf7d98e71675af20c19bdb1f6]
[   57.202671]  ? __pfx_rpc_clnt_setup_test_and_add_xprt+0x10/0x10 [sunrpc ebe02571b9a8ceebf7d98e71675af20c19bdb1f6]
[   57.202696]  nfs4_pnfs_ds_connect+0x345/0x760 [nfsv4 c716d88496ded0ea6d289bbea684fa996f9b57a9]
[   57.202728]  ? __pfx_nfs4_test_session_trunk+0x10/0x10 [nfsv4 c716d88496ded0ea6d289bbea684fa996f9b57a9]
[   57.202754]  nfs4_fl_prepare_ds+0x75/0xc0 [nfs_layout_nfsv41_files e3a4187f18ae8a27b630f9feae6831b584a9360a]
[   57.202760]  filelayout_write_pagelist+0x4a/0x200 [nfs_layout_nfsv41_files e3a4187f18ae8a27b630f9feae6831b584a9360a]
[   57.202765]  pnfs_generic_pg_writepages+0xbe/0x230 [nfsv4 c716d88496ded0ea6d289bbea684fa996f9b57a9]
[   57.202788]  __nfs_pageio_add_request+0x3fd/0x520 [nfs 6c976fa593a7c2976f5a0aeb4965514a828e6902]
[   57.202813]  nfs_pageio_add_request+0x18b/0x390 [nfs 6c976fa593a7c2976f5a0aeb4965514a828e6902]
[   57.202831]  nfs_do_writepage+0x116/0x1e0 [nfs 6c976fa593a7c2976f5a0aeb4965514a828e6902]
[   57.202849]  nfs_writepages_callback+0x13/0x30 [nfs 6c976fa593a7c2976f5a0aeb4965514a828e6902]
[   57.202866]  write_cache_pages+0x265/0x450
[   57.202870]  ? __pfx_nfs_writepages_callback+0x10/0x10 [nfs 6c976fa593a7c2976f5a0aeb4965514a828e6902]
[   57.202891]  nfs_writepages+0x141/0x230 [nfs 6c976fa593a7c2976f5a0aeb4965514a828e6902]
[   57.202913]  do_writepages+0xd2/0x230
[   57.202917]  ? filemap_fdatawrite_wbc+0x5c/0x80
[   57.202921]  filemap_fdatawrite_wbc+0x67/0x80
[   57.202924]  filemap_write_and_wait_range+0xd9/0x170
[   57.202930]  nfs_wb_all+0x49/0x180 [nfs 6c976fa593a7c2976f5a0aeb4965514a828e6902]
[   57.202947]  nfs4_file_flush+0x72/0xb0 [nfsv4 c716d88496ded0ea6d289bbea684fa996f9b57a9]
[   57.202969]  __se_sys_close+0x46/0xd0
[   57.202972]  do_syscall_64+0x68/0x100
[   57.202975]  ? do_syscall_64+0x77/0x100
[   57.202976]  ? do_syscall_64+0x77/0x100
[   57.202979]  entry_SYSCALL_64_after_hwframe+0x6e/0x76
[   57.202982] RIP: 0033:0x7fe2b12e4a94
[   57.202985] Code: 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 80 3d d5 18 0e 00 00 74 13 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 44 c3 0f 1f 00 48 83 ec 18 89 7c 24 0c e8 c3
[   57.202987] RSP: 002b:00007ffe857ddb38 EFLAGS: 00000202 ORIG_RAX: 0000000000000003
[   57.202989] RAX: ffffffffffffffda RBX: 00007ffe857dfd68 RCX: 00007fe2b12e4a94
[   57.202991] RDX: 0000000000002000 RSI: 00007ffe857ddc40 RDI: 0000000000000003
[   57.202992] RBP: 00007ffe857dfc50 R08: 7fffffffffffffff R09: 0000000065650f49
[   57.202993] R10: 00007fe2b11f8300 R11: 0000000000000202 R12: 0000000000000000
[   57.202994] R13: 00007ffe857dfd80 R14: 00007fe2b1445000 R15: 0000000000000000
[   57.202999]  </TASK>

The problem seems to be that two out of three callers aren't taking the
rcu_read_lock() before calling the list_for_each_entry_rcu() function in
rpc_xprt_switch_has_addr(). I fix this by having
rpc_xprt_switch_has_addr() unconditionaly take the rcu_read_lock(),
which is okay to do recursively in the case that the lock has already
been taken by a caller.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2024-01-04 10:47:56 -05:00
Anna Schumaker a902f3dec7 SUNRPC: Create a helper function for accessing the rpc_clnt's xprt_switch
This function takes the necessary rcu read lock to dereference the
client's rpc_xprt_switch and bump the reference count so it doesn't
disappear underneath us before returning. This does mean that callers
are responsible for calling xprt_switch_put() on the returned object
when they are done with it.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2024-01-04 10:47:56 -05:00
Anna Schumaker 5f1e77b228 SUNRPC: Remove unused function rpc_clnt_xprt_switch_put()
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2024-01-04 10:47:56 -05:00
Anna Schumaker ec677b58f6 SUNRPC: Clean up unused variable in rpc_xprt_probe_trunked()
We don't use the rpc_xprt_switch anywhere in this function, so let's not
take an extra reference to in unnecessarily.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2024-01-04 10:47:56 -05:00
Linus Torvalds ac1c13e257 nfsd-6.7 fixes:
- Address a few recently-introduced issues
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmWB+F4ACgkQM2qzM29m
 f5dZKBAAoAxQA9FTIcb+uMIviGvMFBylKv9mhQxKSkhWhhP9XX2ESXAl+6pcBQJP
 ikjWw09WO/6ttvxg4hY7exk6FgkDuCALnx1xB6azEq5Ndf4T2aauuFUTEERfQ8th
 mtGSKEX1a98bGlkBIVQgfr3VFbQ0MRwcr0iGKfbHC3X5GiuFZOHn5fFE+0LjXWyJ
 7MjI0Du3lXDlrG5mJ88T5ySMcbaOWBTzqlMY3kbwcf1dWy3TZ6uh6faXqbemrDPg
 Zixj6JGi+oLlrbYYjQV1Sm5QLW3e882QCQ0U9g2sOEjmLRN/hbCE90i6l1rVEe9a
 E1ZkAY5qtl+iFK6cbAYUt6lOcZaF8lNeEtBW5JhcFm7f7CJAUSmMb05klZ6rwpXR
 l6UDwQtnAbmghqAwuKaWdfxys/yqFTvKNJRida7+qK1fs8H8q0y/acKsIv/5PoWz
 cJvWXova4wIT7Q3xneaYXhHk1aWTQTooErQ0VrgnsE0Ch7Vc4sONhQ3SLajllChD
 x3JL0DgDJ6mO/JaFZCfIy3ihEaprV0uL3bqgXaad1SszEWJt4HCCE1fQpI0cSbXH
 Vyq5H0R74EoLh8TGc6dbJ5wZwU81P6Rc0nHz/VU+gUPrEeNYKv8P2bahnXfR7Nsc
 F93sl2DqQOEv/Q/h/9UolwL7AGl6TkzdPKEoYKNafagUWT031y0=
 =F1lB
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-6.7-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd fixes from Chuck Lever:

 - Address a few recently-introduced issues

* tag 'nfsd-6.7-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
  SUNRPC: Revert 5f7fc5d69f
  NFSD: Revert 738401a9bd
  NFSD: Revert 6c41d9a9bd
  nfsd: hold nfsd_mutex across entire netlink operation
  nfsd: call nfsd_last_thread() before final nfsd_put()
2023-12-20 11:16:50 -08:00
Chuck Lever bd018b98ba SUNRPC: Revert 5f7fc5d69f
Guillaume says:
> I believe commit 5f7fc5d69f ("SUNRPC: Resupply rq_pages from
> node-local memory") in Linux 6.5+ is incorrect. It passes
> unconditionally rq_pool->sp_id as the NUMA node.
>
> While the comment in the svc_pool declaration in sunrpc/svc.h says
> that sp_id is also the NUMA node id, it might not be the case if
> the svc is created using svc_create_pooled(). svc_created_pooled()
> can use the per-cpu pool mode therefore in this case sp_id would
> be the cpu id.

Fix this by reverting now. At a later point this minor optimization,
and the deceptive labeling of the sp_id field, can be revisited.

Reported-by: Guillaume Morin <guillaume@morinfr.org>
Closes: https://lore.kernel.org/linux-nfs/ZYC9rsno8qYggVt9@bender.morinfr.org/T/#u
Fixes: 5f7fc5d69f ("SUNRPC: Resupply rq_pages from node-local memory")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-12-18 17:10:52 -05:00
Jens Axboe ae1914174a cred: get rid of CONFIG_DEBUG_CREDENTIALS
This code is rarely (never?) enabled by distros, and it hasn't caught
anything in decades. Let's kill off this legacy debug code.

Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-12-15 14:19:48 -08:00
Kees Cook cb6d2fd30d SUNRPC: Replace strlcpy() with strscpy()
strlcpy() reads the entire source buffer first. This read may exceed
the destination size limit. This is both inefficient and can lead
to linear read overflows if a source string is not NUL-terminated[1].
Additionally, it returns the size of the source string, not the
resulting size of the destination string. In an effort to remove strlcpy()
completely[2], replace strlcpy() here with strscpy().

Explicitly handle the truncation case by returning the size of the
resulting string.

If "nodename" was ever longer than sizeof(clnt->cl_nodename) - 1, this
change will fix a bug where clnt->cl_nodelen would end up thinking there
were more characters in clnt->cl_nodename than there actually were,
which might have lead to kernel memory content exposures.

Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Anna Schumaker <anna@kernel.org>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Neil Brown <neilb@suse.de>
Cc: Olga Kornievskaia <kolga@netapp.com>
Cc: Dai Ngo <Dai.Ngo@oracle.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: linux-nfs@vger.kernel.org
Cc: netdev@vger.kernel.org
Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strlcpy [1]
Link: https://github.com/KSPP/linux/issues/89 [2]
Co-developed-by: Azeem Shaikh <azeemshaikh38@gmail.com>
Signed-off-by: Azeem Shaikh <azeemshaikh38@gmail.com>
Reviewed-by: NeilBrown <neilb@suse.de>
Link: https://lore.kernel.org/r/20231114175407.work.410-kees@kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
2023-11-30 12:18:36 -08:00
Linus Torvalds 6bc986ab83 NFS client updates for Linux 6.7
Highlights include:
 
 Bugfixes:
  - SUNRPC: A fix to re-probe the target RPC port after an ECONNRESET error
  - SUNRPC: Handle allocation errors from rpcb_call_async()
  - SUNRPC: Fix a use-after-free condition in rpc_pipefs
  - SUNRPC: fix up various checks for timeouts
  - NFSv4.1: Handle NFS4ERR_DELAY errors during session trunking
  - NFSv4.1: fix SP4_MACH_CRED protection for pnfs IO
  - NFSv4: Ensure that we test all delegations when the server notifies
    us that it may have revoked some of them
 
 Features:
  - Allow knfsd processes to break out of NFS4ERR_DELAY loops when
    re-exporting NFSv4.x by setting appropriate values for the
    'delay_retrans' module parameter.
  - nfs: Convert nfs_symlink() to use a folio
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAmVL62IACgkQZwvnipYK
 API4Ww/+I75hmEdI4i/6v8WnrLWLWzmCgybez2AfrKYtuYmEcDZTf2K4pNVEJxFG
 PJ4TYRpaSgwxCXqoup5INOdL5gS8g3JbAlTqrDQ8nYGeBXETN9tN5n2Xj8liHk2l
 OrnYzgTjC2pwpWDZq/W1LCQMatCbs9XhpmyBvqZH5r7tAOVGkrk1ICZ/r+/nGiN8
 LAbm/I0M1Jp0iJisN/i/0CsEgLQMCfeQVrtEMCZGsoVS79Mr/W1hF3KiGognI/xz
 FvEXnZKauw9npu7U7ckhHZcHd8oQxby0Q0Xny/IpgiO2Z1YqfKCvJSK1sjBt/lFu
 7fe2HGMFfcTMx/bn/qdUJR351607rBi4h3t1OfK4KIxV1gSLUyS/RR7ayFx5A0gM
 pFJcC0ZnKoiVr2vSNMMguennbyjScqNOC5ECb+bpMAyDV7suF9e/khK12CdYwqFm
 cpeUUD35GZT+RjP3htI92Pj0bqBOHx+o7qEi+MEel3t90us8PPBrWywHk7Tw40vJ
 eUxH0XQtZVswH6TvekNdoMXx8TXSAbx4I6Hw3fmNGWLRp494p3GVSmWQmctcvCi7
 Y6E1P3YT8G1OeI9+fIQr5Wp2F9sOdFPrb3BrAojj9ndJ3ZqQAcx+gY5z2RVnvvur
 PTsLInFxS8+WvvqPtQCZZ5UxnrcPFCk1js0rg6EqdjZmq6+OqJU=
 =TNRl
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-6.7-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client updates from Trond Myklebust:
 "Bugfixes:

   - SUNRPC:
       - re-probe the target RPC port after an ECONNRESET error
       - handle allocation errors from rpcb_call_async()
       - fix a use-after-free condition in rpc_pipefs
       - fix up various checks for timeouts

   - NFSv4.1:
       - Handle NFS4ERR_DELAY errors during session trunking
       - fix SP4_MACH_CRED protection for pnfs IO

   - NFSv4:
       - Ensure that we test all delegations when the server notifies
         us that it may have revoked some of them

  Features:

   - Allow knfsd processes to break out of NFS4ERR_DELAY loops when
     re-exporting NFSv4.x by setting appropriate values for the
     'delay_retrans' module parameter

   - nfs: Convert nfs_symlink() to use a folio"

* tag 'nfs-for-6.7-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  nfs: Convert nfs_symlink() to use a folio
  SUNRPC: Fix RPC client cleaned up the freed pipefs dentries
  NFSv4.1: fix SP4_MACH_CRED protection for pnfs IO
  SUNRPC: Add an IS_ERR() check back to where it was
  NFSv4.1: fix handling NFS4ERR_DELAY when testing for session trunking
  nfs41: drop dependency between flexfiles layout driver and NFSv3 modules
  NFSv4: fairly test all delegations on a SEQ4_ revocation
  SUNRPC: SOFTCONN tasks should time out when on the sending list
  SUNRPC: Force close the socket when a hard error is reported
  SUNRPC: Don't skip timeout checks in call_connect_status()
  SUNRPC: ECONNRESET might require a rebind
  NFSv4/pnfs: Allow layoutget to return EAGAIN for softerr mounts
  NFSv4: Add a parameter to limit the number of retries after NFS4ERR_DELAY
2023-11-08 13:39:16 -08:00
Linus Torvalds ecae0bd517 Many singleton patches against the MM code. The patch series which are
included in this merge do the following:
 
 - Kemeng Shi has contributed some compation maintenance work in the
   series "Fixes and cleanups to compaction".
 
 - Joel Fernandes has a patchset ("Optimize mremap during mutual
   alignment within PMD") which fixes an obscure issue with mremap()'s
   pagetable handling during a subsequent exec(), based upon an
   implementation which Linus suggested.
 
 - More DAMON/DAMOS maintenance and feature work from SeongJae Park i the
   following patch series:
 
 	mm/damon: misc fixups for documents, comments and its tracepoint
 	mm/damon: add a tracepoint for damos apply target regions
 	mm/damon: provide pseudo-moving sum based access rate
 	mm/damon: implement DAMOS apply intervals
 	mm/damon/core-test: Fix memory leaks in core-test
 	mm/damon/sysfs-schemes: Do DAMOS tried regions update for only one apply interval
 
 - In the series "Do not try to access unaccepted memory" Adrian Hunter
   provides some fixups for the recently-added "unaccepted memory' feature.
   To increase the feature's checking coverage.  "Plug a few gaps where
   RAM is exposed without checking if it is unaccepted memory".
 
 - In the series "cleanups for lockless slab shrink" Qi Zheng has done
   some maintenance work which is preparation for the lockless slab
   shrinking code.
 
 - Qi Zheng has redone the earlier (and reverted) attempt to make slab
   shrinking lockless in the series "use refcount+RCU method to implement
   lockless slab shrink".
 
 - David Hildenbrand contributes some maintenance work for the rmap code
   in the series "Anon rmap cleanups".
 
 - Kefeng Wang does more folio conversions and some maintenance work in
   the migration code.  Series "mm: migrate: more folio conversion and
   unification".
 
 - Matthew Wilcox has fixed an issue in the buffer_head code which was
   causing long stalls under some heavy memory/IO loads.  Some cleanups
   were added on the way.  Series "Add and use bdev_getblk()".
 
 - In the series "Use nth_page() in place of direct struct page
   manipulation" Zi Yan has fixed a potential issue with the direct
   manipulation of hugetlb page frames.
 
 - In the series "mm: hugetlb: Skip initialization of gigantic tail
   struct pages if freed by HVO" has improved our handling of gigantic
   pages in the hugetlb vmmemmep optimizaton code.  This provides
   significant boot time improvements when significant amounts of gigantic
   pages are in use.
 
 - Matthew Wilcox has sent the series "Small hugetlb cleanups" - code
   rationalization and folio conversions in the hugetlb code.
 
 - Yin Fengwei has improved mlock()'s handling of large folios in the
   series "support large folio for mlock"
 
 - In the series "Expose swapcache stat for memcg v1" Liu Shixin has
   added statistics for memcg v1 users which are available (and useful)
   under memcg v2.
 
 - Florent Revest has enhanced the MDWE (Memory-Deny-Write-Executable)
   prctl so that userspace may direct the kernel to not automatically
   propagate the denial to child processes.  The series is named "MDWE
   without inheritance".
 
 - Kefeng Wang has provided the series "mm: convert numa balancing
   functions to use a folio" which does what it says.
 
 - In the series "mm/ksm: add fork-exec support for prctl" Stefan Roesch
   makes is possible for a process to propagate KSM treatment across
   exec().
 
 - Huang Ying has enhanced memory tiering's calculation of memory
   distances.  This is used to permit the dax/kmem driver to use "high
   bandwidth memory" in addition to Optane Data Center Persistent Memory
   Modules (DCPMM).  The series is named "memory tiering: calculate
   abstract distance based on ACPI HMAT"
 
 - In the series "Smart scanning mode for KSM" Stefan Roesch has
   optimized KSM by teaching it to retain and use some historical
   information from previous scans.
 
 - Yosry Ahmed has fixed some inconsistencies in memcg statistics in the
   series "mm: memcg: fix tracking of pending stats updates values".
 
 - In the series "Implement IOCTL to get and optionally clear info about
   PTEs" Peter Xu has added an ioctl to /proc/<pid>/pagemap which permits
   us to atomically read-then-clear page softdirty state.  This is mainly
   used by CRIU.
 
 - Hugh Dickins contributed the series "shmem,tmpfs: general maintenance"
   - a bunch of relatively minor maintenance tweaks to this code.
 
 - Matthew Wilcox has increased the use of the VMA lock over file-backed
   page faults in the series "Handle more faults under the VMA lock".  Some
   rationalizations of the fault path became possible as a result.
 
 - In the series "mm/rmap: convert page_move_anon_rmap() to
   folio_move_anon_rmap()" David Hildenbrand has implemented some cleanups
   and folio conversions.
 
 - In the series "various improvements to the GUP interface" Lorenzo
   Stoakes has simplified and improved the GUP interface with an eye to
   providing groundwork for future improvements.
 
 - Andrey Konovalov has sent along the series "kasan: assorted fixes and
   improvements" which does those things.
 
 - Some page allocator maintenance work from Kemeng Shi in the series
   "Two minor cleanups to break_down_buddy_pages".
 
 - In thes series "New selftest for mm" Breno Leitao has developed
   another MM self test which tickles a race we had between madvise() and
   page faults.
 
 - In the series "Add folio_end_read" Matthew Wilcox provides cleanups
   and an optimization to the core pagecache code.
 
 - Nhat Pham has added memcg accounting for hugetlb memory in the series
   "hugetlb memcg accounting".
 
 - Cleanups and rationalizations to the pagemap code from Lorenzo
   Stoakes, in the series "Abstract vma_merge() and split_vma()".
 
 - Audra Mitchell has fixed issues in the procfs page_owner code's new
   timestamping feature which was causing some misbehaviours.  In the
   series "Fix page_owner's use of free timestamps".
 
 - Lorenzo Stoakes has fixed the handling of new mappings of sealed files
   in the series "permit write-sealed memfd read-only shared mappings".
 
 - Mike Kravetz has optimized the hugetlb vmemmap optimization in the
   series "Batch hugetlb vmemmap modification operations".
 
 - Some buffer_head folio conversions and cleanups from Matthew Wilcox in
   the series "Finish the create_empty_buffers() transition".
 
 - As a page allocator performance optimization Huang Ying has added
   automatic tuning to the allocator's per-cpu-pages feature, in the series
   "mm: PCP high auto-tuning".
 
 - Roman Gushchin has contributed the patchset "mm: improve performance
   of accounted kernel memory allocations" which improves their performance
   by ~30% as measured by a micro-benchmark.
 
 - folio conversions from Kefeng Wang in the series "mm: convert page
   cpupid functions to folios".
 
 - Some kmemleak fixups in Liu Shixin's series "Some bugfix about
   kmemleak".
 
 - Qi Zheng has improved our handling of memoryless nodes by keeping them
   off the allocation fallback list.  This is done in the series "handle
   memoryless nodes more appropriately".
 
 - khugepaged conversions from Vishal Moola in the series "Some
   khugepaged folio conversions".
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZULEMwAKCRDdBJ7gKXxA
 jhQHAQCYpD3g849x69DmHnHWHm/EHQLvQmRMDeYZI+nx/sCJOwEAw4AKg0Oemv9y
 FgeUPAD1oasg6CP+INZvCj34waNxwAc=
 =E+Y4
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2023-11-01-14-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:
 "Many singleton patches against the MM code. The patch series which are
  included in this merge do the following:

   - Kemeng Shi has contributed some compation maintenance work in the
     series 'Fixes and cleanups to compaction'

   - Joel Fernandes has a patchset ('Optimize mremap during mutual
     alignment within PMD') which fixes an obscure issue with mremap()'s
     pagetable handling during a subsequent exec(), based upon an
     implementation which Linus suggested

   - More DAMON/DAMOS maintenance and feature work from SeongJae Park i
     the following patch series:

	mm/damon: misc fixups for documents, comments and its tracepoint
	mm/damon: add a tracepoint for damos apply target regions
	mm/damon: provide pseudo-moving sum based access rate
	mm/damon: implement DAMOS apply intervals
	mm/damon/core-test: Fix memory leaks in core-test
	mm/damon/sysfs-schemes: Do DAMOS tried regions update for only one apply interval

   - In the series 'Do not try to access unaccepted memory' Adrian
     Hunter provides some fixups for the recently-added 'unaccepted
     memory' feature. To increase the feature's checking coverage. 'Plug
     a few gaps where RAM is exposed without checking if it is
     unaccepted memory'

   - In the series 'cleanups for lockless slab shrink' Qi Zheng has done
     some maintenance work which is preparation for the lockless slab
     shrinking code

   - Qi Zheng has redone the earlier (and reverted) attempt to make slab
     shrinking lockless in the series 'use refcount+RCU method to
     implement lockless slab shrink'

   - David Hildenbrand contributes some maintenance work for the rmap
     code in the series 'Anon rmap cleanups'

   - Kefeng Wang does more folio conversions and some maintenance work
     in the migration code. Series 'mm: migrate: more folio conversion
     and unification'

   - Matthew Wilcox has fixed an issue in the buffer_head code which was
     causing long stalls under some heavy memory/IO loads. Some cleanups
     were added on the way. Series 'Add and use bdev_getblk()'

   - In the series 'Use nth_page() in place of direct struct page
     manipulation' Zi Yan has fixed a potential issue with the direct
     manipulation of hugetlb page frames

   - In the series 'mm: hugetlb: Skip initialization of gigantic tail
     struct pages if freed by HVO' has improved our handling of gigantic
     pages in the hugetlb vmmemmep optimizaton code. This provides
     significant boot time improvements when significant amounts of
     gigantic pages are in use

   - Matthew Wilcox has sent the series 'Small hugetlb cleanups' - code
     rationalization and folio conversions in the hugetlb code

   - Yin Fengwei has improved mlock()'s handling of large folios in the
     series 'support large folio for mlock'

   - In the series 'Expose swapcache stat for memcg v1' Liu Shixin has
     added statistics for memcg v1 users which are available (and
     useful) under memcg v2

   - Florent Revest has enhanced the MDWE (Memory-Deny-Write-Executable)
     prctl so that userspace may direct the kernel to not automatically
     propagate the denial to child processes. The series is named 'MDWE
     without inheritance'

   - Kefeng Wang has provided the series 'mm: convert numa balancing
     functions to use a folio' which does what it says

   - In the series 'mm/ksm: add fork-exec support for prctl' Stefan
     Roesch makes is possible for a process to propagate KSM treatment
     across exec()

   - Huang Ying has enhanced memory tiering's calculation of memory
     distances. This is used to permit the dax/kmem driver to use 'high
     bandwidth memory' in addition to Optane Data Center Persistent
     Memory Modules (DCPMM). The series is named 'memory tiering:
     calculate abstract distance based on ACPI HMAT'

   - In the series 'Smart scanning mode for KSM' Stefan Roesch has
     optimized KSM by teaching it to retain and use some historical
     information from previous scans

   - Yosry Ahmed has fixed some inconsistencies in memcg statistics in
     the series 'mm: memcg: fix tracking of pending stats updates
     values'

   - In the series 'Implement IOCTL to get and optionally clear info
     about PTEs' Peter Xu has added an ioctl to /proc/<pid>/pagemap
     which permits us to atomically read-then-clear page softdirty
     state. This is mainly used by CRIU

   - Hugh Dickins contributed the series 'shmem,tmpfs: general
     maintenance', a bunch of relatively minor maintenance tweaks to
     this code

   - Matthew Wilcox has increased the use of the VMA lock over
     file-backed page faults in the series 'Handle more faults under the
     VMA lock'. Some rationalizations of the fault path became possible
     as a result

   - In the series 'mm/rmap: convert page_move_anon_rmap() to
     folio_move_anon_rmap()' David Hildenbrand has implemented some
     cleanups and folio conversions

   - In the series 'various improvements to the GUP interface' Lorenzo
     Stoakes has simplified and improved the GUP interface with an eye
     to providing groundwork for future improvements

   - Andrey Konovalov has sent along the series 'kasan: assorted fixes
     and improvements' which does those things

   - Some page allocator maintenance work from Kemeng Shi in the series
     'Two minor cleanups to break_down_buddy_pages'

   - In thes series 'New selftest for mm' Breno Leitao has developed
     another MM self test which tickles a race we had between madvise()
     and page faults

   - In the series 'Add folio_end_read' Matthew Wilcox provides cleanups
     and an optimization to the core pagecache code

   - Nhat Pham has added memcg accounting for hugetlb memory in the
     series 'hugetlb memcg accounting'

   - Cleanups and rationalizations to the pagemap code from Lorenzo
     Stoakes, in the series 'Abstract vma_merge() and split_vma()'

   - Audra Mitchell has fixed issues in the procfs page_owner code's new
     timestamping feature which was causing some misbehaviours. In the
     series 'Fix page_owner's use of free timestamps'

   - Lorenzo Stoakes has fixed the handling of new mappings of sealed
     files in the series 'permit write-sealed memfd read-only shared
     mappings'

   - Mike Kravetz has optimized the hugetlb vmemmap optimization in the
     series 'Batch hugetlb vmemmap modification operations'

   - Some buffer_head folio conversions and cleanups from Matthew Wilcox
     in the series 'Finish the create_empty_buffers() transition'

   - As a page allocator performance optimization Huang Ying has added
     automatic tuning to the allocator's per-cpu-pages feature, in the
     series 'mm: PCP high auto-tuning'

   - Roman Gushchin has contributed the patchset 'mm: improve
     performance of accounted kernel memory allocations' which improves
     their performance by ~30% as measured by a micro-benchmark

   - folio conversions from Kefeng Wang in the series 'mm: convert page
     cpupid functions to folios'

   - Some kmemleak fixups in Liu Shixin's series 'Some bugfix about
     kmemleak'

   - Qi Zheng has improved our handling of memoryless nodes by keeping
     them off the allocation fallback list. This is done in the series
     'handle memoryless nodes more appropriately'

   - khugepaged conversions from Vishal Moola in the series 'Some
     khugepaged folio conversions'"

[ bcachefs conflicts with the dynamically allocated shrinkers have been
  resolved as per Stephen Rothwell in

     https://lore.kernel.org/all/20230913093553.4290421e@canb.auug.org.au/

  with help from Qi Zheng.

  The clone3 test filtering conflict was half-arsed by yours truly ]

* tag 'mm-stable-2023-11-01-14-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (406 commits)
  mm/damon/sysfs: update monitoring target regions for online input commit
  mm/damon/sysfs: remove requested targets when online-commit inputs
  selftests: add a sanity check for zswap
  Documentation: maple_tree: fix word spelling error
  mm/vmalloc: fix the unchecked dereference warning in vread_iter()
  zswap: export compression failure stats
  Documentation: ubsan: drop "the" from article title
  mempolicy: migration attempt to match interleave nodes
  mempolicy: mmap_lock is not needed while migrating folios
  mempolicy: alloc_pages_mpol() for NUMA policy without vma
  mm: add page_rmappable_folio() wrapper
  mempolicy: remove confusing MPOL_MF_LAZY dead code
  mempolicy: mpol_shared_policy_init() without pseudo-vma
  mempolicy trivia: use pgoff_t in shared mempolicy tree
  mempolicy trivia: slightly more consistent naming
  mempolicy trivia: delete those ancient pr_debug()s
  mempolicy: fix migrate_pages(2) syscall return nr_failed
  kernfs: drop shared NUMA mempolicy hooks
  hugetlbfs: drop shared NUMA mempolicy pretence
  mm/damon/sysfs-test: add a unit test for damon_sysfs_set_targets()
  ...
2023-11-02 19:38:47 -10:00
Linus Torvalds bc3012f4e3 This update includes the following changes:
API:
 
 - Add virtual-address based lskcipher interface.
 - Optimise ahash/shash performance in light of costly indirect calls.
 - Remove ahash alignmask attribute.
 
 Algorithms:
 
 - Improve AES/XTS performance of 6-way unrolling for ppc.
 - Remove some uses of obsolete algorithms (md4, md5, sha1).
 - Add FIPS 202 SHA-3 support in pkcs1pad.
 - Add fast path for single-page messages in adiantum.
 - Remove zlib-deflate.
 
 Drivers:
 
 - Add support for S4 in meson RNG driver.
 - Add STM32MP13x support in stm32.
 - Add hwrng interface support in qcom-rng.
 - Add support for deflate algorithm in hisilicon/zip.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEn51F/lCuNhUwmDeSxycdCkmxi6cFAmVB3vgACgkQxycdCkmx
 i6dsOBAAykbnX8BpnpnOXYywE9ZWrl98rAk51MK0N9olZNfg78zRPIv7fFxFdC20
 SDJrDSNPmn0Qvaa5e0EfoAdklsm0k2GkXL/BwPKMKWUsyIoJVYI3WrBMnjBy9xMp
 yfME+h0bKoXJCZKnYkIUSGUejmUPSyRlEylrXoFlH/VWYwAaii/x9zwreQoF+0LR
 KI24A1q8AYs6Dw9HSfndaAub9GOzrqKYs6fSaMG+77Y4UC5aoi5J9Bp2G3uVyHay
 x/0bZtIxKXS9wn+LeG/3GspX23x/I5VwBOdAoMigrYmAIaIg5qgyMszudltTAs4R
 zF1Kh7WsnM5+vpnBSeigzo+/GGOU3QTz8y3tBTg+3ZR7GWGOwQLiizhOYqCyOfAH
 pIm6c++sZw/OOHiL69Nt4HeLKzGNYYWk3s4X/B/6cqoouPfOsfBaQobZNx9zfy7q
 ZNEvSVBjrFX/L6wDSotny1LTWLUNjHbmLaMV5uQZ/SQKEtv19fp2Dl7SsLkHH+3v
 ldOAwfoJR6QcSwz3Ez02TUAvQhtP172Hnxi7u44eiZu2aUboLhCFr7aEU6kVdBCx
 1rIRVHD1oqlOEDRwPRXzhF3I8R4QDORJIxZ6UUhg7yueuI+XCGDsBNC+LqBrBmSR
 IbdjqmSDUBhJyM5yMnt1VFYhqKQ/ZzwZ3JQviwW76Es9pwEIolM=
 =IZmR
 -----END PGP SIGNATURE-----

Merge tag 'v6.7-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6

Pull crypto updates from Herbert Xu:
 "API:
   - Add virtual-address based lskcipher interface
   - Optimise ahash/shash performance in light of costly indirect calls
   - Remove ahash alignmask attribute

  Algorithms:
   - Improve AES/XTS performance of 6-way unrolling for ppc
   - Remove some uses of obsolete algorithms (md4, md5, sha1)
   - Add FIPS 202 SHA-3 support in pkcs1pad
   - Add fast path for single-page messages in adiantum
   - Remove zlib-deflate

  Drivers:
   - Add support for S4 in meson RNG driver
   - Add STM32MP13x support in stm32
   - Add hwrng interface support in qcom-rng
   - Add support for deflate algorithm in hisilicon/zip"

* tag 'v6.7-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (283 commits)
  crypto: adiantum - flush destination page before unmapping
  crypto: testmgr - move pkcs1pad(rsa,sha3-*) to correct place
  Documentation/module-signing.txt: bring up to date
  module: enable automatic module signing with FIPS 202 SHA-3
  crypto: asymmetric_keys - allow FIPS 202 SHA-3 signatures
  crypto: rsa-pkcs1pad - Add FIPS 202 SHA-3 support
  crypto: FIPS 202 SHA-3 register in hash info for IMA
  x509: Add OIDs for FIPS 202 SHA-3 hash and signatures
  crypto: ahash - optimize performance when wrapping shash
  crypto: ahash - check for shash type instead of not ahash type
  crypto: hash - move "ahash wrapping shash" functions to ahash.c
  crypto: talitos - stop using crypto_ahash::init
  crypto: chelsio - stop using crypto_ahash::init
  crypto: ahash - improve file comment
  crypto: ahash - remove struct ahash_request_priv
  crypto: ahash - remove crypto_ahash_alignmask
  crypto: gcm - stop using alignmask of ahash
  crypto: chacha20poly1305 - stop using alignmask of ahash
  crypto: ccm - stop using alignmask of ahash
  net: ipv6: stop checking crypto_ahash_alignmask
  ...
2023-11-02 16:15:30 -10:00
felix bfca5fb4e9 SUNRPC: Fix RPC client cleaned up the freed pipefs dentries
RPC client pipefs dentries cleanup is in separated rpc_remove_pipedir()
workqueue,which takes care about pipefs superblock locking.
In some special scenarios, when kernel frees the pipefs sb of the
current client and immediately alloctes a new pipefs sb,
rpc_remove_pipedir function would misjudge the existence of pipefs
sb which is not the one it used to hold. As a result,
the rpc_remove_pipedir would clean the released freed pipefs dentries.

To fix this issue, rpc_remove_pipedir should check whether the
current pipefs sb is consistent with the original pipefs sb.

This error can be catched by KASAN:
=========================================================
[  250.497700] BUG: KASAN: slab-use-after-free in dget_parent+0x195/0x200
[  250.498315] Read of size 4 at addr ffff88800a2ab804 by task kworker/0:18/106503
[  250.500549] Workqueue: events rpc_free_client_work
[  250.501001] Call Trace:
[  250.502880]  kasan_report+0xb6/0xf0
[  250.503209]  ? dget_parent+0x195/0x200
[  250.503561]  dget_parent+0x195/0x200
[  250.503897]  ? __pfx_rpc_clntdir_depopulate+0x10/0x10
[  250.504384]  rpc_rmdir_depopulate+0x1b/0x90
[  250.504781]  rpc_remove_client_dir+0xf5/0x150
[  250.505195]  rpc_free_client_work+0xe4/0x230
[  250.505598]  process_one_work+0x8ee/0x13b0
...
[   22.039056] Allocated by task 244:
[   22.039390]  kasan_save_stack+0x22/0x50
[   22.039758]  kasan_set_track+0x25/0x30
[   22.040109]  __kasan_slab_alloc+0x59/0x70
[   22.040487]  kmem_cache_alloc_lru+0xf0/0x240
[   22.040889]  __d_alloc+0x31/0x8e0
[   22.041207]  d_alloc+0x44/0x1f0
[   22.041514]  __rpc_lookup_create_exclusive+0x11c/0x140
[   22.041987]  rpc_mkdir_populate.constprop.0+0x5f/0x110
[   22.042459]  rpc_create_client_dir+0x34/0x150
[   22.042874]  rpc_setup_pipedir_sb+0x102/0x1c0
[   22.043284]  rpc_client_register+0x136/0x4e0
[   22.043689]  rpc_new_client+0x911/0x1020
[   22.044057]  rpc_create_xprt+0xcb/0x370
[   22.044417]  rpc_create+0x36b/0x6c0
...
[   22.049524] Freed by task 0:
[   22.049803]  kasan_save_stack+0x22/0x50
[   22.050165]  kasan_set_track+0x25/0x30
[   22.050520]  kasan_save_free_info+0x2b/0x50
[   22.050921]  __kasan_slab_free+0x10e/0x1a0
[   22.051306]  kmem_cache_free+0xa5/0x390
[   22.051667]  rcu_core+0x62c/0x1930
[   22.051995]  __do_softirq+0x165/0x52a
[   22.052347]
[   22.052503] Last potentially related work creation:
[   22.052952]  kasan_save_stack+0x22/0x50
[   22.053313]  __kasan_record_aux_stack+0x8e/0xa0
[   22.053739]  __call_rcu_common.constprop.0+0x6b/0x8b0
[   22.054209]  dentry_free+0xb2/0x140
[   22.054540]  __dentry_kill+0x3be/0x540
[   22.054900]  shrink_dentry_list+0x199/0x510
[   22.055293]  shrink_dcache_parent+0x190/0x240
[   22.055703]  do_one_tree+0x11/0x40
[   22.056028]  shrink_dcache_for_umount+0x61/0x140
[   22.056461]  generic_shutdown_super+0x70/0x590
[   22.056879]  kill_anon_super+0x3a/0x60
[   22.057234]  rpc_kill_sb+0x121/0x200

Fixes: 0157d021d2 ("SUNRPC: handle RPC client pipefs dentries by network namespace aware routines")
Signed-off-by: felix <fuzhen5@huawei.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2023-11-01 15:40:44 -04:00
Dan Carpenter 4f3ed83718 SUNRPC: Add an IS_ERR() check back to where it was
This IS_ERR() check was deleted during in a cleanup because, at the time,
the rpcb_call_async() function could not return an error pointer.  That
changed in commit 25cf32ad5d ("SUNRPC: Handle allocation failure in
rpc_new_task()") and now it can return an error pointer.  Put the check
back.

A related revert was done in commit 13bd901418 ("Revert "SUNRPC:
Remove unreachable error condition"").

Fixes: 037e910b52 ("SUNRPC: Remove unreachable error condition in rpcb_getport_async()")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2023-11-01 15:40:44 -04:00
Linus Torvalds 8b16da681e NFSD 6.7 Release Notes
This release completes the SunRPC thread scheduler work that was
 begun in v6.6. The scheduler can now find an svc thread to wake in
 constant time and without a list walk. Thanks again to Neil Brown
 for this overhaul.
 
 Lorenzo Bianconi contributed infrastructure for a netlink-based
 NFSD control plane. The long-term plan is to provide the same
 functionality as found in /proc/fs/nfsd, plus some interesting
 additions, and then migrate the NFSD user space utilities to
 netlink.
 
 A long series to overhaul NFSD's NFSv4 operation encoding was
 applied in this release. The goals are to bring this family of
 encoding functions in line with the matching NFSv4 decoding
 functions and with the NFSv2 and NFSv3 XDR functions, preparing
 the way for better memory safety and maintainability.
 
 A further improvement to NFSD's write delegation support was
 contributed by Dai Ngo. This adds a CB_GETATTR callback,
 enabling the server to retrieve cached size and mtime data from
 clients holding write delegations. If the server can retrieve
 this information, it does not have to recall the delegation in
 some cases.
 
 The usual panoply of bug fixes and minor improvements round out
 this release. As always I am grateful to all contributors,
 reviewers, and testers.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmU5IuoACgkQM2qzM29m
 f5eVsg//bVp8S93ci/oDlKfzOwH2fO5e5rna91wrDpJxkd51h6KTx55dSRG5sjAZ
 EywIVOann6xCtsixAPyff5Cweg2dWvzQRsy1ZnvWQ1qZBzD5KAJY5LPkeSFUCKBo
 Zani/qTOYbxzgFMjZx+yDSXDPKG68WYZBQK59SI7mURu4SYdk8aRyNY8mjHfr0Vh
 Aqrcny4oVtXV4sL5P5G/2FUW7WKT3olA3jSYlRRNMhbs2qpEemRCCrspOEMMad+b
 t1+ZCg+U27PMranvOJnof4RU7peZbaxDWA0gyiUbivVXVtZn9uOs0ffhktkvechL
 ePc33dqdp2ITdKIPA6JlaRv5WflKXQw0YYM9Kv5mcR4A2el7owL4f/pMlPhtbYwJ
 IOJv15KdKVN979G2e6WMYiKK+iHfaUUguhMEXnfnGoAajHOZNQiUEo3iFQAD7LDc
 DvMF8d9QqYmB9IW8FOYaRRfZGJOQHf3TL79Nd08z/bn5swvlvfj77leux9Sb+0/m
 Luk2Xvz2AJVSXE31wzabaGHkizN+BtH+e4MMbXUHBPW5jE9v7XOnEUFr4UdZyr9P
 Gl87A7NcrzNjJWT5TrnzM4sOslNsx46Aeg+VuNt2fSRn2dm6iBu2B8s0N4imx6dV
 PX1y9VSLq5WRhjrFZ1qeiZdsuTaQtrEiNDoRIQR6nCJPAV80iFk=
 =B4wJ
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd updates from Chuck Lever:
 "This release completes the SunRPC thread scheduler work that was begun
  in v6.6. The scheduler can now find an svc thread to wake in constant
  time and without a list walk. Thanks again to Neil Brown for this
  overhaul.

  Lorenzo Bianconi contributed infrastructure for a netlink-based NFSD
  control plane. The long-term plan is to provide the same functionality
  as found in /proc/fs/nfsd, plus some interesting additions, and then
  migrate the NFSD user space utilities to netlink.

  A long series to overhaul NFSD's NFSv4 operation encoding was applied
  in this release. The goals are to bring this family of encoding
  functions in line with the matching NFSv4 decoding functions and with
  the NFSv2 and NFSv3 XDR functions, preparing the way for better memory
  safety and maintainability.

  A further improvement to NFSD's write delegation support was
  contributed by Dai Ngo. This adds a CB_GETATTR callback, enabling the
  server to retrieve cached size and mtime data from clients holding
  write delegations. If the server can retrieve this information, it
  does not have to recall the delegation in some cases.

  The usual panoply of bug fixes and minor improvements round out this
  release. As always I am grateful to all contributors, reviewers, and
  testers"

* tag 'nfsd-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (127 commits)
  svcrdma: Fix tracepoint printk format
  svcrdma: Drop connection after an RDMA Read error
  NFSD: clean up alloc_init_deleg()
  NFSD: Fix frame size warning in svc_export_parse()
  NFSD: Rewrite synopsis of nfsd_percpu_counters_init()
  nfsd: Clean up errors in nfs3proc.c
  nfsd: Clean up errors in nfs4state.c
  NFSD: Clean up errors in stats.c
  NFSD: simplify error paths in nfsd_svc()
  NFSD: Clean up nfsd4_encode_seek()
  NFSD: Clean up nfsd4_encode_offset_status()
  NFSD: Clean up nfsd4_encode_copy_notify()
  NFSD: Clean up nfsd4_encode_copy()
  NFSD: Clean up nfsd4_encode_test_stateid()
  NFSD: Clean up nfsd4_encode_exchange_id()
  NFSD: Clean up nfsd4_do_encode_secinfo()
  NFSD: Clean up nfsd4_encode_access()
  NFSD: Clean up nfsd4_encode_readdir()
  NFSD: Clean up nfsd4_encode_entry4()
  NFSD: Add an nfsd4_encode_nfs_cookie4() helper
  ...
2023-10-30 10:12:29 -10:00
Trond Myklebust 59464b262f SUNRPC: SOFTCONN tasks should time out when on the sending list
SOFTCONN tasks need to periodically check if the transport is still
connected, so that they can time out if that is not the case.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2023-10-22 19:47:57 -04:00
Trond Myklebust f663507e29 SUNRPC: Force close the socket when a hard error is reported
Fix up xs_wake_error() to close the socket when a hard error is being
reported. Usually, that means an ECONNRESET was received on a connection
attempt.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2023-10-22 19:47:57 -04:00
Trond Myklebust caa388f7e5 SUNRPC: Don't skip timeout checks in call_connect_status()
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2023-10-22 19:47:56 -04:00
Trond Myklebust 4b09ca1508 SUNRPC: ECONNRESET might require a rebind
If connect() is returning ECONNRESET, it usually means that nothing is
listening on that port. If so, a rebind might be required in order to
obtain the new port on which the RPC service is listening.

Fixes: fd01b25979 ("SUNRPC: ECONNREFUSED should cause a rebind.")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2023-10-22 19:47:56 -04:00
Jeff Layton 5feb4b4ac4
sunrpc: convert to new timestamp accessors
Convert to using the new inode timestamp accessor functions.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20231004185347.80880-81-jlayton@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-10-18 14:08:30 +02:00
Chuck Lever 197115ebf3 svcrdma: Drop connection after an RDMA Read error
When an RPC Call message cannot be pulled from the client, that
is a message loss, by definition. Close the connection to trigger
the client to resend.

Cc: <stable@vger.kernel.org>
Reviewed-by: Tom Talpey <tom@talpey.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-10-16 12:44:40 -04:00
Chuck Lever 789ce196a3 SUNRPC: Remove BUG_ON call sites
There is no need to take down the whole system for these assertions.

I'd rather not attempt a heroic save here, as some bug has occurred
that has left the transport data structures in an unknown state.
Just warn and then leak the left-over resources.

Acked-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-10-16 12:44:10 -04:00
NeilBrown 15d39883ee SUNRPC: change the back-channel queue to lwq
This removes the need to store and update back-links in the list.
It also remove the need for the _bh version of spin_lock().

Signed-off-by: NeilBrown <neilb@suse.de>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-10-16 12:44:08 -04:00
NeilBrown 580a25756a SUNRPC: discard sp_lock
sp_lock is now only used to protect sp_all_threads.  This isn't needed
as sp_all_threads is only manipulated through svc_set_num_threads(),
which is already serialized.  Read-acccess only requires rcu_read_lock().
So no more locking is needed.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-10-16 12:44:08 -04:00
NeilBrown 2e8fc923fe SUNRPC: change sp_nrthreads to atomic_t
Using an atomic_t avoids the need to take a spinlock (which can soon be
removed).

Choosing a thread to kill needs to be careful as we cannot set the "die
now" bit atomically with the test on the count.  Instead we temporarily
increase the count.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-10-16 12:44:07 -04:00
NeilBrown 9a0e6accc0 SUNRPC: use lwq for sp_sockets - renamed to sp_xprts
lwq avoids using back pointers in lists, and uses less locking.
This introduces a new spinlock, but the other one will be removed in a
future patch.

For svc_clean_up_xprts(), we now dequeue the entire queue, walk it to
remove and process the xprts that need cleaning up, then re-enqueue the
remaining queue.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-10-16 12:44:07 -04:00
NeilBrown 5b80147e0c SUNRPC: only have one thread waking up at a time
Currently if several items of work become available in quick succession,
that number of threads (if available) will be woken.  By the time some
of them wake up another thread that was already cache-warm might have
come along and completed the work.  Anecdotal evidence suggests as many
as 15% of wakes find nothing to do once they get to the point of
looking.

This patch changes svc_pool_wake_idle_thread() to wake the first thread
on the queue but NOT remove it.  Subsequent calls will wake the same
thread.  Once that thread starts it will dequeue itself and after
dequeueing some work to do, it will wake the next thread if there is more
work ready.  This results in a more orderly increase in the number of
busy threads.

As a bonus, this allows us to reduce locking around the idle queue.
svc_pool_wake_idle_thread() no longer needs to take a lock (beyond
rcu_read_lock()) as it doesn't manipulate the queue, it just looks at
the first item.

The thread itself can avoid locking by using the new
llist_del_first_this() interface.  This will safely remove the thread
itself if it is the head.  If it isn't the head, it will do nothing.
If multiple threads call this concurrently only one will succeed.  The
others will do nothing, so no corruption can result.

If a thread wakes up and finds that it cannot dequeue itself that means
either
- that it wasn't woken because it was the head of the queue.  Maybe the
  freezer woke it.  In that case it can go back to sleep (after trying
  to freeze of course).
- some other thread found there was nothing to do very recently, and
  placed itself on the head of the queue in front of this thread.
  It must check again after placing itself there, so it can be deemed to
  be responsible for any pending work, and this thread can go back to
  sleep until woken.

No code ever tests for busy threads any more.  Only each thread itself
cares if it is busy.  So svc_thread_busy() is no longer needed.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-10-16 12:44:07 -04:00
NeilBrown d7926ee8b7 SUNRPC: rename some functions from rqst_ to svc_thread_
Functions which directly manipulate a 'struct rqst', such as
svc_rqst_alloc() or svc_rqst_release_pages(), can reasonably
have "rqst" in there name.
However functions that act on the running thread, such as
XX_should_sleep() or XX_wait_for_work() should seem more
natural with a "svc_thread_" prefix.

So make those changes.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-10-16 12:44:06 -04:00
NeilBrown 9bd4161c59 SUNRPC: change service idle list to be an llist
With an llist we don't need to take a lock to add a thread to the list,
though we still need a lock to remove it.  That will go in the next
patch.

Unlike double-linked lists, a thread cannot reliably remove itself from
the list.  Only the first thread can be removed, and that can change
asynchronously.  So some care is needed.

We already check if there is pending work to do, so we are unlikely to
add ourselves to the idle list and then want to remove ourselves again.

If we DO find something needs to be done after adding ourselves to the
list, we simply wake up the first thread on the list.  If that was us,
we successfully removed ourselves and can continue.  If it was some
other thread, they will do the work that needs to be done.  We can
safely sleep until woken.

We also remove the test on freezing() from rqst_should_sleep().  Instead
we set TASK_FREEZABLE before scheduling.  This makes is safe to
schedule() when a freeze is pending.  As we now loop waiting to be
removed from the idle queue, this is a cleaner way to handle freezing.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-10-16 12:44:05 -04:00
NeilBrown 2b65a22684 SUNRPC: discard SP_CONGESTED
We can tell if a pool is congested by checking if the idle list is
empty.  We don't need a separate flag.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-10-16 12:44:05 -04:00
NeilBrown 5ff817b235 SUNRPC: add list of idle threads
Rather than searching a list of threads to find an idle one, having a
list of idle threads allows an idle thread to be found immediately.

This adds some spin_lock calls which is not ideal, but as the hold-time
is tiny it is still faster than searching a list.  A future patch will
remove them using llist.h.  This involves some subtlety and so is left
to a separate patch.

This removes the need for the RQ_BUSY flag.  The rqst is "busy"
precisely when it is not on the "idle" list.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-10-16 12:44:04 -04:00
NeilBrown fa341560ca SUNRPC: change how svc threads are asked to exit.
svc threads are currently stopped using kthread_stop().  This requires
identifying a specific thread.  However we don't care which thread
stops, just as long as one does.

So instead, set a flag in the svc_pool to say that a thread needs to
die, and have each thread check this flag instead of calling
kthread_should_stop().  The first thread to find and clear this flag
then moves towards exiting.

This removes an explicit dependency on sp_all_threads which will make a
future patch simpler.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-10-16 12:44:04 -04:00
NeilBrown 063ab935a4 SUNRPC: integrate back-channel processing with svc_recv()
Using svc_recv() for (NFSv4.1) back-channel handling means we have just
one mechanism for waking threads.

Also change kthread_freezable_should_stop() in nfs4_callback_svc() to
kthread_should_stop() as used elsewhere.
kthread_freezable_should_stop() effectively adds a try_to_freeze() call,
and svc_recv() already contains that at an appropriate place.

Signed-off-by: NeilBrown <neilb@suse.de>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-10-16 12:44:03 -04:00
Chuck Lever 6ed8cdf967 SUNRPC: Clean up bc_svc_process()
The test robot complained that, in some build configurations, the
@error variable in bc_svc_process's only caller is set but never
used. This happens because dprintk() is the only consumer of that
value.

 - Remove the dprintk() call sites in favor of the svc_process
   tracepoint
 - The @error variable and the return value of bc_svc_process() are
   now unused, so get rid of them.
 - The @serv parameter is set to rqstp->rq_serv by the only caller,
   and bc_svc_process() then uses it only to set rqstp->rq_serv. It
   can be removed.
 - Rename bc_svc_process() according to the convention that
   globally-visible RPC server functions have names that begin with
   "svc_"; and because it is globally-visible, give it a proper
   kdoc comment.

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202308121314.HA8Rq2XG-lkp@intel.com/
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-10-16 12:44:03 -04:00
NeilBrown 7b31f4daeb SUNRPC: rename and refactor svc_get_next_xprt()
svc_get_next_xprt() does a lot more than just get an xprt.  It also
decides if it needs to sleep, depending not only on the availability of
xprts but also on the need to exit or handle external work.

So rename it to svc_rqst_wait_for_work() and only do the testing and
waiting.  Move all the waiting-related code out of svc_recv() into the
new svc_rqst_wait_for_work().

Move the dequeueing code out of svc_get_next_xprt() into svc_recv().

Previously svc_xprt_dequeue() would be called twice, once before waiting
and possibly once after.  Now instead rqst_should_sleep() is called
twice.  Once to decide if waiting is needed, and once to check against
after setting the task state do see if we might have missed a wakeup.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-10-16 12:44:03 -04:00
NeilBrown e3274026e2 SUNRPC: move all of xprt handling into svc_xprt_handle()
svc_xprt_handle() does lots of things itself, but leaves some to the
caller - svc_recv().  This isn't elegant.

Move that code out of svc_recv() into svc_xprt_handle()

Move the calls to svc_xprt_release() from svc_send() and svc_drop()
(the two possible final steps in svc_process()) and from svc_recv() (in
the case where svc_process() wasn't called) into svc_xprt_handle().

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-10-16 12:44:02 -04:00
Qi Zheng abe0c269f9 sunrpc: dynamically allocate the sunrpc_cred shrinker
Use new APIs to dynamically allocate the sunrpc_cred shrinker.

Link: https://lkml.kernel.org/r/20230911094444.68966-19-zhengqi.arch@bytedance.com
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Neil Brown <neilb@suse.de>
Cc: Olga Kornievskaia <kolga@netapp.com>
Cc: Dai Ngo <Dai.Ngo@oracle.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Anna Schumaker <anna@kernel.org>
Cc: Abhinav Kumar <quic_abhinavk@quicinc.com>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Bob Peterson <rpeterso@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Carlos Llamas <cmllamas@google.com>
Cc: Chandan Babu R <chandan.babu@oracle.com>
Cc: Chao Yu <chao@kernel.org>
Cc: Chris Mason <clm@fb.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Christian Koenig <christian.koenig@amd.com>
Cc: Chuck Lever <cel@kernel.org>
Cc: Coly Li <colyli@suse.de>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Airlie <airlied@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Sterba <dsterba@suse.com>
Cc: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
Cc: Gao Xiang <hsiangkao@linux.alibaba.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Huang Rui <ray.huang@amd.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Jeffle Xu <jefflexu@linux.alibaba.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Kirill Tkhai <tkhai@ya.ru>
Cc: Marijn Suijten <marijn.suijten@somainline.org>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Mike Snitzer <snitzer@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nadav Amit <namit@vmware.com>
Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rob Clark <robdclark@gmail.com>
Cc: Rob Herring <robh@kernel.org>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Sean Paul <sean@poorly.run>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Song Liu <song@kernel.org>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Steven Price <steven.price@arm.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tomeu Vizoso <tomeu.vizoso@collabora.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Cc: Yue Hu <huyue2@coolpad.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-04 10:32:24 -07:00
Anna Schumaker 26e8bfa30d SUNRPC/TLS: Lock the lower_xprt during the tls handshake
Otherwise we run the risk of having the lower_xprt freed from underneath
us, causing an oops that looks like this:

[  224.150698] BUG: kernel NULL pointer dereference, address: 0000000000000018
[  224.150951] #PF: supervisor read access in kernel mode
[  224.151117] #PF: error_code(0x0000) - not-present page
[  224.151278] PGD 0 P4D 0
[  224.151361] Oops: 0000 [#1] PREEMPT SMP NOPTI
[  224.151499] CPU: 2 PID: 99 Comm: kworker/u10:6 Not tainted 6.6.0-rc3-g6465e260f487 #41264 a00b0960990fb7bc6d6a330ee03588b67f08a47b
[  224.151977] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 2/2/2022
[  224.152216] Workqueue: xprtiod xs_tcp_tls_setup_socket [sunrpc]
[  224.152434] RIP: 0010:xs_tcp_tls_setup_socket+0x3cc/0x7e0 [sunrpc]
[  224.152643] Code: 00 00 48 8b 7c 24 08 e9 f3 01 00 00 48 83 7b c0 00 0f 85 d2 01 00 00 49 8d 84 24 f8 05 00 00 48 89 44 24 10 48 8b 00 48 89 c5 <4c> 8b 68 18 66 41 83 3f 0a 75 71 45 31 ff 4c 89 ef 31 f6 e8 5c 76
[  224.153246] RSP: 0018:ffffb00ec060fd18 EFLAGS: 00010246
[  224.153427] RAX: 0000000000000000 RBX: ffff8c06c2e53e40 RCX: 0000000000000001
[  224.153652] RDX: ffff8c073bca2408 RSI: 0000000000000282 RDI: ffff8c06c259ee00
[  224.153868] RBP: 0000000000000000 R08: ffffffff9da55aa0 R09: 0000000000000001
[  224.154084] R10: 00000034306c30f1 R11: 0000000000000002 R12: ffff8c06c2e51800
[  224.154300] R13: ffff8c06c355d400 R14: 0000000004208160 R15: ffff8c06c2e53820
[  224.154521] FS:  0000000000000000(0000) GS:ffff8c073bd00000(0000) knlGS:0000000000000000
[  224.154763] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  224.154940] CR2: 0000000000000018 CR3: 0000000062c1e000 CR4: 0000000000750ee0
[  224.155157] PKRU: 55555554
[  224.155244] Call Trace:
[  224.155325]  <TASK>
[  224.155395]  ? __die_body+0x68/0xb0
[  224.155507]  ? page_fault_oops+0x34c/0x3a0
[  224.155635]  ? _raw_spin_unlock_irqrestore+0xe/0x40
[  224.155793]  ? exc_page_fault+0x7a/0x1b0
[  224.155916]  ? asm_exc_page_fault+0x26/0x30
[  224.156047]  ? xs_tcp_tls_setup_socket+0x3cc/0x7e0 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[  224.156367]  ? xs_tcp_tls_setup_socket+0x2fe/0x7e0 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[  224.156697]  ? __pfx_xs_tls_handshake_done+0x10/0x10 [sunrpc ae3a15912ae37fd51dafbdbc2dbd069117f8f5c8]
[  224.157013]  process_scheduled_works+0x24e/0x450
[  224.157158]  worker_thread+0x21c/0x2d0
[  224.157275]  ? __pfx_worker_thread+0x10/0x10
[  224.157409]  kthread+0xe8/0x110
[  224.157510]  ? __pfx_kthread+0x10/0x10
[  224.157628]  ret_from_fork+0x37/0x50
[  224.157741]  ? __pfx_kthread+0x10/0x10
[  224.157859]  ret_from_fork_asm+0x1b/0x30
[  224.157983]  </TASK>

Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-09-27 15:16:40 -04:00
Trond Myklebust a275ab6260 Revert "SUNRPC dont update timeout value on connection reset"
This reverts commit 88428cc4ae.

The problem this commit is intended to fix was comprehensively fixed
in commit 7de62bc09f ("SUNRPC dont update timeout value on connection
reset").
Since then, this commit has been preventing the correct timeout of soft
mounted requests.

Cc: stable@vger.kernel.org # 5.9.x: 09252177d5f9: SUNRPC: Handle major timeout in xprt_adjust_timeout()
Cc: stable@vger.kernel.org # 5.9.x: 7de62bc09fe6: SUNRPC dont update timeout value on connection reset
Cc: stable@vger.kernel.org # 5.9.x
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-09-27 15:16:40 -04:00
Chuck Lever 5623ecfcbe SUNRPC: Fail quickly when server does not recognize TLS
rpcauth_checkverf() should return a distinct error code when a
server recognizes the AUTH_TLS probe but does not support TLS so
that the client's header decoder can respond appropriately and
quickly. No retries are necessary is in this case, since the server
has already affirmatively answered "TLS is unsupported".

Suggested-by: Trond Myklebust <trondmy@hammerspace.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-09-27 15:16:40 -04:00
Herbert Xu 7b5fad7cff SUNRPC: Do not include crypto/algapi.h
The header file crypto/algapi.h is for internal use only.  Use the
header file crypto/utils.h instead.

Acked-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2023-09-15 18:30:43 +08:00
Olga Kornievskaia 806a3bc421 NFSv4.1: fix pnfs MDS=DS session trunking
Currently, when GETDEVICEINFO returns multiple locations where each
is a different IP but the server's identity is same as MDS, then
nfs4_set_ds_client() finds the existing nfs_client structure which
has the MDS's max_connect value (and if it's 1), then the 1st IP
on the DS's list will get dropped due to MDS trunking rules. Other
IPs would be added as they fall under the pnfs trunking rules.

For the list of IPs the 1st goes thru calling nfs4_set_ds_client()
which will eventually call nfs4_add_trunk() and call into
rpc_clnt_test_and_add_xprt() which has the check for MDS trunking.
The other IPs (after the 1st one), would call rpc_clnt_add_xprt()
which doesn't go thru that check.

nfs4_add_trunk() is called when MDS trunking is happening and it
needs to enforce the usage of max_connect mount option of the
1st mount. However, this shouldn't be applied to pnfs flow.

Instead, this patch proposed to treat MDS=DS as DS trunking and
make sure that MDS's max_connect limit does not apply to the
1st IP returned in the GETDEVICEINFO list. It does so by
marking the newly created client with a new flag NFS_CS_PNFS
which then used to pass max_connect value to use into the
rpc_clnt_test_and_add_xprt() instead of the existing rpc
client's max_connect value set by the MDS connection.

For example, mount was done without max_connect value set
so MDS's rpc client has cl_max_connect=1. Upon calling into
rpc_clnt_test_and_add_xprt() and using rpc client's value,
the caller passes in max_connect value which is previously
been set in the pnfs path (as a part of handling
GETDEVICEINFO list of IPs) in nfs4_set_ds_client().

However, when NFS_CS_PNFS flag is not set and we know we
are doing MDS trunking, comparing a new IP of the same
server, we then set the max_connect value to the
existing MDS's value and pass that into
rpc_clnt_test_and_add_xprt().

Fixes: dc48e0abee ("SUNRPC enforce creation of no more than max_connect xprts")
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-09-13 11:51:11 -04:00
Trond Myklebust e86fcf0820 Revert "SUNRPC: Fail faster on bad verifier"
This reverts commit 0701214cd6.

The premise of this commit was incorrect. There are exactly 2 cases
where rpcauth_checkverf() will return an error:

1) If there was an XDR decode problem (i.e. garbage data).
2) If gss_validate() had a problem verifying the RPCSEC_GSS MIC.

In the second case, there are again 2 subcases:

a) The GSS context expires, in which case gss_validate() will force a
   new context negotiation on retry by invalidating the cred.
b) The sequence number check failed because an RPC call timed out, and
   the client retransmitted the request using a new sequence number,
   as required by RFC2203.

In neither subcase is this a fatal error.

Reported-by: Russell Cattelan <cattelan@thebarn.com>
Fixes: 0701214cd6 ("SUNRPC: Fail faster on bad verifier")
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-09-13 11:51:11 -04:00
Trond Myklebust 611fa42dfa SUNRPC: Mark the cred for revalidation if the server rejects it
If the server rejects the credential as being stale, or bad, then we
should mark it for revalidation before retransmitting.

Fixes: 7f5667a5f8 ("SUNRPC: Clean up rpc_verify_header()")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-09-13 11:51:11 -04:00
Linus Torvalds 99d99825fc NFS CLient Updates for Linux 6.6
New Features:
   * Enable the NFS v4.2 READ_PLUS operation by default
 
 Stable Fixes:
   * NFSv4/pnfs: minor fix for cleanup path in nfs4_get_device_info
   * NFS: Fix a potential data corruption
 
 Bugfixes:
   * Fix various READ_PLUS issues including:
     * smatch warnings
     * xdr size calculations
     * scratch buffer handling
     * 32bit / highmem xdr page handling
   * Fix checkpatch errors in file.c
   * Fix redundant readdir request after an EOF
   * Fix handling of COPY ERR_OFFLOAD_NO_REQ
   * Fix assignment of xprtdata.cred
 
 Cleanups:
   * Remove unused xprtrdma function declarations
   * Clean up an integer overflow check to avoid a warning
   * Clean up #includes in dns_resolve.c
   * Clean up nfs4_get_device_info so we don't pass a NULL pointer to __free_page()
   * Clean up sunrpc TCP socket timeout configuration
   * Guard against READDIR loops when entry names are too long
   * Use EXCHID4_FLAG_USE_PNFS_DS for DS servers
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAmTwzwYACgkQ18tUv7Cl
 QOtIhBAA+BOh7MB6yjlyctFxABJiXz2x2Dehxy7Ox15LfnyStqQAUEpk35CXWvjC
 iNxpZJ486+WrzM76WGEaRbECK9nTQLK1yacR3V1zpnDwHWIJA6VHN6qU4JAfSMu7
 XbhWkHWry6d7PXhvqHlaiYvPX2pF39wUzfH+vLlzS2QLIkpT6LnG0zVRJTQvLCmq
 zE5xD+NCQ1Dpo9VnouuzW7VVfm532hI7GQNrpo0E0vWKgeQD+/fOpDu23MW8A1Ua
 ZgVMAc7vScgDZH8/20Ze5PH4jAEB4gwEIzjreQlIXr7Tf+mE7qn435lgOuvdMQCW
 eHhdNriZ2X6HMLhNFFpup8bkRKGCCTooTHC1W66n9CuxIAuVT5DNwBbakpagHSZf
 J4ho81hEgBfc5zppISVjV6eFK4brM0rF9AliaIw8r/qGcMmO1CILi9tLGiheiJcT
 LuId7U2sE/vfIa6SiBt7rx37/MkrgLlAgjpk4dCRJQW+gKVBi09sMGnDlgaRvCZz
 T0WCsK4DgI9q2rScpwJYJbNWbC2Q8qUtYWW9LSRvwhbNdm/VbRnEHWA7eOwqqm8r
 KkkF4chyoTJqpnF3SjxT/lyFk6GwsD+wXafOmEeuFA6Si3dHDU9i3aUf+cCXhwRI
 uUzCUHYcCKnv4QVGPuAbIdxMgueNCuLoeWgTClVlqidv7GRyz7Y=
 =rjmq
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-6.6-1' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client updates from Anna Schumaker:
 "New Features:
   - Enable the NFS v4.2 READ_PLUS operation by default

  Stable Fixes:
   - NFSv4/pnfs: minor fix for cleanup path in nfs4_get_device_info
   - NFS: Fix a potential data corruption

  Bugfixes:
   - Fix various READ_PLUS issues including:
      - smatch warnings
      - xdr size calculations
      - scratch buffer handling
      - 32bit / highmem xdr page handling
   - Fix checkpatch errors in file.c
   - Fix redundant readdir request after an EOF
   - Fix handling of COPY ERR_OFFLOAD_NO_REQ
   - Fix assignment of xprtdata.cred

  Cleanups:
   - Remove unused xprtrdma function declarations
   - Clean up an integer overflow check to avoid a warning
   - Clean up #includes in dns_resolve.c
   - Clean up nfs4_get_device_info so we don't pass a NULL pointer
     to __free_page()
   - Clean up sunrpc TCP socket timeout configuration
   - Guard against READDIR loops when entry names are too long
   - Use EXCHID4_FLAG_USE_PNFS_DS for DS servers"

* tag 'nfs-for-6.6-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (22 commits)
  pNFS: Fix assignment of xprtdata.cred
  NFSv4.2: fix handling of COPY ERR_OFFLOAD_NO_REQ
  NFS: Guard against READDIR loop when entry names exceed MAXNAMELEN
  NFSv4.1: use EXCHGID4_FLAG_USE_PNFS_DS for DS server
  NFS/pNFS: Set the connect timeout for the pNFS flexfiles driver
  SUNRPC: Don't override connect timeouts in rpc_clnt_add_xprt()
  SUNRPC: Allow specification of TCP client connect timeout at setup
  SUNRPC: Refactor and simplify connect timeout
  SUNRPC: Set the TCP_SYNCNT to match the socket timeout
  NFS: Fix a potential data corruption
  nfs: fix redundant readdir request after get eof
  nfs/blocklayout: Use the passed in gfp flags
  filemap: Fix errors in file.c
  NFSv4/pnfs: minor fix for cleanup path in nfs4_get_device_info
  NFS: Move common includes outside ifdef
  SUNRPC: clean up integer overflow check
  xprtrdma: Remove unused function declaration rpcrdma_bc_post_recv()
  NFS: Enable the READ_PLUS operation by default
  SUNRPC: kmap() the xdr pages during decode
  NFSv4.2: Rework scratch handling for READ_PLUS (again)
  ...
2023-08-31 15:36:41 -07:00
Linus Torvalds f35d170615 NFSD 6.6 Release Notes
I'm thrilled to announce that the Linux in-kernel NFS server now
 offers NFSv4 write delegations. A write delegation enables a client
 to cache data and metadata for a single file more aggressively,
 reducing network round trips and server workload. Many thanks to Dai
 Ngo for contributing this facility, and to Jeff Layton and Neil
 Brown for reviewing and testing it.
 
 This release also sees the removal of all support for DES- and
 triple-DES-based Kerberos encryption types in the kernel's SunRPC
 implementation. These encryption types have been deprecated by the
 Internet community for years and are considered insecure. This
 change affects both the in-kernel NFS client and server.
 
 The server's UDP and TCP socket transports have now fully adopted
 David Howells' new bio_vec iterator so that no more than one
 sendmsg() call is needed to transmit each RPC message. In
 particular, this helps kTLS optimize record boundaries when sending
 RPC-with-TLS replies, and it takes the server a baby step closer to
 handling file I/O via folios.
 
 We've begun work on overhauling the SunRPC thread scheduler to
 remove a costly linked-list walk when looking for an idle RPC
 service thread to wake. The pre-requisites are included in this
 release. Thanks to Neil Brown for his ongoing work on this
 improvement.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmTwoa0ACgkQM2qzM29m
 f5cZvw/8CmFVNC27aMrJEhRRhwwrXbLzUkWh9GCYkG98PHYiLxLTvZ6qELXAax/a
 UjSgIDSRcWl4z8M/tyBQtgsw7NADr+7XWqEoXR7HZ5pEEC/KNGM0oQWQ92ojjKYy
 JmHdB02uaDJfcd9ioFTU13cw7q2BQfoe2xLI8yqis2vcVSu92AM7aIw+cvJIpwQB
 inA3TIIsYTV/gPByXSfEtvmYACadoFiMvfvYwaWhjFS9MdSzFmcVG0Dp3EFIig29
 odmWEofcz6uIvUWvUswWEGdoSu7uOKIztSuAI4PlTwaofUaSKG6e5kmtpr3cLERD
 Uhg2lm5JgqkXBd7QHObNimJ4DtQzFwHmhA08qo8rd/zba75mn/Hr5IF0q3Rxs99J
 SRYHcAeP8afKn5Ge0yzoTgCNcqhfz8KLRfoCQX49mljr+muNxld4nMklD2KdUwJi
 XEB512/q3E4nUgopXZiSJYQYAq1CfdR5WpGipZ9X0XK9HZBDF/qhXGtk1YQuNWyj
 ZxJS3bfBza4oVIvP5/ehjCIQwOvqkcrC5zZGDIgDvw9Q6L3L1wqmVntsdCLCLRcJ
 jB4IOsj+DECfJ6w2vP2SZ3GeMtFnyuTQjhUTkjPuAKTBBiKo4Tj0o/agpfDYbWZy
 1l3a2yH5jqJgkm4MaVh3YHRJGc0ub0ccpIrs3QQ4jvjMLQ/3Gcs=
 =XGHs
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd updates from Chuck Lever:
 "I'm thrilled to announce that the Linux in-kernel NFS server now
  offers NFSv4 write delegations. A write delegation enables a client to
  cache data and metadata for a single file more aggressively, reducing
  network round trips and server workload. Many thanks to Dai Ngo for
  contributing this facility, and to Jeff Layton and Neil Brown for
  reviewing and testing it.

  This release also sees the removal of all support for DES- and
  triple-DES-based Kerberos encryption types in the kernel's SunRPC
  implementation. These encryption types have been deprecated by the
  Internet community for years and are considered insecure. This change
  affects both the in-kernel NFS client and server.

  The server's UDP and TCP socket transports have now fully adopted
  David Howells' new bio_vec iterator so that no more than one sendmsg()
  call is needed to transmit each RPC message. In particular, this helps
  kTLS optimize record boundaries when sending RPC-with-TLS replies, and
  it takes the server a baby step closer to handling file I/O via
  folios.

  We've begun work on overhauling the SunRPC thread scheduler to remove
  a costly linked-list walk when looking for an idle RPC service thread
  to wake. The pre-requisites are included in this release. Thanks to
  Neil Brown for his ongoing work on this improvement"

* tag 'nfsd-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (56 commits)
  Documentation: Add missing documentation for EXPORT_OP flags
  SUNRPC: Remove unused declaration rpc_modcount()
  SUNRPC: Remove unused declarations
  NFSD: da_addr_body field missing in some GETDEVICEINFO replies
  SUNRPC: Remove return value of svc_pool_wake_idle_thread()
  SUNRPC: make rqst_should_sleep() idempotent()
  SUNRPC: Clean up svc_set_num_threads
  SUNRPC: Count ingress RPC messages per svc_pool
  SUNRPC: Deduplicate thread wake-up code
  SUNRPC: Move trace_svc_xprt_enqueue
  SUNRPC: Add enum svc_auth_status
  SUNRPC: change svc_xprt::xpt_flags bits to enum
  SUNRPC: change svc_rqst::rq_flags bits to enum
  SUNRPC: change svc_pool::sp_flags bits to enum
  SUNRPC: change cache_head.flags bits to enum
  SUNRPC: remove timeout arg from svc_recv()
  SUNRPC: change svc_recv() to return void.
  SUNRPC: call svc_process() from svc_recv().
  nfsd: separate nfsd_last_thread() from nfsd_put()
  nfsd: Simplify code around svc_exit_thread() call in nfsd()
  ...
2023-08-31 15:32:18 -07:00
NeilBrown 2a4557452a SUNRPC: Remove return value of svc_pool_wake_idle_thread()
The returned value is not used (any more), so don't return it.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-08-29 17:45:22 -04:00
NeilBrown 6859d1f290 SUNRPC: make rqst_should_sleep() idempotent()
Based on its name you would think that rqst_should_sleep() would be
read-only, not changing anything.  But in fact it will clear
SP_TASK_PENDING if that was set.  This is surprising, and it blurs the
line between "check for work to do" and "dequeue work to do".

So change the "test_and_clear" to simple "test" and clear the bit once
the thread has decided to wake up and return to the caller.

With this, it makes sense to *always* set SP_TASK_PENDING when asked,
rather than to set it only if no thread could be woken up.

[ cel: Previously TASK_PENDING indicated there is work waiting but no
idle threads were found to pick up that work. After this patch, it acts
as an XPT_BUSY flag for wake-ups that have no associated xprt. ]

Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-08-29 17:45:22 -04:00
Chuck Lever d2f0ef1cbf SUNRPC: Clean up svc_set_num_threads
Document the API contract and remove stale or obvious comments.

Reviewed-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-08-29 17:45:22 -04:00
Chuck Lever f208e9508a SUNRPC: Count ingress RPC messages per svc_pool
svc_xprt_enqueue() can be costly, since it involves selecting and
waking up a process.

More than one enqueue is done per incoming RPC. For example,
svc_data_ready() enqueues, and so does svc_xprt_receive(). Also, if
an RPC message requires more than one call to ->recvfrom() to
receive it fully, each one of those calls does an enqueue.

To get a sense of the average number of transport enqueue operations
needed to process an incoming RPC message, re-use the "packets" pool
stat. Track the number of complete RPC messages processed by each
thread pool.

Reviewed-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-08-29 17:45:22 -04:00
Chuck Lever 850bac3ae4 SUNRPC: Deduplicate thread wake-up code
Refactor: Extract the loop that finds an idle service thread from
svc_xprt_enqueue() and svc_wake_up(). Both functions do just about
the same thing.

Note that svc_wake_up() currently does not hold the RCU read lock
while waking the target thread. It indeed should hold the lock, just
as svc_xprt_enqueue() does, to ensure the rqstp does not vanish
during the wake-up. This patch adds the RCU lock for svc_wake_up().

Note that shrinking the pool thread count is rare, and calls to
svc_wake_up() are also quite infrequent. In practice, this race is
very unlikely to be hit, so we are not marking the lock fix for
stable backport at this time.

Reviewed-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-08-29 17:45:22 -04:00
Chuck Lever 82e5d82a45 SUNRPC: Move trace_svc_xprt_enqueue
The xpt_flags field frequently changes between the time that
svc_xprt_ready() grabs a copy and execution flow arrives at the
tracepoint at the tail of svc_xprt_enqueue(). In fact, there's
usually a sleep/wake-up in there, so those flags are almost
guaranteed to be different.

It would be more useful to record the exact flags that were used to
decide whether the transport is ready, so move the tracepoint.

Moving it means the tracepoint can't pick up the waker's pid. That
can be added to struct svc_rqst if it turns out that is important.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-08-29 17:45:22 -04:00
Chuck Lever 78c542f916 SUNRPC: Add enum svc_auth_status
In addition to the benefits of using an enum rather than a set of
macros, we now have a named type that can improve static type
checking of function return values.

As part of this change, I removed a stale comment from svcauth.h;
the return values from current implementations of the
auth_ops::release method are all zero/negative errno, not the SVC_OK
enum values as the old comment suggested.

Suggested-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-08-29 17:45:22 -04:00
NeilBrown c743b4259c SUNRPC: remove timeout arg from svc_recv()
Most svc threads have no interest in a timeout.
nfsd sets it to 1 hour, but this is a wart of no significance.

lockd uses the timeout so that it can call nlmsvc_retry_blocked().
It also sometimes calls svc_wake_up() to ensure this is called.

So change lockd to be consistent and always use svc_wake_up() to trigger
nlmsvc_retry_blocked() - using a timer instead of a timeout to
svc_recv().

And change svc_recv() to not take a timeout arg.

This makes the sp_threads_timedout counter always zero.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-08-29 17:45:22 -04:00
NeilBrown 7b719e2bf3 SUNRPC: change svc_recv() to return void.
svc_recv() currently returns a 0 on success or one of two errors:
 - -EAGAIN means no message was successfully received
 - -EINTR means the thread has been told to stop

Previously nfsd would stop as the result of a signal as well as
following kthread_stop().  In that case the difference was useful: EINTR
means stop unconditionally.  EAGAIN means stop if kthread_should_stop(),
continue otherwise.

Now threads only exit when kthread_should_stop() so we don't need the
distinction.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-08-29 17:45:22 -04:00
NeilBrown f78116d3bf SUNRPC: call svc_process() from svc_recv().
All callers of svc_recv() go on to call svc_process() on success.
Simplify callers by having svc_recv() do that for them.

This loses one call to validate_process_creds() in nfsd.  That was
debugging code added 14 years ago.  I don't think we need to keep it.

Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-08-29 17:45:22 -04:00
NeilBrown 3903902401 nfsd: don't allow nfsd threads to be signalled.
The original implementation of nfsd used signals to stop threads during
shutdown.
In Linux 2.3.46pre5 nfsd gained the ability to shutdown threads
internally it if was asked to run "0" threads.  After this user-space
transitioned to using "rpc.nfsd 0" to stop nfsd and sending signals to
threads was no longer an important part of the API.

In commit 3ebdbe5203 ("SUNRPC: discard svo_setup and rename
svc_set_num_threads_sync()") (v5.17-rc1~75^2~41) we finally removed the
use of signals for stopping threads, using kthread_stop() instead.

This patch makes the "obvious" next step and removes the ability to
signal nfsd threads - or any svc threads.  nfsd stops allowing signals
and we don't check for their delivery any more.

This will allow for some simplification in later patches.

A change worth noting is in nfsd4_ssc_setup_dul().  There was previously
a signal_pending() check which would only succeed when the thread was
being shut down.  It should really have tested kthread_should_stop() as
well.  Now it just does the latter, not the former.

Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-08-29 17:45:22 -04:00
Chuck Lever 2b877fc53e SUNRPC: Reduce thread wake-up rate when receiving large RPC messages
With large NFS WRITE requests on TCP, I measured 5-10 thread wake-
ups to receive each request. This is because the socket layer
calls ->sk_data_ready() frequently, and each call triggers a
thread wake-up. Each recvmsg() seems to pull in less than 100KB.

Have the socket layer hold ->sk_data_ready() calls until the full
incoming message has arrived to reduce the wake-up rate.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-08-29 17:45:22 -04:00
Chuck Lever 89d2d9fbea SUNRPC: Revert e0a912e8dd
Flamegraph analysis showed that the cork/uncork calls consume
nearly a third of the CPU time spent in svc_tcp_sendto(). The
other two consumers are mutex lock/unlock and svc_tcp_sendmsg().

Now that svc_tcp_sendto() coalesces RPC messages properly, there
is no need to introduce artificial delays to prevent sending
partial messages.

After applying this change, I measured a 1.2K read IOPS increase
for 8KB random I/O (several percent) on 56Gb IP over IB.

Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-08-29 17:45:22 -04:00
Chuck Lever baabf59c24 SUNRPC: Convert svc_udp_sendto() to use the per-socket bio_vec array
Commit da1661b93b ("SUNRPC: Teach server to use xprt_sock_sendmsg
for socket sends") modified svc_udp_sendto() to use xprt_sock_sendmsg()
because we originally believed xprt_sock_sendmsg() would be needed
for TLS support. That does not actually appear to be the case.

In addition, the linkage between the client and server send code has
been a bit of a maintenance headache because of the distinct ways
that the client and server handle memory allocation.

Going forward, eventually the XDR layer will deal with its buffers
in the form of bio_vec arrays, so convert this function accordingly.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-08-29 17:45:22 -04:00
Chuck Lever e18e157bb5 SUNRPC: Send RPC message on TCP with a single sock_sendmsg() call
There is now enough infrastructure in place to combine the stream
record marker into the biovec array used to send each outgoing RPC
message on TCP. The whole message can be more efficiently sent with
a single call to sock_sendmsg() using a bio_vec iterator.

Note that this also helps with RPC-with-TLS: the TLS implementation
can now clearly see where the upper layer message boundaries are.
Before, it would send each component of the xdr_buf (record marker,
head, page payload, tail) in separate TLS records.

Suggested-by: David Howells <dhowells@redhat.com>
Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-08-29 17:45:22 -04:00
Chuck Lever 2eb2b93581 SUNRPC: Convert svc_tcp_sendmsg to use bio_vecs directly
Add a helper to convert a whole xdr_buf directly into an array of
bio_vecs, then send this array instead of iterating piecemeal over
the xdr_buf containing the outbound RPC message.

Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-08-29 17:45:22 -04:00
Chuck Lever 2a9893f796 SUNRPC: Remove net/sunrpc/auth_gss/gss_krb5_seqnum.c
These functions are no longer used.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-08-29 17:45:22 -04:00
Chuck Lever cfb6b328c4 SUNRPC: Remove the ->import_ctx method
All supported encryption types now use the same context import
function.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-08-29 17:45:22 -04:00
Chuck Lever 6c922ea711 SUNRPC: Remove CONFIG_RPCSEC_GSS_KRB5_CRYPTOSYSTEM
This code is now always on, so the ifdef can be removed.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-08-29 17:45:22 -04:00
Chuck Lever da33d635bb SUNRPC: Remove gss_import_v1_context()
We no longer support importing v1 contexts.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-08-29 17:45:22 -04:00
Chuck Lever 6964629f4c SUNRPC: Remove krb5_derive_key_v1()
This function is no longer used.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-08-29 17:45:22 -04:00
Chuck Lever ec596aaf9b SUNRPC: Remove code behind CONFIG_RPCSEC_GSS_KRB5_SIMPLIFIED
None of this code can be enabled any more.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-08-29 17:45:22 -04:00
Chuck Lever 2024b89d90 SUNRPC: Remove DES and DES3 enctypes from the supported enctypes list
These enctypes can no longer be enabled via CONFIG.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-08-29 17:45:22 -04:00
Chuck Lever 649879561d SUNRPC: Remove Kunit tests for the DES3 encryption type
The DES3 encryption type is no longer implemented.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-08-29 17:45:22 -04:00
Chuck Lever 788849b64d SUNRPC: Remove RPCSEC_GSS_KRB5_ENCTYPES_DES
Make it impossible to enable support for the DES or DES3 Kerberos
encryption types in SunRPC. These enctypes were deprecated by RFCs
6649 and 8429 because they are known to be insecure.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-08-29 17:45:22 -04:00
Chuck Lever f16ff1cafb SUNRPC: Fix the recent bv_offset fix
Jeff confirmed his original fix addressed his pynfs test failure,
but this same bug also impacted qemu: accessing qcow2 virtual disks
using direct I/O was failing. Jeff's fix missed that you have to
shorten the bio_vec element by the same amount as you increased
the page offset.

Reported-by: Maxim Levitsky <mlevitsk@redhat.com>
Fixes: c96e2a695e ("sunrpc: set the bv_offset of first bvec in svc_tcp_sendmsg")
Tested-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-08-29 17:44:12 -04:00
Linus Torvalds bd6c11bc43 Networking changes for 6.6.
Core
 ----
 
  - Increase size limits for to-be-sent skb frag allocations. This
    allows tun, tap devices and packet sockets to better cope with large
    writes operations.
 
  - Store netdevs in an xarray, to simplify iterating over netdevs.
 
  - Refactor nexthop selection for multipath routes.
 
  - Improve sched class lifetime handling.
 
  - Add backup nexthop ID support for bridge.
 
  - Implement drop reasons support in openvswitch.
 
  - Several data races annotations and fixes.
 
  - Constify the sk parameter of routing functions.
 
  - Prepend kernel version to netconsole message.
 
 Protocols
 ---------
 
  - Implement support for TCP probing the peer being under memory
    pressure.
 
  - Remove hard coded limitation on IPv6 specific info placement
    inside the socket struct.
 
  - Get rid of sysctl_tcp_adv_win_scale and use an auto-estimated
    per socket scaling factor.
 
  - Scaling-up the IPv6 expired route GC via a separated list of
    expiring routes.
 
  - In-kernel support for the TLS alert protocol.
 
  - Better support for UDP reuseport with connected sockets.
 
  - Add NEXT-C-SID support for SRv6 End.X behavior, reducing the SR
    header size.
 
  - Get rid of additional ancillary per MPTCP connection struct socket.
 
  - Implement support for BPF-based MPTCP packet schedulers.
 
  - Format MPTCP subtests selftests results in TAP.
 
  - Several new SMC 2.1 features including unique experimental options,
    max connections per lgr negotiation, max links per lgr negotiation.
 
 BPF
 ---
 
  - Multi-buffer support in AF_XDP.
 
  - Add multi uprobe BPF links for attaching multiple uprobes
    and usdt probes, which is significantly faster and saves extra fds.
 
  - Implement an fd-based tc BPF attach API (TCX) and BPF link support on
    top of it.
 
  - Add SO_REUSEPORT support for TC bpf_sk_assign.
 
  - Support new instructions from cpu v4 to simplify the generated code and
    feature completeness, for x86, arm64, riscv64.
 
  - Support defragmenting IPv(4|6) packets in BPF.
 
  - Teach verifier actual bounds of bpf_get_smp_processor_id()
    and fix perf+libbpf issue related to custom section handling.
 
  - Introduce bpf map element count and enable it for all program types.
 
  - Add a BPF hook in sys_socket() to change the protocol ID
    from IPPROTO_TCP to IPPROTO_MPTCP to cover migration for legacy.
 
  - Introduce bpf_me_mcache_free_rcu() and fix OOM under stress.
 
  - Add uprobe support for the bpf_get_func_ip helper.
 
  - Check skb ownership against full socket.
 
  - Support for up to 12 arguments in BPF trampoline.
 
  - Extend link_info for kprobe_multi and perf_event links.
 
 Netfilter
 ---------
 
  - Speed-up process exit by aborting ruleset validation if a
    fatal signal is pending.
 
  - Allow NLA_POLICY_MASK to be used with BE16/BE32 types.
 
 Driver API
 ----------
 
  - Page pool optimizations, to improve data locality and cache usage.
 
  - Introduce ndo_hwtstamp_get() and ndo_hwtstamp_set() to avoid the need
    for raw ioctl() handling in drivers.
 
  - Simplify genetlink dump operations (doit/dumpit) providing them
    the common information already populated in struct genl_info.
 
  - Extend and use the yaml devlink specs to [re]generate the split ops.
 
  - Introduce devlink selective dumps, to allow SF filtering SF based on
    handle and other attributes.
 
  - Add yaml netlink spec for netlink-raw families, allow route, link and
    address related queries via the ynl tool.
 
  - Remove phylink legacy mode support.
 
  - Support offload LED blinking to phy.
 
  - Add devlink port function attributes for IPsec.
 
 New hardware / drivers
 ----------------------
 
  - Ethernet:
    - Broadcom ASP 2.0 (72165) ethernet controller
    - MediaTek MT7988 SoC
    - Texas Instruments AM654 SoC
    - Texas Instruments IEP driver
    - Atheros qca8081 phy
    - Marvell 88Q2110 phy
    - NXP TJA1120 phy
 
  - WiFi:
    - MediaTek mt7981 support
 
  - Can:
    - Kvaser SmartFusion2 PCI Express devices
    - Allwinner T113 controllers
    - Texas Instruments tcan4552/4553 chips
 
  - Bluetooth:
    - Intel Gale Peak
    - Qualcomm WCN3988 and WCN7850
    - NXP AW693 and IW624
    - Mediatek MT2925
 
 Drivers
 -------
 
  - Ethernet NICs:
    - nVidia/Mellanox:
      - mlx5:
        - support UDP encapsulation in packet offload mode
        - IPsec packet offload support in eswitch mode
        - improve aRFS observability by adding new set of counters
        - extends MACsec offload support to cover RoCE traffic
        - dynamic completion EQs
      - mlx4:
        - convert to use auxiliary bus instead of custom interface logic
    - Intel
      - ice:
        - implement switchdev bridge offload, even for LAG interfaces
        - implement SRIOV support for LAG interfaces
      - igc:
        - add support for multiple in-flight TX timestamps
    - Broadcom:
      - bnxt:
        - use the unified RX page pool buffers for XDP and non-XDP
        - use the NAPI skb allocation cache
    - OcteonTX2:
      - support Round Robin scheduling HTB offload
      - TC flower offload support for SPI field
    - Freescale:
      -  add XDP_TX feature support
    - AMD:
      - ionic: add support for PCI FLR event
      - sfc:
        - basic conntrack offload
        - introduce eth, ipv4 and ipv6 pedit offloads
    - ST Microelectronics:
      - stmmac: maximze PTP timestamping resolution
 
  - Virtual NICs:
    - Microsoft vNIC:
      - batch ringing RX queue doorbell on receiving packets
      - add page pool for RX buffers
    - Virtio vNIC:
      - add per queue interrupt coalescing support
    - Google vNIC:
      - add queue-page-list mode support
 
  - Ethernet high-speed switches:
    - nVidia/Mellanox (mlxsw):
      - add port range matching tc-flower offload
      - permit enslavement to netdevices with uppers
 
  - Ethernet embedded switches:
    - Marvell (mv88e6xxx):
      - convert to phylink_pcs
    - Renesas:
      - r8A779fx: add speed change support
      - rzn1: enables vlan support
 
  - Ethernet PHYs:
    - convert mv88e6xxx to phylink_pcs
 
  - WiFi:
    - Qualcomm Wi-Fi 7 (ath12k):
      - extremely High Throughput (EHT) PHY support
    - RealTek (rtl8xxxu):
      - enable AP mode for: RTL8192FU, RTL8710BU (RTL8188GU),
        RTL8192EU and RTL8723BU
    - RealTek (rtw89):
      - Introduce Time Averaged SAR (TAS) support
 
  - Connector:
    - support for event filtering
 
 Signed-off-by: Paolo Abeni <pabeni@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmTt1ZoSHHBhYmVuaUBy
 ZWRoYXQuY29tAAoJECkkeY3MjxOkgFUP/REFaYWdWUvAzmWeezyx9dqgZMfSOjWq
 9QvySiA94OAOcjIYkb7wfzQ5BBAZqaBQ/f8XqWwS1EDDDEBs8sP1cxmABKwW7Hsr
 qFRu2sOqLzKBk223d0jIgEocfQaFpGbF71gXoTlDivBjBi5UxWm9bF0XnbYWcKgO
 /QEvzNosi9uNdi85Fzmv62J6YzAdidEpwGsM7X2CfejwNRmStxAEg/NwvRR0Hyiq
 OJCo97omEgTRaUle8nc64PDx33u4h5kQ1BkaeHEv0rbE3hftFC2YPKn/InmqSFGz
 6ew2xnrGPR37LCuAiCcIIv6yR7K0eu0iYJ7jXwZxBDqxGavEPuwWGBoCP6qFiitH
 ZLWhIrAUrdmSbySkTOCONhJ475qFAuQoYHYpZnX/bJZUHlSsb/9lwDJYJQGpVfd1
 /daqJVSb7lhaifmNO1iNd/ibCIXq9zapwtkRwA897M8GkZBTsnVvazFld1Em+Se3
 Bx6DSDUVBqVQ9fpZG2IAGD6odDwOzC1lF2IoceFvK9Ff6oE0psI+A0qNLMkHxZbW
 Qlo7LsNe53hpoCC+yHTfXX7e/X8eNt0EnCGOQJDusZ0Nr3K7H4LKFA0i8UBUK05n
 4lKnnaSQW7GQgdofLWt103OMDR9GoDxpFsm7b1X9+AEk6Fz6tq50wWYeMZETUKYP
 DCW8VGFOZjZM
 =9CsR
 -----END PGP SIGNATURE-----

Merge tag 'net-next-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking updates from Paolo Abeni:
 "Core:

   - Increase size limits for to-be-sent skb frag allocations. This
     allows tun, tap devices and packet sockets to better cope with
     large writes operations

   - Store netdevs in an xarray, to simplify iterating over netdevs

   - Refactor nexthop selection for multipath routes

   - Improve sched class lifetime handling

   - Add backup nexthop ID support for bridge

   - Implement drop reasons support in openvswitch

   - Several data races annotations and fixes

   - Constify the sk parameter of routing functions

   - Prepend kernel version to netconsole message

  Protocols:

   - Implement support for TCP probing the peer being under memory
     pressure

   - Remove hard coded limitation on IPv6 specific info placement inside
     the socket struct

   - Get rid of sysctl_tcp_adv_win_scale and use an auto-estimated per
     socket scaling factor

   - Scaling-up the IPv6 expired route GC via a separated list of
     expiring routes

   - In-kernel support for the TLS alert protocol

   - Better support for UDP reuseport with connected sockets

   - Add NEXT-C-SID support for SRv6 End.X behavior, reducing the SR
     header size

   - Get rid of additional ancillary per MPTCP connection struct socket

   - Implement support for BPF-based MPTCP packet schedulers

   - Format MPTCP subtests selftests results in TAP

   - Several new SMC 2.1 features including unique experimental options,
     max connections per lgr negotiation, max links per lgr negotiation

  BPF:

   - Multi-buffer support in AF_XDP

   - Add multi uprobe BPF links for attaching multiple uprobes and usdt
     probes, which is significantly faster and saves extra fds

   - Implement an fd-based tc BPF attach API (TCX) and BPF link support
     on top of it

   - Add SO_REUSEPORT support for TC bpf_sk_assign

   - Support new instructions from cpu v4 to simplify the generated code
     and feature completeness, for x86, arm64, riscv64

   - Support defragmenting IPv(4|6) packets in BPF

   - Teach verifier actual bounds of bpf_get_smp_processor_id() and fix
     perf+libbpf issue related to custom section handling

   - Introduce bpf map element count and enable it for all program types

   - Add a BPF hook in sys_socket() to change the protocol ID from
     IPPROTO_TCP to IPPROTO_MPTCP to cover migration for legacy

   - Introduce bpf_me_mcache_free_rcu() and fix OOM under stress

   - Add uprobe support for the bpf_get_func_ip helper

   - Check skb ownership against full socket

   - Support for up to 12 arguments in BPF trampoline

   - Extend link_info for kprobe_multi and perf_event links

  Netfilter:

   - Speed-up process exit by aborting ruleset validation if a fatal
     signal is pending

   - Allow NLA_POLICY_MASK to be used with BE16/BE32 types

  Driver API:

   - Page pool optimizations, to improve data locality and cache usage

   - Introduce ndo_hwtstamp_get() and ndo_hwtstamp_set() to avoid the
     need for raw ioctl() handling in drivers

   - Simplify genetlink dump operations (doit/dumpit) providing them the
     common information already populated in struct genl_info

   - Extend and use the yaml devlink specs to [re]generate the split ops

   - Introduce devlink selective dumps, to allow SF filtering SF based
     on handle and other attributes

   - Add yaml netlink spec for netlink-raw families, allow route, link
     and address related queries via the ynl tool

   - Remove phylink legacy mode support

   - Support offload LED blinking to phy

   - Add devlink port function attributes for IPsec

  New hardware / drivers:

   - Ethernet:
      - Broadcom ASP 2.0 (72165) ethernet controller
      - MediaTek MT7988 SoC
      - Texas Instruments AM654 SoC
      - Texas Instruments IEP driver
      - Atheros qca8081 phy
      - Marvell 88Q2110 phy
      - NXP TJA1120 phy

   - WiFi:
      - MediaTek mt7981 support

   - Can:
      - Kvaser SmartFusion2 PCI Express devices
      - Allwinner T113 controllers
      - Texas Instruments tcan4552/4553 chips

   - Bluetooth:
      - Intel Gale Peak
      - Qualcomm WCN3988 and WCN7850
      - NXP AW693 and IW624
      - Mediatek MT2925

  Drivers:

   - Ethernet NICs:
      - nVidia/Mellanox:
         - mlx5:
            - support UDP encapsulation in packet offload mode
            - IPsec packet offload support in eswitch mode
            - improve aRFS observability by adding new set of counters
            - extends MACsec offload support to cover RoCE traffic
            - dynamic completion EQs
         - mlx4:
            - convert to use auxiliary bus instead of custom interface
              logic
      - Intel
         - ice:
            - implement switchdev bridge offload, even for LAG
              interfaces
            - implement SRIOV support for LAG interfaces
         - igc:
            - add support for multiple in-flight TX timestamps
      - Broadcom:
         - bnxt:
            - use the unified RX page pool buffers for XDP and non-XDP
            - use the NAPI skb allocation cache
      - OcteonTX2:
         - support Round Robin scheduling HTB offload
         - TC flower offload support for SPI field
      - Freescale:
         - add XDP_TX feature support
      - AMD:
         - ionic: add support for PCI FLR event
         - sfc:
            - basic conntrack offload
            - introduce eth, ipv4 and ipv6 pedit offloads
      - ST Microelectronics:
         - stmmac: maximze PTP timestamping resolution

   - Virtual NICs:
      - Microsoft vNIC:
         - batch ringing RX queue doorbell on receiving packets
         - add page pool for RX buffers
      - Virtio vNIC:
         - add per queue interrupt coalescing support
      - Google vNIC:
         - add queue-page-list mode support

   - Ethernet high-speed switches:
      - nVidia/Mellanox (mlxsw):
         - add port range matching tc-flower offload
         - permit enslavement to netdevices with uppers

   - Ethernet embedded switches:
      - Marvell (mv88e6xxx):
         - convert to phylink_pcs
      - Renesas:
         - r8A779fx: add speed change support
         - rzn1: enables vlan support

   - Ethernet PHYs:
      - convert mv88e6xxx to phylink_pcs

   - WiFi:
      - Qualcomm Wi-Fi 7 (ath12k):
         - extremely High Throughput (EHT) PHY support
      - RealTek (rtl8xxxu):
         - enable AP mode for: RTL8192FU, RTL8710BU (RTL8188GU),
           RTL8192EU and RTL8723BU
      - RealTek (rtw89):
         - Introduce Time Averaged SAR (TAS) support

   - Connector:
      - support for event filtering"

* tag 'net-next-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1806 commits)
  net: ethernet: mtk_wed: minor change in wed_{tx,rx}info_show
  net: ethernet: mtk_wed: add some more info in wed_txinfo_show handler
  net: stmmac: clarify difference between "interface" and "phy_interface"
  r8152: add vendor/device ID pair for D-Link DUB-E250
  devlink: move devlink_notify_register/unregister() to dev.c
  devlink: move small_ops definition into netlink.c
  devlink: move tracepoint definitions into core.c
  devlink: push linecard related code into separate file
  devlink: push rate related code into separate file
  devlink: push trap related code into separate file
  devlink: use tracepoint_enabled() helper
  devlink: push region related code into separate file
  devlink: push param related code into separate file
  devlink: push resource related code into separate file
  devlink: push dpipe related code into separate file
  devlink: move and rename devlink_dpipe_send_and_alloc_skb() helper
  devlink: push shared buffer related code into separate file
  devlink: push port related code into separate file
  devlink: push object register/unregister notifications into separate helpers
  inet: fix IP_TRANSPARENT error handling
  ...
2023-08-29 11:33:01 -07:00
Linus Torvalds 615e95831e v6.6-vfs.ctime
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZOXTKAAKCRCRxhvAZXjc
 oifJAQCzi/p+AdQu8LA/0XvR7fTwaq64ZDCibU4BISuLGT2kEgEAuGbuoFZa0rs2
 XYD/s4+gi64p9Z01MmXm2XO1pu3GPg0=
 =eJz5
 -----END PGP SIGNATURE-----

Merge tag 'v6.6-vfs.ctime' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs timestamp updates from Christian Brauner:
 "This adds VFS support for multi-grain timestamps and converts tmpfs,
  xfs, ext4, and btrfs to use them. This carries acks from all relevant
  filesystems.

  The VFS always uses coarse-grained timestamps when updating the ctime
  and mtime after a change. This has the benefit of allowing filesystems
  to optimize away a lot of metadata updates, down to around 1 per
  jiffy, even when a file is under heavy writes.

  Unfortunately, this has always been an issue when we're exporting via
  NFSv3, which relies on timestamps to validate caches. A lot of changes
  can happen in a jiffy, so timestamps aren't sufficient to help the
  client decide to invalidate the cache.

  Even with NFSv4, a lot of exported filesystems don't properly support
  a change attribute and are subject to the same problems with timestamp
  granularity. Other applications have similar issues with timestamps
  (e.g., backup applications).

  If we were to always use fine-grained timestamps, that would improve
  the situation, but that becomes rather expensive, as the underlying
  filesystem would have to log a lot more metadata updates.

  This introduces fine-grained timestamps that are used when they are
  actively queried.

  This uses the 31st bit of the ctime tv_nsec field to indicate that
  something has queried the inode for the mtime or ctime. When this flag
  is set, on the next mtime or ctime update, the kernel will fetch a
  fine-grained timestamp instead of the usual coarse-grained one.

  As POSIX generally mandates that when the mtime changes, the ctime
  must also change the kernel always stores normalized ctime values, so
  only the first 30 bits of the tv_nsec field are ever used.

  Filesytems can opt into this behavior by setting the FS_MGTIME flag in
  the fstype. Filesystems that don't set this flag will continue to use
  coarse-grained timestamps.

  Various preparatory changes, fixes and cleanups are included:

   - Fixup all relevant places where POSIX requires updating ctime
     together with mtime. This is a wide-range of places and all
     maintainers provided necessary Acks.

   - Add new accessors for inode->i_ctime directly and change all
     callers to rely on them. Plain accesses to inode->i_ctime are now
     gone and it is accordingly rename to inode->__i_ctime and commented
     as requiring accessors.

   - Extend generic_fillattr() to pass in a request mask mirroring in a
     sense the statx() uapi. This allows callers to pass in a request
     mask to only get a subset of attributes filled in.

   - Rework timestamp updates so it's possible to drop the @now
     parameter the update_time() inode operation and associated helpers.

   - Add inode_update_timestamps() and convert all filesystems to it
     removing a bunch of open-coding"

* tag 'v6.6-vfs.ctime' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (107 commits)
  btrfs: convert to multigrain timestamps
  ext4: switch to multigrain timestamps
  xfs: switch to multigrain timestamps
  tmpfs: add support for multigrain timestamps
  fs: add infrastructure for multigrain timestamps
  fs: drop the timespec64 argument from update_time
  xfs: have xfs_vn_update_time gets its own timestamp
  fat: make fat_update_time get its own timestamp
  fat: remove i_version handling from fat_update_time
  ubifs: have ubifs_update_time use inode_update_timestamps
  btrfs: have it use inode_update_timestamps
  fs: drop the timespec64 arg from generic_update_time
  fs: pass the request_mask to generic_fillattr
  fs: remove silly warning from current_time
  gfs2: fix timestamp handling on quota inodes
  fs: rename i_ctime field to __i_ctime
  selinux: convert to ctime accessor functions
  security: convert to ctime accessor functions
  apparmor: convert to ctime accessor functions
  sunrpc: convert to ctime accessor functions
  ...
2023-08-28 09:31:32 -07:00
Jakub Kicinski 57ce6427e0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

Conflicts:

include/net/inet_sock.h
  f866fbc842 ("ipv4: fix data-races around inet->inet_id")
  c274af2242 ("inet: introduce inet->inet_flags")
https://lore.kernel.org/all/679ddff6-db6e-4ff6-b177-574e90d0103d@tessares.net/

Adjacent changes:

drivers/net/bonding/bond_alb.c
  e74216b8de ("bonding: fix macvlan over alb bond support")
  f11e5bd159 ("bonding: support balance-alb with openvswitch")

drivers/net/ethernet/broadcom/bgmac.c
  d6499f0b7c ("net: bgmac: Return PTR_ERR() for fixed_phy_register()")
  23a14488ea ("net: bgmac: Fix return value check for fixed_phy_register()")

drivers/net/ethernet/broadcom/genet/bcmmii.c
  32bbe64a13 ("net: bcmgenet: Fix return value check for fixed_phy_register()")
  acf50d1adb ("net: bcmgenet: Return PTR_ERR() for fixed_phy_register()")

net/sctp/socket.c
  f866fbc842 ("ipv4: fix data-races around inet->inet_id")
  b09bde5c35 ("inet: move inet->mc_loop to inet->inet_frags")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-24 10:51:39 -07:00
Trond Myklebust cd18f24085 SUNRPC: Don't override connect timeouts in rpc_clnt_add_xprt()
If the caller specifies the connect timeouts in the arguments to
rpc_clnt_add_xprt(), then we shouldn't override them.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-08-24 13:24:15 -04:00
Trond Myklebust d2ee413884 SUNRPC: Allow specification of TCP client connect timeout at setup
When we create a TCP transport, the connect timeout parameters are
currently fixed to be 90s. This is problematic in the pNFS flexfiles
case, where we may have multiple mirrors, and we would like to fail over
quickly to the next mirror if a data server is down.

This patch adds the ability to specify the connection parameters at RPC
client creation time.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-08-24 13:24:15 -04:00
Trond Myklebust 3e6ff89d2e SUNRPC: Refactor and simplify connect timeout
Instead of requiring the requests to redrive the connection several
times, just let the TCP connect code manage it now that we've adjusted
the TCP_SYNCNT value.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-08-24 13:24:15 -04:00
Trond Myklebust 3a107f0740 SUNRPC: Set the TCP_SYNCNT to match the socket timeout
Set the TCP SYN count so that we abort the connection attempt at around
the expected timeout value.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-08-24 13:24:15 -04:00
Yue Haibing f9597ba887 xprtrdma: Remove unused function declaration rpcrdma_bc_post_recv()
rpcrdma_bc_post_recv() is never implemented since introduction in
commit f531a5dbc4 ("xprtrdma: Pre-allocate backward rpc_rqst and send/receive buffers").

Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-08-23 15:58:47 -04:00
Anna Schumaker 61182c796d SUNRPC: kmap() the xdr pages during decode
If the pages are in HIGHMEM then we need to make sure they're mapped
before trying to read data off of them, otherwise we could end up with a
NULL pointer dereference.

The downside to this is that we need an extra cleanup step at the end of
decode to kunmap() the last page. I introduced an xdr_finish_decode()
function to do this. Right now this function only calls the
unmap_current_page() function, but other generic cleanup steps could be
added in the future if we come across anything else.

Reported-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-08-23 15:58:47 -04:00
Linus Torvalds 53663f4103 NFS client fixes for Linux 6.5
Highlights include:
 
 Stable fixes
  - NFS: Fix a use after free in nfs_direct_join_group()
 
 Bugfixes
  - NFS: Fix a sysfs server name memory leak
  - NFS: Fix a lock recovery hang in NFSv4.0
  - NFS: Fix page free in the error path for nfs42_proc_getxattr
  - NFS: Fix page free in the error path for __nfs4_get_acl_uncached
  - SUNRPC/rdma: Fix receive buffer dma-mapping after a server disconnect
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAmTjgEEACgkQZwvnipYK
 APItFA//WzGcKbujlMXpiRdvUg6k6CfG/ikBRB1UwQEyZjK/tVZ96qt6UuHGNMbz
 b8GaGls7NRYJKezAcMSW9QMMPYVyG0PLwxOW6BPwsZS61Zn6HMeM1YRboaZEid7f
 JrUNhbUXHl6bVWrBNEtcr3IN/5ERU4sGCAa4A3uWdNxGyffD/avrK06/bfmE/SJi
 +7LVPp0M9rM5X5Z1c407TbWfg+L81Q9t0tTz7II3Ba9i2BzQ0uhQhyVUQAGF767u
 Vua4XWTRoqG1es+tA4iuwZ3KtaqXoaMRDWPLGTkmBrY+pAo+u4IPzY5LCwfUu6kI
 vttkZU5b0b05+UomJ1d+Muzr8uEjRmBhIHZsP6lgVVmuNzqkDb0gCGkfix87J+RO
 0QmDZ9D0ftJxsb8fSdp8iy8NqmqJ6X4FhsylRtANEuCrf8+zrkUlBJi47CCwpYDD
 8gq6SoTfA8MmiSgzrBuYkJe2HSx7c2csDl3xp5KrJX2IHODjbzlHC05fNadTWc6W
 0jQvq1cJ2xBYDNSxkG0Trsd3lTTao3rZC4M7imVVjTTOHS8X1LNCLkbZ7LVnA8rn
 0F+lp/h1qs/daXSp0aMG5wyvZNkx5rsJ23o+InNCjiCh3cDvoi9mg6DN5bQK8Foy
 Iqd2MTgxrMaF/FUbdGLdnFX4GQkgFPng8TpdX8sqqm1JHUprpqg=
 =nd41
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-6.5-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client fixes from Trond Myklebust:

 - fix a use after free in nfs_direct_join_group() (Cc: stable)

 - fix sysfs server name memory leak

 - fix lock recovery hang in NFSv4.0

 - fix page free in the error path for nfs42_proc_getxattr() and
   __nfs4_get_acl_uncached()

 - SUNRPC/rdma: fix receive buffer dma-mapping after a server disconnect

* tag 'nfs-for-6.5-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  xprtrdma: Remap Receive buffers after a reconnect
  NFSv4: fix out path in __nfs4_get_acl_uncached
  NFSv4.2: fix error handling in nfs42_proc_getxattr
  NFS: Fix sysfs server name memory leak
  NFS: Fix a use after free in nfs_direct_join_group()
  NFSv4: Fix dropped lock for racing OPEN and delegation return
2023-08-22 10:50:17 -07:00
Chuck Lever 895cedc179 xprtrdma: Remap Receive buffers after a reconnect
On server-initiated disconnect, rpcrdma_xprt_disconnect() was DMA-
unmapping the Receive buffers, but rpcrdma_post_recvs() neglected
to remap them after a new connection had been established. The
result was immediate failure of the new connection with the Receives
flushing with LOCAL_PROT_ERR.

Fixes: 671c450b6f ("xprtrdma: Fix oops in Receive handler after device removal")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2023-08-19 10:26:29 -04:00
Jakub Kicinski 7ff57803d2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

Conflicts:

drivers/net/ethernet/sfc/tc.c
  fa165e1949 ("sfc: don't unregister flow_indr if it was never registered")
  3bf969e88a ("sfc: add MAE table machinery for conntrack table")
https://lore.kernel.org/all/20230818112159.7430e9b4@canb.auug.org.au/

No adjacent changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-18 12:44:56 -07:00
Jeff Layton c96e2a695e sunrpc: set the bv_offset of first bvec in svc_tcp_sendmsg
svc_tcp_sendmsg used to factor in the xdr->page_base when sending pages,
but commit 5df5dd03a8 ("sunrpc: Use sendmsg(MSG_SPLICE_PAGES) rather
then sendpage") dropped that part of the handling. Fix it by setting
the bv_offset of the first bvec.

Fixes: 5df5dd03a8 ("sunrpc: Use sendmsg(MSG_SPLICE_PAGES) rather then sendpage")
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-08-14 15:02:25 -04:00
Chuck Lever 39067dda1d SUNRPC: Use new helpers to handle TLS Alerts
Use the helpers to parse the level and description fields in
incoming alerts. "Warning" alerts are discarded, and "fatal"
alerts mean the session is no longer valid.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Link: https://lore.kernel.org/r/169047944747.5241.1974889594004407123.stgit@oracle-102.nfsv4bat.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-28 14:07:59 -07:00
Chuck Lever 5dd5ad682c SUNRPC: Send TLS Closure alerts before closing a TCP socket
Before closing a TCP connection, the TLS protocol wants peers to
send session close Alert notifications. Add those in both the RPC
client and server.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Link: https://lore.kernel.org/r/169047939404.5241.14392506226409865832.stgit@oracle-102.nfsv4bat.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-28 14:07:59 -07:00
Chuck Lever 6a7eccef47 net/tls: Move TLS protocol elements to a separate header
Kernel TLS consumers will need definitions of various parts of the
TLS protocol, but often do not need the function declarations and
other infrastructure provided in <net/tls.h>.

Break out existing standardized protocol elements into a separate
header, and make room for a few more elements in subsequent patches.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Link: https://lore.kernel.org/r/169047931374.5241.7713175865185969309.stgit@oracle-102.nfsv4bat.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-28 14:07:59 -07:00
Jeff Layton bc734e8ec1 sunrpc: convert to ctime accessor functions
In later patches, we're going to change how the inode's ctime field is
used. Switch to using accessor functions instead of raw accesses of
inode->i_ctime.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Message-Id: <20230705190309.579783-86-jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-07-24 10:30:07 +02:00
Linus Torvalds dfab92f27c NFS client updates for Linux 6.5
Highlights include:
 
 Stable fixes and other bugfixes:
  - nfs: don't report STATX_BTIME in ->getattr
  - Revert "NFSv4: Retry LOCK on OLD_STATEID during delegation return"
    since it breaks NFSv4 state recovery.
  - NFSv4.1: freeze the session table upon receiving NFS4ERR_BADSESSION
  - Fix the NFSv4.2 xattr cache shrinker_id
  - Force a ctime update after a NFSv4.2 SETXATTR call
 
 Features and cleanups:
  - NFS and RPC over TLS client code from Chuck Lever.
  - Support for use of abstract unix socket addresses with the rpcbind
    daemon.
  - Sysfs API to allow shutdown of the kernel RPC client and prevent
    umount() hangs if the server is known to be permanently down.
  - XDR cleanups from Anna.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAmSgmmUACgkQZwvnipYK
 APJwUA/+J6uEjJFoigSDU5dpCwQr4pHZPgUn3T2heplcyalGMxLo1VjDTVuFXb+a
 NZqdUZF2ePmYqss/UYzJC7R6/z9OanVBcpiGqp66foJt9ncs9BSm5AzdV5Gvi4VX
 6SrBM98nSqvD47l45LQ90bqIdR6WgMP9OiDC257PzYnaMZJcB0xObD4HWXh1zbIz
 3xynJTSQnRGbv9I5EjJJGVIHDWLfSKY61NUXjrUcmMZ2L39ITNy0CRi8sIdj3oY/
 A2Iz52IHtAhE77+EetThPskbTLa07raQSWRo3X6XJqCKiJIXa5giNDoG/zLq6sOT
 hi1AV7Tdxaed2EYibeRWzsSVQIClBb7T/hdro5dWs5u/bxM6Bt+yY90ZWUMZVOAQ
 /kGTYQXhI31vUgRaEN+2xci0wKDy9wqyAWcD8u8Gz01KaK09sfJSIvvYn+srSeaz
 wEUQHZCdBGtNFVP2q18q4x8BN27uObh1DdMvNhrxrA7YraXSQvL/rIIsD0jmDInb
 6olMm9g9nZSHgq62+CYs2v7J/AJKQzE7PsWrTMJDX1rso+/Lyc6x7oUGxv2IFt5H
 VZVZNdstKeNzfcnNKsGG2ZbufhasKHqiHJxJTdNOuOi0YBi+ixtJVRpupId3+6aZ
 ysng0IfzqiWSuiq5Axjreva+480IDSMW+7cqcw5urKEfYY5uVcc=
 =leGh
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-6.5-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client updates from Trond Myklebust:
"Stable fixes and other bugfixes:

   - nfs: don't report STATX_BTIME in ->getattr

   - Revert 'NFSv4: Retry LOCK on OLD_STATEID during delegation return'
     since it breaks NFSv4 state recovery.

   - NFSv4.1: freeze the session table upon receiving NFS4ERR_BADSESSION

   - Fix the NFSv4.2 xattr cache shrinker_id

   - Force a ctime update after a NFSv4.2 SETXATTR call

  Features and cleanups:

   - NFS and RPC over TLS client code from Chuck Lever

   - Support for use of abstract unix socket addresses with the rpcbind
     daemon

   - Sysfs API to allow shutdown of the kernel RPC client and prevent
     umount() hangs if the server is known to be permanently down

   - XDR cleanups from Anna"

* tag 'nfs-for-6.5-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (33 commits)
  Revert "NFSv4: Retry LOCK on OLD_STATEID during delegation return"
  NFS: Don't cleanup sysfs superblock entry if uninitialized
  nfs: don't report STATX_BTIME in ->getattr
  NFSv4.1: freeze the session table upon receiving NFS4ERR_BADSESSION
  NFSv4.2: fix wrong shrinker_id
  NFSv4: Clean up some shutdown loops
  NFS: Cancel all existing RPC tasks when shutdown
  NFS: add sysfs shutdown knob
  NFS: add a sysfs link to the acl rpc_client
  NFS: add a sysfs link to the lockd rpc_client
  NFS: Add sysfs links to sunrpc clients for nfs_clients
  NFS: add superblock sysfs entries
  NFS: Make all of /sys/fs/nfs network-namespace unique
  NFS: Open-code the nfs_kset kset_create_and_add()
  NFS: rename nfs_client_kobj to nfs_net_kobj
  NFS: rename nfs_client_kset to nfs_kset
  NFS: Add an "xprtsec=" NFS mount option
  NFS: Have struct nfs_client carry a TLS policy field
  SUNRPC: Add a TCP-with-TLS RPC transport class
  SUNRPC: Capture CMSG metadata on client-side receive
  ...
2023-07-01 14:38:25 -07:00
Linus Torvalds 3a8a670eee Networking changes for 6.5.
Core
 ----
 
  - Rework the sendpage & splice implementations. Instead of feeding
    data into sockets page by page extend sendmsg handlers to support
    taking a reference on the data, controlled by a new flag called
    MSG_SPLICE_PAGES. Rework the handling of unexpected-end-of-file
    to invoke an additional callback instead of trying to predict what
    the right combination of MORE/NOTLAST flags is.
    Remove the MSG_SENDPAGE_NOTLAST flag completely.
 
  - Implement SCM_PIDFD, a new type of CMSG type analogous to
    SCM_CREDENTIALS, but it contains pidfd instead of plain pid.
 
  - Enable socket busy polling with CONFIG_RT.
 
  - Improve reliability and efficiency of reporting for ref_tracker.
 
  - Auto-generate a user space C library for various Netlink families.
 
 Protocols
 ---------
 
  - Allow TCP to shrink the advertised window when necessary, prevent
    sk_rcvbuf auto-tuning from growing the window all the way up to
    tcp_rmem[2].
 
  - Use per-VMA locking for "page-flipping" TCP receive zerocopy.
 
  - Prepare TCP for device-to-device data transfers, by making sure
    that payloads are always attached to skbs as page frags.
 
  - Make the backoff time for the first N TCP SYN retransmissions
    linear. Exponential backoff is unnecessarily conservative.
 
  - Create a new MPTCP getsockopt to retrieve all info (MPTCP_FULL_INFO).
 
  - Avoid waking up applications using TLS sockets until we have
    a full record.
 
  - Allow using kernel memory for protocol ioctl callbacks, paving
    the way to issuing ioctls over io_uring.
 
  - Add nolocalbypass option to VxLAN, forcing packets to be fully
    encapsulated even if they are destined for a local IP address.
 
  - Make TCPv4 use consistent hash in TIME_WAIT and SYN_RECV. Ensure
    in-kernel ECMP implementation (e.g. Open vSwitch) select the same
    link for all packets. Support L4 symmetric hashing in Open vSwitch.
 
  - PPPoE: make number of hash bits configurable.
 
  - Allow DNS to be overwritten by DHCPACK in the in-kernel DHCP client
    (ipconfig).
 
  - Add layer 2 miss indication and filtering, allowing higher layers
    (e.g. ACL filters) to make forwarding decisions based on whether
    packet matched forwarding state in lower devices (bridge).
 
  - Support matching on Connectivity Fault Management (CFM) packets.
 
  - Hide the "link becomes ready" IPv6 messages by demoting their
    printk level to debug.
 
  - HSR: don't enable promiscuous mode if device offloads the proto.
 
  - Support active scanning in IEEE 802.15.4.
 
  - Continue work on Multi-Link Operation for WiFi 7.
 
 BPF
 ---
 
  - Add precision propagation for subprogs and callbacks. This allows
    maintaining verification efficiency when subprograms are used,
    or in fact passing the verifier at all for complex programs,
    especially those using open-coded iterators.
 
  - Improve BPF's {g,s}setsockopt() length handling. Previously BPF
    assumed the length is always equal to the amount of written data.
    But some protos allow passing a NULL buffer to discover what
    the output buffer *should* be, without writing anything.
 
  - Accept dynptr memory as memory arguments passed to helpers.
 
  - Add routing table ID to bpf_fib_lookup BPF helper.
 
  - Support O_PATH FDs in BPF_OBJ_PIN and BPF_OBJ_GET commands.
 
  - Drop bpf_capable() check in BPF_MAP_FREEZE command (used to mark
    maps as read-only).
 
  - Show target_{obj,btf}_id in tracing link fdinfo.
 
  - Addition of several new kfuncs (most of the names are self-explanatory):
    - Add a set of new dynptr kfuncs: bpf_dynptr_adjust(),
      bpf_dynptr_is_null(), bpf_dynptr_is_rdonly(), bpf_dynptr_size()
      and bpf_dynptr_clone().
    - bpf_task_under_cgroup()
    - bpf_sock_destroy() - force closing sockets
    - bpf_cpumask_first_and(), rework bpf_cpumask_any*() kfuncs
 
 Netfilter
 ---------
 
  - Relax set/map validation checks in nf_tables. Allow checking
    presence of an entry in a map without using the value.
 
  - Increase ip_vs_conn_tab_bits range for 64BIT builds.
 
  - Allow updating size of a set.
 
  - Improve NAT tuple selection when connection is closing.
 
 Driver API
 ----------
 
  - Integrate netdev with LED subsystem, to allow configuring HW
    "offloaded" blinking of LEDs based on link state and activity
    (i.e. packets coming in and out).
 
  - Support configuring rate selection pins of SFP modules.
 
  - Factor Clause 73 auto-negotiation code out of the drivers, provide
    common helper routines.
 
  - Add more fool-proof helpers for managing lifetime of MDIO devices
    associated with the PCS layer.
 
  - Allow drivers to report advanced statistics related to Time Aware
    scheduler offload (taprio).
 
  - Allow opting out of VF statistics in link dump, to allow more VFs
    to fit into the message.
 
  - Split devlink instance and devlink port operations.
 
 New hardware / drivers
 ----------------------
 
  - Ethernet:
    - Synopsys EMAC4 IP support (stmmac)
    - Marvell 88E6361 8 port (5x1GE + 3x2.5GE) switches
    - Marvell 88E6250 7 port switches
    - Microchip LAN8650/1 Rev.B0 PHYs
    - MediaTek MT7981/MT7988 built-in 1GE PHY driver
 
  - WiFi:
    - Realtek RTL8192FU, 2.4 GHz, b/g/n mode, 2T2R, 300 Mbps
    - Realtek RTL8723DS (SDIO variant)
    - Realtek RTL8851BE
 
  - CAN:
    - Fintek F81604
 
 Drivers
 -------
 
  - Ethernet NICs:
    - Intel (100G, ice):
      - support dynamic interrupt allocation
      - use meta data match instead of VF MAC addr on slow-path
    - nVidia/Mellanox:
      - extend link aggregation to handle 4, rather than just 2 ports
      - spawn sub-functions without any features by default
    - OcteonTX2:
      - support HTB (Tx scheduling/QoS) offload
      - make RSS hash generation configurable
      - support selecting Rx queue using TC filters
    - Wangxun (ngbe/txgbe):
      - add basic Tx/Rx packet offloads
      - add phylink support (SFP/PCS control)
    - Freescale/NXP (enetc):
      - report TAPRIO packet statistics
    - Solarflare/AMD:
      - support matching on IP ToS and UDP source port of outer header
      - VxLAN and GENEVE tunnel encapsulation over IPv4 or IPv6
      - add devlink dev info support for EF10
 
  - Virtual NICs:
    - Microsoft vNIC:
      - size the Rx indirection table based on requested configuration
      - support VLAN tagging
    - Amazon vNIC:
      - try to reuse Rx buffers if not fully consumed, useful for ARM
        servers running with 16kB pages
    - Google vNIC:
      - support TCP segmentation of >64kB frames
 
  - Ethernet embedded switches:
    - Marvell (mv88e6xxx):
      - enable USXGMII (88E6191X)
    - Microchip:
     - lan966x: add support for Egress Stage 0 ACL engine
     - lan966x: support mapping packet priority to internal switch
       priority (based on PCP or DSCP)
 
  - Ethernet PHYs:
    - Broadcom PHYs:
      - support for Wake-on-LAN for BCM54210E/B50212E
      - report LPI counter
    - Microsemi PHYs: support RGMII delay configuration (VSC85xx)
    - Micrel PHYs: receive timestamp in the frame (LAN8841)
    - Realtek PHYs: support optional external PHY clock
    - Altera TSE PCS: merge the driver into Lynx PCS which it is
      a variant of
 
  - CAN: Kvaser PCIEcan:
    - support packet timestamping
 
  - WiFi:
    - Intel (iwlwifi):
      - major update for new firmware and Multi-Link Operation (MLO)
      - configuration rework to drop test devices and split
        the different families
      - support for segmented PNVM images and power tables
      - new vendor entries for PPAG (platform antenna gain) feature
    - Qualcomm 802.11ax (ath11k):
      - Multiple Basic Service Set Identifier (MBSSID) and
        Enhanced MBSSID Advertisement (EMA) support in AP mode
      - support factory test mode
    - RealTek (rtw89):
      - add RSSI based antenna diversity
      - support U-NII-4 channels on 5 GHz band
    - RealTek (rtl8xxxu):
      - AP mode support for 8188f
      - support USB RX aggregation for the newer chips
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmSbJM4ACgkQMUZtbf5S
 IrtoDhAAhEim1+LBIKf4lhPcVdZ2p/TkpnwTz5jsTwSeRBAxTwuNJ2fQhFXg13E3
 MnRq6QaEp8G4/tA/gynLvQop+FEZEnv+horP0zf/XLcC8euU7UrKdrpt/4xxdP07
 IL/fFWsoUGNO+L9LNaHwBo8g7nHvOkPscHEBHc2Xrvzab56TJk6vPySfLqcpKlNZ
 CHWDwTpgRqNZzSKiSpoMVd9OVMKUXcPYHpDmfEJ5l+e8vTXmZzOLHrSELHU5nP5f
 mHV7gxkDCTshoGcaed7UTiOvgu1p6E5EchDJxiLaSUbgsd8SZ3u4oXwRxgj33RK/
 fB2+UaLrRt/DdlHvT/Ph8e8Ygu77yIXMjT49jsfur/zVA0HEA2dFb7V6QlsYRmQp
 J25pnrdXmE15llgqsC0/UOW5J1laTjII+T2T70UOAqQl4LWYAQDG4WwsAqTzU0KY
 dueydDouTp9XC2WYrRUEQxJUzxaOaazskDUHc5c8oHp/zVBT+djdgtvVR9+gi6+7
 yy4elI77FlEEqL0ItdU/lSWINayAlPLsIHkMyhSGKX0XDpKjeycPqkNx4UterXB/
 JKIR5RBWllRft+igIngIkKX0tJGMU0whngiw7d1WLw25wgu4sB53hiWWoSba14hv
 tXMxwZs5iGaPcT38oRVMZz8I1kJM4Dz3SyI7twVvi4RUut64EG4=
 =9i4I
 -----END PGP SIGNATURE-----

Merge tag 'net-next-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking changes from Jakub Kicinski:
 "WiFi 7 and sendpage changes are the biggest pieces of work for this
  release. The latter will definitely require fixes but I think that we
  got it to a reasonable point.

  Core:

   - Rework the sendpage & splice implementations

     Instead of feeding data into sockets page by page extend sendmsg
     handlers to support taking a reference on the data, controlled by a
     new flag called MSG_SPLICE_PAGES

     Rework the handling of unexpected-end-of-file to invoke an
     additional callback instead of trying to predict what the right
     combination of MORE/NOTLAST flags is

     Remove the MSG_SENDPAGE_NOTLAST flag completely

   - Implement SCM_PIDFD, a new type of CMSG type analogous to
     SCM_CREDENTIALS, but it contains pidfd instead of plain pid

   - Enable socket busy polling with CONFIG_RT

   - Improve reliability and efficiency of reporting for ref_tracker

   - Auto-generate a user space C library for various Netlink families

  Protocols:

   - Allow TCP to shrink the advertised window when necessary, prevent
     sk_rcvbuf auto-tuning from growing the window all the way up to
     tcp_rmem[2]

   - Use per-VMA locking for "page-flipping" TCP receive zerocopy

   - Prepare TCP for device-to-device data transfers, by making sure
     that payloads are always attached to skbs as page frags

   - Make the backoff time for the first N TCP SYN retransmissions
     linear. Exponential backoff is unnecessarily conservative

   - Create a new MPTCP getsockopt to retrieve all info
     (MPTCP_FULL_INFO)

   - Avoid waking up applications using TLS sockets until we have a full
     record

   - Allow using kernel memory for protocol ioctl callbacks, paving the
     way to issuing ioctls over io_uring

   - Add nolocalbypass option to VxLAN, forcing packets to be fully
     encapsulated even if they are destined for a local IP address

   - Make TCPv4 use consistent hash in TIME_WAIT and SYN_RECV. Ensure
     in-kernel ECMP implementation (e.g. Open vSwitch) select the same
     link for all packets. Support L4 symmetric hashing in Open vSwitch

   - PPPoE: make number of hash bits configurable

   - Allow DNS to be overwritten by DHCPACK in the in-kernel DHCP client
     (ipconfig)

   - Add layer 2 miss indication and filtering, allowing higher layers
     (e.g. ACL filters) to make forwarding decisions based on whether
     packet matched forwarding state in lower devices (bridge)

   - Support matching on Connectivity Fault Management (CFM) packets

   - Hide the "link becomes ready" IPv6 messages by demoting their
     printk level to debug

   - HSR: don't enable promiscuous mode if device offloads the proto

   - Support active scanning in IEEE 802.15.4

   - Continue work on Multi-Link Operation for WiFi 7

  BPF:

   - Add precision propagation for subprogs and callbacks. This allows
     maintaining verification efficiency when subprograms are used, or
     in fact passing the verifier at all for complex programs,
     especially those using open-coded iterators

   - Improve BPF's {g,s}setsockopt() length handling. Previously BPF
     assumed the length is always equal to the amount of written data.
     But some protos allow passing a NULL buffer to discover what the
     output buffer *should* be, without writing anything

   - Accept dynptr memory as memory arguments passed to helpers

   - Add routing table ID to bpf_fib_lookup BPF helper

   - Support O_PATH FDs in BPF_OBJ_PIN and BPF_OBJ_GET commands

   - Drop bpf_capable() check in BPF_MAP_FREEZE command (used to mark
     maps as read-only)

   - Show target_{obj,btf}_id in tracing link fdinfo

   - Addition of several new kfuncs (most of the names are
     self-explanatory):
      - Add a set of new dynptr kfuncs: bpf_dynptr_adjust(),
        bpf_dynptr_is_null(), bpf_dynptr_is_rdonly(), bpf_dynptr_size()
        and bpf_dynptr_clone().
      - bpf_task_under_cgroup()
      - bpf_sock_destroy() - force closing sockets
      - bpf_cpumask_first_and(), rework bpf_cpumask_any*() kfuncs

  Netfilter:

   - Relax set/map validation checks in nf_tables. Allow checking
     presence of an entry in a map without using the value

   - Increase ip_vs_conn_tab_bits range for 64BIT builds

   - Allow updating size of a set

   - Improve NAT tuple selection when connection is closing

  Driver API:

   - Integrate netdev with LED subsystem, to allow configuring HW
     "offloaded" blinking of LEDs based on link state and activity
     (i.e. packets coming in and out)

   - Support configuring rate selection pins of SFP modules

   - Factor Clause 73 auto-negotiation code out of the drivers, provide
     common helper routines

   - Add more fool-proof helpers for managing lifetime of MDIO devices
     associated with the PCS layer

   - Allow drivers to report advanced statistics related to Time Aware
     scheduler offload (taprio)

   - Allow opting out of VF statistics in link dump, to allow more VFs
     to fit into the message

   - Split devlink instance and devlink port operations

  New hardware / drivers:

   - Ethernet:
      - Synopsys EMAC4 IP support (stmmac)
      - Marvell 88E6361 8 port (5x1GE + 3x2.5GE) switches
      - Marvell 88E6250 7 port switches
      - Microchip LAN8650/1 Rev.B0 PHYs
      - MediaTek MT7981/MT7988 built-in 1GE PHY driver

   - WiFi:
      - Realtek RTL8192FU, 2.4 GHz, b/g/n mode, 2T2R, 300 Mbps
      - Realtek RTL8723DS (SDIO variant)
      - Realtek RTL8851BE

   - CAN:
      - Fintek F81604

  Drivers:

   - Ethernet NICs:
      - Intel (100G, ice):
         - support dynamic interrupt allocation
         - use meta data match instead of VF MAC addr on slow-path
      - nVidia/Mellanox:
         - extend link aggregation to handle 4, rather than just 2 ports
         - spawn sub-functions without any features by default
      - OcteonTX2:
         - support HTB (Tx scheduling/QoS) offload
         - make RSS hash generation configurable
         - support selecting Rx queue using TC filters
      - Wangxun (ngbe/txgbe):
         - add basic Tx/Rx packet offloads
         - add phylink support (SFP/PCS control)
      - Freescale/NXP (enetc):
         - report TAPRIO packet statistics
      - Solarflare/AMD:
         - support matching on IP ToS and UDP source port of outer
           header
         - VxLAN and GENEVE tunnel encapsulation over IPv4 or IPv6
         - add devlink dev info support for EF10

   - Virtual NICs:
      - Microsoft vNIC:
         - size the Rx indirection table based on requested
           configuration
         - support VLAN tagging
      - Amazon vNIC:
         - try to reuse Rx buffers if not fully consumed, useful for ARM
           servers running with 16kB pages
      - Google vNIC:
         - support TCP segmentation of >64kB frames

   - Ethernet embedded switches:
      - Marvell (mv88e6xxx):
         - enable USXGMII (88E6191X)
      - Microchip:
         - lan966x: add support for Egress Stage 0 ACL engine
         - lan966x: support mapping packet priority to internal switch
           priority (based on PCP or DSCP)

   - Ethernet PHYs:
      - Broadcom PHYs:
         - support for Wake-on-LAN for BCM54210E/B50212E
         - report LPI counter
      - Microsemi PHYs: support RGMII delay configuration (VSC85xx)
      - Micrel PHYs: receive timestamp in the frame (LAN8841)
      - Realtek PHYs: support optional external PHY clock
      - Altera TSE PCS: merge the driver into Lynx PCS which it is a
        variant of

   - CAN: Kvaser PCIEcan:
      - support packet timestamping

   - WiFi:
      - Intel (iwlwifi):
         - major update for new firmware and Multi-Link Operation (MLO)
         - configuration rework to drop test devices and split the
           different families
         - support for segmented PNVM images and power tables
         - new vendor entries for PPAG (platform antenna gain) feature
      - Qualcomm 802.11ax (ath11k):
         - Multiple Basic Service Set Identifier (MBSSID) and Enhanced
           MBSSID Advertisement (EMA) support in AP mode
         - support factory test mode
      - RealTek (rtw89):
         - add RSSI based antenna diversity
         - support U-NII-4 channels on 5 GHz band
      - RealTek (rtl8xxxu):
         - AP mode support for 8188f
         - support USB RX aggregation for the newer chips"

* tag 'net-next-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1602 commits)
  net: scm: introduce and use scm_recv_unix helper
  af_unix: Skip SCM_PIDFD if scm->pid is NULL.
  net: lan743x: Simplify comparison
  netlink: Add __sock_i_ino() for __netlink_diag_dump().
  net: dsa: avoid suspicious RCU usage for synced VLAN-aware MAC addresses
  Revert "af_unix: Call scm_recv() only after scm_set_cred()."
  phylink: ReST-ify the phylink_pcs_neg_mode() kdoc
  libceph: Partially revert changes to support MSG_SPLICE_PAGES
  net: phy: mscc: fix packet loss due to RGMII delays
  net: mana: use vmalloc_array and vcalloc
  net: enetc: use vmalloc_array and vcalloc
  ionic: use vmalloc_array and vcalloc
  pds_core: use vmalloc_array and vcalloc
  gve: use vmalloc_array and vcalloc
  octeon_ep: use vmalloc_array and vcalloc
  net: usb: qmi_wwan: add u-blox 0x1312 composition
  perf trace: fix MSG_SPLICE_PAGES build error
  ipvlan: Fix return value of ipvlan_queue_xmit()
  netfilter: nf_tables: fix underflow in chain reference counter
  netfilter: nf_tables: unbind non-anonymous set if rule construction fails
  ...
2023-06-28 16:43:10 -07:00
Linus Torvalds 6e17c6de3d - Yosry Ahmed brought back some cgroup v1 stats in OOM logs.
- Yosry has also eliminated cgroup's atomic rstat flushing.
 
 - Nhat Pham adds the new cachestat() syscall.  It provides userspace
   with the ability to query pagecache status - a similar concept to
   mincore() but more powerful and with improved usability.
 
 - Mel Gorman provides more optimizations for compaction, reducing the
   prevalence of page rescanning.
 
 - Lorenzo Stoakes has done some maintanance work on the get_user_pages()
   interface.
 
 - Liam Howlett continues with cleanups and maintenance work to the maple
   tree code.  Peng Zhang also does some work on maple tree.
 
 - Johannes Weiner has done some cleanup work on the compaction code.
 
 - David Hildenbrand has contributed additional selftests for
   get_user_pages().
 
 - Thomas Gleixner has contributed some maintenance and optimization work
   for the vmalloc code.
 
 - Baolin Wang has provided some compaction cleanups,
 
 - SeongJae Park continues maintenance work on the DAMON code.
 
 - Huang Ying has done some maintenance on the swap code's usage of
   device refcounting.
 
 - Christoph Hellwig has some cleanups for the filemap/directio code.
 
 - Ryan Roberts provides two patch series which yield some
   rationalization of the kernel's access to pte entries - use the provided
   APIs rather than open-coding accesses.
 
 - Lorenzo Stoakes has some fixes to the interaction between pagecache
   and directio access to file mappings.
 
 - John Hubbard has a series of fixes to the MM selftesting code.
 
 - ZhangPeng continues the folio conversion campaign.
 
 - Hugh Dickins has been working on the pagetable handling code, mainly
   with a view to reducing the load on the mmap_lock.
 
 - Catalin Marinas has reduced the arm64 kmalloc() minimum alignment from
   128 to 8.
 
 - Domenico Cerasuolo has improved the zswap reclaim mechanism by
   reorganizing the LRU management.
 
 - Matthew Wilcox provides some fixups to make gfs2 work better with the
   buffer_head code.
 
 - Vishal Moola also has done some folio conversion work.
 
 - Matthew Wilcox has removed the remnants of the pagevec code - their
   functionality is migrated over to struct folio_batch.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZJejewAKCRDdBJ7gKXxA
 joggAPwKMfT9lvDBEUnJagY7dbDPky1cSYZdJKxxM2cApGa42gEA6Cl8HRAWqSOh
 J0qXCzqaaN8+BuEyLGDVPaXur9KirwY=
 =B7yQ
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2023-06-24-19-15' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull mm updates from Andrew Morton:

 - Yosry Ahmed brought back some cgroup v1 stats in OOM logs

 - Yosry has also eliminated cgroup's atomic rstat flushing

 - Nhat Pham adds the new cachestat() syscall. It provides userspace
   with the ability to query pagecache status - a similar concept to
   mincore() but more powerful and with improved usability

 - Mel Gorman provides more optimizations for compaction, reducing the
   prevalence of page rescanning

 - Lorenzo Stoakes has done some maintanance work on the
   get_user_pages() interface

 - Liam Howlett continues with cleanups and maintenance work to the
   maple tree code. Peng Zhang also does some work on maple tree

 - Johannes Weiner has done some cleanup work on the compaction code

 - David Hildenbrand has contributed additional selftests for
   get_user_pages()

 - Thomas Gleixner has contributed some maintenance and optimization
   work for the vmalloc code

 - Baolin Wang has provided some compaction cleanups,

 - SeongJae Park continues maintenance work on the DAMON code

 - Huang Ying has done some maintenance on the swap code's usage of
   device refcounting

 - Christoph Hellwig has some cleanups for the filemap/directio code

 - Ryan Roberts provides two patch series which yield some
   rationalization of the kernel's access to pte entries - use the
   provided APIs rather than open-coding accesses

 - Lorenzo Stoakes has some fixes to the interaction between pagecache
   and directio access to file mappings

 - John Hubbard has a series of fixes to the MM selftesting code

 - ZhangPeng continues the folio conversion campaign

 - Hugh Dickins has been working on the pagetable handling code, mainly
   with a view to reducing the load on the mmap_lock

 - Catalin Marinas has reduced the arm64 kmalloc() minimum alignment
   from 128 to 8

 - Domenico Cerasuolo has improved the zswap reclaim mechanism by
   reorganizing the LRU management

 - Matthew Wilcox provides some fixups to make gfs2 work better with the
   buffer_head code

 - Vishal Moola also has done some folio conversion work

 - Matthew Wilcox has removed the remnants of the pagevec code - their
   functionality is migrated over to struct folio_batch

* tag 'mm-stable-2023-06-24-19-15' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (380 commits)
  mm/hugetlb: remove hugetlb_set_page_subpool()
  mm: nommu: correct the range of mmap_sem_read_lock in task_mem()
  hugetlb: revert use of page_cache_next_miss()
  Revert "page cache: fix page_cache_next/prev_miss off by one"
  mm/vmscan: fix root proactive reclaim unthrottling unbalanced node
  mm: memcg: rename and document global_reclaim()
  mm: kill [add|del]_page_to_lru_list()
  mm: compaction: convert to use a folio in isolate_migratepages_block()
  mm: zswap: fix double invalidate with exclusive loads
  mm: remove unnecessary pagevec includes
  mm: remove references to pagevec
  mm: rename invalidate_mapping_pagevec to mapping_try_invalidate
  mm: remove struct pagevec
  net: convert sunrpc from pagevec to folio_batch
  i915: convert i915_gpu_error to use a folio_batch
  pagevec: rename fbatch_count()
  mm: remove check_move_unevictable_pages()
  drm: convert drm_gem_put_pages() to use a folio_batch
  i915: convert shmem_sg_free_table() to use a folio_batch
  scatterlist: add sg_set_folio()
  ...
2023-06-28 10:28:11 -07:00
Matthew Wilcox (Oracle) 76fa884290 net: convert sunrpc from pagevec to folio_batch
Remove the last usage of pagevecs.  There is a slight change here; we now
free the folio_batch as soon as it fills up instead of freeing the
folio_batch when we try to add a page to a full batch.  This should have
no effect in practice.

Link: https://lkml.kernel.org/r/20230621164557.3510324-10-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-23 16:59:30 -07:00
Benjamin Coddington d9615d166c NFS: add sysfs shutdown knob
Within each nfs_server sysfs tree, add an entry named "shutdown".  Writing
1 to this file will set the cl_shutdown bit on the rpc_clnt structs
associated with that mount.  If cl_shutdown is set, the task scheduler
immediately returns -EIO for new tasks.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2023-06-19 15:08:12 -04:00
Benjamin Coddington e13b549319 NFS: Add sysfs links to sunrpc clients for nfs_clients
For the general and state management nfs_client under each mount, create
symlinks to their respective rpc_client sysfs entries.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2023-06-19 15:04:13 -04:00
Chuck Lever 75eb6af7ac SUNRPC: Add a TCP-with-TLS RPC transport class
Use the new TLS handshake API to enable the SunRPC client code
to request a TLS handshake. This implements support for RFC 9289,
only on TCP sockets.

Upper layers such as NFS use RPC-with-TLS to protect in-transit
traffic.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2023-06-19 12:28:10 -04:00
Chuck Lever dea034b963 SUNRPC: Capture CMSG metadata on client-side receive
kTLS sockets use CMSG to report decryption errors and the need
for session re-keying.

For RPC-with-TLS, an "application data" message contains a ULP
payload, and that is passed along to the RPC client. An "alert"
message triggers connection reset. Everything else is discarded.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2023-06-19 12:26:54 -04:00
Chuck Lever 0d3ca07ffd SUNRPC: Ignore data_ready callbacks during TLS handshakes
The RPC header parser doesn't recognize TLS handshake traffic, so it
will close the connection prematurely with an error. To avoid that,
shunt the transport's data_ready callback when there is a TLS
handshake in progress.

The XPRT_SOCK_IGNORE_RECV flag will be toggled by code added in a
subsequent patch.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2023-06-19 12:19:16 -04:00
Chuck Lever 120726526e SUNRPC: Add RPC client support for the RPC_AUTH_TLS auth flavor
The new authentication flavor is used only to discover peer support
for RPC-over-TLS.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2023-06-19 12:18:36 -04:00
Chuck Lever 97d1c83c3f SUNRPC: Trace the rpc_create_args
Pass the upper layer's rpc_create_args to the rpc_clnt_new()
tracepoint so additional parts of the upper layer's request can be
recorded.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2023-06-19 12:17:56 -04:00
Chuck Lever 5000531912 SUNRPC: Plumb an API for setting transport layer security
Add an initial set of policies along with fields for upper layers to
pass the requested policy down to the transport layer.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2023-06-19 12:16:59 -04:00
NeilBrown 626590ea4c SUNRPC: attempt to reach rpcbind with an abstract socket name
NFS is primarily name-spaced using network namespaces.  However it
contacts rpcbind (and gss_proxy) using AF_UNIX sockets which are
name-spaced using the mount namespaces.  This requires a container using
NFSv3 (the form that requires rpcbind) to manage both network and mount
namespaces, which can seem an unnecessary burden.

As NFS is primarily a network service it makes sense to use network
namespaces as much as possible, and to prefer to communicate with an
rpcbind running in the same network namespace.  This can be done, while
preserving the benefits of AF_UNIX sockets, by using an abstract socket
address.

An abstract address has a nul at the start of sun_path, and a length
that is exactly the complete size of the sockaddr_un up to the end of
the name, NOT including any trailing nul (which is not part of the
address).
Abstract addresses are local to a network namespace - regular AF_UNIX
path names a resolved in the mount namespace ignoring the network
namespace.

This patch causes rpcb to first try an abstract address before
continuing with regular AF_UNIX and then IP addresses.  This ensures
backwards compatibility.

Choosing the name needs some care as the same address will be configured
for rpcbind, and needs to be built in to libtirpc for this enhancement
to be fully successful.  There is no formal standard for choosing
abstract addresses.  The defacto standard appears to be to use a path
name similar to what would be used for a filesystem AF_UNIX address -
but with a leading nul.

In that case
   "\0/var/run/rpcbind.sock"
seems like the best choice.  However at this time /var/run is deprecated
in favour of /run, so
   "\0/run/rpcbind.sock"
might be better.
Though as we are deliberately moving away from using the filesystem it
might seem more sensible to explicitly break the connection and just
have
   "\0rpcbind.socket"
using the same name as the systemd unit file..

This patch chooses the second option, which seems least likely to raise
objections.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2023-06-19 12:12:22 -04:00
NeilBrown 4388ce05fa SUNRPC: support abstract unix socket addresses
An "abtract" address for an AF_UNIX socket start with a nul and can
contain any bytes for the given length, but traditionally doesn't
contain other nuls.  When reported, the leading nul is replaced by '@'.

sunrpc currently rejects connections to these addresses and reports them
as an empty string.  To provide support for future use of these
addresses, allow them for outgoing connections and report them more
usefully.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2023-06-19 12:12:22 -04:00
Chuck Lever 88770b8de3 svcrdma: Fix stale comment
Commit 7d81ee8722 ("svcrdma: Single-stage RDMA Read") changed the
behavior of svc_rdma_recvfrom() but neglected to update the
documenting comment.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-06-18 12:09:08 -04:00
Chuck Lever 00a87e5d1d SUNRPC: Address RCU warning in net/sunrpc/svc.c
$ make C=1 W=1 net/sunrpc/svc.o
make[1]: Entering directory 'linux/obj/manet.1015granger.net'
  GEN     Makefile
  CALL    linux/server-development/scripts/checksyscalls.sh
  DESCEND objtool
  INSTALL libsubcmd_headers
  DESCEND bpf/resolve_btfids
  INSTALL libsubcmd_headers
  CC [M]  net/sunrpc/svc.o
  CHECK   linux/server-development/net/sunrpc/svc.c
linux/server-development/net/sunrpc/svc.c:1225:9: warning: incorrect type in argument 1 (different address spaces)
linux/server-development/net/sunrpc/svc.c:1225:9:    expected struct spinlock [usertype] *lock
linux/server-development/net/sunrpc/svc.c:1225:9:    got struct spinlock [noderef] __rcu *
linux/server-development/net/sunrpc/svc.c:1227:40: warning: incorrect type in argument 1 (different address spaces)
linux/server-development/net/sunrpc/svc.c:1227:40:    expected struct spinlock [usertype] *lock
linux/server-development/net/sunrpc/svc.c:1227:40:    got struct spinlock [noderef] __rcu *
make[1]: Leaving directory 'linux/obj/manet.1015granger.net'

Warning introduced by commit 913292c97d ("sched.h: Annotate
sighand_struct with __rcu").

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-06-17 13:18:07 -04:00
Azeem Shaikh a9156d7e7d SUNRPC: Use sysfs_emit in place of strlcpy/sprintf
Part of an effort to remove strlcpy() tree-wide [1].

Direct replacement is safe here since the getter in kernel_params_ops
handles -errno return [2].

[1] https://github.com/KSPP/linux/issues/89
[2] https://elixir.bootlin.com/linux/v6.4-rc6/source/include/linux/moduleparam.h#L52

Signed-off-by: Azeem Shaikh <azeemshaikh38@gmail.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-06-17 13:18:07 -04:00
Chuck Lever 6c53da5d66 SUNRPC: Remove transport class dprintk call sites
Remove a couple of dprintk call sites that are of little value.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Tom Talpey <tom@talpey.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-06-17 13:18:07 -04:00
Chuck Lever 02cea33f56 SUNRPC: Fix comments for transport class registration
The preceding block comment before svc_register_xprt_class() is
not related to that function.

While we're here, add proper documenting comments for these two
publicly-visible functions.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Tom Talpey <tom@talpey.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-06-17 13:18:07 -04:00
Chuck Lever b55c63332e svcrdma: Remove an unused argument from __svc_rdma_put_rw_ctxt()
Clean up.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Tom Talpey <tom@talpey.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-06-17 13:18:07 -04:00
Chuck Lever a23c76e92d svcrdma: trace cc_release calls
This event brackets the svcrdma_post_* trace points. If this trace
event is enabled but does not appear as expected, that indicates a
chunk_ctxt leak.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Tom Talpey <tom@talpey.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-06-17 13:18:06 -04:00
Chuck Lever 91f8ce2846 svcrdma: Convert "might sleep" comment into a code annotation
Try to catch incorrect calling contexts mechanically rather than by
code review.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Tom Talpey <tom@talpey.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-06-17 13:18:06 -04:00
Chuck Lever f8335a212a SUNRPC: Move initialization of rq_stime
Micro-optimization: Call ktime_get() only when ->xpo_recvfrom() has
given us a full RPC message to process. rq_stime isn't used
otherwise, so this avoids pointless work.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Tom Talpey <tom@talpey.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-06-17 13:18:06 -04:00
Chuck Lever 5581cf8efc SUNRPC: Optimize page release in svc_rdma_sendto()
Now that we have bulk page allocation and release APIs, it's more
efficient to use those than it is for nfsd threads to wait for send
completions. Previous patches have eliminated the calls to
wait_for_completion() and complete(), in order to avoid scheduler
overhead.

Now release pages-under-I/O in the send completion handler using
the efficient bulk release API.

I've measured a 7% reduction in cumulative CPU utilization in
svc_rdma_sendto(), svc_rdma_wc_send(), and svc_xprt_release(). In
particular, using release_pages() instead of complete() cuts the
time per svc_rdma_wc_send() call by two-thirds. This helps improve
scalability because svc_rdma_wc_send() is single-threaded per
connection.

Reviewed-by: Tom Talpey <tom@talpey.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-06-17 13:18:06 -04:00
Chuck Lever baf6d18b11 svcrdma: Prevent page release when nothing was received
I noticed that svc_rqst_release_pages() was still unnecessarily
releasing a page when svc_rdma_recvfrom() returns zero.

Fixes: a53d5cb064 ("svcrdma: Avoid releasing a page in svc_xprt_release()")
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-06-17 13:18:04 -04:00
David Howells 5df5dd03a8 sunrpc: Use sendmsg(MSG_SPLICE_PAGES) rather then sendpage
When transmitting data, call down into TCP using sendmsg with
MSG_SPLICE_PAGES to indicate that content should be spliced rather than
performing sendpage calls to transmit header, data pages and trailer.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Chuck Lever <chuck.lever@oracle.com>
cc: Trond Myklebust <trond.myklebust@hammerspace.com>
cc: Anna Schumaker <anna@kernel.org>
cc: Jeff Layton <jlayton@kernel.org>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-12 21:13:23 -07:00
Chuck Lever c4b50cdf9d svcrdma: Revert 2a1e4f21d8 ("svcrdma: Normalize Send page handling")
Get rid of the completion wait in svc_rdma_sendto(), and release
pages in the send completion handler again. A subsequent patch will
handle releasing those pages more efficiently.

Reverted by hand: patch -R would not apply cleanly.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-06-12 12:16:36 -04:00
Chuck Lever a944209c11 SUNRPC: Revert 579900670a ("svcrdma: Remove unused sc_pages field")
Pre-requisite for releasing pages in the send completion handler.
Reverted by hand: patch -R would not apply cleanly.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-06-12 12:16:36 -04:00
Chuck Lever 6be7afcd92 SUNRPC: Revert cc93ce9529 ("svcrdma: Retain the page backing rq_res.head[0].iov_base")
Pre-requisite for releasing pages in the send completion handler.
Reverted by hand: patch -R would not apply cleanly.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-06-12 12:16:35 -04:00
Chuck Lever ac3c32bbdb svcrdma: Clean up allocation of svc_rdma_rw_ctxt
The physical device's favored NUMA node ID is available when
allocating a rw_ctxt. Use that value instead of relying on the
assumption that the memory allocation happens to be running on a
node close to the device.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-06-12 12:16:35 -04:00
Chuck Lever ed51b42610 svcrdma: Clean up allocation of svc_rdma_send_ctxt
The physical device's favored NUMA node ID is available when
allocating a send_ctxt. Use that value instead of relying on the
assumption that the memory allocation happens to be running on a
node close to the device.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-06-12 12:16:35 -04:00
Chuck Lever c5d68d25bd svcrdma: Clean up allocation of svc_rdma_recv_ctxt
The physical device's favored NUMA node ID is available when
allocating a recv_ctxt. Use that value instead of relying on the
assumption that the memory allocation happens to be running on a
node close to the device.

This clean up eliminates the hack of destroying recv_ctxts that
were not created by the receive CQ thread -- recv_ctxts are now
always allocated on a "good" node.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-06-12 12:16:35 -04:00
Chuck Lever fe2b401e55 svcrdma: Allocate new transports on device's NUMA node
The physical device's NUMA node ID is available when allocating an
svc_xprt for an incoming connection. Use that value to ensure the
svc_xprt structure is allocated on the NUMA node closest to the
device.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-06-12 12:16:34 -04:00
Chuck Lever 703d752155 NFSD: Hoist rq_vec preparation into nfsd_read() [step two]
Now that the preparation of an rq_vec has been removed from the
generic read path, nfsd_splice_read() no longer needs to reset
rq_next_page.

nfsd4_encode_read() calls nfsd_splice_read() directly. As far as I
can ascertain, resetting rq_next_page for NFSv4 splice reads is
unnecessary because rq_next_page is already set correctly.

Moreover, resetting it might even be incorrect if previous
operations in the COMPOUND have already consumed at least a page of
the send buffer. I would expect that the result would be encoding
the READ payload over previously-encoded results.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-06-11 16:37:46 -04:00
Chuck Lever 88e4d41a26 SUNRPC: Use __alloc_bulk_pages() in svc_init_buffer()
Clean up: Use the bulk page allocator when filling a server thread's
buffer page array.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-06-05 09:01:44 -04:00
Chuck Lever 5f7fc5d69f SUNRPC: Resupply rq_pages from node-local memory
svc_init_buffer() is careful to allocate the initial set of server
thread buffer pages from memory on the local NUMA node.
svc_alloc_arg() should also be that careful.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-06-05 09:01:44 -04:00
Chuck Lever c42bebca96 SUNRPC: Trace struct svc_sock lifetime events
Capture a timestamp and pointer address during the creation and
destruction of struct svc_sock to record its lifetime. This helps
to diagnose transport reference counting issues.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-06-05 09:01:42 -04:00
Chuck Lever d7900daea0 SUNRPC: Improve observability in svc_tcp_accept()
The -ENOMEM arm could fire repeatedly if the system runs low on
memory, so remove it.

Don't bother to trace -EAGAIN error events, since those fire after
a listener is created (with no work done) and once again after an
accept has been handled successfully (again, with no work done).

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-06-05 09:01:42 -04:00
Chuck Lever cce4ee9c78 SUNRPC: Remove dprintk() in svc_handle_xprt()
When enabled, this dprintk() fires for every incoming RPC, which is
an enormous amount of log traffic. These days, after the first few
hundred log messages, the system journald is just going to mute it,
along with all other NFSD debug output.

Let's rely on trace points for this high-traffic information
instead.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-06-05 09:01:42 -04:00
Chuck Lever e8277327d7 SUNRPC: Fix an incorrect comment
The correct function name is svc_tcp_listen_data_ready().

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-06-05 09:01:42 -04:00
Ding Hui fc80fc2d4e SUNRPC: Fix UAF in svc_tcp_listen_data_ready()
After the listener svc_sock is freed, and before invoking svc_tcp_accept()
for the established child sock, there is a window that the newsock
retaining a freed listener svc_sock in sk_user_data which cloning from
parent. In the race window, if data is received on the newsock, we will
observe use-after-free report in svc_tcp_listen_data_ready().

Reproduce by two tasks:

1. while :; do rpc.nfsd 0 ; rpc.nfsd; done
2. while :; do echo "" | ncat -4 127.0.0.1 2049 ; done

KASAN report:

  ==================================================================
  BUG: KASAN: slab-use-after-free in svc_tcp_listen_data_ready+0x1cf/0x1f0 [sunrpc]
  Read of size 8 at addr ffff888139d96228 by task nc/102553
  CPU: 7 PID: 102553 Comm: nc Not tainted 6.3.0+ #18
  Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 11/12/2020
  Call Trace:
   <IRQ>
   dump_stack_lvl+0x33/0x50
   print_address_description.constprop.0+0x27/0x310
   print_report+0x3e/0x70
   kasan_report+0xae/0xe0
   svc_tcp_listen_data_ready+0x1cf/0x1f0 [sunrpc]
   tcp_data_queue+0x9f4/0x20e0
   tcp_rcv_established+0x666/0x1f60
   tcp_v4_do_rcv+0x51c/0x850
   tcp_v4_rcv+0x23fc/0x2e80
   ip_protocol_deliver_rcu+0x62/0x300
   ip_local_deliver_finish+0x267/0x350
   ip_local_deliver+0x18b/0x2d0
   ip_rcv+0x2fb/0x370
   __netif_receive_skb_one_core+0x166/0x1b0
   process_backlog+0x24c/0x5e0
   __napi_poll+0xa2/0x500
   net_rx_action+0x854/0xc90
   __do_softirq+0x1bb/0x5de
   do_softirq+0xcb/0x100
   </IRQ>
   <TASK>
   ...
   </TASK>

  Allocated by task 102371:
   kasan_save_stack+0x1e/0x40
   kasan_set_track+0x21/0x30
   __kasan_kmalloc+0x7b/0x90
   svc_setup_socket+0x52/0x4f0 [sunrpc]
   svc_addsock+0x20d/0x400 [sunrpc]
   __write_ports_addfd+0x209/0x390 [nfsd]
   write_ports+0x239/0x2c0 [nfsd]
   nfsctl_transaction_write+0xac/0x110 [nfsd]
   vfs_write+0x1c3/0xae0
   ksys_write+0xed/0x1c0
   do_syscall_64+0x38/0x90
   entry_SYSCALL_64_after_hwframe+0x72/0xdc

  Freed by task 102551:
   kasan_save_stack+0x1e/0x40
   kasan_set_track+0x21/0x30
   kasan_save_free_info+0x2a/0x50
   __kasan_slab_free+0x106/0x190
   __kmem_cache_free+0x133/0x270
   svc_xprt_free+0x1e2/0x350 [sunrpc]
   svc_xprt_destroy_all+0x25a/0x440 [sunrpc]
   nfsd_put+0x125/0x240 [nfsd]
   nfsd_svc+0x2cb/0x3c0 [nfsd]
   write_threads+0x1ac/0x2a0 [nfsd]
   nfsctl_transaction_write+0xac/0x110 [nfsd]
   vfs_write+0x1c3/0xae0
   ksys_write+0xed/0x1c0
   do_syscall_64+0x38/0x90
   entry_SYSCALL_64_after_hwframe+0x72/0xdc

Fix the UAF by simply doing nothing in svc_tcp_listen_data_ready()
if state != TCP_LISTEN, that will avoid dereferencing svsk for all
child socket.

Link: https://lore.kernel.org/lkml/20230507091131.23540-1-dinghui@sangfor.com.cn/
Fixes: fa9251afc3 ("SUNRPC: Call the default socket callbacks instead of open coding")
Signed-off-by: Ding Hui <dinghui@sangfor.com.cn>
Cc: <stable@vger.kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-06-05 09:01:42 -04:00
Linus Torvalds a746ca666a nfsd-6.4 fixes:
- Two minor bug fixes
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmR3f/AACgkQM2qzM29m
 f5dgCQ/+KiS7B1u8O8VrsAy3D2G309UZUf2QMr982pDimaXBR65ggKI5GWTjjwrW
 ob0P18prMetT6SGUAIJ3nlEsjDfs+CgKPq/zE2uqB0Sgv4ROmWoboygVsCk+JiTp
 xlystF9XBMSjqu23LazFE6+yTAMWcX7ddB5NrJCZr+yMffVZhN6onM38YtQNKh1F
 fdY+tY7NgEu24+60Heb8ZAmo2+mX8fSe+lVXuwV3h1HRKand1463NNSMGH0t805P
 nBdytIBuwWbOHv8mgL3FB042PSGiHsaL5Yevq42Je+7nN92olHS0ML2IBgcwzHsw
 Duc8E8MvElUIADXIfOg5SODjUJ8C5Avclyj7tbcnc9H5jdvolWshbJa8CkhTkvC2
 mdVTxDlOpWlBK4fLEtEJQzHldppGZqmnS85ppl4Ipjdu6s6WGyFs6kKXAyKGnsvc
 ydNFyhqv88BY6H3S681nLdTdwMFpxa8RAyJqXkcOBmbJd/8naDpDq7bi6GxNUvWs
 36ymLklJzWxKP/B0tayvqA+dTiKbs/480Xz37vXrcJigBeWpwyNQG2NUeGeqP2kd
 xvEiCZHYq0gmTjGKnxeyvzAeKAlzRKSo4BiI5tUYBOBhV0OHIozz7TMa+Br91sEl
 fNrjnr/d8XAxAVmJcWuSlqfggOVFS1D8FxYIhcbY1NV6zPWogu0=
 =2AMA
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-6.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd fixes from Chuck Lever:

 - Two minor bug fixes

* tag 'nfsd-6.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
  nfsd: fix double fget() bug in __write_ports_addfd()
  nfsd: make a copy of struct iattr before calling notify_change
2023-06-02 13:38:55 -04:00
Dan Carpenter c034203b6a nfsd: fix double fget() bug in __write_ports_addfd()
The bug here is that you cannot rely on getting the same socket
from multiple calls to fget() because userspace can influence
that.  This is a kind of double fetch bug.

The fix is to delete the svc_alien_sock() function and instead do
the checking inside the svc_addsock() function.

Fixes: 3064639423 ("nfsd: check passed socket's net matches NFSd superblock's one")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: NeilBrown <neilb@suse.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-05-31 09:57:14 -04:00
Linus Torvalds 421ca22e31 NFS Client Bugfixes for Linux 6.4-rc
Stable Fix:
   * Don't change task->tk_status after the call to rpc_exit_task
 
 Other Bugfixes:
   * Convert kmap_atomic() to kmap_local_folio()
   * Fix a potential double free with READ_PLUS
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAmRrttUACgkQ18tUv7Cl
 QOuhaA//QFHklXZk/vCkQnNQMYWL11GJliWawLoDfcZal6uQ/a2QCQV1Cbmav62B
 FR2BmXDxzM2PRdLu2VHGpkn0CQW3M1tvgaNjGD1xdOxpyIkn47T5lfAd/4X2XPiU
 M1ck2Usc258UB1yoKV+jbUD3ptn2BvC+VMWJInaA578hv8TA6Ouh7lP7rPJfDHoJ
 OfoLxx9/VqGqMWzfExAHnGw328oieXNnOwynETAdapVwjQeiEcYAED82pJmVsD7+
 m++6dRVQRA2bMIMRFWmW8HsO08sR32wzy76XgKws4Xu59Fiy+TQ8PoeUjCtTNq6/
 9ibPwH4R7VbcxXa2eT23EbtO2nSkZw/dFiL0s5VNYqeVrBwwlzyklU1uSvIEPegk
 zHamqxMMlVLkoMwJa83wIKB8/viPKwV5zcF9UjmrJy67+wXZet6M0c7S9HyiTj9U
 NzVbqyK3KhMtsD4ps/EGVWsgGKAIeWbE8wPlP7GF7PHwEw+hWa9pHir6L6BizNqG
 DJ/2zfZxDvOGy2r5OvSqGn07/zsj+0URixzEq0IOn1Li/osFZpvK3EVFncd/qsvW
 NwPRoF+70skFRdXhbdWa/HEUZlyN2uiIU24luraMrN0U4b4X7aw+EMnMekBi+Vec
 bEtWEUJ/vK3mlsOde4gVW0PZBhe8JE6PHlqkQBn5zobV3/cXXCw=
 =6xFZ
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-6.4-2' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client fixes from Anna Schumaker:
 "Stable Fix:

   - Don't change task->tk_status after the call to rpc_exit_task

  Other Bugfixes:

   - Convert kmap_atomic() to kmap_local_folio()

   - Fix a potential double free with READ_PLUS"

* tag 'nfs-for-6.4-2' of git://git.linux-nfs.org/projects/anna/linux-nfs:
  NFSv4.2: Fix a potential double free with READ_PLUS
  SUNRPC: Don't change task->tk_status after the call to rpc_exit_task
  NFS: Convert kmap_atomic() to kmap_local_folio()
2023-05-22 12:01:13 -07:00
Trond Myklebust d180891fba SUNRPC: Don't change task->tk_status after the call to rpc_exit_task
Some calls to rpc_exit_task() may deliberately change the value of
task->tk_status, for instance because it gets checked by the RPC call's
rpc_release() callback. That makes it wrong to reset the value to
task->tk_rpc_status.
In particular this causes a bug where the rpc_call_done() callback tries
to fail over a set of pNFS/flexfiles writes to a different IP address,
but the reset of task->tk_status causes nfs_commit_release_pages() to
immediately mark the file as having a fatal error.

Fixes: 39494194f9 ("SUNRPC: Fix races with rpc_killall_tasks()")
Cc: stable@vger.kernel.org # 6.1.x
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-05-19 16:50:05 -04:00
Linus Torvalds 1b66c114d1 nfsd-6.4 fixes:
- A collection of minor bug fixes
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmRiQVEACgkQM2qzM29m
 f5eYfBAAg5Qz45PL+fo1qWxkJ1ZKaNV1vPdi4tqCt9NEItDTjAnjj0am+rKNGZAz
 EOM2yFt4xaZGyMYgXe4VnYl0N+rSpbI/H+Rk/wOq4OHPURQD5EO9VeP86qZ7rmGl
 ECPqb39TFTwAiRomC/DHO4eNpoe1rQuXu0tW9+GmqDGxeuh8xdxTk33g17ZXwCFN
 tdPkkPjxVPdWd8X7HQg9kWm8AWfV+GyuzE2rKAoOjbs6Wv6d9GCY8Cb5HXkRsQhF
 4Zh0PVQuTuXurZwtPXwnS0k4kfvQwjlTIKHlXuo0ZLh+SuFbrWHzv0fVyD+kUpSK
 HtWbJ8JcruUvz0WGMtZatzRLHCZLDguV6oVXPp7rtmuxTj4szzHSFpEeAV901sIm
 Nkvuomvd02K/fiTo7s3yr6t1VG2vju9LDwhBe197iA3leHAlockfbbxE3NJMGbzQ
 NoOPd+lu95cfsanOM1LZZLNfbLrZofoSLK9K1+HD0yAVdyq7u47FyHRrymvCaMrj
 GiheuqrBfBMEq+2mCwUn37aM0FblYEXQM0xTVXPQcHtBBN/nGZxPJukmpr7ScNlR
 aqMtDoOLu4OEFuo6fe2/94eNi+N5XAZgWmx/mSyaytE8Xw9LJxeQ83UTigaGcYKc
 3YIuG1YXg9IyIoIdLkghB+Aj/6fivsGFK9Gud6g7I3xw4f15noA=
 =PiNG
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-6.4-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd fixes from Chuck Lever:

 - A collection of minor bug fixes

* tag 'nfsd-6.4-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
  NFSD: Remove open coding of string copy
  SUNRPC: Fix trace_svc_register() call site
  SUNRPC: always free ctxt when freeing deferred request
  SUNRPC: double free xprt_ctxt while still in use
  SUNRPC: Fix error handling in svc_setup_socket()
  SUNRPC: Fix encoding of accepted but unsuccessful RPC replies
  lockd: define nlm_port_min,max with CONFIG_SYSCTL
  nfsd: define exports_proc_ops with CONFIG_PROC_FS
  SUNRPC: Avoid relying on crypto API to derive CBC-CTS output IV
2023-05-17 09:56:01 -07:00
Chuck Lever 07a2730593 SUNRPC: Fix trace_svc_register() call site
The trace event recorded incorrect values for the registered family,
protocol, and port because the arguments are in the wrong order.

Fixes: b4af59328c ("SUNRPC: Trace server-side rpcbind registration events")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-05-14 15:58:14 -04:00
NeilBrown 948f072ada SUNRPC: always free ctxt when freeing deferred request
Since the ->xprt_ctxt pointer was added to svc_deferred_req, it has not
been sufficient to use kfree() to free a deferred request.  We may need
to free the ctxt as well.

As freeing the ctxt is all that ->xpo_release_rqst() does, we repurpose
it to explicit do that even when the ctxt is not stored in an rqst.
So we now have ->xpo_release_ctxt() which is given an xprt and a ctxt,
which may have been taken either from an rqst or from a dreq.  The
caller is now responsible for clearing that pointer after the call to
->xpo_release_ctxt.

We also clear dr->xprt_ctxt when the ctxt is moved into a new rqst when
revisiting a deferred request.  This ensures there is only one pointer
to the ctxt, so the risk of double freeing in future is reduced.  The
new code in svc_xprt_release which releases both the ctxt and any
rq_deferred depends on this.

Fixes: 773f91b2cf ("SUNRPC: Fix NFSD's request deferral on RDMA transports")
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-05-14 15:55:02 -04:00
NeilBrown eb8d3a2c80 SUNRPC: double free xprt_ctxt while still in use
When an RPC request is deferred, the rq_xprt_ctxt pointer is moved out
of the svc_rqst into the svc_deferred_req.
When the deferred request is revisited, the pointer is copied into
the new svc_rqst - and also remains in the svc_deferred_req.

In the (rare?) case that the request is deferred a second time, the old
svc_deferred_req is reused - it still has all the correct content.
However in that case the rq_xprt_ctxt pointer is NOT cleared so that
when xpo_release_xprt is called, the ctxt is freed (UDP) or possible
added to a free list (RDMA).
When the deferred request is revisited for a second time, it will
reference this ctxt which may be invalid, and the free the object a
second time which is likely to oops.

So change svc_defer() to *always* clear rq_xprt_ctxt, and assert that
the value is now stored in the svc_deferred_req.

Fixes: 773f91b2cf ("SUNRPC: Fix NFSD's request deferral on RDMA transports")
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-05-14 15:55:01 -04:00
Chuck Lever 319050d430 SUNRPC: Fix error handling in svc_setup_socket()
Dan points out that sock_alloc_file() releases @sock on error, but
so do all of svc_setup_socket's callers, resulting in a double-
release if sock_alloc_file() returns an error.

Rather than allocating a struct file for all new sockets, allocate
one only for sockets created during a TCP accept. For the moment,
those are the only ones that will ever be used with RPC-with-TLS.

Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Fixes: ae0d77708a ("SUNRPC: Ensure server-side sockets have a sock->file")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-05-05 20:51:11 -04:00
Chuck Lever 29cd2927fb SUNRPC: Fix encoding of accepted but unsuccessful RPC replies
Jiri Slaby says:
> I bisected to this ... as it breaks nfs3-only servers in 6.3.
> I.e. /etc/nfs.conf containing:
> [nfsd]
> vers4=no
>
> The client sees:
>  mount("10.0.2.15:/tmp", "/mnt", "nfs", 0, "vers=4.2,addr=10.0.2.15,clientad"...) = -1 EIO (Input/output error)
>  write(2, "mount.nfs: mount system call fai"..., 45
>  mount.nfs: mount system call failed for /mnt
>
> And the kernel says:
>  nfs4_discover_server_trunking unhandled error -5. Exiting with error EIO

Reported-by: Jiri Slaby <jirislaby@kernel.org>
Link: https://bugzilla.suse.com/show_bug.cgi?id=1210995
Fixes: 4bcf0343e8 ("SUNRPC: Set rq_accept_statp inside ->accept methods")
Tested-by: Jiri Slaby <jirislaby@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-05-03 09:35:51 -04:00
Ard Biesheuvel af97b7dfb0 SUNRPC: Avoid relying on crypto API to derive CBC-CTS output IV
Scott reports SUNRPC self-test failures regarding the output IV on arm64
when using the SIMD accelerated implementation of AES in CBC mode with
ciphertext stealing ("cts(cbc(aes))" in crypto API speak).

These failures are the result of the fact that, while RFC 3962 does
specify what the output IV should be and includes test vectors for it,
the general concept of an output IV is poorly defined, and generally,
not specified by the various algorithms implemented by the crypto API.
Only algorithms that support transparent chaining (e.g., CBC mode on a
block boundary) have requirements on the output IV, but ciphertext
stealing (CTS) is fundamentally about how to encapsulate CBC in a way
where the length of the entire message may not be an integral multiple
of the cipher block size, and the concept of an output IV does not exist
here because it has no defined purpose past the end of the message.

The generic CTS template takes advantage of this chaining capability of
the CBC implementations, and as a result, happens to return an output
IV, simply because it passes its IV buffer directly to the encapsulated
CBC implementation, which operates on full blocks only, and always
returns an IV. This output IV happens to match how RFC 3962 defines it,
even though the CTS template itself does not contain any output IV logic
whatsoever, and, for this reason, lacks any test vectors that exercise
this accidental output IV generation.

The arm64 SIMD implementation of cts(cbc(aes)) does not use the generic
CTS template at all, but instead, implements the CBC mode and ciphertext
stealing directly, and therefore does not encapsule a CBC implementation
that returns an output IV in the same way. The arm64 SIMD implementation
complies with the specification and passes all internal tests, but when
invoked by the SUNRPC code, fails to produce the expected output IV and
causes its selftests to fail.

Given that the output IV is defined as the penultimate block (where the
final block may smaller than the block size), we can quite easily derive
it in the caller by copying the appropriate slice of ciphertext after
encryption.

Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Anna Schumaker <anna@kernel.org>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Jeff Layton <jlayton@kernel.org>
Reported-by: Scott Mayhew <smayhew@redhat.com>
Tested-by: Scott Mayhew <smayhew@redhat.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-05-02 11:35:04 -04:00
Linus Torvalds 4e1c80ae5c NFSD 6.4 Release Notes
The big ticket item for this release is support for RPC-with-TLS
 [RFC 9289] has been added to the Linux NFS server. The goal is to
 provide a simple-to-deploy, low-overhead in-transit confidentiality
 and peer authentication mechanism. It can supplement NFS Kerberos
 and it can protect the use of legacy non-cryptographic user
 authentication flavors such as AUTH_SYS. The TLS Record protocol is
 handled entirely by kTLS, meaning it can use either software
 encryption or offload encryption to smart NICs.
 
 Work continues on improving NFSD's open file cache. Among the many
 clean-ups in that area is a patch to convert the rhashtable to use
 the list-hashing version of that data structure.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmRK/JMACgkQM2qzM29m
 f5cF5A/+JZFRSPlfSYt0YHzUQQSDdYn5n/IG9TwJQd62xheu083WuKRaCOYYoOhg
 06nZd6p7nuF1E0n2ZWOKSE6YkBSE6z4M6KrQlm6lCe/nmxYCR87IYfJCXuL+Yf0e
 /LdL4OTvDHzY5ec1DreERldPIUJ8GFzwChH8/z4XwbNDR7qJkF/gf8YxpFr+8K+j
 Cfyl8woZiEze/Nvxy1YtAqa7HMEpitt0aWJN55rHwTh9c3b0nmDzziYFcVqXgybJ
 /qUHfHBak66ll8RqhcQ3BMuyfszwASERbPsaZ2a5F/RaxLL5ZWfFyhgQwm+PZWT+
 J5DdSBwLEQYtKQGD41A1aorP6v/u2QelfWrl4S7/qjRpREp8Ba2IU4fYLjGb1499
 Imk68BA7NwFp87tdMi/7en1VVgina4U/S3X71aUYWe+C0g48BfTrVwq4SVbQSAo4
 1638vbZnrJbsJMr9OaaysKWfv4KZB020Ji1KVwuqmgy5F8kdfJCCQ2UR/fHuJ3DY
 R0Zrd1Ryjwr83viP+Xj0ERiW405gPdCT0RJqoA7rznRPCqT5M42tf5z65uO7iZeE
 C1udgDaoQOtioKlem6FcDXLkryf986slGA7V91lat/Jt8A5jLKQfjVe3Q+kaaqXP
 ka1DQnYelzMzILQQs39cqW5pShrH8e3tfRZ7JhdBgrpxVXz9ZZM=
 =lA2+
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd updates from Chuck Lever:
 "The big ticket item for this release is that support for RPC-with-TLS
  [RFC 9289] has been added to the Linux NFS server.

  The goal is to provide a simple-to-deploy, low-overhead in-transit
  confidentiality and peer authentication mechanism. It can supplement
  NFS Kerberos and it can protect the use of legacy non-cryptographic
  user authentication flavors such as AUTH_SYS. The TLS Record protocol
  is handled entirely by kTLS, meaning it can use either software
  encryption or offload encryption to smart NICs.

  Aside from that, work continues on improving NFSD's open file cache.
  Among the many clean-ups in that area is a patch to convert the
  rhashtable to use the list-hashing version of that data structure"

* tag 'nfsd-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (31 commits)
  NFSD: Handle new xprtsec= export option
  SUNRPC: Support TLS handshake in the server-side TCP socket code
  NFSD: Clean up xattr memory allocation flags
  NFSD: Fix problem of COMMIT and NFS4ERR_DELAY in infinite loop
  SUNRPC: Clear rq_xid when receiving a new RPC Call
  SUNRPC: Recognize control messages in server-side TCP socket code
  SUNRPC: Be even lazier about releasing pages
  SUNRPC: Convert svc_xprt_release() to the release_pages() API
  SUNRPC: Relocate svc_free_res_pages()
  nfsd: simplify the delayed disposal list code
  SUNRPC: Ignore return value of ->xpo_sendto
  SUNRPC: Ensure server-side sockets have a sock->file
  NFSD: Watch for rq_pages bounds checking errors in nfsd_splice_actor()
  sunrpc: simplify two-level sysctl registration for svcrdma_parm_table
  SUNRPC: return proper error from get_expiry()
  lockd: add some client-side tracepoints
  nfs: move nfs_fhandle_hash to common include file
  lockd: server should unlock lock if client rejects the grant
  lockd: fix races in client GRANTED_MSG wait logic
  lockd: move struct nlm_wait to lockd.h
  ...
2023-04-29 11:04:14 -07:00
Linus Torvalds 0127f25b5d NFS Client Updates for Linux 6.4
New Features:
   * Convert the readdir path to use folios
   * Convert the NFS fscache code to use netfs
 
 Bugfixes and Cleanups:
   * Always send a RECLAIM_COMPLETE after establishing a lease
   * Simplify sysctl registrations and other cleanups
   * Handle out-of-order write replies on NFS v3
   * Have sunrpc call_bind_status use standard hard/soft task semantics
   * Other minor cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAmRMI04ACgkQ18tUv7Cl
 QOuCNQ//SkQm8aOM4DkYFeDIObye6xMzgtWrB25grYNG4a/DcYqb5kNcbmI5l1tE
 Tus8KMZAWSpwa0m8ALctzp+pZQWQkY/svsqqHrKIGUHBI8F0OinVCqc2MzNN75WX
 m/1wELW6ek9RBL5BoJtAPt+Qu8/jP6KD64Zot7snBeUrzreaZDcz0HM+EcQhi7X7
 qd5XS0/cA2eLEBBQcQdFpRhHvgW12BMYM/zp3/ER5H52L2iAlZunGWw+Nqs8ueOR
 D7K2+CF1sV1k6hYbLWNoaF2J6PZr5dRpc6gSq4fLP4WUKjqQwmQp8cm9iLpf6jGa
 a+Y7t8aj7vup8jVCVGWYWZA2G2gi6jWmxxWudkJwfAa1E45t1B4/C0udwlxR20OO
 XI2Bhe5YwTURgSOvOS9QTZJpQN4qfpEL0NoAmAT5fAHBQ2CXDrMlSIxPS7U6LO9q
 YqwIHcAHvYVnbD45IUh2Zjbp65mRb1VkU6WzOyK1/sNHEyYpubIWXB/yLaA3oGge
 V3xUgvlTzLVzzyQfwiRfzAD1P5/USaXE/B36c4itfCr5rJnAfsiBP3gk0o9yq18J
 3Yb6olrmc9CzeA7PN88uEus4VZHbaE9OktRFIjJ22jlLQEY4xougdE5asY1XX8F+
 OKLLLeeCrsbvrANB9XcLVsLqdMYvsd0VaCX9HtN3UP+7Lod5T10=
 =gpBC
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-6.4-1' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client updates from Anna Schumaker:
 "New Features:

   - Convert the readdir path to use folios

   - Convert the NFS fscache code to use netfs

  Bugfixes and Cleanups:

   - Always send a RECLAIM_COMPLETE after establishing a lease

   - Simplify sysctl registrations and other cleanups

   - Handle out-of-order write replies on NFS v3

   - Have sunrpc call_bind_status use standard hard/soft task semantics

   - Other minor cleanups"

* tag 'nfs-for-6.4-1' of git://git.linux-nfs.org/projects/anna/linux-nfs:
  NFSv4.2: Rework scratch handling for READ_PLUS
  NFS: Cleanup unused rpc_clnt variable
  NFS: set varaiable nfs_netfs_debug_id storage-class-specifier to static
  SUNRPC: remove the maximum number of retries in call_bind_status
  NFS: Convert readdir page array functions to use a folio
  NFS: Convert the readdir array-of-pages into an array-of-folios
  NFSv3: handle out-of-order write replies.
  NFS: Remove fscache specific trace points and NFS_INO_FSCACHE bit
  NFS: Remove all NFSIOS_FSCACHE counters due to conversion to netfs API
  NFS: Convert buffered read paths to use netfs when fscache is enabled
  NFS: Configure support for netfs when NFS fscache is configured
  NFS: Rename readpage_async_filler to nfs_read_add_folio
  sunrpc: simplify one-level sysctl registration for debug_table
  sunrpc: move sunrpc_table and proc routines above
  sunrpc: simplify one-level sysctl registration for xs_tunables_table
  sunrpc: simplify one-level sysctl registration for xr_tunables_table
  nfs: simplify two-level sysctl registration for nfs_cb_sysctls
  nfs: simplify two-level sysctl registration for nfs4_cb_sysctls
  lockd: simplify two-level sysctl registration for nlm_sysctls
  NFSv4.1: Always send a RECLAIM_COMPLETE after establishing lease
2023-04-29 10:58:44 -07:00
Chuck Lever b3cbf98e2f SUNRPC: Support TLS handshake in the server-side TCP socket code
This patch adds opportunitistic RPC-with-TLS to the Linux in-kernel
NFS server. If the client requests RPC-with-TLS and the user space
handshake agent is running, the server will set up a TLS session.

There are no policy settings yet. For example, the server cannot
yet require the use of RPC-with-TLS to access its data.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-04-27 18:49:24 -04:00
Chuck Lever 695bc1f32c SUNRPC: Clear rq_xid when receiving a new RPC Call
This is an eye-catcher for tracepoints that record the XID: it means
svc_rqst() has not received a full RPC Call with an XID yet.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-04-27 18:49:24 -04:00
Chuck Lever 5e052dda12 SUNRPC: Recognize control messages in server-side TCP socket code
To support kTLS, the server-side TCP socket receive path needs to
watch for CMSGs.

Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-04-27 18:49:24 -04:00
Chuck Lever 6a0cdf56bf SUNRPC: Be even lazier about releasing pages
A single RPC transaction that touches only a couple of pages means
rq_pvec will not be even close to full in svc_xpt_release(). This is
a common case.

Instead, just leave the pages in rq_pvec until it is completely
full. This improves the efficiency of the batch release mechanism
on workloads that involve small RPC messages.

The rq_pvec is also fully emptied just before thread exit.

Reviewed-by: Calum Mackay <calum.mackay@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-04-27 18:49:23 -04:00
Chuck Lever 647a2a6428 SUNRPC: Convert svc_xprt_release() to the release_pages() API
Instead of invoking put_page() one-at-a-time, pass the "response"
portion of rq_pages directly to release_pages() to reduce the number
of times each nfsd thread invokes a page allocator API.

Since svc_xprt_release() is not invoked while a client is waiting
for an RPC Reply, this is not expected to directly impact mean
request latencies on a lightly or moderately loaded server. However
as workload intensity increases, I expect somewhat better
scalability: the same number of server threads should be able to
handle more work.

Reviewed-by: Calum Mackay <calum.mackay@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-04-26 09:05:02 -04:00