Commit graph

4457 commits

Author SHA1 Message Date
Jakub Kicinski
9b19e57a3c Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
No conflicts.

Build issue in drivers/net/ethernet/sfc/ptp.c
  54fccfdd7c ("sfc: efx_default_channel_type APIs can be static")
  49e6123c65 ("net: sfc: fix memory leak due to ptp channel")
https://lore.kernel.org/all/20220510130556.52598fe2@canb.auug.org.au/

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-12 16:15:30 -07:00
Trond Myklebust
fd13359f54 SUNRPC: Ensure that the gssproxy client can start in a connected state
Ensure that the gssproxy client connects to the server from the gssproxy
daemon process context so that the AF_LOCAL socket connection is done
using the correct path and namespaces.

Fixes: 1d658336b0 ("SUNRPC: Add RPC based upcall mechanism for RPCGSS auth")
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-05-07 14:31:15 -04:00
Trond Myklebust
3d1b0d3514 Revert "SUNRPC: Ensure gss-proxy connects on setup"
This reverts commit 892de36fd4.

The gssproxy server is unresponsive when it calls into the kernel to
start the upcall service, so it will not reply to our RPC ping at all.

Reported-by: "J.Bruce Fields" <bfields@fieldses.org>
Fixes: 892de36fd4 ("SUNRPC: Ensure gss-proxy connects on setup")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-05-07 14:30:40 -04:00
Trond Myklebust
a3d0562d4d Revert "SUNRPC: attempt AF_LOCAL connect on setup"
This reverts commit 7073ea8799.

We must not try to connect the socket while the transport is under
construction, because the mechanisms to safely tear it down are not in
place. As the code stands, we end up leaking the sockets on a connection
error.

Reported-by: wanghai (M) <wanghai38@huawei.com>
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-04-29 20:38:27 -04:00
Trond Myklebust
892de36fd4 SUNRPC: Ensure gss-proxy connects on setup
For reasons best known to the author, gss-proxy does not implement a
NULL procedure, and returns RPC_PROC_UNAVAIL. However we still want to
ensure that we connect to the service at setup time.
So add a quirk-flag specially for this case.

Fixes: 1d658336b0 ("SUNRPC: Add RPC based upcall mechanism for RPCGSS auth")
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-04-29 12:29:31 -04:00
Trond Myklebust
efce2d0ba6 SUNRPC: Ensure timely close of disconnected AF_LOCAL sockets
When the rpcbind server closes the socket, we need to ensure that the
socket is closed by the kernel as soon as feasible, so add a
sk_state_change callback to trigger this close.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-04-29 08:39:15 -04:00
Trond Myklebust
aad41a7d7c SUNRPC: Don't leak sockets in xs_local_connect()
If there is still a closed socket associated with the transport, then we
need to trigger an autoclose before we can set up a new connection.

Reported-by: wanghai (M) <wanghai38@huawei.com>
Fixes: f00432063d ("SUNRPC: Ensure we flush any closed sockets before xs_xprt_free()")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-04-28 11:28:37 -04:00
Olga Kornievskaia
e13433b441 SUNRPC release the transport of a relocated task with an assigned transport
A relocated task must release its previous transport.

Fixes: 82ee41b85c ("SUNRPC don't resend a task on an offlined transport")
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-04-22 11:11:31 -04:00
Paolo Abeni
edf45f007a Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2022-04-15 09:26:00 +02:00
Linus Torvalds
c1488c9751 NFSD bug fixes for 5.18-rc:
- Fix a write performance regression
 - Fix crashes during request deferral on RDMA transports
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmJVnIgACgkQM2qzM29m
 f5cdQA//SS2pbiAcjePL6zuVrd/eeALWRJIdCj8fP9eQ4dfi7OM0xQE3qHyLDC3n
 2fjciUSnoE+lIYiRXD7Ii+GqYThSrrxlR7aCT+vrIAL/ksnzTeUiUSuEihPOOYjn
 ZEz9vkP4wlCTFh4SdeBoETEjksvRU3ql37l4S04nxxgidpY/NkF5wZH5iUlV8Z14
 Q8OxJJk1tFJARPz6RqGrRkMws24NhYzeuXQktVA+AOoKjUeQSp4bN1ZSCJv3eX/E
 EjAPDFMYVdenp8OY9RJoP3Xpxb6e1mv5flXYQa7YpJUVH3AccMJ8aKuFadjGfGNS
 2tEzoipJR0fmxkxpDX1g0Bzm8fAIc467l4tskVzUxLnqNS/FjXv2P85h8p9U/R06
 BWbF5CxRu89h1tZ/ZIh4lfw/Ro94XvYaYCDeu9V1TndP+QroNDDY9Ypl13+uy6uS
 zLwEEo5nMUbc7FVmT7UidHqeEukwbNzmXEXYrgQD5hRaT9L+85R0L5/Y+gi+ODDK
 7SKu1Bomi3WN7WKzjvZspHMivVT6JNy/ngHlKSYkWYl/dJzMy5I4Z4sNRVCHKoKX
 17SpKHfKkZAt5oS54dZ40O1PhICZTMclAB7Vb8bs/yggrHTsqwqf21v8FTJW79K9
 MjnYigEWgwi62GC5WdZr5LqAkqRHi2K2rLnFwVSLtZvpb2sxPdM=
 =3AeS
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-5.18-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd fixes from Chuck Lever:

 - Fix a write performance regression

 - Fix crashes during request deferral on RDMA transports

* tag 'nfsd-5.18-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
  SUNRPC: Fix the svc_deferred_event trace class
  SUNRPC: Fix NFSD's request deferral on RDMA transports
  nfsd: Clean up nfsd_file_put()
  nfsd: Fix a write performance regression
  SUNRPC: Return true/false (not 1/0) from bool functions
2022-04-12 14:23:19 -10:00
Oliver Hartkopp
ec095263a9 net: remove noblock parameter from recvmsg() entities
The internal recvmsg() functions have two parameters 'flags' and 'noblock'
that were merged inside skb_recv_datagram(). As a follow up patch to commit
f4b41f062c ("net: remove noblock parameter from skb_recv_datagram()")
this patch removes the separate 'noblock' parameter for recvmsg().

Analogue to the referenced patch for skb_recv_datagram() the 'flags' and
'noblock' parameters are unnecessarily split up with e.g.

err = sk->sk_prot->recvmsg(sk, msg, size, flags & MSG_DONTWAIT,
                           flags & ~MSG_DONTWAIT, &addr_len);

or in

err = INDIRECT_CALL_2(sk->sk_prot->recvmsg, tcp_recvmsg, udp_recvmsg,
                      sk, msg, size, flags & MSG_DONTWAIT,
                      flags & ~MSG_DONTWAIT, &addr_len);

instead of simply using only flags all the time and check for MSG_DONTWAIT
where needed (to preserve for the formerly separated no(n)block condition).

Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Link: https://lore.kernel.org/r/20220411124955.154876-1-socketcan@hartkopp.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-04-12 15:00:25 +02:00
Linus Torvalds
1a3b1bba7c NFS client bugfixes for Linux 5.18
Highlights include:
 
 Stable fixes:
 - SUNRPC: Ensure we flush any closed sockets before xs_xprt_free()
 
 Bugfixes:
 - Fix an Oopsable condition due to SLAB_ACCOUNT setting in the NFSv4.2
   xattr code.
 - Fix for open() using an file open mode of '3' in NFSv4
 - Replace readdir's use of xxhash() with hash_64()
 - Several patches to handle malloc() failure in SUNRPC
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAmJQb5IACgkQZwvnipYK
 APLVeRAAheQX422gIrjrWsQrZxJWmMDBBj9qwiV14XMs6g5KF/DtNbIJm1BGjA9Y
 aLgNVknSDcGHMmvdh5MCVF7R61sqkCNbQWP3Bx6OBP6ei3FEgQBPyujcFXAaZ7UK
 T4fBzfPoEYaxVkEG/L5BsAUnh6TvyCfNm1oqZZv2bv9M0U6UlXnRLWZN9I6cAtcw
 GI9HgnufWOxJWIqPd9dY45aF44Ru+aJ953eecPh0G83Reoc99EAU+PvJJD18Wl0H
 BZqM6ar5pNush50yVIjnPFXg+sM97dKGlLvYL11Uy7f8valSaBXPZLZQ/jG4Vx/a
 m/l8FVwBEV89BG7z6jKNju7ERbO+xbPgXP8lSwkj69fXOpuvzo/G6VAxS6ZJww12
 p6TJnhCMSEF7qrQc5mejA803dT4MiJWo4i2th482Ws/tRN5H+y9pDYLsvBPM8iHo
 sVkJBco04tBHEH3qKgDTu8EY7/shH+GVO3Wmtcjz47T2rbmQB7gJtfipHcLmV2qd
 Jy1tiz1T7rhGs7KNtWpu8210MY9iFhIt3rAvdGdeVmgnNjrmRex0Sxqx8vgKUAFE
 aQjXkkpwHA3rHWnuGPMKu5i48BPFQ8m4QgdVC9F6ylNT6RIbImqplQiwIYNIxYC+
 cBohUG1yYZRjFEMkXBdfK5cImkTLW7RkLCatRh5v8OU8xqpRVL8=
 =edfs
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-5.18-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client fixes from Trond Myklebust:
 "Stable fixes:

   - SUNRPC: Ensure we flush any closed sockets before xs_xprt_free()

  Bugfixes:

   - Fix an Oopsable condition due to SLAB_ACCOUNT setting in the
     NFSv4.2 xattr code.

   - Fix for open() using an file open mode of '3' in NFSv4

   - Replace readdir's use of xxhash() with hash_64()

   - Several patches to handle malloc() failure in SUNRPC"

* tag 'nfs-for-5.18-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  SUNRPC: Move the call to xprt_send_pagedata() out of xprt_sock_sendmsg()
  SUNRPC: svc_tcp_sendmsg() should handle errors from xdr_alloc_bvec()
  SUNRPC: Handle allocation failure in rpc_new_task()
  NFS: Ensure rpc_run_task() cannot fail in nfs_async_rename()
  NFSv4/pnfs: Handle RPC allocation errors in nfs4_proc_layoutget
  SUNRPC: Handle low memory situations in call_status()
  SUNRPC: Handle ENOMEM in call_transmit_status()
  NFSv4.2: Fix missing removal of SLAB_ACCOUNT on kmem_cache allocation
  SUNRPC: Ensure we flush any closed sockets before xs_xprt_free()
  NFS: Replace readdir's use of xxhash() with hash_64()
  SUNRPC: handle malloc failure in ->request_prepare
  NFSv4: fix open failure with O_ACCMODE flag
  Revert "NFSv4: Handle the special Linux file open access mode"
2022-04-08 07:39:17 -10:00
Trond Myklebust
ff053dbbaf SUNRPC: Move the call to xprt_send_pagedata() out of xprt_sock_sendmsg()
The client and server have different requirements for their memory
allocation, so move the allocation of the send buffer out of the socket
send code that is common to both.

Reported-by: NeilBrown <neilb@suse.de>
Fixes: b2648015d4 ("SUNRPC: Make the rpciod and xprtiod slab allocation modes consistent")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-04-07 16:20:01 -04:00
Trond Myklebust
b056fa0708 SUNRPC: svc_tcp_sendmsg() should handle errors from xdr_alloc_bvec()
The allocation is done with GFP_KERNEL, but it could still fail in a low
memory situation.

Fixes: 4a85a6a332 ("SUNRPC: Handle TCP socket sends with kernel_sendpage() again")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-04-07 16:20:01 -04:00
Trond Myklebust
25cf32ad5d SUNRPC: Handle allocation failure in rpc_new_task()
If the call to rpc_alloc_task() fails, then ensure that the calldata is
released, and that rpc_run_task() and rpc_run_bc_task() bail out early.

Reported-by: NeilBrown <neilb@suse.de>
Fixes: 910ad38697 ("NFS: Fix memory allocation in rpc_alloc_task()")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-04-07 16:20:00 -04:00
Trond Myklebust
9d82819d5b SUNRPC: Handle low memory situations in call_status()
We need to handle ENFILE, ENOBUFS, and ENOMEM, because
xprt_wake_pending_tasks() can be called with any one of these due to
socket creation failures.

Fixes: b61d59fffd ("SUNRPC: xs_tcp_connect_worker{4,6}: merge common code")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-04-07 16:20:00 -04:00
Trond Myklebust
d3c15033b2 SUNRPC: Handle ENOMEM in call_transmit_status()
Both call_transmit() and call_bc_transmit() can now return ENOMEM, so
let's make sure that we handle the errors gracefully.

Fixes: 0472e47660 ("SUNRPC: Convert socket page send code to use iov_iter()")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-04-07 16:20:00 -04:00
Trond Myklebust
f00432063d SUNRPC: Ensure we flush any closed sockets before xs_xprt_free()
We must ensure that all sockets are closed before we call xprt_free()
and release the reference to the net namespace. The problem is that
calling fput() will defer closing the socket until delayed_fput() gets
called.
Let's fix the situation by allowing rpciod and the transport teardown
code (which runs on the system wq) to call __fput_sync(), and directly
close the socket.

Reported-by: Felix Fu <foyjog@gmail.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Fixes: a73881c96d ("SUNRPC: Fix an Oops in udp_poll()")
Cc: stable@vger.kernel.org # 5.1.x: 3be232f11a: SUNRPC: Prevent immediate close+reconnect
Cc: stable@vger.kernel.org # 5.1.x: 89f42494f9: SUNRPC: Don't call connect() more than once on a TCP socket
Cc: stable@vger.kernel.org # 5.1.x
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-04-07 16:19:47 -04:00
Jason Gunthorpe
e945c653c8 RDMA: Split kernel-only global device caps from uverbs device caps
Split out flags from ib_device::device_cap_flags that are only used
internally to the kernel into kernel_cap_flags that is not part of the
uapi. This limits the device_cap_flags to being the same bitmap that will
be copied to userspace.

This cleanly splits out the uverbs flags from the kernel flags to avoid
confusion in the flags bitmap.

Add some short comments describing which each of the kernel flags is
connected to. Remove unused kernel flags.

Link: https://lore.kernel.org/r/0-v2-22c19e565eef+139a-kern_caps_jgg@nvidia.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2022-04-06 15:02:13 -03:00
Chuck Lever
773f91b2cf SUNRPC: Fix NFSD's request deferral on RDMA transports
Trond Myklebust reports an NFSD crash in svc_rdma_sendto(). Further
investigation shows that the crash occurred while NFSD was handling
a deferred request.

This patch addresses two inter-related issues that prevent request
deferral from working correctly for RPC/RDMA requests:

1. Prevent the crash by ensuring that the original
   svc_rqst::rq_xprt_ctxt value is available when the request is
   revisited. Otherwise svc_rdma_sendto() does not have a Receive
   context available with which to construct its reply.

2. Possibly since before commit 71641d99ce ("svcrdma: Properly
   compute .len and .buflen for received RPC Calls"),
   svc_rdma_recvfrom() did not include the transport header in the
   returned xdr_buf. There should have been no need for svc_defer()
   and friends to save and restore that header, as of that commit.
   This issue is addressed in a backport-friendly way by simply
   having svc_rdma_recvfrom() set rq_xprt_hlen to zero
   unconditionally, just as svc_tcp_recvfrom() does. This enables
   svc_deferred_recv() to correctly reconstruct an RPC message
   received via RPC/RDMA.

Reported-by: Trond Myklebust <trondmy@hammerspace.com>
Link: https://lore.kernel.org/linux-nfs/82662b7190f26fb304eb0ab1bb04279072439d4e.camel@hammerspace.com/
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: <stable@vger.kernel.org>
2022-04-06 14:01:27 -04:00
NeilBrown
eb07d5a4da SUNRPC: handle malloc failure in ->request_prepare
If ->request_prepare() detects an error, it sets ->rq_task->tk_status.
This is easy for callers to ignore.
The only caller is xprt_request_enqueue_receive() and it does ignore the
error, as does call_encode() which calls it.  This can result in a
request being queued to receive a reply without an allocated receive buffer.

So instead of setting rq_task->tk_status, return an error, and store in
->tk_status only in call_encode();

The call to xprt_request_enqueue_receive() is now earlier in
call_encode(), where the error can still be handled.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-29 22:15:46 -04:00
Linus Torvalds
965181d7ef NFS client updates for Linux 5.18
Highlights include:
 
 Features:
 - Switch NFS to use readahead instead of the obsolete readpages.
 - Readdir fixes to improve cacheability of large directories when there
   are multiple readers and writers.
 - Readdir performance improvements when doing a seekdir() immediately
   after opening the directory (common when re-exporting NFS).
 - NFS swap improvements from Neil Brown.
 - Loosen up memory allocation to permit direct reclaim and write back
   in cases where there is no danger of deadlocking the writeback code or
   NFS swap.
 - Avoid sillyrename when the NFSv4 server claims to support the
   necessary features to recover the unlinked but open file after reboot.
 
 Bugfixes:
 - Patch from Olga to add a mount option to control NFSv4.1 session
   trunking discovery, and default it to being off.
 - Fix a lockup in nfs_do_recoalesce().
 - Two fixes for list iterator variables being used when pointing to the
   list head.
 - Fix a kernel memory scribble when reading from a non-socket transport
   in /sys/kernel/sunrpc.
 - Fix a race where reconnecting to a server could leave the TCP socket
   stuck forever in the connecting state.
 - Patch from Neil to fix a shutdown race which can leave the SUNRPC
   transport timer primed after we free the struct xprt itself.
 - Patch from Xin Xiong to fix reference count leaks in the NFSv4.2 copy
   offload.
 - Sunrpc patch from Olga to avoid resending a task on an offlined
   transport.
 
 Cleanups:
 - Patches from Dave Wysochanski to clean up the fscache code
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAmJDLLUACgkQZwvnipYK
 APJZ5Q/9FH5S38RcUOxpFS+XaVdmxltJTMQv9Bz5mcyljq6PwRjTn73Qs7MfBQfD
 k90O9Pn/EXgQpFWNW4o/saQGLAO1MbVaesdNNF92Le9NziMz7lr95fsFXOqpg9EQ
 rEVOImWinv/N0+P0yv5tgjx6Fheu70L4dnwsxepnmtv2R1VjXOVchK7WjsV+9qey
 Xvn15OzhtVCGhRvvKfmGuGW+2I0D29tRNn2X0Pi+LjxAEQ0We/rcsJeGDMGYtbeC
 2U4ADk7KB0kJq04WYCH8k8tTrljL3PUONzO6QP1lltmAZ0FOwXf/AjyaD3SML2Xu
 A1yc2bEn3/S6wCQ3BiMxoE7UCSewg/ZyQD7B9GjZ/nWYNXREhLKvtVZt+ULTCA6o
 UKAI2vlEHljzQI2YfbVVZFyL8KCZqi+BFcJ/H/dtFKNazIkiefkaF+703Pflazis
 HjhA5G6IT0Pk6qZ6OBp+Z/4kCDYZYXXi9ysiNuMPWCiYEfZzt7ocUjm44uRAi+vR
 Qpf0V01Rg4iplTj3YFXqb9TwxrudzC7AcQ7hbLJia0kD/8djG+aGZs3SJWeO+VuE
 Xa41oxAx9jqOSRvSKPvj7nbexrwn8PtQI2NtDRqjPXlg6+Jbk23u8lIEVhsCH877
 xoMUHtTKLyqECskzRN4nh7qoBD3QfiLuI74aDcPtGEuXzU1+Ox8=
 =jeGt
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-5.18-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client updates from Trond Myklebust:
 "Highlights include:

  Features:

   - Switch NFS to use readahead instead of the obsolete readpages.

   - Readdir fixes to improve cacheability of large directories when
     there are multiple readers and writers.

   - Readdir performance improvements when doing a seekdir() immediately
     after opening the directory (common when re-exporting NFS).

   - NFS swap improvements from Neil Brown.

   - Loosen up memory allocation to permit direct reclaim and write back
     in cases where there is no danger of deadlocking the writeback code
     or NFS swap.

   - Avoid sillyrename when the NFSv4 server claims to support the
     necessary features to recover the unlinked but open file after
     reboot.

  Bugfixes:

   - Patch from Olga to add a mount option to control NFSv4.1 session
     trunking discovery, and default it to being off.

   - Fix a lockup in nfs_do_recoalesce().

   - Two fixes for list iterator variables being used when pointing to
     the list head.

   - Fix a kernel memory scribble when reading from a non-socket
     transport in /sys/kernel/sunrpc.

   - Fix a race where reconnecting to a server could leave the TCP
     socket stuck forever in the connecting state.

   - Patch from Neil to fix a shutdown race which can leave the SUNRPC
     transport timer primed after we free the struct xprt itself.

   - Patch from Xin Xiong to fix reference count leaks in the NFSv4.2
     copy offload.

   - Sunrpc patch from Olga to avoid resending a task on an offlined
     transport.

  Cleanups:

   - Patches from Dave Wysochanski to clean up the fscache code"

* tag 'nfs-for-5.18-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (91 commits)
  NFSv4/pNFS: Fix another issue with a list iterator pointing to the head
  NFS: Don't loop forever in nfs_do_recoalesce()
  SUNRPC: Don't return error values in sysfs read of closed files
  SUNRPC: Do not dereference non-socket transports in sysfs
  NFSv4.1: don't retry BIND_CONN_TO_SESSION on session error
  SUNRPC don't resend a task on an offlined transport
  NFS: replace usage of found with dedicated list iterator variable
  SUNRPC: avoid race between mod_timer() and del_timer_sync()
  pNFS/files: Ensure pNFS allocation modes are consistent with nfsiod
  pNFS/flexfiles: Ensure pNFS allocation modes are consistent with nfsiod
  NFSv4/pnfs: Ensure pNFS allocation modes are consistent with nfsiod
  NFS: Avoid writeback threads getting stuck in mempool_alloc()
  NFS: nfsiod should not block forever in mempool_alloc()
  SUNRPC: Make the rpciod and xprtiod slab allocation modes consistent
  SUNRPC: Fix unx_lookup_cred() allocation
  NFS: Fix memory allocation in rpc_alloc_task()
  NFS: Fix memory allocation in rpc_malloc()
  SUNRPC: Improve accuracy of socket ENOBUFS determination
  SUNRPC: Replace internal use of SOCKWQ_ASYNC_NOSPACE
  SUNRPC: Fix socket waits for write buffer space
  ...
2022-03-29 18:55:37 -07:00
Trond Myklebust
ebbe788731 SUNRPC: Don't return error values in sysfs read of closed files
Instead of returning an error value, which ends up being the return
value for the read() system call, it is more elegant to simply return
the error as a string value.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-25 13:24:12 -04:00
Trond Myklebust
421ab1be43 SUNRPC: Do not dereference non-socket transports in sysfs
Do not cast the struct xprt to a sock_xprt unless we know it is a UDP or
TCP transport. Otherwise the call to lock the mutex will scribble over
whatever structure is actually there. This has been seen to cause hard
system lockups when the underlying transport was RDMA.

Fixes: b49ea673e1 ("SUNRPC: lock against ->sock changing during sysfs read")
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-25 13:22:58 -04:00
Linus Torvalds
169e77764a Networking changes for 5.18.
Core
 ----
 
  - Introduce XDP multi-buffer support, allowing the use of XDP with
    jumbo frame MTUs and combination with Rx coalescing offloads (LRO).
 
  - Speed up netns dismantling (5x) and lower the memory cost a little.
    Remove unnecessary per-netns sockets. Scope some lists to a netns.
    Cut down RCU syncing. Use batch methods. Allow netdev registration
    to complete out of order.
 
  - Support distinguishing timestamp types (ingress vs egress) and
    maintaining them across packet scrubbing points (e.g. redirect).
 
  - Continue the work of annotating packet drop reasons throughout
    the stack.
 
  - Switch netdev error counters from an atomic to dynamically
    allocated per-CPU counters.
 
  - Rework a few preempt_disable(), local_irq_save() and busy waiting
    sections problematic on PREEMPT_RT.
 
  - Extend the ref_tracker to allow catching use-after-free bugs.
 
 BPF
 ---
 
  - Introduce "packing allocator" for BPF JIT images. JITed code is
    marked read only, and used to be allocated at page granularity.
    Custom allocator allows for more efficient memory use, lower
    iTLB pressure and prevents identity mapping huge pages from
    getting split.
 
  - Make use of BTF type annotations (e.g. __user, __percpu) to enforce
    the correct probe read access method, add appropriate helpers.
 
  - Convert the BPF preload to use light skeleton and drop
    the user-mode-driver dependency.
 
  - Allow XDP BPF_PROG_RUN test infra to send real packets, enabling
    its use as a packet generator.
 
  - Allow local storage memory to be allocated with GFP_KERNEL if called
    from a hook allowed to sleep.
 
  - Introduce fprobe (multi kprobe) to speed up mass attachment (arch
    bits to come later).
 
  - Add unstable conntrack lookup helpers for BPF by using the BPF
    kfunc infra.
 
  - Allow cgroup BPF progs to return custom errors to user space.
 
  - Add support for AF_UNIX iterator batching.
 
  - Allow iterator programs to use sleepable helpers.
 
  - Support JIT of add, and, or, xor and xchg atomic ops on arm64.
 
  - Add BTFGen support to bpftool which allows to use CO-RE in kernels
    without BTF info.
 
  - Large number of libbpf API improvements, cleanups and deprecations.
 
 Protocols
 ---------
 
  - Micro-optimize UDPv6 Tx, gaining up to 5% in test on dummy netdev.
 
  - Adjust TSO packet sizes based on min_rtt, allowing very low latency
    links (data centers) to always send full-sized TSO super-frames.
 
  - Make IPv6 flow label changes (AKA hash rethink) more configurable,
    via sysctl and setsockopt. Distinguish between server and client
    behavior.
 
  - VxLAN support to "collect metadata" devices to terminate only
    configured VNIs. This is similar to VLAN filtering in the bridge.
 
  - Support inserting IPv6 IOAM information to a fraction of frames.
 
  - Add protocol attribute to IP addresses to allow identifying where
    given address comes from (kernel-generated, DHCP etc.)
 
  - Support setting socket and IPv6 options via cmsg on ping6 sockets.
 
  - Reject mis-use of ECN bits in IP headers as part of DSCP/TOS.
    Define dscp_t and stop taking ECN bits into account in fib-rules.
 
  - Add support for locked bridge ports (for 802.1X).
 
  - tun: support NAPI for packets received from batched XDP buffs,
    doubling the performance in some scenarios.
 
  - IPv6 extension header handling in Open vSwitch.
 
  - Support IPv6 control message load balancing in bonding, prevent
    neighbor solicitation and advertisement from using the wrong port.
    Support NS/NA monitor selection similar to existing ARP monitor.
 
  - SMC
    - improve performance with TCP_CORK and sendfile()
    - support auto-corking
    - support TCP_NODELAY
 
  - MCTP (Management Component Transport Protocol)
    - add user space tag control interface
    - I2C binding driver (as specified by DMTF DSP0237)
 
  - Multi-BSSID beacon handling in AP mode for WiFi.
 
  - Bluetooth:
    - handle MSFT Monitor Device Event
    - add MGMT Adv Monitor Device Found/Lost events
 
  - Multi-Path TCP:
    - add support for the SO_SNDTIMEO socket option
    - lots of selftest cleanups and improvements
 
  - Increase the max PDU size in CAN ISOTP to 64 kB.
 
 Driver API
 ----------
 
  - Add HW counters for SW netdevs, a mechanism for devices which
    offload packet forwarding to report packet statistics back to
    software interfaces such as tunnels.
 
  - Select the default NIC queue count as a fraction of number of
    physical CPU cores, instead of hard-coding to 8.
 
  - Expose devlink instance locks to drivers. Allow device layer of
    drivers to use that lock directly instead of creating their own
    which always runs into ordering issues in devlink callbacks.
 
  - Add header/data split indication to guide user space enabling
    of TCP zero-copy Rx.
 
  - Allow configuring completion queue event size.
 
  - Refactor page_pool to enable fragmenting after allocation.
 
  - Add allocation and page reuse statistics to page_pool.
 
  - Improve Multiple Spanning Trees support in the bridge to allow
    reuse of topologies across VLANs, saving HW resources in switches.
 
  - DSA (Distributed Switch Architecture):
    - replay and offload of host VLAN entries
    - offload of static and local FDB entries on LAG interfaces
    - FDB isolation and unicast filtering
 
 New hardware / drivers
 ----------------------
 
  - Ethernet:
    - LAN937x T1 PHYs
    - Davicom DM9051 SPI NIC driver
    - Realtek RTL8367S, RTL8367RB-VB switch and MDIO
    - Microchip ksz8563 switches
    - Netronome NFP3800 SmartNICs
    - Fungible SmartNICs
    - MediaTek MT8195 switches
 
  - WiFi:
    - mt76: MediaTek mt7916
    - mt76: MediaTek mt7921u USB adapters
    - brcmfmac: Broadcom BCM43454/6
 
  - Mobile:
    - iosm: Intel M.2 7360 WWAN card
 
 Drivers
 -------
 
  - Convert many drivers to the new phylink API built for split PCS
    designs but also simplifying other cases.
 
  - Intel Ethernet NICs:
    - add TTY for GNSS module for E810T device
    - improve AF_XDP performance
    - GTP-C and GTP-U filter offload
    - QinQ VLAN support
 
  - Mellanox Ethernet NICs (mlx5):
    - support xdp->data_meta
    - multi-buffer XDP
    - offload tc push_eth and pop_eth actions
 
  - Netronome Ethernet NICs (nfp):
    - flow-independent tc action hardware offload (police / meter)
    - AF_XDP
 
  - Other Ethernet NICs:
    - at803x: fiber and SFP support
    - xgmac: mdio: preamble suppression and custom MDC frequencies
    - r8169: enable ASPM L1.2 if system vendor flags it as safe
    - macb/gem: ZynqMP SGMII
    - hns3: add TX push mode
    - dpaa2-eth: software TSO
    - lan743x: multi-queue, mdio, SGMII, PTP
    - axienet: NAPI and GRO support
 
  - Mellanox Ethernet switches (mlxsw):
    - source and dest IP address rewrites
    - RJ45 ports
 
  - Marvell Ethernet switches (prestera):
    - basic routing offload
    - multi-chain TC ACL offload
 
  - NXP embedded Ethernet switches (ocelot & felix):
    - PTP over UDP with the ocelot-8021q DSA tagging protocol
    - basic QoS classification on Felix DSA switch using dcbnl
    - port mirroring for ocelot switches
 
  - Microchip high-speed industrial Ethernet (sparx5):
    - offloading of bridge port flooding flags
    - PTP Hardware Clock
 
  - Other embedded switches:
    - lan966x: PTP Hardward Clock
    - qca8k: mdio read/write operations via crafted Ethernet packets
 
  - Qualcomm 802.11ax WiFi (ath11k):
    - add LDPC FEC type and 802.11ax High Efficiency data in radiotap
    - enable RX PPDU stats in monitor co-exist mode
 
  - Intel WiFi (iwlwifi):
    - UHB TAS enablement via BIOS
    - band disablement via BIOS
    - channel switch offload
    - 32 Rx AMPDU sessions in newer devices
 
  - MediaTek WiFi (mt76):
    - background radar detection
    - thermal management improvements on mt7915
    - SAR support for more mt76 platforms
    - MBSSID and 6 GHz band on mt7915
 
  - RealTek WiFi:
    - rtw89: AP mode
    - rtw89: 160 MHz channels and 6 GHz band
    - rtw89: hardware scan
 
  - Bluetooth:
    - mt7921s: wake on Bluetooth, SCO over I2S, wide-band-speed (WBS)
 
  - Microchip CAN (mcp251xfd):
    - multiple RX-FIFOs and runtime configurable RX/TX rings
    - internal PLL, runtime PM handling simplification
    - improve chip detection and error handling after wakeup
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmI7YBcACgkQMUZtbf5S
 IrveSBAAmSNJlUK6vPsnNzs7IhsZnfI/AUjm2TCLZnlhKttbpI4A/4Pohk33V7RS
 FGX7f8kjEfhUwrIiLDgeCnztNHRECrCmk6aZc/jLEvecmTauJ+f6kjShkDY/wix+
 AkPHmrZnQeLPAEVuljDdV+sL6ik08+zQL7PazIYHsaSKKC0MGQptRwcri8PLRAKE
 KPBAhVhleq2rAZ/ntprSN52F4Af6rpFTrPIWuN8Bqdbc9dy5094LT0mpOOWYvgr3
 /DLvvAPuLemwyIQkjWknVKBRUAQcmNPC+BY3J8K3LRaiNhekGqOFan46BfqP+k2J
 6DWu0Qrp2yWt4BMOeEToZR5rA6v5suUAMIBu8PRZIDkINXQMlIxHfGjZyNm0rVfw
 7edNri966yus9OdzwPa32MIG3oC6PnVAwYCJAjjBMNS8sSIkp7wgHLkgWN4UFe2H
 K/e6z8TLF4UQ+zFM0aGI5WZ+9QqWkTWEDF3R3OhdFpGrznna0gxmkOeV2YvtsgxY
 cbS0vV9Zj73o+bYzgBKJsw/dAjyLdXoHUGvus26VLQ78S/VGunVKtItwoxBAYmZo
 krW964qcC89YofzSi8RSKLHuEWtNWZbVm8YXr75u6jpr5GhMBu0CYefLs+BuZcxy
 dw8c69cGneVbGZmY2J3rBhDkchbuICl8vdUPatGrOJAoaFdYKuw=
 =ELpe
 -----END PGP SIGNATURE-----

Merge tag 'net-next-5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking updates from Jakub Kicinski:
 "The sprinkling of SPI drivers is because we added a new one and Mark
  sent us a SPI driver interface conversion pull request.

  Core
  ----

   - Introduce XDP multi-buffer support, allowing the use of XDP with
     jumbo frame MTUs and combination with Rx coalescing offloads (LRO).

   - Speed up netns dismantling (5x) and lower the memory cost a little.
     Remove unnecessary per-netns sockets. Scope some lists to a netns.
     Cut down RCU syncing. Use batch methods. Allow netdev registration
     to complete out of order.

   - Support distinguishing timestamp types (ingress vs egress) and
     maintaining them across packet scrubbing points (e.g. redirect).

   - Continue the work of annotating packet drop reasons throughout the
     stack.

   - Switch netdev error counters from an atomic to dynamically
     allocated per-CPU counters.

   - Rework a few preempt_disable(), local_irq_save() and busy waiting
     sections problematic on PREEMPT_RT.

   - Extend the ref_tracker to allow catching use-after-free bugs.

  BPF
  ---

   - Introduce "packing allocator" for BPF JIT images. JITed code is
     marked read only, and used to be allocated at page granularity.
     Custom allocator allows for more efficient memory use, lower iTLB
     pressure and prevents identity mapping huge pages from getting
     split.

   - Make use of BTF type annotations (e.g. __user, __percpu) to enforce
     the correct probe read access method, add appropriate helpers.

   - Convert the BPF preload to use light skeleton and drop the
     user-mode-driver dependency.

   - Allow XDP BPF_PROG_RUN test infra to send real packets, enabling
     its use as a packet generator.

   - Allow local storage memory to be allocated with GFP_KERNEL if
     called from a hook allowed to sleep.

   - Introduce fprobe (multi kprobe) to speed up mass attachment (arch
     bits to come later).

   - Add unstable conntrack lookup helpers for BPF by using the BPF
     kfunc infra.

   - Allow cgroup BPF progs to return custom errors to user space.

   - Add support for AF_UNIX iterator batching.

   - Allow iterator programs to use sleepable helpers.

   - Support JIT of add, and, or, xor and xchg atomic ops on arm64.

   - Add BTFGen support to bpftool which allows to use CO-RE in kernels
     without BTF info.

   - Large number of libbpf API improvements, cleanups and deprecations.

  Protocols
  ---------

   - Micro-optimize UDPv6 Tx, gaining up to 5% in test on dummy netdev.

   - Adjust TSO packet sizes based on min_rtt, allowing very low latency
     links (data centers) to always send full-sized TSO super-frames.

   - Make IPv6 flow label changes (AKA hash rethink) more configurable,
     via sysctl and setsockopt. Distinguish between server and client
     behavior.

   - VxLAN support to "collect metadata" devices to terminate only
     configured VNIs. This is similar to VLAN filtering in the bridge.

   - Support inserting IPv6 IOAM information to a fraction of frames.

   - Add protocol attribute to IP addresses to allow identifying where
     given address comes from (kernel-generated, DHCP etc.)

   - Support setting socket and IPv6 options via cmsg on ping6 sockets.

   - Reject mis-use of ECN bits in IP headers as part of DSCP/TOS.
     Define dscp_t and stop taking ECN bits into account in fib-rules.

   - Add support for locked bridge ports (for 802.1X).

   - tun: support NAPI for packets received from batched XDP buffs,
     doubling the performance in some scenarios.

   - IPv6 extension header handling in Open vSwitch.

   - Support IPv6 control message load balancing in bonding, prevent
     neighbor solicitation and advertisement from using the wrong port.
     Support NS/NA monitor selection similar to existing ARP monitor.

   - SMC
      - improve performance with TCP_CORK and sendfile()
      - support auto-corking
      - support TCP_NODELAY

   - MCTP (Management Component Transport Protocol)
      - add user space tag control interface
      - I2C binding driver (as specified by DMTF DSP0237)

   - Multi-BSSID beacon handling in AP mode for WiFi.

   - Bluetooth:
      - handle MSFT Monitor Device Event
      - add MGMT Adv Monitor Device Found/Lost events

   - Multi-Path TCP:
      - add support for the SO_SNDTIMEO socket option
      - lots of selftest cleanups and improvements

   - Increase the max PDU size in CAN ISOTP to 64 kB.

  Driver API
  ----------

   - Add HW counters for SW netdevs, a mechanism for devices which
     offload packet forwarding to report packet statistics back to
     software interfaces such as tunnels.

   - Select the default NIC queue count as a fraction of number of
     physical CPU cores, instead of hard-coding to 8.

   - Expose devlink instance locks to drivers. Allow device layer of
     drivers to use that lock directly instead of creating their own
     which always runs into ordering issues in devlink callbacks.

   - Add header/data split indication to guide user space enabling of
     TCP zero-copy Rx.

   - Allow configuring completion queue event size.

   - Refactor page_pool to enable fragmenting after allocation.

   - Add allocation and page reuse statistics to page_pool.

   - Improve Multiple Spanning Trees support in the bridge to allow
     reuse of topologies across VLANs, saving HW resources in switches.

   - DSA (Distributed Switch Architecture):
      - replay and offload of host VLAN entries
      - offload of static and local FDB entries on LAG interfaces
      - FDB isolation and unicast filtering

  New hardware / drivers
  ----------------------

   - Ethernet:
      - LAN937x T1 PHYs
      - Davicom DM9051 SPI NIC driver
      - Realtek RTL8367S, RTL8367RB-VB switch and MDIO
      - Microchip ksz8563 switches
      - Netronome NFP3800 SmartNICs
      - Fungible SmartNICs
      - MediaTek MT8195 switches

   - WiFi:
      - mt76: MediaTek mt7916
      - mt76: MediaTek mt7921u USB adapters
      - brcmfmac: Broadcom BCM43454/6

   - Mobile:
      - iosm: Intel M.2 7360 WWAN card

  Drivers
  -------

   - Convert many drivers to the new phylink API built for split PCS
     designs but also simplifying other cases.

   - Intel Ethernet NICs:
      - add TTY for GNSS module for E810T device
      - improve AF_XDP performance
      - GTP-C and GTP-U filter offload
      - QinQ VLAN support

   - Mellanox Ethernet NICs (mlx5):
      - support xdp->data_meta
      - multi-buffer XDP
      - offload tc push_eth and pop_eth actions

   - Netronome Ethernet NICs (nfp):
      - flow-independent tc action hardware offload (police / meter)
      - AF_XDP

   - Other Ethernet NICs:
      - at803x: fiber and SFP support
      - xgmac: mdio: preamble suppression and custom MDC frequencies
      - r8169: enable ASPM L1.2 if system vendor flags it as safe
      - macb/gem: ZynqMP SGMII
      - hns3: add TX push mode
      - dpaa2-eth: software TSO
      - lan743x: multi-queue, mdio, SGMII, PTP
      - axienet: NAPI and GRO support

   - Mellanox Ethernet switches (mlxsw):
      - source and dest IP address rewrites
      - RJ45 ports

   - Marvell Ethernet switches (prestera):
      - basic routing offload
      - multi-chain TC ACL offload

   - NXP embedded Ethernet switches (ocelot & felix):
      - PTP over UDP with the ocelot-8021q DSA tagging protocol
      - basic QoS classification on Felix DSA switch using dcbnl
      - port mirroring for ocelot switches

   - Microchip high-speed industrial Ethernet (sparx5):
      - offloading of bridge port flooding flags
      - PTP Hardware Clock

   - Other embedded switches:
      - lan966x: PTP Hardward Clock
      - qca8k: mdio read/write operations via crafted Ethernet packets

   - Qualcomm 802.11ax WiFi (ath11k):
      - add LDPC FEC type and 802.11ax High Efficiency data in radiotap
      - enable RX PPDU stats in monitor co-exist mode

   - Intel WiFi (iwlwifi):
      - UHB TAS enablement via BIOS
      - band disablement via BIOS
      - channel switch offload
      - 32 Rx AMPDU sessions in newer devices

   - MediaTek WiFi (mt76):
      - background radar detection
      - thermal management improvements on mt7915
      - SAR support for more mt76 platforms
      - MBSSID and 6 GHz band on mt7915

   - RealTek WiFi:
      - rtw89: AP mode
      - rtw89: 160 MHz channels and 6 GHz band
      - rtw89: hardware scan

   - Bluetooth:
      - mt7921s: wake on Bluetooth, SCO over I2S, wide-band-speed (WBS)

   - Microchip CAN (mcp251xfd):
      - multiple RX-FIFOs and runtime configurable RX/TX rings
      - internal PLL, runtime PM handling simplification
      - improve chip detection and error handling after wakeup"

* tag 'net-next-5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2521 commits)
  llc: fix netdevice reference leaks in llc_ui_bind()
  drivers: ethernet: cpsw: fix panic when interrupt coaleceing is set via ethtool
  ice: don't allow to run ice_send_event_to_aux() in atomic ctx
  ice: fix 'scheduling while atomic' on aux critical err interrupt
  net/sched: fix incorrect vlan_push_eth dest field
  net: bridge: mst: Restrict info size queries to bridge ports
  net: marvell: prestera: add missing destroy_workqueue() in prestera_module_init()
  drivers: net: xgene: Fix regression in CRC stripping
  net: geneve: add missing netlink policy and size for IFLA_GENEVE_INNER_PROTO_INHERIT
  net: dsa: fix missing host-filtered multicast addresses
  net/mlx5e: Fix build warning, detected write beyond size of field
  iwlwifi: mvm: Don't fail if PPAG isn't supported
  selftests/bpf: Fix kprobe_multi test.
  Revert "rethook: x86: Add rethook x86 implementation"
  Revert "arm64: rethook: Add arm64 rethook implementation"
  Revert "powerpc: Add rethook support"
  Revert "ARM: rethook: Add rethook arm implementation"
  netdevice: add missing dm_private kdoc
  net: bridge: mst: prevent NULL deref in br_mst_info_size()
  selftests: forwarding: Use same VRF for port and VLAN upper
  ...
2022-03-24 13:13:26 -07:00
Olga Kornievskaia
82ee41b85c SUNRPC don't resend a task on an offlined transport
When a task is being retried, due to an NFS error, if the assigned
transport has been put offline and the task is relocatable pick a new
transport.

Fixes: 6f081693e7 ("sunrpc: remove an offlined xprt using sysfs")
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-24 12:06:07 -04:00
NeilBrown
3848e96edf SUNRPC: avoid race between mod_timer() and del_timer_sync()
xprt_destory() claims XPRT_LOCKED and then calls del_timer_sync().
Both xprt_unlock_connect() and xprt_release() call
 ->release_xprt()
which drops XPRT_LOCKED and *then* xprt_schedule_autodisconnect()
which calls mod_timer().

This may result in mod_timer() being called *after* del_timer_sync().
When this happens, the timer may fire long after the xprt has been freed,
and run_timer_softirq() will probably crash.

The pairing of ->release_xprt() and xprt_schedule_autodisconnect() is
always called under ->transport_lock.  So if we take ->transport_lock to
call del_timer_sync(), we can be sure that mod_timer() will run first
(if it runs at all).

Cc: stable@vger.kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-23 14:50:04 -04:00
Linus Torvalds
3bf03b9a08 Merge branch 'akpm' (patches from Andrew)
Merge updates from Andrew Morton:

 - A few misc subsystems: kthread, scripts, ntfs, ocfs2, block, and vfs

 - Most the MM patches which precede the patches in Willy's tree: kasan,
   pagecache, gup, swap, shmem, memcg, selftests, pagemap, mremap,
   sparsemem, vmalloc, pagealloc, memory-failure, mlock, hugetlb,
   userfaultfd, vmscan, compaction, mempolicy, oom-kill, migration, thp,
   cma, autonuma, psi, ksm, page-poison, madvise, memory-hotplug, rmap,
   zswap, uaccess, ioremap, highmem, cleanups, kfence, hmm, and damon.

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (227 commits)
  mm/damon/sysfs: remove repeat container_of() in damon_sysfs_kdamond_release()
  Docs/ABI/testing: add DAMON sysfs interface ABI document
  Docs/admin-guide/mm/damon/usage: document DAMON sysfs interface
  selftests/damon: add a test for DAMON sysfs interface
  mm/damon/sysfs: support DAMOS stats
  mm/damon/sysfs: support DAMOS watermarks
  mm/damon/sysfs: support schemes prioritization
  mm/damon/sysfs: support DAMOS quotas
  mm/damon/sysfs: support DAMON-based Operation Schemes
  mm/damon/sysfs: support the physical address space monitoring
  mm/damon/sysfs: link DAMON for virtual address spaces monitoring
  mm/damon: implement a minimal stub for sysfs-based DAMON interface
  mm/damon/core: add number of each enum type values
  mm/damon/core: allow non-exclusive DAMON start/stop
  Docs/damon: update outdated term 'regions update interval'
  Docs/vm/damon/design: update DAMON-Idle Page Tracking interference handling
  Docs/vm/damon: call low level monitoring primitives the operations
  mm/damon: remove unnecessary CONFIG_DAMON option
  mm/damon/paddr,vaddr: remove damon_{p,v}a_{target_valid,set_operations}()
  mm/damon/dbgfs-test: fix is_target_id() change
  ...
2022-03-22 16:11:53 -07:00
Muchun Song
fd60b28842 fs: allocate inode by using alloc_inode_sb()
The inode allocation is supposed to use alloc_inode_sb(), so convert
kmem_cache_alloc() of all filesystems to alloc_inode_sb().

Link: https://lkml.kernel.org/r/20220228122126.37293-5-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Theodore Ts'o <tytso@mit.edu>		[ext4]
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Alex Shi <alexs@kernel.org>
Cc: Anna Schumaker <Anna.Schumaker@Netapp.com>
Cc: Chao Yu <chao@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Fam Zheng <fam.zheng@bytedance.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kari Argillander <kari.argillander@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-22 15:57:03 -07:00
Trond Myklebust
b2648015d4 SUNRPC: Make the rpciod and xprtiod slab allocation modes consistent
Make sure that rpciod and xprtiod are always using the same slab
allocation modes.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-22 15:52:55 -04:00
Trond Myklebust
059ee82b64 SUNRPC: Fix unx_lookup_cred() allocation
Default to the same mempool allocation strategy as for rpc_malloc().

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-22 15:52:55 -04:00
Trond Myklebust
910ad38697 NFS: Fix memory allocation in rpc_alloc_task()
As for rpc_malloc(), we first try allocating from the slab, then fall
back to a non-waiting allocation from the mempool.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-22 15:52:55 -04:00
Trond Myklebust
33e5c765bc NFS: Fix memory allocation in rpc_malloc()
When in a low memory situation, we do want rpciod to kick off direct
reclaim in the case where that helps, however we don't want it looping
forever in mempool_alloc().
So first try allocating from the slab using GFP_KERNEL | __GFP_NORETRY,
and then fall back to a GFP_NOWAIT allocation from the mempool.

Ditto for rpc_alloc_task()

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-22 15:52:55 -04:00
Trond Myklebust
d0afde5fc6 SUNRPC: Improve accuracy of socket ENOBUFS determination
The current code checks for whether or not the socket is in a writeable
state after we get an EAGAIN. That is racy, since we've dropped the
socket lock, so the amount of free buffer may have changed.

Instead, let's check whether the socket is writeable before we try to
write to it. If that was the case, we do expect the message to be at
least partially sent unless we're in a low memory situation.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-22 15:52:55 -04:00
Trond Myklebust
2790a624d4 SUNRPC: Replace internal use of SOCKWQ_ASYNC_NOSPACE
The socket's SOCKWQ_ASYNC_NOSPACE can be cleared by various actors in
the socket layer, so replace it with our own flag in the transport
sock_state field.

Reported-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-22 15:52:55 -04:00
Trond Myklebust
7496b59f58 SUNRPC: Fix socket waits for write buffer space
The socket layer requires that we use the socket lock to protect changes
to the sock->sk_write_pending field and others.

Reported-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-22 15:52:55 -04:00
Trond Myklebust
3b21f757c3 SUNRPC: Only save the TCP source port after the connection is complete
Since the RPC client uses a non-blocking connect(), we do not expect to
see it return '0' under normal circumstances.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-22 15:52:55 -04:00
Trond Myklebust
89f42494f9 SUNRPC: Don't call connect() more than once on a TCP socket
Avoid socket state races due to repeated calls to ->connect() using the
same socket. If connect() returns 0 due to the connection having
completed, but we are in fact in a closing state, then we may leave the
XPRT_CONNECTING flag set on the transport.

Reported-by: Enrico Scholz <enrico.scholz@sigma-chemnitz.de>
Fixes: 3be232f11a ("SUNRPC: Prevent immediate close+reconnect")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-22 15:52:55 -04:00
NeilBrown
693486d5f8 SUNRPC: change locking for xs_swap_enable/disable
It is not in general safe to wait for XPRT_LOCKED to clear.
A wakeup is only sent when
 - connection completes
 - sock close completes
so during normal operations, this can wait indefinitely.

The event we need to protect against is ->inet being set to NULL, and
that happens under the recv_mutex lock.

So drop the handlign of XPRT_LOCKED and use recv_mutex instead.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-13 12:59:36 -04:00
NeilBrown
4dc73c6791 NFSv4: keep state manager thread active if swap is enabled
If we are swapping over NFSv4, we may not be able to allocate memory to
start the state-manager thread at the time when we need it.
So keep it always running when swap is enabled, and just signal it to
start.

This requires updating and testing the cl_swapper count on the root
rpc_clnt after following all ->cl_parent links.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-13 12:59:35 -04:00
NeilBrown
8db55a032a SUNRPC: improve 'swap' handling: scheduling and PF_MEMALLOC
rpc tasks can be marked as RPC_TASK_SWAPPER.  This causes GFP_MEMALLOC
to be used for some allocations.  This is needed in some cases, but not
in all where it is currently provided, and in some where it isn't
provided.

Currently *all* tasks associated with a rpc_client on which swap is
enabled get the flag and hence some GFP_MEMALLOC support.

GFP_MEMALLOC is provided for ->buf_alloc() but only swap-writes need it.
However xdr_alloc_bvec does not get GFP_MEMALLOC - though it often does
need it.

xdr_alloc_bvec is called while the XPRT_LOCK is held.  If this blocks,
then it blocks all other queued tasks.  So this allocation needs
GFP_MEMALLOC for *all* requests, not just writes, when the xprt is used
for any swap writes.

Similarly, if the transport is not connected, that will block all
requests including swap writes, so memory allocations should get
GFP_MEMALLOC if swap writes are possible.

So with this patch:
 1/ we ONLY set RPC_TASK_SWAPPER for swap writes.
 2/ __rpc_execute() sets PF_MEMALLOC while handling any task
    with RPC_TASK_SWAPPER set, or when handling any task that
    holds the XPRT_LOCKED lock on an xprt used for swap.
    This removes the need for the RPC_IS_SWAPPER() test
    in ->buf_alloc handlers.
 3/ xprt_prepare_transmit() sets PF_MEMALLOC after locking
    any task to a swapper xprt.  __rpc_execute() will clear it.
 3/ PF_MEMALLOC is set for all the connect workers.

Reviewed-by: Chuck Lever <chuck.lever@oracle.com> (for xprtrdma parts)
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-13 12:59:35 -04:00
NeilBrown
89c2be8a95 NFS: discard NFS_RPC_SWAPFLAGS and RPC_TASK_ROOTCREDS
NFS_RPC_SWAPFLAGS is only used for READ requests.
It sets RPC_TASK_SWAPPER which gives some memory-allocation priority to
requests.  This is not needed for swap READ - though it is for writes
where it is set via a different mechanism.

RPC_TASK_ROOTCREDS causes the 'machine' credential to be used.
This is not needed as the root credential is saved when the swap file is
opened, and this is used for all IO.

So NFS_RPC_SWAPFLAGS isn't needed, and as it is the only user of
RPC_TASK_ROOTCREDS, that isn't needed either.

Remove both.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-13 12:59:35 -04:00
NeilBrown
a80a846186 SUNRPC: remove scheduling boost for "SWAPPER" tasks.
Currently, tasks marked as "swapper" tasks get put to the front of
non-priority rpc_queues, and are sorted earlier than non-swapper tasks on
the transport's ->xmit_queue.

This is pointless as currently *all* tasks for a mount that has swap
enabled on *any* file are marked as "swapper" tasks.  So the net result
is that the non-priority rpc_queues are reverse-ordered (LIFO).

This scheduling boost is not necessary to avoid deadlocks, and hurts
fairness, so remove it.  If there were a need to expedite some requests,
the tk_priority mechanism is a more appropriate tool.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-13 12:59:35 -04:00
NeilBrown
a721035477 SUNRPC/xprt: async tasks mustn't block waiting for memory
When memory is short, new worker threads cannot be created and we depend
on the minimum one rpciod thread to be able to handle everything.  So it
must not block waiting for memory.

xprt_dynamic_alloc_slot can block indefinitely.  This can tie up all
workqueue threads and NFS can deadlock.  So when called from a
workqueue, set __GFP_NORETRY.

The rdma alloc_slot already does not block.  However it sets the error
to -EAGAIN suggesting this will trigger a sleep.  It does not.  As we
can see in call_reserveresult(), only -ENOMEM causes a sleep.  -EAGAIN
causes immediate retry.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-13 12:59:35 -04:00
NeilBrown
a41b05edfe SUNRPC/auth: async tasks mustn't block waiting for memory
When memory is short, new worker threads cannot be created and we depend
on the minimum one rpciod thread to be able to handle everything.  So it
must not block waiting for memory.

mempools are particularly a problem as memory can only be released back
to the mempool by an async rpc task running.  If all available workqueue
threads are waiting on the mempool, no thread is available to return
anything.

lookup_cred() can block on a mempool or kmalloc - and this can cause
deadlocks.  So add a new RPCAUTH_LOOKUP flag for async lookups and don't
block on memory.  If the -ENOMEM gets back to call_refreshresult(), wait
a short while and try again.  HZ>>4 is chosen as it is used elsewhere
for -ENOMEM retries.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-13 12:59:35 -04:00
NeilBrown
c487216bec SUNRPC/call_alloc: async tasks mustn't block waiting for memory
When memory is short, new worker threads cannot be created and we depend
on the minimum one rpciod thread to be able to handle everything.
So it must not block waiting for memory.

mempools are particularly a problem as memory can only be released back
to the mempool by an async rpc task running.  If all available
workqueue threads are waiting on the mempool, no thread is available to
return anything.

rpc_malloc() can block, and this might cause deadlocks.
So check RPC_IS_ASYNC(), rather than RPC_IS_SWAPPER() to determine if
blocking is acceptable.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-13 12:59:35 -04:00
Chuck Lever
74aaf96fea SUNRPC: Teach server to recognize RPC_AUTH_TLS
Initial support for the RPC_AUTH_TLS authentication flavor enables
NFSD to eventually accept an RPC_AUTH_TLS probe from clients. This
patch simply prevents NFSD from rejecting these probes completely.

In the meantime, graft this support in now so that RPC_AUTH_TLS
support keeps up with generic code and API changes in the RPC
server.

Down the road, server-side transport implementations will populate
xpo_start_tls when they can support RPC-with-TLS. For example, TCP
will eventually populate it, but RDMA won't.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-02-28 10:26:40 -05:00
Chuck Lever
37902c6313 NFSD: Move svc_serv_ops::svo_function into struct svc_serv
Hoist svo_function back into svc_serv and remove struct
svc_serv_ops, since the struct is now devoid of fields.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-02-28 10:26:40 -05:00
Chuck Lever
f49169c97f NFSD: Remove svc_serv_ops::svo_module
struct svc_serv_ops is about to be removed.

Neil Brown says:
> I suspect svo_module can go as well - I don't think the thread is
> ever the thing that primarily keeps a module active.

A random sample of kthread_create() callers shows sunrpc is the only
one that manages module reference count in this way.

Suggested-by: Neil Brown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-02-28 10:26:40 -05:00
Chuck Lever
c7d7ec8f04 SUNRPC: Remove svc_shutdown_net()
Clean up: svc_shutdown_net() now does nothing but call
svc_close_net(). Replace all external call sites.

svc_close_net() is renamed to be the inverse of svc_xprt_create().

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-02-28 10:26:40 -05:00
Chuck Lever
4355d767a2 SUNRPC: Rename svc_close_xprt()
Clean up: Use the "svc_xprt_<task>" function naming convention as
is used for other external APIs.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-02-28 10:26:40 -05:00
Chuck Lever
352ad31448 SUNRPC: Rename svc_create_xprt()
Clean up: Use the "svc_xprt_<task>" function naming convention as
is used for other external APIs.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-02-28 10:26:39 -05:00
Chuck Lever
87cdd8641c SUNRPC: Remove svo_shutdown method
Clean up. Neil observed that "any code that calls svc_shutdown_net()
knows what the shutdown function should be, and so can call it
directly."

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: NeilBrown <neilb@suse.de>
2022-02-28 10:26:39 -05:00
Chuck Lever
c0219c4997 SUNRPC: Merge svc_do_enqueue_xprt() into svc_enqueue_xprt()
Neil says:
"These functions were separated in commit 0971374e28 ("SUNRPC:
Reduce contention in svc_xprt_enqueue()") so that the XPT_BUSY check
happened before taking any spinlocks.

We have since moved or removed the spinlocks so the extra test is
fairly pointless."

I've made this a separate patch in case the XPT_BUSY change has
unexpected consequences and needs to be reverted.

Suggested-by: Neil Brown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-02-28 10:26:39 -05:00
Chuck Lever
a9ff2e99e9 SUNRPC: Remove the .svo_enqueue_xprt method
We have never been able to track down and address the underlying
cause of the performance issues with workqueue-based service
support. svo_enqueue_xprt is called multiple times per RPC, so
it adds instruction path length, but always ends up at the same
function: svc_xprt_do_enqueue(). We do not anticipate needing
this flexibility for dynamic nfsd thread management support.

As a micro-optimization, remove .svo_enqueue_xprt because
Spectre/Meltdown makes virtual function calls more costly.

This change essentially reverts commit b9e13cdfac ("nfsd/sunrpc:
turn enqueueing a svc_xprt into a svc_serv operation").

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-02-28 10:26:39 -05:00
Trond Myklebust
46442b850e SUNRPC/xprtrdma: Convert GFP_NOFS to GFP_KERNEL
Assume that the upper layers have set memalloc_nofs_save/restore as
appropriate.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-02-25 18:50:12 -05:00
Trond Myklebust
4c2883e77c SUNRPC/auth_gss: Convert GFP_NOFS to GFP_KERNEL
Assume that the upper layers have set memalloc_nofs_save/restore as
appropriate.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-02-25 18:50:12 -05:00
Trond Myklebust
0adc879406 SUNRPC: Convert GFP_NOFS to GFP_KERNEL
The sections which should not re-enter the filesystem are already
protected with memalloc_nofs_save/restore calls, so it is better to use
GFP_KERNEL in these calls to allow better performance for synchronous
RPC calls.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-02-25 18:50:12 -05:00
Colin Ian King
ab22e2cbbc SUNRPC: remove redundant pointer plainhdr
[You don't often get email from colin.i.king@gmail.com. Learn why this is important at http://aka.ms/LearnAboutSenderIdentification.]

Pointer plainhdr is being assigned a value that is never read, the
pointer is redundant and can be removed.

Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-02-25 18:50:12 -05:00
Jakub Kicinski
5b91c5cc0e Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
No conflicts.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-10 17:29:56 -08:00
NeilBrown
b49ea673e1 SUNRPC: lock against ->sock changing during sysfs read
->sock can be set to NULL asynchronously unless ->recv_mutex is held.
So it is important to hold that mutex.  Otherwise a sysfs read can
trigger an oops.
Commit 17f09d3f61 ("SUNRPC: Check if the xprt is connected before
handling sysfs reads") appears to attempt to fix this problem, but it
only narrows the race window.

Fixes: 17f09d3f61 ("SUNRPC: Check if the xprt is connected before handling sysfs reads")
Fixes: a8482488a7 ("SUNRPC query transport's source port")
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-02-08 09:14:26 -05:00
Dan Aloni
a9c10b5b3b xprtrdma: fix pointer derefs in error cases of rpcrdma_ep_create
If there are failures then we must not leave the non-NULL pointers with
the error value, otherwise `rpcrdma_ep_destroy` gets confused and tries
free them, resulting in an Oops.

Signed-off-by: Dan Aloni <dan.aloni@vastdata.com>
Acked-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-02-08 09:14:26 -05:00
Jakub Kicinski
c59400a68c Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
No conflicts.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-03 17:36:16 -08:00
Linus Torvalds
4897e722b5 \n
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmHz0QsACgkQnJ2qBz9k
 QNkN+AgA6XqWHKYyElfgJFt1UqaoNMz/Faz9H/+PKiBNSTf6/+67D+V7DFz6jJrv
 dDwHNzfDg9kR+pbAAPwhl2jfnQoUlsr019Hrqa5HpWlj5geVpbdunYUzS2WOkwqD
 /m+brcLgPdKb2uIysj6wOh9B7wa8V9ksl3EjQvvwaHaU0p1YLUqidVXucYvs8DUo
 bgXNaj9GmeysxnmU+aILotWuuXH2vOP4Q2Uk4qz3rN6xW9eEXtpQ4y7gWBp/GA8y
 Ia8FtFdQdvlSDOJYMdPOTBu5RB7gY9ElrapvVaWNYtCWI/jRv666nZsLaERYNhLN
 uUsG4MWjYbOqW5XqFDbSOwbDqvMh5Q==
 =mEFA
 -----END PGP SIGNATURE-----

Merge tag 'fsnotify_for_v5.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

Pull fsnotify fixes from Jan Kara:
 "Fixes for userspace breakage caused by fsnotify changes ~3 years ago
  and one fanotify cleanup"

* tag 'fsnotify_for_v5.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  fsnotify: fix fsnotify hooks in pseudo filesystems
  fsnotify: invalidate dcache before IN_DELETE event
  fanotify: remove variable set but not used
2022-01-28 17:51:31 +02:00
Eric Dumazet
b9a0d6d143 SUNRPC: add netns refcount tracker to struct rpc_xprt
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-28 14:47:55 +00:00
Eric Dumazet
9b1831e56c SUNRPC: add netns refcount tracker to struct gss_auth
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-28 14:47:55 +00:00
Eric Dumazet
6cdef8a6ee SUNRPC: add netns refcount tracker to struct svc_xprt
struct svc_xprt holds a long lived reference to a netns,
it is worth tracking it.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-28 14:47:55 +00:00
Linus Torvalds
0280e3c58f NFS Client Updates for Linux 5.17
- New Features:
   - Basic handling for case insensitive filesystems
   - Initial support for fs_locations and server trunking
 
 - Bugfixes and Cleanups:
   - Cleanups to how the "struct cred *" is handled for the nfs_access_entry
   - Ensure the server has an up to date ctimes before hardlinking or renaming
   - Update 'blocks used' after writeback, fallocate, and clone
   - nfs_atomic_open() fixes
   - Improvements to sunrpc tracing
   - Various null check & indenting related cleanups
   - Some improvements to the sunrpc sysfs code
     - Use default_groups in kobj_type
     - Fix some potential races and reference leaks
   - A few tracepoint cleanups in xprtrdma
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAmHodsIACgkQ18tUv7Cl
 QOt3xQ//c9JPmMJZZoZtaD5UrHg28iyxaJpOUUpwC/jxQhLOETCf+nU1cELYgLq5
 4W06NBYEmjDJ/tihUvcGMKLvbCtQR9Zl9HepFKDTLTQpGmRFD4enwSmMNvW/AV+h
 I7PoN6J1DX/TZ5InOHH9asyoC2MjwrNHMn3bbQVT0qy+i3T76zJiBF79eWTnPR48
 kKPnF1I0p4LKGJy+y+y/z2mdCsz7tzFkhssxVhot0nafxXzbUOp1H9aiwxroRiUC
 ljbBA0TX8FWkGpGFt3y2QK2fMD7ovDpRhLFYiJClmeERXJVH5mXL9O5XfN5AL0xe
 W/QqT5lbWfeHLkpm2j87yTyaHASC7hGKsAyPD0zWLDcNZws61l1Sy4BHymSE5ZVh
 zt7sJjBnOWAtntyUGBg78G2vhBsd63GzrtcqAOlrngwA5ohJ8f32qvBQGyw4MQu9
 75CjRcO8K8mnf9BJ6I1vYPycjkUh9RSFfNdnUEAI9ZwiTEC/hfEvH/omvEtZsNol
 jBgv2SItTkdMZlEppEL4gxuaYT2wiZf2C6Gco215iPAqLC6dudoroN6yoLk/LRd0
 OWZLl5XTr3j6m5QDm22k5CG080vl6XiAxmAFaFSLza6Q34Jmuluc0gLAZZxvqXk9
 Ay7dQt9PQQk6mXD5Hreb0E5N9zcm2LkfvWpyGJ7mTV7sSHjA2DU=
 =wcVT
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-5.17-1' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client updates from Anna Schumaker:
 "New Features:

   - Basic handling for case insensitive filesystems

   - Initial support for fs_locations and server trunking

  Bugfixes and Cleanups:

   - Cleanups to how the "struct cred *" is handled for the
     nfs_access_entry

   - Ensure the server has an up to date ctimes before hardlinking or
     renaming

   - Update 'blocks used' after writeback, fallocate, and clone

   - nfs_atomic_open() fixes

   - Improvements to sunrpc tracing

   - Various null check & indenting related cleanups

   - Some improvements to the sunrpc sysfs code:
      - Use default_groups in kobj_type
      - Fix some potential races and reference leaks

   - A few tracepoint cleanups in xprtrdma"

[ This should have gone in during the merge window, but didn't. The
  original pull request - sent during the merge window - had gotten
  marked as spam and discarded due missing DKIM headers in the email
  from Anna.   - Linus ]

* tag 'nfs-for-5.17-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (35 commits)
  SUNRPC: Don't dereference xprt->snd_task if it's a cookie
  xprtrdma: Remove definitions of RPCDBG_FACILITY
  xprtrdma: Remove final dprintk call sites from xprtrdma
  sunrpc: Fix potential race conditions in rpc_sysfs_xprt_state_change()
  net/sunrpc: fix reference count leaks in rpc_sysfs_xprt_state_change
  NFSv4.1 test and add 4.1 trunking transport
  SUNRPC allow for unspecified transport time in rpc_clnt_add_xprt
  NFSv4 handle port presence in fs_location server string
  NFSv4 expose nfs_parse_server_name function
  NFSv4.1 query for fs_location attr on a new file system
  NFSv4 store server support for fs_location attribute
  NFSv4 remove zero number of fs_locations entries error check
  NFSv4: nfs_atomic_open() can race when looking up a non-regular file
  NFSv4: Handle case where the lookup of a directory fails
  NFSv42: Fallocate and clone should also request 'blocks used'
  NFSv4: Allow writebacks to request 'blocks used'
  SUNRPC: use default_groups in kobj_type
  NFS: use default_groups in kobj_type
  NFS: Fix the verifier for case sensitive filesystem in nfs_atomic_open()
  NFS: Add a helper to remove case-insensitive aliases
  ...
2022-01-25 20:16:03 +02:00
Amir Goldstein
29044dae2e fsnotify: fix fsnotify hooks in pseudo filesystems
Commit 49246466a9 ("fsnotify: move fsnotify_nameremove() hook out of
d_delete()") moved the fsnotify delete hook before d_delete() so fsnotify
will have access to a positive dentry.

This allowed a race where opening the deleted file via cached dentry
is now possible after receiving the IN_DELETE event.

To fix the regression in pseudo filesystems, convert d_delete() calls
to d_drop() (see commit 46c46f8df9 ("devpts_pty_kill(): don't bother
with d_delete()") and move the fsnotify hook after d_drop().

Add a missing fsnotify_unlink() hook in nfsdfs that was found during
the audit of fsnotify hooks in pseudo filesystems.

Note that the fsnotify hooks in simple_recursive_removal() follow
d_invalidate(), so they require no change.

Link: https://lore.kernel.org/r/20220120215305.282577-2-amir73il@gmail.com
Reported-by: Ivan Delalande <colona@arista.com>
Link: https://lore.kernel.org/linux-fsdevel/YeNyzoDM5hP5LtGW@visor/
Fixes: 49246466a9 ("fsnotify: move fsnotify_nameremove() hook out of d_delete()")
Cc: stable@vger.kernel.org # v5.3+
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2022-01-24 14:17:02 +01:00
Muchun Song
359745d783 proc: remove PDE_DATA() completely
Remove PDE_DATA() completely and replace it with pde_data().

[akpm@linux-foundation.org: fix naming clash in drivers/nubus/proc.c]
[akpm@linux-foundation.org: now fix it properly]

Link: https://lkml.kernel.org/r/20211124081956.87711-2-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Alexey Gladkov <gladkov.alexey@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-01-22 08:33:37 +02:00
Linus Torvalds
175398a097 Highlights:
- Bruce steps down as NFSD maintainer
 - Prepare for dynamic nfsd thread management
 - More work on supporting re-exporting NFS mounts
 - One fs/locks patch on behalf of Jeff Layton
 
 Notable bug fixes:
 - Fix zero-length NFSv3 WRITEs
 - Fix directory cinfo on FS's that do not support iversion
 - Fix WRITE verifiers for stable writes
 - Fix crash on COPY_NOTIFY with a special state ID
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmHcWOMACgkQM2qzM29m
 f5dh0Q/+MjEL0IK551FdChx9Es1JqKRggv9KwJkLIoa1bw/PMSwP2pnKz6eL0Yun
 mdhE9AZQgyFH1IAGdqjeLZKIYRin6bvAdDrnlqQ9SvTviPLWniSUI6AuyUqK6Zyk
 wMcXpyOze0fhpxkYmz8/g7i66w967tmLh5MRvV1dkpOYAe99rYwGhvj+9ZeEWfNI
 TgmptntMG6YEb+xY0E73otXZHMr2DL67ZYvOUYWemJA1uxcX4joaWBg8sx74dB6k
 DUB4BFuoURk6viDD1QYh3qPU3dz9RCJNMz/cWd8+2t7BdaujTSXRIcaFslrQnKfL
 Rm+O7pi5W+XohFDjeuMZ1g0c1ot/aoZSaAz00LoCVhejJ/sK9NiPAN1+LyY91Lja
 cUBMVPNfW7ClIpiZcORP/chNmVn2qlaL2nxzSY/Uegnd5pIIeVD0pFVgx4+NlEat
 mbrrQBcMpBRM0B+RzHS6AusqHrGdSEcwqWoVXWdxsBigJQT/AxWmii3U88k0Z54i
 ooMWLaQ9EBBmygV01JN/OBySW2M/dvbfz3eFROvAVqsIP9JWP3FlUOlRDl8GcjXA
 azi9fTysBom7WtL6NPcxDJbJ2t9hYr2YaztTpdo9YCHOuQbSQT6IWR5PAa3zvwMu
 Bfz6Y8Hoo/KZHCqmkPGYM+x1ENCyDPv788E+erdnw1PFP5F3Pbo=
 =/kX3
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd updates from Chuck Lever:
 "Bruce has announced he is leaving Red Hat at the end of the month and
  is stepping back from his role as NFSD co-maintainer. As a result,
  this includes a patch removing him from the MAINTAINERS file.

  There is one patch in here that Jeff Layton was carrying in the locks
  tree. Since he had only one for this cycle, he asked us to send it to
  you via the nfsd tree.

  There continues to be 0-day reports from Robert Morris @MIT. This time
  we include a fix for a crash in the COPY_NOTIFY operation.

  Highlights:
   - Bruce steps down as NFSD maintainer
   - Prepare for dynamic nfsd thread management
   - More work on supporting re-exporting NFS mounts
   - One fs/locks patch on behalf of Jeff Layton

  Notable bug fixes:
   - Fix zero-length NFSv3 WRITEs
   - Fix directory cinfo on FS's that do not support iversion
   - Fix WRITE verifiers for stable writes
   - Fix crash on COPY_NOTIFY with a special state ID"

* tag 'nfsd-5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (51 commits)
  SUNRPC: Fix sockaddr handling in svcsock_accept_class trace points
  SUNRPC: Fix sockaddr handling in the svc_xprt_create_error trace point
  fs/locks: fix fcntl_getlk64/fcntl_setlk64 stub prototypes
  nfsd: fix crash on COPY_NOTIFY with special stateid
  MAINTAINERS: remove bfields
  NFSD: Move fill_pre_wcc() and fill_post_wcc()
  Revert "nfsd: skip some unnecessary stats in the v4 case"
  NFSD: Trace boot verifier resets
  NFSD: Rename boot verifier functions
  NFSD: Clean up the nfsd_net::nfssvc_boot field
  NFSD: Write verifier might go backwards
  nfsd: Add a tracepoint for errors in nfsd4_clone_file_range()
  NFSD: De-duplicate net_generic(nf->nf_net, nfsd_net_id)
  NFSD: De-duplicate net_generic(SVC_NET(rqstp), nfsd_net_id)
  NFSD: Clean up nfsd_vfs_write()
  nfsd: Replace use of rwsem with errseq_t
  NFSD: Fix verifier returned in stable WRITEs
  nfsd: Retry once in nfsd_open on an -EOPENSTALE return
  nfsd: Add errno mapping for EREMOTEIO
  nfsd: map EBADF
  ...
2022-01-16 07:42:58 +02:00
NeilBrown
4034247a0d mm: introduce memalloc_retry_wait()
Various places in the kernel - largely in filesystems - respond to a
memory allocation failure by looping around and re-trying.  Some of
these cannot conveniently use __GFP_NOFAIL, for reasons such as:

 - a GFP_ATOMIC allocation, which __GFP_NOFAIL doesn't work on
 - a need to check for the process being signalled between failures
 - the possibility that other recovery actions could be performed
 - the allocation is quite deep in support code, and passing down an
   extra flag to say if __GFP_NOFAIL is wanted would be clumsy.

Many of these currently use congestion_wait() which (in almost all
cases) simply waits the given timeout - congestion isn't tracked for
most devices.

It isn't clear what the best delay is for loops, but it is clear that
the various filesystems shouldn't be responsible for choosing a timeout.

This patch introduces memalloc_retry_wait() with takes on that
responsibility.  Code that wants to retry a memory allocation can call
this function passing the GFP flags that were used.  It will wait
however is appropriate.

For now, it only considers __GFP_NORETRY and whatever
gfpflags_allow_blocking() tests.  If blocking is allowed without
__GFP_NORETRY, then alloc_page either made some reclaim progress, or
waited for a while, before failing.  So there is no need for much
further waiting.  memalloc_retry_wait() will wait until the current
jiffie ends.  If this condition is not met, then alloc_page() won't have
waited much if at all.  In that case memalloc_retry_wait() waits about
200ms.  This is the delay that most current loops uses.

linux/sched/mm.h needs to be included in some files now,
but linux/backing-dev.h does not.

Link: https://lkml.kernel.org/r/163754371968.13692.1277530886009912421@noble.neil.brown.name
Signed-off-by: NeilBrown <neilb@suse.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Chao Yu <chao@kernel.org>
Cc: Darrick J. Wong <djwong@kernel.org>
Cc: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-01-15 16:30:29 +02:00
Chuck Lever
c0f26167dd xprtrdma: Remove definitions of RPCDBG_FACILITY
Deprecated. dprintk is no longer used in xprtrdma.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-01-14 10:35:08 -05:00
Chuck Lever
c03061e7a2 xprtrdma: Remove final dprintk call sites from xprtrdma
Deprecated. This information is available via tracepoints.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-01-14 10:33:31 -05:00
Anna Schumaker
1a48db3fef sunrpc: Fix potential race conditions in rpc_sysfs_xprt_state_change()
We need to use test_and_set_bit() when changing xprt state flags to
avoid potentially getting xps->xps_nactive out of sync.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-01-13 09:36:58 -05:00
Xiyu Yang
776d794f28 net/sunrpc: fix reference count leaks in rpc_sysfs_xprt_state_change
The refcount leak issues take place in an error handling path. When the
3rd argument buf doesn't match with "offline", "online" or "remove", the
function simply returns -EINVAL and forgets to decrease the reference
count of a rpc_xprt object and a rpc_xprt_switch object increased by
rpc_sysfs_xprt_kobj_get_xprt() and
rpc_sysfs_xprt_kobj_get_xprt_switch(), causing reference count leaks of
both unused objects.

Fix this issue by jumping to the error handling path labelled with
out_put when buf matches none of "offline", "online" or "remove".

Signed-off-by: Xiyu Yang <xiyuyang19@fudan.edu.cn>
Signed-off-by: Xin Xiong <xiongx18@fudan.edu.cn>
Signed-off-by: Xin Tan <tanxin.ctf@gmail.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-01-13 09:36:58 -05:00
Olga Kornievskaia
b8a09619a5 SUNRPC allow for unspecified transport time in rpc_clnt_add_xprt
If the supplied argument doesn't specify the transport type, use the
type of the existing rpc clnt and its existing transport.

Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-01-13 09:36:58 -05:00
Chuck Lever
dc6c6fb3d6 SUNRPC: Fix sockaddr handling in the svc_xprt_create_error trace point
While testing, I got an unexpected KASAN splat:

Jan 08 13:50:27 oracle-102.nfsv4.dev kernel: BUG: KASAN: stack-out-of-bounds in trace_event_raw_event_svc_xprt_create_err+0x190/0x210 [sunrpc]
Jan 08 13:50:27 oracle-102.nfsv4.dev kernel: Read of size 28 at addr ffffc9000008f728 by task mount.nfs/4628

The memcpy() in the TP_fast_assign section of this trace point
copies the size of the destination buffer in order that the buffer
won't be overrun.

In other similar trace points, the source buffer for this memcpy is
a "struct sockaddr_storage" so the actual length of the source
buffer is always long enough to prevent the memcpy from reading
uninitialized or unallocated memory.

However, for this trace point, the source buffer can be as small as
a "struct sockaddr_in". For AF_INET sockaddrs, the memcpy() reads
memory that follows the source buffer, which is not always valid
memory.

To avoid copying past the end of the passed-in sockaddr, make the
source address's length available to the memcpy(). It would be a
little nicer if the tracing infrastructure was more friendly about
storing socket addresses that are not AF_INET, but I could not find
a way to make printk("%pIS") work with a dynamic array.

Reported-by: KASAN
Fixes: 4b8f380e46 ("SUNRPC: Tracepoint to record errors in svc_xpo_create()")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-01-10 10:57:33 -05:00
Greg Kroah-Hartman
86439fa267 SUNRPC: use default_groups in kobj_type
There are currently 2 ways to create a set of sysfs files for a
kobj_type, through the default_attrs field, and the default_groups
field.  Move the sunrpc sysfs code to use default_groups field which has
been the preferred way since aa30f47cf6 ("kobject: Add support for
default attribute groups to kobj_type") so that we can soon get rid of
the obsolete default_attrs field.

Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Anna Schumaker <anna.schumaker@netapp.com>
Cc: linux-nfs@vger.kernel.org
Cc: netdev@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-01-06 14:00:21 -05:00
Jiapeng Chong
c4f0396688 SUNRPC: clean up some inconsistent indenting
Eliminate the follow smatch warning:

net/sunrpc/xprtsock.c:1912 xs_local_connect() warn: inconsistent
indenting.

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-01-06 14:00:20 -05:00
Xu Wang
35e0f9a9af sunrpc: Remove unneeded null check
In g_verify_token_header, the null check of 'ret'
is unneeded to be done twice.

Signed-off-by: Xu Wang <vulab@iscas.ac.cn>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-01-06 14:00:20 -05:00
Chuck Lever
5089f3d975 SUNRPC: Remove low signal-to-noise tracepoints
I'm about to add more information to the server-side SUNRPC
tracepoints, so I'm going to offset the increased trace log
consumption by getting rid of some tracepoints that fire frequently
but don't offer much value.

trace_svc_xprt_received() was useful for debugging, perhaps, but
is not generally informative.

trace_svc_handle_xprt() reports largely the same information as
trace_svc_xdr_recvfrom().

As a clean-up, rename trace_svc_xprt_do_enqueue() to match
svc_xprt_dequeue().

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-12-13 13:43:00 -05:00
NeilBrown
6b044fbaab lockd: use svc_set_num_threads() for thread start and stop
svc_set_num_threads() does everything that lockd_start_svc() does, except
set sv_maxconn.  It also (when passed 0) finds the threads and
stops them with kthread_stop().

So move the setting for sv_maxconn, and use svc_set_num_thread()

We now don't need nlmsvc_task.

Now that we use svc_set_num_threads() it makes sense to set svo_module.
This request that the thread exists with module_put_and_exit().
Also fix the documentation for svo_module to make this explicit.

svc_prepare_thread is now only used where it is defined, so it can be
made static.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-12-13 13:42:58 -05:00
NeilBrown
93aa619eb0 SUNRPC: always treat sv_nrpools==1 as "not pooled"
Currently 'pooled' services hold a reference on the pool_map, and
'unpooled' services do not.
svc_destroy() uses the presence of ->svo_function (via
svc_serv_is_pooled()) to determine if the reference should be dropped.
There is no direct correlation between being pooled and the use of
svo_function, though in practice, lockd is the only non-pooled service,
and the only one not to use svo_function.

This is untidy and would cause problems if we changed lockd to use
svc_set_num_threads(), which requires the use of ->svo_function.

So change the test for "is the service pooled" to "is sv_nrpools > 1".

This means that when svc_pool_map_get() returns 1, it must NOT take a
reference to the pool.

We discard svc_serv_is_pooled(), and test sv_nrpools directly.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-12-13 13:42:57 -05:00
NeilBrown
cf0e124e0a SUNRPC: move the pool_map definitions (back) into svc.c
These definitions are not used outside of svc.c, and there is no
evidence that they ever have been.  So move them into svc.c
and make the declarations 'static'.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-12-13 13:42:57 -05:00
NeilBrown
3ebdbe5203 SUNRPC: discard svo_setup and rename svc_set_num_threads_sync()
The ->svo_setup callback serves no purpose.  It is always called from
within the same module that chooses which callback is needed.  So
discard it and call the relevant function directly.

Now that svc_set_num_threads() is no longer used remove it and rename
svc_set_num_threads_sync() to remove the "_sync" suffix.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-12-13 13:42:53 -05:00
NeilBrown
2a36395fac SUNRPC: use sv_lock to protect updates to sv_nrthreads.
Using sv_lock means we don't need to hold the service mutex over these
updates.

In particular,  svc_exit_thread() no longer requires synchronisation, so
threads can exit asynchronously.

Note that we could use an atomic_t, but as there are many more read
sites than writes, that would add unnecessary noise to the code.
Some reads are already racy, and there is no need for them to not be.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-12-13 13:42:52 -05:00
NeilBrown
ec52361df9 SUNRPC: stop using ->sv_nrthreads as a refcount
The use of sv_nrthreads as a general refcount results in clumsy code, as
is seen by various comments needed to explain the situation.

This patch introduces a 'struct kref' and uses that for reference
counting, leaving sv_nrthreads to be a pure count of threads.  The kref
is managed particularly in svc_get() and svc_put(), and also nfsd_put();

svc_destroy() now takes a pointer to the embedded kref, rather than to
the serv.

nfsd allows the svc_serv to exist with ->sv_nrhtreads being zero.  This
happens when a transport is created before the first thread is started.
To support this, a 'keep_active' flag is introduced which holds a ref on
the svc_serv.  This is set when any listening socket is successfully
added (unless there are running threads), and cleared when the number of
threads is set.  So when the last thread exits, the nfs_serv will be
destroyed.
The use of 'keep_active' replaces previous code which checked if there
were any permanent sockets.

We no longer clear ->rq_server when nfsd() exits.  This was done
to prevent svc_exit_thread() from calling svc_destroy().
Instead we take an extra reference to the svc_serv to prevent
svc_destroy() from being called.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-12-13 13:42:51 -05:00
NeilBrown
8c62d12740 SUNRPC/NFSD: clean up get/put functions.
svc_destroy() is poorly named - it doesn't necessarily destroy the svc,
it might just reduce the ref count.
nfsd_destroy() is poorly named for the same reason.

This patch:
 - removes the refcount functionality from svc_destroy(), moving it to
   a new svc_put().  Almost all previous callers of svc_destroy() now
   call svc_put().
 - renames nfsd_destroy() to nfsd_put() and improves the code, using
   the new svc_destroy() rather than svc_put()
 - removes a few comments that explain the important for balanced
   get/put calls.  This should be obvious.

The only non-trivial part of this is that svc_destroy() would call
svc_sock_update() on a non-final decrement.  It can no longer do that,
and svc_put() isn't really a good place of it.  This call is now made
from svc_exit_thread() which seems like a good place.  This makes the
call *before* sv_nrthreads is decremented rather than after.  This
is not particularly important as the call just sets a flag which
causes sv_nrthreads set be checked later.  A subsequent patch will
improve the ordering.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-12-13 13:42:50 -05:00
Linus Torvalds
7413927713 NFS client bugfixes for Linux 5.16
Highlights include:
 
 Stable fixes:
 - NFSv42: Fix pagecache invalidation after COPY/CLONE
 
 Bugfixes:
 - NFSv42: Don't fail clone() just because the server failed to return
   post-op attributes
 - SUNRPC: use different lockdep keys for INET6 and LOCAL
 - NFSv4.1: handle NFS4ERR_NOSPC from CREATE_SESSION
 - SUNRPC: fix header include guard in trace header
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAmGiTnMACgkQZwvnipYK
 APJoJQ//VZYSCx/mGaTIj5oUwjBKE/n/9rz2EUGS1cfYjZjPpb5Xgm1tn+1A4c01
 Ztu9/hKgwrDqknqkmtKvP1GsX5vYUqgfqAlc880Q2nXAqaLJBBZgB6BFMmTtcoQx
 C24L0tgxlZVD9Vw0DJEVDVgDxXA/9VmdSQK6uptQRQhcYf4VtR1wAzELHWdkdkfq
 1WrREeAwGWw1BaPTrlPn9XwW9qTaMlBH05XRHh6dM7gFmoIe3td7kq7BOnFxFsnA
 AQZ/nCgeMTE04kQQMzYqUc4YqZvnzUHxueZ6q8s0K1RKJBpNIQNgdWUMa285Qo4a
 JA9oBCPPo0JjmsEge2Km12zyBJoA7lLQDfc6UQJON50ADF0sTu3wszgGuC63KkhE
 V+kUogK2WOlnGky2yYrHmv43mcCcyoJ/g+g+38GXNYGorsFi/XUhvctEpFFFPF71
 0umQwWhA6Dhc52hMj5DN4nfspp//hEuV9o7/zJzlPi0elC+xtVaBWZCorNvznlXW
 C/O5yobJVd89PuIE17Stg+c0Rq3k7RVPPoyS2IaMkZ5cs7DtT5Tz3nKClPAA8Bur
 mPLAMkHSOSLO26cy30SVZCIx1JDMJU9dx/PkFemCexkzYXQxgp9px/8wmM2Xe6Oc
 /hDqi8V7ayJUuYpYuJ6sA8oUqj3j+NkoP4w5HhlnmZDBhFMEBhM=
 =jeHm
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-5.16-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client fixes from Trond Myklebust:
 "Highlights include:

  Stable fixes:

   - NFSv42: Fix pagecache invalidation after COPY/CLONE

  Bugfixes:

   - NFSv42: Don't fail clone() just because the server failed to return
     post-op attributes

   - SUNRPC: use different lockdep keys for INET6 and LOCAL

   - NFSv4.1: handle NFS4ERR_NOSPC from CREATE_SESSION

   - SUNRPC: fix header include guard in trace header"

* tag 'nfs-for-5.16-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  SUNRPC: use different lock keys for INET6 and LOCAL
  sunrpc: fix header include guard in trace header
  NFSv4.1: handle NFS4ERR_NOSPC by CREATE_SESSION
  NFSv42: Fix pagecache invalidation after COPY/CLONE
  NFS: Add a tracepoint to show the results of nfs_set_cache_invalid()
  NFSv42: Don't fail clone() unless the OP_CLONE operation failed
2021-11-27 10:33:55 -08:00
NeilBrown
064a91771f SUNRPC: use different lock keys for INET6 and LOCAL
xprtsock.c reclassifies sock locks based on the protocol.
However there are 3 protocols and only 2 classification keys.
The same key is used for both INET6 and LOCAL.

This causes lockdep complaints.  The complaints started since Commit
ea9afca88b ("SUNRPC: Replace use of socket sk_callback_lock with
sock_lock") which resulted in the sock locks beings used more.

So add another key, and renumber them slightly.

Fixes: ea9afca88b ("SUNRPC: Replace use of socket sk_callback_lock with sock_lock")
Fixes: 176e21ee2e ("SUNRPC: Support for RPC over AF_LOCAL transports")
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-11-22 17:44:49 -05:00
Linus Torvalds
38764c7340 A slow cycle for nfsd: mainly cleanup, including Neil's patch dropping
support for a filehandle format deprecated 20 years ago, and further
 xdr-related cleanup from Chuck.
 -----BEGIN PGP SIGNATURE-----
 
 iQJJBAABCAAzFiEEYtFWavXG9hZotryuJ5vNeUKO4b4FAmGMPYkVHGJmaWVsZHNA
 ZmllbGRzZXMub3JnAAoJECebzXlCjuG+JVwQAKbrpgbzl91u+T6W9MUGgQVzDpeP
 XIy3NxCu/4pZ8SToWF3trz71sskokmkPPaZyuISD2C8e4DxO5LQ3fJLhtS9CjRFB
 x4iZUxH7V2BoWrb5SY6TDWBEqaq4MY9f7tIbvUu5xpa0FIupLqJjYh2CP8vqtsbm
 lblQKXz4ao0jwDzSVimNnPcTccpB25VIzwHsSOszRhN4rTjMgyHoETx2cqJne5IU
 Tx/hH0UlpnwuQ7aVpcjMoKqIyUWDTMejx51pyZhHB47DVKL7HsnZvg59mTpXFcBx
 29edvWT9yy1+w3nGkTYSkOgO9DyHvCbmQzIsvoYlmbZ2sdmTKK8Wuv2Ehcw3OfvL
 MXGmy2EXIhzvTZXyN6pL1bBwwNSxdqJhVSxvrPLz1EymIkxf/IDI8eyUicVXd3Vq
 K2xOn+CXyIbXWCU85ru8UA77r1+x//gSwqcJvtKUavbNJUwNt935CE2n3+o/0OL/
 pToZ89nhcaRyDP1jJKA37K48VLNtBXzZZQlRovyLelNojam/kzZkXX8dI6oV9VD1
 Ymjm0mbdZzwhE3C1HxKlxwZqhN+7YoyxMQuWjFMp28wxH+dkz/USCulKZ3/H+neD
 0YBSgvwe92JqkZTW2AOjipL+beAuKJ4zsfCCl2XZig/rHGutiwOf2GfgdRmJM6AD
 6aiufVWKNNRQef9y
 =yKBl
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-5.16' of git://linux-nfs.org/~bfields/linux

Pull nfsd updates from Bruce Fields:
 "A slow cycle for nfsd: mainly cleanup, including Neil's patch dropping
  support for a filehandle format deprecated 20 years ago, and further
  xdr-related cleanup from Chuck"

* tag 'nfsd-5.16' of git://linux-nfs.org/~bfields/linux: (26 commits)
  nfsd4: remove obselete comment
  nfsd: document server-to-server-copy parameters
  NFSD:fix boolreturn.cocci warning
  nfsd: update create verifier comment
  SUNRPC: Change return value type of .pc_encode
  SUNRPC: Replace the "__be32 *p" parameter to .pc_encode
  NFSD: Save location of NFSv4 COMPOUND status
  SUNRPC: Change return value type of .pc_decode
  SUNRPC: Replace the "__be32 *p" parameter to .pc_decode
  SUNRPC: De-duplicate .pc_release() call sites
  SUNRPC: Simplify the SVC dispatch code path
  SUNRPC: Capture value of xdr_buf::page_base
  SUNRPC: Add trace event when alloc_pages_bulk() makes no progress
  svcrdma: Split svcrmda_wc_{read,write} tracepoints
  svcrdma: Split the svcrdma_wc_send() tracepoint
  svcrdma: Split the svcrdma_wc_receive() tracepoint
  NFSD: Have legacy NFSD WRITE decoders use xdr_stream_subsegment()
  SUNRPC: xdr_stream_subsegment() must handle non-zero page_bases
  NFSD: Initialize pointer ni with NULL and not plain integer 0
  NFSD: simplify struct nfsfh
  ...
2021-11-10 16:45:54 -08:00
Linus Torvalds
2ec20f4895 NFS client updates for Linux 5.16
Highlights include:
 
 Features:
 - NFSv4.1 can always retrieve and cache the ACCESS mode on OPEN
 - Optimisations for READDIR and the 'ls -l' style workload
 - Further replacements of dprintk() with tracepoints and other tracing
   improvements
 - Ensure we re-probe NFSv4 server capabilities when the user does a
   "mount -o remount"
 
 Bugfixes:
 - Fix an Oops in pnfs_mark_request_commit()
 - Fix up deadlocks in the commit code
 - Fix regressions in NFSv2/v3 attribute revalidation due to the
   change_attr_type optimisations
 - Fix some dentry verifier races
 - Fix some missing dentry verifier settings
 - Fix a performance regression in nfs_set_open_stateid_locked()
 - SUNRPC was sending multiple SYN calls when re-establishing a TCP
   connection.
 - Fix multiple NFSv4 issues due to missing sanity checking of server
   return values
 - Fix a potential Oops when FREE_STATEID races with an unmount
 
 Cleanups:
 - Clean up the labelled NFS code
 - Remove unused header <linux/pnfs_osd_xdr.h>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAmGL5c4ACgkQZwvnipYK
 APLFyQ//endoc1HYNpTNpcvlWiAgombBQumjBLrk73Qr+M2Vq9uK6+WmaqYTCHhU
 SfX6kbptiyGrd+f/pdIXCjIfPCnCRPRZYpRx8BxHwNr5vqOQIr9rvT/1Mvg2G9Oi
 IkdwVDmrN3ZjK/dbvyYSxhsLwuwrnaNm0oHkHxDO/EFghqEsesU1Aj1yywbFIZZA
 onRXVXh8r1T9pqL25HyHzZjD1kxvEiKuAMFis2NCKHexSmsvGF4Xs71J3AiCKuc2
 XXLged3ng7WRhNCvvrZmfA0AVkZ+iklpVJQzBeXzxuYB81pRZr99yXuv3FKE5aEl
 UIPv73b2uTq2SlXtZe2ggsVOdB0JDIRx+9jIH0iV3tOOjapfaTGdTwDx8JR1qHza
 wVxB24evk3rW6EFrZNPogaf3JiZmwlVCSUlSZZ3T5c+5l36yZV+WuoSTOe4ajttm
 y/uUkA1p2iFpYb9qNoO6kQ1ue3YO34TCqYPrUipzXWvTG1ZjJ5yGV5LZR0VvB4QT
 bYpInua7SC/t9RwJ1/HWBrk1G9/xufC4WI7xJf6dJzSDSEo8n6x24nxY0OwUIClb
 YzoVWv+bwTHgqkVlTO52XH3VX9E3XBgt5GLtxstQT3hXIndIEoitBqPms0buP/Af
 RveTtV1pNCqhmGrmZJGInH3veIELn3l/pTywqITuhIBNCG3Rj5g=
 =n8lj
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-5.16-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client updates from Trond Myklebust:
 "Highlights include:

  Features:
   - NFSv4.1 can always retrieve and cache the ACCESS mode on OPEN
   - Optimisations for READDIR and the 'ls -l' style workload
   - Further replacements of dprintk() with tracepoints and other
     tracing improvements
   - Ensure we re-probe NFSv4 server capabilities when the user does a
     "mount -o remount"

  Bugfixes:
   - Fix an Oops in pnfs_mark_request_commit()
   - Fix up deadlocks in the commit code
   - Fix regressions in NFSv2/v3 attribute revalidation due to the
     change_attr_type optimisations
   - Fix some dentry verifier races
   - Fix some missing dentry verifier settings
   - Fix a performance regression in nfs_set_open_stateid_locked()
   - SUNRPC was sending multiple SYN calls when re-establishing a TCP
     connection.
   - Fix multiple NFSv4 issues due to missing sanity checking of server
     return values
   - Fix a potential Oops when FREE_STATEID races with an unmount

  Cleanups:
   - Clean up the labelled NFS code
   - Remove unused header <linux/pnfs_osd_xdr.h>"

* tag 'nfs-for-5.16-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (84 commits)
  NFSv4: Sanity check the parameters in nfs41_update_target_slotid()
  NFS: Remove the nfs4_label argument from decode_getattr_*() functions
  NFS: Remove the nfs4_label argument from nfs_setsecurity
  NFS: Remove the nfs4_label argument from nfs_fhget()
  NFS: Remove the nfs4_label argument from nfs_add_or_obtain()
  NFS: Remove the nfs4_label argument from nfs_instantiate()
  NFS: Remove the nfs4_label from the nfs_setattrres
  NFS: Remove the nfs4_label from the nfs4_getattr_res
  NFS: Remove the f_label from the nfs4_opendata and nfs_openres
  NFS: Remove the nfs4_label from the nfs4_lookupp_res struct
  NFS: Remove the label from the nfs4_lookup_res struct
  NFS: Remove the nfs4_label from the nfs4_link_res struct
  NFS: Remove the nfs4_label from the nfs4_create_res struct
  NFS: Remove the nfs4_label from the nfs_entry struct
  NFS: Create a new nfs_alloc_fattr_with_label() function
  NFS: Always initialise fattr->label in nfs_fattr_alloc()
  NFSv4.2: alloc_file_pseudo() takes an open flag, not an f_mode
  NFS: Don't allocate nfs_fattr on the stack in __nfs42_ssc_open()
  NFSv4: Remove unnecessary 'minor version' check
  NFSv4: Fix potential Oops in decode_op_map()
  ...
2021-11-10 16:32:46 -08:00
Trond Myklebust
3be232f11a SUNRPC: Prevent immediate close+reconnect
If we have already set up the socket and are waiting for it to connect,
then don't immediately close and retry.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-11-04 19:46:19 -04:00
Trond Myklebust
d896ba8300 SUNRPC: Fix races when closing the socket
Ensure that we bump the xprt->connect_cookie when we set the
XPRT_CLOSE_WAIT flag so that another call to
xprt_conditional_disconnect() won't race with the reconnection.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-11-04 19:46:19 -04:00
Anna Schumaker
17f09d3f61 SUNRPC: Check if the xprt is connected before handling sysfs reads
xprts don't immediately reconnect when changing the "dstaddr" property,
instead this gets handled the next time an operation uses the transport.
This could lead to NULL pointer dereferences when trying to read sysfs
files between the disconnect and reconnect operations. Fix this by
returning an error if the xprt is not connected.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-11-04 19:43:29 -04:00
Benjamin Coddington
cb5a967f7c xprtrdma: Fix a maybe-uninitialized compiler warning
This minor fix-up keeps GCC from complaining that "last' may be used
uninitialized", which breaks some build workflows that have been running
with all warnings treated as errors.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-11-02 16:06:33 -04:00
Trond Myklebust
280254b605 SUNRPC: Clean up xs_tcp_setup_sock()
Move the error handling into a single switch statement for clarity.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-11-01 11:00:30 -04:00
Trond Myklebust
ea9afca88b SUNRPC: Replace use of socket sk_callback_lock with sock_lock
Since we do things like setting flags, etc it really is more appropriate
to use sock_lock().

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-11-01 11:00:30 -04:00
Thiago Rafael Becker
023859ce6f sunrpc: remove unnecessary test in rpc_task_set_client()
In rpc_task_set_client(), testing for a NULL clnt is not necessary, as
clnt should always be a valid pointer to a rpc_client.

Signed-off-by: Thiago Rafael Becker <trbecker@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-10-20 18:09:55 -04:00
Chuck Lever
b40887e10d SUNRPC: Trace calls to .rpc_call_done
Introduce a single tracepoint that can replace simple dprintk call
sites in upper layer "rpc_call_done" callbacks. Example:

   kworker/u24:2-1254  [001]   771.026677: rpc_stats_latency:    task:00000001@00000002 xid=0x16a6f3c0 rpcbindv2 GETPORT backlog=446 rtt=101 execute=555
   kworker/u24:2-1254  [001]   771.026677: rpc_task_call_done:   task:00000001@00000002 flags=ASYNC|DYNAMIC|SOFT|SOFTCONN|SENT runstate=RUNNING|ACTIVE status=0 action=rpcb_getport_done
   kworker/u24:2-1254  [001]   771.026678: rpcb_setport:         task:00000001@00000002 status=0 port=20048

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-10-20 18:09:54 -04:00
Chuck Lever
7a3d524c4c xprtrdma: Remove rpcrdma_ep::re_implicit_roundup
Clean up: this field is no longer used.

xprt_rdma_pad_optimize is also no longer used, but is left in place
because it is part of the kernel/userspace API.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-10-20 18:09:54 -04:00
Chuck Lever
21037b8c22 xprtrdma: Provide a buffer to pad Write chunks of unaligned length
This is a buffer to be left persistently registered while a
connection is up. Connection tear-down will automatically DMA-unmap,
invalidate, and dereg the MR. A persistently registered buffer is
lower in cost to provide, and it can never be coalesced into the
RDMA segment that carries the data payload.

An RPC that provisions a Write chunk with a non-aligned length now
uses this MR rather than the tail buffer of the RPC's rq_rcv_buf.

Reviewed-By: Tom Talpey <tom@talpey.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-10-20 18:09:54 -04:00
Chuck Lever
5b747a594b SUNRPC: De-duplicate .pc_release() call sites
There was some spaghetti in svc_process_common() that had evolved
over time such that there was still one case that needed a call
to .pc_release() but never made it. That issue was removed in
the previous patch.

As additional insurance against missing this important callout,
ensure that the .pc_release() method is always called, no matter
what the reply_stat is.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2021-10-12 10:13:57 -04:00
Chuck Lever
0ae93b99be SUNRPC: Simplify the SVC dispatch code path
Micro-optimization: The last user of the generic SVC dispatch code
path has been removed, so svc_process_common() can be simplified.
This declutters the hot path so that the by-far most common case
(a dispatch function exists) is made the /only/ path.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2021-10-12 10:13:57 -04:00
Chuck Lever
0392dd51f9 SUNRPC: Per-rpc_clnt task PIDs
The current range of RPC task PIDs is 0..65535. That's not adequate
for distinguishing tasks across multiple rpc_clnts running high
throughput workloads.

To help relieve this situation and to reduce the bottleneck of
having a single atomic for assigning all RPC task PIDs, assign task
PIDs per rpc_clnt.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-10-10 11:05:54 +02:00
Linus Torvalds
1da38549dd Bug fixes for NFSD error handling paths
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmFfUVQACgkQM2qzM29m
 f5fW9w/+MthSLnUW4edoq45d9pH7jYJrtSo54NavIknXXWYaSnDydFsV/msAsJH8
 kNmwk0JAmhQ6GIkRLm4gZ2cHT+cCtlU/1gJWamvstUGM6XUpmwODdD8nacmXUh4q
 fgh9yJooe2GERIhv2/04XA8dP7UcqyZeWAGOpUZNlYEBF/Pcp1i8fJHkbJ2zEueH
 AtTwQY5atuJVQYeno7hSd38p7whWMPF37pbL8u72fbJkOefAy0/UW3AdUiMkKTOT
 TT/1bgNhOAEo20F9vspVaYAOhC8rAGaWr4j82N1QvgBtJhGt9bayQEIZQ5e+HdCg
 It4d5qtzE0zZQ/ARYsQxfF7AgNitGYEfjVu6F3hxeHFKJQCSQoxuPbBl2FiVUl7I
 JeVgPRRfYLjOjEG2E3NCWQXuzy0MzPFKqnNrvtfTE41vz1Bzrnx9Feu9GEffAn4l
 K59pIWYcVgSaC1nu8ba/sfZTVjpKShsxcTB/GJl9cgCkenZG1bqbqNCwnzcH1s3u
 zXyJZ8CjncLWHkcm2bi/xZ3jdRAyOwVCth37wI5KTBXvEiPG3yKloQifi9yKU0Zi
 a93l7hs1swcj2GfutWVjVwVsi2d1YSRRGpVgmK5pbOAhSFBU+TXOUfGo5VG5JsUW
 LA3enCmuXrcnrsFABf43mwikLw2w8/rwgXANS6LE8vaZ7A/c07Q=
 =CTLP
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-5.15-3' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd fixes from Chuck Lever:
 "Bug fixes for NFSD error handling paths"

* tag 'nfsd-5.15-3' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
  NFSD: Keep existing listeners on portlist error
  SUNRPC: fix sign error causing rpcsec_gss drops
  nfsd: Fix a warning for nfsd_file_close_inode
  nfsd4: Handle the NFSv4 READDIR 'dircount' hint being zero
  nfsd: fix error handling of register_pernet_subsys() in init_nfsd()
2021-10-07 14:11:40 -07:00
Chuck Lever
22a027e8c0 SUNRPC: Add trace event when alloc_pages_bulk() makes no progress
This is an operational low memory situation that needs to be
flagged. The new tracepoint records a timestamp and the nfsd thread
that failed to allocate pages.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2021-10-04 15:40:15 -04:00
Chuck Lever
45f1358468 svcrdma: Split svcrmda_wc_{read,write} tracepoints
There are currently three separate purposes being served by single
tracepoints. Split them up, as was done with wc_send.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2021-10-04 15:40:15 -04:00
Chuck Lever
eef2d8d47c svcrdma: Split the svcrdma_wc_send() tracepoint
There are currently three separate purposes being served by a single
tracepoint here. They need to be split up.

svcrdma_wc_send:
 - status is always zero, so there's no value in recording it.
 - vendor_err is meaningless unless status is not zero, so
   there's no value in recording it.
 - This tracepoint is needed only when developing modifications,
   so it should be left disabled most of the time.

svcrdma_wc_send_flush:
 - As above, needed only rarely, and not an error.

svcrdma_wc_send_err:
 - This tracepoint can be left persistently enabled because
   completion errors are run-time problems (except for FLUSHED_ERR).
 - Tracepoint name now ends in _err to reflect its purpose.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2021-10-04 15:40:15 -04:00
Chuck Lever
8dcc5721da svcrdma: Split the svcrdma_wc_receive() tracepoint
There are currently three separate purposes being served by a single
tracepoint here. They need to be split up.

svcrdma_wc_recv:
 - status is always zero, so there's no value in recording it.
 - vendor_err is meaningless unless status is not zero, so
   there's no value in recording it.
 - This tracepoint is needed only when developing modifications,
   so it should be left disabled most of the time.

svcrdma_wc_recv_flush:
 - As above, needed only rarely, and not an error.

svcrdma_wc_recv_err:
 - received is always zero, so there's no value in recording it.
 - This tracepoint can be left enabled because completion
   errors are run-time problems (except for FLUSHED_ERR).
 - Tracepoint name now ends in _err to reflect its purpose.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2021-10-04 15:40:15 -04:00
Trond Myklebust
33c3214bf4 SUNRPC: xprt_clear_locked() only needs release memory semantics
The clearing of the XPRT_LOCKED bit has to happen after we clear
xprt->snd_task, but we don't require any extra memory barriers after
that.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-10-03 20:49:06 -04:00
Trond Myklebust
6dbcbe3f78 SUNRPC: Remove WQ_HIGHPRI from xprtiod
Don't let xprtiod pre-empt softirq.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-10-03 20:49:05 -04:00
Trond Myklebust
47dd8796a3 SUNRPC: Add cond_resched() at the appropriate point in __rpc_execute()
Allow tasks that need to pre-empt rpciod/xprtiod to do so when it is
safe.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-10-03 20:49:05 -04:00
Trond Myklebust
ea7a1019d8 SUNRPC: Partial revert of commit 6f9f17287e
The premise of commit 6f9f17287e ("SUNRPC: Mitigate cond_resched() in
xprt_transmit()") was that cond_resched() is expensive and unnecessary
when there has been just a single send.
The point of cond_resched() is to ensure that tasks that should pre-empt
this one get a chance to do so when it is safe to do so. The code prior
to commit 6f9f17287e failed to take into account that it was keeping a
rpc_task pinned for longer than it needed to, and so rather than doing a
full revert, let's just move the cond_resched.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-10-03 20:49:05 -04:00
Chuck Lever
dae9a6cab8 NFSD: Have legacy NFSD WRITE decoders use xdr_stream_subsegment()
Refactor.

Now that the NFSv2 and NFSv3 XDR decoders have been converted to
use xdr_streams, the WRITE decoder functions can use
xdr_stream_subsegment() to extract the WRITE payload into its own
xdr_buf, just as the NFSv4 WRITE XDR decoder currently does.

That makes it possible to pass the first kvec, pages array + length,
page_base, and total payload length via a single function parameter.

The payload's page_base is not yet assigned or used, but will be in
subsequent patches.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2021-10-02 16:10:01 -04:00
Chuck Lever
f49b68ddc9 SUNRPC: xdr_stream_subsegment() must handle non-zero page_bases
xdr_stream_subsegment() was introduced in commit c1346a1216
("NFSD: Replace the internals of the READ_BUF() macro").

There are two call sites for xdr_stream_subsegment(). One is
nfsd4_decode_write(), and the other is nfsd4_decode_setxattr().
Currently neither of these call sites calls this API when
xdr_buf::page_base is a non-zero value.

However, I'm about to add a case where page_base will sometimes not
be zero when nfsd4_decode_write() invokes this API. Replace the
logic in xdr_stream_subsegment() that advances to the next data item
in the xdr_stream with something more generic in order to handle
this new use case.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2021-10-02 16:10:01 -04:00
J. Bruce Fields
2ba5acfb34 SUNRPC: fix sign error causing rpcsec_gss drops
If sd_max is unsigned, then sd_max - GSS_SEQ_WIN is a very large number
whenever sd_max is less than GSS_SEQ_WIN, and the comparison:

	seq_num <= sd->sd_max - GSS_SEQ_WIN

in gss_check_seq_num is pretty much always true, even when that's
clearly not what was intended.

This was causing pynfs to hang when using krb5, because pynfs uses zero
as the initial gss sequence number.  That's perfectly legal, but this
logic error causes knfsd to drop the rpc in that case.  Out-of-order
sequence IDs in the first GSS_SEQ_WIN (128) calls will also cause this.

Fixes: 10b9d99a3d ("SUNRPC: Augment server-side rpcgss tracepoints")
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-10-01 11:17:42 -04:00
Yang Li
458032fcfa UNRPC: Return specific error code on kmalloc failure
Although the callers of this function only care about whether the
return value is null or not, we should still give a rigorous
error code.

Smatch tool warning:
net/sunrpc/auth_gss/svcauth_gss.c:784 gss_write_verf() warn: returning
-1 instead of -ENOMEM is sloppy

No functional change, just more standardized.

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2021-09-23 16:05:24 -04:00
J. Bruce Fields
9b6e27d01a nfsd: don't alloc under spinlock in rpc_parse_scope_id
Dan Carpenter says:

  The patch d20c11d86d: "nfsd: Protect session creation and client
  confirm using client_lock" from Jul 30, 2014, leads to the following
  Smatch static checker warning:

        net/sunrpc/addr.c:178 rpc_parse_scope_id()
        warn: sleeping in atomic context

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Fixes: d20c11d86d ("nfsd: Protect session creation and client...")
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2021-09-21 17:51:47 -04:00
Linus Torvalds
14e2bc4e8c Critical bug fixes:
- Restore performance on memory-starved servers
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmE49dQACgkQM2qzM29m
 f5fnpA//dqRbqDAmWLJSoIc4WEkr5dEEqBYzRyqeefGRSANdqf7jQfjnWh9MKkIw
 rbBfwWRmiLN/9qsjAuxHJGnGeZG6BLBN3LKEgfvFFj3HNUHekEIsKP3HPhXeo49K
 5U6JVZhcDHPTiVSqDVpumfdnZLRTAR7BMbduWYxdOK+dWdCFZ/zUrf1HWEoFJO6o
 Y+Kb2iGzuWQu+ie3Zh8jp797OUSt5FZXLLWPeOS1giBOWRz2+2z5pEj/tBSZuoOS
 IbzivQHQVWCt1q5CtsYY5sqxtpDgObCdDQQ7Pxo/qsxYv3D+56vll5lbZ513KHkd
 fWnk1q97QpjJI52jQY3kIx33FLVB0BWEGK0mrANQ8wQA7stq11Xc439GOY6CI1zZ
 NHz7VelzoR295s1bSMz1V66ZaP9o9d+CUKgWuT7x99hPbyqp90z8K71l6BrcM05u
 tP2YUObmAGfGusbG3OJvHLJWAo/22u4APowC0ZWVmF3FrCHXIdbDtQOrrb+h1Yqq
 5wmshDQYCuh/sqpxx7VqseFUIIg4XQ0ziVDbVcDNxVkwDElu1Abd/mKf98+K3Q3G
 RYHrGGAEXz9HG4WzVKYl+k0GUV3vUiGH4pvLtBpJfDAGSP6zvsu64lb7IAoZVczm
 O/bQKWnJjYzEO/CM6vsCY15LFwRMC1F83c+8OhskDyvla2azzwQ=
 =BXCy
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-5.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd fixes from Chuck Lever:

 - Restore performance on memory-starved servers

* tag 'nfsd-5.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
  SUNRPC: improve error response to over-size gss credential
  SUNRPC: don't pause on incomplete allocation
2021-09-08 15:55:42 -07:00
Linus Torvalds
0961f0c00e NFS Client Updates for Linux 5.15
- New Features:
   - Better client responsiveness when server isn't replying
   - Use refcount_t in sunrpc rpc_client refcount tracking
   - Add srcaddr and dst_port to the sunrpc sysfs info files
   - Add basic support for connection sharing between servers with multiple NICs`
 
 - Bugfixes and Cleanups:
   - Sunrpc tracepoint cleanups
   - Disconnect after ib_post_send() errors to avoid deadlocks
   - Fix for tearing down rpcrdma_reps
   - Fix a potential pNFS layoutget livelock loop
   - pNFS layout barrier fixes
   - Fix a potential memory corruption in rpc_wake_up_queued_task_set_status()
   - Fix reconnection locking
   - Fix return value of get_srcport()
   - Remove rpcrdma_post_sends()
   - Remove pNFS dead code
   - Remove copy size restriction for inter-server copies
   - Overhaul the NFS callback service
   - Clean up sunrpc TCP socket shutdowns
   - Always provide aligned buffers to RPC read layers
 -----BEGIN PGP SIGNATURE-----
 
 iQIyBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAmExP7AACgkQ18tUv7Cl
 QOshTg/zBz7OfrS23CcLLgNidTJ6S7JOuj1DShG+YzsYXT8f9Nl1DadLM7yAEyok
 6JZzC8rXYzJcmYztHZzRyTuzj1+tGGb0u/MrD0bBk42VEel6eOjH/Y9ybn12Gf/E
 aqlcJh8hPx44U8oo5EFjRJsg2h28O06vywqhJz+sTbkqKN4hlAgMOo5ysAB+1thg
 BrTlR84EKBw5QqxPJ1WPmq9tEyGebU9Yrj1p8f0Uf015IeRNeTOXx3NzmdPshphf
 2yJvjumwEzqkcHXTJFDfP6ikIcGPPMNVAOK8DHb+vDGzNsOXW7dDM7GuWA3U8DlU
 ZHvyyb05Wwe6Wwg8xwx90FEXcYZFfZbSKmI9z2uoOuGFzNG07zWzPDzRft+qrOvU
 VMMwP9oEh71+qesmWTvqIbR2RjxqbCYlTcc8mBrD66DROi6jZ2jznraNC85sxG0Q
 b8GE+2SnYr2Q25yehj2xrRlOXyiYNkeeYmIpIquEqH9o7cSyDNJhBWbzIv6x+ith
 O/S06ZVKMc9X1nH5t5121XcHrSTMMVA/67WMyKfKMxWnrADAWPQALG+ttoTcbRu7
 Txew3Jb+hB8+ZdHAqbPf1l1i+7USQl1CRHMw3GRvNjCL2qcjZb1R7eyJRSQQtUyw
 q6SJRGe6Sn1FTUnn96Hv15Zy8VHx+q0cOL/EQVzL1RzJIXYcag==
 =Ad/3
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-5.15-1' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client updates from Anna Schumaker:
 "New Features:
   - Better client responsiveness when server isn't replying
   - Use refcount_t in sunrpc rpc_client refcount tracking
   - Add srcaddr and dst_port to the sunrpc sysfs info files
   - Add basic support for connection sharing between servers with multiple NICs`

  Bugfixes and Cleanups:
   - Sunrpc tracepoint cleanups
   - Disconnect after ib_post_send() errors to avoid deadlocks
   - Fix for tearing down rpcrdma_reps
   - Fix a potential pNFS layoutget livelock loop
   - pNFS layout barrier fixes
   - Fix a potential memory corruption in rpc_wake_up_queued_task_set_status()
   - Fix reconnection locking
   - Fix return value of get_srcport()
   - Remove rpcrdma_post_sends()
   - Remove pNFS dead code
   - Remove copy size restriction for inter-server copies
   - Overhaul the NFS callback service
   - Clean up sunrpc TCP socket shutdowns
   - Always provide aligned buffers to RPC read layers"

* tag 'nfs-for-5.15-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (39 commits)
  NFS: Always provide aligned buffers to the RPC read layers
  NFSv4.1 add network transport when session trunking is detected
  SUNRPC enforce creation of no more than max_connect xprts
  NFSv4 introduce max_connect mount options
  SUNRPC add xps_nunique_destaddr_xprts to xprt_switch_info in sysfs
  SUNRPC keep track of number of transports to unique addresses
  NFSv3: Delete duplicate judgement in nfs3_async_handle_jukebox
  SUNRPC: Tweak TCP socket shutdown in the RPC client
  SUNRPC: Simplify socket shutdown when not reusing TCP ports
  NFSv4.2: remove restriction of copy size for inter-server copy.
  NFS: Clean up the synopsis of callback process_op()
  NFS: Extract the xdr_init_encode/decode() calls from decode_compound
  NFS: Remove unused callback void decoder
  NFS: Add a private local dispatcher for NFSv4 callback operations
  SUNRPC: Eliminate the RQ_AUTHERR flag
  SUNRPC: Set rq_auth_stat in the pg_authenticate() callout
  SUNRPC: Add svc_rqst::rq_auth_stat
  SUNRPC: Add dst_port to the sysfs xprt info file
  SUNRPC: Add srcaddr as a file in sysfs
  sunrpc: Fix return value of get_srcport()
  ...
2021-09-04 10:25:26 -07:00
NeilBrown
0c217d5066 SUNRPC: improve error response to over-size gss credential
When the NFS server receives a large gss (kerberos) credential and tries
to pass it up to rpc.svcgssd (which is deprecated), it triggers an
infinite loop in cache_read().

cache_request() always returns -EAGAIN, and this causes a "goto again".

This patch:
 - changes the error to -E2BIG to avoid the infinite loop, and
 - generates a WARN_ONCE when rsi_request first sees an over-sized
   credential.  The warning suggests switching to gssproxy.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=196583
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-09-03 13:38:11 -04:00
NeilBrown
e38b3f2005 SUNRPC: don't pause on incomplete allocation
alloc_pages_bulk_array() attempts to allocate at least one page based on
the provided pages, and then opportunistically allocates more if that
can be done without dropping the spinlock.

So if it returns fewer than requested, that could just mean that it
needed to drop the lock.  In that case, try again immediately.

Only pause for a time if no progress could be made.

Reported-and-tested-by: Mike Javorski <mike.javorski@gmail.com>
Reported-and-tested-by: Lothar Paltins <lopa@mailbox.org>
Fixes: f6e70aab9d ("SUNRPC: refresh rq_pages using a bulk page allocator")
Signed-off-by: NeilBrown <neilb@suse.de>
Acked-by: Mel Gorman <mgorman@suse.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-09-01 11:05:07 -04:00
Linus Torvalds
8bda955776 New features:
- Support for server-side disconnect injection via debugfs
 - Protocol definitions for new RPC_AUTH_TLS authentication flavor
 
 Performance improvements:
 - Reduce page allocator traffic in the NFSD splice read actor
 - Reduce CPU utilization in svcrdma's Send completion handler
 
 Notable bug fixes:
 - Stabilize lockd operation when re-exporting NFS mounts
 - Fix the use of %.*s in NFSD tracepoints
 - Fix /proc/sys/fs/nfs/nsm_use_hostnames
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmEqq0AACgkQM2qzM29m
 f5dYig/5AaPN2BWYf4D1VkrAS3+zGS+3IN23WVgpbA54jgfjPEH+Aa00YhEQQa0j
 Y5u/jE5g/tWvenDefq5BmvdRfZMWCVc2JkngctOSflhaREUWK+HgCkH+5DQs6zUM
 rbX7qy0v6wJnEMSlwCKJ2AuZbYw7Bsg2nvOgEbb718/ent3umeoXEK09x3HTWLEp
 eVcMU5uicB5wRRPpROYG792oWzUScQ8kyiRCKJfQDoR7bINhBeVHObAIFMBo1UaH
 x9CMX4RlPYGmoMYUc+AqcOM7hizucHpXqM1r3oVjQ7FyI+pmDLuLL/3OTjtRUX7+
 nYLqNW/PijH9PjFe4BPjGHAUQfKiTIXANAe8VdjQj70D40jYkP+jQ9SPdV+pEgi4
 U4azfK3S+85/bRYYq/1alcLiP1+6dgcL++rVvnKESTH9NRgNoEw2WZHeKxXiYaxU
 p7oOC4XdnYDwcz/3QVWa0sK2kA5IJHzOsCQR7OilD09NAJ+AbJTAp0H3xFXTllzb
 AV2CAEBVZlP+pZYOehuVnKpZPa7YAWx92wRK2anbRUMZN3lF1wWBEOTd6KweIpTx
 l2GJSf3GWBqL1x9PjSet/cBusxYjTA+S1hE7KMrsNPhzbvpIgAZEtSqOfn9apDCV
 uAFIN2DSiHm3Tv0aFSJWo+CMyKkyktuiS8JFKaFdzCp9NtsBM2M=
 =TGkK
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd updates from Chuck Lever:
 "New features:

   - Support for server-side disconnect injection via debugfs

   - Protocol definitions for new RPC_AUTH_TLS authentication flavor

  Performance improvements:

   - Reduce page allocator traffic in the NFSD splice read actor

   - Reduce CPU utilization in svcrdma's Send completion handler

  Notable bug fixes:

   - Stabilize lockd operation when re-exporting NFS mounts

   - Fix the use of %.*s in NFSD tracepoints

   - Fix /proc/sys/fs/nfs/nsm_use_hostnames"

* tag 'nfsd-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (31 commits)
  nfsd: fix crash on LOCKT on reexported NFSv3
  nfs: don't allow reexport reclaims
  lockd: don't attempt blocking locks on nfs reexports
  nfs: don't atempt blocking locks on nfs reexports
  Keep read and write fds with each nlm_file
  lockd: update nlm_lookup_file reexport comment
  nlm: minor refactoring
  nlm: minor nlm_lookup_file argument change
  lockd: lockd server-side shouldn't set fl_ops
  SUNRPC: Add documentation for the fail_sunrpc/ directory
  SUNRPC: Server-side disconnect injection
  SUNRPC: Move client-side disconnect injection
  SUNRPC: Add a /sys/kernel/debug/fail_sunrpc/ directory
  svcrdma: xpt_bc_xprt is already clear in __svc_rdma_free()
  nfsd4: Fix forced-expiry locking
  rpc: fix gss_svc_init cleanup on failure
  SUNRPC: Add RPC_AUTH_TLS protocol numbers
  lockd: change the proc_handler for nsm_use_hostnames
  sysctl: introduce new proc handler proc_dobool
  SUNRPC: Fix a NULL pointer deref in trace_svc_stats_latency()
  ...
2021-08-31 10:57:06 -07:00
Linus Torvalds
9a1d6c9e3f for-5.15/drivers-2021-08-30
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmEs6LsQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpqnLD/9c8v7WTLjrDR6FLD8fHUmkwk9ss6OeyYJC
 Z62QOyk6BqNOu6FAwBax9wFaboXdUqOdpJU0PVQ7WJ5wBiCQ9DAZY6T+iwW0jE79
 +iOSqdXHVLAIyIM9GplzLH5AH3tx4445bX7fRWwWX1OgmSidkAhb25FusCvpcpHx
 1k+9dSLClLeHPR6jVT3k6tHv2RzPSw+/vYOggeWYA0YYPfoCx/Ft0uwO+PjKpvLQ
 Je5jASlLGYCXazswJBZgfjbroA97EuaLOmHHIHrwhkkFsbV6ewv6mlmanbMEs4fX
 Wh+axTt8so27g6gbw31EOcGsxTi0B37Jx9MOrSla6NdJoZkFE2sn6K+D5k4oeSrg
 QgYXL00U62eSgWmgSB0f0X081cQfI+FUMe5u5S368WdrgCPfaXl11zHw8nXw8gEW
 UvqR4zr3hQd4piXsIWl2bwZrmpPBCeB8iStLq3C92RLPFT6hJO3GM/ZmwTn+0HT0
 lMXzoEdkPywkKWi8aBbSgzXiGknNl8HAYnwMhcQjiHbYQOycGkI9pigJDNY9Ox1l
 fYHFSompmJ/XK8cIiU7QIglXEXJky5jQ89Ni0ryCstOaP20tPxWtkpOCgidXfNGz
 4lmQV8D5aBTUFs6ifPjXfiXUmDiU3SaxiFhAqaEkGII9BbkrNhlibB4LBAU+toi1
 Q0yGhGR/mg==
 =4uWF
 -----END PGP SIGNATURE-----

Merge tag 'for-5.15/drivers-2021-08-30' of git://git.kernel.dk/linux-block

Pull block driver updates from Jens Axboe:
 "Sitting on top of the core block changes, here are the driver changes
  for the 5.15 merge window:

   - NVMe updates via Christoph:
       - suspend improvements for devices with an HMB (Keith Busch)
       - handle double completions more gacefull (Sagi Grimberg)
       - cleanup the selects for the nvme core code a bit (Sagi Grimberg)
       - don't update queue count when failing to set io queues (Ruozhu Li)
       - various nvmet connect fixes (Amit Engel)
       - cleanup lightnvm leftovers (Keith Busch, me)
       - small cleanups (Colin Ian King, Hou Pu)
       - add tracing for the Set Features command (Hou Pu)
       - CMB sysfs cleanups (Keith Busch)
       - add a mutex_destroy call (Keith Busch)

   - remove lightnvm subsystem. It's served its purpose and ultimately
     led to zoned nvme support, we no longer need it (Christoph)

   - revert floppy O_NDELAY fix (Denis)

   - nbd fixes (Hou, Pavel, Baokun)

   - nbd locking fixes (Tetsuo)

   - nbd device removal fixes (Christoph)

   - raid10 rcu warning fix (Xiao)

   - raid1 write behind fix (Guoqing)

   - rnbd fixes (Gioh, Md Haris)

   - misc fixes (Colin)"

* tag 'for-5.15/drivers-2021-08-30' of git://git.kernel.dk/linux-block: (42 commits)
  Revert "floppy: reintroduce O_NDELAY fix"
  raid1: ensure write behind bio has less than BIO_MAX_VECS sectors
  md/raid10: Remove unnecessary rcu_dereference in raid10_handle_discard
  nbd: remove nbd->destroy_complete
  nbd: only return usable devices from nbd_find_unused
  nbd: set nbd->index before releasing nbd_index_mutex
  nbd: prevent IDR lookups from finding partially initialized devices
  nbd: reset NBD to NULL when restarting in nbd_genl_connect
  nbd: add missing locking to the nbd_dev_add error path
  nvme: remove the unused NVME_NS_* enum
  nvme: remove nvm_ndev from ns
  nvme: Have NVME_FABRICS select NVME_CORE instead of transport drivers
  block: nbd: add sanity check for first_minor
  nvmet: check that host sqsize does not exceed ctrl MQES
  nvmet: avoid duplicate qid in connect cmd
  nvmet: pass back cntlid on successful completion
  nvme-rdma: don't update queue count when failing to set io queues
  nvme-tcp: don't update queue count when failing to set io queues
  nvme-tcp: pair send_mutex init with destroy
  nvme: allow user toggling hmb usage
  ...
2021-08-30 19:01:46 -07:00
Olga Kornievskaia
dc48e0abee SUNRPC enforce creation of no more than max_connect xprts
If we are adding new transports via rpc_clnt_test_and_add_xprt()
then check if we've reached the limit. Currently only pnfs path
adds transports via that function but this is done in
preparation when the client would add new transports when
session trunking is detected. A warning is logged if the
limit is reached.

Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2021-08-27 16:37:29 -04:00
Olga Kornievskaia
df205d0a8e SUNRPC add xps_nunique_destaddr_xprts to xprt_switch_info in sysfs
In sysfs's xprt_switch_info attribute also display the value of
number of transports with unique destination addresses for this
xprt_switch.

Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2021-08-27 16:37:03 -04:00
Olga Kornievskaia
3a3f976639 SUNRPC keep track of number of transports to unique addresses
Currently, xprt_switch keeps a number of all xprts (xps_nxprts)
that were added to the switch regardless of whethere it's an
nconnect transport or a transport to a trunkable address.
Introduce a new counter to keep track of transports to unique
destination addresses per xprt_switch.

Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2021-08-27 16:36:53 -04:00
Trond Myklebust
7c81e6a9d7 SUNRPC: Tweak TCP socket shutdown in the RPC client
We only really need to call shutdown() if we're in the ESTABLISHED TCP
state, since that is the only case where the client is initiating a
close of an established connection.

If the socket is in FIN_WAIT1 or FIN_WAIT2, then we've already initiated
socket shutdown and are waiting for the server's reply, so do nothing.

In all other cases where we've already received a FIN from the server,
we should be able to just close the socket.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2021-08-27 16:36:21 -04:00
Trond Myklebust
0a6ff58edb SUNRPC: Simplify socket shutdown when not reusing TCP ports
If we're not required to reuse the TCP port, then we can just
immediately close the socket, and leave the cleanup details to the TCP
layer.

Fixes: e6237b6feb ("NFSv4.1: Don't rebind to the same source port when reconnecting to the server")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2021-08-27 16:36:21 -04:00
Trond Myklebust
062b829c52 SUNRPC: Fix XPT_BUSY flag leakage in svc_handle_xprt()...
If the attempt to reserve a slot fails, we currently leak the XPT_BUSY
flag on the socket. Among other things, this make it impossible to close
the socket.

Fixes: 82011c80b3 ("SUNRPC: Move svc_xprt_received() call sites")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2021-08-25 16:58:09 -04:00
Chuck Lever
3a12618059 SUNRPC: Server-side disconnect injection
Disconnect injection stress-tests the ability for both client and
server implementations to behave resiliently in the face of network
instability.

A file called /sys/kernel/debug/fail_sunrpc/ignore-server-disconnect
enables administrators to turn off server-side disconnect injection
while allowing other types of sunrpc errors to be injected. The
default setting is that server-side disconnect injection is enabled
(ignore=false).

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-08-20 13:50:33 -04:00
Chuck Lever
a4ae308143 SUNRPC: Move client-side disconnect injection
Disconnect injection stress-tests the ability for both client and
server implementations to behave resiliently in the face of network
instability.

Convert the existing client-side disconnect injection infrastructure
to use the kernel's generic error injection facility. The generic
facility has a richer set of injection criteria.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-08-20 13:50:32 -04:00
Chuck Lever
c782af2500 SUNRPC: Add a /sys/kernel/debug/fail_sunrpc/ directory
This directory will contain a set of administrative controls for
enabling error injection for kernel RPC consumers.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-08-20 13:50:32 -04:00
Chuck Lever
729580ddc5 svcrdma: xpt_bc_xprt is already clear in __svc_rdma_free()
svc_xprt_free() already "puts" the bc_xprt before calling the
transport's "free" method. No need to do it twice.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-08-19 08:29:32 -04:00
J. Bruce Fields
5a47534462 rpc: fix gss_svc_init cleanup on failure
The failure case here should be rare, but it's obviously wrong.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-08-17 11:47:53 -04:00
Chuck Lever
5c11720767 SUNRPC: Fix a NULL pointer deref in trace_svc_stats_latency()
Some paths through svc_process() leave rqst->rq_procinfo set to
NULL, which triggers a crash if tracing happens to be enabled.

Fixes: 89ff87494c ("SUNRPC: Display RPC procedure names instead of proc numbers")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-08-17 11:47:53 -04:00
Chuck Lever
07a92d009f svcrdma: Convert rdma->sc_rw_ctxts to llist
Relieve contention on sc_rw_ctxt_lock by converting rdma->sc_rw_ctxts
to an llist.

The goal is to reduce the average overhead of Send completions,
because a transport's completion handlers are single-threaded on
one CPU core. This change reduces CPU utilization of each Send
completion by 2-3% on my server.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-By: Tom Talpey <tom@talpey.com>
2021-08-17 11:47:53 -04:00
Chuck Lever
b6c2bfea09 svcrdma: Relieve contention on sc_send_lock.
/proc/lock_stat indicates the the sc_send_lock is heavily
contended when the server is under load from a single client.

To address this, convert the send_ctxt free list to an llist.
Returning an item to the send_ctxt cache is now waitless, which
reduces the instruction path length in the single-threaded Send
handler (svc_rdma_wc_send).

The goal is to enable the ib_comp_wq worker to handle a higher
RPC/RDMA Send completion rate given the same CPU resources. This
change reduces CPU utilization of Send completion by 2-3% on my
server.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-By: Tom Talpey <tom@talpey.com>
2021-08-17 11:47:53 -04:00
Chuck Lever
6c8c84f525 svcrdma: Fewer calls to wake_up() in Send completion handler
Because wake_up() takes an IRQ-safe lock, it can be expensive,
especially to call inside of a single-threaded completion handler.
What's more, the Send wait queue almost never has waiters, so
most of the time, this is an expensive no-op.

As always, the goal is to reduce the average overhead of each
completion, because a transport's completion handlers are single-
threaded on one CPU core. This change reduces CPU utilization of
the Send completion thread by 2-3% on my server.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-By: Tom Talpey <tom@talpey.com>
2021-08-17 11:47:53 -04:00
Chuck Lever
2f0f88f42f SUNRPC: Add svc_rqst_replace_page() API
Replacing a page in rq_pages[] requires a get_page(), which is a
bus-locked operation, and a put_page(), which can be even more
costly.

To reduce the cost of replacing a page in rq_pages[], batch the
put_page() operations by collecting "freed" pages in a pagevec,
and then release those pages when the pagevec is full. This
pagevec is also emptied when each RPC completes.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-08-17 11:47:52 -04:00
Sagi Grimberg
2a14c9ae15 params: lift param_set_uint_minmax to common code
It is a useful helper hence move it to common code so others can enjoy
it.

Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-08-16 14:42:22 +02:00
Chuck Lever
9082e1d914 SUNRPC: Eliminate the RQ_AUTHERR flag
Now that there is an alternate method for returning an auth_stat
value, replace the RQ_AUTHERR flag with use of that new method.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2021-08-10 14:18:35 -04:00
Chuck Lever
5c2465dfd4 SUNRPC: Set rq_auth_stat in the pg_authenticate() callout
In a few moments, rq_auth_stat will need to be explicitly set to
rpc_auth_ok before execution gets to the dispatcher.

svc_authenticate() already sets it, but it often gets reset to
rpc_autherr_badcred right after that call, even when authentication
is successful. Let's ensure that the pg_authenticate callout and
svc_set_client() set it properly in every case.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2021-08-10 14:18:35 -04:00
Chuck Lever
438623a06b SUNRPC: Add svc_rqst::rq_auth_stat
I'd like to take commit 4532608d71 ("SUNRPC: Clean up generic
dispatcher code") even further by using only private local SVC
dispatchers for all kernel RPC services. This change would enable
the removal of the logic that switches between
svc_generic_dispatch() and a service's private dispatcher, and
simplify the invocation of the service's pc_release method
so that humans can visually verify that it is always invoked
properly.

All that will come later.

First, let's provide a better way to return authentication errors
from SVC dispatcher functions. Instead of overloading the dispatch
method's *statp argument, add a field to struct svc_rqst that can
hold an error value.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2021-08-10 14:18:35 -04:00
Anna Schumaker
69f2cd6df3 SUNRPC: Add dst_port to the sysfs xprt info file
This is most likely going to be 2049 for NFS, but some servers might be
configured to export on a non-standard port. Let's show this information
just in case somebody needs it.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2021-08-09 17:25:51 -04:00
Anna Schumaker
e44773daf8 SUNRPC: Add srcaddr as a file in sysfs
I don't support changing it right now, but it could be useful
information for clients with multiple network cards.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2021-08-09 17:25:44 -04:00
Anna Schumaker
5d46dd04cb sunrpc: Fix return value of get_srcport()
Since bc1c56e9bb transport->srcport may by unset, causing
get_srcport() to return 0 when called. Fix this by querying the port
from the underlying socket instead of the transport.

Fixes: bc1c56e9bb (SUNRPC: prevent port reuse on transports which don't request it)
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2021-08-09 17:25:36 -04:00
Trond Myklebust
f99fa50880 SUNRPC/xprtrdma: Fix reconnection locking
The xprtrdma client code currently relies on the task that initiated the
connect to hold the XPRT_LOCK for the duration of the connection
attempt. If the task is woken early, due to some other event, then that
lock could get released early.
Avoid races by using the same mechanism that the socket code uses of
transferring lock ownership to the RDMA connect worker itself. That
frees us to call rpcrdma_xprt_disconnect() directly since we're now
guaranteed exclusion w.r.t. other callers.

Fixes: 4cf44be6f1 ("xprtrdma: Fix recursion into rpcrdma_xprt_disconnect()")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2021-08-09 16:57:05 -04:00