linux-stable/net/sunrpc
Chuck Lever 8d4fb8ff42 xprtrdma: Fix disconnect regression
I found that injecting disconnects with v4.18-rc resulted in
random failures of the multi-threaded git regression test.

The root cause appears to be that, after a reconnect, the
RPC/RDMA transport is waking pending RPCs before the transport has
posted enough Receive buffers to receive the Replies. If a Reply
arrives before enough Receive buffers are posted, the connection
is dropped. A few connection drops happen in quick succession as
the client and server struggle to regain credit synchronization.

This regression was introduced with commit 7c8d9e7c88 ("xprtrdma:
Move Receive posting to Receive handler"). The client is supposed to
post a single Receive when a connection is established because
it's not supposed to send more than one RPC Call before it gets
a fresh credit grant in the first RPC Reply [RFC 8166, Section
3.3.3].

Unfortunately there appears to be a longstanding bug in the Linux
client's credit accounting mechanism. On connect, it simply dumps
all pending RPC Calls onto the new connection. It's possible it has
done this ever since the RPC/RDMA transport was added to the kernel
ten years ago.

Servers have so far been tolerant of this bad behavior. Currently no
server implementation ever changes its credit grant over reconnects,
and servers always repost enough Receives before connections are
fully established.

The Linux client implementation used to post a Receive before each
of these Calls. This has covered up the flooding send behavior.

I could try to correct this old bug so that the client sends exactly
one RPC Call and waits for a Reply. Since we are so close to the
next merge window, I'm going to instead provide a simple patch to
post enough Receives before a reconnect completes (based on the
number of credits granted to the previous connection).

The spurious disconnects will be gone, but the client will still
send multiple RPC Calls immediately after a reconnect.

Addressing the latter problem will wait for a merge window because
a) I expect it to be a large change requiring lots of testing, and
b) obviously the Linux client has interoperated successfully since
day zero while still being broken.

Fixes: 7c8d9e7c88 ("xprtrdma: Move Receive posting to ... ")
Cc: stable@vger.kernel.org # v4.18+
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-08-08 16:49:47 -04:00
..
auth_gss sunrpc: whitespace fixes 2018-07-31 12:53:40 -04:00
xprtrdma xprtrdma: Fix disconnect regression 2018-08-08 16:49:47 -04:00
addr.c replace strict_strto calls 2014-07-12 18:45:49 -04:00
auth.c sunrpc: kstrtoul() can also return -ERANGE 2018-07-31 12:53:40 -04:00
auth_generic.c NFS client updates for Linux 4.9 2016-10-13 21:28:20 -07:00
auth_null.c net/sunrpc: Make rpc_auth_create_args a const 2018-07-30 13:19:41 -04:00
auth_unix.c net/sunrpc: Make rpc_auth_create_args a const 2018-07-30 13:19:41 -04:00
backchannel_rqst.c sunrpc: whitespace fixes 2018-07-31 12:53:40 -04:00
cache.c treewide: kzalloc() -> kcalloc() 2018-06-12 16:19:22 -07:00
clnt.c sunrpc: whitespace fixes 2018-07-31 12:53:40 -04:00
debugfs.c net: Use octal not symbolic permissions 2018-03-26 12:07:48 -04:00
Kconfig IB: Revert "remove redundant INFINIBAND kconfig dependencies" 2018-05-28 10:40:16 -06:00
Makefile License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
netns.h License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
rpc_pipe.c Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2018-06-04 10:14:28 -07:00
rpcb_clnt.c sunrpc: whitespace fixes 2018-07-31 12:53:40 -04:00
sched.c sunrpc: Simplify synopsis of some trace points 2018-04-10 16:06:22 -04:00
socklib.c sunrpc: do not pull udp headers on receive 2016-04-11 15:31:33 -04:00
stats.c sunrpc: whitespace fixes 2018-07-31 12:53:40 -04:00
sunrpc.h sunrpc: whitespace fixes 2018-07-31 12:53:40 -04:00
sunrpc_syms.c net: Drop pernet_operations::async 2018-03-27 13:18:09 -04:00
svc.c NFSD: Clean up legacy NFS SYMLINK argument XDR decoders 2018-04-03 15:08:16 -04:00
svc_xprt.c svc: Report xprt dequeue latency 2018-04-03 15:08:13 -04:00
svcauth.c locking/atomic, kref: Implement kref_put_lock() 2017-01-18 10:03:29 +01:00
svcauth_unix.c kernel: make groups_sort calling a responsibility group_info allocators 2017-12-14 16:00:49 -08:00
svcsock.c Chuck Lever did a bunch of work on nfsd tracepoints, on RDMA, and on 2018-04-05 19:15:29 -07:00
sysctl.c Replace <asm/uaccess.h> with <linux/uaccess.h> globally 2016-12-24 11:46:01 -08:00
timer.c
xdr.c SUNRPC: Add helpers for decoding opaque and string types 2018-04-10 16:06:22 -04:00
xprt.c sunrpc: whitespace fixes 2018-07-31 12:53:40 -04:00
xprtmultipath.c License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
xprtsock.c sunrpc: whitespace fixes 2018-07-31 12:53:40 -04:00