Commit graph

228 commits

Author SHA1 Message Date
Joel Granados
ca5d1fce79 net: sunrpc: Remove the now superfluous sentinel elements from ctl_table array
This commit comes at the tail end of a greater effort to remove the
empty elements at the end of the ctl_table arrays (sentinels) which
will reduce the overall build time size of the kernel and run time
memory bloat by ~64 bytes per sentinel (further information Link :
https://lore.kernel.org/all/ZO5Yx5JFogGi%2FcBo@bombadil.infradead.org/)

* Remove sentinel element from ctl_table structs.

Signed-off-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-05-03 13:29:42 +01:00
Luis Chamberlain
17c6d0ce83 sunrpc: simplify one-level sysctl registration for xr_tunables_table
There is no need to declare an extra tables to just create directory,
this can be easily be done with a prefix path with register_sysctl().

Simplify this registration.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-04-11 12:44:49 -04:00
Chuck Lever
6b1eb3b222 SUNRPC: Replace the use of the xprtiod WQ in rpcrdma
While setting up a new lab, I accidentally misconfigured the
Ethernet port for a system that tried an NFS mount using RoCE.
This made the NFS server unreachable. The following WARNING
popped on the NFS client while waiting for the mount attempt to
time out:

kernel: workqueue: WQ_MEM_RECLAIM xprtiod:xprt_rdma_connect_worker [rpcrdma] is flushing !WQ_MEM_RECLAI>
kernel: WARNING: CPU: 0 PID: 100 at kernel/workqueue.c:2628 check_flush_dependency+0xbf/0xca
kernel: Modules linked in: rpcsec_gss_krb5 nfsv4 dns_resolver nfs 8021q garp stp mrp llc rfkill rpcrdma>
kernel: CPU: 0 PID: 100 Comm: kworker/u8:8 Not tainted 6.0.0-rc1-00002-g6229f8c054e5 #13
kernel: Hardware name: Supermicro X10SRA-F/X10SRA-F, BIOS 2.0b 06/12/2017
kernel: Workqueue: xprtiod xprt_rdma_connect_worker [rpcrdma]
kernel: RIP: 0010:check_flush_dependency+0xbf/0xca
kernel: Code: 75 2a 48 8b 55 18 48 8d 8b b0 00 00 00 4d 89 e0 48 81 c6 b0 00 00 00 48 c7 c7 65 33 2e be>
kernel: RSP: 0018:ffffb562806cfcf8 EFLAGS: 00010092
kernel: RAX: 0000000000000082 RBX: ffff97894f8c3c00 RCX: 0000000000000027
kernel: RDX: 0000000000000002 RSI: ffffffffbe3447d1 RDI: 00000000ffffffff
kernel: RBP: ffff978941315840 R08: 0000000000000000 R09: 0000000000000000
kernel: R10: 00000000000008b0 R11: 0000000000000001 R12: ffffffffc0ce3731
kernel: R13: ffff978950c00500 R14: ffff97894341f0c0 R15: ffff978951112eb0
kernel: FS:  0000000000000000(0000) GS:ffff97987fc00000(0000) knlGS:0000000000000000
kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: CR2: 00007f807535eae8 CR3: 000000010b8e4002 CR4: 00000000003706f0
kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
kernel: Call Trace:
kernel:  <TASK>
kernel:  __flush_work.isra.0+0xaf/0x188
kernel:  ? _raw_spin_lock_irqsave+0x2c/0x37
kernel:  ? lock_timer_base+0x38/0x5f
kernel:  __cancel_work_timer+0xea/0x13d
kernel:  ? preempt_latency_start+0x2b/0x46
kernel:  rdma_addr_cancel+0x70/0x81 [ib_core]
kernel:  _destroy_id+0x1a/0x246 [rdma_cm]
kernel:  rpcrdma_xprt_connect+0x115/0x5ae [rpcrdma]
kernel:  ? _raw_spin_unlock+0x14/0x29
kernel:  ? raw_spin_rq_unlock_irq+0x5/0x10
kernel:  ? finish_task_switch.isra.0+0x171/0x249
kernel:  xprt_rdma_connect_worker+0x3b/0xc7 [rpcrdma]
kernel:  process_one_work+0x1d8/0x2d4
kernel:  worker_thread+0x18b/0x24f
kernel:  ? rescuer_thread+0x280/0x280
kernel:  kthread+0xf4/0xfc
kernel:  ? kthread_complete_and_exit+0x1b/0x1b
kernel:  ret_from_fork+0x22/0x30
kernel:  </TASK>

SUNRPC's xprtiod workqueue is WQ_MEM_RECLAIM, so any workqueue that
one of its work items tries to cancel has to be WQ_MEM_RECLAIM to
prevent a priority inversion. The internal workqueues in the
RDMA/core are currently non-MEM_RECLAIM.

Jason Gunthorpe says this about the current state of RDMA/core:
> If you attempt to do a reconnection/etc from within a RECLAIM
> context it will deadlock on one of the many allocations that are
> made to support opening the connection.
>
> The general idea of reclaim is that the entire task context
> working under the reclaim is marked with an override of the gfp
> flags to make all allocations under that call chain reclaim safe.
>
> But rdmacm does allocations outside this, eg in the WQs processing
> the CM packets. So this doesn't work and we will deadlock.
>
> Fixing it is a big deal and needs more than poking WQ_MEM_RECLAIM
> here and there.

So we will change the ULP in this case to avoid the use of
WQ_MEM_RECLAIM where possible. Deadlocks that were possible before
are not fixed, but at least we no longer have a false sense of
confidence that the stack won't allocate memory during memory
reclaim.

Suggested-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-10-05 15:47:16 -04:00
Trond Myklebust
4b8dbdfbc5 SUNRPC: Fix an RPC/RDMA performance regression
Use the standard gfp mask instead of using GFP_NOWAIT. The latter causes
issues when under memory pressure.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-07-10 19:00:53 -04:00
NeilBrown
8db55a032a SUNRPC: improve 'swap' handling: scheduling and PF_MEMALLOC
rpc tasks can be marked as RPC_TASK_SWAPPER.  This causes GFP_MEMALLOC
to be used for some allocations.  This is needed in some cases, but not
in all where it is currently provided, and in some where it isn't
provided.

Currently *all* tasks associated with a rpc_client on which swap is
enabled get the flag and hence some GFP_MEMALLOC support.

GFP_MEMALLOC is provided for ->buf_alloc() but only swap-writes need it.
However xdr_alloc_bvec does not get GFP_MEMALLOC - though it often does
need it.

xdr_alloc_bvec is called while the XPRT_LOCK is held.  If this blocks,
then it blocks all other queued tasks.  So this allocation needs
GFP_MEMALLOC for *all* requests, not just writes, when the xprt is used
for any swap writes.

Similarly, if the transport is not connected, that will block all
requests including swap writes, so memory allocations should get
GFP_MEMALLOC if swap writes are possible.

So with this patch:
 1/ we ONLY set RPC_TASK_SWAPPER for swap writes.
 2/ __rpc_execute() sets PF_MEMALLOC while handling any task
    with RPC_TASK_SWAPPER set, or when handling any task that
    holds the XPRT_LOCKED lock on an xprt used for swap.
    This removes the need for the RPC_IS_SWAPPER() test
    in ->buf_alloc handlers.
 3/ xprt_prepare_transmit() sets PF_MEMALLOC after locking
    any task to a swapper xprt.  __rpc_execute() will clear it.
 3/ PF_MEMALLOC is set for all the connect workers.

Reviewed-by: Chuck Lever <chuck.lever@oracle.com> (for xprtrdma parts)
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-13 12:59:35 -04:00
NeilBrown
a721035477 SUNRPC/xprt: async tasks mustn't block waiting for memory
When memory is short, new worker threads cannot be created and we depend
on the minimum one rpciod thread to be able to handle everything.  So it
must not block waiting for memory.

xprt_dynamic_alloc_slot can block indefinitely.  This can tie up all
workqueue threads and NFS can deadlock.  So when called from a
workqueue, set __GFP_NORETRY.

The rdma alloc_slot already does not block.  However it sets the error
to -EAGAIN suggesting this will trigger a sleep.  It does not.  As we
can see in call_reserveresult(), only -ENOMEM causes a sleep.  -EAGAIN
causes immediate retry.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-13 12:59:35 -04:00
NeilBrown
c487216bec SUNRPC/call_alloc: async tasks mustn't block waiting for memory
When memory is short, new worker threads cannot be created and we depend
on the minimum one rpciod thread to be able to handle everything.
So it must not block waiting for memory.

mempools are particularly a problem as memory can only be released back
to the mempool by an async rpc task running.  If all available
workqueue threads are waiting on the mempool, no thread is available to
return anything.

rpc_malloc() can block, and this might cause deadlocks.
So check RPC_IS_ASYNC(), rather than RPC_IS_SWAPPER() to determine if
blocking is acceptable.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-13 12:59:35 -04:00
Chuck Lever
c0f26167dd xprtrdma: Remove definitions of RPCDBG_FACILITY
Deprecated. dprintk is no longer used in xprtrdma.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-01-14 10:35:08 -05:00
Trond Myklebust
f99fa50880 SUNRPC/xprtrdma: Fix reconnection locking
The xprtrdma client code currently relies on the task that initiated the
connect to hold the XPRT_LOCK for the duration of the connection
attempt. If the task is woken early, due to some other event, then that
lock could get released early.
Avoid races by using the same mechanism that the socket code uses of
transferring lock ownership to the RDMA connect worker itself. That
frees us to call rpcrdma_xprt_disconnect() directly since we're now
guaranteed exclusion w.r.t. other callers.

Fixes: 4cf44be6f1 ("xprtrdma: Fix recursion into rpcrdma_xprt_disconnect()")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2021-08-09 16:57:05 -04:00
Chuck Lever
8d863b1f05 xprtrdma: Eliminate rpcrdma_post_sends()
Clean up.

Now that there is only one registration mode, there is only one
target "post_send" method: frwr_send(). rpcrdma_post_sends() no
longer adds much value, especially since all of its call sites
ignore the return code value except to check if it's non-zero.

Just have them call frwr_send() directly instead.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2021-08-09 16:42:25 -04:00
Olga Kornievskaia
d3abc73987 sunrpc: keep track of the xprt_class in rpc_xprt structure
We need to keep track of the type for a given transport.

Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-07-08 14:03:23 -04:00
Trond Myklebust
e86be3a04b SUNRPC: More fixes for backlog congestion
Ensure that we fix the XPRT_CONGESTED starvation issue for RDMA as well
as socket based transports.
Ensure we always initialise the request after waking up from the backlog
list.

Fixes: e877a88d1f ("SUNRPC in case of backlog, hand free slots directly to waiting task")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-05-26 06:36:13 -04:00
Chuck Lever
7638e0bfae SUNRPC: Move fault injection call sites
I've hit some crashes that occur in the xprt_rdma_inject_disconnect
path. It appears that, for some provides, rdma_disconnect() can
take so long that the transport can disconnect and release its
hardware resources while rdma_disconnect() is still running,
resulting in a UAF in the provider.

The transport's fault injection method may depend on the stability
of transport data structures. That means it needs to be invoked
only from contexts that hold the transport write lock.

Fixes: 4a06825839 ("SUNRPC: Transport fault injection")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-04-14 09:36:29 -04:00
Trond Myklebust
edffb84cc8 NFSoRDmA Client updates for Linux 5.11
Cleanups and improvements:
   - Remove use of raw kernel memory addresses in tracepoints
   - Replace dprintk() call sites in ERR_CHUNK path
   - Trace unmap sync calls
   - Optimize MR DMA-unmapping
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAl/SlTcACgkQ18tUv7Cl
 QOsc+xAA0qmDLbdZShFAiip/jgFvHkoIJzVcil5++xiRY77xOjR2rgkgBSv4hFYm
 XWBidP/eQm3n5r0S+LG0zudcHnaKBonZ0UV2j32PelMEnvn3H9qlSJYneEm9xv2O
 K1koNcbunJH8JRsLxbStdMnOnNwCmB+HzIjnWr87OgXkYucqDBBxt3RMyZVD46QU
 GypnItAtLns4Oacsw0TyPAPAxstjZ71YRlTMhtfyIYEJfvxUVRYRKvs7nPei9eCa
 0uCzoKb28F7CWy+S9so7wSjfdDu4G+FmAfMvJbQF4irhgZBvzyzj4jq4hCnv377S
 XF6Eygm0Z4BSWNoAnRjgLx3Bo4ps4qZi7iAj8cUAksEKQ3QI2BWTWR3e07YwLfnm
 iwhWougP76zbtC3lNA+4D2yaYwA1LtN4IYp19KmS/0LyRe5EBry39ccE/eE508JA
 BGnCohawI9ya7WT3xTne9lyxNGhjm/jNDKt0ax0ze4hwIqSGekqhUzxQ0bl4suQz
 ZHP2gUEad07jZyy8gCcpEvvxA3fFW241vaG/hN34Dy47eHpAQDQIMk7611YRQXH/
 jCBHoym1dHXc8eNa+gTmrjXdhVdgHo4zI0ppeaQvA2Ss0nPp8N8H2nJV/as536Dp
 on0b1zAF4o5uzGlEzHu+FuFGqUPE9My9/dNDqw3o0Cncu7adffk=
 =R45y
 -----END PGP SIGNATURE-----

Merge tag 'nfs-rdma-for-5.11-1' of git://git.linux-nfs.org/projects/anna/linux-nfs into linux-next

NFSoRDmA Client updates for Linux 5.11

Cleanups and improvements:
  - Remove use of raw kernel memory addresses in tracepoints
  - Replace dprintk() call sites in ERR_CHUNK path
  - Trace unmap sync calls
  - Optimize MR DMA-unmapping

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-15 20:08:41 -05:00
Trond Myklebust
d5aa6b22e2 SUNRPC: xprt_load_transport() needs to support the netid "rdma6"
According to RFC5666, the correct netid for an IPv6 addressed RDMA
transport is "rdma6", which we've supported as a mount option since
Linux-4.7. The problem is when we try to load the module "xprtrdma6",
that will fail, since there is no modulealias of that name.

Fixes: 181342c5eb ("xprtrdma: Add rdma6 option to support NFS/RDMA IPv6")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-02 14:05:52 -05:00
Chuck Lever
8e24e191d4 xprtrdma: Trace unmap_sync calls
->buf_free is called nearly once per RPC. Only rarely does
xprt_rdma_free() have to do anything, thus tracing every one of
these calls seems unnecessary. Instead, just throw a trace event
when that one occasional RPC still has MRs that need to be
released.

xprt_rdma_free() is further micro-optimized to reduce the amount of
work done in the common case.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-11-11 10:49:12 -05:00
Chuck Lever
ac1ae53421 SUNRPC: Hoist trace_xprtrdma_op_setport into generic code
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-09-21 10:21:09 -04:00
Chuck Lever
7806948753 SUNRPC: Remove debugging instrumentation from xprt_release
These instruments don't appear to add any substantial value.

We already have this at the termination of each RPC:

          iozone-2617  [002]   975.713126: rpc_stats_latency:    task:418@5 xid=0x260eab5d nfsv3 LOOKUP backlog=15 rtt=32 execute=58
          iozone-2617  [002]   975.713127: xprt_release_cong:    task:418@5 snd_task:4294967295 cong=256 cwnd=16384
          iozone-2617  [002]   975.713127: xprt_put_cong:        task:418@5 snd_task:4294967295 cong=0 cwnd=16384

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-09-21 10:21:08 -04:00
Chuck Lever
06e234c613 SUNRPC: Hoist trace_xprtrdma_op_allocate into generic code
Introduce a tracepoint in call_allocate that reports the exact
sizes in the RPC buffer allocation request and the status of the
result. This helps catch problems with XDR buffer provisioning,
and replaces transport-specific debugging instrumentation.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-09-21 10:21:08 -04:00
Chuck Lever
4cf44be6f1 xprtrdma: Fix recursion into rpcrdma_xprt_disconnect()
Both Dan and I have observed two processes invoking
rpcrdma_xprt_disconnect() concurrently. In my case:

1. The connect worker invokes rpcrdma_xprt_disconnect(), which
   drains the QP and waits for the final completion
2. This causes the newly posted Receive to flush and invoke
   xprt_force_disconnect()
3. xprt_force_disconnect() sets CLOSE_WAIT and wakes up the RPC task
   that is holding the transport lock
4. The RPC task invokes xprt_connect(), which calls ->ops->close
5. xprt_rdma_close() invokes rpcrdma_xprt_disconnect(), which tries
   to destroy the QP.

Deadlock.

To prevent xprt_force_disconnect() from waking anything, handle the
clean up after a failed connection attempt in the xprt's sndtask.

The retry loop is removed from rpcrdma_xprt_connect() to ensure
that the newly allocated ep and id are properly released before
a REJECTED connection attempt can be retried.

Reported-by: Dan Aloni <dan@kernelim.com>
Fixes: e28ce90083 ("xprtrdma: kmalloc rpcrdma_ep separate from rpcrdma_xprt")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-07-13 10:50:41 -04:00
Chuck Lever
2d97f46376 xprtrdma: Use re_connect_status safely in rpcrdma_xprt_connect()
Clean up: Sometimes creating a fresh rpcrdma_ep can fail. That's why
xprt_rdma_connect() always checks if the r_xprt->rx_ep pointer is
valid before dereferencing it. Instead, xprt_rdma_connect() can
simply check rpcrdma_xprt_connect()'s return value.

Also, there's no need to set re_connect_status to zero just after
the rpcrdma_ep is created, since it is allocated with kzalloc.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-06-22 09:34:35 -04:00
Zou Wei
5bffb00621 xprtrdma: Make xprt_rdma_slot_table_entries static
Fix the following sparse warning:

net/sunrpc/xprtrdma/transport.c:71:14: warning: symbol 'xprt_rdma_slot_table_entries'
was not declared. Should it be static?

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Zou Wei <zou_wei@huawei.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-06-11 13:33:48 -04:00
Chuck Lever
911813d7a1 SUNRPC: Trace transport lifetime events
Refactor: Hoist create/destroy/disconnect tracepoints out of
xprtrdma and into the generic RPC client. Some benefits include:

- Enable tracing of xprt lifetime events for the socket transport
  types

- Expose the different types of disconnect to help run down
  issues with lingering connections

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-06-11 13:33:48 -04:00
Chuck Lever
e28ce90083 xprtrdma: kmalloc rpcrdma_ep separate from rpcrdma_xprt
Change the rpcrdma_xprt_disconnect() function so that it no longer
waits for the DISCONNECTED event.  This prevents blocking if the
remote is unresponsive.

In rpcrdma_xprt_disconnect(), the transport's rpcrdma_ep is
detached. Upon return from rpcrdma_xprt_disconnect(), the transport
(r_xprt) is ready immediately for a new connection.

The RDMA_CM_DEVICE_REMOVAL and RDMA_CM_DISCONNECTED events are now
handled almost identically.

However, because the lifetimes of rpcrdma_xprt structures and
rpcrdma_ep structures are now independent, creating an rpcrdma_ep
needs to take a module ref count. The ep now owns most of the
hardware resources for a transport.

Also, a kref is needed to ensure that rpcrdma_ep sticks around
long enough for the cm_event_handler to finish.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-03-27 10:47:25 -04:00
Chuck Lever
93aa8e0a9d xprtrdma: Merge struct rpcrdma_ia into struct rpcrdma_ep
I eventually want to allocate rpcrdma_ep separately from struct
rpcrdma_xprt so that on occasion there can be more than one ep per
xprt.

The new struct rpcrdma_ep will contain all the fields currently in
rpcrdma_ia and in rpcrdma_ep. This is all the device and CM settings
for the connection, in addition to per-connection settings
negotiated with the remote.

Take this opportunity to rename the existing ep fields from rep_* to
re_* to disambiguate these from struct rpcrdma_rep.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-03-27 10:47:25 -04:00
Chuck Lever
897b7be9bc xprtrdma: Remove rpcrdma_ia::ri_flags
Clean up:
The upper layer serializes calls to xprt_rdma_close, so there is no
need for an atomic bit operation, saving 8 bytes in rpcrdma_ia.

This enables merging rpcrdma_ia_remove directly into the disconnect
logic.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-03-27 10:47:25 -04:00
Chuck Lever
81fe0c57f4 xprtrdma: Invoke rpcrdma_ia_open in the connect worker
Move rdma_cm_id creation into rpcrdma_ep_create() so that it is now
responsible for allocating all per-connection hardware resources.

With this clean-up, all three arms of the switch statement in
rpcrdma_ep_connect are exactly the same now, thus the switch can be
removed.

Because device removal behaves a little differently than
disconnection, there is a little more work to be done before
rpcrdma_ep_destroy() can release the connection's rdma_cm_id. So
it is not quite symmetrical with rpcrdma_ep_create() yet.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-03-27 10:47:24 -04:00
Chuck Lever
9144a803df xprtrdma: Refactor rpcrdma_ep_connect() and rpcrdma_ep_disconnect()
Clean up: Simplify the synopses of functions in the connect and
disconnect paths in preparation for combining the rpcrdma_ia and
struct rpcrdma_ep structures.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-03-27 10:47:24 -04:00
Chuck Lever
97d0de8812 xprtrdma: Clean up the post_send path
Clean up: Simplify the synopses of functions in the post_send path
by combining the struct rpcrdma_ia and struct rpcrdma_ep arguments.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-03-27 10:47:24 -04:00
Chuck Lever
85cd8e2b78 xprtrdma: Invoke rpcrdma_ep_create() in the connect worker
Refactor rpcrdma_ep_create(), rpcrdma_ep_disconnect(), and
rpcrdma_ep_destroy().

rpcrdma_ep_create will be invoked at connect time instead of at
transport set-up time. It will be responsible for allocating per-
connection resources. In this patch it allocates the CQs and
creates a QP. More to come.

rpcrdma_ep_destroy() is the inverse functionality that is
invoked at disconnect time. It will be responsible for releasing
the CQs and QP.

These changes should be safe to do because both connect and
disconnect is guaranteed to be serialized by the transport send
lock.

This takes us another step closer to resolving the address and route
only at connect time so that connection failover to another device
will work correctly.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-03-27 10:47:24 -04:00
Chuck Lever
18d065a5d4 xprtrdma: Eliminate per-transport "max pages"
To support device hotplug and migrating a connection between devices
of different capabilities, we have to guarantee that all in-kernel
devices can support the same max NFS payload size (1 megabyte).

This means that possibly one or two in-tree devices are no longer
supported for NFS/RDMA because they cannot support 1MB rsize/wsize.
The only one I confirmed was cxgb3, but it has already been removed
from the kernel.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-01-15 10:54:32 -05:00
Chuck Lever
7581d90109 xprtrdma: Refactor initialization of ep->rep_max_requests
Clean up: there is no need to keep two copies of the same value.
Also, in subsequent patches, rpcrdma_ep_create() will be called in
the connect worker rather than at set-up time.

Minor fix: Initialize the transport's sendctx to the value based on
the capabilities of the underlying device, not the maximum setting.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-01-15 10:54:32 -05:00
Chuck Lever
a52c23b8b2 xprtrdma: Replace dprintk in xprt_rdma_set_port
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-10-24 10:30:40 -04:00
Chuck Lever
7b020f17bb xprtrdma: Report the computed connect delay
For debugging, the op_connect trace point should report the computed
connect delay. We can then ensure that the delay is computed at the
proper times, for example.

As a further clean-up, remove a few low-value "heartbeat" trace
points in the connect path.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-10-24 10:30:40 -04:00
Chuck Lever
6cb28687fd xprtrdma: Wake tasks after connect worker fails
Pending tasks are currently never awoken when the connect worker
fails. The reason is that XPRT_CONNECTED is always clear after a
failure return of rpcrdma_ep_connect, thus the
xprt_test_and_clear_connected() check in xprt_rdma_connect_worker()
always fails.

- xprt_rdma_close always clears XPRT_CONNECTED.

- rpcrdma_ep_connect always clears XPRT_CONNECTED.

After reviewing the TCP connect worker, it appears that there's no
need for extra test_and_set paranoia in xprt_rdma_connect_worker.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-10-24 10:30:40 -04:00
Chuck Lever
eea63ca7ff xprtrdma: Initialize rb_credits in one place
Clean up/code de-duplication.

Nit: RPC_CWNDSHIFT is incorrect as the initial value for xprt->cwnd.
This mistake does not appear to have operational consequences, since
the cwnd value is replaced with a valid value upon the first Receive
completion.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-10-24 10:30:39 -04:00
Chuck Lever
a31b2f9392 xprtrdma: Connection becomes unstable after a reconnect
This is because xprt_request_get_cong() is allowing more than one
RPC Call to be transmitted before the first Receive on the new
connection. The first Receive fills the Receive Queue based on the
server's credit grant. Before that Receive, there is only a single
Receive WR posted because the client doesn't know the server's
credit grant.

Solution is to clear rq_cong on all outstanding rpc_rqsts when the
the cwnd is reset. This is because an RPC/RDMA credit is good for
one connection instance only.

Fixes: 75891f502f ("SUNRPC: Support for congestion control ... ")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-10-24 10:30:39 -04:00
Chuck Lever
f9e1afe0fa xprtrdma: Clear xprt->reestablish_timeout on close
Ensure that the re-establishment delay does not grow exponentially
on each good reconnect. This probably should have been part of
commit 675dd90ad0 ("xprtrdma: Modernize ops->connect").

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-08-26 15:34:59 -04:00
Chuck Lever
2a7f77c7be xprtrdma: Clean up xprt_rdma_set_connect_timeout()
Clean up: The function name should match the documenting comment.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-08-21 11:53:42 -04:00
Chuck Lever
395790566e xprtrdma: Toggle XPRT_CONGESTED in xprtrdma's slot methods
Commit 48be539dd4 ("xprtrdma: Introduce ->alloc_slot call-out for
xprtrdma") added a separate alloc_slot and free_slot to the RPC/RDMA
transport. Later, commit 75891f502f ("SUNRPC: Support for
congestion control when queuing is enabled") modified the generic
alloc/free_slot methods, but neglected the methods in xprtrdma.

Found via code review.

Fixes: 75891f502f ("SUNRPC: Support for congestion control ... ")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-08-20 14:06:03 -04:00
Linus Torvalds
249be8511b Merge branch 'akpm' (patches from Andrew)
Merge yet more updates from Andrew Morton:
 "The rest of MM and a kernel-wide procfs cleanup.

  Summary of the more significant patches:

   - Patch series "mm/memory_hotplug: Factor out memory block
     devicehandling", v3. David Hildenbrand.

     Some spring-cleaning of the memory hotplug code, notably in
     drivers/base/memory.c

   - "mm: thp: fix false negative of shmem vma's THP eligibility". Yang
     Shi.

     Fix /proc/pid/smaps output for THP pages used in shmem.

   - "resource: fix locking in find_next_iomem_res()" + 1. Nadav Amit.

     Bugfix and speedup for kernel/resource.c

   - Patch series "mm: Further memory block device cleanups", David
     Hildenbrand.

     More spring-cleaning of the memory hotplug code.

   - Patch series "mm: Sub-section memory hotplug support". Dan
     Williams.

     Generalise the memory hotplug code so that pmem can use it more
     completely. Then remove the hacks from the libnvdimm code which
     were there to work around the memory-hotplug code's constraints.

   - "proc/sysctl: add shared variables for range check", Matteo Croce.

     We have about 250 instances of

          int zero;
          ...
                  .extra1 = &zero,

     in the tree. This is a tree-wide sweep to make all those private
     "zero"s and "one"s use global variables.

     Alas, it isn't practical to make those two global integers const"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (38 commits)
  proc/sysctl: add shared variables for range check
  mm: migrate: remove unused mode argument
  mm/sparsemem: cleanup 'section number' data types
  libnvdimm/pfn: stop padding pmem namespaces to section alignment
  libnvdimm/pfn: fix fsdax-mode namespace info-block zero-fields
  mm/devm_memremap_pages: enable sub-section remap
  mm: document ZONE_DEVICE memory-model implications
  mm/sparsemem: support sub-section hotplug
  mm/sparsemem: prepare for sub-section ranges
  mm: kill is_dev_zone() helper
  mm/hotplug: kill is_dev_zone() usage in __remove_pages()
  mm/sparsemem: convert kmalloc_section_memmap() to populate_section_memmap()
  mm/hotplug: prepare shrink_{zone, pgdat}_span for sub-section removal
  mm/sparsemem: add helpers track active portions of a section at boot
  mm/sparsemem: introduce a SECTION_IS_EARLY flag
  mm/sparsemem: introduce struct mem_section_usage
  drivers/base/memory.c: get rid of find_memory_block_hinted()
  mm/memory_hotplug: move and simplify walk_memory_blocks()
  mm/memory_hotplug: rename walk_memory_range() and pass start+size instead of pfns
  mm: make register_mem_sect_under_node() static
  ...
2019-07-19 09:45:58 -07:00
Matteo Croce
eec4844fae proc/sysctl: add shared variables for range check
In the sysctl code the proc_dointvec_minmax() function is often used to
validate the user supplied value between an allowed range.  This
function uses the extra1 and extra2 members from struct ctl_table as
minimum and maximum allowed value.

On sysctl handler declaration, in every source file there are some
readonly variables containing just an integer which address is assigned
to the extra1 and extra2 members, so the sysctl range is enforced.

The special values 0, 1 and INT_MAX are very often used as range
boundary, leading duplication of variables like zero=0, one=1,
int_max=INT_MAX in different source files:

    $ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
    248

Add a const int array containing the most commonly used values, some
macros to refer more easily to the correct array member, and use them
instead of creating a local one for every object file.

This is the bloat-o-meter output comparing the old and new binary
compiled with the default Fedora config:

    # scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
    add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
    Data                                         old     new   delta
    sysctl_vals                                    -      12     +12
    __kstrtab_sysctl_vals                          -      12     +12
    max                                           14      10      -4
    int_max                                       16       -     -16
    one                                           68       -     -68
    zero                                         128      28    -100
    Total: Before=20583249, After=20583085, chg -0.00%

[mcroce@redhat.com: tipc: remove two unused variables]
  Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
[akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
[arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
  Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
[akpm@linux-foundation.org: fix fs/eventpoll.c]
Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-18 17:08:07 -07:00
Trond Myklebust
7402a4fedc SUNRPC: Fix up backchannel slot table accounting
Add a per-transport maximum limit in the socket case, and add
helpers to allow the NFSv4 code to discover that limit.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-07-18 01:12:59 -04:00
Chuck Lever
675dd90ad0 xprtrdma: Modernize ops->connect
Adapt and apply changes that were made to the TCP socket connect
code. See the following commits for details on the purpose of
these changes:

Commit 7196dbb02e ("SUNRPC: Allow changing of the TCP timeout parameters on the fly")
Commit 3851f1cdb2 ("SUNRPC: Limit the reconnect backoff timer to the max RPC message timeout")
Commit 02910177ae ("SUNRPC: Fix reconnection timeouts")

Some common transport code is moved to xprt.c to satisfy the code
duplication police.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-07-09 10:30:25 -04:00
Chuck Lever
5828cebad1 xprtrdma: Remove rpcrdma_req::rl_buffer
Clean up.

There is only one remaining function, rpcrdma_buffer_put(), that
uses this field. Its caller can supply a pointer to the correct
rpcrdma_buffer, enabling the removal of an 8-byte pointer field
from a frequently-allocated shared data structure.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-07-09 10:30:25 -04:00
Chuck Lever
0ab1152370 xprtrdma: Wake RPCs directly in rpcrdma_wc_send path
Eliminate a context switch in the path that handles RPC wake-ups
when a Receive completion has to wait for a Send completion.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-07-09 10:30:25 -04:00
Chuck Lever
5809ea4f7c xprtrdma: Remove the RPCRDMA_REQ_F_PENDING flag
Commit 9590d083c1 ("xprtrdma: Use xprt_pin_rqst in
rpcrdma_reply_handler") pins incoming RPC/RDMA replies so they
can be left in the pending requests queue while they are being
processed without introducing a race between ->buf_free and the
transport's reply handler. Therefore RPCRDMA_REQ_F_PENDING is no
longer necessary.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-07-09 10:30:20 -04:00
Chuck Lever
86c4ccd9b9 xprtrdma: Eliminate struct rpcrdma_create_data_internal
Clean up.

Move the remaining field in rpcrdma_create_data_internal so the
structure can be removed.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-04-25 15:39:40 -04:00
Chuck Lever
94087e978e xprtrdma: Aggregate the inline settings in struct rpcrdma_ep
Clean up.

The inline settings are actually a characteristic of the endpoint,
and not related to the device. They are also modified after the
transport instance is created, so they do not belong in the cdata
structure either.

Lastly, let's use names that are more natural to RDMA than to NFS:
inline_write -> inline_send and inline_read -> inline_recv. The
/proc files retain their names to avoid breaking user space.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-04-25 15:39:16 -04:00
Chuck Lever
fd5951742d xprtrdma: Remove rpcrdma_create_data_internal::rsize and wsize
Clean up.

xprt_rdma_max_inline_{read,write} cannot be set to large values
by virtue of proc_dointvec_minmax. The current maximum is
RPCRDMA_MAX_INLINE, which is much smaller than RPCRDMA_MAX_SEGS *
PAGE_SIZE.

The .rsize and .wsize fields are otherwise unused in the current
code base, and thus can be removed.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-04-25 15:38:30 -04:00