linux-stable/net/sunrpc/xprtrdma/rpc_rdma.c

1336 lines
36 KiB
C
Raw Normal View History

/*
* Copyright (c) 2003-2007 Network Appliance, Inc. All rights reserved.
*
* This software is available to you under a choice of one of two
* licenses. You may choose to be licensed under the terms of the GNU
* General Public License (GPL) Version 2, available from the file
* COPYING in the main directory of this source tree, or the BSD-type
* license below:
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions
* are met:
*
* Redistributions of source code must retain the above copyright
* notice, this list of conditions and the following disclaimer.
*
* Redistributions in binary form must reproduce the above
* copyright notice, this list of conditions and the following
* disclaimer in the documentation and/or other materials provided
* with the distribution.
*
* Neither the name of the Network Appliance, Inc. nor the names of
* its contributors may be used to endorse or promote products
* derived from this software without specific prior written
* permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
* "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
* LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
* A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
* OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
* SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
* LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
* DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
* THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
* (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
/*
* rpc_rdma.c
*
* This file contains the guts of the RPC RDMA protocol, and
* does marshaling/unmarshaling, etc. It is also where interfacing
* to the Linux RPC framework lives.
*/
#include "xprt_rdma.h"
#include <linux/highmem.h>
#if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
# define RPCDBG_FACILITY RPCDBG_TRANS
#endif
static const char transfertypes[][12] = {
xprtrdma: Allow Read list and Reply chunk simultaneously rpcrdma_marshal_req() makes a simplifying assumption: that NFS operations with large Call messages have small Reply messages, and vice versa. Therefore with RPC-over-RDMA, only one chunk type is ever needed for each Call/Reply pair, because one direction needs chunks, the other direction will always fit inline. In fact, this assumption is asserted in the code: if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) { dprintk("RPC: %s: cannot marshal multiple chunk lists\n", __func__); return -EIO; } But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p perform data transformation on RPC messages before they are transmitted, direct data placement techniques cannot be used, thus RPC messages must be sent via a Long call in both directions. All such calls are sent with a Position Zero Read chunk, and all such replies are handled with a Reply chunk. Thus the client must provide every Call/Reply pair with both a Read list and a Reply chunk. Without any special security in effect, NFSv4 WRITEs may now also use the Read list and provide a Reply chunk. The marshal_req logic was preventing that, meaning an NFSv4 WRITE with a large payload that included a GETATTR result larger than the inline threshold would fail. The code that encodes each chunk list is now completely contained in its own function. There is some code duplication, but the trade-off is that the overall logic should be more clear. Note that all three chunk lists now share the rl_segments array. Some additional per-req accounting is necessary to track this usage. For the same reasons that the above simplifying assumption has held true for so long, I don't expect more array elements are needed at this time. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-02 18:41:30 +00:00
"inline", /* no chunks */
"read list", /* some argument via rdma read */
"*read list", /* entire request via rdma read */
"write list", /* some result via rdma write */
"reply chunk" /* entire reply via rdma write */
};
/* Returns size of largest RPC-over-RDMA header in a Call message
*
xprtrdma: Allow Read list and Reply chunk simultaneously rpcrdma_marshal_req() makes a simplifying assumption: that NFS operations with large Call messages have small Reply messages, and vice versa. Therefore with RPC-over-RDMA, only one chunk type is ever needed for each Call/Reply pair, because one direction needs chunks, the other direction will always fit inline. In fact, this assumption is asserted in the code: if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) { dprintk("RPC: %s: cannot marshal multiple chunk lists\n", __func__); return -EIO; } But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p perform data transformation on RPC messages before they are transmitted, direct data placement techniques cannot be used, thus RPC messages must be sent via a Long call in both directions. All such calls are sent with a Position Zero Read chunk, and all such replies are handled with a Reply chunk. Thus the client must provide every Call/Reply pair with both a Read list and a Reply chunk. Without any special security in effect, NFSv4 WRITEs may now also use the Read list and provide a Reply chunk. The marshal_req logic was preventing that, meaning an NFSv4 WRITE with a large payload that included a GETATTR result larger than the inline threshold would fail. The code that encodes each chunk list is now completely contained in its own function. There is some code duplication, but the trade-off is that the overall logic should be more clear. Note that all three chunk lists now share the rl_segments array. Some additional per-req accounting is necessary to track this usage. For the same reasons that the above simplifying assumption has held true for so long, I don't expect more array elements are needed at this time. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-02 18:41:30 +00:00
* The largest Call header contains a full-size Read list and a
* minimal Reply chunk.
*/
static unsigned int rpcrdma_max_call_header_size(unsigned int maxsegs)
{
unsigned int size;
/* Fixed header fields and list discriminators */
size = RPCRDMA_HDRLEN_MIN;
/* Maximum Read list size */
maxsegs += 2; /* segment for head and tail buffers */
size = maxsegs * sizeof(struct rpcrdma_read_chunk);
xprtrdma: Allow Read list and Reply chunk simultaneously rpcrdma_marshal_req() makes a simplifying assumption: that NFS operations with large Call messages have small Reply messages, and vice versa. Therefore with RPC-over-RDMA, only one chunk type is ever needed for each Call/Reply pair, because one direction needs chunks, the other direction will always fit inline. In fact, this assumption is asserted in the code: if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) { dprintk("RPC: %s: cannot marshal multiple chunk lists\n", __func__); return -EIO; } But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p perform data transformation on RPC messages before they are transmitted, direct data placement techniques cannot be used, thus RPC messages must be sent via a Long call in both directions. All such calls are sent with a Position Zero Read chunk, and all such replies are handled with a Reply chunk. Thus the client must provide every Call/Reply pair with both a Read list and a Reply chunk. Without any special security in effect, NFSv4 WRITEs may now also use the Read list and provide a Reply chunk. The marshal_req logic was preventing that, meaning an NFSv4 WRITE with a large payload that included a GETATTR result larger than the inline threshold would fail. The code that encodes each chunk list is now completely contained in its own function. There is some code duplication, but the trade-off is that the overall logic should be more clear. Note that all three chunk lists now share the rl_segments array. Some additional per-req accounting is necessary to track this usage. For the same reasons that the above simplifying assumption has held true for so long, I don't expect more array elements are needed at this time. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-02 18:41:30 +00:00
/* Minimal Read chunk size */
size += sizeof(__be32); /* segment count */
size += sizeof(struct rpcrdma_segment);
size += sizeof(__be32); /* list discriminator */
dprintk("RPC: %s: max call header size = %u\n",
__func__, size);
return size;
}
/* Returns size of largest RPC-over-RDMA header in a Reply message
*
* There is only one Write list or one Reply chunk per Reply
* message. The larger list is the Write list.
*/
static unsigned int rpcrdma_max_reply_header_size(unsigned int maxsegs)
{
unsigned int size;
/* Fixed header fields and list discriminators */
size = RPCRDMA_HDRLEN_MIN;
/* Maximum Write list size */
maxsegs += 2; /* segment for head and tail buffers */
size = sizeof(__be32); /* segment count */
size += maxsegs * sizeof(struct rpcrdma_segment);
size += sizeof(__be32); /* list discriminator */
dprintk("RPC: %s: max reply header size = %u\n",
__func__, size);
return size;
}
void rpcrdma_set_max_header_sizes(struct rpcrdma_xprt *r_xprt)
{
struct rpcrdma_create_data_internal *cdata = &r_xprt->rx_data;
struct rpcrdma_ia *ia = &r_xprt->rx_ia;
unsigned int maxsegs = ia->ri_max_segs;
ia->ri_max_inline_write = cdata->inline_wsize -
rpcrdma_max_call_header_size(maxsegs);
ia->ri_max_inline_read = cdata->inline_rsize -
rpcrdma_max_reply_header_size(maxsegs);
}
/* The client can send a request inline as long as the RPCRDMA header
* plus the RPC call fit under the transport's inline limit. If the
* combined call message size exceeds that limit, the client must use
* a Read chunk for this operation.
*
* A Read chunk is also required if sending the RPC call inline would
* exceed this device's max_sge limit.
*/
static bool rpcrdma_args_inline(struct rpcrdma_xprt *r_xprt,
struct rpc_rqst *rqst)
{
struct xdr_buf *xdr = &rqst->rq_snd_buf;
unsigned int count, remaining, offset;
if (xdr->len > r_xprt->rx_ia.ri_max_inline_write)
return false;
if (xdr->page_len) {
remaining = xdr->page_len;
offset = offset_in_page(xdr->page_base);
count = 0;
while (remaining) {
remaining -= min_t(unsigned int,
PAGE_SIZE - offset, remaining);
offset = 0;
if (++count > r_xprt->rx_ia.ri_max_send_sges)
return false;
}
}
return true;
}
/* The client can't know how large the actual reply will be. Thus it
* plans for the largest possible reply for that particular ULP
* operation. If the maximum combined reply message size exceeds that
* limit, the client must provide a write list or a reply chunk for
* this request.
*/
static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
struct rpc_rqst *rqst)
{
struct rpcrdma_ia *ia = &r_xprt->rx_ia;
return rqst->rq_rcv_buf.buflen <= ia->ri_max_inline_read;
}
xprtrdma: Segment head and tail XDR buffers on page boundaries A single memory allocation is used for the pair of buffers wherein the RPC client builds an RPC call message and decodes its matching reply. These buffers are sized based on the maximum possible size of the RPC call and reply messages for the operation in progress. This means that as the call buffer increases in size, the start of the reply buffer is pushed farther into the memory allocation. RPC requests are growing in size. It used to be that both the call and reply buffers fit inside a single page. But these days, thanks to NFSv4 (and especially security labels in NFSv4.2) the maximum call and reply sizes are large. NFSv4.0 OPEN, for example, now requires a 6KB allocation for a pair of call and reply buffers, and NFSv4 LOOKUP is not far behind. As the maximum size of a call increases, the reply buffer is pushed far enough into the buffer's memory allocation that a page boundary can appear in the middle of it. When the maximum possible reply size is larger than the client's RDMA receive buffers (currently 1KB), the client has to register a Reply chunk for the server to RDMA Write the reply into. The logic in rpcrdma_convert_iovs() assumes that xdr_buf head and tail buffers would always be contained on a single page. It supplies just one segment for the head and one for the tail. FMR, for example, registers up to a page boundary (only a portion of the reply buffer in the OPEN case above). But without additional segments, it doesn't register the rest of the buffer. When the server tries to write the OPEN reply, the RDMA Write fails with a remote access error since the client registered only part of the Reply chunk. rpcrdma_convert_iovs() must split the XDR buffer into multiple segments, each of which are guaranteed not to contain a page boundary. That way fmr_op_map is given the proper number of segments to register the whole reply buffer. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Devesh Sharma <devesh.sharma@broadcom.com> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-03-04 16:27:52 +00:00
/* Split "vec" on page boundaries into segments. FMR registers pages,
* not a byte range. Other modes coalesce these segments into a single
* MR when they can.
*/
static int
rpcrdma_convert_kvec(struct kvec *vec, struct rpcrdma_mr_seg *seg, int n)
xprtrdma: Segment head and tail XDR buffers on page boundaries A single memory allocation is used for the pair of buffers wherein the RPC client builds an RPC call message and decodes its matching reply. These buffers are sized based on the maximum possible size of the RPC call and reply messages for the operation in progress. This means that as the call buffer increases in size, the start of the reply buffer is pushed farther into the memory allocation. RPC requests are growing in size. It used to be that both the call and reply buffers fit inside a single page. But these days, thanks to NFSv4 (and especially security labels in NFSv4.2) the maximum call and reply sizes are large. NFSv4.0 OPEN, for example, now requires a 6KB allocation for a pair of call and reply buffers, and NFSv4 LOOKUP is not far behind. As the maximum size of a call increases, the reply buffer is pushed far enough into the buffer's memory allocation that a page boundary can appear in the middle of it. When the maximum possible reply size is larger than the client's RDMA receive buffers (currently 1KB), the client has to register a Reply chunk for the server to RDMA Write the reply into. The logic in rpcrdma_convert_iovs() assumes that xdr_buf head and tail buffers would always be contained on a single page. It supplies just one segment for the head and one for the tail. FMR, for example, registers up to a page boundary (only a portion of the reply buffer in the OPEN case above). But without additional segments, it doesn't register the rest of the buffer. When the server tries to write the OPEN reply, the RDMA Write fails with a remote access error since the client registered only part of the Reply chunk. rpcrdma_convert_iovs() must split the XDR buffer into multiple segments, each of which are guaranteed not to contain a page boundary. That way fmr_op_map is given the proper number of segments to register the whole reply buffer. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Devesh Sharma <devesh.sharma@broadcom.com> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-03-04 16:27:52 +00:00
{
size_t page_offset;
u32 remaining;
char *base;
base = vec->iov_base;
page_offset = offset_in_page(base);
remaining = vec->iov_len;
while (remaining && n < RPCRDMA_MAX_SEGS) {
xprtrdma: Segment head and tail XDR buffers on page boundaries A single memory allocation is used for the pair of buffers wherein the RPC client builds an RPC call message and decodes its matching reply. These buffers are sized based on the maximum possible size of the RPC call and reply messages for the operation in progress. This means that as the call buffer increases in size, the start of the reply buffer is pushed farther into the memory allocation. RPC requests are growing in size. It used to be that both the call and reply buffers fit inside a single page. But these days, thanks to NFSv4 (and especially security labels in NFSv4.2) the maximum call and reply sizes are large. NFSv4.0 OPEN, for example, now requires a 6KB allocation for a pair of call and reply buffers, and NFSv4 LOOKUP is not far behind. As the maximum size of a call increases, the reply buffer is pushed far enough into the buffer's memory allocation that a page boundary can appear in the middle of it. When the maximum possible reply size is larger than the client's RDMA receive buffers (currently 1KB), the client has to register a Reply chunk for the server to RDMA Write the reply into. The logic in rpcrdma_convert_iovs() assumes that xdr_buf head and tail buffers would always be contained on a single page. It supplies just one segment for the head and one for the tail. FMR, for example, registers up to a page boundary (only a portion of the reply buffer in the OPEN case above). But without additional segments, it doesn't register the rest of the buffer. When the server tries to write the OPEN reply, the RDMA Write fails with a remote access error since the client registered only part of the Reply chunk. rpcrdma_convert_iovs() must split the XDR buffer into multiple segments, each of which are guaranteed not to contain a page boundary. That way fmr_op_map is given the proper number of segments to register the whole reply buffer. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Devesh Sharma <devesh.sharma@broadcom.com> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-03-04 16:27:52 +00:00
seg[n].mr_page = NULL;
seg[n].mr_offset = base;
seg[n].mr_len = min_t(u32, PAGE_SIZE - page_offset, remaining);
remaining -= seg[n].mr_len;
base += seg[n].mr_len;
++n;
page_offset = 0;
}
return n;
}
/*
* Chunk assembly from upper layer xdr_buf.
*
* Prepare the passed-in xdr_buf into representation as RPC/RDMA chunk
* elements. Segments are then coalesced when registered, if possible
* within the selected memreg mode.
*
* Returns positive number of segments converted, or a negative errno.
*/
static int
rpcrdma_convert_iovs(struct rpcrdma_xprt *r_xprt, struct xdr_buf *xdrbuf,
unsigned int pos, enum rpcrdma_chunktype type,
struct rpcrdma_mr_seg *seg)
{
int len, n, p, page_base;
struct page **ppages;
n = 0;
xprtrdma: Segment head and tail XDR buffers on page boundaries A single memory allocation is used for the pair of buffers wherein the RPC client builds an RPC call message and decodes its matching reply. These buffers are sized based on the maximum possible size of the RPC call and reply messages for the operation in progress. This means that as the call buffer increases in size, the start of the reply buffer is pushed farther into the memory allocation. RPC requests are growing in size. It used to be that both the call and reply buffers fit inside a single page. But these days, thanks to NFSv4 (and especially security labels in NFSv4.2) the maximum call and reply sizes are large. NFSv4.0 OPEN, for example, now requires a 6KB allocation for a pair of call and reply buffers, and NFSv4 LOOKUP is not far behind. As the maximum size of a call increases, the reply buffer is pushed far enough into the buffer's memory allocation that a page boundary can appear in the middle of it. When the maximum possible reply size is larger than the client's RDMA receive buffers (currently 1KB), the client has to register a Reply chunk for the server to RDMA Write the reply into. The logic in rpcrdma_convert_iovs() assumes that xdr_buf head and tail buffers would always be contained on a single page. It supplies just one segment for the head and one for the tail. FMR, for example, registers up to a page boundary (only a portion of the reply buffer in the OPEN case above). But without additional segments, it doesn't register the rest of the buffer. When the server tries to write the OPEN reply, the RDMA Write fails with a remote access error since the client registered only part of the Reply chunk. rpcrdma_convert_iovs() must split the XDR buffer into multiple segments, each of which are guaranteed not to contain a page boundary. That way fmr_op_map is given the proper number of segments to register the whole reply buffer. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Devesh Sharma <devesh.sharma@broadcom.com> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-03-04 16:27:52 +00:00
if (pos == 0) {
n = rpcrdma_convert_kvec(&xdrbuf->head[0], seg, n);
if (n == RPCRDMA_MAX_SEGS)
goto out_overflow;
}
len = xdrbuf->page_len;
ppages = xdrbuf->pages + (xdrbuf->page_base >> PAGE_SHIFT);
page_base = offset_in_page(xdrbuf->page_base);
p = 0;
while (len && n < RPCRDMA_MAX_SEGS) {
if (!ppages[p]) {
/* alloc the pagelist for receiving buffer */
ppages[p] = alloc_page(GFP_ATOMIC);
if (!ppages[p])
return -EAGAIN;
}
seg[n].mr_page = ppages[p];
seg[n].mr_offset = (void *)(unsigned long) page_base;
seg[n].mr_len = min_t(u32, PAGE_SIZE - page_base, len);
if (seg[n].mr_len > PAGE_SIZE)
goto out_overflow;
len -= seg[n].mr_len;
++n;
++p;
page_base = 0; /* page offset only applies to first page */
}
/* Message overflows the seg array */
if (len && n == RPCRDMA_MAX_SEGS)
goto out_overflow;
/* When encoding a Read chunk, the tail iovec contains an
* XDR pad and may be omitted.
*/
if (type == rpcrdma_readch && r_xprt->rx_ia.ri_implicit_roundup)
xprtrdma: Fix XDR tail buffer marshalling Currently xprtrdma appends an extra chunk element to the RPC/RDMA read chunk list of each NFSv4 WRITE compound. The extra element contains the final GETATTR operation in the compound. The result is an extra RDMA READ operation to transfer a very short piece of each NFS WRITE compound (typically 16 bytes). This is inefficient. It is also incorrect. The client is sending the trailing GETATTR at the same Position as the preceding WRITE data payload. Whether or not RFC 5667 allows the GETATTR to appear in a read chunk, RFC 5666 requires that these two separate RPC arguments appear at two distinct Positions. It can also be argued that the GETATTR operation is not bulk data, and therefore RFC 5667 forbids its appearance in a read chunk at all. Although RFC 5667 is not precise about when using a read list with NFSv4 COMPOUND is allowed, the intent is that only data arguments not touched by NFS (ie, read and write payloads) are to be sent using RDMA READ or WRITE. The NFS client constructs GETATTR arguments itself, and therefore is required to send the trailing GETATTR operation as additional inline content, not as a data payload. NB: This change is not backwards compatible. Some older servers do not accept inline content following the read list. The Linux NFS server should handle this content correctly as of commit a97c331f9aa9 ("svcrdma: Handle additional inline content"). Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Devesh Sharma <devesh.sharma@avagotech.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-08-03 17:04:17 +00:00
return n;
/* When encoding a Write chunk, some servers need to see an
* extra segment for non-XDR-aligned Write chunks. The upper
* layer provides space in the tail iovec that may be used
* for this purpose.
*/
if (type == rpcrdma_writech && r_xprt->rx_ia.ri_implicit_roundup)
return n;
if (xdrbuf->tail[0].iov_len) {
n = rpcrdma_convert_kvec(&xdrbuf->tail[0], seg, n);
if (n == RPCRDMA_MAX_SEGS)
goto out_overflow;
}
return n;
out_overflow:
pr_err("rpcrdma: segment array overflow\n");
return -EIO;
}
xprtrdma: Allow Read list and Reply chunk simultaneously rpcrdma_marshal_req() makes a simplifying assumption: that NFS operations with large Call messages have small Reply messages, and vice versa. Therefore with RPC-over-RDMA, only one chunk type is ever needed for each Call/Reply pair, because one direction needs chunks, the other direction will always fit inline. In fact, this assumption is asserted in the code: if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) { dprintk("RPC: %s: cannot marshal multiple chunk lists\n", __func__); return -EIO; } But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p perform data transformation on RPC messages before they are transmitted, direct data placement techniques cannot be used, thus RPC messages must be sent via a Long call in both directions. All such calls are sent with a Position Zero Read chunk, and all such replies are handled with a Reply chunk. Thus the client must provide every Call/Reply pair with both a Read list and a Reply chunk. Without any special security in effect, NFSv4 WRITEs may now also use the Read list and provide a Reply chunk. The marshal_req logic was preventing that, meaning an NFSv4 WRITE with a large payload that included a GETATTR result larger than the inline threshold would fail. The code that encodes each chunk list is now completely contained in its own function. There is some code duplication, but the trade-off is that the overall logic should be more clear. Note that all three chunk lists now share the rl_segments array. Some additional per-req accounting is necessary to track this usage. For the same reasons that the above simplifying assumption has held true for so long, I don't expect more array elements are needed at this time. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-02 18:41:30 +00:00
static inline __be32 *
xdr_encode_rdma_segment(__be32 *iptr, struct rpcrdma_mw *mw)
xprtrdma: Allow Read list and Reply chunk simultaneously rpcrdma_marshal_req() makes a simplifying assumption: that NFS operations with large Call messages have small Reply messages, and vice versa. Therefore with RPC-over-RDMA, only one chunk type is ever needed for each Call/Reply pair, because one direction needs chunks, the other direction will always fit inline. In fact, this assumption is asserted in the code: if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) { dprintk("RPC: %s: cannot marshal multiple chunk lists\n", __func__); return -EIO; } But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p perform data transformation on RPC messages before they are transmitted, direct data placement techniques cannot be used, thus RPC messages must be sent via a Long call in both directions. All such calls are sent with a Position Zero Read chunk, and all such replies are handled with a Reply chunk. Thus the client must provide every Call/Reply pair with both a Read list and a Reply chunk. Without any special security in effect, NFSv4 WRITEs may now also use the Read list and provide a Reply chunk. The marshal_req logic was preventing that, meaning an NFSv4 WRITE with a large payload that included a GETATTR result larger than the inline threshold would fail. The code that encodes each chunk list is now completely contained in its own function. There is some code duplication, but the trade-off is that the overall logic should be more clear. Note that all three chunk lists now share the rl_segments array. Some additional per-req accounting is necessary to track this usage. For the same reasons that the above simplifying assumption has held true for so long, I don't expect more array elements are needed at this time. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-02 18:41:30 +00:00
{
*iptr++ = cpu_to_be32(mw->mw_handle);
*iptr++ = cpu_to_be32(mw->mw_length);
return xdr_encode_hyper(iptr, mw->mw_offset);
xprtrdma: Allow Read list and Reply chunk simultaneously rpcrdma_marshal_req() makes a simplifying assumption: that NFS operations with large Call messages have small Reply messages, and vice versa. Therefore with RPC-over-RDMA, only one chunk type is ever needed for each Call/Reply pair, because one direction needs chunks, the other direction will always fit inline. In fact, this assumption is asserted in the code: if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) { dprintk("RPC: %s: cannot marshal multiple chunk lists\n", __func__); return -EIO; } But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p perform data transformation on RPC messages before they are transmitted, direct data placement techniques cannot be used, thus RPC messages must be sent via a Long call in both directions. All such calls are sent with a Position Zero Read chunk, and all such replies are handled with a Reply chunk. Thus the client must provide every Call/Reply pair with both a Read list and a Reply chunk. Without any special security in effect, NFSv4 WRITEs may now also use the Read list and provide a Reply chunk. The marshal_req logic was preventing that, meaning an NFSv4 WRITE with a large payload that included a GETATTR result larger than the inline threshold would fail. The code that encodes each chunk list is now completely contained in its own function. There is some code duplication, but the trade-off is that the overall logic should be more clear. Note that all three chunk lists now share the rl_segments array. Some additional per-req accounting is necessary to track this usage. For the same reasons that the above simplifying assumption has held true for so long, I don't expect more array elements are needed at this time. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-02 18:41:30 +00:00
}
/* XDR-encode the Read list. Supports encoding a list of read
* segments that belong to a single read chunk.
*
* Encoding key for single-list chunks (HLOO = Handle32 Length32 Offset64):
*
* Read chunklist (a linked list):
* N elements, position P (same P for all chunks of same arg!):
* 1 - PHLOO - 1 - PHLOO - ... - 1 - PHLOO - 0
*
* Returns a pointer to the XDR word in the RDMA header following
* the end of the Read list, or an error pointer.
*/
static __be32 *
rpcrdma_encode_read_list(struct rpcrdma_xprt *r_xprt,
struct rpcrdma_req *req, struct rpc_rqst *rqst,
__be32 *iptr, enum rpcrdma_chunktype rtype)
{
struct rpcrdma_mr_seg *seg;
struct rpcrdma_mw *mw;
xprtrdma: Allow Read list and Reply chunk simultaneously rpcrdma_marshal_req() makes a simplifying assumption: that NFS operations with large Call messages have small Reply messages, and vice versa. Therefore with RPC-over-RDMA, only one chunk type is ever needed for each Call/Reply pair, because one direction needs chunks, the other direction will always fit inline. In fact, this assumption is asserted in the code: if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) { dprintk("RPC: %s: cannot marshal multiple chunk lists\n", __func__); return -EIO; } But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p perform data transformation on RPC messages before they are transmitted, direct data placement techniques cannot be used, thus RPC messages must be sent via a Long call in both directions. All such calls are sent with a Position Zero Read chunk, and all such replies are handled with a Reply chunk. Thus the client must provide every Call/Reply pair with both a Read list and a Reply chunk. Without any special security in effect, NFSv4 WRITEs may now also use the Read list and provide a Reply chunk. The marshal_req logic was preventing that, meaning an NFSv4 WRITE with a large payload that included a GETATTR result larger than the inline threshold would fail. The code that encodes each chunk list is now completely contained in its own function. There is some code duplication, but the trade-off is that the overall logic should be more clear. Note that all three chunk lists now share the rl_segments array. Some additional per-req accounting is necessary to track this usage. For the same reasons that the above simplifying assumption has held true for so long, I don't expect more array elements are needed at this time. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-02 18:41:30 +00:00
unsigned int pos;
int n, nsegs;
if (rtype == rpcrdma_noch) {
*iptr++ = xdr_zero; /* item not present */
return iptr;
}
pos = rqst->rq_snd_buf.head[0].iov_len;
if (rtype == rpcrdma_areadch)
pos = 0;
seg = req->rl_segments;
nsegs = rpcrdma_convert_iovs(r_xprt, &rqst->rq_snd_buf, pos,
rtype, seg);
xprtrdma: Allow Read list and Reply chunk simultaneously rpcrdma_marshal_req() makes a simplifying assumption: that NFS operations with large Call messages have small Reply messages, and vice versa. Therefore with RPC-over-RDMA, only one chunk type is ever needed for each Call/Reply pair, because one direction needs chunks, the other direction will always fit inline. In fact, this assumption is asserted in the code: if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) { dprintk("RPC: %s: cannot marshal multiple chunk lists\n", __func__); return -EIO; } But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p perform data transformation on RPC messages before they are transmitted, direct data placement techniques cannot be used, thus RPC messages must be sent via a Long call in both directions. All such calls are sent with a Position Zero Read chunk, and all such replies are handled with a Reply chunk. Thus the client must provide every Call/Reply pair with both a Read list and a Reply chunk. Without any special security in effect, NFSv4 WRITEs may now also use the Read list and provide a Reply chunk. The marshal_req logic was preventing that, meaning an NFSv4 WRITE with a large payload that included a GETATTR result larger than the inline threshold would fail. The code that encodes each chunk list is now completely contained in its own function. There is some code duplication, but the trade-off is that the overall logic should be more clear. Note that all three chunk lists now share the rl_segments array. Some additional per-req accounting is necessary to track this usage. For the same reasons that the above simplifying assumption has held true for so long, I don't expect more array elements are needed at this time. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-02 18:41:30 +00:00
if (nsegs < 0)
return ERR_PTR(nsegs);
do {
n = r_xprt->rx_ia.ri_ops->ro_map(r_xprt, seg, nsegs,
false, &mw);
if (n < 0)
xprtrdma: Allow Read list and Reply chunk simultaneously rpcrdma_marshal_req() makes a simplifying assumption: that NFS operations with large Call messages have small Reply messages, and vice versa. Therefore with RPC-over-RDMA, only one chunk type is ever needed for each Call/Reply pair, because one direction needs chunks, the other direction will always fit inline. In fact, this assumption is asserted in the code: if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) { dprintk("RPC: %s: cannot marshal multiple chunk lists\n", __func__); return -EIO; } But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p perform data transformation on RPC messages before they are transmitted, direct data placement techniques cannot be used, thus RPC messages must be sent via a Long call in both directions. All such calls are sent with a Position Zero Read chunk, and all such replies are handled with a Reply chunk. Thus the client must provide every Call/Reply pair with both a Read list and a Reply chunk. Without any special security in effect, NFSv4 WRITEs may now also use the Read list and provide a Reply chunk. The marshal_req logic was preventing that, meaning an NFSv4 WRITE with a large payload that included a GETATTR result larger than the inline threshold would fail. The code that encodes each chunk list is now completely contained in its own function. There is some code duplication, but the trade-off is that the overall logic should be more clear. Note that all three chunk lists now share the rl_segments array. Some additional per-req accounting is necessary to track this usage. For the same reasons that the above simplifying assumption has held true for so long, I don't expect more array elements are needed at this time. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-02 18:41:30 +00:00
return ERR_PTR(n);
rpcrdma_push_mw(mw, &req->rl_registered);
xprtrdma: Allow Read list and Reply chunk simultaneously rpcrdma_marshal_req() makes a simplifying assumption: that NFS operations with large Call messages have small Reply messages, and vice versa. Therefore with RPC-over-RDMA, only one chunk type is ever needed for each Call/Reply pair, because one direction needs chunks, the other direction will always fit inline. In fact, this assumption is asserted in the code: if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) { dprintk("RPC: %s: cannot marshal multiple chunk lists\n", __func__); return -EIO; } But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p perform data transformation on RPC messages before they are transmitted, direct data placement techniques cannot be used, thus RPC messages must be sent via a Long call in both directions. All such calls are sent with a Position Zero Read chunk, and all such replies are handled with a Reply chunk. Thus the client must provide every Call/Reply pair with both a Read list and a Reply chunk. Without any special security in effect, NFSv4 WRITEs may now also use the Read list and provide a Reply chunk. The marshal_req logic was preventing that, meaning an NFSv4 WRITE with a large payload that included a GETATTR result larger than the inline threshold would fail. The code that encodes each chunk list is now completely contained in its own function. There is some code duplication, but the trade-off is that the overall logic should be more clear. Note that all three chunk lists now share the rl_segments array. Some additional per-req accounting is necessary to track this usage. For the same reasons that the above simplifying assumption has held true for so long, I don't expect more array elements are needed at this time. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-02 18:41:30 +00:00
*iptr++ = xdr_one; /* item present */
/* All read segments in this chunk
* have the same "position".
*/
*iptr++ = cpu_to_be32(pos);
iptr = xdr_encode_rdma_segment(iptr, mw);
xprtrdma: Allow Read list and Reply chunk simultaneously rpcrdma_marshal_req() makes a simplifying assumption: that NFS operations with large Call messages have small Reply messages, and vice versa. Therefore with RPC-over-RDMA, only one chunk type is ever needed for each Call/Reply pair, because one direction needs chunks, the other direction will always fit inline. In fact, this assumption is asserted in the code: if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) { dprintk("RPC: %s: cannot marshal multiple chunk lists\n", __func__); return -EIO; } But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p perform data transformation on RPC messages before they are transmitted, direct data placement techniques cannot be used, thus RPC messages must be sent via a Long call in both directions. All such calls are sent with a Position Zero Read chunk, and all such replies are handled with a Reply chunk. Thus the client must provide every Call/Reply pair with both a Read list and a Reply chunk. Without any special security in effect, NFSv4 WRITEs may now also use the Read list and provide a Reply chunk. The marshal_req logic was preventing that, meaning an NFSv4 WRITE with a large payload that included a GETATTR result larger than the inline threshold would fail. The code that encodes each chunk list is now completely contained in its own function. There is some code duplication, but the trade-off is that the overall logic should be more clear. Note that all three chunk lists now share the rl_segments array. Some additional per-req accounting is necessary to track this usage. For the same reasons that the above simplifying assumption has held true for so long, I don't expect more array elements are needed at this time. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-02 18:41:30 +00:00
dprintk("RPC: %5u %s: pos %u %u@0x%016llx:0x%08x (%s)\n",
xprtrdma: Allow Read list and Reply chunk simultaneously rpcrdma_marshal_req() makes a simplifying assumption: that NFS operations with large Call messages have small Reply messages, and vice versa. Therefore with RPC-over-RDMA, only one chunk type is ever needed for each Call/Reply pair, because one direction needs chunks, the other direction will always fit inline. In fact, this assumption is asserted in the code: if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) { dprintk("RPC: %s: cannot marshal multiple chunk lists\n", __func__); return -EIO; } But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p perform data transformation on RPC messages before they are transmitted, direct data placement techniques cannot be used, thus RPC messages must be sent via a Long call in both directions. All such calls are sent with a Position Zero Read chunk, and all such replies are handled with a Reply chunk. Thus the client must provide every Call/Reply pair with both a Read list and a Reply chunk. Without any special security in effect, NFSv4 WRITEs may now also use the Read list and provide a Reply chunk. The marshal_req logic was preventing that, meaning an NFSv4 WRITE with a large payload that included a GETATTR result larger than the inline threshold would fail. The code that encodes each chunk list is now completely contained in its own function. There is some code duplication, but the trade-off is that the overall logic should be more clear. Note that all three chunk lists now share the rl_segments array. Some additional per-req accounting is necessary to track this usage. For the same reasons that the above simplifying assumption has held true for so long, I don't expect more array elements are needed at this time. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-02 18:41:30 +00:00
rqst->rq_task->tk_pid, __func__, pos,
mw->mw_length, (unsigned long long)mw->mw_offset,
mw->mw_handle, n < nsegs ? "more" : "last");
xprtrdma: Allow Read list and Reply chunk simultaneously rpcrdma_marshal_req() makes a simplifying assumption: that NFS operations with large Call messages have small Reply messages, and vice versa. Therefore with RPC-over-RDMA, only one chunk type is ever needed for each Call/Reply pair, because one direction needs chunks, the other direction will always fit inline. In fact, this assumption is asserted in the code: if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) { dprintk("RPC: %s: cannot marshal multiple chunk lists\n", __func__); return -EIO; } But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p perform data transformation on RPC messages before they are transmitted, direct data placement techniques cannot be used, thus RPC messages must be sent via a Long call in both directions. All such calls are sent with a Position Zero Read chunk, and all such replies are handled with a Reply chunk. Thus the client must provide every Call/Reply pair with both a Read list and a Reply chunk. Without any special security in effect, NFSv4 WRITEs may now also use the Read list and provide a Reply chunk. The marshal_req logic was preventing that, meaning an NFSv4 WRITE with a large payload that included a GETATTR result larger than the inline threshold would fail. The code that encodes each chunk list is now completely contained in its own function. There is some code duplication, but the trade-off is that the overall logic should be more clear. Note that all three chunk lists now share the rl_segments array. Some additional per-req accounting is necessary to track this usage. For the same reasons that the above simplifying assumption has held true for so long, I don't expect more array elements are needed at this time. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-02 18:41:30 +00:00
r_xprt->rx_stats.read_chunk_count++;
seg += n;
nsegs -= n;
} while (nsegs);
/* Finish Read list */
*iptr++ = xdr_zero; /* Next item not present */
return iptr;
}
/* XDR-encode the Write list. Supports encoding a list containing
* one array of plain segments that belong to a single write chunk.
*
* Encoding key for single-list chunks (HLOO = Handle32 Length32 Offset64):
*
* Write chunklist (a list of (one) counted array):
* N elements:
* 1 - N - HLOO - HLOO - ... - HLOO - 0
*
* Returns a pointer to the XDR word in the RDMA header following
* the end of the Write list, or an error pointer.
*/
static __be32 *
rpcrdma_encode_write_list(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req,
struct rpc_rqst *rqst, __be32 *iptr,
enum rpcrdma_chunktype wtype)
{
struct rpcrdma_mr_seg *seg;
struct rpcrdma_mw *mw;
xprtrdma: Allow Read list and Reply chunk simultaneously rpcrdma_marshal_req() makes a simplifying assumption: that NFS operations with large Call messages have small Reply messages, and vice versa. Therefore with RPC-over-RDMA, only one chunk type is ever needed for each Call/Reply pair, because one direction needs chunks, the other direction will always fit inline. In fact, this assumption is asserted in the code: if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) { dprintk("RPC: %s: cannot marshal multiple chunk lists\n", __func__); return -EIO; } But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p perform data transformation on RPC messages before they are transmitted, direct data placement techniques cannot be used, thus RPC messages must be sent via a Long call in both directions. All such calls are sent with a Position Zero Read chunk, and all such replies are handled with a Reply chunk. Thus the client must provide every Call/Reply pair with both a Read list and a Reply chunk. Without any special security in effect, NFSv4 WRITEs may now also use the Read list and provide a Reply chunk. The marshal_req logic was preventing that, meaning an NFSv4 WRITE with a large payload that included a GETATTR result larger than the inline threshold would fail. The code that encodes each chunk list is now completely contained in its own function. There is some code duplication, but the trade-off is that the overall logic should be more clear. Note that all three chunk lists now share the rl_segments array. Some additional per-req accounting is necessary to track this usage. For the same reasons that the above simplifying assumption has held true for so long, I don't expect more array elements are needed at this time. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-02 18:41:30 +00:00
int n, nsegs, nchunks;
__be32 *segcount;
if (wtype != rpcrdma_writech) {
*iptr++ = xdr_zero; /* no Write list present */
return iptr;
}
seg = req->rl_segments;
nsegs = rpcrdma_convert_iovs(r_xprt, &rqst->rq_rcv_buf,
xprtrdma: Allow Read list and Reply chunk simultaneously rpcrdma_marshal_req() makes a simplifying assumption: that NFS operations with large Call messages have small Reply messages, and vice versa. Therefore with RPC-over-RDMA, only one chunk type is ever needed for each Call/Reply pair, because one direction needs chunks, the other direction will always fit inline. In fact, this assumption is asserted in the code: if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) { dprintk("RPC: %s: cannot marshal multiple chunk lists\n", __func__); return -EIO; } But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p perform data transformation on RPC messages before they are transmitted, direct data placement techniques cannot be used, thus RPC messages must be sent via a Long call in both directions. All such calls are sent with a Position Zero Read chunk, and all such replies are handled with a Reply chunk. Thus the client must provide every Call/Reply pair with both a Read list and a Reply chunk. Without any special security in effect, NFSv4 WRITEs may now also use the Read list and provide a Reply chunk. The marshal_req logic was preventing that, meaning an NFSv4 WRITE with a large payload that included a GETATTR result larger than the inline threshold would fail. The code that encodes each chunk list is now completely contained in its own function. There is some code duplication, but the trade-off is that the overall logic should be more clear. Note that all three chunk lists now share the rl_segments array. Some additional per-req accounting is necessary to track this usage. For the same reasons that the above simplifying assumption has held true for so long, I don't expect more array elements are needed at this time. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-02 18:41:30 +00:00
rqst->rq_rcv_buf.head[0].iov_len,
wtype, seg);
xprtrdma: Allow Read list and Reply chunk simultaneously rpcrdma_marshal_req() makes a simplifying assumption: that NFS operations with large Call messages have small Reply messages, and vice versa. Therefore with RPC-over-RDMA, only one chunk type is ever needed for each Call/Reply pair, because one direction needs chunks, the other direction will always fit inline. In fact, this assumption is asserted in the code: if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) { dprintk("RPC: %s: cannot marshal multiple chunk lists\n", __func__); return -EIO; } But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p perform data transformation on RPC messages before they are transmitted, direct data placement techniques cannot be used, thus RPC messages must be sent via a Long call in both directions. All such calls are sent with a Position Zero Read chunk, and all such replies are handled with a Reply chunk. Thus the client must provide every Call/Reply pair with both a Read list and a Reply chunk. Without any special security in effect, NFSv4 WRITEs may now also use the Read list and provide a Reply chunk. The marshal_req logic was preventing that, meaning an NFSv4 WRITE with a large payload that included a GETATTR result larger than the inline threshold would fail. The code that encodes each chunk list is now completely contained in its own function. There is some code duplication, but the trade-off is that the overall logic should be more clear. Note that all three chunk lists now share the rl_segments array. Some additional per-req accounting is necessary to track this usage. For the same reasons that the above simplifying assumption has held true for so long, I don't expect more array elements are needed at this time. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-02 18:41:30 +00:00
if (nsegs < 0)
return ERR_PTR(nsegs);
*iptr++ = xdr_one; /* Write list present */
segcount = iptr++; /* save location of segment count */
nchunks = 0;
do {
n = r_xprt->rx_ia.ri_ops->ro_map(r_xprt, seg, nsegs,
true, &mw);
if (n < 0)
xprtrdma: Allow Read list and Reply chunk simultaneously rpcrdma_marshal_req() makes a simplifying assumption: that NFS operations with large Call messages have small Reply messages, and vice versa. Therefore with RPC-over-RDMA, only one chunk type is ever needed for each Call/Reply pair, because one direction needs chunks, the other direction will always fit inline. In fact, this assumption is asserted in the code: if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) { dprintk("RPC: %s: cannot marshal multiple chunk lists\n", __func__); return -EIO; } But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p perform data transformation on RPC messages before they are transmitted, direct data placement techniques cannot be used, thus RPC messages must be sent via a Long call in both directions. All such calls are sent with a Position Zero Read chunk, and all such replies are handled with a Reply chunk. Thus the client must provide every Call/Reply pair with both a Read list and a Reply chunk. Without any special security in effect, NFSv4 WRITEs may now also use the Read list and provide a Reply chunk. The marshal_req logic was preventing that, meaning an NFSv4 WRITE with a large payload that included a GETATTR result larger than the inline threshold would fail. The code that encodes each chunk list is now completely contained in its own function. There is some code duplication, but the trade-off is that the overall logic should be more clear. Note that all three chunk lists now share the rl_segments array. Some additional per-req accounting is necessary to track this usage. For the same reasons that the above simplifying assumption has held true for so long, I don't expect more array elements are needed at this time. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-02 18:41:30 +00:00
return ERR_PTR(n);
rpcrdma_push_mw(mw, &req->rl_registered);
xprtrdma: Allow Read list and Reply chunk simultaneously rpcrdma_marshal_req() makes a simplifying assumption: that NFS operations with large Call messages have small Reply messages, and vice versa. Therefore with RPC-over-RDMA, only one chunk type is ever needed for each Call/Reply pair, because one direction needs chunks, the other direction will always fit inline. In fact, this assumption is asserted in the code: if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) { dprintk("RPC: %s: cannot marshal multiple chunk lists\n", __func__); return -EIO; } But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p perform data transformation on RPC messages before they are transmitted, direct data placement techniques cannot be used, thus RPC messages must be sent via a Long call in both directions. All such calls are sent with a Position Zero Read chunk, and all such replies are handled with a Reply chunk. Thus the client must provide every Call/Reply pair with both a Read list and a Reply chunk. Without any special security in effect, NFSv4 WRITEs may now also use the Read list and provide a Reply chunk. The marshal_req logic was preventing that, meaning an NFSv4 WRITE with a large payload that included a GETATTR result larger than the inline threshold would fail. The code that encodes each chunk list is now completely contained in its own function. There is some code duplication, but the trade-off is that the overall logic should be more clear. Note that all three chunk lists now share the rl_segments array. Some additional per-req accounting is necessary to track this usage. For the same reasons that the above simplifying assumption has held true for so long, I don't expect more array elements are needed at this time. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-02 18:41:30 +00:00
iptr = xdr_encode_rdma_segment(iptr, mw);
xprtrdma: Allow Read list and Reply chunk simultaneously rpcrdma_marshal_req() makes a simplifying assumption: that NFS operations with large Call messages have small Reply messages, and vice versa. Therefore with RPC-over-RDMA, only one chunk type is ever needed for each Call/Reply pair, because one direction needs chunks, the other direction will always fit inline. In fact, this assumption is asserted in the code: if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) { dprintk("RPC: %s: cannot marshal multiple chunk lists\n", __func__); return -EIO; } But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p perform data transformation on RPC messages before they are transmitted, direct data placement techniques cannot be used, thus RPC messages must be sent via a Long call in both directions. All such calls are sent with a Position Zero Read chunk, and all such replies are handled with a Reply chunk. Thus the client must provide every Call/Reply pair with both a Read list and a Reply chunk. Without any special security in effect, NFSv4 WRITEs may now also use the Read list and provide a Reply chunk. The marshal_req logic was preventing that, meaning an NFSv4 WRITE with a large payload that included a GETATTR result larger than the inline threshold would fail. The code that encodes each chunk list is now completely contained in its own function. There is some code duplication, but the trade-off is that the overall logic should be more clear. Note that all three chunk lists now share the rl_segments array. Some additional per-req accounting is necessary to track this usage. For the same reasons that the above simplifying assumption has held true for so long, I don't expect more array elements are needed at this time. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-02 18:41:30 +00:00
dprintk("RPC: %5u %s: %u@0x016%llx:0x%08x (%s)\n",
xprtrdma: Allow Read list and Reply chunk simultaneously rpcrdma_marshal_req() makes a simplifying assumption: that NFS operations with large Call messages have small Reply messages, and vice versa. Therefore with RPC-over-RDMA, only one chunk type is ever needed for each Call/Reply pair, because one direction needs chunks, the other direction will always fit inline. In fact, this assumption is asserted in the code: if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) { dprintk("RPC: %s: cannot marshal multiple chunk lists\n", __func__); return -EIO; } But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p perform data transformation on RPC messages before they are transmitted, direct data placement techniques cannot be used, thus RPC messages must be sent via a Long call in both directions. All such calls are sent with a Position Zero Read chunk, and all such replies are handled with a Reply chunk. Thus the client must provide every Call/Reply pair with both a Read list and a Reply chunk. Without any special security in effect, NFSv4 WRITEs may now also use the Read list and provide a Reply chunk. The marshal_req logic was preventing that, meaning an NFSv4 WRITE with a large payload that included a GETATTR result larger than the inline threshold would fail. The code that encodes each chunk list is now completely contained in its own function. There is some code duplication, but the trade-off is that the overall logic should be more clear. Note that all three chunk lists now share the rl_segments array. Some additional per-req accounting is necessary to track this usage. For the same reasons that the above simplifying assumption has held true for so long, I don't expect more array elements are needed at this time. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-02 18:41:30 +00:00
rqst->rq_task->tk_pid, __func__,
mw->mw_length, (unsigned long long)mw->mw_offset,
mw->mw_handle, n < nsegs ? "more" : "last");
xprtrdma: Allow Read list and Reply chunk simultaneously rpcrdma_marshal_req() makes a simplifying assumption: that NFS operations with large Call messages have small Reply messages, and vice versa. Therefore with RPC-over-RDMA, only one chunk type is ever needed for each Call/Reply pair, because one direction needs chunks, the other direction will always fit inline. In fact, this assumption is asserted in the code: if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) { dprintk("RPC: %s: cannot marshal multiple chunk lists\n", __func__); return -EIO; } But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p perform data transformation on RPC messages before they are transmitted, direct data placement techniques cannot be used, thus RPC messages must be sent via a Long call in both directions. All such calls are sent with a Position Zero Read chunk, and all such replies are handled with a Reply chunk. Thus the client must provide every Call/Reply pair with both a Read list and a Reply chunk. Without any special security in effect, NFSv4 WRITEs may now also use the Read list and provide a Reply chunk. The marshal_req logic was preventing that, meaning an NFSv4 WRITE with a large payload that included a GETATTR result larger than the inline threshold would fail. The code that encodes each chunk list is now completely contained in its own function. There is some code duplication, but the trade-off is that the overall logic should be more clear. Note that all three chunk lists now share the rl_segments array. Some additional per-req accounting is necessary to track this usage. For the same reasons that the above simplifying assumption has held true for so long, I don't expect more array elements are needed at this time. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-02 18:41:30 +00:00
r_xprt->rx_stats.write_chunk_count++;
r_xprt->rx_stats.total_rdma_request += seg->mr_len;
nchunks++;
seg += n;
nsegs -= n;
} while (nsegs);
/* Update count of segments in this Write chunk */
*segcount = cpu_to_be32(nchunks);
/* Finish Write list */
*iptr++ = xdr_zero; /* Next item not present */
return iptr;
}
/* XDR-encode the Reply chunk. Supports encoding an array of plain
* segments that belong to a single write (reply) chunk.
*
* Encoding key for single-list chunks (HLOO = Handle32 Length32 Offset64):
*
* Reply chunk (a counted array):
* N elements:
* 1 - N - HLOO - HLOO - ... - HLOO
*
* Returns a pointer to the XDR word in the RDMA header following
* the end of the Reply chunk, or an error pointer.
*/
static __be32 *
rpcrdma_encode_reply_chunk(struct rpcrdma_xprt *r_xprt,
struct rpcrdma_req *req, struct rpc_rqst *rqst,
__be32 *iptr, enum rpcrdma_chunktype wtype)
{
struct rpcrdma_mr_seg *seg;
struct rpcrdma_mw *mw;
xprtrdma: Allow Read list and Reply chunk simultaneously rpcrdma_marshal_req() makes a simplifying assumption: that NFS operations with large Call messages have small Reply messages, and vice versa. Therefore with RPC-over-RDMA, only one chunk type is ever needed for each Call/Reply pair, because one direction needs chunks, the other direction will always fit inline. In fact, this assumption is asserted in the code: if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) { dprintk("RPC: %s: cannot marshal multiple chunk lists\n", __func__); return -EIO; } But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p perform data transformation on RPC messages before they are transmitted, direct data placement techniques cannot be used, thus RPC messages must be sent via a Long call in both directions. All such calls are sent with a Position Zero Read chunk, and all such replies are handled with a Reply chunk. Thus the client must provide every Call/Reply pair with both a Read list and a Reply chunk. Without any special security in effect, NFSv4 WRITEs may now also use the Read list and provide a Reply chunk. The marshal_req logic was preventing that, meaning an NFSv4 WRITE with a large payload that included a GETATTR result larger than the inline threshold would fail. The code that encodes each chunk list is now completely contained in its own function. There is some code duplication, but the trade-off is that the overall logic should be more clear. Note that all three chunk lists now share the rl_segments array. Some additional per-req accounting is necessary to track this usage. For the same reasons that the above simplifying assumption has held true for so long, I don't expect more array elements are needed at this time. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-02 18:41:30 +00:00
int n, nsegs, nchunks;
__be32 *segcount;
if (wtype != rpcrdma_replych) {
*iptr++ = xdr_zero; /* no Reply chunk present */
return iptr;
}
seg = req->rl_segments;
nsegs = rpcrdma_convert_iovs(r_xprt, &rqst->rq_rcv_buf, 0, wtype, seg);
xprtrdma: Allow Read list and Reply chunk simultaneously rpcrdma_marshal_req() makes a simplifying assumption: that NFS operations with large Call messages have small Reply messages, and vice versa. Therefore with RPC-over-RDMA, only one chunk type is ever needed for each Call/Reply pair, because one direction needs chunks, the other direction will always fit inline. In fact, this assumption is asserted in the code: if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) { dprintk("RPC: %s: cannot marshal multiple chunk lists\n", __func__); return -EIO; } But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p perform data transformation on RPC messages before they are transmitted, direct data placement techniques cannot be used, thus RPC messages must be sent via a Long call in both directions. All such calls are sent with a Position Zero Read chunk, and all such replies are handled with a Reply chunk. Thus the client must provide every Call/Reply pair with both a Read list and a Reply chunk. Without any special security in effect, NFSv4 WRITEs may now also use the Read list and provide a Reply chunk. The marshal_req logic was preventing that, meaning an NFSv4 WRITE with a large payload that included a GETATTR result larger than the inline threshold would fail. The code that encodes each chunk list is now completely contained in its own function. There is some code duplication, but the trade-off is that the overall logic should be more clear. Note that all three chunk lists now share the rl_segments array. Some additional per-req accounting is necessary to track this usage. For the same reasons that the above simplifying assumption has held true for so long, I don't expect more array elements are needed at this time. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-02 18:41:30 +00:00
if (nsegs < 0)
return ERR_PTR(nsegs);
*iptr++ = xdr_one; /* Reply chunk present */
segcount = iptr++; /* save location of segment count */
nchunks = 0;
do {
n = r_xprt->rx_ia.ri_ops->ro_map(r_xprt, seg, nsegs,
true, &mw);
if (n < 0)
xprtrdma: Allow Read list and Reply chunk simultaneously rpcrdma_marshal_req() makes a simplifying assumption: that NFS operations with large Call messages have small Reply messages, and vice versa. Therefore with RPC-over-RDMA, only one chunk type is ever needed for each Call/Reply pair, because one direction needs chunks, the other direction will always fit inline. In fact, this assumption is asserted in the code: if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) { dprintk("RPC: %s: cannot marshal multiple chunk lists\n", __func__); return -EIO; } But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p perform data transformation on RPC messages before they are transmitted, direct data placement techniques cannot be used, thus RPC messages must be sent via a Long call in both directions. All such calls are sent with a Position Zero Read chunk, and all such replies are handled with a Reply chunk. Thus the client must provide every Call/Reply pair with both a Read list and a Reply chunk. Without any special security in effect, NFSv4 WRITEs may now also use the Read list and provide a Reply chunk. The marshal_req logic was preventing that, meaning an NFSv4 WRITE with a large payload that included a GETATTR result larger than the inline threshold would fail. The code that encodes each chunk list is now completely contained in its own function. There is some code duplication, but the trade-off is that the overall logic should be more clear. Note that all three chunk lists now share the rl_segments array. Some additional per-req accounting is necessary to track this usage. For the same reasons that the above simplifying assumption has held true for so long, I don't expect more array elements are needed at this time. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-02 18:41:30 +00:00
return ERR_PTR(n);
rpcrdma_push_mw(mw, &req->rl_registered);
xprtrdma: Allow Read list and Reply chunk simultaneously rpcrdma_marshal_req() makes a simplifying assumption: that NFS operations with large Call messages have small Reply messages, and vice versa. Therefore with RPC-over-RDMA, only one chunk type is ever needed for each Call/Reply pair, because one direction needs chunks, the other direction will always fit inline. In fact, this assumption is asserted in the code: if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) { dprintk("RPC: %s: cannot marshal multiple chunk lists\n", __func__); return -EIO; } But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p perform data transformation on RPC messages before they are transmitted, direct data placement techniques cannot be used, thus RPC messages must be sent via a Long call in both directions. All such calls are sent with a Position Zero Read chunk, and all such replies are handled with a Reply chunk. Thus the client must provide every Call/Reply pair with both a Read list and a Reply chunk. Without any special security in effect, NFSv4 WRITEs may now also use the Read list and provide a Reply chunk. The marshal_req logic was preventing that, meaning an NFSv4 WRITE with a large payload that included a GETATTR result larger than the inline threshold would fail. The code that encodes each chunk list is now completely contained in its own function. There is some code duplication, but the trade-off is that the overall logic should be more clear. Note that all three chunk lists now share the rl_segments array. Some additional per-req accounting is necessary to track this usage. For the same reasons that the above simplifying assumption has held true for so long, I don't expect more array elements are needed at this time. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-02 18:41:30 +00:00
iptr = xdr_encode_rdma_segment(iptr, mw);
xprtrdma: Allow Read list and Reply chunk simultaneously rpcrdma_marshal_req() makes a simplifying assumption: that NFS operations with large Call messages have small Reply messages, and vice versa. Therefore with RPC-over-RDMA, only one chunk type is ever needed for each Call/Reply pair, because one direction needs chunks, the other direction will always fit inline. In fact, this assumption is asserted in the code: if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) { dprintk("RPC: %s: cannot marshal multiple chunk lists\n", __func__); return -EIO; } But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p perform data transformation on RPC messages before they are transmitted, direct data placement techniques cannot be used, thus RPC messages must be sent via a Long call in both directions. All such calls are sent with a Position Zero Read chunk, and all such replies are handled with a Reply chunk. Thus the client must provide every Call/Reply pair with both a Read list and a Reply chunk. Without any special security in effect, NFSv4 WRITEs may now also use the Read list and provide a Reply chunk. The marshal_req logic was preventing that, meaning an NFSv4 WRITE with a large payload that included a GETATTR result larger than the inline threshold would fail. The code that encodes each chunk list is now completely contained in its own function. There is some code duplication, but the trade-off is that the overall logic should be more clear. Note that all three chunk lists now share the rl_segments array. Some additional per-req accounting is necessary to track this usage. For the same reasons that the above simplifying assumption has held true for so long, I don't expect more array elements are needed at this time. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-02 18:41:30 +00:00
dprintk("RPC: %5u %s: %u@0x%016llx:0x%08x (%s)\n",
xprtrdma: Allow Read list and Reply chunk simultaneously rpcrdma_marshal_req() makes a simplifying assumption: that NFS operations with large Call messages have small Reply messages, and vice versa. Therefore with RPC-over-RDMA, only one chunk type is ever needed for each Call/Reply pair, because one direction needs chunks, the other direction will always fit inline. In fact, this assumption is asserted in the code: if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) { dprintk("RPC: %s: cannot marshal multiple chunk lists\n", __func__); return -EIO; } But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p perform data transformation on RPC messages before they are transmitted, direct data placement techniques cannot be used, thus RPC messages must be sent via a Long call in both directions. All such calls are sent with a Position Zero Read chunk, and all such replies are handled with a Reply chunk. Thus the client must provide every Call/Reply pair with both a Read list and a Reply chunk. Without any special security in effect, NFSv4 WRITEs may now also use the Read list and provide a Reply chunk. The marshal_req logic was preventing that, meaning an NFSv4 WRITE with a large payload that included a GETATTR result larger than the inline threshold would fail. The code that encodes each chunk list is now completely contained in its own function. There is some code duplication, but the trade-off is that the overall logic should be more clear. Note that all three chunk lists now share the rl_segments array. Some additional per-req accounting is necessary to track this usage. For the same reasons that the above simplifying assumption has held true for so long, I don't expect more array elements are needed at this time. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-02 18:41:30 +00:00
rqst->rq_task->tk_pid, __func__,
mw->mw_length, (unsigned long long)mw->mw_offset,
mw->mw_handle, n < nsegs ? "more" : "last");
xprtrdma: Allow Read list and Reply chunk simultaneously rpcrdma_marshal_req() makes a simplifying assumption: that NFS operations with large Call messages have small Reply messages, and vice versa. Therefore with RPC-over-RDMA, only one chunk type is ever needed for each Call/Reply pair, because one direction needs chunks, the other direction will always fit inline. In fact, this assumption is asserted in the code: if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) { dprintk("RPC: %s: cannot marshal multiple chunk lists\n", __func__); return -EIO; } But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p perform data transformation on RPC messages before they are transmitted, direct data placement techniques cannot be used, thus RPC messages must be sent via a Long call in both directions. All such calls are sent with a Position Zero Read chunk, and all such replies are handled with a Reply chunk. Thus the client must provide every Call/Reply pair with both a Read list and a Reply chunk. Without any special security in effect, NFSv4 WRITEs may now also use the Read list and provide a Reply chunk. The marshal_req logic was preventing that, meaning an NFSv4 WRITE with a large payload that included a GETATTR result larger than the inline threshold would fail. The code that encodes each chunk list is now completely contained in its own function. There is some code duplication, but the trade-off is that the overall logic should be more clear. Note that all three chunk lists now share the rl_segments array. Some additional per-req accounting is necessary to track this usage. For the same reasons that the above simplifying assumption has held true for so long, I don't expect more array elements are needed at this time. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-02 18:41:30 +00:00
r_xprt->rx_stats.reply_chunk_count++;
r_xprt->rx_stats.total_rdma_request += seg->mr_len;
nchunks++;
seg += n;
nsegs -= n;
} while (nsegs);
/* Update count of segments in the Reply chunk */
*segcount = cpu_to_be32(nchunks);
return iptr;
}
xprtrdma: Use gathered Send for large inline messages An RPC Call message that is sent inline but that has a data payload (ie, one or more items in rq_snd_buf's page list) must be "pulled up:" - call_allocate has to reserve enough RPC Call buffer space to accommodate the data payload - call_transmit has to memcopy the rq_snd_buf's page list and tail into its head iovec before it is sent As the inline threshold is increased beyond its current 1KB default, however, this means data payloads of more than a few KB are copied by the host CPU. For example, if the inline threshold is increased just to 4KB, then NFS WRITE requests up to 4KB would involve a memcpy of the NFS WRITE's payload data into the RPC Call buffer. This is an undesirable amount of participation by the host CPU. The inline threshold may be much larger than 4KB in the future, after negotiation with a peer server. Instead of copying the components of rq_snd_buf into its head iovec, construct a gather list of these components, and send them all in place. The same approach is already used in the Linux server's RPC-over-RDMA reply path. This mechanism also eliminates the need for rpcrdma_tail_pullup, which is used to manage the XDR pad and trailing inline content when a Read list is present. This requires that the pages in rq_snd_buf's page list be DMA-mapped during marshaling, and unmapped when a data-bearing RPC is completed. This is slightly less efficient for very small I/O payloads, but significantly more efficient as data payload size and inline threshold increase past a kilobyte. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-15 14:57:24 +00:00
/* Prepare the RPC-over-RDMA header SGE.
*/
xprtrdma: Use gathered Send for large inline messages An RPC Call message that is sent inline but that has a data payload (ie, one or more items in rq_snd_buf's page list) must be "pulled up:" - call_allocate has to reserve enough RPC Call buffer space to accommodate the data payload - call_transmit has to memcopy the rq_snd_buf's page list and tail into its head iovec before it is sent As the inline threshold is increased beyond its current 1KB default, however, this means data payloads of more than a few KB are copied by the host CPU. For example, if the inline threshold is increased just to 4KB, then NFS WRITE requests up to 4KB would involve a memcpy of the NFS WRITE's payload data into the RPC Call buffer. This is an undesirable amount of participation by the host CPU. The inline threshold may be much larger than 4KB in the future, after negotiation with a peer server. Instead of copying the components of rq_snd_buf into its head iovec, construct a gather list of these components, and send them all in place. The same approach is already used in the Linux server's RPC-over-RDMA reply path. This mechanism also eliminates the need for rpcrdma_tail_pullup, which is used to manage the XDR pad and trailing inline content when a Read list is present. This requires that the pages in rq_snd_buf's page list be DMA-mapped during marshaling, and unmapped when a data-bearing RPC is completed. This is slightly less efficient for very small I/O payloads, but significantly more efficient as data payload size and inline threshold increase past a kilobyte. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-15 14:57:24 +00:00
static bool
rpcrdma_prepare_hdr_sge(struct rpcrdma_ia *ia, struct rpcrdma_req *req,
u32 len)
{
xprtrdma: Use gathered Send for large inline messages An RPC Call message that is sent inline but that has a data payload (ie, one or more items in rq_snd_buf's page list) must be "pulled up:" - call_allocate has to reserve enough RPC Call buffer space to accommodate the data payload - call_transmit has to memcopy the rq_snd_buf's page list and tail into its head iovec before it is sent As the inline threshold is increased beyond its current 1KB default, however, this means data payloads of more than a few KB are copied by the host CPU. For example, if the inline threshold is increased just to 4KB, then NFS WRITE requests up to 4KB would involve a memcpy of the NFS WRITE's payload data into the RPC Call buffer. This is an undesirable amount of participation by the host CPU. The inline threshold may be much larger than 4KB in the future, after negotiation with a peer server. Instead of copying the components of rq_snd_buf into its head iovec, construct a gather list of these components, and send them all in place. The same approach is already used in the Linux server's RPC-over-RDMA reply path. This mechanism also eliminates the need for rpcrdma_tail_pullup, which is used to manage the XDR pad and trailing inline content when a Read list is present. This requires that the pages in rq_snd_buf's page list be DMA-mapped during marshaling, and unmapped when a data-bearing RPC is completed. This is slightly less efficient for very small I/O payloads, but significantly more efficient as data payload size and inline threshold increase past a kilobyte. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-15 14:57:24 +00:00
struct rpcrdma_regbuf *rb = req->rl_rdmabuf;
struct ib_sge *sge = &req->rl_send_sge[0];
if (unlikely(!rpcrdma_regbuf_is_mapped(rb))) {
if (!__rpcrdma_dma_map_regbuf(ia, rb))
return false;
sge->addr = rdmab_addr(rb);
sge->lkey = rdmab_lkey(rb);
}
sge->length = len;
ib_dma_sync_single_for_device(rdmab_device(rb), sge->addr,
xprtrdma: Use gathered Send for large inline messages An RPC Call message that is sent inline but that has a data payload (ie, one or more items in rq_snd_buf's page list) must be "pulled up:" - call_allocate has to reserve enough RPC Call buffer space to accommodate the data payload - call_transmit has to memcopy the rq_snd_buf's page list and tail into its head iovec before it is sent As the inline threshold is increased beyond its current 1KB default, however, this means data payloads of more than a few KB are copied by the host CPU. For example, if the inline threshold is increased just to 4KB, then NFS WRITE requests up to 4KB would involve a memcpy of the NFS WRITE's payload data into the RPC Call buffer. This is an undesirable amount of participation by the host CPU. The inline threshold may be much larger than 4KB in the future, after negotiation with a peer server. Instead of copying the components of rq_snd_buf into its head iovec, construct a gather list of these components, and send them all in place. The same approach is already used in the Linux server's RPC-over-RDMA reply path. This mechanism also eliminates the need for rpcrdma_tail_pullup, which is used to manage the XDR pad and trailing inline content when a Read list is present. This requires that the pages in rq_snd_buf's page list be DMA-mapped during marshaling, and unmapped when a data-bearing RPC is completed. This is slightly less efficient for very small I/O payloads, but significantly more efficient as data payload size and inline threshold increase past a kilobyte. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-15 14:57:24 +00:00
sge->length, DMA_TO_DEVICE);
req->rl_send_wr.num_sge++;
return true;
}
/* Prepare the Send SGEs. The head and tail iovec, and each entry
* in the page list, gets its own SGE.
*/
static bool
rpcrdma_prepare_msg_sges(struct rpcrdma_ia *ia, struct rpcrdma_req *req,
struct xdr_buf *xdr, enum rpcrdma_chunktype rtype)
{
unsigned int sge_no, page_base, len, remaining;
struct rpcrdma_regbuf *rb = req->rl_sendbuf;
struct ib_device *device = ia->ri_device;
struct ib_sge *sge = req->rl_send_sge;
u32 lkey = ia->ri_pd->local_dma_lkey;
struct page *page, **ppages;
/* The head iovec is straightforward, as it is already
* DMA-mapped. Sync the content that has changed.
*/
if (!rpcrdma_dma_map_regbuf(ia, rb))
return false;
sge_no = 1;
sge[sge_no].addr = rdmab_addr(rb);
sge[sge_no].length = xdr->head[0].iov_len;
sge[sge_no].lkey = rdmab_lkey(rb);
ib_dma_sync_single_for_device(rdmab_device(rb), sge[sge_no].addr,
xprtrdma: Use gathered Send for large inline messages An RPC Call message that is sent inline but that has a data payload (ie, one or more items in rq_snd_buf's page list) must be "pulled up:" - call_allocate has to reserve enough RPC Call buffer space to accommodate the data payload - call_transmit has to memcopy the rq_snd_buf's page list and tail into its head iovec before it is sent As the inline threshold is increased beyond its current 1KB default, however, this means data payloads of more than a few KB are copied by the host CPU. For example, if the inline threshold is increased just to 4KB, then NFS WRITE requests up to 4KB would involve a memcpy of the NFS WRITE's payload data into the RPC Call buffer. This is an undesirable amount of participation by the host CPU. The inline threshold may be much larger than 4KB in the future, after negotiation with a peer server. Instead of copying the components of rq_snd_buf into its head iovec, construct a gather list of these components, and send them all in place. The same approach is already used in the Linux server's RPC-over-RDMA reply path. This mechanism also eliminates the need for rpcrdma_tail_pullup, which is used to manage the XDR pad and trailing inline content when a Read list is present. This requires that the pages in rq_snd_buf's page list be DMA-mapped during marshaling, and unmapped when a data-bearing RPC is completed. This is slightly less efficient for very small I/O payloads, but significantly more efficient as data payload size and inline threshold increase past a kilobyte. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-15 14:57:24 +00:00
sge[sge_no].length, DMA_TO_DEVICE);
/* If there is a Read chunk, the page list is being handled
* via explicit RDMA, and thus is skipped here. However, the
* tail iovec may include an XDR pad for the page list, as
* well as additional content, and may not reside in the
* same page as the head iovec.
*/
if (rtype == rpcrdma_readch) {
len = xdr->tail[0].iov_len;
xprtrdma: Use gathered Send for large inline messages An RPC Call message that is sent inline but that has a data payload (ie, one or more items in rq_snd_buf's page list) must be "pulled up:" - call_allocate has to reserve enough RPC Call buffer space to accommodate the data payload - call_transmit has to memcopy the rq_snd_buf's page list and tail into its head iovec before it is sent As the inline threshold is increased beyond its current 1KB default, however, this means data payloads of more than a few KB are copied by the host CPU. For example, if the inline threshold is increased just to 4KB, then NFS WRITE requests up to 4KB would involve a memcpy of the NFS WRITE's payload data into the RPC Call buffer. This is an undesirable amount of participation by the host CPU. The inline threshold may be much larger than 4KB in the future, after negotiation with a peer server. Instead of copying the components of rq_snd_buf into its head iovec, construct a gather list of these components, and send them all in place. The same approach is already used in the Linux server's RPC-over-RDMA reply path. This mechanism also eliminates the need for rpcrdma_tail_pullup, which is used to manage the XDR pad and trailing inline content when a Read list is present. This requires that the pages in rq_snd_buf's page list be DMA-mapped during marshaling, and unmapped when a data-bearing RPC is completed. This is slightly less efficient for very small I/O payloads, but significantly more efficient as data payload size and inline threshold increase past a kilobyte. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-15 14:57:24 +00:00
/* Do not include the tail if it is only an XDR pad */
if (len < 4)
goto out;
xprtrdma: Use gathered Send for large inline messages An RPC Call message that is sent inline but that has a data payload (ie, one or more items in rq_snd_buf's page list) must be "pulled up:" - call_allocate has to reserve enough RPC Call buffer space to accommodate the data payload - call_transmit has to memcopy the rq_snd_buf's page list and tail into its head iovec before it is sent As the inline threshold is increased beyond its current 1KB default, however, this means data payloads of more than a few KB are copied by the host CPU. For example, if the inline threshold is increased just to 4KB, then NFS WRITE requests up to 4KB would involve a memcpy of the NFS WRITE's payload data into the RPC Call buffer. This is an undesirable amount of participation by the host CPU. The inline threshold may be much larger than 4KB in the future, after negotiation with a peer server. Instead of copying the components of rq_snd_buf into its head iovec, construct a gather list of these components, and send them all in place. The same approach is already used in the Linux server's RPC-over-RDMA reply path. This mechanism also eliminates the need for rpcrdma_tail_pullup, which is used to manage the XDR pad and trailing inline content when a Read list is present. This requires that the pages in rq_snd_buf's page list be DMA-mapped during marshaling, and unmapped when a data-bearing RPC is completed. This is slightly less efficient for very small I/O payloads, but significantly more efficient as data payload size and inline threshold increase past a kilobyte. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-15 14:57:24 +00:00
page = virt_to_page(xdr->tail[0].iov_base);
page_base = offset_in_page(xdr->tail[0].iov_base);
xprtrdma: Use gathered Send for large inline messages An RPC Call message that is sent inline but that has a data payload (ie, one or more items in rq_snd_buf's page list) must be "pulled up:" - call_allocate has to reserve enough RPC Call buffer space to accommodate the data payload - call_transmit has to memcopy the rq_snd_buf's page list and tail into its head iovec before it is sent As the inline threshold is increased beyond its current 1KB default, however, this means data payloads of more than a few KB are copied by the host CPU. For example, if the inline threshold is increased just to 4KB, then NFS WRITE requests up to 4KB would involve a memcpy of the NFS WRITE's payload data into the RPC Call buffer. This is an undesirable amount of participation by the host CPU. The inline threshold may be much larger than 4KB in the future, after negotiation with a peer server. Instead of copying the components of rq_snd_buf into its head iovec, construct a gather list of these components, and send them all in place. The same approach is already used in the Linux server's RPC-over-RDMA reply path. This mechanism also eliminates the need for rpcrdma_tail_pullup, which is used to manage the XDR pad and trailing inline content when a Read list is present. This requires that the pages in rq_snd_buf's page list be DMA-mapped during marshaling, and unmapped when a data-bearing RPC is completed. This is slightly less efficient for very small I/O payloads, but significantly more efficient as data payload size and inline threshold increase past a kilobyte. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-15 14:57:24 +00:00
/* If the content in the page list is an odd length,
* xdr_write_pages() has added a pad at the beginning
* of the tail iovec. Force the tail's non-pad content
* to land at the next XDR position in the Send message.
*/
page_base += len & 3;
len -= len & 3;
goto map_tail;
}
xprtrdma: Use gathered Send for large inline messages An RPC Call message that is sent inline but that has a data payload (ie, one or more items in rq_snd_buf's page list) must be "pulled up:" - call_allocate has to reserve enough RPC Call buffer space to accommodate the data payload - call_transmit has to memcopy the rq_snd_buf's page list and tail into its head iovec before it is sent As the inline threshold is increased beyond its current 1KB default, however, this means data payloads of more than a few KB are copied by the host CPU. For example, if the inline threshold is increased just to 4KB, then NFS WRITE requests up to 4KB would involve a memcpy of the NFS WRITE's payload data into the RPC Call buffer. This is an undesirable amount of participation by the host CPU. The inline threshold may be much larger than 4KB in the future, after negotiation with a peer server. Instead of copying the components of rq_snd_buf into its head iovec, construct a gather list of these components, and send them all in place. The same approach is already used in the Linux server's RPC-over-RDMA reply path. This mechanism also eliminates the need for rpcrdma_tail_pullup, which is used to manage the XDR pad and trailing inline content when a Read list is present. This requires that the pages in rq_snd_buf's page list be DMA-mapped during marshaling, and unmapped when a data-bearing RPC is completed. This is slightly less efficient for very small I/O payloads, but significantly more efficient as data payload size and inline threshold increase past a kilobyte. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-15 14:57:24 +00:00
/* If there is a page list present, temporarily DMA map
* and prepare an SGE for each page to be sent.
*/
if (xdr->page_len) {
ppages = xdr->pages + (xdr->page_base >> PAGE_SHIFT);
page_base = offset_in_page(xdr->page_base);
xprtrdma: Use gathered Send for large inline messages An RPC Call message that is sent inline but that has a data payload (ie, one or more items in rq_snd_buf's page list) must be "pulled up:" - call_allocate has to reserve enough RPC Call buffer space to accommodate the data payload - call_transmit has to memcopy the rq_snd_buf's page list and tail into its head iovec before it is sent As the inline threshold is increased beyond its current 1KB default, however, this means data payloads of more than a few KB are copied by the host CPU. For example, if the inline threshold is increased just to 4KB, then NFS WRITE requests up to 4KB would involve a memcpy of the NFS WRITE's payload data into the RPC Call buffer. This is an undesirable amount of participation by the host CPU. The inline threshold may be much larger than 4KB in the future, after negotiation with a peer server. Instead of copying the components of rq_snd_buf into its head iovec, construct a gather list of these components, and send them all in place. The same approach is already used in the Linux server's RPC-over-RDMA reply path. This mechanism also eliminates the need for rpcrdma_tail_pullup, which is used to manage the XDR pad and trailing inline content when a Read list is present. This requires that the pages in rq_snd_buf's page list be DMA-mapped during marshaling, and unmapped when a data-bearing RPC is completed. This is slightly less efficient for very small I/O payloads, but significantly more efficient as data payload size and inline threshold increase past a kilobyte. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-15 14:57:24 +00:00
remaining = xdr->page_len;
while (remaining) {
sge_no++;
if (sge_no > RPCRDMA_MAX_SEND_SGES - 2)
goto out_mapping_overflow;
len = min_t(u32, PAGE_SIZE - page_base, remaining);
sge[sge_no].addr = ib_dma_map_page(device, *ppages,
page_base, len,
DMA_TO_DEVICE);
if (ib_dma_mapping_error(device, sge[sge_no].addr))
goto out_mapping_err;
sge[sge_no].length = len;
sge[sge_no].lkey = lkey;
req->rl_mapped_sges++;
ppages++;
remaining -= len;
page_base = 0;
}
}
xprtrdma: Use gathered Send for large inline messages An RPC Call message that is sent inline but that has a data payload (ie, one or more items in rq_snd_buf's page list) must be "pulled up:" - call_allocate has to reserve enough RPC Call buffer space to accommodate the data payload - call_transmit has to memcopy the rq_snd_buf's page list and tail into its head iovec before it is sent As the inline threshold is increased beyond its current 1KB default, however, this means data payloads of more than a few KB are copied by the host CPU. For example, if the inline threshold is increased just to 4KB, then NFS WRITE requests up to 4KB would involve a memcpy of the NFS WRITE's payload data into the RPC Call buffer. This is an undesirable amount of participation by the host CPU. The inline threshold may be much larger than 4KB in the future, after negotiation with a peer server. Instead of copying the components of rq_snd_buf into its head iovec, construct a gather list of these components, and send them all in place. The same approach is already used in the Linux server's RPC-over-RDMA reply path. This mechanism also eliminates the need for rpcrdma_tail_pullup, which is used to manage the XDR pad and trailing inline content when a Read list is present. This requires that the pages in rq_snd_buf's page list be DMA-mapped during marshaling, and unmapped when a data-bearing RPC is completed. This is slightly less efficient for very small I/O payloads, but significantly more efficient as data payload size and inline threshold increase past a kilobyte. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-15 14:57:24 +00:00
/* The tail iovec is not always constructed in the same
* page where the head iovec resides (see, for example,
* gss_wrap_req_priv). To neatly accommodate that case,
* DMA map it separately.
*/
if (xdr->tail[0].iov_len) {
page = virt_to_page(xdr->tail[0].iov_base);
page_base = offset_in_page(xdr->tail[0].iov_base);
xprtrdma: Use gathered Send for large inline messages An RPC Call message that is sent inline but that has a data payload (ie, one or more items in rq_snd_buf's page list) must be "pulled up:" - call_allocate has to reserve enough RPC Call buffer space to accommodate the data payload - call_transmit has to memcopy the rq_snd_buf's page list and tail into its head iovec before it is sent As the inline threshold is increased beyond its current 1KB default, however, this means data payloads of more than a few KB are copied by the host CPU. For example, if the inline threshold is increased just to 4KB, then NFS WRITE requests up to 4KB would involve a memcpy of the NFS WRITE's payload data into the RPC Call buffer. This is an undesirable amount of participation by the host CPU. The inline threshold may be much larger than 4KB in the future, after negotiation with a peer server. Instead of copying the components of rq_snd_buf into its head iovec, construct a gather list of these components, and send them all in place. The same approach is already used in the Linux server's RPC-over-RDMA reply path. This mechanism also eliminates the need for rpcrdma_tail_pullup, which is used to manage the XDR pad and trailing inline content when a Read list is present. This requires that the pages in rq_snd_buf's page list be DMA-mapped during marshaling, and unmapped when a data-bearing RPC is completed. This is slightly less efficient for very small I/O payloads, but significantly more efficient as data payload size and inline threshold increase past a kilobyte. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-15 14:57:24 +00:00
len = xdr->tail[0].iov_len;
map_tail:
sge_no++;
sge[sge_no].addr = ib_dma_map_page(device, page,
page_base, len,
DMA_TO_DEVICE);
if (ib_dma_mapping_error(device, sge[sge_no].addr))
goto out_mapping_err;
sge[sge_no].length = len;
sge[sge_no].lkey = lkey;
req->rl_mapped_sges++;
}
xprtrdma: Use gathered Send for large inline messages An RPC Call message that is sent inline but that has a data payload (ie, one or more items in rq_snd_buf's page list) must be "pulled up:" - call_allocate has to reserve enough RPC Call buffer space to accommodate the data payload - call_transmit has to memcopy the rq_snd_buf's page list and tail into its head iovec before it is sent As the inline threshold is increased beyond its current 1KB default, however, this means data payloads of more than a few KB are copied by the host CPU. For example, if the inline threshold is increased just to 4KB, then NFS WRITE requests up to 4KB would involve a memcpy of the NFS WRITE's payload data into the RPC Call buffer. This is an undesirable amount of participation by the host CPU. The inline threshold may be much larger than 4KB in the future, after negotiation with a peer server. Instead of copying the components of rq_snd_buf into its head iovec, construct a gather list of these components, and send them all in place. The same approach is already used in the Linux server's RPC-over-RDMA reply path. This mechanism also eliminates the need for rpcrdma_tail_pullup, which is used to manage the XDR pad and trailing inline content when a Read list is present. This requires that the pages in rq_snd_buf's page list be DMA-mapped during marshaling, and unmapped when a data-bearing RPC is completed. This is slightly less efficient for very small I/O payloads, but significantly more efficient as data payload size and inline threshold increase past a kilobyte. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-15 14:57:24 +00:00
out:
req->rl_send_wr.num_sge = sge_no + 1;
return true;
out_mapping_overflow:
pr_err("rpcrdma: too many Send SGEs (%u)\n", sge_no);
return false;
out_mapping_err:
pr_err("rpcrdma: Send mapping error\n");
return false;
}
bool
rpcrdma_prepare_send_sges(struct rpcrdma_ia *ia, struct rpcrdma_req *req,
u32 hdrlen, struct xdr_buf *xdr,
enum rpcrdma_chunktype rtype)
{
req->rl_send_wr.num_sge = 0;
req->rl_mapped_sges = 0;
if (!rpcrdma_prepare_hdr_sge(ia, req, hdrlen))
goto out_map;
if (rtype != rpcrdma_areadch)
if (!rpcrdma_prepare_msg_sges(ia, req, xdr, rtype))
goto out_map;
return true;
out_map:
pr_err("rpcrdma: failed to DMA map a Send buffer\n");
return false;
}
void
rpcrdma_unmap_sges(struct rpcrdma_ia *ia, struct rpcrdma_req *req)
{
struct ib_device *device = ia->ri_device;
struct ib_sge *sge;
int count;
sge = &req->rl_send_sge[2];
for (count = req->rl_mapped_sges; count--; sge++)
ib_dma_unmap_page(device, sge->addr, sge->length,
DMA_TO_DEVICE);
req->rl_mapped_sges = 0;
}
/**
* rpcrdma_marshal_req - Marshal and send one RPC request
* @r_xprt: controlling transport
* @rqst: RPC request to be marshaled
*
* For the RPC in "rqst", this function:
* - Chooses the transfer mode (eg., RDMA_MSG or RDMA_NOMSG)
* - Registers Read, Write, and Reply chunks
* - Constructs the transport header
* - Posts a Send WR to send the transport header and request
*
* Returns:
* %0 if the RPC was sent successfully,
* %-ENOTCONN if the connection was lost,
* %-EAGAIN if not enough pages are available for on-demand reply buffer,
* %-ENOBUFS if no MRs are available to register chunks,
* %-EIO if a permanent problem occurred while marshaling.
*/
int
rpcrdma_marshal_req(struct rpcrdma_xprt *r_xprt, struct rpc_rqst *rqst)
{
struct rpcrdma_req *req = rpcr_to_rdmar(rqst);
enum rpcrdma_chunktype rtype, wtype;
struct rpcrdma_msg *headerp;
bool ddp_allowed;
xprtrdma: Allow Read list and Reply chunk simultaneously rpcrdma_marshal_req() makes a simplifying assumption: that NFS operations with large Call messages have small Reply messages, and vice versa. Therefore with RPC-over-RDMA, only one chunk type is ever needed for each Call/Reply pair, because one direction needs chunks, the other direction will always fit inline. In fact, this assumption is asserted in the code: if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) { dprintk("RPC: %s: cannot marshal multiple chunk lists\n", __func__); return -EIO; } But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p perform data transformation on RPC messages before they are transmitted, direct data placement techniques cannot be used, thus RPC messages must be sent via a Long call in both directions. All such calls are sent with a Position Zero Read chunk, and all such replies are handled with a Reply chunk. Thus the client must provide every Call/Reply pair with both a Read list and a Reply chunk. Without any special security in effect, NFSv4 WRITEs may now also use the Read list and provide a Reply chunk. The marshal_req logic was preventing that, meaning an NFSv4 WRITE with a large payload that included a GETATTR result larger than the inline threshold would fail. The code that encodes each chunk list is now completely contained in its own function. There is some code duplication, but the trade-off is that the overall logic should be more clear. Note that all three chunk lists now share the rl_segments array. Some additional per-req accounting is necessary to track this usage. For the same reasons that the above simplifying assumption has held true for so long, I don't expect more array elements are needed at this time. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-02 18:41:30 +00:00
ssize_t hdrlen;
__be32 *iptr;
#if defined(CONFIG_SUNRPC_BACKCHANNEL)
if (test_bit(RPC_BC_PA_IN_USE, &rqst->rq_bc_pa_state))
return rpcrdma_bc_marshal_reply(rqst);
#endif
headerp = rdmab_to_msg(req->rl_rdmabuf);
/* don't byte-swap XID, it's already done in request */
headerp->rm_xid = rqst->rq_xid;
headerp->rm_vers = rpcrdma_version;
headerp->rm_credit = cpu_to_be32(r_xprt->rx_buf.rb_max_requests);
headerp->rm_type = rdma_msg;
/* When the ULP employs a GSS flavor that guarantees integrity
* or privacy, direct data placement of individual data items
* is not allowed.
*/
ddp_allowed = !(rqst->rq_cred->cr_auth->au_flags &
RPCAUTH_AUTH_DATATOUCH);
/*
* Chunks needed for results?
*
* o If the expected result is under the inline threshold, all ops
* return as inline.
* o Large read ops return data as write chunk(s), header as
* inline.
* o Large non-read ops return as a single reply chunk.
*/
if (rpcrdma_results_inline(r_xprt, rqst))
wtype = rpcrdma_noch;
else if (ddp_allowed && rqst->rq_rcv_buf.flags & XDRBUF_READ)
wtype = rpcrdma_writech;
else
wtype = rpcrdma_replych;
/*
* Chunks needed for arguments?
*
* o If the total request is under the inline threshold, all ops
* are sent as inline.
* o Large write ops transmit data as read chunk(s), header as
* inline.
* o Large non-write ops are sent with the entire message as a
* single read chunk (protocol 0-position special case).
*
* This assumes that the upper layer does not present a request
* that both has a data payload, and whose non-data arguments
* by themselves are larger than the inline threshold.
*/
if (rpcrdma_args_inline(r_xprt, rqst)) {
rtype = rpcrdma_noch;
} else if (ddp_allowed && rqst->rq_snd_buf.flags & XDRBUF_WRITE) {
rtype = rpcrdma_readch;
} else {
r_xprt->rx_stats.nomsg_call_count++;
headerp->rm_type = htonl(RDMA_NOMSG);
rtype = rpcrdma_areadch;
}
xprtrdma: Fix client lock-up after application signal fires After a signal, the RPC client aborts synchronous RPCs running on behalf of the signaled application. The server is still executing those RPCs, and will write the results back into the client's memory when it's done. By the time the server writes the results, that memory is likely being used for other purposes. Therefore xprtrdma has to immediately invalidate all memory regions used by those aborted RPCs to prevent the server's writes from clobbering that re-used memory. With FMR memory registration, invalidation takes a relatively long time. In fact, the invalidation is often still running when the server tries to write the results into the memory regions that are being invalidated. This sets up a race between two processes: 1. After the signal, xprt_rdma_free calls ro_unmap_safe. 2. While ro_unmap_safe is still running, the server replies and rpcrdma_reply_handler runs, calling ro_unmap_sync. Both processes invoke ib_unmap_fmr on the same FMR. The mlx4 driver allows two ib_unmap_fmr calls on the same FMR at the same time, but HCAs generally don't tolerate this. Sometimes this can result in a system crash. If the HCA happens to survive, rpcrdma_reply_handler continues. It removes the rpc_rqst from rq_list and releases the transport_lock. This enables xprt_rdma_free to run in another process, and the rpc_rqst is released while rpcrdma_reply_handler is still waiting for the ib_unmap_fmr call to finish. But further down in rpcrdma_reply_handler, the transport_lock is taken again, and "rqst" is dereferenced. If "rqst" has already been released, this triggers a general protection fault. Since bottom- halves are disabled, the system locks up. Address both issues by reversing the order of the xprt_lookup_rqst call and the ro_unmap_sync call. Introduce a separate lookup mechanism for rpcrdma_req's to enable calling ro_unmap_sync before xprt_lookup_rqst. Now the handler takes the transport_lock once and holds it for the XID lookup and RPC completion. BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=305 Fixes: 68791649a725 ('xprtrdma: Invalidate in the RPC reply ... ') Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-06-08 15:52:20 +00:00
req->rl_xid = rqst->rq_xid;
rpcrdma_insert_req(&r_xprt->rx_buf, req);
xprtrdma: Allow Read list and Reply chunk simultaneously rpcrdma_marshal_req() makes a simplifying assumption: that NFS operations with large Call messages have small Reply messages, and vice versa. Therefore with RPC-over-RDMA, only one chunk type is ever needed for each Call/Reply pair, because one direction needs chunks, the other direction will always fit inline. In fact, this assumption is asserted in the code: if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) { dprintk("RPC: %s: cannot marshal multiple chunk lists\n", __func__); return -EIO; } But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p perform data transformation on RPC messages before they are transmitted, direct data placement techniques cannot be used, thus RPC messages must be sent via a Long call in both directions. All such calls are sent with a Position Zero Read chunk, and all such replies are handled with a Reply chunk. Thus the client must provide every Call/Reply pair with both a Read list and a Reply chunk. Without any special security in effect, NFSv4 WRITEs may now also use the Read list and provide a Reply chunk. The marshal_req logic was preventing that, meaning an NFSv4 WRITE with a large payload that included a GETATTR result larger than the inline threshold would fail. The code that encodes each chunk list is now completely contained in its own function. There is some code duplication, but the trade-off is that the overall logic should be more clear. Note that all three chunk lists now share the rl_segments array. Some additional per-req accounting is necessary to track this usage. For the same reasons that the above simplifying assumption has held true for so long, I don't expect more array elements are needed at this time. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-02 18:41:30 +00:00
/* This implementation supports the following combinations
* of chunk lists in one RPC-over-RDMA Call message:
*
* - Read list
* - Write list
* - Reply chunk
* - Read list + Reply chunk
*
* It might not yet support the following combinations:
*
* - Read list + Write list
*
* It does not support the following combinations:
*
* - Write list + Reply chunk
* - Read list + Write list + Reply chunk
*
* This implementation supports only a single chunk in each
* Read or Write list. Thus for example the client cannot
* send a Call message with a Position Zero Read chunk and a
* regular Read chunk at the same time.
*/
xprtrdma: Allow Read list and Reply chunk simultaneously rpcrdma_marshal_req() makes a simplifying assumption: that NFS operations with large Call messages have small Reply messages, and vice versa. Therefore with RPC-over-RDMA, only one chunk type is ever needed for each Call/Reply pair, because one direction needs chunks, the other direction will always fit inline. In fact, this assumption is asserted in the code: if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) { dprintk("RPC: %s: cannot marshal multiple chunk lists\n", __func__); return -EIO; } But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p perform data transformation on RPC messages before they are transmitted, direct data placement techniques cannot be used, thus RPC messages must be sent via a Long call in both directions. All such calls are sent with a Position Zero Read chunk, and all such replies are handled with a Reply chunk. Thus the client must provide every Call/Reply pair with both a Read list and a Reply chunk. Without any special security in effect, NFSv4 WRITEs may now also use the Read list and provide a Reply chunk. The marshal_req logic was preventing that, meaning an NFSv4 WRITE with a large payload that included a GETATTR result larger than the inline threshold would fail. The code that encodes each chunk list is now completely contained in its own function. There is some code duplication, but the trade-off is that the overall logic should be more clear. Note that all three chunk lists now share the rl_segments array. Some additional per-req accounting is necessary to track this usage. For the same reasons that the above simplifying assumption has held true for so long, I don't expect more array elements are needed at this time. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-02 18:41:30 +00:00
iptr = headerp->rm_body.rm_chunks;
iptr = rpcrdma_encode_read_list(r_xprt, req, rqst, iptr, rtype);
if (IS_ERR(iptr))
xprtrdma: Properly recover FRWRs with in-flight FASTREG WRs Sriharsha (sriharsha.basavapatna@broadcom.com) reports an occasional double DMA unmap of an FRWR MR when a connection is lost. I see one way this can happen. When a request requires more than one segment or chunk, rpcrdma_marshal_req loops, invoking ->frwr_op_map for each segment (MR) in each chunk. Each call posts a FASTREG Work Request to register one MR. Now suppose that the transport connection is lost part-way through marshaling this request. As part of recovering and resetting that req, rpcrdma_marshal_req invokes ->frwr_op_unmap_safe, which hands all the req's registered FRWRs to the MR recovery thread. But note: FRWR registration is asynchronous. So it's possible that some of these "already registered" FRWRs are fully registered, and some are still waiting for their FASTREG WR to complete. When the connection is lost, the "already registered" frmrs are marked FRMR_IS_VALID, and the "still waiting" WRs flush. Then frwr_wc_fastreg marks these frmrs FRMR_FLUSHED_FR. But thanks to ->frwr_op_unmap_safe, the MR recovery thread is doing an unreg / alloc_mr, a DMA unmap, and marking each of these frwrs FRMR_IS_INVALID, at the same time frwr_wc_fastreg might be running. - If the recovery thread runs last, then the frmr is marked FRMR_IS_INVALID, and life continues. - If frwr_wc_fastreg runs last, the frmr is marked FRMR_FLUSHED_FR, but the recovery thread has already DMA unmapped that MR. When ->frwr_op_map later re-uses this frmr, it sees it is not marked FRMR_IS_INVALID, and tries to recover it before using it, resulting in a second DMA unmap of the same MR. The fix is to guarantee in-flight FASTREG WRs have flushed before MR recovery runs on those FRWRs. Thus we depend on ro_unmap_safe (called from xprt_rdma_send_request on retransmit, or from xprt_rdma_free) to clean up old registrations as needed. Reported-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-02-08 22:00:27 +00:00
goto out_err;
xprtrdma: Allow Read list and Reply chunk simultaneously rpcrdma_marshal_req() makes a simplifying assumption: that NFS operations with large Call messages have small Reply messages, and vice versa. Therefore with RPC-over-RDMA, only one chunk type is ever needed for each Call/Reply pair, because one direction needs chunks, the other direction will always fit inline. In fact, this assumption is asserted in the code: if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) { dprintk("RPC: %s: cannot marshal multiple chunk lists\n", __func__); return -EIO; } But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p perform data transformation on RPC messages before they are transmitted, direct data placement techniques cannot be used, thus RPC messages must be sent via a Long call in both directions. All such calls are sent with a Position Zero Read chunk, and all such replies are handled with a Reply chunk. Thus the client must provide every Call/Reply pair with both a Read list and a Reply chunk. Without any special security in effect, NFSv4 WRITEs may now also use the Read list and provide a Reply chunk. The marshal_req logic was preventing that, meaning an NFSv4 WRITE with a large payload that included a GETATTR result larger than the inline threshold would fail. The code that encodes each chunk list is now completely contained in its own function. There is some code duplication, but the trade-off is that the overall logic should be more clear. Note that all three chunk lists now share the rl_segments array. Some additional per-req accounting is necessary to track this usage. For the same reasons that the above simplifying assumption has held true for so long, I don't expect more array elements are needed at this time. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-02 18:41:30 +00:00
iptr = rpcrdma_encode_write_list(r_xprt, req, rqst, iptr, wtype);
if (IS_ERR(iptr))
xprtrdma: Properly recover FRWRs with in-flight FASTREG WRs Sriharsha (sriharsha.basavapatna@broadcom.com) reports an occasional double DMA unmap of an FRWR MR when a connection is lost. I see one way this can happen. When a request requires more than one segment or chunk, rpcrdma_marshal_req loops, invoking ->frwr_op_map for each segment (MR) in each chunk. Each call posts a FASTREG Work Request to register one MR. Now suppose that the transport connection is lost part-way through marshaling this request. As part of recovering and resetting that req, rpcrdma_marshal_req invokes ->frwr_op_unmap_safe, which hands all the req's registered FRWRs to the MR recovery thread. But note: FRWR registration is asynchronous. So it's possible that some of these "already registered" FRWRs are fully registered, and some are still waiting for their FASTREG WR to complete. When the connection is lost, the "already registered" frmrs are marked FRMR_IS_VALID, and the "still waiting" WRs flush. Then frwr_wc_fastreg marks these frmrs FRMR_FLUSHED_FR. But thanks to ->frwr_op_unmap_safe, the MR recovery thread is doing an unreg / alloc_mr, a DMA unmap, and marking each of these frwrs FRMR_IS_INVALID, at the same time frwr_wc_fastreg might be running. - If the recovery thread runs last, then the frmr is marked FRMR_IS_INVALID, and life continues. - If frwr_wc_fastreg runs last, the frmr is marked FRMR_FLUSHED_FR, but the recovery thread has already DMA unmapped that MR. When ->frwr_op_map later re-uses this frmr, it sees it is not marked FRMR_IS_INVALID, and tries to recover it before using it, resulting in a second DMA unmap of the same MR. The fix is to guarantee in-flight FASTREG WRs have flushed before MR recovery runs on those FRWRs. Thus we depend on ro_unmap_safe (called from xprt_rdma_send_request on retransmit, or from xprt_rdma_free) to clean up old registrations as needed. Reported-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-02-08 22:00:27 +00:00
goto out_err;
xprtrdma: Allow Read list and Reply chunk simultaneously rpcrdma_marshal_req() makes a simplifying assumption: that NFS operations with large Call messages have small Reply messages, and vice versa. Therefore with RPC-over-RDMA, only one chunk type is ever needed for each Call/Reply pair, because one direction needs chunks, the other direction will always fit inline. In fact, this assumption is asserted in the code: if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) { dprintk("RPC: %s: cannot marshal multiple chunk lists\n", __func__); return -EIO; } But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p perform data transformation on RPC messages before they are transmitted, direct data placement techniques cannot be used, thus RPC messages must be sent via a Long call in both directions. All such calls are sent with a Position Zero Read chunk, and all such replies are handled with a Reply chunk. Thus the client must provide every Call/Reply pair with both a Read list and a Reply chunk. Without any special security in effect, NFSv4 WRITEs may now also use the Read list and provide a Reply chunk. The marshal_req logic was preventing that, meaning an NFSv4 WRITE with a large payload that included a GETATTR result larger than the inline threshold would fail. The code that encodes each chunk list is now completely contained in its own function. There is some code duplication, but the trade-off is that the overall logic should be more clear. Note that all three chunk lists now share the rl_segments array. Some additional per-req accounting is necessary to track this usage. For the same reasons that the above simplifying assumption has held true for so long, I don't expect more array elements are needed at this time. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-02 18:41:30 +00:00
iptr = rpcrdma_encode_reply_chunk(r_xprt, req, rqst, iptr, wtype);
if (IS_ERR(iptr))
xprtrdma: Properly recover FRWRs with in-flight FASTREG WRs Sriharsha (sriharsha.basavapatna@broadcom.com) reports an occasional double DMA unmap of an FRWR MR when a connection is lost. I see one way this can happen. When a request requires more than one segment or chunk, rpcrdma_marshal_req loops, invoking ->frwr_op_map for each segment (MR) in each chunk. Each call posts a FASTREG Work Request to register one MR. Now suppose that the transport connection is lost part-way through marshaling this request. As part of recovering and resetting that req, rpcrdma_marshal_req invokes ->frwr_op_unmap_safe, which hands all the req's registered FRWRs to the MR recovery thread. But note: FRWR registration is asynchronous. So it's possible that some of these "already registered" FRWRs are fully registered, and some are still waiting for their FASTREG WR to complete. When the connection is lost, the "already registered" frmrs are marked FRMR_IS_VALID, and the "still waiting" WRs flush. Then frwr_wc_fastreg marks these frmrs FRMR_FLUSHED_FR. But thanks to ->frwr_op_unmap_safe, the MR recovery thread is doing an unreg / alloc_mr, a DMA unmap, and marking each of these frwrs FRMR_IS_INVALID, at the same time frwr_wc_fastreg might be running. - If the recovery thread runs last, then the frmr is marked FRMR_IS_INVALID, and life continues. - If frwr_wc_fastreg runs last, the frmr is marked FRMR_FLUSHED_FR, but the recovery thread has already DMA unmapped that MR. When ->frwr_op_map later re-uses this frmr, it sees it is not marked FRMR_IS_INVALID, and tries to recover it before using it, resulting in a second DMA unmap of the same MR. The fix is to guarantee in-flight FASTREG WRs have flushed before MR recovery runs on those FRWRs. Thus we depend on ro_unmap_safe (called from xprt_rdma_send_request on retransmit, or from xprt_rdma_free) to clean up old registrations as needed. Reported-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-02-08 22:00:27 +00:00
goto out_err;
xprtrdma: Allow Read list and Reply chunk simultaneously rpcrdma_marshal_req() makes a simplifying assumption: that NFS operations with large Call messages have small Reply messages, and vice versa. Therefore with RPC-over-RDMA, only one chunk type is ever needed for each Call/Reply pair, because one direction needs chunks, the other direction will always fit inline. In fact, this assumption is asserted in the code: if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) { dprintk("RPC: %s: cannot marshal multiple chunk lists\n", __func__); return -EIO; } But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p perform data transformation on RPC messages before they are transmitted, direct data placement techniques cannot be used, thus RPC messages must be sent via a Long call in both directions. All such calls are sent with a Position Zero Read chunk, and all such replies are handled with a Reply chunk. Thus the client must provide every Call/Reply pair with both a Read list and a Reply chunk. Without any special security in effect, NFSv4 WRITEs may now also use the Read list and provide a Reply chunk. The marshal_req logic was preventing that, meaning an NFSv4 WRITE with a large payload that included a GETATTR result larger than the inline threshold would fail. The code that encodes each chunk list is now completely contained in its own function. There is some code duplication, but the trade-off is that the overall logic should be more clear. Note that all three chunk lists now share the rl_segments array. Some additional per-req accounting is necessary to track this usage. For the same reasons that the above simplifying assumption has held true for so long, I don't expect more array elements are needed at this time. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-02 18:41:30 +00:00
hdrlen = (unsigned char *)iptr - (unsigned char *)headerp;
dprintk("RPC: %5u %s: %s/%s: hdrlen %zd\n",
xprtrdma: Allow Read list and Reply chunk simultaneously rpcrdma_marshal_req() makes a simplifying assumption: that NFS operations with large Call messages have small Reply messages, and vice versa. Therefore with RPC-over-RDMA, only one chunk type is ever needed for each Call/Reply pair, because one direction needs chunks, the other direction will always fit inline. In fact, this assumption is asserted in the code: if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) { dprintk("RPC: %s: cannot marshal multiple chunk lists\n", __func__); return -EIO; } But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p perform data transformation on RPC messages before they are transmitted, direct data placement techniques cannot be used, thus RPC messages must be sent via a Long call in both directions. All such calls are sent with a Position Zero Read chunk, and all such replies are handled with a Reply chunk. Thus the client must provide every Call/Reply pair with both a Read list and a Reply chunk. Without any special security in effect, NFSv4 WRITEs may now also use the Read list and provide a Reply chunk. The marshal_req logic was preventing that, meaning an NFSv4 WRITE with a large payload that included a GETATTR result larger than the inline threshold would fail. The code that encodes each chunk list is now completely contained in its own function. There is some code duplication, but the trade-off is that the overall logic should be more clear. Note that all three chunk lists now share the rl_segments array. Some additional per-req accounting is necessary to track this usage. For the same reasons that the above simplifying assumption has held true for so long, I don't expect more array elements are needed at this time. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-02 18:41:30 +00:00
rqst->rq_task->tk_pid, __func__,
transfertypes[rtype], transfertypes[wtype],
hdrlen);
xprtrdma: Use gathered Send for large inline messages An RPC Call message that is sent inline but that has a data payload (ie, one or more items in rq_snd_buf's page list) must be "pulled up:" - call_allocate has to reserve enough RPC Call buffer space to accommodate the data payload - call_transmit has to memcopy the rq_snd_buf's page list and tail into its head iovec before it is sent As the inline threshold is increased beyond its current 1KB default, however, this means data payloads of more than a few KB are copied by the host CPU. For example, if the inline threshold is increased just to 4KB, then NFS WRITE requests up to 4KB would involve a memcpy of the NFS WRITE's payload data into the RPC Call buffer. This is an undesirable amount of participation by the host CPU. The inline threshold may be much larger than 4KB in the future, after negotiation with a peer server. Instead of copying the components of rq_snd_buf into its head iovec, construct a gather list of these components, and send them all in place. The same approach is already used in the Linux server's RPC-over-RDMA reply path. This mechanism also eliminates the need for rpcrdma_tail_pullup, which is used to manage the XDR pad and trailing inline content when a Read list is present. This requires that the pages in rq_snd_buf's page list be DMA-mapped during marshaling, and unmapped when a data-bearing RPC is completed. This is slightly less efficient for very small I/O payloads, but significantly more efficient as data payload size and inline threshold increase past a kilobyte. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-15 14:57:24 +00:00
if (!rpcrdma_prepare_send_sges(&r_xprt->rx_ia, req, hdrlen,
&rqst->rq_snd_buf, rtype)) {
iptr = ERR_PTR(-EIO);
xprtrdma: Properly recover FRWRs with in-flight FASTREG WRs Sriharsha (sriharsha.basavapatna@broadcom.com) reports an occasional double DMA unmap of an FRWR MR when a connection is lost. I see one way this can happen. When a request requires more than one segment or chunk, rpcrdma_marshal_req loops, invoking ->frwr_op_map for each segment (MR) in each chunk. Each call posts a FASTREG Work Request to register one MR. Now suppose that the transport connection is lost part-way through marshaling this request. As part of recovering and resetting that req, rpcrdma_marshal_req invokes ->frwr_op_unmap_safe, which hands all the req's registered FRWRs to the MR recovery thread. But note: FRWR registration is asynchronous. So it's possible that some of these "already registered" FRWRs are fully registered, and some are still waiting for their FASTREG WR to complete. When the connection is lost, the "already registered" frmrs are marked FRMR_IS_VALID, and the "still waiting" WRs flush. Then frwr_wc_fastreg marks these frmrs FRMR_FLUSHED_FR. But thanks to ->frwr_op_unmap_safe, the MR recovery thread is doing an unreg / alloc_mr, a DMA unmap, and marking each of these frwrs FRMR_IS_INVALID, at the same time frwr_wc_fastreg might be running. - If the recovery thread runs last, then the frmr is marked FRMR_IS_INVALID, and life continues. - If frwr_wc_fastreg runs last, the frmr is marked FRMR_FLUSHED_FR, but the recovery thread has already DMA unmapped that MR. When ->frwr_op_map later re-uses this frmr, it sees it is not marked FRMR_IS_INVALID, and tries to recover it before using it, resulting in a second DMA unmap of the same MR. The fix is to guarantee in-flight FASTREG WRs have flushed before MR recovery runs on those FRWRs. Thus we depend on ro_unmap_safe (called from xprt_rdma_send_request on retransmit, or from xprt_rdma_free) to clean up old registrations as needed. Reported-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-02-08 22:00:27 +00:00
goto out_err;
xprtrdma: Use gathered Send for large inline messages An RPC Call message that is sent inline but that has a data payload (ie, one or more items in rq_snd_buf's page list) must be "pulled up:" - call_allocate has to reserve enough RPC Call buffer space to accommodate the data payload - call_transmit has to memcopy the rq_snd_buf's page list and tail into its head iovec before it is sent As the inline threshold is increased beyond its current 1KB default, however, this means data payloads of more than a few KB are copied by the host CPU. For example, if the inline threshold is increased just to 4KB, then NFS WRITE requests up to 4KB would involve a memcpy of the NFS WRITE's payload data into the RPC Call buffer. This is an undesirable amount of participation by the host CPU. The inline threshold may be much larger than 4KB in the future, after negotiation with a peer server. Instead of copying the components of rq_snd_buf into its head iovec, construct a gather list of these components, and send them all in place. The same approach is already used in the Linux server's RPC-over-RDMA reply path. This mechanism also eliminates the need for rpcrdma_tail_pullup, which is used to manage the XDR pad and trailing inline content when a Read list is present. This requires that the pages in rq_snd_buf's page list be DMA-mapped during marshaling, and unmapped when a data-bearing RPC is completed. This is slightly less efficient for very small I/O payloads, but significantly more efficient as data payload size and inline threshold increase past a kilobyte. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-15 14:57:24 +00:00
}
return 0;
xprtrdma: Properly recover FRWRs with in-flight FASTREG WRs Sriharsha (sriharsha.basavapatna@broadcom.com) reports an occasional double DMA unmap of an FRWR MR when a connection is lost. I see one way this can happen. When a request requires more than one segment or chunk, rpcrdma_marshal_req loops, invoking ->frwr_op_map for each segment (MR) in each chunk. Each call posts a FASTREG Work Request to register one MR. Now suppose that the transport connection is lost part-way through marshaling this request. As part of recovering and resetting that req, rpcrdma_marshal_req invokes ->frwr_op_unmap_safe, which hands all the req's registered FRWRs to the MR recovery thread. But note: FRWR registration is asynchronous. So it's possible that some of these "already registered" FRWRs are fully registered, and some are still waiting for their FASTREG WR to complete. When the connection is lost, the "already registered" frmrs are marked FRMR_IS_VALID, and the "still waiting" WRs flush. Then frwr_wc_fastreg marks these frmrs FRMR_FLUSHED_FR. But thanks to ->frwr_op_unmap_safe, the MR recovery thread is doing an unreg / alloc_mr, a DMA unmap, and marking each of these frwrs FRMR_IS_INVALID, at the same time frwr_wc_fastreg might be running. - If the recovery thread runs last, then the frmr is marked FRMR_IS_INVALID, and life continues. - If frwr_wc_fastreg runs last, the frmr is marked FRMR_FLUSHED_FR, but the recovery thread has already DMA unmapped that MR. When ->frwr_op_map later re-uses this frmr, it sees it is not marked FRMR_IS_INVALID, and tries to recover it before using it, resulting in a second DMA unmap of the same MR. The fix is to guarantee in-flight FASTREG WRs have flushed before MR recovery runs on those FRWRs. Thus we depend on ro_unmap_safe (called from xprt_rdma_send_request on retransmit, or from xprt_rdma_free) to clean up old registrations as needed. Reported-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-02-08 22:00:27 +00:00
out_err:
if (PTR_ERR(iptr) != -ENOBUFS) {
pr_err("rpcrdma: rpcrdma_marshal_req failed, status %ld\n",
PTR_ERR(iptr));
r_xprt->rx_stats.failed_marshal_count++;
}
xprtrdma: Allow Read list and Reply chunk simultaneously rpcrdma_marshal_req() makes a simplifying assumption: that NFS operations with large Call messages have small Reply messages, and vice versa. Therefore with RPC-over-RDMA, only one chunk type is ever needed for each Call/Reply pair, because one direction needs chunks, the other direction will always fit inline. In fact, this assumption is asserted in the code: if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) { dprintk("RPC: %s: cannot marshal multiple chunk lists\n", __func__); return -EIO; } But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p perform data transformation on RPC messages before they are transmitted, direct data placement techniques cannot be used, thus RPC messages must be sent via a Long call in both directions. All such calls are sent with a Position Zero Read chunk, and all such replies are handled with a Reply chunk. Thus the client must provide every Call/Reply pair with both a Read list and a Reply chunk. Without any special security in effect, NFSv4 WRITEs may now also use the Read list and provide a Reply chunk. The marshal_req logic was preventing that, meaning an NFSv4 WRITE with a large payload that included a GETATTR result larger than the inline threshold would fail. The code that encodes each chunk list is now completely contained in its own function. There is some code duplication, but the trade-off is that the overall logic should be more clear. Note that all three chunk lists now share the rl_segments array. Some additional per-req accounting is necessary to track this usage. For the same reasons that the above simplifying assumption has held true for so long, I don't expect more array elements are needed at this time. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-02 18:41:30 +00:00
return PTR_ERR(iptr);
}
xprtrdma: Do not update {head, tail}.iov_len in rpcrdma_inline_fixup() While trying NFSv4.0/RDMA with sec=krb5p, I noticed small NFS READ operations failed. After the client unwrapped the NFS READ reply message, the NFS READ XDR decoder was not able to decode the reply. The message was "Server cheating in reply", with the reported number of received payload bytes being zero. Applications reported a read(2) that returned -1/EIO. The problem is rpcrdma_inline_fixup() sets the tail.iov_len to zero when the incoming reply fits entirely in the head iovec. The zero tail.iov_len confused xdr_buf_trim(), which then mangled the actual reply data instead of simply removing the trailing GSS checksum. As near as I can tell, RPC transports are not supposed to update the head.iov_len, page_len, or tail.iov_len fields in the receive XDR buffer when handling an incoming RPC reply message. These fields contain the length of each component of the XDR buffer, and hence the maximum number of bytes of reply data that can be stored in each XDR buffer component. I've concluded this because: - This is how xdr_partial_copy_from_skb() appears to behave - rpcrdma_inline_fixup() already does not alter page_len - call_decode() compares rq_private_buf and rq_rcv_buf and WARNs if they are not exactly the same Unfortunately, as soon as I tried the simple fix to just remove the line that sets tail.iov_len to zero, I saw that the logic that appends the implicit Write chunk pad inline depends on inline_fixup setting tail.iov_len to zero. To address this, re-organize the tail iovec handling logic to use the same approach as with the head iovec: simply point tail.iov_base to the correct bytes in the receive buffer. While I remember all this, write down the conclusion in documenting comments. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-06-29 17:54:41 +00:00
/**
* rpcrdma_inline_fixup - Scatter inline received data into rqst's iovecs
* @rqst: controlling RPC request
* @srcp: points to RPC message payload in receive buffer
* @copy_len: remaining length of receive buffer content
* @pad: Write chunk pad bytes needed (zero for pure inline)
*
* The upper layer has set the maximum number of bytes it can
* receive in each component of rq_rcv_buf. These values are set in
* the head.iov_len, page_len, tail.iov_len, and buflen fields.
*
* Unlike the TCP equivalent (xdr_partial_copy_from_skb), in
* many cases this function simply updates iov_base pointers in
* rq_rcv_buf to point directly to the received reply data, to
* avoid copying reply data.
*
* Returns the count of bytes which had to be memcopied.
*/
static unsigned long
rpcrdma_inline_fixup(struct rpc_rqst *rqst, char *srcp, int copy_len, int pad)
{
unsigned long fixup_copy_count;
int i, npages, curlen;
char *destp;
struct page **ppages;
int page_base;
xprtrdma: Do not update {head, tail}.iov_len in rpcrdma_inline_fixup() While trying NFSv4.0/RDMA with sec=krb5p, I noticed small NFS READ operations failed. After the client unwrapped the NFS READ reply message, the NFS READ XDR decoder was not able to decode the reply. The message was "Server cheating in reply", with the reported number of received payload bytes being zero. Applications reported a read(2) that returned -1/EIO. The problem is rpcrdma_inline_fixup() sets the tail.iov_len to zero when the incoming reply fits entirely in the head iovec. The zero tail.iov_len confused xdr_buf_trim(), which then mangled the actual reply data instead of simply removing the trailing GSS checksum. As near as I can tell, RPC transports are not supposed to update the head.iov_len, page_len, or tail.iov_len fields in the receive XDR buffer when handling an incoming RPC reply message. These fields contain the length of each component of the XDR buffer, and hence the maximum number of bytes of reply data that can be stored in each XDR buffer component. I've concluded this because: - This is how xdr_partial_copy_from_skb() appears to behave - rpcrdma_inline_fixup() already does not alter page_len - call_decode() compares rq_private_buf and rq_rcv_buf and WARNs if they are not exactly the same Unfortunately, as soon as I tried the simple fix to just remove the line that sets tail.iov_len to zero, I saw that the logic that appends the implicit Write chunk pad inline depends on inline_fixup setting tail.iov_len to zero. To address this, re-organize the tail iovec handling logic to use the same approach as with the head iovec: simply point tail.iov_base to the correct bytes in the receive buffer. While I remember all this, write down the conclusion in documenting comments. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-06-29 17:54:41 +00:00
/* The head iovec is redirected to the RPC reply message
* in the receive buffer, to avoid a memcopy.
*/
rqst->rq_rcv_buf.head[0].iov_base = srcp;
rqst->rq_private_buf.head[0].iov_base = srcp;
xprtrdma: Do not update {head, tail}.iov_len in rpcrdma_inline_fixup() While trying NFSv4.0/RDMA with sec=krb5p, I noticed small NFS READ operations failed. After the client unwrapped the NFS READ reply message, the NFS READ XDR decoder was not able to decode the reply. The message was "Server cheating in reply", with the reported number of received payload bytes being zero. Applications reported a read(2) that returned -1/EIO. The problem is rpcrdma_inline_fixup() sets the tail.iov_len to zero when the incoming reply fits entirely in the head iovec. The zero tail.iov_len confused xdr_buf_trim(), which then mangled the actual reply data instead of simply removing the trailing GSS checksum. As near as I can tell, RPC transports are not supposed to update the head.iov_len, page_len, or tail.iov_len fields in the receive XDR buffer when handling an incoming RPC reply message. These fields contain the length of each component of the XDR buffer, and hence the maximum number of bytes of reply data that can be stored in each XDR buffer component. I've concluded this because: - This is how xdr_partial_copy_from_skb() appears to behave - rpcrdma_inline_fixup() already does not alter page_len - call_decode() compares rq_private_buf and rq_rcv_buf and WARNs if they are not exactly the same Unfortunately, as soon as I tried the simple fix to just remove the line that sets tail.iov_len to zero, I saw that the logic that appends the implicit Write chunk pad inline depends on inline_fixup setting tail.iov_len to zero. To address this, re-organize the tail iovec handling logic to use the same approach as with the head iovec: simply point tail.iov_base to the correct bytes in the receive buffer. While I remember all this, write down the conclusion in documenting comments. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-06-29 17:54:41 +00:00
/* The contents of the receive buffer that follow
* head.iov_len bytes are copied into the page list.
*/
curlen = rqst->rq_rcv_buf.head[0].iov_len;
xprtrdma: Do not update {head, tail}.iov_len in rpcrdma_inline_fixup() While trying NFSv4.0/RDMA with sec=krb5p, I noticed small NFS READ operations failed. After the client unwrapped the NFS READ reply message, the NFS READ XDR decoder was not able to decode the reply. The message was "Server cheating in reply", with the reported number of received payload bytes being zero. Applications reported a read(2) that returned -1/EIO. The problem is rpcrdma_inline_fixup() sets the tail.iov_len to zero when the incoming reply fits entirely in the head iovec. The zero tail.iov_len confused xdr_buf_trim(), which then mangled the actual reply data instead of simply removing the trailing GSS checksum. As near as I can tell, RPC transports are not supposed to update the head.iov_len, page_len, or tail.iov_len fields in the receive XDR buffer when handling an incoming RPC reply message. These fields contain the length of each component of the XDR buffer, and hence the maximum number of bytes of reply data that can be stored in each XDR buffer component. I've concluded this because: - This is how xdr_partial_copy_from_skb() appears to behave - rpcrdma_inline_fixup() already does not alter page_len - call_decode() compares rq_private_buf and rq_rcv_buf and WARNs if they are not exactly the same Unfortunately, as soon as I tried the simple fix to just remove the line that sets tail.iov_len to zero, I saw that the logic that appends the implicit Write chunk pad inline depends on inline_fixup setting tail.iov_len to zero. To address this, re-organize the tail iovec handling logic to use the same approach as with the head iovec: simply point tail.iov_base to the correct bytes in the receive buffer. While I remember all this, write down the conclusion in documenting comments. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-06-29 17:54:41 +00:00
if (curlen > copy_len)
curlen = copy_len;
dprintk("RPC: %s: srcp 0x%p len %d hdrlen %d\n",
__func__, srcp, copy_len, curlen);
srcp += curlen;
copy_len -= curlen;
ppages = rqst->rq_rcv_buf.pages +
(rqst->rq_rcv_buf.page_base >> PAGE_SHIFT);
page_base = offset_in_page(rqst->rq_rcv_buf.page_base);
fixup_copy_count = 0;
if (copy_len && rqst->rq_rcv_buf.page_len) {
int pagelist_len;
pagelist_len = rqst->rq_rcv_buf.page_len;
if (pagelist_len > copy_len)
pagelist_len = copy_len;
npages = PAGE_ALIGN(page_base + pagelist_len) >> PAGE_SHIFT;
for (i = 0; i < npages; i++) {
curlen = PAGE_SIZE - page_base;
if (curlen > pagelist_len)
curlen = pagelist_len;
dprintk("RPC: %s: page %d"
" srcp 0x%p len %d curlen %d\n",
__func__, i, srcp, copy_len, curlen);
destp = kmap_atomic(ppages[i]);
memcpy(destp + page_base, srcp, curlen);
flush_dcache_page(ppages[i]);
kunmap_atomic(destp);
srcp += curlen;
copy_len -= curlen;
fixup_copy_count += curlen;
pagelist_len -= curlen;
if (!pagelist_len)
break;
page_base = 0;
}
xprtrdma: Do not update {head, tail}.iov_len in rpcrdma_inline_fixup() While trying NFSv4.0/RDMA with sec=krb5p, I noticed small NFS READ operations failed. After the client unwrapped the NFS READ reply message, the NFS READ XDR decoder was not able to decode the reply. The message was "Server cheating in reply", with the reported number of received payload bytes being zero. Applications reported a read(2) that returned -1/EIO. The problem is rpcrdma_inline_fixup() sets the tail.iov_len to zero when the incoming reply fits entirely in the head iovec. The zero tail.iov_len confused xdr_buf_trim(), which then mangled the actual reply data instead of simply removing the trailing GSS checksum. As near as I can tell, RPC transports are not supposed to update the head.iov_len, page_len, or tail.iov_len fields in the receive XDR buffer when handling an incoming RPC reply message. These fields contain the length of each component of the XDR buffer, and hence the maximum number of bytes of reply data that can be stored in each XDR buffer component. I've concluded this because: - This is how xdr_partial_copy_from_skb() appears to behave - rpcrdma_inline_fixup() already does not alter page_len - call_decode() compares rq_private_buf and rq_rcv_buf and WARNs if they are not exactly the same Unfortunately, as soon as I tried the simple fix to just remove the line that sets tail.iov_len to zero, I saw that the logic that appends the implicit Write chunk pad inline depends on inline_fixup setting tail.iov_len to zero. To address this, re-organize the tail iovec handling logic to use the same approach as with the head iovec: simply point tail.iov_base to the correct bytes in the receive buffer. While I remember all this, write down the conclusion in documenting comments. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-06-29 17:54:41 +00:00
/* Implicit padding for the last segment in a Write
* chunk is inserted inline at the front of the tail
* iovec. The upper layer ignores the content of
* the pad. Simply ensure inline content in the tail
* that follows the Write chunk is properly aligned.
*/
if (pad)
srcp -= pad;
}
xprtrdma: Do not update {head, tail}.iov_len in rpcrdma_inline_fixup() While trying NFSv4.0/RDMA with sec=krb5p, I noticed small NFS READ operations failed. After the client unwrapped the NFS READ reply message, the NFS READ XDR decoder was not able to decode the reply. The message was "Server cheating in reply", with the reported number of received payload bytes being zero. Applications reported a read(2) that returned -1/EIO. The problem is rpcrdma_inline_fixup() sets the tail.iov_len to zero when the incoming reply fits entirely in the head iovec. The zero tail.iov_len confused xdr_buf_trim(), which then mangled the actual reply data instead of simply removing the trailing GSS checksum. As near as I can tell, RPC transports are not supposed to update the head.iov_len, page_len, or tail.iov_len fields in the receive XDR buffer when handling an incoming RPC reply message. These fields contain the length of each component of the XDR buffer, and hence the maximum number of bytes of reply data that can be stored in each XDR buffer component. I've concluded this because: - This is how xdr_partial_copy_from_skb() appears to behave - rpcrdma_inline_fixup() already does not alter page_len - call_decode() compares rq_private_buf and rq_rcv_buf and WARNs if they are not exactly the same Unfortunately, as soon as I tried the simple fix to just remove the line that sets tail.iov_len to zero, I saw that the logic that appends the implicit Write chunk pad inline depends on inline_fixup setting tail.iov_len to zero. To address this, re-organize the tail iovec handling logic to use the same approach as with the head iovec: simply point tail.iov_base to the correct bytes in the receive buffer. While I remember all this, write down the conclusion in documenting comments. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-06-29 17:54:41 +00:00
/* The tail iovec is redirected to the remaining data
* in the receive buffer, to avoid a memcopy.
*/
if (copy_len || pad) {
xprtrdma: Do not update {head, tail}.iov_len in rpcrdma_inline_fixup() While trying NFSv4.0/RDMA with sec=krb5p, I noticed small NFS READ operations failed. After the client unwrapped the NFS READ reply message, the NFS READ XDR decoder was not able to decode the reply. The message was "Server cheating in reply", with the reported number of received payload bytes being zero. Applications reported a read(2) that returned -1/EIO. The problem is rpcrdma_inline_fixup() sets the tail.iov_len to zero when the incoming reply fits entirely in the head iovec. The zero tail.iov_len confused xdr_buf_trim(), which then mangled the actual reply data instead of simply removing the trailing GSS checksum. As near as I can tell, RPC transports are not supposed to update the head.iov_len, page_len, or tail.iov_len fields in the receive XDR buffer when handling an incoming RPC reply message. These fields contain the length of each component of the XDR buffer, and hence the maximum number of bytes of reply data that can be stored in each XDR buffer component. I've concluded this because: - This is how xdr_partial_copy_from_skb() appears to behave - rpcrdma_inline_fixup() already does not alter page_len - call_decode() compares rq_private_buf and rq_rcv_buf and WARNs if they are not exactly the same Unfortunately, as soon as I tried the simple fix to just remove the line that sets tail.iov_len to zero, I saw that the logic that appends the implicit Write chunk pad inline depends on inline_fixup setting tail.iov_len to zero. To address this, re-organize the tail iovec handling logic to use the same approach as with the head iovec: simply point tail.iov_base to the correct bytes in the receive buffer. While I remember all this, write down the conclusion in documenting comments. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-06-29 17:54:41 +00:00
rqst->rq_rcv_buf.tail[0].iov_base = srcp;
rqst->rq_private_buf.tail[0].iov_base = srcp;
}
xprtrdma: Do not update {head, tail}.iov_len in rpcrdma_inline_fixup() While trying NFSv4.0/RDMA with sec=krb5p, I noticed small NFS READ operations failed. After the client unwrapped the NFS READ reply message, the NFS READ XDR decoder was not able to decode the reply. The message was "Server cheating in reply", with the reported number of received payload bytes being zero. Applications reported a read(2) that returned -1/EIO. The problem is rpcrdma_inline_fixup() sets the tail.iov_len to zero when the incoming reply fits entirely in the head iovec. The zero tail.iov_len confused xdr_buf_trim(), which then mangled the actual reply data instead of simply removing the trailing GSS checksum. As near as I can tell, RPC transports are not supposed to update the head.iov_len, page_len, or tail.iov_len fields in the receive XDR buffer when handling an incoming RPC reply message. These fields contain the length of each component of the XDR buffer, and hence the maximum number of bytes of reply data that can be stored in each XDR buffer component. I've concluded this because: - This is how xdr_partial_copy_from_skb() appears to behave - rpcrdma_inline_fixup() already does not alter page_len - call_decode() compares rq_private_buf and rq_rcv_buf and WARNs if they are not exactly the same Unfortunately, as soon as I tried the simple fix to just remove the line that sets tail.iov_len to zero, I saw that the logic that appends the implicit Write chunk pad inline depends on inline_fixup setting tail.iov_len to zero. To address this, re-organize the tail iovec handling logic to use the same approach as with the head iovec: simply point tail.iov_base to the correct bytes in the receive buffer. While I remember all this, write down the conclusion in documenting comments. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-06-29 17:54:41 +00:00
return fixup_copy_count;
}
/* Caller must guarantee @rep remains stable during this call.
*/
static void
rpcrdma_mark_remote_invalidation(struct list_head *mws,
struct rpcrdma_rep *rep)
{
struct rpcrdma_mw *mw;
if (!(rep->rr_wc_flags & IB_WC_WITH_INVALIDATE))
return;
list_for_each_entry(mw, mws, mw_list)
if (mw->mw_handle == rep->rr_inv_rkey) {
mw->mw_flags = RPCRDMA_MW_F_RI;
break; /* only one invalidated MR per RPC */
}
}
/* By convention, backchannel calls arrive via rdma_msg type
* messages, and never populate the chunk lists. This makes
* the RPC/RDMA header small and fixed in size, so it is
* straightforward to check the RPC header's direction field.
*/
static bool
rpcrdma_is_bcall(struct rpcrdma_xprt *r_xprt, struct rpcrdma_rep *rep,
__be32 xid, __be32 proc)
#if defined(CONFIG_SUNRPC_BACKCHANNEL)
{
struct xdr_stream *xdr = &rep->rr_stream;
__be32 *p;
if (proc != rdma_msg)
return false;
/* Peek at stream contents without advancing. */
p = xdr_inline_decode(xdr, 0);
/* Chunk lists */
if (*p++ != xdr_zero)
return false;
if (*p++ != xdr_zero)
return false;
if (*p++ != xdr_zero)
return false;
/* RPC header */
if (*p++ != xid)
return false;
if (*p != cpu_to_be32(RPC_CALL))
return false;
/* Now that we are sure this is a backchannel call,
* advance to the RPC header.
*/
p = xdr_inline_decode(xdr, 3 * sizeof(*p));
if (unlikely(!p))
goto out_short;
rpcrdma_bc_receive_call(r_xprt, rep);
return true;
out_short:
pr_warn("RPC/RDMA short backward direction call\n");
if (rpcrdma_ep_post_recv(&r_xprt->rx_ia, rep))
xprt_disconnect_done(&r_xprt->rx_xprt);
return true;
}
#else /* CONFIG_SUNRPC_BACKCHANNEL */
{
return false;
}
#endif /* CONFIG_SUNRPC_BACKCHANNEL */
static int decode_rdma_segment(struct xdr_stream *xdr, u32 *length)
{
__be32 *p;
p = xdr_inline_decode(xdr, 4 * sizeof(*p));
if (unlikely(!p))
return -EIO;
ifdebug(FACILITY) {
u64 offset;
u32 handle;
handle = be32_to_cpup(p++);
*length = be32_to_cpup(p++);
xdr_decode_hyper(p, &offset);
dprintk("RPC: %s: segment %u@0x%016llx:0x%08x\n",
__func__, *length, (unsigned long long)offset,
handle);
} else {
*length = be32_to_cpup(p + 1);
}
return 0;
}
static int decode_write_chunk(struct xdr_stream *xdr, u32 *length)
{
u32 segcount, seglength;
__be32 *p;
p = xdr_inline_decode(xdr, sizeof(*p));
if (unlikely(!p))
return -EIO;
*length = 0;
segcount = be32_to_cpup(p);
while (segcount--) {
if (decode_rdma_segment(xdr, &seglength))
return -EIO;
*length += seglength;
}
dprintk("RPC: %s: segcount=%u, %u bytes\n",
__func__, be32_to_cpup(p), *length);
return 0;
}
/* In RPC-over-RDMA Version One replies, a Read list is never
* expected. This decoder is a stub that returns an error if
* a Read list is present.
*/
static int decode_read_list(struct xdr_stream *xdr)
{
__be32 *p;
p = xdr_inline_decode(xdr, sizeof(*p));
if (unlikely(!p))
return -EIO;
if (unlikely(*p != xdr_zero))
return -EIO;
return 0;
}
/* Supports only one Write chunk in the Write list
*/
static int decode_write_list(struct xdr_stream *xdr, u32 *length)
{
u32 chunklen;
bool first;
__be32 *p;
*length = 0;
first = true;
do {
p = xdr_inline_decode(xdr, sizeof(*p));
if (unlikely(!p))
return -EIO;
if (*p == xdr_zero)
break;
if (!first)
return -EIO;
if (decode_write_chunk(xdr, &chunklen))
return -EIO;
*length += chunklen;
first = false;
} while (true);
return 0;
}
static int decode_reply_chunk(struct xdr_stream *xdr, u32 *length)
{
__be32 *p;
p = xdr_inline_decode(xdr, sizeof(*p));
if (unlikely(!p))
return -EIO;
*length = 0;
if (*p != xdr_zero)
if (decode_write_chunk(xdr, length))
return -EIO;
return 0;
}
static int
rpcrdma_decode_msg(struct rpcrdma_xprt *r_xprt, struct rpcrdma_rep *rep,
struct rpc_rqst *rqst)
{
struct xdr_stream *xdr = &rep->rr_stream;
u32 writelist, replychunk, rpclen;
char *base;
/* Decode the chunk lists */
if (decode_read_list(xdr))
return -EIO;
if (decode_write_list(xdr, &writelist))
return -EIO;
if (decode_reply_chunk(xdr, &replychunk))
return -EIO;
/* RDMA_MSG sanity checks */
if (unlikely(replychunk))
return -EIO;
/* Build the RPC reply's Payload stream in rqst->rq_rcv_buf */
base = (char *)xdr_inline_decode(xdr, 0);
rpclen = xdr_stream_remaining(xdr);
r_xprt->rx_stats.fixup_copy_count +=
rpcrdma_inline_fixup(rqst, base, rpclen, writelist & 3);
r_xprt->rx_stats.total_rdma_reply += writelist;
return rpclen + xdr_align_size(writelist);
}
static noinline int
rpcrdma_decode_nomsg(struct rpcrdma_xprt *r_xprt, struct rpcrdma_rep *rep)
{
struct xdr_stream *xdr = &rep->rr_stream;
u32 writelist, replychunk;
/* Decode the chunk lists */
if (decode_read_list(xdr))
return -EIO;
if (decode_write_list(xdr, &writelist))
return -EIO;
if (decode_reply_chunk(xdr, &replychunk))
return -EIO;
/* RDMA_NOMSG sanity checks */
if (unlikely(writelist))
return -EIO;
if (unlikely(!replychunk))
return -EIO;
/* Reply chunk buffer already is the reply vector */
r_xprt->rx_stats.total_rdma_reply += replychunk;
return replychunk;
}
static noinline int
rpcrdma_decode_error(struct rpcrdma_xprt *r_xprt, struct rpcrdma_rep *rep,
struct rpc_rqst *rqst)
{
struct xdr_stream *xdr = &rep->rr_stream;
__be32 *p;
p = xdr_inline_decode(xdr, sizeof(*p));
if (unlikely(!p))
return -EIO;
switch (*p) {
case err_vers:
p = xdr_inline_decode(xdr, 2 * sizeof(*p));
if (!p)
break;
dprintk("RPC: %5u: %s: server reports version error (%u-%u)\n",
rqst->rq_task->tk_pid, __func__,
be32_to_cpup(p), be32_to_cpu(*(p + 1)));
break;
case err_chunk:
dprintk("RPC: %5u: %s: server reports header decoding error\n",
rqst->rq_task->tk_pid, __func__);
break;
default:
dprintk("RPC: %5u: %s: server reports unrecognized error %d\n",
rqst->rq_task->tk_pid, __func__, be32_to_cpup(p));
}
r_xprt->rx_stats.bad_reply_count++;
return -EREMOTEIO;
}
/* Process received RPC/RDMA messages.
*
* Errors must result in the RPC task either being awakened, or
* allowed to timeout, to discover the errors at that time.
*/
void
rpcrdma_reply_handler(struct work_struct *work)
{
struct rpcrdma_rep *rep =
container_of(work, struct rpcrdma_rep, rr_work);
xprtrdma: Fix client lock-up after application signal fires After a signal, the RPC client aborts synchronous RPCs running on behalf of the signaled application. The server is still executing those RPCs, and will write the results back into the client's memory when it's done. By the time the server writes the results, that memory is likely being used for other purposes. Therefore xprtrdma has to immediately invalidate all memory regions used by those aborted RPCs to prevent the server's writes from clobbering that re-used memory. With FMR memory registration, invalidation takes a relatively long time. In fact, the invalidation is often still running when the server tries to write the results into the memory regions that are being invalidated. This sets up a race between two processes: 1. After the signal, xprt_rdma_free calls ro_unmap_safe. 2. While ro_unmap_safe is still running, the server replies and rpcrdma_reply_handler runs, calling ro_unmap_sync. Both processes invoke ib_unmap_fmr on the same FMR. The mlx4 driver allows two ib_unmap_fmr calls on the same FMR at the same time, but HCAs generally don't tolerate this. Sometimes this can result in a system crash. If the HCA happens to survive, rpcrdma_reply_handler continues. It removes the rpc_rqst from rq_list and releases the transport_lock. This enables xprt_rdma_free to run in another process, and the rpc_rqst is released while rpcrdma_reply_handler is still waiting for the ib_unmap_fmr call to finish. But further down in rpcrdma_reply_handler, the transport_lock is taken again, and "rqst" is dereferenced. If "rqst" has already been released, this triggers a general protection fault. Since bottom- halves are disabled, the system locks up. Address both issues by reversing the order of the xprt_lookup_rqst call and the ro_unmap_sync call. Introduce a separate lookup mechanism for rpcrdma_req's to enable calling ro_unmap_sync before xprt_lookup_rqst. Now the handler takes the transport_lock once and holds it for the XID lookup and RPC completion. BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=305 Fixes: 68791649a725 ('xprtrdma: Invalidate in the RPC reply ... ') Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-06-08 15:52:20 +00:00
struct rpcrdma_xprt *r_xprt = rep->rr_rxprt;
struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
struct rpc_xprt *xprt = &r_xprt->rx_xprt;
struct xdr_stream *xdr = &rep->rr_stream;
struct rpcrdma_req *req;
struct rpc_rqst *rqst;
__be32 *p, xid, vers, proc;
unsigned long cwnd;
struct list_head mws;
int status;
dprintk("RPC: %s: incoming rep %p\n", __func__, rep);
if (rep->rr_hdrbuf.head[0].iov_len == 0)
goto out_badstatus;
xdr_init_decode(xdr, &rep->rr_hdrbuf,
rep->rr_hdrbuf.head[0].iov_base);
/* Fixed transport header fields */
p = xdr_inline_decode(xdr, 4 * sizeof(*p));
if (unlikely(!p))
goto out_shortreply;
xid = *p++;
vers = *p++;
p++; /* credits */
proc = *p++;
if (rpcrdma_is_bcall(r_xprt, rep, xid, proc))
return;
/* Match incoming rpcrdma_rep to an rpcrdma_req to
* get context for handling any incoming chunks.
*/
xprtrdma: Fix client lock-up after application signal fires After a signal, the RPC client aborts synchronous RPCs running on behalf of the signaled application. The server is still executing those RPCs, and will write the results back into the client's memory when it's done. By the time the server writes the results, that memory is likely being used for other purposes. Therefore xprtrdma has to immediately invalidate all memory regions used by those aborted RPCs to prevent the server's writes from clobbering that re-used memory. With FMR memory registration, invalidation takes a relatively long time. In fact, the invalidation is often still running when the server tries to write the results into the memory regions that are being invalidated. This sets up a race between two processes: 1. After the signal, xprt_rdma_free calls ro_unmap_safe. 2. While ro_unmap_safe is still running, the server replies and rpcrdma_reply_handler runs, calling ro_unmap_sync. Both processes invoke ib_unmap_fmr on the same FMR. The mlx4 driver allows two ib_unmap_fmr calls on the same FMR at the same time, but HCAs generally don't tolerate this. Sometimes this can result in a system crash. If the HCA happens to survive, rpcrdma_reply_handler continues. It removes the rpc_rqst from rq_list and releases the transport_lock. This enables xprt_rdma_free to run in another process, and the rpc_rqst is released while rpcrdma_reply_handler is still waiting for the ib_unmap_fmr call to finish. But further down in rpcrdma_reply_handler, the transport_lock is taken again, and "rqst" is dereferenced. If "rqst" has already been released, this triggers a general protection fault. Since bottom- halves are disabled, the system locks up. Address both issues by reversing the order of the xprt_lookup_rqst call and the ro_unmap_sync call. Introduce a separate lookup mechanism for rpcrdma_req's to enable calling ro_unmap_sync before xprt_lookup_rqst. Now the handler takes the transport_lock once and holds it for the XID lookup and RPC completion. BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=305 Fixes: 68791649a725 ('xprtrdma: Invalidate in the RPC reply ... ') Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-06-08 15:52:20 +00:00
spin_lock(&buf->rb_lock);
req = rpcrdma_lookup_req_locked(&r_xprt->rx_buf, xid);
xprtrdma: Fix client lock-up after application signal fires After a signal, the RPC client aborts synchronous RPCs running on behalf of the signaled application. The server is still executing those RPCs, and will write the results back into the client's memory when it's done. By the time the server writes the results, that memory is likely being used for other purposes. Therefore xprtrdma has to immediately invalidate all memory regions used by those aborted RPCs to prevent the server's writes from clobbering that re-used memory. With FMR memory registration, invalidation takes a relatively long time. In fact, the invalidation is often still running when the server tries to write the results into the memory regions that are being invalidated. This sets up a race between two processes: 1. After the signal, xprt_rdma_free calls ro_unmap_safe. 2. While ro_unmap_safe is still running, the server replies and rpcrdma_reply_handler runs, calling ro_unmap_sync. Both processes invoke ib_unmap_fmr on the same FMR. The mlx4 driver allows two ib_unmap_fmr calls on the same FMR at the same time, but HCAs generally don't tolerate this. Sometimes this can result in a system crash. If the HCA happens to survive, rpcrdma_reply_handler continues. It removes the rpc_rqst from rq_list and releases the transport_lock. This enables xprt_rdma_free to run in another process, and the rpc_rqst is released while rpcrdma_reply_handler is still waiting for the ib_unmap_fmr call to finish. But further down in rpcrdma_reply_handler, the transport_lock is taken again, and "rqst" is dereferenced. If "rqst" has already been released, this triggers a general protection fault. Since bottom- halves are disabled, the system locks up. Address both issues by reversing the order of the xprt_lookup_rqst call and the ro_unmap_sync call. Introduce a separate lookup mechanism for rpcrdma_req's to enable calling ro_unmap_sync before xprt_lookup_rqst. Now the handler takes the transport_lock once and holds it for the XID lookup and RPC completion. BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=305 Fixes: 68791649a725 ('xprtrdma: Invalidate in the RPC reply ... ') Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-06-08 15:52:20 +00:00
if (!req)
goto out_nomatch;
if (req->rl_reply)
goto out_duplicate;
list_replace_init(&req->rl_registered, &mws);
rpcrdma_mark_remote_invalidation(&mws, rep);
xprtrdma: Fix client lock-up after application signal fires After a signal, the RPC client aborts synchronous RPCs running on behalf of the signaled application. The server is still executing those RPCs, and will write the results back into the client's memory when it's done. By the time the server writes the results, that memory is likely being used for other purposes. Therefore xprtrdma has to immediately invalidate all memory regions used by those aborted RPCs to prevent the server's writes from clobbering that re-used memory. With FMR memory registration, invalidation takes a relatively long time. In fact, the invalidation is often still running when the server tries to write the results into the memory regions that are being invalidated. This sets up a race between two processes: 1. After the signal, xprt_rdma_free calls ro_unmap_safe. 2. While ro_unmap_safe is still running, the server replies and rpcrdma_reply_handler runs, calling ro_unmap_sync. Both processes invoke ib_unmap_fmr on the same FMR. The mlx4 driver allows two ib_unmap_fmr calls on the same FMR at the same time, but HCAs generally don't tolerate this. Sometimes this can result in a system crash. If the HCA happens to survive, rpcrdma_reply_handler continues. It removes the rpc_rqst from rq_list and releases the transport_lock. This enables xprt_rdma_free to run in another process, and the rpc_rqst is released while rpcrdma_reply_handler is still waiting for the ib_unmap_fmr call to finish. But further down in rpcrdma_reply_handler, the transport_lock is taken again, and "rqst" is dereferenced. If "rqst" has already been released, this triggers a general protection fault. Since bottom- halves are disabled, the system locks up. Address both issues by reversing the order of the xprt_lookup_rqst call and the ro_unmap_sync call. Introduce a separate lookup mechanism for rpcrdma_req's to enable calling ro_unmap_sync before xprt_lookup_rqst. Now the handler takes the transport_lock once and holds it for the XID lookup and RPC completion. BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=305 Fixes: 68791649a725 ('xprtrdma: Invalidate in the RPC reply ... ') Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-06-08 15:52:20 +00:00
/* Avoid races with signals and duplicate replies
* by marking this req as matched.
*/
req->rl_reply = rep;
xprtrdma: Fix client lock-up after application signal fires After a signal, the RPC client aborts synchronous RPCs running on behalf of the signaled application. The server is still executing those RPCs, and will write the results back into the client's memory when it's done. By the time the server writes the results, that memory is likely being used for other purposes. Therefore xprtrdma has to immediately invalidate all memory regions used by those aborted RPCs to prevent the server's writes from clobbering that re-used memory. With FMR memory registration, invalidation takes a relatively long time. In fact, the invalidation is often still running when the server tries to write the results into the memory regions that are being invalidated. This sets up a race between two processes: 1. After the signal, xprt_rdma_free calls ro_unmap_safe. 2. While ro_unmap_safe is still running, the server replies and rpcrdma_reply_handler runs, calling ro_unmap_sync. Both processes invoke ib_unmap_fmr on the same FMR. The mlx4 driver allows two ib_unmap_fmr calls on the same FMR at the same time, but HCAs generally don't tolerate this. Sometimes this can result in a system crash. If the HCA happens to survive, rpcrdma_reply_handler continues. It removes the rpc_rqst from rq_list and releases the transport_lock. This enables xprt_rdma_free to run in another process, and the rpc_rqst is released while rpcrdma_reply_handler is still waiting for the ib_unmap_fmr call to finish. But further down in rpcrdma_reply_handler, the transport_lock is taken again, and "rqst" is dereferenced. If "rqst" has already been released, this triggers a general protection fault. Since bottom- halves are disabled, the system locks up. Address both issues by reversing the order of the xprt_lookup_rqst call and the ro_unmap_sync call. Introduce a separate lookup mechanism for rpcrdma_req's to enable calling ro_unmap_sync before xprt_lookup_rqst. Now the handler takes the transport_lock once and holds it for the XID lookup and RPC completion. BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=305 Fixes: 68791649a725 ('xprtrdma: Invalidate in the RPC reply ... ') Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-06-08 15:52:20 +00:00
spin_unlock(&buf->rb_lock);
dprintk("RPC: %s: reply %p completes request %p (xid 0x%08x)\n",
__func__, rep, req, be32_to_cpu(xid));
xprtrdma: Fix client lock-up after application signal fires After a signal, the RPC client aborts synchronous RPCs running on behalf of the signaled application. The server is still executing those RPCs, and will write the results back into the client's memory when it's done. By the time the server writes the results, that memory is likely being used for other purposes. Therefore xprtrdma has to immediately invalidate all memory regions used by those aborted RPCs to prevent the server's writes from clobbering that re-used memory. With FMR memory registration, invalidation takes a relatively long time. In fact, the invalidation is often still running when the server tries to write the results into the memory regions that are being invalidated. This sets up a race between two processes: 1. After the signal, xprt_rdma_free calls ro_unmap_safe. 2. While ro_unmap_safe is still running, the server replies and rpcrdma_reply_handler runs, calling ro_unmap_sync. Both processes invoke ib_unmap_fmr on the same FMR. The mlx4 driver allows two ib_unmap_fmr calls on the same FMR at the same time, but HCAs generally don't tolerate this. Sometimes this can result in a system crash. If the HCA happens to survive, rpcrdma_reply_handler continues. It removes the rpc_rqst from rq_list and releases the transport_lock. This enables xprt_rdma_free to run in another process, and the rpc_rqst is released while rpcrdma_reply_handler is still waiting for the ib_unmap_fmr call to finish. But further down in rpcrdma_reply_handler, the transport_lock is taken again, and "rqst" is dereferenced. If "rqst" has already been released, this triggers a general protection fault. Since bottom- halves are disabled, the system locks up. Address both issues by reversing the order of the xprt_lookup_rqst call and the ro_unmap_sync call. Introduce a separate lookup mechanism for rpcrdma_req's to enable calling ro_unmap_sync before xprt_lookup_rqst. Now the handler takes the transport_lock once and holds it for the XID lookup and RPC completion. BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=305 Fixes: 68791649a725 ('xprtrdma: Invalidate in the RPC reply ... ') Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-06-08 15:52:20 +00:00
/* Invalidate and unmap the data payloads before waking the
* waiting application. This guarantees the memory regions
* are properly fenced from the server before the application
* accesses the data. It also ensures proper send flow control:
* waking the next RPC waits until this RPC has relinquished
* all its Send Queue entries.
*/
if (!list_empty(&mws))
r_xprt->rx_ia.ri_ops->ro_unmap_sync(r_xprt, &mws);
xprtrdma: Fix client lock-up after application signal fires After a signal, the RPC client aborts synchronous RPCs running on behalf of the signaled application. The server is still executing those RPCs, and will write the results back into the client's memory when it's done. By the time the server writes the results, that memory is likely being used for other purposes. Therefore xprtrdma has to immediately invalidate all memory regions used by those aborted RPCs to prevent the server's writes from clobbering that re-used memory. With FMR memory registration, invalidation takes a relatively long time. In fact, the invalidation is often still running when the server tries to write the results into the memory regions that are being invalidated. This sets up a race between two processes: 1. After the signal, xprt_rdma_free calls ro_unmap_safe. 2. While ro_unmap_safe is still running, the server replies and rpcrdma_reply_handler runs, calling ro_unmap_sync. Both processes invoke ib_unmap_fmr on the same FMR. The mlx4 driver allows two ib_unmap_fmr calls on the same FMR at the same time, but HCAs generally don't tolerate this. Sometimes this can result in a system crash. If the HCA happens to survive, rpcrdma_reply_handler continues. It removes the rpc_rqst from rq_list and releases the transport_lock. This enables xprt_rdma_free to run in another process, and the rpc_rqst is released while rpcrdma_reply_handler is still waiting for the ib_unmap_fmr call to finish. But further down in rpcrdma_reply_handler, the transport_lock is taken again, and "rqst" is dereferenced. If "rqst" has already been released, this triggers a general protection fault. Since bottom- halves are disabled, the system locks up. Address both issues by reversing the order of the xprt_lookup_rqst call and the ro_unmap_sync call. Introduce a separate lookup mechanism for rpcrdma_req's to enable calling ro_unmap_sync before xprt_lookup_rqst. Now the handler takes the transport_lock once and holds it for the XID lookup and RPC completion. BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=305 Fixes: 68791649a725 ('xprtrdma: Invalidate in the RPC reply ... ') Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-06-08 15:52:20 +00:00
/* Perform XID lookup, reconstruction of the RPC reply, and
* RPC completion while holding the transport lock to ensure
* the rep, rqst, and rq_task pointers remain stable.
*/
spin_lock_bh(&xprt->transport_lock);
rqst = xprt_lookup_rqst(xprt, xid);
xprtrdma: Fix client lock-up after application signal fires After a signal, the RPC client aborts synchronous RPCs running on behalf of the signaled application. The server is still executing those RPCs, and will write the results back into the client's memory when it's done. By the time the server writes the results, that memory is likely being used for other purposes. Therefore xprtrdma has to immediately invalidate all memory regions used by those aborted RPCs to prevent the server's writes from clobbering that re-used memory. With FMR memory registration, invalidation takes a relatively long time. In fact, the invalidation is often still running when the server tries to write the results into the memory regions that are being invalidated. This sets up a race between two processes: 1. After the signal, xprt_rdma_free calls ro_unmap_safe. 2. While ro_unmap_safe is still running, the server replies and rpcrdma_reply_handler runs, calling ro_unmap_sync. Both processes invoke ib_unmap_fmr on the same FMR. The mlx4 driver allows two ib_unmap_fmr calls on the same FMR at the same time, but HCAs generally don't tolerate this. Sometimes this can result in a system crash. If the HCA happens to survive, rpcrdma_reply_handler continues. It removes the rpc_rqst from rq_list and releases the transport_lock. This enables xprt_rdma_free to run in another process, and the rpc_rqst is released while rpcrdma_reply_handler is still waiting for the ib_unmap_fmr call to finish. But further down in rpcrdma_reply_handler, the transport_lock is taken again, and "rqst" is dereferenced. If "rqst" has already been released, this triggers a general protection fault. Since bottom- halves are disabled, the system locks up. Address both issues by reversing the order of the xprt_lookup_rqst call and the ro_unmap_sync call. Introduce a separate lookup mechanism for rpcrdma_req's to enable calling ro_unmap_sync before xprt_lookup_rqst. Now the handler takes the transport_lock once and holds it for the XID lookup and RPC completion. BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=305 Fixes: 68791649a725 ('xprtrdma: Invalidate in the RPC reply ... ') Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-06-08 15:52:20 +00:00
if (!rqst)
goto out_norqst;
xprt->reestablish_timeout = 0;
if (vers != rpcrdma_version)
goto out_badversion;
switch (proc) {
case rdma_msg:
status = rpcrdma_decode_msg(r_xprt, rep, rqst);
break;
case rdma_nomsg:
status = rpcrdma_decode_nomsg(r_xprt, rep);
break;
case rdma_error:
status = rpcrdma_decode_error(r_xprt, rep, rqst);
break;
default:
status = -EIO;
}
if (status < 0)
goto out_badheader;
out:
cwnd = xprt->cwnd;
xprt->cwnd = atomic_read(&r_xprt->rx_buf.rb_credits) << RPC_CWNDSHIFT;
if (xprt->cwnd > cwnd)
xprt_release_rqst_cong(rqst->rq_task);
xprt_complete_rqst(rqst->rq_task, status);
spin_unlock_bh(&xprt->transport_lock);
dprintk("RPC: %s: xprt_complete_rqst(0x%p, 0x%p, %d)\n",
xprtrdma: Fix client lock-up after application signal fires After a signal, the RPC client aborts synchronous RPCs running on behalf of the signaled application. The server is still executing those RPCs, and will write the results back into the client's memory when it's done. By the time the server writes the results, that memory is likely being used for other purposes. Therefore xprtrdma has to immediately invalidate all memory regions used by those aborted RPCs to prevent the server's writes from clobbering that re-used memory. With FMR memory registration, invalidation takes a relatively long time. In fact, the invalidation is often still running when the server tries to write the results into the memory regions that are being invalidated. This sets up a race between two processes: 1. After the signal, xprt_rdma_free calls ro_unmap_safe. 2. While ro_unmap_safe is still running, the server replies and rpcrdma_reply_handler runs, calling ro_unmap_sync. Both processes invoke ib_unmap_fmr on the same FMR. The mlx4 driver allows two ib_unmap_fmr calls on the same FMR at the same time, but HCAs generally don't tolerate this. Sometimes this can result in a system crash. If the HCA happens to survive, rpcrdma_reply_handler continues. It removes the rpc_rqst from rq_list and releases the transport_lock. This enables xprt_rdma_free to run in another process, and the rpc_rqst is released while rpcrdma_reply_handler is still waiting for the ib_unmap_fmr call to finish. But further down in rpcrdma_reply_handler, the transport_lock is taken again, and "rqst" is dereferenced. If "rqst" has already been released, this triggers a general protection fault. Since bottom- halves are disabled, the system locks up. Address both issues by reversing the order of the xprt_lookup_rqst call and the ro_unmap_sync call. Introduce a separate lookup mechanism for rpcrdma_req's to enable calling ro_unmap_sync before xprt_lookup_rqst. Now the handler takes the transport_lock once and holds it for the XID lookup and RPC completion. BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=305 Fixes: 68791649a725 ('xprtrdma: Invalidate in the RPC reply ... ') Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-06-08 15:52:20 +00:00
__func__, xprt, rqst, status);
return;
out_badstatus:
rpcrdma_recv_buffer_put(rep);
if (r_xprt->rx_ep.rep_connected == 1) {
r_xprt->rx_ep.rep_connected = -EIO;
rpcrdma_conn_func(&r_xprt->rx_ep);
}
return;
/* If the incoming reply terminated a pending RPC, the next
* RPC call will post a replacement receive buffer as it is
* being marshaled.
*/
out_badversion:
dprintk("RPC: %s: invalid version %d\n",
__func__, be32_to_cpu(vers));
status = -EIO;
r_xprt->rx_stats.bad_reply_count++;
goto out;
out_badheader:
dprintk("RPC: %5u %s: invalid rpcrdma reply (type %u)\n",
rqst->rq_task->tk_pid, __func__, be32_to_cpu(proc));
r_xprt->rx_stats.bad_reply_count++;
status = -EIO;
goto out;
xprtrdma: Fix client lock-up after application signal fires After a signal, the RPC client aborts synchronous RPCs running on behalf of the signaled application. The server is still executing those RPCs, and will write the results back into the client's memory when it's done. By the time the server writes the results, that memory is likely being used for other purposes. Therefore xprtrdma has to immediately invalidate all memory regions used by those aborted RPCs to prevent the server's writes from clobbering that re-used memory. With FMR memory registration, invalidation takes a relatively long time. In fact, the invalidation is often still running when the server tries to write the results into the memory regions that are being invalidated. This sets up a race between two processes: 1. After the signal, xprt_rdma_free calls ro_unmap_safe. 2. While ro_unmap_safe is still running, the server replies and rpcrdma_reply_handler runs, calling ro_unmap_sync. Both processes invoke ib_unmap_fmr on the same FMR. The mlx4 driver allows two ib_unmap_fmr calls on the same FMR at the same time, but HCAs generally don't tolerate this. Sometimes this can result in a system crash. If the HCA happens to survive, rpcrdma_reply_handler continues. It removes the rpc_rqst from rq_list and releases the transport_lock. This enables xprt_rdma_free to run in another process, and the rpc_rqst is released while rpcrdma_reply_handler is still waiting for the ib_unmap_fmr call to finish. But further down in rpcrdma_reply_handler, the transport_lock is taken again, and "rqst" is dereferenced. If "rqst" has already been released, this triggers a general protection fault. Since bottom- halves are disabled, the system locks up. Address both issues by reversing the order of the xprt_lookup_rqst call and the ro_unmap_sync call. Introduce a separate lookup mechanism for rpcrdma_req's to enable calling ro_unmap_sync before xprt_lookup_rqst. Now the handler takes the transport_lock once and holds it for the XID lookup and RPC completion. BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=305 Fixes: 68791649a725 ('xprtrdma: Invalidate in the RPC reply ... ') Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-06-08 15:52:20 +00:00
/* The req was still available, but by the time the transport_lock
* was acquired, the rqst and task had been released. Thus the RPC
* has already been terminated.
*/
xprtrdma: Fix client lock-up after application signal fires After a signal, the RPC client aborts synchronous RPCs running on behalf of the signaled application. The server is still executing those RPCs, and will write the results back into the client's memory when it's done. By the time the server writes the results, that memory is likely being used for other purposes. Therefore xprtrdma has to immediately invalidate all memory regions used by those aborted RPCs to prevent the server's writes from clobbering that re-used memory. With FMR memory registration, invalidation takes a relatively long time. In fact, the invalidation is often still running when the server tries to write the results into the memory regions that are being invalidated. This sets up a race between two processes: 1. After the signal, xprt_rdma_free calls ro_unmap_safe. 2. While ro_unmap_safe is still running, the server replies and rpcrdma_reply_handler runs, calling ro_unmap_sync. Both processes invoke ib_unmap_fmr on the same FMR. The mlx4 driver allows two ib_unmap_fmr calls on the same FMR at the same time, but HCAs generally don't tolerate this. Sometimes this can result in a system crash. If the HCA happens to survive, rpcrdma_reply_handler continues. It removes the rpc_rqst from rq_list and releases the transport_lock. This enables xprt_rdma_free to run in another process, and the rpc_rqst is released while rpcrdma_reply_handler is still waiting for the ib_unmap_fmr call to finish. But further down in rpcrdma_reply_handler, the transport_lock is taken again, and "rqst" is dereferenced. If "rqst" has already been released, this triggers a general protection fault. Since bottom- halves are disabled, the system locks up. Address both issues by reversing the order of the xprt_lookup_rqst call and the ro_unmap_sync call. Introduce a separate lookup mechanism for rpcrdma_req's to enable calling ro_unmap_sync before xprt_lookup_rqst. Now the handler takes the transport_lock once and holds it for the XID lookup and RPC completion. BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=305 Fixes: 68791649a725 ('xprtrdma: Invalidate in the RPC reply ... ') Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-06-08 15:52:20 +00:00
out_norqst:
spin_unlock_bh(&xprt->transport_lock);
rpcrdma_buffer_put(req);
dprintk("RPC: %s: race, no rqst left for req %p\n",
__func__, req);
return;
out_shortreply:
dprintk("RPC: %s: short/invalid reply\n", __func__);
goto repost;
out_nomatch:
xprtrdma: Fix client lock-up after application signal fires After a signal, the RPC client aborts synchronous RPCs running on behalf of the signaled application. The server is still executing those RPCs, and will write the results back into the client's memory when it's done. By the time the server writes the results, that memory is likely being used for other purposes. Therefore xprtrdma has to immediately invalidate all memory regions used by those aborted RPCs to prevent the server's writes from clobbering that re-used memory. With FMR memory registration, invalidation takes a relatively long time. In fact, the invalidation is often still running when the server tries to write the results into the memory regions that are being invalidated. This sets up a race between two processes: 1. After the signal, xprt_rdma_free calls ro_unmap_safe. 2. While ro_unmap_safe is still running, the server replies and rpcrdma_reply_handler runs, calling ro_unmap_sync. Both processes invoke ib_unmap_fmr on the same FMR. The mlx4 driver allows two ib_unmap_fmr calls on the same FMR at the same time, but HCAs generally don't tolerate this. Sometimes this can result in a system crash. If the HCA happens to survive, rpcrdma_reply_handler continues. It removes the rpc_rqst from rq_list and releases the transport_lock. This enables xprt_rdma_free to run in another process, and the rpc_rqst is released while rpcrdma_reply_handler is still waiting for the ib_unmap_fmr call to finish. But further down in rpcrdma_reply_handler, the transport_lock is taken again, and "rqst" is dereferenced. If "rqst" has already been released, this triggers a general protection fault. Since bottom- halves are disabled, the system locks up. Address both issues by reversing the order of the xprt_lookup_rqst call and the ro_unmap_sync call. Introduce a separate lookup mechanism for rpcrdma_req's to enable calling ro_unmap_sync before xprt_lookup_rqst. Now the handler takes the transport_lock once and holds it for the XID lookup and RPC completion. BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=305 Fixes: 68791649a725 ('xprtrdma: Invalidate in the RPC reply ... ') Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-06-08 15:52:20 +00:00
spin_unlock(&buf->rb_lock);
dprintk("RPC: %s: no match for incoming xid 0x%08x\n",
__func__, be32_to_cpu(xid));
goto repost;
out_duplicate:
xprtrdma: Fix client lock-up after application signal fires After a signal, the RPC client aborts synchronous RPCs running on behalf of the signaled application. The server is still executing those RPCs, and will write the results back into the client's memory when it's done. By the time the server writes the results, that memory is likely being used for other purposes. Therefore xprtrdma has to immediately invalidate all memory regions used by those aborted RPCs to prevent the server's writes from clobbering that re-used memory. With FMR memory registration, invalidation takes a relatively long time. In fact, the invalidation is often still running when the server tries to write the results into the memory regions that are being invalidated. This sets up a race between two processes: 1. After the signal, xprt_rdma_free calls ro_unmap_safe. 2. While ro_unmap_safe is still running, the server replies and rpcrdma_reply_handler runs, calling ro_unmap_sync. Both processes invoke ib_unmap_fmr on the same FMR. The mlx4 driver allows two ib_unmap_fmr calls on the same FMR at the same time, but HCAs generally don't tolerate this. Sometimes this can result in a system crash. If the HCA happens to survive, rpcrdma_reply_handler continues. It removes the rpc_rqst from rq_list and releases the transport_lock. This enables xprt_rdma_free to run in another process, and the rpc_rqst is released while rpcrdma_reply_handler is still waiting for the ib_unmap_fmr call to finish. But further down in rpcrdma_reply_handler, the transport_lock is taken again, and "rqst" is dereferenced. If "rqst" has already been released, this triggers a general protection fault. Since bottom- halves are disabled, the system locks up. Address both issues by reversing the order of the xprt_lookup_rqst call and the ro_unmap_sync call. Introduce a separate lookup mechanism for rpcrdma_req's to enable calling ro_unmap_sync before xprt_lookup_rqst. Now the handler takes the transport_lock once and holds it for the XID lookup and RPC completion. BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=305 Fixes: 68791649a725 ('xprtrdma: Invalidate in the RPC reply ... ') Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-06-08 15:52:20 +00:00
spin_unlock(&buf->rb_lock);
dprintk("RPC: %s: "
"duplicate reply %p to RPC request %p: xid 0x%08x\n",
__func__, rep, req, be32_to_cpu(xid));
xprtrdma: Fix client lock-up after application signal fires After a signal, the RPC client aborts synchronous RPCs running on behalf of the signaled application. The server is still executing those RPCs, and will write the results back into the client's memory when it's done. By the time the server writes the results, that memory is likely being used for other purposes. Therefore xprtrdma has to immediately invalidate all memory regions used by those aborted RPCs to prevent the server's writes from clobbering that re-used memory. With FMR memory registration, invalidation takes a relatively long time. In fact, the invalidation is often still running when the server tries to write the results into the memory regions that are being invalidated. This sets up a race between two processes: 1. After the signal, xprt_rdma_free calls ro_unmap_safe. 2. While ro_unmap_safe is still running, the server replies and rpcrdma_reply_handler runs, calling ro_unmap_sync. Both processes invoke ib_unmap_fmr on the same FMR. The mlx4 driver allows two ib_unmap_fmr calls on the same FMR at the same time, but HCAs generally don't tolerate this. Sometimes this can result in a system crash. If the HCA happens to survive, rpcrdma_reply_handler continues. It removes the rpc_rqst from rq_list and releases the transport_lock. This enables xprt_rdma_free to run in another process, and the rpc_rqst is released while rpcrdma_reply_handler is still waiting for the ib_unmap_fmr call to finish. But further down in rpcrdma_reply_handler, the transport_lock is taken again, and "rqst" is dereferenced. If "rqst" has already been released, this triggers a general protection fault. Since bottom- halves are disabled, the system locks up. Address both issues by reversing the order of the xprt_lookup_rqst call and the ro_unmap_sync call. Introduce a separate lookup mechanism for rpcrdma_req's to enable calling ro_unmap_sync before xprt_lookup_rqst. Now the handler takes the transport_lock once and holds it for the XID lookup and RPC completion. BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=305 Fixes: 68791649a725 ('xprtrdma: Invalidate in the RPC reply ... ') Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-06-08 15:52:20 +00:00
/* If no pending RPC transaction was matched, post a replacement
* receive buffer before returning.
*/
repost:
r_xprt->rx_stats.bad_reply_count++;
if (rpcrdma_ep_post_recv(&r_xprt->rx_ia, rep))
rpcrdma_recv_buffer_put(rep);
}