Merge branch 'siw' into rdma.git for-next

Bernard Metzler says:

====================
This patch set contributes the SoftiWarp driver rebased for latest
rdma-next. SoftiWarp (siw) implements the iWarp RDMA protocol over kernel
TCP sockets. The driver integrates with the linux-rdma framework.

A matching userlevel driver is available as PR at
https://github.com/linux-rdma/rdma-core/pull/536

Many thanks for reviewing and testing the driver, especially to Leon,
Jason, Steve, Doug, Olga, Dennis, Gal. You all helped to significantly
improve the driver over the last year.

Please find below a list of changes and comments, compared to older
versions of the siw driver.

Many thanks!
Bernard.

CHANGES:
========

v3 (this version)
-----------------

- Rebased to rdma-next

- Removed unneccessary initialization of enums in siw-abi.h

- Added comment on sizing of all work queues to power of two.

v2
-----------------

- Changed recieve path CRC calculation to compute CRC32c not
  on target buffer after placement, but on original skbuf.
  This change severely hurts performance, if CRC is switched
  on, since skb must now be walked twice. It is planned to
  work on an extension to skb_copy_bits() to fold in CRC
  computation.

- Moved debugging to using ibdev_dbg().

- Dropped detailed packet debug printing.

- Removed siw_debug.[ch] files.

- Removed resource tracking, code now relies on restrack of
  RDMA midlayer. Only object counting to enforce reported
  device limits is left in place.

- Removed all nested switch-case statements.

- Cleaned up header file #include's

- Moved CQ create/destroy to new semantics,
  where midlayer creates/destroys containing object.

- Set siw's ABI version to 1 (was 0 before)

- Removed all enum initialization where not needed.

- Fixed MAINTANERS entry for siw driver

- This version stays with the current siw specific
  management of user memory (siw_umem_get() vs.
  ib_umem_get(), etc.). This, since the current ib_umem
  implementation is less efficient for user page lookup
  on the fast path, where effciency is important for a
  SW RDMA driver.
  It is planned to contribute enhancements to the ib_umem
  framework, wich makes it suitable for SW drivers as well.

v1 (first version after v9 of siw RFC)
--------------------------------------

- Rebased to 5.2-rc1

- All IDR code got removed.

- Both MR and QP deallocation verbs now synchronously
  free the resources referenced by the RDMA mid-layer.

- IPv6 support was added.

- For compatibility with Chelsio iWarp hardware, the RX
  path was slightly reworked. It now allows packet intersection
  between tagged and untagged RDMAP operations. While not
  a defined behavior as of IETF RFC 5040/5041, some RDMA hardware
  may intersect an ongoing outbound (large) tagged message, such
  as an multisegment RDMA Read Response with sending an untagged
  message, such as an RDMA Send frame. This behavior was only
  detected in an NVMeF setup, where siw was used at target side,
  and RDMA hardware at client side (during file write). siw now
  implements two input paths for tagged and untagged messages each,
  and allows the intersected placement of both messages.

- The siw kernel abi file got renamed from siw_user.h to siw-abi.h.
====================

* branch 'siw':
  SIW addition to kernel build environment
  SIW completion queue methods
  SIW receive path
  SIW transmit path
  SIW queue pair methods
  SIW application buffer management
  SIW application interface
  SIW connection management
  SIW network and RDMA core interface
  SIW main include file
  iWarp wire packet format
This commit is contained in:
Jason Gunthorpe 2019-07-02 16:57:54 -03:00
commit c5cfcfcb54
20 changed files with 10773 additions and 0 deletions

View file

@ -14558,6 +14558,13 @@ M: Chris Boot <bootc@bootc.net>
S: Maintained
F: drivers/leds/leds-net48xx.c
SOFT-IWARP DRIVER (siw)
M: Bernard Metzler <bmt@zurich.ibm.com>
L: linux-rdma@vger.kernel.org
S: Supported
F: drivers/infiniband/sw/siw/
F: include/uapi/rdma/siw-abi.h
SOFT-ROCE DRIVER (rxe)
M: Moni Shoua <monis@mellanox.com>
L: linux-rdma@vger.kernel.org

View file

@ -96,6 +96,7 @@ source "drivers/infiniband/hw/hfi1/Kconfig"
source "drivers/infiniband/hw/qedr/Kconfig"
source "drivers/infiniband/sw/rdmavt/Kconfig"
source "drivers/infiniband/sw/rxe/Kconfig"
source "drivers/infiniband/sw/siw/Kconfig"
endif
source "drivers/infiniband/ulp/ipoib/Kconfig"

View file

@ -1,3 +1,4 @@
# SPDX-License-Identifier: GPL-2.0-only
obj-$(CONFIG_INFINIBAND_RDMAVT) += rdmavt/
obj-$(CONFIG_RDMA_RXE) += rxe/
obj-$(CONFIG_RDMA_SIW) += siw/

View file

@ -0,0 +1,17 @@
config RDMA_SIW
tristate "Software RDMA over TCP/IP (iWARP) driver"
depends on INET && INFINIBAND && CRYPTO_CRC32
help
This driver implements the iWARP RDMA transport over
the Linux TCP/IP network stack. It enables a system with a
standard Ethernet adapter to interoperate with a iWARP
adapter or with another system running the SIW driver.
(See also RXE which is a similar software driver for RoCE.)
The driver interfaces with the Linux RDMA stack and
implements both a kernel and user space RDMA verbs API.
The user space verbs API requires a support
library named libsiw which is loaded by the generic user
space verbs API, libibverbs. To implement RDMA over
TCP/IP, the driver further interfaces with the Linux
in-kernel TCP socket layer.

View file

@ -0,0 +1,11 @@
obj-$(CONFIG_RDMA_SIW) += siw.o
siw-y := \
siw_cm.o \
siw_cq.o \
siw_main.o \
siw_mem.o \
siw_qp.o \
siw_qp_tx.o \
siw_qp_rx.o \
siw_verbs.o

View file

@ -0,0 +1,380 @@
/* SPDX-License-Identifier: GPL-2.0 or BSD-3-Clause */
/* Authors: Bernard Metzler <bmt@zurich.ibm.com> */
/* Copyright (c) 2008-2019, IBM Corporation */
#ifndef _IWARP_H
#define _IWARP_H
#include <rdma/rdma_user_cm.h> /* RDMA_MAX_PRIVATE_DATA */
#include <linux/types.h>
#include <asm/byteorder.h>
#define RDMAP_VERSION 1
#define DDP_VERSION 1
#define MPA_REVISION_1 1
#define MPA_REVISION_2 2
#define MPA_MAX_PRIVDATA RDMA_MAX_PRIVATE_DATA
#define MPA_KEY_REQ "MPA ID Req Frame"
#define MPA_KEY_REP "MPA ID Rep Frame"
#define MPA_IRD_ORD_MASK 0x3fff
struct mpa_rr_params {
__be16 bits;
__be16 pd_len;
};
/*
* MPA request/response header bits & fields
*/
enum {
MPA_RR_FLAG_MARKERS = cpu_to_be16(0x8000),
MPA_RR_FLAG_CRC = cpu_to_be16(0x4000),
MPA_RR_FLAG_REJECT = cpu_to_be16(0x2000),
MPA_RR_FLAG_ENHANCED = cpu_to_be16(0x1000),
MPA_RR_FLAG_GSO_EXP = cpu_to_be16(0x0800),
MPA_RR_MASK_REVISION = cpu_to_be16(0x00ff)
};
/*
* MPA request/reply header
*/
struct mpa_rr {
__u8 key[16];
struct mpa_rr_params params;
};
static inline void __mpa_rr_set_revision(__be16 *bits, u8 rev)
{
*bits = (*bits & ~MPA_RR_MASK_REVISION) |
(cpu_to_be16(rev) & MPA_RR_MASK_REVISION);
}
static inline u8 __mpa_rr_revision(__be16 mpa_rr_bits)
{
__be16 rev = mpa_rr_bits & MPA_RR_MASK_REVISION;
return be16_to_cpu(rev);
}
enum mpa_v2_ctrl {
MPA_V2_PEER_TO_PEER = cpu_to_be16(0x8000),
MPA_V2_ZERO_LENGTH_RTR = cpu_to_be16(0x4000),
MPA_V2_RDMA_WRITE_RTR = cpu_to_be16(0x8000),
MPA_V2_RDMA_READ_RTR = cpu_to_be16(0x4000),
MPA_V2_RDMA_NO_RTR = cpu_to_be16(0x0000),
MPA_V2_MASK_IRD_ORD = cpu_to_be16(0x3fff)
};
struct mpa_v2_data {
__be16 ird;
__be16 ord;
};
struct mpa_marker {
__be16 rsvd;
__be16 fpdu_hmd; /* FPDU header-marker distance (= MPA's FPDUPTR) */
};
/*
* maximum MPA trailer
*/
struct mpa_trailer {
__u8 pad[4];
__be32 crc;
};
#define MPA_HDR_SIZE 2
#define MPA_CRC_SIZE 4
/*
* Common portion of iWARP headers (MPA, DDP, RDMAP)
* for any FPDU
*/
struct iwarp_ctrl {
__be16 mpa_len;
__be16 ddp_rdmap_ctrl;
};
/*
* DDP/RDMAP Hdr bits & fields
*/
enum {
DDP_FLAG_TAGGED = cpu_to_be16(0x8000),
DDP_FLAG_LAST = cpu_to_be16(0x4000),
DDP_MASK_RESERVED = cpu_to_be16(0x3C00),
DDP_MASK_VERSION = cpu_to_be16(0x0300),
RDMAP_MASK_VERSION = cpu_to_be16(0x00C0),
RDMAP_MASK_RESERVED = cpu_to_be16(0x0030),
RDMAP_MASK_OPCODE = cpu_to_be16(0x000f)
};
static inline u8 __ddp_get_version(struct iwarp_ctrl *ctrl)
{
return be16_to_cpu(ctrl->ddp_rdmap_ctrl & DDP_MASK_VERSION) >> 8;
}
static inline void __ddp_set_version(struct iwarp_ctrl *ctrl, u8 version)
{
ctrl->ddp_rdmap_ctrl =
(ctrl->ddp_rdmap_ctrl & ~DDP_MASK_VERSION) |
(cpu_to_be16((u16)version << 8) & DDP_MASK_VERSION);
}
static inline u8 __rdmap_get_version(struct iwarp_ctrl *ctrl)
{
__be16 ver = ctrl->ddp_rdmap_ctrl & RDMAP_MASK_VERSION;
return be16_to_cpu(ver) >> 6;
}
static inline void __rdmap_set_version(struct iwarp_ctrl *ctrl, u8 version)
{
ctrl->ddp_rdmap_ctrl = (ctrl->ddp_rdmap_ctrl & ~RDMAP_MASK_VERSION) |
(cpu_to_be16(version << 6) & RDMAP_MASK_VERSION);
}
static inline u8 __rdmap_get_opcode(struct iwarp_ctrl *ctrl)
{
return be16_to_cpu(ctrl->ddp_rdmap_ctrl & RDMAP_MASK_OPCODE);
}
static inline void __rdmap_set_opcode(struct iwarp_ctrl *ctrl, u8 opcode)
{
ctrl->ddp_rdmap_ctrl = (ctrl->ddp_rdmap_ctrl & ~RDMAP_MASK_OPCODE) |
(cpu_to_be16(opcode) & RDMAP_MASK_OPCODE);
}
struct iwarp_rdma_write {
struct iwarp_ctrl ctrl;
__be32 sink_stag;
__be64 sink_to;
};
struct iwarp_rdma_rreq {
struct iwarp_ctrl ctrl;
__be32 rsvd;
__be32 ddp_qn;
__be32 ddp_msn;
__be32 ddp_mo;
__be32 sink_stag;
__be64 sink_to;
__be32 read_size;
__be32 source_stag;
__be64 source_to;
};
struct iwarp_rdma_rresp {
struct iwarp_ctrl ctrl;
__be32 sink_stag;
__be64 sink_to;
};
struct iwarp_send {
struct iwarp_ctrl ctrl;
__be32 rsvd;
__be32 ddp_qn;
__be32 ddp_msn;
__be32 ddp_mo;
};
struct iwarp_send_inv {
struct iwarp_ctrl ctrl;
__be32 inval_stag;
__be32 ddp_qn;
__be32 ddp_msn;
__be32 ddp_mo;
};
struct iwarp_terminate {
struct iwarp_ctrl ctrl;
__be32 rsvd;
__be32 ddp_qn;
__be32 ddp_msn;
__be32 ddp_mo;
#if defined(__LITTLE_ENDIAN_BITFIELD)
__be32 layer : 4;
__be32 etype : 4;
__be32 ecode : 8;
__be32 flag_m : 1;
__be32 flag_d : 1;
__be32 flag_r : 1;
__be32 reserved : 13;
#elif defined(__BIG_ENDIAN_BITFIELD)
__be32 reserved : 13;
__be32 flag_r : 1;
__be32 flag_d : 1;
__be32 flag_m : 1;
__be32 ecode : 8;
__be32 etype : 4;
__be32 layer : 4;
#else
#error "undefined byte order"
#endif
};
/*
* Terminate Hdr bits & fields
*/
enum {
TERM_MASK_LAYER = cpu_to_be32(0xf0000000),
TERM_MASK_ETYPE = cpu_to_be32(0x0f000000),
TERM_MASK_ECODE = cpu_to_be32(0x00ff0000),
TERM_FLAG_M = cpu_to_be32(0x00008000),
TERM_FLAG_D = cpu_to_be32(0x00004000),
TERM_FLAG_R = cpu_to_be32(0x00002000),
TERM_MASK_RESVD = cpu_to_be32(0x00001fff)
};
static inline u8 __rdmap_term_layer(struct iwarp_terminate *term)
{
return term->layer;
}
static inline void __rdmap_term_set_layer(struct iwarp_terminate *term,
u8 layer)
{
term->layer = layer & 0xf;
}
static inline u8 __rdmap_term_etype(struct iwarp_terminate *term)
{
return term->etype;
}
static inline void __rdmap_term_set_etype(struct iwarp_terminate *term,
u8 etype)
{
term->etype = etype & 0xf;
}
static inline u8 __rdmap_term_ecode(struct iwarp_terminate *term)
{
return term->ecode;
}
static inline void __rdmap_term_set_ecode(struct iwarp_terminate *term,
u8 ecode)
{
term->ecode = ecode;
}
/*
* Common portion of iWARP headers (MPA, DDP, RDMAP)
* for an FPDU carrying an untagged DDP segment
*/
struct iwarp_ctrl_untagged {
struct iwarp_ctrl ctrl;
__be32 rsvd;
__be32 ddp_qn;
__be32 ddp_msn;
__be32 ddp_mo;
};
/*
* Common portion of iWARP headers (MPA, DDP, RDMAP)
* for an FPDU carrying a tagged DDP segment
*/
struct iwarp_ctrl_tagged {
struct iwarp_ctrl ctrl;
__be32 ddp_stag;
__be64 ddp_to;
};
union iwarp_hdr {
struct iwarp_ctrl ctrl;
struct iwarp_ctrl_untagged c_untagged;
struct iwarp_ctrl_tagged c_tagged;
struct iwarp_rdma_write rwrite;
struct iwarp_rdma_rreq rreq;
struct iwarp_rdma_rresp rresp;
struct iwarp_terminate terminate;
struct iwarp_send send;
struct iwarp_send_inv send_inv;
};
enum term_elayer {
TERM_ERROR_LAYER_RDMAP = 0x00,
TERM_ERROR_LAYER_DDP = 0x01,
TERM_ERROR_LAYER_LLP = 0x02 /* eg., MPA */
};
enum ddp_etype {
DDP_ETYPE_CATASTROPHIC = 0x0,
DDP_ETYPE_TAGGED_BUF = 0x1,
DDP_ETYPE_UNTAGGED_BUF = 0x2,
DDP_ETYPE_RSVD = 0x3
};
enum ddp_ecode {
/* unspecified, set to zero */
DDP_ECODE_CATASTROPHIC = 0x00,
/* Tagged Buffer Errors */
DDP_ECODE_T_INVALID_STAG = 0x00,
DDP_ECODE_T_BASE_BOUNDS = 0x01,
DDP_ECODE_T_STAG_NOT_ASSOC = 0x02,
DDP_ECODE_T_TO_WRAP = 0x03,
DDP_ECODE_T_VERSION = 0x04,
/* Untagged Buffer Errors */
DDP_ECODE_UT_INVALID_QN = 0x01,
DDP_ECODE_UT_INVALID_MSN_NOBUF = 0x02,
DDP_ECODE_UT_INVALID_MSN_RANGE = 0x03,
DDP_ECODE_UT_INVALID_MO = 0x04,
DDP_ECODE_UT_MSG_TOOLONG = 0x05,
DDP_ECODE_UT_VERSION = 0x06
};
enum rdmap_untagged_qn {
RDMAP_UNTAGGED_QN_SEND = 0,
RDMAP_UNTAGGED_QN_RDMA_READ = 1,
RDMAP_UNTAGGED_QN_TERMINATE = 2,
RDMAP_UNTAGGED_QN_COUNT = 3
};
enum rdmap_etype {
RDMAP_ETYPE_CATASTROPHIC = 0x0,
RDMAP_ETYPE_REMOTE_PROTECTION = 0x1,
RDMAP_ETYPE_REMOTE_OPERATION = 0x2
};
enum rdmap_ecode {
RDMAP_ECODE_INVALID_STAG = 0x00,
RDMAP_ECODE_BASE_BOUNDS = 0x01,
RDMAP_ECODE_ACCESS_RIGHTS = 0x02,
RDMAP_ECODE_STAG_NOT_ASSOC = 0x03,
RDMAP_ECODE_TO_WRAP = 0x04,
RDMAP_ECODE_VERSION = 0x05,
RDMAP_ECODE_OPCODE = 0x06,
RDMAP_ECODE_CATASTROPHIC_STREAM = 0x07,
RDMAP_ECODE_CATASTROPHIC_GLOBAL = 0x08,
RDMAP_ECODE_CANNOT_INVALIDATE = 0x09,
RDMAP_ECODE_UNSPECIFIED = 0xff
};
enum llp_ecode {
LLP_ECODE_TCP_STREAM_LOST = 0x01, /* How to transfer this ?? */
LLP_ECODE_RECEIVED_CRC = 0x02,
LLP_ECODE_FPDU_START = 0x03,
LLP_ECODE_INVALID_REQ_RESP = 0x04,
/* Errors for Enhanced Connection Establishment only */
LLP_ECODE_LOCAL_CATASTROPHIC = 0x05,
LLP_ECODE_INSUFFICIENT_IRD = 0x06,
LLP_ECODE_NO_MATCHING_RTR = 0x07
};
enum llp_etype { LLP_ETYPE_MPA = 0x00 };
enum rdma_opcode {
RDMAP_RDMA_WRITE = 0x0,
RDMAP_RDMA_READ_REQ = 0x1,
RDMAP_RDMA_READ_RESP = 0x2,
RDMAP_SEND = 0x3,
RDMAP_SEND_INVAL = 0x4,
RDMAP_SEND_SE = 0x5,
RDMAP_SEND_SE_INVAL = 0x6,
RDMAP_TERMINATE = 0x7,
RDMAP_NOT_SUPPORTED = RDMAP_TERMINATE + 1
};
#endif

View file

@ -0,0 +1,745 @@
/* SPDX-License-Identifier: GPL-2.0 or BSD-3-Clause */
/* Authors: Bernard Metzler <bmt@zurich.ibm.com> */
/* Copyright (c) 2008-2019, IBM Corporation */
#ifndef _SIW_H
#define _SIW_H
#include <rdma/ib_verbs.h>
#include <linux/socket.h>
#include <linux/skbuff.h>
#include <crypto/hash.h>
#include <linux/crc32.h>
#include <linux/crc32c.h>
#include <rdma/siw-abi.h>
#include "iwarp.h"
#define SIW_VENDOR_ID 0x626d74 /* ascii 'bmt' for now */
#define SIW_VENDORT_PART_ID 0
#define SIW_MAX_QP (1024 * 100)
#define SIW_MAX_QP_WR (1024 * 32)
#define SIW_MAX_ORD_QP 128
#define SIW_MAX_IRD_QP 128
#define SIW_MAX_SGE_PBL 256 /* max num sge's for PBL */
#define SIW_MAX_SGE_RD 1 /* iwarp limitation. we could relax */
#define SIW_MAX_CQ (1024 * 100)
#define SIW_MAX_CQE (SIW_MAX_QP_WR * 100)
#define SIW_MAX_MR (SIW_MAX_QP * 10)
#define SIW_MAX_PD SIW_MAX_QP
#define SIW_MAX_MW 0 /* to be set if MW's are supported */
#define SIW_MAX_FMR SIW_MAX_MR
#define SIW_MAX_SRQ SIW_MAX_QP
#define SIW_MAX_SRQ_WR (SIW_MAX_QP_WR * 10)
#define SIW_MAX_CONTEXT SIW_MAX_PD
/* Min number of bytes for using zero copy transmit */
#define SENDPAGE_THRESH PAGE_SIZE
/* Maximum number of frames which can be send in one SQ processing */
#define SQ_USER_MAXBURST 100
/* Maximum number of consecutive IRQ elements which get served
* if SQ has pending work. Prevents starving local SQ processing
* by serving peer Read Requests.
*/
#define SIW_IRQ_MAXBURST_SQ_ACTIVE 4
struct siw_dev_cap {
int max_qp;
int max_qp_wr;
int max_ord; /* max. outbound read queue depth */
int max_ird; /* max. inbound read queue depth */
int max_sge;
int max_sge_rd;
int max_cq;
int max_cqe;
int max_mr;
int max_pd;
int max_mw;
int max_fmr;
int max_srq;
int max_srq_wr;
int max_srq_sge;
};
struct siw_pd {
struct ib_pd base_pd;
};
struct siw_device {
struct ib_device base_dev;
struct net_device *netdev;
struct siw_dev_cap attrs;
u32 vendor_part_id;
int numa_node;
/* physical port state (only one port per device) */
enum ib_port_state state;
spinlock_t lock;
struct xarray qp_xa;
struct xarray mem_xa;
struct list_head cep_list;
struct list_head qp_list;
/* active objects statistics to enforce limits */
atomic_t num_qp;
atomic_t num_cq;
atomic_t num_pd;
atomic_t num_mr;
atomic_t num_srq;
atomic_t num_ctx;
struct work_struct netdev_down;
};
struct siw_uobj {
void *addr;
u32 size;
};
struct siw_ucontext {
struct ib_ucontext base_ucontext;
struct siw_device *sdev;
/* xarray of user mappable objects */
struct xarray xa;
u32 uobj_nextkey;
};
/*
* The RDMA core does not define LOCAL_READ access, which is always
* enabled implictely.
*/
#define IWARP_ACCESS_MASK \
(IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_WRITE | \
IB_ACCESS_REMOTE_READ)
/*
* siw presentation of user memory registered as source
* or target of RDMA operations.
*/
struct siw_page_chunk {
struct page **plist;
};
struct siw_umem {
struct siw_page_chunk *page_chunk;
int num_pages;
bool writable;
u64 fp_addr; /* First page base address */
struct mm_struct *owning_mm;
};
struct siw_pble {
u64 addr; /* Address of assigned user buffer */
u64 size; /* Size of this entry */
u64 pbl_off; /* Total offset from start of PBL */
};
struct siw_pbl {
unsigned int num_buf;
unsigned int max_buf;
struct siw_pble pbe[1];
};
struct siw_mr;
/*
* Generic memory representation for registered siw memory.
* Memory lookup always via higher 24 bit of STag (STag index).
*/
struct siw_mem {
struct siw_device *sdev;
struct kref ref;
u64 va; /* VA of memory */
u64 len; /* length of the memory buffer in bytes */
u32 stag; /* iWarp memory access steering tag */
u8 stag_valid; /* VALID or INVALID */
u8 is_pbl; /* PBL or user space mem */
u8 is_mw; /* Memory Region or Memory Window */
enum ib_access_flags perms; /* local/remote READ & WRITE */
union {
struct siw_umem *umem;
struct siw_pbl *pbl;
void *mem_obj;
};
struct ib_pd *pd;
};
struct siw_mr {
struct ib_mr base_mr;
struct siw_mem *mem;
struct rcu_head rcu;
};
/*
* Error codes for local or remote
* access to registered memory
*/
enum siw_access_state {
E_ACCESS_OK,
E_STAG_INVALID,
E_BASE_BOUNDS,
E_ACCESS_PERM,
E_PD_MISMATCH
};
enum siw_wr_state {
SIW_WR_IDLE,
SIW_WR_QUEUED, /* processing has not started yet */
SIW_WR_INPROGRESS /* initiated processing of the WR */
};
/* The WQE currently being processed (RX or TX) */
struct siw_wqe {
/* Copy of applications SQE or RQE */
union {
struct siw_sqe sqe;
struct siw_rqe rqe;
};
struct siw_mem *mem[SIW_MAX_SGE]; /* per sge's resolved mem */
enum siw_wr_state wr_status;
enum siw_wc_status wc_status;
u32 bytes; /* total bytes to process */
u32 processed; /* bytes processed */
};
struct siw_cq {
struct ib_cq base_cq;
spinlock_t lock;
u64 *notify;
struct siw_cqe *queue;
u32 cq_put;
u32 cq_get;
u32 num_cqe;
bool kernel_verbs;
u32 xa_cq_index; /* mmap information for CQE array */
u32 id; /* For debugging only */
};
enum siw_qp_state {
SIW_QP_STATE_IDLE,
SIW_QP_STATE_RTR,
SIW_QP_STATE_RTS,
SIW_QP_STATE_CLOSING,
SIW_QP_STATE_TERMINATE,
SIW_QP_STATE_ERROR,
SIW_QP_STATE_COUNT
};
enum siw_qp_flags {
SIW_RDMA_BIND_ENABLED = (1 << 0),
SIW_RDMA_WRITE_ENABLED = (1 << 1),
SIW_RDMA_READ_ENABLED = (1 << 2),
SIW_SIGNAL_ALL_WR = (1 << 3),
SIW_MPA_CRC = (1 << 4),
SIW_QP_IN_DESTROY = (1 << 5)
};
enum siw_qp_attr_mask {
SIW_QP_ATTR_STATE = (1 << 0),
SIW_QP_ATTR_ACCESS_FLAGS = (1 << 1),
SIW_QP_ATTR_LLP_HANDLE = (1 << 2),
SIW_QP_ATTR_ORD = (1 << 3),
SIW_QP_ATTR_IRD = (1 << 4),
SIW_QP_ATTR_SQ_SIZE = (1 << 5),
SIW_QP_ATTR_RQ_SIZE = (1 << 6),
SIW_QP_ATTR_MPA = (1 << 7)
};
struct siw_srq {
struct ib_srq base_srq;
spinlock_t lock;
u32 max_sge;
u32 limit; /* low watermark for async event */
struct siw_rqe *recvq;
u32 rq_put;
u32 rq_get;
u32 num_rqe; /* max # of wqe's allowed */
u32 xa_srq_index; /* mmap information for SRQ array */
char armed; /* inform user if limit hit */
char kernel_verbs; /* '1' if kernel client */
};
struct siw_qp_attrs {
enum siw_qp_state state;
u32 sq_size;
u32 rq_size;
u32 orq_size;
u32 irq_size;
u32 sq_max_sges;
u32 rq_max_sges;
enum siw_qp_flags flags;
struct socket *sk;
};
enum siw_tx_ctx {
SIW_SEND_HDR, /* start or continue sending HDR */
SIW_SEND_DATA, /* start or continue sending DDP payload */
SIW_SEND_TRAILER, /* start or continue sending TRAILER */
SIW_SEND_SHORT_FPDU/* send whole FPDU hdr|data|trailer at once */
};
enum siw_rx_state {
SIW_GET_HDR, /* await new hdr or within hdr */
SIW_GET_DATA_START, /* start of inbound DDP payload */
SIW_GET_DATA_MORE, /* continuation of (misaligned) DDP payload */
SIW_GET_TRAILER/* await new trailer or within trailer */
};
struct siw_rx_stream {
struct sk_buff *skb;
int skb_new; /* pending unread bytes in skb */
int skb_offset; /* offset in skb */
int skb_copied; /* processed bytes in skb */
union iwarp_hdr hdr;
struct mpa_trailer trailer;
enum siw_rx_state state;
/*
* For each FPDU, main RX loop runs through 3 stages:
* Receiving protocol headers, placing DDP payload and receiving
* trailer information (CRC + possibly padding).
* Next two variables keep state on receive status of the
* current FPDU part (hdr, data, trailer).
*/
int fpdu_part_rcvd; /* bytes in pkt part copied */
int fpdu_part_rem; /* bytes in pkt part not seen */
/*
* Next expected DDP MSN for each QN +
* expected steering tag +
* expected DDP tagget offset (all HBO)
*/
u32 ddp_msn[RDMAP_UNTAGGED_QN_COUNT];
u32 ddp_stag;
u64 ddp_to;
u32 inval_stag; /* Stag to be invalidated */
struct shash_desc *mpa_crc_hd;
u8 rx_suspend : 1;
u8 pad : 2; /* # of pad bytes expected */
u8 rdmap_op : 4; /* opcode of current frame */
};
struct siw_rx_fpdu {
/*
* Local destination memory of inbound RDMA operation.
* Valid, according to wqe->wr_status
*/
struct siw_wqe wqe_active;
unsigned int pbl_idx; /* Index into current PBL */
unsigned int sge_idx; /* current sge in rx */
unsigned int sge_off; /* already rcvd in curr. sge */
char first_ddp_seg; /* this is the first DDP seg */
char more_ddp_segs; /* more DDP segs expected */
u8 prev_rdmap_op : 4; /* opcode of prev frame */
};
/*
* Shorthands for short packets w/o payload
* to be transmitted more efficient.
*/
struct siw_send_pkt {
struct iwarp_send send;
__be32 crc;
};
struct siw_write_pkt {
struct iwarp_rdma_write write;
__be32 crc;
};
struct siw_rreq_pkt {
struct iwarp_rdma_rreq rreq;
__be32 crc;
};
struct siw_rresp_pkt {
struct iwarp_rdma_rresp rresp;
__be32 crc;
};
struct siw_iwarp_tx {
union {
union iwarp_hdr hdr;
/* Generic part of FPDU header */
struct iwarp_ctrl ctrl;
struct iwarp_ctrl_untagged c_untagged;
struct iwarp_ctrl_tagged c_tagged;
/* FPDU headers */
struct iwarp_rdma_write rwrite;
struct iwarp_rdma_rreq rreq;
struct iwarp_rdma_rresp rresp;
struct iwarp_terminate terminate;
struct iwarp_send send;
struct iwarp_send_inv send_inv;
/* complete short FPDUs */
struct siw_send_pkt send_pkt;
struct siw_write_pkt write_pkt;
struct siw_rreq_pkt rreq_pkt;
struct siw_rresp_pkt rresp_pkt;
} pkt;
struct mpa_trailer trailer;
/* DDP MSN for untagged messages */
u32 ddp_msn[RDMAP_UNTAGGED_QN_COUNT];
enum siw_tx_ctx state;
u16 ctrl_len; /* ddp+rdmap hdr */
u16 ctrl_sent;
int burst;
int bytes_unsent; /* ddp payload bytes */
struct shash_desc *mpa_crc_hd;
u8 do_crc : 1; /* do crc for segment */
u8 use_sendpage : 1; /* send w/o copy */
u8 tx_suspend : 1; /* stop sending DDP segs. */
u8 pad : 2; /* # pad in current fpdu */
u8 orq_fence : 1; /* ORQ full or Send fenced */
u8 in_syscall : 1; /* TX out of user context */
u8 zcopy_tx : 1; /* Use TCP_SENDPAGE if possible */
u8 gso_seg_limit; /* Maximum segments for GSO, 0 = unbound */
u16 fpdu_len; /* len of FPDU to tx */
unsigned int tcp_seglen; /* remaining tcp seg space */
struct siw_wqe wqe_active;
int pbl_idx; /* Index into current PBL */
int sge_idx; /* current sge in tx */
u32 sge_off; /* already sent in curr. sge */
};
struct siw_qp {
struct siw_device *sdev;
struct ib_qp *ib_qp;
struct kref ref;
u32 qp_num;
struct list_head devq;
int tx_cpu;
bool kernel_verbs;
struct siw_qp_attrs attrs;
struct siw_cep *cep;
struct rw_semaphore state_lock;
struct ib_pd *pd;
struct siw_cq *scq;
struct siw_cq *rcq;
struct siw_srq *srq;
struct siw_iwarp_tx tx_ctx; /* Transmit context */
spinlock_t sq_lock;
struct siw_sqe *sendq; /* send queue element array */
uint32_t sq_get; /* consumer index into sq array */
uint32_t sq_put; /* kernel prod. index into sq array */
struct llist_node tx_list;
struct siw_sqe *orq; /* outbound read queue element array */
spinlock_t orq_lock;
uint32_t orq_get; /* consumer index into orq array */
uint32_t orq_put; /* shared producer index for ORQ */
struct siw_rx_stream rx_stream;
struct siw_rx_fpdu *rx_fpdu;
struct siw_rx_fpdu rx_tagged;
struct siw_rx_fpdu rx_untagged;
spinlock_t rq_lock;
struct siw_rqe *recvq; /* recv queue element array */
uint32_t rq_get; /* consumer index into rq array */
uint32_t rq_put; /* kernel prod. index into rq array */
struct siw_sqe *irq; /* inbound read queue element array */
uint32_t irq_get; /* consumer index into irq array */
uint32_t irq_put; /* producer index into irq array */
int irq_burst;
struct { /* information to be carried in TERMINATE pkt, if valid */
u8 valid;
u8 in_tx;
u8 layer : 4, etype : 4;
u8 ecode;
} term_info;
u32 xa_sq_index; /* mmap information for SQE array */
u32 xa_rq_index; /* mmap information for RQE array */
struct rcu_head rcu;
};
struct siw_base_qp {
struct ib_qp base_qp;
struct siw_qp *qp;
};
/* helper macros */
#define rx_qp(rx) container_of(rx, struct siw_qp, rx_stream)
#define tx_qp(tx) container_of(tx, struct siw_qp, tx_ctx)
#define tx_wqe(qp) (&(qp)->tx_ctx.wqe_active)
#define rx_wqe(rctx) (&(rctx)->wqe_active)
#define rx_mem(rctx) ((rctx)->wqe_active.mem[0])
#define tx_type(wqe) ((wqe)->sqe.opcode)
#define rx_type(wqe) ((wqe)->rqe.opcode)
#define tx_flags(wqe) ((wqe)->sqe.flags)
struct iwarp_msg_info {
int hdr_len;
struct iwarp_ctrl ctrl;
int (*rx_data)(struct siw_qp *qp);
};
/* Global siw parameters. Currently set in siw_main.c */
extern const bool zcopy_tx;
extern const bool try_gso;
extern const bool loopback_enabled;
extern const bool mpa_crc_required;
extern const bool mpa_crc_strict;
extern const bool siw_tcp_nagle;
extern u_char mpa_version;
extern const bool peer_to_peer;
extern struct task_struct *siw_tx_thread[];
extern struct crypto_shash *siw_crypto_shash;
extern struct iwarp_msg_info iwarp_pktinfo[RDMAP_TERMINATE + 1];
/* QP general functions */
int siw_qp_modify(struct siw_qp *qp, struct siw_qp_attrs *attr,
enum siw_qp_attr_mask mask);
int siw_qp_mpa_rts(struct siw_qp *qp, enum mpa_v2_ctrl ctrl);
void siw_qp_llp_close(struct siw_qp *qp);
void siw_qp_cm_drop(struct siw_qp *qp, int schedule);
void siw_send_terminate(struct siw_qp *qp);
void siw_qp_get_ref(struct ib_qp *qp);
void siw_qp_put_ref(struct ib_qp *qp);
int siw_qp_add(struct siw_device *sdev, struct siw_qp *qp);
void siw_free_qp(struct kref *ref);
void siw_init_terminate(struct siw_qp *qp, enum term_elayer layer,
u8 etype, u8 ecode, int in_tx);
enum ddp_ecode siw_tagged_error(enum siw_access_state state);
enum rdmap_ecode siw_rdmap_error(enum siw_access_state state);
void siw_read_to_orq(struct siw_sqe *rreq, struct siw_sqe *sqe);
int siw_sqe_complete(struct siw_qp *qp, struct siw_sqe *sqe, u32 bytes,
enum siw_wc_status status);
int siw_rqe_complete(struct siw_qp *qp, struct siw_rqe *rqe, u32 bytes,
u32 inval_stag, enum siw_wc_status status);
void siw_qp_llp_data_ready(struct sock *sk);
void siw_qp_llp_write_space(struct sock *sk);
/* QP TX path functions */
int siw_run_sq(void *arg);
int siw_qp_sq_process(struct siw_qp *qp);
int siw_sq_start(struct siw_qp *qp);
int siw_activate_tx(struct siw_qp *qp);
void siw_stop_tx_thread(int nr_cpu);
int siw_get_tx_cpu(struct siw_device *sdev);
void siw_put_tx_cpu(int cpu);
/* QP RX path functions */
int siw_proc_send(struct siw_qp *qp);
int siw_proc_rreq(struct siw_qp *qp);
int siw_proc_rresp(struct siw_qp *qp);
int siw_proc_write(struct siw_qp *qp);
int siw_proc_terminate(struct siw_qp *qp);
int siw_tcp_rx_data(read_descriptor_t *rd_desc, struct sk_buff *skb,
unsigned int off, size_t len);
static inline void set_rx_fpdu_context(struct siw_qp *qp, u8 opcode)
{
if (opcode == RDMAP_RDMA_WRITE || opcode == RDMAP_RDMA_READ_RESP)
qp->rx_fpdu = &qp->rx_tagged;
else
qp->rx_fpdu = &qp->rx_untagged;
qp->rx_stream.rdmap_op = opcode;
}
static inline struct siw_ucontext *to_siw_ctx(struct ib_ucontext *base_ctx)
{
return container_of(base_ctx, struct siw_ucontext, base_ucontext);
}
static inline struct siw_base_qp *to_siw_base_qp(struct ib_qp *base_qp)
{
return container_of(base_qp, struct siw_base_qp, base_qp);
}
static inline struct siw_qp *to_siw_qp(struct ib_qp *base_qp)
{
return to_siw_base_qp(base_qp)->qp;
}
static inline struct siw_cq *to_siw_cq(struct ib_cq *base_cq)
{
return container_of(base_cq, struct siw_cq, base_cq);
}
static inline struct siw_srq *to_siw_srq(struct ib_srq *base_srq)
{
return container_of(base_srq, struct siw_srq, base_srq);
}
static inline struct siw_device *to_siw_dev(struct ib_device *base_dev)
{
return container_of(base_dev, struct siw_device, base_dev);
}
static inline struct siw_mr *to_siw_mr(struct ib_mr *base_mr)
{
return container_of(base_mr, struct siw_mr, base_mr);
}
static inline struct siw_qp *siw_qp_id2obj(struct siw_device *sdev, int id)
{
struct siw_qp *qp;
rcu_read_lock();
qp = xa_load(&sdev->qp_xa, id);
if (likely(qp && kref_get_unless_zero(&qp->ref))) {
rcu_read_unlock();
return qp;
}
rcu_read_unlock();
return NULL;
}
static inline u32 qp_id(struct siw_qp *qp)
{
return qp->qp_num;
}
static inline void siw_qp_get(struct siw_qp *qp)
{
kref_get(&qp->ref);
}
static inline void siw_qp_put(struct siw_qp *qp)
{
kref_put(&qp->ref, siw_free_qp);
}
static inline int siw_sq_empty(struct siw_qp *qp)
{
struct siw_sqe *sqe = &qp->sendq[qp->sq_get % qp->attrs.sq_size];
return READ_ONCE(sqe->flags) == 0;
}
static inline struct siw_sqe *sq_get_next(struct siw_qp *qp)
{
struct siw_sqe *sqe = &qp->sendq[qp->sq_get % qp->attrs.sq_size];
if (READ_ONCE(sqe->flags) & SIW_WQE_VALID)
return sqe;
return NULL;
}
static inline struct siw_sqe *orq_get_current(struct siw_qp *qp)
{
return &qp->orq[qp->orq_get % qp->attrs.orq_size];
}
static inline struct siw_sqe *orq_get_tail(struct siw_qp *qp)
{
return &qp->orq[qp->orq_put % qp->attrs.orq_size];
}
static inline struct siw_sqe *orq_get_free(struct siw_qp *qp)
{
struct siw_sqe *orq_e = orq_get_tail(qp);
if (orq_e && READ_ONCE(orq_e->flags) == 0)
return orq_e;
return NULL;
}
static inline int siw_orq_empty(struct siw_qp *qp)
{
return qp->orq[qp->orq_get % qp->attrs.orq_size].flags == 0 ? 1 : 0;
}
static inline struct siw_sqe *irq_alloc_free(struct siw_qp *qp)
{
struct siw_sqe *irq_e = &qp->irq[qp->irq_put % qp->attrs.irq_size];
if (READ_ONCE(irq_e->flags) == 0) {
qp->irq_put++;
return irq_e;
}
return NULL;
}
static inline __wsum siw_csum_update(const void *buff, int len, __wsum sum)
{
return (__force __wsum)crc32c((__force __u32)sum, buff, len);
}
static inline __wsum siw_csum_combine(__wsum csum, __wsum csum2, int offset,
int len)
{
return (__force __wsum)__crc32c_le_combine((__force __u32)csum,
(__force __u32)csum2, len);
}
static inline void siw_crc_skb(struct siw_rx_stream *srx, unsigned int len)
{
const struct skb_checksum_ops siw_cs_ops = {
.update = siw_csum_update,
.combine = siw_csum_combine,
};
__wsum crc = *(u32 *)shash_desc_ctx(srx->mpa_crc_hd);
crc = __skb_checksum(srx->skb, srx->skb_offset, len, crc,
&siw_cs_ops);
*(u32 *)shash_desc_ctx(srx->mpa_crc_hd) = crc;
}
#define siw_dbg(ibdev, fmt, ...) \
ibdev_dbg(ibdev, "%s: " fmt, __func__, ##__VA_ARGS__)
#define siw_dbg_qp(qp, fmt, ...) \
ibdev_dbg(&qp->sdev->base_dev, "QP[%u] %s: " fmt, qp_id(qp), __func__, \
##__VA_ARGS__)
#define siw_dbg_cq(cq, fmt, ...) \
ibdev_dbg(cq->base_cq.device, "CQ[%u] %s: " fmt, cq->id, __func__, \
##__VA_ARGS__)
#define siw_dbg_pd(pd, fmt, ...) \
ibdev_dbg(pd->device, "PD[%u] %s: " fmt, pd->res.id, __func__, \
##__VA_ARGS__)
#define siw_dbg_mem(mem, fmt, ...) \
ibdev_dbg(&mem->sdev->base_dev, \
"MEM[0x%08x] %s: " fmt, mem->stag, __func__, ##__VA_ARGS__)
#define siw_dbg_cep(cep, fmt, ...) \
ibdev_dbg(&cep->sdev->base_dev, "CEP[0x%p] %s: " fmt, \
cep, __func__, ##__VA_ARGS__)
void siw_cq_flush(struct siw_cq *cq);
void siw_sq_flush(struct siw_qp *qp);
void siw_rq_flush(struct siw_qp *qp);
int siw_reap_cqe(struct siw_cq *cq, struct ib_wc *wc);
#endif

File diff suppressed because it is too large Load diff

View file

@ -0,0 +1,133 @@
/* SPDX-License-Identifier: GPL-2.0 or BSD-3-Clause */
/* Authors: Bernard Metzler <bmt@zurich.ibm.com> */
/* Greg Joyce <greg@opengridcomputing.com> */
/* Copyright (c) 2008-2019, IBM Corporation */
/* Copyright (c) 2017, Open Grid Computing, Inc. */
#ifndef _SIW_CM_H
#define _SIW_CM_H
#include <net/sock.h>
#include <linux/tcp.h>
#include <rdma/iw_cm.h>
enum siw_cep_state {
SIW_EPSTATE_IDLE = 1,
SIW_EPSTATE_LISTENING,
SIW_EPSTATE_CONNECTING,
SIW_EPSTATE_AWAIT_MPAREQ,
SIW_EPSTATE_RECVD_MPAREQ,
SIW_EPSTATE_AWAIT_MPAREP,
SIW_EPSTATE_RDMA_MODE,
SIW_EPSTATE_CLOSED
};
struct siw_mpa_info {
struct mpa_rr hdr; /* peer mpa hdr in host byte order */
struct mpa_v2_data v2_ctrl;
struct mpa_v2_data v2_ctrl_req;
char *pdata;
int bytes_rcvd;
};
struct siw_device;
struct siw_cep {
struct iw_cm_id *cm_id;
struct siw_device *sdev;
struct list_head devq;
spinlock_t lock;
struct kref ref;
int in_use;
wait_queue_head_t waitq;
enum siw_cep_state state;
struct list_head listenq;
struct siw_cep *listen_cep;
struct siw_qp *qp;
struct socket *sock;
struct siw_cm_work *mpa_timer;
struct list_head work_freelist;
struct siw_mpa_info mpa;
int ord;
int ird;
bool enhanced_rdma_conn_est;
/* Saved upcalls of socket */
void (*sk_state_change)(struct sock *sk);
void (*sk_data_ready)(struct sock *sk);
void (*sk_write_space)(struct sock *sk);
void (*sk_error_report)(struct sock *sk);
};
/*
* Connection initiator waits 10 seconds to receive an
* MPA reply after sending out MPA request. Reponder waits for
* 5 seconds for MPA request to arrive if new TCP connection
* was set up.
*/
#define MPAREQ_TIMEOUT (HZ * 10)
#define MPAREP_TIMEOUT (HZ * 5)
enum siw_work_type {
SIW_CM_WORK_ACCEPT = 1,
SIW_CM_WORK_READ_MPAHDR,
SIW_CM_WORK_CLOSE_LLP, /* close socket */
SIW_CM_WORK_PEER_CLOSE, /* socket indicated peer close */
SIW_CM_WORK_MPATIMEOUT
};
struct siw_cm_work {
struct delayed_work work;
struct list_head list;
enum siw_work_type type;
struct siw_cep *cep;
};
#define to_sockaddr_in(a) (*(struct sockaddr_in *)(&(a)))
#define to_sockaddr_in6(a) (*(struct sockaddr_in6 *)(&(a)))
static inline int getname_peer(struct socket *s, struct sockaddr_storage *a)
{
return s->ops->getname(s, (struct sockaddr *)a, 1);
}
static inline int getname_local(struct socket *s, struct sockaddr_storage *a)
{
return s->ops->getname(s, (struct sockaddr *)a, 0);
}
static inline int ksock_recv(struct socket *sock, char *buf, size_t size,
int flags)
{
struct kvec iov = { buf, size };
struct msghdr msg = { .msg_name = NULL, .msg_flags = flags };
return kernel_recvmsg(sock, &msg, &iov, 1, size, flags);
}
int siw_connect(struct iw_cm_id *id, struct iw_cm_conn_param *parm);
int siw_accept(struct iw_cm_id *id, struct iw_cm_conn_param *param);
int siw_reject(struct iw_cm_id *id, const void *data, u8 len);
int siw_create_listen(struct iw_cm_id *id, int backlog);
int siw_destroy_listen(struct iw_cm_id *id);
void siw_cep_get(struct siw_cep *cep);
void siw_cep_put(struct siw_cep *cep);
int siw_cm_queue_work(struct siw_cep *cep, enum siw_work_type type);
int siw_cm_init(void);
void siw_cm_exit(void);
/*
* TCP socket interface
*/
#define sk_to_qp(sk) (((struct siw_cep *)((sk)->sk_user_data))->qp)
#define sk_to_cep(sk) ((struct siw_cep *)((sk)->sk_user_data))
#endif

View file

@ -0,0 +1,101 @@
// SPDX-License-Identifier: GPL-2.0 or BSD-3-Clause
/* Authors: Bernard Metzler <bmt@zurich.ibm.com> */
/* Copyright (c) 2008-2019, IBM Corporation */
#include <linux/errno.h>
#include <linux/types.h>
#include <rdma/ib_verbs.h>
#include "siw.h"
static int map_wc_opcode[SIW_NUM_OPCODES] = {
[SIW_OP_WRITE] = IB_WC_RDMA_WRITE,
[SIW_OP_SEND] = IB_WC_SEND,
[SIW_OP_SEND_WITH_IMM] = IB_WC_SEND,
[SIW_OP_READ] = IB_WC_RDMA_READ,
[SIW_OP_READ_LOCAL_INV] = IB_WC_RDMA_READ,
[SIW_OP_COMP_AND_SWAP] = IB_WC_COMP_SWAP,
[SIW_OP_FETCH_AND_ADD] = IB_WC_FETCH_ADD,
[SIW_OP_INVAL_STAG] = IB_WC_LOCAL_INV,
[SIW_OP_REG_MR] = IB_WC_REG_MR,
[SIW_OP_RECEIVE] = IB_WC_RECV,
[SIW_OP_READ_RESPONSE] = -1 /* not used */
};
static struct {
enum siw_opcode siw;
enum ib_wc_status ib;
} map_cqe_status[SIW_NUM_WC_STATUS] = {
{ SIW_WC_SUCCESS, IB_WC_SUCCESS },
{ SIW_WC_LOC_LEN_ERR, IB_WC_LOC_LEN_ERR },
{ SIW_WC_LOC_PROT_ERR, IB_WC_LOC_PROT_ERR },
{ SIW_WC_LOC_QP_OP_ERR, IB_WC_LOC_QP_OP_ERR },
{ SIW_WC_WR_FLUSH_ERR, IB_WC_WR_FLUSH_ERR },
{ SIW_WC_BAD_RESP_ERR, IB_WC_BAD_RESP_ERR },
{ SIW_WC_LOC_ACCESS_ERR, IB_WC_LOC_ACCESS_ERR },
{ SIW_WC_REM_ACCESS_ERR, IB_WC_REM_ACCESS_ERR },
{ SIW_WC_REM_INV_REQ_ERR, IB_WC_REM_INV_REQ_ERR },
{ SIW_WC_GENERAL_ERR, IB_WC_GENERAL_ERR }
};
/*
* Reap one CQE from the CQ. Only used by kernel clients
* during CQ normal operation. Might be called during CQ
* flush for user mapped CQE array as well.
*/
int siw_reap_cqe(struct siw_cq *cq, struct ib_wc *wc)
{
struct siw_cqe *cqe;
unsigned long flags;
spin_lock_irqsave(&cq->lock, flags);
cqe = &cq->queue[cq->cq_get % cq->num_cqe];
if (READ_ONCE(cqe->flags) & SIW_WQE_VALID) {
memset(wc, 0, sizeof(*wc));
wc->wr_id = cqe->id;
wc->status = map_cqe_status[cqe->status].ib;
wc->opcode = map_wc_opcode[cqe->opcode];
wc->byte_len = cqe->bytes;
/*
* During CQ flush, also user land CQE's may get
* reaped here, which do not hold a QP reference
* and do not qualify for memory extension verbs.
*/
if (likely(cq->kernel_verbs)) {
if (cqe->flags & SIW_WQE_REM_INVAL) {
wc->ex.invalidate_rkey = cqe->inval_stag;
wc->wc_flags = IB_WC_WITH_INVALIDATE;
}
wc->qp = cqe->base_qp;
siw_dbg_cq(cq, "idx %u, type %d, flags %2x, id 0x%p\n",
cq->cq_get % cq->num_cqe, cqe->opcode,
cqe->flags, (void *)cqe->id);
}
WRITE_ONCE(cqe->flags, 0);
cq->cq_get++;
spin_unlock_irqrestore(&cq->lock, flags);
return 1;
}
spin_unlock_irqrestore(&cq->lock, flags);
return 0;
}
/*
* siw_cq_flush()
*
* Flush all CQ elements.
*/
void siw_cq_flush(struct siw_cq *cq)
{
struct ib_wc wc;
while (siw_reap_cqe(cq, &wc))
;
}

View file

@ -0,0 +1,687 @@
// SPDX-License-Identifier: GPL-2.0 or BSD-3-Clause
/* Authors: Bernard Metzler <bmt@zurich.ibm.com> */
/* Copyright (c) 2008-2019, IBM Corporation */
#include <linux/init.h>
#include <linux/errno.h>
#include <linux/netdevice.h>
#include <linux/inetdevice.h>
#include <net/net_namespace.h>
#include <linux/rtnetlink.h>
#include <linux/if_arp.h>
#include <linux/list.h>
#include <linux/kernel.h>
#include <linux/sched.h>
#include <linux/module.h>
#include <linux/dma-mapping.h>
#include <rdma/ib_verbs.h>
#include <rdma/ib_user_verbs.h>
#include <rdma/rdma_netlink.h>
#include <linux/kthread.h>
#include "siw.h"
#include "siw_verbs.h"
MODULE_AUTHOR("Bernard Metzler");
MODULE_DESCRIPTION("Software iWARP Driver");
MODULE_LICENSE("Dual BSD/GPL");
/* transmit from user buffer, if possible */
const bool zcopy_tx = true;
/* Restrict usage of GSO, if hardware peer iwarp is unable to process
* large packets. try_gso = true lets siw try to use local GSO,
* if peer agrees. Not using GSO severly limits siw maximum tx bandwidth.
*/
const bool try_gso;
/* Attach siw also with loopback devices */
const bool loopback_enabled = true;
/* We try to negotiate CRC on, if true */
const bool mpa_crc_required;
/* MPA CRC on/off enforced */
const bool mpa_crc_strict;
/* Control TCP_NODELAY socket option */
const bool siw_tcp_nagle;
/* Select MPA version to be used during connection setup */
u_char mpa_version = MPA_REVISION_2;
/* Selects MPA P2P mode (additional handshake during connection
* setup, if true.
*/
const bool peer_to_peer;
struct task_struct *siw_tx_thread[NR_CPUS];
struct crypto_shash *siw_crypto_shash;
static int siw_device_register(struct siw_device *sdev, const char *name)
{
struct ib_device *base_dev = &sdev->base_dev;
static int dev_id = 1;
int rv;
rv = ib_register_device(base_dev, name);
if (rv) {
pr_warn("siw: device registration error %d\n", rv);
return rv;
}
sdev->vendor_part_id = dev_id++;
siw_dbg(base_dev, "HWaddr=%pM\n", sdev->netdev->dev_addr);
return 0;
}
static void siw_device_cleanup(struct ib_device *base_dev)
{
struct siw_device *sdev = to_siw_dev(base_dev);
xa_destroy(&sdev->qp_xa);
xa_destroy(&sdev->mem_xa);
}
static int siw_create_tx_threads(void)
{
int cpu, rv, assigned = 0;
for_each_online_cpu(cpu) {
/* Skip HT cores */
if (cpu % cpumask_weight(topology_sibling_cpumask(cpu)))
continue;
siw_tx_thread[cpu] =
kthread_create(siw_run_sq, (unsigned long *)(long)cpu,
"siw_tx/%d", cpu);
if (IS_ERR(siw_tx_thread[cpu])) {
rv = PTR_ERR(siw_tx_thread[cpu]);
siw_tx_thread[cpu] = NULL;
pr_info("Creating TX thread for CPU %d failed", cpu);
continue;
}
kthread_bind(siw_tx_thread[cpu], cpu);
wake_up_process(siw_tx_thread[cpu]);
assigned++;
}
return assigned;
}
static int siw_dev_qualified(struct net_device *netdev)
{
/*
* Additional hardware support can be added here
* (e.g. ARPHRD_FDDI, ARPHRD_ATM, ...) - see
* <linux/if_arp.h> for type identifiers.
*/
if (netdev->type == ARPHRD_ETHER || netdev->type == ARPHRD_IEEE802 ||
(netdev->type == ARPHRD_LOOPBACK && loopback_enabled))
return 1;
return 0;
}
static DEFINE_PER_CPU(atomic_t, use_cnt = ATOMIC_INIT(0));
static struct {
struct cpumask **tx_valid_cpus;
int num_nodes;
} siw_cpu_info;
static int siw_init_cpulist(void)
{
int i, num_nodes = num_possible_nodes();
memset(siw_tx_thread, 0, sizeof(siw_tx_thread));
siw_cpu_info.num_nodes = num_nodes;
siw_cpu_info.tx_valid_cpus =
kcalloc(num_nodes, sizeof(struct cpumask *), GFP_KERNEL);
if (!siw_cpu_info.tx_valid_cpus) {
siw_cpu_info.num_nodes = 0;
return -ENOMEM;
}
for (i = 0; i < siw_cpu_info.num_nodes; i++) {
siw_cpu_info.tx_valid_cpus[i] =
kzalloc(sizeof(struct cpumask), GFP_KERNEL);
if (!siw_cpu_info.tx_valid_cpus[i])
goto out_err;
cpumask_clear(siw_cpu_info.tx_valid_cpus[i]);
}
for_each_possible_cpu(i)
cpumask_set_cpu(i, siw_cpu_info.tx_valid_cpus[cpu_to_node(i)]);
return 0;
out_err:
siw_cpu_info.num_nodes = 0;
while (i) {
kfree(siw_cpu_info.tx_valid_cpus[i]);
siw_cpu_info.tx_valid_cpus[i--] = NULL;
}
kfree(siw_cpu_info.tx_valid_cpus);
siw_cpu_info.tx_valid_cpus = NULL;
return -ENOMEM;
}
static void siw_destroy_cpulist(void)
{
int i = 0;
while (i < siw_cpu_info.num_nodes)
kfree(siw_cpu_info.tx_valid_cpus[i++]);
kfree(siw_cpu_info.tx_valid_cpus);
}
/*
* Choose CPU with least number of active QP's from NUMA node of
* TX interface.
*/
int siw_get_tx_cpu(struct siw_device *sdev)
{
const struct cpumask *tx_cpumask;
int i, num_cpus, cpu, min_use, node = sdev->numa_node, tx_cpu = -1;
if (node < 0)
tx_cpumask = cpu_online_mask;
else
tx_cpumask = siw_cpu_info.tx_valid_cpus[node];
num_cpus = cpumask_weight(tx_cpumask);
if (!num_cpus) {
/* no CPU on this NUMA node */
tx_cpumask = cpu_online_mask;
num_cpus = cpumask_weight(tx_cpumask);
}
if (!num_cpus)
goto out;
cpu = cpumask_first(tx_cpumask);
for (i = 0, min_use = SIW_MAX_QP; i < num_cpus;
i++, cpu = cpumask_next(cpu, tx_cpumask)) {
int usage;
/* Skip any cores which have no TX thread */
if (!siw_tx_thread[cpu])
continue;
usage = atomic_read(&per_cpu(use_cnt, cpu));
if (usage <= min_use) {
tx_cpu = cpu;
min_use = usage;
}
}
siw_dbg(&sdev->base_dev,
"tx cpu %d, node %d, %d qp's\n", tx_cpu, node, min_use);
out:
if (tx_cpu >= 0)
atomic_inc(&per_cpu(use_cnt, tx_cpu));
else
pr_warn("siw: no tx cpu found\n");
return tx_cpu;
}
void siw_put_tx_cpu(int cpu)
{
atomic_dec(&per_cpu(use_cnt, cpu));
}
static struct ib_qp *siw_get_base_qp(struct ib_device *base_dev, int id)
{
struct siw_qp *qp = siw_qp_id2obj(to_siw_dev(base_dev), id);
if (qp) {
/*
* siw_qp_id2obj() increments object reference count
*/
siw_qp_put(qp);
return qp->ib_qp;
}
return NULL;
}
static void siw_verbs_sq_flush(struct ib_qp *base_qp)
{
struct siw_qp *qp = to_siw_qp(base_qp);
down_write(&qp->state_lock);
siw_sq_flush(qp);
up_write(&qp->state_lock);
}
static void siw_verbs_rq_flush(struct ib_qp *base_qp)
{
struct siw_qp *qp = to_siw_qp(base_qp);
down_write(&qp->state_lock);
siw_rq_flush(qp);
up_write(&qp->state_lock);
}
static const struct ib_device_ops siw_device_ops = {
.owner = THIS_MODULE,
.uverbs_abi_ver = SIW_ABI_VERSION,
.driver_id = RDMA_DRIVER_SIW,
.alloc_mr = siw_alloc_mr,
.alloc_pd = siw_alloc_pd,
.alloc_ucontext = siw_alloc_ucontext,
.create_cq = siw_create_cq,
.create_qp = siw_create_qp,
.create_srq = siw_create_srq,
.dealloc_driver = siw_device_cleanup,
.dealloc_pd = siw_dealloc_pd,
.dealloc_ucontext = siw_dealloc_ucontext,
.dereg_mr = siw_dereg_mr,
.destroy_cq = siw_destroy_cq,
.destroy_qp = siw_destroy_qp,
.destroy_srq = siw_destroy_srq,
.drain_rq = siw_verbs_rq_flush,
.drain_sq = siw_verbs_sq_flush,
.get_dma_mr = siw_get_dma_mr,
.get_port_immutable = siw_get_port_immutable,
.iw_accept = siw_accept,
.iw_add_ref = siw_qp_get_ref,
.iw_connect = siw_connect,
.iw_create_listen = siw_create_listen,
.iw_destroy_listen = siw_destroy_listen,
.iw_get_qp = siw_get_base_qp,
.iw_reject = siw_reject,
.iw_rem_ref = siw_qp_put_ref,
.map_mr_sg = siw_map_mr_sg,
.mmap = siw_mmap,
.modify_qp = siw_verbs_modify_qp,
.modify_srq = siw_modify_srq,
.poll_cq = siw_poll_cq,
.post_recv = siw_post_receive,
.post_send = siw_post_send,
.post_srq_recv = siw_post_srq_recv,
.query_device = siw_query_device,
.query_gid = siw_query_gid,
.query_pkey = siw_query_pkey,
.query_port = siw_query_port,
.query_qp = siw_query_qp,
.query_srq = siw_query_srq,
.req_notify_cq = siw_req_notify_cq,
.reg_user_mr = siw_reg_user_mr,
INIT_RDMA_OBJ_SIZE(ib_cq, siw_cq, base_cq),
INIT_RDMA_OBJ_SIZE(ib_pd, siw_pd, base_pd),
INIT_RDMA_OBJ_SIZE(ib_srq, siw_srq, base_srq),
INIT_RDMA_OBJ_SIZE(ib_ucontext, siw_ucontext, base_ucontext),
};
static struct siw_device *siw_device_create(struct net_device *netdev)
{
struct siw_device *sdev = NULL;
struct ib_device *base_dev;
struct device *parent = netdev->dev.parent;
int rv;
if (!parent) {
/*
* The loopback device has no parent device,
* so it appears as a top-level device. To support
* loopback device connectivity, take this device
* as the parent device. Skip all other devices
* w/o parent device.
*/
if (netdev->type != ARPHRD_LOOPBACK) {
pr_warn("siw: device %s error: no parent device\n",
netdev->name);
return NULL;
}
parent = &netdev->dev;
}
sdev = ib_alloc_device(siw_device, base_dev);
if (!sdev)
return NULL;
base_dev = &sdev->base_dev;
sdev->netdev = netdev;
if (netdev->type != ARPHRD_LOOPBACK) {
memcpy(&base_dev->node_guid, netdev->dev_addr, 6);
} else {
/*
* The loopback device does not have a HW address,
* but connection mangagement lib expects gid != 0
*/
size_t gidlen = min_t(size_t, strlen(base_dev->name), 6);
memcpy(&base_dev->node_guid, base_dev->name, gidlen);
}
base_dev->uverbs_cmd_mask =
(1ull << IB_USER_VERBS_CMD_QUERY_DEVICE) |
(1ull << IB_USER_VERBS_CMD_QUERY_PORT) |
(1ull << IB_USER_VERBS_CMD_GET_CONTEXT) |
(1ull << IB_USER_VERBS_CMD_ALLOC_PD) |
(1ull << IB_USER_VERBS_CMD_DEALLOC_PD) |
(1ull << IB_USER_VERBS_CMD_REG_MR) |
(1ull << IB_USER_VERBS_CMD_DEREG_MR) |
(1ull << IB_USER_VERBS_CMD_CREATE_COMP_CHANNEL) |
(1ull << IB_USER_VERBS_CMD_CREATE_CQ) |
(1ull << IB_USER_VERBS_CMD_POLL_CQ) |
(1ull << IB_USER_VERBS_CMD_REQ_NOTIFY_CQ) |
(1ull << IB_USER_VERBS_CMD_DESTROY_CQ) |
(1ull << IB_USER_VERBS_CMD_CREATE_QP) |
(1ull << IB_USER_VERBS_CMD_QUERY_QP) |
(1ull << IB_USER_VERBS_CMD_MODIFY_QP) |
(1ull << IB_USER_VERBS_CMD_DESTROY_QP) |
(1ull << IB_USER_VERBS_CMD_POST_SEND) |
(1ull << IB_USER_VERBS_CMD_POST_RECV) |
(1ull << IB_USER_VERBS_CMD_CREATE_SRQ) |
(1ull << IB_USER_VERBS_CMD_POST_SRQ_RECV) |
(1ull << IB_USER_VERBS_CMD_MODIFY_SRQ) |
(1ull << IB_USER_VERBS_CMD_QUERY_SRQ) |
(1ull << IB_USER_VERBS_CMD_DESTROY_SRQ);
base_dev->node_type = RDMA_NODE_RNIC;
memcpy(base_dev->node_desc, SIW_NODE_DESC_COMMON,
sizeof(SIW_NODE_DESC_COMMON));
/*
* Current model (one-to-one device association):
* One Softiwarp device per net_device or, equivalently,
* per physical port.
*/
base_dev->phys_port_cnt = 1;
base_dev->dev.parent = parent;
base_dev->dev.dma_ops = &dma_virt_ops;
base_dev->num_comp_vectors = num_possible_cpus();
ib_set_device_ops(base_dev, &siw_device_ops);
rv = ib_device_set_netdev(base_dev, netdev, 1);
if (rv)
goto error;
memcpy(base_dev->iw_ifname, netdev->name,
sizeof(base_dev->iw_ifname));
/* Disable TCP port mapping */
base_dev->iw_driver_flags = IW_F_NO_PORT_MAP,
sdev->attrs.max_qp = SIW_MAX_QP;
sdev->attrs.max_qp_wr = SIW_MAX_QP_WR;
sdev->attrs.max_ord = SIW_MAX_ORD_QP;
sdev->attrs.max_ird = SIW_MAX_IRD_QP;
sdev->attrs.max_sge = SIW_MAX_SGE;
sdev->attrs.max_sge_rd = SIW_MAX_SGE_RD;
sdev->attrs.max_cq = SIW_MAX_CQ;
sdev->attrs.max_cqe = SIW_MAX_CQE;
sdev->attrs.max_mr = SIW_MAX_MR;
sdev->attrs.max_pd = SIW_MAX_PD;
sdev->attrs.max_mw = SIW_MAX_MW;
sdev->attrs.max_fmr = SIW_MAX_FMR;
sdev->attrs.max_srq = SIW_MAX_SRQ;
sdev->attrs.max_srq_wr = SIW_MAX_SRQ_WR;
sdev->attrs.max_srq_sge = SIW_MAX_SGE;
xa_init_flags(&sdev->qp_xa, XA_FLAGS_ALLOC1);
xa_init_flags(&sdev->mem_xa, XA_FLAGS_ALLOC1);
INIT_LIST_HEAD(&sdev->cep_list);
INIT_LIST_HEAD(&sdev->qp_list);
atomic_set(&sdev->num_ctx, 0);
atomic_set(&sdev->num_srq, 0);
atomic_set(&sdev->num_qp, 0);
atomic_set(&sdev->num_cq, 0);
atomic_set(&sdev->num_mr, 0);
atomic_set(&sdev->num_pd, 0);
sdev->numa_node = dev_to_node(parent);
spin_lock_init(&sdev->lock);
return sdev;
error:
ib_dealloc_device(base_dev);
return NULL;
}
/*
* Network link becomes unavailable. Mark all
* affected QP's accordingly.
*/
static void siw_netdev_down(struct work_struct *work)
{
struct siw_device *sdev =
container_of(work, struct siw_device, netdev_down);
struct siw_qp_attrs qp_attrs;
struct list_head *pos, *tmp;
memset(&qp_attrs, 0, sizeof(qp_attrs));
qp_attrs.state = SIW_QP_STATE_ERROR;
list_for_each_safe(pos, tmp, &sdev->qp_list) {
struct siw_qp *qp = list_entry(pos, struct siw_qp, devq);
down_write(&qp->state_lock);
WARN_ON(siw_qp_modify(qp, &qp_attrs, SIW_QP_ATTR_STATE));
up_write(&qp->state_lock);
}
ib_device_put(&sdev->base_dev);
}
static void siw_device_goes_down(struct siw_device *sdev)
{
if (ib_device_try_get(&sdev->base_dev)) {
INIT_WORK(&sdev->netdev_down, siw_netdev_down);
schedule_work(&sdev->netdev_down);
}
}
static int siw_netdev_event(struct notifier_block *nb, unsigned long event,
void *arg)
{
struct net_device *netdev = netdev_notifier_info_to_dev(arg);
struct ib_device *base_dev;
struct siw_device *sdev;
dev_dbg(&netdev->dev, "siw: event %lu\n", event);
if (dev_net(netdev) != &init_net)
return NOTIFY_OK;
base_dev = ib_device_get_by_netdev(netdev, RDMA_DRIVER_SIW);
if (!base_dev)
return NOTIFY_OK;
sdev = to_siw_dev(base_dev);
switch (event) {
case NETDEV_UP:
sdev->state = IB_PORT_ACTIVE;
siw_port_event(sdev, 1, IB_EVENT_PORT_ACTIVE);
break;
case NETDEV_GOING_DOWN:
siw_device_goes_down(sdev);
break;
case NETDEV_DOWN:
sdev->state = IB_PORT_DOWN;
siw_port_event(sdev, 1, IB_EVENT_PORT_ERR);
break;
case NETDEV_REGISTER:
/*
* Device registration now handled only by
* rdma netlink commands. So it shall be impossible
* to end up here with a valid siw device.
*/
siw_dbg(base_dev, "unexpected NETDEV_REGISTER event\n");
break;
case NETDEV_UNREGISTER:
ib_unregister_device_queued(&sdev->base_dev);
break;
case NETDEV_CHANGEADDR:
siw_port_event(sdev, 1, IB_EVENT_LID_CHANGE);
break;
/*
* Todo: Below netdev events are currently not handled.
*/
case NETDEV_CHANGEMTU:
case NETDEV_CHANGE:
break;
default:
break;
}
ib_device_put(&sdev->base_dev);
return NOTIFY_OK;
}
static struct notifier_block siw_netdev_nb = {
.notifier_call = siw_netdev_event,
};
static int siw_newlink(const char *basedev_name, struct net_device *netdev)
{
struct ib_device *base_dev;
struct siw_device *sdev = NULL;
int rv = -ENOMEM;
if (!siw_dev_qualified(netdev))
return -EINVAL;
base_dev = ib_device_get_by_netdev(netdev, RDMA_DRIVER_SIW);
if (base_dev) {
ib_device_put(base_dev);
return -EEXIST;
}
sdev = siw_device_create(netdev);
if (sdev) {
dev_dbg(&netdev->dev, "siw: new device\n");
if (netif_running(netdev) && netif_carrier_ok(netdev))
sdev->state = IB_PORT_ACTIVE;
else
sdev->state = IB_PORT_DOWN;
rv = siw_device_register(sdev, basedev_name);
if (rv)
ib_dealloc_device(&sdev->base_dev);
}
return rv;
}
static struct rdma_link_ops siw_link_ops = {
.type = "siw",
.newlink = siw_newlink,
};
/*
* siw_init_module - Initialize Softiwarp module and register with netdev
* subsystem.
*/
static __init int siw_init_module(void)
{
int rv;
int nr_cpu;
if (SENDPAGE_THRESH < SIW_MAX_INLINE) {
pr_info("siw: sendpage threshold too small: %u\n",
(int)SENDPAGE_THRESH);
rv = -EINVAL;
goto out_error;
}
rv = siw_init_cpulist();
if (rv)
goto out_error;
rv = siw_cm_init();
if (rv)
goto out_error;
if (!siw_create_tx_threads()) {
pr_info("siw: Could not start any TX thread\n");
goto out_error;
}
/*
* Locate CRC32 algorithm. If unsuccessful, fail
* loading siw only, if CRC is required.
*/
siw_crypto_shash = crypto_alloc_shash("crc32c", 0, 0);
if (IS_ERR(siw_crypto_shash)) {
pr_info("siw: Loading CRC32c failed: %ld\n",
PTR_ERR(siw_crypto_shash));
siw_crypto_shash = NULL;
if (mpa_crc_required) {
rv = -EOPNOTSUPP;
goto out_error;
}
}
rv = register_netdevice_notifier(&siw_netdev_nb);
if (rv)
goto out_error;
rdma_link_register(&siw_link_ops);
pr_info("SoftiWARP attached\n");
return 0;
out_error:
for (nr_cpu = 0; nr_cpu < nr_cpu_ids; nr_cpu++) {
if (siw_tx_thread[nr_cpu]) {
siw_stop_tx_thread(nr_cpu);
siw_tx_thread[nr_cpu] = NULL;
}
}
if (siw_crypto_shash)
crypto_free_shash(siw_crypto_shash);
pr_info("SoftIWARP attach failed. Error: %d\n", rv);
siw_cm_exit();
siw_destroy_cpulist();
return rv;
}
static void __exit siw_exit_module(void)
{
int cpu;
for_each_possible_cpu(cpu) {
if (siw_tx_thread[cpu]) {
siw_stop_tx_thread(cpu);
siw_tx_thread[cpu] = NULL;
}
}
unregister_netdevice_notifier(&siw_netdev_nb);
rdma_link_unregister(&siw_link_ops);
ib_unregister_driver(RDMA_DRIVER_SIW);
siw_cm_exit();
siw_destroy_cpulist();
if (siw_crypto_shash)
crypto_free_shash(siw_crypto_shash);
pr_info("SoftiWARP detached\n");
}
module_init(siw_init_module);
module_exit(siw_exit_module);
MODULE_ALIAS_RDMA_LINK("siw");

View file

@ -0,0 +1,460 @@
// SPDX-License-Identifier: GPL-2.0 or BSD-3-Clause
/* Authors: Bernard Metzler <bmt@zurich.ibm.com> */
/* Copyright (c) 2008-2019, IBM Corporation */
#include <linux/gfp.h>
#include <rdma/ib_verbs.h>
#include <linux/dma-mapping.h>
#include <linux/slab.h>
#include <linux/sched/mm.h>
#include <linux/resource.h>
#include "siw.h"
#include "siw_mem.h"
/*
* Stag lookup is based on its index part only (24 bits).
* The code avoids special Stag of zero and tries to randomize
* STag values between 1 and SIW_STAG_MAX_INDEX.
*/
int siw_mem_add(struct siw_device *sdev, struct siw_mem *m)
{
struct xa_limit limit = XA_LIMIT(1, 0x00ffffff);
u32 id, next;
get_random_bytes(&next, 4);
next &= 0x00ffffff;
if (xa_alloc_cyclic(&sdev->mem_xa, &id, m, limit, &next,
GFP_KERNEL) < 0)
return -ENOMEM;
/* Set the STag index part */
m->stag = id << 8;
siw_dbg_mem(m, "new MEM object\n");
return 0;
}
/*
* siw_mem_id2obj()
*
* resolves memory from stag given by id. might be called from:
* o process context before sending out of sgl, or
* o in softirq when resolving target memory
*/
struct siw_mem *siw_mem_id2obj(struct siw_device *sdev, int stag_index)
{
struct siw_mem *mem;
rcu_read_lock();
mem = xa_load(&sdev->mem_xa, stag_index);
if (likely(mem && kref_get_unless_zero(&mem->ref))) {
rcu_read_unlock();
return mem;
}
rcu_read_unlock();
return NULL;
}
static void siw_free_plist(struct siw_page_chunk *chunk, int num_pages,
bool dirty)
{
struct page **p = chunk->plist;
while (num_pages--) {
if (!PageDirty(*p) && dirty)
put_user_pages_dirty_lock(p, 1);
else
put_user_page(*p);
p++;
}
}
void siw_umem_release(struct siw_umem *umem, bool dirty)
{
struct mm_struct *mm_s = umem->owning_mm;
int i, num_pages = umem->num_pages;
for (i = 0; num_pages; i++) {
int to_free = min_t(int, PAGES_PER_CHUNK, num_pages);
siw_free_plist(&umem->page_chunk[i], to_free,
umem->writable && dirty);
kfree(umem->page_chunk[i].plist);
num_pages -= to_free;
}
atomic64_sub(umem->num_pages, &mm_s->pinned_vm);
mmdrop(mm_s);
kfree(umem->page_chunk);
kfree(umem);
}
int siw_mr_add_mem(struct siw_mr *mr, struct ib_pd *pd, void *mem_obj,
u64 start, u64 len, int rights)
{
struct siw_device *sdev = to_siw_dev(pd->device);
struct siw_mem *mem = kzalloc(sizeof(*mem), GFP_KERNEL);
struct xa_limit limit = XA_LIMIT(1, 0x00ffffff);
u32 id, next;
if (!mem)
return -ENOMEM;
mem->mem_obj = mem_obj;
mem->stag_valid = 0;
mem->sdev = sdev;
mem->va = start;
mem->len = len;
mem->pd = pd;
mem->perms = rights & IWARP_ACCESS_MASK;
kref_init(&mem->ref);
mr->mem = mem;
get_random_bytes(&next, 4);
next &= 0x00ffffff;
if (xa_alloc_cyclic(&sdev->mem_xa, &id, mem, limit, &next,
GFP_KERNEL) < 0) {
kfree(mem);
return -ENOMEM;
}
/* Set the STag index part */
mem->stag = id << 8;
mr->base_mr.lkey = mr->base_mr.rkey = mem->stag;
return 0;
}
void siw_mr_drop_mem(struct siw_mr *mr)
{
struct siw_mem *mem = mr->mem, *found;
mem->stag_valid = 0;
/* make STag invalid visible asap */
smp_mb();
found = xa_erase(&mem->sdev->mem_xa, mem->stag >> 8);
WARN_ON(found != mem);
siw_mem_put(mem);
}
void siw_free_mem(struct kref *ref)
{
struct siw_mem *mem = container_of(ref, struct siw_mem, ref);
siw_dbg_mem(mem, "free mem, pbl: %s\n", mem->is_pbl ? "y" : "n");
if (!mem->is_mw && mem->mem_obj) {
if (mem->is_pbl == 0)
siw_umem_release(mem->umem, true);
else
kfree(mem->pbl);
}
kfree(mem);
}
/*
* siw_check_mem()
*
* Check protection domain, STAG state, access permissions and
* address range for memory object.
*
* @pd: Protection Domain memory should belong to
* @mem: memory to be checked
* @addr: starting addr of mem
* @perms: requested access permissions
* @len: len of memory interval to be checked
*
*/
int siw_check_mem(struct ib_pd *pd, struct siw_mem *mem, u64 addr,
enum ib_access_flags perms, int len)
{
if (!mem->stag_valid) {
siw_dbg_pd(pd, "STag 0x%08x invalid\n", mem->stag);
return -E_STAG_INVALID;
}
if (mem->pd != pd) {
siw_dbg_pd(pd, "STag 0x%08x: PD mismatch\n", mem->stag);
return -E_PD_MISMATCH;
}
/*
* check access permissions
*/
if ((mem->perms & perms) < perms) {
siw_dbg_pd(pd, "permissions 0x%08x < 0x%08x\n",
mem->perms, perms);
return -E_ACCESS_PERM;
}
/*
* Check if access falls into valid memory interval.
*/
if (addr < mem->va || addr + len > mem->va + mem->len) {
siw_dbg_pd(pd, "MEM interval len %d\n", len);
siw_dbg_pd(pd, "[0x%016llx, 0x%016llx] out of bounds\n",
(unsigned long long)addr,
(unsigned long long)(addr + len));
siw_dbg_pd(pd, "[0x%016llx, 0x%016llx] STag=0x%08x\n",
(unsigned long long)mem->va,
(unsigned long long)(mem->va + mem->len),
mem->stag);
return -E_BASE_BOUNDS;
}
return E_ACCESS_OK;
}
/*
* siw_check_sge()
*
* Check SGE for access rights in given interval
*
* @pd: Protection Domain memory should belong to
* @sge: SGE to be checked
* @mem: location of memory reference within array
* @perms: requested access permissions
* @off: starting offset in SGE
* @len: len of memory interval to be checked
*
* NOTE: Function references SGE's memory object (mem->obj)
* if not yet done. New reference is kept if check went ok and
* released if check failed. If mem->obj is already valid, no new
* lookup is being done and mem is not released it check fails.
*/
int siw_check_sge(struct ib_pd *pd, struct siw_sge *sge, struct siw_mem *mem[],
enum ib_access_flags perms, u32 off, int len)
{
struct siw_device *sdev = to_siw_dev(pd->device);
struct siw_mem *new = NULL;
int rv = E_ACCESS_OK;
if (len + off > sge->length) {
rv = -E_BASE_BOUNDS;
goto fail;
}
if (*mem == NULL) {
new = siw_mem_id2obj(sdev, sge->lkey >> 8);
if (unlikely(!new)) {
siw_dbg_pd(pd, "STag unknown: 0x%08x\n", sge->lkey);
rv = -E_STAG_INVALID;
goto fail;
}
*mem = new;
}
/* Check if user re-registered with different STag key */
if (unlikely((*mem)->stag != sge->lkey)) {
siw_dbg_mem((*mem), "STag mismatch: 0x%08x\n", sge->lkey);
rv = -E_STAG_INVALID;
goto fail;
}
rv = siw_check_mem(pd, *mem, sge->laddr + off, perms, len);
if (unlikely(rv))
goto fail;
return 0;
fail:
if (new) {
*mem = NULL;
siw_mem_put(new);
}
return rv;
}
void siw_wqe_put_mem(struct siw_wqe *wqe, enum siw_opcode op)
{
switch (op) {
case SIW_OP_SEND:
case SIW_OP_WRITE:
case SIW_OP_SEND_WITH_IMM:
case SIW_OP_SEND_REMOTE_INV:
case SIW_OP_READ:
case SIW_OP_READ_LOCAL_INV:
if (!(wqe->sqe.flags & SIW_WQE_INLINE))
siw_unref_mem_sgl(wqe->mem, wqe->sqe.num_sge);
break;
case SIW_OP_RECEIVE:
siw_unref_mem_sgl(wqe->mem, wqe->rqe.num_sge);
break;
case SIW_OP_READ_RESPONSE:
siw_unref_mem_sgl(wqe->mem, 1);
break;
default:
/*
* SIW_OP_INVAL_STAG and SIW_OP_REG_MR
* do not hold memory references
*/
break;
}
}
int siw_invalidate_stag(struct ib_pd *pd, u32 stag)
{
struct siw_device *sdev = to_siw_dev(pd->device);
struct siw_mem *mem = siw_mem_id2obj(sdev, stag >> 8);
int rv = 0;
if (unlikely(!mem)) {
siw_dbg_pd(pd, "STag 0x%08x unknown\n", stag);
return -EINVAL;
}
if (unlikely(mem->pd != pd)) {
siw_dbg_pd(pd, "PD mismatch for STag 0x%08x\n", stag);
rv = -EACCES;
goto out;
}
/*
* Per RDMA verbs definition, an STag may already be in invalid
* state if invalidation is requested. So no state check here.
*/
mem->stag_valid = 0;
siw_dbg_pd(pd, "STag 0x%08x now invalid\n", stag);
out:
siw_mem_put(mem);
return rv;
}
/*
* Gets physical address backed by PBL element. Address is referenced
* by linear byte offset into list of variably sized PB elements.
* Optionally, provides remaining len within current element, and
* current PBL index for later resume at same element.
*/
u64 siw_pbl_get_buffer(struct siw_pbl *pbl, u64 off, int *len, int *idx)
{
int i = idx ? *idx : 0;
while (i < pbl->num_buf) {
struct siw_pble *pble = &pbl->pbe[i];
if (pble->pbl_off + pble->size > off) {
u64 pble_off = off - pble->pbl_off;
if (len)
*len = pble->size - pble_off;
if (idx)
*idx = i;
return pble->addr + pble_off;
}
i++;
}
if (len)
*len = 0;
return 0;
}
struct siw_pbl *siw_pbl_alloc(u32 num_buf)
{
struct siw_pbl *pbl;
int buf_size = sizeof(*pbl);
if (num_buf == 0)
return ERR_PTR(-EINVAL);
buf_size += ((num_buf - 1) * sizeof(struct siw_pble));
pbl = kzalloc(buf_size, GFP_KERNEL);
if (!pbl)
return ERR_PTR(-ENOMEM);
pbl->max_buf = num_buf;
return pbl;
}
struct siw_umem *siw_umem_get(u64 start, u64 len, bool writable)
{
struct siw_umem *umem;
struct mm_struct *mm_s;
u64 first_page_va;
unsigned long mlock_limit;
unsigned int foll_flags = FOLL_WRITE;
int num_pages, num_chunks, i, rv = 0;
if (!can_do_mlock())
return ERR_PTR(-EPERM);
if (!len)
return ERR_PTR(-EINVAL);
first_page_va = start & PAGE_MASK;
num_pages = PAGE_ALIGN(start + len - first_page_va) >> PAGE_SHIFT;
num_chunks = (num_pages >> CHUNK_SHIFT) + 1;
umem = kzalloc(sizeof(*umem), GFP_KERNEL);
if (!umem)
return ERR_PTR(-ENOMEM);
mm_s = current->mm;
umem->owning_mm = mm_s;
umem->writable = writable;
mmgrab(mm_s);
if (!writable)
foll_flags |= FOLL_FORCE;
down_read(&mm_s->mmap_sem);
mlock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
if (num_pages + atomic64_read(&mm_s->pinned_vm) > mlock_limit) {
rv = -ENOMEM;
goto out_sem_up;
}
umem->fp_addr = first_page_va;
umem->page_chunk =
kcalloc(num_chunks, sizeof(struct siw_page_chunk), GFP_KERNEL);
if (!umem->page_chunk) {
rv = -ENOMEM;
goto out_sem_up;
}
for (i = 0; num_pages; i++) {
int got, nents = min_t(int, num_pages, PAGES_PER_CHUNK);
umem->page_chunk[i].plist =
kcalloc(nents, sizeof(struct page *), GFP_KERNEL);
if (!umem->page_chunk[i].plist) {
rv = -ENOMEM;
goto out_sem_up;
}
got = 0;
while (nents) {
struct page **plist = &umem->page_chunk[i].plist[got];
rv = get_user_pages(first_page_va, nents,
foll_flags | FOLL_LONGTERM,
plist, NULL);
if (rv < 0)
goto out_sem_up;
umem->num_pages += rv;
atomic64_add(rv, &mm_s->pinned_vm);
first_page_va += rv * PAGE_SIZE;
nents -= rv;
got += rv;
}
num_pages -= got;
}
out_sem_up:
up_read(&mm_s->mmap_sem);
if (rv > 0)
return umem;
siw_umem_release(umem, false);
return ERR_PTR(rv);
}

View file

@ -0,0 +1,74 @@
/* SPDX-License-Identifier: GPL-2.0 or BSD-3-Clause */
/* Authors: Bernard Metzler <bmt@zurich.ibm.com> */
/* Copyright (c) 2008-2019, IBM Corporation */
#ifndef _SIW_MEM_H
#define _SIW_MEM_H
struct siw_umem *siw_umem_get(u64 start, u64 len, bool writable);
void siw_umem_release(struct siw_umem *umem, bool dirty);
struct siw_pbl *siw_pbl_alloc(u32 num_buf);
u64 siw_pbl_get_buffer(struct siw_pbl *pbl, u64 off, int *len, int *idx);
struct siw_mem *siw_mem_id2obj(struct siw_device *sdev, int stag_index);
int siw_mem_add(struct siw_device *sdev, struct siw_mem *m);
int siw_invalidate_stag(struct ib_pd *pd, u32 stag);
int siw_check_mem(struct ib_pd *pd, struct siw_mem *mem, u64 addr,
enum ib_access_flags perms, int len);
int siw_check_sge(struct ib_pd *pd, struct siw_sge *sge,
struct siw_mem *mem[], enum ib_access_flags perms,
u32 off, int len);
void siw_wqe_put_mem(struct siw_wqe *wqe, enum siw_opcode op);
int siw_mr_add_mem(struct siw_mr *mr, struct ib_pd *pd, void *mem_obj,
u64 start, u64 len, int rights);
void siw_mr_drop_mem(struct siw_mr *mr);
void siw_free_mem(struct kref *ref);
static inline void siw_mem_put(struct siw_mem *mem)
{
kref_put(&mem->ref, siw_free_mem);
}
static inline struct siw_mr *siw_mem2mr(struct siw_mem *m)
{
return container_of(m, struct siw_mr, mem);
}
static inline void siw_unref_mem_sgl(struct siw_mem **mem, unsigned int num_sge)
{
while (num_sge) {
if (*mem == NULL)
break;
siw_mem_put(*mem);
*mem = NULL;
mem++;
num_sge--;
}
}
#define CHUNK_SHIFT 9 /* sets number of pages per chunk */
#define PAGES_PER_CHUNK (_AC(1, UL) << CHUNK_SHIFT)
#define CHUNK_MASK (~(PAGES_PER_CHUNK - 1))
#define PAGE_CHUNK_SIZE (PAGES_PER_CHUNK * sizeof(struct page *))
/*
* siw_get_upage()
*
* Get page pointer for address on given umem.
*
* @umem: two dimensional list of page pointers
* @addr: user virtual address
*/
static inline struct page *siw_get_upage(struct siw_umem *umem, u64 addr)
{
unsigned int page_idx = (addr - umem->fp_addr) >> PAGE_SHIFT,
chunk_idx = page_idx >> CHUNK_SHIFT,
page_in_chunk = page_idx & ~CHUNK_MASK;
if (likely(page_idx < umem->num_pages))
return umem->page_chunk[chunk_idx].plist[page_in_chunk];
return NULL;
}
#endif

File diff suppressed because it is too large Load diff

File diff suppressed because it is too large Load diff

File diff suppressed because it is too large Load diff

File diff suppressed because it is too large Load diff

View file

@ -0,0 +1,91 @@
/* SPDX-License-Identifier: GPL-2.0 or BSD-3-Clause */
/* Authors: Bernard Metzler <bmt@zurich.ibm.com> */
/* Copyright (c) 2008-2019, IBM Corporation */
#ifndef _SIW_VERBS_H
#define _SIW_VERBS_H
#include <linux/errno.h>
#include <rdma/iw_cm.h>
#include <rdma/ib_verbs.h>
#include <rdma/ib_user_verbs.h>
#include "siw.h"
#include "siw_cm.h"
/*
* siw_copy_sgl()
*
* Copy SGL from RDMA core representation to local
* representation.
*/
static inline void siw_copy_sgl(struct ib_sge *sge, struct siw_sge *siw_sge,
int num_sge)
{
while (num_sge--) {
siw_sge->laddr = sge->addr;
siw_sge->length = sge->length;
siw_sge->lkey = sge->lkey;
siw_sge++;
sge++;
}
}
int siw_alloc_ucontext(struct ib_ucontext *base_ctx, struct ib_udata *udata);
void siw_dealloc_ucontext(struct ib_ucontext *base_ctx);
int siw_query_port(struct ib_device *base_dev, u8 port,
struct ib_port_attr *attr);
int siw_get_port_immutable(struct ib_device *base_dev, u8 port,
struct ib_port_immutable *port_immutable);
int siw_query_device(struct ib_device *base_dev, struct ib_device_attr *attr,
struct ib_udata *udata);
int siw_create_cq(struct ib_cq *base_cq, const struct ib_cq_init_attr *attr,
struct ib_udata *udata);
int siw_query_port(struct ib_device *base_dev, u8 port,
struct ib_port_attr *attr);
int siw_query_pkey(struct ib_device *base_dev, u8 port, u16 idx, u16 *pkey);
int siw_query_gid(struct ib_device *base_dev, u8 port, int idx,
union ib_gid *gid);
int siw_alloc_pd(struct ib_pd *base_pd, struct ib_udata *udata);
void siw_dealloc_pd(struct ib_pd *base_pd, struct ib_udata *udata);
struct ib_qp *siw_create_qp(struct ib_pd *base_pd,
struct ib_qp_init_attr *attr,
struct ib_udata *udata);
int siw_query_qp(struct ib_qp *base_qp, struct ib_qp_attr *qp_attr,
int qp_attr_mask, struct ib_qp_init_attr *qp_init_attr);
int siw_verbs_modify_qp(struct ib_qp *base_qp, struct ib_qp_attr *attr,
int attr_mask, struct ib_udata *udata);
int siw_destroy_qp(struct ib_qp *base_qp, struct ib_udata *udata);
int siw_post_send(struct ib_qp *base_qp, const struct ib_send_wr *wr,
const struct ib_send_wr **bad_wr);
int siw_post_receive(struct ib_qp *base_qp, const struct ib_recv_wr *wr,
const struct ib_recv_wr **bad_wr);
void siw_destroy_cq(struct ib_cq *base_cq, struct ib_udata *udata);
int siw_poll_cq(struct ib_cq *base_cq, int num_entries, struct ib_wc *wc);
int siw_req_notify_cq(struct ib_cq *base_cq, enum ib_cq_notify_flags flags);
struct ib_mr *siw_reg_user_mr(struct ib_pd *base_pd, u64 start, u64 len,
u64 rnic_va, int rights, struct ib_udata *udata);
struct ib_mr *siw_alloc_mr(struct ib_pd *base_pd, enum ib_mr_type mr_type,
u32 max_sge, struct ib_udata *udata);
struct ib_mr *siw_get_dma_mr(struct ib_pd *base_pd, int rights);
int siw_map_mr_sg(struct ib_mr *base_mr, struct scatterlist *sl, int num_sle,
unsigned int *sg_off);
int siw_dereg_mr(struct ib_mr *base_mr, struct ib_udata *udata);
int siw_create_srq(struct ib_srq *base_srq, struct ib_srq_init_attr *attr,
struct ib_udata *udata);
int siw_modify_srq(struct ib_srq *base_srq, struct ib_srq_attr *attr,
enum ib_srq_attr_mask mask, struct ib_udata *udata);
int siw_query_srq(struct ib_srq *base_srq, struct ib_srq_attr *attr);
void siw_destroy_srq(struct ib_srq *base_srq, struct ib_udata *udata);
int siw_post_srq_recv(struct ib_srq *base_srq, const struct ib_recv_wr *wr,
const struct ib_recv_wr **bad_wr);
int siw_mmap(struct ib_ucontext *ctx, struct vm_area_struct *vma);
void siw_qp_event(struct siw_qp *qp, enum ib_event_type type);
void siw_cq_event(struct siw_cq *cq, enum ib_event_type type);
void siw_srq_event(struct siw_srq *srq, enum ib_event_type type);
void siw_port_event(struct siw_device *dev, u8 port, enum ib_event_type type);
#endif

View file

@ -103,6 +103,7 @@ enum rdma_driver_id {
RDMA_DRIVER_HFI1,
RDMA_DRIVER_QIB,
RDMA_DRIVER_EFA,
RDMA_DRIVER_SIW,
};
#endif

185
include/uapi/rdma/siw-abi.h Normal file
View file

@ -0,0 +1,185 @@
/* SPDX-License-Identifier: GPL-2.0 or BSD-3-Clause */
/* Authors: Bernard Metzler <bmt@zurich.ibm.com> */
/* Copyright (c) 2008-2019, IBM Corporation */
#ifndef _SIW_USER_H
#define _SIW_USER_H
#include <linux/types.h>
#define SIW_NODE_DESC_COMMON "Software iWARP stack"
#define SIW_ABI_VERSION 1
#define SIW_MAX_SGE 6
#define SIW_UOBJ_MAX_KEY 0x08FFFF
#define SIW_INVAL_UOBJ_KEY (SIW_UOBJ_MAX_KEY + 1)
struct siw_uresp_create_cq {
__u32 cq_id;
__u32 num_cqe;
__aligned_u64 cq_key;
};
struct siw_uresp_create_qp {
__u32 qp_id;
__u32 num_sqe;
__u32 num_rqe;
__u32 pad;
__aligned_u64 sq_key;
__aligned_u64 rq_key;
};
struct siw_ureq_reg_mr {
__u8 stag_key;
__u8 reserved[3];
__u32 pad;
};
struct siw_uresp_reg_mr {
__u32 stag;
__u32 pad;
};
struct siw_uresp_create_srq {
__u32 num_rqe;
__u32 pad;
__aligned_u64 srq_key;
};
struct siw_uresp_alloc_ctx {
__u32 dev_id;
__u32 pad;
};
enum siw_opcode {
SIW_OP_WRITE,
SIW_OP_READ,
SIW_OP_READ_LOCAL_INV,
SIW_OP_SEND,
SIW_OP_SEND_WITH_IMM,
SIW_OP_SEND_REMOTE_INV,
/* Unsupported */
SIW_OP_FETCH_AND_ADD,
SIW_OP_COMP_AND_SWAP,
SIW_OP_RECEIVE,
/* provider internal SQE */
SIW_OP_READ_RESPONSE,
/*
* below opcodes valid for
* in-kernel clients only
*/
SIW_OP_INVAL_STAG,
SIW_OP_REG_MR,
SIW_NUM_OPCODES
};
/* Keep it same as ibv_sge to allow for memcpy */
struct siw_sge {
__aligned_u64 laddr;
__u32 length;
__u32 lkey;
};
/*
* Inline data are kept within the work request itself occupying
* the space of sge[1] .. sge[n]. Therefore, inline data cannot be
* supported if SIW_MAX_SGE is below 2 elements.
*/
#define SIW_MAX_INLINE (sizeof(struct siw_sge) * (SIW_MAX_SGE - 1))
#if SIW_MAX_SGE < 2
#error "SIW_MAX_SGE must be at least 2"
#endif
enum siw_wqe_flags {
SIW_WQE_VALID = 1,
SIW_WQE_INLINE = (1 << 1),
SIW_WQE_SIGNALLED = (1 << 2),
SIW_WQE_SOLICITED = (1 << 3),
SIW_WQE_READ_FENCE = (1 << 4),
SIW_WQE_REM_INVAL = (1 << 5),
SIW_WQE_COMPLETED = (1 << 6)
};
/* Send Queue Element */
struct siw_sqe {
__aligned_u64 id;
__u16 flags;
__u8 num_sge;
/* Contains enum siw_opcode values */
__u8 opcode;
__u32 rkey;
union {
__aligned_u64 raddr;
__aligned_u64 base_mr;
};
union {
struct siw_sge sge[SIW_MAX_SGE];
__aligned_u64 access;
};
};
/* Receive Queue Element */
struct siw_rqe {
__aligned_u64 id;
__u16 flags;
__u8 num_sge;
/*
* only used by kernel driver,
* ignored if set by user
*/
__u8 opcode;
__u32 unused;
struct siw_sge sge[SIW_MAX_SGE];
};
enum siw_notify_flags {
SIW_NOTIFY_NOT = (0),
SIW_NOTIFY_SOLICITED = (1 << 0),
SIW_NOTIFY_NEXT_COMPLETION = (1 << 1),
SIW_NOTIFY_MISSED_EVENTS = (1 << 2),
SIW_NOTIFY_ALL = SIW_NOTIFY_SOLICITED | SIW_NOTIFY_NEXT_COMPLETION |
SIW_NOTIFY_MISSED_EVENTS
};
enum siw_wc_status {
SIW_WC_SUCCESS,
SIW_WC_LOC_LEN_ERR,
SIW_WC_LOC_PROT_ERR,
SIW_WC_LOC_QP_OP_ERR,
SIW_WC_WR_FLUSH_ERR,
SIW_WC_BAD_RESP_ERR,
SIW_WC_LOC_ACCESS_ERR,
SIW_WC_REM_ACCESS_ERR,
SIW_WC_REM_INV_REQ_ERR,
SIW_WC_GENERAL_ERR,
SIW_NUM_WC_STATUS
};
struct siw_cqe {
__aligned_u64 id;
__u8 flags;
__u8 opcode;
__u16 status;
__u32 bytes;
union {
__aligned_u64 imm_data;
__u32 inval_stag;
};
/* QP number or QP pointer */
union {
struct ib_qp *base_qp;
__aligned_u64 qp_id;
};
};
/*
* Shared structure between user and kernel
* to control CQ arming.
*/
struct siw_cq_ctrl {
__aligned_u64 notify;
};
#endif