Commit graph

123 commits

Author SHA1 Message Date
Gary Leshner
84e3b19a27 IB/hfi1: Remove module parameter for KDETH qpns
The module parameter for KDETH qpns is being removed in favor
of always using the default value of 0x80 as the qpn prefix.
Defines have been added for various KDETH values including
the prefix of 0x80.
The reserved range now starts at the base value for KDETH
qpns (0x80) and extends up to and including the last qpn for
other reserved QP prefixed types.
Adjust other QP prefixed define names to match KDETH defined
names.

Link: https://lore.kernel.org/r/20200511160600.173205.27508.stgit@awfm-01.aw.intel.com
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Gary Leshner <Gary.S.Leshner@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-05-21 11:23:54 -03:00
Dennis Dalessandro
817a68a658 IB/hfi1, qib: Ensure RCU is locked when accessing list
The packet handling function, specifically the iteration of the qp list
for mad packet processing misses locking RCU before running through the
list. Not only is this incorrect, but the list_for_each_entry_rcu() call
can not be called with a conditional check for lock dependency. Remedy
this by invoking the rcu lock and unlock around the critical section.

This brings MAD packet processing in line with what is done for non-MAD
packets.

Fixes: 7724105686 ("IB/hfi1: add driver files")
Link: https://lore.kernel.org/r/20200225195445.140896.41873.stgit@awfm-01.aw.intel.com
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-03-02 11:10:21 -04:00
Mike Marciniszyn
22bb136534 IB/hfi1: Use a common pad buffer for 9B and 16B packets
There is no reason for a different pad buffer for the two
packet types.

Expand the current buffer allocation to allow for both
packet types.

Fixes: f8195f3b14 ("IB/hfi1: Eliminate allocation while atomic")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Kaike Wan <kaike.wan@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Link: https://lore.kernel.org/r/20191004204934.26838.13099.stgit@awfm-01.aw.intel.com
Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-10-17 16:32:25 -04:00
Denis Efremov
7b0b692594 IB/hfi1: remove unlikely() from IS_ERR*() condition
"unlikely(IS_ERR_OR_NULL(x))" is excessive. IS_ERR_OR_NULL() already uses
unlikely() internally.

Link: http://lkml.kernel.org/r/20190829165025.15750-8-efremov@linux.com
Signed-off-by: Denis Efremov <efremov@linux.com>
Cc: Mike Marciniszyn <mike.marciniszyn@intel.com>
Cc: Joe Perches <joe@perches.com>
Acked-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-26 10:10:30 -07:00
Jason Gunthorpe
868df536f5 Merge branch 'odp_fixes' into rdma.git for-next
Jason Gunthorpe says:

====================
This is a collection of general cleanups for ODP to clarify some of the
flows around umem creation and use of the interval tree.
====================

The branch is based on v5.3-rc5 due to dependencies

* odp_fixes:
  RDMA/mlx5: Use odp instead of mr->umem in pagefault_mr
  RDMA/mlx5: Use ib_umem_start instead of umem.address
  RDMA/core: Make invalidate_range a device operation
  RDMA/odp: Use kvcalloc for the dma_list and page_list
  RDMA/odp: Check for overflow when computing the umem_odp end
  RDMA/odp: Provide ib_umem_odp_release() to undo the allocs
  RDMA/odp: Split creating a umem_odp from ib_umem_get
  RDMA/odp: Make the three ways to create a umem_odp clear
  RMDA/odp: Consolidate umem_odp initialization
  RDMA/odp: Make it clearer when a umem is an implicit ODP umem
  RDMA/odp: Iterate over the whole rbtree directly
  RDMA/odp: Use the common interval tree library instead of generic
  RDMA/mlx5: Fix MR npages calculation for IB_ACCESS_HUGETLB

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-08-21 14:10:36 -03:00
Gustavo A. R. Silva
6497d0a9c5 IB/hfi1: Fix Spectre v1 vulnerability
sl is controlled by user-space, hence leading to a potential
exploitation of the Spectre variant 1 vulnerability.

Fix this by sanitizing sl before using it to index ibp->sl_to_sc.

Notice that given that speculation windows are large, the policy is
to kill the speculation on the first load and not worry if it can be
completed with a dependent load/store [1].

[1] https://lore.kernel.org/lkml/20180423164740.GY17484@dhcp22.suse.cz/

Cc: stable@vger.kernel.org
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Link: https://lore.kernel.org/r/20190731175428.GA16736@embeddedor
Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-08-01 12:14:15 -04:00
Kaike Wan
b2590bdd0b IB/hfi1: Do not update hcrc for a KDETH packet during fault injection
When a KDETH packet is subject to fault injection during transmission,
HCRC is supposed to be omitted from the packet so that the hardware on the
receiver side would drop the packet. When creating pbc, the PbcInsertHcrc
field is set to be PBC_IHCRC_NONE if the KDETH packet is subject to fault
injection, but overwritten with PBC_IHCRC_LKDETH when update_hcrc() is
called later.

This problem is fixed by not calling update_hcrc() when the packet is
subject to fault injection.

Fixes: 6b6cf9357f ("IB/hfi1: Set PbcInsertHcrc for TID RDMA packets")
Cc: <stable@vger.kernel.org>
Link: https://lore.kernel.org/r/20190715164546.74174.99296.stgit@awfm-01.aw.intel.com
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-07-22 16:10:48 -03:00
Jason Gunthorpe
371bb62158 Linux 5.2-rc6
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl0Os1seHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGtx4H/j6i482XzcGFKTBm
 A7mBoQpy+kLtoUov4EtBAR62OuwI8rsahW9di37QKndPoQrczWaKBmr3De6LCdPe
 v3pl3O6wBbvH5ru+qBPFX9PdNbDvimEChh7LHxmMxNQq3M+AjZAZVJyfpoiFnx35
 Fbge+LZaH/k8HMwZmkMr5t9Mpkip715qKg2o9Bua6dkH0AqlcpLlC8d9a+HIVw/z
 aAsyGSU8jRwhoAOJsE9bJf0acQ/pZSqmFp0rDKqeFTSDMsbDRKLGq/dgv4nW0RiW
 s7xqsjb/rdcvirRj3rv9+lcTVkOtEqwk0PVdL9WOf7g4iYrb3SOIZh8ZyViaDSeH
 VTS5zps=
 =huBY
 -----END PGP SIGNATURE-----

Merge tag 'v5.2-rc6' into rdma.git for-next

For dependencies in next patches.

Resolve conflicts:
- Use uverbs_get_cleared_udata() with new cq allocation flow
- Continue to delete nes despite SPDX conflict
- Resolve list appends in mlx5_command_str()
- Use u16 for vport_rule stuff
- Resolve list appends in struct ib_client

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-06-28 21:18:23 -03:00
Mike Marciniszyn
942a899335 IB/hfi1: Handle port down properly in pio
The call to sc_buffer_alloc currently returns NULL (no buffer) or
a buffer descriptor.

There is a third case when the port is down.  Currently that
returns NULL and this prevents the caller from properly handling the
sc_buffer_alloc() failure.  A verbs code link test after the call is
racy so the indication needs to come from the state check inside the allocation
routine to be valid.

Fix by encoding the ECOMM failure like SDMA.   IS_ERR_OR_NULL() tests
are added at all call sites.  For verbs send, this needs to treat any
error by returning a completion without any MMIO copy.

Fixes: 7724105686 ("IB/hfi1: add driver files")
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-06-17 21:15:40 -04:00
Mike Marciniszyn
4bb02e9572 IB/hfi1: Use aborts to trigger RC throttling
SDMA and pio flushes will cause a lot of packets to be transmitted
after a link has gone down, using a lot of CPU to retransmit
packets.

Fix for RC QPs by recognizing the flush status and:
- Forcing a timer start
- Putting the QP into a "send one" mode

Fixes: 7724105686 ("IB/hfi1: add driver files")
Reviewed-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-06-17 21:15:40 -04:00
Jason Gunthorpe
7a15414252 RDMA: Move owner into struct ib_device_ops
This more closely follows how other subsytems work, with owner being a
member of the structure containing the function pointers.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-06-10 16:56:03 -03:00
Jason Gunthorpe
b9560a419b RDMA: Move driver_id into struct ib_device_ops
No reason for every driver to emit code to set this, just make it part of
the driver's existing static const ops structure.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-06-10 16:56:02 -03:00
Mike Marciniszyn
35164f5259 IB/{qib, hfi1, rdmavt}: Correct ibv_devinfo max_mr value
The command 'ibv_devinfo -v' reports 0 for max_mr.

Fix by assigning the query values after the mr lkey_table has been built
rather than early on in the driver.

Fixes: 7b1e2099ad ("IB/rdmavt: Move memory registration into rdmavt")
Reviewed-by: Josh Collier <josh.d.collier@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-05-29 12:56:05 -03:00
Matthew Wilcox
03b92789e5 hfi1: Convert hfi1_unit_table to XArray
Also remove hfi1_devs_list.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-01 13:27:35 -03:00
Mike Marciniszyn
270a9833b2 IB/hfi1: Add running average for adaptive pio
The adaptive PIO implementation only considers the current packet size
when deciding between SDMA and pio for a packet.

This causes credit return forces if small and large packets are
interleaved.

Add a running average to avoid costly credit forces so that a large
sequence of small packets is required to go below the threshold that
chooses pio.

Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-03-26 09:33:21 -03:00
Kaike Wan
34025fb0c4 IB/hfi1: Prioritize the sending of ACK packets
ACK packets are generally associated with request completion and resource
release and therefore should be sent first. This patch optimizes the
send engine by using the following policies:
(1) QPs with RVT_S_ACK_PENDING bit set in qp->s_flags or qpriv->s_flags
should have their priority incremented;
(2) QPs with ACK or TID-ACK packet queued should have their priority
incremented;
(3) When a QP is queued to the wait list due to resource constraints, it
will be queued to the head if it has ACK packet to send;
(4) When selecting qps to run from the wait list, the one with the highest
priority and starve_cnt will be selected; each priority will be equivalent
to a fixed number of starve_cnt (16).

Reviewed-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-02-05 18:07:44 -05:00
Kaike Wan
3c6cb20a0d IB/hfi1: Add TID RDMA WRITE functionality into RDMA verbs
This patch integrates TID RDMA WRITE protocol into normal RDMA verbs
framework. The TID RDMA WRITE protocol is an end-to-end protocol
between the hfi1 drivers on two OPA nodes that converts a qualified
RDMA WRITE request into a TID RDMA WRITE request to avoid data copying
on the responder side.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-02-05 18:07:44 -05:00
Kaike Wan
f5a4a95f4d IB/hfi1: Allow for extra entries in QP's s_ack_queue
The TID RDMA WRITE protocol differs from normal IB RDMA WRITE
in that TID RDMA WRITE requests do require responses, not just
ACKs.

Therefore, TID RDMA WRITE requests need to be treated as RDMA
READ requests from the point of view of the QPs' s_ack_queue.
In other words, the QPs' need to allow for TID RDMA WRITE
requests to be stored in their s_ack_queue.

However, because the user does not know anything about the TID
RDMA capability and/or protocols, these extra entries in the
queue cannot be advertized to the user.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-02-05 18:07:43 -05:00
Kaike Wan
24b11923da IB/hfi1: Integrate TID RDMA READ protocol into RC protocol
This patch integrates the TID RDMA READ protocol into the IB RC protocol.
This protocol is an end-to-end protocol between the hfi1 drivers on two
OPA nodes that converts a qualified RDMA READ request into a TID RDMA
READ request to avoid data copying on the requester side. The following
codes are added in this patch:
- Send the TID RDMA READ request;
- Complete the TID RDMA READ send request;
- Send the TID RDMA READ response;
- Complete the TID RDMA READ request;

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-02-05 17:53:55 -05:00
Kaike Wan
22d136d756 IB/hfi1: Add TID RDMA handlers
This commit adds the TID RDMA READ pointers to the receiving opcode
handlers. It also adds TID RDMA READ header sizes to header size table.
A function to print the RHF EFLAGS errors is created so that it can be
shared by both IB and TID RDMA receiving functions.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-02-05 17:53:55 -05:00
Kaike Wan
6b6cf9357f IB/hfi1: Set PbcInsertHcrc for TID RDMA packets
All TID RDMA packets are in KDETH packet format and therefore the
PbcInsertHcrc must be set properly before sending the packet to
hardware. Otherwise, the packets will be dropped by the receiver.
By default, HCRC is not inserted for 9B packets without KDETH, and
this patch adds that back for TID RDMA packets.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-02-05 17:53:55 -05:00
Kaike Wan
838b6fd2d9 IB/hfi1: TID RDMA RcvArray programming and TID allocation
TID entries are used by hfi1 hardware to receive data payload from
incoming packets directly into a user buffer and thus avoid data copying
by software. This patch implements the functions for TID allocation,
freeing, and programming TID RcvArray entries in hardware for kernel
clients. TID entries are managed via lists of TID groups similar to PSM.
Furthermore, to track TID resource allocation for each request, software
flows are also allocated and freed as needed. Since software flows
consume large amount of memory for tracking TID allocation and freeing,
it is generally desirable to allocate them dynamically in the send queue
and only for TID RDMA requests, but pre-allocate them for receive queue
because the send queue could have thousands of entries while the receive
queue has only a limited number of entries.

Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-02-05 17:53:55 -05:00
Kaike Wan
48a615dc00 IB/hfi1: Integrate OPFN into RC transactions
OPFN parameter negotiation allows a pair of connected RC QPs to exchange
a set of parameters in succession. This negotiation does not commence
till the first ULP request. Because OPFN operations are operations
private to the driver, they do not generate user completions or put the
QP into error when they run out of retries. This patch integrates the
OPFN protocol into the transactions of an RC QP.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-01-31 11:37:34 -05:00
Kaike Wan
ddf922c31f IB/hfi1, IB/rdmavt: Allow for extending of QP's s_ack_queue
The OPFN protocol uses the COMPARE_SWAP request to exchange data
between the requester and the responder and therefore needs to
be stored in the QP's s_ack_queue when the request is received
on the responder side. However, because the user does not know
anything about the OPFN protocol, this extra entry in the
queue cannot be advertised to the user. This patch adds an extra
entry in a QP's s_ack_queue.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-01-31 11:36:05 -05:00
Michael J. Ruhl
87fc34b575 IB/{hfi1,qib}: Cleanup open coded sge sizing
Sge sizing is done in several places using an open coded method.

This can cause maintenance issues.  The open coded method is
encapsulated in a helper routine.  The helper was introduced with
commit:

1198fcea8a ("IB/hfi1, rdmavt: Move SGE state helper routines into
rdmavt")

Update all call sites that have the open coded path with the helper
routine.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-01-30 14:22:32 -05:00
Parav Pandit
ea4baf7f11 RDMA: Rename port_callback to init_port
Most provider routines are callback routines which ib core invokes.
_callback suffix doesn't convey information about when such callback is
invoked. Therefore, rename port_callback to init_port.

Additionally, store the init_port function pointer in ib_device_ops, so
that it can be accessed in subsequent patches when binding rdma device to
net namespace.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-01-14 13:05:14 -07:00
Linus Torvalds
5d24ae67a9 4.21 merge window pull request
This has been a fairly typical cycle, with the usual sorts of driver
 updates. Several series continue to come through which improve and
 modernize various parts of the core code, and we finally are starting to
 get the uAPI command interface cleaned up.
 
 - Various driver fixes for bnxt_re, cxgb3/4, hfi1, hns, i40iw, mlx4, mlx5,
   qib, rxe, usnic
 
 - Rework the entire syscall flow for uverbs to be able to run over
   ioctl(). Finally getting past the historic bad choice to use write()
   for command execution
 
 - More functional coverage with the mlx5 'devx' user API
 
 - Start of the HFI1 series for 'TID RDMA'
 
 - SRQ support in the hns driver
 
 - Support for new IBTA defined 2x lane widths
 
 - A big series to consolidate all the driver function pointers into
   a big struct and have drivers provide a 'static const' version of the
   struct instead of open coding initialization
 
 - New 'advise_mr' uAPI to control device caching/loading of page tables
 
 - Support for inline data in SRPT
 
 - Modernize how umad uses the driver core and creates cdev's and sysfs
   files
 
 - First steps toward removing 'uobject' from the view of the drivers
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEfB7FMLh+8QxL+6i3OG33FX4gmxoFAlwhV2oACgkQOG33FX4g
 mxpF8A/9EkRCg6wCDC59maA53b5PjuNmD//9hXbycQPQSlxntI2PyYtxrzBqc0+2
 yIaFFMehL41XNN6y1zfkl7ndl62McCH2TpiidU8RyTxVw/e3KsDD5sU6++atfHRo
 M82RNfedDtxPG8TcCPKVLof6JHADApGSR1r4dCYfAnu7KFMyvlLmeYyx4r/2E6yC
 iQPmtKVOdbGkuWGeX+brGEA0vg7FUOAvaysnxddjyh9hyem4h0SUR3Af/Ik0N5ME
 PYzC+hMKbkPVBLoCWyg7QwUaqK37uWwguMQLtI2byF7FgbiK/lBQt6TsidR4Fw3p
 EalL7uqxgCTtLYh918vxLFjdYt6laka9j7xKCX8M8d06sy/Lo8iV4hWjiTESfMFG
 usqs7D6p09gA/y1KISji81j6BI7C92CPVK2drKIEnfyLgY5dBNFcv9m2H12lUCH2
 NGbfCNVaTQVX6bFWPpy2Bt2y/Litsfxw5RviehD7jlG0lQjsXGDkZzsDxrMSSlNU
 S79iiTJyK4kUZkXzrSSlN58pLBlbupJwm5MDjKmM+irsrsCHjGIULvc902qtnC3/
 8ImiTtW6XvqLbgWXyy2Th8/ZgRY234p1ybhog+DFaGKUch0XqB7VXTV2OZm0GjcN
 Fp4PUeBt+/gBgYqjpuffqQc1rI4uwXYSoz7wq9RBiOpw5zBFT1E=
 =T0p1
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma

Pull rdma updates from Jason Gunthorpe:
 "This has been a fairly typical cycle, with the usual sorts of driver
  updates. Several series continue to come through which improve and
  modernize various parts of the core code, and we finally are starting
  to get the uAPI command interface cleaned up.

   - Various driver fixes for bnxt_re, cxgb3/4, hfi1, hns, i40iw, mlx4,
     mlx5, qib, rxe, usnic

   - Rework the entire syscall flow for uverbs to be able to run over
     ioctl(). Finally getting past the historic bad choice to use
     write() for command execution

   - More functional coverage with the mlx5 'devx' user API

   - Start of the HFI1 series for 'TID RDMA'

   - SRQ support in the hns driver

   - Support for new IBTA defined 2x lane widths

   - A big series to consolidate all the driver function pointers into a
     big struct and have drivers provide a 'static const' version of the
     struct instead of open coding initialization

   - New 'advise_mr' uAPI to control device caching/loading of page
     tables

   - Support for inline data in SRPT

   - Modernize how umad uses the driver core and creates cdev's and
     sysfs files

   - First steps toward removing 'uobject' from the view of the drivers"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (193 commits)
  RDMA/srpt: Use kmem_cache_free() instead of kfree()
  RDMA/mlx5: Signedness bug in UVERBS_HANDLER()
  IB/uverbs: Signedness bug in UVERBS_HANDLER()
  IB/mlx5: Allocate the per-port Q counter shared when DEVX is supported
  IB/umad: Start using dev_groups of class
  IB/umad: Use class_groups and let core create class file
  IB/umad: Refactor code to use cdev_device_add()
  IB/umad: Avoid destroying device while it is accessed
  IB/umad: Simplify and avoid dynamic allocation of class
  IB/mlx5: Fix wrong error unwind
  IB/mlx4: Remove set but not used variable 'pd'
  RDMA/iwcm: Don't copy past the end of dev_name() string
  IB/mlx5: Fix long EEH recover time with NVMe offloads
  IB/mlx5: Simplify netdev unbinding
  IB/core: Move query port to ioctl
  RDMA/nldev: Expose port_cap_flags2
  IB/core: uverbs copy to struct or zero helper
  IB/rxe: Reuse code which sets port state
  IB/rxe: Make counters thread safe
  IB/mlx5: Use the correct commands for UMEM and UCTX allocation
  ...
2018-12-28 14:57:10 -08:00
Kamal Heib
e3c320caa1 RDMA/hfi1: Initialize ib_device_ops struct
Initialize ib_device_ops with the supported operations using
ib_set_device_ops().

Signed-off-by: Kamal Heib <kamalheib1@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-12-11 15:15:07 -07:00
Mike Marciniszyn
9aefcabe57 IB/hfi1: Reduce lock contention on iowait_lock for sdma and pio
Commit 4e045572e2 ("IB/hfi1: Add unique txwait_lock for txreq events")
laid the ground work to support per resource waiting locking.

This patch adds that with a lock unique to each sdma engine and pio
sendcontext and makes necessary changes for verbs, PSM, and vnic to use
the new locks.

This is particularly beneficial for smaller messages that will exhaust
resources at a faster rate.

Fixes: 7724105686 ("IB/hfi1: add driver files")
Reviewed-by: Gary Leshner <Gary.S.Leshner@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-12-06 20:15:36 -07:00
Mike Marciniszyn
5190f052a3 IB/hfi1: Allow the driver to initialize QP priv struct
This patch adds an interface to allow the driver to initialize the QP priv
struct when the QP is created and after the qpn has been assigned.  A
field is added to the QP priv struct to reference the rcd and two new
files are added to contain the function to initialize the rcd field so
that more TID RDMA related code can be added here later.

Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-12-06 19:59:47 -07:00
Michael J. Ruhl
dbc2970cae IB/hfi1: Incorrect sizing of sge for PIO will OOPs
An incorrect sge sizing in the HFI PIO path will cause an OOPs similar to
this:

BUG: unable to handle kernel NULL pointer dereference at (null)
IP: [] hfi1_verbs_send_pio+0x3d8/0x530 [hfi1]
PGD 0
Oops: 0000 1 SMP
 Call Trace:
 ? hfi1_verbs_send_dma+0xad0/0xad0 [hfi1]
 hfi1_verbs_send+0xdf/0x250 [hfi1]
 ? make_rc_ack+0xa80/0xa80 [hfi1]
 hfi1_do_send+0x192/0x430 [hfi1]
 hfi1_do_send_from_rvt+0x10/0x20 [hfi1]
 rvt_post_send+0x369/0x820 [rdmavt]
 ib_uverbs_post_send+0x317/0x570 [ib_uverbs]
 ib_uverbs_write+0x26f/0x420 [ib_uverbs]
 ? security_file_permission+0x21/0xa0
 vfs_write+0xbd/0x1e0
 ? mntput+0x24/0x40
 SyS_write+0x7f/0xe0
 system_call_fastpath+0x16/0x1b

Fix by adding the missing sizing check to correctly determine the sge
length.

Fixes: 7724105686 ("IB/hfi1: add driver files")
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-12-06 19:50:08 -07:00
Piotr Stankiewicz
36d842194a IB/hfi1: Fix an out-of-bounds access in get_hw_stats
When running with KASAN, the following trace is produced:

[   62.535888]

==================================================================
[   62.544930] BUG: KASAN: slab-out-of-bounds in
gut_hw_stats+0x122/0x230 [hfi1]
[   62.553856] Write of size 8 at addr ffff88080e8d6330 by task
kworker/0:1/14

[   62.565333] CPU: 0 PID: 14 Comm: kworker/0:1 Not tainted
4.19.0-test-build-kasan+ #8
[   62.575087] Hardware name: Intel Corporation S2600KPR/S2600KPR, BIOS
SE5C610.86B.01.01.0019.101220160604 10/12/2016
[   62.587951] Workqueue: events work_for_cpu_fn
[   62.594050] Call Trace:
[   62.598023]  dump_stack+0xc6/0x14c
[   62.603089]  ? dump_stack_print_info.cold.1+0x2f/0x2f
[   62.610041]  ? kmsg_dump_rewind_nolock+0x59/0x59
[   62.616615]  ? get_hw_stats+0x122/0x230 [hfi1]
[   62.622985]  print_address_description+0x6c/0x23c
[   62.629744]  ? get_hw_stats+0x122/0x230 [hfi1]
[   62.636108]  kasan_report.cold.6+0x241/0x308
[   62.642365]  get_hw_stats+0x122/0x230 [hfi1]
[   62.648703]  ? hfi1_alloc_rn+0x40/0x40 [hfi1]
[   62.655088]  ? __kmalloc+0x110/0x240
[   62.660695]  ? hfi1_alloc_rn+0x40/0x40 [hfi1]
[   62.667142]  setup_hw_stats+0xd8/0x430 [ib_core]
[   62.673972]  ? show_hfi+0x50/0x50 [hfi1]
[   62.680026]  ib_device_register_sysfs+0x165/0x180 [ib_core]
[   62.687995]  ib_register_device+0x5a2/0xa10 [ib_core]
[   62.695340]  ? show_hfi+0x50/0x50 [hfi1]
[   62.701421]  ? ib_unregister_device+0x2e0/0x2e0 [ib_core]
[   62.709222]  ? __vmalloc_node_range+0x2d0/0x380
[   62.716131]  ? rvt_driver_mr_init+0x11f/0x2d0 [rdmavt]
[   62.723735]  ? vmalloc_node+0x5c/0x70
[   62.729697]  ? rvt_driver_mr_init+0x11f/0x2d0 [rdmavt]
[   62.737347]  ? rvt_driver_mr_init+0x1f5/0x2d0 [rdmavt]
[   62.744998]  ? __rvt_alloc_mr+0x110/0x110 [rdmavt]
[   62.752315]  ? rvt_rc_error+0x140/0x140 [rdmavt]
[   62.759434]  ? rvt_vma_open+0x30/0x30 [rdmavt]
[   62.766364]  ? mutex_unlock+0x1d/0x40
[   62.772445]  ? kmem_cache_create_usercopy+0x15d/0x230
[   62.780115]  rvt_register_device+0x1f6/0x360 [rdmavt]
[   62.787823]  ? rvt_get_port_immutable+0x180/0x180 [rdmavt]
[   62.796058]  ? __get_txreq+0x400/0x400 [hfi1]
[   62.802969]  ? memcpy+0x34/0x50
[   62.808611]  hfi1_register_ib_device+0xde6/0xeb0 [hfi1]
[   62.816601]  ? hfi1_get_npkeys+0x10/0x10 [hfi1]
[   62.823760]  ? hfi1_init+0x89f/0x9a0 [hfi1]
[   62.830469]  ? hfi1_setup_eagerbufs+0xad0/0xad0 [hfi1]
[   62.838204]  ? pcie_capability_clear_and_set_word+0xcd/0xe0
[   62.846429]  ? pcie_capability_read_word+0xd0/0xd0
[   62.853791]  ? hfi1_pcie_init+0x187/0x4b0 [hfi1]
[   62.860958]  init_one+0x67f/0xae0 [hfi1]
[   62.867301]  ? hfi1_init+0x9a0/0x9a0 [hfi1]
[   62.873876]  ? wait_woken+0x130/0x130
[   62.879860]  ? read_word_at_a_time+0xe/0x20
[   62.886329]  ? strscpy+0x14b/0x280
[   62.891998]  ? hfi1_init+0x9a0/0x9a0 [hfi1]
[   62.898405]  local_pci_probe+0x70/0xd0
[   62.904295]  ? pci_device_shutdown+0x90/0x90
[   62.910833]  work_for_cpu_fn+0x29/0x40
[   62.916750]  process_one_work+0x584/0x960
[   62.922974]  ? rcu_work_rcufn+0x40/0x40
[   62.928991]  ? __schedule+0x396/0xdc0
[   62.934806]  ? __sched_text_start+0x8/0x8
[   62.941020]  ? pick_next_task_fair+0x68b/0xc60
[   62.947674]  ? run_rebalance_domains+0x260/0x260
[   62.954471]  ? __list_add_valid+0x29/0xa0
[   62.960607]  ? move_linked_works+0x1c7/0x230
[   62.967077]  ?
trace_event_raw_event_workqueue_execute_start+0x140/0x140
[   62.976248]  ? mutex_lock+0xa6/0x100
[   62.982029]  ? __mutex_lock_slowpath+0x10/0x10
[   62.988795]  ? __switch_to+0x37a/0x710
[   62.994731]  worker_thread+0x62e/0x9d0
[   63.000602]  ? max_active_store+0xf0/0xf0
[   63.006828]  ? __switch_to_asm+0x40/0x70
[   63.012932]  ? __switch_to_asm+0x34/0x70
[   63.019013]  ? __switch_to_asm+0x40/0x70
[   63.025042]  ? __switch_to_asm+0x34/0x70
[   63.031030]  ? __switch_to_asm+0x40/0x70
[   63.037006]  ? __schedule+0x396/0xdc0
[   63.042660]  ? kmem_cache_alloc_trace+0xf3/0x1f0
[   63.049323]  ? kthread+0x59/0x1d0
[   63.054594]  ? ret_from_fork+0x35/0x40
[   63.060257]  ? __sched_text_start+0x8/0x8
[   63.066212]  ? schedule+0xcf/0x250
[   63.071529]  ? __wake_up_common+0x110/0x350
[   63.077794]  ? __schedule+0xdc0/0xdc0
[   63.083348]  ? wait_woken+0x130/0x130
[   63.088963]  ? finish_task_switch+0x1f1/0x520
[   63.095258]  ? kasan_unpoison_shadow+0x30/0x40
[   63.101792]  ? __init_waitqueue_head+0xa0/0xd0
[   63.108183]  ? replenish_dl_entity.cold.60+0x18/0x18
[   63.115151]  ? _raw_spin_lock_irqsave+0x25/0x50
[   63.121754]  ? max_active_store+0xf0/0xf0
[   63.127753]  kthread+0x1ae/0x1d0
[   63.132894]  ? kthread_bind+0x30/0x30
[   63.138422]  ret_from_fork+0x35/0x40

[   63.146973] Allocated by task 14:
[   63.152077]  kasan_kmalloc+0xbf/0xe0
[   63.157471]  __kmalloc+0x110/0x240
[   63.162804]  init_cntrs+0x34d/0xdf0 [hfi1]
[   63.168883]  hfi1_init_dd+0x29a3/0x2f90 [hfi1]
[   63.175244]  init_one+0x551/0xae0 [hfi1]
[   63.181065]  local_pci_probe+0x70/0xd0
[   63.186759]  work_for_cpu_fn+0x29/0x40
[   63.192310]  process_one_work+0x584/0x960
[   63.198163]  worker_thread+0x62e/0x9d0
[   63.203843]  kthread+0x1ae/0x1d0
[   63.208874]  ret_from_fork+0x35/0x40

[   63.217203] Freed by task 1:
[   63.221844]  __kasan_slab_free+0x12e/0x180
[   63.227844]  kfree+0x92/0x1a0
[   63.232570]  single_release+0x3a/0x60
[   63.238024]  __fput+0x1d9/0x480
[   63.242911]  task_work_run+0x139/0x190
[   63.248440]  exit_to_usermode_loop+0x191/0x1a0
[   63.254814]  do_syscall_64+0x301/0x330
[   63.260283]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

[   63.270199] The buggy address belongs to the object at
ffff88080e8d5500
 which belongs to the cache kmalloc-4096 of size 4096
[   63.287247] The buggy address is located 3632 bytes inside of
 4096-byte region [ffff88080e8d5500, ffff88080e8d6500)
[   63.303564] The buggy address belongs to the page:
[   63.310447] page:ffffea00203a3400 count:1 mapcount:0
mapping:ffff88081380e840 index:0x0 compound_mapcount: 0
[   63.323102] flags: 0x2fffff80008100(slab|head)
[   63.329775] raw: 002fffff80008100 0000000000000000 0000000100000001
ffff88081380e840
[   63.340175] raw: 0000000000000000 0000000000070007 00000001ffffffff
0000000000000000
[   63.350564] page dumped because: kasan: bad access detected

[   63.361974] Memory state around the buggy address:
[   63.369137]  ffff88080e8d6200: 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00
[   63.379082]  ffff88080e8d6280: 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00
[   63.389032] >ffff88080e8d6300: 00 00 00 00 00 00 fc fc fc fc fc fc fc
fc fc fc
[   63.398944]                                      ^
[   63.406141]  ffff88080e8d6380: fc fc fc fc fc fc fc fc fc fc fc fc fc
fc fc fc
[   63.416109]  ffff88080e8d6400: fc fc fc fc fc fc fc fc fc fc fc fc fc
fc fc fc
[   63.426099]
==================================================================

The trace happens because get_hw_stats() assumes there is room in the
memory allocated in init_cntrs() to accommodate the driver counters.
Unfortunately, that routine only allocated space for the device
counters.

Fix by insuring the allocation has room for the additional driver
counters.

Cc: <Stable@vger.kernel.org> # v4.14+
Fixes: b7481944b0 ("IB/hfi1: Show statistics counters under IB stats interface")
Reviewed-by: Mike Marciniczyn <mike.marciniszyn@intel.com>
Reviewed-by: Mike Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-12-03 16:05:19 -05:00
Parav Pandit
508a523f6b RDMA/drivers: Use core provided API for registering device attributes
Use rdma_set_device_sysfs_group() to register device attributes and
simplify the driver.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-10-17 03:45:01 -06:00
Jason Gunthorpe
59bfc59a68 Merge branch 'for-rc' into rdma.git for-next
From git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git

This is required to resolve dependencies of the next series of RDMA
patches.

The code motion conflicts in drivers/infiniband/core/cache.c were
resolved.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-10-16 00:01:02 -06:00
Venkata Sandeep Dhanalakota
116aa0330e IB/{hfi1, qib, rdmavt}: Move send completion logic to rdmavt
Moving send completion code into rdmavt in order to have shared logic
between qib and hfi1 drivers.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Brian Welty <brian.welty@intel.com>
Signed-off-by: Venkata Sandeep Dhanalakota <venkata.s.dhanalakota@intel.com>
Signed-off-by: Harish Chegondi <harish.chegondi@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-10-03 16:38:28 -06:00
Brian Welty
019f118b94 IB/{hfi1, qib, rdmavt}: Move copy SGE logic into rdmavt
This patch moves hfi1_copy_sge() into rdmavt for sharing with qib.
This patch also moves all the wss_*() functions into rdmavt as
several wss_*() functions are called from hfi1_copy_sge()

When SGE copy mode is adaptive, cacheless copy may be done in some cases
for performance reasons. In those cases, X86 cacheless copy function
is called since the drivers that use rdmavt and may set SGE copy mode
to adaptive are X86 only. For this reason, this patch adds
"depends on X86_64" to rdmavt/Kconfig.

Reviewed-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Brian Welty <brian.welty@intel.com>
Signed-off-by: Harish Chegondi <harish.chegondi@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-10-03 16:38:28 -06:00
Dennis Dalessandro
5da0fc9dbf IB/hfi1: Prepare resource waits for dual leg
Current implementation allows each qp to have only one send engine.  As
such, each qp has only one list to queue prebuilt packets when send engine
resources are not available. To improve performance, it is desired to
support multiple send engines for each qp.

This patch creates the framework to support two send engines
(two legs) for each qp for the TID RDMA protocol, which can be easily
extended to support more send engines. It achieves the goal by creating a
leg specific struct, iowait_work in the iowait struct, to hold the
work_struct and the tx_list as well as a pointer to the parent iowait
struct.

The hfi1_pkt_state now has an additional field to record the current legs
work structure and that is now passed to all egress waiters to determine
the leg that needs to wait via a new iowait helper.  The APIs are adjusted
to use the new leg specific struct as required.

Many new and modified helpers are added to support this change.

Reviewed-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-09-30 19:21:12 -06:00
Kaike Wan
d205a06a14 IB/rdmavt: Rename check_send_wqe as setup_wqe
The driver-provided function check_send_wqe allows the hardware driver to
check and set up the incoming send wqe before it is inserted into the swqe
ring. This patch will rename it as setup_wqe to better reflect its
usage. In addition, this function is only called when all setup is
complete in rdmavt.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-09-30 19:21:11 -06:00
Ira Weiny
0dbfaa9f28 IB/hfi1: Fix SL array bounds check
The SL specified by a user needs to be a valid SL.

Add a range check to the user specified SL value which protects from
running off the end of the SL to SC table.

CC: stable@vger.kernel.org
Fixes: 7724105686 ("IB/hfi1: add driver files")
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-09-20 19:24:51 -06:00
Bart Van Assche
522628ed1a IB/hfi1: Suppress a compiler warning
Avoid that the following compiler warning is reported when building
with gcc 8:

drivers/infiniband/hw/hfi1/verbs.c:1896:2: warning: 'strncpy' output may be truncated copying 64 bytes from a string of length 64 [-Wstringop-truncation]

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-07-11 14:20:45 -06:00
Jason Gunthorpe
958200ad8e RDMA/hfi1: Move grh_required into update_sm_ah
grh_required is intended to be a global setting where all AV's will
require a GRH, not just the sm_lid. Move the special logic to the creation
of the SM AH.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
2018-07-10 11:13:04 -06:00
Mike Marciniszyn
2e2ba09e48 IB/rdmavt, IB/hfi1: Create device dependent s_flags
Move some s_flags defines out of rdmavt and into hfi1 because they are
hfi1 specific and therefore should remain in the driver instead of
bubbling up to rdmavt.

Document device specific ranges in rdmavt and remap
those in hfi1.

Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-06-19 11:49:46 -06:00
Steve Wise
33023fb85a IB/core: add max_send_sge and max_recv_sge attributes
This patch replaces the ib_device_attr.max_sge with max_send_sge and
max_recv_sge. It allows ulps to take advantage of devices that have very
different send and recv sge depths.  For example cxgb4 has a max_recv_sge
of 4, yet a max_send_sge of 16.  Splitting out these attributes allows
much more efficient use of the SQ for cxgb4 with ulps that use the RDMA_RW
API. Consider a large RDMA WRITE that has 16 scattergather entries.
With max_sge of 4, the ulp would send 4 WRITE WRs, but with max_sge of
16, it can be done with 1 WRITE WR.

Acked-by: Sagi Grimberg <sagi@grimberg.me>
Acked-by: Christoph Hellwig <hch@lst.de>
Acked-by: Selvin Xavier <selvin.xavier@broadcom.com>
Acked-by: Shiraz Saleem <shiraz.saleem@intel.com>
Acked-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-06-18 13:17:28 -06:00
Don Hiatt
81cd3891f0 IB/hfi1: Add support for 16B Management Packets
16B Management Packets (L4=0x08) replace the BTH and DETH
of normal MAD packet packets with a header containing the
the source and destination queue pair numbers; fields that
were originally retrieved from the BTH/DETH are now populated
from this header as well as from the 16B LRH (e.g. pkey).

16B Management Packets are used as an optimized management
format on 16B fabrics.

These management packets have an opcode of IB_OPCODE_UD_SEND_ONLY,
a fixed 3Byte pad, and a header length of 24Bytes.

The decision as to when we send a management packet is based
upon either the source or destination queue pair number being
0 or 1.

Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Don Hiatt <don.hiatt@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-05-24 09:39:25 -06:00
Sebastian Sanchez
5d18ee67d4 IB/{hfi1, rdmavt, qib}: Implement CQ completion vector support
Currently the driver doesn't support completion vectors. These
are used to indicate which sets of CQs should be grouped together
into the same vector. A vector is a CQ processing thread that
runs on a specific CPU.

If an application has several CQs bound to different completion
vectors, and each completion vector runs on different CPUs, then
the completion queue workload is balanced. This helps scale as more
nodes are used.

Implement CQ completion vector support using a global workqueue
where a CQ entry is queued to the CPU corresponding to the CQ's
completion vector. Since the workqueue is global, it's guaranteed
to always be there when queueing CQ entries; Therefore, the RCU
locking for cq->rdi->worker in the hot path is superfluous.

Each completion vector is assigned to a different CPU. The number of
completion vectors available is computed by taking the number of
online, physical CPUs from the local NUMA node and subtracting the
CPUs used for kernel receive queues and the general interrupt.
Special use cases:

  * If there are no CPUs left for completion vectors, the same CPU
    for the general interrupt is used; Therefore, there would only
    be one completion vector available.

  * For multi-HFI systems, the number of completion vectors available
    for each device is the total number of completion vectors in
    the local NUMA node divided by the number of devices in the same
    NUMA node. If there's a division remainder, the first device to
    get initialized gets an extra completion vector.

Upon a CQ creation, an invalid completion vector could be specified.
Handle it as follows:

  * If the completion vector is less than 0, set it to 0.

  * Set the completion vector to the result of the passed completion
    vector moded with the number of device completion vectors
    available.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Sebastian Sanchez <sebastian.sanchez@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-05-09 15:53:30 -04:00
Mitko Haralanov
a74d5307ca IB/hfi1: Rework fault injection machinery
The packet fault injection code present in the HFI1 driver had some
issues which not only fragment the code but also created user
confusion. Furthermore, it suffered from the following issues:

  1. The fault_packet method only worked for received packets. This
     meant that the only fault injection mode available for sent
     packets is fault_opcode, which did not allow for random packet
     drops on all egressing packets.
  2. The mask available for the fault_opcode mode did not really work
     due to the fact that the opcode values are not bits in a bitmask but
     rather sequential integer values. Creating a opcode/mask pair that
     would successfully capture a set of packets was nearly impossible.
  3. The code was fragmented and used too many debugfs entries to
     operate and control. This was confusing to users.
  4. It did not allow filtering fault injection on a per direction basis -
     egress vs. ingress.

In order to improve or fix the above issues, the following changes have
been made:

   1. The fault injection methods have been combined into a single fault
      injection facility. As such, the fault injection has been plugged
      into both the send and receive code paths. Regardless of method used
      the fault injection will operate on both egress and ingress packets.
   2. The type of fault injection - by packet or by opcode - is now controlled
      by changing the boolean value of the file "opcode_mode". When the value
      is set to True, fault injection is done by opcode. Otherwise, by
      packet.
   2. The masking ability has been removed in favor of a bitmap that holds
      opcodes of interest (one bit per opcode, a total of 256 bits). This
      works in tandem with the "opcode_mode" value. When the value of
      "opcode_mode" is False, this bitmap is ignored. When the value is
      True, the bitmap lists all opcodes to be considered for fault injection.
      By default, the bitmap is empty. When the user wants to filter by opcode,
      the user sets the corresponding bit in the bitmap by echo'ing the bit
      position into the 'opcodes' file. This gets around the issue that the set
      of opcodes does not lend itself to effective masks and allow for extremely
      fine-grained filtering by opcode.
   4. fault_packet and fault_opcode methods have been combined. Hence, there
      is only one debugfs directory controlling the entire operation of the
      fault injection machinery. This reduces the number of debugfs entries
      and provides a more unified user experience.
   5. A new control files - "direction" - is provided to allow the user to
      control the direction of packets, which are subject to fault injection.
   6. A new control file - "skip_usec" - is added that would allow the user
      to specify a "timeout" during which no fault injection will occur.

In addition, the following bug fixes have been applied:

   1. The fault injection code has been split into its own header and source
      files. This was done to better organize the code and support conditional
      compilation without littering the code with #ifdef's.
   2. The method by which the TX PIO packets were being marked for drop
      conflicted with the way send contexts were being setup. As a result,
      the send context was repeatedly being reset.
   3. The fault injection only makes sense when the user can control it
      through the debugfs entries. However, a kernel configuration can
      enable fault injection but keep fault injection debugfs entries
      disabled. Therefore, it makes sense that the HFI fault injection
      code depends on both.
   4. Error suppression did not take into account the method by which PIO
      packets were being dropped. Therefore, even with error suppression
      turned on, errors would still be displayed to the screen. A larger
      enough packet drop percentage would case the kernel to crash because
      the driver would be stuck printing errors.

Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: Don Hiatt <don.hiatt@intel.com>
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-05-09 15:53:30 -04:00
Matan Barak
0ede73bc01 IB/uverbs: Extend uverbs_ioctl header with driver_id
Extending uverbs_ioctl header with driver_id and another reserved
field. driver_id should be used in order to identify the driver.
Since every driver could have its own parsing tree, this is necessary
for strace support.
Downstream patches take off the EXPERIMENTAL flag from the ioctl() IB
support and thus we add some reserved fields for future usage.

Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-03-19 14:45:17 -06:00
Mitko Haralanov
9636258f10 IB/hfi1: Remove dependence on qp->s_hdrwords
The s_hdrwords variable was used to indicate whether a
packet was already built on a previous iteration of the
send engine. This variable assumed the protection of the
QP's RVT_S_BUSY flag, which was required since the the
QP's s_lock was dropped just prior to the packet being
queued on the one of the egress mechanisms.

Support for multiple send engine instantiations require
that the field not be used due to concurency issues.
The ps.txreq signals the "already built" without the
potential concurency issues.

Fix by getting rid of all s_hdrword usage.   A wrapper
is added to test for the already built case that used to
use s_hdrwords.

What used to be stored in s_hdrwords is now in the txreq.
The PBC is not counted, but is added in the pio/sdma code
paths prior to posting the packet.

Reviewed-by: Don Hiatt <don.hiatt@intel.com>
Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-02-01 15:24:32 -07:00
Michael J. Ruhl
06f2597f75 IB/{rdmavt, hfi1, qib}: Remove get_card_name() downcall
rdmavt has a down call to client drivers to retrieve a crafted card
name.

This name should be the IB defined name.

Rather than craft the name each time it is needed, simply retrieve
the IB allocated name from the IB device.

Update the function name to reflect its application.

Clean up driver code to match this change.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-01-05 13:34:55 -05:00
Michael J. Ruhl
5084c8ff21 IB/{rdmavt, hfi1, qib}: Self determine driver name
Currently the HFI and QIB drivers allow the IB core to assign a unit
number to the driver name string.

If multiple devices exist in a system, there is a possibility that the
device unit number and the IB core number will be mismatched.

Fix by using the driver defined unit number to generate the device
name.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-01-05 13:34:55 -05:00