Commit graph

364 commits

Author SHA1 Message Date
Parav Pandit
1dfce29457 IB: Replace ib_query_gid/ib_get_cached_gid with rdma_query_gid
If the gid_attr argument is NULL then the functions behave identically to
rdma_query_gid. ib_query_gid just calls ib_get_cached_gid, so everything
can be consolidated to one function.

Now that all callers either use rdma_query_gid() or ib_get_cached_gid(),
ib_query_gid() API is removed.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-06-18 11:09:05 -06:00
Parav Pandit
f4df9a7c34 RDMA: Use GID from the ib_gid_attr during the add_gid() callback
Now that ib_gid_attr contains the GID, make use of that in the add_gid()
callback functions for the provider drivers to simplify the add_gid()
implementations.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-06-18 11:09:05 -06:00
Parav Pandit
b150c3862d IB/core: Introduce GID entry reference counts
In order to be able to expose pointers to the ib_gid_attrs in the GID
table we need to make it so the value of the pointer cannot be
changed. Thus each GID table entry gets a unique piece of kref'd memory
that is written only during initialization and remains constant for its
lifetime.

This eventually will allow the struct ib_gid_attrs to be returned without
copy from many of query the APIs, but it also provides a way to track when
all users of a HW table index go away.

For roce we no longer allow an in-use HW table index to be re-used for a
new an different entry. When a GID table entry needs to be removed it is
hidden from the find API, but remains as a valid HW index and all
ib_gid_attr points remain valid. The HW index is not relased until all
users put the kref.

Later patches will broadly replace the use of the sgid_index integer with
the kref'd structure.

Ultimately this will prevent security problems where the OS changes the
properties of a HW GID table entry while an active user object is still
using the entry.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-06-18 11:09:05 -06:00
Matthew Wilcox
7654cb1ba7 Convert infiniband uverbs to struct_size
The flows were hidden from the C compiler; expose them as a zero-length
array to allow struct_size to work.

Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
2018-06-12 16:19:22 -07:00
Max Gurtovoy
ca24da008f RDMA/core: introduce check masks for T10-PI offload
T10-PI offload capability is currently supported in iSER protocol only,
and the definition of the HCA protection information checks are missing
from the core layer. Add those definition to avoid code duplication in
other drivers (such iSER target and NVMeoF).

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-06-04 09:19:24 -06:00
Jason Gunthorpe
0f45e69d62 Verbs flow counters support
This series comes to allow user space applications to monitor real time
 traffic activity and events of the verbs objects it manages, e.g.:
 ibv_qp, ibv_wq, ibv_flow.
 
 This API enables generic counters creation and define mapping
 to association with a verbs object, current mlx5 driver using
 this API for flow counters.
 
 With this API, an application can monitor the entire life cycle of
 object activity, defined here as a static counters attachment.
 This API also allows dynamic counters monitoring of measurement points
 for a partial period in the verbs object life cycle.
 
 In addition it presents the implementation of the generic counters interface.
 
 This will be achieved by extending flow creation by adding a new flow count
 specification type which allows the user to associate a previously created
 flow counters using the generic verbs counters interface to the created flow,
 once associated the user could read statistics by using the read function of
 the generic counters interface.
 
 The API includes:
 1. create and destroyed API of a new counters objects
 2. read the counters values from HW
 
 Note:
 Attaching API to allow application to define the measurement points per objects
 is a user space only API and this data is passed to kernel when the counted
 object (e.g. flow) is created with the counters object.
 -----BEGIN PGP SIGNATURE-----
 
 iHQEABYIAB0WIQT1m3YD37UfMCUQBNwp8NhrnBAZsQUCWxIiqQAKCRAp8NhrnBAZ
 sWJRAPYl06nEfQjRlW//ZE/pO2oKXbfEevg7nnbpe80ERlxLAQDA2LHAcU7ma/NC
 hS5yxIq1gLSA27N+5qAoFVK8vJ5ZCg==
 =EiAV
 -----END PGP SIGNATURE-----

Merge tag 'verbs_flow_counters' of git://git.kernel.org/pub/scm/linux/kernel/git/leon/linux-rdma.git into for-next

Pull verbs counters series from Leon Romanovsky:

====================
Verbs flow counters support

This series comes to allow user space applications to monitor real time
traffic activity and events of the verbs objects it manages, e.g.: ibv_qp,
ibv_wq, ibv_flow.

The API enables generic counters creation and define mapping to
association with a verbs object, the current mlx5 driver is using this API
for flow counters.

With this API, an application can monitor the entire life cycle of object
activity, defined here as a static counters attachment.  This API also
allows dynamic counters monitoring of measurement points for a partial
period in the verbs object life cycle.

In addition it presents the implementation of the generic counters
interface.

This will be achieved by extending flow creation by adding a new flow
count specification type which allows the user to associate a previously
created flow counters using the generic verbs counters interface to the
created flow, once associated the user could read statistics by using the
read function of the generic counters interface.

The API includes:
1. create and destroyed API of a new counters objects
2. read the counters values from HW

Note:
Attaching API to allow application to define the measurement points per
objects is a user space only API and this data is passed to kernel when
the counted object (e.g. flow) is created with the counters object.
===================

* tag 'verbs_flow_counters':
  IB/mlx5: Add counters read support
  IB/mlx5: Add flow counters read support
  IB/mlx5: Add flow counters binding support
  IB/mlx5: Add counters create and destroy support
  IB/uverbs: Add support for flow counters
  IB/core: Add support for flow counters
  IB/core: Support passing uhw for create_flow
  IB/uverbs: Add read counters support
  IB/core: Introduce counters read verb
  IB/uverbs: Add create/destroy counters support
  IB/core: Introduce counters object and its create/destroy
  IB/uverbs: Add an ib_uobject getter to ioctl() infrastructure
  net/mlx5: Export flow counter related API
  net/mlx5: Use flow counter pointer as input to the query function
2018-06-04 08:48:11 -06:00
Raed Salem
7eea23a5cd IB/core: Add support for flow counters
A counters object could be attached to flow on creation by providing the
counter specification action.

General counters description which count packets and bytes are introduced,
downstream patches from this series will use them as part of flow counters
binding.

In addition, increase number of flow specifications supported layers to 10
upon adding count specification and for the previously added drop
specification.

Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Raed Salem <raeds@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-06-02 07:33:56 +03:00
Matan Barak
59082a327d IB/core: Support passing uhw for create_flow
This is required when user-space drivers need to pass extra information
regarding how to handle this flow steering specification.

Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-06-02 07:33:55 +03:00
Raed Salem
51d7a53874 IB/core: Introduce counters read verb
The user supplies counters instance and a reference to an output array of
uint64_t.  The driver reads the hardware counters values and writes them
to the output index location in the user supplied array.  All counters
values are represented as uint64_t types.

To be able to successfully read the data the counters must be first bound
to an IB object.

Downstream patches will present binding method for flow counters.

Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Raed Salem <raeds@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-06-02 07:33:54 +03:00
Raed Salem
fa9b1802d1 IB/core: Introduce counters object and its create/destroy
A verbs application may need to get statistics and info on various aspects
of a verb object (e.g. Flow, QP, ...), in general case the application
will state which object's counters its interested in (we refer to this
action as attach), bind this new counters object to the appropriate verb
object and on later stage read their values using the counters object.

This series introduces a general API for counters object that may
accumulate any ib object counters type, bound and read on demand.

Counters instance is allocated on an IB context and belongs to that
context.  Upon successful creation the counters can be bound to a verbs
object so that hardware counter instances can be created and read.

Downstream patches in this series will introduce the attach, bind and the
read functionality.

Counters instance can be de-allocated, upon successful destruction the
related hardware resources are released.

Prior to destroy call the user must first make sure that the counters is
not being used by any IB object, e.g. not attached to any of its counted
type otherwise an EBUSY error is invoked.

Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Raed Salem <raeds@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-06-02 07:33:54 +03:00
Jason Gunthorpe
0394808d9e Merge branch 'mr_fix' into git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma for-next
Update mlx4 to support user MR creation against read-only memory, previously
it required the memory to be writable.

Based on rdma for-rc due to dependencies.

* mr_fix: (2 commits)
  IB/mlx4: Mark user MR as writable if actual virtual memory is writable
  IB/core: Make testing MR flags for writability a static inline function
2018-05-28 11:44:35 -06:00
Jack Morgenstein
08bb558ac1 IB/core: Make testing MR flags for writability a static inline function
Make the MR writability flags check, which is performed in umem.c,
a static inline function in file ib_verbs.h

This allows the function to be used by low-level infiniband drivers.

Cc: <stable@vger.kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
2018-05-28 11:41:39 -06:00
Ariel Levkovich
b04f0f036a IB/uverbs: Introduce a MPLS steering match filter
Add a new MPLS steering match filter that can match against
a single MPLS tag field.

Since the MPLS header can reside in different locations in the packet's
protocol stack as well as be encapsulated with a tunnel protocol, it
is required to know the exact location of the header in the protocol
stack.

Therefore, when including the MPLS protocol spec in the specs list,
it is mandatory to provide the list in an ordered manner, so
that it represents the actual header order in a matching packet.

Drivers that process the spec list and apply the matching rule
should treat the position of the MPLS spec in the spec list as the
actual location of the MPLS label in the packet's protocol stack.

Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Ariel Levkovich <lariel@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-05-16 21:32:55 -06:00
Ariel Levkovich
d90e5e5038 IB/uverbs: Introduce a GRE steering match filter
Adding a new GRE steering match filter that can match against
key and protocol fields.

Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Ariel Levkovich <lariel@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-05-16 21:32:54 -06:00
Linus Torvalds
19fd08b85b Merge candidates for 4.17 merge window
- Fix RDMA uapi headers to actually compile in userspace and be more
   complete
 
 - Three shared with netdev pull requests from Mellanox:
 
    * 7 patches, mostly to net with 1 IB related one at the back). This
      series addresses an IRQ performance issue (patch 1), cleanups related to
      the fix for the IRQ performance problem (patches 2-6), and then extends
      the fragmented completion queue support that already exists in the net
      side of the driver to the ib side of the driver (patch 7).
 
    * Mostly IB, with 5 patches to net that are needed to support the remaining
      10 patches to the IB subsystem. This series extends the current
      'representor' framework when the mlx5 driver is in switchdev mode from
      being a netdev only construct to being a netdev/IB dev construct. The IB
      dev is limited to raw Eth queue pairs only, but by having an IB dev of
      this type attached to the representor for a switchdev port, it enables
      DPDK to work on the switchdev device.
 
    * All net related, but needed as infrastructure for the rdma driver
 
 - Updates for the hns, i40iw, bnxt_re, cxgb3, cxgb4, hns drivers
 
 - SRP performance updates
 
 - IB uverbs write path cleanup patch series from Leon
 
 - Add RDMA_CM support to ib_srpt. This is disabled by default.  Users need to
   set the port for ib_srpt to listen on in configfs in order for it to be
   enabled (/sys/kernel/config/target/srpt/discovery_auth/rdma_cm_port)
 
 - TSO and Scatter FCS support in mlx4
 
 - Refactor of modify_qp routine to resolve problems seen while working on new
   code that is forthcoming
 
 - More refactoring and updates of RDMA CM for containers support from Parav
 
 - mlx5 'fine grained packet pacing', 'ipsec offload' and 'device memory'
   user API features
 
 - Infrastructure updates for the new IOCTL interface, based on increased usage
 
 - ABI compatibility bug fixes to fully support 32 bit userspace on 64 bit
   kernel as was originally intended. See the commit messages for
   extensive details
 
 - Syzkaller bugs and code cleanups motivated by them
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCgAGBQJax5Z0AAoJEDht9xV+IJsacCwQAJBIgmLCvVp5fBu2kJcXMMVI
 y3l2YNzAUJvDDKv1r5yTC9ugBXEkDtgzi/W/C2/5es2yUG/QeT/zzQ3YPrtsnN68
 5FkiXQ35Tt7+PBHMr0cacGRmF4M3Td3MeW0X5aJaBKhqlNKwA+aF18pjGWBmpVYx
 URYCwLb5BZBKVh4+1Leebsk4i0/7jSauAqE5M+9notuAUfBCoY1/Eve3DipEIBBp
 EyrEnMDIdujYRsg4KHlxFKKJ1EFGItknLQbNL1+SEa0Oe0SnEl5Bd53Yxfz7ekNP
 oOWQe5csTcs3Yr4Ob0TC+69CzI71zKbz6qPDILTwXmsPFZJ9ipJs4S8D6F7ra8tb
 D5aT1EdRzh/vAORPC9T3DQ3VsHdvhwpUMG7knnKrVT9X/g7E+gSji1BqaQaTr/xs
 i40GepHT7lM/TWEuee/6LRpqdhuOhud7vfaRFwn2JGRX9suqTcvwhkBkPUDGV5XX
 5RkHcWOb/7KvmpG7S1gaRGK5kO208LgmAZi7REaJFoZB74FqSneMR6NHIH07ha41
 Zou7rnxV68CT2bgu27m+72EsprgmBkVDeEzXgKxVI/+PZ1oadUFpgcZ3pRLOPWVx
 rEqjHu65rlA/YPog4iXQaMfSwt/oRD3cVJS/n8EdJKXi4Qt2RDDGdyOmt74w4prM
 QuLEdvJIFmwrND1KDoqn
 =Ku8g
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-unmerged' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma

Pull rdma updates from Jason Gunthorpe:
 "Doug and I are at a conference next week so if another PR is sent I
  expect it to only be bug fixes. Parav noted yesterday that there are
  some fringe case behavior changes in his work that he would like to
  fix, and I see that Intel has a number of rc looking patches for HFI1
  they posted yesterday.

  Parav is again the biggest contributor by patch count with his ongoing
  work to enable container support in the RDMA stack, followed by Leon
  doing syzkaller inspired cleanups, though most of the actual fixing
  went to RC.

  There is one uncomfortable series here fixing the user ABI to actually
  work as intended in 32 bit mode. There are lots of notes in the commit
  messages, but the basic summary is we don't think there is an actual
  32 bit kernel user of drivers/infiniband for several good reasons.

  However we are seeing people want to use a 32 bit user space with 64
  bit kernel, which didn't completely work today. So in fixing it we
  required a 32 bit rxe user to upgrade their userspace. rxe users are
  still already quite rare and we think a 32 bit one is non-existing.

   - Fix RDMA uapi headers to actually compile in userspace and be more
     complete

   - Three shared with netdev pull requests from Mellanox:

      * 7 patches, mostly to net with 1 IB related one at the back).
        This series addresses an IRQ performance issue (patch 1),
        cleanups related to the fix for the IRQ performance problem
        (patches 2-6), and then extends the fragmented completion queue
        support that already exists in the net side of the driver to the
        ib side of the driver (patch 7).

      * Mostly IB, with 5 patches to net that are needed to support the
        remaining 10 patches to the IB subsystem. This series extends
        the current 'representor' framework when the mlx5 driver is in
        switchdev mode from being a netdev only construct to being a
        netdev/IB dev construct. The IB dev is limited to raw Eth queue
        pairs only, but by having an IB dev of this type attached to the
        representor for a switchdev port, it enables DPDK to work on the
        switchdev device.

      * All net related, but needed as infrastructure for the rdma
        driver

   - Updates for the hns, i40iw, bnxt_re, cxgb3, cxgb4, hns drivers

   - SRP performance updates

   - IB uverbs write path cleanup patch series from Leon

   - Add RDMA_CM support to ib_srpt. This is disabled by default. Users
     need to set the port for ib_srpt to listen on in configfs in order
     for it to be enabled
     (/sys/kernel/config/target/srpt/discovery_auth/rdma_cm_port)

   - TSO and Scatter FCS support in mlx4

   - Refactor of modify_qp routine to resolve problems seen while
     working on new code that is forthcoming

   - More refactoring and updates of RDMA CM for containers support from
     Parav

   - mlx5 'fine grained packet pacing', 'ipsec offload' and 'device
     memory' user API features

   - Infrastructure updates for the new IOCTL interface, based on
     increased usage

   - ABI compatibility bug fixes to fully support 32 bit userspace on 64
     bit kernel as was originally intended. See the commit messages for
     extensive details

   - Syzkaller bugs and code cleanups motivated by them"

* tag 'for-linus-unmerged' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (199 commits)
  IB/rxe: Fix for oops in rxe_register_device on ppc64le arch
  IB/mlx5: Device memory mr registration support
  net/mlx5: Mkey creation command adjustments
  IB/mlx5: Device memory support in mlx5_ib
  net/mlx5: Query device memory capabilities
  IB/uverbs: Add device memory registration ioctl support
  IB/uverbs: Add alloc/free dm uverbs ioctl support
  IB/uverbs: Add device memory capabilities reporting
  IB/uverbs: Expose device memory capabilities to user
  RDMA/qedr: Fix wmb usage in qedr
  IB/rxe: Removed GID add/del dummy routines
  RDMA/qedr: Zero stack memory before copying to user space
  IB/mlx5: Add ability to hash by IPSEC_SPI when creating a TIR
  IB/mlx5: Add information for querying IPsec capabilities
  IB/mlx5: Add IPsec support for egress and ingress
  {net,IB}/mlx5: Add ipsec helper
  IB/mlx5: Add modify_flow_action_esp verb
  IB/mlx5: Add implementation for create and destroy action_xfrm
  IB/uverbs: Introduce ESP steering match filter
  IB/uverbs: Add modify ESP flow_action
  ...
2018-04-06 17:35:43 -07:00
Ariel Levkovich
be934cca9e IB/uverbs: Add device memory registration ioctl support
Adding new ioctl method for the MR object - REG_DM_MR.

This command can be used by users to register an allocated
device memory buffer as an MR and receive lkey and rkey
to be used within work requests.

It is added as a new method under the MR object and using a new
ib_device callback - reg_dm_mr.
The command creates a standard ib_mr object which represents the
registered memory.

Signed-off-by: Ariel Levkovich <lariel@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-04-05 11:16:39 -06:00
Ariel Levkovich
bee76d7ab5 IB/uverbs: Add alloc/free dm uverbs ioctl support
This change adds uverbs support for allocation/freeing
of device memory commands.

A new uverbs object is defined of type idr to represent
and track the new resource type allocation per context.

The API requires provider driver to implement 2 new ib_device
callbacks - one for allocation and one for deallocation which
return and accept (respectively) the ib_dm object which represents
the allocated memory on the device.

The support is added via the ioctl command infrastructure
only.

Signed-off-by: Ariel Levkovich <lariel@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-04-05 11:16:39 -06:00
Ariel Levkovich
1d8eeb9f6a IB/uverbs: Add device memory capabilities reporting
This change allows vendors to report device memory capability
max_dm_size - to user via uverbs command.

Signed-off-by: Ariel Levkovich <lariel@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-04-05 11:16:39 -06:00
Matan Barak
56ab0b38b8 IB/uverbs: Introduce ESP steering match filter
Adding a new ESP steering match filter that could match against
spi and seq used in IPSec protocol.

Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-04-04 12:06:26 -06:00
Matan Barak
7d12f8d5a1 IB/uverbs: Add modify ESP flow_action
flow_actions of ESP type could be modified during runtime. This could be
common for example when ESN should be changed. Adding a new
UVERBS_FLOW_ACTION_ESP_MODIFY method for changing ESP parameters of an
existing ESP flow_action.
The new method uses the UVERBS_FLOW_ACTION_ESP_CREATE attributes, but
adds a new IB_FLOW_ACTION_ESP_FLAGS_MOD_ESP_ATTRS which means ESP_ATTRS
should be changed.
In addition, we add a new FLOW_ACTION_ESP_REPLAY_NONE replay type that
could be used when one wants to disable a replay protection over a
specific flow_action.

Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-04-04 12:06:26 -06:00
Boris Pismenny
21e82d3e1d IB/uverbs: Introduce egress flow steering
The egress flag indicates that this flow steering rule is for egress
traffic. The scope of an egress rule is port-wide, meaning all packets
originated from that port, which match the steering rule specification
will be effected by this steering rule's action.

Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Reviewed-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-04-04 12:06:25 -06:00
Matan Barak
9b82844197 IB/uverbs: Add action_handle flow steering specification
Binding a flow_action to flow steering rule requires using a new
specification. Therefore, adding such an IB_FLOW_SPEC_ACTION_HANDLE flow
specification.

Flow steering rules could use flow_action(s) and as of that we need to
avoid deleting flow_action(s) as long as they're being used.
Moreover, when the attached rules are deleted, action_handle reference
count should be decremented. Introducing a new mechanism of flow
resources to keep track on the attached action_handle(s). Later on, this
mechanism should be extended to other attached flow steering resources
like flow counters.

Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-04-04 12:06:25 -06:00
Matan Barak
2eb9beaee5 IB/uverbs: Add flow_action create and destroy verbs
A verbs application may receive and transmits packets using a data
path pipeline. Sometimes, the first stage in the receive pipeline or
the last stage in the transmit pipeline involves transforming a
packet, either in order to make it easier for later stages to process
it or to prepare it for transmission over the wire. Such transformation
could be stripping/encapsulating the packet (i.e. vxlan),
decrypting/encrypting it (i.e. ipsec), altering headers, doing some
complex FPGA changes, etc.

Some hardware could do such transformations without software data path
intervention at all. The flow steering API supports steering a
packet (either to a QP or dropping it) and some simple packet
immutable actions (i.e. tagging a packet). Complex actions, that may
change the packet, could bloat the flow steering API extensively.
Sometimes the same action should be applied to several flows.
In this case, it's easier to bind several flows to the same action and
modify it than change all matching flows.

Introducing a new flow_action object that abstracts any packet
transformation (out of a standard and well defined set of actions).
This flow_action object could be tied to a flow steering rule via a
new specification.

Currently, we support esp flow_action, which encrypts or decrypts a
packet according to the given parameters. However, we present a
flexible schema that could be used to other transformation actions tied
to flow rules.

Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-04-04 12:06:25 -06:00
Parav Pandit
414448d249 RDMA: Use ib_gid_attr during GID modification
Now that ib_gid_attr contains device, port and index, simplify the
provider APIs add_gid() and del_gid() to use device, port and index
fields from the ib_gid_attr attributes structure.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-04-03 21:34:16 -06:00
Parav Pandit
598ff6bae6 IB/core: Refactor GID modify code for RoCE
Code is refactored to prepare separate functions for RoCE which can do more
complex operations related to reference counting, while still
maintainining code readability. This includes
(a) Simplification to not perform netdevice checks and modifications
for IB link layer.
(b) Do not add RoCE GID entry which has NULL netdevice; instead return
an error.
(c) If GID addition fails at provider level add_gid(), do not add the
entry in the cache and keep the entry marked as INVALID.
(d) Simplify and reuse the ib_cache_gid_add()/del() routines so that they
can be used even for modifying default GIDs. This avoid some code
duplication in modifying default GIDs.
(e) find_gid() routine refers to the data entry flags to qualify a GID
as valid or invalid GID rather than depending on attributes and zeroness
of the GID content.
(f) gid_table_reserve_default() sets the GID default attribute at
beginning while setting up the GID table. There is no need to use
default_gid flag in low level functions such as write_gid(), add_gid(),
del_gid(), as they never need to update the DEFAULT property of the GID
entry while during GID table update.

As as result of this refactor, reserved GID 0:0:0:0:0:0:0:0 is no longer
searchable as described below.

A unicast GID entry of 0:0:0:0:0:0:0:0 is Reserved GID as per the IB
spec version 1.3 section 4.1.1, point (6) whose snippet is below.

"The unicast GID address 0:0:0:0:0:0:0:0 is reserved - referred to as
the Reserved GID. It shall never be assigned to any endport. It shall
not be used as a destination address or in a global routing header
(GRH)."

GID table cache now only stores valid GID entries. Before this patch,
Reserved GID 0:0:0:0:0:0:0:0 was searchable in the GID table using
ib_find_cached_gid_by_port() and other similar find routines.

Zero GID is no longer searchable as it shall not to be present in GRH or
path recored entry as described in IB spec version 1.3 section 4.1.1,
point (6), section 12.7.10 and section 12.7.20.

ib_cache_update() is simplified to check link layer once, use unified
locking scheme for all link layers, removed temporary gid table
allocation/free logic.

Additionally,
(a) Expand ib_gid_attr to store port and index so that GID query
routines can get port and index information from the attribute structure.
(b) Expand ib_gid_attr to store device as well so that in future code when
GID reference counting is done, device is used to reach back to the GID
table entry.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-04-03 21:33:50 -06:00
Parav Pandit
72e1ff0fb7 RDMA/core: Update query_gid documentation for HCA drivers
query_gid() should return right GID value for iWarp and IB link layers.
It is a no-op for RoCE link layer.  Update the documentation to reflect
this.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-04-03 21:09:01 -06:00
Mark Bloch
e945130b52 IB/core: Protect against concurrent access to hardware stats
Currently access to hardware stats buffer isn't protected, this can
result in multiple writes and reads at the same time to the same
memory location. This can lead to providing an incorrect value to
the user. Add a mutex to protect against it.

Fixes: b40f4757da ("IB/core: Make device counter infrastructure dynamic")
Signed-off-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-03-27 15:07:21 -06:00
Kirill Tkhai
070f2d7e26 net: Drop NETDEV_UNREGISTER_FINAL
Last user is gone after bdf5bd7f21 "rds: tcp: remove
register_netdevice_notifier infrastructure.", so we can
remove this netdevice command. This allows to delete
rtnl_lock() in netdev_run_todo(), which is hot path for
net namespace unregistration.

dev_change_net_namespace() and netdev_wait_allrefs()
have rcu_barrier() before NETDEV_UNREGISTER_FINAL call,
and the source commits say they were introduced to
delemit the call with NETDEV_UNREGISTER, but this patch
leaves them on the places, since they require additional
analysis, whether we need in them for something else.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-26 11:34:00 -04:00
Matan Barak
c66db31113 IB/uverbs: Safely extend existing attributes
Previously, we've used UVERBS_ATTR_SPEC_F_MIN_SZ for extending existing
attributes. The behavior of this flag was the kernel accepts anything
bigger than the minimum size it specified. This is unsafe, since in
order to safely extend an attribute, we need to make sure unknown size
is zeroed. Replacing UVERBS_ATTR_SPEC_F_MIN_SZ with
UVERBS_ATTR_SPEC_F_MIN_SZ_OR_ZERO, which essentially checks that the
unknown size is zero. In addition, attributes are now decorated with
UVERBS_ATTR_TYPE and UVERBS_ATTR_STRUCT, so we can provide the minimum
and known length.

Users of this flag needs to use copy_from_or_zero functions/macros.

Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-03-19 14:45:17 -06:00
Matan Barak
0ede73bc01 IB/uverbs: Extend uverbs_ioctl header with driver_id
Extending uverbs_ioctl header with driver_id and another reserved
field. driver_id should be used in order to identify the driver.
Since every driver could have its own parsing tree, this is necessary
for strace support.
Downstream patches take off the EXPERIMENTAL flag from the ioctl() IB
support and thus we add some reserved fields for future usage.

Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-03-19 14:45:17 -06:00
Leon Romanovsky
80cf79ae4f RDMA/verbs: Remove restrack entry from XRCD structure
XRCD object is not implemented in the restrack, so lets remove it.

Fixes: 02d8883f52 ("RDMA/restrack: Add general infrastructure to track RDMA resources")
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-03-19 14:10:30 -06:00
Parav Pandit
b19744e965 IB/core: Remove unimplemented ib_peek_cq
ib_peek_cq() verb doesn't seem be implemented in current code.
There is some past reference to it at [1] about it being unimplemented.

Lot of user documentation created out of kdoc refers to this
unimplemented API. Therefore, remove unimplemented API.

[1] http://lists.openfabrics.org/pipermail/ofw/2008-May/002465.html
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-03-19 11:41:41 -06:00
Parav Pandit
b26c4a1138 IB/{core, ipoib}: Simplify ib_find_gid() for unused ndev
ib_find_gid() is only used by IPoIB driver. For IB link layer, GID table
entries are not based on netdevice. Netdevice parameter is unused here.
Therefore, it is removed.

Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-03-15 14:40:37 -06:00
Leon Romanovsky
19b1f54099 RDMA/verbs: Simplify modify QP check
All callers to ib_modify_qp_is_ok() provides enum ib_qp_state
makes the checks of out-of-scope redundant. Let's remove them
together with updating function signature to return boolean result.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-03-14 15:34:25 -04:00
Steve Wise
fccec5b89a RDMA/nldev: provide detailed MR information
Implement the RDMA nldev netlink interface for dumping detailed
MR information.

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-03-08 15:03:03 -05:00
Don Hiatt
87daac68f7 IB/core: Map iWarp AH type to undefined in rdma_ah_find_type
iWarp devices do not support the creation of address handles
so return AH_ATTR_TYPE_UNDEFINED for all iWarp devices.

While we are here reduce the size of port_num to u8 and add
a comment.

Fixes: 44c58487d5 ("IB/core: Define 'ib' and 'roce' rdma_ah_attr types")
Reported-by: Parav Pandit <parav@mellanox.com>
CC: Sean Hefty <sean.hefty@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Don Hiatt <don.hiatt@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-02-01 15:43:31 -07:00
Leon Romanovsky
02d8883f52 RDMA/restrack: Add general infrastructure to track RDMA resources
The RDMA subsystem has very strict set of objects to work with, but it
completely lacks tracking facilities and has no visibility of resource
utilization.

The following patch adds such infrastructure to keep track of RDMA
resources to help with debugging of user space applications. The primary
user of this infrastructure is RDMA nldev netlink (following patches), to
be exposed to userspace via rdmatool, but it is not limited too that.

At this stage, the main three objects (PD, CQ and QP) are added, and more
will be added later.

Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-01-29 20:21:39 -07:00
Leon Romanovsky
f66c8ba4c9 RDMA/core: Save kernel caller name when creating PD and CQ objects
The KBUILD_MODNAME variable contains the module name and it is known for
kernel users during compilation, so let's reuse it to track the owners.

Followup patches will store this for resource tracking.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-01-29 14:01:44 -07:00
Leon Romanovsky
e449644741 RDMA/core: Use the MODNAME instead of the function name for pd callers
Each of our modules only allocates a PD in one place, so there isn't any
loss in detail, while MODNAME is more useful and recognizable as something
to expose to the user.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-01-29 14:01:44 -07:00
Jason Gunthorpe
beb801ac51 RDMA: Move enum ib_cq_creation_flags to uapi headers
The flags field the enum is used with comes directly from the uapi
so it belongs in the uapi headers for clarity and so userspace can
use it.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-01-29 12:58:34 -07:00
Parav Pandit
a6532e7139 RDMA/core: Clarify rdma_ah_find_type
iWARP does not use rdma_ah_attr_type, and for this reason we do not have a
RDMA_AH_ATTR_TYPE_IWARP. rdma_ah_find_type should not even be called on iwarp
ports and for clarity it shouldn't have a special test for iWarp.

This changes the result from RDMA_AH_ATTR_TYPE_ROCE to RDMA_AH_ATTR_TYPE_IB
when wrongly called on an iWarp port.

Fixes: 44c58487d5 ("IB/core: Define 'ib' and 'roce' rdma_ah_attr types")
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-01-15 15:33:21 -07:00
Bodong Wang
cd2a6e7d38 IB/core: Fix ib_wc structure size to remain in 64 bytes boundary
The change of slid from u16 to u32 results in sizeof(struct ib_wc)
cross 64B boundary, which causes more cache misses. This patch
rearranges the fields and remain the size to 64B.

Pahole output before this change:

struct ib_wc {
        union {
                u64                wr_id;                /*           8 */
                struct ib_cqe *    wr_cqe;               /*           8 */
        };                                               /*     0     8 */
        enum ib_wc_status          status;               /*     8     4 */
        enum ib_wc_opcode          opcode;               /*    12     4 */
        u32                        vendor_err;           /*    16     4 */
        u32                        byte_len;             /*    20     4 */
        struct ib_qp *             qp;                   /*    24     8 */
        union {
                __be32             imm_data;             /*           4 */
                u32                invalidate_rkey;      /*           4 */
        } ex;                                            /*    32     4 */
        u32                        src_qp;               /*    36     4 */
        int                        wc_flags;             /*    40     4 */
        u16                        pkey_index;           /*    44     2 */

        /* XXX 2 bytes hole, try to pack */

        u32                        slid;                 /*    48     4 */
        u8                         sl;                   /*    52     1 */
        u8                         dlid_path_bits;       /*    53     1 */
        u8                         port_num;             /*    54     1 */
        u8                         smac[6];              /*    55     6 */

        /* XXX 1 byte hole, try to pack */

        u16                        vlan_id;              /*    62     2 */
        /* --- cacheline 1 boundary (64 bytes) --- */
        u8                         network_hdr_type;     /*    64     1 */

        /* size: 72, cachelines: 2, members: 17 */
        /* sum members: 62, holes: 2, sum holes: 3 */
        /* padding: 7 */
        /* last cacheline: 8 bytes */
};

Pahole output after this change:

struct ib_wc {
        union {
                u64                wr_id;                /*           8 */
                struct ib_cqe *    wr_cqe;               /*           8 */
        };                                               /*     0     8 */
        enum ib_wc_status          status;               /*     8     4 */
        enum ib_wc_opcode          opcode;               /*    12     4 */
        u32                        vendor_err;           /*    16     4 */
        u32                        byte_len;             /*    20     4 */
        struct ib_qp *             qp;                   /*    24     8 */
        union {
                __be32             imm_data;             /*           4 */
                u32                invalidate_rkey;      /*           4 */
        } ex;                                            /*    32     4 */
        u32                        src_qp;               /*    36     4 */
        u32                        slid;                 /*    40     4 */
        int                        wc_flags;             /*    44     4 */
        u16                        pkey_index;           /*    48     2 */
        u8                         sl;                   /*    50     1 */
        u8                         dlid_path_bits;       /*    51     1 */
        u8                         port_num;             /*    52     1 */
        u8                         smac[6];              /*    53     6 */

        /* XXX 1 byte hole, try to pack */

        u16                        vlan_id;              /*    60     2 */
        u8                         network_hdr_type;     /*    62     1 */

        /* size: 64, cachelines: 1, members: 17 */
        /* sum members: 62, holes: 1, sum holes: 1 */
        /* padding: 1 */
};

Cc: <stable@vger.kernel.org> # v4.13
Fixes: 7db20ecd1d ("IB/core: Change wc.slid from 16 to 32 bits")
Signed-off-by: Bodong Wang <bodong@mellanox.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-01-15 15:33:21 -07:00
Daniel Jurgens
32f69e4be2 {net, IB}/mlx5: Manage port association for multiport RoCE
When mlx5_ib_add is called determine if the mlx5 core device being
added is capable of dual port RoCE operation. If it is, determine
whether it is a master device or a slave device using the
num_vhca_ports and affiliate_nic_vport_criteria capabilities.

If the device is a slave, attempt to find a master device to affiliate it
with. Devices that can be affiliated will share a system image guid. If
none are found place it on a list of unaffiliated ports. If a master is
found bind the port to it by configuring the port affiliation in the NIC
vport context.

Similarly when mlx5_ib_remove is called determine the port type. If it's
a slave port, unaffiliate it from the master device, otherwise just
remove it from the unaffiliated port list.

The IB device is registered as a multiport device, even if a 2nd port is
not available for affiliation. When the 2nd port is affiliated later the
GID cache must be refreshed in order to get the default GIDs for the 2nd
port in the cache. Export roce_rescan_device to provide a mechanism to
refresh the cache after a new port is bound.

In a multiport configuration all IB object (QP, MR, PD, etc) related
commands should flow through the master mlx5_core_dev, other commands
must be sent to the slave port mlx5_core_mdev, an interface is provide
to get the correct mdev for non IB object commands.

Signed-off-by: Daniel Jurgens <danielj@mellanox.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-01-08 11:42:22 -07:00
Moni Shoua
8011c1e336 IB/core: Introduce driver QP type
Vendors can implement type of QPs that are not described in the
InfiniBand specification. To still be able to use the IB/core layer
services (e.g. user object management) without tainting this layer with
driver proprietary logic, a new QP type is added - IB_QPT_DRIVER. This
will be a general QP type that the core layer doesn't know about its true nature.
When a command like create_qp() is passed to a hardware driver the extra
data that is required is taken from the driver channel.
Downstream patches from this series will use that QP type in the mlx5
driver.

Signed-off-by: Moni Shoua <monis@mellanox.com>
Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-01-08 11:38:50 -07:00
Parav Pandit
f6bdb14267 IB/{core, umad, cm}: Rename ib_init_ah_from_wc to ib_init_ah_attr_from_wc
Currently ib_init_ah_from_wc initializes address handle attributes and
not the address handle object itself.
To avoid confusion between ah_attr vs ah, ib_init_ah_from_wc is
renamed to ib_init_ah_attr_from_wc to reflect that its initialzes
ah_attr.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2017-12-18 15:37:11 -07:00
Parav Pandit
dbb12562f7 IB/{core, ipoib}: Simplify ib_find_gid to search only for IB link layer
Currently there are no users of ib_find_gid for RoCE transport. It is
only used by IPoIB.
Therefore its simplified to ignore RoCE ports and GID type check which
was previously done for every port.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2017-12-18 15:37:10 -07:00
Leon Romanovsky
4190b4e969 RDMA/core: Rename kernel modify_cq to better describe its usage
Current ib_modify_cq() is used to set CQ moderation parameters.

This patch renames ib_modify_cq() to be rdma_set_cq_moderation(),
because the kernel version of RDMA API doesn't need to follow already
exposed to user's API pattern (create_XXX/modify_XXX/query_XXX/destroy_XXX)
and better to have more accurate name which describes the actual usage.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2017-11-13 16:59:22 -05:00
Yonatan Cohen
18bd907292 IB/uverbs: Add CQ moderation capability to query_device
The query_device function can now obtain the maximum values for
cq_max_count and cq_period, needed for CQ moderation.
cq_max_count is a 16 bits number that determines the number
of CQEs to accumulate before generating an event.
cq_period is a 16 bits number that determines the timeout in micro
seconds from the last event generated, upon which a new event will
be generated even if cq_max_count was not reached.

Signed-off-by: Yonatan Cohen <yonatanc@mellanox.com>
Reviewed-by: Majd Dibbiny <majd@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2017-11-13 16:59:22 -05:00
Yonatan Cohen
869ddcf8b3 IB/uverbs: Allow CQ moderation with modify CQ
Uverbs support in modify_cq for CQ moderation only.
Gives ability to change cq_max_count and cq_period.
CQ moderation enhance performance by moderating the number
of CQEs needed to create an event instead of application
having to suffer from event per-CQE.
To achieve CQ moderation the application needs to set cq_max_count
and cq_period.
cq_max_count - defines the number of CQEs needed to create an event.
cq_period - defines the timeout (micro seconds) between last
            event and a new one that will occur even if
	    cq_max_count was not satisfied

Signed-off-by: Yonatan Cohen <yonatanc@mellanox.com>
Reviewed-by: Majd Dibbiny <majd@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2017-11-13 16:59:22 -05:00
Noa Osherovich
e1d2e88733 IB/core: Add PCI write end padding flags for WQ and QP
There are root complexes that are able to optimize their
performance when incoming data is multiple full cache lines.

PCI write end padding is the device's ability to pad the ending of
incoming packets (scatter) to full cache line such that the last
upstream write generated by an incoming packet will be a full cache
line.

Add a relevant entry to ib_device_cap_flags to report such capability
of an RDMA device.

Add the QP and WQ create flags:
 * A QP/WQ created with a scatter end padding flag will cause
   HW to pad the last upstream write generated by a packet to cache line.

User should consider several factors before activating this feature:
- In case of high CPU memory load (which may cause PCI back pressure in
  turn), if a large percent of the writes are partial cache line, this
  feature should be checked as an optional solution.
- This feature might reduce performance if most packets are between one
  and two cache lines and PCIe throughput has reached its maximum
  capacity. E.g. 65B packet from the network port will lead to 128B
  write on PCIe, which may cause traffic on PCIe to reach high
  throughput.

Signed-off-by: Noa Osherovich <noaos@mellanox.com>
Reviewed-by: Majd Dibbiny <majd@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2017-11-10 13:50:27 -05:00