Commit graph

42 commits

Author SHA1 Message Date
Dennis Dalessandro
f48ad614c1 IB/hfi1: Move driver out of staging
The TODO list for the hfi1 driver was completed during 4.6. In addition
other objections raised (which are far beyond what was in the TODO list)
have been addressed as well. It is now time to remove the driver from
staging and into the drivers/infiniband sub-tree.

Reviewed-by: Jubin John <jubin.john@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-05-26 11:35:14 -04:00
Jianxin Xiong
b583faf4dc IB/hfi1: Fix bug that blocks process on exit after port bounce
During the processing of a user SDMA request, if there was an
error before the request counter was increased, the state of
the packet queue could be updated incorrectly, causing the
counter to underflow. As the result, the process could get
stuck later since the counter could never get back to 0.

This patch adds a condition to guard the packet queue update
so that the counter is only decreased if it has been increased
before the error happens.

Reviewed-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Jianxin Xiong <jianxin.xiong@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-05-26 11:23:11 -04:00
Mitko Haralanov
9565c6a37a IB/hfi1: Fix an interval RB node reference count leak
Commit e88c9271d9 ("IB/hfi1: Fix buffer cache corner case which
may cause corruption") introduced a bug which may cause a reference
count of a interval RB node to be leaked in the case where an SDMA
transfer from that node completes at the same time as the node is
being extended.

If a node is being extended, it is first removed from the RB tree
in order to be processed without the risk of an invalidation event
removing the node at the same time.

If a SDMA completion happens during that time, the completion handler
will fail to find the node in the RB tree and, therefore, fail to
correctly decrement its refcount. This leaves the node in the tree and
its pages pinned for the duration of the user process.

To prevent this from happening the io vector adds a reference to the
RB node, which is used during the SDMA completion instead of looking
up the node in the RB tree.

This change adds a performance improvement as a side effect by avoiding
the RB tree lookup.

Fixes: e88c9271d9 ("IB/hfi1: Fix buffer cache corner case which may cause corruption")
Reviewed-by: Dean Luick <dean.luick@intel.com>
Reviewed-by: Harish Chegondi <harish.chegondi@intel.com>
Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-05-26 11:23:10 -04:00
Sebastian Sanchez
e38d1e4f50 IB/hfi1: Check P_KEY for all sent packets from user mode
Add the P_KEY check for user-context mechanism for
both PIO and SDMA. For PIO, the
SendCtxtCheckEnable.DisallowKDETHPackets is set by
default. When the P_KEY is set,
SendCtxtCheckEnable.DisallowKDETHPackets is cleared.
For SDMA, a software check was included. This change
requires user processes to set the P_KEY before sending
any packets, otherwise, the sent packet will fail. The
original submission didn't have this check but it's
required.

Reviewed-by: Dean Luick <dean.luick@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: Mikto Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Sebastian Sanchez <sebastian.sanchez@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-04-28 16:32:27 -04:00
Mitko Haralanov
e88c9271d9 IB/hfi1: Fix buffer cache races which may cause corruption
There are two possible causes for node/memory corruption both
of which are related to the cache eviction algorithm. One way
to cause corruption is due to the asynchronous nature of the
MMU invalidation and the locking used when invalidating node.

The MMU invalidation routine would temporarily release the
RB tree lock to avoid a deadlock. However, this would allow
the eviction function to take the lock resulting in the removal
of cache nodes.

If the node being removed by the eviction code is the same as
the node being invalidated, the result is use after free.

The same is true in the other direction due to the temporary
release of the eviction list lock in the eviction loop.

Another corner case exists when dealing with the SDMA buffer
cache that could cause memory corruption of kernel memory.
The most common way, in which this corruption exhibits itself
is a linked list node corruption. In that case, the kernel will
complain that a node with poisoned pointers is being removed.
The fact that the pointers are already poisoned means that the
node has already been removed from the list.

To root cause of this corruption was a mishandling of the
eviction list maintained by the driver. In order for this
to happen four conditions need to be satisfied:

   1. A node describing a user buffer already exists in the
      interval RB tree,
   2. The beginning of the current user buffer matches that
      node but is bigger. This will cause the node to be
      extended.
   3. The amount of cached buffers is close or at the limit
      of the buffer cache size.
   4. The node has dropped close to the end of the eviction
      list. This will cause the node to be considered for
      eviction.

If all of the above conditions have been satisfied, it is
possible for the eviction algorithm to evict the current node,
which will free the node without the driver knowing.

To solve both issues described above:
   - the locking around the MMU invalidation loop and cache
     eviction loop has been improved so locks are not released in
     the loop body,
   - a new RB function is introduced which will "atomically" find
     and remove the matching node from the RB tree, preventing the
     MMU invalidation loop from touching it, and
   - the node being extended by the pin_vector_pages() function is
     removed from the eviction list prior to calling the eviction
     function.

Reviewed-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-04-28 16:32:26 -04:00
Mitko Haralanov
f53af85e47 IB/hfi1: Extract and reinsert MMU RB node on lookup
The page pinning function, which also maintains the pin cache,
behaves one of two ways when an exact buffer match is not found:
  1. If no node is not found (a buffer with the same starting address
     is not found in the cache), a new node is created, the buffer
     pages are pinned, and the node is inserted into the RB tree, or
  2. If a node is found but the buffer in that node is a subset of
     the new user buffer, the node is extended with the new buffer
     pages.

Both modes of operation require (re-)insertion into the interval RB
tree.

When the node being inserted is a new node, the operations are pretty
simple. However, when the node is already existing and is being
extended, special care must be taken.

First, we want to guard against an asynchronous attempt to
delete the node by the MMU invalidation notifier. The simplest way to
do this is to remove the node from the RB tree, preventing the search
algorithm from finding it.

Second, the node needs to be re-inserted so it lands in the proper place
in the tree and the tree is correctly re-balanced. This also requires
the node to be removed from the RB tree.

This commit adds the hfi1_mmu_rb_extract() function, which will search
for a node in the interval RB tree matching an address and length and
remove it from the RB tree if found. This allows for both of the above
special cases be handled in a single step.

Reviewed-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-04-28 16:32:26 -04:00
Mitko Haralanov
de79093b28 IB/hfi1: Correctly compute node interval
The computation of the interval of an interval RB node
was incorrect leading to data corruption due to the RB
search algorithm not properly finding the all RB nodes
in an MMU invalidation interval.

The problem stemmed from the fact that the beginning
address of the node's range was being aligned to a page
boundary. For certain buffer sizes, this would lead to
a end address calculation that was off by 1 page.

An important aspect of keeping the RB same is also
updating the node's range in the case it's being extended.

Reviewed-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-04-28 16:32:26 -04:00
Mitko Haralanov
0ad2d3d05b IB/hfi1: Fix memory leak in user ExpRcv and SDMA
The driver had two memory leaks - one in the user
expected receive code and one in SDMA buffer cache.

The leak in the expected receive code only showed up
when the user/admin had set ulimit sufficiently low
and the driver did not have enough room in the cache
before hitting the limit of allowed cachable memory.

When this condition occurred, the driver returned
early signaling userland that it needed to free some
buffers to free up room in the cache.

The bug was that the driver was not cleaning up
allocated memory prior to returning early.

The leak in the SDMA buffer cache could occur (even
though it never did), when the insertion of a buffer
node in the interval RB tree failed. In this case, the
driver failed to unpin the pages of the node instead
erroneously returning success.

Reviewed-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-04-28 16:32:26 -04:00
Mitko Haralanov
4787bc5e17 IB/hfi1: Don't remove list entries if they are not in a list
The SDMA cache logic maintains an eviction list which is ordered
by most recently used user buffers. Upon errors or buffer freeing,
the list nodes were unconditionally being deleted. This would lead
to list corruption warnings if the nodes were never inserted in the
eviction list to begin with.

This commit prevents this by checking that the nodes are already
part of the eviction list.

Reviewed-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-04-28 16:32:26 -04:00
Mitko Haralanov
849e3e9398 IB/hfi1: Prevent unpinning of wrong pages
The routine used by the SDMA cache to handle already
cached nodes can extend an already existing node.

In its error handling code, the routine will unpin pages
when not all pages of the buffer extension were pinned.

There was a bug in that part of the routine, which would
mistakenly unpin pages from the original set rather than
the newly pinned pages.

This commit fixes that bug by offsetting the page array
to the proper place pointing at the beginning of the newly
pinned pages.

Reviewed-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-04-28 12:00:38 -04:00
Mitko Haralanov
f19bd643db IB/hfi1: Prevent NULL pointer deferences in caching code
There is a potential kernel crash when the MMU notifier calls the
invalidation routines in the hfi1 pinned page caching code for sdma.

The invalidation routine could call the remove callback
for the node, which in turn ends up dereferencing the
current task_struct to get a pointer to the mm_struct.
However, the mm_struct pointer could be NULL resulting in
the following backtrace:

    BUG: unable to handle kernel NULL pointer dereference at 00000000000000a8
    IP: [<ffffffffa041f75a>] sdma_rb_remove+0xaa/0x100 [hfi1]
    15
    task: ffff88085e66e080 ti: ffff88085c244000 task.ti: ffff88085c244000
    RIP: 0010:[<ffffffffa041f75a>]  [<ffffffffa041f75a>] sdma_rb_remove+0xaa/0x100 [hfi1]
    RSP: 0000:ffff88085c245878  EFLAGS: 00010002
    RAX: 0000000000000000 RBX: ffff88105b9bbd40 RCX: ffffea003931a830
    RDX: 0000000000000004 RSI: ffff88105754a9c0 RDI: ffff88105754a9c0
    RBP: ffff88085c245890 R08: ffff88105b9bbd70 R09: 00000000fffffffb
    R10: ffff88105b9bbd58 R11: 0000000000000013 R12: ffff88105754a9c0
    R13: 0000000000000001 R14: 0000000000000001 R15: ffff88105b9bbd40
    FS:  0000000000000000(0000) GS:ffff88107ef40000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00000000000000a8 CR3: 0000000001a0b000 CR4: 00000000001407e0
    Stack:
     ffff88105b9bbd40 ffff88080ec481a8 ffff88080ec481b8 ffff88085c2458c0
     ffffffffa03fa00e ffff88080ec48190 ffff88080ed9cd00 0000000001024000
     0000000000000000 ffff88085c245920 ffffffffa03fa0e7 0000000000000282
    Call Trace:
     [<ffffffffa03fa00e>] __mmu_rb_remove.isra.5+0x5e/0x70 [hfi1]
     [<ffffffffa03fa0e7>] mmu_notifier_mem_invalidate+0xc7/0xf0 [hfi1]
     [<ffffffffa03fa143>] mmu_notifier_page+0x13/0x20 [hfi1]
     [<ffffffff81156dd0>] __mmu_notifier_invalidate_page+0x50/0x70
     [<ffffffff81140bbb>] try_to_unmap_one+0x20b/0x470
     [<ffffffff81141ee7>] try_to_unmap_anon+0xa7/0x120
     [<ffffffff81141fad>] try_to_unmap+0x4d/0x60
     [<ffffffff8111fd7b>] shrink_page_list+0x2eb/0x9d0
     [<ffffffff81120ab3>] shrink_inactive_list+0x243/0x490
     [<ffffffff81121491>] shrink_lruvec+0x4c1/0x640
     [<ffffffff81121641>] shrink_zone+0x31/0x100
     [<ffffffff81121b0f>] kswapd_shrink_zone.constprop.62+0xef/0x1c0
     [<ffffffff811229e3>] kswapd+0x403/0x7e0
     [<ffffffff811225e0>] ? shrink_all_memory+0xf0/0xf0
     [<ffffffff81068ac0>] kthread+0xc0/0xd0
     [<ffffffff81068a00>] ? insert_kthread_work+0x40/0x40
     [<ffffffff814ff8ec>] ret_from_fork+0x7c/0xb0
     [<ffffffff81068a00>] ? insert_kthread_work+0x40/0x40

To correct this, the mm_struct passed to us by the MMU notifier is
used (which is what should have been done to begin with). This avoids
the broken derefences and ensures that the correct mm_struct is used.

Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-04-28 12:00:38 -04:00
Linus Torvalds
b8ba452683 Round two of 4.6 merge window patches
- A few minor core fixups needed for the next patch series
 - The IB SRIOV series.  This has bounced around for several versions.
   Of note is the fact that the first patch in this series effects
   the net core.  It was directed to netdev and DaveM for each iteration
   of the series (three versions total).  Dave did not object, but did
   not respond either.  I've taken this as permission to move forward
   with the series.
 - The new Intel X722 iWARP driver
 - A huge set of updates to the Intel hfi1 driver.  Of particular interest
   here is that we have left the driver in staging since it still has an
   API that people object to.  Intel is working on a fix, but getting
   these patches in now helps keep me sane as the upstream and Intel's
   trees were over 300 patches apart.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJW8HR9AAoJELgmozMOVy/dDYMP+wSBALhIdV/pqVzdLCGfIUbK
 H5agonm/3b/Oj74W30w2JYqXBFfZC2LGVJy6OwocJ3wK04v/KfZbA9G+QsOuh2hQ
 Db+tFn1eoltvzrcx3k/a7x6zHGC4YyxyH9OX2B3QfRsNHeE7PG9KGp5dfEs2OH1r
 WGp3jMLAsHf7o8uKpa0jyTEUEErATaTlG+YoaJ+BGHwurgCNy8ni+wAn+EAFiJ3w
 iEJhcXB6KY69vkLsrLYuT9xxJn4udFJ3QEk8xdPkpLKsu+6Ue5i/eNQ19VfbpZgR
 c6fTc8genfIv5S+fis+0P44u1oA7Kl2JT6IZYLi35gJ60ZmxTD+7GruWP3xX/wJ2
 zuR3sTj5fjcFWenk087RSIU/EK87ONPD4g9QPdZpf3FtgleTVKk3YDlqwjqf8pgv
 cO6gQ1BcOBnixJvhjNFiX1c2hvNhb3CkgObly1JBwhcCzZhLkV7BNFPbZuDHAeAx
 VqzNEUse4hupkgiiuiGgudcJ4fsSxMW37kyfX9QC/qyk6YVuUDbrekcWI+MAKot7
 5e5dHqFExpbn1Zgvc8yfvh88H2MUQAgaYwjanWF/qpppOPRd01nTisVQIOJn7s5C
 arcWzvocpQe0GL2UsvDoWwAABXznL3bnnAoCyTWOES2RhOOcw0Ibw46Jl8FQ8gnl
 2IRxQ+ltNEscb2cwi5wE
 =t2Ko
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma

Pull more rdma updates from Doug Ledford:
 "Round two of 4.6 merge window patches.

  This is a monster pull request.  I held off on the hfi1 driver updates
  (the hfi1 driver is intimately tied to the qib driver and the new
  rdmavt software library that was created to help both of them) in my
  first pull request.  The hfi1/qib/rdmavt update is probably 90% of
  this pull request.  The hfi1 driver is being left in staging so that
  it can be fixed up in regards to the API that Al and yourself didn't
  like.  Intel has agreed to do the work, but in the meantime, this
  clears out 300+ patches in the backlog queue and brings my tree and
  their tree closer to sync.

  This also includes about 10 patches to the core and a few to mlx5 to
  create an infrastructure for configuring SRIOV ports on IB devices.
  That series includes one patch to the net core that we sent to netdev@
  and Dave Miller with each of the three revisions to the series.  We
  didn't get any response to the patch, so we took that as implicit
  approval.

  Finally, this series includes Intel's new iWARP driver for their x722
  cards.  It's not nearly the beast as the hfi1 driver.  It also has a
  linux-next merge issue, but that has been resolved and it now passes
  just fine.

  Summary:

   - A few minor core fixups needed for the next patch series

   - The IB SRIOV series.  This has bounced around for several versions.
     Of note is the fact that the first patch in this series effects the
     net core.  It was directed to netdev and DaveM for each iteration
     of the series (three versions total).  Dave did not object, but did
     not respond either.  I've taken this as permission to move forward
     with the series.

   - The new Intel X722 iWARP driver

   - A huge set of updates to the Intel hfi1 driver.  Of particular
     interest here is that we have left the driver in staging since it
     still has an API that people object to.  Intel is working on a fix,
     but getting these patches in now helps keep me sane as the upstream
     and Intel's trees were over 300 patches apart"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (362 commits)
  IB/ipoib: Allow mcast packets from other VFs
  IB/mlx5: Implement callbacks for manipulating VFs
  net/mlx5_core: Implement modify HCA vport command
  net/mlx5_core: Add VF param when querying vport counter
  IB/ipoib: Add ndo operations for configuring VFs
  IB/core: Add interfaces to control VF attributes
  IB/core: Support accessing SA in virtualized environment
  IB/core: Add subnet prefix to port info
  IB/mlx5: Fix decision on using MAD_IFC
  net/core: Add support for configuring VF GUIDs
  IB/{core, ulp} Support above 32 possible device capability flags
  IB/core: Replace setting the zero values in ib_uverbs_ex_query_device
  net/mlx5_core: Introduce offload arithmetic hardware capabilities
  net/mlx5_core: Refactor device capability function
  net/mlx5_core: Fix caching ATOMIC endian mode capability
  ib_srpt: fix a WARN_ON() message
  i40iw: Replace the obsolete crypto hash interface with shash
  IB/hfi1: Add SDMA cache eviction algorithm
  IB/hfi1: Switch to using the pin query function
  IB/hfi1: Specify mm when releasing pages
  ...
2016-03-22 15:48:44 -07:00
Mitko Haralanov
5511d78107 IB/hfi1: Add SDMA cache eviction algorithm
This commit adds a cache eviction algorithm for the SDMA
user buffer cache.

Besides the interval RB tree used for node lookup, the cache
nodes are also arranged in a doubly-linked list. When a node is
used, it is put at the beginning of the list. Less frequently
used nodes naturally move to the tail of the list.

When the cache limit is reached, the eviction code starts
traversing the linked list in reverse, freeing buffers until
enough space has been freed to fit the new user buffer. This
guarantees that only the least used cache nodes will be removed
from the cache.

Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Jubin John <jubin.john@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-03-21 15:55:25 -04:00
Mitko Haralanov
bd3a8947de IB/hfi1: Specify mm when releasing pages
This change adds a pointer to the process mm_struct when
calling hfi1_release_user_pages().

Previously, the function used the mm_struct of the current
process to adjust the number of pinned pages. However, is some
cases, namely when unpinning pages due to a MMU notifier call,
we want to drop into that code block as it will cause a deadlock
(the MMU notifiers take the process' mmap_sem prior to calling
the callbacks).

By allowing to caller to specify the pointer to the mm_struct,
the caller has finer control over that part of hfi1_release_user_pages().

Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Jubin John <jubin.john@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-03-21 15:55:25 -04:00
Mitko Haralanov
5cd3a88d7f IB/hfi1: Implement SDMA-side buffer caching
Add support for caching of user buffers used for SDMA
transfers. This change improves performance by
avoiding repeatedly pinning the pages of buffers, which
are being re-used by the application.

While the cost of the pinning operation has been made
heavier by adding the extra code to search the cache tree,
re-allocate pages arrays, and future cache evictions,
that cost will be amortized against the savings when the
same buffer is re-used. It is also worth noting that in
most cases, the cost of pinning should be much lower due
to the buffer already being in the cache.

Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Jubin John <jubin.john@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-03-21 15:55:24 -04:00
Amitoj Kaur Chawla
8444991755 staging: rdma: hfi1: Replace ALIGN with PAGE_ALIGN
mm.h contains a helper function PAGE_ALIGN which aligns the pointer
to the page boundary instead of using ALIGN(expression, PAGE_SIZE)

This change was made with the help of the following Coccinelle
semantic patch:
//<smpl>
@@
expression e;
symbol PAGE_SIZE;
@@
(
- ALIGN(e, PAGE_SIZE)
+ PAGE_ALIGN(e)
|
- IS_ALIGNED(e, PAGE_SIZE)
+ PAGE_ALIGNED(e)
)
//</smpl>

Signed-off-by: Amitoj Kaur Chawla <amitoj1606@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-03-11 22:09:09 -08:00
Janani Ravichandran
16ccad0475 staging: rdma: hfi1: user_sdma.c: Drop void pointer cast
Void pointers need not be cast to other pointer types.
Semantic patch used:

@r@
expression x;
void *e;
type T;
identifier f;
@@

(
  *((T *)e)
|
  ((T *)x) [...]
|
  ((T *)x)->f
|
- (T *)
  e
)

Signed-off-by: Janani Ravichandran <janani.rvchndrn@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-03-11 22:09:09 -08:00
Jubin John
05d6ac1d82 staging/rdma/hfi1: Fix header
Fix the header by moving the copyright notice out of the license text
and to the top of the header. Also, update the copyright date.

Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jubin John <jubin.john@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-03-10 20:45:42 -05:00
Jubin John
e490974e67 staging/rdma/hfi1: Add braces on all arms of statement
Add braces on all arms of statements to fix checkpatch check:
CHECK: braces {} should be used on all arms of this statement

Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Jubin John <jubin.john@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-03-10 20:45:42 -05:00
Jubin John
17fb4f2923 staging/rdma/hfi1: Fix code alignment
Fix code alignment to fix checkpatch check:
CHECK: Alignment should match open parenthesis

Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Jubin John <jubin.john@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-03-10 20:45:41 -05:00
Jubin John
4d114fdd90 staging/rdma/hfi1: Fix block comments
Fix block comments with proper formatting to fix checkpatch warnings:
WARNING: Block comments use * on subsequent lines
WARNING: Block comments use a trailing */ on a separate line

Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Jubin John <jubin.john@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-03-10 20:45:41 -05:00
Jubin John
5161fc3ef6 staging/rdma/hfi1: Remove blank line before close brace
Remove extra blank line before close brace to fix checkpatch check:
CHECK: Blank lines aren't necessary before a close brace '}'

Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Jubin John <jubin.john@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-03-10 20:45:37 -05:00
Jubin John
50e5dcbed6 staging/rdma/hfi1: Remove space after cast
Remove the space after a cast to fix checkpatch check:
CHECK: No space is necessary after a cast

Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Jubin John <jubin.john@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-03-10 20:45:36 -05:00
Jubin John
8638b77f13 staging/rdma/hfi1: Add spaces around binary operators
Add spaces around binary operators.

Fixes checkpatch check:
CHECK: spaces preferred around that 'x'
where x is a binary operator

Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Jubin John <jubin.john@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-03-10 20:45:33 -05:00
Mike Marciniszyn
a545f5308b staging/rdma/hfi: fix CQ completion order issue
The current implementation of the sdma_wait variable
has a timing hole that can cause a completion Q entry
to be returned from a pio send prior to an older
sdma packets completion queue entry.

The sdma_wait variable used to be decremented prior to
calling the packet complete routine.  The hole is between decrement
and the verbs completion where send engine using pio could return
a out of order completion in that window.

This patch closes the hole by allowing an API option to
specify an sdma_drained callback.   The atomic dec
is positioned after the complete callback to avoid the
window as long as the pio path doesn't execute when
there is a non-zero sdma count.

Reviewed-by: Jubin John <jubin.john@intel.com>
Signed-off-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-03-10 20:38:14 -05:00
Mitko Haralanov
a402d6ab40 staging/rdma/hfi1: Fix bug that could block the process on context exit
A race was discovred in the user SDMA code, which could result
in an process being stuck in the kernel call indefinitely in
certain error conditions.

If, during the processing of a user SDMA request, there was an
error *and* all outstanding SDMA descriptor had been completed
by the time the that error case was handled in the calling function,
the state of the packet queue would not get correctly updated
resulting in the process subsequently getting stuck, thinking that
there are more descriptors to be completed.

To handle this scenario, the driver now checks the submitted
packet count vs. the completed. If all submitted packets have also
been completed, the driver can safely free the request and signal
user level. Otherwise, this will be handled by the completion
callback.

Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Jubin John <jubin.john@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-03-10 20:38:00 -05:00
Mitko Haralanov
0840aea98c staging/rdma/hfi1: Improve performance of user SDMA
To facilitate locked page counting, the user SDMA
routines would maintain a list of io vectors, which
were freed in the completion callback and then unpin
the associated pages during the next call into the
kernel.

Since the size of this list was unbounded, doing this
was bad for performance because the driver ended up
spending too much time freeing the io vectors.

This commit changes how the io vector freeing is done
by moving the actual page unpinning in the callback and
maintaining a count of unpinned pages. This count can
then be used during the next call into the kernel to
update the mm->pinned_vm variable (since that requires
process context and the ability to sleep.)

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Jubin John <jubin.john@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-03-10 20:37:59 -05:00
Mitko Haralanov
c7cbf2fabb staging/rdma/hfi1: Properly determine error status of SDMA slots
To ensure correct operation between the driver and PSM
with respect to managing the SDMA request ring, it is
important that the status for a particular request slot
is set at the correct time. Otherwise, PSM can get out
of sync with the driver, which could lead to hangs or
errors on new requests.

Properly determining of when to set the error status of
a SDMA slot depends on knowing exactly when the last txreq
for that request has been completed. This in turn requires
that the driver knows exactly how many requests have been
generated and how many of those requests have been successfully
submitted to the SDMA queue.

The previous implementation of the mid-layer SDMA API did not
provide a way for the caller of sdma_send_txlist() to know how
many of the txreqs in the input list have actually been submitted
without traversing the list and counting. Since sdma_send_txlist()
already traverses the list in order to process it, requiring
such traversal in the caller is completely unnecessary. Therefore,
it is much easier to enhance sdma_send_txlist() to return the
number of successfully submitted txreqs.

This, in turn, allows the caller to accurately determine the
progress of the SDMA request and, therefore, correctly set the
error status at the right time.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Jubin John <jubin.john@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-03-10 20:37:55 -05:00
Mitko Haralanov
0f2d87d282 staging/rdma/hfi1: Improve performance of SDMA transfers
Commit a0d406934a ("staging/rdma/hfi1: Add page lock limit
check for SDMA requests") added a mechanism to
delay the clean-up of user SDMA requests in order to facilitate
proper locked page counting.

This delayed processing was done using a kernel workqueue, which
meant that a kernel thread would have to spin up and take CPU
cycles to do the clean-up.

This proved detrimental to performance because now there are two
execution threads (the kernel workqueue and the user process)
needing cycles on the same CPU.

Performance-wise, it is much better to do as much of the clean-up
as can be done in interrupt context (during the callback) and do
the remaining work in-line during subsequent calls of the user
process into the driver.

The changes required to implement the above also significantly
simplify the entire SDMA completion processing code and eliminate
a memory corruption causing the following observed crash:

    [ 2881.703362] BUG: unable to handle kernel NULL pointer dereference at        (null)
    [ 2881.703389] IP: [<ffffffffa02897e4>] user_sdma_send_pkts+0xcd4/0x18e0 [hfi1]
    [ 2881.703422] PGD 7d4d25067 PUD 77d96d067 PMD 0
    [ 2881.703427] Oops: 0000 [#1] SMP
    [ 2881.703431] Modules linked in:
    [ 2881.703504] CPU: 28 PID: 6668 Comm: mpi_stress Tainted: G           OENX 3.12.28-4-default #1
    [ 2881.703508] Hardware name: Intel Corporation S2600KP/S2600KP, BIOS SE5C610.86B.11.01.0044.090
    [ 2881.703512] task: ffff88077da8e0c0 ti: ffff880856772000 task.ti: ffff880856772000
    [ 2881.703515] RIP: 0010:[<ffffffffa02897e4>]  [<ffffffffa02897e4>] user_sdma_send_pkts+0xcd4/0x
    [ 2881.703529] RSP: 0018:ffff880856773c48  EFLAGS: 00010287
    [ 2881.703531] RAX: 0000000000000000 RBX: 0000000000001000 RCX: 0000000000002000
    [ 2881.703534] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000002000
    [ 2881.703537] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000
    [ 2881.703540] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
    [ 2881.703543] R13: 0000000000000000 R14: ffff88071e782e68 R15: ffff8810532955c0
    [ 2881.703546] FS:  00007f9c4375e700(0000) GS:ffff88107eec0000(0000) knlGS:0000000000000000
    [ 2881.703549] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 2881.703551] CR2: 0000000000000000 CR3: 00000007d4cba000 CR4: 00000000003407e0
    [ 2881.703554] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [ 2881.703556] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    [ 2881.703558] Stack:
    [ 2881.703559]  ffffffff00002000 ffff881000001800 ffffffff00000000 00000000000080d0
    [ 2881.703570]  0000000000000000 0000200000000000 0000000000000000 ffff88071e782db8
    [ 2881.703580]  ffff8807d4d08d80 ffff881053295600 0000000000000008 ffff88071e782fc8
    [ 2881.703589] Call Trace:
    [ 2881.703691]  [<ffffffffa028b5da>] hfi1_user_sdma_process_request+0x84a/0xab0 [hfi1]
    [ 2881.703777]  [<ffffffffa0255412>] hfi1_aio_write+0xd2/0x110 [hfi1]
    [ 2881.703828]  [<ffffffff8119e3d8>] do_sync_readv_writev+0x48/0x80
    [ 2881.703837]  [<ffffffff8119f78b>] do_readv_writev+0xbb/0x230
    [ 2881.703843]  [<ffffffff8119fab8>] SyS_writev+0x48/0xc0

This commit also addresses issues related to notification of user
processes of SDMA request slot availability. The slot should be
cleaned up first before the user processes is notified of its
availability.

Reviewed-by: Arthur Kepner <arthur.kepner@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Jubin John <jubin.john@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-03-10 20:37:55 -05:00
Amitoj Kaur Chawla
84d899e709 staging: rdma: hfi1: Remove header file
Remove duplicate include file. Found using includecheck.

Signed-off-by: Amitoj Kaur Chawla <amitoj1606@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-02-22 12:03:22 -08:00
Amitoj Kaur Chawla
72a5f6a8df staging: rdma: hfi1: Use offset_in_page macro
Use offset_in_page macro instead of (var & ~PAGE_MASK)

The Coccinelle semantic patch used to make this change is as follows:
// <smpl>
@@
unsigned long p;
@@
- p & ~PAGE_MASK
+ offset_in_page(p)
// </smpl>

Signed-off-by: Amitoj Kaur Chawla <amitoj1606@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-02-20 14:54:05 -08:00
Bhumika Goyal
a4d7d05b2d Staging: rdma: hfi1: Delete NULL check before vfree
The function vfree test whether the argument is NULL and return
immediately. SO NULL test is not needed before vfree.

Signed-off-by: Bhumika Goyal <bhumirks@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-02-14 16:44:06 -08:00
Mitko Haralanov
6a5464f224 staging/rdma/hfi1: Detect SDMA transmission error early
It is possible for an SDMA transmission error to happen
during the processing of an user SDMA transfer. In that
case it is better to detect it early and abort any further
attempts to send more packets.

Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-12-21 13:57:55 -08:00
Mitko Haralanov
faa98b8620 staging/rdma/hfi1: Clean-up unnecessary goto statements
Clean-up unnecessary goto statements based on feedback from the
mailing list on previous patch submissions.

Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-12-21 13:57:55 -08:00
Mitko Haralanov
a0d406934a staging/rdma/hfi1: Add page lock limit check for SDMA requests
The driver pins pages on behalf of user processes in two
separate instances - when the process has submitted a
SDMA transfer and when the process programs an expected
receive buffer.

When pinning pages, the driver is required to observe the
locked page limit set by the system administrator and refuse
to lock more pages than allowed. Such a check was done for
expected receives but was missing from the SDMA transfer
code path.

This commit adds the missing check for SDMA transfers. As of
this commit, user SDMA or expected receive requests will be
rejected if the number of pages required to be pinned will
exceed the set limit.

Due to the fact that the driver needs to take the MM semaphore
in order to update the locked page count (which can sleep), this
cannot be done by the callback function as it [the callback] is
executed in interrupt context. Therefore, it is necessary to put
all the completed SDMA tx requests onto a separate list (txcmp) and
offload the actual clean-up and unpinning work to a workqueue.

Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-12-21 13:57:55 -08:00
Sunny Kumar
cb32649d08 staging: rdma: hfi1 : Prefer using the BIT macro
This patch replaces bit shifting on 1 with the BIT(x) macro

Signed-off-by: Sunny Kumar <sunnyk@cdac.in>
Acked-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-11-15 20:02:47 -08:00
Ira Weiny
9e10af4787 staging/rdma/hfi1: Remove file pointer macros
Remove the following macros in favor of explicit use of struct hfi1_filedata and
various sub structures.

ctxt_fp
subctxt_fp
tidcursor_fp
user_sdma_pkt_fp
user_sdma_comp_fp

Reviewed-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-11-15 20:02:47 -08:00
Mitko Haralanov
b9fb6318d0 staging/rdma/hfi1: Prevent silent data corruption with user SDMA
User SDMA keeps track of progress into the submitted IO vectors by tracking an
offset into the vectors when packets are submitted. This offset is updated
after a successful submission of a txreq to the SDMA engine.

The same offset was used when determining whether an IO vector should be
'freed' (pages unpinned) in the SDMA callback functions.

This was causing a silent data corruption in big jobs (> 2 nodes, 120 ranks
each) on the receive side because the send side was mistakenly unpinning the
vector pages before the HW has processed all descriptors referencing the
vector.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-10-27 17:19:22 +09:00
Alison Schofield
806e6e1bec staging: rdma: hfi1: remove unnecessary out of memory messages
Out of memory messages are unnecssary in the drivers as they are
reported by memory management.

Addresses checkpatch.pl: WARNING: Possible unnecessary 'out of memory' message

Signed-off-by: Alison Schofield <amsfield22@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-10-12 20:35:21 -07:00
Shraddha Barke
314fcc0d53 Staging: rdma: hfi1: Use kcalloc instead of kzalloc to allocate array
The advantage of kcalloc is, that will prevent integer overflows which
could result from the multiplication of number of elements and size and
it is also a bit nicer to read.

Signed-off-by: Shraddha Barke <shraddha.6596@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-10-12 20:35:21 -07:00
Julia Lawall
adad44d132 hfi1: drop null test before destroy functions
Remove unneeded NULL test.

The semantic patch that makes this change is as follows:
(http://coccinelle.lip6.fr/)

// <smpl>
@@ expression x; @@
-if (x != NULL)
  \(kmem_cache_destroy\|mempool_destroy\|dma_pool_destroy\)(x);
// </smpl>

Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Cc: Mike Marciniszyn <infinipath@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-09-15 06:40:57 -07:00
Mike Marciniszyn
7724105686 IB/hfi1: add driver files
Signed-off-by: Andrew Friedley <andrew.friedley@intel.com>
Signed-off-by: Arthur Kepner <arthur.kepner@intel.com>
Signed-off-by: Brendan Cunningham <brendan.cunningham@intel.com>
Signed-off-by: Brian Welty <brian.welty@intel.com>
Signed-off-by: Caz Yokoyama <caz.yokoyama@intel.com>
Signed-off-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Easwar Hariharan <easwar.hariharan@intel.com>
Signed-off-by: Harish Chegondi <harish.chegondi@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Jim Snow <jim.m.snow@intel.com>
Signed-off-by: John Gregor <john.a.gregor@intel.com>
Signed-off-by: Jubin John <jubin.john@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Kevin Pine <kevin.pine@intel.com>
Signed-off-by: Kyle Liddell <kyle.liddell@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Ravi Krishnaswamy <ravi.krishnaswamy@intel.com>
Signed-off-by: Sadanand Warrier <sadanand.warrier@intel.com>
Signed-off-by: Sanath Kumar <sanath.s.kumar@intel.com>
Signed-off-by: Sudeep Dutt <sudeep.dutt@intel.com>
Signed-off-by: Vlad Danushevsky <vladimir.danusevsky@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-08-28 22:59:36 -04:00