Commit graph

2163 commits

Author SHA1 Message Date
Linus Torvalds
009bd55dfc RDMA 5.11 pull request
A smaller set of patches, nothing stands out as being particularly major
 this cycle:
 
 - Driver bug fixes and updates: bnxt_re, cxgb4, rxe, hns, i40iw, cxgb4,
   mlx4 and mlx5
 
 - Bug fixes and polishing for the new rts ULP
 
 - Cleanup of uverbs checking for allowed driver operations
 
 - Use sysfs_emit all over the place
 
 - Lots of bug fixes and clarity improvements for hns
 
 - hip09 support for hns
 
 - NDR and 50/100Gb signaling rates
 
 - Remove dma_virt_ops and go back to using the IB DMA wrappers
 
 - mlx5 optimizations for contiguous DMA regions
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEfB7FMLh+8QxL+6i3OG33FX4gmxoFAl/aNXUACgkQOG33FX4g
 mxqlMQ/+O6UhxKnDAnMB+HzDGvOm+KXNHOQBuzxz4ZWXqtUrW8WU5ca3PhXovc4z
 /QX0HhMhQmVsva5mjp1OGVATxQ2E+yasqFLg4QXAFWFR3N7s0u/sikE9i1DoPvOC
 lsmLTeRauCFaE4mJD5nvYwm+riECX0GmyVVW7v6V05xwAp0hwdhyU7Kb6Yh3lxsE
 umTz+onPNJcD6Tc4snziyC5QEp5ebEjAaj4dVI1YPR5X0c2RwC5E1CIDI6u4OQ2k
 j7/+Kvo8LNdYNERGiR169x6c1L7WS6dYnGMMeXRgyy0BVbVdRGDnvCV9VRmF66w5
 99fHfDjNMNmqbGNt/4/gwNdVrR9aI4jMZWCh7SmsguX6XwNOlhYldy3x3WnlkfkQ
 e4O0huJceJqcB2Uya70GqufnAetRXsbjzcvWxpR5YAwRmcRkm1f6aGK3BxPjWEbr
 BbYRpiKMxxT4yTe65BuuThzx6g4pNQHe0z3BM/dzMJQAX+PZcs1CPQR8F8PbCrZR
 Ad7qw4HJ587PoSxPi3toVMpYZRP6cISh1zx9q/JCj8cxH9Ri4MovUCS3cF63Ny3B
 1LJ2q0x8FuLLjgZJogKUyEkS8OO6q7NL8WumjvrYWWx19+jcYsV81jTRGSkH3bfY
 F7Esv5K2T1F2gVsCe1ZFFplQg6ja1afIcc+LEl8cMJSyTdoSub4=
 =9t8b
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma

Pull rdma updates from Jason Gunthorpe:
 "A smaller set of patches, nothing stands out as being particularly
  major this cycle. The biggest item would be the new HIP09 HW support
  from HNS, otherwise it was pretty quiet for new work here:

   - Driver bug fixes and updates: bnxt_re, cxgb4, rxe, hns, i40iw,
     cxgb4, mlx4 and mlx5

   - Bug fixes and polishing for the new rts ULP

   - Cleanup of uverbs checking for allowed driver operations

   - Use sysfs_emit all over the place

   - Lots of bug fixes and clarity improvements for hns

   - hip09 support for hns

   - NDR and 50/100Gb signaling rates

   - Remove dma_virt_ops and go back to using the IB DMA wrappers

   - mlx5 optimizations for contiguous DMA regions"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (147 commits)
  RDMA/cma: Don't overwrite sgid_attr after device is released
  RDMA/mlx5: Fix MR cache memory leak
  RDMA/rxe: Use acquire/release for memory ordering
  RDMA/hns: Simplify AEQE process for different types of queue
  RDMA/hns: Fix inaccurate prints
  RDMA/hns: Fix incorrect symbol types
  RDMA/hns: Clear redundant variable initialization
  RDMA/hns: Fix coding style issues
  RDMA/hns: Remove unnecessary access right set during INIT2INIT
  RDMA/hns: WARN_ON if get a reserved sl from users
  RDMA/hns: Avoid filling sl in high 3 bits of vlan_id
  RDMA/hns: Do shift on traffic class when using RoCEv2
  RDMA/hns: Normalization the judgment of some features
  RDMA/hns: Limit the length of data copied between kernel and userspace
  RDMA/mlx4: Remove bogus dev_base_lock usage
  RDMA/uverbs: Fix incorrect variable type
  RDMA/core: Do not indicate device ready when device enablement fails
  RDMA/core: Clean up cq pool mechanism
  RDMA/core: Update kernel documentation for ib_create_named_qp()
  MAINTAINERS: SOFT-ROCE: Change Zhu Yanjun's email address
  ...
2020-12-16 13:42:26 -08:00
Linus Torvalds
69f637c335 for-5.11/drivers-2020-12-14
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl/XgdYQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpjTBD/4me2TNvGOogbcL0b1leAotndJ7spI/IcFM
 NUMNy3pOGuRBcRjwle85xq44puAjlNkZE2LLatem5sT7ZvS+8lPNnOIoTYgfaCjt
 PhKx2sKlLumVm3BwymYAPcPtke4fikGG15Mwu5nX1oOehmyGrjObGAr3Lo6gexCT
 tQoCOczVqaTsV+iTXrLlmgEgs07J9Tm93uh2cNR8Jgroxb8ivuWeUq4YgbV4kWk+
 Y8XvOyVE/yba0vQf5/hHtWuVoC6RdELnqZ6NCkcP/EicdBecwk1GMJAej1S3zPS1
 0BT7GSFTpm3YUHcygD6LRmRg4I/BmWDTDtMi84+jLat6VvSG1HwIm//qHiCJh3ku
 SlvFZENIWAv5LP92x2vlR5Lt7uE3GK2V/5Pxt2fekyzCth6mzu+hLH4CBPQ3xgyd
 E1JqIQ/ilbXstp+EYoivV5x8yltZQnKEZRopws0EOqj1LsmDPj9XT1wzE9RnB0o+
 PWu/DNhQFhhcmP7Z8uLgPiKIVpyGs+vjxiJLlTtGDFTCy6M5JbcgzGkEkSmnybxH
 7lSanjpLt1dWj85FBMc6fNtJkv2rBPfb4+j0d1kZ45Dzcr4umirGIh7wtCHcgc83
 brmXSt29hlKHseSHMMuNWK8haXcgAE7gq9tD8GZ/kzM7+vkmLLxHJa22Qhq5rp4w
 URPeaBaQJw==
 =ayp2
 -----END PGP SIGNATURE-----

Merge tag 'for-5.11/drivers-2020-12-14' of git://git.kernel.dk/linux-block

Pull block driver updates from Jens Axboe:
 "Nothing major in here:

   - NVMe pull request from Christoph:
        - nvmet passthrough improvements (Chaitanya Kulkarni)
        - fcloop error injection support (James Smart)
        - read-only support for zoned namespaces without Zone Append
          (Javier González)
        - improve some error message (Minwoo Im)
        - reject I/O to offline fabrics namespaces (Victor Gladkov)
        - PCI queue allocation cleanups (Niklas Schnelle)
        - remove an unused allocation in nvmet (Amit Engel)
        - a Kconfig spelling fix (Colin Ian King)
        - nvme_req_qid simplication (Baolin Wang)

   - MD pull request from Song:
        - Fix race condition in md_ioctl() (Dae R. Jeong)
        - Initialize read_slot properly for raid10 (Kevin Vigor)
        - Code cleanup (Pankaj Gupta)
        - md-cluster resync/reshape fix (Zhao Heming)

   - Move null_blk into its own directory (Damien Le Moal)

   - null_blk zone and discard improvements (Damien Le Moal)

   - bcache race fix (Dongsheng Yang)

   - Set of rnbd fixes/improvements (Gioh Kim, Guoqing Jiang, Jack Wang,
     Lutz Pogrell, Md Haris Iqbal)

   - lightnvm NULL pointer deref fix (tangzhenhao)

   - sr in_interrupt() removal (Sebastian Andrzej Siewior)

   - FC endpoint security support for s390/dasd (Jan Höppner, Sebastian
     Ott, Vineeth Vijayan). From the s390 arch guys, arch bits included
     as it made it easier for them to funnel the feature through the
     block driver tree.

   - Follow up fixes (Colin Ian King)"

* tag 'for-5.11/drivers-2020-12-14' of git://git.kernel.dk/linux-block: (64 commits)
  block: drop dead assignments in loop_init()
  sr: Remove in_interrupt() usage in sr_init_command().
  sr: Switch the sector size back to 2048 if sr_read_sector() changed it.
  cdrom: Reset sector_size back it is not 2048.
  drivers/lightnvm: fix a null-ptr-deref bug in pblk-core.c
  null_blk: Move driver into its own directory
  null_blk: Allow controlling max_hw_sectors limit
  null_blk: discard zones on reset
  null_blk: cleanup discard handling
  null_blk: Improve implicit zone close
  null_blk: improve zone locking
  block: Align max_hw_sectors to logical blocksize
  null_blk: Fail zone append to conventional zones
  null_blk: Fix zone size initialization
  bcache: fix race between setting bdev state to none and new write request direct to backing
  block/rnbd: fix a null pointer dereference on dev->blk_symlink_name
  block/rnbd-clt: Dynamically alloc buffer for pathname & blk_symlink_name
  block/rnbd: call kobject_put in the failure path
  Documentation/ABI/rnbd-srv: add document for force_close
  block/rnbd-srv: close a mapped device from server side.
  ...
2020-12-16 13:09:32 -08:00
Linus Torvalds
ac7ac4618c for-5.11/block-2020-12-14
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl/Xec8QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpoLbEACzXypgZWwMdfgRckA/Vt333rXHtbhUV+hK
 2XP+P81iRvr9Esi31UPbRp82vrgcDO0cpI1QmQojS5U5TIQP88BfXptfRZZu48eb
 wT5RDDNQ34HItqAh/yEuYsv9yUKcxeIrB99tBVvM+4UmQg9zTdIW3mg6PvCBdbhV
 N38jI0tCF/PJatjfRuphT/nXonQLPWBlVDmZk06KZQFOwQe9ep1vUi1+nbiRPuo3
 geFBpTh1Kp6Vl1B3n4RpECs6Y7I0RRuJdaH2sDizICla1/BW91F9fQwHimNnUxUq
 e1Q1kMuh6ftcQGkYlHSYcPhuv6CvorldTZCO5arPxWpcwvxriTSMRPWAgUr5pEiF
 fhiGhqeDu9e6vl9vS31wUD1B30hy+jFz9wyjRrDwJ3cPHH1JVBjTzvdX+cIh/1ku
 IbIwUMteUtvUrzqAv/DzbGhedp7xWtOFaVo8j0QFYh9zkjd6b8yDOF/yztwX2gjY
 Xt1cd+KpDSiN449ZRaoMI0sCJAxqzhMa6nsWlb0L7KuNyWKAbvKQBm9Rb47FLV9A
 Vx70KC+zkFoyw23capvIahmQazerriUJ5PGe0lVm6ROgmIFdCpXTPDjnrvq/6RZ/
 GEpD7gTW9atGJ7EuEE8686sAfKD5kneChWLX5EHXf0d0AG5Mr2lKsluiGp5LpPJg
 Q1Xqs6xwww==
 =zo4w
 -----END PGP SIGNATURE-----

Merge tag 'for-5.11/block-2020-12-14' of git://git.kernel.dk/linux-block

Pull block updates from Jens Axboe:
 "Another series of killing more code than what is being added, again
  thanks to Christoph's relentless cleanups and tech debt tackling.

  This contains:

   - blk-iocost improvements (Baolin Wang)

   - part0 iostat fix (Jeffle Xu)

   - Disable iopoll for split bios (Jeffle Xu)

   - block tracepoint cleanups (Christoph Hellwig)

   - Merging of struct block_device and hd_struct (Christoph Hellwig)

   - Rework/cleanup of how block device sizes are updated (Christoph
     Hellwig)

   - Simplification of gendisk lookup and removal of block device
     aliasing (Christoph Hellwig)

   - Block device ioctl cleanups (Christoph Hellwig)

   - Removal of bdget()/blkdev_get() as exported API (Christoph Hellwig)

   - Disk change rework, avoid ->revalidate_disk() (Christoph Hellwig)

   - sbitmap improvements (Pavel Begunkov)

   - Hybrid polling fix (Pavel Begunkov)

   - bvec iteration improvements (Pavel Begunkov)

   - Zone revalidation fixes (Damien Le Moal)

   - blk-throttle limit fix (Yu Kuai)

   - Various little fixes"

* tag 'for-5.11/block-2020-12-14' of git://git.kernel.dk/linux-block: (126 commits)
  blk-mq: fix msec comment from micro to milli seconds
  blk-mq: update arg in comment of blk_mq_map_queue
  blk-mq: add helper allocating tagset->tags
  Revert "block: Fix a lockdep complaint triggered by request queue flushing"
  nvme-loop: use blk_mq_hctx_set_fq_lock_class to set loop's lock class
  blk-mq: add new API of blk_mq_hctx_set_fq_lock_class
  block: disable iopoll for split bio
  block: Improve blk_revalidate_disk_zones() checks
  sbitmap: simplify wrap check
  sbitmap: replace CAS with atomic and
  sbitmap: remove swap_lock
  sbitmap: optimise sbitmap_deferred_clear()
  blk-mq: skip hybrid polling if iopoll doesn't spin
  blk-iocost: Factor out the base vrate change into a separate function
  blk-iocost: Factor out the active iocgs' state check into a separate function
  blk-iocost: Move the usage ratio calculation to the correct place
  blk-iocost: Remove unnecessary advance declaration
  blk-iocost: Fix some typos in comments
  blktrace: fix up a kerneldoc comment
  block: remove the request_queue to argument request based tracepoints
  ...
2020-12-16 12:57:51 -08:00
Ming Lei
88c9979334 nvme-loop: use blk_mq_hctx_set_fq_lock_class to set loop's lock class
Set nvme-loop's lock class via blk_mq_hctx_set_fq_lock_class for avoiding
lockdep possible recursive locking, then we can remove the dynamically
allocated lock class for each flush queue, finally we can avoid horrible
SCSI probe delay.

This way may not address situation in which one nvme-loop is backed on
another nvme-loop. However, in reality, people seldom uses this way
for test. Even though someone played in this way, it is just one
recursive locking false positive, no real deadlock issue.

Tested-by: Kashyap Desai <kashyap.desai@broadcom.com>
Reported-by: Qian Cai <cai@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Sumit Saxena <sumit.saxena@broadcom.com>
Cc: John Garry <john.garry@huawei.com>
Cc: Kashyap Desai <kashyap.desai@broadcom.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-07 20:30:19 -07:00
Christoph Hellwig
1c02fca620 block: remove the request_queue argument to the block_bio_remap tracepoint
The request_queue can trivially be derived from the bio.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-04 09:42:00 -07:00
Christoph Hellwig
8446fe9255 block: switch partition lookup to use struct block_device
Use struct block_device to lookup partitions on a disk.  This removes
all usage of struct hd_struct from the I/O path.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Coly Li <colyli@suse.de>			[bcache]
Acked-by: Chao Yu <yuchao0@huawei.com>			[f2fs]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-01 14:53:40 -07:00
Javier González
2f4c9ba23b nvme: export zoned namespaces without Zone Append support read-only
Allow ZNS NVMe SSDs to present a read-only namespace when append is not
supported, instead of rejecting the namespace directly.

This allows (i) the namespace to be used in read-only mode, which is not
a problem as the append command only affects the write path, and (ii) to
use standard management tools such as nvme-cli to choose a different
format or firmware slot that is compatible with the Linux zoned block
device.

Signed-off-by: Javier González <javier.gonz@samsung.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-12-01 20:36:38 +01:00
Javier González
ba4fb32056 nvme: rename bdev operations
Remane block device operations in preparation to add char device file
operations.

Signed-off-by: Javier González <javier.gonz@samsung.com>
Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-12-01 20:36:38 +01:00
Javier González
f68abd9cc0 nvme: rename controller base dev_t char device
Rename controller base dev_t char device in preparation for adding a
namespace char device.

Signed-off-by: Javier González <javier.gonz@samsung.com>
Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-12-01 20:36:38 +01:00
Javier González
e1aaf5cacb nvme: remove unnecessary return values
Cleanup unnecessary ret values that are not checked or used in
nvme_alloc_ns().

Signed-off-by: Javier González <javier.gonz@samsung.com>
Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-12-01 20:36:38 +01:00
Minwoo Im
f781f3dd6a nvme: print a warning for when listing active namespaces fails
During the scan_work, an Identify command is issued to figure out which
namespaces are active.  If this command fails, the nvme driver falls back
to scanning namespaces sequentially.  In this situation, we don't see
any warnings and don't even know whether list-ns command has been failed
or not easiliy.

Printa warning when the Identify command executin fail:

[    1.108399] nvme nvme0: Identify NS List failed (status=0x400b)
[    1.109583] nvme0n1: detected capacity change from 0 to 1048576
[    1.112186] nvme nvme0: Identify Descriptors failed (nsid=2, status=0x4002)
[    1.113929] nvme nvme0: Identify Descriptors failed (nsid=3, status=0x4002)
[    1.116537] nvme nvme0: Identify Descriptors failed (nsid=4, status=0x4002)
...

Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-12-01 20:36:37 +01:00
Minwoo Im
aa9d729592 nvme: improve an error message on Identify failure
Add the namespace ID to the error message when the Identify command used
to retrieve the Namespace Identification Descriptor list fails.

This avoids rather useless and duplicative messages like the following:
[    1.321031] nvme nvme0: Identify Descriptors failed (16386)
[    1.321948] nvme nvme0: Identify Descriptors failed (16386)
[    1.322872] nvme nvme0: Identify Descriptors failed (16386)
[    1.323775] nvme nvme0: Identify Descriptors failed (16386)
[    1.324687] nvme nvme0: Identify Descriptors failed (16386)
...

Also, print the nvme status code in hexadecimal rather than decimal
format rather for better readability.

Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-12-01 20:36:37 +01:00
Victor Gladkov
8c4dfea97f nvme-fabrics: reject I/O to offline device
Commands get stuck while Host NVMe-oF controller is in reconnect state.
The controller enters into reconnect state when it loses connection with
the target.  It tries to reconnect every 10 seconds (default) until
a successful reconnect or until the reconnect time-out is reached.
The default reconnect time out is 10 minutes.

Applications are expecting commands to complete with success or error
within a certain timeout (30 seconds by default).  The NVMe host is
enforcing that timeout while it is connected, but during reconnect the
timeout is not enforced and commands may get stuck for a long period or
even forever.

To fix this long delay due to the default timeout, introduce new
"fast_io_fail_tmo" session parameter.  The timeout is measured in seconds
from the controller reconnect and any command beyond that timeout is
rejected.  The new parameter value may be passed during 'connect'.
The default value of -1 means no timeout (similar to current behavior).

Signed-off-by: Victor Gladkov <victor.gladkov@kioxia.com>
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chao Leng <lengchao@huawei.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-12-01 20:36:37 +01:00
Colin Ian King
9f20599c48 nvmet: fix a spelling mistake "incuding" -> "including" in Kconfig
There is a spelling mistake in the Kconfig help text. Fix it.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-12-01 20:36:37 +01:00
Max Gurtovoy
0068a7b010 nvmet: make sure discovery change log event is protected
Generation counter is protected by nvmet_config_sem. Make sure the
callers that call functions that might change it, are calling it
properly.

Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Reviewed-by: Israel Rukshin <israelr@nvidia.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-12-01 20:36:37 +01:00
Amit
6d65aeab7b nvmet: remove unused ctrl->cqs
remove unused cqs from nvmet_ctrl struct
this will reduce the allocated memory.

Signed-off-by: Amit <amit.engel@dell.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-12-01 20:36:36 +01:00
Niklas Schnelle
e3aef0950a nvme-pci: don't allocate unused I/O queues
currently the NVME_QUIRK_SHARED_TAGS quirk for Apple devices is handled
during the assignment of nr_io_queues in nvme_setup_io_queues().
This however means that for these devices nvme_max_io_queues() will
actually not return the supported maximum which is confusing and
unexpected and also means that in nvme_probe() we are allocating
for I/O queues that will never be used.
Fix this by moving the quirk handling into nvme_max_io_queues().

Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-12-01 20:36:36 +01:00
Niklas Schnelle
ff4e5fbad0 nvme-pci: drop min() from nr_io_queues assignment
in nvme_setup_io_queues() the number of I/O queues is set to either 1 in
case of a quirky Apple device or to the min of nvme_max_io_queues() or
dev->nr_allocated_queues - 1.
This is unnecessarily complicated as dev->nr_allocated_queues is only
assigned once and is nvme_max_io_queues() + 1.

Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-12-01 20:36:36 +01:00
Chaitanya Kulkarni
dab3902b19 nvmet: use inline bio for passthru fast path
In nvmet_passthru_execute_cmd() which is a high frequency function
it uses bio_alloc() which leads to memory allocation from the fs pool
for each I/O.

For NVMeoF nvmet_req we already have inline_bvec allocated as a part of
request allocation that can be used with preallocated bio when we
already know the size of request before bio allocation with bio_alloc(),
which we already do.

Introduce a bio member for the nvmet_req passthru anon union. In the
fast path, check if we can get away with inline bvec and bio from
nvmet_req with bio_init() call before actually allocating from the
bio_alloc().

This will be useful to avoid any new memory allocation under high
memory pressure situation and get rid of any extra work of
allocation (bio_alloc()) vs initialization (bio_init()) when
transfer len is < NVMET_MAX_INLINE_DATA_LEN that user can configure at
compile time.

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-12-01 20:36:36 +01:00
Chaitanya Kulkarni
a4fe2d3afe nvmet: use blk_rq_bio_prep instead of blk_rq_append_bio
The function blk_rq_append_bio() is a genereric API written for all
types driver (having bounce buffers) and different context (where
request is already having a bio i.e. rq->bio != NULL).

It does mainly three things: calculating the segments, bounce queue and
if req->bio == NULL call blk_rq_bio_prep() or handle low level merge()
case.

The NVMe PCIe and fabrics transports currently does not use queue
bounce mechanism. In order to find this for each request processing
in the passthru blk_rq_append_bio() does extra work in the fast path
for each request.

When I ran I/Os with different block sizes on the passthru controller
I found that we can reuse the req->sg_cnt instead of iterating over the
bvecs to find out nr_segs in blk_rq_append_bio(). This calculation in
blk_rq_append_bio() is a duplication of work given that we have the
value in req->sg_cnt. (correct me here if I'm wrong).

With NVMe passthru request based driver we allocate fresh request each
time, so every call to blk_rq_append_bio() rq->bio will be NULL i.e.
we don't really need the second condition in the blk_rq_append_bio()
and the resulting error condition in the caller of blk_rq_append_bio().

So for NVMeOF passthru driver recalculating the segments, bounce check
and ll_back_merge code is not needed such that we can get away with the
minimal version of the blk_rq_append_bio() which removes the error check
in the fast path along with extra variable in nvmet_passthru_map_sg().

This patch updates the nvmet_passthru_map_sg() such that it does only
appending the bio to the request in the context of the NVMeOF Passthru
driver. Following are perf numbers :-

With current implementation (blk_rq_append_bio()) :-
----------------------------------------------------
+    5.80%     0.02%  kworker/0:2-mm_  [nvmet]  [k] nvmet_passthru_execute_cmd
+    5.44%     0.01%  kworker/0:2-mm_  [nvmet]  [k] nvmet_passthru_execute_cmd
+    4.88%     0.00%  kworker/0:2-mm_  [nvmet]  [k] nvmet_passthru_execute_cmd
+    5.44%     0.01%  kworker/0:2-mm_  [nvmet]  [k] nvmet_passthru_execute_cmd
+    4.86%     0.01%  kworker/0:2-mm_  [nvmet]  [k] nvmet_passthru_execute_cmd
+    5.17%     0.00%  kworker/0:2-eve  [nvmet]  [k] nvmet_passthru_execute_cmd

With this patch using blk_rq_bio_prep() :-
----------------------------------------------------
+    3.14%     0.02%  kworker/0:2-eve  [nvmet]  [k] nvmet_passthru_execute_cmd
+    3.26%     0.01%  kworker/0:2-eve  [nvmet]  [k] nvmet_passthru_execute_cmd
+    5.37%     0.01%  kworker/0:2-mm_  [nvmet]  [k] nvmet_passthru_execute_cmd
+    5.18%     0.02%  kworker/0:2-eve  [nvmet]  [k] nvmet_passthru_execute_cmd
+    4.84%     0.02%  kworker/0:2-mm_  [nvmet]  [k] nvmet_passthru_execute_cmd
+    4.87%     0.01%  kworker/0:2-mm_  [nvmet]  [k] nvmet_passthru_execute_cmd

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-12-01 20:36:36 +01:00
Chaitanya Kulkarni
06b3bec820 nvmet: remove op_flags for passthru commands
For passthru commands setting op_flags has no meaning. Remove the code
that sets the op flags in nvmet_passthru_map_sg().

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-12-01 20:36:36 +01:00
Chaitanya Kulkarni
39dfe84451 nvme: split nvme_alloc_request()
Right now nvme_alloc_request() allocates a request from block layer
based on the value of the qid. When qid set to NVME_QID_ANY it used
blk_mq_alloc_request() else blk_mq_alloc_request_hctx().

The function nvme_alloc_request() is called from different context, The
only place where it uses non NVME_QID_ANY value is for fabrics connect
commands :-

nvme_submit_sync_cmd()		NVME_QID_ANY
nvme_features()			NVME_QID_ANY
nvme_sec_submit()		NVME_QID_ANY
nvmf_reg_read32()		NVME_QID_ANY
nvmf_reg_read64()		NVME_QID_ANY
nvmf_reg_write32()		NVME_QID_ANY
nvmf_connect_admin_queue()	NVME_QID_ANY
nvme_submit_user_cmd()		NVME_QID_ANY
	nvme_alloc_request()
nvme_keep_alive()		NVME_QID_ANY
	nvme_alloc_request()
nvme_timeout()			NVME_QID_ANY
	nvme_alloc_request()
nvme_delete_queue()		NVME_QID_ANY
	nvme_alloc_request()
nvmet_passthru_execute_cmd()	NVME_QID_ANY
	nvme_alloc_request()
nvmf_connect_io_queue() 	QID
	__nvme_submit_sync_cmd()
		nvme_alloc_request()

With passthru nvme_alloc_request() now falls into the I/O fast path such
that blk_mq_alloc_request_hctx() is never gets called and that adds
additional branch check in fast path.

Split the nvme_alloc_request() into nvme_alloc_request() and
nvme_alloc_request_qid().

Replace each call of the nvme_alloc_request() with NVME_QID_ANY param
with a call to newly added nvme_alloc_request() without NVME_QID_ANY.

Replace a call to nvme_alloc_request() with QID param with a call to
newly added nvme_alloc_request() and nvme_alloc_request_qid()
based on the qid value set in the __nvme_submit_sync_cmd().

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-12-01 20:36:35 +01:00
Chaitanya Kulkarni
47e9730c26 nvmet: add passthru io timeout value attr
NVMeOF controller in the passsthru mode is capable of handling wide set
of I/O commands including vender specific passhtru io comands.

The vendor specific I/O commands are used to read the large drive
logs and can take longer than default NVMe commands, i.e. for
passthru requests the timeout value may differ from the passthru
controller's default timeout values (nvme-core:io_timeout).

Add a configfs attribute so that user can set the io timeout values.
In case if this configfs value is not set nvme_alloc_request() will set
the NVME_IO_TIMEOUT value when request queuedata is NULL.

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-12-01 20:36:35 +01:00
Chaitanya Kulkarni
a2f6a2b8ce nvmet: add passthru admin timeout value attr
NVMeOF controller in the passsthru mode is capable of handling wide set
of admin commands including vender specific passhtru admin comands.

The vendor specific admin commands are used to read the large drive
logs and can take longer than default NVMe commands, i.e. for
passthru requests the timeout value may differ from the passthru
controller's default timeout values (nvme-core:admin_timeout).

Add a configfs attribute so that user can set the admin timeout values.
In case if this configfs value is not set nvme_alloc_request() will set
the ADMIN_TIMEOUT value when request queuedata is NULL.

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-12-01 20:36:35 +01:00
Chaitanya Kulkarni
dc96f93874 nvme: use consistent macro name for timeout
This is purely a clenaup patch, add prefix NVME to the ADMIN_TIMEOUT to
make consistent with NVME_IO_TIMEOUT.

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-12-01 20:36:35 +01:00
Chaitanya Kulkarni
0d2e7c840b nvme: centralize setting the timeout in nvme_alloc_request
The function nvme_alloc_request() is called from different context
(I/O and Admin queue) where callers do not consider the I/O timeout when
called from I/O queue context.

Update nvme_alloc_request() to set the default I/O and Admin timeout
value based on whether the queuedata is set or not.

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-12-01 20:36:35 +01:00
Baolin Wang
84115d6d80 nvme: simplify nvme_req_qid()
Use the request's '->mq_hctx->queue_num' directly to simplify the
nvme_req_qid() function.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-12-01 20:36:34 +01:00
James Smart
03d99e5d63 nvme-fcloop: add sysfs attribute to inject command drop
Add sysfs attribute to specify parameters for dropping a command.  The
attribute takes a string of:

  <opcode>:<starting a what instance>:<number of times>

Opcode is formatted as lower 8 bits are opcode.  If a fabrics opcode, a
bit above bits 7:0 will be set.

Once set, each sqe is looked at. If the opcode matches the running
instance count is updated. If the instance count is in the range of where
to drop (based on starting and # of times), then drop the command by not
passing it to the target layer.

Signed-off-by: James Smart <james.smart@broadcom.com>
2020-12-01 20:36:34 +01:00
Jason Gunthorpe
ed92f6a52b Linux 5.10-rc5
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl+69egeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGTSYH/ifRBlaxy5UiHFc0
 2zdR7pkjWrYfDTTT3sazIAhdlzzcfnkUqgFxOP45F4ZIqeTzunH3sUY+5UlT9IX7
 liUgnLxQ/1R9Gx8kPGQfu+tLCey78xVFydGsqJoW9sPRw2R+apMdGGa/lOrk+OXz
 DXIN+dDnGFqwCCNJpK+rxQQhFf++IPpSI8z6Y23moOFhsDZrEziHuVFy2FGyRM6z
 prZ/us/tcobE8ptCk1RmOxLoJ1DR6UxpA2vLimTE+JD8siOsSWPbjE0KudnWCnd5
 BLqIjrsPJbSxyuzzK3v9dnO5wMv7tMDuMIuYM/MQTXDttNwtsqt/aP6gdnUCym7N
 5eHEj5g=
 =MuO1
 -----END PGP SIGNATURE-----

Merge tag 'v5.10-rc5' into rdma.git for-next

For dependencies in following patches

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-11-23 16:50:59 -04:00
Christoph Hellwig
5a7a9e038b RDMA/core: remove use of dma_virt_ops
Use the ib_dma_* helpers to skip the DMA translation instead.  This
removes the last user if dma_virt_ops and keeps the weird layering
violation inside the RDMA core instead of burderning the DMA mapping
subsystems with it.  This also means the software RDMA drivers now don't
have to mess with DMA parameters that are not relevant to them at all, and
that in the future we can use PCI P2P transfers even for software RDMA, as
there is no first fake layer of DMA mapping that the P2P DMA support.

Link: https://lore.kernel.org/r/20201106181941.1878556-8-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Mike Marciniszyn <mike.marciniszyn@cornelisnetworks.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-11-17 15:22:07 -04:00
Jason Gunthorpe
bf3b7b7ba9 Merge branch 'for-rc' into rdma.git
From https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git

The rc RDMA branch is needed due to dependencies on the next patches.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-11-17 15:20:26 -04:00
Christoph Hellwig
d17e66aadb nvme: use set_capacity_and_notify in nvme_set_queue_dying
Use the block layer helper to update both the disk and block device
sizes.  Contrary to the name no notification is sent in this case,
as a size 0 is special cased.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-11-16 08:34:14 -07:00
Christoph Hellwig
449f4ec989 block: remove the update_bdev parameter to set_capacity_revalidate_and_notify
The update_bdev argument is always set to true, so remove it.  Also
rename the function to the slighly less verbose set_capacity_and_notify,
as propagating the disk size to the block device isn't really
revalidation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Petr Vorel <pvorel@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-11-16 08:34:14 -07:00
Christoph Hellwig
5dd55749b7 nvme: let set_capacity_revalidate_and_notify update the bdev size
There is no good reason to call revalidate_disk_size separately.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-11-16 08:34:14 -07:00
Keith Busch
8168d23fbc nvme: fix memory leak freeing command effects
xa_destroy() frees only internal data. The caller is responsible for
freeing the exteranl objects referenced by an xarray.

Fixes: 1cf7a12e09 ("nvme: use an xarray to lookup the Commands Supported and Effects log")
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-11-14 09:57:55 +01:00
Keith Busch
f6224b8681 nvme: directly cache command effects log
Remove the struct used for tracking known command effects logs in a
list. This is now saved in an xarray that doesn't use these elements.
Instead, store the log directly instead of the wrapper struct.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-11-14 09:57:55 +01:00
Minwoo Im
0f0d2c876c nvme: free sq/cq dbbuf pointers when dbbuf set fails
If Doorbell Buffer Config command fails even 'dev->dbbuf_dbs != NULL'
which means OACS indicates that NVME_CTRL_OACS_DBBUF_SUPP is set,
nvme_dbbuf_update_and_check_event() will check event even it's not been
successfully set.

This patch fixes mismatch among dbbuf for sq/cqs in case that dbbuf
command fails.

Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-11-14 09:57:55 +01:00
Christoph Hellwig
22dd4c7076 nvme-rdma: Use ibdev_to_node instead of dereferencing ->dma_device
->dma_device is a private implementation detail of the RDMA core.  Use the
ibdev_to_node helper to get the NUMA node for a ib_device instead of
poking into ->dma_device.

Link: https://lore.kernel.org/r/20201106181941.1878556-5-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-11-12 13:33:44 -04:00
Sagi Grimberg
65c5a055b0 nvme: fix incorrect behavior when BLKROSET is called by the user
The offending commit breaks BLKROSET ioctl because a device
revalidation will blindly override BLKROSET setting. Hence,
we remove the disk rw setting in case NVME_NS_ATTR_RO is cleared
from by the controller.

Fixes: 1293477f4f ("nvme: set gendisk read only based on nsattr")
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-11-09 17:39:15 +01:00
Jens Axboe
7ae7a8de05 nvme fixes for 5.10:
- revert a nvme_queue size optimization (Keith Bush)
  - fabrics timeout races fixes (Chao Leng and Sagi Grimberg)
 -----BEGIN PGP SIGNATURE-----
 
 iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAl+jrl0LHGhjaEBsc3Qu
 ZGUACgkQD55TZVIEUYNBWQ//Q4xtPjq86h5Ki3bHj5KVH2l43pcbQ0q/oR1LTzvb
 Y/gLRHix8+rldX+xRSK27W2tFpz5mLCrFtQmJXXUgB9rcYbfAGOs0zLk8XRyBwvr
 WnEWFF6YPFBv33ratSS76lN78+GF99bJ1sVRkU5NGV4c1fjSc4C4df0azYBh1fHo
 hyY1dL56DSBOcu1GCD4+BaYkpuKG5vSwe7klxk91rMmSjJUCMbCECrxXI4U7iv2Y
 GFn9SNKFnfoBy9n58fKr34e5UtmVKT3mQtWBe8JG6J0J6NKNDEaP93/FcvyIFwdH
 tthHZpWJayD5uAvX2Ud4zt8O3T7unZoaTluT4JvgseD0p8Ox0anEulJiKfeJktfP
 JzOW7le9Asxatl0R/QePs/V8jVjZJmcCAmspqTsBdmyhGXLq+eTeAAfc7vywAWX4
 9C3Pi17s65MVOJFSszZTkB8QN2t+2H7ByV6XaQFOWVgSPw//0wRdjnbfiRPvOSzb
 UZudTxwiGHhLHuzjwHqx7O7sWGuLKYj/3FDZ6RrzYUh23BZIYry6amWAFOpHipTm
 6TzNGy0ZQQB57vPDAyn91dDDXDwKaNFpH671aGyL4I/Y+QpFcGdk0yiFkKj1/AQi
 ShfrMbUKHq8QubDsh54xf7LpRJsIsJUXkzzme/Dx+Imz7lICIx5cAp+SQqJe6rlK
 3tw=
 =QSBc
 -----END PGP SIGNATURE-----

Merge tag 'nvme-5.10-2020-11-05' of git://git.infradead.org/nvme into block-5.10

Pull NVMe fixes from Christoph:

"nvme fixes for 5.10:

 - revert a nvme_queue size optimization (Keith Bush)
 - fabrics timeout races fixes (Chao Leng and Sagi Grimberg)"

* tag 'nvme-5.10-2020-11-05' of git://git.infradead.org/nvme:
  nvme-tcp: avoid repeated request completion
  nvme-rdma: avoid repeated request completion
  nvme-tcp: avoid race between time out and tear down
  nvme-rdma: avoid race between time out and tear down
  nvme: introduce nvme_sync_io_queues
  Revert "nvme-pci: remove last_sq_tail"
2020-11-05 07:10:50 -07:00
Sagi Grimberg
0a8a2c85b8 nvme-tcp: avoid repeated request completion
The request may be executed asynchronously, and rq->state may be
changed to IDLE. To avoid repeated request completion, only
MQ_RQ_COMPLETE of rq->state is checked in nvme_tcp_complete_timed_out.
It is not safe, so need adding check IDLE for rq->state.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Chao Leng <lengchao@huawei.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-11-03 10:26:02 +01:00
Sagi Grimberg
fdf58e02ad nvme-rdma: avoid repeated request completion
The request may be executed asynchronously, and rq->state may be
changed to IDLE. To avoid repeated request completion, only
MQ_RQ_COMPLETE of rq->state is checked in nvme_rdma_complete_timed_out.
It is not safe, so need adding check IDLE for rq->state.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Chao Leng <lengchao@huawei.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-11-03 10:26:02 +01:00
Chao Leng
d6f66210f4 nvme-tcp: avoid race between time out and tear down
Now use teardown_lock to serialize for time out and tear down. This may
cause abnormal: first cancel all request in tear down, then time out may
complete the request again, but the request may already be freed or
restarted.

To avoid race between time out and tear down, in tear down process,
first we quiesce the queue, and then delete the timer and cancel
the time out work for the queue. At the same time we need to delete
teardown_lock.

Signed-off-by: Chao Leng <lengchao@huawei.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-11-03 10:26:02 +01:00
Chao Leng
3017013dcc nvme-rdma: avoid race between time out and tear down
Now use teardown_lock to serialize for time out and tear down. This may
cause abnormal: first cancel all request in tear down, then time out may
complete the request again, but the request may already be freed or
restarted.

To avoid race between time out and tear down, in tear down process,
first we quiesce the queue, and then delete the timer and cancel
the time out work for the queue. At the same time we need to delete
teardown_lock.

Signed-off-by: Chao Leng <lengchao@huawei.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-11-03 10:26:01 +01:00
Chao Leng
04800fbff4 nvme: introduce nvme_sync_io_queues
Introduce sync io queues for some scenarios which just only need sync
io queues not sync all queues.

Signed-off-by: Chao Leng <lengchao@huawei.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-11-03 10:25:55 +01:00
Keith Busch
38210800bf Revert "nvme-pci: remove last_sq_tail"
Multiple CPUs may be mapped to the same hctx, allowing mulitple
submission contexts to attempt commit_rqs(). We need to verify we're
not writing the same doorbell value multiple times since that's a spec
violation.

Revert commit 54b2fcee1d.

Link: https://bugzilla.redhat.com/show_bug.cgi?id=1878596
Reported-by: "B.L. Jones" <brandon.gustav@googlemail.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2020-11-02 19:03:14 +01:00
Linus Torvalds
5fc6b075e1 block-5.10-2020-10-30
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl+cRukQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpmRAD/9Q4gYoXjnV1XcE5pgdvJzmSCRsyPJvdOAI
 T1KTaakmGckVL5wz2oUs67DFBi8bnf6BwPrcajXd6nucY3PI2KiUlXAfpKIiW7yo
 AQwAkY+V8wS38u4lZg2fL92D4sd1weQWFaGkmYiCHzVHS8wF/boEVKb1mrMtPJZo
 aHZM7kRUZvwVfO1GcOzflFpE0VMLj0ew5hBfyxSVyROe0ouRPVqQx6Q77WxFt44Z
 YP4EhdFTz4ABXLKmkrdiptp5rR5Fx8QBhFiK8NUNxNLsT3Utj5CbTWxmurDre7ST
 p+d5cLaVN2BZeQZSwJSyEv77n4xnLGBkNgPdXni2A9B2y8L1xFIJfN2z6e3BweMo
 hY5eK3iiChmbX09pY/rEG6RKZvQ1ea96eqIXu94e43e2ptxlBuGygAZqkF51Owxa
 SCMROQFMtutkfZ72fS2H72z7OfscmPHDpW+Twsn/Dv6xdfAkXLpBENUtlhF7mTG7
 4QU427tPS76uNsfGx7tek042GmuT4aKrVLmLZuh7Xd34246riCrAPELYvGte2zQz
 gezv7SSanmwm58128I/aj4f2t9fOHD5QMY1pQCvwAec8rFrJJEjGrLyXlBGdZdJd
 lfE9zPySjWv9a7QhQgnPToGiEeK3rpRjRxT1CfGBGFY7+a5ldG3+WFSijAKZtUjG
 Yz33Xw5Okg==
 =8wYV
 -----END PGP SIGNATURE-----

Merge tag 'block-5.10-2020-10-30' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

 - null_blk zone fixes (Damien, Kanchan)

 - NVMe pull request from Christoph:
       - improve zone revalidation (Keith Busch)
       - gracefully handle zero length messages in nvme-rdma (zhenwei pi)
       - nvme-fc error handling fixes (James Smart)
       - nvmet tracing NULL pointer dereference fix (Chaitanya Kulkarni)"

 - xsysace platform fixes (Andy)

 - scatterlist type cleanup (David)

 - blk-cgroup memory fixes (Gabriel)

 - nbd block size update fix (Ming)

 - Flush completion state fix (Ming)

 - bio_add_hw_page() iteration fix (Naohiro)

* tag 'block-5.10-2020-10-30' of git://git.kernel.dk/linux-block:
  blk-mq: mark flush request as IDLE in flush_end_io()
  lib/scatterlist: use consistent sg_copy_buffer() return type
  xsysace: use platform_get_resource() and platform_get_irq_optional()
  null_blk: Fix locking in zoned mode
  null_blk: Fix zone reset all tracing
  nbd: don't update block size after device is started
  block: advance iov_iter on bio_add_hw_page failure
  null_blk: synchronization fix for zoned device
  nvmet: fix a NULL pointer dereference when tracing the flush command
  nvme-fc: remove nvme_fc_terminate_io()
  nvme-fc: eliminate terminate_io use by nvme_fc_error_recovery
  nvme-fc: remove err_work work item
  nvme-fc: track error_recovery while connecting
  nvme-rdma: handle unexpected nvme completion data length
  nvme: ignore zone validate errors on subsequent scans
  blk-cgroup: Pre-allocate tree node on blkg_conf_prep
  blk-cgroup: Fix memleak on error path
2020-10-30 15:02:49 -07:00
Jason Gunthorpe
071ba4cc55 RDMA: Add rdma_connect_locked()
There are two flows for handling RDMA_CM_EVENT_ROUTE_RESOLVED, either the
handler triggers a completion and another thread does rdma_connect() or
the handler directly calls rdma_connect().

In all cases rdma_connect() needs to hold the handler_mutex, but when
handler's are invoked this is already held by the core code. This causes
ULPs using the 2nd method to deadlock.

Provide a rdma_connect_locked() and have all ULPs call it from their
handlers.

Link: https://lore.kernel.org/r/0-v2-53c22d5c1405+33-rdma_connect_locking_jgg@nvidia.com
Reported-and-tested-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Fixes: 2a7cec5381 ("RDMA/cma: Fix locking for the RDMA_CM_CONNECT state")
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Acked-by: Jack Wang <jinpu.wang@cloud.ionos.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-10-28 09:14:49 -03:00
Chaitanya Kulkarni
3c3751f2da nvmet: fix a NULL pointer dereference when tracing the flush command
When target side trace in turned on and flush command is issued from the
host it results in the following Oops.

[  856.789724] BUG: kernel NULL pointer dereference, address: 0000000000000068
[  856.790686] #PF: supervisor read access in kernel mode
[  856.791262] #PF: error_code(0x0000) - not-present page
[  856.791863] PGD 6d7110067 P4D 6d7110067 PUD 66f0ad067 PMD 0
[  856.792527] Oops: 0000 [#1] SMP NOPTI
[  856.792950] CPU: 15 PID: 7034 Comm: nvme Tainted: G           OE     5.9.0nvme-5.9+ #71
[  856.793790] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e3214
[  856.794956] RIP: 0010:trace_event_raw_event_nvmet_req_init+0x13e/0x170 [nvmet]
[  856.795734] Code: 41 5c 41 5d c3 31 d2 31 f6 e8 4e 9b b8 e0 e9 0e ff ff ff 49 8b 55 00 48 8b 38 8b 0
[  856.797740] RSP: 0018:ffffc90001be3a60 EFLAGS: 00010246
[  856.798375] RAX: 0000000000000000 RBX: ffff8887e7d2c01c RCX: 0000000000000000
[  856.799234] RDX: 0000000000000020 RSI: 0000000057e70ea2 RDI: ffff8887e7d2c034
[  856.800088] RBP: ffff88869f710578 R08: ffff888807500d40 R09: 00000000fffffffe
[  856.800951] R10: 0000000064c66670 R11: 00000000ef955201 R12: ffff8887e7d2c034
[  856.801807] R13: ffff88869f7105c8 R14: 0000000000000040 R15: ffff88869f710440
[  856.802667] FS:  00007f6a22bd8780(0000) GS:ffff888813a00000(0000) knlGS:0000000000000000
[  856.803635] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  856.804367] CR2: 0000000000000068 CR3: 00000006d73e0000 CR4: 00000000003506e0
[  856.805283] Call Trace:
[  856.805613]  nvmet_req_init+0x27c/0x480 [nvmet]
[  856.806200]  nvme_loop_queue_rq+0xcb/0x1d0 [nvme_loop]
[  856.806862]  blk_mq_dispatch_rq_list+0x123/0x7b0
[  856.807459]  ? kvm_sched_clock_read+0x14/0x30
[  856.808025]  __blk_mq_sched_dispatch_requests+0xc7/0x170
[  856.808708]  blk_mq_sched_dispatch_requests+0x30/0x60
[  856.809372]  __blk_mq_run_hw_queue+0x70/0x100
[  856.809935]  __blk_mq_delay_run_hw_queue+0x156/0x170
[  856.810574]  blk_mq_run_hw_queue+0x86/0xe0
[  856.811104]  blk_mq_sched_insert_request+0xef/0x160
[  856.811733]  blk_execute_rq+0x69/0xc0
[  856.812212]  ? blk_mq_rq_ctx_init+0xd0/0x230
[  856.812784]  nvme_execute_passthru_rq+0x57/0x130 [nvme_core]
[  856.813461]  nvme_submit_user_cmd+0xeb/0x300 [nvme_core]
[  856.814099]  nvme_user_cmd.isra.82+0x11e/0x1a0 [nvme_core]
[  856.814752]  blkdev_ioctl+0x1dc/0x2c0
[  856.815197]  block_ioctl+0x3f/0x50
[  856.815606]  __x64_sys_ioctl+0x84/0xc0
[  856.816074]  do_syscall_64+0x33/0x40
[  856.816533]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  856.817168] RIP: 0033:0x7f6a222ed107
[  856.817617] Code: 44 00 00 48 8b 05 81 cd 2c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 8
[  856.819901] RSP: 002b:00007ffca848f058 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
[  856.820846] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f6a222ed107
[  856.821726] RDX: 00007ffca848f060 RSI: 00000000c0484e43 RDI: 0000000000000003
[  856.822603] RBP: 0000000000000003 R08: 000000000000003f R09: 0000000000000005
[  856.823478] R10: 00007ffca848ece0 R11: 0000000000000202 R12: 00007ffca84912d3
[  856.824359] R13: 00007ffca848f4d0 R14: 0000000000000002 R15: 000000000067e900
[  856.825236] Modules linked in: nvme_loop(OE) nvmet(OE) nvme_fabrics(OE) null_blk nvme(OE) nvme_corel

Move the nvmet_req_init() tracepoint after we parse the command in
nvmet_req_init() so that we can get rid of the duplicate
nvmet_find_namespace() call.
Rename __assign_disk_name() ->  __assign_req_name(). Now that we call
tracepoint after parsing the command simplify the newly added
__assign_req_name() which fixes this bug.

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-10-27 10:02:50 +01:00
James Smart
ac9b820e71 nvme-fc: remove nvme_fc_terminate_io()
__nvme_fc_terminate_io() is now called by only 1 place, in reset_work.
Consoldate and move the functionality of terminate_io into reset_work.

In reset_work, rather than calling the create_association directly,
schedule the connect work element to do its thing. After scheduling,
flush the connect work element to continue with semantic of not
returning until connect has been attempted at least once.

Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-10-27 10:02:29 +01:00