Commit graph

662396 commits

Author SHA1 Message Date
Christoph Hellwig
44e44b29fb nvme: move the retries count to struct nvme_request
The way NVMe uses this field is entirely different from the older
SCSI/BLOCK_PC usage, so move it into struct nvme_request.

Also reduce the size of the file to a unsigned char so that we leave
space for additional smaller fields that will appear soon.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-05 12:05:08 -06:00
Christoph Hellwig
83f3aeb386 nvme: mark nvme_max_retries static
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-05 12:05:08 -06:00
Christoph Hellwig
f6324b1bb7 nvme: cleanup nvme_req_needs_retry
Don't pass the status explicitly but derive it from the requeust,
and unwind the complex condition to be more readable.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-05 12:05:08 -06:00
Christoph Hellwig
987f699a8f nvme: move ->retries setup to nvme_setup_cmd
->retries is counting the number of times a command is resubmitted, and
be cleared on the first time we see the command.  We currently don't do
that for non-PCIe command, which is easily fixed by moving the setup
to common code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-05 12:05:08 -06:00
Christoph Hellwig
8e14be53f4 remove the obsolete hd driver
This driver is for pre-IDE hardisk that are only found in PC from the
stoneage of personal computing, and which we don't support elsewhere
in the kernel these days.

It's also been marked broken forever.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-05 09:50:34 -06:00
Bart Van Assche
f2fbc9dd78 blk-mq: Remove blk_mq_queue_data.list
The block layer core sets blk_mq_queue_data.list but no block
drivers read that member. Hence remove it and also the code that
is used to set this member.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-05 09:40:15 -06:00
Jan Kara
142bbdfccc cfq: Disable writeback throttling by default
Writeback throttling does not play well with CFQ since that also tries
to throttle async writes. As a result async writeback can get starved in
presence of readers. As an example take a benchmark simulating
postgreSQL database running over a standard rotating SATA drive. There
are 16 processes doing random reads from a huge file (2*machine memory),
1 process doing random writes to the huge file and calling fsync once
per 50000 writes and 1 process doing sequential 8k writes to a
relatively small file wrapping around at the end of the file and calling
fsync every 5 writes. Under this load read latency easily exceeds the
target latency of 75 ms (just because there are so many reads happening
against a relatively slow disk) and thus writeback is throttled to a
point where only 1 write request is allowed at a time. Blktrace data
then looks like:

  8,0    1        0     8.347751764     0  m   N cfq workload slice:40000000
  8,0    1        0     8.347755256     0  m   N cfq293A  / set_active wl_class: 0 wl_type:0
  8,0    1        0     8.347784100     0  m   N cfq293A  / Not idling.  st->count:1
  8,0    1     3814     8.347763916  5839 UT   N [kworker/u9:2] 1
  8,0    0        0     8.347777605     0  m   N cfq293A  / Not idling.  st->count:1
  8,0    1        0     8.347784100     0  m   N cfq293A  / Not idling.  st->count:1
  8,0    3     1596     8.354364057     0  C   R 156109528 + 8 (6906954) [0]
  8,0    3        0     8.354383193     0  m   N cfq6196SN / complete rqnoidle 0
  8,0    3        0     8.354386476     0  m   N cfq schedule dispatch
  8,0    3        0     8.354399397     0  m   N cfq293A  / Not idling.  st->count:1
  8,0    3        0     8.354404705     0  m   N cfq293A  / dispatch_insert
  8,0    3        0     8.354409454     0  m   N cfq293A  / dispatched a request
  8,0    3        0     8.354412527     0  m   N cfq293A  / activate rq, drv=1
  8,0    3     1597     8.354414692     0  D   W 145961400 + 24 (6718452) [swapper/0]
  8,0    3        0     8.354484184     0  m   N cfq293A  / Not idling.  st->count:1
  8,0    3        0     8.354487536     0  m   N cfq293A  / slice expired t=0
  8,0    3        0     8.354498013     0  m   N / served: vt=5888102466265088 min_vt=5888074869387264
  8,0    3        0     8.354502692     0  m   N cfq293A  / sl_used=6737519 disp=1 charge=6737519 iops=0 sect=24
  8,0    3        0     8.354505695     0  m   N cfq293A  / del_from_rr
...
  8,0    0     1810     8.354728768     0  C   W 145961400 + 24 (314076) [0]
  8,0    0        0     8.354746927     0  m   N cfq293A  / complete rqnoidle 0
...
  8,0    1     3829     8.389886102  5839  G   W 145962968 + 24 [kworker/u9:2]
  8,0    1     3830     8.389888127  5839  P   N [kworker/u9:2]
  8,0    1     3831     8.389908102  5839  A   W 145978336 + 24 <- (8,4) 44000
  8,0    1     3832     8.389910477  5839  Q   W 145978336 + 24 [kworker/u9:2]
  8,0    1     3833     8.389914248  5839  I   W 145962968 + 24 (28146) [kworker/u9:2]
  8,0    1        0     8.389919137     0  m   N cfq293A  / insert_request
  8,0    1        0     8.389924305     0  m   N cfq293A  / add_to_rr
  8,0    1     3834     8.389933175  5839 UT   N [kworker/u9:2] 1
...
  8,0    0        0     9.455290997     0  m   N cfq workload slice:40000000
  8,0    0        0     9.455294769     0  m   N cfq293A  / set_active wl_class:0 wl_type:0
  8,0    0        0     9.455303499     0  m   N cfq293A  / fifo=ffff880003166090
  8,0    0        0     9.455306851     0  m   N cfq293A  / dispatch_insert
  8,0    0        0     9.455311251     0  m   N cfq293A  / dispatched a request
  8,0    0        0     9.455314324     0  m   N cfq293A  / activate rq, drv=1
  8,0    0     2043     9.455316210  6204  D   W 145962968 + 24 (1065401962) [pgioperf]
  8,0    0        0     9.455392407     0  m   N cfq293A  / Not idling.  st->count:1
  8,0    0        0     9.455395969     0  m   N cfq293A  / slice expired t=0
  8,0    0        0     9.455404210     0  m   N / served: vt=5888958194597888 min_vt=5888941810597888
  8,0    0        0     9.455410077     0  m   N cfq293A  / sl_used=4000000 disp=1 charge=4000000 iops=0 sect=24
  8,0    0        0     9.455416851     0  m   N cfq293A  / del_from_rr
...
  8,0    0     2045     9.455648515     0  C   W 145962968 + 24 (332305) [0]
  8,0    0        0     9.455668350     0  m   N cfq293A  / complete rqnoidle 0
...
  8,0    1     4371     9.455710115  5839  G   W 145978336 + 24 [kworker/u9:2]
  8,0    1     4372     9.455712350  5839  P   N [kworker/u9:2]
  8,0    1     4373     9.455730159  5839  A   W 145986616 + 24 <- (8,4) 52280
  8,0    1     4374     9.455732674  5839  Q   W 145986616 + 24 [kworker/u9:2]
  8,0    1     4375     9.455737563  5839  I   W 145978336 + 24 (27448) [kworker/u9:2]
  8,0    1        0     9.455742871     0  m   N cfq293A  / insert_request
  8,0    1        0     9.455747550     0  m   N cfq293A  / add_to_rr
  8,0    1     4376     9.455756629  5839 UT   N [kworker/u9:2] 1

So we can see a Q event for a write request, then IO is blocked by
writeback throttling and G and I events for the request happen only once
other writeback IO is completed. Thus CFQ always sees only one write
request. When it sees it, it queues the async queue behind all the read
queues and the async queue gets scheduled after about one second. When
it is scheduled, that one request gets dispatched and async queue is
expired as it has no more requests to submit. Overall we submit about
one write request per second.

Although this scheduling is beneficial for read latency, writes are
heavily starved and this causes large delays all over the system (due to
processes blocking on page lock, transaction starts, etc.). When
writeback throttling is disabled, write throughput is about one fifth of
a read throughput which roughly matches readers/writers ratio and
overall the system stalls are much shorter.

Mixing writeback throttling logic with CFQ throttling logic is always a
recipe for surprises as CFQ assumes it sees the big part of the picture
which is not necessarily true when writeback throttling is blocking
requests. So disable writeback throttling logic by default when CFQ is
used as an IO scheduler.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-05 08:15:08 -06:00
Adam Manzanares
85003a446e block: fix inheriting request priority from bio
In 4.10 I introduced a patch that associates the ioc priority with
each request in the block layer. This work was done in the single queue
block layer code. This patch unifies ioc priority to request mapping across
the single/multi queue block layers.

I have tested this patch with the null block device driver with the following
parameters.

null_blk queue_mode=2 irqmode=0 use_per_node_hctx=1 nr_devices=1

I have not seen a performance regression with this patch and I would appreciate
any feedback or additional testing.

I have also verified that io priorities are passed to the device when using
the SQ and MQ path to a SATA HDD that supports io priorities.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Adam Manzanares <adam.manzanares@wdc.com>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-04 15:39:47 -06:00
Christoph Hellwig
77f02a7acd nvme: factor request completion code into a common helper
This avoids duplicating the logic four times, and it also allows to keep
some helpers static in core.c or just opencode them.

Note that this loses printing the aborted status on completions in the
PCI driver as that uses a data structure not available any more.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-04 09:48:23 -06:00
Christoph Hellwig
4bca70d067 nvme-fc: drop ctrl for all command completions
A requeue means we go through nvme_fc_start_fcp_op again and get
another controller reference.  To make sure the refcount doesn't
leak we also need to drop it for every completion that came from
the LLDD.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-04 09:48:23 -06:00
Sagi Grimberg
f2cd54d3eb nvme-fc: increment request retries counter before requeuing
This way our max retry limit holds as well.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-04 09:48:23 -06:00
Sagi Grimberg
7d9a5e7176 nvme-loop: increment request retries counter before requeuing
This way our max retry limit holds as well.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-04 09:48:23 -06:00
Sagi Grimberg
e806666e25 nvme-rdma: increment request retries counter before requeuing
This way our max retry limit holds as well.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-04 09:48:23 -06:00
James Smart
62eeacb0e0 nvme_fc: Clean up host fcpio done status handling
As Dan Carpenter pointed out: mixing 16-bit nvme status with 32-bit
error status from driver. Corrected comment on fcp request struct
status field, and converted done routine to explicitly set nvme status
codes for nvme status.

Signed-off-by: James Smart <james.smart@broadcom.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-04 09:48:23 -06:00
James Smart
c820ad4cda nvmet_fc: Clear SG list to avoid double frees
Clear SG list to avoid double frees of payload page list

Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-04 09:48:23 -06:00
James Smart
f77fc87c37 nvme_fc: correct LS validation
LS validations shouldn't have been independent checks.

Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-04 09:48:23 -06:00
James Smart
4083aa986f nvmet_fc: Sync NVME LS reject reasons with spec
nvmet_fc: Sync NVME LS reject reasons with spec

Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-04 09:48:23 -06:00
James Smart
726a1080e5 nvme_fc: Add check of status_code in ERSP_IU
Add check of status_code in ERSP_IU

Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-04 09:48:23 -06:00
James Smart
0f222ccce3 nvme_fc: Sync FC-NVME header with standard
Update FC-NVME definitions to match FC-NVME r1.14 (16-020vB) plus
change voted in by 2/22 FC-NVME Adhoc (see HOSTID below).

Includes the following:
- Addition of "status_code" field to ERSP IU
- Addition of FC-NVME LS RJT reason_codes and reason_explanations
- CreateAssociation payload, HostID field shortened to 16 bytes

Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-04 09:48:23 -06:00
Sagi Grimberg
fd8563ced8 nvme-rdma: Support ctrl_loss_tmo
Before scheduling a reconnect attempt, check
nr_reconnects against max_reconnects, if not
exhausted (or max_reconnects is not -1), schedule
a reconnect attempts, otherwise schedule ctrl
removal.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-04 09:48:23 -06:00
Sagi Grimberg
42a45274c2 nvme-fabrics: Allow ctrl loss timeout configuration
When a host sense that its controller session is damaged,
it tries to re-establish it periodically (reconnect every
reconnect_delay). It may very well be that the controller
is gone and never coming back, in this case the host will
try to reconnect forever.

Add a ctrl_loss_tmo to bound the number of reconnect attempts
to a specific controller (default to a reasonable 10 minutes).
The timeout configuration is actually translated into number of
reconnect attempts and not a schedule on its own but rather
divided with reconnect_delay. This is useful to prevent
racing flows of remove and reconnect, and it doesn't really
matter if we remove slightly sooner than what the user requested.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-04 09:48:23 -06:00
Sagi Grimberg
7777bdedf3 nvme-rdma: get rid of local reconnect_delay
we already have it in opts.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-04 09:48:23 -06:00
Sagi Grimberg
3b06837630 nvme-loop: retrieve iod from the cqe command_id
useful to validate that the we didn't mess up
the command_id.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-04 09:48:23 -06:00
Sagi Grimberg
d89a39be5f nvme-loop: remove unneeded includes
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-04 09:48:23 -06:00
Sagi Grimberg
c0e4a6f594 nvme-fc: fix module_init (theoretical) error path
If nvmf_register_transport happened to fail
(it can't, but theoretically) we leak memory.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-04 09:48:23 -06:00
Sagi Grimberg
d19eef029d nvme-loop: fix module_init (theoretical) error path
if nvmf_register_transport happend to fail, we
need to nvmet_unregister_transport as well.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-04 09:48:23 -06:00
Sagi Grimberg
a56c79cfd3 nvme-rdma: fix module_init (theoretical) error path
If nvmf_register_transport happened to fail
(it can't, but theoretically) we leak memory.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-04 09:48:23 -06:00
Max Gurtovoy
2ca0786d5a nvmet: use symbolic constants for log identifiers
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-04 09:48:23 -06:00
Parav Pandit
64a0ca88ea nvmet: Introduced helper routine for controller status check.
This patch introduces helper function for checking controller
status during admin and io command processing which returns u16
status. As to bring consistency on returning status, other
friend functions also now return u16 status instead of int
to match the spec.

As part of the theseerror log prints in also prints qid on
which command error occured.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-04 09:48:23 -06:00
Parav Pandit
4151dd9a58 nvmet: Fixed avoided printing nvmet: twice in error logs.
This patch avoids printing "nvmet:" twice in error logs as its already
coming through pr_fmt macro.

Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-04 09:48:23 -06:00
Sagi Grimberg
4459e04297 iscsi-target: use generic inet_pton_with_scope
Instead of parsing address strings, use a generic
helper.

Acked-by: Nicholas Bellinger <nab@linux-iscsi.org>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-04 09:48:23 -06:00
Sagi Grimberg
0928f9b4f1 nvme-rdma: use inet_pton_with_scope helper
Both the destination and the host addresses are now
parsed using inet_pton_with_scope helper. We also
get ipv6 (with address scopes support).

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-04 09:48:23 -06:00
Sagi Grimberg
670c2a3ad5 nvmet-rdma: use generic inet_pton_with_scope
Instead of parsing address strings, use a generic
helper. This also adds ipv6 (with address scopes)
support.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-04 09:48:23 -06:00
Sagi Grimberg
b1a951fe46 net/utils: generic inet_pton_with_scope helper
Several locations in the stack need to handle ipv4/ipv6
(with scope) and port strings conversion to sockaddr.
Add a helper that takes either AF_INET, AF_INET6 or
AF_UNSPEC (for wildcard) to centralize this handling.

Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-04 09:48:23 -06:00
Sagi Grimberg
297186d640 nvme-loop: remove some code duplication
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-04 09:48:23 -06:00
Sagi Grimberg
782d820ca4 nvme-rdma: Give some more grace for rdma connection establishment
The target might be occupied with multiple hosts so lets
give it some more grace before failing the connection
establishment.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-04 09:48:23 -06:00
Sagi Grimberg
777dc82395 nvmet-rdma: occasionally flush ongoing controller teardown
If we are attacked with establishments/teradowns we need to
make sure we do not consume too much system memory. Thus
let ongoing controller teardowns complete before accepting
new controller establishments.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-04 09:48:23 -06:00
Sagi Grimberg
dc2ad16ab2 nvme-rdma: handle cpu unplug when re-establishing the controller
If a cpu unplug event has occured, we need to take the minimum
of the provided nr_io_queues and the number of online cpus,
otherwise we won't be able to connect them as blk-mq mapping
won't dispatch to those queues.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-04 09:48:23 -06:00
Sagi Grimberg
8d61413db6 nvmet-rdma: Fix a possible uninitialized variable dereference
When handling a new recv command, we grab a new rsp resource and
check for the queue state being live. In case the queue is not in
live state, we simply restore the rsp back to the free list. However
in this flow we didn't set rsp->queue yet, so we cannot dereference it.

Instead, make sure to initialize rsp->queue (and other rsp members)
as soon as possible so we won't reference uninitialized variables.

Reported-by: Yi Zhang <yizhan@redhat.com>
Reported-by: Raju Rangoju <rajur@chelsio.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Raju Rangoju <rajur@chelsio.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-04 09:48:23 -06:00
Sagi Grimberg
427242ce99 nvmet: confirm sq percpu has scheduled and switched to atomic
percpu_ref_kill is not enough to prevent subsequent
percpu_ref_tryget_live from failing. Hence call
perfcpu_ref_kill_confirm to make it safe.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-04 09:48:23 -06:00
Sagi Grimberg
6ecda70ea9 nvme-loop: handle cpu unplug when re-establishing the controller
If a cpu unplug event has occured, we need to take the minimum
of the provided nr_io_queues and the number of online cpus,
otherwise we won't be able to connect them as blk-mq mapping
won't dispatch to those queues.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2017-04-04 09:44:43 -06:00
Sagi Grimberg
d476983ea0 nvme-loop: fix a possible use-after-free when destroying the admin queue
we need to destroy the nvmet sq and let it finish gracefully
before continue to cleanup the queue.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2017-04-04 09:44:41 -06:00
Eric Biggers
f363b089be blk-mq: constify struct blk_mq_ops
Constify all instances of blk_mq_ops, as they are never modified.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-31 08:28:58 -06:00
Jens Axboe
db5bcf87bb null_blk: add blocking mode
This adds a new module parameter to null_blk, blocking. If set, null_blk
will set the BLK_MQ_F_BLOCKING flag, indicating that it sometimes/always
needs to block in its ->queue_rq() function.  The intent is to help find
regressions in blocking drivers, since not many of them exist.

If null_blk is loaded with submit_queues > 1 and blocking=1, this
shows the regression recently fixed by bf4907c05e.

Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-30 13:44:26 -06:00
Jens Axboe
bf4907c05e blk-mq: fix schedule-under-preempt for blocking drivers
Commit a4d907b6a3 unified the single and multi queue request handlers,
but in the process, it also screwed up the locking balance and calls
blk_mq_try_issue_directly() with the ctx preempt lock held. This is a
problem for drivers that have set BLK_MQ_F_BLOCKING, since now they
can't reliably sleep.

While in there, protect against similar issues in the future, by adding
a might_sleep() trigger in the BLOCKING path for direct issue or queue
run.

Reported-by: Josef Bacik <josef@toxicpanda.com>
Tested-by: Josef Bacik <josef@toxicpanda.com>
Fixes: a4d907b6a3 ("blk-mq: streamline blk_mq_make_request")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-30 12:30:39 -06:00
Colin Ian King
47d752076a block/sed-opal: fix spelling mistake: "Lifcycle" -> "Lifecycle"
trivial fix to spelling mistake in pr_err error message

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-30 09:22:53 -06:00
Minchan Kim
3e06eb3dac block: do not put mq context in blk_mq_alloc_request_hctx
In blk_mq_alloc_request_hctx, blk_mq_sched_get_request doesn't
get sw context so we don't need to put the context with
blk_mq_put_ctx. Unless, we will see preempt counter underflow.

Cc: Omar Sandoval <osandov@fb.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-30 08:13:05 -06:00
Jens Axboe
3e8a7069b9 blk-mq: include errors in did_work calculation
Currently we return true in blk_mq_dispatch_rq_list() if we queued IO
successfully, but we really want to return whether or not the we made
progress. Progress includes if we got an error return.  If we don't,
this can lead to a hang in blk_mq_sched_dispatch_requests() when a
driver is draining IO by returning BLK_MQ_QUEUE_ERROR instead of
manually ending the IO in error and return BLK_MQ_QUEUE_OK.

Tested-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-29 13:21:13 -06:00
Josef Bacik
b58e176914 block-mq: don't re-queue if we get a queue error
When try to issue a request directly and we fail we will requeue the
request, but call blk_mq_end_request() as well.  This leads to the
completed request being on a queuelist and getting ended twice, which
causes list corruption in schedulers and other shenanigans.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Ming Lei <tom.leiming@gmail.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-29 13:18:18 -06:00
Tahsin Erdogan
457e490f2b blkcg: allocate struct blkcg_gq outside request queue spinlock
blkg_conf_prep() currently calls blkg_lookup_create() while holding
request queue spinlock. This means allocating memory for struct
blkcg_gq has to be made non-blocking. This causes occasional -ENOMEM
failures in call paths like below:

  pcpu_alloc+0x68f/0x710
  __alloc_percpu_gfp+0xd/0x10
  __percpu_counter_init+0x55/0xc0
  cfq_pd_alloc+0x3b2/0x4e0
  blkg_alloc+0x187/0x230
  blkg_create+0x489/0x670
  blkg_lookup_create+0x9a/0x230
  blkg_conf_prep+0x1fb/0x240
  __cfqg_set_weight_device.isra.105+0x5c/0x180
  cfq_set_weight_on_dfl+0x69/0xc0
  cgroup_file_write+0x39/0x1c0
  kernfs_fop_write+0x13f/0x1d0
  __vfs_write+0x23/0x120
  vfs_write+0xc2/0x1f0
  SyS_write+0x44/0xb0
  entry_SYSCALL_64_fastpath+0x18/0xad

In the code path above, percpu allocator cannot call vmalloc() due to
queue spinlock.

A failure in this call path gives grief to tools which are trying to
configure io weights. We see occasional failures happen shortly after
reboots even when system is not under any memory pressure. Machines
with a lot of cpus are more vulnerable to this condition.

Do struct blkcg_gq allocations outside the queue spinlock to allow
blocking during memory allocations.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-29 11:27:19 -06:00