Commit graph

4489 commits

Author SHA1 Message Date
Christoph Hellwig
163cc2d3cd block: use bio_release_pages in bio_unmap_user
Use bio_release_pages instead of open coding it.

Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-29 09:47:31 -06:00
Christoph Hellwig
d241a95f35 block: optionally mark pages dirty in bio_release_pages
A lot of callers of bio_release_pages also want to mark the released
pages as dirty.  Add a mark_dirty parameter to avoid a second
relatively expensive bio_for_each_segment_all loop.

Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-29 09:47:31 -06:00
Christoph Hellwig
b2d0d99135 block: move the BIO_NO_PAGE_REF check into bio_release_pages
Move the BIO_NO_PAGE_REF check into bio_release_pages instead of
duplicating it in both callers.

Also make the function available outside of bio.c so that we can
reuse it in other direct I/O implementations.

Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-29 09:47:31 -06:00
Revanth Rajashekar
15ddffcb34 block: sed-opal: "Never True" conditions
'who' an unsigned variable in stucture opal_session_info
can never be lesser than zero. Hence, the condition
"who < OPAL_ADMIN1" can never be true.

Signed-off-by: Revanth Rajashekar <revanth.rajashekar@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-29 09:40:31 -06:00
Revanth Rajashekar
5e4c7cf60e block: sed-opal: PSID reverttper capability
PSID is a 32 character password printed on the drive label,
to prove its physical access. This PSID reverttper function
is very useful to regain the control over the drive when it
is locked and the user can no longer access it because of some
failures. However, *all the data on the drive is completely
erased*. This method is advisable only when the user is exhausted
of all other recovery methods.

PSID capabilities are described in:
https://trustedcomputinggroup.org/wp-content/uploads/TCG_Storage-Opal_Feature_Set_PSID_v1.00_r1.00.pdf

Signed-off-by: Revanth Rajashekar <revanth.rajashekar@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-29 09:40:30 -06:00
Douglas Anderson
dbc3117d4c block, bfq: NULL out the bic when it's no longer valid
In reboot tests on several devices we were seeing a "use after free"
when slub_debug or KASAN was enabled.  The kernel complained about:

  Unable to handle kernel paging request at virtual address 6b6b6c2b

...which is a classic sign of use after free under slub_debug.  The
stack crawl in kgdb looked like:

 0  test_bit (addr=<optimized out>, nr=<optimized out>)
 1  bfq_bfqq_busy (bfqq=<optimized out>)
 2  bfq_select_queue (bfqd=<optimized out>)
 3  __bfq_dispatch_request (hctx=<optimized out>)
 4  bfq_dispatch_request (hctx=<optimized out>)
 5  0xc056ef00 in blk_mq_do_dispatch_sched (hctx=0xed249440)
 6  0xc056f728 in blk_mq_sched_dispatch_requests (hctx=0xed249440)
 7  0xc0568d24 in __blk_mq_run_hw_queue (hctx=0xed249440)
 8  0xc0568d94 in blk_mq_run_work_fn (work=<optimized out>)
 9  0xc024c5c4 in process_one_work (worker=0xec6d4640, work=0xed249480)
 10 0xc024cff4 in worker_thread (__worker=0xec6d4640)

Digging in kgdb, it could be found that, though bfqq looked fine,
bfqq->bic had been freed.

Through further digging, I postulated that perhaps it is illegal to
access a "bic" (AKA an "icq") after bfq_exit_icq() had been called
because the "bic" can be freed at some point in time after this call
is made.  I confirmed that there certainly were cases where the exact
crashing code path would access the "bic" after bfq_exit_icq() had
been called.  Sspecifically I set the "bfqq->bic" to (void *)0x7 and
saw that the bic was 0x7 at the time of the crash.

To understand a bit more about why this crash was fairly uncommon (I
saw it only once in a few hundred reboots), you can see that much of
the time bfq_exit_icq_fbqq() fully frees the bfqq and thus it can't
access the ->bic anymore.  The only case it doesn't is if
bfq_put_queue() sees a reference still held.

However, even in the case when bfqq isn't freed, the crash is still
rare.  Why?  I tracked what happened to the "bic" after the exit
routine.  It doesn't get freed right away.  Rather,
put_io_context_active() eventually called put_io_context() which
queued up freeing on a workqueue.  The freeing then actually happened
later than that through call_rcu().  Despite all these delays, some
extra debugging showed that all the hoops could be jumped through in
time and the memory could be freed causing the original crash.  Phew!

To make a long story short, assuming it truly is illegal to access an
icq after the "exit_icq" callback is finished, this patch is needed.

Cc: stable@vger.kernel.org
Reviewed-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28 07:44:19 -06:00
Damien Le Moal
a5b47a40be block: Remove unused code
bio_flush_dcache_pages() is unused. Remove it.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-27 07:34:25 -06:00
Douglas Anderson
2b50f230f7 block, bfq: Init saved_wr_start_at_switch_to_srt in unlikely case
Some debug code suggested by Paolo was tripping when I did reboot
stress tests.  Specifically in bfq_bfqq_resume_state()
"bic->saved_wr_start_at_switch_to_srt" was later than the current
value of "jiffies".  A bit of debugging showed that
"bic->saved_wr_start_at_switch_to_srt" was actually 0 and a bit more
debugging showed that was because we had run through the "unlikely"
case in the bfq_bfqq_save_state() function.

Let's init "saved_wr_start_at_switch_to_srt" in the unlikely case to
something sane.

NOTE: this fixes no known real-world errors.

Reviewed-by: Paolo Valente <paolo.valente@linaro.org>
Reviewed-by: Guenter Roeck <groeck@chromium.org>
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-26 16:25:35 -06:00
Paolo Valente
3726112ec7 block, bfq: re-schedule empty queues if they deserve I/O plugging
Consider, on one side, a bfq_queue Q that remains empty while in
service, and, on the other side, the pending I/O of bfq_queues that,
according to their timestamps, have to be served after Q.  If an
uncontrolled amount of I/O from the latter bfq_queues were dispatched
while Q is waiting for its new I/O to arrive, then Q's bandwidth
guarantees would be violated. To prevent this, I/O dispatch is plugged
until Q receives new I/O (except for a properly controlled amount of
injected I/O). Unfortunately, preemption breaks I/O-dispatch plugging,
for the following reason.

Preemption is performed in two steps. First, Q is expired and
re-scheduled. Second, the new bfq_queue to serve is chosen. The first
step is needed by the second, as the second can be performed only
after Q's timestamps have been properly updated (done in the
expiration step), and Q has been re-queued for service. This
dependency is a consequence of the way how BFQ's scheduling algorithm
is currently implemented.

But Q is not re-scheduled at all in the first step, because Q is
empty. As a consequence, an uncontrolled amount of I/O may be
dispatched until Q becomes non empty again. This breaks Q's service
guarantees.

This commit addresses this issue by re-scheduling Q even if it is
empty. This in turn breaks the assumption that all scheduled queues
are non empty. Then a few extra checks are now needed.

Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-25 09:07:35 -06:00
Paolo Valente
96a291c38c block, bfq: preempt lower-weight or lower-priority queues
BFQ enqueues the I/O coming from each process into a separate
bfq_queue, and serves bfq_queues one at a time. Each bfq_queue may be
served for at most timeout_sync milliseconds (default: 125 ms). This
service scheme is prone to the following inaccuracy.

While a bfq_queue Q1 is in service, some empty bfq_queue Q2 may
receive I/O, and, according to BFQ's scheduling policy, may become the
right bfq_queue to serve, in place of the currently in-service
bfq_queue. In this respect, postponing the service of Q2 to after the
service of Q1 finishes may delay the completion of Q2's I/O, compared
with an ideal service in which all non-empty bfq_queues are served in
parallel, and every non-empty bfq_queue is served at a rate
proportional to the bfq_queue's weight. This additional delay is equal
at most to the time Q1 may unjustly remain in service before switching
to Q2.

If Q1 and Q2 have the same weight, then this time is most likely
negligible compared with the completion time to be guaranteed to Q2's
I/O. In addition, first, one of the reasons why BFQ may want to serve
Q1 for a while is that this boosts throughput and, second, serving Q1
longer reduces BFQ's overhead. As a conclusion, it is usually better
not to preempt Q1 if both Q1 and Q2 have the same weight.

In contrast, as Q2's weight or priority becomes higher and higher
compared with that of Q1, the above delay becomes larger and larger,
compared with the I/O completion times that have to be guaranteed to
Q2 according to Q2's weight. So reducing this delay may be more
important than avoiding the costs of preempting Q1.

Accordingly, this commit preempts Q1 if Q2 has a higher weight or a
higher priority than Q1. Preemption causes Q1 to be re-scheduled, and
triggers a new choice of the next bfq_queue to serve. If Q2 really is
the next bfq_queue to serve, then Q2 will be set in service
immediately.

This change reduces the component of the I/O latency caused by the
above delay by about 80%. For example, on an (old) PLEXTOR PX-256M5
SSD, the maximum latency reported by fio drops from 15.1 to 3.2 ms for
a process doing sporadic random reads while another process is doing
continuous sequential reads.

Signed-off-by: Nicola Bottura <bottura.nicola95@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-25 09:07:35 -06:00
Paolo Valente
13a857a4c4 block, bfq: detect wakers and unconditionally inject their I/O
A bfq_queue Q may happen to be synchronized with another
bfq_queue Q2, i.e., the I/O of Q2 may need to be completed for Q to
receive new I/O. We call Q2 "waker queue".

If I/O plugging is being performed for Q, and Q is not receiving any
more I/O because of the above synchronization, then, thanks to BFQ's
injection mechanism, the waker queue is likely to get served before
the I/O-plugging timeout fires.

Unfortunately, this fact may not be sufficient to guarantee a high
throughput during the I/O plugging, because the inject limit for Q may
be too low to guarantee a lot of injected I/O. In addition, the
duration of the plugging, i.e., the time before Q finally receives new
I/O, may not be minimized, because the waker queue may happen to be
served only after other queues.

To address these issues, this commit introduces the explicit detection
of the waker queue, and the unconditional injection of a pending I/O
request of the waker queue on each invocation of
bfq_dispatch_request().

One may be concerned that this systematic injection of I/O from the
waker queue delays the service of Q's I/O. Fortunately, it doesn't. On
the contrary, next Q's I/O is brought forward dramatically, for it is
not blocked for milliseconds.

Reported-by: Srivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu>
Tested-by: Srivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-25 09:07:34 -06:00
Paolo Valente
a3f9bce369 block, bfq: bring forward seek&think time update
Until the base value for request service times gets finally computed
for a bfq_queue, the inject limit for that queue does depend on the
think-time state (short|long) of the queue. A timely update of the
think time then guarantees a quicker activation or deactivation of the
injection. Fortunately, the think time of a bfq_queue is updated in
the same code path as the inject limit; but after the inject limit.

This commits moves the update of the think time before the update of
the inject limit. For coherence, it moves the update of the seek time
too.

Reported-by: Srivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu>
Tested-by: Srivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-25 09:07:34 -06:00
Paolo Valente
24792ad01c block, bfq: update base request service times when possible
I/O injection gets reduced if it increases the request service times
of the victim queue beyond a certain threshold.  The threshold, in its
turn, is computed as a function of the base service time enjoyed by
the queue when it undergoes no injection.

As a consequence, for injection to work properly, the above base value
has to be accurate. In this respect, such a value may vary over
time. For example, it varies if the size or the spatial locality of
the I/O requests in the queue change. It is then important to update
this value whenever possible. This commit performs this update.

Reported-by: Srivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu>
Tested-by: Srivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-25 09:07:34 -06:00
Paolo Valente
db599f9ed9 block, bfq: fix rq_in_driver check in bfq_update_inject_limit
One of the cases where the parameters for injection may be updated is
when there are no more in-flight I/O requests. The number of in-flight
requests is stored in the field bfqd->rq_in_driver of the descriptor
bfqd of the device. So, the controlled condition is
bfqd->rq_in_driver == 0.

Unfortunately, this is wrong because, the instruction that checks this
condition is in the code path that handles the completion of a
request, and, in particular, the instruction is executed before
bfqd->rq_in_driver is decremented in such a code path.

This commit fixes this issue by just replacing 0 with 1 in the
comparison.

Reported-by: Srivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu>
Tested-by: Srivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-25 09:07:34 -06:00
Paolo Valente
766d61412e block, bfq: reset inject limit when think-time state changes
Until the base value of the request service times gets finally
computed for a bfq_queue, the inject limit does depend on the
think-time state (short|long). The limit must be 0 or 1 if the think
time is deemed, respectively, as short or long. However, such a check
and possible limit update is performed only periodically, once per
second. So, to make the injection mechanism much more reactive, this
commit performs the update also every time the think-time state
changes.

In addition, in the following special case, this commit lets the
inject limit of a bfq_queue bfqq remain equal to 1 even if bfqq's
think time is short: bfqq's I/O is synchronized with that of some
other queue, i.e., bfqq may receive new I/O only after the I/O of the
other queue is completed. Keeping the inject limit to 1 allows the
blocking I/O to be served while bfqq is in service. And this is very
convenient both for bfqq and for the total throughput, as explained
in detail in the comments in bfq_update_has_short_ttime().

Reported-by: Srivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu>
Tested-by: Srivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-25 09:07:34 -06:00
Chaitanya Kulkarni
b0e5168a77 block: update print_req_error()
Improve the print_req_error with additional request fields which are
helpful for debugging. Use newly introduced blk_op_str() to print the
REQ_OP_XXX in the string format.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-20 13:03:51 -06:00
Chaitanya Kulkarni
874c893bf0 block: use blk_op_str() in blk-mq-debugfs.c
Now that we've a helper function blk_op_str() to convert the
REQ_OP_XXX to string XXX, adjust the code to use that. Get rid of
the duplicate array op_name which is now present in the blk-core.c
which we renamed it to "blk_op_name" and open coding in the
blk-mq-debugfs.c.

Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-20 13:03:51 -06:00
Chaitanya Kulkarni
e47bc4eda9 block: add centralize REQ_OP_XXX to string helper
In order to centralize the REQ_OP_XXX to string conversion which can be
used in the block layer and different places in the kernel like f2fs,
this patch adds a new helper function along with an array similar to the
one present in the blk-mq-debugfs.c.

We keep this helper functionality centralize under blk-core.c instead of
blk-mq-debugfs.c since blk-core.c is configured using CONFIG_BLOCK and
it will not be dependent on blk-mq-debugfs.c which is configured using
CONFIG_BLK_DEBUG_FS.

Next patch adjusts the code in the blk-mq-debugfs.c with newly
introduced helper.

Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-20 13:03:51 -06:00
Christoph Hellwig
178cc590e5 block: improve print_req_error
Print the calling function instead of print_req_error as a prefix, and
print the operation and op_flags separately instead of the whole field.

Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-20 13:03:51 -06:00
Christoph Hellwig
8060c47ba8 block: rename CONFIG_DEBUG_BLK_CGROUP to CONFIG_BFQ_CGROUP_DEBUG
This option is entirely bfq specific, give it an appropinquate name.

Also make it depend on CONFIG_BFQ_GROUP_IOSCHED in Kconfig, as all
the functionality already does so anyway.

Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-20 10:32:35 -06:00
Christoph Hellwig
d6258980da bfq-iosched: move bfq_stat_recursive_sum into the only caller
This function was moved from core block code and is way to generic.
Fold it into the only caller and simplify it based on the actually
passed arguments.

Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-20 10:32:34 -06:00
Christoph Hellwig
c0ce79dca5 blk-cgroup: move struct blkg_stat to bfq
This structure and assorted infrastructure is only used by the bfq I/O
scheduler.  Move it there instead of bloating the common code.

Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-20 10:32:34 -06:00
Christoph Hellwig
7af6fd9112 blk-cgroup: introduce a new struct blkg_rwstat_sample
When sampling the blkcg counts we don't need atomics or per-cpu
variables.  Introduce a new structure just containing plain u64
counters.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-20 10:32:34 -06:00
Christoph Hellwig
5d0b6e48cb blk-cgroup: pass blkg_rwstat structures by reference
Returning a structure generates rather bad code, so switch to passing
by reference.  Also don't require the structure to be zeroed and add
to the 0-initialized counters, but actually set the counters to the
calculated value.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-20 10:32:34 -06:00
Christoph Hellwig
239eeb0857 blk-cgroup: factor out a helper to read rwstat counter
Trying to break up the crazy statements to something readable.
Also switch to an unsigned counter as it can't ever turn negative.

Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-20 10:32:34 -06:00
Christoph Hellwig
1aa0a133fb block: mark blk_rq_bio_prep as inline
This function just has a few trivial assignments, has two callers with
one of them being in the fastpath.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-20 10:29:22 -06:00
Christoph Hellwig
d627065d88 block: untangle the end of blk_bio_segment_split
Now that we don't need to assign the front/back segment sizes, we can
duplicating the segs assignment for the split vs no-split case and
remove a whole chunk of boilerplate code.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-20 10:29:22 -06:00
Christoph Hellwig
e9cd19c0c1 block: simplify blk_recalc_rq_segments
Return the segement and let the callers assign them, which makes the code
a littler more obvious.  Also pass the request instead of q plus bio
chain, allowing for the use of rq_for_each_bvec.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-20 10:29:22 -06:00
Christoph Hellwig
14ccb66b3f block: remove the bi_phys_segments field in struct bio
We only need the number of segments in the blk-mq submission path.
Remove the field from struct bio, and return it from a variant of
blk_queue_split instead of that it can passed as an argument to
those functions that need the value.

This also means we stop recounting segments except for cloning
and partial segments.

To keep the number of arguments in this how path down remove
pointless struct request_queue arguments from any of the functions
that had it and grew a nr_segs argument.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-20 10:29:22 -06:00
Christoph Hellwig
f924cddebc block: remove blk_init_request_from_bio
lightnvm should have never used this function, as it is sending
passthrough requests, so switch it to blk_rq_append_bio like all the
other passthrough request users.  Inline blk_init_request_from_bio into
the only remaining caller.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com>
Reviewed-by: Javier González <javier@javigon.com>
Reviewed-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-20 10:29:22 -06:00
Christoph Hellwig
0c8cf8c2a5 block: initialize the write priority in blk_rq_bio_prep
The priority field also makes sense for passthrough requests, so
initialize it in blk_rq_bio_prep.

Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-20 10:29:22 -06:00
Dennis Zhou
a3fb01ba5a blk-iolatency: only account submitted bios
As is, iolatency recognizes done_bio and cleanup as ending paths. If a
request is marked REQ_NOWAIT and fails to get a request, the bio is
cleaned up via rq_qos_cleanup() and ended in bio_wouldblock_error().
This results in underflowing the inflight counter. Fix this by only
accounting bios that were actually submitted.

Signed-off-by: Dennis Zhou <dennis@kernel.org>
Cc: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-20 03:29:56 -06:00
Pavel Begunkov
3a211b7152 blk-core: Remove blk_end_request*() declarations
Commit a1ce35fa49 ("block: remove dead elevator code")
deleted blk_end_request() and friends, but some declaration are still
left. Purge them.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-20 03:26:19 -06:00
Chaitanya Kulkarni
243d9f78d9 block: code cleanup queue_poll_stat_show()
This is a pure code cleanup patch and doesn't change any functionality.
Having multiple coding styles in the code creates confusion when
someone tries to add a new code.

Make queue_poll_stat_show() consistent by adding spaces around binary
operators with the rest of the code.

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-20 03:20:05 -06:00
Chaitanya Kulkarni
3f6d385f81 block: use right format specifier for op
In function __blk_mq_debugfs_rq_show variable op has unsigned int type.
Since op can never be negative use %u format specifier to match the
variable type.

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-20 03:20:03 -06:00
Chaitanya Kulkarni
ee1e03598f block: get rid of redundant else
This is a pure code cleanup patch and doesn't change any functionality.
This removes the redundant else in the code which is not needed since
we are returning from function anyway.

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-20 03:20:02 -06:00
Chaitanya Kulkarni
f9bc64a0f0 block: use req_op() to maintain consistency
This is a pure code cleanup patch and doesn't change any functionality.
In block layer to identify the request operation req_op() macro is
used, so change the open coding the req_op() in the blk-mq-debugfs.c.

Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-17 09:34:18 -06:00
Tejun Heo
71c814077d blkcg: blkcg_activate_policy() should initialize ancestors first
When blkcg_activate_policy() is creating blkg_policy_data for existing
blkgs, it did in the wrong order - descendants first.  Fix it.  None
of the existing controllers seem affected by this.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-15 10:39:40 -06:00
Tejun Heo
ef069b97fe blkcg: perpcu_ref init/exit should be done from blkg_alloc/free()
blkg alloc is performed as a separate step from the rest of blkg
creation so that GFP_KERNEL allocations can be used when creating
blkgs from configuration file writes because otherwise user actions
may fail due to failures of opportunistic GFP_NOWAIT allocations.

While making blkgs use percpu_ref, 7fcf2b033b ("blkcg: change blkg
reference counting to use percpu_ref") incorrectly added unconditional
opportunistic percpu_ref_init() to blkg_create() breaking this
guarantee.

This patch moves percpu_ref_init() to blkg_alloc() so makes it use
@gfp_mask that blkg_alloc() is called with.  Also, percpu_ref_exit()
is moved to blkg_free() for consistency.

Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 7fcf2b033b ("blkcg: change blkg reference counting to use percpu_ref")
Cc: Dennis Zhou <dennis@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-15 10:39:39 -06:00
Tejun Heo
f539da82f2 blkcg: update blkcg_print_stat() to handle larger outputs
Depending on the number of devices, blkcg stats can go over the
default seqfile buf size.  seqfile normally retries with a larger
buffer but since the ->pd_stat() addition, blkcg_print_stat() doesn't
tell seqfile that overflow has happened and the output gets printed
truncated.  Fix it by calling seq_commit() w/ -1 on possible
overflows.

Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 903d23f0a3 ("blk-cgroup: allow controllers to output their own stats")
Cc: stable@vger.kernel.org # v4.19+
Cc: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-15 10:39:37 -06:00
Tejun Heo
5de0073fcd blk-iolatency: clear use_delay when io.latency is set to zero
If use_delay was non-zero when the latency target of a cgroup was set
to zero, it will stay stuck until io.latency is enabled on the cgroup
again.  This keeps readahead disabled for the cgroup impacting
performance negatively.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Josef Bacik <jbacik@fb.com>
Fixes: d706751215 ("block: introduce blk-iolatency io controller")
Cc: stable@vger.kernel.org # v4.19+
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-15 10:39:36 -06:00
Gustavo A. R. Silva
f1f8f292cd block: bio: Use struct_size() in kmalloc()
One of the more common cases of allocation size calculations is finding
the size of a structure that has a zero-sized array at the end, along
with memory for some number of elements for that array. For example:

struct bio_map_data {
	...
        struct iovec iov[];
};

instance = kmalloc(sizeof(sizeof(struct bio_map_data) + sizeof(struct iovec) *
                          count, GFP_KERNEL);

Instead of leaving these open-coded and prone to type mistakes, we can
now use the new struct_size() helper:

instance = kmalloc(struct_size(instance, iov, count), GFP_KERNEL);

This code was detected with the help of Coccinelle.

Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-15 01:47:58 -06:00
Gustavo A. R. Silva
78b90a2ce8 block: genhd: Use struct_size() helper
Make use of the struct_size() helper instead of an open-coded version
in order to avoid any potential type mistakes, in particular in the
context in which this code is being used.

So, replace the following form:

sizeof(*new_ptbl) + target * sizeof(new_ptbl->part[0])

with:

struct_size(new_ptbl, part, target)

Also, notice that variable size is unnecessary, hence it is removed.

This code was detected with the help of Coccinelle.

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-15 01:46:09 -06:00
Pavel Begunkov
315eb65664 blk-mq/debugfs: Fix improper print qualifier
struct blk_rq_stat::mean is a u64 value, so use %llu

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-15 01:38:59 -06:00
Ming Lei
c326f846eb blk-mq: remove WARN_ON(!q->elevator) from blk_mq_sched_free_requests
blk_mq_sched_free_requests() may be called in failure path in which
q->elevator may not be setup yet, so remove WARN_ON(!q->elevator) from
blk_mq_sched_free_requests for avoiding the false positive.

This function is actually safe to call in case of !q->elevator because
hctx->sched_tags is checked.

Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Yi Zhang <yi.zhang@redhat.com>
Fixes: c3e2219216 ("block: free sched's request pool in blk_cleanup_queue")
Reported-by: syzbot+b9d0d56867048c7bcfde@syzkaller.appspotmail.com
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-13 03:05:58 -06:00
Greg Kroah-Hartman
6cfc0081b0 blk-mq: no need to check return value of debugfs_create functions
When calling debugfs functions, there is no need to ever check the
return value.  The function can work or not, but the code logic should
never do something different based on this.

When all of these checks are cleaned up, lots of the functions used in
the blk-mq-debugfs code can now return void, as no need to check the
return value of them either.

Overall, this ends up cleaning up the code and making it smaller, always
a nice win.

Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-13 03:00:30 -06:00
Damien Le Moal
b9aef63aca block: force select mq-deadline for zoned block devices
In most use cases of zoned block devices (aka SMR disks), the
mq-deadline scheduler is mandatory as it implements sequential write
command processing guarantees with zone write locking. So make sure that
this scheduler is always enabled if CONFIG_BLK_DEV_ZONED is selected.

Tested-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-13 03:00:31 -06:00
Jens Axboe
cf8929885d cgroup/bfq: revert bfq.weight symlink change
There's some discussion on how to do this the best, and Tejun prefers
that BFQ just create the file itself instead of having cgroups support
a symlink feature.

Hence revert commit 54b7b868e8 and 19e9da9e86 for 5.2, and this
can be done properly for 5.3.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-10 03:35:41 -06:00
Angelo Ruocco
19e9da9e86 block, bfq: add weight symlink to the bfq.weight cgroup parameter
Many userspace tools and services use the proportional-share policy of
the blkio/io cgroups controller. The CFQ I/O scheduler implemented
this policy for the legacy block layer. To modify the weight of a
group in case CFQ was in charge, the 'weight' parameter of the group
must be modified. On the other hand, the BFQ I/O scheduler implements
the same policy in blk-mq, but, with BFQ, the parameter to modify has
a different name: bfq.weight (forced choice until legacy block was
present, because two different policies cannot share a common parameter
in cgroups).

Due to CFQ legacy, most if not all userspace configurations still use
the parameter 'weight', and for the moment do not seem likely to be
changed. But, when CFQ went away with legacy block, such a parameter
ceased to exist.

So, a simple workaround has been proposed [1] to make all
configurations work: add a symlink, named weight, to bfq.weight. This
commit adds such a symlink.

[1] https://lkml.org/lkml/2019/4/8/555

Suggested-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Angelo Ruocco <angeloruocco90@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-07 01:29:40 -06:00
Ming Lei
c3e2219216 block: free sched's request pool in blk_cleanup_queue
In theory, IO scheduler belongs to request queue, and the request pool
of sched tags belongs to the request queue too.

However, the current tags allocation interfaces are re-used for both
driver tags and sched tags, and driver tags is definitely host wide,
and doesn't belong to any request queue, same with its request pool.
So we need tagset instance for freeing request of sched tags.

Meantime, blk_mq_free_tag_set() often follows blk_cleanup_queue() in case
of non-BLK_MQ_F_TAG_SHARED, this way requires that request pool of sched
tags to be freed before calling blk_mq_free_tag_set().

Commit 47cdee29ef ("block: move blk_exit_queue into __blk_release_queue")
moves blk_exit_queue into __blk_release_queue for simplying the fast
path in generic_make_request(), then causes oops during freeing requests
of sched tags in __blk_release_queue().

Fix the above issue by move freeing request pool of sched tags into
blk_cleanup_queue(), this way is safe becasue queue has been frozen and no any
in-queue requests at that time. Freeing sched tags has to be kept in queue's
release handler becasue there might be un-completed dispatch activity
which might refer to sched tags.

Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Christoph Hellwig <hch@lst.de>
Fixes: 47cdee29ef ("block: move blk_exit_queue into __blk_release_queue")
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Reported-by: kernel test robot <rong.a.chen@intel.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-06 22:39:39 -06:00