Commit graph

6037 commits

Author SHA1 Message Date
Ming Lei
73f3760edd blk-mq: don't use plug->mq_list->q directly in blk_mq_run_dispatch_ops()
blk_mq_run_dispatch_ops() is defined as one macro, and plug->mq_list
will be changed when running 'dispatch_ops', so add one local variable
for holding request queue.

Reported-and-tested-by: Yi Zhang <yi.zhang@redhat.com>
Fixes: 4cafe86c92 ("blk-mq: run dispatch lock once in case of issuing from list")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-12-06 09:41:40 -07:00
Ming Lei
41adf531e3 blk-mq: don't run might_sleep() if the operation needn't blocking
The operation protected via blk_mq_run_dispatch_ops() in blk_mq_run_hw_queue
won't sleep, so don't run might_sleep() for it.

Reported-and-tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-12-06 09:40:42 -07:00
Ming Lei
4cafe86c92 blk-mq: run dispatch lock once in case of issuing from list
It isn't necessary to call blk_mq_run_dispatch_ops() once for issuing
single request directly, and enough to do it one time when issuing from
whole list.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20211203131534.3668411-5-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-12-03 14:51:29 -07:00
Ming Lei
bcc330f42f blk-mq: pass request queue to blk_mq_run_dispatch_ops
We have switched to allocate srcu into request queue, so it is fine
to pass request queue to blk_mq_run_dispatch_ops().

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20211203131534.3668411-4-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-12-03 14:51:29 -07:00
Ming Lei
704b914f15 blk-mq: move srcu from blk_mq_hw_ctx to request_queue
In case of BLK_MQ_F_BLOCKING, per-hctx srcu is used to protect dispatch
critical area. However, this srcu instance stays at the end of hctx, and
it often takes standalone cacheline, often cold.

Inside srcu_read_lock() and srcu_read_unlock(), WRITE is always done on
the indirect percpu variable which is allocated from heap instead of
being embedded, srcu->srcu_idx is read only in srcu_read_lock(). It
doesn't matter if srcu structure stays in hctx or request queue.

So switch to per-request-queue srcu for protecting dispatch, and this
way simplifies quiesce a lot, not mention quiesce is always done on the
request queue wide.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20211203131534.3668411-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-12-03 14:51:29 -07:00
Ming Lei
2a904d0085 blk-mq: remove hctx_lock and hctx_unlock
Remove hctx_lock and hctx_unlock, and add one helper of
blk_mq_run_dispatch_ops() to run code block defined in dispatch_ops
with rcu/srcu read held.

Compared with hctx_lock()/hctx_unlock():

1) remove 2 branch to 1, so we just need to check
(hctx->flags & BLK_MQ_F_BLOCKING) once when running one dispatch_ops

2) srcu_idx needn't to be touched in case of non-blocking

3) might_sleep_if() can be moved to the blocking branch

Also put the added blk_mq_run_dispatch_ops() in private header, so that
the following patch can use it out of blk-mq.c.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20211203131534.3668411-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-12-03 14:51:29 -07:00
Jens Axboe
0a467d0fdd block: switch to atomic_t for request references
refcount_t is not as expensive as it used to be, but it's still more
expensive than the io_uring method of using atomic_t and just checking
for potential over/underflow.

This borrows that same implementation, which in turn is based on the
mm implementation from Linus.

Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-12-03 14:51:29 -07:00
Jens Axboe
ceaa762527 block: move direct_IO into our own read_iter handler
Don't call into generic_file_read_iter() if we know it's O_DIRECT, just
set it up ourselves and call our own handler. This avoids an indirect call
for O_DIRECT.

Fall back to filemap_read() if we fail.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-12-03 14:51:29 -07:00
Jakub Kicinski
8581fd402a treewide: Add missing includes masked by cgroup -> bpf dependency
cgroup.h (therefore swap.h, therefore half of the universe)
includes bpf.h which in turn includes module.h and slab.h.
Since we're about to get rid of that dependency we need
to clean things up.

v2: drop the cpu.h include from cacheinfo.h, it's not necessary
and it makes riscv sensitive to ordering of include files.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Krzysztof Wilczyński <kw@linux.com>
Acked-by: Peter Chen <peter.chen@kernel.org>
Acked-by: SeongJae Park <sj@kernel.org>
Acked-by: Jani Nikula <jani.nikula@intel.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/all/20211120035253.72074-1-kuba@kernel.org/  # v1
Link: https://lore.kernel.org/all/20211120165528.197359-1-kuba@kernel.org/ # cacheinfo discussion
Link: https://lore.kernel.org/bpf/20211202203400.1208663-1-kuba@kernel.org
2021-12-03 10:58:13 -08:00
Jens Axboe
a08ed9aae8 block: fix double bio queue when merging in cached request path
When we attempt to merge off the cached request path, we return NULL
if successful. This makes the caller believe that it's should allocate
a new request, and hence we end up with the bio both merged and associated
with a new request. This, predictably, leads to all sorts of crashes.

Pass in a pointer to the bio pointer, and clear it for the merge case.
Then the caller knows that the bio is already queued, and no new requests
need to get allocated.

Fixes: 5b13bc8a3f ("blk-mq: cleanup request allocation")
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-12-02 19:39:01 -07:00
Jens Axboe
373b5416b4 block: get rid of useless goto and label in blk_mq_get_new_requests()
Expected case is returning a request, just check for success and return
the request rather than having an error label.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-12-02 12:42:58 -07:00
Ming Lei
18d78171c0 blk-mq: check q->poll_stat in queue_poll_stat_show
Without checking q->poll_stat in queue_poll_stat_show(), kernel panic
may be caused if q->poll_stat isn't allocated.

Fixes: 48b5c1fbcd ("block: only allocate poll_stats if there's a user of them")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20211202090716.3292244-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-12-02 08:20:49 -07:00
Ye Bin
8a7518931b block: Fix fsync always failed if once failed
We do test with inject error fault base on v4.19, after test some time we found
sync /dev/sda always failed.
[root@localhost] sync /dev/sda
sync: error syncing '/dev/sda': Input/output error

scsi log as follows:
[19069.812296] sd 0:0:0:0: [sda] tag#64 Send: scmd 0x00000000d03a0b6b
[19069.812302] sd 0:0:0:0: [sda] tag#64 CDB: Synchronize Cache(10) 35 00 00 00 00 00 00 00 00 00
[19069.812533] sd 0:0:0:0: [sda] tag#64 Done: SUCCESS Result: hostbyte=DID_OK driverbyte=DRIVER_OK
[19069.812536] sd 0:0:0:0: [sda] tag#64 CDB: Synchronize Cache(10) 35 00 00 00 00 00 00 00 00 00
[19069.812539] sd 0:0:0:0: [sda] tag#64 scsi host busy 1 failed 0
[19069.812542] sd 0:0:0:0: Notifying upper driver of completion (result 0)
[19069.812546] sd 0:0:0:0: [sda] tag#64 sd_done: completed 0 of 0 bytes
[19069.812549] sd 0:0:0:0: [sda] tag#64 0 sectors total, 0 bytes done.
[19069.812564] print_req_error: I/O error, dev sda, sector 0

ftrace log as follows:
 rep-306069 [007] .... 19654.923315: block_bio_queue: 8,0 FWS 0 + 0 [rep]
 rep-306069 [007] .... 19654.923333: block_getrq: 8,0 FWS 0 + 0 [rep]
 kworker/7:1H-250   [007] .... 19654.923352: block_rq_issue: 8,0 FF 0 () 0 + 0 [kworker/7:1H]
 <idle>-0     [007] ..s. 19654.923562: block_rq_complete: 8,0 FF () 18446744073709551615 + 0 [0]
 <idle>-0     [007] d.s. 19654.923576: block_rq_complete: 8,0 WS () 0 + 0 [-5]

As 8d6996630c introduce 'fq->rq_status', this data only update when 'flush_rq'
reference count isn't zero. If flush request once failed and record error code
in 'fq->rq_status'. If there is no chance to update 'fq->rq_status',then do fsync
will always failed.
To address this issue reset 'fq->rq_status' after return error code to upper layer.

Fixes: 8d6996630c03("block: fix null pointer dereference in blk_mq_rq_timed_out()")
Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20211129012659.1553733-1-yebin10@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:46:04 -07:00
Christoph Hellwig
b84ba30b6c block: remove the gendisk argument to blk_execute_rq
Remove the gendisk aregument to blk_execute_rq and blk_execute_rq_nowait
given that it is unused now.  Also convert the boolean at_head parameter
to actually use the bool type while touching the prototype.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20211126121802.2090656-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:41:29 -07:00
Christoph Hellwig
f3fa33acca block: remove the ->rq_disk field in struct request
Just use the disk attached to the request_queue instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20211126121802.2090656-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:41:29 -07:00
Christoph Hellwig
79bb1dbd12 block: don't check ->rq_disk in merges
There is a 1:1 relationship between request_queues and gendisks now, so
no need for these extra checks.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20211126121802.2090656-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:41:29 -07:00
Colin Ian King
af22fef3e7 block: Remove redundant initialization of variable ret
The variable ret is being initialized with a value that is never
read, it is being updated later on. The assignment is redundant and
can be removed.

Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Link: https://lore.kernel.org/r/20211126230652.1175636-1-colin.i.king@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:41:29 -07:00
Christoph Hellwig
eca5892a5d block: simplify ioc_lookup_icq
Remove the ioc argument as it always points to current->io_context.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211126115817.2087431-15-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:41:29 -07:00
Christoph Hellwig
18b74c4dca block: simplify ioc_create_icq
Remove the ioc and gfp_mask argument, which are hard coded by the caller.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211126115817.2087431-14-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:41:29 -07:00
Christoph Hellwig
d538ea4cb8 block: return the io_context from create_task_io_context
Grab a reference to the newly allocated or existing io_context in
create_task_io_context and return it.  This simplifies the callers and
removes the need for double lookups.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211126115817.2087431-13-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:41:29 -07:00
Christoph Hellwig
8ffc13680e block: use alloc_io_context in __copy_io
In __copy_io we know that the newly allocate task_struct does not have
an I/O context yet and is not exiting.  So just allocate the I/O context
struct and install it directly.  There is no need to lock the task
either as it is just being created.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211126115817.2087431-12-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:41:29 -07:00
Christoph Hellwig
a0f14d8baa block: factor out a alloc_io_context helper
Factor out a helper that just allocate an I/O context.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211126115817.2087431-11-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:41:29 -07:00
Christoph Hellwig
50569c24be block: remove get_io_context_active
Fold it into it's only caller, and remove a lof of the debug checks
that are not needed.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211126115817.2087431-10-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:41:29 -07:00
Christoph Hellwig
222ee581b8 block: move the remaining elv.icq handling to the I/O scheduler
After the prepare side has been moved to the only I/O scheduler that
cares, do the same for the cleanup and the NULL initialization.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211126115817.2087431-9-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:41:29 -07:00
Christoph Hellwig
87dd1d63dc block: move blk_mq_sched_assign_ioc to blk-ioc.c
Move blk_mq_sched_assign_ioc so that many interfaces from the file can
be marked static.  Rename the function to ioc_find_get_icq as well and
return the icq to simplify the interface.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211126115817.2087431-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:41:29 -07:00
Christoph Hellwig
3304742562 block: mark put_io_context_active static
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211126115817.2087431-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:41:29 -07:00
Christoph Hellwig
c2a32464f4 Revert "block: Provide blk_mq_sched_get_icq()"
This reverts commit 4896c4e64ba5d5d5acdbcf68c5910dd4f6d8fa62.

The helper is not needed any more.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211126115817.2087431-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:41:29 -07:00
Christoph Hellwig
a0725c22cd bfq: use bfq_bic_lookup in bfq_limit_depth
No need to create a new I/O context if there is none present yet in
->limit_depth.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211126115817.2087431-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:41:29 -07:00
Christoph Hellwig
836b394b63 bfq: simplify bfq_bic_lookup
Remove the unused bfqd argument, and hardcode ioc to current->io_context.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211126115817.2087431-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:41:29 -07:00
Christoph Hellwig
88c9a2ce52 fork: move copy_io to block/blk-ioc.c
Move the copying of the I/O context to the block layer as that is where
we can use the proper low-level interfaces.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211126115817.2087431-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:41:29 -07:00
Ming Lei
5f480b1a63 blk-mq: use bio->bi_opf after bio is checked
bio->bi_opf isn't finalized before checking the bio, so use it after
submit_bio_checks() returns.

Fixes: 5b13bc8a3f ("blk-mq: cleanup request allocation")
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:41:23 -07:00
Jan Kara
c65e6fd460 bfq: Do not let waker requests skip proper accounting
Commit 7cc4ffc555 ("block, bfq: put reqs of waker and woken in
dispatch list") added a condition to bfq_insert_request() which added
waker's requests directly to dispatch list. The rationale was that
completing waker's IO is needed to get more IO for the current queue.
Although this rationale is valid, there is a hole in it. The waker does
not necessarily serve the IO only for the current queue and maybe it's
current IO is not needed for current queue to make progress. Furthermore
injecting IO like this completely bypasses any service accounting within
bfq and thus we do not properly track how much service is waker's queue
getting or that the waker is actually doing any IO. Depending on the
conditions this can result in the waker getting too much or too few
service.

Consider for example the following job file:

[global]
directory=/mnt/repro/
rw=write
size=8g
time_based
runtime=30
ramp_time=10
blocksize=1m
direct=0
ioengine=sync

[slowwriter]
numjobs=1
prioclass=2
prio=7
fsync=200

[fastwriter]
numjobs=1
prioclass=2
prio=0
fsync=200

Despite processes have very different IO priorities, they get the same
about of service. The reason is that bfq identifies these processes as
having waker-wakee relationship and once that happens, IO from
fastwriter gets injected during slowwriter's time slice. As a result bfq
is not aware that fastwriter has any IO to do and constantly schedules
only slowwriter's queue. Thus fastwriter is forced to compete with
slowwriter's IO all the time instead of getting its share of time based
on IO priority.

Drop the special injection condition from bfq_insert_request(). As a
result, requests will be tracked and queued in a normal way and on next
dispatch bfq_select_queue() can decide whether the waker's inserted
requests should be injected during the current queue's timeslice or not.

Fixes: 7cc4ffc555 ("block, bfq: put reqs of waker and woken in dispatch list")
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211125133645.27483-8-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:39:31 -07:00
Jan Kara
1eb17f5e15 bfq: Log waker detections
Waker - wakee relationships are important in deciding whether one queue
can preempt the other one. Print information about detected waker-wakee
relationships so that scheduling decisions can be better understood from
block traces.

Acked-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211125133645.27483-7-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:39:31 -07:00
Jan Kara
582f04e19a bfq: Provide helper to generate bfqq name
Instead of having helper formating bfqq pid, provide a helper to
generate full bfqq name as used in the traces. It saves some code
duplication and will save more in the coming tracepoints.

Acked-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211125133645.27483-6-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:38:52 -07:00
Jan Kara
1f18b7005b bfq: Limit waker detection in time
Currently, when process A starts issuing requests shortly after process
B has completed some IO three times in a row, we decide that B is a
"waker" of A meaning that completing IO of B is needed for A to make
progress and generally stop separating A's and B's IO much. This logic
is useful to avoid unnecessary idling and thus throughput loss for cases
where workload needs to switch e.g. between the process and the
journaling thread doing IO. However the detection heuristic tends to
frequently give false positives when A and B are fighting IO bandwidth
and other processes aren't doing much IO as we are basically deemed to
eventually accumulate three occurences of a situation where one process
starts issuing requests after the other has completed some IO. To reduce
these false positives, cancel the waker detection also if we didn't
accumulate three detected wakeups within given timeout. The rationale is
that if wakeups are really rare, the pointless idling doesn't hurt
throughput that much anyway.

This significantly reduces false waker detection for workload like:

[global]
directory=/mnt/repro/
rw=write
size=8g
time_based
runtime=30
ramp_time=10
blocksize=1m
direct=0
ioengine=sync

[slowwriter]
numjobs=1
fsync=200

[fastwriter]
numjobs=1
fsync=200

Acked-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211125133645.27483-5-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:38:51 -07:00
Jan Kara
76f1df88bb bfq: Limit number of requests consumed by each cgroup
When cgroup IO scheduling is used with BFQ it does not really provide
service differentiation if the cgroup drives a big IO depth. That for
example happens with writeback which asynchronously submits lots of IO
but it can happen with AIO as well. The problem is that if we have two
cgroups that submit IO with different weights, the cgroup with higher
weight properly gets more IO time and is able to dispatch more IO.
However this causes lower weight cgroup to accumulate more requests
inside BFQ and eventually lower weight cgroup consumes most of IO
scheduler tags. At that point higher weight cgroup stops getting better
service as it is mostly blocked waiting for a scheduler tag while its
queues inside BFQ are empty and thus lower weight cgroup gets served.

Check how many requests submitting cgroup has allocated in
bfq_limit_depth() and if it consumes more requests than what would
correspond to its weight limit available depth to 1 so that the cgroup
cannot consume many more requests. With this limitation the higher
weight cgroup gets proper service even with writeback.

Reviewed-by: Michal Koutný <mkoutny@suse.com>
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211125133645.27483-4-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:38:51 -07:00
Jan Kara
44dfa279f1 bfq: Store full bitmap depth in bfq_data
Store bitmap depth shift inside bfq_data so that we can use it in
bfq_limit_depth() for proportioning when limiting number of available
request tags for a cgroup.

Acked-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211125133645.27483-3-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:38:51 -07:00
Jan Kara
98f044999b bfq: Track number of allocated requests in bfq_entity
When we want to limit number of requests used by each bfqq and also
cgroup, we need to track also number of requests used by each cgroup.
So track number of allocated requests for each bfq_entity.

Acked-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211125133645.27483-2-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:38:51 -07:00
Jan Kara
790cf9c848 block: Provide blk_mq_sched_get_icq()
Currently we lookup ICQ only after the request is allocated. However BFQ
will want to decide how many scheduler tags it allows a given bfq queue
(effectively a process) to consume based on cgroup weight. So provide a
function blk_mq_sched_get_icq() so that BFQ can lookup ICQ earlier.

Acked-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211125133645.27483-1-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:38:51 -07:00
Eric Biggers
72cd9df2ef blk-crypto: remove blk_crypto_unregister()
This function is trivial and is only used in one place.  Having this
function is misleading because it implies that blk_crypto_register()
needs to be paired with blk_crypto_unregister(), which is not the case.
Just set disk->queue->crypto_profile to NULL directly.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211124013733.347612-1-ebiggers@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:38:51 -07:00
Christoph Hellwig
5b13bc8a3f blk-mq: cleanup request allocation
Refactor the request alloction so that blk_mq_get_cached_request tries
to find a cached request first, and the entirely separate and now
self contained blk_mq_get_new_requests allocates one or more requests
if that is not possible.

There is a small change in behavior as submit_bio_checks is called
twice now if a cached request is present but can't be used, but that
is a small price to pay for unwinding this code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211124062856.1444266-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:38:51 -07:00
Christoph Hellwig
82d981d423 block: don't include <linux/part_stat.h> in blk.h
Not needed, shift it into the source files that need it instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211123185312.1432157-9-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:38:44 -07:00
Christoph Hellwig
ca5b304cab block: don't include <linux/idr.h> in blk.h
Not needed.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211123185312.1432157-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:38:44 -07:00
Christoph Hellwig
a2ff7781cf block: don't include <linux/blk-mq.h> in blk.h
Not needed.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211123185312.1432157-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:38:44 -07:00
Christoph Hellwig
e4a19f7289 block: don't include blk-mq.h in blk.h
No needed, shift a blk-stat.h include into the source file that needs it
instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211123185312.1432157-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:38:44 -07:00
Christoph Hellwig
2aa7745bf6 block: don't include blk-mq-sched.h in blk.h
No needed, shift it into the source files that need it instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211123185312.1432157-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:38:44 -07:00
Christoph Hellwig
0c6cb3a293 block: remove the e argument to elevator_exit
All callers pass q->elevator.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211123185312.1432157-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:38:44 -07:00
Christoph Hellwig
f46b81c54b block: remove elevator_exit
Open code elevator_exit in it's only caller, and rename __elevator_exit to
elevator_exit.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211123185312.1432157-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:38:44 -07:00
Christoph Hellwig
0281ed3cf4 block: move blk_get_flush_queue to blk-flush.c
blk_get_flush_queue is only used in blk-flush.c, so move it there.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211123185312.1432157-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:38:44 -07:00
Guo Zhengkui
35c90e6ec9 blk_mq: remove repeated includes
Remove a repeated "#include<linux/sched/sysctl.h>".

Signed-off-by: Guo Zhengkui <guozhengkui@vivo.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20211123063340.25882-1-guozhengkui@vivo.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:38:44 -07:00
Jens Axboe
5a9d041ba2 block: move io_context creation into where it's needed
The only user of the io_context for IO is BFQ, yet we put the checking
and logic of it into the normal IO path.

Put the creation into blk_mq_sched_assign_ioc(), and have BFQ use that
helper.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:38:44 -07:00
Jens Axboe
48b5c1fbcd block: only allocate poll_stats if there's a user of them
This is essentially never used, yet it's about 1/3rd of the total
queue size. Allocate it when needed, and don't embed it in the queue.

Kill the queue flag for this while at it, since we can just check the
assigned pointer now.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:38:35 -07:00
Jens Axboe
25c4b5e058 blk-ioprio: don't set bio priority if not needed
We don't need to write to the bio if:

1) No ioprio value has ever been assigned to the blkcg
2) We wouldn't anyway, depending on bio and blkcg IO priority

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:38:35 -07:00
Christoph Hellwig
1e9c23034d blk-mq: move more plug handling from blk_mq_submit_bio into blk_add_rq_to_plug
Keep all the functionality for adding a request to a plug in a single place.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211123160443.1315598-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:38:35 -07:00
Christoph Hellwig
0c5bcc92d9 blk-mq: simplify the plug handling in blk_mq_submit_bio
blk_mq_submit_bio has two different plug cases, one that uses full
plugging and a limited plugging one.

The limited plugging case is only used for a corner case that does
not matter in real life:

 - no ->commit_rqs (so not NVMe)
 - no shared tags (so not SCSI)
 - not rotational (so no old disk or floppy driver)
 - must have multiple queues (so no eMMC)

Remove the limited merging case and all the related junk to simplify
blk_mq_submit_bio and the functions called from it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211123160443.1315598-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:38:35 -07:00
Christoph Hellwig
9f18db572c block: don't set GENHD_FL_NO_PART for hidden gendisks
Hidden gendisks can't be opened using blkdev_get_*, so we can't really
reach any of the partition scanning paths or partitioning ioctls except
for the initial partition scan from add_disk.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211122130625.1136848-13-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:38:35 -07:00
Christoph Hellwig
1ebe2e5f9d block: remove GENHD_FL_EXT_DEVT
All modern drivers can support extra partitions using the extended
dev_t.  In fact except for the ioctl method drivers never even see
partitions in normal operation.

So remove the GENHD_FL_EXT_DEVT and allow extra partitions for all
block devices that do support partitions, and require those that
do not support partitions to explicit disallow them using
GENHD_FL_NO_PART.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211122130625.1136848-12-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:38:35 -07:00
Christoph Hellwig
3b5149ac50 block: remove GENHD_FL_SUPPRESS_PARTITION_INFO
This flag is not set directly anywhere and only inherited from
GENHD_FL_HIDDEN.  Just check for GENHD_FL_HIDDEN instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211122130625.1136848-11-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:38:35 -07:00
Christoph Hellwig
140862805a block: remove the GENHD_FL_HIDDEN check in blkdev_get_no_open
Hidden gendisks never hash the block device inode, so this can't happen.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211122130625.1136848-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:38:04 -07:00
Christoph Hellwig
46e7eac647 block: rename GENHD_FL_NO_PART_SCAN to GENHD_FL_NO_PART
The GENHD_FL_NO_PART_SCAN controls more than just partitions canning,
so rename it to GENHD_FL_NO_PART.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Ulf Hansson <ulf.hansson@linaro.org>
Link: https://lore.kernel.org/r/20211122130625.1136848-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:35:21 -07:00
Christoph Hellwig
e16e506ccd block: merge disk_scan_partitions and blkdev_reread_part
Unify the functionality that implements a partition rescan for a
gendisk.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211122130625.1136848-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:35:21 -07:00
Christoph Hellwig
e3b3bad3f2 block: remove a dead check in show_partition
disk_max_parts never returns 0 given that ->minors for devices not using
the extended dev_t must be non-zero, and disk_max_parts always returns
DISK_MAX_PARTS for the latter.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211122130625.1136848-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:35:21 -07:00
Christoph Hellwig
1545e0b419 block: move GENHD_FL_BLOCK_EVENTS_ON_EXCL_WRITE to disk->event_flags
GENHD_FL_BLOCK_EVENTS_ON_EXCL_WRITE is all about the event reporting
mechanism, so move it to the event_flags field.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211122130625.1136848-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:35:21 -07:00
Christoph Hellwig
8641691646 block: move GENHD_FL_NATIVE_CAPACITY to disk->state
The flag to indicate an unlocked native capacity is dynamic state,
not a driver capability flag, so move it to disk->state.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211122130625.1136848-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:35:21 -07:00
Christoph Hellwig
d9337a420a block: don't include blk-mq headers in blk-core.c
All request based code is in the blk-mq files now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20211117061404.331732-12-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:35:21 -07:00
Christoph Hellwig
0d7a29a2b5 block: move blk_print_req_error to blk-mq.c
This function is only used by the request completion path.  Factor out
a blk_status_to_str to keep blk_errors private in blk-core.c.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20211117061404.331732-11-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:35:21 -07:00
Christoph Hellwig
22350ad7f1 block: move blk_dump_rq_flags to blk-mq.c
blk_dump_rq_flags deals with a request, so move it to blk-mq.c.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20211117061404.331732-10-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:34:51 -07:00
Christoph Hellwig
450b7879e3 block: move blk_account_io_{start,done} to blk-mq.c
These are only used for request based I/O, so move them where they are
used.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20211117061404.331732-9-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:34:51 -07:00
Christoph Hellwig
f2b8f3ce98 block: move blk_steal_bios to blk-mq.c
Keep all the request based code together.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20211117061404.331732-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:34:51 -07:00
Christoph Hellwig
52fdbbcc83 block: move blk_rq_init to blk-mq.c
blk_rq_init deals with a request structure, so move it to blk-mq.c

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20211117061404.331732-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:34:51 -07:00
Christoph Hellwig
06c8c691e2 block: move request based cloning helpers to blk-mq.c
Keep all the request based code together.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20211117061404.331732-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:34:50 -07:00
Christoph Hellwig
b84c5b50d3 blk-mq: move blk_mq_flush_plug_list
Move blk_mq_flush_plug_list and blk_mq_plug_issue_direct down in blk-mq.c
to prepare for marking blk_mq_request_issue_directly static without the
need of a forward declaration.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20211117061404.331732-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:34:50 -07:00
Christoph Hellwig
4054cff92c block: remove blk-exec.c
All this code is tightly coupled to the blk-mq core, so move it
there.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20211117061404.331732-4-hch@lst.de
[axboe: remove doc generation for blk-exec.c]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:34:50 -07:00
Christoph Hellwig
786d4e01c5 block: remove rq_flush_dcache_pages
This function is trivial, and flush_dcache_page is always defined, so
just open code it in the 2.5 callers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20211117061404.331732-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:34:50 -07:00
Christoph Hellwig
79478bf9ea block: move blk_rq_err_bytes to scsi
blk_rq_err_bytes is only used by the scsi midlayer, so move it there.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20211117061404.331732-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:34:50 -07:00
Jens Axboe
98b26a0e76 block: call rq_qos_done() before ref check in batch completions
We need to call rq_qos_done() regardless of whether or not we're freeing
the request or not, as the reference count doesn't cover the IO completion
tracking.

Fixes: f794f3351f ("block: add support for blk_mq_end_request_batch()")
Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Reported-by: Kenneth R. Crudup <kenny@panix.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-26 09:53:23 -07:00
Yang Guang
e30028ace8 block: fix parameter not described warning
The build warning:
block/blk-core.c:968: warning: Function parameter or member 'iob'
not described in 'bio_poll'.

Fixes: 5a72e899ce ("block: add a struct io_comp_batch argument to fops->iopoll()")
Reported-by: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: Yang Guang <yang.guang5@zte.com.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-25 09:32:19 -07:00
Ming Lei
efcf593223 block: avoid to touch unloaded module instance when opening bdev
disk->fops->owner is grabbed in blkdev_get_no_open() after the disk
kobject refcount is increased. This way can't make sure that
disk->fops->owner is still alive since del_gendisk() still can move
on if the kobject refcount of disk is grabbed by open() and
disk->fops->open() isn't called yet.

Fixes the issue by moving try_module_get() into blkdev_get_by_dev()
with ->open_mutex() held, then we can drain the in-progress open()
in del_gendisk(). Meantime new open() won't succeed because disk
becomes not alive.

This way is reasonable because blkdev_get_no_open() needn't to touch
disk->fops or defined callbacks.

Cc: Christoph Hellwig <hch@lst.de>
Cc: czhong@redhat.com
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211111020343.316126-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-22 18:35:37 -07:00
Ming Lei
2b504bd484 blk-mq: don't insert FUA request with data into scheduler queue
We never insert flush request into scheduler queue before.

Recently commit d92ca9d834 ("blk-mq: don't handle non-flush requests in
blk_insert_flush") tries to handle FUA data request as normal request.
This way has caused warning[1] in mq-deadline dd_exit_sched() or io hang in
case of kyber since RQF_ELVPRIV isn't set for flush request, then
->finish_request won't be called.

Fix the issue by inserting FUA data request with blk_mq_request_bypass_insert()
when the device supports FUA, just like what we did before.

[1] https://lore.kernel.org/linux-block/CAHj4cs-_vkTW=dAzbZYGxpEWSpzpcmaNeY1R=vH311+9vMUSdg@mail.gmail.com/

Reported-by: Yi Zhang <yi.zhang@redhat.com>
Fixes: d92ca9d834 ("blk-mq: don't handle non-flush requests in blk_insert_flush")
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20211118153041.2163228-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-19 06:28:18 -07:00
Yu Kuai
15c3010496 blk-cgroup: fix missing put device in error path from blkg_conf_pref()
If blk_queue_enter() failed due to queue is dying, the
blkdev_put_no_open() is needed because blkcg_conf_open_bdev() succeeded.

Fixes: 0c9d338c84 ("blk-cgroup: synchronize blkg creation against policy deactivation")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20211102020705.2321858-1-yukuai3@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-19 06:26:45 -07:00
Ming Lei
245a489e81 block: avoid to quiesce queue in elevator_init_mq
elevator_init_mq() is only called before adding disk, when there isn't
any FS I/O, only passthrough requests can be queued, so freezing queue
plus canceling dispatch work is enough to drain any dispatch activities,
then we can avoid synchronize_srcu() in blk_mq_quiesce_queue().

Long boot latency issue can be fixed in case of lots of disks added
during booting.

Fixes: 737eb78e82 ("block: Delay default elevator initialization")
Reported-by: yangerkun <yangerkun@huawei.com>
Cc: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20211117115502.1600950-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-17 07:43:26 -07:00
Ming Lei
2a19b28f79 blk-mq: cancel blk-mq dispatch work in both blk_cleanup_queue and disk_release()
For avoiding to slow down queue destroy, we don't call
blk_mq_quiesce_queue() in blk_cleanup_queue(), instead of delaying to
cancel dispatch work in blk_release_queue().

However, this way has caused kernel oops[1], reported by Changhui. The log
shows that scsi_device can be freed before running blk_release_queue(),
which is expected too since scsi_device is released after the scsi disk
is closed and the scsi_device is removed.

Fixes the issue by canceling blk-mq dispatch work in both blk_cleanup_queue()
and disk_release():

1) when disk_release() is run, the disk has been closed, and any sync
dispatch activities have been done, so canceling dispatch work is enough to
quiesce filesystem I/O dispatch activity.

2) in blk_cleanup_queue(), we only focus on passthrough request, and
passthrough request is always explicitly allocated & freed by
its caller, so once queue is frozen, all sync dispatch activity
for passthrough request has been done, then it is enough to just cancel
dispatch work for avoiding any dispatch activity.

[1] kernel panic log
[12622.769416] BUG: kernel NULL pointer dereference, address: 0000000000000300
[12622.777186] #PF: supervisor read access in kernel mode
[12622.782918] #PF: error_code(0x0000) - not-present page
[12622.788649] PGD 0 P4D 0
[12622.791474] Oops: 0000 [#1] PREEMPT SMP PTI
[12622.796138] CPU: 10 PID: 744 Comm: kworker/10:1H Kdump: loaded Not tainted 5.15.0+ #1
[12622.804877] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 1.5.4 10/002/2015
[12622.813321] Workqueue: kblockd blk_mq_run_work_fn
[12622.818572] RIP: 0010:sbitmap_get+0x75/0x190
[12622.823336] Code: 85 80 00 00 00 41 8b 57 08 85 d2 0f 84 b1 00 00 00 45 31 e4 48 63 cd 48 8d 1c 49 48 c1 e3 06 49 03 5f 10 4c 8d 6b 40 83 f0 01 <48> 8b 33 44 89 f2 4c 89 ef 0f b6 c8 e8 fa f3 ff ff 83 f8 ff 75 58
[12622.844290] RSP: 0018:ffffb00a446dbd40 EFLAGS: 00010202
[12622.850120] RAX: 0000000000000001 RBX: 0000000000000300 RCX: 0000000000000004
[12622.858082] RDX: 0000000000000006 RSI: 0000000000000082 RDI: ffffa0b7a2dfe030
[12622.866042] RBP: 0000000000000004 R08: 0000000000000001 R09: ffffa0b742721334
[12622.874003] R10: 0000000000000008 R11: 0000000000000008 R12: 0000000000000000
[12622.881964] R13: 0000000000000340 R14: 0000000000000000 R15: ffffa0b7a2dfe030
[12622.889926] FS:  0000000000000000(0000) GS:ffffa0baafb40000(0000) knlGS:0000000000000000
[12622.898956] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[12622.905367] CR2: 0000000000000300 CR3: 0000000641210001 CR4: 00000000001706e0
[12622.913328] Call Trace:
[12622.916055]  <TASK>
[12622.918394]  scsi_mq_get_budget+0x1a/0x110
[12622.922969]  __blk_mq_do_dispatch_sched+0x1d4/0x320
[12622.928404]  ? pick_next_task_fair+0x39/0x390
[12622.933268]  __blk_mq_sched_dispatch_requests+0xf4/0x140
[12622.939194]  blk_mq_sched_dispatch_requests+0x30/0x60
[12622.944829]  __blk_mq_run_hw_queue+0x30/0xa0
[12622.949593]  process_one_work+0x1e8/0x3c0
[12622.954059]  worker_thread+0x50/0x3b0
[12622.958144]  ? rescuer_thread+0x370/0x370
[12622.962616]  kthread+0x158/0x180
[12622.966218]  ? set_kthread_struct+0x40/0x40
[12622.970884]  ret_from_fork+0x22/0x30
[12622.974875]  </TASK>
[12622.977309] Modules linked in: scsi_debug rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache netfs sunrpc dm_multipath intel_rapl_msr intel_rapl_common dell_wmi_descriptor sb_edac rfkill video x86_pkg_temp_thermal intel_powerclamp dcdbas coretemp kvm_intel kvm mgag200 irqbypass i2c_algo_bit rapl drm_kms_helper ipmi_ssif intel_cstate intel_uncore syscopyarea sysfillrect sysimgblt fb_sys_fops pcspkr cec mei_me lpc_ich mei ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter drm fuse xfs libcrc32c sr_mod cdrom sd_mod t10_pi sg ixgbe ahci libahci crct10dif_pclmul crc32_pclmul crc32c_intel libata megaraid_sas ghash_clmulni_intel tg3 wdat_wdt mdio dca wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_debug]

Reported-by: ChanghuiZhong <czhong@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: linux-scsi@vger.kernel.org
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20211116014343.610501-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-15 19:22:13 -07:00
Jens Axboe
95febeb61b block: fix missing queue put in error path
If we fail the submission queue checks, we don't put the queue afterwards.
This can cause various issues like stalls on scheduler switch or failure
to remove the device, or like in the original bug report, timeout waiting
for the device on reboot/restart.

While in there, fix a few whitespace discrepancies in the surrounding
code.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=215039
Fixes: b637108a40 ("blk-mq: fix filesystem I/O request allocation")
Reported-and-tested-by: Stephen Smith <stephenmsmith@blueyonder.co.uk>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-15 17:00:54 -07:00
Alistair Delva
94c4b4fd25 block: Check ADMIN before NICE for IOPRIO_CLASS_RT
Booting to Android userspace on 5.14 or newer triggers the following
SELinux denial:

avc: denied { sys_nice } for comm="init" capability=23
     scontext=u:r:init:s0 tcontext=u:r:init:s0 tclass=capability
     permissive=0

Init is PID 0 running as root, so it already has CAP_SYS_ADMIN. For
better compatibility with older SEPolicy, check ADMIN before NICE.

Fixes: 9d3a39a5f1 ("block: grant IOPRIO_CLASS_RT to CAP_SYS_NICE")
Signed-off-by: Alistair Delva <adelva@google.com>
Cc: Khazhismel Kumykov <khazhy@google.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Paul Moore <paul@paul-moore.com>
Cc: selinux@vger.kernel.org
Cc: linux-security-module@vger.kernel.org
Cc: kernel-team@android.com
Cc: stable@vger.kernel.org # v5.14+
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Acked-by: Serge Hallyn <serge@hallyn.com>
Link: https://lore.kernel.org/r/20211115181655.3608659-1-adelva@google.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-15 14:28:59 -07:00
Ming Lei
b637108a40 blk-mq: fix filesystem I/O request allocation
submit_bio_checks() may update bio->bi_opf, so we have to initialize
blk_mq_alloc_data.cmd_flags with bio->bi_opf after submit_bio_checks()
returns when allocating new request.

In case of using cached request, fallback to allocate new request if
cached rq isn't compatible with the incoming bio, otherwise change
rq->cmd_flags with incoming bio->bi_opf.

Fixes: 900e080752 ("block: move queue enter logic into blk_mq_submit_bio()")
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Tested-by: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-12 09:31:13 -07:00
Laibin Qiu
b781d8db58 blkcg: Remove extra blkcg_bio_issue_init
KASAN reports a use-after-free report when doing block test:

==================================================================
[10050.967049] BUG: KASAN: use-after-free in
submit_bio_checks+0x1539/0x1550

[10050.977638] Call Trace:
[10050.978190]  dump_stack+0x9b/0xce
[10050.979674]  print_address_description.constprop.6+0x3e/0x60
[10050.983510]  kasan_report.cold.9+0x22/0x3a
[10050.986089]  submit_bio_checks+0x1539/0x1550
[10050.989576]  submit_bio_noacct+0x83/0xc80
[10050.993714]  submit_bio+0xa7/0x330
[10050.994435]  mpage_readahead+0x380/0x500
[10050.998009]  read_pages+0x1c1/0xbf0
[10051.002057]  page_cache_ra_unbounded+0x4c2/0x6f0
[10051.007413]  do_page_cache_ra+0xda/0x110
[10051.008207]  force_page_cache_ra+0x23d/0x3d0
[10051.009087]  page_cache_sync_ra+0xca/0x300
[10051.009970]  generic_file_buffered_read+0xbea/0x2130
[10051.012685]  generic_file_read_iter+0x315/0x490
[10051.014472]  blkdev_read_iter+0x113/0x1b0
[10051.015300]  aio_read+0x2ad/0x450
[10051.023786]  io_submit_one+0xc8e/0x1d60
[10051.029855]  __se_sys_io_submit+0x125/0x350
[10051.033442]  do_syscall_64+0x2d/0x40
[10051.034156]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

[10051.048733] Allocated by task 18598:
[10051.049482]  kasan_save_stack+0x19/0x40
[10051.050263]  __kasan_kmalloc.constprop.1+0xc1/0xd0
[10051.051230]  kmem_cache_alloc+0x146/0x440
[10051.052060]  mempool_alloc+0x125/0x2f0
[10051.052818]  bio_alloc_bioset+0x353/0x590
[10051.053658]  mpage_alloc+0x3b/0x240
[10051.054382]  do_mpage_readpage+0xddf/0x1ef0
[10051.055250]  mpage_readahead+0x264/0x500
[10051.056060]  read_pages+0x1c1/0xbf0
[10051.056758]  page_cache_ra_unbounded+0x4c2/0x6f0
[10051.057702]  do_page_cache_ra+0xda/0x110
[10051.058511]  force_page_cache_ra+0x23d/0x3d0
[10051.059373]  page_cache_sync_ra+0xca/0x300
[10051.060198]  generic_file_buffered_read+0xbea/0x2130
[10051.061195]  generic_file_read_iter+0x315/0x490
[10051.062189]  blkdev_read_iter+0x113/0x1b0
[10051.063015]  aio_read+0x2ad/0x450
[10051.063686]  io_submit_one+0xc8e/0x1d60
[10051.064467]  __se_sys_io_submit+0x125/0x350
[10051.065318]  do_syscall_64+0x2d/0x40
[10051.066082]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

[10051.067455] Freed by task 13307:
[10051.068136]  kasan_save_stack+0x19/0x40
[10051.068931]  kasan_set_track+0x1c/0x30
[10051.069726]  kasan_set_free_info+0x1b/0x30
[10051.070621]  __kasan_slab_free+0x111/0x160
[10051.071480]  kmem_cache_free+0x94/0x460
[10051.072256]  mempool_free+0xd6/0x320
[10051.072985]  bio_free+0xe0/0x130
[10051.073630]  bio_put+0xab/0xe0
[10051.074252]  bio_endio+0x3a6/0x5d0
[10051.074984]  blk_update_request+0x590/0x1370
[10051.075870]  scsi_end_request+0x7d/0x400
[10051.076667]  scsi_io_completion+0x1aa/0xe50
[10051.077503]  scsi_softirq_done+0x11b/0x240
[10051.078344]  blk_mq_complete_request+0xd4/0x120
[10051.079275]  scsi_mq_done+0xf0/0x200
[10051.080036]  virtscsi_vq_done+0xbc/0x150
[10051.080850]  vring_interrupt+0x179/0x390
[10051.081650]  __handle_irq_event_percpu+0xf7/0x490
[10051.082626]  handle_irq_event_percpu+0x7b/0x160
[10051.083527]  handle_irq_event+0xcc/0x170
[10051.084297]  handle_edge_irq+0x215/0xb20
[10051.085122]  asm_call_irq_on_stack+0xf/0x20
[10051.085986]  common_interrupt+0xae/0x120
[10051.086830]  asm_common_interrupt+0x1e/0x40

==================================================================

Bio will be checked at beginning of submit_bio_noacct(). If bio needs
to be throttled, it will start the timer and stop submit bio directly.
Bio will submit in blk_throtl_dispatch_work_fn() when the timer expires.
But in the current process, if bio is throttled, it will still set bio
issue->value by blkcg_bio_issue_init(). This is redundant and may cause
the above use-after-free.

CPU0                                   CPU1
submit_bio
submit_bio_noacct
  submit_bio_checks
    blk_throtl_bio()
      <=mod_timer(&sq->pending_timer
                                      blk_throtl_dispatch_work_fn
                                        submit_bio_noacct() <= bio have
                                        throttle tag, will throw directly
                                        and bio issue->value will be set
                                        here

                                      bio_endio()
                                      bio_put()
                                      bio_free() <= free this bio

    blkcg_bio_issue_init(bio)
      <= bio has been freed and
      will lead to UAF
  return BLK_QC_T_NONE

Fix this by remove extra blkcg_bio_issue_init.

Fixes: e439bedf6b (blkcg: consolidate bio_issue_init() to be a part of core)
Signed-off-by: Laibin Qiu <qiulaibin@huawei.com>
Link: https://lore.kernel.org/r/20211112093354.3581504-1-qiulaibin@huawei.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-12 05:46:07 -07:00
Shin'ichiro Kawasaki
86399ea071 block: Hold invalidate_lock in BLKRESETZONE ioctl
When BLKRESETZONE ioctl and data read race, the data read leaves stale
page cache. The commit e511350590 ("block: Discard page cache of zone
reset target range") added page cache truncation to avoid stale page
cache after the ioctl. However, the stale page cache still can be read
during the reset zone operation for the ioctl. To avoid the stale page
cache completely, hold invalidate_lock of the block device file mapping.

Fixes: e511350590 ("block: Discard page cache of zone reset target range")
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Cc: stable@vger.kernel.org # v5.15
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20211111085238.942492-1-shinichiro.kawasaki@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-11 11:52:46 -07:00
Ming Lei
b131f20111 blk-mq: rename blk_attempt_bio_merge
It is very annoying to have two block layer functions which share same
name, so rename blk_attempt_bio_merge in blk-mq.c as
blk_mq_attempt_bio_merge.

Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211111085134.345235-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-11 11:52:33 -07:00
Ming Lei
10f7335e36 blk-mq: don't grab ->q_usage_counter in blk_mq_sched_bio_merge
blk_mq_sched_bio_merge is only called from blk-mq.c:blk_attempt_bio_merge(),
which is called when queue usage counter is grabbed already:

1) blk_mq_get_new_requests()

2) blk_mq_get_request()
- cached request in current plug owns one queue usage counter

So don't grab ->q_usage_counter in blk_mq_sched_bio_merge(), and more
importantly this nest way causes hang in blk_mq_freeze_queue_wait().

Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211111085134.345235-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-11 11:52:33 -07:00
Jens Axboe
438cd74223 block: fix kerneldoc for disk_register_independent_access__ranges()
The naming got changed as part of a revision of the patchset, but the
kerneldoc apparently never got updated. Fix it.

Reported-by: kernel test robot <lkp@intel.com>
Fixes: a2247f19ee ("block: Add independent access ranges support")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-11 11:52:30 -07:00
Luis Chamberlain
278167fd2f block: add __must_check for *add_disk*() callers
Now that we have done a spring cleaning on all drivers and added
error checking / handling, let's keep it that way and ensure
no new drivers fail to stick with it.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20211110002949.999380-1-mcgrof@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-09 19:19:34 -07:00
Jens Axboe
ecaf97f474 block: use enum type for blk_mq_alloc_data->rq_flags
kernel test robot reports that we now trigger some sparse warnings:

block/blk-mq.h:169:32: sparse: sparse: restricted req_flags_t degrades to integer
block/blk-mq.h:169:32: sparse: sparse: restricted req_flags_t degrades to integer
block/blk-mq.h:169:32: sparse: sparse: restricted req_flags_t degrades to integer

which is due to ->rq_flags being an unsigned int, rather than the
stronger type req_flags_t enum.

Change the type to req_flags_t to silence this warning.

Fixes: 56f8da642b ("block: add rq_flags to struct blk_mq_alloc_data")
Reported-by: kernel test robot <lkp@intel.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-09 19:19:15 -07:00
Shin'ichiro Kawasaki
35e4c6c1a2 block: Hold invalidate_lock in BLKZEROOUT ioctl
When BLKZEROOUT ioctl and data read race, the data read leaves stale
page cache. To avoid the stale page cache, hold invalidate_lock of the
block device file mapping. The stale page cache is observed when
blktests test case block/009 is modified to call "blkdiscard -z" command
and repeated hundreds of times.

This patch can be applied back to the stable kernel version v5.15.y.
Rework is required for older stable kernels.

Fixes: 22dd6d3566 ("block: invalidate the page cache when issuing BLKZEROOUT")
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Cc: stable@vger.kernel.org # v5.15
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211109104723.835533-3-shinichiro.kawasaki@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-09 12:41:12 -07:00
Shin'ichiro Kawasaki
7607c44c15 block: Hold invalidate_lock in BLKDISCARD ioctl
When BLKDISCARD ioctl and data read race, the data read leaves stale
page cache. To avoid the stale page cache, hold invalidate_lock of the
block device file mapping. The stale page cache is observed when
blktests test case block/009 is repeated hundreds of times.

This patch can be applied back to the stable kernel version v5.15.y
with slight patch edit. Rework is required for older stable kernels.

Fixes: 351499a172 ("block: Invalidate cache on discard v2")
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Cc: stable@vger.kernel.org # v5.15
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211109104723.835533-2-shinichiro.kawasaki@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-09 12:41:12 -07:00
Linus Torvalds
cb690f5238 for-5.16/drivers-2021-11-09
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmGKqOcQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpr3yEADD9Cx8oNk3KzWV3c3JlIR4JQtvpczS3dho
 KkGU0D5fOh1sViXbLBNr6VxypcEIKQoHxDQQ6qid1kOu/B3mCNM1duLsVjyj3Qa0
 7nbm2dVUsD/EVDuXedRmMvcfCUx6Z23DbpI182wXtIPaCsEEmsANzHnZNg38OV44
 25SYG0QUvb9ViSz1Y1GORu0ttEJNF2GhZfiBpb0WveRnY7eTSL/PnHNDzHsSeFv4
 zD0W205g7jKbt0+57kgNElTz7DbdM3p8XVex+aXPlFaHz2qx4ZoJJIsaMv/P8tT5
 14b50cB41xnPvlGTvqr1WfZZfJocDNq2rG+fh6N5D1sO86ogWpj7psiiADfa0pb6
 ZWoJqhk3BvEUMPQ5N/BJ/8j3FWGIYWtKQf4QcyxrJYpqDwtwbBfMlzKkc7JMPFYk
 JAi6uq1uF5SbA4x99G90tK85LvxsbkseyIYXgBJ/GIyW5doIPkD9TPDEzJMCdHOe
 laynHS5PMHzuhPLuEDDn9sTVXpZWAMBnoy4j1L4wGmBjiogYWLTSJVobODzCAqHY
 1Va2oP6SXfCdVRkCysFbcrdsjJuoIWlMKrdE40tNvkmU0v7sEX0Zd+GLHiaWdIZa
 fgxC9fmZtDDOowCp+Iw0VaAqPeeptmyUrof06ZktJleOAscX7kSwbxPdmr1FM0jy
 dbnLDyaq/A==
 =QaFI
 -----END PGP SIGNATURE-----

Merge tag 'for-5.16/drivers-2021-11-09' of git://git.kernel.dk/linux-block

Pull more block driver updates from Jens Axboe:

 - Last series adding error handling support for add_disk() in drivers.
   After this one, and once the SCSI side has been merged, we can
   finally annotate add_disk() as must_check. (Luis)

 - bcache fixes (Coly)

 - zram fixes (Ming)

 - ataflop locking fix (Tetsuo)

 - nbd fixes (Ye, Yu)

 - MD merge via Song
      - Cleanup (Yang)
      - sysfs fix (Guoqing)

 - Misc fixes (Geert, Wu, luo)

* tag 'for-5.16/drivers-2021-11-09' of git://git.kernel.dk/linux-block: (34 commits)
  bcache: Revert "bcache: use bvec_virt"
  ataflop: Add missing semicolon to return statement
  floppy: address add_disk() error handling on probe
  ataflop: address add_disk() error handling on probe
  block: update __register_blkdev() probe documentation
  ataflop: remove ataflop_probe_lock mutex
  mtd/ubi/block: add error handling support for add_disk()
  block/sunvdc: add error handling support for add_disk()
  z2ram: add error handling support for add_disk()
  nvdimm/pmem: use add_disk() error handling
  nvdimm/pmem: cleanup the disk if pmem_release_disk() is yet assigned
  nvdimm/blk: add error handling support for add_disk()
  nvdimm/blk: avoid calling del_gendisk() on early failures
  nvdimm/btt: add error handling support for add_disk()
  nvdimm/btt: use goto error labels on btt_blk_init()
  loop: Remove duplicate assignments
  drbd: Fix double free problem in drbd_create_device
  nvdimm/btt: do not call del_gendisk() if not needed
  bcache: fix use-after-free problem in bcache_device_free()
  zram: replace fsync_bdev with sync_blockdev
  ...
2021-11-09 11:24:08 -08:00
Linus Torvalds
3e28850cbd for-5.16/block-2021-11-09
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmGKqAcQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpojrD/4yA+GgV+jWeIepYWvU81TQFpt9AJmzWbrY
 uryj4dy7EdMjun+JkAP8k4qreqvTZRsJMkr9dhmS4qaM8/Vt8K/RU/0n/lxNVmqc
 1//ZaTS6DURVAc52GHIXD3q4cv8pHofTZZlrj1Hgz35shlOayStGJtktH5f8uQl4
 5Yxjh+HKr15Chym+fKlbR6T7BgVxxNyhT9q89BgUwMAJX+1KRVtwtkyVK5IbObFy
 zOeiC+n9niQ6iJHcLoqb7LjfBOs/VjdNOQYGSCAnrBxuQ8GnEP2xDw2nvFlOPE12
 5tWEwTgAX7381ilbL6VvNTlTafIs/Axt8mI0cY/OMW7ApiHwO3rXjQSqA4yrnKCJ
 h6M1QavqThd2DtMnOi0U5wwgtD2UjS+CMpK5XFxeIyl6GqTgZcaWm3VqRnG68KZD
 r5+o99GKWCHy0cckxq2WiWJouReeNZ9u9R6HNDw0Vb8UNyWgBR+v2MkX+SHS/c85
 2gXm10hwBH7BFnC4X8ceiuT/bm7xm9S6D/3LCVitlUTBRfqobsQEQjSciPeoOtL0
 rRSTKob7jtokiB2q01wx3q1jnUMpxE1fqJkpLjUvebTzw+a+xfPwy0nNTGq0XXIv
 WMVRRpSWCZm04Ru0q/K8cj0GOyur5x+ilefZ1V+/sRU5dVmGuJgbJUxei1HPC6eV
 z9Rn0aFv4g==
 =1GPi
 -----END PGP SIGNATURE-----

Merge tag 'for-5.16/block-2021-11-09' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

 - Set of fixes for the batched tag allocation (Ming, me)

 - add_disk() error handling fix (Luis)

 - Nested queue quiesce fixes (Ming)

 - Shared tags init error handling fix (Ye)

 - Misc cleanups (Jean, Ming, me)

* tag 'for-5.16/block-2021-11-09' of git://git.kernel.dk/linux-block:
  nvme: wait until quiesce is done
  scsi: make sure that request queue queiesce and unquiesce balanced
  scsi: avoid to quiesce sdev->request_queue two times
  blk-mq: add one API for waiting until quiesce is done
  blk-mq: don't free tags if the tag_set is used by other device in queue initialztion
  block: fix device_add_disk() kobject_create_and_add() error handling
  block: ensure cached plug request matches the current queue
  block: move queue enter logic into blk_mq_submit_bio()
  block: make bio_queue_enter() fast-path available inline
  block: split request allocation components into helpers
  block: have plug stored requests hold references to the queue
  blk-mq: update hctx->nr_active in blk_mq_end_request_batch()
  blk-mq: add RQF_ELV debug entry
  blk-mq: only try to run plug merge if request has same queue with incoming bio
  block: move RQF_ELV setting into allocators
  dm: don't stop request queue after the dm device is suspended
  block: replace always false argument with 'false'
  block: assign correct tag before doing prefetch of request
  blk-mq: fix redundant check of !e expression
2021-11-09 11:20:07 -08:00
Linus Torvalds
1dc1f92e24 for-5.16/bdev-size-2021-11-09
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmGKp6YQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgphBaD/9Wm1lUI5qPB9tIQERWNsx4phr2Xb5WV3rg
 6ZFPDnjw0EEdfJmHLMKT5w/lguARtTKJAPeMmvX38RhB8U+44jeok0Bc+mJajjBD
 7OkE2KHQ/GnnZxzHQVyoYYo390zLOwNqALDLE55bfQsxKjJXXM03Stwcubzg9MZ+
 JRR4mFISU+7qhYu6Qx1RAvEqrGqWzstgrkbsnf/MJ7dKAy3cyqOiXKc/gDKGipIr
 skSrwbsB9AE6zW5UMdWN0r/wz1fm2bfJOR1meYgUj83exi1EoWDt52oceYWPky2D
 ievHRWUXBkcGgDNvKO6ZO99spJqs/MUy0jhzAW46qYDYUfMcbzthewWzX5mxQBDx
 YtDgH949DLVB/jPg0eoTeB8j1hcfO/gU6jdR9bJl4ZQKE/9UB9RMfusXHklmcSTL
 nZY8SfWp2ecu6HkGUPXJf051mEwCHCaROIslVeNeZBAbh/duVz3VG9EIO+seLOFR
 JJaqsB3xVk1HhqUYoQ+IiAHAkROa73jSv4hsvDu8BX923pv4JWAbTslRFU8lCG1w
 heJZ/QiP20QtEcEVCg14wEMiH5tBMvM7OhslQbhOSs+wcm8zyShjRhoa2idq/aTw
 k289DEfI52k8SIyI0+rClbWR+hEQ9y2K5opur8gR1SuIv9rbM+UzBc5hDbfT6t3W
 4xPpYESkhg==
 =bjgx
 -----END PGP SIGNATURE-----

Merge tag 'for-5.16/bdev-size-2021-11-09' of git://git.kernel.dk/linux-block

Pull more bdev size updates from Jens Axboe:
 "Two followup changes for the bdev-size series from this merge window:

   - Add loff_t cast to bdev_nr_bytes() (Christoph)

   - Use bdev_nr_bytes() consistently for the block parts at least (me)"

* tag 'for-5.16/bdev-size-2021-11-09' of git://git.kernel.dk/linux-block:
  block: use new bdev_nr_bytes() helper for blkdev_{read,write}_iter()
  block: add a loff_t cast to bdev_nr_bytes
2021-11-09 11:16:20 -08:00
Ming Lei
9ef4d0209c blk-mq: add one API for waiting until quiesce is done
Some drivers(NVMe, SCSI) need to call quiesce and unquiesce in pair, but it
is hard to switch to this style, so these drivers need one atomic flag for
helping to balance quiesce and unquiesce.

When quiesce is in-progress, the driver still needs to wait until
the quiesce is done, so add API of blk_mq_wait_quiesce_done() for
these drivers.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20211109071144.181581-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-09 08:14:27 -07:00
Ye Bin
a846a8e6c9 blk-mq: don't free tags if the tag_set is used by other device in queue initialztion
We got UAF report on v5.10 as follows:
[ 1446.674930] ==================================================================
[ 1446.675970] BUG: KASAN: use-after-free in blk_mq_get_driver_tag+0x9a4/0xa90
[ 1446.676902] Read of size 8 at addr ffff8880185afd10 by task kworker/1:2/12348
[ 1446.677851]
[ 1446.678073] CPU: 1 PID: 12348 Comm: kworker/1:2 Not tainted 5.10.0-10177-gc9c81b1e346a #2
[ 1446.679168] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[ 1446.680692] Workqueue: kthrotld blk_throtl_dispatch_work_fn
[ 1446.681448] Call Trace:
[ 1446.681800]  dump_stack+0x9b/0xce
[ 1446.682916]  print_address_description.constprop.6+0x3e/0x60
[ 1446.685999]  kasan_report.cold.9+0x22/0x3a
[ 1446.687186]  blk_mq_get_driver_tag+0x9a4/0xa90
[ 1446.687785]  blk_mq_dispatch_rq_list+0x21a/0x1d40
[ 1446.692576]  __blk_mq_do_dispatch_sched+0x394/0x830
[ 1446.695758]  __blk_mq_sched_dispatch_requests+0x398/0x4f0
[ 1446.698279]  blk_mq_sched_dispatch_requests+0xdf/0x140
[ 1446.698967]  __blk_mq_run_hw_queue+0xc0/0x270
[ 1446.699561]  __blk_mq_delay_run_hw_queue+0x4cc/0x550
[ 1446.701407]  blk_mq_run_hw_queue+0x13b/0x2b0
[ 1446.702593]  blk_mq_sched_insert_requests+0x1de/0x390
[ 1446.703309]  blk_mq_flush_plug_list+0x4b4/0x760
[ 1446.705408]  blk_flush_plug_list+0x2c5/0x480
[ 1446.708471]  blk_finish_plug+0x55/0xa0
[ 1446.708980]  blk_throtl_dispatch_work_fn+0x23b/0x2e0
[ 1446.711236]  process_one_work+0x6d4/0xfe0
[ 1446.711778]  worker_thread+0x91/0xc80
[ 1446.713400]  kthread+0x32d/0x3f0
[ 1446.714362]  ret_from_fork+0x1f/0x30
[ 1446.714846]
[ 1446.715062] Allocated by task 1:
[ 1446.715509]  kasan_save_stack+0x19/0x40
[ 1446.716026]  __kasan_kmalloc.constprop.1+0xc1/0xd0
[ 1446.716673]  blk_mq_init_tags+0x6d/0x330
[ 1446.717207]  blk_mq_alloc_rq_map+0x50/0x1c0
[ 1446.717769]  __blk_mq_alloc_map_and_request+0xe5/0x320
[ 1446.718459]  blk_mq_alloc_tag_set+0x679/0xdc0
[ 1446.719050]  scsi_add_host_with_dma.cold.3+0xa0/0x5db
[ 1446.719736]  virtscsi_probe+0x7bf/0xbd0
[ 1446.720265]  virtio_dev_probe+0x402/0x6c0
[ 1446.720808]  really_probe+0x276/0xde0
[ 1446.721320]  driver_probe_device+0x267/0x3d0
[ 1446.721892]  device_driver_attach+0xfe/0x140
[ 1446.722491]  __driver_attach+0x13a/0x2c0
[ 1446.723037]  bus_for_each_dev+0x146/0x1c0
[ 1446.723603]  bus_add_driver+0x3fc/0x680
[ 1446.724145]  driver_register+0x1c0/0x400
[ 1446.724693]  init+0xa2/0xe8
[ 1446.725091]  do_one_initcall+0x9e/0x310
[ 1446.725626]  kernel_init_freeable+0xc56/0xcb9
[ 1446.726231]  kernel_init+0x11/0x198
[ 1446.726714]  ret_from_fork+0x1f/0x30
[ 1446.727212]
[ 1446.727433] Freed by task 26992:
[ 1446.727882]  kasan_save_stack+0x19/0x40
[ 1446.728420]  kasan_set_track+0x1c/0x30
[ 1446.728943]  kasan_set_free_info+0x1b/0x30
[ 1446.729517]  __kasan_slab_free+0x111/0x160
[ 1446.730084]  kfree+0xb8/0x520
[ 1446.730507]  blk_mq_free_map_and_requests+0x10b/0x1b0
[ 1446.731206]  blk_mq_realloc_hw_ctxs+0x8cb/0x15b0
[ 1446.731844]  blk_mq_init_allocated_queue+0x374/0x1380
[ 1446.732540]  blk_mq_init_queue_data+0x7f/0xd0
[ 1446.733155]  scsi_mq_alloc_queue+0x45/0x170
[ 1446.733730]  scsi_alloc_sdev+0x73c/0xb20
[ 1446.734281]  scsi_probe_and_add_lun+0x9a6/0x2d90
[ 1446.734916]  __scsi_scan_target+0x208/0xc50
[ 1446.735500]  scsi_scan_channel.part.3+0x113/0x170
[ 1446.736149]  scsi_scan_host_selected+0x25a/0x360
[ 1446.736783]  store_scan+0x290/0x2d0
[ 1446.737275]  dev_attr_store+0x55/0x80
[ 1446.737782]  sysfs_kf_write+0x132/0x190
[ 1446.738313]  kernfs_fop_write_iter+0x319/0x4b0
[ 1446.738921]  new_sync_write+0x40e/0x5c0
[ 1446.739429]  vfs_write+0x519/0x720
[ 1446.739877]  ksys_write+0xf8/0x1f0
[ 1446.740332]  do_syscall_64+0x2d/0x40
[ 1446.740802]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 1446.741462]
[ 1446.741670] The buggy address belongs to the object at ffff8880185afd00
[ 1446.741670]  which belongs to the cache kmalloc-256 of size 256
[ 1446.743276] The buggy address is located 16 bytes inside of
[ 1446.743276]  256-byte region [ffff8880185afd00, ffff8880185afe00)
[ 1446.744765] The buggy address belongs to the page:
[ 1446.745416] page:ffffea0000616b00 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x185ac
[ 1446.746694] head:ffffea0000616b00 order:2 compound_mapcount:0 compound_pincount:0
[ 1446.747719] flags: 0x1fffff80010200(slab|head)
[ 1446.748337] raw: 001fffff80010200 ffffea00006a3208 ffffea000061bf08 ffff88801004f240
[ 1446.749404] raw: 0000000000000000 0000000000100010 00000001ffffffff 0000000000000000
[ 1446.750455] page dumped because: kasan: bad access detected
[ 1446.751227]
[ 1446.751445] Memory state around the buggy address:
[ 1446.752102]  ffff8880185afc00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 1446.753090]  ffff8880185afc80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 1446.754079] >ffff8880185afd00: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 1446.755065]                          ^
[ 1446.755589]  ffff8880185afd80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 1446.756574]  ffff8880185afe00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 1446.757566] ==================================================================

Flag 'BLK_MQ_F_TAG_QUEUE_SHARED' will be set if the second device on the
same host initializes it's queue successfully. However, if the second
device failed to allocate memory in blk_mq_alloc_and_init_hctx() from
blk_mq_realloc_hw_ctxs() from blk_mq_init_allocated_queue(),
__blk_mq_free_map_and_rqs() will be called on error path, and if
'BLK_MQ_TAG_HCTX_SHARED' is not set, 'tag_set->tags' will be freed
while it's still used by the first device.

To fix this issue we move release newly allocated hardware context from
blk_mq_realloc_hw_ctxs to __blk_mq_update_nr_hw_queues. As there is needn't to
release hardware context in blk_mq_init_allocated_queue.

Fixes: 868f2f0b72 ("blk-mq: dynamic h/w context count")
Signed-off-by: Ye Bin <yebin10@huawei.com>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20211108074019.1058843-1-yebin10@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-08 06:23:49 -07:00
Jens Axboe
138c1a3811 block: use new bdev_nr_bytes() helper for blkdev_{read,write}_iter()
We have new helpers for this, use them rather than the slower inode
size reads. This makes the read/write path consistent with most of
the rest of block as well.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/a72767cd-3c6d-47f7-80f4-aa025a17b2cb@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-05 08:32:05 -06:00