Commit graph

6418 commits

Author SHA1 Message Date
Yu Kuai
285febabac blk-wbt: fix that 'rwb->wc' is always set to 1 in wbt_init()
commit 8c5035dfbb ("blk-wbt: call rq_qos_add() after wb_normal is
initialized") moves wbt_set_write_cache() before rq_qos_add(), which
is wrong because wbt_rq_qos() is still NULL.

Fix the problem by removing wbt_set_write_cache() and setting 'rwb->wc'
directly. Noted that this patch also remove the redundant setting of
'rab->wc'.

Fixes: 8c5035dfbb ("blk-wbt: call rq_qos_add() after wb_normal is initialized")
Reported-by: kernel test robot <yujie.liu@intel.com>
Link: https://lore.kernel.org/r/202210081045.77ddf59b-yujie.liu@intel.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20221009101038.1692875-1-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-09 07:48:16 -06:00
Deming Wang
340e134727 block: Remove the repeat word 'can'
Remove the repeat word 'can' from the comments of bio_kmalloc.

Signed-off-by: Deming Wang <wangdeming@inspur.com>
Link: https://lore.kernel.org/r/20221006084450.1513-1-wangdeming@inspur.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-06 07:23:13 -06:00
Pankaj Raghav
110fdb4486 block: add rationale for not using blk_mq_plug() when applicable
There are two places in the block layer at the moment where
blk_mq_plug() helper could be used instead of directly accessing the
plug from struct current. In both these cases, directly accessing the plug
should not have any consequences for zoned devices.

Make the intent explicit by adding comments instead of introducing unwanted
checks with blk_mq_plug() helper.[1]

[1] https://lore.kernel.org/linux-block/f6e54907-1035-2b2c-6387-ed178be05ccb@kernel.dk/

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Suggested-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/20220929144141.140077-1-p.raghav@samsung.com
[axboe: fixup multi-line comment style]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-29 09:02:49 -06:00
Pankaj Raghav
8cafdb5ab9 block: adapt blk_mq_plug() to not plug for writes that require a zone lock
The current implementation of blk_mq_plug() disables plugging for all
operations that involves a transfer to the device as we just check if
the last bit in op_is_write() function.

Modify blk_mq_plug() to disable plugging only for REQ_OP_WRITE and
REQ_OP_WRITE_ZEROS as they might require a zone lock.

Suggested-by: Christoph Hellwig <hch@lst.de>
Suggested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20220929074745.103073-2-p.raghav@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-29 07:45:47 -06:00
Christoph Hellwig
5765033cf7 blk-cgroup: don't update the blkg lookup hint in blkg_conf_prep
blkg_conf_prep just creates a new blkg structure, there is no real
need to update the lookup hint which should only be done on a
successful lookup in the I/O path.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220927065425.257876-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-27 11:50:05 -06:00
Keith Busch
8237c01f16 blk-mq: use quiesced elevator switch when reinitializing queues
The hctx's run_work may be racing with the elevator switch when
reinitializing hardware queues. The queue is merely frozen in this
context, but that only prevents requests from allocating and doesn't
stop the hctx work from running. The work may get an elevator pointer
that's being torn down, and can result in use-after-free errors and
kernel panics (example below). Use the quiesced elevator switch instead,
and make the previous one static since it is now only used locally.

  nvme nvme0: resetting controller
  nvme nvme0: 32/0/0 default/read/poll queues
  BUG: kernel NULL pointer dereference, address: 0000000000000008
  #PF: supervisor read access in kernel mode
  #PF: error_code(0x0000) - not-present page
  PGD 80000020c8861067 P4D 80000020c8861067 PUD 250f8c8067 PMD 0
  Oops: 0000 [#1] SMP PTI
  Workqueue: kblockd blk_mq_run_work_fn
  RIP: 0010:kyber_has_work+0x29/0x70

...

  Call Trace:
   __blk_mq_do_dispatch_sched+0x83/0x2b0
   __blk_mq_sched_dispatch_requests+0x12e/0x170
   blk_mq_sched_dispatch_requests+0x30/0x60
   __blk_mq_run_hw_queue+0x2b/0x50
   process_one_work+0x1ef/0x380
   worker_thread+0x2d/0x3e0

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220927155652.3260724-1-kbusch@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-27 09:58:56 -06:00
Christoph Hellwig
568ec936bf block: replace blk_queue_nowait with bdev_nowait
Replace blk_queue_nowait with a bdev_nowait helpers that takes the
block_device given that the I/O submission path should not have to
look into the request_queue.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Pankaj Raghav <p.raghav@samsung.com>
Link: https://lore.kernel.org/r/20220927075815.269694-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-27 09:57:58 -06:00
Christoph Hellwig
99e6038743 blk-cgroup: pass a gendisk to the blkg allocation helpers
Prepare for storing the blkcg information in the gendisk instead of
the request_queue.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Herrmann <aherrmann@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220921180501.1539876-18-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-26 19:17:28 -06:00
Christoph Hellwig
de185b56e8 blk-cgroup: pass a gendisk to blkcg_schedule_throttle
Pass the gendisk to blkcg_schedule_throttle as part of moving the
blk-cgroup infrastructure to be gendisk based.  Remove the unused
!BLK_CGROUP stub while we're at it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Herrmann <aherrmann@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220921180501.1539876-17-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-26 19:17:28 -06:00
Christoph Hellwig
00ad6991bb blk-cgroup: pass a gendisk to blkg_destroy_all
Pass the gendisk to blkg_destroy_all as part of moving the blk-cgroup
infrastructure to be gendisk based.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Herrmann <aherrmann@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220921180501.1539876-16-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-26 19:17:28 -06:00
Christoph Hellwig
cad9266abc blk-throttle: pass a gendisk to blk_throtl_cancel_bios
Pass the gendisk to blk_throtl_cancel_bios as part of moving the
blk-cgroup infrastructure to be gendisk based.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Herrmann <aherrmann@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220921180501.1539876-15-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-26 19:17:28 -06:00
Christoph Hellwig
5f6dc7522a blk-throttle: pass a gendisk to blk_throtl_register_queue
Pass the gendisk to blk_throtl_register_queue as part of moving the
blk-cgroup infrastructure to be gendisk based.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Herrmann <aherrmann@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220921180501.1539876-14-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-26 19:17:27 -06:00
Christoph Hellwig
e13793bae6 blk-throttle: pass a gendisk to blk_throtl_init and blk_throtl_exit
Pass the gendisk to blk_throtl_init and blk_throtl_exit as part of moving
the blk-cgroup infrastructure to be gendisk based.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Herrmann <aherrmann@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220921180501.1539876-13-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-26 19:17:27 -06:00
Christoph Hellwig
3657647e33 blk-iocost: cleanup ioc_qos_write
Use a local disk variable instead of retrieving the disk and
request_queue over and over by various means.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Herrmann <aherrmann@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220921180501.1539876-12-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-26 19:17:27 -06:00
Christoph Hellwig
57b6455497 blk-iocost: pass a gendisk to blk_iocost_init
Pass the gendisk to blk_iocost_init as part of moving the blk-cgroup
infrastructure to be gendisk based.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Herrmann <aherrmann@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220921180501.1539876-11-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-26 19:17:27 -06:00
Christoph Hellwig
9df3e65139 blk-iocost: simplify ioc_name
Just directly dereference the disk name instead of going through multiple
hoops to find the same value.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Herrmann <aherrmann@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220921180501.1539876-10-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-26 19:17:27 -06:00
Christoph Hellwig
16fac1b591 blk-iolatency: pass a gendisk to blk_iolatency_init
Pass the gendisk to blk_iolatency_init as part of moving the blk-cgroup
infrastructure to be gendisk based.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Herrmann <aherrmann@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220921180501.1539876-9-hch@lst.de
[axboe: missed inline for blk_iolatency_init() and !CONFIG_BLK_CGROUP_IOLATENCY]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-26 19:17:24 -06:00
Christoph Hellwig
b0dde3f5d6 blk-ioprio: pass a gendisk to blk_ioprio_init and blk_ioprio_exit
Pass the gendisk to blk_ioprio_init and blk_ioprio_exit as part of moving
the blk-cgroup infrastructure to be gendisk based.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Herrmann <aherrmann@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220921180501.1539876-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-26 19:09:31 -06:00
Christoph Hellwig
9823538fb7 blk-cgroup: pass a gendisk to blkcg_init_queue and blkcg_exit_queue
Pass the gendisk to blkcg_init_disk and blkcg_exit_disk as part of moving
the blk-cgroup infrastructure to be gendisk based.  Also remove the
rather pointless kerneldoc comments for these internal functions with a
single caller each.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Herrmann <aherrmann@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220921180501.1539876-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-26 19:09:31 -06:00
Christoph Hellwig
f753526e32 blk-cgroup: remove blkg_lookup_check
The combinations of an error check with an ERR_PTR return and a lookup
with a NULL return leads to ugly handling of the return values in the
callers.  Just open coding the check and the lookup is much simpler.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Herrmann <aherrmann@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220921180501.1539876-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-26 19:09:31 -06:00
Christoph Hellwig
4a69f325aa blk-cgroup: cleanup the blkg_lookup family of functions
Add a fully inlined blkg_lookup as the extra two checks aren't going
to generated a lot more code vs the call to the slowpath routine, and
open code the hint update in the two callers that care.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Herrmann <aherrmann@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220921180501.1539876-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-26 19:09:31 -06:00
Christoph Hellwig
79fcc5be93 blk-cgroup: remove open coded blkg_lookup instances
Use blkg_lookup instead of open coding it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Herrmann <aherrmann@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220921180501.1539876-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-26 19:09:31 -06:00
Christoph Hellwig
928f6f00a9 blk-cgroup: remove blk_queue_root_blkg
Just open code it in the only caller and drop the unused !BLK_CGROUP
stub.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Herrmann <aherrmann@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220921180501.1539876-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-26 19:09:31 -06:00
Christoph Hellwig
33dc62796c blk-cgroup: fix error unwinding in blkcg_init_queue
When blk_throtl_init fails, we need to call blk_ioprio_exit.  Switch to
proper goto based unwinding to fix this.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Herrmann <aherrmann@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220921180501.1539876-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-26 19:09:31 -06:00
Liu Song
f168420c62 blk-mq: don't redirect completion for hctx withs only one ctx mapping
High-performance NVMe devices usually support a large hw queues, which
ensures a 1:1 mapping of hctx and ctx. In this case there will be no
remote request, so we don't need to care about it.

Signed-off-by: Liu Song <liusong@linux.alibaba.com>
Link: https://lore.kernel.org/r/1663731123-81536-1-git-send-email-liusong@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-24 09:02:30 -06:00
Yu Kuai
81c7a63abc blk-throttle: improve bypassing bios checkings
"tg->has_rules" is extended to "tg->has_rules_iops/bps", thus bios that
don't need to be throttled can be checked accurately.

With this patch, bio will be throttled if:

1) Bio is read/write, and corresponding read/write iops limit exist.
2) If corresponding doesn't exist, corresponding bps limit exist and
bio is not throttled before.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220921095309.1481289-3-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-24 08:59:43 -06:00
Yu Kuai
8549674990 blk-throttle: remove THROTL_TG_HAS_IOPS_LIMIT
Currently, "tg->has_rules" and "tg->flags & THROTL_TG_HAS_IOPS_LIMIT"
both try to bypass bios that don't need to be throttled, however, they are
a little redundant and both not perfect:

1) "tg->has_rules" only distinguish read and write, but not iops and bps
   limit.
2) "tg->flags & THROTL_TG_HAS_IOPS_LIMIT" only check if iops limit
   exist, read and write is not distinguished, and bps limit is not
   checked.

tg->has_rules will extended to distinguish bps and iops in the following
patch. There is no need to keep the flag.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220921095309.1481289-2-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-24 08:59:43 -06:00
Li Jinlin
9713a67067 block/blk-rq-qos: delete useless enmu RQ_QOS_IOPRIO
Since blk-ioprio handing was converted from a rqos policy to a direct call,
RQ_QOS_IOPRIO is not used anymore, just delete it.

Signed-off-by: Li Jinlin <lijinlin3@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220916023241.32926-1-lijinlin3@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21 19:50:53 -06:00
Yu Kuai
8c5035dfbb blk-wbt: call rq_qos_add() after wb_normal is initialized
Our test found a problem that wbt inflight counter is negative, which
will cause io hang(noted that this problem doesn't exist in mainline):

t1: device create	t2: issue io
add_disk
 blk_register_queue
  wbt_enable_default
   wbt_init
    rq_qos_add
    // wb_normal is still 0
			/*
			 * in mainline, disk can't be opened before
			 * bdev_add(), however, in old kernels, disk
			 * can be opened before blk_register_queue().
			 */
			blkdev_issue_flush
                        // disk size is 0, however, it's not checked
                         submit_bio_wait
                          submit_bio
                           blk_mq_submit_bio
                            rq_qos_throttle
                             wbt_wait
			      bio_to_wbt_flags
                               rwb_enabled
			       // wb_normal is 0, inflight is not increased

    wbt_queue_depth_changed(&rwb->rqos);
     wbt_update_limits
     // wb_normal is initialized
                            rq_qos_track
                             wbt_track
                              rq->wbt_flags |= bio_to_wbt_flags(rwb, bio);
			      // wb_normal is not 0,wbt_flags will be set
t3: io completion
blk_mq_free_request
 rq_qos_done
  wbt_done
   wbt_is_tracked
   // return true
   __wbt_done
    wbt_rqw_done
     atomic_dec_return(&rqw->inflight);
     // inflight is decreased

commit 8235b5c1e8 ("block: call bdev_add later in device_add_disk") can
avoid this problem, however it's better to fix this problem in wbt:

1) Lower kernel can't backport this patch due to lots of refactor.
2) Root cause is that wbt call rq_qos_add() before wb_normal is
initialized.

Fixes: e34cbd3074 ("blk-wbt: add general throttling mechanism")
Cc: <stable@vger.kernel.org>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20220913105749.3086243-1-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21 08:36:13 -06:00
Li zeming
a7609c68f7 blk-iocost: Remove unnecessary (void*) conversions
The key pointer is void and hence does not need an explicit cast.

Signed-off-by: Li zeming <zeming@nfschina.com>
Link: https://lore.kernel.org/r/20220919012825.2936-1-zeming@nfschina.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-20 08:28:30 -06:00
Christoph Hellwig
118f3663fb block: remove PSI accounting from the bio layer
PSI accounting is now done by the VM code, where it should have been
since the beginning.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lore.kernel.org/r/20220915094200.139713-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-20 08:24:38 -06:00
Ping-Xiang Chen
e88480871b block: fix comment typo in submit_bio of block-core.c.
This patch fix a comment typo in block-core.c.

Signed-off-by: Ping-Xiang Chen <p.x.chen@uci.edu>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220914074237.31621-1-p.x.chen@uci.edu
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-20 08:22:47 -06:00
Yu Kuai
c013710e1a blk-throttle: cleanup tg_update_disptime()
tg_update_disptime() only need to adjust postion for 'tg' in
'parent_sq', there is no need to call throtl_enqueue/dequeue_tg(),
since they will set/clear flag THROTL_TG_PENDING and increase/decrease
nr_pending, which is useless. By the way, clear the flag/decrease
nr_pending while there are still throttled bios is not good for debugging.

There are no functional changes.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220827101637.1775111-4-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-12 00:20:08 -06:00
Yu Kuai
8c25ed0cb9 blk-throttle: calling throtl_dequeue/enqueue_tg in pairs
It's a little weird to call throtl_dequeue_tg() unconditionally in
throtl_select_dispatch(), since it will be called in tg_update_disptime()
again if some bio is still throttled. Thus call it later if there are no
throttled bio. There are no functional changes.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220827101637.1775111-3-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-12 00:20:07 -06:00
Yu Kuai
7e9c5c54d4 blk-throttle: use 'READ/WRITE' instead of '0/1'
Make the code easier to read, like everywhere else.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220827101637.1775111-2-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-12 00:20:07 -06:00
Yu Kuai
a880ae93e5 blk-throttle: fix io hung due to configuration updates
If new configuration is submitted while a bio is throttled, then new
waiting time is recalculated regardless that the bio might already wait
for some time:

tg_conf_updated
 throtl_start_new_slice
  tg_update_disptime
  throtl_schedule_next_dispatch

Then io hung can be triggered by always submmiting new configuration
before the throttled bio is dispatched.

Fix the problem by respecting the time that throttled bio already waited.
In order to do that, add new fields to record how many bytes/io are
waited, and use it to calculate wait time for throttled bio under new
configuration.

Some simple test:
1)
cd /sys/fs/cgroup/blkio/
echo $$ > cgroup.procs
echo "8:0 2048" > blkio.throttle.write_bps_device
{
        sleep 2
        echo "8:0 1024" > blkio.throttle.write_bps_device
} &
dd if=/dev/zero of=/dev/sda bs=8k count=1 oflag=direct

2)
cd /sys/fs/cgroup/blkio/
echo $$ > cgroup.procs
echo "8:0 1024" > blkio.throttle.write_bps_device
{
        sleep 4
        echo "8:0 2048" > blkio.throttle.write_bps_device
} &
dd if=/dev/zero of=/dev/sda bs=8k count=1 oflag=direct

test results: io finish time
	before this patch	with this patch
1)	10s			6s
2)	8s			6s

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220829022240.3348319-5-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-12 00:19:48 -06:00
Yu Kuai
681cd46fff blk-throttle: factor out code to calculate ios/bytes_allowed
No functional changes, new apis will be used in later patches to
calculate wait time for throttled bios when new configuration is
submitted.

Noted this patch also rename tg_with_in_iops/bps_limit() to
tg_within_iops/bps_limit().

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220829022240.3348319-4-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-12 00:19:48 -06:00
Yu Kuai
8d6bbaada2 blk-throttle: prevent overflow while calculating wait time
There is a problem found by code review in tg_with_in_bps_limit() that
'bps_limit * jiffy_elapsed_rnd' might overflow. Fix the problem by
calling mul_u64_u64_div_u64() instead.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220829022240.3348319-3-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-12 00:19:48 -06:00
Yu Kuai
320fb0f91e blk-throttle: fix that io throttle can only work for single bio
Test scripts:
cd /sys/fs/cgroup/blkio/
echo "8:0 1024" > blkio.throttle.write_bps_device
echo $$ > cgroup.procs
dd if=/dev/zero of=/dev/sda bs=10k count=1 oflag=direct &
dd if=/dev/zero of=/dev/sda bs=10k count=1 oflag=direct &

Test result:
10240 bytes (10 kB, 10 KiB) copied, 10.0134 s, 1.0 kB/s
10240 bytes (10 kB, 10 KiB) copied, 10.0135 s, 1.0 kB/s

The problem is that the second bio is finished after 10s instead of 20s.

Root cause:
1) second bio will be flagged:

__blk_throtl_bio
 while (true) {
  ...
  if (sq->nr_queued[rw]) -> some bio is throttled already
   break
 };
 bio_set_flag(bio, BIO_THROTTLED); -> flag the bio

2) flagged bio will be dispatched without waiting:

throtl_dispatch_tg
 tg_may_dispatch
  tg_with_in_bps_limit
   if (bps_limit == U64_MAX || bio_flagged(bio, BIO_THROTTLED))
    *wait = 0; -> wait time is zero
    return true;

commit 9f5ede3c01 ("block: throttle split bio in case of iops limit")
support to count split bios for iops limit, thus it adds flagged bio
checking in tg_with_in_bps_limit() so that split bios will only count
once for bps limit, however, it introduce a new problem that io throttle
won't work if multiple bios are throttled.

In order to fix the problem, handle iops/bps limit in different ways:

1) for iops limit, there is no flag to record if the bio is throttled,
   and iops is always applied.
2) for bps limit, original bio will be flagged with BIO_BPS_THROTTLED,
   and io throttle will ignore bio with the flag.

Noted this patch also remove the code to set flag in __bio_clone(), it's
introduced in commit 111be88398 ("block-throttle: avoid double
charge"), and author thinks split bio can be resubmited and throttled
again, which is wrong because split bio will continue to dispatch from
caller.

Fixes: 9f5ede3c01 ("block: throttle split bio in case of iops limit")
Cc: <stable@vger.kernel.org>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220829022240.3348319-2-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-12 00:19:48 -06:00
Keith Busch
4acb83417c sbitmap: fix batched wait_cnt accounting
Batched completions can clear multiple bits, but we're only decrementing
the wait_cnt by one each time. This can cause waiters to never be woken,
stalling IO. Use the batched count instead.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=215679
Signed-off-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20220909184022.1709476-1-kbusch@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-12 00:10:34 -06:00
Miaohe Lin
bdb7d420c6 block: remove unneeded return value of bio_check_ro()
bio_check_ro() always return false now. Remove this unneeded return value
and cleanup the sole caller. No functional change intended.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Link: https://lore.kernel.org/r/20220905102754.1942-1-linmiaohe@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-05 11:45:35 -06:00
Miaohe Lin
6d5e8d21e8 blk-mq: remove unneeded needs_restart check
If code reaches here, needs_restart must be true. Remove this unneeded
needs_restart check. No functional change intended.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Link: https://lore.kernel.org/r/20220905101950.4606-1-linmiaohe@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-05 11:45:24 -06:00
Jiapeng Chong
91e5adda5c block/blk-map: Remove set but unused variable 'added'
The variable added is not effectively used in the function, so delete
it.

block/blk-map.c:273:16: warning: variable 'added' set but not used.

Link: https://bugzilla.openanolis.cn/show_bug.cgi?id=2049
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220905063253.120082-1-jiapeng.chong@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-05 11:44:37 -06:00
Yu Kuai
2d8f7a3b9f blk-throttle: clean up codes that can't be reached
While doing code coverage testing while CONFIG_BLK_DEV_THROTTLING_LOW is
disabled, we found that there are many codes can never be reached.

This patch move such codes inside "#ifdef CONFIG_BLK_DEV_THROTTLING_LOW".

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220903062826.1099085-1-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-04 14:38:18 -06:00
Jens Axboe
bce1b56c73 Revert "sbitmap: fix batched wait_cnt accounting"
This reverts commit 16ede66973.

This is causing issues with CPU stalls on my test box, revert it for
now until we understand what is going on. It looks like infinite
looping off sbitmap_queue_wake_up(), but hard to tell with a lot of
CPUs hitting this issue and the console scrolling infinitely.

Link: https://lore.kernel.org/linux-block/e742813b-ce5c-0d58-205b-1626f639b1bd@kernel.dk/
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-04 06:39:25 -06:00
Jens Axboe
12c5b70c18 block: enable per-cpu bio caching for the fs bio set
This is useful for polled IO on a file, or for polled IO with the
io_uring passthrough mechanism. If bio allocations are done with
REQ_POLLED for those cases, then initializing the bio set with
BIOSET_PERCPU_CACHE enables the local per-cpu cache which eliminates
allocations (and frees) of bio structs when possible.

Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-02 13:03:33 -06:00
Keith Busch
16ede66973 sbitmap: fix batched wait_cnt accounting
Batched completions can clear multiple bits, but we're only decrementing
the wait_cnt by one each time. This can cause waiters to never be woken,
stalling IO. Use the batched count instead.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=215679
Signed-off-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20220825145312.1217900-1-kbusch@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-01 10:42:41 -06:00
Jens Axboe
e88811bc43 block: use on-stack page vec for <= UIO_FASTIOV
Avoid a kmalloc+kfree for each page array, if we only have a few pages
that are mapped. An alloc+free for each IO is quite expensive, and
it's pretty pointless if we're only dealing with 1 or a few vecs.

Use UIO_FASTIOV like we do in other spots to set a sane limit for how
big of an IO we want to avoid allocations for.

Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-22 10:07:56 -06:00
Jens Axboe
8af870aa5b block: enable bio caching use for passthru IO
bdev based polled O_DIRECT is currently quite a bit faster than
passthru on the same device, and one of the reaons is that we're not
able to use the bio caching for passthru IO.

If REQ_POLLED is set on the request, use the fs bio set for grabbing a
bio from the caches, if available. This saves 5-6% of CPU over head
for polled passthru IO.

Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-22 10:07:56 -06:00
Jens Axboe
f5d632d15e block: shrink rq_map_data a bit
We don't need full ints for several of these members. Change the
page_order and nr_entries to unsigned shorts, and the true/false from_user
and null_mapped to booleans.

This shrinks the struct from 32 to 24 bytes on 64-bit archs.

Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-22 10:07:56 -06:00