Commit Graph

6280 Commits

Author SHA1 Message Date
Yu Kuai 7a5dc0f4bc blk-wbt: fix that 'rwb->wc' is always set to 1 in wbt_init()
commit 285febabac upstream.

commit 8c5035dfbb ("blk-wbt: call rq_qos_add() after wb_normal is
initialized") moves wbt_set_write_cache() before rq_qos_add(), which
is wrong because wbt_rq_qos() is still NULL.

Fix the problem by removing wbt_set_write_cache() and setting 'rwb->wc'
directly. Noted that this patch also remove the redundant setting of
'rab->wc'.

Fixes: 8c5035dfbb ("blk-wbt: call rq_qos_add() after wb_normal is initialized")
Reported-by: kernel test robot <yujie.liu@intel.com>
Link: https://lore.kernel.org/r/202210081045.77ddf59b-yujie.liu@intel.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20221009101038.1692875-1-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-10-24 09:58:30 +02:00
Keith Busch 63a681bcc3 blk-mq: use quiesced elevator switch when reinitializing queues
[ Upstream commit 8237c01f16 ]

The hctx's run_work may be racing with the elevator switch when
reinitializing hardware queues. The queue is merely frozen in this
context, but that only prevents requests from allocating and doesn't
stop the hctx work from running. The work may get an elevator pointer
that's being torn down, and can result in use-after-free errors and
kernel panics (example below). Use the quiesced elevator switch instead,
and make the previous one static since it is now only used locally.

  nvme nvme0: resetting controller
  nvme nvme0: 32/0/0 default/read/poll queues
  BUG: kernel NULL pointer dereference, address: 0000000000000008
  #PF: supervisor read access in kernel mode
  #PF: error_code(0x0000) - not-present page
  PGD 80000020c8861067 P4D 80000020c8861067 PUD 250f8c8067 PMD 0
  Oops: 0000 [#1] SMP PTI
  Workqueue: kblockd blk_mq_run_work_fn
  RIP: 0010:kyber_has_work+0x29/0x70

...

  Call Trace:
   __blk_mq_do_dispatch_sched+0x83/0x2b0
   __blk_mq_sched_dispatch_requests+0x12e/0x170
   blk_mq_sched_dispatch_requests+0x30/0x60
   __blk_mq_run_hw_queue+0x2b/0x50
   process_one_work+0x1ef/0x380
   worker_thread+0x2d/0x3e0

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220927155652.3260724-1-kbusch@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-10-24 09:58:28 +02:00
Yu Kuai cc6f0855bf blk-throttle: prevent overflow while calculating wait time
[ Upstream commit 8d6bbaada2 ]

There is a problem found by code review in tg_with_in_bps_limit() that
'bps_limit * jiffy_elapsed_rnd' might overflow. Fix the problem by
calling mul_u64_u64_div_u64() instead.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220829022240.3348319-3-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-10-24 09:58:25 +02:00
Yu Kuai dd54b94e72 blk-wbt: call rq_qos_add() after wb_normal is initialized
commit 8c5035dfbb upstream.

Our test found a problem that wbt inflight counter is negative, which
will cause io hang(noted that this problem doesn't exist in mainline):

t1: device create	t2: issue io
add_disk
 blk_register_queue
  wbt_enable_default
   wbt_init
    rq_qos_add
    // wb_normal is still 0
			/*
			 * in mainline, disk can't be opened before
			 * bdev_add(), however, in old kernels, disk
			 * can be opened before blk_register_queue().
			 */
			blkdev_issue_flush
                        // disk size is 0, however, it's not checked
                         submit_bio_wait
                          submit_bio
                           blk_mq_submit_bio
                            rq_qos_throttle
                             wbt_wait
			      bio_to_wbt_flags
                               rwb_enabled
			       // wb_normal is 0, inflight is not increased

    wbt_queue_depth_changed(&rwb->rqos);
     wbt_update_limits
     // wb_normal is initialized
                            rq_qos_track
                             wbt_track
                              rq->wbt_flags |= bio_to_wbt_flags(rwb, bio);
			      // wb_normal is not 0,wbt_flags will be set
t3: io completion
blk_mq_free_request
 rq_qos_done
  wbt_done
   wbt_is_tracked
   // return true
   __wbt_done
    wbt_rqw_done
     atomic_dec_return(&rqw->inflight);
     // inflight is decreased

commit 8235b5c1e8 ("block: call bdev_add later in device_add_disk") can
avoid this problem, however it's better to fix this problem in wbt:

1) Lower kernel can't backport this patch due to lots of refactor.
2) Root cause is that wbt call rq_qos_add() before wb_normal is
initialized.

Fixes: e34cbd3074 ("blk-wbt: add general throttling mechanism")
Cc: <stable@vger.kernel.org>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20220913105749.3086243-1-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-10-24 09:56:59 +02:00
Yu Kuai 8b1f9fde48 blk-throttle: fix that io throttle can only work for single bio
commit 320fb0f91e upstream.

Test scripts:
cd /sys/fs/cgroup/blkio/
echo "8:0 1024" > blkio.throttle.write_bps_device
echo $$ > cgroup.procs
dd if=/dev/zero of=/dev/sda bs=10k count=1 oflag=direct &
dd if=/dev/zero of=/dev/sda bs=10k count=1 oflag=direct &

Test result:
10240 bytes (10 kB, 10 KiB) copied, 10.0134 s, 1.0 kB/s
10240 bytes (10 kB, 10 KiB) copied, 10.0135 s, 1.0 kB/s

The problem is that the second bio is finished after 10s instead of 20s.

Root cause:
1) second bio will be flagged:

__blk_throtl_bio
 while (true) {
  ...
  if (sq->nr_queued[rw]) -> some bio is throttled already
   break
 };
 bio_set_flag(bio, BIO_THROTTLED); -> flag the bio

2) flagged bio will be dispatched without waiting:

throtl_dispatch_tg
 tg_may_dispatch
  tg_with_in_bps_limit
   if (bps_limit == U64_MAX || bio_flagged(bio, BIO_THROTTLED))
    *wait = 0; -> wait time is zero
    return true;

commit 9f5ede3c01 ("block: throttle split bio in case of iops limit")
support to count split bios for iops limit, thus it adds flagged bio
checking in tg_with_in_bps_limit() so that split bios will only count
once for bps limit, however, it introduce a new problem that io throttle
won't work if multiple bios are throttled.

In order to fix the problem, handle iops/bps limit in different ways:

1) for iops limit, there is no flag to record if the bio is throttled,
   and iops is always applied.
2) for bps limit, original bio will be flagged with BIO_BPS_THROTTLED,
   and io throttle will ignore bio with the flag.

Noted this patch also remove the code to set flag in __bio_clone(), it's
introduced in commit 111be88398 ("block-throttle: avoid double
charge"), and author thinks split bio can be resubmited and throttled
again, which is wrong because split bio will continue to dispatch from
caller.

Fixes: 9f5ede3c01 ("block: throttle split bio in case of iops limit")
Cc: <stable@vger.kernel.org>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220829022240.3348319-2-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-10-24 09:56:59 +02:00
Christoph Hellwig 48a12961e8 Revert "block: freeze the queue earlier in del_gendisk"
commit 4c66a326b5 upstream.

This reverts commit a09b314005.

Dusty Mabe reported consistent hang during CoreOS shutdown with a MD
RAID1 setup.  Although apparently similar hangs happened before,
and this patch most likely is not the root cause it made it much
more severe.  Revert it until we can figure out what is going on
with the md driver.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220919144049.978907-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-09-28 11:32:28 +02:00
Rafael Mendonca 98756ca258 block: Do not call blk_put_queue() if gendisk allocation fails
commit aa0c680c3a upstream.

Commit 6f8191fdf4 ("block: simplify disk shutdown") removed the call
to blk_get_queue() during gendisk allocation but missed to remove the
corresponding cleanup code blk_put_queue() for it. Thus, if the gendisk
allocation fails, the request_queue refcount gets decremented and
reaches 0, causing blk_mq_release() to be called with a hctx still
alive. That triggers a WARNING report, as found by syzkaller:

------------[ cut here ]------------
WARNING: CPU: 0 PID: 23016 at block/blk-mq.c:3881
blk_mq_release+0xf8/0x3e0 block/blk-mq.c:3881
[...] stripped
RIP: 0010:blk_mq_release+0xf8/0x3e0 block/blk-mq.c:3881
[...] stripped
Call Trace:
 <TASK>
 blk_release_queue+0x153/0x270 block/blk-sysfs.c:780
 kobject_cleanup lib/kobject.c:673 [inline]
 kobject_release lib/kobject.c:704 [inline]
 kref_put include/linux/kref.h:65 [inline]
 kobject_put+0x1c8/0x540 lib/kobject.c:721
 __alloc_disk_node+0x4f7/0x610 block/genhd.c:1388
 __blk_mq_alloc_disk+0x13b/0x1f0 block/blk-mq.c:3961
 loop_add+0x3e2/0xaf0 drivers/block/loop.c:1978
 loop_control_ioctl+0x133/0x620 drivers/block/loop.c:2150
 vfs_ioctl fs/ioctl.c:51 [inline]
 __do_sys_ioctl fs/ioctl.c:870 [inline]
 __se_sys_ioctl fs/ioctl.c:856 [inline]
 __x64_sys_ioctl+0x193/0x200 fs/ioctl.c:856
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x63/0xcd
[...] stripped

Fixes: 6f8191fdf4 ("block: simplify disk shutdown")
Reported-by: syzbot+31c9594f6e43b9289b25@syzkaller.appspotmail.com
Suggested-by: Hillf Danton <hdanton@sina.com>
Signed-off-by: Rafael Mendonca <rafaelmendsr@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220811232338.254673-1-rafaelmendsr@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-09-28 11:32:23 +02:00
Christoph Hellwig 2f092fd2ce block: call blk_mq_exit_queue from disk_release for never added disks
commit c5db2cfc62 upstream.

To undo the all initialization from blk_mq_init_allocated_queue in case
of a probe failure where add_disk is never called we have to call
blk_mq_exit_queue from put_disk.

This relies on the fact that drivers always call blk_mq_free_tag_set
after calling put_disk in the probe error path if they have a gendisk
at all.

We should be doing this in general, but can't do it for the normal
teardown case (yet) as the tagset can be gone by the time the disk is
released once it was added.  I hope to sort this out properly eventually
but for now this isolated hack will do it.

Fixes: 6f8191fdf4 ("block: simplify disk shutdown")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220720130541.1323531-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-09-28 11:32:23 +02:00
Christoph Hellwig 47f57236ba blk-mq: fix error handling in __blk_mq_alloc_disk
commit 0a3e5cc7bb upstream.

To fully clean up the queue if the disk allocation fails we need to
call blk_mq_destroy_queue and not just blk_put_queue.

Fixes: 6f8191fdf4 ("block: simplify disk shutdown")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220720130541.1323531-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-09-28 11:32:23 +02:00
Christoph Hellwig d27b66257d block: simplify disk shutdown
[ Upstream commit 6f8191fdf4 ]

Set the queue dying flag and call blk_mq_exit_queue from del_gendisk for
all disks that do not have separately allocated queues, and thus remove
the need to call blk_cleanup_queue for them.

Rename blk_cleanup_disk to blk_mq_destroy_queue to make it clear that
this function is intended only for separately allocated blk-mq queues.

This saves an extra queue freeze for devices without a separately
allocated queue.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20220619060552.1850436-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Stable-dep-of: 8fe4ce5836 ("scsi: core: Fix a use-after-free")
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-09-28 11:32:01 +02:00
Christoph Hellwig fdb28e9688 block: stop setting the nomerges flags in blk_cleanup_queue
[ Upstream commit 0e3534022f ]

These flags only apply to file system I/O, and all file system I/O is
already drained by del_gendisk and thus can't be in progress when
blk_cleanup_queue is called.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20220619060552.1850436-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Stable-dep-of: 8fe4ce5836 ("scsi: core: Fix a use-after-free")
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-09-28 11:32:00 +02:00
Christoph Hellwig ab85cb5297 block: remove QUEUE_FLAG_DEAD
[ Upstream commit 1f90307e5f ]

Disallow setting the blk-mq state on any queue that is already dying as
setting the state even then is a bad idea, and remove the now unused
QUEUE_FLAG_DEAD flag.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20220619060552.1850436-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Stable-dep-of: 8fe4ce5836 ("scsi: core: Fix a use-after-free")
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-09-28 11:32:00 +02:00
Mikulas Patocka 46c716a31f blk-lib: fix blkdev_issue_secure_erase
commit c4fa368466 upstream.

There's a bug in blkdev_issue_secure_erase. The statement
"unsigned int len = min_t(sector_t, nr_sects, max_sectors);"
sets the variable "len" to the length in sectors, but the statement
"bio->bi_iter.bi_size = len" treats it as if it were in bytes.
The statements "sector += len << SECTOR_SHIFT" and "nr_sects -= len <<
SECTOR_SHIFT" are thinko.

This patch fixes it.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org	# v5.19
Fixes: 44abff2c0b ("block: decouple REQ_OP_SECURE_ERASE from REQ_OP_DISCARD")
Link: https://lore.kernel.org/r/alpine.LRH.2.02.2209141549480.28100@file01.intranet.prod.int.rdu2.redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-09-23 14:14:05 +02:00
Stefan Roesch 248c48ced2 block: blk_queue_enter() / __bio_queue_enter() must return -EAGAIN for nowait
[ Upstream commit 56f99b8d06 ]

Today blk_queue_enter() and __bio_queue_enter() return -EBUSY for the
nowait code path. This is not correct: they should return -EAGAIN
instead.

This problem was detected by fio. The following command exposed the
above problem:

t/io_uring -p0 -d128 -b4096 -s32 -c32 -F1 -B0 -R0 -X1 -n24 -P1 -u1 -O0 /dev/ng0n1

By applying the patch, the retry case is handled correctly in the slow
path.

Signed-off-by: Stefan Roesch <shr@fb.com>
Fixes: bfd343aa17 ("blk-mq: don't wait in blk_mq_queue_enter() if __GFP_WAIT isn't set")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-09-23 14:14:04 +02:00
Ming Lei 4c9a8adb14 block: don't add partitions if GD_SUPPRESS_PART_SCAN is set
[ Upstream commit 748008e1da ]

Commit b9684a71fc ("block, loop: support partitions without scanning")
adds GD_SUPPRESS_PART_SCAN for replacing part function of
GENHD_FL_NO_PART. But looks blk_add_partitions() is missed, since
loop doesn't want to add partitions if GENHD_FL_NO_PART was set.
And it causes regression on libblockdev (as called from udisks) which
operates with the LO_FLAGS_PARTSCAN.

Fixes the issue by not adding partitions if GD_SUPPRESS_PART_SCAN is
set.

Fixes: b9684a71fc ("block, loop: support partitions without scanning")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220823103819.395776-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-09-15 10:47:15 +02:00
Yu Kuai b2f10baf4d blk-mq: fix io hung due to missing commit_rqs
commit 65fac0d54f upstream.

Currently, in virtio_scsi, if 'bd->last' is not set to true while
dispatching request, such io will stay in driver's queue, and driver
will wait for block layer to dispatch more rqs. However, if block
layer failed to dispatch more rq, it should trigger commit_rqs to
inform driver.

There is a problem in blk_mq_try_issue_list_directly() that commit_rqs
won't be called:

// assume that queue_depth is set to 1, list contains two rq
blk_mq_try_issue_list_directly
 blk_mq_request_issue_directly
 // dispatch first rq
 // last is false
  __blk_mq_try_issue_directly
   blk_mq_get_dispatch_budget
   // succeed to get first budget
   __blk_mq_issue_directly
    scsi_queue_rq
     cmd->flags |= SCMD_LAST
      virtscsi_queuecommand
       kick = (sc->flags & SCMD_LAST) != 0
       // kick is false, first rq won't issue to disk
 queued++

 blk_mq_request_issue_directly
 // dispatch second rq
  __blk_mq_try_issue_directly
   blk_mq_get_dispatch_budget
   // failed to get second budget
 ret == BLK_STS_RESOURCE
  blk_mq_request_bypass_insert
 // errors is still 0

 if (!list_empty(list) || errors && ...)
  // won't pass, commit_rqs won't be called

In this situation, first rq relied on second rq to dispatch, while
second rq relied on first rq to complete, thus they will both hung.

Fix the problem by also treat 'BLK_STS_*RESOURCE' as 'errors' since
it means that request is not queued successfully.

Same problem exists in blk_mq_dispatch_rq_list(), 'BLK_STS_*RESOURCE'
can't be treated as 'errors' here, fix the problem by calling
commit_rqs if queue_rq return 'BLK_STS_*RESOURCE'.

Fixes: d666ba98f8 ("blk-mq: add mq_ops->commit_rqs()")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220726122224.1790882-1-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-08-31 17:18:19 +02:00
Yufen Yu 5d4dc30b4c blk-mq: run queue no matter whether the request is the last request
commit d3b3859687 upstream.

We do test on a virtio scsi device (/dev/sda) and the default mq
scheduler is 'none'. We found a IO hung as following:

blk_finish_plug
  blk_mq_plug_issue_direct
      scsi_mq_get_budget
      //get budget_token fail and sdev->restarts=1

			     	 scsi_end_request
				   scsi_run_queue_async
                                   //sdev->restart=0 and run queue

     blk_mq_request_bypass_insert
        //add request to hctx->dispatch list

  //continue to dispath plug list
  blk_mq_dispatch_plug_list
      blk_mq_try_issue_list_directly
        //success issue all requests from plug list

After .get_budget fail, scsi_mq_get_budget will increase 'restarts'.
Normally, it will run hw queue when io complete and set 'restarts'
as 0. But if we run queue before adding request to the dispatch list
and blk_mq_dispatch_plug_list also success issue all requests, then
on one will run queue, and the request will be stall in the dispatch
list and cannot complete forever.

It is wrong to use last request of plug list to decide if run queue is
needed since all the remained requests in plug list may be from other
hctxs. To fix the bug, pass run_queue as true always to
blk_mq_request_bypass_insert().

Fix-suggested-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Fixes: dc5fc361d8 ("block: attempt direct issue of plug list")
Link: https://lore.kernel.org/r/20220803023355.3687360-1-yuyufen@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-08-25 11:45:36 +02:00
Jinke Han 0c9bb1acd1 block: don't allow the same type rq_qos add more than once
[ Upstream commit 14a6e2eb7d ]

In our test of iocost, we encountered some list add/del corruptions of
inner_walk list in ioc_timer_fn.

The reason can be described as follows:

cpu 0					cpu 1
ioc_qos_write				ioc_qos_write

ioc = q_to_ioc(queue);
if (!ioc) {
        ioc = kzalloc();
					ioc = q_to_ioc(queue);
					if (!ioc) {
						ioc = kzalloc();
						...
						rq_qos_add(q, rqos);
					}
        ...
        rq_qos_add(q, rqos);
        ...
}

When the io.cost.qos file is written by two cpus concurrently, rq_qos may
be added to one disk twice. In that case, there will be two iocs enabled
and running on one disk. They own different iocgs on their active list. In
the ioc_timer_fn function, because of the iocgs from two iocs have the
same root iocg, the root iocg's walk_list may be overwritten by each other
and this leads to list add/del corruptions in building or destroying the
inner_walk list.

And so far, the blk-rq-qos framework works in case that one instance for
one type rq_qos per queue by default. This patch make this explicit and
also fix the crash above.

Signed-off-by: Jinke Han <hanjinke.666@bytedance.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: <stable@vger.kernel.org>
Link: https://lore.kernel.org/r/20220720093616.70584-1-hanjinke.666@bytedance.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-08-17 15:16:10 +02:00
Keith Busch 38fb0d7f39 block: ensure iov_iter advances for added pages
[ Upstream commit 325347d965 ]

There are cases where a bio may not accept additional pages, and the iov
needs to advance to the last data length that was accepted. The zone
append used to handle this correctly, but was inadvertently broken when
the setup was made common with the normal r/w case.

Fixes: 576ed91354 ("block: use bio_add_page in bio_iov_iter_get_pages")
Fixes: c58c0074c5 ("block/bio: remove duplicate append pages code")
Signed-off-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20220712153256.2202024-1-kbusch@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-08-17 15:15:43 +02:00
Keith Busch 70edccb32b block/bio: remove duplicate append pages code
[ Upstream commit c58c0074c5 ]

The getting pages setup for zone append and normal IO are identical. Use
common code for each.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220610195830.3574005-3-kbusch@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-08-17 15:15:43 +02:00
Ming Lei 53f2de4b7b blk-mq: don't create hctx debugfs dir until q->debugfs_dir is created
[ Upstream commit f3ec5d1155 ]

blk_mq_debugfs_register_hctx() can be called by blk_mq_update_nr_hw_queues
when gendisk isn't added yet, such as nvme tcp.

Fixes the warning of 'debugfs: Directory 'hctx0' with parent '/' already present!'
which can be observed reliably when running blktests nvme/005.

Fixes: 6cfc0081b0 ("blk-mq: no need to check return value of debugfs_create functions")
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220711090808.259682-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-08-17 15:14:16 +02:00
Keith Busch f3edcfd871 block: fix infinite loop for invalid zone append
[ Upstream commit b82d9fa257 ]

Returning 0 early from __bio_iov_append_get_pages() for the
max_append_sectors warning just creates an infinite loop since 0 means
success, and the bio will never fill from the unadvancing iov_iter. We
could turn the return into an error value, but it will already be turned
into an error value later on, so just remove the warning. Clearly no one
ever hit it anyway.

Fixes: 0512a75b98 ("block: Introduce REQ_OP_ZONE_APPEND")
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20220610195830.3574005-2-kbusch@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-08-17 15:14:10 +02:00
Jan Kara 28a9cbc1c9 block: fix default IO priority handling again
commit e589f46445 upstream.

Commit e70344c059 ("block: fix default IO priority handling")
introduced an inconsistency in get_current_ioprio() that tasks without
IO context return IOPRIO_DEFAULT priority while tasks with freshly
allocated IO context will return 0 (IOPRIO_CLASS_NONE/0) IO priority.
Tasks without IO context used to be rare before 5a9d041ba2 ("block:
move io_context creation into where it's needed") but after this commit
they became common because now only BFQ IO scheduler setups task's IO
context. Similar inconsistency is there for get_task_ioprio() so this
inconsistency is now exposed to userspace and userspace will see
different IO priority for tasks operating on devices with BFQ compared
to devices without BFQ. Furthemore the changes done by commit
e70344c059 change the behavior when no IO priority is set for BFQ IO
scheduler which is also documented in ioprio_set(2) manpage:

"If no I/O scheduler has been set for a thread, then by default the I/O
priority will follow the CPU nice value (setpriority(2)).  In Linux
kernels before version 2.6.24, once an I/O priority had been set using
ioprio_set(), there was no way to reset the I/O scheduling behavior to
the default. Since Linux 2.6.24, specifying ioprio as 0 can be used to
reset to the default I/O scheduling behavior."

So make sure we default to IOPRIO_CLASS_NONE as used to be the case
before commit e70344c059. Also cleanup alloc_io_context() to
explicitely set this IO priority for the allocated IO context to avoid
future surprises. Note that we tweak ioprio_best() to maintain
ioprio_get(2) behavior and make this commit easily backportable.

CC: stable@vger.kernel.org
Fixes: e70344c059 ("block: fix default IO priority handling")
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220623074840.5960-1-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-08-11 13:22:02 +02:00
Muchun Song 957a2b345c block: fix missing blkcg_bio_issue_init
The commit 513616843d ("block: remove superfluous calls to
blkcg_bio_issue_init") has removed blkcg_bio_issue_init from
__bio_clone since submit_bio will override ->bi_issue.
However, __blk_queue_split is called after blkcg_bio_issue_init
(see blk_mq_submit_bio) in submit_bio. In this case, the
->bi_issue is 0. Fix it.

Fixes: 513616843d ("block: remove superfluous calls to blkcg_bio_issue_init")
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Link: https://lore.kernel.org/r/20220713140226.68135-1-songmuchun@bytedance.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-14 10:54:49 -06:00
Li Nan ca2a3343d6 block: remove WARN_ON() from bd_link_disk_holder
Since commit 83cbce957446("block: add error handling for device_add_disk /
add_disk"), bdev->bd_holder_dir can not be empty now, so remove WARN_ON()
from bd_link_disk_holder.

Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220623074100.2251301-1-linan122@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-23 07:48:05 -06:00
Jens Axboe 2645672ffe block: pop cached rq before potentially blocking rq_qos_throttle()
If rq_qos_throttle() ends up blocking, then we will have invalidated and
flushed our current plug. Since blk_mq_get_cached_request() hasn't
popped the cached request off the plug list just yet, we end holding a
pointer to a request that is no longer valid. This insta-crashes with
rq->mq_hctx being NULL in the validity checks just after.

Pop the request off the cached list before doing rq_qos_throttle() to
avoid using a potentially stale request.

Fixes: 0a5aa8d161 ("block: fix blk_mq_attempt_bio_merge and rq_qos_throttle protection")
Reported-by: Dylan Yudaken <dylany@fb.com>
Tested-by: Dylan Yudaken <dylany@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-21 10:59:58 -06:00
Damien Le Moal 9243fc4cd2 block: remove queue from struct blk_independent_access_range
The request queue pointer in struct blk_independent_access_range is
unused. Remove it.

Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Fixes: 41e46b3c2a ("block: Fix potential deadlock in blk_ia_range_sysfs_show()")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220603053529.76405-1-damien.lemoal@opensource.wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-19 18:40:11 -06:00
Christoph Hellwig a09b314005 block: freeze the queue earlier in del_gendisk
Freeze the queue earlier in del_gendisk so that the state does not
change while we remove debugfs and sysfs files.

Ming mentioned that being able to observer request in debugfs might
be useful while the queue is being frozen in del_gendisk, which is
made possible by this change.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220614074827.458955-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-17 07:31:05 -06:00
Christoph Hellwig 99d055b4fd block: remove per-disk debugfs files in blk_unregister_queue
The block debugfs files are created in blk_register_queue, which is
called by add_disk and use a naming scheme based on the disk_name.
After del_gendisk returns that name can be reused and thus we must not
leave these debugfs files around, otherwise the kernel is unhappy
and spews messages like:

	Directory XXXXX with parent 'block' already present!

and the newly created devices will not have working debugfs files.

Move the unregistration to blk_unregister_queue instead (which matches
the sysfs unregistration) to make sure the debugfs life time rules match
those of the disk name.

As part of the move also make sure the whole debugfs unregistration is
inside a single debugfs_mutex critical section.

Note that this breaks blktests block/002, which checks that the debugfs
directory has not been removed while blktests is running, but that
particular check should simply be removed from the test case.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220614074827.458955-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-17 07:31:05 -06:00
Christoph Hellwig 5cf9c91ba9 block: serialize all debugfs operations using q->debugfs_mutex
Various places like I/O schedulers or the QOS infrastructure try to
register debugfs files on demans, which can race with creating and
removing the main queue debugfs directory.  Use the existing
debugfs_mutex to serialize all debugfs operations that rely on
q->debugfs_dir or the directories hanging off it.

To make the teardown code a little simpler declare all debugfs dentry
pointers and not just the main one uncoditionally in blkdev.h.

Move debugfs_mutex next to the dentries that it protects and document
what it is used for.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220614074827.458955-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-17 07:31:05 -06:00
Christoph Hellwig 50e34d7881 block: disable the elevator int del_gendisk
The elevator is only used for file system requests, which are stopped in
del_gendisk.  Move disabling the elevator and freeing the scheduler tags
to the end of del_gendisk instead of doing that work in disk_release and
blk_cleanup_queue to avoid a use after free on q->tag_set from
disk_release as the tag_set might not be alive at that point.

Move the blk_qos_exit call as well, as it just depends on the elevator
exit and would be the only reason to keep the not exactly cheap queue
freeze in disk_release.

Fixes: e155b0c238 ("blk-mq: Use shared tags for shared sbitmap support")
Reported-by: syzbot+3e3f419f4a7816471838@syzkaller.appspotmail.com
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: syzbot+3e3f419f4a7816471838@syzkaller.appspotmail.com
Link: https://lore.kernel.org/r/20220614074827.458955-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-17 07:31:05 -06:00
Bart Van Assche b96f3cab59 block/bfq: Enable I/O statistics
BFQ uses io_start_time_ns. That member variable is only set if I/O
statistics are enabled. Hence this patch that enables I/O statistics
at the time BFQ is associated with a request queue.

Compile-tested only.

Reported-by: Cixi Geng <cixi.geng1@unisoc.com>
Cc: Cixi Geng <cixi.geng1@unisoc.com>
Cc: Yu Kuai <yukuai3@huawei.com>
Cc: Paolo Valente <paolo.valente@unimore.it>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-16 16:59:28 -06:00
Ming Lei 6cfeadbff3 blk-mq: don't clear flush_rq from tags->rqs[]
commit 364b61818f ("blk-mq: clearing flush request reference in
tags->rqs[]") is added to clear the to-be-free flush request from
tags->rqs[] for avoiding use-after-free on the flush rq.

Yu Kuai reported that blk_mq_clear_flush_rq_mapping() slows down boot time
by ~8s because running scsi probe which may create and remove lots of
unpresent LUNs on megaraid-sas which uses BLK_MQ_F_TAG_HCTX_SHARED and
each request queue has lots of hw queues.

Improve the situation by not running blk_mq_clear_flush_rq_mapping if
disk isn't added when there can't be any flush request issued.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reported-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220616014401.817001-4-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-16 14:45:15 -06:00
Ming Lei 4d337cebcb blk-mq: avoid to touch q->elevator without any protection
q->elevator is referred in blk_mq_has_sqsched() without any protection,
no .q_usage_counter is held, no queue srcu and rcu read lock is held,
so potential use-after-free may be triggered.

Fix the issue by adding one queue flag for checking if the elevator
uses single queue style dispatch. Meantime the elevator feature flag
of ELEVATOR_F_MQ_AWARE isn't needed any more.

Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220616014401.817001-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-16 14:45:15 -06:00
Ming Lei 5fd7a84a09 blk-mq: protect q->elevator by ->sysfs_lock in blk_mq_elv_switch_none
elevator can be tore down by sysfs switch interface or disk release, so
hold ->sysfs_lock before referring to q->elevator, then potential
use-after-free can be avoided.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220616014401.817001-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-16 14:45:15 -06:00
Bart Van Assche 14dc7a18ab block: Fix handling of offline queues in blk_mq_alloc_request_hctx()
This patch prevents that test nvme/004 triggers the following:

UBSAN: array-index-out-of-bounds in block/blk-mq.h:135:9
index 512 is out of range for type 'long unsigned int [512]'
Call Trace:
 show_stack+0x52/0x58
 dump_stack_lvl+0x49/0x5e
 dump_stack+0x10/0x12
 ubsan_epilogue+0x9/0x3b
 __ubsan_handle_out_of_bounds.cold+0x44/0x49
 blk_mq_alloc_request_hctx+0x304/0x310
 __nvme_submit_sync_cmd+0x70/0x200 [nvme_core]
 nvmf_connect_io_queue+0x23e/0x2a0 [nvme_fabrics]
 nvme_loop_connect_io_queues+0x8d/0xb0 [nvme_loop]
 nvme_loop_create_ctrl+0x58e/0x7d0 [nvme_loop]
 nvmf_create_ctrl+0x1d7/0x4d0 [nvme_fabrics]
 nvmf_dev_write+0xae/0x111 [nvme_fabrics]
 vfs_write+0x144/0x560
 ksys_write+0xb7/0x140
 __x64_sys_write+0x42/0x50
 do_syscall_64+0x35/0x80
 entry_SYSCALL_64_after_hwframe+0x44/0xae

Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Fixes: 20e4d81393 ("blk-mq: simplify queue mapping & schedule with each possisble CPU")
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220615210004.1031820-1-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-16 14:43:31 -06:00
Christoph Hellwig d5a37b1998 block: remove bioset_init_from_src
Unused now, and the interface never really made a whole lot of sense to
start with.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-06-08 14:04:14 -04:00
Linus Torvalds 78c6499c92 for-5.19/drivers-2022-06-02
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmKZmkoQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpqyrD/4iyg2ULBPyljLoM3Ed8AONbrFBApenKDFN
 FjiFRZNll8zvJLTtP0GQqJIrljPuRlqb0IkbgqXl+yvPZ+wpB9HgIe2aohkOaqqJ
 KS49UNR/aIHwC2y7lwlcsdgVqqoPdc4wnZeaQvCsWPCBhCca/k0kR7s3uEhHMK92
 OWpV9osl/thLfOBwwt4IEaO1Koz8PM/fCR4XA2KLbs8E4P8EcSFglqi0ap7foLNr
 pQAJIlPjkmF6nw4xg5fdBjVBo//kVcuf9IBMi5/XinmUL1taFAcn5WyeOvBbi0Fs
 Sqp/pKkveM8xWZKrDyA/wf8nzRNpBBl6TOQEMFtV6FkZrij3pbKHlCiyL2gBR13e
 5gkbVvXgtdqDVlnqlvIV/Swfh5YQFtn7+vlHFOUjP+iObsBRo4fTvDhPTLoO/VCf
 tIAA7xnq/pquRKS/QGGC7ZxRVc3T1r+EpvbBP5Dc4CDGbbbyZLCSrOh5HWIb0I3k
 95GSiipTtf54KqZ9HiG/u+xNAFIdapXgU4Xm+JyDRWLdxSs5Nmy8VgpdflvxBfuo
 hCvJJw3vtDusjyHc7IafxaZlJVQT9tPgPshs8GfrCMCP19RCALD/5irspFh6dD35
 BQTIkhC68XNa0iNn/NTP3uxir/JwRoovxQkA9eD+r1NHsAbL8GTypfr5kKJxDaIK
 UhawfyZE3Q==
 =YqhC
 -----END PGP SIGNATURE-----

Merge tag 'for-5.19/drivers-2022-06-02' of git://git.kernel.dk/linux-block

Pull more block driver updates from Jens Axboe:
 "A collection of stragglers that were late on sending in their changes
  and just followup fixes.

   - NVMe fixes pull request via Christoph:
       - set controller enable bit in a separate write (Niklas Cassel)
       - disable namespace identifiers for the MAXIO MAP1001 (Christoph)
       - fix a comment typo (Julia Lawall)"

   - MD fixes pull request via Song:
       - Remove uses of bdevname (Christoph Hellwig)
       - Bug fixes (Guoqing Jiang, and Xiao Ni)

   - bcache fixes series (Coly)

   - null_blk zoned write fix (Damien)

   - nbd fixes (Yu, Zhang)

   - Fix for loop partition scanning (Christoph)"

* tag 'for-5.19/drivers-2022-06-02' of git://git.kernel.dk/linux-block: (23 commits)
  block: null_blk: Fix null_zone_write()
  nvmet: fix typo in comment
  nvme: set controller enable bit in a separate write
  nvme-pci: disable namespace identifiers for the MAXIO MAP1001
  bcache: avoid unnecessary soft lockup in kworker update_writeback_rate()
  nbd: use pr_err to output error message
  nbd: fix possible overflow on 'first_minor' in nbd_dev_add()
  nbd: fix io hung while disconnecting device
  nbd: don't clear 'NBD_CMD_INFLIGHT' flag if request is not completed
  nbd: fix race between nbd_alloc_config() and module removal
  nbd: call genl_unregister_family() first in nbd_cleanup()
  md: bcache: check the return value of kzalloc() in detached_dev_do_request()
  bcache: memset on stack variables in bch_btree_check() and bch_sectors_dirty_init()
  block, loop: support partitions without scanning
  bcache: avoid journal no-space deadlock by reserving 1 journal bucket
  bcache: remove incremental dirty sector counting for bch_sectors_dirty_init()
  bcache: improve multithreaded bch_sectors_dirty_init()
  bcache: improve multithreaded bch_btree_check()
  md: fix double free of io_acct_set bioset
  md: Don't set mddev private to NULL in raid0 pers->free
  ...
2022-06-03 10:25:56 -07:00
Linus Torvalds 72fbbc3d0e for-5.19/block-exec-2022-06-02
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmKZmh0QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpqg6EACCbkwoH7rrr38iU++xP3c9oqFJCfYR95ho
 qn9/3FiPulua0Dwg8Fbp12ubqqBy/iNj+4Mk7XTo28P7ahjGtKJec2DZguDuHC5X
 G3kucgQcDOLs1IMWoil+KrnjGC8qeT9ZPFNaUF0IY084NPxnj1wAOjo1J00QVieN
 WFgHX1sxzBje8abebf3UxAyXImzfyY2uXbp1F3thzf0ZwHXkSDsbWI3fvpdYF4QC
 p3z6CX0sR+5v7ZLWF3X6H8MBSO+eRlprYji3O/0jVslLBAS8FlTdizQtzx7C6Hsv
 JZVY4ZsUswYxtsHCBR0McglDeu/iXZRZ9HX4iiOobYJNaXfycMltvS+4Tb/TsFTB
 GaG6tbL4JS+NT063ctl5h355vUVhIbw6qEhsiF47+0hgawvRr/xxP0aSq1MPmfjw
 OgG4Jn0htXF47tpKnszfaj3BmgvgV56mV0IGwF5Sh5NXDnHF+MHFmrhRsP1NenjL
 12FTnvWGYyTRGDVIVFDkuwaI9o9iNKdFw0JkKIZa/G5RVmmjukvMGTvvSnKmSxJg
 dgbYLSBA2ZxCcIPjJDvZroe3QxvGNyqUYtxwyWl4a1HK/qljZfwwyRE1rfjJ47hK
 F8jNEkOThcjr1anoV2nvSLE1mM3SyA/UqDsntIdwnUlG/ObYsByTgtKc+ETs1RzS
 8Ovp6lv0Mw==
 =lBeb
 -----END PGP SIGNATURE-----

Merge tag 'for-5.19/block-exec-2022-06-02' of git://git.kernel.dk/linux-block

Pull block request execute cleanups from Jens Axboe:
 "This change was advertised in the initial core block pull request, but
  didn't actually make that branch as we deferred it to a post-merge
  pull request to avoid a bunch of cross branch issues.

  This series cleans up the block execute path quite nicely"

* tag 'for-5.19/block-exec-2022-06-02' of git://git.kernel.dk/linux-block:
  blk-mq: remove the done argument to blk_execute_rq_nowait
  blk-mq: avoid a mess of casts for blk_end_sync_rq
  blk-mq: remove __blk_execute_rq_nowait
2022-06-03 10:21:43 -07:00
Linus Torvalds 34845d92bc for-5.19/block-2022-06-02
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmKZmfQQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpsLlEACPbK/ms8dMDwKjfEF/RMoc7uL/j6oC0cpf
 0D2sfMka8D41QdrUfMiUismXZ61dyKdsiX/U/Q0gcjIomnlco8ZeLcLa6DlafjwY
 DtvO2aCb+eBAkII5sX2WM4ANNgFTy08Y4wBmgEy5En5u4nPlIGZ8DsulQQodqygx
 1lJh31OXQKw+2kIyUdAeC0GMiD9nddYDsH0CTFDSZsAijCcOBDOHbDPk27wHapzM
 GR1UAK5/SA7RfZgIMRHHclF6Ea49/uPJ45crD1T+8p6jLW+ldbxpiRD3ux9BnK2v
 U7EWS5MLMFAvb/nTLc8T37srJuEhBAT0r2bn614rjOiJofalPeD0eDeHfz4vRpPe
 +qTQREtpBUtJizYN+8rpcxP8f9S/hmPOBvIKD3XC0TlOo1NCf35fqWLWMli2hkTQ
 AfcY1auKjC/UYcnR0TQ91aHo1puM4fK5Pdc6lDGznrcxy9t1g1NvKAEL9Y3xK3No
 paglrliBCUbAN8vogKr4jc7jRkh/GLEqkxV2LIpOVp3lyT9GepvYM1xLQ8X/rszn
 /Il3fAwf5AyP+1RoVcmmOy1XW0ptUbKXWn03NlxN55Ya8x3tKCwWWDSmL2CP8SwV
 Vo5Qt+rKUkqA/TmHW8HOd7i+44Sa8oD/6WpSSPkwXN2cgRQmvmtaGmpXCKNTn5tk
 PMgFJOq3uw==
 =7JDU
 -----END PGP SIGNATURE-----

Merge tag 'for-5.19/block-2022-06-02' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:
 "Just a collection of fixes that have been queued up since the initial
  merge window pull request, the majority of which are targeted for
  stable as well.

  One bio_set fix that fixes an issue with the dm adoption of cached bio
  structs that got introduced in this merge window"

* tag 'for-5.19/block-2022-06-02' of git://git.kernel.dk/linux-block:
  block: Fix potential deadlock in blk_ia_range_sysfs_show()
  block: fix bio_clone_blkg_association() to associate with proper blkcg_gq
  block: remove useless BUG_ON() in blk_mq_put_tag()
  blk-mq: do not update io_ticks with passthrough requests
  block: make bioset_exit() fully resilient against being called twice
  block: use bio_queue_enter instead of blk_queue_enter in bio_poll
  block: document BLK_STS_AGAIN usage
  block: take destination bvec offsets into account in bio_copy_data_iter
  blk-iolatency: Fix inflight count imbalances and IO hangs on offline
  blk-mq: don't touch ->tagset in blk_mq_get_sq_hctx
2022-06-03 10:14:48 -07:00
Damien Le Moal 41e46b3c2a block: Fix potential deadlock in blk_ia_range_sysfs_show()
When being read, a sysfs attribute is already protected against removal
with the kobject node active reference counter. As a result, in
blk_ia_range_sysfs_show(), there is no need to take the queue sysfs
lock when reading the value of a range attribute. Using the queue sysfs
lock in this function creates a potential deadlock situation with the
disk removal, something that a lockdep signals with a splat when the
device is removed:

[  760.703551]  Possible unsafe locking scenario:
[  760.703551]
[  760.703554]        CPU0                    CPU1
[  760.703556]        ----                    ----
[  760.703558]   lock(&q->sysfs_lock);
[  760.703565]                                lock(kn->active#385);
[  760.703573]                                lock(&q->sysfs_lock);
[  760.703579]   lock(kn->active#385);
[  760.703587]
[  760.703587]  *** DEADLOCK ***

Solve this by removing the mutex_lock()/mutex_unlock() calls from
blk_ia_range_sysfs_show().

Fixes: a2247f19ee ("block: Add independent access ranges support")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Link: https://lore.kernel.org/r/20220603021905.1441419-1-damien.lemoal@opensource.wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-02 23:02:37 -06:00
Jan Kara 22b106e535 block: fix bio_clone_blkg_association() to associate with proper blkcg_gq
Commit d92c370a16 ("block: really clone the block cgroup in
bio_clone_blkg_association") changed bio_clone_blkg_association() to
just clone bio->bi_blkg reference from source to destination bio. This
is however wrong if the source and destination bios are against
different block devices because struct blkcg_gq is different for each
bdev-blkcg pair. This will result in IOs being accounted (and throttled
as a result) multiple times against the same device (src bdev) while
throttling of the other device (dst bdev) is ignored. In case of BFQ the
inconsistency can even result in crashes in bfq_bic_update_cgroup().
Fix the problem by looking up correct blkcg_gq for the cloned bio.

Reported-by: Logan Gunthorpe <logang@deltatee.com>
Reported-and-tested-by: Donald Buczek <buczek@molgen.mpg.de>
Fixes: d92c370a16 ("block: really clone the block cgroup in bio_clone_blkg_association")
CC: stable@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220602081242.7731-1-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-02 02:15:05 -06:00
Damien Le Moal ff47dbd18b block: remove useless BUG_ON() in blk_mq_put_tag()
Since the if condition in blk_mq_put_tag() checks that the tag to put is
not a reserved one, the BUG_ON() check in the else branch checking if
the tag is indeed a reserved one is useless. Remove it.

Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Link: https://lore.kernel.org/r/20220602075159.1273366-1-damien.lemoal@opensource.wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-02 02:05:56 -06:00
Haisu Wang b81c14ca14 blk-mq: do not update io_ticks with passthrough requests
Flush or passthrough requests are not accounted as normal IO in completion.
To reflect iostat for slow IO, io_ticks is updated when stat show called
based on inflight numbers.
It may cause inconsistent io_ticks calculation result.

So do not account non-passthrough request when check inflight.

Fixes: 86d7331299 ("block: update io_ticks when io hang")
Signed-off-by: Haisu Wang <haisuwang@tencent.com>
Reviewed-by: samuelliao <samuelliao@tencent.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220530064059.1120058-1-haisuwang@tencent.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-30 07:13:12 -06:00
Jens Axboe 605f7415ec block: make bioset_exit() fully resilient against being called twice
Most of bioset_exit() is fine being called twice, as it clears the
various allocations etc when they are freed. The exception is
bio_alloc_cache_destroy(), which does not clear ->cache when it has
freed it.

This isn't necessarily a bug, but can be if buggy users does call the
exit path more then once, or with just a memset() bioset which has
never been initialized. dm appears to be one such user.

Fixes: be4d234d7a ("bio: add allocation cache abstraction")
Link: https://lore.kernel.org/linux-block/YpK7m+14A+pZKs5k@casper.infradead.org/
Reported-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-29 07:36:31 -06:00
Christoph Hellwig e2e5308672 blk-mq: remove the done argument to blk_execute_rq_nowait
Let the caller set it together with the end_io_data instead of passing
a pointless argument.  Note the the target code did in fact already
set it and then just overrode it again by calling blk_execute_rq_nowait.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220524121530.943123-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-28 06:15:27 -06:00
Christoph Hellwig 32ac5a9b8b blk-mq: avoid a mess of casts for blk_end_sync_rq
Instead of trying to cast a __bitwise 32-bit integer to a larger integer
and then a pointer, just allow a struct with the blk_status_t and the
completion on stack and set the end_io_data to that.  Use the
opportunity to move the code to where it belongs and drop rather
confusing comments.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220524121530.943123-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-28 06:15:27 -06:00
Christoph Hellwig ae948fd6d0 blk-mq: remove __blk_execute_rq_nowait
We don't want to plug for synchronous execution that where we immediately
wait for the request.  Once that is done not a whole lot of code is
shared, so just remove __blk_execute_rq_nowait.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220524121530.943123-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-28 06:15:27 -06:00
Christoph Hellwig ebd076bf7d block: use bio_queue_enter instead of blk_queue_enter in bio_poll
We want to have a valid live gendisk to call ->poll and not just a
request_queue, so call the right helper.

Fixes: 3e08773c38 ("block: switch polling to be bio based")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220523124302.526186-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-28 06:14:35 -06:00
Christoph Hellwig 403d50341c block: take destination bvec offsets into account in bio_copy_data_iter
Appartly bcache can copy into bios that do not just contain fresh
pages but can have offsets into the bio_vecs.  Restore support for tht
in bio_copy_data_iter.

Fixes: f8b679a070 ("block: rewrite bio_copy_data_iter to use bvec_kmap_local and memcpy_to_bvec")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220524143919.1155501-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-27 20:35:55 -06:00