linux-stable/block
Tejun Heo 515d077ee3 blk-iolatency: Fix inflight count imbalances and IO hangs on offline
commit 8a177a36da upstream.

iolatency needs to track the number of inflight IOs per cgroup. As this
tracking can be expensive, it is disabled when no cgroup has iolatency
configured for the device. To ensure that the inflight counters stay
balanced, iolatency_set_limit() freezes the request_queue while manipulating
the enabled counter, which ensures that no IO is in flight and thus all
counters are zero.

Unfortunately, iolatency_set_limit() isn't the only place where the enabled
counter is manipulated. iolatency_pd_offline() can also dec the counter and
trigger disabling. As this disabling happens without freezing the q, this
can easily happen while some IOs are in flight and thus leak the counts.

This can be easily demonstrated by turning on iolatency on an one empty
cgroup while IOs are in flight in other cgroups and then removing the
cgroup. Note that iolatency shouldn't have been enabled elsewhere in the
system to ensure that removing the cgroup disables iolatency for the whole
device.

The following keeps flipping on and off iolatency on sda:

  echo +io > /sys/fs/cgroup/cgroup.subtree_control
  while true; do
      mkdir -p /sys/fs/cgroup/test
      echo '8:0 target=100000' > /sys/fs/cgroup/test/io.latency
      sleep 1
      rmdir /sys/fs/cgroup/test
      sleep 1
  done

and there's concurrent fio generating direct rand reads:

  fio --name test --filename=/dev/sda --direct=1 --rw=randread \
      --runtime=600 --time_based --iodepth=256 --numjobs=4 --bs=4k

while monitoring with the following drgn script:

  while True:
    for css in css_for_each_descendant_pre(prog['blkcg_root'].css.address_of_()):
        for pos in hlist_for_each(container_of(css, 'struct blkcg', 'css').blkg_list):
            blkg = container_of(pos, 'struct blkcg_gq', 'blkcg_node')
            pd = blkg.pd[prog['blkcg_policy_iolatency'].plid]
            if pd.value_() == 0:
                continue
            iolat = container_of(pd, 'struct iolatency_grp', 'pd')
            inflight = iolat.rq_wait.inflight.counter.value_()
            if inflight:
                print(f'inflight={inflight} {disk_name(blkg.q.disk).decode("utf-8")} '
                      f'{cgroup_path(css.cgroup).decode("utf-8")}')
    time.sleep(1)

The monitoring output looks like the following:

  inflight=1 sda /user.slice
  inflight=1 sda /user.slice
  ...
  inflight=14 sda /user.slice
  inflight=13 sda /user.slice
  inflight=17 sda /user.slice
  inflight=15 sda /user.slice
  inflight=18 sda /user.slice
  inflight=17 sda /user.slice
  inflight=20 sda /user.slice
  inflight=19 sda /user.slice <- fio stopped, inflight stuck at 19
  inflight=19 sda /user.slice
  inflight=19 sda /user.slice

If a cgroup with stuck inflight ends up getting throttled, the throttled IOs
will never get issued as there's no completion event to wake it up leading
to an indefinite hang.

This patch fixes the bug by unifying enable handling into a work item which
is automatically kicked off from iolatency_set_min_lat_nsec() which is
called from both iolatency_set_limit() and iolatency_pd_offline() paths.
Punting to a work item is necessary as iolatency_pd_offline() is called
under spinlocks while freezing a request_queue requires a sleepable context.

This also simplifies the code reducing LOC sans the comments and avoids the
unnecessary freezes which were happening whenever a cgroup's latency target
is newly set or cleared.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Liu Bo <bo.liu@linux.alibaba.com>
Fixes: 8c772a9bfc ("blk-iolatency: fix IO hang due to negative inflight counter")
Cc: stable@vger.kernel.org # v5.0+
Link: https://lore.kernel.org/r/Yn9ScX6Nx2qIiQQi@slm.duckdns.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-06-14 16:59:30 +02:00
..
partitions partitions/aix: append null character to print data from disk 2018-07-27 09:17:41 -06:00
Kconfig block: introduce blk-iolatency io controller 2018-07-09 09:07:54 -06:00
Kconfig.iosched
Makefile block: introduce blk-iolatency io controller 2018-07-09 09:07:54 -06:00
badblocks.c badblocks: fix wrong return value in badblocks_set if badblocks are disabled 2017-11-03 11:29:50 -07:00
bfq-cgroup.c block, bfq: fix overwrite of bfq_group pointer in bfq_find_set_group() 2020-03-25 08:06:08 +01:00
bfq-iosched.c Revert "Revert "block, bfq: honor already-setup queue merges"" 2022-04-15 14:14:54 +02:00
bfq-iosched.h block, bfq: fix use after free in bfq_bfqq_expire 2021-12-29 12:20:43 +01:00
bfq-wf2q.c block, bfq: fix use after free in bfq_bfqq_expire 2021-12-29 12:20:43 +01:00
bio-integrity.c block: bio-integrity: Advance seed correctly for larger interval sizes 2022-02-08 18:23:14 +01:00
bio.c block-map: add __GFP_ZERO flag for alloc_page in function bio_copy_kern 2022-06-06 08:24:21 +02:00
blk-cgroup.c bdi: use bdi_dev_name() to get device name 2021-08-08 08:54:29 +02:00
blk-core.c block: split .sysfs_lock into two locks 2021-03-04 09:39:30 +01:00
blk-exec.c blk-mq-sched: remove unused 'can_block' arg from blk_mq_sched_insert_request 2018-01-17 09:49:21 -07:00
blk-flush.c block: Fix fsync always failed if once failed 2022-03-08 19:04:08 +01:00
blk-integrity.c block drivers/block: Use octal not symbolic permissions 2018-05-24 13:38:59 -06:00
blk-ioc.c block: Fix use-after-free issue accessing struct io_cq 2020-04-17 10:48:41 +02:00
blk-iolatency.c blk-iolatency: Fix inflight count imbalances and IO hangs on offline 2022-06-14 16:59:30 +02:00
blk-lib.c block: fix 32 bit overflow in __blkdev_issue_discard() 2020-02-01 09:37:12 +00:00
blk-map.c block: fix memleak when __blk_rq_map_user_iov() is failed 2020-01-12 12:17:22 +01:00
blk-merge.c block: don't merge across cgroup boundaries if blkcg is enabled 2022-04-15 14:14:41 +02:00
blk-mq-cpumap.c blk-mq: don't keep offline CPUs mapped to hctx 0 2018-04-10 08:38:46 -06:00
blk-mq-debugfs-zoned.c block: Make struct request_queue smaller for CONFIG_BLK_DEV_ZONED=n 2018-07-09 09:07:52 -06:00
blk-mq-debugfs.c block, scsi: Change the preempt-only flag into a counter 2019-08-04 09:30:57 +02:00
blk-mq-debugfs.h block: Make struct request_queue smaller for CONFIG_BLK_DEV_ZONED=n 2018-07-09 09:07:52 -06:00
blk-mq-pci.c blk-mq: code clean-up by adding an API to clear set->mq_map 2018-07-09 09:07:53 -06:00
blk-mq-rdma.c
blk-mq-sched.c blk-mq: order adding requests to hctx->dispatch and checking SCHED_RESTART 2020-09-03 11:24:26 +02:00
blk-mq-sched.h block: mq-deadline: Fix write completion handling 2019-01-13 09:51:07 +01:00
blk-mq-sysfs.c block: split .sysfs_lock into two locks 2021-03-04 09:39:30 +01:00
blk-mq-tag.c blk-mq: Allow blocking queue tag iter callbacks 2018-09-25 20:17:59 -06:00
blk-mq-tag.h Merge branch 'for-4.15/block' of git://git.kernel.dk/linux-block 2017-11-14 15:32:19 -08:00
blk-mq-virtio.c
blk-mq.c blk-mq: Swap two calls in blk_mq_exit_queue() 2021-05-22 10:59:46 +02:00
blk-mq.h blk-mq: free hw queue's resource in hctx's release handler 2019-09-16 08:22:13 +02:00
blk-rq-qos.c blk-wbt: fix performance regression in wbt scale_up/scale_down 2019-10-17 13:45:16 -07:00
blk-rq-qos.h blk-rq-qos: fix first node deletion of rq_qos_del() 2019-10-29 09:20:09 +01:00
blk-settings.c blk-settings: align max_sectors on "logical_block_size" boundary 2021-03-04 09:39:51 +01:00
blk-softirq.c block: fix timeout changes for legacy request drivers 2018-06-19 11:27:18 -06:00
blk-stat.c blk-stat: export helpers for modifying blk_rq_stat 2018-07-09 09:07:54 -06:00
blk-stat.h block: deactivate blk_stat timer in wbt_disable_default() 2019-01-13 09:51:06 +01:00
blk-sysfs.c block: don't delete queue kobject before its children 2022-04-15 14:14:43 +02:00
blk-tag.c for-linus-20180616 2018-06-17 05:37:55 +09:00
blk-throttle.c blk-throttle: fix UAF by deleteing timer in blk_throtl_exit() 2021-09-26 13:39:49 +02:00
blk-timeout.c blk-mq: Fix timeout handling in case the timeout handler returns BLK_EH_DONE 2018-06-23 10:25:45 -06:00
blk-wbt.c blk-wbt: make sure throttle is enabled properly 2021-07-20 16:15:48 +02:00
blk-wbt.h blk-wbt: introduce a new disable state to prevent false positive by rwb_enabled() 2021-07-20 16:15:48 +02:00
blk-zoned.c blk-zoned: allow BLKREPORTZONE without CAP_SYS_ADMIN 2021-09-22 11:47:57 +02:00
blk.h block: split .sysfs_lock into two locks 2021-03-04 09:39:30 +01:00
bounce.c block: copy ioprio in __bio_clone_fast() and bounce 2018-12-01 09:37:32 +01:00
bsg-lib.c block/bsg-lib: use PTR_ERR_OR_ZERO to simplify the flow path 2018-08-01 09:13:03 -06:00
bsg.c block: bsg: move atomic_t ref_count variable to refcount API 2018-08-27 19:17:02 -06:00
cfq-iosched.c cfq: Suppress compiler warnings about comparisons 2018-08-07 17:57:13 -06:00
cmdline-parser.c
compat_ioctl.c block/compat_ioctl: fix range check in BLKGETSIZE 2022-04-27 13:39:45 +02:00
deadline-iosched.c block drivers/block: Use octal not symbolic permissions 2018-05-24 13:38:59 -06:00
elevator.c block/wbt: fix negative inflight counter when remove scsi device 2022-02-23 11:58:40 +01:00
genhd.c block: Suppress uevent for hidden device when removed 2021-03-30 14:36:59 +02:00
ioctl.c block: pass inclusive 'lend' parameter to truncate_inode_pages_range 2018-02-23 15:20:19 -07:00
ioprio.c block: fix ioprio_get(IOPRIO_WHO_PGRP) vs setuid(2) 2021-12-14 10:18:07 +01:00
kyber-iosched.c block: kyber: make kyber more friendly with merging 2018-05-30 10:47:40 -06:00
mq-deadline.c block: mq-deadline: Fix queue restart handling 2019-10-07 18:57:19 +02:00
noop-iosched.c
opal_proto.h
partition-generic.c block: fix use-after-free on gendisk 2019-05-31 06:46:18 -07:00
scsi_ioctl.c block: consistently use GFP_NOIO instead of __GFP_NORECLAIM 2018-05-14 08:55:18 -06:00
sed-opal.c block: sed-opal: fix IOC_OPAL_ENABLE_DISABLE_MBR 2019-05-31 06:46:24 -07:00
t10-pi.c block: move dif_prepare/dif_complete functions to block layer 2018-07-30 08:27:02 -06:00