linux-stable/block
Tejun Heo a30acbb5df blk-iolatency: Fix inflight count imbalances and IO hangs on offline
commit 8a177a36da upstream.

iolatency needs to track the number of inflight IOs per cgroup. As this
tracking can be expensive, it is disabled when no cgroup has iolatency
configured for the device. To ensure that the inflight counters stay
balanced, iolatency_set_limit() freezes the request_queue while manipulating
the enabled counter, which ensures that no IO is in flight and thus all
counters are zero.

Unfortunately, iolatency_set_limit() isn't the only place where the enabled
counter is manipulated. iolatency_pd_offline() can also dec the counter and
trigger disabling. As this disabling happens without freezing the q, this
can easily happen while some IOs are in flight and thus leak the counts.

This can be easily demonstrated by turning on iolatency on an one empty
cgroup while IOs are in flight in other cgroups and then removing the
cgroup. Note that iolatency shouldn't have been enabled elsewhere in the
system to ensure that removing the cgroup disables iolatency for the whole
device.

The following keeps flipping on and off iolatency on sda:

  echo +io > /sys/fs/cgroup/cgroup.subtree_control
  while true; do
      mkdir -p /sys/fs/cgroup/test
      echo '8:0 target=100000' > /sys/fs/cgroup/test/io.latency
      sleep 1
      rmdir /sys/fs/cgroup/test
      sleep 1
  done

and there's concurrent fio generating direct rand reads:

  fio --name test --filename=/dev/sda --direct=1 --rw=randread \
      --runtime=600 --time_based --iodepth=256 --numjobs=4 --bs=4k

while monitoring with the following drgn script:

  while True:
    for css in css_for_each_descendant_pre(prog['blkcg_root'].css.address_of_()):
        for pos in hlist_for_each(container_of(css, 'struct blkcg', 'css').blkg_list):
            blkg = container_of(pos, 'struct blkcg_gq', 'blkcg_node')
            pd = blkg.pd[prog['blkcg_policy_iolatency'].plid]
            if pd.value_() == 0:
                continue
            iolat = container_of(pd, 'struct iolatency_grp', 'pd')
            inflight = iolat.rq_wait.inflight.counter.value_()
            if inflight:
                print(f'inflight={inflight} {disk_name(blkg.q.disk).decode("utf-8")} '
                      f'{cgroup_path(css.cgroup).decode("utf-8")}')
    time.sleep(1)

The monitoring output looks like the following:

  inflight=1 sda /user.slice
  inflight=1 sda /user.slice
  ...
  inflight=14 sda /user.slice
  inflight=13 sda /user.slice
  inflight=17 sda /user.slice
  inflight=15 sda /user.slice
  inflight=18 sda /user.slice
  inflight=17 sda /user.slice
  inflight=20 sda /user.slice
  inflight=19 sda /user.slice <- fio stopped, inflight stuck at 19
  inflight=19 sda /user.slice
  inflight=19 sda /user.slice

If a cgroup with stuck inflight ends up getting throttled, the throttled IOs
will never get issued as there's no completion event to wake it up leading
to an indefinite hang.

This patch fixes the bug by unifying enable handling into a work item which
is automatically kicked off from iolatency_set_min_lat_nsec() which is
called from both iolatency_set_limit() and iolatency_pd_offline() paths.
Punting to a work item is necessary as iolatency_pd_offline() is called
under spinlocks while freezing a request_queue requires a sleepable context.

This also simplifies the code reducing LOC sans the comments and avoids the
unnecessary freezes which were happening whenever a cgroup's latency target
is newly set or cleared.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Liu Bo <bo.liu@linux.alibaba.com>
Fixes: 8c772a9bfc ("blk-iolatency: fix IO hang due to negative inflight counter")
Cc: stable@vger.kernel.org # v5.0+
Link: https://lore.kernel.org/r/Yn9ScX6Nx2qIiQQi@slm.duckdns.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-06-09 10:23:30 +02:00
..
partitions block: drop unused includes in <linux/genhd.h> 2022-03-16 14:23:46 +01:00
badblocks.c treewide: Use fallthrough pseudo-keyword 2020-08-23 17:36:59 -05:00
bdev.c block: simplify the block device syncing code 2022-04-27 14:38:50 +02:00
bfq-cgroup.c bfq: Make sure bfqg for which we are queueing requests is online 2022-06-09 10:23:19 +02:00
bfq-iosched.c bfq: Get rid of __bio_blkcg() usage 2022-06-09 10:23:19 +02:00
bfq-iosched.h bfq: Get rid of __bio_blkcg() usage 2022-06-09 10:23:19 +02:00
bfq-wf2q.c block/bfq_wf2q: correct weight to ioprio 2022-04-08 14:23:55 +02:00
bio-integrity.c block: bio-integrity: Advance seed correctly for larger interval sizes 2022-02-08 18:34:05 +01:00
bio.c block: fix offset/size check in bio_trim() 2022-04-20 09:34:14 +02:00
blk-cgroup-rwstat.c blk-cgroup: Fix the recursive blkg rwstat 2021-03-05 11:32:15 -07:00
blk-cgroup-rwstat.h blk-cgroup: separate out blkg_rwstat under CONFIG_BLK_CGROUP_RWSTAT 2019-11-07 12:28:13 -07:00
blk-cgroup.c blk-cgroup: set blkg iostat after percpu stat aggregation 2022-04-08 14:23:06 +02:00
blk-core.c block: release rq qos structures for queue without disk 2022-03-23 09:16:41 +01:00
blk-crypto-fallback.c block: rename BIO_MAX_PAGES to BIO_MAX_VECS 2021-03-11 07:47:48 -07:00
blk-crypto-internal.h block: make blk_crypto_rq_bio_prep() able to fail 2020-10-05 10:47:43 -06:00
blk-crypto.c blk-crypto: fix check for too-large dun_bytes 2021-08-25 06:45:00 -06:00
blk-exec.c block: return errors from blk_execute_rq() 2021-06-30 15:35:45 -06:00
blk-flush.c block: Fix fsync always failed if once failed 2022-01-27 11:05:25 +01:00
blk-integrity.c block: flush the integrity workqueue in blk_integrity_unregister 2021-09-14 20:03:30 -06:00
blk-ioc.c block: remove retry loop in ioc_release_fn() 2020-07-16 10:22:15 -06:00
blk-iocost.c iocost: don't reset the inuse weight of under-weighted debtors 2022-05-09 09:14:31 +02:00
blk-iolatency.c blk-iolatency: Fix inflight count imbalances and IO hangs on offline 2022-06-09 10:23:30 +02:00
blk-ioprio.c block: Introduce the ioprio rq-qos policy 2021-06-21 15:03:40 -06:00
blk-ioprio.h block: Introduce the ioprio rq-qos policy 2021-06-21 15:03:40 -06:00
blk-lib.c block: export blk_next_bio() 2021-06-17 15:51:20 +02:00
blk-map.c block-map: add __GFP_ZERO flag for alloc_page in function bio_copy_kern 2022-03-08 19:12:31 +01:00
blk-merge.c block: don't merge across cgroup boundaries if blkcg is enabled 2022-04-08 14:22:59 +02:00
blk-mq-cpumap.c blk-mq: remove the calling of local_memory_node() 2020-10-20 07:08:17 -06:00
blk-mq-debugfs-zoned.c
blk-mq-debugfs.c block: decode QUEUE_FLAG_HCTX_ACTIVE in debugfs output 2021-10-04 06:58:39 -06:00
blk-mq-debugfs.h blk-mq: no need to check return value of debugfs_create functions 2019-06-13 03:00:30 -06:00
blk-mq-pci.c block: Fix blk_mq_*_map_queues() kernel-doc headers 2019-05-31 15:12:34 -06:00
blk-mq-rdma.c block: Fix blk_mq_*_map_queues() kernel-doc headers 2019-05-31 15:12:34 -06:00
blk-mq-sched.c block: limit request dispatch loop duration 2022-04-08 14:22:59 +02:00
blk-mq-sched.h blk: Fix lock inversion between ioc lock and bfqd lock 2021-06-24 18:43:55 -06:00
blk-mq-sysfs.c block: remove blk-mq-sysfs dead code 2021-08-02 13:37:29 -06:00
blk-mq-tag.c blk-mq: avoid to iterate over stale request 2021-09-12 19:32:43 -06:00
blk-mq-tag.h blk-mq: Some tag allocation code refactoring 2021-05-24 06:47:22 -06:00
blk-mq-virtio.c blk-mq: Fix typo in comment 2020-03-17 20:55:21 +01:00
blk-mq.c blk-mq: cancel blk-mq dispatch work in both blk_cleanup_queue and disk_release() 2021-12-01 09:04:56 +01:00
blk-mq.h blk-mq: cancel blk-mq dispatch work in both blk_cleanup_queue and disk_release() 2021-12-01 09:04:56 +01:00
blk-pm.c scsi: block: pm: Always set request queue runtime active in blk_post_runtime_resume() 2022-01-27 11:04:15 +01:00
blk-pm.h block: Remove unused blk_pm_*() function definitions 2021-02-22 06:33:48 -07:00
blk-rq-qos.c rq-qos: fix missed wake-ups in rq_qos_throttle try two 2021-06-08 15:12:57 -06:00
blk-rq-qos.h block: Introduce the ioprio rq-qos policy 2021-06-21 15:03:40 -06:00
blk-settings.c block: Fix partition check for host-aware zoned block devices 2021-10-27 06:58:01 -06:00
blk-stat.c blk-stat: make q->stats->lock irqsafe 2020-09-01 16:48:46 -06:00
blk-stat.h
blk-sysfs.c block: don't delete queue kobject before its children 2022-04-08 14:23:07 +02:00
blk-throttle.c blk-throttle: fix UAF by deleteing timer in blk_throtl_exit() 2021-09-07 08:36:56 -06:00
blk-timeout.c block: blk-timeout: delete duplicated word 2020-07-31 16:29:47 -06:00
blk-wbt.c blk-wbt: prevent NULL pointer dereference in wb_timer_fn 2021-11-18 19:16:34 +01:00
blk-wbt.h blk-wbt: introduce a new disable state to prevent false positive by rwb_enabled() 2021-06-21 15:03:41 -06:00
blk-zoned.c block: Hold invalidate_lock in BLKRESETZONE ioctl 2021-11-18 19:17:15 +01:00
blk.h block: bump max plugged deferred size from 16 to 32 2021-11-18 19:16:16 +01:00
bounce.c block: use memcpy_from_bvec in __blk_queue_bounce 2021-08-02 13:37:28 -06:00
bsg-lib.c scsi: bsg-lib: Fix commands without data transfer in bsg_transport_sg_io_fn() 2021-08-01 13:21:40 -04:00
bsg.c scsi: bsg: Fix device unregistration 2021-09-14 00:22:15 -04:00
disk-events.c block: return errors from disk_alloc_events 2021-08-23 12:55:45 -06:00
elevator.c block/wbt: fix negative inflight counter when remove scsi device 2022-02-23 12:03:15 +01:00
fops.c block: hold ->invalidate_lock in blkdev_fallocate 2021-09-24 11:06:58 -06:00
genhd.c block: Fix the maximum minor value is blk_alloc_ext_minor() 2022-04-08 14:24:11 +02:00
holder.c block: drop unused includes in <linux/genhd.h> 2022-03-16 14:23:46 +01:00
ioctl.c block/compat_ioctl: fix range check in BLKGETSIZE 2022-04-27 14:39:02 +02:00
ioprio.c block: fix ioprio_get(IOPRIO_WHO_PGRP) vs setuid(2) 2021-12-14 10:57:16 +01:00
Kconfig SCSI misc on 20210902 2021-09-02 15:09:46 -07:00
Kconfig.iosched Revert "block/mq-deadline: Add cgroup support" 2021-08-11 13:47:26 -06:00
keyslot-manager.c - Fix DM integrity's HMAC support to provide enhanced security of 2021-02-22 10:22:54 -08:00
kyber-iosched.c kyber: avoid q->disk dereferences in trace points 2021-10-15 21:02:57 -06:00
Makefile block-5.15-2021-09-11 2021-09-11 10:19:51 -07:00
mq-deadline.c block: fix async_depth sysfs interface for mq-deadline 2022-01-27 11:05:25 +01:00
opal_proto.h block: sed-opal: Change the check condition for regular session validity 2020-03-12 08:00:10 -06:00
sed-opal.c block: sed-opal: Change the check condition for regular session validity 2020-03-12 08:00:10 -06:00
t10-pi.c block: use bvec_kmap_local in t10_pi_type1_{prepare,complete} 2021-08-02 13:37:28 -06:00