linux-stable/drivers/md
Mauricio Faria de Oliveira 47796938c4 Revert "dm mpath: fix stalls when handling invalid ioctls"
This reverts commit a1989b3300.

That commit introduced a regression at least for the case of the SG_IO ioctl()
running without CAP_SYS_RAWIO capability (e.g., unprivileged users) when there
are no active paths: the ioctl() fails with the ENOTTY errno immediately rather
than blocking due to queue_if_no_path until a path becomes active, for example.

That case happens to be exercised by QEMU KVM guests with 'scsi-block' devices
(qemu "-device scsi-block" [1], libvirt "<disk type='block' device='lun'>" [2])
from multipath devices; which leads to SCSI/filesystem errors in such a guest.

More general scenarios can hit that regression too. The following demonstration
employs a SG_IO ioctl() with a standard SCSI INQUIRY command for this objective
(some output & user changes omitted for brevity and comments added for clarity).

Reverting that commit restores normal operation (queueing) in failing scenarios;
tested on linux-next (next-20151022).

1) Test-case is based on sg_simple0 [3] (just SG_IO; remove SG_GET_VERSION_NUM)

    $ cat sg_simple0.c
    ... see [3] ...
    $ sed '/SG_GET_VERSION_NUM/,/}/d' sg_simple0.c > sgio_inquiry.c
    $ gcc sgio_inquiry.c -o sgio_inquiry

2) The ioctl() works fine with active paths present.

    # multipath -l 85ag56
    85ag56 (...) dm-19 IBM     ,2145
    size=60G features='1 queue_if_no_path' hwhandler='0' wp=rw
    |-+- policy='service-time 0' prio=0 status=active
    | |- 8:0:11:0  sdz  65:144  active undef running
    | `- 9:0:9:0   sdbf 67:144  active undef running
    `-+- policy='service-time 0' prio=0 status=enabled
      |- 8:0:12:0  sdae 65:224  active undef running
      `- 9:0:12:0  sdbo 68:32   active undef running

    $ ./sgio_inquiry /dev/mapper/85ag56
    Some of the INQUIRY command's response:
        IBM       2145              0000
    INQUIRY duration=0 millisecs, resid=0

3) The ioctl() fails with ENOTTY errno with _no_ active paths present,
   for unprivileged users (rather than blocking due to queue_if_no_path).

    # for path in $(multipath -l 85ag56 | grep -o 'sd[a-z]\+'); \
          do multipathd -k"fail path $path"; done

    # multipath -l 85ag56
    85ag56 (...) dm-19 IBM     ,2145
    size=60G features='1 queue_if_no_path' hwhandler='0' wp=rw
    |-+- policy='service-time 0' prio=0 status=enabled
    | |- 8:0:11:0  sdz  65:144  failed undef running
    | `- 9:0:9:0   sdbf 67:144  failed undef running
    `-+- policy='service-time 0' prio=0 status=enabled
      |- 8:0:12:0  sdae 65:224  failed undef running
      `- 9:0:12:0  sdbo 68:32   failed undef running

    $ ./sgio_inquiry /dev/mapper/85ag56
    sg_simple0: Inquiry SG_IO ioctl error: Inappropriate ioctl for device

4) dmesg shows that scsi_verify_blk_ioctl() failed for SG_IO (0x2285);
   it returns -ENOIOCTLCMD, later replaced with -ENOTTY in vfs_ioctl().

    $ dmesg
    <...>
    [] device-mapper: multipath: Failing path 65:144.
    [] device-mapper: multipath: Failing path 67:144.
    [] device-mapper: multipath: Failing path 65:224.
    [] device-mapper: multipath: Failing path 68:32.
    [] sgio_inquiry: sending ioctl 2285 to a partition!

5) The ioctl() only works if the SYS_CAP_RAWIO capability is present
   (then queueing happens -- in this example, queue_if_no_path is set);
   this is due to a conditional check in scsi_verify_blk_ioctl().

    # capsh --drop=cap_sys_rawio -- -c './sgio_inquiry /dev/mapper/85ag56'
    sg_simple0: Inquiry SG_IO ioctl error: Inappropriate ioctl for device

    # ./sgio_inquiry /dev/mapper/85ag56 &
    [1] 72830

    # cat /proc/72830/stack
    [<c00000171c0df700>] 0xc00000171c0df700
    [<c000000000015934>] __switch_to+0x204/0x350
    [<c000000000152d4c>] msleep+0x5c/0x80
    [<c00000000077dfb0>] dm_blk_ioctl+0x70/0x170
    [<c000000000487c40>] blkdev_ioctl+0x2b0/0x9b0
    [<c0000000003128e4>] block_ioctl+0x64/0xd0
    [<c0000000002dd3b0>] do_vfs_ioctl+0x490/0x780
    [<c0000000002dd774>] SyS_ioctl+0xd4/0xf0
    [<c000000000009358>] system_call+0x38/0xd0

6) This is the function call chain exercised in this analysis:

SYSCALL_DEFINE3(ioctl, <...>) @ fs/ioctl.c
    -> do_vfs_ioctl()
        -> vfs_ioctl()
            ...
            error = filp->f_op->unlocked_ioctl(filp, cmd, arg);
            ...
                -> dm_blk_ioctl() @ drivers/md/dm.c
                    -> multipath_ioctl() @ drivers/md/dm-mpath.c
                        ...
                        (bdev = NULL, due to no active paths)
                        ...
                        if (!bdev || <...>) {
                            int err = scsi_verify_blk_ioctl(NULL, cmd);
                            if (err)
                                r = err;
                        }
                        ...
                            -> scsi_verify_blk_ioctl() @ block/scsi_ioctl.c
                                ...
                                if (bd && bd == bd->bd_contains) // not taken (bd = NULL)
                                    return 0;
                                ...
                                if (capable(CAP_SYS_RAWIO)) // not taken (unprivileged user)
                                    return 0;
                                ...
                                printk_ratelimited(KERN_WARNING
                                           "%s: sending ioctl %x to a partition!\n" <...>);

                                return -ENOIOCTLCMD;
                            <-
                        ...
                        return r ? : <...>
                    <-
            ...
            if (error == -ENOIOCTLCMD)
                error = -ENOTTY;
             out:
                return error;
            ...

Links:
[1] http://git.qemu.org/?p=qemu.git;a=commit;h=336a6915bc7089fb20fea4ba99972ad9a97c5f52
[2] https://libvirt.org/formatdomain.html#elementsDisks (see 'disk' -> 'device')
[3] http://tldp.org/HOWTO/SCSI-Generic-HOWTO/pexample.html (Revision 1.2, 2002-05-03)

Signed-off-by: Mauricio Faria de Oliveira <mauricfo@linux.vnet.ibm.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2015-10-31 18:53:51 -04:00
..
bcache bcache: remove driver private bio splitting code 2015-08-13 12:31:40 -06:00
persistent-data dm: remove unlikely() before IS_ERR() 2015-08-12 11:32:21 -04:00
bitmap.c md/bitmap: don't pass -1 to bitmap_storage_alloc. 2015-10-02 17:24:13 +10:00
bitmap.h md-cluster: re-add capabilities 2015-04-22 07:59:39 +10:00
dm-bio-prison.c block: add a bi_error field to struct bio 2015-07-29 08:55:15 -06:00
dm-bio-prison.h dm bio prison: add dm_cell_promote_or_release() 2015-05-29 14:19:06 -04:00
dm-bio-record.h
dm-bufio.c block: add a bi_error field to struct bio 2015-07-29 08:55:15 -06:00
dm-bufio.h
dm-builtin.c
dm-cache-block-types.h dm cache: revert "remove remainder of distinct discard block size" 2014-11-10 15:25:30 -05:00
dm-cache-metadata.c dm cache: add fail io mode and needs_check flag 2015-06-11 17:13:00 -04:00
dm-cache-metadata.h dm cache: add fail io mode and needs_check flag 2015-06-11 17:13:00 -04:00
dm-cache-policy-cleaner.c dm cache: pass a new 'critical' flag to the policies when requesting writeback work 2015-05-29 14:19:04 -04:00
dm-cache-policy-internal.h dm cache: age and write back cache entries even without active IO 2015-06-11 17:13:01 -04:00
dm-cache-policy-mq.c dm cache policy smq: move 'dm-cache-default' module alias to SMQ 2015-08-12 11:27:29 -04:00
dm-cache-policy-smq.c dm cache policy smq: change the mutex to a spinlock 2015-08-12 11:32:19 -04:00
dm-cache-policy.c
dm-cache-policy.h dm cache: age and write back cache entries even without active IO 2015-06-11 17:13:01 -04:00
dm-cache-target.c Merge tag 'dm-4.3-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm 2015-09-02 16:35:26 -07:00
dm-crypt.c dm crypt: constrain crypt device's max_segment_size to PAGE_SIZE 2015-09-14 12:04:24 -04:00
dm-delay.c dm: do not override error code returned from dm_get_device() 2015-08-12 11:32:21 -04:00
dm-era-target.c block: kill merge_bvec_fn() completely 2015-08-13 12:31:57 -06:00
dm-exception-store.c
dm-exception-store.h
dm-flakey.c Merge tag 'dm-4.3-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm 2015-09-02 16:35:26 -07:00
dm-io.c block: remove bio_get_nr_vecs() 2015-08-13 12:32:04 -06:00
dm-ioctl.c char: make misc_deregister a void function 2015-08-05 10:35:49 -07:00
dm-kcopyd.c
dm-linear.c Merge tag 'dm-4.3-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm 2015-09-02 16:35:26 -07:00
dm-log-userspace-base.c dm log userspace base: fix compile warning 2015-04-15 12:10:20 -04:00
dm-log-userspace-transfer.c dm log userspace transfer: match wait_for_completion_timeout return type 2015-04-15 12:10:20 -04:00
dm-log-userspace-transfer.h
dm-log-writes.c Merge tag 'dm-4.3-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm 2015-09-02 16:35:26 -07:00
dm-log.c
dm-mpath.c Revert "dm mpath: fix stalls when handling invalid ioctls" 2015-10-31 18:53:51 -04:00
dm-mpath.h
dm-path-selector.c
dm-path-selector.h
dm-queue-length.c
dm-raid.c block: kill merge_bvec_fn() completely 2015-08-13 12:31:57 -06:00
dm-raid1.c Merge tag 'dm-4.3-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm 2015-09-02 16:35:26 -07:00
dm-region-hash.c
dm-round-robin.c
dm-service-time.c
dm-snap-persistent.c dm: remove unlikely() before IS_ERR() 2015-08-12 11:32:21 -04:00
dm-snap-transient.c
dm-snap.c Merge tag 'dm-4.3-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm 2015-09-02 16:35:26 -07:00
dm-stats.c dm stats: report precise_timestamps and histogram in @stats_list output 2015-08-18 17:20:03 -04:00
dm-stats.h dm stats: support precise timestamps 2015-06-17 12:40:40 -04:00
dm-stripe.c Merge tag 'dm-4.3-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm 2015-09-02 16:35:26 -07:00
dm-switch.c
dm-sysfs.c dm: add 'use_blk_mq' module param and expose in per-device ro sysfs attr 2015-04-15 12:10:17 -04:00
dm-table.c block: Replace SG_GAPS with new queue limits mask 2015-08-19 14:26:02 -07:00
dm-target.c dm: allocate requests in target when stacking on blk-mq devices 2015-02-09 13:06:47 -05:00
dm-thin-metadata.c dm thin metadata: delete btrees when releasing metadata snapshot 2015-08-12 10:42:51 -04:00
dm-thin-metadata.h dm thin metadata: add dm_thin_remove_range() 2015-06-11 17:13:04 -04:00
dm-thin.c dm thin: disable discard support for thin devices if pool's is disabled 2015-09-13 21:32:10 -04:00
dm-uevent.c
dm-uevent.h
dm-verity.c Merge tag 'dm-4.3-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm 2015-09-02 16:35:26 -07:00
dm-zero.c block: add a bi_error field to struct bio 2015-07-29 08:55:15 -06:00
dm.c dm: initialize non-blk-mq queue data before queue is used 2015-10-29 22:09:40 -04:00
dm.h block: kill merge_bvec_fn() completely 2015-08-13 12:31:57 -06:00
faulty.c block: add a bi_error field to struct bio 2015-07-29 08:55:15 -06:00
Kconfig SCSI misc on 20150911 2015-09-11 18:15:18 -07:00
linear.c block: kill merge_bvec_fn() completely 2015-08-13 12:31:57 -06:00
linear.h
Makefile dm cache: add stochastic-multi-queue (smq) policy 2015-06-11 17:12:59 -04:00
md-cluster.c md-cluster: remove inappropriate try_module_get from join() 2015-08-31 19:43:17 +02:00
md-cluster.h Fix read-balancing during node failure 2015-07-24 13:37:59 +10:00
md.c md: clear CHANGE_PENDING in readonly array 2015-10-02 17:23:44 +10:00
md.h block: kill merge_bvec_fn() completely 2015-08-13 12:31:57 -06:00
multipath.c md: drop null test before destroy functions 2015-10-02 17:23:44 +10:00
multipath.h
raid0.c md/raid0: apply base queue limits *before* disk_stack_limits 2015-10-02 17:23:44 +10:00
raid0.h block: kill merge_bvec_fn() completely 2015-08-13 12:31:57 -06:00
raid1.c md/raid1: Avoid raid1 resync getting stuck 2015-10-02 17:23:44 +10:00
raid1.h md/raid1: ensure device failure recorded before write request returns. 2015-08-31 19:43:23 +02:00
raid5.c md: drop null test before destroy functions 2015-10-02 17:23:44 +10:00
raid5.h md/raid5: ensure device failure recorded before write request returns. 2015-08-31 19:43:59 +02:00
raid10.c md: drop null test before destroy functions 2015-10-02 17:23:44 +10:00
raid10.h md/raid10: ensure device failure recorded before write request returns. 2015-08-31 19:43:45 +02:00