Commit graph

247 commits

Author SHA1 Message Date
Jan Kara
fd2ef39cc9 blk: Fix lock inversion between ioc lock and bfqd lock
Lockdep complains about lock inversion between ioc->lock and bfqd->lock:

bfqd -> ioc:
 put_io_context+0x33/0x90 -> ioc->lock grabbed
 blk_mq_free_request+0x51/0x140
 blk_put_request+0xe/0x10
 blk_attempt_req_merge+0x1d/0x30
 elv_attempt_insert_merge+0x56/0xa0
 blk_mq_sched_try_insert_merge+0x4b/0x60
 bfq_insert_requests+0x9e/0x18c0 -> bfqd->lock grabbed
 blk_mq_sched_insert_requests+0xd6/0x2b0
 blk_mq_flush_plug_list+0x154/0x280
 blk_finish_plug+0x40/0x60
 ext4_writepages+0x696/0x1320
 do_writepages+0x1c/0x80
 __filemap_fdatawrite_range+0xd7/0x120
 sync_file_range+0xac/0xf0

ioc->bfqd:
 bfq_exit_icq+0xa3/0xe0 -> bfqd->lock grabbed
 put_io_context_active+0x78/0xb0 -> ioc->lock grabbed
 exit_io_context+0x48/0x50
 do_exit+0x7e9/0xdd0
 do_group_exit+0x54/0xc0

To avoid this inversion we change blk_mq_sched_try_insert_merge() to not
free the merged request but rather leave that upto the caller similarly
to blk_mq_sched_try_merge(). And in bfq_insert_requests() we make sure
to free all the merged requests after dropping bfqd->lock.

Fixes: aee69d78de ("block, bfq: introduce the BFQ-v0 I/O scheduler as an extra scheduler")
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20210623093634.27879-3-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-24 18:43:55 -06:00
Christoph Hellwig
2bc8cda5ea block: add the events* attributes to disk_attrs
Add the events attributes to the disk_attrs array, which ensures they are
added by the driver core when the device is created rather than adding
them after the device has been added, which is racy versus uevents and
requires more boilerplate code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20210624073843.251178-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-24 12:00:22 -06:00
Christoph Hellwig
d5870edfa3 block: move the disk events code to a separate file
Move the code for handling disk events from genhd.c into a new file
as it isn't very related to the rest of the file while at the same
time requiring lots of forward declarations.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20210624073843.251178-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-24 12:00:22 -06:00
Christoph Hellwig
26a9750aa8 blk-mq: improve the blk_mq_init_allocated_queue interface
Don't return the passed in request_queue but a normal error code, and
drop the elevator_init argument in favor of just calling elevator_init_mq
directly from dm-rq.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Link: https://lore.kernel.org/r/20210602065345.355274-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-11 11:53:02 -06:00
Christoph Hellwig
da7ba72960 block: unexport blk_alloc_queue
blk_alloc_queue is just an internal helper now, unexport it and remove
it from the public header.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
Link: https://lore.kernel.org/r/20210521055116.1053587-27-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-01 07:42:24 -06:00
Christoph Hellwig
7c3f828b52 block: refactor device number setup in __device_add_disk
Untangle the mess around blk_alloc_devt by moving the check for
the used allocation scheme into the callers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
Link: https://lore.kernel.org/r/20210521055116.1053587-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-01 07:42:22 -06:00
Christoph Hellwig
d3c4a43d92 block: refactor blk_drop_partitions
Move the busy check and disk-wide sync into the only caller, so that
the remainder can be shared with del_gendisk.  Also pass the gendisk
instead of the bdev as that is all that is needed.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210406062303.811835-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-08 10:24:36 -06:00
Christoph Hellwig
9bb33f24ab block: refactor the bounce buffering code
Get rid of all the PFN arithmetics and just use an enum for the two
remaining options, and use PageHighMem for the actual bounce decision.

Add a fast path to entirely avoid the call for the common case of a queue
not using the legacy bouncing code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20210331073001.46776-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-06 09:28:17 -06:00
Christoph Hellwig
ce288e0535 block: remove BLK_BOUNCE_ISA support
Remove the BLK_BOUNCE_ISA support now that all users are gone.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20210331073001.46776-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-06 09:28:17 -06:00
Damien Le Moal
508aebb805 block: introduce blk_queue_clear_zone_settings()
Introduce the internal function blk_queue_clear_zone_settings() to
cleanup all limits and resources related to zoned block devices. This
new function is called from blk_queue_set_zoned() when a disk zoned
model is set to BLK_ZONED_NONE. This particular case can happens when a
partition is created on a host-aware scsi disk.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@edc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10 07:44:41 -07:00
Christoph Hellwig
7a800a20ae block: use bi_max_vecs to find the bvec pool
Instead of encoding of the bvec pool using magic bio flags, just use
a helper to find the pool based on the max_vecs value.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-08 08:33:16 -07:00
Christoph Hellwig
dc0b8a57ad block: reuse BIO_INLINE_VECS for integrity bvecs
bvec_alloc always uses biovec_slabs, and thus always needs to use the
same number of inline vecs.  Share a single definition for the data
and integrity bvecs.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-08 08:33:15 -07:00
Ming Lei
eec716a1c1 block: move three bvec helpers declaration into private helper
bvec_alloc(), bvec_free() and bvec_nr_vecs() are only used inside block
layer core functions, no need to declare them in public header.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-24 21:24:06 -07:00
Christoph Hellwig
a33df75c63 block: use an xarray for disk->part_tbl
Now that no fast path lookups in the partition table are left, there is
no point in micro-optimizing the data structure for it.  Just use a bog
standard xarray.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-24 18:17:20 -07:00
Christoph Hellwig
0b6e522cdc blk-mq: use ->bi_bdev for I/O accounting
Remove the reverse map from a sector to a partition for I/O accounting by
simply using ->bi_bdev.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-24 18:17:20 -07:00
Christoph Hellwig
309dca309f block: store a block_device pointer in struct bio
Replace the gendisk pointer in struct bio with a pointer to the newly
improved struct block device.  From that the gendisk can be trivially
accessed with an extra indirection, but it also allows to directly
look up all information related to partition remapping.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-24 18:17:20 -07:00
Linus Torvalds
69f637c335 for-5.11/drivers-2020-12-14
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl/XgdYQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpjTBD/4me2TNvGOogbcL0b1leAotndJ7spI/IcFM
 NUMNy3pOGuRBcRjwle85xq44puAjlNkZE2LLatem5sT7ZvS+8lPNnOIoTYgfaCjt
 PhKx2sKlLumVm3BwymYAPcPtke4fikGG15Mwu5nX1oOehmyGrjObGAr3Lo6gexCT
 tQoCOczVqaTsV+iTXrLlmgEgs07J9Tm93uh2cNR8Jgroxb8ivuWeUq4YgbV4kWk+
 Y8XvOyVE/yba0vQf5/hHtWuVoC6RdELnqZ6NCkcP/EicdBecwk1GMJAej1S3zPS1
 0BT7GSFTpm3YUHcygD6LRmRg4I/BmWDTDtMi84+jLat6VvSG1HwIm//qHiCJh3ku
 SlvFZENIWAv5LP92x2vlR5Lt7uE3GK2V/5Pxt2fekyzCth6mzu+hLH4CBPQ3xgyd
 E1JqIQ/ilbXstp+EYoivV5x8yltZQnKEZRopws0EOqj1LsmDPj9XT1wzE9RnB0o+
 PWu/DNhQFhhcmP7Z8uLgPiKIVpyGs+vjxiJLlTtGDFTCy6M5JbcgzGkEkSmnybxH
 7lSanjpLt1dWj85FBMc6fNtJkv2rBPfb4+j0d1kZ45Dzcr4umirGIh7wtCHcgc83
 brmXSt29hlKHseSHMMuNWK8haXcgAE7gq9tD8GZ/kzM7+vkmLLxHJa22Qhq5rp4w
 URPeaBaQJw==
 =ayp2
 -----END PGP SIGNATURE-----

Merge tag 'for-5.11/drivers-2020-12-14' of git://git.kernel.dk/linux-block

Pull block driver updates from Jens Axboe:
 "Nothing major in here:

   - NVMe pull request from Christoph:
        - nvmet passthrough improvements (Chaitanya Kulkarni)
        - fcloop error injection support (James Smart)
        - read-only support for zoned namespaces without Zone Append
          (Javier González)
        - improve some error message (Minwoo Im)
        - reject I/O to offline fabrics namespaces (Victor Gladkov)
        - PCI queue allocation cleanups (Niklas Schnelle)
        - remove an unused allocation in nvmet (Amit Engel)
        - a Kconfig spelling fix (Colin Ian King)
        - nvme_req_qid simplication (Baolin Wang)

   - MD pull request from Song:
        - Fix race condition in md_ioctl() (Dae R. Jeong)
        - Initialize read_slot properly for raid10 (Kevin Vigor)
        - Code cleanup (Pankaj Gupta)
        - md-cluster resync/reshape fix (Zhao Heming)

   - Move null_blk into its own directory (Damien Le Moal)

   - null_blk zone and discard improvements (Damien Le Moal)

   - bcache race fix (Dongsheng Yang)

   - Set of rnbd fixes/improvements (Gioh Kim, Guoqing Jiang, Jack Wang,
     Lutz Pogrell, Md Haris Iqbal)

   - lightnvm NULL pointer deref fix (tangzhenhao)

   - sr in_interrupt() removal (Sebastian Andrzej Siewior)

   - FC endpoint security support for s390/dasd (Jan Höppner, Sebastian
     Ott, Vineeth Vijayan). From the s390 arch guys, arch bits included
     as it made it easier for them to funnel the feature through the
     block driver tree.

   - Follow up fixes (Colin Ian King)"

* tag 'for-5.11/drivers-2020-12-14' of git://git.kernel.dk/linux-block: (64 commits)
  block: drop dead assignments in loop_init()
  sr: Remove in_interrupt() usage in sr_init_command().
  sr: Switch the sector size back to 2048 if sr_read_sector() changed it.
  cdrom: Reset sector_size back it is not 2048.
  drivers/lightnvm: fix a null-ptr-deref bug in pblk-core.c
  null_blk: Move driver into its own directory
  null_blk: Allow controlling max_hw_sectors limit
  null_blk: discard zones on reset
  null_blk: cleanup discard handling
  null_blk: Improve implicit zone close
  null_blk: improve zone locking
  block: Align max_hw_sectors to logical blocksize
  null_blk: Fail zone append to conventional zones
  null_blk: Fix zone size initialization
  bcache: fix race between setting bdev state to none and new write request direct to backing
  block/rnbd: fix a null pointer dereference on dev->blk_symlink_name
  block/rnbd-clt: Dynamically alloc buffer for pathname & blk_symlink_name
  block/rnbd: call kobject_put in the failure path
  Documentation/ABI/rnbd-srv: add document for force_close
  block/rnbd-srv: close a mapped device from server side.
  ...
2020-12-16 13:09:32 -08:00
Ming Lei
7aa390ec2d Revert "block: Fix a lockdep complaint triggered by request queue flushing"
This reverts commit b3c6a59975.

Now we can avoid nvme-loop lockdep warning of 'lockdep possible recursive locking'
by nvme-loop's lock class, no need to apply dynamically allocated lock class key,
so revert commit b3c6a5997541("block: Fix a lockdep complaint triggered by request
queue flushing").

This way fixes horrible SCSI probe delay issue on megaraid_sas, and it is reported
the whole probe may take more than half an hour.

Tested-by: Kashyap Desai <kashyap.desai@broadcom.com>
Reported-by: Qian Cai <cai@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Sumit Saxena <sumit.saxena@broadcom.com>
Cc: John Garry <john.garry@huawei.com>
Cc: Kashyap Desai <kashyap.desai@broadcom.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-07 20:30:19 -07:00
Christoph Hellwig
0d02129e76 block: merge struct block_device and struct hd_struct
Instead of having two structures that represent each block device with
different life time rules, merge them into a single one.  This also
greatly simplifies the reference counting rules, as we can use the inode
reference count as the main reference count for the new struct
block_device, with the device model reference front ending it for device
model interaction.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-01 14:53:40 -07:00
Christoph Hellwig
9fc995a6e0 block: pass a block_device to blk_alloc_devt
Pass the block_device actually needed instead of the hd_struct.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-01 14:53:40 -07:00
Christoph Hellwig
8446fe9255 block: switch partition lookup to use struct block_device
Use struct block_device to lookup partitions on a disk.  This removes
all usage of struct hd_struct from the I/O path.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Coly Li <colyli@suse.de>			[bcache]
Acked-by: Chao Yu <yuchao0@huawei.com>			[f2fs]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-01 14:53:40 -07:00
Christoph Hellwig
cb8432d650 block: allocate struct hd_struct as part of struct bdev_inode
Allocate hd_struct together with struct block_device to pre-load
the lifetime rule changes in preparation of merging the two structures.

Note that part0 was previously embedded into struct gendisk, but is
a separate allocation now, and already points to the block_device instead
of the hd_struct.  The lifetime of struct gendisk is still controlled by
the struct device embedded in the part0 hd_struct.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-01 14:53:40 -07:00
Christoph Hellwig
231926dbf0 block: move the partition_meta_info to struct block_device
Move the partition_meta_info to struct block_device in preparation for
killing struct hd_struct.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-01 14:53:40 -07:00
Christoph Hellwig
15e3d2c5cd block: move disk stat accounting to struct block_device
Move the dkstats and stamp field to struct block_device in preparation
of killing struct hd_struct.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-01 14:53:40 -07:00
Christoph Hellwig
a782483cc1 block: remove the nr_sects field in struct hd_struct
Now that the hd_struct always has a block device attached to it, there is
no need for having two size field that just get out of sync.

Additionally the field in hd_struct did not use proper serialization,
possibly allowing for torn writes.  By only using the block_device field
this problem also gets fixed.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Coly Li <colyli@suse.de>			[bcache]
Acked-by: Chao Yu <yuchao0@huawei.com>			[f2fs]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-01 14:53:40 -07:00
Christoph Hellwig
22ae8ce8b8 block: simplify bdev/disk lookup in blkdev_get
To simplify block device lookup and a few other upcoming areas, make sure
that we always have a struct block_device available for each disk and
each partition, and only find existing block devices in bdget.  The only
downside of this is that each device and partition uses a little more
memory.  The upside will be that a lot of code can be simplified.

With that all we need to look up the block device is to lookup the inode
and do a few sanity checks on the gendisk, instead of the separate lookup
for the gendisk.  For blk-cgroup which wants to access a gendisk without
opening it, a new blkdev_{get,put}_no_open low-level interface is added
to replace the previous get_gendisk use.

Note that the change to look up block device directly instead of the two
step lookup using struct gendisk causes a subtile change in behavior:
accessing a non-existing partition on an existing block device can now
cause a call to request_module.  That call is harmless, and in practice
no recent system will access these nodes as they aren't created by udev
and static /dev/ setups are unusual.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-01 14:53:39 -07:00
Chaitanya Kulkarni
53ffabfd4d block: move blk_rq_bio_prep() to linux/blk-mq.h
This is a preparation patch to have minimal block layer request bio
append functionality in the context of the NVMeOF Passthru driver which
falls in the fast path and doesn't need calls from blk_rq_append_bio().

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-12-01 20:36:35 +01:00
Christoph Hellwig
eda5cc997a block: move blk_mq_sched_try_merge to blk-merge.c
Move blk_mq_sched_try_merge to blk-merge.c, which allows to mark
a lot of the merge infrastructure static there.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-06 07:29:53 -06:00
Christoph Hellwig
d59da41998 block: remove the unused blk_integrity_merge_bio export
Also move the definition from the public blkdev.h to the private
block/blk.h header.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-06 07:29:53 -06:00
Christoph Hellwig
92cf2fd156 block: remove the unused blk_integrity_merge_rq export
Also move the definition from the public blkdev.h to the private
block/blk.h header.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-06 07:29:53 -06:00
Christoph Hellwig
8328eb2836 block: remove the disk argument to delete_partition
We can trivially derive the gendisk from the hd_struct.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-01 19:38:25 -06:00
Baolin Wang
7d7ca7c526 block: Add a new helper to attempt to merge a bio
There are lots of duplicated code when trying to merge a bio from
plug list and sw queue, we can introduce a new helper to attempt
to merge a bio, which can simplify the blk_bio_list_merge()
and blk_attempt_plug_merge().

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-01 16:49:26 -06:00
Baolin Wang
bdc6a287bc block: Move blk_mq_bio_list_merge() into blk-merge.c
Move the blk_mq_bio_list_merge() into blk-merge.c and
rename it as a generic name.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-01 16:49:26 -06:00
Coly Li
9b15d109a6 block: improve discard bio alignment in __blkdev_issue_discard()
This patch improves discard bio split for address and size alignment in
__blkdev_issue_discard(). The aligned discard bio may help underlying
device controller to perform better discard and internal garbage
collection, and avoid unnecessary internal fragment.

Current discard bio split algorithm in __blkdev_issue_discard() may have
non-discarded fregment on device even the discard bio LBA and size are
both aligned to device's discard granularity size.

Here is the example steps on how to reproduce the above problem.
- On a VMWare ESXi 6.5 update3 installation, create a 51GB virtual disk
  with thin mode and give it to a Linux virtual machine.
- Inside the Linux virtual machine, if the 50GB virtual disk shows up as
  /dev/sdb, fill data into the first 50GB by,
        # dd if=/dev/zero of=/dev/sdb bs=4096 count=13107200
- Discard the 50GB range from offset 0 on /dev/sdb,
        # blkdiscard /dev/sdb -o 0 -l 53687091200
- Observe the underlying mapping status of the device
        # sg_get_lba_status /dev/sdb -m 1048 --lba=0
  descriptor LBA: 0x0000000000000000  blocks: 2048  mapped (or unknown)
  descriptor LBA: 0x0000000000000800  blocks: 16773120  deallocated
  descriptor LBA: 0x0000000000fff800  blocks: 2048  mapped (or unknown)
  descriptor LBA: 0x0000000001000000  blocks: 8386560  deallocated
  descriptor LBA: 0x00000000017ff800  blocks: 2048  mapped (or unknown)
  descriptor LBA: 0x0000000001800000  blocks: 8386560  deallocated
  descriptor LBA: 0x0000000001fff800  blocks: 2048  mapped (or unknown)
  descriptor LBA: 0x0000000002000000  blocks: 8386560  deallocated
  descriptor LBA: 0x00000000027ff800  blocks: 2048  mapped (or unknown)
  descriptor LBA: 0x0000000002800000  blocks: 8386560  deallocated
  descriptor LBA: 0x0000000002fff800  blocks: 2048  mapped (or unknown)
  descriptor LBA: 0x0000000003000000  blocks: 8386560  deallocated
  descriptor LBA: 0x00000000037ff800  blocks: 2048  mapped (or unknown)
  descriptor LBA: 0x0000000003800000  blocks: 8386560  deallocated
  descriptor LBA: 0x0000000003fff800  blocks: 2048  mapped (or unknown)
  descriptor LBA: 0x0000000004000000  blocks: 8386560  deallocated
  descriptor LBA: 0x00000000047ff800  blocks: 2048  mapped (or unknown)
  descriptor LBA: 0x0000000004800000  blocks: 8386560  deallocated
  descriptor LBA: 0x0000000004fff800  blocks: 2048  mapped (or unknown)
  descriptor LBA: 0x0000000005000000  blocks: 8386560  deallocated
  descriptor LBA: 0x00000000057ff800  blocks: 2048  mapped (or unknown)
  descriptor LBA: 0x0000000005800000  blocks: 8386560  deallocated
  descriptor LBA: 0x0000000005fff800  blocks: 2048  mapped (or unknown)
  descriptor LBA: 0x0000000006000000  blocks: 6291456  deallocated
  descriptor LBA: 0x0000000006600000  blocks: 0  deallocated

Although the discard bio starts at LBA 0 and has 50<<30 bytes size which
are perfect aligned to the discard granularity, from the above list
these are many 1MB (2048 sectors) internal fragments exist unexpectedly.

The problem is in __blkdev_issue_discard(), an improper algorithm causes
an improper bio size which is not aligned.

 25 int __blkdev_issue_discard(struct block_device *bdev, sector_t sector,
 26                 sector_t nr_sects, gfp_t gfp_mask, int flags,
 27                 struct bio **biop)
 28 {
 29         struct request_queue *q = bdev_get_queue(bdev);
   [snipped]
 56
 57         while (nr_sects) {
 58                 sector_t req_sects = min_t(sector_t, nr_sects,
 59                                 bio_allowed_max_sectors(q));
 60
 61                 WARN_ON_ONCE((req_sects << 9) > UINT_MAX);
 62
 63                 bio = blk_next_bio(bio, 0, gfp_mask);
 64                 bio->bi_iter.bi_sector = sector;
 65                 bio_set_dev(bio, bdev);
 66                 bio_set_op_attrs(bio, op, 0);
 67
 68                 bio->bi_iter.bi_size = req_sects << 9;
 69                 sector += req_sects;
 70                 nr_sects -= req_sects;
   [snipped]
 79         }
 80
 81         *biop = bio;
 82         return 0;
 83 }
 84 EXPORT_SYMBOL(__blkdev_issue_discard);

At line 58-59, to discard a 50GB range, req_sects is set as return value
of bio_allowed_max_sectors(q), which is 8388607 sectors. In the above
case, the discard granularity is 2048 sectors, although the start LBA
and discard length are aligned to discard granularity, req_sects never
has chance to be aligned to discard granularity. This is why there are
some still-mapped 2048 sectors fragment in every 4 or 8 GB range.

If req_sects at line 58 is set to a value aligned to discard_granularity
and close to UNIT_MAX, then all consequent split bios inside device
driver are (almostly) aligned to discard_granularity of the device
queue. The 2048 sectors still-mapped fragment will disappear.

This patch introduces bio_aligned_discard_max_sectors() to return the
the value which is aligned to q->limits.discard_granularity and closest
to UINT_MAX. Then this patch replaces bio_allowed_max_sectors() with
this new routine to decide a more proper split bio length.

But we still need to handle the situation when discard start LBA is not
aligned to q->limits.discard_granularity, otherwise even the length is
aligned, current code may still leave 2048 fragment around every 4GB
range. Therefore, to calculate req_sects, firstly the start LBA of
discard range is checked (including partition offset), if it is not
aligned to discard granularity, the first split location should make
sure following bio has bi_sector aligned to discard granularity. Then
there won't be still-mapped fragment in the middle of the discard range.

The above is how this patch improves discard bio alignment in
__blkdev_issue_discard(). Now with this patch, after discard with same
command line mentiond previously, sg_get_lba_status returns,
descriptor LBA: 0x0000000000000000  blocks: 106954752  deallocated
descriptor LBA: 0x0000000006600000  blocks: 0  deallocated

We an see there is no 2048 sectors segment anymore, everything is clean.

Reported-and-tested-by: Acshai Manoj <acshai.manoj@microfocus.com>
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Enzo Matsumiya <ematsumiya@suse.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-17 07:15:10 -06:00
Ming Lei
568f270065 blk-mq: centralise related handling into blk_mq_get_driver_tag
Move .nr_active update and request assignment into blk_mq_get_driver_tag(),
all are good to do during getting driver tag.

Meantime blk-flush related code is simplified and flush request needn't
to update the request table manually any more.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-08 16:06:42 -06:00
Jens Axboe
4e2f62e566 Revert "blk-mq: put driver tag when this request is completed"
This reverts commits the following commits:

	37f4a24c24
	723bf178f1
	36a3df5a45

The last one is the culprit, but we have to go a bit deeper to get this
to revert cleanly. There's been a report that this breaks some MMC
setups [1], and also causes an issue with swap [2]. Until this can be
figured out, revert the offending commits.

[1] https://lore.kernel.org/linux-block/57fb09b1-54ba-f3aa-f82c-d709b0e6b281@samsung.com/
[2] https://lore.kernel.org/linux-block/20200702043721.GA1087@lca.pw/

Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
Reported-by: Qian Cai <cai@lca.pw>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-01 22:58:32 -06:00
Christoph Hellwig
c62b37d96b block: move ->make_request_fn to struct block_device_operations
The make_request_fn is a little weird in that it sits directly in
struct request_queue instead of an operation vector.  Replace it with
a block_device_operations method called submit_bio (which describes much
better what it does).  Also remove the request_queue argument to it, as
the queue can be derived pretty trivially from the bio.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-01 07:27:24 -06:00
Christoph Hellwig
f695ca3886 block: remove the request_queue argument from blk_queue_split
The queue can be trivially derived from the bio, so pass one less
argument.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-01 07:27:23 -06:00
Ming Lei
37f4a24c24 blk-mq: centralise related handling into blk_mq_get_driver_tag
Move .nr_active update and request assignment into blk_mq_get_driver_tag(),
all are good to do during getting driver tag.

Meantime blk-flush related code is simplified and flush request needn't
to update the request table manually any more.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-30 12:57:59 -06:00
Christoph Hellwig
db18a53e5b blk-cgroup: remove blkcg_bio_issue_check
blkcg_bio_issue_check is a giant inline function that does three entirely
different things.  Factor out the blk-cgroup related bio initalization
into a new helper, and the open code the sequence in the only caller,
relying on the fact that all the actual functionality is stubbed out for
non-cgroup builds.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-29 09:09:08 -06:00
Luis Chamberlain
85e0cbbb8a block: create the request_queue debugfs_dir on registration
We were only creating the request_queue debugfs_dir only
for make_request block drivers (multiqueue), but never for
request-based block drivers. We did this as we were only
creating non-blktrace additional debugfs files on that directory
for make_request drivers. However, since blktrace *always* creates
that directory anyway, we special-case the use of that directory
on blktrace. Other than this being an eye-sore, this exposes
request-based block drivers to the same debugfs fragile
race that used to exist with make_request block drivers
where if we start adding files onto that directory we can later
run a race with a double removal of dentries on the directory
if we don't deal with this carefully on blktrace.

Instead, just simplify things by always creating the request_queue
debugfs_dir on request_queue registration. Rename the mutex also to
reflect the fact that this is used outside of the blktrace context.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-24 09:15:58 -06:00
Christoph Hellwig
15f73f5b3e blk-mq: move failure injection out of blk_mq_complete_request
Move the call to blk_should_fake_timeout out of blk_mq_complete_request
and into the drivers, skipping call sites that are obvious error
handlers, and remove the now superflous blk_mq_force_complete_rq helper.
This ensures we don't keep injecting errors into completions that just
terminate the Linux request after the hardware has been reset or the
command has been aborted.

Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-24 09:15:57 -06:00
Ahmed S. Darwish
15b81ce5ab block: nr_sects_write(): Disable preemption on seqcount write
For optimized block readers not holding a mutex, the "number of sectors"
64-bit value is protected from tearing on 32-bit architectures by a
sequence counter.

Disable preemption before entering that sequence counter's write side
critical section. Otherwise, the read side can preempt the write side
section and spin for the entire scheduler tick. If the reader belongs to
a real-time scheduling class, it can spin forever and the kernel will
livelock.

Fixes: c83f6bf98d ("block: add partition resize function to blkpg ioctl")
Cc: <stable@vger.kernel.org>
Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-04 21:22:28 -06:00
Guoqing Jiang
b77412372b blk-throttle: remove blk_throtl_drain
After the commit 5addeae1be ("blk-cgroup: remove blkcg_drain_queue"),
there is no caller of blk_throtl_drain, so let's remove it.

Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-29 16:30:39 -06:00
Konstantin Khlebnikov
b5af37ab3a block: add a blk_account_io_merge_bio helper
Move the non-"new_io" branch of blk_account_io_start() into separate
function.  Fix merge accounting for discards (they were counted as write
merges).

The new blk_account_io_merge_bio() doesn't call update_io_ticks() unlike
blk_account_io_start(), as there is no reason for that.

[hch: rebased]

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-27 05:21:23 -06:00
Christoph Hellwig
58d4f14fc3 block: always use a percpu variable for disk stats
percpu variables have a perfectly fine working stub implementation
for UP kernels, so use that.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-27 05:21:23 -06:00
Christoph Hellwig
9123bf6f21 block: move update_io_ticks to blk-core.c
All callers are in blk-core.c, so move update_io_ticks over.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-27 05:21:23 -06:00
Baolin Wang
172ce41db4 block: Remove unused flush_queue_delayed in struct blk_flush_queue
The flush_queue_delayed was introdued to hold queue if flush is
running for non-queueable flush drive by commit 3ac0cc4508
("hold queue if flush is running for non-queueable flush drive"),
but the non mq parts of the flush code had been removed by
commit 7e992f847a ("block: remove non mq parts from the flush code"),
as well as removing the usage of the flush_queue_delayed flag.
Thus remove the unused flush_queue_delayed flag.

Signed-off-by: Baolin Wang <baolin.wang7@gmail.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-19 09:42:46 -06:00
Christoph Hellwig
10ec5e86f9 block: merge part_{inc,dev}_in_flight into their only callers
part_inc_in_flight and part_dec_in_flight only have one caller each, and
those callers are purely for bio based drivers.  Merge each function into
the only caller, and remove the superflous blk-mq checks.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-19 09:35:24 -06:00
Christoph Hellwig
f1394b7988 block: mark blk_account_io_completion static
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-19 09:35:24 -06:00