linux-stable

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git synced 2024-10-01 14:44:12 +00:00

Author	SHA1	Message	Date
Tejun Heo	e11d80a849	blk-stat: make q->stats->lock irqsafe blk-iocost calls blk_stat_enable_accounting() while holding an irqsafe lock which triggers a lockdep splat because q->stats->lock isn't irqsafe. Let's make it irqsafe. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: `cd006509b0` ("blk-iocost: account for IO size when testing latencies") Cc: stable@vger.kernel.org # v5.8+ Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-01 16:48:46 -06:00
Tejun Heo	5aeac7c4b1	blk-iocost: ioc_pd_free() shouldn't assume irq disabled ioc_pd_free() grabs irq-safe ioc->lock without ensuring that irq is disabled when it can be called with irq disabled or enabled. This has a small chance of causing A-A deadlocks and triggers lockdep splats. Use irqsave operations instead. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: `7caa47151a` ("blkcg: implement blk-iocost") Cc: stable@vger.kernel.org # v5.4+ Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-01 16:48:44 -06:00
Christoph Hellwig	08fc1ab6d7	block: fix locking in bdev_del_partition We need to hold the whole device bd_mutex to protect against other thread concurrently deleting out partition before we get to it, and thus causing a use after free. Fixes: `cddae808ae` ("block: pass a hd_struct to delete_partition") Reported-by: syzbot+6448f3c229bc52b82f69@syzkaller.appspotmail.com Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-01 08:35:35 -06:00
Ming Lei	cafe01ef8f	block: release disk reference in hd_struct_free_work Commit `e8c7d14ac6` ("block: revert back to synchronous request_queue removal") stops to release request queue from wq context because that commit supposed all blk_put_queue() is called in context which is allowed to sleep. However, this assumption isn't true because we release disk's reference in partition's percpu_ref's ->release() which doesn't allow to sleep, because the ->release() is run via call_rcu(). Fixes this issue by moving put disk reference into hd_struct_free_work() Fixes: `e8c7d14ac6` ("block: revert back to synchronous request_queue removal") Reported-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Tested-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-01 08:34:08 -06:00
Jens Axboe	de1b0ee490	block: ensure bdi->io_pages is always initialized If a driver leaves the limit settings as the defaults, then we don't initialize bdi->io_pages. This means that file systems may need to work around bdi->io_pages == 0, which is somewhat messy. Initialize the default value just like we do for ->ra_pages. Cc: stable@vger.kernel.org Fixes: `9491ae4aad` ("mm: don't cap request size based on read-ahead setting") Reported-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-01 08:00:14 -06:00
Linus Torvalds	c41c3ec4a2	io_uring-5.9-2020-08-23 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl9CwtMQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpsehEAC4ReB53LLbZxqcmoA2RNs9yz1I4DM2PU6z C+NSGGEnAFHQAhLbfCAzxbtQa6x/m64zoLd+8zHZNAeanJXarszcgSuqhXQFlEfX 7Jz/vdXGdu7Q4zgkLuO3FxleDoPoUC5qOSFHWYtMu6KvHLOkmc9DvdSUsFMDSThX 6RsoaQY2gDOD/pwtm8Cqmy89nLZdFoyxadXyk/lzxLodjeRZOwoVc+YM8YWmrXZ0 mKEEuO4uBWxUUmoyAwUABNqWWAkwTDEhrYCiiG81DkAa1Cu0mRXodN0xycr72cLZ Ik2OlnTLCE6B0UXsBu2c0+qXGArWsvDyhEEkwF+O+Ump4IBIr72EmgZb+o2nnkXo Uu4X/r0qeQ6XD+vBTHcE6oPUjJhV6uEXXon5aesE+vh277ILmHgMyjJKaSiJcY/E efM5SuPRq2kuROKWLKiLJnpuJ/9ZTU/4nk4k1pOlWWOVGLHien0sWBBzQ+iWr6mm eRl5EkI3JoahqIrNFz0+qF3DwKPVfu+B02/EzA8OXoYHIRV9KMS5eWX5hK12aZ3i 4AT3xuAanfcNs4qBAScOfHQxQu9U5Z7Mu4JQJ58xdsJd+UWBnbznUmSLob9KKk+c X8AvAcYhb684F87VCmaCzDlIPMb46OYxLBgI6sz7L0xdc7i8TCeeEDbQCN1HixZ3 SNtKzalNXA== =fAwK -----END PGP SIGNATURE----- Merge tag 'io_uring-5.9-2020-08-23' of git://git.kernel.dk/linux-block Pull block fixes from Jens Axboe: - NVMe pull request from Sagi: - nvme completion rework from Christoph and Chao that mostly came from a bit of divergence of how we classify errors related to pathing/retry etc. - nvmet passthru fixes from Chaitanya - minor nvmet fixes from Amit and I - mpath round-robin path selection fix from Martin - ignore noiob for zoned devices from Keith - minor nvme-fc fix from Tianjia" - BFQ cgroup leak fix (Dmitry) - block layer MAINTAINERS addition (Geert) - fix null_blk FUA checking (Hou) - get_max_io_size() size fix (Keith) - fix block page_is_mergeable() for compound pages (Matthew) - discard granularity fixes (Ming) - IO scheduler ordering fix (Ming) - misc fixes * tag 'io_uring-5.9-2020-08-23' of git://git.kernel.dk/linux-block: (31 commits) null_blk: fix passing of REQ_FUA flag in null_handle_rq nvmet: Disable keep-alive timer when kato is cleared to 0h nvme: redirect commands on dying queue nvme: just check the status code type in nvme_is_path_error nvme: refactor command completion nvme: rename and document nvme_end_request nvme: skip noiob for zoned devices nvme-pci: fix PRP pool size nvme-pci: Use u32 for nvme_dev.q_depth and nvme_queue.q_depth nvme: Use spin_lock_irq() when taking the ctrl->lock nvmet: call blk_mq_free_request() directly nvmet: fix oops in pt cmd execution nvmet: add ns tear down label for pt-cmd handling nvme: multipath: round-robin: eliminate "fallback" variable nvme: multipath: round-robin: fix single non-optimized path case nvme-fc: Fix wrong return value in __nvme_fc_init_request() nvmet-passthru: Reject commands with non-sgl flags set nvmet: fix a memory leak blkcg: fix memleak for iolatency MAINTAINERS: Add missing header files to BLOCK LAYER section ...	2020-08-24 11:53:15 -07:00
Gustavo A. R. Silva	df561f6688	treewide: Use fallthrough pseudo-keyword Replace the existing /* fall through */ comments and its variants with the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary fall-through markings when it is the case. [1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>	2020-08-23 17:36:59 -05:00
Yufen Yu	27029b4b18	blkcg: fix memleak for iolatency Normally, blkcg_iolatency_exit() will free related memory in iolatency when cleanup queue. But if blk_throtl_init() return error and queue init fail, blkcg_iolatency_exit() will not do that for us. Then it cause memory leak. Fixes: `d706751215` ("block: introduce blk-iolatency io controller") Signed-off-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-08-21 17:14:27 -06:00
Keith Busch	e4b469c66f	block: fix get_max_io_size() A previous commit aligning splits to physical block sizes inadvertently modified one return case such that that it now returns 0 length splits when the number of sectors doesn't exceed the physical offset. This later hits a BUG in bio_split(). Restore the previous working behavior. Fixes: `9cc5169cd4` ("block: Improve physical block alignment of split bios") Reported-by: Eric Deal <eric.deal@wdc.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Cc: Bart Van Assche <bvanassche@acm.org> Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-08-21 17:09:22 -06:00
Ming Lei	db03f88fae	blk-mq: insert request not through ->queue_rq into sw/scheduler queue `c616cbee97` ("blk-mq: punt failed direct issue to dispatch list") supposed to add request which has been through ->queue_rq() to the hw queue dispatch list, however it adds request running out of budget or driver tag to hw queue too. This way basically bypasses request merge, and causes too many request dispatched to LLD, and system% is unnecessary increased. Fixes this issue by adding request not through ->queue_rq into sw/scheduler queue, and this way is safe because no ->queue_rq is called on this request yet. High %system can be observed on Azure storvsc device, and even soft lock is observed. This patch reduces %system during heavy sequential IO, meantime decreases soft lockup risk. Fixes: `c616cbee97` ("blk-mq: punt failed direct issue to dispatch list") Signed-off-by: Ming Lei <ming.lei@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Bart Van Assche <bvanassche@acm.org> Cc: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-08-21 17:09:22 -06:00
Dmitry Monakhov	2de791ab49	bfq: fix blkio cgroup leakage v4 Changes from v1: - update commit description with proper ref-accounting justification commit `db37a34c56` ("block, bfq: get a ref to a group when adding it to a service tree") introduce leak forbfq_group and blkcg_gq objects because of get/put imbalance. In fact whole idea of original commit is wrong because bfq_group entity can not dissapear under us because it is referenced by child bfq_queue's entities from here: -> bfq_init_entity() ->bfqg_and_blkg_get(bfqg); ->entity->parent = bfqg->my_entity -> bfq_put_queue(bfqq) FINAL_PUT ->bfqg_and_blkg_put(bfqq_group(bfqq)) ->kmem_cache_free(bfq_pool, bfqq); So parent entity can not disappear while child entity is in tree, and child entities already has proper protection. This patch revert commit `db37a34c56` ("block, bfq: get a ref to a group when adding it to a service tree") bfq_group leak trace caused by bad commit: -> blkg_alloc -> bfq_pq_alloc -> bfqg_get (+1) ->bfq_activate_bfqq ->bfq_activate_requeue_entity -> __bfq_activate_entity ->bfq_get_entity ->bfqg_and_blkg_get (+1) <==== : Note1 ->bfq_del_bfqq_busy ->bfq_deactivate_entity+0x53/0xc0 [bfq] ->__bfq_deactivate_entity+0x1b8/0x210 [bfq] -> bfq_forget_entity(is_in_service = true) entity->on_st_or_in_serv = false <=== :Note2 if (is_in_service) return; ==> do not touch reference -> blkcg_css_offline -> blkcg_destroy_blkgs -> blkg_destroy -> bfq_pd_offline -> __bfq_deactivate_entity if (!entity->on_st_or_in_serv) /* true, because (Note2) return false; -> bfq_pd_free -> bfqg_put() (-1, byt bfqg->ref == 2) because of (Note2) So bfq_group and blkcg_gq will leak forever, see test-case below. ##TESTCASE_BEGIN: #!/bin/bash max_iters=${1:-100} #prep cgroup mounts mount -t tmpfs cgroup_root /sys/fs/cgroup mkdir /sys/fs/cgroup/blkio mount -t cgroup -o blkio none /sys/fs/cgroup/blkio # Prepare blkdev grep blkio /proc/cgroups truncate -s 1M img losetup /dev/loop0 img echo bfq > /sys/block/loop0/queue/scheduler grep blkio /proc/cgroups for ((i=0;i<max_iters;i++)) do mkdir -p /sys/fs/cgroup/blkio/a echo 0 > /sys/fs/cgroup/blkio/a/cgroup.procs dd if=/dev/loop0 bs=4k count=1 of=/dev/null iflag=direct 2> /dev/null echo 0 > /sys/fs/cgroup/blkio/cgroup.procs rmdir /sys/fs/cgroup/blkio/a grep blkio /proc/cgroups done ##TESTCASE_END: Fixes: `db37a34c56` ("block, bfq: get a ref to a group when adding it to a service tree") Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by: Dmitry Monakhov <dmtrmonakhov@yandex-team.ru> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-08-18 07:48:08 -07:00
Matthew Wilcox (Oracle)	d81665198b	block: Fix page_is_mergeable() for compound pages If we pass in an offset which is larger than PAGE_SIZE, then page_is_mergeable() thinks it's not mergeable with the previous bio_vec, leading to a large number of bio_vecs being used. Use a slightly more obvious test that the two pages are compatible with each other. Fixes: `52d52d1c98` ("block: only allow contiguous page structs in a bio_vec") Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-08-17 19:35:53 -07:00
Ming Lei	943b40c832	block: respect queue limit of max discard segment When queue_max_discard_segments(q) is 1, blk_discard_mergable() will return false for discard request, then normal request merge is applied. However, only queue_max_segments() is checked, so max discard segment limit isn't respected. Check max discard segment limit in the request merge code for fixing the issue. Discard request failure of virtio_blk is fixed. Fixes: `6984046608` ("block: fix the DISCARD request merge") Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-08-17 06:59:41 -07:00
Ming Lei	d7d8535f37	blk-mq: order adding requests to hctx->dispatch and checking SCHED_RESTART SCHED_RESTART code path is relied to re-run queue for dispatch requests in hctx->dispatch. Meantime the SCHED_RSTART flag is checked when adding requests to hctx->dispatch. memory barriers have to be used for ordering the following two pair of OPs: 1) adding requests to hctx->dispatch and checking SCHED_RESTART in blk_mq_dispatch_rq_list() 2) clearing SCHED_RESTART and checking if there is request in hctx->dispatch in blk_mq_sched_restart(). Without the added memory barrier, either: 1) blk_mq_sched_restart() may miss requests added to hctx->dispatch meantime blk_mq_dispatch_rq_list() observes SCHED_RESTART, and not run queue in dispatch side or 2) blk_mq_dispatch_rq_list still sees SCHED_RESTART, and not run queue in dispatch side, meantime checking if there is request in hctx->dispatch from blk_mq_sched_restart() is missed. IO hang in ltp/fs_fill test is reported by kernel test robot: https://lkml.org/lkml/2020/7/26/77 Turns out it is caused by the above out-of-order OPs. And the IO hang can't be observed any more after applying this patch. Fixes: `bd166ef183` ("blk-mq-sched: add framework for MQ capable IO schedulers") Reported-by: kernel test robot <rong.a.chen@intel.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Bart Van Assche <bvanassche@acm.org> Cc: Christoph Hellwig <hch@lst.de> Cc: David Jeffery <djeffery@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-08-17 06:57:49 -07:00
Xu Wang	03ef5941a0	bsg-lib: convert comma to semicolon Replace a comma between expression statements by a semicolon. Signed-off-by: Xu Wang <vulab@iscas.ac.cn> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-08-16 20:07:12 -07:00
Randy Dunlap	26bfeb2662	block: blk-mq.c: fix @at_head kernel-doc warning Fix a kernel-doc warning in block/blk-mq.c: ../block/blk-mq.c:1844: warning: Function parameter or member 'at_head' not described in 'blk_mq_request_bypass_insert' Fixes: `01e99aeca3` ("blk-mq: insert passthrough request into hctx->dispatch directly") Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: André Almeida <andrealmeid@collabora.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Ming Lei <ming.lei@redhat.com> Cc: linux-block@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-08-16 16:50:28 -07:00
Linus Torvalds	4b6c093e21	block-5.9-2020-08-14 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl83DGUQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgplTYEACs5kPPpSVdzVsOeHj0LkfQXiSlli8dbPBo NB2+TIyr9NxJgxn8B4x+5/4DZgJaCoHOeyyzOocXQmvWGwWOTkrfX/OSyQOlRB5z dpzqF0Huhw31MSQEiwA/8lo3omBmat9cMzMa5PJYPghMGfqyQDzVJk1lIX51a1th oE01eBpNNsDK0OTwKrl6Rx2/OuFZnA0P3lQwgPZSLnDM6Hq+xeHTdx2LNSyE2QFv GzYl4dFoXg3NReLv9D57b7hE6Dc95NcCDDeU7Y3cE7XPksKMA/TkVYOD20ysJ31l 9uzscvvcm2UugN2r0d/B35lf6NWmOG24SmkLMKTtExPGHOCQIbDAlSP/QQ4zz9pQ 2yA+eImpQnRsCzPbGcnBzwEF3yX5+lQYmFWac+0AHDiWEWkb8e3MzNSWPZrsN+cD 7U7c5Zw6zDEtl/naJccuZZPgQGbZgFJ/P6Wo6l5ywIPtE7wzv4MUe4eUxZhitL9M 0ZP6WIQd8oNQdNoCYVQDwPdYJYMq7uUQFUo40vaSfntZxVKZQao7cvUHwmzVzNlZ v5UazETAx+4Eg6MNwfjKp+kt3rr6Xul7K9Nzn6R/cVacIU349FovUshm7WieoAUu niZ40gXltxj7NDwHj3p/dqesW5Nhv/qk6hlVWoi9vdmh8vAVBy/fedQfocvKrFJy prCI1h1UOQ== =10Pr -----END PGP SIGNATURE----- Merge tag 'block-5.9-2020-08-14' of git://git.kernel.dk/linux-block Pull block fixes from Jens Axboe: "A few fixes on the block side of things: - Discard granularity fix (Coly) - rnbd cleanups (Guoqing) - md error handling fix (Dan) - md sysfs fix (Junxiao) - Fix flush request accounting, which caused an IO slowdown for some configurations (Ming) - Properly propagate loop flag for partition scanning (Lennart)" * tag 'block-5.9-2020-08-14' of git://git.kernel.dk/linux-block: block: fix double account of flush request's driver tag loop: unset GENHD_FL_NO_PART_SCAN on LOOP_CONFIGURE rnbd: no need to set bi_end_io in rnbd_bio_map_kern rnbd: remove rnbd_dev_submit_io md-cluster: Fix potential error pointer dereference in resize_bitmaps() block: check queue's limits.discard_granularity in __blkdev_issue_discard() md: get sysfs entry after redundancy attr group create	2020-08-15 20:36:42 -07:00
Ming Lei	c1e2b8422b	block: fix double account of flush request's driver tag In case of none scheduler, we share data request's driver tag for flush request, so have to mark the flush request as INFLIGHT for avoiding double account of this driver tag. Fixes: `568f270065` ("blk-mq: centralise related handling into blk_mq_get_driver_tag") Reported-by: Matthew Wilcox <willy@infradead.org> Signed-off-by: Ming Lei <ming.lei@redhat.com> Tested-by: Matthew Wilcox <willy@infradead.org> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-08-11 13:53:32 -06:00
Linus Torvalds	97d052ea3f	A set of locking fixes and updates: - Untangle the header spaghetti which causes build failures in various situations caused by the lockdep additions to seqcount to validate that the write side critical sections are non-preemptible. - The seqcount associated lock debug addons which were blocked by the above fallout. seqcount writers contrary to seqlock writers must be externally serialized, which usually happens via locking - except for strict per CPU seqcounts. As the lock is not part of the seqcount, lockdep cannot validate that the lock is held. This new debug mechanism adds the concept of associated locks. sequence count has now lock type variants and corresponding initializers which take a pointer to the associated lock used for writer serialization. If lockdep is enabled the pointer is stored and write_seqcount_begin() has a lockdep assertion to validate that the lock is held. Aside of the type and the initializer no other code changes are required at the seqcount usage sites. The rest of the seqcount API is unchanged and determines the type at compile time with the help of _Generic which is possible now that the minimal GCC version has been moved up. Adding this lockdep coverage unearthed a handful of seqcount bugs which have been addressed already independent of this. While generaly useful this comes with a Trojan Horse twist: On RT kernels the write side critical section can become preemtible if the writers are serialized by an associated lock, which leads to the well known reader preempts writer livelock. RT prevents this by storing the associated lock pointer independent of lockdep in the seqcount and changing the reader side to block on the lock when a reader detects that a writer is in the write side critical section. - Conversion of seqcount usage sites to associated types and initializers. -----BEGIN PGP SIGNATURE----- iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl8xmPYTHHRnbHhAbGlu dXRyb25peC5kZQAKCRCmGPVMDXSYoTuQEACyzQCjU8PgehPp9oMqWzaX2fcVyuZO QU2yw6gmz2oTz3ZHUNwdW8UnzGh2OWosK3kDruoD9FtSS51lER1/ISfSPCGfyqxC KTjOcB1Kvxwq/3LcCx7Zi3ZxWApat74qs3EhYhKtEiQ2Y9xv9rLq8VV1UWAwyxq0 eHpjlIJ6b6rbt+ARslaB7drnccOsdK+W/roNj4kfyt+gezjBfojGRdMGQNMFcpnv shuTC+vYurAVIiVA/0IuizgHfwZiXOtVpjVoEWaxg6bBH6HNuYMYzdSa/YrlDkZs n/aBI/Xkvx+Eacu8b1Zwmbzs5EnikUK/2dMqbzXKUZK61eV4hX5c2xrnr1yGWKTs F/juh69Squ7X6VZyKVgJ9RIccVueqwR2EprXWgH3+RMice5kjnXH4zURp0GHALxa DFPfB6fawcH3Ps87kcRFvjgm6FBo0hJ1AxmsW1dY4ACFB9azFa2euW+AARDzHOy2 VRsUdhL9CGwtPjXcZ/9Rhej6fZLGBXKr8uq5QiMuvttp4b6+j9FEfBgD4S6h8csl AT2c2I9LcbWqyUM9P4S7zY/YgOZw88vHRuDH7tEBdIeoiHfrbSBU7EQ9jlAKq/59 f+Htu2Io281c005g7DEeuCYvpzSYnJnAitj5Lmp/kzk2Wn3utY1uIAVszqwf95Ul 81ppn2KlvzUK8g== =7Gj+ -----END PGP SIGNATURE----- Merge tag 'locking-urgent-2020-08-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking updates from Thomas Gleixner: "A set of locking fixes and updates: - Untangle the header spaghetti which causes build failures in various situations caused by the lockdep additions to seqcount to validate that the write side critical sections are non-preemptible. - The seqcount associated lock debug addons which were blocked by the above fallout. seqcount writers contrary to seqlock writers must be externally serialized, which usually happens via locking - except for strict per CPU seqcounts. As the lock is not part of the seqcount, lockdep cannot validate that the lock is held. This new debug mechanism adds the concept of associated locks. sequence count has now lock type variants and corresponding initializers which take a pointer to the associated lock used for writer serialization. If lockdep is enabled the pointer is stored and write_seqcount_begin() has a lockdep assertion to validate that the lock is held. Aside of the type and the initializer no other code changes are required at the seqcount usage sites. The rest of the seqcount API is unchanged and determines the type at compile time with the help of _Generic which is possible now that the minimal GCC version has been moved up. Adding this lockdep coverage unearthed a handful of seqcount bugs which have been addressed already independent of this. While generally useful this comes with a Trojan Horse twist: On RT kernels the write side critical section can become preemtible if the writers are serialized by an associated lock, which leads to the well known reader preempts writer livelock. RT prevents this by storing the associated lock pointer independent of lockdep in the seqcount and changing the reader side to block on the lock when a reader detects that a writer is in the write side critical section. - Conversion of seqcount usage sites to associated types and initializers" * tag 'locking-urgent-2020-08-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (25 commits) locking/seqlock, headers: Untangle the spaghetti monster locking, arch/ia64: Reduce <asm/smp.h> header dependencies by moving XTP bits into the new <asm/xtp.h> header x86/headers: Remove APIC headers from <asm/smp.h> seqcount: More consistent seqprop names seqcount: Compress SEQCNT_LOCKNAME_ZERO() seqlock: Fold seqcount_LOCKNAME_init() definition seqlock: Fold seqcount_LOCKNAME_t definition seqlock: s/__SEQ_LOCKDEP/__SEQ_LOCK/g hrtimer: Use sequence counter with associated raw spinlock kvm/eventfd: Use sequence counter with associated spinlock userfaultfd: Use sequence counter with associated spinlock NFSv4: Use sequence counter with associated spinlock iocost: Use sequence counter with associated spinlock raid5: Use sequence counter with associated spinlock vfs: Use sequence counter with associated spinlock timekeeping: Use sequence counter with associated raw spinlock xfrm: policy: Use sequence counters with associated lock netfilter: nft_set_rbtree: Use sequence counter with associated rwlock netfilter: conntrack: Use sequence counter with associated spinlock sched: tasks: Use sequence counter with associated spinlock ...	2020-08-10 19:07:44 -07:00
Linus Torvalds	dfdf16ecfd	SCSI misc on 20200806 This series consists of the usual driver updates (ufs, qla2xxx, tcmu, lpfc, hpsa, zfcp, scsi_debug) and minor bug fixes. We also have a huge docbook fix update like most other subsystems and no major update to the core (the few non trivial updates are either minor fixes or removing an unused feature [scsi_sdb_cache]). Signed-off-by: James E.J. Bottomley <jejb@linux.ibm.com> -----BEGIN PGP SIGNATURE----- iJwEABMIAEQWIQTnYEDbdso9F2cI+arnQslM7pishQUCXyxq1yYcamFtZXMuYm90 dG9tbGV5QGhhbnNlbnBhcnRuZXJzaGlwLmNvbQAKCRDnQslM7pishSoAAQChZ4i8 ZqYW3pL33JO3fA8vdjvLuyC489Hj4wzIsl3/bQEAxYyM6BSLvMoLWR2Plq/JmTLm 4W/LDptarpTiDI3NuDc= =4b0W -----END PGP SIGNATURE----- Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi Pull SCSI updates from James Bottomley: "This consists of the usual driver updates (ufs, qla2xxx, tcmu, lpfc, hpsa, zfcp, scsi_debug) and minor bug fixes. We also have a huge docbook fix update like most other subsystems and no major update to the core (the few non trivial updates are either minor fixes or removing an unused feature [scsi_sdb_cache])" * tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (307 commits) scsi: scsi_transport_srp: Sanitize scsi_target_block/unblock sequences scsi: ufs-mediatek: Apply DELAY_AFTER_LPM quirk to Micron devices scsi: ufs: Introduce device quirk "DELAY_AFTER_LPM" scsi: virtio-scsi: Correctly handle the case where all LUNs are unplugged scsi: scsi_debug: Implement tur_ms_to_ready parameter scsi: scsi_debug: Fix request sense scsi: lpfc: Fix typo in comment for ULP scsi: ufs-mediatek: Prevent LPM operation on undeclared VCC scsi: iscsi: Do not put host in iscsi_set_flashnode_param() scsi: hpsa: Correct ctrl queue depth scsi: target: tcmu: Make TMR notification optional scsi: target: tcmu: Implement tmr_notify callback scsi: target: tcmu: Fix and simplify timeout handling scsi: target: tcmu: Factor out new helper ring_insert_padding scsi: target: tcmu: Do not queue aborted commands scsi: target: tcmu: Use priv pointer in se_cmd scsi: target: Add tmr_notify backend function scsi: target: Modify core_tmr_abort_task() scsi: target: iscsi: Fix inconsistent debug message scsi: target: iscsi: Fix login error when receiving ...	2020-08-06 16:50:07 -07:00
Coly Li	b35fd7422c	block: check queue's limits.discard_granularity in __blkdev_issue_discard() If create a loop device with a backing NVMe SSD, current loop device driver doesn't correctly set its queue's limits.discard_granularity and leaves it as 0. If a discard request at LBA 0 on this loop device, in __blkdev_issue_discard() the calculated req_sects will be 0, and a zero length discard request will trigger a BUG() panic in generic block layer code at block/blk-mq.c:563. [ 955.565006][ C39] ------------[ cut here ]------------ [ 955.559660][ C39] invalid opcode: 0000 [#1] SMP NOPTI [ 955.622171][ C39] CPU: 39 PID: 248 Comm: ksoftirqd/39 Tainted: G E 5.8.0-default+ #40 [ 955.622171][ C39] Hardware name: Lenovo ThinkSystem SR650 -[7X05CTO1WW]-/-[7X05CTO1WW]-, BIOS -[IVE160M-2.70]- 07/17/2020 [ 955.622175][ C39] RIP: 0010:blk_mq_end_request+0x107/0x110 [ 955.622177][ C39] Code: 48 8b 03 e9 59 ff ff ff 48 89 df 5b 5d 41 5c e9 9f ed ff ff 48 8b 35 98 3c f4 00 48 83 c7 10 48 83 c6 19 e8 cb 56 c9 ff eb cb <0f> 0b 0f 1f 80 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41 56 41 54 [ 955.622179][ C39] RSP: 0018:ffffb1288701fe28 EFLAGS: 00010202 [ 955.749277][ C39] RAX: 0000000000000001 RBX: ffff956fffba5080 RCX: 0000000000004003 [ 955.749278][ C39] RDX: 0000000000000003 RSI: 0000000000000000 RDI: 0000000000000000 [ 955.749279][ C39] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 [ 955.749279][ C39] R10: ffffb1288701fd28 R11: 0000000000000001 R12: ffffffffa8e05160 [ 955.749280][ C39] R13: 0000000000000004 R14: 0000000000000004 R15: ffffffffa7ad3a1e [ 955.749281][ C39] FS: 0000000000000000(0000) GS:ffff95bfbda00000(0000) knlGS:0000000000000000 [ 955.749282][ C39] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 955.749282][ C39] CR2: 00007f6f0ef766a8 CR3: 0000005a37012002 CR4: 00000000007606e0 [ 955.749283][ C39] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 955.749284][ C39] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 955.749284][ C39] PKRU: 55555554 [ 955.749285][ C39] Call Trace: [ 955.749290][ C39] blk_done_softirq+0x99/0xc0 [ 957.550669][ C39] __do_softirq+0xd3/0x45f [ 957.550677][ C39] ? smpboot_thread_fn+0x2f/0x1e0 [ 957.550679][ C39] ? smpboot_thread_fn+0x74/0x1e0 [ 957.550680][ C39] ? smpboot_thread_fn+0x14e/0x1e0 [ 957.550684][ C39] run_ksoftirqd+0x30/0x60 [ 957.550687][ C39] smpboot_thread_fn+0x149/0x1e0 [ 957.886225][ C39] ? sort_range+0x20/0x20 [ 957.886226][ C39] kthread+0x137/0x160 [ 957.886228][ C39] ? kthread_park+0x90/0x90 [ 957.886231][ C39] ret_from_fork+0x22/0x30 [ 959.117120][ C39] ---[ end trace 3dacdac97e2ed164 ]--- This is the procedure to reproduce the panic, # modprobe scsi_debug delay=0 dev_size_mb=2048 max_queue=1 # losetup -f /dev/nvme0n1 --direct-io=on # blkdiscard /dev/loop0 -o 0 -l 0x200 This patch fixes the issue by checking q->limits.discard_granularity in __blkdev_issue_discard() before composing the discard bio. If the value is 0, then prints a warning oops information and returns -EOPNOTSUPP to the caller to indicate that this buggy device driver doesn't support discard request. Fixes: `9b15d109a6` ("block: improve discard bio alignment in __blkdev_issue_discard()") Fixes: `c52abf5630` ("loop: Better discard support for block devices") Reported-and-suggested-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Jack Wang <jinpu.wang@cloud.ionos.com> Cc: Bart Van Assche <bvanassche@acm.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: Enzo Matsumiya <ematsumiya@suse.com> Cc: Evan Green <evgreen@chromium.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Martin K. Petersen <martin.petersen@oracle.com> Cc: Xiao Ni <xni@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-08-05 17:15:47 -06:00
Linus Torvalds	060a72a268	for-5.9/block-merge-20200804 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl8pjAEQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgphuHEAC5hNAi99HLktAQ7qZy4cBqnGNKCrguFszq Kxiecp3Nrb9EnAPNWYG+QMO0kD9z8quML85beBaJNxN1PlOk9pawqFAd4ziFncFI ruZwIMP+oH/0OmPUA2a4ymrqu+rpyFvfsDL2RKJ9dirAt9fuv9W0RZM5g7Oz83Xi cNdPRn0tOhK0DTPxL4M1/NR2OutSgvKDfA5Et3IrDFl7+bJAEFqmSO8wOSdZtvFp KcR4O/DXnr5Wl6cPvzlvooQze8vGGJkXAyIKaC9cuBm/nlzMCBGG8kE0v3kRJ8Sc uSSFkC+P+OlktY4JwXN+mCacDUdVBiiL/uUs1zel6HmociBgh67mgyJ6AfQtGZry yVl9mj44qWZjAzCODv5KnuxlH+gBacdmjcQqwxsZ2P477gfNkxmBXgHeWdfzO9A/ zTUXaBDXg3VdYxQfD8zTWPkCwXYp+YG3SRb9pfrIWIiYuz2UECZTvl/8Upnacz2B POTf+6vcNDlILCtboVE0mKEYR0ckxqrbs0NQloQdmVOfXNyhLml9OrXmwJIffVtE pZ9g428c5bm44lIOiB2eW+QPsXo0s8GxqIrMtxzKsJ3WgFefwLiVDLJBqEt78jRJ RvpGUxrMLgWFubowH8yDmWV+Fp0NpqcqF+GU45z8nGC3OTS+i0ZvUFYgLM6a2uOf sv4bzDPDBg== =uMth -----END PGP SIGNATURE----- Merge tag 'for-5.9/block-merge-20200804' of git://git.kernel.dk/linux-block Pull block stacking updates from Jens Axboe: "The stacking related fixes depended on both the core block and drivers branches, so here's a topic branch with that change. Outside of that, a late fix from Johannes for zone revalidation" * tag 'for-5.9/block-merge-20200804' of git://git.kernel.dk/linux-block: block: don't do revalidate zones on invalid devices block: remove blk_queue_stack_limits block: remove bdev_stack_limits block: inherit the zoned characteristics in blk_stack_limits	2020-08-05 11:12:34 -07:00
Linus Torvalds	e0fc99e21e	for-5.9/drivers-20200803 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl8od3oQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgppkpD/9D+XqD9qYcYTj+ShVCc5+3RtMG5ZiAAX0y l4QXomentn/1Y0UYXFGJH7JLZWrKYT0QiktLtfpe5pmTqRUkckTIyJQlsHb+K6Dz lFjtywRK9pcFYgiWIUg80wlJKrTa8QdnrlS/Esn4YITKGRbgMIdFvq2jymXC+1ho RgodlgzcBUREgHSLo0H3cqEKA53fQiJhKC6CbFrFdrkpf2yUpcTfEDtpSwuIuPj3 2AUed1qXUtNjdHciCn3N37OuHqXKAA9noXAWfg9Gx/5zfGUNX9QJvlsny1AopgS0 jJvPSDVAhu/qRLHW6q/ZOT0JAlHegguuTAOtgMh2cMpAS5sumCAtltxVcI7Qnx41 HalMpTefXsVoBo0gfjqldnIPt34ZNj5aH5GYaH/wPpSg6VkTVBJK8GuQDBvg27qT w+U/T6EzuqniWXh/P3COhfrMCR9ueUOY1qWCRwzomlpeIfBhCzidt2wUqIxX1TOA Q0Ltf0eERDevsZbE+tIm+VAAg98kHehcS2t8lfFYFO6/PKu2iJpJt/HtJbZNBE+W rm96E4qXRiy1UuL7D9vBkaWsbnosuNHgGQXx57GlokQU+2IGBmOxV52XHiSxxpXd AS1ZTd56ItmID8VaU09Pbf7ZFbiCgdEAxIbUFzaCuvo+lxryHFphIUARNi/zPnNT UC2OzunCqA== =oADH -----END PGP SIGNATURE----- Merge tag 'for-5.9/drivers-20200803' of git://git.kernel.dk/linux-block Pull block driver updates from Jens Axboe: - NVMe: - ZNS support (Aravind, Keith, Matias, Niklas) - Misc cleanups, optimizations, fixes (Baolin, Chaitanya, David, Dongli, Max, Sagi) - null_blk zone capacity support (Aravind) - MD: - raid5/6 fixes (ChangSyun) - Warning fixes (Damien) - raid5 stripe fixes (Guoqing, Song, Yufen) - sysfs deadlock fix (Junxiao) - raid10 deadlock fix (Vitaly) - struct_size conversions (Gustavo) - Set of bcache updates/fixes (Coly) * tag 'for-5.9/drivers-20200803' of git://git.kernel.dk/linux-block: (117 commits) md/raid5: Allow degraded raid6 to do rmw md/raid5: Fix Force reconstruct-write io stuck in degraded raid5 raid5: don't duplicate code for different paths in handle_stripe raid5-cache: hold spinlock instead of mutex in r5c_journal_mode_show md: print errno in super_written md/raid5: remove the redundant setting of STRIPE_HANDLE md: register new md sysfs file 'uuid' read-only md: fix max sectors calculation for super 1.0 nvme-loop: remove extra variable in create ctrl nvme-loop: set ctrl state connecting after init nvme-multipath: do not fall back to __nvme_find_path() for non-optimized paths nvme-multipath: fix logic for non-optimized paths nvme-rdma: fix controller reset hang during traffic nvme-tcp: fix controller reset hang during traffic nvmet: introduce the passthru Kconfig option nvmet: introduce the passthru configfs interface nvmet: Add passthru enable/disable helpers nvmet: add passthru code to process commands nvme: export nvme_find_get_ns() and nvme_put_ns() nvme: introduce nvme_ctrl_get_by_path() ...	2020-08-05 10:51:40 -07:00
Linus Torvalds	99ea1521a0	Remove uninitialized_var() macro for v5.9-rc1 - Clean up non-trivial uses of uninitialized_var() - Update documentation and checkpatch for uninitialized_var() removal - Treewide removal of uninitialized_var() -----BEGIN PGP SIGNATURE----- iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAl8oYLQWHGtlZXNjb29r QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJsfjEACvf0D3WL3H7sLHtZ2HeMwOgAzq il08t6vUscINQwiIIK3Be43ok3uQ1Q+bj8sr2gSYTwunV2IYHFferzgzhyMMno3o XBIGd1E+v1E4DGBOiRXJvacBivKrfvrdZ7AWiGlVBKfg2E0fL1aQbe9AYJ6eJSbp UGqkBkE207dugS5SQcwrlk1tWKUL089lhDAPd7iy/5RK76OsLRCJFzIerLHF2ZK2 BwvA+NWXVQI6pNZ0aRtEtbbxwEU4X+2J/uaXH5kJDszMwRrgBT2qoedVu5LXFPi8 +B84IzM2lii1HAFbrFlRyL/EMueVFzieN40EOB6O8wt60Y4iCy5wOUzAdZwFuSTI h0xT3JI8BWtpB3W+ryas9cl9GoOHHtPA8dShuV+Y+Q2bWe1Fs6kTl2Z4m4zKq56z 63wQCdveFOkqiCLZb8s6FhnS11wKtAX4czvXRXaUPgdVQS1Ibyba851CRHIEY+9I AbtogoPN8FXzLsJn7pIxHR4ADz+eZ0dQ18f2hhQpP6/co65bYizNP5H3h+t9hGHG k3r2k8T+jpFPaddpZMvRvIVD8O2HvJZQTyY6Vvneuv6pnQWtr2DqPFn2YooRnzoa dbBMtpon+vYz6OWokC5QNWLqHWqvY9TmMfcVFUXE4AFse8vh4wJ8jJCNOFVp8On+ drhmmImUr1YylrtVOw== =xHmk -----END PGP SIGNATURE----- Merge tag 'uninit-macro-v5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull uninitialized_var() macro removal from Kees Cook: "This is long overdue, and has hidden too many bugs over the years. The series has several "by hand" fixes, and then a trivial treewide replacement. - Clean up non-trivial uses of uninitialized_var() - Update documentation and checkpatch for uninitialized_var() removal - Treewide removal of uninitialized_var()" * tag 'uninit-macro-v5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: compiler: Remove uninitialized_var() macro treewide: Remove uninitialized_var() usage checkpatch: Remove awareness of uninitialized_var() macro mm/debug_vm_pgtable: Remove uninitialized_var() usage f2fs: Eliminate usage of uninitialized_var() macro media: sur40: Remove uninitialized_var() usage KVM: PPC: Book3S PR: Remove uninitialized_var() usage clk: spear: Remove uninitialized_var() usage clk: st: Remove uninitialized_var() usage spi: davinci: Remove uninitialized_var() usage ide: Remove uninitialized_var() usage rtlwifi: rtl8192cu: Remove uninitialized_var() usage b43: Remove uninitialized_var() usage drbd: Remove uninitialized_var() usage x86/mm/numa: Remove uninitialized_var() usage docs: deprecated.rst: Add uninitialized_var()	2020-08-04 13:49:43 -07:00
Linus Torvalds	cdc8fcb499	for-5.9/io_uring-20200802 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl8m7asQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgplrCD/0S17kio+k4cOJDGwl88WoJw+QiYmM5019k decZ1JymQvV1HXRmlcZiEAu0hHDD0FoovSRrw7II3gw3GouETmYQM62f6ZTpDeMD CED/fidnfULAkPaI6h+bj3jyI0cEuujG/R47rGSQEkIIr3RttqKZUzVkB9KN+KMw +OBuXZtMIoFFEVJ91qwC2dm2qHLqOn1/5MlT59knso/xbPOYOXsFQpGiACJqF97x 6qSSI8uGE+HZqvL2OLWPDBbLEJhrq+dzCgxln5VlvLele4UcRhOdonUb7nUwEKCe zwvtXzz16u1D1b8bJL4Kg5bGqyUAQUCSShsfBJJxh6vTTULiHyCX5sQaai1OEB16 4dpBL9E+nOUUix4wo9XBY0/KIYaPWg5L1CoEwkAXqkXPhFvNUucsC0u6KvmzZR3V 1OogVTjl6GhS8uEVQjTKNshkTIC9QHEMXDUOHtINDCb/sLU+ANXU5UpvsuzZ9+kt KGc4mdyCwaKBq4YW9sVwhhq/RHLD4AUtWZiUVfOE+0cltCLJUNMbQsJ+XrcYaQnm W4zz22Rep+SJuQNVcCW/w7N2zN3yB6gC1qeroSLvzw4b5el2TdFp+BcgVlLHK+uh xjsGNCq++fyzNk7vvMZ5hVq4JGXYjza7AiP5HlQ8nqdiPUKUPatWCBqUm9i9Cz/B n+0dlYbRwQ== =2vmy -----END PGP SIGNATURE----- Merge tag 'for-5.9/io_uring-20200802' of git://git.kernel.dk/linux-block Pull io_uring updates from Jens Axboe: "Lots of cleanups in here, hardening the code and/or making it easier to read and fixing bugs, but a core feature/change too adding support for real async buffered reads. With the latter in place, we just need buffered write async support and we're done relying on kthreads for the fast path. In detail: - Cleanup how memory accounting is done on ring setup/free (Bijan) - sq array offset calculation fixup (Dmitry) - Consistently handle blocking off O_DIRECT submission path (me) - Support proper async buffered reads, instead of relying on kthread offload for that. This uses the page waitqueue to drive retries from task_work, like we handle poll based retry. (me) - IO completion optimizations (me) - Fix race with accounting and ring fd install (me) - Support EPOLLEXCLUSIVE (Jiufei) - Get rid of the io_kiocb unionizing, made possible by shrinking other bits (Pavel) - Completion side cleanups (Pavel) - Cleanup REQ_F_ flags handling, and kill off many of them (Pavel) - Request environment grabbing cleanups (Pavel) - File and socket read/write cleanups (Pavel) - Improve kiocb_set_rw_flags() (Pavel) - Tons of fixes and cleanups (Pavel) - IORING_SQ_NEED_WAKEUP clear fix (Xiaoguang)" * tag 'for-5.9/io_uring-20200802' of git://git.kernel.dk/linux-block: (127 commits) io_uring: flip if handling after io_setup_async_rw fs: optimise kiocb_set_rw_flags() io_uring: don't touch 'ctx' after installing file descriptor io_uring: get rid of atomic FAA for cq_timeouts io_uring: consolidate *_check_overflow accounting io_uring: fix stalled deferred requests io_uring: fix racy overflow count reporting io_uring: deduplicate __io_complete_rw() io_uring: de-unionise io_kiocb io-wq: update hash bits io_uring: fix missing io_queue_linked_timeout() io_uring: mark ->work uninitialised after cleanup io_uring: deduplicate io_grab_files() calls io_uring: don't do opcode prep twice io_uring: clear IORING_SQ_NEED_WAKEUP after executing task works io_uring: batch put_task_struct() tasks: add put_task_struct_many() io_uring: return locked and pinned page accounting io_uring: don't miscount pinned memory io_uring: don't open-code recv kbuf managment ...	2020-08-03 13:01:22 -07:00
Linus Torvalds	382625d0d4	for-5.9/block-20200802 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl8m7YwQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpt+dEAC7a0HYuX2OrkyawBnsgd1QQR/soC7surec yDDa7SMM8cOq3935bfzcYHV9FWJszEGIknchiGb9R3/T+vmSohbvDsM5zgwya9u/ FHUIuTq324I6JWXKl30k4rwjiX9wQeMt+WZ5gC8KJYCWA296i2IpJwd0A45aaKuS x4bTjxqknE+fD4gQiMUSt+bmuOUAp81fEku3EPapCRYDPAj8f5uoY7R2arT/POwB b+s+AtXqzBymIqx1z0sZ/XcdZKmDuhdurGCWu7BfJFIzw5kQ2Qe3W8rUmrQ3pGut 8a21YfilhUFiBv+B4wptfrzJuzU6Ps0BXHCnBsQjzvXwq5uFcZH495mM/4E4OJvh SbjL2K4iFj+O1ngFkukG/F8tdEM1zKBYy2ZEkGoWKUpyQanbAaGI6QKKJA+DCdBi yPEb7yRAa5KfLqMiocm1qCEO1I56HRiNHaJVMqCPOZxLmpXj19Fs71yIRplP1Trv GGXdWZsccjuY6OljoXWdEfnxAr5zBsO3Yf2yFT95AD+egtGsU1oOzlqAaU1mtflw ABo452pvh6FFpxGXqz6oK4VqY4Et7WgXOiljA4yIGoPpG/08L1Yle4eVc2EE01Jb +BL49xNJVeUhGFrvUjPGl9kVMeLmubPFbmgrtipW+VRg9W8+Yirw7DPP6K+gbPAR RzAUdZFbWw== =abJG -----END PGP SIGNATURE----- Merge tag 'for-5.9/block-20200802' of git://git.kernel.dk/linux-block Pull core block updates from Jens Axboe: "Good amount of cleanups and tech debt removals in here, and as a result, the diffstat shows a nice net reduction in code. - Softirq completion cleanups (Christoph) - Stop using ->queuedata (Christoph) - Cleanup bd claiming (Christoph) - Use check_events, moving away from the legacy media change (Christoph) - Use inode i_blkbits consistently (Christoph) - Remove old unused writeback congestion bits (Christoph) - Cleanup/unify submission path (Christoph) - Use bio_uninit consistently, instead of bio_disassociate_blkg (Christoph) - sbitmap cleared bits handling (John) - Request merging blktrace event addition (Jan) - sysfs add/remove race fixes (Luis) - blk-mq tag fixes/optimizations (Ming) - Duplicate words in comments (Randy) - Flush deferral cleanup (Yufen) - IO context locking/retry fixes (John) - struct_size() usage (Gustavo) - blk-iocost fixes (Chengming) - blk-cgroup IO stats fixes (Boris) - Various little fixes" * tag 'for-5.9/block-20200802' of git://git.kernel.dk/linux-block: (135 commits) block: blk-timeout: delete duplicated word block: blk-mq-sched: delete duplicated word block: blk-mq: delete duplicated word block: genhd: delete duplicated words block: elevator: delete duplicated word and fix typos block: bio: delete duplicated words block: bfq-iosched: fix duplicated word iocost_monitor: start from the oldest usage index iocost: Fix check condition of iocg abs_vdebt block: Remove callback typedefs for blk_mq_ops block: Use non _rcu version of list functions for tag_set_list blk-cgroup: show global disk stats in root cgroup io.stat blk-cgroup: make iostat functions visible to stat printing block: improve discard bio alignment in __blkdev_issue_discard() block: change REQ_OP_ZONE_RESET and REQ_OP_ZONE_RESET_ALL to be odd numbers block: defer flush request no matter whether we have elevator block: make blk_timeout_init() static block: remove retry loop in ioc_release_fn() block: remove unnecessary ioc nested locking block: integrate bd_start_claiming into __blkdev_get ...	2020-08-03 11:57:03 -07:00
Johannes Thumshirn	1a1206dc4c	block: don't do revalidate zones on invalid devices When we loose a device for whatever reason while (re)scanning zones, we trip over a NULL pointer in blk_revalidate_zone_cb, like in the following log: sd 0:0:0:0: [sda] 3418095616 4096-byte logical blocks: (14.0 TB/12.7 TiB) sd 0:0:0:0: [sda] 52156 zones of 65536 logical blocks sd 0:0:0:0: [sda] Write Protect is off sd 0:0:0:0: [sda] Mode Sense: 37 00 00 08 sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA sd 0:0:0:0: [sda] REPORT ZONES start lba 1065287680 failed sd 0:0:0:0: [sda] REPORT ZONES: Result: hostbyte=0x00 driverbyte=0x08 sd 0:0:0:0: [sda] Sense Key : 0xb [current] sd 0:0:0:0: [sda] ASC=0x0 ASCQ=0x6 sda: failed to revalidate zones sd 0:0:0:0: [sda] 0 4096-byte logical blocks: (0 B/0 B) sda: detected capacity change from 14000519643136 to 0 ================================================================== BUG: KASAN: null-ptr-deref in blk_revalidate_zone_cb+0x1b7/0x550 Write of size 8 at addr 0000000000000010 by task kworker/u4:1/58 CPU: 1 PID: 58 Comm: kworker/u4:1 Not tainted 5.8.0-rc1 #692 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4-rebuilt.opensuse.org 04/01/2014 Workqueue: events_unbound async_run_entry_fn Call Trace: dump_stack+0x7d/0xb0 ? blk_revalidate_zone_cb+0x1b7/0x550 kasan_report.cold+0x5/0x37 ? blk_revalidate_zone_cb+0x1b7/0x550 check_memory_region+0x145/0x1a0 blk_revalidate_zone_cb+0x1b7/0x550 sd_zbc_parse_report+0x1f1/0x370 ? blk_req_zone_write_trylock+0x200/0x200 ? sectors_to_logical+0x60/0x60 ? blk_req_zone_write_trylock+0x200/0x200 ? blk_req_zone_write_trylock+0x200/0x200 sd_zbc_report_zones+0x3c4/0x5e0 ? sd_dif_config_host+0x500/0x500 blk_revalidate_disk_zones+0x231/0x44d ? _raw_write_lock_irqsave+0xb0/0xb0 ? blk_queue_free_zone_bitmaps+0xd0/0xd0 sd_zbc_read_zones+0x8cf/0x11a0 sd_revalidate_disk+0x305c/0x64e0 ? __device_add_disk+0x776/0xf20 ? read_capacity_16.part.0+0x1080/0x1080 ? blk_alloc_devt+0x250/0x250 ? create_object.isra.0+0x595/0xa20 ? kasan_unpoison_shadow+0x33/0x40 sd_probe+0x8dc/0xcd2 really_probe+0x20e/0xaf0 __driver_attach_async_helper+0x249/0x2d0 async_run_entry_fn+0xbe/0x560 process_one_work+0x764/0x1290 ? _raw_read_unlock_irqrestore+0x30/0x30 worker_thread+0x598/0x12f0 ? __kthread_parkme+0xc6/0x1b0 ? schedule+0xed/0x2c0 ? process_one_work+0x1290/0x1290 kthread+0x36b/0x440 ? kthread_create_worker_on_cpu+0xa0/0xa0 ret_from_fork+0x22/0x30 ================================================================== When the device is already gone we end up with the following scenario: The device's capacity is 0 and thus the number of zones will be 0 as well. When allocating the bitmap for the conventional zones, we then trip over a NULL pointer. So if we encounter a zoned block device with a 0 capacity, don't dare to revalidate the zones sizes. Fixes: `6c6b354914` ("block: set the zone size in blk_revalidate_disk_zones atomically") Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-08-03 09:24:04 -06:00
Randy Dunlap	d958e343bd	block: blk-timeout: delete duplicated word Drop the repeated word "request". Change to the correct kernel-doc notation for function name separtor. Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-31 16:29:47 -06:00
Randy Dunlap	c4aecaa256	block: blk-mq-sched: delete duplicated word Drop the repeated word "to". Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-31 16:29:47 -06:00
Randy Dunlap	70f15a4fd9	block: blk-mq: delete duplicated word Drop the repeated word "the". Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-31 16:29:47 -06:00
Randy Dunlap	0d20dcc277	block: genhd: delete duplicated words Drop the repeated word "to" in multiple places. Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-31 16:29:47 -06:00
Randy Dunlap	5b8f65e1f9	block: elevator: delete duplicated word and fix typos Drop the repeated word "the". Fix typos of "features" and "specified". Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-31 16:29:47 -06:00
Randy Dunlap	3cf1488917	block: bio: delete duplicated words Drop the repeated words "a" and "the". Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-31 16:29:47 -06:00
Randy Dunlap	f06678af91	block: bfq-iosched: fix duplicated word Change "at at" to "at a". Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-31 16:29:47 -06:00
Chengming Zhou	d9012a59db	iocost: Fix check condition of iocg abs_vdebt We shouldn't skip iocg when its abs_vdebt is not zero. Fixes: `0b80f9866e` ("iocost: protect iocg->abs_vdebt with iocg->waitq.lock") Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-30 11:45:12 -06:00
Ahmed S. Darwish	67b7b641ca	iocost: Use sequence counter with associated spinlock A sequence counter write side critical section must be protected by some form of locking to serialize writers. A plain seqcount_t does not contain the information of which lock must be held when entering a write side critical section. Use the new seqcount_spinlock_t data type, which allows to associate a spinlock with the sequence counter. This enables lockdep to verify that the spinlock used for writer serialization is held when the write side critical section is entered. If lockdep is disabled this lock association is compiled out and has neither storage size nor runtime overhead. Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Daniel Wagner <dwagner@suse.de> Link: https://lkml.kernel.org/r/20200720155530.1173732-21-a.darwish@linutronix.de	2020-07-29 16:14:28 +02:00
Daniel Wagner	08c875cbf4	block: Use non _rcu version of list functions for tag_set_list tag_set_list is only accessed under the tag_set_lock lock. There is no need for using the _rcu list functions. The _rcu list function were introduced to allow read access to the tag_set_list protected under RCU, see `705cda97ee` ("blk-mq: Make it safe to use RCU to iterate over blk_mq_tag_set.tag_list") and `05b7941394` ("Revert "blk-mq: don't handle TAG_SHARED in restart""). Those changes got reverted later but the cleanup commit missed a couple of places to undo the changes. Fixes: `97889f9ac2` ("blk-mq: remove synchronize_rcu() from blk_mq_del_queue_tag_set()" Signed-off-by: Daniel Wagner <dwagner@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-28 09:37:09 -06:00
Alan Stern	8f38f8e0a3	scsi: block: pm: Simplify resume handling Commit `05d18ae1cc` ("scsi: pm: Balance pm_only counter of request queue during system resume") fixed a problem in the block layer's runtime-PM code: blk_set_runtime_active() failed to call blk_clear_pm_only(). However, the commit's implementation was awkward; it forced the SCSI system-resume handler to choose whether to call blk_post_runtime_resume() or blk_set_runtime_active(), depending on whether or not the SCSI device had previously been runtime suspended. This patch simplifies the situation considerably by adding the missing function call directly into blk_set_runtime_active() (under the condition that the queue is not already in the RPM_ACTIVE state). This allows the SCSI routine to revert back to its original form. Furthermore, making this change reveals that blk_post_runtime_resume() (in its success pathway) does exactly the same thing as blk_set_runtime_active(). The duplicate code is easily removed by making one routine call the other. No functional changes are intended. Link: https://lore.kernel.org/r/20200706151436.GA702867@rowland.harvard.edu CC: Can Guo <cang@codeaurora.org> CC: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Alan Stern <stern@rowland.harvard.edu> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>	2020-07-24 22:09:55 -04:00
Christoph Hellwig	b9b1a5d715	block: remove blk_queue_stack_limits This function is just a tiny wrapper around blk_stack_limits. Open code it int the two callers. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Tested-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-20 15:38:52 -06:00
Christoph Hellwig	9efa82ef2b	block: remove bdev_stack_limits This function is just a tiny wrapper around blk_stack_limit and has two callers. Simplify the stack a bit by open coding it in the two callers. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Tested-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-20 15:38:52 -06:00
Christoph Hellwig	3093a47972	block: inherit the zoned characteristics in blk_stack_limits Lift the code from device mapper into blk_stack_limits to inherity the stacking limitations. This ensures we do the right thing for all stacked zoned block devices. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Tested-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-20 15:38:52 -06:00
Jens Axboe	4f43d64807	Merge branch 'for-5.9/drivers' into for-5.9/block-merge * for-5.9/drivers: (38 commits) block: add max_active_zones to blk-sysfs block: add max_open_zones to blk-sysfs s390/dasd: Use struct_size() helper s390/dasd: fix inability to use DASD with DIAG driver md-cluster: fix wild pointer of unlock_all_bitmaps() md/raid5-cache: clear MD_SB_CHANGE_PENDING before flushing stripes md: fix deadlock causing by sysfs_notify md: improve io stats accounting md: raid0/linear: fix dereference before null check on pointer mddev rsxx: switch from 'pci_free_consistent()' to 'dma_free_coherent()' nvme: remove ns->disk checks nvme-pci: use standard block status symbolic names nvme-pci: use the consistent return type of nvme_pci_iod_alloc_size() nvme-pci: add a blank line after declarations nvme-pci: fix some comments issues nvme-pci: remove redundant segment validation nvme: document quirked Intel models nvme: expose reconnect_delay and ctrl_loss_tmo via sysfs nvme: support for zoned namespaces nvme: support for multiple Command Sets Supported and Effects log pages ...	2020-07-20 15:38:27 -06:00
Jens Axboe	9caaa66c91	Merge branch 'for-5.9/block' into for-5.9/block-merge * for-5.9/block: (124 commits) blk-cgroup: show global disk stats in root cgroup io.stat blk-cgroup: make iostat functions visible to stat printing block: improve discard bio alignment in __blkdev_issue_discard() block: change REQ_OP_ZONE_RESET and REQ_OP_ZONE_RESET_ALL to be odd numbers block: defer flush request no matter whether we have elevator block: make blk_timeout_init() static block: remove retry loop in ioc_release_fn() block: remove unnecessary ioc nested locking block: integrate bd_start_claiming into __blkdev_get block: use bd_prepare_to_claim directly in the loop driver block: refactor bd_start_claiming block: simplify the restart case in __blkdev_get Revert "blk-rq-qos: remove redundant finish_wait to rq_qos_wait." block: always remove partitions from blk_drop_partitions() block: relax jiffies rounding for timeouts blk-mq: remove redundant validation in __blk_mq_end_request() blk-mq: Remove unnecessary local variable writeback: remove bdi->congested_fn writeback: remove struct bdi_writeback_congested writeback: remove {set,clear}_wb_congested ...	2020-07-20 15:38:23 -06:00
Boris Burkov	ef45fe470e	blk-cgroup: show global disk stats in root cgroup io.stat In order to improve consistency and usability in cgroup stat accounting, we would like to support the root cgroup's io.stat. Since the root cgroup has processes doing io even if the system has no explicitly created cgroups, we need to be careful to avoid overhead in that case. For that reason, the rstat algorithms don't handle the root cgroup, so just turning the file on wouldn't give correct statistics. To get around this, we simulate flushing the iostat struct by filling it out directly from global disk stats. The result is a root cgroup io.stat file consistent with both /proc/diskstats and io.stat. Note that in order to collect the disk stats, we needed to iterate over devices. To facilitate that, we had to change the linkage of a disk_type to external so that it can be used from blk-cgroup.c to iterate over disks. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Boris Burkov <boris@bur.io> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-17 20:18:00 -06:00
Boris Burkov	cd1fc4b98f	blk-cgroup: make iostat functions visible to stat printing Previously, the code which printed io.stat only needed access to the generic rstat flushing code, but since we plan to write some more specific code for preparing root cgroup stats, we need to manipulate iostat structs directly. Since declaring static functions ahead does not seem like common practice in this file, simply move the iostat functions up. We only plan to use blkg_iostat_set, but it seems better to keep them all together. Signed-off-by: Boris Burkov <boris@bur.io> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-17 20:17:59 -06:00
Coly Li	9b15d109a6	block: improve discard bio alignment in __blkdev_issue_discard() This patch improves discard bio split for address and size alignment in __blkdev_issue_discard(). The aligned discard bio may help underlying device controller to perform better discard and internal garbage collection, and avoid unnecessary internal fragment. Current discard bio split algorithm in __blkdev_issue_discard() may have non-discarded fregment on device even the discard bio LBA and size are both aligned to device's discard granularity size. Here is the example steps on how to reproduce the above problem. - On a VMWare ESXi 6.5 update3 installation, create a 51GB virtual disk with thin mode and give it to a Linux virtual machine. - Inside the Linux virtual machine, if the 50GB virtual disk shows up as /dev/sdb, fill data into the first 50GB by, # dd if=/dev/zero of=/dev/sdb bs=4096 count=13107200 - Discard the 50GB range from offset 0 on /dev/sdb, # blkdiscard /dev/sdb -o 0 -l 53687091200 - Observe the underlying mapping status of the device # sg_get_lba_status /dev/sdb -m 1048 --lba=0 descriptor LBA: 0x0000000000000000 blocks: 2048 mapped (or unknown) descriptor LBA: 0x0000000000000800 blocks: 16773120 deallocated descriptor LBA: 0x0000000000fff800 blocks: 2048 mapped (or unknown) descriptor LBA: 0x0000000001000000 blocks: `8386560` deallocated descriptor LBA: 0x00000000017ff800 blocks: 2048 mapped (or unknown) descriptor LBA: 0x0000000001800000 blocks: `8386560` deallocated descriptor LBA: 0x0000000001fff800 blocks: 2048 mapped (or unknown) descriptor LBA: 0x0000000002000000 blocks: `8386560` deallocated descriptor LBA: 0x00000000027ff800 blocks: 2048 mapped (or unknown) descriptor LBA: 0x0000000002800000 blocks: `8386560` deallocated descriptor LBA: 0x0000000002fff800 blocks: 2048 mapped (or unknown) descriptor LBA: 0x0000000003000000 blocks: `8386560` deallocated descriptor LBA: 0x00000000037ff800 blocks: 2048 mapped (or unknown) descriptor LBA: 0x0000000003800000 blocks: `8386560` deallocated descriptor LBA: 0x0000000003fff800 blocks: 2048 mapped (or unknown) descriptor LBA: 0x0000000004000000 blocks: `8386560` deallocated descriptor LBA: 0x00000000047ff800 blocks: 2048 mapped (or unknown) descriptor LBA: 0x0000000004800000 blocks: `8386560` deallocated descriptor LBA: 0x0000000004fff800 blocks: 2048 mapped (or unknown) descriptor LBA: 0x0000000005000000 blocks: `8386560` deallocated descriptor LBA: 0x00000000057ff800 blocks: 2048 mapped (or unknown) descriptor LBA: 0x0000000005800000 blocks: `8386560` deallocated descriptor LBA: 0x0000000005fff800 blocks: 2048 mapped (or unknown) descriptor LBA: 0x0000000006000000 blocks: 6291456 deallocated descriptor LBA: 0x0000000006600000 blocks: 0 deallocated Although the discard bio starts at LBA 0 and has 50<<30 bytes size which are perfect aligned to the discard granularity, from the above list these are many 1MB (2048 sectors) internal fragments exist unexpectedly. The problem is in __blkdev_issue_discard(), an improper algorithm causes an improper bio size which is not aligned. 25 int __blkdev_issue_discard(struct block_device bdev, sector_t sector, 26 sector_t nr_sects, gfp_t gfp_mask, int flags, 27 struct bio biop) 28 { 29 struct request_queue q = bdev_get_queue(bdev); [snipped] 56 57 while (nr_sects) { 58 sector_t req_sects = min_t(sector_t, nr_sects, 59 bio_allowed_max_sectors(q)); 60 61 WARN_ON_ONCE((req_sects << 9) > UINT_MAX); 62 63 bio = blk_next_bio(bio, 0, gfp_mask); 64 bio->bi_iter.bi_sector = sector; 65 bio_set_dev(bio, bdev); 66 bio_set_op_attrs(bio, op, 0); 67 68 bio->bi_iter.bi_size = req_sects << 9; 69 sector += req_sects; 70 nr_sects -= req_sects; [snipped] 79 } 80 81 *biop = bio; 82 return 0; 83 } 84 EXPORT_SYMBOL(__blkdev_issue_discard); At line 58-59, to discard a 50GB range, req_sects is set as return value of bio_allowed_max_sectors(q), which is 8388607 sectors. In the above case, the discard granularity is 2048 sectors, although the start LBA and discard length are aligned to discard granularity, req_sects never has chance to be aligned to discard granularity. This is why there are some still-mapped 2048 sectors fragment in every 4 or 8 GB range. If req_sects at line 58 is set to a value aligned to discard_granularity and close to UNIT_MAX, then all consequent split bios inside device driver are (almostly) aligned to discard_granularity of the device queue. The 2048 sectors still-mapped fragment will disappear. This patch introduces bio_aligned_discard_max_sectors() to return the the value which is aligned to q->limits.discard_granularity and closest to UINT_MAX. Then this patch replaces bio_allowed_max_sectors() with this new routine to decide a more proper split bio length. But we still need to handle the situation when discard start LBA is not aligned to q->limits.discard_granularity, otherwise even the length is aligned, current code may still leave 2048 fragment around every 4GB range. Therefore, to calculate req_sects, firstly the start LBA of discard range is checked (including partition offset), if it is not aligned to discard granularity, the first split location should make sure following bio has bi_sector aligned to discard granularity. Then there won't be still-mapped fragment in the middle of the discard range. The above is how this patch improves discard bio alignment in __blkdev_issue_discard(). Now with this patch, after discard with same command line mentiond previously, sg_get_lba_status returns, descriptor LBA: 0x0000000000000000 blocks: 106954752 deallocated descriptor LBA: 0x0000000006600000 blocks: 0 deallocated We an see there is no 2048 sectors segment anymore, everything is clean. Reported-and-tested-by: Acshai Manoj <acshai.manoj@microfocus.com> Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Xiao Ni <xni@redhat.com> Cc: Bart Van Assche <bvanassche@acm.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Enzo Matsumiya <ematsumiya@suse.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-17 07:15:10 -06:00
Yufen Yu	b5718d6c00	block: defer flush request no matter whether we have elevator Commit `7520872c0c` ("block: don't defer flushes on blk-mq + scheduling") tried to fix deadlock for cycled wait between flush requests and data request into flush_data_in_flight. The former holded all driver tags and wait for data request completion, but the latter can not complete for waiting free driver tags. After commit `923218f616` ("blk-mq: don't allocate driver tag upfront for flush rq"), flush requests will not get driver tag before queuing into flush queue. * With elevator, flush request just get sched_tags before inserting flush queue. It will not get driver tag until issue them to driver. data request on list fq->flush_data_in_flight will complete in the end. * Without elevator, each flush request will get a driver tag when allocate request. Then data request on fq->flush_data_in_flight don't worry about lacking driver tag. In both of these cases, cycled wait cannot be true. So we may allow to defer flush request. Signed-off-by: Yufen Yu <yuyufen@huawei.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-17 07:14:28 -06:00
Wei Yongjun	943c4d9074	block: make blk_timeout_init() static The sparse tool complains as follows: block/blk-timeout.c:93:12: warning: symbol 'blk_timeout_init' was not declared. Should it be static? Function blk_timeout_init() is not used outside of blk-timeout.c, so mark it static. Fixes: `9054650fac` ("block: relax jiffies rounding for timeouts") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-17 07:13:42 -06:00
Kees Cook	3f649ab728	treewide: Remove uninitialized_var() usage Using uninitialized_var() is dangerous as it papers over real bugs[1] (or can in the future), and suppresses unrelated compiler warnings (e.g. "unused variable"). If the compiler thinks it is uninitialized, either simply initialize the variable or make compiler changes. In preparation for removing[2] the[3] macro[4], remove all remaining needless uses with the following script: git grep '\buninitialized_var\b' \| cut -d: -f1 \| sort -u \| \ xargs perl -pi -e \ 's/\buninitialized_var$([^$]+)\)/\1/g; s:\s/\ (GCC be quiet\|to make compiler happy) \*/$::g;' drivers/video/fbdev/riva/riva_hw.c was manually tweaked to avoid pathological white-space. No outstanding warnings were found building allmodconfig with GCC 9.3.0 for x86_64, i386, arm64, arm, powerpc, powerpc64le, s390x, mips, sparc64, alpha, and m68k. [1] https://lore.kernel.org/lkml/20200603174714.192027-1-glider@google.com/ [2] https://lore.kernel.org/lkml/CA+55aFw+Vbj0i=1TGqCR5vQkCzWJ0QxK6CernOU6eedsudAixw@mail.gmail.com/ [3] https://lore.kernel.org/lkml/CA+55aFwgbgqhbp1fkxvRKEpzyR5J8n1vKT1VZdz9knmPuXhOeg@mail.gmail.com/ [4] https://lore.kernel.org/lkml/CA+55aFz2500WfbKXAx8s67wrm9=yVJu65TpLgN_ybYNv0VEOKA@mail.gmail.com/ Reviewed-by: Leon Romanovsky <leonro@mellanox.com> # drivers/infiniband and mlx4/mlx5 Acked-by: Jason Gunthorpe <jgg@mellanox.com> # IB Acked-by: Kalle Valo <kvalo@codeaurora.org> # wireless drivers Reviewed-by: Chao Yu <yuchao0@huawei.com> # erofs Signed-off-by: Kees Cook <keescook@chromium.org>	2020-07-16 12:35:15 -07:00
John Ogness	ab96bbab46	block: remove retry loop in ioc_release_fn() The reverse-order double lock dance in ioc_release_fn() is using a retry loop. This is a problem on PREEMPT_RT because it could preempt the task that would release q->queue_lock and thus live lock in the retry loop. RCU is already managing the freeing of the request queue and icq. If the trylock fails, use RCU to guarantee that the request queue and icq are not freed and re-acquire the locks in the correct order, allowing forward progress. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-16 10:22:15 -06:00
John Ogness	a43f085f87	block: remove unnecessary ioc nested locking The legacy CFQ IO scheduler could call put_io_context() in its exit_icq() elevator callback. This led to a lockdep warning, which was fixed in commit `d8c66c5d59` ("block: fix lockdep warning on io_context release put_io_context()") by using a nested subclass for the ioc spinlock. However, with commit `f382fb0bce` ("block: remove legacy IO schedulers") the CFQ IO scheduler no longer exists. The BFQ IO scheduler also implements the exit_icq() elevator callback but does not call put_io_context(). The nested subclass for the ioc spinlock is no longer needed. Since it existed as an exception and no longer applies, remove the nested subclass usage. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-16 10:22:15 -06:00
Niklas Cassel	659bf827ba	block: add max_active_zones to blk-sysfs Add a new max_active zones definition in the sysfs documentation. This definition will be common for all devices utilizing the zoned block device support in the kernel. Export max_active_zones according to this new definition for NVMe Zoned Namespace devices, ZAC ATA devices (which are treated as SCSI devices by the kernel), and ZBC SCSI devices. Add the new max_active_zones member to struct request_queue, rather than as a queue limit, since this property cannot be split across stacking drivers. For SCSI devices, even though max active zones is not part of the ZBC/ZAC spec, export max_active_zones as 0, signifying "no limit". Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com> Reviewed-by: Javier González <javier@javigon.com> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-15 14:26:11 -06:00
Niklas Cassel	e15864f8ea	block: add max_open_zones to blk-sysfs Add a new max_open_zones definition in the sysfs documentation. This definition will be common for all devices utilizing the zoned block device support in the kernel. Export max open zones according to this new definition for NVMe Zoned Namespace devices, ZAC ATA devices (which are treated as SCSI devices by the kernel), and ZBC SCSI devices. Add the new max_open_zones member to struct request_queue, rather than as a queue limit, since this property cannot be split across stacking drivers. Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com> Reviewed-by: Javier González <javier@javigon.com> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-15 14:26:11 -06:00
Jens Axboe	e791ee6885	Revert "blk-rq-qos: remove redundant finish_wait to rq_qos_wait." This reverts commit `826f2f48da`. Qian Cai reports that this commit causes stalls with swap. Revert until the reason can be figured out. Reported-by: Qian Cai <cai@lca.pw> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-15 09:33:37 -06:00
Ming Lei	d0f0f1b4c5	block: always remove partitions from blk_drop_partitions() In theory, when GENHD_FL_NO_PART_SCAN is set, no partitions can be created on one disk. However, ioctl(BLKPG, BLKPG_ADD_PARTITION) doesn't check GENHD_FL_NO_PART_SCAN, so partitions still can be added even though GENHD_FL_NO_PART_SCAN is set. So far blk_drop_partitions() only removes partitions when disk_part_scan_enabled() return true. This way can make ghost partition on loop device after changing/clearing FD in case that PARTSCAN is disabled, such as partitions can be added via 'parted' on loop disk even though GENHD_FL_NO_PART_SCAN is set. Fix this issue by always removing partitions in blk_drop_partitions(), and this way is correct because the current code supposes that no partitions can be added in case of GENHD_FL_NO_PART_SCAN. Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-15 09:23:42 -06:00
Jens Axboe	9054650fac	block: relax jiffies rounding for timeouts In doing high IOPS testing, blk-mq is generally pretty well optimized. There are a few things that stuck out as using more CPU than what is really warranted, and one thing is the round_jiffies_up() that we do twice for each request. That accounts for about 0.8% of the CPU in my testing. We can make this cheaper by avoiding an integer division, by just adding a rough HZ mask that we can AND with instead. The timeouts are only on a second granularity already, we don't have to be that accurate here and this patch barely changes that. All we care about is nice grouping. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-15 09:23:35 -06:00
Linus Torvalds	d33db70274	block-5.8-2020-07-10 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl8Ij/kQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgprHQD/4rS7rwnzbE0HAYqnxG40Pkzcedsj8DMBSY whlPe016D4tqGuqbudm2crVO8TKPRVhluw+VL85xgDh5Hg09uljjpjCh9q4K8GxF moTNlhrHkg7SC8P6wEUBR4oPA0g8NQ0CLMk9CB+2YmahcyBV1neZKH3oItZniSw5 8GwHDRI0o+FfhpgwuCMGvThQ2q4OLnwVa6rxVFaVgniCB/vjcuRQGLlGTTLIon21 7jRWr950XcbBu0g4fp5jP7bXXEwr/fQDYRk3LNGS3Ku+v0sCzWscpkOrQZxH/xLd l41QQK0dP8dG7GphYGupnlGenqHhLpW+9hrG3nJTt+Y2V16/wdJoFfKgi5wHj1lP ltSnBo+0/V2M3IfzNnLu0khzRl3//65dffZQIJznMqMy7L+ggZfpDm3MKUdxRrmc +yyL5NJmvg1p8CQdky8A22yrIzK6NAyS/rxg1rzdy5PuF+y9z91vcARxorrB0t31 bM89PYLDnUkwT0kGErdU1TtqvW8OEE2kzjJf/sfoRdY9w+ZlD/vzlC8axso1FrjX ep8BBHH7oHPy0q8gYYVUUWdsJSi0DZEHwML7lb+CDIlfE0UVBtOuvLYDfZUrckOg Ahs3Mc7odJ2Am9qESUZQnG5AhkYmVLYhtB5NGeaOMuzDQnfZPrYYrUdvZ6kN7pww c4h7ZSPVUQ== =BZ6W -----END PGP SIGNATURE----- Merge tag 'block-5.8-2020-07-10' of git://git.kernel.dk/linux-block Pull block fixes from Jens Axboe: - Fix for inflight accounting, which affects only dm (Ming) - Fix documentation error for bfq (Yufen) - Fix memory leak for nbd (Zheng) * tag 'block-5.8-2020-07-10' of git://git.kernel.dk/linux-block: nbd: Fix memory leak in nbd_add_socket blk-mq: consider non-idle request as "inflight" in blk_mq_rq_inflight() docs: block: update and fix tiny error for bfq	2020-07-10 09:55:46 -07:00
Baolin Wang	87890092ee	blk-mq: remove redundant validation in __blk_mq_end_request() We've already validated the 'q->elevator' before calling ->ops.completed_request() in blk_mq_sched_completed_request(), thus no need to validate rq->internal_tag again. Rmove it. Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-10 07:58:33 -06:00
Baolin Wang	106e71c512	blk-mq: Remove unnecessary local variable Remove unnecessary local variable 'ret' in blk_mq_dispatch_hctx_list(). Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-10 07:58:09 -06:00
Christoph Hellwig	8c911f3d4c	writeback: remove struct bdi_writeback_congested We never set any congested bits in the group writeback instances of it. And for the simpler bdi-wide case a simple scalar field is all that that is needed. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-08 17:05:53 -06:00
Christoph Hellwig	a564e23f0f	md: switch to ->check_events for media change notifications md is the last driver using the legacy media_changed method. Switch it over to (not so) new ->clear_events approach, which also removes the need for the ->revalidate_disk method. Signed-off-by: Christoph Hellwig <hch@lst.de> [axboe: remove unused 'bdops' variable in disk_clear_events()] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-08 16:19:47 -06:00
Ming Lei	568f270065	blk-mq: centralise related handling into blk_mq_get_driver_tag Move .nr_active update and request assignment into blk_mq_get_driver_tag(), all are good to do during getting driver tag. Meantime blk-flush related code is simplified and flush request needn't to update the request table manually any more. Signed-off-by: Ming Lei <ming.lei@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-08 16:06:42 -06:00
Ming Lei	7bf137298c	blk-mq: streamline handling of q->mq_ops->queue_rq result Current handling of q->mq_ops->queue_rq result is a bit ugly: - two branches which needs to 'continue' have to check if the dispatch local list is empty, otherwise one bad request may be retrieved via 'rq = list_first_entry(list, struct request, queuelist);' - the branch of 'if (unlikely(ret != BLK_STS_OK))' isn't easy to follow, since it is actually one error branch. Streamline this handling, so the code becomes more readable, meantime potential kernel oops can be avoided in case that the last request in local dispatch list is failed. Fixes: `fc17b6534e` ("blk-mq: switch ->queue_rq return value to blk_status_t") Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-08 16:04:39 -06:00
Keith Busch	240e6ee272	nvme: support for zoned namespaces Add support for NVM Express Zoned Namespaces (ZNS) Command Set defined in NVM Express TP4053. Zoned namespaces are discovered based on their Command Set Identifier reported in the namespaces Namespace Identification Descriptor list. A successfully discovered Zoned Namespace will be registered with the block layer as a host managed zoned block device with Zone Append command support. A namespace that does not support append is not supported by the driver. Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Javier González <javier.gonz@samsung.com> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com> Signed-off-by: Ajay Joshi <ajay.joshi@wdc.com> Signed-off-by: Aravind Ramesh <aravind.ramesh@wdc.com> Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com> Signed-off-by: Matias Bjørling <matias.bjorling@wdc.com> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Keith Busch <keith.busch@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-07-08 16:16:20 +02:00
Matias Bjørling	82394db738	block: add capacity field to zone descriptors In the zoned storage model, the sectors within a zone are typically all writeable. With the introduction of the Zoned Namespace (ZNS) Command Set in the NVM Express organization, the model was extended to have a specific writeable capacity. Extend the zone descriptor data structure with a zone capacity field to indicate to the user how many sectors in a zone are writeable. Introduce backward compatibility in the zone report ioctl by extending the zone report header data structure with a flags field to indicate if the capacity field is available. Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Javier González <javier.gonz@samsung.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Matias Bjørling <matias.bjorling@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-07-08 16:16:19 +02:00
Jens Axboe	482c6b614a	Linux 5.8-rc4 -----BEGIN PGP SIGNATURE----- iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl8CYDYeHHRvcnZhbGRz QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGcQkH/2vOsPf79yWtsc7x hd2LpCPfrm7T1xlQcYcXbEbyRI8sqPmguixO8pRI1ePl2lBZ7KurfyeYgYZNGpFU t74Ph6A6dSWoCgO68Genm/SQuK8ic6o9n1Vr8tDsGDp5KlHWNaweq4JwHrsPmO1T cI0PR/ClAhLG8cQZ4x988Es5HTNGY17XK27e+M/zKYxSMGY2NRdJBGQIq964i5Q8 2d9G0rtVCaVDzgjrLwaFm6RBu21Il7HV6KsBsacyTFiL1ywx2vnUHzeZQyvuJSOQ 4YpLo9v4tBP10WHC50LRStZyO0qRwPVd/Yl7fL4R/CKsJT9H4uiwasVoEBVSL/k6 CUn3JL0= =P/Vx -----END PGP SIGNATURE----- Merge tag 'v5.8-rc4' into for-5.9/drivers Merge in 5.8-rc4 for-5.9/block to setup for-5.9/drivers, to provide a clean base and making the life for the NVMe changes easier. Signed-off-by: Jens Axboe <axboe@kernel.dk> * tag 'v5.8-rc4': (732 commits) Linux 5.8-rc4 x86/ldt: use "pr_info_once()" instead of open-coding it badly MIPS: Do not use smp_processor_id() in preemptible code MIPS: Add missing EHB in mtc0 -> mfc0 sequence for DSPen .gitignore: Do not track `defconfig` from `make savedefconfig` io_uring: fix regression with always ignoring signals in io_cqring_wait() x86/ldt: Disable 16-bit segments on Xen PV x86/entry/32: Fix #MC and #DB wiring on x86_32 x86/entry/xen: Route #DB correctly on Xen PV x86/entry, selftests: Further improve user entry sanity checks x86/entry/compat: Clear RAX high bits on Xen PV SYSENTER i2c: mlxcpld: check correct size of maximum RECV_LEN packet i2c: add Kconfig help text for slave mode i2c: slave-eeprom: update documentation i2c: eg20t: Load module automatically if ID matches i2c: designware: platdrv: Set class based on DMI i2c: algo-pca: Add 0x78 as SCL stuck low status for PCA9665 mm/page_alloc: fix documentation error vmalloc: fix the owner argument for the new __vmalloc_node_range callers mm/cma.c: use exact_nid true to fix possible per-numa cma leak ...	2020-07-08 08:02:13 -06:00
Christoph Hellwig	0e6e255e7a	block: remove a bogus warning in __submit_bio_noacct_mq If blk_mq_submit_bio flushes the plug list, bios for other disks can show up on current->bio_list. As that doesn't involve any stacking of block device it is entirely harmless and we should not warn about this case. Fixes: `ff93ea0ce7` ("block: shortcut __submit_bio_noacct for blk-mq drivers") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-07 11:45:59 -06:00
Ming Lei	05a4fed69f	blk-mq: consider non-idle request as "inflight" in blk_mq_rq_inflight() dm-multipath is the only user of blk_mq_queue_inflight(). When dm-multipath calls blk_mq_queue_inflight() to check if it has outstanding IO it can get a false negative. The reason for this is blk_mq_rq_inflight() doesn't consider requests that are no longer MQ_RQ_IN_FLIGHT but that are now MQ_RQ_COMPLETE (->complete isn't called or finished yet) as "inflight". This causes request-based dm-multipath's dm_wait_for_completion() to return before all outstanding dm-multipath requests have actually completed. This breaks DM multipath's suspend functionality because blk-mq requests complete after DM's suspend has finished -- which shouldn't happen. Fix this by considering any request not in the MQ_RQ_IDLE state (so either MQ_RQ_COMPLETE or MQ_RQ_IN_FLIGHT) as "inflight" in blk_mq_rq_inflight(). Fixes: `3c94d83cb3` ("blk-mq: change blk_mq_queue_busy() to blk_mq_queue_inflight()") Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-07 09:06:25 -06:00
Linus Torvalds	29206c6314	block-5.8-2020-07-05 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl8BDy4QHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpqyUD/0XI7Jo1W63aEwgW9wD1Xiyadc7eKzEFc/x upfqYBGiRUQehTKdmNBfr1ocrWF9OGj1g4NtPlU81Zjp1Y6c6pBuzeFF6NrfwEVi GrOO4nm04t4BOIk9AnsIjqknnk2XenbjFZmBNo0TKz3W3ftOPXSNDtJDgjxJ+rGd y5WOMfFCrE5rvo+JWiG3vxZIfTx8cxtraNw2PWcmxqjwOL+jNiN7E5rW/O4t0+DS 1ajqv5KseTEVtDNKG/Vn04cXxMVG8upG+Jv3xvxu4AlqJk84/va1LxkfvUuPuxJe c7dbGfR5db/KVdTsHU/WVo6URJ5nioftkMIHgIhOIIJR5D/B7WGFPu5AZtwRze6s C7BNIF49rBfbxyfLsVdIaAiw8GLQmsJWLs13OEVNRNGDxPO65as74J0E3UO9vOPa MCKffqkeSVHGK5LaXnhzn0lTEn35StUjWXRuyKAFxTWtSNDptopaoGZCrFO1IFXz EQfFlwU/fUNyfujAkMq7kNCxeQ0Kh7co6v41zphn8gBanpKgk5AqhnBJOSbI7OAS TDVMaQTzi3M+kMJV0Fu0rYQ5E3eiY3VAwif3L+6QiccwwgEygwIkdSLo2g63Q7CX Ogw8J2LIhuwbB/fCHhs7WfgfMRmgQcsaGWFLjI27UYd0FQks+rbDS8DItDuWgiBQ 74skOmzL5w== =dhze -----END PGP SIGNATURE----- Merge tag 'block-5.8-2020-07-05' of git://git.kernel.dk/linux-block Pull block fixes from Jens Axboe: - NVMe fixes from Christoph: - Fix crash in multi-path disk add (Christoph) - Fix ignore of identify error (Sagi) - Fix a compiler complaint that a function should be static (Wei) * tag 'block-5.8-2020-07-05' of git://git.kernel.dk/linux-block: block: make function __bio_integrity_free() static nvme: fix a crash in nvme_mpath_add_disk nvme: fix identify error status silent ignore	2020-07-05 10:45:31 -07:00
Linus Torvalds	7cc2a8ea10	block-5.8-2020-07-01 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl79YWAQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpuRIEACL2tFFKxKhWEJoRt5SQIV2fcJ8eM0MwYPk W3UdumUj5BnVaJJsu6U/lKYNGhl86sLheDKBUquKlILJa99pYhkoaphzTQG4HDoo 07HHHOhryRhVyIZ/5G+ALsGhC8cBJY3QkW2aU2TWd3VguQsBF1Hxud1O24Ks9hYe D2riudXIR5GE0q5APIAPEF1nNlc9pEa6STaIpWBLFzXEqaZwWX0yV2eF/ppmAubZ WcyrmMQebRAskP8cTOKFoUL57/2A3XT1gg7pDuVJE0qOmFVaqdFI/+2xZmZ4rpFO 6kvEeBglSY68h+rVbet5BBnD1y9nAunVphBDKSFqMuu1ORG2p6yPea8OWIDE+Z+z 9jSrRIf2A9qVLHf0yoPNUL+jCziEwITdnxLvnNo9Of+NJugwpxfzIDs6GnLAt8W2 JNX8HuGY7h/BupXxdzwyU0g0thlurIFJKoQMBkw/7SxGelKwEUwIPNqbuhNpdyB+ D86gdpkVQJEvULO6KUeObE32f2/nrPkwBiX81baeBLNSEoDsBnVdQhj8dIhsx0RD sViv9YQghE7UpNVnAqj2Elr/MSeaqYoVqWxM3GK56mIVMlGYg/iyLph4c/pJAKSt KOh3Z5tjMj4V257sfmZH7E14LWxI3bQwO9h7oBaNKhazH2xzzQHsTnjdPu6V63hv aLKP98uH+A== =5kZ8 -----END PGP SIGNATURE----- Merge tag 'block-5.8-2020-07-01' of git://git.kernel.dk/linux-block Pull block fixes from Jens Axboe: - Use kvfree_sensitive() for the block keyslot free (Eric) - Sync blk-mq debugfs flags (Hou) - Memory leak fix in virtio-blk error path (Hou) * tag 'block-5.8-2020-07-01' of git://git.kernel.dk/linux-block: virtio-blk: free vblk-vqs in error path of virtblk_probe() block/keyslot-manager: use kvfree_sensitive() blk-mq-debugfs: update blk_queue_flag_name[] accordingly for new flags	2020-07-02 15:13:51 -07:00
Christoph Hellwig	7c792f33c1	block: initialize current->bio_list[1] in __submit_bio_noacct_mq bio_alloc_bioset references current->bio_list[1], so we need to initialize it for the blk-mq submission path as well. Fixes: `ff93ea0ce7` ("block: shortcut __submit_bio_noacct for blk-mq drivers") Reported-by: Qian Cai <cai@lca.pw> Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-02 13:34:30 -06:00
Wei Yongjun	3197d48a7c	block: make function __bio_integrity_free() static Fix sparse build warning: block/bio-integrity.c:27:6: warning: symbol '__bio_integrity_free' was not declared. Should it be static? Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-02 12:38:18 -06:00
Jens Axboe	4e2f62e566	Revert "blk-mq: put driver tag when this request is completed" This reverts commits the following commits: `37f4a24c24` `723bf178f1` `36a3df5a45` The last one is the culprit, but we have to go a bit deeper to get this to revert cleanly. There's been a report that this breaks some MMC setups [1], and also causes an issue with swap [2]. Until this can be figured out, revert the offending commits. [1] https://lore.kernel.org/linux-block/57fb09b1-54ba-f3aa-f82c-d709b0e6b281@samsung.com/ [2] https://lore.kernel.org/linux-block/20200702043721.GA1087@lca.pw/ Reported-by: Marek Szyprowski <m.szyprowski@samsung.com> Reported-by: Qian Cai <cai@lca.pw> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-01 22:58:32 -06:00
Hongnan Li	6e2fa4dd68	blk-iolatency: only call ktime_get() if needed ktime_to_ns(ktime_get()), which is expensive, does not need to be called if blk_iolatency_enabled() return false in blkcg_iolatency_done_bio(). Postponing ktime_to_ns(ktime_get()) execution reduces the CPU usage when blk_iolatency is disabled. Signed-off-by: Hongnan Li <hongnan.li@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-01 08:02:38 -06:00
Christoph Hellwig	5a6c35f9af	block: remove direct_make_request Now that submit_bio_noacct has a decent blk-mq fast path there is no more need for this bypass. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-01 07:27:24 -06:00
Christoph Hellwig	ff93ea0ce7	block: shortcut __submit_bio_noacct for blk-mq drivers For blk-mq drivers bios can only be inserted for the same queue. So bypass the complicated sorting logic in __submit_bio_noacct with a blk-mq simpler submission helper. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-01 07:27:24 -06:00
Christoph Hellwig	566acf2daa	block: refator submit_bio_noacct Split out a __submit_bio_noacct helper for the actual de-recursion algorithm, and simplify the loop by using a continue when we can't enter the queue for a bio. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-01 07:27:24 -06:00
Christoph Hellwig	ed00aabd5e	block: rename generic_make_request to submit_bio_noacct generic_make_request has always been very confusingly misnamed, so rename it to submit_bio_noacct to make it clear that it is submit_bio minus accounting and a few checks. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-01 07:27:24 -06:00
Christoph Hellwig	c62b37d96b	block: move ->make_request_fn to struct block_device_operations The make_request_fn is a little weird in that it sits directly in struct request_queue instead of an operation vector. Replace it with a block_device_operations method called submit_bio (which describes much better what it does). Also remove the request_queue argument to it, as the queue can be derived pretty trivially from the bio. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-01 07:27:24 -06:00
Christoph Hellwig	e439ab710f	block: remove the nr_sectors variable in generic_make_request_checks The variable is only used once, so just open code the bio_sector() there. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-01 07:27:24 -06:00
Christoph Hellwig	833f84e2b9	block: remove the NULL queue check in generic_make_request_checks All registers disks must have a valid queue pointer, so don't bother to log a warning for that case. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-01 07:27:24 -06:00
Christoph Hellwig	c817867460	block: tidy up a warning in bio_check_ro The "generic_make_request: " prefix has no value, and will soon become stale. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-01 07:27:23 -06:00
Christoph Hellwig	f695ca3886	block: remove the request_queue argument from blk_queue_split The queue can be trivially derived from the bio, so pass one less argument. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-01 07:27:23 -06:00
Hou Tao	b5fc1e8bed	blk-mq: remove pointless call of list_entry_rq() in hctx_show_busy_rq() Just use rq directly, the usage of list_entry_rq() doesn't make any sense. Signed-off-by: Hou Tao <houtao1@huawei.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-01 07:25:36 -06:00
Colin Ian King	0b8cc25d94	blk-cgroup: clean up indentation There is a statement that is indented one level too deeply, fix it by removing a tab. Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-30 13:00:04 -06:00
Ming Lei	37f4a24c24	blk-mq: centralise related handling into blk_mq_get_driver_tag Move .nr_active update and request assignment into blk_mq_get_driver_tag(), all are good to do during getting driver tag. Meantime blk-flush related code is simplified and flush request needn't to update the request table manually any more. Signed-off-by: Ming Lei <ming.lei@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-30 12:57:59 -06:00
Ming Lei	723bf178f1	blk-mq: move blk_mq_put_driver_tag() into blk-mq.c It is used by blk-mq.c only, so move it to the source file. Suggested-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-30 12:57:59 -06:00
Ming Lei	570e9b73b0	blk-mq: move blk_mq_get_driver_tag into blk-mq.c blk_mq_get_driver_tag() is only used by blk-mq.c and is supposed to stay in blk-mq.c, so move it and preparing for cleanup code of get/put driver tag. Meantime hctx_may_queue() is moved to header file and it is fine since it is defined as inline always. No functional change. Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-30 12:57:59 -06:00
Ming Lei	6e6fcbc27e	blk-mq: support batching dispatch in case of io More and more drivers want to get batching requests queued from block layer, such as mmc, and tcp based storage drivers. Also current in-tree users have virtio-scsi, virtio-blk and nvme. For none, we already support batching dispatch. But for io scheduler, every time we just take one request from scheduler and pass the single request to blk_mq_dispatch_rq_list(). This way makes batching dispatch not possible when io scheduler is applied. One reason is that we don't want to hurt sequential IO performance, becasue IO merge chance is reduced if more requests are dequeued from scheduler queue. Try to support batching dispatch for io scheduler by starting with the following simple approach: 1) still make sure we can get budget before dequeueing request 2) use hctx->dispatch_busy to evaluate if queue is busy, if it is busy we fackback to non-batching dispatch, otherwise dequeue as many as possible requests from scheduler, and pass them to blk_mq_dispatch_rq_list(). Wrt. 2), we use similar policy for none, and turns out that SCSI SSD performance got improved much. In future, maybe we can develop more intelligent algorithem for batching dispatch. Baolin has tested this patch and found that MMC performance is improved[3]. [1] https://lore.kernel.org/linux-block/20200512075501.GF1531898@T590/#r [2] https://lore.kernel.org/linux-block/fe6bd8b9-6ed9-b225-f80c-314746133722@grimberg.me/ [3] https://lore.kernel.org/linux-block/CADBw62o9eTQDJ9RvNgEqSpXmg6Xcq=2TxH0Hfxhp29uF2W=TXA@mail.gmail.com/ Signed-off-by: Ming Lei <ming.lei@redhat.com> Tested-by: Baolin Wang <baolin.wang7@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Sagi Grimberg <sagi@grimberg.me> Cc: Baolin Wang <baolin.wang7@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-30 07:51:48 -06:00
Ming Lei	1fd40b5ea7	blk-mq: pass obtained budget count to blk_mq_dispatch_rq_list Pass obtained budget count to blk_mq_dispatch_rq_list(), and prepare for supporting fully batching submission. With the obtained budget count, it is easier to put extra budgets in case of .queue_rq failure. Meantime remove the old 'got_budget' parameter. Signed-off-by: Ming Lei <ming.lei@redhat.com> Tested-by: Baolin Wang <baolin.wang7@gmail.com> Reviewed-by: Christoph Hellwig <hch@infradead.org> Cc: Sagi Grimberg <sagi@grimberg.me> Cc: Baolin Wang <baolin.wang7@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-30 07:51:48 -06:00
Ming Lei	bbdb3c5d94	blk-mq: remove dead check from blk_mq_dispatch_rq_list When BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE is returned from .queue_rq, the 'list' variable always holds this rq which isn't queued to LLD successfully. So blk_mq_dispatch_rq_list() always returns false from the branch of '!list_empty(list)'. No functional change. Signed-off-by: Ming Lei <ming.lei@redhat.com> Tested-by: Baolin Wang <baolin.wang7@gmail.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@infradead.org> Cc: Sagi Grimberg <sagi@grimberg.me> Cc: Baolin Wang <baolin.wang7@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-30 07:51:48 -06:00
Ming Lei	7538352453	blk-mq: move getting driver tag and budget into one helper Move code for getting driver tag and budget into one helper, so blk_mq_dispatch_rq_list gets a bit simplified, and easier to read. Meantime move updating of 'no_tag' and 'no_budget_available' into the branch for handling partial dispatch because that is exactly consumer of the two local variables. Also rename the parameter of 'got_budget' as 'ask_budget'. No functional change. Signed-off-by: Ming Lei <ming.lei@redhat.com> Tested-by: Baolin Wang <baolin.wang7@gmail.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@infradead.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: Sagi Grimberg <sagi@grimberg.me> Cc: Baolin Wang <baolin.wang7@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-30 07:51:48 -06:00
Ming Lei	445874e89f	blk-mq: pass hctx to blk_mq_dispatch_rq_list All requests in the 'list' of blk_mq_dispatch_rq_list belong to same hctx, so it is better to pass hctx instead of request queue, because blk-mq's dispatch target is hctx instead of request queue. Signed-off-by: Ming Lei <ming.lei@redhat.com> Tested-by: Baolin Wang <baolin.wang7@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: Sagi Grimberg <sagi@grimberg.me> Cc: Baolin Wang <baolin.wang7@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-30 07:51:48 -06:00
Ming Lei	65c7636943	blk-mq: pass request queue into get/put budget callback blk-mq budget is abstract from scsi's device queue depth, and it is always per-request-queue instead of hctx. It can be quite absurd to get a budget from one hctx, then dequeue a request from scheduler queue, and this request may not belong to this hctx, at least for bfq and deadline. So fix the mess and always pass request queue to get/put budget callback. Signed-off-by: Ming Lei <ming.lei@redhat.com> Tested-by: Baolin Wang <baolin.wang7@gmail.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Douglas Anderson <dianders@chromium.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Cc: Sagi Grimberg <sagi@grimberg.me> Cc: Baolin Wang <baolin.wang7@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Douglas Anderson <dianders@chromium.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-30 07:51:48 -06:00
Eric Biggers	3e20aa9630	block/keyslot-manager: use kvfree_sensitive() Make blk_ksm_destroy() use the kvfree_sensitive() function (which was introduced in v5.8-rc1) instead of open-coding it. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-29 13:24:05 -06:00
Christoph Hellwig	42fdc5e49c	blk-mq: remove the BLK_MQ_REQ_INTERNAL flag Just check for a non-NULL elevator directly to make the code more clear. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-29 09:56:18 -06:00
Ming Lei	36a3df5a45	blk-mq: put driver tag when this request is completed It is natural to release driver tag when this request is completed by LLD or device since its purpose is for LLD use. One big benefit is that the released tag can be re-used quicker since bio_endio() may take too long. Meantime we don't need to release driver tag for flush request. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-29 09:56:10 -06:00
Christoph Hellwig	a2e83ef9c3	blk-cgroup: remove a dead check in blk_throtl_bio bios must have a valid block group by the time they are submitted. Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-29 09:09:08 -06:00
Christoph Hellwig	db18a53e5b	blk-cgroup: remove blkcg_bio_issue_check blkcg_bio_issue_check is a giant inline function that does three entirely different things. Factor out the blk-cgroup related bio initalization into a new helper, and the open code the sequence in the only caller, relying on the fact that all the actual functionality is stubbed out for non-cgroup builds. Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-29 09:09:08 -06:00
Christoph Hellwig	93b8063804	blk-cgroup: move rcu locking from blkcg_bio_issue_check to blk_throtl_bio The only thing in blkcg_bio_issue_check that needs to be under rcu_read_lock is blk_throtl_bio, so move the locking there. Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-29 09:09:08 -06:00
Christoph Hellwig	13c7863d48	block: move the initial blkg lookup into blkg_tryget_closest By moving the initial blkg lookup into blkg_tryget_closest we get a nicely self contained routines that does all the RCU locking. Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-29 09:09:08 -06:00
Christoph Hellwig	a5b97526bf	block: bypass blkg_tryget_closest for the root_blkg The root_blkg is only torn down at the very end of removing a queue. So in the I/O submission path is always has a life reference and we can just grab another one using blkg_get instead of doing a tryget and parent walk that won't lead anywhere. Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-29 09:09:08 -06:00
Christoph Hellwig	8c54628752	block: merge blkg_lookup_create and __blkg_lookup_create No good reason to keep these two functions split. Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-29 09:09:08 -06:00
Christoph Hellwig	28fc591ff9	block: move the bio cgroup associatation helpers to blk-cgroup.c Keep the cgroup code together. Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-29 09:09:08 -06:00
Christoph Hellwig	a18b9b1590	block: move bio_associate_blkg_from_page to mm/page_io.c bio_associate_blkg_from_page is a special purpose helper for swap bios that doesn't need access to bio internals. Move it to the swap code instead of having it in bio.c. Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-29 09:09:08 -06:00
Christoph Hellwig	2badf06cf9	block: merge __bio_associate_blkg into bio_associate_blkg_from_css Merge __bio_associate_blkg into the only caller, which allows to slightly reduce the RCU crticial section and better explain the code flow. Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-29 09:09:08 -06:00
Christoph Hellwig	d92c370a16	block: really clone the block cgroup in bio_clone_blkg_association bio_clone_blkg_association is supposed to clone the associatation, but actually ends up doing a search with a tryget. As we know we have a reference on the source cgroup just get an unconditional additional reference to it and call it a day. That also removes the need for a RCU critical section. Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-29 09:09:08 -06:00
Christoph Hellwig	db9819c76c	block: remove bio_disassociate_blkg bio_disassociate_blkg has two callers, of which one immediately assigns a new value to >bi_blkg. Just open code the function in the two callers. Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-29 09:09:08 -06:00
Hou Tao	bfe373f608	blk-mq-debugfs: update blk_queue_flag_name[] accordingly for new flags Else there may be magic numbers in /sys/kernel/debug/block/*/state. Signed-off-by: Hou Tao <houtao1@huawei.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-29 07:45:09 -06:00
Guo Xuenan	826f2f48da	blk-rq-qos: remove redundant finish_wait to rq_qos_wait. It is no need do finish_wait twice after acquiring inflight. Signed-off-by: Guo Xuenan <guoxuenan@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-28 08:11:14 -06:00
Linus Torvalds	9b8d020796	block-5.8-2020-06-26 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl72TjIQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpvxcD/9i6yjjq7Qzx9pUIaowcCah0PTUfqNuXoL/ muA01DbynjcP3uxP5XxG416z5rPmLDRLtof6QoV/8tbwxQUsg2RaZW9i/OADYrq0 qnISlNfQ0+rlLkV1v7S+WWM2npYGoh2j+WizmeNcHYMFFo8ueds7gUM9usFkx+dw 3RXUGxColF18uXizjRYMlLgxqddNmC1H7B/Z7Y3kooRuqYcd56QXrh/gDLQuzo0e SnBybTYyIiUSsMakyoRBcYleSJu6mLQQ/BT665tkdWgQpwFaWQ7nwYtKMwXee/Ul uRyKnTK4tGnp66PCt6nDGu5Ud3IQkWqlXJvqmN/5Cggs3pWklzO+HZkxFusJOTtS NqDIs7vkVMcgi5LxXUIb5+uRqMSYXLmfhv3iyJ11/fZlT8mv6SaQJfWHo2jDJvz9 CuLEr4+auRPrcXRau8FNySnssJ3NZ4iuEH4CI0r+Zgzdm7C3kmEj9w16t8/CNuCW s3/EyyCBvwPnPGYJEukYirVoVPKQL1Pn5hHqtStyWfFH0lUlhl/GUXc0S8Qhl9YU cOBRGxjR1aIv65kK9zWeSpNq9lZCLCWeACFbA/4nIdhURtxdiH8nVW38qdVcGM3/ nr+KKTBCdOeK9iTt64XIuqIRX2J3p2NjGzExAugmlBQzeAqGbcZgCvX/WYK5Roay hel6eOLY4A== =pQH7 -----END PGP SIGNATURE----- Merge tag 'block-5.8-2020-06-26' of git://git.kernel.dk/linux-block Pull block fixes from Jens Axboe: - NVMe pull request from Christoph: - multipath deadlock fixes (Anton) - NUMA fixes (Max) - RDMA completion vector fix (Max) - IO deadlock fix (Sagi) - multipath reference fix (Sagi) - NS mutation fix (Sagi) - Use right allocator when freeing bip in error path (Chengguang) * tag 'block-5.8-2020-06-26' of git://git.kernel.dk/linux-block: nvme-multipath: fix bogus request queue reference put nvme-multipath: fix deadlock due to head->lock nvme: don't protect ns mutation with ns->head->lock nvme-multipath: fix deadlock between ana_work and scan_work nvme: fix possible deadlock when I/O is blocked nvme-rdma: assign completion vector correctly nvme-loop: initialize tagset numa value to the value of the ctrl nvme-tcp: initialize tagset numa value to the value of the ctrl nvme-pci: initialize tagset numa value to the value of the ctrl nvme-pci: override the value of the controller's numa node nvme: set initial value for controller's numa node block: release bip in a right way in error path	2020-06-27 08:59:32 -07:00
Jan Kara	f3bdc62fd8	blktrace: Provide event for request merging Currently blk-mq does not report any event when two requests get merged in the elevator. This then results in difficult to understand sequence of events like: ... 8,0 34 1579 0.608765271 2718 I WS 215023504 + 40 [dbench] 8,0 34 1584 0.609184613 2719 A WS 215023544 + 56 <- (8,4) 2160568 8,0 34 1585 0.609184850 2719 Q WS 215023544 + 56 [dbench] 8,0 34 1586 0.609188524 2719 G WS 215023544 + 56 [dbench] 8,0 3 602 0.609684162 773 D WS 215023504 + 96 [kworker/3:1H] 8,0 34 1591 0.609843593 0 C WS 215023504 + 96 [0] and you can only guess (after quite some headscratching since the above excerpt is intermixed with a lot of other IO) that request 215023544+56 got merged to request 215023504+40. Provide proper event for request merging like we used to do in the legacy block layer. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-25 21:06:11 -06:00
Gustavo A. R. Silva	f61d6e259c	blk-iocost: Use struct_size() in kzalloc_node() Make use of the struct_size() helper instead of an open-coded version in order to avoid any potential type mistakes. This code was detected with the help of Coccinelle and, audited and fixed manually. Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Addresses-KSPP-ID: https://github.com/KSPP/linux/issues/83 Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-24 09:15:58 -06:00
Gustavo A. R. Silva	1f4fe21cf4	block: bio: Use struct_size() in kmalloc() Make use of the struct_size() helper instead of an open-coded version in order to avoid any potential type mistakes. This code was detected with the help of Coccinelle and, audited and fixed manually. Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Addresses-KSPP-ID: https://github.com/KSPP/linux/issues/83 Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-24 09:15:58 -06:00
Luis Chamberlain	85e0cbbb8a	block: create the request_queue debugfs_dir on registration We were only creating the request_queue debugfs_dir only for make_request block drivers (multiqueue), but never for request-based block drivers. We did this as we were only creating non-blktrace additional debugfs files on that directory for make_request drivers. However, since blktrace always creates that directory anyway, we special-case the use of that directory on blktrace. Other than this being an eye-sore, this exposes request-based block drivers to the same debugfs fragile race that used to exist with make_request block drivers where if we start adding files onto that directory we can later run a race with a double removal of dentries on the directory if we don't deal with this carefully on blktrace. Instead, just simplify things by always creating the request_queue debugfs_dir on request_queue registration. Rename the mutex also to reflect the fact that this is used outside of the blktrace context. Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-24 09:15:58 -06:00
Luis Chamberlain	e8c7d14ac6	block: revert back to synchronous request_queue removal Commit `dc9edc44de` ("block: Fix a blk_exit_rl() regression") merged on v4.12 moved the work behind blk_release_queue() into a workqueue after a splat floated around which indicated some work on blk_release_queue() could sleep in blk_exit_rl(). This splat would be possible when a driver called blk_put_queue() or blk_cleanup_queue() (which calls blk_put_queue() as its final call) from an atomic context. blk_put_queue() decrements the refcount for the request_queue kobject, and upon reaching 0 blk_release_queue() is called. Although blk_exit_rl() is now removed through commit `db6d995235` ("block: remove request_list code") on v5.0, we reserve the right to be able to sleep within blk_release_queue() context. The last reference for the request_queue must not be called from atomic context. When the last reference to the request_queue reaches 0 varies, and so let's take the opportunity to document when that is expected to happen and also document the context of the related calls as best as possible so we can avoid future issues, and with the hopes that the synchronous request_queue removal sticks. We revert back to synchronous request_queue removal because asynchronous removal creates a regression with expected userspace interaction with several drivers. An example is when removing the loopback driver, one uses ioctls from userspace to do so, but upon return and if successful, one expects the device to be removed. Likewise if one races to add another device the new one may not be added as it is still being removed. This was expected behavior before and it now fails as the device is still present and busy still. Moving to asynchronous request_queue removal could have broken many scripts which relied on the removal to have been completed if there was no error. Document this expectation as well so that this doesn't regress userspace again. Using asynchronous request_queue removal however has helped us find other bugs. In the future we can test what could break with this arrangement by enabling CONFIG_DEBUG_KOBJECT_RELEASE. While at it, update the docs with the context expectations for the request_queue / gendisk refcount decrement, and make these expectations explicit by using might_sleep(). Fixes: `dc9edc44de` ("block: Fix a blk_exit_rl() regression") Suggested-by: Nicolai Stange <nstange@suse.de> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Cc: Bart Van Assche <bvanassche@acm.org> Cc: Omar Sandoval <osandov@fb.com> Cc: Hannes Reinecke <hare@suse.com> Cc: Nicolai Stange <nstange@suse.de> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: yu kuai <yukuai3@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-24 09:15:58 -06:00
Luis Chamberlain	763b58923a	block: clarify context for refcount increment helpers Let us clarify the context under which the helpers to increment the refcount for the gendisk and request_queue can be called under. We make this explicit on the places where we may sleep with might_sleep(). We don't address the decrement context yet, as that needs some extra work and fixes, but will be addressed in the next patch. Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-24 09:15:58 -06:00
Luis Chamberlain	b5bd357cf8	block: add docs for gendisk / request_queue refcount helpers This adds documentation for the gendisk / request_queue refcount helpers. Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-24 09:15:57 -06:00
Christoph Hellwig	40d09b53bf	blk-mq: add a new blk_mq_complete_request_remote API This is a variant of blk_mq_complete_request_remote that only completes the request if it needs to be bounced to another CPU or a softirq. If the request can be completed locally the function returns false and lets the driver complete it without requring and indirect function call. Reviewed-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-24 09:15:57 -06:00
Christoph Hellwig	963395269c	blk-mq: factor out a blk_mq_complete_need_ipi helper Add a helper to decide if we can complete locally or need an IPI. Reviewed-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-24 09:15:57 -06:00
Christoph Hellwig	4c8fc19686	blk-mq: remove the get_cpu/put_cpu pair in blk_mq_complete_request We don't really care if we get migrated during the I/O completion. In the worth case we either perform an IPI that wasn't required, or complete the request on a CPU which we just migrated off. Reviewed-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-24 09:15:57 -06:00
Christoph Hellwig	15f73f5b3e	blk-mq: move failure injection out of blk_mq_complete_request Move the call to blk_should_fake_timeout out of blk_mq_complete_request and into the drivers, skipping call sites that are obvious error handlers, and remove the now superflous blk_mq_force_complete_rq helper. This ensures we don't keep injecting errors into completions that just terminate the Linux request after the hardware has been reset or the command has been aborted. Reviewed-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-24 09:15:57 -06:00
Christoph Hellwig	d391a7a399	blk-mq: merge the softirq vs non-softirq IPI logic Both the softirq path for single queue devices and the multi-queue completion handler share the same logic to figure out if we need an IPI for the completion and eventually issue it. Merge the two versions into a single unified code path. Reviewed-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-24 09:15:57 -06:00
Christoph Hellwig	d6cc464cc5	blk-mq: short cut the IPI path in blk_mq_force_complete_rq for !SMP Let the compile optimize out the entire IPI path, given that we are obviously not going to use it. Reviewed-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-24 09:15:57 -06:00
Christoph Hellwig	6aab1da603	blk-mq: complete polled requests directly Even for single queue devices there is no point in offloading a polled completion to the softirq, given that blk_mq_force_complete_rq is called from the polling thread in that case and thus there are no starvation issues. Reviewed-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-24 09:15:57 -06:00
Christoph Hellwig	dea6f39938	blk-mq: remove raise_blk_irq By open coding raise_blk_irq in the only caller, and replacing the ifdef CONFIG_SMP with an IS_ENABLED check the flow in the caller can be significantly simplified. Reviewed-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-24 09:15:57 -06:00
Christoph Hellwig	115243f553	blk-mq: factor out a helper to reise the block softirq Add a helper to deduplicate the logic that raises the block softirq. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-24 09:15:56 -06:00
Christoph Hellwig	c3077b5d97	blk-mq: merge blk-softirq.c into blk-mq.c __blk_complete_request is only called from the blk-mq code, and duplicates a lot of code from blk-mq.c. Move it there to prepare for better code sharing and simplifications. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-24 09:15:56 -06:00
Chengguang Xu	0b8eb629a7	block: release bip in a right way in error path Release bip using kfree() in error path when that was allocated by kmalloc(). Signed-off-by: Chengguang Xu <cgxu519@mykernel.net> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-24 08:49:07 -06:00
Jens Axboe	5a473e8311	block: provide plug based way of signaling forced no-wait semantics Provide a way for the caller to specify that IO should be marked with REQ_NOWAIT to avoid blocking on allocation. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-21 20:44:25 -06:00
Linus Torvalds	d2b1c81f5f	block-5.8-2020-06-19 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl7s0SAQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpp+YEACVqFvsfzxKCqa61IzyuOaPfnj9awyP+MY2 7V6y9sDDHL8sp6aPDbHvqFnqz0O7E+7nHVZD2rf2qc6tKKMvJYNO/BFZSXPvWTZV KQ4cBChf/LDwqAKOnI4ZhmF5UcSyyob1yMy4uJ+U0gQiXXrRMbwJ3N1K24a9dr4c epkzGavR0Q+PJ9BbUgjACjbRdT+vrP4bOu0cuyCGkIpD9eCerKJ6mFaUAj0FDthD bg4BJj+c8Ij6LO0V++Wga6OxccmL43KeP0ky8B3x07PfAl+tDWqsbHSlU2YPtdcq 5nKgMMTW16mVnZeO2/W0JB7tn89VubsmyvIFcm2KNeeRqSnEZyW9HI8n4kq994Ju xMH24lgbsU4trNeYkgOmzPoJJZ+LShkn+rnldyI1U/fhpEYub7DqfVySuT7ti9in uFpQdeRUmPsdw92F3+o6h8OYAflpcQQ7CblkzxPEeV4OyzOZasb+S9tMNPe59KBh 0MtHv9IfzgtDihR6HuXifitXaP+GtH4x3D2z0dzEdooHKHC/+P3WycS5daG+3WKQ xV5lJruvpTuxhXKLFAH0wRrxnVlB0VUvhQ21T3WgHrwF0btbdmQMHFc83XOxBIB4 jHWJMHGc4xp1ZdpWFBC8Cj79OmJh1w/ao8+/cf8SUoTB0LzFce1B8LvwnxgpcpUk VjIOrl7zhQ== =LeLd -----END PGP SIGNATURE----- Merge tag 'block-5.8-2020-06-19' of git://git.kernel.dk/linux-block Pull block fixes from Jens Axboe: - Use import_uuid() where appropriate (Andy) - bcache fixes (Coly, Mauricio, Zhiqiang) - blktrace sparse warnings fix (Jan) - blktrace concurrent setup fix (Luis) - blkdev_get use-after-free fix (Jason) - Ensure all blk-mq maps are updated (Weiping) - Loop invalidate bdev fix (Zheng) * tag 'block-5.8-2020-06-19' of git://git.kernel.dk/linux-block: block: make function 'kill_bdev' static loop: replace kill_bdev with invalidate_bdev partitions/ldm: Replace uuid_copy() with import_uuid() where it makes sense block: update hctx map when use multiple maps blktrace: Avoid sparse warnings when assigning q->blk_trace blktrace: break out of blktrace setup on concurrent calls block: Fix use-after-free in blkdev_get() trace/events/block.h: drop kernel-doc for dropped function parameter blk-mq: Remove redundant 'return' statement bcache: pr_info() format clean up in bcache_device_init() bcache: use delayed kworker fo asynchronous devices registration bcache: check and adjust logical block size for backing devices bcache: fix potential deadlock problem in btree_gc_coalesce	2020-06-19 13:11:26 -07:00
Andy Shevchenko	bc163c2046	partitions/ldm: Replace uuid_copy() with import_uuid() where it makes sense There is a specific API to treat raw data as UUID, i.e. import_uuid(). Use it instead of uuid_copy() with explicit casting. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-18 09:17:54 -06:00
Weiping Zhang	fe35ec58f0	block: update hctx map when use multiple maps There is an issue when tune the number for read and write queues, if the total queue count was not changed. The hctx->type cannot be updated, since __blk_mq_update_nr_hw_queues will return directly if the total queue count has not been changed. Reproduce: dmesg \| grep "default/read/poll" [ 2.607459] nvme nvme0: 48/0/0 default/read/poll queues cat /sys/kernel/debug/block/nvme0n1/hctx/type \| sort \| uniq -c 48 default tune the write queues to 24: echo 24 > /sys/module/nvme/parameters/write_queues echo 1 > /sys/block/nvme0n1/device/reset_controller dmesg \| grep "default/read/poll" [ 433.547235] nvme nvme0: 24/24/0 default/read/poll queues cat /sys/kernel/debug/block/nvme0n1/hctx/type \| sort \| uniq -c 48 default The driver's hardware queue mapping is not same as block layer. Signed-off-by: Weiping Zhang <zhangweiping@didiglobal.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-17 21:33:04 -06:00
Gustavo A. R. Silva	8a631c26bd	block: Replace zero-length array with flexible-array There is a regular need in the kernel to provide a way to declare having a dynamically sized set of trailing elements in a structure. Kernel code should always use “flexible array members”[1] for these cases. The older style of one-element or zero-length arrays should no longer be used[2]. [1] https://en.wikipedia.org/wiki/Flexible_array_member [2] https://github.com/KSPP/linux/issues/21 Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>	2020-06-15 23:08:32 -05:00
Baolin Wang	a8a5e383cf	blk-mq: Remove redundant 'return' statement The blk_mq_all_tag_iter() is a void function, thus remove the redundant 'return' statement in this function. Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-15 08:34:43 -06:00
Linus Torvalds	6adc19fd13	Kbuild updates for v5.8 (2nd) - fix build rules in binderfs sample - fix build errors when Kbuild recurses to the top Makefile - covert '---help---' in Kconfig to 'help' -----BEGIN PGP SIGNATURE----- iQJJBAABCgAzFiEEbmPs18K1szRHjPqEPYsBB53g2wYFAl7lBuYVHG1hc2FoaXJv eUBrZXJuZWwub3JnAAoJED2LAQed4NsGHvIP/3iErjPshpg/phwH8NTCS4SFkiti BZRM+2lupSn7Qs53BTpVzIkXoHBJQZlJxlQ5HY8ScO+fiz28rKZr+b40us+je1Q+ SkvSPfwZzxjEg7lAZutznG4KgItJLWJKmDyh9T8Y8TAuG4f8WO0hKnXoAp3YorS2 zppEIxso8O5spZPjp+fF/fPbxPjIsabGK7Jp2LpSVFR5pVDHI/ycTlKQS+MFpMEx 6JIpdFRw7TkvKew1dr5uAWT5btWHatEqjSR3JeyVHv3EICTGQwHmcHK67cJzGInK T51+DT7/CpKtmRgGMiTEu/INfMzzoQAKl6Fcu+vMaShTN97Hk9DpdtQyvA6P/h3L 8GA4UBct05J7fjjIB7iUD+GYQ0EZbaFujzRXLYk+dQqEJRbhcCwvdzggGp0WvGRs 1f8/AIpgnQv8JSL/bOMgGMS5uL2dSLsgbzTdr6RzWf1jlYdI1i4u7AZ/nBrwWP+Z iOBkKsVceEoJrTbaynl3eoYqFLtWyDau+//oBc2gUvmhn8ioM5dfqBRiJjxJnPG9 /giRj6xRIqMMEw8Gg8PCG7WebfWxWyaIQwlWBbPok7DwISURK5mvOyakZL+Q25/y 6MBr2H8NEJsf35q0GTINpfZnot7NX4JXrrndJH8NIRC7HEhwd29S041xlQJdP0rs E76xsOr3hrAmBu4P =1NIT -----END PGP SIGNATURE----- Merge tag 'kbuild-v5.8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull more Kbuild updates from Masahiro Yamada: - fix build rules in binderfs sample - fix build errors when Kbuild recurses to the top Makefile - covert '---help---' in Kconfig to 'help' * tag 'kbuild-v5.8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: treewide: replace '---help---' in Kconfig files with 'help' kbuild: fix broken builds because of GZIP,BZIP2,LZOP variables samples: binderfs: really compile this sample and fix build issues	2020-06-13 13:29:16 -07:00
Masahiro Yamada	a7f7f6248d	treewide: replace '---help---' in Kconfig files with 'help' Since commit `84af7a6194` ("checkpatch: kconfig: prefer 'help' over '---help---'"), the number of '---help---' has been gradually decreasing, but there are still more than 2400 instances. This commit finishes the conversion. While I touched the lines, I also fixed the indentation. There are a variety of indentation styles found. a) 4 spaces + '---help---' b) 7 spaces + '---help---' c) 8 spaces + '---help---' d) 1 space + 1 tab + '---help---' e) 1 tab + '---help---' (correct indentation) f) 1 tab + 1 space + '---help---' g) 1 tab + 2 spaces + '---help---' In order to convert all of them to 1 tab + 'help', I ran the following commend: $ find . -name 'Kconfig' \| xargs sed -i 's/^[[:space:]]---help---/\thelp/' Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>	2020-06-14 01:57:21 +09:00
Ming Lei	22f614bc0f	blk-mq: fix blk_mq_all_tag_iter blk_mq_all_tag_iter() is added to iterate all requests, so we should fetch the request from ->static_rqs][] instead of ->rqs[] which is for holding in-flight request only. Fix it by adding flag of BT_TAG_ITER_STATIC_RQS. Fixes: `bf0beec060` ("blk-mq: drain I/O when all CPUs in a hctx are offline") Signed-off-by: Ming Lei <ming.lei@redhat.com> Tested-by: John Garry <john.garry@huawei.com> Cc: Dongli Zhang <dongli.zhang@oracle.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Daniel Wagner <dwagner@suse.de> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-07 08:56:50 -06:00
Christoph Hellwig	d94ecfc399	blk-mq: split out a __blk_mq_get_driver_tag helper Allocation of the driver tag in the case of using a scheduler shares very little code with the "normal" tag allocation. Split out a new helper to streamline this path, and untangle it from the complex normal tag allocation. This way also avoids to fail driver tag allocation because of inactive hctx during cpu hotplug, and fixes potential hang risk. Fixes: `bf0beec060` ("blk-mq: drain I/O when all CPUs in a hctx are offline") Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: John Garry <john.garry@huawei.com> Cc: Dongli Zhang <dongli.zhang@oracle.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Daniel Wagner <dwagner@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-07 08:56:50 -06:00
Ahmed S. Darwish	15b81ce5ab	block: nr_sects_write(): Disable preemption on seqcount write For optimized block readers not holding a mutex, the "number of sectors" 64-bit value is protected from tearing on 32-bit architectures by a sequence counter. Disable preemption before entering that sequence counter's write side critical section. Otherwise, the read side can preempt the write side section and spin for the entire scheduler tick. If the reader belongs to a real-time scheduling class, it can spin forever and the kernel will livelock. Fixes: `c83f6bf98d` ("block: add partition resize function to blkpg ioctl") Cc: <stable@vger.kernel.org> Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-04 21:22:28 -06:00
Christoph Hellwig	d24de76af8	block: remove the error argument to the block_bio_complete tracepoint The status can be trivially derived from the bio itself. That also avoid callers like NVMe to incorrectly pass a blk_status_t instead of the errno, and the overhead of translating the blk_status_t to the errno in the I/O completion fast path when no tracing is enabled. Fixes: `35fe0d12c8` ("nvme: trace bio completion") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-04 21:16:11 -06:00
yu kuai	a75ca93031	block/bio-integrity: don't free 'buf' if bio_integrity_add_page() failed commit `e7bf90e5af` ("block/bio-integrity: fix a memory leak bug") added a kfree() for 'buf' if bio_integrity_add_page() returns '0'. However, the object will be freed in bio_integrity_free() since 'bio->bi_opf' and 'bio->bi_integrity' were set previousy in bio_integrity_alloc(). Fixes: commit `e7bf90e5af` ("block/bio-integrity: fix a memory leak bug") Signed-off-by: yu kuai <yukuai3@huawei.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Bob Liu <bob.liu@oracle.com> Acked-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-02 17:21:50 -06:00
Linus Torvalds	bce159d734	for-5.8/drivers-2020-06-01 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl7VPc4QHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpgQkEACnQlzWOfNQMz1AzgUAv/S8IYDJCLrkbjLZ JK4pJv8Hjhss/7sS+fd8kyKe9VtaZz2IjmrXcC66RMMwtpx4iHnkRffoNAgEdGOl /M5TCZGhs+F/mp3Lc0WdR5DFHkM6yy2Tkk9wCFLreB4bW67janAWnd7nbU4INqJj +WqIgpzNMc/kfUhpBYTeQLORhL4e2TG9ADTi/zeUITlpnEsA65LOgXKEpeIFYnSX KTl4GIZ9tjazG3Y1Eva7DYHDIErNNAtX67KBqf+WBgMV98eB0O6xIPN1WlmhDTqj FGMLkb8msH1HHntvxDAuc4/ortnUy8vPI4o6zKP89HJJNjIM5p5eHEuVF5JnBw42 Rtu9Om6JqWx51nhAhJNBj9bUStYbhEl0vVQCwbkfPbDJhzTy3RR8z709q9+ZwOrL xbp4aJBzqrzscjBEiSQbNCf2PyuOAdU0r1x81UN81ZN41d5qUcumcinjw4Y7vru8 z5zMlo1Iy/AWQYyu7jgHmnpI7ZyA/1Qclo5dV7aa72bLFaJa35e7QxgfQOFBA5dY UZl6QPJRlnB80uGRzD5jCh2O2sQ3XZqYnpaKsUAka1GgbceCp9IC4A5mfZvpACsh Xk8VXjlhvY/iPJsKLqrh4Oedg4Dj5M3PLL9C3MDfYeIP2qgXpbnk87UV1TPNSpY0 QcTxsXXXIw== =H+/Z -----END PGP SIGNATURE----- Merge tag 'for-5.8/drivers-2020-06-01' of git://git.kernel.dk/linux-block Pull block driver updates from Jens Axboe: "On top of the core changes, here are the block driver changes for this merge window: - NVMe changes: - NVMe over Fibre Channel protocol updates, which also reach over to drivers/scsi/lpfc (James Smart) - namespace revalidation support on the target (Anthony Iliopoulos) - gcc zero length array fix (Arnd Bergmann) - nvmet cleanups (Chaitanya Kulkarni) - misc cleanups and fixes (me, Keith Busch, Sagi Grimberg) - use a SRQ per completion vector (Max Gurtovoy) - fix handling of runtime changes to the queue count (Weiping Zhang) - t10 protection information support for nvme-rdma and nvmet-rdma (Israel Rukshin and Max Gurtovoy) - target side AEN improvements (Chaitanya Kulkarni) - various fixes and minor improvements all over, icluding the nvme part of the lpfc driver" - Floppy code cleanup series (Willy, Denis) - Floppy contention fix (Jiri) - Loop CONFIGURE support (Martijn) - bcache fixes/improvements (Coly, Joe, Colin) - q->queuedata cleanups (Christoph) - Get rid of ioctl_by_bdev (Christoph, Stefan) - md/raid5 allocation fixes (Coly) - zero length array fixes (Gustavo) - swim3 task state fix (Xu)" * tag 'for-5.8/drivers-2020-06-01' of git://git.kernel.dk/linux-block: (166 commits) bcache: configure the asynchronous registertion to be experimental bcache: asynchronous devices registration bcache: fix refcount underflow in bcache_device_free() bcache: Convert pr_<level> uses to a more typical style bcache: remove redundant variables i and n lpfc: Fix return value in __lpfc_nvme_ls_abort lpfc: fix axchg pointer reference after free and double frees lpfc: Fix pointer checks and comments in LS receive refactoring nvme: set dma alignment to qword nvmet: cleanups the loop in nvmet_async_events_process nvmet: fix memory leak when removing namespaces and controllers concurrently nvmet-rdma: add metadata/T10-PI support nvmet: add metadata support for block devices nvmet: add metadata/T10-PI support nvme: add Metadata Capabilities enumerations nvmet: rename nvmet_check_data_len to nvmet_check_transfer_len nvmet: rename nvmet_rw_len to nvmet_rw_data_len nvmet: add metadata characteristics for a namespace nvme-rdma: add metadata/T10-PI support nvme-rdma: introduce nvme_rdma_sgl structure ...	2020-06-02 15:37:03 -07:00
Linus Torvalds	750a02ab8d	for-5.8/block-2020-06-01 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl7VOwMQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpoR7EADAlz3TCkb4wwuHytTBDrm6gVDdsJ9zUfQW Cl2ASLtufA8PWZUCEI3vhFyOe6P5e+ZZ0O2HjljSevmHyogCaRYXFYVfbWKcQKuk AcxiTgnYNevh8KbGLfJY1WL4eXsY+C3QUGivg35cCgrx+kr9oDaHMeqA9Tm1plyM FSprDBoSmHPqRxiV/1gnr8uXLX6K7i/fHzwmKgySMhavum7Ma8W3wdAGebzvQwrO SbFSuJVgz06e4B1Fzr/wSvVNUE/qW/KqfGuQKIp7VQFIywbgG7TgRMHjE1FSnpnh gn+BfL+O5gc0sTvcOTGOE0SRWWwLx961WNg8Azq08l3fzsxLA6h8/AnoDf3i+QMA rHmLpWZIic2xPSvjaFHX3/V9ITyGYeAMpAR77EL+4ivWrKv5JrBhnSLDt1fKILdg 5elxm7RDI+C4nCP4xuTlVCy5gCd6gwjgytKj+NUWhNq1WiGAD0B54SSiV+SbCSH6 Om2f5trcxz8E4pqWcf0k3LjFapVKRNV8v/+TmVkCdRPBl3y9P0h0wFTkkcEquqnJ y7Yq6efdWviRCnX5w/r/yj0qBuk4xo5hMVsPmlthCWtnBm+xZQ6LwMRcq4HQgZgR 2SYNscZ3OFMekHssH7DvY4DAy1J+n83ims+KzbScbLg2zCZjh/scQuv38R5Eh9WZ rCS8c+T7Ig== =HYf4 -----END PGP SIGNATURE----- Merge tag 'for-5.8/block-2020-06-01' of git://git.kernel.dk/linux-block Pull block updates from Jens Axboe: "Core block changes that have been queued up for this release: - Remove dead blk-throttle and blk-wbt code (Guoqing) - Include pid in blktrace note traces (Jan) - Don't spew I/O errors on wouldblock termination (me) - Zone append addition (Johannes, Keith, Damien) - IO accounting improvements (Konstantin, Christoph) - blk-mq hardware map update improvements (Ming) - Scheduler dispatch improvement (Salman) - Inline block encryption support (Satya) - Request map fixes and improvements (Weiping) - blk-iocost tweaks (Tejun) - Fix for timeout failing with error injection (Keith) - Queue re-run fixes (Douglas) - CPU hotplug improvements (Christoph) - Queue entry/exit improvements (Christoph) - Move DMA drain handling to the few drivers that use it (Christoph) - Partition handling cleanups (Christoph)" * tag 'for-5.8/block-2020-06-01' of git://git.kernel.dk/linux-block: (127 commits) block: mark bio_wouldblock_error() bio with BIO_QUIET blk-wbt: rename __wbt_update_limits to wbt_update_limits blk-wbt: remove wbt_update_limits blk-throttle: remove tg_drain_bios blk-throttle: remove blk_throtl_drain null_blk: force complete for timeout request blk-mq: drain I/O when all CPUs in a hctx are offline blk-mq: add blk_mq_all_tag_iter blk-mq: open code __blk_mq_alloc_request in blk_mq_alloc_request_hctx blk-mq: use BLK_MQ_NO_TAG in more places blk-mq: rename BLK_MQ_TAG_FAIL to BLK_MQ_NO_TAG blk-mq: move more request initialization to blk_mq_rq_ctx_init blk-mq: simplify the blk_mq_get_request calling convention blk-mq: remove the bio argument to ->prepare_request nvme: force complete cancelled requests blk-mq: blk-mq: provide forced completion method block: fix a warning when blkdev.h is included for !CONFIG_BLOCK builds block: blk-crypto-fallback: remove redundant initialization of variable err block: reduce part_stat_lock() scope block: use __this_cpu_add() instead of access by smp_processor_id() ...	2020-06-02 15:29:19 -07:00
Matthew Wilcox (Oracle)	cee9a0c4e8	mm: move readahead prototypes from mm.h Patch series "Change readahead API", v11. This series adds a readahead address_space operation to replace the readpages operation. The key difference is that pages are added to the page cache as they are allocated (and then looked up by the filesystem) instead of passing them on a list to the readpages operation and having the filesystem add them to the page cache. It's a net reduction in code for each implementation, more efficient than walking a list, and solves the direct-write vs buffered-read problem reported by yu kuai at http://lkml.kernel.org/r/20200116063601.39201-1-yukuai3@huawei.com The only unconverted filesystems are those which use fscache. Their conversion is pending Dave Howells' rewrite which will make the conversion substantially easier. This should be completed by the end of the year. I want to thank the reviewers/testers; Dave Chinner, John Hubbard, Eric Biggers, Johannes Thumshirn, Dave Sterba, Zi Yan, Christoph Hellwig and Miklos Szeredi have done a marvellous job of providing constructive criticism. These patches pass an xfstests run on ext4, xfs & btrfs with no regressions that I can tell (some of the tests seem a little flaky before and remain flaky afterwards). This patch (of 25): The readahead code is part of the page cache so should be found in the pagemap.h file. force_page_cache_readahead is only used within mm, so move it to mm/internal.h instead. Remove the parameter names where they add no value, and rename the ones which were actively misleading. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: Chao Yu <yuchao0@huawei.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: Dave Chinner <dchinner@redhat.com> Cc: Eric Biggers <ebiggers@google.com> Cc: Gao Xiang <gaoxiang25@huawei.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Miklos Szeredi <mszeredi@redhat.com> Link: http://lkml.kernel.org/r/20200414150233.24495-1-willy@infradead.org Link: http://lkml.kernel.org/r/20200414150233.24495-2-willy@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-06-02 10:59:06 -07:00
Guoqing Jiang	4d89e1d112	blk-wbt: rename __wbt_update_limits to wbt_update_limits Now let's rename __wbt_update_limits to wbt_update_limits after the previous one is deleted. Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-29 16:30:39 -06:00
Guoqing Jiang	26e0ca12e0	blk-wbt: remove wbt_update_limits No one call this function after commit `2af2783f2e` ("rq-qos: get rid of redundant wbt_update_limits()"), so remove it. Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-29 16:30:39 -06:00
Guoqing Jiang	32e3374304	blk-throttle: remove tg_drain_bios After blk_throtl_drain is removed, there is no caller of tg_drain_bios, so remove it as well. Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-29 16:30:39 -06:00
Guoqing Jiang	b77412372b	blk-throttle: remove blk_throtl_drain After the commit `5addeae1be` ("blk-cgroup: remove blkcg_drain_queue"), there is no caller of blk_throtl_drain, so let's remove it. Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-29 16:30:39 -06:00
Ming Lei	bf0beec060	blk-mq: drain I/O when all CPUs in a hctx are offline Most of blk-mq drivers depend on managed IRQ's auto-affinity to setup up queue mapping. Thomas mentioned the following point[1]: "That was the constraint of managed interrupts from the very beginning: The driver/subsystem has to quiesce the interrupt line and the associated queue _before_ it gets shutdown in CPU unplug and not fiddle with it until it's restarted by the core when the CPU is plugged in again." However, current blk-mq implementation doesn't quiesce hw queue before the last CPU in the hctx is shutdown. Even worse, CPUHP_BLK_MQ_DEAD is a cpuhp state handled after the CPU is down, so there isn't any chance to quiesce the hctx before shutting down the CPU. Add new CPUHP_AP_BLK_MQ_ONLINE state to stop allocating from blk-mq hctxs where the last CPU goes away, and wait for completion of in-flight requests. This guarantees that there is no inflight I/O before shutting down the managed IRQ. Add a BLK_MQ_F_STACKING and set it for dm-rq and loop, so we don't need to wait for completion of in-flight requests from these drivers to avoid a potential dead-lock. It is safe to do this for stacking drivers as those do not use interrupts at all and their I/O completions are triggered by underlying devices I/O completion. [1] https://lore.kernel.org/linux-block/alpine.DEB.2.21.1904051331270.1802@nanos.tec.linutronix.de/ [hch: different retry mechanism, merged two patches, minor cleanups] Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-29 10:23:25 -06:00
Ming Lei	602380d28e	blk-mq: add blk_mq_all_tag_iter Add a new blk_mq_all_tag_iter function to iterate over all allocated scheduler tags and driver tags. This is more flexible than the existing blk_mq_all_tag_busy_iter function as it allows the callers to do whatever they want on allocated request instead of being limited to started requests. It will be used to implement draining allocated requests on specified hctx in this patchset. [hch: switch from the two booleans to a more readable flags field and consolidate the tags iter functions] Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Daniel Wagner <dwagner@suse.de> Reviewed-by: Bart van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-29 10:23:25 -06:00
Christoph Hellwig	600c3b0cea	blk-mq: open code __blk_mq_alloc_request in blk_mq_alloc_request_hctx blk_mq_alloc_request_hctx is only used for NVMeoF connect commands, so tailor it to the specific requirements, and don't bother the general fast path code with its special twinkles. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.de Reviewed-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-29 10:23:25 -06:00
Christoph Hellwig	766473681c	blk-mq: use BLK_MQ_NO_TAG in more places Replace various magic -1 constants for tags with BLK_MQ_NO_TAG. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-29 10:23:25 -06:00
Christoph Hellwig	419c3d5e80	blk-mq: rename BLK_MQ_TAG_FAIL to BLK_MQ_NO_TAG To prepare for wider use of this constant give it a more applicable name. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-29 10:23:25 -06:00
Christoph Hellwig	7ea4d8a4d6	blk-mq: move more request initialization to blk_mq_rq_ctx_init Don't split request initialization between __blk_mq_alloc_request and blk_mq_rq_ctx_init. Also remove the op argument as it can be derived from the blk_mq_alloc_data structure. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-29 10:23:24 -06:00
Christoph Hellwig	e6e7abffe3	blk-mq: simplify the blk_mq_get_request calling convention The bio argument is entirely unused, and the request_queue can be passed through the alloc_data, given that it needs to be filled out for the low-level tag allocation anyway. Also rename the function to __blk_mq_alloc_request as the switch between get and alloc in the call chains is rather confusing. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-29 10:23:24 -06:00
Christoph Hellwig	5d9c305b8e	blk-mq: remove the bio argument to ->prepare_request None of the I/O schedulers actually needs it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-29 10:23:24 -06:00
Keith Busch	7b11eab041	blk-mq: blk-mq: provide forced completion method Drivers may need to bypass error injection for error recovery. Rename __blk_mq_complete_request() to blk_mq_force_complete_rq() and export that function so drivers may skip potential fake timeouts after they've reclaimed lost requests. Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-29 10:21:59 -06:00
Jens Axboe	b0beb28097	Revert "block: end bio with BLK_STS_AGAIN in case of non-mq devs and REQ_NOWAIT" This reverts commit `c58c1f8343`. io_uring does do the right thing for this case, and we're still returning -EAGAIN to userspace for the cases we don't support. Revert this change to avoid doing endless spins of resubmits. Cc: stable@vger.kernel.org # v5.6 Reported-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-28 13:20:39 -06:00
Colin Ian King	e7ecc142e9	block: blk-crypto-fallback: remove redundant initialization of variable err The variable err is being initialized with a value that is never read and it is being updated later with a new value. The initialization is redundant and can be removed. Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Satya Tangirala <satyat@google.com> Addresses-Coverity: ("Unused value") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-27 05:41:59 -06:00
Christoph Hellwig	524f9ffd6a	block: reduce part_stat_lock() scope We only need the stats lock (aka preempt_disable()) for updating the states, not for looking up or dropping the hd_struct reference. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-27 05:21:23 -06:00
Konstantin Khlebnikov	8ab1d40a64	block: remove rcu_read_lock() from part_stat_lock() The RCU lock is required only in disk_map_sector_rcu() to lookup the partition. After that request holds reference to related hd_struct. Replace get_cpu() with preempt_disable() - returned cpu index is unused. [hch: rebased] Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-27 05:21:23 -06:00
Konstantin Khlebnikov	b5af37ab3a	block: add a blk_account_io_merge_bio helper Move the non-"new_io" branch of blk_account_io_start() into separate function. Fix merge accounting for discards (they were counted as write merges). The new blk_account_io_merge_bio() doesn't call update_io_ticks() unlike blk_account_io_start(), as there is no reason for that. [hch: rebased] Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-27 05:21:23 -06:00
Konstantin Khlebnikov	b9c54f5660	block: account merge of two requests Also rename blk_account_io_merge() into blk_account_io_merge_request() to distinguish it from merging request and bio. [hch: rebased] Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-27 05:21:23 -06:00
Christoph Hellwig	58d4f14fc3	block: always use a percpu variable for disk stats percpu variables have a perfectly fine working stub implementation for UP kernels, so use that. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-27 05:21:23 -06:00
Christoph Hellwig	9123bf6f21	block: move update_io_ticks to blk-core.c All callers are in blk-core.c, so move update_io_ticks over. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-27 05:21:23 -06:00
Christoph Hellwig	e722fff238	block: remove generic_{start,end}_io_acct Remove these now unused functions. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-27 05:21:23 -06:00
Christoph Hellwig	956d510ee7	block: add disk/bio-based accounting helpers Add two new helpers to simplify I/O accounting for bio based drivers. Currently these drivers use the generic_start_io_acct and generic_end_io_acct helpers which have very cumbersome calling conventions, don't actually return the time they started accounting, and try to deal with accounting for partitions, which can't happen for bio based drivers. The new helpers will be used to subsequently replace uses of the old helpers. The main API is the bio based wrappes in blkdev.h, but for zram which wants to account rw_page based I/O lower level routines are provided as well. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-27 05:21:23 -06:00
Christoph Hellwig	c81b49d4d6	block: remove the disk and queue NULL checks in blkdev_issue_flush Both of these never can be NULL for a live block device. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-22 08:45:59 -06:00
Christoph Hellwig	9398554fb3	block: remove the error_sector argument to blkdev_issue_flush The argument isn't used by any caller, and drivers don't fill out bi_sector for flush requests either. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-22 08:45:46 -06:00
Stefan Haberland	26d7e28e38	s390/dasd: remove ioctl_by_bdev calls The IBM partition parser requires device type specific information only available to the DASD driver to correctly register partitions. The current approach of using ioctl_by_bdev with a fake user space pointer is discouraged. Fix this by replacing IOCTL calls with direct in-kernel function calls. Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Stefan Haberland <sth@linux.ibm.com> Reviewed-by: Jan Hoeppner <hoeppner@linux.ibm.com> Reviewed-by: Peter Oberparleiter <oberpar@linux.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-21 08:22:20 -06:00
Baolin Wang	172ce41db4	block: Remove unused flush_queue_delayed in struct blk_flush_queue The flush_queue_delayed was introdued to hold queue if flush is running for non-queueable flush drive by commit `3ac0cc4508` ("hold queue if flush is running for non-queueable flush drive"), but the non mq parts of the flush code had been removed by commit `7e992f847a` ("block: remove non mq parts from the flush code"), as well as removing the usage of the flush_queue_delayed flag. Thus remove the unused flush_queue_delayed flag. Signed-off-by: Baolin Wang <baolin.wang7@gmail.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-19 09:42:46 -06:00
Bart Van Assche	c8210a5765	block: Fix type of first compat_put_{,u}long() argument This patch fixes the following sparse warnings: block/ioctl.c:209:16: warning: incorrect type in argument 1 (different address spaces) block/ioctl.c:209:16: expected void const volatile [noderef] <asn:1> * block/ioctl.c:209:16: got signed int [usertype] argp block/ioctl.c:214:16: warning: incorrect type in argument 1 (different address spaces) block/ioctl.c:214:16: expected void const volatile [noderef] <asn:1> block/ioctl.c:214:16: got unsigned int [usertype] argp block/ioctl.c:666:40: warning: incorrect type in argument 1 (different address spaces) block/ioctl.c:666:40: expected signed int [usertype] argp block/ioctl.c:666:40: got void [noderef] <asn:1> argp block/ioctl.c:672:41: warning: incorrect type in argument 1 (different address spaces) block/ioctl.c:672:41: expected unsigned int [usertype] argp block/ioctl.c:672:41: got void [noderef] <asn:1> *argp Fixes: `9b81648cb5` ("compat_ioctl: simplify up block/ioctl.c") Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Arnd Bergmann <arnd@arndb.de> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-19 09:40:29 -06:00
Christoph Hellwig	10ec5e86f9	block: merge part_{inc,dev}_in_flight into their only callers part_inc_in_flight and part_dec_in_flight only have one caller each, and those callers are purely for bio based drivers. Merge each function into the only caller, and remove the superflous blk-mq checks. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-19 09:35:24 -06:00
Christoph Hellwig	76268f3ac0	block: don't call part_{inc,dec}_in_flight for blk-mq devices part_inc_in_flight and part_dec_in_flight are no-ops for blk-mq queues, so remove the calls in purely blk-mq callers. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-19 09:35:24 -06:00
Christoph Hellwig	b2f609e191	block: move the blk-mq calls out of part_in_flight{,_rw} Don't bother to call part_in_flight / part_in_flight_rw on blk-mq devices, just call the blk-mq versions directly. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-19 09:35:24 -06:00
Christoph Hellwig	f1394b7988	block: mark blk_account_io_completion static Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-19 09:35:24 -06:00
Christoph Hellwig	ac7c5675fa	blk-mq: allow blk_mq_make_request to consume the q_usage_counter reference blk_mq_make_request currently needs to grab an q_usage_counter reference when allocating a request. This is because the block layer grabs one before calling blk_mq_make_request, but also releases it as soon as blk_mq_make_request returns. Remove the blk_queue_exit call after blk_mq_make_request returns, and instead let it consume the reference. This works perfectly fine for the block layer caller, just device mapper needs an extra reference as the old problem still persists there. Open code blk_queue_enter_live in device mapper, as there should be no other callers and this allows better documenting why we do a non-try get. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-19 09:34:29 -06:00
Christoph Hellwig	35b371ff01	blk-mq: remove a pointless queue enter pair in blk_mq_alloc_request_hctx No need for two queue references. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-19 09:34:29 -06:00
Christoph Hellwig	22fa792cd8	blk-mq: remove a pointless queue enter pair in blk_mq_alloc_request No need for two queue references. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-19 09:34:29 -06:00
Christoph Hellwig	a5ea581105	blk-mq: move the call to blk_queue_enter_live out of blk_mq_get_request Move the blk_queue_enter_live calls into the callers, where they can successively be cleaned up. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-19 09:34:29 -06:00
Satya Tangirala	488f6682c8	block: blk-crypto-fallback for Inline Encryption Blk-crypto delegates crypto operations to inline encryption hardware when available. The separately configurable blk-crypto-fallback contains a software fallback to the kernel crypto API - when enabled, blk-crypto will use this fallback for en/decryption when inline encryption hardware is not available. This lets upper layers not have to worry about whether or not the underlying device has support for inline encryption before deciding to specify an encryption context for a bio. It also allows for testing without actual inline encryption hardware - in particular, it makes it possible to test the inline encryption code in ext4 and f2fs simply by running xfstests with the inlinecrypt mount option, which in turn allows for things like the regular upstream regression testing of ext4 to cover the inline encryption code paths. For more details, refer to Documentation/block/inline-encryption.rst. Signed-off-by: Satya Tangirala <satyat@google.com> Reviewed-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-14 09:48:03 -06:00
Satya Tangirala	d145dc2303	block: Make blk-integrity preclude hardware inline encryption Whenever a device supports blk-integrity, make the kernel pretend that the device doesn't support inline encryption (essentially by setting the keyslot manager in the request queue to NULL). There's no hardware currently that supports both integrity and inline encryption. However, it seems possible that there will be such hardware in the near future (like the NVMe key per I/O support that might support both inline encryption and PI). But properly integrating both features is not trivial, and without real hardware that implements both, it is difficult to tell if it will be done correctly by the majority of hardware that support both. So it seems best not to support both features together right now, and to decide what to do at probe time. Signed-off-by: Satya Tangirala <satyat@google.com> Reviewed-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-14 09:48:03 -06:00
Satya Tangirala	a892c8d52c	block: Inline encryption support for blk-mq We must have some way of letting a storage device driver know what encryption context it should use for en/decrypting a request. However, it's the upper layers (like the filesystem/fscrypt) that know about and manages encryption contexts. As such, when the upper layer submits a bio to the block layer, and this bio eventually reaches a device driver with support for inline encryption, the device driver will need to have been told the encryption context for that bio. We want to communicate the encryption context from the upper layer to the storage device along with the bio, when the bio is submitted to the block layer. To do this, we add a struct bio_crypt_ctx to struct bio, which can represent an encryption context (note that we can't use the bi_private field in struct bio to do this because that field does not function to pass information across layers in the storage stack). We also introduce various functions to manipulate the bio_crypt_ctx and make the bio/request merging logic aware of the bio_crypt_ctx. We also make changes to blk-mq to make it handle bios with encryption contexts. blk-mq can merge many bios into the same request. These bios need to have contiguous data unit numbers (the necessary changes to blk-merge are also made to ensure this) - as such, it suffices to keep the data unit number of just the first bio, since that's all a storage driver needs to infer the data unit number to use for each data block in each bio in a request. blk-mq keeps track of the encryption context to be used for all the bios in a request with the request's rq_crypt_ctx. When the first bio is added to an empty request, blk-mq will program the encryption context of that bio into the request_queue's keyslot manager, and store the returned keyslot in the request's rq_crypt_ctx. All the functions to operate on encryption contexts are in blk-crypto.c. Upper layers only need to call bio_crypt_set_ctx with the encryption key, algorithm and data_unit_num; they don't have to worry about getting a keyslot for each encryption context, as blk-mq/blk-crypto handles that. Blk-crypto also makes it possible for request-based layered devices like dm-rq to make use of inline encryption hardware by cloning the rq_crypt_ctx and programming a keyslot in the new request_queue when necessary. Note that any user of the block layer can submit bios with an encryption context, such as filesystems, device-mapper targets, etc. Signed-off-by: Satya Tangirala <satyat@google.com> Reviewed-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-14 09:47:53 -06:00
Satya Tangirala	1b26283970	block: Keyslot Manager for Inline Encryption Inline Encryption hardware allows software to specify an encryption context (an encryption key, crypto algorithm, data unit num, data unit size) along with a data transfer request to a storage device, and the inline encryption hardware will use that context to en/decrypt the data. The inline encryption hardware is part of the storage device, and it conceptually sits on the data path between system memory and the storage device. Inline Encryption hardware implementations often function around the concept of "keyslots". These implementations often have a limited number of "keyslots", each of which can hold a key (we say that a key can be "programmed" into a keyslot). Requests made to the storage device may have a keyslot and a data unit number associated with them, and the inline encryption hardware will en/decrypt the data in the requests using the key programmed into that associated keyslot and the data unit number specified with the request. As keyslots are limited, and programming keys may be expensive in many implementations, and multiple requests may use exactly the same encryption contexts, we introduce a Keyslot Manager to efficiently manage keyslots. We also introduce a blk_crypto_key, which will represent the key that's programmed into keyslots managed by keyslot managers. The keyslot manager also functions as the interface that upper layers will use to program keys into inline encryption hardware. For more information on the Keyslot Manager, refer to documentation found in block/keyslot-manager.c and linux/keyslot-manager.h. Co-developed-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Satya Tangirala <satyat@google.com> Reviewed-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-14 09:46:54 -06:00
Tejun Heo	81ca627a93	iocost: don't let vrate run wild while there's no saturation signal When the QoS targets are met and nothing is being throttled, there's no way to tell how saturated the underlying device is - it could be almost entirely idle, at the cusp of saturation or anywhere inbetween. Given that there's no information, it's best to keep vrate as-is in this state. Before `7cd806a9a9` ("iocost: improve nr_lagging handling"), this was the case - if the device isn't missing QoS targets and nothing is being throttled, busy_level was reset to zero. While fixing nr_lagging handling, `7cd806a9a9` ("iocost: improve nr_lagging handling") broke this. Now, while the device is hitting QoS targets and nothing is being throttled, vrate keeps getting adjusted according to the existing busy_level. This led to vrate keeping climing till it hits max when there's an IO issuer with limited request concurrency if the vrate started low. vrate starts getting adjusted upwards until the issuer can issue IOs w/o being throttled. From then on, QoS targets keeps getting met and nothing on the system needs throttling and vrate keeps getting increased due to the existing busy_level. This patch makes the following changes to the busy_level logic. * Reset busy_level if nr_shortages is zero to avoid the above scenario. * Make non-zero nr_lagging block lowering nr_level but still clear positive busy_level if there's clear non-saturation signal - QoS targets are met and nr_shortages is non-zero. nr_lagging's role is preventing adjusting vrate upwards while there are long-running commands and it shouldn't keep busy_level positive while there's clear non-saturation signal. * Restructure code for clarity and add comments. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Andy Newell <newella@fb.com> Fixes: `7cd806a9a9` ("iocost: improve nr_lagging handling") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-14 09:32:09 -06:00
Ming Lei	71ac860af8	block: move blk_io_schedule() out of header file blk_io_schedule() isn't called from performance sensitive code path, and it is easier to maintain by exporting it as symbol. Also blk_io_schedule() is only called by CONFIG_BLOCK code, so it is safe to do this way. Meantime fixes build failure when CONFIG_BLOCK is off. Cc: Christoph Hellwig <hch@infradead.org> Fixes: `e6249cdd46` ("block: add blk_io_schedule() for avoiding task hung in sync dio") Reported-by: Satya Tangirala <satyat@google.com> Tested-by: Satya Tangirala <satyat@google.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-14 08:06:04 -06:00
Johannes Thumshirn	29b2a3aa29	block: export bio_release_pages and bio_iov_iter_get_pages Export bio_release_pages and bio_iov_iter_get_pages, so they can be used from modular code. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-12 20:36:28 -06:00
Damien Le Moal	e732671aa5	block: Modify revalidate zones Modify the interface of blk_revalidate_disk_zones() to add an optional driver callback function that a driver can use to extend processing done during zone revalidation. The callback, if defined, is executed with the device request queue frozen, after all zones have been inspected. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-12 20:36:28 -06:00
Johannes Thumshirn	1392d37018	block: introduce blk_req_zone_write_trylock Introduce blk_req_zone_write_trylock(), which either grabs the write-lock for a sequential zone or returns false, if the zone is already locked. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-12 20:36:28 -06:00
Keith Busch	0512a75b98	block: Introduce REQ_OP_ZONE_APPEND Define REQ_OP_ZONE_APPEND to append-write sectors to a zone of a zoned block device. This is a no-merge write operation. A zone append write BIO must: * Target a zoned block device * Have a sector position indicating the start sector of the target zone * The target zone must be a sequential write zone * The BIO must not cross a zone boundary * The BIO size must not be split to ensure that a single range of LBAs is written with a single command. Implement these checks in generic_make_request_checks() using the helper function blk_check_zone_append(). To avoid write append BIO splitting, introduce the new max_zone_append_sectors queue limit attribute and ensure that a BIO size is always lower than this limit. Export this new limit through sysfs and check these limits in bio_full(). Also when a LLDD can't dispatch a request to a specific zone, it will return BLK_STS_ZONE_RESOURCE indicating this request needs to be delayed, e.g. because the zone it will be dispatched to is still write-locked. If this happens set the request aside in a local list to continue trying dispatching requests such as READ requests or a WRITE/ZONE_APPEND requests targetting other zones. This way we can still keep a high queue depth without starving other requests even if one request can't be served due to zone write-locking. Finally, make sure that the bio sector position indicates the actual write position as indicated by the device on completion. Signed-off-by: Keith Busch <kbusch@kernel.org> [ jth: added zone-append specific add_page and merge_page helpers ] Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-12 20:36:28 -06:00
Christoph Hellwig	e458110577	block: rename __bio_add_pc_page to bio_add_hw_page Rename __bio_add_pc_page() to bio_add_hw_page() and explicitly pass in a max_sectors argument. This max_sectors argument can be used to specify constraints from the hardware. Signed-off-by: Christoph Hellwig <hch@lst.de> [ jth: rebased and made public for blk-map.c ] Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Daniel Wagner <dwagner@suse.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-12 20:36:28 -06:00
Ming Lei	27eb3af9a3	block: don't hold part0's refcount in IO path gendisk can't be gone when there is IO activity, so not hold part0's refcount in IO path. Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@infradead.org> Cc: Yufen Yu <yuyufen@huawei.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Hou Tao <houtao1@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-12 20:31:40 -06:00
Ming Lei	07c4e1e834	block: only define 'nr_sects_seq' in hd_part for 32bit SMP The seqcount of 'nr_sects_seq' is only needed in case of 32bit SMP, so define it just for 32bit SMP. Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@infradead.org> Cc: Yufen Yu <yuyufen@huawei.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Hou Tao <houtao1@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-12 20:31:39 -06:00
Ming Lei	b7d6c30333	block: fix use-after-free on cached last_lookup partition delete_partition() clears the cached last_lookup partition. However the .last_lookup cache may be overwritten by one IO path after it is cleared from delete_partition(). Then another IO path may use the cached deleting partition after hd_struct_free() is called, then use-after-free is triggered on the cached partition. Fixes the issue by the following approach: 1) always get the partition's refcount via hd_struct_try_get() before setting .last_lookup 2) move clearing .last_lookup from delete_partition() to hd_struct_free() which is the release handle of the partition's percpu-refcount, so that no IO path can cache deleteing partition via .last_lookup. It is one candidate approach of Yufen's patch[1] which adds overhead in fast path by indirect lookup which may introduce one extra cacheline in IO path. Also this patch relies on percpu-refcount's protection, and it is easier to understand and verify. [1] https://lore.kernel.org/linux-block/20200109013551.GB9655@ming.t460p/T/#t Reported-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@infradead.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Hou Tao <houtao1@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-12 20:31:39 -06:00
Weiping Zhang	aa880ad690	block: reset mapping if failed to update hardware queue count When we increase hardware queue count, blk_mq_update_queue_map will reset the mapping between cpu and hardware queue base on the hardware queue count(set->nr_hw_queues). The mapping cannot be reset if it encounters error in blk_mq_realloc_hw_ctxs, but the fallback flow will continue using it, then blk_mq_map_swqueue will touch a invalid memory, because the mapping points to a wrong hctx. blktest block/030: null_blk: module loaded Increasing nr_hw_queues to 8 fails, fallback to 1 ================================================================== BUG: KASAN: null-ptr-deref in blk_mq_map_swqueue+0x2f2/0x830 Read of size 8 at addr 0000000000000128 by task nproc/8541 CPU: 5 PID: 8541 Comm: nproc Not tainted 5.7.0-rc4-dbg+ #3 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4-rebuilt.opensuse.org 04/01/2014 Call Trace: dump_stack+0xa5/0xe6 __kasan_report.cold+0x65/0xbb kasan_report+0x45/0x60 check_memory_region+0x15e/0x1c0 __kasan_check_read+0x15/0x20 blk_mq_map_swqueue+0x2f2/0x830 __blk_mq_update_nr_hw_queues+0x3df/0x690 blk_mq_update_nr_hw_queues+0x32/0x50 nullb_device_submit_queues_store+0xde/0x160 [null_blk] configfs_write_file+0x1c4/0x250 [configfs] __vfs_write+0x4c/0x90 vfs_write+0x14b/0x2d0 ksys_write+0xdd/0x180 __x64_sys_write+0x47/0x50 do_syscall_64+0x6f/0x310 entry_SYSCALL_64_after_hwframe+0x49/0xb3 Signed-off-by: Weiping Zhang <zhangweiping@didiglobal.com> Tested-by: Bart van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-12 20:20:22 -06:00
Christoph Hellwig	1cd925d583	bdi: remove the name field in struct backing_dev_info The name is only printed for a not registered bdi in writeback. Use the device name there as is more useful anyway for the unlike case that the warning triggers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-09 16:15:13 -06:00
Christoph Hellwig	aef33c2ff8	bdi: simplify bdi_alloc Merge the _node vs normal version and drop the superflous gfp_t argument. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-09 16:15:13 -06:00
Christoph Hellwig	3c5d202b55	bdi: remove bdi_register_owner Split out a new bdi_set_owner helper to set the owner, and move the policy for creating the bdi name back into genhd.c, where it belongs. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-09 16:15:13 -06:00
Weiping Zhang	79fab52879	block: rename blk_mq_alloc_rq_maps rename blk_mq_alloc_rq_maps to blk_mq_alloc_map_and_requests, this function allocs both map and request, make function name align with funtion. Signed-off-by: Weiping Zhang <zhangweiping@didiglobal.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-09 16:15:13 -06:00
Weiping Zhang	03b63b029d	block: rename __blk_mq_alloc_rq_map rename __blk_mq_alloc_rq_map to __blk_mq_alloc_map_and_request, actually it alloc both map and request, make function name align with function. Signed-off-by: Weiping Zhang <zhangweiping@didiglobal.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-09 16:15:13 -06:00
Ming Lei	fd689871bb	block: alloc map and request for new hardware queue Alloc new map and request for new hardware queue when increse hardware queue count. Before this patch, it will show a warning for each new hardware queue, but it's not enough, these hctx have no maps and reqeust, when a bio was mapped to these hardware queue, it will trigger kernel panic when get request from these hctx. Test environment: * A NVMe disk supports 128 io queues * 96 cpus in system A corner case can always trigger this panic, there are 96 io queues allocated for HCTX_TYPE_DEFAULT type, the corresponding kernel log: nvme nvme0: 96/0/0 default/read/poll queues. Now we set nvme write queues to 96, then nvme will alloc others(32) queues for read, but blk_mq_update_nr_hw_queues does not alloc map and request for these new added io queues. So when process read nvme disk, it will trigger kernel panic when get request from these hardware context. Reproduce script: nr=$(expr `cat /sys/block/nvme0n1/device/queue_count` - 1) echo $nr > /sys/module/nvme/parameters/write_queues echo 1 > /sys/block/nvme0n1/device/reset_controller dd if=/dev/nvme0n1 of=/dev/null bs=4K count=1 [ 8040.805626] ------------[ cut here ]------------ [ 8040.805627] WARNING: CPU: 82 PID: 12921 at block/blk-mq.c:2578 blk_mq_map_swqueue+0x2b6/0x2c0 [ 8040.805627] Modules linked in: nvme nvme_core nf_conntrack_netlink xt_addrtype br_netfilter overlay xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nft_counter nf_nat_tftp nf_conntrack_tftp nft_masq nf_tables_set nft_fib_inet nft_f ib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack tun bridge nf_defrag_ipv6 nf_defrag_ipv4 stp llc ip6_tables ip_tables nft_compat rfkill ip_set nf_tables nfne tlink sunrpc intel_rapl_msr intel_rapl_common skx_edac nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass ipmi_ssif crct10dif_pclmul crc32_pclmul iTCO_wdt iTCO_vendor_support ghash_clmulni_intel intel_ cstate intel_uncore raid0 joydev intel_rapl_perf ipmi_si pcspkr mei_me ioatdma sg ipmi_devintf mei i2c_i801 dca lpc_ich ipmi_msghandler acpi_power_meter acpi_pad xfs libcrc32c sd_mod ast i2c_algo_bit drm_vram_helper drm_ttm_helper ttm d rm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops [ 8040.805637] ahci drm i40e libahci crc32c_intel libata t10_pi wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: nvme_core] [ 8040.805640] CPU: 82 PID: 12921 Comm: kworker/u194:2 Kdump: loaded Tainted: G W 5.6.0-rc5.78317c+ #2 [ 8040.805640] Hardware name: Inspur SA5212M5/YZMB-00882-104, BIOS 4.0.9 08/27/2019 [ 8040.805641] Workqueue: nvme-reset-wq nvme_reset_work [nvme] [ 8040.805642] RIP: 0010:blk_mq_map_swqueue+0x2b6/0x2c0 [ 8040.805643] Code: 00 00 00 00 00 41 83 c5 01 44 39 6d 50 77 b8 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 8b bb 98 00 00 00 89 d6 e8 8c 81 03 00 eb 83 <0f> 0b e9 52 ff ff ff 0f 1f 00 0f 1f 44 00 00 41 57 48 89 f1 41 56 [ 8040.805643] RSP: 0018:ffffba590d2e7d48 EFLAGS: 00010246 [ 8040.805643] RAX: 0000000000000000 RBX: ffff9f013e1ba800 RCX: 000000000000003d [ 8040.805644] RDX: ffff9f00ffff6000 RSI: 0000000000000003 RDI: ffff9ed200246d90 [ 8040.805644] RBP: ffff9f00f6a79860 R08: 0000000000000000 R09: 000000000000003d [ 8040.805645] R10: 0000000000000001 R11: ffff9f0138c3d000 R12: ffff9f00fb3a9008 [ 8040.805645] R13: 000000000000007f R14: ffffffff96822660 R15: 000000000000005f [ 8040.805645] FS: 0000000000000000(0000) GS:ffff9f013fa80000(0000) knlGS:0000000000000000 [ 8040.805646] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 8040.805646] CR2: 00007f7f397fa6f8 CR3: 0000003d8240a002 CR4: 00000000007606e0 [ 8040.805647] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 8040.805647] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 8040.805647] PKRU: 55555554 [ 8040.805647] Call Trace: [ 8040.805649] blk_mq_update_nr_hw_queues+0x31b/0x390 [ 8040.805650] nvme_reset_work+0xb4b/0xeab [nvme] [ 8040.805651] process_one_work+0x1a7/0x370 [ 8040.805652] worker_thread+0x1c9/0x380 [ 8040.805653] ? max_active_store+0x80/0x80 [ 8040.805655] kthread+0x112/0x130 [ 8040.805656] ? __kthread_parkme+0x70/0x70 [ 8040.805657] ret_from_fork+0x35/0x40 [ 8040.805658] ---[ end trace b5f13b1e73ccb5d3 ]--- [ 8229.365135] BUG: kernel NULL pointer dereference, address: 0000000000000004 [ 8229.365165] #PF: supervisor read access in kernel mode [ 8229.365178] #PF: error_code(0x0000) - not-present page [ 8229.365191] PGD 0 P4D 0 [ 8229.365201] Oops: 0000 [#1] SMP PTI [ 8229.365212] CPU: 77 PID: 13024 Comm: dd Kdump: loaded Tainted: G W 5.6.0-rc5.78317c+ #2 [ 8229.365232] Hardware name: Inspur SA5212M5/YZMB-00882-104, BIOS 4.0.9 08/27/2019 [ 8229.365253] RIP: 0010:blk_mq_get_tag+0x227/0x250 [ 8229.365265] Code: 44 24 04 44 01 e0 48 8b 74 24 38 65 48 33 34 25 28 00 00 00 75 33 48 83 c4 40 5b 5d 41 5c 41 5d 41 5e c3 48 8d 68 10 4c 89 ef <44> 8b 60 04 48 89 ee e8 dd f9 ff ff 83 f8 ff 75 c8 e9 67 fe ff ff [ 8229.365304] RSP: 0018:ffffba590e977970 EFLAGS: 00010246 [ 8229.365317] RAX: 0000000000000000 RBX: ffff9f00f6a79860 RCX: ffffba590e977998 [ 8229.365333] RDX: 0000000000000000 RSI: ffff9f012039b140 RDI: ffffba590e977a38 [ 8229.365349] RBP: 0000000000000010 R08: ffffda58ff94e190 R09: ffffda58ff94e198 [ 8229.365365] R10: 0000000000000011 R11: ffff9f00f6a79860 R12: 0000000000000000 [ 8229.365381] R13: ffffba590e977a38 R14: ffff9f012039b140 R15: 0000000000000001 [ 8229.365397] FS: 00007f481c230580(0000) GS:ffff9f013f940000(0000) knlGS:0000000000000000 [ 8229.365415] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 8229.365428] CR2: 0000000000000004 CR3: 0000005f35e26004 CR4: 00000000007606e0 [ 8229.365444] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 8229.365460] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 8229.365476] PKRU: 55555554 [ 8229.365484] Call Trace: [ 8229.365498] ? finish_wait+0x80/0x80 [ 8229.365512] blk_mq_get_request+0xcb/0x3f0 [ 8229.365525] blk_mq_make_request+0x143/0x5d0 [ 8229.365538] generic_make_request+0xcf/0x310 [ 8229.365553] ? scan_shadow_nodes+0x30/0x30 [ 8229.365564] submit_bio+0x3c/0x150 [ 8229.365576] mpage_readpages+0x163/0x1a0 [ 8229.365588] ? blkdev_direct_IO+0x490/0x490 [ 8229.365601] read_pages+0x6b/0x190 [ 8229.365612] __do_page_cache_readahead+0x1c1/0x1e0 [ 8229.365626] ondemand_readahead+0x182/0x2f0 [ 8229.365639] generic_file_buffered_read+0x590/0xab0 [ 8229.365655] new_sync_read+0x12a/0x1c0 [ 8229.365666] vfs_read+0x8a/0x140 [ 8229.365676] ksys_read+0x59/0xd0 [ 8229.365688] do_syscall_64+0x55/0x1d0 [ 8229.365700] entry_SYSCALL_64_after_hwframe+0x44/0xa9 Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Weiping Zhang <zhangweiping@didiglobal.com> Tested-by: Weiping Zhang <zhangweiping@didiglobal.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-09 16:15:13 -06:00
Weiping Zhang	a2584e43f5	block: save previous hardware queue count before udpate blk_mq_realloc_tag_set_tags will update set->nr_hw_queues, so save old set->nr_hw_queues before call this function. Signed-off-by: Weiping Zhang <zhangweiping@didiglobal.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-09 16:15:12 -06:00
Weiping Zhang	2e194422f1	block: free both rq_map and request Allocation: __blk_mq_alloc_rq_map blk_mq_alloc_rq_map blk_mq_alloc_rq_map tags = blk_mq_init_tags : kzalloc_node: tags->rqs = kcalloc_node tags->static_rqs = kcalloc_node blk_mq_alloc_rqs p = alloc_pages_node tags->static_rqs[i] = p + offset; Free: blk_mq_free_rq_map kfree(tags->rqs); kfree(tags->static_rqs); blk_mq_free_tags kfree(tags); The page allocated in blk_mq_alloc_rqs cannot be released, so we should use blk_mq_free_map_and_requests here. blk_mq_free_map_and_requests blk_mq_free_rqs __free_pages : cleanup for blk_mq_alloc_rqs blk_mq_free_rq_map : cleanup for blk_mq_alloc_rq_map Signed-off-by: Weiping Zhang <zhangweiping@didiglobal.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-09 16:15:12 -06:00
Jens Axboe	873f1c8df7	Merge branch 'block-5.7' into for-5.8/block Pull in block-5.7 fixes for 5.8. Mostly to resolve a conflict with the blk-iocost changes, but we also need the base of the bdi use-after-free as well as we build on top of it. * block-5.7: nvme: fix possible hang when ns scanning fails during error recovery nvme-pci: fix "slimmer CQ head update" bdi: add a ->dev_name field to struct backing_dev_info bdi: use bdi_dev_name() to get device name bdi: move bdi_dev_name out of line vboxsf: don't use the source name in the bdi name iocost: protect iocg->abs_vdebt with iocg->waitq.lock block: remove the bd_openers checks in blk_drop_partitions nvme: prevent double free in nvme_alloc_ns() error handling null_blk: Cleanup zoned device initialization null_blk: Fix zoned command handling block: remove unused header blk-iocost: Fix error on iocost_ioc_vrate_adj bdev: Reduce time holding bd_mutex in sync in blkdev_close() buffer: remove useless comment and WB_REASON_FREE_MORE_MEM, reason. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-09 16:13:58 -06:00
Yufen Yu	d51cfc53ad	bdi: use bdi_dev_name() to get device name Use the common interface bdi_dev_name() to get device name. Signed-off-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Add missing <linux/backing-dev.h> include BFQ Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-09 16:07:39 -06:00
Tejun Heo	0b80f9866e	iocost: protect iocg->abs_vdebt with iocg->waitq.lock abs_vdebt is an atomic_64 which tracks how much over budget a given cgroup is and controls the activation of use_delay mechanism. Once a cgroup goes over budget from forced IOs, it has to pay it back with its future budget. The progress guarantee on debt paying comes from the iocg being active - active iocgs are processed by the periodic timer, which ensures that as time passes the debts dissipate and the iocg returns to normal operation. However, both iocg activation and vdebt handling are asynchronous and a sequence like the following may happen. 1. The iocg is in the process of being deactivated by the periodic timer. 2. A bio enters ioc_rqos_throttle(), calls iocg_activate() which returns without anything because it still sees that the iocg is already active. 3. The iocg is deactivated. 4. The bio from #2 is over budget but needs to be forced. It increases abs_vdebt and goes over the threshold and enables use_delay. 5. IO control is enabled for the iocg's subtree and now IOs are attributed to the descendant cgroups and the iocg itself no longer issues IOs. This leaves the iocg with stuck abs_vdebt - it has debt but inactive and no further IOs which can activate it. This can end up unduly punishing all the descendants cgroups. The usual throttling path has the same issue - the iocg must be active while throttled to ensure that future event will wake it up - and solves the problem by synchronizing the throttling path with a spinlock. abs_vdebt handling is another form of overage handling and shares a lot of characteristics including the fact that it isn't in the hottest path. This patch fixes the above and other possible races by strictly synchronizing abs_vdebt and use_delay handling with iocg->waitq.lock. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Vlad Dmitriev <vvd@fb.com> Cc: stable@vger.kernel.org # v5.4+ Fixes: `e1518f63f2` ("blk-iocost: Don't let merges push vtime into the future") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-05 09:23:18 -06:00
Tejun Heo	cd006509b0	blk-iocost: account for IO size when testing latencies On each IO completion, iocost decides whether the IO met or missed its latency target. Currently, the targets are fixed numbers per IO type. While this can be good enough for loose latency targets way higher than typical completion latencies, the effect of IO size makes it difficult to tighten the latency target - a target adequate for 4k IOs might be too tight for 512k IOs and vice-versa. iocost already has all the necessary information to account for different IO sizes when testing whether the latency target is met as iocost can calculate the size vtime cost of a given IO. This patch updates the completion path to calculate the size vtime cost of the IO, deduct the nsec equivalent from the observed latency and use the adjusted value to decide whether the target is met. This makes latency targets independent from IO size and enables determining adequate latency targets with fixed size fio runs. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Andy Newell <newella@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-04-30 15:54:45 -06:00
Tejun Heo	54c52e10dc	blk-iocost: switch to fixed non-auto-decaying use_delay The use_delay mechanism was introduced by blk-iolatency to hold memory allocators accountable for the reclaim and other shared IOs they cause. The duration of the delay is dynamically balanced between iolatency increasing the value on each target miss and it auto-decaying as time passes and threads get delayed on it. While this works well for iolatency, iocost's control model isn't compatible with it. There is no repeated "violation" events which can be balanced against auto-decaying. iocost instead knows how much a given cgroup is over budget and wants to prevent that cgroup from issuing IOs while over budget. Until now, iocost has been adding the cost of force-issued IOs. However, this doesn't reflect the amount which is already over budget and is simply not enough to counter the auto-decaying allowing anon-memory leaking low priority cgroup to go over its alloted share of IOs. As auto-decaying doesn't make much sense for iocost, this patch introduces a different mode of operation for use_delay - when blkcg_set_delay() are used insted of blkcg_add/use_delay(), the delay duration is not auto-decayed until it is explicitly cleared with blkcg_clear_delay(). iocost is updated to keep the delay duration synchronized to the budget overage amount. With this change, iocost can effectively police cgroups which generate significant amount of force-issued IOs. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-04-30 15:54:45 -06:00
Christoph Hellwig	10c70d95c0	block: remove the bd_openers checks in blk_drop_partitions When replacing the bd_super check with a bd_openers I followed a logical conclusion, which turns out to be utterly wrong. When a block device has bd_super sets it has a mount file system on it (although not every mounted file system sets bd_super), but that also implies it doesn't even have partitions to start with. So instead of trying to come up with a logical check for all openers, just remove the check entirely. Fixes: `d3ef553627` ("block: fix busy device checking in blk_drop_partitions") Fixes: `cb6b771b05` ("block: fix busy device checking in blk_drop_partitions again") Reported-by: Michal Koutný <mkoutny@suse.com> Reported-by: Yang Xu <xuyang2018.jy@cn.fujitsu.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-04-30 10:25:43 -06:00
Christoph Hellwig	accea322f5	block: add a bio_queue_enter helper Add a little helper that passes the right nowait flag to blk_queue_enter based on the bio flag, and terminates the bio with the right error code if entering the queue fails. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-04-29 09:33:26 -06:00
Christoph Hellwig	0376e9efe1	block: replace BIO_QUEUE_ENTERED with BIO_CGROUP_ACCT BIO_QUEUE_ENTERED is only used for cgroup accounting now, so rename the flag and move setting it into the cgroup code. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-04-29 09:33:26 -06:00
Christoph Hellwig	760f83ea63	block: cleanup the memory stall accounting in submit_bio Instead of a convoluted chain just check for REQ_OP_READ directly, and keep all the memory stall code together in a single unlikely branch. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-04-29 09:33:26 -06:00
Christoph Hellwig	3fdd40861d	block: improve the submit_bio and generic_make_request documentation The current documentation is a little weird, as it doesn't clearly explain which function to use, and also has the guts of the information on generic_make_request, which is the internal interface for stacking drivers. Fix this up by properly documenting submit_bio, and only documenting the differences and the use case for generic_make_request. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-04-29 09:33:26 -06:00
Zheng Bin	e1b586f2b8	blk-mq: make function '__blk_mq_sched_dispatch_requests' static Fix sparse warnings: block/blk-mq-sched.c:209:5: warning: symbol '__blk_mq_sched_dispatch_requests' was not declared. Should it be static? Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Zheng Bin <zhengbin13@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-04-29 09:16:53 -06:00
Christoph Hellwig	8cf7961dab	block: bypass ->make_request_fn for blk-mq drivers Call blk_mq_make_request when no ->make_request_fn is set. This is safe now that blk_alloc_queue always sets up the pointer for make_request based drivers. This avoids an indirect call in the blk-mq driver I/O fast path, which is rather expensive due to spectre mitigations. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-04-25 09:45:44 -06:00
Christoph Hellwig	3e82c3485e	block: remove create_io_context create_io_context just has a single caller, which also happens to not even use the return value. Just open code it there. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-04-25 09:44:40 -06:00
Salman Qazi	28d65729b0	block: Limit number of items taken from the I/O scheduler in one go Flushes bypass the I/O scheduler and get added to hctx->dispatch in blk_mq_sched_bypass_insert. This can happen while a kworker is running hctx->run_work work item and is past the point in blk_mq_sched_dispatch_requests where hctx->dispatch is checked. The blk_mq_do_dispatch_sched call is not guaranteed to end in bounded time, because the I/O scheduler can feed an arbitrary number of commands. Since we have only one hctx->run_work, the commands waiting in hctx->dispatch will wait an arbitrary length of time for run_work to be rerun. A similar phenomenon exists with dispatches from the software queue. The solution is to poll hctx->dispatch in blk_mq_do_dispatch_sched and blk_mq_do_dispatch_ctx and return from the run_work handler and let it rerun. Signed-off-by: Salman Qazi <sqazi@google.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-04-24 09:16:56 -06:00
Christoph Hellwig	bdf8710d69	block: move dma_pad handling from blk_rq_map_sg into the callers There are only two callers of blk_rq_map_sg/__blk_rq_map_sg that set the dma_pad value in the queue. Move the handling into those callers instead of burdening the common code, and move the ->extra_len field from struct request to struct scsi_cmnd. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-04-22 10:47:39 -06:00
Christoph Hellwig	cc97923a5b	block: move dma drain handling to scsi Don't burden the common block code with with specifics of the libata DMA draining mechanism. Instead move most of the code to the scsi midlayer. That also means the nr_phys_segments adjustments in the blk-mq fast path can go away entirely, given that SCSI never looks at nr_phys_segments after mapping the request to a scatterlist. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-04-22 10:47:35 -06:00
Christoph Hellwig	89de1504d5	block: provide a blk_rq_map_sg variant that returns the last element To be able to move some of the special purpose hacks in blk_rq_map_sg into the callers we need a variant that returns the last mapped S/G list element to the caller. Add that variant as __blk_rq_map_sg and make blk_rq_map_sg a trivial inline wrapper around it. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-04-22 10:47:06 -06:00
Christoph Hellwig	e64a0e1692	block: remove RQF_COPY_USER The RQF_COPY_USER is set for bio where the passthrough request mapping helpers decided that bounce buffering is required. It is then used to pad scatterlist for drivers that required it. But given that non-passthrough requests are per definition aligned, and directly mapped pass-through request must be aligned it is not actually required at all. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-04-22 10:47:06 -06:00
Waiman Long	d6c8e949a3	blk-iocost: Fix error on iocost_ioc_vrate_adj Systemtap 4.2 is unable to correctly interpret the "u32 (missed_ppm)[2]" argument of the iocost_ioc_vrate_adj trace entry defined in include/trace/events/iocost.h leading to the following error: /tmp/stapAcz0G0/stap_c89c58b83cea1724e26395efa9ed4939_6321_aux_6.c:78:8: error: expected ‘;’, ‘,’ or ‘)’ before ‘’ token , u32[]* __tracepoint_arg_missed_ppm That argument type is indeed rather complex and hard to read. Looking at block/blk-iocost.c. It is just a 2-entry u32 array. By simplifying the argument to a simple "u32 *missed_ppm" and adjusting the trace entry accordingly, the compilation error was gone. Fixes: `7caa47151a` ("blkcg: implement blk-iocost") Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-04-21 09:49:36 -06:00
Christoph Hellwig	9bc5c397d8	block: fold bdev_unhash_inode into invalidate_partition invalidate_partition and bdev_unhash_inode are always paired, and invalidate_partition already does an icache lookup for the block device inode. Piggy back on that to remove the inode from the hash. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-04-20 11:33:00 -06:00
Christoph Hellwig	02d33b6771	block: mark invalidate_partition static invalidate_partition is only used in genhd.c, so mark it static. Also drop the return value given that is is always ignored. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-04-20 11:32:59 -06:00
Christoph Hellwig	d5f3178ec9	block: simplify block device syncing in bdev_del_partition We just checked a little above that the block device for the partition im busy. That implies no file system is mounted, and thus the only thing in fsync_bdev that actually is used is sync_blockdev. Just call sync_blockdev directly. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-04-20 11:32:59 -06:00
Christoph Hellwig	e669c1da03	block: don't call invalidate_partition from blk_drop_partitions Given that the device must not be busy, most of the calls from invalidate_partition that are related to file system metadata are guranteed to not happen. Just open code the calls to sync_blockdev and invalidate_bdev instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-04-20 11:32:59 -06:00
Christoph Hellwig	21be6cdc00	dasd: use blk_drop_partitions instead of badly reimplementing it Use the blk_drop_partitions function instead of messing around with ioctls that get kernel pointers. For this blk_drop_partitions needs to be exported, which it normally shouldn't - make an exception for s390 only. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-04-20 11:32:59 -06:00
Christoph Hellwig	d46430bf5a	block: remove the disk argument from blk_drop_partitions The gendisk can be trivially deducted from the block_device. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-04-20 11:32:59 -06:00
Christoph Hellwig	4377b48da6	block: remove hd_struct_kill The function has a single caller, so just open code it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-04-20 11:32:59 -06:00
Christoph Hellwig	8da2892e27	block: cleanup hd_struct freeing Move hd_ref_init out of line as there it isn't anywhere near a fast path, and rename the rcu ref freeing callbacks to be more descriptive. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-04-20 11:32:59 -06:00
Christoph Hellwig	cddae808ae	block: pass a hd_struct to delete_partition All callers have the hd_struct at hand, so pass it instead of performing another lookup. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-04-20 11:32:59 -06:00
Christoph Hellwig	fa9156ae59	block: refactor blkpg_ioctl Split each sub-command out into a separate helper, and move those helpers to block/partitions/core.c instead of having a lot of partition manipulation logic open coded in block/ioctl.c. Signed-off-by: Christoph Hellwig <hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-04-20 11:32:59 -06:00
Douglas Anderson	a0823421a4	blk-mq: Rerun dispatching in the case of budget contention If ever a thread running blk-mq code tries to get budget and fails it immediately stops doing work and assumes that whenever budget is freed up that queues will be kicked and whatever work the thread was trying to do will be tried again. One path where budget is freed and queues are kicked in the normal case can be seen in scsi_finish_command(). Specifically: - scsi_finish_command() - scsi_device_unbusy() - # Decrement "device_busy", AKA release budget - scsi_io_completion() - scsi_end_request() - blk_mq_run_hw_queues() The above is all well and good. The problem comes up when a thread claims the budget but then releases it without actually dispatching any work. Since we didn't schedule any work we'll never run the path of finishing work / kicking the queues. This isn't often actually a problem which is why this issue has existed for a while and nobody noticed. Specifically we only get into this situation when we unexpectedly found that we weren't going to do any work. Code that later receives new work kicks the queues. All good, right? The problem shows up, however, if timing is just wrong and we hit a race. To see this race let's think about the case where we only have a budget of 1 (only one thread can hold budget). Now imagine that a thread got budget and then decided not to dispatch work. It's about to call put_budget() but then the thread gets context switched out for a long, long time. While in this state, any and all kicks of the queue (like the when we received new work) will be no-ops because nobody can get budget. Finally the thread holding budget gets to run again and returns. All the normal kicks will have been no-ops and we have an I/O stall. As you can see from the above, you need just the right timing to see the race. To start with, the only case it happens if we thought we had work, actually managed to get the budget, but then actually didn't have work. That's pretty rare to start with. Even then, there's usually a very small amount of time between realizing that there's no work and putting the budget. During this small amount of time new work has to come in and the queue kick has to make it all the way to trying to get the budget and fail. It's pretty unlikely. One case where this could have failed is illustrated by an example of threads running blk_mq_do_dispatch_sched(): * Threads A and B both run has_work() at the same time with the same "hctx". Imagine has_work() is exact. There's no lock, so it's OK if Thread A and B both get back true. * Thread B gets interrupted for a long time right after it decides that there is work. Maybe its CPU gets an interrupt and the interrupt handler is slow. * Thread A runs, get budget, dispatches work. * Thread A's work finishes and budget is released. * Thread B finally runs again and gets budget. * Since Thread A already took care of the work and no new work has come in, Thread B will get NULL from dispatch_request(). I believe this is specifically why dispatch_request() is allowed to return NULL in the first place if has_work() must be exact. * Thread B will now be holding the budget and is about to call put_budget(), but hasn't called it yet. * Thread B gets interrupted for a long time (again). Dang interrupts. * Now Thread C (maybe with a different "hctx" but the same queue) comes along and runs blk_mq_do_dispatch_sched(). * Thread C won't do anything because it can't get budget. * Finally Thread B will run again and put the budget without kicking any queues. Even though the example above is with blk_mq_do_dispatch_sched() I believe the race is possible any time someone is holding budget but doesn't do work. Unfortunately, the unlikely has become more likely if you happen to be using the BFQ I/O scheduler. BFQ, by design, sometimes returns "true" for has_work() but then NULL for dispatch_request() and stays in this state for a while (currently up to 9 ms). Suddenly you only need one race to hit, not two races in a row. With my current setup this is easy to reproduce in reboot tests and traces have actually shown that we hit a race similar to the one described above. Note that we only need to fix blk_mq_do_dispatch_sched() and blk_mq_do_dispatch_ctx() and not the other places that put budget. In other cases we know that we have work to do on at least one "hctx" and code already exists to kick that "hctx"'s queue. When that work finally finishes all the queues will be kicked using the normal flow. One last note is that (at least in the SCSI case) budget is shared by all "hctx"s that have the same queue. Thus we need to make sure to kick the whole queue, not just re-run dispatching on a single "hctx". Signed-off-by: Douglas Anderson <dianders@chromium.org> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-04-20 10:34:56 -06:00
Douglas Anderson	b9151e7bca	blk-mq: Add blk_mq_delay_run_hw_queues() API call We have: * blk_mq_run_hw_queue() * blk_mq_delay_run_hw_queue() * blk_mq_run_hw_queues() ...but not blk_mq_delay_run_hw_queues(), presumably because nobody needed it before now. Since we need it for a later patch in this series, add it. Signed-off-by: Douglas Anderson <dianders@chromium.org> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-04-20 10:34:56 -06:00
Douglas Anderson	ab3cee3762	blk-mq: In blk_mq_dispatch_rq_list() "no budget" is a reason to kick In blk_mq_dispatch_rq_list(), if blk_mq_sched_needs_restart() returns true and the driver returns BLK_STS_RESOURCE then we'll kick the queue. However, there's another case where we might need to kick it. If we were unable to get budget we can be in much the same state as when the driver returns BLK_STS_RESOURCE, so we should treat it the same. It should be noted that even if we add a whole bunch of extra kicking to the queue in other patches this patch is still important. Specifically any kicking that happened before we re-spliced leftover requests into 'hctx->dispatch' wouldn't have found any work, so we really need to make sure we kick ourselves after we've done the splicing. Signed-off-by: Douglas Anderson <dianders@chromium.org> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-04-20 10:34:56 -06:00
Tommi Rantala	3a89c25d98	blk-wbt: Use tracepoint_string() for wbt_step tracepoint string literals Use tracepoint_string() for string literals that are used in the wbt_step tracepoint, so that userspace tools can display the string content. Signed-off-by: Tommi Rantala <tommi.t.rantala@nokia.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-04-17 08:21:44 -06:00
John Garry	5fe56de799	blk-mq: Put driver tag in blk_mq_dispatch_rq_list() when no budget If in blk_mq_dispatch_rq_list() we find no budget, then we break of the dispatch loop, but the request may keep the driver tag, evaulated in 'nxt' in the previous loop iteration. Fix by putting the driver tag for that request. Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: John Garry <john.garry@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-04-16 09:27:03 -06:00
Linus Torvalds	8df2a0a6da	block-5.7-2020-04-10 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl6QhDIQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpsE/EADOQ0xDMOa8EmzRvjuCkiaB9yK2zXiBSAj5 ZBi7ReownfXhCR7nVc8Bv1s2f00PD6CFNURXZmdgyDDrXEd2ojueDoAZNBk59t0e i2CAF2wLAQ5EfuVaxSHVEOrVEmtu+ue+Ix83JNlnGPd7pf9s7uKc/W4iKGpgpxIo 1CpXmWwm5RwjX4z/Qsiaka2lB7QojjImp1n3C+XI5+pp/bJXiftep1lxH5Y3nSWU iR4jO81uxDMxhTEZ9z2cb1HarhctKvnihcb39gQYQ/kYYu7hSZnBPZo5zp5Dyb/t 4tGuDsfXCQCbF0smkusUrcyeT19vh9tOsGkiMzJ/ihm7TMyN4fT23h6DUb/7pAON jnlcB7r5Ezs8jLz9i+mAoq06djd5u54kiuKFog8170sTrtYsncZbyc01wLNAla/V /6KX1sMbPlbXZ+a3l3i7i/gcCBJ7ci6pV3x2elvM9dKHxyqJmwEGMlFVwt4s26ev wS+7+dktLAC73889Zyn8LutA4bWy5FmisSPA4PydSUSOZA+7JjlbILcz15jjwlP2 HzYk+TXsd3yJUQRYX5P0FcDaBUTISr/xeUUB+KT1rLv4Lhtso+S/9cvSc8x5mOa9 989gmqNfFAWoj1nKEIKeRwLjk0b6YA9qMv4jOwwiuobsT55aBxpbP80huNoRVj5L xFIWgBSwzg== =3woC -----END PGP SIGNATURE----- Merge tag 'block-5.7-2020-04-10' of git://git.kernel.dk/linux-block Pull block fixes from Jens Axboe: "Here's a set of fixes that should go into this merge window. This contains: - NVMe pull request from Christoph with various fixes - Better discard support for loop (Evan) - Only call ->commit_rqs() if we have queued IO (Keith) - blkcg offlining fixes (Tejun) - fix (and fix the fix) for busy partitions" * tag 'block-5.7-2020-04-10' of git://git.kernel.dk/linux-block: block: fix busy device checking in blk_drop_partitions again block: fix busy device checking in blk_drop_partitions nvmet-rdma: fix double free of rdma queue blk-mq: don't commit_rqs() if none were queued nvme-fc: Revert "add module to ops template to allow module references" nvme: fix deadlock caused by ANA update wrong locking nvmet-rdma: fix bonding failover possible NULL deref loop: Better discard support for block devices loop: Report EOPNOTSUPP properly nvmet: fix NULL dereference when removing a referral nvme: inherit stable pages constraint in the mpath stack device blkcg: don't offline parent blkcg first blkcg: rename blkcg->cgwb_refcnt to ->online_pin and always use it nvme-tcp: fix possible crash in recv error flow nvme-tcp: don't poll a non-live queue nvme-tcp: fix possible crash in write_zeroes processing nvmet-fc: fix typo in comment nvme-rdma: Replace comma with a semicolon nvme-fcloop: fix deallocation of working context nvme: fix compat address handling in several ioctls	2020-04-10 10:06:54 -07:00
Christoph Hellwig	cb6b771b05	block: fix busy device checking in blk_drop_partitions again The previous fix had an off by one in the bd_openers checking, counting the callers blkdev_get. Fixes: `d3ef553627` ("block: fix busy device checking in blk_drop_partitions") Reported-by: Qian Cai <cai@lca.pw> Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Qian Cai <cai@lca.pw> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-04-10 08:34:11 -06:00
Christoph Hellwig	d3ef553627	block: fix busy device checking in blk_drop_partitions bd_super is only set by get_tree_bdev and mount_bdev, and thus not by other openers like btrfs or the XFS realtime and log devices, as well as block devices directly opened from user space. Check bd_openers instead. Fixes: `77032ca66f` ("Return EBUSY from BLKRRPART for mounted whole-dev fs") Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-04-07 14:42:59 -06:00
Keith Busch	536167d47a	blk-mq: don't commit_rqs() if none were queued Unburden the drivers from checking if a call to commit_rqs() is valid by not calling it when there are no requests to commit. Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-04-06 13:43:56 -06:00
Linus Torvalds	79f51b7b9c	SCSI misc on 20200402 update changing all our txt files to rst ones. Excluding that, we have the usual driver updates (qla2xxx, ufs, lpfc, zfcp, ibmvfc, pm80xx, aacraid), a treewide update for scnprintf and some other minor updates. The major core update is Hannes moving functions out of the aacraid driver and into the core. Signed-off-by: James E.J. Bottomley <jejb@linux.ibm.com> -----BEGIN PGP SIGNATURE----- iJwEABMIAEQWIQTnYEDbdso9F2cI+arnQslM7pishQUCXoYKiyYcamFtZXMuYm90 dG9tbGV5QGhhbnNlbnBhcnRuZXJzaGlwLmNvbQAKCRDnQslM7pishSasAP4iGwSB Y8tFaZgWadu76+wj5MdqTBoXdhnIuFF0rZG3pQEAiIKdsfQlbSFdm75+gUtx5hG/ GOilX/pJczTRJDCGNis= =g7Sk -----END PGP SIGNATURE----- Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi Pull SCSI updates from James Bottomley: "This series has a huge amount of churn because it pulls in Mauro's doc update changing all our txt files to rst ones. Excluding that, we have the usual driver updates (qla2xxx, ufs, lpfc, zfcp, ibmvfc, pm80xx, aacraid), a treewide update for scnprintf and some other minor updates. The major core change is Hannes moving functions out of the aacraid driver and into the core" * tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (223 commits) scsi: aic7xxx: aic97xx: Remove FreeBSD-specific code scsi: ufs: Do not rely on prefetched data scsi: dc395x: remove dc395x_bios_param scsi: libiscsi: Fix error count for active session scsi: hpsa: correct race condition in offload enabled scsi: message: fusion: Replace zero-length array with flexible-array member scsi: qedi: Add PCI shutdown handler support scsi: qedi: Add MFW error recovery process scsi: ufs: Enable block layer runtime PM for well-known logical units scsi: ufs-qcom: Override devfreq parameters scsi: ufshcd: Let vendor override devfreq parameters scsi: ufshcd: Update the set frequency to devfreq scsi: ufs: Resume ufs host before accessing ufs device scsi: ufs-mediatek: customize the delay for enabling host scsi: ufs: make HCE polling more compact to improve initialization latency scsi: ufs: allow custom delay prior to host enabling scsi: ufs-mediatek: use common delay function scsi: ufs: introduce common and flexible delay function scsi: ufs: use an enum for host capabilities scsi: ufs: fix uninitialized tx_lanes in ufshcd_disable_tx_lcc() ...	2020-04-02 17:03:53 -07:00
Linus Torvalds	69c1fd9726	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial Pull trivial tree updates from Jiri Kosina: "My attempt to revitalize trivial queue I've been neglecting for years (what a disaster that was for this world, right? :) ) with patches collected from backlog that were still relevant and not applied elsewhere in the meantime" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: err.h: remove deprecated PTR_RET for good blk-mq: Fix typo in comment x86/boot: Fix comment spelling sh: mach-highlander: Fix comment spelling s390/dasd: Fix comment spelling mfd: wm8994: Fix comment spelling docs: Add reference in binfmt-misc.rst genirq: fix kerneldoc comment for irq_desc drm/amdgpu: fix two documentation mismatch issues HID: fix Kconfig word ordering list/hashtable: minor documentation corrections.	2020-04-01 14:52:59 -07:00
Tejun Heo	4308a434e5	blkcg: don't offline parent blkcg first blkcg->cgwb_refcnt is used to delay blkcg offlining so that blkgs don't get offlined while there are active cgwbs on them. However, it ends up making offlining unordered sometimes causing parents to be offlined before children. Let's fix this by making child blkcgs pin the parents' online states. Note that pin/unpin names are chosen over get/put intentionally because css uses get/put online for something different. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-04-01 14:56:44 -06:00
Tejun Heo	d866dbf617	blkcg: rename blkcg->cgwb_refcnt to ->online_pin and always use it blkcg->cgwb_refcnt is used to delay blkcg offlining so that blkgs don't get offlined while there are active cgwbs on them. However, it ends up making offlining unordered sometimes causing parents to be offlined before children. To fix it, we want child blkcgs to pin the parents' online states turning the refcnt into a more generic online pinning mechanism. In prepartion, * blkcg->cgwb_refcnt -> blkcg->online_pin * blkcg_cgwb_get/put() -> blkcg_pin/unpin_online() * Take them out of CONFIG_CGROUP_WRITEBACK Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-04-01 14:56:42 -06:00
Linus Torvalds	a776c270a0	Merge branch 'efi-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull EFI updates from Ingo Molnar: "The EFI changes in this cycle are much larger than usual, for two (positive) reasons: - The GRUB project is showing signs of life again, resulting in the introduction of the generic Linux/UEFI boot protocol, instead of x86 specific hacks which are increasingly difficult to maintain. There's hope that all future extensions will now go through that boot protocol. - Preparatory work for RISC-V EFI support. The main changes are: - Boot time GDT handling changes - Simplify handling of EFI properties table on arm64 - Generic EFI stub cleanups, to improve command line handling, file I/O, memory allocation, etc. - Introduce a generic initrd loading method based on calling back into the firmware, instead of relying on the x86 EFI handover protocol or device tree. - Introduce a mixed mode boot method that does not rely on the x86 EFI handover protocol either, and could potentially be adopted by other architectures (if another one ever surfaces where one execution mode is a superset of another) - Clean up the contents of 'struct efi', and move out everything that doesn't need to be stored there. - Incorporate support for UEFI spec v2.8A changes that permit firmware implementations to return EFI_UNSUPPORTED from UEFI runtime services at OS runtime, and expose a mask of which ones are supported or unsupported via a configuration table. - Partial fix for the lack of by-VA cache maintenance in the decompressor on 32-bit ARM. - Changes to load device firmware from EFI boot service memory regions - Various documentation updates and minor code cleanups and fixes" * 'efi-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (114 commits) efi/libstub/arm: Fix spurious message that an initrd was loaded efi/libstub/arm64: Avoid image_base value from efi_loaded_image partitions/efi: Fix partition name parsing in GUID partition entry efi/x86: Fix cast of image argument efi/libstub/x86: Use ULONG_MAX as upper bound for all allocations efi: Fix a mistype in comments mentioning efivar_entry_iter_begin() efi/libstub: Avoid linking libstub/lib-ksyms.o into vmlinux efi/x86: Preserve %ebx correctly in efi_set_virtual_address_map() efi/x86: Ignore the memory attributes table on i386 efi/x86: Don't relocate the kernel unless necessary efi/x86: Remove extra headroom for setup block efi/x86: Add kernel preferred address to PE header efi/x86: Decompress at start of PE image load address x86/boot/compressed/32: Save the output address instead of recalculating it efi/libstub/x86: Deal with exit() boot service returning x86/boot: Use unsigned comparison for addresses efi/x86: Avoid using code32_start efi/x86: Make efi32_pe_entry() more readable efi/x86: Respect 32-bit ABI in efi32_pe_entry() efi/x86: Annotate the LOADED_IMAGE_PROTOCOL_GUID with SYM_DATA ...	2020-03-30 16:13:08 -07:00
Linus Torvalds	1592614838	for-5.7/drivers-2020-03-29 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl6BJDYQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgplhMD/95jd4nlVetHAo54z+Zk2ExE13+yDamRKyh vc7t2tz1reqFOimtVr5aVuTXCTgOx4CpiIox5qcn6qAExN4JtCChOBRGize/0u8S ckxnhHbN2C0rfnGldvrYYeNRonFI+7QKimnurWUSYYGN0xqbo21BxJ7dFaohMseo q4K8sIW0ctE6AOlw28Jerkg614s2NDGZ7q1laheXnYHn5c9f1m0NaKN/jyTGgr0X TLBiLbX2yRrAuvpctBj6Fna6YN7Vdd9jsf2Bt6ipUI1XgHQoVUGMxQNhWPyjsbSv GzRQUNAfVcasLzCP/Mj/47144OkUtDDpn2mjeXDaFljLDGFULD+jp/SsOmLCxkPC gI7G2yfBvF96/SOyT0JXrLyMcBd1R2vRoASbc5tPu82mZhx7YJZH5WYtOB9h2gra RTYo3xcm0EoN6yeMaH+xOuXxTWWInIrgKPONW4H8s7hxEiMt5oFNVBI7vqPr4LVp tpfxiKZDavKOofKXogNV4W7mSMP/Ir5Q9Ha4g5SXHBGp0z/PHmnQ0xDGNq0KDnU4 eNO0UYCFNCNa+0AOhpNxaVuVm9LjrgvyXRjePgOZQ4akhohwHO6DLrHK1f8Hb1vD 8Ih6uR+F5zZlKsouWro8HLGYm5w40Wq9tbCI8QbPYH6nkGoDmzpPv9jbAeWgJU5c KqP/5TBSLA== =Bs4E -----END PGP SIGNATURE----- Merge tag 'for-5.7/drivers-2020-03-29' of git://git.kernel.dk/linux-block Pull block driver updates from Jens Axboe: - floppy driver cleanup series from Willy - NVMe updates and fixes (Various) - null_blk trace improvements (Chaitanya) - bcache fixes (Coly) - md fixes (via Song) - loop block size change optimizations (Martijn) - scnprintf() use (Takashi) * tag 'for-5.7/drivers-2020-03-29' of git://git.kernel.dk/linux-block: (81 commits) null_blk: add trace in null_blk_zoned.c null_blk: add tracepoint helpers for zoned mode block: add a zone condition debug helper nvme: cleanup namespace identifier reporting in nvme_init_ns_head nvme: rename __nvme_find_ns_head to nvme_find_ns_head nvme: refactor nvme_identify_ns_descs error handling nvme-tcp: Add warning on state change failure at nvme_tcp_setup_ctrl nvme-rdma: Add warning on state change failure at nvme_rdma_setup_ctrl nvme: Fix controller creation races with teardown flow nvme: Make nvme_uninit_ctrl symmetric to nvme_init_ctrl nvme: Fix ctrl use-after-free during sysfs deletion nvme-pci: Re-order nvme_pci_free_ctrl nvme: Remove unused return code from nvme_delete_ctrl_sync nvme: Use nvme_state_terminal helper nvme: release ida resources nvme: Add compat_ioctl handler for NVME_IOCTL_SUBMIT_IO nvmet-tcp: optimize tcp stack TX when data digest is used nvme-fabrics: Use scnprintf() for avoiding potential buffer overflow nvme-multipath: do not reset on unknown status nvmet-rdma: allocate RW ctxs according to mdts ...	2020-03-30 11:43:51 -07:00
Linus Torvalds	10f36b1e80	for-5.7/block-2020-03-29 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl6BJCoQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpvziEACqQC+QRKiqR6X5yaPWJ9LqjKE7lfI1PUb7 0a1z1mKuf8d6z0qNleUwdSOEaS5zJiswou2K8GLvEtTQH41QYsQkxc9GLjAyTveK szAyzZaa3BNUy9hkczm9i2arv3fI8XoTE3JvRM0e9wL8fBJDYCtKtHFJvF4hisOQ ydaJlU6tcwzd9bdV7K5dLwBxu3AeAJjzS3Tyfw25u9N9O/btUxJ91RTqBb2+Xeoz AVasfRlAqf/CzdjxCCmDgWE2QM4852pAeQ7UJJBGISNWNoiwkezMg+6HD0jEOLee bQ8uDyQdihIWTY+/zQasotX8/71uLV8QgtjWLXR9zrjrubIBWHGzoWSQ4kPg5DfQ bJmKO0VvWN2sshZEpWvzzAFGYxZViNphbK2Pb4hKOcv7jtMcC8mmEogh/7EqbD/n KB3IM9qVoXM8INm5o0dTy5uDRJxiHiHYkqsZaKz55BB/R4Geym5TINT3nXgxhQrn JoSwp4zdm3/NJOySruDi2eETqWJC2bsz3FsQSyCQTPOuP0nLtFKBb1UKHpmYTCXG H4LCyCKFJ6s006qBcdaNPZBw1mrSNwoxEulHnpYA4BFfPeXi72yrnMZQkdwWONpW LIVuD0hBm8X/pulbvEEdjzXBqZVkqK3xFX+uX5+bnwwaUKddXAC/h9SQKpBP2Mbb AeZToMklKw== =6Glq -----END PGP SIGNATURE----- Merge tag 'for-5.7/block-2020-03-29' of git://git.kernel.dk/linux-block Pull block updates from Jens Axboe: - Online capacity resizing (Balbir) - Number of hardware queue change fixes (Bart) - null_blk fault injection addition (Bart) - Cleanup of queue allocation, unifying the node/no-node API (Christoph) - Cleanup of genhd, moving code to where it makes sense (Christoph) - Cleanup of the partition handling code (Christoph) - disk stat fixes/improvements (Konstantin) - BFQ improvements (Paolo) - Various fixes and improvements * tag 'for-5.7/block-2020-03-29' of git://git.kernel.dk/linux-block: (72 commits) block: return NULL in blk_alloc_queue() on error block: move bio_map_* to blk-map.c Revert "blkdev: check for valid request queue before issuing flush" block: simplify queue allocation bcache: pass the make_request methods to blk_queue_make_request null_blk: use blk_mq_init_queue_data block: add a blk_mq_init_queue_data helper block: move the ->devnode callback to struct block_device_operations block: move the part_stat* helpers from genhd.h to a new header block: move block layer internals out of include/linux/genhd.h block: move guard_bio_eod to bio.c block: unexport get_gendisk block: unexport disk_map_sector_rcu block: unexport disk_get_part block: mark part_in_flight and part_in_flight_rw static block: mark block_depr static block: factor out requeue handling from dispatch code block/diskstats: replace time_in_queue with sum of request times block/diskstats: accumulate all per-cpu counters in one pass block/diskstats: more accurate approximation of io_ticks for slow disks ...	2020-03-30 11:20:13 -07:00
Chaitanya Kulkarni	654a3667df	block: return NULL in blk_alloc_queue() on error This patch fixes follwoing warning: block/blk-core.c: In function ‘blk_alloc_queue’: block/blk-core.c:558:10: warning: returning ‘int’ from a function with return type ‘struct request_queue *’ makes pointer from integer without a cast [-Wint-conversion] return -EINVAL; Fixes: `3d745ea5b0` ("block: simplify queue allocation") Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-29 10:08:26 -06:00
Chaitanya Kulkarni	02694e8635	block: add a zone condition debug helper Add a helper to stringify the zone conditions. We use this helper in the next patch to track zone conditions in tracepoints. Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-27 13:39:09 -06:00
Christoph Hellwig	130879f1ee	block: move bio_map_* to blk-map.c The bio_map_* helpers are just the low-level helpers for the blk_rq_map_* APIs. Move them together for better logical grouping, as no there isn't much overlap with other code in bio.c. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-27 12:04:34 -06:00
Christoph Hellwig	f01b411f41	Revert "blkdev: check for valid request queue before issuing flush" This reverts commit `f10d9f617a`. We can't have queues without a make_request_fn any more (and the loop device uses blk-mq these days anyway..). Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-27 10:23:44 -06:00
Christoph Hellwig	3d745ea5b0	block: simplify queue allocation Current make_request based drivers use either blk_alloc_queue_node or blk_alloc_queue to allocate a queue, and then set up the make_request_fn function pointer and a few parameters using the blk_queue_make_request helper. Simplify this by passing the make_request pointer to blk_alloc_queue, and while at it merge the _node variant into the main helper by always passing a node_id, and remove the superfluous gfp_mask parameter. A lower-level __blk_alloc_queue is kept for the blk-mq case. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-27 10:23:43 -06:00
Christoph Hellwig	2f227bb999	block: add a blk_mq_init_queue_data helper This allows a driver to pass a queuedata member before ->init_hctx is called. null_blk currently open codes this logic, but I'd rather have it in the core to ease future maintainance. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-27 10:23:43 -06:00
Christoph Hellwig	348e114bbd	block: move the ->devnode callback to struct block_device_operations There really isn't any good reason to stash a method directly into struct gendisk. Move it together with the other block device operations. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-27 09:50:05 -06:00
Christoph Hellwig	c6a564ffad	block: move the part_stat* helpers from genhd.h to a new header These macros are just used by a few files. Move them out of genhd.h, which is included everywhere into a new standalone header. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-25 09:50:09 -06:00
Christoph Hellwig	581e26004a	block: move block layer internals out of include/linux/genhd.h None of this needs to be exposed to drivers. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-25 09:50:08 -06:00
Christoph Hellwig	29125ed624	block: move guard_bio_eod to bio.c This is bio layer functionality and not related to buffer heads. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-25 09:50:08 -06:00
Christoph Hellwig	1b4d4dbdae	block: unexport get_gendisk get_gendisk is not used by any modular code. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-25 09:50:08 -06:00
Christoph Hellwig	a7818aedda	block: unexport disk_map_sector_rcu disk_map_sector_rcu is not used by any modular code. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-25 09:50:08 -06:00
Christoph Hellwig	572e7fc85b	block: unexport disk_get_part disk_get_part is not used by any modular code. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-25 09:50:08 -06:00
Christoph Hellwig	6005771c17	block: mark part_in_flight and part_in_flight_rw static Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-25 09:50:08 -06:00
Christoph Hellwig	31eb618679	block: mark block_depr static Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-25 09:50:08 -06:00
Johannes Thumshirn	c92a41031a	block: factor out requeue handling from dispatch code Factor out the requeue handling from the dispatch code, this will make subsequent addition of different requeueing schemes easier. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-25 09:40:24 -06:00
Konstantin Khlebnikov	8cd5b8fc00	block/diskstats: replace time_in_queue with sum of request times Column "time_in_queue" in diskstats is supposed to show total waiting time of all requests. I.e. value should be equal to the sum of times from other columns. But this is not true, because column "time_in_queue" is counted separately in jiffies rather than in nanoseconds as other times. This patch removes redundant counter for "time_in_queue" and shows total time of read, write, discard and flush requests. Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-25 08:49:12 -06:00
Konstantin Khlebnikov	ea18e0f0a6	block/diskstats: accumulate all per-cpu counters in one pass Reading /proc/diskstats iterates over all cpus for summing each field. It's faster to sum all fields in one pass. Hammering /proc/diskstats with fio shows 2x performance improvement: fio --name=test --numjobs=$JOBS --filename=/proc/diskstats \ --size=1k --bs=1k --fallocate=none --create_on_open=1 \ --time_based=1 --runtime=10 --invalidate=0 --group_report JOBS=1 JOBS=10 Before: 7k iops 64k iops After: 18k iops 120k iops Also this way code is more compact: add/remove: 1/0 grow/shrink: 0/2 up/down: 194/-1540 (-1346) Function old new delta part_stat_read_all - 194 +194 diskstats_show 1344 631 -713 part_stat_show 1219 392 -827 Total: Before=14966947, After=14965601, chg -0.01% Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-25 08:49:10 -06:00
Konstantin Khlebnikov	2b8bd42361	block/diskstats: more accurate approximation of io_ticks for slow disks Currently io_ticks is approximated by adding one at each start and end of requests if jiffies counter has changed. This works perfectly for requests shorter than a jiffy or if one of requests starts/ends at each jiffy. If disk executes just one request at a time and they are longer than two jiffies then only first and last jiffies will be accounted. Fix is simple: at the end of request add up into io_ticks jiffies passed since last update rather than just one jiffy. Example: common HDD executes random read 4k requests around 12ms. fio --name=test --filename=/dev/sdb --rw=randread --direct=1 --runtime=30 & iostat -x 10 sdb Note changes of iostat's "%util" 8,43% -> 99,99% before/after patch: Before: Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sdb 0,00 0,00 82,60 0,00 330,40 0,00 8,00 0,96 12,09 12,09 0,00 1,02 8,43 After: Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sdb 0,00 0,00 82,50 0,00 330,00 0,00 8,00 1,00 12,10 12,10 0,00 12,12 99,99 Now io_ticks does not loose time between start and end of requests, but for queue-depth > 1 some I/O time between adjacent starts might be lost. For load estimation "%util" is not as useful as average queue length, but it clearly shows how often disk queue is completely empty. Fixes: `5b18b5a737` ("block: delete part_round_stats and switch to less precise counting") Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-25 08:49:08 -06:00
Christoph Hellwig	387048bf67	block: merge partition-generic.c and check.c Merge block/partition-generic.c and block/partitions/check.c into a single block/partitions/core.c as the content is closely related and both files are tiny. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-24 07:57:08 -06:00
Christoph Hellwig	3f4fc59c13	block: move the various x86 Unix label formats out of genhd.h All these are just used in block/partitions/msdos.c, so move them out of the genhd.h driver included by every driver. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-24 07:57:08 -06:00
Christoph Hellwig	cb0ab52652	partitions/msdos: remove LINUX_SWAP_PARTITION Just always use NEW_SOLARIS_X86_PARTITION and explain the situation, as that is less confusing than two names for a single value. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-24 07:57:08 -06:00
Christoph Hellwig	0226e9ead4	block: move the _PARTITION enum out of genhd.h The enum containing the _PARTITION symbolic names is only relevant for the partition parser. More specifically most values are MSDOS partition table system indicators and thus should go straight into msdos.c. One value is only used by the sun partition parser, and the sun and sgi partition parsers use the same value as the x86 Linux RAID indicator to also indicate RAID autodetection. Duplicate them in sun.c and sgi.c given that the different partition types use entirely different values otherwise. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-24 07:57:08 -06:00
Christoph Hellwig	1442f76d43	block: move struct partition out of genhd.h struct partition is the on-disk format of a MSDOS partition table entry. Move it out of genhd.h into a new msdos_partition.h header and give it a msdos_ prefix to avoid confusion. Also move the magic number from block/partitions/msdos.h to the new header so that it can be used by the SCSI drivers looking at the DOS partition tables. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-24 07:57:08 -06:00
Christoph Hellwig	cbb5cb3b29	block: remove block/partitions/sun.h Just move the two defines to block/partitions/sun.c. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-24 07:57:08 -06:00
Christoph Hellwig	95f77ef35a	block: remove block/partitions/sgi.h Just move the single define to block/partitions/sgi.c. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-24 07:57:08 -06:00
Christoph Hellwig	3466f63a7c	block: remove block/partitions/osf.h Just move the single define to block/partitions/osf.c. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-24 07:57:08 -06:00
Christoph Hellwig	f6d17358dc	block: remove block/partitions/karma.h Just move the single define to block/partitions/karma.c. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-24 07:57:08 -06:00
Christoph Hellwig	3f1b95ef81	block: declare all partition detection routines in check.h There is no good reason to include one header per partition type in core.c. Instead move the prototypes for the detection routins to check.h, and remove all now empty headers in block/partitions/. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-24 07:57:08 -06:00
Christoph Hellwig	ffa9ed647a	block: remove warn_no_part The warn_no_part is initialized to 1 and never changed. Remove it and execute the code keyed off from it unconditionally. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-24 07:57:08 -06:00
Christoph Hellwig	74cc979c3c	block: cleanup how md_autodetect_dev is called Add a new include/linux/raid/detect.h header to declare the md_autodetect_dev prototype which can be shared between md and the partition code. Then use IS_BUILTIN to call it instead of the ifdef magic. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-24 07:57:08 -06:00
Christoph Hellwig	1a9fba3a77	block: unexport read_dev_sector and put_dev_sector read_dev_sector and put_dev_sector are now only used by the partition parsing code. Remove the export for read_dev_sector and merge it into the only caller. Clean the mess up a bit by using goto labels and the SECTOR_SHIFT constant. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-24 07:57:08 -06:00
Christoph Hellwig	f17c21c1ec	block: remove alloc_part_info and free_part_info There isn't any good reason not to simply open code the allocation and freeing of the partition_meta_info structure. Especially as one of the branches in alloc_part_info is entirely dead code. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-24 07:57:07 -06:00
Christoph Hellwig	3ad5cee5cd	block: move sysfs methods shared by disks and partitions to genhd.c Move the sysfs _show methods that are used both on the full disk and partition nodes to genhd.c instead of hiding them in the partitioning code. Also move the declaration for these methods to block/blk.h so that we don't expose them to drivers. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-24 07:57:07 -06:00
Christoph Hellwig	5cbd28e3ce	block: move disk_name and related helpers out of partition-generic.c Thes functions aren't really related to partition support, so move them to a more suitable place. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-24 07:57:07 -06:00
Christoph Hellwig	ea3edd4dc2	block: remove __bdevname There is no good reason for __bdevname to exist. Just open code printing the string in the callers. For three of them the format string can be trivially merged into existing printk statements, and in init/do_mounts.c we can at least do the scnprintf once at the start of the function, and unconditional of CONFIG_BLOCK to make the output for tiny configfs a little more helpful. Acked-by: Theodore Ts'o <tytso@mit.edu> # for ext4 Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-24 07:57:07 -06:00
Christoph Hellwig	d2332c5c04	block: remove the blk_lookup_devt export This function is only used by init/do_mounts.c, which can't be modular. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-24 07:57:07 -06:00
Paolo Valente	4d38a87fbb	block, bfq: invoke flush_idle_tree after reparent_active_queues in pd_offline In bfq_pd_offline(), the function bfq_flush_idle_tree() is invoked to flush the rb tree that contains all idle entities belonging to the pd (cgroup) being destroyed. In particular, bfq_flush_idle_tree() is invoked before bfq_reparent_active_queues(). Yet the latter may happen to add some entities to the idle tree. It happens if, in some of the calls to bfq_bfqq_move() performed by bfq_reparent_active_queues(), the queue to move is empty and gets expired. This commit simply reverses the invocation order between bfq_flush_idle_tree() and bfq_reparent_active_queues(). Tested-by: cki-project@redhat.com Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-21 14:31:03 -06:00
Paolo Valente	576682fa52	block, bfq: make reparent_leaf_entity actually work only on leaf entities bfq_reparent_leaf_entity() reparents the input leaf entity (a leaf entity represents just a bfq_queue in an entity tree). Yet, the input entity is guaranteed to always be a leaf entity only in two-level entity trees. In this respect, because of the error fixed by commit `14afc59361` ("block, bfq: fix overwrite of bfq_group pointer in bfq_find_set_group()"), all (wrongly collapsed) entity trees happened to actually have only two levels. After the latter commit, this does not hold any longer. This commit fixes this problem by modifying bfq_reparent_leaf_entity(), so that it searches an active leaf entity down the path that stems from the input entity. Such a leaf entity is guaranteed to exist when bfq_reparent_leaf_entity() is invoked. Tested-by: cki-project@redhat.com Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-21 14:31:02 -06:00
Paolo Valente	c899773665	block, bfq: turn put_queue into release_process_ref in __bfq_bic_change_cgroup A bfq_put_queue() may be invoked in __bfq_bic_change_cgroup(). The goal of this put is to release a process reference to a bfq_queue. But process-reference releases may trigger also some extra operation, and, to this goal, are handled through bfq_release_process_ref(). So, turn the invocation of bfq_put_queue() into an invocation of bfq_release_process_ref(). Tested-by: cki-project@redhat.com Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-21 14:31:00 -06:00
Paolo Valente	fd1bb3ae54	block, bfq: move forward the getting of an extra ref in bfq_bfqq_move Commit `ecedd3d7e1` ("block, bfq: get extra ref to prevent a queue from being freed during a group move") gets an extra reference to a bfq_queue before possibly deactivating it (temporarily), in bfq_bfqq_move(). This prevents the bfq_queue from disappearing before being reactivated in its new group. Yet, the bfq_queue may also be expired (i.e., its service may be stopped) before the bfq_queue is deactivated. And also an expiration may lead to a premature freeing. This commit fixes this issue by simply moving forward the getting of the extra reference already introduced by commit `ecedd3d7e1` ("block, bfq: get extra ref to prevent a queue from being freed during a group move"). Reported-by: cki-project@redhat.com Tested-by: cki-project@redhat.com Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-21 14:30:58 -06:00
Zhiqiang Liu	2f95fa5c95	block, bfq: fix use-after-free in bfq_idle_slice_timer_body In bfq_idle_slice_timer func, bfqq = bfqd->in_service_queue is not in bfqd-lock critical section. The bfqq, which is not equal to NULL in bfq_idle_slice_timer, may be freed after passing to bfq_idle_slice_timer_body. So we will access the freed memory. In addition, considering the bfqq may be in race, we should firstly check whether bfqq is in service before doing something on it in bfq_idle_slice_timer_body func. If the bfqq in race is not in service, it means the bfqq has been expired through __bfq_bfqq_expire func, and wait_request flags has been cleared in __bfq_bfqd_reset_in_service func. So we do not need to re-clear the wait_request of bfqq which is not in service. KASAN log is given as follows: [13058.354613] ================================================================== [13058.354640] BUG: KASAN: use-after-free in bfq_idle_slice_timer+0xac/0x290 [13058.354644] Read of size 8 at addr ffffa02cf3e63f78 by task fork13/19767 [13058.354646] [13058.354655] CPU: 96 PID: 19767 Comm: fork13 [13058.354661] Call trace: [13058.354667] dump_backtrace+0x0/0x310 [13058.354672] show_stack+0x28/0x38 [13058.354681] dump_stack+0xd8/0x108 [13058.354687] print_address_description+0x68/0x2d0 [13058.354690] kasan_report+0x124/0x2e0 [13058.354697] __asan_load8+0x88/0xb0 [13058.354702] bfq_idle_slice_timer+0xac/0x290 [13058.354707] __hrtimer_run_queues+0x298/0x8b8 [13058.354710] hrtimer_interrupt+0x1b8/0x678 [13058.354716] arch_timer_handler_phys+0x4c/0x78 [13058.354722] handle_percpu_devid_irq+0xf0/0x558 [13058.354731] generic_handle_irq+0x50/0x70 [13058.354735] __handle_domain_irq+0x94/0x110 [13058.354739] gic_handle_irq+0x8c/0x1b0 [13058.354742] el1_irq+0xb8/0x140 [13058.354748] do_wp_page+0x260/0xe28 [13058.354752] __handle_mm_fault+0x8ec/0x9b0 [13058.354756] handle_mm_fault+0x280/0x460 [13058.354762] do_page_fault+0x3ec/0x890 [13058.354765] do_mem_abort+0xc0/0x1b0 [13058.354768] el0_da+0x24/0x28 [13058.354770] [13058.354773] Allocated by task 19731: [13058.354780] kasan_kmalloc+0xe0/0x190 [13058.354784] kasan_slab_alloc+0x14/0x20 [13058.354788] kmem_cache_alloc_node+0x130/0x440 [13058.354793] bfq_get_queue+0x138/0x858 [13058.354797] bfq_get_bfqq_handle_split+0xd4/0x328 [13058.354801] bfq_init_rq+0x1f4/0x1180 [13058.354806] bfq_insert_requests+0x264/0x1c98 [13058.354811] blk_mq_sched_insert_requests+0x1c4/0x488 [13058.354818] blk_mq_flush_plug_list+0x2d4/0x6e0 [13058.354826] blk_flush_plug_list+0x230/0x548 [13058.354830] blk_finish_plug+0x60/0x80 [13058.354838] read_pages+0xec/0x2c0 [13058.354842] __do_page_cache_readahead+0x374/0x438 [13058.354846] ondemand_readahead+0x24c/0x6b0 [13058.354851] page_cache_sync_readahead+0x17c/0x2f8 [13058.354858] generic_file_buffered_read+0x588/0xc58 [13058.354862] generic_file_read_iter+0x1b4/0x278 [13058.354965] ext4_file_read_iter+0xa8/0x1d8 [ext4] [13058.354972] __vfs_read+0x238/0x320 [13058.354976] vfs_read+0xbc/0x1c0 [13058.354980] ksys_read+0xdc/0x1b8 [13058.354984] __arm64_sys_read+0x50/0x60 [13058.354990] el0_svc_common+0xb4/0x1d8 [13058.354994] el0_svc_handler+0x50/0xa8 [13058.354998] el0_svc+0x8/0xc [13058.354999] [13058.355001] Freed by task 19731: [13058.355007] __kasan_slab_free+0x120/0x228 [13058.355010] kasan_slab_free+0x10/0x18 [13058.355014] kmem_cache_free+0x288/0x3f0 [13058.355018] bfq_put_queue+0x134/0x208 [13058.355022] bfq_exit_icq_bfqq+0x164/0x348 [13058.355026] bfq_exit_icq+0x28/0x40 [13058.355030] ioc_exit_icq+0xa0/0x150 [13058.355035] put_io_context_active+0x250/0x438 [13058.355038] exit_io_context+0xd0/0x138 [13058.355045] do_exit+0x734/0xc58 [13058.355050] do_group_exit+0x78/0x220 [13058.355054] __wake_up_parent+0x0/0x50 [13058.355058] el0_svc_common+0xb4/0x1d8 [13058.355062] el0_svc_handler+0x50/0xa8 [13058.355066] el0_svc+0x8/0xc [13058.355067] [13058.355071] The buggy address belongs to the object at ffffa02cf3e63e70#012 which belongs to the cache bfq_queue of size 464 [13058.355075] The buggy address is located 264 bytes inside of#012 464-byte region [ffffa02cf3e63e70, ffffa02cf3e64040) [13058.355077] The buggy address belongs to the page: [13058.355083] page:ffff7e80b3cf9800 count:1 mapcount:0 mapping:ffff802db5c90780 index:0xffffa02cf3e606f0 compound_mapcount: 0 [13058.366175] flags: 0x2ffffe0000008100(slab\|head) [13058.370781] raw: 2ffffe0000008100 ffff7e80b53b1408 ffffa02d730c1c90 ffff802db5c90780 [13058.370787] raw: ffffa02cf3e606f0 0000000000370023 00000001ffffffff 0000000000000000 [13058.370789] page dumped because: kasan: bad access detected [13058.370791] [13058.370792] Memory state around the buggy address: [13058.370797] ffffa02cf3e63e00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fb fb [13058.370801] ffffa02cf3e63e80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [13058.370805] >ffffa02cf3e63f00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [13058.370808] ^ [13058.370811] ffffa02cf3e63f80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [13058.370815] ffffa02cf3e64000: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc [13058.370817] ================================================================== [13058.370820] Disabling lock debugging due to kernel taint Here, we directly pass the bfqd to bfq_idle_slice_timer_body func. -- V2->V3: rewrite the comment as suggested by Paolo Valente V1->V2: add one comment, and add Fixes and Reported-by tag. Fixes: `aee69d78d` ("block, bfq: introduce the BFQ-v0 I/O scheduler as an extra scheduler") Acked-by: Paolo Valente <paolo.valente@linaro.org> Reported-by: Wang Wang <wangwang2@huawei.com> Signed-off-by: Zhiqiang Liu <liuzhiqiang26@huawei.com> Signed-off-by: Feilong Lin <linfeilong@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-21 14:29:44 -06:00
Balbir Singh	e598a72fae	block/genhd: Notify udev about capacity change Allow block/genhd to notify user space (via udev) about disk size changes using a new helper set_capacity_revalidate_and_notify(), which is a wrapper on top of set_capacity(). set_capacity_revalidate_and_notify() will only notify via udev if the current capacity or the target capacity is not zero and iff the capacity changes. Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Someswarudu Sangaraju <ssomesh@amazon.com> Signed-off-by: Balbir Singh <sblbir@amazon.com> Reviewed-by: Bob Liu <bob.liu@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-18 15:13:21 -06:00
Ming Lei	de6a78b601	block: Prevent hung_check firing during long sync IO submit_bio_wait() can be called from ioctl(BLKSECDISCARD), which may take long time to complete, as Salman mentioned, 4K BLKSECDISCARD takes up to 100 second on some devices. Also any block I/O operation that occurs after the BLKSECDISCARD is submitted will also potentially be affected by the hung task timeouts. Another report is that task hang can be observed when running mkfs over raid10 which takes a small max discard sectors limit because of chunk size. So prevent hung_check from firing by taking same approach used in blk_execute_rq(), and the wake-up interval is set as half the hung_check timer period, which keeps overhead low enough. Cc: Salman Qazi <sqazi@google.com> Cc: Jesse Barnes <jsbarnes@google.com> Cc: Bart Van Assche <bvanassche@acm.org> Link: https://lkml.org/lkml/2020/2/12/1193 Reported-by: Salman Qazi <sqazi@google.com> Reviewed-by: Jesse Barnes <jsbarnes@google.com> Reviewed-by: Salman Qazi <sqazi@google.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-18 08:48:03 -06:00
Gabriela Bittencourt	7901b6e4e6	blk-mq: Fix typo in comment Fix typo in words: 'vector' and 'query'. Signed-off-by: Gabriela Bittencourt <gabrielabittencourt00@gmail.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Acked-by: Julia Lawall <julia.lawall@lip6.fr> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2020-03-17 20:55:21 +01:00
Konstantin Khlebnikov	e74d93e96d	block: keep bdi->io_pages in sync with max_sectors_kb for stacked devices Field bdi->io_pages added in commit `9491ae4aad` ("mm: don't cap request size based on read-ahead setting") removes unneeded split of read requests. Stacked drivers do not call blk_queue_max_hw_sectors(). Instead they set limits of their devices by blk_set_stacking_limits() + disk_stack_limits(). Field bio->io_pages stays zero until user set max_sectors_kb via sysfs. This patch updates io_pages after merging limits in disk_stack_limits(). Commit `c6d6e9b0f6` ("dm: do not allow readahead to limit IO size") fixed the same problem for device-mapper devices, this one fixes MD RAIDs. Fixes: `9491ae4aad` ("mm: don't cap request size based on read-ahead setting") Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de> Reviewed-by: Bob Liu <bob.liu@oracle.com> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Song Liu <songliubraving@fb.com>	2020-03-17 10:53:07 -07:00
Ryan Attard	6fdb79ff27	scsi: core: Allow non-root users to perform ZBC commands Allow users with read permissions to issue REPORT ZONE commands and users with write permissions to manage zones on block devices supporting the ZBC specification. Link: https://lore.kernel.org/r/20200226170518.92963-2-ryanattard@ryanattard.info Signed-off-by: Ryan Attard <ryanattard@ryanattard.info> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>	2020-03-16 18:26:31 -04:00
Alexey Dobriyan	11bde98600	block, zoned: fix integer overflow with BLKRESETZONE et al Check for overflow in addition before checking for end-of-block-device. Steps to reproduce: #define _GNU_SOURCE 1 #include <sys/ioctl.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> typedef unsigned long long __u64; struct blk_zone_range { __u64 sector; __u64 nr_sectors; }; #define BLKRESETZONE _IOW(0x12, 131, struct blk_zone_range) int main(void) { int fd = open("/dev/nullb0", O_RDWR\|O_DIRECT); struct blk_zone_range zr = {4096, 0xfffffffffffff000ULL}; ioctl(fd, BLKRESETZONE, &zr); return 0; } BUG: KASAN: null-ptr-deref in submit_bio_wait+0x74/0xe0 Write of size 8 at addr 0000000000000040 by task a.out/1590 CPU: 8 PID: 1590 Comm: a.out Not tainted 5.6.0-rc1-00019-g359c92c02bfa #2 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190711_202441-buildvm-armv7-10.arm.fedoraproject.org-2.fc31 04/01/2014 Call Trace: dump_stack+0x76/0xa0 __kasan_report.cold+0x5/0x3e kasan_report+0xe/0x20 submit_bio_wait+0x74/0xe0 blkdev_zone_mgmt+0x26f/0x2a0 blkdev_zone_mgmt_ioctl+0x14b/0x1b0 blkdev_ioctl+0xb28/0xe60 block_ioctl+0x69/0x80 ksys_ioctl+0x3af/0xa50 Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alexey Dobriyan (SK hynix) <adobriyan@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-12 09:10:52 -06:00
Weiping Zhang	fa800d73c8	blk-iocost: remove duplicated lines in comments Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Weiping Zhang <zhangweiping@didiglobal.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-12 08:02:20 -06:00
Revanth Rajashekar	88d6041d07	block: sed-opal: Change the check condition for regular session validity This patch changes the check condition for the validity/authentication of the session. 1. The Host Session Number(HSN) in the response should match the HSN for the session. 2. The TPER Session Number(TSN) can never be less than 4096 for a regular session. Reference: Section 3.2.2.1 of https://trustedcomputinggroup.org/wp-content/uploads/TCG_Storage_Opal_SSC_Application_Note_1-00_1-00-Final.pdf Section 3.3.7.1.1 of https://trustedcomputinggroup.org/wp-content/uploads/TCG_Storage_Architecture_Core_Spec_v2.01_r1.00.pdf Co-developed-by: Andrzej Jakowski <andrzej.jakowski@linux.intel.com> Signed-off-by: Andrzej Jakowski <andrzej.jakowski@linux.intel.com> Signed-off-by: Revanth Rajashekar <revanth.rajashekar@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-12 08:00:10 -06:00
Shin'ichiro Kawasaki	b53df2e744	block: Fix partition support for host aware zoned block devices Commit `b72053072c` ("block: allow partitions on host aware zone devices") introduced the helper function disk_has_partitions() to check if a given disk has valid partitions. However, since this function result directly depends on the disk partition table length rather than the actual existence of valid partitions in the table, it returns true even after all partitions are removed from the disk. For host aware zoned block devices, this results in zone management support to be kept disabled even after removing all partitions. Fix this by changing disk_has_partitions() to walk through the partition table entries and return true if and only if a valid non-zero size partition is found. Fixes: `b72053072c` ("block: allow partitions on host aware zone devices") Cc: stable@vger.kernel.org # 5.5 Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-12 07:54:39 -06:00
Guoqing Jiang	ce24f736f2	block: cleanup comment for blk_flush_complete_seq Remove the comment about return value, since it is not valid after commit `404b8f5a03` ("block: cleanup kick/queued handling"). Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-12 07:42:54 -06:00
Guoqing Jiang	754a15726f	block: remove unneeded argument from blk_alloc_flush_queue Remove 'q' from arguments since it is not used anymore after commit `7e992f847a` ("block: remove non mq parts from the flush code"). Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-12 07:42:54 -06:00
Guoqing Jiang	361301a222	block: cleanup for _blk/blk_rq_prep_clone Both cmd and sense had been moved to scsi_request, so remove the related comments to avoid confusion. And as Bart suggested, move _blk_rq_prep_clone into the only caller (blk_rq_prep_clone). Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-12 07:42:54 -06:00
Guoqing Jiang	fc4cc77210	block: remove redundant setting of QUEUE_FLAG_DYING Previously, blk_cleanup_queue has called blk_set_queue_dying to set the flag, no need to do it again. Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-12 07:42:54 -06:00
Guoqing Jiang	35ed78b32c	block: use bio_{wouldblock,io}_error in direct_make_request Use the two functions to simplify code. Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-12 07:42:54 -06:00
Guoqing Jiang	0d72031820	block: fix comment for blk_cloned_rq_check_limits Since the later description mentioned "checked against the new queue limits", so make the change to avoid confusion. Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-12 07:42:54 -06:00
Ming Lei	cc3200eac4	blk-mq: insert flush request to the front of dispatch queue commit `01e99aeca3` ("blk-mq: insert passthrough request into hctx->dispatch directly") may change to add flush request to the tail of dispatch by applying the 'add_head' parameter of blk_mq_sched_insert_request. Turns out this way causes performance regression on NCQ controller because flush is non-NCQ command, which can't be queued when there is any in-flight NCQ command. When adding flush rq to the front of hctx->dispatch, it is easier to introduce extra time to flush rq's latency compared with adding to the tail of dispatch queue because of S_SCHED_RESTART, then chance of flush merge is increased, and less flush requests may be issued to controller. So always insert flush request to the front of dispatch queue just like before applying commit `01e99aeca3` ("blk-mq: insert passthrough request into hctx->dispatch directly"). Cc: Damien Le Moal <Damien.LeMoal@wdc.com> Cc: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com> Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com> Fixes: `01e99aeca3` ("blk-mq: insert passthrough request into hctx->dispatch directly") Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-12 07:26:12 -06:00
Sahitya Tummala	30a2da7b7e	block: Fix use-after-free issue accessing struct io_cq There is a potential race between ioc_release_fn() and ioc_clear_queue() as shown below, due to which below kernel crash is observed. It also can result into use-after-free issue. context#1: context#2: ioc_release_fn() __ioc_clear_queue() gets the same icq ->spin_lock(&ioc->lock); ->spin_lock(&ioc->lock); ->ioc_destroy_icq(icq); ->list_del_init(&icq->q_node); ->call_rcu(&icq->__rcu_head, icq_free_icq_rcu); ->spin_unlock(&ioc->lock); ->ioc_destroy_icq(icq); ->hlist_del_init(&icq->ioc_node); This results into below crash as this memory is now used by icq->__rcu_head in context#1. There is a chance that icq could be free'd as well. 22150.386550: <6> Unable to handle kernel write to read-only memory at virtual address ffffffaa8d31ca50 ... Call trace: 22150.607350: <2> ioc_destroy_icq+0x44/0x110 22150.611202: <2> ioc_clear_queue+0xac/0x148 22150.615056: <2> blk_cleanup_queue+0x11c/0x1a0 22150.619174: <2> __scsi_remove_device+0xdc/0x128 22150.623465: <2> scsi_forget_host+0x2c/0x78 22150.627315: <2> scsi_remove_host+0x7c/0x2a0 22150.631257: <2> usb_stor_disconnect+0x74/0xc8 22150.635371: <2> usb_unbind_interface+0xc8/0x278 22150.639665: <2> device_release_driver_internal+0x198/0x250 22150.644897: <2> device_release_driver+0x24/0x30 22150.649176: <2> bus_remove_device+0xec/0x140 22150.653204: <2> device_del+0x270/0x460 22150.656712: <2> usb_disable_device+0x120/0x390 22150.660918: <2> usb_disconnect+0xf4/0x2e0 22150.664684: <2> hub_event+0xd70/0x17e8 22150.668197: <2> process_one_work+0x210/0x480 22150.672222: <2> worker_thread+0x32c/0x4c8 Fix this by adding a new ICQ_DESTROYED flag in ioc_destroy_icq() to indicate this icq is once marked as destroyed. Also, ensure __ioc_clear_queue() is accessing icq within rcu_read_lock/unlock so that icq doesn't get free'd up while it is still using it. Signed-off-by: Sahitya Tummala <stummala@codeaurora.org> Co-developed-by: Pradeep P V K <ppvk@codeaurora.org> Signed-off-by: Pradeep P V K <ppvk@codeaurora.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-12 07:07:38 -06:00
Tejun Heo	dcd6589b11	blk-iocost: fix incorrect vtime comparison in iocg_is_idle() vtimes may wrap and time_before/after64() should be used to determine whether a given vtime is before or after another. iocg_is_idle() was incorrectly using plain "<" comparison do determine whether done_vtime is before vtime. Here, the only thing we're interested in is whether done_vtime matches vtime which indicates that there's nothing in flight. Let's test for inequality instead. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: `7caa47151a` ("blkcg: implement blk-iocost") Cc: stable@vger.kernel.org # v5.4+ Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-10 11:37:00 -06:00
Bart Van Assche	d0930bb8f4	blk-mq: Fix a recently introduced regression in blk_mq_realloc_hw_ctxs() q->nr_hw_queues must only be updated once it is known that blk_mq_realloc_hw_ctxs() has succeeded. Otherwise it can happen that reallocation fails and that q->nr_hw_queues is larger than the number of allocated hardware queues. This patch fixes the following crash if increasing the number of hardware queues fails: BUG: KASAN: null-ptr-deref in blk_mq_map_swqueue+0x775/0x810 Write of size 8 at addr 0000000000000118 by task check/977 CPU: 3 PID: 977 Comm: check Not tainted 5.6.0-rc1-dbg+ #8 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 Call Trace: dump_stack+0xa5/0xe6 __kasan_report.cold+0x65/0x99 kasan_report+0x16/0x20 check_memory_region+0x140/0x1b0 memset+0x28/0x40 blk_mq_map_swqueue+0x775/0x810 blk_mq_update_nr_hw_queues+0x468/0x710 nullb_device_submit_queues_store+0xf7/0x1a0 [null_blk] configfs_write_file+0x1c4/0x250 [configfs] __vfs_write+0x4c/0x90 vfs_write+0x145/0x2c0 ksys_write+0xd7/0x180 __x64_sys_write+0x47/0x50 do_syscall_64+0x6f/0x2f0 entry_SYSCALL_64_after_hwframe+0x49/0xbe Fixes: `ac0d6b926e` ("block: Reduce the amount of memory required per request queue") Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Ming Lei <ming.lei@redhat.com> Cc: Keith Busch <kbusch@kernel.org> Cc: Johannes Thumshirn <jth@kernel.org> Cc: Hannes Reinecke <hare@suse.com> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-10 07:09:59 -06:00
Bart Van Assche	6e66b49392	blk-mq: Keep set->nr_hw_queues and set->map[].nr_queues in sync blk_mq_map_queues() and multiple .map_queues() implementations expect that set->map[HCTX_TYPE_DEFAULT].nr_queues is set to the number of hardware queues. Hence set .nr_queues before calling these functions. This patch fixes the following kernel warning: WARNING: CPU: 0 PID: 2501 at include/linux/cpumask.h:137 Call Trace: blk_mq_run_hw_queue+0x19d/0x350 block/blk-mq.c:1508 blk_mq_run_hw_queues+0x112/0x1a0 block/blk-mq.c:1525 blk_mq_requeue_work+0x502/0x780 block/blk-mq.c:775 process_one_work+0x9af/0x1740 kernel/workqueue.c:2269 worker_thread+0x98/0xe40 kernel/workqueue.c:2415 kthread+0x361/0x430 kernel/kthread.c:255 Fixes: `ed76e329d7` ("blk-mq: abstract out queue map") # v5.0 Reported-by: syzbot+d44e1b26ce5c3e77458d@syzkaller.appspotmail.com Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Cc: Johannes Thumshirn <jth@kernel.org> Cc: Hannes Reinecke <hare@suse.com> Cc: Ming Lei <ming.lei@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-10 07:09:59 -06:00
Nikolai Merinov	d5528d5e91	partitions/efi: Fix partition name parsing in GUID partition entry GUID partition entry defined to have a partition name as 36 UTF-16LE code units. This means that on big-endian platforms ASCII symbols would be read with 0xXX00 efi_char16_t character code. In order to correctly extract ASCII characters from a partition name field we should be converted from 16LE to CPU architecture. The problem exists on all big endian platforms. [ mingo: Minor edits. ] Fixes: `eec7ecfede` ("genhd, efi: add efi partition metadata to hd_structs") Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Nikolai Merinov <n.merinov@inango-systems.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20200308080859.21568-29-ardb@kernel.org Link: https://lore.kernel.org/r/797777312.1324734.1582544319435.JavaMail.zimbra@inango-systems.com/	2020-03-08 10:00:09 +01:00
Carlo Nonato	14afc59361	block, bfq: fix overwrite of bfq_group pointer in bfq_find_set_group() The bfq_find_set_group() function takes as input a blkcg (which represents a cgroup) and retrieves the corresponding bfq_group, then it updates the bfq internal group hierarchy (see comments inside the function for why this is needed) and finally it returns the bfq_group. In the hierarchy update cycle, the pointer holding the correct bfq_group that has to be returned is mistakenly used to traverse the hierarchy bottom to top, meaning that in each iteration it gets overwritten with the parent of the current group. Since the update cycle stops at root's children (depth = 2), the overwrite becomes a problem only if the blkcg describes a cgroup at a hierarchy level deeper than that (depth > 2). In this case the root's child that happens to be also an ancestor of the correct bfq_group is returned. The main consequence is that processes contained in a cgroup at depth greater than 2 are wrongly placed in the group described above by BFQ. This commits fixes this problem by using a different bfq_group pointer in the update cycle in order to avoid the overwrite of the variable holding the original group reference. Reported-by: Kwon Je Oh <kwonje.oh2@gmail.com> Signed-off-by: Carlo Nonato <carlo.nonato95@gmail.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-06 07:00:58 -07:00
Daniel Wagner	e959e5405f	block: Remove used kblockd_schedule_work_on() Commit `ee63cfa7fc` ("block: add kblockd_schedule_work_on()") introduced the helper in 2016. Remove it because since then no caller was added. Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-02 07:17:31 -07:00
John Garry	cae740a04b	blk-mq: Remove some unused function arguments The struct blk_mq_hw_ctx pointer argument in blk_mq_put_tag(), blk_mq_poll_nsecs(), and blk_mq_poll_hybrid_sleep() is unused, so remove it. Overall obj code size shows a minor reduction, before: text data bss dec hex filename 27306 1312 0 28618 6fca block/blk-mq.o 4303 272 0 4575 11df block/blk-mq-tag.o after: 27282 1312 0 28594 6fb2 block/blk-mq.o 4311 272 0 4583 11e7 block/blk-mq-tag.o Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: John Garry <john.garry@huawei.com> -- This minor patch had been carried as part of the blk-mq shared tags RFC, I'd rather not carry it anymore as it required rebasing, so now or never.. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-02-26 10:34:41 -07:00
Ming Lei	01e99aeca3	blk-mq: insert passthrough request into hctx->dispatch directly For some reason, device may be in one situation which can't handle FS request, so STS_RESOURCE is always returned and the FS request will be added to hctx->dispatch. However passthrough request may be required at that time for fixing the problem. If passthrough request is added to scheduler queue, there isn't any chance for blk-mq to dispatch it given we prioritize requests in hctx->dispatch. Then the FS IO request may never be completed, and IO hang is caused. So passthrough request has to be added to hctx->dispatch directly for fixing the IO hang. Fix this issue by inserting passthrough request into hctx->dispatch directly together withing adding FS request to the tail of hctx->dispatch in blk_mq_dispatch_rq_list(). Actually we add FS request to tail of hctx->dispatch at default, see blk_mq_request_bypass_insert(). Then it becomes consistent with original legacy IO request path, in which passthrough request is always added to q->queue_head. Cc: Dongli Zhang <dongli.zhang@oracle.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Ewan D. Milne <emilne@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-02-24 18:50:48 -07:00
Linus Torvalds	ed535f2c9e	block-5.6-2020-02-05 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl47ML4QHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpvm2EACGaxAxP7pLniNV30cRotF8lPpQ5nUrpiem H1r5WqeI5osCGkRKHaJQ4O0Sw8IV2pWzHTWz+9bv56zLM40yIMaEHLRU00AM047n KFdA2x4xH+HhbR9lF+flYz1oInlIEXxPiERKm/p1pvQEbquzi4X5cQqv6q2pdzJ9 sf8OBJhKs4rp/ooqzWwjVOeP/n1sT2r+XDg9C9WC5aXaVZbbLw50r1WRYFt1zf7N oa+91fq2lasxK1c79OtbbGJlBXWTurAtUaKBM0KKPguiH2h9j47pAs0HsV02kZ2M 1ZltwKTyfDNMzBEgvkdB3R0G9nU422nIF+w319i6on8P8xfz8Px13d1KCQGAmfD6 K1YuaCgOjWuVhOKpMwBq9ql6QVP+1LIMKIl2OGJkrBgl9ZzfE8KMZa2QZTGrGO/U xE/hirYdj5T1O8umUQ4cmZHTROASOJZ8/eU9XHA1vf/eJYXiS31/4ewgRzP3oGX2 5Jvz3o144nBeBTOiFlzs3Fe+wX63QABNG22bijzEGoNTxjXJFroBDYzeiOELjECZ /xGRZG1bLOGMj8Gg4ZADSILQDkqISsQHofl1I9mWTbwB1j7g69ZjV8Ie2dyMaX6b 5z5Smqzd9gcok9hr8NGWkV3c3NypPxIWxrOcyzYbGLUPDGqa+QjGtlLrGgeinhLM SitalHw0KA== =05d8 -----END PGP SIGNATURE----- Merge tag 'block-5.6-2020-02-05' of git://git.kernel.dk/linux-block Pull more block updates from Jens Axboe: "Some later arrivals, but all fixes at this point: - bcache fix series (Coly) - Series of BFQ fixes (Paolo) - NVMe pull request from Keith with a few minor NVMe fixes - Various little tweaks" * tag 'block-5.6-2020-02-05' of git://git.kernel.dk/linux-block: (23 commits) nvmet: update AEN list and array at one place nvmet: Fix controller use after free nvmet: Fix error print message at nvmet_install_queue function brd: check and limit max_part par nvme-pci: remove nvmeq->tags nvmet: fix dsm failure when payload does not match sgl descriptor nvmet: Pass lockdep expression to RCU lists block, bfq: clarify the goal of bfq_split_bfqq() block, bfq: get a ref to a group when adding it to a service tree block, bfq: remove ifdefs from around gets/puts of bfq groups block, bfq: extend incomplete name of field on_st block, bfq: get extra ref to prevent a queue from being freed during a group move block, bfq: do not insert oom queue into position tree block, bfq: do not plug I/O for bfq_queues with no proc refs bcache: check return value of prio_read() bcache: fix incorrect data type usage in btree_flush_write() bcache: add readahead cache policy options via sysfs interface bcache: explicity type cast in bset_bkey_last() bcache: fix memory corruption in bch_cache_accounting_clear() xen/blkfront: limit allocated memory size to actual use case ...	2020-02-06 06:15:23 +00:00
Paolo Valente	c92bddee77	block, bfq: clarify the goal of bfq_split_bfqq() The exact, general goal of the function bfq_split_bfqq() is not that apparent. Add a comment to make it clear. Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-02-03 06:58:15 -07:00
Paolo Valente	db37a34c56	block, bfq: get a ref to a group when adding it to a service tree BFQ schedules generic entities, which may represent either bfq_queues or groups of bfq_queues. When an entity is inserted into a service tree, a reference must be taken, to make sure that the entity does not disappear while still referred in the tree. Unfortunately, such a reference is mistakenly taken only if the entity represents a bfq_queue. This commit takes a reference also in case the entity represents a group. Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Tested-by: Chris Evich <cevich@redhat.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-02-03 06:58:15 -07:00
Paolo Valente	4d8340d0d4	block, bfq: remove ifdefs from around gets/puts of bfq groups ifdefs around gets and puts of bfq groups reduce readability, remove them. Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Reported-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-02-03 06:58:15 -07:00
Paolo Valente	33a16a9804	block, bfq: extend incomplete name of field on_st The flag on_st in the bfq_entity data structure is true if the entity is on a service tree or is in service. Yet the name of the field, confusingly, does not mention the second, very important case. Extend the name to mention the second case too. Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-02-03 06:58:15 -07:00
Paolo Valente	ecedd3d7e1	block, bfq: get extra ref to prevent a queue from being freed during a group move In bfq_bfqq_move(), the bfq_queue, say Q, to be moved to a new group may happen to be deactivated in the scheduling data structures of the source group (and then activated in the destination group). If Q is referred only by the data structures in the source group when the deactivation happens, then Q is freed upon the deactivation. This commit addresses this issue by getting an extra reference before the possible deactivation, and releasing this extra reference after Q has been moved. Tested-by: Chris Evich <cevich@redhat.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-02-03 06:58:15 -07:00
Paolo Valente	32c59e3a9a	block, bfq: do not insert oom queue into position tree BFQ maintains an ordered list, implemented with an RB tree, of head-request positions of non-empty bfq_queues. This position tree, inherited from CFQ, is used to find bfq_queues that contain I/O close to each other. BFQ merges these bfq_queues into a single shared queue, if this boosts throughput on the device at hand. There is however a special-purpose bfq_queue that does not participate in queue merging, the oom bfq_queue. Yet, also this bfq_queue could be wrongly added to the position tree. So bfqq_find_close() could return the oom bfq_queue, which is a source of further troubles in an out-of-memory situation. This commit prevents the oom bfq_queue from being inserted into the position tree. Tested-by: Patrick Dung <patdung100@gmail.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-02-03 06:58:15 -07:00
Paolo Valente	f718b09327	block, bfq: do not plug I/O for bfq_queues with no proc refs Commit `478de3380c` ("block, bfq: deschedule empty bfq_queues not referred by any process") fixed commit `3726112ec7` ("block, bfq: re-schedule empty queues if they deserve I/O plugging") by descheduling an empty bfq_queue when it remains with not process reference. Yet, this still left a case uncovered: an empty bfq_queue with not process reference that remains in service. This happens for an in-service sync bfq_queue that is deemed to deserve I/O-dispatch plugging when it remains empty. Yet no new requests will arrive for such a bfq_queue if no process sends requests to it any longer. Even worse, the bfq_queue may happen to be prematurely freed while still in service (because there may remain no reference to it any longer). This commit solves this problem by preventing I/O dispatch from being plugged for the in-service bfq_queue, if the latter has no process reference (the bfq_queue is then prevented from remaining in service). Fixes: `3726112ec7` ("block, bfq: re-schedule empty queues if they deserve I/O plugging") Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Reported-by: Patrick Dung <patdung100@gmail.com> Tested-by: Patrick Dung <patdung100@gmail.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-02-03 06:58:14 -07:00
Linus Torvalds	33c84e89ab	SCSI misc on 20200129 This series is slightly unusual because it includes Arnd's compat ioctl tree here: `1c46a2cf2d` Merge tag 'block-ioctl-cleanup-5.6' into 5.6/scsi-queue Excluding Arnd's changes, this is mostly an update of the usual drivers: megaraid_sas, mpt3sas, qla2xxx, ufs, lpfc, hisi_sas. There are a couple of core and base updates around error propagation and atomicity in the attribute container base we use for the SCSI transport classes. The rest is minor changes and updates. Signed-off-by: James E.J. Bottomley <jejb@linux.ibm.com> -----BEGIN PGP SIGNATURE----- iJwEABMIAEQWIQTnYEDbdso9F2cI+arnQslM7pishQUCXjHQJyYcamFtZXMuYm90 dG9tbGV5QGhhbnNlbnBhcnRuZXJzaGlwLmNvbQAKCRDnQslM7pishZZ8AQC02N+v iUnTl1YxGPjIWBbnHuUxN2Qbb9D3C6gAT1LkigEArlk163K3A1XEQHF/VNCdAz/f 01XYTd3p1VHuegIBHlk= =Cn52 -----END PGP SIGNATURE----- Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi Pull SCSI updates from James Bottomley: "This series is slightly unusual because it includes Arnd's compat ioctl tree here: `1c46a2cf2d` Merge tag 'block-ioctl-cleanup-5.6' into 5.6/scsi-queue Excluding Arnd's changes, this is mostly an update of the usual drivers: megaraid_sas, mpt3sas, qla2xxx, ufs, lpfc, hisi_sas. There are a couple of core and base updates around error propagation and atomicity in the attribute container base we use for the SCSI transport classes. The rest is minor changes and updates" * tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (149 commits) scsi: hisi_sas: Rename hisi_sas_cq.pci_irq_mask scsi: hisi_sas: Add prints for v3 hw interrupt converge and automatic affinity scsi: hisi_sas: Modify the file permissions of trigger_dump to write only scsi: hisi_sas: Replace magic number when handle channel interrupt scsi: hisi_sas: replace spin_lock_irqsave/spin_unlock_restore with spin_lock/spin_unlock scsi: hisi_sas: use threaded irq to process CQ interrupts scsi: ufs: Use UFS device indicated maximum LU number scsi: ufs: Add max_lu_supported in struct ufs_dev_info scsi: ufs: Delete is_init_prefetch from struct ufs_hba scsi: ufs: Inline two functions into their callers scsi: ufs: Move ufshcd_get_max_pwr_mode() to ufshcd_device_params_init() scsi: ufs: Split ufshcd_probe_hba() based on its called flow scsi: ufs: Delete struct ufs_dev_desc scsi: ufs: Fix ufshcd_probe_hba() reture value in case ufshcd_scsi_add_wlus() fails scsi: ufs-mediatek: enable low-power mode for hibern8 state scsi: ufs: export some functions for vendor usage scsi: ufs-mediatek: add dbg_register_dump implementation scsi: qla2xxx: Fix a NULL pointer dereference in an error path scsi: qla1280: Make checking for 64bit support consistent scsi: megaraid_sas: Update driver version to 07.713.01.00-rc1 ...	2020-01-29 18:16:16 -08:00
Linus Torvalds	48b4b4ff1e	for-5.6/block-2020-01-27 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl4vOqAQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgppYBD/wLczY7hyjF2loc71MC9HloUq3BVbATktM3 OF6wRbyxbeiOj/7Px0lE0M67tQbnEoIP26gS03fd6e7HE19//gmzGuB3Z2R2CJ5q XKkTamqz0pcPcX5FdDO5JFQZf27/1Qs3g7Nkr7FjVcR2XQ8PFv5B/FLMhse4frJI k92Sj0V1OwdNtMXozKqno/7xPwL/kQKWoF6aFDgO27xLfsFmi8Wbgf/CslOOTHIN vAUaz3Cue6V17M5y98wD4nwpjG7Ve+aY1i6oFPBE7Az9TA0xoiBA/tNPKW7iS10C GEP1aoI6lpgkxAzvyR29K1ayjzV11hEIig3rNIWxNfmCSGaawttWXAPEi7jU5u2D ZXbzUJxKnfeg8yrAj0CTcKLA9i4v1cZXPCUXqMO2+wHEWgmxq2IWuWjSl/V4fn3Y zgTPBngDM4Gx3fAqvD8SVfCW7xwI4VRP+da58WCFOjwnOgYSouxS7RnCtm+yPUbk Es6m2XBb+3ycaJPT58LcXPrnTJWZeRincs3MfFJeTXRn5T7IzlBjKdIvQiQSHQXo caZzWHEJW827+wfQFNreXpk5KPi+D6boeziYe96UcII8L5qVw3N0X5hOpr6IRhkX hn2CUb/CmY6bl8PJJPVc4ygqgiavyvynJu+A0uJvFSjvXX6jjXNEsSJ6bz8aBxdm 4rmgPFTlqA== =yJZi -----END PGP SIGNATURE----- Merge tag 'for-5.6/block-2020-01-27' of git://git.kernel.dk/linux-block Pull core block updates from Jens Axboe: "This may be the most quiet round we've had in years. I'm not complaining. Really not a lot to detail here, outside of spelling and documentation improvements/fixes, we have: - Allow t10-pi to be modular (Herbert) - Remove dead code in bfq (Alex) - Mark zone management requests with REQ_SYNC (Chaitanya) - BFQ division improvement (Wen) - Small series improving plugging (Pavel)" * tag 'for-5.6/block-2020-01-27' of git://git.kernel.dk/linux-block: partitions/ldm: fix spelling mistake "to" -> "too" block, bfq: improve arithmetic division in bfq_delta() block/bfq: remove unused bfq_class_rt which never used block: mark zone-mgmt bios with REQ_SYNC blk-mq: Document functions for sending request block: Allow t10-pi to be modular blk-mq: optimise blk_mq_flush_plug_list() list: introduce list_for_each_continue() blk-mq: optimise rq sort function	2020-01-27 12:38:25 -08:00
Christoph Hellwig	b72053072c	block: allow partitions on host aware zone devices Host-aware SMR drives can be used with the commands to explicitly manage zone state, but they can also be used as normal disks. In the former case it makes perfect sense to allow partitions on them, in the latter it does not, just like for host managed devices. Add a check to add_partition to allow partitions on host aware devices, but give up any zone management capabilities in that case, which also catches the previously missed case of adding a partition vs just scanning it. Because sd can rescan the attribute at runtime it needs to check if a disk has partitions, for which a new helper is added to genhd.h. Fixes: `5eac3eb30c` ("block: Remove partition support for zoned block devices") Reported-by: Borislav Petkov <bp@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-26 09:59:08 -07:00
Colin Ian King	5336da37a5	partitions/ldm: fix spelling mistake "to" -> "too" There is a spelling mistake in a ldm_error message. Fix it. Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-23 11:41:45 -07:00
Wen Yang	554d21efb0	block, bfq: improve arithmetic division in bfq_delta() do_div() does a 64-by-32 division. Use div64_ul() instead of it if the divisor is unsigned long, to avoid truncation to 32-bit. And as a nice side effect also cleans up the function a bit. Signed-off-by: Wen Yang <wenyang@linux.alibaba.com> Cc: Paolo Valente <paolo.valente@linaro.org> Cc: Jens Axboe <axboe@fb.com> Cc: linux-block@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-22 10:34:11 -07:00
Alex Shi	b7f22d993f	block/bfq: remove unused bfq_class_rt which never used This macro is never used after introduced from commit `aee69d78de` ("block, bfq: introduce the BFQ-v0 I/O scheduler as an extra scheduler") Better to remove it. Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com> Cc: Paolo Valente <paolo.valente@linaro.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-22 10:31:20 -07:00
Mikulas Patocka	ad6bf88a6c	block: fix an integer overflow in logical block size Logical block size has type unsigned short. That means that it can be at most 32768. However, there are architectures that can run with 64k pages (for example arm64) and on these architectures, it may be possible to create block devices with 64k block size. For exmaple (run this on an architecture with 64k pages): Mount will fail with this error because it tries to read the superblock using 2-sector access: device-mapper: writecache: I/O is not aligned, sector 2, size 1024, block size 65536 EXT4-fs (dm-0): unable to read superblock This patch changes the logical block size from unsigned short to unsigned int to avoid the overflow. Cc: stable@vger.kernel.org Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-15 21:43:09 -07:00
Ming Lei	4a2f704eb2	block: fix get_max_segment_size() overflow on 32bit arch Commit `429120f3df` starts to take account of segment's start dma address when computing max segment size, and data type of 'unsigned long' is used to do that. However, the segment mask may be 0xffffffff, so the figured out segment size may be overflowed in case of zero physical address on 32bit arch. Fix the issue by returning queue_max_segment_size() directly when that happens. Fixes: `429120f3df` ("block: fix splitting segments on boundary masks") Reported-by: Guenter Roeck <linux@roeck-us.net> Tested-by: Guenter Roeck <linux@roeck-us.net> Cc: Christoph Hellwig <hch@lst.de> Tested-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-14 13:37:40 -07:00
Ming Lei	83c9c54716	fs: move guard_bio_eod() after bio_set_op_attrs Commit `85a8ce62c2` ("block: add bio_truncate to fix guard_bio_eod") adds bio_truncate() for handling bio EOD. However, bio_truncate() doesn't use the passed 'op' parameter from guard_bio_eod's callers. So bio_trunacate() may retrieve wrong 'op', and zering pages may not be done for READ bio. Fixes this issue by moving guard_bio_eod() after bio_set_op_attrs() in submit_bh_wbc() so that bio_truncate() can always retrieve correct op info. Meantime remove the 'op' parameter from guard_bio_eod() because it isn't used any more. Cc: Carlos Maiolino <cmaiolino@redhat.com> Cc: linux-fsdevel@vger.kernel.org Fixes: `85a8ce62c2` ("block: add bio_truncate to fix guard_bio_eod") Signed-off-by: Ming Lei <ming.lei@redhat.com> Fold in kerneldoc and bio_op() change. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-09 08:16:12 -07:00
Chaitanya Kulkarni	8e42d239cb	block: mark zone-mgmt bios with REQ_SYNC In the current implementation, final zone-mgmt request is issued with submit_bio_wait() which marks the bio REQ_SYNC. This is needed since immediate action is expected for zone-mgmt requests as these are blocking operations. This also bypasses the scheduler in the blk_mq_make_request() and dispatches the request directly into the hw ctx. This patch marks all the chained bios REQ_SYNC so that we can have above-mentioned behavior for non-final bios also. Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Bob Liu <bob.liu@oracle.com> Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-09 07:59:12 -07:00
André Almeida	105663f73e	blk-mq: Document functions for sending request Add or improve documentation for function regarding creating and sending IO requests to the hardware. Signed-off-by: André Almeida <andrealmeid@collabora.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-06 21:00:27 -07:00
Herbert Xu	a754bd5f18	block: Allow t10-pi to be modular Currently t10-pi can only be built into the block layer which via crc-t10dif pulls in a whole chunk of the Crypto API. In fact all users of t10-pi work as modules and there is no reason for it to always be built-in. This patch adds a new hidden option for t10-pi that is selected automatically based on BLK_DEV_INTEGRITY and whether the users of t10-pi are built-in or not. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-06 20:59:04 -07:00
Arnd Bergmann	9b81648cb5	compat_ioctl: simplify up block/ioctl.c Having separate implementations of blkdev_ioctl() often leads to these getting out of sync, despite the comment at the top. Since most of the ioctl commands are compatible, and we try very hard not to add any new incompatible ones, move all the common bits into a shared function and leave only the ones that are historically different in separate functions for native/compat mode. To deal with the compat_ptr() conversion, pass both the integer argument and the pointer argument into the new blkdev_common_ioctl() and make sure to always use the correct one of these. blkdev_ioctl() is now only kept as a separate exported interfact for drivers/char/raw.c, which lacks a compat_ioctl variant. We should probably either move raw.c to staging if there are no more users, or export blkdev_compat_ioctl() as well. Reviewed-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Arnd Bergmann <arnd@arndb.de>	2020-01-03 09:42:52 +01:00
Arnd Bergmann	5fb889f587	compat_ioctl: block: simplify compat_blkpg_ioctl() There is no need to go through a compat_alloc_user_space() copy any more, just wrap the function in a small helper that works the same way for native and compat mode. Reviewed-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Arnd Bergmann <arnd@arndb.de>	2020-01-03 09:42:52 +01:00
Arnd Bergmann	bdc1ddad3e	compat_ioctl: block: move blkdev_compat_ioctl() into ioctl.c Having both in the same file allows a number of simplifications to the compat path, and makes it more likely that changes to the native path get applied to the compat version as well. Reviewed-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Arnd Bergmann <arnd@arndb.de>	2020-01-03 09:42:52 +01:00
Arnd Bergmann	1df23c6fe5	compat_ioctl: move HDIO ioctl handling into drivers/ide Most of the HDIO ioctls are only used by the obsolete drivers/ide subsystem, these can be handled by changing ide_cmd_ioctl() to be aware of compat mode and doing the correct transformations in place and using it as both native and compat handlers for all drivers. The SCSI drivers implementing the same commands are already doing this in the drivers, so the compat_blkdev_driver_ioctl() function is no longer needed now. The BLKSECTSET and HDIO_GETGEO_BIG ioctls are not implemented in any driver any more and no longer need any conversion. Reviewed-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Arnd Bergmann <arnd@arndb.de>	2020-01-03 09:42:52 +01:00
Arnd Bergmann	64cbfa9655	compat_ioctl: move cdrom commands into cdrom.c There is no need for the special cases for the cdrom ioctls any more now, so make sure that each cdrom driver has a .compat_ioctl() callback and calls cdrom_compat_ioctl() directly there. Reviewed-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Arnd Bergmann <arnd@arndb.de>	2020-01-03 09:42:52 +01:00
Arnd Bergmann	fe0da4e5e8	compat_ioctl: bsg: add handler bsg_ioctl() calls into scsi_cmd_ioctl() for a couple of generic commands and relies on fs/compat_ioctl.c to handle it correctly in compat mode. Adding a private compat_ioctl() handler avoids that round-trip and lets us get rid of the generic emulation once this is done. Note that bsg implements an SG_IO command that is different from the other drivers and does not need emulation. Reviewed-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Arnd Bergmann <arnd@arndb.de>	2020-01-03 09:33:21 +01:00
Arnd Bergmann	8f8f562038	compat_ioctl: move CDROMREADADIO to cdrom.c Again, there is only one file that needs this, so move the conversion handler into the native implementation. Reviewed-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Arnd Bergmann <arnd@arndb.de>	2020-01-03 09:33:08 +01:00
Arnd Bergmann	f3ee6e63a9	compat_ioctl: move CDROM_SEND_PACKET handling into scsi There is only one implementation of this ioctl, so move the handling out of the common block layer code into the place where it's actually needed. It also gets called indirectly through pktcdvd, which needs to be aware of this change. As I noticed, the old implementation of the compat handler failed to convert the structure on the way out, so the updated fields never got written back to user space. This is either not important, or it has never worked and should be fixed now. Reviewed-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Arnd Bergmann <arnd@arndb.de>	2020-01-03 09:33:05 +01:00
Arnd Bergmann	ee6a129dff	compat_ioctl: block: add blkdev_compat_ptr_ioctl A lot of block drivers need only a trivial .compat_ioctl callback. Add a helper function that can be set as the callback pointer to only convert the argument using the compat_ptr() conversion and otherwise assume all input and output data is compatible, or handled using in_compat_syscall() checks. This mirrors the compat_ptr_ioctl() helper function used in character devices. Reviewed-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Arnd Bergmann <arnd@arndb.de>	2020-01-03 09:32:59 +01:00
Arnd Bergmann	78ed001d9e	compat: scsi: sg: fix v3 compat read/write interface In the v5.4 merge window, a cleanup patch from Al Viro conflicted with my rework of the compat handling for sg.c read(). Linus Torvalds did a correct merge but pointed out that the resulting code is still unsatisfactory. I later noticed that the sg_new_read() function still gets the compat mode wrong, when the 'count' argument is large enough to pass a compat_sg_io_hdr object, but not a nativ sg_io_hdr. To address both of these, move the definition of compat_sg_io_hdr into a scsi/sg.h to make it visible to sg.c and rewrite the logic for reading req_pack_id as well as the size check to a simpler version that gets the expected results. Fixes: `c35a5cfb41` ("scsi: sg: sg_read(): simplify reading ->pack_id of userland sg_io_hdr_t") Fixes: `98aaaec4a1` ("compat_ioctl: reimplement SG_IO handling") Reviewed-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Arnd Bergmann <arnd@arndb.de>	2020-01-03 09:32:54 +01:00
Ming Lei	429120f3df	block: fix splitting segments on boundary masks We ran into a problem with a mpt3sas based controller, where we would see random (and hard to reproduce) file corruption). The issue seemed specific to this controller, but wasn't specific to the file system. After a lot of debugging, we find out that it's caused by segments spanning a 4G memory boundary. This shouldn't happen, as the default setting for segment boundary masks is 4G. Turns out there are two issues in get_max_segment_size(): 1) The default segment boundary mask is bypassed 2) The segment start address isn't taken into account when checking segment boundary limit Fix these two issues by removing the bypass of the segment boundary check even if the mask is set to the default value, and taking into account the actual start address of the request when checking if a segment needs splitting. Cc: stable@vger.kernel.org # v5.1+ Reviewed-by: Chris Mason <clm@fb.com> Tested-by: Chris Mason <clm@fb.com> Fixes: `dcebd75592` ("block: use bio_for_each_bvec() to compute multi-page bvec count") Signed-off-by: Ming Lei <ming.lei@redhat.com> Dropped const on the page pointer, ppc page_to_phys() doesn't mark the page as const... Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-30 08:51:18 -07:00
Ming Lei	85a8ce62c2	block: add bio_truncate to fix guard_bio_eod Some filesystem, such as vfat, may send bio which crosses device boundary, and the worse thing is that the IO request starting within device boundaries can contain more than one segment past EOD. Commit `dce30ca9e3` ("fs: fix guard_bio_eod to check for real EOD errors") tries to fix this issue by returning -EIO for this situation. However, this way lets fs user code lose chance to handle -EIO, then sync_inodes_sb() may hang for ever. Also the current truncating on last segment is dangerous by updating the last bvec, given bvec table becomes not immutable any more, and fs bio users may not retrieve the truncated pages via bio_for_each_segment_all() in its .end_io callback. Fixes this issue by supporting multi-segment truncating. And the approach is simpler: - just update bio size since block layer can make correct bvec with the updated bio size. Then bvec table becomes really immutable. - zero all truncated segments for read bio Cc: Carlos Maiolino <cmaiolino@redhat.com> Cc: linux-fsdevel@vger.kernel.org Fixed-by: `dce30ca9e3` ("fs: fix guard_bio_eod to check for real EOD errors") Reported-by: syzbot+2b9e54155c8c25d8d165@syzkaller.appspotmail.com Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-28 09:44:56 -07:00
Arnd Bergmann	b2c0fcd287	compat_ioctl: block: handle Persistent Reservations These were added to blkdev_ioctl() in linux-5.5 but not blkdev_compat_ioctl, so add them now. Cc: <stable@vger.kernel.org> # v4.4+ Fixes: `bbd3e06436` ("block: add an API for Persistent Reservations") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Fold in followup patch from Arnd with missing pr.h header include. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-21 07:26:56 -07:00
Arnd Bergmann	4b43f31d65	compat_ioctl: block: handle add zone open, close and finish ioctl These were added to blkdev_ioctl() in linux-5.5 but not blkdev_compat_ioctl, so add them now. Fixes: `e876df1fe0` ("block: add zone open, close and finish ioctl support") Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-21 07:26:41 -07:00
Arnd Bergmann	21d3734091	compat_ioctl: block: handle BLKGETZONESZ/BLKGETNRZONES These were added to blkdev_ioctl() in v4.20 but not blkdev_compat_ioctl, so add them now. Cc: <stable@vger.kernel.org> # v4.20+ Fixes: `72cd87576d` ("block: Introduce BLKGETZONESZ ioctl") Fixes: `65e4e3eee8` ("block: Introduce BLKGETNRZONES ioctl") Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-21 07:26:41 -07:00
Arnd Bergmann	673bdf8ce0	compat_ioctl: block: handle BLKREPORTZONE/BLKRESETZONE These were added to blkdev_ioctl() but not blkdev_compat_ioctl, so add them now. Cc: <stable@vger.kernel.org> # v4.10+ Fixes: `3ed05a987e` ("blk-zoned: implement ioctls") Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-21 07:26:40 -07:00
Yang Yingliang	3b7995a98a	block: fix memleak when __blk_rq_map_user_iov() is failed When I doing fuzzy test, get the memleak report: BUG: memory leak unreferenced object 0xffff88837af80000 (size 4096): comm "memleak", pid 3557, jiffies 4294817681 (age 112.499s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 20 00 00 00 10 01 00 00 00 00 00 00 01 00 00 00 ............... backtrace: [<000000001c894df8>] bio_alloc_bioset+0x393/0x590 [<000000008b139a3c>] bio_copy_user_iov+0x300/0xcd0 [<00000000a998bd8c>] blk_rq_map_user_iov+0x2f1/0x5f0 [<000000005ceb7f05>] blk_rq_map_user+0xf2/0x160 [<000000006454da92>] sg_common_write.isra.21+0x1094/0x1870 [<00000000064bb208>] sg_write.part.25+0x5d9/0x950 [<000000004fc670f6>] sg_write+0x5f/0x8c [<00000000b0d05c7b>] __vfs_write+0x7c/0x100 [<000000008e177714>] vfs_write+0x1c3/0x500 [<0000000087d23f34>] ksys_write+0xf9/0x200 [<000000002c8dbc9d>] do_syscall_64+0x9f/0x4f0 [<00000000678d8e9a>] entry_SYSCALL_64_after_hwframe+0x49/0xbe If __blk_rq_map_user_iov() is failed in blk_rq_map_user_iov(), the bio(s) which is allocated before this failing will leak. The refcount of the bio(s) is init to 1 and increased to 2 by calling bio_get(), but __blk_rq_unmap_user() only decrease it to 1, so the bio cannot be freed. Fix it by calling blk_rq_unmap_user(). Reviewed-by: Bob Liu <bob.liu@oracle.com> Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-20 11:52:01 -07:00
Bart Van Assche	b3c6a59975	block: Fix a lockdep complaint triggered by request queue flushing Avoid that running test nvme/012 from the blktests suite triggers the following false positive lockdep complaint: ============================================ WARNING: possible recursive locking detected 5.0.0-rc3-xfstests-00015-g1236f7d60242 #841 Not tainted -------------------------------------------- ksoftirqd/1/16 is trying to acquire lock: 000000000282032e (&(&fq->mq_flush_lock)->rlock){..-.}, at: flush_end_io+0x4e/0x1d0 but task is already holding lock: 00000000cbadcbc2 (&(&fq->mq_flush_lock)->rlock){..-.}, at: flush_end_io+0x4e/0x1d0 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&(&fq->mq_flush_lock)->rlock); lock(&(&fq->mq_flush_lock)->rlock); * DEADLOCK * May be due to missing lock nesting notation 1 lock held by ksoftirqd/1/16: #0: 00000000cbadcbc2 (&(&fq->mq_flush_lock)->rlock){..-.}, at: flush_end_io+0x4e/0x1d0 stack backtrace: CPU: 1 PID: 16 Comm: ksoftirqd/1 Not tainted 5.0.0-rc3-xfstests-00015-g1236f7d60242 #841 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: dump_stack+0x67/0x90 __lock_acquire.cold.45+0x2b4/0x313 lock_acquire+0x98/0x160 _raw_spin_lock_irqsave+0x3b/0x80 flush_end_io+0x4e/0x1d0 blk_mq_complete_request+0x76/0x110 nvmet_req_complete+0x15/0x110 [nvmet] nvmet_bio_done+0x27/0x50 [nvmet] blk_update_request+0xd7/0x2d0 blk_mq_end_request+0x1a/0x100 blk_flush_complete_seq+0xe5/0x350 flush_end_io+0x12f/0x1d0 blk_done_softirq+0x9f/0xd0 __do_softirq+0xca/0x440 run_ksoftirqd+0x24/0x50 smpboot_thread_fn+0x113/0x1e0 kthread+0x121/0x140 ret_from_fork+0x3a/0x50 Cc: Christoph Hellwig <hch@infradead.org> Cc: Ming Lei <ming.lei@redhat.com> Cc: Hannes Reinecke <hare@suse.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-20 11:52:01 -07:00
Bart Van Assche	c44a4edb20	block: Fix the type of 'sts' in bsg_queue_rq() This patch fixes the following sparse warnings: block/bsg-lib.c:269:19: warning: incorrect type in initializer (different base types) block/bsg-lib.c:269:19: expected int sts block/bsg-lib.c:269:19: got restricted blk_status_t [usertype] block/bsg-lib.c:286:16: warning: incorrect type in return expression (different base types) block/bsg-lib.c:286:16: expected restricted blk_status_t block/bsg-lib.c:286:16: got int [assigned] sts Cc: Martin Wilck <mwilck@suse.com> Fixes: `d46fe2cb2d` ("block: drop device references in bsg_queue_rq()") Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-20 11:52:01 -07:00
Pavel Begunkov	95ed0c5b12	blk-mq: optimise blk_mq_flush_plug_list() Instead of using list_del_init() in a loop, that generates a lot of unnecessary memory read/writes, iterate from the first request of a batch and cut out a sublist with list_cut_before(). Apart from removing the list node initialisation part, this is more register-friendly, and the assembly uses the stack less intensively. list_empty() at the beginning is done with hope, that the compiler can optimise out the same check in the following list_splice_init(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-19 06:08:50 -07:00
Pavel Begunkov	7d30a62102	blk-mq: optimise rq sort function Check "!=" in multi-layer comparisons. The same memory usage, fewer instructions, and 2 from 4 jumps are replaced with SETcc. Note, that list_sort() doesn't differ 0 and <0. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-19 06:08:50 -07:00
Roman Penyaev	c58c1f8343	block: end bio with BLK_STS_AGAIN in case of non-mq devs and REQ_NOWAIT Non-mq devs do not honor REQ_NOWAIT so give a chance to the caller to repeat request gracefully on -EAGAIN error. The problem is well reproduced using io_uring: mkfs.ext4 /dev/ram0 mount /dev/ram0 /mnt # Preallocate a file dd if=/dev/zero of=/mnt/file bs=1M count=1 # Start fio with io_uring and get -EIO fio --rw=write --ioengine=io_uring --size=1M --direct=1 --name=job --filename=/mnt/file Signed-off-by: Roman Penyaev <rpenyaev@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-17 09:01:43 -07:00
Tejun Heo	d7bd15a138	iocost: over-budget forced IOs should schedule async delay When over-budget IOs are force-issued through root cgroup, iocg_kick_delay() adjusts the async delay accordingly but doesn't actually schedule async throttle for the issuing task. This bug is pretty well masked because sooner or later the offending threads are gonna get directly throttled on regular IOs or have async delay scheduled by mem_cgroup_throttle_swaprate(). However, it can affect control quality on filesystem metadata heavy operations. Let's fix it by invoking blkcg_schedule_throttle() when iocg_kick_delay() says async delay is needed. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: `7caa47151a` ("blkcg: implement blk-iocost") Cc: stable@vger.kernel.org Reported-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-16 16:10:17 -07:00
Linus Torvalds	f1fcd7786e	for-linus-20191212 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl3y54EQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpqJuD/93LZmzS5UEWrNLkRaAaCyAy40MPxuXRZEp 42yk7cvAT4OcCr+W6nkAgG6IHGRXOz8QvOzt0P5/HfugpNlB2oz5a/6+TiTtcZTt YNt0Z4yuBMU5SXIIxc3lUMcJGxslzOr+L+9ZXD4u5UqIdG1fSrECAexSCrlmmTwu Fx02TakDc/bbUYDfLAQD1+/Z066rp1ZWDkjXqA4kUvbFzt8F7qEOc1Evq47SuR7d Iw0bM3LVASXwTq2lRc1bFFL2glku6wwkccjwdyjSrQmK4+8LhF396fQGtXuj0Mrs OzuWhaOoGhan7dpj1D8e4tqugflQy9rv9bcy6Z9PjBY+VauuFdgPr3iFcwPaPbXm 17ir4y7xJJxXlhZl/Bn06KIB2h+nLWDIaundFys5JnMmTiZvWIgSJ6Q3gWtMxgfH zWZLMw/UtRAmjHhLqvGsMaBTfgKX5ATpMbfGeZeXheVtVaOgGTunXunT56o7oRHB q4XWZqbydsYyHBUhgSzhBr03i67wbotxtebqg9VZ0UD8XM4iM8Kor/DleK03oUqD DsltKF66NAGNeOcV3TNzJuXHyF6S/vZdO7JdFHY29+pdljoTj5GB88+W9CbhwQRe WiKVpq7sAe/bh0wtqrD+QCByjSNSVU62kVgRhfqms47804j/vNqNvOKaC5UWTd0I 2LG4jfSbeg== =hmxJ -----END PGP SIGNATURE----- Merge tag 'for-linus-20191212' of git://git.kernel.dk/linux-block Pull block fixes from Jens Axboe: - stable fix for the bi_size overflow. Not a corruption issue, but a case wher we could merge but disallowed (Andreas) - NVMe pull request via Keith, with various fixes. - MD pull request from Song. - Merge window regression fix for the rq passthrough stats (Logan) - Remove unused blkcg_drain_queue() function (Guoqing) * tag 'for-linus-20191212' of git://git.kernel.dk/linux-block: blk-cgroup: remove blkcg_drain_queue block: fix NULL pointer dereference in account statistics with IDE md: make sure desc_nr less than MD_SB_DISKS md: raid1: check rdev before reference in raid1_sync_request func raid5: need to set STRIPE_HANDLE for batch head block: fix "check bi_size overflow before merge" nvme/pci: Fix read queue count nvme/pci Limit write queue sizes to possible cpus nvme/pci: Fix write and poll queue types nvme/pci: Remove last_cq_head nvme: Namepace identification descriptor list is optional nvme-fc: fix double-free scenarios on hw queues nvme: else following return is not needed nvme: add error message on mismatching controller ids nvme_fc: add module to ops template to allow module references nvmet-loop: Avoid preallocating big SGL for data nvme-fc: Avoid preallocating big SGL for data nvme-rdma: Avoid preallocating big SGL for data	2019-12-13 14:27:19 -08:00
Guoqing Jiang	5addeae1be	blk-cgroup: remove blkcg_drain_queue Since blk_drain_queue had already been removed, so this function is not needed anymore. Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-12 09:26:55 -07:00
Logan Gunthorpe	ecb6186cf7	block: fix NULL pointer dereference in account statistics with IDE The IDE driver creates some passthru requests which never get submitted to the block layer in such a way that blk_account_io_start() gets called. However, the driver still calls __blk_mq_end_request() in ide_end_rq() which will call blk_account_io_completion() which tries to dereferences req->part which is never set. See ide_prep_sense() for an example of where these requests come from. To fix this, blk_account_io_completion() and blk_account_io_done() should do nothing if req->part is not set. The back trace of this bug is: BUG: kernel NULL pointer dereference, address: 000002ac #PF: supervisor write access in kernel mode #PF: error_code(0x0002) - not-present page *pde = 00000000 Oops: 0002 [#1] CPU: 0 PID: 237 Comm: kworker/0:1H Not tainted 5.4.0-rc2-00011-g48d9b0d43105e #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014 Workqueue: kblockd drive_rq_insert_work EIP: blk_account_io_completion+0x7a/0xf0 Code: 89 54 24 08 31 d2 89 4c 24 04 31 c9 c7 04 24 02 00 00 00 c1 ee 09 e8 f5 21 a6 ff e8 70 5c a7 ff 8b 53 60 8d 04 bd 00 00 00 00 <01> b4 02 ac 02 00 00 8b 9a 88 02 00 00 85 db 74 11 85 d2 74 51 8b EAX: 00000000 EBX: f5b80000 ECX: 00000000 EDX: 00000000 ESI: 00000000 EDI: 00000000 EBP: f3031e70 ESP: f3031e54 DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068 EFLAGS: 00010046 CR0: 80050033 CR2: 000002ac CR3: 03c25000 CR4: 000406d0 Call Trace: <IRQ> blk_update_request+0x85/0x420 ide_end_rq+0x38/0xa0 ide_complete_rq+0x3d/0x70 cdrom_newpc_intr+0x258/0xba0 ide_intr+0x135/0x250 __handle_irq_event_percpu+0x3e/0x250 handle_irq_event_percpu+0x1f/0x50 handle_irq_event+0x32/0x60 handle_level_irq+0x6c/0x110 handle_irq+0x72/0xa0 </IRQ> do_IRQ+0x45/0xad common_interrupt+0x115/0x11c Fixes: `48d9b0d431` ("block: account statistics for passthrough requests") Reported-by: kernel test robot <rong.a.chen@intel.com> Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-12 08:12:50 -07:00
Andreas Gruenbacher	cc90bc6842	block: fix "check bi_size overflow before merge" This partially reverts commit `e3a5d8e386`. Commit `e3a5d8e386` ("check bi_size overflow before merge") adds a bio_full check to __bio_try_merge_page. This will cause __bio_try_merge_page to fail when the last bi_io_vec has been reached. Instead, what we want here is only the bi_size overflow check. Fixes: `e3a5d8e386` ("block: check bi_size overflow before merge") Cc: stable@vger.kernel.org # v5.4+ Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-09 22:04:35 -07:00
Pankaj Bharadiya	c593642c8b	treewide: Use sizeof_field() macro Replace all the occurrences of FIELD_SIZEOF() with sizeof_field() except at places where these are defined. Later patches will remove the unused definition of FIELD_SIZEOF(). This patch is generated using following script: EXCLUDE_FILES="include/linux/stddef.h\|include/linux/kernel.h" git grep -l -e "\bFIELD_SIZEOF\b" \| while read file; do if [[ "$file" =~ $EXCLUDE_FILES ]]; then continue fi sed -i -e 's/\bFIELD_SIZEOF\b/sizeof_field/g' $file; done Signed-off-by: Pankaj Bharadiya <pankaj.laxminarayan.bharadiya@intel.com> Link: https://lore.kernel.org/r/20190924105839.110713-3-pankaj.laxminarayan.bharadiya@intel.com Co-developed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: David Miller <davem@davemloft.net> # for net	2019-12-09 10:36:44 -08:00
Justin Tee	ece841abbe	block: fix memleak of bio integrity data `7c20f11680` ("bio-integrity: stop abusing bi_end_io") moves bio_integrity_free from bio_uninit() to bio_integrity_verify_fn() and bio_endio(). This way looks wrong because bio may be freed without calling bio_endio(), for example, blk_rq_unprep_clone() is called from dm_mq_queue_rq() when the underlying queue of dm-mpath is busy. So memory leak of bio integrity data is caused by commit `7c20f11680`. Fixes this issue by re-adding bio_integrity_free() to bio_uninit(). Fixes: `7c20f11680` ("bio-integrity: stop abusing bi_end_io") Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by Justin Tee <justin.tee@broadcom.com> Add commit log, and simplify/fix the original patch wroten by Justin. Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-05 11:38:36 -07:00
Hou Tao	08802ed665	bfq-iosched: Ensure bio->bi_blkg is valid before using it bio->bi_blkg will be NULL when the issue of the request has bypassed the block layer as shown in the following oops: Internal error: Oops: 96000005 [#1] SMP CPU: 17 PID: 2996 Comm: scsi_id Not tainted 5.4.0 #4 Call trace: percpu_counter_add_batch+0x38/0x4c8 bfqg_stats_update_legacy_io+0x9c/0x280 bfq_insert_requests+0xbac/0x2190 blk_mq_sched_insert_request+0x288/0x670 blk_execute_rq_nowait+0x140/0x178 blk_execute_rq+0x8c/0x140 sg_io+0x604/0x9c0 scsi_cmd_ioctl+0xe38/0x10a8 scsi_cmd_blk_ioctl+0xac/0xe8 sd_ioctl+0xe4/0x238 blkdev_ioctl+0x590/0x20e0 block_ioctl+0x60/0x98 do_vfs_ioctl+0xe0/0x1b58 ksys_ioctl+0x80/0xd8 __arm64_sys_ioctl+0x40/0x78 el0_svc_handler+0xc4/0x270 so ensure its validity before using it. Fixes: `fd41e60331` ("bfq-iosched: stop using blkg->stat_bytes and ->stat_ios") Signed-off-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-05 07:10:09 -07:00
Christoph Hellwig	6c6b354914	block: set the zone size in blk_revalidate_disk_zones atomically The current zone revalidation code has a major problem in that it doesn't update the zone size and q->nr_zones atomically, leading to a short window where an out of bounds access to the zone arrays is possible. To fix this move the setting of the zone size into the crticial sections blk_revalidate_disk_zones so that it gets updated together with the zone bitmaps and q->nr_zones. This also slightly simplifies the caller as it deducts the zone size from the report_zones. This change also allows to check for a power of two zone size in generic code. Reported-by: Hans Holmberg <hans@owltronix.com> Reviewed-by: Javier González <javier@javigon.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-03 10:18:22 -07:00
Christoph Hellwig	ae58954d87	block: don't handle bio based drivers in blk_revalidate_disk_zones bio based drivers only need to update q->nr_zones. Do that manually instead of overloading blk_revalidate_disk_zones to keep that function simpler for the next round of changes that will rely even more on the request based functionality. Reviewed-by: Javier González <javier@javigon.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-03 08:51:25 -07:00
Christoph Hellwig	e94f581944	block: allocate the zone bitmaps lazily Allocate the conventional zone bitmap and the sequential zone locking bitmap only when we find a zone of the respective type. This avoids wasting memory on the conventional zone bitmap for devices that only have sequential zones, and will also prepare for other future changes. Reviewed-by: Javier González <javier@javigon.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-03 08:51:25 -07:00
Christoph Hellwig	f216fdd77b	block: replace seq_zones_bitmap with conv_zones_bitmap Invert the meaning of seq_zones_bitmap by keeping a bitmap of conventional zones. This allows not having a bitmap for devices that do not have conventional zones. Reviewed-by: Javier González <javier@javigon.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-03 08:51:25 -07:00
Christoph Hellwig	9b38bb4b1e	block: simplify blkdev_nr_zones Simplify the arguments to blkdev_nr_zones by passing a gendisk instead of the block_device and capacity. This also removes the need for __blkdev_nr_zones as all callers are outside the fast path and can deal with the additional branch. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-03 08:51:24 -07:00
Christoph Hellwig	bb55628288	block: remove the empty line at the end of blk-zoned.c Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-03 08:51:24 -07:00
Linus Torvalds	0da522107e	compat_ioctl: remove most of fs/compat_ioctl.c As part of the cleanup of some remaining y2038 issues, I came to fs/compat_ioctl.c, which still has a couple of commands that need support for time64_t. In completely unrelated work, I spent time on cleaning up parts of this file in the past, moving things out into drivers instead. After Al Viro reviewed an earlier version of this series and did a lot more of that cleanup, I decided to try to completely eliminate the rest of it and move it all into drivers. This series incorporates some of Al's work and many patches of my own, but in the end stops short of actually removing the last part, which is the scsi ioctl handlers. I have patches for those as well, but they need more testing or possibly a rewrite. Signed-off-by: Arnd Bergmann <arnd@arndb.de> -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAABCAAGBQJdsHCdAAoJEJpsee/mABjZtYkP/1JGl3jFv3Iq/5BCdPkaePP1 RtMJRNfURgK3GeuHUui330PvVjI/pLWXU/VXMK2MPTASpJLzYz3uCaZrpVWEMpDZ +ImzGmgJkITlW1uWU3zOcQhOxTyb1hCZ0Ci+2xn9QAmyOL7prXoXCXDWv3h6iyiF lwG+nW+HNtyx41YG+9bRfKNoG0ZJ+nkJ70BV6u0acQHXWn7Xuupa9YUmBL87hxAL 6dlJfLTJg6q8QSv/Q6LxslfWk2Ti8OOJZOwtFM5R8Bgl0iUcvshiRCKfv/3t9jXD dJNvF1uq8z+gracWK49Qsfq5dnZ2ZxHFUo9u0NjbCrxNvWH/sdvhbaUBuJI75seH VIznCkdxFhrqitJJ8KmxANxG08u+9zSKjSlxG2SmlA4qFx/AoStoHwQXcogJscNb YIXYKmWBvwPzYu09QFAXdHFPmZvp/3HhMWU6o92lvDhsDwzkSGt3XKhCJea4DCaT m+oCcoACqSWhMwdbJOEFofSub4bY43s5iaYuKes+c8O261/Dwg6v/pgIVez9mxXm TBnvCsotq5m8wbwzv99eFqGeJH8zpDHrXxEtRR5KQqMqjLq/OQVaEzmpHZTEuK7n e/V/PAKo2/V63g4k6GApQXDxnjwT+m0aWToWoeEzPYXS6KmtWC91r4bWtslu3rdl bN65armTm7bFFR32Avnu =lgCl -----END PGP SIGNATURE----- Merge tag 'compat-ioctl-5.5' of git://git.kernel.org:/pub/scm/linux/kernel/git/arnd/playground Pull removal of most of fs/compat_ioctl.c from Arnd Bergmann: "As part of the cleanup of some remaining y2038 issues, I came to fs/compat_ioctl.c, which still has a couple of commands that need support for time64_t. In completely unrelated work, I spent time on cleaning up parts of this file in the past, moving things out into drivers instead. After Al Viro reviewed an earlier version of this series and did a lot more of that cleanup, I decided to try to completely eliminate the rest of it and move it all into drivers. This series incorporates some of Al's work and many patches of my own, but in the end stops short of actually removing the last part, which is the scsi ioctl handlers. I have patches for those as well, but they need more testing or possibly a rewrite" * tag 'compat-ioctl-5.5' of git://git.kernel.org:/pub/scm/linux/kernel/git/arnd/playground: (42 commits) scsi: sd: enable compat ioctls for sed-opal pktcdvd: add compat_ioctl handler compat_ioctl: move SG_GET_REQUEST_TABLE handling compat_ioctl: ppp: move simple commands into ppp_generic.c compat_ioctl: handle PPPIOCGIDLE for 64-bit time_t compat_ioctl: move PPPIOCSCOMPRESS to ppp_generic compat_ioctl: unify copy-in of ppp filters tty: handle compat PPP ioctls compat_ioctl: move SIOCOUTQ out of compat_ioctl.c compat_ioctl: handle SIOCOUTQNSD af_unix: add compat_ioctl support compat_ioctl: reimplement SG_IO handling compat_ioctl: move WDIOC handling into wdt drivers fs: compat_ioctl: move FITRIM emulation into file systems gfs2: add compat_ioctl support compat_ioctl: remove unused convert_in_user macro compat_ioctl: remove last RAID handling code compat_ioctl: remove /dev/raw ioctl translation compat_ioctl: remove PCI ioctl translation compat_ioctl: remove joystick ioctl translation ...	2019-12-01 13:46:15 -08:00
Linus Torvalds	7e5192b93c	for-5.5/disk-revalidate-20191122 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl3YA5sQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgplFxEACM7CwrWsullPX6b3j62NW6VepU5JQdzwVW S+bmLpb8Z2I4wzEnaVuWAY5hEhGaS9NFtQLdBG0W0YOzH7sweNmL38dZfCE4+oFj ZwytpXQQhAQUwkgANJCpNfzDymHduPsTz7RYqRr1plmhna1KC/dnhuMwg8lVOBf5 myWjqcCHxxoQn6KFqcX9/Azz29ZrgzV28lOnZdiw9yoTjraBmS/ymx4woaa3pc2v UNw0Cgx53vHENJzEL9FNSxc0ENZq/bQhpDolnc2AlPGy9+vPg4afMitJb60KTT7r HpDcLGkYAIKLrfk8DUmFW8lZhWsxTchXvK2+zwQV7nXMcdUgGN/G3HTIdvWEHFv8 oGbPB8cfdA2vNC9QAybwWEum/S0H/GfYsBVplNCUCdFXE7yj1cbKD5dPfCyIvmPz BjgMae5vH/KoH+vNdZ8NL5oFz2eFC3rLxa/Ss78pcEoBdiiV3WQHPv9MBmn/OQ/v CeUAM7omyWpbv3lcByNzIOkeeO3m6Ne28EpEMc2pzLnDPu2btvSyetdO488DE+7O MNfApZULVX91W7jWnhM5GR+1SJTdEXZnoxnFV+J/j4deog5vUR7Dt1VkujpUILfL 7jMl3erF6C53wNrc465z8iLRp1ZM+aTpwatXXRfucNXeomExKK9zF+/+O1ACckUB jWDCR9NTcw== =e5Lx -----END PGP SIGNATURE----- Merge tag 'for-5.5/disk-revalidate-20191122' of git://git.kernel.dk/linux-block Pull disk revalidation updates from Jens Axboe: "This continues the work that Jan Kara started to thoroughly cleanup and consolidate how we handle rescans and revalidations" * tag 'for-5.5/disk-revalidate-20191122' of git://git.kernel.dk/linux-block: block: move clearing bd_invalidated into check_disk_size_change block: remove (__)blkdev_reread_part as an exported API block: fix bdev_disk_changed for non-partitioned devices block: move rescan_partitions to fs/block_dev.c block: merge invalidate_partitions into rescan_partitions block: refactor rescan_partitions	2019-11-25 11:37:01 -08:00
Linus Torvalds	464a47f45d	for-5.5/zoned-20191122 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl3YAiAQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpsJRD/wNfUGWVdIckw7iiFNuuipKBEy0Nd2VLt0B I+pVW/YjDsG2oxWXWPs5Nxc7ca2A8EzRXcWP0xEjBfOCcBh/9mULi1flkLRoWKcq v/OuTVif3ATvgJcwNkbMcoi0bYA/VwKi2dWC6ALhDDmZhyMTLeE362oIeOUNNnl6 GM8CGZHaRfmBzcH5t+WnxiS6rBlt5iwFJ35EvZo3GMXGGiLGlryxEXPAwZrf4haA Z4atNinKcNXhb80LWHo23aK3bpnaumwKP4BPuLEyvnjS4iU8SeYTXy+w5yq1BE+h HBP5s3no/mPiBAG8b6EZXqOJUGlN596AQfNLu7vCR78tmImZF0jKRFsHEAaKXf+B 1yRgZi7J+gV0qzK/Ufulg43vItk5/sTzEuV9YLfCpKTr14MFcWw908BAqaI5Kk1K e8uGqnb2KbZOLTW4QdPvpWg3eYtqEoluSoZUQ5elHxqQZ4MSZ1lK78FF1TeaW/pw sYH+v6rsWoVjEcFSwGoaaOMravzU4MKtavNAZrTJwKZx7qCqkwmi3R1k8WF6KsSV rTRAzUC1wpTdSOm1MYPMMKM/h5+BJRSJ/RjljOF4fXLnvpD5q0lequCWjrrEzc6c HPRKIgSBq7S620A19QD8UxwvZJ8bOivESqr0bux29v1Vpf7vJBrRMng8nLUrXfJs jdma5mK1UA== =/G9l -----END PGP SIGNATURE----- Merge tag 'for-5.5/zoned-20191122' of git://git.kernel.dk/linux-block Pull zoned block device update from Jens Axboe: "Enhancements and improvements to the zoned device support" * tag 'for-5.5/zoned-20191122' of git://git.kernel.dk/linux-block: scsi: sd_zbc: Remove set but not used variable 'buflen' block: rework zone reporting scsi: sd_zbc: Cleanup sd_zbc_alloc_report_buffer() null_blk: Add zone_nr_conv to features null_blk: clean up report zones null_blk: clean up the block device operations block: Remove partition support for zoned block devices block: Simplify report zones execution block: cleanup the !zoned case in blk_revalidate_disk_zones block: Enhance blk_revalidate_disk_zones()	2019-11-25 11:22:37 -08:00
Linus Torvalds	ff6814b078	for-5.5/block-20191121 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl3WxrEQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpuH5D/9qQKfIIuQDUNO4Xx+dIHimTDCrfiEOeO9e CRaMuSj+yMxLDMwfX8RnDmR17H3ZVoiIY1CT24U9ZkA5iDjeAH4xmzkH30US7LR7 /64YVZTxB0OrWppRK8RiIhaJJZDQ6+HPUQsn6PRaLVuFHi2unMoTQnj/ZQKz03QA Pl8Xx7qBtH1JwYCzQ21f/uryAcNg9eWabRLN2f1uiOXLmvRxOfh6Z/iaezlaZlmL qeJdcdLjjvOgOPwEOfNjfS6pd+XBz3gdEhn0l+11nHITxWZmVBwsWTKyUQlCmKnl yuCWDVyx5d6zCnlrLYG0l2Fn2lr9SwAkdkq3YAKV03hA/6s6P9q9bm31VvOf828x 7gmr4YVz68y7H9bM0QAHCvDpjll0aIEUw6XFzSOCDtZ9B6/pppYQWzMU71J05eyF 8DOKv2M2EVNLUjf6u0RDyolnWGU0kIjt5ryWE3OsGcezAVa2wYstgUJTKbrn1YgT j+4KTpaI+sg8GKDFauvxcSa6gwoRp6jweFNW+7vC090/shXmrGmVLOnQZKRuHho/ O4W8y/1/deM8CCIAETpiNxA8RV5U/EZygrFGDFc7yzTtVDGHY356M/B4Bmm2qkVu K3WgeZp8Fc0lH0QF6Pp9ZlBkZEpGNCAPVsPkXIsxQXbctftkn3KY//uIubfpFEB1 PpHSicvkww== =HYYq -----END PGP SIGNATURE----- Merge tag 'for-5.5/block-20191121' of git://git.kernel.dk/linux-block Pull core block updates from Jens Axboe: "Due to more granular branches, this one is small and will be followed with other core branches that add specific features. I meant to just have a core and drivers branch, but external dependencies we ended up adding a few more that are also core. The changes are: - Fixes and improvements for the zoned device support (Ajay, Damien) - sed-opal table writing and datastore UID (Revanth) - blk-cgroup (and bfq) blk-cgroup stat fixes (Tejun) - Improvements to the block stats tracking (Pavel) - Fix for overruning sysfs buffer for large number of CPUs (Ming) - Optimization for small IO (Ming, Christoph) - Fix typo in RWH lifetime hint (Eugene) - Dead code removal and documentation (Bart) - Reduction in memory usage for queue and tag set (Bart) - Kerneldoc header documentation (André) - Device/partition revalidation fixes (Jan) - Stats tracking for flush requests (Konstantin) - Various other little fixes here and there (et al)" * tag 'for-5.5/block-20191121' of git://git.kernel.dk/linux-block: (48 commits) Revert "block: split bio if the only bvec's length is > SZ_4K" block: add iostat counters for flush requests block,bfq: Skip tracing hooks if possible block: sed-opal: Introduce SUM_SET_LIST parameter and append it using 'add_token_u64' blk-cgroup: cgroup_rstat_updated() shouldn't be called on cgroup1 block: Don't disable interrupts in trigger_softirq() sbitmap: Delete sbitmap_any_bit_clear() blk-mq: Delete blk_mq_has_free_tags() and blk_mq_can_queue() block: split bio if the only bvec's length is > SZ_4K block: still try to split bio if the bvec crosses pages blk-cgroup: separate out blkg_rwstat under CONFIG_BLK_CGROUP_RWSTAT blk-cgroup: reimplement basic IO stats using cgroup rstat blk-cgroup: remove now unused blkg_print_stat_{bytes\|ios}_recursive() blk-throtl: stop using blkg->stat_bytes and ->stat_ios bfq-iosched: stop using blkg->stat_bytes and ->stat_ios bfq-iosched: relocate bfqg_rwstat() helpers block: add zone open, close and finish ioctl support block: add zone open, close and finish operations block: Simplify REQ_OP_ZONE_RESET_ALL handling block: Remove REQ_OP_ZONE_RESET plugging ...	2019-11-25 10:59:41 -08:00
Jens Axboe	1e279153df	Revert "block: split bio if the only bvec's length is > SZ_4K" We really don't need this, as the slow path will do the right thing anyway. This reverts commit `6952a7f844`. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-21 10:16:12 -07:00
Konstantin Khlebnikov	b686631865	block: add iostat counters for flush requests Requests that triggers flushing volatile writeback cache to disk (barriers) have significant effect to overall performance. Block layer has sophisticated engine for combining several flush requests into one. But there is no statistics for actual flushes executed by disk. Requests which trigger flushes usually are barriers - zero-size writes. This patch adds two iostat counters into /sys/class/block/$dev/stat and /proc/diskstats - count of completed flush requests and their total time. Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-21 09:06:47 -07:00
Dmitry Monakhov	40d47c155e	block,bfq: Skip tracing hooks if possible In most cases blk_tracing is not active, but bfq_log_bfqq macro generate pid_str unconditionally, which result in significant overhead. ## Test modprobe null_blk echo bfq > /sys/block/nullb0/queue/scheduler fio --name=t --ioengine=libaio --direct=1 --filename=/dev/nullb0 \ --runtime=30 --time_based=1 --rw=write --iodepth=128 --bs=4k # Results \| \| baseline \| w/ patch \| gain \| \| iops \| 113.19K \| 126.42K \| +11% \| Acked-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Dmitry Monakhov <dmonakhov@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-20 16:10:29 -07:00
Revanth Rajashekar	c6da429ea9	block: sed-opal: Introduce SUM_SET_LIST parameter and append it using 'add_token_u64' In function 'activate_lsp', rather than hard-coding the short atom header(0x83), we need to let the function 'add_short_atom_header' append the header based on the parameter being appended. The parameter has been defined in Section 3.1.2.1 of https://trustedcomputinggroup.org/wp-content/uploads/TCG_Storage-Opal_Feature_Set_Single_User_Mode_v1-00_r1-00-Final.pdf Reviewed-by: Jon Derrick <jonathan.derrick@intel.com> Signed-off-by: Revanth Rajashekar <revanth.rajashekar@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-18 09:49:15 -07:00
Sebastian Andrzej Siewior	de678bc63c	block: Don't disable interrupts in trigger_softirq() trigger_softirq() is always invoked as a SMP-function call which is always invoked with disables interrupts. Don't disable interrupt in trigger_softirq() because interrupts are already disabled. Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-18 07:29:22 -07:00
Jiufei Xue	8b37bc277f	iocost: check active_list of all the ancestors in iocg_activate() There is a bug that checking the same active_list over and over again in iocg_activate(). The intention of the code was checking whether all the ancestors and self have already been activated. So fix it. Fixes: `7caa47151a` ("blkcg: implement blk-iocost") Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jiufei Xue <jiufei.xue@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-14 13:56:54 -07:00
Christoph Hellwig	f0b870df80	block: remove (__)blkdev_reread_part as an exported API In general drivers should never mess with partition tables directly. Unfortunately s390 and loop do for somewhat historic reasons, but they can use bdev_disk_changed directly instead when we export it as they satisfy the sanity checks we have in __blkdev_reread_part. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Stefan Haberland <sth@linux.ibm.com> [dasd] Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-14 07:43:59 -07:00
Christoph Hellwig	142fe8f4bb	block: fix bdev_disk_changed for non-partitioned devices We still have to set the capacity to 0 if invalidating or call revalidate_disk if not even if the disk has no partitions. Fix that by merging rescan_partitions into bdev_disk_changed and just stubbing out blk_add_partitions and blk_drop_partitions for non-partitioned devices. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-14 07:43:53 -07:00
Christoph Hellwig	a1548b6744	block: move rescan_partitions to fs/block_dev.c Large parts of rescan_partitions aren't about partitions, and moving it to block_dev.c will allow for some further cleanups by merging it into its only caller. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-14 07:43:21 -07:00
Christoph Hellwig	6917d06899	block: merge invalidate_partitions into rescan_partitions A lot of the logic in invalidate_partitions and rescan_partitions is shared. Merge the two functions to simplify things. There is a small behavior change in that we now send the kevent change notice also if we were not invalidating but no partitions were found, which seems like the right thing to do. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-14 07:42:41 -07:00
Christoph Hellwig	f902b02600	block: refactor rescan_partitions Split out a helper that adds one single partition, and another one calling that dealing with the parsed_partitions state. This makes it much more obvious how we clean up all state and start again when using the rescan label. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-14 07:40:55 -07:00
Paolo Valente	478de3380c	block, bfq: deschedule empty bfq_queues not referred by any process Since commit `3726112ec7` ("block, bfq: re-schedule empty queues if they deserve I/O plugging"), to prevent the service guarantees of a bfq_queue from being violated, the bfq_queue may be left busy, i.e., scheduled for service, even if empty (see comments in __bfq_bfqq_expire() for details). But, if no process will send requests to the bfq_queue any longer, then there is no point in keeping the bfq_queue scheduled for service. In addition, keeping the bfq_queue scheduled for service, but with no process reference any longer, may cause the bfq_queue to be freed when descheduled from service. But this is assumed to never happen, and causes a UAF if it happens. This, in turn, caused crashes [1, 2]. This commit fixes this issue by descheduling an empty bfq_queue when it remains with not process reference. [1] https://bugzilla.redhat.com/show_bug.cgi?id=1767539 [2] https://bugzilla.kernel.org/show_bug.cgi?id=205447 Fixes: `3726112ec7` ("block, bfq: re-schedule empty queues if they deserve I/O plugging") Reported-by: Chris Evich <cevich@redhat.com> Reported-by: Patrick Dung <patdung100@gmail.com> Reported-by: Thorsten Schubert <tschubert@bafh.org> Tested-by: Thorsten Schubert <tschubert@bafh.org> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-14 07:00:54 -07:00
John Garry	cb711b91a3	blk-mq: Delete blk_mq_has_free_tags() and blk_mq_can_queue() These functions are not referenced, so delete them. Signed-off-by: John Garry <john.garry@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-13 12:50:38 -07:00
Christoph Hellwig	d41003513e	block: rework zone reporting Avoid the need to allocate a potentially large array of struct blk_zone in the block layer by switching the ->report_zones method interface to a callback model. Now the caller simply supplies a callback that is executed on each reported zone, and private data for it. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-12 19:12:07 -07:00
Damien Le Moal	5eac3eb30c	block: Remove partition support for zoned block devices No known partitioning tool supports zoned block devices, especially the host managed flavor with strong sequential write constraints. Furthermore, there are also no known user nor use cases for partitioned zoned block devices. This patch removes partition device creation for zoned block devices, which allows simplifying the processing of zone commands for zoned block devices. A warning is added if a partition table is found on the device. For report zones operations no zone sector information remapping is necessary anymore, simplifying the code. Of note is that remapping of zone reports for DM targets is still necessary as done by dm_remap_zone_report(). Similarly, remaping of a zone reset bio is not necessary anymore. Testing for the applicability of the zone reset all request also becomes simpler and only needs to check that the number of sectors of the requested zone range is equal to the disk capacity. Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-12 19:11:57 -07:00
Damien Le Moal	ceeb373aa6	block: Simplify report zones execution All kernel users of blkdev_report_zones() as well as applications use through ioctl(BLKZONEREPORT) expect to potentially get less zone descriptors than requested. As such, the use of the internal report zones command execution loop implemented by blk_report_zones() is not necessary and can even be harmful to performance by causing the execution of inefficient small zones report command to service the reminder of a requested zone array. This patch removes blk_report_zones(), simplifying the code. Also remove a now incorrect comment in dm_blk_report_zones(). Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Javier Gonzalez <javier@javigon.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-12 19:11:56 -07:00
Christoph Hellwig	c98c3d09fc	block: cleanup the !zoned case in blk_revalidate_disk_zones blk_revalidate_disk_zones is never called for non-zoned devices. Just return early and warn instead of trying to handle this case. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-12 19:11:54 -07:00
Damien Le Moal	d9dd73087a	block: Enhance blk_revalidate_disk_zones() For ZBC and ZAC zoned devices, the scsi driver revalidation processing implemented by sd_revalidate_disk() includes a call to sd_zbc_read_zones() which executes a full disk zone report used to check that all zones of the disk are the same size. This processing is followed by a call to blk_revalidate_disk_zones(), used to initialize the device request queue zone bitmaps (zone type and zone write lock bitmaps). To do so, blk_revalidate_disk_zones() also executes a full device zone report to obtain zone types. As a result, the entire zoned block device revalidation process includes two full device zone report. By moving the zone size checks into blk_revalidate_disk_zones(), this process can be optimized to a single full device zone report, leading to shorter device scan and revalidation times. This patch implements this optimization, reducing the original full device zone report implemented in sd_zbc_check_zones() to a single, small, report zones command execution to obtain the size of the first zone of the device. Checks whether all zones of the device are the same size as the first zone size are moved to the generic blk_check_zone() function called from blk_revalidate_disk_zones(). This optimization also has the following benefits: 1) fewer memory allocations in the scsi layer during disk revalidation as the potentailly large buffer for zone report execution is not needed. 2) Implement zone checks in a generic manner, reducing the burden on device driver which only need to obtain the zone size and check that this size is a power of 2 number of LBAs. Any new type of zoned block device will benefit from this. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-12 19:11:52 -07:00
Junichi Nomura	e3a5d8e386	block: check bi_size overflow before merge __bio_try_merge_page() may merge a page to bio without bio_full() check and cause bi_size overflow. The overflow typically ends up with sd_init_command() warning on zero segment request with call trace like this: ------------[ cut here ]------------ WARNING: CPU: 2 PID: 1986 at drivers/scsi/scsi_lib.c:1025 scsi_init_io+0x156/0x180 CPU: 2 PID: 1986 Comm: kworker/2:1H Kdump: loaded Not tainted 5.4.0-rc7 #1 Workqueue: kblockd blk_mq_run_work_fn RIP: 0010:scsi_init_io+0x156/0x180 RSP: 0018:ffffa11487663bf0 EFLAGS: 00010246 RAX: 00000000002be0a0 RBX: ffff8e6e9ff30118 RCX: 0000000000000000 RDX: 00000000ffffffe1 RSI: 0000000000000000 RDI: ffff8e6e9ff30118 RBP: ffffa11487663c18 R08: ffffa11487663d28 R09: ffff8e6e9ff30150 R10: 0000000000000001 R11: 0000000000000000 R12: ffff8e6e9ff30000 R13: 0000000000000001 R14: ffff8e74a1cf1800 R15: ffff8e6e9ff30000 FS: 0000000000000000(0000) GS:ffff8e6ea7680000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fff18cf0fe8 CR3: 0000000659f0a001 CR4: 00000000001606e0 Call Trace: sd_init_command+0x326/0xb40 [sd_mod] scsi_queue_rq+0x502/0xaa0 ? blk_mq_get_driver_tag+0xe7/0x120 blk_mq_dispatch_rq_list+0x256/0x5a0 ? elv_rb_del+0x24/0x30 ? deadline_remove_request+0x7b/0xc0 blk_mq_do_dispatch_sched+0xa3/0x140 blk_mq_sched_dispatch_requests+0xfb/0x170 __blk_mq_run_hw_queue+0x81/0x130 blk_mq_run_work_fn+0x1b/0x20 process_one_work+0x179/0x390 worker_thread+0x4f/0x3e0 kthread+0x105/0x140 ? max_active_store+0x80/0x80 ? kthread_bind+0x20/0x20 ret_from_fork+0x35/0x40 ---[ end trace f9036abf5af4a4d3 ]--- blk_update_request: I/O error, dev sdd, sector 2875552 op 0x1:(WRITE) flags 0x0 phys_seg 0 prio class 0 XFS (sdd1): writeback error on sector 2875552 __bio_try_merge_page() should check the overflow before actually doing merge. Fixes: `07173c3ec2` ("block: enable multipage bvecs") Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-12 07:26:27 -07:00
Ming Lei	6952a7f844	block: split bio if the only bvec's length is > SZ_4K 64K PAGE_SIZE is popular on ARM64 or other ARCHs, and 64K has been big enough to break some devices probably, so change the logic to split bio if the only bvec's length is > SZ_4K instead of PAGE_SIZE. Fixes: `fa53228721` (block: avoid blk_bio_segment_split for small I/O operations) Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-08 06:59:51 -07:00
Ming Lei	59db8ba2f6	block: still try to split bio if the bvec crosses pages Some device may set segment boundary as PAGE_SIZE - 1. If the bvec crosses pages, and meantime its length is <= PAGE_SIZE, we still need to split the bvec into 2 segments. Fixes this issue by still splitting bio if the single bvec crosses pages. Reported-by: kernel test robot <lkp@intel.com> Fixes: `fa53228721` (block: avoid blk_bio_segment_split for small I/O operations) Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-08 06:59:49 -07:00
Tejun Heo	1d156646e0	blk-cgroup: separate out blkg_rwstat under CONFIG_BLK_CGROUP_RWSTAT blkg_rwstat is now only used by bfq-iosched and blk-throtl when on cgroup1. Let's move it into its own files and gate it behind a config option. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-07 12:28:13 -07:00
Tejun Heo	f733164829	blk-cgroup: reimplement basic IO stats using cgroup rstat blk-cgroup has been using blkg_rwstat to track basic IO stats. Unfortunately, reading recursive stats scales badly as itinvolves walking all descendants. On systems with a huge number of cgroups (dead or alive), this can lead to substantial CPU cost when reading IO stats. This patch reimplements basic IO stats using cgroup rstat which uses more memory but makes recursive stat reading O(# descendants which have been active since last reading) instead of O(# descendants). * blk-cgroup core no longer uses sync/async stats. Introduce new stat enums - BLKG_IOSTAT_{READ\|WRITE\|DISCARD}. * Add blkg_iostat[_set] which encapsulates byte and io stats, last values for propagation delta calculation and u64_stats_sync for correctness on 32bit archs. * Update the new percpu stat counters directly and implement blkcg_rstat_flush() to implement propagation. * blkg_print_stat() can now bring the stats up to date by calling cgroup_rstat_flush() and print them instead of directly summing up all descendants. * It now allocates 96 bytes per cpu. It used to be 40 bytes. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Dan Schatzberg <dschatzberg@fb.com> Cc: Daniel Xu <dlxu@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-07 12:28:13 -07:00
Tejun Heo	8a80d5d663	blk-cgroup: remove now unused blkg_print_stat_{bytes\|ios}_recursive() These don't have users anymore. Remove them. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-07 12:28:13 -07:00

... 6 7 8 9 10 ...

5410 commits