linux-stable

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git synced 2024-09-13 22:25:03 +00:00

Author	SHA1	Message	Date
Christoph Hellwig	ba91c849fa	blk-rq-qos: store a gendisk instead of request_queue in struct rq_qos This is what about half of the users already want, and it's only going to grow more. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Andreas Herrmann <aherrmann@suse.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230203150400.3199230-16-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-03 08:20:05 -07:00
Christoph Hellwig	3963d84df7	blk-rq-qos: constify rq_qos_ops These op vectors are constant, so mark them const. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Andreas Herrmann <aherrmann@suse.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230203150400.3199230-15-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-03 08:20:05 -07:00
Christoph Hellwig	ce57b55860	blk-rq-qos: make rq_qos_add and rq_qos_del more useful Switch to passing a gendisk, and make rq_qos_add initialize all required fields and drop the not required q argument from rq_qos_del. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Andreas Herrmann <aherrmann@suse.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230203150400.3199230-14-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-03 08:20:05 -07:00
Christoph Hellwig	b494f9c566	blk-rq-qos: move rq_qos_add and rq_qos_del out of line These two functions are rather larger and not in a fast path, so move them out of line. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230203150400.3199230-13-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-03 08:20:05 -07:00
Christoph Hellwig	4e1d91ae87	blk-wbt: open code wbt_queue_depth_changed in wbt_init wbt_queue_depth_changed just updates a field and calls another function. Open code it in wbt_init, so that the local queue variable can be used instead of the one stored in the rq_qos. This will allow delaying that rq_qos->queue assignment in a subsequent patch. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Andreas Herrmann <aherrmann@suse.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230203150400.3199230-12-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-03 08:20:05 -07:00
Christoph Hellwig	0bc65bd41d	blk-wbt: move private information from blk-wbt.h to blk-wbt.c A large part of blk-wbt.h is only used in blk-wbt.c, so move it there. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230203150400.3199230-11-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-03 08:20:05 -07:00
Christoph Hellwig	958f296547	blk-wbt: pass a gendisk to wbt_init Pass a gendisk to wbt_init to prepare for phasing out usage of the request_queue in the blk-cgroup code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Andreas Herrmann <aherrmann@suse.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230203150400.3199230-10-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-03 08:20:05 -07:00
Christoph Hellwig	04aad37be1	blk-wbt: pass a gendisk to wbt_{enable,disable}_default Pass a gendisk to wbt_enable_default and wbt_disable_default to prepare for phasing out usage of the request_queue in the blk-cgroup code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Andreas Herrmann <aherrmann@suse.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230203150400.3199230-9-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-03 08:20:05 -07:00
Christoph Hellwig	f05837ed73	blk-cgroup: store a gendisk to throttle in struct task_struct Switch from a request_queue pointer and reference to a gendisk once for the throttle information in struct task_struct. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Andreas Herrmann <aherrmann@suse.de> Link: https://lore.kernel.org/r/20230203150400.3199230-8-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-03 08:20:05 -07:00
Christoph Hellwig	84d7d462b1	blk-cgroup: pin the gendisk in struct blkcg_gq Currently each blkcg_gq holds a request_queue reference, which is what is used in the policies. But a lot of these interfaces will move over to use a gendisk, so store a disk in struct blkcg_gq and hold a reference to it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Andreas Herrmann <aherrmann@suse.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230203150400.3199230-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-03 08:20:05 -07:00
Christoph Hellwig	180b04d450	blk-cgroup: remove the !bdi->dev check in blkg_dev_name bdi_dev_name already performs the same check. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230203150400.3199230-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-03 08:20:05 -07:00
Christoph Hellwig	27b642b07a	blk-cgroup: simplify blkg freeing from initialization failure paths There is no need to delay freeing a blkg to a workqueue when freeing it after an initialization failure. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230203150400.3199230-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-03 08:20:05 -07:00
Christoph Hellwig	0b6f93bdf0	blk-cgroup: improve error unwinding in blkg_alloc Unwind only the previous initialization steps that happened in blkg_alloc using goto based unwinding. This avoids the need for the !queue special case in blkg_free and thus ensures that any blkg seens outside of blkg_alloc is always fully constructed. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230203150400.3199230-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-03 08:20:04 -07:00
Christoph Hellwig	178fa7d498	blk-cgroup: delay blk-cgroup initialization until add_disk There is no need to initialize the cgroup code before the disk is marked live. Moving the cgroup initialization earlier will help to have a fully initialized struct device in the gendisk for the cgroup code to use in the future. Similarly tear the cgroup information down in del_gendisk to be symmetric and because none of the cgroup tracking is needed once non-passthrough I/O stops. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Andreas Herrmann <aherrmann@suse.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230203150400.3199230-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-03 08:20:04 -07:00
Christoph Hellwig	a886001c2d	block: don't call blk_throtl_stat_add for non-READ/WRITE commands blk_throtl_stat_add is called from blk_stat_add explicitly, unlike the other stats that go through q->stats->callbacks. To prepare for cgroup data moving to the gendisk, ensure blk_throtl_stat_add is only called for the plain READ and WRITE commands that it actually handles internally, as blk_stat_add can also be called for passthrough commands on queues that do not have a gendisk associated with them. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Andreas Herrmann <aherrmann@suse.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230203150400.3199230-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-03 08:20:04 -07:00
Christoph Hellwig	3222d8c2a7	block: remove ->rw_page The ->rw_page method is a special purpose bypass of the usual bio handling path that is limited to single-page reads and writes and synchronous which causes a lot of extra code in the drivers, callers and the block layer. The only remaining user is the MM swap code. Switch that swap code to simply submit a single-vec on-stack bio an synchronously wait on it based on a newly added QUEUE_FLAG_SYNCHRONOUS flag set by the drivers that currently implement ->rw_page instead. While this touches one extra cache line and executes extra code, it simplifies the block layer and drivers and ensures that all feastures are properly supported by all drivers, e.g. right now ->rw_page bypassed cgroup writeback entirely. [akpm@linux-foundation.org: fix comment typo, per Dan] Link: https://lkml.kernel.org/r/20230125133436.447864-8-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Keith Busch <kbusch@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-02-02 22:33:34 -08:00
Ming Lei	0416f3be58	blk-cgroup: don't update io stat for root cgroup We source root cgroup stats from the system-wide stats, see blkcg_print_stat and blkcg_rstat_flush, so don't update io state for root cgroup. Fixes blkg leak issue introduced in commit `3b8cc62987` ("blk-cgroup: Optimize blkcg_rstat_flush()") which starts to grab blkg's reference when adding iostat_cpu into percpu blkcg list, but this state won't be consumed by blkcg_rstat_flush() where the blkg reference is dropped. Tested-by: Bart van Assche <bvanassche@acm.org> Reported-by: Bart van Assche <bvanassche@acm.org> Fixes: `3b8cc62987` ("blk-cgroup: Optimize blkcg_rstat_flush()") Cc: Tejun Heo <tj@kernel.org> Cc: Waiman Long <longman@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230202021804.278582-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-01 19:26:41 -07:00
Bart Van Assche	81ea42b9c3	block: Fix the blk_mq_destroy_queue() documentation Commit `2b3f056f72` moved a blk_put_queue() call from blk_mq_destroy_queue() into its callers. Reflect this change in the documentation block above blk_mq_destroy_queue(). Cc: Christoph Hellwig <hch@lst.de> Cc: Sagi Grimberg <sagi@grimberg.me> Cc: Chaitanya Kulkarni <kch@nvidia.com> Cc: Keith Busch <kbusch@kernel.org> Fixes: `2b3f056f72` ("blk-mq: move the call to blk_put_queue out of blk_mq_destroy_queue") Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230130211233.831613-1-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-31 11:46:15 -07:00
Ulf Hansson	4a6a7bc21d	block: Default to use cgroup support for BFQ Assuming that both Kconfig options, BLK_CGROUP and IOSCHED_BFQ are set, we most likely want cgroup support for BFQ too (BFQ_GROUP_IOSCHED), so let's make it default y. Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org> Reviewed-by: Linus Walleij <linus.walleij@linaro.org> Link: https://lore.kernel.org/r/20230130121240.159456-1-ulf.hansson@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-30 09:42:42 -07:00
Kemeng Shi	323745a3aa	block, bfq: remove unused bfq_wr_max_time in struct bfq_data bfqd->bfq_wr_max_time is set to 0 in bfq_init_queue and is never changed. It is only used in bfq_wr_duration when bfq_wr_max_time > 0 which never meets, so bfqd->bfq_wr_max_time is not used actually. Just remove it. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230116095153.3810101-9-shikemeng@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-29 20:03:49 -07:00
Kemeng Shi	87c971de81	block, bfq: remove unnecessary goto tag in bfq_dispatch_rq_from_bfqq We jump to tag only for returning current rq. Return directly to remove this tag. Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Link: https://lore.kernel.org/r/20230116095153.3810101-8-shikemeng@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-29 20:03:49 -07:00
Kemeng Shi	433d4b03e7	block, bfq: remove redundant check in bfq_put_cooperator We have already avoided a circular list in bfq_setup_merge (see comments in bfq_setup_merge() for details), so bfq_queue will not appear in it's new_bfqq list. Just remove this check. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230116095153.3810101-7-shikemeng@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-29 20:03:49 -07:00
Kemeng Shi	86f8382e6d	block, bfq: remove unnecessary dereference to get async_bfqq The async_bfqq is assigned with bfqq->bic->bfqq[0], use it directly. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230116095153.3810101-6-shikemeng@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-29 20:03:49 -07:00
Kemeng Shi	8ac2e43c35	block, bfq: use helper macro RQ_BFQQ to get bfqq of request Use helper macro RQ_BFQQ to get bfqq of request. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230116095153.3810101-5-shikemeng@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-29 20:03:49 -07:00
Kemeng Shi	1c970450a7	block, bfq: initialize bfqq->decrease_time_jif correctly Inject limit is updated or reset when time_is_before_eq_jiffies( decrease_time_jif + several msecs) or think-time state changes. decrease_time_jif is initialized to 0 and will be set to current jiffies when inject limit is updated or reset. If the jiffies is slightly greater than LONG_MAX, time_is_after_eq_jiffies(0) will keep for a long time, so as time_is_after_eq_jiffies(decrease_time_jif + several msecs). If the think-time state never chages, then the injection will not work as expected for long time. To be more specific: Function bfq_update_inject_limit maybe triggered when jiffies pasts decrease_time_jif + msecs_to_jiffies(10) in bfq_add_request by setting bfqd->wait_dispatch to true. Function bfq_reset_inject_limit are called in two conditions: 1. jiffies pasts bfqq->decrease_time_jif + msecs_to_jiffies(1000) in function bfq_add_request. 2. jiffies pasts bfqq->decrease_time_jif + msecs_to_jiffies(100) or bfq think-time state change from short to long. Fix this by initializing bfqq->decrease_time_jif to current jiffies to trigger service injection soon when service injection conditions are met. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230116095153.3810101-4-shikemeng@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-29 20:03:49 -07:00
Kemeng Shi	bebeb9e582	block, bfq: remove unsed parameter reason in bfq_bfqq_is_slow Parameter reason is never used, just remove it. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230116095153.3810101-3-shikemeng@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-29 20:03:49 -07:00
Kemeng Shi	0c3e09e885	block, bfq: correctly raise inject limit in bfq_choose_bfqq_for_injection Function bfq_choose_bfqq_for_injection may temporarily raise inject limit to one request if current inject_limit is 0 before search of the source queue for injection. However the search below will reset inject limit to bfqd->in_service_queue which is zero for raised inject limit. Then the temporarily raised inject limit never works as expected. Assigment limit to bfqd->in_service_queue in search is needed as limit maybe overwriten to min_t(unsigned int, 1, limit) for condition that a large in-flight request is on non-rotational devices in found queue. So we need to reset limit to bfqd->in_service_queue for normal case. Actually, we have already make sure bfqd->rq_in_driver is < limit before search, then -Limit is >= 1 as bfqd->rq_in_driver is >= 0. Then min_t(unsigned int, 1, limit) is always 1. So we can simply check bfqd->rq_in_driver with 1 instead of result of min_t(unsigned int, 1, limit) for larget request in non-rotational device case to avoid overwritting limit and the bug is gone. -For normal case, we have already check bfqd->rq_in_driver is < limit, so we can return found bfqq unconditionally to remove unncessary check. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230116095153.3810101-2-shikemeng@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-29 20:03:49 -07:00
Yu Kuai	b600de2d7d	block, bfq: fix uaf for bfqq in bic_set_bfqq() After commit `64dc8c732f` ("block, bfq: fix possible uaf for 'bfqq->bic'"), bic->bfqq will be accessed in bic_set_bfqq(), however, in some context bic->bfqq will be freed, and bic_set_bfqq() is called with the freed bic->bfqq. Fix the problem by always freeing bfqq after bic_set_bfqq(). Fixes: `64dc8c732f` ("block, bfq: fix possible uaf for 'bfqq->bic'") Reported-and-tested-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230130014136.591038-1-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-29 18:57:01 -07:00
Yu Kuai	f1c006f1c6	blk-cgroup: synchronize pd_free_fn() from blkg_free_workfn() and blkcg_deactivate_policy() Currently parent pd can be freed before child pd: t1: remove cgroup C1 blkcg_destroy_blkgs blkg_destroy list_del_init(&blkg->q_node) // remove blkg from queue list percpu_ref_kill(&blkg->refcnt) blkg_release call_rcu t2: from t1 __blkg_release blkg_free schedule_work t4: deactivate policy blkcg_deactivate_policy pd_free_fn // parent of C1 is freed first t3: from t2 blkg_free_workfn pd_free_fn If policy(for example, ioc_timer_fn() from iocost) access parent pd from child pd after pd_offline_fn(), then UAF can be triggered. Fix the problem by delaying 'list_del_init(&blkg->q_node)' from blkg_destroy() to blkg_free_workfn(), and using a new disk level mutex to synchronize blkg_free_workfn() and blkcg_deactivate_policy(). Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230119110350.2287325-4-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-29 15:19:04 -07:00
Yu Kuai	dfd6200a09	blk-cgroup: support to track if policy is online A new field 'online' is added to blkg_policy_data to fix following 2 problem: 1) In blkcg_activate_policy(), if pd_alloc_fn() with 'GFP_NOWAIT' failed, 'queue_lock' will be dropped and pd_alloc_fn() will try again without 'GFP_NOWAIT'. In the meantime, remove cgroup can race with it, and pd_offline_fn() will be called without pd_init_fn() and pd_online_fn(). This way null-ptr-deference can be triggered. 2) In order to synchronize pd_free_fn() from blkg_free_workfn() and blkcg_deactivate_policy(), 'list_del_init(&blkg->q_node)' will be delayed to blkg_free_workfn(), hence pd_offline_fn() can be called first in blkg_destroy(), and then blkcg_deactivate_policy() will call it again, we must prevent it. The new field 'online' will be set after pd_online_fn() and will be cleared after pd_offline_fn(), in the meantime pd_offline_fn() will only be called if 'online' is set. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230119110350.2287325-3-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-29 15:19:04 -07:00
Yu Kuai	c7241babf0	blk-cgroup: dropping parent refcount after pd_free_fn() is done Some cgroup policies will access parent pd through child pd even after pd_offline_fn() is done. If pd_free_fn() for parent is called before child, then UAF can be triggered. Hence it's better to guarantee the order of pd_free_fn(). Currently refcount of parent blkg is dropped in __blkg_release(), which is before pd_free_fn() is called in blkg_free_work_fn() while blkg_free_work_fn() is called asynchronously. This patch make sure pd_free_fn() called from removing cgroup is ordered by delaying dropping parent refcount after calling pd_free_fn() for child. BTW, pd_free_fn() will also be called from blkcg_deactivate_policy() from deleting device, and following patches will guarantee the order. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230119110350.2287325-2-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-29 15:19:04 -07:00
Zhong Jinghua	b36781034c	blk-mq: cleanup unused methods: blk_mq_hw_sysfs_store We found that the blk_mq_hw_sysfs_store interface has no place to use. The object default_hw_ctx_attrs using blk_mq_hw_sysfs_ops only uses the show method and does not use the store method. Since this patch: `4a46f05ebf` ("blk-mq: move hctx and ctx counters from sysfs to debugfs") moved the store method to debugfs, the store method is not used anymore. So let me do some tiny work to clean up unused code. Signed-off-by: Zhong Jinghua <zhongjinghua@huawei.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230128030419.2780298-1-zhongjinghua@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-29 15:18:35 -07:00
Jens Axboe	33391eecd6	block: treat poll queue enter similarly to timeouts We ran into an issue where a production workload would randomly grind to a halt and not continue until the pending IO had timed out. This turned out to be a complicated interaction between queue freezing and polled IO: 1) You have an application that does polled IO. At any point in time, there may be polled IO pending. 2) You have a monitoring application that issues a passthrough command, which is marked with side effects such that it needs to freeze the queue. 3) Passthrough command is started, which calls blk_freeze_queue_start() on the device. At this point the queue is marked frozen, and any attempt to enter the queue will fail (for non-blocking) or block. 4) Now the driver calls blk_mq_freeze_queue_wait(), which will return when the queue is quiesced and pending IO has completed. 5) The pending IO is polled IO, but any attempt to poll IO through the normal iocb_bio_iopoll() -> bio_poll() will fail when it gets to bio_queue_enter() as the queue is frozen. Rather than poll and complete IO, the polling threads will sit in a tight loop attempting to poll, but failing to enter the queue to do so. The end result is that progress for either application will be stalled until all pending polled IO has timed out. This causes obvious huge latency issues for the application doing polled IO, but also long delays for passthrough command. Fix this by treating queue enter for polled IO just like we do for timeouts. This allows quick quiesce of the queue as we still poll and complete this IO, while still disallowing queueing up new IO. Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-29 15:18:34 -07:00
Li Nan	b326032965	blk-iocost: change div64_u64 to DIV64_U64_ROUND_UP in ioc_refresh_params() vrate_min is calculated by DIV64_U64_ROUND_UP, but vrate_max is calculated by div64_u64. Vrate_min may be 1 greater than vrate_max if the input values min and max of cost.qos are equal. Signed-off-by: Li Nan <linan122@huawei.com> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230117070806.3857142-6-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-29 15:18:34 -07:00
Li Nan	984af1e66b	blk-iocost: fix divide by 0 error in calc_lcoefs() echo max of u64 to cost.model can cause divide by 0 error. # echo 8:0 rbps=18446744073709551615 > /sys/fs/cgroup/io.cost.model divide error: 0000 [#1] PREEMPT SMP RIP: 0010:calc_lcoefs+0x4c/0xc0 Call Trace: <TASK> ioc_refresh_params+0x2b3/0x4f0 ioc_cost_model_write+0x3cb/0x4c0 ? _copy_from_iter+0x6d/0x6c0 ? kernfs_fop_write_iter+0xfc/0x270 cgroup_file_write+0xa0/0x200 kernfs_fop_write_iter+0x17d/0x270 vfs_write+0x414/0x620 ksys_write+0x73/0x160 __x64_sys_write+0x1e/0x30 do_syscall_64+0x35/0x80 entry_SYSCALL_64_after_hwframe+0x63/0xcd calc_lcoefs() uses the input value of cost.model in DIV_ROUND_UP_ULL, overflow would happen if bps plus IOC_PAGE_SIZE is greater than ULLONG_MAX, it can cause divide by 0 error. Fix the problem by setting basecost Signed-off-by: Li Nan <linan122@huawei.com> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230117070806.3857142-5-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-29 15:18:34 -07:00
Yu Kuai	35198e3230	blk-iocost: read params inside lock in sysfs apis Otherwise, user might get abnormal values if params is updated concurrently. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230117070806.3857142-4-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-29 15:18:34 -07:00
Yu Kuai	235a5a83f6	blk-iocost: don't allow to configure bio based device iocost is based on rq_qos, which can only work for request based device, thus it doesn't make sense to configure iocost for bio based device. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230117070806.3857142-3-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-29 15:18:34 -07:00
Yu Kuai	7b7c5ae440	blk-iocost: check return value of match_u64() This patch fixs that the return value of match_u64() from ioc_qos_write() is not checked, Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230117070806.3857142-2-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-29 15:18:34 -07:00
Arnd Bergmann	5f2779dfa7	blk-iocost: avoid 64-bit division in ioc_timer_fn The behavior of 'enum' types has changed in gcc-13, so now the UNBUSY_THR_PCT constant is interpreted as a 64-bit number because it is defined as part of the same enum definition as some other constants that do not fit within a 32-bit integer. This in turn leads to some inefficient code on 32-bit architectures as well as a link error: arm-linux-gnueabi/bin/arm-linux-gnueabi-ld: block/blk-iocost.o: in function `ioc_timer_fn': blk-iocost.c:(.text+0x68e8): undefined reference to `__aeabi_uldivmod' arm-linux-gnueabi-ld: blk-iocost.c:(.text+0x6908): undefined reference to `__aeabi_uldivmod' Split the enum definition to keep the 64-bit timing constants in a separate enum type from those constants that can clearly fit within a smaller type. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230118080706.3303186-1-arnd@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-29 15:18:34 -07:00
Pankaj Raghav	e29b210021	block: add a new helper bdev_{is_zone_start, offset_from_zone_start} Instead of open coding to check for zone start, add a helper to improve readability and store the logic in one place. Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20230110143635.77300-3-p.raghav@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-29 15:18:34 -07:00
Anuj Gupta	7e2e355dd9	block: extend bio-cache for non-polled requests This patch modifies the present check, so that bio-cache is not limited to iopoll. Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Link: https://lore.kernel.org/r/20230117120638.72254-3-anuj20.g@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-29 15:18:34 -07:00
Jens Axboe	67d59247d4	block: don't allow multiple bios for IOCB_NOWAIT issue If we're doing a large IO request which needs to be split into multiple bios for issue, then we can run into the same situation as the below marked commit fixes - parts will complete just fine, one or more parts will fail to allocate a request. This will result in a partially completed read or write request, where the caller gets EAGAIN even though parts of the IO completed just fine. Do the same for large bios as we do for splits - fail a NOWAIT request with EAGAIN. This isn't technically fixing an issue in the below marked patch, but for stable purposes, we should have either none of them or both. This depends on: `613b14884b` ("block: handle bio_split_to_limits() NULL return") Cc: stable@vger.kernel.org # 5.15+ Fixes: `9cea62b2cb` ("block: don't allow splitting of a REQ_NOWAIT bio") Link: https://github.com/axboe/liburing/issues/766 Reported-and-tested-by: Michael Kelley <mikelley@microsoft.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-29 15:18:34 -07:00
Jens Axboe	a3df2e456c	block: add a BUILD_BUG_ON() for adding more bio flags than we have space We have BIO_FLAG_LAST in the enum for bio specific flags, but it's not used to check that we're not exceeding the size of them. Add such a check. Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-29 15:18:33 -07:00
Keith Busch	c9c77418a9	block: save user max_sectors limit The user can set the max_sectors limit to any valid value via sysfs /sys/block/<dev>/queue/max_sectors_kb attribute. If the device limits are ever rescanned, though, the limit reverts back to the potentially artificially low BLK_DEF_MAX_SECTORS value. Preserve the user's setting as the max_sectors limit as long as it's valid. The user can reset back to defaults by writing 0 to the sysfs file. Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20230105205146.3610282-3-kbusch@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-29 15:18:33 -07:00
Keith Busch	0a26f327e4	block: make BLK_DEF_MAX_SECTORS unsigned This is used as an unsigned value, so define it that way to avoid having to cast it. Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20230105205146.3610282-2-kbusch@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-29 15:18:33 -07:00
Davide Zini	1bd43e19de	block, bfq: balance I/O injection among underutilized actuators Upon the invocation of its dispatch function, BFQ returns the next I/O request of the in-service bfq_queue, unless some exception holds. One such exception is that there is some underutilized actuator, different from the actuator for which the in-service queue contains I/O, and that some other bfq_queue happens to contain I/O for such an actuator. In this case, the next I/O request of the latter bfq_queue, and not of the in-service bfq_queue, is returned (I/O is injected from that bfq_queue). To find such an actuator, a linear scan, in increasing index order, is performed among actuators. Performing a linear scan entails a prioritization among actuators: an underutilized actuator may be considered for injection only if all actuators with a lower index are currently fully utilized, or if there is no pending I/O for any lower-index actuator that happens to be underutilized. This commits breaks this prioritization and tends to distribute injection uniformly across actuators. This is obtained by adding the following condition to the linear scan: even if an actuator A is underutilized, A is however skipped if its load is higher than that of the next actuator. Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Davide Zini <davidezini2@gmail.com> Link: https://lore.kernel.org/r/20230103145503.71712-9-paolo.valente@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-29 15:18:33 -07:00
Davide Zini	2d31c684a0	block, bfq: inject I/O to underutilized actuators The main service scheme of BFQ for sync I/O is serving one sync bfq_queue at a time, for a while. In particular, BFQ enforces this scheme when it deems the latter necessary to boost throughput or to preserve service guarantees. Unfortunately, when BFQ enforces this policy, only one actuator at a time gets served for a while, because each bfq_queue contains I/O only for one actuator. The other actuators may remain underutilized. Actually, BFQ may serve (inject) extra I/O, taken from other bfq_queues, in parallel with that of the in-service queue. This injection mechanism may provide the ground for dealing also with the above actuator-underutilization problem. Yet BFQ does not take the actuator load into account when choosing which queue to pick extra I/O from. In addition, BFQ may happen to inject extra I/O only when the in-service queue is temporarily empty. In view of these facts, this commit extends the injection mechanism in such a way that the latter: (1) takes into account also the actuator load; (2) checks such a load on each dispatch, and injects I/O for an underutilized actuator, if there is one and there is I/O for it. To perform the check in (2), this commit introduces a load threshold, currently set to 4. A linear scan of each actuator is performed, until an actuator is found for which the following two conditions hold: the load of the actuator is below the threshold, and there is at least one non-in-service queue that contains I/O for that actuator. If such a pair (actuator, queue) is found, then the head request of that queue is returned for dispatch, instead of the head request of the in-service queue. We have set the threshold, empirically, to the minimum possible value for which an actuator is fully utilized, or close to be fully utilized. By doing so, injected I/O 'steals' as few drive-queue slots as possibile to the in-service queue. This reduces as much as possible the probability that the service of I/O from the in-service bfq_queue gets delayed because of slot exhaustion, i.e., because all the slots of the drive queue are filled with I/O injected from other queues (NCQ provides for 32 slots). This new mechanism also counters actuator underutilization in the case of asymmetric configurations of bfq_queues. Namely if there are few bfq_queues containing I/O for some actuators and many bfq_queues containing I/O for other actuators. Or if the bfq_queues containing I/O for some actuators have lower weights than the other bfq_queues. Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Davide Zini <davidezini2@gmail.com> Link: https://lore.kernel.org/r/20230103145503.71712-8-paolo.valente@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-29 15:18:33 -07:00
Federico Gavioli	4fdb3b9f2a	block, bfq: retrieve independent access ranges from request queue This patch implements the code to gather the content of the independent_access_ranges structure from the request_queue and copy it into the queue's bfq_data. This copy is done at queue initialization. We copy the access ranges into the bfq_data to avoid taking the queue lock each time we access the ranges. This implementation, however, puts a limit to the maximum independent ranges supported by the scheduler. Such a limit is equal to the constant BFQ_MAX_ACTUATORS. This limit was placed to avoid the allocation of dynamic memory. Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Co-developed-by: Rory Chen <rory.c.chen@seagate.com> Signed-off-by: Rory Chen <rory.c.chen@seagate.com> Signed-off-by: Federico Gavioli <f.gavioli97@gmail.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Link: https://lore.kernel.org/r/20230103145503.71712-7-paolo.valente@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-29 15:18:33 -07:00
Davide Zini	8b7fd74111	block, bfq: split also async bfq_queues on a per-actuator basis Similarly to sync bfq_queues, also async bfq_queues need to be split on a per-actuator basis. Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Davide Zini <davidezini2@gmail.com> Link: https://lore.kernel.org/r/20230103145503.71712-6-paolo.valente@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-29 15:18:33 -07:00
Paolo Valente	fd571df0ac	block, bfq: turn bfqq_data into an array in bfq_io_cq When a bfq_queue Q is merged with another queue, several pieces of information are saved about Q. These pieces are stored in the bfqq_data field in the bfq_io_cq data structure of the process associated with Q. Yet, with a multi-actuator drive, a process may get associated with multiple bfq_queues: one queue for each of the N actuators. Each of these queues may undergo a merge. So, the bfq_io_cq data structure must be able to accommodate the above information for N queues. This commit solves this problem by turning the bfqq_data scalar field into an array of N elements (and by changing code so as to handle this array). This solution is written under the assumption that bfq_queues associated with different actuators cannot be cross-merged. This assumption holds naturally with basic queue merging: the latter is triggered by spatial locality, and sectors for different actuators are not close to each other (apart from the corner case of the last sectors served by a given actuator and the first sectors served by the next actuator). As for stable cross-merging, the assumption here is that it is disabled. Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Signed-off-by: Gabriele Felici <felicigb@gmail.com> Signed-off-by: Gianmarco Lusvardi <glusvardi@posteo.net> Signed-off-by: Giulio Barabino <giuliobarabino99@gmail.com> Signed-off-by: Emiliano Maccaferri <inbox@emilianomaccaferri.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Link: https://lore.kernel.org/r/20230103145503.71712-5-paolo.valente@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-29 15:18:32 -07:00
Paolo Valente	a61230470c	block, bfq: move io_cq-persistent bfqq data into a dedicated struct With a multi-actuator drive, a process may get associated with multiple bfq_queues: one queue for each of the N actuators. So, the bfq_io_cq data structure must be able to accommodate its per-queue persistent information for N queues. Currently it stores this information for just one queue, in several scalar fields. This is a preparatory commit for moving to accommodating persistent information for N queues. In particular, this commit packs all the above scalar fields into a single data structure. Then there is now only one field, in bfq_io_cq, that stores all the above information. This scalar field will then be turned into an array by a following commit. Suggested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Signed-off-by: Gianmarco Lusvardi <glusvardi@posteo.net> Signed-off-by: Giulio Barabino <giuliobarabino99@gmail.com> Signed-off-by: Emiliano Maccaferri <inbox@emilianomaccaferri.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Link: https://lore.kernel.org/r/20230103145503.71712-4-paolo.valente@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-29 15:18:32 -07:00
Paolo Valente	b752989897	block, bfq: forbid stable merging of queues associated with different actuators If queues associated with different actuators are merged, then control is lost on each actuator. Therefore some actuator may be underutilized, and throughput may decrease. This problem cannot occur with basic queue merging, because the latter is triggered by spatial locality, and sectors for different actuators are not close to each other. Yet it may happen with stable merging. To address this issue, this commit prevents stable merging from occurring among queues associated with different actuators. Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Link: https://lore.kernel.org/r/20230103145503.71712-3-paolo.valente@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-29 15:18:32 -07:00
Paolo Valente	9778369a2d	block, bfq: split sync bfq_queues on a per-actuator basis Single-LUN multi-actuator SCSI drives, as well as all multi-actuator SATA drives appear as a single device to the I/O subsystem [1]. Yet they address commands to different actuators internally, as a function of Logical Block Addressing (LBAs). A given sector is reachable by only one of the actuators. For example, Seagate’s Serial Advanced Technology Attachment (SATA) version contains two actuators and maps the lower half of the SATA LBA space to the lower actuator and the upper half to the upper actuator. Evidently, to fully utilize actuators, no actuator must be left idle or underutilized while there is pending I/O for it. The block layer must somehow control the load of each actuator individually. This commit lays the ground for allowing BFQ to provide such a per-actuator control. BFQ associates an I/O-request sync bfq_queue with each process doing synchronous I/O, or with a group of processes, in case of queue merging. Then BFQ serves one bfq_queue at a time. While in service, a bfq_queue is emptied in request-position order. Yet the same process, or group of processes, may generate I/O for different actuators. In this case, different streams of I/O (each for a different actuator) get all inserted into the same sync bfq_queue. So there is basically no individual control on when each stream is served, i.e., on when the I/O requests of the stream are picked from the bfq_queue and dispatched to the drive. This commit enables BFQ to control the service of each actuator individually for synchronous I/O, by simply splitting each sync bfq_queue into N queues, one for each actuator. In other words, a sync bfq_queue is now associated to a pair (process, actuator). As a consequence of this split, the per-queue proportional-share policy implemented by BFQ will guarantee that the sync I/O generated for each actuator, by each process, receives its fair share of service. This is just a preparatory patch. If the I/O of the same process happens to be sent to different queues, then each of these queues may undergo queue merging. To handle this event, the bfq_io_cq data structure must be properly extended. In addition, stable merging must be disabled to avoid loss of control on individual actuators. Finally, also async queues must be split. These issues are described in detail and addressed in next commits. As for this commit, although multiple per-process bfq_queues are provided, the I/O of each process or group of processes is still sent to only one queue, regardless of the actuator the I/O is for. The forwarding to distinct bfq_queues will be enabled after addressing the above issues. [1] https://www.linaro.org/blog/budget-fair-queueing-bfq-linux-io-scheduler-optimizations-for-multi-actuator-sata-hard-drives/ Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Signed-off-by: Gabriele Felici <felicigb@gmail.com> Signed-off-by: Carmine Zaccagnino <carmine@carminezacc.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Link: https://lore.kernel.org/r/20230103145503.71712-2-paolo.valente@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-29 15:18:32 -07:00
Greg Kroah-Hartman	a9b12f8b4e	driver core: make struct device_type.devnode() take a const * The devnode() callback in struct device_type should not be modifying the device that is passed into it, so mark it as a const * and propagate the function signature changes out into all relevant subsystems that use this callback. Cc: Jens Axboe <axboe@kernel.dk> Cc: Alison Schofield <alison.schofield@intel.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Ben Widawsky <bwidawsk@kernel.org> Cc: Jeremy Kerr <jk@ozlabs.org> Cc: Joel Stanley <joel@jms.id.au> Cc: Alistar Popple <alistair@popple.id.au> Cc: Eddie James <eajames@linux.ibm.com> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Jilin Yuan <yuanjilin@cdjrlc.com> Cc: Heikki Krogerus <heikki.krogerus@linux.intel.com> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com> Cc: Won Chung <wonchung@google.com> Acked-by: Dan Williams <dan.j.williams@intel.com> Acked-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Acked-by: Hans de Goede <hdegoede@redhat.com> Link: https://lore.kernel.org/r/20230111113018.459199-7-gregkh@linuxfoundation.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-01-27 13:45:38 +01:00
Greg Kroah-Hartman	162736b0d7	driver core: make struct device_type.uevent() take a const * The uevent() callback in struct device_type should not be modifying the device that is passed into it, so mark it as a const * and propagate the function signature changes out into all relevant subsystems that use this callback. Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Andreas Noever <andreas.noever@gmail.com> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Bard Liao <yung-chuan.liao@linux.intel.com> Cc: Chaitanya Kulkarni <kch@nvidia.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com> Cc: Frank Rowand <frowand.list@gmail.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jilin Yuan <yuanjilin@cdjrlc.com> Cc: Jiri Slaby <jirislaby@kernel.org> Cc: Len Brown <lenb@kernel.org> Cc: Mark Gross <markgross@kernel.org> Cc: "Martin K. Petersen" <martin.petersen@oracle.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Maximilian Luz <luzmaximilian@gmail.com> Cc: Michael Jamet <michael.jamet@intel.com> Cc: Ming Lei <ming.lei@redhat.com> Cc: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com> Cc: Rob Herring <robh+dt@kernel.org> Cc: Sakari Ailus <sakari.ailus@linux.intel.com> Cc: Sanyog Kale <sanyog.r.kale@intel.com> Cc: Sean Young <sean@mess.org> Cc: Stefan Richter <stefanr@s5r6.in-berlin.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Won Chung <wonchung@google.com> Cc: Yehezkel Bernat <YehezkelShB@gmail.com> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Mika Westerberg <mika.westerberg@linux.intel.com> # for Thunderbolt Acked-by: Mauro Carvalho Chehab <mchehab@kernel.org> Acked-by: Alexandre Belloni <alexandre.belloni@bootlin.com> Acked-by: Heikki Krogerus <heikki.krogerus@linux.intel.com> Acked-by: Wolfram Sang <wsa@kernel.org> Acked-by: Vinod Koul <vkoul@kernel.org> Acked-by: Hans de Goede <hdegoede@redhat.com> Link: https://lore.kernel.org/r/20230111113018.459199-6-gregkh@linuxfoundation.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-01-27 13:45:36 +01:00
Linus Torvalds	edc00350d2	block-6.2-2023-01-20 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmPK8NUQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgptS/EADT+m0n7jjonp7NoENoZT2y4o5ayESuEmBV X8QUg/Ji1P3VG3QzI+yCqevGa2Rkkd8EenlokpjLliuqPdb/aZ56G7rsebotzWu3 zOV3XNvKvD0thiMIjmXABvmUKdb3lcrM5tpC9Uqq6L52SqbtkSsPUVO+rWE/tTZk u97dUmyQcaD2brGfn4AcR0wgQoxrcLbmUpa/TKhFIDPDl+4PFi2ePoSQSsdDJT8R PTvQhud1dl/wJ3733vj8S8s4Sxkbm5xXt50oDaTSmdOWSNOuMNuyW3WqkZ/SPdyK LDmtOXEfuiokJK/l+DZ9SKt6jONW6ShdEaUo37/8yjYCnZFvWkcfn+6mWaDygjqS eI3Mwb91w8K9krTZU1tGq3qOtxEJwbtLHCM96nh8SHLjNrYYrkZQZHOcea9CgX8h iMzI5ylP2t6RofwHwwFoZYGOxrRz/R5LS+pCFIv720QnBjb9ZpO9zoDQaDl5tOS6 UpuL3XPzs9rZZizY00NG6+vQeSdSLRyyjs4XIWYxrZy2wuC2EjM0HstMfefldQcJ uEfgrVgd/pcUTNzCG8uH8cZbmeflivm18J6OX86l2X9d3m62HD5gULHFOFxbDwsC zoQOsyaGVRLpO0+/0MKs7aLaZlk40VDb4XdRsM6qbd4+x+J7yicvGrkUxS6cZMwT VlQm3YUc0g== =L12Q -----END PGP SIGNATURE----- Merge tag 'block-6.2-2023-01-20' of git://git.kernel.dk/linux Pull block fixes from Jens Axboe: "Various little tweaks all over the place: - NVMe pull request via Christoph: - fix controller shutdown regression in nvme-apple (Janne Grunau) - fix a polling on timeout regression in nvme-pci (Keith Busch) - Fix a bug in the read request side request allocation caching (Pavel) - pktcdvd was brought back after we configured a NULL return on bio splits, make it consistent with the others (me) - BFQ refcount fix (Yu) - Block cgroup policy activation fix (Yu) - Fix for an md regression introduced in the 6.2 cycle (Adrian)" * tag 'block-6.2-2023-01-20' of git://git.kernel.dk/linux: nvme-pci: fix timeout request state check nvme-apple: only reset the controller when RTKit is running nvme-apple: reset controller during shutdown block: fix hctx checks for batch allocation block/rnbd-clt: fix wrong max ID in ida_alloc_max blk-cgroup: fix missing pd_online_fn() while activating policy pktcdvd: check for NULL returna fter calling bio_split_to_limits() block, bfq: switch 'bfqg->ref' to use atomic refcount apis md: fix incorrect declaration about claim_rdev in md_import_device	2023-01-20 12:44:41 -08:00
Ming Lei	6a6dcae8f4	blk-mq: Build default queue map via group_cpus_evenly() The default queue mapping builder of blk_mq_map_queues doesn't take NUMA topo into account, so the built mapping is pretty bad, since CPUs belonging to different NUMA node are assigned to same queue. It is observed that IOPS drops by ~30% when running two jobs on same hctx of null_blk from two CPUs belonging to two NUMA nodes compared with from same NUMA node. Address the issue by reusing group_cpus_evenly() for building queue mapping since group_cpus_evenly() does group cpus according to CPU/NUMA locality. Also performance data becomes more stable with this given correct queue mapping is applied wrt. numa locality viewpoint, for example, on one two nodes arm64 machine with 160 cpus, node 0(cpu 0~79), node 1(cpu 80~159): 1) modprobe null_blk nr_devices=1 submit_queues=2 2) run 'fio(t/io_uring -p 0 -n 4 -r 20 /dev/nullb0)', and observe that IOPS becomes much stable on multiple tests: - unpatched: IOPS is 2.5M ~ 4.5M - patched: IOPS is 4.3M ~ 5.0M Lots of drivers may benefit from the change, such as nvme pci poll, nvme tcp, ... Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/20221227022905.352674-7-ming.lei@redhat.com	2023-01-17 18:50:06 +01:00
Pavel Begunkov	7746564793	block: fix hctx checks for batch allocation When there are no read queues read requests will be assigned a default queue on allocation. However, blk_mq_get_cached_request() is not prepared for that and will fail all attempts to grab read requests from the cache. Worst case it doubles the number of requests allocated, roughly half of which will be returned by blk_mq_free_plug_rqs(). It only affects batched allocations and so is io_uring specific. For reference, QD8 t/io_uring benchmark improves by 20-35%. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/80d4511011d7d4751b4cf6375c4e38f237d935e3.1673955390.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-17 09:56:52 -07:00
Yu Kuai	e3ff8887e7	blk-cgroup: fix missing pd_online_fn() while activating policy If the policy defines pd_online_fn(), it should be called after pd_init_fn(), like blkg_create(). Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230103112833.2013432-1-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-16 19:04:07 -07:00
Yu Kuai	216f764716	block, bfq: switch 'bfqg->ref' to use atomic refcount apis The updating of 'bfqg->ref' should be protected by 'bfqd->lock', however, during code review, we found that bfq_pd_free() update 'bfqg->ref' without holding the lock, which is problematic: 1) bfq_pd_free() triggered by removing cgroup is called asynchronously; 2) bfqq will grab bfqg reference, and exit bfqq will drop the reference, which can concurrent with 1). Unfortunately, 'bfqd->lock' can't be held here because 'bfqd' might already be freed in bfq_pd_free(). Fix the problem by using atomic refcount apis. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230103084755.1256479-1-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-15 20:53:27 -07:00
Linus Torvalds	97ec4d559d	block-6.2-2023-01-13 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmPBsFAQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpqgnEAC0OqxnMsOPNbkLO7k6FsSrG7ZoENkOIMCt Grk3D1cPkM13I0xc+WiaOBezMriPzfdXvt5AGDn9fd53Ih47qpSY4eU6pCqoCk5y HWdn8KXZvhJGZsSy0Nz+cfPDW/8diJON8YBpJwWM/DfDdP8XibtjlIMTVTtJab6h aGWjmy3leNfghOJ0cZ1wjL6maWFoowQASs52PZfajSc0mQ5X0i8BgQb1WOHNu89C vEir9PYlTmdMnYlAKLsyEL3KoGUPm++zSLtJeyWYavlCMGK5WTyNkzmeXqsQhAGf b1LjovQASe//1t2wvCzQviRf4cae0pE9JhiaYt2oxoDdHrfQj/WPndVS4yE9c+0O BnLVTCFHNv86TRXNCbEUzI+Ftj6m9qt4MrHz8YpstX7FxGxYC+T5RqTwYClWZQ0j llBuJUHj+kkAv6kBMJCHTyat6pxIDgcb52QMJr5mFWuEaTloraBIJC70hMtxBQV/ j5mrBYqCngCHVs+hAl9UQ4zqQVSvkeT11QFvwFolxIfs7qtfLqeGzYxvaeomqO3V sA+H5NY50OEuPfFFmCpcNUJXeUKg7wP39iNHdz6P5cCDBCfUwbNbgKKKNmBovaC+ KhPd8Xo1MmzDuF+cylvTcjOBDte4425GN7PBj4vP1xbuHYcjg6AEFLawgqE9Y4XX xyNlgJXPOg== =ujiw -----END PGP SIGNATURE----- Merge tag 'block-6.2-2023-01-13' of git://git.kernel.dk/linux Pull block fixes from Jens Axboe: "Nothing major in here, just a collection of NVMe fixes and dropping a wrong might_sleep() that static checkers tripped over but which isn't valid" * tag 'block-6.2-2023-01-13' of git://git.kernel.dk/linux: MAINTAINERS: stop nvme matching for nvmem files nvme: don't allow unprivileged passthrough on partitions nvme: replace the "bool vec" arguments with flags in the ioctl path nvme: remove __nvme_ioctl nvme-pci: fix error handling in nvme_pci_enable() nvme-pci: add NVME_QUIRK_IDENTIFY_CNS quirk to Apple T2 controllers nvme-apple: add NVME_QUIRK_IDENTIFY_CNS quirk to fix regression block: Drop spurious might_sleep() from blk_put_queue()	2023-01-13 17:41:19 -06:00
Christoph Hellwig	b4a6bb3a67	block: add a sanity check for non-write flush/fua bios Check that the PREFUSH and FUA flags are only set on write bios, given that the flush state machine expects that. [Damien] The check is also extended to REQ_OP_ZONE_APPEND operations as these are data write operations used by btrfs and zonefs and may also have the REQ_FUA bit set. Reported-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Niklas Cassel <niklas.cassel@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de>	2023-01-14 07:32:41 +09:00
Keith Busch	d46aa786fa	block: use iter_ubuf for single range This is more efficient than iter_iov. Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> [axboe: fold in iovec assumption fix] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-11 10:31:49 -07:00
Tejun Heo	49e4d04f04	block: Drop spurious might_sleep() from blk_put_queue() Dan reports the following smatch detected the following: block/blk-cgroup.c:1863 blkcg_schedule_throttle() warn: sleeping in atomic context caused by blkcg_schedule_throttle() calling blk_put_queue() in an non-sleepable context. blk_put_queue() acquired might_sleep() in `63f93fd6fa` ("block: mark blk_put_queue as potentially blocking") which transferred the might_sleep() from blk_free_queue(). blk_free_queue() acquired might_sleep() in `e8c7d14ac6` ("block: revert back to synchronous request_queue removal") while turning request_queue removal synchronous. However, this isn't necessary as nothing in the free path actually requires sleeping. It's pretty unusual to require a sleeping context in a put operation and it's not needed in the first place. Let's drop it. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Dan Carpenter <error27@gmail.com> Link: https://lkml.kernel.org/r/Y7g3L6fntnTtOm63@kili Cc: Christoph Hellwig <hch@lst.de> Cc: Luis Chamberlain <mcgrof@kernel.org> Fixes: `e8c7d14ac6` ("block: revert back to synchronous request_queue removal") # v5.9+ Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/Y7iFwjN+XzWvLv3y@slm.duckdns.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-08 20:29:28 -07:00
Linus Torvalds	a689b938df	block-2023-01-06 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmO4SiAQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpgc9D/0XJufUgHsLeFCF5G+q6iL5Bz+d7ymw+VFv xrNjOz8wUKYKXcJqxLrPdkmL1tcd1+fESNGgyBidn4P53BWoHB9dtbs8+Lova08t I4lQmZHgxgbAMhSOwGvHlOTkdlBIw/fBgQ6XdI+1qmpxzma5+gjImjyp7oH+pODP zqsg3DKRQmDApKWtvB6D5iItsWc1Jx5TEuOfU5/JjLuVZWl6O2qynNVUccF5T89O jkt624yO+r70CVfX3NAdFTm/mOEUiGH97l4l/8OkekJ40pf73xzvNRF/S8z8nHb/ QUGY1tKvr08xfPusl3epmQ5aO938F0aFpKi2x6P+z3G6Uq+dqMMrjJl8XMDG+J+d +yBow5yRH7o6oBb0YPPz/6S5zBjslsHtuKFd/rs4mCDfjp9GHiIIiIpdLxZEWawJ WaYlc5WlzSdopT/IxfaRZ9HMHzscdKadjiFngSKdpEdCUw7wxdIey+/9xbKR+xh0 Es13MzyCCurj4OnyDl5cnetGJUNNiL1JvQmIaFVndyxnMfvOaZBBmKW7h9RYBIU/ nqi4vZwYoafnGUIfLFL6uq9F627lF/EhodDuLheqz0G2pWhmFJITOJUAakGNFf83 22CiKY2GyTrOy5tKqkNzv7BG/KyJZGP+CxyyQ/7xm0k2C9wEjYSpZHKcjaNZygU5 eswPKbZMkw== =LJ5Q -----END PGP SIGNATURE----- Merge tag 'block-2023-01-06' of git://git.kernel.dk/linux Pull block fixes from Jens Axboe: "The big change here is obviously the revert of the pktcdvd driver removal. Outside of that, just minor tweaks. In detail: - Re-instate the pktcdvd driver, which necessitates adding back bio_copy_data_iter() and the fops->devnode() hook for now (me) - Fix for splitting of a bio marked as NOWAIT, causing either nowait reads or writes to error with EAGAIN even if parts of the IO completed (me) - Fix for ublk, punting management commands to io-wq as they can all easily block for extended periods of time (Ming) - Removal of SRCU dependency for the block layer (Paul)" * tag 'block-2023-01-06' of git://git.kernel.dk/linux: block: Remove "select SRCU" Revert "pktcdvd: remove driver." Revert "block: remove devnode callback from struct block_device_operations" Revert "block: bio_copy_data_iter" ublk: honor IO_URING_F_NONBLOCK for handling control command block: don't allow splitting of a REQ_NOWAIT bio block: handle bio_split_to_limits() NULL return	2023-01-06 13:12:42 -08:00
Paul E. McKenney	b2b50d5721	block: Remove "select SRCU" Now that the SRCU Kconfig option is unconditionally selected, there is no longer any point in selecting it. Therefore, remove the "select SRCU" Kconfig statements. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-05 08:50:10 -07:00
Jens Axboe	050a4f341f	Revert "block: remove devnode callback from struct block_device_operations" This reverts commit `85d6ce58e4`. We're reinstating the pktcdvd driver, which needs this API. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-04 14:44:02 -07:00
Jens Axboe	ee4b4e2248	Revert "block: bio_copy_data_iter" This reverts commit `db1c7d7797`. We're reinstating the pktcdvd driver, which needs this API. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-04 14:43:27 -07:00
Jens Axboe	9cea62b2cb	block: don't allow splitting of a REQ_NOWAIT bio If we split a bio marked with REQ_NOWAIT, then we can trigger spurious EAGAIN if constituent parts of that split bio end up failing request allocations. Parts will complete just fine, but just a single failure in one of the chained bios will yield an EAGAIN final result for the parent bio. Return EAGAIN early if we end up needing to split such a bio, which allows for saner recovery handling. Cc: stable@vger.kernel.org # 5.15+ Link: https://github.com/axboe/liburing/issues/766 Reported-by: Michael Kelley <mikelley@microsoft.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-04 13:24:44 -07:00
Jens Axboe	613b14884b	block: handle bio_split_to_limits() NULL return This can't happen right now, but in preparation for allowing bio_split_to_limits() returning NULL if it ended the bio, check for it in all the callers. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-04 09:05:23 -07:00
Linus Torvalds	bff687b3da	block-6.2-2022-12-29 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmOt35IQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpq4QD/9nGWlCdLRLyyiysUWhLwTmsZt7PSebG3KD CmCyEt+o8n2PdxBe7Xq8glppvuQTwJwOYynMXcAWd0IBxYUnAkCDF4PTbmdiIiVY fzci1UydIXw/HOVft/2IbIC0+Apo+UJ9WVqqhwm7ya0lAkLQYuT7iWmn1pxFdbcI hi9ZbaghxtZXSQP4ZtKG+a8tQ99HTsf76xqCM6DdMCVOUH6/V1f5g67iSkYLCL3Q V9bAq7U2VEXFdRC9m5yPG7KGUBRllE4etBvVAIIcAQBAgEktyvgvas5luwu5j+W0 R2z8KXp2X4BWGW+R45hpt2cdyfcJy24+6QnAGNQAs/3Muq1IfEMwmJ5tyR/y8HiS 0RvIv/BOwDMDOaM9YuW0beyHQMu+bwhtf+C453r1gsKmnL912+ElMzuqUpditkjr d4nL5aUTk5iM38jzJpQylZSY+20wyUnOmxCxETpeSMaRrYY75PLOVCJLNncJuZtQ GFtqUzMPVURLMGnxyJZLiG+qbGVXh9f7B7OStKDPhBJvqoZ2cQpwTzywmYxQOv+0 OO1DdmMDtUWNpuBN2U4HOzLElmB034OM3Fcia529IhLoXK/x57n9mXW0D0HeOd84 /EYSsmsT+spv7psKBNjhXkZwgVpVgsYOu8eUjRKYUmrYLEbTk+fGUtia3rBd4wjl uNMuRhRtUA== =cqhz -----END PGP SIGNATURE----- Merge tag 'block-6.2-2022-12-29' of git://git.kernel.dk/linux Pull block fixes from Jens Axboe: "Mostly just NVMe, but also a single fixup for BFQ for a regression that happened during the merge window. In detail: - NVMe pull requests via Christoph: - Fix doorbell buffer value endianness (Klaus Jensen) - Fix Linux vs NVMe page size mismatch (Keith Busch) - Fix a potential use memory access beyong the allocation limit (Keith Busch) - Fix a multipath vs blktrace NULL pointer dereference (Yanjun Zhang) - Fix various problems in handling the Command Supported and Effects log (Christoph Hellwig) - Don't allow unprivileged passthrough of commands that don't transfer data but modify logical block content (Christoph Hellwig) - Add a features and quirks policy document (Christoph Hellwig) - Fix some really nasty code that was correct but made smatch complain (Sagi Grimberg) - Use-after-free regression in BFQ from this merge window (Yu)" * tag 'block-6.2-2022-12-29' of git://git.kernel.dk/linux: nvme-auth: fix smatch warning complaints nvme: consult the CSE log page for unprivileged passthrough nvme: also return I/O command effects from nvme_command_effects nvmet: don't defer passthrough commands with trivial effects to the workqueue nvmet: set the LBCC bit for commands that modify data nvmet: use NVME_CMD_EFFECTS_CSUPP instead of open coding it nvme: fix the NVME_CMD_EFFECTS_CSE_MASK definition docs, nvme: add a feature and quirk policy document nvme-pci: update sqsize when adjusting the queue depth nvme: fix setting the queue depth in nvme_alloc_io_tag_set block, bfq: fix uaf for bfqq in bfq_exit_icq_bfqq nvme: fix multipath crash caused by flush request when blktrace is enabled nvme-pci: fix page size checks nvme-pci: fix mempool alloc size nvme-pci: fix doorbell buffer value endianness	2022-12-29 16:57:29 -08:00
Yu Kuai	246cf66e30	block, bfq: fix uaf for bfqq in bfq_exit_icq_bfqq Commit `64dc8c732f` ("block, bfq: fix possible uaf for 'bfqq->bic'") will access 'bic->bfqq' in bic_set_bfqq(), however, bfq_exit_icq_bfqq() can free bfqq first, and then call bic_set_bfqq(), which will cause uaf. Fix the problem by moving bfq_exit_bfqq() behind bic_set_bfqq(). Fixes: `64dc8c732f` ("block, bfq: fix possible uaf for 'bfqq->bic'") Reported-by: Yi Zhang <yi.zhang@redhat.com> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20221226030605.1437081-1-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-12-26 12:09:56 -07:00
Steven Rostedt (Google)	292a089d78	treewide: Convert del_timer() to timer_shutdown() Due to several bugs caused by timers being re-armed after they are shutdown and just before they are freed, a new state of timers was added called "shutdown". After a timer is set to this state, then it can no longer be re-armed. The following script was run to find all the trivial locations where del_timer() or del_timer_sync() is called in the same function that the object holding the timer is freed. It also ignores any locations where the timer->function is modified between the del_timer*() and the free(), as that is not considered a "trivial" case. This was created by using a coccinelle script and the following commands: $ cat timer.cocci @@ expression ptr, slab; identifier timer, rfield; @@ ( - del_timer(&ptr->timer); + timer_shutdown(&ptr->timer); \| - del_timer_sync(&ptr->timer); + timer_shutdown_sync(&ptr->timer); ) ... when strict when != ptr->timer ( kfree_rcu(ptr, rfield); \| kmem_cache_free(slab, ptr); \| kfree(ptr); ) $ spatch timer.cocci . > /tmp/t.patch $ patch -p1 < /tmp/t.patch Link: https://lore.kernel.org/lkml/20221123201306.823305113@linutronix.de/ Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Acked-by: Pavel Machek <pavel@ucw.cz> [ LED ] Acked-by: Kalle Valo <kvalo@kernel.org> [ wireless ] Acked-by: Paolo Abeni <pabeni@redhat.com> [ networking ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2022-12-25 13:38:09 -08:00
Linus Torvalds	569c3a283c	block-6.2-2022-12-19 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmOgp5AQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpm5SD/9tduSZQW00aDm83HbEikWdCgQm0w37tyYl C2+IwRwLF8pnAoSb6yaO7LZM9ZUYfoIfIlkHXkKhT1xNJ/XdeGDgwjOHi106iaEx kG08DcFnUjyJ4Yh6hnnpnSepIo0ckwa18pSaE4smvmKZirj3it3O6xSspyBxtUcv q6PvJDMN15aG6uLHq3xNZPzoI2KYXBDgwanyImRhdvLoOTiS9rok+F9e2ob3lzAa PB+FOipQoKb7M6jbyfZe4KbeTiJh4EYEl5Qa6ebrDIkOTm7zjc8sQbCkNeI7osh+ D0FvEQ1Vsrjj5Bp6N9CmZcrmNagjEcAPbzguxAilrgw2/XvA8d0fymziGXvuyUEv bSAx6lyJzfMLrvtubSqMhIF+8DlccQnnXz2ccacwvAfayytzNJjC9serU+czHA4O ZkPTwZFjAmbn6q6SK3qaOCB9IgITHipj8R/ncGu9KjNvM2QgzM+OIrP0xGxtk6uI ZGrt9nGMUmgjtaliQjiDVZomMewru1lRWPRAjfQ995gmVkejgapUHYoaDtDzaLKZ Q9BaK5CC2jltGUuuoFEnXnwu/Eyvp9y++pKkz4Esb+/Wkst4qyGtr9DOSTnv1wKN W20h3Z5vOAXXquvUJ5S3mQl8TNJHiBz+/CRB9PZG8XFtn8ubGo8XttGdgjQgyLM3 6FHzcZgeWw== =TSec -----END PGP SIGNATURE----- Merge tag 'block-6.2-2022-12-19' of git://git.kernel.dk/linux Pull block fixes from Jens Axboe: - Various fixes for BFQ (Yu, Yuwei) - Fix for loop command line parsing (Isaac) - No need to specifically clear REQ_ALLOC_CACHE on IOPOLL downgrade anymore (me) - blk-iocost enum fix for newer gcc (Jiri) - UAF fix for queue release (Ming) - blk-iolatency error handling memory leak fix (Tejun) * tag 'block-6.2-2022-12-19' of git://git.kernel.dk/linux: block: don't clear REQ_ALLOC_CACHE for non-polled requests block: fix use-after-free of q->q_usage_counter block, bfq: only do counting of pending-request for BFQ_GROUP_IOSCHED blk-iolatency: Fix memory leak on add_disk() failures loop: Fix the max_loop commandline argument treatment when it is set to 0 block/blk-iocost (gcc13): keep large values in a new enum block, bfq: replace 0/1 with false/true in bic apis block, bfq: don't return bfqg from __bfq_bic_change_cgroup() block, bfq: fix possible uaf for 'bfqq->bic'	2022-12-21 16:35:26 -08:00
Linus Torvalds	71a7507afb	Driver Core changes for 6.2-rc1 Here is the set of driver core and kernfs changes for 6.2-rc1. The "big" change in here is the addition of a new macro, container_of_const() that will preserve the "const-ness" of a pointer passed into it. The "problem" of the current container_of() macro is that if you pass in a "const ", out of it can comes a non-const pointer unless you specifically ask for it. For many usages, we want to preserve the "const" attribute by using the same call. For a specific example, this series changes the kobj_to_dev() macro to use it, allowing it to be used no matter what the const value is. This prevents every subsystem from having to declare 2 different individual macros (i.e. kobj_const_to_dev() and kobj_to_dev()) and having the compiler enforce the const value at build time, which having 2 macros would not do either. The driver for all of this have been discussions with the Rust kernel developers as to how to properly mark driver core, and kobject, objects as being "non-mutable". The changes to the kobject and driver core in this pull request are the result of that, as there are lots of paths where kobjects and device pointers are not modified at all, so marking them as "const" allows the compiler to enforce this. So, a nice side affect of the Rust development effort has been already to clean up the driver core code to be more obvious about object rules. All of this has been bike-shedded in quite a lot of detail on lkml with different names and implementations resulting in the tiny version we have in here, much better than my original proposal. Lots of subsystem maintainers have acked the changes as well. Other than this change, included in here are smaller stuff like: - kernfs fixes and updates to handle lock contention better - vmlinux.lds.h fixes and updates - sysfs and debugfs documentation updates - device property updates All of these have been in the linux-next tree for quite a while with no problems, OTHER than some merge issues with other trees that should be obvious when you hit them (block tree deletes a driver that this tree modifies, iommufd tree modifies code that this tree also touches). If there are merge problems with these trees, please let me know. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> -----BEGIN PGP SIGNATURE----- iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCY5wz3A8cZ3JlZ0Brcm9h aC5jb20ACgkQMUfUDdst+yks0ACeKYUlVgCsER8eYW+x18szFa2QTXgAn2h/VhZe 1Fp53boFaQkGBjl8mGF8 =v+FB -----END PGP SIGNATURE----- Merge tag 'driver-core-6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core updates from Greg KH: "Here is the set of driver core and kernfs changes for 6.2-rc1. The "big" change in here is the addition of a new macro, container_of_const() that will preserve the "const-ness" of a pointer passed into it. The "problem" of the current container_of() macro is that if you pass in a "const ", out of it can comes a non-const pointer unless you specifically ask for it. For many usages, we want to preserve the "const" attribute by using the same call. For a specific example, this series changes the kobj_to_dev() macro to use it, allowing it to be used no matter what the const value is. This prevents every subsystem from having to declare 2 different individual macros (i.e. kobj_const_to_dev() and kobj_to_dev()) and having the compiler enforce the const value at build time, which having 2 macros would not do either. The driver for all of this have been discussions with the Rust kernel developers as to how to properly mark driver core, and kobject, objects as being "non-mutable". The changes to the kobject and driver core in this pull request are the result of that, as there are lots of paths where kobjects and device pointers are not modified at all, so marking them as "const" allows the compiler to enforce this. So, a nice side affect of the Rust development effort has been already to clean up the driver core code to be more obvious about object rules. All of this has been bike-shedded in quite a lot of detail on lkml with different names and implementations resulting in the tiny version we have in here, much better than my original proposal. Lots of subsystem maintainers have acked the changes as well. Other than this change, included in here are smaller stuff like: - kernfs fixes and updates to handle lock contention better - vmlinux.lds.h fixes and updates - sysfs and debugfs documentation updates - device property updates All of these have been in the linux-next tree for quite a while with no problems" * tag 'driver-core-6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (58 commits) device property: Fix documentation for fwnode_get_next_parent() firmware_loader: fix up to_fw_sysfs() to preserve const usb.h: take advantage of container_of_const() device.h: move kobj_to_dev() to use container_of_const() container_of: add container_of_const() that preserves const-ness of the pointer driver core: fix up missed drivers/s390/char/hmcdrv_dev.c class.devnode() conversion. driver core: fix up missed scsi/cxlflash class.devnode() conversion. driver core: fix up some missing class.devnode() conversions. driver core: make struct class.devnode() take a const * driver core: make struct class.dev_uevent() take a const * cacheinfo: Remove of_node_put() for fw_token device property: Add a blank line in Kconfig of tests device property: Rename goto label to be more precise device property: Move PROPERTY_ENTRY_BOOL() a bit down device property: Get rid of __PROPERTY_ENTRY_ARRAY_ELSIZE() kernfs: fix all kernel-doc warnings and multiple typos driver core: pass a const * into of_device_uevent() kobject: kset_uevent_ops: make name() callback take a const * kobject: kset_uevent_ops: make filter() callback take a const * kobject: make kobject_namespace take a const * ...	2022-12-16 03:54:54 -08:00
Ming Lei	d36a9ea5e7	block: fix use-after-free of q->q_usage_counter For blk-mq, queue release handler is usually called after blk_mq_freeze_queue_wait() returns. However, the q_usage_counter->release() handler may not be run yet at that time, so this can cause a use-after-free. Fix the issue by moving percpu_ref_exit() into blk_free_queue_rcu(). Since ->release() is called with rcu read lock held, it is agreed that the race should be covered in caller per discussion from the two links. Reported-by: Zhang Wensheng <zhangwensheng@huaweicloud.com> Reported-by: Zhong Jinghua <zhongjinghua@huawei.com> Link: https://lore.kernel.org/linux-block/Y5prfOjyyjQKUrtH@T590/T/#u Link: https://lore.kernel.org/lkml/Y4%2FmzMd4evRg9yDi@fedora/ Cc: Hillf Danton <hdanton@sina.com> Cc: Yu Kuai <yukuai3@huawei.com> Cc: Dennis Zhou <dennis@kernel.org> Fixes: `2b0d3d3e4f` ("percpu_ref: reduce memory footprint of percpu_ref in fast path") Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20221215021629.74870-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-12-15 05:23:11 -07:00
Yuwei Guan	1eb206208b	block, bfq: only do counting of pending-request for BFQ_GROUP_IOSCHED The 'bfqd->num_groups_with_pending_reqs' is used when CONFIG_BFQ_GROUP_IOSCHED is enabled, so let the variables and processes take effect when CONFIG_BFQ_GROUP_IOSCHED is enabled. Cc: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Yuwei Guan <Yuwei.Guan@zeekrlife.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20221110112622.389332-1-Yuwei.Guan@zeekrlife.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-12-15 05:11:59 -07:00
Tejun Heo	813e693023	blk-iolatency: Fix memory leak on add_disk() failures When a gendisk is successfully initialized but add_disk() fails such as when a loop device has invalid number of minor device numbers specified, blkcg_init_disk() is called during init and then blkcg_exit_disk() during error handling. Unfortunately, iolatency gets initialized in the former but doesn't get cleaned up in the latter. This is because, in non-error cases, the cleanup is performed by del_gendisk() calling rq_qos_exit(), the assumption being that rq_qos policies, iolatency being one of them, can only be activated once the disk is fully registered and visible. That assumption is true for wbt and iocost, but not so for iolatency as it gets initialized before add_disk() is called. It is desirable to lazy-init rq_qos policies because they are optional features and add to hot path overhead once initialized - each IO has to walk all the registered rq_qos policies. So, we want to switch iolatency to lazy init too. However, that's a bigger change. As a fix for the immediate problem, let's just add an extra call to rq_qos_exit() in blkcg_exit_disk(). This is safe because duplicate calls to rq_qos_exit() become noop's. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: darklight2357@icloud.com Cc: Josef Bacik <josef@toxicpanda.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Fixes: `d706751215` ("block: introduce blk-iolatency io controller") Cc: stable@vger.kernel.org # v4.19+ Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/Y5TQ5gm3O4HXrXR3@slm.duckdns.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-12-14 12:42:55 -07:00
Jiri Slaby (SUSE)	ff1cc97b1f	block/blk-iocost (gcc13): keep large values in a new enum Since gcc13, each member of an enum has the same type as the enum [1]. And that is inherited from its members. Provided: VTIME_PER_SEC_SHIFT = 37, VTIME_PER_SEC = 1LLU << VTIME_PER_SEC_SHIFT, ... AUTOP_CYCLE_NSEC = 10LLU * NSEC_PER_SEC, the named type is unsigned long. This generates warnings with gcc-13: block/blk-iocost.c: In function 'ioc_weight_prfill': block/blk-iocost.c:3037:37: error: format '%u' expects argument of type 'unsigned int', but argument 4 has type 'long unsigned int' block/blk-iocost.c: In function 'ioc_weight_show': block/blk-iocost.c:3047:34: error: format '%u' expects argument of type 'unsigned int', but argument 3 has type 'long unsigned int' So split the anonymous enum with large values to a separate enum, so that they don't affect other members. [1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=36113 Cc: Martin Liska <mliska@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: cgroups@vger.kernel.org Cc: linux-block@vger.kernel.org Signed-off-by: Jiri Slaby (SUSE) <jirislaby@kernel.org> Link: https://lore.kernel.org/r/20221213120826.17446-1-jirislaby@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-12-14 09:56:10 -07:00
Yu Kuai	337366e02b	block, bfq: replace 0/1 with false/true in bic apis Just to make the code a litter cleaner, there are no functional changes. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20221214033155.3455754-3-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-12-14 09:51:11 -07:00
Yu Kuai	452af7dc59	block, bfq: don't return bfqg from __bfq_bic_change_cgroup() The return value is not used, hence remove it. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20221214033155.3455754-2-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-12-14 09:51:11 -07:00
Yu Kuai	64dc8c732f	block, bfq: fix possible uaf for 'bfqq->bic' Our test report a uaf for 'bfqq->bic' in 5.10: ================================================================== BUG: KASAN: use-after-free in bfq_select_queue+0x378/0xa30 CPU: 6 PID: 2318352 Comm: fsstress Kdump: loaded Not tainted 5.10.0-60.18.0.50.h602.kasan.eulerosv2r11.x86_64 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58-20220320_160524-szxrtosci10000 04/01/2014 Call Trace: bfq_select_queue+0x378/0xa30 bfq_dispatch_request+0xe8/0x130 blk_mq_do_dispatch_sched+0x62/0xb0 __blk_mq_sched_dispatch_requests+0x215/0x2a0 blk_mq_sched_dispatch_requests+0x8f/0xd0 __blk_mq_run_hw_queue+0x98/0x180 __blk_mq_delay_run_hw_queue+0x22b/0x240 blk_mq_run_hw_queue+0xe3/0x190 blk_mq_sched_insert_requests+0x107/0x200 blk_mq_flush_plug_list+0x26e/0x3c0 blk_finish_plug+0x63/0x90 __iomap_dio_rw+0x7b5/0x910 iomap_dio_rw+0x36/0x80 ext4_dio_read_iter+0x146/0x190 [ext4] ext4_file_read_iter+0x1e2/0x230 [ext4] new_sync_read+0x29f/0x400 vfs_read+0x24e/0x2d0 ksys_read+0xd5/0x1b0 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x61/0xc6 Commit `3bc5e683c6` ("bfq: Split shared queues on move between cgroups") changes that move process to a new cgroup will allocate a new bfqq to use, however, the old bfqq and new bfqq can point to the same bic: 1) Initial state, two process with io in the same cgroup. Process 1 Process 2 (BIC1) (BIC2) \| Λ \| Λ \| \| \| \| V \| V \| bfqq1 bfqq2 2) bfqq1 is merged to bfqq2. Process 1 Process 2 (BIC1) (BIC2) \| \| \-------------\\| V bfqq1 bfqq2(coop) 3) Process 1 exit, then issue new io(denoce IOA) from Process 2. (BIC2) \| Λ \| \| V \| bfqq2(coop) 4) Before IOA is completed, move Process 2 to another cgroup and issue io. Process 2 (BIC2) Λ \|\--------------\ \| V bfqq2 bfqq3 Now that BIC2 points to bfqq3, while bfqq2 and bfqq3 both point to BIC2. If all the requests are completed, and Process 2 exit, BIC2 will be freed while there is no guarantee that bfqq2 will be freed before BIC2. Fix the problem by clearing bfqq->bic while bfqq is detached from bic. Fixes: `3bc5e683c6` ("bfq: Split shared queues on move between cgroups") Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20221214030430.3304151-1-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-12-14 09:50:34 -07:00
Linus Torvalds	ce8a79d560	for-6.2/block-2022-12-08 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmOScsgQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpi5ID/9pLXFYOq1+uDjU0KO/MdjMjK8Ukr34lCnk WkajRLheE8JBKOFDE54XJk56sQSZHX9bTWqziar0h1fioh7FlQR/tVvzsERCm2M9 2y9THJNJygC68wgybStyiKlshFjl7TD7Kv5N9Y3xP3mkQygT+D6o8fXZk5xQbYyH YdFSoq4rJVHxRL03yzQiReGGIYdOUEQQh8l1FiLwLlKa3lXAey1KuxWIzksVN0KK aZB4QhiBpOiPgDHUVisq2XtyQjpZ2byoCImPzgrcqk9Jo4esvm/e6esrg4xlsvII LKFFkTmbVqjUZtFjqakFHmfuzVor4nU5f+xb90ZHExuuODYckkxWp5rWhf9QwqqI 0ik6WYgI1/5vnHnX8f2DYzOFQf9qa/rLgg0CshyUODlD6RfHa9vntqYvlIFkmOBd Q7KblIoK8YTzUS1M+v7X8JQ7gDR2KwygH37Da2KJS+vgvfIb8kJGr1ZORuhJuJJ7 Bl69gaNkHTHrqufp7UI64YXfueeuNu2J9z3zwzGoxeaFaofF/phDn0/2gCQE1fQI XBhsMw+ETqI6B2SPHMnzYDu2DM1S8ZTOYQlaD4G3uqgWnAM1tG707395uAy5yu4n D5azU1fVG4UocoNIyPujpaoSRs2zWZycEFEeUQkhyDDww/j4hlHi6H33eOnk0zsr wxzFGfvHfw== =k/vv -----END PGP SIGNATURE----- Merge tag 'for-6.2/block-2022-12-08' of git://git.kernel.dk/linux Pull block updates from Jens Axboe: - NVMe pull requests via Christoph: - Support some passthrough commands without CAP_SYS_ADMIN (Kanchan Joshi) - Refactor PCIe probing and reset (Christoph Hellwig) - Various fabrics authentication fixes and improvements (Sagi Grimberg) - Avoid fallback to sequential scan due to transient issues (Uday Shankar) - Implement support for the DEAC bit in Write Zeroes (Christoph Hellwig) - Allow overriding the IEEE OUI and firmware revision in configfs for nvmet (Aleksandr Miloserdov) - Force reconnect when number of queue changes in nvmet (Daniel Wagner) - Minor fixes and improvements (Uros Bizjak, Joel Granados, Sagi Grimberg, Christoph Hellwig, Christophe JAILLET) - Fix and cleanup nvme-fc req allocation (Chaitanya Kulkarni) - Use the common tagset helpers in nvme-pci driver (Christoph Hellwig) - Cleanup the nvme-pci removal path (Christoph Hellwig) - Use kstrtobool() instead of strtobool (Christophe JAILLET) - Allow unprivileged passthrough of Identify Controller (Joel Granados) - Support io stats on the mpath device (Sagi Grimberg) - Minor nvmet cleanup (Sagi Grimberg) - MD pull requests via Song: - Code cleanups (Christoph) - Various fixes - Floppy pull request from Denis: - Fix a memory leak in the init error path (Yuan) - Series fixing some batch wakeup issues with sbitmap (Gabriel) - Removal of the pktcdvd driver that was deprecated more than 5 years ago, and subsequent removal of the devnode callback in struct block_device_operations as no users are now left (Greg) - Fix for partition read on an exclusively opened bdev (Jan) - Series of elevator API cleanups (Jinlong, Christoph) - Series of fixes and cleanups for blk-iocost (Kemeng) - Series of fixes and cleanups for blk-throttle (Kemeng) - Series adding concurrent support for sync queues in BFQ (Yu) - Series bringing drbd a bit closer to the out-of-tree maintained version (Christian, Joel, Lars, Philipp) - Misc drbd fixes (Wang) - blk-wbt fixes and tweaks for enable/disable (Yu) - Fixes for mq-deadline for zoned devices (Damien) - Add support for read-only and offline zones for null_blk (Shin'ichiro) - Series fixing the delayed holder tracking, as used by DM (Yu, Christoph) - Series enabling bio alloc caching for IRQ based IO (Pavel) - Series enabling userspace peer-to-peer DMA (Logan) - BFQ waker fixes (Khazhismel) - Series fixing elevator refcount issues (Christoph, Jinlong) - Series cleaning up references around queue destruction (Christoph) - Series doing quiesce by tagset, enabling cleanups in drivers (Christoph, Chao) - Series untangling the queue kobject and queue references (Christoph) - Misc fixes and cleanups (Bart, David, Dawei, Jinlong, Kemeng, Ye, Yang, Waiman, Shin'ichiro, Randy, Pankaj, Christoph) * tag 'for-6.2/block-2022-12-08' of git://git.kernel.dk/linux: (247 commits) blktrace: Fix output non-blktrace event when blk_classic option enabled block: sed-opal: Don't include <linux/kernel.h> sed-opal: allow using IOC_OPAL_SAVE for locking too blk-cgroup: Fix typo in comment block: remove bio_set_op_attrs nvmet: don't open-code NVME_NS_ATTR_RO enumeration nvme-pci: use the tagset alloc/free helpers nvme: add the Apple shared tag workaround to nvme_alloc_io_tag_set nvme: only set reserved_tags in nvme_alloc_io_tag_set for fabrics controllers nvme: consolidate setting the tagset flags nvme: pass nr_maps explicitly to nvme_alloc_io_tag_set block: bio_copy_data_iter nvme-pci: split out a nvme_pci_ctrl_is_dead helper nvme-pci: return early on ctrl state mismatch in nvme_reset_work nvme-pci: rename nvme_disable_io_queues nvme-pci: cleanup nvme_suspend_queue nvme-pci: remove nvme_pci_disable nvme-pci: remove nvme_disable_admin_queue nvme: merge nvme_shutdown_ctrl into nvme_disable_ctrl nvme: use nvme_wait_ready in nvme_shutdown_ctrl ...	2022-12-13 10:43:59 -08:00
Linus Torvalds	8129bac60f	fscrypt updates for 6.2 This release adds SM4 encryption support, contributed by Tianjia Zhang. SM4 is a Chinese block cipher that is an alternative to AES. I recommend against using SM4, but (according to Tianjia) some people are being required to use it. Since SM4 has been turning up in many other places (crypto API, wireless, TLS, OpenSSL, ARMv8 CPUs, etc.), it hasn't been very controversial, and some people have to use it, I don't think it would be fair for me to reject this optional feature. Besides the above, there are a couple cleanups. -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCY5auyBQcZWJpZ2dlcnNA Z29vZ2xlLmNvbQAKCRDzXCl4vpKOK1u4AP4lhLxaEJ9upkHZrPAvEdF7QjLhO/ju h1LrvWHcEbvr6AEA/8ptc5RA1BAoSTDcqIWxIAWRztvptP4gUETb1b9C/ws= =An5w -----END PGP SIGNATURE----- Merge tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt Pull fscrypt updates from Eric Biggers: "This release adds SM4 encryption support, contributed by Tianjia Zhang. SM4 is a Chinese block cipher that is an alternative to AES. I recommend against using SM4, but (according to Tianjia) some people are being required to use it. Since SM4 has been turning up in many other places (crypto API, wireless, TLS, OpenSSL, ARMv8 CPUs, etc.), it hasn't been very controversial, and some people have to use it, I don't think it would be fair for me to reject this optional feature. Besides the above, there are a couple cleanups" * tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt: fscrypt: add additional documentation for SM4 support fscrypt: remove unused Speck definitions fscrypt: Add SM4 XTS/CTS symmetric algorithm support blk-crypto: Add support for SM4-XTS blk crypto mode fscrypt: add comment for fscrypt_valid_enc_modes_v1() fscrypt: pass super_block to fscrypt_put_master_key_activeref()	2022-12-12 20:03:50 -08:00
Luca Boccassi	c1f480b2d0	sed-opal: allow using IOC_OPAL_SAVE for locking too Usually when closing a crypto device (eg: dm-crypt with LUKS) the volume key is not required, as it requires root privileges anyway, and root can deny access to a disk in many ways regardless. Requiring the volume key to lock the device is a peculiarity of the OPAL specification. Given we might already have saved the key if the user requested it via the 'IOC_OPAL_SAVE' ioctl, we can use that key to lock the device if no key was provided here and the locking range matches, and the user sets the appropriate flag with 'IOC_OPAL_SAVE'. This allows integrating OPAL with tools and libraries that are used to the common behaviour and do not ask for the volume key when closing a device. Callers can always pass a non-zero key and it will be used regardless, as before. Suggested-by: Štěpán Horáček <stepan.horacek@gmail.com> Signed-off-by: Luca Boccassi <bluca@debian.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Link: https://lore.kernel.org/r/20221206092913.4625-1-luca.boccassi@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-12-08 09:17:45 -07:00
Kemeng Shi	37754595e9	blk-cgroup: Fix typo in comment Replace assocating with associating. Replace intiailized with initialized. Signed-off-by: Kemeng Shi <shikemeng@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Mukesh Ojha <quic_mojha@quicinc.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20221206093307.378249-1-shikemeng@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-12-08 09:17:21 -07:00
Christoph Hellwig	db1c7d7797	block: bio_copy_data_iter With the pktcdvdv removal, bio_copy_data_iter is unused now. Fold the logic into bio_copy_data and remove the separate lower level function. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20221206144407.722049-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-12-06 10:18:27 -07:00
Kemeng Shi	eea3e8b74a	blk-throttle: Use more suitable time_after check for update of slice_start There is no need to update tg->slice_start[rw] to start when they are equal already. So remove "eq" part of check before update slice_start. Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Kemeng Shi <shikemeng@huawei.com> Link: https://lore.kernel.org/r/20221205115709.251489-10-shikemeng@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-12-05 13:45:31 -07:00
Kemeng Shi	9c9f209d9d	blk-throttle: remove repeat check of elapsed time There is no need to check elapsed time from last upgrade for each node in hierarchy. Move this check before traversing as throtl_can_upgrade do to remove repeat check. Reported-by: kernel test robot <lkp@intel.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Kemeng Shi <shikemeng@huawei.com> Link: https://lore.kernel.org/r/20221205115709.251489-9-shikemeng@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-12-05 13:45:11 -07:00
Kemeng Shi	e3031d4c7d	blk-throttle: remove incorrect comment for tg_last_low_overflow_time Function tg_last_low_overflow_time is called with intermediate node as following: throtl_hierarchy_can_downgrade throtl_tg_can_downgrade tg_last_low_overflow_time throtl_hierarchy_can_upgrade throtl_tg_can_upgrade tg_last_low_overflow_time throtl_hierarchy_can_downgrade/throtl_hierarchy_can_upgrade will traverse from leaf node to sub-root node and pass traversed intermediate node to tg_last_low_overflow_time. No such limit could be found from context and implementation of tg_last_low_overflow_time, so remove this limit in comment. Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Kemeng Shi <shikemeng@huawei.com> Link: https://lore.kernel.org/r/20221205115709.251489-8-shikemeng@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-12-05 13:45:05 -07:00
Kemeng Shi	009df34171	blk-throttle: fix typo in comment of throtl_adjusted_limit lapsed time -> elapsed time Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Kemeng Shi <shikemeng@huawei.com> Link: https://lore.kernel.org/r/20221205115709.251489-7-shikemeng@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-12-05 13:44:59 -07:00
Kemeng Shi	a4d508e333	blk-throttle: simpfy low limit reached check in throtl_tg_can_upgrade Commit `c79892c557` ("blk-throttle: add upgrade logic for LIMIT_LOW state") added upgrade logic for low limit and methioned that 1. "To determine if a cgroup exceeds its limitation, we check if the cgroup has pending request. Since cgroup is throttled according to the limit, pending request means the cgroup reaches the limit." 2. "If a cgroup has limit set for both read and write, we consider the combination of them for upgrade. The reason is read IO and write IO can interfere with each other. If we do the upgrade based in one direction IO, the other direction IO could be severly harmed." Besides, we also determine that cgroup reaches low limit if low limit is 0, see comment in throtl_tg_can_upgrade. Collect the information above, the desgin of upgrade check is as following: 1.The low limit is reached if limit is zero or io is already queued. 2.Cgroup will pass upgrade check if low limits of READ and WRITE are both reached. Simpfy the check code described above to removce repeat check and improve readability. There is no functional change. Detail equivalence proof is as following: All replaced conditions to return true are as following: condition 1 (!read_limit && !write_limit) condition 2 read_limit && sq->nr_queued[READ] && (!write_limit \|\| sq->nr_queued[WRITE]) condition 3 write_limit && sq->nr_queued[WRITE] && (!read_limit \|\| sq->nr_queued[READ]) Transferring condition 2 as following: (read_limit && sq->nr_queued[READ]) && (!write_limit \|\| sq->nr_queued[WRITE]) is equivalent to (read_limit && sq->nr_queued[READ]) && (!write_limit \|\| (write_limit && sq->nr_queued[WRITE])) is equivalent to condition 2.1 (read_limit && sq->nr_queued[READ] && !write_limit) \|\| condition 2.2 (read_limit && sq->nr_queued[READ] && (write_limit && sq->nr_queued[WRITE])) Transferring condition 3 as following: write_limit && sq->nr_queued[WRITE] && (!read_limit \|\| sq->nr_queued[READ]) is equivalent to (write_limit && sq->nr_queued[WRITE]) && (!read_limit \|\| (read_limit && sq->nr_queued[READ])) is equivalent to condition 3.1 ((write_limit && sq->nr_queued[WRITE]) && !read_limit) \|\| condition 3.2 ((write_limit && sq->nr_queued[WRITE]) && (read_limit && sq->nr_queued[READ])) Condition 3.2 is the same as condition 2.2, so all conditions we get to return are as following: (!read_limit && !write_limit) (1) (!read_limit && (write_limit && sq->nr_queued[WRITE])) (3.1) ((read_limit && sq->nr_queued[READ]) && !write_limit) (2.1) ((write_limit && sq->nr_queued[WRITE]) && (read_limit && sq->nr_queued[READ])) (2.2) As we can extract conditions "(a1 \|\| a2) && (b1 \|\| b2)" to: a1 && b1 a1 && b2 a2 && b1 ab && b2 Considering that: a1 = !read_limit a2 = read_limit && sq->nr_queued[READ] b1 = !write_limit b2 = write_limit && sq->nr_queued[WRITE] We can pack replaced conditions to (!read_limit \|\| (read_limit && sq->nr_queued[READ])) && (!write_limit \|\| (write_limit && sq->nr_queued[WRITE])) which is equivalent to (!read_limit \|\| sq->nr_queued[READ]) && (!write_limit \|\| sq->nr_queued[WRITE]) Reported-by: kernel test robot <lkp@intel.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Kemeng Shi <shikemeng@huawei.com> Link: https://lore.kernel.org/r/20221205115709.251489-6-shikemeng@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-12-05 13:44:48 -07:00
Kemeng Shi	183daeb11d	blk-throttle: correct calculation of wait time in tg_may_dispatch In C language, When executing "if (expression1 && expression2)" and expression1 return false, the expression2 may not be executed. For "tg_within_bps_limit(tg, bio, bps_limit, &bps_wait) && tg_within_iops_limit(tg, bio, iops_limit, &iops_wait))", if bps is limited, tg_within_bps_limit will return false and tg_within_iops_limit will not be called. So even bps and iops are both limited, iops_wait will not be calculated and is always zero. So wait time of iops is always ignored. Fix this by always calling tg_within_bps_limit and tg_within_iops_limit to get wait time for both bps and iops. Observed that: 1. Wait time in tg_within_iops_limit/tg_within_bps_limit need always be stored as wait argument is always passed. 2. wait time is stored to zero if iops/bps is limited otherwise non-zero is stored. Simpfy tg_within_iops_limit/tg_within_bps_limit by removing wait argument and return wait time directly. Caller tg_may_dispatch checks if wait time is zero to find if iops/bps is limited. Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Kemeng Shi <shikemeng@huawei.com> Link: https://lore.kernel.org/r/20221205115709.251489-5-shikemeng@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-12-05 13:44:42 -07:00
Kemeng Shi	eb18479182	blk-throttle: ignore cgroup without io queued in blk_throtl_cancel_bios Ignore cgroup without io queued in blk_throtl_cancel_bios for two reasons: 1. Save cpu cycle for trying to dispatch cgroup which is no io queued. 2. Avoid non-consistent state that cgroup is inserted to service queue without THROTL_TG_PENDING set as tg_update_disptime will unconditional re-insert cgroup to service queue. If we are on the default hierarchy, IO dispatched from child in tg_dispatch_one_bio will trigger inserting cgroup to service queue without erase first and ruin the tree. Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Kemeng Shi <shikemeng@huawei.com> Link: https://lore.kernel.org/r/20221205115709.251489-4-shikemeng@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-12-05 13:44:34 -07:00
Kemeng Shi	84aca0a7e0	blk-throttle: Fix that bps of child could exceed bps limited in parent Consider situation as following (on the default hierarchy): HDD \| root (bps limit: 4k) \| child (bps limit :8k) \| fio bs=8k Rate of fio is supposed to be 4k, but result is 8k. Reason is as following: Size of single IO from fio is larger than bytes allowed in one throtl_slice in child, so IOs are always queued in child group first. When queued IOs in child are dispatched to parent group, BIO_BPS_THROTTLED is set and these IOs will not be limited by tg_within_bps_limit anymore. Fix this by only set BIO_BPS_THROTTLED when the bio traversed the entire tree. There patch has no influence on situation which is not on the default hierarchy as each group is a single root group without parent. Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Kemeng Shi <shikemeng@huawei.com> Link: https://lore.kernel.org/r/20221205115709.251489-3-shikemeng@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-12-05 13:44:27 -07:00
Kemeng Shi	f56019aef3	blk-throttle: correct stale comment in throtl_pd_init On the default hierarchy (cgroup2), the throttle interface files don't exist in the root cgroup, so the ablity to limit the whole system by configuring root group is not existing anymore. In general, cgroup doesn't wanna be in the business of restricting resources at the system level, so correct the stale comment that we can limit whole system to we can only limit subtree. Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Kemeng Shi <shikemeng@huawei.com> Link: https://lore.kernel.org/r/20221205115709.251489-2-shikemeng@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-12-05 13:44:19 -07:00
Greg Kroah-Hartman	85d6ce58e4	block: remove devnode callback from struct block_device_operations With the removal of the pktcdvd driver, there are no in-kernel users of the devnode callback in struct block_device_operations, so it can be safely removed. If it is needed for new block drivers in the future, it can be brought back. Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Link: https://lore.kernel.org/r/20221203140747.1942969-1-gregkh@linuxfoundation.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-12-03 09:19:48 -07:00
Yang Li	1d6df9d352	blk-cgroup: Fix some kernel-doc comments Make the description of @gendisk to @disk in blkcg_schedule_throttle() to clear the below warnings: block/blk-cgroup.c:1850: warning: Function parameter or member 'disk' not described in 'blkcg_schedule_throttle' block/blk-cgroup.c:1850: warning: Excess function parameter 'gendisk' description in 'blkcg_schedule_throttle' Fixes: `de185b56e8` ("blk-cgroup: pass a gendisk to blkcg_schedule_throttle") Link: https://bugzilla.openanolis.cn/show_bug.cgi?id=3338 Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Link: https://lore.kernel.org/r/20221202011713.14834-1-yang.lee@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-12-01 18:22:12 -07:00
Tianjia Zhang	d209ce353a	blk-crypto: Add support for SM4-XTS blk crypto mode SM4 is a symmetric cipher algorithm widely used in China. The SM4-XTS variant is used to encrypt length-preserving data. This is the mandatory algorithm in some special scenarios. Add support for the algorithm to block inline encryption. This is needed for the inlinecrypt mount option to be supported via blk-crypto-fallback, as it is for the other fscrypt modes. Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com> Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20221201125819.36932-2-tianjia.zhang@linux.alibaba.com	2022-12-01 10:57:10 -08:00
Randy Dunlap	2e833c8c8c	block: bdev & blktrace: use consistent function doc. notation Use only one hyphen in kernel-doc notation between the function name and its short description. The is the documented kerenl-doc format. It also fixes the HTML presentation to be consistent with other functions. Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org Link: https://lore.kernel.org/r/20221201070331.25685-1-rdunlap@infradead.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-12-01 09:16:46 -07:00
Kemeng Shi	7a88b1a826	blk-iocost: Correct comment in blk_iocost_init There is no iocg_pd_init function. The pd_alloc_fn function pointer of iocost policy is set with ioc_pd_init. Just correct it. Signed-off-by: Kemeng Shi <shikemeng@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20221018121932.10792-6-shikemeng@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-12-01 07:44:13 -07:00
Kemeng Shi	6c31be320c	blk-iocost: Remove vrate member in struct ioc_now If we trace vtime_base_rate instead of vtime_rate, there is nowhere which accesses now->vrate except function ioc_now using now->vrate locally. Just remove it. Signed-off-by: Kemeng Shi <shikemeng@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20221018121932.10792-5-shikemeng@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-12-01 07:44:12 -07:00
Kemeng Shi	63c9eac4b6	blk-iocost: Trace vtime_base_rate instead of vtime_rate Since commit `ac33e91e2d` ("blk-iocost: implement vtime loss compensation") rename original vtime_rate to vtime_base_rate and current vtime_rate is original vtime_rate with compensation. The current rate showed in tracepoint is mixed with vtime_rate and vtime_base_rate: 1) In function ioc_adjust_base_vrate, the first trace_iocost_ioc_vrate_adj shows vtime_rate, the second trace_iocost_ioc_vrate_adj shows vtime_base_rate. 2) In function iocg_activate shows vtime_rate by calling TRACE_IOCG_PATH(iocg_activate... 3) In function ioc_check_iocgs shows vtime_rate by calling TRACE_IOCG_PATH(iocg_idle... Trace vtime_base_rate instead of vtime_rate as: 1) Before commit `ac33e91e2d` ("blk-iocost: implement vtime loss compensation"), the traced rate is without compensation, so still show rate without compensation. 2) The vtime_base_rate is more stable while vtime_rate heavily depends on excess budeget on current period which may change abruptly in next period. Signed-off-by: Kemeng Shi <shikemeng@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20221018121932.10792-4-shikemeng@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-12-01 07:44:12 -07:00
Kemeng Shi	c6d2efdd38	blk-iocost: Reset vtime_base_rate in ioc_refresh_params Since commit ac33e91e2daca("blk-iocost: implement vtime loss compensation") split vtime_rate into vtime_rate and vtime_base_rate, we need reset both vtime_base_rate and vtime_rate when device parameters are refreshed. If vtime_base_rate is no reset here, vtime_rate will be overwritten with old vtime_base_rate soon in ioc_refresh_vrate. Signed-off-by: Kemeng Shi <shikemeng@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20221018121932.10792-3-shikemeng@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-12-01 07:44:12 -07:00
Kemeng Shi	ecaaaabeea	blk-iocost: Fix typo in comment soley -> solely Signed-off-by: Kemeng Shi <shikemeng@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20221018121932.10792-2-shikemeng@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-12-01 07:44:12 -07:00
Jan Kara	36369f46e9	block: Do not reread partition table on exclusively open device Since commit `10c70d95c0` ("block: remove the bd_openers checks in blk_drop_partitions") we allow rereading of partition table although there are users of the block device. This has an undesirable consequence that e.g. if sda and sdb are assembled to a RAID1 device md0 with partitions, BLKRRPART ioctl on sda will rescan partition table and create sda1 device. This partition device under a raid device confuses some programs (such as libstorage-ng used for initial partitioning for distribution installation) leading to failures. Fix the problem refusing to rescan partitions if there is another user that has the block device exclusively open. Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/20221130135344.2ul4cyfstfs3znxg@quack3 Fixes: `10c70d95c0` ("block: remove the bd_openers checks in blk_drop_partitions") Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20221130175653.24299-1-jack@suse.cz [axboe: fold in followup fix] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-12-01 07:44:03 -07:00
Christoph Hellwig	63f93fd6fa	block: mark blk_put_queue as potentially blocking We can't just say that the last reference release may block, as any reference dropped could be the last one. So move the might_sleep() from blk_free_queue to blk_put_queue and update the documentation. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20221114042637.1009333-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-11-30 11:09:00 -07:00
Christoph Hellwig	2bd85221a6	block: untangle request_queue refcounting from sysfs The kobject embedded into the request_queue is used for the queue directory in sysfs, but that is a child of the gendisks directory and is intimately tied to it. Move this kobject to the gendisk and use a refcount_t in the request_queue for the actual request_queue refcounting that is completely unrelated to the device model. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20221114042637.1009333-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-11-30 11:09:00 -07:00
Christoph Hellwig	40602997be	block: fix error unwinding in blk_register_queue blk_register_queue fails to handle errors from blk_mq_sysfs_register, leaks various resources on errors and accidentally sets queue refs percpu refcount to percpu mode on kobject_add failure. Fix all that by properly unwinding on errors. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20221114042637.1009333-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-11-30 11:09:00 -07:00
Christoph Hellwig	6fc75f309d	block: factor out a blk_debugfs_remove helper Split the debugfs removal from blk_unregister_queue into a helper so that the it can be reused for blk_register_queue error handling. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20221114042637.1009333-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-11-30 11:09:00 -07:00
Christoph Hellwig	450deb93df	blk-crypto: pass a gendisk to blk_crypto_sysfs_{,un}register Prepare for changes to the block layer sysfs handling by passing the readily available gendisk to blk_crypto_sysfs_{,un}register. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20221114042637.1009333-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-11-30 11:09:00 -07:00
Jens Axboe	c62256dda3	Revert "blk-cgroup: Flush stats at blkgs destruction path" This reverts commit `dae590a6c9`. We've had a few reports on this causing a crash at boot time, because of a reference issue. While this problem seemginly did exist before the patch and needs solving separately, this patch makes it a lot easier to trigger. Link: https://lore.kernel.org/linux-block/CA+QYu4oxiRKC6hJ7F27whXy-PRBx=Tvb+-7TQTONN8qTtV3aDA@mail.gmail.com/ Link: https://lore.kernel.org/linux-block/69af7ccb-6901-c84c-0e95-5682ccfb750c@acm.org/ Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-11-30 08:25:46 -07:00
Jinlong Chen	8d283ee62b	block: use bool as the return type of elv_iosched_allow_bio_merge We have bool type now, update the old signature. Signed-off-by: Jinlong Chen <nickyc975@zju.edu.cn> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/0db0a0298758d60d0f4df8b7126ac6a381e5a5bb.1669736350.git.nickyc975@zju.edu.cn Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-11-29 10:53:10 -07:00
Jinlong Chen	c6451ede40	block: replace "len+name" with "name+len" in elv_iosched_show The "pointer + offset" pattern is more resonable. Signed-off-by: Jinlong Chen <nickyc975@zju.edu.cn> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/d9beaee71b14f7b2a39ab0db6458dc0f7d961ceb.1669736350.git.nickyc975@zju.edu.cn Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-11-29 10:53:10 -07:00
Jinlong Chen	7a3b3660fd	block: always use 'e' when printing scheduler name Printing e->elevator_name in all cases improves the readability, and 'e' and 'cur' are identical in this branch. Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jinlong Chen <nickyc975@zju.edu.cn> Link: https://lore.kernel.org/r/4bae180ffbac608ea0cf46ffa9739ce0973b60aa.1669736350.git.nickyc975@zju.edu.cn Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-11-29 10:53:10 -07:00
Jinlong Chen	5998249e32	block: replace continue with else-if in elv_iosched_show else-if is more readable than continue here. Signed-off-by: Jinlong Chen <nickyc975@zju.edu.cn> Link: https://lore.kernel.org/r/77ac19ba556efd2c8639a6396eb4203c59bc13d6.1669736350.git.nickyc975@zju.edu.cn Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-11-29 10:53:06 -07:00
Jinlong Chen	7919d679ae	block: include 'none' for initial elv_iosched_show call This makes the printing order of the io schedulers consistent, and removes a redundant q->elevator check. Signed-off-by: Jinlong Chen <nickyc975@zju.edu.cn> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/bdd7083ed4f232e3285f39081e3c5f30b20b8da2.1669736350.git.nickyc975@zju.edu.cn Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-11-29 10:48:59 -07:00
Damien Le Moal	3692fec8bb	block: mq-deadline: Rename deadline_is_seq_writes() Rename deadline_is_seq_writes() to deadline_is_seq_write() (remove the "s" plural) to more correctly reflect the fact that this function tests a single request, not multiple requests. Fixes: `015d02f485` ("block: mq-deadline: Do not break sequential write streams to zoned HDDs") Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Link: https://lore.kernel.org/r/20221126025550.967914-2-damien.lemoal@opensource.wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-11-28 19:27:45 -07:00
Linus Torvalds	990f320031	block-6.1-2022-11-25 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmOBLEoQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgponRD/90Nx2W6P8iN21w3ZtzhtYiG9WVBhUtQqhK IFCUuaq7Meh8Iz6dpoOhEBHZ3alxZRptYLVVKMGUxkOUjzJPKljermvcb/If288G 5yaRrBtwkL+opZuDTwTiDFrUFGAY8+Uk/cmse8iqLkmQ+eJxmpr28tTrkOEToinS D8JEnXG0YTRNe7o52wBC0rD2zJURHidAqyHuKdXP9DnPntSKfWQNZHK6kWwiJ20W UQDgk3sycbmv8WXQ2nsDvrGf1s9FeQzS+gu5gWiA1sbQ5yhBvnGpT/U9E8WOeCZR wszfWlsjOfv6N095o6plNeVo3Ti/QgliGiJvuBhkEhU6M9kOCEKzTh0W5DNyDpyo DVB4/FmSyBKf1Aif9eo3gUqBdaaKaNn5b2vmgwuY/P5ALjanrL8izsnzdMfxOyRf wNFgiYlD3VOksWxHUnPLx9nMtM9uDjkdE8IeRr/4bfP46qxSOpC1dZWvu6Ot1vPr bfYo0QM+wUis4tfdxW9MIIi8oDAV0jbPN3zC2/c1end0KfZzlBiRh/1aernWweAj NgVJC+9GhzR0RV0T74vH1JY5Xa5PF3VREbNeCYhzLPH/QtI/dCNIVhAv13p06+6x zkeyMKUo8oLNl7WJRDb5WU/k2gr1msbwxvS/IdE1PuopqTefU9zdDlP/bYab0xla E6DHJ2aHlw== =41zr -----END PGP SIGNATURE----- Merge tag 'block-6.1-2022-11-25' of git://git.kernel.dk/linux Pull block fixes from Jens Axboe: - A few fixes for s390 sads (Stefan, Colin) - Ensure that ublk doesn't reorder requests, as that can be problematic on devices that need specific ordering (Ming) - Fix a queue reference leak in disk allocation handling (Christoph) * tag 'block-6.1-2022-11-25' of git://git.kernel.dk/linux: ublk_drv: don't forward io commands in reserve order s390/dasd: fix possible buffer overflow in copy_pair_show s390/dasd: fix no record found for raw_track_access s390/dasd: increase printing of debug data payload s390/dasd: Fix spelling mistake "Ivalid" -> "Invalid" blk-mq: fix queue reference leak on blk_mq_alloc_disk_for_queue failure	2022-11-25 17:50:57 -08:00
Ye Bin	4b7a21c57b	blk-mq: fix possible memleak when register 'hctx' failed There's issue as follows when do fault injection test: unreferenced object 0xffff888132a9f400 (size 512): comm "insmod", pid 308021, jiffies 4324277909 (age 509.733s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 08 f4 a9 32 81 88 ff ff ...........2.... 08 f4 a9 32 81 88 ff ff 00 00 00 00 00 00 00 00 ...2............ backtrace: [<00000000e8952bb4>] kmalloc_node_trace+0x22/0xa0 [<00000000f9980e0f>] blk_mq_alloc_and_init_hctx+0x3f1/0x7e0 [<000000002e719efa>] blk_mq_realloc_hw_ctxs+0x1e6/0x230 [<000000004f1fda40>] blk_mq_init_allocated_queue+0x27e/0x910 [<00000000287123ec>] __blk_mq_alloc_disk+0x67/0xf0 [<00000000a2a34657>] 0xffffffffa2ad310f [<00000000b173f718>] 0xffffffffa2af824a [<0000000095a1dabb>] do_one_initcall+0x87/0x2a0 [<00000000f32fdf93>] do_init_module+0xdf/0x320 [<00000000cbe8541e>] load_module+0x3006/0x3390 [<0000000069ed1bdb>] __do_sys_finit_module+0x113/0x1b0 [<00000000a1a29ae8>] do_syscall_64+0x35/0x80 [<000000009cd878b0>] entry_SYSCALL_64_after_hwframe+0x46/0xb0 Fault injection context as follows: kobject_add blk_mq_register_hctx blk_mq_sysfs_register blk_register_queue device_add_disk null_add_dev.part.0 [null_blk] As 'blk_mq_register_hctx' may already add some objects when failed halfway, but there isn't do fallback, caller don't know which objects add failed. To solve above issue just do fallback when add objects failed halfway in 'blk_mq_register_hctx'. Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20221117022940.873959-1-yebin@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-11-25 06:34:03 -07:00
Greg Kroah-Hartman	ff62b8e658	driver core: make struct class.devnode() take a const * The devnode() in struct class should not be modifying the device that is passed into it, so mark it as a const * and propagate the function signature changes out into all relevant subsystems that use this callback. Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Reinette Chatre <reinette.chatre@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: x86@kernel.org Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Cc: Jens Axboe <axboe@kernel.dk> Cc: Justin Sanders <justin@coraid.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Sumit Semwal <sumit.semwal@linaro.org> Cc: Benjamin Gaignard <benjamin.gaignard@collabora.com> Cc: Liam Mark <lmark@codeaurora.org> Cc: Laura Abbott <labbott@redhat.com> Cc: Brian Starkey <Brian.Starkey@arm.com> Cc: John Stultz <jstultz@google.com> Cc: "Christian König" <christian.koenig@amd.com> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Cc: Maxime Ripard <mripard@kernel.org> Cc: Thomas Zimmermann <tzimmermann@suse.de> Cc: David Airlie <airlied@gmail.com> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Leon Romanovsky <leon@kernel.org> Cc: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com> Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Sean Young <sean@mess.org> Cc: Frank Haverkamp <haver@linux.ibm.com> Cc: Jiri Slaby <jirislaby@kernel.org> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Cornelia Huck <cohuck@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Anton Vorontsov <anton@enomsg.org> Cc: Colin Cross <ccross@android.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Jaroslav Kysela <perex@perex.cz> Cc: Takashi Iwai <tiwai@suse.com> Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl> Cc: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Cc: Xie Yongji <xieyongji@bytedance.com> Cc: Gautam Dawar <gautam.dawar@xilinx.com> Cc: Dan Carpenter <error27@gmail.com> Cc: Eli Cohen <elic@nvidia.com> Cc: Parav Pandit <parav@nvidia.com> Cc: Maxime Coquelin <maxime.coquelin@redhat.com> Cc: alsa-devel@alsa-project.org Cc: dri-devel@lists.freedesktop.org Cc: kvm@vger.kernel.org Cc: linaro-mm-sig@lists.linaro.org Cc: linux-block@vger.kernel.org Cc: linux-input@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: linux-media@vger.kernel.org Cc: linux-rdma@vger.kernel.org Cc: linux-scsi@vger.kernel.org Cc: linux-usb@vger.kernel.org Cc: virtualization@lists.linux-foundation.org Link: https://lore.kernel.org/r/20221123122523.1332370-2-gregkh@linuxfoundation.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-11-24 17:12:27 +01:00
Greg Kroah-Hartman	23680f0b7d	driver core: make struct class.dev_uevent() take a const * The dev_uevent() in struct class should not be modifying the device that is passed into it, so mark it as a const * and propagate the function signature changes out into all relevant subsystems that use this callback. Cc: Jens Axboe <axboe@kernel.dk> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Russ Weight <russell.h.weight@intel.com> Cc: Jean Delvare <jdelvare@suse.com> Cc: Johan Hovold <johan@kernel.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Leon Romanovsky <leon@kernel.org> Cc: Karsten Keil <isdn@linux-pingi.de> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Keith Busch <kbusch@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Sagi Grimberg <sagi@grimberg.me> Cc: Dominik Brodowski <linux@dominikbrodowski.net> Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: Wolfram Sang <wsa+renesas@sang-engineering.com> Cc: Raed Salem <raeds@nvidia.com> Cc: Chen Zhongjin <chenzhongjin@huawei.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Avihai Horon <avihaih@nvidia.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Colin Ian King <colin.i.king@gmail.com> Cc: Geert Uytterhoeven <geert+renesas@glider.be> Cc: Jakob Koschel <jakobkoschel@gmail.com> Cc: Antoine Tenart <atenart@kernel.org> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Wang Yufen <wangyufen@huawei.com> Cc: linux-block@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: linux-media@vger.kernel.org Cc: linux-nvme@lists.infradead.org Cc: linux-pm@vger.kernel.org Cc: linux-rdma@vger.kernel.org Cc: linux-usb@vger.kernel.org Cc: linux-wireless@vger.kernel.org Cc: netdev@vger.kernel.org Acked-by: Sebastian Reichel <sre@kernel.org> Acked-by: Rafael J. Wysocki <rafael@kernel.org> Link: https://lore.kernel.org/r/20221123122523.1332370-1-gregkh@linuxfoundation.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-11-24 17:12:15 +01:00
Ye Bin	90b0296ece	block: fix crash in 'blk_mq_elv_switch_none' Syzbot found the following issue: general protection fault, probably for non-canonical address 0xdffffc000000001d: 0000 [#1] PREEMPT SMP KASAN KASAN: null-ptr-deref in range [0x00000000000000e8-0x00000000000000ef] CPU: 0 PID: 5234 Comm: syz-executor931 Not tainted 6.1.0-rc3-next-20221102-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/11/2022 RIP: 0010:__elevator_get block/elevator.h:94 [inline] RIP: 0010:blk_mq_elv_switch_none block/blk-mq.c:4593 [inline] RIP: 0010:__blk_mq_update_nr_hw_queues block/blk-mq.c:4658 [inline] RIP: 0010:blk_mq_update_nr_hw_queues+0x304/0xe40 block/blk-mq.c:4709 RSP: 0018:ffffc90003cdfc08 EFLAGS: 00010206 RAX: 0000000000000000 RBX: dffffc0000000000 RCX: 0000000000000000 RDX: 000000000000001d RSI: 0000000000000002 RDI: 00000000000000e8 RBP: ffff88801dbd0000 R08: ffff888027c89398 R09: ffffffff8de2e517 R10: fffffbfff1bc5ca2 R11: 0000000000000000 R12: ffffc90003cdfc70 R13: ffff88801dbd0008 R14: ffff88801dbd03f8 R15: ffff888027c89380 FS: 0000555557259300(0000) GS:ffff8880b9a00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000005d84c8 CR3: 000000007a7cb000 CR4: 00000000003506f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> nbd_start_device+0x153/0xc30 drivers/block/nbd.c:1355 nbd_start_device_ioctl drivers/block/nbd.c:1405 [inline] __nbd_ioctl drivers/block/nbd.c:1481 [inline] nbd_ioctl+0x5a1/0xbd0 drivers/block/nbd.c:1521 blkdev_ioctl+0x36e/0x800 block/ioctl.c:614 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:870 [inline] __se_sys_ioctl fs/ioctl.c:856 [inline] __x64_sys_ioctl+0x193/0x200 fs/ioctl.c:856 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd As after `dd6f7f17bf` commit move '__elevator_get(qe->type)' before set 'qe->type', so will lead to access wild pointer. To solve above issue get 'qe->type' after set 'qe->type'. Reported-by: syzbot+746a4eece09f86bc39d7@syzkaller.appspotmail.com Fixes:dd6f7f17bf58("block: add proper helpers for elevator_type module refcount management") Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20221107033956.3276891-1-yebin@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-11-24 06:58:11 -07:00
Damien Le Moal	015d02f485	block: mq-deadline: Do not break sequential write streams to zoned HDDs mq-deadline ensures an in order dispatching of write requests to zoned block devices using a per zone lock (a bit). This implies that for any purely sequential write workload, the drive is exercised most of the time at a maximum queue depth of one. However, when such sequential write workload crosses a zone boundary (when sequentially writing multiple contiguous zones), zone write locking may prevent the last write to one zone to be issued (as the previous write is still being executed) but allow the first write to the following zone to be issued (as that zone is not yet being writen and not locked). This result in an out of order delivery of the sequential write commands to the device every time a zone boundary is crossed. While such behavior does not break the sequential write constraint of zoned block devices (and does not generate any write error), some zoned hard-disks react badly to seeing these out of order writes, resulting in lower write throughput. This problem can be addressed by always dispatching the first request of a stream of sequential write requests, regardless of the zones targeted by these sequential writes. To do so, the function deadline_skip_seq_writes() is introduced and used in deadline_next_request() to select the next write command to issue if the target device is an HDD (blk_queue_nonrot() being false). deadline_fifo_request() is modified using the new deadline_earlier_request() and deadline_is_seq_write() helpers to ignore requests in the fifo list that have a preceding request in lba order that is sequential. With this fix, a sequential write workload executed with the following fio command: fio --name=seq-write --filename=/dev/sda --zonemode=zbd --direct=1 \ --size=68719476736 --ioengine=libaio --iodepth=32 --rw=write \ --bs=65536 results in an increase from 225 MB/s to 250 MB/s of the write throughput of an SMR HDD (11% increase). Cc: <stable@vger.kernel.org> Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20221124021208.242541-3-damien.lemoal@opensource.wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-11-24 06:29:36 -07:00
Damien Le Moal	2820e5d082	block: mq-deadline: Fix dd_finish_request() for zoned devices dd_finish_request() tests if the per prio fifo_list is not empty to determine if request dispatching must be restarted for handling blocked write requests to zoned devices with a call to blk_mq_sched_mark_restart_hctx(). While simple, this implementation has 2 problems: 1) Only the priority level of the completed request is considered. However, writes to a zone may be blocked due to other writes to the same zone using a different priority level. While this is unlikely to happen in practice, as writing a zone with different IO priorirites does not make sense, nothing in the code prevents this from happening. 2) The use of list_empty() is dangerous as dd_finish_request() does not take dd->lock and may run concurrently with the insert and dispatch code. Fix these 2 problems by testing the write fifo list of all priority levels using the new helper dd_has_write_work(), and by testing each fifo list using list_empty_careful(). Fixes: `c807ab520f` ("block/mq-deadline: Add I/O priority support") Cc: <stable@vger.kernel.org> Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20221124021208.242541-2-damien.lemoal@opensource.wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-11-24 06:29:36 -07:00
Bart Van Assche	85168d416e	blk-crypto: Add a missing include directive Allow the compiler to verify consistency of function declarations and function definitions. This patch fixes the following sparse errors: block/blk-crypto-profile.c:241:14: error: no previous prototype for ‘blk_crypto_get_keyslot’ [-Werror=missing-prototypes] 241 \| blk_status_t blk_crypto_get_keyslot(struct blk_crypto_profile profile, \| ^~~~~~~~~~~~~~~~~~~~~~ block/blk-crypto-profile.c:318:6: error: no previous prototype for ‘blk_crypto_put_keyslot’ [-Werror=missing-prototypes] 318 \| void blk_crypto_put_keyslot(struct blk_crypto_keyslot slot) \| ^~~~~~~~~~~~~~~~~~~~~~ block/blk-crypto-profile.c:344:6: error: no previous prototype for ‘__blk_crypto_cfg_supported’ [-Werror=missing-prototypes] 344 \| bool __blk_crypto_cfg_supported(struct blk_crypto_profile profile, \| ^~~~~~~~~~~~~~~~~~~~~~~~~~ block/blk-crypto-profile.c:373:5: error: no previous prototype for ‘__blk_crypto_evict_key’ [-Werror=missing-prototypes] 373 \| int __blk_crypto_evict_key(struct blk_crypto_profile profile, \| ^~~~~~~~~~~~~~~~~~~~~~ Cc: Eric Biggers <ebiggers@google.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20221123172923.434339-1-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-11-23 10:38:54 -07:00
Jinlong Chen	4284354758	elevator: remove an outdated comment in elevator_change mq is no longer a special case. Signed-off-by: Jinlong Chen <nickyc975@zju.edu.cn> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/cbf47824fc726440371e74c867bf635ae1b671a3.1669126766.git.nickyc975@zju.edu.cn Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-11-23 06:48:20 -07:00
Jinlong Chen	f69b5e8f35	elevator: update the document of elevator_match elevator_match does not care about elevator_features any more. Remove related descriptions from its document. Fixes: `ffb86425ee` ("block: don't check for required features in elevator_match") Signed-off-by: Jinlong Chen <nickyc975@zju.edu.cn> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/a58424555202c07a9ccf7f60c3ad7e247da09e25.1669126766.git.nickyc975@zju.edu.cn Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-11-23 06:48:20 -07:00
Jinlong Chen	e0cca8bc9c	elevator: printk a warning if switching to a new io scheduler fails printk a warning to indicate that the io scheduler has been set to none if switching to a new io scheduler fails. Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jinlong Chen <nickyc975@zju.edu.cn> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/d51ed0fb457db7a4f9cbb0dbce36d534e22be457.1669126766.git.nickyc975@zju.edu.cn Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-11-23 06:48:08 -07:00
Jinlong Chen	ac1171bd2c	elevator: update the document of elevator_switch We no longer support falling back to the old io scheduler if switching to the new one fails. Update the document to indicate that. Fixes: `a1ce35fa49` ("block: remove dead elevator code") Signed-off-by: Jinlong Chen <nickyc975@zju.edu.cn> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/94250961689ba7d2e67a7d9e7995a11166fedb31.1669126766.git.nickyc975@zju.edu.cn Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-11-23 06:47:46 -07:00
Christoph Hellwig	22c17e279a	blk-mq: fix queue reference leak on blk_mq_alloc_disk_for_queue failure Drop the request queue reference just acquired when __alloc_disk_node failed. Fixes: `6f8191fdf4` ("block: simplify disk shutdown") Reported-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Link: https://lore.kernel.org/r/20221122072753.426077-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-11-22 15:58:08 -07:00
Shin'ichiro Kawasaki	d4b2e0d433	block: fix missing nr_hw_queues update in blk_mq_realloc_tag_set_tags The commit `ee9d55210c` ("blk-mq: simplify blk_mq_realloc_tag_set_tags") cleaned up the function blk_mq_realloc_tag_set_tags. After this change, the function does not update nr_hw_queues of struct blk_mq_tag_set when new nr_hw_queues value is smaller than original. This results in failure of queue number change of block devices. To avoid the failure, add the missing nr_hw_queues update. Fixes: `ee9d55210c` ("blk-mq: simplify blk_mq_realloc_tag_set_tags") Reported-by: Chaitanya Kulkarni <chaitanyak@nvidia.com> Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Link: https://lore.kernel.org/linux-block/20221118140640.featvt3fxktfquwh@shindev/ Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20221122084917.2034220-1-shinichiro.kawasaki@wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-11-22 06:10:54 -07:00
Christoph Hellwig	3569788c08	blk-crypto: move internal only declarations to blk-crypto-internal.h blk_crypto_get_keyslot, blk_crypto_put_keyslot, __blk_crypto_evict_key and __blk_crypto_cfg_supported are only used internally by the blk-crypto code, so move the out of blk-crypto-profile.h, which is included by drivers that supply blk-crypto functionality. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20221114042944.1009870-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-11-21 11:39:05 -07:00
Christoph Hellwig	6715c98b6c	blk-crypto: add a blk_crypto_config_supported_natively helper Add a blk_crypto_config_supported_natively helper that wraps __blk_crypto_cfg_supported to retrieve the crypto_profile from the request queue. With this fscrypt can stop including blk-crypto-profile.h and rely on the public consumer interface in blk-crypto.h. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20221114042944.1009870-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-11-21 11:39:05 -07:00
Christoph Hellwig	fce3caea0f	blk-crypto: don't use struct request_queue for public interfaces Switch all public blk-crypto interfaces to use struct block_device arguments to specify the device they operate on instead of th request_queue, which is a block layer implementation detail. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20221114042944.1009870-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-11-21 11:39:05 -07:00
Linus Torvalds	f4408c3dfc	block-6.1-2022-11-18 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmN38ZUQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpgXxD/9tUSFUKIVGIn4pmNILfY3XV45HOi1w44yR zCxCELupcBeT+YixmaJcT8sunrrg2fLPOXMrDJk1cG/izXHzkjAQsHZvERfqC7hC f5onH+2MyGm3qBwxV0iGqITJgTwQGInVJijT4f9UZd/8ultymyZR2nOdIdIydHCF qzlOjq6hgIuGKHhFgOqRUg/OAkx510ZEEilUDcZ6XVV+zL7ccN6J9+eNTI3c58wT 7jvxZC4u6QGKteGvVniE3WXgk3QdFiQRORvV09g+PkbG/vPjAIZ5tJFb9PdIOebD 3guDiNUasgz2vnDetMK+yk4LcedcRfWnqgn+Vm8C26j5Fxs13eDx5kMDteVy7CYh 3bokOATHohoZZ9qTApgQUswTfGJfBdoy0nUTPuffxPdKDyUPteIxFCADcnyDHnDG d/+PjU3FKF31o2HcUfvYp7OMO0VZP0hJSWps8znoVXKxb+LH9qKkYzHVlfni5kkS k9XqqD1Ki98Erb346YqgvQjCkz+CUd5DxtGyh9Oh2+oS2qHP6WjdKo1QPFmWD5dp EyXGSqGoZrIPtnKohLUN9EiVXanRQWJr3L0gw2CYXpmwfSKfMC3CQraEC1jOc01l TfsLJGbl3L5XpLzxoBwDu44cqp+VvbalergdcmsDTLDFHhONY2g5LJh6C9/EDdnQ Cde1uHikGw== =sOGG -----END PGP SIGNATURE----- Merge tag 'block-6.1-2022-11-18' of git://git.kernel.dk/linux Pull block fixes from Jens Axboe: - NVMe pull request via Christoph: - Two more bogus nid quirks (Bean Huo, Tiago Dias Ferreira) - Memory leak fix in nvmet (Sagi Grimberg) - Regression fix for block cgroups pinning the wrong blkcg, causing leaks of cgroups and blkcgs (Chris) - UAF fix for drbd setup error handling (Dan) - Fix DMA alignment propagation in DM (Keith) * tag 'block-6.1-2022-11-18' of git://git.kernel.dk/linux: dm-log-writes: set dma_alignment limit in io_hints dm-integrity: set dma_alignment limit in io_hints block: make blk_set_default_limits() private dm-crypt: provide dma_alignment limit in io_hints block: make dma_alignment a stacking queue_limit nvmet: fix a memory leak in nvmet_auth_set_key nvme-pci: add NVME_QUIRK_BOGUS_NID for Netac NV7000 drbd: use after free in drbd_create_device() nvme-pci: add NVME_QUIRK_BOGUS_NID for Micron Nitro blk-cgroup: properly pin the parent in blkcg_css_online	2022-11-18 13:59:45 -08:00
Waiman Long	dae590a6c9	blk-cgroup: Flush stats at blkgs destruction path As noted by Michal, the blkg_iostat_set's in the lockless list hold reference to blkg's to protect against their removal. Those blkg's hold reference to blkcg. When a cgroup is being destroyed, cgroup_rstat_flush() is only called at css_release_work_fn() which is called when the blkcg reference count reaches 0. This circular dependency will prevent blkcg from being freed until some other events cause cgroup_rstat_flush() to be called to flush out the pending blkcg stats. To prevent this delayed blkcg removal, add a new cgroup_rstat_css_flush() function to flush stats for a given css and cpu and call it at the blkgs destruction path, blkcg_destroy_blkgs(), whenever there are still some pending stats to be flushed. This will ensure that blkcg reference count can reach 0 ASAP. Signed-off-by: Waiman Long <longman@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20221105005902.407297-4-longman@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-11-16 16:58:44 -07:00
Waiman Long	3b8cc62987	blk-cgroup: Optimize blkcg_rstat_flush() For a system with many CPUs and block devices, the time to do blkcg_rstat_flush() from cgroup_rstat_flush() can be rather long. It can be especially problematic as interrupt is disabled during the flush. It was reported that it might take seconds to complete in some extreme cases leading to hard lockup messages. As it is likely that not all the percpu blkg_iostat_set's has been updated since the last flush, those stale blkg_iostat_set's don't need to be flushed in this case. This patch optimizes blkcg_rstat_flush() by keeping a lockless list of recently updated blkg_iostat_set's in a newly added percpu blkcg->lhead pointer. The blkg_iostat_set is added to a lockless list on the update side in blk_cgroup_bio_start(). It is removed from the lockless list when flushed in blkcg_rstat_flush(). Due to racing, it is possible that blk_iostat_set's in the lockless list may have no new IO stats to be flushed, but that is OK. To protect against destruction of blkg, a percpu reference is gotten when putting into the lockless list and put back when removed. When booting up an instrumented test kernel with this patch on a 2-socket 96-thread system with cgroup v2, out of the 2051 calls to cgroup_rstat_flush() after bootup, 1788 of the calls were exited immediately because of empty lockless list. After an all-cpu kernel build, the ratio became 6295424/6340513. That was more than 99%. Signed-off-by: Waiman Long <longman@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20221105005902.407297-3-longman@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-11-16 16:58:44 -07:00
Waiman Long	b5a9adcbd5	blk-cgroup: Return -ENOMEM directly in blkcg_css_alloc() error path For blkcg_css_alloc(), the only error that will be returned is -ENOMEM. Simplify error handling code by returning this error directly instead of setting an intermediate "ret" variable. Signed-off-by: Waiman Long <longman@redhat.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20221105005902.407297-2-longman@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-11-16 16:58:44 -07:00
Keith Busch	b3228254bb	block: make blk_set_default_limits() private There are no external users of this function. Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20221110184501.2451620-4-kbusch@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-11-16 15:58:11 -07:00
Keith Busch	c964d62f5c	block: make dma_alignment a stacking queue_limit Device mappers had always been getting the default 511 dma mask, but the underlying device might have a larger alignment requirement. Since this value is used to determine alloweable direct-io alignment, this needs to be a stackable limit. Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20221110184501.2451620-2-kbusch@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-11-16 15:58:11 -07:00
Yu Kuai	077a403354	block: don't allow a disk link holder to itself After creating a dm device, then user can reload such dm with itself, and dead loop will be triggered because dm keep looking up to itself. Test procedures: 1) dmsetup create test --table "xxx sda", assume dm-0 is created 2) dmsetup suspend test 3) dmsetup reload test --table "xxx dm-0" 4) dmsetup resume test Test result: BUG: TASK stack guard page was hit at 00000000736a261f (stack is 000000008d12c88d..00000000c8dd82d5) stack guard page: 0000 [#1] PREEMPT SMP CPU: 29 PID: 946 Comm: systemd-udevd Not tainted 6.1.0-rc3-next-20221101-00006-g17640ca3b0ee #1295 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014 RIP: 0010:dm_prepare_ioctl+0xf/0x1e0 Code: da 48 83 05 4a 7c 99 0b 01 41 89 c4 eb cd e8 b8 1f 40 00 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 57 48 83 05 a1 5a 99 0b 01 <41> 56 49 89 d6 41 55 4c 8d af 90 02 00 00 9 RSP: 0018:ffffc90002090000 EFLAGS: 00010206 RAX: ffff8881049d6800 RBX: ffff88817e589000 RCX: 0000000000000000 RDX: ffffc90002090010 RSI: ffffc9000209001c RDI: ffff88817e589000 RBP: 00000000484a101d R08: 0000000000000000 R09: 0000000000000007 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000005331 R13: 0000000000005331 R14: 0000000000000000 R15: 0000000000000000 FS: 00007fddf9609200(0000) GS:ffff889fbfd40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffc9000208fff8 CR3: 0000000179043000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> dm_blk_ioctl+0x50/0x1c0 ? dm_prepare_ioctl+0xe0/0x1e0 dm_blk_ioctl+0x88/0x1c0 dm_blk_ioctl+0x88/0x1c0 ......(a lot of same lines) dm_blk_ioctl+0x88/0x1c0 dm_blk_ioctl+0x88/0x1c0 blkdev_ioctl+0x184/0x3e0 __x64_sys_ioctl+0xa3/0x110 do_syscall_64+0x35/0x80 entry_SYSCALL_64_after_hwframe+0x63/0xcd RIP: 0033:0x7fddf7306577 Code: b3 66 90 48 8b 05 11 89 2c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d e1 88 8 RSP: 002b:00007ffd0b2ec318 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 00005634ef478320 RCX: 00007fddf7306577 RDX: 0000000000000000 RSI: 0000000000005331 RDI: 0000000000000007 RBP: 0000000000000007 R08: 00005634ef4843e0 R09: 0000000000000080 R10: 00007fddf75cfb38 R11: 0000000000000246 R12: 00000000030d4000 R13: 0000000000000000 R14: 0000000000000000 R15: 00005634ef48b800 </TASK> Modules linked in: ---[ end trace 0000000000000000 ]--- RIP: 0010:dm_prepare_ioctl+0xf/0x1e0 Code: da 48 83 05 4a 7c 99 0b 01 41 89 c4 eb cd e8 b8 1f 40 00 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 57 48 83 05 a1 5a 99 0b 01 <41> 56 49 89 d6 41 55 4c 8d af 90 02 00 00 9 RSP: 0018:ffffc90002090000 EFLAGS: 00010206 RAX: ffff8881049d6800 RBX: ffff88817e589000 RCX: 0000000000000000 RDX: ffffc90002090010 RSI: ffffc9000209001c RDI: ffff88817e589000 RBP: 00000000484a101d R08: 0000000000000000 R09: 0000000000000007 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000005331 R13: 0000000000005331 R14: 0000000000000000 R15: 0000000000000000 FS: 00007fddf9609200(0000) GS:ffff889fbfd40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffc9000208fff8 CR3: 0000000179043000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Kernel panic - not syncing: Fatal exception in interrupt Kernel Offset: disabled ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]--- Fix the problem by forbidding a disk to create link to itself. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20221115141054.1051801-11-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-11-16 15:19:56 -07:00
Yu Kuai	3b3449c1e6	block: store the holder kobject in bd_holder_disk We hold a reference to the holder kobject for each bd_holder_disk, so to make the code a bit more robust, use a reference to it instead of the block_device. As long as no one clears ->bd_holder_dir in before freeing the disk, this isn't strictly required, but it does make the code more clear and more robust. Orignally-From: Christoph Hellwig <hch@lst.de> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20221115141054.1051801-10-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-11-16 15:19:56 -07:00
Yu Kuai	62f535e1f0	block: fix use after free for bd_holder_dir Currently, the caller of bd_link_disk_holer() get 'bdev' by blkdev_get_by_dev(), which will look up 'bdev' by inode number 'dev'. Howerver, it's possible that del_gendisk() can be called currently, and 'bd_holder_dir' can be freed before bd_link_disk_holer() access it, thus use after free is triggered. t1: t2: bdev = blkdev_get_by_dev del_gendisk kobject_put(bd_holder_dir) kobject_free() bd_link_disk_holder Fix the problem by checking disk is still live and grabbing a reference to 'bd_holder_dir' first in bd_link_disk_holder(). Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20221115141054.1051801-9-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-11-16 15:19:56 -07:00
Christoph Hellwig	7abc077788	block: remove delayed holder registration Now that dm has been fixed to track of holder registrations before add_disk, the somewhat buggy block layer code can be safely removed. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20221115141054.1051801-8-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-11-16 15:19:56 -07:00
Christoph Hellwig	d90db3b1c8	block: clear ->slave_dir when dropping the main slave_dir reference Zero out the pointer to ->slave_dir so that the holder code doesn't incorrectly treat the object as alive when add_disk failed or after del_gendisk was called. Fixes: `89f871af1b` ("dm: delay registering the gendisk") Reported-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Mike Snitzer <snitzer@kernel.org> Link: https://lore.kernel.org/r/20221115141054.1051801-2-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-11-16 15:19:53 -07:00
Christoph Hellwig	470373e888	block: remove blkdev_writepages While the block device code should switch to implementing ->writepages instead of ->writepage eventually, the current implementation is entirely pointless as it does the same looping over ->writepage as the generic code if no ->writepages is present. Remove blkdev_writepages so that we can eventually unexport generic_writepages. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20221116132035.2192924-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-11-16 11:32:53 -07:00
Pavel Begunkov	42b2b2fb6e	bio: shrink max number of pcpu cached bios The downside of the bio pcpu cache is that bios of a cpu will be never freed unless there is new I/O issued from that cpu. We currently keep max 512 bios, which feels too much, half it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/bc198e8efb27d8c740d80c8ce477432729075096.1667384020.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-11-16 09:44:26 -07:00
Pavel Begunkov	b99182c501	bio: add pcpu caching for non-polling bio_put This patch extends REQ_ALLOC_CACHE to IRQ completions, whenever currently it's only limited to iopoll. Instead of guarding the list with irq toggling on alloc, which is expensive, it keeps an additional irq-safe list from which bios are spliced in batches to ammortise overhead. On the put side it toggles irqs, but in many cases they're already disabled and so cheap. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/c2306de96b900ab9264f4428ec37768ddcf0da36.1667384020.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-11-16 09:44:26 -07:00
Pavel Begunkov	f25cf75a45	bio: split pcpu cache part of bio_put into a helper Extract a helper out of bio_put for recycling into percpu caches. It's a preparation patch without functional changes. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/e97ab2026a89098ee1bfdd09bcb9451fced95f87.1667384020.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-11-16 09:44:26 -07:00
Pavel Begunkov	759aa12f19	bio: don't rob starving biosets of bios Biosets keep a mempool, so as long as requests complete we can always can allocate and have forward progress. Percpu bio caches break that assumptions as we may complete into the cache of one CPU and after try and fail to allocate with another CPU. We also can't grab from another CPU's cache without tricky sync. If we're allocating with a bio while the mempool is undersaturated, remove REQ_ALLOC_CACHE flag, so on put it will go straight to mempool. It might try to free into mempool more requests than required, but assuming than there is no memory starvation in the system it'll stabilise and never hit that path. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/aa150caf9c263fa92269e86d7826cc8fa65f38de.1667384020.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-11-16 09:44:26 -07:00
Chris Mason	d7dbd43f4a	blk-cgroup: properly pin the parent in blkcg_css_online blkcg_css_online is supposed to pin the blkcg of the parent, but `397c9f46ee` refactored things and along the way, changed it to pin the css instead. This results in extra pins, and we end up leaking blkcgs and cgroups. Fixes: `397c9f46ee` ("blk-cgroup: move blkcg_{pin,unpin}_online out of line") Signed-off-by: Chris Mason <clm@fb.com> Spotted-by: Rik van Riel <riel@surriel.com> Cc: <stable@vger.kernel.org> # v5.19+ Acked-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://lore.kernel.org/r/20221114181930.2093706-1-clm@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-11-14 12:13:19 -07:00
Linus Torvalds	b0b6e2c9d3	block-6.1-2022-11-11 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmNuaacQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpiHOD/wMxAiJcZfhTUakXuJnOOdNqgOzIkTOw1u9 BHS23p8FwwaESevpTOEiGHh9DVRGBzDJknwsAf/YoHV5CA3BvhlW8I2zHp8ybzWD Mq9LLK/waifYo0/5eWdEG2b4cH1kXeK9n377RWi+LstL+C7X/+0w6Q0wBTV5SxNF mWHfhnomtTz1A0qcxgSkyIuJOoUQ5iH9LZvoOze+kIiJf0S7C2/oKfBKuXO8iPxI wt76qMlb1+uNTuTLVHpZDbF11df7wYSrTZIfYBH5hYZ5KefM3cHUSgedoBbOb3Gy 2TdctzWyjxBhUKeeZxkWgV3kJ3ha0hQ5lRxvy8R9uYs8NMxfhe2lfoyJmU1NtEvm xNIs1sRRYQ8BpnVOdwPRPVqmpGCauGj9I7W8KEOEzvGdUFN1TIpEucIfRL3mg88w 8/4JCDi10PNRpyc1G1bb/vqXF11iX2YI8Fr9M+R9oW8V28qdMFBob5MK+TTCBGDL 2lQHx0wCZMK3dUiLLv0mqFPcrK9v1mxpBBwpPGkzGf/FvmB00aV1n02Bo8prCD/d tY/aghHviDPkpaR0MJ4+MHllloZR+gbcxYfGbpdDUrN8ZVYRMIzi8NrwwPb98zqB d6CX8BPevi3/azjORf/I/v7egTSTRhH/JHBw7derANhPd7OSWLQfjhIHDhZoYs/q wsuIlnJOyA== =lEq0 -----END PGP SIGNATURE----- Merge tag 'block-6.1-2022-11-11' of git://git.kernel.dk/linux Pull block fixes from Jens Axboe: - NVMe pull request via Christoph: - Quiet user passthrough command errors (Keith Busch) - Fix memory leak in nvmet_subsys_attr_model_store_locked - Fix a memory leak in nvmet-auth (Sagi Grimberg) - Fix a potential NULL point deref in bfq (Yu) - Allocate command/response buffers separately for DMA for sed-opal, rather than rely on embedded alignment (Serge) * tag 'block-6.1-2022-11-11' of git://git.kernel.dk/linux: nvmet: fix a memory leak nvmet: fix memory leak in nvmet_subsys_attr_model_store_locked nvme: quiet user passthrough command errors block: sed-opal: kmalloc the cmd/resp buffers block, bfq: fix null pointer dereference in bfq_bio_bfqg()	2022-11-11 14:08:30 -08:00
Christoph Hellwig	ee9d55210c	blk-mq: simplify blk_mq_realloc_tag_set_tags Use set->nr_hw_queues for the current number of tags, and remove the duplicate set->nr_hw_queues update in the caller. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20221109100811.2413423-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-11-10 11:18:09 -07:00
Christoph Hellwig	5ee20298ff	blk-mq: remove blk_mq_alloc_tag_set_tags There is no point in trying to share any code with the realloc case when all that is needed by the initial tagset allocation is a simple kcalloc_node. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20221109100811.2413423-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-11-10 11:18:09 -07:00
Khazhismel Kumykov	99771d73ff	bfq: ignore oom_bfqq in bfq_check_waker oom_bfqq is just a fallback bfqq, so shouldn't be used with waker detection. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Khazhismel Kumykov <khazhy@google.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20221108181030.1611703-2-khazhy@google.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-11-09 12:42:26 -07:00
Khazhismel Kumykov	a1795c2ccb	bfq: fix waker_bfqq inconsistency crash This fixes crashes in bfq_add_bfqq_busy due to waker_bfqq being NULL, but woken_list_node still being hashed. This would happen when bfq_init_rq() expects a brand new allocated queue to be returned from bfq_get_bfqq_handle_split() and unconditionally updates waker_bfqq without resetting woken_list_node. Since we can always return oom_bfqq when attempting to allocate, we cannot assume waker_bfqq starts as NULL. Avoid setting woken_bfqq for oom_bfqq entirely, as it's not useful. Crashes would have a stacktrace like: [160595.656560] bfq_add_bfqq_busy+0x110/0x1ec [160595.661142] bfq_add_request+0x6bc/0x980 [160595.666602] bfq_insert_request+0x8ec/0x1240 [160595.671762] bfq_insert_requests+0x58/0x9c [160595.676420] blk_mq_sched_insert_request+0x11c/0x198 [160595.682107] blk_mq_submit_bio+0x270/0x62c [160595.686759] __submit_bio_noacct_mq+0xec/0x178 [160595.691926] submit_bio+0x120/0x184 [160595.695990] ext4_mpage_readpages+0x77c/0x7c8 [160595.701026] ext4_readpage+0x60/0xb0 [160595.705158] filemap_read_page+0x54/0x114 [160595.711961] filemap_fault+0x228/0x5f4 [160595.716272] do_read_fault+0xe0/0x1f0 [160595.720487] do_fault+0x40/0x1c8 Tested by injecting random failures into bfq_get_queue, crashes go away completely. Fixes: `8ef3fc3a04` ("block, bfq: make shared queues inherit wakers") Signed-off-by: Khazhismel Kumykov <khazhy@google.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20221108181030.1611703-1-khazhy@google.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-11-09 12:42:26 -07:00
Logan Gunthorpe	7ee4ccf574	block: set FOLL_PCI_P2PDMA in bio_map_user_iov() When a bio's queue supports PCI P2PDMA, set FOLL_PCI_P2PDMA for iov_iter_get_pages_flags(). This allows PCI P2PDMA pages to be passed from userspace and enables the NVMe passthru requests to use P2PDMA pages. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20221021174116.7200-8-logang@deltatee.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-11-09 11:29:21 -07:00
Logan Gunthorpe	5e3e3f2e15	block: set FOLL_PCI_P2PDMA in __bio_iov_iter_get_pages() When a bio's queue supports PCI P2PDMA, set FOLL_PCI_P2PDMA for iov_iter_get_pages_flags(). This allows PCI P2PDMA pages to be passed from userspace and enables the O_DIRECT path in iomap based filesystems and direct to block devices. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20221021174116.7200-7-logang@deltatee.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-11-09 11:29:21 -07:00
Logan Gunthorpe	49580e6907	block: add check when merging zone device pages Consecutive zone device pages should not be merged into the same sgl or bvec segment with other types of pages or if they belong to different pgmaps. Otherwise getting the pgmap of a given segment is not possible without scanning the entire segment. This helper returns true either if both pages are not zone device pages or both pages are zone device pages with the same pgmap. Add a helper to determine if zone device pages are mergeable and use this helper in page_is_mergeable(). Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20221021174116.7200-5-logang@deltatee.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-11-09 11:29:21 -07:00
Serge Semin	f829230dd5	block: sed-opal: kmalloc the cmd/resp buffers In accordance with [1] the DMA-able memory buffers must be cacheline-aligned otherwise the cache writing-back and invalidation performed during the mapping may cause the adjacent data being lost. It's specifically required for the DMA-noncoherent platforms [2]. Seeing the opal_dev.{cmd,resp} buffers are implicitly used for DMAs in the NVME and SCSI/SD drivers in framework of the nvme_sec_submit() and sd_sec_submit() methods respectively they must be cacheline-aligned to prevent the denoted problem. One of the option to guarantee that is to kmalloc the buffers [2]. Let's explicitly allocate them then instead of embedding into the opal_dev structure instance. Note this fix was inspired by the commit `c94b7f9bab` ("nvme-hwmon: kmalloc the NVME SMART log buffer"). [1] Documentation/core-api/dma-api.rst [2] Documentation/core-api/dma-api-howto.rst Fixes: `455a7b238c` ("block: Add Sed-opal library") Signed-off-by: Serge Semin <Sergey.Semin@baikalelectronics.ru> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20221107203944.31686-1-Sergey.Semin@baikalelectronics.ru Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-11-08 07:14:35 -07:00
Yu Kuai	f02be9002c	block, bfq: fix null pointer dereference in bfq_bio_bfqg() Out test found a following problem in kernel 5.10, and the same problem should exist in mainline: BUG: kernel NULL pointer dereference, address: 0000000000000094 PGD 0 P4D 0 Oops: 0000 [#1] SMP CPU: 7 PID: 155 Comm: kworker/7:1 Not tainted 5.10.0-01932-g19e0ace2ca1d-dirty 4 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-b4 Workqueue: kthrotld blk_throtl_dispatch_work_fn RIP: 0010:bfq_bio_bfqg+0x52/0xc0 Code: 94 00 00 00 00 75 2e 48 8b 40 30 48 83 05 35 06 c8 0b 01 48 85 c0 74 3d 4b RSP: 0018:ffffc90001a1fba0 EFLAGS: 00010002 RAX: ffff888100d60400 RBX: ffff8881132e7000 RCX: 0000000000000000 RDX: 0000000000000017 RSI: ffff888103580a18 RDI: ffff888103580a18 RBP: ffff8881132e7000 R08: 0000000000000000 R09: ffffc90001a1fe10 R10: 0000000000000a20 R11: 0000000000034320 R12: 0000000000000000 R13: ffff888103580a18 R14: ffff888114447000 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff88881fdc0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000094 CR3: 0000000100cdb000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: bfq_bic_update_cgroup+0x3c/0x350 ? ioc_create_icq+0x42/0x270 bfq_init_rq+0xfd/0x1060 bfq_insert_requests+0x20f/0x1cc0 ? ioc_create_icq+0x122/0x270 blk_mq_sched_insert_requests+0x86/0x1d0 blk_mq_flush_plug_list+0x193/0x2a0 blk_flush_plug_list+0x127/0x170 blk_finish_plug+0x31/0x50 blk_throtl_dispatch_work_fn+0x151/0x190 process_one_work+0x27c/0x5f0 worker_thread+0x28b/0x6b0 ? rescuer_thread+0x590/0x590 kthread+0x153/0x1b0 ? kthread_flush_work+0x170/0x170 ret_from_fork+0x1f/0x30 Modules linked in: CR2: 0000000000000094 ---[ end trace e2e59ac014314547 ]--- RIP: 0010:bfq_bio_bfqg+0x52/0xc0 Code: 94 00 00 00 00 75 2e 48 8b 40 30 48 83 05 35 06 c8 0b 01 48 85 c0 74 3d 4b RSP: 0018:ffffc90001a1fba0 EFLAGS: 00010002 RAX: ffff888100d60400 RBX: ffff8881132e7000 RCX: 0000000000000000 RDX: 0000000000000017 RSI: ffff888103580a18 RDI: ffff888103580a18 RBP: ffff8881132e7000 R08: 0000000000000000 R09: ffffc90001a1fe10 R10: 0000000000000a20 R11: 0000000000034320 R12: 0000000000000000 R13: ffff888103580a18 R14: ffff888114447000 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff88881fdc0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000094 CR3: 0000000100cdb000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Root cause is quite complex: 1) use bfq elevator for the test device. 2) create a cgroup CG 3) config blk throtl in CG blkg_conf_prep blkg_create 4) create a thread T1 and issue async io in CG: bio_init bio_associate_blkg ... submit_bio submit_bio_noacct blk_throtl_bio -> io is throttled // io submit is done 5) switch elevator: bfq_exit_queue blkcg_deactivate_policy list_for_each_entry(blkg, &q->blkg_list, q_node) blkg->pd[] = NULL // bfq policy is removed 5) thread t1 exist, then remove the cgroup CG: blkcg_unpin_online blkcg_destroy_blkgs blkg_destroy list_del_init(&blkg->q_node) // blkg is removed from queue list 6) switch elevator back to bfq bfq_init_queue bfq_create_group_hierarchy blkcg_activate_policy list_for_each_entry_reverse(blkg, &q->blkg_list) // blkg is removed from list, hence bfq policy is still NULL 7) throttled io is dispatched to bfq: bfq_insert_requests bfq_init_rq bfq_bic_update_cgroup bfq_bio_bfqg bfqg = blkg_to_bfqg(blkg) // bfqg is NULL because bfq policy is NULL The problem is only possible in bfq because only bfq can be deactivated and activated while queue is online, while others can only be deactivated while the device is removed. Fix the problem in bfq by checking if blkg is online before calling blkg_to_bfqg(). Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20221108103434.2853269-1-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-11-08 07:13:25 -07:00
Yang Li	5b2560c4c2	block: Fix some kernel-doc comments Remove the description of @required_features in elevator_match() to clear the below warning: block/elevator.c:103: warning: Excess function parameter 'required_features' description in 'elevator_match' Link: https://bugzilla.openanolis.cn/show_bug.cgi?id=2734 Fixes: `ffb86425ee` ("block: don't check for required features in elevator_match") Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Link: https://lore.kernel.org/r/20221107062255.2685-1-yang.lee@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-11-07 07:21:45 -07:00
Linus Torvalds	4869f5750a	block-6.1-2022-11-05 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmNmdGEQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpvLpD/9pL9SLpoUAnSvYAzaJC0dJhFHzQhmQgA55 qxwgC4NDyxYgvsLuHVVoR5qRSNQO37nKkgoeHsqSX56UTloQnggurg0Cr94VJcjC seT1++Dl6BPz1M9h/UFS2cwm26GC+cQmsLIQoACSi0lNEOLytPP/emq6Vuqz0udx ah1ACXebiHe07A8Kvpt7orHlpM/dKH0/4g5/7h0E5RWrC9yg1WEOHPjd/MQ5amy0 9YkhtqM5OQfNsVY0DcRbgRPr115xSi/L6No3Q6pMAVqzM7ZRk3iD039be7Sooqn8 sl54gZB3AWGzgrFnJLjKCcQg4qg/wyYhZXEuV2JdzYeXCBK6RMcV0I2hP6vWP7Au dqlw5khvQOwx32qYNlXHU7g/ve5qY7hblIHbyqtKjQicIQ8LP18Ek1QWQcywiK4E hyYJ/3gYRjVqigyw32++cMSRLbLktiY38+J7NxujIj6J1aOYosCA5kIxTSa11tLG VGeXny5CS5l0zrl3irGBRI1Qi33T0hnbmf99v+MndFhRfsYAF8tKwuJyI+d+rJvj S8grDzsmlzwe1INXEbnMEg+SsHOPe5On0bzYIYX9Oi0BSsZf1i4u7SdDb9tu2Tiw WSJyYBNGCsl7wFSoLmY75j1OvWY/iXYPKqZ8bt9STQbO9vL+VHksFzcnnkvzrBG6 Zs1uD17jwQ== =JQlF -----END PGP SIGNATURE----- Merge tag 'block-6.1-2022-11-05' of git://git.kernel.dk/linux Pull block fixes from Jens Axboe: - Fixes for the ublk driver (Ming) - Fixes for error handling memory leaks (Chen Jun, Chen Zhongjin) - Explicitly clear the last request in a chain when the plug is flushed, as it may have already been issued (Al) * tag 'block-6.1-2022-11-05' of git://git.kernel.dk/linux: block: blk_add_rq_to_plug(): clear stale 'last' after flush blk-mq: Fix kmemleak in blk_mq_init_allocated_queue block: Fix possible memory leak for rq_wb on add_disk failure ublk_drv: add ublk_queue_cmd() for cleanup ublk_drv: avoid to touch io_uring cmd in blk_mq io path ublk_drv: comment on ublk_driver entry of Kconfig ublk_drv: return flag of UBLK_F_URING_CMD_COMP_IN_TASK in case of module	2022-11-05 09:02:28 -07:00
Jinlong Chen	4046728253	blk-mq: use if-else instead of goto in blk_mq_alloc_cached_request() if-else is more readable than goto here. Signed-off-by: Jinlong Chen <nickyc975@zju.edu.cn> Link: https://lore.kernel.org/r/d3306fa4e92dc9cc614edc8f1802686096bafef2.1667356813.git.nickyc975@zju.edu.cn Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-11-02 08:36:50 -06:00
Jinlong Chen	7edfd68165	blk-mq: improve error handling in blk_mq_alloc_rq_map() Use goto-style error handling like we do elsewhere in the kernel. Signed-off-by: Jinlong Chen <nickyc975@zju.edu.cn> Link: https://lore.kernel.org/r/bbbc2d9b17b137798c7fb92042141ca4cbbc58cc.1667356813.git.nickyc975@zju.edu.cn Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-11-02 08:36:50 -06:00
Chao Leng	414dd48e88	blk-mq: add tagset quiesce interface Drivers that have shared tagsets may need to quiesce potentially a lot of request queues that all share a single tagset (e.g. nvme). Add an interface to quiesce all the queues on a given tagset. This interface is useful because it can speedup the quiesce by doing it in parallel. Because some queues should not need to be quiesced (e.g. the nvme connect_q) when quiescing the tagset, introduce a QUEUE_FLAG_SKIP_TAGSET_QUIESCE flag to allow this new interface to ski quiescing a particular queue. Signed-off-by: Chao Leng <lengchao@huawei.com> [hch: simplify for the per-tag_set srcu_struct] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Chao Leng <lengchao@huawei.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20221101150050.3510-14-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-11-02 08:35:34 -06:00
Christoph Hellwig	483239c75b	blk-mq: pass a tagset to blk_mq_wait_quiesce_done Nothing in blk_mq_wait_quiesce_done needs the request_queue now, so just pass the tagset, and move the non-mq check into the only caller that needs it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chao Leng <lengchao@huawei.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20221101150050.3510-13-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-11-02 08:35:34 -06:00
Christoph Hellwig	80bd4a7aab	blk-mq: move the srcu_struct used for quiescing to the tagset All I/O submissions have fairly similar latencies, and a tagset-wide quiesce is a fairly common operation. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Chao Leng <lengchao@huawei.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20221101150050.3510-12-hch@lst.de [axboe: fix whitespace] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-11-02 08:35:34 -06:00
Christoph Hellwig	8537380bb9	blk-mq: skip non-mq queues in blk_mq_quiesce_queue For submit_bio based queues there is no (S)RCU critical section during I/O submission and thus nothing to wait for in blk_mq_wait_quiesce_done, so skip doing any synchronization. No non-mq driver should be calling this, but for now we have core callers that unconditionally call into it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20221101150050.3510-11-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-11-02 08:35:34 -06:00
Christoph Hellwig	71b26083d5	block: set the disk capacity to 0 in blk_mark_disk_dead nvme and xen-blkfront are already doing this to stop buffered writes from creating dirty pages that can't be written out later. Move it to the common code. This also removes the comment about the ordering from nvme, as bd_mutex not only is gone entirely, but also hasn't been used for locking updates to the disk size long before that, and thus the ordering requirement documented there doesn't apply any more. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Chao Leng <lengchao@huawei.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20221101150050.3510-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-11-02 08:32:31 -06:00
Yu Kuai	aa625117d6	block, bfq: don't declare 'bfqd' as type 'void *' in bfq_group Prevent unnecessary format conversion for bfqg->bfqd in multiple places. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Paolo Valente <paolo.valente@unimore.it> Link: https://lore.kernel.org/r/20221102022542.3621219-6-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-11-01 20:10:55 -06:00
Yu Kuai	918fdea388	block, bfq: remove dead code for updating 'rq_in_driver' Such code are not even compiled since they are inside marco "#if 0". Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Paolo Valente <paolo.valente@unimore.it> Link: https://lore.kernel.org/r/20221102022542.3621219-5-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-11-01 20:10:55 -06:00
Yu Kuai	f6fd119b1a	block, bfq: cleanup bfq_activate_requeue_entity() Just make the code a litter cleaner by removing the unnecessary variable 'sd'. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Paolo Valente <paolo.valente@unimore.it> Link: https://lore.kernel.org/r/20221102022542.3621219-4-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-11-01 20:10:54 -06:00
Yu Kuai	e5c63eb4b5	block, bfq: factor out code to update 'active_entities' Current code is a bit ugly and hard to read. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Paolo Valente <paolo.valente@unimore.it> Link: https://lore.kernel.org/r/20221102022542.3621219-3-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-11-01 20:10:54 -06:00
Yu Kuai	060d9217d3	block, bfq: remove set but not used variable in __bfq_entity_update_weight_prio After the patch "block, bfq: cleanup bfq_weights_tree add/remove apis"), the local variable 'bfqd' is not used anymore, thus remove it. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20221102022542.3621219-2-yukuai1@huaweicloud.com Fixes: `afdba14612` ("block, bfq: cleanup bfq_weights_tree add/remove apis") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-11-01 20:10:48 -06:00
Kemeng Shi	dc572f418a	block: Replace struct rq_depth with unsigned int in struct iolatency_grp We only need a max queue depth for every iolatency to limit the inflight io number. Replace struct rq_depth with unsigned int to simplfy "struct iolatency_grp" and save memory. Signed-off-by: Kemeng Shi <shikemeng@huawei.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Link: https://lore.kernel.org/r/20221018111240.22612-4-shikemeng@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-11-01 09:12:24 -06:00
Kemeng Shi	6891f96898	block: Correct comment for scale_cookie_change Default queue depth of iolatency_grp is unlimited, so we scale down quickly(once by half) in scale_cookie_change. Remove the "subtract 1/16th" part which is not the truth and add the actual way we scale down. Signed-off-by: Kemeng Shi <shikemeng@huawei.com> Link: https://lore.kernel.org/r/20221018111240.22612-3-shikemeng@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-11-01 09:12:24 -06:00
Kemeng Shi	db5896e9cf	block: Remove redundant parent blkcg_gp check in check_scale_change Function blkcg_iolatency_throttle will make sure blkg->parent is not NULL before calls check_scale_change. And function check_scale_change is only called in blkcg_iolatency_throttle. Signed-off-by: Kemeng Shi <shikemeng@huawei.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Link: https://lore.kernel.org/r/20221018111240.22612-2-shikemeng@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-11-01 09:12:24 -06:00
Christoph Hellwig	64b36075eb	block: split elevator_switch Split an elevator_disable helper from elevator_switch for the case where we want to switch to no scheduler at all. This includes removing the pointless elevator_switch_mq helper and removing the switch to no schedule logic from blk_mq_init_sched. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20221030100714.876891-8-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-11-01 09:12:24 -06:00
Christoph Hellwig	ffb86425ee	block: don't check for required features in elevator_match Checking for the required features in the callers simplifies the code quite a bit, so do that. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20221030100714.876891-7-hch@lst.de [axboe: adjust for dropping patch 1, use __elevator_find()] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-11-01 09:12:05 -06:00
Christoph Hellwig	2eef17a209	block: simplify the check for the current elevator in elv_iosched_show Just compare the pointers instead of using the string based elevator_match. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20221030100714.876891-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-11-01 08:02:47 -06:00
Christoph Hellwig	16095af2fa	block: cleanup the variable naming in elv_iosched_store Use eq for the elevator_queue as done elsewhere. This frees e to be used for the loop iterator instead of the odd __ prefix. In addition rename elv to cur to make it more clear it is the currently selected elevator. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20221030100714.876891-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-11-01 08:02:47 -06:00
Christoph Hellwig	aae2a643f5	block: exit elv_iosched_show early when I/O schedulers are not supported If the tag_set has BLK_MQ_F_NO_SCHED flag set we will never show any scheduler, so exit early. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20221030100714.876891-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-11-01 08:02:47 -06:00
Christoph Hellwig	81eaca442e	block: cleanup elevator_get Do the request_module and repeated lookup in the only caller that cares, pick a saner name that explains where are actually doing a lookup and use a sane calling conventions that passes the queue first. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20221030100714.876891-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-11-01 08:02:47 -06:00
Yu Kuai	eb5bca7365	block, bfq: cleanup __bfq_weights_tree_remove() It's the same with bfq_weights_tree_remove() now. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Paolo Valente <paolo.valente@linaro.org> Link: https://lore.kernel.org/r/20220916071942.214222-7-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-11-01 07:09:44 -06:00
Yu Kuai	afdba14612	block, bfq: cleanup bfq_weights_tree add/remove apis The 'bfq_data' and 'rb_root_cached' can both be accessed through 'bfq_queue', thus only pass 'bfq_queue' as parameter. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Paolo Valente <paolo.valente@linaro.org> Link: https://lore.kernel.org/r/20220916071942.214222-6-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-11-01 07:09:44 -06:00
Yu Kuai	eed3ecc991	block, bfq: do not idle if only one group is activated Now that root group is counted into 'num_groups_with_pending_reqs', 'num_groups_with_pending_reqs > 0' is always true in bfq_asymmetric_scenario(). Thus change the condition to '> 1'. On the other hand, this change can enable concurrent sync io if only one group is activated. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Paolo Valente <paolo.valente@linaro.org> Link: https://lore.kernel.org/r/20220916071942.214222-5-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-11-01 07:09:44 -06:00
Yu Kuai	71f8ca77cb	block, bfq: refactor the counting of 'num_groups_with_pending_reqs' Currently, bfq can't handle sync io concurrently as long as they are not issued from root group. This is because 'bfqd->num_groups_with_pending_reqs > 0' is always true in bfq_asymmetric_scenario(). The way that bfqg is counted into 'num_groups_with_pending_reqs': Before this patch: 1) root group will never be counted. 2) Count if bfqg or it's child bfqgs have pending requests. 3) Don't count if bfqg and it's child bfqgs complete all the requests. After this patch: 1) root group is counted. 2) Count if bfqg have pending requests. 3) Don't count if bfqg complete all the requests. With this change, the occasion that only one group is activated can be detected, and next patch will support concurrent sync io in the occasion. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Paolo Valente <paolo.valente@linaro.org> Link: https://lore.kernel.org/r/20220916071942.214222-4-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-11-01 07:09:44 -06:00
Yu Kuai	60a6e10c53	block, bfq: record how many queues have pending requests Prepare to refactor the counting of 'num_groups_with_pending_reqs'. Add a counter in bfq_group, update it while tracking if bfqq have pending requests and when bfq_bfqq_move() is called. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Paolo Valente <paolo.valente@linaro.org> Link: https://lore.kernel.org/r/20220916071942.214222-3-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-11-01 07:09:44 -06:00
Yu Kuai	3d89bd12d3	block, bfq: support to track if bfqq has pending requests If entity belongs to bfqq, then entity->in_groups_with_pending_reqs is not used currently. This patch use it to track if bfqq has pending requests through callers of weights_tree insertion and removal. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Paolo Valente <paolo.valente@linaro.org> Link: https://lore.kernel.org/r/20220916071942.214222-2-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-11-01 07:09:44 -06:00
Al Viro	878eb6e48f	block: blk_add_rq_to_plug(): clear stale 'last' after flush blk_mq_flush_plug_list() empties ->mq_list and request we'd peeked there before that call is gone; in any case, we are not dealing with a mix of requests for different queues now - there's no requests left in the plug. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-10-31 20:21:38 -06:00
Chen Jun	943f45b939	blk-mq: Fix kmemleak in blk_mq_init_allocated_queue There is a kmemleak caused by modprobe null_blk.ko unreferenced object 0xffff8881acb1f000 (size 1024): comm "modprobe", pid 836, jiffies 4294971190 (age 27.068s) hex dump (first 32 bytes): 00 00 00 00 ad 4e ad de ff ff ff ff 00 00 00 00 .....N.......... ff ff ff ff ff ff ff ff 00 53 99 9e ff ff ff ff .........S...... backtrace: [<000000004a10c249>] kmalloc_node_trace+0x22/0x60 [<00000000648f7950>] blk_mq_alloc_and_init_hctx+0x289/0x350 [<00000000af06de0e>] blk_mq_realloc_hw_ctxs+0x2fe/0x3d0 [<00000000e00c1872>] blk_mq_init_allocated_queue+0x48c/0x1440 [<00000000d16b4e68>] __blk_mq_alloc_disk+0xc8/0x1c0 [<00000000d10c98c3>] 0xffffffffc450d69d [<00000000b9299f48>] 0xffffffffc4538392 [<0000000061c39ed6>] do_one_initcall+0xd0/0x4f0 [<00000000b389383b>] do_init_module+0x1a4/0x680 [<0000000087cf3542>] load_module+0x6249/0x7110 [<00000000beba61b8>] __do_sys_finit_module+0x140/0x200 [<00000000fdcfff51>] do_syscall_64+0x35/0x80 [<000000003c0f1f71>] entry_SYSCALL_64_after_hwframe+0x46/0xb0 That is because q->ma_ops is set to NULL before blk_release_queue is called. blk_mq_init_queue_data blk_mq_init_allocated_queue blk_mq_realloc_hw_ctxs for (i = 0; i < set->nr_hw_queues; i++) { old_hctx = xa_load(&q->hctx_table, i); if (!blk_mq_alloc_and_init_hctx(.., i, ..)) [1] if (!old_hctx) break; xa_for_each_start(&q->hctx_table, j, hctx, j) blk_mq_exit_hctx(q, set, hctx, j); [2] if (!q->nr_hw_queues) [3] goto err_hctxs; err_exit: q->mq_ops = NULL; [4] blk_put_queue blk_release_queue if (queue_is_mq(q)) [5] blk_mq_release(q); [1]: blk_mq_alloc_and_init_hctx failed at i != 0. [2]: The hctxs allocated by [1] are moved to q->unused_hctx_list and will be cleaned up in blk_mq_release. [3]: q->nr_hw_queues is 0. [4]: Set q->mq_ops to NULL. [5]: queue_is_mq returns false due to [4]. And blk_mq_release will not be called. The hctxs in q->unused_hctx_list are leaked. To fix it, call blk_release_queue in exception path. Fixes: `2f8f1336a4` ("blk-mq: always free hctx after request queue is freed") Signed-off-by: Yuan Can <yuancan@huawei.com> Signed-off-by: Chen Jun <chenjun102@huawei.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20221031031242.94107-1-chenjun102@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-10-31 08:30:47 -06:00
Jinlong Chen	56c1ee9224	blk-mq: remove redundant call to blk_freeze_queue_start in blk_mq_destroy_queue The calling relationship in blk_mq_destroy_queue() is as follows: blk_mq_destroy_queue() ... -> blk_queue_start_drain() -> blk_freeze_queue_start() <- called ... -> blk_freeze_queue() -> blk_freeze_queue_start() <- called again -> blk_mq_freeze_queue_wait() ... So there is a redundant call to blk_freeze_queue_start(). Replace blk_freeze_queue() with blk_mq_freeze_queue_wait() to avoid the redundant call. Signed-off-by: Jinlong Chen <nickyc975@zju.edu.cn> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20221030083212.1251255-1-nickyc975@zju.edu.cn Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-10-31 07:30:42 -06:00
Chen Zhongjin	fa81cbafbf	block: Fix possible memory leak for rq_wb on add_disk failure kmemleak reported memory leaks in device_add_disk(): kmemleak: 3 new suspected memory leaks unreferenced object 0xffff88800f420800 (size 512): comm "modprobe", pid 4275, jiffies 4295639067 (age 223.512s) hex dump (first 32 bytes): 04 00 00 00 08 00 00 00 01 00 00 00 00 00 00 00 ................ 00 e1 f5 05 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<00000000d3662699>] kmalloc_trace+0x26/0x60 [<00000000edc7aadc>] wbt_init+0x50/0x6f0 [<0000000069601d16>] wbt_enable_default+0x157/0x1c0 [<0000000028fc393f>] blk_register_queue+0x2a4/0x420 [<000000007345a042>] device_add_disk+0x6fd/0xe40 [<0000000060e6aab0>] nbd_dev_add+0x828/0xbf0 [nbd] ... It is because the memory allocated in wbt_enable_default() is not released in device_add_disk() error path. Normally, these memory are freed in: del_gendisk() rq_qos_exit() rqos->ops->exit(rqos); wbt_exit() So rq_qos_exit() is called to free the rq_wb memory for wbt_init(). However in the error path of device_add_disk(), only blk_unregister_queue() is called and make rq_wb memory leaked. Add rq_qos_exit() to the error path to fix it. Fixes: `83cbce9574` ("block: add error handling for device_add_disk / add_disk") Signed-off-by: Chen Zhongjin <chenzhongjin@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20221029071355.35462-1-chenzhongjin@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-10-31 07:29:53 -06:00
Jinlong Chen	219cf43c55	blk-mq: move queue_is_mq out of blk_mq_cancel_work_sync The only caller that needs queue_is_mq check is del_gendisk, so move the check into it. Signed-off-by: Jinlong Chen <nickyc975@zju.edu.cn> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20221030094730.1275463-1-nickyc975@zju.edu.cn Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-10-31 07:27:54 -06:00
David Jeffery	82c229476b	blk-mq: avoid double ->queue_rq() because of early timeout David Jeffery found one double ->queue_rq() issue, so far it can be triggered in VM use case because of long vmexit latency or preempt latency of vCPU pthread or long page fault in vCPU pthread, then block IO req could be timed out before queuing the request to hardware but after calling blk_mq_start_request() during ->queue_rq(), then timeout handler may handle it by requeue, then double ->queue_rq() is caused, and kernel panic. So far, it is driver's responsibility to cover the race between timeout and completion, so it seems supposed to be solved in driver in theory, given driver has enough knowledge. But it is really one common problem, lots of driver could have similar issue, and could be hard to fix all affected drivers, even it isn't easy for driver to handle the race. So David suggests this patch by draining in-progress ->queue_rq() for solving this issue. Cc: Stefan Hajnoczi <stefanha@redhat.com> Cc: Keith Busch <kbusch@kernel.org> Cc: virtualization@lists.linux-foundation.org Cc: Bart Van Assche <bvanassche@acm.org> Signed-off-by: David Jeffery <djeffery@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20221026051957.358818-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-10-31 07:24:22 -06:00
Linus Torvalds	c6e0e874a8	block-6.1-2022-10-28 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmNcRTYQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpkc2D/4yxjQ+lAmXLrqZOGAZc+r2GpiCFsgFcpKP i3ezeeo3zmdaoUH778DcbPo0oeWY/iIvV2RDo3/0PBHIlGL43W9e7zsnauRxUwtw /Aj140Tsm5/lKnBy8n0nT+DO4LE22JnBHi5XjlFELBwM+deBxS+izinFtcOQC8sj 58XWSmKag/Lv5JLvcYMj+PprtGOKzfNAacXvTjouy0IlEyb9E/yPMELS8lWFv8i2 QELtvuEDxODpQtA+Ph0O/o00A8Fg/lC4EH5uvExFMr8k74CGFm32Bar1UuaJ9QAs 5b8wateTra51yOGW3NEl1ph+4qVe9e4mutrLOFrChYylk5LePOVAki3wYb3lREiU rTOEKzUj3P/LHLpl4els0yIQ0gHXs60/M/Vn3TC50+2DnV00qEfvaocZ8vtXOux4 YR+2cKUxk2CRNyj2BB3WRlrIkCIVk+ehl17E2cdrg0m8SMqk0GAYbpXD753L9uiy I7IQEqYB+op501pmTcVskFUfW9ozT96YD53fwSOTR/pEK+esHN0GfqxI6lcA6Q0O M2AWEiu8t1PbSONVH/p895gfgGHdRHl6zgvR+ADJMDEmc7dpEoAxsoTj4HIirXbe sGHi7ycrQR6aLdHahjCukjUVkZkuhXJkAQmq2XURJgmEcz7iJme23WqtWWUUoQvi pk6e1RSqSA== =Zfnu -----END PGP SIGNATURE----- Merge tag 'block-6.1-2022-10-28' of git://git.kernel.dk/linux Pull block fixes from Jens Axboe: - NVMe pull request via Christoph: - make the multipath dma alignment match the non-multipath one (Keith Busch) - fix a bogus use of sg_init_marker() (Nam Cao) - fix circulr locking in nvme-tcp (Sagi Grimberg) - Initialization fix for requests allocated via the special hw queue allocator (John) - Fix for a regression added in this release with the batched completions of end_io backed requests (Ming) - Error handling leak fix for rbd (Yang) - Error handling leak fix for add_disk() failure (Yu) * tag 'block-6.1-2022-10-28' of git://git.kernel.dk/linux: blk-mq: Properly init requests from blk_mq_alloc_request_hctx() blk-mq: don't add non-pt request with ->end_io to batch rbd: fix possible memory leak in rbd_sysfs_init() nvme-multipath: set queue dma alignment to 3 nvme-tcp: fix possible circular locking when deleting a controller under memory pressure nvme-tcp: replace sg_init_marker() with sg_init_table() block: fix memory leak for elevator on add_disk failure	2022-10-29 18:06:52 -07:00
John Garry	e3c5a78cdb	blk-mq: Properly init requests from blk_mq_alloc_request_hctx() Function blk_mq_alloc_request_hctx() is missing zeroing/init of rq->bio, biotail, __sector, and __data_len members, which blk_mq_alloc_request() has, so duplicate what we do in blk_mq_alloc_request(). Fixes: `1f5bd336b9` ("blk-mq: add blk_mq_alloc_request_hctx") Signed-off-by: John Garry <john.garry@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/1666780513-121650-1-git-send-email-john.garry@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-10-28 07:54:47 -06:00
Bart Van Assche	9546531884	block: Micro-optimize get_max_segment_size() This patch removes a conditional jump from get_max_segment_size(). The x86-64 assembler code for this function without this patch is as follows: 206 return min_not_zero(mask - offset + 1, 0x0000000000000118 <+72>: not %rax 0x000000000000011b <+75>: and 0x8(%r10),%rax 0x000000000000011f <+79>: add $0x1,%rax 0x0000000000000123 <+83>: je 0x138 <bvec_split_segs+104> 0x0000000000000125 <+85>: cmp %rdx,%rax 0x0000000000000128 <+88>: mov %rdx,%r12 0x000000000000012b <+91>: cmovbe %rax,%r12 0x000000000000012f <+95>: test %rdx,%rdx 0x0000000000000132 <+98>: mov %eax,%edx 0x0000000000000134 <+100>: cmovne %r12d,%edx With this patch applied: 206 return min(mask - offset, (unsigned long)lim->max_segment_size - 1) + 1; 0x000000000000003f <+63>: mov 0x28(%rdi),%ebp 0x0000000000000042 <+66>: not %rax 0x0000000000000045 <+69>: and 0x8(%rdi),%rax 0x0000000000000049 <+73>: sub $0x1,%rbp 0x000000000000004d <+77>: cmp %rbp,%rax 0x0000000000000050 <+80>: cmova %rbp,%rax 0x0000000000000054 <+84>: add $0x1,%eax Reviewed-by: Ming Lei <ming.lei@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Keith Busch <kbusch@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20221025191755.1711437-4-bvanassche@acm.org Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-10-25 13:41:20 -06:00
Bart Van Assche	aa261f2058	block: Constify most queue limits pointers Document which functions do not modify the queue limits. Reviewed-by: Ming Lei <ming.lei@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Keith Busch <kbusch@kernel.org> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20221025191755.1711437-3-bvanassche@acm.org Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-10-25 13:41:17 -06:00
Christoph Hellwig	a55b70f127	block: remove bio_start_io_acct_time bio_start_io_acct_time is not actually used anywhere, so remove it. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20221025155916.270303-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-10-25 12:08:31 -06:00
Christoph Hellwig	2b3f056f72	blk-mq: move the call to blk_put_queue out of blk_mq_destroy_queue The fact that blk_mq_destroy_queue also drops a queue reference leads to various places having to grab an extra reference. Move the call to blk_put_queue into the callers to allow removing the extra references. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/20221018135720.670094-2-hch@lst.de [axboe: fix fabrics_q vs admin_q conflict in nvme core.c] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-10-25 08:25:10 -06:00
Jinlong Chen	8ed40ee35d	block: fix up elevator_type refcounting The current reference management logic of io scheduler modules contains refcnt problems. For example, blk_mq_init_sched may fail before or after the calling of e->ops.init_sched. If it fails before the calling, it does nothing to the reference to the io scheduler module. But if it fails after the calling, it releases the reference by calling kobject_put(&eq->kobj). As the callers of blk_mq_init_sched can't know exactly where the failure happens, they can't handle the reference to the io scheduler module properly: releasing the reference on failure results in double-release if blk_mq_init_sched has released it, and not releasing the reference results in ghost reference if blk_mq_init_sched did not release it either. The same problem also exists in io schedulers' init_sched implementations. We can address the problem by adding releasing statements to the error handling procedures of blk_mq_init_sched and init_sched implementations. But that is counterintuitive and requires modifications to existing io schedulers. Instead, We make elevator_alloc get the io scheduler module references that will be released by elevator_release. And then, we match each elevator_get with an elevator_put. Therefore, each reference to an io scheduler module explicitly has its own getter and releaser, and we no longer need to worry about the refcnt problems. The bugs and the patch can be validated with tools here: https://github.com/nickyc975/linux_elv_refcnt_bug.git [hch: split out a few bits into separate patches, use a non-try module_get in elevator_alloc] Signed-off-by: Jinlong Chen <nickyc975@zju.edu.cn> Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20221020064819.1469928-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-10-23 18:59:17 -06:00
Jinlong Chen	b54c2ad9b7	block: check for an unchanged elevator earlier in __elevator_change No need to find the actual elevator_type struct for this comparism, the name is all that is needed. Signed-off-by: Jinlong Chen <nickyc975@zju.edu.cn> [hch: split from a larger patch] Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20221020064819.1469928-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-10-23 18:59:17 -06:00
Christoph Hellwig	58367c8a5f	block: sanitize the elevator name before passing it to __elevator_change The stripped name should also be used for the none check. To do so strip it in the caller and pass in the sanitized name. Drop the pointless __ prefix in the function name while we're at it. Based on a patch from Jinlong Chen <nickyc975@zju.edu.cn>. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20221020064819.1469928-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-10-23 18:59:17 -06:00
Christoph Hellwig	dd6f7f17bf	block: add proper helpers for elevator_type module refcount management Make sure we have helpers for all relevant module refcount operations on the elevator_type in elevator.h, and use them. Move the call to the get helper in blk_mq_elv_switch_none a bit so that it is obvious with a less verbose comment. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20221020064819.1469928-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-10-23 18:59:17 -06:00
Yu Kuai	671fae5e51	blk-wbt: don't enable throttling if default elevator is bfq Commit `b5dc5d4d1f` ("block,bfq: Disable writeback throttling") tries to disable wbt for bfq, it's done by calling wbt_disable_default() in bfq_init_queue(). However, wbt is still enabled if default elevator is bfq: device_add_disk elevator_init_mq bfq_init_queue wbt_disable_default -> done nothing blk_register_queue wbt_enable_default -> wbt is enabled Fix the problem by adding a new flag ELEVATOR_FLAG_DISBALE_WBT, bfq will set the flag in bfq_init_queue, and following wbt_enable_default() won't enable wbt while the flag is set. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20221019121518.3865235-7-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-10-23 18:59:17 -06:00
Yu Kuai	181d066374	elevator: add new field flags in struct elevator_queue There are only one flag to indicate that elevator is registered currently, prepare to add a flag to disable wbt if default elevator is bfq. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20221019121518.3865235-6-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-10-23 18:59:17 -06:00
Yu Kuai	3642ef4d95	blk-wbt: don't show valid wbt_lat_usec in sysfs while wbt is disabled Currently, if wbt is initialized and then disabled by wbt_disable_default(), sysfs will still show valid wbt_lat_usec, which will confuse users that wbt is still enabled. This patch shows wbt_lat_usec as zero if it's disabled. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reported-and-tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20221019121518.3865235-5-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-10-23 18:59:17 -06:00
Yu Kuai	a9a236d238	blk-wbt: make enable_state more accurate Currently, if user disable wbt through sysfs, 'enable_state' will be 'WBT_STATE_ON_MANUAL', which will be confusing. Add a new state 'WBT_STATE_OFF_MANUAL' to cover that case. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20221019121518.3865235-4-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-10-23 18:59:17 -06:00
Yu Kuai	b11d31ae01	blk-wbt: remove unnecessary check in wbt_enable_default() If CONFIG_BLK_WBT_MQ is disabled, wbt_init() won't do anything. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20221019121518.3865235-3-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-10-23 18:59:17 -06:00
Yu Kuai	6d9f4cf125	elevator: remove redundant code in elv_unregister_queue() "elevator_queue *e" is already declared and initialized in the beginning of elv_unregister_queue(). Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20221019121518.3865235-2-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-10-23 18:59:17 -06:00
Yu Kuai	074501bce3	blk-iocost: read 'ioc->params' inside 'ioc->lock' in ioc_timer_fn() 'ioc->params' is updated in ioc_refresh_params(), which is proteced by 'ioc->lock', however, ioc_timer_fn() read params outside the lock. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20221012094035.390056-5-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-10-23 18:59:17 -06:00
Yu Kuai	2b2da2f6dc	blk-iocost: prevent configuration update concurrent with io throttling This won't cause any severe problem currently, however, this doesn't seems appropriate: 1) 'ioc->params' is read from multiple places without holding 'ioc->lock', unexpected value might be read if writing it concurrently. 2) If configuration is changed while io is throttling, the functionality might be affected. For example, if module params is updated and cost becomes smaller, waiting for timer that is caculated under old configuration is not appropriate. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20221012094035.390056-4-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-10-23 18:59:16 -06:00
Yu Kuai	2c06479884	blk-iocost: don't release 'ioc->lock' while updating params ioc_qos_write() and ioc_cost_model_write() are the same: 1) hold lock to read 'ioc->params' to local variable; 2) update params to local variable without lock; 3) hold lock to write local variable to 'ioc->params'; In theroy, if user updates params concurrenty, the params might be lost: t1: update params a t2: update params b spin_lock_irq(&ioc->lock); memcpy(qos, ioc->params.qos, sizeof(qos)) spin_unlock_irq(&ioc->lock); qos[a] = xxx; spin_lock_irq(&ioc->lock); memcpy(qos, ioc->params.qos, sizeof(qos)) spin_unlock_irq(&ioc->lock); qos[b] = xxx; spin_lock_irq(&ioc->lock); memcpy(ioc->params.qos, qos, sizeof(qos)); ioc_refresh_params(ioc, true); spin_unlock_irq(&ioc->lock); spin_lock_irq(&ioc->lock); // updates of a will be lost memcpy(ioc->params.qos, qos, sizeof(qos)); ioc_refresh_params(ioc, true); spin_unlock_irq(&ioc->lock); Althrough this is not common case, the problem can by fixed easily by holding the lock through the read, update, write process. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20221012094035.390056-3-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-10-23 18:59:16 -06:00
Yu Kuai	8796acbc9a	blk-iocost: disable writeback throttling Commit `b5dc5d4d1f` ("block,bfq: Disable writeback throttling") disable wbt for bfq, because different write-throttling heuristics should not work together. For the same reason, wbt and iocost should not work together as well, unless admin really want to do that, dispite that performance is affected. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20221012094035.390056-2-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-10-23 18:59:16 -06:00
Yu Kuai	02341a08c9	block: fix memory leak for elevator on add_disk failure The default elevator is allocated in the beginning of device_add_disk(), however, it's not freed in the following error path. Fixes: `50e34d7881` ("block: disable the elevator int del_gendisk") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jason Yan <yanaijie@huawei.com> Link: https://lore.kernel.org/r/20221022021615.2756171-1-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-10-22 15:14:38 -06:00
Linus Torvalds	d4b7332eef	block-6.1-2022-10-20 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmNSA3oQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpoTfEACwFAyQsAGMVVsTmU/XggFsNRczTKH+J7dO 4tBhOhfrVEWyDbB++3kUepN0GenSiu3dvcAzWQbX0lPiLjkqFWvpj2GIrqHwXJCt Li8SPoPsQARBvJYUERmdlQDUeEKCK/IwOREQSH4LZaNVvU2l0MSAqjihUjmcXHBE OeCWwYLJ27Bo0j5py9xSo7Z1C84YtXHDKAbx1nVhX6ByyCvpSGLLunrwT7oeEjT7 FAbVvL4p5xwZe5e3XiiC5J7g85BkIZGwhqViYrVyHPWrUVWycSvXQdqXGGHiSqlj g3jcT6nLNBscAPuWnnP4jobaFVjm1Lr3eXtVX1k66kRwrxVcB+tZGszvjh4cuneO N23OjrRjtGIXenSyOslCxMAIxgMjsxeuuoQb66Js9g8bA9zJeKPcpKWbXplkaXoH VsT23W24Xp72MkBAz9nP9sHXbQEAkfNGK+6qZLBpkb8bJsymc+4Vn6lpV6ybl8WF LJWouLss2/I+G/D/WLvVcygvefQxVSLDFjWcU8BcJjDkdXgQWUjZfcZEsRE2v0cr tLZOh1WhHgVwTBCZxEus+QoAE1tXn2C+xhMu0uqRQ7zlngy5AycM6eSJtQ5Lwm2H 91MSYoUvGkOx5Hqp265AnZEBayOg+xnsp7qMZghoQqLUc5yU4zyUbvxPi0vNuduJ mbys/sxVsg== =YfkM -----END PGP SIGNATURE----- Merge tag 'block-6.1-2022-10-20' of git://git.kernel.dk/linux Pull block fixes from Jens Axboe: - NVMe pull request via Christoph: - fix nvme-hwmon for DMA non-cohehrent architectures (Serge Semin) - add a nvme-hwmong maintainer (Christoph Hellwig) - fix error pointer dereference in error handling (Dan Carpenter) - fix invalid memory reference in nvmet_subsys_attr_qid_max_show (Daniel Wagner) - don't limit the DMA segment size in nvme-apple (Russell King) - fix workqueue MEM_RECLAIM flushing dependency (Sagi Grimberg) - disable write zeroes on various Kingston SSDs (Xander Li) - fix a memory leak with block device tracing (Ye) - flexible-array fix for ublk (Yushan) - document the ublk recovery feature from this merge window (ZiyangZhang) - remove dead bfq variable in struct (Yuwei) - error handling rq clearing fix (Yu) - add an IRQ safety check for the cached bio freeing (Pavel) - drbd bio cloning fix (Christoph) * tag 'block-6.1-2022-10-20' of git://git.kernel.dk/linux: blktrace: remove unnessary stop block trace in 'blk_trace_shutdown' blktrace: fix possible memleak in '__blk_trace_remove' blktrace: introduce 'blk_trace_{start,stop}' helper bio: safeguard REQ_ALLOC_CACHE bio put block, bfq: remove unused variable for bfq_queue drbd: only clone bio if we have a backing device ublk_drv: use flexible-array member instead of zero-length array nvmet: fix invalid memory reference in nvmet_subsys_attr_qid_max_show nvmet: fix workqueue MEM_RECLAIM flushing dependency nvme-hwmon: kmalloc the NVME SMART log buffer nvme-hwmon: consistently ignore errors from nvme_hwmon_init nvme: add Guenther as nvme-hwmon maintainer nvme-apple: don't limit DMA segement size nvme-pci: disable write zeroes on various Kingston SSD nvme: fix error pointer dereference in error handling Documentation: document ublk user recovery feature blk-mq: fix null pointer dereference in blk_mq_clear_rq_mapping()	2022-10-21 15:14:14 -07:00
Pavel Begunkov	d4347d5040	bio: safeguard REQ_ALLOC_CACHE bio put bio_put() with REQ_ALLOC_CACHE assumes that it's executed not from an irq context. Let's add a warning if the invariant is not respected, especially since there is a couple of places removing REQ_POLLED by hand without also clearing REQ_ALLOC_CACHE. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/558d78313476c4e9c233902efa0092644c3d420a.1666122465.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-10-20 05:50:29 -07:00
Yuwei Guan	33566f92cd	block, bfq: remove unused variable for bfq_queue it defined in `d0edc2473b`, but there's nowhere to use it, so remove it. Signed-off-by: Yuwei Guan <Yuwei.Guan@zeekrlife.com> Acked-by: Paolo Valente <paolo.valente@linaro.org> Link: https://lore.kernel.org/r/20221018030139.159-1-Yuwei.Guan@zeekrlife.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-10-20 05:46:49 -07:00
Yu Kuai	76dd298094	blk-mq: fix null pointer dereference in blk_mq_clear_rq_mapping() Our syzkaller report a null pointer dereference, root cause is following: __blk_mq_alloc_map_and_rqs set->tags[hctx_idx] = blk_mq_alloc_map_and_rqs blk_mq_alloc_map_and_rqs blk_mq_alloc_rqs // failed due to oom alloc_pages_node // set->tags[hctx_idx] is still NULL blk_mq_free_rqs drv_tags = set->tags[hctx_idx]; // null pointer dereference is triggered blk_mq_clear_rq_mapping(drv_tags, ...) This is because commit `63064be150` ("blk-mq: Add blk_mq_alloc_map_and_rqs()") merged the two steps: 1) set->tags[hctx_idx] = blk_mq_alloc_rq_map() 2) blk_mq_alloc_rqs(..., set->tags[hctx_idx]) into one step: set->tags[hctx_idx] = blk_mq_alloc_map_and_rqs() Since tags is not initialized yet in this case, fix the problem by checking if tags is NULL pointer in blk_mq_clear_rq_mapping(). Fixes: `63064be150` ("blk-mq: Add blk_mq_alloc_map_and_rqs()") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: John Garry <john.garry@huawei.com> Link: https://lore.kernel.org/r/20221011142253.4015966-1-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-10-16 17:22:51 -06:00
Linus Torvalds	f1947d7c8a	Random number generator fixes for Linux 6.1-rc1. -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEq5lC5tSkz8NBJiCnSfxwEqXeA64FAmNHYD0ACgkQSfxwEqXe A655AA//dJK0PdRghqrKQsl18GOCffV5TUw5i1VbJQbI9d8anfxNjVUQiNGZi4et qUwZ8OqVXxYx1Z1UDgUE39PjEDSG9/cCvOpMUWqN20/+6955WlNZjwA7Fk6zjvlM R30fz5CIJns9RFvGT4SwKqbVLXIMvfg/wDENUN+8sxt36+VD2gGol7J2JJdngEhM lW+zqzi0ABqYy5so4TU2kixpKmpC08rqFvQbD1GPid+50+JsOiIqftDErt9Eg1Mg MqYivoFCvbAlxxxRh3+UHBd7ZpJLtp1UFEOl2Rf00OXO+ZclLCAQAsTczucIWK9M 8LCZjb7d4lPJv9RpXFAl3R1xvfc+Uy2ga5KeXvufZtc5G3aMUKPuIU7k28ZyblVS XXsXEYhjTSd0tgi3d0JlValrIreSuj0z2QGT5pVcC9utuAqAqRIlosiPmgPlzXjr Us4jXaUhOIPKI+Musv/fqrxsTQziT0jgVA3Njlt4cuAGm/EeUbLUkMWwKXjZLTsv vDsBhEQFmyZqxWu4pYo534VX2mQWTaKRV1SUVVhQEHm57b00EAiZohoOvweB09SR 4KiJapikoopmW4oAUFotUXUL1PM6yi+MXguTuc1SEYuLz/tCFtK8DJVwNpfnWZpE lZKvXyJnHq2Sgod/hEZq58PMvT6aNzTzSg7YzZy+VabxQGOO5mc= =M+mV -----END PGP SIGNATURE----- Merge tag 'random-6.1-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random Pull more random number generator updates from Jason Donenfeld: "This time with some large scale treewide cleanups. The intent of this pull is to clean up the way callers fetch random integers. The current rules for doing this right are: - If you want a secure or an insecure random u64, use get_random_u64() - If you want a secure or an insecure random u32, use get_random_u32() The old function prandom_u32() has been deprecated for a while now and is just a wrapper around get_random_u32(). Same for get_random_int(). - If you want a secure or an insecure random u16, use get_random_u16() - If you want a secure or an insecure random u8, use get_random_u8() - If you want secure or insecure random bytes, use get_random_bytes(). The old function prandom_bytes() has been deprecated for a while now and has long been a wrapper around get_random_bytes() - If you want a non-uniform random u32, u16, or u8 bounded by a certain open interval maximum, use prandom_u32_max() I say "non-uniform", because it doesn't do any rejection sampling or divisions. Hence, it stays within the prandom_() namespace, not the get_random_() namespace. I'm currently investigating a "uniform" function for 6.2. We'll see what comes of that. By applying these rules uniformly, we get several benefits: - By using prandom_u32_max() with an upper-bound that the compiler can prove at compile-time is ≤65536 or ≤256, internally get_random_u16() or get_random_u8() is used, which wastes fewer batched random bytes, and hence has higher throughput. - By using prandom_u32_max() instead of %, when the upper-bound is not a constant, division is still avoided, because prandom_u32_max() uses a faster multiplication-based trick instead. - By using get_random_u16() or get_random_u8() in cases where the return value is intended to indeed be a u16 or a u8, we waste fewer batched random bytes, and hence have higher throughput. This series was originally done by hand while I was on an airplane without Internet. Later, Kees and I worked on retroactively figuring out what could be done with Coccinelle and what had to be done manually, and then we split things up based on that. So while this touches a lot of files, the actual amount of code that's hand fiddled is comfortably small" * tag 'random-6.1-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random: prandom: remove unused functions treewide: use get_random_bytes() when possible treewide: use get_random_u32() when possible treewide: use get_random_{u8,u16}() when possible, part 2 treewide: use get_random_{u8,u16}() when possible, part 1 treewide: use prandom_u32_max() when possible, part 2 treewide: use prandom_u32_max() when possible, part 1	2022-10-16 15:27:07 -07:00
Linus Torvalds	a521fc3cfb	block-6.1-2022-10-13 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmNIXQAQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpnplD/wP1Kcbn0wc6UktiOaGHy6/DjRmpi2a857g ioxTioycD7/ZchIGsfXLca9xBuOufLK7//irG1CpxJT7QXXQTprHspXOZLpw2nUb t77hGTgCHkI5QOPuVnZ4+agjvxefUQ0/TOEPLYlb1fh+Q81ZAR7NecV9XwXnGy4V 75gS1t3cP4WcmeFi6MIkvg8BZhusS2qnPvmviig6eDMDvlJ5Ct/x3zJGnxPH/jt/ fJtCvZUIUVSfuUirCh/Xv3gmXptPmwrQPKGW66dCZeZaZpUAcFkmvQu/4739q+x2 5XGhr02FlHxm6zX165rF4jvXNyn1K9nPj/gUQcPLKK+JesA56OPGsUCjkNE7azbm ZWf710Tyzx1B2EAIk4keRNIe5YBcUQenwIbGd2czOf+oAuSmNX+tRLozIsVdwD7Z NPvLFYfH+evPqMzIhESjt4q476eCl1Zu8pGfBvYmPJ9NNnjcG3wjFZLmmsfA4KLG ffkJqBGLWc9LD4sUjsVhYxV79jcQDr7Wh+fkiQ1u9ML6qjNsp9dQijnvIb7dC6Vj UKvMbB45oDPffI71dHx08xX+G7Kf5cLWm/Wf0DLRjJaL5uxUA/OGITMdjmvFzSLZ FNdlAkwfIomLO3mFK1LrSUYMpwZ84ZBH7iZfO6omOk0MrAhMA3M1U1OnXUS+GgxK R5A4tOD4Ow== =kxra -----END PGP SIGNATURE----- Merge tag 'block-6.1-2022-10-13' of git://git.kernel.dk/linux Pull more block updates from Jens Axboe: "Fixes that ended up landing later than the initial block pull request. Nothing really major in here: - NVMe pull request via Christoph: - add NVME_QUIRK_BOGUS_NID for Lexar NM760 (Abhijit) - add NVME_QUIRK_NO_DEEPEST_PS to avoid the deepest sleep state on ZHITAI TiPro5000 SSDs (Xi Ruoyao) - fix possible hang caused during ctrl deletion (Sagi Grimberg) - fix possible hang in live ns resize with ANA access (Sagi Grimberg) - Proactively avoid a sign extension issue with the queue flags (Brian) - Regression fix for hidden disks (Christoph) - Update OPAL maintainers entry (Jonathan) - blk-wbt regression initialization fix (Yu)" * tag 'block-6.1-2022-10-13' of git://git.kernel.dk/linux: nvme-multipath: fix possible hang in live ns resize with ANA access nvme-pci: avoid the deepest sleep state on ZHITAI TiPro5000 SSDs nvme-pci: add NVME_QUIRK_BOGUS_NID for Lexar NM760 nvme-tcp: fix possible hang caused during ctrl deletion nvme-rdma: fix possible hang caused during ctrl deletion block: fix leaking minors of hidden disks block: avoid sign extend problem with default queue flags mask blk-wbt: fix that 'rwb->wc' is always set to 1 in wbt_init() block: Remove the repeat word 'can' MAINTAINERS: Update SED-Opal Maintainers	2022-10-13 21:25:57 -07:00
Jason A. Donenfeld	197173db99	treewide: use get_random_bytes() when possible The prandom_bytes() function has been a deprecated inline wrapper around get_random_bytes() for several releases now, and compiles down to the exact same code. Replace the deprecated wrapper with a direct call to the real function. This was done as a basic find and replace. Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Yury Norov <yury.norov@gmail.com> Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> # powerpc Acked-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>	2022-10-11 17:42:58 -06:00
Linus Torvalds	27bc50fc90	- Yu Zhao's Multi-Gen LRU patches are here. They've been under test in linux-next for a couple of months without, to my knowledge, any negative reports (or any positive ones, come to that). - Also the Maple Tree from Liam R. Howlett. An overlapping range-based tree for vmas. It it apparently slight more efficient in its own right, but is mainly targeted at enabling work to reduce mmap_lock contention. Liam has identified a number of other tree users in the kernel which could be beneficially onverted to mapletrees. Yu Zhao has identified a hard-to-hit but "easy to fix" lockdep splat (https://lkml.kernel.org/r/CAOUHufZabH85CeUN-MEMgL8gJGzJEWUrkiM58JkTbBhh-jew0Q@mail.gmail.com). This has yet to be addressed due to Liam's unfortunately timed vacation. He is now back and we'll get this fixed up. - Dmitry Vyukov introduces KMSAN: the Kernel Memory Sanitizer. It uses clang-generated instrumentation to detect used-unintialized bugs down to the single bit level. KMSAN keeps finding bugs. New ones, as well as the legacy ones. - Yang Shi adds a userspace mechanism (madvise) to induce a collapse of memory into THPs. - Zach O'Keefe has expanded Yang Shi's madvise(MADV_COLLAPSE) to support file/shmem-backed pages. - userfaultfd updates from Axel Rasmussen - zsmalloc cleanups from Alexey Romanov - cleanups from Miaohe Lin: vmscan, hugetlb_cgroup, hugetlb and memory-failure - Huang Ying adds enhancements to NUMA balancing memory tiering mode's page promotion, with a new way of detecting hot pages. - memcg updates from Shakeel Butt: charging optimizations and reduced memory consumption. - memcg cleanups from Kairui Song. - memcg fixes and cleanups from Johannes Weiner. - Vishal Moola provides more folio conversions - Zhang Yi removed ll_rw_block() :( - migration enhancements from Peter Xu - migration error-path bugfixes from Huang Ying - Aneesh Kumar added ability for a device driver to alter the memory tiering promotion paths. For optimizations by PMEM drivers, DRM drivers, etc. - vma merging improvements from Jakub Matěn. - NUMA hinting cleanups from David Hildenbrand. - xu xin added aditional userspace visibility into KSM merging activity. - THP & KSM code consolidation from Qi Zheng. - more folio work from Matthew Wilcox. - KASAN updates from Andrey Konovalov. - DAMON cleanups from Kaixu Xia. - DAMON work from SeongJae Park: fixes, cleanups. - hugetlb sysfs cleanups from Muchun Song. - Mike Kravetz fixes locking issues in hugetlbfs and in hugetlb core. -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY0HaPgAKCRDdBJ7gKXxA joPjAQDZ5LlRCMWZ1oxLP2NOTp6nm63q9PWcGnmY50FjD/dNlwEAnx7OejCLWGWf bbTuk6U2+TKgJa4X7+pbbejeoqnt5QU= =xfWx -----END PGP SIGNATURE----- Merge tag 'mm-stable-2022-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Yu Zhao's Multi-Gen LRU patches are here. They've been under test in linux-next for a couple of months without, to my knowledge, any negative reports (or any positive ones, come to that). - Also the Maple Tree from Liam Howlett. An overlapping range-based tree for vmas. It it apparently slightly more efficient in its own right, but is mainly targeted at enabling work to reduce mmap_lock contention. Liam has identified a number of other tree users in the kernel which could be beneficially onverted to mapletrees. Yu Zhao has identified a hard-to-hit but "easy to fix" lockdep splat at [1]. This has yet to be addressed due to Liam's unfortunately timed vacation. He is now back and we'll get this fixed up. - Dmitry Vyukov introduces KMSAN: the Kernel Memory Sanitizer. It uses clang-generated instrumentation to detect used-unintialized bugs down to the single bit level. KMSAN keeps finding bugs. New ones, as well as the legacy ones. - Yang Shi adds a userspace mechanism (madvise) to induce a collapse of memory into THPs. - Zach O'Keefe has expanded Yang Shi's madvise(MADV_COLLAPSE) to support file/shmem-backed pages. - userfaultfd updates from Axel Rasmussen - zsmalloc cleanups from Alexey Romanov - cleanups from Miaohe Lin: vmscan, hugetlb_cgroup, hugetlb and memory-failure - Huang Ying adds enhancements to NUMA balancing memory tiering mode's page promotion, with a new way of detecting hot pages. - memcg updates from Shakeel Butt: charging optimizations and reduced memory consumption. - memcg cleanups from Kairui Song. - memcg fixes and cleanups from Johannes Weiner. - Vishal Moola provides more folio conversions - Zhang Yi removed ll_rw_block() :( - migration enhancements from Peter Xu - migration error-path bugfixes from Huang Ying - Aneesh Kumar added ability for a device driver to alter the memory tiering promotion paths. For optimizations by PMEM drivers, DRM drivers, etc. - vma merging improvements from Jakub Matěn. - NUMA hinting cleanups from David Hildenbrand. - xu xin added aditional userspace visibility into KSM merging activity. - THP & KSM code consolidation from Qi Zheng. - more folio work from Matthew Wilcox. - KASAN updates from Andrey Konovalov. - DAMON cleanups from Kaixu Xia. - DAMON work from SeongJae Park: fixes, cleanups. - hugetlb sysfs cleanups from Muchun Song. - Mike Kravetz fixes locking issues in hugetlbfs and in hugetlb core. Link: https://lkml.kernel.org/r/CAOUHufZabH85CeUN-MEMgL8gJGzJEWUrkiM58JkTbBhh-jew0Q@mail.gmail.com [1] * tag 'mm-stable-2022-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (555 commits) hugetlb: allocate vma lock for all sharable vmas hugetlb: take hugetlb vma_lock when clearing vma_lock->vma pointer hugetlb: fix vma lock handling during split vma and range unmapping mglru: mm/vmscan.c: fix imprecise comments mm/mglru: don't sync disk for each aging cycle mm: memcontrol: drop dead CONFIG_MEMCG_SWAP config symbol mm: memcontrol: use do_memsw_account() in a few more places mm: memcontrol: deprecate swapaccounting=0 mode mm: memcontrol: don't allocate cgroup swap arrays when memcg is disabled mm/secretmem: remove reduntant return value mm/hugetlb: add available_huge_pages() func mm: remove unused inline functions from include/linux/mm_inline.h selftests/vm: add selftest for MADV_COLLAPSE of uffd-minor memory selftests/vm: add file/shmem MADV_COLLAPSE selftest for cleared pmd selftests/vm: add thp collapse shmem testing selftests/vm: add thp collapse file and tmpfs testing selftests/vm: modularize thp collapse memory operations selftests/vm: dedup THP helpers mm/khugepaged: add tracepoint to hpage_collapse_scan_file() mm/madvise: add file and shmem support to MADV_COLLAPSE ...	2022-10-10 17:53:04 -07:00
Linus Torvalds	adf4bfc4a9	cgroup changes for v6.1-rc1. * cpuset now support isolated cpus.partition type, which will enable dynamic CPU isolation. * pids.peak added to remember the max number of pids used. * Holes in cgroup namespace plugged. * Internal cleanups. Note that for-6.1-fixes was pulled into for-6.1 twice. Both were for follow-up cleanups and each merge commit has details. Also, `8a693f7766` ("cgroup: Remove CFTYPE_PRESSURE") removes the flag used by PSI changes in the tip tree and the merged result won't compile due to the missing flag. Simply removing the struct init lines specifying the flag is the correct resolution. linux-next already contains the correct fix: https://lkml.kernel.org/r/20220912161812.072aaa3b@canb.auug.org.au -----BEGIN PGP SIGNATURE----- iIQEABYIACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCYzsl7w4cdGpAa2VybmVs Lm9yZwAKCRCxYfJx3gVYGYsxAP4kad4YPw+CueLyyEMiYgBHouqDt8cG0+FJWK3X svTC7wD/eCLfxZM8TjjSrMmvaMrml586mr3NoQaFeW0x3twptQQ= =LERu -----END PGP SIGNATURE----- Merge tag 'cgroup-for-6.1' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: - cpuset now support isolated cpus.partition type, which will enable dynamic CPU isolation - pids.peak added to remember the max number of pids used - holes in cgroup namespace plugged - internal cleanups * tag 'cgroup-for-6.1' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (25 commits) cgroup: use strscpy() is more robust and safer iocost_monitor: reorder BlkgIterator cgroup: simplify code in cgroup_apply_control cgroup: Make cgroup_get_from_id() prettier cgroup/cpuset: remove unreachable code cgroup: Remove CFTYPE_PRESSURE cgroup: Improve cftype add/rm error handling kselftest/cgroup: Add cpuset v2 partition root state test cgroup/cpuset: Update description of cpuset.cpus.partition in cgroup-v2.rst cgroup/cpuset: Make partition invalid if cpumask change violates exclusivity rule cgroup/cpuset: Relocate a code block in validate_change() cgroup/cpuset: Show invalid partition reason string cgroup/cpuset: Add a new isolated cpus.partition type cgroup/cpuset: Relax constraints to partition & cpus changes cgroup/cpuset: Allow no-task partition to have empty cpuset.cpus.effective cgroup/cpuset: Miscellaneous cleanups & add helper functions cgroup/cpuset: Enable update_tasks_cpumask() on top_cpuset cgroup: add pids.peak interface for pids controller cgroup: Remove data-race around cgrp_dfl_visible cgroup: Fix build failure when CONFIG_SHRINKER_DEBUG ...	2022-10-10 11:12:25 -07:00
Jens Axboe	24a403340d	Merge branch 'for-6.1/block' into block-6.1 Merge in later fixes. * for-6.1/block: block: fix leaking minors of hidden disks block: avoid sign extend problem with default queue flags mask blk-wbt: fix that 'rwb->wc' is always set to 1 in wbt_init() block: Remove the repeat word 'can' MAINTAINERS: Update SED-Opal Maintainers	2022-10-10 11:26:40 -06:00
Christoph Hellwig	a0a6314ae7	block: fix leaking minors of hidden disks The major/minor of a hidden gendisk is not propagated to the block device because it is never registered using bdev_add. But the lack of bd_dev also causes the dynamic major minor number not to be freed. Assign bd_dev manually to ensure the dynamic major minor gets freed. Based on a patch by Keith Busch. Fixes: `8ddcd65325` ("block: introduce GENHD_FL_HIDDEN") Reported-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Daniel Wagner <dwagner@suse.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/20221010131857.748129-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-10-10 08:48:59 -06:00
Yu Kuai	285febabac	blk-wbt: fix that 'rwb->wc' is always set to 1 in wbt_init() commit `8c5035dfbb` ("blk-wbt: call rq_qos_add() after wb_normal is initialized") moves wbt_set_write_cache() before rq_qos_add(), which is wrong because wbt_rq_qos() is still NULL. Fix the problem by removing wbt_set_write_cache() and setting 'rwb->wc' directly. Noted that this patch also remove the redundant setting of 'rab->wc'. Fixes: `8c5035dfbb` ("blk-wbt: call rq_qos_add() after wb_normal is initialized") Reported-by: kernel test robot <yujie.liu@intel.com> Link: https://lore.kernel.org/r/202210081045.77ddf59b-yujie.liu@intel.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20221009101038.1692875-1-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-10-09 07:48:16 -06:00
Linus Torvalds	7c989b1da3	for-6.1/passthrough-2022-10-04 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmM8rp4QHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpjTHD/9eeWwaG7oSSu5J1YzkKn+hptaDzZwreL98 Mh8euiQScUVpvHGkNowBhjBZ5cIAAcYaH17rjW7dWu6A7tv/iqygWd/YvIbs1JOe STSD9yf0RV4dI0MG6Wu2w6YxObaLvE5BTRxqb/WuFWNTgsYf2HEp4PM9sTio71+H WwWdRvsIxsRxVYemds3vBxd+BcM8vm26EoUTSaCwRhfopaJwBNceCYIIrM7VHUNM 5G6+DJkm3mB1a8nsdguYZQC/y8F/9P5Ch9CdxA12yOZEryr3wzsyRNGdm7oRmFGM bAkjFcddhwk5+SuTzGX6t4/Z3ODIjeCXbMBg4p7AShHws4Yx1trJePiqoNQ8xd5A PkMfxhQpBPlDFKLmwtObPLInyzMpp5P8KYMIZfyymKD/+XjmqAlR6TXbFUTihzBU lHSFhwG8ysT2cAVrFBMDJu4UPIThIHqfkkF/nTkHePTSArJ/k5rGV7v5sQpZ+jtY R0gvoNHTq2IvgKGEEbTgDjpwVcCn5ERVorZuGjVN2nMdLj35kXpo7YNgyYMaD5LJ 9SOR5a8iQjjudAfdGyZCGzNaOecizVFjABozUYc1XJi/boNuFTsq4XCE/tCLTixc V4sElRpgrlXxNXkiVdbuWIPuYo4sDw5gqZQynpVNH5PkmX/NqmpWYVEWJ20o+pwg 3ag39nZQVQ== =nwLk -----END PGP SIGNATURE----- Merge tag 'for-6.1/passthrough-2022-10-04' of git://git.kernel.dk/linux Pull passthrough updates from Jens Axboe: "With these changes, passthrough NVMe support over io_uring now performs at the same level as block device O_DIRECT, and in many cases 6-8% better. This contains: - Add support for fixed buffers for passthrough (Anuj, Kanchan) - Enable batched allocations and freeing on passthrough, similarly to what we support on the normal storage path (me) - Fix from Geert fixing an issue with !CONFIG_IO_URING" * tag 'for-6.1/passthrough-2022-10-04' of git://git.kernel.dk/linux: io_uring: Add missing inline to io_uring_cmd_import_fixed() dummy nvme: wire up fixed buffer support for nvme passthrough nvme: pass ubuffer as an integer block: extend functionality to map bvec iterator block: factor out blk_rq_map_bio_alloc helper block: rename bio_map_put to blk_mq_map_bio_put nvme: refactor nvme_alloc_request nvme: refactor nvme_add_user_metadata nvme: Use blk_rq_map_user_io helper scsi: Use blk_rq_map_user_io helper block: add blk_rq_map_user_io io_uring: introduce fixed buffer support for io_uring_cmd io_uring: add io_uring_cmd_import_fixed nvme: enable batched completions of passthrough IO nvme: split out metadata vs non metadata end_io uring_cmd completions block: allow end_io based requests in the completion batch handling block: change request end_io handler to pass back a return value block: enable batched allocation for blk_mq_alloc_request() block: kill deprecated BUG_ON() in the flush handling	2022-10-07 09:35:50 -07:00
Linus Torvalds	513389809e	for-6.1/block-2022-10-03 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmM67XkQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpiHoD/9eN+6YnNRPu5+2zeGnnm1Nlwic6YMZeORr KFIeC0COMWoFhNBIPFkgAKT+0qIH+uGt5UsHSM3Y5La7wMR8yLxD4PAnvTZ/Ijtt yxVIOmonJoQ0OrQ2kTbvDXL/9OCUrzwXXyUIEPJnH0Ca1mxeNOgDHbE7VGF6DMul 0D3pI8qs2WLnHlDi1V/8kH5qZ6WoAJSDcb8sTzOUVnyveZPNaZhGQJuHA2XAYMtg fqKMDJqgmNk6jdTMUgdF5B+rV64PQoCy28I7fXqGkEe+RE5TBy57vAa0XY84V8XR /a8CEuwMts2ypk1hIcJG8Vv8K6u5war9yPM5MTngKsoMpzNIlhrhaJQVyjKdcs+E Ixwzexu6xTYcrcq+mUARgeTh79FzTBM/uXEdbCG2G3S6HPd6UZWUJZGfxw/l0Aem V4xB7lj6SQaJDU1iJCYUaHcekNXhQAPvyVG+R2ED1SO3McTpTPIM1aeigxw6vj7u bH3Kfdr94Z8HNuoLuiS6YYfjNt2Shf4LEB6GxKJ9TYHtyhdOyO0H64jGHpygrWqN cSnkWPUqUUNpF7srKM0ZgbliCshvmyJc4aMOFd0gBY/kXf5J/j7IXvh8TFCi9rHH 0KyZH3/3Zsu9geUn3ynznlr4FXU+BcqE6boaa/iWb9sN1m+Rvaahv8cSch/dh44a vQNj/iOBQA== =R05e -----END PGP SIGNATURE----- Merge tag 'for-6.1/block-2022-10-03' of git://git.kernel.dk/linux Pull block updates from Jens Axboe: - NVMe pull requests via Christoph: - handle number of queue changes in the TCP and RDMA drivers (Daniel Wagner) - allow changing the number of queues in nvmet (Daniel Wagner) - also consider host_iface when checking ip options (Daniel Wagner) - don't map pages which can't come from HIGHMEM (Fabio M. De Francesco) - avoid unnecessary flush bios in nvmet (Guixin Liu) - shrink and better pack the nvme_iod structure (Keith Busch) - add comment for unaligned "fake" nqn (Linjun Bao) - print actual source IP address through sysfs "address" attr (Martin Belanger) - various cleanups (Jackie Liu, Wolfram Sang, Genjian Zhang) - handle effects after freeing the request (Keith Busch) - copy firmware_rev on each init (Keith Busch) - restrict management ioctls to admin (Keith Busch) - ensure subsystem reset is single threaded (Keith Busch) - report the actual number of tagset maps in nvme-pci (Keith Busch) - small fabrics authentication fixups (Christoph Hellwig) - add common code for tagset allocation and freeing (Christoph Hellwig) - stop using the request_queue in nvmet (Christoph Hellwig) - set min_align_mask before calculating max_hw_sectors (Rishabh Bhatnagar) - send a rediscover uevent when a persistent discovery controller reconnects (Sagi Grimberg) - misc nvmet-tcp fixes (Varun Prakash, zhenwei pi) - MD pull request via Song: - Various raid5 fix and clean up, by Logan Gunthorpe and David Sloan. - Raid10 performance optimization, by Yu Kuai. - sbitmap wakeup hang fixes (Hugh, Keith, Jan, Yu) - IO scheduler switching quisce fix (Keith) - s390/dasd block driver updates (Stefan) - support for recovery for the ublk driver (ZiyangZhang) - rnbd drivers fixes and updates (Guoqing, Santosh, ye, Christoph) - blk-mq and null_blk map fixes (Bart) - various bcache fixes (Coly, Jilin, Jules) - nbd signal hang fix (Shigeru) - block writeback throttling fix (Yu) - optimize the passthrough mapping handling (me) - prepare block cgroups to being gendisk based (Christoph) - get rid of an old PSI hack in the block layer, moving it to the callers instead where it belongs (Christoph) - blk-throttle fixes and cleanups (Yu) - misc fixes and cleanups (Liu Shixin, Liu Song, Miaohe, Pankaj, Ping-Xiang, Wolfram, Saurabh, Li Jinlin, Li Lei, Lin, Li zeming, Miaohe, Bart, Coly, Gaosheng * tag 'for-6.1/block-2022-10-03' of git://git.kernel.dk/linux: (162 commits) sbitmap: fix lockup while swapping block: add rationale for not using blk_mq_plug() when applicable block: adapt blk_mq_plug() to not plug for writes that require a zone lock s390/dasd: use blk_mq_alloc_disk blk-cgroup: don't update the blkg lookup hint in blkg_conf_prep nvmet: don't look at the request_queue in nvmet_bdev_set_limits nvmet: don't look at the request_queue in nvmet_bdev_zone_mgmt_emulate_all blk-mq: use quiesced elevator switch when reinitializing queues block: replace blk_queue_nowait with bdev_nowait nvme: remove nvme_ctrl_init_connect_q nvme-loop: use the tagset alloc/free helpers nvme-loop: store the generic nvme_ctrl in set->driver_data nvme-loop: initialize sqsize later nvme-fc: use the tagset alloc/free helpers nvme-fc: store the generic nvme_ctrl in set->driver_data nvme-fc: keep ctrl->sqsize in sync with opts->queue_size nvme-rdma: use the tagset alloc/free helpers nvme-rdma: store the generic nvme_ctrl in set->driver_data nvme-tcp: use the tagset alloc/free helpers nvme-tcp: store the generic nvme_ctrl in set->driver_data ...	2022-10-07 09:19:14 -07:00
Linus Torvalds	0a78a376ef	for-6.1/io_uring-2022-10-03 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmM67S0QHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgppnPEACkBzilBLKwT9MWdUAITwyrMXsAa1R9gsR9 Tb3Xs+mNO2meuycLAUh4LIbb28NNr7/S5rwWet5NRZ71hgv4Q/WA/0EemAGGXYqd +3MEBAWU3FBFkC/cJXCnT8F5yCXYRkT5n/hzCSNEpNKjQ5JnAhHDlWAjgzZRuD/A A+YJjoBVJJuI1wY4I5XCpeQXEmg/Wc1MgXfyHgWVtGKnYrrxibiCnBZnqbAMZNvD hGn1Vl02ooamGTFm/nW/OAt71DtqsjWUCVOHKmlZ+zBUjbUj6FMXmPVV7vCV9o2w PT4Dx3CTc2iXwa8KfEFNPvXBzy0Qfu8edweP/MvZHWHVZREpEAh4cG6GhwW8whD+ 5mPisqmRjZKe0BBS4k/wKN1RXEypSQoTU4EdljfbQPU/usn35lmjMmEXXgs3IhqM fcTdO5ZUOp+CGyzI0Bc7UtS8vilJbX9ynN8G80MUUAZzuQg39MH7lNQYSJSSvJfU OlvzmL3lhRLYM1s/KKiZzdDBoMvC7R4oHmzCveOjQTMIHf6WNyqKFlrWScq2wzpN oRxqt0xiVQ3PFMmFj6N08f145qtbASuF3sKv7dbU3QXTsCAos3wdTdX+PejYApEZ W3dr0TDjNBicLNVPiSj132p0ZRtdZvLGuGVkBD4GPQeH2NwswxMHQAfz8e2lqmA4 9bWG6BM7Yw== =m9kX -----END PGP SIGNATURE----- Merge tag 'for-6.1/io_uring-2022-10-03' of git://git.kernel.dk/linux Pull io_uring updates from Jens Axboe: - Add supported for more directly managed task_work running. This is beneficial for real world applications that end up issuing lots of system calls as part of handling work. Normal task_work will always execute as we transition in and out of the kernel, even for "unrelated" system calls. It's more efficient to defer the handling of io_uring's deferred work until the application wants it to be run, generally in batches. As part of ongoing work to write an io_uring network backend for Thrift, this has been shown to greatly improve performance. (Dylan) - Add IOPOLL support for passthrough (Kanchan) - Improvements and fixes to the send zero-copy support (Pavel) - Partial IO handling fixes (Pavel) - CQE ordering fixes around CQ ring overflow (Pavel) - Support sendto() for non-zc as well (Pavel) - Support sendmsg for zerocopy (Pavel) - Networking iov_iter fix (Stefan) - Misc fixes and cleanups (Pavel, me) * tag 'for-6.1/io_uring-2022-10-03' of git://git.kernel.dk/linux: (56 commits) io_uring/net: fix notif cqe reordering io_uring/net: don't update msg_name if not provided io_uring: don't gate task_work run on TIF_NOTIFY_SIGNAL io_uring/rw: defer fsnotify calls to task context io_uring/net: fix fast_iov assignment in io_setup_async_msg() io_uring/net: fix non-zc send with address io_uring/net: don't skip notifs for failed requests io_uring/rw: don't lose short results on io_setup_async_rw() io_uring/rw: fix unexpected link breakage io_uring/net: fix cleanup double free free_iov init io_uring: fix CQE reordering io_uring/net: fix UAF in io_sendrecv_fail() selftest/net: adjust io_uring sendzc notif handling io_uring: ensure local task_work marks task as running io_uring/net: zerocopy sendmsg io_uring/net: combine fail handlers io_uring/net: rename io_sendzc() io_uring/net: support non-zerocopy sendto io_uring/net: refactor io_setup_async_addr io_uring/net: don't lose partial send_zc on fail ...	2022-10-07 08:52:43 -07:00
Deming Wang	340e134727	block: Remove the repeat word 'can' Remove the repeat word 'can' from the comments of bio_kmalloc. Signed-off-by: Deming Wang <wangdeming@inspur.com> Link: https://lore.kernel.org/r/20221006084450.1513-1-wangdeming@inspur.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-10-06 07:23:13 -06:00
Linus Torvalds	725737e7c2	STATX_DIOALIGN for 6.1 Make statx() support reporting direct I/O (DIO) alignment information. This provides a generic interface for userspace programs to determine whether a file supports DIO, and if so with what alignment restrictions. Specifically, STATX_DIOALIGN works on block devices, and on regular files when their containing filesystem has implemented support. An interface like this has been requested for years, since the conditions for when DIO is supported in Linux have gotten increasingly complex over time. Today, DIO support and alignment requirements can be affected by various filesystem features such as multi-device support, data journalling, inline data, encryption, verity, compression, checkpoint disabling, log-structured mode, etc. Further complicating things, Linux v6.0 relaxed the traditional rule of DIO needing to be aligned to the block device's logical block size; now user buffers (but not file offsets) only need to be aligned to the DMA alignment. The approach of uplifting the XFS specific ioctl XFS_IOC_DIOINFO was discarded in favor of creating a clean new interface with statx(). For more information, see the individual commits and the man page update https://lore.kernel.org/r/20220722074229.148925-1-ebiggers@kernel.org. -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCYzpV2xQcZWJpZ2dlcnNA Z29vZ2xlLmNvbQAKCRDzXCl4vpKOKwF1AQDetPX5hyuq0/mwikOywLTTJsoHgGY5 euO+dISqjH/InwD9HAQqfPRkdM1j4ml82BjjkAfrhzZXOOWPKJm0zOhMIQg= =0Oav -----END PGP SIGNATURE----- Merge tag 'statx-dioalign-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux Pull STATX_DIOALIGN support from Eric Biggers: "Make statx() support reporting direct I/O (DIO) alignment information. This provides a generic interface for userspace programs to determine whether a file supports DIO, and if so with what alignment restrictions. Specifically, STATX_DIOALIGN works on block devices, and on regular files when their containing filesystem has implemented support. An interface like this has been requested for years, since the conditions for when DIO is supported in Linux have gotten increasingly complex over time. Today, DIO support and alignment requirements can be affected by various filesystem features such as multi-device support, data journalling, inline data, encryption, verity, compression, checkpoint disabling, log-structured mode, etc. Further complicating things, Linux v6.0 relaxed the traditional rule of DIO needing to be aligned to the block device's logical block size; now user buffers (but not file offsets) only need to be aligned to the DMA alignment. The approach of uplifting the XFS specific ioctl XFS_IOC_DIOINFO was discarded in favor of creating a clean new interface with statx(). For more information, see the individual commits and the man page update[1]" Link: https://lore.kernel.org/r/20220722074229.148925-1-ebiggers@kernel.org [1] * tag 'statx-dioalign-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux: xfs: support STATX_DIOALIGN f2fs: support STATX_DIOALIGN f2fs: simplify f2fs_force_buffered_io() f2fs: move f2fs_force_buffered_io() into file.c ext4: support STATX_DIOALIGN fscrypt: change fscrypt_dio_supported() to prepare for STATX_DIOALIGN vfs: support STATX_DIOALIGN on block devices statx: add direct I/O alignment information	2022-10-03 20:33:41 -07:00
Alexander Potapenko	11b331f857	block: kmsan: skip bio block merging logic for KMSAN KMSAN doesn't allow treating adjacent memory pages as such, if they were allocated by different alloc_pages() calls. The block layer however does so: adjacent pages end up being used together. To prevent this, make page_is_mergeable() return false under KMSAN. Link: https://lkml.kernel.org/r/20220915150417.722975-29-glider@google.com Signed-off-by: Alexander Potapenko <glider@google.com> Suggested-by: Eric Biggers <ebiggers@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Konovalov <andreyknvl@google.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Eric Biggers <ebiggers@kernel.org> Cc: Eric Dumazet <edumazet@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Ilya Leoshkevich <iii@linux.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kees Cook <keescook@chromium.org> Cc: Marco Elver <elver@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vegard Nossum <vegard.nossum@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-10-03 14:03:23 -07:00
Alexander Potapenko	f630a5d0ca	kmsan: disable physical page merging in biovec KMSAN metadata for adjacent physical pages may not be adjacent, therefore accessing such pages together may lead to metadata corruption. We disable merging pages in biovec to prevent such corruptions. Link: https://lkml.kernel.org/r/20220915150417.722975-28-glider@google.com Signed-off-by: Alexander Potapenko <glider@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Konovalov <andreyknvl@google.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Eric Biggers <ebiggers@google.com> Cc: Eric Biggers <ebiggers@kernel.org> Cc: Eric Dumazet <edumazet@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Ilya Leoshkevich <iii@linux.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kees Cook <keescook@chromium.org> Cc: Marco Elver <elver@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vegard Nossum <vegard.nossum@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-10-03 14:03:23 -07:00
Kanchan Joshi	3798754793	block: extend functionality to map bvec iterator Extend blk_rq_map_user_iov so that it can handle bvec iterator, using the new blk_rq_map_user_bvec function. It maps the pages from bvec iterator into a bio and place the bio into request. This helper will be used by nvme for uring-passthrough path when IO is done using pre-mapped buffers. Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> Suggested-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220930062749.152261-11-anuj20.g@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-09-30 07:51:13 -06:00
Kanchan Joshi	ab89e8e7ca	block: factor out blk_rq_map_bio_alloc helper Move bio allocation logic from bio_map_user_iov to a new helper blk_rq_map_bio_alloc. It is named so because functionality is opposite of what is done inside blk_mq_map_bio_put. This is a prep patch. Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Link: https://lore.kernel.org/r/20220930062749.152261-10-anuj20.g@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-09-30 07:51:13 -06:00
Anuj Gupta	32f1c71b15	block: rename bio_map_put to blk_mq_map_bio_put This patch renames existing bio_map_put function to blk_mq_map_bio_put. Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> Suggested-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220930062749.152261-9-anuj20.g@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-09-30 07:51:13 -06:00
Anuj Gupta	557654025d	block: add blk_rq_map_user_io Create a helper blk_rq_map_user_io for mapping of vectored as well as non-vectored requests. This will help in saving dupilcation of code at few places in scsi and nvme. Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> Suggested-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220930062749.152261-4-anuj20.g@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-09-30 07:51:13 -06:00
Jens Axboe	ab3e1d3bba	block: allow end_io based requests in the completion batch handling With end_io handlers now being able to potentially pass ownership of the request upon completion, we can allow requests with end_io handlers in the batch completion handling. Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Co-developed-by: Stefan Roesch <shr@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-09-30 07:49:11 -06:00
Jens Axboe	de671d6116	block: change request end_io handler to pass back a return value Everything is just converted to returning RQ_END_IO_NONE, and there should be no functional changes with this patch. In preparation for allowing the end_io handler to pass ownership back to the block layer, rather than retain ownership of the request. Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-09-30 07:49:09 -06:00
Jens Axboe	4b6a5d9cea	block: enable batched allocation for blk_mq_alloc_request() The filesystem IO path can take advantage of allocating batches of requests, if the underlying submitter tells the block layer about it through the blk_plug. For passthrough IO, the exported API is the blk_mq_alloc_request() helper, and that one does not allow for request caching. Wire up request caching for blk_mq_alloc_request(), which is generally done without having a bio available upfront. Tested-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-09-30 07:48:00 -06:00
Jens Axboe	e73a625bc2	block: kill deprecated BUG_ON() in the flush handling We've never had any useful reports from this BUG_ON(), and in fact a number of the BUG_ON()'s in the flush handling need to be turned into more graceful handling. In preparation for allowing batched completions of the end_io handling, where we can enter the flush completion with queuelist having been reused for the batch, get rid of this BUG_ON(). Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-09-30 07:48:00 -06:00
Jens Axboe	5853a7b551	Merge branch 'for-6.1/io_uring' into for-6.1/passthrough * for-6.1/io_uring: (56 commits) io_uring/net: fix notif cqe reordering io_uring/net: don't update msg_name if not provided io_uring: don't gate task_work run on TIF_NOTIFY_SIGNAL io_uring/rw: defer fsnotify calls to task context io_uring/net: fix fast_iov assignment in io_setup_async_msg() io_uring/net: fix non-zc send with address io_uring/net: don't skip notifs for failed requests io_uring/rw: don't lose short results on io_setup_async_rw() io_uring/rw: fix unexpected link breakage io_uring/net: fix cleanup double free free_iov init io_uring: fix CQE reordering io_uring/net: fix UAF in io_sendrecv_fail() selftest/net: adjust io_uring sendzc notif handling io_uring: ensure local task_work marks task as running io_uring/net: zerocopy sendmsg io_uring/net: combine fail handlers io_uring/net: rename io_sendzc() io_uring/net: support non-zerocopy sendto io_uring/net: refactor io_setup_async_addr io_uring/net: don't lose partial send_zc on fail ...	2022-09-30 07:47:42 -06:00
Jens Axboe	736feaa3a0	Merge branch 'for-6.1/block' into for-6.1/passthrough * for-6.1/block: (162 commits) sbitmap: fix lockup while swapping block: add rationale for not using blk_mq_plug() when applicable block: adapt blk_mq_plug() to not plug for writes that require a zone lock s390/dasd: use blk_mq_alloc_disk blk-cgroup: don't update the blkg lookup hint in blkg_conf_prep nvmet: don't look at the request_queue in nvmet_bdev_set_limits nvmet: don't look at the request_queue in nvmet_bdev_zone_mgmt_emulate_all blk-mq: use quiesced elevator switch when reinitializing queues block: replace blk_queue_nowait with bdev_nowait nvme: remove nvme_ctrl_init_connect_q nvme-loop: use the tagset alloc/free helpers nvme-loop: store the generic nvme_ctrl in set->driver_data nvme-loop: initialize sqsize later nvme-fc: use the tagset alloc/free helpers nvme-fc: store the generic nvme_ctrl in set->driver_data nvme-fc: keep ctrl->sqsize in sync with opts->queue_size nvme-rdma: use the tagset alloc/free helpers nvme-rdma: store the generic nvme_ctrl in set->driver_data nvme-tcp: use the tagset alloc/free helpers nvme-tcp: store the generic nvme_ctrl in set->driver_data ... Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-09-30 07:47:38 -06:00
Pankaj Raghav	110fdb4486	block: add rationale for not using blk_mq_plug() when applicable There are two places in the block layer at the moment where blk_mq_plug() helper could be used instead of directly accessing the plug from struct current. In both these cases, directly accessing the plug should not have any consequences for zoned devices. Make the intent explicit by adding comments instead of introducing unwanted checks with blk_mq_plug() helper.[1] [1] https://lore.kernel.org/linux-block/f6e54907-1035-2b2c-6387-ed178be05ccb@kernel.dk/ Signed-off-by: Pankaj Raghav <p.raghav@samsung.com> Suggested-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/20220929144141.140077-1-p.raghav@samsung.com [axboe: fixup multi-line comment style] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-09-29 09:02:49 -06:00
Pankaj Raghav	8cafdb5ab9	block: adapt blk_mq_plug() to not plug for writes that require a zone lock The current implementation of blk_mq_plug() disables plugging for all operations that involves a transfer to the device as we just check if the last bit in op_is_write() function. Modify blk_mq_plug() to disable plugging only for REQ_OP_WRITE and REQ_OP_WRITE_ZEROS as they might require a zone lock. Suggested-by: Christoph Hellwig <hch@lst.de> Suggested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com> Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20220929074745.103073-2-p.raghav@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-09-29 07:45:47 -06:00
Christoph Hellwig	5765033cf7	blk-cgroup: don't update the blkg lookup hint in blkg_conf_prep blkg_conf_prep just creates a new blkg structure, there is no real need to update the lookup hint which should only be done on a successful lookup in the I/O path. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20220927065425.257876-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-09-27 11:50:05 -06:00
Keith Busch	8237c01f16	blk-mq: use quiesced elevator switch when reinitializing queues The hctx's run_work may be racing with the elevator switch when reinitializing hardware queues. The queue is merely frozen in this context, but that only prevents requests from allocating and doesn't stop the hctx work from running. The work may get an elevator pointer that's being torn down, and can result in use-after-free errors and kernel panics (example below). Use the quiesced elevator switch instead, and make the previous one static since it is now only used locally. nvme nvme0: resetting controller nvme nvme0: 32/0/0 default/read/poll queues BUG: kernel NULL pointer dereference, address: 0000000000000008 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 80000020c8861067 P4D 80000020c8861067 PUD 250f8c8067 PMD 0 Oops: 0000 [#1] SMP PTI Workqueue: kblockd blk_mq_run_work_fn RIP: 0010:kyber_has_work+0x29/0x70 ... Call Trace: __blk_mq_do_dispatch_sched+0x83/0x2b0 __blk_mq_sched_dispatch_requests+0x12e/0x170 blk_mq_sched_dispatch_requests+0x30/0x60 __blk_mq_run_hw_queue+0x2b/0x50 process_one_work+0x1ef/0x380 worker_thread+0x2d/0x3e0 Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220927155652.3260724-1-kbusch@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-09-27 09:58:56 -06:00
Christoph Hellwig	568ec936bf	block: replace blk_queue_nowait with bdev_nowait Replace blk_queue_nowait with a bdev_nowait helpers that takes the block_device given that the I/O submission path should not have to look into the request_queue. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Pankaj Raghav <p.raghav@samsung.com> Link: https://lore.kernel.org/r/20220927075815.269694-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-09-27 09:57:58 -06:00
Christoph Hellwig	99e6038743	blk-cgroup: pass a gendisk to the blkg allocation helpers Prepare for storing the blkcg information in the gendisk instead of the request_queue. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Andreas Herrmann <aherrmann@suse.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20220921180501.1539876-18-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-09-26 19:17:28 -06:00
Christoph Hellwig	de185b56e8	blk-cgroup: pass a gendisk to blkcg_schedule_throttle Pass the gendisk to blkcg_schedule_throttle as part of moving the blk-cgroup infrastructure to be gendisk based. Remove the unused !BLK_CGROUP stub while we're at it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Andreas Herrmann <aherrmann@suse.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20220921180501.1539876-17-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-09-26 19:17:28 -06:00
Christoph Hellwig	00ad6991bb	blk-cgroup: pass a gendisk to blkg_destroy_all Pass the gendisk to blkg_destroy_all as part of moving the blk-cgroup infrastructure to be gendisk based. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Andreas Herrmann <aherrmann@suse.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20220921180501.1539876-16-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-09-26 19:17:28 -06:00
Christoph Hellwig	cad9266abc	blk-throttle: pass a gendisk to blk_throtl_cancel_bios Pass the gendisk to blk_throtl_cancel_bios as part of moving the blk-cgroup infrastructure to be gendisk based. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Andreas Herrmann <aherrmann@suse.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20220921180501.1539876-15-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-09-26 19:17:28 -06:00
Christoph Hellwig	5f6dc7522a	blk-throttle: pass a gendisk to blk_throtl_register_queue Pass the gendisk to blk_throtl_register_queue as part of moving the blk-cgroup infrastructure to be gendisk based. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Andreas Herrmann <aherrmann@suse.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20220921180501.1539876-14-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-09-26 19:17:27 -06:00
Christoph Hellwig	e13793bae6	blk-throttle: pass a gendisk to blk_throtl_init and blk_throtl_exit Pass the gendisk to blk_throtl_init and blk_throtl_exit as part of moving the blk-cgroup infrastructure to be gendisk based. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Andreas Herrmann <aherrmann@suse.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20220921180501.1539876-13-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-09-26 19:17:27 -06:00
Christoph Hellwig	3657647e33	blk-iocost: cleanup ioc_qos_write Use a local disk variable instead of retrieving the disk and request_queue over and over by various means. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Andreas Herrmann <aherrmann@suse.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20220921180501.1539876-12-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-09-26 19:17:27 -06:00
Christoph Hellwig	57b6455497	blk-iocost: pass a gendisk to blk_iocost_init Pass the gendisk to blk_iocost_init as part of moving the blk-cgroup infrastructure to be gendisk based. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Andreas Herrmann <aherrmann@suse.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20220921180501.1539876-11-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-09-26 19:17:27 -06:00
Christoph Hellwig	9df3e65139	blk-iocost: simplify ioc_name Just directly dereference the disk name instead of going through multiple hoops to find the same value. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Andreas Herrmann <aherrmann@suse.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20220921180501.1539876-10-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-09-26 19:17:27 -06:00
Christoph Hellwig	16fac1b591	blk-iolatency: pass a gendisk to blk_iolatency_init Pass the gendisk to blk_iolatency_init as part of moving the blk-cgroup infrastructure to be gendisk based. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Andreas Herrmann <aherrmann@suse.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20220921180501.1539876-9-hch@lst.de [axboe: missed inline for blk_iolatency_init() and !CONFIG_BLK_CGROUP_IOLATENCY] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-09-26 19:17:24 -06:00
Christoph Hellwig	b0dde3f5d6	blk-ioprio: pass a gendisk to blk_ioprio_init and blk_ioprio_exit Pass the gendisk to blk_ioprio_init and blk_ioprio_exit as part of moving the blk-cgroup infrastructure to be gendisk based. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Andreas Herrmann <aherrmann@suse.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20220921180501.1539876-8-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-09-26 19:09:31 -06:00
Christoph Hellwig	9823538fb7	blk-cgroup: pass a gendisk to blkcg_init_queue and blkcg_exit_queue Pass the gendisk to blkcg_init_disk and blkcg_exit_disk as part of moving the blk-cgroup infrastructure to be gendisk based. Also remove the rather pointless kerneldoc comments for these internal functions with a single caller each. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Andreas Herrmann <aherrmann@suse.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20220921180501.1539876-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-09-26 19:09:31 -06:00
Christoph Hellwig	f753526e32	blk-cgroup: remove blkg_lookup_check The combinations of an error check with an ERR_PTR return and a lookup with a NULL return leads to ugly handling of the return values in the callers. Just open coding the check and the lookup is much simpler. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Andreas Herrmann <aherrmann@suse.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20220921180501.1539876-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-09-26 19:09:31 -06:00
Christoph Hellwig	4a69f325aa	blk-cgroup: cleanup the blkg_lookup family of functions Add a fully inlined blkg_lookup as the extra two checks aren't going to generated a lot more code vs the call to the slowpath routine, and open code the hint update in the two callers that care. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Andreas Herrmann <aherrmann@suse.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20220921180501.1539876-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-09-26 19:09:31 -06:00
Christoph Hellwig	79fcc5be93	blk-cgroup: remove open coded blkg_lookup instances Use blkg_lookup instead of open coding it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Andreas Herrmann <aherrmann@suse.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20220921180501.1539876-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-09-26 19:09:31 -06:00
Christoph Hellwig	928f6f00a9	blk-cgroup: remove blk_queue_root_blkg Just open code it in the only caller and drop the unused !BLK_CGROUP stub. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Andreas Herrmann <aherrmann@suse.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20220921180501.1539876-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-09-26 19:09:31 -06:00
Christoph Hellwig	33dc62796c	blk-cgroup: fix error unwinding in blkcg_init_queue When blk_throtl_init fails, we need to call blk_ioprio_exit. Switch to proper goto based unwinding to fix this. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Andreas Herrmann <aherrmann@suse.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20220921180501.1539876-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-09-26 19:09:31 -06:00
Linus Torvalds	0be27f7be2	block-6.0-2022-09-22 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmMs/PgQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpml9D/wOZVaE/eofmZBWcnl+ZmYtdWQ+N8nv8a2D dqLR+OBe/w/6oBMnf2VzgzXpCrI+ZgW7z6uMCKpJa8FvcvB4KnB+UcBS5FQgaAdm Zbs7LJGe7p7lZahwCgjFfeN1L2gYya9vM19QgrSxAUsy3mZCjJpOi644YsNC397C 9+NwzMS6LS2KGHUlZTYL+M8iVBEehuR9oexrHbqhWAm7NaMtMKJGNvPvbt7G3bkl 1RnHmOtLJh1vyD1TxVj0OxCmJ719igFUJHgtAgFLkMVRVli224tAu5lMRZ7kR7IU l2IJzSM/4af1UjVeMAcXoZp1kht+ks/SEvpT8auh3y44teXsTBCxKQpoUE5n+U7v 737UJN2iZSFi9ahUbg8FgxDu+rmW/Et4G3vptWnVcmAGEd+OOSfdKNhj/oskLdLS /YioBL9uDvUQDETxbPcnfCN5ySziKXPwRvtobGyjrzHyuCaP9eutFXnwpdW+FAED 9cMyPKFbnnrDlvKfHOybxwWsvGNlNMMeueNglVJ/XUkZhGzizlHhD1+CAgxXa0u5 q8eXpe/y2TtCrl9q3lPKJvoBZ1KVE2cxyK/w17venwcMNb8quy6Yf3FdedlkS04N rXzEJCSELnkT4F7siii4R9VrcoVU7qUJHA3kLFsNxvNXhwdM2koLa+4gZXfpIn4A dVHjIuo54Q== =Z0yq -----END PGP SIGNATURE----- Merge tag 'block-6.0-2022-09-22' of git://git.kernel.dk/linux Pull block fixes from Jens Axboe: "Fix a regression that's been plaguing us by reverting the offending commit, as attempts to both reproduce the issue and fix it in a saner fashion have failed. Fix for a potential oops condition in the s390 dasd block driver" * tag 'block-6.0-2022-09-22' of git://git.kernel.dk/linux: Revert "block: freeze the queue earlier in del_gendisk" s390/dasd: fix Oops in dasd_alias_get_start_dev due to missing pavgroup	2022-09-24 08:22:53 -07:00
Liu Song	f168420c62	blk-mq: don't redirect completion for hctx withs only one ctx mapping High-performance NVMe devices usually support a large hw queues, which ensures a 1:1 mapping of hctx and ctx. In this case there will be no remote request, so we don't need to care about it. Signed-off-by: Liu Song <liusong@linux.alibaba.com> Link: https://lore.kernel.org/r/1663731123-81536-1-git-send-email-liusong@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-09-24 09:02:30 -06:00
Yu Kuai	81c7a63abc	blk-throttle: improve bypassing bios checkings "tg->has_rules" is extended to "tg->has_rules_iops/bps", thus bios that don't need to be throttled can be checked accurately. With this patch, bio will be throttled if: 1) Bio is read/write, and corresponding read/write iops limit exist. 2) If corresponding doesn't exist, corresponding bps limit exist and bio is not throttled before. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20220921095309.1481289-3-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-09-24 08:59:43 -06:00
Yu Kuai	8549674990	blk-throttle: remove THROTL_TG_HAS_IOPS_LIMIT Currently, "tg->has_rules" and "tg->flags & THROTL_TG_HAS_IOPS_LIMIT" both try to bypass bios that don't need to be throttled, however, they are a little redundant and both not perfect: 1) "tg->has_rules" only distinguish read and write, but not iops and bps limit. 2) "tg->flags & THROTL_TG_HAS_IOPS_LIMIT" only check if iops limit exist, read and write is not distinguished, and bps limit is not checked. tg->has_rules will extended to distinguish bps and iops in the following patch. There is no need to keep the flag. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20220921095309.1481289-2-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-09-24 08:59:43 -06:00
Tejun Heo	026e14a276	Merge branch 'for-6.0-fixes' into for-6.1 for-6.0 has the following fix for cgroup_get_from_id(). 836ac87d ("cgroup: fix cgroup_get_from_id") which conflicts with the following two commits in for-6.1. `4534dee9` ("cgroup: cgroup: Honor caller's cgroup NS when resolving cgroup id") `fa7e439c` ("cgroup: Homogenize cgroup_get_from_id() return value") While the resolution is straightforward, the code ends up pretty ugly afterwards. Let's pull for-6.0-fixes into for-6.1 so that the code can be fixed up there. Signed-off-by: Tejun Heo <tj@kernel.org>	2022-09-23 07:19:38 -10:00
Li Jinlin	9713a67067	block/blk-rq-qos: delete useless enmu RQ_QOS_IOPRIO Since blk-ioprio handing was converted from a rqos policy to a direct call, RQ_QOS_IOPRIO is not used anymore, just delete it. Signed-off-by: Li Jinlin <lijinlin3@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220916023241.32926-1-lijinlin3@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-09-21 19:50:53 -06:00
Kanchan Joshi	c6e99ea482	block: export blk_rq_is_poll This is in preparation to support iopoll for nvme passthrough. Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Link: https://lore.kernel.org/r/20220823161443.49436-4-joshi.k@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-09-21 10:30:42 -06:00
Yu Kuai	8c5035dfbb	blk-wbt: call rq_qos_add() after wb_normal is initialized Our test found a problem that wbt inflight counter is negative, which will cause io hang(noted that this problem doesn't exist in mainline): t1: device create t2: issue io add_disk blk_register_queue wbt_enable_default wbt_init rq_qos_add // wb_normal is still 0 /* * in mainline, disk can't be opened before * bdev_add(), however, in old kernels, disk * can be opened before blk_register_queue(). */ blkdev_issue_flush // disk size is 0, however, it's not checked submit_bio_wait submit_bio blk_mq_submit_bio rq_qos_throttle wbt_wait bio_to_wbt_flags rwb_enabled // wb_normal is 0, inflight is not increased wbt_queue_depth_changed(&rwb->rqos); wbt_update_limits // wb_normal is initialized rq_qos_track wbt_track rq->wbt_flags \|= bio_to_wbt_flags(rwb, bio); // wb_normal is not 0，wbt_flags will be set t3: io completion blk_mq_free_request rq_qos_done wbt_done wbt_is_tracked // return true __wbt_done wbt_rqw_done atomic_dec_return(&rqw->inflight); // inflight is decreased commit `8235b5c1e8` ("block: call bdev_add later in device_add_disk") can avoid this problem, however it's better to fix this problem in wbt: 1) Lower kernel can't backport this patch due to lots of refactor. 2) Root cause is that wbt call rq_qos_add() before wb_normal is initialized. Fixes: `e34cbd3074` ("blk-wbt: add general throttling mechanism") Cc: <stable@vger.kernel.org> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20220913105749.3086243-1-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-09-21 08:36:13 -06:00
Li zeming	a7609c68f7	blk-iocost: Remove unnecessary (void*) conversions The key pointer is void and hence does not need an explicit cast. Signed-off-by: Li zeming <zeming@nfschina.com> Link: https://lore.kernel.org/r/20220919012825.2936-1-zeming@nfschina.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-09-20 08:28:30 -06:00
Christoph Hellwig	118f3663fb	block: remove PSI accounting from the bio layer PSI accounting is now done by the VM code, where it should have been since the beginning. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://lore.kernel.org/r/20220915094200.139713-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-09-20 08:24:38 -06:00
Ping-Xiang Chen	e88480871b	block: fix comment typo in submit_bio of block-core.c. This patch fix a comment typo in block-core.c. Signed-off-by: Ping-Xiang Chen <p.x.chen@uci.edu> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220914074237.31621-1-p.x.chen@uci.edu Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-09-20 08:22:47 -06:00
Christoph Hellwig	4c66a326b5	Revert "block: freeze the queue earlier in del_gendisk" This reverts commit `a09b314005`. Dusty Mabe reported consistent hang during CoreOS shutdown with a MD RAID1 setup. Although apparently similar hangs happened before, and this patch most likely is not the root cause it made it much more severe. Revert it until we can figure out what is going on with the md driver. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220919144049.978907-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-09-20 08:15:44 -06:00
Linus Torvalds	68e777e44c	block-6.0-2022-09-16 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmMkPEIQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpiFOEADWRiCqDvhcSp+ChZybJTeUIHr7XIeMCDig S6822sgaEvi0jg7xOXe0BcGJ77bDCzjB0Jn0OjD/XTeT6a3692yszbG0KDHX0RCZ mnpIzMvCdWwkDgkhbmKPTZvWTA93abyJN/53N5YBlOO1/qZciJYL5rBnK/8tBKLF z8VpszzTmJp5TFt9mBzfjwYFQ6NNTRfml+LOKPiGuJwc2+Q3IVy1ZLTVtvL1L9Tl tHB2dJakvGdr5UzR/FgVUlOw36IODqIOfOdyfwTpJmk3Lx0PbtYLHNbv/sGUgMe7 5VXuSgum0ZOpKXXReiwXLj1u68ys/1eQC2K0LtR5lIXbHhC4FX4S0RZ9IYHat5qF IV7kcDALDg6TCpHDVXU11G+S6RkvaBhfQHNMJAvkUt90zbD5RrRbe/4BGNgfpVIJ t95w80v5o7vpYM6X0pr+wp2l3bx62U77ZH1vPLlANsaaXc9aL+BMiwIx3vISTDtx 4NGAoDH0O8MGn+WF9dwVt6UixNmIoqdlkxCADrM/3gn9Ebo8kTuqtAo5ZpezqSgi eDcKa1BXxOcHHrNt3udWkiyGLAM/XtyztrlAULRGDyqkv+KfF8V4Q4UM4lNywICc Q3t8nMWtLkjMqxZXttlpOrsAgHwuuuSeyRjcSkeKg1Ldu6/ecvgtsl7YmfOrFSWd UQ9ODxburw== =jBr+ -----END PGP SIGNATURE----- Merge tag 'block-6.0-2022-09-16' of git://git.kernel.dk/linux-block Pull block fixes from Jens Axboe: "Two fixes for -rc6: - Fix a mixup of sectors and bytes in the secure erase ioctl (Mikulas) - Fix for a bad return value for a non-blocking bio/blk queue enter call (me)" * tag 'block-6.0-2022-09-16' of git://git.kernel.dk/linux-block: blk-lib: fix blkdev_issue_secure_erase block: blk_queue_enter() / __bio_queue_enter() must return -EAGAIN for nowait	2022-09-16 06:58:04 -07:00
Mikulas Patocka	c4fa368466	blk-lib: fix blkdev_issue_secure_erase There's a bug in blkdev_issue_secure_erase. The statement "unsigned int len = min_t(sector_t, nr_sects, max_sectors);" sets the variable "len" to the length in sectors, but the statement "bio->bi_iter.bi_size = len" treats it as if it were in bytes. The statements "sector += len << SECTOR_SHIFT" and "nr_sects -= len << SECTOR_SHIFT" are thinko. This patch fixes it. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org # v5.19 Fixes: `44abff2c0b` ("block: decouple REQ_OP_SECURE_ERASE from REQ_OP_DISCARD") Link: https://lore.kernel.org/r/alpine.LRH.2.02.2209141549480.28100@file01.intranet.prod.int.rdu2.redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-09-15 00:25:17 -06:00
Stefan Roesch	56f99b8d06	block: blk_queue_enter() / __bio_queue_enter() must return -EAGAIN for nowait Today blk_queue_enter() and __bio_queue_enter() return -EBUSY for the nowait code path. This is not correct: they should return -EAGAIN instead. This problem was detected by fio. The following command exposed the above problem: t/io_uring -p0 -d128 -b4096 -s32 -c32 -F1 -B0 -R0 -X1 -n24 -P1 -u1 -O0 /dev/ng0n1 By applying the patch, the retry case is handled correctly in the slow path. Signed-off-by: Stefan Roesch <shr@fb.com> Fixes: `bfd343aa17` ("blk-mq: don't wait in blk_mq_queue_enter() if __GFP_WAIT isn't set") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-09-13 15:06:39 -06:00
Yu Kuai	c013710e1a	blk-throttle: cleanup tg_update_disptime() tg_update_disptime() only need to adjust postion for 'tg' in 'parent_sq', there is no need to call throtl_enqueue/dequeue_tg(), since they will set/clear flag THROTL_TG_PENDING and increase/decrease nr_pending, which is useless. By the way, clear the flag/decrease nr_pending while there are still throttled bios is not good for debugging. There are no functional changes. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20220827101637.1775111-4-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-09-12 00:20:08 -06:00
Yu Kuai	8c25ed0cb9	blk-throttle: calling throtl_dequeue/enqueue_tg in pairs It's a little weird to call throtl_dequeue_tg() unconditionally in throtl_select_dispatch(), since it will be called in tg_update_disptime() again if some bio is still throttled. Thus call it later if there are no throttled bio. There are no functional changes. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20220827101637.1775111-3-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-09-12 00:20:07 -06:00
Yu Kuai	7e9c5c54d4	blk-throttle: use 'READ/WRITE' instead of '0/1' Make the code easier to read, like everywhere else. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20220827101637.1775111-2-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-09-12 00:20:07 -06:00
Yu Kuai	a880ae93e5	blk-throttle: fix io hung due to configuration updates If new configuration is submitted while a bio is throttled, then new waiting time is recalculated regardless that the bio might already wait for some time: tg_conf_updated throtl_start_new_slice tg_update_disptime throtl_schedule_next_dispatch Then io hung can be triggered by always submmiting new configuration before the throttled bio is dispatched. Fix the problem by respecting the time that throttled bio already waited. In order to do that, add new fields to record how many bytes/io are waited, and use it to calculate wait time for throttled bio under new configuration. Some simple test: 1) cd /sys/fs/cgroup/blkio/ echo $$ > cgroup.procs echo "8:0 2048" > blkio.throttle.write_bps_device { sleep 2 echo "8:0 1024" > blkio.throttle.write_bps_device } & dd if=/dev/zero of=/dev/sda bs=8k count=1 oflag=direct 2) cd /sys/fs/cgroup/blkio/ echo $$ > cgroup.procs echo "8:0 1024" > blkio.throttle.write_bps_device { sleep 4 echo "8:0 2048" > blkio.throttle.write_bps_device } & dd if=/dev/zero of=/dev/sda bs=8k count=1 oflag=direct test results: io finish time before this patch with this patch 1) 10s 6s 2) 8s 6s Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Michal Koutný <mkoutny@suse.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20220829022240.3348319-5-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-09-12 00:19:48 -06:00
Yu Kuai	681cd46fff	blk-throttle: factor out code to calculate ios/bytes_allowed No functional changes, new apis will be used in later patches to calculate wait time for throttled bios when new configuration is submitted. Noted this patch also rename tg_with_in_iops/bps_limit() to tg_within_iops/bps_limit(). Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20220829022240.3348319-4-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-09-12 00:19:48 -06:00
Yu Kuai	8d6bbaada2	blk-throttle: prevent overflow while calculating wait time There is a problem found by code review in tg_with_in_bps_limit() that 'bps_limit * jiffy_elapsed_rnd' might overflow. Fix the problem by calling mul_u64_u64_div_u64() instead. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20220829022240.3348319-3-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-09-12 00:19:48 -06:00
Yu Kuai	320fb0f91e	blk-throttle: fix that io throttle can only work for single bio Test scripts: cd /sys/fs/cgroup/blkio/ echo "8:0 1024" > blkio.throttle.write_bps_device echo $$ > cgroup.procs dd if=/dev/zero of=/dev/sda bs=10k count=1 oflag=direct & dd if=/dev/zero of=/dev/sda bs=10k count=1 oflag=direct & Test result: 10240 bytes (10 kB, 10 KiB) copied, 10.0134 s, 1.0 kB/s 10240 bytes (10 kB, 10 KiB) copied, 10.0135 s, 1.0 kB/s The problem is that the second bio is finished after 10s instead of 20s. Root cause: 1) second bio will be flagged: __blk_throtl_bio while (true) { ... if (sq->nr_queued[rw]) -> some bio is throttled already break }; bio_set_flag(bio, BIO_THROTTLED); -> flag the bio 2) flagged bio will be dispatched without waiting: throtl_dispatch_tg tg_may_dispatch tg_with_in_bps_limit if (bps_limit == U64_MAX \|\| bio_flagged(bio, BIO_THROTTLED)) *wait = 0; -> wait time is zero return true; commit `9f5ede3c01` ("block: throttle split bio in case of iops limit") support to count split bios for iops limit, thus it adds flagged bio checking in tg_with_in_bps_limit() so that split bios will only count once for bps limit, however, it introduce a new problem that io throttle won't work if multiple bios are throttled. In order to fix the problem, handle iops/bps limit in different ways: 1) for iops limit, there is no flag to record if the bio is throttled, and iops is always applied. 2) for bps limit, original bio will be flagged with BIO_BPS_THROTTLED, and io throttle will ignore bio with the flag. Noted this patch also remove the code to set flag in __bio_clone(), it's introduced in commit `111be88398` ("block-throttle: avoid double charge"), and author thinks split bio can be resubmited and throttled again, which is wrong because split bio will continue to dispatch from caller. Fixes: `9f5ede3c01` ("block: throttle split bio in case of iops limit") Cc: <stable@vger.kernel.org> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20220829022240.3348319-2-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-09-12 00:19:48 -06:00
Keith Busch	4acb83417c	sbitmap: fix batched wait_cnt accounting Batched completions can clear multiple bits, but we're only decrementing the wait_cnt by one each time. This can cause waiters to never be woken, stalling IO. Use the batched count instead. Link: https://bugzilla.kernel.org/show_bug.cgi?id=215679 Signed-off-by: Keith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/20220909184022.1709476-1-kbusch@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-09-12 00:10:34 -06:00
Eric Biggers	2d985f8c6b	vfs: support STATX_DIOALIGN on block devices Add support for STATX_DIOALIGN to block devices, so that direct I/O alignment restrictions are exposed to userspace in a generic way. Note that this breaks the tradition of stat operating only on the block device node, not the block device itself. However, it was felt that doing this is preferable, in order to make the interface useful and avoid needing separate interfaces for regular files and block devices. Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org> Link: https://lore.kernel.org/r/20220827065851.135710-3-ebiggers@kernel.org	2022-09-11 19:47:12 -05:00
Linus Torvalds	9ebc0ecb21	block-6.0-2022-09-09 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmMbdksQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpqm8EACPqE5eFwFtZZnYV8eFjtTQCA5aZ+u3HyF8 Xby5FEZxwpRaaGPwUZ6T7A45ZVH5ka+JPvSQYeSTKxdjBVeqmAUVnlhqLjxeTvJt WhfSCrPq553eYIKoeUVGxLfQ5GhK4f/2noiEMUIEOMSsDyh4txm46MRNdfi5doj1 2JlbJyoTIK0ZRoraAuZp07bZPFnbxtjSB9YJlUqyKpiSMbM+xocPRaB3Tpr42So8 vUdgt2/tDZ2xNOf4gd1KqUlosIlaG0RWeZeXTU9knrQjPYtB77iM6MVDUGhgiYT/ w25XYCbZiGt1YwRTOYFX9bE6jGuPDc54isRqCmPSjYwTAbeEd7sWytn5D1W0IkaF dH93mDaW+8/GrjdGVCJ4jLPRCfQvZZ8i5XlwXaXQfnTQ8okUelx5BJ6kEfKRUkhV nZD0r5MskH4dQv4ESluaeblrDWphMaznZLNjp+UXfiycsbE9l2TX1FOjyK9E7f7o 4zGYiertjc7wj0OnnoDTf0wdpXI13xDqQGFhe1Lnz4tOrpTfzZqwyv4IkFo3VghR OocJFk4GSuKLwDPASDY/B+nynhQw5fft/fp458Q5pFY/2zItvmg2rxloFetUAz9i 0kcrO4NuO1l6E20KoBED2HMrQOow68dqcCJTrRcXhDfUIRIk0XSg4Kd/nGJBhZZS 5mv/hhkjjQ== =3RZd -----END PGP SIGNATURE----- Merge tag 'block-6.0-2022-09-09' of git://git.kernel.dk/linux-block Pull block fixes from Jens Axboe: - NVMe pull via Christoph: - fix a use after free in nvmet (Bart Van Assche) - fix a use after free when detecting digest errors (Sagi Grimberg) - fix regression that causes sporadic TCP requests to time out (Sagi Grimberg) - fix two off by ones errors in the nvmet ZNS support (Dennis Maisenbacher) - requeue aen after firmware activation (Keith Busch) - Fix missing request flags in debugfs code (me) - Partition scan fix (Ming) * tag 'block-6.0-2022-09-09' of git://git.kernel.dk/linux-block: block: add missing request flags to debugfs code nvme: requeue aen after firmware activation nvmet: fix mar and mor off-by-one errors nvme-tcp: fix regression that causes sporadic requests to time out nvme-tcp: fix UAF when detecting digest errors nvmet: fix a use-after-free block: don't add partitions if GD_SUPPRESS_PART_SCAN is set	2022-09-09 15:03:08 -04:00
Jens Axboe	745ed37277	block: add missing request flags to debugfs code We're missing TIMED_OUT and RESV. Particularly the former is handy for debugging, let's get them added. Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-09-09 05:57:52 -06:00
Miaohe Lin	bdb7d420c6	block: remove unneeded return value of bio_check_ro() bio_check_ro() always return false now. Remove this unneeded return value and cleanup the sole caller. No functional change intended. Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Link: https://lore.kernel.org/r/20220905102754.1942-1-linmiaohe@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-09-05 11:45:35 -06:00
Miaohe Lin	6d5e8d21e8	blk-mq: remove unneeded needs_restart check If code reaches here, needs_restart must be true. Remove this unneeded needs_restart check. No functional change intended. Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Link: https://lore.kernel.org/r/20220905101950.4606-1-linmiaohe@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-09-05 11:45:24 -06:00
Jiapeng Chong	91e5adda5c	block/blk-map: Remove set but unused variable 'added' The variable added is not effectively used in the function, so delete it. block/blk-map.c:273:16: warning: variable 'added' set but not used. Link: https://bugzilla.openanolis.cn/show_bug.cgi?id=2049 Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Link: https://lore.kernel.org/r/20220905063253.120082-1-jiapeng.chong@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-09-05 11:44:37 -06:00
Yu Kuai	2d8f7a3b9f	blk-throttle: clean up codes that can't be reached While doing code coverage testing while CONFIG_BLK_DEV_THROTTLING_LOW is disabled, we found that there are many codes can never be reached. This patch move such codes inside "#ifdef CONFIG_BLK_DEV_THROTTLING_LOW". Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20220903062826.1099085-1-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-09-04 14:38:18 -06:00
Jens Axboe	bce1b56c73	Revert "sbitmap: fix batched wait_cnt accounting" This reverts commit `16ede66973`. This is causing issues with CPU stalls on my test box, revert it for now until we understand what is going on. It looks like infinite looping off sbitmap_queue_wake_up(), but hard to tell with a lot of CPUs hitting this issue and the console scrolling infinitely. Link: https://lore.kernel.org/linux-block/e742813b-ce5c-0d58-205b-1626f639b1bd@kernel.dk/ Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-09-04 06:39:25 -06:00
Ming Lei	748008e1da	block: don't add partitions if GD_SUPPRESS_PART_SCAN is set Commit `b9684a71fc` ("block, loop: support partitions without scanning") adds GD_SUPPRESS_PART_SCAN for replacing part function of GENHD_FL_NO_PART. But looks blk_add_partitions() is missed, since loop doesn't want to add partitions if GENHD_FL_NO_PART was set. And it causes regression on libblockdev (as called from udisks) which operates with the LO_FLAGS_PARTSCAN. Fixes the issue by not adding partitions if GD_SUPPRESS_PART_SCAN is set. Fixes: `b9684a71fc` ("block, loop: support partitions without scanning") Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220823103819.395776-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-09-03 11:29:03 -06:00
Jens Axboe	12c5b70c18	block: enable per-cpu bio caching for the fs bio set This is useful for polled IO on a file, or for polled IO with the io_uring passthrough mechanism. If bio allocations are done with REQ_POLLED for those cases, then initializing the bio set with BIOSET_PERCPU_CACHE enables the local per-cpu cache which eliminates allocations (and frees) of bio structs when possible. Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-09-02 13:03:33 -06:00
Keith Busch	16ede66973	sbitmap: fix batched wait_cnt accounting Batched completions can clear multiple bits, but we're only decrementing the wait_cnt by one each time. This can cause waiters to never be woken, stalling IO. Use the batched count instead. Link: https://bugzilla.kernel.org/show_bug.cgi?id=215679 Signed-off-by: Keith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/20220825145312.1217900-1-kbusch@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-09-01 10:42:41 -06:00
Michal Koutný	fa7e439cf9	cgroup: Homogenize cgroup_get_from_id() return value Cgroup id is user provided datum hence extend its return domain to include possible error reason (similar to cgroup_get_from_fd()). This change also fixes commit `d4ccaf58a8` ("bpf: Introduce cgroup iter") that would use NULL instead of proper error handling in `d4ccaf58a8` ("bpf: Introduce cgroup iter"). Additionally, neither of: fc_appid_store, bpf_iter_attach_cgroup, mem_cgroup_get_from_ino (callers of cgroup_get_from_fd) is built without CONFIG_CGROUPS (depends via CONFIG_BLK_CGROUP, direct, transitive CONFIG_MEMCG respectively) transitive, so drop the singular definition not needed with !CONFIG_CGROUPS. Fixes: `d4ccaf58a8` ("bpf: Introduce cgroup iter") Signed-off-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2022-08-26 10:57:41 -10:00
Linus Torvalds	3e5c673f0d	block-6.0-2022-08-26 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmMI9VgQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpgY+D/0axaG4MQODNSCyzTkKdFJqeL1Ab6tgz+SP w4hqGhBA50P18nqQo14v/l2wBnc3DnvX6OcoAj2v+27WRw4VAC6pYu9cHiiz0sK0 WyfYE/3qdB5PTe3149UcMd0BlPzMiadYKYhDbS5zLq5VGnun4oOT6uJugdz5toSl GgPJ4zQTjUE/w7pJ40fSTZ82FPCqVndBaSMycLvzzAkzfKdMxMIPBqFTDdW/gFqG 1Xc7NI4vASw8jsrG+TkPESNf4j0WcEs8+zCV9d9WFSQjWgktV1NpdCZ9CA+jD0cV h0i31SB43pttZZTARPraPgQdC23L22CpxxNTuzwz5ZKAuSulML87S/W2lN7HnagQ GAl6irQdB9RXN5o6vmOXkZIfmfuTctrXay7g13tNk/WUrGb/3euk4CJMYJWA4fM7 VkUS3yCjvP5z8auqOqti7zEy1QSN8mvdtsafU1n+tsf0WdIIqH+sufSnmJExbOs9 gzL2xcabOFCD5WYDCpl/SkbnY4G7Qrk4oHIB9S/r0aOKVHFdQ8JMp3GBVjSlHulN Nz3oScNQBG3CSvJ2QbFi6UQxt6YSs+tx/oa1plmeCe4w9tModooL0CdfmKMBwPXT AGlHvUIGyl4ZT/XamQ19AdFOE7WLHPG7wYtZUaFiNtkjAsgMoQr2bwYeTpb29Cfu 881BssmWCQ== =qR5N -----END PGP SIGNATURE----- Merge tag 'block-6.0-2022-08-26' of git://git.kernel.dk/linux-block Pull block fixes from Jens Axboe: - MD pull request via Song: - Fix for clustered raid (Guoqing Jiang) - req_op fix (Bart Van Assche) - Fix race condition in raid recreate (David Sloan) - loop configuration overflow fix (Siddh) - Fix missing commit_rqs call for certain conditions (Yu) * tag 'block-6.0-2022-08-26' of git://git.kernel.dk/linux-block: md: call __md_stop_writes in md_stop Revert "md-raid: destroy the bitmap after destroying the thread" md: Flush workqueue md_rdev_misc_wq in md_alloc() md/raid10: Fix the data type of an r10_sync_page_io() argument loop: Check for overflow while configuring loop blk-mq: fix io hung due to missing commit_rqs	2022-08-26 11:05:54 -07:00
Jens Axboe	e88811bc43	block: use on-stack page vec for <= UIO_FASTIOV Avoid a kmalloc+kfree for each page array, if we only have a few pages that are mapped. An alloc+free for each IO is quite expensive, and it's pretty pointless if we're only dealing with 1 or a few vecs. Use UIO_FASTIOV like we do in other spots to set a sane limit for how big of an IO we want to avoid allocations for. Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-08-22 10:07:56 -06:00
Jens Axboe	8af870aa5b	block: enable bio caching use for passthru IO bdev based polled O_DIRECT is currently quite a bit faster than passthru on the same device, and one of the reaons is that we're not able to use the bio caching for passthru IO. If REQ_POLLED is set on the request, use the fs bio set for grabbing a bio from the caches, if available. This saves 5-6% of CPU over head for polled passthru IO. Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-08-22 10:07:56 -06:00
Jens Axboe	f5d632d15e	block: shrink rq_map_data a bit We don't need full ints for several of these members. Change the page_order and nr_entries to unsigned shorts, and the true/false from_user and null_mapped to booleans. This shrinks the struct from 32 to 24 bytes on 64-bit archs. Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-08-22 10:07:56 -06:00
Yu Kuai	d322f355e9	block, bfq: remove useless parameter for bfq_add/del_bfqq_busy() 'bfqd' can be accessed through 'bfqq->bfqd', there is no need to pass it as a parameter separately. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220816015631.1323948-4-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-08-22 10:07:56 -06:00
Yu Kuai	1e3cc2125d	block, bfq: remove useless checking in bfq_put_queue() 'bfqq->bfqd' is ensured to set in bfq_init_queue(), and it will never change afterwards. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220816015631.1323948-3-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-08-22 10:07:56 -06:00
Yu Kuai	c2090eabac	block, bfq: remove unused functions While doing code coverage testing(CONFIG_BFQ_CGROUP_DEBUG is disabled), we found that some functions doesn't have caller, thus remove them. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220816015631.1323948-2-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-08-22 10:07:56 -06:00
Bart Van Assche	a4e1d0b76e	block: Change the return type of blk_mq_map_queues() into void Since blk_mq_map_queues() and the .map_queues() callbacks always return 0, change their return type into void. Most callers ignore the returned value anyway. Cc: Christoph Hellwig <hch@lst.de> Cc: Jason Wang <jasowang@redhat.com> Cc: Keith Busch <kbusch@kernel.org> Cc: Martin K. Petersen <martin.petersen@oracle.com> Cc: Doug Gilbert <dgilbert@interlog.com> Cc: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: John Garry <john.garry@huawei.com> Acked-by: Md Haris Iqbal <haris.iqbal@ionos.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Link: https://lore.kernel.org/r/20220815170043.19489-3-bvanassche@acm.org [axboe: fold in fix from Bart] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-08-22 10:07:53 -06:00
dougmill@linux.vnet.ibm.com	c6ea706042	block: sed-opal: Add ioctl to return device status Provide a mechanism to retrieve basic status information about the device, including the "supported" flag indicating whether SED-OPAL is supported. The information returned is from the various feature descriptors received during the discovery0 step, and so this ioctl does nothing more than perform the discovery0 step and then save the information received. See "struct opal_status" and OPAL_FL_* bits for the status information currently returned. This is necessary to be able to check whether a device is OPAL enabled, set up, locked or unlocked from userspace programs like systemd-cryptsetup and libcryptsetup. Right now we just have to assume the user 'knows' or blindly attempt setup/lock/unlock operations. Signed-off-by: Douglas Miller <dougmill@linux.vnet.ibm.com> Tested-by: Luca Boccassi <bluca@debian.org> Reviewed-by: Scott Bauer <sbauer@plzdonthack.me> Acked-by: Christian Brauner (Microsoft) <brauner@kernel.org> Link: https://lore.kernel.org/r/20220816140713.84893-1-luca.boccassi@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-08-22 07:52:51 -06:00
Linus Torvalds	b9bce6e553	block-6.0-2022-08-19 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmL/xOgQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpgenD/4kaXa2Q2GdrCUZxSSwKCc1u8FemSunFyao Q1jbpRPhS2of8JGOdQzbZ/1ioer73rjKAVCpiZ8pVbFw5j/PpjsCUY2H4pF4Pm5V oeaq29yp5TLT9mlETGHO8bFAWs3wmErqa9/Tp+P4ut7Jbxw2fjv9oDqbYg7dc8T9 F769MuojyVQ2D8CAn0o1Vpw3BSqIPk/MJKMU8MWWtErRHidljT6RqZT3ow8qGroD 0QMfZl7rzfuJ9hokyO3ixFkLErpZbZdA7MdMciXvuvPafz7onjrBf5dKJxp1qMDK CADw4uWQBndc+337YVY5uJSPHFWApsRiCadkLgsAnRIn4QcEyYCEBJcYXXs0p05z 2wuyMlOynVjzSJiyWgq2lJF9CNIUWxkfnBDNNvj1rw6McKX0eJCCnLIUWE90GVn3 hDU6TTT6dTdb4QyhpbjdS9RVcGOxB8yaVUy4JvXBqZ0GDfVxqTozR8Qx8Gh3XRfi 5LeUSsHFyzD81GMYtTtovllJZdBhNue3hpLFMy6rFMTpwFiF3bKAPeihGmkMhnWX hG340uO44PM8iXQZAoSlEUplY/fbRX2WAfTNSsbmKxey1BHEqfmLvdv9DxaTGZFy 3xse9L5s867uhFQh8ezYjK2WdIumN67spT1xszYc0pJqhHN6LmRIncVSyzTyJeii fUKpxfj15g== =y2HE -----END PGP SIGNATURE----- Merge tag 'block-6.0-2022-08-19' of git://git.kernel.dk/linux-block Pull block fixes from Jens Axboe: "A few fixes that should go into this release: - Small series of patches for ublk (ZiyangZhang) - Remove dead function (Yu) - Fix for running a block queue in case of resource starvation (Yufen)" * tag 'block-6.0-2022-08-19' of git://git.kernel.dk/linux-block: blk-mq: run queue no matter whether the request is the last request blk-mq: remove unused function blk_mq_queue_stopped() ublk_drv: do not add a re-issued request aborted previously to ioucmd's task_work ublk_drv: update comment for __ublk_fail_req() ublk_drv: check ubq_daemon_is_dying() in __ublk_rq_task_work() ublk_drv: update iod->addr for UBLK_IO_NEED_GET_DATA	2022-08-20 10:17:05 -07:00
Yu Kuai	65fac0d54f	blk-mq: fix io hung due to missing commit_rqs Currently, in virtio_scsi, if 'bd->last' is not set to true while dispatching request, such io will stay in driver's queue, and driver will wait for block layer to dispatch more rqs. However, if block layer failed to dispatch more rq, it should trigger commit_rqs to inform driver. There is a problem in blk_mq_try_issue_list_directly() that commit_rqs won't be called: // assume that queue_depth is set to 1, list contains two rq blk_mq_try_issue_list_directly blk_mq_request_issue_directly // dispatch first rq // last is false __blk_mq_try_issue_directly blk_mq_get_dispatch_budget // succeed to get first budget __blk_mq_issue_directly scsi_queue_rq cmd->flags \|= SCMD_LAST virtscsi_queuecommand kick = (sc->flags & SCMD_LAST) != 0 // kick is false, first rq won't issue to disk queued++ blk_mq_request_issue_directly // dispatch second rq __blk_mq_try_issue_directly blk_mq_get_dispatch_budget // failed to get second budget ret == BLK_STS_RESOURCE blk_mq_request_bypass_insert // errors is still 0 if (!list_empty(list) \|\| errors && ...) // won't pass, commit_rqs won't be called In this situation, first rq relied on second rq to dispatch, while second rq relied on first rq to complete, thus they will both hung. Fix the problem by also treat 'BLK_STS_RESOURCE' as 'errors' since it means that request is not queued successfully. Same problem exists in blk_mq_dispatch_rq_list(), 'BLK_STS_RESOURCE' can't be treated as 'errors' here, fix the problem by calling commit_rqs if queue_rq return 'BLK_STS_*RESOURCE'. Fixes: `d666ba98f8` ("blk-mq: add mq_ops->commit_rqs()") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20220726122224.1790882-1-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-08-19 20:45:58 +00:00
Yufen Yu	d3b3859687	blk-mq: run queue no matter whether the request is the last request We do test on a virtio scsi device (/dev/sda) and the default mq scheduler is 'none'. We found a IO hung as following: blk_finish_plug blk_mq_plug_issue_direct scsi_mq_get_budget //get budget_token fail and sdev->restarts=1 scsi_end_request scsi_run_queue_async //sdev->restart=0 and run queue blk_mq_request_bypass_insert //add request to hctx->dispatch list //continue to dispath plug list blk_mq_dispatch_plug_list blk_mq_try_issue_list_directly //success issue all requests from plug list After .get_budget fail, scsi_mq_get_budget will increase 'restarts'. Normally, it will run hw queue when io complete and set 'restarts' as 0. But if we run queue before adding request to the dispatch list and blk_mq_dispatch_plug_list also success issue all requests, then on one will run queue, and the request will be stall in the dispatch list and cannot complete forever. It is wrong to use last request of plug list to decide if run queue is needed since all the remained requests in plug list may be from other hctxs. To fix the bug, pass run_queue as true always to blk_mq_request_bypass_insert(). Fix-suggested-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Yufen Yu <yuyufen@huawei.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Fixes: `dc5fc361d8` ("block: attempt direct issue of plug list") Link: https://lore.kernel.org/r/20220803023355.3687360-1-yuyufen@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-08-18 07:39:01 -06:00
Yu Kuai	a8239f0342	blk-mq: remove unused function blk_mq_queue_stopped() blk_mq_queue_stopped() doesn't have any caller, which was found by code coverage test, thus remove it. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20220818063555.3741222-1-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-08-18 07:38:10 -06:00
Linus Torvalds	abe7a481aa	block-6.0-2022-08-12 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmL2SxQQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpl24D/4nxPDuhvDxLHpMUR/C2lCtAnqVhHtGLhIX QSw+GVA5LuJvuB0L5Zr7ODDILPY5ZyWI/F1FPVJUOJE1NJ3tFiH4WzFIkqtFtVCE 2jFTXH63A/o/fyo9nscsZ1g6eEswSAbvenHEa9HNpjgFxz0lnXjrniP5VFPo5HNl F8/MO1CBkKmhsGazZn7o1J3Ws6RvApq59YzxHmVz1hFHPgJFN2KwIAQjY2+GGoOD ifpRBbZBCTzj2dEEFZHeK1aCYhTNP4VqbNnBDQPZHwEB3qkml5R9GhTlUe7Ej17t 7o4A05efcm/24TXcODMHP5YaGA14otPUr8wQiJjuOIFLw8sMC61OyS9qDdu1IvyW JCnTtDkzwZZEkhXlraU1HmiLSaBjMEvd/2puxbcS9kISdO7baCLd4Oj7+8ThVhIf JHIt2x3vzKaCzWI93IMrw5iJFK0+NS+SLAD6eyuEgC71Rj5ooemxrBYxKBQ7jb3o GCC3SaU8lFmB1Z/zKo63gGS1b7eaCNGauNm9/gSe1jM8Sor4hlT0yeNRpHf7egAu 1dUQdSwgon6EH6JOGX3CFSC9lnIEAew733QZLaYBqar2WHcq5Wpq6LcWUHhhujgB dSTeLY1Tnhs3GWvMFe4JH/YZilFpbMKzxuFPCV7sQScxzlaGusliX8kVkha/VPLK 9rtT8uyJXg== =1/w+ -----END PGP SIGNATURE----- Merge tag 'block-6.0-2022-08-12' of git://git.kernel.dk/linux-block Pull block fixes from Jens Axboe: - NVMe pull request - print nvme connect Linux error codes properly (Amit Engel) - fix the fc_appid_store return value (Christoph Hellwig) - fix a typo in an error message (Christophe JAILLET) - add another non-unique identifier quirk (Dennis P. Kliem) - check if the queue is allocated before stopping it in nvme-tcp (Maurizio Lombardi) - restart admin queue if the caller needs to restart queue in nvme-fc (Ming Lei) - use kmemdup instead of kmalloc + memcpy in nvme-auth (Zhang Xiaoxu) - __alloc_disk_node() error handling fix (Rafael) * tag 'block-6.0-2022-08-12' of git://git.kernel.dk/linux-block: block: Do not call blk_put_queue() if gendisk allocation fails nvme-pci: add NVME_QUIRK_BOGUS_NID for ADATA XPG GAMMIX S70 nvme-tcp: check if the queue is allocated before stopping it nvme-fabrics: Fix a typo in an error message nvme-fabrics: parse nvme connect Linux error codes nvmet-auth: use kmemdup instead of kmalloc + memcpy nvme-fc: fix the fc_appid_store return value nvme-fc: restart admin queue if the caller needs to restart queue	2022-08-13 13:37:36 -07:00
Rafael Mendonca	aa0c680c3a	block: Do not call blk_put_queue() if gendisk allocation fails Commit `6f8191fdf4` ("block: simplify disk shutdown") removed the call to blk_get_queue() during gendisk allocation but missed to remove the corresponding cleanup code blk_put_queue() for it. Thus, if the gendisk allocation fails, the request_queue refcount gets decremented and reaches 0, causing blk_mq_release() to be called with a hctx still alive. That triggers a WARNING report, as found by syzkaller: ------------[ cut here ]------------ WARNING: CPU: 0 PID: 23016 at block/blk-mq.c:3881 blk_mq_release+0xf8/0x3e0 block/blk-mq.c:3881 [...] stripped RIP: 0010:blk_mq_release+0xf8/0x3e0 block/blk-mq.c:3881 [...] stripped Call Trace: <TASK> blk_release_queue+0x153/0x270 block/blk-sysfs.c:780 kobject_cleanup lib/kobject.c:673 [inline] kobject_release lib/kobject.c:704 [inline] kref_put include/linux/kref.h:65 [inline] kobject_put+0x1c8/0x540 lib/kobject.c:721 __alloc_disk_node+0x4f7/0x610 block/genhd.c:1388 __blk_mq_alloc_disk+0x13b/0x1f0 block/blk-mq.c:3961 loop_add+0x3e2/0xaf0 drivers/block/loop.c:1978 loop_control_ioctl+0x133/0x620 drivers/block/loop.c:2150 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:870 [inline] __se_sys_ioctl fs/ioctl.c:856 [inline] __x64_sys_ioctl+0x193/0x200 fs/ioctl.c:856 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd [...] stripped Fixes: `6f8191fdf4` ("block: simplify disk shutdown") Reported-by: syzbot+31c9594f6e43b9289b25@syzkaller.appspotmail.com Suggested-by: Hillf Danton <hdanton@sina.com> Signed-off-by: Rafael Mendonca <rafaelmendsr@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220811232338.254673-1-rafaelmendsr@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-08-12 06:42:06 -06:00
Al Viro	480cb846c2	block: convert to advancing variants of iov_iter_get_pages{,_alloc}() ... doing revert if we end up not using some pages Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2022-08-08 22:37:22 -04:00
Al Viro	fcb14cb1bd	new iov_iter flavour - ITER_UBUF Equivalent of single-segment iovec. Initialized by iov_iter_ubuf(), checked for by iter_is_ubuf(), otherwise behaves like ITER_IOVEC ones. We are going to expose the things like ->write_iter() et.al. to those in subsequent commits. New predicate (user_backed_iter()) that is true for ITER_IOVEC and ITER_UBUF; places like direct-IO handling should use that for checking that pages we modify after getting them from iov_iter_get_pages() would need to be dirtied. DO NOT assume that replacing iter_is_iovec() with user_backed_iter() will solve all problems - there's code that uses iter_is_iovec() to decide how to poke around in iov_iter guts and for that the predicate replacement obviously won't suffice. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2022-08-08 22:37:15 -04:00
Linus Torvalds	fa9db655d0	for-5.20/block-2022-08-04 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmLsRfkQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpj43EADBydQhe7nQHH65gecqvttnio2GqEmcbozt lKFQlPPd3SHGMAJjSdR1dIwqtPsJ8q6xZXH+TjHhLXb2kgVu+TQ31krNHIqBwE14 s7SsgGRgvopA46lSf/ls18/8sh6Yz1NgI39YcMVPjvkbLaVFK7zRkL9OSp4RQCwH u/IIHJmV415EeF6QNTgABBel/gEIPBLsvwOxTBIkzDOyUohtExZPYj83MDm7jdr3 jsTUd2MiumNMh7ziMJIp1iN32nQOtIKtwWZaMHDCzfU/IUnBSmh2nj9oXr3+vcwo IsBMDUfUj9Eig5QQ/XcVIrFezi0GnunpBhScXPqL+dxPN812lzxNjkx6PsC+rPn8 mWmXoaeK1ayoyotdHJlmINNmWUSCkOMwVnA2r1c4Hp4cQS5vRUtkKcpNLTpMhk4I OwQ3bjt9mA//WlH+apbhJqXqxjcoBwCwMoveJ4mHVtku9lo+JJAKVGdUs17QjZkC NxACP1MtBcXy1hurNQf14oH5C0Hyg4TBJShPauKmrqGtOFnbOAdX2qIhldvyNfH1 l9cOvGNSgbQ6FLD6MVto6dC/KYOEM3LelVxgNB/80GbSmGwj88Kd/nzQLYFP89JJ 0Wkt14mSkm82gabOvNqXGG8P8hLb/+v6sp4qZv0mf+op0xmb4FB5eaZvoceptVzM 3Z+hmT7MfA== =pgNf -----END PGP SIGNATURE----- Merge tag 'for-5.20/block-2022-08-04' of git://git.kernel.dk/linux-block Pull block driver updates from Jens Axboe: - NVMe pull requests via Christoph: - add support for In-Band authentication (Hannes Reinecke) - handle the persistent internal error AER (Michael Kelley) - use in-capsule data for TCP I/O queue connect (Caleb Sander) - remove timeout for getting RDMA-CM established event (Israel Rukshin) - misc cleanups (Joel Granados, Sagi Grimberg, Chaitanya Kulkarni, Guixin Liu, Xiang wangx) - use command_id instead of req->tag in trace_nvme_complete_rq() (Bean Huo) - various fixes for the new authentication code (Lukas Bulwahn, Dan Carpenter, Colin Ian King, Chaitanya Kulkarni, Hannes Reinecke) - small cleanups (Liu Song, Christoph Hellwig) - restore compat_ioctl support (Nick Bowler) - make a nvmet-tcp workqueue lockdep-safe (Sagi Grimberg) - enable generic interface (/dev/ngXnY) for unknown command sets (Joel Granados, Christoph Hellwig) - don't always build constants.o (Christoph Hellwig) - print the command name of aborted commands (Christoph Hellwig) - MD pull requests via Song: - Improve raid5 lock contention, by Logan Gunthorpe. - Misc fixes to raid5, by Logan Gunthorpe. - Fix race condition with md_reap_sync_thread(), by Guoqing Jiang. - Fix potential deadlock with raid5_quiesce and raid5_get_active_stripe, by Logan Gunthorpe. - Refactoring md_alloc(), by Christoph" - Fix md disk_name lifetime problems, by Christoph Hellwig - Convert prepare_to_wait() to wait_woken() api, by Logan Gunthorpe; - Fix sectors_to_do bitmap issue, by Logan Gunthorpe. - Work on unifying the null_blk module parameters and configfs API (Vincent) - drbd bitmap IO error fix (Lars) - Set of rnbd fixes (Guoqing, Md Haris) - Remove experimental marker on bcache async device registration (Coly) - Series from cleaning up the bio splitting (Christoph) - Removal of the sx8 block driver. This hardware never really widespread, and it didn't receive a lot of attention after the initial merge of it back in 2005 (Christoph) - A few fixes for s390 dasd (Eric, Jiang) - Followup set of fixes for ublk (Ming) - Support for UBLK_IO_NEED_GET_DATA for ublk (ZiyangZhang) - Fixes for the dio dma alignment (Keith) - Misc fixes and cleanups (Ming, Yu, Dan, Christophe * tag 'for-5.20/block-2022-08-04' of git://git.kernel.dk/linux-block: (136 commits) s390/dasd: Establish DMA alignment s390/dasd: drop unexpected word 'for' in comments ublk_drv: add support for UBLK_IO_NEED_GET_DATA ublk_cmd.h: add one new ublk command: UBLK_IO_NEED_GET_DATA ublk_drv: cleanup ublksrv_ctrl_dev_info ublk_drv: add SET_PARAMS/GET_PARAMS control command ublk_drv: fix ublk device leak in case that add_disk fails ublk_drv: cancel device even though disk isn't up block: fix leaking page ref on truncated direct io block: ensure bio_iov_add_page can't fail block: ensure iov_iter advances for added pages drivers:md:fix a potential use-after-free bug md/raid5: Ensure batch_last is released before sleeping for quiesce md/raid5: Move stripe_request_ctx up md/raid5: Drop unnecessary call to r5c_check_stripe_cache_usage() md/raid5: Make is_inactive_blocked() helper md/raid5: Refactor raid5_get_active_stripe() block: pass struct queue_limits to the bio splitting helpers block: move bio_allowed_max_sectors to blk-merge.c block: move the call to get_max_io_size out of blk_bio_segment_split ...	2022-08-04 20:00:14 -07:00
Linus Torvalds	746fc76b82	SCSI misc on 20220804 Updates to the usual drivers (ufs, qla2xx, target, lpfc, smartpqi, mpi3mr). The main driver change that might cause issues on down the road is the conversion of some of our oldest surviving drivers to the DMA API (should only affect m68k). The only major core change is the rework of async resume; the rest are either completely trivial or for updating deprecated APIs. Signed-off-by: James E.J. Bottomley <jejb@linux.ibm.com> -----BEGIN PGP SIGNATURE----- iJwEABMIAEQWIQTnYEDbdso9F2cI+arnQslM7pishQUCYuvakyYcamFtZXMuYm90 dG9tbGV5QGhhbnNlbnBhcnRuZXJzaGlwLmNvbQAKCRDnQslM7pishfvOAP4m0N6b e3JwoBtB1c0JMKv6G4gka8suEG8p5f4khDu8wwD+LfGUCzG49Y5Ts7rByXfEiGgO krSdwsAZiV6yKg/HuPw= =Ak9L -----END PGP SIGNATURE----- Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi Pull SCSI updates from James Bottomley: "Updates to the usual drivers (ufs, qla2xx, target, lpfc, smartpqi, mpi3mr). The main driver change that might cause issues on down the road is the conversion of some of our oldest surviving drivers to the DMA API (should only affect m68k). The only major core change is the rework of async resume; the rest are either completely trivial or for updating deprecated APIs" * tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (195 commits) scsi: target: Remove XDWRITEREAD emulated support scsi: megaraid: Remove the static variable initialisation scsi: ch: Do not initialise statics to 0 scsi: ufs: core: Fix spelling mistake "Cannnot" -> "Cannot" scsi: target: iscsi: Do not require target authentication scsi: target: iscsi: Allow AuthMethod=None scsi: target: iscsi: Support base64 in CHAP scsi: target: iscsi: Add support for extended CDB AHS scsi: ufs: dt-bindings: Add SC8280XP binding scsi: target: iscsi: Fix clang -Wformat warnings scsi: ufs: core: Read device property for ref clock scsi: libsas: Resume SAS host for phy reset or enable via sysfs scsi: hisi_sas: Modify v3 HW SATA completion error processing scsi: hisi_sas: Relocate DMA unmap of SMP task scsi: hisi_sas: Remove unnecessary variable to hold DMA map elements scsi: hisi_sas: Call hisi_sas_slave_configure() from slave_configure_v3_hw() scsi: mpi3mr: Delete a stray tab scsi: mpi3mr: Unlock on error path scsi: mpi3mr: Reduce VD queue depth on detecting throttling scsi: mpi3mr: Resource Based Metering ...	2022-08-04 19:47:37 -07:00
Linus Torvalds	5264406cdb	iov_iter work, part 1 - isolated cleanups and optimizations. One of the goals is to reduce the overhead of using ->read_iter() and ->write_iter() instead of ->read()/->write(); new_sync_{read,write}() has a surprising amount of overhead, in particular inside iocb_flags(). That's why the beginning of the series is in this pile; it's not directly iov_iter-related, but it's a part of the same work... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCYurGOQAKCRBZ7Krx/gZQ 6ysyAP91lvBfMRepcxpd9kvtuzWkU8A3rfSziZZteEHANB9Q7QEAiPn2a2OjWkcZ uAyUWfCkHCNx+dSMkEvUgR5okQ0exAM= =9UCV -----END PGP SIGNATURE----- Merge tag 'pull-work.iov_iter-base' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs iov_iter updates from Al Viro: "Part 1 - isolated cleanups and optimizations. One of the goals is to reduce the overhead of using ->read_iter() and ->write_iter() instead of ->read()/->write(). new_sync_{read,write}() has a surprising amount of overhead, in particular inside iocb_flags(). That's the explanation for the beginning of the series is in this pile; it's not directly iov_iter-related, but it's a part of the same work..." * tag 'pull-work.iov_iter-base' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: first_iovec_segment(): just return address iov_iter: massage calling conventions for first_{iovec,bvec}_segment() iov_iter: first_{iovec,bvec}_segment() - simplify a bit iov_iter: lift dealing with maxpages out of first_{iovec,bvec}_segment() iov_iter_get_pages{,_alloc}(): cap the maxsize with MAX_RW_COUNT iov_iter_bvec_advance(): don't bother with bvec_iter copy_page_{to,from}_iter(): switch iovec variants to generic keep iocb_flags() result cached in struct file iocb: delay evaluation of IS_SYNC(...) until we want to check IOCB_DSYNC struct file: use anonymous union member for rcuhead and llist btrfs: use IOMAP_DIO_NOSYNC teach iomap_dio_rw() to suppress dsync No need of likely/unlikely on calls of check_copy_size()	2022-08-03 13:50:22 -07:00
Linus Torvalds	f00654007f	Folio changes for 6.0 - Fix an accounting bug that made NR_FILE_DIRTY grow without limit when running xfstests - Convert more of mpage to use folios - Remove add_to_page_cache() and add_to_page_cache_locked() - Convert find_get_pages_range() to filemap_get_folios() - Improvements to the read_cache_page() family of functions - Remove a few unnecessary checks of PageError - Some straightforward filesystem conversions to use folios - Split PageMovable users out from address_space_operations into their own movable_operations - Convert aops->migratepage to aops->migrate_folio - Remove nobh support (Christoph Hellwig) -----BEGIN PGP SIGNATURE----- iQEzBAABCgAdFiEEejHryeLBw/spnjHrDpNsjXcpgj4FAmLpViQACgkQDpNsjXcp gj5pBgf/f3+K7Hi3qw7aYQCYJQ7IA/bLyE/DLWI59kuiao6wDSve40B9YH9X++Ha mRLp55bkQS+bwS2xa4jlqrIDJzAfNoWlXaXZHUXGL1C/52ChTF6jaH2cvO9PVlDS 7fLv1hy2LwiIdzpKJkUW7T+kcQGj3QLKqtQ4x8zD0LGMg055yvt/qndHSUi41nWT /58+6W8Sk4vvRgkpeChFzF1lGLy00+FGT8y5V2kM9uRliFQ7XPCwqB2a3e5jbW6z C1NXQmRnopCrnOT1TFIhK3DyX6MDIWV5qcikNAmCKFb9fQFPmjDLPt9iSoMGjw2M Z+UVhJCaU3ISccd0DG5Ra/vzs9/O9Q== =DgUi -----END PGP SIGNATURE----- Merge tag 'folio-6.0' of git://git.infradead.org/users/willy/pagecache Pull folio updates from Matthew Wilcox: - Fix an accounting bug that made NR_FILE_DIRTY grow without limit when running xfstests - Convert more of mpage to use folios - Remove add_to_page_cache() and add_to_page_cache_locked() - Convert find_get_pages_range() to filemap_get_folios() - Improvements to the read_cache_page() family of functions - Remove a few unnecessary checks of PageError - Some straightforward filesystem conversions to use folios - Split PageMovable users out from address_space_operations into their own movable_operations - Convert aops->migratepage to aops->migrate_folio - Remove nobh support (Christoph Hellwig) * tag 'folio-6.0' of git://git.infradead.org/users/willy/pagecache: (78 commits) fs: remove the NULL get_block case in mpage_writepages fs: don't call ->writepage from __mpage_writepage fs: remove the nobh helpers jfs: stop using the nobh helper ext2: remove nobh support ntfs3: refactor ntfs_writepages mm/folio-compat: Remove migration compatibility functions fs: Remove aops->migratepage() secretmem: Convert to migrate_folio hugetlb: Convert to migrate_folio aio: Convert to migrate_folio f2fs: Convert to filemap_migrate_folio() ubifs: Convert to filemap_migrate_folio() btrfs: Convert btrfs_migratepage to migrate_folio mm/migrate: Add filemap_migrate_folio() mm/migrate: Convert migrate_page() to migrate_folio() nfs: Convert to migrate_folio btrfs: Convert btree_migratepage to migrate_folio mm/migrate: Convert expected_page_refs() to folio_expected_refs() mm/migrate: Convert buffer_migrate_page() to buffer_migrate_folio() ...	2022-08-03 10:35:43 -07:00
Keith Busch	e97424fd44	block: fix leaking page ref on truncated direct io The size being added to a bio from an iov is aligned to a block size after the pages were gotten. If the new aligned size truncates the last page, its reference was being leaked. Ensure all pages that were not added to the bio have their reference released. Since this essentially requires doing the same that bio_put_pages(), and there was only one caller for that function, this patch makes the put_page() loop common for everyone. Fixes: `b1a000d3b8` ("block: relax direct io memory alignment") Reported-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Keith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/20220712153256.2202024-3-kbusch@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-08-02 21:08:54 -06:00
Keith Busch	34cdb8c825	block: ensure bio_iov_add_page can't fail Adding the page could fail on the bio_full() condition, which checks for either exceeding the bio's max segments or total size exceeding UINT_MAX. We already ensure the max segments can't be exceeded, so just ensure the total size won't reach the limit. This simplifies error handling and removes unnecessary repeated bio_full() checks. Signed-off-by: Keith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/20220712153256.2202024-2-kbusch@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-08-02 21:08:54 -06:00
Keith Busch	325347d965	block: ensure iov_iter advances for added pages There are cases where a bio may not accept additional pages, and the iov needs to advance to the last data length that was accepted. The zone append used to handle this correctly, but was inadvertently broken when the setup was made common with the normal r/w case. Fixes: `576ed91354` ("block: use bio_add_page in bio_iov_iter_get_pages") Fixes: `c58c0074c5` ("block/bio: remove duplicate append pages code") Signed-off-by: Keith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/20220712153256.2202024-1-kbusch@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-08-02 21:08:54 -06:00
Christoph Hellwig	c55ddd9082	block: pass struct queue_limits to the bio splitting helpers Allow using the splitting helpers on just a queue_limits instead of a full request_queue structure. This will eventually allow file systems or remapping drivers to split REQ_OP_ZONE_APPEND bios based on limits calculated as the minimum common capabilities over multiple devices. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20220727162300.3089193-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-08-02 21:08:53 -06:00
Christoph Hellwig	b6dc6198eb	block: move bio_allowed_max_sectors to blk-merge.c Move this helper into the only file where it is used. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20220727162300.3089193-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-08-02 21:08:53 -06:00
Christoph Hellwig	a85b36375b	block: move the call to get_max_io_size out of blk_bio_segment_split Prepare for reusing blk_bio_segment_split for (file system controlled) splits of REQ_OP_ZONE_APPEND bios by letting the caller control the maximum size of the bio. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20220727162300.3089193-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-08-02 21:08:53 -06:00
Christoph Hellwig	46754bd056	block: move ->bio_split to the gendisk Only non-passthrough requests are split by the block layer and use the ->bio_split bio_set. Move it from the request_queue to the gendisk. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20220727162300.3089193-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-08-02 21:08:49 -06:00
Christoph Hellwig	51d798cdb5	block: change the blk_queue_bounce calling convention The double indirect bio leads to somewhat suboptimal code generation. Instead return the (original or split) bio, and make sure the request_queue arguments to the lower level helpers is passed after the bio to avoid constant reshuffling of the argument passing registers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20220727162300.3089193-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-08-02 17:22:54 -06:00
Christoph Hellwig	5a97806f7d	block: change the blk_queue_split calling convention The double indirect bio leads to somewhat suboptimal code generation. Instead return the (original or split) bio, and make sure the request_queue arguments to the lower level helpers is passed after the bio to avoid constant reshuffling of the argument passing registers. Also give it and the helpers used to implement it more descriptive names. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220727162300.3089193-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-08-02 17:22:53 -06:00
Linus Torvalds	c013d0af81	for-5.20/block-2022-07-29 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmLko3gQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpmQaD/90NKFj4v8I456TUQyg1jimXEsL+e84E6o2 ALWVb6JzQvlPVQXNLnK5YKIunMWOTtTMz0nyB8sVRwVJVJO0P5d7QopAkZM8fkyU MK5OCzoryENw4DTc2wJS4in6cSbGylIuN74wMzlf7+M67JTImfoZQhbTMcjwzZfn b3OlL6sID7zMXwGcuOJPZyUJICCpDhzdSF9JXqKma5PQuG2SBmQyvFxJAcsoFBPc YetnoRIOIN6yBvsIZaPaYq7XI9MIvF0e67EQtyCEHj4tHpyVnyDWkeObVFULsISU gGEKbkYPvNUzRAU5Q1NBBHh1tTfkf/MaUxTuZwoEwZ/s04IGBGMmrZGyfvdfzYo6 M7NwSEg/TrUSNfTwn65mQi7uOXu1pGkJrqz84Flm8u9Qid9Vd7LExLG5p/ggnWdH 5th93MDEmtEg29e9DXpEAuS5d0t3TtSvosflaKpyfNNfr+P0rWCN6GM/uW62VUTK ls69SQh/AQJRbg64jU4xper6WhaYtSXK7TKEnxJycoEn9gYNyCcdot2uekth0xRH ChHGmRlteiqe/y4uFWn/2dcxWjoleiHbFjTaiRL75WVl8wIDEjw02LGuoZ61Ss9H WOV+MT7KqNjBGe6lreUY+O/PO02dzmoR6heJXN19p8zr/pBuLCTGX7UpO7rzgaBR 4N1HEozvIw== =celk -----END PGP SIGNATURE----- Merge tag 'for-5.20/block-2022-07-29' of git://git.kernel.dk/linux-block Pull block updates from Jens Axboe: - Improve the type checking of request flags (Bart) - Ensure queue mapping for a single queues always picks the right queue (Bart) - Sanitize the io priority handling (Jan) - rq-qos race fix (Jinke) - Reserved tags handling improvements (John) - Separate memory alignment from file/disk offset aligment for O_DIRECT (Keith) - Add new ublk driver, userspace block driver using io_uring for communication with the userspace backend (Ming) - Use try_cmpxchg() to cleanup the code in various spots (Uros) - Finally remove bdevname() (Christoph) - Clean up the zoned device handling (Christoph) - Clean up independent access range support (Christoph) - Clean up and improve block sysfs handling (Christoph) - Clean up and improve teardown of block devices. This turns the usual two step process into something that is simpler to implement and handle in block drivers (Christoph) - Clean up chunk size handling (Christoph) - Misc cleanups and fixes (Bart, Bo, Dan, GuoYong, Jason, Keith, Liu, Ming, Sebastian, Yang, Ying) * tag 'for-5.20/block-2022-07-29' of git://git.kernel.dk/linux-block: (178 commits) ublk_drv: fix double shift bug ublk_drv: make sure that correct flags(features) returned to userspace ublk_drv: fix error handling of ublk_add_dev ublk_drv: fix lockdep warning block: remove __blk_get_queue block: call blk_mq_exit_queue from disk_release for never added disks blk-mq: fix error handling in __blk_mq_alloc_disk ublk: defer disk allocation ublk: rewrite ublk_ctrl_get_queue_affinity to not rely on hctx->cpumask ublk: fold __ublk_create_dev into ublk_ctrl_add_dev ublk: cleanup ublk_ctrl_uring_cmd ublk: simplify ublk_ch_open and ublk_ch_release ublk: remove the empty open and release block device operations ublk: remove UBLK_IO_F_PREFLUSH ublk: add a MAINTAINERS entry block: don't allow the same type rq_qos add more than once mmc: fix disk/queue leak in case of adding disk failure ublk_drv: fix an IS_ERR() vs NULL check ublk: remove UBLK_IO_F_INTEGRITY ublk_drv: remove unneeded semicolon ...	2022-08-02 13:46:35 -07:00
Matthew Wilcox (Oracle)	67235182a4	mm/migrate: Convert buffer_migrate_page() to buffer_migrate_folio() Use a folio throughout __buffer_migrate_folio(), add kernel-doc for buffer_migrate_folio() and buffer_migrate_folio_norefs(), move their declarations to buffer.h and switch all filesystems that have wired them up. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2022-08-02 12:34:03 -04:00
Matthew Wilcox (Oracle)	4fdc08d418	block: Convert read_part_sector() to use a folio This relatively straightforward converion saves a call to compound_head() hidden inside put_page(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>	2022-08-02 12:34:03 -04:00
Matthew Wilcox (Oracle)	069fc464f1	block: Use PAGE_SECTORS_SHIFT The bare use of '9' confuses some people. We also don't need this cast, since the compiler does exactly that cast for us. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>	2022-08-02 12:34:03 -04:00
Matthew Wilcox (Oracle)	98d8ba69ff	block: Handle partition read errors more consistently Set p->v to NULL if we try to read beyond the end of the disk, just like we do if we get an error returned from trying to read the disk. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>	2022-08-02 12:34:03 -04:00
Matthew Wilcox (Oracle)	8b5d143c95	block: Simplify read_part_sector() That rather complicated expression is just trying to find the offset of this sector within a page, and there are easier ways to express that. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>	2022-08-02 12:34:03 -04:00
Christoph Hellwig	828b5f017d	block: remove __blk_get_queue __blk_get_queue is only called by blk_get_queue, so merge the two. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220721063432.1714609-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-21 11:03:17 -06:00
Christoph Hellwig	c5db2cfc62	block: call blk_mq_exit_queue from disk_release for never added disks To undo the all initialization from blk_mq_init_allocated_queue in case of a probe failure where add_disk is never called we have to call blk_mq_exit_queue from put_disk. This relies on the fact that drivers always call blk_mq_free_tag_set after calling put_disk in the probe error path if they have a gendisk at all. We should be doing this in general, but can't do it for the normal teardown case (yet) as the tagset can be gone by the time the disk is released once it was added. I hope to sort this out properly eventually but for now this isolated hack will do it. Fixes: `6f8191fdf4` ("block: simplify disk shutdown") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20220720130541.1323531-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-21 10:59:02 -06:00
Christoph Hellwig	0a3e5cc7bb	blk-mq: fix error handling in __blk_mq_alloc_disk To fully clean up the queue if the disk allocation fails we need to call blk_mq_destroy_queue and not just blk_put_queue. Fixes: `6f8191fdf4` ("block: simplify disk shutdown") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20220720130541.1323531-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-21 10:59:02 -06:00
Jinke Han	14a6e2eb7d	block: don't allow the same type rq_qos add more than once In our test of iocost, we encountered some list add/del corruptions of inner_walk list in ioc_timer_fn. The reason can be described as follows: cpu 0 cpu 1 ioc_qos_write ioc_qos_write ioc = q_to_ioc(queue); if (!ioc) { ioc = kzalloc(); ioc = q_to_ioc(queue); if (!ioc) { ioc = kzalloc(); ... rq_qos_add(q, rqos); } ... rq_qos_add(q, rqos); ... } When the io.cost.qos file is written by two cpus concurrently, rq_qos may be added to one disk twice. In that case, there will be two iocs enabled and running on one disk. They own different iocgs on their active list. In the ioc_timer_fn function, because of the iocgs from two iocs have the same root iocg, the root iocg's walk_list may be overwritten by each other and this leads to list add/del corruptions in building or destroying the inner_walk list. And so far, the blk-rq-qos framework works in case that one instance for one type rq_qos per queue by default. This patch make this explicit and also fix the crash above. Signed-off-by: Jinke Han <hanjinke.666@bytedance.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: <stable@vger.kernel.org> Link: https://lore.kernel.org/r/20220720093616.70584-1-hanjinke.666@bytedance.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-20 06:44:14 -06:00
Bart Van Assche	d625fecd8a	block/kyber: Use the new blk_opf_t type Use the new blk_opf_t type for arguments that represent a bitwise combination of a request operation and request flags. Rename those arguments from 'op' into 'opf'. This patch does not change any functionality. Cc: Omar Sandoval <osandov@fb.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20220714180729.1065367-10-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-14 12:14:30 -06:00
Bart Van Assche	f8359efe47	block/mq-deadline: Use the new blk_opf_t type Use the new blk_opf_t type for an argument that represents a bitwise combination of a request operation and request flags. Rename that argument from 'op' into 'opf'. This patch does not change any functionality. Cc: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20220714180729.1065367-9-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-14 12:14:30 -06:00
Bart Van Assche	dc469ba2e7	block/bfq: Use the new blk_opf_t type Use the new blk_opf_t type for arguments and variables that represent request flags or a bitwise combination of a request operation and request flags. Rename those variables from 'op' into 'opf'. This patch does not change any functionality. Cc: Jan Kara <jack@suse.cz> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20220714180729.1065367-8-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-14 12:14:30 -06:00
Bart Van Assche	16458cf3bd	block: Use the new blk_opf_t type Use the new blk_opf_t type for arguments and variables that represent request flags or a bitwise combination of a request operation and request flags. Rename the function arguments and also a structure member that hold a request operation and flags from 'rw' into 'opf'. This patch does not change any functionality. Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Damien Le Moal <damien.lemoal@wdc.com> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20220714180729.1065367-7-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-14 12:14:30 -06:00
Bart Van Assche	2d9b02be73	block: Change the type of req_op() and bio_op() into enum req_op Improve static type checking by changing the type of the value returned by req_op() and bio_op() from unsigned int into enum req_op. Insert 'default: break;' in switch statements on the enum req_op type to prevent that the compiler warns about these switch statements. Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Damien Le Moal <damien.lemoal@wdc.com> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: Tim Waugh <tim@cyberelk.net> Cc: Alasdair Kergon <agk@redhat.com> Cc: Mike Snitzer <snitzer@kernel.org> Cc: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20220714180729.1065367-5-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-14 12:14:30 -06:00
Bart Van Assche	77e7ffd7ad	block: Use enum req_op where appropriate Change the type of the arguments that are used to pass a REQ_OP_* value from int or unsigned int into enum req_op to improve static type checking. Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Damien Le Moal <damien.lemoal@wdc.com> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20220714180729.1065367-3-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-14 12:14:30 -06:00
Bart Van Assche	ff07a02e9e	treewide: Rename enum req_opf into enum req_op The type name enum req_opf is misleading since it suggests that values of this type include both an operation type and flags. Since values of this type represent an operation only, change the type name into enum req_op. Convert the enum req_op documentation into kernel-doc format. Move a few definitions such that the enum req_op documentation occurs just above the enum req_op definition. The name "req_opf" was introduced by commit `ef295ecf09` ("block: better op and flags encoding"). Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Damien Le Moal <damien.lemoal@wdc.com> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20220714180729.1065367-2-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-14 12:14:30 -06:00
Muchun Song	957a2b345c	block: fix missing blkcg_bio_issue_init The commit `513616843d` ("block: remove superfluous calls to blkcg_bio_issue_init") has removed blkcg_bio_issue_init from __bio_clone since submit_bio will override ->bi_issue. However, __blk_queue_split is called after blkcg_bio_issue_init (see blk_mq_submit_bio) in submit_bio. In this case, the ->bi_issue is 0. Fix it. Fixes: `513616843d` ("block: remove superfluous calls to blkcg_bio_issue_init") Signed-off-by: Muchun Song <songmuchun@bytedance.com> Link: https://lore.kernel.org/r/20220713140226.68135-1-songmuchun@bytedance.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-14 10:54:49 -06:00
Christoph Hellwig	900d156bac	block: remove bdevname Replace the remaining calls of bdevname with snprintf using the %pg format specifier. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20220713055317.1888500-10-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-14 10:27:56 -06:00
Christoph Hellwig	02ff3dd20f	block: stop using bdevname in __blkdev_issue_discard Just use the %pg format specifier instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20220713055317.1888500-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-14 10:27:56 -06:00
Christoph Hellwig	5bf83e9a14	block: stop using bdevname in bdev_write_inode Just use the %pg format specifier instead. Also reformat the printk statement to be more readable. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20220713055317.1888500-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-14 10:27:56 -06:00
Uros Bizjak	96388f57d2	blk-cgroup: Use atomic{,64}_try_cmpxchg Use atomic_try_cmpxchg instead of atomic_cmpxchg (ptr, old, new) == old in blkcg_unuse_delay, blkcg_set_delay and blkcg_clear_delay and atomic64_try_cmpxchg in blkcg_scale_delay. x86 CMPXCHG instruction returns success in ZF flag, so this change saves a compare after cmpxchg (and related move instruction in front of cmpxchg). Also, atomic_try_cmpxchg implicitly assigns old ptr value to "old" when cmpxchg fails, enabling further code simplifications. No functional change intended. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Cc: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/20220712154455.66868-1-ubizjak@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-12 15:46:03 -06:00
Uros Bizjak	aee8960c2e	blk-iolatency: Use atomic{,64}_try_cmpxchg Use atomic_try_cmpxchg instead of atomic_cmpxchg (*ptr, old, new) == old in check_scale_change and atomic64_try_cmpxchg in blkcg_iolatency_done_bio. x86 CMPXCHG instruction returns success in ZF flag, so this change saves a compare after cmpxchg (and related move instruction in front of cmpxchg). No functional change intended. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Cc: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/20220712151947.6783-1-ubizjak@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-12 15:45:31 -06:00
Uros Bizjak	939f9dd040	block: Use try_cmpxchg in update_io_ticks Use try_cmpxchg instead of cmpxchg (*ptr, old, new) == old in update_io_ticks. x86 CMPXCHG instruction returns success in ZF flag, so this change saves a compare after cmpxchg (and related move instruction in front of cmpxchg). No functional change intended. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Cc: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/20220712152741.7324-1-ubizjak@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-12 14:39:06 -06:00
Uros Bizjak	f4b1e27db4	block/rq_qos: Use atomic_try_cmpxchg in atomic_inc_below Use atomic_try_cmpxchg instead of atomic_cmpxchg (ptr, old, new) == old in atomic_inc_below. x86 CMPXCHG instruction returns success in ZF flag, so this change saves a compare after cmpxchg (and related move instruction in front of cmpxchg). Also, atomic_try_cmpxchg implicitly assigns old ptr value to "old" when cmpxchg fails, enabling further code simplifications. No functional change intended. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Cc: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/20220712150547.5786-1-ubizjak@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-12 14:38:52 -06:00
Ming Lei	f3ec5d1155	blk-mq: don't create hctx debugfs dir until q->debugfs_dir is created blk_mq_debugfs_register_hctx() can be called by blk_mq_update_nr_hw_queues when gendisk isn't added yet, such as nvme tcp. Fixes the warning of 'debugfs: Directory 'hctx0' with parent '/' already present!' which can be observed reliably when running blktests nvme/005. Fixes: `6cfc0081b0` ("blk-mq: no need to check return value of debugfs_create functions") Reported-by: Yi Zhang <yi.zhang@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Tested-by: Yi Zhang <yi.zhang@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220711090808.259682-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-12 06:52:27 -06:00
Christoph Hellwig	d86e716aa4	block: move zone related fields to struct gendisk Move the zone related fields that are currently stored in struct request_queue to struct gendisk as these are part of the highlevel block layer API and are only used for non-passthrough I/O that requires the gendisk. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20220706070350.1703384-17-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-06 06:46:26 -06:00
Christoph Hellwig	375c140c19	block: use bdev based helpers in blkdev_zone_mgmt{,all} Use the bdev based helpers instead of the queue based ones to clean up the code a bit and prepare for storing all zone related fields in struct gendisk. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20220706070350.1703384-13-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-06 06:46:26 -06:00
Christoph Hellwig	b623e34732	block: replace blkdev_nr_zones with bdev_nr_zones Pass a block_device instead of a request_queue as that is what most callers have at hand. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Link: https://lore.kernel.org/r/20220706070350.1703384-12-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-06 06:46:26 -06:00
Christoph Hellwig	1dc0172027	block: remove queue_max_open_zones and queue_max_active_zones Always use the bdev based helpers instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20220706070350.1703384-10-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-06 06:46:26 -06:00
Christoph Hellwig	5d40066567	block: pass a gendisk to blk_queue_free_zone_bitmaps Switch to a gendisk based API in preparation for moving all zone related fields from the request_queue to the gendisk. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20220706070350.1703384-9-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-06 06:46:26 -06:00
Christoph Hellwig	b3c72f8138	block: pass a gendisk to blk_queue_clear_zone_settings Switch to a gendisk based API in preparation for moving all zone related fields from the request_queue to the gendisk. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20220706070350.1703384-8-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-06 06:46:26 -06:00
Christoph Hellwig	6b2bd27474	block: pass a gendisk to blk_queue_set_zoned Prepare for storing the zone related field in struct gendisk instead of struct request_queue. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20220706070350.1703384-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-06 06:46:26 -06:00
Christoph Hellwig	052e545c92	block: simplify blk_check_zone_append Use the bdev based helpers instead of open coding them. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20220706070350.1703384-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-06 06:46:25 -06:00
Christoph Hellwig	6deacb3bfa	block: simplify blk_mq_plug Drop the unused q argument, and invert the check to move the exception into a branch and the regular path as the normal return. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20220706070350.1703384-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-06 06:46:25 -06:00
Christoph Hellwig	edd1dbc83b	block: use bdev_is_zoned instead of open coding it Use bdev_is_zoned in all places where a block_device is available instead of open coding it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20220706070350.1703384-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-06 06:46:25 -06:00
Christoph Hellwig	6cc37a672a	block: call blk_queue_free_zone_bitmaps from disk_release The zone bitmaps are only used for non-passthrough I/O, so free them as soon as the disk is released. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20220706070350.1703384-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-06 06:46:25 -06:00
John Garry	4cf6e6c010	blk-mq: Drop local variable for reserved tag The local variable is now only referenced once so drop it. Signed-off-by: John Garry <john.garry@huawei.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/1657109034-206040-7-git-send-email-john.garry@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-06 06:33:53 -06:00
John Garry	2dd6532e95	blk-mq: Drop 'reserved' arg of busy_tag_iter_fn We no longer use the 'reserved' arg in busy_tag_iter_fn for any iter function so it may be dropped. Signed-off-by: John Garry <john.garry@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> #nvme Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/1657109034-206040-6-git-send-email-john.garry@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-06 06:33:53 -06:00
John Garry	9bdb4833dd	blk-mq: Drop blk_mq_ops.timeout 'reserved' arg With new API blk_mq_is_reserved_rq() we can tell if a request is from the reserved pool, so stop passing 'reserved' arg. There is actually only a single user of that arg for all the callback implementations, which can use blk_mq_is_reserved_rq() instead. This will also allow us to stop passing the same 'reserved' around the blk-mq iter functions next. Signed-off-by: John Garry <john.garry@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # For MMC Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/1657109034-206040-4-git-send-email-john.garry@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-06 06:33:53 -06:00
John Garry	99e48cd685	blk-mq: Add a flag for reserved requests Add a flag for reserved requests so that drivers may know this for any special handling. Signed-off-by: John Garry <john.garry@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/1657109034-206040-3-git-send-email-john.garry@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-06 06:33:53 -06:00
Jason Yan	e55cf79814	blk-cgroup: factor out blkcg_free_all_cpd() To reduce some duplicated code, factor out blkcg_free_all_cpd(). No functional change. Signed-off-by: Jason Yan <yanaijie@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220629070917.3113016-3-yanaijie@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-06-29 11:09:53 -06:00
Jason Yan	362b8c16f8	blk-cgroup: factor out blkcg_iostat_update() To reduce some duplicated code, factor out blkcg_iostat_update(). No functional change. Signed-off-by: Jason Yan <yanaijie@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220629070917.3113016-2-yanaijie@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-06-29 11:09:53 -06:00
Christoph Hellwig	22d0c4080f	block: simplify disk_set_independent_access_ranges Lift setting disk->ia_ranges from disk_register_independent_access_ranges into disk_set_independent_access_ranges, and make the behavior the same for the registered vs non-registered queue cases. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Link: https://lore.kernel.org/r/20220629062013.1331068-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-06-29 08:36:46 -06:00
Christoph Hellwig	6a27d28c81	block: move ->ia_ranges from the request_queue to the gendisk Independent access ranges only matter for file system I/O and are only valid with a registered gendisk, so move them there. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Link: https://lore.kernel.org/r/20220629062013.1331068-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-06-29 08:36:46 -06:00
Ying Sun	b9a1c179bd	block: remove "select BLK_RQ_IO_DATA_LEN" from BLK_CGROUP_IOCOST dependency The configuration item BLK_RQ_IO_DATA_LEN is not declared in the kernel. Select BLK_RQ_IO_DATA_LEN is meaningless which could be removed. Signed-off-by: Ying Sun <sunying@nj.iscas.ac.cn> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220629062409.19458-1-sunying@nj.iscas.ac.cn Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-06-29 08:35:57 -06:00
Matthew Wilcox (Oracle)	0e00fa5f83	block: Remove check of PageError If read_mapping_page() sees a page with PageError set, it returns a PTR_ERR(). Checking PageError again is simply dead code. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>	2022-06-29 08:51:06 -04:00
Christoph Hellwig	8682b92e5a	blk-mq: cleanup disk sysfs registration Pass a gendisk to the sysfs register/unregister functions and give them descriptive names. Also move the unregistration helper next to the one doing the registration. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20220628171850.1313069-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-06-28 11:32:42 -06:00
Christoph Hellwig	eaa870f975	blk-mq: rename blk_mq_sysfs_{,un}register Add a _hctx postfix to better describe what the functions do, match the debugfs equivalents and release the old names for functions that should be using them. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20220628171850.1313069-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-06-28 11:32:42 -06:00
Christoph Hellwig	81f0c2ef41	block: remove the extra gendisk reference in __blk_mq_register_dev kobject_add already grabs a reference to the parent, no need to have another one. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20220628171850.1313069-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-06-28 11:32:42 -06:00
Christoph Hellwig	4a8d14bba4	block: use default groups to register the queue attributes Set up the default_groups for blk_queue_ktype instead of manually calling sysfs_create_group. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20220628171850.1313069-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-06-28 11:32:42 -06:00
Christoph Hellwig	060f131e9c	block: remove a superflous queue kobject reference kobject_add already adds a reference to the parent that is dropped on deletion, so don't bother grabbing another one. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20220628171850.1313069-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-06-28 11:32:42 -06:00
Christoph Hellwig	cc5c516df0	block: simplify blktrace sysfs attribute creation Add the trace attributes to the default gendisk attributes, just like we already do for partitions. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20220628171850.1313069-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-06-28 11:32:42 -06:00
Christoph Hellwig	8b9ab62662	block: remove blk_cleanup_disk blk_cleanup_disk is nothing but a trivial wrapper for put_disk now, so remove it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20220619060552.1850436-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-06-28 06:33:15 -06:00
Christoph Hellwig	6f8191fdf4	block: simplify disk shutdown Set the queue dying flag and call blk_mq_exit_queue from del_gendisk for all disks that do not have separately allocated queues, and thus remove the need to call blk_cleanup_queue for them. Rename blk_cleanup_disk to blk_mq_destroy_queue to make it clear that this function is intended only for separately allocated blk-mq queues. This saves an extra queue freeze for devices without a separately allocated queue. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20220619060552.1850436-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-06-28 06:30:26 -06:00
Christoph Hellwig	0e3534022f	block: stop setting the nomerges flags in blk_cleanup_queue These flags only apply to file system I/O, and all file system I/O is already drained by del_gendisk and thus can't be in progress when blk_cleanup_queue is called. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20220619060552.1850436-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-06-28 06:30:26 -06:00
Christoph Hellwig	1f90307e5f	block: remove QUEUE_FLAG_DEAD Disallow setting the blk-mq state on any queue that is already dying as setting the state even then is a bad idea, and remove the now unused QUEUE_FLAG_DEAD flag. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20220619060552.1850436-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-06-28 06:30:26 -06:00
Liu Song	ee78ec1077	blk-mq: blk_mq_tag_busy is no need to return a value Currently "blk_mq_tag_busy" return value has no effect, so adjust it. Some code implementations have also been adjusted to enhance readability. Signed-off-by: Liu Song <liusong@linux.alibaba.com> Link: https://lore.kernel.org/r/1656170121-1619-1-git-send-email-liusong@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-06-27 06:29:12 -06:00
Jan Kara	a78418e6a0	block: Always initialize bio IO priority on submit Currently, IO priority set in task's IO context is not reflected in the bio->bi_ioprio for most IO (only io_uring and direct IO set it). This results in odd results where process is submitting some bios with one priority and other bios with a different (unset) priority and due to differing priorities bios cannot be merged. Make sure bio->bi_ioprio is always set on bio submission. Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220623074840.5960-9-jack@suse.cz Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-06-27 06:29:12 -06:00
Jan Kara	9c6227e043	block: Initialize bio priority earlier Bio's IO priority needs to be initialized before we try to merge the bio with other bios. Otherwise we could merge bios which would otherwise receive different IO priorities leading to possible QoS issues. Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220623074840.5960-8-jack@suse.cz Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-06-27 06:29:12 -06:00
Jan Kara	82b74cac28	blk-ioprio: Convert from rqos policy to direct call Convert blk-ioprio handling from a rqos policy to a direct call from blk_mq_submit_bio(). Firstly, blk-ioprio is not much of a rqos policy anyway, it just needs a hook in bio submission path to set the bio's IO priority. Secondly, the rqos .track hook gets actually called too late for blk-ioprio purposes and introducing a special rqos hook just for blk-ioprio looks even weirder. Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220623074840.5960-7-jack@suse.cz Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-06-27 06:29:12 -06:00
Jan Kara	f258654472	blk-ioprio: Remove unneeded field blkcg->ioprio_set field is not really useful except for avoiding possibly more expensive checks inside blkcg_ioprio_track(). The check for blkcg->prio_policy being equal to POLICY_NO_CHANGE does the same service so just remove the ioprio_set field and replace the check. Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220623074840.5960-6-jack@suse.cz Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-06-27 06:29:12 -06:00
Jan Kara	4b838d9ee9	block: Fix handling of tasks without ioprio in ioprio_get(2) ioprio_get(2) can be asked to return the best IO priority from several tasks (IOPRIO_WHO_PGRP, IOPRIO_WHO_USER). Currently the call treats tasks without set IO priority as having priority IOPRIO_CLASS_BE/IOPRIO_BE_NORM however this does not really reflect the IO priority the task will get (which depends on task's nice value). Fix the code to use the real IO priority task's IO will use. We have to modify code for ioprio_get(IOPRIO_WHO_PROCESS) to keep returning IOPRIO_CLASS_NONE priority for tasks without set IO priority as a special case to maintain userspace visible API. Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220623074840.5960-5-jack@suse.cz Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-06-27 06:29:12 -06:00
Jan Kara	fc25545e17	block: Make ioprio_best() static Nobody outside of block/ioprio.c uses it. Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220623074840.5960-4-jack@suse.cz Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-06-27 06:29:12 -06:00
Jan Kara	893e5d32d5	block: Generalize get_current_ioprio() for any task get_current_ioprio() operates only on current task. We will need the same functionality for other tasks as well. Generalize get_current_ioprio() for that and also move the bulk out of the header file because it is large enough. Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220623074840.5960-3-jack@suse.cz Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-06-27 06:29:12 -06:00
Jan Kara	e589f46445	block: fix default IO priority handling again Commit `e70344c059` ("block: fix default IO priority handling") introduced an inconsistency in get_current_ioprio() that tasks without IO context return IOPRIO_DEFAULT priority while tasks with freshly allocated IO context will return 0 (IOPRIO_CLASS_NONE/0) IO priority. Tasks without IO context used to be rare before `5a9d041ba2` ("block: move io_context creation into where it's needed") but after this commit they became common because now only BFQ IO scheduler setups task's IO context. Similar inconsistency is there for get_task_ioprio() so this inconsistency is now exposed to userspace and userspace will see different IO priority for tasks operating on devices with BFQ compared to devices without BFQ. Furthemore the changes done by commit `e70344c059` change the behavior when no IO priority is set for BFQ IO scheduler which is also documented in ioprio_set(2) manpage: "If no I/O scheduler has been set for a thread, then by default the I/O priority will follow the CPU nice value (setpriority(2)). In Linux kernels before version 2.6.24, once an I/O priority had been set using ioprio_set(), there was no way to reset the I/O scheduling behavior to the default. Since Linux 2.6.24, specifying ioprio as 0 can be used to reset to the default I/O scheduling behavior." So make sure we default to IOPRIO_CLASS_NONE as used to be the case before commit `e70344c059`. Also cleanup alloc_io_context() to explicitely set this IO priority for the allocated IO context to avoid future surprises. Note that we tweak ioprio_best() to maintain ioprio_get(2) behavior and make this commit easily backportable. CC: stable@vger.kernel.org Fixes: `e70344c059` ("block: fix default IO priority handling") Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220623074840.5960-1-jack@suse.cz Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-06-27 06:29:12 -06:00

... 6 7 8 9 10 ...

7027 commits