Commit graph

5802 commits

Author SHA1 Message Date
Ming Lei
d4ad8736ac block: release rq qos structures for queue without disk
commit daaca3522a upstream.

blkcg_init_queue() may add rq qos structures to request queue, previously
blk_cleanup_queue() calls rq_qos_exit() to release them, but commit
8e141f9eb8 ("block: drain file system I/O on del_gendisk")
moves rq_qos_exit() into del_gendisk(), so memory leak is caused
because queues may not have disk, such as un-present scsi luns, nvme
admin queue, ...

Fixes the issue by adding rq_qos_exit() to blk_cleanup_queue() back.

BTW, v5.18 won't need this patch any more since we move
blkcg_init_queue()/blkcg_exit_queue() into disk allocation/release
handler, and patches have been in for-5.18/block.

Cc: Christoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org
Fixes: 8e141f9eb8 ("block: drain file system I/O on del_gendisk")
Reported-by: syzbot+b42749a851a47a0f581b@syzkaller.appspotmail.com
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220314043018.177141-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-03-23 09:16:41 +01:00
Christoph Hellwig
69b80587f6 block: drop unused includes in <linux/genhd.h>
commit b81e0c2372 upstream.

Drop various include not actually used in genhd.h itself, and
move the remaning includes closer together.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20210920123328.1399408-15-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Reported-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com>a
Reported-by: "H. Nikolaus Schaller" <hns@goldelico.com>
Reported-by: Guenter Roeck <linux@roeck-us.net>
Cc: "Maciej W. Rozycki" <macro@orcam.me.uk>
[ resolves MIPS build failure by luck, root cause needs to be fixed in
  Linus's tree properly, but this is needed for now to fix the build - gregkh ]
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-03-16 14:23:46 +01:00
Haimin Zhang
a1ba987315 block-map: add __GFP_ZERO flag for alloc_page in function bio_copy_kern
[ Upstream commit cc8f7fe1f5 ]

Add __GFP_ZERO flag for alloc_page in function bio_copy_kern to initialize
the buffer of a bio.

Signed-off-by: Haimin Zhang <tcs.kernel@gmail.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220216084038.15635-1-tcs.kernel@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-03-08 19:12:31 +01:00
Christoph Hellwig
e65450a12c block: fix surprise removal for drivers calling blk_set_queue_dying
commit 7a5428dcb7 upstream.

Various block drivers call blk_set_queue_dying to mark a disk as dead due
to surprise removal events, but since commit 8e141f9eb8 that doesn't
work given that the GD_DEAD flag needs to be set to stop I/O.

Replace the driver calls to blk_set_queue_dying with a new (and properly
documented) blk_mark_disk_dead API, and fold blk_set_queue_dying into the
only remaining caller.

Fixes: 8e141f9eb8 ("block: drain file system I/O on del_gendisk")
Reported-by: Markus Blöchl <markus.bloechl@ipetronik.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Link: https://lore.kernel.org/r/20220217075231.1140-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-02-23 12:03:15 +01:00
Laibin Qiu
4b9c861a58 block/wbt: fix negative inflight counter when remove scsi device
commit e92bc4cd34 upstream.

Now that we disable wbt by set WBT_STATE_OFF_DEFAULT in
wbt_disable_default() when switch elevator to bfq. And when
we remove scsi device, wbt will be enabled by wbt_enable_default.
If it become false positive between wbt_wait() and wbt_track()
when submit write request.

The following is the scenario that triggered the problem.

T1                          T2                           T3
                            elevator_switch_mq
                            bfq_init_queue
                            wbt_disable_default <= Set
                            rwb->enable_state (OFF)
Submit_bio
blk_mq_make_request
rq_qos_throttle
<= rwb->enable_state (OFF)
                                                         scsi_remove_device
                                                         sd_remove
                                                         del_gendisk
                                                         blk_unregister_queue
                                                         elv_unregister_queue
                                                         wbt_enable_default
                                                         <= Set rwb->enable_state (ON)
q_qos_track
<= rwb->enable_state (ON)
^^^^^^ this request will mark WBT_TRACKED without inflight add and will
lead to drop rqw->inflight to -1 in wbt_done() which will trigger IO hung.

Fix this by move wbt_enable_default() from elv_unregister to
bfq_exit_queue(). Only re-enable wbt when bfq exit.

Fixes: 76a8040817 ("blk-wbt: make sure throttle is enabled properly")

Remove oneline stale comment, and kill one oneshot local variable.

Signed-off-by: Ming Lei <ming.lei@rehdat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/linux-block/20211214133103.551813-1-qiulaibin@huawei.com/
Signed-off-by: Laibin Qiu <qiulaibin@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-02-23 12:03:15 +01:00
Martin K. Petersen
f576721152 block: bio-integrity: Advance seed correctly for larger interval sizes
commit b13e0c7185 upstream.

Commit 309a62fa3a ("bio-integrity: bio_integrity_advance must update
integrity seed") added code to update the integrity seed value when
advancing a bio. However, it failed to take into account that the
integrity interval might be larger than the 512-byte block layer
sector size. This broke bio splitting on PI devices with 4KB logical
blocks.

The seed value should be advanced by bio_integrity_intervals() and not
the number of sectors.

Cc: Dmitry Monakhov <dmonakhov@openvz.org>
Cc: stable@vger.kernel.org
Fixes: 309a62fa3a ("bio-integrity: bio_integrity_advance must update integrity seed")
Tested-by: Dmitry Ivanov <dmitry.ivanov2@hpe.com>
Reported-by: Alexey Lyashkov <alexey.lyashkov@hpe.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20220204034209.4193-1-martin.petersen@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-02-08 18:34:05 +01:00
OGAWA Hirofumi
4633a79ff8 block: Fix wrong offset in bio_truncate()
commit 3ee859e384 upstream.

bio_truncate() clears the buffer outside of last block of bdev, however
current bio_truncate() is using the wrong offset of page. So it can
return the uninitialized data.

This happened when both of truncated/corrupted FS and userspace (via
bdev) are trying to read the last of bdev.

Reported-by: syzbot+ac94ae5f68b84197f41c@syzkaller.appspotmail.com
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/875yqt1c9g.fsf@mail.parknet.co.jp
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-02-01 17:27:16 +01:00
Mike Snitzer
4cca3e3ef2 block: add bio_start_io_acct_time() to control start_time
commit e45c47d1f9 upstream.

bio_start_io_acct_time() interface is like bio_start_io_acct() that
allows start_time to be passed in. This gives drivers the ability to
defer starting accounting until after IO is issued (but possibily not
entirely due to bio splitting).

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Link: https://lore.kernel.org/r/20220128155841.39644-2-snitzer@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-02-01 17:27:03 +01:00
Ye Bin
b157f4bf1f block: Fix fsync always failed if once failed
commit 8a7518931b upstream.

We do test with inject error fault base on v4.19, after test some time we found
sync /dev/sda always failed.
[root@localhost] sync /dev/sda
sync: error syncing '/dev/sda': Input/output error

scsi log as follows:
[19069.812296] sd 0:0:0:0: [sda] tag#64 Send: scmd 0x00000000d03a0b6b
[19069.812302] sd 0:0:0:0: [sda] tag#64 CDB: Synchronize Cache(10) 35 00 00 00 00 00 00 00 00 00
[19069.812533] sd 0:0:0:0: [sda] tag#64 Done: SUCCESS Result: hostbyte=DID_OK driverbyte=DRIVER_OK
[19069.812536] sd 0:0:0:0: [sda] tag#64 CDB: Synchronize Cache(10) 35 00 00 00 00 00 00 00 00 00
[19069.812539] sd 0:0:0:0: [sda] tag#64 scsi host busy 1 failed 0
[19069.812542] sd 0:0:0:0: Notifying upper driver of completion (result 0)
[19069.812546] sd 0:0:0:0: [sda] tag#64 sd_done: completed 0 of 0 bytes
[19069.812549] sd 0:0:0:0: [sda] tag#64 0 sectors total, 0 bytes done.
[19069.812564] print_req_error: I/O error, dev sda, sector 0

ftrace log as follows:
 rep-306069 [007] .... 19654.923315: block_bio_queue: 8,0 FWS 0 + 0 [rep]
 rep-306069 [007] .... 19654.923333: block_getrq: 8,0 FWS 0 + 0 [rep]
 kworker/7:1H-250   [007] .... 19654.923352: block_rq_issue: 8,0 FF 0 () 0 + 0 [kworker/7:1H]
 <idle>-0     [007] ..s. 19654.923562: block_rq_complete: 8,0 FF () 18446744073709551615 + 0 [0]
 <idle>-0     [007] d.s. 19654.923576: block_rq_complete: 8,0 WS () 0 + 0 [-5]

As 8d6996630c introduce 'fq->rq_status', this data only update when 'flush_rq'
reference count isn't zero. If flush request once failed and record error code
in 'fq->rq_status'. If there is no chance to update 'fq->rq_status',then do fsync
will always failed.
To address this issue reset 'fq->rq_status' after return error code to upper layer.

Fixes: 8d6996630c03("block: fix null pointer dereference in blk_mq_rq_timed_out()")
Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20211129012659.1553733-1-yebin10@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-01-27 11:05:25 +01:00
Jens Axboe
269fbc2686 block: fix async_depth sysfs interface for mq-deadline
commit 46cdc45acb upstream.

A previous commit added this feature, but it inadvertently used the wrong
variable to show/store the setting from/to, victimized by copy/paste. Fix
it up so that the async_depth sysfs interface reads and writes from the
right setting.

Fixes: 07757588e5 ("block/mq-deadline: Reserve 25% of scheduler tags for synchronous requests")
Link: https://bugzilla.kernel.org/show_bug.cgi?id=215485
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-01-27 11:05:25 +01:00
Tetsuo Handa
f75868e020 block: check minor range in device_add_disk()
[ Upstream commit e338924bd0 ]

ioctl(fd, LOOP_CTL_ADD, 1048576) causes

  sysfs: cannot create duplicate filename '/dev/block/7:0'

message because such request is treated as if ioctl(fd, LOOP_CTL_ADD, 0)
due to MINORMASK == 1048575. Verify that all minor numbers for that device
fit in the minor range.

Reported-by: wangyangbo <wangyangbo@uniontech.com>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/b1b19379-23ee-5379-0eb5-94bf5f79f1b4@i-love.sakura.ne.jp
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-01-27 11:04:48 +01:00
Alan Stern
af8cab36bf scsi: block: pm: Always set request queue runtime active in blk_post_runtime_resume()
[ Upstream commit 6e1fcab00a ]

John Garry reported a deadlock that occurs when trying to access a
runtime-suspended SATA device.  For obscure reasons, the rescan procedure
causes the link to be hard-reset, which disconnects the device.

The rescan tries to carry out a runtime resume when accessing the device.
scsi_rescan_device() holds the SCSI device lock and won't release it until
it can put commands onto the device's block queue.  This can't happen until
the queue is successfully runtime-resumed or the device is unregistered.
But the runtime resume fails because the device is disconnected, and
__scsi_remove_device() can't do the unregistration because it can't get the
device lock.

The best way to resolve this deadlock appears to be to allow the block
queue to start running again even after an unsuccessful runtime resume.
The idea is that the driver or the SCSI error handler will need to be able
to use the queue to resolve the runtime resume failure.

This patch removes the err argument to blk_post_runtime_resume() and makes
the routine act as though the resume was successful always.  This fixes the
deadlock.

Link: https://lore.kernel.org/r/1639999298-244569-4-git-send-email-chenxiang66@hisilicon.com
Fixes: e27829dc92 ("scsi: serialize ->rescan against ->remove")
Reported-and-tested-by: John Garry <john.garry@huawei.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Xiang Chen <chenxiang66@hisilicon.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-01-27 11:04:15 +01:00
Christoph Hellwig
7be5c25e31 block: fix error unwinding in device_add_disk
[ Upstream commit 99d8690aae ]

One device_add is called disk->ev will be freed by disk_release, so we
should free it twice.  Fix this by allocating disk->ev after device_add
so that the extra local unwinding can be removed entirely.

Based on an earlier patch from Tetsuo Handa.

Reported-by: syzbot <syzbot+28a66a9fbc621c939000@syzkaller.appspotmail.com>
Tested-by: syzbot <syzbot+28a66a9fbc621c939000@syzkaller.appspotmail.com>
Fixes: 83cbce9574 ("block: add error handling for device_add_disk / add_disk")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211221161851.788424-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-01-27 11:03:43 +01:00
Jan Kara
162bc7bd73 bfq: Do not let waker requests skip proper accounting
[ Upstream commit c65e6fd460 ]

Commit 7cc4ffc555 ("block, bfq: put reqs of waker and woken in
dispatch list") added a condition to bfq_insert_request() which added
waker's requests directly to dispatch list. The rationale was that
completing waker's IO is needed to get more IO for the current queue.
Although this rationale is valid, there is a hole in it. The waker does
not necessarily serve the IO only for the current queue and maybe it's
current IO is not needed for current queue to make progress. Furthermore
injecting IO like this completely bypasses any service accounting within
bfq and thus we do not properly track how much service is waker's queue
getting or that the waker is actually doing any IO. Depending on the
conditions this can result in the waker getting too much or too few
service.

Consider for example the following job file:

[global]
directory=/mnt/repro/
rw=write
size=8g
time_based
runtime=30
ramp_time=10
blocksize=1m
direct=0
ioengine=sync

[slowwriter]
numjobs=1
prioclass=2
prio=7
fsync=200

[fastwriter]
numjobs=1
prioclass=2
prio=0
fsync=200

Despite processes have very different IO priorities, they get the same
about of service. The reason is that bfq identifies these processes as
having waker-wakee relationship and once that happens, IO from
fastwriter gets injected during slowwriter's time slice. As a result bfq
is not aware that fastwriter has any IO to do and constantly schedules
only slowwriter's queue. Thus fastwriter is forced to compete with
slowwriter's IO all the time instead of getting its share of time based
on IO priority.

Drop the special injection condition from bfq_insert_request(). As a
result, requests will be tracked and queued in a normal way and on next
dispatch bfq_select_queue() can decide whether the waker's inserted
requests should be injected during the current queue's timeslice or not.

Fixes: 7cc4ffc555 ("block, bfq: put reqs of waker and woken in dispatch list")
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211125133645.27483-8-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-01-27 11:03:18 +01:00
Tejun Heo
3a1a4eb574 iocost: Fix divide-by-zero on donation from low hweight cgroup
commit edaa26334c upstream.

The donation calculation logic assumes that the donor has non-zero
after-donation hweight, so the lowest active hweight a donating cgroup can
have is 2 so that it can donate 1 while keeping the other 1 for itself.
Earlier, we only donated from cgroups with sizable surpluses so this
condition was always true. However, with the precise donation algorithm
implemented, f1de2439ec ("blk-iocost: revamp donation amount
determination") made the donation amount calculation exact enabling even low
hweight cgroups to donate.

This means that in rare occasions, a cgroup with active hweight of 1 can
enter donation calculation triggering the following warning and then a
divide-by-zero oops.

 WARNING: CPU: 4 PID: 0 at block/blk-iocost.c:1928 transfer_surpluses.cold+0x0/0x53 [884/94867]
 ...
 RIP: 0010:transfer_surpluses.cold+0x0/0x53
 Code: 92 ff 48 c7 c7 28 d1 ab b5 65 48 8b 34 25 00 ae 01 00 48 81 c6 90 06 00 00 e8 8b 3f fe ff 48 c7 c0 ea ff ff ff e9 95 ff 92 ff <0f> 0b 48 c7 c7 30 da ab b5 e8 71 3f fe ff 4c 89 e8 4d 85 ed 74 0
4
 ...
 Call Trace:
  <IRQ>
  ioc_timer_fn+0x1043/0x1390
  call_timer_fn+0xa1/0x2c0
  __run_timers.part.0+0x1ec/0x2e0
  run_timer_softirq+0x35/0x70
 ...
 iocg: invalid donation weights in /a/b: active=1 donating=1 after=0

Fix it by excluding cgroups w/ active hweight < 2 from donating. Excluding
these extreme low hweight donations shouldn't affect work conservation in
any meaningful way.

Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: f1de2439ec ("blk-iocost: revamp donation amount determination")
Cc: stable@vger.kernel.org # v5.10+
Link: https://lore.kernel.org/r/Ybfh86iSvpWKxhVM@slm.duckdns.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-12-22 09:32:48 +01:00
Davidlohr Bueso
d429b302c1 block: fix ioprio_get(IOPRIO_WHO_PGRP) vs setuid(2)
commit e6a59aac8a upstream.

do_each_pid_thread(PIDTYPE_PGID) can race with a concurrent
change_pid(PIDTYPE_PGID) that can move the task from one hlist
to another while iterating. Serialize ioprio_get to take
the tasklist_lock in this case, just like it's set counterpart.

Fixes: d69b78ba1d (ioprio: grab rcu_read_lock in sys_ioprio_{set,get}())
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Link: https://lore.kernel.org/r/20211210182058.43417-1-dave@stgolabs.net
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-12-14 10:57:16 +01:00
Ming Lei
db8ed1e61b block: avoid to quiesce queue in elevator_init_mq
commit 245a489e81 upstream.

elevator_init_mq() is only called before adding disk, when there isn't
any FS I/O, only passthrough requests can be queued, so freezing queue
plus canceling dispatch work is enough to drain any dispatch activities,
then we can avoid synchronize_srcu() in blk_mq_quiesce_queue().

Long boot latency issue can be fixed in case of lots of disks added
during booting.

Fixes: 737eb78e82 ("block: Delay default elevator initialization")
Reported-by: yangerkun <yangerkun@huawei.com>
Cc: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20211117115502.1600950-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-12-01 09:04:56 +01:00
Ming Lei
e03513f589 blk-mq: cancel blk-mq dispatch work in both blk_cleanup_queue and disk_release()
commit 2a19b28f79 upstream.

For avoiding to slow down queue destroy, we don't call
blk_mq_quiesce_queue() in blk_cleanup_queue(), instead of delaying to
cancel dispatch work in blk_release_queue().

However, this way has caused kernel oops[1], reported by Changhui. The log
shows that scsi_device can be freed before running blk_release_queue(),
which is expected too since scsi_device is released after the scsi disk
is closed and the scsi_device is removed.

Fixes the issue by canceling blk-mq dispatch work in both blk_cleanup_queue()
and disk_release():

1) when disk_release() is run, the disk has been closed, and any sync
dispatch activities have been done, so canceling dispatch work is enough to
quiesce filesystem I/O dispatch activity.

2) in blk_cleanup_queue(), we only focus on passthrough request, and
passthrough request is always explicitly allocated & freed by
its caller, so once queue is frozen, all sync dispatch activity
for passthrough request has been done, then it is enough to just cancel
dispatch work for avoiding any dispatch activity.

[1] kernel panic log
[12622.769416] BUG: kernel NULL pointer dereference, address: 0000000000000300
[12622.777186] #PF: supervisor read access in kernel mode
[12622.782918] #PF: error_code(0x0000) - not-present page
[12622.788649] PGD 0 P4D 0
[12622.791474] Oops: 0000 [#1] PREEMPT SMP PTI
[12622.796138] CPU: 10 PID: 744 Comm: kworker/10:1H Kdump: loaded Not tainted 5.15.0+ #1
[12622.804877] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 1.5.4 10/002/2015
[12622.813321] Workqueue: kblockd blk_mq_run_work_fn
[12622.818572] RIP: 0010:sbitmap_get+0x75/0x190
[12622.823336] Code: 85 80 00 00 00 41 8b 57 08 85 d2 0f 84 b1 00 00 00 45 31 e4 48 63 cd 48 8d 1c 49 48 c1 e3 06 49 03 5f 10 4c 8d 6b 40 83 f0 01 <48> 8b 33 44 89 f2 4c 89 ef 0f b6 c8 e8 fa f3 ff ff 83 f8 ff 75 58
[12622.844290] RSP: 0018:ffffb00a446dbd40 EFLAGS: 00010202
[12622.850120] RAX: 0000000000000001 RBX: 0000000000000300 RCX: 0000000000000004
[12622.858082] RDX: 0000000000000006 RSI: 0000000000000082 RDI: ffffa0b7a2dfe030
[12622.866042] RBP: 0000000000000004 R08: 0000000000000001 R09: ffffa0b742721334
[12622.874003] R10: 0000000000000008 R11: 0000000000000008 R12: 0000000000000000
[12622.881964] R13: 0000000000000340 R14: 0000000000000000 R15: ffffa0b7a2dfe030
[12622.889926] FS:  0000000000000000(0000) GS:ffffa0baafb40000(0000) knlGS:0000000000000000
[12622.898956] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[12622.905367] CR2: 0000000000000300 CR3: 0000000641210001 CR4: 00000000001706e0
[12622.913328] Call Trace:
[12622.916055]  <TASK>
[12622.918394]  scsi_mq_get_budget+0x1a/0x110
[12622.922969]  __blk_mq_do_dispatch_sched+0x1d4/0x320
[12622.928404]  ? pick_next_task_fair+0x39/0x390
[12622.933268]  __blk_mq_sched_dispatch_requests+0xf4/0x140
[12622.939194]  blk_mq_sched_dispatch_requests+0x30/0x60
[12622.944829]  __blk_mq_run_hw_queue+0x30/0xa0
[12622.949593]  process_one_work+0x1e8/0x3c0
[12622.954059]  worker_thread+0x50/0x3b0
[12622.958144]  ? rescuer_thread+0x370/0x370
[12622.962616]  kthread+0x158/0x180
[12622.966218]  ? set_kthread_struct+0x40/0x40
[12622.970884]  ret_from_fork+0x22/0x30
[12622.974875]  </TASK>
[12622.977309] Modules linked in: scsi_debug rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache netfs sunrpc dm_multipath intel_rapl_msr intel_rapl_common dell_wmi_descriptor sb_edac rfkill video x86_pkg_temp_thermal intel_powerclamp dcdbas coretemp kvm_intel kvm mgag200 irqbypass i2c_algo_bit rapl drm_kms_helper ipmi_ssif intel_cstate intel_uncore syscopyarea sysfillrect sysimgblt fb_sys_fops pcspkr cec mei_me lpc_ich mei ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter drm fuse xfs libcrc32c sr_mod cdrom sd_mod t10_pi sg ixgbe ahci libahci crct10dif_pclmul crc32_pclmul crc32c_intel libata megaraid_sas ghash_clmulni_intel tg3 wdat_wdt mdio dca wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_debug]

Reported-by: ChanghuiZhong <czhong@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: linux-scsi@vger.kernel.org
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20211116014343.610501-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-12-01 09:04:56 +01:00
Alistair Delva
1560763677 block: Check ADMIN before NICE for IOPRIO_CLASS_RT
commit 94c4b4fd25 upstream.

Booting to Android userspace on 5.14 or newer triggers the following
SELinux denial:

avc: denied { sys_nice } for comm="init" capability=23
     scontext=u:r:init:s0 tcontext=u:r:init:s0 tclass=capability
     permissive=0

Init is PID 0 running as root, so it already has CAP_SYS_ADMIN. For
better compatibility with older SEPolicy, check ADMIN before NICE.

Fixes: 9d3a39a5f1 ("block: grant IOPRIO_CLASS_RT to CAP_SYS_NICE")
Signed-off-by: Alistair Delva <adelva@google.com>
Cc: Khazhismel Kumykov <khazhy@google.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Paul Moore <paul@paul-moore.com>
Cc: selinux@vger.kernel.org
Cc: linux-security-module@vger.kernel.org
Cc: kernel-team@android.com
Cc: stable@vger.kernel.org # v5.14+
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Acked-by: Serge Hallyn <serge@hallyn.com>
Link: https://lore.kernel.org/r/20211115181655.3608659-1-adelva@google.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-11-25 09:48:45 +01:00
Yu Kuai
6ffad92d05 blk-cgroup: fix missing put device in error path from blkg_conf_pref()
[ Upstream commit 15c3010496 ]

If blk_queue_enter() failed due to queue is dying, the
blkdev_put_no_open() is needed because blkcg_conf_open_bdev() succeeded.

Fixes: 0c9d338c84 ("blk-cgroup: synchronize blkg creation against policy deactivation")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20211102020705.2321858-1-yukuai3@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-25 09:48:41 +01:00
Laibin Qiu
04096d1b66 blkcg: Remove extra blkcg_bio_issue_init
[ Upstream commit b781d8db58 ]

KASAN reports a use-after-free report when doing block test:

==================================================================
[10050.967049] BUG: KASAN: use-after-free in
submit_bio_checks+0x1539/0x1550

[10050.977638] Call Trace:
[10050.978190]  dump_stack+0x9b/0xce
[10050.979674]  print_address_description.constprop.6+0x3e/0x60
[10050.983510]  kasan_report.cold.9+0x22/0x3a
[10050.986089]  submit_bio_checks+0x1539/0x1550
[10050.989576]  submit_bio_noacct+0x83/0xc80
[10050.993714]  submit_bio+0xa7/0x330
[10050.994435]  mpage_readahead+0x380/0x500
[10050.998009]  read_pages+0x1c1/0xbf0
[10051.002057]  page_cache_ra_unbounded+0x4c2/0x6f0
[10051.007413]  do_page_cache_ra+0xda/0x110
[10051.008207]  force_page_cache_ra+0x23d/0x3d0
[10051.009087]  page_cache_sync_ra+0xca/0x300
[10051.009970]  generic_file_buffered_read+0xbea/0x2130
[10051.012685]  generic_file_read_iter+0x315/0x490
[10051.014472]  blkdev_read_iter+0x113/0x1b0
[10051.015300]  aio_read+0x2ad/0x450
[10051.023786]  io_submit_one+0xc8e/0x1d60
[10051.029855]  __se_sys_io_submit+0x125/0x350
[10051.033442]  do_syscall_64+0x2d/0x40
[10051.034156]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

[10051.048733] Allocated by task 18598:
[10051.049482]  kasan_save_stack+0x19/0x40
[10051.050263]  __kasan_kmalloc.constprop.1+0xc1/0xd0
[10051.051230]  kmem_cache_alloc+0x146/0x440
[10051.052060]  mempool_alloc+0x125/0x2f0
[10051.052818]  bio_alloc_bioset+0x353/0x590
[10051.053658]  mpage_alloc+0x3b/0x240
[10051.054382]  do_mpage_readpage+0xddf/0x1ef0
[10051.055250]  mpage_readahead+0x264/0x500
[10051.056060]  read_pages+0x1c1/0xbf0
[10051.056758]  page_cache_ra_unbounded+0x4c2/0x6f0
[10051.057702]  do_page_cache_ra+0xda/0x110
[10051.058511]  force_page_cache_ra+0x23d/0x3d0
[10051.059373]  page_cache_sync_ra+0xca/0x300
[10051.060198]  generic_file_buffered_read+0xbea/0x2130
[10051.061195]  generic_file_read_iter+0x315/0x490
[10051.062189]  blkdev_read_iter+0x113/0x1b0
[10051.063015]  aio_read+0x2ad/0x450
[10051.063686]  io_submit_one+0xc8e/0x1d60
[10051.064467]  __se_sys_io_submit+0x125/0x350
[10051.065318]  do_syscall_64+0x2d/0x40
[10051.066082]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

[10051.067455] Freed by task 13307:
[10051.068136]  kasan_save_stack+0x19/0x40
[10051.068931]  kasan_set_track+0x1c/0x30
[10051.069726]  kasan_set_free_info+0x1b/0x30
[10051.070621]  __kasan_slab_free+0x111/0x160
[10051.071480]  kmem_cache_free+0x94/0x460
[10051.072256]  mempool_free+0xd6/0x320
[10051.072985]  bio_free+0xe0/0x130
[10051.073630]  bio_put+0xab/0xe0
[10051.074252]  bio_endio+0x3a6/0x5d0
[10051.074984]  blk_update_request+0x590/0x1370
[10051.075870]  scsi_end_request+0x7d/0x400
[10051.076667]  scsi_io_completion+0x1aa/0xe50
[10051.077503]  scsi_softirq_done+0x11b/0x240
[10051.078344]  blk_mq_complete_request+0xd4/0x120
[10051.079275]  scsi_mq_done+0xf0/0x200
[10051.080036]  virtscsi_vq_done+0xbc/0x150
[10051.080850]  vring_interrupt+0x179/0x390
[10051.081650]  __handle_irq_event_percpu+0xf7/0x490
[10051.082626]  handle_irq_event_percpu+0x7b/0x160
[10051.083527]  handle_irq_event+0xcc/0x170
[10051.084297]  handle_edge_irq+0x215/0xb20
[10051.085122]  asm_call_irq_on_stack+0xf/0x20
[10051.085986]  common_interrupt+0xae/0x120
[10051.086830]  asm_common_interrupt+0x1e/0x40

==================================================================

Bio will be checked at beginning of submit_bio_noacct(). If bio needs
to be throttled, it will start the timer and stop submit bio directly.
Bio will submit in blk_throtl_dispatch_work_fn() when the timer expires.
But in the current process, if bio is throttled, it will still set bio
issue->value by blkcg_bio_issue_init(). This is redundant and may cause
the above use-after-free.

CPU0                                   CPU1
submit_bio
submit_bio_noacct
  submit_bio_checks
    blk_throtl_bio()
      <=mod_timer(&sq->pending_timer
                                      blk_throtl_dispatch_work_fn
                                        submit_bio_noacct() <= bio have
                                        throttle tag, will throw directly
                                        and bio issue->value will be set
                                        here

                                      bio_endio()
                                      bio_put()
                                      bio_free() <= free this bio

    blkcg_bio_issue_init(bio)
      <= bio has been freed and
      will lead to UAF
  return BLK_QC_T_NONE

Fix this by remove extra blkcg_bio_issue_init.

Fixes: e439bedf6b (blkcg: consolidate bio_issue_init() to be a part of core)
Signed-off-by: Laibin Qiu <qiulaibin@huawei.com>
Link: https://lore.kernel.org/r/20211112093354.3581504-1-qiulaibin@huawei.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-25 09:48:32 +01:00
Shin'ichiro Kawasaki
5e84e9d61d block: Hold invalidate_lock in BLKRESETZONE ioctl
commit 86399ea071 upstream.

When BLKRESETZONE ioctl and data read race, the data read leaves stale
page cache. The commit e511350590 ("block: Discard page cache of zone
reset target range") added page cache truncation to avoid stale page
cache after the ioctl. However, the stale page cache still can be read
during the reset zone operation for the ioctl. To avoid the stale page
cache completely, hold invalidate_lock of the block device file mapping.

Fixes: e511350590 ("block: Discard page cache of zone reset target range")
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Cc: stable@vger.kernel.org # v5.15
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20211111085238.942492-1-shinichiro.kawasaki@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-11-18 19:17:15 +01:00
Shin'ichiro Kawasaki
373c2bfecb block: Hold invalidate_lock in BLKZEROOUT ioctl
commit 35e4c6c1a2 upstream.

When BLKZEROOUT ioctl and data read race, the data read leaves stale
page cache. To avoid the stale page cache, hold invalidate_lock of the
block device file mapping. The stale page cache is observed when
blktests test case block/009 is modified to call "blkdiscard -z" command
and repeated hundreds of times.

This patch can be applied back to the stable kernel version v5.15.y.
Rework is required for older stable kernels.

Fixes: 22dd6d3566 ("block: invalidate the page cache when issuing BLKZEROOUT")
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Cc: stable@vger.kernel.org # v5.15
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211109104723.835533-3-shinichiro.kawasaki@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-11-18 19:17:15 +01:00
Shin'ichiro Kawasaki
f9bed86a35 block: Hold invalidate_lock in BLKDISCARD ioctl
commit 7607c44c15 upstream.

When BLKDISCARD ioctl and data read race, the data read leaves stale
page cache. To avoid the stale page cache, hold invalidate_lock of the
block device file mapping. The stale page cache is observed when
blktests test case block/009 is repeated hundreds of times.

This patch can be applied back to the stable kernel version v5.15.y
with slight patch edit. Rework is required for older stable kernels.

Fixes: 351499a172 ("block: Invalidate cache on discard v2")
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Cc: stable@vger.kernel.org # v5.15
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211109104723.835533-2-shinichiro.kawasaki@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-11-18 19:17:15 +01:00
Luis Chamberlain
9d5c7b0eee block: fix device_add_disk() kobject_create_and_add() error handling
[ Upstream commit fe7d064fa3 ]

Commit 83cbce9574 ("block: add error handling for device_add_disk /
add_disk") added error handling to device_add_disk(), however the goto
label for the kobject_create_and_add() failure did not set the return
value correctly, and so we can end up in a situation where
kobject_create_and_add() fails but we report success.

Fixes: 83cbce9574 ("block: add error handling for device_add_disk / add_disk")
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211103164023.1384821-1-mcgrof@kernel.org
[axboe: fold in followup fix from Wu Bo <wubo40@huawei.com>]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-18 19:17:08 +01:00
Yu Kuai
af25dbc846 blk-cgroup: synchronize blkg creation against policy deactivation
[ Upstream commit 0c9d338c84 ]

Our test reports a null pointer dereference:

[  168.534653] ==================================================================
[  168.535614] Disabling lock debugging due to kernel taint
[  168.536346] BUG: kernel NULL pointer dereference, address: 0000000000000008
[  168.537274] #PF: supervisor read access in kernel mode
[  168.537964] #PF: error_code(0x0000) - not-present page
[  168.538667] PGD 0 P4D 0
[  168.539025] Oops: 0000 [#1] PREEMPT SMP KASAN
[  168.539656] CPU: 13 PID: 759 Comm: bash Tainted: G    B             5.15.0-rc2-next-202100
[  168.540954] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_0738364
[  168.542736] RIP: 0010:bfq_pd_init+0x88/0x1e0
[  168.543318] Code: 98 00 00 00 e8 c9 e4 5b ff 4c 8b 65 00 49 8d 7c 24 08 e8 bb e4 5b ff 4d0
[  168.545803] RSP: 0018:ffff88817095f9c0 EFLAGS: 00010002
[  168.546497] RAX: 0000000000000001 RBX: ffff888101a1c000 RCX: 0000000000000000
[  168.547438] RDX: 0000000000000003 RSI: 0000000000000002 RDI: ffff888106553428
[  168.548402] RBP: ffff888106553400 R08: ffffffff961bcaf4 R09: 0000000000000001
[  168.549365] R10: ffffffffa2e16c27 R11: fffffbfff45c2d84 R12: 0000000000000000
[  168.550291] R13: ffff888101a1c098 R14: ffff88810c7a08c8 R15: ffffffffa55541a0
[  168.551221] FS:  00007fac75227700(0000) GS:ffff88839ba80000(0000) knlGS:0000000000000000
[  168.552278] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  168.553040] CR2: 0000000000000008 CR3: 0000000165ce7000 CR4: 00000000000006e0
[  168.554000] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  168.554929] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  168.555888] Call Trace:
[  168.556221]  <TASK>
[  168.556510]  blkg_create+0x1c0/0x8c0
[  168.556989]  blkg_conf_prep+0x574/0x650
[  168.557502]  ? stack_trace_save+0x99/0xd0
[  168.558033]  ? blkcg_conf_open_bdev+0x1b0/0x1b0
[  168.558629]  tg_set_conf.constprop.0+0xb9/0x280
[  168.559231]  ? kasan_set_track+0x29/0x40
[  168.559758]  ? kasan_set_free_info+0x30/0x60
[  168.560344]  ? tg_set_limit+0xae0/0xae0
[  168.560853]  ? do_sys_openat2+0x33b/0x640
[  168.561383]  ? do_sys_open+0xa2/0x100
[  168.561877]  ? __x64_sys_open+0x4e/0x60
[  168.562383]  ? __kasan_check_write+0x20/0x30
[  168.562951]  ? copyin+0x48/0x70
[  168.563390]  ? _copy_from_iter+0x234/0x9e0
[  168.563948]  tg_set_conf_u64+0x17/0x20
[  168.564467]  cgroup_file_write+0x1ad/0x380
[  168.565014]  ? cgroup_file_poll+0x80/0x80
[  168.565568]  ? __mutex_lock_slowpath+0x30/0x30
[  168.566165]  ? pgd_free+0x100/0x160
[  168.566649]  kernfs_fop_write_iter+0x21d/0x340
[  168.567246]  ? cgroup_file_poll+0x80/0x80
[  168.567796]  new_sync_write+0x29f/0x3c0
[  168.568314]  ? new_sync_read+0x410/0x410
[  168.568840]  ? __handle_mm_fault+0x1c97/0x2d80
[  168.569425]  ? copy_page_range+0x2b10/0x2b10
[  168.570007]  ? _raw_read_lock_bh+0xa0/0xa0
[  168.570622]  vfs_write+0x46e/0x630
[  168.571091]  ksys_write+0xcd/0x1e0
[  168.571563]  ? __x64_sys_read+0x60/0x60
[  168.572081]  ? __kasan_check_write+0x20/0x30
[  168.572659]  ? do_user_addr_fault+0x446/0xff0
[  168.573264]  __x64_sys_write+0x46/0x60
[  168.573774]  do_syscall_64+0x35/0x80
[  168.574264]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  168.574960] RIP: 0033:0x7fac74915130
[  168.575456] Code: 73 01 c3 48 8b 0d 58 ed 2c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 444
[  168.577969] RSP: 002b:00007ffc3080e288 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  168.578986] RAX: ffffffffffffffda RBX: 0000000000000009 RCX: 00007fac74915130
[  168.579937] RDX: 0000000000000009 RSI: 000056007669f080 RDI: 0000000000000001
[  168.580884] RBP: 000056007669f080 R08: 000000000000000a R09: 00007fac75227700
[  168.581841] R10: 000056007655c8f0 R11: 0000000000000246 R12: 0000000000000009
[  168.582796] R13: 0000000000000001 R14: 00007fac74be55e0 R15: 00007fac74be08c0
[  168.583757]  </TASK>
[  168.584063] Modules linked in:
[  168.584494] CR2: 0000000000000008
[  168.584964] ---[ end trace 2475611ad0f77a1a ]---

This is because blkg_alloc() is called from blkg_conf_prep() without
holding 'q->queue_lock', and elevator is exited before blkg_create():

thread 1                            thread 2
blkg_conf_prep
 spin_lock_irq(&q->queue_lock);
 blkg_lookup_check -> return NULL
 spin_unlock_irq(&q->queue_lock);

 blkg_alloc
  blkcg_policy_enabled -> true
  pd = ->pd_alloc_fn
  blkg->pd[i] = pd
                                   blk_mq_exit_sched
                                    bfq_exit_queue
                                     blkcg_deactivate_policy
                                      spin_lock_irq(&q->queue_lock);
                                      __clear_bit(pol->plid, q->blkcg_pols);
                                      spin_unlock_irq(&q->queue_lock);
                                    q->elevator = NULL;
  spin_lock_irq(&q->queue_lock);
   blkg_create
    if (blkg->pd[i])
     ->pd_init_fn -> q->elevator is NULL
  spin_unlock_irq(&q->queue_lock);

Because blkcg_deactivate_policy() requires queue to be frozen, we can
grab q_usage_counter to synchoronize blkg_conf_prep() against
blkcg_deactivate_policy().

Fixes: e21b7a0b98 ("block, bfq: add full hierarchical scheduling and cgroups support")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20211020014036.2141723-1-yukuai3@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-18 19:16:41 +01:00
Andrea Righi
3e6e11f853 blk-wbt: prevent NULL pointer dereference in wb_timer_fn
[ Upstream commit 480d42dc00 ]

The timer callback used to evaluate if the latency is exceeded can be
executed after the corresponding disk has been released, causing the
following NULL pointer dereference:

[ 119.987108] BUG: kernel NULL pointer dereference, address: 0000000000000098
[ 119.987617] #PF: supervisor read access in kernel mode
[ 119.987971] #PF: error_code(0x0000) - not-present page
[ 119.988325] PGD 7c4a4067 P4D 7c4a4067 PUD 7bf63067 PMD 0
[ 119.988697] Oops: 0000 [#1] SMP NOPTI
[ 119.988959] CPU: 1 PID: 9353 Comm: cloud-init Not tainted 5.15-rc5+arighi #rc5+arighi
[ 119.989520] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014
[ 119.990055] RIP: 0010:wb_timer_fn+0x44/0x3c0
[ 119.990376] Code: 41 8b 9c 24 98 00 00 00 41 8b 94 24 b8 00 00 00 41 8b 84 24 d8 00 00 00 4d 8b 74 24 28 01 d3 01 c3 49 8b 44 24 60 48 8b 40 78 <4c> 8b b8 98 00 00 00 4d 85 f6 0f 84 c4 00 00 00 49 83 7c 24 30 00
[ 119.991578] RSP: 0000:ffffb5f580957da8 EFLAGS: 00010246
[ 119.991937] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000004
[ 119.992412] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88f476d7f780
[ 119.992895] RBP: ffffb5f580957dd0 R08: 0000000000000000 R09: 0000000000000000
[ 119.993371] R10: 0000000000000004 R11: 0000000000000002 R12: ffff88f476c84500
[ 119.993847] R13: ffff88f4434390c0 R14: 0000000000000000 R15: ffff88f4bdc98c00
[ 119.994323] FS: 00007fb90bcd9c00(0000) GS:ffff88f4bdc80000(0000) knlGS:0000000000000000
[ 119.994952] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 119.995380] CR2: 0000000000000098 CR3: 000000007c0d6000 CR4: 00000000000006e0
[ 119.995906] Call Trace:
[ 119.996130] ? blk_stat_free_callback_rcu+0x30/0x30
[ 119.996505] blk_stat_timer_fn+0x138/0x140
[ 119.996830] call_timer_fn+0x2b/0x100
[ 119.997136] __run_timers.part.0+0x1d1/0x240
[ 119.997470] ? kvm_clock_get_cycles+0x11/0x20
[ 119.997826] ? ktime_get+0x3e/0xa0
[ 119.998110] ? native_apic_msr_write+0x2c/0x30
[ 119.998456] ? lapic_next_event+0x20/0x30
[ 119.998779] ? clockevents_program_event+0x94/0xf0
[ 119.999150] run_timer_softirq+0x2a/0x50
[ 119.999465] __do_softirq+0xcb/0x26f
[ 119.999764] irq_exit_rcu+0x8c/0xb0
[ 120.000057] sysvec_apic_timer_interrupt+0x43/0x90
[ 120.000429] ? asm_sysvec_apic_timer_interrupt+0xa/0x20
[ 120.000836] asm_sysvec_apic_timer_interrupt+0x12/0x20

In this case simply return from the timer callback (no action
required) to prevent the NULL pointer dereference.

BugLink: https://bugs.launchpad.net/bugs/1947557
Link: https://lore.kernel.org/linux-mm/YWRNVTk9N8K0RMst@arighi-desktop/
Fixes: 34dbad5d26 ("blk-stat: convert to callback-based statistics reporting")
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Link: https://lore.kernel.org/r/YW6N2qXpBU3oc50q@arighi-desktop
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-18 19:16:34 +01:00
Jens Axboe
a3d08aae18 block: remove inaccurate requeue check
[ Upstream commit 037057a5a9 ]

This check is meant to catch cases where a requeue is attempted on a
request that is still inserted. It's never really been useful to catch any
misuse, and now it's actively wrong. Outside of that, this should not be a
BUG_ON() to begin with.

Remove the check as it's now causing active harm, as requeue off the plug
path will trigger it even though the request state is just fine.

Reported-by: Yi Zhang <yi.zhang@redhat.com>
Link: https://lore.kernel.org/linux-block/CAHj4cs80zAUc2grnCZ015-2Rvd-=gXRfB_dFKy=RTm+wRo09HQ@mail.gmail.com/
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-18 19:16:17 +01:00
Jens Axboe
396c9e834d block: bump max plugged deferred size from 16 to 32
[ Upstream commit ba0ffdd8ce ]

Particularly for NVMe with efficient deferred submission for many
requests, there are nice benefits to be seen by bumping the default max
plug count from 16 to 32. This is especially true for virtualized setups,
where the submit part is more expensive. But can be noticed even on
native hardware.

Reduce the multiple queue factor from 4 to 2, since we're changing the
default size.

While changing it, move the defines into the block layer private header.
These aren't values that anyone outside of the block layer uses, or
should use.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-18 19:16:16 +01:00
Linus Torvalds
a379fbbcb8 block-5.15-2021-10-29
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmF8DN8QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgptM8D/44F9YcY8qRhrZmsUFr0QFlvFHHUCVCWtDR
 JW3JQN3hV0zBEVIvc0P3NSKAih/1+rJ3WZmVZA0lczm5OHv4C+ESZSmcl3Muv4Tk
 skOWxwDTIoSCvC+DzDw8k5UluOucLU9V7uLHQYqDOsqHngLwUerGDMwGfkMXKkNb
 zRvVaqQMUJufY0tN5QaEjl+GsaXiZJ0pid0MOtXo8NeU+K0BDyoBUF5Gco3/8ZYa
 NtD4hwM48kYoCNJDAeJmRNo3vArPpZdiJ77jeVXHrj42Mp20LK/jD7PdguEbUzq5
 3uXhn0boZCKFGhWQntkL18WwaZbFRZzTBpBqpFIjQKIvicNRoGArIguwwmwdt53P
 lbsgGgyMqQ3KvuOIEgrAFieA/mQ8iw8Pf/QWiQRk2aYA5n+miex1XfmVX7dVipdm
 OcV5HtLrKPR1newr0/eZIvN31C3tgaViYxxQOunfW57fXPthCazal+ON6K5w9ZZ8
 y79P+K1czCS/edKLTB+idvmWWijoF4GRguUMoCKsD4uXOZO0tk/f/U/ds8/+LBxm
 KQv8T9wBd+r5h225cB+boM+zslkB4vqnCT+MyiIp070ZAbi/ohirYcFC2fM5j57B
 UZ58C+WjGjtC0nhb0xS8EGM43ow6hkS+8LPub3fb01AKKvfaw22grc9vHBaFgXgW
 5rqLDaAIcg==
 =73x5
 -----END PGP SIGNATURE-----

Merge tag 'block-5.15-2021-10-29' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

 - NVMe pull request:
      - fix nvmet-tcp header digest verification (Amit Engel)
      - fix a memory leak in nvmet-tcp when releasing a queue (Maurizio
        Lombardi)
      - fix nvme-tcp H2CData PDU send accounting again (Sagi Grimberg)
      - fix digest pointer calculation in nvme-tcp and nvmet-tcp (Varun
        Prakash)
      - fix possible nvme-tcp req->offset corruption (Varun Prakash)

 - Queue drain ordering fix (Ming)

 - Partition check regression for zoned devices (Shin'ichiro)

 - Zone queue restart fix (Naohiro)

* tag 'block-5.15-2021-10-29' of git://git.kernel.dk/linux-block:
  block: Fix partition check for host-aware zoned block devices
  nvmet-tcp: fix header digest verification
  nvmet-tcp: fix data digest pointer calculation
  nvme-tcp: fix data digest pointer calculation
  nvme-tcp: fix possible req->offset corruption
  block: schedule queue restart after BLK_STS_ZONE_RESOURCE
  block: drain queue after disk is removed from sysfs
  nvme-tcp: fix H2CData PDU send accounting (again)
  nvmet-tcp: fix a memory leak when releasing a queue
2021-10-29 11:10:29 -07:00
Shin'ichiro Kawasaki
e0c60d0102 block: Fix partition check for host-aware zoned block devices
Commit a33df75c63 ("block: use an xarray for disk->part_tbl") modified
the method to check partition existence in host-aware zoned block
devices from disk_has_partitions() helper function call to empty check
of xarray disk->part_tbl. However, disk->part_tbl always has single
entry for disk->part0 and never becomes empty. This resulted in the
host-aware zoned devices always judged to have partitions, and it made
the sysfs queue/zoned attribute to be "none" instead of "host-aware"
regardless of partition existence in the devices.

This also caused DEBUG_LOCKS_WARN_ON(lock->magic != lock) for
sdkp->rev_mutex in scsi layer when the kernel detects host-aware zoned
device. Since block layer handled the host-aware zoned devices as non-
zoned devices, scsi layer did not have chance to initialize the mutex
for zone revalidation. Therefore, the warning was triggered.

To fix the issues, call the helper function disk_has_partitions() in
place of disk->part_tbl empty check. Since the function was removed with
the commit a33df75c63, reimplement it to walk through entries in the
xarray disk->part_tbl.

Fixes: a33df75c63 ("block: use an xarray for disk->part_tbl")
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Cc: stable@vger.kernel.org # v5.14+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211026060115.753746-1-shinichiro.kawasaki@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-27 06:58:01 -06:00
Naohiro Aota
9586e67b91 block: schedule queue restart after BLK_STS_ZONE_RESOURCE
When dispatching a zone append write request to a SCSI zoned block device,
if the target zone of the request is already locked, the device driver will
return BLK_STS_ZONE_RESOURCE and the request will be pushed back to the
hctx dipatch queue. The queue will be marked as RESTART in
dd_finish_request() and restarted in __blk_mq_free_request(). However, this
restart applies to the hctx of the completed request. If the requeued
request is on a different hctx, dispatch will no be retried until another
request is submitted or the next periodic queue run triggers, leading to up
to 30 seconds latency for the requeued request.

Fix this problem by scheduling a queue restart similarly to the
BLK_STS_RESOURCE case or when we cannot get the budget.

Also, consolidate the checks into the "need_resource" variable to simplify
the condition.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Niklas Cassel <Niklas.Cassel@wdc.com>
Link: https://lore.kernel.org/r/20211026165127.4151055-1-naohiro.aota@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-26 16:00:36 -06:00
Ming Lei
d308ae0d29 block: drain queue after disk is removed from sysfs
Before removing disk from sysfs, userspace still may change queue via
sysfs, such as switching elevator or setting wbt latency, both may
reinitialize wbt, then the warning in blk_free_queue_stats() will be
triggered since rq_qos_exit() is moved to del_gendisk().

Fixes the issue by moving draining queue & tearing down after disk is
removed from sysfs, at that time no one can come into queue's
store()/show().

Reported-by: Yi Zhang <yi.zhang@redhat.com>
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Fixes: 8e141f9eb8 ("block: drain file system I/O on del_gendisk")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20211026101204.2897166-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-26 08:44:38 -06:00
Linus Torvalds
9c0c4d24ac block-5.15-2021-10-22
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmFzfzwQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpqvoEACZB9dFKiYyFcv6X6ARAhfKDuE4ImaJ/hLI
 VCt4M52P06nqywPQ7iyZMRWR/EVKW8ADDJiiAeXy3mDBgMqD6O27u894i1JP06sn
 5gHcSvfH1b9PNFMUw04aI3iXXFUzQU7pwn5z6o2nXlA4onPVW7vwTp8fcHd41Kep
 LqvZJbihW/iF9d0Wjs9LqPRBWXchtsVyxiNDBgC+kx5IYFn+oTnZOhlxw8ZiT/KH
 0v8FIq9HY+5n6UP7InZF2gtIQUyDTR5L1zKKvJu5LDoHvcNlM6Ke0m3DVPcgP79D
 2kKGyHOGfqC9Gr37qqOjgKqRO/Z/9SCvG39dmocAd/hh3AfUgKpDQs3HgLyx7ECT
 aRAe5n0XbfIVcHX1XaOc8cGrszan9YhJvt/dMCmkjaG/3hASlzl2kV4QF3f5IVjx
 oMgB1Kj8kyu6SqG8mCCjyGCxPpzNq8lVplJRlpifoz+ID/+hgt03aDoYVfPZkDRL
 nf4VdQCRSl3ZEXkHy1j6l6Nb2UgNEZP1B3a/9onSyBJ/WYqSfFMXrx29PSirz7m7
 x4jGOJvdqtNx09zjWHXc/d+I8BEXp4JDXe0GH0OHMiwCwz5PoMo99HRb+IuffKjR
 lWl4EimH0bfzOA/3vFr5TigfqbnDJ7HCRrGsodQX8gJhVaxVTWxeZG+7Y9qkLqnD
 JGlZeMQ37w==
 =uhGw
 -----END PGP SIGNATURE-----

Merge tag 'block-5.15-2021-10-22' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:
 "Fix for the cgroup code not ussing irq safe stats updates, and one fix
  for an error handling condition in add_partition()"

* tag 'block-5.15-2021-10-22' of git://git.kernel.dk/linux-block:
  block: fix incorrect references to disk objects
  blk-cgroup: blk_cgroup_bio_start() should use irq-safe operations on blkg->iostat_cpu
2021-10-22 17:42:13 -10:00
Zqiang
9fbfabfda2 block: fix incorrect references to disk objects
When adding partitions to the disk, the reference count of the disk
object is increased. then alloc partition device and called
device_add(), if the device_add() return error, the reference
count of the disk object will be reduced twice, at put_device(pdev)
and put_disk(disk). this leads to the end of the object's life cycle
prematurely, and trigger following calltrace.

  __init_work+0x2d/0x50 kernel/workqueue.c:519
  synchronize_rcu_expedited+0x3af/0x650 kernel/rcu/tree_exp.h:847
  bdi_remove_from_list mm/backing-dev.c:938 [inline]
  bdi_unregister+0x17f/0x5c0 mm/backing-dev.c:946
  release_bdi+0xa1/0xc0 mm/backing-dev.c:968
  kref_put include/linux/kref.h:65 [inline]
  bdi_put+0x72/0xa0 mm/backing-dev.c:976
  bdev_free_inode+0x11e/0x220 block/bdev.c:408
  i_callback+0x3f/0x70 fs/inode.c:226
  rcu_do_batch kernel/rcu/tree.c:2508 [inline]
  rcu_core+0x76d/0x16c0 kernel/rcu/tree.c:2743
  __do_softirq+0x1d7/0x93b kernel/softirq.c:558
  invoke_softirq kernel/softirq.c:432 [inline]
  __irq_exit_rcu kernel/softirq.c:636 [inline]
  irq_exit_rcu+0xf2/0x130 kernel/softirq.c:648
  sysvec_apic_timer_interrupt+0x93/0xc0

making disk is NULL when calling put_disk().

Reported-by: Hao Sun <sunhao.th@gmail.com>
Signed-off-by: Zqiang <qiang.zhang1211@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211018103422.2043-1-qiang.zhang1211@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 11:20:38 -06:00
Linus Torvalds
f2b3420b92 block-5.15-2021-10-17
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmFsIqAQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgppbBEACDewLUv7bg1VFIGdroRN51OGiOv1oV+8HP
 ruY7O9CPtV7wcb3lA1Zy9igICuzuC5culHjbRJrNIUeTdWCQHCFk/sfKSD6VGMoT
 cFTqpKxV7M3vYr9G2m5TFWgY2mfS+I5fxyDZxK2z2esHCFw6TZ7A5W13xScVXKP+
 QdNFSlTrGkpggsSIEeHApG+NLsIecnkT4qzm8zPfUodUtQ3A8JMjQjnYUFEAWfWv
 l9x9zDIzaGjPtXf5soFEvmdh1ALh3WWiYb1kIwK1FeP/PYX0JV/3zCMgqOwpK+4b
 69OM3Q0NPHvu2TgSRK+ghekAtz5qgPDMCrzdhSgLYJEL/PGAOboqjrB9E+wWoEjd
 IKrYLx4Xao2TUZLJF2y34hHfODGdasx7d+wS191UpVFEZHFhDhIaazZ2rDd5xnQK
 LdzQw1JQF/igJovHauhSkGFIdJWBSDneLQoMimBnitZlsWARUmFSZej34FFRLZsW
 8ZXfqipn/x+fh4sQ/HdEfWxnGHtveDpU+0Ka5bMUe/tJ9RPtmn/Ye7nFjYecC6NY
 4UzFSNn+4e9DpHaDuP3I/eA1YBmVlcB5Hum3ve7X6ovwpjArYg3dgJOEi8uCZjfb
 hdMANmkVptcPiEO9njEHhC7S8+Nm3t+8o3qQceN81j6Vcjgzt/Y/n3Z6UkKeSlkn
 Ila+cZI1oA==
 =J/e4
 -----END PGP SIGNATURE-----

Merge tag 'block-5.15-2021-10-17' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:
 "Bigger than usual for this point in time, the majority is fixing some
  issues around BDI lifetimes with the move from the request_queue to
  the disk in this release. In detail:

   - Series on draining fs IO for del_gendisk() (Christoph)

   - NVMe pull request via Christoph:
        - fix the abort command id (Keith Busch)
        - nvme: fix per-namespace chardev deletion (Adam Manzanares)

   - brd locking scope fix (Tetsuo)

   - BFQ fix (Paolo)"

* tag 'block-5.15-2021-10-17' of git://git.kernel.dk/linux-block:
  block, bfq: reset last_bfqq_created on group change
  block: warn when putting the final reference on a registered disk
  brd: reduce the brd_devices_mutex scope
  kyber: avoid q->disk dereferences in trace points
  block: keep q_usage_counter in atomic mode after del_gendisk
  block: drain file system I/O on del_gendisk
  block: split bio_queue_enter from blk_queue_enter
  block: factor out a blk_try_enter_queue helper
  block: call submit_bio_checks under q_usage_counter
  nvme: fix per-namespace chardev deletion
  block/rnbd-clt-sysfs: fix a couple uninitialized variable bugs
  nvme-pci: Fix abort command id
2021-10-17 19:25:20 -10:00
Tejun Heo
5370b0f490 blk-cgroup: blk_cgroup_bio_start() should use irq-safe operations on blkg->iostat_cpu
c3df5fb57f ("cgroup: rstat: fix A-A deadlock on 32bit around
u64_stats_sync") made u64_stats updates irq-safe to avoid A-A deadlocks.
Unfortunately, the conversion missed one in blk_cgroup_bio_start(). Fix it.

Fixes: 2d146aa3aa ("mm: memcontrol: switch to rstat")
Cc: stable@vger.kernel.org # v5.13+
Reported-by: syzbot+9738c8815b375ce482a1@syzkaller.appspotmail.com
Signed-off-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/YWi7NrQdVlxD6J9W@slm.duckdns.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-17 10:40:10 -06:00
Paolo Valente
d29bd41428 block, bfq: reset last_bfqq_created on group change
Since commit 430a67f9d6 ("block, bfq: merge bursts of newly-created
queues"), BFQ maintains a per-group pointer to the last bfq_queue
created. If such a queue, say bfqq, happens to move to a different
group, then bfqq is no more a valid last bfq_queue created for its
previous group. That pointer must then be cleared. Not resetting such
a pointer may also cause UAF, if bfqq happens to also be freed after
being moved to a different group. This commit performs this missing
reset. As such it fixes commit 430a67f9d6 ("block, bfq: merge bursts
of newly-created queues").

Such a missing reset is most likely the cause of the crash reported in [1].
With some analysis, we found that this crash was due to the
above UAF. And such UAF did go away with this commit applied [1].

Anyway, before this commit, that crash happened to be triggered in
conjunction with commit 2d52c58b9c ("block, bfq: honor already-setup
queue merges"). The latter was then reverted by commit ebc69e897e
("Revert "block, bfq: honor already-setup queue merges""). Yet commit
2d52c58b9c ("block, bfq: honor already-setup queue merges") contains
no error related with the above UAF, and can then be restored.

[1] https://bugzilla.kernel.org/show_bug.cgi?id=214503

Fixes: 430a67f9d6 ("block, bfq: merge bursts of newly-created queues")
Tested-by: Grzegorz Kowal <custos.mentis@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Link: https://lore.kernel.org/r/20211015144336.45894-2-paolo.valente@linaro.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-17 07:03:02 -06:00
Christoph Hellwig
a20417611b block: warn when putting the final reference on a registered disk
Warn when the last reference on a live disk is put without calling
del_gendisk first.  There are some BDI related bug reports that look
like a case of this, so make sure we have the proper instrumentation
to catch it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211014130231.1468538-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-17 06:59:45 -06:00
Christoph Hellwig
c41108049d kyber: avoid q->disk dereferences in trace points
q->disk becomes invalid after the gendisk is removed.  Work around this
by caching the dev_t for the tracepoints.  The real fix would be to
properly tear down the I/O schedulers with the gendisk, but that is
a much more invasive change.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211012093301.GA27795@lst.de
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-15 21:02:57 -06:00
Christoph Hellwig
aec89dc5d4 block: keep q_usage_counter in atomic mode after del_gendisk
Don't switch back to percpu mode to avoid the double RCU grace period
when tearing down SCSI devices.  After removing the disk only passthrough
commands can be send anyway.

Suggested-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20210929071241.934472-6-hch@lst.de
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-15 21:02:54 -06:00
Christoph Hellwig
8e141f9eb8 block: drain file system I/O on del_gendisk
Instead of delaying draining of file system I/O related items like the
blk-qos queues, the integrity read workqueue and timeouts only when the
request_queue is removed, do that when del_gendisk is called.  This is
important for SCSI where the upper level drivers that control the gendisk
are separate entities, and the disk can be freed much earlier than the
request_queue, or can even be unbound without tearing down the queue.

Fixes: edb0872f44 ("block: move the bdi from the request_queue to the gendisk")
Reported-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20210929071241.934472-5-hch@lst.de
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-15 21:02:50 -06:00
Christoph Hellwig
a6741536f4 block: split bio_queue_enter from blk_queue_enter
To prepare for fixing a gendisk shutdown race, open code the
blk_queue_enter logic in bio_queue_enter.  This also removes the
pointless flags translation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20210929071241.934472-4-hch@lst.de
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-15 21:02:47 -06:00
Christoph Hellwig
1f14a09890 block: factor out a blk_try_enter_queue helper
Factor out the code to try to get q_usage_counter without blocking into
a separate helper.  Both to improve code readability and to prepare for
splitting bio_queue_enter from blk_queue_enter.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20210929071241.934472-3-hch@lst.de
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-15 21:02:44 -06:00
Christoph Hellwig
cc9c884dd7 block: call submit_bio_checks under q_usage_counter
Ensure all bios check the current values of the queue under freeze
protection, i.e. to make sure the zero capacity set by del_gendisk
is actually seen before dispatching to the driver.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210929071241.934472-2-hch@lst.de
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-15 21:02:36 -06:00
Linus Torvalds
50eb0a06e6 block-5.15-2021-10-09
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmFh4SQQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgphKJEACwasArWhI3zxFrl4QQfGsILTbqn7Ent7qp
 XsEunsQ/27wp8Pv2nTrQT/7Z2qYVuIs+EzPR1+DTFTVWycG3ZZCMp6ACIuAVH73b
 nD2vNQdHGcWqeipxbRYA/JBON+tVcFGbsFPREQUqn++3FIBHhKUvd/NGRPrim62s
 a63WEKZuHo/HHptvG9zmtqQAJ1N9wdA75VBn9GNgpqpW/lLePufNykEdox7/HHFg
 jJzvW8rppRBaLzoDj4RC4hp7Xgf0CPWScKqI3R20CHg5WjXkjDSvIWT3VQuFA/kU
 aRBO2cvoVYUw7w4xRosxTdJObaDXZZnXewA0SOGwGVAb9pIW5dhzby+bpuRx/Wkw
 3WrYGxndu1eis3YAoCVxllzo0zFMXGLjM3nfERketPHFPicLVVd0Pye+YAemyvMp
 Y9RWIRTRg6w/t0NicfYdK8LBTqEqdoys5OZZgaCyIPDV5dwhHCSDuACAwpR4qTAF
 PodCBG4DAc5TpE0Vy5QPywIe1cRvF8lnTZ+ZYXR7g3Ub1KoIl24gXuLvNMYrm0qb
 92l9S1+Bk1lT3nVThxz+rJSHsumlZHROd5TLkQs1S2bb7E9Pxyc6IW+H0hviutgc
 bN+aBf7O+U+oDoqkOGWFXB4aJMqFjjVW6z3DmhqE7MyVPE5jkpFSXY19ge2HOVwd
 AVXocaNAdw==
 =AbHC
 -----END PGP SIGNATURE-----

Merge tag 'block-5.15-2021-10-09' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:
 "Two small fixes for this release:

   - Add missing QUEUE_FLAG_HCTX_ACTIVE in the debugfs handling
     (Johannes)

   - Fix double free / UAF issue in __alloc_disk_node (Tetsuo)"

* tag 'block-5.15-2021-10-09' of git://git.kernel.dk/linux-block:
  block: decode QUEUE_FLAG_HCTX_ACTIVE in debugfs output
  block: genhd: fix double kfree() in __alloc_disk_node()
2021-10-09 14:51:59 -07:00
Johannes Thumshirn
1dbdd99b51 block: decode QUEUE_FLAG_HCTX_ACTIVE in debugfs output
While debugging an issue we've found that $DEBUGFS/block/$disk/state
doesn't decode QUEUE_FLAG_HCTX_ACTIVE but only displays its numerical
value.

Add QUEUE_FLAG(HCTX_ACTIVE) to the blk_queue_flag_name array so it'll get
decoded properly.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/4351076388918075bd80ef07756f9d2ce63be12c.1633332053.git.johannes.thumshirn@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-04 06:58:39 -06:00
Linus Torvalds
ab2a7a35c4 block-5.15-2021-10-01
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmFXvGIQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgprAgD/9WKQIVPyzXC3Ui8OkSb9DM+dnU47+4b6sL
 Naud4khgrvMjTIdZC9YjwAhKBRNsXbcTXRLbBsxVzk7HhGI31jR711qaOZ/jMRbc
 dPwcpF82/Yr8BlIl5i3SBvxFNW8FAJWLTVvPvJXSjUBPECTgY6MnchojOjYz1CrN
 a0TmtHs5Feu+nLyzKgQlPMv3P8tvrS1hQNpfvbK1oWeEQuyONGacK82nl853ak30
 c0uywVsvO+yiBY2xWhiTqperbPwnLxCYOMOdBqbNvlPlGIkEbkW1YWhm2/5f3vRi
 37C3ibmt2SCnJaAyiNUKrYSTCK9BoeWFqtxVwyGB8WxWsUXrdocEOoBdohxDHrcp
 gHIqxPVUD2hyEroRQseq4NTvERkRvOiF4jAFA3lHOw5zLtPGNEvC0EUFRMm3Ur3/
 DyviDBPy321pvrbqGg5uIL3o6wuXDQNnwxdXa2wElWDLuTYp8tB25rqAe2IRJ+iS
 k32AsgfL+y5fhjsA6ros/hpQ6FQTAKSW3i6S/0L+QB6H0jYb1ue9JPrA+3MoizvI
 p+PdmpLeFVQxnfthT2J+6pzQADoxcMwg5ALgyLFvhhL1YAqyhHkKgdAIS3gKXxYm
 GbQLd2hkv7wPZHEsS/XiGCj+zzyawjsr5hSHRqkbZ1QlyfnuTz5uI65WzST68o4d
 KOAvQTx4aQ==
 =SMoX
 -----END PGP SIGNATURE-----

Merge tag 'block-5.15-2021-10-01' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:
 "A few block fixes for this release:

   - Revert a BFQ commit that causes breakage for people. Unfortunately
     it was auto-selected for stable as well, so now 5.14.7 suffers from
     it too. Hopefully stable will pick up this revert quickly too, so
     we can remove the issue on that end as well.

   - Add a quirk for Apple NVMe controllers, which due to their
     non-compliance broke due to the introduction of command sequences
     (Keith)

   - Use shifts in nbd, fixing a __divdi3 issue (Nick)"

* tag 'block-5.15-2021-10-01' of git://git.kernel.dk/linux-block:
  nbd: use shifts rather than multiplies
  Revert "block, bfq: honor already-setup queue merges"
  nvme: add command id quirk for apple controllers
2021-10-02 11:00:36 -07:00
Tetsuo Handa
06cc978d3f block: genhd: fix double kfree() in __alloc_disk_node()
syzbot is reporting use-after-free read at bdev_free_inode() [1], for
kfree() from __alloc_disk_node() is called before bdev_free_inode()
(which is called after RCU grace period) reads bdev->bd_disk and calls
kfree(bdev->bd_disk).

Fix use-after-free read followed by double kfree() problem
by making sure that bdev->bd_disk is NULL when calling iput().

Link: https://syzkaller.appspot.com/bug?extid=8281086e8a6fbfbd952a [1]
Reported-by: syzbot <syzbot+8281086e8a6fbfbd952a@syzkaller.appspotmail.com>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/e6dd13c5-8db0-4392-6e78-a42ee5d2a1c4@i-love.sakura.ne.jp
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-02 07:29:20 -06:00
Jens Axboe
ebc69e897e Revert "block, bfq: honor already-setup queue merges"
This reverts commit 2d52c58b9c.

We have had several folks complain that this causes hangs for them, which
is especially problematic as the commit has also hit stable already.

As no resolution seems to be forthcoming right now, revert the patch.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=214503
Fixes: 2d52c58b9c ("block, bfq: honor already-setup queue merges")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-28 06:33:15 -06:00
Linus Torvalds
bb19237bf6 SCSI fixes on 20210925
Thirty Three fixes, I'm afraid.  Essentially the build up from the
 last couple of weeks while I've been dealling with Linux Plumbers
 conference infrastructure issues.  It's mostly the usual assortment of
 spelling fixes and minor corrections.  The only core relevant changes
 are to the sd driver to reduce the spin up message spew and fix a
 small memory leak on the freeing path.
 
 Signed-off-by: James E.J. Bottomley <jejb@linux.ibm.com>
 -----BEGIN PGP SIGNATURE-----
 
 iJsEABMIAEQWIQTnYEDbdso9F2cI+arnQslM7pishQUCYU+VfiYcamFtZXMuYm90
 dG9tbGV5QGhhbnNlbnBhcnRuZXJzaGlwLmNvbQAKCRDnQslM7pishVuMAP9Y2+rE
 AyQ3Y9dubie/AyKQ/2ZEFO/G2wz+jEJvfppXJwD4ulhhDFh/l4iNPVa7GW9M/ti3
 gd+6gMzzRC9/B+u2gQ==
 =+K+l
 -----END PGP SIGNATURE-----

Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

Pull SCSI fixes from James Bottomley:
 "Thirty-three fixes, I'm afraid.

  Essentially the build up from the last couple of weeks while I've been
  dealling with Linux Plumbers conference infrastructure issues. It's
  mostly the usual assortment of spelling fixes and minor corrections.

  The only core relevant changes are to the sd driver to reduce the spin
  up message spew and fix a small memory leak on the freeing path"

* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (33 commits)
  scsi: ses: Retry failed Send/Receive Diagnostic commands
  scsi: target: Fix spelling mistake "CONFLIFT" -> "CONFLICT"
  scsi: lpfc: Fix gcc -Wstringop-overread warning, again
  scsi: lpfc: Use correct scnprintf() limit
  scsi: lpfc: Fix sprintf() overflow in lpfc_display_fpin_wwpn()
  scsi: core: Remove 'current_tag'
  scsi: acornscsi: Remove tagged queuing vestiges
  scsi: fas216: Kill scmd->tag
  scsi: qla2xxx: Restore initiator in dual mode
  scsi: ufs: core: Unbreak the reset handler
  scsi: sd_zbc: Support disks with more than 2**32 logical blocks
  scsi: ufs: core: Revert "scsi: ufs: Synchronize SCSI and UFS error handling"
  scsi: bsg: Fix device unregistration
  scsi: sd: Make sd_spinup_disk() less noisy
  scsi: ufs: ufs-pci: Fix Intel LKF link stability
  scsi: mpt3sas: Clean up some inconsistent indenting
  scsi: megaraid: Clean up some inconsistent indenting
  scsi: sr: Fix spelling mistake "does'nt" -> "doesn't"
  scsi: Remove SCSI CDROM MAINTAINERS entry
  scsi: megaraid: Fix Coccinelle warning
  ...
2021-09-25 16:05:56 -07:00
Ming Lei
f278eb3d81 block: hold ->invalidate_lock in blkdev_fallocate
When running ->fallocate(), blkdev_fallocate() should hold
mapping->invalidate_lock to prevent page cache from being accessed,
otherwise stale data may be read in page cache.

Without this patch, blktests block/009 fails sometimes. With this patch,
block/009 can pass always.

Also as Jan pointed out, no pages can be created in the discarded area
while you are holding the invalidate_lock, so remove the 2nd
truncate_bdev_range().

Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20210923023751.1441091-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-24 11:06:58 -06:00
Ming Lei
a647a524a4 block: don't call rq_qos_ops->done_bio if the bio isn't tracked
rq_qos framework is only applied on request based driver, so:

1) rq_qos_done_bio() needn't to be called for bio based driver

2) rq_qos_done_bio() needn't to be called for bio which isn't tracked,
such as bios ended from error handling code.

Especially in bio_endio():

1) request queue is referred via bio->bi_bdev->bd_disk->queue, which
may be gone since request queue refcount may not be held in above two
cases

2) q->rq_qos may be freed in blk_cleanup_queue() when calling into
__rq_qos_done_bio()

Fix the potential kernel panic by not calling rq_qos_ops->done_bio if
the bio isn't tracked. This way is safe because both ioc_rqos_done_bio()
and blkcg_iolatency_done_bio() are nop if the bio isn't tracked.

Reported-by: Yu Kuai <yukuai3@huawei.com>
Cc: tj@kernel.org
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20210924110704.1541818-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-24 11:04:39 -06:00
Li Jinlin
858560b276 blk-cgroup: fix UAF by grabbing blkcg lock before destroying blkg pd
KASAN reports a use-after-free report when doing fuzz test:

[693354.104835] ==================================================================
[693354.105094] BUG: KASAN: use-after-free in bfq_io_set_weight_legacy+0xd3/0x160
[693354.105336] Read of size 4 at addr ffff888be0a35664 by task sh/1453338

[693354.105607] CPU: 41 PID: 1453338 Comm: sh Kdump: loaded Not tainted 4.18.0-147
[693354.105610] Hardware name: Huawei 2288H V5/BC11SPSCB0, BIOS 0.81 07/02/2018
[693354.105612] Call Trace:
[693354.105621]  dump_stack+0xf1/0x19b
[693354.105626]  ? show_regs_print_info+0x5/0x5
[693354.105634]  ? printk+0x9c/0xc3
[693354.105638]  ? cpumask_weight+0x1f/0x1f
[693354.105648]  print_address_description+0x70/0x360
[693354.105654]  kasan_report+0x1b2/0x330
[693354.105659]  ? bfq_io_set_weight_legacy+0xd3/0x160
[693354.105665]  ? bfq_io_set_weight_legacy+0xd3/0x160
[693354.105670]  bfq_io_set_weight_legacy+0xd3/0x160
[693354.105675]  ? bfq_cpd_init+0x20/0x20
[693354.105683]  cgroup_file_write+0x3aa/0x510
[693354.105693]  ? ___slab_alloc+0x507/0x540
[693354.105698]  ? cgroup_file_poll+0x60/0x60
[693354.105702]  ? 0xffffffff89600000
[693354.105708]  ? usercopy_abort+0x90/0x90
[693354.105716]  ? mutex_lock+0xef/0x180
[693354.105726]  kernfs_fop_write+0x1ab/0x280
[693354.105732]  ? cgroup_file_poll+0x60/0x60
[693354.105738]  vfs_write+0xe7/0x230
[693354.105744]  ksys_write+0xb0/0x140
[693354.105749]  ? __ia32_sys_read+0x50/0x50
[693354.105760]  do_syscall_64+0x112/0x370
[693354.105766]  ? syscall_return_slowpath+0x260/0x260
[693354.105772]  ? do_page_fault+0x9b/0x270
[693354.105779]  ? prepare_exit_to_usermode+0xf9/0x1a0
[693354.105784]  ? enter_from_user_mode+0x30/0x30
[693354.105793]  entry_SYSCALL_64_after_hwframe+0x65/0xca

[693354.105875] Allocated by task 1453337:
[693354.106001]  kasan_kmalloc+0xa0/0xd0
[693354.106006]  kmem_cache_alloc_node_trace+0x108/0x220
[693354.106010]  bfq_pd_alloc+0x96/0x120
[693354.106015]  blkcg_activate_policy+0x1b7/0x2b0
[693354.106020]  bfq_create_group_hierarchy+0x1e/0x80
[693354.106026]  bfq_init_queue+0x678/0x8c0
[693354.106031]  blk_mq_init_sched+0x1f8/0x460
[693354.106037]  elevator_switch_mq+0xe1/0x240
[693354.106041]  elevator_switch+0x25/0x40
[693354.106045]  elv_iosched_store+0x1a1/0x230
[693354.106049]  queue_attr_store+0x78/0xb0
[693354.106053]  kernfs_fop_write+0x1ab/0x280
[693354.106056]  vfs_write+0xe7/0x230
[693354.106060]  ksys_write+0xb0/0x140
[693354.106064]  do_syscall_64+0x112/0x370
[693354.106069]  entry_SYSCALL_64_after_hwframe+0x65/0xca

[693354.106114] Freed by task 1453336:
[693354.106225]  __kasan_slab_free+0x130/0x180
[693354.106229]  kfree+0x90/0x1b0
[693354.106233]  blkcg_deactivate_policy+0x12c/0x220
[693354.106238]  bfq_exit_queue+0xf5/0x110
[693354.106241]  blk_mq_exit_sched+0x104/0x130
[693354.106245]  __elevator_exit+0x45/0x60
[693354.106249]  elevator_switch_mq+0xd6/0x240
[693354.106253]  elevator_switch+0x25/0x40
[693354.106257]  elv_iosched_store+0x1a1/0x230
[693354.106261]  queue_attr_store+0x78/0xb0
[693354.106264]  kernfs_fop_write+0x1ab/0x280
[693354.106268]  vfs_write+0xe7/0x230
[693354.106271]  ksys_write+0xb0/0x140
[693354.106275]  do_syscall_64+0x112/0x370
[693354.106280]  entry_SYSCALL_64_after_hwframe+0x65/0xca

[693354.106329] The buggy address belongs to the object at ffff888be0a35580
                 which belongs to the cache kmalloc-1k of size 1024
[693354.106736] The buggy address is located 228 bytes inside of
                 1024-byte region [ffff888be0a35580, ffff888be0a35980)
[693354.107114] The buggy address belongs to the page:
[693354.107273] page:ffffea002f828c00 count:1 mapcount:0 mapping:ffff888107c17080 index:0x0 compound_mapcount: 0
[693354.107606] flags: 0x17ffffc0008100(slab|head)
[693354.107760] raw: 0017ffffc0008100 ffffea002fcbc808 ffffea0030bd3a08 ffff888107c17080
[693354.108020] raw: 0000000000000000 00000000001c001c 00000001ffffffff 0000000000000000
[693354.108278] page dumped because: kasan: bad access detected

[693354.108511] Memory state around the buggy address:
[693354.108671]  ffff888be0a35500: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[693354.116396]  ffff888be0a35580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[693354.124473] >ffff888be0a35600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[693354.132421]                                                        ^
[693354.140284]  ffff888be0a35680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[693354.147912]  ffff888be0a35700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[693354.155281] ==================================================================

blkgs are protected by both queue and blkcg locks and holding
either should stabilize them. However, the path of destroying
blkg policy data is only protected by queue lock in
blkcg_activate_policy()/blkcg_deactivate_policy(). Other tasks
can get the blkg policy data before the blkg policy data is
destroyed, and use it after destroyed, which will result in a
use-after-free.

CPU0                             CPU1
blkcg_deactivate_policy
  spin_lock_irq(&q->queue_lock)
                                 bfq_io_set_weight_legacy
                                   spin_lock_irq(&blkcg->lock)
                                   blkg_to_bfqg(blkg)
                                     pd_to_bfqg(blkg->pd[pol->plid])
                                     ^^^^^^blkg->pd[pol->plid] != NULL
                                           bfqg != NULL
  pol->pd_free_fn(blkg->pd[pol->plid])
    pd_to_bfqg(blkg->pd[pol->plid])
    bfqg_put(bfqg)
      kfree(bfqg)
  blkg->pd[pol->plid] = NULL
  spin_unlock_irq(q->queue_lock);
                                   bfq_group_set_weight(bfqg, val, 0)
                                     bfqg->entity.new_weight
                                     ^^^^^^trigger uaf here
                                   spin_unlock_irq(&blkcg->lock);

Fix by grabbing the matching blkcg lock before trying to
destroy blkg policy data.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Li Jinlin <lijinlin3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20210914042605.3260596-1-lijinlin3@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-15 12:03:18 -06:00
Yanfei Xu
6f5ddde410 blkcg: fix memory leak in blk_iolatency_init
BUG: memory leak
unreferenced object 0xffff888129acdb80 (size 96):
  comm "syz-executor.1", pid 12661, jiffies 4294962682 (age 15.220s)
  hex dump (first 32 bytes):
    20 47 c9 85 ff ff ff ff 20 d4 8e 29 81 88 ff ff   G...... ..)....
    01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<ffffffff82264ec8>] kmalloc include/linux/slab.h:591 [inline]
    [<ffffffff82264ec8>] kzalloc include/linux/slab.h:721 [inline]
    [<ffffffff82264ec8>] blk_iolatency_init+0x28/0x190 block/blk-iolatency.c:724
    [<ffffffff8225b8c4>] blkcg_init_queue+0xb4/0x1c0 block/blk-cgroup.c:1185
    [<ffffffff822253da>] blk_alloc_queue+0x22a/0x2e0 block/blk-core.c:566
    [<ffffffff8223b175>] blk_mq_init_queue_data block/blk-mq.c:3100 [inline]
    [<ffffffff8223b175>] __blk_mq_alloc_disk+0x25/0xd0 block/blk-mq.c:3124
    [<ffffffff826a9303>] loop_add+0x1c3/0x360 drivers/block/loop.c:2344
    [<ffffffff826a966e>] loop_control_get_free drivers/block/loop.c:2501 [inline]
    [<ffffffff826a966e>] loop_control_ioctl+0x17e/0x2e0 drivers/block/loop.c:2516
    [<ffffffff81597eec>] vfs_ioctl fs/ioctl.c:51 [inline]
    [<ffffffff81597eec>] __do_sys_ioctl fs/ioctl.c:874 [inline]
    [<ffffffff81597eec>] __se_sys_ioctl fs/ioctl.c:860 [inline]
    [<ffffffff81597eec>] __x64_sys_ioctl+0xfc/0x140 fs/ioctl.c:860
    [<ffffffff843fa745>] do_syscall_x64 arch/x86/entry/common.c:50 [inline]
    [<ffffffff843fa745>] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
    [<ffffffff84600068>] entry_SYSCALL_64_after_hwframe+0x44/0xae

Once blk_throtl_init() queue init failed, blkcg_iolatency_exit() will
not be invoked for cleanup. That leads a memory leak. Swap the
blk_throtl_init() and blk_iolatency_init() calls can solve this.

Reported-by: syzbot+01321b15cc98e6bf96d6@syzkaller.appspotmail.com
Fixes: 19688d7f95 (block/blk-cgroup: Swap the blk_throtl_init() and blk_iolatency_init() calls)
Signed-off-by: Yanfei Xu <yanfei.xu@windriver.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20210915072426.4022924-1-yanfei.xu@windriver.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-15 11:57:44 -06:00
Lihong Kou
3df49967f6 block: flush the integrity workqueue in blk_integrity_unregister
When the integrity profile is unregistered there can still be integrity
reads queued up which could see a NULL verify_fn as shown by the race
window below:

CPU0                                    CPU1
  process_one_work                      nvme_validate_ns
    bio_integrity_verify_fn                nvme_update_ns_info
	                                     nvme_update_disk_info
	                                       blk_integrity_unregister
                                               ---set queue->integrity as 0
	bio_integrity_process
	--access bi->profile->verify_fn(bi is a pointer of queue->integity)

Before calling blk_integrity_unregister in nvme_update_disk_info, we must
make sure that there is no work item in the kintegrityd_wq. Just call
blk_flush_integrity to flush the work queue so the bug can be resolved.

Signed-off-by: Lihong Kou <koulihong@huawei.com>
[hch: split up and shortened the changelog]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Link: https://lore.kernel.org/r/20210914070657.87677-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-14 20:03:30 -06:00
Christoph Hellwig
783a40a1b3 block: check if a profile is actually registered in blk_integrity_unregister
While clearing the profile itself is harmless, we really should not clear
the stable writes flag if it wasn't set due to a registered integrity
profile.

Reported-by: Lihong Kou <koulihong@huawei.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Link: https://lore.kernel.org/r/20210914070657.87677-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-14 20:03:30 -06:00
Zenghui Yu
1a0db7744e scsi: bsg: Fix device unregistration
device_initialize() is used to take a refcount on the device. However,
put_device() is not called during device teardown. This leads to a
leak of private data of the driver core, dev_name(), etc. This is
reported by kmemleak at boot time if we compile kernel with
DEBUG_TEST_DRIVER_REMOVE.

Fix memory leaks during unregistration and implement a release
function.

Link: https://lore.kernel.org/r/20210911105306.1511-1-yuzenghui@huawei.com
Fixes: ead09dd3ae ("scsi: bsg: Simplify device registration")
Reviewed-by: Johan Hovold <johan@kernel.org>
Signed-off-by: Zenghui Yu <yuzenghui@huawei.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-09-14 00:22:15 -04:00
Ming Lei
67f3b2f822 blk-mq: avoid to iterate over stale request
blk-mq can't run allocating driver tag and updating ->rqs[tag]
atomically, meantime blk-mq doesn't clear ->rqs[tag] after the driver
tag is released.

So there is chance to iterating over one stale request just after the
tag is allocated and before updating ->rqs[tag].

scsi_host_busy_iter() calls scsi_host_check_in_flight() to count scsi
in-flight requests after scsi host is blocked, so no new scsi command can
be marked as SCMD_STATE_INFLIGHT. However, driver tag allocation still can
be run by blk-mq core. One request is marked as SCMD_STATE_INFLIGHT,
but this request may have been kept in another slot of ->rqs[], meantime
the slot can be allocated out but ->rqs[] isn't updated yet. Then this
in-flight request is counted twice as SCMD_STATE_INFLIGHT. This way causes
trouble in handling scsi error.

Fixes the issue by not iterating over stale request.

Cc: linux-scsi@vger.kernel.org
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Reported-by: luojiaxing <luojiaxing@huawei.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20210906065003.439019-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-12 19:32:43 -06:00
Linus Torvalds
c0f7e49fc4 block-5.15-2021-09-11
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmE8ueIQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpkSYD/9eaQ1Hxc+X+4eVb3A9Cpy36Qy/uY/hArnT
 kSUDtQitrRigqhStaD0MGpknWFnZE4cSojbYN0OoEWL7GC8idSZXx7KrVJpSHGbM
 XGVEflohvjDLNPkV99gmlzF2o6zPlWESApU1/HO2x+Ws1oKaYDAfFVf0CPGPe2C6
 MRerU5v3HSmTC0eFZxU246bwwX/phNuNDokndR27rrsjK0mLF5UoMKySeqy3INp5
 6mj3R+HNIW5j8eQk/HJPW7dgiKpWYneWV2Z90DuOLbcJ+wnx7s07wT1yRnOFUTsb
 p2ojVWmXtCJ1kRex6bK/eeIJC5TYvT3bNwsnIRmJHd9btHqhm2uKy77m3S1AuE7w
 K8bN581aXlr/3pUbFyYZDZQbYshUn25YP9OlyS9r4pklCh9C5KneL1b4xswWTDTB
 whvPZlkot3rGD8LHDpV5xVVzeaAcbSXanIRROjxHqQSRRTA9BjG3E4A2cDh8nmYD
 mRGEimfZcoojF2EQJYswPOQ24cZwpnihPpJO9NkOodRqfasn6XakAGg6SONFYyQ0
 Ewa6QzIOCebBgOVGbzMtpoDpnySE12ONmrDCbSEiYFJLXBMMiqgNON/Xaq0tmXHT
 lsDpyz3ytWAB9OZ3M0/9arZzlFf/E+FRqt4ExelmwxiutKRb1dIKQq8xip/YxdA+
 Y86kwUoAXQ==
 =1ajD
 -----END PGP SIGNATURE-----

Merge tag 'block-5.15-2021-09-11' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

 - NVMe pull request from Christoph:
     - fix nvmet command set reporting for passthrough controllers (Adam Manzanares)
     - update a MAINTAINERS email address (Chaitanya Kulkarni)
     - set QUEUE_FLAG_NOWAIT for nvme-multipth (me)
     - handle errors from add_disk() (Luis Chamberlain)
     - update the keep alive interval when kato is modified (Tatsuya Sasaki)
     - fix a buffer overrun in nvmet_subsys_attr_serial (Hannes Reinecke)
     - do not reset transport on data digest errors in nvme-tcp (Daniel Wagner)
     - only call synchronize_srcu when clearing current path (Daniel Wagner)
     - revalidate paths during rescan (Hannes Reinecke)

 - Split out the fs/block_dev into block/fops.c and block/bdev.c, which
   has been long overdue. Do this now before -rc1, to avoid annoying
   conflicts due to this (Christoph)

 - blk-throtl use-after-free fix (Li)

 - Improve plug depth for multi-device plugs, greatly increasing md
   resync performance (Song)

 - blkdev_show() locking fix (Tetsuo)

 - n64cart error check fix (Yang)

* tag 'block-5.15-2021-09-11' of git://git.kernel.dk/linux-block:
  n64cart: fix return value check in n64cart_probe()
  blk-mq: allow 4x BLK_MAX_REQUEST_COUNT at blk_plug for multiple_queues
  block: move fs/block_dev.c to block/bdev.c
  block: split out operations on block special files
  blk-throttle: fix UAF by deleteing timer in blk_throtl_exit()
  block: genhd: don't call blkdev_show() with major_names_lock held
  nvme: update MAINTAINERS email address
  nvme: add error handling support for add_disk()
  nvme: only call synchronize_srcu when clearing current path
  nvme: update keep alive interval when kato is modified
  nvme-tcp: Do not reset transport on data digest errors
  nvmet: fixup buffer overrun in nvmet_subsys_attr_serial()
  nvmet: return bool from nvmet_passthru_ctrl and nvmet_is_passthru_req
  nvmet: looks at the passthrough controller when initializing CAP
  nvme: move nvme_multi_css into nvme.h
  nvme-multipath: revalidate paths during rescan
  nvme-multipath: set QUEUE_FLAG_NOWAIT
2021-09-11 10:19:51 -07:00
Song Liu
7f2a6a69f7 blk-mq: allow 4x BLK_MAX_REQUEST_COUNT at blk_plug for multiple_queues
Limiting number of request to BLK_MAX_REQUEST_COUNT at blk_plug hurts
performance for large md arrays. [1] shows resync speed of md array drops
for md array with more than 16 HDDs.

Fix this by allowing more request at plug queue. The multiple_queue flag
is used to only apply higher limit to multiple queue cases.

[1] https://lore.kernel.org/linux-raid/CAFDAVznS71BXW8Jxv6k9dXc2iR3ysX3iZRBww_rzA8WifBFxGg@mail.gmail.com/
Tested-by: Marcin Wanat <marcin.wanat@gmail.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-07 17:05:12 -06:00
Christoph Hellwig
0dca4462ed block: move fs/block_dev.c to block/bdev.c
Move it together with the rest of the block layer.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210907141303.1371844-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-07 08:39:40 -06:00
Christoph Hellwig
cd82cca7eb block: split out operations on block special files
Add a new block/fops.c for all the file and address_space operations
that provide the block special file support.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210907141303.1371844-2-hch@lst.de
[axboe: correct trailing whitespace while at it]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-07 08:39:29 -06:00
Li Jinlin
884f0e84f1 blk-throttle: fix UAF by deleteing timer in blk_throtl_exit()
The pending timer has been set up in blk_throtl_init(). However, the
timer is not deleted in blk_throtl_exit(). This means that the timer
handler may still be running after freeing the timer, which would
result in a use-after-free.

Fix by calling del_timer_sync() to delete the timer in blk_throtl_exit().

Signed-off-by: Li Jinlin <lijinlin3@huawei.com>
Link: https://lore.kernel.org/r/20210907121242.2885564-1-lijinlin3@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-07 08:36:56 -06:00
Tetsuo Handa
dfbb3409b2 block: genhd: don't call blkdev_show() with major_names_lock held
If CONFIG_BLK_DEV_LOOP && CONFIG_MTD (at least; there might be other
combinations), lockdep complains circular locking dependency at
__loop_clr_fd(), for major_names_lock serves as a locking dependency
aggregating hub across multiple block modules.

 ======================================================
 WARNING: possible circular locking dependency detected
 5.14.0+ #757 Tainted: G            E
 ------------------------------------------------------
 systemd-udevd/7568 is trying to acquire lock:
 ffff88800f334d48 ((wq_completion)loop0){+.+.}-{0:0}, at: flush_workqueue+0x70/0x560

 but task is already holding lock:
 ffff888014a7d4a0 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x4d/0x400 [loop]

 which lock already depends on the new lock.

 the existing dependency chain (in reverse order) is:

 -> #6 (&lo->lo_mutex){+.+.}-{3:3}:
        lock_acquire+0xbe/0x1f0
        __mutex_lock_common+0xb6/0xe10
        mutex_lock_killable_nested+0x17/0x20
        lo_open+0x23/0x50 [loop]
        blkdev_get_by_dev+0x199/0x540
        blkdev_open+0x58/0x90
        do_dentry_open+0x144/0x3a0
        path_openat+0xa57/0xda0
        do_filp_open+0x9f/0x140
        do_sys_openat2+0x71/0x150
        __x64_sys_openat+0x78/0xa0
        do_syscall_64+0x3d/0xb0
        entry_SYSCALL_64_after_hwframe+0x44/0xae

 -> #5 (&disk->open_mutex){+.+.}-{3:3}:
        lock_acquire+0xbe/0x1f0
        __mutex_lock_common+0xb6/0xe10
        mutex_lock_nested+0x17/0x20
        bd_register_pending_holders+0x20/0x100
        device_add_disk+0x1ae/0x390
        loop_add+0x29c/0x2d0 [loop]
        blk_request_module+0x5a/0xb0
        blkdev_get_no_open+0x27/0xa0
        blkdev_get_by_dev+0x5f/0x540
        blkdev_open+0x58/0x90
        do_dentry_open+0x144/0x3a0
        path_openat+0xa57/0xda0
        do_filp_open+0x9f/0x140
        do_sys_openat2+0x71/0x150
        __x64_sys_openat+0x78/0xa0
        do_syscall_64+0x3d/0xb0
        entry_SYSCALL_64_after_hwframe+0x44/0xae

 -> #4 (major_names_lock){+.+.}-{3:3}:
        lock_acquire+0xbe/0x1f0
        __mutex_lock_common+0xb6/0xe10
        mutex_lock_nested+0x17/0x20
        blkdev_show+0x19/0x80
        devinfo_show+0x52/0x60
        seq_read_iter+0x2d5/0x3e0
        proc_reg_read_iter+0x41/0x80
        vfs_read+0x2ac/0x330
        ksys_read+0x6b/0xd0
        do_syscall_64+0x3d/0xb0
        entry_SYSCALL_64_after_hwframe+0x44/0xae

 -> #3 (&p->lock){+.+.}-{3:3}:
        lock_acquire+0xbe/0x1f0
        __mutex_lock_common+0xb6/0xe10
        mutex_lock_nested+0x17/0x20
        seq_read_iter+0x37/0x3e0
        generic_file_splice_read+0xf3/0x170
        splice_direct_to_actor+0x14e/0x350
        do_splice_direct+0x84/0xd0
        do_sendfile+0x263/0x430
        __se_sys_sendfile64+0x96/0xc0
        do_syscall_64+0x3d/0xb0
        entry_SYSCALL_64_after_hwframe+0x44/0xae

 -> #2 (sb_writers#3){.+.+}-{0:0}:
        lock_acquire+0xbe/0x1f0
        lo_write_bvec+0x96/0x280 [loop]
        loop_process_work+0xa68/0xc10 [loop]
        process_one_work+0x293/0x480
        worker_thread+0x23d/0x4b0
        kthread+0x163/0x180
        ret_from_fork+0x1f/0x30

 -> #1 ((work_completion)(&lo->rootcg_work)){+.+.}-{0:0}:
        lock_acquire+0xbe/0x1f0
        process_one_work+0x280/0x480
        worker_thread+0x23d/0x4b0
        kthread+0x163/0x180
        ret_from_fork+0x1f/0x30

 -> #0 ((wq_completion)loop0){+.+.}-{0:0}:
        validate_chain+0x1f0d/0x33e0
        __lock_acquire+0x92d/0x1030
        lock_acquire+0xbe/0x1f0
        flush_workqueue+0x8c/0x560
        drain_workqueue+0x80/0x140
        destroy_workqueue+0x47/0x4f0
        __loop_clr_fd+0xb4/0x400 [loop]
        blkdev_put+0x14a/0x1d0
        blkdev_close+0x1c/0x20
        __fput+0xfd/0x220
        task_work_run+0x69/0xc0
        exit_to_user_mode_prepare+0x1ce/0x1f0
        syscall_exit_to_user_mode+0x26/0x60
        do_syscall_64+0x4c/0xb0
        entry_SYSCALL_64_after_hwframe+0x44/0xae

 other info that might help us debug this:

 Chain exists of:
   (wq_completion)loop0 --> &disk->open_mutex --> &lo->lo_mutex

  Possible unsafe locking scenario:

        CPU0                    CPU1
        ----                    ----
   lock(&lo->lo_mutex);
                                lock(&disk->open_mutex);
                                lock(&lo->lo_mutex);
   lock((wq_completion)loop0);

  *** DEADLOCK ***

 2 locks held by systemd-udevd/7568:
  #0: ffff888012554128 (&disk->open_mutex){+.+.}-{3:3}, at: blkdev_put+0x4c/0x1d0
  #1: ffff888014a7d4a0 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x4d/0x400 [loop]

 stack backtrace:
 CPU: 0 PID: 7568 Comm: systemd-udevd Tainted: G            E     5.14.0+ #757
 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 02/27/2020
 Call Trace:
  dump_stack_lvl+0x79/0xbf
  print_circular_bug+0x5d6/0x5e0
  ? stack_trace_save+0x42/0x60
  ? save_trace+0x3d/0x2d0
  check_noncircular+0x10b/0x120
  validate_chain+0x1f0d/0x33e0
  ? __lock_acquire+0x953/0x1030
  ? __lock_acquire+0x953/0x1030
  __lock_acquire+0x92d/0x1030
  ? flush_workqueue+0x70/0x560
  lock_acquire+0xbe/0x1f0
  ? flush_workqueue+0x70/0x560
  flush_workqueue+0x8c/0x560
  ? flush_workqueue+0x70/0x560
  ? sched_clock_cpu+0xe/0x1a0
  ? drain_workqueue+0x41/0x140
  drain_workqueue+0x80/0x140
  destroy_workqueue+0x47/0x4f0
  ? blk_mq_freeze_queue_wait+0xac/0xd0
  __loop_clr_fd+0xb4/0x400 [loop]
  ? __mutex_unlock_slowpath+0x35/0x230
  blkdev_put+0x14a/0x1d0
  blkdev_close+0x1c/0x20
  __fput+0xfd/0x220
  task_work_run+0x69/0xc0
  exit_to_user_mode_prepare+0x1ce/0x1f0
  syscall_exit_to_user_mode+0x26/0x60
  do_syscall_64+0x4c/0xb0
  entry_SYSCALL_64_after_hwframe+0x44/0xae
 RIP: 0033:0x7f0fd4c661f7
 Code: 00 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 41 c3 48 83 ec 18 89 7c 24 0c e8 13 fc ff ff
 RSP: 002b:00007ffd1c9e9fd8 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
 RAX: 0000000000000000 RBX: 00007f0fd46be6c8 RCX: 00007f0fd4c661f7
 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000006
 RBP: 0000000000000006 R08: 000055fff1eaf400 R09: 0000000000000000
 R10: 00007f0fd46be6c8 R11: 0000000000000246 R12: 0000000000000000
 R13: 0000000000000000 R14: 0000000000002f08 R15: 00007ffd1c9ea050

Commit 1c500ad706 ("loop: reduce the loop_ctl_mutex scope") is for
breaking "loop_ctl_mutex => &lo->lo_mutex" dependency chain. But enabling
a different block module results in forming circular locking dependency
due to shared major_names_lock mutex.

The simplest fix is to call probe function without holding
major_names_lock [1], but Christoph Hellwig does not like such idea.
Therefore, instead of holding major_names_lock in blkdev_show(),
introduce a different lock for blkdev_show() in order to break
"sb_writers#$N => &p->lock => major_names_lock" dependency chain.

Link: https://lkml.kernel.org/r/b2af8a5b-3c1b-204e-7f56-bea0b15848d6@i-love.sakura.ne.jp [1]
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Link: https://lore.kernel.org/r/18a02da2-0bf3-550e-b071-2b4ab13c49f0@i-love.sakura.ne.jp
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-07 08:36:21 -06:00
Linus Torvalds
1dbe7e386f block-5.15-2021-09-05
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmE1hQcQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgprjfEACdG+medwQOPpKNSoAvQmYyQnZRMbPjiruv
 A4nW2L6MKaExO59qQLVbBYHaH2+ng2UR/p5jNi2AKm+hrQEYllxlNvuCkRBIn97J
 r45R48mzBbHjR4kE3Fdu1mOFpBWOuU9JrtzHI+JF/Sl/qPIxKYNHf5E66T6l90Fz
 0hJkorAoVB7+hQYixdmkM9quZy11D5SY3aM+bG8r2uNjZTBEHMfmOen8o1giR0vC
 EOHzObuC6WLjLGQInNW+Cq2//vVVybQa79mhOUMp93z5nhDMtwUu7MH4B4kmGpix
 GLjDa1DukUZe7nGcnsRKmjjXQ+BpG6YF52Z2RfVZpWZn83t5c4YQsq++TPZ8KfpK
 4NAFFuSbGM/+QWwEiiyWu00syvpzrEJ4ZIJyZX3FYEeKyKWVRGHqlMDcS9LstYOk
 4OfgQUcJ7f/fXeedwi0OGJS1BLr6fi8RnazIafCNIIJLe1XIwTsNufPCNxWYqDAi
 0XhH+uYGD38VoUiR5JymZku6frwY4kxssA1khPPE5jWbzCZXiHprwwzaP4hBNNeZ
 c5cn9/1ZQSoTE3ebrX9pzTn5wRZwAL+iDhZ2SpLlN2Ji1BJ4EM9H8qFGj3U/CSM4
 OWKY2c0VwJYQUhjO4QDBx0MblJgNy8HsvmqGETuxUlk56j3Q1Mx3ViPV43amP9eM
 OM4mGige3Q==
 =4SCA
 -----END PGP SIGNATURE-----

Merge tag 'block-5.15-2021-09-05' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:
 "Was going to send this one in later this week, but given that -Werror
  is now enabled (or at least available), the mq-deadline fix really
  should go in for the folks hitting that.

   - Ensure dd_queued() is only there if needed (Geert)

   - Fix a kerneldoc warning for bio_alloc_kiocb()

   - BFQ fix for queue merging

   - loop locking fix (Tetsuo)"

* tag 'block-5.15-2021-09-05' of git://git.kernel.dk/linux-block:
  loop: reduce the loop_ctl_mutex scope
  bio: fix kerneldoc documentation for bio_alloc_kiocb()
  block, bfq: honor already-setup queue merges
  block/mq-deadline: Move dd_queued() to fix defined but not used warning
2021-09-06 10:06:26 -07:00
Linus Torvalds
14726903c8 Merge branch 'akpm' (patches from Andrew)
Merge misc updates from Andrew Morton:
 "173 patches.

  Subsystems affected by this series: ia64, ocfs2, block, and mm (debug,
  pagecache, gup, swap, shmem, memcg, selftests, pagemap, mremap,
  bootmem, sparsemem, vmalloc, kasan, pagealloc, memory-failure,
  hugetlb, userfaultfd, vmscan, compaction, mempolicy, memblock,
  oom-kill, migration, ksm, percpu, vmstat, and madvise)"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (173 commits)
  mm/madvise: add MADV_WILLNEED to process_madvise()
  mm/vmstat: remove unneeded return value
  mm/vmstat: simplify the array size calculation
  mm/vmstat: correct some wrong comments
  mm/percpu,c: remove obsolete comments of pcpu_chunk_populated()
  selftests: vm: add COW time test for KSM pages
  selftests: vm: add KSM merging time test
  mm: KSM: fix data type
  selftests: vm: add KSM merging across nodes test
  selftests: vm: add KSM zero page merging test
  selftests: vm: add KSM unmerge test
  selftests: vm: add KSM merge test
  mm/migrate: correct kernel-doc notation
  mm: wire up syscall process_mrelease
  mm: introduce process_mrelease system call
  memblock: make memblock_find_in_range method private
  mm/mempolicy.c: use in_task() in mempolicy_slab_node()
  mm/mempolicy: unify the create() func for bind/interleave/prefer-many policies
  mm/mempolicy: advertise new MPOL_PREFERRED_MANY
  mm/hugetlb: add support for mempolicy MPOL_PREFERRED_MANY
  ...
2021-09-03 10:08:28 -07:00
Christoph Hellwig
f358afc52c mm: remove flush_kernel_dcache_page
flush_kernel_dcache_page is a rather confusing interface that implements a
subset of flush_dcache_page by not being able to properly handle page
cache mapped pages.

The only callers left are in the exec code as all other previous callers
were incorrect as they could have dealt with page cache pages.  Replace
the calls to flush_kernel_dcache_page with calls to flush_dcache_page,
which for all architectures does either exactly the same thing, can
contains one or more of the following:

 1) an optimization to defer the cache flush for page cache pages not
    mapped into userspace
 2) additional flushing for mapped page cache pages if cache aliases
    are possible

Link: https://lkml.kernel.org/r/20210712060928.4161649-7-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Cc: Alex Shi <alexs@kernel.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Guo Ren <guoren@kernel.org>
Cc: Helge Deller <deller@gmx.de>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Nick Hu <nickhu@andestech.com>
Cc: Paul Cercueil <paul@crapouillou.net>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Ulf Hansson <ulf.hansson@linaro.org>
Cc: Vincent Chen <deanbo422@gmail.com>
Cc: Yoshinori Sato <ysato@users.osdn.me>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03 09:58:13 -07:00
Jens Axboe
0ef47db1cb bio: fix kerneldoc documentation for bio_alloc_kiocb()
Apparently the last fixup got butter fingered a bit, the correct variable
name is 'nr_vecs', not 'nr_iovecs'.

Link: https://lore.kernel.org/lkml/20210903164939.02f6e8c5@canb.auug.org.au/
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-03 07:42:13 -06:00
Linus Torvalds
a9c9a6f741 SCSI misc on 20210902
This series consists of the usual driver updates (ufs, qla2xxx,
 target, smartpqi, lpfc, mpt3sas).  The core change causing the most
 churn was replacing the command request field request with a macro,
 allowing us to offset map to it and remove the redundant field; the
 same was also done for the tag field.  The most impactful change is
 the final removal of scsi_ioctl, which has been deprecated for over a
 decade.
 
 Signed-off-by: James E.J. Bottomley <jejb@linux.ibm.com>
 -----BEGIN PGP SIGNATURE-----
 
 iJwEABMIAEQWIQTnYEDbdso9F2cI+arnQslM7pishQUCYTD/TiYcamFtZXMuYm90
 dG9tbGV5QGhhbnNlbnBhcnRuZXJzaGlwLmNvbQAKCRDnQslM7pishdUkAQCjb3Ux
 4K9438mMelHlzM4er1S1IJ0WNnvObaVMNO9LBwD+JUz+rHsrKvuEX9j3g3C3u6JH
 hC3BUEW8f2LLnujWanQ=
 =lC5o
 -----END PGP SIGNATURE-----

Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

Pull SCSI updates from James Bottomley:
 "This series consists of the usual driver updates (ufs, qla2xxx,
  target, smartpqi, lpfc, mpt3sas).

  The core change causing the most churn was replacing the command
  request field request with a macro, allowing us to offset map to it
  and remove the redundant field; the same was also done for the tag
  field.

  The most impactful change is the final removal of scsi_ioctl, which
  has been deprecated for over a decade"

* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (293 commits)
  scsi: ufs: Fix ufshcd_request_sense_async() for Samsung KLUFG8RHDA-B2D1
  scsi: ufs: ufs-exynos: Fix static checker warning
  scsi: mpt3sas: Use the proper SCSI midlayer interfaces for PI
  scsi: lpfc: Use the proper SCSI midlayer interfaces for PI
  scsi: lpfc: Copyright updates for 14.0.0.1 patches
  scsi: lpfc: Update lpfc version to 14.0.0.1
  scsi: lpfc: Add bsg support for retrieving adapter cmf data
  scsi: lpfc: Add cmf_info sysfs entry
  scsi: lpfc: Add debugfs support for cm framework buffers
  scsi: lpfc: Add support for maintaining the cm statistics buffer
  scsi: lpfc: Add rx monitoring statistics
  scsi: lpfc: Add support for the CM framework
  scsi: lpfc: Add cmfsync WQE support
  scsi: lpfc: Add support for cm enablement buffer
  scsi: lpfc: Add cm statistics buffer support
  scsi: lpfc: Add EDC ELS support
  scsi: lpfc: Expand FPIN and RDF receive logging
  scsi: lpfc: Add MIB feature enablement support
  scsi: lpfc: Add SET_HOST_DATA mbox cmd to pass date/time info to firmware
  scsi: fc: Add EDC ELS definition
  ...
2021-09-02 15:09:46 -07:00
Paolo Valente
2d52c58b9c block, bfq: honor already-setup queue merges
The function bfq_setup_merge prepares the merging between two
bfq_queues, say bfqq and new_bfqq. To this goal, it assigns
bfqq->new_bfqq = new_bfqq. Then, each time some I/O for bfqq arrives,
the process that generated that I/O is disassociated from bfqq and
associated with new_bfqq (merging is actually a redirection). In this
respect, bfq_setup_merge increases new_bfqq->ref in advance, adding
the number of processes that are expected to be associated with
new_bfqq.

Unfortunately, the stable-merging mechanism interferes with this
setup. After bfqq->new_bfqq has been set by bfq_setup_merge, and
before all the expected processes have been associated with
bfqq->new_bfqq, bfqq may happen to be stably merged with a different
queue than the current bfqq->new_bfqq. In this case, bfqq->new_bfqq
gets changed. So, some of the processes that have been already
accounted for in the ref counter of the previous new_bfqq will not be
associated with that queue.  This creates an unbalance, because those
references will never be decremented.

This commit fixes this issue by reestablishing the previous, natural
behaviour: once bfqq->new_bfqq has been set, it will not be changed
until all expected redirections have occurred.

Signed-off-by: Davide Zini <davidezini2@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Link: https://lore.kernel.org/r/20210802141352.74353-2-paolo.valente@linaro.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-02 06:36:58 -06:00
Geert Uytterhoeven
55a51ea140 block/mq-deadline: Move dd_queued() to fix defined but not used warning
If CONFIG_BLK_DEBUG_FS=n:

    block/mq-deadline.c:274:12: warning: ‘dd_queued’ defined but not used [-Wunused-function]
      274 | static u32 dd_queued(struct deadline_data *dd, enum dd_prio prio)
	  |            ^~~~~~~~~

Fix this by moving dd_queued() just before the sole function that calls
it.

Fixes: 7b05bf7710 ("Revert "block/mq-deadline: Prioritize high-priority requests"")
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Fixes: 38ba64d12d ("block/mq-deadline: Track I/O statistics")
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20210830091128.1854266-1-geert@linux-m68k.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-02 06:34:45 -06:00
Linus Torvalds
87045e6546 for-5.15-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmEs2NIACgkQxWXV+ddt
 WDsJMQ/+PJ/yXfI85mAeAzTJLWQ0zD6YO3iBhf3wOeyychWC4on435pj+zW8zR/U
 /bix25ygoWF4MvGF6p0uyv4Z5mnvkZXE5lapUcJu6wXG7se1QRPH0broTh05IBXK
 SnT93Eb9RexaiNFk7DVma9XkviqZ/ZISPtkJ9wYrfIba7j/U/wa+PtEFS7wk58hP
 rFQXgV64xm/pcP28YYHfOkCjdyUMdJrnBUvfKOlX6d94lmYbP5lyiTL+XJEXExzN
 wPakD0UsnXPr4TRvf+YRTPeFHPPUgyORII7otVUOKmGywWtcJrELX8rXFoW+6GwB
 dzZIcSYXHUxU5UrtMbZgiztVBJ+bQY5juYMIrj13eYOMYkijxAqPP84iDO15+TSV
 zNqyAVjUglHCGUGjhSpAxnAmtp+IJTZfVAWcvIKq3VqvJtb8tssQsk9bqFjH1xlH
 qNJLE57CYe3tjw05K9y0keMh2iJWRWkXZYkgI/zjwo5nreemobpN+3fO4yneVLh7
 ecdBmSl/JVSzAB1NamLOCZNGZLUqiiuTvZlJtI6ZsekrN1+4A6QzVcU/MGjSYL1v
 C7W0hK0LF+e3xIBkxTKVq8noolsgbmlWacxJq8fZq9HwZy5IVJOVm9STDlCuLaIo
 gPr0V0itkclcsMU0CHTyCjMsfuHYUwJZXwg93wKfJf5UCzS4OWU=
 =ALO9
 -----END PGP SIGNATURE-----

Merge tag 'for-5.15-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs updates from David Sterba:
 "The highlights of this round are integrations with fs-verity and
  idmapped mounts, the rest is usual mix of minor improvements, speedups
  and cleanups.

  There are some patches outside of btrfs, namely updating some VFS
  interfaces, all straightforward and acked.

  Features:

   - fs-verity support, using standard ioctls, backward compatible with
     read-only limitation on inodes with previously enabled fs-verity

   - idmapped mount support

   - make mount with rescue=ibadroots more tolerant to partially damaged
     trees

   - allow raid0 on a single device and raid10 on two devices,
     degenerate cases but might be useful as an intermediate step during
     conversion to other profiles

   - zoned mode block group auto reclaim can be disabled via sysfs knob

  Performance improvements:

   - continue readahead of node siblings even if target node is in
     memory, could speed up full send (on sample test +11%)

   - batching of delayed items can speed up creating many files

   - fsync/tree-log speedups
       - avoid unnecessary work (gains +2% throughput, -2% run time on
         sample load)
       - reduced lock contention on renames (on dbench +4% throughput,
         up to -30% latency)

  Fixes:

   - various zoned mode fixes

   - preemptive flushing threshold tuning, avoid excessive work on
     almost full filesystems

  Core:

   - continued subpage support, preparation for implementing remaining
     features like compression and defragmentation; with some
     limitations, write is now enabled on 64K page systems with 4K
     sectors, still considered experimental
       - no readahead on compressed reads
       - inline extents disabled
       - disabled raid56 profile conversion and mount

   - improved flushing logic, fixing early ENOSPC on some workloads

   - inode flags have been internally split to read-only and read-write
     incompat bit parts, used by fs-verity

   - new tree items for fs-verity
       - descriptor item
       - Merkle tree item

   - inode operations extended to be namespace-aware

   - cleanups and refactoring

  Generic code changes:

   - fs: new export filemap_fdatawrite_wbc

   - fs: removed sync_inode

   - block: bio_trim argument type fixups

   - vfs: add namespace-aware lookup"

* tag 'for-5.15-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (114 commits)
  btrfs: reset replace target device to allocation state on close
  btrfs: zoned: fix ordered extent boundary calculation
  btrfs: do not do preemptive flushing if the majority is global rsv
  btrfs: reduce the preemptive flushing threshold to 90%
  btrfs: tree-log: check btrfs_lookup_data_extent return value
  btrfs: avoid unnecessarily logging directories that had no changes
  btrfs: allow idmapped mount
  btrfs: handle ACLs on idmapped mounts
  btrfs: allow idmapped INO_LOOKUP_USER ioctl
  btrfs: allow idmapped SUBVOL_SETFLAGS ioctl
  btrfs: allow idmapped SET_RECEIVED_SUBVOL ioctls
  btrfs: relax restrictions for SNAP_DESTROY_V2 with subvolids
  btrfs: allow idmapped SNAP_DESTROY ioctls
  btrfs: allow idmapped SNAP_CREATE/SUBVOL_CREATE ioctls
  btrfs: check whether fsgid/fsuid are mapped during subvolume creation
  btrfs: allow idmapped permission inode op
  btrfs: allow idmapped setattr inode op
  btrfs: allow idmapped tmpfile inode op
  btrfs: allow idmapped symlink inode op
  btrfs: allow idmapped mkdir inode op
  ...
2021-08-31 09:41:22 -07:00
Linus Torvalds
3b629f8d6d io_uring-bio-cache.5-2021-08-30
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmEs8QQQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpgAgD/wP9gGxrFE5oxtdozDPkEYTXn5e0QKseDyV
 cNxLmSb3wc4WIEPwjCavdQHpy0fnbjaYwGveHf9ygQwDZPj9WBgEL3ipPYXCCzFA
 ysoV86kBRxKDI476r2InxI8WaW7hV0IWxPlScUTA1QeeNAzRJDymQvRuwg5KvVRS
 Jt6R58khzWpEGYO2CqFTpGsA7x01R0kvZ54xmFgKZ+Pxo+Bk03fkO32YUFC49Wm8
 Zy+JMsaiIlLgucDTJ4zAKjQUXiwP2GMEw5Vk/lLUFGBvyw0AN2rO9g18L7QW2ZUu
 vnkaJQwBbMUbgveXlI/y6GG/vuKUG2i4AmzNJH17qFCnimO3JY6vgzUOg5dqOiwx
 bx7ZzmnBWgQp95/cSAlZ4QwRYf3z0hvVFKPj9U3X9wKGmuxUKHiLResQwp7bzRdd
 4L4Jo1WFDDHR/1MOOzzW0uxE3uTm0LKcncsi4hJL20dl+16RXCIbzHWUTAd8yyMV
 9QeUAumc4GHOeswa1Ms8jLPAgXyEoAkec7ca7cRIY/NW+DXGLG9tYBgCw1eLe6BN
 M7LwMsPNlS2v2dMUbiuw8XxkA+uYso728e2vd/edca2jxXj8+SVnm020aYBnxIzh
 nmjbf69+QddBPEnk/EPvRj8tXOhr3k7FklI4R7qlei/+IGTujGPvM4kn3p6fnHrx
 d7bsu/jtaQ==
 =izfH
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-bio-cache.5-2021-08-30' of git://git.kernel.dk/linux-block

Pull support for struct bio recycling from Jens Axboe:
 "This adds bio recycling support for polled IO, allowing quick reuse of
  a bio for high IOPS scenarios via a percpu bio_set list.

  It's good for almost a 10% improvement in performance, bumping our
  per-core IO limit from ~3.2M IOPS to ~3.5M IOPS"

* tag 'io_uring-bio-cache.5-2021-08-30' of git://git.kernel.dk/linux-block:
  bio: improve kerneldoc documentation for bio_alloc_kiocb()
  block: provide bio_clear_hipri() helper
  block: use the percpu bio cache in __blkdev_direct_IO
  io_uring: enable use of bio alloc cache
  block: clear BIO_PERCPU_CACHE flag if polling isn't supported
  bio: add allocation cache abstraction
  fs: add kiocb alloc cache flag
  bio: optimize initialization of a bio
2021-08-30 19:30:30 -07:00
Linus Torvalds
679369114e for-5.15/block-2021-08-30
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmEs6H0QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpukbD/9Qk9fQte+WJVmpbdvhV40gcKBVnGOVH0ke
 k+36x6AB/gWKnFHwtprsSyVqPxmzqwTv9VIq5l/s3Vydt3L61znvTneBeN03Wlkn
 UTxD0lY8HzyVWnZb82LBBjjy7cs6EzrFG4kBH/ZiTAyTcBsCAvzo5J7mywb4gFjj
 L/HeBq58EJ3WCUlxlVW1ijctvi7wnGoaH5bZY1TE00GGT6TysN2bEPfzjkuYHrDz
 RqhoQdWPLDz6h3x9lAncPw2MWlcmlGvJ96ABseAKFPKvXxE2PzgolSoQfVUUJtko
 bqGyy2ns+pxN11SrcGYjogEKVKhONoms/5UN1RtwRBVsgvecxlHER/SgyZ8luBDo
 lFhVXulkSjpswbWutRy3USge98GwMu2Z4ppP2CDmO7hkQd0DF8sL0kPKyaREkcHi
 NmsD/0zF2uUhUVN+PRC/MuzngAmL4Mmxjk70L+MohlK7e+H3pnEo1ec3OMcXe+wB
 dG6t/BFD9bYmj0UjsHeXEoR/iRuvSba1L8zBz5dhRaHH6DvdycYhpynXWWlU3C8K
 3nzEVVpcDINMsiRl1Vqb6g6HsMwHIH84FRl7Mc51UmhW9C4gLfWMCt1guQuzOj72
 yEbmCLydE/FR2IUPY7eqX8hRG8GTUlMtSvGdgnvBOcWj+K3buT/c5yVTHgTrN8ox
 LCOXHSvV6w==
 =S8fs
 -----END PGP SIGNATURE-----

Merge tag 'for-5.15/block-2021-08-30' of git://git.kernel.dk/linux-block

Pull block updates from Jens Axboe:
 "Nothing major in here - lots of good cleanups and tech debt handling,
  which is also evident in the diffstats. In particular:

   - Add disk sequence numbers (Matteo)

   - Discard merge fix (Ming)

   - Relax disk zoned reporting restrictions (Niklas)

   - Bio error handling zoned leak fix (Pavel)

   - Start of proper add_disk() error handling (Luis, Christoph)

   - blk crypto fix (Eric)

   - Non-standard GPT location support (Dmitry)

   - IO priority improvements and cleanups (Damien)o

   - blk-throtl improvements (Chunguang)

   - diskstats_show() stack reduction (Abd-Alrhman)

   - Loop scheduler selection (Bart)

   - Switch block layer to use kmap_local_page() (Christoph)

   - Remove obsolete disk_name helper (Christoph)

   - block_device refcounting improvements (Christoph)

   - Ensure gendisk always has a request queue reference (Christoph)

   - Misc fixes/cleanups (Shaokun, Oliver, Guoqing)"

* tag 'for-5.15/block-2021-08-30' of git://git.kernel.dk/linux-block: (129 commits)
  sg: pass the device name to blk_trace_setup
  block, bfq: cleanup the repeated declaration
  blk-crypto: fix check for too-large dun_bytes
  blk-zoned: allow BLKREPORTZONE without CAP_SYS_ADMIN
  blk-zoned: allow zone management send operations without CAP_SYS_ADMIN
  block: mark blkdev_fsync static
  block: refine the disk_live check in del_gendisk
  mmc: sdhci-tegra: Enable MMC_CAP2_ALT_GPT_TEGRA
  mmc: block: Support alternative_gpt_sector() operation
  partitions/efi: Support non-standard GPT location
  block: Add alternative_gpt_sector() operation
  bio: fix page leak bio_add_hw_page failure
  block: remove CONFIG_DEBUG_BLOCK_EXT_DEVT
  block: remove a pointless call to MINOR() in device_add_disk
  null_blk: add error handling support for add_disk()
  virtio_blk: add error handling support for add_disk()
  block: add error handling for device_add_disk / add_disk
  block: return errors from disk_alloc_events
  block: return errors from blk_integrity_add
  block: call blk_register_queue earlier in device_add_disk
  ...
2021-08-30 18:52:11 -07:00
Linus Torvalds
7d6e3fa87e Updates to the interrupt core and driver subsystems:
Core changes:
 
    - The usual set of small fixes and improvements all over the place, but nothing
      outstanding
 
 MSI changes:
 
    - Further consolidation of the PCI/MSI interrupt chip code
 
    - Make MSI sysfs code independent of PCI/MSI and expose the MSI interrupts
      of platform devices in the same way as PCI exposes them.
 
 Driver changes:
 
    - Support for ARM GICv3 EPPI partitions
 
    - Treewide conversion to generic_handle_domain_irq() for all chained
      interrupt controllers
 
    - Conversion to bitmap_zalloc() throughout the irq chip drivers
 
    - The usual set of small fixes and improvements
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmEsnpsTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoS+/EACQdpRkzl3IDIYqThxVZ8KQzp2rKKVn
 qisAQiWg/6koNJx/yYy62KNAUyKjCIObNtRnWi7OAOx6OvNtQTD2WOLAwkh3Pgw1
 8ePYYl55k+yCs8VoITsZM9jYeO+Tk878pU2A6R943zR+g6G7bskGJrxEyZ9TbzIe
 qKfusNKnRY9/jMQaRALUAAtA9VIVR867GqORX5X8hKz8yE2rqlpb4y+1CFba5BTV
 Vlxw7cIXvXBn7BKAom5diRqEGDNJEbX+56jJ7yDZshgLo7m11D7QLw72kmb6TNVC
 g7PchvFi4afpc1ifEAAp0tk4RiSIAQ91nS3n0+jLcLbodOjIkl14eY02ZCJGAP29
 uslyzUbmy1wgejG6CA63JtZ4MYdrf/OSMGuoN78qnOKYcIsWFzOvlJmBWWNW34qW
 LCaUF9QdJ/slXu6B4vIx30GfN9q4myml8bFUobE5q9mBRrEk4R0B7iyBvPu1xKYr
 ZEan67prI5VEu+afJGpp4r294m4HNVkMLfl3nYmE5+y4MoLeMNKDY3IPTvI9iP4G
 kaFgoPvQo23WnuclNYpJ+CaA4aRASlB2nTY+oAXIYfehbey9EW5vq4/EK864ek6w
 oyUTepxxNhE81tG2jpQbf2tR4COsEHy986clxqPP4AvsZXcbypCw8O2FcflpQbHO
 5DLEAfTmp7cziQ==
 =qyll
 -----END PGP SIGNATURE-----

Merge tag 'irq-core-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull irq updates from Thomas Gleixner:
 "Updates to the interrupt core and driver subsystems:

  Core changes:

   - The usual set of small fixes and improvements all over the place,
     but nothing stands out

  MSI changes:

   - Further consolidation of the PCI/MSI interrupt chip code

   - Make MSI sysfs code independent of PCI/MSI and expose the MSI
     interrupts of platform devices in the same way as PCI exposes them.

  Driver changes:

   - Support for ARM GICv3 EPPI partitions

   - Treewide conversion to generic_handle_domain_irq() for all chained
     interrupt controllers

   - Conversion to bitmap_zalloc() throughout the irq chip drivers

   - The usual set of small fixes and improvements"

* tag 'irq-core-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (57 commits)
  platform-msi: Add ABI to show msi_irqs of platform devices
  genirq/msi: Move MSI sysfs handling from PCI to MSI core
  genirq/cpuhotplug: Demote debug printk to KERN_DEBUG
  irqchip/qcom-pdc: Trim unused levels of the interrupt hierarchy
  irqdomain: Export irq_domain_disconnect_hierarchy()
  irqchip/gic-v3: Fix priority comparison when non-secure priorities are used
  irqchip/apple-aic: Fix irq_disable from within irq handlers
  pinctrl/rockchip: drop the gpio related codes
  gpio/rockchip: drop irq_gc_lock/irq_gc_unlock for irq set type
  gpio/rockchip: support next version gpio controller
  gpio/rockchip: use struct rockchip_gpio_regs for gpio controller
  gpio/rockchip: add driver for rockchip gpio
  dt-bindings: gpio: change items restriction of clock for rockchip,gpio-bank
  pinctrl/rockchip: add pinctrl device to gpio bank struct
  pinctrl/rockchip: separate struct rockchip_pin_bank to a head file
  pinctrl/rockchip: always enable clock for gpio controller
  genirq: Fix kernel doc indentation
  EDAC/altera: Convert to generic_handle_domain_irq()
  powerpc: Bulk conversion to generic_handle_domain_irq()
  nios2: Bulk conversion to generic_handle_domain_irq()
  ...
2021-08-30 14:38:37 -07:00
Linus Torvalds
64b4fc45be block-5.14-2021-08-27
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmEpVAkQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpmhjD/4+DT4BhDTOAaXzjLs51Y0vu2+wLBNhLoi8
 sP+l+ZI4EWiOmbgejSyyHF5Wa3wA8UxhAq+9CpyKhtniWmlkeh6uw+SOOq41CYS+
 4mdwyZCC/3eYkV1mgD1OmAzq3om4CF5tR/Mp5UO9rCQjoWdzqxcu8jvTiJYj5R3X
 NvThxKjHaPaDiRZd1ZKu3jYo+lEmnLZ/j9ErYEsrT7OdZZCrUosoCdLsbQxKDWeK
 HTRbnZzFdCEKWsWZVUgFqTtQhNwepYK47gHh14egZ8ESVSkRUKL0j0mXMeV6IM+d
 upzdfnoIxPF5oPZ75cYxAIHaCKaZRqiPRE7rDM72U5J//xm5+m10S9Zs5ythj6/0
 TGmNvR5XUg/OyX1gjyn7coOXQqCJrb0UOUsViA6De29GUau/s1DQITOaTRtB0rh0
 gPkOLxp4WZYpss6FCHoPItSxh3lw2PhsBZawm4/mkVnuayPvVEadeKqdimKO5Vco
 JGHB2R3HM/jGObXpaqhzAJKAIElbsjIZhTNomMmAOW+pmYCgjrgq5SG2q5YYSQhm
 0R7LKXknuywr5koRx4NzsBAprakKKsLy80kKtOeQV1H0OigeZqRpi2rO0RYgvvcf
 yxmsa68ZbkrRUxfOKMUb9bOHq79s+XXiRB1w76WgZKB5YqouM5O8VDntwYFvk7yN
 cr0rSofAtw==
 =ZfeN
 -----END PGP SIGNATURE-----

Merge tag 'block-5.14-2021-08-27' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

 - Revert the mq-deadline priority handling, it's causing serious
   performance regressions. While experimental patches exists to fix
   this up, it's too late to do so now. Revert it and re-do it properly
   for 5.15 instead.

 - Fix a NULL vs IS_ERR() regression in this release (Dan)

 - Fix a mq-deadline accounting regression in this release (Bart)

 - Mark cryptoloop as deprecated. It's broken and dm-crypt fully
   supports it, and it's actively intefering with loop. Plan on removal
   for 5.16 (Christoph)

* tag 'block-5.14-2021-08-27' of git://git.kernel.dk/linux-block:
  cryptoloop: add a deprecation warning
  pd: fix a NULL vs IS_ERR() check
  Revert "block/mq-deadline: Prioritize high-priority requests"
  mq-deadline: Fix request accounting
2021-08-27 16:08:29 -07:00
Jens Axboe
7b05bf7710 Revert "block/mq-deadline: Prioritize high-priority requests"
This reverts commit fb926032b3.

Zhen reports that this commit slows down mq-deadline on a 128 thread
box, going from 258K IOPS to 170-180K. My testing shows that Optane
gen2 IOPS goes from 2.3M IOPS to 1.2M IOPS on a 64 thread box.

Looking in detail at the code, the main culprit here is needing to sum
percpu counters in the dispatch hot path, leading to very high CPU
utilization there. To make matters worse, the code currently needs to
sum 2 percpu counters, and it does so in the most naive way of iterating
possible CPUs _twice_.

Since we're close to release, revert this commit and we can re-do it
with regular per-priority counters instead for the 5.15 kernel.

Link: https://lore.kernel.org/linux-block/20210826144039.2143-1-thunder.leizhen@huawei.com/
Reported-by: Zhen Lei <thunder.leizhen@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-26 12:59:44 -06:00
Shaokun Zhang
1e294970fc block, bfq: cleanup the repeated declaration
Function 'bfq_entity_to_bfqq' is declared twice, so remove the
repeated declaration and blank line.

Cc: Paolo Valente <paolo.valente@linaro.org>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Shaokun Zhang <zhangshaokun@hisilicon.com>
Link: https://lore.kernel.org/r/1629872391-46399-1-git-send-email-zhangshaokun@hisilicon.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-25 06:45:33 -06:00
Eric Biggers
cc40b72251 blk-crypto: fix check for too-large dun_bytes
dun_bytes needs to be less than or equal to the IV size of the
encryption mode, not just less than or equal to BLK_CRYPTO_MAX_IV_SIZE.

Currently this doesn't matter since blk_crypto_init_key() is never
actually passed invalid values, but we might as well fix this.

Fixes: a892c8d52c ("block: Inline encryption support for blk-mq")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20210825055918.51975-1-ebiggers@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-25 06:45:00 -06:00
Bart Van Assche
b6d2b054e8 mq-deadline: Fix request accounting
The block layer may call the I/O scheduler .finish_request() callback
without having called the .insert_requests() callback. Make sure that the
mq-deadline I/O statistics are correct if the block layer inserts an I/O
request that bypasses the I/O scheduler. This patch prevents that lower
priority I/O is delayed longer than necessary for mixed I/O priority
workloads.

Cc: Niklas Cassel <Niklas.Cassel@wdc.com>
Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Hannes Reinecke <hare@suse.de>
Reported-by: Niklas Cassel <Niklas.Cassel@wdc.com>
Fixes: 08a9ad8bf6 ("block/mq-deadline: Add cgroup support")
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20210824170520.1659173-1-bvanassche@acm.org
Reviewed-by: Niklas Cassel <niklas.cassel@wdc.com>
Tested-by: Niklas Cassel <niklas.cassel@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-24 16:18:01 -06:00
Niklas Cassel
4d643b6608 blk-zoned: allow BLKREPORTZONE without CAP_SYS_ADMIN
A user space process should not need the CAP_SYS_ADMIN capability set
in order to perform a BLKREPORTZONE ioctl.

Getting the zone report is required in order to get the write pointer.
Neither read() nor write() requires CAP_SYS_ADMIN, so it is reasonable
that a user space process that can read/write from/to the device, also
can get the write pointer. (Since e.g. writes have to be at the write
pointer.)

Fixes: 3ed05a987e ("blk-zoned: implement ioctls")
Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Aravind Ramesh <aravind.ramesh@wdc.com>
Reviewed-by: Adam Manzanares <a.manzanares@samsung.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: stable@vger.kernel.org # v4.10+
Link: https://lore.kernel.org/r/20210811110505.29649-3-Niklas.Cassel@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-24 10:12:36 -06:00
Niklas Cassel
ead3b768bb blk-zoned: allow zone management send operations without CAP_SYS_ADMIN
Zone management send operations (BLKRESETZONE, BLKOPENZONE, BLKCLOSEZONE
and BLKFINISHZONE) should be allowed under the same permissions as write().
(write() does not require CAP_SYS_ADMIN).

Additionally, other ioctls like BLKSECDISCARD and BLKZEROOUT only check if
the fd was successfully opened with FMODE_WRITE.
(They do not require CAP_SYS_ADMIN).

Currently, zone management send operations require both CAP_SYS_ADMIN
and that the fd was successfully opened with FMODE_WRITE.

Remove the CAP_SYS_ADMIN requirement, so that zone management send
operations match the access control requirement of write(), BLKSECDISCARD
and BLKZEROOUT.

Fixes: 3ed05a987e ("blk-zoned: implement ioctls")
Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Aravind Ramesh <aravind.ramesh@wdc.com>
Reviewed-by: Adam Manzanares <a.manzanares@samsung.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: stable@vger.kernel.org # v4.10+
Link: https://lore.kernel.org/r/20210811110505.29649-2-Niklas.Cassel@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-24 10:12:36 -06:00
Christoph Hellwig
9f2869921f block: refine the disk_live check in del_gendisk
hidden gendisks will never be marked live.

Fixes: 40b3a52ffc ("block: add a sanity check for a live disk in del_gendisk")
Reported-by: Bruno Goncalves <bgoncalv@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210824144310.1487816-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-24 10:10:08 -06:00
Dmitry Osipenko
466d9c4904 partitions/efi: Support non-standard GPT location
Support looking up GPT at a non-standard location specified by a block
device driver.

Acked-by: Davidlohr Bueso <dbueso@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dmitry Osipenko <digetx@gmail.com>
Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
Link: https://lore.kernel.org/r/20210820004536.15791-3-digetx@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-24 10:09:06 -06:00
Pavel Begunkov
d9cf3bd531 bio: fix page leak bio_add_hw_page failure
__bio_iov_append_get_pages() doesn't put not appended pages on
bio_add_hw_page() failure, so potentially leaking them, fix it. Also, do
the same for __bio_iov_iter_get_pages(), even though it looks like it
can't be triggered by userspace in this case.

Fixes: 0512a75b98 ("block: Introduce REQ_OP_ZONE_APPEND")
Cc: stable@vger.kernel.org # 5.8+
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/1edfa6a2ffd66d55e6345a477df5387d2c1415d0.1626653825.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-24 08:04:58 -06:00
Christoph Hellwig
c4b2b7d150 block: remove CONFIG_DEBUG_BLOCK_EXT_DEVT
This might have been a neat debug aid when the extended dev_t was
added, but that time is long gone.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210824075216.1179406-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-24 06:42:40 -06:00
Christoph Hellwig
539711d7d6 block: remove a pointless call to MINOR() in device_add_disk
blk_alloc_ext_minor already returns just a minor number, so no need to
mask the high bits.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210824075216.1179406-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-24 06:42:32 -06:00
Jens Axboe
3d5b3fbeda bio: improve kerneldoc documentation for bio_alloc_kiocb()
We're missing a description for the 'nr_vecs' parameter. While in there,
clarify that freeing a bio allocated through this function must be done
from process context.

Fixes: 1cbbd31c4ada ("bio: add allocation cache abstraction")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23 13:45:40 -06:00
Jens Axboe
270a1c913e block: provide bio_clear_hipri() helper
Any case that turns off REQ_HIPRI must also clear BIO_PERCPU_CACHE,
as non-polled IO may complete through hard/soft IRQ and hence isn't
safe for our polled bio alloc cache.

Provide a helper that does just that, and use it in the merging code as
well if we split a bio and turn off polling.

Fixes: be863b9e43 ("block: clear BIO_PERCPU_CACHE flag if polling isn't supported")
Reported-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23 13:45:15 -06:00
Jens Axboe
be863b9e43 block: clear BIO_PERCPU_CACHE flag if polling isn't supported
The bio alloc cache relies on the fact that a polled bio will complete
in process context, clear the cacheable flag if we disable polling
for a given bio.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23 13:44:51 -06:00
Jens Axboe
be4d234d7a bio: add allocation cache abstraction
Add a per-cpu bio_set cache for bio allocations, enabling us to quickly
recycle them instead of going through the slab allocator. This cache
isn't IRQ safe, and hence is only really suitable for polled IO.

Very simple - keeps a count of bio's in the cache, and maintains a max
of 512 with a slack of 64. If we get above max + slack, we drop slack
number of bio's.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23 13:44:43 -06:00
Jens Axboe
da521626ac bio: optimize initialization of a bio
The memset() used is measurably slower in targeted benchmarks, wasting
about 1% of the total runtime, or 50% of the (later) hot path cached
bio alloc. Get rid of it and fill in the bio manually.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23 13:44:02 -06:00
Luis Chamberlain
83cbce9574 block: add error handling for device_add_disk / add_disk
Properly unwind on errors in device_add_disk.  This is the initial work
as drivers are not converted yet, which will follow in separate patches.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
[hch: major rebase.  All bugs are probably mine]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20210818144542.19305-10-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23 12:55:45 -06:00
Luis Chamberlain
92e7755ebc block: return errors from disk_alloc_events
Prepare for proper error handling in add_disk.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
[hch: split from a larger patch]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20210818144542.19305-9-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23 12:55:45 -06:00
Luis Chamberlain
614310c9c8 block: return errors from blk_integrity_add
Prepare for proper error handling in add_disk.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
[hch: split from a larger patch]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20210818144542.19305-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23 12:55:45 -06:00
Christoph Hellwig
75f4dca596 block: call blk_register_queue earlier in device_add_disk
Ensure that all the sysfs bits are set up before bdev_add is called,
as that will make the upcomding error handling much easier.  However
this means the call to disk_update_readahead has to be split as that
requires a bdi.  Also remove various sanity checks that don't make
sense now that blk_register_queue only has a single caller.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210818144542.19305-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23 12:55:45 -06:00
Christoph Hellwig
bab53f6b61 block: call blk_integrity_add earlier in device_add_disk
Doing all the sysfs file creation before adding the bdev and thus
allowing it to be opened will simplify the about to be added error
handling.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20210818144542.19305-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23 12:55:45 -06:00
Christoph Hellwig
9d5ee6767c block: create the bdi link earlier in device_add_disk
This will simplify error handling going forward.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20210818144542.19305-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23 12:55:45 -06:00
Christoph Hellwig
8235b5c1e8 block: call bdev_add later in device_add_disk
Once bdev_add is called userspace can open the block device.  Ensure
that the struct device, which is used for refcounting of the disk
besides various other things, is fully setup at that point.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20210818144542.19305-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23 12:55:45 -06:00