linux-stable

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git synced 2024-09-29 22:02:02 +00:00

Author	SHA1	Message	Date
Nitesh Shetty	7ccc017fea	null_blk: Fix: memory release when memory_backed=1 [ Upstream commit `8cfb98196c` ] Memory/pages are not freed, when unloading nullblk driver. Steps to reproduce issue 1.free -h total used free shared buff/cache available Mem: 7.8Gi 260Mi 7.1Gi 3.0Mi 395Mi 7.3Gi Swap: 0B 0B 0B 2.modprobe null_blk memory_backed=1 3.dd if=/dev/urandom of=/dev/nullb0 oflag=direct bs=1M count=1000 4.modprobe -r null_blk 5.free -h total used free shared buff/cache available Mem: 7.8Gi 1.2Gi 6.1Gi 3.0Mi 398Mi 6.3Gi Swap: 0B 0B 0B Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com> Link: https://lore.kernel.org/r/20230605062354.24785-1-nj.shetty@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-06-28 11:14:23 +02:00
Michael S. Tsirkin	cb23a5f55b	Revert "virtio-blk: support completion batching for the IRQ path" commit `afd384f0db` upstream. This reverts commit `07b679f70d`. This change appears to have broken things... We now see applications hanging during disk accesses. e.g. multi-port virtio-blk device running in h/w (FPGA) Host running a simple 'fio' test. [global] thread=1 direct=1 ioengine=libaio norandommap=1 group_reporting=1 bs=4K rw=read iodepth=128 runtime=1 numjobs=4 time_based [job0] filename=/dev/vda [job1] filename=/dev/vdb [job2] filename=/dev/vdc ... [job15] filename=/dev/vdp i.e. 16 disks; 4 queues per disk; simple burst of 4KB reads This is repeatedly run in a loop. After a few, normally <10 seconds, fio hangs. With 64 queues (16 disks), failure occurs within a few seconds; with 8 queues (2 disks) it may take ~hour before hanging. Last message: fio-3.19 Starting 8 threads Jobs: 1 (f=1): [_(7),R(1)][68.3%][eta 03h:11m:06s] I think this means at the end of the run 1 queue was left incomplete. 'diskstats' (run while fio is hung) shows no outstanding transactions. e.g. $ cat /proc/diskstats ... 252 0 vda 1843140071 0 14745120568 712568645 0 0 0 0 0 `3117947` 712568645 0 0 0 0 0 0 252 16 vdb 1816291511 0 14530332088 704905623 0 0 0 0 0 3117711 704905623 0 0 0 0 0 0 ... Other stats (in the h/w, and added to the virtio-blk driver ([a]virtio_queue_rq(), [b]virtblk_handle_req(), [c]virtblk_request_done()) all agree, and show every request had a completion, and that virtblk_request_done() never gets called. e.g. PF= 0 vq=0 1 2 3 [a]request_count - 839416590 813148916 105586179 84988123 [b]completion1_count - 839416590 813148916 105586179 84988123 [c]completion2_count - 0 0 0 0 PF= 1 vq=0 1 2 3 [a]request_count - 823335887 812516140 104582672 75856549 [b]completion1_count - 823335887 812516140 104582672 75856549 [c]completion2_count - 0 0 0 0 i.e. the issue is after the virtio-blk driver. This change was introduced in kernel 6.3.0. I am seeing this using 6.3.3. If I run with an earlier kernel (5.15), it does not occur. If I make a simple patch to the 6.3.3 virtio-blk driver, to skip the blk_mq_add_to_batch()call, it does not fail. e.g. kernel 5.15 - this is OK virtio_blk.c,virtblk_done() [irq handler] if (likely(!blk_should_fake_timeout(req->q))) { blk_mq_complete_request(req); } kernel 6.3.3 - this fails virtio_blk.c,virtblk_handle_req() [irq handler] if (likely(!blk_should_fake_timeout(req->q))) { if (!blk_mq_complete_request_remote(req)) { if (!blk_mq_add_to_batch(req, iob, virtblk_vbr_status(vbr), virtblk_complete_batch)) { virtblk_request_done(req); //this never gets called... so blk_mq_add_to_batch() must always succeed } } } If I do, kernel 6.3.3 - this is OK virtio_blk.c,virtblk_handle_req() [irq handler] if (likely(!blk_should_fake_timeout(req->q))) { if (!blk_mq_complete_request_remote(req)) { virtblk_request_done(req); //force this here... if (!blk_mq_add_to_batch(req, iob, virtblk_vbr_status(vbr), virtblk_complete_batch)) { virtblk_request_done(req); //this never gets called... so blk_mq_add_to_batch() must always succeed } } } Perhaps you might like to fix/test/revert this change... Martin Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202306090826.C1fZmdMe-lkp@intel.com/ Cc: Suwan Kim <suwan.kim027@gmail.com> Tested-by: edliaw@google.com Reported-by: "Roberts, Martin" <martin.roberts@intel.com> Message-Id: <336455b4f630f329380a8f53ee8cad3868764d5c.1686295549.git.mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-06-28 11:13:59 +02:00
Ross Lagerwall	8765aec0cb	xen/blkfront: Only check REQ_FUA for writes [ Upstream commit `b6ebaa8100` ] The existing code silently converts read operations with the REQ_FUA bit set into write-barrier operations. This results in data loss as the backend scribbles zeroes over the data instead of returning it. While the REQ_FUA bit doesn't make sense on a read operation, at least one well-known out-of-tree kernel module does set it and since it results in data loss, let's be safe here and only look at REQ_FUA for writes. Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com> Acked-by: Juergen Gross <jgross@suse.com> Link: https://lore.kernel.org/r/20230426164005.2213139-1-ross.lagerwall@citrix.com Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-06-21 16:02:07 +02:00
Ilya Dryomov	40a9ec704b	rbd: get snapshot context after exclusive lock is ensured to be held commit `870611e487` upstream. Move capturing the snapshot context into the image request state machine, after exclusive lock is ensured to be held for the duration of dealing with the image request. This is needed to ensure correctness of fast-diff states (OBJECT_EXISTS vs OBJECT_EXISTS_CLEAN) and object deltas computed based off of them. Otherwise the object map that is forked for the snapshot isn't guaranteed to accurately reflect the contents of the snapshot when the snapshot is taken under I/O. This breaks differential backup and snapshot-based mirroring use cases with fast-diff enabled: since some object deltas may be incomplete, the destination image may get corrupted. Cc: stable@vger.kernel.org Link: https://tracker.ceph.com/issues/61472 Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-06-14 11:16:58 +02:00
Ilya Dryomov	b57d8a8ef7	rbd: move RBD_OBJ_FLAG_COPYUP_ENABLED flag setting commit `09fe05c57b` upstream. Move RBD_OBJ_FLAG_COPYUP_ENABLED flag setting into the object request state machine to allow for the snapshot context to be captured in the image request state machine rather than in rbd_queue_workfn(). Cc: stable@vger.kernel.org Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-06-14 11:16:58 +02:00
Ming Lei	936048a565	ublk: fix AB-BA lockdep warning [ Upstream commit `ac5902f84b` ] When handling UBLK_IO_FETCH_REQ, ctx->uring_lock is grabbed first, then ub->mutex is acquired. When handling UBLK_CMD_STOP_DEV or UBLK_CMD_DEL_DEV, ub->mutex is grabbed first, then calling io_uring_cmd_done() for canceling uring command, in which ctx->uring_lock may be required. Real deadlock only happens when all the above commands are issued from same uring context, and in reality different uring contexts are often used for handing control command and IO command. Fix the issue by using io_uring_cmd_complete_in_task() to cancel command in ublk_cancel_dev(ublk_cancel_queue). Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com> Closes: https://lore.kernel.org/linux-block/becol2g7sawl4rsjq2dztsbc7mqypfqko6wzsyoyazqydoasml@rcxarzwidrhk Cc: Ziyang Zhang <ZiyangZhang@linux.alibaba.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Tested-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com> Link: https://lore.kernel.org/r/20230517133408.210944-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-06-09 10:48:09 +02:00
Guoqing Jiang	faf4ccfaf1	block/rnbd: replace REQ_OP_FLUSH with REQ_OP_WRITE [ Upstream commit `5e6e08087a` ] Since flush bios are implemented as writes with no data and the preflush flag per Christoph's comment [1]. And we need to change it in rnbd accordingly. Otherwise, I got splatting when create fs from rnbd client. [ 464.028545] ------------[ cut here ]------------ [ 464.028553] WARNING: CPU: 0 PID: 65 at block/blk-core.c:751 submit_bio_noacct+0x32c/0x5d0 [ ... ] [ 464.028668] CPU: 0 PID: 65 Comm: kworker/0:1H Tainted: G OE 6.4.0-rc1 #9 [ 464.028671] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.15.0-0-g2dd4b9b-rebuilt.opensuse.org 04/01/2014 [ 464.028673] Workqueue: ib-comp-wq ib_cq_poll_work [ib_core] [ 464.028717] RIP: 0010:submit_bio_noacct+0x32c/0x5d0 [ 464.028720] Code: 03 0f 85 51 fe ff ff 48 8b 43 18 8b 88 04 03 00 00 85 c9 0f 85 3f fe ff ff e9 be fd ff ff 0f b6 d0 3c 0d 74 26 83 fa 01 74 21 <0f> 0b b8 0a 00 00 00 e9 56 fd ff ff 4c 89 e7 e8 70 a1 03 00 84 c0 [ 464.028722] RSP: 0018:ffffaf3680b57c68 EFLAGS: 00010202 [ 464.028724] RAX: 0000000000060802 RBX: ffffa09dcc18bf00 RCX: 0000000000000000 [ 464.028726] RDX: 0000000000000002 RSI: 0000000000000000 RDI: ffffa09dde081d00 [ 464.028727] RBP: ffffaf3680b57c98 R08: ffffa09dde081d00 R09: ffffa09e38327200 [ 464.028729] R10: 0000000000000000 R11: 0000000000000000 R12: ffffa09dde081d00 [ 464.028730] R13: ffffa09dcb06e1e8 R14: 0000000000000000 R15: 0000000000200000 [ 464.028733] FS: 0000000000000000(0000) GS:ffffa09e3bc00000(0000) knlGS:0000000000000000 [ 464.028735] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 464.028736] CR2: 000055a4e8206c40 CR3: 0000000119f06000 CR4: 00000000003506f0 [ 464.028738] Call Trace: [ 464.028740] <TASK> [ 464.028746] submit_bio+0x1b/0x80 [ 464.028748] rnbd_srv_rdma_ev+0x50d/0x10c0 [rnbd_server] [ 464.028754] ? percpu_ref_get_many.constprop.0+0x55/0x140 [rtrs_server] [ 464.028760] ? __this_cpu_preempt_check+0x13/0x20 [ 464.028769] process_io_req+0x1dc/0x450 [rtrs_server] [ 464.028775] rtrs_srv_inv_rkey_done+0x67/0xb0 [rtrs_server] [ 464.028780] __ib_process_cq+0xbc/0x1f0 [ib_core] [ 464.028793] ib_cq_poll_work+0x2b/0xa0 [ib_core] [ 464.028804] process_one_work+0x2a9/0x580 [1]. https://lore.kernel.org/all/ZFHgefWofVt24tRl@infradead.org/ Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20230512034631.28686-1-guoqing.jiang@linux.dev Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-06-09 10:48:01 +02:00
Ivan Orlov	2ba70aa402	nbd: Fix debugfs_create_dir error checking [ Upstream commit `4913cfcf01` ] The debugfs_create_dir function returns ERR_PTR in case of error, and the only correct way to check if an error occurred is 'IS_ERR' inline function. This patch will replace the null-comparison with IS_ERR. Signed-off-by: Ivan Orlov <ivan.orlov0322@gmail.com> Link: https://lore.kernel.org/r/20230512130533.98709-1-ivan.orlov0322@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-06-09 10:48:01 +02:00
Zhong Jinghua	ffb75ffaa6	nbd: fix incomplete validation of ioctl arg [ Upstream commit `55793ea54d` ] We tested and found an alarm caused by nbd_ioctl arg without verification. The UBSAN warning calltrace like below: UBSAN: Undefined behaviour in fs/buffer.c:1709:35 signed integer overflow: -9223372036854775808 - 1 cannot be represented in type 'long long int' CPU: 3 PID: 2523 Comm: syz-executor.0 Not tainted 4.19.90 #1 Hardware name: linux,dummy-virt (DT) Call trace: dump_backtrace+0x0/0x3f0 arch/arm64/kernel/time.c:78 show_stack+0x28/0x38 arch/arm64/kernel/traps.c:158 __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x170/0x1dc lib/dump_stack.c:118 ubsan_epilogue+0x18/0xb4 lib/ubsan.c:161 handle_overflow+0x188/0x1dc lib/ubsan.c:192 __ubsan_handle_sub_overflow+0x34/0x44 lib/ubsan.c:206 __block_write_full_page+0x94c/0xa20 fs/buffer.c:1709 block_write_full_page+0x1f0/0x280 fs/buffer.c:2934 blkdev_writepage+0x34/0x40 fs/block_dev.c:607 __writepage+0x68/0xe8 mm/page-writeback.c:2305 write_cache_pages+0x44c/0xc70 mm/page-writeback.c:2240 generic_writepages+0xdc/0x148 mm/page-writeback.c:2329 blkdev_writepages+0x2c/0x38 fs/block_dev.c:2114 do_writepages+0xd4/0x250 mm/page-writeback.c:2344 The reason for triggering this warning is __block_write_full_page() -> i_size_read(inode) - 1 overflow. inode->i_size is assigned in __nbd_ioctl() -> nbd_set_size() -> bytesize. We think it is necessary to limit the size of arg to prevent errors. Moreover, __nbd_ioctl() -> nbd_add_socket(), arg will be cast to int. Assuming the value of arg is 0x80000000000000001) (on a 64-bit machine), it will become 1 after the coercion, which will return unexpected results. Fix it by adding checks to prevent passing in too large numbers. Signed-off-by: Zhong Jinghua <zhongjinghua@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Link: https://lore.kernel.org/r/20230206145805.2645671-1-zhongjinghua@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-05-24 17:30:07 +01:00
Chaitanya Kulkarni	651260e563	null_blk: Always check queue mode setting from configfs [ Upstream commit `63f8793ee6` ] Make sure to check device queue mode in the null_validate_conf() and return error for NULL_Q_RQ as we don't allow legacy I/O path, without this patch we get OOPs when queue mode is set to 1 from configfs, following are repro steps :- modprobe null_blk nr_devices=0 mkdir config/nullb/nullb0 echo 1 > config/nullb/nullb0/memory_backed echo 4096 > config/nullb/nullb0/blocksize echo 20480 > config/nullb/nullb0/size echo 1 > config/nullb/nullb0/queue_mode echo 1 > config/nullb/nullb0/power Entering kdb (current=0xffff88810acdd080, pid 2372) on processor 42 Oops: (null) due to oops @ 0xffffffffc041c329 CPU: 42 PID: 2372 Comm: sh Tainted: G O N 6.3.0-rc5lblk+ #5 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 RIP: 0010:null_add_dev.part.0+0xd9/0x720 [null_blk] Code: 01 00 00 85 d2 0f 85 a1 03 00 00 48 83 bb 08 01 00 00 00 0f 85 f7 03 00 00 80 bb 62 01 00 00 00 48 8b 75 20 0f 85 6d 02 00 00 <48> 89 6e 60 48 8b 75 20 bf 06 00 00 00 e8 f5 37 2c c1 48 8b 75 20 RSP: 0018:ffffc900052cbde0 EFLAGS: 00010246 RAX: 0000000000000001 RBX: ffff88811084d800 RCX: 0000000000000001 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff888100042e00 RBP: ffff8881053d8200 R08: ffffc900052cbd68 R09: ffff888105db2000 R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000002 R13: ffff888104765200 R14: ffff88810eec1748 R15: ffff88810eec1740 FS: 00007fd445fd1740(0000) GS:ffff8897dfc80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000060 CR3: 0000000166a00000 CR4: 0000000000350ee0 DR0: ffffffff8437a488 DR1: ffffffff8437a489 DR2: ffffffff8437a48a DR3: ffffffff8437a48b DR6: 00000000ffff0ff0 DR7: 0000000000000400 Call Trace: <TASK> nullb_device_power_store+0xd1/0x120 [null_blk] configfs_write_iter+0xb4/0x120 vfs_write+0x2ba/0x3c0 ksys_write+0x5f/0xe0 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x72/0xdc RIP: 0033:0x7fd4460c57a7 Code: 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24 RSP: 002b:00007ffd3792a4a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fd4460c57a7 RDX: 0000000000000002 RSI: 000055b43c02e4c0 RDI: 0000000000000001 RBP: 000055b43c02e4c0 R08: 000000000000000a R09: 00007fd44615b4e0 R10: 00007fd44615b3e0 R11: 0000000000000246 R12: 0000000000000002 R13: 00007fd446198520 R14: 0000000000000002 R15: 00007fd446198700 </TASK> Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com> Link: https://lore.kernel.org/r/20230416220339.43845-1-kch@nvidia.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-05-24 17:30:07 +01:00
Ming Lei	d243c61de1	ublk: add timeout handler [ Upstream commit `c0b79b0ff5` ] Add timeout handler, so that we can provide forward progress guarantee for unprivileged ublk, which can't be trusted. One thing is that sync() calls sync_bdevs(wait) for all block devices after running sync_bdevs(no_wait), and if one device can't move on, the sync() won't return any more. Add timeout for unprivileged ublk to avoid such affect for other users which call sync() syscall. Meantime clear UBLK_F_USER_RECOVERY_REISSUE for unprivileged ublk since that feature may cause IO hang too. Fixes: `4093cb5a06` ("ublk_drv: add mechanism for supporting unprivileged ublk device") Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230502024231.888498-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-05-17 14:01:46 +02:00
Christoph Böhmwalder	3cd4698f12	drbd: correctly submit flush bio on barrier commit `3899d94e38` upstream. When we receive a flush command (or "barrier" in DRBD), we currently use a REQ_OP_FLUSH with the REQ_PREFLUSH flag set. The correct way to submit a flush bio is by using a REQ_OP_WRITE without any data, and set the REQ_PREFLUSH flag. Since commit `b4a6bb3a67` ("block: add a sanity check for non-write flush/fua bios"), this triggers a warning in the block layer, but this has been broken for quite some time before that. So use the correct set of flags to actually make the flush happen. Cc: Christoph Hellwig <hch@infradead.org> Cc: stable@vger.kernel.org Fixes: `f9ff0da564` ("drbd: allow parallel flushes for multi-volume resources") Reported-by: Thomas Voegtle <tv@lio96.de> Signed-off-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230503121937.17232-1-christoph.boehmwalder@linbit.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-05-11 23:16:55 +09:00
Linus Torvalds	dfc1915448	virtio: last minute fixes Some last minute fixes - most of them for regressions. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> -----BEGIN PGP SIGNATURE----- iQFDBAABCAAtFiEEXQn9CHHI+FuUyooNKB8NuNKNVGkFAmQz3aoPHG1zdEByZWRo YXQuY29tAAoJECgfDbjSjVRp/MQIAJ0uzzyTAg7Bx/73M7JZckxgScyVOI/192mw ylT/1EZrlZi8JOC1uf9gC+fHlg4jgS/0Yn6HSJlBH0CAJPRNpGckzA1aEr/gc9jS 2wWxCrUeO2hvlMN7HToesmZxu79xKDIoYIFnwVV/IDLTi9Y8Fh+hUIkRR16IBiXb dgrI2qkYvyqGj5hJDnM9MaMVbTARYXb2q2SgiqGjh/TUNrkqDI0e0m1tj9eXNR90 uIz1lfbIH6JKiq2Zq/CK+h87dMOVBNoH+LZvMRKPx0beEfxC4P/1Re1Rag12kG1s TE7n556BOsqBQ333+rIgtNSq2dDAepG+CASZsR6Dw4F+41G43Qo= =WjYr -----END PGP SIGNATURE----- Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost Pull virtio fixes from Michael Tsirkin: "Some last minute fixes - most of them for regressions" * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: vdpa_sim_net: complete the initialization before register the device vdpa/mlx5: Add and remove debugfs in setup/teardown driver tools/virtio: fix typo in README instructions vhost-scsi: Fix crash during LUN unmapping vhost-scsi: Fix vhost_scsi struct use after free virtio-blk: fix ZBD probe in kernels without ZBD support virtio-blk: fix to match virtio spec	2023-04-10 13:35:54 -07:00
Ming Lei	1d1665279a	block: ublk: make sure that block size is set correctly block size is one very key setting for block layer, and bad block size could panic kernel easily. Make sure that block size is set correctly. Meantime if ublk_validate_params() fails, clear ub->params so that disk is prevented from being added. Fixes: `71f28f3136` ("ublk_drv: add io_uring based userspace block driver") Reported-and-tested-by: Breno Leitao <leitao@debian.org> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-04-06 08:12:08 -06:00
Jens Axboe	8c68ae3b22	ublk: read any SQE values upfront Since SQE memory is shared with userspace, we should only be reading it once. We cannot read it multiple times, particularly when it's read once for validation and then read again for the actual use. ublk_ch_uring_cmd() is safe when called as a retry operation, as the memory backing is stable at that point. But for normal issue, we want to ensure that we only read ublksrv_io_cmd once. Wrap the function in a helper that reads the value into an on-stack copy of the struct. Cc: stable@vger.kernel.org # 6.0+ Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-04-05 20:59:17 -06:00
Dmitry Fomichev	10805eb5d6	virtio-blk: fix ZBD probe in kernels without ZBD support When the kernel is built without support for zoned block devices, virtio-blk probe needs to error out any host-managed device scans to prevent such devices from appearing in the system as non-zoned. The current virtio-blk code simply bypasses all ZBD checks if CONFIG_BLK_DEV_ZONED is not defined and this leads to host-managed block devices being presented as non-zoned in the OS. This is one of the main problems this patch series is aimed to fix. In this patch, make VIRTIO_BLK_F_ZONED feature defined even when CONFIG_BLK_DEV_ZONED is not. This change makes the code compliant with the voted revision of virtio-blk ZBD spec. Modify the probe code to look at the situation when VIRTIO_BLK_F_ZONED is negotiated in a kernel that is built without ZBD support. In this case, the code checks the zoned model of the device and fails the probe is the device is host-managed. The patch also adds the comment to clarify that the call to perform the zoned device probe is correctly placed after virtio_device ready(). Fixes: `95bfec41bd` ("virtio-blk: add support for zoned block devices") Cc: stable@vger.kernel.org Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Message-Id: <20230330214953.1088216-3-dmitry.fomichev@wdc.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2023-04-04 11:01:57 -04:00
Dmitry Fomichev	f1ba4e674f	virtio-blk: fix to match virtio spec The merged patch series to support zoned block devices in virtio-blk is not the most up to date version. The merged patch can be found at https://lore.kernel.org/linux-block/20221016034127.330942-3-dmitry.fomichev@wdc.com/ but the latest and reviewed version is https://lore.kernel.org/linux-block/20221110053952.3378990-3-dmitry.fomichev@wdc.com/ The reason is apparently that the correct mailing lists and maintainers were not copied. The differences between the two are mostly cleanups, but there is one change that is very important in terms of compatibility with the approved virtio-zbd specification. Before it was approved, the OASIS virtio spec had a change in VIRTIO_BLK_T_ZONE_APPEND request layout that is not reflected in the current virtio-blk driver code. In the running code, the status is the first byte of the in-header that is followed by some pad bytes and the u64 that carries the sector at which the data has been written to the zone back to the driver, aka the append sector. This layout turned out to be problematic for implementing in QEMU and the request status byte has been eventually made the last byte of the in-header. The current code doesn't expect that and this causes the append sector value always come as zero to the block layer. This needs to be fixed ASAP. Fixes: `95bfec41bd` ("virtio-blk: add support for zoned block devices") Cc: stable@vger.kernel.org Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Message-Id: <20230330214953.1088216-2-dmitry.fomichev@wdc.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2023-04-04 11:01:57 -04:00
Alyssa Ross	bb430b6942	loop: LOOP_CONFIGURE: send uevents for partitions LOOP_CONFIGURE is, as far as I understand it, supposed to be a way to combine LOOP_SET_FD and LOOP_SET_STATUS64 into a single syscall. When using LOOP_SET_FD+LOOP_SET_STATUS64, a single uevent would be sent for each partition found on the loop device after the second ioctl(), but when using LOOP_CONFIGURE, no such uevent was being sent. In the old setup, uevents are disabled for LOOP_SET_FD, but not for LOOP_SET_STATUS64. This makes sense, as it prevents uevents being sent for a partially configured device during LOOP_SET_FD - they're only sent at the end of LOOP_SET_STATUS64. But for LOOP_CONFIGURE, uevents were disabled for the entire operation, so that final notification was never issued. To fix this, reduce the critical section to exclude the loop_reread_partitions() call, which causes the uevents to be issued, to after uevents are re-enabled, matching the behaviour of the LOOP_SET_FD+LOOP_SET_STATUS64 combination. I noticed this because Busybox's losetup program recently changed from using LOOP_SET_FD+LOOP_SET_STATUS64 to LOOP_CONFIGURE, and this broke my setup, for which I want a notification from the kernel any time a new partition becomes available. Signed-off-by: Alyssa Ross <hi@alyssa.is> [hch: reduced the critical section] Signed-off-by: Christoph Hellwig <hch@lst.de> Fixes: `3448914e8c` ("loop: Add LOOP_CONFIGURE ioctl") Link: https://lore.kernel.org/r/20230320125430.55367-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-03-27 13:27:06 -06:00
Jens Axboe	9d2789ac9d	block/io_uring: pass in issue_flags for uring_cmd task_work handling io_uring_cmd_done() currently assumes that the uring_lock is held when invoked, and while it generally is, this is not guaranteed. Pass in the issue_flags associated with it, so that we have IO_URING_F_UNLOCKED available to be able to lock the CQ ring appropriately when completing events. Cc: stable@vger.kernel.org Fixes: `ee692a21e9` ("fs,io_uring: add infrastructure for uring-cmd") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-03-20 20:01:25 -06:00
Ming Lei	4985e7b2c0	block: ublk_drv: mark device as LIVE before adding disk IO can be started before add_disk() returns, such as reading parititon table, then the monitor work should work for making forward progress. So mark device as LIVE before adding disk, meantime change to DEAD if add_disk() fails. Fixed: `71f28f3136` ("ublk_drv: add io_uring based userspace block driver") Reviewed-by: Ziyang Zhang <ZiyangZhang@linux.alibaba.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230318141231.55562-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-03-18 08:32:46 -06:00
Liang He	6030363199	block: sunvdc: add check for mdesc_grab() returning NULL In vdc_port_probe(), we should check the return value of mdesc_grab() as it may return NULL, which can cause potential NPD bug. Fixes: `43fdf27470` ("[SPARC64]: Abstract out mdesc accesses for better MD update handling.") Signed-off-by: Liang He <windhl@126.com> Link: https://lore.kernel.org/r/20230315062032.1741692-1-windhl@126.com [axboe: style cleanup] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-03-15 08:48:58 -06:00
Damien Le Moal	b6402014ca	block: null_blk: cleanup null_queue_rq() Use a local struct request pointer variable to avoid having to dereference struct blk_mq_queue_data multiple times. While at it, also fix the function argument indentation and remove a useless "else" after a return. Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Pankaj Raghav <p.raghav@samsung.com> Link: https://lore.kernel.org/r/20230314041106.19173-2-damien.lemoal@opensource.wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-03-15 06:50:24 -06:00
Damien Le Moal	63f8865970	block: null_blk: Fix handling of fake timeout request When injecting a fake timeout into the null_blk driver using fail_io_timeout, the request timeout handler does not execute blk_mq_complete_request(), so the complete callback is never executed for a timedout request. The null_blk driver also has a driver-specific fake timeout mechanism which does not have this problem. Fix the problem with fail_io_timeout by using the same meachanism as null_blk internal timeout feature, using the fake_timeout field of null_blk commands. Reported-by: Akinobu Mita <akinobu.mita@gmail.com> Fixes: `de3510e52b` ("null_blk: fix command timeout completion handling") Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20230314041106.19173-2-damien.lemoal@opensource.wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-03-15 06:50:23 -06:00
Bart Van Assche	9b0cb770f5	loop: Fix use-after-free issues do_req_filebacked() calls blk_mq_complete_request() synchronously or asynchronously when using asynchronous I/O unless memory allocation fails. Hence, modify loop_handle_cmd() such that it does not dereference 'cmd' nor 'rq' after do_req_filebacked() finished unless we are sure that the request has not yet been completed. This patch fixes the following kernel crash: Unable to handle kernel NULL pointer dereference at virtual address 0000000000000054 Call trace: css_put.42938+0x1c/0x1ac loop_process_work+0xc8c/0xfd4 loop_rootcg_workfn+0x24/0x34 process_one_work+0x244/0x558 worker_thread+0x400/0x8fc kthread+0x16c/0x1e0 ret_from_fork+0x10/0x20 Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Dan Schatzberg <schatzberg.dan@gmail.com> Fixes: `c74d40e8b5` ("loop: charge i/o to mem and blk cg") Fixes: `bc07c10a36` ("block: loop: support DIO & AIO") Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230314182155.80625-1-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-03-14 19:20:48 -06:00
Linus Torvalds	9d0281b56b	block-6.3-2023-03-03 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmQB57MQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgputpEADVrc1OFzHOivJq+LJ3HS3ufhLBthtgu1Lp sEHvDNp9tBGXMLkomuCYpAju5TBAEKC+AJTZyj9iS1j++ItoezdoP55YRIH7t2Or UTy8ex3rLPGkQk6k3o8roWCyajTW/ZS+4fmk+NkVYMLsQBp9I+kFbxgJa5bbREdU Z8b/9hcBGz58R8Kq+TEMp/bO7oCV4c8xWumrKER+MktDDx0kc5d+afWXoy7bEKFg jLB3gleTM9HUpa9a2GPc4fxqdb0KanQdMtiyn/oplg0JcZLMiHfRbiRnsgQkjN0O RVtUcdxXmOkQeFra4GXPiHmQBcIfE85wP4wxb8p/F2StYRhb1epzzeCXOhuNZvv4 dd6OSARgtzWt3OlHka4aC63H4kzs9SxJp0F2uwuPLV0fM91TP1oOTWV+53FrQr9Z OQYyB8d9Il4K72NFLwU4ukJ1fPoCRHjpgAXIIkasEjaBftpJlMNnfblncTZTBumy XumFVdKfvqc3OFt8LLKWqLDV0j3TknVeCMPKhsbRwQ0NG4vlNOSWaLkGJCDLJ7ga ebf8AD5eaLCT9qyYquBuW5VBKZH5Z4rf5yHta9Dx+Omu0JTQYtTkiiM3UTdpDbtq SObZ31UvLoYK2dOZcVgjhE2RgM/AV5jJcx7aHhT3UptavAehHbePgiNhuEEntlKv L87kXJkSSQ== =ezrg -----END PGP SIGNATURE----- Merge tag 'block-6.3-2023-03-03' of git://git.kernel.dk/linux Pull block fixes from Jens Axboe: - NVMe pull request via Christoph: - Don't access released socket during error recovery (Akinobu Mita) - Bring back auto-removal of deleted namespaces during sequential scan (Christoph Hellwig) - Fix an error code in nvme_auth_process_dhchap_challenge (Dan Carpenter) - Show well known discovery name (Daniel Wagner) - Add a missing endianess conversion in effects masking (Keith Busch) - Fix for a regression introduced in blk-rq-qos during init in this merge window (Breno) - Reorder a few fields in struct blk_mq_tag_set, eliminating a few holes and shrinking it (Christophe) - Remove redundant bdev_get_queue() NULL checks (Juhyung) - Add sed-opal single user mode support flag (Luca) - Remove SQE128 check in ublk as it isn't needed, saving some memory (Ming) - Op specific segment checking for cloned requests (Uday) - Exclusive open partition scan fixes (Yu) - Loop offset/size checking before assigning them in the device (Zhong) - Bio polling fixes (me) * tag 'block-6.3-2023-03-03' of git://git.kernel.dk/linux: blk-mq: enforce op-specific segment limits in blk_insert_cloned_request nvme-fabrics: show well known discovery name nvme-tcp: don't access released socket during error recovery nvme-auth: fix an error code in nvme_auth_process_dhchap_challenge() nvme: bring back auto-removal of deleted namespaces during sequential scan blk-iocost: Pass gendisk to ioc_refresh_params nvme: fix sparse warning on effects masking block: be a bit more careful in checking for NULL bdev while polling block: clear bio->bi_bdev when putting a bio back in the cache loop: loop_set_status_from_info() check before assignment ublk: remove check IO_URING_F_SQE128 in ublk_ch_uring_cmd block: remove more NULL checks after bdev_get_queue() blk-mq: Reorder fields in 'struct blk_mq_tag_set' block: fix scan partition for exclusively open device again block: Revert "block: Do not reread partition table on exclusively open device" sed-opal: add support flag for SUM in status ioctl	2023-03-03 10:21:39 -08:00
Linus Torvalds	c3f9b9fa10	Two small fixes from Xiubo and myself, marked for stable. -----BEGIN PGP SIGNATURE----- iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmQA0uETHGlkcnlvbW92 QGdtYWlsLmNvbQAKCRBKf944AhHzi92pB/4yZ7Go/7j2zb84N9nEYPCHV23v1vED YGZIiWHYv6X3dJyTYpcU7Mn9TF00naTGDKi9NpTZjKOUIkibXPFJfbG7Dh4T2HhN TKw9EbldCaXE1mR7o+g/mrVQFM1PIR1VbtIeszL3eD2qO0aXEGyBMvPfUNqFX/M7 lNWVjuglIaYUL235Uid/wt0zfmPDvtGD24fjpN0e22UQh/aBFnodIDpa/AapsFKp yifzqe/ADbvgnHwOhMiEMG1gRFd3vywVfPDQmQ41oSMnf7yTtLWE9t47wTfyoTY5 IwZY2K1H51QJej/mObYJmClp/y81xSLXEydFdQ571MqZbDeDfQeM23/7 =cWWl -----END PGP SIGNATURE----- Merge tag 'ceph-for-6.3-rc1' of https://github.com/ceph/ceph-client Pull ceph fixes from Ilya Dryomov: "Two small fixes from Xiubo and myself, marked for stable" * tag 'ceph-for-6.3-rc1' of https://github.com/ceph/ceph-client: rbd: avoid use-after-free in do_rbd_add() when rbd_dev_create() fails ceph: update the time stamps and try to drop the suid/sgid	2023-03-02 10:48:30 -08:00
Ilya Dryomov	f7c4d9b133	rbd: avoid use-after-free in do_rbd_add() when rbd_dev_create() fails If getting an ID or setting up a work queue in rbd_dev_create() fails, use-after-free on rbd_dev->rbd_client, rbd_dev->spec and rbd_dev->opts is triggered in do_rbd_add(). The root cause is that the ownership of these structures is transfered to rbd_dev prematurely and they all end up getting freed when rbd_dev_create() calls rbd_dev_free() prior to returning to do_rbd_add(). Found by Linux Verification Center (linuxtesting.org) with SVACE, an incomplete patch submitted by Natalia Petrova <n.petrova@fintech.ru>. Cc: stable@vger.kernel.org Fixes: `1643dfa4c2` ("rbd: introduce a per-device ordered workqueue") Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2023-02-26 20:03:14 +01:00
Linus Torvalds	84cc6674b7	virtio,vhost,vdpa: features, fixes device feature provisioning in ifcvf, mlx5 new SolidNET driver support for zoned block device in virtio blk numa support in virtio pmem VIRTIO_F_RING_RESET support in vhost-net more debugfs entries in mlx5 resume support in vdpa completion batching in virtio blk cleanup of dma api use in vdpa now simulating more features in vdpa-sim documentation, features, fixes all over the place Signed-off-by: Michael S. Tsirkin <mst@redhat.com> -----BEGIN PGP SIGNATURE----- iQFDBAABCAAtFiEEXQn9CHHI+FuUyooNKB8NuNKNVGkFAmP0D98PHG1zdEByZWRo YXQuY29tAAoJECgfDbjSjVRpV6IH/iecRgLMWWjp3n31IFdu31f/J4HpF7dczVjK qtV98eJ1N2pkgeJkdCfmB5XszfvFBeAurrS7++FTHiJhrRfR3Z+2ml/Qtvh5DEyP qxz6wOw6VVsi/txdUxM1wsxLeEmmzkmFdAmPM+FXeIjhWj76GOgy/4A3eaj6TgzV W8ShsBve/UZ5qMOC3XbIscvdOrudHJ18tH90Tiz3NZfH1fAs5E4uWbU6Mrz9DJVr canGvf4kAI9z8qram5HSgzPIXRJEYiF4q/eiStdtiiME8gL1mHLRZDNP1I1LeCAb q6Q6RCRKi3Ek+LGdH6u+nR1Swu03N2b/g+vgKtv30kJo06oZVzw= =EasV -----END PGP SIGNATURE----- Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost Pull virtio updates from Michael Tsirkin: - device feature provisioning in ifcvf, mlx5 - new SolidNET driver - support for zoned block device in virtio blk - numa support in virtio pmem - VIRTIO_F_RING_RESET support in vhost-net - more debugfs entries in mlx5 - resume support in vdpa - completion batching in virtio blk - cleanup of dma api use in vdpa - now simulating more features in vdpa-sim - documentation, features, fixes all over the place * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (64 commits) vdpa/mlx5: support device features provisioning vdpa/mlx5: make MTU/STATUS presence conditional on feature bits vdpa: validate device feature provisioning against supported class vdpa: validate provisioned device features against specified attribute vdpa: conditionally read STATUS in config space vdpa: fix improper error message when adding vdpa dev vdpa/mlx5: Initialize CVQ iotlb spinlock vdpa/mlx5: Don't clear mr struct on destroy MR vdpa/mlx5: Directly assign memory key tools/virtio: enable to build with retpoline vringh: fix a typo in comments for vringh_kiov vhost-vdpa: print warning when vhost_vdpa_alloc_domain fails scsi: virtio_scsi: fix handling of kmalloc failure vdpa: Fix a couple of spelling mistakes in some messages vhost-net: support VIRTIO_F_RING_RESET vhost-scsi: convert sysfs snprintf and sprintf to sysfs_emit vdpa: mlx5: support per virtqueue dma device vdpa: set dma mask for vDPA device virtio-vdpa: support per vq dma device vdpa: introduce get_vq_dma_device() ...	2023-02-25 11:48:02 -08:00
Linus Torvalds	3822a7c409	- Daniel Verkamp has contributed a memfd series ("mm/memfd: add F_SEAL_EXEC") which permits the setting of the memfd execute bit at memfd creation time, with the option of sealing the state of the X bit. - Peter Xu adds a patch series ("mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare") which addresses a rare race condition related to PMD unsharing. - Several folioification patch serieses from Matthew Wilcox, Vishal Moola, Sidhartha Kumar and Lorenzo Stoakes - Johannes Weiner has a series ("mm: push down lock_page_memcg()") which does perform some memcg maintenance and cleanup work. - SeongJae Park has added DAMOS filtering to DAMON, with the series "mm/damon/core: implement damos filter". These filters provide users with finer-grained control over DAMOS's actions. SeongJae has also done some DAMON cleanup work. - Kairui Song adds a series ("Clean up and fixes for swap"). - Vernon Yang contributed the series "Clean up and refinement for maple tree". - Yu Zhao has contributed the "mm: multi-gen LRU: memcg LRU" series. It adds to MGLRU an LRU of memcgs, to improve the scalability of global reclaim. - David Hildenbrand has added some userfaultfd cleanup work in the series "mm: uffd-wp + change_protection() cleanups". - Christoph Hellwig has removed the generic_writepages() library function in the series "remove generic_writepages". - Baolin Wang has performed some maintenance on the compaction code in his series "Some small improvements for compaction". - Sidhartha Kumar is doing some maintenance work on struct page in his series "Get rid of tail page fields". - David Hildenbrand contributed some cleanup, bugfixing and generalization of pte management and of pte debugging in his series "mm: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE on all architectures with swap PTEs". - Mel Gorman and Neil Brown have removed the __GFP_ATOMIC allocation flag in the series "Discard __GFP_ATOMIC". - Sergey Senozhatsky has improved zsmalloc's memory utilization with his series "zsmalloc: make zspage chain size configurable". - Joey Gouly has added prctl() support for prohibiting the creation of writeable+executable mappings. The previous BPF-based approach had shortcomings. See "mm: In-kernel support for memory-deny-write-execute (MDWE)". - Waiman Long did some kmemleak cleanup and bugfixing in the series "mm/kmemleak: Simplify kmemleak_cond_resched() & fix UAF". - T.J. Alumbaugh has contributed some MGLRU cleanup work in his series "mm: multi-gen LRU: improve". - Jiaqi Yan has provided some enhancements to our memory error statistics reporting, mainly by presenting the statistics on a per-node basis. See the series "Introduce per NUMA node memory error statistics". - Mel Gorman has a second and hopefully final shot at fixing a CPU-hog regression in compaction via his series "Fix excessive CPU usage during compaction". - Christoph Hellwig does some vmalloc maintenance work in the series "cleanup vfree and vunmap". - Christoph Hellwig has removed block_device_operations.rw_page() in ths series "remove ->rw_page". - We get some maple_tree improvements and cleanups in Liam Howlett's series "VMA tree type safety and remove __vma_adjust()". - Suren Baghdasaryan has done some work on the maintainability of our vm_flags handling in the series "introduce vm_flags modifier functions". - Some pagemap cleanup and generalization work in Mike Rapoport's series "mm, arch: add generic implementation of pfn_valid() for FLATMEM" and "fixups for generic implementation of pfn_valid()" - Baoquan He has done some work to make /proc/vmallocinfo and /proc/kcore better represent the real state of things in his series "mm/vmalloc.c: allow vread() to read out vm_map_ram areas". - Jason Gunthorpe rationalized the GUP system's interface to the rest of the kernel in the series "Simplify the external interface for GUP". - SeongJae Park wishes to migrate people from DAMON's debugfs interface over to its sysfs interface. To support this, we'll temporarily be printing warnings when people use the debugfs interface. See the series "mm/damon: deprecate DAMON debugfs interface". - Andrey Konovalov provided the accurately named "lib/stackdepot: fixes and clean-ups" series. - Huang Ying has provided a dramatic reduction in migration's TLB flush IPI rates with the series "migrate_pages(): batch TLB flushing". - Arnd Bergmann has some objtool fixups in "objtool warning fixes". -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY/PoPQAKCRDdBJ7gKXxA jlvpAPsFECUBBl20qSue2zCYWnHC7Yk4q9ytTkPB/MMDrFEN9wD/SNKEm2UoK6/K DmxHkn0LAitGgJRS/W9w81yrgig9tAQ= =MlGs -----END PGP SIGNATURE----- Merge tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Daniel Verkamp has contributed a memfd series ("mm/memfd: add F_SEAL_EXEC") which permits the setting of the memfd execute bit at memfd creation time, with the option of sealing the state of the X bit. - Peter Xu adds a patch series ("mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare") which addresses a rare race condition related to PMD unsharing. - Several folioification patch serieses from Matthew Wilcox, Vishal Moola, Sidhartha Kumar and Lorenzo Stoakes - Johannes Weiner has a series ("mm: push down lock_page_memcg()") which does perform some memcg maintenance and cleanup work. - SeongJae Park has added DAMOS filtering to DAMON, with the series "mm/damon/core: implement damos filter". These filters provide users with finer-grained control over DAMOS's actions. SeongJae has also done some DAMON cleanup work. - Kairui Song adds a series ("Clean up and fixes for swap"). - Vernon Yang contributed the series "Clean up and refinement for maple tree". - Yu Zhao has contributed the "mm: multi-gen LRU: memcg LRU" series. It adds to MGLRU an LRU of memcgs, to improve the scalability of global reclaim. - David Hildenbrand has added some userfaultfd cleanup work in the series "mm: uffd-wp + change_protection() cleanups". - Christoph Hellwig has removed the generic_writepages() library function in the series "remove generic_writepages". - Baolin Wang has performed some maintenance on the compaction code in his series "Some small improvements for compaction". - Sidhartha Kumar is doing some maintenance work on struct page in his series "Get rid of tail page fields". - David Hildenbrand contributed some cleanup, bugfixing and generalization of pte management and of pte debugging in his series "mm: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE on all architectures with swap PTEs". - Mel Gorman and Neil Brown have removed the __GFP_ATOMIC allocation flag in the series "Discard __GFP_ATOMIC". - Sergey Senozhatsky has improved zsmalloc's memory utilization with his series "zsmalloc: make zspage chain size configurable". - Joey Gouly has added prctl() support for prohibiting the creation of writeable+executable mappings. The previous BPF-based approach had shortcomings. See "mm: In-kernel support for memory-deny-write-execute (MDWE)". - Waiman Long did some kmemleak cleanup and bugfixing in the series "mm/kmemleak: Simplify kmemleak_cond_resched() & fix UAF". - T.J. Alumbaugh has contributed some MGLRU cleanup work in his series "mm: multi-gen LRU: improve". - Jiaqi Yan has provided some enhancements to our memory error statistics reporting, mainly by presenting the statistics on a per-node basis. See the series "Introduce per NUMA node memory error statistics". - Mel Gorman has a second and hopefully final shot at fixing a CPU-hog regression in compaction via his series "Fix excessive CPU usage during compaction". - Christoph Hellwig does some vmalloc maintenance work in the series "cleanup vfree and vunmap". - Christoph Hellwig has removed block_device_operations.rw_page() in ths series "remove ->rw_page". - We get some maple_tree improvements and cleanups in Liam Howlett's series "VMA tree type safety and remove __vma_adjust()". - Suren Baghdasaryan has done some work on the maintainability of our vm_flags handling in the series "introduce vm_flags modifier functions". - Some pagemap cleanup and generalization work in Mike Rapoport's series "mm, arch: add generic implementation of pfn_valid() for FLATMEM" and "fixups for generic implementation of pfn_valid()" - Baoquan He has done some work to make /proc/vmallocinfo and /proc/kcore better represent the real state of things in his series "mm/vmalloc.c: allow vread() to read out vm_map_ram areas". - Jason Gunthorpe rationalized the GUP system's interface to the rest of the kernel in the series "Simplify the external interface for GUP". - SeongJae Park wishes to migrate people from DAMON's debugfs interface over to its sysfs interface. To support this, we'll temporarily be printing warnings when people use the debugfs interface. See the series "mm/damon: deprecate DAMON debugfs interface". - Andrey Konovalov provided the accurately named "lib/stackdepot: fixes and clean-ups" series. - Huang Ying has provided a dramatic reduction in migration's TLB flush IPI rates with the series "migrate_pages(): batch TLB flushing". - Arnd Bergmann has some objtool fixups in "objtool warning fixes". * tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (505 commits) include/linux/migrate.h: remove unneeded externs mm/memory_hotplug: cleanup return value handing in do_migrate_range() mm/uffd: fix comment in handling pte markers mm: change to return bool for isolate_movable_page() mm: hugetlb: change to return bool for isolate_hugetlb() mm: change to return bool for isolate_lru_page() mm: change to return bool for folio_isolate_lru() objtool: add UACCESS exceptions for __tsan_volatile_read/write kmsan: disable ftrace in kmsan core code kasan: mark addr_has_metadata __always_inline mm: memcontrol: rename memcg_kmem_enabled() sh: initialize max_mapnr m68k/nommu: add missing definition of ARCH_PFN_OFFSET mm: percpu: fix incorrect size in pcpu_obj_full_size() maple_tree: reduce stack usage with gcc-9 and earlier mm: page_alloc: call panic() when memoryless node allocation fails mm: multi-gen LRU: avoid futile retries migrate_pages: move THP/hugetlb migration support check to simplify code migrate_pages: batch flushing TLB migrate_pages: share more code between _unmap and _move ...	2023-02-23 17:09:35 -08:00
Zhong Jinghua	9f6ad5d533	loop: loop_set_status_from_info() check before assignment In loop_set_status_from_info(), lo->lo_offset and lo->lo_sizelimit should be checked before reassignment, because if an overflow error occurs, the original correct value will be changed to the wrong value, and it will not be changed back. More, the original patch did not solve the problem, the value was set and ioctl returned an error, but the subsequent io used the value in the loop driver, which still caused an alarm: loop_handle_cmd do_req_filebacked loff_t pos = ((loff_t) blk_rq_pos(rq) << 9) + lo->lo_offset; lo_rw_aio cmd->iocb.ki_pos = pos Fixes: `c490a0b5a4` ("loop: Check for overflow while configuring loop") Signed-off-by: Zhong Jinghua <zhongjinghua@huawei.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20230221095027.3656193-1-zhongjinghua@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-22 20:43:09 -07:00
Linus Torvalds	6861eaf791	ATA changes for 6.3-rc1 * Small cleanup of the pata_octeon driver to drop a useless platform callback, from Uwe. * Simplify ata_scsi_cmd_error_handler() code using the fact that ap->ops->error_handler is NULL most of the time, from Wenchao. * Several patches improving libata error handling. This is in preparation for supporting the command duration limits (CDL) feature. The changes allow handling corner cases of ATA NCQ errors which do not happen with regular drives but will be triggered with CDL drives. From Niklas. * Simplify the qc_fill_rtf operation, from me. * Improve SCSI command translation for the REPORT_SUPPORTED_OPERATION_CODES command, from me. * Cleanup of libata FUA handling. This falls short of enabling FUA for ATA drives that support it by default as there were concerns that old drives would break. The series howeverfixes several issues with the FUA support to ensure that FUA is reported as being supported only for drives that can handle all possible write cases (NCQ and non-NCQ). A check in the block layer is also added to ensure that we never see read FUA commands (current behavior). From me. * Several patches to move the old PARIDE (parallel port IDE) driver to libata as pata_parport. Given that this driver also needs protocol modules, the driver code resides in its own pata_parport directoy under drivers/ata. From Ondrej. -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQSRPv8tYSvhwAzJdzjdoc3SxdoYdgUCY/VTnQAKCRDdoc3SxdoY dk77AQCA1frczKhcOFe2PK/FsFAiO9Nlx/snk7V95JdjVG8GlwEAkey7mvbXMfX0 fDbqpaCkWFb6SvwxdMSATlqUvwEpSQ8= =tqQP -----END PGP SIGNATURE----- Merge tag 'ata-6.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata Pull ATA updates from Damien Le Moal: - Small cleanup of the pata_octeon driver to drop a useless platform callback (Uwe) - Simplify ata_scsi_cmd_error_handler() code using the fact that ap->ops->error_handler is NULL most of the time (Wenchao) - Several patches improving libata error handling. This is in preparation for supporting the command duration limits (CDL) feature. The changes allow handling corner cases of ATA NCQ errors which do not happen with regular drives but will be triggered with CDL drives (Niklas) - Simplify the qc_fill_rtf operation (me) - Improve SCSI command translation for REPORT_SUPPORTED_OPERATION_CODES command (me) - Cleanup of libata FUA handling. This falls short of enabling FUA for ATA drives that support it by default as there were concerns that old drives would break. The series however fixes several issues with the FUA support to ensure that FUA is reported as being supported only for drives that can handle all possible write cases (NCQ and non-NCQ). A check in the block layer is also added to ensure that we never see read FUA commands (current behavior) (me) - Several patches to move the old PARIDE (parallel port IDE) driver to libata as pata_parport. Given that this driver also needs protocol modules, the driver code resides in its own pata_parport directoy under drivers/ata (Ondrej) * tag 'ata-6.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata: ata: pata_parport: Fix ida_alloc return value error check drivers/block: Move PARIDE protocol modules to drivers/ata/pata_parport drivers/block: Remove PARIDE core and high-level protocols ata: pata_parport: add driver (PARIDE replacement) ata: libata: exclude FUA support for known buggy drives ata: libata: Fix FUA handling in ata_build_rw_tf() ata: libata: cleanup fua support detection ata: libata: Rename and cleanup ata_rwcmd_protocol() ata: libata: Introduce ata_ncq_supported() block: add a sanity check for non-write flush/fua bios ata: libata-scsi: improve ata_scsiop_maint_in() ata: libata-scsi: do not overwrite SCSI ML and status bytes ata: libata: move NCQ related ATA_DFLAGs ata: libata: respect successfully completed commands during errors ata: libata: read the shared status for successful NCQ commands once ata: libata: simplify qc_fill_rtf port operation interface ata: scsi: rename flag ATA_QCFLAG_FAILED to ATA_QCFLAG_EH ata: libata-eh: Cleanup ata_scsi_cmd_error_handler() ata: octeon: Drop empty platform remove function	2023-02-22 13:35:51 -08:00
Ming Lei	9c7c4bc986	ublk: remove check IO_URING_F_SQE128 in ublk_ch_uring_cmd sizeof(struct ublksrv_io_cmd) is 16bytes, which can be held in 64byte SQE, so not necessary to check IO_URING_F_SQE128. With this change, we get chance to save half SQ ring memory. Fixed: `71f28f3136` ("ublk_drv: add io_uring based userspace block driver") Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230220041413.1524335-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-21 09:27:23 -07:00
Suwan Kim	07b679f70d	virtio-blk: support completion batching for the IRQ path This patch adds completion batching to the IRQ path. It reuses batch completion code of virtblk_poll(). It collects requests to io_comp_batch and processes them all at once. It can boost up the performance by 2%. To validate the performance improvement and stabilty, I did fio test with 4 vCPU VM and 12 vCPU VM respectively. Both VMs have 8GB ram and the same number of HW queues as vCPU. The fio cammad is as follows and I ran the fio 5 times and got IOPS average. (io_uring, randread, direct=1, bs=512, iodepth=64 numjobs=2,4) Test result shows about 2% improvement. 4 vcpu VM \| numjobs=2 \| numjobs=4 ----------------------------------------------------------- fio without patch \| 367.2K IOPS \| 397.6K IOPS ----------------------------------------------------------- fio with patch \| 372.8K IOPS \| 407.7K IOPS 12 vcpu VM \| numjobs=2 \| numjobs=4 ----------------------------------------------------------- fio without patch \| 363.6K IOPS \| 374.8K IOPS ----------------------------------------------------------- fio with patch \| 373.8K IOPS \| 385.3K IOPS Signed-off-by: Suwan Kim <suwan.kim027@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Message-Id: <20221221145456.281218-3-suwan.kim027@gmail.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2023-02-20 19:26:57 -05:00
Suwan Kim	489e18f3d7	virtio-blk: set req->state to MQ_RQ_COMPLETE after polling I/O is finished Driver should set req->state to MQ_RQ_COMPLETE after it finishes to process req. But virtio-blk doesn't set MQ_RQ_COMPLETE after virtblk_poll() handles req and req->state still remains MQ_RQ_IN_FLIGHT. Fortunately so far there is no issue about it because blk_mq_end_request_batch() sets req->state to MQ_RQ_IDLE. In this patch, virblk_poll() calls blk_mq_complete_request_remote() to set req->state to MQ_RQ_COMPLETE before it adds req to a batch completion list. So it properly sets req->state after polling I/O is finished. Fixes: `4e04005256` ("virtio-blk: support polling I/O") Signed-off-by: Suwan Kim <suwan.kim027@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Message-Id: <20221221145456.281218-2-suwan.kim027@gmail.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2023-02-20 19:26:57 -05:00
Linus Torvalds	5b0ed59649	for-6.3/block-2023-02-16 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmPvfncQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpob2EADXJxcr2jjYHm/7cjKkyuVX8fr80dNdMeuY JFdsjG1k6Uj73BVhQQWYTcs/PsrWBHWRsv6uz4WgOELj55eXmf5Q0kJszyUeJW33 /DjqLvtoppVcYf80xE13wKvCfn73BjwQo6xkGM0qAYn15eaXiD/Ax3xC6eJlsBeK PEw7EJyhacbSxZa/1D2B6+mqII1jUQWProTCc3udZ4JHi3WvdWa3Rda0qCqHl4a1 +K2aP2YTFIRPxBzfMNa/CafWVIFubTdht+4Ds6R60RImzB9e0VUBfcsiUyW5Zg7L Fwv7ptXuWrALwVNdW56Oz1QikBxn2pdRR2HMLwKJW1MD8kP9r8LMm2jV5Rhiwe0B OQsGRYkOzBvw+bxeP5fvk0iPGVMz6ActH4gkraA5QdLqayDaFYOadlhqz0uRo5SH Fb42Vl658K/MHDSIk8U58TNkmrsIJsBGohXI9DOGINPPvv3XOPi4Q1HmXkGRmii0 y+lNU/QEGh7xXXew29SPP76uQpQaYfC7NxXCMw/OpOMwehzjsjshmM2lpxi8zsgt PJUmfHv5qxCplNmTJXmUpmX7sS7550HUdu9FJb13DM+gzKg8bk9jWVuLrzqrVlG5 1hKWEl1+heg1heRfaIuJVLbPI0au6Sb4uqhih/PHyrP9TWIoAruDbDJM65GKTxyE 2uEgcHzHQw== =poRc -----END PGP SIGNATURE----- Merge tag 'for-6.3/block-2023-02-16' of git://git.kernel.dk/linux Pull block updates from Jens Axboe: - NVMe updates via Christoph: - Small improvements to the logging functionality (Amit Engel) - Authentication cleanups (Hannes Reinecke) - Cleanup and optimize the DMA mapping cod in the PCIe driver (Keith Busch) - Work around the command effects for Format NVM (Keith Busch) - Misc cleanups (Keith Busch, Christoph Hellwig) - Fix and cleanup freeing single sgl (Keith Busch) - MD updates via Song: - Fix a rare crash during the takeover process - Don't update recovery_cp when curr_resync is ACTIVE - Free writes_pending in md_stop - Change active_io to percpu - Updates to drbd, inching us closer to unifying the out-of-tree driver with the in-tree one (Andreas, Christoph, Lars, Robert) - BFQ update adding support for multi-actuator drives (Paolo, Federico, Davide) - Make brd compliant with REQ_NOWAIT (me) - Fix for IOPOLL and queue entering, fixing stalled IO waiting on timeouts (me) - Fix for REQ_NOWAIT with multiple bios (me) - Fix memory leak in blktrace cleanup (Greg) - Clean up sbitmap and fix a potential hang (Kemeng) - Clean up some bits in BFQ, and fix a bug in the request injection (Kemeng) - Clean up the request allocation and issue code, and fix some bugs related to that (Kemeng) - ublk updates and fixes: - Add support for unprivileged ublk (Ming) - Improve device deletion handling (Ming) - Misc (Liu, Ziyang) - s390 dasd fixes (Alexander, Qiheng) - Improve utility of request caching and fixes (Anuj, Xiao) - zoned cleanups (Pankaj) - More constification for kobjs (Thomas) - blk-iocost cleanups (Yu) - Remove bio splitting from drivers that don't need it (Christoph) - Switch blk-cgroups to use struct gendisk. Some of this is now incomplete as select late reverts were done. (Christoph) - Add bvec initialization helpers, and convert callers to use that rather than open-coding it (Christoph) - Misc fixes and cleanups (Jinke, Keith, Arnd, Bart, Li, Martin, Matthew, Ulf, Zhong) * tag 'for-6.3/block-2023-02-16' of git://git.kernel.dk/linux: (169 commits) brd: use radix_tree_maybe_preload instead of radix_tree_preload block: use proper return value from bio_failfast() block: bio-integrity: Copy flags when bio_integrity_payload is cloned block: Fix io statistics for cgroup in throttle path brd: mark as nowait compatible brd: check for REQ_NOWAIT and set correct page allocation mask brd: return 0/-error from brd_insert_page() block: sync mixed merged request's failfast with 1st bio's Revert "blk-cgroup: pin the gendisk in struct blkcg_gq" Revert "blk-cgroup: pass a gendisk to blkg_lookup" Revert "blk-cgroup: delay blk-cgroup initialization until add_disk" Revert "blk-cgroup: delay calling blkcg_exit_disk until disk_release" Revert "blk-cgroup: move the cgroup information to struct gendisk" nvme-pci: remove iod use_sgls nvme-pci: fix freeing single sgl block: ublk: check IO buffer based on flag need_get_data s390/dasd: Fix potential memleak in dasd_eckd_init() s390/dasd: sort out physical vs virtual pointers usage block: Remove the ALLOC_CACHE_SLACK constant block: make kobj_type structures constant ...	2023-02-20 14:27:21 -08:00
Pankaj Raghav	0aa2988e4f	brd: use radix_tree_maybe_preload instead of radix_tree_preload Unconditionally calling radix_tree_preload_end() results in a OOPS message as the preload is only conditionally called for gfpflags_allow_blocking(). [ 20.267323] BUG: using smp_processor_id() in preemptible [00000000] code: fio/416 [ 20.267837] caller is brd_insert_page.part.0+0xbe/0x190 [brd] [ 20.269436] Call Trace: [ 20.269598] <TASK> [ 20.269742] dump_stack_lvl+0x32/0x50 [ 20.269982] check_preemption_disabled+0xd1/0xe0 [ 20.270289] brd_insert_page.part.0+0xbe/0x190 [brd] [ 20.270664] brd_submit_bio+0x33f/0xf40 [brd] Use radix_tree_maybe_preload() which does preload only if gfpflags_allow_blocking() is true but also takes the lock. Therefore, unconditionally calling radix_tree_preload_end() should not create any issues and the message disappears. Fixes: `6ded703c56` ("brd: check for REQ_NOWAIT and set correct page allocation mask") Signed-off-by: Pankaj Raghav <p.raghav@samsung.com> Link: https://lore.kernel.org/r/20230217121442.33914-1-p.raghav@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-17 06:15:53 -07:00
Jens Axboe	67205f80be	brd: mark as nowait compatible By default, non-mq drivers do not support nowait. This causes io_uring to use a slower path as the driver cannot be trust not to block. brd can safely set the nowait flag, as worst case all it does is a NOIO allocation. For io_uring, this makes a substantial difference. Before: submitter=0, tid=453, file=/dev/ram0, node=-1 polled=0, fixedbufs=1/0, register_files=1, buffered=0, QD=128 Engine=io_uring, sq_ring=128, cq_ring=128 IOPS=440.03K, BW=1718MiB/s, IOS/call=32/31 IOPS=428.96K, BW=1675MiB/s, IOS/call=32/32 IOPS=442.59K, BW=1728MiB/s, IOS/call=32/31 IOPS=419.65K, BW=1639MiB/s, IOS/call=32/32 IOPS=426.82K, BW=1667MiB/s, IOS/call=32/31 and after: submitter=0, tid=354, file=/dev/ram0, node=-1 polled=0, fixedbufs=1/0, register_files=1, buffered=0, QD=128 Engine=io_uring, sq_ring=128, cq_ring=128 IOPS=3.37M, BW=13.15GiB/s, IOS/call=32/31 IOPS=3.45M, BW=13.46GiB/s, IOS/call=32/31 IOPS=3.43M, BW=13.42GiB/s, IOS/call=32/32 IOPS=3.43M, BW=13.39GiB/s, IOS/call=32/31 IOPS=3.43M, BW=13.38GiB/s, IOS/call=32/31 or about an 8x in difference. Now that brd is prepared to deal with REQ_NOWAIT reads/writes, mark it as supporting that. Cc: stable@vger.kernel.org # 5.10+ Link: https://lore.kernel.org/linux-block/20230203103005.31290-1-p.raghav@samsung.com/ Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-16 10:02:55 -07:00
Jens Axboe	6ded703c56	brd: check for REQ_NOWAIT and set correct page allocation mask If REQ_NOWAIT is set, then do a non-blocking allocation if the operation is a write and we need to insert a new page. Currently REQ_NOWAIT cannot be set as the queue isn't marked as supporting nowait, this change is in preparation for allowing that. radix_tree_preload() warns on attempting to call it with an allocation mask that doesn't allow blocking. While that warning could arguably be removed, we need to handle radix insertion failures anyway as they are more likely if we cannot block to get memory. Remove legacy BUG_ON()'s and turn them into proper errors instead, one for the allocation failure and one for finding a page that doesn't match the correct index. Cc: stable@vger.kernel.org # 5.10+ Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-16 10:02:41 -07:00
Jens Axboe	db0ccc44a2	brd: return 0/-error from brd_insert_page() It currently returns a page, but callers just check for NULL/page to gauge success. Clean this up and return the appropriate error directly instead. Cc: stable@vger.kernel.org # 5.10+ Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-16 10:02:31 -07:00
Michael S. Tsirkin	2a9c844e89	virtio_blk: zone append in header type tweak virtio blk returns a 64 bit append_sector in an input buffer, in LE format. This field is not tagged as LE correctly, so even though the generated code is ok, we get warnings from sparse: drivers/block/virtio_blk.c:332:33: sparse: sparse: cast to restricted __le64 Make sparse happy by using the correct type. Message-Id: <20221220125154.564265-1-mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2023-02-15 06:46:22 -05:00
Michael S. Tsirkin	04e5421e6f	virtio_blk: temporary variable type tweak virtblk_result returns blk_status_t which is a bitwise restricted type, so we are not supposed to stuff it in a plain int temporary variable. All we do with it is pass it on to a function expecting blk_status_t so the generated code is ok, but we get warnings from sparse: drivers/block/virtio_blk.c:326:36: sparse: sparse: incorrect type in initializer (different base types) @@ expected int status @@ +got restricted blk_status_t @@ drivers/block/virtio_blk.c:334:33: sparse: sparse: incorrect type in argument 2 (different base types) @@ expected restricted +blk_status_t [usertype] error @@ got int status @@ Make sparse happy by using the correct type. Message-Id: <20221220124152.523531-1-mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>	2023-02-15 06:46:22 -05:00
Dmitry Fomichev	95bfec41bd	virtio-blk: add support for zoned block devices This patch adds support for Zoned Block Devices (ZBDs) to the kernel virtio-blk driver. The patch accompanies the virtio-blk ZBD support draft that is now being proposed for standardization. The latest version of the draft is linked at https://github.com/oasis-tcs/virtio-spec/issues/143 . The QEMU zoned device code that implements these protocol extensions has been developed by Sam Li and it is currently in review at the QEMU mailing list. A number of virtblk request structure changes has been introduced to accommodate the functionality that is specific to zoned block devices and, most importantly, make room for carrying the Zoned Append sector value from the device back to the driver along with the request status. The zone-specific code in the patch is heavily influenced by NVMe ZNS code in drivers/nvme/host/zns.c, but it is simpler because the proposed virtio ZBD draft only covers the zoned device features that are relevant to the zoned functionality provided by Linux block layer. includes the following fixup: virtio-blk: fix probe without CONFIG_BLK_DEV_ZONED When building without CONFIG_BLK_DEV_ZONED, VIRTIO_BLK_F_ZONED is excluded from array of driver features. As a result virtio_has_feature panics in virtio_check_driver_offered_feature since that by design verifies that a feature we are checking for is listed in the feature array. To fix, replace the call to virtio_has_feature with a stub. Message-Id: <20221016034127.330942-3-dmitry.fomichev@wdc.com> Co-developed-by: Stefan Hajnoczi <stefanha@gmail.com> Signed-off-by: Stefan Hajnoczi <stefanha@gmail.com> Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com> Message-Id: <20221220112340.518841-1-mst@redhat.com> Reported-by: Linux Kernel Functional Testing <lkft@linaro.org> Tested-by: Linux Kernel Functional Testing <lkft@linaro.org> Reported-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Debugged-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Tested-by: Anders Roxell <anders.roxell@linaro.org> Tested-by: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com>	2023-02-15 06:46:22 -05:00
Liu Xiaodong	2f1e07dda1	block: ublk: check IO buffer based on flag need_get_data Currently, uring_cmd with UBLK_IO_FETCH_REQ or UBLK_IO_COMMIT_AND_FETCH_REQ is always checked whether userspace server has provided IO buffer even flag UBLK_F_NEED_GET_DATA is configured. This is a excessive check. If UBLK_F_NEED_GET_DATA is configured, FETCH_RQ doesn't need to provide IO buffer; COMMIT_AND_FETCH_REQ also doesn't need to do that if the IO type is not READ. Check ub_cmd->addr together with ublk_need_get_data() and IO type in ublk_ch_uring_cmd(). With this fix, userspace server doesn't need to preserve buffers for every ublk_io when flag UBLK_F_NEED_GET_DATA is configured, in order to save memory. Signed-off-by: Liu Xiaodong <xiaodong.liu@intel.com> Fixes: `c86019ff75` ("ublk_drv: add support for UBLK_IO_NEED_GET_DATA") Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230210141356.112321-1-xiaodong.liu@intel.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-13 08:36:23 -07:00
Ming Lei	0abe39dec0	block: ublk: improve handling device deletion Inside ublk_ctrl_del_dev(), when the device is removed, we wait until the device number is freed with holding global lock of ublk_ctl_mutex, this way isn't friendly from user viewpoint: 1) if device is in-use, the current delete command hangs in ublk_ctrl_del_dev(), and user can't break from the handling because wait_event() is used 2) global lock is held, so any new device can't be added and other old devices can't be removed. Improve the deleting handling by the following way, suggested by Nadav: 1) wait without holding the global lock 2) replace wait_event() with wait_event_interruptible() Reported-by: Nadav Amit <nadav.amit@gmail.com> Suggested-by: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230207150700.545530-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-07 18:53:51 -07:00
Ziyang Zhang	1972d038a5	ublk: pass NULL to blk_mq_alloc_disk() as queuedata queuedata is not referenced in ublk_drv and we can use driver_data instead. Pass NULL to blk_mq_alloc_disk() as queuedata while allocating ublk's gendisk. Signed-off-by: Ziyang Zhang <ZiyangZhang@linux.alibaba.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230207070839.370817-4-ZiyangZhang@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-07 07:21:31 -07:00
Ziyang Zhang	b352389e7b	ublk: mention WRITE_ZEROES in comment of ublk_complete_rq() WRITE_ZEROES won't return bytes returned just like FLUSH and DISCARD, and we can end it directly. Add missing comment for it in ublk_complete_rq(). Signed-off-by: Ziyang Zhang <ZiyangZhang@linux.alibaba.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230207070839.370817-3-ZiyangZhang@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-07 07:21:31 -07:00
Ziyang Zhang	731e208d7b	ublk: remove unnecessary NULL check in ublk_rq_has_data() bio_has_data() allows a NULL bio so the NULL check in ublk_rq_has_data() is unnecessary. Signed-off-by: Ziyang Zhang <ZiyangZhang@linux.alibaba.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230207070839.370817-2-ZiyangZhang@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-07 07:21:31 -07:00
Linus Torvalds	0136d86b78	block-6.2-2023-02-03 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmPdRq8QHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpjcqEADcWlRjkcLzRpEMD9g3IyDShasT1JVeSvV6 xqDuA0kRF6DyObu82jE2wiZ49FRpeCUw6S6ZdVhvwGHgPpfLBuPWonFnTqxYAnSz XCYnt4QdZHGiydIHVxkyP8Raz6d24kZawlUmbE7dcfksNziyGR5UjbCsk1HNJhmf EvnLZ2EozZwsZLW/RRYZrh9Q8ccB8kJeX+JuUVw7sboNyJ+bW+x+7prlm3CKgopX IiP69E6qIPe6RHkyLRdKgYgxRdcgeq6uJk/nuZ/6uPCcyrz+0QEtge3CkTe7zLkF CPmbWlqngmNfNsS93nPTK2kHWTz8P2spo+UTkXIegSYBA8CIr9lDxazSFKT0B6zH yIWzmQoE7YXRI5B21rlPvNGE/gPSy48mSn1ym/MCf+UyWGneRypeU/K//2Ww3UJK F1Xl2c1v/EEr28qPuC8VQbAsQ56GOcZ6zW4Q0grxTYm0KzzJ2O5B3FEHdCWlS/x9 KY5v3a8a3nXg9rNio0ruXiyD5l7PE5nFESNrBFDS4kEfxk4cx50ZfgDH68d515/W //EnNjx9nN20yF+LcKD70KJHxPdWaUXGT2c1+E/tdbrgUKReCpER+5hQc8+YxQML DCbzr7LJjX5mmDQ5YI6Y09/L6luzFMjrnxpmXkL7nyWQlSYkMqus3vPtDcJ5Xk2J shHBlzIcuw== =/+rE -----END PGP SIGNATURE----- Merge tag 'block-6.2-2023-02-03' of git://git.kernel.dk/linux Pull block fixes from Jens Axboe: "A bit bigger than I'd like at this point, but mostly a bunch of little fixes. In detail: - NVMe pull request via Christoph: - Fix a missing queue put in nvmet_fc_ls_create_association (Amit Engel) - Clear queue pointers on tag_set initialization failure (Maurizio Lombardi) - Use workqueue dedicated to authentication (Shin'ichiro Kawasaki) - Fix for an overflow in ublk (Liu) - Fix for leaking a queue reference in block cgroups (Ming) - Fix for a use-after-free in BFQ (Yu)" * tag 'block-6.2-2023-02-03' of git://git.kernel.dk/linux: blk-cgroup: don't update io stat for root cgroup nvme-auth: use workqueue dedicated to authentication nvme: clear the request_queue pointers on failure in nvme_alloc_io_tag_set nvme: clear the request_queue pointers on failure in nvme_alloc_admin_tag_set nvme-fc: fix a missing queue put in nvmet_fc_ls_create_association block: Fix the blk_mq_destroy_queue() documentation block: ublk: extending queue_size to fix overflow block, bfq: fix uaf for bfqq in bic_set_bfqq()	2023-02-03 11:35:42 -08:00
Christoph Hellwig	13ae4db0c0	zram: use bvec_set_page to initialize bvecs Use the bvec_set_page helper to initialize bvecs. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20230203150634.3199647-11-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-03 08:20:55 -07:00
Christoph Hellwig	b831f3a103	virtio_blk: use bvec_set_virt to initialize special_vec Use the bvec_set_virt helper to initialize the special_vec. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com> Link: https://lore.kernel.org/r/20230203150634.3199647-10-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-03 08:20:55 -07:00

1 2 3 4 5 ...

7816 commits