linux-stable

Commit Graph

Author	SHA1	Message	Date
Christoph Hellwig	5d13243820	blk-wbt: remove the separate write cache tracking Use the queue wide write back cache tracking insted of duplicating the value in strut rq_wb. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20231226090747.204969-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-12-26 09:28:10 -07:00
Christoph Hellwig	1c042f8d4b	block: reject invalid operation in submit_bio_noacct submit_bio_noacct allows completely invalid operations, or operations that are not supported in the bio path. Extent the existing switch statement to rejcect all invalid types. Move the code point for REQ_OP_ZONE_APPEND so that it's not right in the middle of the zone management operations and the switch statement can follow the numerical order of the operations. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20231221070538.1112446-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-12-26 09:27:14 -07:00
Randy Dunlap	8aabc11c8f	drbd: actlog: fix kernel-doc warnings and spelling Fix all kernel-doc warnings in drbd_actlog.c: drbd_actlog.c:963: warning: No description found for return value of 'drbd_rs_begin_io' drbd_actlog.c:1015: warning: Function parameter or member 'peer_device' not described in 'drbd_try_rs_begin_io' drbd_actlog.c:1015: warning: Excess function parameter 'device' description in 'drbd_try_rs_begin_io' drbd_actlog.c:1015: warning: No description found for return value of 'drbd_try_rs_begin_io' drbd_actlog.c:1197: warning: No description found for return value of 'drbd_rs_del_all' Fix one spelling error (s/ore/or/). Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Philipp Reisner <philipp.reisner@linbit.com> Cc: Lars Ellenberg <lars.ellenberg@linbit.com> Cc: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> Cc: <drbd-dev@lists.linbit.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: <linux-block@vger.kernel.org> Link: https://lore.kernel.org/r/20231222061909.8791-1-rdunlap@infradead.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-12-22 07:18:35 -07:00
Kundan Kumar	8e6e83d772	block: skip start/end time stamping for passthrough IO commit `41fa722239` ("blk-mq: do not include passthrough requests in I/O accounting")' disables I/O accounting for passthrough requests. Since tools like 'iostat' do not show anything useful for passthrough I/O, it's wasteful to do start/end time-stamping. So do away with that. Avoiding the time-stamping improves the I/O performance by ~7% Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Link: https://lore.kernel.org/r/20231222101707.6921-1-kundan.kumar@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-12-22 07:14:15 -07:00
Jens Axboe	f70a479228	nvme udpates for Linux 6.8 - nvme fabrics spec updates (Guixin, Max) - nvme target udpates (Guixin, Evan) - nvme attribute refactoring (Daniel) - nvme-fc numa fix (Keith) -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE3Fbyvv+648XNRdHTPe3zGtjzRgkFAmWErTYACgkQPe3zGtjz RgkMew/+I3Lx92leFzO4NtLX1zRbuhcVfO+cYyJQorkilbgYVBAnBEDjFiVCHLIL xPsVMJ0lEjta3uo7KhgV8mc6ZezlPwyZ2ke6Rc5tUsuzuLte9R0Uv382huDAniTb FD3VMYtlPU0Oq+bOWVFkkh2pqI2HXR/M3v3j1Xpki422rkxahpZ9n02p7YLHtkX1 sVezjw/s3OPHq517HcIzKiLK7HBZWvA9oC1qyVVcUBxvqMYp/xYURpEwb4r8TP1r /Wcidlo5gcias4/sQrYnMkXkh/wOMDv0TKhjeb1dRr844wztm/WhsPZCP+Suhvf5 E0SuS6tov3SVgfvBWLvHuFQzPx+NTD+1HaVq61CicQWH6S6lRtm3qQ7vcozbvZwl 1wQ3Ic1wKIG095clYNSHZomdlc2b4z1YUHRJDE7r4COZWX2jTwH2piKxtYpdM+t4 TGdS9E3hbwurn6rxt032MpMrxPsY5/zRYTbxRzrQ4BxVh9gcrq+WESztu8jfS/bM D5wXQPsL9K4njN9Uw7sWYFa2OL01wp8GgbdyL5AOeAJ5xxskcPIzvKny3cWo2WN3 fsspzHGFSBy+A6nkvHnLvgNqD4m7fzzbAdNLL7+3cSkZtoQ8QH4xr+64axweWhJ8 PZqncj9gxw82W2hsY4ap9Z6KYTe38TUTl0YJI22JuwTqJ0Tmbm8= =Zzf6 -----END PGP SIGNATURE----- Merge tag 'nvme-6.8-2023-12-21' of git://git.infradead.org/nvme into for-6.8/block Pull NVMe updates from Keith: "nvme updates for Linux 6.8 - nvme fabrics spec updates (Guixin, Max) - nvme target udpates (Guixin, Evan) - nvme attribute refactoring (Daniel) - nvme-fc numa fix (Keith)" * tag 'nvme-6.8-2023-12-21' of git://git.infradead.org/nvme: nvme-fc: set numa_node after nvme_init_ctrl nvme-fabrics: don't check discovery ioccsz/iorcsz nvmet: configfs: use ctrl->instance to track passthru subsystems nvme: repack struct nvme_ns_head nvme: add csi, ms and nuse to sysfs nvme: rename ns attribute group nvme: refactor ns info setup function nvme: refactor ns info helpers nvme: move ns id info to struct nvme_ns_head nvmet: remove cntlid_min and cntlid_max check in nvmet_alloc_ctrl nvmet: allow identical cntlid_min and cntlid_max settings nvme-fabrics: check ioccsz and iorcsz nvme: introduce nvme_check_ctrl_fabric_info helper	2023-12-21 14:44:17 -07:00
Keith Busch	5d51dc8db1	nvme-fc: set numa_node after nvme_init_ctrl nvme_init_ctrl() resets numa_node to NUMA_NO_NODE, so be sure to set the desired value after that function call so it won't be overwritten. Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org>	2023-12-21 09:19:01 -08:00
Max Gurtovoy	7642138e17	nvme-fabrics: don't check discovery ioccsz/iorcsz IOCCSZ and IORCSZ are reserved for discovery controllers. Avoid checking their values during identify controller phase. Fixes: `2fcd3ab398` ("nvme-fabrics: check ioccsz and iorcsz") Reported-by: Daniel Wagner <dwagner@suse.de> Tested-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org>	2023-12-21 09:18:25 -08:00
Jens Axboe	5165799f0d	block: export disk_clear_zoned() A previous commit split disk_set_zoned(..., bool) into not taking an argument for whether to set or clear, and instead added disk_clear_zoned() as the counterpart. However, that commit neglected to export the new symbol, causing failures for modular drivers that used it. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Fixes: `d73e93b4df` ("block: simplify disk_set_zoned") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-12-20 20:32:12 -07:00
Christoph Hellwig	5cc99b8978	sd: only call disk_clear_zoned when needed disk_clear_zoned only needs to be called when a device reported zone managed mode first and we clear it. Add a check so that disk_clear_zoned isn't called on devices that were never zoned. This avoids a fairly expensive queue freezing when revalidating conventional devices. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20231217165359.604246-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-12-19 20:17:43 -07:00
Christoph Hellwig	d73e93b4df	block: simplify disk_set_zoned Only use disk_set_zoned to actually enable zoned device support. For clearing it, call disk_clear_zoned, which is renamed from disk_clear_zone_settings and now directly clears the zoned flag as well. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20231217165359.604246-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-12-19 20:17:43 -07:00
Christoph Hellwig	7437bb73f0	block: remove support for the host aware zone model When zones were first added the SCSI and ATA specs, two different models were supported (in addition to the drive managed one that is invisible to the host): - host managed where non-conventional zones there is strict requirement to write at the write pointer, or else an error is returned - host aware where a write point is maintained if writes always happen at it, otherwise it is left in an under-defined state and the sequential write preferred zones behave like conventional zones (probably very badly performing ones, though) Not surprisingly this lukewarm model didn't prove to be very useful and was finally removed from the ZBC and SBC specs (NVMe never implemented it). Due to to the easily disappearing write pointer host software could never rely on the write pointer to actually be useful for say recovery. Fortunately only a few HDD prototypes shipped using this model which never made it to mass production. Drop the support before it is too late. Note that any such host aware prototype HDD can still be used with Linux as we'll now treat it as a conventional HDD. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20231217165359.604246-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-12-19 20:17:43 -07:00
Christoph Hellwig	a971ed8002	virtio_blk: remove the broken zone revalidation support virtblk_revalidate_zones is called unconditionally from virtblk_config_changed_work from the virtio config_changed callback. virtblk_revalidate_zones is a bit odd in that it re-clears the zoned state for host aware or non-zoned devices, which isn't needed unless the zoned mode changed - but a zone mode change to a host managed model isn't handled at all, and virtio_blk also doesn't handle any other config change except for a capacity change is handled (and even if it was the upper layers above virtio_blk wouldn't handle it very well). But even the useful case of a size change that would add or remove zones isn't handled properly as blk_revalidate_disk_zones expects the device capacity to cover all zones, but the capacity is only updated after virtblk_revalidate_zones. As this code appears to be entirely untested and is getting in the way remove it for now, but it can be readded in a fixed version with proper test coverage if needed. Fixes: `95bfec41bd` ("virtio-blk: add support for zoned block devices") Fixes: `f1ba4e674f` ("virtio-blk: fix to match virtio spec") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20231217165359.604246-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-12-19 20:17:43 -07:00
Christoph Hellwig	77360cadaa	virtio_blk: cleanup zoned device probing Move reading and checking the zoned model from virtblk_probe_zoned_device into the caller, leaving only the code to perform the actual setup for host managed zoned devices in virtblk_probe_zoned_device. This allows to share the model reading and sharing between builds with and without CONFIG_BLK_DEV_ZONED, and improve it for the !CONFIG_BLK_DEV_ZONED case. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20231217165359.604246-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-12-19 20:17:43 -07:00
Jens Axboe	0bd7c5d802	Merge tag 'md-next-20231219' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-6.8/block Pull MD updates from Song: "1. Remove deprecated flavors, by Song Liu; 2. raid1 read error check support, by Li Nan; 3. Better handle events off-by-1 case, by Alex Lyakas." * tag 'md-next-20231219' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md: md: Remove deprecated CONFIG_MD_FAULTY md: Remove deprecated CONFIG_MD_MULTIPATH md: Remove deprecated CONFIG_MD_LINEAR md/raid1: support read error check md: factor out a helper exceed_read_errors() to check read_errors md: Whenassemble the array, consult the superblock of the freshest device md/raid1: remove unnecessary null checking	2023-12-19 15:49:23 -07:00
Song Liu	415c745187	md: Remove deprecated CONFIG_MD_FAULTY md-faulty has been marked as deprecated for 2.5 years. Remove it. Cc: Christoph Hellwig <hch@lst.de> Cc: Jens Axboe <axboe@kernel.dk> Cc: Neil Brown <neilb@suse.de> Cc: Guoqing Jiang <guoqing.jiang@linux.dev> Cc: Mateusz Grzonka <mateusz.grzonka@intel.com> Cc: Jes Sorensen <jes@trained-monkey.org> Signed-off-by: Song Liu <song@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20231214222107.2016042-4-song@kernel.org	2023-12-19 10:37:50 -08:00
Song Liu	d8730f0cf4	md: Remove deprecated CONFIG_MD_MULTIPATH md-multipath has been marked as deprecated for 2.5 years. Remove it. Cc: Christoph Hellwig <hch@lst.de> Cc: Jens Axboe <axboe@kernel.dk> Cc: Neil Brown <neilb@suse.de> Cc: Guoqing Jiang <guoqing.jiang@linux.dev> Cc: Mateusz Grzonka <mateusz.grzonka@intel.com> Cc: Jes Sorensen <jes@trained-monkey.org> Signed-off-by: Song Liu <song@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20231214222107.2016042-3-song@kernel.org	2023-12-19 10:37:41 -08:00
Song Liu	849d18e27b	md: Remove deprecated CONFIG_MD_LINEAR md-linear has been marked as deprecated for 2.5 years. Remove it. Cc: Christoph Hellwig <hch@lst.de> Cc: Jens Axboe <axboe@kernel.dk> Cc: Neil Brown <neilb@suse.de> Cc: Guoqing Jiang <guoqing.jiang@linux.dev> Cc: Mateusz Grzonka <mateusz.grzonka@intel.com> Cc: Jes Sorensen <jes@trained-monkey.org> Signed-off-by: Song Liu <song@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20231214222107.2016042-2-song@kernel.org	2023-12-19 10:16:51 -08:00
Evan Burgess	536ecccbaf	nvmet: configfs: use ctrl->instance to track passthru subsystems To prevent enabling more than one passthrough subsystem per NVMe controller, passthru.c maintains an xarray indexed by cntlid values. Passthrough for a given nvmet subsystem cannot be enabled by configfs if the subsystem's passthru_ctrl->cntlid value is already accounted for in the xarray. However, according to the NVMe spec (rev 2.0c, p.145), "The Controller ID (CNTLID) value returned in the Identify Controller data structure may be used to uniquely identify a controller within an NVM subsystem," meaning that cntlid values are not guaranteed to be globally unique across multiple subsystems. Instead, the cntlid only uniquely identifies multiple controllers _within_ a subsystem. As a result, multiple unique & valid NVMe targets can be blocked from enabling passthrough at the same time if their controllers share cntlid values, a behavior allowed by the spec. Fix this by indexing the xarray with passthru_ctrl->instance values, which are allocated per controller by IDA and thus should be truly unique. Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Evan Burgess <evan.burgess@seagate.com> Signed-off-by: Keith Busch <kbusch@kernel.org>	2023-12-19 09:10:22 -08:00
Daniel Wagner	9639296151	nvme: repack struct nvme_ns_head ns_id, lba_shift and ms are always accessed for every read/write I/O in nvme_setup_rw. By grouping these variables into one cacheline we can safe some cycles. 4k sequential reads: baseline patched Bandwidth: 1620 1634 IOPs 66345579 66910939 Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org>	2023-12-19 09:10:13 -08:00
Daniel Wagner	a1a825ab6a	nvme: add csi, ms and nuse to sysfs libnvme is using the sysfs for enumarating the nvme resources. Though there are few missing attritbutes in the sysfs. For these libnvme issues commands during discovering. As the kernel already knows all these attributes and we would like to avoid libnvme to issue commands all the time, expose these missing attributes. The nuse value is updated on request because the nuse is a volatile value. Since any user can read the sysfs attribute, a very simple rate limit is added (update once every 5 seconds). A more sophisticated update strategy can be added later if there is actually a need for it. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org>	2023-12-19 09:10:08 -08:00
Daniel Wagner	83ac678e59	nvme: rename ns attribute group Drop the 'id' part of the attribute group name because we want to expose non 'id' related attributes via the ns attribute group. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org>	2023-12-19 09:10:05 -08:00
Daniel Wagner	d386aedc94	nvme: refactor ns info setup function Use nvme_ns_head instead of nvme_ns where possible. This reduces the coupling between the different data structures. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org>	2023-12-19 09:10:01 -08:00
Daniel Wagner	0372dd4e36	nvme: refactor ns info helpers Pass in the nvme_ns_head pointer directly. This reduces the necessity on the caller side have the nvme_ns data structure present. Thus we can refactor the caller side in the next step as well. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org>	2023-12-19 09:09:58 -08:00
Daniel Wagner	9419e71b8d	nvme: move ns id info to struct nvme_ns_head Move the namesapce info to struct nvme_ns_head, because it's the same for all associated namespaces. Note: with multipathing enabled the PI information is shared between all paths. If a path is using a different PI configuration it will overwrite the previous settings. This is obviously not correct and such configuration will be rejected in future. For the time being we expect a correctly configured storage. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org>	2023-12-19 09:09:15 -08:00
Li Nan	4c434392c4	block: add check of 'minors' and 'first_minor' in device_add_disk() 'first_minor' represents the starting minor number of disks, and 'minors' represents the number of partitions in the device. Neither of them can be greater than MINORMASK + 1. Commit `e338924bd0` ("block: check minor range in device_add_disk()") only added the check of 'first_minor + minors'. However, their sum might be less than MINORMASK but their values are wrong. Complete the checks now. Fixes: `e338924bd0` ("block: check minor range in device_add_disk()") Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20231219075942.840255-1-linan666@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-12-19 08:23:12 -07:00
Kundan Kumar	6c9b97085c	block: skip cgroups for passthrough io Even if BLK_CGROUP is enabled, it does not work for passthrough io. So skip setting up blkg for passthrough bio. Reduced processing gives ~5% hike in peak-performance workload. Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/20231218152722.1768-1-joshi.k@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-12-18 09:46:53 -07:00
Li Nan	ca294b34aa	md/raid1: support read error check After commit `1e50915fe0` ("raid: improve MD/raid10 handling of correctable read errors."), rdev will be set to faulty if it reads data error to many times in raid10. Add this mechanism to raid1 now. Signed-off-by: Li Nan <linan122@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231215023852.3478228-3-linan666@huaweicloud.com	2023-12-15 15:22:15 -08:00
Li Nan	1979dbbe32	md: factor out a helper exceed_read_errors() to check read_errors Move check_decay_read_errors() to raid1-10.c and factor out a helper exceed_read_errors() to check if read_errors exceeds the limit, so that raid1 can also use it. There are no functional changes. Signed-off-by: Li Nan <linan122@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231215023852.3478228-2-linan666@huaweicloud.com	2023-12-15 15:22:15 -08:00
Alex Lyakas	dc1cc22ed5	md: Whenassemble the array, consult the superblock of the freshest device Upon assembling the array, both kernel and mdadm allow the devices to have event counter difference of 1, and still consider them as up-to-date. However, a device whose event count is behind by 1, may in fact not be up-to-date, and array resync with such a device may cause data corruption. To avoid this, consult the superblock of the freshest device about the status of a device, whose event counter is behind by 1. Signed-off-by: Alex Lyakas <alex.lyakas@zadara.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/1702470271-16073-1-git-send-email-alex.lyakas@zadara.com	2023-12-15 15:21:24 -08:00
Jens Axboe	0c734c5ea7	block: improve struct request_queue layout It's clearly been a while since someone looked at this, so I gave it a quick shot. There are few issues in here: - Random bundling of members that are mostly read-only and often written - Random holes that need not be there This moves the most frequently used bits into cacheline 1 and 2, with the 2nd one being more write intensive than the first one, which is basically read-only. Outside of making this work a bit more efficiently, it also reduces the size of struct request_queue for my test setup from 864 bytes (spanning 14 cachelines!) to 832 bytes and 13 cachelines. Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/d2b7b61c-4868-45c0-9060-4f9c73de9d7e@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-12-15 07:34:51 -07:00
Christoph Hellwig	6ef02df154	block: support adding less than len in bio_add_hw_page bio_add_hw_page currently always fails or succeeds. This is fine for the existing callers that always add PAGE_SIZE worth given that the max_segment_size and max_sectors must always allow at least a page worth of data. But when we want to add it for bigger amounts of data this means it can also fail when adding the data to a bio, and creating a fallback for that becomes really annoying in the callers. Make use of the existing API design that allows to return a smaller length than the one passed in and add up to max_segment_size worth of data from a larger input. All the existing callers are fine with this - not because they handle this return correctly, but because they never pass more than a page in. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20231204173419.782378-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-12-15 07:34:27 -07:00
Christoph Hellwig	3f034c374a	block: prevent an integer overflow in bvec_try_merge_hw_page Reordered a check to avoid a possible overflow when adding len to bv_len. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20231204173419.782378-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-12-15 07:34:27 -07:00
Gou Hao	af140f806a	md/raid1: remove unnecessary null checking If %__GFP_DIRECT_RECLAIM is set then bio_alloc_bioset will always be able to allocate a bio. See comment of bio_alloc_bioset. Signed-off-by: Gou Hao <gouhao@uniontech.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231214151458.28970-1-gouhao@uniontech.com	2023-12-15 00:45:37 -08:00
Bart Van Assche	f19d1e3b17	block: Use pr_info() instead of printk(KERN_INFO ...) Switch to the modern style of printing kernel messages. Use %u instead of %d to print unsigned integers. Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Cc: Keith Busch <kbusch@kernel.org> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Keith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/20231213194702.90381-1-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-12-14 10:28:56 -07:00
Guixin Liu	4ba8b3f7d3	nvmet: remove cntlid_min and cntlid_max check in nvmet_alloc_ctrl The cntlid_min and cntlid_max are checked in configfs, don't check again in nvmet_alloc_ctrl(). Signed-off-by: Guixin Liu <kanie@linux.alibaba.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>	2023-12-13 14:53:33 -08:00
Guixin Liu	906dbc47b1	nvmet: allow identical cntlid_min and cntlid_max settings When the user wants to restrict to only creating one controller, they can set cntlid_min and cntlid_max to the same value. Signed-off-by: Guixin Liu <kanie@linux.alibaba.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>	2023-12-13 14:53:33 -08:00
Min Li	6f64f866aa	block: add check that partition length needs to be aligned with block size Before calling add partition or resize partition, there is no check on whether the length is aligned with the logical block size. If the logical block size of the disk is larger than 512 bytes, then the partition size maybe not the multiple of the logical block size, and when the last sector is read, bio_truncate() will adjust the bio size, resulting in an IO error if the size of the read command is smaller than the logical block size.If integrity data is supported, this will also result in a null pointer dereference when calling bio_integrity_free. Cc: <stable@vger.kernel.org> Signed-off-by: Min Li <min15.li@samsung.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230629142517.121241-1-min15.li@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-12-13 08:19:14 -07:00
Li Nan	5fa3d1a00c	block: Set memalloc_noio to false on device_add_disk() error path On the error path of device_add_disk(), device's memalloc_noio flag was set but not cleared. As the comment of pm_runtime_set_memalloc_noio(), "The function should be called between device_add() and device_del()". Clear this flag before device_del() now. Fixes: `25e823c8c3` ("block/genhd.c: apply pm_runtime_set_memalloc_noio on block devices") Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20231211075356.1839282-1-linan666@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-12-13 08:17:02 -07:00
Kees Cook	9e4bf6a08d	block/rnbd-srv: Check for unlikely string overflow Since "dev_search_path" can technically be as large as PATH_MAX, there was a risk of truncation when copying it and a second string into "full_path" since it was also PATH_MAX sized. The W=1 builds were reporting this warning: drivers/block/rnbd/rnbd-srv.c: In function 'process_msg_open.isra': drivers/block/rnbd/rnbd-srv.c:616:51: warning: '%s' directive output may be truncated writing up to 254 bytes into a region of size between 0 and 4095 [-Wformat-truncation=] 616 \| snprintf(full_path, PATH_MAX, "%s/%s", \| ^~ In function 'rnbd_srv_get_full_path', inlined from 'process_msg_open.isra' at drivers/block/rnbd/rnbd-srv.c:721:14: drivers/block/rnbd/rnbd-srv.c:616:17: note: 'snprintf' output between 2 and 4351 bytes into a destination of size 4096 616 \| snprintf(full_path, PATH_MAX, "%s/%s", \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 617 \| dev_search_path, dev_name); \| ~~~~~~~~~~~~~~~~~~~~~~~~~~ To fix this, unconditionally check for truncation (as was already done for the case where "%SESSNAME%" was present). Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202312100355.lHoJPgKy-lkp@intel.com/ Cc: Md. Haris Iqbal <haris.iqbal@ionos.com> Cc: Jack Wang <jinpu.wang@ionos.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: <linux-block@vger.kernel.org> Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Guoqing Jiang <guoqing.jiang@linux.dev> Acked-by: Jack Wang <jinpu.wang@ionos.com> Link: https://lore.kernel.org/r/20231212214738.work.169-kees@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-12-13 08:15:48 -07:00
Jens Axboe	f788893d5e	Merge tag 'md-next-20231208' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-6.8/block Pull MD updates from Song: "1. Fix/Cleanup RCU usage from conf->disks[i].rdev, by Yu Kuai; 2. Fix raid5 hang issue, by Junxiao Bi; 3. Add Yu Kuai as Reviewer of the md subsystem." * tag 'md-next-20231208' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md: md: synchronize flush io with array reconfiguration MAINTAINERS: SOFTWARE RAID: Add Yu Kuai as Reviewer md/md-multipath: remove rcu protection to access rdev from conf md/raid5: remove rcu protection to access rdev from conf md/raid1: remove rcu protection to access rdev from conf md/raid10: remove rcu protection to access rdev from conf md: remove flag RemoveSynchronized Revert "md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5d" md: bypass block throttle for superblock update	2023-12-08 16:25:39 -07:00
Matthew Wilcox (Oracle)	1b151e2435	block: Remove special-casing of compound pages The special casing was originally added in pre-git history; reproducing the commit log here: > commit a318a92567d77 > Author: Andrew Morton <akpm@osdl.org> > Date: Sun Sep 21 01:42:22 2003 -0700 > > [PATCH] Speed up direct-io hugetlbpage handling > > This patch short-circuits all the direct-io page dirtying logic for > higher-order pages. Without this, we pointlessly bounce BIOs up to > keventd all the time. In the last twenty years, compound pages have become used for more than just hugetlb. Rewrite these functions to operate on folios instead of pages and remove the special case for hugetlbfs; I don't think it's needed any more (and if it is, we can put it back in as a call to folio_test_hugetlb()). This was found by inspection; as far as I can tell, this bug can lead to pages used as the destination of a direct I/O read not being marked as dirty. If those pages are then reclaimed by the MM without being dirtied for some other reason, they won't be written out. Then when they're faulted back in, they will not contain the data they should. It'll take a pretty unusual setup to produce this problem with several races all going the wrong way. This problem predates the folio work; it could for example have been triggered by mmaping a THP in tmpfs and using that as the target of an O_DIRECT read. Fixes: `800d8c63b2` ("shmem: add huge pages support") Cc: <stable@vger.kernel.org> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-12-07 14:02:20 -07:00
Guixin Liu	2fcd3ab398	nvme-fabrics: check ioccsz and iorcsz Make sure that ioccsz and iorcsz returned by target are correct before use it. Per 2.0a base NVMe spec: I/O Queue Command Capsule Supported Size (IOCCSZ): This field defines the maximum I/O command capsule size in 16 byte units. The minimum value that shall be indicated is 4 corresponding to 64 bytes. Signed-off-by: Guixin Liu <kanie@linux.alibaba.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>	2023-12-06 13:57:43 -08:00
Guixin Liu	68999d1dd2	nvme: introduce nvme_check_ctrl_fabric_info helper Inroduce nvme_check_ctrl_fabric_info helper to check fabric controller info returned by target. Signed-off-by: Guixin Liu <kanie@linux.alibaba.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>	2023-12-06 13:57:40 -08:00
Kundan Kumar	847c5bcdfb	block: skip QUEUE_FLAG_STATS and rq-qos for passthrough io Write-back throttling (WBT) enables QUEUE_FLAG_STATS on the request queue. But WBT does not make sense for passthrough io, so skip QUEUE_FLAG_STATS processing. Also skip rq_qos_issue/done for passthrough io. Overall, the change gives ~11% hike in peak performance. Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Link: https://lore.kernel.org/r/20231123190331.7934-1-kundan.kumar@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-12-01 18:29:18 -07:00
Keith Busch	8fadb86d4c	io_uring: remove uring_cmd cookie No more users of this field. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/20231130215309.2923568-5-kbusch@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-12-01 18:29:18 -07:00
Keith Busch	e5da71f1e3	iouring: remove IORING_URING_CMD_POLLED No more users of this flag. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/20231130215309.2923568-4-kbusch@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-12-01 18:29:18 -07:00
Keith Busch	d6aacee925	nvme: use bio_integrity_map_user Map user metadata buffers directly. Now that the bio tracks the metadata, nvme doesn't need special metadata handling and tracking with callbacks and additional fields in the pdu. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/20231130215309.2923568-3-kbusch@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-12-01 18:29:18 -07:00
Keith Busch	492c5d4559	block: bio-integrity: directly map user buffers Passthrough commands that utilize metadata currently need to bounce the user space buffer through the kernel. Add support for mapping user space directly so that we can avoid this costly overhead. This is similar to how the normal bio data payload utilizes user addresses with bio_map_user_iov(). If the user address can't directly be used for reason, like too many segments or address unalignement, fallback to a copy of the user vec while keeping the user address pinned for the IO duration so that it can safely be copied on completion in any process context. Signed-off-by: Keith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/20231130215309.2923568-2-kbusch@meta.com [axboe: fold in fix from Kanchan Joshi] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-12-01 18:29:00 -07:00
Yu Kuai	fa2bbff7b0	md: synchronize flush io with array reconfiguration Currently rcu is used to protect iterating rdev from submit_flushes(): submit_flushes remove_and_add_spares synchronize_rcu pers->hot_remove_disk() rcu_read_lock() rdev_for_each_rcu if (rdev->raid_disk >= 0) rdev->radi_disk = -1; atomic_inc(&rdev->nr_pending) rcu_read_unlock() bi = bio_alloc_bioset() bi->bi_end_io = md_end_flush bi->private = rdev submit_bio // issue io for removed rdev Fix this problem by grabbing 'acive_io' before iterating rdev, make sure that remove_and_add_spares() won't concurrent with submit_flushes(). Fixes: `a2826aa92e` ("md: support barrier requests on all personalities.") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231129020234.1586910-1-yukuai1@huaweicloud.com	2023-12-01 15:49:42 -08:00
Song Liu	15da990f8d	MAINTAINERS: SOFTWARE RAID: Add Yu Kuai as Reviewer Add Yu Kuai as reviewer for md/raid subsystem. Signed-off-by: Song Liu <song@kernel.org> Acked-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20231128035807.3191738-1-song@kernel.org	2023-11-28 09:51:12 -08:00

1 2 3 4 5 ...

1234150 Commits All Branches Search

1234150 Commits

All Branches