linux-stable

Commit Graph

Author	SHA1	Message	Date
Keith Busch	a5df5e79c4	nvme: allow user toggling hmb usage The NVMe host memory buffer may consume a non-negligable amount of memory. Controllers are required to function without the host memory buffer enabled, but with possibly degraded performance. Export a sysfs property to toggle this feature on a per-device granularity so users may choose to reclaim memory at the expense of storage performance. Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-08-16 14:42:24 +02:00
Keith Busch	e5ad96f388	nvme-pci: disable hmb on idle suspend An idle suspend may or may not disable host memory access from devices placed in low power mode. Either way, it should always be safe to disable the host memory buffer prior to entering the low power mode, and this should also always be faster than a full device shutdown. Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-08-16 14:42:24 +02:00
Colin Ian King	ad0e9a80ba	nvmet: remove redundant assignments of variable status There are two occurrances where variable status is being assigned a value that is never read and it is being re-assigned a new value almost immediately afterwards on an error exit path. The assignments are redundant and can be removed. Addresses-Coverity: ("Unused value") Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-08-16 14:42:23 +02:00
Hou Pu	8d84f9de69	nvmet: add set feature tracing support A nvme connect command produces following trace from the target side. Before: kworker/0:1H-56 [000] .... 9012.155139: nvmet_req_init: nvmet1: qid=0, cmdid=16, nsid=0, flags=0x40, meta=0x0, cmd=(nvme_admin_set_features, cdw10=07 00 00 00 07 00 07 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00) kworker/0:1H-56 [000] .... 9012.872272: nvmet_req_init: nvmet1: qid=0, cmdid=13, nsid=0, flags=0x40, meta=0x0, cmd=(nvme_admin_set_features, cdw10=0b 00 00 00 00 09 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00) cmdline:/sys/kernel/debug/tracing# cat trace \| grep feature kworker/0:1H-56 [000] .... 203.493914: nvmet_req_init: nvmet1: qid=0, cmdid=29, nsid=0, flags=0x40, meta=0x0, cmd=(nvme_admin_set_features, fid=0x7, sv=0x0, cdw11=0x70007) kworker/0:1H-56 [000] .... 204.197079: nvmet_req_init: nvmet1: qid=0, cmdid=29, nsid=0, flags=0x40, meta=0x0, cmd=(nvme_admin_set_features, fid=0xb, sv=0x0, cdw11=0x900) Using ',' to separate different field like others in nvmet_trace_admin_get_features. Signed-off-by: Hou Pu <houpu.main@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-08-16 14:42:23 +02:00
Hou Pu	a7b5e8d864	nvme: add set feature tracing support A nvme connect command produces following trace. Before: /sys/kernel/debug/tracing# cat trace \| grep feature kworker/5:1H-98 [005] .... 3221.294844: nvme_setup_cmd: nvme0: qid=0, cmdid=25, nsid=0, flags=0x0, meta=0x0, cmd=(nvme_admin_set_features cdw10=07 00 00 00 07 00 07 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00) kworker/4:1H-124 [004] .... 3222.009186: nvme_setup_cmd: nvme0: qid=0, cmdid=17, nsid=0, flags=0x0, meta=0x0, cmd=(nvme_admin_set_features cdw10=0b 00 00 00 00 09 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00) After: /sys/kernel/debug/tracing# cat trace \| grep feature kworker/0:1H-253 [000] .... 196.060509: nvme_setup_cmd: nvme0: qid=0, cmdid=29, nsid=0, flags=0x0, meta=0x0, cmd=(nvme_admin_set_features fid=0x7, sv=0x0, cdw11=0x70007) kworker/0:1H-253 [000] .... 196.763947: nvme_setup_cmd: nvme0: qid=0, cmdid=29, nsid=0, flags=0x0, meta=0x0, cmd=(nvme_admin_set_features fid=0xb, sv=0x0, cdw11=0x900) Using ',' to separate different field like others in nvmet_trace_admin_get_features. Signed-off-by: Hou Pu <houpu.main@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-08-16 14:42:23 +02:00
Hou Pu	e23439e977	nvme-fabrics: remove superfluous nvmf_host_put in nvmf_parse_options Opts->host is NULL there. It is checked just before. So remove nvmf_host_put. It is introduced by commit `59a2f3f00f` ("nvme: fix potential memory leak in option parsing"). Signed-off-by: Hou Pu <houpu.main@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-08-16 14:42:23 +02:00
Keith Busch	1751e97aa9	nvme-pci: cmb sysfs: one file, one value An attribute should only be exporting one value as recommended in Documentation/filesystems/sysfs.rst. Implement CMB attributes this way. The old attribute will remain for backward compatibility. Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-08-16 14:42:23 +02:00
Keith Busch	0521905e85	nvme-pci: use attribute group for cmb sysfs Appending sysfs files to the controller kobject is a bit clunky and becomes a maintenance problem as more attributes are added. The attribute group infrastructure handles this better, so use that. Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-08-16 14:42:22 +02:00
Sagi Grimberg	e7006de6c2	nvme: code command_id with a genctr for use-after-free validation We cannot detect a (perhaps buggy) controller that is sending us a completion for a request that was already completed (for example sending a completion twice), this phenomenon was seen in the wild a few times. So to protect against this, we use the upper 4 msbits of the nvme sqe command_id to use as a 4-bit generation counter and verify it matches the existing request generation that is incrementing on every execution. The 16-bit command_id structure now is constructed by: \| xxxx \| xxxxxxxxxxxx \| gen request tag This means that we are giving up some possible queue depth as 12 bits allow for a maximum queue depth of 4095 instead of 65536, however we never create such long queues anyways so no real harm done. Suggested-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Acked-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Daniel Wagner <dwagner@suse.de> Tested-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-08-16 14:42:22 +02:00
Sagi Grimberg	3b01a9d0ca	nvme-tcp: don't check blk_mq_tag_to_rq when receiving pdu data We already validate it when receiving the c2hdata pdu header and this is not changing so this is a redundant check. Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-08-16 14:42:22 +02:00
Sagi Grimberg	27453b45e6	nvme-pci: limit maximum queue depth to 4095 We are going to use the upper 4-bits of the command_id for a generation counter, so enforce the new queue depth upper limit. As we enforce both min and max queue depth, use param_set_uint_minmax istead of open coding it. Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-08-16 14:42:22 +02:00
Christoph Hellwig	9ea9b9c483	remove the lightnvm subsystem Lightnvm supports the OCSSD 1.x and 2.0 specs which were early attempts to produce Open Channel SSDs and never made it into the NVMe spec proper. They have since been superceeded by NVMe enhancements such as ZNS support. Remove the support per the deprecation schedule. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210812132308.38486-1-hch@lst.de Reviewed-by: Matias Bjørling <mb@lightnvm.io> Reviewed-by: Javier González <javier@javigon.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-14 15:54:09 -06:00
Christoph Hellwig	50b4aecfbb	block: remove GENHD_FL_UP Just check inode_unhashed on the whole device bdev inode instead, and provide a helper to check for that information. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210809064028.1198327-9-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-12 10:29:36 -06:00
Christoph Hellwig	916a470da0	nvme: replace the GENHD_FL_UP check in nvme_mpath_shutdown_disk Use the nvme-internal NVME_NSHEAD_DISK_LIVE flag instead of abusing the block layer state. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210809064028.1198327-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-12 10:29:36 -06:00
Christoph Hellwig	5eba200526	nvme: remove the GENHD_FL_UP check in nvme_ns_remove Early probe failure never reaches nvme_ns_remove, so GENHD_FL_UP must be set at this point. Remove the check. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210809064028.1198327-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-12 10:29:36 -06:00
Christoph Hellwig	471aa704db	block: pass a gendisk to blk_queue_update_readahead .. and rename the function to disk_update_readahead. This is in preparation for moving the BDI from the request_queue to the gendisk. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20210809141744.1203023-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-09 11:52:28 -06:00
Linus Torvalds	4d4a60cede	block-5.14-2021-07-24 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmD8NkkQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgprfxEACKUWYgb6OSc121lltJTFTZ8LDLxKLn4dNt 5SREbhPdUXKZ64fTo6dKx3ZQ0SZkY+90aTOQQClSNps3/hsIuGsXcLnaPBgEET21 GgCaJfT1B4i/NCjguwboprGPQjETLUTXHri8e/5C8OBgFSL4Q62Cvsl2l5U1V9rP v4O21TazHNVWnV60uJ4FZvQ56zU/mlvU77eWnfa1MJQaupl1atUwhDNTidDNVMYN v69nJqxgYIo5pvNVkMwHxWcp6Ckkwj5GaxptcJ5EPBU69sImBCZ8fpYVrrS3NzTL 82Xf/HFVmJdegRqk0fctWGOfkorBl1ah/+PhL69In/6y6lHOs9y9+lTHoWq/+z8z cPj/Ru6440Gf5D81rzbJRmCKNE4k2ToYnvRTHj2kUSRoAxDih0S3axZMF7m9eEnQ DNOG29bsskJhYPx/0HxffCTAsKf2iPlMCRQFTqR296F5Hbg1eMo9rDSXY5JCtlNU jOKKPmBM5fZWn19QIvjlxICsS3TRRun4jpcpbQZKoY1ItuN9x5ANmlr54K7T4gD3 RYXZ51dPDaEFQGO0bwKlRj6w45F7wZb7HBNufXl2WsuSbM88sGEXUkYlBQ55xjJK Y8+amkzA2XUorQIrfsegzkCw67CuCB7Pg2IQWoiyFLagBTfoX8QJPslRRwUFiBav QZi3lITl5Q== =uzwV -----END PGP SIGNATURE----- Merge tag 'block-5.14-2021-07-24' of git://git.kernel.dk/linux-block Pull block fixes from Jens Axboe: - NVMe pull request (Christoph): - tracing fix (Keith Busch) - fix multipath head refcounting (Hannes Reinecke) - Write Zeroes vs PI fix (me) - drop a bogus WARN_ON (Zhihao Cheng) - Increase max blk-cgroup policy size, now that mq-deadline uses it too (Oleksandr) * tag 'block-5.14-2021-07-24' of git://git.kernel.dk/linux-block: nvme: set the PRACT bit when using Write Zeroes with T10 PI nvme: fix nvme_setup_command metadata trace event nvme: fix refcounting imbalance when all paths are down nvme-pci: don't WARN_ON in nvme_reset_work if ctrl.state is not RESETTING block: increase BLKCG_MAX_POLS	2021-07-24 12:57:06 -07:00
Christoph Hellwig	aaeb7bb061	nvme: set the PRACT bit when using Write Zeroes with T10 PI When using Write Zeroes on a namespace that has protection information enabled they behavior without the PRACT bit counter-intuitive and will generally lead to validation failures when reading the written blocks. Fix this by always setting the PRACT bit that generates matching PI data on the fly. Fixes: `6e02318eae` ("nvme: add support for the Write Zeroes command") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>	2021-07-21 17:24:10 +02:00
Keith Busch	234211b8dd	nvme: fix nvme_setup_command metadata trace event The metadata address is set after the trace event, so the trace is not capturing anything useful. Rather than logging the memory address, it's useful to know if the command carries a metadata payload, so change the trace event to log that true/false state instead. Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-07-21 09:55:44 +02:00
Hannes Reinecke	5396fdac56	nvme: fix refcounting imbalance when all paths are down When the last path to a ns_head drops the current code removes the ns_head from the subsystem list, but will only delete the disk itself if the last reference to the ns_head drops. This is causing an refcounting imbalance eg when applications have a reference to the disk, as then they'll never get notified that the disk is in fact dead. This patch moves the call 'del_gendisk' into nvme_mpath_check_last_path(), ensuring that the disk can be properly removed and applications get the appropriate notifications. Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-07-21 09:55:40 +02:00
Zhihao Cheng	7764656b10	nvme-pci: don't WARN_ON in nvme_reset_work if ctrl.state is not RESETTING Followling process: nvme_probe nvme_reset_ctrl nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING) queue_work(nvme_reset_wq, &ctrl->reset_work) --------------> nvme_remove nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_DELETING) worker_thread process_one_work nvme_reset_work WARN_ON(dev->ctrl.state != NVME_CTRL_RESETTING) , which will trigger WARN_ON in nvme_reset_work(): [ 127.534298] WARNING: CPU: 0 PID: 139 at drivers/nvme/host/pci.c:2594 [ 127.536161] CPU: 0 PID: 139 Comm: kworker/u8:7 Not tainted 5.13.0 [ 127.552518] Call Trace: [ 127.552840] ? kvm_sched_clock_read+0x25/0x40 [ 127.553936] ? native_send_call_func_single_ipi+0x1c/0x30 [ 127.555117] ? send_call_function_single_ipi+0x9b/0x130 [ 127.556263] ? __smp_call_single_queue+0x48/0x60 [ 127.557278] ? ttwu_queue_wakelist+0xfa/0x1c0 [ 127.558231] ? try_to_wake_up+0x265/0x9d0 [ 127.559120] ? ext4_end_io_rsv_work+0x160/0x290 [ 127.560118] process_one_work+0x28c/0x640 [ 127.561002] worker_thread+0x39a/0x700 [ 127.561833] ? rescuer_thread+0x580/0x580 [ 127.562714] kthread+0x18c/0x1e0 [ 127.563444] ? set_kthread_struct+0x70/0x70 [ 127.564347] ret_from_fork+0x1f/0x30 The preceding problem can be easily reproduced by executing following script (based on blktests suite): test() { pdev="$(_get_pci_dev_from_blkdev)" sysfs="/sys/bus/pci/devices/${pdev}" for ((i = 0; i < 10; i++)); do echo 1 > "$sysfs/remove" echo 1 > /sys/bus/pci/rescan done } Since the device ctrl could be updated as an non-RESETTING state by repeating probe/remove in userspace (which is a normal situation), we can replace stack dumping WARN_ON with a warnning message. Fixes: `82b057caef` ("nvme-pci: fix multiple ctrl removal schedulin") Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>	2021-07-21 09:55:40 +02:00
Linus Torvalds	0d18c12b28	block-5.14-2021-07-16 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmDxlZAQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpjrgD/49ZEQ6XU6EsEAdcm6A4vZDLw4i5g6ZdyOE +SNyn/vgAP0q4g/C8R0DiC4/kubZrMqWnXvmAWdbODnTUWi2xvejwsSMtaK2Plyw WxOY/AI1SZUOUDKuibZBCDQMJ95mI+v0aEZyvgYjULzw+2D9X3wVYhsjbvQFsW+A CdmjkQGtLWiGkEWDWRunn3XQvKvRQBrb89yv03VJF9RmkgbMkWKPAwDN9vcwtYXY OLgLApbeLeEtqNU509NqdBrvPcD6E3gnxbWYSl5XinzI8MmglfHzQBuwB38lDmGS +oSntoqDQw1FRc5D/X196ToIGKKGC+0rICVDSctpdGUbcYCNwc8Jj6ZZXD3x9Y/W WA7gprHPyBQvpcKiL82TNffXk29ZHzbIShJVmMubjZfEGlUu9m7ZHcssYJUja/zu XHVlE4GBCFq/cTE40c7sUImwHIofdFHZr8EdyGTgBKNSsqmPtoow/PCL15IbjZeK 2AZVG+HP5PtDw7PbZlHECaUJV/AE8vb4MK2PA3cftbW+fWqfVsxvmzdWA1GmKStG uw/FWbW0L9y09vHJRpDqp+oeLF6qxIPqMRrgJG5zZJvai9hqMFg0HTWjmUMyGLKv QziEiDM44+wkLr0479GElHHLAMscPsbzBA4kEVJPWlIvLSNKZnQq/xbh/Apq6yXR ORo+S1biZg== =Zeni -----END PGP SIGNATURE----- Merge tag 'block-5.14-2021-07-16' of git://git.kernel.dk/linux-block Pull block fixes from Jens Axboe: - NVMe fixes via Christoph: - fix various races in nvme-pci when shutting down just after probing (Casey Chen) - fix a net_device leak in nvme-tcp (Prabhakar Kushwaha) - Fix regression in xen-blkfront by cleaning up the removal state machine (Christoph) - Fix tag_set and queue cleanup ordering regression in nbd (Wang) - Fix tag_set and queue cleanup ordering regression in pd (Guoqing) * tag 'block-5.14-2021-07-16' of git://git.kernel.dk/linux-block: xen-blkfront: sanitize the removal state machine nbd: fix order of cleaning up the queue and freeing the tagset pd: fix order of cleaning up the queue and freeing the tagset nvme-pci: do not call nvme_dev_remove_admin from nvme_remove nvme-pci: fix multiple races in nvme_setup_io_queues nvme-tcp: use __dev_get_by_name instead dev_get_by_name for OPT_HOST_IFACE	2021-07-16 12:31:44 -07:00
Casey Chen	251ef6f71b	nvme-pci: do not call nvme_dev_remove_admin from nvme_remove nvme_dev_remove_admin could free dev->admin_q and the admin_tagset while they are being accessed by nvme_dev_disable(), which can be called by nvme_reset_work via nvme_remove_dead_ctrl. Commit `cb4bfda62a` ("nvme-pci: fix hot removal during error handling") intended to avoid requests being stuck on a removed controller by killing the admin queue. But the later fix `c8e9e9b764` ("nvme-pci: unquiesce admin queue on shutdown"), together with nvme_dev_disable(dev, true) right before nvme_dev_remove_admin() could help dispatch requests and fail them early, so we don't need nvme_dev_remove_admin() any more. Fixes: `cb4bfda62a` ("nvme-pci: fix hot removal during error handling") Signed-off-by: Casey Chen <cachen@purestorage.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-07-13 12:03:20 +02:00
Casey Chen	e4b9852a0f	nvme-pci: fix multiple races in nvme_setup_io_queues Below two paths could overlap each other if we power off a drive quickly after powering it on. There are multiple races in nvme_setup_io_queues() because of shutdown_lock missing and improper use of NVMEQ_ENABLED bit. nvme_reset_work() nvme_remove() nvme_setup_io_queues() nvme_dev_disable() ... ... A1 clear NVMEQ_ENABLED bit for admin queue lock retry: B1 nvme_suspend_io_queues() A2 pci_free_irq() admin queue B2 nvme_suspend_queue() admin queue A3 pci_free_irq_vectors() nvme_pci_disable() A4 nvme_setup_irqs(); B3 pci_free_irq_vectors() ... unlock A5 queue_request_irq() for admin queue set NVMEQ_ENABLED bit ... nvme_create_io_queues() A6 result = queue_request_irq(); set NVMEQ_ENABLED bit ... fail to allocate enough IO queues: A7 nvme_suspend_io_queues() goto retry If B3 runs in between A1 and A2, it will crash if irqaction haven't been freed by A2. B2 is supposed to free admin queue IRQ but it simply can't fulfill the job as A1 has cleared NVMEQ_ENABLED bit. Fix: combine A1 A2 so IRQ get freed as soon as the NVMEQ_ENABLED bit gets cleared. After solved #1, A2 could race with B3 if A2 is freeing IRQ while B3 is checking irqaction. A3 also could race with B2 if B2 is freeing IRQ while A3 is checking irqaction. Fix: A2 and A3 take lock for mutual exclusion. A3 could race with B3 since they could run free_msi_irqs() in parallel. Fix: A3 takes lock for mutual exclusion. A4 could fail to allocate all needed IRQ vectors if A3 and A4 are interrupted by B3. Fix: A4 takes lock for mutual exclusion. If A5/A6 happened after B2/B1, B3 will crash since irqaction is not NULL. They are just allocated by A5/A6. Fix: Lock queue_request_irq() and setting of NVMEQ_ENABLED bit. A7 could get chance to pci_free_irq() for certain IO queue while B3 is checking irqaction. Fix: A7 takes lock. nvme_dev->online_queues need to be protected by shutdown_lock. Since it is not atomic, both paths could modify it using its own copy. Co-developed-by: Yuanyuan Zhong <yzhong@purestorage.com> Signed-off-by: Casey Chen <cachen@purestorage.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-07-13 11:40:42 +02:00
Prabhakar Kushwaha	8b43ced64d	nvme-tcp: use __dev_get_by_name instead dev_get_by_name for OPT_HOST_IFACE dev_get_by_name() finds network device by name but it also increases the reference count. If a nvme-tcp queue is present and the network device driver is removed before nvme_tcp, we will face the following continuous log: "kernel:unregister_netdevice: waiting for <eth> to become free. Usage count = 2" And rmmod further halts. Similar case arises during reboot/shutdown with nvme-tcp queue present and both never completes. To fix this, use __dev_get_by_name() which finds network device by name without increasing any reference counter. Fixes: `3ede8f72a9` ("nvme-tcp: allow selecting the network interface for connections") Signed-off-by: Omkar Kulkarni <okulkarni@marvell.com> Signed-off-by: Shai Malin <smalin@marvell.com> Signed-off-by: Prabhakar Kushwaha <pkushwaha@marvell.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> [hch: remove the ->ndev member entirely] Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-07-13 11:34:24 +02:00
Linus Torvalds	a022f7d575	block-5.14-2021-07-08 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmDnGVYQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpv6UEAC78zkseI8TmKaowNfkz/+MkP9eSFb1pVn3 rxpbPOsZompHoZpeWt4oHL+3Rmm3a9iRo/APA2ELas4zvp+Q+6uG7eha2Dc4hUA9 YgeO4z9YfG8wQNZc3x7bncb6ZwqEE5nnbFe/m25SyrAZVLlZ7FKHxfoZDqjhlGFC eLNiYO6vdvwgCoBMcotyCDttrPfEu6947/5vB1zevv57twdQQaEWGUhvyx1XrlDX 0YD5fmdOjNU2isgxt4xo2Ur2zL6w254/hvj58sV3Z7JfkJpI9DCK+ztKEfzuyEhA WYz06rDAT1+1KuVLfowaZ+pYiPPOIsL0+QXI83r3nLaE7WGGlfS8Hmz//1FbziYs ZSZI826kEN+/lKeWTcKOOMhmkYyXEFFuQZS34eg9KI4xwML8v+ILlHmcp+tjebw9 vzNF6f7N2ki+jnyxxyNxeMHxeAMWsqnIRROOhZg6bbs6UVNpDy4qRzpQaDOaJsVe uSAQ6PTd/etR9KE+ClhLe6X7Rmp/lfZCPe64wqM/3k1qV2KWhE1fwCQO4c5o1MBN rpk3Ef5PZYP3aakCvZnfcjMWlpZNbq/xMc6vPc+yq32akq1t1KbODVBiR5odcH0C Gt5N11im50SO06haBt7EOe4JMQLbK5sxG15t4C6mNQZgPegGfaLlVkKpzIkOzUha OkRofKMcDA== =gHse -----END PGP SIGNATURE----- Merge tag 'block-5.14-2021-07-08' of git://git.kernel.dk/linux-block Pull more block updates from Jens Axboe: "A combination of changes that ended up depending on both the driver and core branch (and/or the IDE removal), and a few late arriving fixes. In detail: - Fix io ticks wrap-around issue (Chunguang) - nvme-tcp sock locking fix (Maurizio) - s390-dasd fixes (Kees, Christoph) - blk_execute_rq polling support (Keith) - blk-cgroup RCU iteration fix (Yu) - nbd backend ID addition (Prasanna) - Partition deletion fix (Yufen) - Use blk_mq_alloc_disk for mmc, mtip32xx, ubd (Christoph) - Removal of now dead block request types due to IDE removal (Christoph) - Loop probing and control device cleanups (Christoph) - Device uevent fix (Christoph) - Misc cleanups/fixes (Tetsuo, Christoph)" * tag 'block-5.14-2021-07-08' of git://git.kernel.dk/linux-block: (34 commits) blk-cgroup: prevent rcu_sched detected stalls warnings while iterating blkgs block: fix the problem of io_ticks becoming smaller nvme-tcp: can't set sk_user_data without write_lock loop: remove unused variable in loop_set_status() block: remove the bdgrab in blk_drop_partitions block: grab a device refcount in disk_uevent s390/dasd: Avoid field over-reading memcpy() dasd: unexport dasd_set_target_state block: check disk exist before trying to add partition ubd: remove dead code in ubd_setup_common nvme: use return value from blk_execute_rq() block: return errors from blk_execute_rq() nvme: use blk_execute_rq() for passthrough commands block: support polling through blk_execute_rq block: remove REQ_OP_SCSI_{IN,OUT} block: mark blk_mq_init_queue_data static loop: rewrite loop_exit using idr_for_each_entry loop: split loop_lookup loop: don't allow deleting an unspecified loop device loop: move loop_ctl_mutex locking into loop_add ...	2021-07-09 12:05:33 -07:00
Maurizio Lombardi	0755d3be2d	nvme-tcp: can't set sk_user_data without write_lock The sk_user_data pointer is supposed to be modified only while holding the write_lock "sk_callback_lock", otherwise we could race with other threads and crash the kernel. we can't take the write_lock in nvmet_tcp_state_change() because it would cause a deadlock, but the release_work queue will set the pointer to NULL later so we can simply remove the assignment. Fixes: `b5332a9f3f` ("nvmet-tcp: fix incorrect locking in state_change sk callback") Signed-off-by: Maurizio Lombardi <mlombard@redhat.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-07-05 10:14:44 +02:00
Linus Torvalds	bd31b9efbf	SCSI misc on 20210702 This series consists of the usual driver updates (ufs, ibmvfc, megaraid_sas, lpfc, elx, mpi3mr, qedi, iscsi, storvsc, mpt3sas) with elx and mpi3mr being new drivers. The major core change is a rework to drop the status byte handling macros and the old bit shifted definitions and the rest of the updates are minor fixes. Signed-off-by: James E.J. Bottomley <jejb@linux.ibm.com> -----BEGIN PGP SIGNATURE----- iJwEABMIAEQWIQTnYEDbdso9F2cI+arnQslM7pishQUCYN7I6iYcamFtZXMuYm90 dG9tbGV5QGhhbnNlbnBhcnRuZXJzaGlwLmNvbQAKCRDnQslM7pishXpRAQCkngYZ 35yQrqOxgOk2pfrysE95tHrV1MfJm2U49NFTwAEAuZutEvBUTfBF+sbcJ06r6q7i H0hkJN/Io7enFs5v3WA= =zwIa -----END PGP SIGNATURE----- Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi Pull SCSI updates from James Bottomley: "This series consists of the usual driver updates (ufs, ibmvfc, megaraid_sas, lpfc, elx, mpi3mr, qedi, iscsi, storvsc, mpt3sas) with elx and mpi3mr being new drivers. The major core change is a rework to drop the status byte handling macros and the old bit shifted definitions and the rest of the updates are minor fixes" * tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (287 commits) scsi: aha1740: Avoid over-read of sense buffer scsi: arcmsr: Avoid over-read of sense buffer scsi: ips: Avoid over-read of sense buffer scsi: ufs: ufs-mediatek: Add missing of_node_put() in ufs_mtk_probe() scsi: elx: libefc: Fix IRQ restore in efc_domain_dispatch_frame() scsi: elx: libefc: Fix less than zero comparison of a unsigned int scsi: elx: efct: Fix pointer error checking in debugfs init scsi: elx: efct: Fix is_originator return code type scsi: elx: efct: Fix link error for _bad_cmpxchg scsi: elx: efct: Eliminate unnecessary boolean check in efct_hw_command_cancel() scsi: elx: efct: Do not use id uninitialized in efct_lio_setup_session() scsi: elx: efct: Fix error handling in efct_hw_init() scsi: elx: efct: Remove redundant initialization of variable lun scsi: elx: efct: Fix spelling mistake "Unexected" -> "Unexpected" scsi: lpfc: Fix build error in lpfc_scsi.c scsi: target: iscsi: Remove redundant continue statement scsi: qla4xxx: Remove redundant continue statement scsi: ppa: Switch to use module_parport_driver() scsi: imm: Switch to use module_parport_driver() scsi: mpt3sas: Fix error return value in _scsih_expander_add() ...	2021-07-02 15:14:36 -07:00
Keith Busch	ae5e6886b4	nvme: use return value from blk_execute_rq() We don't have an nvme status to report if the driver's .queue_rq() returns an error without dispatching the requested nvme command. Check the return value from blk_execute_rq() for all passthrough commands so the caller may know their command was not successful. If the command is from the target passthrough interface and fails to dispatch, synthesize the response back to the host as a internal target error. Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Link: https://lore.kernel.org/r/20210610214437.641245-5-kbusch@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-30 15:35:45 -06:00
Keith Busch	be42a33b92	nvme: use blk_execute_rq() for passthrough commands The generic blk_execute_rq() knows how to handle polled completions. Use that instead of implementing an nvme specific handler. Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Link: https://lore.kernel.org/r/20210610214437.641245-3-kbusch@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-30 15:35:38 -06:00
Linus Torvalds	440462198d	for-5.14/drivers-2021-06-29 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmDbd5UQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpvsNEADCJKP81boFzRcdJo7EqaNDAzZyKOIg9Oq7 4GZE0Wm0SgA6+04bKrNVd9KLcKvQ+NC1pK7UJemSSH2y9ir+zHfyYgAV0/+wFmYm NgHlDjBvf80XSI5wezcb6MxZT+R7IaIpDsW1ZvV9hFtPSncn5o2OIWiSdJtHT/Rv enlgZPc7OwNWoVMX8eR58IoO0k3S6GLpctUZHt/AUukaKgoOks0X523qhEPf3Upr RkbIZuqLWVgpdT6457iSE/OijUczD4thTI8bdprxzhgimOm2vV52sO6F5HtHc7GX qW+PWYUaiUk7UpObuOuyv0yyUG45ii73iY1W0w66RiyCjVTgtpdwwMQ38VlBcoOg zcE1jneAEJt6TiS6zfRaER/10JoCIG4gp1+apPuaXud/o3BqWI0cagVHAgaLziBI F7bDJkbJZIR6GrWMgemBI+mc5/LACBePxzPGLScKFptejtQ/ysfZQ6aCLROJWB2U 4EnysAaUBf6tywj30JqfQvqFNGkHIgY95FKiXJW6GzqqwgBouNf48vS15BgkwI+2 EijcqUhlOVNfc3RIc0ZL5c9KcPIN9t5sqBrWZe3wgCErhxAx6w6Za9nDdP+US9bl /apCpvDFlu59g8n1wtkNE/uC+XqdKDwsplYhnfpX0FGni5wIknhQq3bSe4dPFgSn pG5VMrw3pA== =D6dS -----END PGP SIGNATURE----- Merge tag 'for-5.14/drivers-2021-06-29' of git://git.kernel.dk/linux-block Pull block driver updates from Jens Axboe: "Pretty calm round, mostly just NVMe and a bit of MD: - NVMe updates (via Christoph) - improve the APST configuration algorithm (Alexey Bogoslavsky) - look for StorageD3Enable on companion ACPI device (Mario Limonciello) - allow selecting the network interface for TCP connections (Martin Belanger) - misc cleanups (Amit Engel, Chaitanya Kulkarni, Colin Ian King, Christoph) - move the ACPI StorageD3 code to drivers/acpi/ and add quirks for certain AMD CPUs (Mario Limonciello) - zoned device support for nvmet (Chaitanya Kulkarni) - fix the rules for changing the serial number in nvmet (Noam Gottlieb) - various small fixes and cleanups (Dan Carpenter, JK Kim, Chaitanya Kulkarni, Hannes Reinecke, Wesley Sheng, Geert Uytterhoeven, Daniel Wagner) - MD updates (Via Song) - iostats rewrite (Guoqing Jiang) - raid5 lock contention optimization (Gal Ofri) - Fall through warning fix (Gustavo) - Misc fixes (Gustavo, Jiapeng)" * tag 'for-5.14/drivers-2021-06-29' of git://git.kernel.dk/linux-block: (78 commits) nvmet: use NVMET_MAX_NAMESPACES to set nn value loop: Fix missing discard support when using LOOP_CONFIGURE nvme.h: add missing nvme_lba_range_type endianness annotations nvme: remove zeroout memset call for struct nvme-pci: remove zeroout memset call for struct nvmet: remove zeroout memset call for struct nvmet: add ZBD over ZNS backend support nvmet: add Command Set Identifier support nvmet: add nvmet_req_bio put helper for backends nvmet: add req cns error complete helper block: export blk_next_bio() nvmet: remove local variable nvmet: use nvme status value directly nvmet: use u32 type for the local variable nsid nvmet: use u32 for nvmet_subsys max_nsid nvmet: use req->cmd directly in file-ns fast path nvmet: use req->cmd directly in bdev-ns fast path nvmet: make ver stable once connection established nvmet: allow mn change if subsys not discovered nvmet: make sn stable once connection was established ...	2021-06-30 12:21:16 -07:00
Linus Torvalds	df668a5fe4	for-5.14/block-2021-06-29 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmDbXAwQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpr0HEADDJaSgjpnWQwH1RVLNagJa9KnktxZYsEs+ as3QmDdpKRG3rEC9bdE7FLe/xq3WBaO5j1hTQ9P6IguqLyS1Df72DtTlKyaCrZoe zv9eIlY4lZUfksE2nzWmlN9uG0FBVXeEQpHCLSNbUZeK1zvV6+NNhQqw2kc0sEqu hReUFeMUbsMcu/w5T3XMVJNsTMCql9wta2H0q5hONQyJQSrIwa1D+sUdE5I8fO4j bnoYX9yxHX26EztX1UJiGRgoq5Trz7LY7hAfljKSkewpFwiHE2vBdq2L0C2RKsIV tTs2DjMCMQyPNeA7WAG8HlR4aPG+7+/fuBP1KJHkykjWXglWN7OqISuBv6rrBgQs gNRnZ4qmb1CzD6aLEBk59nHt6po6eMxXIW856YktKy8rKcrgK29qP44Z+oomkPKo ZjQ0wqN5CvpObM/dIKxl9bAJ4zQDHBt49d5nTTQLfWl/mgevu6ZNWD/hONyCQmFy zKKqQ/wkxWHutOsjC5/MKNb3ZRNH9tt9X+HfULO2DU6IqqifYw/ex4z4MVsBopJC 7pPfd81kgC73TgXe1AaCwHqNWsrqYCuTK0ew1CtGudlS3lucMwtap4GBiCgg5gbu M8pEgwO4OcCLHyRUc8zdfqI7HumbprbFmojPkwGSEe0ofVD74lMhzbUj5jvTYY2B t8D2XcgyOA== =lhon -----END PGP SIGNATURE----- Merge tag 'for-5.14/block-2021-06-29' of git://git.kernel.dk/linux-block Pull core block updates from Jens Axboe: - disk events cleanup (Christoph) - gendisk and request queue allocation simplifications (Christoph) - bdev_disk_changed cleanups (Christoph) - IO priority improvements (Bart) - Chained bio completion trace fix (Edward) - blk-wbt fixes (Jan) - blk-wbt enable/disable fix (Zhang) - Scheduler dispatch improvements (Jan, Ming) - Shared tagset scheduler improvements (John) - BFQ updates (Paolo, Luca, Pietro) - BFQ lock inversion fix (Jan) - Documentation improvements (Kir) - CLONE_IO block cgroup fix (Tejun) - Remove of ancient and deprecated block dump feature (zhangyi) - Discard merge fix (Ming) - Misc fixes or followup fixes (Colin, Damien, Dan, Long, Max, Thomas, Yang) * tag 'for-5.14/block-2021-06-29' of git://git.kernel.dk/linux-block: (129 commits) block: fix discard request merge block/mq-deadline: Remove a WARN_ON_ONCE() call blk-mq: update hctx->dispatch_busy in case of real scheduler blk: Fix lock inversion between ioc lock and bfqd lock bfq: Remove merged request already in bfq_requests_merged() block: pass a gendisk to bdev_disk_changed block: move bdev_disk_changed block: add the events* attributes to disk_attrs block: move the disk events code to a separate file block: fix trace completion for chained bio block/partitions/msdos: Fix typo inidicator -> indicator block, bfq: reset waker pointer with shared queues block, bfq: check waker only for queues with no in-flight I/O block, bfq: avoid delayed merge of async queues block, bfq: boost throughput by extending queue-merging times block, bfq: consider also creation time in delayed stable merge block, bfq: fix delayed stable merge check block, bfq: let also stably merged queues enjoy weight raising blk-wbt: make sure throttle is enabled properly blk-wbt: introduce a new disable state to prevent false positive by rwb_enabled() ...	2021-06-30 12:12:56 -07:00
Chaitanya Kulkarni	3c3ee16532	nvmet: use NVMET_MAX_NAMESPACES to set nn value For Spec regarding MNAN value:- If the controller supports Asymmetric Namespace Access Reporting, then this field shall be set to a non-zero value that is less than or equal to the NN value. Instead of using subsys->max_nsid that gets calculated dynamically, use NVMET_MAX_NAMESPACES value to report NN. This way we will maintain the MNAN value spec compliant with NN. Without this patch, code results in the following error :- [337976.409142] nvme nvme1: Invalid MNAN value 1024 Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-06-21 08:34:10 +02:00
Chaitanya Kulkarni	cc72c44267	nvme: remove zeroout memset call for struct Declare and initialize structure variables to zero values so that we can remove zeroout memset calls in the host/core.c. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-06-17 15:51:21 +02:00
Chaitanya Kulkarni	f66e2804d6	nvme-pci: remove zeroout memset call for struct Declare and initialize structure variables to zero values so that we can remove zeroout memset calls in the host/pci.c. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-06-17 15:51:21 +02:00
Chaitanya Kulkarni	8abd7e2a75	nvmet: remove zeroout memset call for struct Declare and initialize structure variables to zero values so that we can remove zeroout memset calls in the target/rdma.c. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-06-17 15:51:21 +02:00
Chaitanya Kulkarni	aaf2e048af	nvmet: add ZBD over ZNS backend support NVMe TP 4053 – Zoned Namespaces (ZNS) allows host software to communicate with a non-volatile memory subsystem using zones for NVMe protocol-based controllers. NVMeOF already support the ZNS NVMe Protocol compliant devices on the target in the passthru mode. There are generic zoned block devices like Shingled Magnetic Recording (SMR) HDDs that are not based on the NVMe protocol. This patch adds ZNS backend support for non-ZNS zoned block devices as NVMeOF targets. This support includes implementing the new command set NVME_CSI_ZNS, adding different command handlers for ZNS command set such as NVMe Identify Controller, NVMe Identify Namespace, NVMe Zone Append, NVMe Zone Management Send and NVMe Zone Management Receive. With the new command set identifier, we also update the target command effects logs to reflect the ZNS compliant commands. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-06-17 15:51:21 +02:00
Chaitanya Kulkarni	ab5d0b38c0	nvmet: add Command Set Identifier support NVMe TP 4056 allows controllers to support different command sets. NVMeoF target currently only supports namespaces that contain traditional logical blocks that may be randomly read and written. In some applications there is a value in exposing namespaces that contain logical blocks that have special access rules (e.g. sequentially write required namespace such as Zoned Namespace (ZNS)). In order to support the Zoned Block Devices (ZBD) backend, controllers need to have support for ZNS Command Set Identifier (CSI). In this preparation patch, we adjust the code such that it can now support the default command set identifier. We update the namespace data structure to store the CSI value which defaults to NVME_CSI_NVM that represents traditional logical blocks namespace type. The CSI support is required to implement the ZBD backend for NVMeOF with host side NVMe ZNS interface, since ZNS commands belong to the different command set than the default one. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-06-17 15:51:20 +02:00
Chaitanya Kulkarni	9a01b58c22	nvmet: add nvmet_req_bio put helper for backends In current code there exists two backends which are using inline bio optimization, that adds a duplicate code for freeing the bio. For Zoned Block Device backend we also use the same optimzation and it will lead to having duplicate code in the three backends: generic bdev, passsthru, and generic zns. Add a helper function to avoid duplicate code and update the respective backends. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-06-17 15:51:20 +02:00
Chaitanya Kulkarni	6e597263f9	nvmet: add req cns error complete helper We report error and complete the request when identify cns value is not handled in nvmet_execute_identify(). This error reporting is also needed for Zone Block Device backend for NVMeOF target. Add a helper nvmet_req_cns_error_compplete() to report an error and complete the request when idenitfy command cns not handled value. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-06-17 15:51:20 +02:00
Chaitanya Kulkarni	7860569ad4	nvmet: remove local variable In function errno_to_nvme_status() we store the value of the NVMe status into the local variable and don't do anything useful with that but just return. Remove the local variable and return the value directly from switch. This also removed extra break statements. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-06-17 15:51:20 +02:00
Chaitanya Kulkarni	8bb6cb9b97	nvmet: use nvme status value directly There is no point in keeping the status variable that is used only once in the function nvmet_async_events_failall(). Remove the variable and use the value directly. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-06-17 15:51:20 +02:00
Chaitanya Kulkarni	245067e37d	nvmet: use u32 type for the local variable nsid In function nvmet_max_nsid() we calculate the max nsid by iterating over the XArray and store it in the variable nsid that has type of unsigned long. Since the value of this function is stored into the subsys->max_nsid which is of type u32, change the local variable nsid type and the return type of the same function to u32. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-06-17 15:51:20 +02:00
Chaitanya Kulkarni	86693c43bb	nvmet: use u32 for nvmet_subsys max_nsid Use u32 type for the nsid_max member of the nvmet_subsys structure. This avoids the type confusion when updating the subsys->nax_nsid from ns->nsid. This also matches the nvmet_ns->nsid member. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-06-17 15:51:20 +02:00
Chaitanya Kulkarni	f3dce2add3	nvmet: use req->cmd directly in file-ns fast path The function nvmet_file_parse_io_cmd() is called from the fast path. The local variable to that function cmd is only used once. Remove the local variable and use req->cmd directly. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-06-17 15:51:20 +02:00
Chaitanya Kulkarni	46eca4702d	nvmet: use req->cmd directly in bdev-ns fast path The function nvmet_bdev_parse_io_cmd() is called from the fast path. The local variable to that function cmd is only used once. Remove the local variable and use req->cmd directly. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-06-17 15:51:20 +02:00
Noam Gottlieb	87fd4cc1c0	nvmet: make ver stable once connection established Once some host has connected to the nvmf target, make sure that the version number is stable and cannot be changed. Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com> Signed-off-by: Noam Gottlieb <ngottlieb@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-06-17 15:51:19 +02:00
Noam Gottlieb	0d148efdf0	nvmet: allow mn change if subsys not discovered Currently, once the subsystem's model_number is set for the first time there is no way to change it. However, as long as no connection was established to nvmf target, there is no reason for such restriction and we should allow to change the subsystem's model_number as many times as needed. In addition, in order to simplfy the changes and make the model number flow more similar to the rest of the attributes in the Identify Controller data structure, we set a default value for the model number at the initiation of the subsystem. Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com> Signed-off-by: Noam Gottlieb <ngottlieb@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-06-17 15:51:19 +02:00
Noam Gottlieb	7ae023c5aa	nvmet: make sn stable once connection was established Once some host has connected to the target, make sure that the serial number is stable and cannot be changed. Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com> Signed-off-by: Noam Gottlieb <ngottlieb@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-06-17 15:51:19 +02:00
Noam Gottlieb	e13b061589	nvmet: change sn size and check validity According to the NVM specification, the serial_number should be 20 bytes (bytes 23:04 of the Identify Controller data structure), and should contain only ASCII characters. In accordance, the serial_number size is changed to 20 bytes and before any attempt to store a new value in serial_number we check that the input is valid - i.e. contains only ASCII characters, is not empty and does not exceed 20 bytes. Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com> Signed-off-by: Noam Gottlieb <ngottlieb@nvidia.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-06-17 15:51:19 +02:00
Hannes Reinecke	2a4a910aa4	nvmet-fc: do not check for invalid target port in nvmet_fc_handle_fcp_rqst() When parsing a request in nvmet_fc_handle_fcp_rqst() we should not check for invalid target ports; if we do the command is aborted from the fcp layer, causing the host to assume a transport error. Rather we should still forward this request to the nvmet layer, which will then correctly fail the command with an appropriate error status. Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-06-17 15:51:19 +02:00
Chaitanya Kulkarni	eff4423ec0	nvme-fabrics: remove memset in connect io q Declare and initialize structure variable to the zero values so that we can get rid of the zeroout memset call. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-06-17 15:51:19 +02:00
Chaitanya Kulkarni	bfa9d1222d	nvme-fabrics: remove memset in connect admin q Declare and initialize structure variable to the zero values so that we can get rid of the zeroout memset call. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-06-17 15:51:19 +02:00
Chaitanya Kulkarni	c22c272013	nvme-fabrics: remove memset in nvmf_reg_write32() Declare and initialize structure variable to the zero values so that we can get rid of the zeroout memset call. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-06-17 15:51:19 +02:00
Chaitanya Kulkarni	2796a8e409	nvme-fabrics: remove memset in nvmf_reg_read64() Declare and initialize structure variable to the zero values so that we can get rid of the zeroout memset call. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-06-17 15:51:19 +02:00
Chaitanya Kulkarni	3b54064fbc	nvme-tcp: use ctrl sgl check helper Use the helper to check NVMe controller's SGL support. Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-06-17 15:51:19 +02:00
Chaitanya Kulkarni	253a0b76a1	nvme-pci: use ctrl sgl check helper Use the helper to check NVMe controller's SGL support. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-06-17 15:51:19 +02:00
Chaitanya Kulkarni	b61678bcd4	nvme-fc: use ctrl sgl check helper Use the helper to check NVMe controller's SGL support. Reviewed-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-06-17 15:51:18 +02:00
Chaitanya Kulkarni	73eefc270a	nvme: add a helper to check ctrl sgl support For various transports such as fc/tcp/pci it is common to check if NVMe SGLs are supported or not by the controller. In this preparation patch we add a helper to avoid the open coding of such checks in the various transport. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-06-17 15:51:18 +02:00
Chaitanya Kulkarni	cb1b10e7ac	nvme-pci: remove trailing lines for helpers Remove the extra white line at the end of the functions. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-06-17 15:51:18 +02:00
JK Kim	a0aac973a2	nvme-pci: fix var. type for increasing cq_head nvmeq->cq_head is compared with nvmeq->q_depth and changed the value and cq_phase for handling the next cq db. but, nvmeq->q_depth's type is u32 and max. value is 0x10000 when CQP.MSQE is 0xffff and io_queue_depth is 0x10000. current temp. variable for comparing with nvmeq->q_depth is overflowed when previous nvmeq->cq_head is 0xffff. in this case, nvmeq->cq_phase is not updated. so, fix data type for temp. variable to u32. Signed-off-by: JK Kim <jongkang.kim2@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-06-17 15:51:18 +02:00
Dan Carpenter	522af60cb2	nvme-tcp: fix error codes in nvme_tcp_setup_ctrl() These error paths currently return success but they should return -EOPNOTSUPP. Fixes: `73ffcefcfc` ("nvme-tcp: check sgl supported by target") Fixes: `3f2304f8c6` ("nvme-tcp: add NVMe over TCP host driver") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-06-16 05:36:16 +02:00
Chaitanya Kulkarni	e7d4b5493a	nvme: factor out a nvme_validate_passthru_nsid helper Add a helper nvme_validate_passthru_nsid() to validate the nsid that removes the nsid validation and error message print code from nvme_user_cmd() and nvme_user_cmd64(). Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-06-16 05:36:16 +02:00
Geert Uytterhoeven	d399742cd0	nvme: fix grammar in the CONFIG_NVME_MULTIPATH kconfig help text Fix a singular/plural mismatch in the CONFIG_NVME_MULTIPATH help text. Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-06-16 05:36:15 +02:00
Daniel Wagner	2411424143	nvme: remove superfluous bio_set_dev in nvme_requeue_work Commit `ce86dad222` ("nvme-multipath: reset bdev to ns head when failover") moved the reset code where the bio is added to the requeue_list for the failover path. But it left the original bio_set_dev in nvme_requeue_work. There is a second path to nvme_requee_work. It is via nvme_ns_head_submit_bio. Though we don't have to set bio->bi_bdev for this path either, as it points to the correct bdev already. Let's remove the bio_set_dev. It's updating the bio->bi_bdev with the same pointer and thus it's unnecessary. Signed-off-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-06-16 05:36:15 +02:00
Daniel Wagner	120bb3624d	nvme: verify MNAN value if ANA is enabled The controller is required to have a non-zero MNAN value if it supports ANA: If the controller supports Asymmetric Namespace Access Reporting, then this field shall be set to a non-zero value that is less than or equal to the NN value. Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-06-16 05:36:15 +02:00
Mario Limonciello	2744d7a073	ACPI: Check StorageD3Enable _DSD property in ACPI code Although first implemented for NVME, this check may be usable by other drivers as well. Microsoft's specification explicitly mentions that is may be usable by SATA and AHCI devices. Google also indicates that they have used this with SDHCI in a downstream kernel tree that a user can plug a storage device into. Link: https://docs.microsoft.com/en-us/windows-hardware/design/component-guidelines/power-management-for-storage-hardware-devices-intro Suggested-by: Keith Busch <kbusch@kernel.org> CC: Shyam-sundar S-k <Shyam-sundar.S-k@amd.com> CC: Alexander Deucher <Alexander.Deucher@amd.com> CC: Rafael J. Wysocki <rjw@rjwysocki.net> CC: Prike Liang <prike.liang@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-06-16 05:14:59 +02:00
Muneendra Kumar	3dbbca75ed	scsi: nvme: Added a new sysfs attribute appid_store Add a new sysfs attribute, appid_store, which can be used to set the application identifier in the blkcg associated with a cgroup id. Below is the interface provided to set the app_id: echo "<cgroupid>:<appid>" >> /sys/class/fc/fc_udev_device/appid_store echo "457E:100000109b521d27" >> /sys/class/fc/fc_udev_device/appid_store Link: https://lore.kernel.org/r/20210608043556.274139-4-muneendra.kumar@broadcom.com Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Muneendra Kumar <muneendra.kumar@broadcom.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>	2021-06-10 10:01:32 -04:00
Chaitanya Kulkarni	346ac785ba	nvmet: remove a superfluous variable Remove the superfluous variable "bdev" that is only used once in the nvmet_bdev_alloc_bip() and use req->ns->bdev that is used everywhere in the code to access the nvmet request's bdev. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-06-03 10:29:26 +03:00
Amit Engel	f6e8bd59c4	nvmet: move ka_work initialization to nvmet_alloc_ctrl Initialize keep-alive work only once, as part of alloc_ctrl and not each time that nvmet_start_keep_alive_timer is being called Signed-off-by: Amit Engel <amit.engel@dell.com> Reviewed-by: Hou Pu <houpu.main@gmail.com>	2021-06-03 10:29:26 +03:00
Christoph Hellwig	f1cf35e17e	nvme: remove nvme_{get,put}_ns_from_disk Now that only one caller is left remove the helpers by restructuring nvme_pr_command so that it has two helpers for sending a command of to a given nsid using either the ns_head for multipath, or the namespace stored in the gendisk. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me>	2021-06-03 10:29:26 +03:00
Christoph Hellwig	8b4fb0f968	nvme: split nvme_report_zones Split multipath support out of nvme_report_zones into a separate helper and simplify the non-multipath version as a result. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me>	2021-06-03 10:29:26 +03:00
Christoph Hellwig	d8ca66e821	nvme: move the CSI sanity check into nvme_ns_report_zones Move the CSI check into nvme_ns_report_zones to clean up the code a little bit and prepare for further refactoring. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me>	2021-06-03 10:29:25 +03:00
Christoph Hellwig	85b790a7ae	nvme: add a sparse annotation to nvme_ns_head_ctrl_ioctl Add the __releases annotation to tell sparse that nvme_ns_head_ctrl_ioctl is expected to unlock head->srcu. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me>	2021-06-03 10:29:25 +03:00
Christoph Hellwig	3e7d1a5516	nvme: open code nvme_put_ns_from_disk in nvme_ns_head_ctrl_ioctl nvme_ns_head_ctrl_ioctl is always used on multipath nodes, so just call srcu_read_unlock directly. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me>	2021-06-03 10:29:25 +03:00
Christoph Hellwig	86b4284d98	nvme: open code nvme_{get,put}_ns_from_disk in nvme_ns_head_ioctl nvme_ns_head_ioctl is always used on multipath nodes, no need to deal with the de-multiplexers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me>	2021-06-03 10:29:25 +03:00
Christoph Hellwig	f423c85cd3	nvme: open code nvme_put_ns_from_disk in nvme_ns_head_chr_ioctl nvme_ns_head_chr_ioctl is always used on multipath nodes, so just call srcu_read_unlock and consolidate the two unlock paths. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me>	2021-06-03 10:29:25 +03:00
Chaitanya Kulkarni	97ba6931ba	nvme-fabrics: remove extra braces No need to use the braces around ~ operator. No functionality change in this patch. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-06-03 10:29:25 +03:00
Chaitanya Kulkarni	6f860c9225	nvme-fabrics: remove an extra comment Remove the comment at the end of the switch that is not needed as function is small enough. No functionality change in this patch. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-06-03 10:29:25 +03:00
Chaitanya Kulkarni	63d20f54a3	nvme-fabrics: remove extra new lines in the switch Remove the extra lines in the switch block that is not common practice in the kernel code. No functionality change in this patch. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-06-03 10:29:25 +03:00
Chaitanya Kulkarni	25e1de8c40	nvme-fabrics: fix the kerneldco comment for nvmf_log_connect_error() Fix the comment style that matches existing code. No functionality change in this patch. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-06-03 10:29:24 +03:00
Martin Belanger	3ede8f72a9	nvme-tcp: allow selecting the network interface for connections In our application, we need a way to force TCP connections to go out a specific IP interface instead of letting Linux select the interface based on the routing tables. Add the 'host-iface' option to allow specifying the interface to use. When the option host-iface is specified, the driver uses the specified interface to set the option SO_BINDTODEVICE on the TCP socket before connecting. This new option is needed in addtion to the existing host-traddr for the following reasons: Specifying an IP interface by its associated IP address is less intuitive than specifying the actual interface name and, in some cases, simply doesn't work. That's because the association between interfaces and IP addresses is not predictable. IP addresses can be changed or can change by themselves over time (e.g. DHCP). Interface names are predictable [1] and will persist over time. Consider the following configuration. 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state ... link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 100.0.0.100/24 scope global lo valid_lft forever preferred_lft forever 2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc ... link/ether 08:00:27:21:65:ec brd ff:ff:ff:ff:ff:ff inet 100.0.0.100/24 scope global enp0s3 valid_lft forever preferred_lft forever 3: enp0s8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc ... link/ether 08:00:27:4f:95:5c brd ff:ff:ff:ff:ff:ff inet 100.0.0.100/24 scope global enp0s8 valid_lft forever preferred_lft forever The above is a VM that I configured with the same IP address (100.0.0.100) on all interfaces. Doing a reverse lookup to identify the unique interface associated with 100.0.0.100 does not work here. And this is why the option host_iface is required. I understand that the above config does not represent a standard host system, but I'm using this to prove a point: "We can never know how users will configure their systems". By te way, The above configuration is perfectly fine by Linux. The current TCP implementation for host_traddr performs a bind()-before-connect(). This is a common construct to set the source IP address on a TCP socket before connecting. This has no effect on how Linux selects the interface for the connection. That's because Linux uses the Weak End System model as described in RFC1122 [2]. On the other hand, setting the Source IP Address has benefits and should be supported by linux-nvme. In fact, setting the Source IP Address is a mandatory FedGov requirement (e.g. connection to a RADIUS/TACACS+ server). Consider the following configuration. $ ip addr list dev enp0s8 3: enp0s8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc ... link/ether 08:00:27:4f:95:5c brd ff:ff:ff:ff:ff:ff inet 192.168.56.101/24 brd 192.168.56.255 scope global enp0s8 valid_lft 426sec preferred_lft 426sec inet 192.168.56.102/24 scope global secondary enp0s8 valid_lft forever preferred_lft forever inet 192.168.56.103/24 scope global secondary enp0s8 valid_lft forever preferred_lft forever inet 192.168.56.104/24 scope global secondary enp0s8 valid_lft forever preferred_lft forever Here we can see that several addresses are associated with interface enp0s8. By default, Linux always selects the default IP address, 192.168.56.101, as the source address when connecting over interface enp0s8. Some users, however, want the ability to specify a different source address (e.g., 192.168.56.102, 192.168.56.103, ...). The option host_traddr can be used as-is to perform this function. In conclusion, I believe that we need 2 options for TCP connections. One that can be used to specify an interface (host-iface). And one that can be used to set the source address (host-traddr). Users should be allowed to use one or the other, or both, or none. Of course, the documentation for host_traddr will need some clarification. It should state that when used for TCP connection, this option only sets the source address. And the documentation for host_iface should say that this option is only available for TCP connections. References: [1] https://www.freedesktop.org/wiki/Software/systemd/PredictableNetworkInterfaceNames/ [2] https://tools.ietf.org/html/rfc1122 Tested both IPv4 and IPv6 connections. Signed-off-by: Martin Belanger <martin.belanger@dell.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-06-03 10:29:24 +03:00
Mario Limonciello	e21e0243e7	nvme-pci: look for StorageD3Enable on companion ACPI device instead The documentation around the StorageD3Enable property hints that it should be made on the PCI device. This is where newer AMD systems set the property and it's required for S0i3 support. So rather than look for nodes of the root port only present on Intel systems, switch to the companion ACPI device for all systems. David Box from Intel indicated this should work on Intel as well. Link: https://lore.kernel.org/linux-nvme/YK6gmAWqaRmvpJXb@google.com/T/#m900552229fa455867ee29c33b854845fce80ba70 Link: https://docs.microsoft.com/en-us/windows-hardware/design/component-guidelines/power-management-for-storage-hardware-devices-intro Fixes: `df4f9bc4fb` ("nvme-pci: add support for ACPI StorageD3Enable property") Suggested-by: Liang Prike <Prike.Liang@amd.com> Acked-by: Raul E Rangel <rrangel@chromium.org> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Reviewed-by: David E. Box <david.e.box@linux.intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-06-03 10:29:24 +03:00
Alexey Bogoslavsky	ebd8a93aa4	nvme: extend and modify the APST configuration algorithm The algorithm that was used until now for building the APST configuration table has been found to produce entries with excessively long ITPT (idle time prior to transition) for devices declaring relatively long entry and exit latencies for non-operational power states. This leads to unnecessary waste of power and, as a result, failure to pass mandatory power consumption tests on Chromebook platforms. The new algorithm is based on two predefined ITPT values and two predefined latency tolerances. Based on these values, as well as on exit and entry latencies reported by the device, the algorithm looks for up to 2 suitable non-operational power states to use as primary and secondary APST transition targets. The predefined values are supplied to the nvme driver as module parameters: - apst_primary_timeout_ms (default: 100) - apst_secondary_timeout_ms (default: 2000) - apst_primary_latency_tol_us (default: 15000) - apst_secondary_latency_tol_us (default: 100000) The algorithm echoes the approach used by Intel's and Microsoft's drivers on Windows. The specific default parameter values are also based on those drivers. Yet, this patch doesn't introduce the ability to dynamically regenerate the APST table in the event of switching the power source from AC to battery and back. Adding this functionality may be considered in the future. In the meantime, the timeouts and tolerances reflect a compromise between values used by Microsoft for AC and battery scenarios. In most NVMe devices the new algorithm causes them to implement a more aggressive power saving policy. While beneficial in most cases, this sometimes comes at the price of a higher IO processing latency in certain scenarios as well as at the price of a potential impact on the drive's endurance (due to more frequent context saving when entering deep non- operational states). So in order to provide a fallback for systems where these regressions cannot be tolerated, the patch allows to revert to the legacy behavior by setting either apst_primary_timeout_ms or apst_primary_latency_tol_us parameter to 0. Eventually (and possibly after fine tuning the default values of the module parameters) the legacy behavior can be removed. TESTING. The new algorithm has been extensively tested. Initially, simulations were used to compare APST tables generated by old and new algorithms for a wide range of devices. After that, power consumption, performance and latencies were measured under different workloads on devices from multiple vendors (WD, Intel, Samsung, Hynix, Kioxia). Below is the description of the tests and the findings. General observations. The effect the patch has on the APST table varies depending on the entry and exit latencies advertised by the devices. For some devices, the effect is negligible (e.g. Kioxia KBG40ZNS), for some significant, making the transitions to PS3 and PS4 much quicker (e.g. WD SN530, Intel 760P), or making the sleep deeper, PS4 rather than PS3 after a similar amount of time (e.g. SK Hynix BC511). For some devices (e.g. Samsung PM991) the effect is mixed: the initial transition happens after a longer idle time, but takes the device to a lower power state. Workflows. In order to evaluate the patch's effect on the power consumption and latency, 7 workflows were used for each device. The workflows were designed to test the scenarios where significant differences between the old and new behaviors are most likely. Each workflow was tested twice: with the new and with the old APST table generation implementation. Power consumption, performance and latency were measured in the process. The following workflows were used: 1) Consecutive write at the maximum rate with IO depth of 2, with no pauses 2) Repeated pattern of 1000 consecutive writes of 4K packets followed by 50ms idle time 3) Repeated pattern of 1000 consecutive writes of 4K packets followed by 150ms idle time 4) Repeated pattern of 1000 consecutive writes of 4K packets followed by 500ms idle time 5) Repeated pattern of 1000 consecutive writes of 4K packets followed by 1.5s idle time 6) Repeated pattern of 1000 consecutive writes of 4K packets followed by 5s idle time 7) Repeated pattern of a single random read of a 4K packet followed by 150ms idle time Power consumption Actual power consumption measurements produced predictable results in accordance with the APST mechanism's theory of operation. Devices with long entry and exit latencies such as WD SN530 showed huge improvement on scenarios 4,5 and 6 of up to 62%. Devices such as Kioxia KBG40ZNS where the resulting APST table looks virtually identical with both legacy and new algorithms, showed little or no change in the average power consumption on all workflows. Devices with extra short latencies such as Samsung PM991 showed moderate increase in power consumption of up to 18% in worst case scenarios. In addition, on Intel and Samsung devices a more complex impact was observed on scenarios 3, 4 and 7. Our understanding is that due to longer stay in deep non-operational states between the writes the devices start performing background operations leading to an increase of power consumption. With the old APST tables part of these operations are delayed until the scenario is over and a longer idle period begins, but eventually this extra power is consumed anyway. Performance. In terms of performance measured on sustained write or read scenarios, the effect of the patch is minimal as in this case the device doesn't enter low power states. Latency As expected, in devices where the patch causes a more aggressive power saving policy (e.g. WD SN530, Intel 760P), an increase in latency was observed in certain scenarios. Workflow number 7, specifically designed to simulate the worst case scenario as far as latency is concerned, indeed shows a sharp increase in average latency (~2ms -> ~53ms on Intel 760P and 0.6 -> 10ms on WD SN530). The latency increase on other workloads and other devices is much milder or non-existent. Signed-off-by: Alexey Bogoslavsky <alexey.bogoslavsky@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-06-03 10:29:24 +03:00
Colin Ian King	13ce7e625a	nvme: remove redundant initialization of variable ret The variable ret is being initialized with a value that is never read, it is being updated later on. The assignment is redundant and can be removed. Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-06-03 10:29:24 +03:00
Max Gurtovoy	bcd9a0797d	nvmet: fix freeing unallocated p2pmem In case p2p device was found but the p2p pool is empty, the nvme target is still trying to free the sgl from the p2p pool instead of the regular sgl pool and causing a crash (BUG() is called). Instead, assign the p2p_dev for the request only if it was allocated from p2p pool. This is the crash that was caused: [Sun May 30 19:13:53 2021] ------------[ cut here ]------------ [Sun May 30 19:13:53 2021] kernel BUG at lib/genalloc.c:518! [Sun May 30 19:13:53 2021] invalid opcode: 0000 [#1] SMP PTI ... [Sun May 30 19:13:53 2021] kernel BUG at lib/genalloc.c:518! ... [Sun May 30 19:13:53 2021] RIP: 0010:gen_pool_free_owner+0xa8/0xb0 ... [Sun May 30 19:13:53 2021] Call Trace: [Sun May 30 19:13:53 2021] ------------[ cut here ]------------ [Sun May 30 19:13:53 2021] pci_free_p2pmem+0x2b/0x70 [Sun May 30 19:13:53 2021] pci_p2pmem_free_sgl+0x4f/0x80 [Sun May 30 19:13:53 2021] nvmet_req_free_sgls+0x1e/0x80 [nvmet] [Sun May 30 19:13:53 2021] kernel BUG at lib/genalloc.c:518! [Sun May 30 19:13:53 2021] nvmet_rdma_release_rsp+0x4e/0x1f0 [nvmet_rdma] [Sun May 30 19:13:53 2021] nvmet_rdma_send_done+0x1c/0x60 [nvmet_rdma] Fixes: `c6e3f13398` ("nvmet: add metadata support for block devices") Reviewed-by: Israel Rukshin <israelr@nvidia.com> Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-06-02 10:10:38 +03:00
Hannes Reinecke	6622f9acd2	nvme-loop: do not warn for deleted controllers during reset During concurrent reset and delete calls the reset workqueue is flushed, causing nvme_loop_reset_ctrl_work() to be executed when the controller is in state DELETING or DELETING_NOIO. But this is expected, so we shouldn't issue a WARN_ON here. Signed-off-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-06-02 10:06:53 +03:00
Hannes Reinecke	4237de2f73	nvme-loop: check for NVME_LOOP_Q_LIVE in nvme_loop_destroy_admin_queue() We need to check the NVME_LOOP_Q_LIVE flag in nvme_loop_destroy_admin_queue() to protect against duplicate invocations eg during concurrent reset and remove calls. Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-06-02 10:06:53 +03:00
Hannes Reinecke	1c5f8e882a	nvme-loop: clear NVME_LOOP_Q_LIVE when nvme_loop_configure_admin_queue() fails When the call to nvme_enable_ctrl() in nvme_loop_configure_admin_queue() fails the NVME_LOOP_Q_LIVE flag is not cleared. Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-06-02 10:06:53 +03:00
Hannes Reinecke	a6c144f3d2	nvme-loop: reset queue count to 1 in nvme_loop_destroy_io_queues() The queue count is increased in nvme_loop_init_io_queues(), so we need to reset it to 1 at the end of nvme_loop_destroy_io_queues(). Otherwise the function is not re-entrant safe, and crash will happen during concurrent reset and remove calls. Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-06-02 10:06:53 +03:00
Christoph Hellwig	f165fb89b7	nvme-multipath: convert to blk_alloc_disk/blk_cleanup_disk Convert the nvme-multipath driver to use the blk_alloc_disk and blk_cleanup_disk helpers to simplify gendisk and request_queue allocation. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org> Link: https://lore.kernel.org/r/20210521055116.1053587-19-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-01 07:42:23 -06:00
Christoph Hellwig	0d1feb72ff	block: automatically enable GENHD_FL_EXT_DEVT Automatically set the GENHD_FL_EXT_DEVT flag for all disks allocated without an explicit number of minors. This is what all new block drivers should do, so make sure it is the default without boilerplate code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org> Link: https://lore.kernel.org/r/20210521055116.1053587-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-01 07:42:23 -06:00
Sagi Grimberg	12b2aaadb6	nvme-rdma: fix in-casule data send for chained sgls We have only 2 inline sg entries and we allow 4 sg entries for the send wr sge. Larger sgls entries will be chained. However when we build in-capsule send wr sge, we iterate without taking into account that the sgl may be chained and still fit in-capsule (which can happen if the sgl is bigger than 2, but lower-equal to 4). Fix in-capsule data mapping to correctly iterate chained sgls. Fixes: `38e1800275` ("nvme-rdma: Avoid preallocating big SGL for data") Reported-by: Walker, Benjamin <benjamin.walker@intel.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-05-31 09:06:11 +03:00
Sagi Grimberg	aaeadd7075	nvmet: fix false keep-alive timeout when a controller is torn down Controller teardown flow may take some time in case it has many I/O queues, and the host may not send us keep-alive during this period. Hence reset the traffic based keep-alive timer so we don't trigger a controller teardown as a result of a keep-alive expiration. Reported-by: Yi Zhang <yi.zhang@redhat.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Tested-by: Yi Zhang <yi.zhang@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-05-26 16:18:22 +02:00
Hou Pu	25df1acd2d	nvmet-tcp: fix inline data size comparison in nvmet_tcp_queue_response Using "<=" instead "<" to compare inline data size. Fixes: `bdaf132791` ("nvmet-tcp: fix a segmentation fault during io parsing error") Signed-off-by: Hou Pu <houpu.main@gmail.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-05-26 16:18:22 +02:00
Sagi Grimberg	042a3eaad6	nvme-tcp: remove incorrect Kconfig dep in BLK_DEV_NVME We need to select NVME_CORE. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-05-26 16:18:22 +02:00
Hannes Reinecke	4d9442bf26	nvme-fabrics: decode host pathing error for connect Add an additional decoding for 'host pathing error' during connect. Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-05-25 09:21:16 +02:00
Hannes Reinecke	f25f8ef70c	nvme-fc: short-circuit reconnect retries Returning an nvme status from nvme_fc_create_association() indicates that the association is established, and we should honour the DNR bit. If it's set a reconnect attempt will just return the same error, so we can short-circuit the reconnect attempts and fail the connection directly. Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Reviewed-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-05-25 09:21:15 +02:00
Guoqing Jiang	3596a06583	nvme: fix potential memory leaks in nvme_cdev_add We need to call put_device if cdev_device_add failed, otherwise kmemleak has below report. [<0000000024c71758>] kmem_cache_alloc_trace+0x233/0x480 [<00000000ad2813ed>] device_add+0x7ff/0xe10 [<0000000035bc54c4>] cdev_device_add+0x72/0xa0 [<000000006c9aa1e8>] nvme_cdev_add+0xa9/0xf0 [nvme_core] [<000000003c4d492d>] nvme_mpath_set_live+0x251/0x290 [nvme_core] [<00000000889a58da>] nvme_mpath_add_disk+0x268/0x320 [nvme_core] [<00000000192e7161>] nvme_alloc_ns+0x669/0xac0 [nvme_core] [<000000007a1a6041>] nvme_validate_or_alloc_ns+0x156/0x280 [nvme_core] [<000000003a763c35>] nvme_scan_work+0x221/0x3c0 [nvme_core] [<000000009ff10706>] process_one_work+0x5cf/0xb10 [<000000000644ee25>] worker_thread+0x7a/0x680 [<00000000285ebd2f>] kthread+0x1c6/0x210 [<00000000e297c6ea>] ret_from_fork+0x22/0x30 Fixes: `2637baed78` ("nvme: introduce generic per-namespace chardev") Signed-off-by: Guoqing Jiang <jiangguoqing@kylinos.cn> Reviewed-by: Javier González <javier.gonz@samsung.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-05-25 09:21:15 +02:00
James Smart	a7d139145a	nvme-fc: clear q_live at beginning of association teardown The __nvmf_check_ready() routine used to bounce all filesystem io if the controller state isn't LIVE. However, a later patch changed the logic so that it rejection ends up being based on the Q live check. The FC transport has a slightly different sequence from rdma and tcp for shutting down queues/marking them non-live. FC marks its queue non-live after aborting all ios and waiting for their termination, leaving a rather large window for filesystem io to continue to hit the transport. Unfortunately this resulted in filesystem I/O or applications seeing I/O errors. Change the FC transport to mark the queues non-live at the first sign of teardown for the association (when I/O is initially terminated). Fixes: `73a5379937` ("nvme-fabrics: allow to queue requests for live queues") Signed-off-by: James Smart <jsmart2021@gmail.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-05-19 08:40:24 +02:00
Keith Busch	a0fdd14180	nvme-tcp: rerun io_work if req_list is not empty A possible race condition exists where the request to send data is enqueued from nvme_tcp_handle_r2t()'s will not be observed by nvme_tcp_send_all() if it happens to be running. The driver relies on io_work to send the enqueued request when it is runs again, but the concurrently running nvme_tcp_send_all() may not have released the send_mutex at that time. If no future commands are enqueued to re-kick the io_work, the request will timeout in the SEND_H2C state, resulting in a timeout error like: nvme nvme0: queue 1: timeout request 0x3 type 6 Ensure the io_work continues to run as long as the req_list is not empty. Fixes: `db5ad6b7f8` ("nvme-tcp: try to send request in queue_rq context") Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-05-19 08:33:42 +02:00
Sagi Grimberg	825619b09a	nvme-tcp: fix possible use-after-completion Commit `db5ad6b7f8` ("nvme-tcp: try to send request in queue_rq context") added a second context that may perform a network send. This means that now RX and TX are not serialized in nvme_tcp_io_work and can run concurrently. While there is correct mutual exclusion in the TX path (where the send_mutex protect the queue socket send activity) RX activity, and more specifically request completion may run concurrently. This means we must guarantee that any mutation of the request state related to its lifetime, bytes sent must not be accessed when a completion may have possibly arrived back (and processed). The race may trigger when a request completion arrives, processed _and_ reused as a fresh new request, exactly in the (relatively short) window between the last data payload sent and before the request iov_iter is advanced. Consider the following race: 1. 16K write request is queued 2. The nvme command and the data is sent to the controller (in-capsule or solicited by r2t) 3. After the last payload is sent but before the req.iter is advanced, the controller sends back a completion. 4. The completion is processed, the request is completed, and reused to transfer a new request (write or read) 5. The new request is queued, and the driver reset the request parameters (nvme_tcp_setup_cmd_pdu). 6. Now context in (2) resumes execution and advances the req.iter ==> use-after-completion as this is already a new request. Fix this by making sure the request is not advanced after the last data payload send, knowing that a completion may have arrived already. An alternative solution would have been to delay the request completion or state change waiting for reference counting on the TX path, but besides adding atomic operations to the hot-path, it may present challenges in multi-stage R2T scenarios where a r2t handler needs to be deferred to an async execution. Reported-by: Narayan Ayalasomayajula <narayan.ayalasomayajula@wdc.com> Tested-by: Anil Mishra <anil.mishra@wdc.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Cc: stable@vger.kernel.org # v5.8+ Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-05-19 08:33:42 +02:00
Wu Bo	03504e3b54	nvme-loop: fix memory leak in nvme_loop_create_ctrl() When creating loop ctrl in nvme_loop_create_ctrl(), if nvme_init_ctrl() fails, the loop ctrl should be freed before jumping to the "out" label. Fixes: `3a85a5de29` ("nvme-loop: add a NVMe loopback host driver") Signed-off-by: Wu Bo <wubo40@huawei.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-05-19 08:33:42 +02:00
Wu Bo	fec356a61a	nvmet: fix memory leak in nvmet_alloc_ctrl() When creating ctrl in nvmet_alloc_ctrl(), if the cntlid_min is larger than cntlid_max of the subsystem, and jumps to the "out_free_changed_ns_list" label, but the ctrl->sqs lack of be freed. Fix this by jumping to the "out_free_sqs" label. Fixes: `94a39d61f8` ("nvmet: make ctrl-id configurable") Signed-off-by: Wu Bo <wubo40@huawei.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-05-19 08:33:39 +02:00
Hou Pu	e181811bd0	nvmet: use new ana_log_size instead the old one The new ana_log_size should be used instead of the old one. Or kernel NULL pointer dereference will happen like below: [ 38.957849][ T69] BUG: kernel NULL pointer dereference, address: 000000000000003c [ 38.975550][ T69] #PF: supervisor write access in kernel mode [ 38.975955][ T69] #PF: error_code(0x0002) - not-present page [ 38.976905][ T69] PGD 0 P4D 0 [ 38.979388][ T69] Oops: 0002 [#1] SMP NOPTI [ 38.980488][ T69] CPU: 0 PID: 69 Comm: kworker/0:2 Not tainted 5.12.0+ #54 [ 38.981254][ T69] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 [ 38.982502][ T69] Workqueue: events nvme_loop_execute_work [ 38.985219][ T69] RIP: 0010:memcpy_orig+0x68/0x10f [ 38.986203][ T69] Code: 83 c2 20 eb 44 48 01 d6 48 01 d7 48 83 ea 20 0f 1f 00 48 83 ea 20 4c 8b 46 f8 4c 8b 4e f0 4c 8b 56 e8 4c 8b 5e e0 48 8d 76 e0 <4c> 89 47 f8 4c 89 4f f0 4c 89 57 e8 4c 89 5f e0 48 8d 7f e0 73 d2 [ 38.987677][ T69] RSP: 0018:ffffc900001b7d48 EFLAGS: 00000287 [ 38.987996][ T69] RAX: 0000000000000020 RBX: 0000000000000024 RCX: 0000000000000010 [ 38.988327][ T69] RDX: ffffffffffffffe4 RSI: ffff8881084bc004 RDI: 0000000000000044 [ 38.988620][ T69] RBP: 0000000000000024 R08: 0000000100000000 R09: 0000000000000000 [ 38.988991][ T69] R10: 0000000100000000 R11: 0000000000000001 R12: 0000000000000024 [ 38.989289][ T69] R13: ffff8881084bc000 R14: 0000000000000000 R15: 0000000000000024 [ 38.989845][ T69] FS: 0000000000000000(0000) GS:ffff888237c00000(0000) knlGS:0000000000000000 [ 38.990234][ T69] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 38.990490][ T69] CR2: 000000000000003c CR3: 00000001085b2000 CR4: 00000000000006f0 [ 38.991105][ T69] Call Trace: [ 38.994157][ T69] sg_copy_buffer+0xb8/0xf0 [ 38.995357][ T69] nvmet_copy_to_sgl+0x48/0x6d [ 38.995565][ T69] nvmet_execute_get_log_page_ana+0xd4/0x1cb [ 38.995792][ T69] nvmet_execute_get_log_page+0xc9/0x146 [ 38.995992][ T69] nvme_loop_execute_work+0x3e/0x44 [ 38.996181][ T69] process_one_work+0x1c3/0x3c0 [ 38.996393][ T69] worker_thread+0x44/0x3d0 [ 38.996600][ T69] ? cancel_delayed_work+0x90/0x90 [ 38.996804][ T69] kthread+0xf7/0x130 [ 38.996961][ T69] ? kthread_create_worker_on_cpu+0x70/0x70 [ 38.997171][ T69] ret_from_fork+0x22/0x30 [ 38.997705][ T69] Modules linked in: [ 38.998741][ T69] CR2: 000000000000003c [ 39.000104][ T69] ---[ end trace e719927b609d0fa0 ]--- Fixes: `5e1f689913` ("nvme-multipath: fix double initialization of ANA state") Signed-off-by: Hou Pu <houpu.main@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-05-13 16:33:32 +02:00
Daniel Wagner	85428beac8	nvmet: seset ns->file when open fails Reset the ns->file value to NULL also in the error case in nvmet_file_ns_enable(). The ns->file variable points either to file object or contains the error code after the filp_open() call. This can lead to following problem: When the user first setups an invalid file backend and tries to enable the ns, it will fail. Then the user switches over to a bdev backend and enables successfully the ns. The first received I/O will crash the system because the IO backend is chosen based on the ns->file value: static u16 nvmet_parse_io_cmd(struct nvmet_req *req) { [...] if (req->ns->file) return nvmet_file_parse_io_cmd(req); return nvmet_bdev_parse_io_cmd(req); } Reported-by: Enzo Matsumiya <ematsumiya@suse.com> Signed-off-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-05-12 19:48:48 +02:00
Chaitanya Kulkarni	7a4ffd20ec	nvmet: demote fabrics cmd parse err msg to debug Host can send invalid commands and flood the target with error messages. Demote the error message from pr_err() to pr_debug() in nvmet_parse_fabrics_cmd() and nvmet_parse_connect_cmd(). Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-05-11 18:30:46 +02:00
Chaitanya Kulkarni	4c2dab2bf5	nvmet: use helper to remove the duplicate code Use the helper nvmet_report_invalid_opcode() to report invalid opcode so we can remove the duplicate code. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-05-11 18:30:45 +02:00
Chaitanya Kulkarni	3651aaacd1	nvmet: demote discovery cmd parse err msg to debug Host can send invalid commands and flood the target with error messages for the discovery controller. Demote the error message from pr_err() to pr_debug( in nvmet_parse_discovery_cmd(). Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-05-11 18:30:45 +02:00
Michal Kalderon	8cc365f955	nvmet-rdma: Fix NULL deref when SEND is completed with error When running some traffic and taking down the link on peer, a retry counter exceeded error is received. This leads to nvmet_rdma_error_comp which tried accessing the cq_context to obtain the queue. The cq_context is no longer valid after the fix to use shared CQ mechanism and should be obtained similar to how it is obtained in other functions from the wc->qp. [ 905.786331] nvmet_rdma: SEND for CQE 0x00000000e3337f90 failed with status transport retry counter exceeded (12). [ 905.832048] BUG: unable to handle kernel NULL pointer dereference at 0000000000000048 [ 905.839919] PGD 0 P4D 0 [ 905.842464] Oops: 0000 1 SMP NOPTI [ 905.846144] CPU: 13 PID: 1557 Comm: kworker/13:1H Kdump: loaded Tainted: G OE --------- - - 4.18.0-304.el8.x86_64 #1 [ 905.872135] RIP: 0010:nvmet_rdma_error_comp+0x5/0x1b [nvmet_rdma] [ 905.878259] Code: 19 4f c0 e8 89 b3 a5 f6 e9 5b e0 ff ff 0f b7 75 14 4c 89 ea 48 c7 c7 08 1a 4f c0 e8 71 b3 a5 f6 e9 4b e0 ff ff 0f 1f 44 00 00 <48> 8b 47 48 48 85 c0 74 08 48 89 c7 e9 98 bf 49 00 e9 c3 e3 ff ff [ 905.897135] RSP: 0018:ffffab601c45fe28 EFLAGS: 00010246 [ 905.902387] RAX: 0000000000000065 RBX: ffff9e729ea2f800 RCX: 0000000000000000 [ 905.909558] RDX: 0000000000000000 RSI: ffff9e72df9567c8 RDI: 0000000000000000 [ 905.916731] RBP: ffff9e729ea2b400 R08: 000000000000074d R09: 0000000000000074 [ 905.923903] R10: 0000000000000000 R11: ffffab601c45fcc0 R12: 0000000000000010 [ 905.931074] R13: 0000000000000000 R14: 0000000000000010 R15: ffff9e729ea2f400 [ 905.938247] FS: 0000000000000000(0000) GS:ffff9e72df940000(0000) knlGS:0000000000000000 [ 905.938249] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 905.950067] nvmet_rdma: SEND for CQE 0x00000000c7356cca failed with status transport retry counter exceeded (12). [ 905.961855] CR2: 0000000000000048 CR3: 000000678d010004 CR4: 00000000007706e0 [ 905.961855] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 905.961856] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 905.961857] PKRU: 55555554 [ 906.010315] Call Trace: [ 906.012778] __ib_process_cq+0x89/0x170 [ib_core] [ 906.017509] ib_cq_poll_work+0x26/0x80 [ib_core] [ 906.022152] process_one_work+0x1a7/0x360 [ 906.026182] ? create_worker+0x1a0/0x1a0 [ 906.030123] worker_thread+0x30/0x390 [ 906.033802] ? create_worker+0x1a0/0x1a0 [ 906.037744] kthread+0x116/0x130 [ 906.040988] ? kthread_flush_work_fn+0x10/0x10 [ 906.045456] ret_from_fork+0x1f/0x40 Fixes: `ca0f1a8055` ("nvmet-rdma: use new shared CQ mechanism") Signed-off-by: Shai Malin <smalin@marvell.com> Signed-off-by: Michal Kalderon <michal.kalderon@marvell.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-05-11 18:30:45 +02:00
Chaitanya Kulkarni	ab96de5def	nvmet: fix inline bio check for passthru When handling passthru commands, for inline bio allocation we only consider the transfer size. This works well when req->sg_cnt fits into the req->inline_bvec, but it will result in the early return from bio_add_hw_page() when req->sg_cnt > NVMET_MAX_INLINE_BVEC. Consider an I/O of size 32768 and first buffer is not aligned to the page boundary, then I/O is split in following manner :- [ 2206.256140] nvmet: sg->length 3440 sg->offset 656 [ 2206.256144] nvmet: sg->length 4096 sg->offset 0 [ 2206.256148] nvmet: sg->length 4096 sg->offset 0 [ 2206.256152] nvmet: sg->length 4096 sg->offset 0 [ 2206.256155] nvmet: sg->length 4096 sg->offset 0 [ 2206.256159] nvmet: sg->length 4096 sg->offset 0 [ 2206.256163] nvmet: sg->length 4096 sg->offset 0 [ 2206.256166] nvmet: sg->length 4096 sg->offset 0 [ 2206.256170] nvmet: sg->length 656 sg->offset 0 Now the req->transfer_size == NVMET_MAX_INLINE_DATA_LEN i.e. 32768, but the req->sg_cnt is (9) > NVMET_MAX_INLINE_BIOVEC which is (8). This will result in early return in the following code path :- nvmet_bdev_execute_rw() bio_add_pc_page() bio_add_hw_page() if (bio_full(bio, len)) return 0; Use previously introduced helper nvmet_use_inline_bvec() to consider req->sg_cnt when using inline bio. This only affects nvme-loop transport. Fixes: `dab3902b19` ("nvmet: use inline bio for passthru fast path") Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-05-11 18:30:45 +02:00
Chaitanya Kulkarni	608a969046	nvmet: fix inline bio check for bdev-ns When handling rw commands, for inline bio case we only consider transfer size. This works well when req->sg_cnt fits into the req->inline_bvec, but it will result in the warning in __bio_add_page() when req->sg_cnt > NVMET_MAX_INLINE_BVEC. Consider an I/O size 32768 and first page is not aligned to the page boundary, then I/O is split in following manner :- [ 2206.256140] nvmet: sg->length 3440 sg->offset 656 [ 2206.256144] nvmet: sg->length 4096 sg->offset 0 [ 2206.256148] nvmet: sg->length 4096 sg->offset 0 [ 2206.256152] nvmet: sg->length 4096 sg->offset 0 [ 2206.256155] nvmet: sg->length 4096 sg->offset 0 [ 2206.256159] nvmet: sg->length 4096 sg->offset 0 [ 2206.256163] nvmet: sg->length 4096 sg->offset 0 [ 2206.256166] nvmet: sg->length 4096 sg->offset 0 [ 2206.256170] nvmet: sg->length 656 sg->offset 0 Now the req->transfer_size == NVMET_MAX_INLINE_DATA_LEN i.e. 32768, but the req->sg_cnt is (9) > NVMET_MAX_INLINE_BIOVEC which is (8). This will result in the following warning message :- nvmet_bdev_execute_rw() bio_add_page() __bio_add_page() WARN_ON_ONCE(bio_full(bio, len)); This scenario is very hard to reproduce on the nvme-loop transport only with rw commands issued with the passthru IOCTL interface from the host application and the data buffer is allocated with the malloc() and not the posix_memalign(). Fixes: `73383adfad` ("nvmet: don't split large I/Os unconditionally") Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-05-11 18:30:45 +02:00
Christoph Hellwig	5e1f689913	nvme-multipath: fix double initialization of ANA state nvme_init_identify and thus nvme_mpath_init can be called multiple times and thus must not overwrite potentially initialized or in-use fields. Split out a helper for the basic initialization when the controller is initialized and make sure the init_identify path does not blindly change in-use data structures. Fixes: `0d0b660f21` ("nvme: add ANA support") Reported-by: Martin Wilck <mwilck@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Hannes Reinecke <hare@suse.de>	2021-05-11 18:30:45 +02:00
Keith Busch	4a20342572	nvmet: remove unsupported command noise Nothing can stop a host from submitting invalid commands. The target just needs to respond with an appropriate status, but that's not a target error. Demote invalid command messages to the debug level so these events don't spam the kernel logs. Reported-by: Yi Zhang <yi.zhang@redhat.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Klaus Jensen <k.jensen@samsung.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-05-04 09:39:26 +02:00
Daniel Wagner	ce86dad222	nvme-multipath: reset bdev to ns head when failover When a request finally completes in end_io() after it has failed over, the bdev pointer can be stale and thus the system can crash. Set the bdev back to ns head, so the request is map to an active path when resubmitted. Signed-off-by: Daniel Wagner <dwagner@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-05-04 09:39:23 +02:00
Tao Chiu	d4060d2be1	nvme-pci: fix controller reset hang when racing with nvme_timeout reset_work() in nvme-pci may hang forever in the following scenario: 1) A reset caused by a command timeout occurs due to a controller being temporarily irresponsive. 2) nvme_reset_work() restarts admin queue at nvme_alloc_admin_tags(). At the same time, a user-submitted admin command is queued and waiting for completion. Then, reset_work() changes its state to CONNECTING, and submits an identify command. 3) However, the controller does still not respond to any command, causing a timeout being fired at the user-submitted command. Unfortunately, nvme_timeout() does not see the completion on cq, and any timeout that takes place under CONNECTING state causes a controller shutdown. 4) Normally, the identify command in reset_work() would be canceled with SC_HOST_ABORTED by nvme_dev_disable(), then reset_work can tear down the controller accordingly. But the controller happens to return online and respond the identify command before nvme_dev_disable() should have been reaped it off. 5) reset_work() continues to setup_io_queues() as it observes no error in init_identify(). However, the admin queue has already been quiesced in dev_disable(). Thus, any following commands would be blocked forever in blk_execute_rq(). This can be fixed by restricting usercmd commands when controller is not in a LIVE state in nvme_queue_rq(), as what has been done previously in fabrics. ``` nvme_reset_work(): \| nvme_alloc_admin_tags() \| \| nvme_submit_user_cmd(): nvme_init_identify(): \| ... __nvme_submit_sync_cmd(): \| ... \| ... ---------------------------------------> nvme_timeout(): (Controller starts reponding commands) \| nvme_dev_disable(, true): nvme_setup_io_queues(): \| __nvme_submit_sync_cmd(): \| (hung in blk_execute_rq \| since run_hw_queue sees \| queue quiesced) \| ``` Signed-off-by: Tao Chiu <taochiu@synology.com> Signed-off-by: Cody Wong <codywong@synology.com> Reviewed-by: Leon Chien <leonchien@synology.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-05-04 09:35:50 +02:00
Tao Chiu	a97157440e	nvme: move the fabrics queue ready check routines to core queue_rq() in pci only checks if the dispatched queue (nvmeq) is ready, e.g. not being suspended. Since nvme_alloc_admin_tags() in reset flow restarts the admin queue, users are able to submit admin commands to a controller before reset_work() completes. Commands submitted under this condition may interfere with commands that performs identify, IO queue setup in reset_work(), and may result in a hang described in the following patch. As seen in the fabrics, user commands are prevented from being executed under inproper controller states. We may reuse this logic to maintain a clear admin queue during reset_work(). Signed-off-by: Tao Chiu <taochiu@synology.com> Signed-off-by: Cody Wong <codywong@synology.com> Reviewed-by: Leon Chien <leonchien@synology.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-05-04 09:35:49 +02:00
Kanchan Joshi	51ad06cd69	nvme: avoid memset for passthrough requests nvme_clear_nvme_request() clears the nvme_command, which is unncessary for passthrough requests as nvme_command is overwritten immediately. Move clearing part from this helper to the caller, so that double memset for passthrough requests is avoided. Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-05-04 09:35:49 +02:00
Kanchan Joshi	4c74d1f803	nvme: add nvme_get_ns helper Add a helper to avoid opencoding ns->kref increment. Decrement is already done via nvme_put_ns helper. Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-05-04 09:35:48 +02:00
Minwoo Im	48145b6256	nvme: fix controller ioctl through ns_head In multipath case, we should consider namespace attachment with controllers in a subsystem when we find out the live controller for the namespace. This patch manually reverted the commit `3557a44097` ("nvme: don't bother to look up a namespace for controller ioctls") with few more updates to nvme_ns_head_chr_ioctl which has been newly updated. Fixes: `3557a44097` ("nvme: don't bother to look up a namespace for controller ioctls") Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-05-04 09:35:47 +02:00
Linus Torvalds	fc05860628	for-5.13/drivers-2021-04-27 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmCIJYcQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpieWD/92qbtWl/z+9oCY212xV+YMoMqj/vGROX+U 9i/FQJ3AIC/AUoNjZeW3NIbiaNqde5mrLlUSCHgn6RLsHK7p0GQJ4ohpbIGFG5+i 2+Efm+vjlCxLVGrkeZEwMtsht7w/NbOYDr1Rgv9b4lQ6iWI11Mg8E337Whl1me1k h6bEXaioK9yqxYtsLgcn9I1qQ2p7gok0HX7zFU/XxEUZylqH6E4vQhj2+NL8UUqE 7siFHADZE99Z7LXtOkl8YyOlGU52RCUzqDHWydvkipKjgYBi95HLXGT64Z+WCEvz HI54oVDRWr+uWdqDFfy+ncHm8pNeP0GV9JPhDz4ELRTSndoxB2il7wRLvp6wxV9d 8Y4j7vb30i+8GGbM0c79dnlG76D9r5ivbTKixcXFKB128NusQR6JymIv1pKlSKhk H871/iOarrepAAUwVR5CtldDDJCy/q1Hks+7UXbaM3F9iNitxsJNZryQq9xdTu/N ThFOTz+VECG4RJLxIwmsWGiLgwr52/ybAl2MBcn+s7uC4jM/TFKpdQBfQnOAiINb MLlfuYRRSMg1Osb2fYZneR2ifmSNOMRdDJb+tsZGz4xWmZcj0uL4QgqcsOvuiOEQ veF/Ky50qw57hWtiEhvqa7/WIxzNF3G3wejqqA8hpT9Qifu0QawYTnXGUttYNBB1 mO9R3/ccaw== =c0x4 -----END PGP SIGNATURE----- Merge tag 'for-5.13/drivers-2021-04-27' of git://git.kernel.dk/linux-block Pull block driver updates from Jens Axboe: - MD changes via Song: - raid5 POWER fix - raid1 failure fix - UAF fix for md cluster - mddev_find_or_alloc() clean up - Fix NULL pointer deref with external bitmap - Performance improvement for raid10 discard requests - Fix missing information of /proc/mdstat - rsxx const qualifier removal (Arnd) - Expose allocated brd pages (Calvin) - rnbd via Gioh Kim: - Change maintainer - Change domain address of maintainers' email - Add polling IO mode and document update - Fix memory leak and some bug detected by static code analysis tools - Code refactoring - Series of floppy cleanups/fixes (Denis) - s390 dasd fixes (Julian) - kerneldoc fixes (Lee) - null_blk double free (Lv) - null_blk virtual boundary addition (Max) - Remove xsysace driver (Michal) - umem driver removal (Davidlohr) - ataflop fixes (Dan) - Revalidate disk removal (Christoph) - Bounce buffer cleanups (Christoph) - Mark lightnvm as deprecated (Christoph) - mtip32xx init cleanups (Shixin) - Various fixes (Tian, Gustavo, Coly, Yang, Zhang, Zhiqiang) * tag 'for-5.13/drivers-2021-04-27' of git://git.kernel.dk/linux-block: (143 commits) async_xor: increase src_offs when dropping destination page drivers/block/null_blk/main: Fix a double free in null_init. md/raid1: properly indicate failure when ending a failed write request md-cluster: fix use-after-free issue when removing rdev nvme: introduce generic per-namespace chardev nvme: cleanup nvme_configure_apst nvme: do not try to reconfigure APST when the controller is not live nvme: add 'kato' sysfs attribute nvme: sanitize KATO setting nvmet: avoid queuing keep-alive timer if it is disabled brd: expose number of allocated pages in debugfs ataflop: fix off by one in ataflop_probe() ataflop: potential out of bounds in do_format() drbd: Fix fall-through warnings for Clang block/rnbd: Use strscpy instead of strlcpy block/rnbd-clt-sysfs: Remove copy buffer overlap in rnbd_clt_get_path_name block/rnbd-clt: Remove max_segment_size block/rnbd-clt: Generate kobject_uevent when the rnbd device state changes block/rnbd-srv: Remove unused arguments of rnbd_srv_rdma_ev Documentation/ABI/rnbd-clt: Add description for nr_poll_queues ...	2021-04-28 14:39:37 -07:00
Linus Torvalds	6c00292113	for-5.13/block-2021-04-27 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmCIJW0QHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpr8sD/4qP+MsFTB1IFUu8fW7BjBPdduoK8Vq9o3S HB8iF/yhJZ73nLecMMdn/jTO8SCW0Iw+okywW3BugGnNPbwXo0UQ4jLhzbTts76P JvZaguZFhBsF3ceFOt3CRCQDOeoDfMp3sitLUVivkN+2vwMs9vJpVNaEeUjcCC1Z 8QjlpqYSMuakTwEn7QhlnKxVWn1V2B6PDjZMcf48ONRZGsCkoOXH1SE4Ge8nxjqa KHKO5bvwgRzGhKpvdHEIl8dmFL9WEWElBVoY3vE2EHL0SPE32zHlxtYLS0NAhY2M aprkJ0QP0Rgl8HpYiCstwAnJGKDg4a0ArWhf/CJTuLAWmTNFR7v5n7vw2SilJHTG 0FtiFiOnpvvBmUC0B1PUEQX8AiFcdXueLb6xboExcp2WtxIAe8wPoGFl6T1tobBY qsfWggGs/vD1RVrJISPC+20cJemcRyeakMV48w+n3Lt/ES3IEv/LXx6PO/PbXvOo B7HJXTofkoaX52A/1+NxraGapwzhYouhi6Sb6Fc++X59/a/oBuOUGuur0eZ+/oWA 9787mUUDmW/sahfZUgZh5AxqKo2jJULjeggANCICW9/RN6duV8TBQVOLW1/0Wddp 9lndiA9ZMveWF+J19+sjBoiYMYawLmURaOlDK77ctTCcR/ji3l4GZ+2KvBEMeIT8 O1OYEnwaIQ== =oza6 -----END PGP SIGNATURE----- Merge tag 'for-5.13/block-2021-04-27' of git://git.kernel.dk/linux-block Pull block updates from Jens Axboe: "Pretty quiet round this time, which is nice. In detail: - Series revamping bounce buffer support (Christoph) - Dead code removal (Christoph, Bart) - Partition iteration revamp, now using xarray (Christoph) - Passthrough request scheduler improvements (Lin) - Series of BFQ improvements (Paolo) - Fix ioprio task iteration (Peter) - Various little tweaks and fixes (Tejun, Saravanan, Bhaskar, Max, Nikolay)" * tag 'for-5.13/block-2021-04-27' of git://git.kernel.dk/linux-block: (41 commits) blk-iocost: don't ignore vrate_min on QD contention blk-mq: Fix spurious debugfs directory creation during initialization bfq/mq-deadline: remove redundant check for passthrough request blk-mq: bypass IO scheduler's limit_depth for passthrough request block: Remove an obsolete comment from sg_io() block: move bio_list_copy_data to pktcdvd block: remove zero_fill_bio_iter block: add queue_to_disk() to get gendisk from request_queue block: remove an incorrect check from blk_rq_append_bio block: initialize ret in bdev_disk_changed block: Fix sys_ioprio_set(.which=IOPRIO_WHO_PGRP) task iteration block: remove disk_part_iter block: simplify diskstats_show block: simplify show_partition block: simplify printk_all_partitions block: simplify partition_overlaps block: simplify partition removal block: take bd_mutex around delete_partitions in del_gendisk block: refactor blk_drop_partitions block: move more syncing and invalidation to delete_partition ...	2021-04-28 14:27:12 -07:00
Minwoo Im	2637baed78	nvme: introduce generic per-namespace chardev Userspace has not been allowed to I/O to device that's failed to be initialized. This patch introduces generic per-namespace character device to allow userspace to I/O regardless the block device is there or not. The chardev naming convention will similar to the existing blkdev naming, using a ng prefix instead of nvme, i.e. - /dev/ngXnY It also supports multipath which means it will not expose chardev for the hidden namespace blkdevs (e.g., nvmeXcYnZ). If /dev/ngXnY is created for a ns_head, then I/O request will be routed to a specific controller selected by the iopolicy of the subsystem. Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com> Signed-off-by: Javier González <javier.gonz@samsung.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Tested-by: Kanchan Joshi <joshi.k@samsung.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-04-22 07:25:17 +02:00
Christoph Hellwig	60df5de9b0	nvme: cleanup nvme_configure_apst Remove a level of indentation from the main code implementating the table search by using a goto for the APST not supported case. Also move the main comment above the function. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Niklas Cassel <niklas.cassel@wdc.com>	2021-04-21 19:13:16 +02:00
Christoph Hellwig	53fe2a30bc	nvme: do not try to reconfigure APST when the controller is not live Do not call nvme_configure_apst when the controller is not live, given that nvme_configure_apst will fail due the lack of an admin queue when the controller is being torn down and nvme_set_latency_tolerance is called from dev_pm_qos_hide_latency_tolerance. Fixes: 510a405d945b("nvme: fix memory leak for power latency tolerance") Reported-by: Peng Liu <liupeng17@lenovo.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org>	2021-04-21 19:13:16 +02:00
Hannes Reinecke	74c22990f0	nvme: add 'kato' sysfs attribute Add a 'kato' controller sysfs attribute to display the current keep-alive timeout value (if any). This allows userspace to identify persistent discovery controllers, as these will have a non-zero KATO value. Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-04-21 19:13:15 +02:00
Hannes Reinecke	a70b81bd4d	nvme: sanitize KATO setting According to the NVMe base spec the KATO commands should be sent at half of the KATO interval, to properly account for round-trip times. As we now will only ever send one KATO command per connection we can easily use the recommended values. This also fixes a potential issue where the request timeout for the KATO command does not match the value in the connect command, which might be causing spurious connection drops from the target. Signed-off-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-04-21 19:13:15 +02:00
Hou Pu	8f864c595b	nvmet: avoid queuing keep-alive timer if it is disabled Issue following command: nvme set-feature -f 0xf -v 0 /dev/nvme1n1 # disable keep-alive timer nvme admin-passthru -o 0x18 /dev/nvme1n1 # send keep-alive command will make keep-alive timer fired and thus delete the controller like below: [247459.907635] nvmet: ctrl 1 keep-alive timer (0 seconds) expired! [247459.930294] nvmet: ctrl 1 fatal error occurred! Avoid this by not queuing delayed keep-alive if it is disabled when keep-alive command is received from the admin queue. Signed-off-by: Hou Pu <houpu.main@gmail.com> Tested-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-04-21 19:13:15 +02:00
Gopal Tiwari	d6609084b0	nvme: fix NULL derefence in nvme_ctrl_fast_io_fail_tmo_show/store Adding entry for dev_attr_fast_io_fail_tmo to avoid the kernel crash while reading and writing the fast_io_fail_tmo. Fixes: `09fbed6363` (nvme: export fast_io_fail_tmo to sysfs) Signed-off-by: Gopal Tiwari <gtiwari@redhat.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-04-15 08:12:56 +02:00
Christoph Hellwig	a9e0e6bc72	nvme: let namespace probing continue for unsupported features Instead of failing to scan the namespace entirely when unsupported features are detected, just mark the gendisk hidden but allow other access like the upcoming per-namespace character device. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Javier González <javier.gonz@samsung.com>	2021-04-15 08:12:56 +02:00
Christoph Hellwig	f5b9a51db2	nvme: factor out nvme_ns_open and nvme_ns_release helpers These will be reused for the per-namespace character devices. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Javier González <javier.gonz@samsung.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>	2021-04-15 08:12:55 +02:00
Christoph Hellwig	1496bd4936	nvme: move nvme_ns_head_ops to multipath.c Move the multipath block_device_operations to multipath.c, where they belong. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Javier González <javier.gonz@samsung.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>	2021-04-15 08:12:55 +02:00
Christoph Hellwig	871ca3ef13	nvme: factor out a nvme_tryget_ns_head helper Add a helper to avoid opencoding ns_head->ref manipulations. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Javier González <javier.gonz@samsung.com> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>	2021-04-15 08:12:55 +02:00
Christoph Hellwig	2405252a68	nvme: move the ioctl code to a separate file Split out the ioctl code from core.c into a new file. Also update copyrights while we're at it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Javier González <javier.gonz@samsung.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>	2021-04-15 08:12:55 +02:00
Christoph Hellwig	3557a44097	nvme: don't bother to look up a namespace for controller ioctls Don't bother to look up a namespace just to drop if after retreiving the controller for the multipath case. Just look up a live controller for the subsystem directly. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Javier González <javier.gonz@samsung.com>	2021-04-15 08:12:55 +02:00
Christoph Hellwig	2f907f7f96	nvme: simplify block device ioctl handling for the !multipath case Only use the existing ioctl handler for the multipath case, and add a simpler one that reverts to the pre-multipath case for not shared use case. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Javier González <javier.gonz@samsung.com>	2021-04-15 08:12:55 +02:00
Christoph Hellwig	89b3d6e605	nvme: simplify the compat ioctl handling Don't bother defining a separate compat_ioctl handler, and just handle the NVME_IOCTL_SUBMIT_IO32 case inline. Also only defined it for those ABIs (currently just i386 vs x86_64) that are affected. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Javier González <javier.gonz@samsung.com>	2021-04-15 08:12:55 +02:00
Christoph Hellwig	a5d737f100	nvme: factor out a nvme_ns_ioctl helper Factor out a helper for the namespace based ioctls. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Javier González <javier.gonz@samsung.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>	2021-04-15 08:12:54 +02:00
Christoph Hellwig	d7790d3739	nvme: pass a user pointer to nvme_nvm_ioctl Pass the proper user pointer instead of the not all that useful integer representation. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Javier González <javier.gonz@samsung.com>	2021-04-15 08:12:54 +02:00
Christoph Hellwig	9953ab0c5a	nvme: cleanup setting the disk name Return false from nvme_set_disk_name and let the caller set the non-multipath name instead of duplicating the naming information in two places. Also remove the pointless local variables for the disk name and flags and the not needed ctrl argument. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Javier González <javier.gonz@samsung.com>	2021-04-15 08:12:54 +02:00
Minwoo Im	3089738868	nvme: add a nvme_ns_head_multipath helper Move the multipath gendisk out of #ifdef CONFIG_NVME_MULTIPATH and add a new nvme_ns_head_multipath that uses it to check if a ns_head has a multipath device associated with it. Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com> [hch: added the IS_ENABLED, converted a few existing users] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Javier González <javier.gonz@samsung.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>	2021-04-15 08:12:54 +02:00
Niklas Cassel	95d54bd1a4	nvme: remove single trailing whitespace There is a single trailing whitespace in core.c. Since this is just a single whitespace, the chances of this affecting backports to stable should be quite low, so let's just remove it. Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-04-15 08:12:54 +02:00
Niklas Cassel	e234f1f8bb	nvme-multipath: remove single trailing whitespace There is a single trailing whitespace in multipath.c. Since this is just a single whitespace, the chances of this affecting backports to stable should be quite low, so let's just remove it. Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-04-15 08:12:54 +02:00
Niklas Cassel	53dc180e7c	nvme-pci: remove single trailing whitespace There is a single trailing whitespace in pci.c. Since this is just a single whitespace, the chances of this affecting backports to stable should be quite low, so let's just remove it. Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-04-15 08:12:54 +02:00
Niklas Cassel	e51183be1f	nvme-pci: don't simple map sgl when sgls are disabled According to the module parameter description for sgl_threshold, a value of 0 means that SGLs are disabled. If SGLs are disabled, we should respect that, even for the case where the request is made up of a single physical segment. Fixes: `297910571f` ("nvme-pci: optimize mapping single segment requests using SGLs") Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-04-15 08:12:53 +02:00
Colin Ian King	ccc1003b5b	nvmet: fix a spelling mistake "nubmer" -> "number" There is a spelling mistake in a pr_err error message. Fix it. Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-04-15 08:12:53 +02:00
Amit Engel	0d8ddeea11	nvmet-fc: simplify nvmet_fc_alloc_hostport Once a host is already created, avoid allocate additional hostports that will be thrown away. add an helper function to handle host search. Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Amit Engel <amit.engel@dell.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-04-15 08:12:53 +02:00
Elad Grupi	bdaf132791	nvmet-tcp: fix a segmentation fault during io parsing error In case there is an io that contains inline data and it goes to parsing error flow, command response will free command and iov before clearing the data on the socket buffer. This will delay the command response until receive flow is completed. Fixes: `872d26a391` ("nvmet-tcp: add NVMe over TCP target driver") Signed-off-by: Elad Grupi <elad.grupi@dell.com> Signed-off-by: Hou Pu <houpu.main@gmail.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-04-15 08:12:50 +02:00
Chaitanya Kulkarni	327e1d2957	lightnvm: use kobj_to_dev() This fixs coccicheck warning: drivers/nvme//host/lightnvm.c:1243:60-61: WARNING opportunity for kobj_to_dev() Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Matias Bjørling <matias.bjorling@wdc.com> Link: https://lore.kernel.org/r/20210413105257.159260-2-matias.bjorling@wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-13 09:16:12 -06:00
Sami Tolvanen	4f0f586bf0	treewide: Change list_sort to use const pointers list_sort() internally casts the comparison function passed to it to a different type with constant struct list_head pointers, and uses this pointer to call the functions, which trips indirect call Control-Flow Integrity (CFI) checking. Instead of removing the consts, this change defines the list_cmp_func_t type and changes the comparison function types of all list_sort() callers to use const pointers, thus avoiding type mismatches. Suggested-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Kees Cook <keescook@chromium.org> Tested-by: Nick Desaulniers <ndesaulniers@google.com> Tested-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20210408182843.1754385-10-samitolvanen@google.com	2021-04-08 16:04:22 -07:00

1 2 3 4 5 ...

2556 Commits