linux-stable

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git synced 2024-10-01 22:54:01 +00:00

Author	SHA1	Message	Date
Jens Axboe	b7405176b5	Merge branch 'nvme-4.18-2' of git://git.infradead.org/nvme into for-4.18/block Pull NVMe changes from Christoph: "Here is the current batch of nvme updates for 4.18, we have a few more patches in the queue, but I'd like to get this pile into your tree and linux-next ASAP. The biggest item is support for file-backed namespaces in the NVMe target from Chaitanya, in addition to that we mostly small fixes from all the usual suspects." * 'nvme-4.18-2' of git://git.infradead.org/nvme: nvme: fixup memory leak in nvme_init_identify() nvme: fix KASAN warning when parsing host nqn nvmet-loop: use nr_phys_segments when map rq to sgl nvmet-fc: increase LS buffer count per fc port nvmet: add simple file backed ns support nvmet: remove duplicate NULL initialization for req->ns nvmet: make a few error messages more generic nvme-fabrics: allow duplicate connections to the discovery controller nvme-fabrics: centralize discovery controller defaults nvme-fabrics: remove unnecessary controller subnqn validation nvme-fc: remove setting DNR on exception conditions nvme-rdma: stop admin queue before freeing it nvme-pci: Fix AER reset handling nvme-pci: set nvmeq->cq_vector after alloc cq/sq nvme: host: core: fix precedence of ternary operator nvme: fix lockdep warning in nvme_mpath_clear_current_path	2018-05-29 12:56:20 -06:00
Christoph Hellwig	5afb78356c	block: don't print a message when the device went away The information about a size change in this case just creates confusion. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-05-29 08:59:21 -06:00
Christoph Hellwig	4163a03984	block: unexport check_disk_size_change Only used in block_dev.c and the partitions code, and it should remain that way.. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-05-29 08:59:21 -06:00
Jens Axboe	0b7576d8eb	block: move ->timeout request member After the recent timeout handling changes, we have two holes in the struct. Move the timeout near the deadline, killing both, and moving related members closer together. On my config on x86-64, this shrinks struct request from 312 to 304 bytes. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-05-29 08:59:21 -06:00
Christoph Hellwig	d1210d5afb	blk-mq: simplify blk_mq_rq_timed_out Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-05-29 08:59:21 -06:00
Christoph Hellwig	88b0cfad28	block: document the blk_eh_timer_return values Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-05-29 08:59:21 -06:00
Christoph Hellwig	f6e7d48a78	block: remove BLK_EH_HANDLED Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-05-29 08:59:21 -06:00
Christoph Hellwig	adb2b769d4	libiscsi: don't try to bypass SCSI EH libiscsi is the only SCSI code that return BLK_EH_HANDLED, thus trying to bypass the normal SCSI EH code. We are going to remove this return value at the block layer, and at least from a quick look it doesn't look too harmful to try to send an abort for these cases, especially as the first one should not actually be possible. If this doesn't work out iscsi will probably need its own eh_strategy_handler instead to just do the right thing. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-05-29 08:59:21 -06:00
Christoph Hellwig	ad73d6fead	mmc: complete requests from ->timeout By completing the request entirely in the driver we can remove the BLK_EH_HANDLED return value and thus the split responsibility between the driver and the block layer that has been causing trouble. [While this keeps existing behavior it seems to mismatch the comment, maintainers please chime in!] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-05-29 08:59:21 -06:00
Christoph Hellwig	1fc2b62edb	scsi_transport_fc: complete requests from ->timeout By completing the request entirely in the driver we can remove the BLK_EH_HANDLED return value and thus the split responsibility between the driver and the block layer that has been causing trouble. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-05-29 08:59:21 -06:00
Christoph Hellwig	0df0bb080a	null_blk: complete requests from ->timeout By completing the request entirely in the driver we can remove the BLK_EH_HANDLED return value and thus the split responsibility between the driver and the block layer that has been causing trouble. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-05-29 08:59:21 -06:00
Christoph Hellwig	c5fb85b7ff	mtip32xx: complete requests from ->timeout By completing the request entirely in the driver we can remove the BLK_EH_HANDLED return value and thus the split responsibility between the driver and the block layer that has been causing trouble. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-05-29 08:59:21 -06:00
Christoph Hellwig	e5eab01704	nbd: complete requests from ->timeout By completing the request entirely in the driver we can remove the BLK_EH_HANDLED return value and thus the split responsibility between the driver and the block layer that has been causing trouble. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-05-29 08:59:21 -06:00
Christoph Hellwig	db8c48e4b2	nvme: return BLK_EH_DONE from ->timeout NVMe always completes the request before returning from ->timeout, either by polling for it, or by disabling the controller. Return BLK_EH_DONE so that the block layer doesn't even try to complete it again. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-05-29 08:59:21 -06:00
Christoph Hellwig	6600593cbd	block: rename BLK_EH_NOT_HANDLED to BLK_EH_DONE The BLK_EH_NOT_HANDLED implies nothing happen, but very often that is not what is happening - instead the driver already completed the command. Fix the symbolic name to reflect that a little better. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-05-29 08:59:21 -06:00
Keith Busch	12f5b93145	blk-mq: Remove generation seqeunce This patch simplifies the timeout handling by relying on the request reference counting to ensure the iterator is operating on an inflight and truly timed out request. Since the reference counting prevents the tag from being reallocated, the block layer no longer needs to prevent drivers from completing their requests while the timeout handler is operating on it: a driver completing a request is allowed to proceed to the next state without additional syncronization with the block layer. This also removes any need for generation sequence numbers since the request lifetime is prevented from being reallocated as a new sequence while timeout handling is operating on it. To enables this a refcount is added to struct request so that request users can be sure they're operating on the same request without it changing while they're processing it. The request's tag won't be released for reuse until both the timeout handler and the completion are done with it. Signed-off-by: Keith Busch <keith.busch@intel.com> [hch: slight cleanups, added back submission side hctx lock, use cmpxchg for completions] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-05-29 08:59:21 -06:00
Keith Busch	ad103e7983	blk-mq: Fix timeout and state order The block layer had been setting the state to in-flight prior to updating the timer. This is the wrong order since the timeout handler could observe the in-flight state with the older timeout, believing the request had expired when in fact it is just getting started. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-05-29 08:47:40 -06:00
Christoph Hellwig	01fc27d969	libata: remove ata_scsi_timed_out As far as I can tell this function can't even be called any more, given that ATA implements its own eh_strategy_handler with ata_scsi_error, which never calls ->eh_timed_out. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-05-29 08:47:40 -06:00
Andy Shevchenko	ce4c3e19e5	bcache: Replace bch_read_string_list() by __sysfs_match_string() Kernel library has a common function to match user input from sysfs against an array of strings. Thus, replace bch_read_string_list() by __sysfs_match_string(). Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-05-28 14:53:22 -06:00
Andy Shevchenko	ecb37ce9ba	bcache: Move couple of functions to sysfs.c There is couple of functions that are used exclusively in sysfs.c. Move it to there and make them static. Besides above, it will allow further clean up. No functional change intended. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-05-28 14:53:20 -06:00
Andy Shevchenko	04cbc21137	bcache: Move couple of string arrays to sysfs.c There is couple of string arrays that are used exclusively in sysfs.c. Move it to there and make them static. Besides above, it will allow further clean up. No functional change intended. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-05-28 14:53:18 -06:00
Coly Li	0f0709e6bf	bcache: stop bcache device when backing device is offline Currently bcache does not handle backing device failure, if backing device is offline and disconnected from system, its bcache device can still be accessible. If the bcache device is in writeback mode, I/O requests even can success if the requests hit on cache device. That is to say, when and how bcache handles offline backing device is undefined. This patch tries to handle backing device offline in a rather simple way, - Add cached_dev->status_update_thread kernel thread to update backing device status in every 1 second. - Add cached_dev->offline_seconds to record how many seconds the backing device is observed to be offline. If the backing device is offline for BACKING_DEV_OFFLINE_TIMEOUT (30) seconds, set dc->io_disable to 1 and call bcache_device_stop() to stop the bache device which linked to the offline backing device. Now if a backing device is offline for BACKING_DEV_OFFLINE_TIMEOUT seconds, its bcache device will be removed, then user space application writing on it will get error immediately, and handler the device failure in time. This patch is quite simple, does not handle more complicated situations. Once the bcache device is stopped, users need to recovery the backing device, register and attach it manually. Changelog: v3: call wait_for_kthread_stop() before exits kernel thread. v2: remove "bcache: " prefix when calling pr_warn(). v1: initial version. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Cc: Michael Lyle <mlyle@lyle.org> Cc: Junhui Tang <tang.junhui@zte.com.cn> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-05-28 14:53:16 -06:00
Liu Bo	6723d8dc9d	null_blk: add blocking description and remove lightnvm - The description of 'blocking' is missing in null_blk.txt - The 'lightnvm' parameter has been removed in null_blk.c This updates both in null_blk.txt. Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-05-25 08:51:58 -06:00
Hannes Reinecke	75c8b19a23	nvme: fixup memory leak in nvme_init_identify() If nvme_get_effects_log() failed the 'id' buffer from the previous nvme_identify_ctrl() call will never be freed. Signed-off-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>	2018-05-25 16:50:12 +02:00
Hannes Reinecke	1e5f446162	nvme: fix KASAN warning when parsing host nqn The host nqn actually is smaller than the space reserved for it, so we should be using strlcpy to keep KASAN happy. Signed-off-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2018-05-25 16:50:12 +02:00
Chaitanya Kulkarni	eb464833a2	nvmet-loop: use nr_phys_segments when map rq to sgl Use blk_rq_nr_phys_segments() instead of blk_rq_payload_bytes() to check if a command contains data to me mapped. This fixes the case where a struct requests contains LBAs, but no data will actually be send, e.g. the pending Write Zeroes support. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2018-05-25 16:50:12 +02:00
James Smart	17d78252ee	nvmet-fc: increase LS buffer count per fc port Todays limit on concurrent LS's is very small - 4 buffers. With large subsystem counts or large numbers of initiators connecting, the limit may be exceeded. Raise the LS buffer count to 256. Signed-off-by: James Smart <james.smart@broadcom.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2018-05-25 16:50:12 +02:00
Chaitanya Kulkarni	d5eff33ee6	nvmet: add simple file backed ns support This patch adds simple file backed namespace support for NVMeOF target. The new file io-cmd-file.c is responsible for handling the code for I/O commands when ns is file backed. Also, we introduce mempools based slow path using sync I/Os for file backed ns to ensure forward progress under reclaim. The old block device based implementation is moved to io-cmd-bdev.c and use a "nvmet_bdev_" symbol prefix. The enable/disable calls are also move into the respective files. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> [hch: updated changelog, fixed double req->ns lookup in bdev case] Signed-off-by: Christoph Hellwig <hch@lst.de>	2018-05-25 16:50:12 +02:00
Chaitanya Kulkarni	618cff4285	nvmet: remove duplicate NULL initialization for req->ns Remove the duplicate NULL initialization for req->ns. req->ns is always initialized to NULL in nvmet_req_init(), so there is no need to reset it later on failures unless we have previously assigned a value to it. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2018-05-25 16:50:12 +02:00
Chaitanya Kulkarni	b40b83e365	nvmet: make a few error messages more generic "nvmet_check_ctrl_status()" is called from admin-cmd.c along with io-cmd.c, make the error message more generic. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2018-05-25 16:50:12 +02:00
Hannes Reinecke	181303d035	nvme-fabrics: allow duplicate connections to the discovery controller The whole point of the discovery controller is that it can accept multiple connections. Additionally the cmic field is not even defined for the discovery controller identify page. Signed-off-by: Hannes Reinecke <hare@suse.com> Reviewed-by: James Smart <james.smart@broadcom.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2018-05-25 16:50:12 +02:00
Hannes Reinecke	461fbc8f0e	nvme-fabrics: centralize discovery controller defaults When connecting to the discovery controller we have certain defaults to observe, so centralize them to avoid inconsistencies due to argument ordering. Signed-off-by: Hannes Reinecke <hare@suse.com> Reviewed-by: James Smart <james.smart@broadcom.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2018-05-25 16:50:12 +02:00
James Smart	ffecb0b452	nvme-fabrics: remove unnecessary controller subnqn validation After creating the nvme controller, nvmf_create_ctrl() validates the newly created subsysnqn vs the one specified by the options. In general, this is an unnecessary check as the Connect message should implicitly ensure this value matches. With the change to the FC transport to do an asynchronous connect for the first association create, the transport will return to nvmf_create_ctrl() before that first association has been established, thus the subnqn will not yet be set. Remove the unnecessary validation. Signed-off-by: James Smart <james.smart@broadcom.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2018-05-25 16:50:15 +02:00
James Smart	90fcaf5d54	nvme-fc: remove setting DNR on exception conditions Current code will set DNR if the controller is deleting or there is an error during controller init. None of this is necessary. Remove the code that sets DNR Signed-off-by: James Smart <james.smart@broadcom.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2018-05-25 16:50:12 +02:00
Jianchao Wang	2e050f00a0	nvme-rdma: stop admin queue before freeing it For any failure after nvme_rdma_start_queue in nvme_rdma_configure_admin_queue, the admin queue will be freed with the NVME_RDMA_Q_LIVE flag still set. Once nvme_rdma_stop_queue is invoked, that will cause a use-after-free. BUG: KASAN: use-after-free in rdma_disconnect+0x1f/0xe0 [rdma_cm] To fix it, call nvme_rdma_stop_queue for all the failed cases after nvme_rdma_start_queue. Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com> Suggested-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2018-05-25 16:50:12 +02:00
Keith Busch	72cd4cc28e	nvme-pci: Fix AER reset handling The nvme timeout handling doesn't do anything if the pci channel is offline, which is the case when recovering from PCI error event, so it was a bad idea to sync the controller reset in this state. This patch flushes the reset work in the error_resume callback instead when the channel is back to online. This keeps AER handling serialized and can recover from timeouts. Link: https://bugzilla.kernel.org/show_bug.cgi?id=199757 Fixes: `cc1d5e749a` ("nvme/pci: Sync controller reset for AER slot_reset") Reported-by: Alex Gagniuc <mr.nuke.me@gmail.com> Tested-by: Alex Gagniuc <mr.nuke.me@gmail.com> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2018-05-25 16:50:12 +02:00
Jianchao Wang	a8e3e0bb74	nvme-pci: set nvmeq->cq_vector after alloc cq/sq Set cq_vector after alloc cq/sq, otherwise nvme_suspend_queue will invoke free_irq for it and cause a 'Trying to free already-free IRQ xxx' warning if the create CQ/SQ command times out. Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com> Reviewed-by: Keith Busch <keith.busch@intel.com> [hch: fixed to pass a s16 and clean up the comment] Signed-off-by: Christoph Hellwig <hch@lst.de>	2018-05-25 16:50:12 +02:00
Joe Perches	5657a819a8	block drivers/block: Use octal not symbolic permissions Convert the S_<FOO> symbolic permissions to their octal equivalents as using octal and not symbolic permissions is preferred by many as more readable. see: https://lkml.org/lkml/2016/8/2/1945 Done with automated conversion via: $ ./scripts/checkpatch.pl -f --types=SYMBOLIC_PERMS --fix-inplace <files...> Miscellanea: o Wrapped modified multi-line calls to a single line where appropriate o Realign modified multi-line calls to open parenthesis Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-05-24 13:38:59 -06:00
Ming Lei	e6fc464987	blk-mq: avoid starving tag allocation after allocating process migrates When the allocation process is scheduled back and the mapped hw queue is changed, fake one extra wake up on previous queue for compensating wake up miss, so other allocations on the previous queue won't be starved. This patch fixes one request allocation hang issue, which can be triggered easily in case of very low nr_request. The race is as follows: 1) 2 hw queues, nr_requests are 2, and wake_batch is one 2) there are 3 waiters on hw queue 0 3) two in-flight requests in hw queue 0 are completed, and only two waiters of 3 are waken up because of wake_batch, but both the two waiters can be scheduled to another CPU and cause to switch to hw queue 1 4) then the 3rd waiter will wait for ever, since no in-flight request is in hw queue 0 any more. 5) this patch fixes it by the fake wakeup when waiter is scheduled to another hw queue Cc: <stable@vger.kernel.org> Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Modified commit message to make it clearer, and make it apply on top of the 4.18 branch. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-05-24 11:00:39 -06:00
Tejun Heo	f183464684	bdi: Move cgroup bdi_writeback to a dedicated low concurrency workqueue From 0aa2e9b921d6db71150633ff290199554f0842a8 Mon Sep 17 00:00:00 2001 From: Tejun Heo <tj@kernel.org> Date: Wed, 23 May 2018 10:29:00 -0700 cgwb_release() punts the actual release to cgwb_release_workfn() on system_wq. Depending on the number of cgroups or block devices, there can be a lot of cgwb_release_workfn() in flight at the same time. We're periodically seeing close to 256 kworkers getting stuck with the following stack trace and overtime the entire system gets stuck. [<ffffffff810ee40c>] _synchronize_rcu_expedited.constprop.72+0x2fc/0x330 [<ffffffff810ee634>] synchronize_rcu_expedited+0x24/0x30 [<ffffffff811ccf23>] bdi_unregister+0x53/0x290 [<ffffffff811cd1e9>] release_bdi+0x89/0xc0 [<ffffffff811cd645>] wb_exit+0x85/0xa0 [<ffffffff811cdc84>] cgwb_release_workfn+0x54/0xb0 [<ffffffff810a68d0>] process_one_work+0x150/0x410 [<ffffffff810a71fd>] worker_thread+0x6d/0x520 [<ffffffff810ad3dc>] kthread+0x12c/0x160 [<ffffffff81969019>] ret_from_fork+0x29/0x40 [<ffffffffffffffff>] 0xffffffffffffffff The events leading to the lockup are... 1. A lot of cgwb_release_workfn() is queued at the same time and all system_wq kworkers are assigned to execute them. 2. They all end up calling synchronize_rcu_expedited(). One of them wins and tries to perform the expedited synchronization. 3. However, that invovles queueing rcu_exp_work to system_wq and waiting for it. Because #1 is holding all available kworkers on system_wq, rcu_exp_work can't be executed. cgwb_release_workfn() is waiting for synchronize_rcu_expedited() which in turn is waiting for cgwb_release_workfn() to free up some of the kworkers. We shouldn't be scheduling hundreds of cgwb_release_workfn() at the same time. There's nothing to be gained from that. This patch updates cgwb release path to use a dedicated percpu workqueue with @max_active of 1. While this resolves the problem at hand, it might be a good idea to isolate rcu_exp_work to its own workqueue too as it can be used from various paths and is prone to this sort of indirect A-A deadlocks. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-05-23 15:28:50 -06:00
Josef Bacik	6df133a149	nbd: set discard granularity properly For some reason we had discard granularity set to 512 always even when discards were disabled. Fix this by having the default be 0, and then if we turn it on set the discard granularity to the blocksize. Signed-off-by: Josef Bacik <jbacik@fb.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-05-23 15:27:26 -06:00
Ivan Bornyakov	e9a9853c23	nvme: host: core: fix precedence of ternary operator Ternary operator have lower precedence then bitwise or, so 'cdw10' was calculated wrong. Signed-off-by: Ivan Bornyakov <brnkv.i1@gmail.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Keith Busch <keith.busch@intel.com>	2018-05-23 09:11:37 -06:00
Johannes Thumshirn	978628ec79	nvme: fix lockdep warning in nvme_mpath_clear_current_path When running blktest's nvme/005 with a lockdep enabled kernel the test case fails due to the following lockdep splat in dmesg: ============================= WARNING: suspicious RCU usage 4.17.0-rc5 #881 Not tainted ----------------------------- drivers/nvme/host/nvme.h:457 suspicious rcu_dereference_check() usage! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 3 locks held by kworker/u32:5/1102: #0: (ptrval) ((wq_completion)"nvme-wq"){+.+.}, at: process_one_work+0x152/0x5c0 #1: (ptrval) ((work_completion)(&ctrl->scan_work)){+.+.}, at: process_one_work+0x152/0x5c0 #2: (ptrval) (&subsys->lock#2){+.+.}, at: nvme_ns_remove+0x43/0x1c0 [nvme_core] The only caller of nvme_mpath_clear_current_path() is nvme_ns_remove() which holds the subsys lock so it's likely a false positive, but when using rcu_access_pointer(), we're telling rcu and lockdep that we're only after the pointer falue. Fixes: `32acab3181` ("nvme: implement multipath access to nvme subsystems") Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Suggested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <keith.busch@intel.com>	2018-05-23 08:58:27 -06:00
Bart Van Assche	327ea4adcf	blkdev_report_zones_ioctl(): Use vmalloc() to allocate large buffers Avoid that complaints similar to the following appear in the kernel log if the number of zones is sufficiently large: fio: page allocation failure: order:9, mode:0x140c0c0(GFP_KERNEL\|__GFP_COMP\|__GFP_ZERO), nodemask=(null) Call Trace: dump_stack+0x63/0x88 warn_alloc+0xf5/0x190 __alloc_pages_slowpath+0x8f0/0xb0d __alloc_pages_nodemask+0x242/0x260 alloc_pages_current+0x6a/0xb0 kmalloc_order+0x18/0x50 kmalloc_order_trace+0x26/0xb0 __kmalloc+0x20e/0x220 blkdev_report_zones_ioctl+0xa5/0x1a0 blkdev_ioctl+0x1ba/0x930 block_ioctl+0x41/0x50 do_vfs_ioctl+0xaa/0x610 SyS_ioctl+0x79/0x90 do_syscall_64+0x79/0x1b0 entry_SYSCALL_64_after_hwframe+0x3d/0xa2 Fixes: `3ed05a987e` ("blk-zoned: implement ioctls") Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Cc: Shaun Tancheff <shaun.tancheff@seagate.com> Cc: Damien Le Moal <damien.lemoal@hgst.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Martin K. Petersen <martin.petersen@oracle.com> Cc: Hannes Reinecke <hare@suse.com> Cc: <stable@vger.kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-05-22 11:58:07 -06:00
Dan Melnic	2189c97cdb	block/ndb: add WQ_UNBOUND to the knbd-recv workqueue Add WQ_UNBOUND to the knbd-recv workqueue so we're not bound to a single CPU that is selected at device creation time. Signed-off-by: Dan Melnic <dmm@fb.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-05-22 11:47:34 -06:00
huhai	b4f6f38d9f	blk-mq: remove wrong 'unlikely' check When dispatch_rq_from_ctx is called, in the vast majority of cases the ctx->rq_list is not empty. Signed-off-by: huhai <huhai@kylinos.cn> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-05-22 08:38:04 -06:00
Jens Axboe	68fa9dbe08	nvme-pci: fix race between poll and IRQ completions If polling completions are racing with the IRQ triggered by a completion, the IRQ handler will find no work and return IRQ_NONE. This can trigger complaints about spurious interrupts: [ 560.169153] irq 630: nobody cared (try booting with the "irqpoll" option) [ 560.175988] CPU: 40 PID: 0 Comm: swapper/40 Not tainted 4.17.0-rc2+ #65 [ 560.175990] Hardware name: Intel Corporation S2600STB/S2600STB, BIOS SE5C620.86B.00.01.0010.010920180151 01/09/2018 [ 560.175991] Call Trace: [ 560.175994] <IRQ> [ 560.176005] dump_stack+0x5c/0x7b [ 560.176010] __report_bad_irq+0x30/0xc0 [ 560.176013] note_interrupt+0x235/0x280 [ 560.176020] handle_irq_event_percpu+0x51/0x70 [ 560.176023] handle_irq_event+0x27/0x50 [ 560.176026] handle_edge_irq+0x6d/0x180 [ 560.176031] handle_irq+0xa5/0x110 [ 560.176036] do_IRQ+0x41/0xc0 [ 560.176042] common_interrupt+0xf/0xf [ 560.176043] </IRQ> [ 560.176050] RIP: 0010:cpuidle_enter_state+0x9b/0x2b0 [ 560.176052] RSP: 0018:ffffa0ed4659fe98 EFLAGS: 00000246 ORIG_RAX: ffffffffffffffdd [ 560.176055] RAX: ffff9527beb20a80 RBX: 000000826caee491 RCX: 000000000000001f [ 560.176056] RDX: 000000826caee491 RSI: 00000000335206ee RDI: 0000000000000000 [ 560.176057] RBP: 0000000000000001 R08: 00000000ffffffff R09: 0000000000000008 [ 560.176059] R10: ffffa0ed4659fe78 R11: 0000000000000001 R12: ffff9527beb29358 [ 560.176060] R13: ffffffffa235d4b8 R14: 0000000000000000 R15: 000000826caed593 [ 560.176065] ? cpuidle_enter_state+0x8b/0x2b0 [ 560.176071] do_idle+0x1f4/0x260 [ 560.176075] cpu_startup_entry+0x6f/0x80 [ 560.176080] start_secondary+0x184/0x1d0 [ 560.176085] secondary_startup_64+0xa5/0xb0 [ 560.176088] handlers: [ 560.178387] [<00000000efb612be>] nvme_irq [nvme] [ 560.183019] Disabling IRQ #630 A previous commit removed ->cqe_seen that was handling this case, but we need to handle this a bit differently due to completions now running outside the queue lock. Return IRQ_HANDLED from the IRQ handler, if the completion ring head was moved since we last saw it. Fixes: `5cb525c831` ("nvme-pci: handle completions outside of the queue lock") Reported-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Keith Busch <keith.busch@intel.com> Tested-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-05-21 08:41:52 -06:00
Jens Axboe	81b1dab458	Merge branch 'nvme-4.18' of git://git.infradead.org/nvme into for-4.18/block Pull NVMe changes from Keith: "This is just the first nvme pull request for 4.18. There are several fabrics and target patches that I missed, so there will be more to come." * 'nvme-4.18' of git://git.infradead.org/nvme: nvme-pci: drop IRQ disabling on submission queue lock nvme-pci: split the nvme queue lock into submission and completion locks nvme-pci: handle completions outside of the queue lock nvme-pci: move ->cq_vector == -1 check outside of ->q_lock nvme-pci: remove cq check after submission nvme-pci: simplify nvme_cqe_valid nvme: mark the result argument to nvme_complete_async_event volatile nvme/pci: Sync controller reset for AER slot_reset nvme/pci: Hold controller reference during async probe nvme: only reconfigure discard if necessary nvme/pci: Use async_schedule for initial reset work nvme: lightnvm: add granby support NVMe: Add Quirk Delay before CHK RDY for Seagate Nytro Flash Storage nvme: change order of qid and cmdid in completion trace nvme: fc: provide a descriptive error	2018-05-21 08:33:37 -06:00
Jens Axboe	1eae349d18	nvme-pci: drop IRQ disabling on submission queue lock Since we aren't sharing the lock for completions now, we don't have to make it IRQ safe. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christoph Hellwig <hch@lst.de>	2018-05-18 14:41:36 -06:00
Jens Axboe	1ab0cd6966	nvme-pci: split the nvme queue lock into submission and completion locks This is now feasible. We protect the submission queue ring with ->sq_lock, and the completion side with ->cq_lock. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christoph Hellwig <hch@lst.de>	2018-05-18 14:41:36 -06:00

1 2 3 4 5 ...

753195 commits