Commit graph

798559 commits

Author SHA1 Message Date
Shenghui Wang
ae17102316 bcache: do not check if debug dentry is ERR or NULL explicitly on remove
debugfs_remove and debugfs_remove_recursive will check if the dentry
pointer is NULL or ERR, and will do nothing in that case.

Remove the check in cache_set_free and bch_debug_init.

Signed-off-by: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-13 08:15:54 -07:00
Shenghui Wang
d2f96f487f bcache: add comment for cache_set->fill_iter
We have the following define for btree iterator:
	struct btree_iter {
		size_t size, used;
	#ifdef CONFIG_BCACHE_DEBUG
		struct btree_keys *b;
	#endif
		struct btree_iter_set {
			struct bkey *k, *end;
		} data[MAX_BSETS];
	};

We can see that the length of data[] field is static MAX_BSETS, which is
defined as 4 currently.

But a btree node on disk could have too many bsets for an iterator to fit
on the stack - maybe far more that MAX_BSETS. Have to dynamically allocate
space to host more btree_iter_sets.

bch_cache_set_alloc() will make sure the pool cache_set->fill_iter can
allocate an iterator equipped with enough room that can host
	(sb.bucket_size / sb.block_size)
btree_iter_sets, which is more than static MAX_BSETS.

bch_btree_node_read_done() will use that pool to allocate one iterator, to
host many bsets in one btree node.

Add more comment around cache_set->fill_iter to make code less confusing.

Signed-off-by: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-13 08:15:54 -07:00
Dennis Zhou
0273ac349f blkcg: handle dying request_queue when associating a blkg
Between v3 [1] and v4 [2] of the blkg association series, the
association point moved from generic_make_request_checks(), which is
called after the request enters the queue, to bio_set_dev(), which is when
the bio is formed before submit_bio(). When the request_queue goes away,
the blkgs supporting the request_queue are destroyed and then the
q->root_blkg is set to %NULL.

This patch adds a %NULL check to blkg_tryget_closest() to prevent the
NPE caused by the above. It also adds a guard to see if the
request_queue is dying when creating a blkg to prevent creating a blkg
for a dead request_queue.

[1] https://lore.kernel.org/lkml/20180911184137.35897-1-dennisszhou@gmail.com/
[2] https://lore.kernel.org/lkml/20181126211946.77067-1-dennis@kernel.org/

Fixes: 5cdf2e3fea ("blkcg: associate blkg when associating a device")
Reported-and-tested-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-12 17:43:33 -07:00
Ming Lei
544fbd16a4 block: deactivate blk_stat timer in wbt_disable_default()
rwb_enabled() can't be changed when there is any inflight IO.

wbt_disable_default() may set rwb->wb_normal as zero, however the
blk_stat timer may still be pending, and the timer function will update
wrb->wb_normal again.

This patch introduces blk_stat_deactivate() and applies it in
wbt_disable_default(), then the following IO hang triggered when running
parted & switching io scheduler can be fixed:

[  369.937806] INFO: task parted:3645 blocked for more than 120 seconds.
[  369.938941]       Not tainted 4.20.0-rc6-00284-g906c801e5248 #498
[  369.939797] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  369.940768] parted          D    0  3645   3239 0x00000000
[  369.941500] Call Trace:
[  369.941874]  ? __schedule+0x6d9/0x74c
[  369.942392]  ? wbt_done+0x5e/0x5e
[  369.942864]  ? wbt_cleanup_cb+0x16/0x16
[  369.943404]  ? wbt_done+0x5e/0x5e
[  369.943874]  schedule+0x67/0x78
[  369.944298]  io_schedule+0x12/0x33
[  369.944771]  rq_qos_wait+0xb5/0x119
[  369.945193]  ? karma_partition+0x1c2/0x1c2
[  369.945691]  ? wbt_cleanup_cb+0x16/0x16
[  369.946151]  wbt_wait+0x85/0xb6
[  369.946540]  __rq_qos_throttle+0x23/0x2f
[  369.947014]  blk_mq_make_request+0xe6/0x40a
[  369.947518]  generic_make_request+0x192/0x2fe
[  369.948042]  ? submit_bio+0x103/0x11f
[  369.948486]  ? __radix_tree_lookup+0x35/0xb5
[  369.949011]  submit_bio+0x103/0x11f
[  369.949436]  ? blkg_lookup_slowpath+0x25/0x44
[  369.949962]  submit_bio_wait+0x53/0x7f
[  369.950469]  blkdev_issue_flush+0x8a/0xae
[  369.951032]  blkdev_fsync+0x2f/0x3a
[  369.951502]  do_fsync+0x2e/0x47
[  369.951887]  __x64_sys_fsync+0x10/0x13
[  369.952374]  do_syscall_64+0x89/0x149
[  369.952819]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  369.953492] RIP: 0033:0x7f95a1e729d4
[  369.953996] Code: Bad RIP value.
[  369.954456] RSP: 002b:00007ffdb570dd48 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
[  369.955506] RAX: ffffffffffffffda RBX: 000055c2139c6be0 RCX: 00007f95a1e729d4
[  369.956389] RDX: 0000000000000001 RSI: 0000000000001261 RDI: 0000000000000004
[  369.957325] RBP: 0000000000000002 R08: 0000000000000000 R09: 000055c2139c6ce0
[  369.958199] R10: 0000000000000000 R11: 0000000000000246 R12: 000055c2139c0380
[  369.959143] R13: 0000000000000004 R14: 0000000000000100 R15: 0000000000000008

Cc: stable@vger.kernel.org
Cc: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-12 06:47:51 -07:00
Jens Axboe
b2dbff1bb8 sbitmap: flush deferred clears for resize and shallow gets
We're missing a deferred clear off the shallow get, which can cause
a hang. Additionally, when we resize the sbitmap, we should also
flush deferred clears for good measure.

Ensure we have full coverage on batch clears, even for paths where
we would not be doing deferred clear. This makes it less error
prone for future additions.

Reported-by: Bart Van Assche <bvanassche@acm.org>
Tested-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-11 18:39:41 -07:00
Igor Konopko
2c4d5356e6 lightnvm: pblk: do not overwrite ppa list with meta list
Ehen using pblk with 0 sized metadata both ppa list and meta list
points to the same memory since pblk_dma_meta_size() returns 0 in
that case.

This patch fix that issue by ensuring that pblk_dma_meta_size()
always returns space equal to sizeof(struct pblk_sec_meta) and thus
ppa list and meta list points to different memory address.

Even that in that case drive does not really care about meta_list
pointer, this is the easiest way to fix that issue without introducing
changes in many places in the code just for 0 sized metadata case.

The same approach needs to be also done for pblk_get_sec_meta()
since we also cannot point to the same memory address in meta buffer
when we are using it for pblk recovery process

Reported-by: Hans Holmberg <hans.holmberg@cnexlabs.com>
Tested-by: Hans Holmberg <hans.holmberg@cnexlabs.com>
Signed-off-by: Igor Konopko <igor.j.konopko@intel.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-11 12:22:35 -07:00
Igor Konopko
55d8ec3539 lightnvm: pblk: support packed metadata
pblk performs recovery of open lines by storing the LBA in the per LBA
metadata field. Recovery therefore only works for drives that has this
field.

This patch adds support for packed metadata, which store l2p mapping
for open lines in last sector of every write unit and enables drives
without per IO metadata to recover open lines.

After this patch, drives with OOB size <16B will use packed metadata
and metadata size larger than16B will continue to use the device per
IO metadata.

Reviewed-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Igor Konopko <igor.j.konopko@intel.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-11 12:22:35 -07:00
Igor Konopko
a16816b9e4 lightnvm: disable interleaved metadata
Currently pblk only check the size of I/O metadata and does not take
into account if this metadata is in a separate buffer or interleaved
in a single metadata buffer.

In reality only the first scenario is supported, where second mode will
break pblk functionality during any IO operation.

This patch prevents pblk to be instantiated in case device only
supports interleaved metadata.

Reviewed-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Igor Konopko <igor.j.konopko@intel.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-11 12:22:35 -07:00
Igor Konopko
24828d0536 lightnvm: dynamic DMA pool entry size
Currently lightnvm and pblk uses single DMA pool, for which the entry
size always is equal to PAGE_SIZE. The contents of each entry allocated
from the DMA pool consists of a PPA list (8bytes * 64), leaving
56bytes * 64 space for metadata. Since the metadata field can be bigger,
such as 128 bytes, the static size does not cover this use-case.

This patch adds support for I/O metadata above 56 bytes by changing DMA
pool size based on device meta size and allows pblk to use OOB metadata
>=16B.

Reviewed-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Igor Konopko <igor.j.konopko@intel.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-11 12:22:35 -07:00
Igor Konopko
faa79f27f0 lightnvm: pblk: add helpers for OOB metadata
pblk currently assumes that size of OOB metadata on drive is always
equal to size of pblk_sec_meta struct. This commit add helpers which will
allow to handle different sizes of OOB metadata on drive in the future.

After this patch only OOB metadata equal to 16 bytes is supported.

Reviewed-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Igor Konopko <igor.j.konopko@intel.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-11 12:22:35 -07:00
Igor Konopko
dd439496df lightnvm: pblk: move lba list to partial read context
Currently DMA allocated memory is reused on partial read
for lba_list_mem and lba_list_media arrays. In preparation
for dynamic DMA pool sizes we need to move this arrays
into pblk_pr_ctx structures.

Reviewed-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Igor Konopko <igor.j.konopko@intel.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-11 12:22:34 -07:00
Javier González
42bd0384d7 lightnvm: pblk: avoid ref warning on cache creation
The current kref implementation around pblk global caches triggers a
false positive on refcount_inc_checked() (when called) as the kref is
initialized to 0. Instead of usint kref_inc() on a 0 reference, which is
in principle correct, use kref_init() to avoid the check. This is also
more explicit about what actually happens on cache creation.

In the process, do a small refactoring to use kref helpers.

Fixes: 1864de94ec "lightnvm: pblk: stop recreating global caches"
Signed-off-by: Javier González <javier@cnexlabs.com>
Reviewed-by: Hans Holmberg <hans.holmberg@cnexlabs.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-11 12:22:34 -07:00
Matias Bjørling
85136c0102 lightnvm: simplify geometry enumeration
Currently the geometry of an OCSSD is enumerated using a two step
approach:

First, nvm_register is called, the OCSSD identify command is issued,
and second the geometry sos and csecs values are read either from the
OCSSD identify if it is a 1.2 drive, or from the NVMe namespace data
structure if it is a 2.0 device.

This patch recombines it into a single step, such that nvm_register can
use the csecs and sos fields independent of which version is used. This
enables one to dynamically size the lightnvm subsystem dma pool.

Reviewed-by: Igor Konopko <igor.j.konopko@intel.com>
Reviewed-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-11 12:22:34 -07:00
Javier González
361d889f83 lightnvm: pblk: add comments wrt locking in recovery path
pblk's recovery path is single threaded and therefore a number of
assumptions regarding concurrency can be made. To avoid confusion, make
this explicit with a couple of comments in the code.

Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-11 12:22:34 -07:00
Hua Su
fde201a466 lightnvm: pblk: add lock protection to list operations
Protect the list_add on the pblk_line_init_bb() error
path in case this code is used for some other purpose
in the future.

Signed-off-by: Hua Su <suhua.tanke@gmail.com>
Reviewed-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-11 12:22:34 -07:00
Hua Su
6e82f0ba00 lightnvm: pblk: fix spelling in comment
Signed-off-by: Hua Su <suhua.tanke@gmail.com>
Updated description.
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-11 12:22:34 -07:00
Hans Holmberg
e698d9f4e6 lightnvm: pblk: remove dead code in pblk_recov_l2p
Remove the call to pblk_line_replace_data as it returns
directly because we have not set l_mg->data_next yet.

Signed-off-by: Hans Holmberg <hans.holmberg@cnexlabs.com>
Reviewed-by: Javier González <javier@javigon.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-11 12:22:34 -07:00
Hans Holmberg
0934ce87b5 lightnvm: pblk: fix pblk_lines_init error handling path
The chunk metadata is allocated with vmalloc, so we need to use
vfree to free it.

Fixes: 090ee26fd5 ("lightnvm: use internal allocation for chunk log page")
Signed-off-by: Hans Holmberg <hans.holmberg@cnexlabs.com>
Reviewed-by: Javier González <javier@javigon.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-11 12:22:34 -07:00
Hans Holmberg
c9a1d640d5 lightnvm: pblk: remove unused macro
ADDR_POOL_SIZE is not used anymore, so remove the macro.

Signed-off-by: Hans Holmberg <hans.holmberg@cnexlabs.com>
Reviewed-by: Javier González <javier@javigon.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-11 12:22:34 -07:00
Hans Holmberg
3bcebc5bac lightnvm: pblk: set conservative threshold for user writes
In a worst-case scenario (random writes), OP% of sectors
in each line will be invalid, and we will then need
to move data out of 100/OP% lines to free a single line.

So, to prevent the possibility of running out of lines,
temporarily block user writes when there is less than
100/OP% free lines.

Also ensure that pblk creation does not produce instances
with insufficient over provisioning.

Insufficient over-provising is not a problem on real hardware,
but often an issue when running QEMU simulations (with few lines).
100 lines is enough to create a sane instance with the standard
(11%) over provisioning.

Signed-off-by: Hans Holmberg <hans.holmberg@cnexlabs.com>
Reviewed-by: Javier González <javier@javigon.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-11 12:22:33 -07:00
Hans Holmberg
525f7bb2c9 lightnvm: pblk: stop writes gracefully when running out of lines
If mapping fails (i.e. when running out of lines), handle the error
and stop writing.

Signed-off-by: Hans Holmberg <hans.holmberg@cnexlabs.com>
Reviewed-by: Javier González <javier@javigon.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-11 12:22:33 -07:00
Hans Holmberg
ab3887be1e lightnvm: pblk: account for write error sectors in emeta
Lines inflicted with write errors lines might be recovered
if they have not been recycled after write error garbage collection.

Ensure that the emeta accounting of valid lbas is correct
for such lines to avoid recovery inconsistencies.

Signed-off-by: Hans Holmberg <hans.holmberg@cnexlabs.com>
Reviewed-by: Javier González <javier@javigon.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-11 12:22:33 -07:00
Hans Holmberg
c12fa401ac lightnvm: pblk: fix resubmission of overwritten write err lbas
Make sure we only look up valid lba addresses on the resubmission path.

If an lba is invalidated in the write buffer, that sector will be
submitted to disk (as it is already mapped to a ppa), and that write
might fail, resulting in a crash when trying to look up the lba in the
mapping table (as the lba is marked as invalid).

Signed-off-by: Hans Holmberg <hans.holmberg@cnexlabs.com>
Reviewed-by: Javier González <javier@javigon.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-11 12:22:33 -07:00
Hans Holmberg
96076f7dde lightnvm: pblk: fix chunk close trace event check
The check for chunk closes suffers from an off-by-one issue, leading
to chunk close events not being traced.

Fixes: 4c44abf43d ("lightnvm: pblk: add trace events for chunk states")
Signed-off-by: Hans Holmberg <hans.holmberg@cnexlabs.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-11 12:22:33 -07:00
Geert Uytterhoeven
55e58c5e78 lightnvm: Fix uninitialized return value in nvm_get_chunk_meta()
With gcc 4.1:

    drivers/lightnvm/core.c: In function ‘nvm_get_bb_meta’:
    drivers/lightnvm/core.c:977: warning: ‘ret’ may be used uninitialized in this function

and

    drivers/nvme/host/lightnvm.c: In function ‘nvme_nvm_get_chk_meta’:
    drivers/nvme/host/lightnvm.c:580: warning: ‘ret’ may be used uninitialized in this function

Indeed, if (for the former) the number of channels or LUNs is zero, or
(for both) the passed number of chunks is zero, ret will be returned
uninitialized.

Fix this by preinitializing ret to zero.

Fixes: aff3fb18f9 ("lightnvm: move bad block and chunk state logic to core")
Fixes: a294c19945 ("lightnvm: implement get log report chunk helpers")
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-11 12:22:33 -07:00
Zhoujie Wu
f40a62d267 lightnvm: pblk: ignore the smeta oob area scan
The smeta area l2p mapping is empty, and actually the
recovery procedure only need to restore data sector's l2p
mapping. So ignore the smeta oob scan.

Signed-off-by: Zhoujie Wu <zjwu@marvell.com>
Reviewed-by: Javier González <javier@javigon.com>
Reviewed-by: Hans Holmberg <hans.holmberg@cnexlabs.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-11 12:22:33 -07:00
Mike Snitzer
c4576aed8d dm: fix request-based dm's use of dm_wait_for_completion
The md->wait waitqueue is used by both bio-based and request-based DM.
Commit dbd3bbd291 ("dm rq: leverage blk_mq_queue_busy() to check for
outstanding IO") lost sight of the requirement that
dm_wait_for_completion() must work with all types of DM devices.

Fix md_in_flight() to call the blk-mq or bio-based method accordingly.

Fixes: dbd3bbd291 ("dm rq: leverage blk_mq_queue_busy() to check for outstanding IO")
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-11 07:40:02 -07:00
Jens Axboe
6451fe73fa nvme: fix irq vs io_queue calculations
Guenter reported an boot hang issue on HPPA after we default to 0 poll
queues. We have two issues in the queue count calculations:

1) We don't separate the poll queues from the read/write queues. This is
   important, since the former doesn't need interrupts.
2) The adjust logic is broken.

Adjust the poll queue count before doing nvme_calc_io_queues(). The poll
queue count is only limited by the IO queue count we were able to get
from the controller, not failures in the IRQ allocation loop. This
leaves nvme_calc_io_queues() just adjusting the read/write queue map.

Reported-by: Reported-by: Guenter Roeck <linux@roeck-us.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-11 06:27:46 -07:00
Jens Axboe
b7934ba414 dm: fix inflight IO check
After switching to percpu inflight counters, the inflight check
is totally buggy. It's perfectly valid for some counters to be
non-zero while having a total inflight IO count of 0, that's how
these kinds of counters work (inc on one CPU, dec on another).
Fix the md_in_flight() check to sum all counters before returning
a false positive, potentially.

While at it, remove the inflight read for IO completion. We don't
need it, just wake anyone that's waiting for the IO count to drop
to zero. The caller needs to re-check that value anyway when woken,
which it does.

Fixes: 6f75723190 ("dm: remove the pending IO accounting")
Acked-by: Mike Snitzer <snitzer@redhat.com>
Reported-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-10 18:10:34 -07:00
Jens Axboe
4ba09f69e2 mtip32xx: use BLK_STS_DEV_RESOURCE for device resources
For cases where we can only fail with IO in-flight, we should be using
BLK_STS_DEV_RESOURCE instead of BLK_STS_RESOURCE. The latter refers to
system wide resource constraints.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-10 14:45:19 -07:00
Arnd Bergmann
e4025e46f0 mtip32xx: avoid using semaphores
The "cmd_slot_unal" semaphore is never used in a blocking way
but only as an atomic counter. Change the code to using
atomic_dec_if_positive() as a better API.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-10 14:44:56 -07:00
Mikulas Patocka
6f75723190 dm: remove the pending IO accounting
Remove the "pending" atomic counters, that duplicate block-core's
in_flight counters, and update md_in_flight() to look at percpu
in_flight counters.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-10 08:30:38 -07:00
Mikulas Patocka
e016b78201 block: return just one value from part_in_flight
The previous patches deleted all the code that needed the second value
returned from part_in_flight - now the kernel only uses the first value.

Consequently, part_in_flight (and blk_mq_in_flight) may be changed so that
it only returns one value.

This patch just refactors the code, there's no functional change.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-10 08:30:38 -07:00
Mikulas Patocka
1226b8dd0e block: switch to per-cpu in-flight counters
Now when part_round_stats is gone, we can switch to per-cpu in-flight
counters.

We use the local-atomic type local_t, so that if part_inc_in_flight or
part_dec_in_flight is reentrantly called from an interrupt, the value will
be correct.

The other counters could be corrupted due to reentrant interrupt, but the
corruption only results in slight counter skew - the in_flight counter
must be exact, so it needs local_t.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-10 08:30:37 -07:00
Mikulas Patocka
5b18b5a737 block: delete part_round_stats and switch to less precise counting
We want to convert to per-cpu in_flight counters.

The function part_round_stats needs the in_flight counter every jiffy, it
would be too costly to sum all the percpu variables every jiffy, so it
must be deleted. part_round_stats is used to calculate two counters -
time_in_queue and io_ticks.

time_in_queue can be calculated without part_round_stats, by adding the
duration of the I/O when the I/O ends (the value is almost as exact as the
previously calculated value, except that time for in-progress I/Os is not
counted).

io_ticks can be approximated by increasing the value when I/O is started
or ended and the jiffies value has changed. If the I/Os take less than a
jiffy, the value is as exact as the previously calculated value. If the
I/Os take more than a jiffy, io_ticks can drift behind the previously
calculated value.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-10 08:30:37 -07:00
Mike Snitzer
112f158f66 block: stop passing 'cpu' to all percpu stats methods
All of part_stat_* and related methods are used with preempt disabled,
so there is no need to pass cpu around to allow of them.  Just call
smp_processor_id() as needed.

Suggested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-10 08:30:37 -07:00
Mike Snitzer
dbd3bbd291 dm rq: leverage blk_mq_queue_busy() to check for outstanding IO
Now that request-based dm-multipath only supports blk-mq, make use of
the newly introduced blk_mq_queue_busy() to check for outstanding IO --
rather than (ab)using the block core's in_flight counters.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-10 08:30:37 -07:00
Mikulas Patocka
80a787ba38 dm: dont rewrite dm_disk(md)->part0.in_flight
generic_start_io_acct and generic_end_io_acct already update the variable
in_flight using atomic operations, so we don't have to overwrite them
again.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-10 08:30:37 -07:00
Jens Axboe
96f774106e Linux 4.20-rc6
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAlwNpb0eHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGwGwH/00UHnXfxww3ixxz
 zwTVDzptA6SPm6s84yJOWatM5fXhPiAltZaHSYF9lzRzNU71NCq7Frhq3fQUIXKM
 OxqDn9nfSTWcjWTk2q5n2keyRV/KIn67YX7UgqFc1bO/mqtVjEgNWaMyblhI+e9E
 giu1ZXayHr43jK1cDOmGExZubXUq7Vsc9TOlrd+d2SwIqeEP7TCMrPhnHDwCNvX2
 UU5dtANpVzGtHaBcr37wJj+L8kODCc0f+PQ3g2ar5jTHst5SLlHp2u0AMRnUmgdi
 VkGx+mu/uk8mtwUqMIMqhplklVoqK6LTeLqsY5Xt32SKruw9UqyJGdphLjW2QP/g
 MkmA1lI=
 =7kaD
 -----END PGP SIGNATURE-----

Merge tag 'v4.20-rc6' into for-4.21/block

Pull in v4.20-rc6 to resolve the conflict in NVMe, but also to get the
two corruption fixes. We're going to be overhauling the direct dispatch
path, and we need to do that on top of the changes we made for that
in mainline.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-09 17:45:40 -07:00
Jens Axboe
58ab5e32e6 sbitmap: silence bogus lockdep IRQ warning
Ming reports that lockdep spews the following trace. What this
essentially says is that the sbitmap swap_lock was used inconsistently
in IRQ enabled and disabled context, and that is usually indicative of a
bug that will cause a deadlock.

For this case, it's a false positive. The swap_lock is used from process
context only, when we swap the bits in the word and cleared mask. We
also end up doing that when we are getting a driver tag, from the
blk_mq_mark_tag_wait(), and from there we hold the waitqueue lock with
IRQs disabled. However, this isn't from an actual IRQ, it's still
process context.

In lieu of a better way to fix this, simply always disable interrupts
when grabbing the swap_lock if lockdep is enabled.

[  100.967642] ================start test sanity/001================
[  101.238280] null: module loaded
[  106.093735]
[  106.094012] =====================================================
[  106.094854] WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
[  106.095759] 4.20.0-rc3_5d2ee7122c73_for-next+ #1 Not tainted
[  106.096551] -----------------------------------------------------
[  106.097386] fio/1043 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
[  106.098231] 000000004c43fa71
(&(&sb->map[i].swap_lock)->rlock){+.+.}, at: sbitmap_get+0xd5/0x22c
[  106.099431]
[  106.099431] and this task is already holding:
[  106.100229] 000000007eec8b2f
(&(&hctx->dispatch_wait_lock)->rlock){....}, at:
blk_mq_dispatch_rq_list+0x4c1/0xd7c
[  106.101630] which would create a new lock dependency:
[  106.102326]  (&(&hctx->dispatch_wait_lock)->rlock){....} ->
(&(&sb->map[i].swap_lock)->rlock){+.+.}
[  106.103553]
[  106.103553] but this new dependency connects a SOFTIRQ-irq-safe lock:
[  106.104580]  (&sbq->ws[i].wait){..-.}
[  106.104582]
[  106.104582] ... which became SOFTIRQ-irq-safe at:
[  106.105751]   _raw_spin_lock_irqsave+0x4b/0x82
[  106.106284]   __wake_up_common_lock+0x119/0x1b9
[  106.106825]   sbitmap_queue_wake_up+0x33f/0x383
[  106.107456]   sbitmap_queue_clear+0x4c/0x9a
[  106.108046]   __blk_mq_free_request+0x188/0x1d3
[  106.108581]   blk_mq_free_request+0x23b/0x26b
[  106.109102]   scsi_end_request+0x345/0x5d7
[  106.109587]   scsi_io_completion+0x4b5/0x8f0
[  106.110099]   scsi_finish_command+0x412/0x456
[  106.110615]   scsi_softirq_done+0x23f/0x29b
[  106.111115]   blk_done_softirq+0x2a7/0x2e6
[  106.111608]   __do_softirq+0x360/0x6ad
[  106.112062]   run_ksoftirqd+0x2f/0x5b
[  106.112499]   smpboot_thread_fn+0x3a5/0x3db
[  106.113000]   kthread+0x1d4/0x1e4
[  106.113457]   ret_from_fork+0x3a/0x50
[  106.113969]
[  106.113969] to a SOFTIRQ-irq-unsafe lock:
[  106.114672]  (&(&sb->map[i].swap_lock)->rlock){+.+.}
[  106.114674]
[  106.114674] ... which became SOFTIRQ-irq-unsafe at:
[  106.116000] ...
[  106.116003]   _raw_spin_lock+0x33/0x64
[  106.116676]   sbitmap_get+0xd5/0x22c
[  106.117134]   __sbitmap_queue_get+0xe8/0x177
[  106.117731]   __blk_mq_get_tag+0x1e6/0x22d
[  106.118286]   blk_mq_get_tag+0x1db/0x6e4
[  106.118756]   blk_mq_get_driver_tag+0x161/0x258
[  106.119383]   blk_mq_dispatch_rq_list+0x28e/0xd7c
[  106.120043]   blk_mq_do_dispatch_sched+0x23a/0x287
[  106.120607]   blk_mq_sched_dispatch_requests+0x379/0x3fc
[  106.121234]   __blk_mq_run_hw_queue+0x137/0x17e
[  106.121781]   __blk_mq_delay_run_hw_queue+0x80/0x25f
[  106.122366]   blk_mq_run_hw_queue+0x151/0x187
[  106.122887]   blk_mq_sched_insert_requests+0x13f/0x175
[  106.123492]   blk_mq_flush_plug_list+0x7d6/0x81b
[  106.124042]   blk_flush_plug_list+0x392/0x3d7
[  106.124557]   blk_finish_plug+0x37/0x4f
[  106.125019]   read_pages+0x3ef/0x430
[  106.125446]   __do_page_cache_readahead+0x18e/0x2fc
[  106.126027]   force_page_cache_readahead+0x121/0x133
[  106.126621]   page_cache_sync_readahead+0x35f/0x3bb
[  106.127229]   generic_file_buffered_read+0x410/0x1860
[  106.127932]   __vfs_read+0x319/0x38f
[  106.128415]   vfs_read+0xd2/0x19a
[  106.128817]   ksys_read+0xb9/0x135
[  106.129225]   do_syscall_64+0x140/0x385
[  106.129684]   entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  106.130292]
[  106.130292] other info that might help us debug this:
[  106.130292]
[  106.131226] Chain exists of:
[  106.131226]   &sbq->ws[i].wait -->
&(&hctx->dispatch_wait_lock)->rlock -->
&(&sb->map[i].swap_lock)->rlock
[  106.131226]
[  106.132865]  Possible interrupt unsafe locking scenario:
[  106.132865]
[  106.133659]        CPU0                    CPU1
[  106.134194]        ----                    ----
[  106.134733]   lock(&(&sb->map[i].swap_lock)->rlock);
[  106.135318]                                local_irq_disable();
[  106.136014]                                lock(&sbq->ws[i].wait);
[  106.136747]
lock(&(&hctx->dispatch_wait_lock)->rlock);
[  106.137742]   <Interrupt>
[  106.138110]     lock(&sbq->ws[i].wait);
[  106.138625]
[  106.138625]  *** DEADLOCK ***
[  106.138625]
[  106.139430] 3 locks held by fio/1043:
[  106.139947]  #0: 0000000076ff0fd9 (rcu_read_lock){....}, at:
hctx_lock+0x29/0xe8
[  106.140813]  #1: 000000002feb1016 (&sbq->ws[i].wait){..-.}, at:
blk_mq_dispatch_rq_list+0x4ad/0xd7c
[  106.141877]  #2: 000000007eec8b2f
(&(&hctx->dispatch_wait_lock)->rlock){....}, at:
blk_mq_dispatch_rq_list+0x4c1/0xd7c
[  106.143267]
[  106.143267] the dependencies between SOFTIRQ-irq-safe lock and the
holding lock:
[  106.144351]  -> (&sbq->ws[i].wait){..-.} ops: 82 {
[  106.144926]     IN-SOFTIRQ-W at:
[  106.145314]                       _raw_spin_lock_irqsave+0x4b/0x82
[  106.146042]                       __wake_up_common_lock+0x119/0x1b9
[  106.146785]                       sbitmap_queue_wake_up+0x33f/0x383
[  106.147567]                       sbitmap_queue_clear+0x4c/0x9a
[  106.148379]                       __blk_mq_free_request+0x188/0x1d3
[  106.149148]                       blk_mq_free_request+0x23b/0x26b
[  106.149864]                       scsi_end_request+0x345/0x5d7
[  106.150546]                       scsi_io_completion+0x4b5/0x8f0
[  106.151367]                       scsi_finish_command+0x412/0x456
[  106.152157]                       scsi_softirq_done+0x23f/0x29b
[  106.152855]                       blk_done_softirq+0x2a7/0x2e6
[  106.153537]                       __do_softirq+0x360/0x6ad
[  106.154280]                       run_ksoftirqd+0x2f/0x5b
[  106.155020]                       smpboot_thread_fn+0x3a5/0x3db
[  106.155828]                       kthread+0x1d4/0x1e4
[  106.156526]                       ret_from_fork+0x3a/0x50
[  106.157267]     INITIAL USE at:
[  106.157713]                      _raw_spin_lock_irqsave+0x4b/0x82
[  106.158542]                      prepare_to_wait_exclusive+0xa8/0x215
[  106.159421]                      blk_mq_get_tag+0x34f/0x6e4
[  106.160186]                      blk_mq_get_request+0x48e/0xaef
[  106.160997]                      blk_mq_make_request+0x27e/0xbd2
[  106.161828]                      generic_make_request+0x4d1/0x873
[  106.162661]                      submit_bio+0x20c/0x253
[  106.163379]                      mpage_bio_submit+0x44/0x4b
[  106.164142]                      mpage_readpages+0x3c2/0x407
[  106.164919]                      read_pages+0x13a/0x430
[  106.165633]                      __do_page_cache_readahead+0x18e/0x2fc
[  106.166530]                      force_page_cache_readahead+0x121/0x133
[  106.167439]                      page_cache_sync_readahead+0x35f/0x3bb
[  106.168337]                      generic_file_buffered_read+0x410/0x1860
[  106.169255]                      __vfs_read+0x319/0x38f
[  106.169977]                      vfs_read+0xd2/0x19a
[  106.170662]                      ksys_read+0xb9/0x135
[  106.171356]                      do_syscall_64+0x140/0x385
[  106.172120]                      entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  106.173051]   }
[  106.173308]   ... key      at: [<ffffffff85094600>] __key.26481+0x0/0x40
[  106.174219]   ... acquired at:
[  106.174646]    _raw_spin_lock+0x33/0x64
[  106.175183]    blk_mq_dispatch_rq_list+0x4c1/0xd7c
[  106.175843]    blk_mq_do_dispatch_sched+0x23a/0x287
[  106.176518]    blk_mq_sched_dispatch_requests+0x379/0x3fc
[  106.177262]    __blk_mq_run_hw_queue+0x137/0x17e
[  106.177900]    __blk_mq_delay_run_hw_queue+0x80/0x25f
[  106.178591]    blk_mq_run_hw_queue+0x151/0x187
[  106.179207]    blk_mq_sched_insert_requests+0x13f/0x175
[  106.179926]    blk_mq_flush_plug_list+0x7d6/0x81b
[  106.180571]    blk_flush_plug_list+0x392/0x3d7
[  106.181187]    blk_finish_plug+0x37/0x4f
[  106.181737]    __se_sys_io_submit+0x171/0x304
[  106.182346]    do_syscall_64+0x140/0x385
[  106.182895]    entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  106.183607]
[  106.183830] -> (&(&hctx->dispatch_wait_lock)->rlock){....} ops: 1 {
[  106.184691]    INITIAL USE at:
[  106.185119]                    _raw_spin_lock+0x33/0x64
[  106.185838]                    blk_mq_dispatch_rq_list+0x4c1/0xd7c
[  106.186697]                    blk_mq_do_dispatch_sched+0x23a/0x287
[  106.187551]                    blk_mq_sched_dispatch_requests+0x379/0x3fc
[  106.188481]                    __blk_mq_run_hw_queue+0x137/0x17e
[  106.189307]                    __blk_mq_delay_run_hw_queue+0x80/0x25f
[  106.190189]                    blk_mq_run_hw_queue+0x151/0x187
[  106.190989]                    blk_mq_sched_insert_requests+0x13f/0x175
[  106.191902]                    blk_mq_flush_plug_list+0x7d6/0x81b
[  106.192739]                    blk_flush_plug_list+0x392/0x3d7
[  106.193535]                    blk_finish_plug+0x37/0x4f
[  106.194269]                    __se_sys_io_submit+0x171/0x304
[  106.195059]                    do_syscall_64+0x140/0x385
[  106.195794]                    entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  106.196705]  }
[  106.196950]  ... key      at: [<ffffffff84880620>] __key.51231+0x0/0x40
[  106.197853]  ... acquired at:
[  106.198270]    lock_acquire+0x280/0x2f3
[  106.198806]    _raw_spin_lock+0x33/0x64
[  106.199337]    sbitmap_get+0xd5/0x22c
[  106.199850]    __sbitmap_queue_get+0xe8/0x177
[  106.200450]    __blk_mq_get_tag+0x1e6/0x22d
[  106.201035]    blk_mq_get_tag+0x1db/0x6e4
[  106.201589]    blk_mq_get_driver_tag+0x161/0x258
[  106.202237]    blk_mq_dispatch_rq_list+0x5b9/0xd7c
[  106.202902]    blk_mq_do_dispatch_sched+0x23a/0x287
[  106.203572]    blk_mq_sched_dispatch_requests+0x379/0x3fc
[  106.204316]    __blk_mq_run_hw_queue+0x137/0x17e
[  106.204956]    __blk_mq_delay_run_hw_queue+0x80/0x25f
[  106.205649]    blk_mq_run_hw_queue+0x151/0x187
[  106.206269]    blk_mq_sched_insert_requests+0x13f/0x175
[  106.206997]    blk_mq_flush_plug_list+0x7d6/0x81b
[  106.207644]    blk_flush_plug_list+0x392/0x3d7
[  106.208264]    blk_finish_plug+0x37/0x4f
[  106.208814]    __se_sys_io_submit+0x171/0x304
[  106.209415]    do_syscall_64+0x140/0x385
[  106.209965]    entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  106.210684]
[  106.210904]
[  106.210904] the dependencies between the lock to be acquired
[  106.210905]  and SOFTIRQ-irq-unsafe lock:
[  106.212541] -> (&(&sb->map[i].swap_lock)->rlock){+.+.} ops: 1969 {
[  106.213393]    HARDIRQ-ON-W at:
[  106.213840]                     _raw_spin_lock+0x33/0x64
[  106.214570]                     sbitmap_get+0xd5/0x22c
[  106.215282]                     __sbitmap_queue_get+0xe8/0x177
[  106.216086]                     __blk_mq_get_tag+0x1e6/0x22d
[  106.216876]                     blk_mq_get_tag+0x1db/0x6e4
[  106.217627]                     blk_mq_get_driver_tag+0x161/0x258
[  106.218465]                     blk_mq_dispatch_rq_list+0x28e/0xd7c
[  106.219326]                     blk_mq_do_dispatch_sched+0x23a/0x287
[  106.220198]                     blk_mq_sched_dispatch_requests+0x379/0x3fc
[  106.221138]                     __blk_mq_run_hw_queue+0x137/0x17e
[  106.221975]                     __blk_mq_delay_run_hw_queue+0x80/0x25f
[  106.222874]                     blk_mq_run_hw_queue+0x151/0x187
[  106.223686]                     blk_mq_sched_insert_requests+0x13f/0x175
[  106.224597]                     blk_mq_flush_plug_list+0x7d6/0x81b
[  106.225444]                     blk_flush_plug_list+0x392/0x3d7
[  106.226255]                     blk_finish_plug+0x37/0x4f
[  106.227006]                     read_pages+0x3ef/0x430
[  106.227717]                     __do_page_cache_readahead+0x18e/0x2fc
[  106.228595]                     force_page_cache_readahead+0x121/0x133
[  106.229491]                     page_cache_sync_readahead+0x35f/0x3bb
[  106.230373]                     generic_file_buffered_read+0x410/0x1860
[  106.231277]                     __vfs_read+0x319/0x38f
[  106.231986]                     vfs_read+0xd2/0x19a
[  106.232666]                     ksys_read+0xb9/0x135
[  106.233350]                     do_syscall_64+0x140/0x385
[  106.234097]                     entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  106.235012]    SOFTIRQ-ON-W at:
[  106.235460]                     _raw_spin_lock+0x33/0x64
[  106.236195]                     sbitmap_get+0xd5/0x22c
[  106.236913]                     __sbitmap_queue_get+0xe8/0x177
[  106.237715]                     __blk_mq_get_tag+0x1e6/0x22d
[  106.238488]                     blk_mq_get_tag+0x1db/0x6e4
[  106.239244]                     blk_mq_get_driver_tag+0x161/0x258
[  106.240079]                     blk_mq_dispatch_rq_list+0x28e/0xd7c
[  106.240937]                     blk_mq_do_dispatch_sched+0x23a/0x287
[  106.241806]                     blk_mq_sched_dispatch_requests+0x379/0x3fc
[  106.242751]                     __blk_mq_run_hw_queue+0x137/0x17e
[  106.243579]                     __blk_mq_delay_run_hw_queue+0x80/0x25f
[  106.244469]                     blk_mq_run_hw_queue+0x151/0x187
[  106.245277]                     blk_mq_sched_insert_requests+0x13f/0x175
[  106.246191]                     blk_mq_flush_plug_list+0x7d6/0x81b
[  106.247044]                     blk_flush_plug_list+0x392/0x3d7
[  106.247859]                     blk_finish_plug+0x37/0x4f
[  106.248749]                     read_pages+0x3ef/0x430
[  106.249463]                     __do_page_cache_readahead+0x18e/0x2fc
[  106.250357]                     force_page_cache_readahead+0x121/0x133
[  106.251263]                     page_cache_sync_readahead+0x35f/0x3bb
[  106.252157]                     generic_file_buffered_read+0x410/0x1860
[  106.253084]                     __vfs_read+0x319/0x38f
[  106.253808]                     vfs_read+0xd2/0x19a
[  106.254488]                     ksys_read+0xb9/0x135
[  106.255186]                     do_syscall_64+0x140/0x385
[  106.255943]                     entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  106.256867]    INITIAL USE at:
[  106.257300]                    _raw_spin_lock+0x33/0x64
[  106.258033]                    sbitmap_get+0xd5/0x22c
[  106.258747]                    __sbitmap_queue_get+0xe8/0x177
[  106.259542]                    __blk_mq_get_tag+0x1e6/0x22d
[  106.260320]                    blk_mq_get_tag+0x1db/0x6e4
[  106.261072]                    blk_mq_get_driver_tag+0x161/0x258
[  106.261902]                    blk_mq_dispatch_rq_list+0x28e/0xd7c
[  106.262762]                    blk_mq_do_dispatch_sched+0x23a/0x287
[  106.263626]                    blk_mq_sched_dispatch_requests+0x379/0x3fc
[  106.264571]                    __blk_mq_run_hw_queue+0x137/0x17e
[  106.265409]                    __blk_mq_delay_run_hw_queue+0x80/0x25f
[  106.266302]                    blk_mq_run_hw_queue+0x151/0x187
[  106.267111]                    blk_mq_sched_insert_requests+0x13f/0x175
[  106.268028]                    blk_mq_flush_plug_list+0x7d6/0x81b
[  106.268878]                    blk_flush_plug_list+0x392/0x3d7
[  106.269694]                    blk_finish_plug+0x37/0x4f
[  106.270432]                    read_pages+0x3ef/0x430
[  106.271139]                    __do_page_cache_readahead+0x18e/0x2fc
[  106.272040]                    force_page_cache_readahead+0x121/0x133
[  106.272932]                    page_cache_sync_readahead+0x35f/0x3bb
[  106.273811]                    generic_file_buffered_read+0x410/0x1860
[  106.274709]                    __vfs_read+0x319/0x38f
[  106.275407]                    vfs_read+0xd2/0x19a
[  106.276074]                    ksys_read+0xb9/0x135
[  106.276764]                    do_syscall_64+0x140/0x385
[  106.277500]                    entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  106.278417]  }
[  106.278676]  ... key      at: [<ffffffff85094640>] __key.26212+0x0/0x40
[  106.279586]  ... acquired at:
[  106.280026]    lock_acquire+0x280/0x2f3
[  106.280559]    _raw_spin_lock+0x33/0x64
[  106.281101]    sbitmap_get+0xd5/0x22c
[  106.281610]    __sbitmap_queue_get+0xe8/0x177
[  106.282221]    __blk_mq_get_tag+0x1e6/0x22d
[  106.282809]    blk_mq_get_tag+0x1db/0x6e4
[  106.283368]    blk_mq_get_driver_tag+0x161/0x258
[  106.284018]    blk_mq_dispatch_rq_list+0x5b9/0xd7c
[  106.284685]    blk_mq_do_dispatch_sched+0x23a/0x287
[  106.285371]    blk_mq_sched_dispatch_requests+0x379/0x3fc
[  106.286135]    __blk_mq_run_hw_queue+0x137/0x17e
[  106.286806]    __blk_mq_delay_run_hw_queue+0x80/0x25f
[  106.287515]    blk_mq_run_hw_queue+0x151/0x187
[  106.288149]    blk_mq_sched_insert_requests+0x13f/0x175
[  106.289041]    blk_mq_flush_plug_list+0x7d6/0x81b
[  106.289912]    blk_flush_plug_list+0x392/0x3d7
[  106.290590]    blk_finish_plug+0x37/0x4f
[  106.291238]    __se_sys_io_submit+0x171/0x304
[  106.291864]    do_syscall_64+0x140/0x385
[  106.292534]    entry_SYSCALL_64_after_hwframe+0x49/0xbe

Reported-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-09 17:43:20 -07:00
Linus Torvalds
40e020c129 Linux 4.20-rc6 2018-12-09 15:31:00 -08:00
Linus Torvalds
d48f782e4f Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:
 "A decent batch of fixes here. I'd say about half are for problems that
  have existed for a while, and half are for new regressions added in
  the 4.20 merge window.

   1) Fix 10G SFP phy module detection in mvpp2, from Baruch Siach.

   2) Revert bogus emac driver change, from Benjamin Herrenschmidt.

   3) Handle BPF exported data structure with pointers when building
      32-bit userland, from Daniel Borkmann.

   4) Memory leak fix in act_police, from Davide Caratti.

   5) Check RX checksum offload in RX descriptors properly in aquantia
      driver, from Dmitry Bogdanov.

   6) SKB unlink fix in various spots, from Edward Cree.

   7) ndo_dflt_fdb_dump() only works with ethernet, enforce this, from
      Eric Dumazet.

   8) Fix FID leak in mlxsw driver, from Ido Schimmel.

   9) IOTLB locking fix in vhost, from Jean-Philippe Brucker.

  10) Fix SKB truesize accounting in ipv4/ipv6/netfilter frag memory
      limits otherwise namespace exit can hang. From Jiri Wiesner.

  11) Address block parsing length fixes in x25 from Martin Schiller.

  12) IRQ and ring accounting fixes in bnxt_en, from Michael Chan.

  13) For tun interfaces, only iface delete works with rtnl ops, enforce
      this by disallowing add. From Nicolas Dichtel.

  14) Use after free in liquidio, from Pan Bian.

  15) Fix SKB use after passing to netif_receive_skb(), from Prashant
      Bhole.

  16) Static key accounting and other fixes in XPS from Sabrina Dubroca.

  17) Partially initialized flow key passed to ip6_route_output(), from
      Shmulik Ladkani.

  18) Fix RTNL deadlock during reset in ibmvnic driver, from Thomas
      Falcon.

  19) Several small TCP fixes (off-by-one on window probe abort, NULL
      deref in tail loss probe, SNMP mis-estimations) from Yuchung
      Cheng"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (93 commits)
  net/sched: cls_flower: Reject duplicated rules also under skip_sw
  bnxt_en: Fix _bnxt_get_max_rings() for 57500 chips.
  bnxt_en: Fix NQ/CP rings accounting on the new 57500 chips.
  bnxt_en: Keep track of reserved IRQs.
  bnxt_en: Fix CNP CoS queue regression.
  net/mlx4_core: Correctly set PFC param if global pause is turned off.
  Revert "net/ibm/emac: wrong bit is used for STA control"
  neighbour: Avoid writing before skb->head in neigh_hh_output()
  ipv6: Check available headroom in ip6_xmit() even without options
  tcp: lack of available data can also cause TSO defer
  ipv6: sr: properly initialize flowi6 prior passing to ip6_route_output
  mlxsw: spectrum_switchdev: Fix VLAN device deletion via ioctl
  mlxsw: spectrum_router: Relax GRE decap matching check
  mlxsw: spectrum_switchdev: Avoid leaking FID's reference count
  mlxsw: spectrum_nve: Remove easily triggerable warnings
  ipv4: ipv6: netfilter: Adjust the frag mem limit when truesize changes
  sctp: frag_point sanity check
  tcp: fix NULL ref in tail loss probe
  tcp: Do not underestimate rwnd_limited
  net: use skb_list_del_init() to remove from RX sublists
  ...
2018-12-09 15:12:33 -08:00
Linus Torvalds
8586ca8a21 Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Ingo Molnar:
 "Three fixes: a boot parameter re-(re-)fix, a retpoline build artifact
  fix and an LLVM workaround"

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/vdso: Drop implicit common-page-size linker flag
  x86/build: Fix compiler support check for CONFIG_RETPOLINE
  x86/boot: Clear RSDP address in boot_params for broken loaders
2018-12-09 15:09:55 -08:00
Linus Torvalds
ebbd30004d Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull kprobes fixes from Ingo Molnar:
 "Two kprobes fixes: a blacklist fix and an instruction patching related
  corruption fix"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  kprobes/x86: Blacklist non-attachable interrupt functions
  kprobes/x86: Fix instruction patching corruption when copying more than one RIP-relative instruction
2018-12-09 14:21:33 -08:00
Linus Torvalds
4b04e73a78 Merge branch 'efi-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull EFI fixes from Ingo Molnar:
 "Two fixes: a large-system fix and an earlyprintk fix with certain
  resolutions"

* 'efi-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/earlyprintk/efi: Fix infinite loop on some screen widths
  x86/efi: Allocate e820 buffer before calling efi_exit_boot_service
2018-12-09 14:03:56 -08:00
Or Gerlitz
35cc3cefc4 net/sched: cls_flower: Reject duplicated rules also under skip_sw
Currently, duplicated rules are rejected only for skip_hw or "none",
hence allowing users to push duplicates into HW for no reason.

Use the flower tables to protect for that.

Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Paul Blakey <paulb@mellanox.com>
Reported-by: Chris Mi <chrism@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-09 11:55:08 -08:00
David S. Miller
d4b60e94e9 Merge branch 'bnxt_en-Bug-fixes'
Michael Chan says:

====================
bnxt_en: Bug fixes.

The first patch fixes a regression on CoS queue setup, introduced
recently by the 57500 new chip support patches.  The rest are
fixes related to ring and resource accounting on the new 57500 chips.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-09 11:46:59 -08:00
Michael Chan
e30fbc3319 bnxt_en: Fix _bnxt_get_max_rings() for 57500 chips.
The CP rings are accounted differently on the new 57500 chips.  There
must be enough CP rings for the sum of RX and TX rings on the new
chips.  The current logic may be over-estimating the RX and TX rings.

The output parameter max_cp should be the maximum NQs capped by
MSIX vectors available for networking in the context of 57500 chips.
The existing code which uses CMPL rings capped by the MSIX vectors
works most of the time but is not always correct.

Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-09 11:46:58 -08:00
Michael Chan
c0b8cda05e bnxt_en: Fix NQ/CP rings accounting on the new 57500 chips.
The new 57500 chips have introduced the NQ structure in addition to
the existing CP rings in all chips.  We need to introduce a new
bnxt_nq_rings_in_use().  On legacy chips, the 2 functions are the
same and one will just call the other.  On the new chips, they
refer to the 2 separate ring structures.  The new function is now
called to determine the resource (NQ or CP rings) associated with
MSIX that are in use.

On 57500 chips, the RDMA driver does not use the CP rings so
we don't need to do the subtraction adjustment.

Fixes: 41e8d79837 ("bnxt_en: Modify the ring reservation functions for 57500 series chips.")
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-09 11:46:58 -08:00
Michael Chan
75720e6323 bnxt_en: Keep track of reserved IRQs.
The new 57500 chips use 1 NQ per MSIX vector, whereas legacy chips use
1 CP ring per MSIX vector.  To better unify this, add a resv_irqs
field to struct bnxt_hw_resc.  On legacy chips, we initialize resv_irqs
with resv_cp_rings.  On new chips, we initialize it with the allocated
MSIX resources.

Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-09 11:46:58 -08:00