linux-stable

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git synced 2024-08-25 02:09:42 +00:00

Author	SHA1	Message	Date
Shaohua Li	53696b8d21	blk-throttle: add latency target support One hard problem adding .low limit is to detect idle cgroup. If one cgroup doesn't dispatch enough IO against its low limit, we must have a mechanism to determine if other cgroups dispatch more IO. We added the think time detection mechanism before, but it doesn't work for all workloads. Here we add a latency based approach. We already have mechanism to calculate latency threshold for each IO size. For every IO dispatched from a cgorup, we compare its latency against its threshold and record the info. If most IO latency is below threshold (in the code I use 75%), the cgroup could be treated idle and other cgroups can dispatch more IO. Currently this latency target check is only for SSD as we can't calcualte the latency target for hard disk. And this is only for cgroup leaf node so far. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2017-03-28 08:02:20 -06:00
Shaohua Li	b9147dd1ba	blk-throttle: add a mechanism to estimate IO latency User configures latency target, but the latency threshold for each request size isn't fixed. For a SSD, the IO latency highly depends on request size. To calculate latency threshold, we sample some data, eg, average latency for request size 4k, 8k, 16k, 32k .. 1M. The latency threshold of each request size will be the sample latency (I'll call it base latency) plus latency target. For example, the base latency for request size 4k is 80us and user configures latency target 60us. The 4k latency threshold will be 80 + 60 = 140us. To sample data, we calculate the order base 2 of rounded up IO sectors. If the IO size is bigger than 1M, it will be accounted as 1M. Since the calculation does round up, the base latency will be slightly smaller than actual value. Also if there isn't any IO dispatched for a specific IO size, we will use the base latency of smaller IO size for this IO size. But we shouldn't sample data at any time. The base latency is supposed to be latency where disk isn't congested, because we use latency threshold to schedule IOs between cgroups. If disk is congested, the latency is higher, using it for scheduling is meaningless. Hence we only do the sampling when block throttling is in the LOW limit, with assumption disk isn't congested in such state. If the assumption isn't true, eg, low limit is too high, calculated latency threshold will be higher. Hard disk is completely different. Latency depends on spindle seek instead of request size. Currently this feature is SSD only, we probably can use a fixed threshold like 4ms for hard disk though. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2017-03-28 08:02:20 -06:00
Shaohua Li	88eeca495b	block: track request size in blk_issue_stat Currently there is no way to know the request size when the request is finished. Next patch will need this info. We could add extra field to record the size, but blk_issue_stat has enough space to record it, so this patch just overloads blk_issue_stat. With this, we will have 49bits to track time, which still is very long time. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2017-03-28 08:02:20 -06:00
Shaohua Li	ec80991d6f	blk-throttle: add interface for per-cgroup target latency Here we introduce per-cgroup latency target. The target determines how a cgroup can afford latency increasement. We will use the target latency to calculate a threshold and use it to schedule IO for cgroups. If a cgroup's bandwidth is below its low limit but its average latency is below the threshold, other cgroups can safely dispatch more IO even their bandwidth is higher than their low limits. On the other hand, if the first cgroup's latency is higher than the threshold, other cgroups are throttled to their low limits. So the target latency determines how we efficiently utilize free disk resource without sacifice of worload's IO latency. For example, assume 4k IO average latency is 50us when disk isn't congested. A cgroup sets the target latency to 30us. Then the cgroup can accept 50+30=80us IO latency. If the cgroupt's average IO latency is 90us and its bandwidth is below low limit, other cgroups are throttled to their low limit. If the cgroup's average IO latency is 60us, other cgroups are allowed to dispatch more IO. When other cgroups dispatch more IO, the first cgroup's IO latency will increase. If it increases to 81us, we then throttle other cgroups. User will configure the interface in this way: echo "8:16 rbps=2097152 wbps=max latency=100 idle=200" > io.low latency is in microsecond unit By default, latency target is 0, which means to guarantee IO latency. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2017-03-28 08:02:20 -06:00
Shaohua Li	fa6fb5aab8	blk-throttle: ignore idle cgroup limit Last patch introduces a way to detect idle cgroup. We use it to make upgrade/downgrade decision. And the new algorithm can detect completely idle cgroup too, so we can delete the corresponding code. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2017-03-28 08:02:20 -06:00
Shaohua Li	ada75b6e5b	blk-throttle: add interface to configure idle time threshold Add interface to configure the threshold. The io.low interface will like: echo "8:16 rbps=2097152 wbps=max idle=2000" > io.low idle is in microsecond unit. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2017-03-28 08:02:20 -06:00
Shaohua Li	9e234eeafb	blk-throttle: add a simple idle detection A cgroup gets assigned a low limit, but the cgroup could never dispatch enough IO to cross the low limit. In such case, the queue state machine will remain in LIMIT_LOW state and all other cgroups will be throttled according to low limit. This is unfair for other cgroups. We should treat the cgroup idle and upgrade the state machine to lower state. We also have a downgrade logic. If the state machine upgrades because of cgroup idle (real idle), the state machine will downgrade soon as the cgroup is below its low limit. This isn't what we want. A more complicated case is cgroup isn't idle when queue is in LIMIT_LOW. But when queue gets upgraded to lower state, other cgroups could dispatch more IO and this cgroup can't dispatch enough IO, so the cgroup is below its low limit and looks like idle (fake idle). In this case, the queue should downgrade soon. The key to determine if we should do downgrade is to detect if cgroup is truely idle. Unfortunately it's very hard to determine if a cgroup is real idle. This patch uses the 'think time check' idea from CFQ for the purpose. Please note, the idea doesn't work for all workloads. For example, a workload with io depth 8 has disk utilization 100%, hence think time is 0, eg, not idle. But the workload can run higher bandwidth with io depth 16. Compared to io depth 16, the io depth 8 workload is idle. We use the idea to roughly determine if a cgroup is idle. We treat a cgroup idle if its think time is above a threshold (by default 1ms for SSD and 100ms for HD). The idea is think time above the threshold will start to harm performance. HD is much slower so a longer think time is ok. The patch (and the latter patches) uses 'unsigned long' to track time. We convert 'ns' to 'us' with 'ns >> 10'. This is fast but loses precision, should not a big deal. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2017-03-28 08:02:20 -06:00
Shaohua Li	7394e31fa4	blk-throttle: make bandwidth change smooth When cgroups all reach low limit, cgroups can dispatch more IO. This could make some cgroups dispatch more IO but others not, and even some cgroups could dispatch less IO than their low limit. For example, cg1 low limit 10MB/s, cg2 limit 80MB/s, assume disk maximum bandwidth is 120M/s for the workload. Their bps could something like this: cg1/cg2 bps: T1: 10/80 -> T2: 60/60 -> T3: 10/80 At T1, all cgroups reach low limit, so they can dispatch more IO later. Then cg1 dispatch more IO and cg2 has no room to dispatch enough IO. At T2, cg2 only dispatches 60M/s. Since We detect cg2 dispatches less IO than its low limit 80M/s, we downgrade the queue from LIMIT_MAX to LIMIT_LOW, then all cgroups are throttled to their low limit (T3). cg2 will have bandwidth below its low limit at most time. The big problem here is we don't know the maximum bandwidth of the workload, so we can't make smart decision to avoid the situation. This patch makes cgroup bandwidth change smooth. After disk upgrades from LIMIT_LOW to LIMIT_MAX, we don't allow cgroups use all bandwidth upto their max limit immediately. Their bandwidth limit will be increased gradually to avoid above situation. So above example will became something like: cg1/cg2 bps: 10/80 -> 15/105 -> 20/100 -> 25/95 -> 30/90 -> 35/85 -> 40/80 -> 45/75 -> 22/98 In this way cgroups bandwidth will be above their limit in majority time, this still doesn't fully utilize disk bandwidth, but that's something we pay for sharing. Scale up is linear. The limit scales up 1/2 .low limit every throtl_slice after upgrade. The scale up will stop if the adjusted limit hits .max limit. Scale down is exponential. We cut the scale value half if a cgroup doesn't hit its .low limit. If the scale becomes 0, we then fully downgrade the queue to LIMIT_LOW state. Note this doesn't completely avoid cgroup running under its low limit. The best way to guarantee cgroup doesn't run under its limit is to set max limit. For example, if we set cg1 max limit to 40, cg2 will never run under its low limit. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2017-03-28 08:02:20 -06:00
Shaohua Li	aec242468c	blk-throttle: detect completed idle cgroup cgroup could be assigned a limit, but doesn't dispatch enough IO, eg the cgroup is idle. When this happens, the cgroup doesn't hit its limit, so we can't move the state machine to higher level and all cgroups will be throttled to their lower limit, so we waste bandwidth. Detecting idle cgroup is hard. This patch handles a simple case, a cgroup doesn't dispatch any IO. We ignore such cgroup's limit, so other cgroups can use the bandwidth. Please note this will be replaced with a more sophisticated algorithm later, but this demonstrates the idea how we handle idle cgroups, so I leave it here. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2017-03-28 08:02:20 -06:00
Shaohua Li	d61fcfa4bb	blk-throttle: choose a small throtl_slice for SSD The throtl_slice is 100ms by default. This is a long time for SSD, a lot of IO can run. To make cgroups have smoother throughput, we choose a small value (20ms) for SSD. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2017-03-28 08:02:20 -06:00
Shaohua Li	297e3d8547	blk-throttle: make throtl_slice tunable throtl_slice is important for blk-throttling. It's called slice internally but it really is a time window blk-throttling samples data. blk-throttling will make decision based on the samplings. An example is bandwidth measurement. A cgroup's bandwidth is measured in the time interval of throtl_slice. A small throtl_slice meanse cgroups have smoother throughput but burn more CPUs. It has 100ms default value, which is not appropriate for all disks. A fast SSD can dispatch a lot of IOs in 100ms. This patch makes it tunable. Since throtl_slice isn't a time slice, the sysfs name 'throttle_sample_time' reflects its character better. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2017-03-28 08:02:20 -06:00
Shaohua Li	06cceedcca	blk-throttle: make sure expire time isn't too big cgroup could be throttled to a limit but when all cgroups cross high limit, queue enters a higher state and so the group should be throttled to a higher limit. It's possible the cgroup is sleeping because of throttle and other cgroups don't dispatch IO any more. In this case, nobody can trigger current downgrade/upgrade logic. To fix this issue, we could either set up a timer to wakeup the cgroup if other cgroups are idle or make sure this cgroup doesn't sleep too long. Setting up a timer means we must change the timer very frequently. This patch chooses the latter. Making cgroup sleep time not too big wouldn't change cgroup bps/iops, but could make it wakeup more frequently, which isn't a big issue because throtl_slice * 8 is already quite big. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2017-03-28 08:02:20 -06:00
Shaohua Li	3f0abd8066	blk-throttle: add downgrade logic When queue state machine is in LIMIT_MAX state, but a cgroup is below its low limit for some time, the queue should be downgraded to lower state as one cgroup's low limit isn't met. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2017-03-28 08:02:20 -06:00
Shaohua Li	c79892c557	blk-throttle: add upgrade logic for LIMIT_LOW state When queue is in LIMIT_LOW state and all cgroups with low limit cross the bps/iops limitation, we will upgrade queue's state to LIMIT_MAX. To determine if a cgroup exceeds its limitation, we check if the cgroup has pending request. Since cgroup is throttled according to the limit, pending request means the cgroup reaches the limit. If a cgroup has limit set for both read and write, we consider the combination of them for upgrade. The reason is read IO and write IO can interfere with each other. If we do the upgrade based in one direction IO, the other direction IO could be severly harmed. For a cgroup hierarchy, there are two cases. Children has lower low limit than parent. Parent's low limit is meaningless. If children's bps/iops cross low limit, we can upgrade queue state. The other case is children has higher low limit than parent. Children's low limit is meaningless. As long as parent's bps/iops (which is a sum of childrens bps/iops) cross low limit, we can upgrade queue state. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2017-03-28 08:02:20 -06:00
Shaohua Li	b22c417c88	blk-throttle: configure bps/iops limit for cgroup in low limit each queue will have a state machine. Initially queue is in LIMIT_LOW state, which means all cgroups will be throttled according to their low limit. After all cgroups with low limit cross the limit, the queue state gets upgraded to LIMIT_MAX state. For max limit, cgroup will use the limit configured by user. For low limit, cgroup will use the minimal value between low limit and max limit configured by user. If the minimal value is 0, which means the cgroup doesn't configure low limit, we will use max limit to throttle the cgroup and the cgroup is ready to upgrade to LIMIT_MAX Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2017-03-28 08:02:20 -06:00
Shaohua Li	cd5ab1b0fc	blk-throttle: add .low interface Add low limit for cgroup and corresponding cgroup interface. To be consistent with memcg, we allow users configure .low limit higher than .max limit. But the internal logic always assumes .low limit is lower than .max limit. So we add extra bps/iops_conf fields in throtl_grp for userspace configuration. Old bps/iops fields in throtl_grp will be the actual limit we use for throttling. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2017-03-28 08:02:20 -06:00
Shaohua Li	327ffb9b37	blk-throttle: add configure option for new .low interface As discussed in LSF, add configure option for the interface and mark it as experimental, so people can try/test. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2017-03-28 08:02:20 -06:00
Shaohua Li	9f626e372a	blk-throttle: prepare support multiple limits We are going to support low/max limit, each cgroup will have 2 limits after that. This patch prepares for the multiple limits change. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2017-03-28 08:02:20 -06:00
Shaohua Li	2ab5492de5	blk-throttle: use U64_MAX/UINT_MAX to replace -1 clean up the code to avoid using -1 Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2017-03-28 08:02:20 -06:00
Eric Biggers	869ab90f0a	block: constify struct blk_integrity_profile blk_integrity_profile's are never modified, so mark them 'const' so that they are placed in .rodata and benefit from memory protection. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2017-03-24 20:34:39 -06:00
Jens Axboe	93efe9817e	blk-mq: include errors in did_work calculation Currently we return true in blk_mq_dispatch_rq_list() if we queued IO successfully, but we really want to return whether or not the we made progress. Progress includes if we got an error return. If we don't, this can lead to a hang in blk_mq_sched_dispatch_requests() when a driver is draining IO by returning BLK_MQ_QUEUE_ERROR instead of manually ending the IO in error and return BLK_MQ_QUEUE_OK. Tested-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2017-03-24 15:42:47 -06:00
Eric Biggers	1be7d2073e	block: remove outdated part of blkdev_issue_flush() comment blkdev_issue_flush() is now always synchronous, and it no longer has a flags argument. So remove the part of the comment about the WAIT flag. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2017-03-24 15:41:30 -06:00
Eric Biggers	e554911c23	block: correct documentation for blkdev_issue_discard() flags BLKDEV_IFL_* flags no longer exist; blkdev_issue_discard() now actually takes BLKDEV_DISCARD_* flags. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2017-03-24 15:41:28 -06:00
Dan Carpenter	7a88fa1919	block: make nr_iovecs unsigned in bio_alloc_bioset() There isn't a bug here, but Smatch is not smart enough to know that "nr_iovecs" can't be negative so it complains about underflows. Really, it's slightly cleaner to make this parameter unsigned. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2017-03-23 08:16:11 -06:00
Christoph Hellwig	a4d907b6a3	blk-mq: streamline blk_mq_make_request Turn the different ways of merging or issuing I/O into a series of if/else statements instead of the current maze of gotos. Note that this means we pin the CPU a little longer for some cases as the CTX put is moved to common code at the end of the function. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2017-03-22 20:17:03 -06:00
Christoph Hellwig	2299722c4b	blk-mq: split the plug and sync cases in blk_mq_make_request Now that we have a nice direct issue heper this helps simplifying the code a bit, and also gets rid of the old_rq variable. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2017-03-22 20:17:01 -06:00
Christoph Hellwig	5eb6126e1c	blk-mq: improve blk_mq_try_issue_directly Rename blk_mq_try_issue_directly to __blk_mq_try_issue_directly and add a new wrapper that takes care of RCU / SRCU locking to avoid having boileplate code in the caller which would get duplicated with new callers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2017-03-22 20:16:59 -06:00
Christoph Hellwig	254d259da0	blk-mq: merge mq and sq make_request instances They are mostly the same code anyway - this just one small conditional for the plug case that is different for both variants. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2017-03-22 20:16:58 -06:00
Christoph Hellwig	7642747d67	blk-mq: remove BLK_MQ_F_DEFER_ISSUE This flag was never used since it was introduced. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2017-03-22 20:16:56 -06:00
Jan Kara	d01b2dcb44	block: Fix oops scsi_disk_get() When device open races with device shutdown, we can get the following oops in scsi_disk_get(): [11863.044351] general protection fault: 0000 [#1] SMP [11863.045561] Modules linked in: scsi_debug xfs libcrc32c netconsole btrfs raid6_pq zlib_deflate lzo_compress xor [last unloaded: loop] [11863.047853] CPU: 3 PID: 13042 Comm: hald-probe-stor Tainted: G W 4.10.0-rc2-xen+ #35 [11863.048030] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [11863.048030] task: ffff88007f438200 task.stack: ffffc90000fd0000 [11863.048030] RIP: 0010:scsi_disk_get+0x43/0x70 [11863.048030] RSP: 0018:ffffc90000fd3a08 EFLAGS: 00010202 [11863.048030] RAX: 6b6b6b6b6b6b6b6b RBX: ffff88007f56d000 RCX: 0000000000000000 [11863.048030] RDX: 0000000000000001 RSI: 0000000000000004 RDI: ffffffff81a8d880 [11863.048030] RBP: ffffc90000fd3a18 R08: 0000000000000000 R09: 0000000000000001 [11863.059217] R10: 0000000000000000 R11: 0000000000000000 R12: 00000000fffffffa [11863.059217] R13: ffff880078872800 R14: ffff880070915540 R15: 000000000000001d [11863.059217] FS: 00007f2611f71800(0000) GS:ffff88007f0c0000(0000) knlGS:0000000000000000 [11863.059217] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [11863.059217] CR2: 000000000060e048 CR3: 00000000778d4000 CR4: 00000000000006e0 [11863.059217] Call Trace: [11863.059217] ? disk_get_part+0x22/0x1f0 [11863.059217] sd_open+0x39/0x130 [11863.059217] __blkdev_get+0x69/0x430 [11863.059217] ? bd_acquire+0x7f/0xc0 [11863.059217] ? bd_acquire+0x96/0xc0 [11863.059217] ? blkdev_get+0x350/0x350 [11863.059217] blkdev_get+0x126/0x350 [11863.059217] ? _raw_spin_unlock+0x2b/0x40 [11863.059217] ? bd_acquire+0x7f/0xc0 [11863.059217] ? blkdev_get+0x350/0x350 [11863.059217] blkdev_open+0x65/0x80 ... As you can see RAX value is already poisoned showing that gendisk we got is already freed. The problem is that get_gendisk() looks up device number in ext_devt_idr and then does get_disk() which does kobject_get() on the disks kobject. However the disk gets removed from ext_devt_idr only in disk_release() (through blk_free_devt()) at which moment it has already 0 refcount and is already on its way to be freed. Indeed we've got a warning from kobject_get() about 0 refcount shortly before the oops. We fix the problem by using kobject_get_unless_zero() in get_disk() so that get_disk() cannot get reference on a disk that is already being freed. Tested-by: Lekshmi Pillai <lekshmicpillai@in.ibm.com> Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>	2017-03-22 20:11:37 -06:00
Ming Lei	95a4960370	blk-mq: don't complete un-started request in timeout handler When iterating busy requests in timeout handler, if the STARTED flag of one request isn't set, that means the request is being processed in block layer or driver, and isn't submitted to hardware yet. In current implementation of blk_mq_check_expired(), if the request queue becomes dying, un-started requests are handled as being completed/freed immediately. This way is wrong, and can cause rq corruption or double allocation[1][2], when doing I/O and removing&resetting NVMe device at the sametime. This patch fixes several issues reported by Yi Zhang. [1]. oops log 1 [ 581.789754] ------------[ cut here ]------------ [ 581.789758] kernel BUG at block/blk-mq.c:374! [ 581.789760] invalid opcode: 0000 [#1] SMP [ 581.789761] Modules linked in: vfat fat ipmi_ssif intel_rapl sb_edac edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm nvme irqbypass crct10dif_pclmul nvme_core crc32_pclmul ghash_clmulni_intel intel_cstate ipmi_si mei_me ipmi_devintf intel_uncore sg ipmi_msghandler intel_rapl_perf iTCO_wdt mei iTCO_vendor_support mxm_wmi lpc_ich dcdbas shpchp pcspkr acpi_power_meter wmi nfsd auth_rpcgss nfs_acl lockd dm_multipath grace sunrpc ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm ahci libahci crc32c_intel tg3 libata megaraid_sas i2c_core ptp fjes pps_core dm_mirror dm_region_hash dm_log dm_mod [ 581.789796] CPU: 1 PID: 1617 Comm: kworker/1:1H Not tainted 4.10.0.bz1420297+ #4 [ 581.789797] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.2.5 09/06/2016 [ 581.789804] Workqueue: kblockd blk_mq_timeout_work [ 581.789806] task: ffff8804721c8000 task.stack: ffffc90006ee4000 [ 581.789809] RIP: 0010:blk_mq_end_request+0x58/0x70 [ 581.789810] RSP: 0018:ffffc90006ee7d50 EFLAGS: 00010202 [ 581.789811] RAX: 0000000000000001 RBX: ffff8802e4195340 RCX: ffff88028e2f4b88 [ 581.789812] RDX: 0000000000001000 RSI: 0000000000001000 RDI: 0000000000000000 [ 581.789813] RBP: ffffc90006ee7d60 R08: 0000000000000003 R09: ffff88028e2f4b00 [ 581.789814] R10: 0000000000001000 R11: 0000000000000001 R12: 00000000fffffffb [ 581.789815] R13: ffff88042abe5780 R14: 000000000000002d R15: ffff88046fbdff80 [ 581.789817] FS: 0000000000000000(0000) GS:ffff88047fc00000(0000) knlGS:0000000000000000 [ 581.789818] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 581.789819] CR2: 00007f64f403a008 CR3: 000000014d078000 CR4: 00000000001406e0 [ 581.789820] Call Trace: [ 581.789825] blk_mq_check_expired+0x76/0x80 [ 581.789828] bt_iter+0x45/0x50 [ 581.789830] blk_mq_queue_tag_busy_iter+0xdd/0x1f0 [ 581.789832] ? blk_mq_rq_timed_out+0x70/0x70 [ 581.789833] ? blk_mq_rq_timed_out+0x70/0x70 [ 581.789840] ? __switch_to+0x140/0x450 [ 581.789841] blk_mq_timeout_work+0x88/0x170 [ 581.789845] process_one_work+0x165/0x410 [ 581.789847] worker_thread+0x137/0x4c0 [ 581.789851] kthread+0x101/0x140 [ 581.789853] ? rescuer_thread+0x3b0/0x3b0 [ 581.789855] ? kthread_park+0x90/0x90 [ 581.789860] ret_from_fork+0x2c/0x40 [ 581.789861] Code: 48 85 c0 74 0d 44 89 e6 48 89 df ff d0 5b 41 5c 5d c3 48 8b bb 70 01 00 00 48 85 ff 75 0f 48 89 df e8 7d f0 ff ff 5b 41 5c 5d c3 <0f> 0b e8 71 f0 ff ff 90 eb e9 0f 1f 40 00 66 2e 0f 1f 84 00 00 [ 581.789882] RIP: blk_mq_end_request+0x58/0x70 RSP: ffffc90006ee7d50 [ 581.789889] ---[ end trace bcaf03d9a14a0a70 ]--- [2]. oops log2 [ 6984.857362] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010 [ 6984.857372] IP: nvme_queue_rq+0x6e6/0x8cd [nvme] [ 6984.857373] PGD 0 [ 6984.857374] [ 6984.857376] Oops: 0000 [#1] SMP [ 6984.857379] Modules linked in: ipmi_ssif vfat fat intel_rapl sb_edac edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ipmi_si iTCO_wdt iTCO_vendor_support mxm_wmi ipmi_devintf intel_cstate sg dcdbas intel_uncore mei_me intel_rapl_perf mei pcspkr lpc_ich ipmi_msghandler shpchp acpi_power_meter wmi nfsd auth_rpcgss dm_multipath nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect crc32c_intel sysimgblt fb_sys_fops ttm nvme drm nvme_core ahci libahci i2c_core tg3 libata ptp megaraid_sas pps_core fjes dm_mirror dm_region_hash dm_log dm_mod [ 6984.857416] CPU: 7 PID: 1635 Comm: kworker/7:1H Not tainted 4.10.0-2.el7.bz1420297.x86_64 #1 [ 6984.857417] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.2.5 09/06/2016 [ 6984.857427] Workqueue: kblockd blk_mq_run_work_fn [ 6984.857429] task: ffff880476e3da00 task.stack: ffffc90002e90000 [ 6984.857432] RIP: 0010:nvme_queue_rq+0x6e6/0x8cd [nvme] [ 6984.857433] RSP: 0018:ffffc90002e93c50 EFLAGS: 00010246 [ 6984.857434] RAX: 0000000000000000 RBX: ffff880275646600 RCX: 0000000000001000 [ 6984.857435] RDX: 0000000000000fff RSI: 00000002fba2a000 RDI: ffff8804734e6950 [ 6984.857436] RBP: ffffc90002e93d30 R08: 0000000000002000 R09: 0000000000001000 [ 6984.857437] R10: 0000000000001000 R11: 0000000000000000 R12: ffff8804741d8000 [ 6984.857438] R13: 0000000000000040 R14: ffff880475649f80 R15: ffff8804734e6780 [ 6984.857439] FS: 0000000000000000(0000) GS:ffff88047fcc0000(0000) knlGS:0000000000000000 [ 6984.857440] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 6984.857442] CR2: 0000000000000010 CR3: 0000000001c09000 CR4: 00000000001406e0 [ 6984.857443] Call Trace: [ 6984.857451] ? mempool_free+0x2b/0x80 [ 6984.857455] ? bio_free+0x4e/0x60 [ 6984.857459] blk_mq_dispatch_rq_list+0xf5/0x230 [ 6984.857462] blk_mq_process_rq_list+0x133/0x170 [ 6984.857465] __blk_mq_run_hw_queue+0x8c/0xa0 [ 6984.857467] blk_mq_run_work_fn+0x12/0x20 [ 6984.857473] process_one_work+0x165/0x410 [ 6984.857475] worker_thread+0x137/0x4c0 [ 6984.857478] kthread+0x101/0x140 [ 6984.857480] ? rescuer_thread+0x3b0/0x3b0 [ 6984.857481] ? kthread_park+0x90/0x90 [ 6984.857489] ret_from_fork+0x2c/0x40 [ 6984.857490] Code: 8b bd 70 ff ff ff 89 95 50 ff ff ff 89 8d 58 ff ff ff 44 89 95 60 ff ff ff e8 b7 dd 12 e1 8b 95 50 ff ff ff 48 89 85 68 ff ff ff <4c> 8b 48 10 44 8b 58 18 8b 8d 58 ff ff ff 44 8b 95 60 ff ff ff [ 6984.857511] RIP: nvme_queue_rq+0x6e6/0x8cd [nvme] RSP: ffffc90002e93c50 [ 6984.857512] CR2: 0000000000000010 [ 6984.895359] ---[ end trace 2d7ceb528432bf83 ]--- Cc: stable@vger.kernel.org Reported-by: Yi Zhang <yizhan@redhat.com> Tested-by: Yi Zhang <yizhan@redhat.com> Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2017-03-22 08:03:35 -06:00
Jens Axboe	a83b576c9c	block: fix stacked driver stats init and free If a driver allocates a queue for stacked usage, then it does not currently get stats allocated. This causes the later init of, eg, writeback throttling to blow up. Move the init to the queue allocation instead. Additionally, allow a NULL callback unregistration. This avoids having the caller check for that, fixing another oops on removal of a block device that doesn't have poll stats allocated. Fixes: `34dbad5d26` ("blk-stat: convert to callback-based statistics reporting") Signed-off-by: Jens Axboe <axboe@fb.com>	2017-03-21 17:20:01 -06:00
Omar Sandoval	34dbad5d26	blk-stat: convert to callback-based statistics reporting Currently, statistics are gathered in ~0.13s windows, and users grab the statistics whenever they need them. This is not ideal for both in-tree users: 1. Writeback throttling wants its own dynamically sized window of statistics. Since the blk-stats statistics are reset after every window and the wbt windows don't line up with the blk-stats windows, wbt doesn't see every I/O. 2. Polling currently grabs the statistics on every I/O. Again, depending on how the window lines up, we may miss some I/Os. It's also unnecessary overhead to get the statistics on every I/O; the hybrid polling heuristic would be just as happy with the statistics from the previous full window. This reworks the blk-stats infrastructure to be callback-based: users register a callback that they want called at a given time with all of the statistics from the window during which the callback was active. Users can dynamically bucketize the statistics. wbt and polling both currently use read vs. write, but polling can be extended to further subdivide based on request size. The callbacks are kept on an RCU list, and each callback has percpu stats buffers. There will only be a few users, so the overhead on the I/O completion side is low. The stats flushing is also simplified considerably: since the timer function is responsible for clearing the statistics, we don't have to worry about stale statistics. wbt is a trivial conversion. After the conversion, the windowing problem mentioned above is fixed. For polling, we register an extra callback that caches the previous window's statistics in the struct request_queue for the hybrid polling heuristic to use. Since we no longer have a single stats buffer for the request queue, this also removes the sysfs and debugfs stats entries. To replace those, we add a debugfs entry for the poll statistics. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2017-03-21 10:03:11 -06:00
Omar Sandoval	4875253fdd	blk-stat: move BLK_RQ_STAT_BATCH definition to blk-stat.c This is an implementation detail that no-one outside of blk-stat.c uses. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2017-03-21 10:03:08 -06:00
Omar Sandoval	fa2e39cb9e	blk-stat: use READ and WRITE instead of BLK_STAT_{READ,WRITE} The stats buckets will become generic soon, so make the existing users use the common READ and WRITE definitions instead of one internal to blk-stat. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2017-03-21 10:03:08 -06:00
Omar Sandoval	0315b15908	block: remove extra calls to wbt_exit() We always call wbt_exit() from blk_release_queue(), so these are unnecessary. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2017-03-21 10:03:08 -06:00
Omar Sandoval	7d8d001407	blk-stat: fix blk_stat_sum() if all samples are batched We need to flush the batch _before_ we check the number of samples, otherwise we'll miss all of the batched samples. Fixes: `cf43e6b` ("block: add scalable completion tracking of requests") Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2017-03-21 10:00:55 -06:00
Omar Sandoval	efd4b81abb	blk-stat: fix blk_stat_sum() if all samples are batched We need to flush the batch _before_ we check the number of samples, otherwise we'll miss all of the batched samples. Fixes: `cf43e6b` ("block: add scalable completion tracking of requests") Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2017-03-16 09:46:14 -06:00
Jens Axboe	9c62110454	blk-mq-sched: don't run the queue async from blk_mq_try_issue_directly() If we have scheduling enabled, we jump directly to insert-and-run. That's fine, but we run the queue async and we don't pass in information on whether we can block from this context or not. Fixup both these cases. Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2017-03-14 11:51:59 -06:00
Sagi Grimberg	0067d4b020	blk-mq: Fix tagset reinit in the presence of cpu hot-unplug In case cpu was unplugged, we need to make sure not to assume that the tags for that cpu are still allocated. so check for null tags when reinitializing a tagset. Reported-by: Yi Zhang <yizhan@redhat.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@fb.com>	2017-03-13 08:14:23 -06:00
NeilBrown	f5fe1b5190	blk: Ensure users for current->bio_list can see the full list. Commit `79bd99596b` ("blk: improve order of bio handling in generic_make_request()") changed current->bio_list so that it did not contain all of the queued bios, but only those submitted by the currently running make_request_fn. There are two places which walk the list and requeue selected bios, and others that check if the list is empty. These are no longer correct. So redefine current->bio_list to point to an array of two lists, which contain all queued bios, and adjust various code to test or walk both lists. Signed-off-by: NeilBrown <neilb@suse.com> Fixes: `79bd99596b` ("blk: improve order of bio handling in generic_make_request()") Signed-off-by: Jens Axboe <axboe@fb.com>	2017-03-11 15:31:37 -07:00
NeilBrown	79bd99596b	blk: improve order of bio handling in generic_make_request() To avoid recursion on the kernel stack when stacked block devices are in use, generic_make_request() will, when called recursively, queue new requests for later handling. They will be handled when the make_request_fn for the current bio completes. If any bios are submitted by a make_request_fn, these will ultimately be handled seqeuntially. If the handling of one of those generates further requests, they will be added to the end of the queue. This strict first-in-first-out behaviour can lead to deadlocks in various ways, normally because a request might need to wait for a previous request to the same device to complete. This can happen when they share a mempool, and can happen due to interdependencies particular to the device. Both md and dm have examples where this happens. These deadlocks can be erradicated by more selective ordering of bios. Specifically by handling them in depth-first order. That is: when the handling of one bio generates one or more further bios, they are handled immediately after the parent, before any siblings of the parent. That way, when generic_make_request() calls make_request_fn for some particular device, we can be certain that all previously submited requests for that device have been completely handled and are not waiting for anything in the queue of requests maintained in generic_make_request(). An easy way to achieve this would be to use a last-in-first-out stack instead of a queue. However this will change the order of consecutive bios submitted by a make_request_fn, which could have unexpected consequences. Instead we take a slightly more complex approach. A fresh queue is created for each call to a make_request_fn. After it completes, any bios for a different device are placed on the front of the main queue, followed by any bios for the same device, followed by all bios that were already on the queue before the make_request_fn was called. This provides the depth-first approach without reordering bios on the same level. This, by itself, it not enough to remove all deadlocks. It just makes it possible for drivers to take the extra step required themselves. To avoid deadlocks, drivers must never risk waiting for a request after submitting one to generic_make_request. This includes never allocing from a mempool twice in the one call to a make_request_fn. A common pattern in drivers is to call bio_split() in a loop, handling the first part and then looping around to possibly split the next part. Instead, a driver that finds it needs to split a bio should queue (with generic_make_request) the second part, handle the first part, and then return. The new code in generic_make_request will ensure the requests to underlying bios are processed first, then the second bio that was split off. If it splits again, the same process happens. In each case one bio will be completely handled before the next one is attempted. With this is place, it should be possible to disable the punt_bios_to_recover() recovery thread for many block devices, and eventually it may be possible to remove it completely. Ref: http://www.spinics.net/lists/raid/msg54680.html Tested-by: Jinpu Wang <jinpu.wang@profitbricks.com> Inspired-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2017-03-08 10:55:17 -07:00
Jan Kara	c01228db4b	Revert "scsi, block: fix duplicate bdi name registration crashes" This reverts commit `0dba1314d4`. It causes leaking of device numbers for SCSI when SCSI registers multiple gendisks for one request_queue in succession. It can be easily reproduced using Omar's script [1] on kernel with CONFIG_DEBUG_TEST_DRIVER_REMOVE. Furthermore the protection provided by this commit is not needed anymore as the problem it was fixing got also fixed by commit `165a5e22fa` "block: Move bdi_unregister() to del_gendisk()". [1]: http://marc.info/?l=linux-block&m=148554717109098&w=2 Signed-off-by: Jan Kara <jack@suse.cz> Acked-by: Dan Williams <dan.j.williams@intel.com> Tested-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2017-03-08 10:55:17 -07:00
Jan Kara	90f16fddcc	block: Make del_gendisk() safer for disks without queues Commit `165a5e22fa` "block: Move bdi_unregister() to del_gendisk()" added disk->queue dereference to del_gendisk(). Although del_gendisk() is not supposed to be called without disk->queue valid and blk_unregister_queue() warns in that case, this change will make it oops instead. Return to the old more robust behavior of just warning when del_gendisk() gets called for gendisk with disk->queue being NULL. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Jan Kara <jack@suse.cz> Tested-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2017-03-08 10:55:17 -07:00
Jon Derrick	b0bfdfc2bf	block/sed: Fix opal user range check and unused variables Fixes check that the opal user is within the range, and cleans up unused method variables. Signed-off-by: Jon Derrick <jonathan.derrick@intel.com> Reviewed-by: Scott Bauer <scott.bauer@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2017-03-08 09:56:12 -07:00
Ming Lei	01388df376	blk-mq: free hctx->cpumask in release handler of hctx's kobject It is obviously that hctx->cpumask is per hctx, and both share same lifetime, so this patch moves freeing of hctx->cpumask into release handler of hctx's kobject. Signed-off-by: Ming Lei <tom.leiming@gmail.com> Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Jens Axboe <axboe@fb.com>	2017-03-08 09:56:12 -07:00
Ming Lei	6c8b232efe	blk-mq: make lifetime consistent between hctx and its kobject This patch removes kobject_put() over hctx in __blk_mq_unregister_dev(), and trys to keep lifetime consistent between hctx and hctx's kobject. Now blk_mq_sysfs_register() and blk_mq_sysfs_unregister() become totally symmetrical, and kobject's refcounter drops to zero just when the hctx is freed. Signed-off-by: Ming Lei <tom.leiming@gmail.com> Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Jens Axboe <axboe@fb.com>	2017-03-08 09:56:12 -07:00
Ming Lei	7ea5fe31c1	blk-mq: make lifetime consitent between q/ctx and its kobject Currently from kobject view, both q->mq_kobj and ctx->kobj can be released during one cycle of blk_mq_register_dev() and blk_mq_unregister_dev(). Actually, sw queue's lifetime is same with its request queue's, which is covered by request_queue->kobj. So we don't need to call kobject_put() for the two kinds of kobject in __blk_mq_unregister_dev(), instead we do that in release handler of request queue. Signed-off-by: Ming Lei <tom.leiming@gmail.com> Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Jens Axboe <axboe@fb.com>	2017-03-08 09:56:12 -07:00
Ming Lei	737f98cfe7	blk-mq: initialize mq kobjects in blk_mq_init_allocated_queue() Both q->mq_kobj and sw queues' kobjects should have been initialized once, instead of doing that each add_disk context. Also this patch removes clearing of ctx in blk_mq_init_cpu_queues() because percpu allocator fills zero to allocated variable. This patch fixes one issue[1] reported from Omar. [1] kernel wearning when doing unbind/bind on one scsi-mq device [ 19.347924] kobject (ffff8800791ea0b8): tried to init an initialized object, something is seriously wrong. [ 19.349781] CPU: 1 PID: 84 Comm: kworker/u8:1 Not tainted 4.10.0-rc7-00210-g53f39eeaa263 #34 [ 19.350686] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-20161122_114906-anatol 04/01/2014 [ 19.350920] Workqueue: events_unbound async_run_entry_fn [ 19.350920] Call Trace: [ 19.350920] dump_stack+0x63/0x83 [ 19.350920] kobject_init+0x77/0x90 [ 19.350920] blk_mq_register_dev+0x40/0x130 [ 19.350920] blk_register_queue+0xb6/0x190 [ 19.350920] device_add_disk+0x1ec/0x4b0 [ 19.350920] sd_probe_async+0x10d/0x1c0 [sd_mod] [ 19.350920] async_run_entry_fn+0x48/0x150 [ 19.350920] process_one_work+0x1d0/0x480 [ 19.350920] worker_thread+0x48/0x4e0 [ 19.350920] kthread+0x101/0x140 [ 19.350920] ? process_one_work+0x480/0x480 [ 19.350920] ? kthread_create_on_node+0x60/0x60 [ 19.350920] ret_from_fork+0x2c/0x40 Cc: Omar Sandoval <osandov@osandov.com> Signed-off-by: Ming Lei <tom.leiming@gmail.com> Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Jens Axboe <axboe@fb.com>	2017-03-08 09:56:12 -07:00
Linus Torvalds	e0d072250a	Merge branch 'for-linus' of git://git.kernel.dk/linux-block Pull block layer fixes from Jens Axboe: "A collection of fixes for this merge window, either fixes for existing issues, or parts that were waiting for acks to come in. This pull request contains: - Allocation of nvme queues on the right node from Shaohua. This was ready long before the merge window, but waiting on an ack from Bjorn on the PCI bit. Now that we have that, the three patches can go in. - Two fixes for blk-mq-sched with nvmeof, which uses hctx specific request allocations. This caused an oops. One part from Sagi, one part from Omar. - A loop partition scan deadlock fix from Omar, fixing a regression in this merge window. - A three-patch series from Keith, closing up a hole on clearing out requests on shutdown/resume. - A stable fix for nbd from Josef, fixing a leak of sockets. - Two fixes for a regression in this window from Jan, fixing a problem with one of his earlier patches dealing with queue vs bdi life times. - A fix for a regression with virtio-blk, causing an IO stall if scheduling is used. From me. - A fix for an io context lock ordering problem. From me" * 'for-linus' of git://git.kernel.dk/linux-block: block: Move bdi_unregister() to del_gendisk() blk-mq: ensure that bd->last is always set correctly block: don't call ioc_exit_icq() with the queue lock held for blk-mq block: Initialize bd_bdi on inode initialization loop: fix LO_FLAGS_PARTSCAN hang nvme: Complete all stuck requests blk-mq: Provide freeze queue timeout blk-mq: Export blk_mq_freeze_queue_wait nbd: stop leaking sockets blk-mq: move update of tags->rqs to __blk_mq_alloc_request() blk-mq: kill blk_mq_set_alloc_data() blk-mq: make blk_mq_alloc_request_hctx() allocate a scheduler request blk-mq-sched: Allocate sched reserved tags as specified in the original queue tagset nvme: allocate nvme_queue in correct node PCI: add an API to get node from vector blk-mq: allocate blk_mq_tags and requests in correct node	2017-03-03 10:53:35 -08:00
Linus Torvalds	1827adb11a	Merge branch 'WIP.sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull sched.h split-up from Ingo Molnar: "The point of these changes is to significantly reduce the <linux/sched.h> header footprint, to speed up the kernel build and to have a cleaner header structure. After these changes the new <linux/sched.h>'s typical preprocessed size goes down from a previous ~0.68 MB (~22K lines) to ~0.45 MB (~15K lines), which is around 40% faster to build on typical configs. Not much changed from the last version (-v2) posted three weeks ago: I eliminated quirks, backmerged fixes plus I rebased it to an upstream SHA1 from yesterday that includes most changes queued up in -next plus all sched.h changes that were pending from Andrew. I've re-tested the series both on x86 and on cross-arch defconfigs, and did a bisectability test at a number of random points. I tried to test as many build configurations as possible, but some build breakage is probably still left - but it should be mostly limited to architectures that have no cross-compiler binaries available on kernel.org, and non-default configurations" * 'WIP.sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (146 commits) sched/headers: Clean up <linux/sched.h> sched/headers: Remove #ifdefs from <linux/sched.h> sched/headers: Remove the <linux/topology.h> include from <linux/sched.h> sched/headers, hrtimer: Remove the <linux/wait.h> include from <linux/hrtimer.h> sched/headers, x86/apic: Remove the <linux/pm.h> header inclusion from <asm/apic.h> sched/headers, timers: Remove the <linux/sysctl.h> include from <linux/timer.h> sched/headers: Remove <linux/magic.h> from <linux/sched/task_stack.h> sched/headers: Remove <linux/sched.h> from <linux/sched/init.h> sched/core: Remove unused prefetch_stack() sched/headers: Remove <linux/rculist.h> from <linux/sched.h> sched/headers: Remove the 'init_pid_ns' prototype from <linux/sched.h> sched/headers: Remove <linux/signal.h> from <linux/sched.h> sched/headers: Remove <linux/rwsem.h> from <linux/sched.h> sched/headers: Remove the runqueue_is_locked() prototype sched/headers: Remove <linux/sched.h> from <linux/sched/hotplug.h> sched/headers: Remove <linux/sched.h> from <linux/sched/debug.h> sched/headers: Remove <linux/sched.h> from <linux/sched/nohz.h> sched/headers: Remove <linux/sched.h> from <linux/sched/stat.h> sched/headers: Remove the <linux/gfp.h> include from <linux/sched.h> sched/headers: Remove <linux/rtmutex.h> from <linux/sched.h> ...	2017-03-03 10:16:38 -08:00
Jan Kara	165a5e22fa	block: Move bdi_unregister() to del_gendisk() Commit `6cd18e711d` "block: destroy bdi before blockdev is unregistered." moved bdi unregistration (at that time through bdi_destroy()) from blk_release_queue() to blk_cleanup_queue() because it needs to happen before blk_unregister_region() call in del_gendisk() for MD. SCSI though will free up the device number from sd_remove() called through a maze of callbacks from device_del() in __scsi_remove_device() before blk_cleanup_queue() and thus similar races as described in `6cd18e711d` can happen for SCSI as well as reported by Omar [1]. Moving bdi_unregister() to del_gendisk() works for MD and fixes the problem for SCSI since del_gendisk() gets called from sd_remove() before freeing the device number. This also makes device_add_disk() (calling bdi_register_owner()) more symmetric with del_gendisk(). [1] http://marc.info/?l=linux-block&m=148554717109098&w=2 Tested-by: Lekshmi Pillai <lekshmicpillai@in.ibm.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jan Kara <jack@suse.cz> Tested-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2017-03-02 16:08:35 -07:00
Linus Torvalds	54d7989f47	virtio, vhost: optimizations, fixes Looks like a quiet cycle for vhost/virtio, just a couple of minor tweaks. Most notable is automatic interrupt affinity for blk and scsi. Hopefully other devices are not far behind. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> -----BEGIN PGP SIGNATURE----- iQEcBAABAgAGBQJYt1rRAAoJECgfDbjSjVRpEZsIALSHevdXWtRHBZUb0ZkqPLQb /x2Vn49CcALS1p7iSuP9L027MPeaLKyr0NBT9hptBChp/4b9lnZWyyAo6vYQrzfx Ia/hLBYsK4ml6lEwbyfLwqkF2cmYCrZhBSVAILifn84lTPoN7CT0PlYDfA+OCaNR geo75qF8KR+AUO0aqchwMRL3RV3OxZKxQr2AR6LttCuhiBgnV3Xqxffg/M3x6ONM 0ffFFdodm6slem3hIEiGUMwKj4NKQhcOleV+y0fVBzWfLQG9210pZbQyRBRikIL0 7IsaarpaUr7OrLAZFMGF6nJnyRAaRrt6WknTHZkyvyggrePrGcmGgPm4jrODwY4= =2zwv -----END PGP SIGNATURE----- Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost Pull vhost updates from Michael Tsirkin: "virtio, vhost: optimizations, fixes Looks like a quiet cycle for vhost/virtio, just a couple of minor tweaks. Most notable is automatic interrupt affinity for blk and scsi. Hopefully other devices are not far behind" * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: virtio-console: avoid DMA from stack vhost: introduce O(1) vq metadata cache virtio_scsi: use virtio IRQ affinity virtio_blk: use virtio IRQ affinity blk-mq: provide a default queue mapping for virtio device virtio: provide a method to get the IRQ affinity mask for a virtqueue virtio: allow drivers to request IRQ affinity when creating VQs virtio_pci: simplify MSI-X setup virtio_pci: don't duplicate the msix_enable flag in struct pci_dev virtio_pci: use shared interrupts for virtqueues virtio_pci: remove struct virtio_pci_vq_info vhost: try avoiding avail index access when getting descriptor virtio_mmio: expose header to userspace	2017-03-02 13:53:13 -08:00
Jens Axboe	113285b473	blk-mq: ensure that bd->last is always set correctly When drivers are called with a request in blk-mq, blk-mq flags the state such that the driver knows if this is the last request in this call chain or not. The driver can then use that information to defer kicking off IO until bd->last is true. However, with blk-mq and scheduling, we need to allocate a driver tag for a request before it can be issued. If we fail to allocate such a tag, we could end up in the situation where the last request issued did not have bd->last == true set. This can then cause a driver hang. This fixes a hang with virtio-blk, which uses bd->last as a hint on whether to kick the queue or not. Reported-by: Chris Mason <clm@fb.com> Tested-by: Chris Mason <clm@fb.com> Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2017-03-02 14:30:51 -07:00
Jens Axboe	7b36a7189f	block: don't call ioc_exit_icq() with the queue lock held for blk-mq For legacy scheduling, we always call ioc_exit_icq() with both the ioc and queue lock held. This poses a problem for blk-mq with scheduling, since the queue lock isn't what we use in the scheduler. And since we don't need the queue lock held for ioc exit there, don't grab it and leave any extra locking up to the blk-mq scheduler. Reported-by: Paolo Valente <paolo.valente@linaro.org> Tested-by: Paolo Valente <paolo.valente@linaro.org> Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2017-03-02 13:59:08 -07:00
Keith Busch	f91328c40a	blk-mq: Provide freeze queue timeout A driver may wish to take corrective action if queued requests do not complete within a set time. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2017-03-02 08:56:04 -07:00
Keith Busch	6bae363ee3	blk-mq: Export blk_mq_freeze_queue_wait Drivers can start a freeze, so this provides a way to wait for frozen. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@fb.com>	2017-03-02 08:56:04 -07:00
Omar Sandoval	562bef4259	blk-mq: move update of tags->rqs to __blk_mq_alloc_request() No functional difference, it just makes a little more sense to update the tag map where we actually allocate the tag. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com> Tested-by: Sagi Grimberg <sagi@grimberg.me>	2017-03-02 08:56:04 -07:00
Omar Sandoval	5974839899	blk-mq: kill blk_mq_set_alloc_data() Nothing is using it anymore. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com> Tested-by: Sagi Grimberg <sagi@grimberg.me>	2017-03-02 08:56:04 -07:00
Omar Sandoval	6d2809d51a	blk-mq: make blk_mq_alloc_request_hctx() allocate a scheduler request blk_mq_alloc_request_hctx() allocates a driver request directly, unlike its blk_mq_alloc_request() counterpart. It also crashes because it doesn't update the tags->rqs map. Fix it by making it allocate a scheduler request. Reported-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com> Tested-by: Sagi Grimberg <sagi@grimberg.me>	2017-03-02 08:56:04 -07:00
Sagi Grimberg	415b806de5	blk-mq-sched: Allocate sched reserved tags as specified in the original queue tagset Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Modified by me to also check at driver tag allocation time if the original request was reserved, so we can be sure to allocate a properly reserved tag at that point in time, too. Signed-off-by: Jens Axboe <axboe@fb.com>	2017-03-02 08:56:04 -07:00
Shaohua Li	59f082e464	blk-mq: allocate blk_mq_tags and requests in correct node blk_mq_tags/requests of specific hardware queue are mostly used in specific cpus, which might not be in the same numa node as disk. For example, a nvme card is in node 0. half hardware queue will be used by node 0, the other node 1. Signed-off-by: Shaohua Li <shli@fb.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2017-03-02 08:56:04 -07:00
Ingo Molnar	f719ff9bce	sched/headers: Prepare to move the task_lock()/unlock() APIs to <linux/sched/task.h> But first update the code that uses these facilities with the new header. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-03-02 08:42:38 +01:00
Ingo Molnar	68db0cf106	sched/headers: Prepare for new header dependencies before moving code to <linux/sched/task_stack.h> We are going to split <linux/sched/task_stack.h> out of <linux/sched.h>, which will have to be picked up from other headers and a couple of .c files. Create a trivial placeholder <linux/sched/task_stack.h> file that just maps to <linux/sched.h> to make this patch obviously correct and bisectable. Include the new header in the files that are going to need it. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-03-02 08:42:36 +01:00
Ingo Molnar	174cd4b1e5	sched/headers: Prepare to move signal wakeup & sigpending methods from <linux/sched.h> into <linux/sched/signal.h> Fix up affected files that include this signal functionality via sched.h. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-03-02 08:42:32 +01:00
Ingo Molnar	5b825c3af1	sched/headers: Prepare to remove <linux/cred.h> inclusion from <linux/sched.h> Add #include <linux/cred.h> dependencies to all .c files rely on sched.h doing that for them. Note that even if the count where we need to add extra headers seems high, it's still a net win, because <linux/sched.h> is included in over 2,200 files ... Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-03-02 08:42:31 +01:00
Ingo Molnar	8703e8a465	sched/headers: Prepare for new header dependencies before moving code to <linux/sched/user.h> We are going to split <linux/sched/user.h> out of <linux/sched.h>, which will have to be picked up from other headers and a couple of .c files. Create a trivial placeholder <linux/sched/user.h> file that just maps to <linux/sched.h> to make this patch obviously correct and bisectable. Include the new header in the files that are going to need it. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-03-02 08:42:29 +01:00
Ingo Molnar	e601757102	sched/headers: Prepare for new header dependencies before moving code to <linux/sched/clock.h> We are going to split <linux/sched/clock.h> out of <linux/sched.h>, which will have to be picked up from other headers and .c files. Create a trivial placeholder <linux/sched/clock.h> file that just maps to <linux/sched.h> to make this patch obviously correct and bisectable. Include the new header in the files that are going to need it. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-03-02 08:42:27 +01:00
Ingo Molnar	105ab3d8ce	sched/headers: Prepare for new header dependencies before moving code to <linux/sched/topology.h> We are going to split <linux/sched/topology.h> out of <linux/sched.h>, which will have to be picked up from other headers and a couple of .c files. Create a trivial placeholder <linux/sched/topology.h> file that just maps to <linux/sched.h> to make this patch obviously correct and bisectable. Include the new header in the files that are going to need it. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-03-02 08:42:26 +01:00
Alexey Dobriyan	5b5e0928f7	lib/vsprintf.c: remove %Z support Now that %z is standartised in C99 there is no reason to support %Z. Unlike %L it doesn't even make format strings smaller. Use BUILD_BUG_ON in a couple ATM drivers. In case anyone didn't notice lib/vsprintf.o is about half of SLUB which is in my opinion is quite an achievement. Hopefully this patch inspires someone else to trim vsprintf.c more. Link: http://lkml.kernel.org/r/20170103230126.GA30170@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Andy Shevchenko <andy.shevchenko@gmail.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-02-27 18:43:47 -08:00
Masahiro Yamada	b43daedc0e	scripts/spelling.txt: add "embeded" pattern and fix typo instances Fix typos and add the following to the scripts/spelling.txt: embeded\|\|embedded Link: http://lkml.kernel.org/r/1481573103-11329-12-git-send-email-yamada.masahiro@socionext.com Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-02-27 18:43:47 -08:00
Christoph Hellwig	73473427bb	blk-mq: provide a default queue mapping for virtio device Similar to the PCI version, just calling into virtio instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2017-02-27 20:54:05 +02:00
Linus Torvalds	a682e00354	Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md Pull md updates from Shaohua Li: "Mainly fixes bugs and improves performance: - Improve scalability for raid1 from Coly - Improve raid5-cache read performance, disk efficiency and IO pattern from Song and me - Fix a race condition of disk hotplug for linear from Coly - A few cleanup patches from Ming and Byungchul - Fix a memory leak from Neil - Fix WRITE SAME IO failure from me - Add doc for raid5-cache from me" * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md: (23 commits) md/raid1: fix write behind issues introduced by bio_clone_bioset_partial md/raid1: handle flush request correctly md/linear: shutup lockdep warnning md/raid1: fix a use-after-free bug RAID1: avoid unnecessary spin locks in I/O barrier code RAID1: a new I/O barrier implementation to remove resync window md/raid5: Don't reinvent the wheel but use existing llist API md: fast clone bio in bio_clone_mddev() md: remove unnecessary check on mddev md/raid1: use bio_clone_bioset_partial() in case of write behind md: fail if mddev->bio_set can't be created block: introduce bio_clone_bioset_partial() md: disable WRITE SAME if it fails in underlayer disks md/raid5-cache: exclude reclaiming stripes in reclaim check md/raid5-cache: stripe reclaim only counts valid stripes MD: add doc for raid5-cache Documentation: move MD related doc into a separate dir md: ensure md devices are freed before module is unloaded. md/r5cache: improve journal device efficiency md/r5cache: enable chunk_aligned_read with write back cache ...	2017-02-24 14:42:19 -08:00
Linus Torvalds	1802979ab1	Merge branch 'for-linus' of git://git.kernel.dk/linux-block Pull block updates and fixes from Jens Axboe: - NVMe updates and fixes that missed the first pull request. This includes bug fixes, and support for autonomous power management. - Fix from Christoph for missing clear of the request payload, causing a problem with (at least) the storvsc driver. - Further fixes for the queue/bdi life time issues from Jan. - The Kconfig mq scheduler update from me. - Fixing a use-after-free in dm-rq, spotted by Bart, introduced in this merge window. - Three fixes for nbd from Josef. - Bug fix from Omar, fixing a bug in sas transport code that oopses when bsg ioctls were used. From Omar. - Improvements to the queue restart and tag wait from from Omar. - Set of fixes for the sed/opal code from Scott. - Three trivial patches to cciss from Tobin * 'for-linus' of git://git.kernel.dk/linux-block: (41 commits) dm-rq: don't dereference request payload after ending request blk-mq-sched: separate mark hctx and queue restart operations blk-mq: use sbq wait queues instead of restart for driver tags block/sed-opal: Propagate original error message to userland. nvme/pci: re-check security protocol support after reset block/sed-opal: Introduce free_opal_dev to free the structure and clean up state nvme: detect NVMe controller in recent MacBooks nvme-rdma: add support for host_traddr nvmet-rdma: Fix error handling nvmet-rdma: use nvme cm status helper nvme-rdma: move nvme cm status helper to .h file nvme-fc: don't bother to validate ioccsz and iorcsz nvme/pci: No special case for queue busy on IO nvme/core: Fix race kicking freed request_queue nvme/pci: Disable on removal when disconnected nvme: Enable autonomous power state transitions nvme: Add a quirk mechanism that uses identify_ctrl nvme: make nvmf_register_transport require a create_ctrl callback nvme: Use CNS as 8-bit field and avoid endianness conversion nvme: add semicolon in nvme_command setting ...	2017-02-24 14:13:34 -08:00
Omar Sandoval	d38d351555	blk-mq-sched: separate mark hctx and queue restart operations In blk_mq_sched_dispatch_requests(), we call blk_mq_sched_mark_restart() after we dispatch requests left over on our hardware queue dispatch list. This is so we'll go back and dispatch requests from the scheduler. In this case, it's only necessary to restart the hardware queue that we are running; there's no reason to run other hardware queues just because we are using shared tags. So, split out blk_mq_sched_mark_restart() into two operations, one for just the hardware queue and one for the whole request queue. The core code only needs the hctx variant, but I/O schedulers will want to use both. This also requires adjusting blk_mq_sched_restart_queues() to always check the queue restart flag, not just when using shared tags. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2017-02-23 11:55:47 -07:00
Omar Sandoval	da55f2cc78	blk-mq: use sbq wait queues instead of restart for driver tags Commit `50e1dab86a` ("blk-mq-sched: fix starvation for multiple hardware queues and shared tags") fixed one starvation issue for shared tags. However, we can still get into a situation where we fail to allocate a tag because all tags are allocated but we don't have any pending requests on any hardware queue. One solution for this would be to restart all queues that share a tag map, but that really sucks. Ideally, we could just block and wait for a tag, but that isn't always possible from blk_mq_dispatch_rq_list(). However, we can still use the struct sbitmap_queue wait queues with a custom callback instead of blocking. This has a few benefits: 1. It avoids iterating over all hardware queues when completing an I/O, which the current restart code has to do. 2. It benefits from the existing rolling wakeup code. 3. It avoids punting to another thread just to have it block. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2017-02-23 11:55:46 -07:00
Scott Bauer	2d19020b08	block/sed-opal: Propagate original error message to userland. During an error on a comannd, ex: user provides wrong pw to unlock range, we will gracefully terminate the opal session. We want to propagate the original error to userland instead of the result of the session termination, which is almost always a success. Signed-off-by: Scott Bauer <scott.bauer@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2017-02-23 11:55:44 -07:00
Scott Bauer	7d6d15789d	block/sed-opal: Introduce free_opal_dev to free the structure and clean up state Before we free the opal structure we need to clean up any saved locking ranges that the user had told us to unlock from a suspend. Signed-off-by: Scott Bauer <scott.bauer@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2017-02-23 11:55:41 -07:00
Tetsuo Handa	612dafabb6	block: use for_each_thread() in sys_ioprio_set()/sys_ioprio_get() IOPRIO_WHO_USER case in sys_ioprio_set()/sys_ioprio_get() are using while_each_thread(), which is unsafe under RCU lock according to commit `0c740d0afc` ("introduce for_each_thread() to replace the buggy while_each_thread()"). Use for_each_thread() (via for_each_process_thread()) which is safe under RCU lock. Link: http://lkml.kernel.org/r/201702011947.DBD56740.OMVHOLOtSJFFFQ@I-love.SAKURA.ne.jp Link: http://lkml.kernel.org/r/1486041779-4401-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-02-22 16:41:27 -08:00
Jens Axboe	b86dd815ff	block: get rid of blk-mq default scheduler choice Kconfig entries The wording in the entries were poor and not understandable by even deities. Kill the selection for default block scheduler, and impose a policy with sane defaults. Architected-by: Linus Torvalds <torvalds@linux-foundation.org> Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2017-02-22 13:19:45 -07:00
Jon Derrick	eed64951f1	block/sed: Embed function data into the function sequence By embedding the function data with the function sequence, we can eliminate the external function data and state variable code. It also made obvious some other small cleanups. Signed-off-by: Jon Derrick <jonathan.derrick@intel.com> Reviewed-by: Scott Bauer <scott.bauer@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2017-02-22 11:54:49 -07:00
Jon Derrick	77039b9631	block/sed: Check received header lengths Add a buffer size check against discovery and response header lengths before we loop over their buffers. Signed-off-by: Jon Derrick <jonathan.derrick@intel.com> Reviewed-by: Scott Bauer <scott.bauer@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2017-02-22 11:54:49 -07:00
Jon Derrick	cccb92417d	block/sed: Add helper to qualify response tokens Add helper which verifies the response token is valid and matches the expected value. Merges token_type and response_get_token. Signed-off-by: Jon Derrick <jonathan.derrick@intel.com> Reviewed-by: Scott Bauer <scott.bauer@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2017-02-22 11:54:49 -07:00
Jon Derrick	aedb6e2411	block/sed: Use ssize_t on atom parsers to return errors The short atom parser can return an errno from decoding but does not currently return the error as a signed value. Convert all of the parsers to ssize_t. Signed-off-by: Jon Derrick <jonathan.derrick@intel.com> Reviewed-by: Scott Bauer <scott.bauer@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2017-02-22 11:54:49 -07:00
Jan Kara	d06e05c026	block: Unhash also block device inode for the whole device Iteration over partitions in del_gendisk() omits part0. Add bdev_unhash_inode() call for the whole device. Otherwise if the device number gets reused, bdev inode will be still associated with the old (stale) bdi. Tested-by: Lekshmi Pillai <lekshmicpillai@in.ibm.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>	2017-02-21 12:51:54 -07:00
Jan Kara	4b8c861a7c	block: Move bdev_unhash_inode() after invalidate_partition() Move bdev_unhash_inode() after invalidate_partition() as invalidate_partition() looks up bdev and it cannot find the right bdev inode after bdev_unhash_inode() is called. Thus invalidate_partition() would not invalidate page cache of the previously used bdev. Also use part_devt() when calling bdev_unhash_inode() instead of manually creating the device number. Tested-by: Lekshmi Pillai <lekshmicpillai@in.ibm.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>	2017-02-21 12:51:54 -07:00
Linus Torvalds	772c8f6f3b	for-4.11/linus-merge-signed -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIcBAABCAAGBQJYqeb8AAoJEPfTWPspceCmB3UP/3UtcPrzEm8w2cxB9MaWhZN3 J+jiwlO4vaqhm2HVzQtoJqfaqRlud/iDx5cIXE2S7FnIM54ZKs3CANbKu8X+b1zm eJije3zMI8A8qyftigbz6a/Y2kWE4ZqFEc9WU5CWawfTl3ImCVUi8+F5X0wOLU/h r50zAQOEyURH4G5usNl9q0olF6FonJ82AcYm1iJ0QP2wYWZRJauC0rRn8IT93tyK bZPHnGKdkd7km8yi3zr2GNWOfuZZuA0HWAaF4qfrHPZQ883gITFAUIlFb1f+2TNl DkQzRrBB2wPWPnlbfb9KejMkvL94hflzsLb5rHt835DyVXFRyjxsgyAI8A+LPGSz vqZ3rsbWj6H4F9z2CkZ+T+AP/ZSWDNjwc0RXPm9HYdR5CDeTxIUVvnFQ44YNsmTv Xd5BKrUJ2oKegAxQG6zcuFx23p8JzhT70l+mNrMdtyeKnDD9FRdDvhKG9AHeTipn o/DnGivhS3UMQoQ7D68KOO+kuhLDeo7my5XGsnjzMO/iHqg++7IP2HyYYs/Ba4qZ cYaCtSDQW71Zt0vsqa6dvPuXBveu4h8Qh8R7uAGjSGS9IAFFb4Cab2tiUdISE6PE YnMWzY+G6pT8imlLVOL5/QFuo2Q4pUsaL0AHpXMCN9TZnQtbqXa8eqwnKnQ0m2KN 7ut0IYYEPaYUX5xFn1K6 =z7AL -----END PGP SIGNATURE----- Merge tag 'for-4.11/linus-merge-signed' of git://git.kernel.dk/linux-block Pull block layer updates from Jens Axboe: - blk-mq scheduling framework from me and Omar, with a port of the deadline scheduler for this framework. A port of BFQ from Paolo is in the works, and should be ready for 4.12. - Various fixups and improvements to the above scheduling framework from Omar, Paolo, Bart, me, others. - Cleanup of the exported sysfs blk-mq data into debugfs, from Omar. This allows us to export more information that helps debug hangs or performance issues, without cluttering or abusing the sysfs API. - Fixes for the sbitmap code, the scalable bitmap code that was migrated from blk-mq, from Omar. - Removal of the BLOCK_PC support in struct request, and refactoring of carrying SCSI payloads in the block layer. This cleans up the code nicely, and enables us to kill the SCSI specific parts of struct request, shrinking it down nicely. From Christoph mainly, with help from Hannes. - Support for ranged discard requests and discard merging, also from Christoph. - Support for OPAL in the block layer, and for NVMe as well. Mainly from Scott Bauer, with fixes/updates from various others folks. - Error code fixup for gdrom from Christophe. - cciss pci irq allocation cleanup from Christoph. - Making the cdrom device operations read only, from Kees Cook. - Fixes for duplicate bdi registrations and bdi/queue life time problems from Jan and Dan. - Set of fixes and updates for lightnvm, from Matias and Javier. - A few fixes for nbd from Josef, using idr to name devices and a workqueue deadlock fix on receive. Also marks Josef as the current maintainer of nbd. - Fix from Josef, overwriting queue settings when the number of hardware queues is updated for a blk-mq device. - NVMe fix from Keith, ensuring that we don't repeatedly mark and IO aborted, if we didn't end up aborting it. - SG gap merging fix from Ming Lei for block. - Loop fix also from Ming, fixing a race and crash between setting loop status and IO. - Two block race fixes from Tahsin, fixing request list iteration and fixing a race between device registration and udev device add notifiations. - Double free fix from cgroup writeback, from Tejun. - Another double free fix in blkcg, from Hou Tao. - Partition overflow fix for EFI from Alden Tondettar. * tag 'for-4.11/linus-merge-signed' of git://git.kernel.dk/linux-block: (156 commits) nvme: Check for Security send/recv support before issuing commands. block/sed-opal: allocate struct opal_dev dynamically block/sed-opal: tone down not supported warnings block: don't defer flushes on blk-mq + scheduling blk-mq-sched: ask scheduler for work, if we failed dispatching leftovers blk-mq: don't special case flush inserts for blk-mq-sched blk-mq-sched: don't add flushes to the head of requeue queue blk-mq: have blk_mq_dispatch_rq_list() return if we queued IO or not block: do not allow updates through sysfs until registration completes lightnvm: set default lun range when no luns are specified lightnvm: fix off-by-one error on target initialization Maintainers: Modify SED list from nvme to block Move stack parameters for sed_ioctl to prevent oversized stack with CONFIG_KASAN uapi: sed-opal fix IOW for activate lsp to use correct struct cdrom: Make device operations read-only elevator: fix loading wrong elevator type for blk-mq devices cciss: switch to pci_irq_alloc_vectors block/loop: fix race between I/O and set_status blk-mq-sched: don't hold queue_lock when calling exit_icq block: set make_request_fn manually in blk_mq_update_nr_hw_queues ...	2017-02-21 10:57:33 -08:00
Jens Axboe	818551e2b2	Merge branch 'for-4.11/next' into for-4.11/linus-merge Signed-off-by: Jens Axboe <axboe@fb.com>	2017-02-17 14:08:19 -07:00
Jens Axboe	6010720da8	Merge branch 'for-4.11/block' into for-4.11/linus-merge Signed-off-by: Jens Axboe <axboe@fb.com>	2017-02-17 14:06:45 -07:00
Christoph Hellwig	4f1244c829	block/sed-opal: allocate struct opal_dev dynamically Insted of bloating the containing structure with it all the time this allocates struct opal_dev dynamically. Additionally this allows moving the definition of struct opal_dev into sed-opal.c. For this a new private data field is added to it that is passed to the send/receive callback. After that a lot of internals can be made private as well. Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Scott Bauer <scott.bauer@intel.com> Reviewed-by: Scott Bauer <scott.bauer@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2017-02-17 12:41:47 -07:00
Christoph Hellwig	f5b37b7c23	block/sed-opal: tone down not supported warnings Not having OPAL or a sub-feature supported is an entirely normal condition for many drives, so don't warn about it. Keep the messages, but tone them down to debug only. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Scott Bauer <scott.bauer@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2017-02-17 12:41:44 -07:00
Jens Axboe	7520872c0c	block: don't defer flushes on blk-mq + scheduling For blk-mq with scheduling, we can potentially end up with ALL driver tags assigned and sitting on the flush queues. If we defer because of an inlfight data request, then we can deadlock if that data request doesn't already have a tag assigned. This fixes a deadlock with running the xfs/297 xfstest, where thousands of syncs can cause the drive queue to stall. Signed-off-by: Jens Axboe <axboe@fb.com> Reviewed-by: Omar Sandoval <osandov@fb.com>	2017-02-17 12:35:47 -07:00
Jens Axboe	64765a75ef	blk-mq-sched: ask scheduler for work, if we failed dispatching leftovers Usually we don't ask the scheduler for work, if we already have leftovers on the dispatch list. This is done to leave work on the scheduler side for as long as possible, for proper merging. But if we do have work leftover but didn't dispatch anything, then we should ask the scheduler since we could potentially issue requests from that. Signed-off-by: Jens Axboe <axboe@fb.com> Reviewed-by: Omar Sandoval <osandov@fb.com>	2017-02-17 12:35:47 -07:00
Jens Axboe	0c2a6fe4dc	blk-mq: don't special case flush inserts for blk-mq-sched The current request insertion machinery works just fine for directly inserting flushes, so no need to special case this anymore. Signed-off-by: Jens Axboe <axboe@fb.com> Reviewed-by: Omar Sandoval <osandov@fb.com>	2017-02-17 12:35:47 -07:00
Jens Axboe	c7a571b450	blk-mq-sched: don't add flushes to the head of requeue queue If we are currently out of driver tags, we don't want to add a new flush (without a tag) to the head of the requeue list. We want to add it to the back, behind the others that are potentially also waiting for a tag. Signed-off-by: Jens Axboe <axboe@fb.com> Reviewed-by: Omar Sandoval <osandov@fb.com>	2017-02-17 12:35:47 -07:00
Jens Axboe	2aa0f21d54	blk-mq: have blk_mq_dispatch_rq_list() return if we queued IO or not Currently we're almost there, but if we dispatch nothing, then we still return success. Signed-off-by: Jens Axboe <axboe@fb.com> Reviewed-by: Omar Sandoval <osandov@fb.com>	2017-02-17 12:35:47 -07:00
Jens Axboe	5d7f5ce151	cfq-iosched: don't call wbt_disable_default() with IRQs disabled wbt_disable_default() calls del_timer_sync() to wait for the wbt timer to finish before disabling throttling. We can't do this with IRQs disable. This fixes a lockdep splat on boot, if non-root cgroups are used. Reported-by: Gabriel C <nix.or.die@gmail.com> Fixes: `87760e5eef` ("block: hook up writeback throttling") Signed-off-by: Jens Axboe <axboe@fb.com>	2017-02-16 08:02:06 -07:00
Ming Lei	c18a1e0900	block: introduce bio_clone_bioset_partial() md still need bio clone(not the fast version) for behind write, and it is more efficient to use bio_clone_bioset_partial(). The idea is simple and just copy the bvecs range specified from parameters. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jens Axboe <axboe@fb.com> Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-02-15 11:22:05 -08:00
Tahsin Erdogan	b410aff2bd	block: do not allow updates through sysfs until registration completes When a new disk shows up, sysfs queue directory is created before elevator is registered. This allows a user to attempt a scheduler switch even though the initial registration hasn't completed yet. In one scenario, blk_register_queue() calls elv_register_queue() and right before cfq_registered_queue() is called, another process executes elevator_switch() and replaces q->elevator with deadline scheduler. When cfq_registered_queue() executes it interprets e->elevator_data as struct cfq_data even though it is actually struct deadline_data. Grab q->sysfs_lock in blk_register_queue() to synchronize with sysfs callers. Signed-off-by: Tahsin Erdogan <tahsin@google.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2017-02-15 08:40:04 -07:00
Scott Bauer	e225c20eb0	Move stack parameters for sed_ioctl to prevent oversized stack with CONFIG_KASAN When CONFIG_KASAN is enabled, compilation fails: block/sed-opal.c: In function 'sed_ioctl': block/sed-opal.c:2447:1: error: the frame size of 2256 bytes is larger than 2048 bytes [-Werror=frame-larger-than=] Moved all the ioctl structures off the stack and dynamically allocate using _IOC_SIZE() Fixes: `455a7b238c` ("block: Add Sed-opal library") Reported-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Scott Bauer <scott.bauer@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2017-02-14 19:47:18 -07:00

1 2 3 4 5 ...

3213 commits