Commit graph

159 commits

Author SHA1 Message Date
Christoph Hellwig
2a842acab1 block: introduce new block status code type
Currently we use nornal Linux errno values in the block layer, and while
we accept any error a few have overloaded magic meanings.  This patch
instead introduces a new  blk_status_t value that holds block layer specific
status codes and explicitly explains their meaning.  Helpers to convert from
and to the previous special meanings are provided for now, but I suspect
we want to get rid of them in the long run - those drivers that have a
errno input (e.g. networking) usually get errnos that don't know about
the special block layer overloads, and similarly returning them to userspace
will usually return somethings that strictly speaking isn't correct
for file system operations, but that's left as an exercise for later.

For now the set of errors is a very limited set that closely corresponds
to the previous overloaded errno values, but there is some low hanging
fruite to improve it.

blk_status_t (ab)uses the sparse __bitwise annotations to allow for sparse
typechecking, so that we can easily catch places passing the wrong values.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-06-09 09:27:32 -06:00
Christoph Hellwig
d928be9f85 block: add a REQ_NOUNMAP flag for REQ_OP_WRITE_ZEROES
If this flag is set logical provisioning capable device should
release space for the zeroed blocks if possible, if it is not set
devices should keep the blocks anchored.

Also remove an out of sync kerneldoc comment for a static function
that would have become even more out of data with this change.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-08 11:25:38 -06:00
Christoph Hellwig
1d62ac1363 block: renumber REQ_OP_WRITE_ZEROES
Make life easy for implementations that needs to send a data buffer
to the device (e.g. SCSI) by numbering it as a data out command.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-08 11:25:38 -06:00
NeilBrown
fbbaf700e7 block: trace completion of all bios.
Currently only dm and md/raid5 bios trigger
trace_block_bio_complete().  Now that we have bio_chain() and
bio_inc_remaining(), it is not possible, in general, for a driver to
know when the bio is really complete.  Only bio_endio() knows that.

So move the trace_block_bio_complete() call to bio_endio().

Now trace_block_bio_complete() pairs with trace_block_bio_queue().
Any bio for which a 'queue' event is traced, will subsequently
generate a 'complete' event.

There are a few cases where completion tracing is not wanted.
1/ If blk_update_request() has already generated a completion
   trace event at the 'request' level, there is no point generating
   one at the bio level too.  In this case the bi_sector and bi_size
   will have changed, so the bio level event would be wrong

2/ If the bio hasn't actually been queued yet, but is being aborted
   early, then a trace event could be confusing.  Some filesystems
   call bio_endio() but do not want tracing.

3/ The bio_integrity code interposes itself by replacing bi_end_io,
   then restoring it and calling bio_endio() again.  This would produce
   two identical trace events if left like that.

To handle these, we introduce a flag BIO_TRACE_COMPLETION and only
produce the trace event when this is set.
We address point 1 above by clearing the flag in blk_update_request().
We address point 2 above by only setting the flag when
generic_make_request() is called.
We address point 3 above by clearing the flag after generating a
completion event.

When bio_split() is used on a bio, particularly in blk_queue_split(),
there is an extra complication.  A new bio is split off the front, and
may be handle directly without going through generic_make_request().
The old bio, which has been advanced, is passed to
generic_make_request(), so it will trigger a trace event a second
time.
Probably the best result when a split happens is to see a single
'queue' event for the whole bio, then multiple 'complete' events - one
for each component.  To achieve this was can:
- copy the BIO_TRACE_COMPLETION flag to the new bio in bio_split()
- avoid generating a 'queue' event if BIO_TRACE_COMPLETION is already set.
This way, the split-off bio won't create a queue event, the original
won't either even if it re-submitted to generic_make_request(),
but both will produce completion events, each for their own range.

So if generic_make_request() is called (which generates a QUEUED
event), then bi_endio() will create a single COMPLETE event for each
range that the bio is split into, unless the driver has explicitly
requested it not to.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-07 09:40:52 -06:00
NeilBrown
dbde775cdb block: simple improvements for bio->flags
The comment for the 'flags' field of 'bio' mentions
"command" which is no longer stored there, and doesn't
mention the bvec pool number, which is.

BIO_RESET_BITS is set in such a way that it would need to be
updated if new bits were added, which is easy to miss.

BVEC_POOL_BITS is larger than needed.  The BVEC_POOL_IDX()
ranges from 0 to 6, so 3 bits are sufficient.

This patch make improvements in each of these areas.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-07 09:39:29 -06:00
Shaohua Li
b9147dd1ba blk-throttle: add a mechanism to estimate IO latency
User configures latency target, but the latency threshold for each
request size isn't fixed. For a SSD, the IO latency highly depends on
request size. To calculate latency threshold, we sample some data, eg,
average latency for request size 4k, 8k, 16k, 32k .. 1M. The latency
threshold of each request size will be the sample latency (I'll call it
base latency) plus latency target. For example, the base latency for
request size 4k is 80us and user configures latency target 60us. The 4k
latency threshold will be 80 + 60 = 140us.

To sample data, we calculate the order base 2 of rounded up IO sectors.
If the IO size is bigger than 1M, it will be accounted as 1M. Since the
calculation does round up, the base latency will be slightly smaller
than actual value. Also if there isn't any IO dispatched for a specific
IO size, we will use the base latency of smaller IO size for this IO
size.

But we shouldn't sample data at any time. The base latency is supposed
to be latency where disk isn't congested, because we use latency
threshold to schedule IOs between cgroups. If disk is congested, the
latency is higher, using it for scheduling is meaningless. Hence we only
do the sampling when block throttling is in the LOW limit, with
assumption disk isn't congested in such state. If the assumption isn't
true, eg, low limit is too high, calculated latency threshold will be
higher.

Hard disk is completely different. Latency depends on spindle seek
instead of request size. Currently this feature is SSD only, we probably
can use a fixed threshold like 4ms for hard disk though.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-28 08:02:20 -06:00
Shaohua Li
88eeca495b block: track request size in blk_issue_stat
Currently there is no way to know the request size when the request is
finished. Next patch will need this info. We could add extra field to
record the size, but blk_issue_stat has enough space to record it, so
this patch just overloads blk_issue_stat. With this, we will have 49bits
to track time, which still is very long time.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-28 08:02:20 -06:00
Shaohua Li
9e234eeafb blk-throttle: add a simple idle detection
A cgroup gets assigned a low limit, but the cgroup could never dispatch
enough IO to cross the low limit. In such case, the queue state machine
will remain in LIMIT_LOW state and all other cgroups will be throttled
according to low limit. This is unfair for other cgroups. We should
treat the cgroup idle and upgrade the state machine to lower state.

We also have a downgrade logic. If the state machine upgrades because of
cgroup idle (real idle), the state machine will downgrade soon as the
cgroup is below its low limit. This isn't what we want. A more
complicated case is cgroup isn't idle when queue is in LIMIT_LOW. But
when queue gets upgraded to lower state, other cgroups could dispatch
more IO and this cgroup can't dispatch enough IO, so the cgroup is below
its low limit and looks like idle (fake idle). In this case, the queue
should downgrade soon. The key to determine if we should do downgrade is
to detect if cgroup is truely idle.

Unfortunately it's very hard to determine if a cgroup is real idle. This
patch uses the 'think time check' idea from CFQ for the purpose. Please
note, the idea doesn't work for all workloads. For example, a workload
with io depth 8 has disk utilization 100%, hence think time is 0, eg,
not idle. But the workload can run higher bandwidth with io depth 16.
Compared to io depth 16, the io depth 8 workload is idle. We use the
idea to roughly determine if a cgroup is idle.

We treat a cgroup idle if its think time is above a threshold (by
default 1ms for SSD and 100ms for HD). The idea is think time above the
threshold will start to harm performance. HD is much slower so a longer
think time is ok.

The patch (and the latter patches) uses 'unsigned long' to track time.
We convert 'ns' to 'us' with 'ns >> 10'. This is fast but loses
precision, should not a big deal.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-28 08:02:20 -06:00
Omar Sandoval
34dbad5d26 blk-stat: convert to callback-based statistics reporting
Currently, statistics are gathered in ~0.13s windows, and users grab the
statistics whenever they need them. This is not ideal for both in-tree
users:

1. Writeback throttling wants its own dynamically sized window of
   statistics. Since the blk-stats statistics are reset after every
   window and the wbt windows don't line up with the blk-stats windows,
   wbt doesn't see every I/O.
2. Polling currently grabs the statistics on every I/O. Again, depending
   on how the window lines up, we may miss some I/Os. It's also
   unnecessary overhead to get the statistics on every I/O; the hybrid
   polling heuristic would be just as happy with the statistics from the
   previous full window.

This reworks the blk-stats infrastructure to be callback-based: users
register a callback that they want called at a given time with all of
the statistics from the window during which the callback was active.
Users can dynamically bucketize the statistics. wbt and polling both
currently use read vs. write, but polling can be extended to further
subdivide based on request size.

The callbacks are kept on an RCU list, and each callback has percpu
stats buffers. There will only be a few users, so the overhead on the
I/O completion side is low. The stats flushing is also simplified
considerably: since the timer function is responsible for clearing the
statistics, we don't have to worry about stale statistics.

wbt is a trivial conversion. After the conversion, the windowing problem
mentioned above is fixed.

For polling, we register an extra callback that caches the previous
window's statistics in the struct request_queue for the hybrid polling
heuristic to use.

Since we no longer have a single stats buffer for the request queue,
this also removes the sysfs and debugfs stats entries. To replace those,
we add a debugfs entry for the poll statistics.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-21 10:03:11 -06:00
Omar Sandoval
4875253fdd blk-stat: move BLK_RQ_STAT_BATCH definition to blk-stat.c
This is an implementation detail that no-one outside of blk-stat.c uses.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-21 10:03:08 -06:00
Christoph Hellwig
aebf526b53 block: fold cmd_type into the REQ_OP_ space
Instead of keeping two levels of indirection for requests types, fold it
all into the operations.  The little caveat here is that previously
cmd_type only applied to struct request, while the request and bio op
fields were set to plain REQ_OP_READ/WRITE even for passthrough
operations.

Instead this patch adds new REQ_OP_* for SCSI passthrough and driver
private requests, althought it has to add two for each so that we
can communicate the data in/out nature of the request.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-31 14:00:44 -07:00
Christoph Hellwig
f73f44eb00 block: add a op_is_flush helper
This centralizes the checks for bios that needs to be go into the flush
state machine.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-27 09:01:45 -07:00
Jens Axboe
fd2d332677 blk-mq: add support for carrying internal tag information in blk_qc_t
No functional change in this patch, just in preparation for having
two types of tags available to the block layer for a single request.

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
2017-01-17 10:04:09 -07:00
Chaitanya Kulkarni
a6f0788ec2 block: add support for REQ_OP_WRITE_ZEROES
This adds a new block layer operation to zero out a range of
LBAs. This allows to implement zeroing for devices that don't use
either discard with a predictable zero pattern or WRITE SAME of zeroes.
The prominent example of that is NVMe with the Write Zeroes command,
but in the future, this should also help with improving the way
zeroing discards work. For this operation, suitable entry is exported in
sysfs which indicate the number of maximum bytes allowed in one
write zeroes operation by the device.

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@hgst.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-12-01 07:58:40 -07:00
Christoph Hellwig
93c5bdf7ab block: clear all of bi_opf in bio_set_op_attrs
Since commit 87374179 ("block: add a proper block layer data direction
encoding") we only or the new op and flags into bi_opf in bio_set_op_attrs
instead of clearing the old value.  I've not seen any breakage with the
new behavior, but it seems dangerous.

Also convert it to an inline function to make the argument passing
safer.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-21 09:35:05 -07:00
Jens Axboe
cf43e6be86 block: add scalable completion tracking of requests
For legacy block, we simply track them in the request queue. For
blk-mq, we track them on a per-sw queue basis, which we can then
sum up through the hardware queues and finally to a per device
state.

The stats are tracked in, roughly, 0.1s interval windows.

Add sysfs files to display the stats.

The feature is off by default, to avoid any extra overhead. In-kernel
users of it can turn it on by setting QUEUE_FLAG_STATS in the queue
flags. We currently don't turn it on if someone just reads any of
the stats files, that is something we could add as well.

Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-10 13:53:26 -07:00
Jens Axboe
1d796d6a96 block: add REQ_BACKGROUND
This adds a new request flag, REQ_BACKGROUND, that callers can use to
tell the block layer that this is background (non-urgent) IO.

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-11-02 10:24:01 -06:00
Christoph Hellwig
7281b4526c block: remove the CONFIG_BLOCK ifdef in blk_types.h
Now that we have a separate header for struct bio_vec there is absolutely
no excuse for including this header from non-block I/O code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-01 09:43:26 -06:00
Christoph Hellwig
a2b809672e block: replace REQ_NOIDLE with REQ_IDLE
Noidle should be the default for writes as seen by all the compounds
definitions in fs.h using it.  In fact only direct I/O really should
be using NODILE, so turn the whole flag around to get the defaults
right, which will make our life much easier especially onces the
WRITE_* defines go away.

This assumes all the existing "raw" users of REQ_SYNC for writes
want noidle behavior, which seems to be spot on from a quick audit.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-01 09:43:26 -06:00
Christoph Hellwig
b685d3d65a block: treat REQ_FUA and REQ_PREFLUSH as synchronous
Instead of requiring everyone to specify the REQ_SYNC flag aѕ well.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-01 09:43:26 -06:00
Christoph Hellwig
87374179c5 block: add a proper block layer data direction encoding
Currently the block layer op_is_write, bio_data_dir and rq_data_dir
helper treat every operation that is not a READ as a data out operation.
This worked surprisingly long, but the new REQ_OP_ZONE_REPORT operation
actually adds a second operation that reads data from the device.
Surprisingly nothing critical relied on this direction, but this might
be a good opportunity to properly fix this issue up.

We take a little inspiration and use the least significant bit of the
operation number to encode the data direction, which just requires us
to renumber the operations to fix this scheme.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Shaun Tancheff <shaun.tancheff@seagate.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-10-28 08:48:54 -06:00
Christoph Hellwig
ef295ecf09 block: better op and flags encoding
Now that we don't need the common flags to overflow outside the range
of a 32-bit type we can encode them the same way for both the bio and
request fields.  This in addition allows us to place the operation
first (and make some room for more ops while we're at it) and to
stop having to shift around the operation values.

In addition this allows passing around only one value in the block layer
instead of two (and eventuall also in the file systems, but we can do
that later) and thus clean up a lot of code.

Last but not least this allows decreasing the size of the cmd_flags
field in struct request to 32-bits.  Various functions passing this
value could also be updated, but I'd like to avoid the churn for now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-10-28 08:48:16 -06:00
Christoph Hellwig
e806402130 block: split out request-only flags into a new namespace
A lot of the REQ_* flags are only used on struct requests, and only of
use to the block layer and a few drivers that dig into struct request
internals.

This patch adds a new req_flags_t rq_flags field to struct request for
them, and thus dramatically shrinks the number of common requests.  It
also removes the unfortunate situation where we have to fit the fields
from the same enum into 32 bits for struct bio and 64 bits for
struct request.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Shaun Tancheff <shaun.tancheff@seagate.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-10-28 08:45:17 -06:00
Christoph Hellwig
8d2bbd4c82 block: replace REQ_THROTTLED with a bio flag
It's the last bio-only REQ_* flag, and we have space for it in the bio
bi_flags field.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Shaun Tancheff <shaun.tancheff@seagate.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-10-28 08:45:17 -06:00
Christoph Hellwig
188bd2b16b block: move REQ_RAHEAD to common flags
The information that am I/O is a read-ahead can be useful for drivers.
In fact the NVMe driver already checks it, even if it won't ever be set
at the moment.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Shaun Tancheff <shaun.tancheff@seagate.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-10-28 08:45:17 -06:00
Christoph Hellwig
bd1c1c2174 block: REQ_NOMERGE is common to the bio and request
So move it into the common setion of the request flags.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Shaun Tancheff <shaun.tancheff@seagate.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-10-28 08:45:17 -06:00
Shaun Tancheff
2d253440b5 block: Define zoned block device operations
Define REQ_OP_ZONE_REPORT and REQ_OP_ZONE_RESET for handling zones of
host-managed and host-aware zoned block devices. With with these two
new operations, the total number of operations defined reaches 8 and
still fits with the 3 bits definition of REQ_OP_BITS.

Signed-off-by: Shaun Tancheff <shaun.tancheff@seagate.com>
Signed-off-by: Damien Le Moal <damien.lemoal@hgst.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-10-18 10:02:05 -06:00
Christoph Hellwig
3f7c624aa5 block: remove bio_destructor_t
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-09-14 09:18:06 -06:00
Bart Van Assche
3e1de31b9b block: Improve bio_set_op_attrs() robustness
Since REQ_OP_BITS == 3 and __REQ_NR_BITS == 30 it is not that hard
to pass an op_flags argument to bio_set_op_attrs() that is larger
than the number of bits reserved for the op_flags argument. Complain
if this happens. Additionally, ensure that negative arguments trigger
a complaint (1 << ... is signed while 1U << ... is unsigned; adding
0U to an integer expression causes it to be promoted to an unsigned
type).

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Mike Christie <mchristi@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Damien Le Moal <damien.lemoal@hgst.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-09-14 08:48:29 -06:00
Bart Van Assche
4382e33ad3 block, dm-crypt, btrfs: Introduce bio_flags()
Introduce the bio_flags() macro. Ensure that the second argument of
bio_set_op_attrs() only contains flags and no operation. This patch
does not change any functionality.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Mike Christie <mchristi@redhat.com>
Cc: Chris Mason <clm@fb.com> (maintainer:BTRFS FILE SYSTEM)
Cc: Josef Bacik <jbacik@fb.com> (maintainer:BTRFS FILE SYSTEM)
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Damien Le Moal <damien.lemoal@hgst.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-09-14 08:48:27 -06:00
Bart Van Assche
637ca77bd1 block: Document that bio_op() uses the data type of bio.bi_opf
Make it clear that the sizeof(unsigned int) expression in BIO_OP_SHIFT
refers to the bi_opf member of struct bio.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Mike Christie <mchristi@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Damien Le Moal <damien.lemoal@hgst.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-09-14 08:48:26 -06:00
Jens Axboe
1eff9d322a block: rename bio bi_rw to bi_opf
Since commit 63a4cc2486, bio->bi_rw contains flags in the lower
portion and the op code in the higher portions. This means that
old code that relies on manually setting bi_rw is most likely
going to be broken. Instead of letting that brokeness linger,
rename the member, to force old and out-of-tree code to break
at compile time instead of at runtime.

No intended functional changes in this commit.

Signed-off-by: Jens Axboe <axboe@fb.com>
2016-08-07 14:41:02 -06:00
Jens Axboe
c11f0c0b5b block/mm: make bdev_ops->rw_page() take a bool for read/write
Commit abf545484d changed it from an 'rw' flags type to the
newer ops based interface, but now we're effectively leaking
some bdev internals to the rest of the kernel. Since we only
care about whether it's a read or a write at that level, just
pass in a bool 'is_write' parameter instead.

Then we can also move op_is_write() and friends back under
CONFIG_BLOCK protection.

Reviewed-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-08-07 14:41:02 -06:00
Mike Christie
abf545484d mm/block: convert rw_page users to bio op use
The rw_page users were not converted to use bio/req ops. As a result
bdev_write_page is not passing down REQ_OP_WRITE and the IOs will
be sent down as reads.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Fixes: 4e1b2d52a8 ("block, fs, drivers: remove REQ_OP compat defs and related code")

Modified by me to:

1) Drop op_flags passing into ->rw_page(), as we don't use it.
2) Make op_is_write() and friends safe to use for !CONFIG_BLOCK

Signed-off-by: Jens Axboe <axboe@fb.com>
2016-08-04 14:25:33 -06:00
Linus Torvalds
3fc9d69093 Merge branch 'for-4.8/drivers' of git://git.kernel.dk/linux-block
Pull block driver updates from Jens Axboe:
 "This branch also contains core changes.  I've come to the conclusion
  that from 4.9 and forward, I'll be doing just a single branch.  We
  often have dependencies between core and drivers, and it's hard to
  always split them up appropriately without pulling core into drivers
  when that happens.

  That said, this contains:

   - separate secure erase type for the core block layer, from
     Christoph.

   - set of discard fixes, from Christoph.

   - bio shrinking fixes from Christoph, as a followup up to the
     op/flags change in the core branch.

   - map and append request fixes from Christoph.

   - NVMeF (NVMe over Fabrics) code from Christoph.  This is pretty
     exciting!

   - nvme-loop fixes from Arnd.

   - removal of ->driverfs_dev from Dan, after providing a
     device_add_disk() helper.

   - bcache fixes from Bhaktipriya and Yijing.

   - cdrom subchannel read fix from Vchannaiah.

   - set of lightnvm updates from Wenwei, Matias, Johannes, and Javier.

   - set of drbd updates and fixes from Fabian, Lars, and Philipp.

   - mg_disk error path fix from Bart.

   - user notification for failed device add for loop, from Minfei.

   - NVMe in general:
        + NVMe delay quirk from Guilherme.
        + SR-IOV support and command retry limits from Keith.
        + fix for memory-less NUMA node from Masayoshi.
        + use UINT_MAX for discard sectors, from Minfei.
        + cancel IO fixes from Ming.
        + don't allocate unused major, from Neil.
        + error code fixup from Dan.
        + use constants for PSDT/FUSE from James.
        + variable init fix from Jay.
        + fabrics fixes from Ming, Sagi, and Wei.
        + various fixes"

* 'for-4.8/drivers' of git://git.kernel.dk/linux-block: (115 commits)
  nvme/pci: Provide SR-IOV support
  nvme: initialize variable before logical OR'ing it
  block: unexport various bio mapping helpers
  scsi/osd: open code blk_make_request
  target: stop using blk_make_request
  block: simplify and export blk_rq_append_bio
  block: ensure bios return from blk_get_request are properly initialized
  virtio_blk: use blk_rq_map_kern
  memstick: don't allow REQ_TYPE_BLOCK_PC requests
  block: shrink bio size again
  block: simplify and cleanup bvec pool handling
  block: get rid of bio_rw and READA
  block: don't ignore -EOPNOTSUPP blkdev_issue_write_same
  block: introduce BLKDEV_DISCARD_ZERO to fix zeroout
  NVMe: don't allocate unused nvme_major
  nvme: avoid crashes when node 0 is memoryless node.
  nvme: Limit command retries
  loop: Make user notify for adding loop device failed
  nvme-loop: fix nvme-loop Kconfig dependencies
  nvmet: fix return value check in nvmet_subsys_alloc()
  ...
2016-07-26 15:37:51 -07:00
Christoph Hellwig
c0acf12a50 block: shrink bio size again
The recent ops split grew the bio by adding the new ioprio field.
Shrink it again by using a 16-bit field for the bi_flags value and
filling the holes near the beginning of the structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-20 17:37:04 -06:00
Christoph Hellwig
ed996a52c8 block: simplify and cleanup bvec pool handling
Instead of a flag and an index just make sure an index of 0 means
no need to free the bvec array.  Also move the constants related
to the bvec pools together and use a consistent naming scheme for
them.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-20 17:37:02 -06:00
Ming Lei
0781e79eb2 block: move two bvec structure into bvec.h
This patch moves 'struct bio_vec' and 'struct bvec_iter'
into 'include/linux/bvec.h', then always include this header
into 'include/linux/blk_types.h'.

With this change, both 'struct bvec_iter' and bvec iterator
helpers don't depend on CONFIG_BLOCK any more, then we can
use bvec iterator to implement iterate_bvec(): lib/iov_iter.c.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Suggested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Tested-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-09 10:02:47 -06:00
Christoph Hellwig
288dab8a35 block: add a separate operation type for secure erase
Instead of overloading the discard support with the REQ_SECURE flag.
Use the opportunity to rename the queue flag as well, and remove the
dead checks for this flag in the RAID 1 and RAID 10 drivers that don't
claim support for secure erase.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-09 09:52:25 -06:00
Mike Christie
28a8f0d317 block, drivers, fs: rename REQ_FLUSH to REQ_PREFLUSH
To avoid confusion between REQ_OP_FLUSH, which is handled by
request_fn drivers, and upper layers requesting the block layer
perform a flush sequence along with possibly a WRITE, this patch
renames REQ_FLUSH to REQ_PREFLUSH.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07 13:41:38 -06:00
Mike Christie
3a5e02ced1 block, drivers: add REQ_OP_FLUSH operation
This adds a REQ_OP_FLUSH operation that is sent to request_fn
based drivers by the block layer's flush code, instead of
sending requests with the request->cmd_flags REQ_FLUSH bit set.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07 13:41:38 -06:00
Mike Christie
4e1b2d52a8 block, fs, drivers: remove REQ_OP compat defs and related code
This patch drops the compat definition of req_op where it matches
the rq_flag_bits definitions, and drops the related old and compat
code that allowed users to set either the op or flags for the operation.

We also then store the operation in the bi_rw/cmd_flags field similar
to how we used to store the bio ioprio where it sat in the upper bits
of the field.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07 13:41:38 -06:00
Mike Christie
6296b9604f block, drivers, fs: shrink bi_rw from long to int
We don't need bi_rw to be so large on 64 bit archs, so
reduce it to unsigned int.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07 13:41:38 -06:00
Mike Christie
43b62ce3ff block: move bio io prio to a new field
In the next patch, we move drop the compat code and make
the op a separate value that is hidden in bi_rw. To give
the op and rq bits flags room to grow this moves prio to
its own field.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07 13:41:38 -06:00
Mike Christie
f21508211d block: add REQ_OP definitions and helpers
The following patches separate the operation (WRITE, READ, DISCARD,
etc) from the rq_flag_bits flags. This patch adds definitions for
request/bio operations (REQ_OPs) and adds request/bio accessors to
get/set the op.

In this patch the REQ_OPs match the REQ rq_flag_bits ones
for compat reasons while all the code is converted to use the
op accessors in the set. In the last patches the op will become a
number and the accessors and helpers in this patch will be dropped
or updated.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07 13:41:38 -06:00
Shaohua Li
1fcbcc333f block: copy NOMERGE flag from bio to request
bio might have NOMERGE flag set, for example blk_queue_split sets it.
When we initiate request, copy this flag too.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-04-25 19:13:44 -06:00
Christoph Hellwig
bbc758ec04 block: remove REQ_NO_TIMEOUT flag
This was added for the 'magic' AEN requests in the NVMe driver that never
return.  We now handle them purely inside the driver and don't need this
core hack any more.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-12-22 09:38:34 -07:00
Jens Axboe
e3a7a3bf36 block: don't hardcode blk_qc_t -> tag mask
Use the shift/mask we use elsewhere.

Signed-off-by: Jens Axboe <axboe@fb.com>
2015-11-11 09:37:34 -07:00
Jens Axboe
dece16353e block: change ->make_request_fn() and users to return a queue cookie
No functional changes in this patch, but it prepares us for returning
a more useful cookie related to the IO that was queued up.

Signed-off-by: Jens Axboe <axboe@fb.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Acked-by: Keith Busch <keith.busch@intel.com>
2015-11-07 10:40:46 -07:00
Linus Torvalds
e31fb9e005 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull ext3 removal, quota & udf fixes from Jan Kara:
 "The biggest change in the pull is the removal of ext3 filesystem
  driver (~28k lines removed).  Ext4 driver is a full featured
  replacement these days and both RH and SUSE use it for several years
  without issues.  Also there are some workarounds in VM & block layer
  mainly for ext3 which we could eventually get rid of.

  Other larger change is addition of proper error handling for
  dquot_initialize().  The rest is small fixes and cleanups"

[ I wasn't convinced about the ext3 removal and worried about things
  falling through the cracks for legacy users, but ext4 maintainers
  piped up and were all unanimously in favor of removal, and maintaining
  all legacy ext3 support inside ext4.   - Linus ]

* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  udf: Don't modify filesystem for read-only mounts
  quota: remove an unneeded condition
  ext4: memory leak on error in ext4_symlink()
  mm/Kconfig: NEED_BOUNCE_POOL: clean-up condition
  ext4: Improve ext4 Kconfig test
  block: Remove forced page bouncing under IO
  fs: Remove ext3 filesystem driver
  doc: Update doc about journalling layer
  jfs: Handle error from dquot_initialize()
  reiserfs: Handle error from dquot_initialize()
  ocfs2: Handle error from dquot_initialize()
  ext4: Handle error from dquot_initialize()
  ext2: Handle error from dquot_initalize()
  quota: Propagate error from ->acquire_dquot()
2015-09-03 12:28:30 -07:00
Jens Axboe
2c68f6dc6e block: shrink struct bio down to 2 cache lines again
Commit bcf2843b3f8f added ->bi_error to cleanup the error passing
for struct bio, but that ended up adding 4 bytes and a 4 byte hole
to the size of struct bio. For a clean config, that bumped it from
128 bytes, to 136 bytes, on x86-64.

The ->bi_flags member is currently an unsigned long, but it fits
easily within an int. Change it to an unsigned int, adjust the
the pool offset code, and move ->bi_error into the new hole. Then
we end up with a 128 byte bio again.

Change the bio flag set/clear to use cmpxchg to ensure we don't
lose any flags when manipulating them.

Signed-off-by: Jens Axboe <axboe@fb.com>
2015-07-29 08:55:25 -06:00
Jens Axboe
b7c44ed9d2 block: manipulate bio->bi_flags through helpers
Some places use helpers now, others don't. We only have the 'is set'
helper, add helpers for setting and clearing flags too.

It was a bit of a mess of atomic vs non-atomic access. With
BIO_UPTODATE gone, we don't have any risk of concurrent access to the
flags. So relax the restriction and don't make any of them atomic. The
flags that do have serialization issues (reffed and chained), we
already handle those separately.

Signed-off-by: Jens Axboe <axboe@fb.com>
2015-07-29 08:55:20 -06:00
Christoph Hellwig
4246a0b63b block: add a bi_error field to struct bio
Currently we have two different ways to signal an I/O error on a BIO:

 (1) by clearing the BIO_UPTODATE flag
 (2) by returning a Linux errno value to the bi_end_io callback

The first one has the drawback of only communicating a single possible
error (-EIO), and the second one has the drawback of not beeing persistent
when bios are queued up, and are not passed along from child to parent
bio in the ever more popular chaining scenario.  Having both mechanisms
available has the additional drawback of utterly confusing driver authors
and introducing bugs where various I/O submitters only deal with one of
them, and the others have to add boilerplate code to deal with both kinds
of error returns.

So add a new bi_error field to store an errno value directly in struct
bio and remove the existing mechanisms to clean all this up.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-07-29 08:55:15 -06:00
Jan Kara
a3ad0a9da8 block: Remove forced page bouncing under IO
JBD layer wrote back data buffers without setting PageWriteback bit.
Thus standard mechanism for guaranteeing stable pages under IO did not
work. Since JBD is gone now and there is no other user of the
functionality, just remove it.

Acked-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jan Kara <jack@suse.cz>
2015-07-23 20:59:40 +02:00
Mike Snitzer
78d8e58a08 Revert "block, dm: don't copy bios for request clones"
This reverts commit 5f1b670d0b.

Justification for revert as reported in this dm-devel post:
https://www.redhat.com/archives/dm-devel/2015-June/msg00160.html

this change should not be pushed to mainline yet.

Firstly, Christoph has a newer version of the patch that fixes silent
data corruption problem:
  https://www.redhat.com/archives/dm-devel/2015-May/msg00229.html

And the new version still depends on LLDDs to always complete requests
to the end when error happens, while block API doesn't enforce such a
requirement. If the assumption is ever broken, the inconsistency between
request and bio (e.g. rq->__sector and rq->bio) will cause silent data
corruption:
  https://www.redhat.com/archives/dm-devel/2015-June/msg00022.html

Reported-by: Junichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-06-26 10:11:58 -04:00
Mike Snitzer
183f7802e7 Merge remote-tracking branch 'jens/for-4.2/core' into dm-4.2 2015-05-29 14:17:16 -04:00
Christoph Hellwig
5f1b670d0b block, dm: don't copy bios for request clones
Currently dm-multipath has to clone the bios for every request sent
to the lower devices, which wastes cpu cycles and ties down memory.

This patch instead adds a new REQ_CLONE flag that instructs req_bio_endio
to not complete bios attached to a request, which we set on clone
requests similar to bios in a flush sequence.  With this change I/O
errors on a path failure only get propagated to dm-multipath, which
can then either resubmit the I/O or complete the bios on the original
request.

I've done some basic testing of this on a Linux target with ALUA support,
and it survives path failures during I/O nicely.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-22 08:58:57 -06:00
Jens Axboe
b2dbe0a60f block: collapse bio bit space
Various previous patches removed bits and left holes, collapse them
all. Leave the reset start bit where it is, we don't need to change
that.

Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-19 09:18:28 -06:00
Christoph Hellwig
97ca223c3b block: remove unused BIO_RW_BLOCK and BIO_EOF flags
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-19 09:17:05 -06:00
Christoph Hellwig
b25de9d6da block: remove BIO_EOPNOTSUPP
Since the big barrier rewrite/removal in 2007 we never fail FLUSH or
FUA requests, which means we can remove the magic BIO_EOPNOTSUPP flag
to help propagating those to the buffer_head layer.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-19 09:17:03 -06:00
Jens Axboe
dac56212e8 bio: skip atomic inc/dec of ->bi_cnt for most use cases
Struct bio has a reference count that controls when it can be freed.
Most uses cases is allocating the bio, which then returns with a
single reference to it, doing IO, and then dropping that single
reference. We can remove this atomic_dec_and_test() in the completion
path, if nobody else is holding a reference to the bio.

If someone does call bio_get() on the bio, then we flag the bio as
now having valid count and that we must properly honor the reference
count when it's being put.

Tested-by: Robert Elliott <elliott@hp.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-05 13:32:49 -06:00
Jens Axboe
c4cf5261f8 bio: skip atomic inc/dec of ->bi_remaining for non-chains
Struct bio has an atomic ref count for chained bio's, and we use this
to know when to end IO on the bio. However, most bio's are not chained,
so we don't need to always introduce this atomic operation as part of
ending IO.

Add a helper to elevate the bi_remaining count, and flag the bio as
now actually needing the decrement at end_io time. Rename the field
to __bi_remaining to catch any current users of this doing the
incrementing manually.

For high IOPS workloads, this reduces the overhead of bio_endio()
substantially.

Tested-by: Robert Elliott <elliott@hp.com>
Acked-by: Kent Overstreet <kent.overstreet@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-05 13:32:47 -06:00
Shaohua Li
b2387ddcce blk-mq: fix FUA request hang
When a FUA request enters its DATA stage of flush pipeline, the
request is added to mq requeue list, the request will then be added to
ctx->rq_list. blk_mq_attempt_merge() might merge the request with a bio.
Later when the request is finished the flush pipeline, the
request->__data_len is 0. Then I only saw the bio gets endio called, the
original request never finish.

Adding REQ_FLUSH_SEQ into REQ_NOMERGE_FLAGS looks an easy fix.

stable: 3.15+

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-04 13:09:55 -06:00
Bart Van Assche
bba0bdd7ad Defer processing of REQ_PREEMPT requests for blocked devices
SCSI transport drivers and SCSI LLDs block a SCSI device if the
transport layer is not operational. This means that in this state
no requests should be processed, even if the REQ_PREEMPT flag has
been set. This patch avoids that a rescan shortly after a cable
pull sporadically triggers the following kernel oops:

BUG: unable to handle kernel paging request at ffffc9001a6bc084
IP: [<ffffffffa04e08f2>] mlx4_ib_post_send+0xd2/0xb30 [mlx4_ib]
Process rescan-scsi-bus (pid: 9241, threadinfo ffff88053484a000, task ffff880534aae100)
Call Trace:
 [<ffffffffa0718135>] srp_post_send+0x65/0x70 [ib_srp]
 [<ffffffffa071b9df>] srp_queuecommand+0x1cf/0x3e0 [ib_srp]
 [<ffffffffa0001ff1>] scsi_dispatch_cmd+0x101/0x280 [scsi_mod]
 [<ffffffffa0009ad1>] scsi_request_fn+0x411/0x4d0 [scsi_mod]
 [<ffffffff81223b37>] __blk_run_queue+0x27/0x30
 [<ffffffff8122a8d2>] blk_execute_rq_nowait+0x82/0x110
 [<ffffffff8122a9c2>] blk_execute_rq+0x62/0xf0
 [<ffffffffa000b0e8>] scsi_execute+0xe8/0x190 [scsi_mod]
 [<ffffffffa000b2f3>] scsi_execute_req+0xa3/0x130 [scsi_mod]
 [<ffffffffa000c1aa>] scsi_probe_lun+0x17a/0x450 [scsi_mod]
 [<ffffffffa000ce86>] scsi_probe_and_add_lun+0x156/0x480 [scsi_mod]
 [<ffffffffa000dc2f>] __scsi_scan_target+0xdf/0x1f0 [scsi_mod]
 [<ffffffffa000dfa3>] scsi_scan_host_selected+0x183/0x1c0 [scsi_mod]
 [<ffffffffa000edfb>] scsi_scan+0xdb/0xe0 [scsi_mod]
 [<ffffffffa000ee13>] store_scan+0x13/0x20 [scsi_mod]
 [<ffffffff811c8d9b>] sysfs_write_file+0xcb/0x160
 [<ffffffff811589de>] vfs_write+0xce/0x140
 [<ffffffff81158b53>] sys_write+0x53/0xa0
 [<ffffffff81464592>] system_call_fastpath+0x16/0x1b
 [<00007f611c9d9300>] 0x7f611c9d92ff

Reported-by: Max Gurtuvoy <maxg@mellanox.com>
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Mike Christie <michaelc@cs.wisc.edu>
Cc: <stable@vger.kernel.org>
Signed-off-by: James Bottomley <JBottomley@Odin.com>
2015-04-08 09:41:41 -07:00
Keith Busch
5b3f25fc34 blk-mq: Allow requests to never expire
Some types of requests may be started that are not gauranteed to ever
complete. This adds a request flag that a driver can use so mark the
request as such.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-01-08 08:59:01 -07:00
Martin K. Petersen
e19a8a0ad2 block: Remove REQ_KERNEL
REQ_KERNEL is no longer used. Remove it and drop the redundant uio
argument to nfs_file_direct_{read,write}.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Christoph Hellwig <hch@infradead.org>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-10-14 09:00:44 -06:00
Martin K. Petersen
b1f0138857 block: Relocate bio integrity flags
Move flags affecting the integrity code out of the bio bi_flags and into
the block integrity payload.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-09-27 09:14:54 -06:00
Martin K. Petersen
180b2f95dd block: Replace bi_integrity with bi_special
For commands like REQ_COPY we need a way to pass extra information along
with each bio. Like integrity metadata this information must be
available at the bottom of the stack so bi_private does not suffice.

Rename the existing bi_integrity field to bi_special and make it a union
so we can have different bio extensions for each class of command.

We previously used bi_integrity != NULL as a way to identify whether a
bio had integrity metadata or not. Introduce a REQ_INTEGRITY to be the
indicator now that bi_special can contain different things.

In addition, bio_integrity(bio) will now return a pointer to the
integrity payload (when applicable).

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-09-27 09:14:46 -06:00
Christoph Hellwig
bf57229745 blk-mq: remove REQ_END
Pass an explicit parameter for the last request in a batch to ->queue_rq
instead of using a request flag.  Besides being a cleaner and non-stateful
interface this is also required for the next patch, which fixes the blk-mq
I/O submission code to not start a time too early.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-09-22 12:00:07 -06:00
Linus Torvalds
16b9057804 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs updates from Al Viro:
 "This the bunch that sat in -next + lock_parent() fix.  This is the
  minimal set; there's more pending stuff.

  In particular, I really hope to get acct.c fixes merged this cycle -
  we need that to deal sanely with delayed-mntput stuff.  In the next
  pile, hopefully - that series is fairly short and localized
  (kernel/acct.c, fs/super.c and fs/namespace.c).  In this pile: more
  iov_iter work.  Most of prereqs for ->splice_write with sane locking
  order are there and Kent's dio rewrite would also fit nicely on top of
  this pile"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (70 commits)
  lock_parent: don't step on stale ->d_parent of all-but-freed one
  kill generic_file_splice_write()
  ceph: switch to iter_file_splice_write()
  shmem: switch to iter_file_splice_write()
  nfs: switch to iter_splice_write_file()
  fs/splice.c: remove unneeded exports
  ocfs2: switch to iter_file_splice_write()
  ->splice_write() via ->write_iter()
  bio_vec-backed iov_iter
  optimize copy_page_{to,from}_iter()
  bury generic_file_aio_{read,write}
  lustre: get rid of messing with iovecs
  ceph: switch to ->write_iter()
  ceph_sync_direct_write: stop poking into iov_iter guts
  ceph_sync_read: stop poking into iov_iter guts
  new helper: copy_page_from_iter()
  fuse: switch to ->write_iter()
  btrfs: switch to ->write_iter()
  ocfs2: switch to ->write_iter()
  xfs: switch to ->write_iter()
  ...
2014-06-12 10:30:18 -07:00
Jens Axboe
0d2602ca30 blk-mq: improve support for shared tags maps
This adds support for active queue tracking, meaning that the
blk-mq tagging maintains a count of active users of a tag set.
This allows us to maintain a notion of fairness between users,
so that we can distribute the tag depth evenly without starving
some users while allowing others to try unfair deep queues.

If sharing of a tag set is detected, each hardware queue will
track the depth of its own queue. And if this exceeds the total
depth divided by the number of active queues, the user is actively
throttled down.

The active queue count is done lazily to avoid bouncing that data
between submitter and completer. Each hardware queue gets marked
active when it allocates its first tag, and gets marked inactive
when 1) the last tag is cleared, and 2) the queue timeout grace
period has passed.

Signed-off-by: Jens Axboe <axboe@fb.com>
2014-05-13 15:10:52 -06:00
Al Viro
62a8067a7f bio_vec-backed iov_iter
New variant of iov_iter - ITER_BVEC in iter->type, backed with
bio_vec array instead of iovec one.  Primitives taught to deal
with such beasts, __swap_write() switched to using that kind
of iov_iter.

Note that bio_vec is just a <page, offset, length> triple - there's
nothing block-specific about it.  I've left the definition where it
was, but took it from under ifdef CONFIG_BLOCK.

Next target: ->splice_write()...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 17:39:45 -04:00
Jens Axboe
360f92c244 block: fix regression with block enabled tagging
Martin reported that his test system would not boot with
current git, it oopsed with this:

BUG: unable to handle kernel paging request at ffff88046c6c9e80
IP: [<ffffffff812971e0>] blk_queue_start_tag+0x90/0x150
PGD 1ddf067 PUD 1de2067 PMD 47fc7d067 PTE 800000046c6c9060
Oops: 0002 [#1] SMP DEBUG_PAGEALLOC
Modules linked in: sd_mod lpfc(+) scsi_transport_fc scsi_tgt oracleasm
rpcsec_gss_krb5 ipv6 igb dca i2c_algo_bit i2c_core hwmon
CPU: 3 PID: 87 Comm: kworker/u17:1 Not tainted 3.14.0+ #246
Hardware name: Supermicro X9DRX+-F/X9DRX+-F, BIOS 3.00 07/09/2013
Workqueue: events_unbound async_run_entry_fn
task: ffff8802743c2150 ti: ffff880273d02000 task.ti: ffff880273d02000
RIP: 0010:[<ffffffff812971e0>]  [<ffffffff812971e0>]
blk_queue_start_tag+0x90/0x150
RSP: 0018:ffff880273d03a58  EFLAGS: 00010092
RAX: ffff88046c6c9e78 RBX: ffff880077208e78 RCX: 00000000fffc8da6
RDX: 00000000fffc186d RSI: 0000000000000009 RDI: 00000000fffc8d9d
RBP: ffff880273d03a88 R08: 0000000000000001 R09: ffff8800021c2410
R10: 0000000000000005 R11: 0000000000015b30 R12: ffff88046c5bb8a0
R13: ffff88046c5c0890 R14: 000000000000001e R15: 000000000000001e
FS:  0000000000000000(0000) GS:ffff880277b00000(0000)
knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffff88046c6c9e80 CR3: 00000000018f6000 CR4: 00000000000407e0
Stack:
 ffff880273d03a98 ffff880474b18800 0000000000000000 ffff880474157000
 ffff88046c5c0890 ffff880077208e78 ffff880273d03ae8 ffffffff813b9e62
 ffff880200000010 ffff880474b18968 ffff880474b18848 ffff88046c5c0cd8
Call Trace:
 [<ffffffff813b9e62>] scsi_request_fn+0xf2/0x510
 [<ffffffff81293167>] __blk_run_queue+0x37/0x50
 [<ffffffff8129ac43>] blk_execute_rq_nowait+0xb3/0x130
 [<ffffffff8129ad24>] blk_execute_rq+0x64/0xf0
 [<ffffffff8108d2b0>] ? bit_waitqueue+0xd0/0xd0
 [<ffffffff813bba35>] scsi_execute+0xe5/0x180
 [<ffffffff813bbe4a>] scsi_execute_req_flags+0x9a/0x110
 [<ffffffffa01b1304>] sd_spinup_disk+0x94/0x460 [sd_mod]
 [<ffffffff81160000>] ? __unmap_hugepage_range+0x200/0x2f0
 [<ffffffffa01b2b9a>] sd_revalidate_disk+0xaa/0x3f0 [sd_mod]
 [<ffffffffa01b2fb8>] sd_probe_async+0xd8/0x200 [sd_mod]
 [<ffffffff8107703f>] async_run_entry_fn+0x3f/0x140
 [<ffffffff8106a1c5>] process_one_work+0x175/0x410
 [<ffffffff8106b373>] worker_thread+0x123/0x400
 [<ffffffff8106b250>] ? manage_workers+0x160/0x160
 [<ffffffff8107104e>] kthread+0xce/0xf0
 [<ffffffff81070f80>] ? kthread_freezable_should_stop+0x70/0x70
 [<ffffffff815f0bac>] ret_from_fork+0x7c/0xb0
 [<ffffffff81070f80>] ? kthread_freezable_should_stop+0x70/0x70
Code: 48 0f ab 11 72 db 48 81 4b 40 00 00 10 00 89 83 08 01 00 00 48 89
df 49 8b 04 24 48 89 1c d0 e8 f7 a8 ff ff 49 8b 85 28 05 00 00 <48> 89
58 08 48 89 03 49 8d 85 28 05 00 00 48 89 43 08 49 89 9d
RIP  [<ffffffff812971e0>] blk_queue_start_tag+0x90/0x150
 RSP <ffff880273d03a58>
CR2: ffff88046c6c9e80

Martin bisected and found this to be the problem patch;

	commit 6d113398dc
	Author: Jan Kara <jack@suse.cz>
	Date:   Mon Feb 24 16:39:54 2014 +0100

	    block: Stop abusing rq->csd.list in blk-softirq

and the problem was immediately apparent. The patch states that
it is safe to reuse queuelist at completion time, since it is
no longer used. However, that is not true if a device is using
block enabled tagging. If that is the case, then the queuelist
is reused to keep track of busy tags. If a device also ended
up using softirq completions, we'd reuse ->queuelist for the
IPI handling while block tagging was still using it. Boom.

Fix this by adding a new ipi_list list head, and share the
memory used with the request hash table. The hash table is
never used after the request is moved to the dispatch list,
which happens long before any potential completion of the
request. Add a new request bit for this, so we don't have
cases that check rq->hash while it could potentially have
been reused for the IPI completion.

Reported-by: Martin K. Petersen <martin.petersen@oracle.com>
Tested-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-04-09 21:54:06 -06:00
Kent Overstreet
196d38bccf block: Generic bio chaining
This adds a generic mechanism for chaining bio completions. This is
going to be used for a bio_split() replacement, and it turns out to be
very useful in a fair amount of driver code - a fair number of drivers
were implementing this in their own roundabout ways, often painfully.

Note that this means it's no longer to call bio_endio() more than once
on the same bio! This can cause problems for drivers that save/restore
bi_end_io. Arguably they shouldn't be saving/restoring bi_end_io at all
- in all but the simplest cases they'd be better off just cloning the
bio, and immutable biovecs is making bio cloning cheaper. But for now,
we add a bio_endio_nodec() for these cases.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
2013-11-23 22:33:56 -08:00
Kent Overstreet
4550dd6c6b block: Immutable bio vecs
This adds a mechanism by which we can advance a bio by an arbitrary
number of bytes without modifying the biovec: bio->bi_iter.bi_bvec_done
indicates the number of bytes completed in the current bvec.

Various driver code still needs to be updated to not refer to the bvec
directly before we can use this for interesting things, like efficient
bio splitting.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
Cc: Paul Clements <Paul.Clements@steeleye.com>
Cc: drbd-user@lists.linbit.com
Cc: nbd-general@lists.sourceforge.net
2013-11-23 22:33:49 -08:00
Kent Overstreet
4f024f3797 block: Abstract out bvec iterator
Immutable biovecs are going to require an explicit iterator. To
implement immutable bvecs, a later patch is going to add a bi_bvec_done
member to this struct; for now, this patch effectively just renames
things.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "Ed L. Cashin" <ecashin@coraid.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Yehuda Sadeh <yehuda@inktank.com>
Cc: Sage Weil <sage@inktank.com>
Cc: Alex Elder <elder@inktank.com>
Cc: ceph-devel@vger.kernel.org
Cc: Joshua Morris <josh.h.morris@us.ibm.com>
Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Neil Brown <neilb@suse.de>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: dm-devel@redhat.com
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: linux390@de.ibm.com
Cc: Boaz Harrosh <bharrosh@panasas.com>
Cc: Benny Halevy <bhalevy@tonian.com>
Cc: "James E.J. Bottomley" <JBottomley@parallels.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Nicholas A. Bellinger" <nab@linux-iscsi.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Chris Mason <chris.mason@fusionio.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Dave Kleikamp <shaggy@kernel.org>
Cc: Joern Engel <joern@logfs.org>
Cc: Prasad Joshi <prasadjoshi.linux@gmail.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Ben Myers <bpm@sgi.com>
Cc: xfs@oss.sgi.com
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com>
Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Guo Chao <yan@linux.vnet.ibm.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Cc: "Roger Pau Monné" <roger.pau@citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Cc: Ian Campbell <Ian.Campbell@citrix.com>
Cc: Sebastian Ott <sebott@linux.vnet.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Jerome Marchand <jmarchand@redhat.com>
Cc: Joe Perches <joe@perches.com>
Cc: Peng Tao <tao.peng@emc.com>
Cc: Andy Adamson <andros@netapp.com>
Cc: fanchaoting <fanchaoting@cn.fujitsu.com>
Cc: Jie Liu <jeff.liu@oracle.com>
Cc: Sunil Mushran <sunil.mushran@gmail.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Namjae Jeon <namjae.jeon@samsung.com>
Cc: Pankaj Kumar <pankaj.km@samsung.com>
Cc: Dan Magenheimer <dan.magenheimer@oracle.com>
Cc: Mel Gorman <mgorman@suse.de>6
2013-11-23 22:33:47 -08:00
Jens Axboe
320ae51fee blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:

- The classic request_fn based approach, where drivers use struct
  request units for IO. The block layer provides various helper
  functionalities to let drivers share code, things like tag
  management, timeout handling, queueing, etc.

- The "stacked" approach, where a driver squeezes in between the
  block layer and IO submitter. Since this bypasses the IO stack,
  driver generally have to manage everything themselves.

With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.

The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.

This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.

blk-mq provides various helper functions, which include:

- Scalable support for request tagging. Most devices need to
  be able to uniquely identify a request both in the driver and
  to the hardware. The tagging uses per-cpu caches for freed
  tags, to enable cache hot reuse.

- Timeout handling without tracking request on a per-device
  basis. Basically the driver should be able to get a notification,
  if a request happens to fail.

- Optional support for non 1:1 mappings between issue and
  submission queues. blk-mq can redirect IO completions to the
  desired location.

- Support for per-request payloads. Drivers almost always need
  to associate a request structure with some driver private
  command structure. Drivers can tell blk-mq this at init time,
  and then any request handed to the driver will have the
  required size of memory associated with it.

- Support for merging of IO, and plugging. The stacked model
  gets neither of these. Even for high IOPS devices, merging
  sequential IO reduces per-command overhead and thus
  increases bandwidth.

For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).

Contributions in this patch from the following people:

Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>

Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-25 11:56:00 +01:00
Jens Axboe
5953316dbf block: make rq->cmd_flags be 64-bit
We have officially run out of flags in a 32-bit space. Extend it
to 64-bit even on 32-bit archs.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-25 11:55:59 +01:00
Linus Torvalds
4de13d7aa8 Merge branch 'for-3.10/core' of git://git.kernel.dk/linux-block
Pull block core updates from Jens Axboe:

 - Major bit is Kents prep work for immutable bio vecs.

 - Stable candidate fix for a scheduling-while-atomic in the queue
   bypass operation.

 - Fix for the hang on exceeded rq->datalen 32-bit unsigned when merging
   discard bios.

 - Tejuns changes to convert the writeback thread pool to the generic
   workqueue mechanism.

 - Runtime PM framework, SCSI patches exists on top of these in James'
   tree.

 - A few random fixes.

* 'for-3.10/core' of git://git.kernel.dk/linux-block: (40 commits)
  relay: move remove_buf_file inside relay_close_buf
  partitions/efi.c: replace useless kzalloc's by kmalloc's
  fs/block_dev.c: fix iov_shorten() criteria in blkdev_aio_read()
  block: fix max discard sectors limit
  blkcg: fix "scheduling while atomic" in blk_queue_bypass_start
  Documentation: cfq-iosched: update documentation help for cfq tunables
  writeback: expose the bdi_wq workqueue
  writeback: replace custom worker pool implementation with unbound workqueue
  writeback: remove unused bdi_pending_list
  aoe: Fix unitialized var usage
  bio-integrity: Add explicit field for owner of bip_buf
  block: Add an explicit bio flag for bios that own their bvec
  block: Add bio_alloc_pages()
  block: Convert some code to bio_for_each_segment_all()
  block: Add bio_for_each_segment_all()
  bounce: Refactor __blk_queue_bounce to not use bi_io_vec
  raid1: use bio_copy_data()
  pktcdvd: Use bio_reset() in disabled code to kill bi_idx usage
  pktcdvd: use bio_copy_data()
  block: Add bio_copy_data()
  ...
2013-05-08 10:13:35 -07:00
Darrick J. Wong
7136851117 mm: make snapshotting pages for stable writes a per-bio operation
Walking a bio's page mappings has proved problematic, so create a new
bio flag to indicate that a bio's data needs to be snapshotted in order
to guarantee stable pages during writeback.  Next, for the one user
(ext3/jbd) of snapshotting, hook all the places where writes can be
initiated without PG_writeback set, and set BIO_SNAP_STABLE there.

We must also flag journal "metadata" bios for stable writeout, since
file data can be written through the journal.  Finally, the
MS_SNAP_STABLE mount flag (only used by ext3) is now superfluous, so get
rid of it.

[akpm@linux-foundation.org: rename _submit_bh()'s `flags' to `bio_flags', delobotomize the _submit_bh declaration]
[akpm@linux-foundation.org: teeny cleanup]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Artem Bityutskiy <dedekind1@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29 15:54:33 -07:00
Jens Axboe
705cd0ea1c Merge branch 'for-jens' of http://evilpiepirate.org/git/linux-bcache into for-3.10/core
This contains Kents prep work for the immutable bio_vecs.
2013-03-24 21:38:59 -06:00
Kent Overstreet
a38352e0ac block: Add an explicit bio flag for bios that own their bvec
This is for the new bio splitting code. When we split a bio, if the
split occured on a bvec boundry we reuse the bvec for the new bio. But
that means bio_free() can't free it, hence the explicit flag.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
CC: Jens Axboe <axboe@kernel.dk>
Acked-by: Tejun Heo <tj@kernel.org>
2013-03-23 14:26:33 -07:00
Kent Overstreet
054bdf646e block: Add bio_advance()
This is prep work for immutable bio vecs; we first want to centralize
where bvecs are modified.

Next two patches convert some existing code to use this function.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
CC: Jens Axboe <axboe@kernel.dk>
2013-03-23 14:15:27 -07:00
Lin Ming
6631127469 block: add a flag to identify PM request
Add a flag REQ_PM to identify the request is PM related, such requests
will not change the device request queue's runtime status. It is
intended to be used in driver's runtime PM callback, so that driver can
perform some IO to the device there with the queue's runtime status
unaffected. e.g. in SCSI disk's runtime suspend callback, the disk will
be put into stopped power state, and this require sending a command to
the device. Such command processing should not change the disk's runtime
status.

Signed-off-by: Lin Ming <ming.m.lin@intel.com>
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
Acked-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-03-22 22:22:15 -06:00
Martin K. Petersen
4363ac7c13 block: Implement support for WRITE SAME
The WRITE SAME command supported on some SCSI devices allows the same
block to be efficiently replicated throughout a block range. Only a
single logical block is transferred from the host and the storage device
writes the same data to all blocks described by the I/O.

This patch implements support for WRITE SAME in the block layer. The
blkdev_issue_write_same() function can be used by filesystems and block
drivers to replicate a buffer across a block range. This can be used to
efficiently initialize software RAID devices, etc.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-09-20 14:31:45 +02:00
Martin K. Petersen
e2a60da74f block: Clean up special command handling logic
Remove special-casing of non-rw fs style requests (discard). The nomerge
flags are consolidated in blk_types.h, and rq_mergeable() and
bio_mergeable() have been modified to use them.

bio_is_rw() is used in place of bio_has_data() a few places. This is
done to to distinguish true reads and writes from other fs type requests
that carry a payload (e.g. write same).

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-09-20 14:31:38 +02:00
Kent Overstreet
4254bba17d block: Kill bi_destructor
Now that we've got generic code for freeing bios allocated from bio
pools, this isn't needed anymore.

This patch also makes bio_free() static, since without bi_destructor
there should be no need for it to be called anywhere else.

bio_free() is now only called from bio_put, so we can refactor those a
bit - move some code from bio_put() to bio_free() and kill the redundant
bio->bi_next = NULL.

v5: Switch to BIO_KMALLOC_POOL ((void *)~0), per Boaz
v6: BIO_KMALLOC_POOL now NULL, drop bio_free's EXPORT_SYMBOL
v7: No #define BIO_KMALLOC_POOL anymore

Signed-off-by: Kent Overstreet <koverstreet@google.com>
CC: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-09-09 10:35:39 +02:00
Kent Overstreet
f44b48c769 block: Add bio_reset()
Reusing bios is something that's been highly frowned upon in the past,
but driver code keeps doing it anyways. If it's going to happen anyways,
we should provide a generic method.

This'll help with getting rid of bi_destructor - drivers/block/pktcdvd.c
was open coding it, by doing a bio_init() and resetting bi_destructor.

This required reordering struct bio, but the block layer is not yet
nearly fast enough for any cacheline effects to matter here.

v5: Add a define BIO_RESET_BITS, to be very explicit about what parts of
bio->bi_flags are saved.
v6: Further commenting verbosity, per Tejun
v9: Add a function comment

Signed-off-by: Kent Overstreet <koverstreet@google.com>
CC: Jens Axboe <axboe@kernel.dk>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-09-09 10:35:39 +02:00
Kent Overstreet
395c72a707 block: Generalized bio pool freeing
With the old code, when you allocate a bio from a bio pool you have to
implement your own destructor that knows how to find the bio pool the
bio was originally allocated from.

This adds a new field to struct bio (bi_pool) and changes
bio_alloc_bioset() to use it. This makes various bio destructors
unnecessary, so they're then deleted.

v6: Explain the temporary if statement in bio_put

Signed-off-by: Kent Overstreet <koverstreet@google.com>
CC: Jens Axboe <axboe@kernel.dk>
CC: NeilBrown <neilb@suse.de>
CC: Alasdair Kergon <agk@redhat.com>
CC: Nicholas Bellinger <nab@linux-iscsi.org>
CC: Lars Ellenberg <lars.ellenberg@linbit.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Nicholas Bellinger <nab@linux-iscsi.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-09-09 10:35:38 +02:00
Mel Gorman
18022c5d86 mm: add get_kernel_page[s] for pinning of kernel addresses for I/O
This patch adds two new APIs get_kernel_pages() and get_kernel_page() that
may be used to pin a vector of kernel addresses for IO.  The initial user
is expected to be NFS for allowing pages to be written to swap using
aops->direct_IO().  Strictly speaking, swap-over-NFS only needs to pin one
page for IO but it makes sense to express the API in terms of a vector and
add a helper for pinning single pages.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Eric B Munson <emunson@mgebm.net>
Cc: Eric Paris <eparis@redhat.com>
Cc: James Morris <jmorris@namei.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: Neil Brown <neilb@suse.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Xiaotian Feng <dfeng@redhat.com>
Cc: Mark Salter <msalter@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31 18:42:47 -07:00
Tejun Heo
852c788f83 block: implement bio_associate_current()
IO scheduling and cgroup are tied to the issuing task via io_context
and cgroup of %current.  Unfortunately, there are cases where IOs need
to be routed via a different task which makes scheduling and cgroup
limit enforcement applied completely incorrectly.

For example, all bios delayed by blk-throttle end up being issued by a
delayed work item and get assigned the io_context of the worker task
which happens to serve the work item and dumped to the default block
cgroup.  This is double confusing as bios which aren't delayed end up
in the correct cgroup and makes using blk-throttle and cfq propio
together impossible.

Any code which punts IO issuing to another task is affected which is
getting more and more common (e.g. btrfs).  As both io_context and
cgroup are firmly tied to task including userland visible APIs to
manipulate them, it makes a lot of sense to match up tasks to bios.

This patch implements bio_associate_current() which associates the
specified bio with %current.  The bio will record the associated ioc
and blkcg at that point and block layer will use the recorded ones
regardless of which task actually ends up issuing the bio.  bio
release puts the associated ioc and blkcg.

It grabs and remembers ioc and blkcg instead of the task itself
because task may already be dead by the time the bio is issued making
ioc and blkcg inaccessible and those are all block layer cares about.

elevator_set_req_fn() is updated such that the bio elvdata is being
allocated for is available to the elevator.

This doesn't update block cgroup policies yet.  Further patches will
implement the support.

-v2: #ifdef CONFIG_BLK_CGROUP added around bio->bi_ioc dereference in
     rq_ioc() to fix build breakage.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Kent Overstreet <koverstreet@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06 21:27:24 +01:00
Tao Ma
9562ad9ab3 block: Remove the control of complete cpu from bio.
bio originally has the functionality to set the complete cpu, but
it is broken.

Chirstoph said that "This code is unused, and from the all the
discussions lately pretty obviously broken.  The only thing keeping
it serves is creating more confusion and possibly more bugs."

And Jens replied with "We can kill bio_set_completion_cpu(). I'm fine
with leaving cpu control to the request based drivers, they are the
only ones that can toggle the setting anyway".

So this patch tries to remove all the work of controling complete cpu
from a bio.

Cc: Shaohua Li <shaohua.li@intel.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-10-24 16:11:30 +02:00
Christoph Hellwig
65299a3b78 block: separate priority boosting from REQ_META
Add a new REQ_PRIO to let requests preempt others in the cfq I/O schedule,
and lave REQ_META purely for marking requests as metadata in blktrace.

All existing callers of REQ_META except for XFS are updated to also
set REQ_PRIO for now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-08-23 14:50:29 +02:00
Matthew Wilcox
8e4bf84474 Move some REQ flags to the common bio/request area
REQ_SECURE, REQ_FLUSH and REQ_FUA may all be set on a bio as well as
on a request, so relocate them to the shared part of the enum.

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-08-11 10:36:03 +02:00
Namhyung Kim
155d109b5f block: add REQ_SECURE to REQ_COMMON_MASK
Add REQ_SECURE flag to REQ_COMMON_MASK so that
init_request_from_bio() can pass it to @req->cmd_flags.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Acked-by: Adrian Hunter <adrian.hunter@intel.com>
Cc: stable@kernel.org # 2.6.36 and newer
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-06-20 13:23:14 +02:00
Jens Axboe
771949d03b block: get rid of on-stack plugging debug checks
We don't need them anymore, so kill:

- REQ_ON_PLUG checks in various places
- !rq_mergeable() check in plug merging

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-20 20:52:16 +02:00
Jens Axboe
721a9602e6 block: kill off REQ_UNPLUG
With the plugging now being explicitly controlled by the
submitter, callers need not pass down unplugging hints
to the block layer. If they want to unplug, it's because they
manually plugged on their own - in which case, they should just
unplug at will.

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-03-10 08:52:27 +01:00
Jens Axboe
73c1010119 block: initial patch for on-stack per-task plugging
This patch adds support for creating a queuing context outside
of the queue itself. This enables us to batch up pieces of IO
before grabbing the block device queue lock and submitting them to
the IO scheduler.

The context is created on the stack of the process and assigned in
the task structure, so that we can auto-unplug it if we hit a schedule
event.

The current queue plugging happens implicitly if IO is submitted to
an empty device, yet callers have to remember to unplug that IO when
they are going to wait for it. This is an ugly API and has caused bugs
in the past. Additionally, it requires hacks in the vm (->sync_page()
callback) to handle that logic. By switching to an explicit plugging
scheme we make the API a lot nicer and can get rid of the ->sync_page()
hack in the vm.

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-03-10 08:45:54 +01:00
Tejun Heo
414b4ff5ee block: add REQ_FLUSH_SEQ
rq == &q->flush_rq was used to determine whether a rq is part of a
flush sequence, which worked because all requests in a flush sequence
were sequenced using the single dedicated request.  This is about to
change, so introduce REQ_FLUSH_SEQ flag to distinguish flush sequence
requests.

This patch doesn't cause any behavior change.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-01-25 12:43:49 +01:00
Christoph Hellwig
02e031cbc8 block: remove REQ_HARDBARRIER
REQ_HARDBARRIER is dead now, so remove the leftovers.  What's left
at this point is:

 - various checks inside the block layer.
 - sanity checks in bio based drivers.
 - now unused bio_empty_barrier helper.
 - Xen blockfront use of BLKIF_OP_WRITE_BARRIER - it's dead for a while,
   but Xen really needs to sort out it's barrier situaton.
 - setting of ordered tags in uas - dead code copied from old scsi
   drivers.
 - scsi different retry for barriers - it's dead and should have been
   removed when flushes were converted to FS requests.
 - blktrace handling of barriers - removed.  Someone who knows blktrace
   better should add support for REQ_FLUSH and REQ_FUA, though.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-11-10 14:54:09 +01:00