Commit graph

965 commits

Author SHA1 Message Date
Jens Axboe
fb1e75389b block: improve queue_should_plug() by looking at IO depths
Instead of just checking whether this device uses block layer
tagging, we can improve the detection by looking at the maximum
queue depth it has reached. If that crosses 4, then deem it a
queuing device.

This is important on high IOPS devices, since plugging hurts
the performance there (it can be as much as 10-15% of the sys
time).

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-09-11 14:33:31 +02:00
Jens Axboe
1f98a13f62 bio: first step in sanitizing the bio->bi_rw flag testing
Get rid of any functions that test for these bits and make callers
use bio_rw_flagged() directly. Then it is at least directly apparent
what variable and flag they check.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-09-11 14:33:31 +02:00
Hannes Reinecke
e3264a4d7d Send uevents for write_protect changes
Whenever a block device changes it's read-only attribute
notify the userspace about it.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-09-11 14:33:31 +02:00
Vivek Goyal
d58b85e1e8 cfq-iosched: no need to keep track of busy_rt_queues
o Get rid of busy_rt_queues infrastructure. Looks like it is redundant.

o Once an RT queue gets request it will preempt any of the BE or IDLE queues
  immediately. Otherwise this queue will be put on service tree and scheduler
  will anyway select this queue before any of the BE or IDLE queue. Hence
  looks like there is no need to keep track of how many busy RT queues are
  currently on service tree.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-09-11 14:33:30 +02:00
Jens Axboe
5ad531db6e cfq-iosched: drain device queue before switching to a sync queue
To lessen the impact of async IO on sync IO, let the device drain of
any async IO in progress when switching to a sync cfqq that has idling
enabled.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-09-11 14:33:30 +02:00
Tejun Heo
da6c5c720c scsi,block: update SCSI to handle mixed merge failures
Update scsi_io_completion() such that it only fails requests till the
next error boundary and retry the leftover.  This enables block layer
to merge requests with different failfast settings and still behave
correctly on errors.  Allow merge of requests of different failfast
settings.

As SCSI is currently the only subsystem which follows failfast status,
there's no need to worry about other block drivers for now.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Niel Lambrechts <niel.lambrechts@gmail.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-09-11 14:33:30 +02:00
Tejun Heo
80a761fd33 block: implement mixed merge of different failfast requests
Failfast has characteristics from other attributes.  When issuing,
executing and successuflly completing requests, failfast doesn't make
any difference.  It only affects how a request is handled on failure.
Allowing requests with different failfast settings to be merged cause
normal IOs to fail prematurely while not allowing has performance
penalties as failfast is used for read aheads which are likely to be
located near in-flight or to-be-issued normal IOs.

This patch introduces the concept of 'mixed merge'.  A request is a
mixed merge if it is merge of segments which require different
handling on failure.  Currently the only mixable attributes are
failfast ones (or lack thereof).

When a bio with different failfast settings is added to an existing
request or requests of different failfast settings are merged, the
merged request is marked mixed.  Each bio carries failfast settings
and the request always tracks failfast state of the first bio.  When
the request fails, blk_rq_err_bytes() can be used to determine how
many bytes can be safely failed without crossing into an area which
requires further retrials.

This allows request merging regardless of failfast settings while
keeping the failure handling correct.

This patch only implements mixed merge but doesn't enable it.  The
next one will update SCSI to make use of mixed merge.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Niel Lambrechts <niel.lambrechts@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-09-11 14:33:30 +02:00
Tejun Heo
a82afdfcb8 block: use the same failfast bits for bio and request
bio and request use the same set of failfast bits.  This patch makes
the following changes to simplify things.

* enumify BIO_RW* bits and reorder bits such that BIOS_RW_FAILFAST_*
  bits coincide with __REQ_FAILFAST_* bits.

* The above pushes BIO_RW_AHEAD out of sync with __REQ_FAILFAST_DEV
  but the matching is useless anyway.  init_request_from_bio() is
  responsible for setting FAILFAST bits on FS requests and non-FS
  requests never use BIO_RW_AHEAD.  Drop the code and comment from
  blk_rq_bio_prep().

* Define REQ_FAILFAST_MASK which is OR of all FAILFAST bits and
  simplify FAILFAST flags handling in init_request_from_bio().

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-09-11 14:33:27 +02:00
Nikanth Karthikesan
c295fc0578 block: Allow changing max_sectors_kb above the default 512
The patch "block: Use accessor functions for queue limits"
(ae03bf639a) changed queue_max_sectors_store()
to use blk_queue_max_sectors() instead of directly assigning the value.

But blk_queue_max_sectors() differs a bit
1. It sets both max_sectors_kb, and max_hw_sectors_kb
2. Never allows one to change max_sectors_kb above BLK_DEF_MAX_SECTORS. If one
specifies a value greater then max_hw_sectors is set to that value but
max_sectors is set to BLK_DEF_MAX_SECTORS

I am not sure whether blk_queue_max_sectors() should be changed, as it seems
to be that way for a long time. And there may be callers dependent on that
behaviour.

This patch simply reverts to the older way of directly assigning the value to
max_sectors as it was before.

Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-09-01 22:40:15 +02:00
John Stoffel
14d9fa3525 Make SCSI SG v4 driver enabled by default and remove EXPERIMENTAL dependency, since udev depends on BSG
Make Block Layer SG support v4 the default, since recent udev versions
depend on this to access serial numbers and other low level info properly.

This should be backported to older kernels as well, since most distros have
enabled this for a long time.

Signed-off-by: John Stoffel <john@stoffel.org>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-08-04 22:10:17 +02:00
Martin K. Petersen
7e5f5fb09e block: Update topology documentation
Update topology comments and sysfs documentation based upon discussions
with Neil Brown.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-08-01 10:24:35 +02:00
Martin K. Petersen
70dd5bf3b9 block: Stack optimal I/O size
When stacking block devices ensure that optimal I/O size is scaled
accordingly.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-08-01 10:24:35 +02:00
Martin K. Petersen
7c958e3264 block: Add a wrapper for setting minimum request size without a queue
Introduce blk_limits_io_min() and make blk_queue_io_min() call it.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-08-01 10:24:35 +02:00
Martin K. Petersen
fef246672b block: Make blk_queue_stack_limits use the new stacking interface
blk_queue_stack_limits() has been superceded by blk_stack_limits() and
disk_stack_limits().  Wrap the function call for now, we'll deprecate it
later.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-08-01 10:24:35 +02:00
Jens Axboe
56ad1740d9 block: make the end_io functions be non-GPL exports
Prior to the change for more sane end_io functions, we exported
the helpers with the normal EXPORT_SYMBOL(). That got changed
to _GPL() for the new interface. Revert that particular change,
on the basis that this is basic functionality and doesn't dip
into internal structures. If these exports can't be non-GPL,
then we may as well make EXPORT_SYMBOL() imply GPL for
everything.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-07-28 22:11:24 +02:00
Xiaotian Feng
3839e4b29b block: fix improper kobject release in blk_integrity_unregister
blk_integrity_unregister should use kobject_put to release the kobject,
otherwise after bi is freed, memory of bi->kobj->name is leaked.

Signed-off-by: Xiaotian Feng <dfeng@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-07-28 09:11:14 +02:00
Jens Axboe
a4e7d46407 block: always assign default lock to queues
Move the assignment of a default lock below blk_init_queue() to
blk_queue_make_request(), so we also get to set the default lock
for ->make_request_fn() based drivers. This is important since the
queue flag locking requires a lock to be in place.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-07-28 09:07:29 +02:00
Xiaotian Feng
9cb308ce8d block: sysfs fix mismatched queue_var_{store,show} in 64bit kernel
In blk-sysfs.c, queue_var_store uses unsigned long to store data,
but queue_var_show uses unsigned int to show data.  This causes,

	# echo 70000000000 > /sys/block/<dev>/queue/read_ahead_kb
	# cat /sys/block/<dev>/queue/read_ahead_kb => get wrong value

Fix it by using unsigned long.

While at it, convert queue_rq_affinity_show() such that it uses bool
variable instead of explicit != 0 testing.

Signed-off-by: Xiaotian Feng <dfeng@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2009-07-17 16:54:15 +09:00
Tejun Heo
0a09f4319c block: fix failfast merge testing in elv_rq_merge_ok()
Commit ab0fd1debe tries to prevent merge
of requests with different failfast settings.  In elv_rq_merge_ok(),
it compares new bio's failfast flags against the merge target
request's.  However, the flag testing accessors for bio and blk don't
return boolean but the tested bit value directly and FAILFAST on bio
and blk don't match, so directly comparing them with == results in
false negative unnecessary preventing merge of readahead requests.

This patch convert the results to boolean by negating them before
comparison.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Boaz Harrosh <bharrosh@panasas.com>
Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Jeff Garzik <jeff@garzik.org>
2009-07-17 14:50:43 +09:00
Vivek Goyal
32f2e807a3 cfq-iosched: reset oom_cfqq in cfq_set_request()
In case memory is scarce, we now default to oom_cfqq. Once memory is
available again, we should allocate a new cfqq and stop using oom_cfqq for
a particular io context.

Once a new request comes in, check if we are using oom_cfqq, and if yes,
try to allocate a new cfqq.

Tested the patch by forcing the use of oom_cfqq and upon next request thread
realized that it was using oom_cfqq and it allocated a new cfqq.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-07-10 20:31:54 +02:00
FUJITA Tomonori
76da03467a block: call blk_scsi_ioctl_init()
Currently, blk_scsi_ioctl_init() is not called since it lacks
an initcall marking. This causes the command table to be
unitialized, hence somce commands are block when they should
not have been.

This fixes a regression introduced by commit
018e044689

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-07-10 20:31:53 +02:00
Tejun Heo
ab0fd1debe block: don't merge requests of different failfast settings
Block layer used to merge requests and bios with different failfast
settings.  This caused regular IOs to fail prematurely when they were
merged into failfast requests for readahead.

Niel Lambrechts could trigger the problem semi-reliably on ext4 when
resuming from STR.  ext4 uses readahead when reading inodes and
combined with the deterministic extra SATA PHY exception cycle during
resume on the specific configuration, non-readahead inode read would
fail causing ext4 errors.  Please read the following thread for
details.

  http://lkml.org/lkml/2009/5/23/21

This patch makes block layer reject merging if the failfast settings
don't match.  This is correct but likely to lower IO performance by
preventing regular IOs from mingling into surrounding readahead
requests.  Changes to allow such mixed merges and handle errors
correctly will be added later.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Niel Lambrechts <niel.lambrechts@gmail.com>
Cc: Theodore Tso <tytso@mit.edu>
Signed-off-by: Jens Axboe <axboe@carl.(none)>
2009-07-03 21:06:45 +02:00
Shan Wei
b706f64281 cfq-iosched: remove redundant check for NULL cfqq in cfq_set_request()
With the changes for falling back to an oom_cfqq, we never fail
to find/allocate a queue in cfq_get_queue(). So remove the check.

Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-07-01 12:41:14 +02:00
NeilBrown
db64f680ba blocK: Restore barrier support for md and probably other virtual devices.
The next_ordered flag is only meaningful for devices that use __make_request.
So move the test against next_ordered out of generic code and in to
__make_request

Since this test was added, barriers have not worked on md or any
devices that don't use __make_request and so don't bother to set
next_ordered.  (dm explicitly sets something other than
QUEUE_ORDERED_NONE since
  commit 99360b4c18
but notes in the comments that it is otherwise meaningless).

Cc: Ken Milmore <ken.milmore@googlemail.com>
Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-07-01 10:56:26 +02:00
Jens Axboe
018e044689 block: get rid of queue-private command filter
The initial patches to support this through sysfs export were broken
and have been if 0'ed out in any release. So lets just kill the code
and reclaim some space in struct request_queue, if anyone would later
like to fixup the sysfs bits, the git history can easily restore
the removed bits.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-07-01 10:56:26 +02:00
Martin K. Petersen
7878cba9f0 block: Create bip slabs with embedded integrity vectors
This patch restores stacking ability to the block layer integrity
infrastructure by creating a set of dedicated bip slabs.  Each bip slab
has an embedded bio_vec array at the end.  This cuts down on memory
allocations and also simplifies the code compared to the original bvec
version.  Only the largest bip slab is backed by a mempool.  The pool is
contained in the bio_set so stacking drivers can ensure forward
progress.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@carl.(none)>
2009-07-01 10:56:25 +02:00
Jens Axboe
6118b70b3a cfq-iosched: get rid of the need for __GFP_NOFAIL in cfq_find_alloc_queue()
Setup an emergency fallback cfqq that we allocate at IO scheduler init
time. If the slab allocation fails in cfq_find_alloc_queue(), we'll just
punt IO to that cfqq instead. This ensures that cfq_find_alloc_queue()
never fails without having to ensure free memory.

On cfqq lookup, always try to allocate a new cfqq if the given cfq io
context has the oom_cfqq assigned. This ensures that we only temporarily
punt to this shared queue.

Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-07-01 10:56:25 +02:00
Jens Axboe
d5036d770f cfq-iosched: move cfqq initialization out of cfq_find_alloc_queue()
We're going to be needing that init code outside of that function
to get rid of the __GFP_NOFAIL in cfqq allocation.

Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-07-01 10:56:25 +02:00
FUJITA Tomonori
1119865935 block: revert "bsg: setting rq->bio to NULL"
The SMP handler (sas_smp_request) was fixed to use the block API
properly, so we don't need this workaround to avoid blk_put_request()
warning.

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
2009-06-21 10:52:42 -05:00
Randy Dunlap
f740f5ca05 Fix kernel-doc parameter name typo in blk-settings.c:
Warning(block/blk-settings.c:108): No description found for parameter 'lim'
Warning(block/blk-settings.c:108): Excess function parameter 'limits' description in 'blk_set_default_limits'

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-06-19 09:18:32 +02:00
Bartlomiej Zolnierkiewicz
90c699a9ee block: rename CONFIG_LBD to CONFIG_LBDAF
Follow-up to "block: enable by default support for large devices
and files on 32-bit archs".

Rename CONFIG_LBD to CONFIG_LBDAF to:
- allow update of existing [def]configs for "default y" change
- reflect that it is used also for large files support nowadays

Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-06-19 08:08:50 +02:00
Martin K. Petersen
3a02c8e814 block: Fix bounce_pfn setting
Correct stacking bounce_pfn limit setting and prevent warnings on
32-bit.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-06-18 09:56:20 +02:00
Linus Torvalds
6fd03301d7 Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6: (64 commits)
  debugfs: use specified mode to possibly mark files read/write only
  debugfs: Fix terminology inconsistency of dir name to mount debugfs filesystem.
  xen: remove driver_data direct access of struct device from more drivers
  usb: gadget: at91_udc: remove driver_data direct access of struct device
  uml: remove driver_data direct access of struct device
  block/ps3: remove driver_data direct access of struct device
  s390: remove driver_data direct access of struct device
  parport: remove driver_data direct access of struct device
  parisc: remove driver_data direct access of struct device
  of_serial: remove driver_data direct access of struct device
  mips: remove driver_data direct access of struct device
  ipmi: remove driver_data direct access of struct device
  infiniband: ehca: remove driver_data direct access of struct device
  ibmvscsi: gadget: at91_udc: remove driver_data direct access of struct device
  hvcs: remove driver_data direct access of struct device
  xen block: remove driver_data direct access of struct device
  thermal: remove driver_data direct access of struct device
  scsi: remove driver_data direct access of struct device
  pcmcia: remove driver_data direct access of struct device
  PCIE: remove driver_data direct access of struct device
  ...

Manually fix up trivial conflicts due to different direct driver_data
direct access fixups in drivers/block/{ps3disk.c,ps3vram.c}
2009-06-16 12:57:37 -07:00
Li Zefan
e212d6f250 block: remove some includings of blktrace_api.h
When porting blktrace to tracepoints, we changed to trace/block.h
for trace prober declarations.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-06-16 11:19:36 +02:00
Martin K. Petersen
e475bba2fd block: Introduce helper to reset queue limits to default values
DM reuses the request queue when swapping in a new device table
Introduce blk_set_default_limits() which can be used to reset the the
queue_limits prior to stacking devices.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Alasdair G Kergon <agk@redhat.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-06-16 08:23:52 +02:00
Jeff Moyer
6923715ae3 cfq: remove extraneous '\n' in blktrace output
I noticed a blank line in blktrace output.  This patch fixes that.

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-06-16 08:21:04 +02:00
Jens Axboe
0989a025d2 block: don't overwrite bdi->state after bdi_init() has been run
Move the defaults to where we do the init of the backing_dev_info.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-06-16 08:21:03 +02:00
Gui Jianfeng
81be834713 cfq: cleanup for last_end_request in cfq_data
Actually, last_end_request in cfq_data isn't used now. So lets
just remove it.

Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-06-16 08:21:03 +02:00
Kay Sievers
2bdf914915 Driver Core: bsg: add nodename for bsg driver
This adds support to the BSG driver to report the proper device name to
userspace for the bsg devices.

Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Jan Blunck <jblunck@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-06-15 21:30:26 -07:00
Kay Sievers
b03f38b685 Driver Core: block: add nodename support for block drivers.
This adds support for block drivers to report their requested nodename
to userspace.  It also updates a number of block drivers to provide the
needed subdirectory and device name to be used for them.

Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Jan Blunck <jblunck@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-06-15 21:30:25 -07:00
Randy Dunlap
8ebf975608 block: fix kernel-doc in recent block/ changes
Fix kernel-doc warnings in recently changed block/ source code.

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-11 20:14:23 -07:00
Linus Torvalds
c9059598ea Merge branch 'for-2.6.31' of git://git.kernel.dk/linux-2.6-block
* 'for-2.6.31' of git://git.kernel.dk/linux-2.6-block: (153 commits)
  block: add request clone interface (v2)
  floppy: fix hibernation
  ramdisk: remove long-deprecated "ramdisk=" boot-time parameter
  fs/bio.c: add missing __user annotation
  block: prevent possible io_context->refcount overflow
  Add serial number support for virtio_blk, V4a
  block: Add missing bounce_pfn stacking and fix comments
  Revert "block: Fix bounce limit setting in DM"
  cciss: decode unit attention in SCSI error handling code
  cciss: Remove no longer needed sendcmd reject processing code
  cciss: change SCSI error handling routines to work with interrupts enabled.
  cciss: separate error processing and command retrying code in sendcmd_withirq_core()
  cciss: factor out fix target status processing code from sendcmd functions
  cciss: simplify interface of sendcmd() and sendcmd_withirq()
  cciss: factor out core of sendcmd_withirq() for use by SCSI error handling code
  cciss: Use schedule_timeout_uninterruptible in SCSI error handling code
  block: needs to set the residual length of a bidi request
  Revert "block: implement blkdev_readpages"
  block: Fix bounce limit setting in DM
  Removed reference to non-existing file Documentation/PCI/PCI-DMA-mapping.txt
  ...

Manually fix conflicts with tracing updates in:
	block/blk-sysfs.c
	drivers/ide/ide-atapi.c
	drivers/ide/ide-cd.c
	drivers/ide/ide-floppy.c
	drivers/ide/ide-tape.c
	include/trace/events/block.h
	kernel/trace/blktrace.c
2009-06-11 11:10:35 -07:00
Linus Torvalds
27951daa71 Merge branch 'for-2.6.31' of git://git.kernel.org/pub/scm/linux/kernel/git/bart/ide-2.6
* 'for-2.6.31' of git://git.kernel.org/pub/scm/linux/kernel/git/bart/ide-2.6: (28 commits)
  ide-tape: fix debug call
  alim15x3: Remove historical hacks, re-enable init_hwif for PowerPC
  ide-dma: don't reset request fields on dma_timeout_retry()
  ide: drop rq->data handling from ide_map_sg()
  ide-atapi: kill unused fields and callbacks
  ide-tape: simplify read/write functions
  ide-tape: use byte size instead of sectors on rw issue functions
  ide-tape: unify r/w init paths
  ide-tape: kill idetape_bh
  ide-tape: use standard data transfer mechanism
  ide-tape: use single continuous buffer
  ide-atapi,tape,floppy: allow ->pc_callback() to change rq->data_len
  ide-tape,floppy: fix failed command completion after request sense
  ide-pm: don't abuse rq->data
  ide-cd,atapi: use bio for internal commands
  ide-atapi: convert ide-{floppy,tape} to using preallocated sense buffer
  ide-cd: convert to using generic sense request
  ide: add helpers for preparing sense requests
  ide-cd: don't abuse rq->buffer
  ide-atapi: don't abuse rq->buffer
  ...
2009-06-11 10:00:03 -07:00
Kiyoshi Ueda
b0fd271d5f block: add request clone interface (v2)
This patch adds the following 2 interfaces for request-stacking drivers:

  - blk_rq_prep_clone(struct request *clone, struct request *orig,
		      struct bio_set *bs, gfp_t gfp_mask,
		      int (*bio_ctr)(struct bio *, struct bio*, void *),
		      void *data)
      * Clones bios in the original request to the clone request
        (bio_ctr is called for each cloned bios.)
      * Copies attributes of the original request to the clone request.
        The actual data parts (e.g. ->cmd, ->buffer, ->sense) are not
        copied.

  - blk_rq_unprep_clone(struct request *clone)
      * Frees cloned bios from the clone request.

Request stacking drivers (e.g. request-based dm) need to make a clone
request for a submitted request and dispatch it to other devices.

To allocate request for the clone, request stacking drivers may not
be able to use blk_get_request() because the allocation may be done
in an irq-disabled context.
So blk_rq_prep_clone() takes a request allocated by the caller
as an argument.

For each clone bio in the clone request, request stacking drivers
should be able to set up their own completion handler.
So blk_rq_prep_clone() takes a callback function which is called
for each clone bio, and a pointer for private data which is passed
to the callback.

NOTE:
blk_rq_prep_clone() doesn't copy any actual data of the original
request.  Pages are shared between original bios and cloned bios.
So caller must not complete the original request before the clone
request.

Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-06-11 13:11:05 +02:00
Linus Torvalds
8623661180 Merge branch 'tracing-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'tracing-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (244 commits)
  Revert "x86, bts: reenable ptrace branch trace support"
  tracing: do not translate event helper macros in print format
  ftrace/documentation: fix typo in function grapher name
  tracing/events: convert block trace points to TRACE_EVENT(), fix !CONFIG_BLOCK
  tracing: add protection around module events unload
  tracing: add trace_seq_vprint interface
  tracing: fix the block trace points print size
  tracing/events: convert block trace points to TRACE_EVENT()
  ring-buffer: fix ret in rb_add_time_stamp
  ring-buffer: pass in lockdep class key for reader_lock
  tracing: add annotation to what type of stack trace is recorded
  tracing: fix multiple use of __print_flags and __print_symbolic
  tracing/events: fix output format of user stack
  tracing/events: fix output format of kernel stack
  tracing/trace_stack: fix the number of entries in the header
  ring-buffer: discard timestamps that are at the start of the buffer
  ring-buffer: try to discard unneeded timestamps
  ring-buffer: fix bug in ring_buffer_discard_commit
  ftrace: do not profile functions when disabled
  tracing: make trace pipe recognize latency format flag
  ...
2009-06-10 19:53:40 -07:00
Nikanth Karthikesan
d9c7d394a8 block: prevent possible io_context->refcount overflow
Currently io_context has an atomic_t(32-bit) as refcount.  In the case of
cfq, for each device against whcih a task does I/O, a reference to the
io_context would be taken.  And when there are multiple process sharing
io_contexts(CLONE_IO) would also have a reference to the same io_context.

Theoretically the possible maximum number of processes sharing the same
io_context + the number of disks/cfq_data referring to the same io_context
can overflow the 32-bit counter on a very high-end machine.

Even though it is an improbable case, let us make it atomic_long_t.

Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-06-10 23:07:15 +02:00
Li Zefan
55782138e4 tracing/events: convert block trace points to TRACE_EVENT()
TRACE_EVENT is a more generic way to define tracepoints. Doing so adds
these new capabilities to this tracepoint:

  - zero-copy and per-cpu splice() tracing
  - binary tracing without printf overhead
  - structured logging records exposed under /debug/tracing/events
  - trace events embedded in function tracer output and other plugins
  - user-defined, per tracepoint filter expressions
  ...

Cons:

  - no dev_t info for the output of plug, unplug_timer and unplug_io events.
    no dev_t info for getrq and sleeprq events if bio == NULL.
    no dev_t info for rq_abort,...,rq_requeue events if rq->rq_disk == NULL.

    This is mainly because we can't get the deivce from a request queue.
    But this may change in the future.

  - A packet command is converted to a string in TP_assign, not TP_print.
    While blktrace do the convertion just before output.

    Since pc requests should be rather rare, this is not a big issue.

  - In blktrace, an event can have 2 different print formats, but a TRACE_EVENT
    has a unique format, which means we have some unused data in a trace entry.

    The overhead is minimized by using __dynamic_array() instead of __array().

I've benchmarked the ioctl blktrace vs the splice based TRACE_EVENT tracing:

      dd                   dd + ioctl blktrace       dd + TRACE_EVENT (splice)
1     7.36s, 42.7 MB/s     7.50s, 42.0 MB/s          7.41s, 42.5 MB/s
2     7.43s, 42.3 MB/s     7.48s, 42.1 MB/s          7.43s, 42.4 MB/s
3     7.38s, 42.6 MB/s     7.45s, 42.2 MB/s          7.41s, 42.5 MB/s

So the overhead of tracing is very small, and no regression when using
those trace events vs blktrace.

And the binary output of TRACE_EVENT is much smaller than blktrace:

 # ls -l -h
 -rw-r--r-- 1 root root 8.8M 06-09 13:24 sda.blktrace.0
 -rw-r--r-- 1 root root 195K 06-09 13:24 sda.blktrace.1
 -rw-r--r-- 1 root root 2.7M 06-09 13:25 trace_splice.out

Following are some comparisons between TRACE_EVENT and blktrace:

plug:
  kjournald-480   [000]   303.084981: block_plug: [kjournald]
  kjournald-480   [000]   303.084981:   8,0    P   N [kjournald]

unplug_io:
  kblockd/0-118   [000]   300.052973: block_unplug_io: [kblockd/0] 1
  kblockd/0-118   [000]   300.052974:   8,0    U   N [kblockd/0] 1

remap:
  kjournald-480   [000]   303.085042: block_remap: 8,0 W 102736992 + 8 <- (8,8) 33384
  kjournald-480   [000]   303.085043:   8,0    A   W 102736992 + 8 <- (8,8) 33384

bio_backmerge:
  kjournald-480   [000]   303.085086: block_bio_backmerge: 8,0 W 102737032 + 8 [kjournald]
  kjournald-480   [000]   303.085086:   8,0    M   W 102737032 + 8 [kjournald]

getrq:
  kjournald-480   [000]   303.084974: block_getrq: 8,0 W 102736984 + 8 [kjournald]
  kjournald-480   [000]   303.084975:   8,0    G   W 102736984 + 8 [kjournald]

  bash-2066  [001]  1072.953770:   8,0    G   N [bash]
  bash-2066  [001]  1072.953773: block_getrq: 0,0 N 0 + 0 [bash]

rq_complete:
  konsole-2065  [001]   300.053184: block_rq_complete: 8,0 W () 103669040 + 16 [0]
  konsole-2065  [001]   300.053191:   8,0    C   W 103669040 + 16 [0]

  ksoftirqd/1-7   [001]  1072.953811:   8,0    C   N (5a 00 08 00 00 00 00 00 24 00) [0]
  ksoftirqd/1-7   [001]  1072.953813: block_rq_complete: 0,0 N (5a 00 08 00 00 00 00 00 24 00) 0 + 0 [0]

rq_insert:
  kjournald-480   [000]   303.084985: block_rq_insert: 8,0 W 0 () 102736984 + 8 [kjournald]
  kjournald-480   [000]   303.084986:   8,0    I   W 102736984 + 8 [kjournald]

Changelog from v2 -> v3:

- use the newly introduced __dynamic_array().

Changelog from v1 -> v2:

- use __string() instead of __array() to minimize the memory required
  to store hex dump of rq->cmd().

- support large pc requests.

- add missing blk_fill_rwbs_rq() in block_rq_requeue TRACE_EVENT.

- some cleanups.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <4A2DF669.5070905@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-06-09 12:34:23 -04:00
FUJITA Tomonori
c1d4c41f2f bsg: setting rq->bio to NULL
Due to commit 1cd96c242a ("block: WARN
in __blk_put_request() for potential bio leak"), BSG SMP requests get
the false warnings:

WARNING: at block/blk-core.c:1068 __blk_put_request+0x52/0xc0()

This sets rq->bio to NULL to avoid that false warnings.

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-06-09 15:17:37 +02:00
Martin K. Petersen
77634f33d4 block: Add missing bounce_pfn stacking and fix comments
DM no longer needs to set limits explicitly when calling blk_stack_limits.
Let the latter automatically deal with bounce_pfn scaling.

Fix kerneldoc variable names.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-06-09 06:23:22 +02:00
Jens Axboe
9df1bb9b51 Revert "block: Fix bounce limit setting in DM"
This reverts commit a05c0205ba.

DM doesn't need to access the bounce_pfn directly.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-06-09 06:22:57 +02:00