linux-stable

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git synced 2024-08-23 17:30:02 +00:00

Author	SHA1	Message	Date
Martin K. Petersen	4363ac7c13	block: Implement support for WRITE SAME The WRITE SAME command supported on some SCSI devices allows the same block to be efficiently replicated throughout a block range. Only a single logical block is transferred from the host and the storage device writes the same data to all blocks described by the I/O. This patch implements support for WRITE SAME in the block layer. The blkdev_issue_write_same() function can be used by filesystems and block drivers to replicate a buffer across a block range. This can be used to efficiently initialize software RAID devices, etc. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-09-20 14:31:45 +02:00
NeilBrown	6dafab6b13	md: make sure metadata is updated when spares are activated or removed. It isn't always necessary to update the metadata when spares are removed as the presence-or-not of a spare isn't really important to the integrity of an array. Also activating a spare doesn't always require updating the metadata as the update on 'recovery-completed' is usually sufficient. However the introduction of 'replacement' devices have made these transitions sometimes more important. For example the 'Replacement' flag isn't cleared until the original device is removed, so we need to ensure a metadata update after that 'spare' is removed. So set MD_CHANGE_DEVS whenever a spare is activated or removed, to complement the current situation where it is set when a spare is added or a device is failed (or a number of other less common situations). This is suitable for -stable as out-of-data metadata could lead to data corruption. This is only relevant for 3.3 and later 9when 'replacement' as introduced. Cc: stable@vger.kernel.org Signed-off-by: NeilBrown <neilb@suse.de>	2012-09-19 12:54:22 +10:00
NeilBrown	e5c86471f9	md/raid5: fix calculate of 'degraded' when a replacement becomes active. When a replacement device becomes active, we mark the device that it replaces as 'faulty' so that it can subsequently get removed. However 'calc_degraded' only pays attention to the primary device, not the replacement, so the array appears to become degraded, which is wrong. So teach 'calc_degraded' to consider any replacement if a primary device is faulty. This is suitable for -stable as an incorrect 'degraded' value can confuse md and could lead to data corruption. This is only relevant for 3.3 and later. Cc: stable@vger.kernel.org Reported-by: Robin Hill <robin@robinhill.me.uk> Reported-by: John Drescher <drescherjm@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2012-09-19 12:52:30 +10:00
NeilBrown	a852d7b8a0	Revert "md/raid5: For odirect-write performance, do not set STRIPE_PREREAD_ACTIVE." This reverts commit `895e3c5c58`. While this patch seemed like a good idea and did help some workloads, it hurts other workloads. Large sequential O_DIRECT writes were faster, Small random O_DIRECT writes were slower. Other changes (batching RAID5 writes) have improved the sequential writes using a different mechanism, so the net result of this patch is definitely negative. So revert it. Reported-by: Shaohua Li <shli@kernel.org> Tested-by: Jianpeng Ma <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2012-09-19 12:48:30 +10:00
Kent Overstreet	bf800ef181	block: Add bio_clone_bioset(), bio_clone_kmalloc() Previously, there was bio_clone() but it only allocated from the fs bio set; as a result various users were open coding it and using __bio_clone(). This changes bio_clone() to become bio_clone_bioset(), and then we add bio_clone() and bio_clone_kmalloc() as wrappers around it, making use of the functionality the last patch adedd. This will also help in a later patch changing how bio cloning works. Signed-off-by: Kent Overstreet <koverstreet@google.com> CC: Jens Axboe <axboe@kernel.dk> CC: NeilBrown <neilb@suse.de> CC: Alasdair Kergon <agk@redhat.com> CC: Boaz Harrosh <bharrosh@panasas.com> CC: Jeff Garzik <jeff@garzik.org> Acked-by: Jeff Garzik <jgarzik@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-09-09 10:35:39 +02:00
Kent Overstreet	9481874231	dm: Use bioset's front_pad for dm_rq_clone_bio_info Previously, dm_rq_clone_bio_info needed to be freed by the bio's destructor to avoid a memory leak in the blk_rq_prep_clone() error path. This gets rid of a memory allocation and means we can kill dm_rq_bio_destructor. The _rq_bio_info_cache kmem cache is unused now and needs to be deleted, but due to the way io_pool is used and overloaded this looks not quite trivial so I'm leaving it for a later patch. v6: Fix comment on struct dm_rq_clone_bio_info, per Tejun Signed-off-by: Kent Overstreet <koverstreet@google.com> CC: Alasdair Kergon <agk@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-09-09 10:35:38 +02:00
Kent Overstreet	1e2a410ff7	block: Ues bi_pool for bio_integrity_alloc() Now that bios keep track of where they were allocated from, bio_integrity_alloc_bioset() becomes redundant. Remove bio_integrity_alloc_bioset() and drop bio_set argument from the related functions and make them use bio->bi_pool. Signed-off-by: Kent Overstreet <koverstreet@google.com> CC: Jens Axboe <axboe@kernel.dk> CC: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-09-09 10:35:38 +02:00
Kent Overstreet	395c72a707	block: Generalized bio pool freeing With the old code, when you allocate a bio from a bio pool you have to implement your own destructor that knows how to find the bio pool the bio was originally allocated from. This adds a new field to struct bio (bi_pool) and changes bio_alloc_bioset() to use it. This makes various bio destructors unnecessary, so they're then deleted. v6: Explain the temporary if statement in bio_put Signed-off-by: Kent Overstreet <koverstreet@google.com> CC: Jens Axboe <axboe@kernel.dk> CC: NeilBrown <neilb@suse.de> CC: Alasdair Kergon <agk@redhat.com> CC: Nicholas Bellinger <nab@linux-iscsi.org> CC: Lars Ellenberg <lars.ellenberg@linbit.com> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Nicholas Bellinger <nab@linux-iscsi.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-09-09 10:35:38 +02:00
Tejun Heo	43829731dd	workqueue: deprecate flush[_delayed]_work_sync() flush[_delayed]_work_sync() are now spurious. Mark them deprecated and convert all users to flush[_delayed]_work(). If you're cc'd and wondering what's going on: Now all workqueues are non-reentrant and the regular flushes guarantee that the work item is not pending or running on any CPU on return, so there's no reason to use the sync flushes at all and they're going away. This patch doesn't make any functional difference. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Russell King <linux@arm.linux.org.uk> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Ian Campbell <ian.campbell@citrix.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Mattia Dongili <malattia@linux.it> Cc: Kent Yoder <key@linux.vnet.ibm.com> Cc: David Airlie <airlied@linux.ie> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Karsten Keil <isdn@linux-pingi.de> Cc: Bryan Wu <bryan.wu@canonical.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Alasdair Kergon <agk@redhat.com> Cc: Mauro Carvalho Chehab <mchehab@infradead.org> Cc: Florian Tobias Schandinat <FlorianSchandinat@gmx.de> Cc: David Woodhouse <dwmw2@infradead.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: linux-wireless@vger.kernel.org Cc: Anton Vorontsov <cbou@mail.ru> Cc: Sangbeom Kim <sbkim73@samsung.com> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Eric Van Hensbergen <ericvh@gmail.com> Cc: Takashi Iwai <tiwai@suse.de> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Petr Vandrovec <petr@vandrovec.name> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Avi Kivity <avi@redhat.com>	2012-08-20 14:51:24 -07:00
NeilBrown	e0ee778528	md/raid10: fix problem with on-stack allocation of r10bio structure. A 'struct r10bio' has an array of per-copy information at the end. This array is declared with size [0] and r10bio_pool_alloc allocates enough extra space to store the per-copy information depending on the number of copies needed. So declaring a 'struct r10bio on the stack isn't going to work. It won't allocate enough space, and memory corruption will ensue. So in the two places where this is done, declare a sufficiently large structure and use that instead. The two call-sites of this bug were introduced in 3.4 and 3.5 so this is suitable for both those kernels. The patch will have to be modified for 3.4 as it only has one bug. Cc: stable@vger.kernel.org Reported-by: Ivan Vasilyev <ivan.vasilyev@gmail.com> Tested-by: Ivan Vasilyev <ivan.vasilyev@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2012-08-18 09:51:42 +10:00
NeilBrown	667a5313ec	md: Don't truncate size at 4TB for RAID0 and Linear commit `27a7b260f7` md: Fix handling for devices from 2TB to 4TB in 0.90 metadata. changed 0.90 metadata handling to truncated size to 4TB as that is all that 0.90 can record. However for RAID0 and Linear, 0.90 doesn't need to record the size, so this truncation is not needed and causes working arrays to become too small. So avoid the truncation for RAID0 and Linear This bug was introduced in 3.1 and is suitable for any stable kernels from then onwards. As the offending commit was tagged for 'stable', any stable kernel that it was applied to should also get this patch. That includes at least 2.6.32, 2.6.33 and 3.0. (Thanks to Ben Hutchings for providing that list). Cc: stable@vger.kernel.org Signed-off-by: Neil Brown <neilb@suse.de>	2012-08-16 16:46:12 +10:00
Linus Torvalds	25aa6a7ae4	Additional md update for 3.6 This contains a few patches that depend on plugging changes in the block layer so needs to wait for those. It also contains a Kconfig fix for the new RAID10 support in dm-raid. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.18 (GNU/Linux) iQIVAwUAUBnKUznsnt1WYoG5AQJOQA/+M7RoVnF63+TbGIqdNDotuF8FxvudCZBl Ou2yG47EOPtWf/RoqPyfpydDgdjyXsk4T5TfXoc0hsXVr4shCYo51uT9K34TMSDJ 2GzGWuyugRJFyvxW7PBgM+zFWlcVdgUGcwsdmIUMtHRz8Q10TqO5fE22RNLkhwOl fvGCK1KYnQqlG87DbulHWMo22vyZVic8jBqFSw55CPuuFMSJMxCw0rOPUnvk5Q8v jWzZzuUqrM8iiOxTDHsbCA0IleCbGl/m0tgk02Vj4tkCvz9N/xzQW2se0H6uECiK k8odbAiNBOh1q135sa7ASrBzxT+JqSiQ25rLheTEzzNxjFv6/NlntXmYu6HB+lD3 DoHAvRjgMxiLCdisW6TJb10NItitXwE/HSpQOVRxyYtINdzmhIDaCccgfN8ZMkho nmE/uzO+CAoCFpZC2C/nY8D0BZs5fw4hgDAsci66mvs+88dy+SoA4AbyNEMAusOS tiL8ZEjnYXvxTh3JFaMIaqQd6PkbahmtEtvorwXsUYUdY0ybkcs2FYVksvkgYdyW WlejOZVurY2i5biqck3UqjesxeJA5TMAlAUQR7vXu1Fa9fYFXZbqJom/KnPRTfek xerCWPMbhuzmcyEjUOGfjs6GFEnEmRT6Q6fN3CBaQMS2Q/z+6AkTOXKVl5Fhvoyl aeu1m8nZLuI= =ovN2 -----END PGP SIGNATURE----- Merge tag 'md-3.6' of git://neil.brown.name/md Pull additional md update from NeilBrown: "This contains a few patches that depend on plugging changes in the block layer so needed to wait for those. It also contains a Kconfig fix for the new RAID10 support in dm-raid." * tag 'md-3.6' of git://neil.brown.name/md: md/dm-raid: DM_RAID should select MD_RAID10 md/raid1: submit IO from originating thread instead of md thread. raid5: raid5d handle stripe in batch way raid5: make_request use batch stripe release	2012-08-02 11:34:40 -07:00
NeilBrown	d9f691c365	md/dm-raid: DM_RAID should select MD_RAID10 Now that DM_RAID supports raid10, it needs to select that code to ensure it is included. Cc: Jonathan Brassow <jbrassow@redhat.com> Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>	2012-08-02 08:35:43 +10:00
NeilBrown	f54a9d0e59	md/raid1: submit IO from originating thread instead of md thread. queuing writes to the md thread means that all requests go through the one processor which may not be able to keep up with very high request rates. So use the plugging infrastructure to submit all requests on unplug. If a 'schedule' is needed, we fall back on the old approach of handing the requests to the thread for it to handle. Signed-off-by: NeilBrown <neilb@suse.de>	2012-08-02 08:33:20 +10:00
Shaohua Li	46a06401f6	raid5: raid5d handle stripe in batch way Let raid5d handle stripe in batch way to reduce conf->device_lock locking. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>	2012-08-02 08:33:15 +10:00
Shaohua Li	8811b5968f	raid5: make_request use batch stripe release make_request() does stripe release for every stripe and the stripe usually has count 1, which makes previous release_stripe() optimization not work. In my test, this release_stripe() becomes the heaviest pleace to take conf->device_lock after previous patches applied. Below patch makes stripe release batch. All the stripes will be released in unplug. The STRIPE_ON_UNPLUG_LIST bit is to protect concurrent access stripe lru. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>	2012-08-02 08:33:00 +10:00
Linus Torvalds	eff0d13f38	Merge branch 'for-3.6/drivers' of git://git.kernel.dk/linux-block Pull block driver changes from Jens Axboe: - Making the plugging support for drivers a bit more sane from Neil. This supersedes the plugging change from Shaohua as well. - The usual round of drbd updates. - Using a tail add instead of a head add in the request completion for ndb, making us find the most completed request more quickly. - A few floppy changes, getting rid of a duplicated flag and also running the floppy init async (since it takes forever in boot terms) from Andi. * 'for-3.6/drivers' of git://git.kernel.dk/linux-block: floppy: remove duplicated flag FD_RAW_NEED_DISK blk: pass from_schedule to non-request unplug functions. block: stack unplug blk: centralize non-request unplug handling. md: remove plug_cnt feature of plugging. block/nbd: micro-optimization in nbd request completion drbd: announce FLUSH/FUA capability to upper layers drbd: fix max_bio_size to be unsigned drbd: flush drbd work queue before invalidate/invalidate remote drbd: fix potential access after free drbd: call local-io-error handler early drbd: do not reset rs_pending_cnt too early drbd: reset congestion information before reporting it in /proc/drbd drbd: report congestion if we are waiting for some userland callback drbd: differentiate between normal and forced detach drbd: cleanup, remove two unused global flags floppy: Run floppy initialization asynchronous	2012-08-01 09:06:47 -07:00
Linus Torvalds	fcff06c438	Merge branch 'for-next' of git://neil.brown.name/md Pull md updates from NeilBrown. * 'for-next' of git://neil.brown.name/md: DM RAID: Add support for MD RAID10 md/RAID1: Add missing case for attempting to repair known bad blocks. md/raid5: For odirect-write performance, do not set STRIPE_PREREAD_ACTIVE. md/raid1: don't abort a resync on the first badblock. md: remove duplicated test on ->openers when calling do_md_stop() raid5: Add R5_ReadNoMerge flag which prevent bio from merging at block layer md/raid1: prevent merging too large request md/raid1: read balance chooses idlest disk for SSD md/raid1: make sequential read detection per disk based MD RAID10: Export md_raid10_congested MD: Move macros from raid1.h to raid1.c MD RAID1: rename mirror_info structure MD RAID10: rename mirror_info structure MD RAID10: Fix compiler warning. raid5: add a per-stripe lock raid5: remove unnecessary bitmap write optimization raid5: lockless access raid5 overrided bi_phys_segments raid5: reduce chance release_stripe() taking device_lock	2012-08-01 09:02:01 -07:00
Jonathan Brassow	63f33b8dda	DM RAID: Add support for MD RAID10 Support the MD RAID10 personality through dm-raid.c Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>	2012-08-01 20:41:20 +10:00
NeilBrown	bb181e2e48	Merge commit 'c039c332f23e794deb6d6f37b9f07ff3b27fb2cf' into md Pull in pre-requisites for adding raid10 support to dm-raid.	2012-08-01 20:40:02 +10:00
NeilBrown	74018dc306	blk: pass from_schedule to non-request unplug functions. This will allow md/raid to know why the unplug was called, and will be able to act according - if !from_schedule it is safe to perform tasks which could themselves schedule. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-07-31 09:08:15 +02:00
NeilBrown	9cbb175088	blk: centralize non-request unplug handling. Both md and umem has similar code for getting notified on an blk_finish_plug event. Centralize this code in block/ and allow each driver to provide its distinctive difference. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-07-31 09:08:14 +02:00
NeilBrown	0021b7bc04	md: remove plug_cnt feature of plugging. This seemed like a good idea at the time, but after further thought I cannot see it making a difference other than very occasionally and testing to try to exercise the case it is most likely to help did not show any performance difference by removing it. So remove the counting of active plugs and allow 'pending writes' to be activated at any time, not just when no plugs are active. This is only relevant when there is a write-intent bitmap, and the updating of the bitmap will likely introduce enough delay that the single-threading of bitmap updates will be enough to collect large numbers of updates together. Removing this will make it easier to centralise the unplug code, and will clear the other for other unplug enhancements which have a measurable effect. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-07-31 09:08:14 +02:00
Alexander Lyakas	d57368afe6	md/RAID1: Add missing case for attempting to repair known bad blocks. When doing resync or repair, attempt to correct bad blocks, according to WriteErrorSeen policy Signed-off-by: Alex Lyakas <alex.bolshoy@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2012-07-31 12:01:29 +10:00
Linus Torvalds	27c1ee3f92	Merge branch 'akpm' (Andrew's patch-bomb) Merge Andrew's first set of patches: "Non-MM patches: - lots of misc bits - tree-wide have_clk() cleanups - quite a lot of printk tweaks. I draw your attention to "printk: convert the format for KERN_<LEVEL> to a 2 byte pattern" which looks a bit scary. But afaict it's solid. - backlight updates - lib/ feature work (notably the addition and use of memweight()) - checkpatch updates - rtc updates - nilfs updates - fatfs updates (partial, still waiting for acks) - kdump, proc, fork, IPC, sysctl, taskstats, pps, etc - new fault-injection feature work" * Merge emailed patches from Andrew Morton <akpm@linux-foundation.org>: (128 commits) drivers/misc/lkdtm.c: fix missing allocation failure check lib/scatterlist: do not re-write gfp_flags in __sg_alloc_table() fault-injection: add tool to run command with failslab or fail_page_alloc fault-injection: add selftests for cpu and memory hotplug powerpc: pSeries reconfig notifier error injection module memory: memory notifier error injection module PM: PM notifier error injection module cpu: rewrite cpu-notifier-error-inject module fault-injection: notifier error injection c/r: fcntl: add F_GETOWNER_UIDS option resource: make sure requested range is included in the root range include/linux/aio.h: cpp->C conversions fs: cachefiles: add support for large files in filesystem caching pps: return PTR_ERR on error in device_create taskstats: check nla_reserve() return sysctl: suppress kmemleak messages ipc: use Kconfig options for __ARCH_WANT_[COMPAT_]IPC_PARSE_VERSION ipc: compat: use signed size_t types for msgsnd and msgrcv ipc: allow compat IPC version field parsing if !ARCH_WANT_OLD_COMPAT_IPC ipc: add COMPAT_SHMLBA support ...	2012-07-30 17:25:34 -07:00
Akinobu Mita	8fb980e35b	dm: use memweight() Use memweight() to count the total number of bits set in memory area. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Alasdair Kergon <agk@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-07-30 17:25:16 -07:00
majianpeng	895e3c5c58	md/raid5: For odirect-write performance, do not set STRIPE_PREREAD_ACTIVE. 'sync' writes set both REQ_SYNC and REQ_NOIDLE. O_DIRECT writes set REQ_SYNC but not REQ_NOIDLE. We currently assume that a REQ_SYNC request will not be followed by more requests and so set STRIPE_PREREAD_ACTIVE to expedite the request. This is appropriate for sync requests, but not for O_DIRECT requests. So make the setting of STRIPE_PREREAD_ACTIVE conditional on REQ_NOIDLE rather than REQ_SYNC. This is consistent with the documented meaning of REQ_NOIDLE: __REQ_NOIDLE, /* don't anticipate more IO after this one */ Signed-off-by: Jianpeng Ma <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2012-07-31 10:05:44 +10:00
NeilBrown	b7219ccb33	md/raid1: don't abort a resync on the first badblock. If a resync of a RAID1 array with 2 devices finds a known bad block one device it will neither read from, or write to, that device for this block offset. So there will be one read_target (The other device) and zero write targets. This condition causes md/raid1 to abort the resync assuming that it has finished - without known bad blocks this would be true. When there are no write targets because of the presence of bad blocks we should only skip over the area covered by the bad block. RAID10 already gets this right, raid1 doesn't. Or didn't. As this can cause a 'sync' to abort early and appear to have succeeded it could lead to some data corruption, so it suitable for -stable. Cc: stable@vger.kernel.org Reported-by: Alexander Lyakas <alex.bolshoy@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2012-07-31 10:05:34 +10:00
NeilBrown	90cf195d9b	md: remove duplicated test on ->openers when calling do_md_stop() do_md_stop tests mddev->openers while holding ->open_mutex, and fails if this count is too high. So callers do not need to check mddev->openers and doing so isn't very meaningful as they don't hold ->open_mutex so the number could change. So remove the unnecessary tests on mddev->openers. These are not called often enough for there to be any gain in an early test on ->open_mutex to avoid the need for a slightly more costly mutex_lock call. Signed-off-by: NeilBrown <neilb@suse.de>	2012-07-31 10:04:55 +10:00
majianpeng	3f9e7c140e	raid5: Add R5_ReadNoMerge flag which prevent bio from merging at block layer Because bios will merge at block-layer,so bios-error may caused by other bio which be merged into to the same request. Using this flag,it will find exactly error-sector and not do redundant operation like re-write and re-read. V0->V1:Using REQ_FLUSH instead REQ_NOMERGE avoid bio merging at block layer. Signed-off-by: Jianpeng Ma <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2012-07-31 10:04:21 +10:00
Shaohua Li	12cee5a8a2	md/raid1: prevent merging too large request For SSD, if request size exceeds specific value (optimal io size), request size isn't important for bandwidth. In such condition, if making request size bigger will cause some disks idle, the total throughput will actually drop. A good example is doing a readahead in a two-disk raid1 setup. So when should we split big requests? We absolutly don't want to split big request to very small requests. Even in SSD, big request transfer is more efficient. This patch only considers request with size above optimal io size. If all disks are busy, is it worth doing a split? Say optimal io size is 16k, two requests 32k and two disks. We can let each disk run one 32k request, or split the requests to 4 16k requests and each disk runs two. It's hard to say which case is better, depending on hardware. So only consider case where there are idle disks. For readahead, split is always better in this case. And in my test, below patch can improve > 30% thoughput. Hmm, not 100%, because disk isn't 100% busy. Such case can happen not just in readahead, for example, in directio. But I suppose directio usually will have bigger IO depth and make all disks busy, so I ignored it. Note: if the raid uses any hard disk, we don't prevent merging. That will make performace worse. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>	2012-07-31 10:03:53 +10:00
Shaohua Li	9dedf60313	md/raid1: read balance chooses idlest disk for SSD SSD hasn't spindle, distance between requests means nothing. And the original distance based algorithm sometimes can cause severe performance issue for SSD raid. Considering two thread groups, one accesses file A, the other access file B. The first group will access one disk and the second will access the other disk, because requests are near from one group and far between groups. In this case, read balance might keep one disk very busy but the other relative idle. For SSD, we should try best to distribute requests to as many disks as possible. There isn't spindle move penality anyway. With below patch, I can see more than 50% throughput improvement sometimes depending on workloads. The only exception is small requests can be merged to a big request which typically can drive higher throughput for SSD too. Such small requests are sequential reads. Unlike hard disk, sequential read which can't be merged (for example direct IO, or read without readahead) can be ignored for SSD. Again there is no spindle move penality. readahead dispatches small requests and such requests can be merged. Last patch can help detect sequential read well, at least if concurrent read number isn't greater than raid disk number. In that case, distance based algorithm doesn't work well too. V2: For hard disk and SSD mixed raid, doesn't use distance based algorithm for random IO too. This makes the algorithm generic for raid with SSD. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>	2012-07-31 10:03:53 +10:00
Shaohua Li	be4d3280b1	md/raid1: make sequential read detection per disk based Currently the sequential read detection is global wide. It's natural to make it per disk based, which can improve the detection for concurrent multiple sequential reads. And next patch will make SSD read balance not use distance based algorithm, where this change help detect truly sequential read for SSD. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>	2012-07-31 10:03:53 +10:00
Jonathan Brassow	cc4d1efdd0	MD RAID10: Export md_raid10_congested md/raid10: Export is_congested test. In similar fashion to commits `11d8a6e371` `1ed7242e59` we export the RAID10 congestion checking function so that dm-raid.c can make use of it and make use of the personality. The 'queue' and 'gendisk' structures will not be available to the MD code when device-mapper sets up the device, so we conditionalize access to these fields also. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>	2012-07-31 10:03:53 +10:00
Jonathan Brassow	473e87ce48	MD: Move macros from raid1.h to raid1.c MD RAID1/RAID10: Move some macros from .h file to .c file There are three macros (IO_BLOCKED,IO_MADE_GOOD,BIO_SPECIAL) which are defined in both raid1.h and raid10.h. They are only used in there respective .c files. However, if we wish to make RAID10 accessible to the device-mapper RAID target (dm-raid.c), then we need to move these macros into the .c files where they are used so that they do not conflict with each other. The macros from the two files are identical and could be moved into md.h, but I chose to leave the duplication and have them remain in the personality files. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>	2012-07-31 10:03:52 +10:00
Jonathan Brassow	0eaf822cb3	MD RAID1: rename mirror_info structure MD RAID1: Rename the structure 'mirror_info' to 'raid1_info' The same structure name ('mirror_info') is used by raid10. Each of these structures are defined in there respective header files. If dm-raid is to support both RAID1 and RAID10, the header files will be included and the structure names must not collide. While only one of these structure names needs to change, this patch adds consistency to the naming of the structure. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>	2012-07-31 10:03:52 +10:00
Jonathan Brassow	dc280d987f	MD RAID10: rename mirror_info structure MD RAID10: Rename the structure 'mirror_info' to 'raid10_info' The same structure name ('mirror_info') is used by raid1. Each of these structures are defined in there respective header files. If dm-raid is to support both RAID1 and RAID10, the header files will be included and the structure names must not collide. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>	2012-07-31 10:03:52 +10:00
Jonathan Brassow	3bbae04b12	MD RAID10: Fix compiler warning. MD RAID10: Fix compiler warning. Initialize variable to prevent compiler warning. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>	2012-07-31 10:03:52 +10:00
Alasdair G Kergon	1f4e0ff079	dm thin: commit before gathering status Commit outstanding metadata before returning the status for a dm thin pool so that the numbers reported are as up-to-date as possible. The commit is not performed if the device is suspended or if the DM_NOFLUSH_FLAG is supplied by userspace and passed to the target through a new 'status_flags' parameter in the target's dm_status_fn. The userspace dmsetup tool will support the --noflush flag with the 'dmsetup status' and 'dmsetup wait' commands from version 1.02.76 onwards. Tested-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2012-07-27 15:08:16 +01:00
Joe Thornber	e49e582965	dm thin: add read only and fail io modes Add read-only and fail-io modes to thin provisioning. If a transaction commit fails the pool's metadata device will transition to "read-only" mode. If a commit fails once already in read-only mode the transition to "fail-io" mode occurs. Once in fail-io mode the pool and all associated thin devices will report a status of "Fail". Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2012-07-27 15:08:16 +01:00
Joe Thornber	da105ed5fd	dm thin metadata: introduce dm_pool_abort_metadata Introduce dm_pool_abort_metadata to abort the current metadata transaction. Generally this will only be called when bad things are happening and dm-thin is trying to roll back to a good state for read-only mode. It's complicated by the fact that the metadata device may have failed completely causing the abort to be unable to read the old transaction. In this case the metadata object is placed in a 'fail' mode and everything fails apart from destroying it. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2012-07-27 15:08:15 +01:00
Joe Thornber	12ba58af46	dm thin metadata: introduce dm_pool_metadata_set_read_only Introduce dm_pool_metadata_set_read_only to put the underlying block manager into read-only mode. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2012-07-27 15:08:15 +01:00
Joe Thornber	310975573b	dm persistent data: introduce dm_bm_set_read_only Introduce dm_bm_set_read_only to switch the block manager into a read-only mode. To be used when dm-thin degrades due to io errors on the metadata device. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2012-07-27 15:08:15 +01:00
Joe Thornber	4afdd680f7	dm thin: reduce number of metadata commits Reduce the number of metadata commits by using dm_thin_changed_this_transaction to check if metadata was changed on a per thin device granularity. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2012-07-27 15:08:14 +01:00
Joe Thornber	40db5a5376	dm thin metadata: add dm_thin_changed_this_transaction Introduce dm_thin_changed_this_transaction to dm-thin-metadata to publish a useful bit of information we're already tracking. This will help dm thin decide when to commit. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2012-07-27 15:08:14 +01:00
Joe Thornber	66b1edc05e	dm thin metadata: add format option to dm_pool_metadata_open Add a parameter to dm_pool_metadata_open to indicate whether or not an unformatted metadata area should be formatted. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2012-07-27 15:08:14 +01:00
Joe Thornber	0fa5b17b08	dm thin metadata: tidy up open and format error paths Tidy up error path in __open_metadata and __format_metadata in dm-thin-metadata. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2012-07-27 15:08:14 +01:00
Mike Snitzer	d73ec52538	dm thin metadata: only check incompat features on open Factor out __check_incompat_features and only call it once when we open the metadata device rather than at the beginning of every transaction. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2012-07-27 15:08:13 +01:00
Joe Thornber	b793995108	dm thin metadata: remove duplicate pmd initialisation Remove some duplicate initialisation of struct dm_pool_metadata. These pmd fields are initialised by both: __format_metadata's calls to dm_btree_empty __write_initial_superblock + __begin_transaction Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2012-07-27 15:08:13 +01:00
Joe Thornber	8801e06945	dm thin metadata: remove create parameter from __create_persistent_data_objects Remove 'create' parameter from __create_persistent_data_objects() in dm-thin-metadata. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2012-07-27 15:08:13 +01:00
Joe Thornber	237074c0a3	dm thin metadata: move __superblock_all_zeroes to __open_or_format_metadata Move the check for __superblock_all_zeroes from __create_persistent_data_objects() down to __open_or_format_metadata in dm-thin-metadata. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2012-07-27 15:08:13 +01:00
Joe Thornber	a97e5e6fd0	dm thin metadata: remove nr_blocks arg from __create_persistent_data_objects Remove nr_blocks arg from __create_persistent_data_objects in dm-thin-metadata. It was always passed as zero. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2012-07-27 15:08:12 +01:00
Joe Thornber	e4d2205cdf	dm thin metadata: split __open or format metadata Split __open_or_format_metadata into __format_metadata and __open_metadata in dm-thin-metadata. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2012-07-27 15:08:12 +01:00
Joe Thornber	d6332814e3	dm thin metadata: use struct dm_pool_metadata members in __open_or_format_metadata Clean up __open_or_format_metadata in dm-thin-metadata by using struct dm_pool_metadata members to replace local variables. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2012-07-27 15:08:12 +01:00
Joe Thornber	583ceee2ed	dm thin metadata: zero unused superblock uuid Zero the unused uuid when initialising the metadata superblock. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2012-07-27 15:08:11 +01:00
Joe Thornber	270938bac5	dm thin metadata: lift __begin_transaction out of __write_initial_superblock Lift the call to __begin_transaction out of __write_initial_superblock in dm-thin-metadata. Called higher up the call chain now. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2012-07-27 15:08:11 +01:00
Joe Thornber	10d2a9ff7c	dm thin metadata: move dm_commit_pool_metadata into __write_initial_superblock Move dm_commit_pool_metadata inline into __write_initial_superblock in dm-thin-metadata. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2012-07-27 15:08:11 +01:00
Joe Thornber	9cb6653f9a	dm thin metadata: factor out __write_initial_superblock Factor out __write_initial_superblock and also pull some other initial creation code out of dm_pool_metadata_open. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2012-07-27 15:08:10 +01:00
Joe Thornber	6a0ebd31b6	dm thin metadata: lift some initialisation out of __open_or_format_metadata Lift some initialisation out of __open_or_format_metadata in dm-thin-metadata. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2012-07-27 15:08:10 +01:00
Joe Thornber	f9dd9352b9	dm thin metadata: factor __destroy_persistent_data out of dm_pool_metadata_close Factor __destroy_persistent_data_objects out of dm_pool_metadata_close. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2012-07-27 15:08:10 +01:00
Joe Thornber	332627db00	dm thin metadata: move bm creation code into create_persistent_data_objects Move block manager creation and the check for unformatted metadata into __create_persistent_data_objects(). Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2012-07-27 15:08:10 +01:00
Joe Thornber	77f49a4027	dm thin metadata: rename init_pmd to __create_persistent_data_objects Rename init_pmd to __create_persistent_data_objects in dm-thin-metadata. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2012-07-27 15:08:10 +01:00
Joe Thornber	2597119206	dm thin metadata: wrap superblock locking Introduce wrappers to handle write locking the superblock appropriately. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2012-07-27 15:08:09 +01:00
Joe Thornber	3c9ad9bd87	dm persistent data: stop using dm_bm_unlock_move when shadowing blocks in tm Stop using dm_bm_unlock_move when shadowing blocks in the transaction manager as an optimisation and remove the function as it is then no longer used. Some code, such as the space maps, keeps using on-disk data structures from the previous transaction. It can do this because blocks won't be reallocated until the subsequent transaction. Using dm_bm_unlock_move to copy blocks sounds like a win, but it forces a synchronous read should the old block be accessed. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2012-07-27 15:08:09 +01:00
Joe Thornber	384ef0e62e	dm persistent data: tidy transaction manager creation fns Tidy the transaction manager creation functions. They no longer lock the superblock. Superblock locking is pulled out to the caller. Also export dm_bm_write_lock_zero. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2012-07-27 15:08:09 +01:00
Joe Thornber	eb04cf634f	dm thin metadata: stop tracking need for commit Remove an optimisation that tracks whether or not a thin metadata commit is needed. If dm_pool_commit_metadata() is called and no changes have been made to the metadata then this optimisation avoided writing to disk. Removing because we're going to do something better later. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2012-07-27 15:08:08 +01:00
Joe Thornber	51a0f659c0	dm persistent data: create new dm_block_manager struct This patch introduces a separate struct for the block_manager. It also uses IS_ERR to check the return value of dm_bufio_client_create instead of testing incorrectly for NULL. Prior to this patch a struct dm_block_manager was really an alias for a struct dm_bufio_client. We want to add some functionality to the block manager that will require extra fields, so this one to one mapping is no longer valid. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2012-07-27 15:08:08 +01:00
Joe Thornber	41675aea32	dm thin metadata: factor __setup_btree_details out of init_pmd Factor __setup_btree_details out of init_pmd in dm-thin-metadata. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2012-07-27 15:08:08 +01:00
Alasdair G Kergon	0ac55489d9	dm: use bool bitfields in struct dm_target Use boolean bit fields for flags in struct dm_target. Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2012-07-27 15:08:08 +01:00
Joe Thornber	16ad3d103d	dm thin: set flush_supported The thin provisioning target commits internal metadata on flush. So it should receive flushes regardless of whether the underlying devices support them. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2012-07-27 15:08:07 +01:00
Joe Thornber	0e9c24ed74	dm: allow targets to request flushes regardless of underlying device support Allow targets to override the 'supports flush' calculation. Set 'flush_supported' if a target needs to receive flushes regardless of whether or not its underlying devices have support. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2012-07-27 15:08:07 +01:00
Joe Thornber	f4b90369d3	dm persistent data: only commit space map if index changed Introduce bitmap_index_changed to track whether or not the index changed then only commit a space map if it did. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2012-07-27 15:08:06 +01:00
Joe Thornber	8d44c98aac	dm persistent data: always unlock superblock in dm_bm_flush_and_unlock Unlock the superblock even if initial dm_bufio_write_dirty_buffers fails. Also, remove redundant flush calls. dm_bm_flush_and_unlock's calls to dm_bufio_write_dirty_buffers already result in dm_bufio_issue_flush being called. This avoids warnings about unflushed dirty buffers from bufio. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2012-07-27 15:08:06 +01:00
Joe Thornber	6004970136	dm thin: avoid unnecessarily breaking sharing for flushes There's no need to break sharing, triggering a copy, for a write that has no data (i.e. a flush). Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2012-07-27 15:08:06 +01:00
Joe Thornber	905386f82d	dm thin: fix memory leak in process_prepared_mapping error paths Fix memory leak in process_prepared_mapping by always freeing the dm_thin_new_mapping structs from the mapping_pool mempool on the error paths. Signed-off-by: Joe Thornber <ejt@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2012-07-27 15:08:05 +01:00
Mikulas Patocka	c66029f4d4	dm crypt: rename struct convert_context sector field Rename sector to cc_sector in dm-crypt's convert_context struct. This is preparation for a future patch that merges dm_io and convert_context which both have a "sector" field. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2012-07-27 15:08:05 +01:00
Alasdair G Kergon	49a8a9204b	dm crypt: store crypt_config instead of dm_target struct Store the crypt_config struct pointer directly in struct dm_crypt_io instead of the dm_target struct pointer. Target information is never used - only target->private is referenced, thus we can change it to point directly to struct crypt_config. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2012-07-27 15:08:05 +01:00
Mikulas Patocka	fd2d231faf	dm crypt: move cipher data out of per_cpu struct Move static dm-crypt cipher data out of per-cpu structure. Cipher information is static, so it does not have to be in a per-cpu structure. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2012-07-27 15:08:05 +01:00
Jonathan E Brassow	c039c332f2	dm raid: move sectors_per_dev calculation In preparation for RAID10 inclusion in dm-raid, we move the sectors_per_dev calculation later in the device creation process. This is because we won't know up-front how many stripes vs how many mirrors there are which will change the calculation. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2012-07-27 15:08:04 +01:00
Mikulas Patocka	40b6229b69	dm crypt: rename pending field There are two dm crypt structures that have a field called "pending". This patch renames them to "cc_pending" and "io_pending" to reduce confusion and ease searching the code. Also remove unnecessary initialisation of r in crypt_convert_block(). Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2012-07-27 15:08:04 +01:00
Jonathan E Brassow	f999e8fe70	dm raid: restructure parse_raid_params In preparation for RAID10 addition to dm-raid, we change an 'if' conditional to a 'switch' conditional to make it easier to see what is being checked for each RAID type. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2012-07-27 15:08:04 +01:00
Mike Snitzer	a58a935d5a	dm mpath: add retain_attached_hw_handler feature A SCSI device handler might get attached to a device during the initial device scan. We do not necessarily want to override this when loading a multipath table, so this patch adds a new multipath feature argument "retain_attached_hw_handler". During SCSI device scan all loaded SCSI device handlers will be consulted for a match (via scsi_dh's provided .match). If a match is found that device handler will be attached. We need a way to have userspace multipathd's provided 'hw_handler' not override the already attached hardware handler. When specifying the new feature 'retain_attached_hw_handler' multipath will use the currently attached hardware handler instead of trying to attach the one specified during table load. If no hardware handler is attached the specified hardware handler will still be used. Leverages scsi_dh_attach's ability to increment the scsi_dh's reference count if the same scsi_dh name is provided when attaching - currently attached scsi_dh name is determined with scsi_dh_attached_handler_name. Depends upon commit `7e8a74b177` ("[SCSI] scsi_dh: add scsi_dh_attached_handler_name"). Signed-off-by: Mike Snitzer <snitzer@redhat.com> Tested-by: Babu Moger <babu.moger@netapp.com> Reviewed-by: Chandra Seetharaman <sekharan@us.ibm.com> Acked-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2012-07-27 15:08:04 +01:00
Mikulas Patocka	f9a8e0cd26	dm thin: optimize power of two block size dm-thin will be most likely used with a block size that is a power of two. So it should be optimized for this case. This patch changes division and modulo operations to shifts and bit masks if block size is a power of two. A test that bi_sector is divisible by a block size is removed from io_overlaps_block. Device mapper never sends bios that span a block boundary. Consequently, if we tested that bi_size is equivalent to block size, bi_sector must already be on a block boundary. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2012-07-27 15:08:03 +01:00
Mikulas Patocka	4929630901	dm thin: split discards on block boundary This patch sets the variable "ti->split_discard_requests" for the dm thin target so that device mapper core splits discard requests on a block boundary. Consequently, a discard request that spans multiple blocks is never sent to dm-thin. The patch also removes some code in process_discard that deals with discards that span multiple blocks. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2012-07-27 15:08:03 +01:00
Mikulas Patocka	7acf0277ce	dm: introduce split_discard_requests This patch introduces a new variable split_discard_requests. It can be set by targets so that discard requests are split on max_io_len boundaries. When split_discard_requests is not set, discard requests are only split on boundaries between targets, as was the case before this patch. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2012-07-27 15:08:03 +01:00
Mike Snitzer	55f2b8bdb0	dm thin: support for non power of 2 pool blocksize Non power of 2 blocksize support is needed to properly align thinp IO on storage that has non power of 2 optimal IO sizes (e.g. RAID6 10+2). Use sector_div to support non power of 2 blocksize for the pool's data device. This provides comparable performance to the power of 2 math that was performed until now (as tested on modern x86_64 hardware). The kernel currently assumes that limits->discard_granularity is a power of two so the thin target only enables discard support if the block size is a power of two. Eliminate pool structure's 'block_shift', 'offset_mask' and remaining 4 byte holes. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2012-07-27 15:08:02 +01:00
Mikulas Patocka	33d07c0dfa	dm stripe: optimize chunk_size calculations dm-stripe is usually used with a chunk size that is a power of two. Use faster shifts and bit masks in such cases. stripe_width is already optimized in a similar way. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2012-07-27 15:08:02 +01:00
Mikulas Patocka	8f069b41bc	dm stripe: remove minimum stripe size There is no technical limitation in device mapper that would prevent the dm-stripe target from using a stripe size smaller than page size. This patch removes the limit and makes stripe volumes portable across architectures with different page size. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2012-07-27 15:08:01 +01:00
Mike Snitzer	eb850de608	dm stripe: support for non power of 2 chunksize Support non-power-of-2 chunk sizes with dm striping for proper alignment of stripe IO on storage that has non-power-of-2 optimal IO sizes (e.g. RAID6 10+2). Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2012-07-27 15:08:01 +01:00
Mike Snitzer	542f903814	dm: support non power of two target max_io_len Remove the restriction that limits a target's specified maximum incoming I/O size to be a power of 2. Rename this setting from 'split_io' to the less-ambiguous 'max_io_len'. Change it from sector_t to uint32_t, which is plenty big enough, and introduce a wrapper function dm_set_target_max_io_len() to set it. Use sector_div() to process it now that it is not necessarily a power of 2. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2012-07-27 15:08:00 +01:00
Mikulas Patocka	1df05483d7	dm stripe: remove stripes_mask The structure stripe_c contains a stripes_mask field. This field is useless because it can be trivially calculated by subtracting one from stripes. It is used only at one place. This patch removes it. The patch also changes ffs(stripes) - 1 to __ffs(stripes). Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2012-07-27 15:08:00 +01:00
Mikulas Patocka	f14fa693c9	dm stripe: fix size test dm-stripe is supposed to ensure that all the space allocated to the stripes is fully used and that all stripes are the same size. This patch fixes the test. It checks that device length is divisible by the chunk size and checks that the resulting quotient is divisible by the number of stripes (which is equivalent to testing if device length is divisible by chunk_size * stripes). Previously, the code only tested that the number of sectors in the target was divisible by each of the chunk size and the number of stripes separately, which could leave entire stripes unused. (A setup that genuinely needs some stripes to be shorter than others can be created by concatenating striped targets.) Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2012-07-27 15:08:00 +01:00
Mike Snitzer	f09996c993	dm thin: provide specific errors for two table load failure cases Provide specific error message strings for two pool_ctr() failure cases that currently give just "Unknown error". Reference: test_two_pools_pointing_to_the_same_metadata_fails and test_different_pool_cant_replace_pool in thinp-test-suite. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2012-07-27 15:07:59 +01:00
majianpeng	1a66a08ae8	dm: replace simple_strtoul Replace obsolete simple_strtoul() with kstrtou8/kstrtouint. Signed-off-by: majianpeng <majianpeng@gmail.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2012-07-27 15:07:59 +01:00
Alasdair G Kergon	70c4861102	dm snapshot: remove redundant assignment in merge fn Remove redundant bvm->bi_sector self-assignment in dm snapshot's origin_merge(). Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2012-07-27 15:07:59 +01:00
Joe Thornber	8c971178a7	dm thin metadata: introduce THIN_MAX_CONCURRENT_LOCKS Introduce THIN_MAX_CONCURRENT_LOCKS into dm-thin-metadata to give a name to an otherwise "magic" number. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2012-07-27 15:07:58 +01:00
Joe Thornber	d973ac196b	dm thin metadata: remove pointless label from __commit_transaction Remove the pointless label 'out' from __commit_transaction in dm-thin-metadata.c Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2012-07-27 15:07:58 +01:00
Joe Thornber	3caf6d73d4	dm persistent data: remove debug space map checker Remove debug space map checker from dm persistent data. The space map checker is a wrapper for other space maps that double checks the reference counts are correct. It holds all these reference counts in memory rather than on disk, so uses a lot of memory and is thus restricted to small pools. As yet, this checker hasn't found any issues, but has caused a few of its own due to people turning it on by default with larger pools. Removing. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2012-07-27 15:07:58 +01:00
Mike Snitzer	17b7d63f7e	dm thin: clean up compiler warning Clean up "warning: dubious: !x & y". Also make it clear that __snapshotted_since() returns a bool and that dm_thin_lookup_result's 'shared' member is a flag. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2012-07-27 15:07:57 +01:00
Alasdair G Kergon	7768ed33cc	dm thin: reduce endio_hook pool size Reduce the slab size used for the dm_thin_endio_hook mempool. Allocation has been seen to fail on machines with smaller amounts of memory due to fragmentation. lvm: page allocation failure. order:5, mode:0xd0 device-mapper: table: 253:38: thin-pool: Error creating pool's endio_hook mempool Cc: stable@vger.kernel.org Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2012-07-27 15:07:57 +01:00

1 2 3 4 5 ...

2537 commits