linux-stable

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git synced 2024-09-20 01:20:54 +00:00

Author	SHA1	Message	Date
Hidehiro Kawai	7ffe1ea894	ext4: add checks for errors from jbd2 If the journal has aborted due to a checkpointing failure, we have to keep the contents of the journal space. Otherwise, the filesystem will lose uncheckpointed metadata completely and become inconsistent. To avoid this, we need to keep needs_recovery flag if checkpoint has failed. With this patch, ext4_put_super() detects a checkpointing failure from the return value of journal_destroy(), then it invokes ext4_abort() to make the filesystem read only and keep needs_recovery flag. Errors from jbd2_journal_flush() are also handled by this patch in some places. Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2008-10-10 20:29:21 -04:00
Hidehiro Kawai	44519faf22	jbd2: fix error handling for checkpoint io When a checkpointing IO fails, current JBD2 code doesn't check the error and continue journaling. This means latest metadata can be lost from both the journal and filesystem. This patch leaves the failed metadata blocks in the journal space and aborts journaling in the case of jbd2_log_do_checkpoint(). To achieve this, we need to do: 1. don't remove the failed buffer from the checkpoint list where in the case of __try_to_free_cp_buf() because it may be released or overwritten by a later transaction 2. jbd2_log_do_checkpoint() is the last chance, remove the failed buffer from the checkpoint list and abort the journal 3. when checkpointing fails, don't update the journal super block to prevent the journaled contents from being cleaned. For safety, don't update j_tail and j_tail_sequence either 4. when checkpointing fails, notify this error to the ext4 layer so that ext4 don't clear the needs_recovery flag, otherwise the journaled contents are ignored and cleaned in the recovery phase 5. if the recovery fails, keep the needs_recovery flag 6. prevent jbd2_cleanup_journal_tail() from being called between __jbd2_journal_drop_transaction() and jbd2_journal_abort() (a possible race issue between jbd2_log_do_checkpoint()s called by jbd2_journal_flush() and __jbd2_log_wait_for_space()) Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2008-10-10 20:29:13 -04:00
Hidehiro Kawai	77e841de8a	jbd2: abort when failed to log metadata buffers If we failed to write metadata buffers to the journal space and succeeded to write the commit record, stale data can be written back to the filesystem as metadata in the recovery phase. To avoid this, when we failed to write out metadata buffers, abort the journal before writing the commit record. We can also avoid this kind of corruption by using the journal checksum feature because it can detect invalid metadata blocks in the journal and avoid them from being replayed. So we don't need to care about asynchronous commit record writeout with a checksum. Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2008-10-12 16:39:16 -04:00
Theodore Ts'o	03010a3350	ext4: Rename ext4dev to ext4 The ext4 filesystem is getting stable enough that it's time to drop the "dev" prefix. Also remove the requirement for the TEST_FILESYS flag. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-10-10 20:02:48 -04:00
Linus Torvalds	13dd7f876d	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/dlm * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/dlm: dlm: choose better identifiers dlm: remove bkl dlm: fix address compare dlm: fix locking of lockspace list in dlm_scand dlm: detect available userspace daemon dlm: allow multiple lockspace creates	2008-10-10 11:13:55 -07:00
Christoph Hellwig	73f6aa4d44	Fix barrier fail detection in XFS Currently we disable barriers as soon as we get a buffer in xlog_iodone that has the XBF_ORDERED flag cleared. But this can be the case not only for buffers where the barrier failed, but also the first buffer of a split log write in case of a log wraparound. Due to the disabled barriers we can easily get directory corruption on unclean shutdowns. So instead of using this check add a new buffer flag for failed barrier writes. This is a regression vs 2.6.26 caused by patch to use the right macro to check for the ORDERED flag, as we previously got true returned for every buffer. Thanks to Toei Rei for reporting the bug. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Sandeen <sandeen@sandeen.net> Reviewed-by: David Chinner <david@fromorbit.com> Signed-off-by: Tim Shimmin <tes@sgi.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-10 11:08:07 -07:00
Linus Torvalds	445e1ceda3	Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-nmw * git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-nmw: GFS2: Support for I/O barriers GFS2: Add UUID to GFS2 sb GFS2: high time to take some time over atime GFS2: The war on bloat GFS2: GFS2 will panic if you misspell any mount options GFS2: Direct IO write at end of file error GFS2: Use an IS_ERR test rather than a NULL test GFS2: Fix race relating to glock min-hold time GFS2: Fix & clean up GFS2 rename GFS2: rm on multiple nodes causes panic GFS2: Fix metafs mounts GFS2: Fix debugfs glock file iterator	2008-10-10 11:02:22 -07:00
Linus Torvalds	e26feff647	Merge branch 'for-2.6.28' of git://git.kernel.dk/linux-2.6-block * 'for-2.6.28' of git://git.kernel.dk/linux-2.6-block: (132 commits) doc/cdrom: Trvial documentation error, file not present block_dev: fix kernel-doc in new functions block: add some comments around the bio read-write flags block: mark bio_split_pool static block: Find bio sector offset given idx and offset block: gendisk integrity wrapper block: Switch blk_integrity_compare from bdev to gendisk block: Fix double put in blk_integrity_unregister block: Introduce integrity data ownership flag block: revert part of d7533ad0e132f92e75c1b2eb7c26387b25a583c1 bio.h: Remove unused conditional code block: remove end_{queued\|dequeued}_request() block: change elevator to use __blk_end_request() gdrom: change to use __blk_end_request() memstick: change to use __blk_end_request() virtio_blk: change to use __blk_end_request() blktrace: use BLKTRACE_BDEV_SIZE as the name size for setup structure block: add lld busy state exporting interface block: Fix blk_start_queueing() to not kick a stopped queue include blktrace_api.h in headers_install ...	2008-10-10 10:52:45 -07:00
Alexey Dobriyan	3bbfe05967	proc: remove kernel.maps_protect After commit `831830b5a2` aka "restrict reading from /proc/<pid>/maps to those who share ->mm or can ptrace" sysctl stopped being relevant because commit moved security checks from ->show time to ->start time (mm_for_maps()). Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Kees Cook <kees.cook@canonical.com>	2008-10-10 04:24:51 +04:00
Alexey Dobriyan	45acb8db06	proc: remove now unneeded ADDBUF macro After local seq_file conversion it was forgotten. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2008-10-10 04:18:58 +04:00
Kees Cook	4783072308	[PATCH] proc: show personality via /proc/pid/personality Make process personality flags visible in /proc. Since a process's personality is potentially sensitive (e.g. READ_IMPLIES_EXEC), make this file only readable by the process owner. Signed-off-by: Kees Cook <kees.cook@canonical.com> Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2008-10-10 04:18:57 +04:00
Lai Jiangshan	a6bebbc87a	[PATCH] signal, procfs: some lock_task_sighand() users do not need rcu_read_lock() lock_task_sighand() make sure task->sighand is being protected, so we do not need rcu_read_lock(). [ exec() will get task->sighand->siglock before change task->sighand! ] But code using rcu_read_lock() _just_ to protect lock_task_sighand() only appear in procfs. (and some code in procfs use lock_task_sighand() without such redundant protection.) Other subsystem may put lock_task_sighand() into rcu_read_lock() critical region, but these rcu_read_lock() are used for protecting "for_each_process()", "find_task_by_vpid()" etc. , not for protecting lock_task_sighand(). Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> [ok from Oleg] Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2008-10-10 04:18:57 +04:00
Alexey Dobriyan	53167a3ef2	proc: move PROC_PAGE_MONITOR to fs/proc/Kconfig Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2008-10-10 04:18:57 +04:00
Adrian Bunk	81324364b7	proc: make grab_header() static Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2008-10-10 04:18:56 +04:00
Alexey Dobriyan	a70973c214	proc: remove unused get_dma_list() Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2008-10-10 04:18:56 +04:00
Alexey Dobriyan	a04f4de641	proc: remove dummy vmcore_open() Empty ->open is equivalent to always succeeding ->open. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2008-10-10 04:18:55 +04:00
Alexey Dobriyan	e1675231ce	proc: proc_sys_root tweak Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2008-10-10 04:18:55 +04:00
Alexey Dobriyan	300b994b74	proc: fix return value of proc_reg_open() in "too late" case If ->open() wasn't called, returning 0 is misleading and, theoretically, oopsable: 1) remove_proc_entry clears ->proc_fops, drops lock, 2) ->open "succeeds", 3) ->release oopses, because it assumes ->open was called (single_release()). Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2008-10-10 04:18:54 +04:00
Linus Torvalds	efc968d450	Don't allow splice() to files opened with O_APPEND This is debatable, but while we're debating it, let's disallow the combination of splice and an O_APPEND destination. It's not entirely clear what the semantics of O_APPEND should be, and POSIX apparently expects pwrite() to ignore O_APPEND, for example. So we could make up any semantics we want, including the old ones. But Miklos convinced me that we should at least give it some thought, and that accepting writes at arbitrary offsets is wrong at least for IS_APPEND() files (which always have O_APPEND set, even if the reverse isn't true: you can obviously have O_APPEND set on a regular file). So disallow O_APPEND entirely for now. I doubt anybody cares, and this way we have one less gray area to worry about. Reported-and-argued-for-by: Miklos Szeredi <miklos@szeredi.hu> Acked-by: Jens Axboe <ens.axboe@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-09 14:26:38 -07:00
Randy Dunlap	57d1b5366f	block_dev: fix kernel-doc in new functions Fix kernel-doc in new functions: Error(mmotm-2008-1002-1617//fs/block_dev.c:895): duplicate section name 'Description' Error(mmotm-2008-1002-1617//fs/block_dev.c:924): duplicate section name 'Description' Warning(mmotm-2008-1002-1617//fs/block_dev.c:1282): No description found for parameter 'pathname' Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> cc: Andrew Patterson <andrew.patterson@hp.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 10:42:38 +02:00
Denis ChengRq	6feef531f5	block: mark bio_split_pool static Since all bio_split calls refer the same single bio_split_pool, the bio_split function can use bio_split_pool directly instead of the mempool_t parameter; then the mempool_t parameter can be removed from bio_split param list, and bio_split_pool is only referred in fs/bio.c file, can be marked static. Signed-off-by: Denis ChengRq <crquan@gmail.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:57:05 +02:00
Martin K. Petersen	ad3316bf4e	block: Find bio sector offset given idx and offset Helper function to find the sector offset in a bio given bvec index and page offset. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:22 +02:00
Martin K. Petersen	74aa8c2cc0	block: Introduce integrity data ownership flag A filesystem might supply its own integrity metadata. Introduce a flag that indicates whether the filesystem or the block layer owns the integrity buffer. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:21 +02:00
Jens Axboe	b04accc425	block: revert part of d7533ad0e132f92e75c1b2eb7c26387b25a583c1 We need bdev_get_integrity() to support the pending md/dm patches. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:21 +02:00
Jens Axboe	9c02f2b02e	block: cleanup some of the integrity stuff in blkdev.h Don't put functions that are only used in fs/bio-integrity.c in blkdev.h, it's much cleaner to just keep it in there. Also kill completely unused bdev_get_tag_size() Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:17 +02:00
Jens Axboe	0a0d96b03a	block: add bio_kmalloc() Not all callers need (or want!) the mempool backing guarentee, it essentially means that you can only use bio_alloc() for short allocations and not for preallocating some bio's at setup or init time. So add bio_kmalloc() which does the same thing as bio_alloc(), except it just uses kmalloc() as the backing instead of the bio mempools. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:17 +02:00
Andrew Patterson	608aeef17a	Call flush_disk() after detecting an online resize. We call flush_disk() to make sure the buffer cache for the disk is flushed after a disk resize. There are two resize cases, growing and shrinking. Given that users can shrink/then grow a disk before revalidate_disk() is called, we treat the grow case identically to shrinking. We need to flush the buffer cache after an online shrink because, as James Bottomley puts it, The two use cases for shrinking I can see are 1. planned: the fs is already shrunk to within the new boundaries and all data is relocated, so invalidate is fine (any dirty buffers that might exist in the shrunk region are there only because they were relocated but not yet written to their original location). 2. unplanned: In this case, the fs is probably toast, so whether we invalidate or not isn't going to make a whole lot of difference; it's still going to try to read or write from sectors beyond the new size and get I/O errors. Immediately invalidating shrunk disks will cause errors for outstanding I/Os for reads/write beyond the new end of the disk to be generated earlier then if we waited for the normal buffer cache operation. It also removes a potential security hole where we might keep old data around from beyond the end of the shrunk disk if the disk was not invalidated. Signed-off-by: Andrew Patterson <andrew.patterson@hp.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:13 +02:00
Andrew Patterson	56ade44b46	Added flush_disk to factor out common buffer cache flushing code. We need to be able to flush the buffer cache for for more than just when a disk is changed, so we factor out common cache flush code in check_disk_change() to an internal flush_disk() routine. This routine will then be used for both disk changes and disk resizes (in a later patch). Include the disk name in the text indicating that there are busy inodes on the device and increase the KERN severity of the message. Signed-off-by: Andrew Patterson <andrew.patterson@hp.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:13 +02:00
Andrew Patterson	9bc3ffbfbd	Check for device resize when rescanning partitions Check for device resize in the rescan_partitions() routine. If the device has been resized, the bdev size is set to match. The rescan_partitions() routine is called when opening the device and when calling the BLKRRPART ioctl. Signed-off-by: Andrew Patterson <andrew.patterson@hp.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:12 +02:00
Andrew Patterson	c3279d1454	Adjust block device size after an online resize of a disk. The revalidate_disk routine now checks if a disk has been resized by comparing the gendisk capacity to the bdev inode size. If they are different (usually because the disk has been resized underneath the kernel) the bdev inode size is adjusted to match the capacity. Signed-off-by: Andrew Patterson <andrew.patterson@hp.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:12 +02:00
Andrew Patterson	0c002c2f74	Wrapper for lower-level revalidate_disk routines. This is a wrapper for the lower-level revalidate_disk call-backs such as sd_revalidate_disk(). It allows us to perform pre and post operations when calling them. We will use this wrapper in a later patch to adjust block device sizes after an online resize (a _post_ operation). Signed-off-by: Andrew Patterson <andrew.patterson@hp.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:12 +02:00
FUJITA Tomonori	818827669d	block: make blk_rq_map_user take a NULL user-space buffer This patch changes blk_rq_map_user to accept a NULL user-space buffer with a READ command if rq_map_data is not NULL. Thus a caller can pass page frames to lk_rq_map_user to just set up a request and bios with page frames propely. bio_uncopy_user (called via blk_rq_unmap_user) doesn't copy data to user space with such request. Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:11 +02:00
FUJITA Tomonori	4d8ab62e08	bio: convert bio_copy_kern to use bio_copy_user bio_copy_kern and bio_copy_user are very similar. This converts bio_copy_kern to use bio_copy_user. Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Cc: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:10 +02:00
FUJITA Tomonori	152e283fdf	block: introduce struct rq_map_data to use reserved pages This patch introduces struct rq_map_data to enable bio_copy_use_iov() use reserved pages. Currently, bio_copy_user_iov allocates bounce pages but drivers/scsi/sg.c wants to allocate pages by itself and use them. struct rq_map_data can be used to pass allocated pages to bio_copy_user_iov. The current users of bio_copy_user_iov simply passes NULL (they don't want to use pre-allocated pages). Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Douglas Gilbert <dougg@torque.net> Cc: Mike Christie <michaelc@cs.wisc.edu> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:10 +02:00
FUJITA Tomonori	a3bce90edd	block: add gfp_mask argument to blk_rq_map_user and blk_rq_map_user_iov Currently, blk_rq_map_user and blk_rq_map_user_iov always do GFP_KERNEL allocation. This adds gfp_mask argument to blk_rq_map_user and blk_rq_map_user_iov so sg can use it (sg always does GFP_ATOMIC allocation). Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Signed-off-by: Douglas Gilbert <dougg@torque.net> Cc: Mike Christie <michaelc@cs.wisc.edu> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:10 +02:00
Jens Axboe	c7c22e4d5c	block: add support for IO CPU affinity This patch adds support for controlling the IO completion CPU of either all requests on a queue, or on a per-request basis. We export a sysfs variable (rq_affinity) which, if set, migrates completions of requests to the CPU that originally submitted it. A bio helper (bio_set_completion_cpu()) is also added, so that queuers can ask for completion on that specific CPU. In testing, this has been show to cut the system time by as much as 20-40% on synthetic workloads where CPU affinity is desired. This requires a little help from the architecture, so it'll only work as designed for archs that are using the new generic smp helper infrastructure. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:09 +02:00
Tejun Heo	3e1a7ff8a0	block: allow disk to have extended device number Now that disk and partition handlings are mostly unified, it's easy to allow disk to have extended device number. This patch makes add_disk() use extended device number if disk->minors is zero. Both sd and ide-disk are updated to use this. * sd_format_disk_name() is implemented which can generically determine the drive name. This removes disk number restriction stemming from limited device names. * If sd index goes over SD_MAX_DISKS (which can be increased now BTW), sd simply doesn't initialize minors letting block layer choose extended device number. * If CONFIG_DEBUG_EXT_DEVT is set, both sd and ide-disk always set minors to 0 and use extended device numbers. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:08 +02:00
Tejun Heo	689d6fac40	block: replace @ext_minors with GENHD_FL_EXT_DEVT With previous changes, it's meaningless to limit the number of partitions. Replace @ext_minors with GENHD_FL_EXT_DEVT such that setting the flag allows the disk to have maximum number of allowed partitions (only limited by the number of entries in parsed_partitions as determined by MAX_PART constant). This kills not-too-pretty alloc_disk_ext[_node]() functions and makes @minors parameter to alloc_disk[_node]() unnecessary. The parameter is left alone to avoid disturbing the users. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:08 +02:00
Tejun Heo	540eed5637	block: make partition array dynamic disk->__part used to be statically allocated to the maximum possible number of partitions. This patch makes partition array allocation dynamic. The added overhead is minimal as only real change is one memory dereference changed to RCU one. This saves both a bit of memory and cpu cycles iterating through unoccupied slots and makes increasing partition limit easier. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:08 +02:00
Tejun Heo	074a7aca7a	block: move stats from disk to part0 Move stats related fields - stamp, in_flight, dkstats - from disk to part0 and unify stat handling such that... * part_stat_() now updates part0 together if the specified partition is not part0. ie. part_stat_() are now essentially all_stat_(). {disk\|all}_stat_() are gone. part_round_stats() is updated similary. It handles part0 stats automatically and disk_round_stats() is killed. * part_{inc\|dec}_in_fligh() is implemented which automatically updates part0 stats for parts other than part0. * disk_map_sector_rcu() is updated to return part0 if no part matches. Combined with the above changes, this makes NULL special case handling in callers unnecessary. * Separate stats show code paths for disk are collapsed into part stats show code paths. * Rename disk_stat_lock/unlock() to part_stat_lock/unlock() While at it, reposition stat handling macros a bit and add missing parentheses around macro parameters. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:08 +02:00
Tejun Heo	eddb2e26b5	block: kill GENHD_FL_FAIL and use part0->make_it_fail GENHD_FL_FAIL for disk is what make_it_fail is for parts. Kill it and use part0->make_it_fail. Sysfs node handling is unified too. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:08 +02:00
Tejun Heo	0762b8bde9	block: always set bdev->bd_part Till now, bdev->bd_part is set only if the bdev was for parts other than part0. This patch makes bdev->bd_part always set so that code paths don't have to differenciate common handling. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:08 +02:00
Tejun Heo	4c46501d16	block: move holder_dir from disk to part0 Move disk->holder_dir to part0->holder_dir. Kill now mostly superflous bdev_get_holder(). While at it, kill superflous kobject_get/put() around holder_dir, slave_dir and cmd_filter creation and collapse disk_sysfs_add_subdirs() into register_disk(). These serve no purpose but obfuscating the code. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:08 +02:00
Tejun Heo	b7db9956e5	block: move policy from disk to part0 Move disk->policy to part0->policy. Implement and use get_disk_ro(). Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:07 +02:00
Tejun Heo	e561052149	block: unify sysfs size node handling Now that capacity and __dev are moved to part0, part0 and others can share the same method. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:07 +02:00
Tejun Heo	80795aefb7	block: move capacity from disk to part0 Move disk->capacity to part0->nr_sects and convert all users who directly accessed the field to use {get\|set}_capacity(). This is done early to allow the __dev field to be moved. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:07 +02:00
Tejun Heo	b5d0b9df0b	block: introduce partition 0 genhd and partition code handled disk and partitions separately. All information about the whole disk was in struct genhd and partitions in struct hd_struct. However, the whole disk (part0) and other partitions have a lot in common and the data structures end up having good number of common fields and thus separate code paths doing the same thing. Also, the partition array was indexed by partno - 1 which gets pretty confusing at times. This patch introduces partition 0 and makes the partition array indexed by partno. Following patches will unify the handling of disk and parts piece-by-piece. This patch also implements disk_partitionable() which tests whether a disk is partitionable. With coming dynamic partition array change, the most common usage of disk_max_parts() will be testing whether a disk is partitionable and the number of max partitions will become much less important. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:07 +02:00
Tejun Heo	ed9e198234	block: implement and use {disk\|part}_to_dev() Implement {disk\|part}_to_dev() and use them to access generic device instead of directly dereferencing {disk\|part}->dev. To make sure no user is left behind, rename generic devices fields to __dev. This is in preparation of unifying partition 0 handling with other partitions. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:07 +02:00
Tejun Heo	bcce3de1be	block: implement extended dev numbers Implement extended device numbers. A block driver can tell block layer that it wants to use extended device numbers. After the usual minor space is used up, block layer automatically allocates devt's from EXT_BLOCK_MAJOR. Currently only one major number is allocated for this but as the allocation is strictly on-demand, ~1mil minor space under it should suffice unless the system actually has more than ~1mil partitions and if that ever happens adding more majors to the extended devt area is easy. Due to internal implementation issues, the first partition can't be allocated on the extended area. In other words, genhd->minors should at least be 1. This limitation will be lifted by later changes. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:06 +02:00
Tejun Heo	c995905916	block: fix diskstats access There are two variants of stat functions - ones prefixed with double underbars which don't care about preemption and ones without which disable preemption before manipulating per-cpu counters. It's unclear whether the underbarred ones assume that preemtion is disabled on entry as some callers don't do that. This patch unifies diskstats access by implementing disk_stat_lock() and disk_stat_unlock() which take care of both RCU (for partition access) and preemption (for per-cpu counter access). diskstats access should always be enclosed between the two functions. As such, there's no need for the versions which disables preemption. They're removed and double underbars ones are renamed to drop the underbars. As an extra argument is added, there's no danger of using the old version unconverted. disk_stat_lock() uses get_cpu() and returns the cpu index and all diskstat functions which access per-cpu counters now has @cpu argument to help RT. This change adds RCU or preemption operations at some places but also collapses several preemption ops into one at others. Overall, the performance difference should be negligible as all involved ops are very lightweight per-cpu ones. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:06 +02:00
Tejun Heo	e71bf0d0ee	block: fix disk->part[] dereferencing race disk->part[] is protected by its matching bdev's lock. However, non-critical accesses like collecting stats and printing out sysfs and proc information used to be performed without any locking. As partitions can come and go dynamically, partitions can go away underneath those non-critical accesses. As some of those accesses are writes, this theoretically can lead to silent corruption. This patch fixes the race by using RCU for the partition array and dev reference counter to hold partitions. * Rename disk->part[] to disk->__part[] to make sure no one outside genhd layer proper accesses it directly. * Use RCU for disk->__part[] dereferencing. * Implement disk_{get\|put}_part() which can be used to get and put partitions from gendisk respectively. * Iterators are implemented to help iterate through all partitions safely. * Functions which require RCU readlock are marked with _rcu suffix. * Use disk_put_part() in __blkdev_put() instead of directly putting the contained kobject. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:06 +02:00
Tejun Heo	f331c0296f	block: don't depend on consecutive minor space * Implement disk_devt() and part_devt() and use them to directly access devt instead of computing it from ->major and ->first_minor. Note that all references to ->major and ->first_minor outside of block layer is used to determine devt of the disk (the part0) and as ->major and ->first_minor will continue to represent devt for the disk, converting these users aren't strictly necessary. However, convert them for consistency. * Implement disk_max_parts() to avoid directly deferencing genhd->minors. * Update bdget_disk() such that it doesn't assume consecutive minor space. * Move devt computation from register_disk() to add_disk() and make it the only one (all other usages use the initially determined value). These changes clean up the code and will help disk->part dereference fix and extended block device numbers. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:05 +02:00
Tejun Heo	cf771cb5a7	block: make variable and argument names more consistent In hd_struct, @partno is used to denote partition number and a number of other places use @part to denote hd_struct. Functions use @part and @index instead. This causes confusion and makes it difficult to use consistent variable names for hd_struct. Always use @partno if a variable represents partition number. Also, print out functions use @f or @part for seq_file argument. Use @seqf uniformly instead. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:05 +02:00
Tejun Heo	88e341261c	block: update add_partition() error handling `d805dda4` tried to fix error case handling in add_partition() but had a few problems. * disk->part[] entry is set early and left dangling if operation fails. * Once device initialized, the last put_device() is responsible for freeing all the resources. The failure path freed part_stats and p regardless of put_device() causing double free. * holders subdir holds reference to the disk device, so failure path should remove it to release resources properly which was missing. This patch fixes the above problems and while at it move partition slot busy check into add_partition() for completeness and inlines holders subdirectory creation. Using separate function for it just obfuscates the code. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Abdel Benamrouche <draconux@gmail.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:04 +02:00
Tejun Heo	ec2cdedf79	block: allow deleting zero length partition delete_partition() was noop for zero length partition. As the addition code allows creating zero lenght partition and deletion is assumed to always succeed, this causes memory leak for zero length partitions. Allow zero length partitions to end their meaningless lives. While at it, allow deleting zero lenght partition via BLKPG_DEL_PARTITION ioctl too. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:04 +02:00
Mikulas Patocka	5df97b91b5	drop vmerge accounting Remove hw_segments field from struct bio and struct request. Without virtual merge accounting they have no purpose. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:03 +02:00
Mikulas Patocka	b8b3e16cfe	block: drop virtual merging accounting Remove virtual merge accounting. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:03 +02:00
David Woodhouse	8c540a96c1	Let the block device know when sectors can be discarded [hirofumi@mail.parknet.co.jp: discard _after_ checking for corrupt chains] Signed-off-by: David Woodhouse <David.Woodhouse@intel.com> Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:01 +02:00
Steve French	b77d753c41	[CIFS] Check that last search entry resume key is valid Jeff's recent patch to add a last_entry field in the search structure to better construct resume keys did not validate that the server sent us a plausible pointer to the last entry. This adds that. Signed-off-by: Steve French <sfrench@us.ibm.com>	2008-10-08 19:13:46 +00:00
Steve French	0752f1522a	[CIFS] make sure we have the right resume info before calling CIFSFindNext When we do a seekdir() or equivalent, we usually end up doing a FindFirst call and then call FindNext until we get to the offset that we want. The problem is that when we call FindNext, the code usually doesn't have the proper info (mostly, the filename of the entry from the last search) to resume the search. Add a "last_entry" field to the cifs_search_info that points to the last entry in the search. We calculate this pointer by using the LastNameOffset field from the search parms that are returned. We then use that info to do a cifs_save_resume_key before we call CIFSFindNext. This patch allows CIFS to reliably pass the "telldir" connectathon test. Signed-off-by: Jeff Layton <jlayton@redhat.com> CC: Stable <stable@kernel.org> Signed-off-by: Steve French <sfrench@us.ibm.com>	2008-10-07 20:03:33 +00:00
Steve French	6050247d80	[CIFS] clean up error handling in cifs_unlink Currently, if a standard delete fails and we end up getting -EACCES we try to clear ATTR_READONLY and try the delete again. If that then fails with -ETXTBSY then we try a rename_pending_delete. We aren't handling other errors appropriately though. Another client could have deleted the file in the meantime and we get back -ENOENT, for instance. In that case we wouldn't do a d_drop. Instead of retrying in a separate call, simply goto the original call and use the error handling from that. Also, we weren't properly undoing any attribute changes that were done before returning an error back to the caller. CC: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2008-10-07 18:42:52 +00:00
Andi Kleen	39d80c33a0	ext4: Avoid double dirtying of super block in ext4_put_super() While reading code I noticed that ext4_put_super() dirties the superblock bh twice. It is always done in ext4_commit_super() too. Remove the redundant dirty operation. Should be a nop semantically. Signed-off-by: Andi Kleen <ak@linux.intel.com>	2008-10-06 21:37:44 -04:00
Eric Sandeen	6873fa0de1	Hook ext4 to the vfs fiemap interface. ext4_ext_walk_space() was reinstated to be used for iterating over file extents with a callback; it is used by the ext4 fiemap implementation. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: linux-ext4@vger.kernel.org Cc: linux-fsdevel@vger.kernel.org	2008-10-07 00:46:36 -04:00
Jeff Layton	6b37faa175	[CIFS] fix some settings of cifsAttrs after calling SetFileInfo and SetPathInfo We only need to set them when we call SetFileInfo or SetPathInfo directly, and as soon as possible after then. We had one place setting it where it didn't need to be, and another place where it was missing. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2008-10-06 21:54:41 +00:00
Josef Bacik	68c9d702bb	generic block based fiemap implementation Any block based fs (this patch includes ext3) just has to declare its own fiemap() function and then call this generic function with its own get_block_t. This works well for block based filesystems that will map multiple contiguous blocks at one time, but will work for filesystems that only map one block at a time, you will just end up with an "extent" for each block. One gotcha is this will not play nicely where there is hole+data after the EOF. This function will assume its hit the end of the data as soon as it hits a hole after the EOF, so if there is any data past that it will not pick that up. AFAIK no block based fs does this anyway, but its in the comments of the function anyway just in case. Signed-off-by: Josef Bacik <jbacik@redhat.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: linux-fsdevel@vger.kernel.org	2008-10-03 17:32:43 -04:00
Mark Fasheh	00dc417fa3	ocfs2: fiemap support Plug ocfs2 into ->fiemap. Some portions of ocfs2_get_clusters() had to be refactored so that the extent cache can be skipped in favor of going directly to the on-disk records. This makes it easier for us to determine which extent is the last one in the btree. Also, I'm not sure we want to be caching fiemap lookups anyway as they're not directly related to data read/write. Signed-off-by: Mark Fasheh <mfasheh@suse.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: ocfs2-devel@oss.oracle.com Cc: linux-fsdevel@vger.kernel.org	2008-10-03 17:32:11 -04:00
Mark Fasheh	c4b929b85b	vfs: vfs-level fiemap interface Basic vfs-level fiemap infrastructure, which sets up a new ->fiemap inode operation. Userspace can get extent information on a file via fiemap ioctl. As input, the fiemap ioctl takes a struct fiemap which includes an array of struct fiemap_extent (fm_extents). Size of the extent array is passed as fm_extent_count and number of extents returned will be written into fm_mapped_extents. Offset and length fields on the fiemap structure (fm_start, fm_length) describe a logical range which will be searched for extents. All extents returned will at least partially contain this range. The actual extent offsets and ranges returned will be unmodified from their offset and range on-disk. The fiemap ioctl returns '0' on success. On error, -1 is returned and errno is set. If errno is equal to EBADR, then fm_flags will contain those flags which were passed in which the kernel did not understand. On all other errors, the contents of fm_extents is undefined. As fiemap evolved, there have been many authors of the vfs patch. As far as I can tell, the list includes: Kalpak Shah <kalpak.shah@sun.com> Andreas Dilger <adilger@sun.com> Eric Sandeen <sandeen@redhat.com> Mark Fasheh <mfasheh@suse.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: Michael Kerrisk <mtk.manpages@googlemail.com> Cc: linux-api@vger.kernel.org Cc: linux-fsdevel@vger.kernel.org	2008-10-08 19:44:18 -04:00
Kalpak Shah	4d20c685fa	ext4: fix xattr deadlock ext4_xattr_set_handle() eventually ends up calling ext4_mark_inode_dirty() which tries to expand the inode by shifting the EAs. This leads to the xattr_sem being downed again and leading to a deadlock. This patch makes sure that if ext4_xattr_set_handle() is in the call-chain, ext4_mark_inode_dirty() will not expand the inode. Signed-off-by: Kalpak Shah <kalpak.shah@sun.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-10-08 23:21:54 -04:00
Theodore Ts'o	45a90bfd90	jbd2: Fix buffer head leak when writing the commit block Also make sure the buffer heads are marked clean before submitting bh for writing. The previous code was marking the buffer head dirty, which would have forced an unneeded write (and seek) to the journal for no good reason. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-10-06 12:04:02 -04:00
Theodore Ts'o	ede86cc473	ext4: Add debugging markers that can be used by systemtap This debugging markers are designed to debug problems such as the random filesystem latency problems reported by Arjan. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-10-05 20:50:06 -04:00
Duane Griffin	23f8b79eae	jbd2: abort instead of waiting for nonexistent transaction The __jbd2_log_wait_for_space function sits in a loop checkpointing transactions until there is sufficient space free in the journal. However, if there are no transactions to be processed (e.g. because the free space calculation is wrong due to a corrupted filesystem) it will never progress. Check for space being required when no transactions are outstanding and abort the journal instead of endlessly looping. This patch fixes the bug reported by Sami Liedes at: http://bugzilla.kernel.org/show_bug.cgi?id=10976 Signed-off-by: Duane Griffin <duaneg@dghda.com> Cc: Sami Liedes <sliedes@cc.hut.fi> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-10-08 23:28:31 -04:00
Frederic Bohe	c806e68f56	ext4: fix initialization of UNINIT bitmap blocks This fixes a bug which caused on-line resizing of filesystems with a 1k blocksize to fail. The root cause of this bug was the fact that if an uninitalized bitmap block gets read in by userspace (which e2fsprogs does try to avoid, but can happen when the blocksize is less than the pagesize and an adjacent blocks is read into memory) ext4_read_block_bitmap() was erroneously depending on the buffer uptodate flag to decide whether it needed to initialize the bitmap block in memory --- i.e., to set the standard set of blocks in use by a block group (superblock, bitmaps, inode table, etc.). Essentially, ext4_read_block_bitmap() assumed it was the only routine that might try to read a block containing a block bitmap, which is simply not true. To fix this, ext4_read_block_bitmap() and ext4_read_inode_bitmap() must always initialize uninitialized bitmap blocks. Once a block or inode is allocated out of that bitmap, it will be marked as initialized in the block group descriptor, so in general this won't result any extra unnecessary work. Signed-off-by: Frederic Bohe <frederic.bohe@bull.net> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-10-10 08:09:18 -04:00
Theodore Ts'o	c2ea3fde61	ext4: Remove old legacy block allocator Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-10-10 09:40:52 -04:00
Theodore Ts'o	240799cdf2	ext4: Use readahead when reading an inode from the inode table With modern hard drives, reading 64k takes roughly the same time as reading a 4k block. So request readahead for adjacent inode table blocks to reduce the time it takes when iterating over directories (especially when doing this in htree sort order) in a cold cache case. With this patch, the time it takes to run "git status" on a kernel tree after flushing the caches via "echo 3 > /proc/sys/vm/drop_caches" is reduced by 21%. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-10-09 23:53:47 -04:00
Nick Piggin	4b19de6d1c	mm: tiny-shmem nommu fix The previous patch `db203d53d4` ("mm: tiny-shmem fix lock ordering: mmap_sem vs i_mutex") to fix the lock ordering in tiny-shmem breaks shared anonymous and IPC memory on NOMMU architectures because it was using the expanding truncate to signal ramfs to allocate a physically contiguous RAM backing the inode (otherwise it is unusable for "memory mapping" it to userspace). However do_truncate is what caused the lock ordering error, due to it taking i_mutex. In this case, we can actually just call ramfs directly to allocate memory for the mapping, rather than go via truncate. Acked-by: David Howells <dhowells@redhat.com> Acked-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Matt Mackall <mpm@selenic.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-02 15:53:13 -07:00
Nick Piggin	16dbc6c961	inotify: fix lock ordering wrt do_page_fault's mmap_sem Fix inotify lock order reversal with mmap_sem due to holding locks over copy_to_user. Signed-off-by: Nick Piggin <npiggin@suse.de> Reported-by: "Daniel J Blueman" <daniel.blueman@gmail.com> Tested-by: "Daniel J Blueman" <daniel.blueman@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-02 15:53:13 -07:00
Balbir Singh	31a78f23ba	mm owner: fix race between swapoff and exit There's a race between mm->owner assignment and swapoff, more easily seen when task slab poisoning is turned on. The condition occurs when try_to_unuse() runs in parallel with an exiting task. A similar race can occur with callers of get_task_mm(), such as /proc/<pid>/<mmstats> or ptrace or page migration. CPU0 CPU1 try_to_unuse looks at mm = task0->mm increments mm->mm_users task 0 exits mm->owner needs to be updated, but no new owner is found (mm_users > 1, but no other task has task->mm = task0->mm) mm_update_next_owner() leaves mmput(mm) decrements mm->mm_users task0 freed dereferencing mm->owner fails The fix is to notify the subsystem via mm_owner_changed callback(), if no new owner is found, by specifying the new task as NULL. Jiri Slaby: mm->owner was set to NULL prior to calling cgroup_mm_owner_callbacks(), but must be set after that, so as not to pass NULL as old owner causing oops. Daisuke Nishimura: mm_update_next_owner() may set mm->owner to NULL, but mem_cgroup_from_task() and its callers need to take account of this situation to avoid oops. Hugh Dickins: Lockdep warning and hang below exec_mmap() when testing these patches. exit_mm() up_reads mmap_sem before calling mm_update_next_owner(), so exec_mmap() now needs to do the same. And with that repositioning, there's now no point in mm_need_new_owner() allowing for NULL mm. Reported-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Jiri Slaby <jirislaby@gmail.com> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-09-29 08:41:47 -07:00
Linus Torvalds	d0185c0882	Fix NULL pointer dereference in proc_sys_compare The VFS interface for the 'd_compare()' is a bit special (read: 'odd'), because it really just essentially replaces a memcmp(). The filesystem is supposed to just compare the two names with whatever case-independent or other function. And when I say 'is supposed to', I obviously mean that 'procfs does odd things, and actually looks at the dentry that we don't even pass down, rather than just the name'. Which results in problems, because we actually call d_compare before we have even verified that the dentry is still hashed at all. And that causes a problm since the inode that procfs looks at may have been free'd and the d_inode pointer is NULL. procfs just assumes that all dentries are positive, since procfs itself never generates a negative one. But memory pressure will still result in the dentry getting torn down, and as it is removed by RCU, it still remains visible on some lists - and to d_compare. If the filesystem just did a name comparison, we wouldn't care. And we could just fix procfs to know about negative dentries too. But rather than have the low-level filesystems know about internal VFS details, just move the check for a unhashed dentry up a bit, so that we will only call d_compare on dentries that are still active. The actual oops this caused didn't look like a NULL pointer dereference because procfs did a 'container_of(inode, struct proc_inode, vfs_inode)' to get at its internal proc_inode information from the inode pointer, and accessed a field below the inode. So the oops would look something like BUG: unable to handle kernel paging request at fffffffffffffff0 IP: [<ffffffff802bc6c6>] proc_sys_compare+0x36/0x50 and was seen on both x86-64 (Alexey Dobriyan and Hugh Dickins) and ppc64 (Hugh Dickins). Reported-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Hugh Dickins <hugh@veritas.com> Cc: Al Viro <viro@ZenIV.linux.org.uk> Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-of-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-09-29 07:42:57 -07:00
Linus Torvalds	ec4d90287e	Merge git://oss.sgi.com:8090/xfs/linux-2.6 * git://oss.sgi.com:8090/xfs/linux-2.6: [XFS] Remove xfs_iext_irec_compact_full() [XFS] Fix extent list corruption in xfs_iext_irec_compact_full().	2008-09-26 08:49:34 -07:00
Linus Torvalds	bde40fe071	Merge branch 'linux-next' of git://git.infradead.org/~dedekind/ubifs-2.6 * 'linux-next' of git://git.infradead.org/~dedekind/ubifs-2.6: UBIFS: fix printk format warnings UBIFS: remove incorrect assert UBIFS: TNC / GC race fixes UBIFS: create the name of the background thread in every case	2008-09-26 08:20:26 -07:00
Steven Whitehouse	254db57f9b	GFS2: Support for I/O barriers This patch adds barrier support to GFS2. There is not a lot of change really... we just add the barrier flag when we write journal header blocks. If the underlying device refuses to support them, we fall back to the previous way of doing things (wait for the I/O and hope) since there is nothing else we can do. There is no user configuration, barriers will always be on unless the device refuses to support them. This seems a reasonable solution to me since this is a correctness issue. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2008-09-26 10:23:22 +01:00
Lachlan McIlroy	71a8c87fb3	[XFS] Remove xfs_iext_irec_compact_full() Yet another bug was found in xfs_iext_irec_compact_full() and while the source of the bug was found it wasn't an easy task to track it down because the conditions are very difficult to reproduce. A HUGE thank-you goes to Russell Cattelan and Eric Sandeen for their significant effort in tracking down the source of this corruption. xfs_iext_irec_compact_full() and xfs_iext_irec_compact_pages() are almost identical - they both compact indirect extent lists by moving extents from subsequent buffers into earlier ones. xfs_iext_irec_compact_pages() only moves extents if all of the extents in the next buffer will fit into the empty space in the buffer before it. xfs_iext_irec_compact_full() will go a step further and move part of the next buffer if all the extents wont fit. It will then shift the remaining extents in the next buffer up to the start of the buffer. The bug here was that we did not update er_extoff and this caused extent list corruption. It does not appear that this extra functionality gains us much. Calling xfs_iext_irec_compact_pages() instead will do a good enough job at compacting the indirect list and will be quicker too. For the case in xfs_iext_indirect_to_direct() the total number of extents in the indirect list will fit into one buffer so we will never need the extra functionality of xfs_iext_irec_compact_full() there. Also xfs_iext_irec_compact_pages() doesn't need to do a memmove() (the buffers will never overlap) so we don't want the performance hit that can incur. SGI-PV: 987159 SGI-Modid: xfs-linux-melb:xfs-kern:32166a Signed-off-by: Lachlan McIlroy <lachlan@sgi.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>	2008-09-26 12:17:57 +10:00
Lachlan McIlroy	f1ccd29551	[XFS] Fix extent list corruption in xfs_iext_irec_compact_full(). If we don't move all the records from the next buffer into the current buffer then we need to update the er_extoff field of the next buffer as we shift the remaining records to the start of the buffer. SGI-PV: 987159 SGI-Modid: xfs-linux-melb:xfs-kern:32165a Signed-off-by: Lachlan McIlroy <lachlan@sgi.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net> Signed-off-by: Russell Cattelan <cattelan@thebarn.com>	2008-09-26 12:16:46 +10:00
Julien Brunel	62aa528e02	9p: use an IS_ERR test rather than a NULL test In case of error, the function p9_client_walk returns an ERR pointer, but never returns a NULL pointer. So a NULL test that comes after an IS_ERR test should be deleted. The semantic match that finds this problem is as follows: (http://www.emn.fr/x-info/coccinelle/) // <smpl> @match_bad_null_test@ expression x, E; statement S1,S2; @@ x = p9_client_walk(...) ... when != x = E * if (x != NULL) S1 else S2 // </smpl> Signed-off-by: Julien Brunel <brunel@diku.dk> Signed-off-by: Julia Lawall <julia@diku.dk> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2008-09-24 16:22:22 -05:00
Jeff Layton	dfd15c46a6	cifs: explicitly revoke SPNEGO key after session setup cifs: explicitly revoke SPNEGO key after session setup The SPNEGO blob returned by an upcall can only be used once. Explicitly revoke it to make sure that we never pick it up again after session setup exits. This doesn't seem to be that big an issue on more recent kernels, but older kernels seem to link keys into the session keyring by default. That said, explicitly revoking the key seems like a reasonable thing to do here. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2008-09-24 20:59:37 +00:00
Nick Piggin	d9414774dc	cifs: Convert cifs to new aops. cifs: Convert cifs to new aops. This patch is based on the one originally posted by Nick Piggin. His patch was very close, but had a couple of small bugs. Nick's original comments follow: This is another relatively naive conversion. Always do the read upfront when the page is not uptodate (unless we're in the writethrough path). Fix an uninitialized data exposure where SetPageUptodate was called before the page was uptodate. SetPageUptodate and switch to writeback mode in the case that the full page was dirtied. Acked-by: Shaggy <shaggy@austin.ibm.com> Acked-by: Badari Pulavarty <pbadari@us.ibm.com> Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2008-09-24 19:32:56 +00:00
Steve French	d388908ec4	[CIFS] update DOS attributes in cifsInode if we successfully changed them Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2008-09-24 19:22:52 +00:00
Jeff Layton	391e575556	cifs: remove NULL termination from rename target in CIFSSMBRenameOpenFIle cifs: remove NULL termination from rename target in CIFSSMBRenameOpenFIle The rename destination isn't supposed to be null terminated. Also, change the name string arg to be const. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2008-09-24 19:07:36 +00:00
Jeff Layton	7ce86d5a93	cifs: work around samba returning -ENOENT on SetFileDisposition call cifs: work around samba returning -ENOENT on SetFileDisposition call Samba seems to return STATUS_OBJECT_NAME_NOT_FOUND when we try to set the delete on close bit after doing a rename by filehandle. This looks like a samba bug to me, but a lot of servers will do this. For now, pretend an -ENOENT return is a success. Samba does however seem to respect the CREATE_DELETE_ON_CLOSE bit when opening files that already exist. Windows will ignore it, but so adding it to the open flags should be harmless. We're also currently ignoring the return code on the rename by filehandle, so no need to set rc based on it. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2008-09-24 18:59:20 +00:00
Jeff Layton	74553b1b6a	cifs: fix inverted NULL check after kmalloc cifs: fix inverted NULL check after kmalloc Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2008-09-24 18:55:11 +00:00
Theodore Ts'o	5e8814f2f7	ext4: Combine proc file handling into a single set of functions Previously mballoc created a separate set of functions for each proc file. This combines the tunables into a single set of functions which gets used for all of the per-superblock proc files, saving approximately 2k of compiled object code. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-09-23 18:07:35 -04:00
Steve French	9d81523480	[CIFS] clean up upcall handling for dns_resolver keys We're given the datalen in the downcall, so there's no need to do any calls to strlen(). Just keep track of the datalen in the key. Finally, add a sanity check of the data in the downcall to make sure that it looks like a real IP address. Signed-off-by: Jeff Layton <jlayton@redhat.com> Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2008-09-23 18:46:07 +00:00
Steve French	ee2fd967fb	[CIFS] fix busy-file renames and refactor cifs_rename logic Break out the code that does the actual renaming into a separate function and have cifs_rename call that. That function will attempt a path based rename first and then do a filehandle based one if it looks like the source is busy. The existing logic tried a path based rename first, but if we needed to remove the destination then it only attempted a filehandle based rename afterward. Not all servers support renaming by filehandle, so we need to always attempt path rename first and fall back to filehandle rename if it doesn't work. This also fixes renames of open files on windows servers (at least when the source and destination directories are the same). CC: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2008-09-23 18:23:33 +00:00
Jeff Layton	6d22f09896	cifs: add function to set file disposition cifs: add function to set file disposition The proper way to set the delete on close bit on an already existing file is to use SET_FILE_INFO with an infolevel of SMB_FILE_DISPOSITION_INFO. Add a function to do that and have the silly-rename code use it. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2008-09-23 17:39:28 +00:00
Steve French	7c9c3760b3	[CIFS] add constants for string lengths of keynames in SPNEGO upcall string Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2008-09-23 17:23:09 +00:00
Jeff Layton	a12a1ac7a4	cifs: move rename and delete-on-close logic into helper function cifs: move rename and delete-on-close logic into helper function When a file is still open on the server, we attempt to set the DELETE_ON_CLOSE bit and rename it to a new filename. When the last opener closes the file, the server should delete it. This patch moves this mechanism into a helper function and has the two places in cifs_unlink that do this procedure call it. It also fixes the open flags to be correct. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2008-09-23 17:11:03 +00:00
Theodore Ts'o	9f6200bbfc	ext4: move /proc setup and teardown out of mballoc.c ...and into the core setup/teardown code in fs/ext4/super.c so that other parts of ext4 can define tuning parameters. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-09-23 09:18:24 -04:00
Jeff Layton	2846d38647	cifs: have find_writeable_file prefer filehandles opened by same task When the CIFS client goes to write out pages, it needs to pick a filehandle to write to. find_writeable_file however just picks the first filehandle that it finds. This can cause problems when a lock is issued against a particular filehandle and we pick a different filehandle to write to. This patch tries to avert this situation by having find_writable_file prefer filehandles that have a pid that matches the current task. This seems to fix lock test 11 from the connectathon test suite when run against a windows server. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2008-09-23 02:24:13 +00:00
Pekka Enberg	232087cb73	cifs: don't use GFP_KERNEL with GFP_NOFS GFP_KERNEL and GFP_NOFS are mutually exclusive. If you combine them, you end up with plain GFP_KERNEL which can deadlock in cases where you really want GFP_NOFS. Cc: Steve French <sfrench@samba.org> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Steve French <sfrench@us.ibm.com>	2008-09-22 22:23:56 +00:00
Theodore Ts'o	f702ba0fd7	ext4: Don't use 'struct dentry' for internal lookups This is a port of a patch from Linus which fixes a 200+ byte stack usage problem in ext4_get_parent(). It's more efficient to pass down only the actual parts of the dentry that matter: the parent inode and the name, instead of allocating a struct dentry on the stack. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-09-22 15:21:01 -04:00

1 2 3 4 5 ...

10180 commits