Commit Graph

15691 Commits

Author SHA1 Message Date
Ryusuke Konishi 05b4358ad5 nilfs2: add zero-fill for new btree node buffers
Adds missing initialization of newly allocated b-tree node buffers.
This avoids garbage data to be mixed in b-tree node blocks.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-11-03 12:32:03 +09:00
Ryusuke Konishi aeda7f6343 nilfs2: fix irregular checkpoint creation due to data flush
When nilfs flushes out dirty data to reduce memory pressure, creation
of checkpoints is wrongly postponed.  This bug causes irregular
checkpoint creation especially in small footprint systems.

To correct this issue, a timer for the checkpoint creation has to be
continued if a log writer does not create a checkpoint.

This will do the correction.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-11-03 12:32:03 +09:00
Ryusuke Konishi b1e19e5601 nilfs2: fix dirty page accounting leak causing hang at write
Bruno Prémont and Dunphy, Bill noticed me that NILFS will certainly
hang on ARM-based targets.

I found this was caused by an underflow of dirty pages counter.  A
b-tree cache routine was marking page dirty without adjusting page
account information.

This fixes the dirty page accounting leak and resolves the hang on
arm-based targets.

Reported-by: Bruno Prémont <bonbons@linux-vserver.org>
Reported-by: Dunphy, Bill <WDunphy@tandbergdata.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Tested-by: Bruno Prémont <bonbons@linux-vserver.org>
Cc: stable <stable@kernel.org>
2009-11-03 12:31:36 +09:00
Linus Torvalds 1836d95928 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs:
  9p: fix readdir corner cases
  9p: fix readlink
  9p: fix a small bug in readdir for long directories
2009-11-02 12:23:21 -08:00
Linus Torvalds d4da6c9ccf Revert "ext4: Remove journal_checksum mount option and enable it by default"
This reverts commit d0646f7b63, as
requested by Eric Sandeen.

It can basically cause an ext4 filesystem to miss recovery (and thus get
mounted with errors) if the journal checksum does not match.

Quoth Eric:

   "My hand-wavy hunch about what is happening is that we're finding a
    bad checksum on the last partially-written transaction, which is
    not surprising, but if we have a wrapped log and we're doing the
    initial scan for head/tail, and we abort scanning on that bad
    checksum, then we are essentially running an unrecovered filesystem.

    But that's hand-wavy and I need to go look at the code.

    We lived without journal checksums on by default until now, and at
    this point they're doing more harm than good, so we should revert
    the default-changing commit until we can fix it and do some good
    power-fail testing with the fixes in place."

See

	http://bugzilla.kernel.org/show_bug.cgi?id=14354

for all the gory details.

Requested-by: Eric Sandeen <sandeen@redhat.com>
Cc: Theodore Tso <tytso@mit.edu>
Cc: Alexey Fisher <bug-track@fisher-privat.net>
Cc: Maxim Levitsky <maximlevitsky@gmail.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Mathias Burén <mathias.buren@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-11-02 10:15:27 -08:00
Eric Van Hensbergen 3e2796a90c 9p: fix readdir corner cases
The patch below also addresses a couple of other corner cases in readdir
seen with a large (e.g. 64k) msize.  I'm not sure what people think of
my co-opting of fid->aux here.  I'd be happy to rework if there's a better
way.

When the size of the user supplied buffer passed to readdir is smaller
than the data returned in one go by the 9P read request, v9fs_dir_readdir()
currently discards extra data so that, on the next call, a 9P read
request will be issued with offset < previous offset + bytes returned,
which voilates the constraint described in paragraph 3 of read(5) description.
This patch preseves the leftover data in fid->aux for use in the next call.

Signed-off-by: Jim Garlick <garlick@llnl.gov>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2009-11-02 08:43:45 -06:00
Martin Stava 2511cd0b3b 9p: fix readlink
I do not know if you've looked on the patch, but unfortunately it is
incorrect. A suggested better version is in this email (the old
version didn't work in case the user provided buffer was not long
enough - it incorrectly appended null byte on a position of last char,
and thus broke the contract of the readlink method). However, I'm
still not sure this is 100% correct thing to do, I think readlink is
supposed to return buffer without last null byte in all cases, but we
do return last null byte (even the old version).. on the other hand it
is likely unspecified what is in the remaining part of the buffer, so
null character may be fine there ;):

Signed-off-by: Martin Stava <martin.stava@gmail.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2009-11-02 08:43:45 -06:00
Martin Stava f91b90993f 9p: fix a small bug in readdir for long directories
Here is a proposed patch for bug in readdir. Listing of dirs with
many files fails without this patch.

Signed-off-by: Martin Stava <martin.stava@gmail.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2009-11-02 08:43:44 -06:00
Linus Torvalds a80a66caf8 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/fs/xfs/xfs
* 'for-linus' of git://git.kernel.org/pub/scm/fs/xfs/xfs:
  xfs: fix xfs_quota remove error
  xfs: free temporary cursor in xfs_dialloc
2009-10-31 12:12:49 -07:00
Ryota Yamauchi c7ff91d722 xfs: fix xfs_quota remove error
The xfs_quota returns ENOSYS when remove command is executed.
Reproducable with following steps.

    # mount -t xfs -o uquota /dev/sda7 /mnt/mp1
    # xfs_quota -x -c off -c remove
    XFS_QUOTARM: Function not implemented.

The remove command is allowed during quotaoff, but xfs_fs_set_xstate()
checks whether quota is running, and it leads to ENOSYS.

To solve this problem, add a check for X_QUOTARM.

Signed-off-by: Ryota Yamauchi <r-yamauchi@vf.jp.nec.com>
Signed-off-by: Utako Kusaka <u-kusaka@wm.jp.nec.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2009-10-30 09:27:44 +01:00
Eric Sandeen 3b826386d3 xfs: free temporary cursor in xfs_dialloc
Commit bd16956599 seems
to have a slight regression where this code path:

    if (!--searchdistance) {
        /*
         * Not in range - save last search
         * location and allocate a new inode
         */
        ...
        goto newino;
    }

doesn't free the temporary cursor (tcur) that got dup'd in
this function.

This leaks an item in the xfs_btree_cur zone, and it's caught
on module unload:

===========================================================
BUG xfs_btree_cur: Objects remaining on kmem_cache_close()
-----------------------------------------------------------

It seems like maybe a single free at the end of the function might
be cleaner, but for now put a del_cursor right in this code block
similar to the handling in the rest of the function.

Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2009-10-30 09:27:07 +01:00
Linus Torvalds 68e71d1902 Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
  backing-dev: ensure that a removed bdi no longer has super_block referencing it
  block: use after free bug in __blkdev_get
  block: silently error unsupported empty barriers too
2009-10-29 09:17:19 -07:00
Linus Torvalds 0d43f5123d Merge branch 'sh/for-2.6.32' of git://git.kernel.org/pub/scm/linux/kernel/git/lethal/sh-2.6
* 'sh/for-2.6.32' of git://git.kernel.org/pub/scm/linux/kernel/git/lethal/sh-2.6:
  sh: Fix hugetlbfs dependencies for SH-3 && MMU configurations.
  sh: Document uImage.bin target in archhelp.
  sh: add uImage.bin target
  sh: rsk7203 CONFIG_MTD=n fix
  sh: Check for return_to_handler when unwinding the stack
  sh: Build fix: define more __movmem* symbols
  sh: __irq_entry annotate do_IRQ().

Fix up sh/powerpc conflicts in fs/Kconfig
2009-10-29 09:07:15 -07:00
Linus Torvalds fb3165b59f Merge branch 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6
* 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6:
  NFSv4: The link() operation should return any delegation on the file
  NFSv4: Fix two unbalanced put_rpccred() issues.
  NFSv4: Fix a bug when the server returns NFS4ERR_RESOURCE
  nfs: Panic when commit fails
2009-10-29 09:02:24 -07:00
Linus Torvalds 36f8a53ff2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
  [CIFS] Fixing to avoid invalid kfree() in cifs_get_tcp_session()
2009-10-29 09:02:01 -07:00
Linus Torvalds 0a53f1693c Merge branch 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc
* 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc:
  powerpc/ppc64: Use preempt_schedule_irq instead of preempt_schedule
  powerpc: Minor cleanup to lib/Kconfig.debug
  powerpc: Minor cleanup to sound/ppc/Kconfig
  powerpc: Minor cleanup to init/Kconfig
  powerpc: Limit memory hotplug support to PPC64 Book-3S machines
  powerpc: Limit hugetlbfs support to PPC64 Book-3S machines
  powerpc: Fix compile errors found by new ppc64e_defconfig
  powerpc: Add a Book-3E 64-bit defconfig
  powerpc/booke: Fix xmon single step on PowerPC Book-E
  powerpc: Align vDSO base address
  powerpc: Fix segment mapping in vdso32
  powerpc/iseries: Remove compiler version dependent hack
  powerpc/perf_events: Fix priority of MSR HV vs PR bits
  powerpc/5200: Update defconfigs
  drivers/serial/mpc52xx_uart.c: Use UPIO_MEM rather than SERIAL_IO_MEM
  powerpc/boot/dts: drop obsolete 'fsl5200-clocking'
  of: Remove nested function
  mpc5200: support for the MAN mpc5200 based board mucmc52
  mpc5200: support for the MAN mpc5200 based board uc101
2009-10-29 08:59:06 -07:00
Linus Torvalds 2375669214 Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs
* 'for-linus' of git://oss.sgi.com/xfs/xfs:
  xfs: fix double IRELE in xfs_dqrele_inode
2009-10-29 08:18:25 -07:00
Jeff Mahoney 47f365eb57 hfs: fix oops on mount with corrupted btree extent records
A particular fsfuzzer run caused an hfs file system to crash on mount.
This is due to a corrupted MDB extent record causing a miscalculation of
HFS_I(inode)->first_blocks for the extent tree.  If the extent records are
zereod out, it won't trigger the first_blocks special case.  Instead it
falls through to the extent code which we're still in the middle of
initializing.

This patch catches the 0 size extent records, reports the corruption, and
fails the mount.

Reported-by: Ramon de Carvalho Valle <rcvalle@linux.vnet.ibm.com>
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-10-29 07:39:29 -07:00
Ben Hutchings 5c36fe3d87 hfsplus: refuse to mount volumes larger than 2TB
As found in <http://bugs.debian.org/550010>, hfsplus is using type u32
rather than sector_t for some sector number calculations.

In particular, hfsplus_get_block() does:

        u32 ablock, dblock, mask;
...
        map_bh(bh_result, sb, (dblock << HFSPLUS_SB(sb).fs_shift) + HFSPLUS_SB(sb).blockoffset + (iblock & mask));

I am not confident that I can find and fix all cases where a sector number
may be truncated.  For now, avoid data loss by refusing to mount HFS+
volumes with more than 2^32 sectors (2TB).

[akpm@linux-foundation.org: fix 32 and 64-bit issues]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Cc: Eric Sesterhenn <snakebyte@gmx.de>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-10-29 07:39:27 -07:00
Hugh Dickins 370c28def6 hwpoison: fix/proc/meminfo alignment
Given such a long name, the kB count in /proc/meminfo's HardwareCorrupted
line is being shown too far right (it does align with x86_64's VmallocChunk
above, but I hope nobody will ever have that much corrupted!).  Align it.

Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-10-29 07:39:25 -07:00
Kumar Gala 0cd9ad73b8 powerpc: Limit hugetlbfs support to PPC64 Book-3S machines
Signed-off-by: Kumar Gala <galak@kernel.crashing.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2009-10-27 16:42:41 +11:00
Paul Mundt ffb4a73d89 sh: Fix hugetlbfs dependencies for SH-3 && MMU configurations.
The hugetlb dependencies presently depend on SUPERH && MMU while the
hugetlb page size definitions depend on CPU_SH4 or CPU_SH5. This
unfortunately allows SH-3 + MMU configurations to enable hugetlbfs
without a corresponding HPAGE_SHIFT definition, resulting in the build
blowing up.

As SH-3 doesn't support variable page sizes, we tighten up the
dependenies a bit to prevent hugetlbfs from being enabled. These days
we also have a shiny new SYS_SUPPORTS_HUGETLBFS, so switch to using
that rather than adding to the list of corner cases in fs/Kconfig.

Reported-by: Kristoffer Ericson <kristoffer.ericson@gmail.com>
Signed-off-by: Paul Mundt <lethal@linux-sh.org>
2009-10-27 07:22:37 +09:00
Neil Brown 960cc0f4fe block: use after free bug in __blkdev_get
commit 0762b8bde9
(from 14 months ago) introduced a use-after-free bug which has just
recently started manifesting in my md testing.
I tried git bisect to find out what caused the bug to start
manifesting, and it could have been the recent change to
blk_unregister_queue (48c0d4d4c0) but the results were inconclusive.

This patch certainly fixes my symptoms and looks correct as the two
calls are now in the same order as elsewhere in that function.

Signed-off-by: NeilBrown <neilb@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-10-26 15:27:11 +01:00
Trond Myklebust 9a3936aac1 NFSv4: The link() operation should return any delegation on the file
Otherwise, we have to wait for the server to recall it.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-10-26 08:09:46 -04:00
Trond Myklebust 141aeb9f26 NFSv4: Fix two unbalanced put_rpccred() issues.
Commits 29fba38b (nfs41: lease renewal) and fc01cea9 (nfs41: sequence
operation) introduce a couple of put_rpccred() calls on credentials for
which there is no corresponding get_rpccred().

See http://bugzilla.kernel.org/show_bug.cgi?id=14249

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-10-26 08:09:46 -04:00
Trond Myklebust 52567b03ca NFSv4: Fix a bug when the server returns NFS4ERR_RESOURCE
RFC 3530 states that when we recieve the error NFS4ERR_RESOURCE, we are not
supposed to bump the sequence number on OPEN, LOCK, LOCKU, CLOSE, etc
operations. The problem is that we map that error into EREMOTEIO in the XDR
layer, and so the NFSv4 middle-layer routines like seqid_mutating_err(),
and nfs_increment_seqid() don't recognise it.

The fix is to defer the mapping until after the middle layers have
processed the error.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-10-23 14:46:42 -04:00
Terry Loftin a8b40bc7e6 nfs: Panic when commit fails
Actually pass the NFS_FILE_SYNC option to the server to avoid a
Panic in nfs_direct_write_complete() when a commit fails.

At the end of an nfs write, if the nfs commit fails, all the writes
will be rescheduled.  They are supposed to be rescheduled as NFS_FILE_SYNC
writes, but the rpc_task structure is not completely intialized and so
the option is not passed.  When the rescheduled writes complete, the
return indicates that they are NFS_UNSTABLE and we try to do another
commit.  This leads to a Panic because the commit data structure pointer
was set to null in the initial (failed) commit attempt.

Signed-off-by: Terry Loftin <terry.loftin@hp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-10-23 14:16:30 -04:00
Linus Torvalds d995053d04 Merge branch 'for-linus' of git://git.infradead.org/users/eparis/notify
* 'for-linus' of git://git.infradead.org/users/eparis/notify:
  dnotify: ignore FS_EVENT_ON_CHILD
  inotify: fix coalesce duplicate events into a single event in special case
  inotify: deprecate the inotify kernel interface
  fsnotify: do not set group for a mark before it is on the i_list
2009-10-22 08:28:28 +09:00
Yinghai Lu 4223a4a155 nfs: Fix nfs_parse_mount_options() kfree() leak
Fix a (small) memory leak in one of the error paths of the NFS mount
options parsing code.

Regression introduced in 2.6.30 by commit a67d18f (NFS: load the
rpc/rdma transport module automatically).

Reported-by: Yinghai Lu <yinghai@kernel.org>
Reported-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-10-22 08:15:23 +09:00
Earl Chew ad3960243e fs: pipe.c null pointer dereference
This patch fixes a null pointer exception in pipe_rdwr_open() which
generates the stack trace:

> Unable to handle kernel NULL pointer dereference at 0000000000000028 RIP:
>  [<ffffffff802899a5>] pipe_rdwr_open+0x35/0x70
>  [<ffffffff8028125c>] __dentry_open+0x13c/0x230
>  [<ffffffff8028143d>] do_filp_open+0x2d/0x40
>  [<ffffffff802814aa>] do_sys_open+0x5a/0x100
>  [<ffffffff8021faf3>] sysenter_do_call+0x1b/0x67

The failure mode is triggered by an attempt to open an anonymous
pipe via /proc/pid/fd/* as exemplified by this script:

=============================================================
while : ; do
   { echo y ; sleep 1 ; } | { while read ; do echo z$REPLY; done ; } &
   PID=$!
   OUT=$(ps -efl | grep 'sleep 1' | grep -v grep |
        { read PID REST ; echo $PID; } )
   OUT="${OUT%% *}"
   DELAY=$((RANDOM * 1000 / 32768))
   usleep $((DELAY * 1000 + RANDOM % 1000 ))
   echo n > /proc/$OUT/fd/1                 # Trigger defect
done
=============================================================

Note that the failure window is quite small and I could only
reliably reproduce the defect by inserting a small delay
in pipe_rdwr_open(). For example:

 static int
 pipe_rdwr_open(struct inode *inode, struct file *filp)
 {
       msleep(100);
       mutex_lock(&inode->i_mutex);

Although the defect was observed in pipe_rdwr_open(), I think it
makes sense to replicate the change through all the pipe_*_open()
functions.

The core of the change is to verify that inode->i_pipe has not
been released before attempting to manipulate it. If inode->i_pipe
is no longer present, return ENOENT to indicate so.

The comment about potentially using atomic_t for i_pipe->readers
and i_pipe->writers has also been removed because it is no longer
relevant in this context. The inode->i_mutex lock must be used so
that inode->i_pipe can be dealt with correctly.

Signed-off-by: Earl Chew <earl_chew@agilent.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-10-22 08:11:44 +09:00
Andreas Gruenbacher 945526846a dnotify: ignore FS_EVENT_ON_CHILD
Mask off FS_EVENT_ON_CHILD in dnotify_handle_event().  Otherwise, when there
is more than one watch on a directory and dnotify_should_send_event()
succeeds, events with FS_EVENT_ON_CHILD set will trigger all watches and cause
spurious events.

This case was overlooked in commit e42e2773.

	#define _GNU_SOURCE

	#include <stdio.h>
	#include <stdlib.h>
	#include <unistd.h>
	#include <signal.h>
	#include <sys/types.h>
	#include <sys/stat.h>
	#include <fcntl.h>
	#include <string.h>

	static void create_event(int s, siginfo_t* si, void* p)
	{
		printf("create\n");
	}

	static void delete_event(int s, siginfo_t* si, void* p)
	{
		printf("delete\n");
	}

	int main (void) {
		struct sigaction action;
		char *tmpdir, *file;
		int fd1, fd2;

		sigemptyset (&action.sa_mask);
		action.sa_flags = SA_SIGINFO;

		action.sa_sigaction = create_event;
		sigaction (SIGRTMIN + 0, &action, NULL);

		action.sa_sigaction = delete_event;
		sigaction (SIGRTMIN + 1, &action, NULL);

	#	define TMPDIR "/tmp/test.XXXXXX"
		tmpdir = malloc(strlen(TMPDIR) + 1);
		strcpy(tmpdir, TMPDIR);
		mkdtemp(tmpdir);

	#	define TMPFILE "/file"
		file = malloc(strlen(tmpdir) + strlen(TMPFILE) + 1);
		sprintf(file, "%s/%s", tmpdir, TMPFILE);

		fd1 = open (tmpdir, O_RDONLY);
		fcntl(fd1, F_SETSIG, SIGRTMIN);
		fcntl(fd1, F_NOTIFY, DN_MULTISHOT | DN_CREATE);

		fd2 = open (tmpdir, O_RDONLY);
		fcntl(fd2, F_SETSIG, SIGRTMIN + 1);
		fcntl(fd2, F_NOTIFY, DN_MULTISHOT | DN_DELETE);

		if (fork()) {
			/* This triggers a create event */
			creat(file, 0600);
			/* This triggers a create and delete event (!) */
			unlink(file);
		} else {
			sleep(1);
			rmdir(tmpdir);
		}

		return 0;
	}

Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Signed-off-by: Eric Paris <eparis@redhat.com>
2009-10-20 18:02:33 -04:00
Wei Yongjun 3de0ef4f20 inotify: fix coalesce duplicate events into a single event in special case
If we do rename a dir entry, like this:

  rename("/tmp/ino7UrgoJ.rename1", "/tmp/ino7UrgoJ.rename2")
  rename("/tmp/ino7UrgoJ.rename2", "/tmp/ino7UrgoJ")

The duplicate events should be coalesced into a single event. But those two
events do not be coalesced into a single event, due to some bad check in
event_compare(). It can not match the two NULL inodes as the same event.

Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Signed-off-by: Eric Paris <eparis@redhat.com>
2009-10-18 15:49:38 -04:00
Eric Paris 9f0d793b52 fsnotify: do not set group for a mark before it is on the i_list
fsnotify_add_mark is supposed to add a mark to the g_list and i_list and to
set the group and inode for the mark.  fsnotify_destroy_mark_by_entry uses
the fact that ->group != NULL to know if this group should be destroyed or
if it's already been done.

But fsnotify_add_mark sets the group and inode before it actually adds the
mark to the i_list and g_list.  This can result in a race in inotify, it
requires 3 threads.

sys_inotify_add_watch("file")	sys_inotify_add_watch("file")	sys_inotify_rm_watch([a])
inotify_update_watch()
inotify_new_watch()
inotify_add_to_idr()
   ^--- returns wd = [a]
				inotfiy_update_watch()
				inotify_new_watch()
				inotify_add_to_idr()
				fsnotify_add_mark()
				   ^--- returns wd = [b]
				returns to userspace;
								inotify_idr_find([a])
								   ^--- gives us the pointer from task 1
fsnotify_add_mark()
   ^--- this is going to set the mark->group and mark->inode fields, but will
return -EEXIST because of the race with [b].
								fsnotify_destroy_mark()
								   ^--- since ->group != NULL we call back
									into inotify_freeing_mark() which calls
								inotify_remove_from_idr([a])

since fsnotify_add_mark() failed we call:
inotify_remove_from_idr([a])     <------WHOOPS it's not in the idr, this could
					have been any entry added later!

The fix is to make sure we don't set mark->group until we are sure the mark is
on the inode and fsnotify_add_mark will return success.

Signed-off-by: Eric Paris <eparis@redhat.com>
2009-10-18 15:49:38 -04:00
Linus Torvalds 8e7768d3f4 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/dlm
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/dlm:
  dlm: fix socket fd translation
  dlm: fix lowcomms_connect_node for sctp
2009-10-15 15:21:20 -07:00
Linus Torvalds dcbeb0bec5 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  Btrfs: always pin metadata in discard mode
  Btrfs: enable discard support
  Btrfs: add -o discard option
  Btrfs: properly wait log writers during log sync
  Btrfs: fix possible ENOSPC problems with truncate
  Btrfs: fix btrfs acl #ifdef checks
  Btrfs: streamline tree-log btree block writeout
  Btrfs: avoid tree log commit when there are no changes
  Btrfs: only write one super copy during fsync
2009-10-15 15:06:37 -07:00
Neil Brown 83db93f4de sysfs: Allow sysfs_notify_dirent to be called from interrupt context.
sysfs_notify_dirent is a simple atomic operation that can be used to
alert user-space that new data can be read from a sysfs attribute.

Unfortunately it cannot currently be called from non-process context
because of its use of spin_lock which is sometimes taken with
interrupts enabled.

So change all lockers of sysfs_open_dirent_lock to disable interrupts,
thus making sysfs_notify_dirent safe to be called from non-process
context (as drivers/md does in md_safemode_timeout).

sysfs_get_open_dirent is (documented as being) only called from
process context, so it uses spin_lock_irq.  Other places
use spin_lock_irqsave.

The usage for sysfs_notify_dirent in md_safemode_timeout was
introduced in 2.6.28, so this patch is suitable for that and more
recent kernels.

Reported-by: Joel Andres Granados <jgranado@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Cc: stable <stable@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-10-14 15:16:25 -07:00
Cornelia Huck a6a8357788 sysfs: Allow sysfs_move_dir(..., NULL) again.
As device_move() and kobject_move() both handle a NULL destination,
sysfs_move_dir() should do this as well (again) and fall back to
sysfs_root in that case.

Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Cc: Phil Carmody <ext-phil.2.carmody@nokia.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-10-14 15:16:25 -07:00
Chris Mason 444528b3e6 Btrfs: always pin metadata in discard mode
We have an optimization in btrfs to allow blocks to be
immediately freed if they were allocated in this transaction and never
written.  Otherwise they are pinned and freed when the transaction
commits.

This isn't optimal for discard mode because immediately freeing
them means immediately discarding them.  It is better to give the
block to the pinning code and letting the (slow) discard happen later.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-10-14 10:32:50 -04:00
Christoph Hellwig 0634857488 Btrfs: enable discard support
The discard support code in btrfs currently is guarded by ifdefs for
BIO_RW_DISCARD, which is never defines as it's the name of an enum
memeber.  Just remove the useless ifdefs to actually enable the code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-10-14 10:32:49 -04:00
Christoph Hellwig e244a0aeb6 Btrfs: add -o discard option
Enable discard by default is not a good idea given the the trim speed
of SSD prototypes we've seen, and the carecteristics for many high-end
arrays.  Turn of discards by default and require the -o discard option
to enable them on.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-10-14 10:32:49 -04:00
Yan, Zheng 86df7eb921 Btrfs: properly wait log writers during log sync
A recently fsync optimization make btrfs_sync_log skip calling
wait_for_writer in the single log writer case. This is incorrect
since the writer count can also be increased by btrfs_pin_log.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-10-14 10:32:48 -04:00
Josef Bacik 5d5e103a70 Btrfs: fix possible ENOSPC problems with truncate
There's a problem where we don't do any space reservation for truncates, which
can cause you to OOPs because you will be allowed to go off in the weeds a bit
since we don't account for the delalloc bytes that are created as a result of
the truncate.

Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-10-14 10:32:47 -04:00
Alex Elder ba313e68fa Merge branch 'master' of ssh://oss.sgi.com/oss/git/xfs/xfs into for-linus 2009-10-13 15:47:22 -05:00
Christoph Hellwig 05277c75f6 xfs: fix double IRELE in xfs_dqrele_inode
xfs_dqrele_inode calls xfs_iput to release the ilock and a reference
and then also calls IRELE which does a second decrement of the reference
count.  This leads to a premature freeing of inodes when quotas were turned
off while the filesystem was mounted.

Thanks to Utako Kusaka for reporting the bug and provinding a good testcase.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Utako Kusaka <u-kusaka@wm.jp.nec.com>
Reviewed-by: Alex Elder <aelder@sgi.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2009-10-13 13:16:36 -05:00
Chris Mason 0eda294dfc Btrfs: fix btrfs acl #ifdef checks
The btrfs acl code was #ifdefing for a define
that didn't exist.  This correctly matches it
to the values used by the Kconfig file.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-10-13 13:51:39 -04:00
Chris Mason 690587d109 Btrfs: streamline tree-log btree block writeout
Syncing the tree log is a 3 phase operation.

1) write and wait for all the tree log blocks for a given root.

2) write and wait for all the tree log blocks for the
tree of tree log roots.

3) write and wait for the super blocks (barriers here)

This isn't as efficient as it could be because there is
no requirement to wait for the blocks from step one to hit the disk
before we start writing the blocks from step two.  This commit
changes the sequence so that we don't start waiting until
all the tree blocks from both steps one and two have been sent
to disk.

We do this by breaking up btrfs_write_wait_marked_extents into
two functions, which is trivial because it was already broken
up into two parts.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-10-13 13:35:12 -04:00
Chris Mason 257c62e1bc Btrfs: avoid tree log commit when there are no changes
rpm has a habit of running fdatasync when the file hasn't
changed.  We already detect if a file hasn't been changed
in the current transaction but it might have been sent to
the tree-log in this transaction and not changed since
the last call to fsync.

In this case, we want to avoid a tree log sync, which includes
a number of synchronous writes and barriers.  This commit
extends the existing tracking of the last transaction to change
a file to also track the last sub-transaction.

The end result is that rpm -ivh and -Uvh are roughly twice as fast,
and on par with ext3.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-10-13 13:35:12 -04:00
Chris Mason 4722607db6 Btrfs: only write one super copy during fsync
During a tree-log commit for fsync, we've been writing at least
two copies of the super block and forcing them to disk.

The other filesystems write only one, and this change brings us on
par with them.  A full transaction commit will write all the super
copies, so we still have redundant info written on a regular
basis.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-10-13 13:35:11 -04:00
Linus Torvalds 80f506918f Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
  cciss: Add cciss_allow_hpsa module parameter
  cciss: Fix multiple calls to pci_release_regions
  blk-settings: fix function parameter kernel-doc notation
  writeback: kill space in debugfs item name
  writeback: account IO throttling wait as iowait
  elv_iosched_store(): fix strstrip() misuse
  cfq-iosched: avoid probable slice overrun when idling
  cfq-iosched: apply bool value where we return 0/1
  cfq-iosched: fix think time allowed for seekers
  cfq-iosched: fix the slice residual sign
  cfq-iosched: abstract out the 'may this cfqq dispatch' logic
  block: use proper BLK_RW_ASYNC in blk_queue_start_tag()
  block: Seperate read and write statistics of in_flight requests v2
  block: get rid of kblock_schedule_delayed_work()
  cfq-iosched: fix possible problem with jiffies wraparound
  cfq-iosched: fix issue with rq-rq merging and fifo list ordering
2009-10-13 10:21:33 -07:00
Theodore Ts'o 96ec2e0a71 ext3: Don't update superblock write time when filesystem is read-only
This avoids updating the superblock write time when we are mounting
the root file system read/only but we need to replay the journal; at
that point, for people who are east of GMT and who make their clock
tick in localtime for Windows bug-for-bug compatibility, and this will
cause e2fsck to complain and force a full file system check.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Jan Kara <jack@suse.cz>
2009-10-13 00:06:43 +02:00