Commit graph

22820 commits

Author SHA1 Message Date
Linus Torvalds
bc9bc72e2f Merge git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-linus
* git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-linus:
  Squashfs: update email address
  Squashfs: add extra sanity checks at mount time
  Squashfs: add sanity checks to fragment reading at mount time
  Squashfs: add sanity checks to lookup table reading at mount time
  Squashfs: add sanity checks to id reading at mount time
  Squashfs: add sanity checks to xattr reading at mount time
  Squashfs: reverse order of filesystem table reading
  Squashfs: move table allocation into squashfs_read_table()
2011-05-26 17:27:35 -07:00
Timo Warns
3eb8e74ec7 fs/partitions/efi.c: corrupted GUID partition tables can cause kernel oops
The kernel automatically evaluates partition tables of storage devices.
The code for evaluating GUID partitions (in fs/partitions/efi.c) contains
a bug that causes a kernel oops on certain corrupted GUID partition
tables.

This bug has security impacts, because it allows, for example, to
prepare a storage device that crashes a kernel subsystem upon connecting
the device (e.g., a "USB Stick of (Partial) Death").

	crc = efi_crc32((const unsigned char *) (*gpt), le32_to_cpu((*gpt)->header_size));

computes a CRC32 checksum over gpt covering (*gpt)->header_size bytes.
There is no validation of (*gpt)->header_size before the efi_crc32 call.

A corrupted partition table may have large values for (*gpt)->header_size.
 In this case, the CRC32 computation access memory beyond the memory
allocated for gpt, which may cause a kernel heap overflow.

Validate value of GUID partition table header size.

[akpm@linux-foundation.org: fix layout and indenting]
Signed-off-by: Timo Warns <warns@pre-sense.de>
Cc: Matt Domsch <Matt_Domsch@dell.com>
Cc: Eugene Teo <eugeneteo@kernel.sg>
Cc: Dave Jones <davej@codemonkey.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-26 17:12:37 -07:00
Olaf Hering
997c136f51 fs/proc/vmcore.c: add hook to read_from_oldmem() to check for non-ram pages
The balloon driver in a Xen guest frees guest pages and marks them as
mmio.  When the kernel crashes and the crash kernel attempts to read the
oldmem via /proc/vmcore a read from ballooned pages will generate 100%
load in dom0 because Xen asks qemu-dm for the page content.  Since the
reads come in as 8byte requests each ballooned page is tried 512 times.

With this change a hook can be registered which checks wether the given
pfn is really ram.  The hook has to return a value > 0 for ram pages, a
value < 0 on error (because the hypercall is not known) and 0 for non-ram
pages.

This will reduce the time to read /proc/vmcore.  Without this change a
512M guest with 128M crashkernel region needs 200 seconds to read it, with
this change it takes just 2 seconds.

Signed-off-by: Olaf Hering <olaf@aepfle.de>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-26 17:12:37 -07:00
KOSAKI Motohiro
98bc93e505 proc: fix pagemap_read() error case
Currently, pagemap_read() has three error and/or corner case handling
mistake.

 (1) If ppos parameter is wrong, mm refcount will be leak.
 (2) If count parameter is 0, mm refcount will be leak too.
 (3) If the current task is sleeping in kmalloc() and the system
     is out of memory and oom-killer kill the proc associated task,
     mm_refcount prevent the task free its memory. then system may
     hang up.

<Quote Hugh's explain why we shold call kmalloc() before get_mm()>

  check_mem_permission gets a reference to the mm.  If we
  __get_free_page after check_mem_permission, imagine what happens if the
  system is out of memory, and the mm we're looking at is selected for
  killing by the OOM killer: while we wait in __get_free_page for more
  memory, no memory is freed from the selected mm because it cannot reach
  exit_mmap while we hold that reference.

This patch fixes the above three.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jovi Zhang <bookjovi@gmail.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Stephen Wilson <wilsons@start.ca>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-26 17:12:37 -07:00
KOSAKI Motohiro
30cd890391 proc: put check_mem_permission after __get_free_page in mem_write
It whould be better if put check_mem_permission after __get_free_page in
mem_write, to be same as function mem_read.

Hugh Dickins explained the reason.

    check_mem_permission gets a reference to the mm.  If we __get_free_page
    after check_mem_permission, imagine what happens if the system is out
    of memory, and the mm we're looking at is selected for killing by the
    OOM killer: while we wait in __get_free_page for more memory, no memory
    is freed from the selected mm because it cannot reach exit_mmap while
    we hold that reference.

Reported-by: Jovi Zhang <bookjovi@gmail.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Stephen Wilson <wilsons@start.ca>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-26 17:12:37 -07:00
Yuanhan Liu
a4dbf0ec2a proc/stat: use defined macro KMALLOC_MAX_SIZE
There is a macro for the max size kmalloc can allocate, so use it instead
of a hardcoded number.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-26 17:12:37 -07:00
Mike Frysinger
e130aa70f4 proc: constify status array
No need for this local array to be writable, so mark it const.

Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-26 17:12:36 -07:00
Alexey Dobriyan
0a8cb8e341 fs/proc: convert to kstrtoX()
Convert fs/proc/ from strict_strto*() to kstrto*() functions.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-26 17:12:36 -07:00
Jiri Slaby
57cc083ad9 coredump: add support for exe_file in core name
Now, exe_file is not proc FS dependent, so we can use it to name core
file.  So we add %E pattern for core file name cration which extract path
from mm_struct->exe_file.  Then it converts slashes to exclamation marks
and pastes the result to the core file name itself.

This is useful for environments where binary names are longer than 16
character (the current->comm limitation).  Also where there are binaries
with same name but in a different path.  Further in case the binery itself
changes its current->comm after exec.

So by doing (s/$/#/ -- # is treated as git comment):

  $ sysctl kernel.core_pattern='core.%p.%e.%E'
  $ ln /bin/cat cat45678901234567890
  $ ./cat45678901234567890
  ^Z
  $ rm cat45678901234567890
  $ fg
  ^\Quit (core dumped)
  $ ls core*

we now get:

  core.2434.cat456789012345.!root!cat45678901234567890 (deleted)

Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Reviewed-by: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-26 17:12:36 -07:00
Jiri Slaby
3864601387 mm: extract exe_file handling from procfs
Setup and cleanup of mm_struct->exe_file is currently done in fs/proc/.
This was because exe_file was needed only for /proc/<pid>/exe.  Since we
will need the exe_file functionality also for core dumps (so core name can
contain full binary path), built this functionality always into the
kernel.

To achieve that move that out of proc FS to the kernel/ where in fact it
should belong.  By doing that we can make dup_mm_exe_file static.  Also we
can drop linux/proc_fs.h inclusion in fs/exec.c and kernel/fork.c.

Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-26 17:12:36 -07:00
Ying Han
456f998ec8 memcg: add the pagefault count into memcg stats
Two new stats in per-memcg memory.stat which tracks the number of page
faults and number of major page faults.

  "pgfault"
  "pgmajfault"

They are different from "pgpgin"/"pgpgout" stat which count number of
pages charged/discharged to the cgroup and have no meaning of reading/
writing page to disk.

It is valuable to track the two stats for both measuring application's
performance as well as the efficiency of the kernel page reclaim path.
Counting pagefaults per process is useful, but we also need the aggregated
value since processes are monitored and controlled in cgroup basis in
memcg.

Functional test: check the total number of pgfault/pgmajfault of all
memcgs and compare with global vmstat value:

  $ cat /proc/vmstat | grep fault
  pgfault 1070751
  pgmajfault 553

  $ cat /dev/cgroup/memory.stat | grep fault
  pgfault 1071138
  pgmajfault 553
  total_pgfault 1071142
  total_pgmajfault 553

  $ cat /dev/cgroup/A/memory.stat | grep fault
  pgfault 199
  pgmajfault 0
  total_pgfault 199
  total_pgmajfault 0

Performance test: run page fault test(pft) wit 16 thread on faulting in
15G anon pages in 16G container.  There is no regression noticed on the
"flt/cpu/s"

Sample output from pft:

  TAG pft:anon-sys-default:
    Gb  Thr CLine   User     System     Wall    flt/cpu/s fault/wsec
    15   16   1     0.67s   233.41s    14.76s   16798.546 266356.260

  +-------------------------------------------------------------------------+
      N           Min           Max        Median           Avg        Stddev
  x  10     16682.962     17344.027     16913.524     16928.812      166.5362
  +  10     16695.568     16923.896     16820.604     16824.652     84.816568
  No difference proven at 95.0% confidence

[akpm@linux-foundation.org: fix build]
[hughd@google.com: shmem fix]
Signed-off-by: Ying Han <yinghan@google.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-26 17:12:36 -07:00
Dan Carpenter
1d5827235d ufs: fix truncated values handling 64 bit metadata
Originally i_lastfrag was 32 bits but then we added support for handling
64 bit metadata and it became a 64 bit variable.  That was during 2007, in
54fb996ac1 "[PATCH] ufs2 write: block allocation update".  Unfortunately
these casts got left behind so the value got truncated to 32 bit again.

[akpm@linux-foundation.org: remove now-unneeded min_t/max_t casting]
Signed-off-by: Dan Carpenter <error27@gmail.com>
Cc: Evgeniy Dushistov <dushistov@mail.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-26 17:12:33 -07:00
Linus Torvalds
b7c2f03628 Merge branch 'trivial' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild-2.6
* 'trivial' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild-2.6:
  gfs2: Drop __TIME__ usage
  isdn/diva: Drop __TIME__ usage
  atm: Drop __TIME__ usage
  dlm: Drop __TIME__ usage
  wan/pc300: Drop __TIME__ usage
  parport: Drop __TIME__ usage
  hdlcdrv: Drop __TIME__ usage
  baycom: Drop __TIME__ usage
  pmcraid: Drop __DATE__ usage
  edac: Drop __DATE__ usage
  rio: Drop __DATE__ usage
  scsi/wd33c93: Drop __TIME__ usage
  scsi/in2000: Drop __TIME__ usage
  aacraid: Drop __TIME__ usage
  media/cx231xx: Drop __TIME__ usage
  media/radio-maxiradio: Drop __TIME__ usage
  nozomi: Drop __TIME__ usage
  cyclades: Drop __TIME__ usage
2011-05-26 13:19:00 -07:00
Linus Torvalds
a74b81b0af Merge branch 'linux-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2
* 'linux-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2: (28 commits)
  Ocfs2: Teach local-mounted ocfs2 to handle unwritten_extents correctly.
  ocfs2/dlm: Do not migrate resource to a node that is leaving the domain
  ocfs2/dlm: Add new dlm message DLM_BEGIN_EXIT_DOMAIN_MSG
  Ocfs2/move_extents: Set several trivial constraints for threshold.
  Ocfs2/move_extents: Let defrag handle partial extent moving.
  Ocfs2/move_extents: move/defrag extents within a certain range.
  Ocfs2/move_extents: helper to calculate the defraging length in one run.
  Ocfs2/move_extents: move entire/partial extent.
  Ocfs2/move_extents: helpers to update the group descriptor and global bitmap inode.
  Ocfs2/move_extents: helper to probe a proper region to move in an alloc group.
  Ocfs2/move_extents: helper to validate and adjust moving goal.
  Ocfs2/move_extents: find the victim alloc group, where the given #blk fits.
  Ocfs2/move_extents: defrag a range of extent.
  Ocfs2/move_extents: move a range of extent.
  Ocfs2/move_extents: lock allocators and reserve metadata blocks and data clusters for extents moving.
  Ocfs2/move_extents: Add basic framework and source files for extent moving.
  Ocfs2/move_extents: Adding new ioctl code 'OCFS2_IOC_MOVE_EXT' to ocfs2.
  Ocfs2/refcounttree: Publicize couple of funcs from refcounttree.c
  Ocfs2: Add a new code 'OCFS2_INFO_FREEFRAG' for o2info ioctl.
  Ocfs2: Add a new code 'OCFS2_INFO_FREEINODE' for o2info ioctl.
  ...
2011-05-26 10:55:15 -07:00
Linus Torvalds
f8d613e2a6 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/djm/tmem
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/djm/tmem:
  xen: cleancache shim to Xen Transcendent Memory
  ocfs2: add cleancache support
  ext4: add cleancache support
  btrfs: add cleancache support
  ext3: add cleancache support
  mm/fs: add hooks to support cleancache
  mm: cleancache core ops functions and config
  fs: add field to superblock to support cleancache
  mm/fs: cleancache documentation

Fix up trivial conflict in fs/btrfs/extent_io.c due to includes
2011-05-26 10:50:56 -07:00
Linus Torvalds
8a0599dd24 Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs
* 'for-linus' of git://oss.sgi.com/xfs/xfs:
  xfs: correctly decrement the extent buffer index in xfs_bmap_del_extent
  xfs: check for valid indices in xfs_iext_get_ext and xfs_iext_idx_to_irec
  xfs: fix up asserts in xfs_iflush_fork
  xfs: do not do pointer arithmetic on extent records
  xfs: do not use unchecked extent indices in xfs_bunmapi
  xfs: do not use unchecked extent indices in xfs_bmapi
  xfs: do not use unchecked extent indices in xfs_bmap_add_extent_*
  xfs: remove if_lastex
  xfs: remove the unused XFS_BMAPI_RSVBLOCKS flag
  xfs: do not discard alloc btree blocks
  xfs: add online discard support
2011-05-26 10:49:11 -07:00
Linus Torvalds
35806b4f7c Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (61 commits)
  jbd2: Add MAINTAINERS entry
  jbd2: fix a potential leak of a journal_head on an error path
  ext4: teach ext4_ext_split to calculate extents efficiently
  ext4: Convert ext4 to new truncate calling convention
  ext4: do not normalize block requests from fallocate()
  ext4: enable "punch hole" functionality
  ext4: add "punch hole" flag to ext4_map_blocks()
  ext4: punch out extents
  ext4: add new function ext4_block_zero_page_range()
  ext4: add flag to ext4_has_free_blocks
  ext4: reserve inodes and feature code for 'quota' feature
  ext4: add support for multiple mount protection
  ext4: ensure f_bfree returned by ext4_statfs() is non-negative
  ext4: protect bb_first_free in ext4_trim_all_free() with group lock
  ext4: only load buddy bitmap in ext4_trim_fs() when it is needed
  jbd2: Fix comment to match the code in jbd2__journal_start()
  ext4: fix waiting and sending of a barrier in ext4_sync_file()
  jbd2: Add function jbd2_trans_will_send_data_barrier()
  jbd2: fix sending of data flush on journal commit
  ext4: fix ext4_ext_fiemap_cb() to handle blocks before request range correctly
  ...
2011-05-26 09:53:20 -07:00
Linus Torvalds
32e51f141f Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (25 commits)
  cifs: remove unnecessary dentry_unhash on rmdir/rename_dir
  ocfs2: remove unnecessary dentry_unhash on rmdir/rename_dir
  exofs: remove unnecessary dentry_unhash on rmdir/rename_dir
  nfs: remove unnecessary dentry_unhash on rmdir/rename_dir
  ext2: remove unnecessary dentry_unhash on rmdir/rename_dir
  ext3: remove unnecessary dentry_unhash on rmdir/rename_dir
  ext4: remove unnecessary dentry_unhash on rmdir/rename_dir
  btrfs: remove unnecessary dentry_unhash in rmdir/rename_dir
  ceph: remove unnecessary dentry_unhash calls
  vfs: clean up vfs_rename_other
  vfs: clean up vfs_rename_dir
  vfs: clean up vfs_rmdir
  vfs: fix vfs_rename_dir for FS_RENAME_DOES_D_MOVE filesystems
  libfs: drop unneeded dentry_unhash
  vfs: update dentry_unhash() comment
  vfs: push dentry_unhash on rename_dir into file systems
  vfs: push dentry_unhash on rmdir into file systems
  vfs: remove dget() from dentry_unhash()
  vfs: dentry_unhash immediately prior to rmdir
  vfs: Block mmapped writes while the fs is frozen
  ...
2011-05-26 09:52:14 -07:00
KOSAKI Motohiro
ca16d140af mm: don't access vm_flags as 'int'
The type of vma->vm_flags is 'unsigned long'. Neither 'int' nor
'unsigned int'. This patch fixes such misuse.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
[ Changed to use a typedef - we'll extend it to cover more cases
  later, since there has been discussion about making it a 64-bit
  type..                      - Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-26 09:20:31 -07:00
Dan Magenheimer
1cfd8bd0f9 ocfs2: add cleancache support
This eighth patch of eight in this cleancache series "opts-in"
cleancache for ocfs2.  Clustered filesystems must explicitly enable
cleancache by calling cleancache_init_shared_fs anytime an instance
of the filesystem is mounted.  Ocfs2 is currently the only user of
the clustered filesystem interface but nevertheless, the cleancache
hooks in the VFS layer are sufficient for ocfs2 including the matching
cleancache_flush_fs hook which must be called on unmount.

Details and a FAQ can be found in Documentation/vm/cleancache.txt

[v8: trivial merge conflict update]
[v5: jeremy@goop.org: simplify init hook and any future fs init changes]
Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Reviewed-by: Jeremy Fitzhardinge <jeremy@goop.org>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Rik Van Riel <riel@redhat.com>
Cc: Jan Beulich <JBeulich@novell.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Andreas Dilger <adilger@sun.com>
Cc: Ted Tso <tytso@mit.edu>
Cc: Nitin Gupta <ngupta@vflare.org>
2011-05-26 10:02:08 -06:00
Dan Magenheimer
7abc52c2ed ext4: add cleancache support
This seventh patch of eight in this cleancache series "opts-in"
cleancache for ext4.  Filesystems must explicitly enable cleancache
by calling cleancache_init_fs anytime an instance of the filesystem
is mounted. For ext4, all other cleancache hooks are in
the VFS layer including the matching cleancache_flush_fs
hook which must be called on unmount.

Details and a FAQ can be found in Documentation/vm/cleancache.txt

[v6-v8: no changes]
[v5: jeremy@goop.org: simplify init hook and any future fs init changes]
Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com>
Reviewed-by: Jeremy Fitzhardinge <jeremy@goop.org>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Andreas Dilger <adilger@sun.com>
Cc: Ted Ts'o <tytso@mit.edu>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Rik Van Riel <riel@redhat.com>
Cc: Jan Beulich <JBeulich@novell.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <joel.becker@oracle.com>
Cc: Nitin Gupta <ngupta@vflare.org>
2011-05-26 10:02:03 -06:00
Dan Magenheimer
90a887c9a2 btrfs: add cleancache support
This sixth patch of eight in this cleancache series "opts-in"
cleancache for btrfs.  Filesystems must explicitly enable
cleancache by calling cleancache_init_fs anytime an instance
of the filesystem is mounted.  Btrfs uses its own readpage
which must be hooked, but all other cleancache hooks are in
the VFS layer including the matching cleancache_flush_fs hook
which must be called on unmount.

Details and a FAQ can be found in Documentation/vm/cleancache.txt

[v6-v8: no changes]
[v5: jeremy@goop.org: simplify init hook and any future fs init changes]
Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Reviewed-by: Jeremy Fitzhardinge <jeremy@goop.org>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Rik Van Riel <riel@redhat.com>
Cc: Jan Beulich <JBeulich@novell.com>
Cc: Andreas Dilger <adilger@sun.com>
Cc: Ted Ts'o <tytso@mit.edu>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <joel.becker@oracle.com>
Cc: Nitin Gupta <ngupta@vflare.org>
2011-05-26 10:01:56 -06:00
Dan Magenheimer
d71bc6db5e ext3: add cleancache support
This fifth patch of eight in this cleancache series "opts-in"
cleancache for ext3.  Filesystems must explicitly enable
cleancache by calling cleancache_init_fs anytime an instance
of the filesystem is mounted. For ext3, all other cleancache
hooks are in the VFS layer including the matching cleancache_flush_fs
hook which must be called on unmount.

Details and a FAQ can be found in Documentation/vm/cleancache.txt

[v6-v8: no changes]
[v5: jeremy@goop.org: simplify init hook and any future fs init changes]
Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com>
Reviewed-by: Jeremy Fitzhardinge <jeremy@goop.org>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Andreas Dilger <adilger@sun.com>
Cc: Ted Ts'o <tytso@mit.edu>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Rik Van Riel <riel@redhat.com>
Cc: Jan Beulich <JBeulich@novell.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <joel.becker@oracle.com>
Cc: Nitin Gupta <ngupta@vflare.org>
2011-05-26 10:01:49 -06:00
Dan Magenheimer
c515e1fd36 mm/fs: add hooks to support cleancache
This fourth patch of eight in this cleancache series provides the
core hooks in VFS for: initializing cleancache per filesystem;
capturing clean pages reclaimed by page cache; attempting to get
pages from cleancache before filesystem read; and ensuring coherency
between pagecache, disk, and cleancache.  Note that the placement
of these hooks was stable from 2.6.18 to 2.6.38; a minor semantic
change was required due to a patchset in 2.6.39.

All hooks become no-ops if CONFIG_CLEANCACHE is unset, or become
a check of a boolean global if CONFIG_CLEANCACHE is set but no
cleancache "backend" has claimed cleancache_ops.

Details and a FAQ can be found in Documentation/vm/cleancache.txt

[v8: minchan.kim@gmail.com: adapt to new remove_from_page_cache function]
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com>
Reviewed-by: Jeremy Fitzhardinge <jeremy@goop.org>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Rik Van Riel <riel@redhat.com>
Cc: Jan Beulich <JBeulich@novell.com>
Cc: Andreas Dilger <adilger@sun.com>
Cc: Ted Ts'o <tytso@mit.edu>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <joel.becker@oracle.com>
Cc: Nitin Gupta <ngupta@vflare.org>
2011-05-26 10:01:43 -06:00
Sage Weil
b6ff24a333 cifs: remove unnecessary dentry_unhash on rmdir/rename_dir
Cifs has no problems with lingering references to unlinked directory
inodes.

CC: Steve French <sfrench@samba.org>
CC: linux-cifs@vger.kernel.org
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-05-26 07:26:59 -04:00
Sage Weil
7ca5736388 ocfs2: remove unnecessary dentry_unhash on rmdir/rename_dir
Ocfs2 has no issues with lingering references to unlinked directory inodes.

CC: Mark Fasheh <mfasheh@suse.com>
CC: ocfs2-devel@oss.oracle.com
Acked-by: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-05-26 07:26:58 -04:00
Sage Weil
8cbfa53b1c exofs: remove unnecessary dentry_unhash on rmdir/rename_dir
Exofs has no problems with lingering references to unlinked directory
inodes.

CC: Benny Halevy <bhalevy@panasas.com>
CC: osd-dev@open-osd.org
Acked-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-05-26 07:26:57 -04:00
Sage Weil
052e2a1ba2 nfs: remove unnecessary dentry_unhash on rmdir/rename_dir
NFS has no problems with lingering references to unlinked directory
inodes.

CC: Trond Myklebust <Trond.Myklebust@netapp.com>
CC: linux-nfs@vger.kernel.org
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-05-26 07:26:57 -04:00
Sage Weil
5afcb940fa ext2: remove unnecessary dentry_unhash on rmdir/rename_dir
ext2 has no problems with lingering references to unlinked directory
inodes.

CC: Jan Kara <jack@suse.cz>
CC: linux-ext4@vger.kernel.org
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-05-26 07:26:56 -04:00
Sage Weil
5a61a245f7 ext3: remove unnecessary dentry_unhash on rmdir/rename_dir
ext3 has no problems with lingering references to unlinked directory
inodes.

CC: Jan Kara <jack@suse.cz>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Andreas Dilger <adilger.kernel@dilger.ca>
CC: linux-ext4@vger.kernel.org
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-05-26 07:26:55 -04:00
Sage Weil
40ebc0af58 ext4: remove unnecessary dentry_unhash on rmdir/rename_dir
ext4 has no problems with lingering references to unlinked directory
inodes.

CC: "Theodore Ts'o" <tytso@mit.edu>
CC: Andreas Dilger <adilger.kernel@dilger.ca>
CC: linux-ext4@vger.kernel.org
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-05-26 07:26:55 -04:00
Sage Weil
f64f58f854 btrfs: remove unnecessary dentry_unhash in rmdir/rename_dir
Btrfs has no problems with lingering references to unlinked directory
inodes.

CC: Chris Mason <chris.mason@oracle.com>
CC: linux-btrfs@vger.kernel.org
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-05-26 07:26:54 -04:00
Sage Weil
051e8f0ee2 ceph: remove unnecessary dentry_unhash calls
Ceph does not need these, and they screw up our use of the dcache as a
consistent cache.

Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-05-26 07:26:53 -04:00
Sage Weil
51892bbb57 vfs: clean up vfs_rename_other
Simplify control flow to match vfs_rename_dir.

Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-05-26 07:26:53 -04:00
Sage Weil
9055cba711 vfs: clean up vfs_rename_dir
Simplify control flow through vfs_rename_dir.

Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-05-26 07:26:52 -04:00
Sage Weil
912dbc15d9 vfs: clean up vfs_rmdir
Simplify the control flow with an out label.

Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-05-26 07:26:51 -04:00
Miklos Szeredi
b5afd2c406 vfs: fix vfs_rename_dir for FS_RENAME_DOES_D_MOVE filesystems
vfs_rename_dir() doesn't properly account for filesystems with
FS_RENAME_DOES_D_MOVE.  If new_dentry has a target inode attached, it
unhashes the new_dentry prior to the rename() iop and rehashes it after,
but doesn't account for the possibility that rename() may have swapped
{old,new}_dentry.  For FS_RENAME_DOES_D_MOVE filesystems, it rehashes
new_dentry (now the old renamed-from name, which d_move() expected to go
away), such that a subsequent lookup will find it.  Currently all
FS_RENAME_DOES_D_MOVE filesystems compensate for this by failing in
d_revalidate.

The bug was introduced by: commit 349457ccf2
"[PATCH] Allow file systems to manually d_move() inside of ->rename()"

Fix by not rehashing the new dentry.  Rehashing used to be needed by
d_move() but isn't anymore.

Reported-by: Sage Weil <sage@newdream.net>
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-05-26 07:26:50 -04:00
Sage Weil
5c5d3f3b87 libfs: drop unneeded dentry_unhash
There are no libfs issues with dangling references to empty directories.

Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-05-26 07:26:50 -04:00
Sage Weil
a71905f0db vfs: update dentry_unhash() comment
The helper is now only called by file systems, not the VFS.

Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-05-26 07:26:49 -04:00
Sage Weil
e4eaac06bc vfs: push dentry_unhash on rename_dir into file systems
Only a few file systems need this.  Start by pushing it down into each
rename method (except gfs2 and xfs) so that it can be dealt with on a
per-fs basis.

Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-05-26 07:26:48 -04:00
Sage Weil
79bf7c732b vfs: push dentry_unhash on rmdir into file systems
Only a few file systems need this.  Start by pushing it down into each
fs rmdir method (except gfs2 and xfs) so it can be dealt with on a per-fs
basis.

This does not change behavior for any in-tree file systems.

Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-05-26 07:26:47 -04:00
Sage Weil
64252c75a2 vfs: remove dget() from dentry_unhash()
This serves no useful purpose that I can discern.  All callers (rename,
rmdir) hold their own reference to the dentry.

A quick audit of all file systems showed no relevant checks on the value
of d_count in vfs_rmdir/vfs_rename_dir paths.

Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-05-26 07:26:46 -04:00
Sage Weil
48293699a0 vfs: dentry_unhash immediately prior to rmdir
This presumes that there is no reason to unhash a dentry if we fail because
it is a mountpoint or the LSM check fails, and that the LSM checks do not
depend on the dentry being unhashed.

Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-05-26 07:26:46 -04:00
Jan Kara
ea13a86463 vfs: Block mmapped writes while the fs is frozen
We should not allow file modification via mmap while the filesystem is
frozen. So block in block_page_mkwrite() while the filesystem is frozen.
We cannot do the blocking wait in __block_page_mkwrite() since e.g. ext4
will want to call that function with transaction started in some cases
and that would deadlock. But we can at least do the non-blocking reliable
check in __block_page_mkwrite() which is the hardest part anyway.

We have to check for frozen filesystem with the page marked dirty and under
page lock with which we then return from ->page_mkwrite(). Only that way we
cannot race with writeback done by freezing code - either we mark the page
dirty after the writeback has started, see freezing in progress and block, or
writeback will wait for our page lock which is released only when the fault is
done and then writeback will writeout and writeprotect the page again.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-05-26 07:26:45 -04:00
Jan Kara
24da4fab5a vfs: Create __block_page_mkwrite() helper passing error values back
Create __block_page_mkwrite() helper which does all what block_page_mkwrite()
does except that it passes back errors from __block_write_begin /
block_commit_write calls.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-05-26 07:26:44 -04:00
Roman Borisov
7c6e984dfc fs/namespace.c: bound mount propagation fix
This issue was discovered by users of busybox.  And the bug is actual for
busybox users, I don't know how it affects others.  Apparently, mount is
called with and without MS_SILENT, and this affects mount() behaviour.
But MS_SILENT is only supposed to affect kernel logging verbosity.

The following script was run in an empty test directory:

mkdir -p mount.dir mount.shared1 mount.shared2
touch mount.dir/a mount.dir/b
mount -vv --bind         mount.shared1 mount.shared1
mount -vv --make-rshared mount.shared1
mount -vv --bind         mount.shared2 mount.shared2
mount -vv --make-rshared mount.shared2
mount -vv --bind mount.shared2 mount.shared1
mount -vv --bind mount.dir     mount.shared2
ls -R mount.dir mount.shared1 mount.shared2
umount mount.dir mount.shared1 mount.shared2 2>/dev/null
umount mount.dir mount.shared1 mount.shared2 2>/dev/null
umount mount.dir mount.shared1 mount.shared2 2>/dev/null
rm -f mount.dir/a mount.dir/b mount.dir/c
rmdir mount.dir mount.shared1 mount.shared2

mount -vv was used to show the mount() call arguments and result.
Output shows that flag argument has 0x00008000 = MS_SILENT bit:

mount: mount('mount.shared1','mount.shared1','(null)',0x00009000,'(null)'):0
mount: mount('','mount.shared1','',0x0010c000,''):0
mount: mount('mount.shared2','mount.shared2','(null)',0x00009000,'(null)'):0
mount: mount('','mount.shared2','',0x0010c000,''):0
mount: mount('mount.shared2','mount.shared1','(null)',0x00009000,'(null)'):0
mount: mount('mount.dir','mount.shared2','(null)',0x00009000,'(null)'):0
mount.dir:
a
b

mount.shared1:

mount.shared2:
a
b

After adding --loud option to remove MS_SILENT bit from just one mount cmd:

mkdir -p mount.dir mount.shared1 mount.shared2
touch mount.dir/a mount.dir/b
mount -vv --bind         mount.shared1 mount.shared1 2>&1
mount -vv --make-rshared mount.shared1               2>&1
mount -vv --bind         mount.shared2 mount.shared2 2>&1
mount -vv --loud --make-rshared mount.shared2               2>&1  # <-HERE
mount -vv --bind mount.shared2 mount.shared1         2>&1
mount -vv --bind mount.dir     mount.shared2         2>&1
ls -R mount.dir mount.shared1 mount.shared2      2>&1
umount mount.dir mount.shared1 mount.shared2 2>/dev/null
umount mount.dir mount.shared1 mount.shared2 2>/dev/null
umount mount.dir mount.shared1 mount.shared2 2>/dev/null
rm -f mount.dir/a mount.dir/b mount.dir/c
rmdir mount.dir mount.shared1 mount.shared2

The result is different now - look closely at mount.shared1 directory listing.
Now it does show files 'a' and 'b':

mount: mount('mount.shared1','mount.shared1','(null)',0x00009000,'(null)'):0
mount: mount('','mount.shared1','',0x0010c000,''):0
mount: mount('mount.shared2','mount.shared2','(null)',0x00009000,'(null)'):0
mount: mount('','mount.shared2','',0x00104000,''):0
mount: mount('mount.shared2','mount.shared1','(null)',0x00009000,'(null)'):0
mount: mount('mount.dir','mount.shared2','(null)',0x00009000,'(null)'):0

mount.dir:
a
b

mount.shared1:
a
b

mount.shared2:
a
b

The analysis shows that MS_SILENT flag which is ON by default in any
busybox-> mount operations cames to flags_to_propagation_type function and
causes the error return while is_power_of_2 checking because the function
expects only one bit set.  This doesn't allow to do busybox->mount with
any --make-[r]shared, --make-[r]private etc options.

Moreover, the recently added flags_to_propagation_type() function doesn't
allow us to do such operations as --make-[r]private --make-[r]shared etc.
when MS_SILENT is on.  The idea or clearing the MS_SILENT flag came from
to Denys Vlasenko.

Signed-off-by: Roman Borisov <ext-roman.borisov@nokia.com>
Reported-by: Denys Vlasenko <vda.linux@googlemail.com>
Cc: Chuck Ebbert <cebbert@redhat.com>
Cc: Alexander Shishkin <virtuoso@slind.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-05-26 07:26:44 -04:00
Jonas Gorski
79fead47c5 exportfs: reallow building as a module
Commit 990d6c2d7a ("vfs: Add name to file
handle conversion support") changed EXPORTFS to be a bool.
This was needed for earlier revisions of the original patch, but the actual
commit put the code needing it into its own file that only gets compiled
when FHANDLE is selected which in turn selects EXPORTFS.
So EXPORTFS can be safely compiled as a module when not selecting FHANDLE.

Signed-off-by: Jonas Gorski <jonas.gorski@gmail.com>
Acked-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-05-26 07:26:43 -04:00
Al Viro
9f1fafee9e merge handle_reval_dot and nameidata_drop_rcu_last
new helper: complete_walk().  Done on successful completion
of walk, drops out of RCU mode, does d_revalidate of final
result if that hadn't been done already.

handle_reval_dot() and nameidata_drop_rcu_last() subsumed into
that one; callers converted to use of complete_walk().

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-05-26 07:26:32 -04:00
Al Viro
19660af736 consolidate nameidata_..._drop_rcu()
Merge these into a single function (unlazy_walk(nd, dentry)),
kill ..._maybe variants

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-05-26 07:26:02 -04:00
Phillip Lougher
d7f2ff6718 Squashfs: update email address
My existing email address may stop working in a month or two, so update
email to one that will continue working.

Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>
2011-05-26 10:49:11 +01:00