Commit graph

220 commits

Author SHA1 Message Date
Christoph Hellwig
0e51a8e191 xfs: optimize bio handling in the buffer writeback path
This patch implements two closely related changes:  First it embeds
a bio the ioend structure so that we don't have to allocate one
separately.  Second it uses the block layer bio chaining mechanism
to chain additional bios off this first one if needed instead of
manually accounting for multiple bio completions in the ioend
structure.  Together this removes a memory allocation per ioend and
greatly simplifies the ioend setup and I/O completion path.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-04-06 08:34:30 +10:00
Eryu Guan
ce5c767db0 xfs: add missing break in xfs_parseargs()
Commit 2e74af0e11 ("xfs: convert mount option parsing to tokens")
missed a 'break;' in xfs_parseargs() which causes mount to fail with
"-o pqnoenforce" option when mounting a v4 filesystem. xfs/050
catches this failure:

XFS (vda6): Super block does not support project and group quota together

Fixes: 2e74af0e11 ("xfs: convert mount option parsing to tokens")
Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-04-06 07:19:40 +10:00
Eric Sandeen
d0a58e8339 xfs: disallow rw remount on fs with unknown ro-compat features
Today, a kernel which refuses to mount a filesystem read-write
due to unknown ro-compat features can still transition to read-write
via the remount path.  The old kernel is most likely none the wiser,
because it's unaware of the new feature, and isn't using it.  However,
writing to the filesystem may well corrupt metadata related to that
new feature, and moving to a newer kernel which understand the feature
will have problems.

Right now the only ro-compat feature we have is the free inode btree,
which showed up in v3.16.  It would be good to push this back to
all the active stable kernels, I think, so that if anyone is using
newer mkfs (which enables the finobt feature) with older kernel
releases, they'll be protected.

Cc: <stable@vger.kernel.org> # 3.10.x-
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-04-06 07:05:41 +10:00
Kirill A. Shutemov
ea1754a084 mm, fs: remove remaining PAGE_CACHE_* and page_cache_{get,release} usage
Mostly direct substitution with occasional adjustment or removing
outdated comments.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-04 10:41:08 -07:00
Kirill A. Shutemov
09cbfeaf1a mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.

This promise never materialized.  And unlikely will.

We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE.  And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.

Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.

Let's stop pretending that pages in page cache are special.  They are
not.

The changes are pretty straight-forward:

 - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

 - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

 - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

 - page_cache_get() -> get_page();

 - page_cache_release() -> put_page();

This patch contains automated changes generated with coccinelle using
script below.  For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.

The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.

There are few places in the code where coccinelle didn't reach.  I'll
fix them manually in a separate patch.  Comments and documentation also
will be addressed with the separate patch.

virtual patch

@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT

@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE

@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK

@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)

@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)

@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-04 10:41:08 -07:00
Dave Chinner
ab9d1e4f7b Merge branch 'xfs-misc-fixes-4.6-3' into for-next 2016-03-09 08:18:30 +11:00
Darrick J. Wong
30cbc591c3 xfs: check sizes of XFS on-disk structures at compile time
Check the sizes of XFS on-disk structures when compiling the kernel.
Use this to catch inadvertent changes in structure size due to padding
and alignment issues, etc.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-03-09 08:15:14 +11:00
Eric Sandeen
12c3f05c7b xfs: fix up inode32/64 (re)mount handling
inode32/inode64 allocator behavior with respect to mount, remount
and growfs is a little tricky.

The inode32 mount option should only enable the inode32 allocator
heuristics if the filesystem is large enough for 64-bit inodes to
exist.  Today, it has this behavior on the initial mount, but a
remount with inode32 unconditionally changes the allocation
heuristics, even for a small fs.

Also, an inode32 mounted small filesystem should transition to the
inode32 allocator if the filesystem is subsequently grown to a
sufficient size.  Today that does not happen.

This patch consolidates xfs_set_inode32 and xfs_set_inode64 into a
single new function, and moves the "is the maximum inode number big
enough to matter" test into that function, so it doesn't rely on the
caller to get it right - which remount did not do, previously.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-03-02 09:58:09 +11:00
Eric Sandeen
a08ee40a79 xfs: sanitize remount options
Perform basic sanitization of remount options by
passing the option string and a dummy mount structure
through xfs_parseargs and returning the result.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-03-02 09:56:31 +11:00
Eric Sandeen
2e74af0e11 xfs: convert mount option parsing to tokens
This should be a no-op change, just switch to token parsing
like every other respectable filesystem does.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-03-02 09:55:38 +11:00
Vladimir Davydov
5d097056c9 kmemcg: account certain kmem allocations to memcg
Mark those kmem allocations that are known to be easily triggered from
userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to
memcg.  For the list, see below:

 - threadinfo
 - task_struct
 - task_delay_info
 - pid
 - cred
 - mm_struct
 - vm_area_struct and vm_region (nommu)
 - anon_vma and anon_vma_chain
 - signal_struct
 - sighand_struct
 - fs_struct
 - files_struct
 - fdtable and fdtable->full_fds_bits
 - dentry and external_name
 - inode for all filesystems. This is the most tedious part, because
   most filesystems overwrite the alloc_inode method.

The list is far from complete, so feel free to add more objects.
Nevertheless, it should be close to "account everything" approach and
keep most workloads within bounds.  Malevolent users will be able to
breach the limit, but this was possible even with the former "account
everything" approach (simply because it did not account everything in
fact).

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Greg Thelen <gthelen@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
Markus Elfring
a841b64df2 XFS: Use a signed return type for suffix_kstrtoint()
The return type "unsigned long" was used by the suffix_kstrtoint()
function even though it will eventually return a negative error code.
Improve this implementation detail by using the type "int" instead.

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-01-04 16:13:21 +11:00
Dave Chinner
4e14e49a91 Merge branch 'xfs-misc-fixes-for-4.4-3' into for-next 2015-11-10 10:20:48 +11:00
Chris Mason
7a29ac474a xfs: give all workqueues rescuer threads
We're consistently hitting deadlocks here with XFS on recent kernels.
After some digging through the crash files, it looks like everyone in
the system is waiting for XFS to reclaim memory.

Something like this:

PID: 2733434  TASK: ffff8808cd242800  CPU: 19  COMMAND: "java"
 #0 [ffff880019c53588] __schedule at ffffffff818c4df2
 #1 [ffff880019c535d8] schedule at ffffffff818c5517
 #2 [ffff880019c535f8] _xfs_log_force_lsn at ffffffff81316348
 #3 [ffff880019c53688] xfs_log_force_lsn at ffffffff813164fb
 #4 [ffff880019c536b8] xfs_iunpin_wait at ffffffff8130835e
 #5 [ffff880019c53728] xfs_reclaim_inode at ffffffff812fd453
 #6 [ffff880019c53778] xfs_reclaim_inodes_ag at ffffffff812fd8c7
 #7 [ffff880019c53928] xfs_reclaim_inodes_nr at ffffffff812fe433
 #8 [ffff880019c53958] xfs_fs_free_cached_objects at ffffffff8130d3b9
 #9 [ffff880019c53968] super_cache_scan at ffffffff811a6f73
#10 [ffff880019c539c8] shrink_slab at ffffffff811460e6
#11 [ffff880019c53aa8] shrink_zone at ffffffff8114a53f
#12 [ffff880019c53b48] do_try_to_free_pages at ffffffff8114a8ba
#13 [ffff880019c53be8] try_to_free_pages at ffffffff8114ad5a
#14 [ffff880019c53c78] __alloc_pages_nodemask at ffffffff8113e1b8
#15 [ffff880019c53d88] alloc_kmem_pages_node at ffffffff8113e671
#16 [ffff880019c53dd8] copy_process at ffffffff8104f781
#17 [ffff880019c53ec8] do_fork at ffffffff8105129c
#18 [ffff880019c53f38] sys_clone at ffffffff810515b6
#19 [ffff880019c53f48] stub_clone at ffffffff818c8e4d

xfs_log_force_lsn is waiting for logs to get cleaned, which is waiting
for IO, which is waiting for workers to complete the IO which is waiting
for worker threads that don't exist yet:

PID: 2752451  TASK: ffff880bd6bdda00  CPU: 37  COMMAND: "kworker/37:1"
 #0 [ffff8808d20abbb0] __schedule at ffffffff818c4df2
 #1 [ffff8808d20abc00] schedule at ffffffff818c5517
 #2 [ffff8808d20abc20] schedule_timeout at ffffffff818c7c6c
 #3 [ffff8808d20abcc0] wait_for_completion_killable at ffffffff818c6495
 #4 [ffff8808d20abd30] kthread_create_on_node at ffffffff8106ec82
 #5 [ffff8808d20abdf0] create_worker at ffffffff8106752f
 #6 [ffff8808d20abe40] worker_thread at ffffffff810699be
 #7 [ffff8808d20abec0] kthread at ffffffff8106ef59
 #8 [ffff8808d20abf50] ret_from_fork at ffffffff818c8ac8

I think we should be using WQ_MEM_RECLAIM to make sure this thread
pool makes progress when we're not able to allocate new workers.

[dchinner: make all workqueues WQ_MEM_RECLAIM]

Signed-off-by: Chris Mason <clm@fb.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-11-10 10:10:34 +11:00
Dave Chinner
2da5c4b05a Merge branch 'xfs-misc-fixes-for-4.4-2' into for-next 2015-11-03 13:27:58 +11:00
Darrick J. Wong
af3b63822e xfs: don't leak uuid table on rmmod
Don't leak the UUID table when the module is unloaded.
(Found with kmemleak.)

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-11-03 13:06:34 +11:00
Dan Carpenter
f9d460b341 xfs: fix an error code in xfs_fs_fill_super()
If alloc_percpu() fails, we accidentally return PTR_ERR(NULL), which
means success, but we intended to return -ENOMEM.

Fixes: 225e463558 ('xfs: per-filesystem stats in sysfs')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-10-19 08:42:47 +11:00
Bill O'Donnell
ff6d6af235 xfs: per-filesystem stats counter implementation
This patch modifies the stats counting macros and the callers
to those macros to properly increment, decrement, and add-to
the xfs stats counts. The counts for global and per-fs stats
are correctly advanced, and cleared by writing a "1" to the
corresponding clear file.

global counts: /sys/fs/xfs/stats/stats
per-fs counts: /sys/fs/xfs/sda*/stats/stats

global clear:  /sys/fs/xfs/stats/stats_clear
per-fs clear:  /sys/fs/xfs/sda*/stats/stats_clear

[dchinner: cleaned up macro variables, removed CONFIG_FS_PROC around
 stats structures and macros. ]

Signed-off-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-10-12 18:21:22 +11:00
Bill O'Donnell
225e463558 xfs: per-filesystem stats in sysfs
This patch implements per-filesystem stats objects in sysfs. It
depends on the application of the previous patch series that
develops the infrastructure to support both xfs global stats and
xfs per-fs stats in sysfs.

Stats objects are instantiated when an xfs filesystem is mounted
and deleted on unmount. With this patch, the stats directory is
created and populated with the familiar stats and stats_clear files.
Example:
        /sys/fs/xfs/sda9/stats/stats
        /sys/fs/xfs/sda9/stats/stats_clear

With this patch, the individual counts within the new per-fs
stats file(s) remain at zero. Functions that use the the macros
to increment, decrement, and add-to the per-fs stats counts will
be covered in a separate new patch to follow this one. Note that
the counts within the global stats file (/sys/fs/xfs/stats/stats)
advance normally and can be cleared as it was prior to this patch.

[dchinner: move setup/teardown to xfs_fs_{fill|put}_super() so
it is down before/after any path that uses the per-mount stats. ]

Signed-off-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-10-12 18:21:19 +11:00
Bill O'Donnell
80529c45ab xfs: pass xfsstats structures to handlers and macros
This patch is the next step toward per-fs xfs stats. The patch makes
the show and clear routines able to handle any stats structure
associated with a kobject.

Instead of a single global xfsstats structure, add kobject and a pointer
to a per-cpu struct xfsstats. Modify the macros that manipulate the stats
accordingly: XFS_STATS_INC, XFS_STATS_DEC, and XFS_STATS_ADD now access
xfsstats->xs_stats.

The sysfs functions need to get from the kobject back to the xfsstats
structure which contains it, and pass the pointer to the ->xs_stats
percpu structure into the show & clear routines.

Signed-off-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-10-12 05:19:45 +11:00
Bill O'Donnell
bb230c1247 xfs: create global stats and stats_clear in sysfs
Currently, xfs global stats are in procfs. This patch introduces
(replicates) the global stats in sysfs. Additionally a stats_clear file
is introduced in sysfs.

Signed-off-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-10-12 05:15:45 +11:00
Linus Torvalds
77a78806c7 xfs: updates for 4.3-rc1
This update contains:
 o large rework of EFI/EFD lifecycle handling to fix log recovery corruption
   issues, crashes and unmount hangs
 o separate metadata UUID on disk to enable changing boot label UUID for v5
   filesystems
 o fixes for gcc miscompilation on certain platforms and optimisation levels
 o remote attribute allocation and recovery corruption fixes
 o inode lockdep annotation rework to fix bugs with too many subclasses
 o directory inode locking changes to prevent lockdep false positives
 o a handful of minor corruption fixes
 o various other small cleanups and bug fixes
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABAgAGBQJV7VlyAAoJEK3oKUf0dfodLDAQAMTYAERZGp8sI1ZZo9qTtMis
 HE3X7X1jpA2/CrSlsQtw5FEahl9NDoVZKInEFzpDeFogloOwLy+aNz6F9s6SQvSO
 p7r+Tkv8k5WCWCpYhm6N5yVSwMRkCVBJ9+DsxKeabQaNobu2nBRYWA7RcTPbwhL6
 eZZ41NT/1x4Di3MppjRcSHMRxq+DOYsoTj7ey2tB3jFK4w99pfhBqhMsxOMCyThQ
 g61Rj/IIwbUKWDZNBP1vdG9y8eN9xEan7+uQRJYpwjrdPAXeZMg9J0U5dIoZXmOA
 o7UDvyhxZP06vZGG52rMCMWl5kWbEyFGAa/bzh+L+O3/5DZAdoJQxZUF00AsLaxQ
 2kQ4L8vUEuGvPpUcFFopSjvpJmjmdg4O8KCkxKp4bcONA+2Z108e68zVxffnQPgm
 0d2msqRRHCVRSw+o52Nf8R1A29cjhShyxBq4Xw154zrK2lJNwWWx36LG+XgrW2R6
 CHXj2OoMvQZIJWpbhwZqJCcl1dmhcjES082Wvb+RyKvvcQzerOjb5p2R7uqwXVg+
 uR27KstQ3tJ3J+hmq2FwhB7E2GMnvYDL9qt+3RgMIJrM7rxAOB0b/QS+yO9hzgQH
 /I0KzyX72Lcwwxqd0aWLqlqoIWfn44eBK+V2vdXFRNTeWu3kDEW9q0JRQjxVBsFt
 /SMKetOh+gj7yAs+kgOh
 =Eikc
 -----END PGP SIGNATURE-----

Merge tag 'xfs-for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs

Pull xfs updates from Dave Chinner:
 "There isn't a whole lot to this update - it's mostly bug fixes and
  they are spread pretty much all over XFS.  There are some corruption
  fixes, some fixes for log recovery, some fixes that prevent unount
  from hanging, a lockdep annotation rework for inode locking to prevent
  false positives and the usual random bunch of cleanups and minor
  improvements.

  Deatils:

   - large rework of EFI/EFD lifecycle handling to fix log recovery
     corruption issues, crashes and unmount hangs

   - separate metadata UUID on disk to enable changing boot label UUID
     for v5 filesystems

   - fixes for gcc miscompilation on certain platforms and optimisation
     levels

   - remote attribute allocation and recovery corruption fixes

   - inode lockdep annotation rework to fix bugs with too many
     subclasses

   - directory inode locking changes to prevent lockdep false positives

   - a handful of minor corruption fixes

   - various other small cleanups and bug fixes"

* tag 'xfs-for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (42 commits)
  xfs: fix error gotos in xfs_setattr_nonsize
  xfs: add mssing inode cache attempts counter increment
  xfs: return errors from partial I/O failures to files
  libxfs: bad magic number should set da block buffer error
  xfs: fix non-debug build warnings
  xfs: collapse allocsize and biosize mount option handling
  xfs: Fix file type directory corruption for btree directories
  xfs: lockdep annotations throw warnings on non-debug builds
  xfs: Fix uninitialized return value in xfs_alloc_fix_freelist()
  xfs: inode lockdep annotations broke non-lockdep build
  xfs: flush entire file on dio read/write to cached file
  xfs: Fix xfs_attr_leafblock definition
  libxfs: readahead of dir3 data blocks should use the read verifier
  xfs: stop holding ILOCK over filldir callbacks
  xfs: clean up inode lockdep annotations
  xfs: swap leaf buffer into path struct atomically during path shift
  xfs: relocate sparse inode mount warning
  xfs: dquots should be stamped with sb_meta_uuid
  xfs: log recovery needs to validate against sb_meta_uuid
  xfs: growfs not aware of sb_meta_uuid
  ...
2015-09-07 13:28:32 -07:00
Kees Cook
a068acf2ee fs: create and use seq_show_option for escaping
Many file systems that implement the show_options hook fail to correctly
escape their output which could lead to unescaped characters (e.g.  new
lines) leaking into /proc/mounts and /proc/[pid]/mountinfo files.  This
could lead to confusion, spoofed entries (resulting in things like
systemd issuing false d-bus "mount" notifications), and who knows what
else.  This looks like it would only be the root user stepping on
themselves, but it's possible weird things could happen in containers or
in other situations with delegated mount privileges.

Here's an example using overlay with setuid fusermount trusting the
contents of /proc/mounts (via the /etc/mtab symlink).  Imagine the use
of "sudo" is something more sneaky:

  $ BASE="ovl"
  $ MNT="$BASE/mnt"
  $ LOW="$BASE/lower"
  $ UP="$BASE/upper"
  $ WORK="$BASE/work/ 0 0
  none /proc fuse.pwn user_id=1000"
  $ mkdir -p "$LOW" "$UP" "$WORK"
  $ sudo mount -t overlay -o "lowerdir=$LOW,upperdir=$UP,workdir=$WORK" none /mnt
  $ cat /proc/mounts
  none /root/ovl/mnt overlay rw,relatime,lowerdir=ovl/lower,upperdir=ovl/upper,workdir=ovl/work/ 0 0
  none /proc fuse.pwn user_id=1000 0 0
  $ fusermount -u /proc
  $ cat /proc/mounts
  cat: /proc/mounts: No such file or directory

This fixes the problem by adding new seq_show_option and
seq_show_option_n helpers, and updating the vulnerable show_option
handlers to use them as needed.  Some, like SELinux, need to be open
coded due to unusual existing escape mechanisms.

[akpm@linux-foundation.org: add lost chunk, per Kees]
[keescook@chromium.org: seq_show_option should be using const parameters]
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Acked-by: Jan Kara <jack@suse.com>
Acked-by: Paul Moore <paul@paul-moore.com>
Cc: J. R. Okajima <hooanon05g@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04 16:54:41 -07:00
Eric Sandeen
2ccf4a9b18 xfs: collapse allocsize and biosize mount option handling
The allocsize and biosize mount options are handled identically,
other than allocsize accepting suffixes.  suffix_kstrtoint handles
bare numbers just fine too, so these can be collapsed.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-08-25 10:05:13 +10:00
Brian Foster
1b867d3ab5 xfs: relocate sparse inode mount warning
The sparse inodes feature is currently considered experimental. We warn
at mount time from xfs_mount_validate_sb(). This function is part of the
superblock verifier codepath, however, which means it could be invoked
repeatedly on superblock reads or writes. This is currently only
noticeable from userspace, where mkfs produces multiple warnings at
format time.

As mkfs warnings were not the intent of this change, relocate the mount
time warning to xfs_fs_fill_super(), which is only invoked once and only
in kernel space.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-08-19 10:32:14 +10:00
Dave Chinner
cbe4dab119 xfs: add initial DAX support
Add initial DAX support to XFS. To do this we need a new mount
option to turn DAX on filesystem, and we need to propagate this into
the inode flags whenever an inode is instantiated so that the
per-inode checks throughout the code Do The Right Thing.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-06-04 09:19:18 +10:00
Linus Torvalds
9ec3a646fe Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull fourth vfs update from Al Viro:
 "d_inode() annotations from David Howells (sat in for-next since before
  the beginning of merge window) + four assorted fixes"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  RCU pathwalk breakage when running into a symlink overmounting something
  fix I_DIO_WAKEUP definition
  direct-io: only inc/dec inode->i_dio_count for file systems
  fs/9p: fix readdir()
  VFS: assorted d_backing_inode() annotations
  VFS: fs/inode.c helpers: d_inode() annotations
  VFS: fs/cachefiles: d_backing_inode() annotations
  VFS: fs library helpers: d_inode() annotations
  VFS: assorted weird filesystems: d_inode() annotations
  VFS: normal filesystems (and lustre): d_inode() annotations
  VFS: security/: d_inode() annotations
  VFS: security/: d_backing_inode() annotations
  VFS: net/: d_inode() annotations
  VFS: net/unix: d_backing_inode() annotations
  VFS: kernel/: d_inode() annotations
  VFS: audit: d_backing_inode() annotations
  VFS: Fix up some ->d_inode accesses in the chelsio driver
  VFS: Cachefiles should perform fs modifications on the top layer only
  VFS: AF_UNIX sockets should call mknod on the top layer only
2015-04-26 17:22:07 -07:00
David Howells
2b0143b5c9 VFS: normal filesystems (and lustre): d_inode() annotations
that's the bulk of filesystem drivers dealing with inodes of their own

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-15 15:06:57 -04:00
Dave Chinner
6a63ef064b Merge branch 'xfs-misc-fixes-for-4.1-3' into for-next
Conflicts:
	fs/xfs/xfs_iops.c
2015-04-13 11:40:16 +10:00
Eric Sandeen
bbe051c841 xfs: disallow ro->rw remount on norecovery mount
There's a bit of a loophole in norecovery mount handling right
now: an initial mount must be readonly, but nothing prevents
a mount -o remount,rw from producing a writable, unrecovered
xfs filesystem.

It might be possible to try to perform a log recovery when this
is requested, but I'm not sure it's worth the effort.  For now,
simply disallow this sort of transition.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-04-13 11:25:41 +10:00
Dave Chinner
2b93681f59 Merge branch 'xfs-misc-fixes-for-4.1-2' into for-next
Conflicts:
	fs/xfs/libxfs/xfs_bmap.c
	fs/xfs/xfs_inode.c
2015-03-25 15:12:30 +11:00
Joe Perches
5e9383f97e xfs: Fix incorrect positive ENOMEM return
added a positive error return value.

This value filters up through the return layers and should be
negative as the other return values are in the same function.

Signed-off-by: Joe Perches <joe@perches.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-03-25 15:00:24 +11:00
Dave Chinner
88e8fda99a Merge branch 'xfs-mmap-lock' into for-next 2015-02-24 10:27:47 +11:00
Dave Chinner
4225441a1e Merge branch 'xfs-generic-sb-counters' into for-next
Conflicts:
	fs/xfs/xfs_super.c
2015-02-24 10:27:28 +11:00
Eric Sandeen
444a702231 xfs: remove deprecated mount options
We recently removed deprecated sysctls; may as well
remove deprecated mount options as well, we've stated
that they'd be gone by now in the docs.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-02-24 10:17:04 +11:00
Eric Sandeen
3b9ce795fa xfs: log unmount events on console
There are times, when doing triage and forensics,
that we would like to know whether a filesystem was unmounted,
or if the plug was pulled without a clean unmount.  Log
unmounts at the same level (NOTICE) as we log mounts.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-02-24 10:13:37 +11:00
Dave Chinner
653c60b633 xfs: introduce mmap/truncate lock
Right now we cannot serialise mmap against truncate or hole punch
sanely. ->page_mkwrite is not able to take locks that the read IO
path normally takes (i.e. the inode iolock) because that could
result in lock inversions (read - iolock - page fault - page_mkwrite
- iolock) and so we cannot use an IO path lock to serialise page
write faults against truncate operations.

Instead, introduce a new lock that is used *only* in the
->page_mkwrite path that is the equivalent of the iolock. The lock
ordering in a page fault is i_mmaplock -> page lock -> i_ilock,
and so in truncate we can i_iolock -> i_mmaplock and so lock out
new write faults during the process of truncation.

Because i_mmap_lock is outside the page lock, we can hold it across
all the same operations we hold the i_iolock for. The only
difference is that we never hold the i_mmaplock in the normal IO
path and so do not ever have the possibility that we can page fault
inside it. Hence there are no recursion issues on the i_mmap_lock
and so we can use it to serialise page fault IO against inode
modification operations that affect the IO path.

This patch introduces the i_mmaplock infrastructure, lockdep
annotations and initialisation/destruction code. Use of the new lock
will be in subsequent patches.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-02-23 21:43:37 +11:00
Dave Chinner
5681ca4006 xfs: Remove icsb infrastructure
Now that the in-core superblock infrastructure has been replaced with
generic per-cpu counters, we don't need it anymore. Nuke it from
orbit so we are sure that it won't haunt us again...

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-02-23 21:22:31 +11:00
Dave Chinner
0d485ada40 xfs: use generic percpu counters for free block counter
XFS has hand-rolled per-cpu counters for the superblock since before
there was any generic implementation. The free block counter is
special in that it is used for ENOSPC detection outside transaction
contexts for for delayed allocation. This means that the counter
needs to be accurate at zero. The current per-cpu counter code jumps
through lots of hoops to ensure we never run past zero, but we don't
need to make all those jumps with the generic counter
implementation.

The generic counter implementation allows us to pass a "batch"
threshold at which the addition/subtraction to the counter value
will be folded back into global value under lock. We can use this
feature to reduce the batch size as we approach 0 in a very similar
manner to the existing counters and their rebalance algorithm. If we
use a batch size of 1 as we approach 0, then every addition and
subtraction will be done against the global value and hence allow
accurate detection of zero threshold crossing.

Hence we can replace the handrolled, accurate-at-zero counters with
generic percpu counters.

Note: this removes just enough of the icsb infrastructure to compile
without warnings. The rest will go in subsequent commits.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-02-23 21:22:03 +11:00
Dave Chinner
e88b64ea1f xfs: use generic percpu counters for free inode counter
XFS has hand-rolled per-cpu counters for the superblock since before
there was any generic implementation. The free inode counter is not
used for any limit enforcement - the per-AG free inode counters are
used during allocation to determine if there are inode available for
allocation.

Hence we don't need any of the complexity of the hand-rolled
counters and we can simply replace them with generic per-cpu
counters similar to the inode counter.

This version introduces a xfs_mod_ifree() helper function from
Christoph Hellwig.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-02-23 21:19:53 +11:00
Dave Chinner
501ab32387 xfs: use generic percpu counters for inode counter
XFS has hand-rolled per-cpu counters for the superblock since before
there was any generic implementation. There are some warts around
the  use of them for the inode counter as the hand rolled counter is
designed to be accurate at zero, but has no specific accurracy at
any other value. This design causes problems for the maximum inode
count threshold enforcement, as there is no trigger that balances
the counters as they get close tothe maximum threshold.

Instead of designing new triggers for balancing, just replace the
handrolled per-cpu counter with a generic counter.  This enables us
to update the counter through the normal superblock modification
funtions, but rather than do that we add a xfs_mod_icount() helper
function (from Christoph Hellwig) and keep the percpu counter
outside the superblock in the struct xfs_mount.

This means we still need to initialise the per-cpu counter
specifically when we read the superblock, and vice versa when we
log/write it, but it does mean that we don't need to change any
other code.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-02-23 21:19:28 +11:00
Vladimir Davydov
4101b62435 fs: consolidate {nr,free}_cached_objects args in shrink_control
We are going to make FS shrinkers memcg-aware.  To achieve that, we will
have to pass the memcg to scan to the nr_cached_objects and
free_cached_objects VFS methods, which currently take only the NUMA node
to scan.  Since the shrink_control structure already holds the node, and
the memcg to scan will be added to it when we introduce memcg-aware
vmscan, let us consolidate the methods' arguments in this structure to
keep things clean.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Suggested-by: Dave Chinner <david@fromorbit.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Greg Thelen <gthelen@google.com>
Cc: Glauber Costa <glommer@gmail.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-12 18:54:08 -08:00
Dave Chinner
bad962662d Merge branch 'xfs-misc-fixes-for-3.20-4' into for-next 2015-02-10 09:24:25 +11:00
Eric Sandeen
01f9882eac xfs: report proper f_files in statfs if we overshoot imaxpct
Normally, a statfs syscall reports m_maxicount as f_files
(total file nodes in file system) because it is supposed
to be the upper limit for dynamically-allocated inodes.

It's possible, however, to overshoot imaxpct / m_maxicount.
If this happens, we should report the actual number of allocated
inodes, which is contained in sb_icount.  Add one more adjustment
to the statfs code to make this happen.

Reported-by: Alexander Tsvetkov <alexander.tsvetkov@oracle.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-02-06 09:53:02 +11:00
Dave Chinner
465e2def7c Merge branch 'xfs-sb-logging-rework' into for-next
Conflicts:
	fs/xfs/xfs_mount.c
2015-01-22 09:20:53 +11:00
Dave Chinner
61e63ecb57 xfs: consolidate superblock logging functions
We now have several superblock loggin functions that are identical
except for the transaction reservation and whether it shoul dbe a
synchronous transaction or not. Consolidate these all into a single
function, a single reserveration and a sync flag and call it
xfs_sync_sb().

Also, xfs_mod_sb() is not really a modification function - it's the
operation of logging the superblock buffer. hence change the name of
it to reflect this.

Note that we have to change the mp->m_update_flags that are passed
around at mount time to a boolean simply to indicate a superblock
update is needed.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-01-22 09:10:31 +11:00
Dave Chinner
4d11a40239 xfs: remove bitfield based superblock updates
When we log changes to the superblock, we first have to write them
to the on-disk buffer, and then log that. Right now we have a
complex bitfield based arrangement to only write the modified field
to the buffer before we log it.

This used to be necessary as a performance optimisation because we
logged the superblock buffer in every extent or inode allocation or
freeing, and so performance was extremely important. We haven't done
this for years, however, ever since the lazy superblock counters
pulled the superblock logging out of the transaction commit
fast path.

Hence we have a bunch of complexity that is not necessary that makes
writing the in-core superblock to disk much more complex than it
needs to be. We only need to log the superblock now during
management operations (e.g. during mount, unmount or quota control
operations) so it is not a performance critical path anymore.

As such, remove the complex field based logging mechanism and
replace it with a simple conversion function similar to what we use
for all other on-disk structures.

This means we always log the entirity of the superblock, but again
because we rarely modify the superblock this is not an issue for log
bandwidth or CPU time. Indeed, if we do log the superblock
frequently, delayed logging will minimise the impact of this
overhead.

[Fixed gquota/pquota inode sharing regression noticed by bfoster.]

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-01-22 09:10:26 +11:00
Eric Sandeen
77af574eef xfs: remove extra newlines from xfs messages
xfs_warn() and friends add a newline by default, but some
messages add another one.

Particularly for the failing write message below, this can
waste a lot of console real estate!

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-12-24 09:47:27 +11:00
Linus Torvalds
c05e14f7b3 xfs: update for 3.19-rc1
This update contains:
 o more on-disk format header consolidation
 o move some structures shared with userspace to libxfs
 o new per-mount workqueue to fix for deadlocks between nested loop
   mounted filesystems
 o various bug fixes for ENOSPC, stats, quota off and preallocation
 o a bunch of compiler warning fixes for set-but-unused variables
 o various code cleanups
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABAgAGBQJUihOWAAoJEK3oKUf0dfodYbkP/iXuIYOhpmc1rUORMDl2JDBc
 iTjXqz1Ydp6vJrq2+3qeAsCbJciNdZ72eNKdvgRbFAN4BW8tv1Wc9QR5m2ZIpCkf
 7buCzbkI64j9HoNAiZJhrMp/eyJ0X1hRGk1ANUaBT9ouXWOBDaOD/sNj9cMptWOA
 72BpTMN0FszAJxW6rNEk1M/i+W2ly0qmD0QJPQU18Z62NU5E+D/uMkg2xif4dhwK
 CSNMgCIv0X1qmve2lMOgwHbgkmHRwbXKSb4Z5vV8pDUh49tkRtxJ2ky7mE7aglrq
 pjChpEqDktkCL/RHAT3XJ77tRIyBXwvpC7ewHXiYBY83OcGfRFv0jMCJ+R+1b3KD
 p1faOVwd/H0tStd+0rF+tMMn8TuujQ597upLGhWdy1BpY3nnkJ7iJ8lyJv+aiCzr
 Oh3DvyX1XgxnEo7yVr+x64TFz/GPkyuvVPSfL3gspqEZErC4BN+AEP/3fF+5SGed
 x9QplB+lcy7IpzB+HURPZL4TqWl4Ib29pArZY1mQ1rJz6IFFbDSzj6lo36YDBrP8
 HRG2LDxgc1udPPMxdZ3PAV3nt4/ufaxSTmT5HGV0Aj+hjkSfLvBDFMuVz9t6vfn9
 YN3ocKWxJr2QISc0fcQ/hsBDiHVyoFgDOikBAetaqpdoM7OM7FHtLXtwLDILldx9
 DZAIS0msNrjc7gGCrbxj
 =2SJP
 -----END PGP SIGNATURE-----

Merge tag 'xfs-for-linus-3.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs

Pull xfs update from Dave Chinner:
 "There's relatively little change in this update; it is mainly bug
  fixes, cleanups and more of the on-going libxfs restructuring and
  on-disk format header consolidation work.

  Details:
   - more on-disk format header consolidation
   - move some structures shared with userspace to libxfs
   - new per-mount workqueue to fix for deadlocks between nested loop
     mounted filesystems
   - various bug fixes for ENOSPC, stats, quota off and preallocation
   - a bunch of compiler warning fixes for set-but-unused variables
   - various code cleanups"

* tag 'xfs-for-linus-3.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (24 commits)
  xfs: split metadata and log buffer completion to separate workqueues
  xfs: fix set-but-unused warnings
  xfs: move type conversion functions to xfs_dir.h
  xfs: move ftype conversion functions to libxfs
  xfs: lobotomise xfs_trans_read_buf_map()
  xfs: active inodes stat is broken
  xfs: cleanup xfs_bmse_merge returns
  xfs: cleanup xfs_bmse_shift_one goto mess
  xfs: fix premature enospc on inode allocation
  xfs: overflow in xfs_iomap_eof_align_last_fsb
  xfs: fix simple_return.cocci warning in xfs_bmse_shift_one
  xfs: fix simple_return.cocci warning in xfs_file_readdir
  libxfs: fix simple_return.cocci warnings
  xfs: remove unnecessary null checks
  xfs: merge xfs_inum.h into xfs_format.h
  xfs: move most of xfs_sb.h to xfs_format.h
  xfs: merge xfs_ag.h into xfs_format.h
  xfs: move acl structures to xfs_format.h
  xfs: merge xfs_dinode.h into xfs_format.h
  xfs: catch invalid negative blknos in _xfs_buf_find()
  ...
2014-12-12 09:48:17 -08:00
Dave Chinner
6044e4386c Merge branch 'xfs-misc-fixes-for-3.19-2' into for-next
Conflicts:
	fs/xfs/xfs_iops.c
2014-12-04 09:46:17 +11:00
Brian Foster
b29c70f598 xfs: split metadata and log buffer completion to separate workqueues
XFS traditionally sends all buffer I/O completion work to a single
workqueue. This includes metadata buffer completion and log buffer
completion. The log buffer completion requires a high priority queue to
prevent stalls due to log forces getting stuck behind other queued work.

Rather than continue to prioritize all buffer I/O completion due to the
needs of log completion, split log buffer completion off to
m_log_workqueue and move the high priority flag from m_buf_workqueue to
m_log_workqueue.

Add a b_ioend_wq wq pointer to xfs_buf to allow completion workqueue
customization on a per-buffer basis. Initialize b_ioend_wq to
m_buf_workqueue by default in the generic buffer I/O submission path.
Finally, override the default wq with the high priority m_log_workqueue
in the log buffer I/O submission path.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-12-04 09:43:17 +11:00
Dave Chinner
cdc9cec7c0 xfs: active inodes stat is broken
vn_active only ever gets decremented, so it has a very large
negative number.  Make it track the inode count we currently have
allocated properly so we can easily track the size of the inode
cache via tools like PCP.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-12-04 09:42:40 +11:00
Dave Chinner
e77b8547ca Merge branch 'xfs-coccinelle-cleanups' into xfs-misc-fixes-for-3.19-2 2014-12-04 09:18:21 +11:00
Dave Chinner
c14fc01340 Merge branch 'xfs-coccinelle-cleanups' into for-next 2014-12-01 09:03:02 +11:00
Markus Elfring
d2a5e3c6fc xfs: remove unnecessary null checks
The functions xfs_blkdev_put() and xfs_qm_dqrele() test whether
their argument is NULL and then return immediately.  Thus the test
around the call is not needed.

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-12-01 08:24:20 +11:00
Dave Chinner
216875a594 Merge branch 'xfs-consolidate-format-defs' into for-next 2014-11-28 14:52:16 +11:00
Christoph Hellwig
508b6b3b73 xfs: merge xfs_inum.h into xfs_format.h
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-11-28 14:27:10 +11:00
Christoph Hellwig
4fb6e8ade2 xfs: merge xfs_ag.h into xfs_format.h
More on-disk format consolidation.  A few declarations that weren't on-disk
format related move into better suitable spots.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-11-28 14:25:04 +11:00
Christoph Hellwig
6d3ebaae7c xfs: merge xfs_dinode.h into xfs_format.h
More consolidatation for the on-disk format defintions.  Note that the
XFS_IS_REALTIME_INODE moves to xfs_linux.h instead as it is not related
to the on disk format, but depends on a CONFIG_ option.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-11-28 14:24:06 +11:00
Brian Foster
78c931b8be xfs: replace global xfslogd wq with per-mount wq
The xfslogd workqueue is a global, single-job workqueue for buffer ioend
processing. This means we allow for a single work item at a time for all
possible XFS mounts on a system. fsstress testing in loopback XFS over
XFS configurations has reproduced xfslogd deadlocks due to the single
threaded nature of the queue and dependencies introduced between the
separate XFS instances by online discard (-o discard).

Discard over a loopback device converts the discard request to a hole
punch (fallocate) on the underlying file. Online discard requests are
issued synchronously and from xfslogd context in XFS, hence the xfslogd
workqueue is blocked in the upper fs waiting on a hole punch request to
be servied in the lower fs. If the lower fs issues I/O that depends on
xfslogd to complete, both filesystems end up hung indefinitely. This is
reproduced reliabily by generic/013 on XFS->loop->XFS test devices with
the '-o discard' mount option.

Further, docker implementations appear to use this kind of configuration
for container instance filesystems by default (container fs->dm->
loop->base fs) and therefore are subject to this deadlock when running
on XFS.

Replace the global xfslogd workqueue with a per-mount variant. This
guarantees each mount access to a single worker and prevents deadlocks
due to inter-fs dependencies introduced by discard. Since the queue is
only responsible for buffer iodone processing at this point in time,
rename xfslogd to xfs-buf.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-11-28 13:59:58 +11:00
Jan Kara
17ef4fdd37 xfs: Set allowed quota types
We support user, group, and project quotas. Tell VFS about it.

CC: xfs@oss.sgi.com
CC: Dave Chinner <david@fromorbit.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2014-11-10 10:06:09 +01:00
Dave Chinner
e3aed1a081 xfs: xfs_kset should be static
As it is accessed through the struct xfs_mount and can be set up
entirely from fs/xfs/xfs_super.c

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-29 10:46:08 +10:00
Brian Foster
65b65735fe xfs: add debug sysfs attribute set
Create a top-level debug directory for global debug sysfs attributes.
This directory is added and removed on XFS module initialization and
removal respectively for DEBUG mode kernels only. It typically resides
at /sys/fs/xfs/debug. It is located at the top level of the xfs sysfs
hierarchy as attributes might define global behavior or behavior that
must be configured before an xfs mount is available (e.g., log recovery
behavior).

Define the global debug kobject that represents the debug sysfs
directory and add generic attribute show/store helpers to support future
attributes. No debug attributes are exported as of yet.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-09 11:52:42 +10:00
Brian Foster
8018ec083c xfs: mark all internal workqueues as freezable
Workqueues must be explicitly set as freezable to ensure they are frozen
in the assocated part of the hibernation/suspend sequence. Freezing of
workqueues and kernel threads is important to ensure that modifications
are not made on-disk after the hibernation image has been created.
Otherwise, the in-memory state can become inconsistent with what is on
disk and eventually lead to filesystem corruption. We have reports of
free space btree corruptions that occur immediately after restore from
hibernate that suggest the xfs-eofblocks workqueue could be causing
such problems if it races with hibernation.

Mark all of the internal XFS workqueues as freezable to ensure nothing
changes on-disk once the freezer infrastructure freezes kernel threads
and creates the hibernation image.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reported-by: Carlos E. R. <carlos.e.r@opensuse.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-09 11:44:46 +10:00
Dave Chinner
4d7eece2c0 Merge branch 'xfs-misc-fixes-3.17-1' into for-next 2014-08-04 13:54:14 +10:00
Christoph Hellwig
d5cf09bace xfs: require 64-bit sector_t
Trying to support tiny disks only and saving a bit memory might have
made sense on an SGI O2 15 years ago, but is pretty pointless today.

Remove the rarely tested codepath that uses various smaller in-memory
types to reduce our test matrix and make the codebase a little bit
smaller and less complicated.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-07-30 09:12:05 +10:00
Eric Sandeen
54aa61f82d xfs: tidy up xfs_set_inode32
xfs_set_inode32() caught my eye because it had weird spacing around
the "-1's".  In cleaning that up, I realized that the assignment in
the declaration of "ino" is never used; it's rewritten before it
gets read.

Drop the ino initializer from its declaration since it's not used,
and move the agino initialization into the body of the function,
mostly so that we can have pretty whitespace and not exceed 80
columns.  :)

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-07-24 20:53:10 +10:00
Eric Sandeen
9de67c3ba9 xfs: allow inode allocations in post-growfs disk space
Today, if we perform an xfs_growfs which adds allocation groups,
mp->m_maxagi is not properly updated when the growfs is complete.

Therefore inodes will continue to be allocated only in the
AGs which existed prior to the growfs, and the new space
won't be utilized.

This is because of this path in xfs_growfs_data_private():

xfs_growfs_data_private
	xfs_initialize_perag(mp, nagcount, &nagimax);
		if (mp->m_flags & XFS_MOUNT_32BITINODES)
			index = xfs_set_inode32(mp);
		else
			index = xfs_set_inode64(mp);

		if (maxagi)
			*maxagi = index;

where xfs_set_inode* iterates over the (old) agcount in
mp->m_sb.sb_agblocks, which has not yet been updated
in the growfs path.  So "index" will be returned based on
the old agcount, not the new one, and new AGs are not available
for inode allocation.

Fix this by explicitly passing the proper AG count (which
xfs_initialize_perag() already has) down another level,
so that xfs_set_inode* can make the proper decision about
acceptable AGs for inode allocation in the potentially
newly-added AGs.

This has been broken since 3.7, when these two
xfs_set_inode* functions were added in commit 2d2194f.
Prior to that, we looped over "agcount" not sb_agblocks
in these calculations.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-07-24 20:51:54 +10:00
Brian Foster
3d8712265c xfs: add a sysfs kset
Create a sysfs kset to contain all sub-objects associated with the XFS
module. The kset is created and removed on module initialization and
removal respectively. The kset uses fs_obj as a parent. This leads to
the creation of a /sys/fs/xfs directory when the kset exists.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-07-15 07:41:37 +10:00
Dave Chinner
2451337dd0 xfs: global error sign conversion
Convert all the errors the core XFs code to negative error signs
like the rest of the kernel and remove all the sign conversion we
do in the interface layers.

Errors for conversion (and comparison) found via searches like:

$ git grep " E" fs/xfs
$ git grep "return E" fs/xfs
$ git grep " E[A-Z].*;$" fs/xfs

Negation points found via searches like:

$ git grep "= -[a-z,A-Z]" fs/xfs
$ git grep "return -[a-z,A-D,F-Z]" fs/xfs
$ git grep " -[a-z].*;" fs/xfs

[ with some bits I missed from Brian Foster ]

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-06-25 14:58:08 +10:00
Eric Sandeen
b474c7ae43 xfs: Nuke XFS_ERROR macro
XFS_ERROR was designed long ago to trap return values, but it's not
runtime configurable, it's not consistently used, and we can do
similar error trapping with ftrace scripts and triggers from
userspace.

Just nuke XFS_ERROR and associated bits.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-06-22 15:04:54 +10:00
Dave Chinner
232c2f5c65 Merge branch 'xfs-filestreams-lookup' into for-next 2014-05-15 09:36:59 +10:00
Dave Chinner
fdd3a2ae2e Merge branch 'xfs-unused-args-cleanup' into for-next 2014-05-15 09:36:35 +10:00
Dave Chinner
bc147822d5 xfs: negate xfs_icsb_init_counters error value
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-05-15 09:23:07 +10:00
Dave Chinner
45687642e4 xfs: negate mount workqueue init error value
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-05-15 09:22:53 +10:00
Christoph Hellwig
1919adda07 xfs: don't create a slab cache for filestream items
We only have very few of these around, and allocation isn't that
much of a hot path.  Remove the slab cache to simplify the code,
and to not waste any resources for the usual case of not having
any inodes that use the filestream allocator.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-23 07:11:51 +10:00
Eric Sandeen
34dcefd717 xfs: remove unused args from xfs_alloc_buftarg()
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-14 19:01:00 +10:00
Eric Sandeen
a96c41519a xfs: remove unused blocksize arg from xfs_setsize_buftarg()
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-14 19:00:29 +10:00
Linus Torvalds
24e7ea3bea Major changes for 3.14 include support for the newly added ZERO_RANGE
and COLLAPSE_RANGE fallocate operations, and scalability improvements
 in the jbd2 layer and in xattr handling when the extended attributes
 spill over into an external block.
 
 Other than that, the usual clean ups and minor bug fixes.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIcBAABCAAGBQJTPbD2AAoJENNvdpvBGATwDmUQANSfGYIQazB8XKKgtNTMiG/Y
 Ky7n1JzN9lTX/6nMsqQnbfCweLRmxqpWUBuyKDRHUi8IG0/voXSTFsAOOgz0R15A
 ERRRWkVvHixLpohuL/iBdEMFHwNZYPGr3jkm0EIgzhtXNgk5DNmiuMwvHmCY27kI
 kdNZIw9fip/WRNoFLDBGnLGC37aanoHhCIbVlySy5o9LN1pkC8BgXAYV0Rk19SVd
 bWCudSJEirFEqWS5H8vsBAEm/ioxTjwnNL8tX8qms6orZ6h8yMLFkHoIGWPw3Q15
 a0TSUoMyav50Yr59QaDeWx9uaPQVeK41wiYFI2rZOnyG2ts0u0YXs/nLwJqTovgs
 rzvbdl6cd3Nj++rPi97MTA7iXK96WQPjsDJoeeEgnB0d/qPyTk6mLKgftzLTNgSa
 ZmWjrB19kr6CMbebMC4L6eqJ8Fr66pCT8c/iue8wc4MUHi7FwHKH64fqWvzp2YT/
 +165dqqo2JnUv7tIp6sUi1geun+bmDHLZFXgFa7fNYFtcU3I+uY1mRr3eMVAJndA
 2d6ASe/KhQbpVnjKJdQ8/b833ZS3p+zkgVPrd68bBr3t7gUmX91wk+p1ct6rUPLr
 700F+q/pQWL8ap0pU9Ht/h3gEJIfmRzTwxlOeYyOwDseqKuS87PSB3BzV3dDunSU
 DrPKlXwIgva7zq5/S0Vr
 =4s1Z
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Ted Ts'o:
 "Major changes for 3.14 include support for the newly added ZERO_RANGE
  and COLLAPSE_RANGE fallocate operations, and scalability improvements
  in the jbd2 layer and in xattr handling when the extended attributes
  spill over into an external block.

  Other than that, the usual clean ups and minor bug fixes"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (42 commits)
  ext4: fix premature freeing of partial clusters split across leaf blocks
  ext4: remove unneeded test of ret variable
  ext4: fix comment typo
  ext4: make ext4_block_zero_page_range static
  ext4: atomically set inode->i_flags in ext4_set_inode_flags()
  ext4: optimize Hurd tests when reading/writing inodes
  ext4: kill i_version support for Hurd-castrated file systems
  ext4: each filesystem creates and uses its own mb_cache
  fs/mbcache.c: doucple the locking of local from global data
  fs/mbcache.c: change block and index hash chain to hlist_bl_node
  ext4: Introduce FALLOC_FL_ZERO_RANGE flag for fallocate
  ext4: refactor ext4_fallocate code
  ext4: Update inode i_size after the preallocation
  ext4: fix partial cluster handling for bigalloc file systems
  ext4: delete path dealloc code in ext4_ext_handle_uninitialized_extents
  ext4: only call sync_filesystm() when remounting read-only
  fs: push sync_filesystem() down to the file system's remount_fs()
  jbd2: improve error messages for inconsistent journal heads
  jbd2: minimize region locked by j_list_lock in jbd2_journal_forget()
  jbd2: minimize region locked by j_list_lock in journal_get_create_access()
  ...
2014-04-04 15:39:39 -07:00
Johannes Weiner
91b0abe36a mm + fs: store shadow entries in page cache
Reclaim will be leaving shadow entries in the page cache radix tree upon
evicting the real page.  As those pages are found from the LRU, an
iput() can lead to the inode being freed concurrently.  At this point,
reclaim must no longer install shadow pages because the inode freeing
code needs to ensure the page tree is really empty.

Add an address_space flag, AS_EXITING, that the inode freeing code sets
under the tree lock before doing the final truncate.  Reclaim will check
for this flag before installing shadow pages.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Bob Liu <bob.liu@oracle.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Metin Doslu <metin@citusdata.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Ozgun Erdogan <ozgun@citusdata.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roman Gushchin <klamm@yandex-team.ru>
Cc: Ryan Mallon <rmallon@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:21:01 -07:00
Theodore Ts'o
02b9984d64 fs: push sync_filesystem() down to the file system's remount_fs()
Previously, the no-op "mount -o mount /dev/xxx" operation when the
file system is already mounted read-write causes an implied,
unconditional syncfs().  This seems pretty stupid, and it's certainly
documented or guaraunteed to do this, nor is it particularly useful,
except in the case where the file system was mounted rw and is getting
remounted read-only.

However, it's possible that there might be some file systems that are
actually depending on this behavior.  In most file systems, it's
probably fine to only call sync_filesystem() when transitioning from
read-write to read-only, and there are some file systems where this is
not needed at all (for example, for a pseudo-filesystem or something
like romfs).

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: linux-fsdevel@vger.kernel.org
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Artem Bityutskiy <dedekind1@gmail.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Evgeniy Dushistov <dushistov@mail.ru>
Cc: Jan Kara <jack@suse.cz>
Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Anders Larsen <al@alarsen.net>
Cc: Phillip Lougher <phillip@squashfs.org.uk>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Cc: Petr Vandrovec <petr@vandrovec.name>
Cc: xfs@oss.sgi.com
Cc: linux-btrfs@vger.kernel.org
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Cc: codalist@coda.cs.cmu.edu
Cc: linux-ext4@vger.kernel.org
Cc: linux-f2fs-devel@lists.sourceforge.net
Cc: fuse-devel@lists.sourceforge.net
Cc: cluster-devel@redhat.com
Cc: linux-mtd@lists.infradead.org
Cc: jfs-discussion@lists.sourceforge.net
Cc: linux-nfs@vger.kernel.org
Cc: linux-nilfs@vger.kernel.org
Cc: linux-ntfs-dev@lists.sourceforge.net
Cc: ocfs2-devel@oss.oracle.com
Cc: reiserfs-devel@vger.kernel.org
2014-03-13 10:14:33 -04:00
Jan Kara
0dc83bd30b Revert "writeback: do not sync data dirtied after sync start"
This reverts commit c4a391b53a. Dave
Chinner <david@fromorbit.com> has reported the commit may cause some
inodes to be left out from sync(2). This is because we can call
redirty_tail() for some inode (which sets i_dirtied_when to current time)
after sync(2) has started or similarly requeue_inode() can set
i_dirtied_when to current time if writeback had to skip some pages. The
real problem is in the functions clobbering i_dirtied_when but fixing
that isn't trivial so revert is a safer choice for now.

CC: stable@vger.kernel.org # >= 3.13
Signed-off-by: Jan Kara <jack@suse.cz>
2014-02-22 02:02:28 +01:00
Linus Torvalds
7e1a1e9378 xfs: update for v3.13-rc1
For 3.13-rc1 we have an eclectic assortment of bugfixes, cleanups, and
 refactoring.  Bugfixes that stand out are the fix for the AGF/AGI
 deadlock, incore extent list fixes, verifier fixes for v4 superblocks
 and growfs, and memory leaks.  There are some asserts, warnings, and
 strings that were cleaned up.  There was further rearrangement of code
 to make libxfs and the kernel sync up more easily, differences between
 v2 and v3 directory code were abstracted using an ops vector,
 xfs_inactive was reworked, and the preallocation/hole punching code was
 refactored.
 
 - simplify kmem_zone_zalloc
 - add traces for AGF/AGI read ops
 - add additional AIL traces
 - fix xfs_remove AGF vs AGI deadlock
 - fix the extent count of new incore extent page in the indirection array
 - don't fail bad secondary superblocks verification on v4 filesystems
   due to unzeroed bits after v4 fields
 - fix possible NULL dereference in xlog_verify_iclog
 - remove redundant assert in xfs_dir2_leafn_split
 - prevent stack overflows from page cache allocation
 - fix some sparse warnings
 - fix directory block format verifier to check the leaf entry count
 - abstract the differences in dir2/dir3 via an ops vector
 - continue process of reorganization to make libxfs/kernel code merges easier
 - refactor the preallocation and hole punching code
 - fix for growfs and verifiers
 - remove unnecessary scary corruption error when probing non-xfs filesystems
 - remove extra newlines from strings passed to printk
 - prevent deadlock trying to cover an active log
 - rework xfs_inactive()
 - add the inode directory type support to XFS_IOC_FSGEOM
 - cleanup (remove) usage of is_bad_inode
 - fix miscalculation in xfs_iext_realloc_direct which results in oversized
   direct extent list
 - remove unnecessary count arg to xfs_iomap_write_allocate
 - fix memory leak in xlog_recover_add_to_trans
 - check superblock instead of block magic to determine if dtype field
   is present
 - fix lockdep annotation due to project quotas
 - fix regression in xfs_node_toosmall which can lead to incorrect directory
   btree node collapse
 - make log recovery verify filesystem uuid of recovering blocks
 - fix XFS_IOC_FREE_EOFBLOCKS definition
 - remove invalid assert in xfs_inode_free
 - fix for AIL lock regression
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.10 (GNU/Linux)
 
 iQIcBAABAgAGBQJShBL7AAoJENaLyazVq6ZOaRwP/1B3QkWxFArRSD4wl15oBxpN
 Zv6D7woTmAvON87OIG4m67gyTr5/yNrPy8bg6Qw4YoL6lHTgle+RDUaKhgnsODoX
 Gd/oOiKCBqGfe93zs2fzIQzZ+yn+xdXr+q8uyEwEe8QHK6/wg6lEHNNae8VXEBlO
 20ec4b0U9dxoOJyG8nJNdytI++jp3TWzmZGpmLwisRogt4b86JM+QRhKFOe18AeI
 c9ky0uQmOQ6gX6h1VKN1L1u66GpTtFgj8XqPp/V6D8xHb1XGNiutDAKD7Mt/Rcgf
 njQsky2lXSQIuOnhyS1+lPvR8x19srs6UdnxcWdJvvwsICb14ZHEsZQA7M1bkLnw
 zNYtwn5RSneVSdjUZ+55dU1oDfTw2fxRHmKcm3bKrJCG7aOcH5vhEKs6HS0eVZAW
 4AcjThA43UpcEv47sghd7WJ+hFc4tKDVh9BOLUNi9zlkltVdP6WmWduMco0mRNeJ
 gd++CFRv9R3cQ0UUNsNMGQ9a8k/TW5uHYRsfX2IRBcgXQD2Ip1HBGLGSft2/JA4G
 U53mM08RntInGKctp1PjJea74QPJrYT7wBMlBl917tmnZ59i20nDs/OfeD2Dsnod
 9ekK5J7cMGHdWnQ3+o2b9Awuypcl+d9vdNKgNmNVTPlptfkI5OjJ5+BhqScyDw7m
 LJ1JmPIPIJF7vdIqBJWL
 =XMd/
 -----END PGP SIGNATURE-----

Merge tag 'xfs-for-linus-v3.13-rc1' of git://oss.sgi.com/xfs/xfs

Pull xfs update from Ben Myers:
 "For 3.13-rc1 we have an eclectic assortment of bugfixes, cleanups, and
  refactoring.  Bugfixes that stand out are the fix for the AGF/AGI
  deadlock, incore extent list fixes, verifier fixes for v4 superblocks
  and growfs, and memory leaks.  There are some asserts, warnings, and
  strings that were cleaned up.  There was further rearrangement of code
  to make libxfs and the kernel sync up more easily, differences between
  v2 and v3 directory code were abstracted using an ops vector,
  xfs_inactive was reworked, and the preallocation/hole punching code
  was refactored.

   - simplify kmem_zone_zalloc
   - add traces for AGF/AGI read ops
   - add additional AIL traces
   - fix xfs_remove AGF vs AGI deadlock
   - fix the extent count of new incore extent page in the indirection
     array
   - don't fail bad secondary superblocks verification on v4 filesystems
     due to unzeroed bits after v4 fields
   - fix possible NULL dereference in xlog_verify_iclog
   - remove redundant assert in xfs_dir2_leafn_split
   - prevent stack overflows from page cache allocation
   - fix some sparse warnings
   - fix directory block format verifier to check the leaf entry count
   - abstract the differences in dir2/dir3 via an ops vector
   - continue process of reorganization to make libxfs/kernel code
     merges easier
   - refactor the preallocation and hole punching code
   - fix for growfs and verifiers
   - remove unnecessary scary corruption error when probing non-xfs
     filesystems
   - remove extra newlines from strings passed to printk
   - prevent deadlock trying to cover an active log
   - rework xfs_inactive()
   - add the inode directory type support to XFS_IOC_FSGEOM
   - cleanup (remove) usage of is_bad_inode
   - fix miscalculation in xfs_iext_realloc_direct which results in
     oversized direct extent list
   - remove unnecessary count arg to xfs_iomap_write_allocate
   - fix memory leak in xlog_recover_add_to_trans
   - check superblock instead of block magic to determine if dtype field
     is present
   - fix lockdep annotation due to project quotas
   - fix regression in xfs_node_toosmall which can lead to incorrect
     directory btree node collapse
   - make log recovery verify filesystem uuid of recovering blocks
   - fix XFS_IOC_FREE_EOFBLOCKS definition
   - remove invalid assert in xfs_inode_free
   - fix for AIL lock regression"

* tag 'xfs-for-linus-v3.13-rc1' of git://oss.sgi.com/xfs/xfs: (49 commits)
  xfs: simplify kmem_{zone_}zalloc
  xfs: add tracepoints to AGF/AGI read operations
  xfs: trace AIL manipulations
  xfs: xfs_remove deadlocks due to inverted AGF vs AGI lock ordering
  xfs: fix the extent count when allocating an new indirection array entry
  xfs: be more forgiving of a v4 secondary sb w/ junk in v5 fields
  xfs: fix possible NULL dereference in xlog_verify_iclog
  xfs:xfs_dir2_node.c: pointer use before check for null
  xfs: prevent stack overflows from page cache allocation
  xfs: fix static and extern sparse warnings
  xfs: validity check the directory block leaf entry count
  xfs: make dir2 ftype offset pointers explicit
  xfs: convert directory vector functions to constants
  xfs: convert directory vector functions to constants
  xfs: vectorise encoding/decoding directory headers
  xfs: vectorise DA btree operations
  xfs: vectorise directory leaf operations
  xfs: vectorise directory data operations part 2
  xfs: vectorise directory data operations
  xfs: vectorise remaining shortform dir2 ops
  ...
2013-11-14 17:16:35 +09:00
Jan Kara
c4a391b53a writeback: do not sync data dirtied after sync start
When there are processes heavily creating small files while sync(2) is
running, it can easily happen that quite some new files are created
between WB_SYNC_NONE and WB_SYNC_ALL pass of sync(2).  That can happen
especially if there are several busy filesystems (remember that sync
traverses filesystems sequentially and waits in WB_SYNC_ALL phase on one
fs before starting it on another fs).  Because WB_SYNC_ALL pass is slow
(e.g.  causes a transaction commit and cache flush for each inode in
ext3), resulting sync(2) times are rather large.

The following script reproduces the problem:

  function run_writers
  {
    for (( i = 0; i < 10; i++ )); do
      mkdir $1/dir$i
      for (( j = 0; j < 40000; j++ )); do
        dd if=/dev/zero of=$1/dir$i/$j bs=4k count=4 &>/dev/null
      done &
    done
  }

  for dir in "$@"; do
    run_writers $dir
  done

  sleep 40
  time sync

Fix the problem by disregarding inodes dirtied after sync(2) was called
in the WB_SYNC_ALL pass.  To allow for this, sync_inodes_sb() now takes
a time stamp when sync has started which is used for setting up work for
flusher threads.

To give some numbers, when above script is run on two ext4 filesystems
on simple SATA drive, the average sync time from 10 runs is 267.549
seconds with standard deviation 104.799426.  With the patched kernel,
the average sync time from 10 runs is 2.995 seconds with standard
deviation 0.096.

Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Fengguang Wu <fengguang.wu@intel.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13 12:09:07 +09:00
Dave Chinner
632b89e82b xfs: fix static and extern sparse warnings
The kbuild test robot indicated that there were some new sparse
warnings in fs/xfs/xfs_dquot_buf.c. Actually, there were a lot more
that is wasn't warning about, so fix them all up.

Reported-by: kbuild test robot
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-10-30 13:59:56 -05:00
Dave Chinner
a4fbe6ab1e xfs: decouple inode and bmap btree header files
Currently the xfs_inode.h header has a dependency on the definition
of the BMAP btree records as the inode fork includes an array of
xfs_bmbt_rec_host_t objects in it's definition.

Move all the btree format definitions from xfs_btree.h,
xfs_bmap_btree.h, xfs_alloc_btree.h and xfs_ialloc_btree.h to
xfs_format.h to continue the process of centralising the on-disk
format definitions. With this done, the xfs inode definitions are no
longer dependent on btree header files.

The enables a massive culling of unnecessary includes, with close to
200 #include directives removed from the XFS kernel code base.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-10-23 16:28:49 -05:00
Dave Chinner
239880ef64 xfs: decouple log and transaction headers
xfs_trans.h has a dependency on xfs_log.h for a couple of
structures. Most code that does transactions doesn't need to know
anything about the log, but this dependency means that they have to
include xfs_log.h. Decouple the xfs_trans.h and xfs_log.h header
files and clean up the includes to be in dependency order.

In doing this, remove the direct include of xfs_trans_reserve.h from
xfs_trans.h so that we remove the dependency between xfs_trans.h and
xfs_mount.h. Hence the xfs_trans.h include can be moved to the
indicate the actual dependencies other header files have on it.

Note that these are kernel only header files, so this does not
translate to any userspace changes at all.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-10-23 16:17:44 -05:00
Dave Chinner
5706278758 xfs: unify directory/attribute format definitions
The on-disk format definitions for the directory and attribute
structures are spread across 3 header files right now, only one of
which is dedicated to defining on-disk structures and their
manipulation (xfs_dir2_format.h). Pull all the format definitions
into a single header file - xfs_da_format.h - and switch all the
code over to point at that.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-10-23 14:21:40 -05:00
Dave Chinner
70a9883c5f xfs: create a shared header file for format-related information
All of the buffer operations structures are needed to be exported
for xfs_db, so move them all to a common location rather than
spreading them all over the place. They are verifying the on-disk
format, so while xfs_format.h might be a good place, it is not part
of the on disk format.

Hence we need to create a new header file that we centralise these
related definitions. Start by moving the bffer operations
structures, and then also move all the other definitions that have
crept into xfs_log_format.h and xfs_format.h as there was no other
shared header file to put them in.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-10-23 14:11:30 -05:00
Eric Sandeen
08e96e1a3c xfs: remove newlines from strings passed to __xfs_printk
__xfs_printk adds its own "\n".  Having it in the original string
leads to unintentional blank lines from these messages.

Most format strings have no newline, but a few do, leading to
i.e.:

[ 7347.119911] XFS (sdb2): Access to block zero in inode 132 start_block: 0 start_off: 0 blkcnt: 0 extent-state: 0 lastx: 1a05
[ 7347.119911] 
[ 7347.119919] XFS (sdb2): Access to block zero in inode 132 start_block: 0 start_off: 0 blkcnt: 0 extent-state: 0 lastx: 1a05
[ 7347.119919] 

Fix them all.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-10-17 13:30:29 -05:00
Ben Myers
d948709b8e xfs: remove usage of is_bad_inode
XFS never calls mark_inode_bad or iget_failed, so it will never see a
bad inode.  Remove all checks for is_bad_inode because they are
unnecessary.

Signed-off-by: Ben Myers <bpm@sgi.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2013-10-01 17:38:16 -05:00
Dave Chinner
9b17c62382 fs: convert inode and dentry shrinking to be node aware
Now that the shrinker is passing a node in the scan control structure, we
can pass this to the the generic LRU list code to isolate reclaim to the
lists on matching nodes.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Glauber Costa <glommer@parallels.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-09-10 18:56:31 -04:00
Dave Chinner
0a234c6dcb shrinker: convert superblock shrinkers to new API
Convert superblock shrinker to use the new count/scan API, and propagate
the API changes through to the filesystem callouts.  The filesystem
callouts already use a count/scan API, so it's just changing counters to
longs to match the VM API.

This requires the dentry and inode shrinker callouts to be converted to
the count/scan API.  This is mainly a mechanical change.

[glommer@openvz.org: use mult_frac for fractional proportions, build fixes]
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Glauber Costa <glommer@openvz.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-09-10 18:56:30 -04:00
Dave Chinner
e546cb79ef xfs: consolidate xfs_utils.c
There are a few small helper functions in xfs_util, all related to
xfs_inode modifications. Move them all to xfs_inode.c so all
xfs_inode operations are consiolidated in the one place.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-12 16:55:17 -05:00
Dave Chinner
c24b5dfadc xfs: kill xfs_vnodeops.[ch]
Now we have xfs_inode.c for holding kernel-only XFS inode
operations, move all the inode operations from xfs_vnodeops.c to
this new file as it holds another set of kernel-only inode
operations. The name of this file traces back to the days of Irix
and it's vnodes which we don't have anymore.

Essentially this move consolidates the inode locking functions
and a bunch of XFS inode operations into the one file. Eventually
the high level functions will be merged into the VFS interface
functions in xfs_iops.c.

This leaves only internal preallocation, EOF block manipulation and
hole punching functions in vnodeops.c. Move these to xfs_bmap_util.c
where we are already consolidating various in-kernel physical extent
manipulation and querying functions.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-12 16:53:39 -05:00
Dave Chinner
2b9ab5ab9c xfs: reshuffle dir2 definitions around for userspace
Many of the definitions within xfs_dir2_priv.h are needed in
userspace outside libxfs. Definitions within xfs_dir2_priv.h are
wholly contained within libxfs, so we need to shuffle some of the
definitions around to keep consistency across files shared between
user and kernel space.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-12 16:40:57 -05:00
Dave Chinner
6ca1c9063d xfs: separate dquot on disk format definitions out of xfs_quota.h
The on disk format definitions of the on-disk dquot, log formats and
quota off log formats are all intertwined with other definitions for
quotas. Separate them out into their own header file so they can
easily be shared with userspace.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-12 16:09:52 -05:00
Tejun Heo
7a378c9aea xfs: WQ_NON_REENTRANT is meaningless and going away
dbf2576e37 ("workqueue: make all workqueues non-reentrant") made
WQ_NON_REENTRANT no-op and the flag is going away.  Remove its usages.

This patch doesn't introduce any behavior changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ben Myers <bpm@sgi.com>
Cc: Alex Elder <elder@kernel.org>
Cc: xfs@oss.sgi.com
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-07-30 13:11:17 -05:00
Chandra Seetharaman
d892d5864f xfs: Start using pquotaino from the superblock.
Start using pquotino and define a macro to check if the
superblock has pquotino.

Keep backward compatibilty by alowing mount of older superblock
with no separate pquota inode.

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-07-22 14:46:26 -05:00
Chandra Seetharaman
83e782e1a1 xfs: Remove incore use of XFS_OQUOTA_ENFD and XFS_OQUOTA_CHKD
Remove all incore use of XFS_OQUOTA_ENFD and XFS_OQUOTA_CHKD. Instead,
start using XFS_GQUOTA_.* XFS_PQUOTA_.* counterparts for GQUOTA and
PQUOTA respectively.

On-disk copy still uses XFS_OQUOTA_ENFD and XFS_OQUOTA_CHKD.

Read and write of the superblock does the conversion from *OQUOTA*
to *[PG]QUOTA*.

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-06-28 17:39:22 -05:00