Commit Graph

2894 Commits

Author SHA1 Message Date
Dave Chinner e80dfa1997 xfs: convert buftarg LRU to generic code
Convert the buftarg LRU to use the new generic LRU list and take advantage
of the functionality it supplies to make the buffer cache shrinker node
aware.

Signed-off-by: Glauber Costa <glommer@openvz.org>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-09-10 18:56:31 -04:00
Dave Chinner 9b17c62382 fs: convert inode and dentry shrinking to be node aware
Now that the shrinker is passing a node in the scan control structure, we
can pass this to the the generic LRU list code to isolate reclaim to the
lists on matching nodes.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Glauber Costa <glommer@parallels.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-09-10 18:56:31 -04:00
Dave Chinner 0a234c6dcb shrinker: convert superblock shrinkers to new API
Convert superblock shrinker to use the new count/scan API, and propagate
the API changes through to the filesystem callouts.  The filesystem
callouts already use a count/scan API, so it's just changing counters to
longs to match the VM API.

This requires the dentry and inode shrinker callouts to be converted to
the count/scan API.  This is mainly a mechanical change.

[glommer@openvz.org: use mult_frac for fractional proportions, build fixes]
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Glauber Costa <glommer@openvz.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-09-10 18:56:30 -04:00
Glauber Costa 55f841ce93 super: fix calculation of shrinkable objects for small numbers
The sysctl knob sysctl_vfs_cache_pressure is used to determine which
percentage of the shrinkable objects in our cache we should actively try
to shrink.

It works great in situations in which we have many objects (at least more
than 100), because the aproximation errors will be negligible.  But if
this is not the case, specially when total_objects < 100, we may end up
concluding that we have no objects at all (total / 100 = 0, if total <
100).

This is certainly not the biggest killer in the world, but may matter in
very low kernel memory situations.

Signed-off-by: Glauber Costa <glommer@openvz.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-09-10 18:56:29 -04:00
Linus Torvalds 300893b08f xfs: update for v3.12-rc1
For 3.12-rc1 there are a number of bugfixes in addition to work to ease usage
 of shared code between libxfs and the kernel, the rest of the work to enable
 project and group quotas to be used simultaneously, performance optimisations
 in the log and the CIL, directory entry file type support, fixes for log space
 reservations, some spelling/grammar cleanups, and the addition of user
 namespace support.
 
 - introduce readahead to log recovery
 - add directory entry file type support
 - fix a number of spelling errors in comments
 - introduce new Q_XGETQSTATV quotactl for project quotas
 - add USER_NS support
 - log space reservation rework
 - CIL optimisations
 - kernel/userspace libxfs rework
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.10 (GNU/Linux)
 
 iQIcBAABAgAGBQJSLeikAAoJENaLyazVq6ZOciEP/3tc850sQsPlNwP9aqd1l2Wk
 S1RJ8i+MUQ2W/PlbswCXvdUCT8DIwXWxL31tGvi8vtaLhh6t8ICSZwqNil+/GCIJ
 BErVvY4oXhEMHhlbIRRvpxblTfJGiYy3puUEz9VI0yDdUVnC33+DuEeLTQ/0mibo
 /UUqKFmM3KYpOc8vIQvH5K5i8PkjtMt9yge0k4l9COD30gtY2okkaD4b1voOsKc+
 5YFqulq7zcXBUYti+EFCQeV8aUBTGEPN4PJRdcS12/ylzsTzZivAOO+QREu7qBW8
 x+Gj8fOC+yYWCttmJlfa1n8taxge3ndEuzKN97nvvfQgjvvunMvwJ499skryYVdB
 EcPnBnpDUQuz/y7exKBT9uROK817vZBtfHzSova29ayQSWC+qDpNE4xXeDIqeCtT
 CPxdHuWMOvIdZg41E4x7je0elaZl8EAZ8hycc2WuRhtukEkIdE1O8aD7IVrMYee8
 kg+aVHG5nmYRInO1WuMinbtiCzwvVoBJToWM3y4cbfgW0dILASRyL53HDd+eCr1j
 kOpPIVgXlBZgiPMmdYahWxyVVWcE7zyex0w4frzWVlJMZ4lP5brppD6qfQg1JwOB
 z21Y95F5C2GxSyN/Lwps0G6jujHrpe6GVeYK7uKCtnqTD83nSShv5Naln7pQ3AUs
 qUMsqmJob4+bwt94Xgbx
 =V4s4
 -----END PGP SIGNATURE-----

Merge tag 'xfs-for-linus-v3.12-rc1' of git://oss.sgi.com/xfs/xfs

Pull xfs updates from Ben Myers:
 "For 3.12-rc1 there are a number of bugfixes in addition to work to
  ease usage of shared code between libxfs and the kernel, the rest of
  the work to enable project and group quotas to be used simultaneously,
  performance optimisations in the log and the CIL, directory entry file
  type support, fixes for log space reservations, some spelling/grammar
  cleanups, and the addition of user namespace support.

   - introduce readahead to log recovery
   - add directory entry file type support
   - fix a number of spelling errors in comments
   - introduce new Q_XGETQSTATV quotactl for project quotas
   - add USER_NS support
   - log space reservation rework
   - CIL optimisations
  - kernel/userspace libxfs rework"

* tag 'xfs-for-linus-v3.12-rc1' of git://oss.sgi.com/xfs/xfs: (112 commits)
  xfs: XFS_MOUNT_QUOTA_ALL needed by userspace
  xfs: dtype changed xfs_dir2_sfe_put_ino to xfs_dir3_sfe_put_ino
  Fix wrong flag ASSERT in xfs_attr_shortform_getvalue
  xfs: finish removing IOP_* macros.
  xfs: inode log reservations are too small
  xfs: check correct status variable for xfs_inobt_get_rec() call
  xfs: inode buffers may not be valid during recovery readahead
  xfs: check LSN ordering for v5 superblocks during recovery
  xfs: btree block LSN escaping to disk uninitialised
  XFS: Assertion failed: first <= last && last < BBTOB(bp->b_length), file: fs/xfs/xfs_trans_buf.c, line: 568
  xfs: fix bad dquot buffer size in log recovery readahead
  xfs: don't account buffer cancellation during log recovery readahead
  xfs: check for underflow in xfs_iformat_fork()
  xfs: xfs_dir3_sfe_put_ino can be static
  xfs: introduce object readahead to log recovery
  xfs: Simplify xfs_ail_min() with list_first_entry_or_null()
  xfs: Register hotcpu notifier after initialization
  xfs: add xfs sb v4 support for dirent filetype field
  xfs: Add write support for dirent filetype field
  xfs: Add read-only support for dirent filetype field
  ...
2013-09-09 11:19:09 -07:00
Christoph Hellwig 7b7a8665ed direct-io: Implement generic deferred AIO completions
Add support to the core direct-io code to defer AIO completions to user
context using a workqueue.  This replaces opencoded and less efficient
code in XFS and ext4 (we save a memory allocation for each direct IO)
and will be needed to properly support O_(D)SYNC for AIO.

The communication between the filesystem and the direct I/O code requires
a new buffer head flag, which is a bit ugly but not avoidable until the
direct I/O code stops abusing the buffer_head structure for communicating
with the filesystems.

Currently this creates a per-superblock unbound workqueue for these
completions, which is taken from an earlier patch by Jan Kara.  I'm
not really convinced about this use and would prefer a "normal" global
workqueue with a high concurrency limit, but this needs further discussion.

JK: Fixed ext4 part, dynamic allocation of the workqueue.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-09-04 09:23:46 -04:00
Dave Chinner 1d03c6fa88 xfs: XFS_MOUNT_QUOTA_ALL needed by userspace
So move it to a header file shared with userspace.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-09-03 15:00:06 -05:00
Dave Chinner 50fc5f7acc xfs: dtype changed xfs_dir2_sfe_put_ino to xfs_dir3_sfe_put_ino
So fix up the export in xfs_dir2.h that is needed by userspace.

<sigh>

Now xfs_dir3_sfe_put_ino has been made static. Revert 98f7462 ("xfs:
xfs_dir3_sfe_put_ino can be static") to being non static so that the
code shared with userspace is identical again.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-09-03 14:51:16 -05:00
Eric Sandeen 914ed44b17 Fix wrong flag ASSERT in xfs_attr_shortform_getvalue
This ASSERT is testing an if_flags flag value against
a di_aformat enum value.  di_aformat is never assigned
XFS_IFINLINE.

This happens to work for now, because XFS_IFINLINE has
the same value as XFS_DINODE_FMT_LOCAL, and that's tested
just before we call this function.

However, I think the intention is to assert that we have
read in the data, i.e. XFS_IFINLINE on if_flags, before
we use if_data.  This is done in other places through the
code as well.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-30 15:20:50 -05:00
Dave Chinner 904c17e683 xfs: finish removing IOP_* macros.
In optimising the CIL operations, some of the IOP_* macros for
calling log item operations were removed. Remove the rest of them as
Christoph requested.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Geoffrey Wehrman <gwehrman@sgi.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-30 14:14:35 -05:00
Dave Chinner 239567033c xfs: inode log reservations are too small
We've been seeing occasional problems with log space leaks and
transaction underruns such as this for some time:

 XFS (dm-0): xlog_write: reservation summary:
   trans type  = FSYNC_TS (36)
   unit res    = 2740 bytes
   current res = -4 bytes
   total reg   = 0 bytes (o/flow = 0 bytes)
   ophdrs      = 0 (ophdr space = 0 bytes)
   ophdr + reg = 0 bytes
   num regions = 0

Turns out that xfstests generic/311 is reliably reproducing this
problem with the test it runs at sequence 16 of it execution. It is
a 100% reliable reproducer with the mkfs configuration of "-b
size=1024 -m crc=1" on a 10GB scratch device.

The problem? Inode forks in btree format are logged in memory
format, not disk format (i.e. bmbt format, not bmdr format). That
means there is a btree block header being logged, when such a
structure is never written to the inode fork in bmdr format. The
bmdr header in the inode is only 4 bytes, while the bmbt header is
24 bytes for v4 filesystems and 72 bytes for v5 filesystems.

We currently reserve the inode size plus the rounded up overhead of
a logging a buffer, which is 128 bytes. That means the reservation
for a 512 byte inode is 640 bytes. What we can actually log is:

	inode core, data and attr fork = 512 bytes
	inode log format + log op header = 56 + 12 = 68 bytes
	data fork bmbt hdr = 24/72 bytes
	attr fork bmbt hdr = 24/72 bytes

So, for a v2 inodes we can log at least 628 bytes, but if we split that
inode over the end of the log across log buffers, we need to also
another log op header, which takes us to 640 bytes. If there's
another reservation taken out of this that I haven't taken into
account (perhaps multiple iclog splits?) or I haven't corectly
calculated the bmbt format space used (entirely possible), then
we will overun it.

For v3 inodes the maximum is actually 724 bytes, and even a
single maximally sized btree format fork can blow it (652 bytes).
And that's exactly what is happening with the FSYNC_TS transaction
in the above output - it's consumed 644 bytes of space after the CIL
context took the space reserved for it (2100 bytes).

This problem has always been present in the XFS code - the btree
format inode forks have always been logged in this manner. Hence
there has always been the possibility of an overrun with such a
transaction. The CRC code has just exposed it frequently enough to
be able to debug and understand the root cause....

So, let's fix all the inode log space reservations.

[ I'm so glad we spent the effort to clean up the transaction
  reservation code. This is an easy fix now. ]

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-30 13:59:30 -05:00
Brian Foster b121099d84 xfs: check correct status variable for xfs_inobt_get_rec() call
The call to xfs_inobt_get_rec() in xfs_dialloc_ag() passes 'j' as
the output status variable. The immediately following
XFS_WANT_CORRUPTED_GOTO() checks the value of 'i,' which is from
the previous lookup call and has already been checked. Fix the
corruption check to use 'j.'

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-30 13:48:35 -05:00
Dave Chinner d8914002a0 xfs: inode buffers may not be valid during recovery readahead
CRC enabled filesystems fail log recovery with 100% reliability on
xfstests xfs/085 with the following failure:

XFS (vdb): Mounting Filesystem
XFS (vdb): Starting recovery (logdev: internal)
XFS (vdb): Corruption detected. Unmount and run xfs_repair
XFS (vdb): bad inode magic/vsn daddr 144 #0 (magic=0)
XFS: Assertion failed: 0, file: fs/xfs/xfs_inode_buf.c, line: 95

The problem is that the inode buffer has not been recovered before
the readahead on the inode buffer is issued. The checkpoint being
recovered actually allocates the inode chunk we are doing readahead
from, so what comes from disk during readahead is essentially
random and the verifier barfs on it.

This inode buffer readahead problem affects non-crc filesystems,
too, but xfstests does not trigger it at all on such
configurations....

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-30 13:45:49 -05:00
Dave Chinner 50d5c8d8e9 xfs: check LSN ordering for v5 superblocks during recovery
Log recovery has some strict ordering requirements which unordered
or reordered metadata writeback can defeat. This can occur when an
item is logged in a transaction, written back to disk, and then
logged in a new transaction before the tail of the log is moved past
the original modification.

The result of this is that when we read an object off disk for
recovery purposes, the buffer that we read may not contain the
object type that recovery is expecting and hence at the end of the
checkpoint being recovered we have an invalid object in memory.

This isn't usually a problem, as recovery will then replay all the
other checkpoints and that brings the object back to a valid and
correct state, but the issue is that while the object is in the
invalid state it can be flushed to disk. This results in the object
verifier failing and triggering a corruption shutdown of log
recover. This is correct behaviour for the verifiers - the problem
is that we are not detecting that the object we've read off disk is
newer than the transaction we are replaying.

All metadata in v5 filesystems has the LSN of it's last modification
stamped in it. This enabled log recover to read that field and
determine the age of the object on disk correctly. If the LSN of the
object on disk is older than the transaction being replayed, then we
replay the modification. If the LSN of the object matches or is more
recent than the transaction's LSN, then we should avoid overwriting
the object as that is what leads to the transient corrupt state.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-30 13:44:53 -05:00
Dave Chinner b58fa554e9 xfs: btree block LSN escaping to disk uninitialised
When testing LSN ordering code for v5 superblocks, it was discovered
that the the LSN embedded in the generic btree blocks was
occasionally uninitialised. These values didn't get written to disk
by metadata writeback - they got written by previous transactions in
log recovery.

The issue is here that the when the block is first allocated and
initialised, the LSN field was not initialised - it gets overwritten
before IO is issued on the buffer - but the value that is logged by
transactions that modify the header before it is written to disk
(and initialised) contain garbage. Hence the first recovery of the
buffer will stamp garbage into the LSN field, and that can cause
subsequent transactions to not replay correctly.

The fix is simply to initialise the bb_lsn field to zero when we
initialise the block for the first time.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-30 13:43:34 -05:00
Dave Chinner 3780437612 XFS: Assertion failed: first <= last && last < BBTOB(bp->b_length), file: fs/xfs/xfs_trans_buf.c, line: 568
The calculation doesn't take into account the size of the dir v3
header, so overestimates the hash entries in a node. This causes
directory buffer overruns when splitting and merging nodes.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-30 09:48:59 -05:00
Dave Chinner 0f0d334595 xfs: fix bad dquot buffer size in log recovery readahead
xfstests xfs/087 fails 100% reliably with this assert:

XFS (vdb): Mounting Filesystem
XFS (vdb): Starting recovery (logdev: internal)
XFS: Assertion failed: bp->b_flags & XBF_STALE, file: fs/xfs/xfs_buf.c, line: 548

while trying to read a dquot buffer in xlog_recover_dquot_ra_pass2().

The issue is that the buffer length to read that is passed to
xfs_buf_readahead is in units of filesystem blocks, not disk blocks.
(i.e. FSB, not daddr). Fix it but putting the correct conversion in
place.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-29 10:51:35 -05:00
Dave Chinner 84a5b7300c xfs: don't account buffer cancellation during log recovery readahead
When doing readhaead in log recovery, we check to see if buffers are
cancelled before doing readahead. If we find a cancelled buffer,
however, we always decrement the reference count we have on it, and
that means that readahead is causing a double decrement of the
cancelled buffer reference count.

This results in log recovery *replaying cancelled buffers* as the
actual recovery pass does not find the cancelled buffer entry in the
commit phase of the second pass across a transaction. On debug
kernels, this results in an ASSERT failure like so:

XFS: Assertion failed: !(flags & XFS_BLF_CANCEL), file: fs/xfs/xfs_log_recover.c, line: 1815

xfstests generic/311 reproduces this ASSERT failure with 100%
reproducability.

Fix it by making readahead only peek at the buffer cancelled state
rather than the full accounting that xlog_check_buffer_cancelled()
does.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-29 10:37:06 -05:00
Dan Carpenter 0d0ab120d1 xfs: check for underflow in xfs_iformat_fork()
The "di_size" variable comes from the disk and it's a signed 64 bit.
We check the upper limit but we should check for negative numbers as
well.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-26 11:28:08 -05:00
Fengguang Wu 98f7462c43 xfs: xfs_dir3_sfe_put_ino can be static
TO: Dave Chinner <david@fromorbit.com>
CC: Ben Myers <bpm@sgi.com>
CC: linux-kernel@vger.kernel.org 

Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-26 11:27:40 -05:00
Zhi Yong Wu 00574da199 xfs: introduce object readahead to log recovery
It can take a long time to run log recovery operation because it is
single threaded and is bound by read latency. We can find that it took
most of the time to wait for the read IO to occur, so if one object
readahead is introduced to log recovery, it will obviously reduce the
log recovery time.

Log recovery time stat:

          w/o this patch        w/ this patch

real:        0m15.023s             0m7.802s
user:        0m0.001s              0m0.001s
sys:         0m0.246s              0m0.107s

Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-23 14:32:50 -05:00
Jie Liu 8d1d40832b xfs: Simplify xfs_ail_min() with list_first_entry_or_null()
At xfs_ail_min(), we do check if the AIL list is empty or not before
returning the first item in it with list_empty() and list_first_entry().

This can be simplified a bit with a new list operation routine that is
the list_first_entry_or_null() which has been introduced by:

commit 6d7581e62f
    list: introduce list_first_entry_or_null

v2: make xfs_ail_min() as a static inline function and move it to
    xfs_trans_priv.h as per Dave Chinner's comments.

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-23 12:57:43 -05:00
Richard Weinberger 46677e679f xfs: Register hotcpu notifier after initialization
Currently the code initializizes mp->m_icsb_mutex and other things
_after_ register_hotcpu_notifier().
As the notifier takes mp->m_icsb_mutex it can happen
that it takes the lock before it's initialization.

Signed-off-by: Richard Weinberger <richard@nod.at>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-22 14:05:27 -05:00
Mark Tinguely 3e3c51cee9 xfs: add xfs sb v4 support for dirent filetype field
Add XFS superblock v4 support for the file type field in the
directory entry feature.

This support adds a feature bit for version 4 superblocks and
leaves the original superblock 5 incompatibility bit.

Signed-off-by: Mark Tinguely <tinguely@sgi.com>
Reviewed-by: Geoffrey Wehrman <gwehrman@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-22 08:49:59 -05:00
Dave Chinner 1c55cece08 xfs: Add write support for dirent filetype field
Add support to propagate and add filetype values into the on-disk
directs. This involves passing the filetype into the xfs_da_args
structure along with the name and namelength for direct operations,
and encoding it into the dirent at the same time we write the inode
number into the dirent.

With write support, add the feature flag to the
XFS_SB_FEAT_INCOMPAT_ALL mask so we can now mount filesystems with
this feature set.

Performance of directory recursion is now much improved. Parallel
walk of ~50 million directory entries across hundreds of directories
improves significantly. Unpatched, no CRCs:

Walking via ls -R

real    3m19.886s
user    6m36.960s
sys     28m19.087s

THis is doing roughly 500 getdents() calls per second, and 250,000
inode lookups per second to determine the inode type at roughly
17,000 read IOPS. CPU usage is 90% kernel space.

With dtype support patched in and the fileset recreated with CRCs
enabled:

Walking via ls -R

real    0m31.316s
user    6m32.975s
sys     0m21.111s

This is doing roughly 3500 getdents() calls per second at 16,000
IOPS. There are no inode lookups at all. CPU usages is almost 100%
userspace.

This is a big win for recursive directory walks that only need to
find file names and file types.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-22 08:44:49 -05:00
Dave Chinner 0cb97766f2 xfs: Add read-only support for dirent filetype field
Add support for the file type field in directory entries so that
readdir can return the type of the inode the dirent points to to
userspace without first having to read the inode off disk.

The encoding of the type field is a single byte that is added to the
end of the directory entry name length. For all intents and
purposes, it appends a "hidden" byte to the name field which
contains the type information. As the directory entry is already of
dynamic size, helpers are already required to access and decode the
direct entry structures.

Hence the relevent extraction and iteration helpers are updated to
understand the hidden byte.  Helpers for reading and writing the
filetype field from the directory entries are also added. Only the
read helpers are used by this patch.  It also adds all the code
necessary to read the type information out of the dirents on disk.

Further we add the superblock feature bit and helpers to indicate
that we understand the on-disk format change. This is not a
compatible change - existing kernels cannot read the new format
successfully - so an incompatible feature flag is added. We don't
yet allow filesystems to mount with this flag yet - that will be
added once write support is added.

Finally, the code to take the type from the VFS, convert it to an
XFS on-disk type and put it into the xfs_name structures passed
around is added, but the directory code does not use this field yet.
That will be in the next patch.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-22 08:40:24 -05:00
Chandra Seetharaman 5d5e3d5760 xfs: Add support for the Q_XGETQSTATV
For XFS, add support for Q_XGETQSTATV quotactl command.

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Reviewed-by: Rich Johnston <rjohnston@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-20 17:00:38 -05:00
Zhi Yong Wu c2bfbc9b48 xfs: fix the comment of xfs_mountfs()
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-20 15:53:07 -05:00
Zhi Yong Wu 2533787a43 xfs: fix the comment of xfs_sb_quiet_read_verify()
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-20 15:51:49 -05:00
Zhi Yong Wu 8ba701ee9e xfs: fix the comment of xlog_recover_do_dquot_buffer()
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-20 15:49:06 -05:00
Zhi Yong Wu 8e159e72e2 xfs: fix the comment of xfs_log_unmount_write()
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-20 15:47:18 -05:00
Zhi Yong Wu 0b8182dba6 xfs: fix the comment of xfs_ifree_cluster()
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-20 15:44:36 -05:00
Zhi Yong Wu 2f21ff1cca xfs: fix the comment of xfs_ialloc_ag_select()
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-20 15:42:41 -05:00
Zhi Yong Wu b3c496343b xfs: fix the comment of xfs_extent_busy_update_extent()
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-20 15:41:08 -05:00
Zhi Yong Wu 8b4ad79cc6 xfs: fix the comment of xfs_setsize_buftarg_early()
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-20 15:40:39 -05:00
Zhi Yong Wu ad4809bf22 xfs: fix the comment of xfs_bmap_punch_delalloc_range()
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-20 15:40:16 -05:00
Zhi Yong Wu 02bb4873db xfs: fix the comment of xfs_bmap_last_before()
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-20 15:36:28 -05:00
Zhi Yong Wu a97f4df7b5 xfs: fix the comment of xfs_bmap_validate_ret()
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-20 15:35:55 -05:00
Zhi Yong Wu 8be11e92b6 xfs: fix the comment of xfs_bmap_count_tree()
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-20 15:35:32 -05:00
Zhi Yong Wu c7c1a7d8bb xfs: rename bio_add_buffer() to xfs_bio_add_buffer()
Follow up with xfs naming style.

Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-20 15:35:00 -05:00
Zhi Yong Wu 0a94da24b9 xfs: fix the comment of xlog_find_head()
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-20 15:34:34 -05:00
Zhi Yong Wu 34be5ff378 xfs: fix the comment of xlog_recover_buffer_pass2()
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-20 15:30:53 -05:00
Zhi Yong Wu 5c75390924 xfs: remove two unused macro definitions in xfs_linux.h
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-20 15:30:23 -05:00
Zhi Yong Wu 1cb9386354 xfs: fix the comment of xfs_btree_get_iroot()
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-20 15:17:35 -05:00
Zhi Yong Wu f6c2734924 xfs: fix the comment of xfs_iroot_realloc()
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-20 15:17:11 -05:00
Zhi Yong Wu 7c3e664051 xfs: remove one blank line in xfs_btree_make_block_unfull()
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-20 15:16:07 -05:00
Zhi Yong Wu ac0e300fa5 xfs: fix the comment of xlog_write_setup_copy()
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-20 14:59:31 -05:00
Zhi Yong Wu 99e738b783 xfs: fix the comment of xfs_mod_incore_sb_unlocked()
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-20 14:59:05 -05:00
Zhi Yong Wu 49d3da149c xfs: fix the comment of xfs_btree_lookup()
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-20 14:45:23 -05:00
Zhi Yong Wu b46fe8259b xfs: fix the comment of xfs_buf_free()
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-20 14:44:39 -05:00