When we need to go from the log to the AIL, we have to go via the
xfs_mount. Add a xfs_ail pointer to the log so we can go directly to the
AIL associated with the log.
SGI-PV: 988143
SGI-Modid: xfs-linux-melb:xfs-kern:32351a
Signed-off-by: David Chinner <david@fromorbit.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Bring the ail lock inside the struct xfs_ail. This means the AIL can be
entirely manipulated via the struct xfs_ail rather than needing both the
struct xfs_mount and the struct xfs_ail.
SGI-PV: 988143
SGI-Modid: xfs-linux-melb:xfs-kern:32350a
Signed-off-by: David Chinner <david@fromorbit.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
With the new cursor interface, it makes sense to make all the traversing
code use the cursor interface and make the old one go away. This means
more of the AIL interfacing is done by passing struct xfs_ail pointers
around the place instead of struct xfs_mount pointers.
We can replace the use of xfs_trans_first_ail() in xfs_log_need_covered()
as it is only checking if the AIL is empty. We can do that with a call to
xfs_trans_ail_tail() instead, where a zero LSN returned indicates and
empty AIL...
SGI-PV: 988143
SGI-Modid: xfs-linux-melb:xfs-kern:32348a
Signed-off-by: David Chinner <david@fromorbit.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
To replace the current generation number ensuring sanity of the AIL
traversal, replace it with an external cursor that is linked to the AIL.
Basically, we store the next item in the cursor whenever we want to drop
the AIL lock to do something to the current item. When we regain the lock.
the current item may already be free, so we can't reference it, but the
next item in the traversal is already held in the cursor.
When we move or delete an object, we search all the active cursors and if
there is an item match we clear the cursor(s) that point to the object.
This forces the traversal to restart transparently.
We don't invalidate the cursor on insert because the cursor still points
to a valid item. If the intem is inserted between the current item and the
cursor it does not matter; the traversal is considered to be past the
insertion point so it will be picked up in the next traversal.
Hence traversal restarts pretty much disappear altogether with this method
of traversal, which should substantially reduce the overhead of pushing on
a busy AIL.
Version 2 o add restart logic o comment cursor interface o minor cleanups
SGI-PV: 988143
SGI-Modid: xfs-linux-melb:xfs-kern:32347a
Signed-off-by: David Chinner <david@fromorbit.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
structures.
Always use the generic xfs_btree_block type instead of the short / long
structures. Add XFS_BTREE_SBLOCK_LEN / XFS_BTREE_LBLOCK_LEN defines for
the length of a short / long form block. The rationale for this is that we
will grow more btree block header variants to support CRCs and other RAS
information, and always accessing them through the same datatype with
unions for the short / long form pointers makes implementing this much
easier.
SGI-PV: 988146
SGI-Modid: xfs-linux-melb:xfs-kern:32300a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Donald Douwsma <donaldd@sgi.com>
Signed-off-by: David Chinner <david@fromorbit.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Clean up the way the maximum and minimum records for the btree blocks are
calculated. For the alloc and inobt btrees all the values are
pre-calculated in xfs_mount_common, and we switch the current loop around
the ugly generic macros that use cpp token pasting to generate type names
to two small helpers in normal C code. For the bmbt and bmdr trees these
helpers also exist, but can be called during runtime, too. Here we also
kill various macros dealing with them and inline the logic into the
get_minrecs / get_maxrecs / get_dmaxrecs methods in xfs_bmap_btree.c.
Note that all these new helpers take an xfs_mount * argument which will be
needed to determine the size of a btree block once we add support for
extended btree blocks with CRCs and other RAS information.
SGI-PV: 988146
SGI-Modid: xfs-linux-melb:xfs-kern:32292a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Donald Douwsma <donaldd@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Remove all the useless flags and code keyed off it in xfs_mountfs.
SGI-PV: 981498
SGI-Modid: xfs-linux-melb:xfs-kern:31831a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
kmem_free() function takes (ptr, size) arguments but doesn't actually use
second one.
This patch removes size argument from all callsites.
SGI-PV: 981498
SGI-Modid: xfs-linux-melb:xfs-kern:31050a
Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
xfs_buf_associate_memory() can fail, but the return is never checked.
Propagate the error through XFS_BUF_SET_PTR() so that failures are
detected.
SGI-PV: 980084
SGI-Modid: xfs-linux-melb:xfs-kern:30831a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Catch errors from xfs_imap() in log recovery when we might be trying to
map an invalid inode number due to a corrupted log.
SGI-PV: 980084
SGI-Modid: xfs-linux-melb:xfs-kern:30829a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
xfsbdstrat() is declared to return an error. That is never checked because
the error is propagated by the xfs_buf_t that is passed through the
function.
Mark xfsbdstrat() as returning void and comment the prototype on the
methods needed for error checking.
SGI-PV: 980084
SGI-Modid: xfs-linux-melb:xfs-kern:30823a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
xfs_free_extent() can fail, but log recovery never bothers to check if it
successfully free the extent it was supposed to. This could lead to silent
corruption during log recovery. Abort log recovery if we fail to free an
extent.
SGI-PV: 980084
SGI-Modid: xfs-linux-melb:xfs-kern:30801a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
xfs_trans_commit() can return errors when there are problems in the
transaction subsystem. They are indicative that the entire transaction may
be incomplete, and hence the error should be propagated as there is a good
possibility that there is something fatally wrong in the filesystem. Catch
and propagate or warn about commit errors in the places where they are
currently ignored.
SGI-PV: 980084
SGI-Modid: xfs-linux-melb:xfs-kern:30795a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
xfs_trans_reserve() reports errors that should not be ignored. For
example, a shutdown filesystem will report errors through
xfs_trans_reserve() to prevent further changes from being attempted on a
damaged filesystem. Catch and propagate all error conditions from
xfs_trans_reserve().
SGI-PV: 980084
SGI-Modid: xfs-linux-melb:xfs-kern:30794a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Most VN_RELE calls either directly contain a XFS_ITOV or have the
corresponding xfs_inode already in scope. Use the IRELE helper instead of
VN_RELE to clarify the code. With a little more work we can kill VN_RELE
altogether and define IRELE in terms of iput directly.
SGI-PV: 976035
SGI-Modid: xfs-linux-melb:xfs-kern:30710a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
When pdflush is writing back inodes, it can get stuck on inode cluster
buffers that are currently under I/O. This occurs when we write data to
multiple inodes in the same inode cluster at the same time.
Effectively, delayed allocation marks the inode dirty during the data
writeback. Hence if the inode cluster was flushed during the writeback of
the first inode, the writeback of the second inode will block waiting for
the inode cluster write to complete before writing it again for the newly
dirtied inode.
Basically, we want to avoid this from happening so we don't block pdflush
and slow down all of writeback. Hence we introduce a non-blocking async
inode flush flag that pdflush uses. If this flag is set, we use
non-blocking operations (e.g. try locks) whereever we can to avoid
blocking or extra I/O being issued.
SGI-PV: 970925
SGI-Modid: xfs-linux-melb:xfs-kern:30501a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Remove macro-to-small-function indirection from xfs_sb.h, and remove some
which are completely unused.
SGI-PV: 976035
SGI-Modid: xfs-linux-melb:xfs-kern:30528a
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Donald Douwsma <donaldd@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
A problem was reported where a system panicked in log recovery due to a
corrupt log record. The cause of the corruption is not known but this
change will at least prevent a crash for this specific scenario. Log
recovery definitely needs some more work in this area.
SGI-PV: 974151
SGI-Modid: xfs-linux-melb:xfs-kern:30318a
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Sparse trips over the locking order in xlog_recover_do_efd_trans() when
xfs_trans_delete_ail() drops the ail lock. Because the unlock is
conditional, we need to either annotate with a "fake unlock" or change the
structure of the code so sparse thinks the function always unlocks.
Reordering the code makes it simpler, so do that.
SGI-PV: 972755
SGI-Modid: xfs-linux-melb:xfs-kern:30003a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
These are mostly locking annotations, marking things static, casts where
needed and declaring stuff in header files.
SGI-PV: 971186
SGI-Modid: xfs-linux-melb:xfs-kern:30002a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Mostly trivial conversion with one exceptions: h_num_logops was kept in
native endian previously and only converted to big endian in xlog_sync,
but we always keep it big endian now. With todays cpus fast byteswap
instructions that's not an issue but the new variant keeps the code clean
and maintainable.
SGI-PV: 971186
SGI-Modid: xfs-linux-melb:xfs-kern:29821a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
- the various assign lsn macros are replaced by a single inline,
xlog_assign_lsn, which is equivalent to ASSIGN_ANY_LSN_HOST except
for a more sane calling convention. ASSIGN_LSN_DISK is replaced
by xlog_assign_lsn and a manual bytespap, and ASSIGN_LSN by the same,
except we pass the cycle and block arguments explicitly instead of a
log paramter. The latter two variants only had 2, respectively one
user anyway.
- the GET_CYCLE is replaced by a xlog_get_cycle inline with exactly the
same calling conventions.
- GET_CLIENT_ID is replaced by xlog_get_client_id which leaves away
the unused arch argument. Instead of conditional defintions
depending on host endianess we now do an unconditional swap and shift
then, which generates equal code.
- the unused XLOG_SET macro is removed.
SGI-PV: 971186
SGI-Modid: xfs-linux-melb:xfs-kern:29820a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
- the various assign lsn macros are replaced by a single inline,
xlog_assign_lsn, which is equivalent to ASSIGN_ANY_LSN_HOST except
for a more sane calling convention. ASSIGN_LSN_DISK is replaced
by xlog_assign_lsn and a manual bytespap, and ASSIGN_LSN by the same,
except we pass the cycle and block arguments explicitly instead of a
log paramter. The latter two variants only had 2, respectively one
user anyway.
- the GET_CYCLE is replaced by a xlog_get_cycle inline with exactly the
same calling conventions.
- GET_CLIENT_ID is replaced by xlog_get_client_id which leaves away
the unused arch argument. Instead of conditional defintions
depending on host endianess we now do an unconditional swap and shift
then, which generates equal code.
- the unused XLOG_SET macro is removed.
SGI-PV: 971186
SGI-Modid: xfs-linux-melb:xfs-kern:29819a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Creates a new xfs_dsb_t that is __be annotated and keeps xfs_sb_t for the
incore one. xfs_xlatesb is renamed to xfs_sb_to_disk and only handles the
incore -> disk conversion. A new helper xfs_sb_from_disk handles the other
direction and doesn't need the slightly hacky table-driven approach
because we only ever read the full sb from disk.
The handling of shared r/o filesystems has been buggy on little endian
system and fixing this required shuffling around of some code in that
area.
SGI-PV: 968563
SGI-Modid: xfs-linux-melb:xfs-kern:29477a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Biggest bit is duplicating the dinode structure so we have one annotated for
native endianess and one for disk endianess. The other significant change
is that xfs_xlate_dinode_core is split into one helper per direction to
allow for proper annotations, everything else is trivial.
As a sidenode splitting out the incore dinode means we can move it into
xfs_inode.h in a later patch and severely improving on the include hell in
xfs.
SGI-PV: 968563
SGI-Modid: xfs-linux-melb:xfs-kern:29476a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
The new xlog_recover_do_reg_buffer checks call be16_to_cpu on di_gen which
is a 32bit value so sparse rightly complains. Fortunately the warning is
harmless because we don't care for the value, but only whether it's
non-NULL. Due to that fact we can simply kill the endian swaps on this and
the previous di_mode check entirely.
SGI-PV: 969656
SGI-Modid: xfs-linux-melb:xfs-kern:29709a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Sparse now warns about comparing pointers to 0, so change all instance
where that happens to NULL instead.
SGI-PV: 968555
SGI-Modid: xfs-linux-melb:xfs-kern:29308a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
When we have a couple of hundred transactions on the fly at once, they all
typically modify the on disk superblock in some way.
create/unclink/mkdir/rmdir modify inode counts, allocation/freeing modify
free block counts.
When these counts are modified in a transaction, they must eventually lock
the superblock buffer and apply the mods. The buffer then remains locked
until the transaction is committed into the incore log buffer. The result
of this is that with enough transactions on the fly the incore superblock
buffer becomes a bottleneck.
The result of contention on the incore superblock buffer is that
transaction rates fall - the more pressure that is put on the superblock
buffer, the slower things go.
The key to removing the contention is to not require the superblock fields
in question to be locked. We do that by not marking the superblock dirty
in the transaction. IOWs, we modify the incore superblock but do not
modify the cached superblock buffer. In short, we do not log superblock
modifications to critical fields in the superblock on every transaction.
In fact we only do it just before we write the superblock to disk every
sync period or just before unmount.
This creates an interesting problem - if we don't log or write out the
fields in every transaction, then how do the values get recovered after a
crash? the answer is simple - we keep enough duplicate, logged information
in other structures that we can reconstruct the correct count after log
recovery has been performed.
It is the AGF and AGI structures that contain the duplicate information;
after recovery, we walk every AGI and AGF and sum their individual
counters to get the correct value, and we do a transaction into the log to
correct them. An optimisation of this is that if we have a clean unmount
record, we know the value in the superblock is correct, so we can avoid
the summation walk under normal conditions and so mount/recovery times do
not change under normal operation.
One wrinkle that was discovered during development was that the blocks
used in the freespace btrees are never accounted for in the AGF counters.
This was once a valid optimisation to make; when the filesystem is full,
the free space btrees are empty and consume no space. Hence when it
matters, the "accounting" is correct. But that means the when we do the
AGF summations, we would not have a correct count and xfs_check would
complain. Hence a new counter was added to track the number of blocks used
by the free space btrees. This is an *on-disk format change*.
As a result of this, lazy superblock counters are a mkfs option and at the
moment on linux there is no way to convert an old filesystem. This is
possible - xfs_db can be used to twiddle the right bits and then
xfs_repair will do the format conversion for you. Similarly, you can
convert backwards as well. At some point we'll add functionality to
xfs_admin to do the bit twiddling easily....
SGI-PV: 964999
SGI-Modid: xfs-linux-melb:xfs-kern:28652a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Tim Shimmin <tes@sgi.com>
After filesystem recovery the superblock is re-read to bring in any
changes. If the per-cpu superblock counters are not re-initialized from
the superblock then the next time the per-cpu counters are disabled they
might overwrite the global counter with a bogus value.
SGI-PV: 957348
SGI-Modid: xfs-linux-melb:xfs-kern:27999a
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
logged version of di_next_unlinked which is actually always stored in the
correct ondisk format. This was pointed out to us by Shailendra Tripathi.
And is evident in the xfs qa test of 121.
SGI-PV: 953263
SGI-Modid: xfs-linux-melb:xfs-kern:26044a
Signed-off-by: Tim Shimmin <tes@sgi.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>
64bit kernels allow recovery to handle both versions and do the necessary
decoding
SGI-PV: 952214
SGI-Modid: xfs-linux-melb:xfs-kern:26011a
Signed-off-by: Tim Shimmin <tes@sgi.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>