Commit graph

5365 commits

Author SHA1 Message Date
Brian Foster
175d1a013e xfs: use ->t_dfops for all xfs_bmapi_write() callers
Attach ->t_dfops for all remaining callers of xfs_bmapi_write().
This prepares the latter to no longer require a separate dfops
parameter.

Note that xfs_symlink() already uses ->t_dfops. Fix up the local
references for consistency.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:12 -07:00
Brian Foster
2ba1372125 xfs: use ->t_dfops in dqalloc transaction
xfs_dquot_disk_alloc() receives a transaction from the caller and
passes a local dfops along to xfs_bmapi_write(). If we attach this
dfops to the transaction, we have to make sure to clear it before
returning to avoid invalid access of stack memory.

Since xfs_qm_dqread_alloc() is the only caller, pull dfops into the
caller and attach it to the transaction to eliminate this pattern
entirely.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:11 -07:00
Brian Foster
32a9b7c65c xfs: replace xfs_da_args->dfops accesses with ->t_dfops and remove
Now that xfs_da_args->dfops is always assigned from a ->t_dfops
pointer (or one that is immediately attached), replace all
downstream accesses of the former with the latter and remove the
field from struct xfs_da_args.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:11 -07:00
Brian Foster
d76e6ce8ed xfs: use ->t_dfops in extent split tx and remove param
Attach the local dfops to ->t_dfops of the extent split transaction.
Since this is the only caller of xfs_bmap_split_extent_at(), remove
the dfops parameter as well.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:10 -07:00
Brian Foster
0bd6207f83 xfs: remove dfops param in attr fork add path
Now that the attribute fork add tx carries dfops along with the
transaction, it is unnecessary to pass it down the stack. Remove the
dfops parameter and access ->t_dfops directly where necessary. This
patch does not change behavior.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:10 -07:00
Brian Foster
40d03ac6aa xfs: use ->t_dfops for attr set/remove operations
Attach the local dfops to the transaction allocated for xattr add
and remove operations. Add an earlier initialization in
xfs_attr_remove() to ensure the structure is valid if it remains
unused at transaction commit time.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:09 -07:00
Brian Foster
813d08cb6d xfs: use ->t_dfops for recovery of [b|c]ui log items
Log recovery passes down a central dfops structure to recovery
handlers for bui and cui log items. Each of these handlers allocates
and commits a transaction and defers any remaining operations to be
completed by the main recovery sequence.

Since dfops outlives the transaction in this context, set and clear
->t_dfops appropriately such that the *_finish_item() paths and
below (i.e., xfs_bmapi*()) can expect to find the dfops in the
transaction without it being committed with the dfops attached. This
is required because transaction commit expects that an associated
dfops is finished and in this context the dfops may be populated at
commit time.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:09 -07:00
Brian Foster
c9cfdb3811 xfs: remove dfops param from high level dirname calls
All callers of the directory create, rename and remove interfaces
already associate the dfops with the transaction. Drop the dfops
parameters in these calls in preparation for further cleanups in the
layers below. This patch does not change behavior.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:08 -07:00
Brian Foster
0e0417f3e5 xfs: remove dfops parameter from ifree call stack
The inode free callchain starting in xfs_inactive_ifree() already
associates its dfops with the transaction. It still passes the dfops
on the stack down through xfs_difree_inobt(), however.

Clean up the call stack and reference dfops directly from the
transaction. This patch does not change behavior.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:07 -07:00
Brian Foster
6aa6718439 xfs: rename xfs_trans ->t_agfl_dfops to ->t_dfops
The ->t_agfl_dfops field is currently used to defer agfl block frees
from associated transaction contexts. While all known problematic
contexts have already been updated to use ->t_agfl_dfops, the
broader goal is defer agfl frees from all callers that already use a
deferred operations structure. Further, the transaction field
facilitates a good amount of code clean up where the transaction and
dfops have historically been passed down through the stack
separately.

Rename the field to something more generic to prepare to use it as
such throughout XFS. This patch does not change behavior.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:07 -07:00
Brian Foster
8a74938649 xfs: cow unwritten conversion uses uninitialized dfops
A couple COW fork unwritten extent conversion helpers pass an
uninitialized dfops pointer to xfs_bmapi_write(). This does not
cause problems because conversion does not use a transaction or the
dfops structure for the COW fork.  Drop the uninitialized usage of
dfops in these codepaths and pass NULL along to xfs_bmapi_write()
instead.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:06 -07:00
Christoph Hellwig
98c1a7c0ec xfs: update my copyrights for the writeback and iomap code
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:06 -07:00
Christoph Hellwig
82cb14175e xfs: add support for sub-pagesize writeback without buffer_heads
Switch to using the iomap_page structure for checking sub-page uptodate
status and track sub-page I/O completion status, and remove large
quantities of boilerplate code working around buffer heads.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:05 -07:00
Christoph Hellwig
ac8ee54669 xfs: allow writeback on pages without buffer heads
Disable the IOMAP_F_BUFFER_HEAD flag on file systems with a block size
equal to the page size, and deal with pages without buffer heads in
writeback.  Thanks to the previous refactoring this is basically trivial
now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:04 -07:00
Christoph Hellwig
8e1f065bea xfs: refactor the tail of xfs_writepage_map
Rejuggle how we deal with the different error vs non-error and have
ioends vs not have ioend cases to keep the fast path streamlined, and
the duplicate code at a minimum.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:04 -07:00
Christoph Hellwig
1b65d3dd2d xfs: remove xfs_start_page_writeback
This helper only has two callers, one of them with a constant error
argument.  Remove it to make pending changes to the code a little easier.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:03 -07:00
Christoph Hellwig
6d465e8953 xfs: move all writeback buffer_head manipulation into xfs_map_at_offset
This keeps it in a single place so it can be made otional more easily.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:03 -07:00
Christoph Hellwig
3faed66764 xfs: don't look at buffer heads in xfs_add_to_ioend
Calculate all information for the bio based on the passed in information
without requiring a buffer_head structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:02 -07:00
Christoph Hellwig
889c65b3f6 xfs: remove the imap_valid flag
Simplify the way we check for a valid imap - we know we have a valid
mapping after xfs_map_blocks returned successfully, and we know we can
call xfs_imap_valid on any imap, as it will always fail on a
zero-initialized map.

We can also remove the xfs_imap_valid function and fold it into
xfs_map_blocks now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:02 -07:00
Christoph Hellwig
3345746ef3 xfs: simplify xfs_map_blocks by using xfs_iext_lookup_extent directly
xfs_bmapi_read adds zero value in xfs_map_blocks.  Replace it with a
direct call to the low-level extent lookup function.

Note that we now always pass a 0 length to the trace points as we ask
for an unspecified len.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:02 -07:00
Christoph Hellwig
060d4eaa0b xfs: remove xfs_reflink_find_cow_mapping
We only have one caller left, and open coding the simple extent list
lookup in it allows us to make the code both more understandable and
reuse calculations and variables already present.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:01 -07:00
Christoph Hellwig
c3a2f9fff1 xfs: remove the now unused XFS_BMAPI_IGSTATE flag
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:01 -07:00
Dave Chinner
e2f6ad4624 xfs: make xfs_writepage_map extent map centric
xfs_writepage_map() iterates over the bufferheads on a page to decide
what sort of IO to do and what actions to take.  However, when it comes
to reflink and deciding when it needs to execute a COW operation, we no
longer look at the bufferhead state but instead we ignore than and look
up internal state held in the COW fork extent list.

This means xfs_writepage_map() is somewhat confused. It does stuff, then
ignores it, then tries to handle the impedence mismatch by shovelling the
results inside the existing mapping code.  It works, but it's a bit of a
mess and it makes it hard to fix the cached map bug that the writepage
code currently has.

To unify the two different mechanisms, we first have to choose a direction.
That's already been set - we're de-emphasising bufferheads so they are no
longer a control structure as we need to do taht to allow for eventual
removal.  Hence we need to move away from looking at bufferhead state to
determine what operations we need to perform.

We can't completely get rid of bufferheads yet - they do contain some
state that is absolutely necessary, such as whether that part of the page
contains valid data or not (buffer_uptodate()).  Other state in the
bufferhead is redundant:

	BH_dirty - the page is dirty, so we can ignore this and just
		write it
	BH_delay - we have delalloc extent info in the DATA fork extent
		tree
	BH_unwritten - same as BH_delay
	BH_mapped - indicates we've already used it once for IO and it is
		mapped to a disk address. Needs to be ignored for COW
		blocks.

The BH_mapped flag is an interesting case - it's supposed to indicate that
it's already mapped to disk and so we can just use it "as is".  In theory,
we don't even have to do an extent lookup to find where to write it too,
but we have to do that anyway to determine we are actually writing over a
valid extent.  Hence it's not even serving the purpose of avoiding a an
extent lookup during writeback, and so we can pretty much ignore it.
Especially as we have to ignore it for COW operations...

Therefore, use the extent map as the source of information to tell us
what actions we need to take and what sort of IO we should perform.  The
first step is to have xfs_map_blocks() set the io type according to what
it looks up.  This means it can easily handle both normal overwrite and
COW cases.  The only thing we also need to add is the ability to return
hole mappings.

We need to return and cache hole mappings now for the case of multiple
blocks per page.  We no longer use the BH_mapped to indicate a block over
a hole, so we have to get that info from xfs_map_blocks().  We cache it so
that holes that span two pages don't need separate lookups.  This allows us
to avoid ever doing write IO over a hole, too.

Now that we have xfs_map_blocks() returning both a cached map and the type
of IO we need to perform, we can rewrite xfs_writepage_map() to drop all
the bufferhead control. It's also much simplified because it doesn't need
to explicitly handle COW operations.  Instead of iterating bufferheads, it
iterates blocks within the page and then looks up what per-block state is
required from the appropriate bufferhead.  It then validates the cached
map, and if it's not valid, we get a new map.  If we don't get a valid map
or it's over a hole, we skip the block.

At this point, we have to remap the bufferhead via xfs_map_at_offset().
As previously noted, we had to do this even if the buffer was already
mapped as the mapping would be stale for XFS_IO_DELALLOC, XFS_IO_UNWRITTEN
and XFS_IO_COW IO types.  With xfs_map_blocks() now controlling the type,
even XFS_IO_OVERWRITE types need remapping, as converted-but-not-yet-
written delalloc extents beyond EOF can be reported at XFS_IO_OVERWRITE.
Bufferheads that span such regions still need their BH_Delay flags cleared
and their block numbers calculated, so we now unconditionally map each
bufferhead before submission.

But wait! There's more - remember the old "treat unwritten extents as
holes on read" hack?  Yeah, that means we can have a dirty page with
unmapped, unwritten bufferheads that contain data!  What makes these so
special is that the unwritten "hole" bufferheads do not have a valid block
device pointer, so if we attempt to write them xfs_add_to_ioend() blows
up. So we make xfs_map_at_offset() do the "realtime or data device"
lookup from the inode and ignore what was or wasn't put into the
bufferhead when the buffer was instantiated.

The astute reader will have realised by now that this code treats
unwritten extents in multiple-blocks-per-page situations differently.
If we get any combination of unwritten blocks on a dirty page that contain
valid data in the page, we're going to convert them to real extents.  This
can actually be a win, because it means that pages with interleaving
unwritten and written blocks will get converted to a single written extent
with zeros replacing the interspersed unwritten blocks.  This is actually
good for reducing extent list and conversion overhead, and it means we
issue a contiguous IO instead of lots of little ones.  The downside is
that we use up a little extra IO bandwidth.  Neither of these seem like a
bad thing given that spinning disks are seek sensitive, and SSDs/pmem have
bandwidth to burn and the lower Io latency/CPU overhead of fewer, larger
IOs will result in better performance on them...

As a result of all this, the only state we actually care about from the
bufferhead is a single flag - BH_Uptodate. We still use the bufferhead to
pass some information to the bio via xfs_add_to_ioend(), but that is
trivial to separate and pass explicitly.  This means we really only need
1 bit of state per block per page from the buffered write path in the
writeback path.  Everything else we do with the bufferhead is purely to
make the buffered IO front end continue to work correctly. i.e we've
pretty much marginalised bufferheads in the writeback path completely.

Signed-off-By: Dave Chinner <dchinner@redhat.com>
[hch: forward port, refactor and split off bits into other commits]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:00 -07:00
Christoph Hellwig
6a4c950136 xfs: rename the offset variable in xfs_writepage_map
Calling it file_offset makes the usage more clear, especially with
a new poffset variable that will be added soon for the offset inside
the page.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:00 -07:00
Christoph Hellwig
5c665e5b5a xfs: remove xfs_map_cow
We can handle the existing cow mapping case as a special case directly
in xfs_writepage_map, and share code for allocating delalloc blocks
with regular I/O in xfs_map_blocks.  This means we need to always
call xfs_map_blocks for reflink inodes, but we can still skip most of
the work if it turns out that there is no COW mapping overlapping the
current block.

As a subtle detail we need to start caching holes in the wpc to deal
with the case of COW reservations between EOF.  But we'll need that
infrastructure later anyway, so this is no big deal.

Based on a patch from Dave Chinner.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:25:59 -07:00
Christoph Hellwig
fca8c80542 xfs: remove xfs_reflink_trim_irec_to_next_cow
We already have to check for overlapping COW extents everytime we
come back to a page in xfs_writepage_map / xfs_map_cow, so this
additional trim is not required.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:25:59 -07:00
Christoph Hellwig
a7b28f72ab xfs: don't use XFS_BMAPI_IGSTATE in xfs_map_blocks
We want to be able to use the extent state as a reliably indicator for
the type of I/O, and stop using the buffer head state.  For this we
need to stop using the XFS_BMAPI_IGSTATE so that we don't see merged
extents of different types.

Based on a patch from Dave Chinner.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:25:59 -07:00
Christoph Hellwig
c57371a16d xfs: don't clear imap_valid for a non-uptodate buffers
Finding a buffer that isn't uptodate doesn't invalidate the mapping for
any given block.  The last_sector check will already take care of starting
another ioend as soon as we find any non-update buffer, and if the current
mapping doesn't include the next uptodate buffer the xfs_imap_valid check
will take care of it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:25:58 -07:00
Christoph Hellwig
91cdfd1761 xfs: do not set the page uptodate in xfs_writepage_map
We already track the page uptodate status based on the buffer uptodate
status, which is updated whenever reading or zeroing blocks.

This code has been there since commit a ptool commit in 2002, which
claims to:

    "merge" the 2.4 fsx fix for block size < page size to 2.5.  This needed
    major changes to actually fit.

and isn't present in other writepage implementations.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:25:58 -07:00
Christoph Hellwig
d438017757 xfs: move locking into xfs_bmap_punch_delalloc_range
Both callers want the same looking, so do it only once.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:25:57 -07:00
Christoph Hellwig
0362572138 xfs: simplify xfs_aops_discard_page
Instead of looking at the buffer heads to see if a block is delalloc just
call xfs_bmap_punch_delalloc_range on the whole page - this will leave
any non-delalloc block intact and handle the iteration for us.  As a side
effect one more place stops caring about buffer heads and we can remove the
xfs_check_page_type function entirely.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:25:57 -07:00
Christoph Hellwig
8b2e77c163 xfs: use iomap for blocksize == PAGE_SIZE readpage and readpages
For file systems with a block size that equals the page size we never do
partial reads, so we can use the buffer_head-less iomap versions of
readpage and readpages without conflicting with the buffer_head structures
create later in write_begin.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:25:56 -07:00
Darrick J. Wong
c2efdfc100 Merge branch 'iomap-4.19-merge' into xfs-4.19-merge 2018-07-11 22:24:40 -07:00
Miklos Szeredi
87eb5eb242 vfs: dedupe: rationalize args
Clean up f_op->dedupe_file_range() interface.

1) Use loff_t for offsets and length instead of u64
2) Order the arguments the same way as {copy|clone}_file_range().

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-06 23:57:03 +02:00
Miklos Szeredi
5740c99e9d vfs: dedupe: return int
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-06 23:57:03 +02:00
Carlos Maiolino
9991274fdd xfs: Initialize variables in xfs_alloc_get_rec before using them
Make sure we initialize *bno and *len, before jumping to out_bad_rec
label, and risk calling xfs_warn() with uninitialized variables.

Coverity: 100898
Coverity: 1437081
Coverity: 1437129
Coverity: 1437191
Coverity: 1437201
Coverity: 1437212
Coverity: 1437341
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-06-28 06:56:23 -07:00
Darrick J. Wong
d8cb5e4237 xfs: fix fdblocks accounting w/ RMAPBT per-AG reservation
In __xfs_ag_resv_init we incorrectly calculate the amount by which to
decrease fdblocks when reserving blocks for the rmapbt.  Because rmapbt
allocations do not decrease fdblocks, we must decrease fdblocks by the
entire size of the requested reservation in order to achieve our goal of
always having enough free blocks to satisfy an rmapbt expansion.

This is in contrast to the refcountbt/finobt, which /do/ subtract from
fdblocks whenever they allocate a block.  For this allocation type we
preserve the existing behavior where we decrease fdblocks only by the
requested reservation minus the size of the existing tree.

This fixes the problem where the available block counts reported by
statfs change across a remount if there had been an rmapbt size change
since mount time.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
2018-06-24 12:00:12 -07:00
Darrick J. Wong
e53c4b5983 xfs: ensure post-EOF zeroing happens after zeroing part of a file
If a user asks us to zero_range part of a file, the end of the range is
EOF, and not aligned to a page boundary, invoke writeback of the EOF
page to ensure that the post-EOF part of the page is zeroed.  This
ensures that we don't expose stale memory contents via mmap, if in a
clumsy manner.

Found by running generic/127 when it runs zero_range and mapread at EOF
one after the other.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
2018-06-24 11:56:36 -07:00
Darrick J. Wong
a3a374bf18 xfs: fix off-by-one error in xfs_rtalloc_query_range
In commit 8ad560d256 ("xfs: strengthen rtalloc query range checks")
we strengthened the input parameter checks in the rtbitmap range query
function, but introduced an off-by-one error in the process.  The call
to xfs_rtfind_forw deals with the high key being rextents, but we clamp
the high key to rextents - 1.  This causes the returned results to stop
one block short of the end of the rtdev, which is incorrect.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-06-24 11:56:36 -07:00
Darrick J. Wong
232d0a24b0 xfs: fix uninitialized field in rtbitmap fsmap backend
Initialize the extent count field of the high key so that when we use
the high key to synthesize an 'unknown owner' record (i.e. used space
record) at the end of the queried range we have a field with which to
compute rm_blockcount.  This is not strictly necessary because the
synthesizer never uses the rm_blockcount field, but we can shut up the
static code analysis anyway.

Coverity-id: 1437358
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-06-24 11:56:36 -07:00
Darrick J. Wong
5bd88d1539 xfs: recheck reflink state after grabbing ILOCK_SHARED for a write
The reflink iflag could have changed since the earlier unlocked check,
so if we got ILOCK_SHARED for a write and but we're now a reflink inode
we have to switch to ILOCK_EXCL and relock.

This helps us avoid blowing lock assertions in things like generic/166:

XFS: Assertion failed: xfs_isilocked(ip, XFS_ILOCK_EXCL), file: fs/xfs/xfs_reflink.c, line: 383
WARNING: CPU: 1 PID: 24707 at fs/xfs/xfs_message.c:104 assfail+0x25/0x30 [xfs]
Modules linked in: deadline_iosched dm_snapshot dm_bufio ext4 mbcache jbd2 dm_flakey xfs libcrc32c dax_pmem device_dax nd_pmem sch_fq_codel af_packet [last unloaded: scsi_debug]
CPU: 1 PID: 24707 Comm: xfs_io Not tainted 4.18.0-rc1-djw #1
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-1ubuntu1 04/01/2014
RIP: 0010:assfail+0x25/0x30 [xfs]
Code: ff 0f 0b c3 90 66 66 66 66 90 48 89 f1 41 89 d0 48 c7 c6 e8 ef 1b a0 48 89 fa 31 ff e8 54 f9 ff ff 80 3d fd ba 0f 00 00 75 03 <0f> 0b c3 0f 0b 66 0f 1f 44 00 00 66 66 66 66 90 48 63 f6 49 89 f9
RSP: 0018:ffffc90006423ad8 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff880030b65e80 RCX: 0000000000000000
RDX: 00000000ffffffc0 RSI: 000000000000000a RDI: ffffffffa01b0447
RBP: ffffc90006423c10 R08: 0000000000000000 R09: 0000000000000000
R10: ffff88003d43fc30 R11: f000000000000000 R12: ffff880077cda000
R13: 0000000000000000 R14: ffffc90006423c30 R15: ffffc90006423bf9
FS:  00007feba8986800(0000) GS:ffff88003ec00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000000000138ab58 CR3: 000000003d40a000 CR4: 00000000000006a0
Call Trace:
 xfs_reflink_allocate_cow+0x24c/0x3d0 [xfs]
 xfs_file_iomap_begin+0x6d2/0xeb0 [xfs]
 ? iomap_to_fiemap+0x80/0x80
 iomap_apply+0x5e/0x130
 iomap_dio_rw+0x2e0/0x400
 ? iomap_to_fiemap+0x80/0x80
 ? xfs_file_dio_aio_write+0x133/0x4a0 [xfs]
 xfs_file_dio_aio_write+0x133/0x4a0 [xfs]
 xfs_file_write_iter+0x7b/0xb0 [xfs]
 __vfs_write+0x16f/0x1f0
 vfs_write+0xc8/0x1c0
 ksys_pwrite64+0x74/0x90
 do_syscall_64+0x56/0x180
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-06-24 11:56:36 -07:00
Darrick J. Wong
f62cb48e43 xfs: don't allow insert-range to shift extents past the maximum offset
Zorro Lang reports that generic/485 blows an assert on a filesystem with
512 byte blocks.  The test tries to fallocate a post-eof extent at the
maximum file size and calls insert range to shift the extents right by
two blocks.  On a 512b block filesystem this causes startoff to overflow
the 54-bit startoff field, leading to the assert.

Therefore, always check the rightmost extent to see if it would overflow
prior to invoking the insert range machinery.

Reported-by: zlang@redhat.com
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=200137
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-06-24 11:56:36 -07:00
Darrick J. Wong
aafe12cee0 xfs: don't trip over negative free space in xfs_reserve_blocks
If we somehow end up with a filesystem that has fewer free blocks than
the blocks set aside to avoid ENOSPC deadlocks, it's possible that the
free space calculation in xfs_reserve_blocks will spit out a negative
number (because percpu_counter_sum returns s64).  We fail to notice
this negative number and set fdblks_delta to it.  Now we increment
fdblocks(!) and the unsigned type of m_resblks means that we end up
setting a ridiculously huge m_resblks reservation.

Avoid this comedy of errors by detecting the negative free space and
returning -ENOSPC.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-06-24 11:56:36 -07:00
Darrick J. Wong
10ee25268e xfs: allow empty transactions while frozen
In commit e89c041338 ("xfs: implement the GETFSMAP ioctl") we
created the ability to obtain empty transactions.  These transactions
have no log or block reservations and therefore can't modify anything.
Since they're also NO_WRITECOUNT they can run while the fs is frozen,
so we don't need to WARN_ON about that usage.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-06-24 11:56:35 -07:00
Dave Chinner
e53946dbd3 xfs: xfs_iflush_abort() can be called twice on cluster writeback failure
When a corrupt inode is detected during xfs_iflush_cluster, we can
get a shutdown ASSERT failure like this:

XFS (pmem1): Metadata corruption detected at xfs_symlink_shortform_verify+0x5c/0xa0, inode 0x86627 data fork
XFS (pmem1): Unmount and run xfs_repair
XFS (pmem1): xfs_do_force_shutdown(0x8) called from line 3372 of file fs/xfs/xfs_inode.c.  Return address = ffffffff814f4116
XFS (pmem1): Corruption of in-memory data detected.  Shutting down filesystem
XFS (pmem1): xfs_do_force_shutdown(0x1) called from line 222 of file fs/xfs/libxfs/xfs_defer.c.  Return address = ffffffff814a8a88
XFS (pmem1): xfs_do_force_shutdown(0x1) called from line 222 of file fs/xfs/libxfs/xfs_defer.c.  Return address = ffffffff814a8ef9
XFS (pmem1): Please umount the filesystem and rectify the problem(s)
XFS: Assertion failed: xfs_isiflocked(ip), file: fs/xfs/xfs_inode.h, line: 258
.....
Call Trace:
 xfs_iflush_abort+0x10a/0x110
 xfs_iflush+0xf3/0x390
 xfs_inode_item_push+0x126/0x1e0
 xfsaild+0x2c5/0x890
 kthread+0x11c/0x140
 ret_from_fork+0x24/0x30

Essentially, xfs_iflush_abort() has been called twice on the
original inode that that was flushed. This happens because the
inode has been flushed to teh buffer successfully via
xfs_iflush_int(), and so when another inode is detected as corrupt
in xfs_iflush_cluster, the buffer is marked stale and EIO, and
iodone callbacks are run on it.

Running the iodone callbacks walks across the original inode and
calls xfs_iflush_abort() on it. When xfs_iflush_cluster() returns
to xfs_iflush(), it runs the error path for that function, and that
calls xfs_iflush_abort() on the inode a second time, leading to the
above assert failure as the inode is not flush locked anymore.

This bug has been there a long time.

The simple fix would be to just avoid calling xfs_iflush_abort() in
xfs_iflush() if we've got a failure from xfs_iflush_cluster().
However, xfs_iflush_cluster() has magic delwri buffer handling that
means it may or may not have run IO completion on the buffer, and
hence sometimes we have to call xfs_iflush_abort() from
xfs_iflush(), and sometimes we shouldn't.

After reading through all the error paths and the delwri buffer
code, it's clear that the error handling in xfs_iflush_cluster() is
unnecessary. If the buffer is delwri, it leaves it on the delwri
list so that when the delwri list is submitted it sees a shutdown
fliesystem in xfs_buf_submit() and that marks the buffer stale, EIO
and runs IO completion. i.e. exactly what xfs+iflush_cluster() does
when it's not a delwri buffer. Further, marking a buffer stale
clears the _XBF_DELWRI_Q flag on the buffer, which means when
submission of the buffer occurs, it just skips over it and releases
it.

IOWs, the error handling in xfs_iflush_cluster doesn't need to care
if the buffer is already on a the delwri queue or not - it just
needs to mark the buffer stale, EIO and run completions. That means
we can just use the easy fix for xfs_iflush() to avoid the double
abort.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-06-21 23:31:38 -07:00
Dave Chinner
23fcb3340d xfs: More robust inode extent count validation
When the inode is in extent format, it can't have more extents that
fit in the inode fork. We don't currenty check this, and so this
corruption goes unnoticed by the inode verifiers. This can lead to
crashes operating on invalid in-memory structures.

Attempts to access such a inode will now error out in the verifier
rather than allowing modification operations to proceed.

Reported-by: Wen Xu <wen.xu@gatech.edu>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
[darrick: fix a typedef, add some braces and breaks to shut up compiler warnings]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-06-21 23:25:57 -07:00
Christoph Hellwig
e2ac836307 xfs: simplify xfs_bmap_punch_delalloc_range
Instead of using xfs_bmapi_read to find delalloc extents and then punch
them out using xfs_bunmapi, opencode the loop to iterate over the extents
and call xfs_bmap_del_extent_delay directly.  This both simplifies the
code and reduces the number of extent tree lookups required.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-06-21 23:24:38 -07:00
Christoph Hellwig
c03cea4214 iomap: add initial support for writes without buffer heads
For now just limited to blocksize == PAGE_SIZE, where we can simply read
in the full page in write begin, and just set the whole page dirty after
copying data into it.  This code is enabled by default and XFS will now
be feed pages without buffer heads in ->writepage and ->writepages.

If a file system sets the IOMAP_F_BUFFER_HEAD flag on the iomap the old
path will still be used, this both helps the transition in XFS and
prepares for the gfs2 migration to the iomap infrastructure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-06-20 09:32:41 -07:00
Linus Torvalds
7a932516f5 vfs/y2038: inode timestamps conversion to timespec64
This is a late set of changes from Deepa Dinamani doing an automated
 treewide conversion of the inode and iattr structures from 'timespec'
 to 'timespec64', to push the conversion from the VFS layer into the
 individual file systems.
 
 There were no conflicts between this and the contents of linux-next
 until just before the merge window, when we saw multiple problems:
 
 - A minor conflict with my own y2038 fixes, which I could address
   by adding another patch on top here.
 - One semantic conflict with late changes to the NFS tree. I addressed
   this by merging Deepa's original branch on top of the changes that
   now got merged into mainline and making sure the merge commit includes
   the necessary changes as produced by coccinelle.
 - A trivial conflict against the removal of staging/lustre.
 - Multiple conflicts against the VFS changes in the overlayfs tree.
   These are still part of linux-next, but apparently this is no longer
   intended for 4.18 [1], so I am ignoring that part.
 
 As Deepa writes:
 
   The series aims to switch vfs timestamps to use struct timespec64.
   Currently vfs uses struct timespec, which is not y2038 safe.
 
   The series involves the following:
   1. Add vfs helper functions for supporting struct timepec64 timestamps.
   2. Cast prints of vfs timestamps to avoid warnings after the switch.
   3. Simplify code using vfs timestamps so that the actual
      replacement becomes easy.
   4. Convert vfs timestamps to use struct timespec64 using a script.
      This is a flag day patch.
 
   Next steps:
   1. Convert APIs that can handle timespec64, instead of converting
      timestamps at the boundaries.
   2. Update internal data structures to avoid timestamp conversions.
 
 Thomas Gleixner adds:
 
   I think there is no point to drag that out for the next merge window.
   The whole thing needs to be done in one go for the core changes which
   means that you're going to play that catchup game forever. Let's get
   over with it towards the end of the merge window.
 
 [1] https://www.spinics.net/lists/linux-fsdevel/msg128294.html
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJbInZAAAoJEGCrR//JCVInReoQAIlVIIMt5ZX6wmaKbrjy9Itf
 MfgbFihQ/djLnuSPVQ3nztcxF0d66BKHZ9puVjz6+mIHqfDvJTRwZs9nU+sOF/T1
 g78fRkM1cxq6ZCkGYAbzyjyo5aC4PnSMP/NQLmwqvi0MXqqrbDoq5ZdP9DHJw39h
 L9lD8FM/P7T29Fgp9tq/pT5l9X8VU8+s5KQG1uhB5hii4VL6pD6JyLElDita7rg+
 Z7/V7jkxIGEUWF7vGaiR1QTFzEtpUA/exDf9cnsf51OGtK/LJfQ0oiZPPuq3oA/E
 LSbt8YQQObc+dvfnGxwgxEg1k5WP5ekj/Wdibv/+rQKgGyLOTz6Q4xK6r8F2ahxs
 nyZQBdXqHhJYyKr1H1reUH3mrSgQbE5U5R1i3My0xV2dSn+vtK5vgF21v2Ku3A1G
 wJratdtF/kVBzSEQUhsYTw14Un+xhBLRWzcq0cELonqxaKvRQK9r92KHLIWNE7/v
 c0TmhFbkZA+zR8HdsaL3iYf1+0W/eYy8PcvepyldKNeW2pVk3CyvdTfY2Z87G2XK
 tIkK+BUWbG3drEGG3hxZ3757Ln3a9qWyC5ruD3mBVkuug/wekbI8PykYJS7Mx4s/
 WNXl0dAL0Eeu1M8uEJejRAe1Q3eXoMWZbvCYZc+wAm92pATfHVcKwPOh8P7NHlfy
 A3HkjIBrKW5AgQDxfgvm
 =CZX2
 -----END PGP SIGNATURE-----

Merge tag 'vfs-timespec64' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground

Pull inode timestamps conversion to timespec64 from Arnd Bergmann:
 "This is a late set of changes from Deepa Dinamani doing an automated
  treewide conversion of the inode and iattr structures from 'timespec'
  to 'timespec64', to push the conversion from the VFS layer into the
  individual file systems.

  As Deepa writes:

   'The series aims to switch vfs timestamps to use struct timespec64.
    Currently vfs uses struct timespec, which is not y2038 safe.

    The series involves the following:
    1. Add vfs helper functions for supporting struct timepec64
       timestamps.
    2. Cast prints of vfs timestamps to avoid warnings after the switch.
    3. Simplify code using vfs timestamps so that the actual replacement
       becomes easy.
    4. Convert vfs timestamps to use struct timespec64 using a script.
       This is a flag day patch.

    Next steps:
    1. Convert APIs that can handle timespec64, instead of converting
       timestamps at the boundaries.
    2. Update internal data structures to avoid timestamp conversions'

  Thomas Gleixner adds:

   'I think there is no point to drag that out for the next merge
    window. The whole thing needs to be done in one go for the core
    changes which means that you're going to play that catchup game
    forever. Let's get over with it towards the end of the merge window'"

* tag 'vfs-timespec64' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground:
  pstore: Remove bogus format string definition
  vfs: change inode times to use struct timespec64
  pstore: Convert internal records to timespec64
  udf: Simplify calls to udf_disk_stamp_to_time
  fs: nfs: get rid of memcpys for inode times
  ceph: make inode time prints to be long long
  lustre: Use long long type to print inode time
  fs: add timespec64_truncate()
2018-06-15 07:31:07 +09:00
Linus Torvalds
a205f0c974 Changes since last update:
- Strengthen metadata checking to avoid ASSERTing on bad disk contents
 - Validate btree records that are being retrieved for clients
 - Strengthen root inode verification
 - Convert license blurbs to SPDX tags
 - Enable changing DAX flag on directories
 - Fix some writeback deadlocks in reflink
 - Refactor out some old xfs helpers
 - Move type verifiers to a separate file
 - Fix some fuzzer crashes
 - Various other bug fixes
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAlsfUegACgkQ+H93GTRK
 tOv0sw/9HW63zhuzGs9uihmyGvtqNeeUBfKc+3ovJ80wnhYOa0n8yYCBORS1EaMS
 YPk74IbD0yAak0H9ePdOpont43gGVTDox6/K8+6rPFnWtX30Z2/ckb6BWE4UfoW8
 QeojpB2+aS6fqfO1wcSb3i//XRu4h90ORQY0xNkHYcN4GWwIDwPCyBf+AT9HH1E+
 GFHtB3QWANZg6LRT7X0GVgz5r68lzyxX1WisJ4uAm0NwKR5zVb9NWFCcOszQ45Ky
 +YFw4kfgithbIHlwTpo3LrvQk7+cBhlSpWuASZOYjugxcQ2d85B/+9mF/QDnLOey
 ddbO6WK+wo0KZImpFvOOQZY07cO7vtWwkWHraz0PkUdaEab5rcnooLoJg9UTMZa4
 WT8wM8CrX1kkFvJQCuAMV9jblovjETeYhHfG8ak8Z/lWc3WEnEBUFQiO9ZVQdiAv
 B02xMmpOkfi0fqRCg6li9u3CJtN+2vxPiNEME3lz5zdY5aE2aXSmCspvP3aPVZMt
 y1fZ90u5NONz6Q9WrIh0plEru4oynhwVuqRrnVRDPCT4X64IZXuf/fBmYqrfZGmJ
 K45P/LQDvfcHj3xBLhfkKv5OpXtyYgDtLSBNqYAYrcGS4sW7Z4Ts8ohqcOhF1OqR
 g3mFp75aO4Ekw6hFbg9CRX13G4mu80BmnRKDVwFjThkl6d0Xyxw=
 =SD3u
 -----END PGP SIGNATURE-----

Merge tag 'xfs-4.18-merge-10' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull more xfs updates from Darrick Wong:
 "Here's the second round of patches for XFS for 4.18. Most of the
  commits are small cleanups, bug fixes, and continued strengthening of
  metadata verifiers; the bulk of the diff is the conversion of the
  fs/xfs/ tree to use SPDX tags.

  This series has been run through a full xfstests run over the weekend
  and through a quick xfstests run against this morning's master, with
  no major failures reported.

  Summary:

   - Strengthen metadata checking to avoid ASSERTing on bad disk
     contents

   - Validate btree records that are being retrieved for clients

   - Strengthen root inode verification

   - Convert license blurbs to SPDX tags

   - Enable changing DAX flag on directories

   - Fix some writeback deadlocks in reflink

   - Refactor out some old xfs helpers

   - Move type verifiers to a separate file

   - Fix some fuzzer crashes

   - Various other bug fixes"

* tag 'xfs-4.18-merge-10' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (31 commits)
  xfs: update incore per-AG inode count
  xfs: replace do_mod with native operations
  xfs: don't call xfs_da_shrink_inode with NULL bp
  xfs: clean up MIN/MAX
  xfs: move various type verifiers to common file
  xfs: xfs_reflink_convert_cow() memory allocation deadlock
  xfs: setup VFS i_rwsem lockdep state correctly
  xfs: fix string handling in label get/set functions
  xfs: convert to SPDX license tags
  xfs: validate btree records on retrieval
  xfs: push corruption -> ESTALE conversion to xfs_nfs_get_inode()
  xfs: verify root inode more thoroughly
  xfs: verify COW extent size hint is valid in inode verifier
  xfs: verify extent size hint is valid in inode verifier
  xfs: catch bad stripe alignment configurations
  iomap: fsync swap files before iterating mappings
  xfs: use xfs_trans_getsb in xfs_sync_sb_buf
  xfs: don't assert on corrupted unlinked inode list
  xfs: explicitly pass buffer size to xfs_corruption_error
  xfs: don't assert when on-disk btree pointers are garbage
  ...
2018-06-12 15:49:00 -07:00
Darrick J. Wong
89e9b5c091 xfs: update incore per-AG inode count
For whatever reason we never actually update pagi_count (the in-core
perag inode count) when we allocate or free inode chunks.  Online scrub
is going to use it, so we need to fix the accounting.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-06-11 21:53:52 -07:00
Linus Torvalds
7d3bf613e9 libnvdimm for 4.18
* DAX broke a fundamental assumption of truncate of file mapped pages.
   The truncate path assumed that it is safe to disconnect a pinned page
   from a file and let the filesystem reclaim the physical block. With DAX
   the page is equivalent to the filesystem block. Introduce
   dax_layout_busy_page() to enable filesystems to wait for pinned DAX
   pages to be released. Without this wait a filesystem could allocate
   blocks under active device-DMA to a new file.
 
 * DAX arranges for the block layer to be bypassed and uses
   dax_direct_access() + copy_to_iter() to satisfy read(2) calls.
   However, the memcpy_mcsafe() facility is available through the pmem
   block driver. In order to safely handle media errors, via the DAX
   block-layer bypass, introduce copy_to_iter_mcsafe().
 
 * Fix cache management policy relative to the ACPI NFIT Platform
   Capabilities Structure to properly elide cache flushes when they are not
   necessary. The table indicates whether CPU caches are power-fail
   protected. Clarify that a deep flush is always performed on
   REQ_{FUA,PREFLUSH} requests.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJbGxI7AAoJEB7SkWpmfYgCDjsP/2Lcibu9Kf4tKIzuInsle6iE
 6qP29qlkpHVTpDKbhvIxTYTYL9sMU0DNUrpPCJR/EYdeyztLWDFC5EAT1wF240vf
 maV37s/uP331jSC/2VJnKWzBs2ztQxmKLEIQCxh6aT0qs9cbaOvJgB/WlVu+qtsl
 aGJFLmb6vdQacp31noU5plKrMgMA1pADyF5qx9I9K2HwowHE7T368ZEFS/3S//c3
 LXmpx/Nfq52sGu/qbRbu6B1CTJhIGhmarObyQnvBYoKntK1Ov4e8DS95wD3EhNDe
 FuRkOCUKhjl6cFy7QVWh1ct1bFm84ny+b4/AtbpOmv9l/+0mveJ7e+5mu8HQTifT
 wYiEe2xzXJ+OG/xntv8SvlZKMpjP3BqI0jYsTutsjT4oHrciiXdXM186cyS+BiGp
 KtFmWyncQJgfiTq6+Hj5XpP9BapNS+OYdYgUagw9ZwzdzptuGFYUMSVOBrYrn6c/
 fwqtxjubykJoW0P3pkIoT91arFSea7nxOKnGwft06imQ7TwR4ARsI308feQ9itJq
 2P2e7/20nYMsw2aRaUDDA70Yu+Lagn1m8WL87IybUGeUDLb1BAkjphAlWa6COJ+u
 PhvAD2tvyM9m0c7O5Mytvz7iWKG6SVgatoAyOPkaeplQK8khZ+wEpuK58sO6C1w8
 4GBvt9ri9i/Ww/A+ppWs
 =4bfw
 -----END PGP SIGNATURE-----

Merge tag 'libnvdimm-for-4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm

Pull libnvdimm updates from Dan Williams:
 "This adds a user for the new 'bytes-remaining' updates to
  memcpy_mcsafe() that you already received through Ingo via the
  x86-dax- for-linus pull.

  Not included here, but still targeting this cycle, is support for
  handling memory media errors (poison) consumed via userspace dax
  mappings.

  Summary:

   - DAX broke a fundamental assumption of truncate of file mapped
     pages. The truncate path assumed that it is safe to disconnect a
     pinned page from a file and let the filesystem reclaim the physical
     block. With DAX the page is equivalent to the filesystem block.
     Introduce dax_layout_busy_page() to enable filesystems to wait for
     pinned DAX pages to be released. Without this wait a filesystem
     could allocate blocks under active device-DMA to a new file.

   - DAX arranges for the block layer to be bypassed and uses
     dax_direct_access() + copy_to_iter() to satisfy read(2) calls.
     However, the memcpy_mcsafe() facility is available through the pmem
     block driver. In order to safely handle media errors, via the DAX
     block-layer bypass, introduce copy_to_iter_mcsafe().

   - Fix cache management policy relative to the ACPI NFIT Platform
     Capabilities Structure to properly elide cache flushes when they
     are not necessary. The table indicates whether CPU caches are
     power-fail protected. Clarify that a deep flush is always performed
     on REQ_{FUA,PREFLUSH} requests"

* tag 'libnvdimm-for-4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (21 commits)
  dax: Use dax_write_cache* helpers
  libnvdimm, pmem: Do not flush power-fail protected CPU caches
  libnvdimm, pmem: Unconditionally deep flush on *sync
  libnvdimm, pmem: Complete REQ_FLUSH => REQ_PREFLUSH
  acpi, nfit: Remove ecc_unit_size
  dax: dax_insert_mapping_entry always succeeds
  libnvdimm, e820: Register all pmem resources
  libnvdimm: Debug probe times
  linvdimm, pmem: Preserve read-only setting for pmem devices
  x86, nfit_test: Add unit test for memcpy_mcsafe()
  pmem: Switch to copy_to_iter_mcsafe()
  dax: Report bytes remaining in dax_iomap_actor()
  dax: Introduce a ->copy_to_iter dax operation
  uio, lib: Fix CONFIG_ARCH_HAS_UACCESS_MCSAFE compilation
  xfs, dax: introduce xfs_break_dax_layouts()
  xfs: prepare xfs_break_layouts() for another layout type
  xfs: prepare xfs_break_layouts() to be called with XFS_MMAPLOCK_EXCL
  mm, fs, dax: handle layout changes to pinned dax mappings
  mm: fix __gup_device_huge vs unmap
  mm: introduce MEMORY_DEVICE_FS_DAX and CONFIG_DEV_PAGEMAP_OPS
  ...
2018-06-08 17:21:52 -07:00
Dan Williams
b56845794e Merge branch 'for-4.18/dax' into libnvdimm-for-next 2018-06-08 15:16:40 -07:00
Dave Chinner
0703a8e1c1 xfs: replace do_mod with native operations
do_mod() is a hold-over from when we have different sizes for file
offsets and and other internal values for 40 bit XFS filesystems.
Hence depending on build flags variables passed to do_mod() could
change size. We no longer support those small format filesystems and
hence everything is of fixed size theses days, even on 32 bit
platforms.

As such, we can convert all the do_mod() callers to platform
optimised modulus operations as defined by linux/math64.h.
Individual conversions depend on the types of variables being used.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-06-08 10:07:52 -07:00
Eric Sandeen
bb3d48dcf8 xfs: don't call xfs_da_shrink_inode with NULL bp
xfs_attr3_leaf_create may have errored out before instantiating a buffer,
for example if the blkno is out of range.  In that case there is no work
to do to remove it, and in fact xfs_da_shrink_inode will lead to an oops
if we try.

This also seems to fix a flaw where the original error from
xfs_attr3_leaf_create gets overwritten in the cleanup case, and it
removes a pointless assignment to bp which isn't used after this.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=199969
Reported-by: Xu, Wen <wen.xu@gatech.edu>
Tested-by: Xu, Wen <wen.xu@gatech.edu>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-06-08 10:07:52 -07:00
Dave Chinner
9bb54cb56a xfs: clean up MIN/MAX
Get rid of the MIN/MAX macros and just use the native min/max macros
directly in the XFS code.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-06-08 10:07:52 -07:00
Dave Chinner
86210fbeba xfs: move various type verifiers to common file
New verification functions like xfs_verify_fsbno() and
xfs_verify_agino() are spread across multiple files and different
header files. They really don't fit cleanly into the places they've
been put, and have wider scope than the current header includes.

Move the type verifiers to a new file in libxfs (xfs-types.c) and
the prototypes to xfs_types.h where they will be visible to all the
code that uses the types.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-06-08 10:07:51 -07:00
Dave Chinner
4a2d01b076 xfs: xfs_reflink_convert_cow() memory allocation deadlock
xfs_reflink_convert_cow() manipulates the incore extent list
in GFP_KERNEL context in the IO submission path whilst holding
locked pages under writeback. This is a memory reclaim deadlock
vector. This code is not in a transaction, so any memory allocations
it makes aren't protected via the memalloc_nofs_save() context that
transactions carry.

Hence we need to run this call under memalloc_nofs_save() context to
prevent potential memory allocations from being run as GFP_KERNEL
and deadlocking.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-06-08 10:07:51 -07:00
Dave Chinner
ef215e394e xfs: setup VFS i_rwsem lockdep state correctly
When lockdep is enabled, it changes the type of the inode i_rwsem
semaphore before unlocking a newly instantiated inode. THere is the
possibility that there is already a waiter on that inode lock by the
time we unlock the new inode, so having lockdep re-initialise the
lock is a vector for trouble.

Avoid this whole situation by setting up the i_rwsem lockdep class
at the same time we set up the XFS inode i_ilock classes and so the
VFS doesn't have to change the lock class itself when it is
potentially unsafe.

This change is necessary because the equivalent fixes to the VFS code
made in commit 1e2e547a93 ("do d_instantiate/unlock_new_inode
combinations safely") are not relevant to XFS as it has it's own
internal inode cache lookup and instantiation routines.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-06-08 10:07:51 -07:00
Arnd Bergmann
4bb8b65a04 xfs: fix string handling in label get/set functions
[sandeen: fix subject, avoid copy-out of uninit data in getlabel]

gcc-8 reports two warnings for the newly added getlabel/setlabel code:

fs/xfs/xfs_ioctl.c: In function 'xfs_ioc_getlabel':
fs/xfs/xfs_ioctl.c:1822:38: error: argument to 'sizeof' in 'strncpy' call is the same expression as the source; did you mean to use the size of the destination? [-Werror=sizeof-pointer-memaccess]
  strncpy(label, sbp->sb_fname, sizeof(sbp->sb_fname));
                                      ^
In function 'strncpy',
    inlined from 'xfs_ioc_setlabel' at /git/arm-soc/fs/xfs/xfs_ioctl.c:1863:2,
    inlined from 'xfs_file_ioctl' at /git/arm-soc/fs/xfs/xfs_ioctl.c:1918:10:
include/linux/string.h:254:9: error: '__builtin_strncpy' output may be truncated copying 12 bytes from a string of length 12 [-Werror=stringop-truncation]
  return __builtin_strncpy(p, q, size);

In both cases, part of the problem is that one of the strncpy()
arguments is a fixed-length character array with zero-padding rather
than a zero-terminated string. In the first one case, we also get an
odd warning about sizeof-pointer-memaccess, which doesn't seem right
(the sizeof is for an array that happens to be the same as the second
strncpy argument).

To work around the bogus warning, I use a plain 'XFSLABEL_MAX' for
the strncpy() length when copying the label in getlabel. For setlabel(),
using memcpy() with the correct length that is already known avoids
the second warning and is slightly simpler.

In a related issue, it appears that we accidentally skip the trailing
\0 when copying a 12-character label back to user space in getlabel().
Using the correct sizeof() argument here copies the extra character.

Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85602
Fixes: f7664b3197 ("xfs: implement online get/set fs label")
Cc: Eric Sandeen <sandeen@redhat.com>
Cc: Martin Sebor <msebor@gmail.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-06-06 14:17:53 -07:00
Dave Chinner
0b61f8a407 xfs: convert to SPDX license tags
Remove the verbose license text from XFS files and replace them
with SPDX tags. This does not change the license of any of the code,
merely refers to the common, up-to-date license files in LICENSES/

This change was mostly scripted. fs/xfs/Makefile and
fs/xfs/libxfs/xfs_fs.h were modified by hand, the rest were detected
and modified by the following command:

for f in `git grep -l "GNU General" fs/xfs/` ; do
	echo $f
	cat $f | awk -f hdr.awk > $f.new
	mv -f $f.new $f
done

And the hdr.awk script that did the modification (including
detecting the difference between GPL-2.0 and GPL-2.0+ licenses)
is as follows:

$ cat hdr.awk
BEGIN {
	hdr = 1.0
	tag = "GPL-2.0"
	str = ""
}

/^ \* This program is free software/ {
	hdr = 2.0;
	next
}

/any later version./ {
	tag = "GPL-2.0+"
	next
}

/^ \*\// {
	if (hdr > 0.0) {
		print "// SPDX-License-Identifier: " tag
		print str
		print $0
		str=""
		hdr = 0.0
		next
	}
	print $0
	next
}

/^ \* / {
	if (hdr > 1.0)
		next
	if (hdr > 0.0) {
		if (str != "")
			str = str "\n"
		str = str $0
		next
	}
	print $0
	next
}

/^ \*/ {
	if (hdr > 0.0)
		next
	print $0
	next
}

// {
	if (hdr > 0.0) {
		if (str != "")
			str = str "\n"
		str = str $0
		next
	}
	print $0
}

END { }
$

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-06-06 14:17:53 -07:00
Dave Chinner
9e6c08d4a8 xfs: validate btree records on retrieval
So we don't check the validity of records as we walk the btree. When
there are corrupt records in the free space btree (e.g. zero
startblock/length or beyond EOAG) we just blindly use it and things
go bad from there. That leads to assert failures on debug kernels
like this:

XFS: Assertion failed: fs_is_ok, file: fs/xfs/libxfs/xfs_alloc.c, line: 450
....
Call Trace:
 xfs_alloc_fixup_trees+0x368/0x5c0
 xfs_alloc_ag_vextent_near+0x79a/0xe20
 xfs_alloc_ag_vextent+0x1d3/0x330
 xfs_alloc_vextent+0x5e9/0x870

Or crashes like this:

XFS (loop0): xfs_buf_find: daddr 0x7fb28 out of range, EOFS 0x8000
.....
BUG: unable to handle kernel NULL pointer dereference at 00000000000000c8
....
Call Trace:
 xfs_bmap_add_extent_hole_real+0x67d/0x930
 xfs_bmapi_write+0x934/0xc90
 xfs_da_grow_inode_int+0x27e/0x2f0
 xfs_dir2_grow_inode+0x55/0x130
 xfs_dir2_sf_to_block+0x94/0x5d0
 xfs_dir2_sf_addname+0xd0/0x590
 xfs_dir_createname+0x168/0x1a0
 xfs_rename+0x658/0x9b0

By checking that free space records pulled from the trees are
within the valid range, we catch many of these corruptions before
they can do damage.

This is a generic btree record checking deficiency. We need to
validate the records we fetch from all the different btrees before
we use them to catch corruptions like this.

This patch results in a corrupt record emitting an error message and
returning -EFSCORRUPTED, and the higher layers catch that and abort:

 XFS (loop0): Size Freespace BTree record corruption in AG 0 detected!
 XFS (loop0): start block 0x0 block count 0x0
 XFS (loop0): Internal error xfs_trans_cancel at line 1012 of file fs/xfs/xfs_trans.c.  Caller xfs_create+0x42a/0x670
 .....
 Call Trace:
  dump_stack+0x85/0xcb
  xfs_trans_cancel+0x19f/0x1c0
  xfs_create+0x42a/0x670
  xfs_generic_create+0x1f6/0x2c0
  vfs_create+0xf9/0x180
  do_mknodat+0x1f9/0x210
  do_syscall_64+0x5a/0x180
  entry_SYSCALL_64_after_hwframe+0x49/0xbe
.....
 XFS (loop0): xfs_do_force_shutdown(0x8) called from line 1013 of file fs/xfs/xfs_trans.c.  Return address = ffffffff81500868
 XFS (loop0): Corruption of in-memory data detected.  Shutting down filesystem

Signed-off-by: Dave Chinner <dchinner@redhat.com>

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-06-06 08:12:00 -07:00
Dave Chinner
29cad0b3ed xfs: push corruption -> ESTALE conversion to xfs_nfs_get_inode()
In xfs_imap_to_bp(), we convert a -EFSCORRUPTED error to -EINVAL if
we are doing an untrusted lookup. This is done because we need
failed filehandle lookups to report -ESTALE to the caller, and it
does this by converting -EINVAL and -ENOENT errors to -ESTALE.

The squashing of EFSCORRUPTED in imap_to_bp makes it impossible for
for xfs_iget(UNTRUSTED) callers to determine the difference between
"inode does not exist" and "corruption detected during lookup". We
realy need that distinction in places calling xfS_iget(UNTRUSTED),
so move the filehandle error case handling all the way out to
xfs_nfs_get_inode() where it is needed.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-06-06 08:10:26 -07:00
Dave Chinner
541b5acc85 xfs: verify root inode more thoroughly
When looking up the root inode at mount time, we don't actually do
any verification to check that the inode is allocated and accounted
for correctly in the INOBT. Make the checks on the root inode more
robust by making it an untrusted lookup. This forces the inode
lookup to use the inode btree to verify the inode is allocated
and mapped correctly to disk. This will also have the effect of
catching a significant number of AGI/INOBT related corruptions in
AG 0 at mount time.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-06-06 08:10:26 -07:00
Dave Chinner
02a0fda875 xfs: verify COW extent size hint is valid in inode verifier
There are rules for vald extent size hints. We enforce them when
applications set them, but fuzzers violate those rules and that
screws us over. Validate COW extent size hint rules in the inode
verifier to catch this.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-06-06 08:10:26 -07:00
Dave Chinner
7d71a671a2 xfs: verify extent size hint is valid in inode verifier
There are rules for vald extent size hints. We enforce them when
applications set them, but fuzzers violate those rules and that
screws us over.

This results in alignment assertion failures when setting up
allocations such as this in direct IO:

XFS: Assertion failed: ap->length, file: fs/xfs/libxfs/xfs_bmap.c, line: 3432
....
Call Trace:
 xfs_bmap_btalloc+0x415/0x910
 xfs_bmapi_write+0x71c/0x12e0
 xfs_iomap_write_direct+0x2a9/0x420
 xfs_file_iomap_begin+0x4dc/0xa70
 iomap_apply+0x43/0x100
 iomap_file_buffered_write+0x62/0x90
 xfs_file_buffered_aio_write+0xba/0x300
 __vfs_write+0xd5/0x150
 vfs_write+0xb6/0x180
 ksys_write+0x45/0xa0
 do_syscall_64+0x5a/0x180
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

And from xfs_db:

core.extsize = 10380288

Which is not an integer multiple of the block size, and so violates
Rule #7 for setting extent size hints. Validate extent size hint
rules in the inode verifier to catch this.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-06-06 08:10:26 -07:00
Dave Chinner
fa4ca9c557 xfs: catch bad stripe alignment configurations
When stripe alignments are invalid, data alignment algorithms in the
allocator may not work correctly. Ensure we catch superblocks with
invalid stripe alignment setups at mount time. These data alignment
mismatches are now detected at mount time like this:

XFS (loop0): SB stripe unit sanity check failed
XFS (loop0): Metadata corruption detected at xfs_sb_read_verify+0xab/0x110, xfs_sb block 0xffffffffffffffff
XFS (loop0): Unmount and run xfs_repair
XFS (loop0): First 128 bytes of corrupted metadata buffer:
0000000091c2de02: 58 46 53 42 00 00 10 00 00 00 00 00 00 00 10 00  XFSB............
0000000023bff869: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00000000cdd8c893: 17 32 37 15 ff ca 46 3d 9a 17 d3 33 04 b5 f1 a2  .27...F=...3....
000000009fd2844f: 00 00 00 00 00 00 00 04 00 00 00 00 00 00 06 d0  ................
0000000088e9b0bb: 00 00 00 00 00 00 06 d1 00 00 00 00 00 00 06 d2  ................
00000000ff233a20: 00 00 00 01 00 00 10 00 00 00 00 01 00 00 00 00  ................
000000009db0ac8b: 00 00 03 60 e1 34 02 00 08 00 00 02 00 00 00 00  ...`.4..........
00000000f7022460: 00 00 00 00 00 00 00 00 0c 09 0b 01 0c 00 00 19  ................
XFS (loop0): SB validate failed with error -117.

And the mount fails.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-06-06 08:10:26 -07:00
Deepa Dinamani
95582b0083 vfs: change inode times to use struct timespec64
struct timespec is not y2038 safe. Transition vfs to use
y2038 safe struct timespec64 instead.

The change was made with the help of the following cocinelle
script. This catches about 80% of the changes.
All the header file and logic changes are included in the
first 5 rules. The rest are trivial substitutions.
I avoid changing any of the function signatures or any other
filesystem specific data structures to keep the patch simple
for review.

The script can be a little shorter by combining different cases.
But, this version was sufficient for my usecase.

virtual patch

@ depends on patch @
identifier now;
@@
- struct timespec
+ struct timespec64
  current_time ( ... )
  {
- struct timespec now = current_kernel_time();
+ struct timespec64 now = current_kernel_time64();
  ...
- return timespec_trunc(
+ return timespec64_trunc(
  ... );
  }

@ depends on patch @
identifier xtime;
@@
 struct \( iattr \| inode \| kstat \) {
 ...
-       struct timespec xtime;
+       struct timespec64 xtime;
 ...
 }

@ depends on patch @
identifier t;
@@
 struct inode_operations {
 ...
int (*update_time) (...,
-       struct timespec t,
+       struct timespec64 t,
...);
 ...
 }

@ depends on patch @
identifier t;
identifier fn_update_time =~ "update_time$";
@@
 fn_update_time (...,
- struct timespec *t,
+ struct timespec64 *t,
 ...) { ... }

@ depends on patch @
identifier t;
@@
lease_get_mtime( ... ,
- struct timespec *t
+ struct timespec64 *t
  ) { ... }

@te depends on patch forall@
identifier ts;
local idexpression struct inode *inode_node;
identifier i_xtime =~ "^i_[acm]time$";
identifier ia_xtime =~ "^ia_[acm]time$";
identifier fn_update_time =~ "update_time$";
identifier fn;
expression e, E3;
local idexpression struct inode *node1;
local idexpression struct inode *node2;
local idexpression struct iattr *attr1;
local idexpression struct iattr *attr2;
local idexpression struct iattr attr;
identifier i_xtime1 =~ "^i_[acm]time$";
identifier i_xtime2 =~ "^i_[acm]time$";
identifier ia_xtime1 =~ "^ia_[acm]time$";
identifier ia_xtime2 =~ "^ia_[acm]time$";
@@
(
(
- struct timespec ts;
+ struct timespec64 ts;
|
- struct timespec ts = current_time(inode_node);
+ struct timespec64 ts = current_time(inode_node);
)

<+... when != ts
(
- timespec_equal(&inode_node->i_xtime, &ts)
+ timespec64_equal(&inode_node->i_xtime, &ts)
|
- timespec_equal(&ts, &inode_node->i_xtime)
+ timespec64_equal(&ts, &inode_node->i_xtime)
|
- timespec_compare(&inode_node->i_xtime, &ts)
+ timespec64_compare(&inode_node->i_xtime, &ts)
|
- timespec_compare(&ts, &inode_node->i_xtime)
+ timespec64_compare(&ts, &inode_node->i_xtime)
|
ts = current_time(e)
|
fn_update_time(..., &ts,...)
|
inode_node->i_xtime = ts
|
node1->i_xtime = ts
|
ts = inode_node->i_xtime
|
<+... attr1->ia_xtime ...+> = ts
|
ts = attr1->ia_xtime
|
ts.tv_sec
|
ts.tv_nsec
|
btrfs_set_stack_timespec_sec(..., ts.tv_sec)
|
btrfs_set_stack_timespec_nsec(..., ts.tv_nsec)
|
- ts = timespec64_to_timespec(
+ ts =
...
-)
|
- ts = ktime_to_timespec(
+ ts = ktime_to_timespec64(
...)
|
- ts = E3
+ ts = timespec_to_timespec64(E3)
|
- ktime_get_real_ts(&ts)
+ ktime_get_real_ts64(&ts)
|
fn(...,
- ts
+ timespec64_to_timespec(ts)
,...)
)
...+>
(
<... when != ts
- return ts;
+ return timespec64_to_timespec(ts);
...>
)
|
- timespec_equal(&node1->i_xtime1, &node2->i_xtime2)
+ timespec64_equal(&node1->i_xtime2, &node2->i_xtime2)
|
- timespec_equal(&node1->i_xtime1, &attr2->ia_xtime2)
+ timespec64_equal(&node1->i_xtime2, &attr2->ia_xtime2)
|
- timespec_compare(&node1->i_xtime1, &node2->i_xtime2)
+ timespec64_compare(&node1->i_xtime1, &node2->i_xtime2)
|
node1->i_xtime1 =
- timespec_trunc(attr1->ia_xtime1,
+ timespec64_trunc(attr1->ia_xtime1,
...)
|
- attr1->ia_xtime1 = timespec_trunc(attr2->ia_xtime2,
+ attr1->ia_xtime1 =  timespec64_trunc(attr2->ia_xtime2,
...)
|
- ktime_get_real_ts(&attr1->ia_xtime1)
+ ktime_get_real_ts64(&attr1->ia_xtime1)
|
- ktime_get_real_ts(&attr.ia_xtime1)
+ ktime_get_real_ts64(&attr.ia_xtime1)
)

@ depends on patch @
struct inode *node;
struct iattr *attr;
identifier fn;
identifier i_xtime =~ "^i_[acm]time$";
identifier ia_xtime =~ "^ia_[acm]time$";
expression e;
@@
(
- fn(node->i_xtime);
+ fn(timespec64_to_timespec(node->i_xtime));
|
 fn(...,
- node->i_xtime);
+ timespec64_to_timespec(node->i_xtime));
|
- e = fn(attr->ia_xtime);
+ e = fn(timespec64_to_timespec(attr->ia_xtime));
)

@ depends on patch forall @
struct inode *node;
struct iattr *attr;
identifier i_xtime =~ "^i_[acm]time$";
identifier ia_xtime =~ "^ia_[acm]time$";
identifier fn;
@@
{
+ struct timespec ts;
<+...
(
+ ts = timespec64_to_timespec(node->i_xtime);
fn (...,
- &node->i_xtime,
+ &ts,
...);
|
+ ts = timespec64_to_timespec(attr->ia_xtime);
fn (...,
- &attr->ia_xtime,
+ &ts,
...);
)
...+>
}

@ depends on patch forall @
struct inode *node;
struct iattr *attr;
struct kstat *stat;
identifier ia_xtime =~ "^ia_[acm]time$";
identifier i_xtime =~ "^i_[acm]time$";
identifier xtime =~ "^[acm]time$";
identifier fn, ret;
@@
{
+ struct timespec ts;
<+...
(
+ ts = timespec64_to_timespec(node->i_xtime);
ret = fn (...,
- &node->i_xtime,
+ &ts,
...);
|
+ ts = timespec64_to_timespec(node->i_xtime);
ret = fn (...,
- &node->i_xtime);
+ &ts);
|
+ ts = timespec64_to_timespec(attr->ia_xtime);
ret = fn (...,
- &attr->ia_xtime,
+ &ts,
...);
|
+ ts = timespec64_to_timespec(attr->ia_xtime);
ret = fn (...,
- &attr->ia_xtime);
+ &ts);
|
+ ts = timespec64_to_timespec(stat->xtime);
ret = fn (...,
- &stat->xtime);
+ &ts);
)
...+>
}

@ depends on patch @
struct inode *node;
struct inode *node2;
identifier i_xtime1 =~ "^i_[acm]time$";
identifier i_xtime2 =~ "^i_[acm]time$";
identifier i_xtime3 =~ "^i_[acm]time$";
struct iattr *attrp;
struct iattr *attrp2;
struct iattr attr ;
identifier ia_xtime1 =~ "^ia_[acm]time$";
identifier ia_xtime2 =~ "^ia_[acm]time$";
struct kstat *stat;
struct kstat stat1;
struct timespec64 ts;
identifier xtime =~ "^[acmb]time$";
expression e;
@@
(
( node->i_xtime2 \| attrp->ia_xtime2 \| attr.ia_xtime2 \) = node->i_xtime1  ;
|
 node->i_xtime2 = \( node2->i_xtime1 \| timespec64_trunc(...) \);
|
 node->i_xtime2 = node->i_xtime1 = node->i_xtime3 = \(ts \| current_time(...) \);
|
 node->i_xtime1 = node->i_xtime3 = \(ts \| current_time(...) \);
|
 stat->xtime = node2->i_xtime1;
|
 stat1.xtime = node2->i_xtime1;
|
( node->i_xtime2 \| attrp->ia_xtime2 \) = attrp->ia_xtime1  ;
|
( attrp->ia_xtime1 \| attr.ia_xtime1 \) = attrp2->ia_xtime2;
|
- e = node->i_xtime1;
+ e = timespec64_to_timespec( node->i_xtime1 );
|
- e = attrp->ia_xtime1;
+ e = timespec64_to_timespec( attrp->ia_xtime1 );
|
node->i_xtime1 = current_time(...);
|
 node->i_xtime2 = node->i_xtime1 = node->i_xtime3 =
- e;
+ timespec_to_timespec64(e);
|
 node->i_xtime1 = node->i_xtime3 =
- e;
+ timespec_to_timespec64(e);
|
- node->i_xtime1 = e;
+ node->i_xtime1 = timespec_to_timespec64(e);
)

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: <anton@tuxera.com>
Cc: <balbi@kernel.org>
Cc: <bfields@fieldses.org>
Cc: <darrick.wong@oracle.com>
Cc: <dhowells@redhat.com>
Cc: <dsterba@suse.com>
Cc: <dwmw2@infradead.org>
Cc: <hch@lst.de>
Cc: <hirofumi@mail.parknet.co.jp>
Cc: <hubcap@omnibond.com>
Cc: <jack@suse.com>
Cc: <jaegeuk@kernel.org>
Cc: <jaharkes@cs.cmu.edu>
Cc: <jslaby@suse.com>
Cc: <keescook@chromium.org>
Cc: <mark@fasheh.com>
Cc: <miklos@szeredi.hu>
Cc: <nico@linaro.org>
Cc: <reiserfs-devel@vger.kernel.org>
Cc: <richard@nod.at>
Cc: <sage@redhat.com>
Cc: <sfrench@samba.org>
Cc: <swhiteho@redhat.com>
Cc: <tj@kernel.org>
Cc: <trond.myklebust@primarydata.com>
Cc: <tytso@mit.edu>
Cc: <viro@zeniv.linux.org.uk>
2018-06-05 16:57:31 -07:00
Linus Torvalds
6567af78ac Changes for 4.18:
- Strengthen inode number and structure validation when allocating inodes.
 - Reduce pointless buffer allocations during cache miss
 - Use FUA for pure data O_DSYNC directio writes
 - Various iomap refactorings
 - Strengthen quota metadata verification to avoid unfixable broken quota
 - Make AGFL block freeing a deferred operation to avoid blowing out
   transaction reservations when running complex operations
 - Get rid of the log item descriptors to reduce log overhead
 - Fix various reflink bugs where inodes were double-joined to
   transactions
 - Don't issue discards when trimming unwritten extents
 - Refactor incore dquot initialization and retrieval interfaces
 - Fix some locking problmes in the quota scrub code
 - Strengthen btree structure checks in scrub code
 - Rewrite swapfile activation to use iomap and support unwritten extents
 - Make scrub exit to userspace sooner when corruptions or
   cross-referencing problems are found
 - Make scrub invoke the data fork scrubber directly on metadata inodes
 - Don't do background reclamation of post-eof and cow blocks when the fs
   is suspended
 - Fix secondary superblock buffer lifespan hinting
 - Refactor growfs to use table-dispatched functions instead of long
   stringy functions
 - Move growfs code to libxfs
 - Implement online fs label getting and setting
 - Introduce online filesystem repair (in a very limited capacity)
 - Fix unit conversion problems in the realtime freemap iteration
   functions
 - Various refactorings and cleanups in preparation to remove buffer
   heads in a future release
 - Reimplement the old bmap call with iomap
 - Remove direct buffer head accesses from seek hole/data
 - Various bug fixes
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAlsR9dEACgkQ+H93GTRK
 tOv0dw//cBwRgY4jhC6b9oMk2DNRWUiTt1F2yoqr28661GPo124iXAMLIwJe1DiV
 W/qpN3HUz7P46xKOVY+MXaj0JIDFxJ8c5tHAQMH/TkDc49S+mkcGyaoPJ39hnc6u
 yikG+Hq4m0YWhHaeUhKTe8pnhXBaziz5A2NtKtwh6lPOIW+Wds51T77DJnViqADq
 tZzmAq8fS9/ELpxe0Th/2D7iTWCr2c3FLsW2KgbbNvQ4e34zVE1ix1eBtEzQE+Mm
 GUjdQhYVS1oCzqZfCxJkzR4R/1TAFyS0FXOW7PHo8FAX/kas9aQbRlnHSAQ/08EE
 8Z2p3GsFip7dgmd6O6nAmFAStW6GRvgyycJ7Y+Y0IsJj6aDp9OxhRExyF+uocJR9
 b9ChOH6PMEtRB/RRlBg66pbS61abvNGutzl61ZQZGBHEvL3VqDcd68IomdD5bNSB
 pXo6mOJIcKuXsghZszsHAV9uuMe4zQAMbLy7QH6V8LyWeSAG9hTXOT9EA4MWktEJ
 SCQFf7RRPgU5pEAgOS8LgKrawqnBaqFcFvkvWsQhyiltTFz29cwxH7tjSXYMAOFE
 W+RMp8kbkPnGOaJJeKxT+/RGRB534URk0jIEKtRb679xkEF3HE58exXEVrnojJq6
 0m712+EYuZSYhFBwrvEnQjNHr0x2r/A/iBJZ6HhyV0aO1RWm4n4=
 =11pr
 -----END PGP SIGNATURE-----

Merge tag 'xfs-4.18-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs updates from Darrick Wong:
 "New features this cycle include the ability to relabel mounted
  filesystems, support for fallocated swapfiles, and using FUA for pure
  data O_DSYNC directio writes. With this cycle we begin to integrate
  online filesystem repair and refactor the growfs code in preparation
  for eventual subvolume support, though the road ahead for both
  features is quite long.

  There are also numerous refactorings of the iomap code to remove
  unnecessary log overhead, to disentangle some of the quota code, and
  to prepare for buffer head removal in a future upstream kernel.

  Metadata validation continues to improve, both in the hot path
  veifiers and the online filesystem check code. I anticipate sending a
  second pull request in a few days with more metadata validation
  improvements.

  This series has been run through a full xfstests run over the weekend
  and through a quick xfstests run against this morning's master, with
  no major failures reported.

  Summary:

   - Strengthen inode number and structure validation when allocating
     inodes.

   - Reduce pointless buffer allocations during cache miss

   - Use FUA for pure data O_DSYNC directio writes

   - Various iomap refactorings

   - Strengthen quota metadata verification to avoid unfixable broken
     quota

   - Make AGFL block freeing a deferred operation to avoid blowing out
     transaction reservations when running complex operations

   - Get rid of the log item descriptors to reduce log overhead

   - Fix various reflink bugs where inodes were double-joined to
     transactions

   - Don't issue discards when trimming unwritten extents

   - Refactor incore dquot initialization and retrieval interfaces

   - Fix some locking problmes in the quota scrub code

   - Strengthen btree structure checks in scrub code

   - Rewrite swapfile activation to use iomap and support unwritten
     extents

   - Make scrub exit to userspace sooner when corruptions or
     cross-referencing problems are found

   - Make scrub invoke the data fork scrubber directly on metadata
     inodes

   - Don't do background reclamation of post-eof and cow blocks when the
     fs is suspended

   - Fix secondary superblock buffer lifespan hinting

   - Refactor growfs to use table-dispatched functions instead of long
     stringy functions

   - Move growfs code to libxfs

   - Implement online fs label getting and setting

   - Introduce online filesystem repair (in a very limited capacity)

   - Fix unit conversion problems in the realtime freemap iteration
     functions

   - Various refactorings and cleanups in preparation to remove buffer
     heads in a future release

   - Reimplement the old bmap call with iomap

   - Remove direct buffer head accesses from seek hole/data

   - Various bug fixes"

* tag 'xfs-4.18-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (121 commits)
  fs: use ->is_partially_uptodate in page_cache_seek_hole_data
  fs: remove the buffer_unwritten check in page_seek_hole_data
  fs: move page_cache_seek_hole_data to iomap.c
  xfs: use iomap_bmap
  iomap: add an iomap-based bmap implementation
  iomap: add a iomap_sector helper
  iomap: use __bio_add_page in iomap_dio_zero
  iomap: move IOMAP_F_BOUNDARY to gfs2
  iomap: fix the comment describing IOMAP_NOWAIT
  iomap: inline data should be an iomap type, not a flag
  mm: split ->readpages calls to avoid non-contiguous pages lists
  mm: return an unsigned int from __do_page_cache_readahead
  mm: give the 'ret' variable a better name __do_page_cache_readahead
  block: add a lower-level bio_add_page interface
  xfs: fix error handling in xfs_refcount_insert()
  xfs: fix xfs_rtalloc_rec units
  xfs: strengthen rtalloc query range checks
  xfs: xfs_rtbuf_get should check the bmapi_read results
  xfs: xfs_rtword_t should be unsigned, not signed
  dax: change bdev_dax_supported() to support boolean returns
  ...
2018-06-05 13:24:20 -07:00
Eric Sandeen
89c2e71123 xfs: use xfs_trans_getsb in xfs_sync_sb_buf
Use xfs_trans_getsb rather than reaching right in for
mp->m_sb_bp; I think this is more correct, and it facilitates
building this libxfs code in userspace as well.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-06-04 18:25:05 -07:00
Darrick J. Wong
d2e7366542 xfs: don't assert on corrupted unlinked inode list
Use the per-ag inode number verifiers to detect corrupt lists and error
out, instead of using ASSERTs.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-06-04 18:25:05 -07:00
Darrick J. Wong
2551a53053 xfs: explicitly pass buffer size to xfs_corruption_error
Explicitly pass the buffer length to xfs_corruption_error() instead of
assuming XFS_CORRUPTION_DUMP_LEN so that we avoid dumping off the end
of the buffer.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-06-04 18:25:05 -07:00
Darrick J. Wong
85ae01098c xfs: don't assert when on-disk btree pointers are garbage
Don't ASSERT when we encounter bad on-disk btree pointers in the debug
check functions.  Log the error to leave breadcrumbs and let the upper
layers deal with it.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-06-04 18:25:05 -07:00
Darrick J. Wong
e63a1008ee xfs: strengthen btree pointer checks before use
Instead of ASSERTing on null btree pointers in xfs_btree_ptr_to_daddr,
use the new block number verifiers to ensure that the btree pointer
doesn't point to any sensitive areas (AG headers, past-EOFS) and return
-EFSCORRUPTED if this is the case.  Remove the ASSERT because on-disk
corruptions shouldn't trigger ASSERTs.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-06-04 18:25:05 -07:00
Darrick J. Wong
4cbae4b816 xfs: introduce xfs_btree_debug_check_ptr
Make xfs_btree_check_ptr a non-debug function and introduce a new _debug
version that only runs when #ifdef DEBUG.   This will enable us to reuse
the checking logic with other parts of the btree code.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-06-04 18:25:05 -07:00
Darrick J. Wong
e4f45eff86 xfs: check directory bestfree information in the verifier
Create a variant of xfs_dir2_data_freefind that is suitable for use in a
verifier.  Because _freefind is called by the verifier, we simply
duplicate the _freefind function, convert the ASSERTs to return
__this_address, and modify the verifier to call our new function.  Once
we've made it impossible for directory blocks with bad bestfree data to
make it into the filesystem we can remove the DEBUG code from the
regular _freefind function.

Underlying argument: corruption of on-disk metadata should return
-EFSCORRUPTED instead of blowing ASSERTs.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-06-04 18:25:04 -07:00
Darrick J. Wong
924cade4df xfs: don't return garbage buffers in xfs_da3_node_read
If we're reading a node in a dir/attr btree and the buffer comes off the
disk with a magic number we don't recognize, don't ASSERT and don't set
a garbage buffer type (0 also triggers ASSERTs).  Instead, report the
corruption, release the buffer, and return -EFSCORRUPTED because that's
what the dabtree is -- corrupt.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-06-04 14:45:30 -07:00
Darrick J. Wong
1f5c071d19 xfs: don't ASSERT on short form btree root pointer of zero
Don't ASSERT if the short form btree root pointer is zero.  Now that we
use xfs_verify_agbno to check all short form btree pointers, we'll let
that log the error and pass it to the upper layers.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-06-04 14:45:30 -07:00
Darrick J. Wong
eeee0d6a9b xfs: btree lookup shouldn't ASSERT on empty btree nodes
If a btree lookup encounters an empty btree node or an empty btree leaf
on a multi-level btree, that's evidence of a corrupt on-disk btree.
Therefore, we should return -EFSCORRUPTED to the upper levels, not an
ASSERT failure.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-06-04 14:45:30 -07:00
Darrick J. Wong
a37f7b127e xfs: xfs_alloc_get_rec should return EFSCORRUPTED for obvious bnobt corruption
Return -EFSCORRUPTED when the bnobt/cntbt return obviously corrupt
values, rather than letting them bounce around in the internal code.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-06-04 14:45:30 -07:00
Darrick J. Wong
b3986010ce xfs: remove redundant ASSERT on insufficient bestfree length in _leaf_addname
In xfs_dir2_leaf_addname we ASSERT if the length of the unused space
described by bestfree[0] is less the amount of space we wish to consume.
Immediately after it is a call to xfs_dir2_data_use_free where the
offset parameter is offset of the unused space and the length parameter
is the amount of space we wish to consume.  Both values (and the unused
space pointer) are passed into xfs_dir2_data_check_free, which also
validates that the region of unused space is big enough to cover the
space we wish to consume.  This is effectively the same check that the
ASSERT covers, and since a check failure results in a corruption message
being logged we can remove the ASSERT.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-06-04 14:45:29 -07:00
Darrick J. Wong
17ba2cc7b5 xfs: don't assert when reporting on-disk corruption while loading btree
Don't bother ASSERTing when we're already going to log and return the
corruption status.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-06-04 14:45:29 -07:00
Darrick J. Wong
aaacdd257f xfs: don't forbid setting dax flag on directories if device doesn't dax
On a directory, the DAX flag is merely a hint that files created in the
directory should have the DAX flag set at creation time.  We don't care
if the underlying device supports DAX or not because directory metadata
are always cached in DRAM.  We don't care if new files get the flag even
if the device doesn't support DAX because we always check for DAX
support before setting the VFS flag (S_DAX).

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-06-04 14:45:29 -07:00
Linus Torvalds
b058efc1ac Merge branch 'work.lookup' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull dcache lookup cleanups from Al Viro:
 "Cleaning ->lookup() instances up - mostly d_splice_alias() conversions"

* 'work.lookup' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (29 commits)
  switch the rest of procfs lookups to d_splice_alias()
  procfs: switch instantiate_t to d_splice_alias()
  don't bother with tid_fd_revalidate() in lookups
  proc_lookupfd_common(): don't bother with instantiate unless the file is open
  procfs: get rid of ancient BS in pid_revalidate() uses
  cifs_lookup(): switch to d_splice_alias()
  cifs_lookup(): cifs_get_inode_...() never returns 0 with *inode left NULL
  9p: unify paths in v9fs_vfs_lookup()
  ncp_lookup(): use d_splice_alias()
  hfsplus: switch to d_splice_alias()
  hfs: don't allow mounting over .../rsrc
  hfs: use d_splice_alias()
  omfs_lookup(): report IO errors, use d_splice_alias()
  orangefs_lookup: simplify
  openpromfs: switch to d_splice_alias()
  xfs_vn_lookup: simplify a bit
  adfs_lookup: do not fail with ENOENT on negatives, use d_splice_alias()
  adfs_lookup_byname: .. *is* taken care of in fs/namei.c
  romfs_lookup: switch to d_splice_alias()
  qnx6_lookup: switch to d_splice_alias()
  ...
2018-06-04 13:46:22 -07:00
Linus Torvalds
cf626b0da7 Merge branch 'hch.procfs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull procfs updates from Al Viro:
 "Christoph's proc_create_... cleanups series"

* 'hch.procfs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (44 commits)
  xfs, proc: hide unused xfs procfs helpers
  isdn/gigaset: add back gigaset_procinfo assignment
  proc: update SIZEOF_PDE_INLINE_NAME for the new pde fields
  tty: replace ->proc_fops with ->proc_show
  ide: replace ->proc_fops with ->proc_show
  ide: remove ide_driver_proc_write
  isdn: replace ->proc_fops with ->proc_show
  atm: switch to proc_create_seq_private
  atm: simplify procfs code
  bluetooth: switch to proc_create_seq_data
  netfilter/x_tables: switch to proc_create_seq_private
  netfilter/xt_hashlimit: switch to proc_create_{seq,single}_data
  neigh: switch to proc_create_seq_data
  hostap: switch to proc_create_{seq,single}_data
  bonding: switch to proc_create_seq_data
  rtc/proc: switch to proc_create_single_data
  drbd: switch to proc_create_single
  resource: switch to proc_create_seq_data
  staging/rtl8192u: simplify procfs code
  jfs: simplify procfs code
  ...
2018-06-04 10:00:01 -07:00
Dave Chinner
9f96cc958e xfs: verify AGI unlinked list contains valid blocks
The heads of tha AGI unlinked list are only scanned on debug
kernels when the verifier runs. Change that to always scan the heads
and validate that the inode numbers are valid.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-06-03 16:12:16 -07:00
Christoph Hellwig
b84e772299 xfs: use iomap_bmap
Switch to the iomap based bmap implementation to get rid of one of the
last users of xfs_get_blocks.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-06-01 18:37:33 -07:00
Dave Chinner
16858f7c21 xfs: fix error handling in xfs_refcount_insert()
generic/475 fired an assert failure just after the filesystem was
shut down:

XFS: Assertion failed: fs_is_ok, file: fs/xfs/libxfs/xfs_refcount.c, line: 182
.....
Call Trace:
 xfs_refcount_insert+0x151/0x190
 xfs_refcount_adjust_extents.constprop.11+0x9c/0x470
 xfs_refcount_adjust.constprop.10+0xb0/0x270
 xfs_refcount_finish_one+0x25a/0x420
 xfs_trans_log_finish_refcount_update+0x2a/0x40
 xfs_refcount_update_finish_item+0x35/0xa0
 xfs_defer_finish+0x15e/0x4d0
 xfs_reflink_remap_extent+0x1bc/0x610
 xfs_reflink_remap_blocks+0x6e/0x280
 xfs_reflink_remap_range+0x311/0x530
 vfs_clone_file_range+0x119/0x200
 ....

If xfs_btree_insert() returns an error, the corruption check fires
instead of passing the error back the caller. The corruption check
should be after we've checked for an error, not before, thereby
avoiding assert failures if the filesystem shuts down during a
refcount btree record insert.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-06-01 09:00:16 -07:00
Darrick J. Wong
a0e5c435ba xfs: fix xfs_rtalloc_rec units
All the realtime allocation functions deal with space on the rtdev in
units of realtime extents.  However, struct xfs_rtalloc_rec confusingly
uses the word 'block' in the name, even though they're really extents.

Fix the naming problem and fix all the unit handling problems in the two
existing users.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
2018-06-01 09:00:16 -07:00
Darrick J. Wong
8ad560d256 xfs: strengthen rtalloc query range checks
Strengthen the rtalloc range query checks to make sure that the keys do
not run off the end of the realtime device inappropriately.  Note that
the query range functions require units of rt extents, not blocks,
despite the type name.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
2018-06-01 09:00:16 -07:00
Darrick J. Wong
a03f1641c7 xfs: xfs_rtbuf_get should check the bmapi_read results
The xfs_rtbuf_get function should check the block mapping it gets back
from bmapi_read.  If there are no mappings or the mapping isn't a real
extent, we should return -EFSCORRUPTED rather than trying to read a
garbage value.  We also require realtime bitmap blocks to be real,
written allocations.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
2018-06-01 09:00:16 -07:00
Darrick J. Wong
2483113f3d xfs: xfs_rtword_t should be unsigned, not signed
xfs_rtword_t is used for bit manipulations in the realtime bitmap file.
Since we're performing bit shifts with this type, we don't want sign
extension and we don't want to be left shifting negative quantities
because that's undefined behavior.

This also shuts up these UBSAN warnings:
UBSAN: Undefined behaviour in fs/xfs/libxfs/xfs_rtbitmap.c:833:48
signed integer overflow:
-2147483648 - 1 cannot be represented in type 'int'

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
2018-06-01 09:00:16 -07:00
Dave Jiang
80660f2025 dax: change bdev_dax_supported() to support boolean returns
The function return values are confusing with the way the function is
named. We expect a true or false return value but it actually returns
0/-errno.  This makes the code very confusing. Changing the return values
to return a bool where if DAX is supported then return true and no DAX
support returns false.

Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-31 08:58:34 -07:00
Darrick J. Wong
ba23cba9b3 fs: allow per-device dax status checking for filesystems
Change bdev_dax_supported so it takes a bdev parameter.  This enables
multi-device filesystems like xfs to check that a dax device can work for
the particular filesystem.  Once that's in place, actually fix all the
parts of XFS where we need to be able to distinguish between datadev and
rtdev.

This patch fixes the problem where we screw up the dax support checking
in xfs if the datadev and rtdev have different dax capabilities.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
[rez: Re-added __bdev_dax_supported() for !CONFIG_FS_DAX cases]
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2018-05-31 08:58:33 -07:00
Kent Overstreet
e292d7bc63 xfs: convert to bioset_init()/mempool_init()
Convert XFS to embedded bio sets.

Acked-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-30 15:33:32 -06:00
Darrick J. Wong
d25522f10c xfs: repair superblocks
If one of the backup superblocks is found to differ seriously from
superblock 0, write out a fresh copy from the in-core sb.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-05-30 08:03:15 -07:00
Darrick J. Wong
7e85bc6c87 xfs: add helpers to attach quotas to inodes
Add a helper routine to attach quota information to inodes that are
about to undergo repair.  If that fails, we need to schedule a
quotacheck for the next mount but allow the corrupted metadata repair to
continue.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-05-30 08:03:15 -07:00
Darrick J. Wong
04a2b7b254 xfs: recover AG btree roots from rmap data
Add a helper function to help us recover btree roots from the rmap data.
Callers pass in a list of rmap owner codes, buffer ops, and magic
numbers.  We iterate the rmap records looking for owner matches, and
then read the matching blocks to see if the magic number & uuid match.
If so, we then read-verify the block, and if that passes then we retain
a pointer to the block with the highest level, assuming that by the end
of the call we will have found the root.  This will be used to reset the
AGF/AGI btree root fields during their rebuild procedures.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-05-30 08:03:14 -07:00
Darrick J. Wong
12c6510e2f xfs: add helpers to dispose of old btree blocks after a repair
Now that we've plumbed in the ability to construct a list of dead btree
blocks following a repair, add more helpers to dispose of them.  This is
done by examining the rmapbt -- if the btree was the only owner we can
free the block, otherwise it's crosslinked and we can only remove the
rmapbt record.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-05-30 08:03:14 -07:00
Darrick J. Wong
64a39d876e xfs: add helpers to collect and sift btree block pointers during repair
Add some helpers to assemble a list of fs block extents.  Generally,
repair functions will iterate the rmapbt to make a list (1) of all
extents owned by the nominal owner of the metadata structure; then they
will iterate all other structures with the same rmap owner to make a
list (2) of active blocks; and finally we have a subtraction function to
subtract all the blocks in (2) from (1), with the result that (1) is now
a list of blocks that were owned by the old btree and must be disposed.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-05-30 08:03:14 -07:00
Darrick J. Wong
73d6b42aa4 xfs: add helpers to allocate and initialize fresh btree roots
Add a pair of helper functions to allocate and initialize fresh btree
roots.  The repair functions will use these as part of recreating
corrupted metadata.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
2018-05-30 08:03:14 -07:00
Darrick J. Wong
0a9633fa2f xfs: add helpers to deal with transaction allocation and rolling
For repairs, we need to reserve at least as many blocks as we think
we're going to need to rebuild the data structure, and we're going to
need some helpers to roll transactions while maintaining locks on the AG
headers so that other threads cannot wander into the middle of a repair.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
2018-05-30 08:03:14 -07:00
Darrick J. Wong
51863d7dd7 xfs: grab the per-ag structure whenever relevant
Grab and hold the per-AG data across a scrub run whenever relevant.
This helps us avoid repeated trips through rcu and the radix tree
in the repair code.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-05-30 08:03:14 -07:00
Souptick Joarder
05edd888d1 fs: xfs: Change return type to vm_fault_t
Use new return type vm_fault_t for fault handlers.

Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-29 10:46:03 -07:00
Darrick J. Wong
2e050e648a xfs: fix inobt magic number check
In commit a6a781a58b ("xfs: have buffer verifier functions
report failing address") the bad magic number return was ported
incorrectly.

Fixes: a6a781a58b
Reported-by: syzbot+08ab33be0178b76851c8@syzkaller.appspotmail.com
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2018-05-29 10:46:03 -07:00
Arnd Bergmann
5ef03dbd91 xfs, proc: hide unused xfs procfs helpers
These two functions now trigger a warning when CONFIG_PROC_FS is disabled:

fs/xfs/xfs_stats.c:128:12: error: 'xqmstat_proc_show' defined but not used [-Werror=unused-function]
 static int xqmstat_proc_show(struct seq_file *m, void *v)
            ^~~~~~~~~~~~~~~~~
fs/xfs/xfs_stats.c:118:12: error: 'xqm_proc_show' defined but not used [-Werror=unused-function]
 static int xqm_proc_show(struct seq_file *m, void *v)
            ^~~~~~~~~~~~~

Previously, they were referenced from an unused 'static const' structure,
which is silently dropped by gcc.

We can address the warning by adding the same #ifdef around them that
hides the reference.

Fixes: 3f3942aca6 ("proc: introduce proc_create_single{,_data}")
Cc: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-05-25 20:43:08 -04:00
Al Viro
b113a6d3cf xfs_vn_lookup: simplify a bit
have all post-xfs_lookup() branches converge on d_splice_alias()

Cc: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-05-22 14:27:57 -04:00
Dan Williams
d6dc57e251 xfs, dax: introduce xfs_break_dax_layouts()
xfs_break_dax_layouts(), similar to xfs_break_leased_layouts(), scans
for busy / pinned dax pages and waits for those pages to go idle before
any potential extent unmap operation.

dax_layout_busy_page() handles synchronizing against new page-busy
events (get_user_pages). It invalidates all mappings to trigger the
get_user_pages slow path which will eventually block on the xfs inode
lock held in XFS_MMAPLOCK_EXCL mode. If dax_layout_busy_page() finds a
busy page it returns it for xfs to wait for the page-idle event that
will fire when the page reference count reaches 1 (recall ZONE_DEVICE
pages are idle at count 1, see generic_dax_pagefree()).

While waiting, the XFS_MMAPLOCK_EXCL lock is dropped in order to not
deadlock the process that might be trying to elevate the page count of
more pages before arranging for any of them to go idle. I.e. the typical
case of submitting I/O is that iov_iter_get_pages() elevates the
reference count of all pages in the I/O before starting I/O on the first
page. The process of elevating the reference count of all pages involved
in an I/O may cause faults that need to take XFS_MMAPLOCK_EXCL.

Although XFS_MMAPLOCK_EXCL is dropped while waiting, XFS_IOLOCK_EXCL is
held while sleeping. We need this to prevent starvation of the truncate
path as continuous submission of direct-I/O could starve the truncate
path indefinitely if the lock is dropped.

Cc: Dave Chinner <david@fromorbit.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Reported-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2018-05-22 07:19:08 -07:00
Dan Williams
69eb5fa10e xfs: prepare xfs_break_layouts() for another layout type
When xfs is operating as the back-end of a pNFS block server, it
prevents collisions between local and remote operations by requiring a
lease to be held for remotely accessed blocks. Local filesystem
operations break those leases before writing or mutating the extent map
of the file.

A similar mechanism is needed to prevent operations on pinned dax
mappings, like device-DMA, from colliding with extent unmap operations.

BREAK_WRITE and BREAK_UNMAP are introduced as two distinct levels of
layout breaking.

Layouts are broken in the BREAK_WRITE case to ensure that layout-holders
do not collide with local writes. Additionally, layouts are broken in
the BREAK_UNMAP case to make sure the layout-holder has a consistent
view of the file's extent map. While BREAK_WRITE breaks can be satisfied
be recalling FL_LAYOUT leases, BREAK_UNMAP breaks additionally require
waiting for busy dax-pages to go idle while holding XFS_MMAPLOCK_EXCL.

After this refactoring xfs_break_layouts() becomes the entry point for
coordinating both types of breaks. Finally, xfs_break_leased_layouts()
becomes just the BREAK_WRITE handler.

Note that the unlock tracking is needed in a follow on change. That will
coordinate retrying either break handler until both successfully test
for a lease break while maintaining the lock state.

Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Reported-by: Dave Chinner <david@fromorbit.com>
Reported-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2018-05-22 07:19:08 -07:00
Dan Williams
c63a8eae63 xfs: prepare xfs_break_layouts() to be called with XFS_MMAPLOCK_EXCL
In preparation for adding coordination between extent unmap operations
and busy dax-pages, update xfs_break_layouts() to permit it to be called
with the mmap lock held. This lock scheme will be required for
coordinating the break of 'dax layouts' (non-idle dax (ZONE_DEVICE)
pages mapped into the file's address space). Breaking dax layouts will
be added to xfs_break_layouts() in a future patch, for now this preps
the unmap call sites to take and hold XFS_MMAPLOCK_EXCL over the call to
xfs_break_layouts().

Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <darrick.wong@oracle.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2018-05-22 07:19:08 -07:00
Eric Sandeen
f7664b3197 xfs: implement online get/set fs label
The GET ioctl is trivial, just return the current label.

The SET ioctl is more involved:
It transactionally modifies the superblock to write a new filesystem
label to the primary super.

A new variant of xfs_sync_sb then writes the superblock buffer
immediately to disk so that the change is visible from userspace.

It then invalidates any page cache that userspace might have previously
read on the block device so that i.e. blkid can see the change
immediately, and updates all secondary superblocks as userspace relable
does.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
[darrick: use dchinner's new xfs_update_secondary_sbs function]
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-16 08:50:16 -07:00
Christoph Hellwig
3f3942aca6 proc: introduce proc_create_single{,_data}
Variants of proc_create{,_data} that directly take a seq_file show
callback and drastically reduces the boilerplate code in the callers.

All trivial callers converted over.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-05-16 07:23:35 +02:00
Dave Chinner
49dd56f26e xfs: factor the ag length extension code into libxfs
Growfs currently manually codes the extension of the last AG in a
filesytem during the growfs process. Factor that out of the growfs
code and move it into libxfs along with teh rest of the AG header
modification code.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-15 18:12:51 -07:00
Dave Chinner
b16817b66b xfs: move growfs core to libxfs
So it can be shared with userspace (e.g. mkfs) easily.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-15 18:12:51 -07:00
Dave Chinner
8125147288 xfs: rework secondary superblock updates in growfs
Right now we wait until we've committed changes to the primary
superblock before we initialise any of the new secondary
superblocks. This means that if we have any write errors for new
secondary superblocks we end up with garbage in place rather than
zeros or even an "in progress" superblock to indicate a grow
operation is being done.

To ensure we can write the secondary superblocks, initialise them
earlier in the same loop that initialises the AG headers. We stamp
the new secondary superblocks here with the old geometry, but set
the "sb_inprogress" field to indicate that updates are being done to
the superblock so they cannot be used.  This will result in the
secondary superblock fields being updated or triggering errors that
will abort the grow before we commit any permanent changes.

This also means we can change the update mechanism of the secondary
superblocks.  We know that we are going to wholly overwrite the
information in the struct xfs_sb in the buffer, so there's no point
reading it from disk. Just allocate an uncached buffer, zero it in
memory, stamp the new superblock structure in it and write it out.
If we fail to write it out, then we'll leave the existing sb (old or
new w/ inprogress) on disk for repair to deal with later.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-15 18:12:51 -07:00
Dave Chinner
83a7f86e39 xfs: separate secondary sb update in growfs
This happens after all the transactions to update the superblock
occur, and errors need to be handled slightly differently. Seperate
out the code into it's own function, and clean up the error goto
stack in the core growfs code as it is now much simpler.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-15 18:12:51 -07:00
Dave Chinner
87444b8c26 xfs: make imaxpct changes in growfs separate
When growfs changes the imaxpct value of the filesystem, it runs
through all the "change size" growfs code, whether it needs to or
not. Separate out changing imaxpct into it's own function and
transaction to simplify the rest of the growfs code.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-15 18:12:51 -07:00
Dave Chinner
532ff647d8 xfs: turn ag header initialisation into a table driven operation
There's still more cookie cutter code in setting up each AG header.
Separate all the variables into a simple structure and iterate a
table of header definitions to initialise everything.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-15 18:12:51 -07:00
Dave Chinner
0410c3bb2b xfs: factor ag btree root block initialisation
Cookie cutter code, easily factored.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-15 18:12:51 -07:00
Dave Chinner
9aebe805a5 xfs: convert growfs AG header init to use buffer lists
We currently write all new AG headers synchronously, which can be
slow for large grow operations. All we really need to do is ensure
all the headers are on disk before we run the growfs transaction, so
convert this to a buffer list and a delayed write operation. We
block waiting for the delayed write buffer submission to complete,
so this will fulfill the requirement to have all the buffers written
correctly before proceeding.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-15 18:12:51 -07:00
Dave Chinner
cce77bcf48 xfs: factor out AG header initialisation from growfs core
The intialisation of new AG headers is mostly common with the
userspace mkfs code and growfs in the kernel, so start factoring it
out so we can move it to libxfs and use it in both places.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-15 18:12:51 -07:00
Dave Chinner
879de98ead xfs: one-shot cached buffers
For the new growfs work, we want to ensure that we serialise
secondary superblock updates with other operations (e.g. scrub)
correctly, but we don't want to cache the buffers for long term
reuse. We need cached buffers for serialisation, however.

To solve this, introduce a "oneshot" buffer which will be marshalled
through the cache but then released once the last current reference
goes away. If the buffer is already cached, then we ignore the
"one-shot" behaviour and leave the buffer in the state it was prior
to the one-shot command being run. This means we don't perturb
either the working set or existing cached buffer state by a one-shot
operation.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-15 18:12:51 -07:00
Darrick J. Wong
84d42ea6b6 xfs: implement the metadata repair ioctl flag
Plumb in the pieces necessary to make the "scrub" subfunction of
the scrub ioctl actually work.  This means that we make the IFLAG_REPAIR
flag to the scrub ioctl actually do something, and we add an errortag
knob so that xfstests can force the kernel to rebuild a metadata
structure even if there's nothing wrong with it.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-05-15 18:12:50 -07:00
Darrick J. Wong
718fa74b15 xfs: create tracepoints for online repair
These tracepoints will be used to debug the online repair routines.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-05-15 18:12:50 -07:00
Darrick J. Wong
7644bd988d xfs: teach xfs_bmapi_remap to accept some bmapi flags
Teach xfs_bmapi_remap how to map in unwritten extent and to skip rmap
updates.  This enables us to rebuild real and unwritten extents from the
rmapbt.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-05-15 18:12:50 -07:00
Darrick J. Wong
7cf199ba5a xfs: make xfs_bmapi_remapi work with attribute forks
Add a new flags argument to xfs_bmapi_remapi so that we can pass BMAPI
flags into the function.  This enables us to pass in BMAPI_ATTRFORK so
that we can remap things into the attribute fork.  Eventually the
online repair code will use this to rebuild attribute forks, so make it
non-static.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-05-15 18:12:50 -07:00
Darrick J. Wong
9f3a080ef1 xfs: hoist xfs_scrub_agfl_walk to libxfs as xfs_agfl_walk
This function is basically a generic AGFL block iterator, so promote it
to libxfs ahead of online repair wanting to use it.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-05-15 18:12:50 -07:00
Darrick J. Wong
ddd10c2fe2 xfs: avoid ABBA deadlock when scrubbing parent pointers
In normal operation, the XFS convention is to take an inode's iolock
and then allocate a transaction.  However, when scrubbing parent inodes
this is inverted -- we allocated the transaction to do the scrub, and
now we're trying to grab the parent's iolock.  This can lead to ABBA
deadlocks: some thread grabbed the parent's iolock and is waiting for
space for a transaction while our parent scrubber is sitting on a
transaction trying to get the parent's iolock.

Therefore, convert all iolock attempts to use trylock; if that fails,
they can use the existing mechanisms to back off and try again.

The ABBA deadlock didn't happen with a non-repair scrub because the
transactions don't reserve any space, but repair scrubs require
reservation in order to update metadata.  However, any other concurrent
metadata update (e.g. directory create in the parent) could also induce
this deadlock with the parent scrubber.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-05-15 18:12:50 -07:00
Darrick J. Wong
517b32b7fa xfs: scrub the data fork of the realtime inodes
The realtime bitmap and summary inodes live on the metadata device, so
we can scrub their data forks with the regular scrubbers.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-05-15 18:12:50 -07:00
Darrick J. Wong
87d9d609c2 xfs: quota scrub should use bmapbtd scrubber
Replace the quota scrubber's open-coded data fork scrubber with a
redirected call to the bmapbtd scrubber.  This strengthens the quota
scrub to include all the cross-referencing that it does.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-05-15 18:12:50 -07:00
Darrick J. Wong
8bc763c24d xfs: don't continue scrub if already corrupt
If we've already decided that something is corrupt, we might as well
abort all the loops and exit as quickly as possible.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-05-15 18:12:50 -07:00
Darrick J. Wong
eac69e1676 xfs: refactor quota limits initialization
Replace all the if (!error) weirdness with helper functions that follow
our regular coding practices, and factor out the ternary expression soup.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-05-15 17:57:05 -07:00
Darrick J. Wong
689e11c84b xfs: superblock scrub should use short-lived buffers
Secondary superblocks are rarely used, so create a helper to read a
given non-primary AG's superblock and ensure that it won't stick around
hogging memory.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-05-15 17:57:05 -07:00
Darrick J. Wong
8389f3ffa2 xfs: skip scrub xref if corruption already noted
Don't bother looking for cross-referencing problems if the metadata is
already corrupt or we've already found a cross-referencing problem.
Since we added a helper function for flags testing, convert existing
users to use it.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-05-15 17:57:05 -07:00
Dave Chinner
c9fbd7bbc2 xfs: clear sb->s_fs_info on mount failure
We recently had an oops reported on a 4.14 kernel in
xfs_reclaim_inodes_count() where sb->s_fs_info pointed to garbage
and so the m_perag_tree lookup walked into lala land.

Essentially, the machine was under memory pressure when the mount
was being run, xfs_fs_fill_super() failed after allocating the
xfs_mount and attaching it to sb->s_fs_info. It then cleaned up and
freed the xfs_mount, but the sb->s_fs_info field still pointed to
the freed memory. Hence when the superblock shrinker then ran
it fell off the bad pointer.

With the superblock shrinker problem fixed at teh VFS level, this
stale s_fs_info pointer is still a problem - we use it
unconditionally in ->put_super when the superblock is being torn
down, and hence we can still trip over it after a ->fill_super
call failure. Hence we need to clear s_fs_info if
xfs-fs_fill_super() fails, and we need to check if it's valid in
the places it can potentially be dereferenced after a ->fill_super
failure.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-15 17:57:05 -07:00
Dave Chinner
dae5cd8118 xfs: add mount delay debug option
Similar to log_recovery_delay, this delay occurs between the VFS
superblock being initialised and the xfs_mount being fully
initialised. It also poisons the per-ag radix tree node so that it
can be used for triggering shrinker races during mount
such as the following:

<run memory pressure workload in background>

$ cat dirty-mount.sh
#! /bin/bash

umount -f /dev/pmem0
mkfs.xfs -f /dev/pmem0
mount /dev/pmem0 /mnt/test
rm -f /mnt/test/foo
xfs_io -fxc "pwrite 0 4k" -c fsync -c "shutdown" /mnt/test/foo
umount /dev/pmem0

# let's crash it now!
echo 30 > /sys/fs/xfs/debug/mount_delay
mount /dev/pmem0 /mnt/test
echo 0 > /sys/fs/xfs/debug/mount_delay
umount /dev/pmem0
$ sudo ./dirty-mount.sh
.....
[   60.378118] CPU: 3 PID: 3577 Comm: fs_mark Tainted: G      D W        4.16.0-rc5-dgc #440
[   60.378120] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
[   60.378124] RIP: 0010:radix_tree_next_chunk+0x76/0x320
[   60.378127] RSP: 0018:ffffc9000276f4f8 EFLAGS: 00010282
[   60.383670] RAX: a5a5a5a5a5a5a5a4 RBX: 0000000000000010 RCX: 000000000000001a
[   60.385277] RDX: 0000000000000000 RSI: ffffc9000276f540 RDI: 0000000000000000
[   60.386554] RBP: 0000000000000000 R08: 0000000000000000 R09: a5a5a5a5a5a5a5a5
[   60.388194] R10: 0000000000000006 R11: 0000000000000001 R12: ffffc9000276f598
[   60.389288] R13: 0000000000000040 R14: 0000000000000228 R15: ffff880816cd6458
[   60.390827] FS:  00007f5c124b9740(0000) GS:ffff88083fc00000(0000) knlGS:0000000000000000
[   60.392253] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   60.393423] CR2: 00007f5c11bba0b8 CR3: 000000035580e001 CR4: 00000000000606e0
[   60.394519] Call Trace:
[   60.395252]  radix_tree_gang_lookup_tag+0xc4/0x130
[   60.395948]  xfs_perag_get_tag+0x37/0xf0
[   60.396522]  xfs_reclaim_inodes_count+0x32/0x40
[   60.397178]  xfs_fs_nr_cached_objects+0x11/0x20
[   60.397837]  super_cache_count+0x35/0xc0
[   60.399159]  shrink_slab.part.66+0xb1/0x370
[   60.400194]  shrink_node+0x7e/0x1a0
[   60.401058]  try_to_free_pages+0x199/0x470
[   60.402081]  __alloc_pages_slowpath+0x3a1/0xd20
[   60.403729]  __alloc_pages_nodemask+0x1c3/0x200
[   60.404941]  cache_grow_begin+0x20b/0x2e0
[   60.406164]  fallback_alloc+0x160/0x200
[   60.407088]  kmem_cache_alloc+0x111/0x4e0
[   60.408038]  ? xfs_buf_rele+0x61/0x430
[   60.408925]  kmem_zone_alloc+0x61/0xe0
[   60.409965]  xfs_inode_alloc+0x24/0x1d0
.....


Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-15 17:57:05 -07:00
Brian Foster
4e529339af xfs: factor out nodiscard helpers
The changes to skip discards of speculative preallocation and
unwritten extents introduced several new wrapper functions through
the bunmapi -> extent free codepath to reduce churn in all of the
associated callers. In several cases, these wrappers simply toggle a
single flag to skip or not skip discards for the resulting blocks.

The explicit _nodiscard() wrappers for such an isolated set of
callers is a bit overkill. Kill off these wrappers and replace with
the calls to the underlying functions in the contexts that need to
control discard behavior. Retain the wrappers that preserve the
original calling conventions to serve the original purpose of
reducing code churn.

This is a refactoring patch and does not change behavior.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-15 17:57:05 -07:00
Darrick J. Wong
67482129cd iomap: add a swapfile activation function
Add a new iomap_swapfile_activate function so that filesystems can
activate swap files without having to use the obsolete and slow bmap
function.  This enables XFS to support fallocate'd swap files and
swap files on realtime devices.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Jan Kara <jack@suse.cz>
2018-05-15 17:57:05 -07:00
Darrick J. Wong
d6b636ebb1 xfs: halt auto-reclamation activities while rebuilding rmap
Rebuilding the reverse-mapping tree requires us to quiesce all inodes in
the filesystem, so we must stop background reclamation of post-EOF and
CoW prealloc blocks.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-05-15 17:57:05 -07:00
Darrick J. Wong
95eb308caa xfs: add BMAPI_NORMAP flag to perform block remapping without updating rmapbt
Add a new flag, XFS_BMAPI_NORMAP, which will perform file block
remapping without updating the rmapbt.  This will be used by the repair
code to reconstruct bmbts from the rmapbt, in which case we don't want
the rmapbt update.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-05-15 17:57:05 -07:00
Darrick J. Wong
08daa3ccf5 xfs: add repair helpers for the reference count btree
Add a couple of functions to the refcount btree and generic btree code
that will be used to repair the refcountbt.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-05-15 17:57:05 -07:00
Darrick J. Wong
4d4f86b49f xfs: add repair helpers for the reverse mapping btree
Add a couple of functions to the reverse mapping btree that will be used
to repair the rmapbt.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-05-15 17:57:05 -07:00
Darrick J. Wong
7f8f1313d9 xfs: expose various functions to repair code
Expose various helpers that the repair code will want to use.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-05-15 17:57:05 -07:00
Darrick J. Wong
14861c4740 xfs: add helpers to calculate btree size
Add a bunch of helper functions that calculate the sizes of various
btrees.  These will be used to repair btrees and btree headers.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-05-15 17:57:05 -07:00
Darrick J. Wong
9d9c90286a xfs: refactor scrub transaction allocation function
Since the transaction allocation helper is about to become more complex,
move it to common.c and remove the redundant parameters.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-05-15 17:57:05 -07:00
Darrick J. Wong
08a3a692ef xfs: btree scrub should check minrecs
Strengthen the btree block header checks to detect the number of records
being less than the btree type's minimum record count.  Certain blocks
are allowed to violate this constraint -- specifically any btree block
at the top of the tree can have fewer than minrecs records.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-05-15 17:57:05 -07:00
Darrick J. Wong
631fc955bd xfs: clean up scrub usage of KM_NOFS
All scrub code runs in transaction context, which means that memory
allocations are automatically run in PF_MEMALLOC_NOFS context.  It's
therefore unnecessary to pass in KM_NOFS to allocation routines, so
clean them all out.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-05-15 17:57:05 -07:00
Darrick J. Wong
eb41c93fef xfs: avoid ilock games in the quota scrubber
Refactor the quota scrubber to take the quotaofflock and grab the quota
inode in the setup function so that we can treat quota in the same
"scrub in the context of this inode" (i.e. sc->ip) manner as we treat
any other inode.  We do have to drop the quota inode's ILOCK_EXCL to use
dqiterate, but since dquots have their own individual locks the ILOCK
wasn't helping us anyway.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-05-15 17:57:05 -07:00
Darrick J. Wong
554ba96540 xfs: refactor dquot iteration
Create a helper function to iterate all the dquots of a given type in
the system, and refactor the dquot scrub to use it.  This will get more
use in the quota repair code.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-05-15 17:56:59 -07:00
Darrick J. Wong
28b9060bd8 xfs: rename on-disk dquot counter zap functions
The function 'xfs_qm_dqiterate' doesn't iterate dquots at all, it
iterates all dquot blocks of a quota inode and clears the counters.
Therefore, change the name to something more descriptive so that we can
introduce a real dquot iterator later.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-05-10 08:56:48 -07:00
Darrick J. Wong
30ab2dcf2c xfs: replace XFS_QMOPT_DQALLOC with a simple boolean
DQALLOC is only ever used with xfs_qm_dqget*, and the only flag that the
_dqget family of functions cares about is DQALLOC.  Therefore, change
it to a boolean 'can alloc?' flag for the dqget interfaces where that
makes sense.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-05-10 08:56:48 -07:00
Darrick J. Wong
114e73ccfa xfs: remove direct calls to _qm_dqread
The quota initialization code needs an "uncached" variant of _dqget to
read in default quota limits and timers before the dquot cache is fully
set up.  We've already split up _dqget into its component pieces so
create a fourth variant to address this need, and make dqread internal
to xfs_dquot.c again.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-05-10 08:56:48 -07:00
Darrick J. Wong
d63192c890 xfs: refactor xfs_qm_dqtobp and xfs_qm_dqalloc
Separate the disk dquot read and allocation functionality into
two helper functions, then refactor dqread to call them directly.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-05-10 08:56:48 -07:00
Darrick J. Wong
617cd5c12c xfs: refactor incore dquot initialization functions
Create two incore dquot initialization functions that will help us to
disentangle dqget and dqread.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-05-10 08:56:48 -07:00
Darrick J. Wong
0fcef1270f xfs: fetch dquots directly during quotacheck
Quotacheck only runs during mount, which means that there are no other
processes in the system that could be doing chown or chproj.  Therefore
there's no potential for racing to attach dquots to the inode so we can
drop all the ILOCK and race detection bits from quotacheck.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-05-10 08:56:48 -07:00
Darrick J. Wong
4882c19d2a xfs: split out dqget for inodes from regular dqget
There are two uses of dqget here -- one is to return the dquot for a
given type and id, and the other is to return the dquot for a given type
and inode.  Those are two separate things, so split them into two
smaller functions.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-05-10 08:56:48 -07:00
Darrick J. Wong
c14cfccabe xfs: remove unnecessary xfs_qm_dqattach parameter
The flags argument is always zero, get rid of it.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-05-10 08:56:47 -07:00
Darrick J. Wong
d7103eeb00 xfs: delegate dqget input checks to helper function
Move the dqget input checks to a separate function in preparation for
splitting up the dqget functionality.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-05-10 08:56:47 -07:00
Darrick J. Wong
cc2047c4d0 xfs: refactor dquot cache handling
Delegate the dquot cache handling (radix tree lookup and insertion) to
separate helper functions so that we can continue to simplify the body
of dqget.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-05-10 08:56:47 -07:00
Darrick J. Wong
2e330e76e0 xfs: refactor XFS_QMOPT_DQNEXT out of existence
There's only one caller of DQNEXT and its semantics can be moved into a
separate function, so create the function and get rid of the flag.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-05-10 08:56:47 -07:00
Darrick J. Wong
609001bca4 xfs: don't spray logs when dquot flush/purge fail
When dquot flush or purge fail there's no need to spam the logs, we've
already logged the IO error or fs shutdown that caused the flush
failures.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-05-10 08:56:47 -07:00
Darrick J. Wong
7b6b50f55c xfs: release new dquot buffer on defer_finish error
In commit efa092f3d4 "[XFS] Fixes a bug in the quota code when
allocating a new dquot record", we allocate a new dquot block, grab a
buffer to initialize it, and return the locked initialized dquot buffer
to the caller for further in-core dquot initialization.  Unfortunately,
if the _bmap_finish errored out, _qm_dqalloc would also error out
without bothering to free the (locked) buffer.  Leaking a locked buffer
caused hangs in generic/388 when quotas are enabled.

Furthermore, the _bmap_finish -> _defer_finish conversion in
310a75a3c6 ("xfs: change xfs_bmap_{finish,cancel,init,free} ->
xfs_defer_*") failed to observe that the buffer was held going into
_defer_finish and therefore failed to notice that the buffer lock is
/not/ maintained afterwards.  Now that we can bjoin a buffer to a
defer_ops, use this mechanism to ensure that the buffer stays locked
across the _defer_finish.  Release the holds and locks on the buffer as
appropriate if we have to error out.

There is a subtlety here for the caller in that the buffer emerges
locked and held to the transaction, so if the _trans_commit fails we
have to release the buffer explicitly.  This fixes the unmount hang
in generic/388 when quotas are enabled.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-05-10 08:56:47 -07:00
Brian Foster
84ca484ecf xfs: don't discard on free of unwritten extents
Unwritten extents by definition have not been written to until they
are converted to normal written extents. If unwritten extents are
freed from a file, it is therefore guaranteed that the blocks have
not been written to since allocation (note that zero range punches
and reallocates blocks).

To cut down on online discards generated from workloads that make
use of preallocation, skip discards of extents if they are in the
unwritten state when the extent is freed.

Note that this optimization does not apply to log recovery, during
which all freed extents are discarded if online discard is enabled.
Also note that it may be possible for a filesystem crash to occur
after write completion of an unwritten extent but before unwritten
conversion such that the extent remains unwritten after log
recovery. Since this pseudo-inconsistency may already be possible
after a crash (consider writing to recently allocated blocks where
the allocation transaction is lost after a crash), this change
shouldn't introduce any fundamental limitations that don't already
exist. In short, on storage stacks where discards are important,
it's good practice to run an occasional fstrim even with online
discard enabled in the filesystem, particularly after a crash.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-10 08:56:47 -07:00
Brian Foster
13b86fc337 xfs: skip online discard during eofblocks trims
We've had reports of online discard operations being sent from XFS
on write-only workloads. These discards occur as a result of
eofblocks trims that can occur after a large file copy completes.

These discards are slightly confusing for users who might be paying
close attention to online discards (i.e., vdo) due to performance
sensitivity. They also happen to be spurious because freed post-eof
blocks by definition have not been written to during the current
allocation cycle.

Update xfs_free_eofblocks() to skip discards that are purely
attributed to eofblocks trims. This cuts down the number of spurious
discards that may occur on write-only workloads due to normal
preallocation activity.

Note that discards of post-eof extents can still occur from other
codepaths that do not isolate handling of post-eof blocks from those
within eof. For example, file unlinks and truncates may still cause
discards for any file blocks affected by the operation.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-10 08:56:47 -07:00
Brian Foster
fcb762f5de xfs: add bmapi nodiscard flag
Freed extents are unconditionally discarded when online discard is
enabled. Define XFS_BMAPI_NODISCARD to allow callers to bypass
discards when unnecessary. For example, this will be useful for
eofblocks trimming.

This patch does not change behavior.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-10 08:56:46 -07:00
Dave Chinner
e6631f8554 xfs: get rid of the log item descriptor
It's just a connector between a transaction and a log item. There's
a 1:1 relationship between a log item descriptor and a log item,
and a 1:1 relationship between a log item descriptor and a
transaction. Both relationships are created and terminated at the
same time, so why do we even have the descriptor?

Replace it with a specific list_head in the log item and a new
log item dirtied flag to replace the XFS_LID_DIRTY flag.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
[darrick: fix up deferred agfl intent finish_item use of LID_DIRTY]
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-10 08:56:46 -07:00
Dave Chinner
1a2ebf835a xfs: add some more debug checks to buffer log item reuse
Just to make sure the item isn't associated with another
transaction when we try to reuse it.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-10 08:56:46 -07:00
Dave Chinner
844e5e74c1 xfs: fix double ijoin in xfs_reflink_clear_inode_flag()
xfs_reflink_clear_inode_flag double-joins an inode to a transaction,
which is not allowed.  Fix that and document that the caller must have
already joined it.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
[darrick: edit out trace for nonexistent ASSERT]
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-10 08:56:46 -07:00
Dave Chinner
c5295c6aad xfs: fix double ijoin in xfs_reflink_cancel_cow_range
xfs_reflink_cancel_cow_range joins an inode twice to the same
transaction.  This is not allowed, so fix it and document that the
callers of xfs_reflink_cancel_cow_blocks() must have already joined the
inode to the permanent transaction passed in.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
[darrick: edited the commit log to remove trace for nonexistent ASSERT]
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-10 08:56:46 -07:00
Dave Chinner
3565b660e5 xfs: fix double ijoin in xfs_inactive_symlink_rmt()
xfs_inactive_symlink_rmt() does something nasty - it joins an inode
into a transaction it is already joined to. This means the inode can
have multiple log item descriptors attached to the transaction for
it. This breaks teh 1:1 mapping that is supposed to exist
between the log item and log item descriptor.

This results in the log item being processed twice during
transaction commit and CIL formatting, and there are lots of other
potential issues tha arise from double processing of log items in
the transaction commit state machine.

In this case, the inode is already held by the rolling transaction
returned from xfs_defer_finish(), so there's no need to join it
again.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-10 08:56:46 -07:00
Dave Chinner
d686d12d23 xfs: don't assert fail with AIL lock held
Been hitting AIL ordering assert failures recently, but been unable
to trace them down because the system immediately hangs up onteh
spinlock that was held when this assert fires:

XFS: Assertion failed: XFS_LSN_CMP(prev_lip->li_lsn, lip->li_lsn) <= 0, file: fs/xfs/xfs_trans_ail.c, line: 52

Move the assertions outside of the spinlock so the corpse can
be dissected. Thanks to Brian Foster for supplying a clean
way of doing this.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-10 08:56:46 -07:00
Dave Chinner
e632a5690c xfs: adder caller IP to xfs_defer* tracepoints
So it's clear in the trace where they are being called from.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-10 08:56:46 -07:00
Dave Chinner
ba18781b91 xfs: add tracing to high level transaction operations
Because currently we have no idea what the transaction context we
are operating in is, and I need to know that information to track
down bugs in multiple log item joins to transactions.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-10 08:56:46 -07:00
Dave Chinner
22525c17ed xfs: log item flags are racy
The log item flags contain a field that is protected by the AIL
lock - the XFS_LI_IN_AIL flag. We use non-atomic RMW operations to
set and clear these flags, but most of the updates and checks are
not done with the AIL lock held and so are susceptible to update
races.

Fix this by changing the log item flags to use atomic bitops rather
than be reliant on the AIL lock for update serialisation.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-10 08:56:41 -07:00
Darrick J. Wong
52101dfe56 xfs: add missing rmap error return
xfs_rmap_lookup_le_range can return errors, so we need to check for
them and bail out.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-05-10 08:56:41 -07:00
Darrick J. Wong
cec572561a xfs: bmap debugging should never panic the system
Don't panic() the system if the bmap records are garbage, just call
ASSERT which gives us the same backtrace but enables developers to
control if the system goes down or not.  This makes debugging with
generic/388 much easier because it won't reboot the machine midway
through a run just because btree_read_bufl returns EIO when the fs has
already shut down.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-05-09 10:04:02 -07:00
Brian Foster
8804630e1e xfs: defer agfl frees from directory op transactions
Directory operations can perform block allocations as entries are
added/removed from directories. Defer AGFL block frees from the
remaining directory operation transactions. This covers the hard
link, remove and rename operations.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-09 10:04:02 -07:00
Brian Foster
8b922f0e6a xfs: defer frees from common inode allocation paths
Inode allocation can require block allocation for physical inode
chunk allocation, inode btree record insertion, and/or directory
block allocation for entry insertion. Any of these block allocation
requests can require AGFL fixups prior to the actual allocation.
Update the common file creation transacions to defer AGFL frees from
these contexts to avoid too much log reservation consumption
per-transaction.

Since these transactions are already passed down through the btree
cursors and da_args structure, this simply requires to attach dfops
to the transaction. Note that this covers tr_create, tr_mkdir and
tr_symlink. Other transactions such as tr_create_tmpfile do not
already make use of deferred operations and so are left alone for
the time being.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-09 10:04:02 -07:00
Brian Foster
658f8f9511 xfs: defer agfl frees from inode inactivation
XFS inode chunks are already freed via deferred operations (which
now also defer AGFL block frees), but inode btree blocks are freed
directly in the associated context. This has been known to lead to
log reservation overruns in particular workloads where an inobt
block free may require several AGFL block frees (and thus several
allocation btree modifications) before the inobt block itself is
actually freed.

To avoid this problem, defer the frees of any AGFL blocks before the
inobt block free takes place. This requires passing the dfops from
xfs_inactive_ifree() down through the inobt ->[alloc|free]_block()
callouts, which essentially only requires to attach the dfops to the
transaction since it is already carried all the way through to the
inobt update and allocation.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-09 10:04:02 -07:00
Brian Foster
2bc5eba8b6 xfs: defer agfl block frees from deferred ops processing context
Now that AGFL block frees are deferred when dfops is set in the
transaction, start deferring AGFL block frees from contexts that are
known to push the limits of existing log reservations.

The first such context is deferred operation processing itself. This
primarily targets deferred extent frees (such as file extents and
inode chunks), but in doing so covers all allocation operations that
occur in deferred operation processing context.

Update xfs_defer_finish() to set and reset ->t_agfl_dfops across the
processing sequence. This means that any AGFL block frees due to
allocation events result in the addition of new EFIs to the dfops
rather than being processed immediately. xfs_defer_finish() rolls
the transaction at least once more to process the frees of the AGFL
blocks back to the allocation btrees and returns once the AGFL is
rectified.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-09 10:04:02 -07:00
Brian Foster
f8f2835a9c xfs: defer agfl block frees when dfops is available
The AGFL fixup code executes before every block allocation/free and
rectifies the AGFL based on the current, dynamic allocation
requirements of the fs. The AGFL must hold a minimum number of
blocks to satisfy a worst case split of the free space btrees caused
by the impending allocation operation. The AGFL is also updated to
maintain the implicit requirement for a minimum number of free slots
to satisfy a worst case join of the free space btrees.

Since the AGFL caches individual blocks, AGFL reduction typically
involves multiple, single block frees. We've had reports of
transaction overrun problems during certain workloads that boil down
to AGFL reduction freeing multiple blocks and consuming more space
in the log than was reserved for the transaction.

Since the objective of freeing AGFL blocks is to ensure free AGFL
free slots are available for the upcoming allocation, one way to
address this problem is to release surplus blocks from the AGFL
immediately but defer the free of those blocks (similar to how
file-mapped blocks are unmapped from the file in one transaction and
freed via a deferred operation) until the transaction is rolled.
This turns AGFL reduction into an operation with predictable log
reservation consumption.

Add the capability to defer AGFL block frees when a deferred ops
list is available to the AGFL fixup code. Add a dfops pointer to the
transaction to carry dfops through various contexts to the allocator
context. Deferring AGFL frees is  conditional behavior based on
whether the transaction pointer is populated. The long term
objective is to reuse the transaction pointer to clean up all
unrelated callchains that pass dfops on the stack along with a
transaction and in doing so, consistently defer AGFL blocks from the
allocator.

A bit of customization is required to handle deferred completion
processing because AGFL blocks are accounted against a per-ag
reservation pool and AGFL blocks are not inserted into the extent
busy list when freed (they are inserted when used and released back
to the AGFL). Reuse the majority of the existing deferred extent
free infrastructure and customize it appropriately to handle AGFL
blocks.

Note that this patch only adds infrastructure. It does not change
behavior because no callers have been updated to pass ->t_agfl_dfops
into the allocation code.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-09 10:04:02 -07:00
Brian Foster
4223f659dd xfs: create agfl block free helper function
Refactor the AGFL block free code into a new helper such that it can
be invoked from deferred context. No functional changes.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-09 10:04:01 -07:00
Eric Sandeen
72c5c5f6d0 xfs: print specific dqblk that failed verifiers
Rather than printing the top of the buffer that held a corrupted dqblk,
restructure things to print out the specific one that failed by pushing
the calls to the verifier_error function down into the verifier which
iterates over the buffer and detects the error.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-09 10:04:01 -07:00
Eric Sandeen
7224fa482a xfs: add full xfs_dqblk verifier
Add an xfs_dqblk verifier so that it can check the uuid on V5 filesystems;
it calls the existing xfs_dquot_verify verifier to validate the
xfs_disk_dquot_t contained inside it.  This lets us move the uuid
verification out of the crc verifier, which makes little sense.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-09 10:04:01 -07:00
Eric Sandeen
48fa1db87f xfs: pass full xfs_dqblk to repair during quotacheck
It's a bit dicey to pass in the smaller xfs_disk_dquot and then cast it to
something larger; pass in the full xfs_dqblk so we know the caller has sent
us the right thing.  Rename the function to xfs_dqblk_repair for
clarity.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-09 10:04:01 -07:00
Eric Sandeen
57ab324553 xfs: check type in quota verifier during quotacheck
During quotacheck we send in the quota type, so verify that as well.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-09 10:04:01 -07:00
Eric Sandeen
e381a0f6c2 xfs: remove unused flags arg from xfs_dquot_verify
Long ago the flags argument was used to determine whether to issue warnings
about corruptions, but that's done elsewhere now and the flag is unused
here, so remove it.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-09 10:04:01 -07:00
Dave Chinner
dfa03a5f80 xfs: clean up locking in xfs_file_iomap_begin
Rather than checking what kind of locking is needed in a helper
function and then jumping through hoops to do the locking in line,
move the locking to the helper function that does all the checks
and rename it to xfs_ilock_for_iomap().

This also allows us to hoist all the nonblocking checks up into the
locking helper, further simplifier the code flow in
xfs_file_iomap_begin() and making it easier to understand.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-09 10:04:01 -07:00
Dave Chinner
d064178094 xfs: simplify xfs_file_iomap_begin() logic
The current logic that determines whether allocation should be done
has grown somewhat spaghetti like with the addition of IOMAP_NOWAIT
functionality. Separate out each of the different cases into single,
obvious checks to get rid most of the nested IOMAP_NOWAIT checks
in the allocation logic.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-09 10:04:01 -07:00
Dave Chinner
4f8ff44ba0 iomap: iomap_dio_rw() handles all sync writes
Currently iomap_dio_rw() only handles (data)sync write completions
for AIO. This means we can't optimised non-AIO IO to minimise device
flushes as we can't tell the caller whether a flush is required or
not.

To solve this problem and enable further optimisations, make
iomap_dio_rw responsible for data sync behaviour for all IO, not
just AIO.

In doing so, the sync operation is now accounted as part of the DIO
IO by inode_dio_end(), hence post-IO data stability updates will no
long race against operations that serialise via inode_dio_wait()
such as truncate or hole punch.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-09 10:04:01 -07:00
Dave Chinner
ed5c3e66a3 xfs: move generic_write_sync calls inwards
To prepare for iomap iinfrastructure based DSYNC optimisations.

While moving the code araound, move the XFS write bytes metric
update for direct IO into xfs_dio_write_end_io callback so that we
always capture the amount of data written via AIO+DIO. This fixes
the problem where queued AIO+DIO writes are not accounted to this
metric.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-09 10:04:00 -07:00
Dave Chinner
b027d4c97b xfs: don't retry xfs_buf_find on XBF_TRYLOCK failure
When looking at an event trace recently, I noticed that non-blocking
buffer lookup attempts would fail on cached locked buffers and then
run the slow cache-miss path. This means we are doing an xfs_buf
allocation, lookup and free unnecessarily every time we avoid
blocking on a locked buffer.

Fix this by changing _xfs_buf_find() to return an error status to
the caller to indicate that we failed the lock attempt rather than
just returning a NULL. This allows the higher level code to
discriminate between a cache miss and an cache hit that we failed to
lock.

This also allows us to return a -EFSCORRUPTED state if we are asked
to look up a block number outside the range of the filesystem in
_xfs_buf_find(), which moves us one step closer to being able to
handle such errors in a more graceful manner at the higher levels.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-09 10:04:00 -07:00
Dave Chinner
8925a3dc47 xfs: make xfs_buf_incore out of line
Move xfs_buf_incore out of line and make it the only way to look up
a buffer in the buffer cache from outside the buffer cache. Convert
the external users of _xfs_buf_find() to xfs_buf_incore() and make
_xfs_buf_find() static.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
[darrick: actually rename xfs_incore -> xfs_buf_incore]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-09 10:04:00 -07:00
Eric Sandeen
e443523d19 xfs: trace ATTR flags in xattr tracepoints
This will trace i.e. the ATTR_SECURE/ATTR_CREATE/ATTR_REPLACE
flags as well as the OP_FLAGS.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-09 10:04:00 -07:00
Dave Chinner
8b26984dbd xfs: validate allocated inode number
When we have corrupted free inode btrees, we can attempt to
allocate inodes that we know are already allocated. Catch allocation
of these inodes and report corruption as early as possible to
prevent corruption propagation or deadlocks.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-09 10:04:00 -07:00
Dave Chinner
afca6c5b25 xfs: validate cached inodes are free when allocated
A recent fuzzed filesystem image cached random dcache corruption
when the reproducer was run. This often showed up as panics in
lookup_slow() on a null inode->i_ops pointer when doing pathwalks.

BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
....
Call Trace:
 lookup_slow+0x44/0x60
 walk_component+0x3dd/0x9f0
 link_path_walk+0x4a7/0x830
 path_lookupat+0xc1/0x470
 filename_lookup+0x129/0x270
 user_path_at_empty+0x36/0x40
 path_listxattr+0x98/0x110
 SyS_listxattr+0x13/0x20
 do_syscall_64+0xf5/0x280
 entry_SYSCALL_64_after_hwframe+0x42/0xb7

but had many different failure modes including deadlocks trying to
lock the inode that was just allocated or KASAN reports of
use-after-free violations.

The cause of the problem was a corrupt INOBT on a v4 fs where the
root inode was marked as free in the inobt record. Hence when we
allocated an inode, it chose the root inode to allocate, found it in
the cache and re-initialised it.

We recently fixed a similar inode allocation issue caused by inobt
record corruption problem in xfs_iget_cache_miss() in commit
ee457001ed ("xfs: catch inode allocation state mismatch
corruption"). This change adds similar checks to the cache-hit path
to catch it, and turns the reproducer into a corruption shutdown
situation.

Reported-by: Wen Xu <wen.xu@gatech.edu>
Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
[darrick: fix typos in comment]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-09 10:04:00 -07:00
Darrick J. Wong
021ba8e98f xfs: cap the length of deduplication requests
Since deduplication potentially has to read in all the pages in both
files in order to compare the contents, cap the deduplication request
length at MAX_RW_COUNT/2 (roughly 1GB) so that we have /some/ upper bound
on the request length and can't just lock up the kernel forever.  Found
by running generic/304 after commit 1ddae54555b62 ("common/rc: add
missing 'local' keywords").

Reported-by: matorola@gmail.com
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2018-05-02 09:21:33 -07:00
Darrick J. Wong
7b38460dc8 xfs: don't fail when converting shortform attr to long form during ATTR_REPLACE
Kanda Motohiro reported that expanding a tiny xattr into a large xattr
fails on XFS because we remove the tiny xattr from a shortform fork and
then try to re-add it after converting the fork to extents format having
not removed the ATTR_REPLACE flag.  This fails because the attr is no
longer present, causing a fs shutdown.

This is derived from the patch in his bug report, but we really
shouldn't ignore a nonzero retval from the remove call.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=199119
Reported-by: kanda.motohiro@gmail.com
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-04-17 19:10:15 -07:00
Darrick J. Wong
7d83fb1425 xfs: prevent creating negative-sized file via INSERT_RANGE
During the "insert range" fallocate operation, i_size grows by the
specified 'len' bytes.  XFS verifies that i_size + len < s_maxbytes, as
it should.  But this comparison is done using the signed 'loff_t', and
'i_size + len' can wrap around to a negative value, causing the check to
incorrectly pass, resulting in an inode with "negative" i_size.  This is
possible on 64-bit platforms, where XFS sets s_maxbytes = LLONG_MAX.
ext4 and f2fs don't run into this because they set a smaller s_maxbytes.

Fix it by using subtraction instead.

Reproducer:
    xfs_io -f file -c "truncate $(((1<<63)-1))" -c "finsert 0 4096"

Fixes: a904b1ca57 ("xfs: Add support FALLOC_FL_INSERT_RANGE for fallocate")
Cc: <stable@vger.kernel.org> # v4.1+
Originally-From: Eric Biggers <ebiggers@google.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
[darrick: fix signed integer addition overflow too]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-04-17 17:29:11 -07:00
Eric Sandeen
2c4306f719 xfs: set format back to extents if xfs_bmap_extents_to_btree
If xfs_bmap_extents_to_btree fails in a mode where we call
xfs_iroot_realloc(-1) to de-allocate the root, set the
format back to extents.

Otherwise we can assume we can dereference ifp->if_broot
based on the XFS_DINODE_FMT_BTREE format, and crash.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=199423
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-04-17 17:10:17 -07:00