Commit Graph

179 Commits

Author SHA1 Message Date
Dave Chinner 576af73228 xfs: convert btree cursor ag-private member name
bc_private.a -> bc_ag conversion via script:

`sed -i 's/bc_private\.a/bc_ag/g' fs/xfs/*[ch] fs/xfs/*/*[ch]`

And then revert the change to the bc_ag #define in
fs/xfs/libxfs/xfs_btree.h manually.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2020-03-13 10:37:14 -07:00
Darrick J. Wong 8d57c21600 xfs: add a function to deal with corrupt buffers post-verifiers
Add a helper function to get rid of buffers that we have decided are
corrupt after the verifiers have run.  This function is intended to
handle metadata checks that can't happen in the verifiers, such as
inter-block relationship checking.  Note that we now mark the buffer
stale so that it will not end up on any LRU and will be purged on
release.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2020-03-12 07:58:12 -07:00
Christoph Hellwig 9798f615ad xfs: remove XFS_BUF_TO_AGF
Just dereference bp->b_addr directly and make the code a little
simpler and more clear.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-03-11 09:11:39 -07:00
Christoph Hellwig 183606d824 xfs: remove the agfl_bno member from struct xfs_agfl
struct xfs_agfl is a header in front of the AGFL entries that exists
for CRC enabled file systems.  For not CRC enabled file systems the AGFL
is simply a list of agbno.  Make the CRC case similar to that by just
using the list behind the new header.  This indirectly solves a problem
with modern gcc versions that warn about taking addresses of packed
structures (and we have to pack the AGFL given that gcc rounds up
structure sizes).  Also replace the helper macro to get from a buffer
with an inline function in xfs_alloc.h to make the code easier to
read.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-03-11 09:11:38 -07:00
Zheng Bin d0c7feaf87 xfs: add agf freeblocks verify in xfs_agf_verify
We recently used fuzz(hydra) to test XFS and automatically generate
tmp.img(XFS v5 format, but some metadata is wrong)

xfs_repair information(just one AG):
agf_freeblks 0, counted 3224 in ag 0
agf_longest 536874136, counted 3224 in ag 0
sb_fdblocks 613, counted 3228

Test as follows:
mount tmp.img tmpdir
cp file1M tmpdir
sync

In 4.19-stable, sync will stuck, the reason is:
xfs_mountfs
  xfs_check_summary_counts
    if ((!xfs_sb_version_haslazysbcount(&mp->m_sb) ||
       XFS_LAST_UNMOUNT_WAS_CLEAN(mp)) &&
       !xfs_fs_has_sickness(mp, XFS_SICK_FS_COUNTERS))
	return 0;  -->just return, incore sb_fdblocks still be 613
    xfs_initialize_perag_data

cp file1M tmpdir -->ok(write file to pagecache)
sync -->stuck(write pagecache to disk)
xfs_map_blocks
  xfs_iomap_write_allocate
    while (count_fsb != 0) {
      nimaps = 0;
      while (nimaps == 0) { --> endless loop
         nimaps = 1;
         xfs_bmapi_write(..., &nimaps) --> nimaps becomes 0 again
xfs_bmapi_write
  xfs_bmap_alloc
    xfs_bmap_btalloc
      xfs_alloc_vextent
        xfs_alloc_fix_freelist
          xfs_alloc_space_available -->fail(agf_freeblks is 0)

In linux-next, sync not stuck, cause commit c2b3164320 ("xfs:
use the latest extent at writeback delalloc conversion time") remove
the above while, dmesg is as follows:
[   55.250114] XFS (loop0): page discard on page ffffea0008bc7380, inode 0x1b0c, offset 0.

Users do not know why this page is discard, the better soultion is:
1. Like xfs_repair, make sure sb_fdblocks is equal to counted
(xfs_initialize_perag_data did this, who is not called at this mount)
2. Add agf verify, if fail, will tell users to repair

This patch use the second soultion.

Signed-off-by: Zheng Bin <zhengbin13@huawei.com>
Signed-off-by: Ren Xudong <renxudong1@huawei.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-03-02 20:55:50 -08:00
Darrick J. Wong f48e2df8a8 xfs: make xfs_*read_agf return EAGAIN to ALLOC_FLAG_TRYLOCK callers
Refactor xfs_read_agf and xfs_alloc_read_agf to return EAGAIN if the
caller passed TRYLOCK and we weren't able to get the lock; and change
the callers to recognize this.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2020-01-26 14:32:26 -08:00
Darrick J. Wong ee647f85cb xfs: remove the xfs_btree_get_buf[ls] functions
Remove the xfs_btree_get_bufs and xfs_btree_get_bufl functions, since
they're pretty trivial oneliners.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2020-01-26 14:32:26 -08:00
Darrick J. Wong 4ed8e27b4f xfs: make xfs_buf_read_map return an error code
Convert xfs_buf_read_map() to return numeric error codes like most
everywhere else in xfs.  This involves moving the open-coded logic that
reports metadata IO read / corruption errors and stales the buffer into
xfs_buf_read_map so that the logic is all in one place.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2020-01-26 14:32:26 -08:00
Darrick J. Wong 1cac233cfe xfs: refactor agfl length computation function
Refactor xfs_alloc_min_freelist to accept a NULL @pag argument, in which
case it returns the largest possible minimum length.  This will be used
in an upcoming patch to compute the length of the AGFL at mkfs time.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-12-19 07:53:48 -08:00
Darrick J. Wong a71895c5da xfs: convert open coded corruption check to use XFS_IS_CORRUPT
Convert the last of the open coded corruption check and report idioms to
use the XFS_IS_CORRUPT macro.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2019-11-13 11:08:01 -08:00
Darrick J. Wong f9e0370648 xfs: kill the XFS_WANT_CORRUPT_* macros
The XFS_WANT_CORRUPT_* macros conceal subtle side effects such as the
creation of local variables and redirections of the code flow.  This is
pretty ugly, so replace them with explicit XFS_IS_CORRUPT tests that
remove both of those ugly points.  The change was performed with the
following coccinelle script:

@@
expression mp, test;
identifier label;
@@

- XFS_WANT_CORRUPTED_GOTO(mp, test, label);
+ if (XFS_IS_CORRUPT(mp, !test)) { error = -EFSCORRUPTED; goto label; }

@@
expression mp, test;
@@

- XFS_WANT_CORRUPTED_RETURN(mp, test);
+ if (XFS_IS_CORRUPT(mp, !test)) return -EFSCORRUPTED;

@@
expression mp, lval, rval;
@@

- XFS_IS_CORRUPT(mp, !(lval == rval))
+ XFS_IS_CORRUPT(mp, lval != rval)

@@
expression mp, e1, e2;
@@

- XFS_IS_CORRUPT(mp, !(e1 && e2))
+ XFS_IS_CORRUPT(mp, !e1 || !e2)

@@
expression e1, e2;
@@

- !(e1 == e2)
+ e1 != e2

@@
expression e1, e2, e3, e4, e5, e6;
@@

- !(e1 == e2 && e3 == e4) || e5 != e6
+ e1 != e2 || e3 != e4 || e5 != e6

@@
expression e1, e2, e3, e4, e5, e6;
@@

- !(e1 == e2 || (e3 <= e4 && e5 <= e6))
+ e1 != e2 && (e3 > e4 || e5 > e6)

@@
expression mp, e1, e2;
@@

- XFS_IS_CORRUPT(mp, !(e1 <= e2))
+ XFS_IS_CORRUPT(mp, e1 > e2)

@@
expression mp, e1, e2;
@@

- XFS_IS_CORRUPT(mp, !(e1 < e2))
+ XFS_IS_CORRUPT(mp, e1 >= e2)

@@
expression mp, e1;
@@

- XFS_IS_CORRUPT(mp, !!e1)
+ XFS_IS_CORRUPT(mp, e1)

@@
expression mp, e1, e2;
@@

- XFS_IS_CORRUPT(mp, !(e1 || e2))
+ XFS_IS_CORRUPT(mp, !e1 && !e2)

@@
expression mp, e1, e2, e3, e4;
@@

- XFS_IS_CORRUPT(mp, !(e1 == e2) && !(e3 == e4))
+ XFS_IS_CORRUPT(mp, e1 != e2 && e3 != e4)

@@
expression mp, e1, e2, e3, e4;
@@

- XFS_IS_CORRUPT(mp, !(e1 <= e2) || !(e3 >= e4))
+ XFS_IS_CORRUPT(mp, e1 > e2 || e3 < e4)

@@
expression mp, e1, e2, e3, e4;
@@

- XFS_IS_CORRUPT(mp, !(e1 == e2) && !(e3 <= e4))
+ XFS_IS_CORRUPT(mp, e1 != e2 && e3 > e4)

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2019-11-12 17:19:02 -08:00
Darrick J. Wong 5113f8ec37 xfs: clean up weird while loop in xfs_alloc_ag_vextent_near
Refactor the weird while loop out of existence.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2019-11-10 10:21:58 -08:00
Joe Perches cf085a1b5d xfs: Correct comment tyops -> typos
Just fix the typos checkpatch notices...

Signed-off-by: Joe Perches <joe@perches.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10 10:21:57 -08:00
Darrick J. Wong a5155b870d xfs: always log corruption errors
Make sure we log something to dmesg whenever we return -EFSCORRUPTED up
the call stack.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2019-11-04 13:55:54 -08:00
Christoph Hellwig c34d570d15 xfs: cleanup use of the XFS_ALLOC_ flags
Always set XFS_ALLOC_USERDATA for data fork allocations, and check it
in xfs_alloc_is_userdata instead of the current obsfucated check.
Also remove the xfs_alloc_is_userdata and xfs_alloc_allow_busy_reuse
helpers to make the code a little easier to understand.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-03 10:22:31 -08:00
Christoph Hellwig fd638f1de1 xfs: move extent zeroing to xfs_bmapi_allocate
Move the extent zeroing case there for the XFS_BMAPI_ZERO flag outside
the low-level allocator and into xfs_bmapi_allocate, where is still
is in transaction context, but outside the very lowlevel code where
it doesn't belong.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-03 10:22:30 -08:00
Dave Chinner 1c743574de xfs: cap longest free extent to maximum allocatable
Cap longest extent to the largest we can allocate based on limits
calculated at mount time. Dynamic state (such as finobt blocks)
can result in the longest free extent exceeding the size we can
allocate, and that results in failure to align full AG allocations
when the AG is empty.

Result:

xfs_io-4413  [003]   426.412459: xfs_alloc_vextent_loopfailed: dev 8:96 agno 0 agbno 32 minlen 243968 maxlen 244000 mod 0 prod 1 minleft 1 total 262148 alignment 32 minalignslop 0 len 0 type NEAR_BNO otype START_BNO wasdel 0 wasfromfl 0 resv 0 datatype 0x5 firstblock 0xffffffffffffffff

minlen and maxlen are now separated by the alignment size, and
allocation fails because args.total > free space in the AG.

[bfoster: Added xfs_bmap_btalloc() changes.]

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-10-23 17:01:07 -07:00
Brian Foster dc8e69bd72 xfs: optimize near mode bnobt scans with concurrent cntbt lookups
The near mode fallback algorithm consists of a left/right scan of
the bnobt. This algorithm has very poor breakdown characteristics
under worst case free space fragmentation conditions. If a suitable
extent is far enough from the locality hint, each allocation may
scan most or all of the bnobt before it completes. This causes
pathological behavior and extremely high allocation latencies.

While locality is important to near mode allocations, it is not so
important as to incur pathological allocation latency to provide the
asolute best available locality for every allocation. If the
allocation is large enough or far enough away, there is a point of
diminishing returns. As such, we can bound the overall operation by
including an iterative cntbt lookup in the broader search. The cntbt
lookup is optimized to immediately find the extent with best
locality for the given size on each iteration. Since the cntbt is
indexed by extent size, the lookup repeats with a variably
aggressive increasing search key size until it runs off the edge of
the tree.

This approach provides a natural balance between the two algorithms
for various situations. For example, the bnobt scan is able to
satisfy smaller allocations such as for inode chunks or btree blocks
more quickly where the cntbt search may have to search through a
large set of extent sizes when the search key starts off small
relative to the largest extent in the tree. On the other hand, the
cntbt search more deterministically covers the set of suitable
extents for larger data extent allocation requests that the bnobt
scan may have to search the entire tree to locate.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-10-21 09:04:58 -07:00
Brian Foster d29688257f xfs: factor out tree fixup logic into helper
Lift the btree fixup path into a helper function.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-10-21 09:04:58 -07:00
Brian Foster 0e26d5ca4a xfs: refactor near mode alloc bnobt scan into separate function
In preparation to enhance the near mode allocation bnobt scan algorithm, lift
it into a separate function. No functional changes.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-10-21 09:04:58 -07:00
Brian Foster 78d7aabdee xfs: refactor and reuse best extent scanning logic
The bnobt "find best" helper implements a simple btree walker
function. This general pattern, or a subset thereof, is reused in
various parts of a near mode allocation operation. For example, the
bnobt left/right scans are each iterative btree walks along with the
cntbt lastblock scan.

Rework this function into a generic btree walker, add a couple
parameters to control termination behavior from various contexts and
reuse it where applicable.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-10-21 09:04:58 -07:00
Brian Foster 4a65b7c2c7 xfs: refactor allocation tree fixup code
Both algorithms duplicate the same btree allocation code. Eliminate
the duplication and reuse the fallback algorithm codepath.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-10-21 09:04:58 -07:00
Brian Foster fec0afdaf4 xfs: reuse best extent tracking logic for bnobt scan
The near mode bnobt scan searches left and right in the bnobt
looking for the closest free extent to the allocation hint that
satisfies minlen. Once such an extent is found, the left/right
search terminates, we search one more time in the opposite direction
and finish the allocation with the best overall extent.

The left/right and find best searches are currently controlled via a
combination of cursor state and local variables. Clean up this code
and prepare for further improvements to the near mode fallback
algorithm by reusing the allocation cursor best extent tracking
mechanism. Update the tracking logic to deactivate bnobt cursors
when out of allocation range and replace open-coded extent checks to
calls to the common helper. In doing so, rename some misnamed local
variables in the top-level near mode allocation function.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-10-21 09:04:58 -07:00
Brian Foster 396bbf3c65 xfs: refactor cntbt lastblock scan best extent logic into helper
The cntbt lastblock scan checks the size, alignment, locality, etc.
of each free extent in the block and compares it with the current
best candidate. This logic will be reused by the upcoming optimized
cntbt algorithm, so refactor it into a separate helper. Note that
acur->diff is now initialized to -1 (unsigned) instead of 0 to
support the more granular comparison logic in the new helper.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-10-21 09:04:58 -07:00
Brian Foster c62321a2a0 xfs: track best extent from cntbt lastblock scan in alloc cursor
If the size lookup lands in the last block of the by-size btree, the
near mode algorithm scans the entire block for the extent with best
available locality. In preparation for similar best available
extent tracking across both btrees, extend the allocation cursor
with best extent data and lift the associated state from the cntbt
last block scan code. No functional changes.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-10-21 09:04:58 -07:00
Brian Foster d6d3aff203 xfs: track allocation busy state in allocation cursor
Extend the allocation cursor to track extent busy state for an
allocation attempt. No functional changes.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-10-21 09:04:58 -07:00
Brian Foster f5e7dbea1e xfs: introduce allocation cursor data structure
Introduce a new allocation cursor data structure to encapsulate the
various states and structures used to perform an extent allocation.
This structure will eventually be used to track overall allocation
state across different search algorithms on both free space btrees.

To start, include the three btree cursors (one for the cntbt and two
for the bnobt left/right search) used by the near mode allocation
algorithm and refactor the cursor setup and teardown code into
helpers. This slightly changes cursor memory allocation patterns,
but otherwise makes no functional changes to the allocation
algorithm.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
[darrick: fix sparse complaints]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-10-21 09:04:58 -07:00
Brian Foster f6b428a46d xfs: track active state of allocation btree cursors
The upcoming allocation algorithm update searches multiple
allocation btree cursors concurrently. As such, it requires an
active state to track when a particular cursor should continue
searching. While active state will be modified based on higher level
logic, we can define base functionality based on the result of
allocation btree lookups.

Define an active flag in the private area of the btree cursor.
Update it based on the result of lookups in the existing allocation
btree helpers. Finally, provide a new helper to query the current
state.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-10-21 09:04:58 -07:00
Tetsuo Handa 707e0ddaf6 fs: xfs: Remove KM_NOSLEEP and KM_SLEEP.
Since no caller is using KM_NOSLEEP and no callee branches on KM_SLEEP,
we can remove KM_NOSLEEP and replace KM_SLEEP with 0.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-08-26 12:06:22 -07:00
Darrick J. Wong 5bb46e3e18 xfs: create iterator error codes
Currently, xfs doesn't have generic error codes defined for "stop
iterating"; we just reuse the XFS_BTREE_QUERY_* return values.  This
looks a little weird if we're not actually iterating a btree index.
Before we start adding more iterators, we should create general
XFS_ITER_{CONTINUE,ABORT} return values and define the XFS_BTREE_QUERY_*
ones from that.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-07-02 09:40:05 -07:00
Eric Sandeen 250d4b4c40 xfs: remove unused header files
There are many, many xfs header files which are included but
unneeded (or included twice) in the xfs code, so remove them.

nb: xfs_linux.h includes about 9 headers for everyone, so those
explicit includes get removed by this.  I'm not sure what the
preference is, but if we wanted explicit includes everywhere,
a followup patch could remove those xfs_*.h includes from
xfs_linux.h and move them into the files that need them.
Or it could be left as-is.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-06-28 19:30:43 -07:00
Brian Foster 7e36a3a63d xfs: always update params on small allocation
xfs_alloc_ag_vextent_small() doesn't update the output parameters in
the event of an AGFL allocation. Instead, it updates the
xfs_alloc_arg structure directly to complete the allocation.

Update both args and the output params to provide consistent
behavior for future callers.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-06-28 19:30:20 -07:00
Brian Foster 6691cd9267 xfs: skip small alloc cntbt logic on NULL cursor
The small allocation helper is implemented in a way that is fairly
tightly integrated to the existing allocation algorithms. It expects
a cntbt cursor beyond the end of the tree, attempts to locate the
last record in the tree and only attempts an AGFL allocation if the
cntbt is empty.

The upcoming generic algorithm doesn't rely on the cntbt processing
of this function. It will only call this function when the cntbt
doesn't have a big enough extent or is empty and thus AGFL
allocation is the only remaining option. Tweak
xfs_alloc_ag_vextent_small() to handle a NULL cntbt cursor and skip
the cntbt logic. This facilitates use by the existing allocation
code and new code that only requires an AGFL allocation attempt.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-06-28 19:30:20 -07:00
Brian Foster c63cdd4fc9 xfs: move small allocation helper
Move the small allocation helper further up in the file to avoid the
need for a function declaration. The remaining declarations will be
removed by followup patches. No functional changes.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-06-28 19:30:19 -07:00
Brian Foster 2a4f35f984 xfs: clean up small allocation helper
xfs_alloc_ag_vextent_small() is kind of a mess. Clean it up in
preparation for future changes. No functional changes.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-06-28 19:30:19 -07:00
Christoph Hellwig dbd329f1e4 xfs: add struct xfs_mount pointer to struct xfs_buf
We need to derive the mount pointer from a buffer in a lot of place.
Add a direct pointer to short cut the pointer chasing.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-06-28 19:27:29 -07:00
Eric Sandeen f5b999c03f xfs: remove unused flag arguments
There are several functions which take a flag argument that is
only ever passed as "0," so remove these arguments.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-06-12 09:00:00 -07:00
Brian Foster 362f5e745a xfs: assert that we don't enter agfl freeing with a non-permanent transaction
Block allocation requires a permanent transaction for deferred AGFL
frees.  Add an assert in the block allocation path to make explicit and
obvious to future callers the requirement of a transaction with a
permanent reservation.

Reported-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
[darrick: split this out from the previous patch per hch request]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-04-23 08:36:23 -07:00
Brian Foster 1ca89fbc48 xfs: don't account extra agfl blocks as available
The block allocation AG selection code has parameters that allow a
caller to perform multiple allocations from a single AG and
transaction (under certain conditions). The parameters specify the
total block allocation count required by the transaction and the AG
selection code selects and locks an AG that will be able to satisfy
the overall requirement. If the available block accounting
calculation turns out to be inaccurate and a subsequent allocation
call fails with -ENOSPC, the resulting transaction cancel leads to
filesystem shutdown because the transaction is dirty.

This exact problem can be reproduced with a highly parallel space
consumer and fsstress workload running long enough to a large
filesystem against -ENOSPC conditions. A bmbt block allocation
request made for inode extent to bmap format conversion after an
extent allocation is expected to be satisfied by the same AG and the
same transaction as the extent allocation. The bmbt block allocation
fails, however, because the block availability of the AG has changed
since the AG was selected (outside of the blocks used for the extent
itself).

The inconsistent block availability calculation is caused by the
deferred block freeing behavior of the AGFL. This immediately
removes extra blocks from the AGFL to free up AGFL slots, but rather
than immediately freeing such blocks as was done in the past, the
block free is deferred such that said blocks are not available for
allocation until the current transaction commits. The AG selection
logic currently considers all AGFL blocks as available and executes
shortly before any extra AGFL blocks are freed. This means the block
availability of the current AG can change before the first
allocation even occurs, but in practice a failure is more likely to
manifest via a subsequent allocation because extent allocation
usually has a contiguity requirement larger than a single block that
can't be satisfied from the AGFL.

In general, XFS prefers operational robustness to absolute
allocation efficiency. In other words, we prefer to return -ENOSPC
slightly earlier at the expense of not being able to allocate every
last block in an AG to avoid this kind of problem. As such, update
the AG block availability calculation to consider extra AGFL blocks
as unavailable since they are immediately removed following the
calculation and will not become available until the current
transaction commits.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-04-14 18:15:57 -07:00
Brian Foster 39708c20ab xfs: miscellaneous verifier magic value fixups
Most buffer verifiers have hardcoded magic value checks
conditionalized on the version of the filesystem. The magic value
field of the verifier structure facilitates abstraction of some of
this code. Populate the ->magic field of various verifiers to take
advantage of this abstraction. No functional changes.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-11 16:07:01 -08:00
Brian Foster e34d3e74eb xfs: always check magic values in on-disk byte order
Most verifiers that check on-disk magic values convert the CPU
endian magic value constant to disk endian to facilitate compile
time optimization of the byte swap and reduce the need for runtime
byte swaps in buffer verifiers. Several buffer verifiers do not
follow this pattern. Update those verifiers for consistency.

Also fix up a random typo in the inode readahead verifier name.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-11 16:07:01 -08:00
Darrick J. Wong 7280fedaf3 xfs: remove xfs_rmap_ag_owner and friends
Owner information for static fs metadata can be defined readonly at
build time because it never changes across filesystems.  This enables us
to reduce stack usage (particularly in scrub) because we can use the
statically defined oinfo structures.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-12-12 08:47:16 -08:00
Darrick J. Wong 66e3237e72 xfs: const-ify xfs_owner_info arguments
Only certain functions actually change the contents of an
xfs_owner_info; the rest can accept a const struct pointer.  This will
enable us to save stack space by hoisting static owner info types to
be const global variables.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-12-12 08:47:16 -08:00
Pan Bian fe5ed6c22e xfs: libxfs: move xfs_perag_put late
The function xfs_alloc_get_freelist calls xfs_perag_put to drop the
reference. However, pag->pagf_btreeblks is read and written after the
put operation. This patch moves the put operation later.

Signed-off-by: Pan Bian <bianpan2016@163.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
[darrick: minor changelog edits]
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-12-12 08:46:20 -08:00
Brian Foster c03edc9e49 xfs: always defer agfl block frees
The AGFL fixup code conditionally defers block frees from the free
list based on whether the current transaction has an associated
xfs_defer_ops structure. Now that dfops is embedded in the
transaction and the internal dfops is used unconditionally, this
invariant is always true.

Remove the now dead logic to check for ->t_dfops in
xfs_alloc_fix_freelist() and unconditionally defer AGFL block frees.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-08-02 23:05:14 -07:00
Brian Foster 0f37d1780c xfs: pass transaction to xfs_defer_add()
The majority of remaining references to struct xfs_defer_ops in XFS
are associated with xfs_defer_add(). At this point, there are no
more external xfs_defer_ops users left. All instances of
xfs_defer_ops are embedded in the transaction, which means we can
safely pass the transaction down to the dfops add interface.

Update xfs_defer_add() to receive the transaction as a parameter.
Various subsystems implement wrappers to allocate and construct the
context specific data structures for the associated deferred
operation type. Update these to also carry the transaction down as
needed and clean up unused dfops parameters along the way.

This removes most of the remaining references to struct
xfs_defer_ops throughout the code and facilitates removal of the
structure.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
[darrick: fix unused variable warnings with ftrace disabled]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-08-02 23:05:14 -07:00
Darrick J. Wong ff23f4af7e xfs: move extent busy tree initialization to xfs_initialize_perag
Move the per-AG busy extent tree initialization to the per-ag structure
initialization since we don't want online repair to leak the old tree.
We only deconstruct the tree at unmount time, so this should be safe.
This also enables us to eliminate the commented out initialization in
the xfsprogs libxfs.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-07-31 13:18:09 -07:00
Carlos Maiolino efe8032773 xfs: Initialize variables in xfs_alloc_get_rec before using them
Make sure we initialize *bno and *len, before jumping to out_bad_rec
label, and risk calling xfs_warn() with uninitialized variables.

Coverity: 100898
Coverity: 1437081
Coverity: 1437129
Coverity: 1437191
Coverity: 1437201
Coverity: 1437212
Coverity: 1437341
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:36 -07:00
Brian Foster 64396ff2c2 xfs: remove xfs_alloc_arg firstblock field
The xfs_alloc_arg.firstblock field is used to control the starting
agno for an allocation. The structure already carries a pointer to
the transaction, which carries the current firstblock value.

Remove the field and access ->t_firstblock directly in the
allocation code.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:30 -07:00
Brian Foster 6aa6718439 xfs: rename xfs_trans ->t_agfl_dfops to ->t_dfops
The ->t_agfl_dfops field is currently used to defer agfl block frees
from associated transaction contexts. While all known problematic
contexts have already been updated to use ->t_agfl_dfops, the
broader goal is defer agfl frees from all callers that already use a
deferred operations structure. Further, the transaction field
facilitates a good amount of code clean up where the transaction and
dfops have historically been passed down through the stack
separately.

Rename the field to something more generic to prepare to use it as
such throughout XFS. This patch does not change behavior.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:07 -07:00
Dave Chinner 86210fbeba xfs: move various type verifiers to common file
New verification functions like xfs_verify_fsbno() and
xfs_verify_agino() are spread across multiple files and different
header files. They really don't fit cleanly into the places they've
been put, and have wider scope than the current header includes.

Move the type verifiers to a new file in libxfs (xfs-types.c) and
the prototypes to xfs_types.h where they will be visible to all the
code that uses the types.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-06-08 10:07:51 -07:00
Dave Chinner 0b61f8a407 xfs: convert to SPDX license tags
Remove the verbose license text from XFS files and replace them
with SPDX tags. This does not change the license of any of the code,
merely refers to the common, up-to-date license files in LICENSES/

This change was mostly scripted. fs/xfs/Makefile and
fs/xfs/libxfs/xfs_fs.h were modified by hand, the rest were detected
and modified by the following command:

for f in `git grep -l "GNU General" fs/xfs/` ; do
	echo $f
	cat $f | awk -f hdr.awk > $f.new
	mv -f $f.new $f
done

And the hdr.awk script that did the modification (including
detecting the difference between GPL-2.0 and GPL-2.0+ licenses)
is as follows:

$ cat hdr.awk
BEGIN {
	hdr = 1.0
	tag = "GPL-2.0"
	str = ""
}

/^ \* This program is free software/ {
	hdr = 2.0;
	next
}

/any later version./ {
	tag = "GPL-2.0+"
	next
}

/^ \*\// {
	if (hdr > 0.0) {
		print "// SPDX-License-Identifier: " tag
		print str
		print $0
		str=""
		hdr = 0.0
		next
	}
	print $0
	next
}

/^ \* / {
	if (hdr > 1.0)
		next
	if (hdr > 0.0) {
		if (str != "")
			str = str "\n"
		str = str $0
		next
	}
	print $0
	next
}

/^ \*/ {
	if (hdr > 0.0)
		next
	print $0
	next
}

// {
	if (hdr > 0.0) {
		if (str != "")
			str = str "\n"
		str = str $0
		next
	}
	print $0
}

END { }
$

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-06-06 14:17:53 -07:00
Dave Chinner 9e6c08d4a8 xfs: validate btree records on retrieval
So we don't check the validity of records as we walk the btree. When
there are corrupt records in the free space btree (e.g. zero
startblock/length or beyond EOAG) we just blindly use it and things
go bad from there. That leads to assert failures on debug kernels
like this:

XFS: Assertion failed: fs_is_ok, file: fs/xfs/libxfs/xfs_alloc.c, line: 450
....
Call Trace:
 xfs_alloc_fixup_trees+0x368/0x5c0
 xfs_alloc_ag_vextent_near+0x79a/0xe20
 xfs_alloc_ag_vextent+0x1d3/0x330
 xfs_alloc_vextent+0x5e9/0x870

Or crashes like this:

XFS (loop0): xfs_buf_find: daddr 0x7fb28 out of range, EOFS 0x8000
.....
BUG: unable to handle kernel NULL pointer dereference at 00000000000000c8
....
Call Trace:
 xfs_bmap_add_extent_hole_real+0x67d/0x930
 xfs_bmapi_write+0x934/0xc90
 xfs_da_grow_inode_int+0x27e/0x2f0
 xfs_dir2_grow_inode+0x55/0x130
 xfs_dir2_sf_to_block+0x94/0x5d0
 xfs_dir2_sf_addname+0xd0/0x590
 xfs_dir_createname+0x168/0x1a0
 xfs_rename+0x658/0x9b0

By checking that free space records pulled from the trees are
within the valid range, we catch many of these corruptions before
they can do damage.

This is a generic btree record checking deficiency. We need to
validate the records we fetch from all the different btrees before
we use them to catch corruptions like this.

This patch results in a corrupt record emitting an error message and
returning -EFSCORRUPTED, and the higher layers catch that and abort:

 XFS (loop0): Size Freespace BTree record corruption in AG 0 detected!
 XFS (loop0): start block 0x0 block count 0x0
 XFS (loop0): Internal error xfs_trans_cancel at line 1012 of file fs/xfs/xfs_trans.c.  Caller xfs_create+0x42a/0x670
 .....
 Call Trace:
  dump_stack+0x85/0xcb
  xfs_trans_cancel+0x19f/0x1c0
  xfs_create+0x42a/0x670
  xfs_generic_create+0x1f6/0x2c0
  vfs_create+0xf9/0x180
  do_mknodat+0x1f9/0x210
  do_syscall_64+0x5a/0x180
  entry_SYSCALL_64_after_hwframe+0x49/0xbe
.....
 XFS (loop0): xfs_do_force_shutdown(0x8) called from line 1013 of file fs/xfs/xfs_trans.c.  Return address = ffffffff81500868
 XFS (loop0): Corruption of in-memory data detected.  Shutting down filesystem

Signed-off-by: Dave Chinner <dchinner@redhat.com>

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-06-06 08:12:00 -07:00
Darrick J. Wong a37f7b127e xfs: xfs_alloc_get_rec should return EFSCORRUPTED for obvious bnobt corruption
Return -EFSCORRUPTED when the bnobt/cntbt return obviously corrupt
values, rather than letting them bounce around in the internal code.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-06-04 14:45:30 -07:00
Darrick J. Wong 9f3a080ef1 xfs: hoist xfs_scrub_agfl_walk to libxfs as xfs_agfl_walk
This function is basically a generic AGFL block iterator, so promote it
to libxfs ahead of online repair wanting to use it.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-05-15 18:12:50 -07:00
Brian Foster fcb762f5de xfs: add bmapi nodiscard flag
Freed extents are unconditionally discarded when online discard is
enabled. Define XFS_BMAPI_NODISCARD to allow callers to bypass
discards when unnecessary. For example, this will be useful for
eofblocks trimming.

This patch does not change behavior.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-10 08:56:46 -07:00
Brian Foster f8f2835a9c xfs: defer agfl block frees when dfops is available
The AGFL fixup code executes before every block allocation/free and
rectifies the AGFL based on the current, dynamic allocation
requirements of the fs. The AGFL must hold a minimum number of
blocks to satisfy a worst case split of the free space btrees caused
by the impending allocation operation. The AGFL is also updated to
maintain the implicit requirement for a minimum number of free slots
to satisfy a worst case join of the free space btrees.

Since the AGFL caches individual blocks, AGFL reduction typically
involves multiple, single block frees. We've had reports of
transaction overrun problems during certain workloads that boil down
to AGFL reduction freeing multiple blocks and consuming more space
in the log than was reserved for the transaction.

Since the objective of freeing AGFL blocks is to ensure free AGFL
free slots are available for the upcoming allocation, one way to
address this problem is to release surplus blocks from the AGFL
immediately but defer the free of those blocks (similar to how
file-mapped blocks are unmapped from the file in one transaction and
freed via a deferred operation) until the transaction is rolled.
This turns AGFL reduction into an operation with predictable log
reservation consumption.

Add the capability to defer AGFL block frees when a deferred ops
list is available to the AGFL fixup code. Add a dfops pointer to the
transaction to carry dfops through various contexts to the allocator
context. Deferring AGFL frees is  conditional behavior based on
whether the transaction pointer is populated. The long term
objective is to reuse the transaction pointer to clean up all
unrelated callchains that pass dfops on the stack along with a
transaction and in doing so, consistently defer AGFL blocks from the
allocator.

A bit of customization is required to handle deferred completion
processing because AGFL blocks are accounted against a per-ag
reservation pool and AGFL blocks are not inserted into the extent
busy list when freed (they are inserted when used and released back
to the AGFL). Reuse the majority of the existing deferred extent
free infrastructure and customize it appropriately to handle AGFL
blocks.

Note that this patch only adds infrastructure. It does not change
behavior because no callers have been updated to pass ->t_agfl_dfops
into the allocation code.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-09 10:04:02 -07:00
Brian Foster 4223f659dd xfs: create agfl block free helper function
Refactor the AGFL block free code into a new helper such that it can
be invoked from deferred context. No functional changes.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-09 10:04:01 -07:00
Eric Sandeen a1f69417c6 xfs: non-scrub - remove unused function parameters
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-04-09 10:23:42 -07:00
Brian Foster a27ba2607e xfs: detect agfl count corruption and reset agfl
The struct xfs_agfl v5 header was originally introduced with
unexpected padding that caused the AGFL to operate with one less
slot than intended. The header has since been packed, but the fix
left an incompatibility for users who upgrade from an old kernel
with the unpacked header to a newer kernel with the packed header
while the AGFL happens to wrap around the end. The newer kernel
recognizes one extra slot at the physical end of the AGFL that the
previous kernel did not. The new kernel will eventually attempt to
allocate a block from that slot, which contains invalid data, and
cause a crash.

This condition can be detected by comparing the active range of the
AGFL to the count. While this detects a padding mismatch, it can
also trigger false positives for unrelated flcount corruption. Since
we cannot distinguish a size mismatch due to padding from unrelated
corruption, we can't trust the AGFL enough to simply repopulate the
empty slot.

Instead, avoid unnecessarily complex detection logic and and use a
solution that can handle any form of flcount corruption that slips
through read verifiers: distrust the entire AGFL and reset it to an
empty state. Any valid blocks within the AGFL are intentionally
leaked. This requires xfs_repair to rectify (which was already
necessary based on the state the AGFL was found in). The reset
mitigates the side effect of the padding mismatch problem from a
filesystem crash to a free space accounting inconsistency. The
generic approach also means that this patch can be safely backported
to kernels with or without a packed struct xfs_agfl.

Check the AGF for an invalid freelist count on initial read from
disk. If detected, set a flag on the xfs_perag to indicate that a
reset is required before the AGFL can be used. In the first
transaction that attempts to use a flagged AGFL, reset it to empty,
warn the user about the inconsistency and allow the freelist fixup
code to repopulate the AGFL with new blocks. The xfs_perag flag is
cleared to eliminate the need for repeated checks on each block
allocation operation.

This allows kernels that include the packing fix commit 96f859d52b
("libxfs: pack the agfl header structure so XFS_AGFL_SIZE is correct")
to handle older unpacked AGFL formats without a filesystem crash.

Suggested-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by Dave Chiluk <chiluk+linuxxfs@indeed.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-23 18:05:06 -07:00
Brian Foster 0ab32086d0 xfs: account only rmapbt-used blocks against rmapbt perag res
The rmapbt perag metadata reservation reserves blocks for the
reverse mapping btree (rmapbt). Since the rmapbt uses blocks from
the agfl and perag accounting is updated as blocks are allocated
from the allocation btrees, the reservation actually accounts blocks
as they are allocated to (or freed from) the agfl rather than the
rmapbt itself.

While this works for blocks that are eventually used for the rmapbt,
not all agfl blocks are destined for the rmapbt. Blocks that are
allocated to the agfl (and thus "reserved" for the rmapbt) but then
used by another structure leads to a growing inconsistency over time
between the runtime tracking of rmapbt usage vs. actual rmapbt
usage. Since the runtime tracking thinks all agfl blocks are rmapbt
blocks, it essentially believes that less future reservation is
required to satisfy the rmapbt than what is actually necessary.

The inconsistency is rectified across mount cycles because the perag
reservation is initialized based on the actual rmapbt usage at mount
time. The problem, however, is that the excessive drain of the
reservation at runtime opens a window to allocate blocks for other
purposes that might be required for the rmapbt on a subsequent
mount. This problem can be demonstrated by a simple test that runs
an allocation workload to consume agfl blocks over time and then
observe the difference in the agfl reservation requirement across an
unmount/mount cycle:

  mount ...: xfs_ag_resv_init: ... resv 3193 ask 3194 len 3194
  ...
  ...      : xfs_ag_resv_alloc_extent: ... resv 2957 ask 3194 len 1
  umount...: xfs_ag_resv_free: ... resv 2956 ask 3194 len 0
  mount ...: xfs_ag_resv_init: ... resv 3052 ask 3194 len 3194

As the above tracepoints show, the reservation requirement reduces
from 3194 blocks to 2956 blocks as the workload runs.  Without any
other changes in the filesystem, the same reservation requirement
jumps from 2956 to 3052 blocks over a umount/mount cycle.

To address this divergence, update the RMAPBT reservation to account
blocks used for the rmapbt only rather than all blocks filled into
the agfl. This patch makes several high-level changes toward that
end:

1.) Reintroduce an AGFL reservation type to serve as an accounting
    no-op for blocks allocated to (or freed from) the AGFL.
2.) Invoke RMAPBT usage accounting from the actual rmapbt block
    allocation path rather than the AGFL allocation path.

The first change is required because agfl blocks are considered free
blocks throughout their lifetime. The perag reservation subsystem is
invoked unconditionally by the allocation subsystem, so we need a
way to tell the perag subsystem (via the allocation subsystem) to
not make any accounting changes for blocks filled into the AGFL.

The second change causes the in-core RMAPBT reservation usage
accounting to remain consistent with the on-disk state at all times
and eliminates the risk of leaving the rmapbt reservation
underfilled.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-11 20:27:57 -07:00
Brian Foster 2159286335 xfs: rename agfl perag res type to rmapbt
The AGFL perag reservation type accounts all allocations that feed
into (or are released from) the allocation group free list (agfl).
The purpose of the reservation is to support worst case conditions
for the reverse mapping btree (rmapbt). As such, the agfl
reservation usage accounting only considers rmapbt usage when the
in-core counters are initialized at mount time.

This implementation inconsistency leads to divergence of the in-core
and on-disk usage accounting over time. In preparation to resolve
this inconsistency and adjust the AGFL reservation into an rmapbt
specific reservation, rename the AGFL reservation type and
associated accounting fields to something more rmapbt-specific. Also
fix up a couple tracepoints that incorrectly use the AGFL
reservation type to pass the agfl state of the associated extent
where the raw reservation type is expected.

Note that this patch does not change perag reservation behavior.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-11 20:27:57 -07:00
Dave Chinner a78ee256c3 xfs: convert XFS_AGFL_SIZE to a helper function
The AGFL size calculation is about to get more complex, so lets turn
the macro into a function first and remove the macro.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
[darrick: forward port to newer kernel, simplify the helper]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-03-11 20:27:56 -07:00
Carlos Maiolino fb1755a645 Split buffer's b_fspriv field
By splitting the b_fspriv field into two different fields (b_log_item
and b_li_list). It's possible to get rid of an old ABI workaround, by
using the new b_log_item field to store xfs_buf_log_item separated from
the log items attached to the buffer, which will be linked in the new
b_li_list field.

This way, there is no more need to reorder the log items list to place
the buf_log_item at the beginning of the list, simplifying a bit the
logic to handle buffer IO.

This also opens the possibility to change buffer's log items list into a
proper list_head.

b_log_item field is still defined as a void *, because it is still used
by the log buffers to store xlog_in_core structures, and there is no
need to add an extra field on xfs_buf just for xlog_in_core.

Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
[darrick: minor style changes]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-01-29 07:27:22 -08:00
Darrick J. Wong ce1d802e6a xfs: add scrub cross-referencing helpers for the free space btrees
Add a couple of functions to the free space btrees that will be used
to cross-reference metadata against the bnobt/cntbt, and a generic
btree function that provides the real implementation.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-01-17 21:00:44 -08:00
Darrick J. Wong b55725974c xfs: create a new buf_ops pointer to verify structure metadata
Expose all metadata structure buffer verifier functions via buf_ops.
These will be used by the online scrub mechanism to look for problems
with buffers that are already sitting around in memory.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-01-08 10:54:47 -08:00
Darrick J. Wong bc1a09b8e3 xfs: refactor verifier callers to print address of failing check
Refactor the callers of verifiers to print the instruction address of a
failing check.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-01-08 10:54:46 -08:00
Darrick J. Wong a6a781a58b xfs: have buffer verifier functions report failing address
Modify each function that checks the contents of a metadata buffer to
return the instruction address of the failing test so that we can report
more precise failure errors to the log.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-01-08 10:54:46 -08:00
Darrick J. Wong 31ca03c92c xfs: refactor xfs_verifier_error and xfs_buf_ioerror
Since all verification errors also mark the buffer as having an error,
we can combine these two calls.  Later we'll add a xfs_failaddr_t
parameter to promote the idea of reporting corruption errors and the
address of the failing check to enable better debugging reports.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-01-08 10:54:45 -08:00
Darrick J. Wong 33df3a9cf9 xfs: always honor OWN_UNKNOWN rmap removal requests
Calling xfs_rmap_free with an unknown owner is supposed to remove any
rmaps covering that range regardless of owner.  This is used by the EFI
recovery code to say "we're freeing this, it mustn't be owned by
anything anymore", but for whatever reason xfs_free_ag_extent filters
them out.

Therefore, remove the filter and make xfs_rmap_unmap actually treat it
as a wildcard owner -- free anything that's already there, and if
there's no owner at all then that's fine too.

There are two existing callers of bmap_add_free that take care the rmap
deferred ops themselves and use OWN_UNKNOWN to skip the EFI-based rmap
cleanup; convert these to use OWN_NULL (via helpers), and now we really
require that an RUI (if any) gets added to the defer ops before any EFI.

Lastly, now that xfs_free_extent filters out OWN_NULL rmap free requests,
growfs will have to consult directly with the rmap to ensure that there
aren't any rmaps in the grown region.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2017-12-21 08:48:38 -08:00
Darrick J. Wong e9e899a2a8 xfs: move error injection tags into their own file
Move the error injection tag names into a libxfs header so that we can
share it between kernel and userspace.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2017-11-01 15:03:16 -07:00
Darrick J. Wong 21ec54168b xfs: create block pointer check functions
Create some helper functions to check that a block pointer points
within the filesystem (or AG) and doesn't point at static metadata.
We will use this for scrub.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2017-10-26 15:38:23 -07:00
Eric Sandeen 93e8befc17 xfs: handle error if xfs_btree_get_bufs fails
Jason reported that a corrupted filesystem failed to replay
the log with a metadata block out of bounds warning:

XFS (dm-2): _xfs_buf_find: Block out of range: block 0x80270fff8, EOFS 0x9c40000

_xfs_buf_find() and xfs_btree_get_bufs() return NULL if
that happens, and then when xfs_alloc_fix_freelist() calls
xfs_trans_binval() on that NULL bp, we oops with:

BUG: unable to handle kernel NULL pointer dereference at 00000000000000f8

We don't handle _xfs_buf_find errors very well, every
caller higher up the stack gets to guess at why it failed.
But we should at least handle it somehow, so return
EFSCORRUPTED here.

Reported-by: Jason L Tibbitts III <tibbs@math.uh.edu>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-10-11 10:21:07 -07:00
Darrick J. Wong 9e24cfd044 xfs: remove unneeded parameter from XFS_TEST_ERROR
Since we moved the injected error frequency controls to the mountpoint,
we can get rid of the last argument to XFS_TEST_ERROR.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2017-06-27 18:23:20 -07:00
Darrick J. Wong 2678809799 xfs: export various function for the online scrubber
Export various internal functions so that the online scrubber can use
them to check the state of metadata.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2017-06-19 14:11:34 -07:00
Darrick J. Wong e9a2599a24 xfs: create a function to query all records in a btree
Create a helper function that will query all records in a btree.
This will be used by the online repair functions to examine every
record in a btree to rebuild a second btree.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2017-04-03 15:18:17 -07:00
Darrick J. Wong 2d520bfaa2 xfs: provide a query_range function for freespace btrees
Implement a query_range function for the bnobt and cntbt.  This will
be used for getfsmap fallback if there is no rmapbt and by the online
scrub and repair code.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2017-04-03 15:18:17 -07:00
Christoph Hellwig 8d242e932f xfs: remove XFS_ALLOCTYPE_ANY_AG and XFS_ALLOCTYPE_START_AG
XFS_ALLOCTYPE_ANY_AG  was only used for the RT allocator and is unused
now, and XFS_ALLOCTYPE_START_AG has been unused for a while.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-02-17 20:32:10 -08:00
Christoph Hellwig ebf5587261 xfs: improve handling of busy extents in the low-level allocator
Currently we force the log and simply try again if we hit a busy extent,
but especially with online discard enabled it might take a while after
the log force for the busy extents to disappear, and we might have
already completed our second pass.

So instead we add a new waitqueue and a generation counter to the pag
structure so that we can do wakeups once we've removed busy extents,
and we replace the single retry with an unconditional one - after
all we hold the AGF buffer lock, so no other allocations or frees
can be racing with us in this AG.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-02-09 10:50:25 -08:00
Christoph Hellwig 12ef830198 xfs: don't rely on ->total in xfs_alloc_space_available
->total is a bit of an odd parameter passed down to the low-level
allocator all the way from the high-level callers.  It's supposed to
contain the maximum number of blocks to be allocated for the whole
transaction [1].

But in xfs_iomap_write_allocate we only convert existing delayed
allocations and thus only have a minimal block reservation for the
current transaction, so xfs_alloc_space_available can't use it for
the allocation decisions.  Use the maximum of args->total and the
calculated block requirement to make a decision.  We probably should
get rid of args->total eventually and instead apply ->minleft more
broadly, but that will require some extensive changes all over.

[1] which creates lots of confusion as most callers don't decrement it
once doing a first allocation.  But that's for a separate series.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-01-09 13:45:01 -08:00
Christoph Hellwig 54fee133ad xfs: adjust allocation length in xfs_alloc_space_available
We must decide in xfs_alloc_fix_freelist if we can perform an
allocation from a given AG is possible or not based on the available
space, and should not fail the allocation past that point on a
healthy file system.

But currently we have two additional places that second-guess
xfs_alloc_fix_freelist: xfs_alloc_ag_vextent tries to adjust the
maxlen parameter to remove the reservation before doing the
allocation (but ignores the various minium freespace requirements),
and xfs_alloc_fix_minleft tries to fix up the allocated length
after we've found an extent, but ignores the reservations and also
doesn't take the AGFL into account (and thus fails allocations
for not matching minlen in some cases).

Remove all these later fixups and just correct the maxlen argument
inside xfs_alloc_fix_freelist once we have the AGF buffer locked.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-01-09 13:37:44 -08:00
Christoph Hellwig 255c516278 xfs: fix bogus minleft manipulations
We can't just set minleft to 0 when we're low on space - that's exactly
what we need minleft for: to protect space in the AG for btree block
allocations when we are low on free space.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-01-09 13:36:36 -08:00
Christoph Hellwig 5149fd327f xfs: bump up reserved blocks in xfs_alloc_set_aside
Setting aside 4 blocks globally for bmbt splits isn't all that useful,
as different threads can allocate space in parallel.  Bump it to 4
blocks per AG to allow each thread that is currently doing an
allocation to dip into it separately.  Without that we may no have
enough reserved blocks if there are enough parallel transactions
in an almost out space file system that all run into bmap btree
splits.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-01-09 13:35:00 -08:00
Darrick J. Wong d2a047f31e xfs: forbid AG btrees with level == 0
There is no such thing as a zero-level AG btree since even a single-node
zero-records btree has one level.  Btree cursor constructors read
cur_nlevels straight from disk and then access things like
cur_bufs[cur_nlevels - 1] which is /really/ bad if cur_nlevels is zero!
Therefore, strengthen the verifiers to prevent this possibility.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-12-05 12:32:50 +11:00
Darrick J. Wong d0e853f360 xfs: reserve AG space for the refcount btree root
Reduce the max AG usable space size so that we always have space for
the refcount btree root.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-03 09:11:24 -07:00
Darrick J. Wong bdf28630b7 xfs: add refcount btree operations
Implement the generic btree operations required to manipulate refcount
btree blocks.  The implementation is similar to the bmapbt, though it
will only allocate and free blocks from the AG.

Since the refcount root and level fields are separate from the
existing roots and levels array, they need a separate logging flag.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
[hch: fix logging of AGF refcount btree fields]
Signed-off-by: Christoph Hellwig <hch@lst.de>
2016-10-03 09:11:19 -07:00
Darrick J. Wong af30dfa144 xfs: refcount btree add more reserved blocks
Since XFS reserves a small amount of space in each AG as the minimum
free space needed for an operation, save some more space in case we
touch the refcount btree.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-03 09:11:17 -07:00
Darrick J. Wong 46eeb521b9 xfs: introduce refcount btree definitions
Add new per-AG refcount btree definitions to the per-AG structures.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-03 09:11:16 -07:00
Dave Chinner 155cd433b5 Merge branch 'xfs-4.9-log-recovery-fixes' into for-next 2016-10-03 09:56:28 +11:00
Dave Chinner 292378edcb xfs: remote attribute blocks aren't really userdata
When adding a new remote attribute, we write the attribute to the
new extent before the allocation transaction is committed. This
means we cannot reuse busy extents as that violates crash
consistency semantics. Hence we currently treat remote attribute
extent allocation like userdata because it has the same overwrite
ordering constraints as userdata.

Unfortunately, this also allows the allocator to incorrectly apply
extent size hints to the remote attribute extent allocation. This
results in interesting failures, such as transaction block
reservation overruns and in-memory inode attribute fork corruption.

To fix this, we need to separate the busy extent reuse configuration
from the userdata configuration. This changes the definition of
XFS_BMAPI_METADATA slightly - it now means that allocation is
metadata and reuse of busy extents is acceptible due to the metadata
ordering semantics of the journal. If this flag is not set, it
means the allocation is that has unordered data writeback, and hence
busy extent reuse is not allowed. It no longer implies the
allocation is for user data, just that the data write will not be
strictly ordered. This matches the semantics for both user data
and remote attribute block allocation.

As such, This patch changes the "userdata" field to a "datatype"
field, and adds a "no busy reuse" flag to the field.
When we detect an unordered data extent allocation, we immediately set
the no reuse flag. We then set the "user data" flags based on the
inode fork we are allocating the extent to. Hence we only set
userdata flags on data fork allocations now and consider attribute
fork remote extents to be an unordered metadata extent.

The result is that remote attribute extents now have the expected
allocation semantics, and the data fork allocation behaviour is
completely unchanged.

It should be noted that there may be other ways to fix this (e.g.
use ordered metadata buffers for the remote attribute extent data
write) but they are more invasive and difficult to validate both
from a design and implementation POV. Hence this patch takes the
simple, obvious route to fixing the problem...

Reported-and-tested-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-09-26 08:21:28 +10:00
Darrick J. Wong 3fd129b63f xfs: set up per-AG free space reservations
One unfortunate quirk of the reference count and reverse mapping
btrees -- they can expand in size when blocks are written to *other*
allocation groups if, say, one large extent becomes a lot of tiny
extents.  Since we don't want to start throwing errors in the middle
of CoWing, we need to reserve some blocks to handle future expansion.
The transaction block reservation counters aren't sufficient here
because we have to have a reserve of blocks in every AG, not just
somewhere in the filesystem.

Therefore, create two per-AG block reservation pools.  One feeds the
AGFL so that rmapbt expansion always succeeds, and the other feeds all
other metadata so that refcountbt expansion never fails.

Use the count of how many reserved blocks we need to have on hand to
create a virtual reservation in the AG.  Through selective clamping of
the maximum length of allocation requests and of the length of the
longest free extent, we can make it look like there's less free space
in the AG unless the reservation owner is asking for blocks.

In other words, play some accounting tricks in-core to make sure that
we always have blocks available.  On the plus side, there's nothing to
clean up if we crash, which is contrast to the strategy that the rough
draft used (actually removing extents from the freespace btrees).

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-09-19 10:30:52 +10:00
Darrick J. Wong da1f039d69 xfs: don't log the entire end of the AGF
When we're logging the last non-spare field in the AGF, we don't
need to log the spare fields, so plumb in a new AGF logging flag
to help us avoid that.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-26 15:59:31 +10:00
Darrick J. Wong a03f1a6633 xfs: remove OWN_AG rmap when allocating a block from the AGFL
When we're really tight on space, xfs_alloc_ag_vextent_small() can
allocate a block from the AGFL and give it to the caller.  Since the
caller is never the AGFL-fixing method, we must remove the OWN_AG
reverse mapping because it will clash with whatever rmap the caller
wants to set up.  This bug was discovered by running generic/299
repeatedly.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-17 11:12:57 +10:00
Darrick J. Wong f32866fdc9 xfs: store rmapbt block count in the AGF
Track the number of blocks used for the rmapbt in the AGF.  When we
get to the AG reservation code we need this counter to quickly
make our reservation during mount.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-17 08:31:49 +10:00
Darrick J. Wong 3481b68285 xfs: move (and rename) the deferred bmap-free tracepoints
Rename the deferred bmap-free to extent_free and make them only
trigger when we're really running deferred ops.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-03 12:31:07 +10:00
Darrick J. Wong 04f130605f xfs: don't update rmapbt when fixing agfl
Allow a caller of xfs_alloc_fix_freelist to disable rmapbt updates
when fixing the AG freelist.  xfs_repair needs this during phase 5
to be able to adjust the freelist while it's reconstructing the rmap
btree; the missing entries will be added back at the very end of
phase 5 once the AGFL contents settle down.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-03 12:19:53 +10:00
Darrick J. Wong 525488520a xfs: rmap btree requires more reserved free space
Originally-From: Dave Chinner <dchinner@redhat.com>

The rmap btree is allocated from the AGFL, which means we have to
ensure ENOSPC is reported to userspace before we run out of free
space in each AG. The last allocation in an AG can cause a full
height rmap btree split, and that means we have to reserve at least
this many blocks *in each AG* to be placed on the AGFL at ENOSPC.
Update the various space calculation functions to handle this.

Also, because the macros are now executing conditional code and are
called quite frequently, convert them to functions that initialise
variables in the struct xfs_mount, use the new variables everywhere
and document the calculations better.

[darrick.wong@oracle.com: don't reserve blocks if !rmap]
[dchinner@redhat.com: update m_ag_max_usable after growfs]

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-03 11:38:24 +10:00
Darrick J. Wong 673930c34a xfs: introduce rmap extent operation stubs
Originally-From: Dave Chinner <dchinner@redhat.com>

Add the stubs into the extent allocation and freeing paths that the
rmap btree implementation will hook into. While doing this, add the
trace points that will be used to track rmap btree extent
manipulations.

[darrick.wong@oracle.com: Extend the stubs to take full owner info.]

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-03 11:33:43 +10:00
Darrick J. Wong 340785cca1 xfs: add owner field to extent allocation and freeing
For the rmap btree to work, we have to feed the extent owner
information to the the allocation and freeing functions. This
information is what will end up in the rmap btree that tracks
allocated extents. While we technically don't need the owner
information when freeing extents, passing it allows us to validate
that the extent we are removing from the rmap btree actually
belonged to the owner we expected it to belong to.

We also define a special set of owner values for internal metadata
that would otherwise have no owner. This allows us to tell the
difference between metadata owned by different per-ag btrees, as
well as static fs metadata (e.g. AG headers) and internal journal
blocks.

There are also a couple of special cases we need to take care of -
during EFI recovery, we don't actually know who the original owner
was, so we need to pass a wildcard to indicate that we aren't
checking the owner for validity. We also need special handling in
growfs, as we "free" the space in the last AG when extending it, but
because it's new space it has no actual owner...

While touching the xfs_bmap_add_free() function, re-order the
parameters to put the struct xfs_mount first.

Extend the owner field to include both the owner type and some sort
of index within the owner.  The index field will be used to support
reverse mappings when reflink is enabled.

When we're freeing extents from an EFI, we don't have the owner
information available (rmap updates have their own redo items).
xfs_free_extent therefore doesn't need to do an rmap update. Make
sure that the log replay code signals this correctly.

This is based upon a patch originally from Dave Chinner. It has been
extended to add more owner information with the intent of helping
recovery operations when things go wrong (e.g. offset of user data
block in a file).

[dchinner: de-shout the xfs_rmap_*_owner helpers]
[darrick: minor style fixes suggested by Christoph Hellwig]

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-03 11:33:42 +10:00
Darrick J. Wong 8018026ef2 xfs: rmap btree add more reserved blocks
Originally-From: Dave Chinner <dchinner@redhat.com>

XFS reserves a small amount of space in each AG for the minimum
number of free blocks needed for operation. Adding the rmap btree
increases the number of reserved blocks, but it also increases the
complexity of the calculation as the free inode btree is optional
(like the rmbt).

Rather than calculate the prealloc blocks every time we need to
check it, add a function to calculate it at mount time and store it
in the struct xfs_mount, and convert the XFS_PREALLOC_BLOCKS macro
just to use the xfs-mount variable directly.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-03 11:31:47 +10:00