linux-stable/fs/xfs
Darrick J. Wong 27dada070d xfs: change the order in which child and parent defer ops are finished
The defer ops code has been finishing items in the wrong order -- if a
top level defer op creates items A and B, and finishing item A creates
more defer ops A1 and A2, we'll put the new items on the end of the
chain and process them in the order A B A1 A2.  This is kind of weird,
since it's convenient for programmers to be able to think of A and B as
an ordered sequence where all the sub-tasks for A must finish before we
move on to B, e.g. A A1 A2 D.

Right now, our log intent items are not so complex that this matters,
but this will become important for the atomic extent swapping patchset.
In order to maintain correct reference counting of extents, we have to
unmap and remap extents in that order, and we want to complete that work
before moving on to the next range that the user wants to swap.  This
patch fixes defer ops to satsify that requirement.

The primary symptom of the incorrect order was noticed in an early
performance analysis of the atomic extent swap code.  An astonishingly
large number of deferred work items accumulated when userspace requested
an atomic update of two very fragmented files.  The cause of this was
traced to the same ordering bug in the inner loop of
xfs_defer_finish_noroll.

If the ->finish_item method of a deferred operation queues new deferred
operations, those new deferred ops are appended to the tail of the
pending work list.  To illustrate, say that a caller creates a
transaction t0 with four deferred operations D0-D3.  The first thing
defer ops does is roll the transaction to t1, leaving us with:

t1: D0(t0), D1(t0), D2(t0), D3(t0)

Let's say that finishing each of D0-D3 will create two new deferred ops.
After finish D0 and roll, we'll have the following chain:

t2: D1(t0), D2(t0), D3(t0), d4(t1), d5(t1)

d4 and d5 were logged to t1.  Notice that while we're about to start
work on D1, we haven't actually completed all the work implied by D0
being finished.  So far we've been careful (or lucky) to structure the
dfops callers such that D1 doesn't depend on d4 or d5 being finished,
but this is a potential logic bomb.

There's a second problem lurking.  Let's see what happens as we finish
D1-D3:

t3: D2(t0), D3(t0), d4(t1), d5(t1), d6(t2), d7(t2)
t4: D3(t0), d4(t1), d5(t1), d6(t2), d7(t2), d8(t3), d9(t3)
t5: d4(t1), d5(t1), d6(t2), d7(t2), d8(t3), d9(t3), d10(t4), d11(t4)

Let's say that d4-d11 are simple work items that don't queue any other
operations, which means that we can complete each d4 and roll to t6:

t6: d5(t1), d6(t2), d7(t2), d8(t3), d9(t3), d10(t4), d11(t4)
t7: d6(t2), d7(t2), d8(t3), d9(t3), d10(t4), d11(t4)
...
t11: d10(t4), d11(t4)
t12: d11(t4)
<done>

When we try to roll to transaction #12, we're holding defer op d11,
which we logged way back in t4.  This means that the tail of the log is
pinned at t4.  If the log is very small or there are a lot of other
threads updating metadata, this means that we might have wrapped the log
and cannot get roll to t11 because there isn't enough space left before
we'd run into t4.

Let's shift back to the original failure.  I mentioned before that I
discovered this flaw while developing the atomic file update code.  In
that scenario, we have a defer op (D0) that finds a range of file blocks
to remap, creates a handful of new defer ops to do that, and then asks
to be continued with however much work remains.

So, D0 is the original swapext deferred op.  The first thing defer ops
does is rolls to t1:

t1: D0(t0)

We try to finish D0, logging d1 and d2 in the process, but can't get all
the work done.  We log a done item and a new intent item for the work
that D0 still has to do, and roll to t2:

t2: D0'(t1), d1(t1), d2(t1)

We roll and try to finish D0', but still can't get all the work done, so
we log a done item and a new intent item for it, requeue D0 a second
time, and roll to t3:

t3: D0''(t2), d1(t1), d2(t1), d3(t2), d4(t2)

If it takes 48 more rolls to complete D0, then we'll finally dispense
with D0 in t50:

t50: D<fifty primes>(t49), d1(t1), ..., d102(t50)

We then try to roll again to get a chain like this:

t51: d1(t1), d2(t1), ..., d101(t50), d102(t50)
...
t152: d102(t50)
<done>

Notice that in rolling to transaction #51, we're holding on to a log
intent item for d1 that was logged in transaction #1.  This means that
the tail of the log is pinned at t1.  If the log is very small or there
are a lot of other threads updating metadata, this means that we might
have wrapped the log and cannot roll to t51 because there isn't enough
space left before we'd run into t1.  This is of course problem #2 again.

But notice the third problem with this scenario: we have 102 defer ops
tied to this transaction!  Each of these items are backed by pinned
kernel memory, which means that we risk OOM if the chains get too long.

Yikes.  Problem #1 is a subtle logic bomb that could hit someone in the
future; problem #2 applies (rarely) to the current upstream, and problem
#3 applies to work under development.

This is not how incremental deferred operations were supposed to work.
The dfops design of logging in the same transaction an intent-done item
and a new intent item for the work remaining was to make it so that we
only have to juggle enough deferred work items to finish that one small
piece of work.  Deferred log item recovery will find that first
unfinished work item and restart it, no matter how many other intent
items might follow it in the log.  Therefore, it's ok to put the new
intents at the start of the dfops chain.

For the first example, the chains look like this:

t2: d4(t1), d5(t1), D1(t0), D2(t0), D3(t0)
t3: d5(t1), D1(t0), D2(t0), D3(t0)
...
t9: d9(t7), D3(t0)
t10: D3(t0)
t11: d10(t10), d11(t10)
t12: d11(t10)

For the second example, the chains look like this:

t1: D0(t0)
t2: d1(t1), d2(t1), D0'(t1)
t3: d2(t1), D0'(t1)
t4: D0'(t1)
t5: d1(t4), d2(t4), D0''(t4)
...
t148: D0<50 primes>(t147)
t149: d101(t148), d102(t148)
t150: d102(t148)
<done>

This actually sucks more for pinning the log tail (we try to roll to t10
while holding an intent item that was logged in t1) but we've solved
problem #1.  We've also reduced the maximum chain length from:

    sum(all the new items) + nr_original_items

to:

    max(new items that each original item creates) + nr_original_items

This solves problem #3 by sharply reducing the number of defer ops that
can be attached to a transaction at any given time.  The change makes
the problem of log tail pinning worse, but is improvement we need to
solve problem #2.  Actually solving #2, however, is left to the next
patch.

Note that a subsequent analysis of some hard-to-trigger reflink and COW
livelocks on extremely fragmented filesystems (or systems running a lot
of IO threads) showed the same symptoms -- uncomfortably large numbers
of incore deferred work items and occasional stalls in the transaction
grant code while waiting for log reservations.  I think this patch and
the next one will also solve these problems.

As originally written, the code used list_splice_tail_init instead of
list_splice_init, so change that, and leave a short comment explaining
our actions.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2020-10-07 08:40:28 -07:00
..
libxfs xfs: change the order in which child and parent defer ops are finished 2020-10-07 08:40:28 -07:00
scrub xfs: check dabtree node hash values when loading child blocks 2020-09-23 08:58:51 -07:00
Kconfig xfs: deprecate the V4 format 2020-09-15 20:52:43 -07:00
kmem.c xfs: remove kmem_realloc() 2020-09-06 18:05:51 -07:00
kmem.h xfs: Remove kmem_zalloc_large() 2020-09-15 20:52:41 -07:00
Makefile xfs: refactor log recovery item sorting into a generic dispatch structure 2020-05-08 08:49:58 -07:00
mrlock.h
xfs.h
xfs_acl.c xfs: Remove kmem_zalloc_large() 2020-09-15 20:52:41 -07:00
xfs_acl.h xfs: improve xfs_forget_acl 2020-03-02 20:55:55 -08:00
xfs_aops.c New code for 5.8: 2020-06-02 19:21:40 -07:00
xfs_aops.h
xfs_attr_inactive.c xfs: cleanup xfs_idestroy_fork 2020-05-19 09:40:59 -07:00
xfs_attr_list.c xfs: Convert xfs_attr_sf macros to inline functions 2020-09-15 20:52:42 -07:00
xfs_bio_io.c
xfs_bmap_item.c xfs: fix an incore inode UAF in xfs_bui_recover 2020-10-07 08:40:28 -07:00
xfs_bmap_item.h xfs: refactor intent item RECOVERED flag into the log item 2020-05-08 08:50:01 -07:00
xfs_bmap_util.c xfs: ensure that fpunch, fcollapse, and finsert operations are aligned to rt extent size 2020-09-15 20:52:42 -07:00
xfs_bmap_util.h
xfs_buf.c xfs: reuse _xfs_buf_read for re-reading the superblock 2020-09-15 20:52:39 -07:00
xfs_buf.h xfs: reuse _xfs_buf_read for re-reading the superblock 2020-09-15 20:52:39 -07:00
xfs_buf_item.c xfs: remove xlog_recover_iodone 2020-09-15 20:52:39 -07:00
xfs_buf_item.h xfs: move the buffer retry logic to xfs_buf.c 2020-09-15 20:52:38 -07:00
xfs_buf_item_recover.c xfs: fix finobt btree block recovery ordering 2020-09-30 07:28:52 -07:00
xfs_dir2_readdir.c xfs: move the fork format fields into struct xfs_ifork 2020-05-19 09:40:58 -07:00
xfs_discard.c xfs: remove XFS_BUF_TO_AGF 2020-03-11 09:11:39 -07:00
xfs_discard.h
xfs_dquot.c xfs: fix some comments 2020-09-25 11:34:07 -07:00
xfs_dquot.h xfs: refactor default quota grace period setting code 2020-09-15 20:52:40 -07:00
xfs_dquot_item.c xfs: stop using q_core.d_id in the quota code 2020-07-28 20:24:14 -07:00
xfs_dquot_item.h xfs: factor out quotaoff intent AIL removal and memory free 2020-03-18 08:12:23 -07:00
xfs_dquot_item_recover.c xfs: rename the ondisk dquot d_flags to d_type 2020-07-28 20:24:14 -07:00
xfs_error.c xfs: random buffer write failure errortag 2020-05-07 08:27:48 -07:00
xfs_error.h xfs: xfs_buf_corruption_error should take __this_address 2020-03-12 07:58:12 -07:00
xfs_export.c xfs: delete duplicated words + other fixes 2020-08-05 08:49:58 -07:00
xfs_export.h
xfs_extent_busy.c
xfs_extent_busy.h
xfs_extfree_item.c xfs: fix an incore inode UAF in xfs_bui_recover 2020-10-07 08:40:28 -07:00
xfs_extfree_item.h xfs: refactor intent item RECOVERED flag into the log item 2020-05-08 08:50:01 -07:00
xfs_file.c xfs: force the log after remapping a synchronous-writes file 2020-09-15 20:52:42 -07:00
xfs_filestream.c xfs: drop the obsolete comment on filestream locking 2020-09-25 11:34:08 -07:00
xfs_filestream.h
xfs_fsmap.c xfs: prohibit fs freezing when using empty transactions 2020-03-26 08:19:24 -07:00
xfs_fsmap.h
xfs_fsops.c xfs: remove unused shutdown types 2020-05-07 08:27:48 -07:00
xfs_fsops.h
xfs_globals.c
xfs_health.c
xfs_icache.c xfs: Remove unneeded semicolon 2020-09-15 20:52:42 -07:00
xfs_icache.h xfs: remove SYNC_WAIT and SYNC_TRYLOCK 2020-07-14 08:47:33 -07:00
xfs_icreate_item.c xfs: Remove kmem_zone_zalloc() usage 2020-07-28 20:24:14 -07:00
xfs_icreate_item.h
xfs_inode.c xfs: drop extra transaction roll from inode extent truncate 2020-09-21 09:54:29 -07:00
xfs_inode.h xfs: widen ondisk inode timestamps to deal with y2038+ 2020-09-15 20:52:41 -07:00
xfs_inode_item.c xfs: widen ondisk inode timestamps to deal with y2038+ 2020-09-15 20:52:41 -07:00
xfs_inode_item.h xfs: move the buffer retry logic to xfs_buf.c 2020-09-15 20:52:38 -07:00
xfs_inode_item_recover.c xfs: widen ondisk inode timestamps to deal with y2038+ 2020-09-15 20:52:41 -07:00
xfs_ioctl.c xfs: Remove kmem_zalloc_large() 2020-09-15 20:52:41 -07:00
xfs_ioctl.h xfs: embedded the attrlist cursor into struct xfs_attr_list_context 2020-03-02 20:55:55 -08:00
xfs_ioctl32.c xfs: lift cursor copy in/out into xfs_ioc_attr_list 2020-03-02 20:55:54 -08:00
xfs_ioctl32.h xfs: rename compat_time_t to old_time32_t 2020-01-06 08:57:36 -08:00
xfs_iomap.c xfs: delete duplicated words + other fixes 2020-08-05 08:49:58 -07:00
xfs_iomap.h
xfs_iops.c xfs: directly call xfs_generic_create() for ->create() and ->mkdir() 2020-09-25 11:34:08 -07:00
xfs_iops.h
xfs_itable.c xfs: move the fork format fields into struct xfs_ifork 2020-05-19 09:40:58 -07:00
xfs_itable.h
xfs_iwalk.c xfs: kill the XFS_WANT_CORRUPT_* macros 2019-11-12 17:19:02 -08:00
xfs_iwalk.h
xfs_linux.h xfs: remove the unused SYNCHRONIZE macro 2020-09-25 11:34:07 -07:00
xfs_log.c xfs: clean up calculation of LR header blocks 2020-09-23 09:24:17 -07:00
xfs_log.h xfs: refactor and split xfs_log_done() 2020-03-27 08:32:53 -07:00
xfs_log_cil.c xfs: delete duplicated words + other fixes 2020-08-05 08:49:58 -07:00
xfs_log_priv.h xfs: Modify xlog_ticket_alloc() to use kernel's MM API 2020-07-28 20:24:14 -07:00
xfs_log_recover.c xfs: fix an incore inode UAF in xfs_bui_recover 2020-10-07 08:40:28 -07:00
xfs_message.c xfs: refactor ratelimited buffer error messages into helper 2020-05-07 08:27:46 -07:00
xfs_message.h xfs: refactor ratelimited buffer error messages into helper 2020-05-07 08:27:46 -07:00
xfs_mount.c xfs: remove xfs_getsb 2020-09-15 20:52:39 -07:00
xfs_mount.h xfs: remove xfs_getsb 2020-09-15 20:52:39 -07:00
xfs_mru_cache.c
xfs_mru_cache.h
xfs_ondisk.h xfs: Remove typedef xfs_attr_shortform_t 2020-09-15 20:52:42 -07:00
xfs_pnfs.c xfs: define printk_once variants for xfs messages 2020-05-04 09:03:15 -07:00
xfs_pnfs.h
xfs_pwork.c block: remove the bd_queue field from struct block_device 2020-07-01 08:08:20 -06:00
xfs_pwork.h
xfs_qm.c xfs: remove the unused parameter id from xfs_qm_dqattach_one 2020-09-25 11:34:07 -07:00
xfs_qm.h xfs: refactor quota expiration timer modification 2020-09-15 20:52:40 -07:00
xfs_qm_bhv.c xfs: rename XFS_DQ_{USER,GROUP,PROJ} to XFS_DQTYPE_* 2020-07-28 20:24:14 -07:00
xfs_qm_syscalls.c xfs: refactor default quota grace period setting code 2020-09-15 20:52:40 -07:00
xfs_quota.h xfs: move the buffer retry logic to xfs_buf.c 2020-09-15 20:52:38 -07:00
xfs_quotaops.c xfs: create xfs_dqtype_t to represent quota types 2020-07-28 20:24:14 -07:00
xfs_refcount_item.c xfs: fix an incore inode UAF in xfs_bui_recover 2020-10-07 08:40:28 -07:00
xfs_refcount_item.h xfs: refactor intent item RECOVERED flag into the log item 2020-05-08 08:50:01 -07:00
xfs_reflink.c xfs: delete duplicated words + other fixes 2020-08-05 08:49:58 -07:00
xfs_reflink.h xfs: move helpers that lock and unlock two inodes against userspace IO 2020-07-06 10:46:57 -07:00
xfs_rmap_item.c xfs: fix an incore inode UAF in xfs_bui_recover 2020-10-07 08:40:28 -07:00
xfs_rmap_item.h xfs: refactor intent item RECOVERED flag into the log item 2020-05-08 08:50:01 -07:00
xfs_rtalloc.c xfs: Set xfs_buf's b_ops member when zeroing bitmap/summary files 2020-09-23 08:58:51 -07:00
xfs_rtalloc.h
xfs_stats.c xfs: Use scnprintf() for avoiding potential buffer overflow 2020-03-12 07:58:13 -07:00
xfs_stats.h
xfs_super.c xfs: remove deprecated mount options 2020-09-25 11:34:08 -07:00
xfs_super.h
xfs_symlink.c xfs: move the fork format fields into struct xfs_ifork 2020-05-19 09:40:58 -07:00
xfs_symlink.h
xfs_sysctl.c xfs: remove deprecated sysctl options 2020-09-25 11:34:08 -07:00
xfs_sysctl.h
xfs_sysfs.c
xfs_sysfs.h xfs: Fix UBSAN null-ptr-deref in xfs_sysfs_init 2020-08-07 11:50:17 -07:00
xfs_trace.c xfs: support bulk loading of staged btrees 2020-03-18 08:12:23 -07:00
xfs_trace.h xfs: trace timestamp limits 2020-09-15 20:52:41 -07:00
xfs_trans.c xfs: do the assert for all the log done items in xfs_trans_cancel 2020-09-25 11:34:07 -07:00
xfs_trans.h xfs: proper replay of deferred ops queued during log recovery 2020-10-07 08:40:28 -07:00
xfs_trans_ail.c xfs: delete duplicated words + other fixes 2020-08-05 08:49:58 -07:00
xfs_trans_buf.c xfs: simplify xfs_trans_getsb 2020-09-15 20:52:39 -07:00
xfs_trans_dquot.c xfs: widen ondisk quota expiration timestamps to handle y2038+ 2020-09-15 20:52:41 -07:00
xfs_trans_priv.h xfs: refactor adding recovered intent items to the log 2020-05-08 08:50:00 -07:00
xfs_xattr.c xfs: remove duplicate headers 2020-05-08 08:51:34 -07:00