xfs: document the userspace fsck driver program

Add the sixth chapter of the online fsck design documentation, where
we discuss the details of the data structures and algorithms used by the
driver program xfs_scrub.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
This commit is contained in:
Darrick J. Wong 2023-04-11 18:59:51 -07:00
parent a26aa25247
commit af051dfb81

View file

@ -315,6 +315,9 @@ The seven phases are as follows:
7. Re-check the summary counters and presents the caller with a summary of
space usage and file counts.
This allocation of responsibilities will be :ref:`revisited <scrubcheck>`
later in this document.
Steps for Each Scrub Item
-------------------------
@ -4787,3 +4790,316 @@ The proposed patches are in the
`orphanage adoption
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-orphanage>`_
series.
6. Userspace Algorithms and Data Structures
===========================================
This section discusses the key algorithms and data structures of the userspace
program, ``xfs_scrub``, that provide the ability to drive metadata checks and
repairs in the kernel, verify file data, and look for other potential problems.
.. _scrubcheck:
Checking Metadata
-----------------
Recall the :ref:`phases of fsck work<scrubphases>` outlined earlier.
That structure follows naturally from the data dependencies designed into the
filesystem from its beginnings in 1993.
In XFS, there are several groups of metadata dependencies:
a. Filesystem summary counts depend on consistency within the inode indices,
the allocation group space btrees, and the realtime volume space
information.
b. Quota resource counts depend on consistency within the quota file data
forks, inode indices, inode records, and the forks of every file on the
system.
c. The naming hierarchy depends on consistency within the directory and
extended attribute structures.
This includes file link counts.
d. Directories, extended attributes, and file data depend on consistency within
the file forks that map directory and extended attribute data to physical
storage media.
e. The file forks depends on consistency within inode records and the space
metadata indices of the allocation groups and the realtime volume.
This includes quota and realtime metadata files.
f. Inode records depends on consistency within the inode metadata indices.
g. Realtime space metadata depend on the inode records and data forks of the
realtime metadata inodes.
h. The allocation group metadata indices (free space, inodes, reference count,
and reverse mapping btrees) depend on consistency within the AG headers and
between all the AG metadata btrees.
i. ``xfs_scrub`` depends on the filesystem being mounted and kernel support
for online fsck functionality.
Therefore, a metadata dependency graph is a convenient way to schedule checking
operations in the ``xfs_scrub`` program:
- Phase 1 checks that the provided path maps to an XFS filesystem and detect
the kernel's scrubbing abilities, which validates group (i).
- Phase 2 scrubs groups (g) and (h) in parallel using a threaded workqueue.
- Phase 3 scans inodes in parallel.
For each inode, groups (f), (e), and (d) are checked, in that order.
- Phase 4 repairs everything in groups (i) through (d) so that phases 5 and 6
may run reliably.
- Phase 5 starts by checking groups (b) and (c) in parallel before moving on
to checking names.
- Phase 6 depends on groups (i) through (b) to find file data blocks to verify,
to read them, and to report which blocks of which files are affected.
- Phase 7 checks group (a), having validated everything else.
Notice that the data dependencies between groups are enforced by the structure
of the program flow.
Parallel Inode Scans
--------------------
An XFS filesystem can easily contain hundreds of millions of inodes.
Given that XFS targets installations with large high-performance storage,
it is desirable to scrub inodes in parallel to minimize runtime, particularly
if the program has been invoked manually from a command line.
This requires careful scheduling to keep the threads as evenly loaded as
possible.
Early iterations of the ``xfs_scrub`` inode scanner naïvely created a single
workqueue and scheduled a single workqueue item per AG.
Each workqueue item walked the inode btree (with ``XFS_IOC_INUMBERS``) to find
inode chunks and then called bulkstat (``XFS_IOC_BULKSTAT``) to gather enough
information to construct file handles.
The file handle was then passed to a function to generate scrub items for each
metadata object of each inode.
This simple algorithm leads to thread balancing problems in phase 3 if the
filesystem contains one AG with a few large sparse files and the rest of the
AGs contain many smaller files.
The inode scan dispatch function was not sufficiently granular; it should have
been dispatching at the level of individual inodes, or, to constrain memory
consumption, inode btree records.
Thanks to Dave Chinner, bounded workqueues in userspace enable ``xfs_scrub`` to
avoid this problem with ease by adding a second workqueue.
Just like before, the first workqueue is seeded with one workqueue item per AG,
and it uses INUMBERS to find inode btree chunks.
The second workqueue, however, is configured with an upper bound on the number
of items that can be waiting to be run.
Each inode btree chunk found by the first workqueue's workers are queued to the
second workqueue, and it is this second workqueue that queries BULKSTAT,
creates a file handle, and passes it to a function to generate scrub items for
each metadata object of each inode.
If the second workqueue is too full, the workqueue add function blocks the
first workqueue's workers until the backlog eases.
This doesn't completely solve the balancing problem, but reduces it enough to
move on to more pressing issues.
The proposed patchsets are the scrub
`performance tweaks
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-performance-tweaks>`_
and the
`inode scan rebalance
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-iscan-rebalance>`_
series.
.. _scrubrepair:
Scheduling Repairs
------------------
During phase 2, corruptions and inconsistencies reported in any AGI header or
inode btree are repaired immediately, because phase 3 relies on proper
functioning of the inode indices to find inodes to scan.
Failed repairs are rescheduled to phase 4.
Problems reported in any other space metadata are deferred to phase 4.
Optimization opportunities are always deferred to phase 4, no matter their
origin.
During phase 3, corruptions and inconsistencies reported in any part of a
file's metadata are repaired immediately if all space metadata were validated
during phase 2.
Repairs that fail or cannot be repaired immediately are scheduled for phase 4.
In the original design of ``xfs_scrub``, it was thought that repairs would be
so infrequent that the ``struct xfs_scrub_metadata`` objects used to
communicate with the kernel could also be used as the primary object to
schedule repairs.
With recent increases in the number of optimizations possible for a given
filesystem object, it became much more memory-efficient to track all eligible
repairs for a given filesystem object with a single repair item.
Each repair item represents a single lockable object -- AGs, metadata files,
individual inodes, or a class of summary information.
Phase 4 is responsible for scheduling a lot of repair work in as quick a
manner as is practical.
The :ref:`data dependencies <scrubcheck>` outlined earlier still apply, which
means that ``xfs_scrub`` must try to complete the repair work scheduled by
phase 2 before trying repair work scheduled by phase 3.
The repair process is as follows:
1. Start a round of repair with a workqueue and enough workers to keep the CPUs
as busy as the user desires.
a. For each repair item queued by phase 2,
i. Ask the kernel to repair everything listed in the repair item for a
given filesystem object.
ii. Make a note if the kernel made any progress in reducing the number
of repairs needed for this object.
iii. If the object no longer requires repairs, revalidate all metadata
associated with this object.
If the revalidation succeeds, drop the repair item.
If not, requeue the item for more repairs.
b. If any repairs were made, jump back to 1a to retry all the phase 2 items.
c. For each repair item queued by phase 3,
i. Ask the kernel to repair everything listed in the repair item for a
given filesystem object.
ii. Make a note if the kernel made any progress in reducing the number
of repairs needed for this object.
iii. If the object no longer requires repairs, revalidate all metadata
associated with this object.
If the revalidation succeeds, drop the repair item.
If not, requeue the item for more repairs.
d. If any repairs were made, jump back to 1c to retry all the phase 3 items.
2. If step 1 made any repair progress of any kind, jump back to step 1 to start
another round of repair.
3. If there are items left to repair, run them all serially one more time.
Complain if the repairs were not successful, since this is the last chance
to repair anything.
Corruptions and inconsistencies encountered during phases 5 and 7 are repaired
immediately.
Corrupt file data blocks reported by phase 6 cannot be recovered by the
filesystem.
The proposed patchsets are the
`repair warning improvements
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-better-repair-warnings>`_,
refactoring of the
`repair data dependency
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-repair-data-deps>`_
and
`object tracking
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-object-tracking>`_,
and the
`repair scheduling
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-repair-scheduling>`_
improvement series.
Checking Names for Confusable Unicode Sequences
-----------------------------------------------
If ``xfs_scrub`` succeeds in validating the filesystem metadata by the end of
phase 4, it moves on to phase 5, which checks for suspicious looking names in
the filesystem.
These names consist of the filesystem label, names in directory entries, and
the names of extended attributes.
Like most Unix filesystems, XFS imposes the sparest of constraints on the
contents of a name:
- Slashes and null bytes are not allowed in directory entries.
- Null bytes are not allowed in userspace-visible extended attributes.
- Null bytes are not allowed in the filesystem label.
Directory entries and attribute keys store the length of the name explicitly
ondisk, which means that nulls are not name terminators.
For this section, the term "naming domain" refers to any place where names are
presented together -- all the names in a directory, or all the attributes of a
file.
Although the Unix naming constraints are very permissive, the reality of most
modern-day Linux systems is that programs work with Unicode character code
points to support international languages.
These programs typically encode those code points in UTF-8 when interfacing
with the C library because the kernel expects null-terminated names.
In the common case, therefore, names found in an XFS filesystem are actually
UTF-8 encoded Unicode data.
To maximize its expressiveness, the Unicode standard defines separate control
points for various characters that render similarly or identically in writing
systems around the world.
For example, the character "Cyrillic Small Letter A" U+0430 "а" often renders
identically to "Latin Small Letter A" U+0061 "a".
The standard also permits characters to be constructed in multiple ways --
either by using a defined code point, or by combining one code point with
various combining marks.
For example, the character "Angstrom Sign U+212B "Å" can also be expressed
as "Latin Capital Letter A" U+0041 "A" followed by "Combining Ring Above"
U+030A "◌̊".
Both sequences render identically.
Like the standards that preceded it, Unicode also defines various control
characters to alter the presentation of text.
For example, the character "Right-to-Left Override" U+202E can trick some
programs into rendering "moo\\xe2\\x80\\xaegnp.txt" as "mootxt.png".
A second category of rendering problems involves whitespace characters.
If the character "Zero Width Space" U+200B is encountered in a file name, the
name will render identically to a name that does not have the zero width
space.
If two names within a naming domain have different byte sequences but render
identically, a user may be confused by it.
The kernel, in its indifference to upper level encoding schemes, permits this.
Most filesystem drivers persist the byte sequence names that are given to them
by the VFS.
Techniques for detecting confusable names are explained in great detail in
sections 4 and 5 of the
`Unicode Security Mechanisms <https://unicode.org/reports/tr39/>`_
document.
When ``xfs_scrub`` detects UTF-8 encoding in use on a system, it uses the
Unicode normalization form NFD in conjunction with the confusable name
detection component of
`libicu <https://github.com/unicode-org/icu>`_
to identify names with a directory or within a file's extended attributes that
could be confused for each other.
Names are also checked for control characters, non-rendering characters, and
mixing of bidirectional characters.
All of these potential issues are reported to the system administrator during
phase 5.
Media Verification of File Data Extents
---------------------------------------
The system administrator can elect to initiate a media scan of all file data
blocks.
This scan after validation of all filesystem metadata (except for the summary
counters) as phase 6.
The scan starts by calling ``FS_IOC_GETFSMAP`` to scan the filesystem space map
to find areas that are allocated to file data fork extents.
Gaps betweeen data fork extents that are smaller than 64k are treated as if
they were data fork extents to reduce the command setup overhead.
When the space map scan accumulates a region larger than 32MB, a media
verification request is sent to the disk as a directio read of the raw block
device.
If the verification read fails, ``xfs_scrub`` retries with single-block reads
to narrow down the failure to the specific region of the media and recorded.
When it has finished issuing verification requests, it again uses the space
mapping ioctl to map the recorded media errors back to metadata structures
and report what has been lost.
For media errors in blocks owned by files, parent pointers can be used to
construct file paths from inode numbers for user-friendly reporting.