Commit Graph

260 Commits

Author SHA1 Message Date
Kent Overstreet 9e203c43dc bcachefs: Fix missing write refs in fs fio paths
bch2_journal_flush_seq requires us to have a write ref

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-13 22:48:17 -04:00
Kent Overstreet a292be3b68 bcachefs: Reconstruct missing snapshot nodes
When the snapshots btree is going, we'll have to delete huge amounts of
data - unless we can reconstruct it by looking at the keys that refer to
it.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-03 14:46:51 -04:00
Kent Overstreet 55936afe11 bcachefs: Flag btrees with missing data
We need this to know when we should attempt to reconstruct the snapshots
btree

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-03 14:46:51 -04:00
Kent Overstreet 4409b8081d bcachefs: Repair pass for scanning for btree nodes
If a btree root or interior btree node goes bad, we're going to lose a
lot of data, unless we can recover the nodes that it pointed to by
scanning.

Fortunately btree node headers are fully self describing, and
additionally the magic number is xored with the filesytem UUID, so we
can do so safely.

This implements the scanning - next patch will rework topology repair to
make use of the found nodes.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-03 14:44:18 -04:00
Kent Overstreet d2554263ad bcachefs: Split out recovery_passes.c
We've grown a fair amount of code for managing recovery passes; tracking
which ones we're running, which ones need to be run, and flagging in the
superblock which ones need to be run on the next recovery.

So it's worth splitting out into its own file, this code is pretty
different from the code in recovery.c.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-31 20:36:11 -04:00
Kent Overstreet 63332394c7 bcachefs: Move snapshot table size to struct snapshot_table
We need to add bounds checking for snapshot table accesses - it turns
out there are cases where we do need to use the snapshots table before
fsck checks have completed (and indeed, fsck may not have been run).

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-31 20:36:11 -04:00
Kent Overstreet a0a466ea98 bcachefs: Split out btree_node_rewrite_worker
This fixes a deadlock due to using btree_interior_update_worker for non
interior updates - async btree node rewrites were blocking, and then
blocking other interior updates.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-17 20:53:12 -04:00
Darrick J. Wong 273960b8f3 bcachefs: time_stats: split stats-with-quantiles into a separate structure
Currently, struct time_stats has the optional ability to quantize the
information that it collects.  This is /probably/ useful for callers who
want to see quantized information, but it more than doubles the size of
the structure from 224 bytes to 464.  For users who don't care about
that (e.g. upcoming xfs patches) and want to avoid wasting 240 bytes per
counter, split the two into separate pieces.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-13 21:38:01 -04:00
Kent Overstreet f1ca1abfb0 bcachefs: pull out time_stats.[ch]
prep work for lifting out of fs/bcachefs/

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-13 21:30:35 -04:00
Kent Overstreet 894d062254 bcachefs: Rename journal_keys.d -> journal_keys.data
This will let us use some darray helpers in the next patch.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-13 21:22:25 -04:00
Kent Overstreet a393f33123 bcachefs: Split out discard fastpath
Buckets usually can't be discarded until the transaction that made them
empty has been committed in the journal.

Tracing has indicated that we're queuing the discard worker excessively,
only for it to skip over many buckets that are still waiting on a
journal commit, discarding only one or two buckets per iteration.

We want to switch to only queuing the discard worker after a journal
flush write, but there's an important optimization we need to preserve:
if a bucket becomes empty and it was never committed in the journal
while it was in use, we want to discard it and reuse it right away -
since overwriting it before the previous writes are flushed from the
device cache eans those writes only cost bus bandwidth.

So, this patch implements a fast path for buckets that can be discarded
right away. We need new locking between the two discard workers; the new
list of buckets being discarded provides that locking.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-13 21:22:25 -04:00
Kent Overstreet b63570f747 bcachefs: bch2_print_opts()
Make sure early error messages get redirected, for
kernel-fsck-from-userland.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-13 21:22:24 -04:00
Kent Overstreet b26d79147f bcachefs: BTREE_ID_subvolume_children
Add a btree to record a parent -> child subvolume relationships,
according to the filesystem heirarchy.

The subvolume_children btree is a bitset btree: if a bit is set at pos
p, that means p.offset is a child of subvolume p.inode.

This will be used for efficiently listing subvolumes, as well as
recursive deletion.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-13 21:22:24 -04:00
Kent Overstreet 4e07447503 bcachefs: Clamp replicas_required to replicas
This prevents going emergency read only when the user has specified
replicas_required > replicas.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-02-13 20:33:38 -05:00
Kent Overstreet ec4edd7b9d bcachefs: Prep work for variable size btree node buffers
bcachefs btree nodes are big - typically 256k - and btree roots are
pinned in memory. As we're now up to 18 btrees, we now have significant
memory overhead in mostly empty btree roots.

And in the future we're going to start enforcing that certain btree node
boundaries exist, to solve lock contention issues - analagous to XFS's
AGIs.

Thus, we need to start allocating smaller btree node buffers when we
can. This patch changes code that refers to the filesystem constant
c->opts.btree_node_size to refer to the btree node buffer size -
btree_buf_bytes() - where appropriate.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-21 13:27:10 -05:00
Kent Overstreet c72e4d7a30 bcachefs: add time_stats for btree_node_read_done()
Seeing weird latency issues in the btree node read path - add one
bch2_btree_node_read_done().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-05 23:24:20 -05:00
Kent Overstreet d55ddf6e7a bcachefs: Online fsck can now fix errors
BCH_FS_fsck_done -> BCH_FS_fsck_running; set when we might be fixing
fsck errors. Also; set fix_errors to ask by default when fsck is
running.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-05 23:24:20 -05:00
Kent Overstreet 96f37eabe7 bcachefs: factor out thread_with_file, thread_with_stdio
thread_with_stdio now knows how to handle input - fsck can now prompt to
fix errors.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-05 23:24:19 -05:00
Kent Overstreet 89056f245b bcachefs: track transaction durations
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-05 23:24:19 -05:00
Kent Overstreet a7dc10ce68 bcachefs: Make sure allocation failure errors are logged
The previous patch fixed a bug in allocation path error handling, and it
would've been noticed sooner had it been logged properly.

Generally speaking, errors that shouldn't happen in normal operation and
are being returned up the stack should be logged: the write path was
already logging IO errors, but non IO errors were missed.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-01 11:47:41 -05:00
Kent Overstreet cf904c8d96 bcachefs: bch_err_(fn|msg) check if should print
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-01 11:47:41 -05:00
Kent Overstreet e06af20719 bcachefs: fix userspace build errors
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-01 11:47:41 -05:00
Kent Overstreet 679972348d bcachefs: kill btree_trans->wb_updates
the btree write buffer path now creates a journal entry directly

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-01 11:47:41 -05:00
Kent Overstreet 09caeabe1a bcachefs: btree write buffer now slurps keys from journal
Previosuly, the transaction commit path would have to add keys to the
btree write buffer as a separate operation, requiring additional global
synchronization.

This patch introduces a new journal entry type, which indicates that the
keys need to be copied into the btree write buffer prior to being
written out. We switch the journal entry type back to
JSET_ENTRY_btree_keys prior to write, so this is not an on disk format
change.

Flushing the btree write buffer may require pulling keys out of journal
entries yet to be written, and quiescing outstanding journal
reservations; we previously added journal->buf_lock for synchronization
with the journal write path.

We also can't put strict bounds on the number of keys in the journal
destined for the write buffer, which means we might overflow the size of
the preallocated buffer and have to reallocate - this introduces a
potentially fatal memory allocation failure. This is something we'll
have to watch for, if it becomes an issue in practice we can do
additional mitigation.

The transaction commit path no longer has to explicitly check if the
write buffer is full and wait on flushing; this is another performance
optimization. Instead, when the btree write buffer is close to full we
change the journal watermark, so that only reservations for journal
reclaim are allowed.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-01 11:47:41 -05:00
Kent Overstreet 24de63dacb bcachefs: Improve trans->extra_journal_entries
Instead of using a darray, we now allocate journal entries for the
transaction commit path with our normal bump allocator - with an inlined
fastpath, and using btree_transaction_stats to remember how much to
initially allocate so as to avoid transaction restarts.

This is prep work for converting write buffer updates to use this
mechanism.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-01 11:47:41 -05:00
Kent Overstreet 249bf593e8 bcachefs: Fix snapshot.c assertion for online fsck
c->curr_recovery_pass can go backwards; this adds a non rewinding
version, c->recovery_pass_done.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-01 11:47:41 -05:00
Kent Overstreet 267b801fda bcachefs: BCH_IOCTL_FSCK_ONLINE
This adds a new ioctl for running fsck on a mounted, in use filesystem.

This reuses the fsck_thread code from the previous patch for running
fsck on an offline, unmounted filesystem, so that log messages for the
fsck thread are redirected to userspace.

Only one running fsck instance is allowed at a time; a new semaphore
(since the lock will be taken by one thread and released by another) is
added for this.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-01 11:47:40 -05:00
Kent Overstreet 7f391b2f8e bcachefs: bch2_run_online_recovery_passes()
Add a new helper for running online recovery passes - i.e. online fsck.
This is a subset of our normal recovery passes, and does not - for now -
use or follow c->curr_recovery_pass.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-01 11:47:40 -05:00
Kent Overstreet 2b41226d7f bcachefs: Add ability to redirect log output
Upcoming patches are going to add two new ioctls for running fsck in the
kernel, but pretending that we're running our normal userspace fsck.

This patch adds some plumbing for redirecting our normal log messages
away from the dmesg log to a thread_with_file file descriptor - via a
struct log_output, which will be consumed by the fsck f_op's read method.

The new ioctls will allow for running fsck in the kernel against an
offline filesystem (without mounting it), and an online filesystem. For
an offline filesystem we need a way to pass in a pointer to the
log_output, which is done via a new hidden opts.h option.

For online fsck, we can set c->output directly, but only want to
redirect log messages from the thread running fsck - hence the new
c->output_filter method.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-01 11:47:40 -05:00
Kent Overstreet 63508b7564 bcachefs: c->ro_ref
Add a new refcount for async ops that don't necessarily need the fs to
be RW, with similar lifetime/rules otherwise as c->writes.

To be used by online fsck.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-01 11:47:40 -05:00
Kent Overstreet 7464403009 bcachefs: count_event()
Small helper for event counters.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-01 11:47:39 -05:00
Kent Overstreet 183bcc89b8 bcachefs: Clean up btree write buffer write ref handling
__bch2_btree_write_buffer_flush() now assumes a write ref is already
held (as called by the transaction commit path); and the wrappers
bch2_write_buffer_flush() and flush_sync() take an explicit write ref.

This means internally the write buffer code can always use
BTREE_INSERT_NOCHECK_RW, instead of in the previous code passing flags
around and hoping the NOCHECK_RW flag was always carried around
correctly.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-01 11:47:39 -05:00
Kent Overstreet 3c471b6588 bcachefs: convert bch_fs_flags to x-macro
Now we can print out filesystem flags in sysfs, useful for debugging
various "what's my filesystem doing" issues.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-01 11:47:38 -05:00
Kent Overstreet 066a26460b bcachefs: track_event_change()
This introduces a new helper for connecting time_stats to state changes,
i.e. when taking journal reservations is blocked for some reason.

We use this to track separately the different reasons the journal might
be blocked - i.e. space in the journal full, or the journal pin fifo
full.

Also do some cleanup and improvements on the time stats code.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-01 11:47:37 -05:00
Kent Overstreet 8b16413cda bcachefs: bch_sb.recovery_passes_required
Add two new superblock fields. Since the main section of the superblock
is now fully, we have to add a new variable length section for them -
bch_sb_field_ext.

 - recovery_passes_requried: recovery passes that must be run on the
   next mount
 - errors_silent: errors that will be silently fixed

These are to improve upgrading and dwongrading: these fields won't be
cleared until after recovery successfully completes, so there won't be
any issues with crashing partway through an upgrade or a downgrade.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-01 11:47:07 -05:00
Kent Overstreet bbc3a46065 bcachefs: Fix zstd compress workspace size
zstd apparently lies about the size of the compression workspace it
requires; if we double it compression succeeds.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-28 17:18:24 -05:00
Kent Overstreet 8a443d3ea1 bcachefs: Proper refcounting for journal_keys
The btree iterator code overlays keys from the journal until journal
replay is finished; since we're now starting copygc/rebalance etc.
before replay is finished, this is multithreaded access and thus needs
refcounting.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-24 02:43:12 -05:00
Gustavo A. R. Silva 274c2f8fd2 bcachefs: Fix multiple -Warray-bounds warnings
Transform zero-length array `entries` into a proper flexible-array
member in `struct journal_seq_blacklist_table`; and fix the following
-Warray-bounds warnings:

fs/bcachefs/journal_seq_blacklist.c:148:26: warning: array subscript idx is outside array bounds of 'struct journal_seq_blacklist_table_entry[0]' [-Warray-bounds=]
fs/bcachefs/journal_seq_blacklist.c:150:30: warning: array subscript idx is outside array bounds of 'struct journal_seq_blacklist_table_entry[0]' [-Warray-bounds=]
fs/bcachefs/journal_seq_blacklist.c:154:27: warning: array subscript idx is outside array bounds of 'struct journal_seq_blacklist_table_entry[0]' [-Warray-bounds=]
fs/bcachefs/journal_seq_blacklist.c:176:27: warning: array subscript i is outside array bounds of 'struct journal_seq_blacklist_table_entry[0]' [-Warray-bounds=]
fs/bcachefs/journal_seq_blacklist.c:177:27: warning: array subscript i is outside array bounds of 'struct journal_seq_blacklist_table_entry[0]' [-Warray-bounds=]
fs/bcachefs/journal_seq_blacklist.c:297:34: warning: array subscript i is outside array bounds of 'struct journal_seq_blacklist_table_entry[0]' [-Warray-bounds=]
fs/bcachefs/journal_seq_blacklist.c:298:34: warning: array subscript i is outside array bounds of 'struct journal_seq_blacklist_table_entry[0]' [-Warray-bounds=]
fs/bcachefs/journal_seq_blacklist.c:300:31: warning: array subscript i is outside array bounds of 'struct journal_seq_blacklist_table_entry[0]' [-Warray-bounds=]

This results in no differences in binary output.

This helps with the ongoing efforts to globally enable -Warray-bounds.

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-13 21:42:22 -05:00
Kent Overstreet f5d26fa31e bcachefs: bch_sb_field_errors
Add a new superblock section to keep counts of errors seen since
filesystem creation: we'll be addingcounters for every distinct fsck
error.

The new superblock section has entries of the for [ id, count,
time_of_last_error ]; this is intended to let us see what errors are
occuring - and getting fixed - via show-super output.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-01 21:11:08 -04:00
Kent Overstreet 94119eeb02 bcachefs: Add IO error counts to bch_member
We now track IO errors per device since filesystem creation.

IO error counts can be viewed in sysfs, or with the 'bcachefs
show-super' command.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-01 21:11:08 -04:00
Kent Overstreet fb3f57bb11 bcachefs: rebalance_work
This adds a new btree, rebalance_work, to eliminate scanning required
for finding extents that need work done on them in the background - i.e.
for the background_target and background_compression options.

rebalance_work is a bitset btree, where a KEY_TYPE_set corresponds to an
extent in the extents or reflink btree at the same pos.

A new extent field is added, bch_extent_rebalance, which indicates that
this extent has work that needs to be done in the background - and which
options to use. This allows per-inode options to be propagated to
indirect extents - at least in some circumstances. In this patch,
changing IO options on a file will not propagate the new options to
indirect extents pointed to by that file.

Updating (setting/clearing) the rebalance_work btree is done by the
extent trigger, which looks at the bch_extent_rebalance field.

Scanning is still requrired after changing IO path options - either just
for a given inode, or for the whole filesystem. We indicate that
scanning is required by adding a KEY_TYPE_cookie key to the
rebalance_work btree: the cookie counter is so that we can detect that
scanning is still required when an option has been flipped mid-way
through an existing scan.

Future possible work:
 - Propagate options to indirect extents when being changed
 - Add other IO path options - nr_replicas, ec, to rebalance_work so
   they can be applied in the background when they change
 - Add a counter, for bcachefs fs usage output, showing the pending
   amount of rebalance work: we'll probably want to do this after the
   disk space accounting rewrite (moving it to a new btree)

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-01 21:11:05 -04:00
Kent Overstreet 96a363a7e6 bcachefs: move: move_stats refactoring
data_progress_list is gone - it was redundant with moving_context_list

The upcoming rebalance rewrite is going to have it using two different
move_stats objects with the same moving_context, depending on whether
it's scanning or using the rebalance_work btree - this patch plumbs
stats around a bit differently so that will work.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-31 12:18:38 -04:00
Kent Overstreet 37707bb183 bcachefs: Split out disk_groups_types.h
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-31 12:18:37 -04:00
Kent Overstreet b0b5bbf99f bcachefs: Don't run bch2_delete_dead_snapshots() unnecessarily
Be a bit more careful about when bch2_delete_dead_snapshots needs to
run: it only needs to run synchronously if we're running fsck, and it
only needs to run at all if we have snapshot nodes to delete or if fsck
has noticed that it needs to run.

Also:
  Rename BCH_FS_HAVE_DELETED_SNAPSHOTS -> BCH_FS_NEED_DELETE_DEAD_SNAPSHOTS

  Kill bch2_delete_dead_snapshots_hook(), move functionality to
  bch2_mark_snapshot()

  Factor out bch2_check_snapshot_needs_deletion(), to explicitly check
  if we need to be running snapshot deletion.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-31 12:18:37 -04:00
Kent Overstreet 37fad9497f bcachefs: snapshot_create_lock
Add a new lock for snapshot creation - this addresses a few races with
logged operations and snapshot deletion.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:16 -04:00
Kent Overstreet d2a990d1b1 bcachefs: bch_err_msg(), bch_err_fn() now filters out transaction restart errors
These errors aren't actual errors, and should never be printed - do this
in the common helpers.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:15 -04:00
Kent Overstreet 6bd68ec266 bcachefs: Heap allocate btree_trans
We're using more stack than we'd like in a number of functions, and
btree_trans is the biggest object that we stack allocate.

But we have to do a heap allocatation to initialize it anyways, so
there's no real downside to heap allocating the entire thing.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:13 -04:00
Kent Overstreet 96dea3d599 bcachefs: Fix W=12 build errors
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:13 -04:00
Kent Overstreet aaad530ac6 bcachefs: BTREE_ID_logged_ops
Add a new btree for long running logged operations - i.e. for logging
operations that we can't do within a single btree transaction, so that
they can be resumed if we crash.

Keys in the logged operations btree will represent operations in
progress, with the state of the operation stored in the value.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:12 -04:00
Kent Overstreet e691b391f0 bcachefs: Add logging to bch2_inode_peek() & related
Add error messages when we fail to lookup an inode, and also add a few
missing bch2_err_class() calls.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:09 -04:00