Commit graph

1216935 commits

Author SHA1 Message Date
Kent Overstreet
1f93726e63 bcachefs: Tracepoint improvements
Delete some obsolete tracepoints, organize alloc tracepoints better,
make a few tracepoints more consistent.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:32 -04:00
Kent Overstreet
8cc052db63 bcachefs: Don't kick journal reclaim unless low on space
We shouldn't kick journal reclaim unnecessarily, it's got its own timer
for that.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:32 -04:00
Kent Overstreet
fd4cecd258 bcachefs: Lock ordering fix
Can't take btree node locks while holding btree_reserve_cache_lock - it
would be nice if we could check this with lockdep.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:32 -04:00
Kent Overstreet
c0960603e2 bcachefs: Shutdown path improvements
We're seeing occasional firings of the assertion in the key cache
shutdown code that nr_dirty == 0, which means we must sometimes be doing
transaction commits after we've gone read only.

Cleanups & changes:
 - BCH_FS_ALLOC_CLEAN renamed to BCH_FS_CLEAN_SHUTDOWN
 - new helper bch2_btree_interior_updates_flush(), which returns true if
   it had to wait
 - bch2_btree_flush_writes() now also returns true if there were btree
   writes in flight
 - __bch2_fs_read_only now checks if btree writes were in flight in the
   shutdown loop: btree write completion does a transaction update, to
   update the pointer in the parent node
 - assert that !BCH_FS_CLEAN_SHUTDOWN in __bch2_trans_commit

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:32 -04:00
Kent Overstreet
d8f31407c8 bcachefs: Fix hash_check_key()
hash_check_key() was incorrectly handling transaction restarts - switch
it to for_each_btree_key_norestart().

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:32 -04:00
Kent Overstreet
a729e489ab bcachefs: Allocate some extra room in btree_key_cache_fill()
If we allocate a buffer that's a bit bigger than necessary the
transaction commit path will be much less likely to have to reallocate -
which requires a transaction restart.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:32 -04:00
Kent Overstreet
b0babf2a34 bcachefs: bch2_btree_iter_peek_all_levels()
This adds bch2_btree_iter_peek_all_levels(), which returns keys from
every level of the btree - interior nodes included - in monotonically
increasing order, soon to be used by the backpointers check & repair
code.

 - BTREE_ITER_ALL_LEVELS can now be passed to for_each_btree_key() to
   iterate thusly, much like BTREE_ITER_SLOTS

 - The existing algorithm in bch2_btree_iter_advance() doesn't work with
   peek_all_levels(): we have to defer the actual advancing until the
   next time we call peek, where we have the btree path traversed and
   uptodate. So, we add an advanced bit to btree_iter; when
   BTREE_ITER_ALL_LEVELS is set bch2_btree_iter_advanced() just marks
   the iterator as advanced.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:32 -04:00
Kent Overstreet
c4bce58675 bcachefs: btree_path_set_level_(up|down)
This adds two new helpers to btree_iter.c for changing the level of a
path up or down - to be used by the new
bch2_btree_iter_peek_all_levels().

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:32 -04:00
Kent Overstreet
2ae4573e57 bcachefs: bch2_btree_iter_peek_slot() now works on interior nodes
The new backpointers code will be using bch2_btree_iter_peek_slot() on
interior nodes - this patch updates peek_slot() to make that work.

 - Pass the correct level to bch2_journal_keys_peek_slot()
 - We should only set BTREE_ITER_CACHED or BTREE_ITER_WITH_KEY_CACHE
   when using bch2_trans_iter_init(), not bch2_trans_node_iter_init()
 - Update assertions

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:32 -04:00
Kent Overstreet
7419646b25 bcachefs: btree_update_interior.c prep for backpointers
Previously, btree_update_interior.c passed keys to bch2_trans_mark_*
that hadn't been fully initialized - they didn't have the key field
filled out, just the value.

With backpointers, we need to make sure keys are fully initialized
before marking them - because the backpointer points back to the
original key.

This patch tweaks the interior update paths to fix this.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:32 -04:00
Kent Overstreet
e1b8f5f5ca bcachefs: Plumb btree_id & level to trans_mark
For backpointers, we'll need the full key location - that means btree_id
and btree level. This patch plumbs it through.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:32 -04:00
Kent Overstreet
0095aa94bc bcachefs: Improve some fsck error messages
We have string names for d_type; use it.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:31 -04:00
Kent Overstreet
b33bf1bc0d bcachefs: Go emergency RO when i_blocks underflows
This improves some of our warnings and assertions - they imply possible
filesystem inconsistencies, so they should be calling
bch2_fs_inconsistent().

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:31 -04:00
Kent Overstreet
42796f74f4 bcachefs: Ensure sysfs show fns print a newline
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:31 -04:00
Kent Overstreet
75c8d0305a bcachefs: Kill old rebuild_replicas option
This option was useful when the replicas mechism was new and still being
debugged, but hasn't been used in ages - let's delete it.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:31 -04:00
Kent Overstreet
41fc862224 bcachefs: In fsck, pass BTREE_UPDATE_INTERNAL_SNAPSHOT_NODE when deleting dirents
A user reported an error where we hit an assertion due to deleting a key
in an internal snapshot node, when deleting a dirent that points to a
nonexisting inode.

We try to avoid doing updates to keys for internal snapshot nodes, but
upon inspection of the places where we remove dirents in fsck it appears
BTREE_UPDATE_INTERNAL_SNAPSHOT_NODE is correct for all of them: either
the target dirent doesn't exist, or it's a directory with multiple
dirents pointing to it.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:31 -04:00
Kent Overstreet
c609947b5e bcachefs: Fix for getting stuck in journal replay
In journal replay, we weren't immediately dropping journal pins when we
start doing updates that ewern't from journal replay - leading to
journal reclaim getting stuck.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:31 -04:00
Kent Overstreet
e492e7b6f6 bcachefs: Improve error logging in fsck.c
This adds error logging to a bunch of functions in fsck.c - in fsck,
reduntant error messages is probably better than not enough.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:31 -04:00
Kent Overstreet
e296b1f9ca bcachefs: Fix inode_backpointer_exists()
If the dirent an inode points to doesn't exist, we shouldn't be
returning an error - just 0/false.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:31 -04:00
Kent Overstreet
0b09032653 bcachefs: Improve bch2_lru_delete() error messages
When we detect a filesystem inconsistency, we should include the
relevent keys in the error message. This patch adds a parameter to pass
the key with the lru entry to bch2_lru_delete(), so that it can be
printed.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:31 -04:00
Kent Overstreet
5650bb46be bcachefs: Introduce bch2_journal_keys_peek_(upto|slot)()
When many journal replay keys have been overwritten,
bch2_journal_keys_peek() was taking excessively long to scan before it
found a key to return.

Fix this by introducing bch2_journal_keys_peek_upto() which takes a
parameter for the end of the range we want, so that we can terminate the
search much sooner, and replace all uses of bch2_journal_keys_peek()
with peek_upto() or peek_slot().

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:31 -04:00
Kent Overstreet
9b93596c33 bcachefs: Improve error message when alloc key doesn't match lru entry
Error messages should always print out the full key when available -
this gives us a starting point when looking through the journal to debug
what went wrong.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:31 -04:00
Kent Overstreet
7003589dab bcachefs: Ensure buckets have io_time[READ] set
It's an error if a bucket is in state BCH_DATA_cached but not on the LRU
btree - i.e io_time[READ] == 0 - so, make sure it's set before adding
it.

Also, make some of the LRU code a bit clearer and more direct.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:31 -04:00
Kent Overstreet
84befe8ef9 bcachefs: Use bch2_trans_inconsistent_on() in more places
This gets us better error messages.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:31 -04:00
Kent Overstreet
3518e6faef bcachefs: Improve bch2_open_buckets_to_text()
This patch updates bch2_open_buckets_to_text() to include the device and
bucket the open_bucket owns.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:31 -04:00
Kent Overstreet
ec7ccbde6b bcachefs: Fix CPU usage in journal read path
In journal_entry_add(), we were repeatedly scanning the journal entries
radix tree to scan for old entries that can be freed, with O(n^2)
behaviour. This patch tweaks things to remember the previous last_seq,
so we don't have to scan for entries to free from the start.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:31 -04:00
Kent Overstreet
6e811bbbc2 bcachefs: Fix a null ptr deref
We start doing allocations before the GC thread is created, which means
we need to check for that to avoid a null ptr deref.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:31 -04:00
Kent Overstreet
cf0dd697eb bcachefs: Don't trigger extra assertions in journal replay
We now pass a rw argument to .key_invalid methods so they can trigger
assertions for updates but not on existing keys. We shouldn't trigger
these extra assertions in journal replay - this patch changes the
transaction commit path accordingly.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:31 -04:00
Kent Overstreet
a9c0a4cbf1 bcachefs: Minor device removal fixes
- We weren't clearing the LRU btree
 - bch2_alloc_read() runs before bch2_check_alloc_key() deletes alloc
   keys for devices/buckets that don't exists, so it needs to check for
   that
 - bch2_check_lrus() needs to check that buckets exists
 - improve some error messages

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:31 -04:00
Kent Overstreet
502f973dba bcachefs: Fix a few warnings on 32 bit
These showed up when building for mips.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:31 -04:00
Kent Overstreet
aae29082c6 bcachefs: bch2_btree_delete_extent_at()
New helper, for deleting extents.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:31 -04:00
Kent Overstreet
7c4ca54ae6 bcachefs: Don't skip triggers in fcollapse()
With backpointers this doesn't work anymore - backpointers always need
to be updated to point to the new extent position.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:31 -04:00
Kent Overstreet
84c72755b9 bcachefs: Initialize ec work structs early
We need to ensure that work structs in bch_fs always get initialized -
otherwise an error in filesystem initialization can pop a warning in the
workqueue code when we try to cancel a work struct that wasn't
initialized.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:31 -04:00
Kent Overstreet
ce6201c456 bcachefs: Use a genradix for reading journal entries
Previously, the journal read path used a linked list for storing the
journal entries we read from disk. But there's been a bug that's been
causing journal_flush_delay to incorrectly be set to 0, leading to far
more journal entries than is normal being written out, which then means
filesystems are no longer able to start due to the O(n^2) behaviour of
inserting into/searching that linked list.

Fix this by switching to a radix tree.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:30 -04:00
Kent Overstreet
95752a02cb bcachefs: Refactor journal_keys_sort() to return an error code
When there weren't any keys in the journal there's no need to allocate
the buffer - but doing that causes a spurious -ENOMEM.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:30 -04:00
Kent Overstreet
822835ffea bcachefs: Fold bucket_state in to BCH_DATA_TYPES()
Previously, we were missing accounting for buckets in need_gc_gens and
need_discard states. This matters because buckets in those states need
other btree operations done before they can be used, so they can't be
conuted when checking current number of free buckets against the
allocation watermark.

Also, we weren't directly counting free buckets at all. Now, data type 0
== BCH_DATA_free, and free buckets are counted; this means we can get
rid of the separate (poorly defined) count of unavailable buckets.

This is a new on disk format version, with upgrade and fsck required for
the accounting changes.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:30 -04:00
Kent Overstreet
8058ea64c3 bcachefs: Add a sysfs attr for triggering discards
We're currently debugging an issue with discards not getting run; this
patch adds a manual trigger so we can then watch the tracepoint while it
runs.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:30 -04:00
Kent Overstreet
48620e5177 bcachefs: Topology repair fixes
- We were failing to start topology repair, because we hadn't set the
   superblock flag indicating it needed to run
 - set_node_min() forget to update the btree node's key
 - bch2_gc_alloc_reset() didn't reset data type, leading to inserting an
   invalid key that was empty but had nonzero data type

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:30 -04:00
Kent Overstreet
5e05d7ed3d bcachefs: Use bch2_trans_inconsistent() more
This gets us better error messages.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:30 -04:00
Kent Overstreet
62491956f4 bcachefs: Move alloc assertion to .key_invalid()
.key_invalid is a better place for this assertion.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:30 -04:00
Kent Overstreet
1d8a268940 bcachefs: Improve btree_bad_header()
In the future printbufs will be mempool-ified, so we shouldn't be using
more than one at a time if we don't have to.

This also fixes an extra trailing newline.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:30 -04:00
Kent Overstreet
11c7d3e817 bcachefs: Check for read_time == 0 in bch2_alloc_v4_invalid()
We've been seeing this error in fsck and we weren't able to track down
where it came from - but now that .key_invalid methods take a rw
argument, we can safely check for this.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:30 -04:00
Kent Overstreet
292dea86df bcachefs: fsck: Work around transaction restarts
In check_extents() and check_dirents(), we're working towards only
handling transaction restarts in one place, at the top level - but we're
not there yet. check_i_sectors() and check_subdir_count() handle
transaction restarts locally, which means the iterator for the
dirent/extent is left unlocked (should_be_locked == 0), leading to
asserts popping when we go to do updates.

This patch hacks around this for now, until we can delete the offending
code.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:30 -04:00
Kent Overstreet
275c8426fb bcachefs: Add rw to .key_invalid()
This adds a new parameter to .key_invalid() methods for whether the key
is being read or written; the idea being that methods can do more
aggressive checks when a key is newly created and being written, when we
wouldn't want to delete the key because of those checks.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:30 -04:00
Kent Overstreet
e1effd42a1 bcachefs: More improvements for alloc info checks
- Move checks for whether the device & bucket are valid from the
   .key_invalid method to bch2_check_alloc_key(). This is because
   .key_invalid() is called on keys that may no longer exist (post
   journal replay), which is a problem when removing/resizing devices.

 - We weren't checking the need_discard btree to ensure that every set
   bucket has a corresponding alloc key. This refactors the code for
   checking the freespace btree, so that it now checks both.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:30 -04:00
Kent Overstreet
afb6f7f61b bcachefs: Silence spurious copygc err when shutting down
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:30 -04:00
Kent Overstreet
f0ac7df23d bcachefs: Convert .key_invalid methods to printbufs
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:30 -04:00
Kent Overstreet
d1d7737fd9 bcachefs: Gap buffer for journal keys
Btree updates before we go RW work by inserting into the array of keys
that journal replay will insert - but inserting into a flat array is
O(n), meaning if btree_gc needs to update many alloc keys, we're O(n^2).

Fortunately, the updates btree_gc does happens in sequential order,
which means a gap buffer works nicely here - this patch implements a gap
buffer for journal keys.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:30 -04:00
Kent Overstreet
7c7e071d90 bcachefs: Don't normalize to pages in btree cache shrinker
This behavior dates from the early, early days of bcache, and upon
further delving appears to not make any sense. The shrinker only works
in terms of 'objects' of unknown size; normalizing to pages only had the
effect of changing the batch size, which we could do directly - if we
wanted; we probably don't. Normalizing to pages meant our batch size was
very small, which seems to have been keeping us from doing as much
shrinking as we should be under heavy memory pressure; this patch
appears to alleviate some OOMs we've been seeing.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:30 -04:00
Kent Overstreet
4254f5bf6e bcachefs: Add a tracepoint for superblock writes
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:30 -04:00