Commit graph

1217488 commits

Author SHA1 Message Date
Kent Overstreet
56cc033dfc bcachefs: Don't run transaction hooks multiple times
transaction hooks aren't supposed to run unless we know the transaction
is going to commit succesfully: this fixes a bug with attempting to
delete a subvolume multiple times.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:57 -04:00
Kent Overstreet
26559553e4 bcachefs: Add a fallback when journal_keys doesn't fit in ram
We may end up in a situation where allocating the buffer for the sorted
journal_keys fails - but it would likely succeed, post compaction where
we drop duplicates.

We've had reports of this allocation failing, so this adds a slowpath to
do the compaction incrementally.

This is only a band-aid fix; we need to look at limiting the number of
keys in the journal based on the amount of system RAM.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:57 -04:00
Kent Overstreet
2f0815840c bcachefs: Improve the backpointer to missing extent message
We now print the pos where the backpointer was found in the btree, as
well as the exact bucket:bucket_offset of the data, to aid in grepping
through logs.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:57 -04:00
Kent Overstreet
40a18fe273 bcachefs: Add error message for failing to allocate sorted journal keys
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:57 -04:00
Kent Overstreet
b40901b0f7 bcachefs: New erasure coding shutdown path
This implements a new shutdown path for erasure coding, which is needed
for the upcoming BCH_WRITE_WAIT_FOR_EC write path.

The process is:
 - Cancel new stripes being built up
 - Close out/cancel open buckets on write points or the partial list
   that are for stripes
 - Shutdown rebalance/copygc
 - Then wait for in flight new stripes to finish

With BCH_WRITE_WAIT_FOR_EC, move ops will be waiting on stripes to fill
up before they complete; the new ec shutdown path is needed for shutting
down copygc/rebalance without deadlocking.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:57 -04:00
Kent Overstreet
b9fa375bab bcachefs: bch2_fs_moving_ctxts_to_text()
This also adds bch2_write_op_to_text(): now we can see outstand moves,
useful for debugging shutdown with the upcoming BCH_WRITE_WAIT_FOR_EC
and likely for other things in the future.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:57 -04:00
Kent Overstreet
65d48e3525 bcachefs: Private error codes: ENOMEM
This adds private error codes for most (but not all) of our ENOMEM uses,
which makes it easier to track down assorted allocation failures.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:57 -04:00
Kent Overstreet
872c031167 bcachefs: Fix bch2_check_extents_to_backpointers()
In rare cases, bch2_check_extents_to_backpointers() would incorrectly
flag an extent has having a missing backpointer when we just needed to
flush the btree write buffer - we weren't tracking the last flushed
position correctly.

This adds a level field to the last_flushed pos, fixing a bug where we'd
sometimes fail on a new root node.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:57 -04:00
Kent Overstreet
c639c29ce6 bcachefs: Fix an assert in copygc thread shutdown path
We're not supposed to have nested (locked) btree_trans on the stack:
this means copygc shutdown needs to exit our btree_trans before exiting
the move_ctxt, which calls bch2_write().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:57 -04:00
Kent Overstreet
2d004446c8 bcachefs: bch2_bucket_is_movable() -> BTREE_ITER_CACHED
BTREE_ITER_CACHED should really be the default for cached btrees - this
is an easy mistake to make.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:57 -04:00
Kent Overstreet
3997989ae1 bcachefs: Don't use BTREE_ITER_INTENT in make_extent_indirect()
This is a workaround for a btree path overflow - searching with
BTREE_ITER_INTENT periodically saves the iterator position for updates,
which eventually overflows.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:57 -04:00
Kent Overstreet
aebe7a679c bcachefs: Fix stripe create error path
If we errored out on a new stripe before fully allocating it, we
shouldn't be zeroing out unwritten data.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:57 -04:00
Kent Overstreet
ae1f56238d bcachefs: Mark new snapshots earlier in create path
This fixes a null ptr deref when creating new snapshots:
bch2_create_trans() will lookup the subvolume and find the _new_
snapshot in the BCH_CREATE_SUBVOL path that's being created in that
transaction.

We have to call bch2_mark_snapshot() earlier so that it's properly
initialized, instead of leaving it for transaction commit.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:57 -04:00
Kent Overstreet
e6539b0aeb bcachefs: Improve bch2_new_stripes_to_text()
Print out the alloc reserve, and format it a bit more nicely.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:57 -04:00
Kent Overstreet
751c025f0d bcachefs: Kill bch_write_op->btree_update_ready
This changes the write path to not add write ops to to the write_point's
list of pending work items until it's ready; this means we have to
change the lock protecting it to an irq-safe lock, but means
bch2_write_point_do_index_updates() no longer has to iterate over the
list, which is beneficial with the way the new BCH_WRITE_WAIT_FOR_EC
code works.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:57 -04:00
Kent Overstreet
e28ef07e0e bcachefs: Simplify stripe_idx_to_delete
This is not technically correct - it's subject to a race if we ever end
up with a stripe with all empty blocks (that needs to be deleted) being
held open. But the "correct" version was much too inefficient, and soon
we'll be adding a stripes LRU.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:57 -04:00
Kent Overstreet
46e14854fc bcachefs: Fix next_bucket()
This fixes an infinite loop in bch2_get_key_or_real_bucket_hole().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:57 -04:00
Kent Overstreet
fba053d2aa bcachefs: Second layer of refcounting for new stripes
This will be used for move writes, which will be waiting until the
stripe is created to do the index update. They need to prevent the
stripe from being reclaimed until their index update is done, so we need
another refcount that just keeps the stripe open.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# Conflicts:
#	fs/bcachefs/ec.c
#	fs/bcachefs/io.c
2023-10-22 17:09:56 -04:00
Kent Overstreet
10d9f7d285 bcachefs: ec: fall back to creating new stripes for copygc
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:56 -04:00
Kent Overstreet
57c723de7d bcachefs: Rework __bch2_data_update_index_update()
This makes some improvements to the logic for adding/removing replicas,
as part of the larger erasure coding improvements. We now directly
consider number of replicas desired for the given inode, and
extent/pointer durability: this ensures that the extent ends up with the
desired number of replicas when we're replacing multiple pointers with
one that has higher durability (e.g. erasure coded).

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:56 -04:00
Kent Overstreet
702ffea204 bcachefs: Extent helper improvements
- __bch2_bkey_drop_ptr() -> bch2_bkey_drop_ptr_noerror(), now available
   outside extents.

 - Split bch2_bkey_has_device() and bch2_bkey_has_device_c(), const and
   non const versions

 - bch2_extent_has_ptr() now returns the pointer it found

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:56 -04:00
Kent Overstreet
3f5d3fb402 bcachefs: evacuate_bucket() no longer moves cached ptrs
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:56 -04:00
Kent Overstreet
5bf9db0179 bcachefs: evacuate_bucket() no longer calls verify_bucket_evacuated()
The copygc code itself now calls this when all moves from a given bucket
are complete.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:56 -04:00
Kent Overstreet
51fe0332b1 bcachefs: Suppress transaction restart err message
This isn't a real error, and doesn't need to be printed.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:56 -04:00
Kent Overstreet
7635e1a6d6 bcachefs: Rework open bucket partial list allocation
Now, any open_bucket can go on the partial list: allocating from the
partial list has been moved to its own dedicated function,
open_bucket_add_bucets() -> bucket_alloc_set_partial().

In particular, this means that erasure coded buckets can safely go on
the partial list; the new location works with the "allocate an ec bucket
first, then the rest" logic.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:56 -04:00
Brian Foster
e53d03fe39 bcachefs: don't bump key cache journal seq on nojournal commits
fstest generic/388 occasionally reproduces corruptions where an
inode has extents beyond i_size. This is a deliberate crash and
recovery test, and the post crash+recovery characteristics are
usually the same: the inode exists on disk in an early (i.e. just
allocated) state based on the journal sequence number associated
with the inode. Subsequent inode updates exist in the journal at
higher sequence numbers, but the inode hadn't been written back
before the associated crash and the post-crash recovery processes a
set of journal sequence numbers that doesn't include updates to the
inode. In fact, the sequence with the most recent inode key update
always happens to be the sequence just before the front of the
journal processed by recovery.

This last bit is a significant hint that the problem relates to an
on-disk journal update of the front of the journal. The root cause
of this problem is basically that the inode is updated (multiple
times) in-core and in the key cache, each time bumping the key cache
sequence number used to control the cache flush. The cache flush
skips one or more times, bumping the associated key cache journal
pin to the key cache seq value. This has a side effect of holding
the inode in memory a bit longer than normal, which helps exacerbate
this problem, but is also unsafe in certain cases where the key
cache seq may have been updated by a transaction commit that didn't
journal the associated key.

For example, consider an inode that has been allocated, updated
several times in the key cache, journaled, but not yet written back.
At this stage, everything should be consistent if the fs happens to
crash because the latest update has been journal. Now consider a key
update via bch2_extent_update_i_size_sectors() that uses the
BTREE_UPDATE_NOJOURNAL flag. While this update may not change inode
state, it can have the side effect of bumping ck->seq in
bch2_btree_insert_key_cached(). In turn, if a subsequent key cache
flush skips due to seq not matching the former, the ck->journal pin
is updated to ck->seq even though the most recent key update was not
journaled. If this pin happens to reside at the front (tail) of the
journal, this means a subsequent journal write can update last_seq
to a value beyond that which includes the most recent update to the
inode. If this occurs and the fs happens to crash before the inode
happens to flush, recovery will see the latest last_seq, fail to
recover the inode and leave the inode in the inconsistent state
described above.

To avoid this problem, skip the key cache seq update on NOJOURNAL
commits, except on initial pin add. Pass the insert entry directly
to bch2_btree_insert_key_cached() to make the associated flag
available and be consistent with btree_insert_key_leaf().

Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:56 -04:00
Kent Overstreet
83ec519aea bcachefs: When shutting down, flush btree node writes last
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:56 -04:00
Kent Overstreet
adac06fad3 bcachefs: Verbose on by default when CONFIG_BCACHEFS_DEBUG=y
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:56 -04:00
Kent Overstreet
db64a8e8a1 fixup bcachefs: Use for_each_btree_key_upto() more consistently
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:56 -04:00
Kent Overstreet
4b5b13da52 six locks: be more careful about lost wakeups
This is a workaround for a lost wakeup bug we've been seeing - we still
need to discover the actual bug.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:56 -04:00
Kent Overstreet
2640faeb17 bcachefs: Journal resize fixes
- Fix a sleeping-in-atomic bug due to calling
   bch2_journal_buckets_to_sb() under the journal lock.
 - Additionally, now we mark buckets as journal buckets before adding
   them to the journal in memory and the superblock. This ensures that
   if we crash part way through we'll never be writing to journal
   buckets that aren't marked correctly.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:56 -04:00
Kent Overstreet
511b629aca bcachefs: bch2_btree_iter_peek_node_and_restart()
Minor refactoring for the Rust interface.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:56 -04:00
Kent Overstreet
b65499b7b1 bcachefs: bch2_btree_node_ondisk_to_text()
Pulling out a helper from cmd_list.c, as the rest is being rewritten in
Rust but we're not ready to rewrite lower-level btree code in Rust.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:56 -04:00
Kent Overstreet
a345b0f393 bcachefs: bch2_btree_node_to_text() const correctness
This is for the Rust interface - Rust cares more about const than C
does.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:56 -04:00
Kent Overstreet
26bab33b69 bcachefs: Fix "btree node in stripe" error
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:56 -04:00
Kent Overstreet
2a912a9a39 bcachefs: Kill bch2_ec_bucket_written()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:56 -04:00
Kent Overstreet
81c771b266 bcachefs: Improve bch2_new_stripes_to_text()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:56 -04:00
Kent Overstreet
8fcdf81418 bcachefs: Improved copygc pipelining
This improves copygc pipelining across multiple buckets: we now track
each in flight bucket we're evacuating, with separate moving_contexts.

This means that whereas previously we had to wait for outstanding moves
to complete to ensure we didn't try to evacuate the same bucket twice,
we can now just check buckets we want to evacuate against the pending
list.

This also mean we can run the verify_bucket_evacuated() check without
killing pipelining - meaning it can now always be enabled, not just on
debug builds.

This is going to be important for the upcoming erasure coding work,
where moving IOs that are being erasure coded will now skip the initial
replication step; instead the IOs will wait on the stripe to complete.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:56 -04:00
Kent Overstreet
0b943b973c bcachefs: Free move buffers as early as possible
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:56 -04:00
Kent Overstreet
5be6a274ff bcachefs: Fix stripe reuse path
It's possible that we reuse a stripe that doesn't have quite the same
configuration as the stripe_head we're allocating from. In that case, we
have to make sure that the new stripe uses the settings from the stripe
we resue, not the stripe head, and make sure the buffer is allocated
correctly.

This fixes the ec_mixed_tiers test.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:55 -04:00
Kent Overstreet
ac2ccddc26 bcachefs: Drop some anonymous structs, unions
Rust bindgen doesn't cope well with anonymous structs and unions. This
patch drops the fancy anonymous structs & unions in bkey_i that let us
use the same helpers for bkey_i and bkey_packed; since bkey_packed is an
internal type that's never exposed to outside code, it's only a minor
inconvenienc.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:55 -04:00
Kent Overstreet
45dd05b3ec bcachefs: BKEY_PADDED_ONSTACK()
Rust bindgen doesn't do anonymous structs very nicely: BKEY_PADDED()
only needs the anonymous struct when it's used on the stack, to
guarantee layout, not when it's embedded in another struct.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:55 -04:00
Kent Overstreet
2f528663c5 bcachefs: moving_context->stats is allowed to be NULL
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:55 -04:00
Kent Overstreet
e84face6f0 bcachefs: RESERVE_stripe
Rework stripe creation path - new algorithm for deciding when to create
new stripes or reuse existing stripes.

We add a new allocation watermark, RESERVE_stripe, above RESERVE_none.
Then we always try to create a new stripe by doing RESERVE_stripe
allocations; if this fails, we reuse an existing stripe and allocate
buckets for it with the reserve watermark for the given write
(RESERVE_none or RESERVE_movinggc).

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:55 -04:00
Kent Overstreet
d57c9add59 bcachefs: Improve error message for stripe block sector counts wrong
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:55 -04:00
Kent Overstreet
9d32097f3b bcachefs: More stripe create cleanup/fixes
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:55 -04:00
Kent Overstreet
a1fb08f5df bcachefs: Plumb alloc_reserve through stripe create path
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:55 -04:00
Kent Overstreet
910659763e bcachefs: Mark stripe buckets with correct data type
Currently, we don't use bucket data type for tracking whether buckets
are part of a stripe; parity buckets are BCH_DATA_parity, but data
buckets in a stripe are BCH_DATA_user. There's a separate counter,
buckets_ec, outside the BCH_DATA_TYPES system for tracking number of
buckets on a device that are part of a stripe.

The trouble with this approach is that it's too coarse grained, and we
need better information on fragmentation for debugging copygc.

With this patch, data buckets in a stripe are now tracked as
BCH_DATA_stripe buckets.

This doesn't yet differentiate between erasure coded and non-erasure
coded data in a stripe bucket, nor do we yet track empty data buckets in
stripes.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:55 -04:00
Kent Overstreet
3329cf1bb9 bcachefs: Centralize btree node lock initialization
This fixes some confusion in the lockdep code due to initializing btree
node/key cache locks with the same lockdep key, but different names.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:55 -04:00
Kent Overstreet
1306f87de3 bcachefs: Plumb btree_trans through btree cache code
Soon, __bch2_btree_node_write() is going to require a btree_trans: zoned
device support is going to require a new allocation for every btree node
write. This is a bit of prep work.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:55 -04:00