linux-stable

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git synced 2024-09-22 02:20:40 +00:00

Author	SHA1	Message	Date
Kent Overstreet	1fb4fe6317	six locks: Kill six_lock_state union As suggested by Linus, this drops the six_lock_state union in favor of raw bitmasks. On the one hand, bitfields give more type-level structure to the code. However, a significant amount of the code was working with six_lock_state as a u64/atomic64_t, and the conversions from the bitfields to the u64 were deemed a bit too out-there. More significantly, because bitfield order is poorly defined (#ifdef __LITTLE_ENDIAN_BITFIELD can be used, but is gross), incrementing the sequence number would overflow into the rest of the bitfield if the compiler didn't put the sequence number at the high end of the word. The new code is a bit saner when we're on an architecture without real atomic64_t support - all accesses to lock->state now go through atomic64_*() operations. On architectures with real atomic64_t support, we additionally use atomic bit ops for setting/clearing individual bits. Text size: 7467 bytes -> 4649 bytes - compilers still suck at bitfields. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:10:02 -04:00
Kent Overstreet	c4bd3491b1	six locks: Simplify dispatch Originally, we used inlining/flattening to cause the compiler to generate different versions of lock/trylock/relock/unlock for each lock type - read, intent, and write. This made the individual functions smaller and let the compiler eliminate table lookups: however, as the code has gotten more complicated these optimizations have gotten less worthwhile, and all the tricky inlining and dispatching made the code less readable. Text size: 11015 bytes -> 7467 bytes, and benchmarks show no loss of performance. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:10:02 -04:00
Kent Overstreet	d2c86b77de	six locks: Centralize setting of waiting bit Originally, the waiting bit was always set by trylock() on failure: however, it's now set by __six_lock_type_slowpath(), with wait_lock held - which is the more correct place to do it. That made setting the waiting bit in trylock redundant, so this patch deletes that. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:10:02 -04:00
Kent Overstreet	0157f9c5a7	six locks: Remove hacks for percpu mode lost wakeup The lost wakeup bug hasn't been observed in awhile, and we're trying to provoke it and determine if it still exists. This patch removes some defenses that were added to attempt to track it down; if it still exists, this should make it easier to see it. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:10:01 -04:00
Kent Overstreet	0d2234a79e	six locks: Kill six_lock_pcpu_(alloc\|free) six_lock_pcpu_alloc() is an unsafe interface: it's not safe to allocate or free the percpu reader count on an existing lock that's in use, the only safe time to allocate percpu readers is when the lock is first being initialized. This patch adds a flags parameter to six_lock_init(), and instead of six_lock_pcpu_free() we now expose six_lock_exit(), which does the same thing but is less likely to be misused. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:10:01 -04:00
Kent Overstreet	01bf56a977	six locks: six_lock_readers_add() This moves a helper out of the bcachefs code that shouldn't have been there, since it touches six lock internals. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:10:01 -04:00
Kent Overstreet	f375d6ca58	bcachefs: Don't call local_clock() twice in trans_begin() local_clock() is not as cheap as we'd like it to be, alas Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:10:01 -04:00
Kent Overstreet	962210b281	bcachefs: Fix a buffer overrun in bch2_fs_usage_read() We were copying the size of a struct bch_fs_usage_online to a struct bch_fs_usage, which is 8 bytes smaller. This adds some new helpers so we can do this correctly, and get rid of some magic +1s too. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:10:01 -04:00
Kent Overstreet	0b438c5bfa	bcachefs: Clear btree_node_just_written() when node reused or evicted This fixes the following bug: Journal reclaim attempts to flush a node, but races with the node being evicted from the btree node cache; when we lock the node, the data buffers have already been freed. We don't evict a node that's dirty, so calling btree_node_write() is fine - it's a noop - except that the btree_node_just_written bit causes bch2_btree_post_write_cleanup() to run (resorting the node), which then causes a null ptr deref. 00078 Unable to handle kernel NULL pointer dereference at virtual address 000000000000009e 00078 Mem abort info: 00078 ESR = 0x0000000096000005 00078 EC = 0x25: DABT (current EL), IL = 32 bits 00078 SET = 0, FnV = 0 00078 EA = 0, S1PTW = 0 00078 FSC = 0x05: level 1 translation fault 00078 Data abort info: 00078 ISV = 0, ISS = 0x00000005 00078 CM = 0, WnR = 0 00078 user pgtable: 4k pages, 39-bit VAs, pgdp=000000007ed64000 00078 [000000000000009e] pgd=0000000000000000, p4d=0000000000000000, pud=0000000000000000 00078 Internal error: Oops: 0000000096000005 [#1] SMP 00078 Modules linked in: 00078 CPU: 75 PID: 1170 Comm: stress-ng-utime Not tainted 6.3.0-ktest-g5ef5b466e77e #2078 00078 Hardware name: linux,dummy-virt (DT) 00078 pstate: 80001005 (Nzcv daif -PAN -UAO -TCO -DIT +SSBS BTYPE=--) 00078 pc : btree_node_sort+0xc4/0x568 00078 lr : bch2_btree_post_write_cleanup+0x6c/0x1c0 00078 sp : ffffff803e30b350 00078 x29: ffffff803e30b350 x28: 0000000000000001 x27: ffffff80076e52a8 00078 x26: 0000000000000002 x25: 0000000000000000 x24: ffffffc00912e000 00078 x23: ffffff80076e52a8 x22: 0000000000000000 x21: ffffff80076e52bc 00078 x20: ffffff80076e5200 x19: 0000000000000000 x18: 0000000000000000 00078 x17: fffffffff8000000 x16: 0000000008000000 x15: 0000000008000000 00078 x14: 0000000000000002 x13: 0000000000000000 x12: 00000000000000a0 00078 x11: ffffff803e30b400 x10: ffffff803e30b408 x9 : 0000000000000001 00078 x8 : 0000000000000000 x7 : ffffff803e480000 x6 : 00000000000000a0 00078 x5 : 0000000000000088 x4 : 0000000000000000 x3 : 0000000000000010 00078 x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffffff80076e52a8 00078 Call trace: 00078 btree_node_sort+0xc4/0x568 00078 bch2_btree_post_write_cleanup+0x6c/0x1c0 00078 bch2_btree_node_write+0x108/0x148 00078 __btree_node_flush+0x104/0x160 00078 bch2_btree_node_flush0+0x1c/0x30 00078 journal_flush_pins.constprop.0+0x184/0x2d0 00078 __bch2_journal_reclaim+0x4d4/0x508 00078 bch2_journal_reclaim+0x1c/0x30 00078 __bch2_journal_preres_get+0x244/0x268 00078 bch2_trans_journal_preres_get_cold+0xa4/0x180 00078 __bch2_trans_commit+0x61c/0x1bb0 00078 bch2_setattr_nonsize+0x254/0x318 00078 bch2_setattr+0x5c/0x78 00078 notify_change+0x2bc/0x408 00078 vfs_utimes+0x11c/0x218 00078 do_utimes+0x84/0x140 00078 __arm64_sys_utimensat+0x68/0xa8 00078 invoke_syscall.constprop.0+0x54/0xf0 00078 do_el0_svc+0x48/0xd8 00078 el0_svc+0x14/0x48 00078 el0t_64_sync_handler+0xb0/0xb8 00078 el0t_64_sync+0x14c/0x150 00078 Code: 8b050265 910020c6 8b060266 910060ac (79402cad) 00078 ---[ end trace 0000000000000000 ]--- Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:10:01 -04:00
Kent Overstreet	faa62a2036	bcachefs: alloc_v4_u64s() fix With the recent bkey_ops.min_val_size addition, bkey values are automatically extended to the size of the current version. The check in bch2_alloc_v4_invalid() needs to be updated to take this into account. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:10:01 -04:00
Kent Overstreet	a49bd8c007	bcachefs: Delete an incorrect bch2_trans_unlock() These deletes a bch2_trans_unlock() call from __bch2_move_data(). It was redundant; bch2_move_extent() has the correct unlock call, and it was buggy because when move_extent calls bch2_extent_drop_ptrs() we don't want the transaction to be unlocked yet - this fixes a btree_iter.c assertion. Fixes https://github.com/koverstreet/bcachefs/issues/511. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:10:01 -04:00
Kent Overstreet	d598a9b7e2	bcachefs: Use memcpy_u64s_small() for copying keys Small performance optimization; an open coded loop is better than rep ; movsq for small copies. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:10:01 -04:00
Kent Overstreet	73da30e8e0	bcachefs: Fix check_overlapping_extents() A error check had a flipped conditional - whoops. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:10:01 -04:00
Kent Overstreet	4a2e5d7ba5	bcachefs: Replace a BUG_ON() with fatal error A user hit this BUG_ON() - it's unclear how it happened, so replace it with a fatal error that will cause us to go read only, and print out more information. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:10:01 -04:00
Kent Overstreet	92e637cef4	bcachefs: Delete some dead code in bch2_replicas_gc_end() bch2_replicas_gc_(start\|end) is now only used for journal replicas entries, which don't have bucket sector counts - so this code is entirely dead and can be deleted. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:10:01 -04:00
Brian Foster	a7b29b8d9a	bcachefs: mark journal replicas before journal write submission The journal write submission path marks the associated replica entries for journal data in journal_write_done(), which is just after journal write bio submission. This creates a small window where journal entries might have been written out, but the associated replica is not marked such that recovery does not know that the associated device contains journal data. Move the replica marking a bit earlier in the write path such that recovery is guaranteed to recognize that the device contains journal data in the event of a crash. Signed-off-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:10:01 -04:00
Kent Overstreet	38e3d93fa1	bcachefs: Improved comment for bch2_replicas_gc2() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:10:01 -04:00
Kent Overstreet	cb1b479dc1	bcachefs: Fix quotas + snapshots Now that we can reliably designate and find the master subvolume out of a tree of snapshots, we can finally make quotas work with snapshots: That is - quotas will now _ignore_ snapshot subvolumes, and only be in effect for the master (non snapshot) subvolume. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:10:01 -04:00
Kent Overstreet	653693beea	bcachefs: Add otime, parent to bch_subvolume Add two new fields to bch_subvolume: - otime: creation time - parent: For snapshots, this is the id of the subvolume the snapshot was created from Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:10:01 -04:00
Kent Overstreet	1c59b483a3	bcachefs: BTREE_ID_snapshot_tree This adds a new btree which gets us a persistent per-snapshot-tree identifier. - BTREE_ID_snapshot_trees - KEY_TYPE_snapshot_tree - bch_snapshot now has a field that points to a snapshot_tree This is going to be used to designate one snapshot ID/subvolume out of a given tree of snapshots as the "main" subvolume, so that we can do quota accounting in that subvolume and not the rest. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:10:01 -04:00
Kent Overstreet	51e84d3bbf	bcachefs: bch2_bkey_get_empty_slot() Add a new helper for allocating a new slot in a btree. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:10:01 -04:00
Kent Overstreet	dbda63bbb0	bcachefs: bch2_bkey_make_mut() now calls bch2_trans_update() It's safe to call bch2_trans_update with a k/v pair where the value hasn't been filled out, as long as the key part has been and the value is filled out by transaction commit time. This patch folds the bch2_trans_update() call into bch2_bkey_make_mut(), eliminating a bit of boilerplate. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:10:01 -04:00
Kent Overstreet	f12a798a89	bcachefs: bch2_bkey_get_mut() now calls bch2_trans_update() It's safe to call bch2_trans_update with a k/v pair where the value hasn't been filled out, as long as the key part has been and the value is filled out by transaction commit time. This patch folds the bch2_trans_update() call into bch2_bkey_get_mut(), eliminating a bit of boilerplate. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:10:01 -04:00
Kent Overstreet	f8cb35fda1	bcachefs: bch2_bkey_alloc() now calls bch2_trans_update() It's safe to call bch2_trans_update with a k/v pair where the value hasn't been filled out, as long as the key part has been and the value is filled out by transaction commit time. This patch folds the bch2_trans_update() call into bch2_bkey_alloc(), eliminating a bit of boilerplate. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:10:01 -04:00
Kent Overstreet	34dfa5db19	bcachefs: bch2_bkey_get_mut() improvements - bch2_bkey_get_mut() now handles types increasing in size, allocating a buffer for the type's current size when necessary - bch2_bkey_make_mut_typed() - bch2_bkey_get_mut() now initializes the iterator, like bch2_bkey_get_iter() Also, refactor so that most of the code is in functions - now macros are only used for wrappers. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:10:01 -04:00
Kent Overstreet	d67a16df9c	bcachefs: Move bch2_bkey_make_mut() to btree_update.h It's for doing updates - this is where it belongs, and next pathes will be changing these helpers to use items from btree_update.h. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:10:01 -04:00
Kent Overstreet	bcb79a51cb	bcachefs: bch2_bkey_get_iter() helpers Introduce new helpers for a common pattern: bch2_trans_iter_init(); bch2_btree_iter_peek_slot(); - bch2_bkey_get_iter_type() returns -ENOENT if it doesn't find a key of the correct type - bch2_bkey_get_val_typed() copies the val out of the btree to a (typically stack allocated) variable; it handles the case where the value in the btree is smaller than the current version of the type, zeroing out the remainder. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:10:00 -04:00
Kent Overstreet	174f930b8e	bcachefs: bkey_ops.min_val_size This adds a new field to bkey_ops for the minimum size of the value, which standardizes that check and also enforces the new rule (previously done somewhat ad-hoc) that we can extend value types by adding new fields on to the end. To make that work we do _not_ initialize min_val_size with sizeof, instead we initialize it to the size of the first version of those values. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:10:00 -04:00
Kent Overstreet	ab158fce47	bcachefs: Converting to typed bkeys is now allowed for err, null ptrs Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:10:00 -04:00
Kent Overstreet	95b595a5fc	bcachefs: Btree iterator, update flags no longer conflict Change btree_update_flags to start after the last btree iterator flag, so that we can pass both in the same flags argument. This is needed for the upcoming bch2_bkey_get_mut() helper. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:10:00 -04:00
Brian Foster	0a23574ebb	bcachefs: remove unused key cache coherency flag Signed-off-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:10:00 -04:00
Brian Foster	3c434cdff0	bcachefs: fix accounting corruption race between reclaim and dev add When a device is removed from a bcachefs volume, the associated content is removed from the various btrees. The alloc tree uses the key cache, so when keys are removed the deletes exist in cache for a period of time until reclaim comes along and flushes outstanding updates. When a device is re-added to the bcachefs volume, the add process re-adds some of these previously deleted keys. When marking device superblock locations on device add, the keys will likely refer to some of the same alloc keys that were just removed. The memory triggers for these key updates are responsible for further updates, such as bch2_mark_alloc() calling into bch2_dev_usage_update() to update per-device usage accounting. When a new key is added to key cache, the trans update path also flushes the key to the backing btree for coherency reasons for tree walks. With all of this context, if a device is removed and re-added quickly enough such that some key deletes from the remove are still pending a key cache flush, the trans update path can view this as addition of a new key because the old key in the insert entry refers to a deleted key. However the deleted cached key has not been filled by absence of a btree key, but rather refers to an explicit deletion of an existing key that occurred during device removal. The trans update path adds a new update to flush the key and tags the original (cached) update to skip running the memory triggers. This results in running triggers on the non-cached update instead, which in turn will perform accounting updates based on incoherent values. For example, bch2_dev_usage_update() subtracts the the old alloc key dirty sector count in the non-cached btree key from the newly initialized (i.e. zeroed) per device counters, leading to underflow and accounting corruption. There are at least a few ways to avoid this problem, the simplest of which may be to run triggers against the cached update rather than the non-cached update. If the key only needs to be flushed when the key is not present in the tree, however, then this still performs an unnecessary update. We could potentially use the cached key dirty state to determine whether the delete is a dirty, cached update vs. a clean cache fill, but this may require transmitting key cache dirty state across layers, which adds complexity and seems to be of limited value. Instead, update flush_new_cached_update() to handle this by simply checking for the key in the btree and only perform the flush when a backing key is not present. Signed-off-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:10:00 -04:00
Kent Overstreet	958c347b4b	bcachefs: Mark bch2_copygc() noinline This works around a "stack from too large" error. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:10:00 -04:00
Kent Overstreet	3140a3d0e9	bcachefs: Delete obsolete btree ptr check This patch deletes a .key_invalid check for btree pointers that only applies to _very_ old on disk format versions, and potentially complicates the upgrade process. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:10:00 -04:00
Kent Overstreet	6b52bcde4a	bcachefs: Always run topology error when CONFIG_BCACHEFS_DEBUG=y Improved test coverage. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:10:00 -04:00
Kent Overstreet	a0668d77f0	bcachefs: Fix a userspace build error Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:10:00 -04:00
Kent Overstreet	c8d5b71411	bcachefs: Make sure hash info gets initialized in fsck We had some bugs with setting/using first_this_inode in the inode walker in the dirents/xattr code. This patch changes to not clear first_this_inode until after initializing the new hash info. Also, we fix an error message to not print on transaction restart, and add a comment to related fsck error code. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:10:00 -04:00
Kent Overstreet	1af5227c1d	bcachefs: Kill bch2_verify_bucket_evacuated() With backpointers, it's now impossible for bch2_evacuate_bucket() to be completely reliable: it can race with an extent being partially overwritten or split, which needs a new write buffer flush for the backpointer to be seen. This shouldn't be a real issue in practice; the previous patch added a new tracepoint so we'll be able to see more easily if it is. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:10:00 -04:00
Kent Overstreet	5a21764db1	bcachefs: Improve move path tracepoints Move path tracepoints now include the key being moved. Also, add new tracepoints for the start of move_extent, and evacuate_bucket. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:10:00 -04:00
Kent Overstreet	09ebfa6113	bcachefs: Drop a redundant error message When we're already read-only, we don't need to print out errors from writing btree nodes. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:10:00 -04:00
Brian Foster	02d51bb9a7	bcachefs: remove bucket_gens btree keys on device removal If a device has keys in the bucket_gens btree associated with its buckets and is removed from a bcachefs volume, fsck will complain about the presence of keys associated with an invalid device index. A repair removes the associated keys and restores correctness. Update bch2_dev_remove_alloc() to remove device related keys at device removal time to avoid the problem. Signed-off-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:10:00 -04:00
Brian Foster	251babb55d	bcachefs: fix NULL bch_dev deref when checking bucket_gens keys fsck removes bucket_gens keys for devices that do not exist in the volume (i.e., if the device was removed). In 'fsck -n' mode, the associated fsck_err_on() wrapper returns false to skip the key removal. This proceeds on to the rest of the function, which eventually segfaults on a NULL bch_dev because the device does not exist. Update bch2_check_bucket_gens_key() to skip out of the rest of the function when the associated device does not exist, regardless of running fsck in check or repair mode. Signed-off-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:10:00 -04:00
Brian Foster	bf98ee10d4	bcachefs: folio pos to bch_folio_sector index helper Create a small helper to translate from file offset to the associated bch_folio_sector index in the underlying bch_folio. The helper assumes the file offset is covered by the passed folio. Signed-off-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:10:00 -04:00
Kent Overstreet	e3dc75eb55	bcachefs: Fix a null ptr deref in fsck check_extents() It turns out, in rare situations we need to be passing in a disk reservation, which will be used internally by the transaction commit path when needed. Pass one in... Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:10:00 -04:00
Kent Overstreet	615fccada5	bcachefs: Fix a slab-out-of-bounds In __bch2_alloc_to_v4_mut(), we overrun the buffer we allocate if the alloc key had backpointers stored in it (which we no longer support). Fix this with a max() call. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:10:00 -04:00
Kent Overstreet	853b7393c2	bcachefs: Allow answering y or n to all fsck errors of given type This changes the ask_yn() function used by fsck to accept Y or N, meaning yes or no for all errors of a given type. With this, the user can be prompted only for distinct error types - useful when a filesystem has lots of errors. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:10:00 -04:00
Brian Foster	6b9857b208	bcachefs: use u64 for folio end pos to avoid overflows Some of the folio_end_() helpers are prone to overflow of signed 64-bit types because the mapping is only limited by the max value of loff_t and the associated helpers return the start offset of the next folio. Therefore, a folio_end_pos() of the max allowable folio in a mapping returns a value that overflows loff_t. This makes it hard to rely on such values when doing folio processing across a range of a file, as bcachefs attempts to do with the recent folio changes. For example, generic/564 causes problems in the buffered write path when testing writes at max boundary conditions. The current understanding is that the pagecache historically limited the mapping to one less page to avoid this problem and this was dropped with some of the folio conversions, but may be reinstated to properly address the problem. In the meantime, update the internal folio_end_() helpers in bcachefs to return a u64, and all of the associated code to use or cast to u64 to avoid overflow problems. This allows generic/564 to pass and can be reverted back to using loff_t if at any point the pagecache subsystem can guarantee these boundary conditions will not overflow. Signed-off-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:10:00 -04:00
Brian Foster	335f7d4f22	bcachefs: clean up post-eof folios on -ENOSPC The buffered write path batches folio creations in the file mapping based on the requested size of the write. Under low free space conditions, it is possible to add a bunch of folios to the mapping and then return a short write or -ENOSPC due to lack of space. If this occurs on an extending write, the file size is updated based on the amount of data successfully written to the file. If folios were added beyond the final i_size, they may hang around until reclaimed, truncated or encountered unexpectedly by another operation. For example, generic/083 reproduces a sequence of events where a short write leaves around one or more post-EOF folios on an inode, a subsequent zero range request extends beyond i_size and overlaps with an aforementioned folio, and __bch2_truncate_folio() happens across it and complains. Update __bch2_buffered_write() to keep track of the start offset of the last folio added to the mapping for a prospective write. After i_size is updated, check whether this offset starts beyond EOF. If so, truncate pagecache beyond the latest EOF to clean up any folios that don't reside at least partially within EOF upon completion of the write. Signed-off-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:10:00 -04:00
Brian Foster	4ad6aa46e1	bcachefs: fix truncate overflow if folio is beyond EOF generic/083 occasionally reproduces a panic caused by an overflow when accessing the bch_folio_sector array of the folio being processed by __bch2_truncate_folio(). The immediate cause of the overflow is that the folio offset is beyond i_size, and therefore the sector index calculation underflows on subtraction of the folio offset. One cause of this is mainly observed on nocow mounts. When nocow is enabled, fallocate performs physical block allocation (as opposed to block reservation in cow mode), which range_has_data() then interprets as valid data that requires partial zeroing on truncate. Therefore, if a post-eof zero range request lands across post-eof preallocated blocks, __bch2_truncate_folio() may actually create a post-eof folio in order to perform zeroing. To avoid this problem, update range_has_data() to filter out unwritten blocks from folio creation and partial zeroing. Even though we should never create folios beyond EOF like this, the mere existence of such folios is not necessarily a fatal error. Fix up the truncate code to warn about this condition and not overflow the sector array and possibly crash the system. The addition of this warning without the corresponding unwritten extent fix has shown that various other fstests are able to reproduce this problem fairly frequently, but often in ways that doesn't necessarily result in a kernel panic or a change in user observable behavior, and therefore the problem goes undetected. Signed-off-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:09:59 -04:00
Kent Overstreet	550a6a496d	bcachefs: Enable large folios Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:09:59 -04:00

1 2 3 4 5 ...

1217586 commits