linux-stable/fs
Zygo Blaxell fb8e1608b0 btrfs: fix stripe length calculation for non-zoned data chunk allocation
commit 8a540e990d upstream.

Commit f6fca3917b "btrfs: store chunk size in space-info struct"
broke data chunk allocations on non-zoned multi-device filesystems when
using default chunk_size.  Commit 5da431b71d "btrfs: fix the max chunk
size and stripe length calculation" partially fixed that, and this patch
completes the fix for that case.

After commit f6fca3917b and 5da431b71d, the sequence of events for
a data chunk allocation on a non-zoned filesystem is:

        1.  btrfs_create_chunk calls init_alloc_chunk_ctl, which copies
        space_info->chunk_size (default 10 GiB) to ctl->max_stripe_len
        unmodified.  Before f6fca3917b, ctl->max_stripe_len value was
        1 GiB for non-zoned data chunks and not configurable.

        2.  btrfs_create_chunk calls gather_device_info which consumes
        and produces more fields of chunk_ctl.

        3.  gather_device_info multiplies ctl->max_stripe_len by
        ctl->dev_stripes (which is 1 in all cases except dup)
        and calls find_free_dev_extent with that number as num_bytes.

        4.  find_free_dev_extent locates the first dev_extent hole on
        a device which is at least as large as num_bytes.  With default
        max_chunk_size from f6fca3917b, it finds the first hole which is
        longer than 10 GiB, or the largest hole if that hole is shorter
        than 10 GiB.  This is different from the pre-f6fca3917b4d
        behavior, where num_bytes is 1 GiB, and find_free_dev_extent
        may choose a different hole.

        5.  gather_device_info repeats step 4 with all devices to find
        the first or largest dev_extent hole that can be allocated on
        each device.

        6.  gather_device_info sorts the device list by the hole size
        on each device, using total unallocated space on each device to
        break ties, then returns to btrfs_create_chunk with the list.

        7.  btrfs_create_chunk calls decide_stripe_size_regular.

        8.  decide_stripe_size_regular finds the largest stripe_len that
        fits across the first nr_devs device dev_extent holes that were
        found by gather_device_info (and satisfies other constraints
        on stripe_len that are not relevant here).

        9.  decide_stripe_size_regular caps the length of the stripe it
        computed at 1 GiB.  This cap appeared in 5da431b71d to correct
        one of the other regressions introduced in f6fca3917b.

        10.  btrfs_create_chunk creates a new chunk with the above
        computed size and number of devices.

At step 4, gather_device_info() has found a location where stripe up to
10 GiB in length could be allocated on several devices, and selected
which devices should have a dev_extent allocated on them, but at step
9, only 1 GiB of the space that was found on each device can be used.
This mismatch causes new suboptimal chunk allocation cases that did not
occur in pre-f6fca3917b4d kernels.

Consider a filesystem using raid1 profile with 3 devices.  After some
balances, device 1 has 10x 1 GiB unallocated space, while devices 2
and 3 have 1x 10 GiB unallocated space, i.e. the same total amount of
space, but distributed across different numbers of dev_extent holes.
For visualization, let's ignore all the chunks that were allocated before
this point, and focus on the remaining holes:

        Device 1:  [_] [_] [_] [_] [_] [_] [_] [_] [_] [_] (10x 1 GiB unallocated)
        Device 2:  [__________] (10 GiB contig unallocated)
        Device 3:  [__________] (10 GiB contig unallocated)

Before f6fca3917b, the allocator would fill these optimally by
allocating chunks with dev_extents on devices 1 and 2 ([12]), 1 and 3
([13]), or 2 and 3 ([23]):

        [after 0 chunk allocations]
        Device 1:  [_] [_] [_] [_] [_] [_] [_] [_] [_] [_] (10 GiB)
        Device 2:  [__________] (10 GiB)
        Device 3:  [__________] (10 GiB)

        [after 1 chunk allocation]
        Device 1:  [12] [_] [_] [_] [_] [_] [_] [_] [_] [_]
        Device 2:  [12] [_________] (9 GiB)
        Device 3:  [__________] (10 GiB)

        [after 2 chunk allocations]
        Device 1:  [12] [13] [_] [_] [_] [_] [_] [_] [_] [_] (8 GiB)
        Device 2:  [12] [_________] (9 GiB)
        Device 3:  [13] [_________] (9 GiB)

        [after 3 chunk allocations]
        Device 1:  [12] [13] [12] [_] [_] [_] [_] [_] [_] [_] (7 GiB)
        Device 2:  [12] [12] [________] (8 GiB)
        Device 3:  [13] [_________] (9 GiB)

        [...]

        [after 12 chunk allocations]
        Device 1:  [12] [13] [12] [13] [12] [13] [12] [13] [_] [_] (2 GiB)
        Device 2:  [12] [12] [23] [23] [12] [12] [23] [23] [__] (2 GiB)
        Device 3:  [13] [13] [23] [23] [13] [23] [13] [23] [__] (2 GiB)

        [after 13 chunk allocations]
        Device 1:  [12] [13] [12] [13] [12] [13] [12] [13] [12] [_] (1 GiB)
        Device 2:  [12] [12] [23] [23] [12] [12] [23] [23] [12] [_] (1 GiB)
        Device 3:  [13] [13] [23] [23] [13] [23] [13] [23] [__] (2 GiB)

        [after 14 chunk allocations]
        Device 1:  [12] [13] [12] [13] [12] [13] [12] [13] [12] [13] (full)
        Device 2:  [12] [12] [23] [23] [12] [12] [23] [23] [12] [_] (1 GiB)
        Device 3:  [13] [13] [23] [23] [13] [23] [13] [23] [13] [_] (1 GiB)

        [after 15 chunk allocations]
        Device 1:  [12] [13] [12] [13] [12] [13] [12] [13] [12] [13] (full)
        Device 2:  [12] [12] [23] [23] [12] [12] [23] [23] [12] [23] (full)
        Device 3:  [13] [13] [23] [23] [13] [23] [13] [23] [13] [23] (full)

This allocates all of the space with no waste.  The sorting function used
by gather_device_info considers free space holes above 1 GiB in length
to be equal to 1 GiB, so once find_free_dev_extent locates a sufficiently
long hole on each device, all the holes appear equal in the sort, and the
comparison falls back to sorting devices by total free space.  This keeps
usable space on each device equal so they can all be filled completely.

After f6fca3917b, the allocator prefers the devices with larger holes
over the devices with more free space, so it makes bad allocation choices:

        [after 1 chunk allocation]
        Device 1:  [_] [_] [_] [_] [_] [_] [_] [_] [_] [_] (10 GiB)
        Device 2:  [23] [_________] (9 GiB)
        Device 3:  [23] [_________] (9 GiB)

        [after 2 chunk allocations]
        Device 1:  [_] [_] [_] [_] [_] [_] [_] [_] [_] [_] (10 GiB)
        Device 2:  [23] [23] [________] (8 GiB)
        Device 3:  [23] [23] [________] (8 GiB)

        [after 3 chunk allocations]
        Device 1:  [_] [_] [_] [_] [_] [_] [_] [_] [_] [_] (10 GiB)
        Device 2:  [23] [23] [23] [_______] (7 GiB)
        Device 3:  [23] [23] [23] [_______] (7 GiB)

        [...]

        [after 9 chunk allocations]
        Device 1:  [_] [_] [_] [_] [_] [_] [_] [_] [_] [_] (10 GiB)
        Device 2:  [23] [23] [23] [23] [23] [23] [23] [23] [23] [_] (1 GiB)
        Device 3:  [23] [23] [23] [23] [23] [23] [23] [23] [23] [_] (1 GiB)

        [after 10 chunk allocations]
        Device 1:  [12] [_] [_] [_] [_] [_] [_] [_] [_] [_] (9 GiB)
        Device 2:  [23] [23] [23] [23] [23] [23] [23] [23] [12] (full)
        Device 3:  [23] [23] [23] [23] [23] [23] [23] [23] [_] (1 GiB)

        [after 11 chunk allocations]
        Device 1:  [12] [13] [_] [_] [_] [_] [_] [_] [_] [_] (8 GiB)
        Device 2:  [23] [23] [23] [23] [23] [23] [23] [23] [12] (full)
        Device 3:  [23] [23] [23] [23] [23] [23] [23] [23] [13] (full)

No further allocations are possible, with 8 GiB wasted (4 GiB of data
space).  The sort in gather_device_info now considers free space in
holes longer than 1 GiB to be distinct, so it will prefer devices 2 and
3 over device 1 until all but 1 GiB is allocated on devices 2 and 3.
At that point, with only 1 GiB unallocated on every device, the largest
hole length on each device is equal at 1 GiB, so the sort finally moves
to ordering the devices with the most free space, but by this time it
is too late to make use of the free space on device 1.

Note that it's possible to contrive a case where the pre-f6fca3917b4d
allocator fails the same way, but these cases generally have extensive
dev_extent fragmentation as a precondition (e.g. many holes of 768M
in length on one device, and few holes 1 GiB in length on the others).
With the regression in f6fca3917b, bad chunk allocation can occur even
under optimal conditions, when all dev_extent holes are exact multiples
of stripe_len in length, as in the example above.

Also note that post-f6fca3917b4d kernels do treat dev_extent holes
larger than 10 GiB as equal, so the bad behavior won't show up on a
freshly formatted filesystem; however, as the filesystem ages and fills
up, and holes ranging from 1 GiB to 10 GiB in size appear, the problem
can show up as a failure to balance after adding or removing devices,
or an unexpected shortfall in available space due to unequal allocation.

To fix the regression and make data chunk allocation work
again, set ctl->max_stripe_len back to the original SZ_1G, or
space_info->chunk_size if that's smaller (the latter can happen if the
user set space_info->chunk_size to less than 1 GiB via sysfs, or it's
a 32 MiB system chunk with a hardcoded chunk_size and stripe_len).

While researching the background of the earlier commits, I found that an
identical fix was already proposed at:

  https://lore.kernel.org/linux-btrfs/de83ac46-a4a3-88d3-85ce-255b7abc5249@gmx.com/

The previous review missed one detail:  ctl->max_stripe_len is used
before decide_stripe_size_regular() is called, when it is too late for
the changes in that function to have any effect.  ctl->max_stripe_len is
not used directly by decide_stripe_size_regular(), but the parameter
does heavily influence the per-device free space data presented to
the function.

Fixes: f6fca3917b ("btrfs: store chunk size in space-info struct")
CC: stable@vger.kernel.org # 6.1+
Link: https://lore.kernel.org/linux-btrfs/20231007051421.19657-1-ce3g8jdj@umail.furryterror.org/
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-10-25 12:03:04 +02:00
..
9p use less confusing names for iov_iter direction initializers 2023-02-09 11:28:04 +01:00
adfs
affs affs: initialize fsdata in affs_truncate() 2023-02-01 08:34:08 +01:00
afs afs: Fix accidental truncation when storing data 2023-07-19 16:22:06 +02:00
autofs autofs: fix memory leak of waitqueues in autofs_catatonic_mode 2023-09-23 11:10:59 +02:00
befs befs: Convert befs_symlink_read_folio() to use a folio 2022-08-02 12:34:03 -04:00
bfs
btrfs btrfs: fix stripe length calculation for non-zoned data chunk allocation 2023-10-25 12:03:04 +02:00
cachefiles cachefiles: use vfs_tmpfile_open() helper 2022-09-24 07:00:00 +02:00
ceph ceph: fix type promotion bug on 32bit systems 2023-10-19 23:08:57 +02:00
coda coda: Avoid partial allocation of sig_inputArgs 2023-03-10 09:33:52 +01:00
configfs configfs: fix possible memory leak in configfs_create_dir() 2022-12-31 13:32:22 +01:00
cramfs fs/cramfs/inode.c: initialize file_ra_state 2023-03-10 09:34:09 +01:00
crypto blk-crypto: add a blk_crypto_config_supported_natively helper 2023-05-11 23:03:00 +09:00
debugfs debugfs: fix error when writing negative value to atomic_t debugfs file 2022-12-31 13:31:58 +01:00
devpts
dlm dlm: fix plock lookup when using multiple lockspaces 2023-09-13 09:43:02 +02:00
ecryptfs whack-a-mole: constifying struct path * 2022-10-06 17:31:02 -07:00
efivarfs efi: efivars: Fix variable writes without query_variable_store() 2022-10-21 11:09:40 +02:00
efs
erofs erofs: fix memory leak of LZMA global compressed deduplication 2023-10-10 22:00:39 +02:00
exfat exfat: check if filename entries exceeds max filename length 2023-08-11 12:08:26 +02:00
exportfs Change calling conventions for filldir_t 2022-08-17 17:25:04 -04:00
ext2 ext2: fix datatype of block number in ext2_xattr_set2() 2023-09-23 11:11:05 +02:00
ext4 ext4: do not let fstrim block system suspend 2023-10-06 14:56:33 +02:00
f2fs f2fs: get out of a repeat loop when getting a locked data page 2023-10-06 14:56:44 +02:00
fat treewide: use get_random_u32() when possible 2022-10-11 17:42:58 -06:00
freevxfs freevxfs: Convert vxfs_immed_read_folio() to use a folio 2022-08-02 12:34:03 -04:00
fscache fscache: Use clear_and_wake_up_bit() in fscache_create_volume_work() 2023-02-22 12:59:43 +01:00
fuse fuse: nlookup missing decrement in fuse_direntplus_link 2023-09-19 12:28:05 +02:00
gfs2 gfs2: low-memory forced flush fixes 2023-09-19 12:27:58 +02:00
hfs hfs: fix missing hfs_bnode_get() in __hfs_bnode_create 2023-03-10 09:34:07 +01:00
hfsplus fs: hfsplus: remove WARN_ON() from hfsplus_cat_{read,write}_inode() 2023-05-24 17:32:34 +01:00
hostfs hostfs: move from strlcpy with unused retval to strscpy 2022-09-19 22:46:25 +02:00
hpfs
hugetlbfs hugetlbfs: fix null-ptr-deref in hugetlbfs_parse_param() 2022-12-31 13:33:05 +01:00
iomap iomap: Remove large folio handling in iomap_invalidate_folio() 2023-09-13 09:42:27 +02:00
isofs - hfs and hfsplus kmap API modernization from Fabio Francesco 2022-10-12 11:00:22 -07:00
jbd2 jbd2: correct the end of the journal recovery scan range 2023-09-19 12:28:05 +02:00
jffs2 jffs2: reduce stack usage in jffs2_build_xattr_subsystem() 2023-07-19 16:22:11 +02:00
jfs jfs: fix invalid free of JFS_IP(ipimap)->i_imap in diUnmount 2023-09-23 11:11:05 +02:00
kernfs kernfs: fix missing kernfs_idr_lock to remove an ID from the IDR 2023-07-19 16:21:53 +02:00
lockd fs: lockd: avoid possible wrong NULL parameter 2023-09-13 09:42:49 +02:00
minix vfs: open inside ->tmpfile() 2022-09-24 07:00:00 +02:00
netfs netfs: Only call folio_start_fscache() one time for each folio 2023-10-06 14:56:32 +02:00
nfs Revert "NFS: Fix error handling for O_DIRECT write scheduling" 2023-10-15 18:32:41 +02:00
nfs_common
nfsd nfsd: fix change_info in NFSv4 RENAME replies 2023-09-23 11:11:11 +02:00
nilfs2 nilfs2: fix potential use after free in nilfs_gccache_submit_read_data() 2023-10-06 14:57:01 +02:00
nls fs/nls: make load_nls() take a const parameter 2023-09-13 09:42:22 +02:00
notify fanotify: disallow mount/sb marks on kernel internal pseudo fs 2023-07-19 16:22:05 +02:00
ntfs - hfs and hfsplus kmap API modernization from Fabio Francesco 2022-10-12 11:00:22 -07:00
ntfs3 fs/ntfs3: Mark ntfs dirty when on-disk struct is corrupted 2023-08-23 17:52:26 +02:00
ocfs2 fs: ocfs2: namei: check return value of ocfs2_add_entry() 2023-09-13 09:42:33 +02:00
omfs
openpromfs
orangefs use less confusing names for iov_iter direction initializers 2023-02-09 11:28:04 +01:00
overlayfs ovl: fix incorrect fdput() on aio completion 2023-09-23 11:11:10 +02:00
proc proc: nommu: fix empty /proc/<pid>/maps 2023-10-06 14:56:42 +02:00
pstore pstore/ram: Check start of empty przs during init 2023-09-13 09:43:03 +02:00
qnx4
qnx6 fs/qnx6: delete unnecessary checks before brelse() 2022-09-11 21:55:07 -07:00
quota quota: Fix slow quotaoff 2023-10-19 23:08:50 +02:00
ramfs shmem: use ramfs_kill_sb() for kill_sb method of ramfs-based tmpfs 2023-07-19 16:22:11 +02:00
reiserfs reiserfs: Check the return value from __getblk() 2023-09-13 09:42:27 +02:00
romfs
smb ksmbd: not allow to open file if delelete on close bit is set 2023-10-19 23:08:56 +02:00
squashfs revert "squashfs: harden sanity check in squashfs_read_xattr_id_table" 2023-02-22 12:59:50 +01:00
sysfs
sysv fs/sysv: Null check to prevent null-ptr-deref bug 2023-08-11 12:08:23 +02:00
tracefs tracefs: Add missing lockdown check to tracefs_create_dir() 2023-09-23 11:11:12 +02:00
ubifs ubifs: Fix memory leak in do_rename 2023-05-11 23:03:05 +09:00
udf udf: initialize newblock to 0 2023-09-13 09:43:05 +02:00
ufs ufs: replace ll_rw_block() 2022-09-11 20:26:07 -07:00
unicode
vboxsf
verity fsverity: skip PKCS#7 parser when keyring is empty 2023-09-13 09:43:03 +02:00
xfs xfs: fix xfs_inodegc_stop racing with mod_delayed_work 2023-07-19 16:22:15 +02:00
zonefs zonefs: Always invalidate last cached page on append write 2023-04-06 12:10:52 +02:00
aio.c aio: fix mremap after fork null-deref 2023-02-22 12:59:46 +01:00
anon_inodes.c dynamic_dname(): drop unused dentry argument 2022-08-20 11:34:04 -04:00
attr.c attr: block mode changes of symlinks 2023-09-23 11:11:10 +02:00
bad_inode.c vfs: open inside ->tmpfile() 2022-09-24 07:00:00 +02:00
binfmt_elf.c mm: always expand the stack with the mmap write lock held 2023-07-01 13:16:25 +02:00
binfmt_elf_fdpic.c fs: binfmt_elf_efpic: fix personality for ELF-FDPIC 2023-10-06 14:57:06 +02:00
binfmt_elf_test.c
binfmt_flat.c
binfmt_misc.c binfmt_misc: fix shift-out-of-bounds in check_special_flags 2022-12-31 13:32:57 +01:00
binfmt_script.c
buffer.c - hfs and hfsplus kmap API modernization from Fabio Francesco 2022-10-12 11:00:22 -07:00
char_dev.c chardev: fix error handling in cdev_device_add() 2022-12-31 13:32:41 +01:00
compat_binfmt_elf.c
coredump.c coredump: Move dump_emit_page() to kill unused warning 2023-02-22 12:59:50 +01:00
d_path.c d_path.c: typo fix... 2022-08-20 11:34:33 -04:00
dax.c Merge branch 'for-6.0/dax' into libnvdimm-fixes 2022-09-24 18:14:12 -07:00
dcache.c tmpfile API change 2022-10-10 19:45:17 -07:00
direct-io.c block: remove PSI accounting from the bio layer 2022-09-20 08:24:38 -06:00
drop_caches.c
eventfd.c eventfd: prevent underflow for eventfd semaphores 2023-09-13 09:42:27 +02:00
eventpoll.c epoll: ep_autoremove_wake_function should use list_del_init_careful 2023-06-21 16:00:54 +02:00
exec.c mm: always expand the stack with the mmap write lock held 2023-07-01 13:16:25 +02:00
fcntl.c keep iocb_flags() result cached in struct file 2022-06-10 16:10:23 -04:00
fhandle.c do_sys_name_to_handle(): constify path 2022-09-01 17:36:39 -04:00
file.c file: reinstate f_pos locking optimization for regular files 2023-08-11 12:08:23 +02:00
file_table.c locks: fix TOCTOU race when granting write lease 2022-08-16 10:59:54 -04:00
filesystems.c
fs-writeback.c writeback: fix call of incorrect macro 2023-05-17 11:53:33 +02:00
fs_context.c vfs, security: Fix automount superblock LSM init problem, preventing NFS sb sharing 2023-09-13 09:42:28 +02:00
fs_parser.c ext4: journal_path mount options should follow links 2023-01-07 11:11:59 +01:00
fs_pin.c
fs_struct.c
fs_types.c
fsopen.c
init.c
inode.c fs: Establish locking order for unrelated directories 2023-07-19 16:22:12 +02:00
internal.h nfs: use vfs setgid helper 2023-08-30 16:11:10 +02:00
ioctl.c
Kconfig smb: move client and server files to common directory fs/smb 2023-06-28 11:12:40 +02:00
Kconfig.binfmt Xtensa updates for v6.1 2022-10-10 14:21:11 -07:00
kernel_read_file.c fs/kernel_read_file: allow to read files up-to ssize_t 2022-06-16 19:58:21 -07:00
libfs.c libfs: add DEFINE_SIMPLE_ATTRIBUTE_SIGNED for signed value 2022-12-31 13:31:58 +01:00
locks.c locks: fix KASAN: use-after-free in trace_event_raw_event_filelock_lock 2023-09-23 11:11:00 +02:00
Makefile smb: move client and server files to common directory fs/smb 2023-06-28 11:12:40 +02:00
mbcache.c ext4: fix deadlock due to mbcache entry corruption 2023-01-07 11:12:02 +01:00
mount.h switch try_to_unlazy_next() to __legitimize_mnt() 2022-07-05 16:18:21 -04:00
mpage.c Folio changes for 6.0 2022-08-03 10:35:43 -07:00
namei.c fs: Fix error checking for d_hash_and_lookup() 2023-09-13 09:42:27 +02:00
namespace.c fs: drop peer group ids under namespace lock 2023-04-13 16:55:33 +02:00
no-block.c
nsfs.c dynamic_dname(): drop unused dentry argument 2022-08-20 11:34:04 -04:00
open.c open: make RESOLVE_CACHED correctly test for O_TMPFILE 2023-08-11 12:08:22 +02:00
pipe.c dynamic_dname(): drop unused dentry argument 2022-08-20 11:34:04 -04:00
pnode.c pnode: terminate at peers of source 2023-01-04 11:29:01 +01:00
pnode.h
posix_acl.c - Yu Zhao's Multi-Gen LRU patches are here. They've been under test in 2022-10-10 17:53:04 -07:00
proc_namespace.c vfs: escape hash as well 2022-06-28 13:58:05 -04:00
read_write.c use less confusing names for iov_iter direction initializers 2023-02-09 11:28:04 +01:00
readdir.c Change calling conventions for filldir_t 2022-08-17 17:25:04 -04:00
remap_range.c - The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe 2022-08-05 16:32:45 -07:00
select.c
seq_file.c use less confusing names for iov_iter direction initializers 2023-02-09 11:28:04 +01:00
signalfd.c
splice.c use less confusing names for iov_iter direction initializers 2023-02-09 11:28:04 +01:00
stack.c
stat.c vfs: support STATX_DIOALIGN on block devices 2022-09-11 19:47:12 -05:00
statfs.c statfs: enforce statfs[64] structure initialization 2023-05-24 17:32:51 +01:00
super.c fs: Protect reconfiguration of sb read-write from racing writes 2023-08-11 12:08:24 +02:00
sync.c
sysctls.c
timerfd.c
userfaultfd.c Revert "userfaultfd: don't fail on unrecognized features" 2023-04-26 14:28:37 +02:00
utimes.c
xattr.c fs: don't audit the capability check in simple_xattr_list() 2022-12-31 13:31:55 +01:00