linux-stable/fs
Omar Sandoval 3d08c52ba1 btrfs: fix crash on racing fsync and size-extending write into prealloc
commit 9d274c19a7 upstream.

We have been seeing crashes on duplicate keys in
btrfs_set_item_key_safe():

  BTRFS critical (device vdb): slot 4 key (450 108 8192) new key (450 108 8192)
  ------------[ cut here ]------------
  kernel BUG at fs/btrfs/ctree.c:2620!
  invalid opcode: 0000 [#1] PREEMPT SMP PTI
  CPU: 0 PID: 3139 Comm: xfs_io Kdump: loaded Not tainted 6.9.0 #6
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-2.fc40 04/01/2014
  RIP: 0010:btrfs_set_item_key_safe+0x11f/0x290 [btrfs]

With the following stack trace:

  #0  btrfs_set_item_key_safe (fs/btrfs/ctree.c:2620:4)
  #1  btrfs_drop_extents (fs/btrfs/file.c:411:4)
  #2  log_one_extent (fs/btrfs/tree-log.c:4732:9)
  #3  btrfs_log_changed_extents (fs/btrfs/tree-log.c:4955:9)
  #4  btrfs_log_inode (fs/btrfs/tree-log.c:6626:9)
  #5  btrfs_log_inode_parent (fs/btrfs/tree-log.c:7070:8)
  #6  btrfs_log_dentry_safe (fs/btrfs/tree-log.c:7171:8)
  #7  btrfs_sync_file (fs/btrfs/file.c:1933:8)
  #8  vfs_fsync_range (fs/sync.c:188:9)
  #9  vfs_fsync (fs/sync.c:202:9)
  #10 do_fsync (fs/sync.c:212:9)
  #11 __do_sys_fdatasync (fs/sync.c:225:9)
  #12 __se_sys_fdatasync (fs/sync.c:223:1)
  #13 __x64_sys_fdatasync (fs/sync.c:223:1)
  #14 do_syscall_x64 (arch/x86/entry/common.c:52:14)
  #15 do_syscall_64 (arch/x86/entry/common.c:83:7)
  #16 entry_SYSCALL_64+0xaf/0x14c (arch/x86/entry/entry_64.S:121)

So we're logging a changed extent from fsync, which is splitting an
extent in the log tree. But this split part already exists in the tree,
triggering the BUG().

This is the state of the log tree at the time of the crash, dumped with
drgn (https://github.com/osandov/drgn/blob/main/contrib/btrfs_tree.py)
to get more details than btrfs_print_leaf() gives us:

  >>> print_extent_buffer(prog.crashed_thread().stack_trace()[0]["eb"])
  leaf 33439744 level 0 items 72 generation 9 owner 18446744073709551610
  leaf 33439744 flags 0x100000000000000
  fs uuid e5bd3946-400c-4223-8923-190ef1f18677
  chunk uuid d58cb17e-6d02-494a-829a-18b7d8a399da
          item 0 key (450 INODE_ITEM 0) itemoff 16123 itemsize 160
                  generation 7 transid 9 size 8192 nbytes 8473563889606862198
                  block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0
                  sequence 204 flags 0x10(PREALLOC)
                  atime 1716417703.220000000 (2024-05-22 15:41:43)
                  ctime 1716417704.983333333 (2024-05-22 15:41:44)
                  mtime 1716417704.983333333 (2024-05-22 15:41:44)
                  otime 17592186044416.000000000 (559444-03-08 01:40:16)
          item 1 key (450 INODE_REF 256) itemoff 16110 itemsize 13
                  index 195 namelen 3 name: 193
          item 2 key (450 XATTR_ITEM 1640047104) itemoff 16073 itemsize 37
                  location key (0 UNKNOWN.0 0) type XATTR
                  transid 7 data_len 1 name_len 6
                  name: user.a
                  data a
          item 3 key (450 EXTENT_DATA 0) itemoff 16020 itemsize 53
                  generation 9 type 1 (regular)
                  extent data disk byte 303144960 nr 12288
                  extent data offset 0 nr 4096 ram 12288
                  extent compression 0 (none)
          item 4 key (450 EXTENT_DATA 4096) itemoff 15967 itemsize 53
                  generation 9 type 2 (prealloc)
                  prealloc data disk byte 303144960 nr 12288
                  prealloc data offset 4096 nr 8192
          item 5 key (450 EXTENT_DATA 8192) itemoff 15914 itemsize 53
                  generation 9 type 2 (prealloc)
                  prealloc data disk byte 303144960 nr 12288
                  prealloc data offset 8192 nr 4096
  ...

So the real problem happened earlier: notice that items 4 (4k-12k) and 5
(8k-12k) overlap. Both are prealloc extents. Item 4 straddles i_size and
item 5 starts at i_size.

Here is the state of the filesystem tree at the time of the crash:

  >>> root = prog.crashed_thread().stack_trace()[2]["inode"].root
  >>> ret, nodes, slots = btrfs_search_slot(root, BtrfsKey(450, 0, 0))
  >>> print_extent_buffer(nodes[0])
  leaf 30425088 level 0 items 184 generation 9 owner 5
  leaf 30425088 flags 0x100000000000000
  fs uuid e5bd3946-400c-4223-8923-190ef1f18677
  chunk uuid d58cb17e-6d02-494a-829a-18b7d8a399da
  	...
          item 179 key (450 INODE_ITEM 0) itemoff 4907 itemsize 160
                  generation 7 transid 7 size 4096 nbytes 12288
                  block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0
                  sequence 6 flags 0x10(PREALLOC)
                  atime 1716417703.220000000 (2024-05-22 15:41:43)
                  ctime 1716417703.220000000 (2024-05-22 15:41:43)
                  mtime 1716417703.220000000 (2024-05-22 15:41:43)
                  otime 1716417703.220000000 (2024-05-22 15:41:43)
          item 180 key (450 INODE_REF 256) itemoff 4894 itemsize 13
                  index 195 namelen 3 name: 193
          item 181 key (450 XATTR_ITEM 1640047104) itemoff 4857 itemsize 37
                  location key (0 UNKNOWN.0 0) type XATTR
                  transid 7 data_len 1 name_len 6
                  name: user.a
                  data a
          item 182 key (450 EXTENT_DATA 0) itemoff 4804 itemsize 53
                  generation 9 type 1 (regular)
                  extent data disk byte 303144960 nr 12288
                  extent data offset 0 nr 8192 ram 12288
                  extent compression 0 (none)
          item 183 key (450 EXTENT_DATA 8192) itemoff 4751 itemsize 53
                  generation 9 type 2 (prealloc)
                  prealloc data disk byte 303144960 nr 12288
                  prealloc data offset 8192 nr 4096

Item 5 in the log tree corresponds to item 183 in the filesystem tree,
but nothing matches item 4. Furthermore, item 183 is the last item in
the leaf.

btrfs_log_prealloc_extents() is responsible for logging prealloc extents
beyond i_size. It first truncates any previously logged prealloc extents
that start beyond i_size. Then, it walks the filesystem tree and copies
the prealloc extent items to the log tree.

If it hits the end of a leaf, then it calls btrfs_next_leaf(), which
unlocks the tree and does another search. However, while the filesystem
tree is unlocked, an ordered extent completion may modify the tree. In
particular, it may insert an extent item that overlaps with an extent
item that was already copied to the log tree.

This may manifest in several ways depending on the exact scenario,
including an EEXIST error that is silently translated to a full sync,
overlapping items in the log tree, or this crash. This particular crash
is triggered by the following sequence of events:

- Initially, the file has i_size=4k, a regular extent from 0-4k, and a
  prealloc extent beyond i_size from 4k-12k. The prealloc extent item is
  the last item in its B-tree leaf.
- The file is fsync'd, which copies its inode item and both extent items
  to the log tree.
- An xattr is set on the file, which sets the
  BTRFS_INODE_COPY_EVERYTHING flag.
- The range 4k-8k in the file is written using direct I/O. i_size is
  extended to 8k, but the ordered extent is still in flight.
- The file is fsync'd. Since BTRFS_INODE_COPY_EVERYTHING is set, this
  calls copy_inode_items_to_log(), which calls
  btrfs_log_prealloc_extents().
- btrfs_log_prealloc_extents() finds the 4k-12k prealloc extent in the
  filesystem tree. Since it starts before i_size, it skips it. Since it
  is the last item in its B-tree leaf, it calls btrfs_next_leaf().
- btrfs_next_leaf() unlocks the path.
- The ordered extent completion runs, which converts the 4k-8k part of
  the prealloc extent to written and inserts the remaining prealloc part
  from 8k-12k.
- btrfs_next_leaf() does a search and finds the new prealloc extent
  8k-12k.
- btrfs_log_prealloc_extents() copies the 8k-12k prealloc extent into
  the log tree. Note that it overlaps with the 4k-12k prealloc extent
  that was copied to the log tree by the first fsync.
- fsync calls btrfs_log_changed_extents(), which tries to log the 4k-8k
  extent that was written.
- This tries to drop the range 4k-8k in the log tree, which requires
  adjusting the start of the 4k-12k prealloc extent in the log tree to
  8k.
- btrfs_set_item_key_safe() sees that there is already an extent
  starting at 8k in the log tree and calls BUG().

Fix this by detecting when we're about to insert an overlapping file
extent item in the log tree and truncating the part that would overlap.

CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-06-16 13:47:48 +02:00
..
9p 9p: add missing locking around taking dentry fid list 2024-06-16 13:47:37 +02:00
adfs for-6.6/block-2023-08-28 2023-08-29 20:21:42 -07:00
affs for-6.6/block-2023-08-28 2023-08-29 20:21:42 -07:00
afs afs: Don't cross .backup mountpoint from backup volume 2024-06-16 13:47:30 +02:00
autofs v6.6-vfs.autofs 2023-08-28 11:39:14 -07:00
befs for-6.6/block-2023-08-28 2023-08-29 20:21:42 -07:00
bfs for-6.6/block-2023-08-28 2023-08-29 20:21:42 -07:00
btrfs btrfs: fix crash on racing fsync and size-extending write into prealloc 2024-06-16 13:47:48 +02:00
cachefiles cachefiles: fix memory leak in cachefiles_add_cache() 2024-03-01 13:35:00 +01:00
ceph ceph: redirty page before returning AOP_WRITEPAGE_ACTIVATE 2024-04-27 17:11:29 +02:00
coda v6.6-vfs.ctime 2023-08-28 09:31:32 -07:00
configfs configfs: convert to ctime accessor functions 2023-07-13 10:28:05 +02:00
cramfs v6.6-vfs.super 2023-08-28 11:04:18 -07:00
crypto
debugfs debugfs: fix automount d_fsdata usage 2024-01-20 11:51:37 +01:00
devpts v6.6-vfs.misc 2023-08-28 10:17:14 -07:00
dlm dlm: fix user space lock decision to copy lvb 2024-06-12 11:11:38 +02:00
ecryptfs ecryptfs: Fix buffer size for tag 66 packet 2024-06-12 11:11:31 +02:00
efivarfs efivarfs: Request at most 512 bytes for variable names 2024-03-06 14:48:41 +00:00
efs for-6.6/block-2023-08-28 2023-08-29 20:21:42 -07:00
erofs erofs: avoid allocating DEFLATE streams before mounting 2024-06-16 13:47:31 +02:00
exfat exfat: support handle zero-size directory 2023-11-28 17:19:44 +00:00
exportfs exportfs: remove kernel-doc warnings in exportfs 2023-08-29 17:45:22 -04:00
ext2 quota: Properly annotate i_dquot arrays with __rcu 2024-03-26 18:19:46 -04:00
ext4 ext4: fix mb_cache_entry's e_refcnt leak in ext4_xattr_block_cache_find() 2024-06-16 13:47:45 +02:00
f2fs f2fs: fix to do sanity check on i_xattr_nid in sanity_check_inode() 2024-06-16 13:47:32 +02:00
fat fat: fix uninitialized field in nostale filehandles 2024-04-03 15:28:20 +02:00
freevxfs for-6.6/block-2023-08-28 2023-08-29 20:21:42 -07:00
fscache netfs, fscache: Prevent Oops in fscache_put_cache() 2024-01-31 16:19:01 -08:00
fuse fuse: fix leaked ENOSYS error on first statx call 2024-04-27 17:11:42 +02:00
gfs2 kthread: add kthread_stop_put 2024-06-12 11:12:52 +02:00
hfs for-6.6/block-2023-08-28 2023-08-29 20:21:42 -07:00
hfsplus for-6.6/block-2023-08-28 2023-08-29 20:21:42 -07:00
hostfs hostfs: convert to ctime accessor functions 2023-07-24 10:30:00 +02:00
hpfs for-6.6/block-2023-08-28 2023-08-29 20:21:42 -07:00
hugetlbfs mm: hugetlb pages should not be reserved by shmat() if SHM_NORESERVE 2024-02-23 09:25:16 +01:00
iomap iomap: fault in smaller chunks for non-large folio mappings 2024-06-16 13:47:40 +02:00
isofs isofs: handle CDs with bad root inode but good Joliet root directory 2024-04-13 13:07:34 +02:00
jbd2 jbd2: fix soft lockup in journal_finish_inode_data_buffers() 2024-01-20 11:51:43 +01:00
jffs2 jffs2: prevent xattr node from overflowing the eraseblock 2024-06-12 11:11:33 +02:00
jfs quota: Properly annotate i_dquot arrays with __rcu 2024-03-26 18:19:46 -04:00
kernfs kernfs: RCU protect kernfs_nodes and avoid kernfs_idr_lock in kernfs_find_and_get_node_by_id() 2024-04-13 13:07:38 +02:00
lockd SUNRPC: Add enum svc_auth_status 2023-08-29 17:45:22 -04:00
minix for-6.6/block-2023-08-28 2023-08-29 20:21:42 -07:00
netfs netfs: Only call folio_start_fscache() one time for each folio 2023-09-18 12:03:46 -07:00
nfs NFS: Fix READ_PLUS when server doesn't support OP_READ_PLUS 2024-06-16 13:47:47 +02:00
nfs_common
nfsd nfsd: hold a lighter-weight client reference over CB_RECALL_ANY 2024-04-10 16:36:01 +02:00
nilfs2 nilfs2: fix out-of-range warning 2024-06-12 11:11:31 +02:00
nls nls: Hide new NLS_UCS2_UTILS 2023-08-31 12:07:34 -05:00
notify fanotify: limit reporting of event with non-decodeable file handles 2023-10-19 16:19:20 +02:00
ntfs for-6.6/block-2023-08-28 2023-08-29 20:21:42 -07:00
ntfs3 fs/ntfs3: Use variable length array instead of fixed size 2024-06-12 11:12:39 +02:00
ocfs2 quota: Properly annotate i_dquot arrays with __rcu 2024-03-26 18:19:46 -04:00
omfs for-6.6/block-2023-08-28 2023-08-29 20:21:42 -07:00
openpromfs openpromfs: finish conversion to the new mount API 2024-06-12 11:11:30 +02:00
orangefs Julia Lawall reported this null pointer dereference, this should fix it. 2024-04-13 13:07:35 +02:00
overlayfs ovl: remove upper umask handling from ovl_create_upper() 2024-06-12 11:12:24 +02:00
proc mm: /proc/pid/smaps_rollup: avoid skipping vma after getting mmap_lock again 2024-06-16 13:47:42 +02:00
pstore pstore/zone: Add a null pointer check to the psz_kmsg_read 2024-04-13 13:07:31 +02:00
qnx4 for-6.6/block-2023-08-28 2023-08-29 20:21:42 -07:00
qnx6 for-6.6/block-2023-08-28 2023-08-29 20:21:42 -07:00
quota quota: Properly annotate i_dquot arrays with __rcu 2024-03-26 18:19:46 -04:00
ramfs ramfs: convert to ctime accessor functions 2023-07-24 10:30:04 +02:00
reiserfs quota: Properly annotate i_dquot arrays with __rcu 2024-03-26 18:19:46 -04:00
romfs for-6.6/block-2023-08-28 2023-08-29 20:21:42 -07:00
smb ksmbd: fix uninitialized symbol 'share' in smb2_tree_connect() 2024-06-12 11:11:30 +02:00
squashfs Squashfs: check the inode number is not the invalid value of zero 2024-05-02 16:32:41 +02:00
sysfs fs: sysfs: Fix reference leak in sysfs_break_active_protection() 2024-04-27 17:11:41 +02:00
sysv sysv: don't call sb_bread() with pointers_lock held 2024-04-13 13:07:34 +02:00
tracefs tracefs: Clear EVENT_INODE flag in tracefs_drop_inode() 2024-06-16 13:47:48 +02:00
ubifs ubifs: Set page uptodate in the correct place 2024-04-03 15:28:20 +02:00
udf udf: Convert udf_expand_file_adinicb() to use a folio 2024-06-12 11:12:23 +02:00
ufs for-6.6/block-2023-08-28 2023-08-29 20:21:42 -07:00
unicode
vboxsf vboxsf: explicitly deny setlease attempts 2024-05-17 12:02:13 +02:00
verity fsverity: use register_sysctl_init() to avoid kmemleak warning 2024-06-16 13:47:33 +02:00
xfs xfs: remove conditional building of rt geometry validator functions 2024-04-03 15:28:49 +02:00
zonefs zonefs: Improve error handling 2024-02-23 09:25:13 +01:00
aio.c fs/aio: Check IOCB_AIO_RW before the struct aio_kiocb conversion 2024-04-03 15:28:44 +02:00
anon_inodes.c
attr.c v6.6-vfs.misc 2023-08-28 10:17:14 -07:00
bad_inode.c fs: drop the timespec64 argument from update_time 2023-08-11 09:04:57 +02:00
binfmt_elf.c Merge branch 'expand-stack' 2023-06-28 20:35:21 -07:00
binfmt_elf_fdpic.c fs: binfmt_elf_efpic: fix personality for ELF-FDPIC 2023-09-29 17:20:45 -07:00
binfmt_elf_test.c
binfmt_flat.c
binfmt_misc.c fs: convert to ctime accessor functions 2023-07-13 10:28:04 +02:00
binfmt_script.c
buffer.c iomap: add a workaround for racy i_size updates on block devices 2023-09-25 08:55:00 -07:00
char_dev.c
compat_binfmt_elf.c
coredump.c
d_path.c
dax.c mm: convert DAX lock/unlock page to lock/unlock folio 2024-01-10 17:16:53 +01:00
dcache.c fast_dput(): handle underflows gracefully 2024-02-05 20:14:26 +00:00
direct-io.c - Yosry Ahmed brought back some cgroup v1 stats in OOM logs. 2023-06-28 10:28:11 -07:00
drop_caches.c fs: drop_caches: draining pages before dropping caches 2023-08-18 10:12:11 -07:00
eventfd.c eventfd: prevent underflow for eventfd semaphores 2023-07-11 11:41:34 +02:00
eventpoll.c epoll: be better about file lifetimes 2024-06-12 11:11:30 +02:00
exec.c exec: Fix NOMMU linux_binprm::exec in transfer_args_to_stack() 2024-04-03 15:28:55 +02:00
fcntl.c fs: Fix rw_hint validation 2024-03-26 18:19:17 -04:00
fhandle.c do_sys_name_to_handle(): use kzalloc() to fix kernel-infoleak 2024-03-26 18:19:15 -04:00
file.c v6.6-vfs.misc 2023-08-28 10:17:14 -07:00
file_table.c fs: use __fput_sync in close(2) 2023-08-08 19:36:51 +02:00
filesystems.c
fs-writeback.c writeback, cgroup: switch inodes with dirty timestamps to release dying cgwbs 2023-11-20 11:58:52 +01:00
fs_context.c fs: factor out vfs_parse_monolithic_sep() helper 2023-10-12 18:53:36 +03:00
fs_parser.c
fs_pin.c
fs_struct.c kill do_each_thread() 2023-08-21 13:46:25 -07:00
fs_types.c
fsopen.c fs: add FSCONFIG_CMD_CREATE_EXCL 2023-08-14 18:48:02 +02:00
init.c
inode.c filemap: add a per-mapping stable writes flag 2023-12-03 07:33:03 +01:00
internal.h for-6.6/block-2023-08-28 2023-08-29 20:21:42 -07:00
ioctl.c lsm: new security_file_ioctl_compat() hook 2024-01-31 16:18:54 -08:00
Kconfig for-6.6/block-2023-08-28 2023-08-29 20:21:42 -07:00
Kconfig.binfmt riscv: support the elf-fdpic binfmt loader 2023-08-23 14:17:43 -07:00
kernel_read_file.c fs: Fix kernel-doc warnings 2023-08-19 12:12:12 +02:00
libfs.c fs: new accessor methods for atime and mtime 2024-01-05 15:19:40 +01:00
locks.c NFSD 6.6 Release Notes 2023-08-31 15:32:18 -07:00
Makefile fs: add CONFIG_BUFFER_HEAD 2023-08-02 09:13:09 -06:00
mbcache.c
mnt_idmapping.c
mount.h
mpage.c
namei.c rename(): fix the locking of subdirectories 2024-01-31 16:18:57 -08:00
namespace.c fs: relax mount_setattr() permission checks 2024-02-23 09:25:15 +01:00
nsfs.c fs: convert to ctime accessor functions 2023-07-13 10:28:04 +02:00
open.c cred: get rid of CONFIG_DEBUG_CREDENTIALS 2023-12-20 17:01:51 +01:00
pipe.c fs/pipe: Fix lockdep false-positive in watchqueue pipe_write() 2024-04-10 16:35:57 +02:00
pnode.c
pnode.h
posix_acl.c fs: convert to ctime accessor functions 2023-07-13 10:28:04 +02:00
proc_namespace.c
read_write.c fs: Fix one kernel-doc comment 2023-08-15 08:32:45 +02:00
readdir.c vfs: get rid of old '->iterate' directory operation 2023-08-06 15:08:35 +02:00
remap_range.c
select.c fs/select: rework stack allocation hack for clang 2024-03-26 18:19:17 -04:00
seq_file.c
signalfd.c
splice.c - Some swap cleanups from Ma Wupeng ("fix WARN_ON in add_to_avail_list") 2023-08-29 14:25:26 -07:00
stack.c fs: convert to ctime accessor functions 2023-07-13 10:28:04 +02:00
stat.c fs: Pass AT_GETATTR_NOSEC flag to getattr interface function 2023-12-03 07:33:03 +01:00
statfs.c
super.c fs: export sget_dev() 2023-08-31 12:47:15 +02:00
sync.c
sysctls.c
timerfd.c
userfaultfd.c mm/userfaultfd: reset ptes when close() for wr-protected ones 2024-05-17 12:02:36 +02:00
utimes.c
xattr.c tmpfs,xattr: GFP_KERNEL_ACCOUNT for simple xattrs 2023-08-22 10:57:46 +02:00