Commit graph

76974 commits

Author SHA1 Message Date
Kiselev, Oleg
026d0d27c4 ext4: reduce computation of overhead during resize
This patch avoids doing an O(n**2)-complexity walk through every flex group.
Instead, it uses the already computed overhead information for the newly
allocated space, and simply adds it to the previously calculated
overhead stored in the superblock.  This drastically reduces the time
taken to resize very large bigalloc filesystems (from 3+ hours for a
64TB fs down to milliseconds).

Signed-off-by: Oleg Kiselev <okiselev@amazon.com>
Link: https://lore.kernel.org/r/CE4F359F-4779-45E6-B6A9-8D67FDFF5AE2@amazon.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-08-02 23:56:25 -04:00
Zhihao Cheng
4a734f0869 jbd2: fix assertion 'jh->b_frozen_data == NULL' failure when journal aborted
Following process will fail assertion 'jh->b_frozen_data == NULL' in
jbd2_journal_dirty_metadata():

                   jbd2_journal_commit_transaction
unlink(dir/a)
 jh->b_transaction = trans1
 jh->b_jlist = BJ_Metadata
                    journal->j_running_transaction = NULL
                    trans1->t_state = T_COMMIT
unlink(dir/b)
 handle->h_trans = trans2
 do_get_write_access
  jh->b_modified = 0
  jh->b_frozen_data = frozen_buffer
  jh->b_next_transaction = trans2
 jbd2_journal_dirty_metadata
  is_handle_aborted
   is_journal_aborted // return false

           --> jbd2 abort <--

                     while (commit_transaction->t_buffers)
                      if (is_journal_aborted)
                       jbd2_journal_refile_buffer
                        __jbd2_journal_refile_buffer
                         WRITE_ONCE(jh->b_transaction,
						jh->b_next_transaction)
                         WRITE_ONCE(jh->b_next_transaction, NULL)
                         __jbd2_journal_file_buffer(jh, BJ_Reserved)
        J_ASSERT_JH(jh, jh->b_frozen_data == NULL) // assertion failure !

The reproducer (See detail in [Link]) reports:
 ------------[ cut here ]------------
 kernel BUG at fs/jbd2/transaction.c:1629!
 invalid opcode: 0000 [#1] PREEMPT SMP
 CPU: 2 PID: 584 Comm: unlink Tainted: G        W
 5.19.0-rc6-00115-g4a57a8400075-dirty #697
 RIP: 0010:jbd2_journal_dirty_metadata+0x3c5/0x470
 RSP: 0018:ffffc90000be7ce0 EFLAGS: 00010202
 Call Trace:
  <TASK>
  __ext4_handle_dirty_metadata+0xa0/0x290
  ext4_handle_dirty_dirblock+0x10c/0x1d0
  ext4_delete_entry+0x104/0x200
  __ext4_unlink+0x22b/0x360
  ext4_unlink+0x275/0x390
  vfs_unlink+0x20b/0x4c0
  do_unlinkat+0x42f/0x4c0
  __x64_sys_unlink+0x37/0x50
  do_syscall_64+0x35/0x80

After journal aborting, __jbd2_journal_refile_buffer() is executed with
holding @jh->b_state_lock, we can fix it by moving 'is_handle_aborted()'
into the area protected by @jh->b_state_lock.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=216251
Fixes: 470decc613 ("[PATCH] jbd2: initial copy of files from jbd")
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Link: https://lore.kernel.org/r/20220715125152.4022726-1-chengzhihao1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-08-02 23:56:25 -04:00
Lukas Czerner
1e1c2b86ef ext4: block range must be validated before use in ext4_mb_clear_bb()
Block range to free is validated in ext4_free_blocks() using
ext4_inode_block_valid() and then it's passed to ext4_mb_clear_bb().
However in some situations on bigalloc file system the range might be
adjusted after the validation in ext4_free_blocks() which can lead to
troubles on corrupted file systems such as one found by syzkaller that
resulted in the following BUG

kernel BUG at fs/ext4/ext4.h:3319!
PREEMPT SMP NOPTI
CPU: 28 PID: 4243 Comm: repro Kdump: loaded Not tainted 5.19.0-rc6+ #1
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1.fc35 04/01/2014
RIP: 0010:ext4_free_blocks+0x95e/0xa90
Call Trace:
 <TASK>
 ? lock_timer_base+0x61/0x80
 ? __es_remove_extent+0x5a/0x760
 ? __mod_timer+0x256/0x380
 ? ext4_ind_truncate_ensure_credits+0x90/0x220
 ext4_clear_blocks+0x107/0x1b0
 ext4_free_data+0x15b/0x170
 ext4_ind_truncate+0x214/0x2c0
 ? _raw_spin_unlock+0x15/0x30
 ? ext4_discard_preallocations+0x15a/0x410
 ? ext4_journal_check_start+0xe/0x90
 ? __ext4_journal_start_sb+0x2f/0x110
 ext4_truncate+0x1b5/0x460
 ? __ext4_journal_start_sb+0x2f/0x110
 ext4_evict_inode+0x2b4/0x6f0
 evict+0xd0/0x1d0
 ext4_enable_quotas+0x11f/0x1f0
 ext4_orphan_cleanup+0x3de/0x430
 ? proc_create_seq_private+0x43/0x50
 ext4_fill_super+0x295f/0x3ae0
 ? snprintf+0x39/0x40
 ? sget_fc+0x19c/0x330
 ? ext4_reconfigure+0x850/0x850
 get_tree_bdev+0x16d/0x260
 vfs_get_tree+0x25/0xb0
 path_mount+0x431/0xa70
 __x64_sys_mount+0xe2/0x120
 do_syscall_64+0x5b/0x80
 ? do_user_addr_fault+0x1e2/0x670
 ? exc_page_fault+0x70/0x170
 entry_SYSCALL_64_after_hwframe+0x46/0xb0
RIP: 0033:0x7fdf4e512ace

Fix it by making sure that the block range is properly validated before
used every time it changes in ext4_free_blocks() or ext4_mb_clear_bb().

Link: https://syzkaller.appspot.com/bug?id=5266d464285a03cee9dbfda7d2452a72c3c2ae7c
Reported-by: syzbot+15cd994e273307bf5cfa@syzkaller.appspotmail.com
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Cc: Tadeusz Struk <tadeusz.struk@linaro.org>
Tested-by: Tadeusz Struk <tadeusz.struk@linaro.org>
Link: https://lore.kernel.org/r/20220714165903.58260-1-lczerner@redhat.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-08-02 23:56:25 -04:00
Jan Kara
307af6c879 mbcache: automatically delete entries from cache on freeing
Use the fact that entries with elevated refcount are not removed from
the hash and just move removal of the entry from the hash to the entry
freeing time. When doing this we also change the generic code to hold
one reference to the cache entry, not two of them, which makes code
somewhat more obvious.

Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220712105436.32204-10-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-08-02 23:56:25 -04:00
Jan Kara
75896339e4 mbcache: Remove mb_cache_entry_delete()
Nobody uses mb_cache_entry_delete() anymore. Remove it.

Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220712105436.32204-9-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-08-02 23:56:25 -04:00
Jan Kara
1189d8ec51 ext2: avoid deleting xattr block that is being reused
Currently when we decide to reuse xattr block we detect the case when
the last reference to xattr block is being dropped at the same time and
cancel the reuse attempt. Convert ext2 to a new scheme when as soon as
matching mbcache entry is found, we wait with dropping the last xattr
block reference until mbcache entry reference is dropped (meaning either
the xattr block reference is increased or we decided not to reuse the
block).

Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220712105436.32204-8-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-08-02 23:56:25 -04:00
Jan Kara
b67798d551 ext2: unindent codeblock in ext2_xattr_set()
Replace one else in ext2_xattr_set() with a goto. This makes following
code changes simpler to follow. No functional changes.

Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20220712105436.32204-7-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-08-02 23:56:25 -04:00
Jan Kara
90ae40d243 ext2: factor our freeing of xattr block reference
Free of xattr block reference is opencode in two places. Factor it out
into a separate function and use it.

Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220712105436.32204-6-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-08-02 23:56:25 -04:00
Jan Kara
65f8b80053 ext4: fix race when reusing xattr blocks
When ext4_xattr_block_set() decides to remove xattr block the following
race can happen:

CPU1                                    CPU2
ext4_xattr_block_set()                  ext4_xattr_release_block()
  new_bh = ext4_xattr_block_cache_find()

                                          lock_buffer(bh);
                                          ref = le32_to_cpu(BHDR(bh)->h_refcount);
                                          if (ref == 1) {
                                            ...
                                            mb_cache_entry_delete();
                                            unlock_buffer(bh);
                                            ext4_free_blocks();
                                              ...
                                              ext4_forget(..., bh, ...);
                                                jbd2_journal_revoke(..., bh);

  ext4_journal_get_write_access(..., new_bh, ...)
    do_get_write_access()
      jbd2_journal_cancel_revoke(..., new_bh);

Later the code in ext4_xattr_block_set() finds out the block got freed
and cancels reusal of the block but the revoke stays canceled and so in
case of block reuse and journal replay the filesystem can get corrupted.
If the race works out slightly differently, we can also hit assertions
in the jbd2 code.

Fix the problem by making sure that once matching mbcache entry is
found, code dropping the last xattr block reference (or trying to modify
xattr block in place) waits until the mbcache entry reference is
dropped. This way code trying to reuse xattr block is protected from
someone trying to drop the last reference to xattr block.

Reported-and-tested-by: Ritesh Harjani <ritesh.list@gmail.com>
CC: stable@vger.kernel.org
Fixes: 82939d7999 ("ext4: convert to mbcache2")
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220712105436.32204-5-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-08-02 23:56:25 -04:00
Jan Kara
fd48e9acdf ext4: unindent codeblock in ext4_xattr_block_set()
Remove unnecessary else (and thus indentation level) from a code block
in ext4_xattr_block_set(). It will also make following code changes
easier. No functional changes.

CC: stable@vger.kernel.org
Fixes: 82939d7999 ("ext4: convert to mbcache2")
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220712105436.32204-4-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-08-02 23:56:25 -04:00
Jan Kara
6bc0d63dad ext4: remove EA inode entry from mbcache on inode eviction
Currently we remove EA inode from mbcache as soon as its xattr refcount
drops to zero. However there can be pending attempts to reuse the inode
and thus refcount handling code has to handle the situation when
refcount increases from zero anyway. So save some work and just keep EA
inode in mbcache until it is getting evicted. At that moment we are sure
following iget() of EA inode will fail anyway (or wait for eviction to
finish and load things from the disk again) and so removing mbcache
entry at that moment is fine and simplifies the code a bit.

CC: stable@vger.kernel.org
Fixes: 82939d7999 ("ext4: convert to mbcache2")
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220712105436.32204-3-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-08-02 23:56:25 -04:00
Jan Kara
3dc96bba65 mbcache: add functions to delete entry if unused
Add function mb_cache_entry_delete_or_get() to delete mbcache entry if
it is unused and also add a function to wait for entry to become unused
- mb_cache_entry_wait_unused(). We do not share code between the two
deleting function as one of them will go away soon.

CC: stable@vger.kernel.org
Fixes: 82939d7999 ("ext4: convert to mbcache2")
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220712105436.32204-2-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-08-02 23:56:25 -04:00
Jan Kara
5831891418 mbcache: don't reclaim used entries
Do not reclaim entries that are currently used by somebody from a
shrinker. Firstly, these entries are likely useful. Secondly, we will
need to keep such entries to protect pending increment of xattr block
refcount.

CC: stable@vger.kernel.org
Fixes: 82939d7999 ("ext4: convert to mbcache2")
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220712105436.32204-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-08-02 23:56:25 -04:00
Lukas Czerner
b8a04fe77e ext4: make sure ext4_append() always allocates new block
ext4_append() must always allocate a new block, otherwise we run the
risk of overwriting existing directory block corrupting the directory
tree in the process resulting in all manner of problems later on.

Add a sanity check to see if the logical block is already allocated and
error out if it is.

Cc: stable@kernel.org
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20220704142721.157985-2-lczerner@redhat.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-08-02 23:56:17 -04:00
Lukas Czerner
65f8ea4cd5 ext4: check if directory block is within i_size
Currently ext4 directory handling code implicitly assumes that the
directory blocks are always within the i_size. In fact ext4_append()
will attempt to allocate next directory block based solely on i_size and
the i_size is then appropriately increased after a successful
allocation.

However, for this to work it requires i_size to be correct. If, for any
reason, the directory inode i_size is corrupted in a way that the
directory tree refers to a valid directory block past i_size, we could
end up corrupting parts of the directory tree structure by overwriting
already used directory blocks when modifying the directory.

Fix it by catching the corruption early in __ext4_read_dirblock().

Addresses Red-Hat-Bugzilla: #2070205
CVE: CVE-2022-1184
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Cc: stable@vger.kernel.org
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20220704142721.157985-1-lczerner@redhat.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-08-02 23:56:17 -04:00
Ojaswin Mujoo
3fa5d23e68 ext4: reflect mb_optimize_scan value in options file
Add support to display the mb_optimize_scan value in
/proc/fs/ext4/<dev>/options file. The option is only
displayed when the value is non default.

Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://lore.kernel.org/r/20220704054603.21462-1-ojaswin@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-08-02 23:56:17 -04:00
Ye Bin
b24e77ef1c ext4: avoid remove directory when directory is corrupted
Now if check directoy entry is corrupted, ext4_empty_dir may return true
then directory will be removed when file system mounted with "errors=continue".
In order not to make things worse just return false when directory is corrupted.

Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220622090223.682234-1-yebin10@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-08-02 23:56:17 -04:00
Jiang Jian
c64a92992e ext4: aligned '*' in comments
The '*' in the comment is not aligned.

Signed-off-by: Jiang Jian <jiangjian@cdjrlc.com>
Link: https://lore.kernel.org/r/20220621061531.19669-1-jiangjian@cdjrlc.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-08-02 23:56:17 -04:00
Li Lingfeng
07ea7a617d ext4: recover csum seed of tmp_inode after migrating to extents
When migrating to extents, the checksum seed of temporary inode
need to be replaced by inode's, otherwise the inode checksums
will be incorrect when swapping the inodes data.

However, the temporary inode can not match it's checksum to
itself since it has lost it's own checksum seed.

mkfs.ext4 -F /dev/sdc
mount /dev/sdc /mnt/sdc
xfs_io -fc "pwrite 4k 4k" -c "fsync" /mnt/sdc/testfile
chattr -e /mnt/sdc/testfile
chattr +e /mnt/sdc/testfile
umount /dev/sdc
fsck -fn /dev/sdc

========
...
Pass 1: Checking inodes, blocks, and sizes
Inode 13 passes checks, but checksum does not match inode.  Fix? no
...
========

The fix is simple, save the checksum seed of temporary inode, and
recover it after migrating to extents.

Fixes: e81c9302a6 ("ext4: set csum seed in tmp inode while migrating to extents")
Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220617062515.2113438-1-lilingfeng3@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-08-02 23:56:02 -04:00
Ye Bin
51ae846cff ext4: fix warning in ext4_iomap_begin as race between bmap and write
We got issue as follows:
------------[ cut here ]------------
WARNING: CPU: 3 PID: 9310 at fs/ext4/inode.c:3441 ext4_iomap_begin+0x182/0x5d0
RIP: 0010:ext4_iomap_begin+0x182/0x5d0
RSP: 0018:ffff88812460fa08 EFLAGS: 00010293
RAX: ffff88811f168000 RBX: 0000000000000000 RCX: ffffffff97793c12
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000003
RBP: ffff88812c669160 R08: ffff88811f168000 R09: ffffed10258cd20f
R10: ffff88812c669077 R11: ffffed10258cd20e R12: 0000000000000001
R13: 00000000000000a4 R14: 000000000000000c R15: ffff88812c6691ee
FS:  00007fd0d6ff3740(0000) GS:ffff8883af180000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fd0d6dda290 CR3: 0000000104a62000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 iomap_apply+0x119/0x570
 iomap_bmap+0x124/0x150
 ext4_bmap+0x14f/0x250
 bmap+0x55/0x80
 do_vfs_ioctl+0x952/0xbd0
 __x64_sys_ioctl+0xc6/0x170
 do_syscall_64+0x33/0x40
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Above issue may happen as follows:
          bmap                    write
bmap
  ext4_bmap
    iomap_bmap
      ext4_iomap_begin
                            ext4_file_write_iter
			      ext4_buffered_write_iter
			        generic_perform_write
				  ext4_da_write_begin
				    ext4_da_write_inline_data_begin
				      ext4_prepare_inline_data
				        ext4_create_inline_data
					  ext4_set_inode_flag(inode,
						EXT4_INODE_INLINE_DATA);
      if (WARN_ON_ONCE(ext4_has_inline_data(inode))) ->trigger bug_on

To solved above issue hold inode lock in ext4_bamp.

Signed-off-by: Ye Bin <yebin10@huawei.com>
Link: https://lore.kernel.org/r/20220617013935.397596-1-yebin10@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2022-08-02 23:56:02 -04:00
Baokun Li
fd7e672ea9 ext4: correct the misjudgment in ext4_iget_extra_inode
Use the EXT4_INODE_HAS_XATTR_SPACE macro to more accurately
determine whether the inode have xattr space.

Cc: stable@kernel.org
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220616021358.2504451-5-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-08-02 23:55:55 -04:00
Baokun Li
c9fd167d57 ext4: correct max_inline_xattr_value_size computing
If the ext4 inode does not have xattr space, 0 is returned in the
get_max_inline_xattr_value_size function. Otherwise, the function returns
a negative value when the inode does not contain EXT4_STATE_XATTR.

Cc: stable@kernel.org
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220616021358.2504451-4-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-08-02 23:52:44 -04:00
Baokun Li
67d7d8ad99 ext4: fix use-after-free in ext4_xattr_set_entry
Hulk Robot reported a issue:
==================================================================
BUG: KASAN: use-after-free in ext4_xattr_set_entry+0x18ab/0x3500
Write of size 4105 at addr ffff8881675ef5f4 by task syz-executor.0/7092

CPU: 1 PID: 7092 Comm: syz-executor.0 Not tainted 4.19.90-dirty #17
Call Trace:
[...]
 memcpy+0x34/0x50 mm/kasan/kasan.c:303
 ext4_xattr_set_entry+0x18ab/0x3500 fs/ext4/xattr.c:1747
 ext4_xattr_ibody_inline_set+0x86/0x2a0 fs/ext4/xattr.c:2205
 ext4_xattr_set_handle+0x940/0x1300 fs/ext4/xattr.c:2386
 ext4_xattr_set+0x1da/0x300 fs/ext4/xattr.c:2498
 __vfs_setxattr+0x112/0x170 fs/xattr.c:149
 __vfs_setxattr_noperm+0x11b/0x2a0 fs/xattr.c:180
 __vfs_setxattr_locked+0x17b/0x250 fs/xattr.c:238
 vfs_setxattr+0xed/0x270 fs/xattr.c:255
 setxattr+0x235/0x330 fs/xattr.c:520
 path_setxattr+0x176/0x190 fs/xattr.c:539
 __do_sys_lsetxattr fs/xattr.c:561 [inline]
 __se_sys_lsetxattr fs/xattr.c:557 [inline]
 __x64_sys_lsetxattr+0xc2/0x160 fs/xattr.c:557
 do_syscall_64+0xdf/0x530 arch/x86/entry/common.c:298
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x459fe9
RSP: 002b:00007fa5e54b4c08 EFLAGS: 00000246 ORIG_RAX: 00000000000000bd
RAX: ffffffffffffffda RBX: 000000000051bf60 RCX: 0000000000459fe9
RDX: 00000000200003c0 RSI: 0000000020000180 RDI: 0000000020000140
RBP: 000000000051bf60 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000001009 R11: 0000000000000246 R12: 0000000000000000
R13: 00007ffc73c93fc0 R14: 000000000051bf60 R15: 00007fa5e54b4d80
[...]
==================================================================

Above issue may happen as follows:
-------------------------------------
ext4_xattr_set
  ext4_xattr_set_handle
    ext4_xattr_ibody_find
      >> s->end < s->base
      >> no EXT4_STATE_XATTR
      >> xattr_check_inode is not executed
    ext4_xattr_ibody_set
      ext4_xattr_set_entry
       >> size_t min_offs = s->end - s->base
       >> UAF in memcpy

we can easily reproduce this problem with the following commands:
    mkfs.ext4 -F /dev/sda
    mount -o debug_want_extra_isize=128 /dev/sda /mnt
    touch /mnt/file
    setfattr -n user.cat -v `seq -s z 4096|tr -d '[:digit:]'` /mnt/file

In ext4_xattr_ibody_find, we have the following assignment logic:
  header = IHDR(inode, raw_inode)
         = raw_inode + EXT4_GOOD_OLD_INODE_SIZE + i_extra_isize
  is->s.base = IFIRST(header)
             = header + sizeof(struct ext4_xattr_ibody_header)
  is->s.end = raw_inode + s_inode_size

In ext4_xattr_set_entry
  min_offs = s->end - s->base
           = s_inode_size - EXT4_GOOD_OLD_INODE_SIZE - i_extra_isize -
	     sizeof(struct ext4_xattr_ibody_header)
  last = s->first
  free = min_offs - ((void *)last - s->base) - sizeof(__u32)
       = s_inode_size - EXT4_GOOD_OLD_INODE_SIZE - i_extra_isize -
         sizeof(struct ext4_xattr_ibody_header) - sizeof(__u32)

In the calculation formula, all values except s_inode_size and
i_extra_size are fixed values. When i_extra_size is the maximum value
s_inode_size - EXT4_GOOD_OLD_INODE_SIZE, min_offs is -4 and free is -8.
The value overflows. As a result, the preceding issue is triggered when
memcpy is executed.

Therefore, when finding xattr or setting xattr, check whether
there is space for storing xattr in the inode to resolve this issue.

Cc: stable@kernel.org
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220616021358.2504451-3-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-08-02 23:52:34 -04:00
Baokun Li
179b14152d ext4: add EXT4_INODE_HAS_XATTR_SPACE macro in xattr.h
When adding an xattr to an inode, we must ensure that the inode_size is
not less than EXT4_GOOD_OLD_INODE_SIZE + extra_isize + pad. Otherwise,
the end position may be greater than the start position, resulting in UAF.

Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20220616021358.2504451-2-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-08-02 23:52:34 -04:00
Eric Whitney
7f0d8e1d60 ext4: fix extent status tree race in writeback error recovery path
A race can occur in the unlikely event ext4 is unable to allocate a
physical cluster for a delayed allocation in a bigalloc file system
during writeback.  Failure to allocate a cluster forces error recovery
that includes a call to mpage_release_unused_pages().  That function
removes any corresponding delayed allocated blocks from the extent
status tree.  If a new delayed write is in progress on the same cluster
simultaneously, resulting in the addition of an new extent containing
one or more blocks in that cluster to the extent status tree, delayed
block accounting can be thrown off if that delayed write then encounters
a similar cluster allocation failure during future writeback.

Write lock the i_data_sem in mpage_release_unused_pages() to fix this
problem.  Ext4's block/cluster accounting code for bigalloc relies on
i_data_sem for mutual exclusion, as is found in the delayed write path,
and the locking in mpage_release_unused_pages() is missing.

Cc: stable@kernel.org
Reported-by: Ye Bin <yebin10@huawei.com>
Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Link: https://lore.kernel.org/r/20220615160530.1928801-1-enwlinux@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-08-02 23:52:19 -04:00
Zhang Yi
a89573ce4a jbd2: fix outstanding credits assert in jbd2_journal_commit_transaction()
We catch an assert problem in jbd2_journal_commit_transaction() when
doing fsstress and request falut injection tests. The problem is
happened in a race condition between jbd2_journal_commit_transaction()
and ext4_end_io_end(). Firstly, ext4_writepages() writeback dirty pages
and start reserved handle, and then the journal was aborted due to some
previous metadata IO error, jbd2_journal_abort() start to commit current
running transaction, the committing procedure could be raced by
ext4_end_io_end() and lead to subtract j_reserved_credits twice from
commit_transaction->t_outstanding_credits, finally the
t_outstanding_credits is mistakenly smaller than t_nr_buffers and
trigger assert.

kjournald2           kworker

jbd2_journal_commit_transaction()
 write_unlock(&journal->j_state_lock);
 atomic_sub(j_reserved_credits, t_outstanding_credits); //sub once

     	             jbd2_journal_start_reserved()
     	              start_this_handle()  //detect aborted journal
     	              jbd2_journal_free_reserved()  //get running transaction
                       read_lock(&journal->j_state_lock)
     	                __jbd2_journal_unreserve_handle()
     	               atomic_sub(j_reserved_credits, t_outstanding_credits);
                       //sub again
                       read_unlock(&journal->j_state_lock);

 journal->j_running_transaction = NULL;
 J_ASSERT(t_nr_buffers <= t_outstanding_credits) //bomb!!!

Fix this issue by using journal->j_state_lock to protect the subtraction
in jbd2_journal_commit_transaction().

Fixes: 96f1e09745 ("jbd2: avoid long hold times of j_state_lock while committing a transaction")
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220611130426.2013258-1-yi.zhang@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-08-02 23:52:19 -04:00
Jan Kara
d132495856 jbd2: unexport jbd2_log_start_commit()
jbd2_log_start_commit() is not used outside of jbd2 so unexport it. Also
make __jbd2_log_start_commit() static when we are at it.

Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Link: https://lore.kernel.org/r/20220608112355.4397-4-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-08-02 23:52:19 -04:00
Jan Kara
68af74e92a jbd2: remove unused exports for jbd2 debugging
Jbd2 exports jbd2_journal_enable_debug and __jbd2_debug() depite the
first is used only in fs/jbd2/journal.c and the second only within jbd2
code. Remove the pointless exports make jbd2_journal_enable_debug
static.

Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Link: https://lore.kernel.org/r/20220608112355.4397-3-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-08-02 23:52:19 -04:00
Jan Kara
cb3b3bf22c jbd2: rename jbd_debug() to jbd2_debug()
The name of jbd_debug() is confusing as all functions inside jbd2 have
jbd2_ prefix. Rename jbd_debug() to jbd2_debug(). No functional changes.

Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Link: https://lore.kernel.org/r/20220608112355.4397-2-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-08-02 23:52:19 -04:00
Jan Kara
4978c659e7 ext4: use ext4_debug() instead of jbd_debug()
We use jbd_debug() in some places in ext4. It seems a bit strange to use
jbd2 debugging output function for ext4 code. Also these days
ext4_debug() uses dynamic printk so each debug message can be enabled /
disabled on its own so the time when it made some sense to have these
combined (to allow easier common selecting of messages to report) has
passed. Just convert all jbd_debug() uses in ext4 to ext4_debug().

Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Link: https://lore.kernel.org/r/20220608112355.4397-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-08-02 23:52:19 -04:00
hanjinke
218a69441b ext4: reuse order and buddy in mb_mark_used when buddy split
After each buddy split, mb_mark_used will search the proper order
for the block which may consume some loop in mb_find_order_for_block.
In fact, we can reuse the order and buddy generated by the buddy split.

Reviewed by: lei.rao@intel.com
Signed-off-by: hanjinke <hanjinke.666@bytedance.com>
Link: https://lore.kernel.org/r/20220606155305.74146-1-hanjinke.666@bytedance.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-08-02 23:52:19 -04:00
Theodore Ts'o
827891a38a ext4: update the s_overhead_clusters in the backup sb's when resizing
When the EXT4_IOC_RESIZE_FS ioctl is complete, update the backup
superblocks.  We don't do this for the old-style resize ioctls since
they are quite ancient, and only used by very old versions of
resize2fs --- and we don't want to update the backup superblocks every
time EXT4_IOC_GROUP_ADD is called, since it might get called a lot.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20220629040026.112371-2-tytso@mit.edu
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-08-02 23:52:19 -04:00
Theodore Ts'o
de394a8665 ext4: update s_overhead_clusters in the superblock during an on-line resize
When doing an online resize, the on-disk superblock on-disk wasn't
updated.  This means that when the file system is unmounted and
remounted, and the on-disk overhead value is non-zero, this would
result in the results of statfs(2) to be incorrect.

This was partially fixed by Commits 10b01ee92d ("ext4: fix overhead
calculation to account for the reserved gdt blocks"), 85d825dbf4
("ext4: force overhead calculation if the s_overhead_cluster makes no
sense"), and eb7054212e ("ext4: update the cached overhead value in
the superblock").

However, since it was too expensive to forcibly recalculate the
overhead for bigalloc file systems at every mount, this didn't fix the
problem for bigalloc file systems.  This commit should address the
problem when resizing file systems with the bigalloc feature enabled.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20220629040026.112371-1-tytso@mit.edu
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-08-02 23:52:19 -04:00
Zhang Yi
5a57bca905 ext4: fix reading leftover inlined symlinks
Since commit 6493792d32 ("ext4: convert symlink external data block
mapping to bdev"), create new symlink with inline_data is not supported,
but it missing to handle the leftover inlined symlinks, which could
cause below error message and fail to read symlink.

 ls: cannot read symbolic link 'foo': Structure needs cleaning

 EXT4-fs error (device sda): ext4_map_blocks:605: inode #12: block
 2021161080: comm ls: lblock 0 mapped to illegal pblock 2021161080
 (length 1)

Fix this regression by adding ext4_read_inline_link(), which read the
inline data directly and convert it through a kmalloced buffer.

Fixes: 6493792d32 ("ext4: convert symlink external data block mapping to bdev")
Cc: stable@kernel.org
Reported-by: Torge Matthies <openglfreak@googlemail.com>
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Tested-by: Torge Matthies <openglfreak@googlemail.com>
Link: https://lore.kernel.org/r/20220630090100.2769490-1-yi.zhang@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-08-02 23:37:50 -04:00
Linus Torvalds
20855e4cb3 Fixes for 5.19-rc5:
- Fix statfs blocking on background inode gc workers
  - Fix some broken inode lock assertion code
  - Fix xattr leaf buffer leaks when cancelling a deferred xattr update
    operation
  - Clean up xattr recovery to make it easier to understand.
  - Fix xattr leaf block verifiers tripping over empty blocks.
  - Remove complicated and error prone xattr leaf block bholding mess.
  - Fix a bug where an rt extent crossing EOF was treated as "posteof"
    blocks and cleaned unnecessarily.
  - Fix a UAF when log shutdown races with unmount.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmK/kVMACgkQ+H93GTRK
 tOs0tQ/+PYRhEDKrgocxZGJFNvnxqPRdEDu9k5XCnO2Y/DZRAF52F0JZaPtuiFH4
 12e9vzYYRNrE9KifzPWo4j2L067kFszt4XcAjytJuf5f6k/duX7XbsdMb17Qxd28
 mZDtBBSQCc9fcQo21u5SdZlPaD1SC1843jB4Oe7Sbo3AFvVAMwuBUgnp2TSDA8V0
 0q25PUD0ZvWP3UTQS4M4fW4WhFa5wF+GnLR1DZjryFIzuUp9JwdCQZHIFnp6cHq9
 TZMDJ4WhD9igMSzicRfgPoC8z/D3Mm0cFmRoURbG3GLzAeJ+e7PJ43rvlwq6Ajcv
 v5DhyQvFkiVjKLsrtJyvvUGSpkLL/touNG8MUE9I0heiiwb0QbP108aHWU8AS1Dr
 q7XHIxPaOhvlzVZN1uTuZE4N51/0NWITGKBwF0XU1b5D3wLyvOY6fbI7KLfkX2Sa
 4zHKn4QpHUIE9fs5Na3H6L+ndlJclo2DJA6lF26pLgmrT7NLmJG+r97XagBsp/pr
 X8qOvVMg1XJA37Vy1bTN5cfEYzTTksJk/fQ3AvSKHDCeP5u87kiZ6hqNnW6dD0YF
 D8VTX29rVQr5HavbcGCmAyBZpk4CfclCsWCQrZu9MCnQSW37HnObXPJkIWvzt8Mn
 j6emhPcYHy5TwSChxdpzl733ZX0KdkdOAgkWgqtod2E/7fe+g7Q=
 =8QeL
 -----END PGP SIGNATURE-----

Merge tag 'xfs-5.19-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs fixes from Darrick Wong:
 "This fixes some stalling problems and corrects the last of the
  problems (I hope) observed during testing of the new atomic xattr
  update feature.

   - Fix statfs blocking on background inode gc workers

   - Fix some broken inode lock assertion code

   - Fix xattr leaf buffer leaks when cancelling a deferred xattr update
     operation

   - Clean up xattr recovery to make it easier to understand.

   - Fix xattr leaf block verifiers tripping over empty blocks.

   - Remove complicated and error prone xattr leaf block bholding mess.

   - Fix a bug where an rt extent crossing EOF was treated as "posteof"
     blocks and cleaned unnecessarily.

   - Fix a UAF when log shutdown races with unmount"

* tag 'xfs-5.19-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: prevent a UAF when log IO errors race with unmount
  xfs: dont treat rt extents beyond EOF as eofblocks to be cleared
  xfs: don't hold xattr leaf buffers across transaction rolls
  xfs: empty xattr leaf header blocks are not corruption
  xfs: clean up the end of xfs_attri_item_recover
  xfs: always free xattri_leaf_bp when cancelling a deferred op
  xfs: use invalidate_lock to check the state of mmap_lock
  xfs: factor out the common lock flags assert
  xfs: introduce xfs_inodegc_push()
  xfs: bound maximum wait time for inodegc work
2022-07-03 09:42:17 -07:00
Linus Torvalds
69cb6c6556 Notable regression fixes:
- Fix NFSD crash during NFSv4.2 READ_PLUS operation
 - Fix incorrect status code returned by COMMIT operation
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmLAhFIACgkQM2qzM29m
 f5fK2w//TyUAzwoTZQEgmFbxNhXgUYC6uwWPqTCLMieStKtPwT8GsuXviY34QziF
 2vER0NW9Am2GQyL3gtiYFM07OoHhQ4gr86ltaeIHAHhcwm3eIs879ARwEsyN6eDR
 +RDpqnONwtg+yaepfCMc4Bki9Jex+mmoXro86nFPmH+TDM5QiIRY0ncBWSLVWvYT
 YciAgvL6vfo2G79NYOzohoTb15ydotmy6m9H70nN+a2l6bKOIT8cF4S8lZETJZXA
 Rlj+R0eE0iXZTtp7VsAfVAHHfOzexGJjE85hpVzGiZWbxe6o0WNmBpoHICXs9VoP
 WRkq6vBC9P6wJ7EOu84/SRlR/rQaxfrYB7beqTC2kO6t5Ka5/xpJMKZOhfukCMV+
 7+uPAUnSIRqpHG0hWm1kPto00bfXm9fMETQbOEw7UQ9dH/322iOJnXwy03ZEXtJq
 9G0k+yNcydoIs2g5OpNrw4f/wTQcdlbf1ZA5O9dAsxwS1ZTNpKUiE0Sd0Ez/0VIJ
 t5DZFWH44rs/60avxRYxtw965gNugTxb1YiKdXObvbFGf+xYtH0zePce++4kTJlE
 t886stLcHbuPYs4/XItwTxLMSnDM5UR22Icho7ElRHvPiWjZmc6o2fxCkWTzCOE2
 ylZowLFMYZ/KbrP5mcqLARKEP6oFJERXt/U8U0CWeVbZomy9GVY=
 =z5W8
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-5.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd fixes from Chuck Lever:
 "Notable regression fixes:

   - Fix NFSD crash during NFSv4.2 READ_PLUS operation

   - Fix incorrect status code returned by COMMIT operation"

* tag 'nfsd-5.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
  SUNRPC: Fix READ_PLUS crasher
  NFSD: restore EINVAL error translation in nfsd_commit()
2022-07-02 11:20:56 -07:00
Linus Torvalds
76ff294e16 More NFS Client Bugfixes for Linux 5.19-rc
- Bugfixes:
   - Allocate a fattr for _nfs4_discover_trunking()
   - Fix module reference count leak in nfs4_run_state_manager()
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAmK/NHYACgkQ18tUv7Cl
 QOtbXBAAhivn5bqZvrKz4WI4WjddTRcjyvITiW6m26GZZVgNHz1Sc2Tp6TA6QEL8
 c7OLkj2SoA0SmO4kZX+gCpOzsapQA7FUWULFxpPxJFp/NCsgoxYqO5ZfX0qVpk6t
 1w8lGFqsMve3LdRmcqvbaIrvzJPMdsVvixrZwXRQMe/atvtUMmgHo4pkBPuQ5nv9
 FzWh0KRiohhEWSmncD0fjdYBCq4VOqrUEEn4BTeMoXwvg/noLj5GX3mS14CFH12Y
 +iOqQ05O48Ny8qhzeQ8bRat43t4cZoCpFUcwEPB0CWoNCqS4Qoqvn48Ic+4WMxpN
 nPg2CqkqaG2RUJozSJz8m+GQNbEohoGkruZXJh7TQaqWXrIJGRMBhKhI2b7hujBG
 meu0ypETzlbofjleCpevfvnNStoRTuakssMpcU/hfKjnfNIsHnADSYey0HWHWTAH
 ZaBT6N5Z3C6hRWz7wXIn3uTuWadZbDC9+HGvvyMpuP+PDQFM9exJfhoO0JQvQc/G
 dPB7SyINVj2Z9gguJaew4csRoSxZqxeLB28XHQ7yyYsmo7lbUWaeUgGGpynrrsZL
 Ysuh7JVoZyhSSdNb3b2vlnI7zK+F1qIkzg0/u2ZN8CLPfMaBCmPL031oFudpWJ2n
 TLnxoFHhsG2MTwua3CaOfk8QvNc2dT7ZsvxXHl7HxBlxGtBgSN4=
 =OYKI
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-5.19-3' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client fixes from Anna Schumaker:

 - Allocate a fattr for _nfs4_discover_trunking()

 - Fix module reference count leak in nfs4_run_state_manager()

* tag 'nfs-for-5.19-3' of git://git.linux-nfs.org/projects/anna/linux-nfs:
  NFSv4: Add an fattr allocation to _nfs4_discover_trunking()
  NFS: restore module put when manager exits.
2022-07-01 11:11:32 -07:00
Linus Torvalds
6f8693ea2b A filesystem fix, marked for stable. There appears to be a deeper
issue on the MDS side, but for now we are going with this one-liner
 to avoid busy looping and potential soft lockups.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmK/C8MTHGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHzi1cEB/9CiJoDsc1v+DrP/4Ud/AbI4LMffMcr
 tkHmUo8ZT5D4feUzSFE6iKgb3gRCJUkYKzesywQ7Xhv7Mr6/DKB4+t9QtrympZFd
 sAg775mHkL0NI6/OLnLSRva/r627PFk6f1v8OWENOjsw01PLOtWAB/B5FqlgN8tG
 EQLfX0G83o4AXt4NcPCcsucPh7FxC2iKe8XWqAE6VTjkKnyz3IQHvSLweWV68U8R
 ht6eun8H+slx8Kw1lSZfW/XoFGFO4uKntCh/CKKH28ZqaXrxrdsfmXSVOMlOi351
 qxPfrTPgaSfvWQLbYQfPdQZCsfyyPgP2wdAVfpy56vk0yoxi2TLGBPsD
 =bu9O
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-5.19-rc5' of https://github.com/ceph/ceph-client

Pull ceph fix from Ilya Dryomov:
 "A ceph filesystem fix, marked for stable.

  There appears to be a deeper issue on the MDS side, but for now we are
  going with this one-liner to avoid busy looping and potential soft
  lockups"

* tag 'ceph-for-5.19-rc5' of https://github.com/ceph/ceph-client:
  ceph: wait on async create before checking caps for syncfs
2022-07-01 11:06:21 -07:00
Linus Torvalds
0a35d1622d io_uring-5.19-2022-07-01
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmK+6CoQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpsZPD/9xPZTAJhX3/HNTjbi+FlSvTaJ/4rll98No
 1pzW+nZyBVr4yesnHW2qtLwLRaYMNAFjdJmakn1BIUau4IT4Eqhb8NEz4ZCKnDD2
 Kwi0q/9c0I/GxTnVXmwXPQzQkZarYLa8cppQr1L/L3el1xTU9qXUdpR7+vxPKi4J
 ADDP+7buRYp7Td2RfBD2lD4B7jNMpZYVC/2/Y3fixkuJvK4eYKuf+5K7zgmbahm5
 YOm86k3P7QN7saTxUeyUrwR/G6CoY99Dd54KadQAS4XkU1f6XuNjF6IsYjPUEZ1B
 pKlhK4mhGieMlW8yBti0BdJLLTAHVsL9Pa0Aqsv1EdZ3x/Mfp9kmwig9RAGREyQX
 gNs316VgsfnZb+AdImZ9EItRnPZ/1Z0//VOWiDy7CijKABCZCSFXqOwQ+Yonyfab
 ZoVXlwlvOaxmiQAWhJe2XKxzRtAfeQgyirmF95N+c/wtIH6dWzJeIs2xFLPIKCaY
 tkv5Ah4IBGxofJj1SNqKNRUcv6N/Hr7zs/p6yTQpVEoUzsKqzh1eNz8PDA3ewrq4
 C6nkXnZfidyqPuUZJIfOa02N/cPLUSclxdll6pHQfIMiwLBlV60pFcSsylgdYTE+
 XT/iwiiaSTPUUIkCTYhyoUpfZnNX6IoVpxKOuh5gLOmTz/+xlRfcRjcjuXIoneHQ
 D9qlUWbYLA==
 =Edge
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-5.19-2022-07-01' of git://git.kernel.dk/linux-block

Pull io_uring fixes from Jens Axboe:
 "Two minor tweaks:

   - While we still can, adjust the send/recv based flags to be in
     ->ioprio rather than in ->addr2. This is consistent with eg accept,
     and also doesn't waste a full 64-bit field for flags (Pavel)

   - 5.18-stable fix for re-importing provided buffers. Not much real
     world relevance here as it'll only impact non-pollable files gone
     async, which is more of a practical test case rather than something
     that is used in the wild (Dylan)"

* tag 'io_uring-5.19-2022-07-01' of git://git.kernel.dk/linux-block:
  io_uring: fix provided buffer import
  io_uring: keep sendrecv flags in ioprio
2022-07-01 10:52:01 -07:00
Darrick J. Wong
7561cea5db xfs: prevent a UAF when log IO errors race with unmount
KASAN reported the following use after free bug when running
generic/475:

 XFS (dm-0): Mounting V5 Filesystem
 XFS (dm-0): Starting recovery (logdev: internal)
 XFS (dm-0): Ending recovery (logdev: internal)
 Buffer I/O error on dev dm-0, logical block 20639616, async page read
 Buffer I/O error on dev dm-0, logical block 20639617, async page read
 XFS (dm-0): log I/O error -5
 XFS (dm-0): Filesystem has been shut down due to log error (0x2).
 XFS (dm-0): Unmounting Filesystem
 XFS (dm-0): Please unmount the filesystem and rectify the problem(s).
 ==================================================================
 BUG: KASAN: use-after-free in do_raw_spin_lock+0x246/0x270
 Read of size 4 at addr ffff888109dd84c4 by task 3:1H/136

 CPU: 3 PID: 136 Comm: 3:1H Not tainted 5.19.0-rc4-xfsx #rc4 8e53ab5ad0fddeb31cee5e7063ff9c361915a9c4
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1 04/01/2014
 Workqueue: xfs-log/dm-0 xlog_ioend_work [xfs]
 Call Trace:
  <TASK>
  dump_stack_lvl+0x34/0x44
  print_report.cold+0x2b8/0x661
  ? do_raw_spin_lock+0x246/0x270
  kasan_report+0xab/0x120
  ? do_raw_spin_lock+0x246/0x270
  do_raw_spin_lock+0x246/0x270
  ? rwlock_bug.part.0+0x90/0x90
  xlog_force_shutdown+0xf6/0x370 [xfs 4ad76ae0d6add7e8183a553e624c31e9ed567318]
  xlog_ioend_work+0x100/0x190 [xfs 4ad76ae0d6add7e8183a553e624c31e9ed567318]
  process_one_work+0x672/0x1040
  worker_thread+0x59b/0xec0
  ? __kthread_parkme+0xc6/0x1f0
  ? process_one_work+0x1040/0x1040
  ? process_one_work+0x1040/0x1040
  kthread+0x29e/0x340
  ? kthread_complete_and_exit+0x20/0x20
  ret_from_fork+0x1f/0x30
  </TASK>

 Allocated by task 154099:
  kasan_save_stack+0x1e/0x40
  __kasan_kmalloc+0x81/0xa0
  kmem_alloc+0x8d/0x2e0 [xfs]
  xlog_cil_init+0x1f/0x540 [xfs]
  xlog_alloc_log+0xd1e/0x1260 [xfs]
  xfs_log_mount+0xba/0x640 [xfs]
  xfs_mountfs+0xf2b/0x1d00 [xfs]
  xfs_fs_fill_super+0x10af/0x1910 [xfs]
  get_tree_bdev+0x383/0x670
  vfs_get_tree+0x7d/0x240
  path_mount+0xdb7/0x1890
  __x64_sys_mount+0x1fa/0x270
  do_syscall_64+0x2b/0x80
  entry_SYSCALL_64_after_hwframe+0x46/0xb0

 Freed by task 154151:
  kasan_save_stack+0x1e/0x40
  kasan_set_track+0x21/0x30
  kasan_set_free_info+0x20/0x30
  ____kasan_slab_free+0x110/0x190
  slab_free_freelist_hook+0xab/0x180
  kfree+0xbc/0x310
  xlog_dealloc_log+0x1b/0x2b0 [xfs]
  xfs_unmountfs+0x119/0x200 [xfs]
  xfs_fs_put_super+0x6e/0x2e0 [xfs]
  generic_shutdown_super+0x12b/0x3a0
  kill_block_super+0x95/0xd0
  deactivate_locked_super+0x80/0x130
  cleanup_mnt+0x329/0x4d0
  task_work_run+0xc5/0x160
  exit_to_user_mode_prepare+0xd4/0xe0
  syscall_exit_to_user_mode+0x1d/0x40
  entry_SYSCALL_64_after_hwframe+0x46/0xb0

This appears to be a race between the unmount process, which frees the
CIL and waits for in-flight iclog IO; and the iclog IO completion.  When
generic/475 runs, it starts fsstress in the background, waits a few
seconds, and substitutes a dm-error device to simulate a disk falling
out of a machine.  If the fsstress encounters EIO on a pure data write,
it will exit but the filesystem will still be online.

The next thing the test does is unmount the filesystem, which tries to
clean the log, free the CIL, and wait for iclog IO completion.  If an
iclog was being written when the dm-error switch occurred, it can race
with log unmounting as follows:

Thread 1				Thread 2

					xfs_log_unmount
					xfs_log_clean
					xfs_log_quiesce
xlog_ioend_work
<observe error>
xlog_force_shutdown
test_and_set_bit(XLOG_IOERROR)
					xfs_log_force
					<log is shut down, nop>
					xfs_log_umount_write
					<log is shut down, nop>
					xlog_dealloc_log
					xlog_cil_destroy
					<wait for iclogs>
spin_lock(&log->l_cilp->xc_push_lock)
<KABOOM>

Therefore, free the CIL after waiting for the iclogs to complete.  I
/think/ this race has existed for quite a few years now, though I don't
remember the ~2014 era logging code well enough to know if it was a real
threat then or if the actual race was exposed only more recently.

Fixes: ac983517ec ("xfs: don't sleep in xlog_cil_force_lsn on shutdown")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-07-01 09:09:52 -07:00
Amir Goldstein
868f9f2f8e vfs: fix copy_file_range() regression in cross-fs copies
A regression has been reported by Nicolas Boichat, found while using the
copy_file_range syscall to copy a tracefs file.

Before commit 5dae222a5f ("vfs: allow copy_file_range to copy across
devices") the kernel would return -EXDEV to userspace when trying to
copy a file across different filesystems.  After this commit, the
syscall doesn't fail anymore and instead returns zero (zero bytes
copied), as this file's content is generated on-the-fly and thus reports
a size of zero.

Another regression has been reported by He Zhe - the assertion of
WARN_ON_ONCE(ret == -EOPNOTSUPP) can be triggered from userspace when
copying from a sysfs file whose read operation may return -EOPNOTSUPP.

Since we do not have test coverage for copy_file_range() between any two
types of filesystems, the best way to avoid these sort of issues in the
future is for the kernel to be more picky about filesystems that are
allowed to do copy_file_range().

This patch restores some cross-filesystem copy restrictions that existed
prior to commit 5dae222a5f ("vfs: allow copy_file_range to copy across
devices"), namely, cross-sb copy is not allowed for filesystems that do
not implement ->copy_file_range().

Filesystems that do implement ->copy_file_range() have full control of
the result - if this method returns an error, the error is returned to
the user.  Before this change this was only true for fs that did not
implement the ->remap_file_range() operation (i.e.  nfsv3).

Filesystems that do not implement ->copy_file_range() still fall-back to
the generic_copy_file_range() implementation when the copy is within the
same sb.  This helps the kernel can maintain a more consistent story
about which filesystems support copy_file_range().

nfsd and ksmbd servers are modified to fall-back to the
generic_copy_file_range() implementation in case vfs_copy_file_range()
fails with -EOPNOTSUPP or -EXDEV, which preserves behavior of
server-side-copy.

fall-back to generic_copy_file_range() is not implemented for the smb
operation FSCTL_DUPLICATE_EXTENTS_TO_FILE, which is arguably a correct
change of behavior.

Fixes: 5dae222a5f ("vfs: allow copy_file_range to copy across devices")
Link: https://lore.kernel.org/linux-fsdevel/20210212044405.4120619-1-drinkcat@chromium.org/
Link: https://lore.kernel.org/linux-fsdevel/CANMq1KDZuxir2LM5jOTm0xx+BnvW=ZmpsG47CyHFJwnw7zSX6Q@mail.gmail.com/
Link: https://lore.kernel.org/linux-fsdevel/20210126135012.1.If45b7cdc3ff707bc1efa17f5366057d60603c45f@changeid/
Link: https://lore.kernel.org/linux-fsdevel/20210630161320.29006-1-lhenriques@suse.de/
Reported-by: Nicolas Boichat <drinkcat@chromium.org>
Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Luis Henriques <lhenriques@suse.de>
Fixes: 64bf5ff58d ("vfs: no fallback for ->copy_file_range")
Link: https://lore.kernel.org/linux-fsdevel/20f17f64-88cb-4e80-07c1-85cb96c83619@windriver.com/
Reported-by: He Zhe <zhe.he@windriver.com>
Tested-by: Namjae Jeon <linkinjeon@kernel.org>
Tested-by: Luis Henriques <lhenriques@suse.de>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-06-30 15:16:38 -07:00
Scott Mayhew
4f40a5b554 NFSv4: Add an fattr allocation to _nfs4_discover_trunking()
This was missed in c3ed222745 ("NFSv4: Fix free of uninitialized
nfs4_label on referral lookup.") and causes a panic when mounting
with '-o trunkdiscovery':

PID: 1604   TASK: ffff93dac3520000  CPU: 3   COMMAND: "mount.nfs"
 #0 [ffffb79140f738f8] machine_kexec at ffffffffaec64bee
 #1 [ffffb79140f73950] __crash_kexec at ffffffffaeda67fd
 #2 [ffffb79140f73a18] crash_kexec at ffffffffaeda76ed
 #3 [ffffb79140f73a30] oops_end at ffffffffaec2658d
 #4 [ffffb79140f73a50] general_protection at ffffffffaf60111e
    [exception RIP: nfs_fattr_init+0x5]
    RIP: ffffffffc0c18265  RSP: ffffb79140f73b08  RFLAGS: 00010246
    RAX: 0000000000000000  RBX: ffff93dac304a800  RCX: 0000000000000000
    RDX: ffffb79140f73bb0  RSI: ffff93dadc8cbb40  RDI: d03ee11cfaf6bd50
    RBP: ffffb79140f73be8   R8: ffffffffc0691560   R9: 0000000000000006
    R10: ffff93db3ffd3df8  R11: 0000000000000000  R12: ffff93dac4040000
    R13: ffff93dac2848e00  R14: ffffb79140f73b60  R15: ffffb79140f73b30
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #5 [ffffb79140f73b08] _nfs41_proc_get_locations at ffffffffc0c73d53 [nfsv4]
 #6 [ffffb79140f73bf0] nfs4_proc_get_locations at ffffffffc0c83e90 [nfsv4]
 #7 [ffffb79140f73c60] nfs4_discover_trunking at ffffffffc0c83fb7 [nfsv4]
 #8 [ffffb79140f73cd8] nfs_probe_fsinfo at ffffffffc0c0f95f [nfs]
 #9 [ffffb79140f73da0] nfs_probe_server at ffffffffc0c1026a [nfs]
    RIP: 00007f6254fce26e  RSP: 00007ffc69496ac8  RFLAGS: 00000246
    RAX: ffffffffffffffda  RBX: 0000000000000000  RCX: 00007f6254fce26e
    RDX: 00005600220a82a0  RSI: 00005600220a64d0  RDI: 00005600220a6520
    RBP: 00007ffc69496c50   R8: 00005600220a8710   R9: 003035322e323231
    R10: 0000000000000000  R11: 0000000000000246  R12: 00007ffc69496c50
    R13: 00005600220a8440  R14: 0000000000000010  R15: 0000560020650ef9
    ORIG_RAX: 00000000000000a5  CS: 0033  SS: 002b

Fixes: c3ed222745 ("NFSv4: Fix free of uninitialized nfs4_label on referral lookup.")
Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-06-30 16:13:00 -04:00
NeilBrown
080abad71e NFS: restore module put when manager exits.
Commit f49169c97f ("NFSD: Remove svc_serv_ops::svo_module") removed
calls to module_put_and_kthread_exit() from threads that acted as SUNRPC
servers and had a related svc_serv_ops structure.  This was correct.

It ALSO removed the module_put_and_kthread_exit() call from
nfs4_run_state_manager() which is NOT a SUNRPC service.

Consequently every time the NFSv4 state manager runs the module count
increments and won't be decremented.  So the nfsv4 module cannot be
unloaded.

So restore the module_put_and_kthread_exit() call.

Fixes: f49169c97f ("NFSD: Remove svc_serv_ops::svo_module")
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-06-30 16:13:00 -04:00
Dylan Yudaken
09007af2b6 io_uring: fix provided buffer import
io_import_iovec uses the s pointer, but this was changed immediately
after the iovec was re-imported and so it was imported into the wrong
place.

Change the ordering.

Fixes: 2be2eb02e2 ("io_uring: ensure reads re-import for selected buffers")
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220630132006.2825668-1-dylany@fb.com
[axboe: ensure we don't half-import as well]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-30 11:34:41 -06:00
Linus Torvalds
9fb3bb25d1 Merge tag 'fsnotify_for_v5.19-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull fanotify fix from Jan Kara:
 "A fix for recently added fanotify API to have stricter checks and
  refuse some invalid flag combinations to make our life easier in the
  future"

* tag 'fsnotify_for_v5.19-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  fanotify: refine the validation checks on non-dir inode mask
2022-06-30 09:57:18 -07:00
Pavel Begunkov
29c1ac230e io_uring: keep sendrecv flags in ioprio
We waste a u64 SQE field for flags even though we don't need as many
bits and it can be used for something more useful later. Store io_uring
specific send/recv flags in sqe->ioprio instead of ->addr2.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Fixes: 0455d4ccec ("io_uring: add POLL_FIRST support for send/sendmsg and recv/recvmsg")
[axboe: change comment in io_uring.h as well]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-30 07:15:50 -06:00
Linus Torvalds
732f306943 six ksmbd server fixes
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmK7GlQACgkQiiy9cAdy
 T1FIPQv/agDjMbpfqI4+3JrDzGofGyUgq31bQRwjKouM8dxC7iE0KRaerzDUYur1
 oZSaqidCZPiFTrROQ2Kuzf8YAKFS0XubaH3HTJd5yu2/WiApsSNwelcDYtrolq5T
 OnC3t5tpkVDUcx8ynHAGrYe0mEkY5+IIYkKU9O9JduPlDX4qC796Iel9ESg5LoDR
 0fx+HTdCr3OdfOipOEUl4HxmwWsx8u7RywmNPusaH21jbuC9w1LWSLBWuFIZywio
 bHKq57P9k/I9OXTGV7DuEaf2uvZd+ceH1ZahgXIvDdHzHM183ltdTWiyIbmtLDAj
 iYyos6UZsdK5IpN8+z9wY1zqUfj0btr+UVhbrXRwv5dIH+C0tnYCwDGNXszUh7Id
 XTeQWnaI+BaBA9VuqpvUtSXIJTsikaTYdnk4m/rLDYPaHVeuCKv1QkcitWpjHvIt
 FPlIvB95QMy6CrWtlGxjwDpNBLnca4NKigLTCESJaIKcvnn25kA8OtRG6tHbpvVM
 KiCkjH14
 =8zjT
 -----END PGP SIGNATURE-----

Merge tag '5.19-rc4-ksmbd-server-fixes' of git://git.samba.org/ksmbd

Pull ksmbd server fixes from Steve French:

 - seek null check (don't use f_seek op directly and blindly)

 - offset validation in FSCTL_SET_ZERO_DATA

 - fallocate fix (relates e.g. to xfstests generic/091 and 263)

 - two cleanup fixes

 - fix socket settings on some arch

* tag '5.19-rc4-ksmbd-server-fixes' of git://git.samba.org/ksmbd:
  ksmbd: use vfs_llseek instead of dereferencing NULL
  ksmbd: check invalid FileOffset and BeyondFinalZero in FSCTL_ZERO_DATA
  ksmbd: set the range of bytes to zero without extending file size in FSCTL_ZERO_DATA
  ksmbd: remove duplicate flag set in smb2_write
  ksmbd: smbd: Remove useless license text when SPDX-License-Identifier is already used
  ksmbd: use SOCK_NONBLOCK type for kernel_accept()
2022-06-29 09:20:40 -07:00
Jeff Layton
8692969e91 ceph: wait on async create before checking caps for syncfs
Currently, we'll call ceph_check_caps, but if we're still waiting
on the reply, we'll end up spinning around on the same inode in
flush_dirty_session_caps. Wait for the async create reply before
flushing caps.

Cc: stable@vger.kernel.org
URL: https://tracker.ceph.com/issues/55823
Fixes: fbed7045f5 ("ceph: wait for async create reply before sending any cap messages")
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-06-29 18:02:57 +02:00
Darrick J. Wong
8944c6fb8a xfs: dont treat rt extents beyond EOF as eofblocks to be cleared
On a system with a realtime volume and a 28k realtime extent,
generic/491 fails because the test opens a file on a frozen filesystem
and closing it causes xfs_release -> xfs_can_free_eofblocks to
mistakenly think that the the blocks of the realtime extent beyond EOF
are posteof blocks to be freed.  Realtime extents cannot be partially
unmapped, so this is pointless.  Worse yet, this triggers posteof
cleanup, which stalls on a transaction allocation, which is why the test
fails.

Teach the predicate to account for realtime extents properly.

Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2022-06-29 08:47:56 -07:00
Darrick J. Wong
e53bcffad0 xfs: don't hold xattr leaf buffers across transaction rolls
Now that we've established (again!) that empty xattr leaf buffers are
ok, we no longer need to bhold them to transactions when we're creating
new leaf blocks.  Get rid of the entire mechanism, which should simplify
the xattr code quite a bit.

The original justification for using bhold here was to prevent the AIL
from trying to write the empty leaf block into the fs during the brief
time that we release the buffer lock.  The reason for /that/ was to
prevent recovery from tripping over the empty ondisk block.

Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-06-29 08:47:56 -07:00