linux-stable

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git synced 2024-10-27 22:51:31 +00:00

Author	SHA1	Message	Date
Ojaswin Mujoo	339fee69a1	ext4: fix rbtree traversal bug in ext4_mb_use_preallocated [ Upstream commit `9d3de7ee19` ] During allocations, while looking for preallocations(PA) in the per inode rbtree, we can't do a direct traversal of the tree because ext4_mb_discard_group_preallocation() can paralelly mark the pa deleted and that can cause direct traversal to skip some entries. This was leading to a BUG_ON() being hit [1] when we missed a PA that could satisfy our request and ultimately tried to create a new PA that would overlap with the missed one. To makes sure we handle that case while still keeping the performance of the rbtree, we make use of the fact that the only pa that could possibly overlap the original goal start is the one that satisfies the below conditions: 1. It must have it's logical start immediately to the left of (ie less than) original logical start. 2. It must not be deleted To find this pa we use the following traversal method: 1. Descend into the rbtree normally to find the immediate neighboring PA. Here we keep descending irrespective of if the PA is deleted or if it overlaps with our request etc. The goal is to find an immediately adjacent PA. 2. If the found PA is on right of original goal, use rb_prev() to find the left adjacent PA. 3. Check if this PA is deleted and keep moving left with rb_prev() until a non deleted PA is found. 4. This is the PA we are looking for. Now we can check if it can satisfy the original request and proceed accordingly. This approach also takes care of having deleted PAs in the tree. (While we are at it, also fix a possible overflow bug in calculating the end of a PA) [1] https://lore.kernel.org/linux-ext4/CA+G9fYv2FRpLqBZf34ZinR8bU2_ZRAUOjKAD3+tKRFaEQHtt8Q@mail.gmail.com/ Cc: stable@kernel.org # 6.4 Fixes: `3872778664` ("ext4: Use rbtrees to manage PAs instead of inode i_prealloc_list") Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org> Reviewed-by: Ritesh Harjani (IBM) ritesh.list@gmail.com Tested-by: Ritesh Harjani (IBM) ritesh.list@gmail.com Link: https://lore.kernel.org/r/edd2efda6a83e6343c5ace9deea44813e71dbe20.1690045963.git.ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-08-03 10:25:47 +02:00
Ritesh Harjani	a0f2ba1866	ext4: mballoc: Remove useless setting of ac_criteria [ Upstream commit `569f196f1e` ] There will be changes coming in future patches which will introduce a new criteria for block allocation. This removes the useless setting of ac_criteria. AFAIU, this might be only used to differentiate between whether a preallocated blocks was allocated or was regular allocator called for allocating blocks. Hence this also adds the debug prints to identify what type of block allocation was done in ext4_mb_show_ac(). Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/1dbae05617519cb6202f1b299c9d1be3e7cda763.1685449706.git.ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Stable-dep-of: `9d3de7ee19` ("ext4: fix rbtree traversal bug in ext4_mb_use_preallocated") Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-08-03 10:25:47 +02:00
Kemeng Shi	d254d88140	ext4: add EXT4_MB_HINT_GOAL_ONLY test in ext4_mb_use_preallocated [ Upstream commit `1eff590489` ] ext4_mb_use_preallocated will ignore the demand to alloc goal blocks, although the EXT4_MB_HINT_GOAL_ONLY is requested. For group pa, ext4_mb_group_or_file will not set EXT4_MB_HINT_GROUP_ALLOC if EXT4_MB_HINT_GOAL_ONLY is set. So we will not alloc goal blocks from group pa if EXT4_MB_HINT_GOAL_ONLY is set. For inode pa, ext4_mb_pa_goal_check is added to check if free extent in found inode pa meets goal blocks when EXT4_MB_HINT_GOAL_ONLY is set. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Suggested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://lore.kernel.org/r/20230603150327.3596033-6-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Stable-dep-of: `9d3de7ee19` ("ext4: fix rbtree traversal bug in ext4_mb_use_preallocated") Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-08-03 10:25:47 +02:00
Kemeng Shi	08bd62023f	ext4: fix wrong unit use in ext4_mb_new_blocks commit `2ec6d0a5ea` upstream. Function ext4_free_blocks_simple needs count in cluster. Function ext4_free_blocks accepts count in block. Convert count to cluster to fix the mismatch. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Cc: stable@kernel.org Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://lore.kernel.org/r/20230603150327.3596033-12-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-07-23 13:53:56 +02:00
Kemeng Shi	a6cff82d89	ext4: get block from bh in ext4_free_blocks for fast commit replay commit `11b6890be0` upstream. ext4_free_blocks will retrieve block from bh if block parameter is zero. Retrieve block before ext4_free_blocks_simple to avoid potentially passing wrong block to ext4_free_blocks_simple. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Cc: stable@kernel.org Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://lore.kernel.org/r/20230603150327.3596033-9-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-07-23 13:53:56 +02:00
Kemeng Shi	ed53b2418b	ext4: fix wrong unit use in ext4_mb_clear_bb commit `247c3d214c` upstream. Function ext4_issue_discard need count in cluster. Pass count_clusters instead of count to fix the mismatch. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Cc: stable@kernel.org Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://lore.kernel.org/r/20230603150327.3596033-11-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-07-23 13:53:56 +02:00
Ojaswin Mujoo	3582e74599	Revert "ext4: remove ac->ac_found > sbi->s_mb_min_to_scan dead check in ext4_mb_check_limits" This reverts commit `32c0869370`. The reverted commit was intended to remove a dead check however it was observed that this check was actually being used to exit early instead of looping sbi->s_mb_max_to_scan times when we are able to find a free extent bigger than the goal extent. Due to this, a my performance tests (fsmark, parallel file writes in a highly fragmented FS) were seeing a 2x-3x regression. Example, the default value of the following variables is: sbi->s_mb_max_to_scan = 200 sbi->s_mb_min_to_scan = 10 In ext4_mb_check_limits() if we find an extent smaller than goal, then we return early and try again. This loop will go on until we have processed sbi->s_mb_max_to_scan(=200) number of free extents at which point we exit and just use whatever we have even if it is smaller than goal extent. Now, the regression comes when we find an extent bigger than goal. Earlier, in this case we would loop only sbi->s_mb_min_to_scan(=10) times and then just use the bigger extent. However with commit `32c08693` that check was removed and hence we would loop sbi->s_mb_max_to_scan(=200) times even though we have a big enough free extent to satisfy the request. The only time we would exit early would be when the free extent is exactly the size of our goal, which is pretty uncommon occurrence and so we would almost always end up looping 200 times. Hence, revert the commit by adding the check back to fix the regression. Also add a comment to outline this policy. Fixes: `32c0869370` ("ext4: remove ac->ac_found > sbi->s_mb_min_to_scan dead check in ext4_mb_check_limits") Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Kemeng Shi <shikemeng@huaweicloud.com> Link: https://lore.kernel.org/r/ddcae9658e46880dfec2fb0aa61d01fb3353d202.1685449706.git.ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2023-06-02 14:47:29 -04:00
Theodore Ts'o	463808f237	ext4: remove a BUG_ON in ext4_mb_release_group_pa() If a malicious fuzzer overwrites the ext4 superblock while it is mounted such that the s_first_data_block is set to a very large number, the calculation of the block group can underflow, and trigger a BUG_ON check. Change this to be an ext4_warning so that we don't crash the kernel. Cc: stable@kernel.org Link: https://lore.kernel.org/r/20230430154311.579720-3-tytso@mit.edu Reported-by: syzbot+e2efa3efc15a1c9e95c3@syzkaller.appspotmail.com Link: https://syzkaller.appspot.com/bug?id=69b28112e098b070f639efb356393af3ffec4220 Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2023-05-13 18:05:04 -04:00
Theodore Ts'o	5354b2af34	ext4: allow ext4_get_group_info() to fail Previously, ext4_get_group_info() would treat an invalid group number as BUG(), since in theory it should never happen. However, if a malicious attaker (or fuzzer) modifies the superblock via the block device while it is the file system is mounted, it is possible for s_first_data_block to get set to a very large number. In that case, when calculating the block group of some block number (such as the starting block of a preallocation region), could result in an underflow and very large block group number. Then the BUG_ON check in ext4_get_group_info() would fire, resutling in a denial of service attack that can be triggered by root or someone with write access to the block device. For a quality of implementation perspective, it's best that even if the system administrator does something that they shouldn't, that it will not trigger a BUG. So instead of BUG'ing, ext4_get_group_info() will call ext4_error and return NULL. We also add fallback code in all of the callers of ext4_get_group_info() that it might NULL. Also, since ext4_get_group_info() was already borderline to be an inline function, un-inline it. The results in a next reduction of the compiled text size of ext4 by roughly 2k. Cc: stable@kernel.org Link: https://lore.kernel.org/r/20230430154311.579720-2-tytso@mit.edu Reported-by: syzbot+e2efa3efc15a1c9e95c3@syzkaller.appspotmail.com Link: https://syzkaller.appspot.com/bug?id=69b28112e098b070f639efb356393af3ffec4220 Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>	2023-05-13 18:02:46 -04:00
Ojaswin Mujoo	361eb69fc9	ext4: Remove the logic to trim inode PAs Earlier, inode PAs were stored in a linked list. This caused a need to periodically trim the list down inorder to avoid growing it to a very large size, as this would severly affect performance during list iteration. Recent patches changed this list to an rbtree, and since the tree scales up much better, we no longer need to have the trim functionality, hence remove it. Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/c409addceaa3ade4b40328e28e3b54b2f259689e.1679731817.git.ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2023-04-06 01:13:13 -04:00
Ojaswin Mujoo	3872778664	ext4: Use rbtrees to manage PAs instead of inode i_prealloc_list Currently, the kernel uses i_prealloc_list to hold all the inode preallocations. This is known to cause degradation in performance in workloads which perform large number of sparse writes on a single file. This is mainly because functions like ext4_mb_normalize_request() and ext4_mb_use_preallocated() iterate over this complete list, resulting in slowdowns when large number of PAs are present. Patch `27bc446e2` partially fixed this by enforcing a limit of 512 for the inode preallocation list and adding logic to continually trim the list if it grows above the threshold, however our testing revealed that a hardcoded value is not suitable for all kinds of workloads. To optimize this, add an rbtree to the inode and hold the inode preallocations in this rbtree. This will make iterating over inode PAs faster and scale much better than a linked list. Additionally, we also had to remove the LRU logic that was added during trimming of the list (in ext4_mb_release_context()) as it will add extra overhead in rbtree. The discards now happen in the lowest-logical-offset-first order. Locking notes With the introduction of rbtree to maintain inode PAs, we can't use RCU to walk the tree for searching since it can result in partial traversals which might miss some nodes(or entire subtrees) while discards happen in parallel (which happens under a lock). Hence this patch converts the ei->i_prealloc_lock spin_lock to rw_lock. Almost all the codepaths that read/modify the PA rbtrees are protected by the higher level inode->i_data_sem (except ext4_mb_discard_group_preallocations() and ext4_clear_inode()) IIUC, the only place we need lock protection is when one thread is reading "searching" the PA rbtree (earlier protected under rcu_read_lock()) and another is "deleting" the PAs in ext4_mb_discard_group_preallocations() function (which iterates all the PAs using the grp->bb_prealloc_list and deletes PAs from the tree without taking any inode lock (i_data_sem)). So, this patch converts all rcu_read_lock/unlock() paths for inode list PA to use read_lock() and all places where we were using ei->i_prealloc_lock spinlock will now be using write_lock(). Note that this makes the fast path (searching of the right PA e.g. ext4_mb_use_preallocated() or ext4_mb_normalize_request()), now use read_lock() instead of rcu_read_lock/unlock(). Ths also will now block due to slow discard path (ext4_mb_discard_group_preallocations()) which uses write_lock(). But this is not as bad as it looks. This is because - 1. The slow path only occurs when the normal allocation failed and we can say that we are low on disk space. One can argue this scenario won't be much frequent. 2. ext4_mb_discard_group_preallocations(), locks and unlocks the rwlock for deleting every individual PA. This gives enough opportunity for the fast path to acquire the read_lock for searching the PA inode list. Suggested-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/4137bce8f6948fedd8bae134dabae24acfe699c6.1679731817.git.ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2023-04-06 01:13:13 -04:00
Ojaswin Mujoo	a8e38fd37c	ext4: Convert pa->pa_inode_list and pa->pa_obj_lock into a union Splitting pa->pa_inode_list Currently, we use the same pa->pa_inode_list to add a pa to either the inode preallocation list or the locality group preallocation list. For better clarity, split this list into a union of 2 list_heads and use either of the them based on the type of pa. Splitting pa->pa_obj_lock Currently, pa->pa_obj_lock is either assigned &ei->i_prealloc_lock for inode PAs or lg_prealloc_lock for lg PAs, and is then used to lock the lists containing these PAs. Make the distinction between the 2 PA types clear by changing this lock to a union of 2 locks. Explicitly use the pa_lock_node.inode_lock for inode PAs and pa_lock_node.lg_lock for lg PAs. This patch is required so that the locality group preallocation code remains the same as in upcoming patches we are going to make changes to inode preallocation code to move from list to rbtree based implementation. This patch also makes it easier to review the upcoming patches. There are no functional changes in this patch. Suggested-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/1d7ac0557e998c3fc7eef422b52e4bc67bdef2b0.1679731817.git.ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2023-04-06 01:13:13 -04:00
Ojaswin Mujoo	93cdf49f6e	ext4: Fix best extent lstart adjustment logic in ext4_mb_new_inode_pa() When the length of best extent found is less than the length of goal extent we need to make sure that the best extent atleast covers the start of the original request. This is done by adjusting the ac_b_ex.fe_logical (logical start) of the extent. While doing so, the current logic sometimes results in the best extent's logical range overflowing the goal extent. Since this best extent is later added to the inode preallocation list, we have a possibility of introducing overlapping preallocations. This is discussed in detail here [1]. As per Jan's suggestion, to fix this, replace the existing logic with the below logic for adjusting best extent as it keeps fragmentation in check while ensuring logical range of best extent doesn't overflow out of goal extent: 1. Check if best extent can be kept at end of goal range and still cover original start. 2. Else, check if best extent can be kept at start of goal range and still cover original start. 3. Else, keep the best extent at start of original request. Also, add a few extra BUG_ONs that might help catch errors faster. [1] https://lore.kernel.org/r/Y+OGkVvzPN0RMv0O@li-bb2b2a4c-3307-11b2-a85c-8fa5c3a69313.ibm.com Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/f96aca6d415b36d1f90db86c1a8cd7e2e9d7ab0e.1679731817.git.ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2023-04-06 01:13:13 -04:00
Ojaswin Mujoo	0830344c95	ext4: Abstract out overlap fix/check logic in ext4_mb_normalize_request() Abstract out the logic of fixing PA overlaps in ext4_mb_normalize_request to improve readability of code. This also makes it easier to make changes to the overlap logic in future. There are no functional changes in this patch Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/9b35f3955a1d7b66bbd713eca1e63026e01f78c1.1679731817.git.ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2023-04-06 01:13:12 -04:00
Ojaswin Mujoo	7692094ac5	ext4: Move overlap assert logic into a separate function Abstract out the logic to double check for overlaps in normalize_pa to a separate function. Since there has been no reports in past where we have seen any overlaps which hits this bug_on(), in future we can consider calling this function under "#ifdef AGGRESSIVE_CHECK" only. There are no functional changes in this patch Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/35dd5d94fa0b2d1cd2d2947adf8967279c72967d.1679731817.git.ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2023-04-06 01:13:12 -04:00
Ojaswin Mujoo	bcf4349921	ext4: Refactor code in ext4_mb_normalize_request() and ext4_mb_use_preallocated() Change some variable names to be more consistent and refactor some of the code to make it easier to read. There are no functional changes in this patch Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/8edcab489c06cf861b19d87207d9b0ff7ac7f3c1.1679731817.git.ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2023-04-06 01:13:12 -04:00
Ojaswin Mujoo	820897258a	ext4: Refactor code related to freeing PAs This patch makes the following changes: * Rename ext4_mb_pa_free to ext4_mb_pa_put_free to better reflect its purpose * Add new ext4_mb_pa_free() which only handles freeing * Refactor ext4_mb_pa_callback() to use ext4_mb_pa_free() There are no functional changes in this patch Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/b273bc9cbf5bd278f641fa5bc6c0cc9e6cb3330c.1679731817.git.ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2023-04-06 01:13:12 -04:00
Ojaswin Mujoo	e86a718228	ext4: Stop searching if PA doesn't satisfy non-extent file If we come across a PA that matches the logical offset but is unable to satisfy a non-extent file due to its physical start being higher than that supported by non extent files, then simply stop searching for another PA and break out of loop. This is because, since PAs don't overlap, we won't be able to find another inode PA which can satisfy the original request. Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/42404ca29bd304ae2c962184c3c32a02e8eefcd0.1679731817.git.ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2023-04-06 01:13:12 -04:00
Theodore Ts'o	19b8b035a7	ext4: convert some BUG_ON's in mballoc to use WARN_RATELIMITED instead In cases where we have an obvious way of continuing, let's use WARN_RATELIMITED() instead of BUG_ON(). Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2023-04-06 01:13:12 -04:00
Kemeng Shi	91a48aaf59	ext4: avoid unnecessary pointer dereference in ext4_mb_normalize_request Result of EXT4_SB(ac->ac_sb) is already stored in sbi at beginning of ext4_mb_normalize_request. Use sbi instead of EXT4_SB(ac->ac_sb) to remove unnecessary pointer dereference. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20230311170949.1047958-3-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2023-04-06 01:13:12 -04:00
Kemeng Shi	1221b23501	ext4: fix typos in mballoc pa_plen -> pa_len pa_start -> pa_pstart Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20230311170949.1047958-2-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2023-04-06 01:13:12 -04:00
Kemeng Shi	253cacb0de	ext4: simplify calculation of blkoff in ext4_mb_new_blocks_simple We try to allocate a block from goal in ext4_mb_new_blocks_simple. We only need get blkoff in first group with goal and set blkoff to 0 for the rest groups. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Link: https://lore.kernel.org/r/20230303172120.3800725-21-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2023-04-06 01:13:12 -04:00
Kemeng Shi	46825e9490	ext4: remove comment code ext4_discard_preallocations Just remove comment code in ext4_discard_preallocations. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://lore.kernel.org/r/20230303172120.3800725-20-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2023-04-06 01:13:12 -04:00
Kemeng Shi	3a037b1b88	ext4: remove repeat assignment to ac_f_ex Call trace to assign ac_f_ex: ext4_mb_use_best_found ac->ac_f_ex = ac->ac_b_ex; ext4_mb_new_preallocation ext4_mb_new_group_pa ac->ac_f_ex = ac->ac_b_ex; ext4_mb_new_inode_pa ac->ac_f_ex = ac->ac_b_ex; Actually allocated blocks is already stored in ac_f_ex in ext4_mb_use_best_found, so there is no need to assign ac_f_ex in ext4_mb_new_group_pa and ext4_mb_new_inode_pa. Just remove repeat assignment to ac_f_ex in ext4_mb_new_group_pa and ext4_mb_new_inode_pa. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://lore.kernel.org/r/20230303172120.3800725-19-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2023-04-06 01:13:12 -04:00
Kemeng Shi	fb28f9ceec	ext4: remove unnecessary goto in ext4_mb_mark_diskspace_used When ext4_read_block_bitmap fails, we can return PTR_ERR(bitmap_bh) to remove unnecessary NULL check of bitmap_bh. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20230303172120.3800725-18-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2023-04-06 01:13:12 -04:00
Kemeng Shi	c7f2bafa3c	ext4: remove unnecessary count2 in ext4_free_data_in_buddy count2 is always 1 in mb_debug, just remove unnecessary count2. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://lore.kernel.org/r/20230303172120.3800725-17-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2023-04-06 01:13:12 -04:00
Kemeng Shi	df11909514	ext4: remove unnecessary exit_meta_group_info tag We goto exit_meta_group_info only to return -ENOMEM. Return -ENOMEM directly instead of goto to remove this unnecessary tag. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://lore.kernel.org/r/20230303172120.3800725-16-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2023-04-06 01:13:11 -04:00
Kemeng Shi	78dc9f844f	ext4: use best found when complex scan of group finishs If any bex which meets bex->fe_len >= gex->fe_len is found, then it will always be used when complex scan of group that bex belongs to finishs. So there will not be any lock-unlock period. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://lore.kernel.org/r/20230303172120.3800725-15-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2023-04-06 01:13:11 -04:00
Kemeng Shi	32c0869370	ext4: remove ac->ac_found > sbi->s_mb_min_to_scan dead check in ext4_mb_check_limits Only call trace of ext4_mb_check_limits is as following: ext4_mb_complex_scan_group ext4_mb_measure_extent ext4_mb_check_limits(ac, e4b, 0); ext4_mb_check_limits(ac, e4b, 1); If the first ac->ac_found > sbi->s_mb_max_to_scan check in ext4_mb_check_limits is met, we will set ac_status to AC_STATUS_BREAK and call ext4_mb_try_best_found to try to use ac->ac_b_ex. If ext4_mb_try_best_found successes, then block allocation finishs, the removed ac->ac_found > sbi->s_mb_min_to_scan check is not reachable. If ext4_mb_try_best_found fails, then we set EXT4_MB_HINT_FIRST and reset ac->ac_b_ex to retry block allocation. We will use any found free extent in ext4_mb_measure_extent before reach the removed ac->ac_found > sbi->s_mb_min_to_scan check. In summary, the removed ac->ac_found > sbi->s_mb_min_to_scan check is not reachable and we can remove that dead check. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://lore.kernel.org/r/20230303172120.3800725-14-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2023-04-06 01:13:11 -04:00
Kemeng Shi	976620bd26	ext4: remove dead check in mb_buddy_mark_free We always adjust first to even number and adjust last to odd number, so first == last will never happen. Remove this dead check. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Link: https://lore.kernel.org/r/20230303172120.3800725-13-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2023-04-06 01:13:11 -04:00
Kemeng Shi	aaae558dae	ext4: remove unnecessary check in ext4_mb_new_blocks 1. remove unnecessary ac check: We always go to out tag before ac is successfully allocated, then we can move out tag after free of ac and remove NULL check of ac. 2. remove unnecessary errp check: We always go to errout tag if errp is non-zero, then we can move errout tag into error handle if *errp is non-zero. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://lore.kernel.org/r/20230303172120.3800725-12-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2023-04-06 01:13:11 -04:00
Kemeng Shi	285164b801	ext4: remove unnecessary e4b->bd_buddy_page check in ext4_mb_load_buddy_gfp e4b->bd_buddy_page is only set if we initialize ext4_buddy successfully. So e4b->bd_buddy_page is always NULL in error handle branch. Just remove the dead check. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://lore.kernel.org/r/20230303172120.3800725-11-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2023-04-06 01:13:11 -04:00
Kemeng Shi	139f46d3b5	ext4: Remove unnecessary release when memory allocation failed in ext4_mb_init_cache If we alloc array of buffer_head failed, there is no resource need to be freed and we can simpily return error. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://lore.kernel.org/r/20230303172120.3800725-10-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2023-04-06 01:13:11 -04:00
Kemeng Shi	85b67ffb7d	ext4: remove unused return value of ext4_mb_try_best_found and ext4_mb_free_metadata Return value static function ext4_mb_try_best_found and ext4_mb_free_metadata is not used. Just remove unused return value. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://lore.kernel.org/r/20230303172120.3800725-9-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2023-04-06 01:13:11 -04:00
Kemeng Shi	1b5c9d3494	ext4: add missed brelse in ext4_free_blocks_simple Release bitmap buffer_head we got if error occurs. Besides, this patch remove unused assignment to err. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://lore.kernel.org/r/20230303172120.3800725-8-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2023-04-06 01:13:11 -04:00
Kemeng Shi	36cb0f52ae	ext4: protect pa->pa_free in ext4_discard_allocated_blocks If ext4_mb_mark_diskspace_used fails in ext4_mb_new_blocks, we may discard pa already in list. Protect pa with pa_lock to avoid race. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://lore.kernel.org/r/20230303172120.3800725-7-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2023-04-06 01:13:11 -04:00
Kemeng Shi	1afdc58894	ext4: correct start of used group pa for debug in ext4_mb_use_group_pa As we don't correct pa_lstart here, so there is no need to subtract pa_lstart with consumed len. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://lore.kernel.org/r/20230303172120.3800725-6-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2023-04-06 01:13:11 -04:00
Kemeng Shi	abc075d4a5	ext4: correct calculation of s_mb_preallocated We will add pa_free to s_mb_preallocated when new ext4_prealloc_space is created. In ext4_mb_new_inode_pa, we will call ext4_mb_use_inode_pa before adding pa_free to s_mb_preallocated. However, ext4_mb_use_inode_pa will consume pa_free for block allocation which triggerred the creation of ext4_prealloc_space. Add pa_free to s_mb_preallocated before ext4_mb_use_inode_pa to correct calculation of s_mb_preallocated. There is no such problem in ext4_mb_new_group_pa as pa_free of group pa is consumed in ext4_mb_release_context instead of ext4_mb_use_group_pa. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://lore.kernel.org/r/20230303172120.3800725-5-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2023-04-06 01:13:11 -04:00
Kemeng Shi	22fab98402	ext4: get correct ext4_group_info in ext4_mb_prefetch_fini We always get ext4_group_desc with group + 1 and ext4_group_info with group to check if we need do initialize ext4_group_info for the group. Just get ext4_group_desc with group for ext4_group_info initialization check. Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Link: https://lore.kernel.org/r/20230303172120.3800725-4-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2023-04-06 01:13:11 -04:00
Kemeng Shi	01e4ca2945	ext4: allow to find by goal if EXT4_MB_HINT_GOAL_ONLY is set If EXT4_MB_HINT_GOAL_ONLY is set, ext4_mb_regular_allocator will only allocate blocks from ext4_mb_find_by_goal. Allow to find by goal in ext4_mb_find_by_goal if EXT4_MB_HINT_GOAL_ONLY is set or allocation with EXT4_MB_HINT_GOAL_ONLY set will always fail. EXT4_MB_HINT_GOAL_ONLY is not used at all, so the problem is not found for now. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://lore.kernel.org/r/20230303172120.3800725-3-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2023-04-06 01:13:10 -04:00
Kemeng Shi	b07ffe6927	ext4: set goal start correctly in ext4_mb_normalize_request We need to set ac_g_ex to notify the goal start used in ext4_mb_find_by_goal. Set ac_g_ex instead of ac_f_ex in ext4_mb_normalize_request. Besides we should assure goal start is in range [first_data_block, blocks_count) as ext4_mb_initialize_context does. [ Added a check to make sure size is less than ar->pright; otherwise we could end up passing an underflowed value of ar->pright - size to ext4_get_group_no_and_offset(), which will trigger a BUG_ON later on. - TYT ] Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20230303172120.3800725-2-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2023-04-06 01:13:03 -04:00
Kemeng Shi	1df9bde48f	ext4: remove unused group parameter in ext4_block_bitmap_csum_set Remove unused group parameter in ext4_block_bitmap_csum_set. After this, group parameter in ext4_set_bitmap_checksums is also not used, just remove it too. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230221203027.2359920-5-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2023-03-23 23:00:08 -04:00
Guoqing Jiang	d73eff68a8	ext4: make ext4_mb_initialize_context return void Change the return type to void since it always return 0, and no need to do the checking in ext4_mb_new_blocks. Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://lore.kernel.org/r/20221202120409.24098-1-guoqing.jiang@linux.dev Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2022-12-08 21:49:25 -05:00
Jan Kara	a078dff870	ext4: fixup possible uninitialized variable access in ext4_mb_choose_next_group_cr1() Variable 'grp' may be left uninitialized if there's no group with suitable average fragment size (or larger). Fix the problem by initializing it earlier. Link: https://lore.kernel.org/r/20220922091542.pkhedytey7wzp5fi@quack3 Fixes: `83e80a6e35` ("ext4: use buckets for cr 1 block scan instead of rbtree") Cc: stable@kernel.org Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2022-09-26 13:21:05 -04:00
Theodore Ts'o	80fa46d6b9	ext4: limit the number of retries after discarding preallocations blocks This patch avoids threads live-locking for hours when a large number threads are competing over the last few free extents as they blocks getting added and removed from preallocation pools. From our bug reporter: A reliable way for triggering this has multiple writers continuously write() to files when the filesystem is full, while small amounts of space are freed (e.g. by truncating a large file -1MiB at a time). In the local filesystem, this can be done by simply not checking the return code of write (0) and/or the error (ENOSPACE) that is set. Over NFS with an async mount, even clients with proper error checking will behave this way since the linux NFS client implementation will not propagate the server errors [the write syscalls immediately return success] until the file handle is closed. This leads to a situation where NFS clients send a continuous stream of WRITE rpcs which result in ERRNOSPACE -- but since the client isn't seeing this, the stream of writes continues at maximum network speed. When some space does appear, multiple writers will all attempt to claim it for their current write. For NFS, we may see dozens to hundreds of threads that do this. The real-world scenario of this is database backup tooling (in particular, github.com/mdkent/percona-xtrabackup) which may write large files (>1TiB) to NFS for safe keeping. Some temporary files are written, rewound, and read back -- all before closing the file handle (the temp file is actually unlinked, to trigger automatic deletion on close/crash.) An application like this operating on an async NFS mount will not see an error code until TiB have been written/read. The lockup was observed when running this database backup on large filesystems (64 TiB in this case) with a high number of block groups and no free space. Fragmentation is generally not a factor in this filesystem (~thousands of large files, mostly contiguous except for the parts written while the filesystem is at capacity.) Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org	2022-09-22 10:51:19 -04:00
Jan Kara	83e80a6e35	ext4: use buckets for cr 1 block scan instead of rbtree Using rbtree for sorting groups by average fragment size is relatively expensive (needs rbtree update on every block freeing or allocation) and leads to wide spreading of allocations because selection of block group is very sentitive both to changes in free space and amount of blocks allocated. Furthermore selecting group with the best matching average fragment size is not necessary anyway, even more so because the variability of fragment sizes within a group is likely large so average is not telling much. We just need a group with large enough average fragment size so that we have high probability of finding large enough free extent and we don't want average fragment size to be too big so that we are likely to find free extent only somewhat larger than what we need. So instead of maintaing rbtree of groups sorted by fragment size keep bins (lists) or groups where average fragment size is in the interval [2^i, 2^(i+1)). This structure requires less updates on block allocation / freeing, generally avoids chaotic spreading of allocations into block groups, and still is able to quickly (even faster that the rbtree) provide a block group which is likely to have a suitably sized free space extent. This patch reduces number of block groups used when untarring archive with medium sized files (size somewhat above 64k which is default mballoc limit for avoiding locality group preallocation) to about half and thus improves write speeds for eMMC flash significantly. Fixes: `196e402adf` ("ext4: improve cr 0 / cr 1 group scanning") CC: stable@kernel.org Reported-and-tested-by: Stefan Wahren <stefan.wahren@i2se.com> Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/all/0d81a7c2-46b7-6010-62a4-3e6cfc1628d6@i2se.com/ Link: https://lore.kernel.org/r/20220908092136.11770-5-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2022-09-21 22:12:03 -04:00
Jan Kara	a9f2a2931d	ext4: use locality group preallocation for small closed files Curently we don't use any preallocation when a file is already closed when allocating blocks (from writeback code when converting delayed allocation). However for small files, using locality group preallocation is actually desirable as that is not specific to a particular file. Rather it is a method to pack small files together to reduce fragmentation and for that the fact the file is closed is actually even stronger hint the file would benefit from packing. So change the logic to allow locality group preallocation in this case. Fixes: `196e402adf` ("ext4: improve cr 0 / cr 1 group scanning") CC: stable@kernel.org Reported-and-tested-by: Stefan Wahren <stefan.wahren@i2se.com> Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/all/0d81a7c2-46b7-6010-62a4-3e6cfc1628d6@i2se.com/ Link: https://lore.kernel.org/r/20220908092136.11770-4-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2022-09-21 22:12:00 -04:00
Jan Kara	1940265ede	ext4: avoid unnecessary spreading of allocations among groups mb_set_largest_free_order() updates lists containing groups with largest chunk of free space of given order. The way it updates it leads to always moving the group to the tail of the list. Thus allocations looking for free space of given order effectively end up cycling through all groups (and due to initialization in last to first order). This spreads allocations among block groups which reduces performance for rotating disks or low-end flash media. Change mb_set_largest_free_order() to only update lists if the order of the largest free chunk in the group changed. Fixes: `196e402adf` ("ext4: improve cr 0 / cr 1 group scanning") CC: stable@kernel.org Reported-and-tested-by: Stefan Wahren <stefan.wahren@i2se.com> Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/all/0d81a7c2-46b7-6010-62a4-3e6cfc1628d6@i2se.com/ Link: https://lore.kernel.org/r/20220908092136.11770-2-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2022-09-21 22:11:41 -04:00
Jan Kara	4fca50d440	ext4: make mballoc try target group first even with mb_optimize_scan One of the side-effects of mb_optimize_scan was that the optimized functions to select next group to try were called even before we tried the goal group. As a result we no longer allocate files close to corresponding inodes as well as we don't try to expand currently allocated extent in the same group. This results in reaim regression with workfile.disk workload of upto 8% with many clients on my test machine: baseline mb_optimize_scan Hmean disk-1 2114.16 ( 0.00%) 2099.37 ( -0.70%) Hmean disk-41 87794.43 ( 0.00%) 83787.47 * -4.56%* Hmean disk-81 148170.73 ( 0.00%) 135527.05 * -8.53%* Hmean disk-121 177506.11 ( 0.00%) 166284.93 * -6.32%* Hmean disk-161 220951.51 ( 0.00%) 207563.39 * -6.06%* Hmean disk-201 208722.74 ( 0.00%) 203235.59 ( -2.63%) Hmean disk-241 222051.60 ( 0.00%) 217705.51 ( -1.96%) Hmean disk-281 252244.17 ( 0.00%) 241132.72 * -4.41%* Hmean disk-321 255844.84 ( 0.00%) 245412.84 * -4.08%* Also this is causing huge regression (time increased by a factor of 5 or so) when untarring archive with lots of small files on some eMMC storage cards. Fix the problem by making sure we try goal group first. Fixes: `196e402adf` ("ext4: improve cr 0 / cr 1 group scanning") CC: stable@kernel.org Reported-and-tested-by: Stefan Wahren <stefan.wahren@i2se.com> Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/all/20220727105123.ckwrhbilzrxqpt24@quack3/ Link: https://lore.kernel.org/all/0d81a7c2-46b7-6010-62a4-3e6cfc1628d6@i2se.com/ Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220908092136.11770-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2022-09-21 22:11:34 -04:00
Lukas Czerner	1e1c2b86ef	ext4: block range must be validated before use in ext4_mb_clear_bb() Block range to free is validated in ext4_free_blocks() using ext4_inode_block_valid() and then it's passed to ext4_mb_clear_bb(). However in some situations on bigalloc file system the range might be adjusted after the validation in ext4_free_blocks() which can lead to troubles on corrupted file systems such as one found by syzkaller that resulted in the following BUG kernel BUG at fs/ext4/ext4.h:3319! PREEMPT SMP NOPTI CPU: 28 PID: 4243 Comm: repro Kdump: loaded Not tainted 5.19.0-rc6+ #1 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1.fc35 04/01/2014 RIP: 0010:ext4_free_blocks+0x95e/0xa90 Call Trace: <TASK> ? lock_timer_base+0x61/0x80 ? __es_remove_extent+0x5a/0x760 ? __mod_timer+0x256/0x380 ? ext4_ind_truncate_ensure_credits+0x90/0x220 ext4_clear_blocks+0x107/0x1b0 ext4_free_data+0x15b/0x170 ext4_ind_truncate+0x214/0x2c0 ? _raw_spin_unlock+0x15/0x30 ? ext4_discard_preallocations+0x15a/0x410 ? ext4_journal_check_start+0xe/0x90 ? __ext4_journal_start_sb+0x2f/0x110 ext4_truncate+0x1b5/0x460 ? __ext4_journal_start_sb+0x2f/0x110 ext4_evict_inode+0x2b4/0x6f0 evict+0xd0/0x1d0 ext4_enable_quotas+0x11f/0x1f0 ext4_orphan_cleanup+0x3de/0x430 ? proc_create_seq_private+0x43/0x50 ext4_fill_super+0x295f/0x3ae0 ? snprintf+0x39/0x40 ? sget_fc+0x19c/0x330 ? ext4_reconfigure+0x850/0x850 get_tree_bdev+0x16d/0x260 vfs_get_tree+0x25/0xb0 path_mount+0x431/0xa70 __x64_sys_mount+0xe2/0x120 do_syscall_64+0x5b/0x80 ? do_user_addr_fault+0x1e2/0x670 ? exc_page_fault+0x70/0x170 entry_SYSCALL_64_after_hwframe+0x46/0xb0 RIP: 0033:0x7fdf4e512ace Fix it by making sure that the block range is properly validated before used every time it changes in ext4_free_blocks() or ext4_mb_clear_bb(). Link: https://syzkaller.appspot.com/bug?id=5266d464285a03cee9dbfda7d2452a72c3c2ae7c Reported-by: syzbot+15cd994e273307bf5cfa@syzkaller.appspotmail.com Signed-off-by: Lukas Czerner <lczerner@redhat.com> Cc: Tadeusz Struk <tadeusz.struk@linaro.org> Tested-by: Tadeusz Struk <tadeusz.struk@linaro.org> Link: https://lore.kernel.org/r/20220714165903.58260-1-lczerner@redhat.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2022-08-02 23:56:25 -04:00

1 2 3 4 5 ...

531 commits