linux-stable

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git synced 2024-08-29 04:09:39 +00:00

Author	SHA1	Message	Date
Dmitry Monakhov	8c85447391	ext4: reimplement uninit extent optimization for move_extent_per_page() Uninitialized extent may became initialized(parallel writeback task) at any moment after we drop i_data_sem, so we have to recheck extent's state after we hold page's lock and i_data_sem. If we about to change page's mapping we must hold page's lock in order to serialize other users. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-09-26 12:54:52 -04:00
Dmitry Monakhov	bb55748805	ext4: clean up online defrag bugs in move_extent_per_page() Non-full list of bugs: 1) uninitialized extent optimization does not hold page's lock, and simply replace brunches after that writeback code goes crazy because block mapping changed under it's feets kernel BUG at fs/ext4/inode.c:1434! ( 288'th xfstress) 2) uninitialized extent may became initialized right after we drop i_data_sem, so extent state must be rechecked 3) Locked pages goes uptodate via following sequence: ->readpage(page); lock_page(page); use_that_page(page) But after readpage() one may invalidate it because it is uptodate and unlocked (reclaimer does that) As result kernel bug at include/linux/buffer_head.c:133! 4) We call write_begin() with already opened stansaction which result in following deadlock: ->move_extent_per_page() ->ext4_journal_start()-> hold journal transaction ->write_begin() ->ext4_da_write_begin() ->ext4_nonda_switch() ->writeback_inodes_sb_if_idle() --> will wait for journal_stop() 5) try_to_release_page() may fail and it does fail if one of page's bh was pinned by journal 6) If we about to change page's mapping we MUST hold it's lock during entire remapping procedure, this is true for both pages(original and donor one) Fixes: - Avoid (1) and (2) simply by temproraly drop uninitialized extent handling optimization, this will be reimplemented later. - Fix (3) by manually forcing page to uptodate state w/o dropping it's lock - Fix (4) by rearranging existing locking: from: journal_start(); ->write_begin to: write_begin(); journal_extend() - Fix (5) simply by checking retvalue - Fix (6) by locking both (original and donor one) pages during extent swap with help of mext_page_double_lock() Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-09-26 12:52:07 -04:00
Dmitry Monakhov	f066055a34	ext4: online defrag is not supported for journaled files Proper block swap for inodes with full journaling enabled is truly non obvious task. In order to be on a safe side let's explicitly disable it for now. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org	2012-09-26 12:32:54 -04:00
Dmitry Monakhov	03bd8b9b89	ext4: move_extent code cleanup - Remove usless checks, because it is too late to check that inode != NULL at the moment it was referenced several times. - Double lock routines looks very ugly and locking ordering relays on order of i_ino, but other kernel code rely on order of pointers. Let's make them simple and clean. - check that inodes belongs to the same SB as soon as possible. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org	2012-09-26 12:32:19 -04:00
Tao Ma	0acdb8876f	ext4: don't call update_backups() multiple times for the same bg When performing an online resize, we add a bunch of groups at one time in ext4_flex_group_add, so in most cases a lot of group descriptors will be in the same group block. But in the end of this function, update_backups will be called for every group descriptor and the same block will be copied and journalled again and again. It is really a waste. Fix things so we only update a particular bg descriptor block once and skip subsequent updates of the same block. Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-09-26 00:08:57 -04:00
Dmitry Monakhov	7f1468d1d5	ext4: fix double unlock buffer mess during fs-resize bh_submit_read() is responsible for unlock bh on endio. In addition, we need to use bh_uptodate_or_lock() to avoid races. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-09-25 23:19:25 -04:00
Yongqiang Yang	f2a09af645	ext4: check free inode count before allocating an inode Recently, I ecountered some corrupted filesystems in which some groups' free inode counts were 65535, it seemed that free inode count was overflow. This patch teaches ext4 to check free inode count before allocaing an inode. Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-09-23 23:16:03 -04:00
Yongqiang Yang	838cd0cf9a	ext4: check free block counters in ext4_mb_find_by_goal Free block counters should be checked before doing allocation. Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-09-23 23:10:51 -04:00
Herton Ronaldo Krzesinski	50df9fd55e	ext4: fix crash when accessing /proc/mounts concurrently The crash was caused by a variable being erronously declared static in token2str(). In addition to /proc/mounts, the problem can also be easily replicated by accessing /proc/fs/ext4/<partition>/options in parallel: $ cat /proc/fs/ext4/<partition>/options > options.txt ... and then running the following command in two different terminals: $ while diff /proc/fs/ext4/<partition>/options options.txt; do true; done This is also the cause of the following a crash while running xfstests #234, as reported in the following bug reports: https://bugs.launchpad.net/bugs/1053019 https://bugzilla.kernel.org/show_bug.cgi?id=47731 Signed-off-by: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: Brad Figg <brad.figg@canonical.com> Cc: stable@vger.kernel.org	2012-09-23 22:49:12 -04:00
Tao Ma	bef53b01fa	ext4: remove erroneous ext4_superblock_csum_set() in update_backups() The update_backups() function is used to backup all the metadata blocks, so we should not take it for granted that 'data' is pointed to a super block and use ext4_superblock_csum_set to calculate the checksum there. In case where the data is a group descriptor block, it will corrupt the last group descriptor, and then e2fsck will complain about it it. As all the metadata checksums should already be OK when we do the backup, remove the wrong ext4_superblock_csum_set and it should be just fine. Reported-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org	2012-09-20 11:35:38 -04:00
Theodore Ts'o	00d4e7362e	ext4: fix potential deadlock in ext4_nonda_switch() In ext4_nonda_switch(), if the file system is getting full we used to call writeback_inodes_sb_if_idle(). The problem is that we can be holding i_mutex already, and this causes a potential deadlock when writeback_inodes_sb_if_idle() when it tries to take s_umount. (See lockdep output below). As it turns out we don't need need to hold s_umount; the fact that we are in the middle of the write(2) system call will keep the superblock pinned. Unfortunately writeback_inodes_sb() checks to make sure s_umount is taken, and the VFS uses a different mechanism for making sure the file system doesn't get unmounted out from under us. The simplest way of dealing with this is to just simply grab s_umount using a trylock, and skip kicking the writeback flusher thread in the very unlikely case that we can't take a read lock on s_umount without blocking. Also, we now check the cirteria for kicking the writeback thread before we decide to whether to fall back to non-delayed writeback, so if there are any outstanding delayed allocation writes, we try to get them resolved as soon as possible. [ INFO: possible circular locking dependency detected ] 3.6.0-rc1-00042-gce894ca #367 Not tainted ------------------------------------------------------- dd/8298 is trying to acquire lock: (&type->s_umount_key#18){++++..}, at: [<c02277d4>] writeback_inodes_sb_if_idle+0x28/0x46 but task is already holding lock: (&sb->s_type->i_mutex_key#8){+.+...}, at: [<c01ddcce>] generic_file_aio_write+0x5f/0xd3 which lock already depends on the new lock. 2 locks held by dd/8298: #0: (sb_writers#2){.+.+.+}, at: [<c01ddcc5>] generic_file_aio_write+0x56/0xd3 #1: (&sb->s_type->i_mutex_key#8){+.+...}, at: [<c01ddcce>] generic_file_aio_write+0x5f/0xd3 stack backtrace: Pid: 8298, comm: dd Not tainted 3.6.0-rc1-00042-gce894ca #367 Call Trace: [<c015b79c>] ? console_unlock+0x345/0x372 [<c06d62a1>] print_circular_bug+0x190/0x19d [<c019906c>] __lock_acquire+0x86d/0xb6c [<c01999db>] ? mark_held_locks+0x5c/0x7b [<c0199724>] lock_acquire+0x66/0xb9 [<c02277d4>] ? writeback_inodes_sb_if_idle+0x28/0x46 [<c06db935>] down_read+0x28/0x58 [<c02277d4>] ? writeback_inodes_sb_if_idle+0x28/0x46 [<c02277d4>] writeback_inodes_sb_if_idle+0x28/0x46 [<c026f3b2>] ext4_nonda_switch+0xe1/0xf4 [<c0271ece>] ext4_da_write_begin+0x27/0x193 [<c01dcdb0>] generic_file_buffered_write+0xc8/0x1bb [<c01ddc47>] __generic_file_aio_write+0x1dd/0x205 [<c01ddce7>] generic_file_aio_write+0x78/0xd3 [<c026d336>] ext4_file_write+0x480/0x4a6 [<c0198c1d>] ? __lock_acquire+0x41e/0xb6c [<c0180944>] ? sched_clock_cpu+0x11a/0x13e [<c01967e9>] ? trace_hardirqs_off+0xb/0xd [<c018099f>] ? local_clock+0x37/0x4e [<c0209f2c>] do_sync_write+0x67/0x9d [<c0209ec5>] ? wait_on_retry_sync_kiocb+0x44/0x44 [<c020a7b9>] vfs_write+0x7b/0xe6 [<c020a9a6>] sys_write+0x3b/0x64 [<c06dd4bd>] syscall_call+0x7/0xb Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org	2012-09-19 22:42:36 -04:00
Andrey Sidorov	18888cf088	ext4: speed up truncate/unlink by not using bforget() unless needed Do not iterate over data blocks scanning for bh's to forget as they're never exist. This improves time taken by unlink / truncate syscall. Tested by continuously truncating file that is being written by dd. Another test is rm -rf of linux tree while tar unpacks it. With ordered data mode condition unlikely(!tbh) was always met in ext4_free_blocks. With journal data mode tbh was found only few times, so optimisation is also possible. Unlinking fallocated 60G file after doing sync && echo 3 > /proc/sys/vm/drop_caches && time rm --help X86 before (linux 3.6-rc4): # time rm -f test1 real 0m2.710s user 0m0.000s sys 0m1.530s X86 after: # time rm -f test1 real 0m0.644s user 0m0.003s sys 0m0.060s MIPS before (linux 2.6.37): # time rm -f test1 real 0m 4.93s user 0m 0.00s sys 0m 4.61s MIPS after: # time rm -f test1 real 0m 0.16s user 0m 0.00s sys 0m 0.06s Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Andrey Sidorov <qrxd43@motorola.com>	2012-09-19 14:14:53 -04:00
Theodore Ts'o	59e31c156a	ext4: fix online resizing when the # of block groups is constant Commit `1c6bd7173d` introduced a regression where an online resize operation which did not change the number of block groups would fail, i.e: mke2fs -t /dev/vdc 60000 mount /dev/vdc resize2fs /dev/vdc 60001 This was due to a bug in the logic regarding when to try converting the filesystem to use meta_bg. Also fix up a number of other minor issues with the online resizing code: (a) Fix a sparse warning; (b) only check to make sure the device is large enough once, instead of multiple times through the resize loop. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-09-19 00:55:56 -04:00
Anatol Pomozov	c9b92530a7	ext4: make orphan functions be no-op in no-journal mode Instead of checking whether the handle is valid, we check if journal is enabled. This avoids taking the s_orphan_lock mutex in all cases when there is no journal in use, including the error paths where ext4_orphan_del() is called with a handle set to NULL. Signed-off-by: Anatol Pomozov <anatol.pomozov@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-09-18 13:38:59 -04:00
Theodore Ts'o	b5e2368bae	ext4: re-enable -o discard functionality in no-journal mode This is a revert of commit `b56ff9d397`, which removed the call to ext4_issue_discard() to fix a BUG reported because ext4_issue_discard() was being called from inside a block group spinlock. As it turns out this bug had already been fixed by Lukas Czerner in commit `53fdcf992d` by the simple expedient of moving when we call ext4_issue_discard() outside the spinlock. So it should be safe to re-enable this functionality, which I tested by putting an BUG_ON(in_atomic) just after the restored callsite to ext4_issue_discard(). Addresses-Google-Bug: #6750518 Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: Anatol Pomozov <anatol.pomozov@gmail.com>	2012-09-18 13:33:44 -04:00
Carlos Maiolino	90b0a97323	ext4: fix possible non-initialized variable in htree_dirblock_to_tree() htree_dirblock_to_tree() declares a non-initialized 'err' variable, which is passed as a reference to another functions expecting them to set this variable with their error codes. It's passed to ext4_bread(), which then passes it to ext4_getblk(). If ext4_map_blocks() returns 0 due to a lookup failure, leaving the ext4_getblk() buffer_head uninitialized, it will make ext4_getblk() return to ext4_bread() without initialize the 'err' variable, and ext4_bread() will return to htree_dirblock_to_tree() with this variable still uninitialized. htree_dirblock_to_tree() will pass this variable with garbage back to ext4_htree_fill_tree(), which expects a number of directory entries added to the rb-tree. which, in case, might return a fake non-zero value due the garbage left in the 'err' variable, leading the kernel to an Oops in ext4_dx_readdir(), once this is expecting a filled rb-tree node, when in turn it will have a NULL-ed one, causing an invalid page request when trying to get a fname struct from this NULL-ed rb-tree node in this line: fname = rb_entry(info->curr_node, struct fname, rb_hash); The patch itself initializes the err variable in htree_dirblock_to_tree() to avoid usage mistakes by the called functions, and also fix ext4_getblk() to return a initialized 'err' variable when ext4_map_blocks() fails a lookup. Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-09-17 23:39:12 -04:00
Theodore Ts'o	bc0b75f77a	ext4: do not enable delalloc by default for ext2 Signed-off-by: Brian Foster <bfoster@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-09-17 22:54:36 -04:00
Theodore Ts'o	5e7bbef19c	ext4: advertise the fact that the kernel supports meta_bg resizing Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-09-13 12:11:40 -04:00
Theodore Ts'o	4da4a56e4f	ext4: log a resize update to the console every 10 seconds For very long online resizes, a periodic update to the console log is helpful for debugging and for progress reporting. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-09-13 10:24:21 -04:00
Theodore Ts'o	1c6bd7173d	ext4: convert file system to meta_bg if needed during resizing If we have run out of reserved gdt blocks, then clear the resize_inode feature and enable the meta_bg feature, so that we can continue resizing the file system seamlessly. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-09-13 10:19:24 -04:00
Theodore Ts'o	93f9052643	ext4: set bg_itable_unused when resizing Set bg_itable_unused for file systems that have uninit_bg enabled. This will speed up the first e2fsck run after the file system is resized. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-09-12 14:32:42 -04:00
Yongqiang Yang	01f795f9e0	ext4: add online resizing support for meta_bg and 64-bit file systems This patch adds support for resizing file systems with the meta_bg and 64bit features. [ Added a fix by tytso to fix a divide by zero when resizing a filesystem from 14 TB to 18TB. Also fixed overhead accounting for meta_bg file systems.] Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-09-05 01:33:50 -04:00
Theodore Ts'o	28623c2f5b	ext4: grow the s_group_info array as needed Previously we allocated the s_group_info array with enough space for any future possible growth of the file system via online resize. This is unfortunate because it wastes memory, and it doesn't work for the meta_bg scheme, since there is no limit based on the number of reserved gdt blocks. So add the code to grow the s_group_info array as needed. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-09-05 01:31:50 -04:00
Theodore Ts'o	117fff10d7	ext4: grow the s_flex_groups array as needed when resizing Previously, we allocated the s_flex_groups array to the maximum size that the file system could be resized. There was two problems with this approach. First, it wasted memory in the common case where the file system was not resized. Secondly, once we start allowing online resizing using the meta_bg scheme, there is no maximum size that the file system can be resized. So instead, we need to grow the s_flex_groups at inline resize time. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-09-05 01:29:50 -04:00
Yongqiang Yang	2ebd1704de	ext4: avoid duplicate writes of the backup bg descriptor blocks The resize code was needlessly writing the backup block group descriptor blocks multiple times (once per block group) during an online resize. Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org	2012-09-05 01:27:50 -04:00
Yongqiang Yang	6df935ad2f	ext4: don't copy non-existent gdt blocks when resizing The resize code was copying blocks at the beginning of each block group in order to copy the superblock and block group descriptor table (gdt) blocks. This was, unfortunately, being done even for block groups that did not have super blocks or gdt blocks. This is a complete waste of perfectly good I/O bandwidth, to skip writing those blocks for sparse bg's. Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org	2012-09-05 01:25:50 -04:00
Yongqiang Yang	d7574ad08b	ext4: report the original old blocks count in a debug message when resizing Avoid changing o_blocks_count, since it is used later when reporting old blocks count in debug mode. Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-09-05 01:23:50 -04:00
Yongqiang Yang	03c1c29053	ext4: ignore last group w/o enough space when resizing instead of BUG'ing If the last group does not have enough space for group tables, ignore it instead of calling BUG_ON(). Reported-by: Daniel Drake <dsd@laptop.org> Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org	2012-09-05 01:21:50 -04:00
Zheng Liu	8a2f8460e8	ext4: remove duplicated declarations in inode.c In patch `cb20d51883`, ext4_set_bh_endio and ext4_end_io_buffer_write are declared at the beginning of inode.c, and again later on in the middle of the file. Remove the second set of duplicated function declarations. Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-08-19 18:07:40 -04:00
Wang Sheng-Hui	30cb27d661	ext4: fix trivial typo in comment Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-08-18 22:38:07 -04:00
Ashish Sangwan	e3d2e433e3	ext4: no need to add inode to orphan list during hole punch While performing punch hole for an inode, i_disksize is not changed. So, there is no need to add the inode to orphan list. Signed-off-by: Ashish Sangwan <ashish.sangwan2@gmail.com> Signed-off-by: Namjae Jeon <linkinjeon@gmail.com> Acked-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-08-18 22:29:46 -04:00
Eric Sandeen	eeecef0af5	jbd2: don't write superblock when if its empty This sequence: # truncate --size=1g fsfile # mkfs.ext4 -F fsfile # mount -o loop,ro fsfile /mnt # umount /mnt # dmesg \| tail results in an IO error when unmounting the RO filesystem: [ 318.020828] Buffer I/O error on device loop1, logical block 196608 [ 318.027024] lost page write due to I/O error on loop1 [ 318.032088] JBD2: Error -5 detected when updating journal superblock for loop1-8. This was a regression introduced by commit `24bcc89c7e`: "jbd2: split updating of journal superblock and marking journal empty". Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org	2012-08-18 22:29:40 -04:00
Sachin Kamat	caecd0af8f	ext4: replace plain integer with NULL in super.c Fixes the following sparse warning: fs/ext4/super.c:1672:45: warning: Using plain integer as NULL pointer Signed-off-by: Sachin Kamat <sachin.kamat@linaro.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-08-18 22:29:18 -04:00
Theodore Ts'o	07724f9897	ext4: drop lock_super()/unlock_super() We don't need lock_super()/unlock_super() any more, since the places where it is used, we are protected by the s_umount r/w semaphore. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: Marco Stornelli <marco.stornelli@gmail.com>	2012-08-17 19:08:42 -04:00
Theodore Ts'o	0e376b1e3c	ext4: return an error if kset_create_and_add fails in ext4_init_fs() In the very unlikely case that kset_create_and_add() fails when the ext4.ko module is being loaded (or during kernel startup) set err so that it's clear that the module load failed. https://bugzilla.kernel.org/show_bug.cgi?id=27912 Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-08-17 10:04:17 -04:00
Robin Dong	15c006a22f	ext4: remove unused function argument 'order' in mb_find_extent() All the routines call mb_find_extent are setting argument 'order' to 0 just like: mb_find_extent(e4b, 0, ex.fe_start, ex.fe_len, &ex); therefore the useless argument should be removed. Signed-off-by: Robin Dong <sanbai@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-08-17 10:02:17 -04:00
Robin Dong	cc6eb18d68	ext4: remove unused macro MB_DEFAULT_MAX_GROUPS_TO_SCAN Signed-off-by: Robin Dong <sanbai@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-08-17 10:00:17 -04:00
Theodore Ts'o	a4a39040e9	ext4: check return value of blkdev_issue_flush() blkdev_issue_flush() can fail; make sure the error gets properly propagated. This is a port of the equivalent ext3 patch from commit `44f4f729e7`. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-08-17 09:58:17 -04:00
Theodore Ts'o	316e4cfd0b	jbd2: check return value of blkdev_issue_flush() blkdev_issue_flush() can fail; make sure the error gets properly propagated. This is a port of the equivalent jbd patch from commit `349ecd6a3c`. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-08-17 09:56:17 -04:00
Zheng Liu	67a5da564f	ext4: make the zero-out chunk size tunable Currently in ext4 the length of zero-out chunk is set to 7 file system blocks. But if an inode has uninitailized extents from using fallocate to preallocate space, and the workload issues many random writes, this can cause a fragmented extent tree that will unnecessarily grow the extent tree. So create a new sysfs tunable, extent_max_zeroout_kb, which controls the maximum size where blocks will be zeroed out instead of creating a new uninitialized extent. The default of this has been sent to 32kb. CC: Zach Brown <zab@zabbo.net> CC: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-08-17 09:54:17 -04:00
Anatol Pomozov	8137029172	ext4: add missing space to trace message Signed-off-by: Anatol Pomozov <anatol.pomozov@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-08-17 09:52:17 -04:00
Anatol Pomozov	210c05264d	ext4: realign trace events structs to make it smaller Most hardware architectures require that data (including struct fields) have to be aligned in memory. To make it happen compiler inserts padding between struct fields if they are not aligned correctly. Reorder fields to remove paddings and make structures denser. Making data smaller saves some memory that is very important for trace events. Tracing buffer has limited size and making objects smaller we can put more of them without overflowing the tracing buffer. To find data struct holes I used 'pahole -H 1 -E -I vmlinux.o' from 'dwarves' package. Signed-off-by: Anatol Pomozov <anatol.pomozov@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-08-17 09:50:17 -04:00
Theodore Ts'o	df981d03ee	ext4: add max_dir_size_kb mount option Very large directories can cause significant performance problems, or perhaps even invoke the OOM killer, if the process is running in a highly constrained memory environment (whether it is VM's with a small amount of memory or in a small memory cgroup). So it is useful, in cloud server/data center environments, to be able to set a filesystem-wide cap on the maximum size of a directory, to ensure that directories never get larger than a sane size. We do this via a new mount option, max_dir_size_kb. If there is an attempt to grow the directory larger than max_dir_size_kb, the system call will return ENOSPC instead. Google-Bug-Id: 6863013 Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-08-17 09:48:17 -04:00
Theodore Ts'o	01fc48e892	ext4: don't load the block bitmap for block groups which have no space Add a short circuit check to ext4_mb_group_group() so that we don't bother to load the block bitmap for a block group which does not have any space available. (Or which does not have enough space until we are in desperation mode, i.e., when cr == 3.) Resolves-bug: https://bugzilla.kernel.org/show_bug.cgi?id=45741 Reported-by: mirek@me.com Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-08-17 09:46:17 -04:00
Theodore Ts'o	ecb94f5fdf	ext4: collapse a single extent tree block into the inode if possible If an inode has more than 4 extents, but then later some of the extents are merged together, we can optimize the file system by moving the extents up into the inode, and discarding the extent tree block. This is important, because if there are a large number of inodes with an external extent tree blocks where the contents could fit in the inode, this can significantly increase the fsck time of the file system. Google-Bug-Id: 6801242 Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-08-17 09:44:17 -04:00
Theodore Ts'o	89a4e48f84	ext4: fix kernel BUG on large-scale rm -rf commands Commit `968dee7722`: "ext4: fix hole punch failure when depth is greater than 0" introduced a regression in v3.5.1/v3.6-rc1 which caused kernel crashes when users ran run "rm -rf" on large directory hierarchy on ext4 filesystems on RAID devices: BUG: unable to handle kernel NULL pointer dereference at 0000000000000028 Process rm (pid: 18229, threadinfo ffff8801276bc000, task ffff880123631710) Call Trace: [<ffffffff81236483>] ? __ext4_handle_dirty_metadata+0x83/0x110 [<ffffffff812353d3>] ext4_ext_truncate+0x193/0x1d0 [<ffffffff8120a8cf>] ? ext4_mark_inode_dirty+0x7f/0x1f0 [<ffffffff81207e05>] ext4_truncate+0xf5/0x100 [<ffffffff8120cd51>] ext4_evict_inode+0x461/0x490 [<ffffffff811a1312>] evict+0xa2/0x1a0 [<ffffffff811a1513>] iput+0x103/0x1f0 [<ffffffff81196d84>] do_unlinkat+0x154/0x1c0 [<ffffffff8118cc3a>] ? sys_newfstatat+0x2a/0x40 [<ffffffff81197b0b>] sys_unlinkat+0x1b/0x50 [<ffffffff816135e9>] system_call_fastpath+0x16/0x1b Code: 8b 4d 20 0f b7 41 02 48 8d 04 40 48 8d 04 81 49 89 45 18 0f b7 49 02 48 83 c1 01 49 89 4d 00 e9 ae f8 ff ff 0f 1f 00 49 8b 45 28 <48> 8b 40 28 49 89 45 20 e9 85 f8 ff ff 0f 1f 80 00 00 00 RIP [<ffffffff81233164>] ext4_ext_remove_space+0xa34/0xdf0 This could be reproduced as follows: The problem in commit `968dee7722` was that caused the variable 'i' to be left uninitialized if the truncate required more space than was available in the journal. This resulted in the function ext4_ext_truncate_extend_restart() returning -EAGAIN, which caused ext4_ext_remove_space() to restart the truncate operation after starting a new jbd2 handle. Reported-by: Maciej Żenczykowski <maze@google.com> Reported-by: Marti Raudsepp <marti@juffo.org> Tested-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org	2012-08-17 09:42:17 -04:00
Theodore Ts'o	0548bbb853	ext4: fix long mount times on very big file systems Commit `8aeb00ff85`: "ext4: fix overhead calculation used by ext4_statfs()" introduced a O(n**2) calculation which makes very large file systems take forever to mount. Fix this with an optimization for non-bigalloc file systems. (For bigalloc file systems the overhead needs to be set in the the superblock.) Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org	2012-08-17 09:23:00 -04:00
Theodore Ts'o	7a4c5de27e	ext4: don't call ext4_error while block group is locked While in ext4_validate_block_bitmap(), if an block allocation bitmap is found to be invalid, we call ext4_error() while the block group is still locked. This causes ext4_commit_super() to call a function which might sleep while in an atomic context. There's no need to keep the block group locked at this point, so hoist the ext4_error() call up to ext4_validate_block_bitmap() and release the block group spinlock before calling ext4_error(). The reported stack trace can be found at: http://article.gmane.org/gmane.comp.file-systems.ext4/33731 Reported-by: Dave Jones <davej@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org	2012-08-17 09:06:06 -04:00
Theodore Ts'o	7e731bc9a1	ext4: avoid kmemcheck complaint from reading uninitialized memory Commit `03179fe923` introduced a kmemcheck complaint in ext4_da_get_block_prep() because we save and restore ei->i_da_metadata_calc_last_lblock even though it is left uninitialized in the case where i_da_metadata_calc_len is zero. This doesn't hurt anything, but silencing the kmemcheck complaint makes it easier for people to find real bugs. Addresses https://bugzilla.kernel.org/show_bug.cgi?id=45631 (which is marked as a regression). Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org	2012-08-05 23:28:16 -04:00
Theodore Ts'o	d796c52ef0	ext4: make sure the journal sb is written in ext4_clear_journal_err() After we transfer set the EXT4_ERROR_FS bit in the file system superblock, it's not enough to call jbd2_journal_clear_err() to clear the error indication from journal superblock --- we need to call jbd2_journal_update_sb_errno() as well. Otherwise, when the root file system is mounted read-only, the journal is replayed, and the error indicator is transferred to the superblock --- but the s_errno field in the jbd2 superblock is left set (since although we cleared it in memory, we never flushed it out to disk). This can end up confusing e2fsck. We should make e2fsck more robust in this case, but the kernel shouldn't be leaving things in this confused state, either. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org	2012-08-05 19:04:57 -04:00

1 2 3 4 5 ...

321342 commits