Commit graph

25642 commits

Author SHA1 Message Date
Theodore Ts'o
26092bf524 ext4: use a table-driven handler for mount options
By using a table-drive approach, we shave about 100 lines of code from
ext4, and make the code a bit more regular and factored out.  This
will also make it possible in a future patch to use this table for
displaying the mount options that were specified in /proc/mounts.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-03-03 23:20:47 -05:00
Theodore Ts'o
72578c33c4 ext4: unify handling of mount options which have been removed
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-03-03 18:04:40 -05:00
Theodore Ts'o
39ef17f1b0 ext4: simplify handling of the errors=* mount options
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-03-03 17:56:23 -05:00
Theodore Ts'o
c64db50e76 ext4: remove the I_VERSION mount flag and use the super_block flag instead
There's no point to have two bits that are set in parallel; so use the
MS_I_VERSION flag that is needed by the VFS anyway, and that way we
free up a bit in sbi->s_mount_opts.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-03-02 12:23:11 -05:00
Theodore Ts'o
ee4a3fcd1d ext4: remove Opt_ignore
This is completely unused so let's just get rid of it.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-03-02 12:14:24 -05:00
Theodore Ts'o
87f26807e9 ext4: remove deprecation warnings for minix_df and grpid
People complained about removing both of these features, so per
Linus's dictate, we won't be able to remove them.  Sigh...

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-03-02 00:03:21 -05:00
Santosh Nayak
85d216501a ext4: Fix endianness bug when reading the MMP block
Sparse complained about this endian bug in fs/ext4/mmp.c.

Signed-off-by: Santosh Nayak <santoshprasadnayak@gmail.com>
Reviewed-by: Johann Lombardi <johann@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-02-27 01:09:03 -05:00
Zheng Liu
9ee4930259 ext4: format flag in dx_probe()
Fix ext4_warning format flag in dx_probe().

CC: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-02-20 23:09:36 -05:00
Eric Sandeen
c1bb05a657 ext4: avoid deadlock on sync-mounted FS w/o journal
Processes hang forever on a sync-mounted ext2 file system that
is mounted with the ext4 module (default in Fedora 16).

I can reproduce this reliably by mounting an ext2 partition with
"-o sync" and opening a new file an that partition with vim. vim
will hang in "D" state forever.  The same happens on ext4 without
a journal.

I am attaching a small patch here that solves this issue for me.
In the sync mounted case without a journal,
ext4_handle_dirty_metadata() may call sync_dirty_buffer(), which
can't be called with buffer lock held.

Also move mb_cache_entry_release inside lock to avoid race
fixed previously by 8a2bfdcb ext[34]: EA block reference count racing fix
Note too that ext2 fixed this same problem in 2006 with
b2f49033 [PATCH] fix deadlock in ext2

Signed-off-by: Martin.Wilck@ts.fujitsu.com
[sandeen@redhat.com: move mb_cache_entry_release before unlock, edit commit msg]
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-02-20 23:06:18 -05:00
Lukas Czerner
a0ade1deb8 ext4: fix resize when resizing within single group
When resizing file system in the way that the new size of the file
system is still in the same group (no new groups are added), then we can
hit a BUG_ON in ext4_alloc_group_tables()

BUG_ON(flex_gd->count == 0 || group_data == NULL);

because flex_gd->count is zero. The reason is the missing check for such
case, so the code always extend the last group fully and then attempt to
add more groups, but at that time n_blocks_count is actually smaller
than o_blocks_count.

It can be easily reproduced like this:

mkfs.ext4 -b 4096 /dev/sda 30M
mount /dev/sda /mnt/test
resize2fs /dev/sda 50M

Fix this by checking whether the resize happens within the singe group
and only add that many blocks into the last group to satisfy user
request. Then o_blocks_count == n_blocks_count and the resize will exit
successfully without and attempt to add more groups into the fs.

Also fix mixing together block number and blocks count which might be
confusing and can easily lead to off-by-one errors (but it is actually
not the case here since the two occurrence of this mix-up will cancel
each other).

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reported-by: Milan Broz <mbroz@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-02-20 23:02:06 -05:00
Jeff Moyer
266991b138 ext4: fix race between unwritten extent conversion and truncate
The following comment in ext4_end_io_dio caught my attention:

	/* XXX: probably should move into the real I/O completion handler */
        inode_dio_done(inode);

The truncate code takes i_mutex, then calls inode_dio_wait.  Because the
ext4 code path above will end up dropping the mutex before it is
reacquired by the worker thread that does the extent conversion, it
seems to me that the truncate can happen out of order.  Jan Kara
mentioned that this might result in error messages in the system logs,
but that should be the extent of the "damage."

The fix is pretty straight-forward: don't call inode_dio_done until the
extent conversion is complete.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2012-02-20 17:59:24 -05:00
Heiko Carstens
d4dc462f55 ext4: fix balloc.c printk-format-warning
Get rid of this one:

fs/ext4/balloc.c: In function 'ext4_wait_block_bitmap':
fs/ext4/balloc.c:405:3: warning: format '%llu' expects argument of
  type 'long long unsigned int', but argument 6 has type 'sector_t' [-Wformat]

Happens because sector_t is u64 (unsigned long long) or unsigned long
dependent on CONFIG_64BIT.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Acked-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-02-20 17:57:24 -05:00
Theodore Ts'o
c5e8f3f3bc ext4: remove EXT4_MB_{BITMAP,BUDDY} macros
The EXT4_MB_BITMAP and EXT4_MB_BUDDY macros obfuscate more than they
provide any abstraction.   So remove them.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-02-20 17:54:06 -05:00
Dan Carpenter
a0cc910f15 ext4: using PTR_ERR() on the wrong variable in ext4_ext_migrate()
"inode" is a valid pointer here.  "tmp_inode" was intended.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-02-20 17:53:06 -05:00
Dan Carpenter
4fda400360 ext4: remove an unneeded NULL check in __ext4_check_dir_entry()
We dereference "bh" unconditionally a couple lines down to find
"by->b_size".  This function is never called with a NULL "bh" so I have
removed the check.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-02-20 17:53:05 -05:00
Zheng Liu
f1b3a2a753 ext4: remove unneeded variable in ext4_xattr_check_block()
We could return directly from ext4_xattr_check_block(). Thus, we
shouldn't need to define a 'error' variable.

Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-02-20 17:53:05 -05:00
Eric Sandeen
661aa52057 ext4: remove the resize mount option
The resize mount option seems to be of limited value,
especially in the age of online resize2fs.  Nuke it.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-02-20 17:53:04 -05:00
Eric Sandeen
43e625d84f ext4: remove the journal=update mount option
The V2 journal format was introduced around ten years ago,
for ext3. It seems highly unlikely that anyone will need this
migration option for ext4.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-02-20 17:53:04 -05:00
Curt Wohlgemuth
1592d2c557 ext4: mark possibly unused variable in ext4_mb_normalize_request()
The 'orig_size' local variable is only used in a call to
mb_debug().  Mark it with '__maybe_unused'.

Signed-off-by: Curt Wohlgemuth <curtw@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-02-20 17:53:03 -05:00
Yongqiang Yang
9c0e00e5ce jbd2: use KMEM_CACHE instead of kmem_cache_create()
Use the KMEM_CACHE helper macro instead of kmem_cache_create().

Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-02-20 17:53:03 -05:00
Yongqiang Yang
4185a2ac42 jbd2: rename functions which initialize slab caches
This patch renames functions initializing the slab caches for the
journal head and handle structures to so they are consistent with the
names of the corresponding functions which destroys those slab caches.

Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-02-20 17:53:03 -05:00
Yongqiang Yang
0c2022eccb jbd2: allocate transaction from separate slab cache
There is normally only a handful of these active at any one time, but
putting them in a separate slab cache makes debugging memory
corruption problems easier.  Manish Katiyar also wanted this make it
easier to test memory failure scenarios in the jbd2 layer.

Cc: Manish Katiyar <mkatiyar@gmail.com>
Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-02-20 17:53:02 -05:00
Bobi Jam
18aadd47f8 ext4: expand commit callback and
The per-commit callback was used by mballoc code to manage free space
bitmaps after deleted blocks have been released.  This patch expands
it to support multiple different callbacks, to allow other things to
be done after the commit has been completed.

Signed-off-by: Bobi Jam <bobijam@whamcloud.com>
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-02-20 17:53:02 -05:00
Eric Sandeen
15291164b2 jbd2: clear BH_Delay & BH_Unwritten in journal_unmap_buffer
journal_unmap_buffer()'s zap_buffer: code clears a lot of buffer head
state ala discard_buffer(), but does not touch _Delay or _Unwritten as
discard_buffer() does.

This can be problematic in some areas of the ext4 code which assume
that if they have found a buffer marked unwritten or delay, then it's
a live one.  Perhaps those spots should check whether it is mapped
as well, but if jbd2 is going to tear down a buffer, let's really
tear it down completely.

Without this I get some fsx failures on sub-page-block filesystems
up until v3.2, at which point 4e96b2dbbf
and 189e868fa8 make the failures go
away, because buried within that large change is some more flag
clearing.  I still think it's worth doing in jbd2, since
->invalidatepage leads here directly, and it's the right place
to clear away these flags.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2012-02-20 17:53:01 -05:00
Theodore Ts'o
856cbcf9a9 ext4: fix INCOMPAT feature codepoint reservation for INLINEDATA
In commit 9b90e5e028 I incorrectly reserved the wrong bit for
EXT4_FEATURE_INCOMPAT_INLINEDATA per the discussion on the linux-ext4
list on December 7, 2011.  The codepoint 0x2000 should be used for
EXT4_FEATURE_INCOMPAT_USE_META_CSUM, so INLINEDATA will be assigned
the value 0x8000.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-02-20 17:53:01 -05:00
Seiji Aguchi
2201c590dd jbd2: add drop_transaction/update_superblock_end tracepoints
This patch adds trace_jbd2_drop_transaction and
trace_jbd2_update_superblock_end because there are similar tracepoints
in jbd and they are needed in jbd2 as well.

Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Seiji Aguchi <seiji.aguchi@hds.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-02-20 17:53:01 -05:00
Lukas Czerner
3d2b158262 ext4: ignore EXT4_INODE_JOURNAL_DATA flag with delalloc
Ext4 does not support data journalling with delayed allocation enabled.
We even do not allow to mount the file system with delayed allocation
and data journalling enabled, however it can be set via FS_IOC_SETFLAGS
so we can hit the inode with EXT4_INODE_JOURNAL_DATA set even on file
system mounted with delayed allocation (default) and that's where
problem arises. The easies way to reproduce this problem is with the
following set of commands:

 mkfs.ext4 /dev/sdd
 mount /dev/sdd /mnt/test1
 dd if=/dev/zero of=/mnt/test1/file bs=1M count=4
 chattr +j /mnt/test1/file
 dd if=/dev/zero of=/mnt/test1/file bs=1M count=4 conv=notrunc
 chattr -j /mnt/test1/file

Additionally it can be reproduced quite reliably with xfstests 272 and
269. In fact the above reproducer is a part of test 272.

To fix this we should ignore the EXT4_INODE_JOURNAL_DATA inode flag if
the file system is mounted with delayed allocation. This can be easily
done by fixing ext4_should_*_data() functions do ignore data journal
flag when delalloc is set (suggested by Ted). We also have to set the
appropriate address space operations for the inode (again, ignoring data
journal flag if delalloc enabled).

Additionally this commit introduces ext4_inode_journal_mode() function
because ext4_should_*_data() has already had a lot of common code and
this change is putting it all into one function so it is easier to
read.

Successfully tested with xfstests in following configurations:

delalloc + data=ordered
delalloc + data=writeback
data=journal
nodelalloc + data=ordered
nodelalloc + data=writeback
nodelalloc + data=journal

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2012-02-20 17:53:00 -05:00
Theodore Ts'o
813e57276f ext4: fix race when setting bitmap_uptodate flag
In ext4_read_{inode,block}_bitmap() we were setting bitmap_uptodate()
before submitting the buffer for read.  The is bad, since we check
bitmap_uptodate() without locking the buffer, and so if another
process is racing with us, it's possible that they will think the
bitmap is uptodate even though the read has not completed yet,
resulting in inodes and blocks potentially getting allocated more than
once if we get really unlucky.

Addresses-Google-Bug: 2828254

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-02-20 17:52:46 -05:00
Theodore Ts'o
119c0d4460 ext4: fold ext4_claim_inode into ext4_new_inode
The function ext4_claim_inode() is only called by one function,
ext4_new_inode(), and by folding the functionality into
ext4_new_inode(), we can remove almost 50 lines of code, and put all
of the logic of allocating a new inode into a single place.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-02-06 20:12:03 -05:00
Linus Torvalds
d3712b9dfc Pull request from git://github.com/prasad-joshi/logfs_upstream.git
There are few important bug fixes for LogFS
 
 Shortlog:
 Joern Engel (5):
      logfs: Prevent memory corruption
      logfs: remove useless BUG_ON
      logfs: Free areas before calling generic_shutdown_super()
      logfs: Grow inode in delete path
      Logfs: Allow NULL block_isbad() methods
 
 Prasad Joshi (5):
      logfs: update page reference count for pined pages
      logfs: take write mutex lock during fsync and sync
      logfs: set superblock shutdown flag after generic sb shutdown
      logfs: Propagate page parameter to __logfs_write_inode
      MAINTAINERS: Add Prasad Joshi in LogFS maintiners
 
 Diffstat:
  MAINTAINERS          |    1 +
  fs/logfs/dev_mtd.c   |   26 +++++++++++-------------
  fs/logfs/dir.c       |    2 +-
  fs/logfs/file.c      |    2 +
  fs/logfs/gc.c        |    2 +-
  fs/logfs/inode.c     |    4 ++-
  fs/logfs/journal.c   |    1 -
  fs/logfs/logfs.h     |    5 +++-
  fs/logfs/readwrite.c |   51 +++++++++++++++++++++++++++++++++----------------
  fs/logfs/segment.c   |   51 ++++++++++++++++++++++++++++++++++++++-----------
  fs/logfs/super.c     |    3 +-
  11 files changed, 99 insertions(+), 49 deletions(-)
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.11 (GNU/Linux)
 
 iQIcBAABAgAGBQJPKByhAAoJEDFA/f+3K+ZNQSUP/3gACcIwcsl+FnXPWBtz9XIG
 g0DjXoRDd/sR0u25nLgjCVdBJgx5FVEyA+PvLgvvUU2KCAsqI5F/EQ+fLJs21YEN
 TzepBO5aHtFZbNEjo6WiXOlDbBePTtk44WrN6jqoCHM/aDeT4Wof3NZBmHWNN1PX
 B2RtEZ0ypJ7/b1OY2LUNcQfTaJXNgVoP8Hkx4KGY5LUVxVrBXxvDTU7YbkS8a+ys
 1Yje/EQ4XD4RyZB42TmFEuTenvGPRgMGVFdnkJKuON8EmJQ8Hc61jEf5d7Q8sWef
 dH5F/ptoAaR9a9LbbO8LoYuBZ8MR8848NPsrNPpr/gWntj46Z79yII8Jr7YoSDyw
 zq5G2dZbwlbVrtVWKGae47THkNB8bljR/g4cijvPAkvuIAku6mg+dgjVHAhZ/t+J
 xu8+Gy2sWHUH2gmoSXuoNyppOvYpPIRd5RB16PizMH3bw+sMad2K8/rfOKnmF1/r
 HTM2jZ5bDcHVDjSuVI6u2m/mQX+PmPXUTffreaFXuSI75YpT0dqN3nponTX4EgFI
 Ad9ZBQvdg8w1LGDsNxIAaqrGx4Q87RxqfUV4W/wo6N8gKsp+I2y4GtYMeD/CEKyi
 wncKg10YwoMXZj7cBAkWgPlgrOBYCPwYZc/1DVRHvqrHo/m13SJrWDKkNKVvoXzH
 2y4Tfi5w1WDRUT7yeoyK
 =TA1A
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://github.com/prasad-joshi/logfs_upstream

There are few important bug fixes for LogFS

* tag 'for-linus' of git://github.com/prasad-joshi/logfs_upstream:
  Logfs: Allow NULL block_isbad() methods
  logfs: Grow inode in delete path
  logfs: Free areas before calling generic_shutdown_super()
  logfs: remove useless BUG_ON
  MAINTAINERS: Add Prasad Joshi in LogFS maintiners
  logfs: Propagate page parameter to __logfs_write_inode
  logfs: set superblock shutdown flag after generic sb shutdown
  logfs: take write mutex lock during fsync and sync
  logfs: Prevent memory corruption
  logfs: update page reference count for pined pages

Fix up conflict in fs/logfs/dev_mtd.c due to semantic change in what
"mtd->block_isbad" means in commit f2933e86ad: "Logfs: Allow NULL
block_isbad() methods" clashing with the abstraction changes in the
commits 7086c19d07: "mtd: introduce mtd_block_isbad interface" and
d58b27ed58: "logfs: do not use 'mtd->block_isbad' directly".

This resolution takes the semantics from commit f2933e86ad, and just
makes mtd_block_isbad() return zero (false) if the 'block_isbad'
function is NULL.  But that also means that now "mtd_can_have_bb()"
always returns 0.

Now, "mtd_block_markbad()" will obviously return an error if the
low-level driver doesn't support bad blocks, so this is somewhat
non-symmetric, but it actually makes sense if a NULL "block_isbad"
function is considered to mean "I assume that all my blocks are always
good".
2012-01-31 09:23:59 -08:00
Linus Torvalds
0a96265754 Here are some patches for the 3.3-rc1 tree.
It contains the removal of the sysdev code, now that all users of it are
 gone, as well as some sysfs bugfixes that have been reported by users.
 There are also some documentation updates here as well.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.18 (GNU/Linux)
 
 iEYEABECAAYFAk8jKW4ACgkQMUfUDdst+ynAUwCfVWwHJxpb4DSSMVZhGOnHMQrL
 ZjIAn00gPeSs5u8y1nPvFrFikbon4FDs
 =bzVy
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-3.3-rc1-bugfixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Here are some patches for the 3.3-rc1 tree.

It contains the removal of the sysdev code, now that all users of it are
gone, as well as some sysfs bugfixes that have been reported by users.
There are also some documentation updates here as well.

* tag 'driver-core-3.3-rc1-bugfixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
  sysfs: Complain bitterly about attempts to remove files from nonexistent directories.
  stable: update documentation to ask for kernel version
  base/core.c:fix typo in comment in function device_add
  Documentation: devres: add allocation functions to list of supported calls
  Documentation update for the driver model core
  kernel-doc: fix new warnings in driver-core
  kernel-doc: fix new warnings in debugfs
  kernel-doc: fix new warnings in device.h
  driver core: remove drivers/base/sys.c and include/linux/sysdev.h
2012-01-28 18:20:48 -08:00
Linus Torvalds
67d2433ee7 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: fix reservations in btrfs_page_mkwrite
  Btrfs: advance window_start if we're using a bitmap
  btrfs: mask out gfp flags in releasepage
  Btrfs: fix enospc error caused by wrong checks of the chunk
  Btrfs: do not defrag a file partially
  Btrfs: fix warning for 32-bit build of fs/btrfs/check-integrity.c
  Btrfs: use cluster->window_start when allocating from a cluster bitmap
  Btrfs: Check for NULL page in extent_range_uptodate
  btrfs: Fix busyloops in transaction waiting code
  Btrfs: make sure a bitmap has enough bytes
  Btrfs: fix uninit warning in backref.c
2012-01-28 17:00:19 -08:00
Joern Engel
f2933e86ad Logfs: Allow NULL block_isbad() methods
Not all mtd drivers define block_isbad().  Let's assume no bad blocks
instead of refusing to mount.

Signed-off-by: Joern Engel <joern@logfs.org>
2012-01-28 11:43:40 +05:30
Joern Engel
bbe0138712 logfs: Grow inode in delete path
Can be necessary if an inode gets deleted (through -ENOSPC) before being
written.  Might be better to move this into logfs_write_rec(), but for
now go with the stupid&safe patch.

Signed-off-by: Joern Engel <joern@logfs.org>
2012-01-28 11:43:07 +05:30
Joern Engel
1bcceaff8c logfs: Free areas before calling generic_shutdown_super()
Or hit an assertion in map_invalidatepage() instead.

Signed-off-by: Joern Engel <joern@logfs.org>
2012-01-28 11:42:39 +05:30
Joern Engel
6c69494f6b logfs: remove useless BUG_ON
It prevents write sizes >4k.

Signed-off-by: Joern Engel <joern@logfs.org>
2012-01-28 11:41:56 +05:30
Prasad Joshi
0bd90387ed logfs: Propagate page parameter to __logfs_write_inode
During GC LogFS has to rewrite each valid block to a separate segment.
Rewrite operation reads data from an old segment and writes it to a
newly allocated segment. Since every write operation changes data
block pointers maintained in inode, inode should also be rewritten.

In GC path to avoid AB-BA deadlock LogFS marks a page with
PG_pre_locked in addition to locking the page (PG_locked). The page
lock is ignored iff the page is pre-locked.

LogFS uses a special file called segment file. The segment file
maintains an 8 bytes entry for every segment. It keeps track of erase
count, level etc. for every segment.

Bad things happen with a segment belonging to the segment file is GCed

 ------------[ cut here ]------------
kernel BUG at /home/prasad/logfs/readwrite.c:297!
invalid opcode: 0000 [#1] SMP
Modules linked in: logfs joydev usbhid hid psmouse e1000 i2c_piix4
		serio_raw [last unloaded: logfs]
Pid: 20161, comm: mount Not tainted 3.1.0-rc3+ #3 innotek GmbH
		VirtualBox
EIP: 0060:[<f809132a>] EFLAGS: 00010292 CPU: 0
EIP is at logfs_lock_write_page+0x6a/0x70 [logfs]
EAX: 00000027 EBX: f73f5b20 ECX: c16007c8 EDX: 00000094
ESI: 00000000 EDI: e59be6e4 EBP: c7337b28 ESP: c7337b18
DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
Process mount (pid: 20161, ti=c7336000 task=eb323f70 task.ti=c7336000)
Stack:
f8099a3d c7337b24 f73f5b20 00001002 c7337b50 f8091f6d f8099a4d f80994e4
00000003 00000000 c7337b68 00000000 c67e4400 00001000 c7337b80 f80935e5
00000000 00000000 00000000 00000000 e1fcf000 0000000f e59be618 c70bf900
Call Trace:
[<f8091f6d>] logfs_get_write_page.clone.16+0xdd/0x100 [logfs]
[<f80935e5>] logfs_mod_segment_entry+0x55/0x110 [logfs]
[<f809460d>] logfs_get_segment_entry+0x1d/0x20 [logfs]
[<f8091060>] ? logfs_cleanup_journal+0x50/0x50 [logfs]
[<f809521b>] ostore_get_erase_count+0x1b/0x40 [logfs]
[<f80965b8>] logfs_open_area+0xc8/0x150 [logfs]
[<c141a7ec>] ? kmemleak_alloc+0x2c/0x60
[<f809668e>] __logfs_segment_write.clone.16+0x4e/0x1b0 [logfs]
[<c10dd563>] ? mempool_kmalloc+0x13/0x20
[<c10dd563>] ? mempool_kmalloc+0x13/0x20
[<f809696f>] logfs_segment_write+0x17f/0x1d0 [logfs]
[<f8092e8c>] logfs_write_i0+0x11c/0x180 [logfs]
[<f8092f35>] logfs_write_direct+0x45/0x90 [logfs]
[<f80934cd>] __logfs_write_buf+0xbd/0xf0 [logfs]
[<c102900e>] ? kmap_atomic_prot+0x4e/0xe0
[<f809424b>] logfs_write_buf+0x3b/0x60 [logfs]
[<f80947a9>] __logfs_write_inode+0xa9/0x110 [logfs]
[<f8094cb0>] logfs_rewrite_block+0xc0/0x110 [logfs]
[<f8095300>] ? get_mapping_page+0x10/0x60 [logfs]
[<f8095aa0>] ? logfs_load_object_aliases+0x2e0/0x2f0 [logfs]
[<f808e57d>] logfs_gc_segment+0x2ad/0x310 [logfs]
[<f808e62a>] __logfs_gc_once+0x4a/0x80 [logfs]
[<f808ed43>] logfs_gc_pass+0x683/0x6a0 [logfs]
[<f8097a89>] logfs_mount+0x5a9/0x680 [logfs]
[<c1126b21>] mount_fs+0x21/0xd0
[<c10f6f6f>] ? __alloc_percpu+0xf/0x20
[<c113da41>] ? alloc_vfsmnt+0xb1/0x130
[<c113db4b>] vfs_kern_mount+0x4b/0xa0
[<c113e06e>] do_kern_mount+0x3e/0xe0
[<c113f60d>] do_mount+0x34d/0x670
[<c10f2749>] ? strndup_user+0x49/0x70
[<c113fcab>] sys_mount+0x6b/0xa0
[<c142d87c>] syscall_call+0x7/0xb
Code: f8 e8 8b 93 39 c9 8b 45 f8 3e 0f ba 28 00 19 d2 85 d2 74 ca eb d0 0f 0b 8d 45 fc 89 44 24 04 c7 04 24 3d 9a 09 f8 e8 09 92 39 c9 <0f> 0b 8d 74 26 00 55 89 e5 3e 8d 74 26 00 8b 10 80 e6 01 74 09
EIP: [<f809132a>] logfs_lock_write_page+0x6a/0x70 [logfs] SS:ESP 0068:c7337b18
---[ end trace 96e67d5b3aa3d6ca ]---

The patch passes locked page to __logfs_write_inode. It calls function
logfs_get_wblocks() to pre-lock the page. This ensures any further
attempts to lock the page are ignored (esp from get_erase_count).

Acked-by: Joern Engel <joern@logfs.org>
Signed-off-by: Prasad Joshi <prasadjoshi.linux@gmail.com>
2012-01-28 11:38:25 +05:30
Prasad Joshi
ecfd890991 logfs: set superblock shutdown flag after generic sb shutdown
While unmounting the file system LogFS calls generic_shutdown_super.
The function does file system independent superblock shutdown.
However, it might result in call file system specific inode eviction.

LogFS marks FS shutting down by setting bit LOGFS_SB_FLAG_SHUTDOWN in
super->s_flags. Since, inode eviction might call truncate on inode,
following BUG is observed when file system is unmounted:

------------[ cut here ]------------
kernel BUG at /home/prasad/logfs/segment.c:362!
invalid opcode: 0000 [#1] PREEMPT SMP
CPU 3
Modules linked in: logfs binfmt_misc ppdev virtio_blk parport_pc lp
	parport psmouse floppy virtio_pci serio_raw virtio_ring virtio

Pid: 1933, comm: umount Not tainted 3.0.0+ #4 Bochs Bochs
RIP: 0010:[<ffffffffa008c841>]  [<ffffffffa008c841>]
		logfs_segment_write+0x211/0x230 [logfs]
RSP: 0018:ffff880062d7b9e8  EFLAGS: 00010202
RAX: 000000000000000e RBX: ffff88006eca9000 RCX: 0000000000000000
RDX: ffff88006fd87c40 RSI: ffffea00014ff468 RDI: ffff88007b68e000
RBP: ffff880062d7ba48 R08: 8000000020451430 R09: 0000000000000000
R10: dead000000100100 R11: 0000000000000000 R12: ffff88006fd87c40
R13: ffffea00014ff468 R14: ffff88005ad0a460 R15: 0000000000000000
FS:  00007f25d50ea760(0000) GS:ffff88007fd80000(0000)
	knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000d05e48 CR3: 0000000062c72000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process umount (pid: 1933, threadinfo ffff880062d7a000,
	task ffff880070b44500)
Stack:
ffff880062d7ba38 ffff88005ad0a508 0000000000001000 0000000000000000
8000000020451430 ffffea00014ff468 ffff880062d7ba48 ffff88005ad0a460
ffff880062d7bad8 ffffea00014ff468 ffff88006fd87c40 0000000000000000
Call Trace:
[<ffffffffa0088fee>] logfs_write_i0+0x12e/0x190 [logfs]
[<ffffffffa0089360>] __logfs_write_rec+0x140/0x220 [logfs]
[<ffffffffa0089312>] __logfs_write_rec+0xf2/0x220 [logfs]
[<ffffffffa00894a4>] logfs_write_rec+0x64/0xd0 [logfs]
[<ffffffffa0089616>] __logfs_write_buf+0x106/0x110 [logfs]
[<ffffffffa008a19e>] logfs_write_buf+0x4e/0x80 [logfs]
[<ffffffffa008a6b8>] __logfs_write_inode+0x98/0x110 [logfs]
[<ffffffffa008a7c4>] logfs_truncate+0x54/0x290 [logfs]
[<ffffffffa008abfc>] logfs_evict_inode+0xdc/0x190 [logfs]
[<ffffffff8115eef5>] evict+0x85/0x170
[<ffffffff8115f126>] iput+0xe6/0x1b0
[<ffffffff8115b4a8>] shrink_dcache_for_umount_subtree+0x218/0x280
[<ffffffff8115ce91>] shrink_dcache_for_umount+0x51/0x90
[<ffffffff8114796c>] generic_shutdown_super+0x2c/0x100
[<ffffffffa008cc47>] logfs_kill_sb+0x57/0xf0 [logfs]
[<ffffffff81147de5>] deactivate_locked_super+0x45/0x70
[<ffffffff811487ea>] deactivate_super+0x4a/0x70
[<ffffffff81163934>] mntput_no_expire+0xa4/0xf0
[<ffffffff8116469f>] sys_umount+0x6f/0x380
[<ffffffff814dd46b>] system_call_fastpath+0x16/0x1b
Code: 55 c8 49 8d b6 a8 00 00 00 45 89 f9 45 89 e8 4c 89 e1 4c 89 55
b8 c7 04 24 00 00 00 00 e8 68 fc ff ff 4c 8b 55 b8 e9 3c ff ff ff <0f>
0b 0f 0b c7 45 c0 00 00 00 00 e9 44 fe ff ff 66 66 66 66 66
RIP  [<ffffffffa008c841>] logfs_segment_write+0x211/0x230 [logfs]
RSP <ffff880062d7b9e8>
---[ end trace fe6b040cea952290 ]---

Therefore, move super->s_flags setting after the fs-indenpendent work
has been finished.

Reviewed-by: Joern Engel <joern@logfs.org>
Signed-off-by: Prasad Joshi <prasadjoshi.linux@gmail.com>
2012-01-28 11:37:47 +05:30
Prasad Joshi
13ced29cb2 logfs: take write mutex lock during fsync and sync
LogFS uses super->s_write_mutex while writing data to disk. Taking the
same mutex lock in sync and fsync code path solves the following BUG:

------------[ cut here ]------------
kernel BUG at /home/prasad/logfs/dev_bdev.c:134!

Pid: 2387, comm: flush-253:16 Not tainted 3.0.0+ #4 Bochs Bochs
RIP: 0010:[<ffffffffa007deed>]  [<ffffffffa007deed>]
                bdev_writeseg+0x25d/0x270 [logfs]
Call Trace:
[<ffffffffa007c381>] logfs_open_area+0x91/0x150 [logfs]
[<ffffffff8128dcb2>] ? find_level.clone.9+0x62/0x100
[<ffffffffa007c49c>] __logfs_segment_write.clone.20+0x5c/0x190 [logfs]
[<ffffffff810ef005>] ? mempool_kmalloc+0x15/0x20
[<ffffffff810ef383>] ? mempool_alloc+0x53/0x130
[<ffffffffa007c7a4>] logfs_segment_write+0x1d4/0x230 [logfs]
[<ffffffffa0078f8e>] logfs_write_i0+0x12e/0x190 [logfs]
[<ffffffffa0079300>] __logfs_write_rec+0x140/0x220 [logfs]
[<ffffffffa0079444>] logfs_write_rec+0x64/0xd0 [logfs]
[<ffffffffa00795b6>] __logfs_write_buf+0x106/0x110 [logfs]
[<ffffffffa007a13e>] logfs_write_buf+0x4e/0x80 [logfs]
[<ffffffffa0073e33>] __logfs_writepage+0x23/0x80 [logfs]
[<ffffffffa007410c>] logfs_writepage+0xdc/0x110 [logfs]
[<ffffffff810f5ba7>] __writepage+0x17/0x40
[<ffffffff810f6208>] write_cache_pages+0x208/0x4f0
[<ffffffff810f5b90>] ? set_page_dirty+0x70/0x70
[<ffffffff810f653a>] generic_writepages+0x4a/0x70
[<ffffffff810f75d1>] do_writepages+0x21/0x40
[<ffffffff8116b9d1>] writeback_single_inode+0x101/0x250
[<ffffffff8116bdbd>] writeback_sb_inodes+0xed/0x1c0
[<ffffffff8116c5fb>] writeback_inodes_wb+0x7b/0x1e0
[<ffffffff8116cc23>] wb_writeback+0x4c3/0x530
[<ffffffff814d984d>] ? sub_preempt_count+0x9d/0xd0
[<ffffffff8116cd6b>] wb_do_writeback+0xdb/0x290
[<ffffffff814d984d>] ? sub_preempt_count+0x9d/0xd0
[<ffffffff814d6208>] ? _raw_spin_unlock_irqrestore+0x18/0x40
[<ffffffff8105aa5a>] ? del_timer+0x8a/0x120
[<ffffffff8116cfac>] bdi_writeback_thread+0x8c/0x2e0
[<ffffffff8116cf20>] ? wb_do_writeback+0x290/0x290
[<ffffffff8106d2e6>] kthread+0x96/0xa0
[<ffffffff814de514>] kernel_thread_helper+0x4/0x10
[<ffffffff8106d250>] ? kthread_worker_fn+0x190/0x190
[<ffffffff814de510>] ? gs_change+0xb/0xb
RIP  [<ffffffffa007deed>] bdev_writeseg+0x25d/0x270 [logfs]
---[ end trace 0211ad60a57657c4 ]---

Reviewed-by: Joern Engel <joern@logfs.org>
Signed-off-by: Prasad Joshi <prasadjoshi.linux@gmail.com>
2012-01-28 11:36:06 +05:30
Joern Engel
934eed395d logfs: Prevent memory corruption
This is a bad one.  I wonder whether we were so far protected by
no_free_segments(sb) usually being smaller than LOGFS_NO_AREAS.

Found by Dan Carpenter <dan.carpenter@oracle.com> using smatch.

Signed-off-by: Joern Engel <joern@logfs.org>
Signed-off-by: Prasad Joshi <prasadjoshi.linux@gmail.com>
2012-01-28 11:24:21 +05:30
Prasad Joshi
96150606e2 logfs: update page reference count for pined pages
LogFS sets PG_private flag to indicate a pined page. We assumed that
marking a page as private is enough to ensure its existence. But
instead it is necessary to hold a reference count to the page.

The change resolves the following BUG

BUG: Bad page state in process flush-253:16  pfn:6a6d0
page flags: 0x100000000000808(uptodate|private)

Suggested-and-Acked-by: Joern Engel <joern@logfs.org>
Signed-off-by: Prasad Joshi <prasadjoshi.linux@gmail.com>
2012-01-28 11:23:10 +05:30
Chris Mason
9998eb7034 Btrfs: fix reservations in btrfs_page_mkwrite
Josef fixed btrfs_page_mkwrite to properly release reserved
extents if there was an error.  But if we fail to get a reservation
and we fail to dirty the inode (for ENOSPC reasons), we'll end up
trying to release a reservation we never had.

This makes sure we only release if we were able to reserve.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-01-27 10:44:44 -05:00
Josef Bacik
9b23062840 Btrfs: advance window_start if we're using a bitmap
If we span a long area in a bitmap we could end up taking a lot of time
searching to the next free area if we're searching from the original
window_start, so advance window_start in order to make sure we don't do any
superficial searching.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-01-26 15:01:12 -05:00
David Sterba
0c4e538bcc btrfs: mask out gfp flags in releasepage
btree_releasepage is a callback and can be passed unknown gfp flags and then
they may end up in kmem_cache_alloc called from alloc_extent_state, slab
allocator will BUG_ON when there is HIGHMEM or DMA32 flag set.

This may happen when btrfs is mounted from a loop device, which masks out
__GFP_IO flag. The check in try_release_extent_state

3399                 if ((mask & GFP_NOFS) == GFP_NOFS)
3400                         mask = GFP_NOFS;

will not work and passes unfiltered flags further resulting in crash at
mm/slab.c:2963

 [<000000000024ae4c>] cache_alloc_refill+0x3b4/0x5c8
 [<000000000024c810>] kmem_cache_alloc+0x204/0x294
 [<00000000001fd3c2>] mempool_alloc+0x52/0x170
 [<000003c000ced0b0>] alloc_extent_state+0x40/0xd4 [btrfs]
 [<000003c000cee5ae>] __clear_extent_bit+0x38a/0x4cc [btrfs]
 [<000003c000cee78c>] try_release_extent_state+0x9c/0xd4 [btrfs]
 [<000003c000cc4c66>] btree_releasepage+0x7e/0xd0 [btrfs]
 [<0000000000210d84>] shrink_page_list+0x6a0/0x724
 [<0000000000211394>] shrink_inactive_list+0x230/0x578
 [<0000000000211bb8>] shrink_list+0x6c/0x120
 [<0000000000211e4e>] shrink_zone+0x1e2/0x228
 [<0000000000211f24>] shrink_zones+0x90/0x254
 [<0000000000213410>] do_try_to_free_pages+0xac/0x420
 [<0000000000213ae0>] try_to_free_pages+0x13c/0x1b0
 [<0000000000204e6c>] __alloc_pages_nodemask+0x5b4/0x9a8
 [<00000000001fb04a>] grab_cache_page_write_begin+0x7e/0xe8

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-01-26 15:01:12 -05:00
Miao Xie
9e622d6bea Btrfs: fix enospc error caused by wrong checks of the chunk
When we did sysbench test for inline files, enospc error happened easily though
there was lots of free disk space which could be allocated for new chunks.

Reproduce steps:
 # mkfs.btrfs -b $((2 * 1024 * 1024 * 1024)) <test partition>
 # mount <test partition> /mnt
 # ulimit -n 102400
 # cd /mnt
 # sysbench --num-threads=1 --test=fileio --file-num=81920 \
 > --file-total-size=80M --file-block-size=1K --file-io-mode=sync \
 > --file-test-mode=seqwr prepare
 # sysbench --num-threads=1 --test=fileio --file-num=81920 \
 > --file-total-size=80M --file-block-size=1K --file-io-mode=sync \
 > --file-test-mode=seqwr run
 <soon later, BUG_ON() was triggered by enospc error>

The reason of this bug is:
Now, we can reserve space which is larger than the free space in the chunks if
we have enough free disk space which can be used for new chunks. By this way,
the space allocator should allocate a new chunk by force if there is no free
space in the free space cache. But there are two wrong checks which break this
operation.

One is
	if (ret == -ENOSPC && num_bytes > min_alloc_size)
in btrfs_reserve_extent(), it is wrong, we should try to allocate a new chunk
even we fail to allocate free space by minimum allocable size.

The other is
	if (space_info->force_alloc)
		force = space_info->force_alloc;
in do_chunk_alloc(). It makes the allocator ignore CHUNK_ALLOC_FORCE If someone
sets ->force_alloc to CHUNK_ALLOC_LIMITED, and makes the enospc error happen.

Fix these two wrong checks. Especially the second one, we fix it by changing
the value of CHUNK_ALLOC_LIMITED and CHUNK_ALLOC_FORCE, and make
CHUNK_ALLOC_FORCE greater than CHUNK_ALLOC_LIMITED since CHUNK_ALLOC_FORCE has
higher priority. And if the value which is passed in by the caller is greater
than ->force_alloc, use the passed value.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-01-26 15:01:12 -05:00
Liu Bo
7ec31b548a Btrfs: do not defrag a file partially
xfstests 218 complains that btrfs defrags a file partially:
 After: 1
 Write backwards sync, but contiguous - should defrag to 1 extent
 Before: 10
-After: 1
+After: 2

To fix this, we need to set max_to_defrag count properly.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-01-26 15:01:12 -05:00
Stefan Behrens
0b485143d8 Btrfs: fix warning for 32-bit build of fs/btrfs/check-integrity.c
There have been 4 warnings on 32-bit build, they are herewith fixed.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-01-26 15:01:11 -05:00
Josef Bacik
0b4a9d248f Btrfs: use cluster->window_start when allocating from a cluster bitmap
We specifically set window_start in the cluster struct to indicate where the
cluster starts in a bitmap, but we've been using min_start to indicate where
we're searching from.  This is usually the start of the blockgroup, so
essentially means we're constantly searching from the start of any bitmap we
find, which completely negates all the trouble we go to in order to setup a
cluster.  So start using window_start to make sure we actually use the area we
found.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-01-26 15:01:11 -05:00
Mitch Harder
8bedd51b61 Btrfs: Check for NULL page in extent_range_uptodate
A user has encountered a NULL pointer kernel oops in btrfs when
encountering media errors.  The problem has been identified
as an unhandled NULL pointer returned from find_get_page().
This modification simply checks for a NULL page, and returns
with an error if found (the extent_range_uptodate() function
returns 1 on errors).

After testing this patch, the user reported that the error with
the NULL pointer oops was solved.  However, there is still a
remaining problem with a thread becoming stuck in
wait_on_page_locked(page) in the read_extent_buffer_pages(...)
function in extent_io.c

       for (i = start_i; i < num_pages; i++) {
               page = extent_buffer_page(eb, i);
               wait_on_page_locked(page);
               if (!PageUptodate(page))
                       ret = -EIO;
       }

This patch leaves the issue with the locked page yet to be resolved.

Signed-off-by: Mitch Harder <mitch.harder@sabayonlinux.org>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-01-26 15:01:11 -05:00
Jan Kara
6dd70ce4eb btrfs: Fix busyloops in transaction waiting code
wait_log_commit() and wait_for_writer() were using slightly different
conditions for deciding whether they should call schedule() and whether they
should continue in the wait loop. Thus it could happen that we busylooped when
the first condition was not true while the second one was. That is burning CPU
cycles needlessly and is deadly on UP machines...

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-01-26 15:01:11 -05:00