Commit graph

38969 commits

Author SHA1 Message Date
Steve French
9ccf321623 Add support for original fallocate
In many cases the simple fallocate call is
a no op (since the file is already not sparse) or
can simply be converted from a sparse to a non-sparse
file if we are fallocating the whole file and keeping
the size.

Signed-off-by: Steven French <smfrench@gmail.com>
2014-12-07 22:17:19 -06:00
Dmitry Monakhov
50db71abc5 ext4: ext4_da_convert_inline_data_to_extent drop locked page after error
Testcase:
xfstests generic/270
MKFS_OPTIONS="-q -I 256 -O inline_data,64bit"

Call Trace:
 [<ffffffff81144c76>] lock_page+0x35/0x39 -------> DEADLOCK
 [<ffffffff81145260>] pagecache_get_page+0x65/0x15a
 [<ffffffff811507fc>] truncate_inode_pages_range+0x1db/0x45c
 [<ffffffff8120ea63>] ? ext4_da_get_block_prep+0x439/0x4b6
 [<ffffffff811b29b7>] ? __block_write_begin+0x284/0x29c
 [<ffffffff8120e62a>] ? ext4_change_inode_journal_flag+0x16b/0x16b
 [<ffffffff81150af0>] truncate_inode_pages+0x12/0x14
 [<ffffffff81247cb4>] ext4_truncate_failed_write+0x19/0x25
 [<ffffffff812488cf>] ext4_da_write_inline_data_begin+0x196/0x31c
 [<ffffffff81210dad>] ext4_da_write_begin+0x189/0x302
 [<ffffffff810c07ac>] ? trace_hardirqs_on+0xd/0xf
 [<ffffffff810ddd13>] ? read_seqcount_begin.clone.1+0x9f/0xcc
 [<ffffffff8114309d>] generic_perform_write+0xc7/0x1c6
 [<ffffffff810c040e>] ? mark_held_locks+0x59/0x77
 [<ffffffff811445d1>] __generic_file_write_iter+0x17f/0x1c5
 [<ffffffff8120726b>] ext4_file_write_iter+0x2a5/0x354
 [<ffffffff81185656>] ? file_start_write+0x2a/0x2c
 [<ffffffff8107bcdb>] ? bad_area_nosemaphore+0x13/0x15
 [<ffffffff811858ce>] new_sync_write+0x8a/0xb2
 [<ffffffff81186e7b>] vfs_write+0xb5/0x14d
 [<ffffffff81186ffb>] SyS_write+0x5c/0x8c
 [<ffffffff816f2529>] system_call_fastpath+0x12/0x17

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-12-05 21:37:15 -05:00
Jaegeuk Kim
769ec6e5b7 f2fs: call radix_tree_preload before radix_tree_insert
This patch tries to fix:

 BUG: using smp_processor_id() in preemptible [00000000] code: f2fs_gc-254:0/384
  (radix_tree_node_alloc+0x14/0x74) from [<c033d8a0>] (radix_tree_insert+0x110/0x200)
  (radix_tree_insert+0x110/0x200) from [<c02e8264>] (gc_data_segment+0x340/0x52c)
  (gc_data_segment+0x340/0x52c) from [<c02e8658>] (f2fs_gc+0x208/0x400)
  (f2fs_gc+0x208/0x400) from [<c02e8a98>] (gc_thread_func+0x248/0x28c)
  (gc_thread_func+0x248/0x28c) from [<c0139944>] (kthread+0xa0/0xac)
  (kthread+0xa0/0xac) from [<c0105ef8>] (ret_from_fork+0x14/0x3c)

The reason is that f2fs calls radix_tree_insert under enabled preemption.
So, before calling it, we need to call radix_tree_preload.

Otherwise, we should use _GFP_WAIT for the radix tree, and use mutex or
semaphore to cover the radix tree operations.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-12-05 09:51:04 -08:00
Al Viro
f77c80142e bury struct proc_ns in fs/proc
a) make get_proc_ns() return a pointer to struct ns_common
b) mirror ns_ops in dentry->d_fsdata of ns dentries, so that
is_mnt_ns_file() could get away with fewer dereferences.

That way struct proc_ns becomes invisible outside of fs/proc/*.c

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-12-04 14:34:54 -05:00
Al Viro
33c429405a copy address of proc_ns_ops into ns_common
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-12-04 14:34:47 -05:00
Al Viro
6344c433a4 new helpers: ns_alloc_inum/ns_free_inum
take struct ns_common *, for now simply wrappers around proc_{alloc,free}_inum()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-12-04 14:34:36 -05:00
Al Viro
64964528b2 make proc_ns_operations work with struct ns_common * instead of void *
We can do that now.  And kill ->inum(), while we are at it - all instances
are identical.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-12-04 14:34:17 -05:00
Al Viro
58be28256d make mntns ->get()/->put()/->install()/->inum() work with &mnt_ns->ns
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-12-04 14:33:24 -05:00
Al Viro
435d5f4bb2 common object embedded into various struct ....ns
for now - just move corresponding ->proc_inum instances over there

Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-12-04 14:31:00 -05:00
Jaegeuk Kim
8b26ef98da f2fs: use rw_semaphore for nat entry lock
Previoulsy, we used rwlock for nat_entry lock.
But, now we have a lot of complex operations in set_node_addr.
(e.g., allocating kernel memories, handling radix_trees, and so on)

So, this patches tries to change spinlock to rw_semaphore to give CPUs to other
threads.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-12-03 21:23:29 -08:00
Jaegeuk Kim
4634d71ed1 f2fs: fix missing kmem_cache_free
This patch fixes missing kmem_cache_free when handling errors.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-12-03 16:40:28 -08:00
Dave Chinner
6044e4386c Merge branch 'xfs-misc-fixes-for-3.19-2' into for-next
Conflicts:
	fs/xfs/xfs_iops.c
2014-12-04 09:46:17 +11:00
Brian Foster
b29c70f598 xfs: split metadata and log buffer completion to separate workqueues
XFS traditionally sends all buffer I/O completion work to a single
workqueue. This includes metadata buffer completion and log buffer
completion. The log buffer completion requires a high priority queue to
prevent stalls due to log forces getting stuck behind other queued work.

Rather than continue to prioritize all buffer I/O completion due to the
needs of log completion, split log buffer completion off to
m_log_workqueue and move the high priority flag from m_buf_workqueue to
m_log_workqueue.

Add a b_ioend_wq wq pointer to xfs_buf to allow completion workqueue
customization on a per-buffer basis. Initialize b_ioend_wq to
m_buf_workqueue by default in the generic buffer I/O submission path.
Finally, override the default wq with the high priority m_log_workqueue
in the log buffer I/O submission path.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-12-04 09:43:17 +11:00
Dave Chinner
32296f865e xfs: fix set-but-unused warnings
The kernel compile doesn't turn on these checks by default, so it's
only when I do a kernel-user sync that I find that there are lots of
compiler warnings waiting to be fixed. Fix up these set-but-unused
warnings.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-12-04 09:43:17 +11:00
Dave Chinner
9a2cc41cda xfs: move type conversion functions to xfs_dir.h
These are currently considered private to libxfs, but they are
widely used by the userspace code to decode, walk and check
directory structures. Hence they really form part of the external
API and as such need to bemoved to xfs_dir2.h.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-12-04 09:43:17 +11:00
Dave Chinner
1b767ee386 xfs: move ftype conversion functions to libxfs
These functions are needed in userspace for repair and mkfs to
do the right thing. Move them to libxfs so they can be easily
shared.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-12-04 09:43:17 +11:00
Dave Chinner
2d3d0c53df xfs: lobotomise xfs_trans_read_buf_map()
There's a case in that code where it checks for a buffer match in a
transaction where the buffer is not marked done. i.e. trying to
catch a buffer we have locked in the transaction but have not
completed IO on.

The only way we can find a buffer that has not had IO completed on
it is if it had readahead issued on it, but we never do readahead on
buffers that we have already joined into a transaction. Hence this
condition cannot occur, and buffers locked and joined into a
transaction should always be marked done and not under IO.

Remove this code and re-order xfs_trans_read_buf_map() to remove
duplicated IO dispatch and error handling code.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-12-04 09:43:13 +11:00
Dave Chinner
cdc9cec7c0 xfs: active inodes stat is broken
vn_active only ever gets decremented, so it has a very large
negative number.  Make it track the inode count we currently have
allocated properly so we can easily track the size of the inode
cache via tools like PCP.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-12-04 09:42:40 +11:00
Dave Chinner
4db431f57b xfs: cleanup xfs_bmse_merge returns
Signed-off-by: Dave Chinner <dchinner@redhat.com>

xfs_bmse_merge() has a jump label for return that just returns the
error value. Convert all the code to just return the error directly
and use XFS_WANT_CORRUPTED_RETURN. This also allows the final call
to xfs_bmbt_update() to return directly.

Noticed while reviewing coccinelle return cleanup patches and
wondering why the same return pattern as in xfs_bmse_shift_one()
wasn't picked up by the checker pattern...

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-12-04 09:42:40 +11:00
Dave Chinner
b11bd671ba xfs: cleanup xfs_bmse_shift_one goto mess
xfs_bmse_shift_one() jumps around determining whether to shift or
merge, making the code flow difficult to follow. Clean it up and
use direct error returns (including XFS_WANT_CORRUPTED_RETURN) to
make the code flow better and be easier to read.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-12-04 09:42:24 +11:00
Dave Chinner
7a1df15616 xfs: fix premature enospc on inode allocation
After growing a filesystem, XFS can fail to allocate inodes even
though there is a large amount of space available in the filesystem
for inodes. The issue is caused by a nearly full allocation group
having enough free space in it to be considered for inode
allocation, but not enough contiguous free space to actually
allocation inodes.  This situation results in successful selection
of the AG for allocation, then failure of the allocation resulting
in ENOSPC being reported to the caller.

It is caused by two possible issues. Firstly, we only consider the
lognest free extent and whether it would fit an inode chunk. If the
extent is not correctly aligned, then we can't allocate an inode
chunk in it regardless of the fact that it is large enough. This
tends to be a permanent error until space in the AG is freed.

The second issue is that we don't actually lock the AGI or AGF when
we are doing these checks, and so by the time we get to actually
allocating the inode chunk the space we thought we had in the AG may
have been allocated. This tends to be a spurious error as it
requires a race to trigger. Hence this case is ignored in this patch
as the reported problem is for permanent errors.

The first issue could be addressed by simply taking into account the
alignment when checking the longest extent. This, however, would
prevent allocation in AGs that have aligned, exact sized extents
free. However, this case should be fairly rare compared to the
number of allocations that occur near ENOSPC that would trigger this
condition.

Hence, when selecting the inode AG, take into account the inode
cluster alignment when checking the lognest free extent in the AG.
If we can't find any AGs with a contiguous free space large
enough to be aligned, drop the alignment addition and just try for
an AG that has enough contiguous free space available for an inode
chunk. This won't prevent issues from occurring, but should avoid
situations where other AGs have lots of free space but the selected
AG can't allocate due to alignment constraints.

Reported-by: Arkadiusz Miskiewicz <arekm@maven.pl>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-12-04 09:42:21 +11:00
Peter Watkins
76b5730252 xfs: overflow in xfs_iomap_eof_align_last_fsb
If extsize is set and new_last_fsb is larger than 32 bits, the
roundup to extsize will overflow the align variable. Instead,
combine alignments by rounding stripe size up to extsize.

Signed-off-by: Peter Watkins <treestem@gmail.com>
Reviewed-by: Nathaniel W. Turner <nate@houseofnate.net>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-12-04 09:30:51 +11:00
Dave Chinner
e77b8547ca Merge branch 'xfs-coccinelle-cleanups' into xfs-misc-fixes-for-3.19-2 2014-12-04 09:18:21 +11:00
Al Viro
1ead0e79bf fat: fix oops on corrupted vfat fs
a) don't bother with ->d_time for positives - we only check it for
   negatives anyway.

b) make sure to set it at unlink and rmdir time - at *that* point
   soon-to-be negative dentry matches then-current directory contents

c) don't go into renaming of old alias in vfat_lookup() unless it
   has the same parent (which it will, unless we are seeing corrupted
   image)

[hirofumi@mail.parknet.co.jp: make change minimum, don't call d_move() for dir]
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: <stable@vger.kernel.org>	[3.17.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-03 09:36:03 -08:00
Chris Mason
9627aeee3e Merge branch 'raid56-scrub-replace' of git://github.com/miaoxie/linux-btrfs into for-linus 2014-12-02 18:42:03 -08:00
Josef Bacik
cb83b7b816 Btrfs: make get_caching_control unconditionally return the ctl
This was written when we didn't do a caching control for the fast free space
cache loading.  However we started doing that a long time ago, and there is
still a small window of time that we could be caching the block group the fast
way, so if there is a caching_ctl at all on the block group just return it, the
callers all wait properly for what they want.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-12-02 18:35:10 -08:00
Filipe Manana
8dbcd10f69 Btrfs: fix unprotected deletion from pending_chunks list
On block group remove if the corresponding extent map was on the
transaction->pending_chunks list, we were deleting the extent map
from that list, through remove_extent_mapping(), without any
synchronization with chunk allocation (which iterates that list
and adds new elements to it). Fix this by ensure that this is done
while the chunk mutex is held, since that's the mutex that protects
the list in the chunk allocation code path.

This applies on top (depends on) of my previous patch titled:
"Btrfs: fix race between fs trimming and block group remove/allocation"

But the issue in fact was already present before that change, it only
became easier to hit after Josef's 3.18 patch that added automatic
removal of empty block groups.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-12-02 18:35:10 -08:00
Filipe Manana
495e64f4fe Btrfs: fix fs mapping extent map leak
On chunk allocation error (label "error_del_extent"), after adding the
extent map to the tree and to the pending chunks list, we would leave
decrementing the extent map's refcount by 2 instead of 3 (our allocation
+ tree reference + list reference).

Also, on chunk/block group removal, if the block group was on the list
pending_chunks we weren't decrementing the respective list reference.

Detected by 'rmmod btrfs':

[20770.105881] kmem_cache_destroy btrfs_extent_map: Slab cache still has objects
[20770.106127] CPU: 2 PID: 11093 Comm: rmmod Tainted: G        W    L 3.17.0-rc5-btrfs-next-1+ #1
[20770.106128] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[20770.106130]  0000000000000000 ffff8800ba867eb8 ffffffff813e7a13 ffff8800a2e11040
[20770.106132]  ffff8800ba867ed0 ffffffff81105d0c 0000000000000000 ffff8800ba867ee0
[20770.106134]  ffffffffa035d65e ffff8800ba867ef0 ffffffffa03b0654 ffff8800ba867f78
[20770.106136] Call Trace:
[20770.106142]  [<ffffffff813e7a13>] dump_stack+0x45/0x56
[20770.106145]  [<ffffffff81105d0c>] kmem_cache_destroy+0x4b/0x90
[20770.106164]  [<ffffffffa035d65e>] extent_map_exit+0x1a/0x1c [btrfs]
[20770.106176]  [<ffffffffa03b0654>] exit_btrfs_fs+0x27/0x9d3 [btrfs]
[20770.106179]  [<ffffffff8109dc97>] SyS_delete_module+0x153/0x1c4
[20770.106182]  [<ffffffff8121261b>] ? trace_hardirqs_on_thunk+0x3a/0x3c
[20770.106184]  [<ffffffff813ebf52>] system_call_fastpath+0x16/0x1b

This applies on top (depends on) of my previous patch titled:
"Btrfs: fix race between fs trimming and block group remove/allocation"

But the issue in fact was already present before that change, it only
became easier to hit after Josef's 3.18 patch that added automatic
removal of empty block groups.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-12-02 18:35:10 -08:00
Filipe Manana
946ddbe805 Btrfs: fix memory leak after block remove + trimming
There was a free space entry structure memeory leak if a block
group is remove while a free space entry is being trimmed, which
the following diagram explains:

           CPU 1                                          CPU 2

  btrfs_trim_block_group()
      trim_no_bitmap()
          remove free space entry from
          block group cache's rbtree
          do_trimming()

                                                btrfs_remove_block_group()
                                                    btrfs_remove_free_space_cache()

              add back free space entry to
              block group's cache rbtree
  btrfs_put_block_group()

                                                    (...)
                                                    btrfs_put_block_group()
                                                        kfree(bg->free_space_ctl)
                                                        kfree(bg)

The free space entry added after doing the discard of its respective
range ends up never being freed.
Detected after doing an "rmmod btrfs" after running the stress test
recently submitted for fstests:

[ 8234.642212] kmem_cache_destroy btrfs_free_space: Slab cache still has objects
[ 8234.642657] CPU: 1 PID: 32276 Comm: rmmod Tainted: G        W    L 3.17.0-rc5-btrfs-next-2+ #1
[ 8234.642660] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[ 8234.642664]  0000000000000000 ffff8801af1b3eb8 ffffffff8140c7b6 ffff8801dbedd0c0
[ 8234.642670]  ffff8801af1b3ed0 ffffffff811149ce 0000000000000000 ffff8801af1b3ee0
[ 8234.642676]  ffffffffa042dbe7 ffff8801af1b3ef0 ffffffffa0487422 ffff8801af1b3f78
[ 8234.642682] Call Trace:
[ 8234.642692]  [<ffffffff8140c7b6>] dump_stack+0x4d/0x66
[ 8234.642699]  [<ffffffff811149ce>] kmem_cache_destroy+0x4d/0x92
[ 8234.642731]  [<ffffffffa042dbe7>] btrfs_destroy_cachep+0x63/0x76 [btrfs]
[ 8234.642757]  [<ffffffffa0487422>] exit_btrfs_fs+0x9/0xbe7 [btrfs]
[ 8234.642762]  [<ffffffff810a76a5>] SyS_delete_module+0x155/0x1c6
[ 8234.642768]  [<ffffffff8122a7eb>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[ 8234.642773]  [<ffffffff814122d2>] system_call_fastpath+0x16/0x1b

This applies on top (depends on) of my previous patch titled:
"Btrfs: fix race between fs trimming and block group remove/allocation"

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-12-02 18:35:09 -08:00
Filipe Manana
c92f6be34c Btrfs: make btrfs_abort_transaction consider existence of new block groups
If the transaction handle doesn't have used blocks but has created new block
groups make sure we turn the fs into readonly mode too. This is because the
new block groups didn't get all their metadata persisted into the chunk and
device trees, and therefore if a subsequent transaction starts, allocates
space from the new block groups, writes data or metadata into that space,
commits successfully and then after we unmount and mount the filesystem
again, the same space can be allocated again for a new block group,
resulting in file data or metadata corruption.

Example where we don't abort the transaction when we fail to finish the
chunk allocation (add items to the chunk and device trees) and later a
future transaction where the block group is removed fails because it can't
find the chunk item in the chunk tree:

[25230.404300] WARNING: CPU: 0 PID: 7721 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x50/0xfc [btrfs]()
[25230.404301] BTRFS: Transaction aborted (error -28)
[25230.404302] Modules linked in: btrfs dm_flakey nls_utf8 fuse xor raid6_pq ntfs vfat msdos fat xfs crc32c_generic libcrc32c ext3 jbd ext2 dm_mod nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc loop psmouse i2c_piix4 i2ccore parport_pc parport processor button pcspkr serio_raw thermal_sys evdev microcode ext4 crc16 jbd2 mbcache sr_mod cdrom ata_generic sg sd_mod crc_t10dif crct10dif_generic crct10dif_common virtio_scsi floppy e1000 ata_piix libata virtio_pci virtio_ring scsi_mod virtio [last unloaded: btrfs]
[25230.404325] CPU: 0 PID: 7721 Comm: xfs_io Not tainted 3.17.0-rc5-btrfs-next-1+ #1
[25230.404326] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[25230.404328]  0000000000000000 ffff88004581bb08 ffffffff813e7a13 ffff88004581bb50
[25230.404330]  ffff88004581bb40 ffffffff810423aa ffffffffa049386a 00000000ffffffe4
[25230.404332]  ffffffffa05214c0 000000000000240c ffff88010fc8f800 ffff88004581bba8
[25230.404334] Call Trace:
[25230.404338]  [<ffffffff813e7a13>] dump_stack+0x45/0x56
[25230.404342]  [<ffffffff810423aa>] warn_slowpath_common+0x7f/0x98
[25230.404351]  [<ffffffffa049386a>] ? __btrfs_abort_transaction+0x50/0xfc [btrfs]
[25230.404353]  [<ffffffff8104240b>] warn_slowpath_fmt+0x48/0x50
[25230.404362]  [<ffffffffa049386a>] __btrfs_abort_transaction+0x50/0xfc [btrfs]
[25230.404374]  [<ffffffffa04a8c43>] btrfs_create_pending_block_groups+0x10c/0x135 [btrfs]
[25230.404387]  [<ffffffffa04b77fd>] __btrfs_end_transaction+0x7e/0x2de [btrfs]
[25230.404398]  [<ffffffffa04b7a6d>] btrfs_end_transaction+0x10/0x12 [btrfs]
[25230.404408]  [<ffffffffa04a3d64>] btrfs_check_data_free_space+0x111/0x1f0 [btrfs]
[25230.404421]  [<ffffffffa04c53bd>] __btrfs_buffered_write+0x160/0x48d [btrfs]
[25230.404425]  [<ffffffff811a9268>] ? cap_inode_need_killpriv+0x2d/0x37
[25230.404429]  [<ffffffff810f6501>] ? get_page+0x1a/0x2b
[25230.404441]  [<ffffffffa04c7c95>] btrfs_file_write_iter+0x321/0x42f [btrfs]
[25230.404443]  [<ffffffff8110f5d9>] ? handle_mm_fault+0x7f3/0x846
[25230.404446]  [<ffffffff813e98c5>] ? mutex_unlock+0x16/0x18
[25230.404449]  [<ffffffff81138d68>] new_sync_write+0x7c/0xa0
[25230.404450]  [<ffffffff81139401>] vfs_write+0xb0/0x112
[25230.404452]  [<ffffffff81139c9d>] SyS_pwrite64+0x66/0x84
[25230.404454]  [<ffffffff813ebf52>] system_call_fastpath+0x16/0x1b
[25230.404455] ---[ end trace 5aa5684fdf47ab38 ]---
[25230.404458] BTRFS warning (device sdc): btrfs_create_pending_block_groups:9228: Aborting unused transaction(No space left).
[25288.084814] BTRFS: error (device sdc) in btrfs_free_chunk:2509: errno=-2 No such entry (Failed lookup while freeing chunk.)

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-12-02 18:35:09 -08:00
Filipe Manana
55507ce361 Btrfs: fix race between writing free space cache and trimming
Trimming is completely transactionless, and the way it operates consists
of hiding free space entries from a block group, perform the trim/discard
and then make the free space entries visible again.
Therefore while a free space entry is being trimmed, we can have free space
cache writing running in parallel (as part of a transaction commit) which
will miss the free space entry. This means that an unmount (or crash/reboot)
after that transaction commit and mount again before another transaction
starts/commits after the discard finishes, we will have some free space
that won't be used again unless the free space cache is rebuilt. After the
unmount, fsck (btrfsck, btrfs check) reports the issue like the following
example:

        *** fsck.btrfs output ***
        checking extents
        checking free space cache
        There is no free space entry for 521764864-521781248
        There is no free space entry for 521764864-1103101952
        cache appears valid but isnt 29360128
        Checking filesystem on /dev/sdc
        UUID: b4789e27-4774-4626-98e9-ae8dfbfb0fb5
        found 1235681286 bytes used err is -22
        (...)

Another issue caused by this race is a crash while writing bitmap entries
to the cache, because while the cache writeout task accesses the bitmaps,
the trim task can be concurrently modifying the bitmap or worse might
be freeing the bitmap. The later case results in the following crash:

[55650.804460] general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC
[55650.804835] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc loop parport_pc parport i2c_piix4 psmouse evdev pcspkr microcode processor i2ccore serio_raw thermal_sys button ext4 crc16 jbd2 mbcache sg sd_mod crc_t10dif sr_mod cdrom crct10dif_generic crct10dif_common ata_generic virtio_scsi floppy ata_piix libata virtio_pci virtio_ring virtio scsi_mod e1000 [last unloaded: btrfs]
[55650.806169] CPU: 1 PID: 31002 Comm: btrfs-transacti Tainted: G        W      3.17.0-rc5-btrfs-next-1+ #1
[55650.806493] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[55650.806867] task: ffff8800b12f6410 ti: ffff880071538000 task.ti: ffff880071538000
[55650.807166] RIP: 0010:[<ffffffffa037cf45>]  [<ffffffffa037cf45>] write_bitmap_entries+0x65/0xbb [btrfs]
[55650.807514] RSP: 0018:ffff88007153bc30  EFLAGS: 00010246
[55650.807687] RAX: 000000005d1ec000 RBX: ffff8800a665df08 RCX: 0000000000000400
[55650.807885] RDX: ffff88005d1ec000 RSI: 6b6b6b6b6b6b6b6b RDI: ffff88005d1ec000
[55650.808017] RBP: ffff88007153bc58 R08: 00000000ddd51536 R09: 00000000000001e0
[55650.808017] R10: 0000000000000000 R11: 0000000000000037 R12: 6b6b6b6b6b6b6b6b
[55650.808017] R13: ffff88007153bca8 R14: 6b6b6b6b6b6b6b6b R15: ffff88007153bc98
[55650.808017] FS:  0000000000000000(0000) GS:ffff88023ec80000(0000) knlGS:0000000000000000
[55650.808017] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[55650.808017] CR2: 0000000002273b88 CR3: 00000000b18f6000 CR4: 00000000000006e0
[55650.808017] Stack:
[55650.808017]  ffff88020e834e00 ffff880172d68db0 0000000000000000 ffff88019257c800
[55650.808017]  ffff8801d42ea720 ffff88007153bd10 ffffffffa037d2fa ffff880224e99180
[55650.808017]  ffff8801469a6188 ffff880224e99140 ffff880172d68c50 00000003000000b7
[55650.808017] Call Trace:
[55650.808017]  [<ffffffffa037d2fa>] __btrfs_write_out_cache+0x1ea/0x37f [btrfs]
[55650.808017]  [<ffffffffa037d959>] btrfs_write_out_cache+0xa1/0xd8 [btrfs]
[55650.808017]  [<ffffffffa033936b>] btrfs_write_dirty_block_groups+0x4b5/0x505 [btrfs]
[55650.808017]  [<ffffffffa03aa98e>] commit_cowonly_roots+0x15e/0x1f7 [btrfs]
[55650.808017]  [<ffffffff813eb9c7>] ? _raw_spin_lock+0xe/0x10
[55650.808017]  [<ffffffffa0346e46>] btrfs_commit_transaction+0x411/0x882 [btrfs]
[55650.808017]  [<ffffffffa03432a4>] transaction_kthread+0xf2/0x1a4 [btrfs]
[55650.808017]  [<ffffffffa03431b2>] ? btrfs_cleanup_transaction+0x3d8/0x3d8 [btrfs]
[55650.808017]  [<ffffffff8105966b>] kthread+0xb7/0xbf
[55650.808017]  [<ffffffff810595b4>] ? __kthread_parkme+0x67/0x67
[55650.808017]  [<ffffffff813ebeac>] ret_from_fork+0x7c/0xb0
[55650.808017]  [<ffffffff810595b4>] ? __kthread_parkme+0x67/0x67
[55650.808017] Code: 4c 89 ef 8d 70 ff e8 d4 fc ff ff 41 8b 45 34 41 39 45 30 7d 5c 31 f6 4c 89 ef e8 80 f6 ff ff 49 8b 7d 00 4c 89 f6 b9 00 04 00 00 <f3> a5 4c 89 ef 41 8b 45 30 8d 70 ff e8 a3 fc ff ff 41 8b 45 34
[55650.808017] RIP  [<ffffffffa037cf45>] write_bitmap_entries+0x65/0xbb [btrfs]
[55650.808017]  RSP <ffff88007153bc30>
[55650.815725] ---[ end trace 1c032e96b149ff86 ]---

Fix this by serializing both tasks in such a way that cache writeout
doesn't wait for the trim/discard of free space entries to finish and
doesn't miss any free space entry.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-12-02 18:35:09 -08:00
Filipe Manana
04216820fe Btrfs: fix race between fs trimming and block group remove/allocation
Our fs trim operation, which is completely transactionless (doesn't start
or joins an existing transaction) consists of visiting all block groups
and then for each one to iterate its free space entries and perform a
discard operation against the space range represented by the free space
entries. However before performing a discard, the corresponding free space
entry is removed from the free space rbtree, and when the discard completes
it is added back to the free space rbtree.

If a block group remove operation happens while the discard is ongoing (or
before it starts and after a free space entry is hidden), we end up not
waiting for the discard to complete, remove the extent map that maps
logical address to physical addresses and the corresponding chunk metadata
from the the chunk and device trees. After that and before the discard
completes, the current running transaction can finish and a new one start,
allowing for new block groups that map to the same physical addresses to
be allocated and written to.

So fix this by keeping the extent map in memory until the discard completes
so that the same physical addresses aren't reused before it completes.

If the physical locations that are under a discard operation end up being
used for a new metadata block group for example, and dirty metadata extents
are written before the discard finishes (the VM might call writepages() of
our btree inode's i_mapping for example, or an fsync log commit happens) we
end up overwriting metadata with zeroes, which leads to errors from fsck
like the following:

        checking extents
        Check tree block failed, want=833912832, have=0
        Check tree block failed, want=833912832, have=0
        Check tree block failed, want=833912832, have=0
        Check tree block failed, want=833912832, have=0
        Check tree block failed, want=833912832, have=0
        read block failed check_tree_block
        owner ref check failed [833912832 16384]
        Errors found in extent allocation tree or chunk allocation
        checking free space cache
        checking fs roots
        Check tree block failed, want=833912832, have=0
        Check tree block failed, want=833912832, have=0
        Check tree block failed, want=833912832, have=0
        Check tree block failed, want=833912832, have=0
        Check tree block failed, want=833912832, have=0
        read block failed check_tree_block
        root 5 root dir 256 error
        root 5 inode 260 errors 2001, no inode item, link count wrong
                unresolved ref dir 256 index 0 namelen 8 name foobar_3 filetype 1 errors 6, no dir index, no inode ref
        root 5 inode 262 errors 2001, no inode item, link count wrong
                unresolved ref dir 256 index 0 namelen 8 name foobar_5 filetype 1 errors 6, no dir index, no inode ref
        root 5 inode 263 errors 2001, no inode item, link count wrong
        (...)

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-12-02 18:35:09 -08:00
Zhao Lei
5d3edd8f44 Btrfs, replace: enable dev-replace for raid56
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2014-12-03 10:20:48 +08:00
Filipe Manana
ae0ab003f2 Btrfs: fix freeing used extents after removing empty block group
There's a race between adding a block group to the list of the unused
block groups and removing an unused block group (cleaner kthread) that
leads to freeing extents that are in use or a crash during transaction
commmit. Basically the cleaner kthread, when executing
btrfs_delete_unused_bgs(), might catch the newly added block group to
the list fs_info->unused_bgs and clear the range representing the whole
group from fs_info->freed_extents[] before the task that added the block
group to the list (running update_block_group()) marked the last freed
extent as dirty in fs_info->freed_extents (pinned_extents).

That is:

     CPU 1                                CPU 2

                                  btrfs_delete_unused_bgs()
update_block_group()
   add block group to
   fs_info->unused_bgs
                                    got block group from the list
                                    clear_extent_bits for the whole
                                    block group range in freed_extents[]
   set_extent_dirty for the
   range covering the freed
   extent in freed_extents[]
   (fs_info->pinned_extents)

                                  block group deleted, and a new block
                                  group with the same logical address is
                                  created

                                  reserve space from the new block group
                                  for new data or metadata - the reserved
                                  space overlaps the range specified by
                                  CPU 1 for set_extent_dirty()

                                  commit transaction
                                    find all ranges marked as dirty in
                                    fs_info->pinned_extents, clear them
                                    and add them to the free space cache

Alternatively, if CPU 2 doesn't create a new block group with the same
logical address, we get a crash/BUG_ON at transaction commit when unpining
extent ranges because we can't find a block group for the range marked as
dirty by CPU 1. Sample trace:

[ 2163.426462] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
[ 2163.426640] Modules linked in: btrfs xor raid6_pq dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio crc32c_generic libcrc32c dm_mod nfsd auth_rpc
gss oid_registry nfs_acl nfs lockd fscache sunrpc loop psmouse parport_pc parport i2c_piix4 processor thermal_sys i2ccore evdev button pcspkr microcode serio_raw ext4 crc16 jbd2 mbcache
 sg sr_mod cdrom sd_mod crc_t10dif crct10dif_generic crct10dif_common ata_generic virtio_scsi floppy ata_piix libata e1000 scsi_mod virtio_pci virtio_ring virtio
[ 2163.428209] CPU: 0 PID: 11858 Comm: btrfs-transacti Tainted: G        W      3.17.0-rc5-btrfs-next-1+ #1
[ 2163.428519] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[ 2163.428875] task: ffff88009f2c0650 ti: ffff8801356bc000 task.ti: ffff8801356bc000
[ 2163.429157] RIP: 0010:[<ffffffffa037728e>]  [<ffffffffa037728e>] unpin_extent_range.isra.58+0x62/0x192 [btrfs]
[ 2163.429562] RSP: 0018:ffff8801356bfda8  EFLAGS: 00010246
[ 2163.429802] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[ 2163.429990] RDX: 0000000041bfffff RSI: 0000000001c00000 RDI: ffff880024307080
[ 2163.430042] RBP: ffff8801356bfde8 R08: 0000000000000068 R09: ffff88003734f118
[ 2163.430042] R10: ffff8801356bfcb8 R11: fffffffffffffb69 R12: ffff8800243070d0
[ 2163.430042] R13: 0000000083c04000 R14: ffff8800751b0f00 R15: ffff880024307000
[ 2163.430042] FS:  0000000000000000(0000) GS:ffff88013f400000(0000) knlGS:0000000000000000
[ 2163.430042] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 2163.430042] CR2: 00007ff10eb43fc0 CR3: 0000000004cb8000 CR4: 00000000000006f0
[ 2163.430042] Stack:
[ 2163.430042]  ffff8800243070d0 0000000083c08000 0000000083c07fff ffff88012d6bc800
[ 2163.430042]  ffff8800243070d0 ffff8800751b0f18 ffff8800751b0f00 0000000000000000
[ 2163.430042]  ffff8801356bfe18 ffffffffa037a481 0000000083c04000 0000000083c07fff
[ 2163.430042] Call Trace:
[ 2163.430042]  [<ffffffffa037a481>] btrfs_finish_extent_commit+0xac/0xbf [btrfs]
[ 2163.430042]  [<ffffffffa038c06d>] btrfs_commit_transaction+0x6ee/0x882 [btrfs]
[ 2163.430042]  [<ffffffffa03881f1>] transaction_kthread+0xf2/0x1a4 [btrfs]
[ 2163.430042]  [<ffffffffa03880ff>] ? btrfs_cleanup_transaction+0x3d8/0x3d8 [btrfs]
[ 2163.430042]  [<ffffffff8105966b>] kthread+0xb7/0xbf
[ 2163.430042]  [<ffffffff810595b4>] ? __kthread_parkme+0x67/0x67
[ 2163.430042]  [<ffffffff813ebeac>] ret_from_fork+0x7c/0xb0
[ 2163.430042]  [<ffffffff810595b4>] ? __kthread_parkme+0x67/0x67

So fix this by making update_block_group() first set the range as dirty
in pinned_extents before adding the block group to the unused_bgs list.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-12-02 18:19:17 -08:00
Filipe Manana
4f69cb987e Btrfs: fix crash caused by block group removal
If we remove a block group (because it became empty), we might have left
a caching_ctl structure in fs_info->caching_block_groups that points to
the block group and is accessed at transaction commit time. This results
in accessing an invalid or incorrect block group. This issue became visible
after Josef's patch "Btrfs: remove empty block groups automatically".

So if the block group is removed make sure we don't leave a dangling
caching_ctl in caching_block_groups.

Sample crash trace:

[58380.439449] BUG: unable to handle kernel paging request at ffff8801446eaeb8
[58380.439707] IP: [<ffffffffa03f6d05>] block_group_cache_done.isra.21+0xc/0x1c [btrfs]
[58380.440879] PGD 1acb067 PUD 23f5ff067 PMD 23f5db067 PTE 80000001446ea060
[58380.441220] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
[58380.441486] Modules linked in: btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc loop psmouse processor i2c_piix4 parport_pc parport pcspkr serio_raw evdev i2ccore thermal_sys microcode button ext4 crc16 jbd2 mbcache sr_mod cdrom ata_generic sg sd_mod crc_t10dif crct10dif_generic crct10dif_common virtio_scsi floppy ata_piix e1000 libata virtio_pci scsi_mod virtio_ring virtio [last unloaded: btrfs]
[58380.443238] CPU: 3 PID: 25728 Comm: btrfs-transacti Tainted: G        W      3.17.0-rc5-btrfs-next-1+ #1
[58380.443238] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[58380.443238] task: ffff88013ac82090 ti: ffff88013896c000 task.ti: ffff88013896c000
[58380.443238] RIP: 0010:[<ffffffffa03f6d05>]  [<ffffffffa03f6d05>] block_group_cache_done.isra.21+0xc/0x1c [btrfs]
[58380.443238] RSP: 0018:ffff88013896fdd8  EFLAGS: 00010283
[58380.443238] RAX: ffff880222cae850 RBX: ffff880119ba74c0 RCX: 0000000000000000
[58380.443238] RDX: 0000000000000000 RSI: ffff880185e16800 RDI: ffff8801446eaeb8
[58380.443238] RBP: ffff88013896fdd8 R08: ffff8801a9ca9fa8 R09: ffff88013896fc60
[58380.443238] R10: ffff88013896fd28 R11: 0000000000000000 R12: ffff880222cae000
[58380.443238] R13: ffff880222cae850 R14: ffff880222cae6b0 R15: ffff8801446eae00
[58380.443238] FS:  0000000000000000(0000) GS:ffff88023ed80000(0000) knlGS:0000000000000000
[58380.443238] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[58380.443238] CR2: ffff8801446eaeb8 CR3: 0000000001811000 CR4: 00000000000006e0
[58380.443238] Stack:
[58380.443238]  ffff88013896fe18 ffffffffa03fe2d5 ffff880222cae850 ffff880185e16800
[58380.443238]  ffff88000dc41c20 0000000000000000 ffff8801a9ca9f00 0000000000000000
[58380.443238]  ffff88013896fe80 ffffffffa040fbcf ffff88018b0dcdb0 ffff88013ac82090
[58380.443238] Call Trace:
[58380.443238]  [<ffffffffa03fe2d5>] btrfs_prepare_extent_commit+0x5a/0xd7 [btrfs]
[58380.443238]  [<ffffffffa040fbcf>] btrfs_commit_transaction+0x45c/0x882 [btrfs]
[58380.443238]  [<ffffffffa040c058>] transaction_kthread+0xf2/0x1a4 [btrfs]
[58380.443238]  [<ffffffffa040bf66>] ? btrfs_cleanup_transaction+0x3d8/0x3d8 [btrfs]
[58380.443238]  [<ffffffff8105966b>] kthread+0xb7/0xbf
[58380.443238]  [<ffffffff810595b4>] ? __kthread_parkme+0x67/0x67
[58380.443238]  [<ffffffff813ebeac>] ret_from_fork+0x7c/0xb0
[58380.443238]  [<ffffffff810595b4>] ? __kthread_parkme+0x67/0x67

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-12-02 18:19:17 -08:00
Filipe Manana
292cbd51ec Btrfs: fix invalid block group rbtree access after bg is removed
If we grab a block group, for example in btrfs_trim_fs(), we will be holding
a reference on it but the block group can be removed after we got it (via
btrfs_remove_block_group), which means it will no longer be part of the
rbtree.

However, btrfs_remove_block_group() was only calling rb_erase() which leaves
the block group's rb_node left and right child pointers with the same content
they had before calling rb_erase. This was dangerous because a call to
next_block_group() would access the node's left and right child pointers (via
rb_next), which can be no longer valid.

Fix this by clearing a block group's node after removing it from the tree,
and have next_block_group() do a tree search to get the next block group
instead of using rb_next() if our block group was removed.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-12-02 18:19:17 -08:00
Miao Xie
4245215d6a Btrfs, raid56: fix use-after-free problem in the final device replace procedure on raid56
The commit c404e0dc (Btrfs: fix use-after-free in the finishing
procedure of the device replace) fixed a use-after-free problem
which happened when removing the source device at the end of device
replace, but at that time, btrfs didn't support device replace
on raid56, so we didn't fix the problem on the raid56 profile.
Currently, we implemented device replace for raid56, so we need
kick that problem out before we enable that function for raid56.

The fix method is very simple, we just increase the bio per-cpu
counter before we submit a raid56 io, and decrease the counter
when the raid56 io ends.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2014-12-03 10:18:47 +08:00
Miao Xie
7603597690 Btrfs, replace: write raid56 parity into the replace target device
This function reused the code of parity scrub, and we just write
the right parity or corrected parity into the target device before
the parity scrub end.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2014-12-03 10:18:46 +08:00
Miao Xie
2c8cdd6ee4 Btrfs, replace: write dirty pages into the replace target device
The implementation is simple:
- In order to avoid changing the code logic of btrfs_map_bio and
  RAID56, we add the stripes of the replace target devices at the
  end of the stripe array in btrfs bio, and we sort those target
  device stripes in the array. And we keep the number of the target
  device stripes in the btrfs bio.
- Except write operation on RAID56, all the other operation don't
  take the target device stripes into account.
- When we do write operation, we read the data from the common devices
  and calculate the parity. Then write the dirty data and new parity
  out, at this time, we will find the relative replace target stripes
  and wirte the relative data into it.

Note: The function that copying old data on the source device to
the target device was implemented in the past, it is similar to
the other RAID type.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2014-12-03 10:18:46 +08:00
Miao Xie
5a6ac9eacb Btrfs, raid56: support parity scrub on raid56
The implementation is:
- Read and check all the data with checksum in the same stripe.
  All the data which has checksum is COW data, and we are sure
  that it is not changed though we don't lock the stripe. because
  the space of that data just can be reclaimed after the current
  transction is committed, and then the fs can use it to store the
  other data, but when doing scrub, we hold the current transaction,
  that is that data can not be recovered, it is safe that read and check
  it out of the stripe lock.
- Lock the stripe
- Read out all the data without checksum and parity
  The data without checksum and the parity may be changed if we don't
  lock the stripe, so we need read it in the stripe lock context.
- Check the parity
- Re-calculate the new parity and write back it if the old parity
  is not right
- Unlock the stripe

If we can not read out the data or the data we read is corrupted,
we will try to repair it. If the repair fails. we will mark the
horizontal sub-stripe(pages on the same horizontal) as corrupted
sub-stripe, and we will skip the parity check and repair of that
horizontal sub-stripe.

And in order to skip the horizontal sub-stripe that has no data, we
introduce a bitmap. If there is some data on the horizontal sub-stripe,
we will the relative bit to 1, and when we check and repair the
parity, we will skip those horizontal sub-stripes that the relative
bits is 0.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2014-12-03 10:18:45 +08:00
Miao Xie
1b94b5567e Btrfs, raid56: use a variant to record the operation type
We will introduce new operation type later, if we still use integer
variant as bool variant to record the operation type, we would add new
variant and increase the size of raid bio structure. It is not good,
by this patch, we define different number for different operation,
and we can just use a variant to record the operation type.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2014-12-03 10:18:45 +08:00
Miao Xie
af8e2d1df9 Btrfs, scrub: repair the common data on RAID5/6 if it is corrupted
This patch implement the RAID5/6 common data repair function, the
implementation is similar to the scrub on the other RAID such as
RAID1, the differentia is that we don't read the data from the
mirror, we use the data repair function of RAID5/6.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2014-12-03 10:18:45 +08:00
Miao Xie
b89e1b012c Btrfs, raid56: don't change bbio and raid_map
Because we will reuse bbio and raid_map during the scrub later, it is
better that we don't change any variant of bbio and don't free it at
the end of IO request. So we introduced similar variants into the raid
bio, and don't access those bbio's variants any more.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2014-12-03 10:18:44 +08:00
Zhao Lei
6de6565075 Btrfs: remove unnecessary code of stripe_index assignment in __btrfs_map_block
stripe_index's value was set again in latter line:
stripe_index = 0;

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
2014-12-03 10:18:44 +08:00
Zhao Lei
f90523d1aa Btrfs: remove noused bbio_ret in __btrfs_map_block in condition
bbio_ret in this condition is always !NULL because previous code
already have a check-and-skip:
4908 if (!bbio_ret)
4909     goto out;

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
2014-12-03 10:18:44 +08:00
Dmitry Monakhov
14516bb7bb ext4: fix suboptimal seek_{data,hole} extents traversial
It is ridiculous practice to scan inode block by block, this technique
applicable only for old indirect files. This takes significant amount
of time for really large files. Let's reuse ext4_fiemap which already
traverse inode-tree in most optimal meaner.

TESTCASE:
ftruncate64(fd, 0);
ftruncate64(fd, 1ULL << 40);
/* lseek will spin very long time */
lseek64(fd, 0, SEEK_DATA);
lseek64(fd, 0, SEEK_HOLE);

Original report: https://lkml.org/lkml/2014/10/16/620

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-12-02 18:08:53 -05:00
Dmitry Monakhov
d952d69e26 ext4: ext4_inline_data_fiemap should respect callers argument
Currently ext4_inline_data_fiemap ignores requested arguments (start
and len) which may lead endless loop if start != 0.  Also fix incorrect
extent length determination.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-12-02 16:11:20 -05:00
Dmitry Monakhov
5cc28a9eaa ext4: prevent fsreentrance deadlock for inline_data
ext4_da_convert_inline_data_to_extent() invokes
grab_cache_page_write_begin().  grab_cache_page_write_begin performs
memory allocation, so fs-reentrance should be prohibited because we
are inside journal transaction.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-12-02 16:09:50 -05:00
Changman Lee
7dda2af83b f2fs: more fast lookup for gc_inode list
If there are many inodes that have data blocks in victim segment,
it takes long time to find a inode in gc_inode list.
Let's use radix_tree to reduce lookup time.

Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-12-02 11:02:50 -08:00
Eric W. Biederman
4fed655c41 mnt: Clear mnt_expire during pivot_root
When inspecting the pivot_root and the current mount expiry logic I
realized that pivot_root fails to clear like mount move does.

Add the missing line in case someone does the interesting feat of
moving an expirable submount.  This gives a strong guarantee that root
of the filesystem tree will never expire.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2014-12-02 10:46:51 -06:00
Eric W. Biederman
381cacb12c mnt: Carefully set CL_UNPRIVILEGED in clone_mnt
old->mnt_expiry should be ignored unless CL_EXPIRE is set.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2014-12-02 10:46:50 -06:00
Eric W. Biederman
8486a7882b mnt: Move the clear of MNT_LOCKED from copy_tree to it's callers.
Clear MNT_LOCKED in the callers of copy_tree except copy_mnt_ns, and
collect_mounts.  In copy_mnt_ns it is necessary to create an exact
copy of a mount tree, so not clearing MNT_LOCKED is important.
Similarly collect_mounts is used to take a snapshot of the mount tree
for audit logging purposes and auditing using a faithful copy of the
tree is important.

This becomes particularly significant when we start setting MNT_LOCKED
on rootfs to prevent it from being unmounted.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2014-12-02 10:46:50 -06:00
Eric W. Biederman
da362b09e4 umount: Do not allow unmounting rootfs.
Andrew Vagin <avagin@parallels.com> writes:

> #define _GNU_SOURCE
> #include <sys/types.h>
> #include <sys/stat.h>
> #include <fcntl.h>
> #include <sched.h>
> #include <unistd.h>
> #include <sys/mount.h>
>
> int main(int argc, char **argv)
> {
> 	int fd;
>
> 	fd = open("/proc/self/ns/mnt", O_RDONLY);
> 	if (fd < 0)
> 	   return 1;
> 	   while (1) {
> 	   	 if (umount2("/", MNT_DETACH) ||
> 		        setns(fd, CLONE_NEWNS))
> 					break;
> 					}
>
> 					return 0;
> }
>
> root@ubuntu:/home/avagin# gcc -Wall nsenter.c -o nsenter
> root@ubuntu:/home/avagin# strace ./nsenter
> execve("./nsenter", ["./nsenter"], [/* 22 vars */]) = 0
> ...
> open("/proc/self/ns/mnt", O_RDONLY)     = 3
> umount("/", MNT_DETACH)                 = 0
> setns(3, 131072)                        = 0
> umount("/", MNT_DETACH
>
causes:

> [  260.548301] ------------[ cut here ]------------
> [  260.550941] kernel BUG at /build/buildd/linux-3.13.0/fs/pnode.c:372!
> [  260.552068] invalid opcode: 0000 [#1] SMP
> [  260.552068] Modules linked in: xt_CHECKSUM iptable_mangle xt_tcpudp xt_addrtype xt_conntrack ipt_MASQUERADE iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack bridge stp llc dm_thin_pool dm_persistent_data dm_bufio dm_bio_prison iptable_filter ip_tables x_tables crct10dif_pclmul crc32_pclmul ghash_clmulni_intel binfmt_misc nfsd auth_rpcgss nfs_acl aesni_intel nfs lockd aes_x86_64 sunrpc fscache lrw gf128mul glue_helper ablk_helper cryptd serio_raw ppdev parport_pc lp parport btrfs xor raid6_pq libcrc32c psmouse floppy
> [  260.552068] CPU: 0 PID: 1723 Comm: nsenter Not tainted 3.13.0-30-generic #55-Ubuntu
> [  260.552068] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> [  260.552068] task: ffff8800376097f0 ti: ffff880074824000 task.ti: ffff880074824000
> [  260.552068] RIP: 0010:[<ffffffff811e9483>]  [<ffffffff811e9483>] propagate_umount+0x123/0x130
> [  260.552068] RSP: 0018:ffff880074825e98  EFLAGS: 00010246
> [  260.552068] RAX: ffff88007c741140 RBX: 0000000000000002 RCX: ffff88007c741190
> [  260.552068] RDX: ffff88007c741190 RSI: ffff880074825ec0 RDI: ffff880074825ec0
> [  260.552068] RBP: ffff880074825eb0 R08: 00000000000172e0 R09: ffff88007fc172e0
> [  260.552068] R10: ffffffff811cc642 R11: ffffea0001d59000 R12: ffff88007c741140
> [  260.552068] R13: ffff88007c741140 R14: ffff88007c741140 R15: 0000000000000000
> [  260.552068] FS:  00007fd5c7e41740(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
> [  260.552068] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  260.552068] CR2: 00007fd5c7968050 CR3: 0000000070124000 CR4: 00000000000406f0
> [  260.552068] Stack:
> [  260.552068]  0000000000000002 0000000000000002 ffff88007c631000 ffff880074825ed8
> [  260.552068]  ffffffff811dcfac ffff88007c741140 0000000000000002 ffff88007c741160
> [  260.552068]  ffff880074825f38 ffffffff811dd12b ffffffff811cc642 0000000075640000
> [  260.552068] Call Trace:
> [  260.552068]  [<ffffffff811dcfac>] umount_tree+0x20c/0x260
> [  260.552068]  [<ffffffff811dd12b>] do_umount+0x12b/0x300
> [  260.552068]  [<ffffffff811cc642>] ? final_putname+0x22/0x50
> [  260.552068]  [<ffffffff811cc849>] ? putname+0x29/0x40
> [  260.552068]  [<ffffffff811dd88c>] SyS_umount+0xdc/0x100
> [  260.552068]  [<ffffffff8172aeff>] tracesys+0xe1/0xe6
> [  260.552068] Code: 89 50 08 48 8b 50 08 48 89 02 49 89 45 08 e9 72 ff ff ff 0f 1f 44 00 00 4c 89 e6 4c 89 e7 e8 f5 f6 ff ff 48 89 c3 e9 39 ff ff ff <0f> 0b 66 2e 0f 1f 84 00 00 00 00 00 90 66 66 66 66 90 55 b8 01
> [  260.552068] RIP  [<ffffffff811e9483>] propagate_umount+0x123/0x130
> [  260.552068]  RSP <ffff880074825e98>
> [  260.611451] ---[ end trace 11c33d85f1d4c652 ]--

Which in practice is totally uninteresting.  Only the global root user can
do it, and it is just a stupid thing to do.

However that is no excuse to allow a silly way to oops the kernel.

We can avoid this silly problem by setting MNT_LOCKED on the rootfs
mount point and thus avoid needing any special cases in the unmount
code.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2014-12-02 10:46:49 -06:00
Eric W. Biederman
b2f5d4dc38 umount: Disallow unprivileged mount force
Forced unmount affects not just the mount namespace but the underlying
superblock as well.  Restrict forced unmount to the global root user
for now.  Otherwise it becomes possible a user in a less privileged
mount namespace to force the shutdown of a superblock of a filesystem
in a more privileged mount namespace, allowing a DOS attack on root.

Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2014-12-02 10:46:48 -06:00
Eric W. Biederman
3e1866410f mnt: Implicitly add MNT_NODEV on remount when it was implicitly added by mount
Now that remount is properly enforcing the rule that you can't remove
nodev at least sandstorm.io is breaking when performing a remount.

It turns out that there is an easy intuitive solution implicitly
add nodev on remount when nodev was implicitly added on mount.

Tested-by: Cedric Bosdonnat <cbosdonnat@suse.com>
Tested-by: Richard Weinberger <richard@nod.at>
Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2014-12-02 10:46:39 -06:00
Linus Torvalds
3a18ca0613 Fix an ext4 metadata checksum regression introduced in v3.18-rc3.
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJUfTKdAAoJENNvdpvBGATwRhsQAJOWfDD1K57CFQQ0MbLdRfeA
 YzdTHlKIYpt4Dm3+Bc/LJskLrA0sn2vmxF/v0Jxr3F/agwfCODLMO8dIsB0nFX0E
 eC8fCx05titBMQLN4adWr59qQTeE1nssQdpA5SomPrJZr5pSabtj3ekFiHQJ+bWb
 9WNY737TxSPLK0ex9iDlAAp/AoxgF4K6zj/azsRY+mmlfM9+dFoprZWqWwgwl99m
 4LVx1waAnLQdU2Yj7ZYGReweFFTTOqGz4ds1GggymB3Z8Q873dVYO7vdbQWDFJgC
 TcAp8YbfrQC6/M/IaASZKVj6hwEPVMTgOs7dUeyfPtSaXBrW0WBGAhM5gFnQ6J+T
 DO4YHC+tH26GLsfBs9IZnHjAoeVZ93JFDKmxfclDs0AGY+0WgSyY8Bt6VJyoWR60
 RPbW15i/0oMSWEPxbuqQmFIqlcj5n9D80SEmvhpn7oJwkrrUMprcxcWTQN+Ca73e
 2EIOW0SHaLrkM7wpYjwlO4dgCxSZWg6QfHyznbuJcVKOw8KnDMuTEOjP2vBvwHwu
 Wax4EIGZw9XqZVI7a9Z1nd+mpUYi5KDgpS8Uo08Qz5QapWEYla3JPt76q3TwSCIz
 ExMwoRBUSMrpSoDMbyjmk4sh5ABTTWOf9SPmCdnVfzZ36EY0dJckeXj0jFqtyVdq
 p1bxjWPARBm1LfVBcQek
 =yIOT
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus_urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 bugfix from Ted Ts'o:
 "Fix an ext4 metadata checksum regression introduced in v3.18-rc3"

* tag 'ext4_for_linus_urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  jbd2: fix regression where we fail to initialize checksum seed when loading
2014-12-01 20:11:49 -08:00
Darrick J. Wong
32f3869184 jbd2: fix regression where we fail to initialize checksum seed when loading
When we're enabling journal features, we cannot use the predicate
jbd2_journal_has_csum_v2or3() because we haven't yet set the sb
feature flag fields!  Moreover, we just finished loading the shash
driver, so the test is unnecessary; calculate the seed always.

Without this patch, we fail to initialize the checksum seed the first
time we turn on journal_checksum, which means that all journal blocks
written during that first mount are corrupt.  Transactions written
after the second mount will be fine, since the feature flag will be
set in the journal superblock.  xfstests generic/{034,321,322} are the
regression tests.

(This is important for 3.18.)

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.coM>
Reported-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-12-01 21:57:06 -05:00
Changman Lee
9c01503f4d f2fs: cleanup redundant macro
We've already made fi and sbi for inode. Let's avoid duplicated work.

Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-12-01 14:16:50 -08:00
Chao Yu
cd34e2969b f2fs: fix to return correct error number in f2fs_write_begin
Fix the wrong error number in error path of f2fs_write_begin.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-12-01 13:56:02 -08:00
Dan Carpenter
818f2f57f2 nfsd: minor off by one checks in __write_versions()
My static checker complains that if "len == remaining" then it means we
have truncated the last character off the version string.

The intent of the code is that we print as many versions as we can
without truncating a version.  Then we put a newline at the end.  If the
newline can't fit we return -EINVAL.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-12-01 12:45:28 -07:00
Dave Chinner
c14fc01340 Merge branch 'xfs-coccinelle-cleanups' into for-next 2014-12-01 09:03:02 +11:00
kbuild test robot
d254aaec5d xfs: fix simple_return.cocci warning in xfs_bmse_shift_one
fs/xfs/libxfs/xfs_bmap.c:5591:1-6: WARNING: end returns can be simpified

 Simplify a trivial if-return sequence.  Possibly combine with a
 preceding function call.
Generated by: scripts/coccinelle/misc/simple_return.cocci

CC: Brian Foster <bfoster@redhat.com>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
2014-12-01 08:42:52 +11:00
kbuild test robot
8300475ebf xfs: fix simple_return.cocci warning in xfs_file_readdir
fs/xfs/xfs_file.c:919:1-6: WARNING: end returns can be simpified and declaration on line 902 can be dropped

 Simplify a trivial if-return sequence.  Possibly combine with a
 preceding function call.
Generated by: scripts/coccinelle/misc/simple_return.cocci

Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-12-01 08:25:28 +11:00
kbuild test robot
b72091f2fb libxfs: fix simple_return.cocci warnings
fs/xfs/libxfs/xfs_ialloc.c:1141:1-6: WARNING: end returns can be simpified

 Simplify a trivial if-return sequence.  Possibly combine with a
 preceding function call.
Generated by: scripts/coccinelle/misc/simple_return.cocci

Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-12-01 08:24:58 +11:00
Markus Elfring
d2a5e3c6fc xfs: remove unnecessary null checks
The functions xfs_blkdev_put() and xfs_qm_dqrele() test whether
their argument is NULL and then return immediately.  Thus the test
around the call is not needed.

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-12-01 08:24:20 +11:00
Chris Mason
2f19cad94c btrfs: zero out left over bytes after processing compression streams
Don Bailey noticed that our page zeroing for compression at end-io time
isn't complete.  This reworks a patch from Linus to push the zeroing
into the zlib and lzo specific functions instead of trying to handle the
corners inside btrfs_decompress_buf2page

Signed-off-by: Chris Mason <clm@fb.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reported-by: Don A. Bailey <donb@securitymouse.com>
cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-11-30 09:33:51 -08:00
Geert Uytterhoeven
4740f49652 jffs2: Drop bogus if in comment
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: linux-mtd@lists.infradead.org
Signed-off-by: Brian Norris <computersforpeace@gmail.com>
2014-11-28 18:23:44 -08:00
Changman Lee
31a3268839 f2fs: cleanup if-statement of phase in gc_data_segment
Little cleanup to distinguish each phase easily

Signed-off-by: Changman Lee <cm224.lee@samsung.com>
[Jaegeuk Kim: modify indentation for code readability]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-11-27 20:30:17 -08:00
Dave Chinner
216875a594 Merge branch 'xfs-consolidate-format-defs' into for-next 2014-11-28 14:52:16 +11:00
Dave Chinner
4bd47c1bf4 Merge branch 'xfs-misc-fixes-for-3.19-1' into for-next 2014-11-28 14:52:02 +11:00
Christoph Hellwig
508b6b3b73 xfs: merge xfs_inum.h into xfs_format.h
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-11-28 14:27:10 +11:00
Christoph Hellwig
bb58e6188a xfs: move most of xfs_sb.h to xfs_format.h
More on-disk format consolidation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-11-28 14:27:09 +11:00
Christoph Hellwig
4fb6e8ade2 xfs: merge xfs_ag.h into xfs_format.h
More on-disk format consolidation.  A few declarations that weren't on-disk
format related move into better suitable spots.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-11-28 14:25:04 +11:00
Christoph Hellwig
5beda58bf2 xfs: move acl structures to xfs_format.h
Move the on-disk ACL format to xfs_format.h, so that repair can
use the common defintion.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-11-28 14:24:37 +11:00
Christoph Hellwig
6d3ebaae7c xfs: merge xfs_dinode.h into xfs_format.h
More consolidatation for the on-disk format defintions.  Note that the
XFS_IS_REALTIME_INODE moves to xfs_linux.h instead as it is not related
to the on disk format, but depends on a CONFIG_ option.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-11-28 14:24:06 +11:00
Eric Sandeen
db52d09ecb xfs: catch invalid negative blknos in _xfs_buf_find()
Here blkno is a daddr_t, which is a __s64; it's possible to hold
a value which is negative, and thus pass the (blkno >= eofs)
test.  Then we try to do a xfs_perag_get() for a ridiculous
agno via xfs_daddr_to_agno(), and bad things happen when that
fails, and returns a null pag which is dereferenced shortly
thereafter.

Found via a user-supplied fuzzed image...

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-11-28 14:03:55 +11:00
Brian Foster
91ee575f2b xfs: allow lazy sb counter sync during filesystem freeze sequence
The expectation since the introduction the lazy superblock counters is
that the counters are synced and superblock logged appropriately as part
of the filesystem freeze sequence. This does not occur, however, due to
the logic in xfs_fs_writable() that prevents progress when the fs is in
any state other than SB_UNFROZEN.

While this is a bug, it has not been exposed to date because the last
thing XFS does during freeze is dirty the log. The log recovery process
recalculates the counters from AGI/AGF metadata to ensure everything is
correct. Therefore should a crash occur while an fs is frozen, the
subsequent log recovery puts everything back in order. See the following
commit for reference:

	92821e2b [XFS] Lazy Superblock Counters

We might not always want to rely on dirtying the log on a frozen fs.
Modify xfs_log_sbcount() to proceed when the filesystem is freezing but
not once the freeze process has completed. Modify xfs_fs_writable() to
accept the minimum freeze level for which modifications should be
blocked to support various codepaths.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-11-28 14:02:59 +11:00
Brian Foster
5d45ee1b41 xfs: fix error handling in xfs_qm_log_quotaoff()
The error handling in xfs_qm_log_quotaoff() has a couple problems. If
xfs_trans_commit() fails, we fall through to the error block and call
xfs_trans_cancel(). This is incorrect on commit failure. If
xfs_trans_reserve() fails, we jump to the error block, cancel the tp and
restore the superblock qflags to oldsbqflag. However, oldsbqflag has
been initialized to zero and not yet updated from the original flags so
we set the flags to zero.

Fix up the error handling in xfs_qm_log_quotaoff() to not restore flags
if they haven't been modified and not cancel the tp on commit failure.
Remove the flag restore code altogether because commit error is the only
failure condition and we don't know whether the transaction made it to
disk.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-11-28 14:00:53 +11:00
Brian Foster
062647a8b4 xfs: replace on-stack xfs_trans_res with pointer in xfs_create()
There's no need to store a full struct xfs_trans_res on the stack in
xfs_create() and copy the fields. Use a pointer to the appropriate
structures embedded in the xfs_mount.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-11-28 14:00:16 +11:00
Brian Foster
78c931b8be xfs: replace global xfslogd wq with per-mount wq
The xfslogd workqueue is a global, single-job workqueue for buffer ioend
processing. This means we allow for a single work item at a time for all
possible XFS mounts on a system. fsstress testing in loopback XFS over
XFS configurations has reproduced xfslogd deadlocks due to the single
threaded nature of the queue and dependencies introduced between the
separate XFS instances by online discard (-o discard).

Discard over a loopback device converts the discard request to a hole
punch (fallocate) on the underlying file. Online discard requests are
issued synchronously and from xfslogd context in XFS, hence the xfslogd
workqueue is blocked in the upper fs waiting on a hole punch request to
be servied in the lower fs. If the lower fs issues I/O that depends on
xfslogd to complete, both filesystems end up hung indefinitely. This is
reproduced reliabily by generic/013 on XFS->loop->XFS test devices with
the '-o discard' mount option.

Further, docker implementations appear to use this kind of configuration
for container instance filesystems by default (container fs->dm->
loop->base fs) and therefore are subject to this deadlock when running
on XFS.

Replace the global xfslogd workqueue with a per-mount variant. This
guarantees each mount access to a single worker and prevents deadlocks
due to inter-fs dependencies introduced by discard. Since the queue is
only responsible for buffer iodone processing at this point in time,
rename xfslogd to xfs-buf.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-11-28 13:59:58 +11:00
Phillip Lougher
62421645bb Squashfs: Add LZ4 compression configuration option
Add the glue code, and also update the documentation.

Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk>
2014-11-27 18:48:44 +00:00
Phillip Lougher
9c06a46f15 Squashfs: add LZ4 compression support
Add support for reading file systems compressed with the
LZ4 compression algorithm.

This patch adds the LZ4 decompressor wrapper code.

Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk>
2014-11-27 07:44:11 +00:00
Arend van Spriel
98210b7f73 debugfs: add helper function to create device related seq_file
This patch adds a helper function that simplifies adding a
so-called single_open sequence file for device drivers. The
calling device driver needs to provide a read function and
a device pointer. The field struct seq_file::private will
reference the device pointer upon call to the read function
so the driver can obtain his data from it and do its task
of providing the file content using seq_printf() calls and
alike. Using this helper function also gets rid of the need
to specify file operations per debugfs file.

Signed-off-by: Arend van Spriel <arend@broadcom.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-26 19:38:37 -08:00
Trond Myklebust
1702562db4 NFS: Generic client side changes from Chuck
These patches fixes for iostats and SETCLIENTID in addition to cleaning
 up the nfs4_init_callback() function.
 
 Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJUdfhQAAoJENfLVL+wpUDr9x4QANWmG6xjEU7RuIBLalOoxit6
 eXnEDNZlwYp6NCkYktHVWaTqXdwq7fdGX+3p4eiwNg3C3SrJHkpWvjeFd4KT/SyP
 B/w3vYG3o5H01i3Mb5kCD2uW0gbS9soZXpIg+uHHOl43yZzneC0vPTLhQ/h+9zPc
 B32XcgQhvLIHR3LWrC/+uolsa31lcyya0W45PX+iCpzHF7i9qRBrNLODkTlw1hNQ
 eEnYIvVy5oW00zlHJUYiTHP3e+0EJn5PAngdYbqiboJ9mK7DbB0QDwqyvJbIT7ql
 WAip6cNcJnSv1eiVYqDwlR1ok8drK5X7yQCT3lcLzAMDznLsSAL1Itu0h2Ay3z61
 f8XCyTwI0izq0DbdrMcPoqPSitqyM8nkPElnOuitXwzEroPaG40OF67yss3+ixbl
 JeQZ+u35pnpCkKUaZdCK3Pn83StxmUaBcFx8eg30NBc0SN13Eiz6aZcGEperClrR
 RwMLDUhUtAMcMRunRRxiN9lHafPqqeDeJre7uky0p0sU9CsH+1n5qIKLmk+Ber2d
 ZS29TobdR7ktfjQ52XazMAIFzI7r1v4Zn7ziH/WRbvKoKw9ICcyoUGNy9CS5LyWu
 BxWCZAJmby5H1i7V7ituJK8TImb81L4aPW06hHX9k+0SoTrgFjas//4jZYfysJ9N
 PvR0MRNfMFhuBlsTj7Gy
 =PMdS
 -----END PGP SIGNATURE-----

Merge tag 'nfs-cel-for-3.19' of git://git.linux-nfs.org/projects/anna/nfs-rdma into linux-next

Pull pull additional NFS client changes for 3.19 from Anna Schumaker:
  "NFS: Generic client side changes from Chuck

  These patches fixes for iostats and SETCLIENTID in addition to cleaning
  up the nfs4_init_callback() function.

  Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>"

* tag 'nfs-cel-for-3.19' of git://git.linux-nfs.org/projects/anna/nfs-rdma:
  NFS: Clean up nfs4_init_callback()
  NFS: SETCLIENTID XDR buffer sizes are incorrect
  SUNRPC: serialize iostats updates
2014-11-26 17:34:14 -05:00
Michael Halcrow
942080643b eCryptfs: Remove buggy and unnecessary write in file name decode routine
Dmitry Chernenkov used KASAN to discover that eCryptfs writes past the
end of the allocated buffer during encrypted filename decoding. This
fix corrects the issue by getting rid of the unnecessary 0 write when
the current bit offset is 2.

Signed-off-by: Michael Halcrow <mhalcrow@google.com>
Reported-by: Dmitry Chernenkov <dmitryc@google.com>
Suggested-by: Kees Cook <keescook@chromium.org>
Cc: stable@vger.kernel.org # v2.6.29+: 51ca58d eCryptfs: Filename Encryption: Encoding and encryption functions
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
2014-11-26 15:55:02 -06:00
Linus Torvalds
b914c5b213 Merge branch 'for-3.18' of git://linux-nfs.org/~bfields/linux
Pull nfsd bugfixes from Bruce Fields:
 "These fix one mishandling of the case when security labels are
  configured out, and two races in the 4.1 backchannel code"

* 'for-3.18' of git://linux-nfs.org/~bfields/linux:
  nfsd: Fix slot wake up race in the nfsv4.1 callback code
  SUNRPC: Fix locking around callback channel reply receive
  nfsd: correctly define v4.2 support attributes
2014-11-25 19:05:41 -08:00
Linus Torvalds
277f850fbc Merge git://git.kvack.org/~bcrl/aio-fixes
Pull aio fix from Ben LaHaise:
 "Dirty page accounting fix for aio"

* git://git.kvack.org/~bcrl/aio-fixes:
  aio: fix uncorrent dirty pages accouting when truncating AIO ring buffer
2014-11-25 18:55:44 -08:00
Jaegeuk Kim
95f5b0fc5e f2fs: fix to recover converted inline_data
If an inode has converted inline_data which was written to the disk, we should
set its inode flag for further fsync so that this inline_data can be recovered
from sudden power off.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-11-25 18:08:00 -08:00
Jaegeuk Kim
158c194c37 f2fs: make clean the page before writing
If a page is set to be written to the disk, we can make clean the page.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-11-25 17:33:31 -08:00
Changman Lee
80ec2e914d f2fs: no more dirty_nat_entires when flushing
After flushing dirty nat entries, it has to be no more dirty nat
entries.

Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-11-25 17:26:36 -08:00
Changman Lee
20d047c876 f2fs: check dirty_nat_cnt before flushing nat entries in journal
It's meaningless to check dirty_nat_cnt after re-dirtying nat entries in
journal. And although there are rooms for dirty nat entires if dirty_nat_cnt
is zero, it's also meaningless to check __has_cursum_space.

Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-11-25 17:26:34 -08:00
Jan Kara
d4f7610743 ext4: forbid journal_async_commit in data=ordered mode
Option journal_async_commit breaks gurantees of data=ordered mode as it
sends only a single cache flush after writing a transaction commit
block. Thus even though the transaction including the commit block is
fully stored on persistent storage, file data may still linger in drives
caches and will be lost on power failure. Since all checksums match on
journal recovery, we replay the transaction thus possibly exposing stale
user data.

To fix this data exposure issue, remove the possibility to use
journal_async_commit in data=ordered mode.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-25 20:19:17 -05:00
Theodore Ts'o
d9f39d1e44 jbd2: remove unnecessary NULL check before iput()
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-25 20:02:37 -05:00
Markus Elfring
bfcba2d035 ext4: Remove an unnecessary check for NULL before iput()
The iput() function tests whether its argument is NULL and then
returns immediately. Thus the test around the call is not needed.

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-25 20:01:37 -05:00
Anna Schumaker
624bd5b7b6 nfs: Add DEALLOCATE support
This patch adds support for using the NFS v4.2 operation DEALLOCATE to
punch holes in a file.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-11-25 16:38:32 -05:00
Anna Schumaker
f4ac1674f5 nfs: Add ALLOCATE support
This patch adds support for using the NFS v4.2 operation ALLOCATE to
preallocate data in a file.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-11-25 16:38:32 -05:00
Namjae Jeon
31fc006b12 ext4: remove unneeded code in ext4_unlink
Setting retval to zero is not needed in ext4_unlink.
Remove unneeded code.

Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-25 16:34:38 -05:00
Eric Sandeen
b003b52496 ext4: don't count external journal blocks as overhead
This was fixed for ext3 with:

e6d8fb3 ext3: Count internal journal as bsddf overhead in ext3_statfs

but was never fixed for ext4.

With a large external journal and no used disk blocks, df comes
out negative without this, as journal blocks are added to the
overhead & subtracted from used blocks unconditionally.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-25 16:27:44 -05:00
Jan Kara
733ded2a80 ext4: remove never taken branch from ext4_ext_shift_path_extents()
path[depth].p_hdr can never be NULL for a path passed to us (and even if
it could, EXT_LAST_EXTENT() would make something != NULL from it). So
just remove the branch.

Coverity-id: 1196498
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-25 16:23:48 -05:00
Chuck Lever
c2ef47b7f5 NFS: Clean up nfs4_init_callback()
nfs4_init_callback() is never invoked for NFS versions other than 4.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2014-11-25 16:22:16 -05:00
Chuck Lever
6dd3436b9d NFS: SETCLIENTID XDR buffer sizes are incorrect
Use the correct calculation of the maximum size of a clientaddr4
when encoding and decoding SETCLIENTID operations. clientaddr4 is
defined in section 2.2.10 of RFC3530bis-31.

The usage in encode_setclientid_maxsz is missing the 4-byte length
in both strings, but is otherwise correct. decode_setclientid_maxsz
simply asks for a page of receive buffer space, which is
unnecessarily large (more than 4KB).

Note that a SETCLIENTID reply is either clientid+verifier, or
clientaddr4, depending on the returned NFS status. It doesn't
hurt to allocate enough space for both.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2014-11-25 16:22:16 -05:00
Darrick J. Wong
c6d3d56dd0 ext4: create nojournal_checksum mount option
Create a mount option to disable journal checksumming (because the
metadata_csum feature turns it on by default now), and fix remount not
to allow changing the journal checksumming option, since changing the
mount options has no effect on the journal.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-25 16:20:50 -05:00
Wang Shilong
58d86a50ee ext4: update comments regarding ext4_delete_inode()
ext4_delete_inode() has been renamed for a long time, update
comments for this.

Signed-off-by: Wang Shilong <wshilong@ddn.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-25 16:17:29 -05:00
Jaegeuk Kim
5f72739583 f2fs: fix deadlock during inline_data conversion
A deadlock can be occurred:
Thread 1]                             Thread 2]
 - f2fs_write_data_pages              - f2fs_write_begin
   - lock_page(page #0)
                                        - grab_cache_page(page #X)
                                        - get_node_page(inode_page)
                                        - grab_cache_page(page #0)
                                          : to convert inline_data
   - f2fs_write_data_page
     - f2fs_write_inline_data
       - get_node_page(inode_page)

In this case, trying to lock inode_page and page #0 causes deadlock.
In order to avoid this, this patch adds a rule for this locking policy,
which is that page #0 should be locked followed by inode_page lock.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-11-25 12:08:30 -08:00
Markus Elfring
ce3e6d25f3 f2fs: fix typos for the word "destroy" in jump labels
Two jump labels were adjusted in the implementation of the
create_node_manager_caches() function because these identifiers
contained typos.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-11-25 12:08:22 -08:00
Dmitry Monakhov
4fdb554318 ext4: cleanup GFP flags inside resize path
We must use GFP_NOFS instead GFP_KERNEL inside ext4_mb_add_groupinfo
and ext4_calculate_overhead() because they are called from inside a
journal transaction. Call trace:

ioctl
 ->ext4_group_add
   ->journal_start
   ->ext4_setup_new_descs
     ->ext4_mb_add_groupinfo -> GFP_KERNEL
   ->ext4_flex_group_add
     ->ext4_update_super
       ->ext4_calculate_overhead  -> GFP_KERNEL
   ->journal_stop

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-25 13:08:04 -05:00
Jan Kara
2be12de98a ext4: introduce aging to extent status tree
Introduce a simple aging to extent status tree. Each extent has a
REFERENCED bit which gets set when the extent is used. Shrinker then
skips entries with referenced bit set and clears the bit. Thus
frequently used extents have higher chances of staying in memory.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-25 11:55:24 -05:00
Jan Kara
624d0f1dd7 ext4: cleanup flag definitions for extent status tree
Currently flags for extent status tree are defined twice, once shifted
and once without a being shifted. Consolidate these definitions into one
place and make some computations automatic to make adding flags less
error prone. Compiler should be clever enough to figure out these are
constants and generate the same code.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-25 11:53:47 -05:00
Jan Kara
dd47592551 ext4: limit number of scanned extents in status tree shrinker
Currently we scan extent status trees of inodes until we reclaim nr_to_scan
extents. This can however require a lot of scanning when there are lots
of delayed extents (as those cannot be reclaimed).

Change shrinker to work as shrinkers are supposed to and *scan* only
nr_to_scan extents regardless of how many extents did we actually
reclaim. We however need to be careful and avoid scanning each status
tree from the beginning - that could lead to a situation where we would
not be able to reclaim anything at all when first nr_to_scan extents in
the tree are always unreclaimable. We remember with each inode offset
where we stopped scanning and continue from there when we next come
across the inode.

Note that we also need to update places calling __es_shrink() manually
to pass reasonable nr_to_scan to have a chance of reclaiming anything and
not just 1.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-25 11:51:23 -05:00
Jan Kara
b0dea4c165 ext4: move handling of list of shrinkable inodes into extent status code
Currently callers adding extents to extent status tree were responsible
for adding the inode to the list of inodes with freeable extents. This
is error prone and puts list handling in unnecessarily many places.

Just add inode to the list automatically when the first non-delay extent
is added to the tree and remove inode from the list when the last
non-delay extent is removed.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-25 11:49:25 -05:00
Zheng Liu
edaa53cac8 ext4: change LRU to round-robin in extent status tree shrinker
In this commit we discard the lru algorithm for inodes with extent
status tree because it takes significant effort to maintain a lru list
in extent status tree shrinker and the shrinker can take a long time to
scan this lru list in order to reclaim some objects.

We replace the lru ordering with a simple round-robin.  After that we
never need to keep a lru list.  That means that the list needn't be
sorted if the shrinker can not reclaim any objects in the first round.

Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-25 11:45:37 -05:00
Zheng Liu
2f8e0a7c6c ext4: cache extent hole in extent status tree for ext4_da_map_blocks()
Currently extent status tree doesn't cache extent hole when a write
looks up in extent tree to make sure whether a block has been allocated
or not.  In this case, we don't put extent hole in extent cache because
later this extent might be removed and a new delayed extent might be
added back.  But it will cause a defect when we do a lot of writes.  If
we don't put extent hole in extent cache, the following writes also need
to access extent tree to look at whether or not a block has been
allocated.  It brings a cache miss.  This commit fixes this defect.
Also if the inode doesn't have any extent, this extent hole will be
cached as well.

Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-25 11:44:37 -05:00
Jan Kara
cbd7584e6e ext4: fix block reservation for bigalloc filesystems
For bigalloc filesystems we have to check whether newly requested inode
block isn't already part of a cluster for which we already have delayed
allocation reservation. This check happens in ext4_ext_map_blocks() and
that function sets EXT4_MAP_FROM_CLUSTER if that's the case. However if
ext4_da_map_blocks() finds in extent cache information about the block,
we don't call into ext4_ext_map_blocks() and thus we always end up
getting new reservation even if the space for cluster is already
reserved. This results in overreservation and premature ENOSPC reports.

Fix the problem by checking for existing cluster reservation already in
ext4_da_map_blocks(). That simplifies the logic and actually allows us
to get rid of the EXT4_MAP_FROM_CLUSTER flag completely.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-25 11:41:49 -05:00
Filipe Manana
9ea24bbe17 Btrfs: fix snapshot inconsistency after a file write followed by truncate
If right after starting the snapshot creation ioctl we perform a write against a
file followed by a truncate, with both operations increasing the file's size, we
can get a snapshot tree that reflects a state of the source subvolume's tree where
the file truncation happened but the write operation didn't. This leaves a gap
between 2 file extent items of the inode, which makes btrfs' fsck complain about it.

For example, if we perform the following file operations:

    $ mkfs.btrfs -f /dev/vdd
    $ mount /dev/vdd /mnt
    $ xfs_io -f \
          -c "pwrite -S 0xaa -b 32K 0 32K" \
          -c "fsync" \
          -c "pwrite -S 0xbb -b 32770 16K 32770" \
          -c "truncate 90123" \
          /mnt/foobar

and the snapshot creation ioctl was just called before the second write, we often
can get the following inode items in the snapshot's btree:

        item 120 key (257 INODE_ITEM 0) itemoff 7987 itemsize 160
                inode generation 146 transid 7 size 90123 block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0 flags 0x0
        item 121 key (257 INODE_REF 256) itemoff 7967 itemsize 20
                inode ref index 282 namelen 10 name: foobar
        item 122 key (257 EXTENT_DATA 0) itemoff 7914 itemsize 53
                extent data disk byte 1104855040 nr 32768
                extent data offset 0 nr 32768 ram 32768
                extent compression 0
        item 123 key (257 EXTENT_DATA 53248) itemoff 7861 itemsize 53
                extent data disk byte 0 nr 0
                extent data offset 0 nr 40960 ram 40960
                extent compression 0

There's a file range, corresponding to the interval [32K; ALIGN(16K + 32770, 4096)[
for which there's no file extent item covering it. This is because the file write
and file truncate operations happened both right after the snapshot creation ioctl
called btrfs_start_delalloc_inodes(), which means we didn't start and wait for the
ordered extent that matches the write and, in btrfs_setsize(), we were able to call
btrfs_cont_expand() before being able to commit the current transaction in the
snapshot creation ioctl. So this made it possibe to insert the hole file extent
item in the source subvolume (which represents the region added by the truncate)
right before the transaction commit from the snapshot creation ioctl.

Btrfs' fsck tool complains about such cases with a message like the following:

    "root 331 inode 257 errors 100, file extent discount"

>From a user perspective, the expectation when a snapshot is created while those
file operations are being performed is that the snapshot will have a file that
either:

1) is empty
2) only the first write was captured
3) only the 2 writes were captured
4) both writes and the truncation were captured

But never capture a state where only the first write and the truncation were
captured (since the second write was performed before the truncation).

A test case for xfstests follows.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-25 07:41:23 -08:00
Filipe Manana
e5fa8f865b Btrfs: ensure send always works on roots without orphans
Move the logic from the snapshot creation ioctl into send. This avoids
doing the transaction commit if send isn't used, and ensures that if
a crash/reboot happens after the transaction commit that created the
snapshot and before the transaction commit that switched the commit
root, send will not get a commit root that differs from the main root
(that has orphan items).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-25 07:41:23 -08:00
Filipe Manana
758eb51e71 Btrfs: fix freeing used extent after removing empty block group
Due to ignoring errors returned by clear_extent_bits (at the moment only
-ENOMEM is possible), we can end up freeing an extent that is actually in
use (i.e. return the extent to the free space cache).

The sequence of steps that lead to this:

1) Cleaner thread starts execution and calls btrfs_delete_unused_bgs(), with
   the goal of freeing empty block groups;

2) btrfs_delete_unused_bgs() finds an empty block group, joins the current
   transaction (or starts a new one if none is running) and attempts to
   clear the EXTENT_DIRTY bit for the block group's range from freed_extents[0]
   and freed_extents[1] (of which one corresponds to fs_info->pinned_extents);

3) Clearing the EXTENT_DIRTY bit (via clear_extent_bits()) fails with
   -ENOMEM, but such error is ignored and btrfs_delete_unused_bgs() proceeds
   to delete the block group and the respective chunk, while pinned_extents
   remains with that bit set for the whole (or a part of the) range covered
   by the block group;

4) Later while the transaction is still running, the chunk ends up being reused
   for a new block group (maybe for different purpose, data or metadata), and
   extents belonging to the new block group are allocated for file data or btree
   nodes/leafs;

5) The current transaction is committed, meaning that we unpinned one or more
   extents from the new block group (through btrfs_finish_extent_commit() and
   unpin_extent_range()) which are now being used for new file data or new
   metadata (through btrfs_finish_extent_commit() and unpin_extent_range()).
   And unpinning means we returned the extents to the free space cache of the
   new block group, which implies those extents can be used for future allocations
   while they're still in use.

Alternatively, we can hit a BUG_ON() when doing a lookup for a block group's cache
object in unpin_extent_range() if a new block group didn't end up being allocated for
the same chunk (step 4 above).

Fix this by not freeing the block group and chunk if we fail to clear the dirty bit.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-25 07:41:23 -08:00
Chris Mason
8f608de699 Btrfs: include vmalloc.h in check-integrity.c
Fengguang's build monster reported warnings on some arches because we
don't have vmalloc.h included

Signed-off-by: Chris Mason <clm@fb.com>
Reported-by: fengguang.wu@intel.com
2014-11-25 06:01:11 -08:00
Qu Wenruo
084b6e7c76 btrfs: Fix a lockdep warning when running xfstest.
The following lockdep warning is triggered during xfstests:

[ 1702.980872] =========================================================
[ 1702.981181] [ INFO: possible irq lock inversion dependency detected ]
[ 1702.981482] 3.18.0-rc1 #27 Not tainted
[ 1702.981781] ---------------------------------------------------------
[ 1702.982095] kswapd0/77 just changed the state of lock:
[ 1702.982415]  (&delayed_node->mutex){+.+.-.}, at: [<ffffffffa03b0b51>] __btrfs_release_delayed_node+0x41/0x1f0 [btrfs]
[ 1702.982794] but this lock took another, RECLAIM_FS-unsafe lock in the past:
[ 1702.983160]  (&fs_info->dev_replace.lock){+.+.+.}

and interrupts could create inverse lock ordering between them.

[ 1702.984675]
other info that might help us debug this:
[ 1702.985524] Chain exists of:
  &delayed_node->mutex --> &found->groups_sem --> &fs_info->dev_replace.lock

[ 1702.986799]  Possible interrupt unsafe locking scenario:

[ 1702.987681]        CPU0                    CPU1
[ 1702.988137]        ----                    ----
[ 1702.988598]   lock(&fs_info->dev_replace.lock);
[ 1702.989069]                                local_irq_disable();
[ 1702.989534]                                lock(&delayed_node->mutex);
[ 1702.990038]                                lock(&found->groups_sem);
[ 1702.990494]   <Interrupt>
[ 1702.990938]     lock(&delayed_node->mutex);
[ 1702.991407]
 *** DEADLOCK ***

It is because the btrfs_kobj_{add/rm}_device() will call memory
allocation with GFP_KERNEL,
which may flush fs page cache to free space, waiting for it self to do
the commit, causing the deadlock.

To solve the problem, move btrfs_kobj_{add/rm}_device() out of the
dev_replace lock range, also involing split the
btrfs_rm_dev_replace_srcdev() function into remove and free parts.

Now only btrfs_rm_dev_replace_remove_srcdev() is called in dev_replace
lock range, and kobj_{add/rm} and btrfs_rm_dev_replace_free_srcdev() are
called out of the lock range.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-25 05:55:38 -08:00
Chris Mason
ad27c0dab7 Merge branch 'dev/pending-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus 2014-11-25 05:45:30 -08:00
Li RongQing
e9f456ca50 nfs: define nfs_inc_fscache_stats and using it as possible
Define and use nfs_inc_fscache_stats when plus one, which can save to
pass one parameter.

Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-11-24 20:08:47 -05:00
Li RongQing
5a254d08b0 nfs: replace nfs_add_stats with nfs_inc_stats when add one
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-11-24 20:08:47 -05:00
Markus Elfring
fe0bf1185d NFS: Deletion of unnecessary checks before the function call "nfs_put_client"
The nfs_put_client() function tests whether its argument is NULL and then
returns immediately. Thus the test around the call is not needed.

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-11-24 20:07:27 -05:00
Jeff Layton
10b89567db lockd: eliminate LOCKD_DEBUG
LOCKD_DEBUG is always the same value as CONFIG_SUNRPC_DEBUG, so we can
just use it instead.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-11-24 17:24:08 -05:00
Peng Tao
4bd5a980de nfs41: fix nfs4_proc_layoutget error handling
nfs4_layoutget_release() drops layout hdr refcnt. Grab the refcnt
early so that it is safe to call .release in case nfs4_alloc_pages
fails.

Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Fixes: a47970ff78 ("NFSv4.1: Hold reference to layout hdr in layoutget")
Cc: stable@vger.kernel.org # 3.9+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-11-24 17:14:54 -05:00
Weston Andros Adamson
cb1410c71e NFS: fix subtle change in COMMIT behavior
Recent work in the pgio layer made it possible for there to be more than one
request per page. This caused a subtle change in commit behavior, because
write.c:nfs_commit_unstable_pages compares the number of *pages* waiting for
writeback against the number of requests on a commit list to choose when to
send a COMMIT in a non-blocking flush.

This is probably hard to hit in normal operation - you have to be using
rsize/wsize < PAGE_SIZE, or pnfs with lots of boundaries that are not page
aligned to have a noticeable change in behavior.

Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-11-24 17:00:42 -05:00
Christoph Hellwig
6a74c0c940 pnfs/blocklayout: fix end calculation in pnfs_num_cont_bytes
Use the number of pages in the pagecache mapping instead of the
number of pnfs requests which is only slightly related.

Reported-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-11-24 17:00:41 -05:00
Anna Schumaker
878ffa9f85 NFS: Use nfs_server_capable() for checknig NFS_CAP_SEEK
This should make the code easier to maintain in the future.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-11-24 12:49:13 -05:00
Jan Kara
c1b69b1ca1 nfs: Remove dead case from nfs4_map_errors()
NFS4ERR_ACCESS has number 13 and thus is matched and returned
immediately at the beginning of nfs4_map_errors() and there's no point
in checking it later.

Coverity-id: 733891
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-11-24 12:48:14 -05:00
Paul Burton
774c105ed8 binfmt_elf: allow arch code to examine PT_LOPROC ... PT_HIPROC headers
MIPS is introducing new variants of its O32 ABI which differ in their
handling of floating point, in order to enable a gradual transition
towards a world where mips32 binaries can take advantage of new hardware
features only available when configured for certain FP modes. In order
to do this ELF binaries are being augmented with a new section that
indicates, amongst other things, the FP mode requirements of the binary.
The presence & location of such a section is indicated by a program
header in the PT_LOPROC ... PT_HIPROC range.

In order to allow the MIPS architecture code to examine the program
header & section in question, pass all program headers in this range
to an architecture-specific arch_elf_pt_proc function. This function
may return an error if the header is deemed invalid or unsuitable for
the system, in which case that error will be returned from
load_elf_binary and upwards through the execve syscall.

A means is required for the architecture code to make a decision once
it is known that all such headers have been seen, but before it is too
late to return from an execve syscall. For this purpose the
arch_check_elf function is added, and called once, after all PT_LOPROC
to PT_HIPROC headers have been passed to arch_elf_pt_proc but before
the code which invoked execve has been lost. This enables the
architecture code to make a decision based upon all the headers present
in an ELF binary and its interpreter, as is required to forbid
conflicting FP ABI requirements between an ELF & its interpreter.

In order to allow data to be stored throughout the calls to the above
functions, struct arch_elf_state is introduced.

Finally a variant of the SET_PERSONALITY macro is introduced which
accepts a pointer to the struct arch_elf_state, allowing it to act
based upon state observed from the architecture specific program
headers.

Signed-off-by: Paul Burton <paul.burton@imgtec.com>
Cc: linux-mips@linux-mips.org
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Patchwork: https://patchwork.linux-mips.org/patch/7679/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2014-11-24 07:45:02 +01:00
Paul Burton
a9d9ef133f binfmt_elf: load interpreter program headers earlier
Load the program headers of an ELF interpreter early enough in
load_elf_binary that they can be examined before it's too late to return
an error from an exec syscall. This patch does not perform any such
checking, it merely lays the groundwork for a further patch to do so.

No functional change is intended.

Signed-off-by: Paul Burton <paul.burton@imgtec.com>
Cc: linux-mips@linux-mips.org
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Patchwork: https://patchwork.linux-mips.org/patch/7675/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2014-11-24 07:45:02 +01:00
Paul Burton
6a8d38945c binfmt_elf: Hoist ELF program header loading to a function
load_elf_binary & load_elf_interp both load program headers from an ELF
executable in the same way, duplicating the code. This patch introduces
a helper function (load_elf_phdrs) which performs this common task &
calls it from both load_elf_binary & load_elf_interp. In addition to
reducing code duplication, this is part of preparing to load the ELF
interpreter headers earlier such that they can be examined before it's
too late to return an error from an exec syscall.

Signed-off-by: Paul Burton <paul.burton@imgtec.com>
Cc: linux-mips@linux-mips.org
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Patchwork: https://patchwork.linux-mips.org/patch/7676/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2014-11-24 07:45:02 +01:00
Jaegeuk Kim
0341845efc f2fs: fix livelock calling f2fs_iget during f2fs_evict_inode
In f2fs_evict_inode,
 commit_inmemory_pages
   f2fs_gc
     f2fs_iget
       iget_locked
         -> wait for inode free

Here, if the inode is same as the one to be evicted, f2fs should wait forever.
Actually, we should not call f2fs_balance_fs during f2fs_evict_inode to avoid
this.

But, the commit_inmem_pages calls f2fs_balance_fs by default, even if
f2fs_evict_inode wants to free inmemory pages only.

Hence, this patch adds to trigger f2fs_balance_fs only when there is something
to write.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-11-23 21:51:57 -08:00
Jaegeuk Kim
9486ba442b f2fs: introduce f2fs_dentry_kunmap to clean up
This patch introduces f2fs_dentry_kunmap to clean up dirty codes.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-11-23 21:51:53 -08:00
Changman Lee
c9ee00857c f2fs: fix wrong data structure when create slab
It used nat_entry_set when create slab for sit_entry_set.

Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-11-23 21:48:49 -08:00
Jaegeuk Kim
09b8b3c839 f2fs: call flush_dcache_page when the page was updated
Whenever f2fs updates mapped pages, it needs to call flush_dcache_page.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-11-23 21:48:31 -08:00
Linus Torvalds
d038a63ace Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs deadlock fix from Chris Mason:
 "This has a fix for a long standing deadlock that we've been trying to
  nail down for a while.  It ended up being a bad interaction with the
  fair reader/writer locks and the order btrfs reacquires locks in the
  btree"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  btrfs: fix lockups from btrfs_clear_path_blocking
2014-11-23 11:16:36 -08:00
Eric Whitney
0756b908a3 ext4: fix end of region partial cluster handling
ext4_ext_remove_space() can incorrectly free a partial_cluster if
EAGAIN is encountered while truncating or punching.  Extent removal
should be retried in this case.

It also fails to free a partial cluster when the punched region begins
at the start of a file on that unaligned cluster and where the entire
file has not been punched.  Remove the requirement that all blocks in
the file must have been freed in order to free the partial cluster.

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-23 00:59:39 -05:00
Eric Whitney
345ee94748 ext4: miscellaneous partial cluster cleanups
Add some casts and rearrange a few statements for improved readability.
Some code can also be simplified and made more readable if we set
partial_cluster to 0 rather than to a negative value when we can tell
we've hit the left edge of the punched region.

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-23 00:59:39 -05:00
Eric Whitney
5bf4376065 ext4: fix end of leaf partial cluster handling
The fix in commit ad6599ab3a ("ext4: fix premature freeing of
partial clusters split across leaf blocks"), intended to avoid
dereferencing an invalid extent pointer when determining whether a
partial cluster should be freed, wasn't quite good enough.  Assure that
at least one extent remains at the start of the leaf once the hole has
been punched.  Otherwise, the pointer to the extent to the right of the
hole will be invalid and a partial cluster will be incorrectly freed.

Set partial_cluster to 0 when we can tell we've hit the left edge of
the punched region within the leaf.  This prevents incorrect freeing
of a partial cluster when ext4_ext_rm_leaf is called one last time
during extent tree traversal after the punched region has been removed.

Adjust comments to reflect code changes and a correction.  Remove a bit
of dead code.

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-23 00:58:11 -05:00
Eric Whitney
f4226d9ea4 ext4: fix partial cluster initialization
The partial_cluster variable is not always initialized correctly when
hole punching on bigalloc file systems.  Although commit c063449394
("ext4: fix partial cluster handling for bigalloc file systems")
addressed the case where the right edge of the punched region and the
next extent to its right were within the same leaf, it didn't handle
the case where the next extent to its right is in the next leaf.  This
causes xfstest generic/300 to fail.

Fix this by replacing the code in c0634493922 with a more general
solution that can continue the search for the first cluster to the
right of the punched region into the next leaf if present.  If found,
partial_cluster is initialized to this cluster's negative value.
There's no need to determine if that cluster is actually shared;  we
simply record it so its blocks won't be freed in the event it does
happen to be shared.

Also, minimize the burden on non-bigalloc file systems with some minor
code simplification.

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-23 00:55:42 -05:00
Filipe Manana
b38ef71cb1 Btrfs: ensure ordered extent errors aren't missed on fsync
When doing a fsync with a fast path we have a time window where we can miss
the fact that writeback of some file data failed, and therefore we endup
returning success (0) from fsync when we should return an error.
The steps that lead to this are the following:

1) We start all ordered extents by calling filemap_fdatawrite_range();

2) We do some other work like locking the inode's i_mutex, start a transaction,
   start a log transaction, etc;

3) We enter btrfs_log_inode(), acquire the inode's log_mutex and collect all the
   ordered extents from inode's ordered tree into a list;

4) But by the time we do ordered extent collection, some ordered extents we started
   at step 1) might have already completed with an error, and therefore we didn't
   found them in the ordered tree and had no idea they finished with an error. This
   makes our fsync return success (0) to userspace, but has no bad effects on the log
   like for example insertion of file extent items into the log that point to unwritten
   extents, because the invalid extent maps were removed before the ordered extent
   completed (in inode.c:btrfs_finish_ordered_io).

So after collecting the ordered extents just check if the inode's i_mapping has any
error flags set (AS_EIO or AS_ENOSPC) and leave with an error if it does. Whenever
writeback fails for a page of an ordered extent, we call mapping_set_error (done in
extent_io.c:end_extent_writepage, called by extent_io.c:end_bio_extent_writepage)
that sets one of those error flags in the inode's i_mapping flags.

This change also has the side effect of fixing the issue where for fast fsyncs we
never checked/cleared the error flags from the inode's i_mapping flags, which means
that a full fsync performed after a fast fsync could get such errors that belonged
to the fast fsync - because the full fsync calls btrfs_wait_ordered_range() which
calls filemap_fdatawait_range(), and the later checks for and clears those flags,
while for fast fsyncs we never call filemap_fdatawait_range() or anything else
that checks for and clears the error flags from the inode's i_mapping.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-21 11:59:57 -08:00
Filipe Manana
0870295b23 Btrfs: collect only the necessary ordered extents on ranged fsync
Instead of collecting all ordered extents from the inode's ordered tree
and then wait for all of them to complete, just collect the ones that
overlap the fsync range.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-21 11:59:56 -08:00
Filipe Manana
5ab5e44a36 Btrfs: don't ignore log btree writeback errors
If an error happens during writeback of log btree extents, make sure the
error is returned to the caller (fsync), so that it takes proper action
(commit current transaction) instead of writing a superblock that points
to log btrees with all or some nodes that weren't durably persisted.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-21 11:59:55 -08:00
Josef Bacik
a28046956c Btrfs: do not move em to modified list when unpinning
We use the modified list to keep track of which extents have been modified so we
know which ones are candidates for logging at fsync() time.  Newly modified
extents are added to the list at modification time, around the same time the
ordered extent is created.  We do this so that we don't have to wait for ordered
extents to complete before we know what we need to log.  The problem is when
something like this happens

log extent 0-4k on inode 1
copy csum for 0-4k from ordered extent into log
sync log
commit transaction
log some other extent on inode 1
ordered extent for 0-4k completes and adds itself onto modified list again
log changed extents
see ordered extent for 0-4k has already been logged
	at this point we assume the csum has been copied
sync log
crash

On replay we will see the extent 0-4k in the log, drop the original 0-4k extent
which is the same one that we are replaying which also drops the csum, and then
we won't find the csum in the log for that bytenr.  This of course causes us to
have errors about not having csums for certain ranges of our inode.  So remove
the modified list manipulation in unpin_extent_cache, any modified extents
should have been added well before now, and we don't want them re-logged.  This
fixes my test that I could reliably reproduce this problem with.  Thanks,

cc: stable@vger.kernel.org
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-21 11:59:54 -08:00
Josef Bacik
50d9aa99bd Btrfs: make sure logged extents complete in the current transaction V3
Liu Bo pointed out that my previous fix would lose the generation update in the
scenario I described.  It is actually much worse than that, we could lose the
entire extent if we lose power right after the transaction commits.  Consider
the following

write extent 0-4k
log extent in log tree
commit transaction
	< power fail happens here
ordered extent completes

We would lose the 0-4k extent because it hasn't updated the actual fs tree, and
the transaction commit will reset the log so it isn't replayed.  If we lose
power before the transaction commit we are save, otherwise we are not.

Fix this by keeping track of all extents we logged in this transaction.  Then
when we go to commit the transaction make sure we wait for all of those ordered
extents to complete before proceeding.  This will make sure that if we lose
power after the transaction commit we still have our data.  This also fixes the
problem of the improperly updated extent generation.  Thanks,

cc: stable@vger.kernel.org
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-21 11:58:32 -08:00
Al Viro
3035b675ad Merge branch 'overlayfs-current' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs into for-linus
"The biggest change is to rename the filesystem from "overlayfs" to "overlay".
This will allow legacy overlayfs to be easily carried by distros alongside the
new mainline one.  Also fix a couple of copy-up races and allow escaping comma
character in filenames."

The last bit is about commas in pathname mount options...
2014-11-21 11:51:08 -05:00
Josef Bacik
9dba8cf128 Btrfs: make sure we wait on logged extents when fsycning two subvols
If we have two fsync()'s race on different subvols one will do all of its work
to get into the log_tree, wait on it's outstanding IO, and then allow the
log_tree to finish it's commit.  The problem is we were just free'ing that
subvols logged extents instead of waiting on them, so whoever lost the race
wouldn't really have their data on disk.  Fix this by waiting properly instead
of freeing the logged extents.  Thanks,

cc: stable@vger.kernel.org
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-20 17:20:10 -08:00
David Sterba
0d95c1bec9 btrfs: fix wrong accounting of raid1 data profile in statfs
The sizes that are obtained from space infos are in raw units and have
to be adjusted according to the raid factor. This was missing for
f_bavail and df reported doubled size for raid1.

Reported-by: Martin Steigerwald <Martin@lichtvoll.de>
Fixes: ba7b6e62f4 ("btrfs: adjust statfs calculations according to raid profiles")
CC: stable@vger.kernel.org
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-20 17:20:09 -08:00
Gui Hecheng
321592427c btrfs: fix dead lock while running replace and defrag concurrently
This can be reproduced by fstests: btrfs/070

The scenario is like the following:

replace worker thread		defrag thread
---------------------		-------------
copy_nocow_pages_worker		btrfs_defrag_file
  copy_nocow_pages_for_inode	    ...
				  btrfs_writepages
  |A| lock_extent_bits		    extent_write_cache_pages
				|B|   lock_page
					__extent_writepage
		...			  writepage_delalloc
					    find_lock_delalloc_range
				|B| 	      lock_extent_bits
  find_or_create_page
    pagecache_get_page
  |A| lock_page

This leads to an ABBA pattern deadlock. To fix it,
o we just change it to an AABB pattern which means to @unlock_extent_bits()
  before we @lock_page(), and in this way the @extent_read_full_page_nolock()
  is no longer in an locked context, so change it back to @extent_read_full_page()
  to regain protection.

o Since we @unlock_extent_bits() earlier, then before @write_page_nocow(),
  the extent may not really point at the physical block we want, so we
  have to check it before write.

Signed-off-by: Gui Hecheng <guihc.fnst@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-20 17:20:08 -08:00
Filipe Manana
5f5bc6b1e2 Btrfs: make xattr replace operations atomic
Replacing a xattr consists of doing a lookup for its existing value, delete
the current value from the respective leaf, release the search path and then
finally insert the new value. This leaves a time window where readers (getxattr,
listxattrs) won't see any value for the xattr. Xattrs are used to store ACLs,
so this has security implications.

This change also fixes 2 other existing issues which were:

*) Deleting the old xattr value without verifying first if the new xattr will
   fit in the existing leaf item (in case multiple xattrs are packed in the
   same item due to name hash collision);

*) Returning -EEXIST when the flag XATTR_CREATE is given and the xattr doesn't
   exist but we have have an existing item that packs muliple xattrs with
   the same name hash as the input xattr. In this case we should return ENOSPC.

A test case for xfstests follows soon.

Thanks to Alexandre Oliva for reporting the non-atomicity of the xattr replace
implementation.

Reported-by: Alexandre Oliva <oliva@gnu.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-20 17:20:07 -08:00
Filipe Manana
c7bc6319c5 Btrfs: avoid premature -ENOMEM in clear_extent_bit()
We try to allocate an extent state structure before acquiring the extent
state tree's spinlock as we might need a new one later and therefore avoid
doing later an atomic allocation while holding the tree's spinlock. However
we returned -ENOMEM if that initial non-atomic allocation failed, which is
a bit excessive since we might end up not needing the pre-allocated extent
state at all - for the case where the tree doesn't have any extent states
that cover the input range and cover too any other range. Therefore don't
return -ENOMEM if that pre-allocation fails.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-20 17:20:06 -08:00
Josef Bacik
7e33fd993a Btrfs: don't take the chunk_mutex/dev_list mutex in statfs V2
Our gluster boxes get several thousand statfs() calls per second, which begins
to suck hardcore with all of the lock contention on the chunk mutex and dev list
mutex.  We don't really need to hold these things, if we have transient
weirdness with statfs() because of the chunk allocator we don't care, so remove
this locking.

We still need the dev_list lock if you mount with -o alloc_start however, which
is a good argument for nuking that thing from orbit, but that's a patch for
another day.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-20 17:20:05 -08:00
Josef Bacik
633c0aad4c Btrfs: move read only block groups onto their own list V2
Our gluster boxes were spending lots of time in statfs because our fs'es are
huge.  The problem is statfs loops through all of the block groups looking for
read only block groups, and when you have several terabytes worth of data that
ends up being a lot of block groups.  Move the read only block groups onto a
read only list and only proces that list in
btrfs_account_ro_block_groups_free_space to reduce the amount of churn.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-20 17:20:04 -08:00
David Sterba
cd743fac42 btrfs: fix typos in btrfs_check_super_valid
Copy&paste errors in some messages and add few more missing macro
accessors.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-20 17:20:03 -08:00
Stefan Behrens
cf90c59e68 Btrfs: check-int: don't complain about balanced blocks
The xfstest btrfs/014 which tests the balance operation caused that the
check_int module complained that known blocks changed their physical
location. Since this is not an error in this case, only print such
message if the verbose mode was enabled.

Reported-by: Wang Shilong <wangshilong1991@gmail.com>
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Tested-by: Wang Shilong <wangshilong1991@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-20 17:14:30 -08:00
Stefan Behrens
f382e4653f Btrfs: check_int: use the known block location
The xfstest btrfs/014 which tests the balance operation caused issues with
the check_int module. The attempt was made to use btrfs_map_block() to
find the physical location for a written block. However, this was not
at all needed since the location of the written block was known since
a hook to submit_bio() was the reason for entering the check_int module.
Additionally, after a block relocation it happened that btrfs_map_block()
failed causing misleading error messages afterwards.

This patch changes the check_int module to use the known information of
the physical location from the bio.

Reported-by: Wang Shilong <wangshilong1991@gmail.com>
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Tested-by: Wang Shilong <wangshilong1991@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-20 17:14:29 -08:00
Filipe Manana
c8fd3de79f Btrfs: avoid returning -ENOMEM in convert_extent_bit() too early
We try to allocate an extent state before acquiring the tree's spinlock
just in case we end up needing to split an existing extent state into two.
If that allocation failed, we would return -ENOMEM.
However, our only single caller (transaction/log commit code), passes in
an extent state that was cached from a call to find_first_extent_bit() and
that has a very high chance to match exactly the input range (always true
for a transaction commit and very often, but not always, true for a log
commit) - in this case we end up not needing at all that initial extent
state used for an eventual split. Therefore just don't return -ENOMEM if
we can't allocate the temporary extent state, since we might not need it
at all, and if we end up needing one, we'll do it later anyway.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-20 17:14:29 -08:00
Filipe Manana
e38e2ed701 Btrfs: make find_first_extent_bit be able to cache any state
Right now the only caller of find_first_extent_bit() that is interested
in caching extent states (transaction or log commit), never gets an extent
state cached. This is because find_first_extent_bit() only caches states
that have at least one of the flags EXTENT_IOBITS or EXTENT_BOUNDARY, and
the transaction/log commit caller always passes a tree that doesn't have
ever extent states with any of those flags (they can only have one of the
following flags: EXTENT_DIRTY, EXTENT_NEW or EXTENT_NEED_WAIT).

This change together with the following one in the patch series (titled
"Btrfs: avoid returning -ENOMEM in convert_extent_bit() too early") will
help reduce significantly the chances of calls to convert_extent_bit()
fail with -ENOMEM when called from the transaction/log commit code.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-20 17:14:29 -08:00
Filipe Manana
663dfbb077 Btrfs: deal with convert_extent_bit errors to avoid fs corruption
When committing a transaction or a log, we look for btree extents that
need to be durably persisted by searching for ranges in a io tree that
have some bits set (EXTENT_DIRTY or EXTENT_NEW). We then attempt to clear
those bits and set the EXTENT_NEED_WAIT bit, with calls to the function
convert_extent_bit, and then start writeback for the extents.

That function however can return an error (at the moment only -ENOMEM
is possible, specially when it does GFP_ATOMIC allocation requests
through alloc_extent_state_atomic) - that means the ranges didn't got
the EXTENT_NEED_WAIT bit set (or at least not for the whole range),
which in turn means a call to btrfs_wait_marked_extents() won't find
those ranges for which we started writeback, causing a transaction
commit or a log commit to persist a new superblock without waiting
for the writeback of extents in that range to finish first.

Therefore if a crash happens after persisting the new superblock and
before writeback finishes, we have a superblock pointing to roots that
weren't fully persisted or roots that point to nodes or leafs that weren't
fully persisted, causing all sorts of unexpected/bad behaviour as we endup
reading garbage from disk or the content of some node/leaf from a past
generation that got cowed or deleted and is no longer valid (for this later
case we end up getting error messages like "parent transid verify failed on
X wanted Y found Z" when reading btree nodes/leafs from disk).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-20 17:14:29 -08:00
Eryu Guan
2fc9f6baa2 Btrfs: return failure if btrfs_dev_replace_finishing() failed
device replace could fail due to another running scrub process or any
other errors btrfs_scrub_dev() may hit, but this failure doesn't get
returned to userspace.

The following steps could reproduce this issue

	mkfs -t btrfs -f /dev/sdb1 /dev/sdb2
	mount /dev/sdb1 /mnt/btrfs
	while true; do btrfs scrub start -B /mnt/btrfs >/dev/null 2>&1; done &
	btrfs replace start -Bf /dev/sdb2 /dev/sdb3 /mnt/btrfs
	# if this replace succeeded, do the following and repeat until
	# you see this log in dmesg
	# BTRFS: btrfs_scrub_dev(/dev/sdb2, 2, /dev/sdb3) failed -115
	#btrfs replace start -Bf /dev/sdb3 /dev/sdb2 /mnt/btrfs

	# once you see the error log in dmesg, check return value of
	# replace
	echo $?

Introduce a new dev replace result

BTRFS_IOCTL_DEV_REPLACE_RESULT_SCRUB_INPROGRESS

to catch -EINPROGRESS explicitly and return other errors directly to
userspace.

Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-20 17:14:28 -08:00
Shilong Wang
6b3a4d60db Btrfs: fix allocationg memory failure for btrfsic_state structure
size of @btrfsic_state needs more than 2M, it is very likely to
fail allocating memory using kzalloc(). see following mesage:

[91428.902148] Call Trace:
[<ffffffff816f6e0f>] dump_stack+0x4d/0x66
[<ffffffff811b1c7f>] warn_alloc_failed+0xff/0x170
[<ffffffff811b66e1>] __alloc_pages_nodemask+0x951/0xc30
[<ffffffff811fd9da>] alloc_pages_current+0x11a/0x1f0
[<ffffffff811b1e0b>] ? alloc_kmem_pages+0x3b/0xf0
[<ffffffff811b1e0b>] alloc_kmem_pages+0x3b/0xf0
[<ffffffff811d1018>] kmalloc_order+0x18/0x50
[<ffffffff811d1074>] kmalloc_order_trace+0x24/0x140
[<ffffffffa06c097b>] btrfsic_mount+0x8b/0xae0 [btrfs]
[<ffffffff810af555>] ? check_preempt_curr+0x85/0xa0
[<ffffffff810b2de3>] ? try_to_wake_up+0x103/0x430
[<ffffffffa063d200>] open_ctree+0x1bd0/0x2130 [btrfs]
[<ffffffffa060fdde>] btrfs_mount+0x62e/0x8b0 [btrfs]
[<ffffffff811fd9da>] ? alloc_pages_current+0x11a/0x1f0
[<ffffffff811b0a5e>] ? __get_free_pages+0xe/0x50
[<ffffffff81230429>] mount_fs+0x39/0x1b0
[<ffffffff812509fb>] vfs_kern_mount+0x6b/0x150
[<ffffffff812537fb>] do_mount+0x27b/0xc30
[<ffffffff811b0a5e>] ? __get_free_pages+0xe/0x50
[<ffffffff812544f6>] SyS_mount+0x96/0xf0
[<ffffffff81701970>] system_call_fastpath+0x16/0x1b

Since we are allocating memory for hash table array, so
it will be good if we could allocate continuous pages here.

Fix this problem by firstly trying kzalloc(), if we fail,
use vzalloc() instead.

Signed-off-by: Wang Shilong <wangshilong1991@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-20 17:14:28 -08:00
Filipe Manana
e6eb43142a Btrfs: report error after failure inlining extent in compressed write path
If cow_file_range_inline() failed, when called from compress_file_range(),
we were tagging the locked page for writeback, end its writeback and unlock it,
but not marking it with an error nor setting AS_EIO in inode's mapping flags.

This made it impossible for a caller of filemap_fdatawrite_range (writepages)
or filemap_fdatawait_range() to know that an error happened. And the return
value of compress_file_range() is useless because it's returned to a workqueue
task and not to the task calling filemap_fdatawrite_range (writepages).

This change applies on top of the previous patchset starting at the patch
titled:

    "[1/5] Btrfs: set page and mapping error on compressed write failure"

Which changed extent_clear_unlock_delalloc() to use SetPageError and
mapping_set_error().

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-20 17:14:28 -08:00
Filipe Manana
728404dacf Btrfs: add helper btrfs_fdatawrite_range
To avoid duplicating this double filemap_fdatawrite_range() call for
inodes with async extents (compressed writes) so often.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-20 17:14:28 -08:00
Filipe Manana
075bdbdbe9 Btrfs: correctly flush compressed data before/after direct IO
For compressed writes, after doing the first filemap_fdatawrite_range() we
don't get the pages tagged for writeback immediately. Instead we create
a workqueue task, which is run by other kthread, and keep the pages locked.
That other kthread compresses data, creates the respective ordered extent/s,
tags the pages for writeback and unlocks them. Therefore we need a second
call to filemap_fdatawrite_range() if we have compressed writes, as this
second call will wait for the pages to become unlocked, then see they became
tagged for writeback and finally wait for the writeback to finish.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-20 17:14:27 -08:00
Filipe Manana
c44f649e28 Btrfs: make inode.c:compress_file_range() return void
Its return value is useless, its single caller ignores it and can't do
anything with it anyway, since it's a workqueue task and not the task
calling filemap_fdatawrite_range (writepages) nor filemap_fdatawait_range().
Failure is communicated to such functions via start and end of writeback
with the respective pages tagged with an error and AS_EIO flag set in the
inode's imapping.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-20 17:14:27 -08:00
Shilong Wang
4bcbb33255 Btrfs: fix incorrect compression ratio detection
Steps to reproduce:
 # mkfs.btrfs -f /dev/sdb
 # mount -t btrfs /dev/sdb /mnt -o compress=lzo
 # dd if=/dev/zero of=/mnt/data bs=$((33*4096)) count=1

after previous steps, inode will be detected as bad compression ratio,
and NOCOMPRESS flag will be set for that inode.

Reason is that compress have a max limit pages every time(128K), if a
132k write in, it will be splitted into two write(128k+4k), this bug
is a leftover for commit 68bb462d42a(Btrfs: don't compress for a small write)

Fix this problem by checking every time before compression, if it is a
small write(<=blocksize), we bail out and fall into nocompression directly.

Signed-off-by: Wang Shilong <wangshilong1991@gmail.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-20 17:14:27 -08:00
Filipe Manana
7bdcefc103 Btrfs: don't ignore compressed bio write errors
Our compressed bio write end callback was essentially ignoring the error
parameter. When a write error happens, it must pass a value of 0 to the
inode's write_page_end_io_hook callback, SetPageError on the respective
pages and set AS_EIO in the inode's mapping flags, so that a call to
filemap_fdatawait_range() / filemap_fdatawait() can find out that errors
happened (we surely don't want silent failures on fsync for example).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-20 17:14:26 -08:00
Filipe Manana
dec8f17563 Btrfs: make inode.c:submit_compressed_extents() return void
Its return value is completely ignored by its single caller and it's
useless anyway, since errors are indicated through SetPageError and
the bit AS_EIO set in the flags of the inode's mapping. The caller
can't do anything with the value, as it's invoked from a workqueue
task and not by the task calling filemap_fdatawrite_range (which calls
the writepages address space callback, which in turn calls the inode's
fill_delalloc callback).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-20 17:14:26 -08:00
Filipe Manana
3d7a820f71 Btrfs: process all async extents on compressed write failure
If we had an error when processing one of the async extents from our list,
we were not processing the remaining async extents, meaning we would leak
those async_extent structs, never release the pages with the compressed
data and never unlock and clear the dirty flag from the inode's pages (those
that correspond to the uncompressed content).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-20 17:14:26 -08:00
Filipe Manana
40ae837b43 Btrfs: don't leak pages and memory on compressed write error
In inode.c:submit_compressed_extents(), if we fail before calling
btrfs_submit_compressed_write(), or when that function fails, we
were freeing the async_extent structure without releasing its pages
and freeing the pages array.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-20 17:14:26 -08:00
Filipe Manana
fce2a4e6b2 Btrfs: fix hang on compressed write error
In inode.c:submit_compressed_extents(), before calling btrfs_submit_compressed_write()
we start writeback for all pages, clear their dirty flag, unlock them, etc, but if
btrfs_submit_compressed_write() fails (at the moment it can only fail with -ENOMEM),
we never end the writeback on the pages, so any filemap_fdatawait_range() call will
hang forever. We were also not calling the writepage end io hook, which means the
corresponding ordered extent will never complete and all its waiters will block
forever, such as a full fsync (via btrfs_wait_ordered_range()).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-20 17:14:25 -08:00
Filipe Manana
704de49d2b Btrfs: set page and mapping error on compressed write failure
If we fail in submit_compressed_extents() before calling btrfs_submit_compressed_write(),
we start and end the writeback for the pages (clear their dirty flag, unlock them, etc)
but we don't tag the pages, nor the inode's mapping, with an error. This makes it
impossible for a caller of filemap_fdatawait_range() (fsync, or transaction commit
for e.g.) know that there was an error.

Note that the return value of submit_compressed_extents() is useless, as that function
is executed by a workqueue task and not directly by the fill_delalloc callback. This
means the writepage/s callbacks of the inode's address space operations don't get that
return value.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-20 17:14:25 -08:00
Al Viro
b93b41d4c7 ext4: kill ext4_kvfree()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-20 12:19:11 -05:00
Miklos Szeredi
7676895f47 ovl: ovl_dir_fsync() cleanup
Check against !OVL_PATH_LOWER instead of OVL_PATH_MERGE.  For a copied up
directory the two are currently equivalent.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-11-20 16:40:02 +01:00
Miklos Szeredi
c9f00fdb9a ovl: pass dentry into ovl_dir_read_merged()
Pass dentry into ovl_dir_read_merged() insted of upperpath and lowerpath.
This cleans up callers and paves the way for multi-layer directory reads.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-11-20 16:40:01 +01:00
Miklos Szeredi
71d509280f ovl: use lockless_dereference() for upperdentry
Don't open code lockless_dereference() in ovl_upperdentry_dereference().

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-11-20 16:40:01 +01:00
Miklos Szeredi
91c7794713 ovl: allow filenames with comma
Allow option separator (comma) to be escaped with backslash.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-11-20 16:40:00 +01:00
Miklos Szeredi
521484639e ovl: fix race in private xattr checks
Xattr operations can race with copy up.  This does not matter as long as
we consistently fiter out "trunsted.overlay.opaque" attribute on upper
directories.

Previously we checked parent against OVL_PATH_MERGE.  This is too general,
and prone to race with copy-up.  I.e. we found the parent to be on the
lower layer but ovl_dentry_real() would return the copied-up dentry,
possibly with the "opaque" attribute.

So instead use ovl_path_real() and decide to filter the attributes based on
the actual type of the dentry we'll use.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-11-20 16:40:00 +01:00
Miklos Szeredi
a105d685a8 ovl: fix remove/copy-up race
ovl_remove_and_whiteout() needs to check if upper dentry exists or not
after having locked upper parent directory.

Previously we used a "type" value computed before locking the upper parent
directory, which is susceptible to racing with copy-up.

There's a similar check in ovl_check_empty_and_clear().  This one is not
actually racy, since copy-up doesn't change the "emptyness" property of a
directory.  Add a comment to this effect, and check the existence of upper
dentry locally to make the code cleaner.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-11-20 16:39:59 +01:00
Miklos Szeredi
ef94b1864d ovl: rename filesystem type to "overlay"
Some distributions carry an "old" format of overlayfs while mainline has a
"new" format.

The distros will possibly want to keep the old overlayfs alongside the new
for compatibility reasons.

To make it possible to differentiate the two versions change the name of
the new one from "overlayfs" to "overlay".

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reported-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Cc: Andy Whitcroft <apw@canonical.com>
2014-11-20 16:39:59 +01:00
Masanari Iida
6774def642 treewide: fix typo in printk and Kconfig
This patch fix spelling typo in printk and Kconfig within
various part of kernel sources.

Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2014-11-20 14:56:11 +01:00
Al Viro
ec7d879c45 GFS2: gfs2_atomic_open(): simplify the use of finish_no_open()
In ->atomic_open(inode, dentry, file, opened) calling finish_no_open(file, NULL)
is equivalent to dget(dentry); return finish_no_open(file, dentry);

No need to open-code that...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-11-20 11:18:08 +00:00
Al Viro
9265f1d0c7 GFS2: gfs2_dir_get_hash_table(): avoiding deferred vfree() is easy here...
vfree() is allowed under spinlock these days, but it's cheaper when
it doesn't step into deferred case and here it's very easy to avoid.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-11-20 10:29:44 +00:00
Al Viro
3cdcf63ed2 GFS2: use kvfree() instead of open-coding it
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-11-20 10:29:14 +00:00
Al Viro
44bb31bac5 GFS2: gfs2_create_inode(): don't bother with d_splice_alias()
dentry is always hashed and negative, inode - non-error, non-NULL and
non-directory.  In such conditions d_splice_alias() is equivalent to
"d_instantiate(dentry, inode) and return NULL", which simplifies the
downstream code and is consistent with the "have to create a new object"
case.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-11-20 10:17:39 +00:00
Al Viro
571a4b5797 GFS2: bugger off early if O_CREAT open finds a directory
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-11-20 10:16:58 +00:00
Jaegeuk Kim
857dc4e059 f2fs: write SSA pages under memory pressure
Under memory pressure, we don't need to skip SSA page writes.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-11-19 22:49:33 -08:00
Jaegeuk Kim
27c6bd60ac f2fs: submit bio for node blocks in the reclaim path
If a node page is request to be written during the reclaiming path, we should
submit the bio to avoid pending to recliam it.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-11-19 22:49:32 -08:00
Chao Yu
67298804f3 f2fs: introduce struct inode_management to wrap inner fields
Now in f2fs, we have three inode cache: ORPHAN_INO, APPEND_INO, UPDATE_INO,
and we manage fields related to inode cache separately in struct f2fs_sb_info
for each inode cache type.
This makes codes a bit messy, so that this patch intorduce a new struct
inode_management to wrap inner fields as following which make codes more neat.

/* for inner inode cache management */
struct inode_management {
	struct radix_tree_root ino_root;	/* ino entry array */
	spinlock_t ino_lock;			/* for ino entry lock */
	struct list_head ino_list;		/* inode list head */
	unsigned long ino_num;			/* number of entries */
};

struct f2fs_sb_info {
	...
	struct inode_management im[MAX_INO_ENTRY];      /* manage inode cache */
	...
}

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-11-19 22:49:32 -08:00
Chao Yu
aba291b3d8 f2fs: remove unneeded check code with option in f2fs_remount
Because we have checked the contrary condition in case of "if" judgment, we do
not need to check the condition again in case of "else" judgment. Let's remove
it.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-11-19 22:49:31 -08:00
Chao Yu
6c02993203 f2fs: avoid unable to restart gc thread in remount
In f2fs_remount, we will stop gc thread and set need_restart_gc as true when new
option is set without BG_GC, then if any error occurred in the following
procedure, we can restore to start the gc thread.
But after that, We will fail to restore gc thread in start_gc_thread as BG_GC is
not set in new option, so we'd better move this condition judgment out of
start_gc_thread to fix this issue.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-11-19 22:49:30 -08:00
Markus Elfring
fdf2657bc8 udf: One function call less in udf_fill_super() after error detection
The iput() function was called in up to three cases by the udf_fill_super()
function during error handling even if the passed data structure element
contained still a null pointer. This implementation detail could be improved
by the introduction of another jump label.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Jan Kara <jack@suse.cz>
2014-11-19 21:56:06 +01:00
Markus Elfring
0d454e4a44 udf: Deletion of unnecessary checks before the function call "iput"
The iput() function tests whether its argument is NULL and then
returns immediately. Thus the test around the call is not needed.

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Jan Kara <jack@suse.cz>
2014-11-19 21:55:45 +01:00
David Teigland
2ab4bd8ea3 dlm: adopt orphan locks
A process may exit, leaving an orphan lock in the lockspace.
This adds the capability for another process to acquire the
orphan lock.  Acquiring the orphan just moves the lock from
the orphan list onto the acquiring process's list of locks.

An adopting process must specify the resource name and mode
of the lock it wants to adopt.  If a matching lock is found,
the lock is moved to the caller's 's list of locks, and the
lkid of the lock is returned like the lkid of a new lock.

If an orphan with a different mode is found, then -EAGAIN is
returned.  If no orphan lock is found on the resource, then
-ENOENT is returned.  No async completion is used because
the result is immediately available.

Also, when orphans are purged, allow a zero nodeid to refer
to the local nodeid so the caller does not need to look up
the local nodeid.

Signed-off-by: David Teigland <teigland@redhat.com>
2014-11-19 14:48:02 -06:00
Trond Myklebust
c6c15e1ed3 nfsd: Fix slot wake up race in the nfsv4.1 callback code
The currect code for nfsd41_cb_get_slot() and nfsd4_cb_done() has no
locking in order to guarantee atomicity, and so allows for races of
the form.

Task 1                                  Task 2
======                                  ======
if (test_and_set_bit(0) != 0) {
                                        clear_bit(0)
                                        rpc_wake_up_next(queue)
        rpc_sleep_on(queue)
        return false;
}

This patch breaks the race condition by adding a retest of the bit
after the call to rpc_sleep_on().

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-11-19 15:45:44 -05:00
Chris Mason
f82c458a2c btrfs: fix lockups from btrfs_clear_path_blocking
The fair reader/writer locks mean that btrfs_clear_path_blocking needs
to strictly follow lock ordering rules even when we already have
blocking locks on a given path.

Before we can clear a blocking lock on the path, we need to make sure
all of the locks have been converted to blocking.  This will remove lock
inversions against anyone spinning in write_lock() against the buffers
we're trying to get read locks on.  These inversions didn't exist before
the fair read/writer locks, but now we need to be more careful.

We papered over this deadlock in the past by changing
btrfs_try_read_lock() to be a true trylock against both the spinlock and
the blocking lock.  This was slower, and not sufficient to fix all the
deadlocks.  This patch adds a btrfs_tree_read_lock_atomic(), which
basically means get the spinlock but trylock on the blocking lock.

Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reported-by: Patrick Schmid <schmid@phys.ethz.ch>
cc: stable@vger.kernel.org #v3.15+
2014-11-19 10:34:35 -08:00
Arnd Bergmann
7ca2f23440 isofs: avoid unused function warning
With the isofs_hash() function removed, isofs_hash_ms() is the only user
of isofs_hash_common(), but it's defined inside of an #ifdef, which triggers
this gcc warning in ARM axm55xx_defconfig starting with v3.18-rc3:

fs/isofs/inode.c:177:1: warning: 'isofs_hash_common' defined but not used [-Wunused-function]

This patch moves the function inside of the same #ifdef section to avoid that
warning, which seems the best compromise of a relatively harmless patch for
a late -rc.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: b0afd8e5db ("isofs: don't bother with ->d_op for normal case")
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-11-19 13:09:37 -05:00
Yan, Zheng
4a7795d35e vfs: fix reference leak in d_prune_aliases()
In "d_prune_alias(): just lock the parent and call __dentry_kill()" the old
dget + d_drop + dput has been replaced with lock_parent + __dentry_kill;
unfortunately, dput() does more than just killing dentry - it also drops the
reference to parent.  New variant leaks that reference and needs dput(parent)
after killing the child off.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-11-19 13:07:20 -05:00
Al Viro
8ce74dd605 Merge tag 'trace-seq-file-cleanup' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace into for-next
Pull the beginning of seq_file cleanup from Steven:
  "I'm looking to clean up the seq_file code and to eventually merge the
  trace_seq code with seq_file as well, since they basically do the same thing.

  Part of this process is to remove the return code of seq_printf() and friends
  as they are rather inconsistent. It is better to use the new function
  seq_has_overflowed() if you want to stop processing when the buffer
  is full. Note, if the buffer is full, the seq_file code will throw away
  the contents, allocate a bigger buffer, and then call your code again
  to fill in the data. The only thing that breaking out of the function
  early does is to save a little time which is probably never noticed.

  I started with patches from Joe Perches and modified them as well.
  There's many more places that need to be updated before we can convert
  seq_printf() and friends to return void. But this patch set introduces
  the seq_has_overflowed() and does some initial updates."
2014-11-19 13:02:53 -05:00
Mikulas Patocka
08d4f77222 dcache: fix kmemcheck warning in switch_names
This patch fixes kmemcheck warning in switch_names. The function
switch_names swaps inline names of two dentries. It swaps full arrays
d_iname, no matter how many bytes are really used by the strings. Reading
data beyond string ends results in kmemcheck warning.

We fix the bug by marking both arrays as fully initialized.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org # v3.15
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-11-19 13:01:26 -05:00
Al Viro
9f45f5bf30 new helper: audit_file()
... for situations when we don't have any candidate in pathnames - basically,
in descriptor-based syscalls.

[Folded the build fix for !CONFIG_AUDITSYSCALL configs from Chen Gang]

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-11-19 13:01:26 -05:00
Al Viro
6f4e0d5aaa nfsd_vfs_write(): use file_inode()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-11-19 13:01:26 -05:00
Al Viro
a67f797db6 ncpfs: use file_inode()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-11-19 13:01:25 -05:00
Al Viro
b583043e99 kill f_dentry uses
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-11-19 13:01:25 -05:00
Al Viro
30e46aba8f lockd: get rid of ->f_path.dentry->d_sb
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-11-19 13:01:24 -05:00
Al Viro
3aa3377fbc procfs: get rid of ->f_dentry
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-11-19 13:01:24 -05:00
Al Viro
ef8a1a10e9 nfsd: get rid of ->f_dentry
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-11-19 13:01:23 -05:00
Al Viro
32a59234ae rpc_pipefs.c: get rid of f_dentry
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-11-19 13:01:23 -05:00
Al Viro
3c981bfc57 afs_fsync: don't bother with ->f_path.dentry
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-11-19 13:01:22 -05:00
Al Viro
7119e220a7 cifs: get rid of ->f_path.dentry->d_sb uses, add a new helper
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-11-19 13:01:22 -05:00
Al Viro
ddb52f4fd2 btrfs: get rid of f_dentry use
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-11-19 13:01:21 -05:00
Al Viro
244c7d444b nfsd/nfsctl.c: new helper
... to get from opened file on nfsctl to relevant struct net *

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-11-19 13:01:21 -05:00
Al Viro
a455589f18 assorted conversions to %p[dD]
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-11-19 13:01:20 -05:00
Al Viro
41d28bca2d switch d_materialise_unique() users to d_splice_alias()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-11-19 13:01:20 -05:00
Al Viro
b5ae6b15bd merge d_materialise_unique() into d_splice_alias()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-11-19 13:01:19 -05:00
Al Viro
154e80e4c3 Merge branch 'for-gfs2' into for-next 2014-11-19 13:00:57 -05:00
Al Viro
427c77d461 d_add_ci() should just accept a hashed exact match if it finds one
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-11-19 13:00:10 -05:00
Al Viro
845409b49b gfs2_atomic_open(): simplify the use of finish_no_open()
In ->atomic_open(inode, dentry, file, opened) calling finish_no_open(file, NULL)
is equivalent to dget(dentry); return finish_no_open(file, dentry);

No need to open-code that...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-11-19 12:57:21 -05:00
Al Viro
81295ce635 gfs2_create_inode(): don't bother with d_splice_alias()
dentry is always hashed and negative, inode - non-error, non-NULL and
non-directory.  In such conditions d_splice_alias() is equivalent to
"d_instantiate(dentry, inode) and return NULL", which simplifies the
downstream code and is consistent with the "have to create a new object"
case.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-11-19 12:57:21 -05:00
Al Viro
986cdb862e gfs2: bugger off early if O_CREAT open finds a directory
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-11-19 12:57:14 -05:00
J. Bruce Fields
56429e9b3b merge nfs bugfixes into nfsd for-3.19 branch
In addition to nfsd bugfixes, there are some fixes in -rc5 for client
bugs that can interfere with my testing.
2014-11-19 12:06:30 -05:00
Christoph Hellwig
6d0ba0432a nfsd: correctly define v4.2 support attributes
Even when security labels are disabled we support at least the same
attributes as v4.1.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: stable@kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-11-19 12:03:19 -05:00
James Morris
a6aacbde40 Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity into next 2014-11-19 21:36:07 +11:00
James Morris
b10778a00d Merge commit 'v3.17' into next 2014-11-19 21:32:12 +11:00
Jaegeuk Kim
8cdcb71322 f2fs: put the inode page when error was occurred
We should put the inode page when error was occurred.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-11-18 17:04:33 -08:00
Jaegeuk Kim
6d20aff83c f2fs: fix to call put_page at the error handling routine
The locked page should be released before returning the function.

Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-11-18 17:02:47 -08:00
Markus Elfring
30badc9543 GFS2: Deletion of unnecessary checks before two function calls
The functions iput() and put_pid() test whether their argument is NULL
and then return immediately. Thus the test around the call is not needed.

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-11-18 10:57:58 +00:00
Markus Elfring
11cc9f56a1 jbd: Deletion of an unnecessary check before the function call "iput"
The iput() function tests whether its argument is NULL and then
returns immediately. Thus the test around the call is not needed.

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Jan Kara <jack@suse.cz>
2014-11-18 10:15:29 +01:00
Dmitry Kasatkin
6fb5032ebb VFS: refactor vfs_read()
integrity_kernel_read() duplicates the file read operations code
in vfs_read(). This patch refactors vfs_read() code creating a
helper function __vfs_read(). It is used by both vfs_read() and
integrity_kernel_read().

Signed-off-by: Dmitry Kasatkin <d.kasatkin@samsung.com>
Signed-off-by: Mimi Zohar <zohar@linux.vnet.ibm.com>
2014-11-17 23:14:22 -05:00
Dave Hansen
abe1e395f6 fs: Do not include mpx.h in exec.c
We no longer need mpx.h in exec.c.  This will obviously also
break the build for non-x86 builds.  We get the MPX includes that
we need from mmu_context.h now.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Hansen <dave@sr71.net>
Link: http://lkml.kernel.org/r/20141118003608.837015B3@viggo.jf.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-11-18 02:01:40 +01:00
Dave Hansen
fe3d197f84 x86, mpx: On-demand kernel allocation of bounds tables
This is really the meat of the MPX patch set.  If there is one patch to
review in the entire series, this is the one.  There is a new ABI here
and this kernel code also interacts with userspace memory in a
relatively unusual manner.  (small FAQ below).

Long Description:

This patch adds two prctl() commands to provide enable or disable the
management of bounds tables in kernel, including on-demand kernel
allocation (See the patch "on-demand kernel allocation of bounds tables")
and cleanup (See the patch "cleanup unused bound tables"). Applications
do not strictly need the kernel to manage bounds tables and we expect
some applications to use MPX without taking advantage of this kernel
support. This means the kernel can not simply infer whether an application
needs bounds table management from the MPX registers.  The prctl() is an
explicit signal from userspace.

PR_MPX_ENABLE_MANAGEMENT is meant to be a signal from userspace to
require kernel's help in managing bounds tables.

PR_MPX_DISABLE_MANAGEMENT is the opposite, meaning that userspace don't
want kernel's help any more. With PR_MPX_DISABLE_MANAGEMENT, the kernel
won't allocate and free bounds tables even if the CPU supports MPX.

PR_MPX_ENABLE_MANAGEMENT will fetch the base address of the bounds
directory out of a userspace register (bndcfgu) and then cache it into
a new field (->bd_addr) in  the 'mm_struct'.  PR_MPX_DISABLE_MANAGEMENT
will set "bd_addr" to an invalid address.  Using this scheme, we can
use "bd_addr" to determine whether the management of bounds tables in
kernel is enabled.

Also, the only way to access that bndcfgu register is via an xsaves,
which can be expensive.  Caching "bd_addr" like this also helps reduce
the cost of those xsaves when doing table cleanup at munmap() time.
Unfortunately, we can not apply this optimization to #BR fault time
because we need an xsave to get the value of BNDSTATUS.

==== Why does the hardware even have these Bounds Tables? ====

MPX only has 4 hardware registers for storing bounds information.
If MPX-enabled code needs more than these 4 registers, it needs to
spill them somewhere. It has two special instructions for this
which allow the bounds to be moved between the bounds registers
and some new "bounds tables".

They are similar conceptually to a page fault and will be raised by
the MPX hardware during both bounds violations or when the tables
are not present. This patch handles those #BR exceptions for
not-present tables by carving the space out of the normal processes
address space (essentially calling the new mmap() interface indroduced
earlier in this patch set.) and then pointing the bounds-directory
over to it.

The tables *need* to be accessed and controlled by userspace because
the instructions for moving bounds in and out of them are extremely
frequent. They potentially happen every time a register pointing to
memory is dereferenced. Any direct kernel involvement (like a syscall)
to access the tables would obviously destroy performance.

==== Why not do this in userspace? ====

This patch is obviously doing this allocation in the kernel.
However, MPX does not strictly *require* anything in the kernel.
It can theoretically be done completely from userspace. Here are
a few ways this *could* be done. I don't think any of them are
practical in the real-world, but here they are.

Q: Can virtual space simply be reserved for the bounds tables so
   that we never have to allocate them?
A: As noted earlier, these tables are *HUGE*. An X-GB virtual
   area needs 4*X GB of virtual space, plus 2GB for the bounds
   directory. If we were to preallocate them for the 128TB of
   user virtual address space, we would need to reserve 512TB+2GB,
   which is larger than the entire virtual address space today.
   This means they can not be reserved ahead of time. Also, a
   single process's pre-popualated bounds directory consumes 2GB
   of virtual *AND* physical memory. IOW, it's completely
   infeasible to prepopulate bounds directories.

Q: Can we preallocate bounds table space at the same time memory
   is allocated which might contain pointers that might eventually
   need bounds tables?
A: This would work if we could hook the site of each and every
   memory allocation syscall. This can be done for small,
   constrained applications. But, it isn't practical at a larger
   scale since a given app has no way of controlling how all the
   parts of the app might allocate memory (think libraries). The
   kernel is really the only place to intercept these calls.

Q: Could a bounds fault be handed to userspace and the tables
   allocated there in a signal handler instead of in the kernel?
A: (thanks to tglx) mmap() is not on the list of safe async
   handler functions and even if mmap() would work it still
   requires locking or nasty tricks to keep track of the
   allocation state there.

Having ruled out all of the userspace-only approaches for managing
bounds tables that we could think of, we create them on demand in
the kernel.

Based-on-patch-by: Qiaowei Ren <qiaowei.ren@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: linux-mm@kvack.org
Cc: linux-mips@linux-mips.org
Cc: Dave Hansen <dave@sr71.net>
Link: http://lkml.kernel.org/r/20141114151829.AD4310DE@viggo.jf.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-11-18 00:58:53 +01:00
Qiaowei Ren
4aae7e436f x86, mpx: Introduce VM_MPX to indicate that a VMA is MPX specific
MPX-enabled applications using large swaths of memory can
potentially have large numbers of bounds tables in process
address space to save bounds information. These tables can take
up huge swaths of memory (as much as 80% of the memory on the
system) even if we clean them up aggressively. In the worst-case
scenario, the tables can be 4x the size of the data structure
being tracked. IOW, a 1-page structure can require 4 bounds-table
pages.

Being this huge, our expectation is that folks using MPX are
going to be keen on figuring out how much memory is being
dedicated to it. So we need a way to track memory use for MPX.

If we want to specifically track MPX VMAs we need to be able to
distinguish them from normal VMAs, and keep them from getting
merged with normal VMAs. A new VM_ flag set only on MPX VMAs does
both of those things. With this flag, MPX bounds-table VMAs can
be distinguished from other VMAs, and userspace can also walk
/proc/$pid/smaps to get memory usage for MPX.

In addition to this flag, we also introduce a special ->vm_ops
specific to MPX VMAs (see the patch "add MPX specific mmap
interface"), but currently different ->vm_ops do not by
themselves prevent VMA merging, so we still need this flag.

We understand that VM_ flags are scarce and are open to other
options.

Signed-off-by: Qiaowei Ren <qiaowei.ren@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: linux-mm@kvack.org
Cc: linux-mips@linux-mips.org
Cc: Dave Hansen <dave@sr71.net>
Link: http://lkml.kernel.org/r/20141114151825.565625B3@viggo.jf.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-11-18 00:58:53 +01:00
Benjamin Marzinski
2e60d7683c GFS2: update freeze code to use freeze/thaw_super on all nodes
The current gfs2 freezing code is considerably more complicated than it
should be because it doesn't use the vfs freezing code on any node except
the one that begins the freeze.  This is because it needs to acquire a
cluster glock before calling the vfs code to prevent a deadlock, and
without the new freeze_super and thaw_super hooks, that was impossible. To
deal with the issue, gfs2 had to do some hacky locking tricks to make sure
that a frozen node couldn't be holding on a lock it needed to do the
unfreeze ioctl.

This patch makes use of the new hooks to simply the gfs2 locking code. Now,
all the nodes in the cluster freeze and thaw in exactly the same way. Every
node in the cluster caches the freeze glock in the shared state.  The new
freeze_super hook allows the freezing node to grab this freeze glock in
the exclusive state without first calling the vfs freeze_super function.
All the nodes in the cluster see this lock change, and call the vfs
freeze_super function. The vfs locking code guarantees that the nodes can't
get stuck holding the glocks necessary to unfreeze the system.  To
unfreeze, the freezing node uses the new thaw_super hook to drop the freeze
glock. Again, all the nodes notice this, reacquire the glock in shared mode
and call the vfs thaw_super function.

Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-11-17 10:36:39 +00:00
Benjamin Marzinski
48b6bca6b7 fs: add freeze_super/thaw_super fs hooks
Currently, freezing a filesystem involves calling freeze_super, which locks
sb->s_umount and then calls the fs-specific freeze_fs hook. This makes it
hard for gfs2 (and potentially other cluster filesystems) to use the vfs
freezing code to do freezes on all the cluster nodes.

In order to communicate that a freeze has been requested, and to make sure
that only one node is trying to freeze at a time, gfs2 uses a glock
(sd_freeze_gl). The problem is that there is no hook for gfs2 to acquire
this lock before calling freeze_super. This means that two nodes can
attempt to freeze the filesystem by both calling freeze_super, acquiring
the sb->s_umount lock, and then attempting to grab the cluster glock
sd_freeze_gl. Only one will succeed, and the other will be stuck in
freeze_super, making it impossible to finish freezing the node.

To solve this problem, this patch adds the freeze_super and thaw_super
hooks.  If a filesystem implements these hooks, they are called instead of
the vfs freeze_super and thaw_super functions. This means that every
filesystem that implements these hooks must call the vfs freeze_super and
thaw_super functions itself within the hook function to make use of the vfs
freezing code.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-11-17 10:35:17 +00:00
Ingo Molnar
e9ac5f0fa8 Merge branch 'sched/urgent' into sched/core, to pick up fixes before applying more changes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-11-16 10:50:25 +01:00
Ingo Molnar
595247f61f * Support module unload for efivarfs - Mathias Krause
* Another attempt at moving x86 to libstub taking advantage of the
    __pure attribute - Ard Biesheuvel
 
  * Add EFI runtime services section to ptdump - Mathias Krause
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJUZnQGAAoJEC84WcCNIz1VfyUP/1MCVt4vepl7+0JzdUP/eVPs
 CwPM6gBOvgx1PviWrtvSU8UyjtYDqZx7jnCvyvbmlgixAqqIoFm80x5sd9DfyJBj
 vSrmavXaJgQomJN3N+fvaIpGJXp8NQmeNT87++UMb6VE5nYvx7suDcwfTqOaxcYt
 yDwKatTXTvQxDLlGgtymp2UhgVKBICs9WVo8weevB5LPmpt4TFCi1GDSimJkfg+0
 JTvkKF+QxPmVqgwY7bgdFFcfhsYCux5VgtbD4DJKS3LgfLJLAMKPOt83DbeOSmIa
 8zqtlF3eMwHgecKrMLSfIZH3XIl2gsIdPdvT6iBQkwwGZuSzG93JgwJ90HYiKoDm
 yNlffnhmdgI2RXO97UZJpzqGor+eNc0auuS4485PcE8NtZ1tbo20A/OCpfIzK8j8
 Vk7sfZxaaHKF5PUtRe6vo3myRlUCofMIuSEWSF8d709R6AEEuia6RZ3Y45EJPROn
 fKOiLsf7Og1Mk43Iy2lb7kFT766OsUnZZHU/xiIZj/v94HPWFWoFPtxPC0IURvPx
 24oiJxCnyXWGtoyn+SSprl+NAPuPsxVFYriTwaq2RBuoY0NAdy7NIXKe2HTp/WI0
 oSTtYkCRcwiHv0aSrg+yQHmwH7y7m39S3yIS4t5LXenn2G4ObUUjhcAQdF2Ft0Pr
 MT/l+stTt390cpfpcQbE
 =he2O
 -----END PGP SIGNATURE-----

Merge tag 'efi-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mfleming/efi into x86/efi

Pull EFI updates for v3.19 from Matt Fleming:

 - Support module unload for efivarfs - Mathias Krause

 - Another attempt at moving x86 to libstub taking advantage of the
   __pure attribute - Ard Biesheuvel

 - Add EFI runtime services section to ptdump - Mathias Krause

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-11-16 10:48:53 +01:00
Linus Torvalds
1afcb6ed0d NFS client bugfixes for Linux 3.18
Highlights include:
 
 - Stable patches to fix NFSv4.x delegation reclaim error paths
 - Fix a bug whereby we were advertising NFSv4.1 but using NFSv4.2 features
 - Fix a use-after-free problem with pNFS block layouts
 - Fix a memory leak in the pNFS files O_DIRECT code
 - Replace an intrusive and Oops-prone performance fix in the NFSv4 atomic
   open code with a safer one-line version and revert the two original patches.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJUZol9AAoJEGcL54qWCgDyNQQQALnngvpPR51BoO/iTz9ruXol
 fGZy0SRIlTUKm1ArsQsQ+HGbV5K0hgP3Tg+z2AtEEZ8u/2Fi2Bqdl6+eNY12tKHd
 uUctDdM5TXLrETAn1UULrnd2eX1cvPMBfOlXlAdNHHsGEgC7w7YQ+rzGwnls+HDy
 LYXzY7Y3jYGdTMaRgZc5YRdtd8JBpCxciRvPEQLDIobwP0JnZC1afTLe1XInqB2I
 TZ4NTHT+DEWA+Ou1P2deL7+RuJNEAeWWBvULJy76n4BqKvN/HNedOO5HyBYXrwSd
 3UX3wbx9CWRxN1F0EqNKxjxZ/597JwqBeNoTDRcofLsqumUfAOtlbym1EahcD3Ls
 pykopNfgUhGuhxolStmuHdS6CnyQPERpR5lFZcDp7XtcwSq4FcwD8DRzLJMZW5dg
 N1lkfFlwQN3rqdk/NEHL+IxS41Hlk4HXjMoP6MNbRtqzIN6tW9tvC4MtAWd1aYxO
 YuUW281pbWxXQ731s0kTIrMUdQ9vGSRBMcbnO9rL3o+xkh8y5SPVkx9lhdhJN0UD
 VbQ5Ws/xZ54bD1PfyYb+Yx659lI8MSFOsDuMuLmDtfYnVicHwCA3H63StvQ3ihf/
 q0gu8Iex9YbNNjf7IfYGuWPmPn3gwPBoURPC0bcZvMPdY6DXodU6Oj4BRTQ5VCie
 9N0pt2wp2eRjaSzD7r5A
 =8YN6
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-3.18-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client bugfixes from Trond Myklebust:
 "Highlights include:

   - stable patches to fix NFSv4.x delegation reclaim error paths
   - fix a bug whereby we were advertising NFSv4.1 but using NFSv4.2
     features
   - fix a use-after-free problem with pNFS block layouts
   - fix a memory leak in the pNFS files O_DIRECT code
   - replace an intrusive and Oops-prone performance fix in the NFSv4
     atomic open code with a safer one-line version and revert the two
     original patches"

* tag 'nfs-for-3.18-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  sunrpc: fix sleeping under rcu_read_lock in gss_stringify_acceptor
  NFS: Don't try to reclaim delegation open state if recovery failed
  NFSv4: Ensure that we call FREE_STATEID when NFSv4.x stateids are revoked
  NFSv4: Fix races between nfs_remove_bad_delegation() and delegation return
  NFSv4.1: nfs41_clear_delegation_stateid shouldn't trust NFS_DELEGATED_STATE
  NFSv4: Ensure that we remove NFSv4.0 delegations when state has expired
  NFS: SEEK is an NFS v4.2 feature
  nfs: Fix use of uninitialized variable in nfs_getattr()
  nfs: Remove bogus assignment
  nfs: remove spurious WARN_ON_ONCE in write path
  pnfs/blocklayout: serialize GETDEVICEINFO calls
  nfs: fix pnfs direct write memory leak
  Revert "NFS: nfs4_do_open should add negative results to the dcache."
  Revert "NFS: remove BUG possibility in nfs4_open_and_get_state"
  NFSv4: Ensure nfs_atomic_open set the dentry verifier on ENOENT
2014-11-15 14:15:16 -08:00
Andrew Price
98f1a696a1 GFS2: Update timestamps on fallocate
gfs2_fallocate() wasn't updating ctime and mtime when modifying the
inode. Add a call to file_update_time() to do that.

Signed-off-by: Andrew Price <anprice@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-11-14 14:16:33 +00:00
Andrew Price
1885867b84 GFS2: Update i_size properly on fallocate
This addresses an issue caught by fsx where the inode size was not being
updated to the expected value after fallocate(2) with mode 0.

The problem was caused by the offset and len parameters being converted
to multiples of the file system's block size, so i_size would be rounded
up to the nearest block size multiple instead of the requested size.

This replaces the per-chunk i_size updates with a single i_size_write on
successful completion of the operation.  With this patch gfs2 gets
through a complete run of fsx.

For clarity, the check for (error == 0) following the loop is removed as
all failures before that point jump to out_* labels or return.

Signed-off-by: Andrew Price <anprice@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-11-14 14:15:04 +00:00
Andrew Price
9c9f1159a5 GFS2: Use inode_newsize_ok and get_write_access in fallocate
gfs2_fallocate wasn't checking inode_newsize_ok nor get_write_access.
Split out the context setup and inode locking pieces into a separate
function to make it more clear and add these missing calls.

inode_newsize_ok is called conditional on FALLOC_FL_KEEP_SIZE as there
is no need to enforce a file size limit if it isn't going to change.

Signed-off-by: Andrew Price <anprice@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-11-14 14:14:30 +00:00
Linus Torvalds
971ad4e4d6 Merge branch 'akpm' (fixes from Andrew Morton)
Merge misc fixes from Andrew Morton:
 "15 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  MAINTAINERS: add IIO include files
  kernel/panic.c: update comments for print_tainted
  mem-hotplug: reset node present pages when hot-adding a new pgdat
  mem-hotplug: reset node managed pages when hot-adding a new pgdat
  mm/debug-pagealloc: correct freepage accounting and order resetting
  fanotify: fix notification of groups with inode & mount marks
  mm, compaction: prevent infinite loop in compact_zone
  mm: alloc_contig_range: demote pages busy message from warn to info
  mm/slab: fix unalignment problem on Malta with EVA due to slab merge
  mm/page_alloc: restrict max order of merging on isolated pageblock
  mm/page_alloc: move freepage counting logic to __free_one_page()
  mm/page_alloc: add freepage on isolate pageblock to correct buddy list
  mm/page_alloc: fix incorrect isolation behavior by rechecking migratetype
  mm/compaction: skip the range until proper target pageblock is met
  zram: avoid kunmap_atomic() of a NULL pointer
2014-11-13 16:57:25 -08:00
Jan Kara
8edc6e1688 fanotify: fix notification of groups with inode & mount marks
fsnotify() needs to merge inode and mount marks lists when notifying
groups about events so that ignore masks from inode marks are reflected
in mount mark notifications and groups are notified in proper order
(according to priorities).

Currently the sorting of the lists done by fsnotify_add_inode_mark() /
fsnotify_add_vfsmount_mark() and fsnotify() differed which resulted
ignore masks not being used in some cases.

Fix the problem by always using the same comparison function when
sorting / merging the mark lists.

Thanks to Heinrich Schuchardt for improvements of my patch.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=87721
Signed-off-by: Jan Kara <jack@suse.cz>
Reported-by: Heinrich Schuchardt <xypron.glpk@gmx.de>
Tested-by: Heinrich Schuchardt <xypron.glpk@gmx.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-11-13 16:17:06 -08:00
Yan, Zheng
3231300bb9 ceph: fix flush tid comparision
TID of cap flush ack is 64 bits, but ceph_inode_info::flushing_cap_tid
is only 16 bits. 16 bits should be plenty to let the cap flush updates
pipeline appropriately, but we need to cast in the proper direction when
comparing these differently-sized versions. So downcast the 64-bits one
to 16 bits.

Reflects ceph.git commit a5184cf46a6e867287e24aeb731634828467cd98.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@redhat.com>
2014-11-13 22:19:05 +03:00
Trond Myklebust
f8ebf7a8ca NFS: Don't try to reclaim delegation open state if recovery failed
If state recovery failed, then we should not attempt to reclaim delegated
state.

http://lkml.kernel.org/r/CAN-5tyHwG=Cn2Q9KsHWadewjpTTy_K26ee+UnSvHvG4192p-Xw@mail.gmail.com
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-11-12 17:19:04 -05:00
Trond Myklebust
c606bb8857 NFSv4: Ensure that we call FREE_STATEID when NFSv4.x stateids are revoked
NFSv4.x (x>0) requires us to call TEST_STATEID+FREE_STATEID if a stateid is
revoked. We will currently fail to do this if the stateid is a delegation.

http://lkml.kernel.org/r/CAN-5tyHwG=Cn2Q9KsHWadewjpTTy_K26ee+UnSvHvG4192p-Xw@mail.gmail.com
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-11-12 17:19:04 -05:00
Trond Myklebust
869f9dfa4d NFSv4: Fix races between nfs_remove_bad_delegation() and delegation return
Any attempt to call nfs_remove_bad_delegation() while a delegation is being
returned is currently a no-op. This means that we can end up looping
forever in nfs_end_delegation_return() if something causes the delegation
to be revoked.
This patch adds a mechanism whereby the state recovery code can communicate
to the delegation return code that the delegation is no longer valid and
that it should not be used when reclaiming state.
It also changes the return value for nfs4_handle_delegation_recall_error()
to ensure that nfs_end_delegation_return() does not reattempt the lock
reclaim before state recovery is done.

http://lkml.kernel.org/r/CAN-5tyHwG=Cn2Q9KsHWadewjpTTy_K26ee+UnSvHvG4192p-Xw@mail.gmail.com
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-11-12 17:19:04 -05:00
Trond Myklebust
0c116cadd9 NFSv4.1: nfs41_clear_delegation_stateid shouldn't trust NFS_DELEGATED_STATE
This patch removes the assumption made previously, that we only need to
check the delegation stateid when it matches the stateid on a cached
open.

If we believe that we hold a delegation for this file, then we must assume
that its stateid may have been revoked or expired too. If we don't test it
then our state recovery process may end up caching open/lock state in a
situation where it should not.
We therefore rename the function nfs41_clear_delegation_stateid as
nfs41_check_delegation_stateid, and change it to always run through the
delegation stateid test and recovery process as outlined in RFC5661.

http://lkml.kernel.org/r/CAN-5tyHwG=Cn2Q9KsHWadewjpTTy_K26ee+UnSvHvG4192p-Xw@mail.gmail.com
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-11-12 17:01:33 -05:00
Trond Myklebust
4dfd4f7af0 NFSv4: Ensure that we remove NFSv4.0 delegations when state has expired
NFSv4.0 does not have TEST_STATEID/FREE_STATEID functionality, so
unlike NFSv4.1, the recovery procedure when stateids have expired or
have been revoked requires us to just forget the delegation.

http://lkml.kernel.org/r/CAN-5tyHwG=Cn2Q9KsHWadewjpTTy_K26ee+UnSvHvG4192p-Xw@mail.gmail.com
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-11-12 17:00:09 -05:00
Anna Schumaker
e983120e92 NFS: SEEK is an NFS v4.2 feature
Somehow the nfs_v4_1_minor_ops had the NFS_CAP_SEEK flag set, enabling
SEEK over v4.1.  This is wrong, and can make servers crash.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Tested-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-11-12 14:22:54 -05:00
Jan Kara
16caf5b610 nfs: Fix use of uninitialized variable in nfs_getattr()
Variable 'err' needn't be initialized when nfs_getattr() uses it to
check whether it should call generic_fillattr() or not. That can result
in spurious error returns. Initialize 'err' properly.

Signed-off-by: Jan Kara <jack@suse.cz>
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-11-12 14:22:53 -05:00