Commit graph

48443 commits

Author SHA1 Message Date
Darrick J. Wong
0c9ec4beec ext4: support GETFSMAP ioctls
Support the GETFSMAP ioctls so that we can use the xfs free space
management tools to probe ext4 as well.  Note that this is a partial
implementation -- we only report fixed-location metadata and free space;
everything else is reported as "unknown".

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-04-30 00:36:53 -04:00
Eric Biggers
7b4cc9787f ext4: evict inline data when writing to memory map
Currently the case of writing via mmap to a file with inline data is not
handled.  This is maybe a rare case since it requires a writable memory
map of a very small file, but it is trivial to trigger with on
inline_data filesystem, and it causes the
'BUG_ON(ext4_test_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA));' in
ext4_writepages() to be hit:

    mkfs.ext4 -O inline_data /dev/vdb
    mount /dev/vdb /mnt
    xfs_io -f /mnt/file \
	-c 'pwrite 0 1' \
	-c 'mmap -w 0 1m' \
	-c 'mwrite 0 1' \
	-c 'fsync'

	kernel BUG at fs/ext4/inode.c:2723!
	invalid opcode: 0000 [#1] SMP
	CPU: 1 PID: 2532 Comm: xfs_io Not tainted 4.11.0-rc1-xfstests-00301-g071d9acf3d1f #633
	Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-20170228_101828-anatol 04/01/2014
	task: ffff88003d3a8040 task.stack: ffffc90000300000
	RIP: 0010:ext4_writepages+0xc89/0xf8a
	RSP: 0018:ffffc90000303ca0 EFLAGS: 00010283
	RAX: 0000028410000000 RBX: ffff8800383fa3b0 RCX: ffffffff812afcdc
	RDX: 00000a9d00000246 RSI: ffffffff81e660e0 RDI: 0000000000000246
	RBP: ffffc90000303dc0 R08: 0000000000000002 R09: 869618e8f99b4fa5
	R10: 00000000852287a2 R11: 00000000a03b49f4 R12: ffff88003808e698
	R13: 0000000000000000 R14: 7fffffffffffffff R15: 7fffffffffffffff
	FS:  00007fd3e53094c0(0000) GS:ffff88003e400000(0000) knlGS:0000000000000000
	CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
	CR2: 00007fd3e4c51000 CR3: 000000003d554000 CR4: 00000000003406e0
	Call Trace:
	 ? _raw_spin_unlock+0x27/0x2a
	 ? kvm_clock_read+0x1e/0x20
	 do_writepages+0x23/0x2c
	 ? do_writepages+0x23/0x2c
	 __filemap_fdatawrite_range+0x80/0x87
	 filemap_write_and_wait_range+0x67/0x8c
	 ext4_sync_file+0x20e/0x472
	 vfs_fsync_range+0x8e/0x9f
	 ? syscall_trace_enter+0x25b/0x2d0
	 vfs_fsync+0x1c/0x1e
	 do_fsync+0x31/0x4a
	 SyS_fsync+0x10/0x14
	 do_syscall_64+0x69/0x131
	 entry_SYSCALL64_slow_path+0x25/0x25

We could try to be smart and keep the inline data in this case, or at
least support delayed allocation when allocating the block, but these
solutions would be more complicated and don't seem worthwhile given how
rare this case seems to be.  So just fix the bug by calling
ext4_convert_inline_data() when we're asked to make a page writable, so
that any inline data gets evicted, with the block allocated immediately.

Reported-by: Nick Alcock <nick.alcock@oracle.com>
Cc: stable@vger.kernel.org
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-04-30 00:10:50 -04:00
Eric Biggers
6ba644b9fd ext4: remove ext4_xattr_check_entry()
ext4_xattr_check_entry() was redundant with validation of the full xattr
entries list in ext4_xattr_check_entries(), which all callers also did.
ext4_xattr_check_entry() also didn't actually do correct validation;
specifically, it never checked that the value doesn't overlap the xattr
names, nor did it account for padding when checking whether the xattr
value overflows the available space.  So remove it to eliminate any
potential confusion.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-04-30 00:01:02 -04:00
Eric Biggers
2c4f992337 ext4: rename ext4_xattr_check_names() to ext4_xattr_check_entries()
ext4_xattr_check_names() actually validates both the xattr names and
values, not just the names.  So rename it to ext4_xattr_check_entries()
to avoid confusion.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-04-29 23:56:52 -04:00
Eric Biggers
ba7ea1d8f4 ext4: merge ext4_xattr_list() into ext4_listxattr()
There's no difference between ext4_xattr_list() and ext4_listxattr(), so
merge them together and just have ext4_listxattr().  Some years ago they
took different arguments, but that's no longer the case.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-04-29 23:53:17 -04:00
Eric Biggers
d600618673 ext4: constify static data that is never modified
Constify static data in ext4 that is never (intentionally) modified so
that it is placed in .rodata and benefits from memory protection.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-04-29 23:47:50 -04:00
Eric Biggers
1bc0af600b ext4: trim return value and 'dir' argument from ext4_insert_dentry()
In the initial implementation of ext4 encryption, the filename was
encrypted in ext4_insert_dentry(), which could fail and also required
access to the 'dir' inode.  Since then ext4 filename encryption has been
changed to encrypt the filename earlier, so we can revert the additions
to ext4_insert_dentry().

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-04-29 23:27:26 -04:00
Jan Kara
5052b069ac jbd2: fix dbench4 performance regression for 'nobarrier' mounts
Commit b685d3d65a "block: treat REQ_FUA and REQ_PREFLUSH as
synchronous" removed REQ_SYNC flag from WRITE_FUA implementation. Since
JBD2 strips REQ_FUA and REQ_FLUSH flags from submitted IO when the
filesystem is mounted with nobarrier mount option, journal superblock
writes ended up being async writes after this patch and that caused
heavy performance regression for dbench4 benchmark with high number of
processes. In my test setup with HP RAID array with non-volatile write
cache and 32 GB ram, dbench4 runs with 8 processes regressed by ~25%.

Fix the problem by making sure journal superblock writes are always
treated as synchronous since they generally block progress of the
journalling machinery and thus the whole filesystem.

Fixes: b685d3d65a
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-04-29 21:07:30 -04:00
Jan Kara
c52c47e4b4 jbd2: Fix lockdep splat with generic/270 test
I've hit a lockdep splat with generic/270 test complaining that:

3216.fsstress.b/3533 is trying to acquire lock:
 (jbd2_handle){++++..}, at: [<ffffffff813152e0>] jbd2_log_wait_commit+0x0/0x150

but task is already holding lock:
 (jbd2_handle){++++..}, at: [<ffffffff8130bd3b>] start_this_handle+0x35b/0x850

The underlying problem is that jbd2_journal_force_commit_nested()
(called from ext4_should_retry_alloc()) may get called while a
transaction handle is started. In such case it takes care to not wait
for commit of the running transaction (which would deadlock) but only
for a commit of a transaction that is already committing (which is safe
as that doesn't wait for any filesystem locks).

In fact there are also other callers of jbd2_log_wait_commit() that take
care to pass tid of a transaction that is already committing and for
those cases, the lockdep instrumentation is too restrictive and leading
to false positive reports. Fix the problem by calling
jbd2_might_wait_for_commit() from jbd2_log_wait_commit() only if the
transaction isn't already committing.

Fixes: 1eaa566d36
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-04-29 20:12:16 -04:00
Linus Torvalds
84ced7fd06 Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6
Pull CIFS fixes from Steve French:
 "This is a set of CIFS/SMB3 fixes for stable.

  There is another set of four SMB3 reconnect fixes for stable in
  progress but they are still being reviewed/tested, so didn't want to
  wait any longer to send these five below"

* 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
  Reset TreeId to zero on SMB2 TREE_CONNECT
  CIFS: Fix build failure with smb2
  Introduce cifs_copy_file_range()
  SMB3: Rename clone_range to copychunk_range
  Handle mismatched open calls
2017-04-09 09:10:02 -07:00
Linus Torvalds
5b50be743f Driver core fixes for 4.11-rc6
Here are 3 small fixes for 4.11-rc6.  One resolves a reported issue with
 sysfs files that NeilBrown found, one is a documenatation fix for the
 stable kernel rules, and the last is a small MAINTAINERS file update for
 kernfs.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCWOnrMw8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+yk/JQCfQKjOpGDAR9Hs6u4YQ4hJrAHFneYAn1F4MLDW
 3b0ZMnlZHkDq834UwKnB
 =iiei
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-4.11-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core fixes from Greg KH:
 "Here are 3 small fixes for 4.11-rc6.

  One resolves a reported issue with sysfs files that NeilBrown found,
  one is a documenatation fix for the stable kernel rules, and the last
  is a small MAINTAINERS file update for kernfs"

* tag 'driver-core-4.11-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
  MAINTAINERS: separate out kernfs maintainership
  sysfs: be careful of error returns from ops->show()
  Documentation: stable-kernel-rules: fix stable-tag format
2017-04-09 09:03:51 -07:00
Linus Torvalds
2a610b8aa8 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull VFS fixes from Al Viro:
 "statx followup fixes and a fix for stack-smashing on alpha"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  alpha: fix stack smashing in old_adjtimex(2)
  statx: Include a mask for stx_attributes in struct statx
  statx: Reserve the top bit of the mask for future struct expansion
  xfs: report crtime and attribute flags to statx
  ext4: Add statx support
  statx: optimize copy of struct statx to userspace
  statx: remove incorrect part of vfs_statx() comment
  statx: reject unknown flags when using NULL path
  Documentation/filesystems: fix documentation for ->getattr()
2017-04-09 08:26:21 -07:00
NeilBrown
c8a139d001 sysfs: be careful of error returns from ops->show()
ops->show() can return a negative error code.
Commit 65da3484d9 ("sysfs: correctly handle short reads on PREALLOC attrs.")
(in v4.4) caused this to be stored in an unsigned 'size_t' variable, so errors
would look like large numbers.
As a result, if an error is returned, sysfs_kf_read() will return the
value of 'count', typically 4096.

Commit 17d0774f80 ("sysfs: correctly handle read offset on PREALLOC attrs")
(in v4.8) extended this error to use the unsigned large 'len' as a size for
memmove().
Consequently, if ->show returns an error, then the first read() on the
sysfs file will return 4096 and could return uninitialized memory to
user-space.
If the application performs a subsequent read, this will trigger a memmove()
with extremely large count, and is likely to crash the machine is bizarre ways.

This bug can currently only be triggered by reading from an md
sysfs attribute declared with __ATTR_PREALLOC() during the
brief period between when mddev_put() deletes an mddev from
the ->all_mddevs list, and when mddev_delayed_delete() - which is
scheduled on a workqueue - completes.
Before this, an error won't be returned by the ->show()
After this, the ->show() won't be called.

I can reproduce it reliably only by putting delay like
	usleep_range(500000,700000);
early in mddev_delayed_delete(). Then after creating an
md device md0 run
  echo clear > /sys/block/md0/md/array_state; cat /sys/block/md0/md/array_state

The bug can be triggered without the usleep.

Fixes: 65da3484d9 ("sysfs: correctly handle short reads on PREALLOC attrs.")
Fixes: 17d0774f80 ("sysfs: correctly handle read offset on PREALLOC attrs")
Cc: stable@vger.kernel.org
Signed-off-by: NeilBrown <neilb@suse.com>
Acked-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Miroslav Benes <mbenes@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-04-08 17:33:32 +02:00
Linus Torvalds
56c2997965 Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
 "10 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  mm: move pcp and lru-pcp draining into single wq
  mailmap: update Yakir Yang email address
  mm, swap_cgroup: reschedule when neeed in swap_cgroup_swapoff()
  dax: fix radix tree insertion race
  mm, thp: fix setting of defer+madvise thp defrag mode
  ptrace: fix PTRACE_LISTEN race corrupting task->state
  vmlinux.lds: add missing VMLINUX_SYMBOL macros
  mm/page_alloc.c: fix print order in show_free_areas()
  userfaultfd: report actual registered features in fdinfo
  mm: fix page_vma_mapped_walk() for ksm pages
2017-04-08 01:35:32 -07:00
Ross Zwisler
e11f8b7b6c dax: fix radix tree insertion race
While running generic/340 in my test setup I hit the following race.  It
can happen with kernels that support FS DAX PMDs, so v4.10 thru
v4.11-rc5.

Thread 1				Thread 2
--------				--------
dax_iomap_pmd_fault()
  grab_mapping_entry()
    spin_lock_irq()
    get_unlocked_mapping_entry()
    'entry' is NULL, can't call lock_slot()
    spin_unlock_irq()
    radix_tree_preload()
					dax_iomap_pmd_fault()
					  grab_mapping_entry()
					    spin_lock_irq()
					    get_unlocked_mapping_entry()
					    ...
					    lock_slot()
					    spin_unlock_irq()
					  dax_pmd_insert_mapping()
					    <inserts a PMD mapping>
    spin_lock_irq()
    __radix_tree_insert() fails with -EEXIST
    <fall back to 4k fault, and die horribly
     when inserting a 4k entry where a PMD exists>

The issue is that we have to drop mapping->tree_lock while calling
radix_tree_preload(), but since we didn't have a radix tree entry to
lock (unlike in the pmd_downgrade case) we have no protection against
Thread 2 coming along and inserting a PMD at the same index.  For 4k
entries we handled this with a special-case response to -EEXIST coming
from the __radix_tree_insert(), but this doesn't save us for PMDs
because the -EEXIST case can also mean that we collided with a 4k entry
in the radix tree at a different index, but one that is covered by our
PMD range.

So, correctly handle both the 4k and 2M collision cases by explicitly
re-checking the radix tree for an entry at our index once we reacquire
mapping->tree_lock.

This patch has made it through a clean xfstests run with the current
v4.11-rc5 based linux/master, and it also ran generic/340 500 times in a
loop.  It used to fail within the first 10 iterations.

Link: http://lkml.kernel.org/r/20170406212944.2866-1-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: <stable@vger.kernel.org>    [4.10+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-04-08 00:47:49 -07:00
Mike Rapoport
045098e944 userfaultfd: report actual registered features in fdinfo
fdinfo for userfault file descriptor reports UFFD_API_FEATURES.  Up
until recently, the UFFD_API_FEATURES was defined as 0, therefore
corresponding field in fdinfo always contained zero.  Now, with
introduction of several additional features, UFFD_API_FEATURES is not
longer 0 and it seems better to report actual features requested for the
userfaultfd object described by the fdinfo.

First, the applications that were using userfault will still see zero at
the features field in fdinfo.  Next, reporting actual features rather
than available features, gives clear indication of what userfault
features are used by an application.

Link: http://lkml.kernel.org/r/1491140181-22121-1-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Pavel Emelyanov <xemul@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-04-08 00:47:48 -07:00
Martin Brandenburg
cefdc26e86 orangefs: move features validation to fix filesystem hang
Without this fix (and another to the userspace component itself
described later), the kernel will be unable to process any OrangeFS
requests after the userspace component is restarted (due to a crash or
at the administrator's behest).

The bug here is that inside orangefs_remount, the orangefs_request_mutex
is locked.  When the userspace component restarts while the filesystem
is mounted, it sends a ORANGEFS_DEV_REMOUNT_ALL ioctl to the device,
which causes the kernel to send it a few requests aimed at synchronizing
the state between the two.  While this is happening the
orangefs_request_mutex is locked to prevent any other requests going
through.

This is only half of the bugfix.  The other half is in the userspace
component which outright ignores(!) requests made before it considers
the filesystem remounted, which is after the ioctl returns.  Of course
the ioctl doesn't return until after the userspace component responds to
the request it ignores.  The userspace component has been changed to
allow ORANGEFS_VFS_OP_FEATURES regardless of the mount status.

Mike Marshall says:
 "I've tested this patch against the fixed userspace part. This patch is
  real important, I hope it can make it into 4.11...

  Here's what happens when the userspace daemon is restarted, without
  the patch:

    =============================================
    [ INFO: possible recursive locking detected ]
    [   4.10.0-00007-ge98bdb3 #1 Not tainted    ]
    ---------------------------------------------
    pvfs2-client-co/29032 is trying to acquire lock:
     (orangefs_request_mutex){+.+.+.}, at: service_operation+0x3c7/0x7b0 [orangefs]
                  but task is already holding lock:
     (orangefs_request_mutex){+.+.+.}, at: dispatch_ioctl_command+0x1bf/0x330 [orangefs]

    CPU: 0 PID: 29032 Comm: pvfs2-client-co Not tainted 4.10.0-00007-ge98bdb3 #1
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-1.fc25 04/01/2014
    Call Trace:
     __lock_acquire+0x7eb/0x1290
     lock_acquire+0xe8/0x1d0
     mutex_lock_killable_nested+0x6f/0x6e0
     service_operation+0x3c7/0x7b0 [orangefs]
     orangefs_remount+0xea/0x150 [orangefs]
     dispatch_ioctl_command+0x227/0x330 [orangefs]
     orangefs_devreq_ioctl+0x29/0x70 [orangefs]
     do_vfs_ioctl+0xa3/0x6e0
     SyS_ioctl+0x79/0x90"

Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Acked-by: Mike Marshall <hubcap@omnibond.com>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-04-07 13:41:22 -07:00
Liping Zhang
1680a3868f sysctl: add sanity check for proc_douintvec
Commit e7d316a02f ("sysctl: handle error writing UINT_MAX to u32
fields") introduced the proc_douintvec helper function, but it forgot to
add the related sanity check when doing register_sysctl_table.  So add
it now.

Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Cc: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-04-07 09:46:44 -07:00
Jan-Marek Glogowski
806a28efe9 Reset TreeId to zero on SMB2 TREE_CONNECT
Currently the cifs module breaks the CIFS specs on reconnect as
described in http://msdn.microsoft.com/en-us/library/cc246529.aspx:

"TreeId (4 bytes): Uniquely identifies the tree connect for the
command. This MUST be 0 for the SMB2 TREE_CONNECT Request."

Signed-off-by: Jan-Marek Glogowski <glogow@fbihome.de>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Tested-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <smfrench@gmail.com>
CC: Stable <stable@vger.kernel.org>
2017-04-07 08:04:41 -05:00
Tobias Regnery
4fa8e504e5 CIFS: Fix build failure with smb2
I saw the following build error during a randconfig build:

fs/cifs/smb2ops.c: In function 'smb2_new_lease_key':
fs/cifs/smb2ops.c:1104:2: error: implicit declaration of function 'generate_random_uuid' [-Werror=implicit-function-declaration]

Explicit include the right header to fix this issue.

Signed-off-by: Tobias Regnery <tobias.regnery@gmail.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2017-04-07 08:04:41 -05:00
Sachin Prabhu
620d8745b3 Introduce cifs_copy_file_range()
The earlier changes to copy range for cifs unintentionally disabled the more
common form of server side copy.

The patch introduces the file_operations helper cifs_copy_file_range()
which is used by the syscall copy_file_range. The new file operations
helper allows us to perform server side copies for SMB2.0 and 2.1
servers as well as SMB 3.0+ servers which do not support the ioctl
FSCTL_DUPLICATE_EXTENTS_TO_FILE.

The new helper uses the ioctl FSCTL_SRV_COPYCHUNK_WRITE to perform
server side copies. The helper is called by vfs_copy_file_range() only
once an attempt to clone the file using the ioctl
FSCTL_DUPLICATE_EXTENTS_TO_FILE has failed.

Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
CC: Stable  <stable@vger.kernel.org>
Signed-off-by: Steve French <smfrench@gmail.com>
2017-04-07 08:04:41 -05:00
Sachin Prabhu
312bbc5946 SMB3: Rename clone_range to copychunk_range
Server side copy is one of the most important mechanisms smb2/smb3
supports and it was unintentionally disabled for most use cases.

Renaming calls to reflect the underlying smb2 ioctl called. This is
similar to the name duplicate_extents used for a similar ioctl which is
also used to duplicate files by reusing fs blocks. The name change is to
avoid confusion.

Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2017-04-07 08:04:40 -05:00
Sachin Prabhu
38bd49064a Handle mismatched open calls
A signal can interrupt a SendReceive call which result in incoming
responses to the call being ignored. This is a problem for calls such as
open which results in the successful response being ignored. This
results in an open file resource on the server.

The patch looks into responses which were cancelled after being sent and
in case of successful open closes the open fids.

For this patch, the check is only done in SendReceive2()

RH-bz: 1403319

Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Cc: Stable <stable@vger.kernel.org>
2017-04-07 08:04:40 -05:00
Linus Torvalds
269c930e66 Changes since last update:
- Rework the inline directory verifier to avoid crashes on disk corruption
 - Don't change file size when punching holes w/ KEEP_SIZE
 - Close a kernel memory exposure bug
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABCgAGBQJY4qGYAAoJEPh/dxk0SrTr3mwQAIDqc1RlZThYETn5Mru9BeQ0
 NmiDbgO1394OSxSxpBVZCVW1jU23j3eXbOgO0oD8iEOySdTOwAoCxb78aYUkHsbS
 wKKvix2kprIfsAYfGW264MZjX2JYiBUTjhV8dYw9UB5Kot3TPvc+IgC7ICJtcb08
 y+/ycjbgMwlWxOQwkReaPhVYhKXa3/vqLBNN0E9oacSdpT10Mkb95RDxwFpe5zX2
 RjPYode8RHsoI9+OdvVwvbmPrvylxEt3jKZdnStRMjmAX/X2aUNVDYSCOSpNg+sz
 9kXec96kkoJ4oQVfiaEo0k4LWlTZnDSFe25CX8Gxu7yPgGt8XsnDADkieGBdVcYq
 2q3AfEpmQrYgnLFRu4D3A8KEX8IrP8MLAh/zY/heVp9fTx9uhDvnjF+f+F6mMYXF
 wzEf27RN+DeaXRLbHp35NOKPriT63JpRB/Hsgew6ibEiVPdy8J1+VYoJSohixr4d
 yryDiwtph25spMYxeH6HIiPCeWa6fMr08vwuasWJTEZirxYmF/ZEHTK/bU+ydlDp
 R9/UDTmiP0V10xO09bjLgFiREIO/b1fiOoniAMhhrT4tW2FL31hRpzl8jPZ++2X/
 W2xLMJ1iyVw/Cf1NDWGFIOFX6Y+lnN6dZHftNUno6MFlmdmajNaTbRTBiSGnPH4U
 URH7Dxpmkgp7xuNs8xqv
 =QdXx
 -----END PGP SIGNATURE-----

Merge tag 'xfs-4.11-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull XFS fixes from Darrick Wong:
 "Here are three more fixes for 4.11.

  The first one reworks the inline directory verifier to check the
  working copy of the directory metadata and to avoid triggering a
  periodic crash in xfs/348. The second patch fixes a regression in hole
  punching at EOF that corrupts files; and the third patch closes a
  kernel memory disclosure bug.

  Summary:

   - rework the inline directory verifier to avoid crashes on disk
     corruption

   - don't change file size when punching holes w/ KEEP_SIZE

   - close a kernel memory exposure bug"

* tag 'xfs-4.11-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: fix kernel memory exposure problems
  xfs: Honor FALLOC_FL_KEEP_SIZE when punching ends of files
  xfs: rework the inline directory verifiers
2017-04-06 14:42:05 -07:00
Darrick J. Wong
bf9216f922 xfs: fix kernel memory exposure problems
Fix a memory exposure problems in inumbers where we allocate an array of
structures with holes, fail to zero the holes, then blindly copy the
kernel memory contents (junk and all) into userspace.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2017-04-03 12:22:39 -07:00
Calvin Owens
3dd09d5a85 xfs: Honor FALLOC_FL_KEEP_SIZE when punching ends of files
When punching past EOF on XFS, fallocate(mode=PUNCH_HOLE|KEEP_SIZE) will
round the file size up to the nearest multiple of PAGE_SIZE:

  calvinow@vm-disks/generic-xfs-1 ~$ dd if=/dev/urandom of=test bs=2048 count=1
  calvinow@vm-disks/generic-xfs-1 ~$ stat test
    Size: 2048            Blocks: 8          IO Block: 4096   regular file
  calvinow@vm-disks/generic-xfs-1 ~$ fallocate -n -l 2048 -o 2048 -p test
  calvinow@vm-disks/generic-xfs-1 ~$ stat test
    Size: 4096            Blocks: 8          IO Block: 4096   regular file

Commit 3c2bdc912a ("xfs: kill xfs_zero_remaining_bytes") replaced
xfs_zero_remaining_bytes() with calls to iomap helpers. The new helpers
don't enforce that [pos,offset) lies strictly on [0,i_size) when being
called from xfs_free_file_space(), so by "leaking" these ranges into
xfs_zero_range() we get this buggy behavior.

Fix this by reintroducing the checks xfs_zero_remaining_bytes() did
against i_size at the bottom of xfs_free_file_space().

Reported-by: Aaron Gao <gzh@fb.com>
Fixes: 3c2bdc912a ("xfs: kill xfs_zero_remaining_bytes")
Cc: Christoph Hellwig <hch@lst.de>
Cc: Brian Foster <bfoster@redhat.com>
Cc: <stable@vger.kernel.org> # 4.8+
Signed-off-by: Calvin Owens <calvinowens@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-04-03 12:22:29 -07:00
Darrick J. Wong
78420281a9 xfs: rework the inline directory verifiers
The inline directory verifiers should be called on the inode fork data,
which means after iformat_local on the read side, and prior to
ifork_flush on the write side.  This makes the fork verifier more
consistent with the way buffer verifiers work -- i.e. they will operate
on the memory buffer that the code will be reading and writing directly.

Furthermore, revise the verifier function to return -EFSCORRUPTED so
that we don't flood the logs with corruption messages and assert
notices.  This has been a particular problem with xfs/348, which
triggers the XFS_WANT_CORRUPTED_RETURN assertions, which halts the
kernel when CONFIG_XFS_DEBUG=y.  Disk corruption isn't supposed to do
that, at least not in a verifier.

Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-04-03 12:22:20 -07:00
David Howells
3209f68b3c statx: Include a mask for stx_attributes in struct statx
Include a mask in struct stat to indicate which bits of stx_attributes the
filesystem actually supports.

This would also be useful if we add another system call that allows you to
do a 'bulk attribute set' and pass in a statx struct with the masks
appropriately set to say what you want to set.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-04-03 01:06:00 -04:00
David Howells
47071aee6a statx: Reserve the top bit of the mask for future struct expansion
Reserve the top bit of the mask for future expansion of the statx struct
and give an error if statx() sees it set.  All the other bits are ignored
if we see them set but don't support the bit; we just clear the bit in the
returned mask.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-04-03 01:05:59 -04:00
Darrick J. Wong
5f955f26f3 xfs: report crtime and attribute flags to statx
statx has the ability to report inode creation times and inode flags, so
hook up di_crtime and di_flags to that functionality.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-04-03 01:05:59 -04:00
David Howells
99652ea56a ext4: Add statx support
Return enhanced file attributes from the Ext4 filesystem.  This includes
the following:

 (1) The inode creation time (i_crtime) as stx_btime, setting STATX_BTIME.

 (2) Certain FS_xxx_FL flags are mapped to stx_attribute flags.

This requires that all ext4 inodes have a getattr call, not just some of
them, so to this end, split the ext4_getattr() function and only call part
of it where appropriate.

Example output:

	[root@andromeda ~]# touch foo
	[root@andromeda ~]# chattr +ai foo
	[root@andromeda ~]# /tmp/test-statx foo
	statx(foo) = 0
	results=fff
	  Size: 0               Blocks: 0          IO Block: 4096    regular file
	Device: 08:12           Inode: 2101950     Links: 1
	Access: (0644/-rw-r--r--)  Uid:     0   Gid:     0
	Access: 2016-02-11 17:08:29.031795451+0000
	Modify: 2016-02-11 17:08:29.031795451+0000
	Change: 2016-02-11 17:11:11.987790114+0000
	 Birth: 2016-02-11 17:08:29.031795451+0000
	Attributes: 0000000000000030 (-------- -------- -------- -------- -------- -------- -------- --ai----)

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-04-03 01:05:58 -04:00
Eric Biggers
64bd72048a statx: optimize copy of struct statx to userspace
I found that statx() was significantly slower than stat().  As a
microbenchmark, I compared 10,000,000 invocations of fstat() on a tmpfs
file to the same with statx() passed a NULL path:

	$ time ./stat_benchmark

	real	0m1.464s
	user	0m0.275s
	sys	0m1.187s

	$ time ./statx_benchmark

	real	0m5.530s
	user	0m0.281s
	sys	0m5.247s

statx is expected to be a little slower than stat because struct statx
is larger than struct stat, but not by *that* much.  It turns out that
most of the overhead was in copying struct statx to userspace, mostly in
all the stac/clac instructions that got generated for each __put_user()
call.  (This was on x86_64, but some other architectures, e.g. arm64,
have something similar now too.)

stat() instead initializes its struct on the stack and copies it to
userspace with a single call to copy_to_user().  This turns out to be
much faster, and changing statx to do this makes it almost as fast as
stat:

	$ time ./statx_benchmark

	real	0m1.624s
	user	0m0.270s
	sys	0m1.354s

For zeroing the reserved fields, start by zeroing the full struct with
memset.  This makes it clear that every byte copied to userspace is
initialized, even implicit padding bytes (though there are none
currently).  In the scenarios I tested, it also performed the same as a
designated initializer.  Manually initializing each field was still
slightly faster, but would have been more error-prone and less
verifiable.

Also rename statx_set_result() to cp_statx() for consistency with
cp_old_stat() et al., and make it noinline so that struct statx doesn't
add to the stack usage during the main portion of the syscall execution.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-04-03 01:05:57 -04:00
Eric Biggers
b15fb70b82 statx: remove incorrect part of vfs_statx() comment
request_mask and query_flags are function arguments, not passed in
struct kstat.  So remove the part of the comment which claims otherwise.
This was apparently left over from an earlier version of the statx
patch.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-04-03 01:05:57 -04:00
Eric Biggers
8c7493aa3e statx: reject unknown flags when using NULL path
The statx() system call currently accepts unknown flags when called with
a NULL path to operate on a file descriptor.  Left unchanged, this could
make it hard to introduce new query flags in the future, since
applications may not be able to tell whether a given flag is supported.

Fix this by failing the system call with EINVAL if any flags other than
KSTAT_QUERY_FLAGS are specified in combination with a NULL path.

Arguably, we could still permit known lookup-related flags such as
AT_SYMLINK_NOFOLLOW.  However, that would be inconsistent with how
sys_utimensat() behaves when passed a NULL path, which seems to be the
closest precedent.  And given that the NULL path case is (I believe)
mainly intended to be used to implement a wrapper function like fstatx()
that doesn't have a path argument, I think rejecting lookup-related
flags too is probably the best choice.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-04-03 01:05:56 -04:00
Linus Torvalds
978e0f92cd Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
 "11 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  kasan: do not sanitize kexec purgatory
  drivers/rapidio/devices/tsi721.c: make module parameter variable name unique
  mm/hugetlb.c: don't call region_abort if region_chg fails
  kasan: report only the first error by default
  hugetlbfs: initialize shared policy as part of inode allocation
  mm: fix section name for .data..ro_after_init
  mm, hugetlb: use pte_present() instead of pmd_present() in follow_huge_pmd()
  mm: workingset: fix premature shadow node shrinking with cgroups
  mm: rmap: fix huge file mmap accounting in the memcg stats
  mm: move mm_percpu_wq initialization earlier
  mm: migrate: fix remove_migration_pte() for ksm pages
2017-04-01 19:45:05 -07:00
Linus Torvalds
09c8b3d1d6 The restriction of NFSv4 to TCP went overboard and also broke the
backchannel; fix.  Also some minor refinements to the nfsd
 version-setting interface that we'd like to get fixed before release.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJY3weFAAoJECebzXlCjuG+fQYQAKh7QVB/vDZwvjWZ7OGSBhoe
 LJfS1mzzf52MfQk6qNEM9to86qu+ywmH4XD2rvgMdif6Kc0qF1xBzzVEnp527lVW
 nt1nLq/oIXeRl7a+5p+FxTnqHn0OBH+iCnxcVZI8tvHgptpR+6TwEzR36k4n4esN
 kvNrrv9dFIoBGx0nSBScHTC3zNArU+C9oZTTHAuPJICNBHsOMqYF7sSAXPg9NFR+
 HnowZfGWWXpSnWZ9DaM14Zb/dX0B9Pexv1MsgfiKYS9Beh7g4JN48+CHtWyVliwd
 M/LVBpT2ZwFM/NzvJ2exVIqm/hwj4El2Sjy67XHwQzDvGjnUn/fz55SfXfzZSMyD
 PMj+IeHKuT3jipNui1AXAlzYEz8gPvuKQ0vQ0vVuNX4Ln28KydGwvVTkUNltDnfq
 E7L7RI6mj03OY1j1p6zeK6UJeueZq1gmjfq1NVTPO+TgQFPhh8A50NsDEVcaiwNN
 W8uX7Qa39y79BT+4OFuYL05AbuqxKR+nAmCbVSLy9Kq4sc/6/YErBZtXXAghzPPl
 4Es4tzlAd8skkxWlcVeCeUqPfcDCqHL6xKqa62m5wUuYXmqPtIMyAyVutF4lWpH/
 dAV5Lcjz7HeQnpCgFZXXtIQW5OIIfFa08s0f7fuFm+uxz+nM/x8gg99tuwWP6OsT
 Za7EtdB84M1ZGgjO3JhY
 =oi0+
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-4.11-1' of git://linux-nfs.org/~bfields/linux

Pull nfsd fixes from Bruce Fields:
 "The restriction of NFSv4 to TCP went overboard and also broke the
  backchannel; fix.

  Also some minor refinements to the nfsd version-setting interface that
  we'd like to get fixed before release"

* tag 'nfsd-4.11-1' of git://linux-nfs.org/~bfields/linux:
  svcrdma: set XPT_CONG_CTRL flag for bc xprt
  NFSD: fix nfsd_reset_versions for NFSv4.
  NFSD: fix nfsd_minorversion(.., NFSD_AVAIL)
  NFSD: further refinement of content of /proc/fs/nfsd/versions
  nfsd: map the ENOKEY to nfserr_perm for avoiding warning
  SUNRPC/backchanel: set XPT_CONG_CTRL flag for bc xprt
2017-04-01 10:43:37 -07:00
Linus Torvalds
fe8e12b503 Merge branch 'for-linus-4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "We have three small fixes queued up in my for-linus-4.11 branch"

* 'for-linus-4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: fix an integer overflow check
  btrfs: Change qgroup_meta_rsv to 64bit
  Btrfs: bring back repair during read
2017-03-31 17:58:48 -07:00
Mike Kravetz
4742a35d9d hugetlbfs: initialize shared policy as part of inode allocation
Any time after inode allocation, destroy_inode can be called.  The
hugetlbfs inode contains a shared_policy structure, and
mpol_free_shared_policy is unconditionally called as part of
hugetlbfs_destroy_inode.  Initialize the policy as part of inode
allocation so that any quick (error path) calls to destroy_inode will be
handed an initialized policy.

syzkaller fuzzer found this bug, that resulted in the following:

    BUG: KASAN: user-memory-access in atomic_inc
    include/asm-generic/atomic-instrumented.h:87 [inline] at addr
    000000131730bd7a
    BUG: KASAN: user-memory-access in __lock_acquire+0x21a/0x3a80
    kernel/locking/lockdep.c:3239 at addr 000000131730bd7a
    Write of size 4 by task syz-executor6/14086
    CPU: 3 PID: 14086 Comm: syz-executor6 Not tainted 4.11.0-rc3+ #364
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
    Call Trace:
     atomic_inc include/asm-generic/atomic-instrumented.h:87 [inline]
     __lock_acquire+0x21a/0x3a80 kernel/locking/lockdep.c:3239
     lock_acquire+0x1ee/0x590 kernel/locking/lockdep.c:3762
     __raw_write_lock include/linux/rwlock_api_smp.h:210 [inline]
     _raw_write_lock+0x33/0x50 kernel/locking/spinlock.c:295
     mpol_free_shared_policy+0x43/0xb0 mm/mempolicy.c:2536
     hugetlbfs_destroy_inode+0xca/0x120 fs/hugetlbfs/inode.c:952
     alloc_inode+0x10d/0x180 fs/inode.c:216
     new_inode_pseudo+0x69/0x190 fs/inode.c:889
     new_inode+0x1c/0x40 fs/inode.c:918
     hugetlbfs_get_inode+0x40/0x420 fs/hugetlbfs/inode.c:734
     hugetlb_file_setup+0x329/0x9f0 fs/hugetlbfs/inode.c:1282
     newseg+0x422/0xd30 ipc/shm.c:575
     ipcget_new ipc/util.c:285 [inline]
     ipcget+0x21e/0x580 ipc/util.c:639
     SYSC_shmget ipc/shm.c:673 [inline]
     SyS_shmget+0x158/0x230 ipc/shm.c:657
     entry_SYSCALL_64_fastpath+0x1f/0xc2

Analysis provided by Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>

Link: http://lkml.kernel.org/r/1490477850-7944-1-git-send-email-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-03-31 17:13:30 -07:00
Linus Torvalds
f9799ad21b NFS client fixes for 4.11 (part 2)
Stable Bugfixes:
 - Fix infinite loop on BAD_STATEID error
 
 Other Bugfixes:
 - Fix old dentry rehash after move
 - Fix pnfs GETDEVINFO hangs
 - Fix pnfs fallback to MDS on commit errors
 - Fix flexfiles kernel oops
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAljeko0ACgkQ18tUv7Cl
 QOsZbw/+MXYEZCaILgaAjzoWO4qJhNuAzlblh2jSX2nitY6NAva2MORZoAnxqS/G
 2qVWdYLvfQ2rKkIazInktaqBgsnl5Got9EbrD2hdV5LvM953U3KJkeXZD67ncvV7
 YYmxaFipLfTfmLZDXlQ1h5wTKXXXw0VA2v2YL+sRZhAzVhcTyMyf1n89lT6H9Fqx
 UPitMuBbRskCPFMOZ6xP+T6MsOpeIGVHHOvYWSSydCT6IoujnTjNRaZ9+VFSm5iU
 rubL/qokg2VIJ60xbmv/toq61FkhI2xtJTtBFxHZo47En3RdwB53zOez/OmvTFLZ
 lKSCh/Xkk/DDhUQWrYDaydPyGE6mWR+E/18/BZh0tx0n+yvHPg0ax56CTSJwpvg7
 f6pydxQo0RAH3S/IWb0JB3jyi++EKznWK2OokllOdmT3DrYyKD3LiTY4T2MO8Wlu
 +miFq6yk1iQHZR4R4JDkCWzpD7JeeR6lkg19kRZlBJm2Pv+8Rzg4qjBgqAo5OPxV
 RiQ0CuyKoR2rtMz/4pepBai4fM42S/Nc59QJI/45mTUJ40whyoygJEov7s3SdTZi
 H6cL7Ewe8m++MNW2aAsM2M0CVzoyv0D4mnPxgE8t8lNbkcT14bdi1WFwAHo/ICMd
 KpyhsDVltakqUbelD3WWJtTXbAPCgwlOzuPdfbqtBMQJNP1aONs=
 =1Xgv
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.11-3' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client fixes from Anna Schumaker:
 "Here are a few more bugfixes that came in over the last couple of
  weeks. Most of these fix various hangs and loops that people found,
  but we also had a few error handling fixes.

  Stable Bugfixes:
   - fix infinite loop on BAD_STATEID error

  Other Bugfixes:
   - fix old dentry rehash after move
   - fix pnfs GETDEVINFO hangs
   - fix pnfs fallback to MDS on commit errors
   - fix flexfiles kernel oops"

* tag 'nfs-for-4.11-3' of git://git.linux-nfs.org/projects/anna/linux-nfs:
  nfs: flexfiles: fix kernel OOPS if MDS returns unsupported DS type
  NFSv4.1 fix infinite loop on IO BAD_STATEID error
  PNFS fix fallback to MDS if got error on commit to DS
  NFS filelayout:call GETDEVICEINFO after pnfs_layout_process completes
  NFS store nfs4_deviceid in struct nfs4_filelayout_segment
  NFS cleanup struct nfs4_filelayout_segment
  NFS: Fix old dentry rehash after move
2017-03-31 12:29:03 -07:00
Tigran Mkrtchyan
f17f8a14e8 nfs: flexfiles: fix kernel OOPS if MDS returns unsupported DS type
this fix aims to fix dereferencing of a mirror in an error state when MDS
returns unsupported DS type (IOW, not v3), which causes the following oops:

[  220.370709] BUG: unable to handle kernel NULL pointer dereference at 0000000000000065
[  220.370842] IP: ff_layout_mirror_valid+0x2d/0x110 [nfs_layout_flexfiles]
[  220.370920] PGD 0

[  220.370972] Oops: 0000 [#1] SMP
[  220.371013] Modules linked in: nfnetlink_queue nfnetlink_log bluetooth nfs_layout_flexfiles rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache nf_conntrack_netbios_ns nf_conntrack_broadcast xt_CT ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_raw ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security iptable_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack libcrc32c iptable_mangle iptable_security ebtable_filter ebtables ip6table_filter ip6_tables binfmt_misc intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel btrfs kvm arc4 snd_hda_codec_hdmi iwldvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel intel_cstate mac80211 xor uvcvideo
[  220.371814]  videobuf2_vmalloc videobuf2_memops snd_hda_codec_idt mei_wdt videobuf2_v4l2 snd_hda_codec_generic iTCO_wdt ppdev videobuf2_core iTCO_vendor_support dell_rbtn dell_wmi iwlwifi sparse_keymap dell_laptop dell_smbios snd_hda_intel dcdbas videodev snd_hda_codec dell_smm_hwmon snd_hda_core media cfg80211 intel_uncore snd_hwdep raid6_pq snd_seq intel_rapl_perf snd_seq_device joydev i2c_i801 rfkill lpc_ich snd_pcm parport_pc mei_me parport snd_timer dell_smo8800 mei snd shpchp soundcore tpm_tis tpm_tis_core tpm nfsd auth_rpcgss nfs_acl lockd grace sunrpc i915 nouveau mxm_wmi ttm i2c_algo_bit drm_kms_helper crc32c_intel e1000e drm sdhci_pci firewire_ohci sdhci serio_raw mmc_core firewire_core ptp crc_itu_t pps_core wmi fjes video
[  220.372568] CPU: 7 PID: 4988 Comm: cat Not tainted 4.10.5-200.fc25.x86_64 #1
[  220.372647] Hardware name: Dell Inc. Latitude E6520/0J4TFW, BIOS A06 07/11/2011
[  220.372729] task: ffff94791f6ea580 task.stack: ffffb72b88c0c000
[  220.372802] RIP: 0010:ff_layout_mirror_valid+0x2d/0x110 [nfs_layout_flexfiles]
[  220.372883] RSP: 0018:ffffb72b88c0f970 EFLAGS: 00010246
[  220.372945] RAX: 0000000000000000 RBX: ffff9479015ca600 RCX: ffffffffffffffed
[  220.373025] RDX: ffffffffffffffed RSI: ffff9479753dc980 RDI: 0000000000000000
[  220.373104] RBP: ffffb72b88c0f988 R08: 000000000001c980 R09: ffffffffc0ea6112
[  220.373184] R10: ffffef17477d9640 R11: ffff9479753dd6c0 R12: ffff9479211c7440
[  220.373264] R13: ffff9478f45b7790 R14: 0000000000000001 R15: ffff9479015ca600
[  220.373345] FS:  00007f555fa3e700(0000) GS:ffff9479753c0000(0000) knlGS:0000000000000000
[  220.373435] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  220.373506] CR2: 0000000000000065 CR3: 0000000196044000 CR4: 00000000000406e0
[  220.373586] Call Trace:
[  220.373627]  nfs4_ff_layout_prepare_ds+0x5e/0x200 [nfs_layout_flexfiles]
[  220.373708]  ff_layout_pg_init_read+0x81/0x160 [nfs_layout_flexfiles]
[  220.373806]  __nfs_pageio_add_request+0x11f/0x4a0 [nfs]
[  220.373886]  ? nfs_create_request.part.14+0x37/0x330 [nfs]
[  220.373967]  nfs_pageio_add_request+0xb2/0x260 [nfs]
[  220.374042]  readpage_async_filler+0xaf/0x280 [nfs]
[  220.374103]  read_cache_pages+0xef/0x1b0
[  220.374166]  ? nfs_read_completion+0x210/0x210 [nfs]
[  220.374239]  nfs_readpages+0x129/0x200 [nfs]
[  220.374293]  __do_page_cache_readahead+0x1d0/0x2f0
[  220.374352]  ondemand_readahead+0x17d/0x2a0
[  220.374403]  page_cache_sync_readahead+0x2e/0x50
[  220.374460]  generic_file_read_iter+0x6c8/0x950
[  220.374532]  ? nfs_mapping_need_revalidate_inode+0x17/0x40 [nfs]
[  220.374617]  nfs_file_read+0x6e/0xc0 [nfs]
[  220.374670]  __vfs_read+0xe2/0x150
[  220.374715]  vfs_read+0x96/0x130
[  220.374758]  SyS_read+0x55/0xc0
[  220.374801]  entry_SYSCALL_64_fastpath+0x1a/0xa9
[  220.374856] RIP: 0033:0x7f555f570bd0
[  220.374900] RSP: 002b:00007ffeb73e1b38 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[  220.374986] RAX: ffffffffffffffda RBX: 00007f555f839ae0 RCX: 00007f555f570bd0
[  220.375066] RDX: 0000000000020000 RSI: 00007f555fa41000 RDI: 0000000000000003
[  220.375145] RBP: 0000000000021010 R08: ffffffffffffffff R09: 0000000000000000
[  220.375226] R10: 00007f555fa40010 R11: 0000000000000246 R12: 0000000000022000
[  220.375305] R13: 0000000000021010 R14: 0000000000001000 R15: 0000000000002710
[  220.375386] Code: 66 66 90 55 48 89 e5 41 54 53 49 89 fc 48 83 ec 08 48 85 f6 74 2e 48 8b 4e 30 48 89 f3 48 81 f9 00 f0 ff ff 77 1e 48 85 c9 74 15 <48> 83 79 78 00 b8 01 00 00 00 74 2c 48 83 c4 08 5b 41 5c 5d c3
[  220.375653] RIP: ff_layout_mirror_valid+0x2d/0x110 [nfs_layout_flexfiles] RSP: ffffb72b88c0f970
[  220.375748] CR2: 0000000000000065
[  220.403538] ---[ end trace bcdca752211b7da9 ]---

Signed-off-by: Tigran Mkrtchyan <tigran.mkrtchyan@desy.de>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-03-31 13:30:49 -04:00
Olga Kornievskaia
0e3d3e5df0 NFSv4.1 fix infinite loop on IO BAD_STATEID error
Commit 63d63cbf5e "NFSv4.1: Don't recheck delegations that
have already been checked" introduced a regression where when a
client received BAD_STATEID error it would not send any TEST_STATEID
and instead go into an infinite loop of resending the IO that caused
the BAD_STATEID.

Fixes: 63d63cbf5e ("NFSv4.1: Don't recheck delegations that have already been checked")
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Cc: stable@vger.kernel.org # 4.9+
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-03-31 13:30:21 -04:00
Olga Kornievskaia
fabbbee0eb PNFS fix fallback to MDS if got error on commit to DS
Upong receiving some errors (EACCES) on commit to the DS the code
doesn't fallback to MDS and intead retrieds to the same DS again.

Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-03-30 13:27:20 -04:00
Dan Carpenter
457ae7268b Btrfs: fix an integer overflow check
This isn't super serious because you need CAP_ADMIN to run this code.

I added this integer overflow check last year but apparently I am
rubbish at writing integer overflow checks...  There are two issues.
First, access_ok() works on unsigned long type and not u64 so on 32 bit
systems the access_ok() could be checking a truncated size.  The other
issue is that we should be using a stricter limit so we don't overflow
the kzalloc() setting ctx->clone_roots later in the function after the
access_ok():

	alloc_size = sizeof(struct clone_root) * (arg->clone_sources_count + 1);
	sctx->clone_roots = kzalloc(alloc_size, GFP_KERNEL | __GFP_NOWARN);

Fixes: f5ecec3ce2 ("btrfs: send: silence an integer overflow warning")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ added comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-03-29 14:29:08 +02:00
Goldwyn Rodrigues
ce0dcee626 btrfs: Change qgroup_meta_rsv to 64bit
Using an int value is causing qg->reserved to become negative and
exclusive -EDQUOT to be reached prematurely.

This affects exclusive qgroups only.

TEST CASE:

DEVICE=/dev/vdb
MOUNTPOINT=/mnt
SUBVOL=$MOUNTPOINT/tmp

umount $SUBVOL
umount $MOUNTPOINT

mkfs.btrfs -f $DEVICE
mount /dev/vdb $MOUNTPOINT
btrfs quota enable $MOUNTPOINT
btrfs subvol create $SUBVOL
umount $MOUNTPOINT
mount /dev/vdb $MOUNTPOINT
mount -o subvol=tmp $DEVICE $SUBVOL
btrfs qgroup limit -e 3G $SUBVOL

btrfs quota rescan /mnt -w

for i in `seq 1 44000`; do
  dd if=/dev/zero of=/mnt/tmp/test_$i bs=10k count=1
  if [[ $? > 0 ]]; then
     btrfs qgroup show -pcref $SUBVOL
     exit 1
  fi
done

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
[ add reproducer to changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-03-29 14:29:08 +02:00
Liu Bo
9d0d1c8b1c Btrfs: bring back repair during read
Commit 20a7db8ab3 ("btrfs: add dummy callback for readpage_io_failed
and drop checks") made a cleanup around readpage_io_failed_hook, and
it was supposed to keep the original sematics, but it also
unexpectedly disabled repair during read for dup, raid1 and raid10.

This fixes the problem by letting data's inode call the generic
readpage_io_failed callback by returning -EAGAIN from its
readpage_io_failed_hook in order to notify end_bio_extent_readpage to
do the rest.  We don't call it directly because the generic one takes
an offset from end_bio_extent_readpage() to calculate the index in the
checksum array and inode's readpage_io_failed_hook doesn't offer that
offset.

Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ keep the const function attribute ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-03-29 14:29:07 +02:00
Andy Adamson
8d40b0f148 NFS filelayout:call GETDEVICEINFO after pnfs_layout_process completes
Fix a filelayout GETDEVICEINFO call hang triggered from the LAYOUTGET
pnfs_layout_process where the GETDEVICEINFO call is waiting for a
session slot, and the LAYOUGET call is waiting for pnfs_layout_process
to complete before freeing the slot GETDEVICEINFO is waiting for..

This occurs in testing against the pynfs pNFS server where the
the on-wire reply highest_slotid and slot id are zero, and the
target high slot id is 8 (negotiated in CREATE_SESSION).

The internal fore channel slot table max_slotid, the maximum allowed
table slotid value, has been reduced via nfs41_set_max_slotid_locked
 from 8 to 1.  Thus there is one slot (slotid 0) available for use but
it has not been freed by LAYOUTGET  proir to the GETDEVICEINFO request.

In order to ensure that layoutrecall callbacks are processed in the
correct order, nfs4_proc_layoutget processing needs to be finished
e.g. pnfs_layout_process) before giving up the slot that identifies
the layoutget (see referring_call_exists).

Move the filelayout_check_layout nfs4_find_get_device call outside of
the pnfs_layout_process call tree.

Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-03-28 11:47:42 -04:00
Andy Adamson
629dc8704b NFS store nfs4_deviceid in struct nfs4_filelayout_segment
In preparation for moving the filelayout getdeviceinfo call from
filelayout_alloc_lseg called by pnfs_process_layout

Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-03-28 11:21:52 -04:00
Andy Adamson
551afbb85b NFS cleanup struct nfs4_filelayout_segment
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-03-27 16:51:03 -04:00
Benjamin Coddington
d4ea7e3c5c NFS: Fix old dentry rehash after move
Now that nfs_rename()'s d_move has moved within the RPC task's rpc_call_done
callback, rehashing new_dentry will actually rehash the old dentry's name
in nfs_rename().  d_move() is going to rehash the new dentry for us anyway,
so doing it again here is unnecessary.

Reported-by: Chuck Lever <chuck.lever@oracle.com>
Fixes: 920b4530fb ("NFS: nfs_rename() handle -ERESTARTSYS dentry left behind")
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-03-27 13:30:49 -04:00
Linus Torvalds
9e54ef9da5 driver core fix for 4.11-rc4
Here is a single kernfs fix for 4.11-rc4 that resolves a reported issue.
 
 It has been in linux-next with no reported issues.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCWNedpw8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ykLkgCdEVdmtWb9Fd0igfh7bSWBHdD9W20An3vKOror
 nTP7sT8FwSWGKdOpIaik
 =0Eht
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-4.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core fix from Greg KH:
 "Here is a single kernfs fix for 4.11-rc4 that resolves a reported
  issue.

  It has been in linux-next with no reported issues"

* tag 'driver-core-4.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
  kernfs: Check KERNFS_HAS_RELEASE before calling kernfs_release_file()
2017-03-26 11:05:42 -07:00