Commit graph

39702 commits

Author SHA1 Message Date
Anna Schumaker
7b320382d0 NFS: Rename idmap.c to nfs4idmap.c
I added the nfs4 prefix to make it obvious that this file is built into
the NFS v4 module, and not the generic client.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-04-23 15:16:14 -04:00
Anna Schumaker
40c64c26a4 NFS: Move nfs_idmap.h into fs/nfs/
This file is only used internally to the NFS v4 module, so it doesn't
need to be in the global include path.  I also renamed it from
nfs_idmap.h to nfs4idmap.h to emphasize that it's an NFSv4-only include
file.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-04-23 15:16:14 -04:00
Anna Schumaker
f9ebd61855 NFS: Remove CONFIG_NFS_V4 checks from nfs_idmap.h
The idmapper is completely internal to the NFS v4 module, so this macro
will always evaluate to true.  This patch also removes unnecessary
includes of this file from the generic NFS client.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-04-23 15:16:13 -04:00
Anna Schumaker
7c61f0d389 NFS: Add a stub for GETDEVICELIST
d4b18c3e (pnfs: remove GETDEVICELIST implementation) removed the
GETDEVICELIST operation from the NFS client, but left a "hole" in the
nfs4_procedures array.  This caused /proc/self/mountstats to report an
operation named "51" where GETDEVICELIST used to be.  This patch adds a
stub to fix mountstats.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Fixes: d4b18c3e (pnfs: remove GETDEVICELIST implementation)
Cc: stable@vger.kernel.org # 3.17+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-04-23 15:16:13 -04:00
Peng Tao
05f54903d9 nfs: remove WARN_ON_ONCE from nfs_direct_good_bytes
For flexfiles driver, we might choose to read from mirror index other
than 0 while mirror_count is always 1 for read.

Reported-by: Jean Spector <jean@primarydata.com>
Cc: <stable@vger.kernel.org> # v3.19+
Cc: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-04-23 15:05:19 -04:00
Peng Tao
1ccbad9f9f nfs: fix DIO good bytes calculation
For direct read that has IO size larger than rsize, we'll split
it into several READ requests and nfs_direct_good_bytes() would
count completed bytes incorrectly by eating last zero count reply.

Fix it by handling mirror and non-mirror cases differently such that
we only count mirrored writes differently.

This fixes 5fadeb47("nfs: count DIO good bytes correctly with mirroring").

Reported-by: Jean Spector <jean@primarydata.com>
Cc: <stable@vger.kernel.org> # v3.19+
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-04-23 15:05:18 -04:00
Anna Schumaker
ea96d1ecbe nfs: Fetch MOUNTED_ON_FILEID when updating an inode
2ef47eb1 (NFS: Fix use of nfs_attr_use_mounted_on_fileid()) was a good
start to fixing a circular directory structure warning for NFS v4
"junctioned" mountpoints.  Unfortunately, further testing continued to
generate this error.

My server is configured like this:

anna@nfsd ~ % df
Filesystem      Size  Used Avail Use% Mounted on
/dev/vda1       9.1G  2.0G  6.5G  24% /
/dev/vdc1      1014M   33M  982M   4% /exports
/dev/vdc2      1014M   33M  982M   4% /exports/vol1
/dev/vdc3      1014M   33M  982M   4% /exports/vol1/vol2

anna@nfsd ~ % cat /etc/exports
/exports/          *(rw,async,no_subtree_check,no_root_squash)
/exports/vol1/     *(rw,async,no_subtree_check,no_root_squash)
/exports/vol1/vol2 *(rw,async,no_subtree_check,no_root_squash)

I've been running chown across the entire mountpoint twice in a row to
hit this problem.  The first run succeeds, but the second one fails with
the circular directory warning along with:

anna@client ~ % dmesg
[Apr 3 14:28] NFS: server 192.168.100.204 error: fileid changed
              fsid 0:39: expected fileid 0x100080, got 0x80

WHere 0x80 is the mountpoint's fileid and 0x100080 is the mounted-on
fileid.

This patch fixes the issue by requesting an updated mounted-on fileid
from the server during nfs_update_inode(), and then checking that the
fileid stored in the nfs_inode matches either the fileid or mounted-on
fileid returned by the server.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-04-23 14:43:54 -04:00
Jeff Layton
5d05e54af3 nfs: fix high load average due to callback thread sleeping
Chuck pointed out a problem that crept in with commit 6ffa30d3f7 (nfs:
don't call blocking operations while !TASK_RUNNING). Linux counts tasks
in uninterruptible sleep against the load average, so this caused the
system's load average to be pinned at at least 1 when there was a
NFSv4.1+ mount active.

Not a huge problem, but it's probably worth fixing before we get too
many complaints about it. This patch converts the code back to use
TASK_INTERRUPTIBLE sleep, simply has it flush any signals on each loop
iteration. In practice no one should really be signalling this thread at
all, so I think this is reasonably safe.

With this change, there's also no need to game the hung task watchdog so
we can also convert the schedule_timeout call back to a normal schedule.

Cc: <stable@vger.kernel.org>
Reported-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Tested-by: Chuck Lever <chuck.lever@oracle.com>
Fixes: commit 6ffa30d3f7 (“nfs: don't call blocking . . .”)
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-04-23 14:38:07 -04:00
Anna Schumaker
f830f7ddd9 NFS: Reduce time spent holding the i_mutex during fallocate()
At the very least, we should not be taking the i_mutex until after
checking if the server even supports ALLOCATE or DEALLOCATE, allowing
v4.0 or v4.1 to exit without potentially waiting on a lock.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-04-23 14:36:28 -04:00
Anna Schumaker
9a51940bf6 NFS: Don't zap caches on fallocate()
This patch adds a GETATTR to the end of ALLOCATE and DEALLOCATE
operations so we can set the updated inode size and change attribute
directly.  DEALLOCATE will still need to release pagecache pages, so
nfs42_proc_deallocate() now calls truncate_pagecache_range() before
contacting the server.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-04-23 14:36:28 -04:00
Trond Myklebust
8c18d76bcb NFS: Block new writes while syncing data in nfs_getattr()
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-03-27 12:39:39 -04:00
Trond Myklebust
5bb89b4702 NFSv4.1/pnfs: Separate out metadata and data consistency for pNFS
The LAYOUTCOMMIT operation means different things to different layout types.
For blocks and objects, it is both a data and metadata consistency operation.
For files and flexfiles, it is only a metadata consistency operation.

This patch separates out the 2 cases, allowing the files/flexfiles layout
drivers to optimise away the data consistency calls to layoutcommit.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-03-27 12:39:38 -04:00
Trond Myklebust
7140171ea9 NFSv4.1/pnfs: Ensure we send layoutcommit before return-on-close
We must not send a close or delegreturn that would result in a
return-on-close of the layout without ensuring that we've also
sent the necessary layoutcommit.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-03-27 12:39:38 -04:00
Trond Myklebust
a0815d556d NFSv4.1/pnfs: Ensure that writes respect the O_SYNC flag when doing O_DIRECT
If the caller does not specify the O_SYNC flag, then it is legitimate
to return from O_DIRECT without doing a pNFS layoutcommit operation.
However if the file is opened O_DIRECT|O_SYNC then we'd better get it
right.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-03-27 12:39:37 -04:00
Trond Myklebust
9e1681c2e7 NFSv4: Truncating file opens should also sync O_DIRECT writes
We don't just want to sync out buffered writes, but also O_DIRECT ones.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-03-27 12:39:37 -04:00
Trond Myklebust
d9dabc1a01 NFS: File unlock needs to be a metadata synchronisation point
File unlock needs to update both data and metadata on the NFS server
in order to act as a synchronisation point for other clients.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-03-27 12:39:37 -04:00
Trond Myklebust
4d346bea8f NFS: Add a helper to sync both O_DIRECT and buffered writes
Then apply it to nfs_setattr() and nfs_getattr().

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-03-27 12:39:36 -04:00
Trond Myklebust
67af7611ec NFSv4.1/pnfs: Refactor pnfs_set_layoutcommit()
pnfs_set_layoutcommit() and pnfs_commit_set_layoutcommit() are 100% identical
except for the function arguments. Refactor to eliminate the difference.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-03-27 12:39:36 -04:00
Trond Myklebust
29559b11ae NFSv4.1/pnfs: Fix setting of layoutcommit last write byte
If the NFS_INO_LAYOUTCOMMIT flag was unset, then we _must_ ensure that
we also reset the last write byte (lwb) for that layout. The current
code depends on us clearing the lwb when we clear NFS_INO_LAYOUTCOMMIT,
which is not the case when we call pnfs_clear_layoutcommit().

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-03-27 12:39:35 -04:00
Trond Myklebust
415320fc14 NFSv4: Return the delegation before returning the layout in evict_inode()
Minor optimisation for the case where the layout has return-on-close
enabled.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-03-27 12:39:35 -04:00
Trond Myklebust
81b79afb50 NFSv4: Allow tracing of NFSv4 fsync calls
I appear to have missed this when adding the ftrace probes.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-03-27 12:39:34 -04:00
Trond Myklebust
fc87701b91 NFS: Fix free_deveiceid -> free_deviceid
Make it easier to grep for these functions by name.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-03-27 12:32:24 -04:00
Trond Myklebust
df52699e4f NFSv4.1: Don't cache deviceids that have no notifications
The spec says that once all layouts that reference a given deviceid
have been returned, then we are only allowed to continue to cache
the deviceid if the metadata server supports notifications.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-03-27 12:32:24 -04:00
Trond Myklebust
4e59080397 NFSv4.1: Allow getdeviceinfo to return notification info back to caller
We are only allowed to cache deviceinfo if the server supports notifications
and actually promises to call us back when changes occur. Right now, we
request those notifications, but then we don't check the server's reply.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-03-27 12:32:24 -04:00
Trond Myklebust
fb1458f457 NFSv4.1: Cleanup - don't opencode nfs4_put_deviceid_node()
There really is no reason to do so.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-03-27 12:32:24 -04:00
Trond Myklebust
84a80f62f7 NFSv4.1: Convert pNFS deviceid to use kfree_rcu()
Use of synchronize_rcu() when unmounting and potentially freeing a lot
of deviceids is problematic. There really is no reason why we can't just
use kfree_rcu() here.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-03-27 12:32:24 -04:00
Peng Tao
2854475f6c nfs: clean up nfs_direct_IO
This follows up "nfs: fix dio deadlock when O_DIRECT flag is flipped"
and removes the unnecessary CONFIG_NFS_SWAP switch.

Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-03-13 09:05:25 -04:00
Trond Myklebust
38942ba204 NFSv4: Append delegations to the per-client list instead of prepending
Do so on the assumption that for most use cases, that list will turn into
a more or less LRU-ordered list, and so the list traversals in
nfs_client_return_marked_delegations() are likely to be shorter before
hitting a candidate to return.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-03-12 12:13:56 -04:00
Linus Torvalds
84399bb075 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "Outside of misc fixes, Filipe has a few fsync corners and we're
  pulling in one more of Josef's fixes from production use here"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs:__add_inode_ref: out of bounds memory read when looking for extended ref.
  Btrfs: fix data loss in the fast fsync path
  Btrfs: remove extra run_delayed_refs in update_cowonly_root
  Btrfs: incremental send, don't rename a directory too soon
  btrfs: fix lost return value due to variable shadowing
  Btrfs: do not ignore errors from btrfs_lookup_xattr in do_setxattr
  Btrfs: fix off-by-one logic error in btrfs_realloc_node
  Btrfs: add missing inode update when punching hole
  Btrfs: abort the transaction if we fail to update the free space cache inode
  Btrfs: fix fsync race leading to ordered extent memory leaks
2015-03-06 13:52:54 -08:00
Linus Torvalds
7c5bde7ade File locking related fix for v4.0 (#3)
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJU+EEgAAoJEAAOaEEZVoIVNtgP/33VZs6/5rthhMhhFt5Gonwn
 2wKjrA8Tuskxg2Ij7RLSCnZNeoscF1Te7GNRQdalhO4mHoDzQx3zSHuV/STNlILg
 6wNSd0pVj4FjXQN8Kv/slJsKiUu2U33lQwXNiVbgPI4CAdRAVJqYZ/7EIgf7VciQ
 FojqR3/RTSW1CHuYcNL5CTkn9Pdm8cwqacXMcKBWK6t1ZiEBqaP0GD0UhpFsUeRF
 RPv0ba3a+iuf/ToxhKV1fGWjJUplBUR2FMCnEiUFCnOGNPYeknWKp8wGeRp8bYA0
 6Umpk/NLg8mVSzbdW3wxBQ25PRBNVBAmQutFD2aEjIMPXuKPt4IHe/os2SssbfbJ
 wyCZ4LEbogcbmOlBAePtlUloy2GM3F7lQN+6GUgiLAT/I0+1VCjESsBftZP7QKRT
 N6szoAAsIDNmXcaApYPZZk3JBoyiwHurbrhI23V74p91esVNFqWJaVTpAbn8TkD7
 u/hGL7Loi7sXst2g9XISXvcRkcGUKKXf727Ih9wYQx8N6McP1sDgSC9PNYHz2KLo
 Ha4pQo+1+t0tXGmTsGpHBouAsSntURjn5/vv8OHbi0G3hS5v06G/j0vkU3lRtMqv
 koomIyShwUORLJxo6oJ3yKXQdAnz0Q78LkfDdSyq/M17G1FvOkQhw42a/LJ/A1w1
 yzJ5XmwbkM5GcG3LVTLs
 =ND3k
 -----END PGP SIGNATURE-----

Merge tag 'locks-v4.0-3' of git://git.samba.org/jlayton/linux

Pull file locking fix from Jeff Layton:
 "Just a single patch to fix a memory leak that Daniel Wagner discovered
  while doing some testing with leases"

* tag 'locks-v4.0-3' of git://git.samba.org/jlayton/linux:
  locks: fix fasync_struct memory leak in lease upgrade/downgrade handling
2015-03-06 10:31:38 -08:00
Linus Torvalds
1b1bd56191 NFS client bugfixes for Linux 4.0
Highlights include:
 
 - Fix a regression in the NFSv4 open state recovery code
 - Fix a regression in the NFSv4 close code
 - Fix regressions and side-effects of the loop-back mounted NFS fixes
   in 3.18, that cause the NFS read() syscall to return EBUSY.
 - Fix regressions around the readdirplus code and how it interacts with
   the VFS lazy unmount changes that went into v3.18.
 - Fix issues with out-of-order RPC call replies replacing updated
   attributes with stale ones (particularly after a truncate()).
 - Fix an underflow checking issue with RPC/RDMA credits
 - Fix a number of issues with the NFSv4 delegation return/free code.
 - Fix issues around stale NFSv4.1 leases when doing a mount
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJU+SJ6AAoJEGcL54qWCgDyVgQQAKNsF/O2O9ip2uAHZRZvM6TS
 Ev8c3Spj/FmRI1tlCcGi1zZ8uCSwvPQz3uAN0vOTMocUWjokT5sAgN5yBIxHasem
 6YK7jxs9WHiP7MGyReaAFJwG/W6LZndnlNqPWPs9KiaWwKVXIsQ3uFm/Y0lr90Fi
 ew16DQm0DUd4Yvv42WJR9ay7UwUPvT7wmaGIVVK2hjQqr2lx02jspt5kfrC+vsMU
 OYU/0YDofb2ajs5krbah6tUHf3VDnSVrXP6if66IrukCM9S4AvowpnMJQ5QJALh+
 cPlqHDm2ZzuIecpqZEgYLM73wQ2q+KBXTlDLcgYg6LjnqBEivwO9RDn6tfCwKTcS
 tCFohQc9iDOj9rTZ9EQlQME6u/FdxWncpxyTsMyBk7FlcLsOQRqio/FXZhbyJGuH
 gvIdIW3fPseijtejpYkxgabe6JL9NRvjv3SnOay7xHs9Vn4tRkFF2mkkQDZOG6HT
 atkxQp8kB3m9gMoeAmTLdTcJkcFdk6AKnNrcyJaa1GW4msmMuGZtq/6vayKJsBZb
 OIw788bDSOVpcVR+6SAC24/dutcl+WJHlSJShqvIrTKejBxPCc7IVSeg63FJsWTO
 sxfXdUr3wVJ1ooDFCspeBj+3zAimIq7qDmRRs85ekgEtxrgdUldhg/VwbDjvLjmb
 whXFpiCS7Ii0fVYtZvjz
 =qMB7
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.0-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client bugfixes from Trond Myklebust:
 "Highlights include:

   - Fix a regression in the NFSv4 open state recovery code
   - Fix a regression in the NFSv4 close code
   - Fix regressions and side-effects of the loop-back mounted NFS fixes
     in 3.18, that cause the NFS read() syscall to return EBUSY.
   - Fix regressions around the readdirplus code and how it interacts
     with the VFS lazy unmount changes that went into v3.18.
   - Fix issues with out-of-order RPC call replies replacing updated
     attributes with stale ones (particularly after a truncate()).
   - Fix an underflow checking issue with RPC/RDMA credits
   - Fix a number of issues with the NFSv4 delegation return/free code.
   - Fix issues around stale NFSv4.1 leases when doing a mount"

* tag 'nfs-for-4.0-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (24 commits)
  NFSv4.1: Clear the old state by our client id before establishing a new lease
  NFSv4: Fix a race in NFSv4.1 server trunking discovery
  NFS: Don't write enable new pages while an invalidation is proceeding
  NFS: Fix a regression in the read() syscall
  NFSv4: Ensure we skip delegations that are already being returned
  NFSv4: Pin the superblock while we're returning the delegation
  NFSv4: Ensure we honour NFS_DELEGATION_RETURNING in nfs_inode_set_delegation()
  NFSv4: Ensure that we don't reap a delegation that is being returned
  NFS: Fix stateid used for NFS v4 closes
  NFSv4: Don't call put_rpccred() under the rcu_read_lock()
  NFS: Don't require a filehandle to refresh the inode in nfs_prime_dcache()
  NFSv3: Use the readdir fileid as the mounted-on-fileid
  NFS: Don't invalidate a submounted dentry in nfs_prime_dcache()
  NFSv4: Set a barrier in the update_changeattr() helper
  NFS: Fix nfs_post_op_update_inode() to set an attribute barrier
  NFS: Remove size hack in nfs_inode_attrs_need_update()
  NFSv4: Add attribute update barriers to delegreturn and pNFS layoutcommit
  NFS: Add attribute update barriers to NFS writebacks
  NFS: Set an attribute barrier on all updates
  NFS: Add attribute update barriers to nfs_setattr_update_inode()
  ...
2015-03-06 10:09:57 -08:00
Quentin Casasnovas
dd9ef135e3 Btrfs:__add_inode_ref: out of bounds memory read when looking for extended ref.
Improper arithmetics when calculting the address of the extended ref could
lead to an out of bounds memory read and kernel panic.

Signed-off-by: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
cc: stable@vger.kernel.org # v3.7+
Signed-off-by: Chris Mason <clm@fb.com>
2015-03-05 17:28:33 -08:00
Filipe Manana
3a8b36f378 Btrfs: fix data loss in the fast fsync path
When using the fast file fsync code path we can miss the fact that new
writes happened since the last file fsync and therefore return without
waiting for the IO to finish and write the new extents to the fsync log.

Here's an example scenario where the fsync will miss the fact that new
file data exists that wasn't yet durably persisted:

1. fs_info->last_trans_committed == N - 1 and current transaction is
   transaction N (fs_info->generation == N);

2. do a buffered write;

3. fsync our inode, this clears our inode's full sync flag, starts
   an ordered extent and waits for it to complete - when it completes
   at btrfs_finish_ordered_io(), the inode's last_trans is set to the
   value N (via btrfs_update_inode_fallback -> btrfs_update_inode ->
   btrfs_set_inode_last_trans);

4. transaction N is committed, so fs_info->last_trans_committed is now
   set to the value N and fs_info->generation remains with the value N;

5. do another buffered write, when this happens btrfs_file_write_iter
   sets our inode's last_trans to the value N + 1 (that is
   fs_info->generation + 1 == N + 1);

6. transaction N + 1 is started and fs_info->generation now has the
   value N + 1;

7. transaction N + 1 is committed, so fs_info->last_trans_committed
   is set to the value N + 1;

8. fsync our inode - because it doesn't have the full sync flag set,
   we only start the ordered extent, we don't wait for it to complete
   (only in a later phase) therefore its last_trans field has the
   value N + 1 set previously by btrfs_file_write_iter(), and so we
   have:

       inode->last_trans <= fs_info->last_trans_committed
           (N + 1)              (N + 1)

   Which made us not log the last buffered write and exit the fsync
   handler immediately, returning success (0) to user space and resulting
   in data loss after a crash.

This can actually be triggered deterministically and the following excerpt
from a testcase I made for xfstests triggers the issue. It moves a dummy
file across directories and then fsyncs the old parent directory - this
is just to trigger a transaction commit, so moving files around isn't
directly related to the issue but it was chosen because running 'sync' for
example does more than just committing the current transaction, as it
flushes/waits for all file data to be persisted. The issue can also happen
at random periods, since the transaction kthread periodicaly commits the
current transaction (about every 30 seconds by default).
The body of the test is:

  _scratch_mkfs >> $seqres.full 2>&1
  _init_flakey
  _mount_flakey

  # Create our main test file 'foo', the one we check for data loss.
  # By doing an fsync against our file, it makes btrfs clear the 'needs_full_sync'
  # bit from its flags (btrfs inode specific flags).
  $XFS_IO_PROG -f -c "pwrite -S 0xaa 0 8K" \
                  -c "fsync" $SCRATCH_MNT/foo | _filter_xfs_io

  # Now create one other file and 2 directories. We will move this second file
  # from one directory to the other later because it forces btrfs to commit its
  # currently open transaction if we fsync the old parent directory. This is
  # necessary to trigger the data loss bug that affected btrfs.
  mkdir $SCRATCH_MNT/testdir_1
  touch $SCRATCH_MNT/testdir_1/bar
  mkdir $SCRATCH_MNT/testdir_2

  # Make sure everything is durably persisted.
  sync

  # Write more 8Kb of data to our file.
  $XFS_IO_PROG -c "pwrite -S 0xbb 8K 8K" $SCRATCH_MNT/foo | _filter_xfs_io

  # Move our 'bar' file into a new directory.
  mv $SCRATCH_MNT/testdir_1/bar $SCRATCH_MNT/testdir_2/bar

  # Fsync our first directory. Because it had a file moved into some other
  # directory, this made btrfs commit the currently open transaction. This is
  # a condition necessary to trigger the data loss bug.
  $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/testdir_1

  # Now fsync our main test file. If the fsync succeeds, we expect the 8Kb of
  # data we wrote previously to be persisted and available if a crash happens.
  # This did not happen with btrfs, because of the transaction commit that
  # happened when we fsynced the parent directory.
  $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo

  # Simulate a crash/power loss.
  _load_flakey_table $FLAKEY_DROP_WRITES
  _unmount_flakey

  _load_flakey_table $FLAKEY_ALLOW_WRITES
  _mount_flakey

  # Now check that all data we wrote before are available.
  echo "File content after log replay:"
  od -t x1 $SCRATCH_MNT/foo

  status=0
  exit

The expected golden output for the test, which is what we get with this
fix applied (or when running against ext3/4 and xfs), is:

  wrote 8192/8192 bytes at offset 0
  XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
  wrote 8192/8192 bytes at offset 8192
  XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
  File content after log replay:
  0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
  *
  0020000 bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
  *
  0040000

Without this fix applied, the output shows the test file does not have
the second 8Kb extent that we successfully fsynced:

  wrote 8192/8192 bytes at offset 0
  XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
  wrote 8192/8192 bytes at offset 8192
  XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
  File content after log replay:
  0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
  *
  0020000

So fix this by skipping the fsync only if we're doing a full sync and
if the inode's last_trans is <= fs_info->last_trans_committed, or if
the inode is already in the log. Also remove setting the inode's
last_trans in btrfs_file_write_iter since it's useless/unreliable.

Also because btrfs_file_write_iter no longer sets inode->last_trans to
fs_info->generation + 1, don't set last_trans to 0 if we bail out and don't
bail out if last_trans is 0, otherwise something as simple as the following
example wouldn't log the second write on the last fsync:

  1. write to file

  2. fsync file

  3. fsync file
       |--> btrfs_inode_in_log() returns true and it set last_trans to 0

  4. write to file
       |--> btrfs_file_write_iter() no longers sets last_trans, so it
            remained with a value of 0
  5. fsync
       |--> inode->last_trans == 0, so it bails out without logging the
            second write

A test case for xfstests will be sent soon.

CC: <stable@vger.kernel.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-03-05 17:28:32 -08:00
Josef Bacik
f5c0a12280 Btrfs: remove extra run_delayed_refs in update_cowonly_root
This got added with my dirty_bgs patch, it's not needed.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-03-05 17:28:30 -08:00
Jeff Layton
0164bf0239 locks: fix fasync_struct memory leak in lease upgrade/downgrade handling
Commit 8634b51f6c (locks: convert lease handling to file_lock_context)
introduced a regression in the handling of lease upgrade/downgrades.

In the event that we already have a lease on a file and are going to
either upgrade or downgrade it, we skip doing any list insertion or
deletion and simply re-call lm_setup on the existing lease.

As of commit 8634b51f6c however, we end up calling lm_setup on the
lease that was passed in, instead of on the existing lease. This causes
us to leak the fasync_struct that was allocated in the event that there
was not already an existing one (as it always appeared that there
wasn't one).

Fixes: 8634b51f6c (locks: convert lease handling to file_lock_context)
Reported-and-Tested-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
2015-03-04 17:34:32 -05:00
Linus Torvalds
8a001af4bb Fixes for proper ioctl handling and an untriggerable buffer overflow
- The eCryptfs ioctl handling functions should only pass known-good ioctl
   commands to the lower filesystem
 - A static checker found a potential buffer overflow. Upon inspection, it is
   not triggerable due to input validation performed on the mount parameters.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABCgAGBQJU94JWAAoJENaSAD2qAscK5z0P/3rrZv7w4rWnPeLfzddRCSpt
 QMCPGGXE2fo3nP9w6e7HirAQKg474CFdmNXbNzAKR08irQNRM9lMMEkUp8B9kwXN
 8Ms+lHPVuTZuBPXkqtpG/p47kAJdc1d9QePa0iU5PPp5GcdI/knPR/Md42NxUd4y
 kqMdgK7brO/y16i5aCC40CG2x+OPc5I8Xz/9MT9fm9+NTRzOxFhbLxis5LKUrQgj
 SnzkXn4Z1jsLt3y8OCFhrP9n9sHrKmcHpxBExa7GziADbIcw+nv/ugQy8Dvi8sHK
 zzO1G21uUlZXfMoWdv+DbwgU6subaT/D6NjcyVEhZ0ziYQDdMMOVvqDUFzwIV9W7
 WiIV2fYLxYlZHSv318BlYTT6XiDVveINUcLI2107cw4pBwMYUt9QY3QoKLQ7027o
 HhdX4Ys6chzwggWU0CRRI7W45/LTF0KeJ3g06tAsYVV6GFGoJAnZe7olHA6X7nGE
 s8rXzpT5zZqdcz7Y5ln5NtrzWBER91iBUDivafw6y7b0rqj6o+fEUk4vXNdsyuzZ
 y5Qn5dus1ImPLwtCAsfqZMjUUnNtOaBPd52k0sHIkRoY6W7GlXZNOjXA7ki+X5tX
 Jq4tm0n3fkInEhjJDQ/sxtPa+TcZGgcoyc3qNsX80Lob7QpaTvBczKFgBadNnMnK
 TKJEvJQuLIlAq3Lw5DGL
 =4fdp
 -----END PGP SIGNATURE-----

Merge tag 'ecryptfs-4.0-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs

Pull eCryptfs fixes from Tyler Hicks:
 "Fixes for proper ioctl handling and an untriggerable buffer overflow

   - The eCryptfs ioctl handling functions should only pass known-good
     ioctl commands to the lower filesystem

   - A static checker found a potential buffer overflow.  Upon
     inspection, it is not triggerable due to input validation performed
     on the mount parameters"

* tag 'ecryptfs-4.0-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs:
  eCryptfs: don't pass fs-specific ioctl commands through
  eCryptfs: ensure copy to crypt_stat->cipher does not overrun
2015-03-04 14:19:48 -08:00
Trond Myklebust
e11259f920 NFSv4.1: Clear the old state by our client id before establishing a new lease
If the call to exchange-id returns with the EXCHGID4_FLAG_CONFIRMED_R flag
set, then that means our lease was established by a previous mount instance.
Ensure that we detect this situation, and that we clear the state held by
that mount.

Reported-by: Jorge Mora <Jorge.Mora@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-03-03 21:52:30 -05:00
Trond Myklebust
48d66b9749 NFSv4: Fix a race in NFSv4.1 server trunking discovery
We do not want to allow a race with another NFS mount to cause
nfs41_walk_client_list() to establish a lease on our nfs_client before
we're done checking for trunking.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-03-03 20:42:23 -05:00
Linus Torvalds
a6c5170d1e Merge branch 'for-4.0' of git://linux-nfs.org/~bfields/linux
Pull nfsd fixes from Bruce Fields:
 "Three miscellaneous bugfixes, most importantly the clp->cl_revoked
  bug, which we've seen several reports of people hitting"

* 'for-4.0' of git://linux-nfs.org/~bfields/linux:
  sunrpc: integer underflow in rsc_parse()
  nfsd: fix clp->cl_revoked list deletion causing softlock in nfsd
  svcrpc: fix memory leak in gssp_accept_sec_context_upcall
2015-03-03 15:52:50 -08:00
Trond Myklebust
ef070dcb39 NFS: Don't write enable new pages while an invalidation is proceeding
nfs_vm_page_mkwrite() should wait until the page cache invalidation
is finished. This is the second patch in a 2 patch series to deprecate
the NFS client's reliance on nfs_release_page() in the context of
nfs_invalidate_mapping().

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-03-03 13:58:08 -05:00
Trond Myklebust
874f946376 NFS: Fix a regression in the read() syscall
When invalidating the page cache for a regular file, we want to first
sync all dirty data to disk and then call invalidate_inode_pages2().
The latter relies on nfs_launder_page() and nfs_release_page() to deal
respectively with dirty pages, and unstable written pages.

When commit 9590544694 ("NFS: avoid deadlocks with loop-back mounted
NFS filesystems.") changed the behaviour of nfs_release_page(), then it
made it possible for invalidate_inode_pages2() to fail with an EBUSY.
Unfortunately, that error is then propagated back to read().

Let's therefore work around the problem for now by protecting the call
to sync the data and invalidate_inode_pages2() so that they are atomic
w.r.t. the addition of new writes.
Later on, we can revisit whether or not we still need nfs_launder_page()
and nfs_release_page().

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-03-03 13:02:29 -05:00
Tyler Hicks
6d65261a09 eCryptfs: don't pass fs-specific ioctl commands through
eCryptfs can't be aware of what to expect when after passing an
arbitrary ioctl command through to the lower filesystem. The ioctl
command may trigger an action in the lower filesystem that is
incompatible with eCryptfs.

One specific example is when one attempts to use the Btrfs clone
ioctl command when the source file is in the Btrfs filesystem that
eCryptfs is mounted on top of and the destination fd is from a new file
created in the eCryptfs mount. The ioctl syscall incorrectly returns
success because the command is passed down to Btrfs which thinks that it
was able to do the clone operation. However, the result is an empty
eCryptfs file.

This patch allows the trim, {g,s}etflags, and {g,s}etversion ioctl
commands through and then copies up the inode metadata from the lower
inode to the eCryptfs inode to catch any changes made to the lower
inode's metadata. Those five ioctl commands are mostly common across all
filesystems but the whitelist may need to be further pruned in the
future.

https://bugzilla.kernel.org/show_bug.cgi?id=93691
https://launchpad.net/bugs/1305335

Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
Cc: Rocko <rockorequin@hotmail.com>
Cc: Colin Ian King <colin.king@canonical.com>
Cc: stable@vger.kernel.org # v2.6.36+: c43f7b8 eCryptfs: Handle ioctl calls with unlocked and compat functions
2015-03-03 02:03:56 -06:00
Trond Myklebust
ec3ca4e57e NFSv4: Ensure we skip delegations that are already being returned
In nfs_client_return_marked_delegations() and nfs_delegation_reap_unclaimed()
we want to optimise the loop traversal by skipping delegations that are
already in the process of being returned.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-03-02 18:09:15 -05:00
Trond Myklebust
9f0f8e12c4 NFSv4: Pin the superblock while we're returning the delegation
This patch ensures that the superblock doesn't go ahead and disappear
underneath us while the state manager thread is returning delegations.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-03-02 18:09:14 -05:00
Trond Myklebust
ade04647dd NFSv4: Ensure we honour NFS_DELEGATION_RETURNING in nfs_inode_set_delegation()
Ensure that nfs_inode_set_delegation() doesn't inadvertently detach a
delegation that is already in the process of being returned.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-03-02 18:09:14 -05:00
Trond Myklebust
b04b22f4ca NFSv4: Ensure that we don't reap a delegation that is being returned
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-03-02 18:09:13 -05:00
Anna Schumaker
369d6b7f00 NFS: Fix stateid used for NFS v4 closes
After 566fcec60 the client uses the "current stateid" from the
nfs4_state structure to close a file.  This could potentially contain a
delegation stateid, which is disallowed by the protocol and causes
servers to return NFS4ERR_BAD_STATEID.  This patch restores the
(correct) behavior of sending the open stateid to close a file.

Reported-by: Olga Kornievskaia <kolga@netapp.com>
Fixes: 566fcec60 (NFSv4: Fix an atomicity problem in CLOSE)
Signed-off-by: Anna Schumaker <Anna.Schumaker@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-03-02 18:06:42 -05:00
Filipe Manana
84471e2429 Btrfs: incremental send, don't rename a directory too soon
There's one more case where we can't issue a rename operation for a
directory as soon as we process it. We used to delay directory renames
only if they have some ancestor directory with a higher inode number
that got renamed too, but there's another case where we need to delay
the rename too - when a directory A is renamed to the old name of a
directory B but that directory B has its rename delayed because it
has now (in the send root) an ancestor with a higher inode number that
was renamed. If we don't delay the directory rename in this case, the
receiving end of the send stream will attempt to rename A to the old
name of B before B got renamed to its new name, which results in a
"directory not empty" error. So fix this by delaying directory renames
for this case too.

Steps to reproduce:

  $ mkfs.btrfs -f /dev/sdb
  $ mount /dev/sdb /mnt

  $ mkdir /mnt/a
  $ mkdir /mnt/b
  $ mkdir /mnt/c
  $ touch /mnt/a/file

  $ btrfs subvolume snapshot -r /mnt /mnt/snap1

  $ mv /mnt/c /mnt/x
  $ mv /mnt/a /mnt/x/y
  $ mv /mnt/b /mnt/a

  $ btrfs subvolume snapshot -r /mnt /mnt/snap2

  $ btrfs send /mnt/snap1 -f /tmp/1.send
  $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/2.send

  $ mkfs.btrfs -f /dev/sdc
  $ mount /dev/sdc /mnt2
  $ btrfs receive /mnt2 -f /tmp/1.send
  $ btrfs receive /mnt2 -f /tmp/2.send
  ERROR: rename b -> a failed. Directory not empty

A test case for xfstests follows soon.

Reported-by: Ames Cornish <ames@cornishes.net>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-03-02 14:04:45 -08:00
David Sterba
1932b7be97 btrfs: fix lost return value due to variable shadowing
A block-local variable stores error code but btrfs_get_blocks_direct may
not return it in the end as there's a ret defined in the function scope.

CC: <stable@vger.kernel.org>	# 3.6+
Fixes: d187663ef2 ("Btrfs: lock extents as we map them in DIO")
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-03-02 14:04:45 -08:00
Filipe Manana
5cdf83edb8 Btrfs: do not ignore errors from btrfs_lookup_xattr in do_setxattr
The return value from btrfs_lookup_xattr() can be a pointer encoding an
error, therefore deal with it. This fixes commit 5f5bc6b1e2
("Btrfs: make xattr replace operations atomic").

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-03-02 14:04:45 -08:00