When looking at whether or not our dcache is valid, we really don't care
about the general state of the directory attribute cache. Instead, we
we only care about the state of the change attribute.
This fixes a performance issue when the client is responsible for
changing the directory contents; a number of NFSv4 operations will
atomically update the directory change attribute, but may not return
all the other attributes.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
It now has only one field and is only used in one structure.
So replaced it in that structure by the field it contains.
Signed-off-by: NeilBrown <neilb@suse.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
An open file description (struct file) in a given process can be
associated with two different lock owners.
It can have a Posix lock owner which will be different in each process
that has a fd on the file.
It can have a Flock owner which will be the same in all processes.
When searching for a lock stateid to use, we need to consider both of these
owners
So add a new "flock_owner" to the "nfs_open_context" (of which there
is one for each open file description).
This flock_owner does not need to be reference-counted as there is a
1-1 relation between 'struct file' and nfs open contexts,
and it will never be part of a list of contexts. So there is no need
for a 'flock_context' - just the owner is enough.
The io_count included in the (Posix) lock_context provides no
guarantee that all read-aheads that could use the state have
completed, so not supporting it for flock locks in not a serious
problem. Synchronization between flock and read-ahead can be added
later if needed.
When creating an open_context for a non-openning create call, we don't have
a 'struct file' to pass in, so the lock context gets initialized with
a NULL owner, but this will never be used.
The flock_owner is not used at all in this patch, that will come later.
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
this field is not used in any important way and probably should
have been removed by
Commit: 8003d3c4aa ("nfs4: treat lock owners as opaque values")
which removed the pid argument from nfs4_get_lock_state.
Except in unusual and uninteresting cases, two threads with the same
->tgid will have the same ->files pointer, so keeping them both
for comparison brings no benefit.
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Allow dio requests to be scheduled in parallel, but ensuring that they
do not conflict with buffered I/O.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Pull vfs cleanups from Al Viro:
"More cleanups from Christoph"
* 'work.preadv2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
nfsd: use RWF_SYNC
fs: add RWF_DSYNC aand RWF_SYNC
ceph: use generic_write_sync
fs: simplify the generic_write_sync prototype
fs: add IOCB_SYNC and IOCB_DSYNC
direct-io: remove the offset argument to dio_complete
direct-io: eliminate the offset argument to ->direct_IO
xfs: eliminate the pos variable in xfs_file_dio_aio_write
filemap: remove the pos argument to generic_file_direct_write
filemap: remove pos variables in generic_file_read_iter
use d_alloc_parallel() for sillyunlink/lookup exclusion and
explicit rwsem (nfs_rmdir() being a writer and nfs_call_unlink() -
a reader) for rmdir/sillyunlink one.
That ought to make lookup/readdir/!O_CREAT atomic_open really
parallel on NFS.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Including blkdev_direct_IO and dax_do_io. It has to be ki_pos to actually
work, so eliminate the superflous argument.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
See http: //www.infradead.org/rpr.html
X-Evolution-Source: 1451162204.2173.11@leira.trondhjem.org
Content-Transfer-Encoding: 8bit
Mime-Version: 1.0
We support OFFSET_MAX just fine, so don't round down below it. Also
switch to using min_t to make the helper more readable.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Fixes: 433c92379d ("NFS: Clean up nfs_size_to_loff_t()")
Cc: stable@vger.kernel.org # 2.6.23+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Highlights include:
Stable fixes:
- Fix a regression in the SunRPC socket polling code
- Fix the attribute cache revalidation code
- Fix race in __update_open_stateid()
- Fix an lo->plh_block_lgets imbalance in layoutreturn
- Fix an Oopsable typo in ff_mirror_match_fh()
Features:
- pNFS layout recall performance improvements.
- pNFS/flexfiles: Support server-supplied layoutstats sampling period
Bugfixes + cleanups:
- NFSv4: Don't perform cached access checks before we've OPENed the file
- Fix starvation issues with background flushes
- Reclaim writes should be flushed as unstable writes if there are already
entries in the commit lists
- Various bugfixes from Chuck to fix NFS/RDMA send queue ordering problems
- Ensure that we propagate fatal layoutget errors back to the application
- Fixes for sundry flexfiles layoutstats bugs
- Fix files/flexfiles to not cache invalidated layouts in the DS commit buckets
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJWmAvPAAoJEGcL54qWCgDysScP/jnaRdQO+VTXTMtKcPiR7ujd
LBcx3lrI1jsLYjlrBguTh9ROGt0maX1TAu/rsLuo4j/0wQC6dsQw+vFjfkI4CzSn
4htK0f4hNjA29iOAjMaziAzsQW9eJ97Nn0HU4XD43OeK7PGh5e93Xk26Va4cO18P
PqSam+FJoXpUSEWOzNzDwjTZTt4Voo3yJDqDTa8dU0x8c1qBktslo2n0WCntBxMn
IbEDdBEIaUZmYCNhu2Sq1SLwYPatLg1Orfq3quMFJjzEeUbd0lVQno4C1fjjuACt
DNXUgZDH0uR3U3naMXrdkqQ02GHEY9G0CO4a6q0Evsbm15wQuY6GMioxR0+ll7rX
TeZGBUMq3cRFDR+/m1gTBZFjo7BUPE9LKXUazINVaoaJMYqpFunhI8V31ghx8/z8
0kiracIEPXaIGmQ5S151+IDETpw9nntipCzdnduVefB2EAfXPeDzF7uFQPm+mvgx
R4YuAFrlbcIZ/lZRYy5z6Fj3KLnytSOjzgXC5daxPQVt92QumQTQ6HC5jL25zVKb
KOeSWHrFel7M+miL96ERvcS2vi+IDzPH9YbE9YTWbLW9LMBOYQKsukf1aaV9CwC4
9OiNMYGQIGtmjbzIOlRcpVTAsXj+P6UVuwCfGTpQOm1Qa1fDbU+xSLkc62gg3WRa
3E/3RMr1iXD8u1Kiz8hb
=RBmi
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-4.5-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client updates from Trond Myklebust:
"Highlights include:
Stable fixes:
- Fix a regression in the SunRPC socket polling code
- Fix the attribute cache revalidation code
- Fix race in __update_open_stateid()
- Fix an lo->plh_block_lgets imbalance in layoutreturn
- Fix an Oopsable typo in ff_mirror_match_fh()
Features:
- pNFS layout recall performance improvements.
- pNFS/flexfiles: Support server-supplied layoutstats sampling period
Bugfixes + cleanups:
- NFSv4: Don't perform cached access checks before we've OPENed the
file
- Fix starvation issues with background flushes
- Reclaim writes should be flushed as unstable writes if there are
already entries in the commit lists
- Various bugfixes from Chuck to fix NFS/RDMA send queue ordering
problems
- Ensure that we propagate fatal layoutget errors back to the
application
- Fixes for sundry flexfiles layoutstats bugs
- Fix files/flexfiles to not cache invalidated layouts in the DS
commit buckets"
* tag 'nfs-for-4.5-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (68 commits)
NFS: Fix a compile warning about unused variable in nfs_generic_pg_pgios()
NFSv4: Fix a compile warning about no prototype for nfs4_ioctl()
NFS: Use wait_on_atomic_t() for unlock after readahead
SUNRPC: Fixup socket wait for memory
NFSv4.1/pNFS: Cleanup constify struct pnfs_layout_range arguments
NFSv4.1/pnfs: Cleanup copying of pnfs_layout_range structures
NFSv4.1/pNFS: Cleanup pnfs_mark_matching_lsegs_invalid()
NFSv4.1/pNFS: Fix a race in initiate_file_draining()
NFSv4.1/pNFS: pnfs_error_mark_layout_for_return() must always return layout
NFSv4.1/pNFS: pnfs_mark_matching_lsegs_return() should set the iomode
NFSv4.1/pNFS: Use nfs4_stateid_copy for copying stateids
NFSv4.1/pNFS: Don't pass stateids by value to pnfs_send_layoutreturn()
NFS: Relax requirements in nfs_flush_incompatible
NFSv4.1/pNFS: Don't queue up a new commit if the layout segment is invalid
NFS: Allow multiple commit requests in flight per file
NFS/pNFS: Fix up pNFS write reschedule layering violations and bugs
SUNRPC: Fix a missing break in rpc_anyaddr()
pNFS/flexfiles: Fix an Oopsable typo in ff_mirror_match_fh()
NFS: Fix attribute cache revalidation
NFS: Ensure we revalidate attributes before using execute_ok()
...
The use of wait_on_atomic_t() for waiting on I/O to complete before
unlocking allows us to git rid of the NFS_IO_INPROGRESS flag, and thus the
nfs_iocounter's flags member, and finally the nfs_iocounter altogether.
The count of I/O is moved to the lock context, and the counter
increment/decrement functions become simple enough to open-code.
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
[Trond: Fix up conflict with existing function nfs_wait_atomic_killable()]
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
* pnfs_generic:
NFSv4.1/pNFS: Cleanup constify struct pnfs_layout_range arguments
NFSv4.1/pnfs: Cleanup copying of pnfs_layout_range structures
NFSv4.1/pNFS: Cleanup pnfs_mark_matching_lsegs_invalid()
NFSv4.1/pNFS: Fix a race in initiate_file_draining()
NFSv4.1/pNFS: pnfs_error_mark_layout_for_return() must always return layout
NFSv4.1/pNFS: pnfs_mark_matching_lsegs_return() should set the iomode
NFSv4.1/pNFS: Use nfs4_stateid_copy for copying stateids
NFSv4.1/pNFS: Don't pass stateids by value to pnfs_send_layoutreturn()
NFS: Relax requirements in nfs_flush_incompatible
NFSv4.1/pNFS: Don't queue up a new commit if the layout segment is invalid
NFS: Allow multiple commit requests in flight per file
NFS/pNFS: Fix up pNFS write reschedule layering violations and bugs
NFSv4: List stateid information in the callback tracepoints
NFSv4.1/pNFS: Don't return NFS4ERR_DELAY unnecessarily in CB_LAYOUTRECALL
NFSv4.1/pNFS: Ensure we enforce RFC5661 Section 12.5.5.2.1
pNFS: If we have to delay the layout callback, mark the layout for return
NFSv4.1/pNFS: Add a helper to mark the layout as returned
pNFS: Ensure nfs4_layoutget_prepare returns the correct error
Allow synchronous RPC calls to wait for pending RPC calls to finish,
but also allow asynchronous ones to just fire off another commit.
With this patch, the xfstests generic/074 test completes in 226s
instead of 242s
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Instead of dropping pages when write fails, only do it when
we get fatal failure in launder_page write back.
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
based upon the corresponding patch from Neil's March patchset,
again with kmap-related horrors removed.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
And call nfs_file_clear_open_context() directly. This makes it obvious
that nfs_file_release() will always return 0.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
I'm not aware of any existing bugs around this, but the expectation is
that nfs_mark_for_revalidate() should always force a revalidation of
the cached metadata.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
There is no need to report concurrently.
Reviewed-by: Jeff Layton <jeff.layton@primarydata.com>
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Highlights include:
Stable patches:
- Fix a regression in /proc/self/mountstats
- Fix the pNFS flexfiles O_DIRECT support
- Fix high load average due to callback thread sleeping
Bugfixes:
- Various patches to fix the pNFS layoutcommit support
- Do not cache pNFS deviceids unless server notifications are enabled
- Fix a SUNRPC transport reconnection regression
- make debugfs file creation failure non-fatal in SUNRPC
- Another fix for circular directory warnings on NFSv4 "junctioned" mountpoints
- Fix locking around NFSv4.2 fallocate() support
- Truncating NFSv4 file opens should also sync O_DIRECT writes
- Prevent infinite loop in rpcrdma_ep_create()
Features:
- Various improvements to the RDMA transport code's handling of memory
registration
- Various code cleanups
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJVOmT6AAoJEGcL54qWCgDyrhYQAMPKXB55jrdOR/7UVSF/xPML
7OjMGHvBnTn/y0pNIyLyS1PjTZZsD/WZjoW9EFGpTv727qQNVoFxFRLNUcgi3NoL
1YledCkLf7Q32aqod93SRRFPc9hzBoKhOZpOzBuWaAviyAB3KLi70DWAq9qRReYM
prXUQQjpW5FLU+B2ifaVc2RCnu/rZ2c02YdR2XdtkBaAJxuhB2vR8IY1evwjCv3R
5zyLDd9zSDDoArdpUzM97cxZPcYRSqbOwgTKvaaRnDDq/mKbKMZaqmEfjblwzNFt
b43FbveJzZ3hlPADIpmaiMHjRTbxWjIKc9K1sOF2FPfcuPe2yM3DMAxDegUkEveS
7fkbv/qRZ30NqfchGanX/pmBlLOcdI76qe/bwhN19wCnw48O1eeHi1HK8rWGhU+E
qcrRZ3ZS2ufP/MVBuhauy0qU9Q4wcEtm7NGGP1231ZtmfjHKyBa4pLirNfG1AlJt
dK7tBrknVx+WVm/UddJp/fEsxbP0+fki6TwzioHUSWcz8rDVYF6PFT/QPM54SX2h
0oqwvu6d/uShpkVRm+fbje8FHmUxKdgqDsCYX2fNjWskh1oXSPsItvjqmTmTlE0i
EBmBwVwI0uB1ZQ3PrJLadhRcO3ZJmLQ5gNj456dstvWy6UQds1xyIQ/DgvmlzxWO
E9t0l18xHGRwbndsDa8f
=j5dP
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-4.1-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client updates from Trond Myklebust:
"Another set of mainly bugfixes and a couple of cleanups. No new
functionality in this round.
Highlights include:
Stable patches:
- Fix a regression in /proc/self/mountstats
- Fix the pNFS flexfiles O_DIRECT support
- Fix high load average due to callback thread sleeping
Bugfixes:
- Various patches to fix the pNFS layoutcommit support
- Do not cache pNFS deviceids unless server notifications are enabled
- Fix a SUNRPC transport reconnection regression
- make debugfs file creation failure non-fatal in SUNRPC
- Another fix for circular directory warnings on NFSv4 "junctioned"
mountpoints
- Fix locking around NFSv4.2 fallocate() support
- Truncating NFSv4 file opens should also sync O_DIRECT writes
- Prevent infinite loop in rpcrdma_ep_create()
Features:
- Various improvements to the RDMA transport code's handling of
memory registration
- Various code cleanups"
* tag 'nfs-for-4.1-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (55 commits)
fs/nfs: fix new compiler warning about boolean in switch
nfs: Remove unneeded casts in nfs
NFS: Don't attempt to decode missing directory entries
Revert "nfs: replace nfs_add_stats with nfs_inc_stats when add one"
NFS: Rename idmap.c to nfs4idmap.c
NFS: Move nfs_idmap.h into fs/nfs/
NFS: Remove CONFIG_NFS_V4 checks from nfs_idmap.h
NFS: Add a stub for GETDEVICELIST
nfs: remove WARN_ON_ONCE from nfs_direct_good_bytes
nfs: fix DIO good bytes calculation
nfs: Fetch MOUNTED_ON_FILEID when updating an inode
sunrpc: make debugfs file creation failure non-fatal
nfs: fix high load average due to callback thread sleeping
NFS: Reduce time spent holding the i_mutex during fallocate()
NFS: Don't zap caches on fallocate()
xprtrdma: Make rpcrdma_{un}map_one() into inline functions
xprtrdma: Handle non-SEND completions via a callout
xprtrdma: Add "open" memreg op
xprtrdma: Add "destroy MRs" memreg op
xprtrdma: Add "reset MRs" memreg op
...
When invalidating the page cache for a regular file, we want to first
sync all dirty data to disk and then call invalidate_inode_pages2().
The latter relies on nfs_launder_page() and nfs_release_page() to deal
respectively with dirty pages, and unstable written pages.
When commit 9590544694 ("NFS: avoid deadlocks with loop-back mounted
NFS filesystems.") changed the behaviour of nfs_release_page(), then it
made it possible for invalidate_inode_pages2() to fail with an EBUSY.
Unfortunately, that error is then propagated back to read().
Let's therefore work around the problem for now by protecting the call
to sync the data and invalidate_inode_pages2() so that they are atomic
w.r.t. the addition of new writes.
Later on, we can revisit whether or not we still need nfs_launder_page()
and nfs_release_page().
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Ensure that other operations that race with our write RPC calls
cannot revert the file size updates that were made on the server.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Tested-by: Chuck Lever <chuck.lever@oracle.com>
Ensure that other operations which raced with our setattr RPC call
cannot revert the file attribute changes that were made on the server.
To do so, we artificially bump the attribute generation counter on
the inode so that all calls to nfs_fattr_init() that precede ours
will be dropped.
The motivation for the patch came from Chuck Lever's reports of readaheads
racing with truncate operations and causing the file size to be reverted.
Reported-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Tested-by: Chuck Lever <chuck.lever@oracle.com>
Recent work in the pgio layer made it possible for there to be more than one
request per page. This caused a subtle change in commit behavior, because
write.c:nfs_commit_unstable_pages compares the number of *pages* waiting for
writeback against the number of requests on a commit list to choose when to
send a COMMIT in a non-blocking flush.
This is probably hard to hit in normal operation - you have to be using
rsize/wsize < PAGE_SIZE, or pnfs with lots of boundaries that are not page
aligned to have a noticeable change in behavior.
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Pull core block layer changes from Jens Axboe:
"This is the core block IO pull request for 3.18. Apart from the new
and improved flush machinery for blk-mq, this is all mostly bug fixes
and cleanups.
- blk-mq timeout updates and fixes from Christoph.
- Removal of REQ_END, also from Christoph. We pass it through the
->queue_rq() hook for blk-mq instead, freeing up one of the request
bits. The space was overly tight on 32-bit, so Martin also killed
REQ_KERNEL since it's no longer used.
- blk integrity updates and fixes from Martin and Gu Zheng.
- Update to the flush machinery for blk-mq from Ming Lei. Now we
have a per hardware context flush request, which both cleans up the
code should scale better for flush intensive workloads on blk-mq.
- Improve the error printing, from Rob Elliott.
- Backing device improvements and cleanups from Tejun.
- Fixup of a misplaced rq_complete() tracepoint from Hannes.
- Make blk_get_request() return error pointers, fixing up issues
where we NULL deref when a device goes bad or missing. From Joe
Lawrence.
- Prep work for drastically reducing the memory consumption of dm
devices from Junichi Nomura. This allows creating clone bio sets
without preallocating a lot of memory.
- Fix a blk-mq hang on certain combinations of queue depths and
hardware queues from me.
- Limit memory consumption for blk-mq devices for crash dump
scenarios and drivers that use crazy high depths (certain SCSI
shared tag setups). We now just use a single queue and limited
depth for that"
* 'for-3.18/core' of git://git.kernel.dk/linux-block: (58 commits)
block: Remove REQ_KERNEL
blk-mq: allocate cpumask on the home node
bio-integrity: remove the needless fail handle of bip_slab creating
block: include func name in __get_request prints
block: make blk_update_request print prefix match ratelimited prefix
blk-merge: don't compute bi_phys_segments from bi_vcnt for cloned bio
block: fix alignment_offset math that assumes io_min is a power-of-2
blk-mq: Make bt_clear_tag() easier to read
blk-mq: fix potential hang if rolling wakeup depth is too high
block: add bioset_create_nobvec()
block: use bio_clone_fast() in blk_rq_prep_clone()
block: misplaced rq_complete tracepoint
sd: Honor block layer integrity handling flags
block: Replace strnicmp with strncasecmp
block: Add T10 Protection Information functions
block: Don't merge requests if integrity flags differ
block: Integrity checksum flag
block: Relocate bio integrity flags
block: Add a disk flag to block integrity profile
block: Add prefix to block integrity profile flags
...
REQ_KERNEL is no longer used. Remove it and drop the redundant uio
argument to nfs_file_direct_{read,write}.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Christoph Hellwig <hch@infradead.org>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This code is internal to the v3 module, so other parts of the client
shouldn't have any knowledge of it.
nfs3_getxattr(), nfs3_setxattr(), and nfs3_removexattr() no longer exist
anywhere so I remove the declarations while I'm here.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
The goal is to create a generic NFS module with code that does not
depend on what versions of NFS are enabled.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
The access cache is used during RCU-walk path lookups, so it is best
to avoid locking if possible as taking a lock kills concurrency.
The rbtree is not rcu-safe and cannot easily be made so.
Instead we simply check the last (i.e. most recent) entry on the LRU
list. If this doesn't match, then we return -ECHILD and retry in
lock/refcount mode.
This requires freeing the nfs_access_entry struct with rcu, and
requires using rcu access primatives when adding entries to the lru, and
when examining the last entry.
Calling put_rpccred before kfree_rcu looks a bit odd, but as
put_rpccred already provides rcu protection, we know that the cred will
not actually be freed until the next grace period, so any concurrent
access will be safe.
This patch provides about 5% performance improvement on a stat-heavy
synthetic work load with 4 threads on a 2-core CPU.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
This requires nfs_check_verifier to take an rcu_walk flag, and requires
an rcu version of nfs_revalidate_inode which returns -ECHILD rather
than making an RPC call.
With this, nfs_lookup_revalidate can call nfs_neg_need_reval in
RCU-walk mode.
We can also move the LOOKUP_RCU check past the nfs_check_verifier()
call in nfs_lookup_revalidate.
If RCU_WALK prevents nfs_check_verifier or nfs_neg_need_reval from
doing a full check, they return a status indicating that a revalidation
is required. As this revalidation will not be possible in RCU_WALK
mode, -ECHILD will ultimately be returned, which is the desired result.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Pull vfs updates from Al Viro:
"This the bunch that sat in -next + lock_parent() fix. This is the
minimal set; there's more pending stuff.
In particular, I really hope to get acct.c fixes merged this cycle -
we need that to deal sanely with delayed-mntput stuff. In the next
pile, hopefully - that series is fairly short and localized
(kernel/acct.c, fs/super.c and fs/namespace.c). In this pile: more
iov_iter work. Most of prereqs for ->splice_write with sane locking
order are there and Kent's dio rewrite would also fit nicely on top of
this pile"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (70 commits)
lock_parent: don't step on stale ->d_parent of all-but-freed one
kill generic_file_splice_write()
ceph: switch to iter_file_splice_write()
shmem: switch to iter_file_splice_write()
nfs: switch to iter_splice_write_file()
fs/splice.c: remove unneeded exports
ocfs2: switch to iter_file_splice_write()
->splice_write() via ->write_iter()
bio_vec-backed iov_iter
optimize copy_page_{to,from}_iter()
bury generic_file_aio_{read,write}
lustre: get rid of messing with iovecs
ceph: switch to ->write_iter()
ceph_sync_direct_write: stop poking into iov_iter guts
ceph_sync_read: stop poking into iov_iter guts
new helper: copy_page_from_iter()
fuse: switch to ->write_iter()
btrfs: switch to ->write_iter()
ocfs2: switch to ->write_iter()
xfs: switch to ->write_iter()
...
Combining these functions will let me make a single nfs_rw_common_ops
struct (see the next patch).
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
At this point, the only difference between nfs_read_data and
nfs_write_data is the write verifier.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
...and move the prototype for nfs_sillyrename to internal.h.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Tested-by: Anna Schumaker <Anna.Schumaker@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Try to detect 'ls -l' by having nfs_getattr() look at whether or not
there is an opendir() file descriptor for the parent directory.
If so, then assume that we want to force use of readdirplus in order
to avoid the multiple GETATTR calls over the wire.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Highlights:
- Fix several races in nfs_revalidate_mapping
- NFSv4.1 slot leakage in the pNFS files driver
- Stable fix for a slot leak in nfs40_sequence_done
- Don't reject NFSv4 servers that support ACLs with only ALLOW aces
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJS7Bb+AAoJEGcL54qWCgDyDuQP/17nKR5e6MLhixcAbvlcH+pN
8CGolAM3HmRXDWUW/PkBH3UguG8Tzx1Ex26vIxipPeTSwZabf6194Twj6L97DEGZ
2SouD158BW1TkAbhEN/alKB/4ZCPos05iXjZkrL7MRff+8FD0UvWR2pBT1F2jQdY
ZftG76Q72qhZHfH07ZMxM/v4Oy2Ge98RDD35gfuuqMSjHpmN9tiB55PeheW33LVY
fu6I/JEwmlJpgy2qUcDv7v0V4mDpjC7XbcjjHpMHL8zp/C5Rx/rdgt9OQPlwmjdV
FD8MWNXLc5TWxIouLDFPVUv3WZPjyu449QHS9Wc95fSqsHcdl4j4SwLAoSvUIdHt
vDI5PtWhw3WAezbtiuCQnT0xcoIOn5bLjOVP13taDcV9vlZLcFlyOpZ5gVE4/Yju
zm4sCW2+imDc74ERGa4fukF6QhzzAVmD8RlCJwuNzwCfXiZ36+xSanMYiPoUiwLL
OVNgymrm0fe7GVFQKWN2D+Vr68OQEmARO+KfA3UzP5rQV+9CU8zSHjbcoRWZ59QG
VahOS5WDLQSrMp8W37yAHH9IiAWveAAKJJTHlOniRqH90QYPgyW18fTo7YcpW313
AQGFgr/1n4t27MWRLu5rdoN5v8+kwNi0UV6oboNIPoP1v15NkEMvc7HKFj5M883R
qEYfe5wqN/eRNj68NT/+
=B7f0
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-3.14-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client bugfixes from Trond Myklebust:
"Highlights:
- Fix several races in nfs_revalidate_mapping
- NFSv4.1 slot leakage in the pNFS files driver
- Stable fix for a slot leak in nfs40_sequence_done
- Don't reject NFSv4 servers that support ACLs with only ALLOW aces"
* tag 'nfs-for-3.14-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
nfs: initialize the ACL support bits to zero.
NFSv4.1: Cleanup
NFSv4.1: Clean up nfs41_sequence_done
NFSv4: Fix a slot leak in nfs40_sequence_done
NFSv4.1 free slot before resending I/O to MDS
nfs: add memory barriers around NFS_INO_INVALID_DATA and NFS_INO_INVALIDATING
NFS: Fix races in nfs_revalidate_mapping
sunrpc: turn warn_gssd() log message into a dprintk()
NFS: fix the handling of NFS_INO_INVALID_DATA flag in nfs_revalidate_mapping
nfs: handle servers that support only ALLOW ACE type.
There is a possible race in how the nfs_invalidate_mapping function is
handled. Currently, we go and invalidate the pages in the file and then
clear NFS_INO_INVALID_DATA.
The problem is that it's possible for a stale page to creep into the
mapping after the page was invalidated (i.e., via readahead). If another
writer comes along and sets the flag after that happens but before
invalidate_inode_pages2 returns then we could clear the flag
without the cache having been properly invalidated.
So, we must clear the flag first and then invalidate the pages. Doing
this however, opens another race:
It's possible to have two concurrent read() calls that end up in
nfs_revalidate_mapping at the same time. The first one clears the
NFS_INO_INVALID_DATA flag and then goes to call nfs_invalidate_mapping.
Just before calling that though, the other task races in, checks the
flag and finds it cleared. At that point, it trusts that the mapping is
good and gets the lock on the page, allowing the read() to be satisfied
from the cache even though the data is no longer valid.
These effects are easily manifested by running diotest3 from the LTP
test suite on NFS. That program does a series of DIO writes and buffered
reads. The operations are serialized and page-aligned but the existing
code fails the test since it occasionally allows a read to come out of
the cache incorrectly. While mixing direct and buffered I/O isn't
recommended, I believe it's possible to hit this in other ways that just
use buffered I/O, though that situation is much harder to reproduce.
The problem is that the checking/clearing of that flag and the
invalidation of the mapping really need to be atomic. Fix this by
serializing concurrent invalidations with a bitlock.
At the same time, we also need to allow other places that check
NFS_INO_INVALID_DATA to check whether we might be in the middle of
invalidating the file, so fix up a couple of places that do that
to look for the new NFS_INO_INVALIDATING flag.
Doing this requires us to be careful not to set the bitlock
unnecessarily, so this code only does that if it believes it will
be doing an invalidation.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
This causes a small behaviour change in that we don't bother to set
ACLs on file creation if the mode bit can express the access permissions
fully, and thus behaving identical to local filesystems.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
When CONFIG_NFS_V4_2 is toggled nfsd and lockd will be recompiled,
instead of only the nfs client. This patch moves a small amount of code
into the client directory to avoid unnecessary recompiles.
Signed-off-by: Anna Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Use i_writecount to control whether to get an fscache cookie in nfs_open() as
NFS does not do write caching yet. I *think* this is the cause of a problem
encountered by Mark Moseley whereby __fscache_uncache_page() gets a NULL
pointer dereference because cookie->def is NULL:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
IP: [<ffffffff812a1903>] __fscache_uncache_page+0x23/0x160
PGD 0
Thread overran stack, or stack corrupted
Oops: 0000 [#1] SMP
Modules linked in: ...
CPU: 7 PID: 18993 Comm: php Not tainted 3.11.1 #1
Hardware name: Dell Inc. PowerEdge R420/072XWF, BIOS 1.3.5 08/21/2012
task: ffff8804203460c0 ti: ffff880420346640
RIP: 0010:[<ffffffff812a1903>] __fscache_uncache_page+0x23/0x160
RSP: 0018:ffff8801053af878 EFLAGS: 00210286
RAX: 0000000000000000 RBX: ffff8800be2f8780 RCX: ffff88022ffae5e8
RDX: 0000000000004c66 RSI: ffffea00055ff440 RDI: ffff8800be2f8780
RBP: ffff8801053af898 R08: 0000000000000001 R09: 0000000000000003
R10: 0000000000000000 R11: 0000000000000000 R12: ffffea00055ff440
R13: 0000000000001000 R14: ffff8800c50be538 R15: 0000000000000000
FS: 0000000000000000(0000) GS:ffff88042fc60000(0063) knlGS:00000000e439c700
CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033
CR2: 0000000000000010 CR3: 0000000001d8f000 CR4: 00000000000607f0
Stack:
...
Call Trace:
[<ffffffff81365a72>] __nfs_fscache_invalidate_page+0x42/0x70
[<ffffffff813553d5>] nfs_invalidate_page+0x75/0x90
[<ffffffff811b8f5e>] truncate_inode_page+0x8e/0x90
[<ffffffff811b90ad>] truncate_inode_pages_range.part.12+0x14d/0x620
[<ffffffff81d6387d>] ? __mutex_lock_slowpath+0x1fd/0x2e0
[<ffffffff811b95d3>] truncate_inode_pages_range+0x53/0x70
[<ffffffff811b969d>] truncate_inode_pages+0x2d/0x40
[<ffffffff811b96ff>] truncate_pagecache+0x4f/0x70
[<ffffffff81356840>] nfs_setattr_update_inode+0xa0/0x120
[<ffffffff81368de4>] nfs3_proc_setattr+0xc4/0xe0
[<ffffffff81357f78>] nfs_setattr+0xc8/0x150
[<ffffffff8122d95b>] notify_change+0x1cb/0x390
[<ffffffff8120a55b>] do_truncate+0x7b/0xc0
[<ffffffff8121f96c>] do_last+0xa4c/0xfd0
[<ffffffff8121ffbc>] path_openat+0xcc/0x670
[<ffffffff81220a0e>] do_filp_open+0x4e/0xb0
[<ffffffff8120ba1f>] do_sys_open+0x13f/0x2b0
[<ffffffff8126aaf6>] compat_SyS_open+0x36/0x50
[<ffffffff81d7204c>] sysenter_dispatch+0x7/0x24
The code at the instruction pointer was disassembled:
> (gdb) disas __fscache_uncache_page
> Dump of assembler code for function __fscache_uncache_page:
> ...
> 0xffffffff812a18ff <+31>: mov 0x48(%rbx),%rax
> 0xffffffff812a1903 <+35>: cmpb $0x0,0x10(%rax)
> 0xffffffff812a1907 <+39>: je 0xffffffff812a19cd <__fscache_uncache_page+237>
These instructions make up:
ASSERTCMP(cookie->def->type, !=, FSCACHE_COOKIE_TYPE_INDEX);
That cmpb is the faulting instruction (%rax is 0). So cookie->def is NULL -
which presumably means that the cookie has already been at least partway
through __fscache_relinquish_cookie().
What I think may be happening is something like a three-way race on the same
file:
PROCESS 1 PROCESS 2 PROCESS 3
=============== =============== ===============
open(O_TRUNC|O_WRONLY)
open(O_RDONLY)
open(O_WRONLY)
-->nfs_open()
-->nfs_fscache_set_inode_cookie()
nfs_fscache_inode_lock()
nfs_fscache_disable_inode_cookie()
__fscache_relinquish_cookie()
nfs_inode->fscache = NULL
<--nfs_fscache_set_inode_cookie()
-->nfs_open()
-->nfs_fscache_set_inode_cookie()
nfs_fscache_inode_lock()
nfs_fscache_enable_inode_cookie()
__fscache_acquire_cookie()
nfs_inode->fscache = cookie
<--nfs_fscache_set_inode_cookie()
<--nfs_open()
-->nfs_setattr()
...
...
-->nfs_invalidate_page()
-->__nfs_fscache_invalidate_page()
cookie = nfsi->fscache
-->nfs_open()
-->nfs_fscache_set_inode_cookie()
nfs_fscache_inode_lock()
nfs_fscache_disable_inode_cookie()
-->__fscache_relinquish_cookie()
-->__fscache_uncache_page(cookie)
<crash>
<--__fscache_relinquish_cookie()
nfs_inode->fscache = NULL
<--nfs_fscache_set_inode_cookie()
What is needed is something to prevent process #2 from reacquiring the cookie
- and I think checking i_writecount should do the trick.
It's also possible to have a two-way race on this if the file is opened
O_TRUNC|O_RDONLY instead.
Reported-by: Mark Moseley <moseleymark@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>