Commit graph

223 commits

Author SHA1 Message Date
Yan, Zheng
4b9f2042fd ceph: avoid accessing freeing inode in ceph_check_delayed_caps()
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-07 17:25:13 +02:00
Yan, Zheng
efb0ca765a ceph: update the 'approaching max_size' code
The old 'approaching max_size' code expects MDS set max_size to
'2 * reported_size'. This is no longer true. The new code reports
file size when half of previous max_size increment has been used.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-07 17:25:12 +02:00
Yan, Zheng
84eea8c790 ceph: re-request max size after importing caps
The 'wanted max size' could be sent to inode's old auth mds, re-send
it to inode's new auth mds if necessary. Otherwise write syscall may
hang.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-07 17:25:12 +02:00
Jeff Layton
92475f05bd ceph: handle epoch barriers in cap messages
Have the client store and update the osdc epoch_barrier when a cap
message comes in with one.

When sending cap messages, send the epoch barrier as well. This allows
clients to inform servers that their released caps may not be used until
a particular OSD map epoch.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: "Yan, Zheng” <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-04 09:19:21 +02:00
Elena Reshetova
805692d0e0 ceph: convert ceph_cap_snap.nref from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-04 09:19:18 +02:00
Ingo Molnar
174cd4b1e5 sched/headers: Prepare to move signal wakeup & sigpending methods from <linux/sched.h> into <linux/sched/signal.h>
Fix up affected files that include this signal functionality via sched.h.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:32 +01:00
Ilya Dryomov
55f2a04588 ceph: remove special ack vs commit behavior
- ask for a commit reply instead of an ack reply in
  __ceph_pool_perm_get()
- don't ask for both ack and commit replies in ceph_sync_write()
- since just only one reply is requested now, i_unsafe_writes list
  will always be empty -- kill ceph_sync_write_wait() and go back to
  a standard ->evict_inode()

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2017-02-24 19:04:57 +01:00
Yan, Zheng
c1944fedd8 ceph: avoid calling ceph_renew_caps() infinitely
__ceph_caps_mds_wanted() ignores caps from stale session. So the
return value of __ceph_caps_mds_wanted() can keep the same across
ceph_renew_caps(). This causes try_get_cap_refs() to keep calling
ceph_renew_caps(). The fix is ignore the session valid check for
the try_get_cap_refs() case. If session is stale, just let the
caps requester sleep.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2017-02-20 12:16:07 +01:00
Yan, Zheng
00f06cba53 ceph: make sure flushing inode in proper session's cap_flushing list
when flushing inode's auth cap changes, we need to move it into the
new auth cap session's cap_flushing list

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2017-02-20 12:16:07 +01:00
Yan, Zheng
eb65b919b9 ceph: avoid updating mds_wanted too frequently
user space may open/close single file frequently. It's not good
to send a clientcaps message to mds for each open/close syscall.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2017-02-20 12:16:06 +01:00
Seraphime Kirkovski
52953d5591 ceph: cleanup ACCESS_ONCE -> READ_ONCE
This removes the uses of ACCESS_ONCE in favor of READ_ONCE

Signed-off-by: Seraphime Kirkovski <kirkseraph@gmail.com>
Signed-off-by: Yan, Zheng <zyan@redhat.com>
2017-02-20 12:16:05 +01:00
Jeff Layton
ca6c8ae0f7 ceph: pass parent inode info to ceph_encode_dentry_release if we have it
If we have a parent inode reference already, then we don't need to
go back up the directory tree to find one.

Link: http://tracker.ceph.com/issues/18148
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-02-20 12:16:05 +01:00
Jeff Layton
adf0d68701 ceph: fix unsafe dcache access in ceph_encode_dentry_release
Accessing d_parent requires some sort of locking or it could vanish
out from under us. Since we take the d_lock anyway, use that to fetch
d_parent and take a reference to it, and then use that reference to
call ceph_encode_inode_release.

Link: http://tracker.ceph.com/issues/18148
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-02-20 12:16:05 +01:00
Yan, Zheng
6e09d0fb64 ceph: fix ceph_get_caps() interruption
Commit 5c341ee328 ("ceph: fix scheduler warning due to nested
blocking") causes infinite loop when process is interrupted.  Fix it.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-01-18 17:58:45 +01:00
Yan, Zheng
dc24de82d6 ceph: properly set issue_seq for cap release
Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-12-12 23:54:28 +01:00
Jeff Layton
1e4ef0c633 ceph: add flags parameter to send_cap_msg
Add a flags parameter to send_cap_msg, so we can request expedited
service from the MDS when we know we'll be waiting on the result.

Set that flag in the case of try_flush_caps. The callers of that
function generally wait synchronously on the result, so it's beneficial
to ask the server to expedite it.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
2016-12-12 23:54:28 +01:00
Jeff Layton
43b2967330 ceph: update cap message struct version to 10
The userland ceph has MClientCaps at struct version 10. This brings the
kernel up the same version.

For now, all of the the new stuff is set to default values including
the flags field, which will be conditionally set in a later patch.

Note that we don't need to set the change_attr and btime to anything
since we aren't currently setting the feature flag. The MDS should
ignore those values.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
2016-12-12 23:54:28 +01:00
Jeff Layton
0ff8bfb394 ceph: define new argument structure for send_cap_msg
When we get to this many arguments, it's hard to work with positional
parameters. send_cap_msg is already at 25 arguments, with more needed.

Define a new args structure and pass a pointer to it to send_cap_msg.
Eventually it might make sense to embed one of these inside
ceph_cap_snap instead of tracking individual fields.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
2016-12-12 23:54:28 +01:00
Jeff Layton
9670079f5f ceph: move xattr initialzation before the encoding past the ceph_mds_caps
Just for clarity. This part is inside the header, so it makes sense to
group it with the rest of the stuff in the header.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
2016-12-12 23:54:28 +01:00
Jeff Layton
4945a08479 ceph: fix minor typo in unsafe_request_wait
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
2016-12-12 23:54:27 +01:00
Yan, Zheng
2b1ac852eb ceph: try getting buffer capability for readahead/fadvise
For readahead/fadvise cases, caller of ceph_readpages does not
hold buffer capability. Pages can be added to page cache while
there is no buffer capability. This can cause data integrity
issue.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-12-12 23:54:27 +01:00
Nikolay Borisov
5c341ee328 ceph: fix scheduler warning due to nested blocking
try_get_cap_refs can be used as a condition in a wait_event* calls.
This is all fine until it has to call __ceph_do_pending_vmtruncate,
which in turn acquires the i_truncate_mutex. This leads to a situation
in which a task's state is !TASK_RUNNING and at the same time it's
trying to acquire a sleeping primitive. In essence a nested sleeping
primitives are being used. This causes the following warning:

WARNING: CPU: 22 PID: 11064 at kernel/sched/core.c:7631 __might_sleep+0x9f/0xb0()
do not call blocking ops when !TASK_RUNNING; state=1 set at [<ffffffff8109447d>] prepare_to_wait_event+0x5d/0x110
 ipmi_msghandler tcp_scalable ib_qib dca ib_mad ib_core ib_addr ipv6
CPU: 22 PID: 11064 Comm: fs_checker.pl Tainted: G           O    4.4.20-clouder2 #6
Hardware name: Supermicro X10DRi/X10DRi, BIOS 1.1a 10/16/2015
 0000000000000000 ffff8838b416fa88 ffffffff812f4409 ffff8838b416fad0
 ffffffff81a034f2 ffff8838b416fac0 ffffffff81052b46 ffffffff81a0432c
 0000000000000061 0000000000000000 0000000000000000 ffff88167bda54a0
Call Trace:
 [<ffffffff812f4409>] dump_stack+0x67/0x9e
 [<ffffffff81052b46>] warn_slowpath_common+0x86/0xc0
 [<ffffffff81052bcc>] warn_slowpath_fmt+0x4c/0x50
 [<ffffffff8109447d>] ? prepare_to_wait_event+0x5d/0x110
 [<ffffffff8109447d>] ? prepare_to_wait_event+0x5d/0x110
 [<ffffffff8107767f>] __might_sleep+0x9f/0xb0
 [<ffffffff81612d30>] mutex_lock+0x20/0x40
 [<ffffffffa04eea14>] __ceph_do_pending_vmtruncate+0x44/0x1a0 [ceph]
 [<ffffffffa04fa692>] try_get_cap_refs+0xa2/0x320 [ceph]
 [<ffffffffa04fd6f5>] ceph_get_caps+0x255/0x2b0 [ceph]
 [<ffffffff81094370>] ? wait_woken+0xb0/0xb0
 [<ffffffffa04f2c11>] ceph_write_iter+0x2b1/0xde0 [ceph]
 [<ffffffff81613f22>] ? schedule_timeout+0x202/0x260
 [<ffffffff8117f01a>] ? kmem_cache_free+0x1ea/0x200
 [<ffffffff811b46ce>] ? iput+0x9e/0x230
 [<ffffffff81077632>] ? __might_sleep+0x52/0xb0
 [<ffffffff81156147>] ? __might_fault+0x37/0x40
 [<ffffffff8119e123>] ? cp_new_stat+0x153/0x170
 [<ffffffff81198cfa>] __vfs_write+0xaa/0xe0
 [<ffffffff81199369>] vfs_write+0xa9/0x190
 [<ffffffff811b6d01>] ? set_close_on_exec+0x31/0x70
 [<ffffffff8119a056>] SyS_write+0x46/0xa0

This happens since wait_event_interruptible can interfere with the
mutex locking code, since they both fiddle with the task state.

Fix the issue by using the newly-added nested blocking infrastructure
in 61ada528de ("sched/wait: Provide infrastructure to deal with
nested blocking")

Link: https://lwn.net/Articles/628628/
Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-12-12 23:54:27 +01:00
Yan, Zheng
e4d2b16a44 ceph: fix null pointer dereference in ceph_flush_snaps()
Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-08-08 21:41:43 +02:00
Yan, Zheng
c8799fc467 ceph: optimize cap flush waiting
Add a 'wake' flag to ceph_cap_flush struct, which indicates if there
is someone waiting for it to finish. When getting flush ack message,
we check the 'wake' flag in corresponding ceph_cap_flush struct to
decide if we should wake up waiters. One corner case is that the
acked cap flush has 'wake' flags is set, but it is not the first one
on the flushing list. We do not wake up waiters in this case, set
'wake' flags of preceding ceph_cap_flush struct instead

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-07-28 03:00:45 +02:00
Yan, Zheng
ed9b430c9b ceph: cleanup ceph_flush_snaps()
This patch devide __ceph_flush_snaps() into two stags. In the first
stage, __ceph_flush_snaps() assign snapcaps flush TIDs and add them
to cap flush lists. __ceph_flush_snaps() keeps holding the
i_ceph_lock in this stagge. So inode's auth cap can not change. In
the second stage, __ceph_flush_snaps() send flushsnap cap messages.
i_ceph_lock is unlocked before sending each cap message. If auth cap
changes in the middle, __ceph_flush_snaps() just stops. This is OK
because kick_flushing_inode_caps() will re-send flushsnap cap messages
to inode's new auth MDS.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-07-28 03:00:44 +02:00
Yan, Zheng
7bc00fddb9 ceph: kick cap flushes before sending other cap message
If ceph_check_caps() wants to send cap message to a recovering MDS,
make sure it kicks cap flushes first.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-07-28 03:00:44 +02:00
Yan, Zheng
70220ac8c2 ceph: introduce an inode flag to indicates if snapflush is needed
Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-07-28 03:00:43 +02:00
Yan, Zheng
13c2b57d81 ceph: avoid sending duplicated cap flush message
make ceph_kick_flushing_caps() ignore inodes whose cap flushes
have already been re-sent by ceph_early_kick_flushing_caps()

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-07-28 03:00:43 +02:00
Yan, Zheng
0e29438789 ceph: unify cap flush and snapcap flush
This patch includes following changes
- Assign flush tid to snapcap flush
- Remove session's s_cap_snaps_flushing list. Add inode to session's
  s_cap_flushing list instead. Inode is removed from the list when
  there is no pending snapcap flush or cap flush.
- make __kick_flushing_caps() re-send both snapcap flushes and cap
  flushes.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-07-28 03:00:42 +02:00
Yan, Zheng
e4500b5e35 ceph: use list instead of rbtree to track cap flushes
We don't have requirement of searching cap flush by TID. In most cases,
we just need to know TID of the oldest cap flush. List is ideal for this
usage.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-07-28 03:00:42 +02:00
Yan, Zheng
3609404f8c ceph: update types of some local varibles
Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-07-28 03:00:42 +02:00
Yan, Zheng
9a5530c638 ceph: wait unsafe sync writes for evicting inode
Otherwise ceph_sync_write_unsafe() may access/modify freed inode.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-07-28 02:55:40 +02:00
Yan, Zheng
774a6a118c ceph: reduce i_nr_by_mode array size
Track usage count for individual fmode bit. This can reduce the
array size by half.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-07-28 02:55:39 +02:00
Yan, Zheng
779fe0fb8e ceph: rados pool namespace support
This patch adds codes that decode pool namespace information in
cap message and request reply. Pool namespace is saved in i_layout,
it will be passed to libceph when doing read/write.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-07-28 02:55:38 +02:00
Yan, Zheng
7627151ea3 libceph: define new ceph_file_layout structure
Define new ceph_file_layout structure and rename old ceph_file_layout
to ceph_file_layout_legacy. This is preparation for adding namespace
to ceph_file_layout structure.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-07-28 02:55:36 +02:00
Yan, Zheng
f7f7e7a063 ceph: improve fscache revalidation
There are several issues in fscache revalidation code.
- In ceph_revalidate_work(), fscache_invalidate() is called when
  fscache_check_consistency() return 0. This is complete wrong
  because 0 means cache is valid.
- Handle_cap_grant() calls ceph_queue_revalidate() if client
  already has CAP_FILE_CACHE. This code is confusing. Client
  should revalidate the cache each time it got CAP_FILE_CACHE
  anew.
- In Handle_cap_grant(), fscache_invalidate() is called if MDS
  revokes CAP_FILE_CACHE. This is inconsistency with the case
  that inode get evicted. In the later case, the cache is not
  discarded. Client may use the cache when inode is reloaded.

This patch moves the fscache revalidation into ceph_get_caps().
Client revalidates the cache after it gets CAP_FILE_CACHE.
i_rdcache_gen should keep constance while CAP_FILE_CACHE is
used. If i_fscache_gen is not equal to i_rdcache_gen, client
needs to check cache's consistency.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-06-01 10:31:50 +02:00
Yan, Zheng
1464975816 ceph: avoid unnecessary fscache invalidation/revlidation
ceph_fill_file_size() has already called ceph_fscache_invalidate()
if it return true.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-06-01 10:30:41 +02:00
Yan, Zheng
9abd4db713 ceph: don't use truncate_pagecache() to invalidate read cache
truncate_pagecache() drops dirty pages, it's dangerous to use it
to invalidate read cache. Besides, we shouldn't start invalidating
read cache while there are buffer writers. Because buffer writers
may add dirty pages later.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-05-26 01:15:42 +02:00
Yan, Zheng
77310320c2 ceph: renew caps for read/write if mds session got killed.
When mds session gets killed, read/write operation may hang.
Client waits for Frw caps, but mds does not know what caps client
wants. To recover this, client sends an open request to mds. The
request will tell mds what caps client wants.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-05-26 01:15:31 +02:00
Kirill A. Shutemov
09cbfeaf1a mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.

This promise never materialized.  And unlikely will.

We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE.  And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.

Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.

Let's stop pretending that pages in page cache are special.  They are
not.

The changes are pretty straight-forward:

 - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

 - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

 - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

 - page_cache_get() -> get_page();

 - page_cache_release() -> put_page();

This patch contains automated changes generated with coccinelle using
script below.  For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.

The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.

There are few places in the code where coccinelle didn't reach.  I'll
fix them manually in a separate patch.  Comments and documentation also
will be addressed with the separate patch.

virtual patch

@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT

@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE

@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK

@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)

@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)

@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-04 10:41:08 -07:00
Yan, Zheng
d1eee0c0e1 ceph: encode ctime in cap message
Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-03-25 18:51:40 +01:00
Yan, Zheng
5ea5c5e0a7 ceph: initial CEPH_FEATURE_FS_FILE_LAYOUT_V2 support
Add support for the format change of MClientReply/MclientCaps.
Also add code that denies access to inodes with pool_ns layouts.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2016-03-04 21:00:37 +01:00
Al Viro
5955102c99 wrappers for ->i_mutex access
parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
inode_foo(inode) being mutex_foo(&inode->i_mutex).

Please, use those for access to ->i_mutex; over the coming cycle
->i_mutex will become rwsem, with ->lookup() done with it held
only shared.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-01-22 18:04:28 -05:00
Yan, Zheng
68cd5b4b76 ceph: make fsync() wait unsafe requests that created/modified inode
If we get a unsafe reply for request that created/modified inode,
add the unsafe request to a list in the newly created/modified
inode. So we can make fsync() wait these unsafe requests.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2015-11-02 23:36:48 +01:00
Yan, Zheng
5e804ac482 ceph: don't invalidate page cache when inode is no longer used
ceph_check_caps() invalidate page cache when inode is not used
by any open file. This behaviour is not friendly for workload
that repeatly read files.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2015-11-02 23:36:48 +01:00
Yan, Zheng
48fec5d0a5 ceph: EIO all operations after forced umount
This patch makes try_get_cap_refs() and __do_request() check
if the file system was forced umount, and return -EIO if it was.
This patch also adds a helper function to drops dirty caps and
wakes up blocking operation.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2015-09-08 23:14:28 +03:00
Yan, Zheng
fc927cd32f ceph: always re-send cap flushes when MDS recovers
commit e548e9b93d makes the kclient
only re-send cap flush once during MDS failover. If the kclient sends
a cap flush after MDS enters reconnect stage but before MDS recovers.
The kclient will skip re-sending the same cap flush when MDS recovers.

This causes problem for newly created inode. The MDS handles cap
flushes before replaying unsafe requests, so it's possible that MDS
find corresponding inode is missing when handling cap flush. The fix
is reverting to old behaviour: always re-send when MDS recovers

Signed-off-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2015-07-31 11:38:53 +03:00
Yan, Zheng
fdd4e15838 ceph: rework dcache readdir
Previously our dcache readdir code relies on that child dentries in
directory dentry's d_subdir list are sorted by dentry's offset in
descending order. When adding dentries to the dcache, if a dentry
already exists, our readdir code moves it to head of directory
dentry's d_subdir list. This design relies on dcache internals.
Al Viro suggests using ncpfs's approach: keeping array of pointers
to dentries in page cache of directory inode. the validity of those
pointers are presented by directory inode's complete and ordered
flags. When a dentry gets pruned, we clear directory inode's complete
flag in the d_prune() callback. Before moving a dentry to other
directory, we clear the ordered flag for both old and new directory.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2015-06-25 11:49:32 +03:00
Yan, Zheng
f66fd9f095 ceph: pre-allocate data structure that tracks caps flushing
Signed-off-by: Yan, Zheng <zyan@redhat.com>
2015-06-25 11:49:31 +03:00
Yan, Zheng
e548e9b93d ceph: re-send flushing caps (which are revoked) in reconnect stage
if flushing caps were revoked, we should re-send the cap flush in
client reconnect stage. This guarantees that MDS processes the cap
flush message before issuing the flushing caps to other client.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2015-06-25 11:49:31 +03:00