- CEPH_OSD_FLAG_ACK shouldn't be set anymore, so assert on it
- remove support for handling ack replies (OSDs will send ack replies
only if clients request them)
- drop the "do lingering callbacks under osd->lock" logic from
handle_reply() -- lreq->lock is sufficient in all three cases
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Sage Weil <sage@redhat.com>
The syntax error was not noticed because dprintk is a macro
and the code is discarded by default.
Reflects ceph.git commit f29b840c64a933b2cb13e3da6f3d785effd73a57.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
The is_out() test may require an additional hashing operation, so we
should skip it whenever possible.
Reflects ceph.git commit db107cc7f15cf2481894add325dc93e33479f529.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Pull locking updates from Ingo Molnar:
"The main changes in this cycle were:
- Implement wraparound-safe refcount_t and kref_t types based on
generic atomic primitives (Peter Zijlstra)
- Improve and fix the ww_mutex code (Nicolai Hähnle)
- Add self-tests to the ww_mutex code (Chris Wilson)
- Optimize percpu-rwsems with the 'rcuwait' mechanism (Davidlohr
Bueso)
- Micro-optimize the current-task logic all around the core kernel
(Davidlohr Bueso)
- Tidy up after recent optimizations: remove stale code and APIs,
clean up the code (Waiman Long)
- ... plus misc fixes, updates and cleanups"
* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (50 commits)
fork: Fix task_struct alignment
locking/spinlock/debug: Remove spinlock lockup detection code
lockdep: Fix incorrect condition to print bug msgs for MAX_LOCKDEP_CHAIN_HLOCKS
lkdtm: Convert to refcount_t testing
kref: Implement 'struct kref' using refcount_t
refcount_t: Introduce a special purpose refcount type
sched/wake_q: Clarify queue reinit comment
sched/wait, rcuwait: Fix typo in comment
locking/mutex: Fix lockdep_assert_held() fail
locking/rtmutex: Flip unlikely() branch to likely() in __rt_mutex_slowlock()
locking/rwsem: Reinit wake_q after use
locking/rwsem: Remove unnecessary atomic_long_t casts
jump_labels: Move header guard #endif down where it belongs
locking/atomic, kref: Implement kref_put_lock()
locking/ww_mutex: Turn off __must_check for now
locking/atomic, kref: Avoid more abuse
locking/atomic, kref: Use kref_get_unless_zero() more
locking/atomic, kref: Kill kref_sub()
locking/atomic, kref: Add kref_read()
locking/atomic, kref: Add KREF_INIT()
...
To spare checking for "this reply fits into a page, but does it fit
into my buffer?" in some callers, osd_req_op_cls_response_data_pages()
needs to know how big it is.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jason Dillaman <dillaman@redhat.com>
With EC overwrites maturing, the kernel client will be getting exposed
to potentially very wide EC pools. While "min(pi->size, X)" works fine
when the cluster is stable and happy, truncating OSD sets interferes
with resend logic (ceph_is_new_interval(), etc). Abort the mapping if
the pool is too wide, assigning the request to the homeless session.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
Much like Arlo Guthrie, I decided that one big pile is better than two
little piles.
Reflects ceph.git commit 95c2df6c7e0b22d2ea9d91db500cf8b9441c73ba.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Then add it to the working state. It would be very nice if we didn't
have to take a lock to calculate a crush placement. By moving the
permutation array into the working data, we can treat the CRUSH map as
immutable.
Reflects ceph.git commit cbcd039651c0569551cb90d26ce27e1432671f2a.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
This was causing a build failure for openrisc when using musl and
gcc 5.4.0 since the file is not available in the toolchain.
It doesnt seem this is needed and removing it does not cause any build
warnings for me.
Signed-off-by: Stafford Horne <shorne@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
add_to_page_cache_lru() can fails, so the actual pages to read
can be smaller than the initial size of osd request. We need to
update osd request size in that case.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Currently crypto.c gets linux/sched.h indirectly through linux/slab.h
from linux/kasan.h. Include it directly for memalloc_noio_*() inlines.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
... otherwise the crypto stack will align it for us with a GFP_ATOMIC
allocation and a memcpy() -- see skcipher_walk_first().
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Since we need to change the implementation, stop exposing internals.
Provide kref_read() to read the current reference count; typically
used for debug messages.
Kills two anti-patterns:
atomic_read(&kref->refcount)
kref->refcount.counter
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
r_safe_completion is currently, and has always been, signaled only if
on-disk ack was requested. It's there for fsync and syncfs, which wait
for in-flight writes to flush - all data write requests set ONDISK.
However, the pool perm check code introduced in 4.2 sends a write
request with only ACK set. An unfortunately timed syncfs can then hang
forever: r_safe_completion won't be signaled because only an unsafe
reply was requested.
We could patch ceph_osdc_sync() to skip !ONDISK write requests, but
that is somewhat incomplete and yet another special case. Instead,
rename this completion to r_done_completion and always signal it when
the OSD client is done with the request, whether unsafe, safe, or
error. This is a bit cleaner and helps with the cancellation code.
Reported-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Include linux/crush/mapper.h in crush/mapper.c to get the prototypes of
crush_find_rule and crush_do_rule which are defined there. This fixes
the following GCC warnings when building with 'W=1':
net/ceph/crush/mapper.c:40:5: warning: no previous prototype for ‘crush_find_rule’ [-Wmissing-prototypes]
net/ceph/crush/mapper.c:793:5: warning: no previous prototype for ‘crush_do_rule’ [-Wmissing-prototypes]
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
[idryomov@gmail.com: corresponding !__KERNEL__ include]
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
->get_authorizer(), ->verify_authorizer_reply(), ->sign_message() and
->check_message_signature() shouldn't be doing anything with or on the
connection (like closing it or sending messages).
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
The length of the reply is protocol-dependent - for cephx it's
ceph_x_authorize_reply. Nothing sensible can be passed from the
messenger layer anyway.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
After sending an authorizer (ceph_x_authorize_a + ceph_x_authorize_b),
the client gets back a ceph_x_authorize_reply, which it is supposed to
verify to ensure the authenticity and protect against replay attacks.
The code for doing this is there (ceph_x_verify_authorizer_reply(),
ceph_auth_verify_authorizer_reply() + plumbing), but it is never
invoked by the the messenger.
AFAICT this goes back to 2009, when ceph authentication protocols
support was added to the kernel client in 4e7a5dcd1b ("ceph:
negotiate authentication protocol; implement AUTH_NONE protocol").
The second param of ceph_connection_operations::verify_authorizer_reply
is unused all the way down. Pass 0 to facilitate backporting, and kill
it in the next commit.
Cc: stable@vger.kernel.org
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
It's called during inital setup, when everything should be allocated
with GFP_KERNEL.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
- replace an ad-hoc array with a struct
- rename to calc_signature() for consistency
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
It's going to be used as a temporary buffer for in-place en/decryption
with ceph_crypt() instead of on-stack buffers, so rename to enc_buf.
Ensure alignment to avoid GFP_ATOMIC allocations in the crypto stack.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
Starting with 4.9, kernel stacks may be vmalloced and therefore not
guaranteed to be physically contiguous; the new CONFIG_VMAP_STACK
option is enabled by default on x86. This makes it invalid to use
on-stack buffers with the crypto scatterlist API, as sg_set_buf()
expects a logical address and won't work with vmalloced addresses.
There isn't a different (e.g. kvec-based) crypto API we could switch
net/ceph/crypto.c to and the current scatterlist.h API isn't getting
updated to accommodate this use case. Allocating a new header and
padding for each operation is a non-starter, so do the en/decryption
in-place on a single pre-assembled (header + data + padding) heap
buffer. This is explicitly supported by the crypto API:
"... the caller may provide the same scatter/gather list for the
plaintext and cipher text. After the completion of the cipher
operation, the plaintext data is replaced with the ciphertext data
in case of an encryption and vice versa for a decryption."
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
Since commit 0a990e7093 ("ceph: clean up service ticket decoding"),
th->session_key isn't assigned until everything is decoded.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
Pass what's going to be encrypted - that's msg_b, not ticket_blob.
ceph_x_encrypt_buflen() returns the upper bound, so this doesn't change
the maxlen calculation, but makes it a bit clearer.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
osdc->last_linger_id is a counter for lreq->linger_id, which is used
for watch cookies. Starting with a large integer should ease the task
of telling apart kernel and userspace clients.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
If your data pool was pool 0, ceph_file_layout_from_legacy()
transform that to -1 unconditionally, which broke upgrades.
We only want do that for a fully zeroed ceph_file_layout,
so that it still maps to a file_layout_t. If any fields
are set, though, we trust the fl_pgpool to be a valid pool.
Fixes: 7627151ea3 ("libceph: define new ceph_file_layout structure")
Link: http://tracker.ceph.com/issues/17825
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
This removes the 'write' and 'force' use from get_user_pages_unlocked()
and replaces them with 'gup_flags' to make the use of FOLL_FORCE
explicit in callers as use of this flag can result in surprising
behaviour (and hence bugs) within the mm subsystem.
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use __builtin_clz() supported by GCC and Clang to figure out
how many bits we should shift instead of shifting by a bit
in a loop until the value gets normalized. Improves performance
of this function by up to 3x in worst-case scenario and overall
straw2 performance by ~10%.
Reflects ceph.git commit 110de33ca497d94fc4737e5154d3fe781fa84a0a.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
A static bug finder (EBA) on Linux 4.7:
Double lock in net/ceph/auth.c
second lock at 108: mutex_lock(& ac->mutex); [ceph_auth_build_hello]
after calling from 263: ret = ceph_auth_build_hello(ac, msg_buf, msg_len);
if ! ac->protocol -> true at 262
first lock at 261: mutex_lock(& ac->mutex); [ceph_build_auth]
ceph_auth_build_hello() is never called, because the protocol is always
initialized, whether we are checking existing tickets (in delayed_work())
or getting new ones after invalidation (in invalidate_authorizer()).
Reported-by: Iago Abal <iari@itu.dk>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Export client addr/nonce, so userspace can check if a image is being
blacklisted.
Signed-off-by: Mike Christie <mchristi@redhat.com>
[idryomov@gmail.com: ceph_client_addr(), endianess fix]
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Add basic support for RBD_FEATURE_EXCLUSIVE_LOCK feature. Maintenance
operations (resize, snapshot create, etc) are offloaded to librbd via
returning -EOPNOTSUPP - librbd should request the lock and execute the
operation.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Mike Christie <mchristi@redhat.com>
Tested-by: Mike Christie <mchristi@redhat.com>
Revamp watch code to support retrying watch re-registration:
- add rbd_dev->watch_state for more robust errcb handling
- store watch cookie separately to avoid dereferencing watch_handle
which is set to NULL on unwatch
- move re-register code into a delayed work and retry re-registration
every second, unless the client is blacklisted
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Mike Christie <mchristi@redhat.com>
Tested-by: Mike Christie <mchristi@redhat.com>
It's gid / global_id in other places.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Reuse ceph_mon_generic_request infrastructure for sending monitor
commands. In particular, add support for 'blacklist add' to prevent
other, non-responsive clients from making further updates.
Signed-off-by: Douglas Fuller <dfuller@redhat.com>
[idryomov@gmail.com: refactor, misc fixes throughout]
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Add an interface for the Ceph OSD lock.lock_info method and associated
data structures.
Based heavily on code by Mike Christie <michaelc@cs.wisc.edu>.
Signed-off-by: Douglas Fuller <dfuller@redhat.com>
[idryomov@gmail.com: refactor, misc fixes throughout]
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Alex Elder <elder@linaro.org>
This patch adds support for rados lock, unlock and break lock.
Based heavily on code by Mike Christie <michaelc@cs.wisc.edu>.
Signed-off-by: Douglas Fuller <dfuller@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Add a convenience function to osd_client to send Ceph OSD
'class' ops. The interface assumes that the request and
reply data each consist of single pages.
Signed-off-by: Douglas Fuller <dfuller@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Add support for this Ceph OSD op, needed to support the RBD exclusive
lock feature.
Signed-off-by: Douglas Fuller <dfuller@redhat.com>
[idryomov@gmail.com: refactor, misc fixes throughout]
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Clear up EntityName vs entity_name_t confusion.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Alex Elder <elder@linaro.org>
The callback function of call_rcu() just calls a kfree(), so we
can use kfree_rcu() instead of call_rcu() + callback function.
Signed-off-by: Wei Yongjun <weiyj.lk@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Fixes the following sparse warning:
net/ceph/mon_client.c:577:6: warning:
symbol 'cancel_generic_request' was not declared. Should it be static?
Signed-off-by: Wei Yongjun <weiyj.lk@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
In case of error, the function ceph_alloc_page_vector() returns
ERR_PTR() and never returns NULL. The NULL test in the return value
check should be replaced with IS_ERR().
Fixes: 1907920324 ('libceph: support for sending notifies')
Signed-off-by: Wei Yongjun <weiyj.lk@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Add pool namesapce pointer to struct ceph_file_layout and struct
ceph_object_locator. Pool namespace is used by when mapping object
to PG, it's also used when composing OSD request.
The namespace pointer in struct ceph_file_layout is RCU protected.
So libceph can read namespace without taking lock.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
[idryomov@gmail.com: ceph_oloc_destroy(), misc minor changes]
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
The data structure is for storing namesapce string. It allows namespace
string to be shared between cephfs inodes with same layout. This data
structure can also be referenced by OSD request.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Define new ceph_file_layout structure and rename old ceph_file_layout
to ceph_file_layout_legacy. This is preparation for adding namespace
to ceph_file_layout structure.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Currently, osd_weight and osd_state fields are updated in the encoding
order. This is wrong, because an incremental map may look like e.g.
new_up_client: { osd=6, addr=... } # set osd_state and addr
new_state: { osd=6, xorstate=EXISTS } # clear osd_state
Suppose osd6's current osd_state is EXISTS (i.e. osd6 is down). After
applying new_up_client, osd_state is changed to EXISTS | UP. Carrying
on with the new_state update, we flip EXISTS and leave osd6 in a weird
"!EXISTS but UP" state. A non-existent OSD is considered down by the
mapping code
2087 for (i = 0; i < pg->pg_temp.len; i++) {
2088 if (ceph_osd_is_down(osdmap, pg->pg_temp.osds[i])) {
2089 if (ceph_can_shift_osds(pi))
2090 continue;
2091
2092 temp->osds[temp->size++] = CRUSH_ITEM_NONE;
and so requests get directed to the second OSD in the set instead of
the first, resulting in OSD-side errors like:
[WRN] : client.4239 192.168.122.21:0/2444980242 misdirected client.4239.1:2827 pg 2.5df899f2 to osd.4 not [1,4,6] in e680/680
and hung rbds on the client:
[ 493.566367] rbd: rbd0: write 400000 at 11cc00000 (0)
[ 493.566805] rbd: rbd0: result -6 xferred 400000
[ 493.567011] blk_update_request: I/O error, dev rbd0, sector 9330688
The fix is to decouple application from the decoding and:
- apply new_weight first
- apply new_state before new_up_client
- twiddle osd_state flags if marking in
- clear out some of the state if osd is destroyed
Fixes: http://tracker.ceph.com/issues/14901
Cc: stable@vger.kernel.org # 3.15+: 6dd74e44dc: libceph: set 'exists' flag for newly up osd
Cc: stable@vger.kernel.org # 3.15+
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Josh Durgin <jdurgin@redhat.com>
Commit d30291b985 ("libceph: variable-sized ceph_object_id") changed
dout()s in what is now encode_request() and ceph_object_locator_to_pg()
to use %pE, mostly to document that, although all rbd and cephfs object
names are NULL-terminated strings, ceph_object_id will handle any RADOS
object name, including the one containing NULs, just fine.
However, it turns out that vbin_printf() can't handle anything but ints
and %s - all %p suffixes are ignored. The buffer %p** points to isn't
recorded, resulting in trash in the messages if the buffer had been
reused by the time bstr_printf() got to it.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
handle_reply() may be called twice on the same request: on ack and then
on commit. This occurs on btrfs-formatted OSDs or if cephfs sync write
path is triggered - CEPH_OSD_FLAG_ACK | CEPH_OSD_FLAG_ONDISK.
handle_reply() handles this with the help of done_request().
Fixes: 5aea3dcd50 ("libceph: a major OSD client update")
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
For the benefit of every single caller, take osdc instead of map.
Also, now that osdc->osdmap can't ever be NULL, drop the check.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Ceph_osdc_wait_request() is used when cephfs issues sync IO. In most
cases, the sync IO should be uninterruptible. The fix is use killale
wait function in ceph_osdc_wait_request().
Signed-off-by: Yan, Zheng <zyan@redhat.com>
This patch makes serverl logical caculation functions return bool to
improve readability due to these particular functions only using 0/1
as their return value.
No functional change.
Signed-off-by: Zhang Zhuoyu <zhangzhuoyu@cmss.chinamobile.com>
There is now about a dozen CEPH_OSDMAP_* flags. This is a debugging
interface, so just dump in hex instead of spelling each flag out.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
This adds the "map check" infrastructure for sending osdmap version
checks on CALC_TARGET_POOL_DNE and completing in-flight requests with
-ENOENT if the target pool doesn't exist or has just been deleted.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
For map check, we are going to need to send CEPH_MSG_MON_GET_VERSION
messages asynchronously and get a callback on completion. Refactor MON
client to allow firing off generic requests asynchronously and add an
async variant of ceph_monc_get_version(). ceph_monc_do_statfs() is
switched over and remains sync.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Implement ceph_osdc_watch_check() to be able to check on status of
watch. Note that the time it takes for a watch/notify event to get
delivered through the notify_wq is taken into account.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Implement ceph_osdc_notify() for sending notifies.
Due to the fact that the current messenger can't do read-in into
pagelists (it can only do write-out from them), I had to go with a page
vector for a NOTIFY_COMPLETE payload, for now.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
This adds support and switches rbd to a new, more reliable version of
watch/notify protocol. As with the OSD client update, this is mostly
about getting the right structures linked into the right places so that
reconnects are properly sent when needed. watch/notify v2 also
requires sending regular pings to the OSDs - send_linger_ping().
A major change from the old watch/notify implementation is the
introduction of ceph_osd_linger_request - linger requests no longer
piggy back on ceph_osd_request. ceph_osd_event has been merged into
ceph_osd_linger_request.
All the details are now hidden within libceph, the interface consists
of a simple pair of watch/unwatch functions and ceph_osdc_notify_ack().
ceph_osdc_watch() does return ceph_osd_linger_request, but only to keep
the lifetime management simple.
ceph_osdc_notify_ack() accepts an optional data payload, which is
relayed back to the notifier.
Portions of this patch are loosely based on work by Douglas Fuller
<dfuller@redhat.com> and Mike Christie <michaelc@cs.wisc.edu>.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
The unwatch timeout is currently implemented in rbd. With
watch/unwatch code moving into libceph, we are going to need
a ceph_osdc_wait_request() variant with a timeout.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
This is a major sync up, up to ~Jewel. The highlights are:
- per-session request trees (vs a global per-client tree)
- per-session locking (vs a global per-client rwlock)
- homeless OSD session
- no ad-hoc global per-client lists
- support for pool quotas
- foundation for watch/notify v2 support
- foundation for map check (pool deletion detection) support
The switchover is incomplete: lingering requests can be setup and
teared down but aren't ever reestablished. This functionality is
restored with the introduction of the new lingering infrastructure
(ceph_osd_linger_request, linger_work, etc) in a later commit.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
OSD client is getting moved from the big per-client lock to a set of
per-session locks. The big rwlock would only be held for read most of
the time, so a global osdc->osd_lru needs additional protection.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
create_osd() is called way too deep in the stack to be able to error
out in a sane way; a failing create_osd() just messes everything up.
The current req_notarget list solution is broken - the list is never
traversed as it's not entirely clear when to do it, I guess.
If we were to start traversing it at regular intervals and retrying
each request, we wouldn't be far off from what __GFP_NOFAIL is doing,
so allocate OSD sessions with __GFP_NOFAIL, at least until we come up
with a better fix.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Separate osdmap handling from decoding and iterating over a bag of maps
in a fresh MOSDMap message. This sets up the scene for the updated OSD
client.
Of particular importance here is the addition of pi->was_full, which
can be used to answer "did this pool go full -> not-full in this map?".
This is the key bit for supporting pool quotas.
We won't be able to downgrade map_sem for much longer, so drop
downgrade_write().
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
This leads to a simpler osdmap handling code, particularly when dealing
with pi->was_full, which is introduced in a later commit.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Both homeless OSD sessions and watch/notify v2, introduced in later
commits, require periodic ticks which don't depend on ->num_requests.
Schedule the initial tick from ceph_osdc_init() and reschedule from
handle_timeout() unconditionally.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
ceph_osdc_stop() isn't called if ceph_osdc_init() fails, so we end up
with handle_osds_timeout() running on invalid memory if any one of the
allocations fails. Call schedule_delayed_work() after everything is
setup, just before returning.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
If you specify ACK | ONDISK and set ->r_unsafe_callback, both
->r_callback and ->r_unsafe_callback(true) are called on ack. This is
very confusing. Redo this so that only one of them is called:
->r_unsafe_callback(true), on ack
->r_unsafe_callback(false), on commit
or
->r_callback, on ack|commit
Decode everything in decode_MOSDOpReply() to reduce clutter.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
finish_read(), its only user, uses it to get to hdr.data_len, which is
what ->r_result is set to on success. This gains us the ability to
safely call callbacks from contexts other than reply, e.g. map check.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
The crux of this is getting rid of ceph_osdc_build_request(), so that
MOSDOp can be encoded not before but after calc_target() calculates the
actual target. Encoding now happens within ceph_osdc_start_request().
Also nuked is the accompanying bunch of pointers into the encoded
buffer that was used to update fields on each send - instead, the
entire front is re-encoded. If we want to support target->name_len !=
base->name_len in the future, there is no other way, because oid is
surrounded by other fields in the encoded buffer.
Encoding OSD ops and adding data items to the request message were
mixed together in osd_req_encode_op(). While we want to re-encode OSD
ops, we don't want to add duplicate data items to the message when
resending, so all call to ceph_osdc_msg_data_add() are factored out
into a new setup_request_data().
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Replace __calc_request_pg() and most of __map_request() with
calc_target() and start using req->r_t.
ceph_osdc_build_request() however still encodes base_oid, because it's
called before calc_target() is and target_oid is empty at that point in
time; a printf in osdc_show() also shows base_oid. This is fixed in
"libceph: switch to calc_target(), part 2".
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Introduce ceph_osd_request_target, containing all mapping-related
fields of ceph_osd_request and calc_target() for calculating mappings
and populating it.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Add and decode pi->min_size and pi->last_force_request_resend. These
are going to be used by calc_target().
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
calc_target() code is going to need to know how to compare PGs. Take
lhs and rhs pgid by const * while at it.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Rename ceph_calc_pg_primary() to ceph_pg_to_acting_primary() to
emphasise that it returns acting primary.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Knowning just acting set isn't enough, we need to be able to record up
set as well to detect interval changes. This means returning (up[],
up_len, up_primary, acting[], acting_len, acting_primary) and passing
it around. Introduce and switch to ceph_osds to help with that.
Rename ceph_calc_pg_acting() to ceph_pg_to_up_acting_osds() and return
both up and acting sets from it.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Rename ceph_oloc_oid_to_pg() to ceph_object_locator_to_pg(). Emphasise
that returned is raw PG and return -ENOENT instead of -EIO if the pool
doesn't exist.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Given
struct foo {
u64 id;
struct rb_node bar_node;
};
generate insert_bar(), erase_bar() and lookup_bar() functions with
DEFINE_RB_FUNCS(bar, struct foo, id, bar_node)
The key is assumed to be an integer (u64, int, etc), compared with
< and >. nodefld has to be initialized with RB_CLEAR_NODE().
Start using it for MDS, MON and OSD requests and OSD sessions.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Currently ceph_object_id can hold object names of up to 100
(CEPH_MAX_OID_NAME_LEN) characters. This is enough for all use cases,
expect one - long rbd image names:
- a format 1 header is named "<imgname>.rbd"
- an object that points to a format 2 header is named "rbd_id.<imgname>"
We operate on these potentially long-named objects during rbd map, and,
for format 1 images, during header refresh. (A format 2 header name is
a small system-generated string.)
Lift this 100 character limit by making ceph_object_id be able to point
to an externally-allocated string. Apart from being able to work with
almost arbitrarily-long named objects, this allows us to reduce the
size of ceph_object_id from >100 bytes to 64 bytes.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
For a message pool message, preallocate a page, just like we do for
osd_op. For a normal message, take ceph_object_id into account and
don't bother subtracting CEPH_OSD_SLAB_OPS ceph_osd_ops.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
The size of ->r_request and ->r_reply messages depends on the size of
the object name (ceph_object_id), while the size of ceph_osd_request is
fixed. Move message allocation into a separate function that would
have to be called after ceph_object_id and ceph_object_locator (which
is also going to become variable in size with RADOS namespaces) have
been filled in:
req = ceph_osdc_alloc_request(...);
<fill in req->r_base_oid>
<fill in req->r_base_oloc>
ceph_osdc_alloc_messages(req);
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
ceph_osdc_build_request() is going away. Grab snapc and initialize
->r_snapid in ceph_osdc_alloc_request().
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Starting the kernel client with cephx disabled and then enabling cephx
and restarting userspace daemons can result in a crash:
[262671.478162] BUG: unable to handle kernel paging request at ffffebe000000000
[262671.531460] IP: [<ffffffff811cd04a>] kfree+0x5a/0x130
[262671.584334] PGD 0
[262671.635847] Oops: 0000 [#1] SMP
[262672.055841] CPU: 22 PID: 2961272 Comm: kworker/22:2 Not tainted 4.2.0-34-generic #39~14.04.1-Ubuntu
[262672.162338] Hardware name: Dell Inc. PowerEdge R720/068CDY, BIOS 2.4.3 07/09/2014
[262672.268937] Workqueue: ceph-msgr con_work [libceph]
[262672.322290] task: ffff88081c2d0dc0 ti: ffff880149ae8000 task.ti: ffff880149ae8000
[262672.428330] RIP: 0010:[<ffffffff811cd04a>] [<ffffffff811cd04a>] kfree+0x5a/0x130
[262672.535880] RSP: 0018:ffff880149aeba58 EFLAGS: 00010286
[262672.589486] RAX: 000001e000000000 RBX: 0000000000000012 RCX: ffff8807e7461018
[262672.695980] RDX: 000077ff80000000 RSI: ffff88081af2be04 RDI: 0000000000000012
[262672.803668] RBP: ffff880149aeba78 R08: 0000000000000000 R09: 0000000000000000
[262672.912299] R10: ffffebe000000000 R11: ffff880819a60e78 R12: ffff8800aec8df40
[262673.021769] R13: ffffffffc035f70f R14: ffff8807e5b138e0 R15: ffff880da9785840
[262673.131722] FS: 0000000000000000(0000) GS:ffff88081fac0000(0000) knlGS:0000000000000000
[262673.245377] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[262673.303281] CR2: ffffebe000000000 CR3: 0000000001c0d000 CR4: 00000000001406e0
[262673.417556] Stack:
[262673.472943] ffff880149aeba88 ffff88081af2be04 ffff8800aec8df40 ffff88081af2be04
[262673.583767] ffff880149aeba98 ffffffffc035f70f ffff880149aebac8 ffff8800aec8df00
[262673.694546] ffff880149aebac8 ffffffffc035c89e ffff8807e5b138e0 ffff8805b047f800
[262673.805230] Call Trace:
[262673.859116] [<ffffffffc035f70f>] ceph_x_destroy_authorizer+0x1f/0x50 [libceph]
[262673.968705] [<ffffffffc035c89e>] ceph_auth_destroy_authorizer+0x3e/0x60 [libceph]
[262674.078852] [<ffffffffc0352805>] put_osd+0x45/0x80 [libceph]
[262674.134249] [<ffffffffc035290e>] remove_osd+0xae/0x140 [libceph]
[262674.189124] [<ffffffffc0352aa3>] __reset_osd+0x103/0x150 [libceph]
[262674.243749] [<ffffffffc0354703>] kick_requests+0x223/0x460 [libceph]
[262674.297485] [<ffffffffc03559e2>] ceph_osdc_handle_map+0x282/0x5e0 [libceph]
[262674.350813] [<ffffffffc035022e>] dispatch+0x4e/0x720 [libceph]
[262674.403312] [<ffffffffc034bd91>] try_read+0x3d1/0x1090 [libceph]
[262674.454712] [<ffffffff810ab7c2>] ? dequeue_entity+0x152/0x690
[262674.505096] [<ffffffffc034cb1b>] con_work+0xcb/0x1300 [libceph]
[262674.555104] [<ffffffff8108fb3e>] process_one_work+0x14e/0x3d0
[262674.604072] [<ffffffff810901ea>] worker_thread+0x11a/0x470
[262674.652187] [<ffffffff810900d0>] ? rescuer_thread+0x310/0x310
[262674.699022] [<ffffffff810957a2>] kthread+0xd2/0xf0
[262674.744494] [<ffffffff810956d0>] ? kthread_create_on_node+0x1c0/0x1c0
[262674.789543] [<ffffffff817bd81f>] ret_from_fork+0x3f/0x70
[262674.834094] [<ffffffff810956d0>] ? kthread_create_on_node+0x1c0/0x1c0
What happens is the following:
(1) new MON session is established
(2) old "none" ac is destroyed
(3) new "cephx" ac is constructed
...
(4) old OSD session (w/ "none" authorizer) is put
ceph_auth_destroy_authorizer(ac, osd->o_auth.authorizer)
osd->o_auth.authorizer in the "none" case is just a bare pointer into
ac, which contains a single static copy for all services. By the time
we get to (4), "none" ac, freed in (2), is long gone. On top of that,
a new vtable installed in (3) points us at ceph_x_destroy_authorizer(),
so we end up trying to destroy a "none" authorizer with a "cephx"
destructor operating on invalid memory!
To fix this, decouple authorizer destruction from ac and do away with
a single static "none" authorizer by making a copy for each OSD or MDS
session. Authorizers themselves are independent of ac and so there is
no reason for destroy_authorizer() to be an ac op. Make it an op on
the authorizer itself by turning ceph_authorizer into a real struct.
Fixes: http://tracker.ceph.com/issues/15447
Reported-by: Alan Zhang <alan.zhang@linux.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull Ceph updates from Sage Weil:
"There is quite a bit here, including some overdue refactoring and
cleanup on the mon_client and osd_client code from Ilya, scattered
writeback support for CephFS and a pile of bug fixes from Zheng, and a
few random cleanups and fixes from others"
[ I already decided not to pull this because of it having been rebased
recently, but ended up changing my mind after all. Next time I'll
really hold people to it. Oh well. - Linus ]
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (34 commits)
libceph: use KMEM_CACHE macro
ceph: use kmem_cache_zalloc
rbd: use KMEM_CACHE macro
ceph: use lookup request to revalidate dentry
ceph: kill ceph_get_dentry_parent_inode()
ceph: fix security xattr deadlock
ceph: don't request vxattrs from MDS
ceph: fix mounting same fs multiple times
ceph: remove unnecessary NULL check
ceph: avoid updating directory inode's i_size accidentally
ceph: fix race during filling readdir cache
libceph: use sizeof_footer() more
ceph: kill ceph_empty_snapc
ceph: fix a wrong comparison
ceph: replace CURRENT_TIME by current_fs_time()
ceph: scattered page writeback
libceph: add helper that duplicates last extent operation
libceph: enable large, variable-sized OSD requests
libceph: osdc->req_mempool should be backed by a slab pool
libceph: make r_request msg_size calculation clearer
...
Use KMEM_CACHE() instead of kmem_cache_create() to simplify the code.
Signed-off-by: Geliang Tang <geliangtang@163.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Don't open-code sizeof_footer() in read_partial_message() and
ceph_msg_revoke(). Also, after switching to sizeof_footer(), it's now
possible to use con_out_kvec_add() in prepare_write_message_footer().
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Alex Elder <elder@linaro.org>
This helper duplicates last extent operation in OSD request, then
adjusts the new extent operation's offset and length. The helper
is for scatterd page writeback, which adds nonconsecutive dirty
pages to single OSD request.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Turn r_ops into a flexible array member to enable large, consisting of
up to 16 ops, OSD requests. The use case is scattered writeback in
cephfs and, as far as the kernel client is concerned, 16 is just a made
up number.
r_ops had size 3 for copyup+hint+write, but copyup is really a special
case - it can only happen once. ceph_osd_request_cache is therefore
stuffed with num_ops=2 requests, anything bigger than that is allocated
with kmalloc(). req_mempool is backed by ceph_osd_request_cache, which
means either num_ops=1 or num_ops=2 for use_mempool=true - all existing
users (ceph_writepages_start(), ceph_osdc_writepages()) are fine with
that.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
ceph_osd_request_cache was introduced a long time ago. Also, osd_req
is about to get a flexible array member, which ceph_osd_request_cache
is going to be aware of.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Although msg_size is calculated correctly, the terms are grouped in
a misleading way - snaps appears to not have room for a u32 length.
Move calculation closer to its use and regroup terms.
No functional change.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
This avoids defining large array of r_reply_op_{len,result} in
in struct ceph_osd_request.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
This can happen if __close_session() in ceph_monc_stop() races with
a connection reset. We need to ignore such faults, otherwise it's
likely we would take !hunting, call __schedule_delayed() and end up
with delayed_work() executing on invalid memory, among other things.
The (two!) con->private tests are useless, as nothing ever clears
con->private. Nuke them.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Doing __schedule_delayed() in the hunting branch is pointless, as the
tick will have already been scheduled by then.
What we need to do instead is *reschedule* it in the !hunting branch,
after reopen_session() changes hunt_mult, which affects the delay.
This helps with spacing out connection attempts and avoiding things
like two back-to-back attempts followed by a longer period of waiting
around.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
hunting is now set in __open_session() and cleared in finish_hunting(),
instead of all around. The "session lost" message is printed not only
on connection resets, but also on keepalive timeouts.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Unless we are in the process of setting up a client (i.e. connecting to
the monitor cluster for the first time), apply a backoff: every time we
want to reopen a session, increase our timeout by a multiple (currently
2); when we complete the connection, reduce that multipler by 50%.
Mirrors ceph.git commit 794c86fd289bd62a35ed14368fa096c46736e9a2.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Split ping interval and ping timeout: ping interval is 10s; keepalive
timeout is 30s.
Make monc_ping_timeout a constant while at it - it's not actually
exported as a mount option (and the rest of tick-related settings won't
be either), so it's got no place in ceph_options.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Don't try to reconnect to the same monitor when we fail to establish
a session within a timeout or it's lost.
For that, pick_new_mon() needs to see the old value of cur_mon, so
don't clear it in __close_session() - all calls to __close_session()
but one are followed by __open_session() anyway. __open_session() is
only called when a new session needs to be established, so the "already
open?" branch, which is now in the way, is simply dropped.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
It is currently hard-coded in the mon_client that mdsmap and monmap
subs are continuous, while osdmap sub is always "onetime". To better
handle full clusters/pools in the osd_client, we need to be able to
issue continuous osdmap subs. Revamp subs code to allow us to specify
for each sub whether it should be continuous or not.
Although not strictly required for the above, switch to SUBSCRIBE2
protocol while at it, eliminating the ambiguity between a request for
"every map since X" and a request for "just the latest" when we don't
have a map yet (i.e. have epoch 0). SUBSCRIBE2 feature bit is now
required - it's been supported since pre-argonaut (2010).
Move "got mdsmap" call to the end of ceph_mdsc_handle_map() - calling
in before we validate the epoch and successfully install the new map
can mess up mon_client sub state.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Coupling hunting state with subscribe state is not a good idea. Clear
hunting when we complete the authentication handshake.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Our debugfs dir name is a concatenation of cluster fsid and client
unique ID ("global_id"). It used to be the case that we learned
global_id first, nowadays we always learn fsid first - the monmap is
sent before any auth replies are. ceph_debugfs_client_init() call in
ceph_monc_handle_map() is therefore never executed and can be removed.
Its counterpart in handle_auth_reply() doesn't really belong there
either: having to do monc->client and unlocking early to work around
lockdep is a testament to that. Move it into __ceph_open_session(),
where it can be called unconditionally.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Pull x86 protection key support from Ingo Molnar:
"This tree adds support for a new memory protection hardware feature
that is available in upcoming Intel CPUs: 'protection keys' (pkeys).
There's a background article at LWN.net:
https://lwn.net/Articles/643797/
The gist is that protection keys allow the encoding of
user-controllable permission masks in the pte. So instead of having a
fixed protection mask in the pte (which needs a system call to change
and works on a per page basis), the user can map a (handful of)
protection mask variants and can change the masks runtime relatively
cheaply, without having to change every single page in the affected
virtual memory range.
This allows the dynamic switching of the protection bits of large
amounts of virtual memory, via user-space instructions. It also
allows more precise control of MMU permission bits: for example the
executable bit is separate from the read bit (see more about that
below).
This tree adds the MM infrastructure and low level x86 glue needed for
that, plus it adds a high level API to make use of protection keys -
if a user-space application calls:
mmap(..., PROT_EXEC);
or
mprotect(ptr, sz, PROT_EXEC);
(note PROT_EXEC-only, without PROT_READ/WRITE), the kernel will notice
this special case, and will set a special protection key on this
memory range. It also sets the appropriate bits in the Protection
Keys User Rights (PKRU) register so that the memory becomes unreadable
and unwritable.
So using protection keys the kernel is able to implement 'true'
PROT_EXEC on x86 CPUs: without protection keys PROT_EXEC implies
PROT_READ as well. Unreadable executable mappings have security
advantages: they cannot be read via information leaks to figure out
ASLR details, nor can they be scanned for ROP gadgets - and they
cannot be used by exploits for data purposes either.
We know about no user-space code that relies on pure PROT_EXEC
mappings today, but binary loaders could start making use of this new
feature to map binaries and libraries in a more secure fashion.
There is other pending pkeys work that offers more high level system
call APIs to manage protection keys - but those are not part of this
pull request.
Right now there's a Kconfig that controls this feature
(CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS) that is default enabled
(like most x86 CPU feature enablement code that has no runtime
overhead), but it's not user-configurable at the moment. If there's
any serious problem with this then we can make it configurable and/or
flip the default"
* 'mm-pkeys-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (38 commits)
x86/mm/pkeys: Fix mismerge of protection keys CPUID bits
mm/pkeys: Fix siginfo ABI breakage caused by new u64 field
x86/mm/pkeys: Fix access_error() denial of writes to write-only VMA
mm/core, x86/mm/pkeys: Add execute-only protection keys support
x86/mm/pkeys: Create an x86 arch_calc_vm_prot_bits() for VMA flags
x86/mm/pkeys: Allow kernel to modify user pkey rights register
x86/fpu: Allow setting of XSAVE state
x86/mm: Factor out LDT init from context init
mm/core, x86/mm/pkeys: Add arch_validate_pkey()
mm/core, arch, powerpc: Pass a protection key in to calc_vm_flag_bits()
x86/mm/pkeys: Actually enable Memory Protection Keys in the CPU
x86/mm/pkeys: Add Kconfig prompt to existing config option
x86/mm/pkeys: Dump pkey from VMA in /proc/pid/smaps
x86/mm/pkeys: Dump PKRU with other kernel registers
mm/core, x86/mm/pkeys: Differentiate instruction fetches
x86/mm/pkeys: Optimize fault handling in access_error()
mm/core: Do not enforce PKEY permissions on remote mm access
um, pkeys: Add UML arch_*_access_permitted() methods
mm/gup, x86/mm/pkeys: Check VMAs and PTEs for protection keys
x86/mm/gup: Simplify get_user_pages() PTE bit handling
...
Pull crypto update from Herbert Xu:
"Here is the crypto update for 4.6:
API:
- Convert remaining crypto_hash users to shash or ahash, also convert
blkcipher/ablkcipher users to skcipher.
- Remove crypto_hash interface.
- Remove crypto_pcomp interface.
- Add crypto engine for async cipher drivers.
- Add akcipher documentation.
- Add skcipher documentation.
Algorithms:
- Rename crypto/crc32 to avoid name clash with lib/crc32.
- Fix bug in keywrap where we zero the wrong pointer.
Drivers:
- Support T5/M5, T7/M7 SPARC CPUs in n2 hwrng driver.
- Add PIC32 hwrng driver.
- Support BCM6368 in bcm63xx hwrng driver.
- Pack structs for 32-bit compat users in qat.
- Use crypto engine in omap-aes.
- Add support for sama5d2x SoCs in atmel-sha.
- Make atmel-sha available again.
- Make sahara hashing available again.
- Make ccp hashing available again.
- Make sha1-mb available again.
- Add support for multiple devices in ccp.
- Improve DMA performance in caam.
- Add hashing support to rockchip"
* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (116 commits)
crypto: qat - remove redundant arbiter configuration
crypto: ux500 - fix checks of error code returned by devm_ioremap_resource()
crypto: atmel - fix checks of error code returned by devm_ioremap_resource()
crypto: qat - Change the definition of icp_qat_uof_regtype
hwrng: exynos - use __maybe_unused to hide pm functions
crypto: ccp - Add abstraction for device-specific calls
crypto: ccp - CCP versioning support
crypto: ccp - Support for multiple CCPs
crypto: ccp - Remove check for x86 family and model
crypto: ccp - memset request context to zero during import
lib/mpi: use "static inline" instead of "extern inline"
lib/mpi: avoid assembler warning
hwrng: bcm63xx - fix non device tree compatibility
crypto: testmgr - allow rfc3686 aes-ctr variants in fips mode.
crypto: qat - The AE id should be less than the maximal AE number
lib/mpi: Endianness fix
crypto: rockchip - add hash support for crypto engine in rk3288
crypto: xts - fix compile errors
crypto: doc - add skcipher API documentation
crypto: doc - update AEAD AD handling
...
Commit d15f9d694b ("libceph: check data_len in ->alloc_msg()")
mistakenly bumped the log level on the "tid %llu unknown, skipping"
message. Turn it back into a dout() - stray replies are perfectly
normal when OSDs flap, crash, get killed for testing purposes, etc.
Cc: stable@vger.kernel.org # 4.3+
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Alex Elder <elder@linaro.org>
ceph_msg_footer is 21 bytes long, while ceph_msg_footer_old is only 13.
Don't skip too much when CEPH_FEATURE_MSG_AUTH isn't negotiated.
Cc: stable@vger.kernel.org # 3.19+
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Alex Elder <elder@linaro.org>
The contract between try_read() and try_write() is that when called
each processes as much data as possible. When instructed by osd_client
to skip a message, try_read() is violating this contract by returning
after receiving and discarding a single message instead of checking for
more. try_write() then gets a chance to write out more requests,
generating more replies/skips for try_read() to handle, forcing the
messenger into a starvation loop.
Cc: stable@vger.kernel.org # 3.10+
Reported-by: Varada Kari <Varada.Kari@sandisk.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Tested-by: Varada Kari <Varada.Kari@sandisk.com>
Reviewed-by: Alex Elder <elder@linaro.org>
We will soon modify the vanilla get_user_pages() so it can no
longer be used on mm/tasks other than 'current/current->mm',
which is by far the most common way it is called. For now,
we allow the old-style calls, but warn when they are used.
(implemented in previous patch)
This patch switches all callers of:
get_user_pages()
get_user_pages_unlocked()
get_user_pages_locked()
to stop passing tsk/mm so they will no longer see the warnings.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: jack@suse.cz
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20160212210156.113E9407@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Empty request_redirect_t (struct ceph_request_redirect in the kernel
client) is now encoded with a bool. NEW_OSDOPREPLY_ENCODING feature
bit overlaps with already supported CRUSH_TUNABLES5.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
Add a tunable to fix the bug that chooseleaf may cause unnecessary pg
migrations when some device fails.
Reflects ceph.git commit fdb3f664448e80d984470f32f04e2e6f03ab52ec.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
Ensure that the take argument is a valid bucket ID before indexing the
buckets array.
Reflects ceph.git commit 93ec538e8a667699876b72459b8ad78966d89c61.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
We were indexing the buckets array without verifying the index was
within the [0,max_buckets) range. This could happen because
a multistep rule does not have enough buckets and has CRUSH_ITEM_NONE
for an intermediate result, which would feed in CRUSH_ITEM_NONE and
make us crash.
Reflects ceph.git commit 976a24a326da8931e689ee22fce35feab5b67b76.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
This patch removes the unnecessary ivsize variabls as they always
have the value of AES_BLOCK_SIZE.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
MClientMount{,Ack} are long gone. The receipt of bare monmap doesn't
actually indicate a mount success as we are yet to authenticate at that
point in time.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
With it gone, no need to preserve ceph_timespec in process_one_ticket()
either.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
If we fault due to authentication, we invalidate the service ticket we
have and request a new one - the idea being that if a service rejected
our authorizer, it must have expired, despite mon_client's attempts at
periodic renewal. (The other possibility is that our ticket is too new
and the service hasn't gotten it yet, in which case invalidating isn't
necessary but doesn't hurt.)
Invalidating just the service ticket is not enough, though. If we
assume a failure on mon_client's part to renew a service ticket, we
have to assume the same for the AUTH ticket. If our AUTH ticket is
bad, we won't get any service tickets no matter how hard we try, so
invalidate AUTH ticket along with the service ticket.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
Back in 2013, commit 4b8e8b5d78 ("libceph: fix authorizer
invalidation") tried to fix authorizer invalidation issues by clearing
validity field. However, nothing ever consults this field, so it
doesn't force us to request any new secrets in any way and therefore we
never get out of the exponential backoff mode:
[ 129.973812] libceph: osd2 192.168.122.1:6810 connect authorization failure
[ 130.706785] libceph: osd2 192.168.122.1:6810 connect authorization failure
[ 131.710088] libceph: osd2 192.168.122.1:6810 connect authorization failure
[ 133.708321] libceph: osd2 192.168.122.1:6810 connect authorization failure
[ 137.706598] libceph: osd2 192.168.122.1:6810 connect authorization failure
...
AFAICT this was the case at the time 4b8e8b5d78 was merged, too.
Using timespec solely as a bool isn't nice, so introduce a new have_key
flag, specifically for this purpose.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
Commit 20e55c4cc7 ("libceph: clear messenger auth_retry flag when we
authenticate") got us only half way there. We clear the flag if the
second attempt succeeds, but it also needs to be cleared if that
attempt fails, to allow for the exponential backoff to kick in.
Otherwise, if ->should_authenticate() thinks our keys are valid, we
will busy loop, incrementing auth_retry to no avail:
process_connect ffff880079a63830 got BADAUTHORIZER attempt 1
process_connect ffff880079a63830 got BADAUTHORIZER attempt 2
process_connect ffff880079a63830 got BADAUTHORIZER attempt 3
process_connect ffff880079a63830 got BADAUTHORIZER attempt 4
process_connect ffff880079a63830 got BADAUTHORIZER attempt 5
...
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
There are a number of problems with revoking a "was sending" message:
(1) We never make any attempt to revoke data - only kvecs contibute to
con->out_skip. However, once the header (envelope) is written to the
socket, our peer learns data_len and sets itself to expect at least
data_len bytes to follow front or front+middle. If ceph_msg_revoke()
is called while the messenger is sending message's data portion,
anything we send after that call is counted by the OSD towards the now
revoked message's data portion. The effects vary, the most common one
is the eventual hang - higher layers get stuck waiting for the reply to
the message that was sent out after ceph_msg_revoke() returned and
treated by the OSD as a bunch of data bytes. This is what Matt ran
into.
(2) Flat out zeroing con->out_kvec_bytes worth of bytes to handle kvecs
is wrong. If ceph_msg_revoke() is called before the tag is sent out or
while the messenger is sending the header, we will get a connection
reset, either due to a bad tag (0 is not a valid tag) or a bad header
CRC, which kind of defeats the purpose of revoke. Currently the kernel
client refuses to work with header CRCs disabled, but that will likely
change in the future, making this even worse.
(3) con->out_skip is not reset on connection reset, leading to one or
more spurious connection resets if we happen to get a real one between
con->out_skip is set in ceph_msg_revoke() and before it's cleared in
write_partial_skip().
Fixing (1) and (3) is trivial. The idea behind fixing (2) is to never
zero the tag or the header, i.e. send out tag+header regardless of when
ceph_msg_revoke() is called. That way the header is always correct, no
unnecessary resets are induced and revoke stands ready for disabled
CRCs. Since ceph_msg_revoke() rips out con->out_msg, introduce a new
"message out temp" and copy the header into it before sending.
Cc: stable@vger.kernel.org # 4.0+
Reported-by: Matt Conner <matt.conner@keepertech.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Tested-by: Matt Conner <matt.conner@keepertech.com>
Reviewed-by: Sage Weil <sage@redhat.com>
Use list_for_each_entry_safe() instead of list_for_each_safe() to
simplify the code.
Signed-off-by: Geliang Tang <geliangtang@163.com>
[idryomov@gmail.com: nuke call to list_splice_init() as well]
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
list_next_entry has been defined in list.h, so I replace list_entry_next
with it.
Signed-off-by: Geliang Tang <geliangtang@163.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Pull Ceph updates from Sage Weil:
"There are several patches from Ilya fixing RBD allocation lifecycle
issues, a series adding a nocephx_sign_messages option (and associated
bug fixes/cleanups), several patches from Zheng improving the
(directory) fsync behavior, a big improvement in IO for direct-io
requests when striping is enabled from Caifeng, and several other
small fixes and cleanups"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
libceph: clear msg->con in ceph_msg_release() only
libceph: add nocephx_sign_messages option
libceph: stop duplicating client fields in messenger
libceph: drop authorizer check from cephx msg signing routines
libceph: msg signing callouts don't need con argument
libceph: evaluate osd_req_op_data() arguments only once
ceph: make fsync() wait unsafe requests that created/modified inode
ceph: add request to i_unsafe_dirops when getting unsafe reply
libceph: introduce ceph_x_authorizer_cleanup()
ceph: don't invalidate page cache when inode is no longer used
rbd: remove duplicate calls to rbd_dev_mapping_clear()
rbd: set device_type::release instead of device::release
rbd: don't free rbd_dev outside of the release callback
rbd: return -ENOMEM instead of pool id if rbd_dev_create() fails
libceph: use local variable cursor instead of &msg->cursor
libceph: remove con argument in handle_reply()
ceph: combine as many iovec as possile into one OSD request
ceph: fix message length computation
ceph: fix a comment typo
rbd: drop null test before destroy functions
Pull security subsystem update from James Morris:
"This is mostly maintenance updates across the subsystem, with a
notable update for TPM 2.0, and addition of Jarkko Sakkinen as a
maintainer of that"
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (40 commits)
apparmor: clarify CRYPTO dependency
selinux: Use a kmem_cache for allocation struct file_security_struct
selinux: ioctl_has_perm should be static
selinux: use sprintf return value
selinux: use kstrdup() in security_get_bools()
selinux: use kmemdup in security_sid_to_context_core()
selinux: remove pointless cast in selinux_inode_setsecurity()
selinux: introduce security_context_str_to_sid
selinux: do not check open perm on ftruncate call
selinux: change CONFIG_SECURITY_SELINUX_CHECKREQPROT_VALUE default
KEYS: Merge the type-specific data with the payload data
KEYS: Provide a script to extract a module signature
KEYS: Provide a script to extract the sys cert list from a vmlinux file
keys: Be more consistent in selection of union members used
certs: add .gitignore to stop git nagging about x509_certificate_list
KEYS: use kvfree() in add_key
Smack: limited capability for changing process label
TPM: remove unnecessary little endian conversion
vTPM: support little endian guests
char: Drop owner assignment from i2c_driver
...
The following bit in ceph_msg_revoke_incoming() is unsafe:
struct ceph_connection *con = msg->con;
if (!con)
return;
mutex_lock(&con->mutex);
<more msg->con use>
There is nothing preventing con from getting destroyed right after
msg->con test. One easy way to reproduce this is to disable message
signing only on the server side and try to map an image. The system
will go into a
libceph: read_partial_message ffff880073f0ab68 signature check failed
libceph: osd0 192.168.255.155:6801 bad crc/signature
libceph: read_partial_message ffff880073f0ab68 signature check failed
libceph: osd0 192.168.255.155:6801 bad crc/signature
loop which has to be interrupted with Ctrl-C. Hit Ctrl-C and you are
likely to end up with a random GP fault if the reset handler executes
"within" ceph_msg_revoke_incoming():
<yet another reply w/o a signature>
...
<Ctrl-C>
rbd_obj_request_end
ceph_osdc_cancel_request
__unregister_request
ceph_osdc_put_request
ceph_msg_revoke_incoming
...
osd_reset
__kick_osd_requests
__reset_osd
remove_osd
ceph_con_close
reset_connection
<clear con->in_msg->con>
<put con ref>
put_osd
<free osd/con>
<msg->con use> <-- !!!
If ceph_msg_revoke_incoming() executes "before" the reset handler,
osd/con will be leaked because ceph_msg_revoke_incoming() clears
con->in_msg but doesn't put con ref, while reset_connection() only puts
con ref if con->in_msg != NULL.
The current msg->con scheme was introduced by commits 38941f8031
("libceph: have messages point to their connection") and 92ce034b5a
("libceph: have messages take a connection reference"), which defined
when messages get associated with a connection and when that
association goes away. Part of the problem is that this association is
supposed to go away in much too many places; closing this race entirely
requires either a rework of the existing or an addition of a new layer
of synchronization.
In lieu of that, we can make it *much* less likely to hit by
disassociating messages only on their destruction and resend through
a different connection. This makes the code simpler and is probably
a good thing to do regardless - this patch adds a msg_con_set() helper
which is is called from only three places: ceph_con_send() and
ceph_con_in_msg_alloc() to set msg->con and ceph_msg_release() to clear
it.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Support for message signing was merged into 3.19, along with
nocephx_require_signatures option. But, all that option does is allow
the kernel client to talk to clusters that don't support MSG_AUTH
feature bit. That's pretty useless, given that it's been supported
since bobtail.
Meanwhile, if one disables message signing on the server side with
"cephx sign messages = false", it becomes impossible to use the kernel
client since it expects messages to be signed if MSG_AUTH was
negotiated. Add nocephx_sign_messages option to support this use case.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
supported_features and required_features serve no purpose at all, while
nocrc and tcp_nodelay belong to ceph_options::flags.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
I don't see a way for auth->authorizer to be NULL in
ceph_x_sign_message() or ceph_x_check_message_signature().
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
We can use msg->con instead - at the point we sign an outgoing message
or check the signature on the incoming one, msg->con is always set. We
wouldn't know how to sign a message without an associated session (i.e.
msg->con == NULL) and being able to sign a message using an explicitly
provided authorizer is of no use.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
This patch changes the osd_req_op_data() macro to not evaluate
arguments more than once in order to follow the kernel coding style.
Signed-off-by: Ioana Ciornei <ciorneiioana@gmail.com>
Reviewed-by: Alex Elder <elder@linaro.org>
[idryomov@gmail.com: changelog, formatting]
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Commit ae385eaf24 ("libceph: store session key in cephx authorizer")
introduced ceph_x_authorizer::session_key, but didn't update all the
exit/error paths. Introduce ceph_x_authorizer_cleanup() to encapsulate
ceph_x_authorizer cleanup and switch to it. This fixes ceph_x_destroy(),
which currently always leaks key and ceph_x_build_authorizer() error
paths.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
Use local variable cursor in place of &msg->cursor in
read_partial_msg_data() and write_partial_msg_data().
Signed-off-by: Shraddha Barke <shraddha.6596@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Since handle_reply() does not use its con argument, remove it.
Signed-off-by: Shraddha Barke <shraddha.6596@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
This covers only the simplest case - an object size sized write, but
it's still useful in tiering setups when EC is used for the base tier
as writefull op can be proxied, saving an object promotion.
Even though updating ceph_osdc_new_request() to allow writefull should
just be a matter of fixing an assert, I didn't do it because its only
user is cephfs. All other sites were updated.
Reflects ceph.git commit 7bfb7f9025a8ee0d2305f49bf0336d2424da5b5b.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Alex Elder <elder@linaro.org>
This
struct ceph_timespec ceph_ts;
...
con_out_kvec_add(con, sizeof(ceph_ts), &ceph_ts);
wraps ceph_ts into a kvec and adds it to con->out_kvec array, yet
ceph_ts becomes invalid on return from prepare_write_keepalive(). As
a result, we send out bogus keepalive2 stamps. Fix this by encoding
into a ceph_timespec member, similar to how acks are read and written.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
Pull Ceph update from Sage Weil:
"There are a few fixes for snapshot behavior with CephFS and support
for the new keepalive protocol from Zheng, a libceph fix that affects
both RBD and CephFS, a few bug fixes and cleanups for RBD from Ilya,
and several small fixes and cleanups from Jianpeng and others"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
ceph: improve readahead for file holes
ceph: get inode size for each append write
libceph: check data_len in ->alloc_msg()
libceph: use keepalive2 to verify the mon session is alive
rbd: plug rbd_dev->header.object_prefix memory leak
rbd: fix double free on rbd_dev->header_name
libceph: set 'exists' flag for newly up osd
ceph: cleanup use of ceph_msg_get
ceph: no need to get parent inode in ceph_open
ceph: remove the useless judgement
ceph: remove redundant test of head->safe and silence static analysis warnings
ceph: fix queuing inode to mdsdir's snaprealm
libceph: rename con_work() to ceph_con_workfn()
libceph: Avoid holding the zero page on ceph_msgr_slab_init errors
libceph: remove the unused macro AES_KEY_SIZE
ceph: invalidate dirty pages after forced umount
ceph: EIO all operations after forced umount
Only ->alloc_msg() should check data_len of the incoming message
against the preallocated ceph_msg, doing it in the messenger is not
right. The contract is that either ->alloc_msg() returns a ceph_msg
which will fit all of the portions of the incoming message, or it
returns NULL and possibly sets skip, signaling whether NULL is due to
an -ENOMEM. ->alloc_msg() should be the only place where we make the
skip/no-skip decision.
I stumbled upon this while looking at con/osd ref counting. Right now,
if we get a non-extent message with a larger data portion than we are
prepared for, ->alloc_msg() returns a ceph_msg, and then, when we skip
it in the messenger, we don't put the con/osd ref acquired in
ceph_con_in_msg_alloc() (which is normally put in process_message()),
so this also fixes a memory leak.
An existing BUG_ON in ceph_msg_data_cursor_init() ensures we don't
corrupt random memory should a buggy ->alloc_msg() return an unfit
ceph_msg.
While at it, I changed the "unknown tid" dout() to a pr_warn() to make
sure all skips are seen and unified format strings.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Even though it's static, con_work(), being a work func, shows up in
various stacktraces a lot. Prefix it with ceph_.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
ceph_msgr_slab_init may fail due to a temporary ENOMEM.
Delay a bit the initialization of zero_page in ceph_msgr_init and
reorder its cleanup in _ceph_msgr_exit so it's done in reverse
order of setup.
BUG_ON() will not suffer to be postponed in case it is triggered.
Signed-off-by: Benoît Canet <benoit.canet@nodalink.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
This removes the no longer used macro AES_KEY_SIZE as no functions use
this macro anymore and thus this macro can be removed due it no longer
being required.
Signed-off-by: Nicholas Krause <xerofoify@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Many file systems that implement the show_options hook fail to correctly
escape their output which could lead to unescaped characters (e.g. new
lines) leaking into /proc/mounts and /proc/[pid]/mountinfo files. This
could lead to confusion, spoofed entries (resulting in things like
systemd issuing false d-bus "mount" notifications), and who knows what
else. This looks like it would only be the root user stepping on
themselves, but it's possible weird things could happen in containers or
in other situations with delegated mount privileges.
Here's an example using overlay with setuid fusermount trusting the
contents of /proc/mounts (via the /etc/mtab symlink). Imagine the use
of "sudo" is something more sneaky:
$ BASE="ovl"
$ MNT="$BASE/mnt"
$ LOW="$BASE/lower"
$ UP="$BASE/upper"
$ WORK="$BASE/work/ 0 0
none /proc fuse.pwn user_id=1000"
$ mkdir -p "$LOW" "$UP" "$WORK"
$ sudo mount -t overlay -o "lowerdir=$LOW,upperdir=$UP,workdir=$WORK" none /mnt
$ cat /proc/mounts
none /root/ovl/mnt overlay rw,relatime,lowerdir=ovl/lower,upperdir=ovl/upper,workdir=ovl/work/ 0 0
none /proc fuse.pwn user_id=1000 0 0
$ fusermount -u /proc
$ cat /proc/mounts
cat: /proc/mounts: No such file or directory
This fixes the problem by adding new seq_show_option and
seq_show_option_n helpers, and updating the vulnerable show_option
handlers to use them as needed. Some, like SELinux, need to be open
coded due to unusual existing escape mechanisms.
[akpm@linux-foundation.org: add lost chunk, per Kees]
[keescook@chromium.org: seq_show_option should be using const parameters]
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Acked-by: Jan Kara <jack@suse.com>
Acked-by: Paul Moore <paul@paul-moore.com>
Cc: J. R. Okajima <hooanon05g@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
addr_is_blank() should return true if family is neither AF_INET nor
AF_INET6. This is what its counterpart entity_addr_t::is_blank_ip() is
doing and it is the right thing to do: in process_banner() we check if
our address is blank and if it is "learn" it from our peer. As it is,
we never learn our address and always send out a blank one. This goes
way back to ceph.git commit dd732cbfc1c9 ("use sockaddr_storage; and
some ipv6 support groundwork") from 2009.
While at at, do not open-code ipv6_addr_any() and use INADDR_ANY
constant instead of 0.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
Grab a reference on a network namespace of the 'rbd map' (in case of
rbd) or 'mount' (in case of ceph) process and use that to open sockets
instead of always using init_net and bailing if network namespace is
anything but init_net. Be careful to not share struct ceph_client
instances between different namespaces and don't add any code in the
!CONFIG_NET_NS case.
This is based on a patch from Hong Zhiguo <zhiguohong@tencent.com>.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
Pull Ceph updates from Sage Weil:
"We have a pile of bug fixes from Ilya, including a few patches that
sync up the CRUSH code with the latest from userspace.
There is also a long series from Zheng that fixes various issues with
snapshots, inline data, and directory fsync, some simplification and
improvement in the cap release code, and a rework of the caching of
directory contents.
To top it off there are a few small fixes and cleanups from Benoit and
Hong"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (40 commits)
rbd: use GFP_NOIO in rbd_obj_request_create()
crush: fix a bug in tree bucket decode
libceph: Fix ceph_tcp_sendpage()'s more boolean usage
libceph: Remove spurious kunmap() of the zero page
rbd: queue_depth map option
rbd: store rbd_options in rbd_device
rbd: terminate rbd_opts_tokens with Opt_err
ceph: fix ceph_writepages_start()
rbd: bump queue_max_segments
ceph: rework dcache readdir
crush: sync up with userspace
crush: fix crash from invalid 'take' argument
ceph: switch some GFP_NOFS memory allocation to GFP_KERNEL
ceph: pre-allocate data structure that tracks caps flushing
ceph: re-send flushing caps (which are revoked) in reconnect stage
ceph: send TID of the oldest pending caps flush to MDS
ceph: track pending caps flushing globally
ceph: track pending caps flushing accurately
libceph: fix wrong name "Ceph filesystem for Linux"
ceph: fix directory fsync
...
struct crush_bucket_tree::num_nodes is u8, so ceph_decode_8_safe()
should be used. -Wconversion catches this, but I guess it went
unnoticed in all the noise it spews. The actual problem (at least for
common crushmaps) isn't the u32 -> u8 truncation though - it's the
advancement by 4 bytes instead of 1 in the crushmap buffer.
Fixes: http://tracker.ceph.com/issues/2759
Cc: stable@vger.kernel.org
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Josh Durgin <jdurgin@redhat.com>
From struct ceph_msg_data_cursor in include/linux/ceph/messenger.h:
bool last_piece; /* current is last piece */
In ceph_msg_data_next():
*last_piece = cursor->last_piece;
A call to ceph_msg_data_next() is followed by:
ret = ceph_tcp_sendpage(con->sock, page, page_offset,
length, last_piece);
while ceph_tcp_sendpage() is:
static int ceph_tcp_sendpage(struct socket *sock, struct page *page,
int offset, size_t size, bool more)
The logic is inverted: correct it.
Signed-off-by: Benoît Canet <benoit.canet@nodalink.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
ceph_tcp_sendpage already does the work of mapping/unmapping
the zero page if needed.
Signed-off-by: Benoît Canet <benoit.canet@nodalink.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
.. up to ceph.git commit 1db1abc8328d ("crush: eliminate ad hoc diff
between kernel and userspace"). This fixes a bunch of recently pulled
coding style issues and makes includes a bit cleaner.
A patch "crush:Make the function crush_ln static" from Nicholas Krause
<xerofoify@gmail.com> is folded in as crush_ln() has been made static
in userspace as well.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Verify that the 'take' argument is a valid device or bucket.
Otherwise ignore it (do not add the value to the working vector).
Reflects ceph.git commit 9324d0a1af61e1c234cc48e2175b4e6320fff8f4.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
modinfo libceph prints the module name "Ceph filesystem for Linux",
which is same as the real fs module ceph. It's confusing.
Signed-off-by: Hong Zhiguo <zhiguohong@tencent.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
- return -ETIMEDOUT instead of -EIO in case of timeout
- wait_event_interruptible_timeout() returns time left until timeout
and since it can be almost LONG_MAX we had better assign it to long
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Alex Elder <elder@linaro.org>
There are currently three libceph-level timeouts that the user can
specify on mount: mount_timeout, osd_idle_ttl and osdkeepalive. All of
these are in seconds and no checking is done on user input: negative
values are accepted, we multiply them all by HZ which may or may not
overflow, arbitrarily large jiffies then get added together, etc.
There is also a bug in the way mount_timeout=0 is handled. It's
supposed to mean "infinite timeout", but that's not how wait.h APIs
treat it and so __ceph_open_session() for example will busy loop
without much chance of being interrupted if none of ceph-mons are
there.
Fix all this by verifying user input, storing timeouts capped by
msecs_to_jiffies() in jiffies and using the new ceph_timeout_jiffies()
helper for all user-specified waits to handle infinite timeouts
correctly.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Alex Elder <elder@linaro.org>
This one sneaked in through vfs tree with commit 2b777c9dd9
("ceph_sync_read: stop poking into iov_iter guts").
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Conflicts:
drivers/net/phy/amd-xgbe-phy.c
drivers/net/wireless/iwlwifi/Kconfig
include/net/mac80211.h
iwlwifi/Kconfig and mac80211.h were both trivial overlapping
changes.
The drivers/net/phy/amd-xgbe-phy.c file got removed in 'net-next' and
the bug fix that happened on the 'net' side is already integrated
into the rest of the amd-xgbe driver.
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit ba9d114ec5.
.. which introduced a regression that prevented all lingering requests
requeued in kick_requests() from ever being sent to the OSDs, resulting
in a lot of missed notifies. In retrospect it's pretty obvious that
r_req_lru_item item in the case of lingering requests can be used not
only for notarget, but also for unsent linkage due to how tightly
actual map and enqueue operations are coupled in __map_request().
The assertion that was being silenced is taken care of in the previous
("libceph: request a new osdmap if lingering request maps to no osd")
commit: by always kicking homeless lingering requests we ensure that
none of them ends up on the notarget list outside of the critical
section guarded by request_mutex.
Cc: stable@vger.kernel.org # 3.18+, needs b049453221 "libceph: request a new osdmap if lingering request maps to no osd"
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
This commit does two things. First, if there are any homeless
lingering requests, we now request a new osdmap even if the osdmap that
is being processed brought no changes, i.e. if a given lingering
request turned homeless in one of the previous epochs and remained
homeless in the current epoch. Not doing so leaves us with a stale
osdmap and as a result we may miss our window for reestablishing the
watch and lose notifies.
MON=1 OSD=1:
# cat linger-needmap.sh
#!/bin/bash
rbd create --size 1 test
DEV=$(rbd map test)
ceph osd out 0
rbd map dne/dne # obtain a new osdmap as a side effect (!)
sleep 1
ceph osd in 0
rbd resize --size 2 test
# rbd info test | grep size -> 2M
# blockdev --getsize $DEV -> 1M
N.B.: Not obtaining a new osdmap in between "osd out" and "osd in"
above is enough to make it miss that resize notify, but that is a
bug^Wlimitation of ceph watch/notify v1.
Second, homeless lingering requests are now kicked just like those
lingering requests whose mapping has changed. This is mainly to
recognize that a homeless lingering request makes no sense and to
preserve the invariant that a registered lingering request is not
sitting on any of r_req_lru_item lists. This spares us a WARN_ON,
which commit ba9d114ec5 ("libceph: clear r_req_lru_item in
__unregister_linger_request()") tried to fix the _wrong_ way.
Cc: stable@vger.kernel.org # 3.10+
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
This is long overdue, and is part of cleaning up how we allocate kernel
sockets that don't reference count struct net.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is an improved straw bucket that correctly avoids any data movement
between items A and B when neither A nor B's weights are changed. Said
differently, if we adjust the weight of item C (including adding it anew
or removing it completely), we will only see inputs move to or from C,
never between other items in the bucket.
Notably, there is not intermediate scaling factor that needs to be
calculated. The mapping function is a simple function of the item weights.
The below commits were squashed together into this one (mostly to avoid
adding and then yanking a ~6000 lines worth of crush_ln_table):
- crush: add a straw2 bucket type
- crush: add crush_ln to calculate nature log efficently
- crush: improve straw2 adjustment slightly
- crush: change crush_ln to provide 32 more digits
- crush: fix crush_get_bucket_item_weight and bucket destroy for straw2
- crush/mapper: fix divide-by-0 in straw2
(with div64_s64() for draw = ln / w and INT64_MIN -> S64_MIN - need
to create a proper compat.h in ceph.git)
Reflects ceph.git commits 242293c908e923d474910f2b8203fa3b41eb5a53,
32a1ead92efcd351822d22a5fc37d159c65c1338,
6289912418c4a3597a11778bcf29ed5415117ad9,
35fcb04e2945717cf5cfe150b9fa89cb3d2303a1,
6445d9ee7290938de1e4ee9563912a6ab6d8ee5f,
b5921d55d16796e12d66ad2c4add7305f9ce2353.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Crush temporary buffers are allocated as per replica size configured
by the user. When there are more final osds (to be selected as per
rule) than the replicas, buffer overlaps and it causes crash. Now, it
ensures that at most num-rep osds are selected even if more number of
osds are allowed by the rule.
Reflects ceph.git commits 6b4d1aa99718e3b367496326c1e64551330fabc0,
234b066ba04976783d15ff2abc3e81b6cc06fb10.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Split ceph_show_options() into two pieces and move the piece
responsible for printing client (libceph) options into net/ceph. This
way people adding a libceph option wouldn't have to remember to update
code in fs/ceph.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
- specific con->error_msg messages (e.g. "protocol version mismatch")
end up getting overwritten by a catch-all "socket error on read
/ write", introduced in commit 3a140a0d5c ("libceph: report socket
read/write error message")
- "bad message sequence # for incoming message" loses to "bad crc" due
to the fact that -EBADMSG is used for both
Fix it, and tidy up con->error_msg assignments and pr_errs while at it.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
This reverts commit 89baaa570a.
Dirty page throttling should be sufficient for us in the general case
so there is no need to use __GFP_MEMALLOC - it would be needed only in
the swap-over-rbd case, which we currently don't support. (It would
probably take approximately the commit that is being reverted to add
that support, but we would also need the "swap" option to distinguish
from the general case and make sure swap ceph_client-s aren't shared
with anything else.) See ceph-devel threads [1] and [2] for the
details of why enabling pfmemalloc reserves for all cases is a bad
thing.
On top of potential system lockups related to drained emergency
reserves, this turned out to cause ceph lockups in case peers are on
the same host and communicating via loopback due to sk_filter()
dropping pfmemalloc skbs on the receiving side because the receiving
loopback socket is not tagged with SOCK_MEMALLOC.
[1] "SOCK_MEMALLOC vs loopback"
http://www.spinics.net/lists/ceph-devel/msg22998.html
[2] "[PATCH] libceph: don't set memalloc flags in loopback case"
http://www.spinics.net/lists/ceph-devel/msg23392.html
Conflicts:
net/ceph/messenger.c [ context: tcp_nodelay option ]
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Sage Weil <sage@redhat.com>
Cc: stable@vger.kernel.org # 3.18+, needs backporting
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Acked-by: Mike Christie <michaelc@cs.wisc.edu>
Acked-by: Mel Gorman <mgorman@suse.de>
Pull Ceph changes from Sage Weil:
"On the RBD side, there is a conversion to blk-mq from Christoph,
several long-standing bug fixes from Ilya, and some cleanup from
Rickard Strandqvist.
On the CephFS side there is a long list of fixes from Zheng, including
improved session handling, a few IO path fixes, some dcache management
correctness fixes, and several blocking while !TASK_RUNNING fixes.
The core code gets a few cleanups and Chaitanya has added support for
TCP_NODELAY (which has been used on the server side for ages but we
somehow missed on the kernel client).
There is also an update to MAINTAINERS to fix up some email addresses
and reflect that Ilya and Zheng are doing most of the maintenance for
RBD and CephFS these days. Do not be surprised to see a pull request
come from one of them in the future if I am unavailable for some
reason"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (27 commits)
MAINTAINERS: update Ceph and RBD maintainers
libceph: kfree() in put_osd() shouldn't depend on authorizer
libceph: fix double __remove_osd() problem
rbd: convert to blk-mq
ceph: return error for traceless reply race
ceph: fix dentry leaks
ceph: re-send requests when MDS enters reconnecting stage
ceph: show nocephx_require_signatures and notcp_nodelay options
libceph: tcp_nodelay support
rbd: do not treat standalone as flatten
ceph: fix atomic_open snapdir
ceph: properly mark empty directory as complete
client: include kernel version in client metadata
ceph: provide seperate {inode,file}_operations for snapdir
ceph: fix request time stamp encoding
ceph: fix reading inline data when i_size > PAGE_SIZE
ceph: avoid block operation when !TASK_RUNNING (ceph_mdsc_close_sessions)
ceph: avoid block operation when !TASK_RUNNING (ceph_get_caps)
ceph: avoid block operation when !TASK_RUNNING (ceph_mdsc_sync)
rbd: fix error paths in rbd_dev_refresh()
...
a255651d4c ("ceph: ensure auth ops are defined before use") made
kfree() in put_osd() conditional on the authorizer. A mechanical
mistake most likely - fix it.
Cc: Alex Elder <elder@linaro.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
Reviewed-by: Alex Elder <elder@linaro.org>
It turns out it's possible to get __remove_osd() called twice on the
same OSD. That doesn't sit well with rb_erase() - depending on the
shape of the tree we can get a NULL dereference, a soft lockup or
a random crash at some point in the future as we end up touching freed
memory. One scenario that I was able to reproduce is as follows:
<osd3 is idle, on the osd lru list>
<con reset - osd3>
con_fault_finish()
osd_reset()
<osdmap - osd3 down>
ceph_osdc_handle_map()
<takes map_sem>
kick_requests()
<takes request_mutex>
reset_changed_osds()
__reset_osd()
__remove_osd()
<releases request_mutex>
<releases map_sem>
<takes map_sem>
<takes request_mutex>
__kick_osd_requests()
__reset_osd()
__remove_osd() <-- !!!
A case can be made that osd refcounting is imperfect and reworking it
would be a proper resolution, but for now Sage and I decided to fix
this by adding a safe guard around __remove_osd().
Fixes: http://tracker.ceph.com/issues/8087
Cc: Sage Weil <sage@redhat.com>
Cc: stable@vger.kernel.org # 3.9+: 7c6e6fc53e: libceph: assert both regular and lingering lists in __remove_osd()
Cc: stable@vger.kernel.org # 3.9+: cc9f1f518c: libceph: change from BUG to WARN for __remove_osd() asserts
Cc: stable@vger.kernel.org # 3.9+
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
Reviewed-by: Alex Elder <elder@linaro.org>
On Mon, Dec 22, 2014 at 5:35 PM, Sage Weil <sage@newdream.net> wrote:
> On Mon, 22 Dec 2014, Ilya Dryomov wrote:
>> Actually, pool op stuff has been unused for over two years - looks like
>> it was added for rbd create_snap and that got ripped out in 2012. It's
>> unlikely we'd ever need to manage pools or snaps from the kernel client
>> so I think it makes sense to nuke it. Sage?
>
> Yep!
Signed-off-by: Ilya Dryomov <idryomov@redhat.com>
This allows those get_user_pages calls to pass FAULT_FLAG_ALLOW_RETRY to
the page fault in order to release the mmap_sem during the I/O.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
allow specifying position of extent operation in multi-operations
osd request. This is required for cephfs to convert inline data to
normal data (compare xattr, then write object).
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@redhat.com>
Add CEPH_OSD_OP_CREATE support. Also change libceph to not treat
CEPH_OSD_OP_DELETE as an extent op and add an assert to that end.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@redhat.com>
Session key is required when calculating message signature. Save the session
key in authorizer, this avoid lookup ticket handler for each message
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Use kvfree() from linux/mm.h instead, which is identical. Also fix the
ceph_buffer comment: we will allocate with kmalloc() up to 32k - the
value of PAGE_ALLOC_COSTLY_ORDER, but that really is just an
implementation detail so don't mention it at all.
Signed-off-by: Ilya Dryomov <idryomov@redhat.com>
kick_requests() can put linger requests on the notarget list. This
means we need to clear the much-overloaded req->r_req_lru_item in
__unregister_linger_request() as well, or we get an assertion failure
in ceph_osdc_release_request() - !list_empty(&req->r_req_lru_item).
AFAICT the assumption was that registered linger requests cannot be on
any of req->r_req_lru_item lists, but that's clearly not the case.
Signed-off-by: Ilya Dryomov <idryomov@redhat.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Requests have to be unlinked from both osd->o_requests (normal
requests) and osd->o_linger_requests (linger requests) lists when
clearing req->r_osd. Otherwise __unregister_linger_request() gets
confused and we trip over a !list_empty(&osd->o_linger_requests)
assert in __remove_osd().
MON=1 OSD=1:
# cat remove-osd.sh
#!/bin/bash
rbd create --size 1 test
DEV=$(rbd map test)
ceph osd out 0
sleep 3
rbd map dne/dne # obtain a new osdmap as a side effect
rbd unmap $DEV & # will block
sleep 3
ceph osd in 0
Signed-off-by: Ilya Dryomov <idryomov@redhat.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Commit c27a3e4d66 ("libceph: do not hard code max auth ticket len")
while fixing a buffer overlow tried to keep the same as much of the
surrounding code as possible and introduced an unnecessary kmalloc() in
the unencrypted ticket path. It is likely to fail on huge tickets, so
get rid of it.
Signed-off-by: Ilya Dryomov <idryomov@redhat.com>
Reviewed-by: Sage Weil <sage@redhat.com>
This patch has ceph's lib code use the memalloc flags.
If the VM layer needs to write data out to free up memory to handle new
allocation requests, the block layer must be able to make forward progress.
To handle that requirement we use structs like mempools to reserve memory for
objects like bios and requests.
The problem is when we send/receive block layer requests over the network
layer, net skb allocations can fail and the system can lock up.
To solve this, the memalloc related flags were added. NBD, iSCSI
and NFS uses these flags to tell the network/vm layer that it should
use memory reserves to fullfill allcation requests for structs like
skbs.
I am running ceph in a bunch of VMs in my laptop, so this patch was
not tested very harshly.
Signed-off-by: Mike Christie <michaelc@cs.wisc.edu>
Reviewed-by: Ilya Dryomov <idryomov@redhat.com>
Pull Ceph updates from Sage Weil:
"There is the long-awaited discard support for RBD (Guangliang Zhao,
Josh Durgin), a pile of RBD bug fixes that didn't belong in late -rc's
(Ilya Dryomov, Li RongQing), a pile of fs/ceph bug fixes and
performance and debugging improvements (Yan, Zheng, John Spray), and a
smattering of cleanups (Chao Yu, Fabian Frederick, Joe Perches)"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (40 commits)
ceph: fix divide-by-zero in __validate_layout()
rbd: rbd workqueues need a resque worker
libceph: ceph-msgr workqueue needs a resque worker
ceph: fix bool assignments
libceph: separate multiple ops with commas in debugfs output
libceph: sync osd op definitions in rados.h
libceph: remove redundant declaration
ceph: additional debugfs output
ceph: export ceph_session_state_name function
ceph: include the initial ACL in create/mkdir/mknod MDS requests
ceph: use pagelist to present MDS request data
libceph: reference counting pagelist
ceph: fix llistxattr on symlink
ceph: send client metadata to MDS
ceph: remove redundant code for max file size verification
ceph: remove redundant io_iter_advance()
ceph: move ceph_find_inode() outside the s_mutex
ceph: request xattrs if xattr_version is zero
rbd: set the remaining discard properties to enable support
rbd: use helpers to handle discard for layered images correctly
...
Commit f363e45fd1 ("net/ceph: make ceph_msgr_wq non-reentrant")
effectively removed WQ_MEM_RECLAIM flag from ceph_msgr_wq. This is
wrong - libceph is very much a memory reclaim path, so restore it.
Cc: stable@vger.kernel.org # needs backporting for < 3.12
Signed-off-by: Ilya Dryomov <idryomov@redhat.com>
Tested-by: Micha Krause <micha@krausam.de>
Reviewed-by: Sage Weil <sage@redhat.com>
For requests with multiple ops, separate ops with commas instead of \t,
which is a field separator here.
Signed-off-by: Ilya Dryomov <idryomov@redhat.com>
Reviewed-by: Sage Weil <sage@redhat.com>
Bring in missing osd ops and strings, use macros to eliminate multiple
points of maintenance.
Signed-off-by: Ilya Dryomov <idryomov@redhat.com>
Reviewed-by: Sage Weil <sage@redhat.com>
this allow pagelist to present data that may be sent multiple times.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Reviewed-by: Sage Weil <sage@redhat.com>
queue_work() doesn't "fail to queue", it returns false if work was
already on a queue, which can't happen here since we allocate
event_work right before we queue it. So don't bother at all.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Use the more common pr_warn.
Other miscellanea:
o Coalesce formats
o Realign arguments
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
If the state variable is krealloced successfully, map->osd_state will be
freed, once following two reallocation failed, and exit the function
without resetting map->osd_state, map->osd_state become a wild pointer.
fix it by resetting them after krealloc successfully.
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
We want "cbc(aes)" algorithm, so select CRYPTO_CBC too, not just
CRYPTO_AES. Otherwise on !CRYPTO_CBC kernels we fail rbd map/mount
with
libceph: error -2 building auth method x request
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Both not yet registered (r_linger && list_empty(&r_linger_item)) and
registered linger requests should use the new tid on resend to avoid
the dup op detection logic on the OSDs, yet we were doing this only for
"registered" case. Factor out and simplify the "registered" logic and
use the new helper for "not registered" case as well.
Fixes: http://tracker.ceph.com/issues/8806
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Pull security subsystem updates from James Morris.
Mostly ima, selinux, smack and key handling updates.
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (65 commits)
integrity: do zero padding of the key id
KEYS: output last portion of fingerprint in /proc/keys
KEYS: strip 'id:' from ca_keyid
KEYS: use swapped SKID for performing partial matching
KEYS: Restore partial ID matching functionality for asymmetric keys
X.509: If available, use the raw subjKeyId to form the key description
KEYS: handle error code encoded in pointer
selinux: normalize audit log formatting
selinux: cleanup error reporting in selinux_nlmsg_perm()
KEYS: Check hex2bin()'s return when generating an asymmetric key ID
ima: detect violations for mmaped files
ima: fix race condition on ima_rdwr_violation_check and process_measurement
ima: added ima_policy_flag variable
ima: return an error code from ima_add_boot_aggregate()
ima: provide 'ima_appraise=log' kernel option
ima: move keyring initialization to ima_init()
PKCS#7: Handle PKCS#7 messages that contain no X.509 certs
PKCS#7: Better handling of unsupported crypto
KEYS: Overhaul key identification when searching for asymmetric keys
KEYS: Implement binary asymmetric key ID handling
...
A previous patch added a ->match_preparse() method to the key type. This is
allowed to override the function called by the iteration algorithm.
Therefore, we can just set a default that simply checks for an exact match of
the key description with the original criterion data and allow match_preparse
to override it as needed.
The key_type::match op is then redundant and can be removed, as can the
user_match() function.
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
We hard code cephx auth ticket buffer size to 256 bytes. This isn't
enough for any moderate setups and, in case tickets themselves are not
encrypted, leads to buffer overflows (ceph_x_decrypt() errors out, but
ceph_decode_copy() doesn't - it's just a memcpy() wrapper). Since the
buffer is allocated dynamically anyway, allocated it a bit later, at
the point where we know how much is going to be needed.
Fixes: http://tracker.ceph.com/issues/8979
Cc: stable@vger.kernel.org
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@redhat.com>
Add a helper for processing individual cephx auth tickets. Needed for
the next commit, which deals with allocating ticket buffers. (Most of
the diff here is whitespace - view with git diff -b).
Cc: stable@vger.kernel.org
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@redhat.com>
We preallocate a few of the message types we get back from the mon. If we
get a larger message than we are expecting, fall back to trying to allocate
a new one instead of blindly using the one we have.
CC: stable@vger.kernel.org
Signed-off-by: Sage Weil <sage@redhat.com>
Reviewed-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Pull Ceph updates from Sage Weil:
"There is a lot of refactoring and hardening of the libceph and rbd
code here from Ilya that fix various smaller bugs, and a few more
important fixes with clone overlap. The main fix is a critical change
to the request_fn handling to not sleep that was exposed by the recent
mutex changes (which will also go to the 3.16 stable series).
Yan Zheng has several fixes in here for CephFS fixing ACL handling,
time stamps, and request resends when the MDS restarts.
Finally, there are a few cleanups from Himangi Saraogi based on
Coccinelle"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (39 commits)
libceph: set last_piece in ceph_msg_data_pages_cursor_init() correctly
rbd: remove extra newlines from rbd_warn() messages
rbd: allocate img_request with GFP_NOIO instead GFP_ATOMIC
rbd: rework rbd_request_fn()
ceph: fix kick_requests()
ceph: fix append mode write
ceph: fix sizeof(struct tYpO *) typo
ceph: remove redundant memset(0)
rbd: take snap_id into account when reading in parent info
rbd: do not read in parent info before snap context
rbd: update mapping size only on refresh
rbd: harden rbd_dev_refresh() and callers a bit
rbd: split rbd_dev_spec_update() into two functions
rbd: remove unnecessary asserts in rbd_dev_image_probe()
rbd: introduce rbd_dev_header_info()
rbd: show the entire chain of parent images
ceph: replace comma with a semicolon
rbd: use rbd_segment_name_free() instead of kfree()
ceph: check zero length in ceph_sync_read()
ceph: reset r_resend_mds after receiving -ESTALE
...
Determining ->last_piece based on the value of ->page_offset + length
is incorrect because length here is the length of the entire message.
->last_piece set to false even if page array data item length is <=
PAGE_SIZE, which results in invalid length passed to
ceph_tcp_{send,recv}page() and causes various asserts to fire.
# cat pages-cursor-init.sh
#!/bin/bash
rbd create --size 10 --image-format 2 foo
FOO_DEV=$(rbd map foo)
dd if=/dev/urandom of=$FOO_DEV bs=1M &>/dev/null
rbd snap create foo@snap
rbd snap protect foo@snap
rbd clone foo@snap bar
# rbd_resize calls librbd rbd_resize(), size is in bytes
./rbd_resize bar $(((4 << 20) + 512))
rbd resize --size 10 bar
BAR_DEV=$(rbd map bar)
# trigger a 512-byte copyup -- 512-byte page array data item
dd if=/dev/urandom of=$BAR_DEV bs=1M count=1 seek=5
The problem exists only in ceph_msg_data_pages_cursor_init(),
ceph_msg_data_pages_advance() does the right thing. The size_t cast is
unnecessary.
Cc: stable@vger.kernel.org # 3.10+
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@redhat.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Ceph can use user_match() instead of defining its own identical function.
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Steve Dickson <steved@redhat.com>
Reviewed-by: Sage Weil <sage@redhat.com>
cc: Tommi Virtanen <tommi.virtanen@dreamhost.com>
Make use of key preparsing in Ceph so that quota size determination can take
place prior to keyring locking when a key is being added.
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Steve Dickson <steved@redhat.com>
Reviewed-by: Sage Weil <sage@redhat.com>
cc: Tommi Virtanen <tommi.virtanen@dreamhost.com>
queue_con() bumps osd ref count. We should do the reverse when
canceling con work.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Introduce ceph_osdc_cancel_request() intended for canceling requests
from the higher layers (rbd and cephfs). Because higher layers are in
charge and are supposed to know what and when they are canceling, the
request is not completed, only unref'ed and removed from the libceph
data structures.
__cancel_request() is no longer called before __unregister_request(),
because __unregister_request() unconditionally revokes r_request and
there is no point in trying to do it twice.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
We should check if request is on the linger request list of any of the
OSDs, not whether request is registered or not.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Linger requests that have not yet been registered should not be
unregistered by __unregister_linger_request(). This messes up ref
count and leads to use-after-free.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
It is important that both regular and lingering requests lists are
empty when the OSD is removed.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Add some WARN_ONs to alert us when we try to destroy requests that are
still registered.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Add dout()s to ceph_osdc_request_{get,put}(). Also move them to .c and
turn kref release callback into a static function.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Add dout()s to ceph_msg_{get,put}(). Also move them to .c and turn
kref release callback into a static function.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Abstract out __move_osd_to_lru() logic from __unregister_request() and
__unregister_linger_request().
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
So that:
req->r_osd_item --> osd->o_requests list
req->r_linger_osd_item --> osd->o_linger_requests list
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Pull Ceph updates from Sage Weil:
"This has a mix of bug fixes and cleanups.
Alex's patch fixes a rare race in RBD. Ilya's patches fix an ENOENT
check when a second rbd image is mapped and a couple memory leaks.
Zheng fixes several issues with fragmented directories and multiple
MDSs. Josh fixes a spin/sleep issue, and Josh and Guangliang's
patches fix setting and unsetting RBD images read-only.
Naturally there are several other cleanups mixed in for good measure"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (23 commits)
rbd: only set disk to read-only once
rbd: move calls that may sleep out of spin lock range
rbd: add ioctl for rbd
ceph: use truncate_pagecache() instead of truncate_inode_pages()
ceph: include time stamp in every MDS request
rbd: fix ida/idr memory leak
rbd: use reference counts for image requests
rbd: fix osd_request memory leak in __rbd_dev_header_watch_sync()
rbd: make sure we have latest osdmap on 'rbd map'
libceph: add ceph_monc_wait_osdmap()
libceph: mon_get_version request infrastructure
libceph: recognize poolop requests in debugfs
ceph: refactor readpage_nounlock() to make the logic clearer
mds: check cap ID when handling cap export message
ceph: remember subtree root dirfrag's auth MDS
ceph: introduce ceph_fill_fragtree()
ceph: handle cap import atomically
ceph: pre-allocate ceph_cap struct for ceph_add_cap()
ceph: update inode fields according to issued caps
rbd: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO
...
Pull networking updates from David Miller:
1) Seccomp BPF filters can now be JIT'd, from Alexei Starovoitov.
2) Multiqueue support in xen-netback and xen-netfront, from Andrew J
Benniston.
3) Allow tweaking of aggregation settings in cdc_ncm driver, from Bjørn
Mork.
4) BPF now has a "random" opcode, from Chema Gonzalez.
5) Add more BPF documentation and improve test framework, from Daniel
Borkmann.
6) Support TCP fastopen over ipv6, from Daniel Lee.
7) Add software TSO helper functions and use them to support software
TSO in mvneta and mv643xx_eth drivers. From Ezequiel Garcia.
8) Support software TSO in fec driver too, from Nimrod Andy.
9) Add Broadcom SYSTEMPORT driver, from Florian Fainelli.
10) Handle broadcasts more gracefully over macvlan when there are large
numbers of interfaces configured, from Herbert Xu.
11) Allow more control over fwmark used for non-socket based responses,
from Lorenzo Colitti.
12) Do TCP congestion window limiting based upon measurements, from Neal
Cardwell.
13) Support busy polling in SCTP, from Neal Horman.
14) Allow RSS key to be configured via ethtool, from Venkata Duvvuru.
15) Bridge promisc mode handling improvements from Vlad Yasevich.
16) Don't use inetpeer entries to implement ID generation any more, it
performs poorly, from Eric Dumazet.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1522 commits)
rtnetlink: fix userspace API breakage for iproute2 < v3.9.0
tcp: fixing TLP's FIN recovery
net: fec: Add software TSO support
net: fec: Add Scatter/gather support
net: fec: Increase buffer descriptor entry number
net: fec: Factorize feature setting
net: fec: Enable IP header hardware checksum
net: fec: Factorize the .xmit transmit function
bridge: fix compile error when compiling without IPv6 support
bridge: fix smatch warning / potential null pointer dereference
via-rhine: fix full-duplex with autoneg disable
bnx2x: Enlarge the dorq threshold for VFs
bnx2x: Check for UNDI in uncommon branch
bnx2x: Fix 1G-baseT link
bnx2x: Fix link for KR with swapped polarity lane
sctp: Fix sk_ack_backlog wrap-around problem
net/core: Add VF link state control policy
net/fsl: xgmac_mdio is dependent on OF_MDIO
net/fsl: Make xgmac_mdio read error message useful
net_sched: drr: warn when qdisc is not work conserving
...
Sparse complained about this bogus extern on definition of
a function.
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add ceph_monc_wait_osdmap(), which will block until the osdmap with the
specified epoch is received or timeout occurs.
Export both of these as they are going to be needed by rbd.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Add support for mon_get_version requests to libceph. This reuses much
of the ceph_mon_generic_request infrastructure, with one exception.
Older OSDs don't set mon_get_version reply hdr->tid even if the
original request had a non-zero tid, which makes it impossible to
lookup ceph_mon_generic_request contexts by tid in get_generic_reply()
for such replies. As a workaround, we allocate a reply message on the
reply path. This can probably interfere with revoke, but I don't see
a better way.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Recognize poolop requests in debugfs monc dump, fix prink format
specifiers - tid is unsigned.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Commit e2b149cc4b ("crush: add chooseleaf_vary_r tunable") added the
crush_map::chooseleaf_vary_r field but missed the decode part. This
lead to misdirected requests caused by incorrect raw crush mapping
sets.
Fixes: http://tracker.ceph.com/issues/8226
Reported-and-Tested-by: Dmitry Smirnov <onlyjob@member.fsf.org>
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
It has been reported that using ZFSonLinux on rbd will result in memory
corruption. The bug report can be found here:
https://github.com/zfsonlinux/spl/issues/241http://tracker.ceph.com/issues/7790
The reason is that ZFS will send pages with page_count 0 into rbd, which in
turns send them to tcp_sendpage. However, tcp_sendpage cannot deal with
page_count 0, as it will do get_page and put_page, and erroneously free the
page.
This type of issue has been noted before, and handled in iscsi, drbd,
etc. So, rbd should also handle this. This fix address this issue by fall back
to slower sendmsg when page_count 0 detected.
Cc: Sage Weil <sage@inktank.com>
Cc: Yehuda Sadeh <yehuda@inktank.com>
Cc: stable@vger.kernel.org
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Reviewed-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Pull Ceph fixes from Sage Weil:
"First, there is a critical fix for the new primary-affinity function
that went into -rc1.
The second batch of patches from Zheng fix a range of problems with
directory fragmentation, readdir, and a few odds and ends for cephfs"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
ceph: reserve caps for file layout/lock MDS requests
ceph: avoid releasing caps that are being used
ceph: clear directory's completeness when creating file
libceph: fix non-default values check in apply_primary_affinity()
ceph: use fpos_cmp() to compare dentry positions
ceph: check directory's completeness before emitting directory entry
osd_primary_affinity array is indexed into incorrectly when checking
for non-default primary-affinity values. This nullifies the impact of
the rest of the apply_primary_affinity() and results in misdirected
requests.
if (osds[i] != CRUSH_ITEM_NONE &&
osdmap->osd_primary_affinity[i] !=
^^^
CEPH_OSD_DEFAULT_PRIMARY_AFFINITY) {
For a pool with size 2, this always ends up checking osd0 and osd1
primary_affinity values, instead of the values that correspond to the
osds in question. E.g., given a [2,3] up set and a [max,max,0,max]
primary affinity vector, requests are still sent to osd2, because both
osd0 and osd1 happen to have max primary_affinity values and therefore
we return from apply_primary_affinity() early on the premise that all
osds in the given set have max (default) values. Fix it.
Fixes: http://tracker.ceph.com/issues/7954
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Several spots in the kernel perform a sequence like:
skb_queue_tail(&sk->s_receive_queue, skb);
sk->sk_data_ready(sk, skb->len);
But at the moment we place the SKB onto the socket receive queue it
can be consumed and freed up. So this skb->len access is potentially
to freed up memory.
Furthermore, the skb->len can be modified by the consumer so it is
possible that the value isn't accurate.
And finally, no actual implementation of this callback actually uses
the length argument. And since nobody actually cared about it's
value, lots of call sites pass arbitrary values in such as '0' and
even '1'.
So just remove the length argument from the callback, that way there
is no confusion whatsoever and all of these use-after-free cases get
fixed as a side effect.
Based upon a patch by Eric Dumazet and his suggestion to audit this
issue tree-wide.
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull Ceph updates from Sage Weil:
"The biggest chunk is a series of patches from Ilya that add support
for new Ceph osd and crush map features, including some new tunables,
primary affinity, and the new encoding that is needed for erasure
coding support. This brings things into parity with the server side
and the looming firefly release. There is also support for allocation
hints in RBD that help limit fragmentation on the server side.
There is also a series of patches from Zheng fixing NFS reexport,
directory fragmentation support, flock vs fnctl behavior, and some
issues with clustered MDS.
Finally, there are some miscellaneous fixes from Yunchuan Wen for
fscache, Fabian Frederick for ACLs, and from me for fsync(dirfd)
behavior"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (79 commits)
ceph: skip invalid dentry during dcache readdir
libceph: dump pool {read,write}_tier to debugfs
libceph: output primary affinity values on osdmap updates
ceph: flush cap release queue when trimming session caps
ceph: don't grabs open file reference for aborted request
ceph: drop extra open file reference in ceph_atomic_open()
ceph: preallocate buffer for readdir reply
libceph: enable PRIMARY_AFFINITY feature bit
libceph: redo ceph_calc_pg_primary() in terms of ceph_calc_pg_acting()
libceph: add support for osd primary affinity
libceph: add support for primary_temp mappings
libceph: return primary from ceph_calc_pg_acting()
libceph: switch ceph_calc_pg_acting() to new helpers
libceph: introduce apply_temps() helper
libceph: introduce pg_to_raw_osds() and raw_to_up_osds() helpers
libceph: ceph_can_shift_osds(pool) and pool type defines
libceph: ceph_osd_{exists,is_up,is_down}(osd) definitions
libceph: enable OSDMAP_ENC feature bit
libceph: primary_affinity decode bits
libceph: primary_affinity infrastructure
...
Dump pool {read,write}_tier to debugfs. While at it, fixup printk type
specifiers and remove the unnecessary cast to unsigned long long.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reimplement ceph_calc_pg_primary() in terms of ceph_calc_pg_acting()
and get rid of the now unused calc_pg_raw().
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Respond to non-default primary_affinity values accordingly. (Primary
affinity allows the admin to shift 'primary responsibility' away from
specific osds, effectively shifting around the read side of the
workload and whatever overhead is incurred by peering and writes by
virtue of being the primary).
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Change apply_temp() to override primary in the same way pg_temp
overrides osd set. primary_temp overrides pg_temp primary too.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
In preparation for adding support for primary_temp, stop assuming
primaryness: add a primary out parameter to ceph_calc_pg_acting() and
change call sites accordingly. Primary is now specified separately
from the order of osds in the set.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Switch ceph_calc_pg_acting() to new helpers: pg_to_raw_osds(),
raw_to_up_osds() and apply_temps().
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
apply_temp() helper for applying various temporary mappings (at this
point only pg_temp mappings) to the up set, therefore transforming it
into an acting set.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
pg_to_raw_osds() helper for computing a raw (crush) set, which can
contain non-existant and down osds.
raw_to_up_osds() helper for pruning non-existant and down osds from the
raw set, therefore transforming it into an up set, and determining up
primary.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Add two helpers to decode primary_affinity (full map, vector<u32>) and
new_primary_affinity (inc map, map<u32, u32>) and switch to them.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Add primary_affinity infrastructure. primary_affinity values are
stored in an max_osd-sized array, hanging off ceph_osdmap, similar to
a osd_weight array.
Introduce {get,set}_primary_affinity() helpers, primarily to return
CEPH_OSD_DEFAULT_PRIMARY_AFFINITY when no affinity has been set and to
abstract out osd_primary_affinity array allocation and initialization.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Add a common helper to decode both primary_temp (full map, map<pg_t,
u32>) and new_primary_temp (inc map, same) and switch to it.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Add primary_temp mappings infrastructure. struct ceph_pg_mapping is
overloaded, primary_temp mappings are stored in an rb-tree, rooted at
ceph_osdmap, in a manner similar to pg_temp mappings.
Dump primary_temp mappings to /sys/kernel/debug/ceph/<client>/osdmap,
one 'primary_temp <pgid> <osd>' per line, e.g:
primary_temp 2.6 4
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
In preparation for adding support for primary_temp mappings, generalize
struct ceph_pg_mapping so it can hold mappings other than pg_temp.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Full and incremental osdmaps are structured identically and have
identical headers. Add a helper to decode both "old" (16-bit version,
v6) and "new" (8-bit struct_v+struct_compat+struct_len, v7) osdmap
enconding headers and switch to it.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Consolidate pg_temp (full map, map<pg_t, vector<u32>>) and new_pg_temp
(inc map, same) decoding logic into a common helper and switch to it.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Use krealloc() instead of rolling our own. (krealloc() with a NULL
first argument acts as a kmalloc()). Properly initalize the new array
elements. This is needed to make future additions to osdmap easier.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Consolidate pools (full map, map<u64, pg_pool_t>) and new_pools (inc
map, same) decoding logic into a common helper and switch to it.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
To be in line with all the other osdmap decode helpers.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Sum up sizeof(...) results instead of (incorrectly) hard-coding the
number of bytes, expressed in ints and longs.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Only version 6 of osdmap encoding is supported, anything other than
version 6 results in an error and halts the decoding process. Checking
if version is >= 5 is therefore bogus.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
The existing error handling scheme requires resetting err to -EINVAL
prior to calling any ceph_decode_* macro. This is ugly and fragile,
and there already are a few places where we would return 0 on error,
due to a missing reset. Follow osdmap_decode() and fix this by adding
a special e_inval label to be used by all ceph_decode_* macros.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
The size of the memory area feeded to crush_decode() should be limited
not only by osdmap end, but also by the crush map length. Also, drop
unnecessary dout() (dout() in crush_decode() conveys the same info) and
step past crush map only if it is decoded successfully.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Check length of osd_state, osd_weight and osd_addr arrays. They
should all have exactly max_osd elements after the call to
osdmap_set_max_osd().
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
max_osd value is not covered by any ceph_decode_need(). Use a safe
version of ceph_decode_* macro to decode it.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
The existing error handling scheme requires resetting err to -EINVAL
prior to calling any ceph_decode_* macro. This is ugly and fragile,
and there already are a few places where we would return 0 on error,
due to a missing reset. Fix this by adding a special e_inval label to
be used by all ceph_decode_* macros.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Split osdmap allocation and initialization into a separate function,
ceph_osdmap_decode().
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Dump osdmap in hex on both full and incremental decode errors, to make
it easier to match the contents with error offset. dout() map epoch
and max_osd value on success.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
To save screen space in anticipation of more fields (e.g. primary
affinity).
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
This lets you adjust the vary_r tunable on a per-rule basis.
Reflects ceph.git commit f944ccc20aee60a7d8da7e405ec75ad1cd449fac.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
The current crush_choose_firstn code will re-use the same 'r' value for
the recursive call. That means that if we are hitting a collision or
rejection for some reason (say, an OSD that is marked out) and need to
retry, we will keep making the same (bad) choice in that recursive
selection.
Introduce a tunable that fixes that behavior by incorporating the parent
'r' value into the recursive starting point, so that a different path
will be taken in subsequent placement attempts.
Note that this was done from the get-go for the new crush_choose_indep
algorithm.
This was exposed by a user who was seeing PGs stuck in active+remapped
after reweight-by-utilization because the up set mapped to a single OSD.
Reflects ceph.git commit a8e6c9fbf88bad056dd05d3eb790e98a5e43451a.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
These two fields are misnomers; they are *retry* counts.
Reflects ceph.git commit f17caba8ae0cad7b6f8f35e53e5f73b444696835.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Back in 27f4d1f6bc32c2ed7b2c5080cbd58b14df622607 we refactored the CRUSH
code to allow adjustment of the retry counts on a per-pool basis. That
commit had an off-by-one bug: the previous "tries" counter was a *retry*
count, not a *try* count, but the new code was passing in 1 meaning
there should be no retries.
Fix the ftotal vs tries comparison to use < instead of <= to fix the
problem. Note that the original code used <= here, which means the
global "choose_total_tries" tunable is actually counting retries.
Compensate for that by adding 1 in crush_do_rule when we pull the tunable
into the local variable.
This was noticed looking at output from a user provided osdmap.
Unfortunately the map doesn't illustrate the change in mapping behavior
and I haven't managed to construct one yet that does. Inspection of the
crush debug output now aligns with prior versions, though.
Reflects ceph.git commit 795704fd615f0b008dcc81aa088a859b2d075138.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
When there is no more data, ceph_msg_data_{pages,pagelist}_advance()
should not move on to the next page.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
This is primarily for rbd's benefit and is supposed to combat
fragmentation:
"... knowing that rbd images have a 4m size, librbd can pass a hint
that will let the osd do the xfs allocation size ioctl on new files so
that they are allocated in 1m or 4m chunks. We've seen cases where
users with rbd workloads have very high levels of fragmentation in xfs
and this would mitigate that and probably have a pretty nice
performance benefit."
SETALLOCHINT is considered advisory, so our backwards compatibility
mechanism here is to set FAILOK flag for all SETALLOCHINT ops.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Encode ceph_osd_op::flags field so that it gets sent over the wire.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
With the addition of erasure coding support in the future, scratch
variable-length array in crush_do_rule_ary() is going to grow to at
least 200 bytes on average, on top of another 128 bytes consumed by
rawosd/osd arrays in the call chain. Replace it with a buffer inside
struct osdmap and a mutex. This shouldn't result in any contention,
because all osd requests were already serialized by request_mutex at
that point; the only unlocked caller was ceph_ioctl_get_dataloc().
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
One of my pet coding style peeves is the practice of
adding extra return; at the end of function.
Kill several instances of this in network code.
I suppose some coccinelle wizardy could do this automatically.
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit f38a5181d9 ("ceph: Convert to immutable biovecs") introduced
a NULL pointer dereference, which broke rbd in -rc1. Fix it.
Cc: Kent Overstreet <kmo@daterainc.com>
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Handling redirect replies requires both map_sem and request_mutex.
Taking map_sem unconditionally near the top of handle_reply() avoids
possible race conditions that arise from releasing request_mutex to be
able to acquire map_sem in redirect reply case. (Lock ordering is:
map_sem, request_mutex, crush_mutex.)
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Factor out logic from ceph_osdc_start_request() into a new helper,
__ceph_osdc_start_request(). ceph_osdc_start_request() now amounts to
taking locks and calling __ceph_osdc_start_request().
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
msgpool_op_reply message pool isn't destroyed if workqueue construction
fails. Fix it.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Pull core block IO changes from Jens Axboe:
"The major piece in here is the immutable bio_ve series from Kent, the
rest is fairly minor. It was supposed to go in last round, but
various issues pushed it to this release instead. The pull request
contains:
- Various smaller blk-mq fixes from different folks. Nothing major
here, just minor fixes and cleanups.
- Fix for a memory leak in the error path in the block ioctl code
from Christian Engelmayer.
- Header export fix from CaiZhiyong.
- Finally the immutable biovec changes from Kent Overstreet. This
enables some nice future work on making arbitrarily sized bios
possible, and splitting more efficient. Related fixes to immutable
bio_vecs:
- dm-cache immutable fixup from Mike Snitzer.
- btrfs immutable fixup from Muthu Kumar.
- bio-integrity fix from Nic Bellinger, which is also going to stable"
* 'for-3.14/core' of git://git.kernel.dk/linux-block: (44 commits)
xtensa: fixup simdisk driver to work with immutable bio_vecs
block/blk-mq-cpu.c: use hotcpu_notifier()
blk-mq: for_each_* macro correctness
block: Fix memory leak in rw_copy_check_uvector() handling
bio-integrity: Fix bio_integrity_verify segment start bug
block: remove unrelated header files and export symbol
blk-mq: uses page->list incorrectly
blk-mq: use __smp_call_function_single directly
btrfs: fix missing increment of bi_remaining
Revert "block: Warn and free bio if bi_end_io is not set"
block: Warn and free bio if bi_end_io is not set
blk-mq: fix initializing request's start time
block: blk-mq: don't export blk_mq_free_queue()
block: blk-mq: make blk_sync_queue support mq
block: blk-mq: support draining mq queue
dm cache: increment bi_remaining when bi_end_io is restored
block: fixup for generic bio chaining
block: Really silence spurious compiler warnings
block: Silence spurious compiler warnings
block: Kill bio_pair_split()
...
Follow redirect replies from osds, for details see ceph.git commit
fbbe3ad1220799b7bb00ea30fce581c5eadaf034.
v1 (current) version of redirect reply consists of oloc and oid, which
expands to pool, key, nspace, hash and oid. However, server-side code
that would populate anything other than pool doesn't exist yet, and
hence this commit adds support for pool redirects only. To make sure
that future server-side updates don't break us, we decode all fields
and, if any of key, nspace, hash or oid have a non-default value, error
out with "corrupt osd_op_reply ..." message.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Rename ceph_osd_request::r_{oloc,oid} to r_base_{oloc,oid} before
introducing r_target_{oloc,oid} needed for redirects.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Overwrite ceph_osd_request::r_oloc.pool with read_tier for read ops and
write_tier for write and read+write ops (aka basic tiering support).
{read,write}_tier are part of pg_pool_t since v9. This commit bumps
our pg_pool_t decode compat version from v7 to v9, all new fields
except for {read,write}_tier are ignored.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
"Lookup pool info by ID" function is hidden in osdmap.c. Expose it to
the rest of libceph.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Switch ceph_calc_ceph_pg() to new oloc and oid abstractions and rename
it to ceph_oloc_oid_to_pg() to make its purpose more clear.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
In preparation for tiering support, which would require having two
(base and target) object names for each osd request and also copying
those names around, introduce struct ceph_object_id (oid) and a couple
helpers to facilitate those copies and encapsulate the fact that object
name is not necessarily a NUL-terminated string.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
In preparation for adding oid abstraction, rename MAX_OBJ_NAME_SIZE to
CEPH_MAX_OID_NAME_LEN.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Instead of relying on pool fields in ceph_file_layout (for mapping) and
ceph_pg (for enconding), start using ceph_object_locator (oloc)
abstraction. Note that userspace oloc currently consists of pool, key,
nspace and hash fields, while this one contains only a pool. This is
OK, because at this point we only send (i.e. encode) olocs and never
have to receive (i.e. decode) them.
This makes keeping a copy of ceph_file_layout in every osd request
unnecessary, so ceph_osd_request::r_file_layout field is nuked.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Encapsulate kmalloc vs vmalloc memory allocation and freeing logic into
two helpers, ceph_kvmalloc() and ceph_kvfree(), and switch to them.
ceph_kvmalloc() kmalloc()'s a maximum of 8 pages, anything bigger is
vmalloc()'ed with __GFP_HIGHMEM set. This changes the existing
behaviour:
- for buffers (ceph_buffer_new()), from trying to kmalloc() everything
and using vmalloc() just as a fallback
- for messages (ceph_msg_new()), from going to vmalloc() for anything
bigger than a page
- for messages (ceph_msg_new()), from disallowing vmalloc() to use high
memory
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
The check that makes sure that we have enough memory allocated to read
in the entire header of the message in question is currently busted.
It compares front_len of the incoming message with iov_len field of
ceph_msg::front structure, which is used primarily to indicate the
amount of data already read in, and not the size of the allocated
buffer. Under certain conditions (e.g. a short read from a socket
followed by that socket's shutdown and owning ceph_connection reset)
this results in a warning similar to
[85688.975866] libceph: get_reply front 198 > preallocated 122 (4#0)
and, through another bug, leads to forever hung tasks and forced
reboots. Fix this by comparing front_len with front_alloc_len field of
struct ceph_msg, which stores the actual size of the buffer.
Fixes: http://tracker.ceph.com/issues/5425
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Rename front local variable to front_len in get_reply() to make its
purpose more clear.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Rename front_max field of struct ceph_msg to front_alloc_len to make
its purpose more clear.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Similar to userspace, don't bail with "parse_ips bad ip ..." if the
specified port is port 0, instead use port CEPH_MON_PORT (6789, the
default monitor port).
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
This allows all of the tunables to be overridden by a specific rule.
Reflects ceph.git commits d129e09e57fbc61cfd4f492e3ee77d0750c9d292,
0497db49e5973b50df26251ed0e3f4ac7578e66e.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
The legacy behavior is to make the normal number of tries for the
recursive chooseleaf call. The descend_once tunable changed this to
making a single try and bail if we get a reject (note that it is
impossible to collide in the recursive case).
The new set_chooseleaf_tries lets you select the number of recursive
chooseleaf attempts for indep mode, or default to 1. Use the same
behavior for firstn, except default to total_tries when the legacy
tunables are set (for compatibility). This makes the rule step
override the (new) default of 1 recursive attempt, keeping behavior
consistent with indep mode.
Reflects ceph.git commit 685c6950ef3df325ef04ce7c986e36ca2514c5f1.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
This aligns the internal identifier names with the user-visible names in
the decompiled crush map language.
Reflects ceph.git commit caa0e22e15e4226c3671318ba1f61314bf6da2a6.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Since we can specify the recursive retries in a rule, we may as well also
specify the non-recursive tries too for completeness.
Reflects ceph.git commit d1b97462cffccc871914859eaee562f2786abfd1.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Parameterize the attempts for the _firstn choose method, and apply the
rule-specified tries count to firstn mode as well. Note that we have
slightly different behavior here than with indep:
If the firstn value is not specified for firstn, we pass through the
normal attempt count. This maintains compatibility with legacy behavior.
Note that this is usually *not* actually N^2 work, though, because of the
descend_once tunable. However, descend_once is unfortunately *not* the
same thing as 1 chooseleaf try because it is only checked on a reject but
not on a collision. Sigh.
In contrast, for indep, if tries is not specified we default to 1
recursive attempt, because that is simply more sane, and we have the
option to do so. The descend_once tunable has no effect for indep.
Reflects ceph.git commit 64aeded50d80942d66a5ec7b604ff2fcbf5d7b63.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Explicitly control the number of sample attempts, and allow the number of
tries in the recursive call to be explicitly controlled via the rule. This
is important because the amount of time we want to spend looking for a
solution may be rule dependent (e.g., higher for the wide indep pool than
the rep pools).
(We should do the same for the other tunables, by the way!)
Reflects ceph.git commit c43c893be872f709c787bc57f46c0e97876ff681.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Pass down the parent's 'r' value so that we will sample different values in
the recursive call when the parent tries multiple times. This avoids doing
useless work (calling multiple times and trying the same values).
Reflects ceph.git commit 2731d3030d7a3e80922b7f1b7756f9a4a124bac5.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Pass numrep (the width of the result) separately from the number of results
we want *this* iteration. This makes things less awkward when we do a
recursive call (for chooseleaf) and want only one item.
Reflects ceph.git commit 1b567ee08972f268c11b43fc881e57b5984dd08b.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Now that indep is handled by crush_choose_indep, rename crush_choose to
crush_choose_firstn and remove all the conditionals. This ends up
stripping out *lots* of code.
Note that it *also* makes it obvious that the shenanigans we were playing
with r' for uniform buckets were broken for firstn mode. This appears to
have happened waaaay back in commit dae8bec9 (or earlier)... 2007.
Reflects ceph.git commit 94350996cb2035850bcbece6a77a9b0394177ec9.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
For firstn mode, if we fail to make a valid placement choice, we just
continue and return a short result to the caller. For indep mode, however,
we need to make the position stable, and return an undefined value on
failed placements to avoid shifting later results to the left.
Reflects ceph.git commit b1d4dd4eb044875874a1d01c01c7d766db5d0a80.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
This is only present to size the temporary scratch arrays that we put on
the stack. Let the caller allocate them as they wish and remove the
limitation.
Reflects ceph.git commit 1cfe140bf2dab99517589a82a916f4c75b9492d1.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Pass the size of the weight vector into crush_do_rule() to ensure that we
don't access values past the end. This can happen if the caller misbehaves
and passes a weight vector that is smaller than max_devices.
Currently the monitor tries to prevent that from happening, but this will
gracefully tolerate previous bad osdmaps that got into this state. It's
also a bit more defensive.
Reflects ceph.git commit 5922e2c2b8335b5e46c9504349c3a55b7434c01a.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
This updates ceph_features.h so that it has all feature bits defined in
ceph.git. In the interim since the last update, ceph.git crossed the
"32 feature bits" point, and, the addition of the 33rd bit wasn't
handled correctly. The work-around is squashed into this commit and
reflects ceph.git commit 053659d05e0349053ef703b414f44965f368b9f0.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
In preparation for ceph_features.h update, change all features fields
from unsigned int/u32 to u64. (ceph.git has ~40 feature bits at this
point.)
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
With the current full handling, there is a race between osds and
clients getting the first map marked full. If the osd wins, it will
return -ENOSPC to any writes, but the client may already have writes
in flight. This results in the client getting the error and
propagating it up the stack. For rbd, the block layer turns this into
EIO, which can cause corruption in filesystems above it.
To avoid this race, osds are being changed to drop writes that came
from clients with an osdmap older than the last osdmap marked full.
In order for this to work, clients must resend all writes after they
encounter a full -> not full transition in the osdmap. osds will wait
for an updated map instead of processing a request from a client with
a newer map, so resent writes will not be dropped by the osd unless
there is another not full -> full transition.
This approach requires both osds and clients to be fixed to avoid the
race. Old clients talking to osds with this fix may hang instead of
returning EIO and potentially corrupting an fs. New clients talking to
old osds have the same behavior as before if they encounter this race.
Fixes: http://tracker.ceph.com/issues/6938
Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
The PAUSEWR and PAUSERD flags are meant to stop the cluster from
processing writes and reads, respectively. The FULL flag is set when
the cluster determines that it is out of space, and will no longer
process writes. PAUSEWR and PAUSERD are purely client-side settings
already implemented in userspace clients. The osd does nothing special
with these flags.
When the FULL flag is set, however, the osd responds to all writes
with -ENOSPC. For cephfs, this makes sense, but for rbd the block
layer translates this into EIO. If a cluster goes from full to
non-full quickly, a filesystem on top of rbd will not behave well,
since some writes succeed while others get EIO.
Fix this by blocking any writes when the FULL flag is set in the osd
client. This is the same strategy used by userspace, so apply it by
default. A follow-on patch makes this configurable.
__map_request() is called to re-target osd requests in case the
available osds changed. Add a paused field to a ceph_osd_request, and
set it whenever an appropriate osd map flag is set. Avoid queueing
paused requests in __map_request(), but force them to be resent if
they become unpaused.
Also subscribe to the next osd map from the monitor if any of these
flags are set, so paused requests can be unblocked as soon as
possible.
Fixes: http://tracker.ceph.com/issues/6079
Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Wake up possible waiters, invoke the call back if any, unregister the request
Signed-off-by: Li Wang <liwang@ubuntukylin.com>
Signed-off-by: Yunchuan Wen <yunchuanwen@ubuntukylin.com>
Signed-off-by: Sage Weil <sage@inktank.com>
Now that we've got a mechanism for immutable biovecs -
bi_iter.bi_bvec_done - we need to convert drivers to use primitives that
respect it instead of using the bvec array directly.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Sage Weil <sage@inktank.com>
Cc: ceph-devel@vger.kernel.org
There are a mix of function prototypes with and without extern
in the kernel sources. Standardize on not using extern for
function prototypes.
Function prototypes don't need to be written with extern.
extern is assumed by the compiler. Its use is as unnecessary as
using auto to declare automatic/local variables in a block.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull ceph fixes from Sage Weil:
"These fix several bugs with RBD from 3.11 that didn't get tested in
time for the merge window: some error handling, a use-after-free, and
a sequencing issue when unmapping and image races with a notify
operation.
There is also a patch fixing a problem with the new ceph + fscache
code that just went in"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
fscache: check consistency does not decrement refcount
rbd: fix error handling from rbd_snap_name()
rbd: ignore unmapped snapshots that no longer exist
rbd: fix use-after free of rbd_dev->disk
rbd: make rbd_obj_notify_ack() synchronous
rbd: complete notifies before cleaning up osd_client and rbd_dev
libceph: add function to ensure notifies are complete
Without a way to flush the osd client's notify workqueue, a watch
event that is unregistered could continue receiving callbacks
indefinitely.
Unregistering the event simply means no new notifies are added to the
queue, but there may still be events in the queue that will call the
watch callback for the event. If the queue is flushed after the event
is unregistered, the caller can be sure no more watch callbacks will
occur for the canceled watch.
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Pull ceph updates from Sage Weil:
"This includes both the first pile of Ceph patches (which I sent to
torvalds@vger, sigh) and a few new patches that add support for
fscache for Ceph. That includes a few fscache core fixes that David
Howells asked go through the Ceph tree. (Thanks go to Milosz Tanski
for putting this feature together)
This first batch of patches (included here) had (has) several
important RBD bug fixes, hole punch support, several different
cleanups in the page cache interactions, improvements in the truncate
code (new truncate mutex to avoid shenanigans with i_mutex), and a
series of fixes in the synchronous striping read/write code.
On top of that is a random collection of small fixes all across the
tree (error code checks and error path cleanup, obsolete wq flags,
etc)"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (43 commits)
ceph: use d_invalidate() to invalidate aliases
ceph: remove ceph_lookup_inode()
ceph: trivial buildbot warnings fix
ceph: Do not do invalidate if the filesystem is mounted nofsc
ceph: page still marked private_2
ceph: ceph_readpage_to_fscache didn't check if marked
ceph: clean PgPrivate2 on returning from readpages
ceph: use fscache as a local presisent cache
fscache: Netfs function for cleanup post readpages
FS-Cache: Fix heading in documentation
CacheFiles: Implement interface to check cache consistency
FS-Cache: Add interface to check consistency of a cached object
rbd: fix null dereference in dout
rbd: fix buffer size for writes to images with snapshots
libceph: use pg_num_mask instead of pgp_num_mask for pg.seed calc
rbd: fix I/O error propagation for reads
ceph: use vfs __set_page_dirty_nobuffers interface instead of doing it inside filesystem
ceph: allow sync_read/write return partial successed size of read/write.
ceph: fix bugs about handling short-read for sync read mode.
ceph: remove useless variable revoked_rdcache
...
Fix a typo that used the wrong bitmask for the pg.seed calculation. This
is normally unnoticed because in most cases pg_num == pgp_num. It is, however,
a bug that is easily corrected.
CC: stable@vger.kernel.org
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <alex.elder@linary.org>
create_singlethread_workqueue() returns NULL on error, and it doesn't
return ERR_PTRs.
I tweaked the error handling a little to be consistent with earlier in
the function.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Sage Weil <sage@inktank.com>
There are two places where we read "nr_maps" if both of them are set to
zero then we would hit a NULL dereference here.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Sage Weil <sage@inktank.com>
We've tried to fix the error paths in this function before, but there
is still a hidden goto in the ceph_decode_need() macro which goes to the
wrong place. We need to release the "req" and unlock a mutex before
returning.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Sage Weil <sage@inktank.com>
This patch implements fallocate and punch hole support for Ceph kernel client.
Signed-off-by: Li Wang <liwang@ubuntukylin.com>
Signed-off-by: Yunchuan Wen <yunchuanwen@ubuntukylin.com>
dbf2576e37 ("workqueue: make all workqueues non-reentrant") made
WQ_NON_REENTRANT no-op and the flag is going away. Remove its usages.
This patch doesn't introduce any behavior changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Sage Weil <sage@inktank.com>
Cc: ceph-devel@vger.kernel.org