Commit graph

358 commits

Author SHA1 Message Date
Jordan Rife
7563cf17dc libceph: use kernel_connect()
Direct calls to ops->connect() can overwrite the address parameter when
used in conjunction with BPF SOCK_ADDR hooks. Recent changes to
kernel_connect() ensure that callers are insulated from such side
effects. This patch wraps the direct call to ops->connect() with
kernel_connect() to prevent unexpected changes to the address passed to
ceph_tcp_connect().

This change was originally part of a larger patch targeting the net tree
addressing all instances of unprotected calls to ops->connect()
throughout the kernel, but this change was split up into several patches
targeting various trees.

Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/netdev/20230821100007.559638-1-jrife@google.com/
Link: https://lore.kernel.org/netdev/9944248dba1bce861375fcce9de663934d933ba9.camel@redhat.com/
Fixes: d74bad4e74 ("bpf: Hooks for sys_connect")
Signed-off-by: Jordan Rife <jrife@google.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-10-09 13:35:24 +02:00
Jeff Layton
dee0c5f834 libceph: add new iov_iter-based ceph_msg_data_type and ceph_osd_data_type
Add an iov_iter to the unions in ceph_msg_data and ceph_msg_data_cursor.
Instead of requiring a list of pages or bvecs, we can just use an
iov_iter directly, and avoid extra allocations.

We assume that the pages represented by the iter are pinned such that
they shouldn't incur page faults, which is the case for the iov_iters
created by netfs.

While working on this, Al Viro informed me that he was going to change
iov_iter_get_pages to auto-advance the iterator as that pattern is more
or less required for ITER_PIPE anyway. We emulate that here for now by
advancing in the _next op and tracking that amount in the "lastlen"
field.

In the event that _next is called twice without an intervening
_advance, we revert the iov_iter by the remaining lastlen before
calling iov_iter_get_pages.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-08-22 09:01:48 +02:00
Jeff Layton
ec3bc567ea libceph: new sparse_read op, support sparse reads on msgr2 crc codepath
Add support for a new sparse_read ceph_connection operation. The idea is
that the client driver can define this operation use it to do special
handling for incoming reads.

The alloc_msg routine will look at the request and determine whether the
reply is expected to be sparse. If it is, then we'll dispatch to a
different set of state machine states that will repeatedly call the
driver's sparse_read op to get length and placement info for reading the
extent map, and the extents themselves.

This necessitates adding some new field to some other structs:

- The msg gets a new bool to track whether it's a sparse_read request.

- A new field is added to the cursor to track the amount remaining in the
current extent. This is used to cap the read from the socket into the
msg_data

- Handing a revoke with all of this is particularly difficult, so I've
added a new data_len_remain field to the v2 connection info, and then
use that to skip that much on a revoke. We may want to expand the use of
that to the normal read path as well, just for consistency's sake.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-08-22 09:01:47 +02:00
Ilya Dryomov
8ff2c64c97 rbd: harden get_lock_owner_info() a bit
- we want the exclusive lock type, so test for it directly
- use sscanf() to actually parse the lock cookie and avoid admitting
  invalid handles
- bail if locker has a blank address

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
2023-07-26 15:08:09 +02:00
Peilin Ye
40e0b09081 net/sock: Introduce trace_sk_data_ready()
As suggested by Cong, introduce a tracepoint for all ->sk_data_ready()
callback implementations.  For example:

<...>
  iperf-609  [002] .....  70.660425: sk_data_ready: family=2 protocol=6 func=sock_def_readable
  iperf-609  [002] .....  70.660436: sk_data_ready: family=2 protocol=6 func=sock_def_readable
<...>

Suggested-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-01-23 11:26:50 +00:00
Benjamin Coddington
98123866fc Treewide: Stop corrupting socket's task_frag
Since moving to memalloc_nofs_save/restore, SUNRPC has stopped setting the
GFP_NOIO flag on sk_allocation which the networking system uses to decide
when it is safe to use current->task_frag.  The results of this are
unexpected corruption in task_frag when SUNRPC is involved in memory
reclaim.

The corruption can be seen in crashes, but the root cause is often
difficult to ascertain as a crashing machine's stack trace will have no
evidence of being near NFS or SUNRPC code.  I believe this problem to
be much more pervasive than reports to the community may indicate.

Fix this by having kernel users of sockets that may corrupt task_frag due
to reclaim set sk_use_task_frag = false.  Preemptively correcting this
situation for users that still set sk_allocation allows them to convert to
memalloc_nofs_save/restore without the same unexpected corruptions that are
sure to follow, unlikely to show up in testing, and difficult to bisect.

CC: Philipp Reisner <philipp.reisner@linbit.com>
CC: Lars Ellenberg <lars.ellenberg@linbit.com>
CC: "Christoph Böhmwalder" <christoph.boehmwalder@linbit.com>
CC: Jens Axboe <axboe@kernel.dk>
CC: Josef Bacik <josef@toxicpanda.com>
CC: Keith Busch <kbusch@kernel.org>
CC: Christoph Hellwig <hch@lst.de>
CC: Sagi Grimberg <sagi@grimberg.me>
CC: Lee Duncan <lduncan@suse.com>
CC: Chris Leech <cleech@redhat.com>
CC: Mike Christie <michael.christie@oracle.com>
CC: "James E.J. Bottomley" <jejb@linux.ibm.com>
CC: "Martin K. Petersen" <martin.petersen@oracle.com>
CC: Valentina Manea <valentina.manea.m@gmail.com>
CC: Shuah Khan <shuah@kernel.org>
CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
CC: David Howells <dhowells@redhat.com>
CC: Marc Dionne <marc.dionne@auristor.com>
CC: Steve French <sfrench@samba.org>
CC: Christine Caulfield <ccaulfie@redhat.com>
CC: David Teigland <teigland@redhat.com>
CC: Mark Fasheh <mark@fasheh.com>
CC: Joel Becker <jlbec@evilplan.org>
CC: Joseph Qi <joseph.qi@linux.alibaba.com>
CC: Eric Van Hensbergen <ericvh@gmail.com>
CC: Latchesar Ionkov <lucho@ionkov.net>
CC: Dominique Martinet <asmadeus@codewreck.org>
CC: Ilya Dryomov <idryomov@gmail.com>
CC: Xiubo Li <xiubli@redhat.com>
CC: Chuck Lever <chuck.lever@oracle.com>
CC: Jeff Layton <jlayton@kernel.org>
CC: Trond Myklebust <trond.myklebust@hammerspace.com>
CC: Anna Schumaker <anna@kernel.org>
CC: Steffen Klassert <steffen.klassert@secunet.com>
CC: Herbert Xu <herbert@gondor.apana.org.au>

Suggested-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-12-19 17:28:49 -08:00
Jeff Layton
da4ab869e3 libceph: drop last_piece flag from ceph_msg_data_cursor
ceph_msg_data_next is always passed a NULL pointer for this field. Some
of the "next" operations look at it in order to determine the length,
but we can just take the min of the data on the page or cursor->resid.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-10-04 19:18:08 +02:00
Ilya Dryomov
038b8d1d1a libceph: optionally use bounce buffer on recv path in crc mode
Both msgr1 and msgr2 in crc mode are zero copy in the sense that
message data is read from the socket directly into the destination
buffer.  We assume that the destination buffer is stable (i.e. remains
unchanged while it is being read to) though.  Otherwise, CRC errors
ensue:

  libceph: read_partial_message 0000000048edf8ad data crc 1063286393 != exp. 228122706
  libceph: osd1 (1)192.168.122.1:6843 bad crc/signature

  libceph: bad data crc, calculated 57958023, expected 1805382778
  libceph: osd2 (2)192.168.122.1:6876 integrity error, bad crc

Introduce rxbounce option to enable use of a bounce buffer when
receiving message data.  In particular this is needed if a mapped
image is a Windows VM disk, passed to QEMU.  Windows has a system-wide
"dummy" page that may be mapped into the destination buffer (potentially
more than once into the same buffer) by the Windows Memory Manager in
an effort to generate a single large I/O [1][2].  QEMU makes a point of
preserving overlap relationships when cloning I/O vectors, so krbd gets
exposed to this behaviour.

[1] "What Is Really in That MDL?"
    https://docs.microsoft.com/en-us/previous-versions/windows/hardware/design/dn614012(v=vs.85)
[2] https://blogs.msmvps.com/kernelmustard/2005/05/04/dummy-pages/

URL: https://bugzilla.redhat.com/show_bug.cgi?id=1973317
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
2022-02-02 18:50:36 +01:00
Linus Torvalds
64f29d8856 The highlight is the new mount "device" string syntax implemented
by Venky Shankar.  It solves some long-standing issues with using
 different auth entities and/or mounting different CephFS filesystems
 from the same cluster, remounting and also misleading /proc/mounts
 contents.  The existing syntax of course remains to be maintained.
 
 On top of that, there is a couple of fixes for edge cases in quota
 and a new mount option for turning on unbuffered I/O mode globally
 instead of on a per-file basis with ioctl(CEPH_IOC_SYNCIO).
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmHpP5ATHGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHzi0TgB/480i2lPHgA3ujJNqo5Q6z+W0vtTA2+
 Wx+4rAUgIESJVunbFxvecPbzyUXTe7wWFI11TCVHPpf6GyIIDTD+uHd3kKWtLsfL
 Zkk1/2PN9Q5Dh29R+N8rP9NaP8tIaTQjyiO3iqmRZlo+k0Z/lYtWUb+fUP05XlVY
 ML/ktW543tkKeYwl3SWdW5MqAAOVGDbTt+L51CraDhVoiUac5ptkP+cmDmIqsnGa
 ZHVqpwugxgndEIyuBHDLBps+5/LrEaL10xDhGcMtP9hwGYhyNr6Yj+azfGtHWwOi
 jdVsdHDiecUBVtGyZ351Y4pCMOmP0uJif6MOUZFXYYSSeUBUhH8UjgEi
 =jcte
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-5.17-rc1' of git://github.com/ceph/ceph-client

Pull ceph updates from Ilya Dryomov:
 "The highlight is the new mount "device" string syntax implemented by
  Venky Shankar. It solves some long-standing issues with using
  different auth entities and/or mounting different CephFS filesystems
  from the same cluster, remounting and also misleading /proc/mounts
  contents. The existing syntax of course remains to be maintained.

  On top of that, there is a couple of fixes for edge cases in quota and
  a new mount option for turning on unbuffered I/O mode globally instead
  of on a per-file basis with ioctl(CEPH_IOC_SYNCIO)"

* tag 'ceph-for-5.17-rc1' of git://github.com/ceph/ceph-client:
  ceph: move CEPH_SUPER_MAGIC definition to magic.h
  ceph: remove redundant Lsx caps check
  ceph: add new "nopagecache" option
  ceph: don't check for quotas on MDS stray dirs
  ceph: drop send metrics debug message
  rbd: make const pointer spaces a static const array
  ceph: Fix incorrect statfs report for small quota
  ceph: mount syntax module parameter
  doc: document new CephFS mount device syntax
  ceph: record updated mon_addr on remount
  ceph: new device mount syntax
  libceph: rename parse_fsid() to ceph_parse_fsid() and export
  libceph: generalize addr/ip parsing based on delimiter
2022-01-20 13:46:20 +02:00
Michal Hocko
a421ef3030 mm: allow !GFP_KERNEL allocations for kvmalloc
Support for GFP_NO{FS,IO} and __GFP_NOFAIL has been implemented by
previous patches so we can allow the support for kvmalloc.  This will
allow some external users to simplify or completely remove their
helpers.

GFP_NOWAIT semantic hasn't been supported so far but it hasn't been
explicitly documented so let's add a note about that.

ceph_kvmalloc is the first helper to be dropped and changed to kvmalloc.

Link: https://lkml.kernel.org/r/20211122153233.9924-5-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Neil Brown <neilb@suse.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-01-15 16:30:29 +02:00
Venky Shankar
2d7c86a8f9 libceph: generalize addr/ip parsing based on delimiter
... and remove hardcoded function name in ceph_parse_ips().

[ idryomov: delim parameter, drop CEPH_ADDR_PARSE_DEFAULT_DELIM ]

Signed-off-by: Venky Shankar <vshankar@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-01-13 13:40:05 +01:00
Ilya Dryomov
cd1a677cad libceph, ceph: implement msgr2.1 protocol (crc and secure modes)
Implement msgr2.1 wire protocol, available since nautilus 14.2.11
and octopus 15.2.5.  msgr2.0 wire protocol is not implemented -- it
has several security, integrity and robustness issues and therefore
considered deprecated.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:50 +01:00
Ilya Dryomov
a56dd9bf47 libceph: move msgr1 protocol specific fields to its own struct
A couple whitespace fixups, no functional changes.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:50 +01:00
Ilya Dryomov
2f713615dd libceph: move msgr1 protocol implementation to its own file
A pure move, no other changes.

Note that ceph_tcp_recv{msg,page}() and ceph_tcp_send{msg,page}()
helpers are also moved.  msgr2 will bring its own, more efficient,
variants based on iov_iter.  Switching msgr1 to them was considered
but decided against to avoid subtle regressions.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:50 +01:00
Ilya Dryomov
566050e17e libceph: separate msgr1 protocol implementation
In preparation for msgr2, define internal messenger <-> protocol
interface (as opposed to external messenger <-> client interface, which
is struct ceph_connection_operations) consisting of try_read(),
try_write(), revoke(), revoke_incoming(), opened(), reset_session() and
reset_protocol() ops.  The semantics are exactly the same as they are
now.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:49 +01:00
Ilya Dryomov
6503e0b69c libceph: export remaining protocol independent infrastructure
In preparation for msgr2, make all protocol independent functions
in messenger.c global.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:49 +01:00
Ilya Dryomov
699921d9e6 libceph: export zero_page
In preparation for msgr2, make zero_page global.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:49 +01:00
Ilya Dryomov
3fefd43e74 libceph: rename and export con->flags bits
In preparation for msgr2, move the defines to the header file.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:49 +01:00
Ilya Dryomov
6d7f62bfb5 libceph: rename and export con->state states
In preparation for msgr2, rename msgr1 specific states and move the
defines to the header file.

Also drop state transition comments.  They don't cover all possible
transitions (e.g. NEGOTIATING -> STANDBY, etc) and currently do more
harm than good.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:49 +01:00
Ilya Dryomov
30be780a87 libceph: make con->state an int
unsigned long is a leftover from when con->state used to be a set of
bits managed with set_bit(), clear_bit(), etc.  Save a bit of memory.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:49 +01:00
Ilya Dryomov
2f68738037 libceph: don't export ceph_messenger_{init_fini}() to modules
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:49 +01:00
Ilya Dryomov
fd1a154cad libceph: make sure our addr->port is zero and addr->nonce is non-zero
Our messenger instance addr->port is normally zero -- anything else is
nonsensical because as a client we connect to multiple servers and don't
listen on any port.  However, a user can supply an arbitrary addr:port
via ip option and the port is currently preserved.  Zero it.

Conversely, make sure our addr->nonce is non-zero.  A zero nonce is
special: in combination with a zero port, it is used to blocklist the
entire ip.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:49 +01:00
Ilya Dryomov
771294fe07 libceph: factor out ceph_con_get_out_msg()
Move the logic of grabbing the next message from the queue into its own
function.  Like ceph_con_in_msg_alloc(), this is protocol independent.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:49 +01:00
Ilya Dryomov
fc4c128e15 libceph: change ceph_con_in_msg_alloc() to take hdr
ceph_con_in_msg_alloc() is protocol independent, but con->in_hdr (and
struct ceph_msg_header in general) is msgr1 specific.  While the struct
is deeply ingrained inside and outside the messenger, con->in_hdr field
can be separated.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:49 +01:00
Ilya Dryomov
8ee8abf797 libceph: change ceph_msg_data_cursor_init() to take cursor
Make it possible to have local cursors and embed them outside struct
ceph_msg.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:49 +01:00
Ilya Dryomov
0247192809 libceph: handle discarding acked and requeued messages separately
Make it easier to follow and remove dependency on msgr1 specific
CEPH_MSGR_TAG_SEQ.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:49 +01:00
Ilya Dryomov
5cd8da3a1c libceph: drop msg->ack_stamp field
It is set in process_ack() but never used.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:49 +01:00
Ilya Dryomov
d3c1248cac libceph: remove redundant session reset log message
Stick with pr_info message because session reset isn't an error most of
the time.  When it is (i.e. if the server denies the reconnect attempt),
we get a bunch of other pr_err messages.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:49 +01:00
Ilya Dryomov
a3da057bbd libceph: clear con->peer_global_seq on RESETSESSION
con->peer_global_seq is part of session state.  Clear it when
the server tells us to reset, not just in ceph_con_close().

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:48 +01:00
Ilya Dryomov
5963c3d01c libceph: rename reset_connection() to ceph_con_reset_session()
With just session reset bits left, rename appropriately.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:48 +01:00
Ilya Dryomov
3596f4c124 libceph: split protocol reset bits out of reset_connection()
Move protocol reset bits into ceph_con_reset_protocol(), leaving
just session reset bits.

Note that con->out_skip is now reset on faults.  This fixes a crash
in the case of a stateful session getting a fault while in the middle
of revoking a message.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:48 +01:00
Ilya Dryomov
90b6561a05 libceph: don't call reset_connection() on version/feature mismatches
A fault due to a version mismatch or a feature set mismatch used to be
treated differently from other faults: the connection would get closed
without trying to reconnect and there was a ->bad_proto() connection op
for notifying about that.

This changed a long time ago, see commits 6384bb8b8e ("libceph: kill
bad_proto ceph connection op") and 0fa6ebc600 ("libceph: fix protocol
feature mismatch failure path").  Nowadays these aren't any different
from other faults (i.e. we try to reconnect even though the mismatch
won't resolve until the server is replaced).  reset_connection() calls
there are rather confusing because reset_connection() resets a session
together an individual instance of the protocol.  This is cleaned up
in the next patch.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:48 +01:00
Ilya Dryomov
418af5b3bf libceph: lower exponential backoff delay
The current setting allows the backoff to climb up to 5 minutes.  This
is too high -- it becomes hard to tell whether the client is stuck on
something or just in backoff.

In userspace, ms_max_backoff is defaulted to 15 seconds.  Let's do the
same.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:48 +01:00
Ilya Dryomov
b77f8f0e4f libceph: include middle_len in process_message() dout
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:48 +01:00
Ilya Dryomov
28e1581c3b libceph: clear con->out_msg on Policy::stateful_server faults
con->out_msg must be cleared on Policy::stateful_server
(!CEPH_MSG_CONNECT_LOSSY) faults.  Not doing so botches the
reconnection attempt, because after writing the banner the
messenger moves on to writing the data section of that message
(either from where it got interrupted by the connection reset or
from the beginning) instead of writing struct ceph_msg_connect.
This results in a bizarre error message because the server
sends CEPH_MSGR_TAG_BADPROTOVER but we think we wrote struct
ceph_msg_connect:

  libceph: mds0 (1)172.21.15.45:6828 socket error on write
  ceph: mds0 reconnect start
  libceph: mds0 (1)172.21.15.45:6829 socket closed (con state OPEN)
  libceph: mds0 (1)172.21.15.45:6829 protocol version mismatch, my 32 != server's 32
  libceph: mds0 (1)172.21.15.45:6829 protocol version mismatch

AFAICT this bug goes back to the dawn of the kernel client.
The reason it survived for so long is that only MDS sessions
are stateful and only two MDS messages have a data section:
CEPH_MSG_CLIENT_RECONNECT (always, but reconnecting is rare)
and CEPH_MSG_CLIENT_REQUEST (only when xattrs are involved).
The connection has to get reset precisely when such message
is being sent -- in this case it was the former.

Cc: stable@vger.kernel.org
Link: https://tracker.ceph.com/issues/47723
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
2020-10-12 15:29:27 +02:00
Ilya Dryomov
a9dfe31e5c libceph: format ceph_entity_addr nonces as unsigned
Match the server side logs.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-10-12 15:29:27 +02:00
Ilya Dryomov
5a5036c89f libceph: move a dout in queue_con_delay()
The queued con->work can start executing (and therefore logging)
before we get to this "con->work has been queued" message, making
the logs confusing.  Move it up, with the meaning of "con->work
is about to be queued".

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-10-12 15:29:27 +02:00
Coly Li
40efc4dc73 libceph: use sendpage_ok() in ceph_tcp_sendpage()
In libceph, ceph_tcp_sendpage() does the following checks before handle
the page by network layer's zero copy sendpage method,
	if (page_count(page) >= 1 && !PageSlab(page))

This check is exactly what sendpage_ok() does. This patch replace the
open coded checks by sendpage_ok() as a code cleanup.

Signed-off-by: Coly Li <colyli@suse.de>
Acked-by: Jeff Layton <jlayton@kernel.org>
Cc: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-10-02 15:27:08 -07:00
Gustavo A. R. Silva
df561f6688 treewide: Use fallthrough pseudo-keyword
Replace the existing /* fall through */ comments and its variants with
the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary
fall-through markings when it is the case.

[1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2020-08-23 17:36:59 -05:00
Christoph Hellwig
12abc5ee78 tcp: add tcp_sock_set_nodelay
Add a helper to directly set the TCP_NODELAY sockopt from kernel space
without going through a fake uaccess.  Cleanup the callers to avoid
pointless wrappers now that this is a simple function call.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Sagi Grimberg <sagi@grimberg.me>
Acked-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-28 11:11:45 -07:00
Ilya Dryomov
e886274031 libceph: fix alloc_msg_with_page_vector() memory leaks
Make it so that CEPH_MSG_DATA_PAGES data item can own pages,
fixing a bunch of memory leaks for a page vector allocated in
alloc_msg_with_page_vector().  Currently, only watch-notify
messages trigger this allocation, and normally the page vector
is freed either in handle_watch_notify() or by the caller of
ceph_osdc_notify().  But if the message is freed before that
(e.g. if the session faults while reading in the message or
if the notify is stale), we leak the page vector.

This was supposed to be fixed by switching to a message-owned
pagelist, but that never happened.

Fixes: 1907920324 ("libceph: support for sending notifies")
Reported-by: Roman Penyaev <rpenyaev@suse.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Roman Penyaev <rpenyaev@suse.de>
2020-03-23 13:07:08 +01:00
David Howells
82995cc6c5 libceph, rbd, ceph: convert to use the new mount API
Convert the ceph filesystem to the new internal mount API as the old
one will be obsoleted and removed.  This allows greater flexibility in
communication of mount parameters between userspace, the VFS and the
filesystem.

See Documentation/filesystems/mount_api.txt for more information.

[ Numerous string handling, leak and regression fixes; rbd conversion
  was particularly broken and had to be redone almost from scratch. ]

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-11-27 22:28:37 +01:00
Yan, Zheng
120a75ea9f libceph: add function that reset client's entity addr
This function also re-open connections to OSD/MON, and re-send in-flight
OSD requests after re-opening connections to OSD.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-09-16 12:06:23 +02:00
Linus Torvalds
d9b9c89304 Lots of exciting things this time!
- support for rbd object-map and fast-diff features (myself).  This
   will speed up reads, discards and things like snap diffs on sparse
   images.
 
 - ceph.snap.btime vxattr to expose snapshot creation time (David
   Disseldorp).  This will be used to integrate with "Restore Previous
   Versions" feature added in Windows 7 for folks who reexport ceph
   through SMB.
 
 - security xattrs for ceph (Zheng Yan).  Only selinux is supported
   for now due to the limitations of ->dentry_init_security().
 
 - support for MSG_ADDR2, FS_BTIME and FS_CHANGE_ATTR features (Jeff
   Layton).  This is actually a single feature bit which was missing
   because of the filesystem pieces.  With this in, the kernel client
   will finally be reported as "luminous" by "ceph features" -- it is
   still being reported as "jewel" even though all required Luminous
   features were implemented in 4.13.
 
 - stop NULL-terminating ceph vxattrs (Jeff Layton).  The convention
   with xattrs is to not terminate and this was causing inconsistencies
   with ceph-fuse.
 
 - change filesystem time granularity from 1 us to 1 ns, again fixing
   an inconsistency with ceph-fuse (Luis Henriques).
 
 On top of this there are some additional dentry name handling and cap
 flushing fixes from Zheng.  Finally, Jeff is formally taking over for
 Zheng as the filesystem maintainer.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAl0u+X8THGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHzi9byB/9fIxzoxtDMvixtJabuGSJRtlijDlWF
 GlO6yIWCXl/8v8easR2PCF75U/xv0+QFQmze8PVi8u4Xz589P247NnEuyEZ9n84i
 aCavARho6QLZPEL+B04NaqoHBl+ORKQTA6eKGhyKwRp/rn83z5Ubuw2tN7krHT3b
 kCY61FuTQGxNY2o/WKv/iLwINYr7H23hCf0WwyyKH1bp7OegiQ14Ebn1NtfS3sMx
 hS6h8Ya826vmUW0bCSS/9kzKYBCjksTig0HphUOHq6BoZJs++0b7GukIulRyuLfD
 J9Gr9HGPoDCVzdmFfpn2FSlxdmqfO9amUSagd0ftLQfFlPlrpoULi0GW
 =Bgxr
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-5.3-rc1' of git://github.com/ceph/ceph-client

Pull ceph updates from Ilya Dryomov:
 "Lots of exciting things this time!

   - support for rbd object-map and fast-diff features (myself). This
     will speed up reads, discards and things like snap diffs on sparse
     images.

   - ceph.snap.btime vxattr to expose snapshot creation time (David
     Disseldorp). This will be used to integrate with "Restore Previous
     Versions" feature added in Windows 7 for folks who reexport ceph
     through SMB.

   - security xattrs for ceph (Zheng Yan). Only selinux is supported for
     now due to the limitations of ->dentry_init_security().

   - support for MSG_ADDR2, FS_BTIME and FS_CHANGE_ATTR features (Jeff
     Layton). This is actually a single feature bit which was missing
     because of the filesystem pieces. With this in, the kernel client
     will finally be reported as "luminous" by "ceph features" -- it is
     still being reported as "jewel" even though all required Luminous
     features were implemented in 4.13.

   - stop NULL-terminating ceph vxattrs (Jeff Layton). The convention
     with xattrs is to not terminate and this was causing
     inconsistencies with ceph-fuse.

   - change filesystem time granularity from 1 us to 1 ns, again fixing
     an inconsistency with ceph-fuse (Luis Henriques).

  On top of this there are some additional dentry name handling and cap
  flushing fixes from Zheng. Finally, Jeff is formally taking over for
  Zheng as the filesystem maintainer"

* tag 'ceph-for-5.3-rc1' of git://github.com/ceph/ceph-client: (71 commits)
  ceph: fix end offset in truncate_inode_pages_range call
  ceph: use generic_delete_inode() for ->drop_inode
  ceph: use ceph_evict_inode to cleanup inode's resource
  ceph: initialize superblock s_time_gran to 1
  MAINTAINERS: take over for Zheng as CephFS kernel client maintainer
  rbd: setallochint only if object doesn't exist
  rbd: support for object-map and fast-diff
  rbd: call rbd_dev_mapping_set() from rbd_dev_image_probe()
  libceph: export osd_req_op_data() macro
  libceph: change ceph_osdc_call() to take page vector for response
  libceph: bump CEPH_MSG_MAX_DATA_LEN (again)
  rbd: new exclusive lock wait/wake code
  rbd: quiescing lock should wait for image requests
  rbd: lock should be quiesced on reacquire
  rbd: introduce copyup state machine
  rbd: rename rbd_obj_setup_*() to rbd_obj_init_*()
  rbd: move OSD request allocation into object request state machines
  rbd: factor out __rbd_osd_setup_discard_ops()
  rbd: factor out rbd_osd_setup_copyup()
  rbd: introduce obj_req->osd_reqs list
  ...
2019-07-18 11:05:25 -07:00
Jeff Layton
2c66de560f libceph: rename ceph_encode_addr to ceph_encode_banner_addr
...ditto for the decode function. We only use these functions to fix
up banner addresses now, so let's name them more appropriately.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08 14:01:43 +02:00
Jeff Layton
d3c3c0a841 libceph: use TYPE_LEGACY for entity addrs instead of TYPE_NONE
Going forward, we'll have different address types so let's use
the addr2 TYPE_LEGACY for internal tracking rather than TYPE_NONE.

Also, make ceph_pr_addr print the address type value as well.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08 14:01:43 +02:00
Jeff Layton
bc07532cc5 libceph: fix sa_family just after reading address
It doesn't make sense to leave it undecoded until later.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08 14:01:43 +02:00
David Howells
a58946c158 keys: Pass the network namespace into request_key mechanism
Create a request_key_net() function and use it to pass the network
namespace domain tag into DNS revolver keys and rxrpc/AFS keys so that keys
for different domains can coexist in the same keyring.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: netdev@vger.kernel.org
cc: linux-nfs@vger.kernel.org
cc: linux-cifs@vger.kernel.org
cc: linux-afs@lists.infradead.org
2019-06-27 23:02:12 +01:00
Linus Torvalds
227747fb9e AFS fixes
-----BEGIN PGP SIGNATURE-----
 
 iQIVAwUAXN2BC/u3V2unywtrAQJm9w/+L7ufbRkj6XGVongmhf4n+auBQXMJ4jec
 zN6bjWrp/SN9kJfOqOKA+sk9s3cCOCV8SF/2eM5P8DJNtrB6aXlg590u1wSkOp99
 FdSM8Fy7v4bTwW9hCBhvcFpC+layVUEv/WAsCCIZi94W+H43XFY4QM79cqoqIx8r
 nTLu9EcjWFpUoBIAYEU0x/h4IA5Cyl6CUw3YZhZYaGoLLfi9EZkgBLlUU+6OXpDO
 Uepzn1gnpXMCNsiBE/Hr9LR0pfOTtzdJuNADrppRnbPfky8RsPE8tuk6kT6301U1
 IxG66SafYsvbQGzyIdfTydl022DFj5LOtCPFtfALviJqdBOGE/zPPnrBPinHg4oJ
 40P2tIJ/+Ksz5cPzmkA1KanSXaQ2v0sLBVdQJ7yt5EFuAMzj/roWpiPmEmQd6KqB
 ixZdZLehKFPaAB5cR41fHV1jB30HN7oakwqCoYmXd1Chu3AlB15yV9WZMSqjPS8P
 pkNC/X5mU5hDnZUx9e3Fbu8LqoGOjnGvDn5jOxihdKfaGu3A4OlbSerIUbRHvnT8
 u8XDPoq4j61f04MiI9z/bPDFTRYyycIQPcHYQpi4MJt9lSkkydP217P60BJsUv2n
 NIPYwgI7VIse0Gdo8shIg+RnSnJaKHT9Sf86h8pyDFO6wZp/GVVqPSdjjU+Lv5fv
 CZGJ7PCYcfs=
 =2q2Y
 -----END PGP SIGNATURE-----

Merge tag 'afs-fixes-20190516' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

Pull misc AFS fixes from David Howells:
 "This fixes a set of miscellaneous issues in the afs filesystem,
  including:

   - leak of keys on file close.

   - broken error handling in xattr functions.

   - missing locking when updating VL server list.

   - volume location server DNS lookup whereby preloaded cells may not
     ever get a lookup and regular DNS lookups to maintain server lists
     consume power unnecessarily.

   - incorrect error propagation and handling in the fileserver
     iteration code causes operations to sometimes apparently succeed.

   - interruption of server record check/update side op during
     fileserver iteration causes uninterruptible main operations to fail
     unexpectedly.

   - callback promise expiry time miscalculation.

   - over invalidation of the callback promise on directories.

   - double locking on callback break waking up file locking waiters.

   - double increment of the vnode callback break counter.

  Note that it makes some changes outside of the afs code, including:

   - an extra parameter to dns_query() to allow the dns_resolver key
     just accessed to be immediately invalidated. AFS is caching the
     results itself, so the key can be discarded.

   - an interruptible version of wait_var_event().

   - an rxrpc function to allow the maximum lifespan to be set on a
     call.

   - a way for an rxrpc call to be marked as non-interruptible"

* tag 'afs-fixes-20190516' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
  afs: Fix double inc of vnode->cb_break
  afs: Fix lock-wait/callback-break double locking
  afs: Don't invalidate callback if AFS_VNODE_DIR_VALID not set
  afs: Fix calculation of callback expiry time
  afs: Make dynamic root population wait uninterruptibly for proc_cells_lock
  afs: Make some RPC operations non-interruptible
  rxrpc: Allow the kernel to mark a call as being non-interruptible
  afs: Fix error propagation from server record check/update
  afs: Fix the maximum lifespan of VL and probe calls
  rxrpc: Provide kernel interface to set max lifespan on a call
  afs: Fix "kAFS: AFS vnode with undefined type 0"
  afs: Fix cell DNS lookup
  Add wait_var_event_interruptible()
  dns_resolver: Allow used keys to be invalidated
  afs: Fix afs_cell records to always have a VL server list record
  afs: Fix missing lock when replacing VL server list
  afs: Fix afs_xattr_get_yfs() to not try freeing an error value
  afs: Fix incorrect error handling in afs_xattr_get_acl()
  afs: Fix key leak in afs_release() and afs_evict_inode()
2019-05-16 17:00:13 -07:00
David Howells
d0660f0b3b dns_resolver: Allow used keys to be invalidated
Allow used DNS resolver keys to be invalidated after use if the caller is
doing its own caching of the results.  This reduces the amount of resources
required.

Fix AFS to invalidate DNS results to kill off permanent failure records
that get lodged in the resolver keyring and prevent future lookups from
happening.

Fixes: 0a5143f2f8 ("afs: Implement VL server rotation")
Signed-off-by: David Howells <dhowells@redhat.com>
2019-05-15 17:35:54 +01:00