Commit Graph

2409 Commits

Author SHA1 Message Date
Xiubo Li 834c93dcef ceph: stop copying to iter at EOF on sync reads
[ Upstream commit 1065da21e5 ]

If EOF is encountered, ceph_sync_read() return value is adjusted down
according to i_size, but the "to" iter is advanced by the actual number
of bytes read.  Then, when retrying, the remainder of the range may be
skipped incorrectly.

Ensure that the "to" iter is advanced only until EOF.

[ idryomov: changelog ]

Fixes: c3d8e0b5de ("ceph: return the real size read when it hits EOF")
Reported-by: Frank Hsiao <frankhsiao@qnap.com>
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Tested-by: Frank Hsiao <frankhsiao@qnap.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-03-26 18:17:36 -04:00
Xiubo Li 51d31149a8 ceph: switch to corrected encoding of max_xattr_size in mdsmap
The addition of bal_rank_mask with encoding version 17 was merged
into ceph.git in Oct 2022 and made it into v18.2.0 release normally.
A few months later, the much delayed addition of max_xattr_size got
merged, also with encoding version 17, placed before bal_rank_mask
in the encoding -- but it didn't make v18.2.0 release.

The way this ended up being resolved on the MDS side is that
bal_rank_mask will continue to be encoded in version 17 while
max_xattr_size is now encoded in version 18.  This does mean that
older kernels will misdecode version 17, but this is also true for
v18.2.0 and v18.2.1 clients in userspace.

The best we can do is backport this adjustment -- see ceph.git
commit 78abfeaff27fee343fb664db633de5b221699a73 for details.

[ idryomov: changelog ]

Cc: stable@vger.kernel.org
Link: https://tracker.ceph.com/issues/64440
Fixes: d93231a6bc ("ceph: prevent a client from exceeding the MDS maximum xattr size")
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Patrick Donnelly <pdonnell@ibm.com>
Reviewed-by: Venky Shankar <vshankar@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-02-26 19:20:30 +01:00
Xiubo Li dbc347ef7f ceph: add ceph_cap_unlink_work to fire check_caps() immediately
When unlinking a file the check caps could be delayed for more than
5 seconds, but in MDS side it maybe waiting for the clients to
release caps.

This will use the cap_wq work queue and a dedicated list to help
fire the check_caps() and dirty buffer flushing immediately.

Link: https://tracker.ceph.com/issues/50223
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-02-13 11:22:54 +01:00
Xiubo Li 902d6d013f ceph: always queue a writeback when revoking the Fb caps
In case there is 'Fw' dirty caps and 'CHECK_CAPS_FLUSH' is set we
will always ignore queue a writeback. Queue a writeback is very
important because it will block kclient flushing the snapcaps to
MDS and which will block MDS waiting for revoking the 'Fb' caps.

Link: https://tracker.ceph.com/issues/50223
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-02-13 11:22:35 +01:00
Xiubo Li 07045648c0 ceph: always check dir caps asynchronously
The MDS will issue the 'Fr' caps for async dirop, while there is
buggy in kclient and it could miss releasing the async dirop caps,
which is 'Fsxr'. And then the MDS will complain with:

"[WRN] client.xxx isn't responding to mclientcaps(revoke) ..."

So when releasing the dirop async requests or when they fail we
should always make sure that being revoked caps could be released.

Link: https://tracker.ceph.com/issues/50223
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-02-07 14:58:02 +01:00
Rishabh Dave cda4672da1 ceph: prevent use-after-free in encode_cap_msg()
In fs/ceph/caps.c, in encode_cap_msg(), "use after free" error was
caught by KASAN at this line - 'ceph_buffer_get(arg->xattr_buf);'. This
implies before the refcount could be increment here, it was freed.

In same file, in "handle_cap_grant()" refcount is decremented by this
line - 'ceph_buffer_put(ci->i_xattrs.blob);'. It appears that a race
occurred and resource was freed by the latter line before the former
line could increment it.

encode_cap_msg() is called by __send_cap() and __send_cap() is called by
ceph_check_caps() after calling __prep_cap(). __prep_cap() is where
arg->xattr_buf is assigned to ci->i_xattrs.blob. This is the spot where
the refcount must be increased to prevent "use after free" error.

Cc: stable@vger.kernel.org
Link: https://tracker.ceph.com/issues/59259
Signed-off-by: Rishabh Dave <ridave@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-02-07 14:58:02 +01:00
Xiubo Li bbb20ea993 ceph: always set initial i_blkbits to CEPH_FSCRYPT_BLOCK_SHIFT
The fscrypt code will use i_blkbits to setup ci_data_unit_bits when
allocating the new inode, but ceph will initiate i_blkbits ater when
filling the inode, which is too late. Since ci_data_unit_bits will only
be used by the fscrypt framework so initiating i_blkbits with
CEPH_FSCRYPT_BLOCK_SHIFT is safe.

Link: https://tracker.ceph.com/issues/64035
Fixes: 5b11888471 ("fscrypt: support crypto data unit size less than filesystem block size")
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-02-07 14:43:29 +01:00
Linus Torvalds 556e2d17ca Assorted CephFS fixes and cleanups with nothing standing out.
-----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmWqmP8THGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHzi3cQB/0XJABiPkolqNtd3dSGw8x2YnpS6ciV
 yHxpJViF0+qmnS5l6Vn2lEDr/57h/jts0t3kXUUSDVbitK9glim5ar2FsBeuY7gi
 lQbqhFPfQ+G3APDn2Dn27JYvO1VQLMmvuFJyE4rJ03XZjvOYpq4zM3zPO0jPGvCN
 Gnw0VqPst/h4eobcsFEsHvHuMkkVy6YIOQPsDkiYUShaY6OBUWM4kewrlztmEvaK
 fyuo/FSNmZeEkoc5R7Pfo1FE4PZzfdUie7RmEznxqgHUWFmx2jKZ5TwnCZt1D2av
 dV2e2JWnZUZZL9vAnCQddvnYrj8j+an/IbGZ+0Wa5DZo/eMglDd01VV2
 =kNSw
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-6.8-rc1' of https://github.com/ceph/ceph-client

Pull ceph updates from Ilya Dryomov:
 "Assorted CephFS fixes and cleanups with nothing standing out"

* tag 'ceph-for-6.8-rc1' of https://github.com/ceph/ceph-client:
  ceph: get rid of passing callbacks in __dentry_leases_walk()
  ceph: d_obtain_{alias,root}(ERR_PTR(...)) will do the right thing
  ceph: fix invalid pointer access if get_quota_realm return ERR_PTR
  ceph: remove duplicated code in ceph_netfs_issue_read()
  ceph: send oldest_client_tid when renewing caps
  ceph: rename create_session_open_msg() to create_session_full_msg()
  ceph: select FS_ENCRYPTION_ALGS if FS_ENCRYPTION
  ceph: fix deadlock or deadcode of misusing dget()
  ceph: try to allocate a smaller extent map for sparse read
  libceph: remove MAX_EXTENTS check for sparse reads
  ceph: reinitialize mds feature bit even when session in open
  ceph: skip reconnecting if MDS is not ready
2024-01-19 09:58:55 -08:00
Linus Torvalds 16df6e07d6 vfs-6.8.netfs
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZabMrQAKCRCRxhvAZXjc
 ovnUAQDgCOonb1tjtTvC8s8IMDUEoaVYZI91KVfsZQSJYN1sdQD+KfJmX1BhJnWG
 l0cEffGfnWGXMZkZqDgLPHUIPzFrmws=
 =1b3j
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.8.netfs' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs

Pull netfs updates from Christian Brauner:
 "This extends the netfs helper library that network filesystems can use
  to replace their own implementations. Both afs and 9p are ported. cifs
  is ready as well but the patches are way bigger and will be routed
  separately once this is merged. That will remove lots of code as well.

  The overal goal is to get high-level I/O and knowledge of the page
  cache and ouf of the filesystem drivers. This includes knowledge about
  the existence of pages and folios

  The pull request converts afs and 9p. This removes about 800 lines of
  code from afs and 300 from 9p. For 9p it is now possible to do writes
  in larger than a page chunks. Additionally, multipage folio support
  can be turned on for 9p. Separate patches exist for cifs removing
  another 2000+ lines. I've included detailed information in the
  individual pulls I took.

  Summary:

   - Add NFS-style (and Ceph-style) locking around DIO vs buffered I/O
     calls to prevent these from happening at the same time.

   - Support for direct and unbuffered I/O.

   - Support for write-through caching in the page cache.

   - O_*SYNC and RWF_*SYNC writes use write-through rather than writing
     to the page cache and then flushing afterwards.

   - Support for write-streaming.

   - Support for write grouping.

   - Skip reads for which the server could only return zeros or EOF.

   - The fscache module is now part of the netfs library and the
     corresponding maintainer entry is updated.

   - Some helpers from the fscache subsystem are renamed to mark them as
     belonging to the netfs library.

   - Follow-up fixes for the netfs library.

   - Follow-up fixes for the 9p conversion"

* tag 'vfs-6.8.netfs' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (50 commits)
  netfs: Fix wrong #ifdef hiding wait
  cachefiles: Fix signed/unsigned mixup
  netfs: Fix the loop that unmarks folios after writing to the cache
  netfs: Fix interaction between write-streaming and cachefiles culling
  netfs: Count DIO writes
  netfs: Mark netfs_unbuffered_write_iter_locked() static
  netfs: Fix proc/fs/fscache symlink to point to "netfs" not "../netfs"
  netfs: Rearrange netfs_io_subrequest to put request pointer first
  9p: Use length of data written to the server in preference to error
  9p: Do a couple of cleanups
  9p: Fix initialisation of netfs_inode for 9p
  cachefiles: Fix __cachefiles_prepare_write()
  9p: Use netfslib read/write_iter
  afs: Use the netfs write helpers
  netfs: Export the netfs_sreq tracepoint
  netfs: Optimise away reads above the point at which there can be no data
  netfs: Implement a write-through caching option
  netfs: Provide a launder_folio implementation
  netfs: Provide a writepages implementation
  netfs, cachefiles: Pass upper bound length to allow expansion
  ...
2024-01-19 09:10:23 -08:00
Al Viro 2a965d1b15 ceph: get rid of passing callbacks in __dentry_leases_walk()
__dentry_leases_walk() gets a callback and calls it for
a bunch of denties; there are exactly two callers and
we already have a flag telling them apart - lwc->dir_lease.

Seeing that indirect calls are costly these days, let's
get rid of the callback and just call the right function
directly.  Has a side benefit of saner signatures...

[ xiubli: a minor fix in the commit title ]

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-01-15 15:54:54 +01:00
Al Viro f6fb21b22f ceph: d_obtain_{alias,root}(ERR_PTR(...)) will do the right thing
Clean up the code.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-01-15 15:40:51 +01:00
Wenchao Hao 0f4cf64eab ceph: fix invalid pointer access if get_quota_realm return ERR_PTR
This issue is reported by smatch that get_quota_realm() might return
ERR_PTR but we did not handle it. It's not a immediate bug, while we
still should address it to avoid potential bugs if get_quota_realm()
is changed to return other ERR_PTR in future.

Set ceph_snap_realm's pointer in get_quota_realm()'s to address this
issue, the pointer would be set to NULL if get_quota_realm() failed
to get struct ceph_snap_realm, so no ERR_PTR would happen any more.

[ xiubli: minor code style clean up ]

Signed-off-by: Wenchao Hao <haowenchao2@huawei.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-01-15 15:40:51 +01:00
Xiubo Li b36b03344f ceph: remove duplicated code in ceph_netfs_issue_read()
When allocating an osd request the libceph.ko will add the
'read_from_replica' flag by default.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-01-15 15:40:51 +01:00
Xiubo Li 6df89bf220 ceph: send oldest_client_tid when renewing caps
Update the oldest_client_tid via the session renew caps msg to
make sure that the MDSs won't pile up the completed request list
in a very large size.

[ idryomov: drop inapplicable comment ]

Link: https://tracker.ceph.com/issues/63364
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-01-15 15:40:51 +01:00
Xiubo Li 66207de308 ceph: rename create_session_open_msg() to create_session_full_msg()
Makes the create session msg helper to be more general and could
be used by other ops.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-01-15 15:40:51 +01:00
Eric Biggers 9c896d6bc3 ceph: select FS_ENCRYPTION_ALGS if FS_ENCRYPTION
The kconfig options for filesystems that support FS_ENCRYPTION are
supposed to select FS_ENCRYPTION_ALGS.  This is needed to ensure that
required crypto algorithms get enabled as loadable modules or builtin as
is appropriate for the set of enabled filesystems.  Do this for CEPH_FS
so that there aren't any missing algorithms if someone happens to have
CEPH_FS as their only enabled filesystem that supports encryption.

Cc: stable@vger.kernel.org
Fixes: f061feda6c ("ceph: add fscrypt ioctls and ceph.fscrypt.auth vxattr")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-01-15 15:40:50 +01:00
Xiubo Li b493ad718b ceph: fix deadlock or deadcode of misusing dget()
The lock order is incorrect between denty and its parent, we should
always make sure that the parent get the lock first.

But since this deadcode is never used and the parent dir will always
be set from the callers, let's just remove it.

Link: https://lore.kernel.org/r/20231116081919.GZ1957730@ZenIV
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-01-15 15:40:50 +01:00
Xiubo Li aaefabc4a5 ceph: try to allocate a smaller extent map for sparse read
In fscrypt case and for a smaller read length we can predict the
max count of the extent map. And for small read length use cases
this could save some memories.

[ idryomov: squash into a single patch to avoid build break, drop
  redundant variable in ceph_alloc_sparse_ext_map() ]

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-01-15 15:40:50 +01:00
Venky Shankar f48e0342a7 ceph: reinitialize mds feature bit even when session in open
Following along the same lines as per the user-space fix. Right
now this isn't really an issue with the ceph kernel driver because
of the feature bit laginess, however, that can change over time
(when the new snaprealm info type is ported to the kernel driver)
and depending on the MDS version that's being upgraded can cause
message decoding issues - so, fix that early on.

Link: http://tracker.ceph.com/issues/63188
Signed-off-by: Venky Shankar <vshankar@redhat.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-01-15 15:40:50 +01:00
Xiubo Li cbcb358b74 ceph: skip reconnecting if MDS is not ready
When MDS closed the session the kclient will send to reconnect to
it immediately, but if the MDS just restarted and still not ready
yet, such as still in the up:replay state and the sessionmap journal
logs hasn't be replayed, the MDS will close the session.

And then the kclient could remove the session and later when the
mdsmap is in RECONNECT phrase it will skip reconnecting. But the MDS
will wait until timeout and then evict the kclient.

Just skip sending the reconnection request until the MDS is ready.

Link: https://tracker.ceph.com/issues/62489
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-01-15 15:40:50 +01:00
Linus Torvalds 499aa1ca4e dcache stuff for this cycle
change of locking rules for __dentry_kill(), regularized refcounting
 rules in that area, assorted cleanups and removal of weird corner
 cases (e.g. now ->d_iput() on child is always called before the parent
 might hit __dentry_kill(), etc.)
 
 Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZZ+sQQAKCRBZ7Krx/gZQ
 6ybjAQDM5jiS93IUzfHjCWq0nVBX5YGbDAkZOeqxbmIdQb+2UAEA6elP5r0fBBcA
 seo3bry4DirQMDaA/Cjh4+8r71YSOQs=
 =7+Hk
 -----END PGP SIGNATURE-----

Merge tag 'pull-dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull dcache updates from Al Viro:
 "Change of locking rules for __dentry_kill(), regularized refcounting
  rules in that area, assorted cleanups and removal of weird corner
  cases (e.g. now ->d_iput() on child is always called before the parent
  might hit __dentry_kill(), etc)"

* tag 'pull-dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (40 commits)
  dcache: remove unnecessary NULL check in dget_dlock()
  kill DCACHE_MAY_FREE
  __d_unalias() doesn't use inode argument
  d_alloc_parallel(): in-lookup hash insertion doesn't need an RCU variant
  get rid of DCACHE_GENOCIDE
  d_genocide(): move the extern into fs/internal.h
  simple_fill_super(): don't bother with d_genocide() on failure
  nsfs: use d_make_root()
  d_alloc_pseudo(): move setting ->d_op there from the (sole) caller
  kill d_instantate_anon(), fold __d_instantiate_anon() into remaining caller
  retain_dentry(): introduce a trimmed-down lockless variant
  __dentry_kill(): new locking scheme
  d_prune_aliases(): use a shrink list
  switch select_collect{,2}() to use of to_shrink_list()
  to_shrink_list(): call only if refcount is 0
  fold dentry_kill() into dput()
  don't try to cut corners in shrink_lock_dentry()
  fold the call of retain_dentry() into fast_dput()
  Call retain_dentry() with refcount 0
  dentry_kill(): don't bother with retain_dentry() on slow path
  ...
2024-01-11 20:11:35 -08:00
Linus Torvalds fb46e22a9e Many singleton patches against the MM code. The patch series which
are included in this merge do the following:
 
 - Peng Zhang has done some mapletree maintainance work in the
   series
 
 	"maple_tree: add mt_free_one() and mt_attr() helpers"
 	"Some cleanups of maple tree"
 
 - In the series "mm: use memmap_on_memory semantics for dax/kmem"
   Vishal Verma has altered the interworking between memory-hotplug
   and dax/kmem so that newly added 'device memory' can more easily
   have its memmap placed within that newly added memory.
 
 - Matthew Wilcox continues folio-related work (including a few
   fixes) in the patch series
 
 	"Add folio_zero_tail() and folio_fill_tail()"
 	"Make folio_start_writeback return void"
 	"Fix fault handler's handling of poisoned tail pages"
 	"Convert aops->error_remove_page to ->error_remove_folio"
 	"Finish two folio conversions"
 	"More swap folio conversions"
 
 - Kefeng Wang has also contributed folio-related work in the series
 
 	"mm: cleanup and use more folio in page fault"
 
 - Jim Cromie has improved the kmemleak reporting output in the
   series "tweak kmemleak report format".
 
 - In the series "stackdepot: allow evicting stack traces" Andrey
   Konovalov to permits clients (in this case KASAN) to cause
   eviction of no longer needed stack traces.
 
 - Charan Teja Kalla has fixed some accounting issues in the page
   allocator's atomic reserve calculations in the series "mm:
   page_alloc: fixes for high atomic reserve caluculations".
 
 - Dmitry Rokosov has added to the samples/ dorectory some sample
   code for a userspace memcg event listener application.  See the
   series "samples: introduce cgroup events listeners".
 
 - Some mapletree maintanance work from Liam Howlett in the series
   "maple_tree: iterator state changes".
 
 - Nhat Pham has improved zswap's approach to writeback in the
   series "workload-specific and memory pressure-driven zswap
   writeback".
 
 - DAMON/DAMOS feature and maintenance work from SeongJae Park in
   the series
 
 	"mm/damon: let users feed and tame/auto-tune DAMOS"
 	"selftests/damon: add Python-written DAMON functionality tests"
 	"mm/damon: misc updates for 6.8"
 
 - Yosry Ahmed has improved memcg's stats flushing in the series
   "mm: memcg: subtree stats flushing and thresholds".
 
 - In the series "Multi-size THP for anonymous memory" Ryan Roberts
   has added a runtime opt-in feature to transparent hugepages which
   improves performance by allocating larger chunks of memory during
   anonymous page faults.
 
 - Matthew Wilcox has also contributed some cleanup and maintenance
   work against eh buffer_head code int he series "More buffer_head
   cleanups".
 
 - Suren Baghdasaryan has done work on Andrea Arcangeli's series
   "userfaultfd move option".  UFFDIO_MOVE permits userspace heap
   compaction algorithms to move userspace's pages around rather than
   UFFDIO_COPY'a alloc/copy/free.
 
 - Stefan Roesch has developed a "KSM Advisor", in the series
   "mm/ksm: Add ksm advisor".  This is a governor which tunes KSM's
   scanning aggressiveness in response to userspace's current needs.
 
 - Chengming Zhou has optimized zswap's temporary working memory
   use in the series "mm/zswap: dstmem reuse optimizations and
   cleanups".
 
 - Matthew Wilcox has performed some maintenance work on the
   writeback code, both code and within filesystems.  The series is
   "Clean up the writeback paths".
 
 - Andrey Konovalov has optimized KASAN's handling of alloc and
   free stack traces for secondary-level allocators, in the series
   "kasan: save mempool stack traces".
 
 - Andrey also performed some KASAN maintenance work in the series
   "kasan: assorted clean-ups".
 
 - David Hildenbrand has gone to town on the rmap code.  Cleanups,
   more pte batching, folio conversions and more.  See the series
   "mm/rmap: interface overhaul".
 
 - Kinsey Ho has contributed some maintenance work on the MGLRU
   code in the series "mm/mglru: Kconfig cleanup".
 
 - Matthew Wilcox has contributed lruvec page accounting code
   cleanups in the series "Remove some lruvec page accounting
   functions".
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZZyF2wAKCRDdBJ7gKXxA
 jjWjAP42LHvGSjp5M+Rs2rKFL0daBQsrlvy6/jCHUequSdWjSgEAmOx7bc5fbF27
 Oa8+DxGM9C+fwqZ/7YxU2w/WuUmLPgU=
 =0NHs
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2024-01-08-15-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:
 "Many singleton patches against the MM code. The patch series which are
  included in this merge do the following:

   - Peng Zhang has done some mapletree maintainance work in the series

	'maple_tree: add mt_free_one() and mt_attr() helpers'
	'Some cleanups of maple tree'

   - In the series 'mm: use memmap_on_memory semantics for dax/kmem'
     Vishal Verma has altered the interworking between memory-hotplug
     and dax/kmem so that newly added 'device memory' can more easily
     have its memmap placed within that newly added memory.

   - Matthew Wilcox continues folio-related work (including a few fixes)
     in the patch series

	'Add folio_zero_tail() and folio_fill_tail()'
	'Make folio_start_writeback return void'
	'Fix fault handler's handling of poisoned tail pages'
	'Convert aops->error_remove_page to ->error_remove_folio'
	'Finish two folio conversions'
	'More swap folio conversions'

   - Kefeng Wang has also contributed folio-related work in the series

	'mm: cleanup and use more folio in page fault'

   - Jim Cromie has improved the kmemleak reporting output in the series
     'tweak kmemleak report format'.

   - In the series 'stackdepot: allow evicting stack traces' Andrey
     Konovalov to permits clients (in this case KASAN) to cause eviction
     of no longer needed stack traces.

   - Charan Teja Kalla has fixed some accounting issues in the page
     allocator's atomic reserve calculations in the series 'mm:
     page_alloc: fixes for high atomic reserve caluculations'.

   - Dmitry Rokosov has added to the samples/ dorectory some sample code
     for a userspace memcg event listener application. See the series
     'samples: introduce cgroup events listeners'.

   - Some mapletree maintanance work from Liam Howlett in the series
     'maple_tree: iterator state changes'.

   - Nhat Pham has improved zswap's approach to writeback in the series
     'workload-specific and memory pressure-driven zswap writeback'.

   - DAMON/DAMOS feature and maintenance work from SeongJae Park in the
     series

	'mm/damon: let users feed and tame/auto-tune DAMOS'
	'selftests/damon: add Python-written DAMON functionality tests'
	'mm/damon: misc updates for 6.8'

   - Yosry Ahmed has improved memcg's stats flushing in the series 'mm:
     memcg: subtree stats flushing and thresholds'.

   - In the series 'Multi-size THP for anonymous memory' Ryan Roberts
     has added a runtime opt-in feature to transparent hugepages which
     improves performance by allocating larger chunks of memory during
     anonymous page faults.

   - Matthew Wilcox has also contributed some cleanup and maintenance
     work against eh buffer_head code int he series 'More buffer_head
     cleanups'.

   - Suren Baghdasaryan has done work on Andrea Arcangeli's series
     'userfaultfd move option'. UFFDIO_MOVE permits userspace heap
     compaction algorithms to move userspace's pages around rather than
     UFFDIO_COPY'a alloc/copy/free.

   - Stefan Roesch has developed a 'KSM Advisor', in the series 'mm/ksm:
     Add ksm advisor'. This is a governor which tunes KSM's scanning
     aggressiveness in response to userspace's current needs.

   - Chengming Zhou has optimized zswap's temporary working memory use
     in the series 'mm/zswap: dstmem reuse optimizations and cleanups'.

   - Matthew Wilcox has performed some maintenance work on the writeback
     code, both code and within filesystems. The series is 'Clean up the
     writeback paths'.

   - Andrey Konovalov has optimized KASAN's handling of alloc and free
     stack traces for secondary-level allocators, in the series 'kasan:
     save mempool stack traces'.

   - Andrey also performed some KASAN maintenance work in the series
     'kasan: assorted clean-ups'.

   - David Hildenbrand has gone to town on the rmap code. Cleanups, more
     pte batching, folio conversions and more. See the series 'mm/rmap:
     interface overhaul'.

   - Kinsey Ho has contributed some maintenance work on the MGLRU code
     in the series 'mm/mglru: Kconfig cleanup'.

   - Matthew Wilcox has contributed lruvec page accounting code cleanups
     in the series 'Remove some lruvec page accounting functions'"

* tag 'mm-stable-2024-01-08-15-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (361 commits)
  mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDER
  mm, treewide: introduce NR_PAGE_ORDERS
  selftests/mm: add separate UFFDIO_MOVE test for PMD splitting
  selftests/mm: skip test if application doesn't has root privileges
  selftests/mm: conform test to TAP format output
  selftests: mm: hugepage-mmap: conform to TAP format output
  selftests/mm: gup_test: conform test to TAP format output
  mm/selftests: hugepage-mremap: conform test to TAP format output
  mm/vmstat: move pgdemote_* out of CONFIG_NUMA_BALANCING
  mm: zsmalloc: return -ENOSPC rather than -EINVAL in zs_malloc while size is too large
  mm/memcontrol: remove __mod_lruvec_page_state()
  mm/khugepaged: use a folio more in collapse_file()
  slub: use a folio in __kmalloc_large_node
  slub: use folio APIs in free_large_kmalloc()
  slub: use alloc_pages_node() in alloc_slab_page()
  mm: remove inc/dec lruvec page state functions
  mm: ratelimit stat flush from workingset shrinker
  kasan: stop leaking stack trace handles
  mm/mglru: remove CONFIG_TRANSPARENT_HUGEPAGE
  mm/mglru: add dummy pmd_dirty()
  ...
2024-01-09 11:18:47 -08:00
David Howells 100ccd18bb netfs: Optimise away reads above the point at which there can be no data
Track the file position above which the server is not expected to have any
data (the "zero point") and preemptively assume that we can satisfy
requests by filling them with zeroes locally rather than attempting to
download them if they're over that line - even if we've written data back
to the server.  Assume that any data that was written back above that
position is held in the local cache.  Note that we have to split requests
that straddle the line.

Make use of this to optimise away some reads from the server.  We need to
set the zero point in the following circumstances:

 (1) When we see an extant remote inode and have no cache for it, we set
     the zero_point to i_size.

 (2) On local inode creation, we set zero_point to 0.

 (3) On local truncation down, we reduce zero_point to the new i_size if
     the new i_size is lower.

 (4) On local truncation up, we don't change zero_point.

 (5) On local modification, we don't change zero_point.

 (6) On remote invalidation, we set zero_point to the new i_size.

 (7) If stored data is discarded from the pagecache or culled from fscache,
     we must set zero_point above that if the data also got written to the
     server.

 (8) If dirty data is written back to the server, but not fscache, we must
     set zero_point above that.

 (9) If a direct I/O write is made, set zero_point above that.

Assuming the above, any read from the server at or above the zero_point
position will return all zeroes.

The zero_point value can be stored in the cache, provided the above rules
are applied to it by any code that culls part of the local cache.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
2023-12-28 09:45:27 +00:00
David Howells c1ec4d7c2e netfs: Provide invalidate_folio and release_folio calls
Provide default invalidate_folio and release_folio calls.  These will need
to interact with invalidation correctly at some point.  They will be needed
if netfslib is to make use of folio->private for its own purposes.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
2023-12-24 15:08:51 +00:00
David Howells c9c4ff12df netfs: Move pinning-for-writeback from fscache to netfs
Move the resource pinning-for-writeback from fscache code to netfslib code.
This is used to keep a cache backing object pinned whilst we have dirty
pages on the netfs inode in the pagecache such that VM writeback will be
able to reach it.

Whilst we're at it, switch the parameters of netfs_unpin_writeback() to
match ->write_inode() so that it can be used for that directly.

Note that this mechanism could be more generically useful than that for
network filesystems.  Quite often they have to keep around other resources
(e.g. authentication tokens or network connections) until the writeback is
complete.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
2023-12-24 15:08:49 +00:00
David Howells 4498a8eccc netfs, fscache: Remove ->begin_cache_operation
Remove ->begin_cache_operation() in favour of just calling fscache directly.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: Christian Brauner <christian@brauner.io>
cc: linux-fsdevel@vger.kernel.org
cc: linux-cachefs@redhat.com
2023-12-24 15:08:48 +00:00
Amir Goldstein 705bcfcbde
fs: use splice_copy_file_range() inline helper
generic_copy_file_range() is just a wrapper around splice_file_range(),
which caps the maximum copy length.

The only caller of splice_file_range(), namely __ceph_copy_file_range()
is already ready to cope with short copy.

Move the length capping into splice_file_range() and replace the exported
symbol generic_copy_file_range() with a simple inline helper.

Suggested-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/linux-fsdevel/20231204083849.GC32438@lst.de/
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Link: https://lore.kernel.org/r/20231212094440.250945-3-amir73il@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-12-12 16:20:02 +01:00
Matthew Wilcox (Oracle) af7628d6ec fs: convert error_remove_page to error_remove_folio
There were already assertions that we were not passing a tail page to
error_remove_page(), so make the compiler enforce that by converting
everything to pass and use a folio.

Link: https://lkml.kernel.org/r/20231117161447.2461643-7-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-10 16:51:42 -08:00
Amir Goldstein 488e8f6852 fs: fork splice_file_range() from do_splice_direct()
In preparation of calling do_splice_direct() without file_start_write()
held, create a new helper splice_file_range(), to be called from context
of ->copy_file_range() methods instead of do_splice_direct().

Currently, the only difference is that splice_file_range() does not take
flags argument and that it asserts that file_start_write() is held, but
we factor out a common helper do_splice_direct_actor() that will be used
later.

Use the new helper from __ceph_copy_file_range(), that was incorrectly
passing to do_splice_direct() the copy flags argument as splice flags.
The value of copy flags in ceph is always 0, so it is a smenatic bug fix.

Move the declaration of both helpers to linux/splice.h.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Link: https://lore.kernel.org/r/20231130141624.3338942-2-amir73il@gmail.com
Acked-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-12-01 11:39:50 +01:00
Al Viro da549bdd15 dentry: switch the lists of children to hlist
Saves a pointer per struct dentry and actually makes the things less
clumsy.  Cleaned the d_walk() and dcache_readdir() a bit by use
of hlist_for_... iterators.

A couple of new helpers - d_first_child() and d_next_sibling(),
to make the expressions less awful.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2023-11-25 02:32:13 -05:00
Linus Torvalds e21165bfbc Two items:
- support for idmapped mounts in CephFS (Christian Brauner, Alexander
   Mikhalitsyn).  The series was originally developed by Christian and
   later picked up and brought over the finish line by Alexander, who
   also contributed an enabler on the MDS side (separate owner_{u,g}id
   fields on the wire).  The required exports for mnt_idmap_{get,put}()
   in VFS have been acked by Christian and received no objection from
   Christoph.
 
 - a churny change in CephFS logging to include cluster and client
   identifiers in log and debug messages (Xiubo Li).  This would help
   in scenarios with dozens of CephFS mounts on the same node which are
   getting increasingly common, especially in the Kubernetes world.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmVNDyMTHGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHziy6xCACmUikhgo+pZDO7oQS7HChdZFz5Q2Jb
 5K9K7sJr7Zb2gFKLAV93cyFrz0JRljmgZuA3DmDTH1omrkrVAJfHJ6Md1UFG3W7o
 LaowP3kECsykOiz+YtVU2957sfoFqds/q6KCXdp1Pc8WfNnL1vysCim6EGYtCqUm
 c6vv0zkvEDQp+3kjTN01LHzHFZdHg8+STNM4BMiB0sO/NqbADnaQBEtDpYrgzwh7
 YPwVIKcXutUmAKlb7vjUF9ptSICYkGV7B+loPS4BDva1I6dT42xqLQOu89tTKU0D
 DiQAfE2oRgU+fl+mMooFJNSqD25Q6IkXrPs0HuiSHS4wtLGqYZAwwfHB
 =F9o1
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-6.7-rc1' of https://github.com/ceph/ceph-client

Pull ceph updates from Ilya Dryomov:

 - support for idmapped mounts in CephFS (Christian Brauner, Alexander
   Mikhalitsyn).

   The series was originally developed by Christian and later picked up
   and brought over the finish line by Alexander, who also contributed
   an enabler on the MDS side (separate owner_{u,g}id fields on the
   wire).

   The required exports for mnt_idmap_{get,put}() in VFS have been acked
   by Christian and received no objection from Christoph.

 - a churny change in CephFS logging to include cluster and client
   identifiers in log and debug messages (Xiubo Li).

   This would help in scenarios with dozens of CephFS mounts on the same
   node which are getting increasingly common, especially in the
   Kubernetes world.

* tag 'ceph-for-6.7-rc1' of https://github.com/ceph/ceph-client:
  ceph: allow idmapped mounts
  ceph: allow idmapped atomic_open inode op
  ceph: allow idmapped set_acl inode op
  ceph: allow idmapped setattr inode op
  ceph: pass idmap to __ceph_setattr
  ceph: allow idmapped permission inode op
  ceph: allow idmapped getattr inode op
  ceph: pass an idmapping to mknod/symlink/mkdir
  ceph: add enable_unsafe_idmap module parameter
  ceph: handle idmapped mounts in create_request_message()
  ceph: stash idmapping in mdsc request
  fs: export mnt_idmap_get/mnt_idmap_put
  libceph, ceph: move mdsmap.h to fs/ceph
  ceph: print cluster fsid and client global_id in all debug logs
  ceph: rename _to_client() to _to_fs_client()
  ceph: pass the mdsc to several helpers
  libceph: add doutc and *_client debug macros support
2023-11-10 09:52:56 -08:00
Christian Brauner 56d2e2cfa2 ceph: allow idmapped mounts
Now that we converted cephfs internally to account for idmapped mounts
allow the creation of idmapped mounts on by setting the FS_ALLOW_IDMAP
flag.

Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-11-03 23:28:34 +01:00
Christian Brauner 8a051b40ab ceph: allow idmapped atomic_open inode op
Enable ceph_atomic_open() to handle idmapped mounts. This is just a
matter of passing down the mount's idmapping.

[ aleksandr.mikhalitsyn: adapted to 5fadbd9929 ("ceph: rely on vfs for
  setgid stripping") ]

Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-11-03 23:28:34 +01:00
Christian Brauner 2cce72dda2 ceph: allow idmapped set_acl inode op
Enable ceph_set_acl() to handle idmapped mounts. This is just a matter
of passing down the mount's idmapping.

Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-11-03 23:28:34 +01:00
Christian Brauner a04aff2588 ceph: allow idmapped setattr inode op
Enable __ceph_setattr() to handle idmapped mounts. This is just a matter
of passing down the mount's idmapping.

[ aleksandr.mikhalitsyn: adapted to b27c82e129 ("attr: port attribute
  changes to new types") ]

Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-11-03 23:28:34 +01:00
Alexander Mikhalitsyn 79c66a0c8c ceph: pass idmap to __ceph_setattr
Just pass down the mount's idmapping to __ceph_setattr,
because we will need it later.

Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Acked-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-11-03 23:28:34 +01:00
Christian Brauner 8995375fae ceph: allow idmapped permission inode op
Enable ceph_permission() to handle idmapped mounts. This is just a
matter of passing down the mount's idmapping.

Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-11-03 23:28:34 +01:00
Christian Brauner 0513043ec4 ceph: allow idmapped getattr inode op
Enable ceph_getattr() to handle idmapped mounts. This is just a matter
of passing down the mount's idmapping.

Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-11-03 23:28:34 +01:00
Christian Brauner 09838f1bfd ceph: pass an idmapping to mknod/symlink/mkdir
Enable mknod/symlink/mkdir iops to handle idmapped mounts.
This is just a matter of passing down the mount's idmapping.

Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-11-03 23:28:34 +01:00
Alexander Mikhalitsyn 673478b6e5 ceph: add enable_unsafe_idmap module parameter
This parameter is used to decide if we allow
to perform IO on idmapped mount in case when MDS lacks
support of CEPHFS_FEATURE_HAS_OWNER_UIDGID feature.

In this case we can't properly handle MDS permission
checks and if UID/GID-based restrictions are enabled
on the MDS side then IO requests which go through an
idmapped mount may fail with -EACCESS/-EPERM.
Fortunately, for most of users it's not a case and
everything should work fine. But we put work "unsafe"
in the module parameter name to warn users about
possible problems with this feature and encourage
update of cephfs MDS.

Suggested-by: Stéphane Graber <stgraber@ubuntu.com>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Acked-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-11-03 23:28:33 +01:00
Christian Brauner 5ccd8530dd ceph: handle idmapped mounts in create_request_message()
Inode operations that create a new filesystem object such as ->mknod,
->create, ->mkdir() and others don't take a {g,u}id argument explicitly.
Instead the caller's fs{g,u}id is used for the {g,u}id of the new
filesystem object.

In order to ensure that the correct {g,u}id is used map the caller's
fs{g,u}id for creation requests. This doesn't require complex changes.
It suffices to pass in the relevant idmapping recorded in the request
message. If this request message was triggered from an inode operation
that creates filesystem objects it will have passed down the relevant
idmaping. If this is a request message that was triggered from an inode
operation that doens't need to take idmappings into account the initial
idmapping is passed down which is an identity mapping.

This change uses a new cephfs protocol extension CEPHFS_FEATURE_HAS_OWNER_UIDGID
which adds two new fields (owner_{u,g}id) to the request head structure.
So, we need to ensure that MDS supports it otherwise we need to fail
any IO that comes through an idmapped mount because we can't process it
in a proper way. MDS server without such an extension will use caller_{u,g}id
fields to set a new inode owner UID/GID which is incorrect because caller_{u,g}id
values are unmapped. At the same time we can't map these fields with an
idmapping as it can break UID/GID-based permission checks logic on the
MDS side. This problem was described with a lot of details at [1], [2].

[1] https://lore.kernel.org/lkml/CAEivzxfw1fHO2TFA4dx3u23ZKK6Q+EThfzuibrhA3RKM=ZOYLg@mail.gmail.com/
[2] https://lore.kernel.org/all/20220104140414.155198-3-brauner@kernel.org/

Link: https://github.com/ceph/ceph/pull/52575
Link: https://tracker.ceph.com/issues/62217
Co-Developed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-11-03 23:28:33 +01:00
Christian Brauner 9c2df2271c ceph: stash idmapping in mdsc request
When sending a mds request cephfs will send relevant data for the
requested operation. For creation requests the caller's fs{g,u}id is
used to set the ownership of the newly created filesystem object. For
setattr requests the caller can pass in arbitrary {g,u}id values to
which the relevant filesystem object is supposed to be changed.

If the caller is performing the relevant operation via an idmapped mount
cephfs simply needs to take the idmapping into account when it sends the
relevant mds request.

In order to support idmapped mounts for cephfs we stash the idmapping
whenever they are relevant for the operation for the duration of the
request. Since mds requests can be queued and performed asynchronously
we make sure to keep the idmapping around and release it once the
request has finished.

In follow-up patches we will use this to send correct ownership
information over the wire. This patch just adds the basic infrastructure
to keep the idmapping around. The actual conversion patches are all
fairly minimal.

Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-11-03 23:28:33 +01:00
Xiubo Li 522dc5108f libceph, ceph: move mdsmap.h to fs/ceph
The mdsmap.h is only used by CephFS, so move it to fs/ceph.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-11-03 23:28:33 +01:00
Xiubo Li 38d46409c4 ceph: print cluster fsid and client global_id in all debug logs
Multiple CephFS mounts on a host is increasingly common so
disambiguating messages like this is necessary and will make it easier
to debug issues.

At the same this will improve the debug logs to make them easier to
troubleshooting issues, such as print the ino# instead only printing
the memory addresses of the corresponding inodes and print the dentry
names instead of the corresponding memory addresses for the dentry,etc.

Link: https://tracker.ceph.com/issues/61590
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Patrick Donnelly <pdonnell@redhat.com>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-11-03 23:28:33 +01:00
Xiubo Li 5995d90d2d ceph: rename _to_client() to _to_fs_client()
We need to covert the inode to ceph_client in the following commit,
and will add one new helper for that, here we rename the old helper
to _fs_client().

Link: https://tracker.ceph.com/issues/61590
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Patrick Donnelly <pdonnell@redhat.com>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-11-03 23:28:33 +01:00
Xiubo Li 197b7d792d ceph: pass the mdsc to several helpers
We will use the 'mdsc' to get the global_id in the following commits.

Link: https://tracker.ceph.com/issues/61590
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Patrick Donnelly <pdonnell@redhat.com>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-11-03 23:28:33 +01:00
Linus Torvalds 8829687a4a fscrypt updates for 6.7
This update adds support for configuring the crypto data unit size (i.e.
 the granularity of file contents encryption) to be less than the
 filesystem block size. This can allow users to use inline encryption
 hardware in some cases when it wouldn't otherwise be possible.
 
 In addition, there are two commits that are prerequisites for the
 extent-based encryption support that the btrfs folks are working on.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCZT8acBQcZWJpZ2dlcnNA
 Z29vZ2xlLmNvbQAKCRDzXCl4vpKOK+czAQDkStgX1ICJQANnxwbrg/SUVdZjPuFH
 sJw3sUVpBR81TwEA/SyWh3YzVNZdpE7PWNrCknrC+qnO8hd9QBEjnQfwIQc=
 =t44a
 -----END PGP SIGNATURE-----

Merge tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/linux

Pull fscrypt updates from Eric Biggers:
 "This update adds support for configuring the crypto data unit size
  (i.e. the granularity of file contents encryption) to be less than the
  filesystem block size. This can allow users to use inline encryption
  hardware in some cases when it wouldn't otherwise be possible.

  In addition, there are two commits that are prerequisites for the
  extent-based encryption support that the btrfs folks are working on"

* tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/linux:
  fscrypt: track master key presence separately from secret
  fscrypt: rename fscrypt_info => fscrypt_inode_info
  fscrypt: support crypto data unit size less than filesystem block size
  fscrypt: replace get_ino_and_lblk_bits with just has_32bit_inodes
  fscrypt: compute max_lblk_bits from s_maxbytes and block size
  fscrypt: make the bounce page pool opt-in instead of opt-out
  fscrypt: make it clearer that key_prefix is deprecated
2023-10-30 10:23:42 -10:00
Linus Torvalds 14ab6d425e vfs-6.7.ctime
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZTppYgAKCRCRxhvAZXjc
 okIHAP9anLz1QDyMLH12ASuHjgBc0Of3jcB6NB97IWGpL4O21gEA46ohaD+vcJuC
 YkBLU3lXqQ87nfu28ExFAzh10hG2jwM=
 =m4pB
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.7.ctime' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs

Pull vfs inode time accessor updates from Christian Brauner:
 "This finishes the conversion of all inode time fields to accessor
  functions as discussed on list. Changing timestamps manually as we
  used to do before is error prone. Using accessors function makes this
  robust.

  It does not contain the switch of the time fields to discrete 64 bit
  integers to replace struct timespec and free up space in struct inode.
  But after this, the switch can be trivially made and the patch should
  only affect the vfs if we decide to do it"

* tag 'vfs-6.7.ctime' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (86 commits)
  fs: rename inode i_atime and i_mtime fields
  security: convert to new timestamp accessors
  selinux: convert to new timestamp accessors
  apparmor: convert to new timestamp accessors
  sunrpc: convert to new timestamp accessors
  mm: convert to new timestamp accessors
  bpf: convert to new timestamp accessors
  ipc: convert to new timestamp accessors
  linux: convert to new timestamp accessors
  zonefs: convert to new timestamp accessors
  xfs: convert to new timestamp accessors
  vboxsf: convert to new timestamp accessors
  ufs: convert to new timestamp accessors
  udf: convert to new timestamp accessors
  ubifs: convert to new timestamp accessors
  tracefs: convert to new timestamp accessors
  sysv: convert to new timestamp accessors
  squashfs: convert to new timestamp accessors
  server: convert to new timestamp accessors
  client: convert to new timestamp accessors
  ...
2023-10-30 09:47:13 -10:00
Linus Torvalds 7352a6765c vfs-6.7.xattr
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZTppWAAKCRCRxhvAZXjc
 okB2AP4jjoRErJBwj245OIDJqzoj4m4UVOVd0MH2AkiSpANczwD/TToChdpusY2y
 qAYg1fQoGMbDVlb7Txaj9qI9ieCf9w0=
 =2PXg
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.7.xattr' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs

Pull vfs xattr updates from Christian Brauner:
 "The 's_xattr' field of 'struct super_block' currently requires a
  mutable table of 'struct xattr_handler' entries (although each handler
  itself is const). However, no code in vfs actually modifies the
  tables.

  This changes the type of 's_xattr' to allow const tables, and modifies
  existing file systems to move their tables to .rodata. This is
  desirable because these tables contain entries with function pointers
  in them; moving them to .rodata makes it considerably less likely to
  be modified accidentally or maliciously at runtime"

* tag 'vfs-6.7.xattr' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (30 commits)
  const_structs.checkpatch: add xattr_handler
  net: move sockfs_xattr_handlers to .rodata
  shmem: move shmem_xattr_handlers to .rodata
  overlayfs: move xattr tables to .rodata
  xfs: move xfs_xattr_handlers to .rodata
  ubifs: move ubifs_xattr_handlers to .rodata
  squashfs: move squashfs_xattr_handlers to .rodata
  smb: move cifs_xattr_handlers to .rodata
  reiserfs: move reiserfs_xattr_handlers to .rodata
  orangefs: move orangefs_xattr_handlers to .rodata
  ocfs2: move ocfs2_xattr_handlers and ocfs2_xattr_handler_map to .rodata
  ntfs3: move ntfs_xattr_handlers to .rodata
  nfs: move nfs4_xattr_handlers to .rodata
  kernfs: move kernfs_xattr_handlers to .rodata
  jfs: move jfs_xattr_handlers to .rodata
  jffs2: move jffs2_xattr_handlers to .rodata
  hfsplus: move hfsplus_xattr_handlers to .rodata
  hfs: move hfs_xattr_handlers to .rodata
  gfs2: move gfs2_xattr_handlers_max to .rodata
  fuse: move fuse_xattr_handlers to .rodata
  ...
2023-10-30 09:29:44 -10:00
Linus Torvalds d1b0949f23 assorted fixes all over the place
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZTxSwQAKCRBZ7Krx/gZQ
 6zadAP9o/724KPDCY3ybgwKyEQ1UNjHTriFRBeoF3o2q0WgidwEA+/xS0Xk3i25w
 xnSZO/8My1edE1IcK/JDwewH/J+4Kw0=
 =N/Lv
 -----END PGP SIGNATURE-----

Merge tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull misc filesystem fixes from Al Viro:
 "Assorted fixes all over the place: literally nothing in common, could
  have been three separate pull requests.

  All are simple regression fixes, but not for anything from this cycle"

* tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  ceph_wait_on_conflict_unlink(): grab reference before dropping ->d_lock
  io_uring: kiocb_done() should *not* trust ->ki_pos if ->{read,write}_iter() failed
  sparc32: fix a braino in fault handling in csum_and_copy_..._user()
2023-10-27 16:44:58 -10:00